[PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
@ 2014-09-01 17:39 Andy Lutomirski
  2014-09-01 17:39 ` [PATCH v4 1/4] virtio_ring: Support DMA APIs if requested Andy Lutomirski
                   ` (4 more replies)
  0 siblings, 5 replies; 108+ messages in thread
From: Andy Lutomirski @ 2014-09-01 17:39 UTC (permalink / raw)
  To: Rusty Russell, Michael S. Tsirkin
  Cc: linux-s390, Konrad Rzeszutek Wilk, Benjamin Herrenschmidt,
	virtualization, Christian Borntraeger, Paolo Bonzini, linux390,
	Andy Lutomirski

This fixes virtio on Xen guests as well as on any other platform
that uses virtio_pci on which physical addresses don't match bus
addresses.

This can be tested with:

    virtme-run --xen xen --kimg arch/x86/boot/bzImage --console

using virtme from here:

    https://git.kernel.org/cgit/utils/kernel/virtme/virtme.git

Without these patches, the guest hangs forever.  With these patches,
everything works.

This should be safe on all platforms that I'm aware of.  That
doesn't mean that there isn't anything that I missed.

Thanks to everyone for putting up with the development of this
series.  Hopefully it'll be the end of DMA issues in virtio. :)

Changes from v3:
 - virtio_pci only asks virtio_ring to use the DMA_API if
   !PCI_DMA_BUS_IS_PHYS.
 - Reduce tools/virtio breakage.  It's now merely as broken as before
   instead of being even more broken.
 - Drop the sg_next changes -- Rusty's version is better.

Changes from v2:
 - Reordered patches.
 - Fixed a virtio_net OOPS.

Changes from v1:
 - Using the DMA API is optional now.  It would be nice to improve the
   DMA API to the point that it could be used unconditionally, but s390
   proves that we're not there yet.
 - Includes patch 4, which fixes DMA debugging warnings from virtio_net.

Andy Lutomirski (4):
  virtio_ring: Support DMA APIs if requested
  virtio_pci: Use the DMA API for virtqueues
  virtio_net: Don't set the end flag on reusable sg entries
  virtio_net: Stop doing DMA from the stack

 drivers/lguest/lguest_device.c         |   3 +-
 drivers/misc/mic/card/mic_virtio.c     |   2 +-
 drivers/net/virtio_net.c               |  59 +++++++----
 drivers/remoteproc/remoteproc_virtio.c |   4 +-
 drivers/s390/kvm/kvm_virtio.c          |   2 +-
 drivers/s390/kvm/virtio_ccw.c          |   4 +-
 drivers/virtio/virtio_mmio.c           |   5 +-
 drivers/virtio/virtio_pci.c            |  41 ++++++--
 drivers/virtio/virtio_ring.c           | 187 +++++++++++++++++++++++++++++----
 include/linux/virtio_ring.h            |   1 +
 tools/virtio/linux/dma-mapping.h       |  17 +++
 tools/virtio/linux/virtio.h            |   1 +
 tools/virtio/virtio_test.c             |   2 +-
 tools/virtio/vringh_test.c             |   3 +-
 14 files changed, 268 insertions(+), 63 deletions(-)
 create mode 100644 tools/virtio/linux/dma-mapping.h

-- 
1.9.3

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [PATCH v4 1/4] virtio_ring: Support DMA APIs if requested
  2014-09-01 17:39 [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API Andy Lutomirski
@ 2014-09-01 17:39 ` Andy Lutomirski
  2014-09-01 17:39 ` [PATCH v4 2/4] virtio_pci: Use the DMA API for virtqueues Andy Lutomirski
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 108+ messages in thread
From: Andy Lutomirski @ 2014-09-01 17:39 UTC (permalink / raw)
  To: Rusty Russell, Michael S. Tsirkin
  Cc: linux-s390, Konrad Rzeszutek Wilk, Benjamin Herrenschmidt,
	virtualization, Christian Borntraeger, Paolo Bonzini, linux390,
	Andy Lutomirski

virtio_ring currently sends the device (usually a hypervisor)
physical addresses of its I/O buffers.  This is okay when DMA
addresses and physical addresses are the same thing, but this isn't
always the case.  For example, this never works on Xen guests, and
it is likely to fail if a physical "virtio" device ever ends up
behind an IOMMU or swiotlb.

The immediate use case for me is to enable virtio on Xen guests.
For that to work, we need vring to support DMA address translation
as well as a corresponding change to virtio_pci or to another
driver.

With this patch, if enabled, virtfs survives kmemleak and
CONFIG_DMA_API_DEBUG.  virtio-net warns (correctly) about DMA from
the stack in virtnet_set_rx_mode.

This explicitly supports !CONFIG_HAS_DMA.  If vring is asked to use
the DMA API and CONFIG_HAS_DMA is not set, then vring will refuse to
create the virtqueue.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 drivers/lguest/lguest_device.c         |   3 +-
 drivers/misc/mic/card/mic_virtio.c     |   2 +-
 drivers/remoteproc/remoteproc_virtio.c |   4 +-
 drivers/s390/kvm/kvm_virtio.c          |   2 +-
 drivers/s390/kvm/virtio_ccw.c          |   4 +-
 drivers/virtio/virtio_mmio.c           |   5 +-
 drivers/virtio/virtio_pci.c            |   3 +-
 drivers/virtio/virtio_ring.c           | 187 +++++++++++++++++++++++++++++----
 include/linux/virtio_ring.h            |   1 +
 tools/virtio/linux/dma-mapping.h       |  17 +++
 tools/virtio/linux/virtio.h            |   1 +
 tools/virtio/virtio_test.c             |   2 +-
 tools/virtio/vringh_test.c             |   3 +-
 13 files changed, 199 insertions(+), 35 deletions(-)
 create mode 100644 tools/virtio/linux/dma-mapping.h

diff --git a/drivers/lguest/lguest_device.c b/drivers/lguest/lguest_device.c
index d0a1d8a45c81..f0eafbe82ed4 100644
--- a/drivers/lguest/lguest_device.c
+++ b/drivers/lguest/lguest_device.c
@@ -301,7 +301,8 @@ static struct virtqueue *lg_find_vq(struct virtio_device *vdev,
 	 * barriers.
 	 */
 	vq = vring_new_virtqueue(index, lvq->config.num, LGUEST_VRING_ALIGN, vdev,
-				 true, lvq->pages, lg_notify, callback, name);
+				 true, false, lvq->pages,
+				 lg_notify, callback, name);
 	if (!vq) {
 		err = -ENOMEM;
 		goto unmap;
diff --git a/drivers/misc/mic/card/mic_virtio.c b/drivers/misc/mic/card/mic_virtio.c
index f14b60080c21..d633964417b1 100644
--- a/drivers/misc/mic/card/mic_virtio.c
+++ b/drivers/misc/mic/card/mic_virtio.c
@@ -256,7 +256,7 @@ static struct virtqueue *mic_find_vq(struct virtio_device *vdev,
 	mvdev->vr[index] = va;
 	memset_io(va, 0x0, _vr_size);
 	vq = vring_new_virtqueue(index, le16_to_cpu(config.num),
-				 MIC_VIRTIO_RING_ALIGN, vdev, false,
+				 MIC_VIRTIO_RING_ALIGN, vdev, false, false,
 				 (void __force *)va, mic_notify, callback,
 				 name);
 	if (!vq) {
diff --git a/drivers/remoteproc/remoteproc_virtio.c b/drivers/remoteproc/remoteproc_virtio.c
index a34b50690b4e..e31f2fefa76e 100644
--- a/drivers/remoteproc/remoteproc_virtio.c
+++ b/drivers/remoteproc/remoteproc_virtio.c
@@ -107,8 +107,8 @@ static struct virtqueue *rp_find_vq(struct virtio_device *vdev,
 	 * Create the new vq, and tell virtio we're not interested in
 	 * the 'weak' smp barriers, since we're talking with a real device.
 	 */
-	vq = vring_new_virtqueue(id, len, rvring->align, vdev, false, addr,
-					rproc_virtio_notify, callback, name);
+	vq = vring_new_virtqueue(id, len, rvring->align, vdev, false, false,
+				 addr, rproc_virtio_notify, callback, name);
 	if (!vq) {
 		dev_err(dev, "vring_new_virtqueue %s failed\n", name);
 		rproc_free_vring(rvring);
diff --git a/drivers/s390/kvm/kvm_virtio.c b/drivers/s390/kvm/kvm_virtio.c
index a1349653c6d9..91abcdc196d0 100644
--- a/drivers/s390/kvm/kvm_virtio.c
+++ b/drivers/s390/kvm/kvm_virtio.c
@@ -206,7 +206,7 @@ static struct virtqueue *kvm_find_vq(struct virtio_device *vdev,
 		goto out;
 
 	vq = vring_new_virtqueue(index, config->num, KVM_S390_VIRTIO_RING_ALIGN,
-				 vdev, true, (void *) config->address,
+				 vdev, true, false, (void *) config->address,
 				 kvm_notify, callback, name);
 	if (!vq) {
 		err = -ENOMEM;
diff --git a/drivers/s390/kvm/virtio_ccw.c b/drivers/s390/kvm/virtio_ccw.c
index d2c0b442bce5..2462a443358a 100644
--- a/drivers/s390/kvm/virtio_ccw.c
+++ b/drivers/s390/kvm/virtio_ccw.c
@@ -478,8 +478,8 @@ static struct virtqueue *virtio_ccw_setup_vq(struct virtio_device *vdev,
 	}
 
 	vq = vring_new_virtqueue(i, info->num, KVM_VIRTIO_CCW_RING_ALIGN, vdev,
-				 true, info->queue, virtio_ccw_kvm_notify,
-				 callback, name);
+				 true, false, info->queue,
+				 virtio_ccw_kvm_notify, callback, name);
 	if (!vq) {
 		/* For now, we fail if we can't get the requested size. */
 		dev_warn(&vcdev->cdev->dev, "no vq\n");
diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
index c600ccfd6922..693254e52a5d 100644
--- a/drivers/virtio/virtio_mmio.c
+++ b/drivers/virtio/virtio_mmio.c
@@ -366,8 +366,9 @@ static struct virtqueue *vm_setup_vq(struct virtio_device *vdev, unsigned index,
 			vm_dev->base + VIRTIO_MMIO_QUEUE_PFN);
 
 	/* Create the vring */
-	vq = vring_new_virtqueue(index, info->num, VIRTIO_MMIO_VRING_ALIGN, vdev,
-				 true, info->queue, vm_notify, callback, name);
+	vq = vring_new_virtqueue(index, info->num, VIRTIO_MMIO_VRING_ALIGN,
+				 vdev, true, false, info->queue,
+				 vm_notify, callback, name);
 	if (!vq) {
 		err = -ENOMEM;
 		goto error_new_virtqueue;
diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
index 3d1463c6b120..a1f299fa4626 100644
--- a/drivers/virtio/virtio_pci.c
+++ b/drivers/virtio/virtio_pci.c
@@ -430,7 +430,8 @@ static struct virtqueue *setup_vq(struct virtio_device *vdev, unsigned index,
 
 	/* create the vring */
 	vq = vring_new_virtqueue(index, info->num, VIRTIO_PCI_VRING_ALIGN, vdev,
-				 true, info->queue, vp_notify, callback, name);
+				 true, false, info->queue,
+				 vp_notify, callback, name);
 	if (!vq) {
 		err = -ENOMEM;
 		goto out_activate_queue;
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 4d08f45a9c29..7e10770edd0f 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -24,6 +24,7 @@
 #include <linux/module.h>
 #include <linux/hrtimer.h>
 #include <linux/kmemleak.h>
+#include <linux/dma-mapping.h>
 
 #ifdef DEBUG
 /* For development, we want to crash whenever the ring is screwed. */
@@ -54,6 +55,12 @@
 #define END_USE(vq)
 #endif
 
+struct vring_desc_state
+{
+	void *data;			/* Data for callback. */
+	struct vring_desc *indir_desc;	/* Indirect descriptor, if any. */
+};
+
 struct vring_virtqueue
 {
 	struct virtqueue vq;
@@ -64,6 +71,9 @@ struct vring_virtqueue
 	/* Can we use weak barriers? */
 	bool weak_barriers;
 
+	/* Should we use the DMA API? */
+	bool use_dma_api;
+
 	/* Other side has made a mess, don't try any more. */
 	bool broken;
 
@@ -93,8 +103,8 @@ struct vring_virtqueue
 	ktime_t last_add_time;
 #endif
 
-	/* Tokens for callbacks. */
-	void *data[];
+	/* Per-descriptor state. */
+	struct vring_desc_state desc_state[];
 };
 
 #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq)
@@ -113,6 +123,83 @@ static inline struct scatterlist *sg_next_arr(struct scatterlist *sg,
 	return sg + 1;
 }
 
+/* Map one sg entry. */
+static dma_addr_t vring_map_one_sg(const struct vring_virtqueue *vq,
+				   struct scatterlist *sg,
+				   enum dma_data_direction direction)
+{
+#ifdef CONFIG_HAS_DMA
+	/*
+	 * We can't use dma_map_sg, because we don't use scatterlists in
+	 * the way it expects (we sometimes use unterminated
+	 * scatterlists, and we don't guarantee that the scatterlist
+	 * will exist for the lifetime of the mapping.
+	 */
+	if (vq->use_dma_api)
+		return dma_map_page(vq->vq.vdev->dev.parent,
+				    sg_page(sg), sg->offset, sg->length,
+				    direction);
+#endif
+
+	return sg_phys(sg);
+}
+
+static dma_addr_t vring_map_single(const struct vring_virtqueue *vq,
+				   void *cpu_addr, size_t size,
+				   enum dma_data_direction direction)
+{
+#ifdef CONFIG_HAS_DMA
+	if (vq->use_dma_api)
+		return dma_map_single(vq->vq.vdev->dev.parent,
+				      cpu_addr, size,
+				      direction);
+#endif
+
+	return virt_to_phys(cpu_addr);
+}
+
+static void vring_unmap_one(const struct vring_virtqueue *vq,
+			    struct vring_desc *desc)
+{
+#ifdef CONFIG_HAS_DMA
+	if (!vq->use_dma_api)
+		return;		/* Nothing to do. */
+
+	if (desc->flags & VRING_DESC_F_INDIRECT) {
+		dma_unmap_single(vq->vq.vdev->dev.parent,
+				 desc->addr, desc->len,
+				 (desc->flags & VRING_DESC_F_WRITE) ?
+				 DMA_FROM_DEVICE : DMA_TO_DEVICE);
+	} else {
+		dma_unmap_page(vq->vq.vdev->dev.parent,
+			       desc->addr, desc->len,
+			       (desc->flags & VRING_DESC_F_WRITE) ?
+			       DMA_FROM_DEVICE : DMA_TO_DEVICE);
+	}
+#endif
+}
+
+static void vring_unmap_indirect(const struct vring_virtqueue *vq,
+				 struct vring_desc *desc, int total)
+{
+	int i;
+
+	if (vq->use_dma_api)
+		for (i = 0; i < total; i++)
+			vring_unmap_one(vq, &desc[i]);
+}
+
+static int vring_mapping_error(const struct vring_virtqueue *vq,
+			       dma_addr_t addr)
+{
+#ifdef CONFIG_HAS_DMA
+	return vq->use_dma_api &&
+		dma_mapping_error(vq->vq.vdev->dev.parent, addr);
+#else
+	return 0;
+#endif
+}
+
 /* Set up an indirect table of descriptors and add it to the queue. */
 static inline int vring_add_indirect(struct vring_virtqueue *vq,
 				     struct scatterlist *sgs[],
@@ -146,7 +233,10 @@ static inline int vring_add_indirect(struct vring_virtqueue *vq,
 	for (n = 0; n < out_sgs; n++) {
 		for (sg = sgs[n]; sg; sg = next(sg, &total_out)) {
 			desc[i].flags = VRING_DESC_F_NEXT;
-			desc[i].addr = sg_phys(sg);
+			desc[i].addr =
+				vring_map_one_sg(vq, sg, DMA_TO_DEVICE);
+			if (vring_mapping_error(vq, desc[i].addr))
+				goto unmap_free;
 			desc[i].len = sg->length;
 			desc[i].next = i+1;
 			i++;
@@ -155,7 +245,10 @@ static inline int vring_add_indirect(struct vring_virtqueue *vq,
 	for (; n < (out_sgs + in_sgs); n++) {
 		for (sg = sgs[n]; sg; sg = next(sg, &total_in)) {
 			desc[i].flags = VRING_DESC_F_NEXT|VRING_DESC_F_WRITE;
-			desc[i].addr = sg_phys(sg);
+			desc[i].addr =
+				vring_map_one_sg(vq, sg, DMA_FROM_DEVICE);
+			if (vring_mapping_error(vq, desc[i].addr))
+				goto unmap_free;
 			desc[i].len = sg->length;
 			desc[i].next = i+1;
 			i++;
@@ -173,15 +266,26 @@ static inline int vring_add_indirect(struct vring_virtqueue *vq,
 	/* Use a single buffer which doesn't continue */
 	head = vq->free_head;
 	vq->vring.desc[head].flags = VRING_DESC_F_INDIRECT;
-	vq->vring.desc[head].addr = virt_to_phys(desc);
-	/* kmemleak gives a false positive, as it's hidden by virt_to_phys */
-	kmemleak_ignore(desc);
+	vq->vring.desc[head].addr =
+		vring_map_single(vq,
+				 desc, i * sizeof(struct vring_desc),
+				 DMA_TO_DEVICE);
+	if (vring_mapping_error(vq, vq->vring.desc[head].addr))
+		goto unmap_free;
 	vq->vring.desc[head].len = i * sizeof(struct vring_desc);
 
 	/* Update free pointer */
 	vq->free_head = vq->vring.desc[head].next;
 
+	/* Save the indirect block */
+	vq->desc_state[head].indir_desc = desc;
+
 	return head;
+
+unmap_free:
+	vring_unmap_indirect(vq, desc, i);
+	kfree(desc);
+	return -ENOMEM;
 }
 
 static inline int virtqueue_add(struct virtqueue *_vq,
@@ -197,7 +301,7 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 {
 	struct vring_virtqueue *vq = to_vvq(_vq);
 	struct scatterlist *sg;
-	unsigned int i, n, avail, uninitialized_var(prev), total_sg;
+	unsigned int i, n, avail, uninitialized_var(prev), total_sg, err_idx;
 	int head;
 
 	START_USE(vq);
@@ -256,7 +360,10 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 	for (n = 0; n < out_sgs; n++) {
 		for (sg = sgs[n]; sg; sg = next(sg, &total_out)) {
 			vq->vring.desc[i].flags = VRING_DESC_F_NEXT;
-			vq->vring.desc[i].addr = sg_phys(sg);
+			vq->vring.desc[i].addr =
+				vring_map_one_sg(vq, sg, DMA_TO_DEVICE);
+			if (vring_mapping_error(vq, vq->vring.desc[i].addr))
+				goto unmap_release;
 			vq->vring.desc[i].len = sg->length;
 			prev = i;
 			i = vq->vring.desc[i].next;
@@ -265,7 +372,10 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 	for (; n < (out_sgs + in_sgs); n++) {
 		for (sg = sgs[n]; sg; sg = next(sg, &total_in)) {
 			vq->vring.desc[i].flags = VRING_DESC_F_NEXT|VRING_DESC_F_WRITE;
-			vq->vring.desc[i].addr = sg_phys(sg);
+			vq->vring.desc[i].addr =
+				vring_map_one_sg(vq, sg, DMA_FROM_DEVICE);
+			if (vring_mapping_error(vq, vq->vring.desc[i].addr))
+				goto unmap_release;
 			vq->vring.desc[i].len = sg->length;
 			prev = i;
 			i = vq->vring.desc[i].next;
@@ -279,7 +389,7 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 
 add_head:
 	/* Set token. */
-	vq->data[head] = data;
+	vq->desc_state[head].data = data;
 
 	/* Put entry in available array (but don't update avail->idx until they
 	 * do sync). */
@@ -301,6 +411,20 @@ add_head:
 	END_USE(vq);
 
 	return 0;
+
+unmap_release:
+	err_idx = i;
+	i = head;
+
+	for (n = 0; n < total_sg; n++) {
+		if (i == err_idx)
+			break;
+		vring_unmap_one(vq, &vq->vring.desc[i]);
+		i = vq->vring.desc[i].next;
+	}
+
+	vq->vq.num_free += total_sg;
+	return -EIO;
 }
 
 /**
@@ -480,22 +604,33 @@ static void detach_buf(struct vring_virtqueue *vq, unsigned int head)
 	unsigned int i;
 
 	/* Clear data ptr. */
-	vq->data[head] = NULL;
+	vq->desc_state[head].data = NULL;
 
 	/* Put back on free list: find end */
 	i = head;
 
 	/* Free the indirect table */
-	if (vq->vring.desc[i].flags & VRING_DESC_F_INDIRECT)
-		kfree(phys_to_virt(vq->vring.desc[i].addr));
+	if (vq->desc_state[head].indir_desc) {
+		u32 len = vq->vring.desc[i].len;
+
+		BUG_ON(!(vq->vring.desc[i].flags & VRING_DESC_F_INDIRECT));
+		BUG_ON(len == 0 || len % sizeof(struct vring_desc));
+		vring_unmap_indirect(vq, vq->desc_state[head].indir_desc,
+				     len / sizeof(struct vring_desc));
+		kfree(vq->desc_state[head].indir_desc);
+		vq->desc_state[head].indir_desc = NULL;
+	}
 
 	while (vq->vring.desc[i].flags & VRING_DESC_F_NEXT) {
+		vring_unmap_one(vq, &vq->vring.desc[i]);
 		i = vq->vring.desc[i].next;
 		vq->vq.num_free++;
 	}
 
+	vring_unmap_one(vq, &vq->vring.desc[i]);
 	vq->vring.desc[i].next = vq->free_head;
 	vq->free_head = head;
+
 	/* Plus final descriptor */
 	vq->vq.num_free++;
 }
@@ -552,13 +687,13 @@ void *virtqueue_get_buf(struct virtqueue *_vq, unsigned int *len)
 		BAD_RING(vq, "id %u out of range\n", i);
 		return NULL;
 	}
-	if (unlikely(!vq->data[i])) {
+	if (unlikely(!vq->desc_state[i].data)) {
 		BAD_RING(vq, "id %u is not a head!\n", i);
 		return NULL;
 	}
 
 	/* detach_buf clears data, so grab it now. */
-	ret = vq->data[i];
+	ret = vq->desc_state[i].data;
 	detach_buf(vq, i);
 	vq->last_used_idx++;
 	/* If we expect an interrupt for the next entry, tell host
@@ -719,10 +854,10 @@ void *virtqueue_detach_unused_buf(struct virtqueue *_vq)
 	START_USE(vq);
 
 	for (i = 0; i < vq->vring.num; i++) {
-		if (!vq->data[i])
+		if (!vq->desc_state[i].data)
 			continue;
 		/* detach_buf clears data, so grab it now. */
-		buf = vq->data[i];
+		buf = vq->desc_state[i].data;
 		detach_buf(vq, i);
 		vq->vring.avail->idx--;
 		END_USE(vq);
@@ -761,6 +896,7 @@ struct virtqueue *vring_new_virtqueue(unsigned int index,
 				      unsigned int vring_align,
 				      struct virtio_device *vdev,
 				      bool weak_barriers,
+				      bool use_dma_api,
 				      void *pages,
 				      bool (*notify)(struct virtqueue *),
 				      void (*callback)(struct virtqueue *),
@@ -775,7 +911,13 @@ struct virtqueue *vring_new_virtqueue(unsigned int index,
 		return NULL;
 	}
 
-	vq = kmalloc(sizeof(*vq) + sizeof(void *)*num, GFP_KERNEL);
+#ifndef CONFIG_HAS_DMA
+	if (use_dma_api)
+		return NULL;
+#endif
+
+	vq = kmalloc(sizeof(*vq) + num * sizeof(struct vring_desc_state),
+		     GFP_KERNEL);
 	if (!vq)
 		return NULL;
 
@@ -787,6 +929,7 @@ struct virtqueue *vring_new_virtqueue(unsigned int index,
 	vq->vq.index = index;
 	vq->notify = notify;
 	vq->weak_barriers = weak_barriers;
+	vq->use_dma_api = use_dma_api;
 	vq->broken = false;
 	vq->last_used_idx = 0;
 	vq->num_added = 0;
@@ -805,11 +948,9 @@ struct virtqueue *vring_new_virtqueue(unsigned int index,
 
 	/* Put everything in free lists. */
 	vq->free_head = 0;
-	for (i = 0; i < num-1; i++) {
+	for (i = 0; i < num-1; i++)
 		vq->vring.desc[i].next = i+1;
-		vq->data[i] = NULL;
-	}
-	vq->data[i] = NULL;
+	memset(vq->desc_state, 0, num * sizeof(struct vring_desc_state));
 
 	return &vq->vq;
 }
diff --git a/include/linux/virtio_ring.h b/include/linux/virtio_ring.h
index 67e06fe18c03..60f761a38a09 100644
--- a/include/linux/virtio_ring.h
+++ b/include/linux/virtio_ring.h
@@ -70,6 +70,7 @@ struct virtqueue *vring_new_virtqueue(unsigned int index,
 				      unsigned int vring_align,
 				      struct virtio_device *vdev,
 				      bool weak_barriers,
+				      bool use_dma_api,
 				      void *pages,
 				      bool (*notify)(struct virtqueue *vq),
 				      void (*callback)(struct virtqueue *vq),
diff --git a/tools/virtio/linux/dma-mapping.h b/tools/virtio/linux/dma-mapping.h
new file mode 100644
index 000000000000..4f93af89ae16
--- /dev/null
+++ b/tools/virtio/linux/dma-mapping.h
@@ -0,0 +1,17 @@
+#ifndef _LINUX_DMA_MAPPING_H
+#define _LINUX_DMA_MAPPING_H
+
+#ifdef CONFIG_HAS_DMA
+# error Virtio userspace code does not support CONFIG_HAS_DMA
+#endif
+
+#define PCI_DMA_BUS_IS_PHYS 1
+
+enum dma_data_direction {
+	DMA_BIDIRECTIONAL = 0,
+	DMA_TO_DEVICE = 1,
+	DMA_FROM_DEVICE = 2,
+	DMA_NONE = 3,
+};
+
+#endif
diff --git a/tools/virtio/linux/virtio.h b/tools/virtio/linux/virtio.h
index 5a2d1f0f6bc7..5d42dc6a6201 100644
--- a/tools/virtio/linux/virtio.h
+++ b/tools/virtio/linux/virtio.h
@@ -78,6 +78,7 @@ struct virtqueue *vring_new_virtqueue(unsigned int index,
 				      unsigned int vring_align,
 				      struct virtio_device *vdev,
 				      bool weak_barriers,
+				      bool use_dma_api,
 				      void *pages,
 				      bool (*notify)(struct virtqueue *vq),
 				      void (*callback)(struct virtqueue *vq),
diff --git a/tools/virtio/virtio_test.c b/tools/virtio/virtio_test.c
index 00ea679b3826..860cc89900a7 100644
--- a/tools/virtio/virtio_test.c
+++ b/tools/virtio/virtio_test.c
@@ -99,7 +99,7 @@ static void vq_info_add(struct vdev_info *dev, int num)
 	vring_init(&info->vring, num, info->ring, 4096);
 	info->vq = vring_new_virtqueue(info->idx,
 				       info->vring.num, 4096, &dev->vdev,
-				       true, info->ring,
+				       true, false, info->ring,
 				       vq_notify, vq_callback, "test");
 	assert(info->vq);
 	info->vq->priv = info;
diff --git a/tools/virtio/vringh_test.c b/tools/virtio/vringh_test.c
index 14a4f4cab5b9..67d3c3a1ba88 100644
--- a/tools/virtio/vringh_test.c
+++ b/tools/virtio/vringh_test.c
@@ -312,7 +312,8 @@ static int parallel_test(unsigned long features,
 		if (sched_setaffinity(getpid(), sizeof(cpu_set), &cpu_set))
 			err(1, "Could not set affinity to cpu %u", first_cpu);
 
-		vq = vring_new_virtqueue(0, RINGSIZE, ALIGN, &gvdev.vdev, true,
+		vq = vring_new_virtqueue(0, RINGSIZE, ALIGN, &gvdev.vdev,
+					 true, false,
 					 guest_map, fast_vringh ? no_notify_host
 					 : parallel_notify_host,
 					 never_callback_guest, "guest vq");
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH v4 2/4] virtio_pci: Use the DMA API for virtqueues
  2014-09-01 17:39 [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API Andy Lutomirski
  2014-09-01 17:39 ` [PATCH v4 1/4] virtio_ring: Support DMA APIs if requested Andy Lutomirski
@ 2014-09-01 17:39 ` Andy Lutomirski
  2014-09-01 17:39 ` [PATCH v4 3/4] virtio_net: Don't set the end flag on reusable sg entries Andy Lutomirski
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 108+ messages in thread
From: Andy Lutomirski @ 2014-09-01 17:39 UTC (permalink / raw)
  To: Rusty Russell, Michael S. Tsirkin
  Cc: linux-s390, Konrad Rzeszutek Wilk, Benjamin Herrenschmidt,
	virtualization, Christian Borntraeger, Paolo Bonzini, linux390,
	Andy Lutomirski

A virtqueue is a coherent DMA mapping.  Use the DMA API for it.
This fixes virtio_pci on Xen.

As an optimization, this only enables asks virtio_ring to use the
DMA API if !PCI_DMA_BUS_IS_PHYS.  Eventually, once the DMA API is
known to be efficient on all relevant architectures, this
optimization can be removed.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 drivers/virtio/virtio_pci.c | 40 +++++++++++++++++++++++++++++++---------
 1 file changed, 31 insertions(+), 9 deletions(-)

diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
index a1f299fa4626..226b46b08727 100644
--- a/drivers/virtio/virtio_pci.c
+++ b/drivers/virtio/virtio_pci.c
@@ -80,8 +80,9 @@ struct virtio_pci_vq_info
 	/* the number of entries in the queue */
 	int num;
 
-	/* the virtual address of the ring queue */
-	void *queue;
+	/* the ring queue */
+	void *queue;			/* virtual address */
+	dma_addr_t queue_dma_addr;	/* bus address */
 
 	/* the list node for the virtqueues list */
 	struct list_head node;
@@ -417,20 +418,32 @@ static struct virtqueue *setup_vq(struct virtio_device *vdev, unsigned index,
 	info->num = num;
 	info->msix_vector = msix_vec;
 
-	size = PAGE_ALIGN(vring_size(num, VIRTIO_PCI_VRING_ALIGN));
-	info->queue = alloc_pages_exact(size, GFP_KERNEL|__GFP_ZERO);
+	size = vring_size(num, VIRTIO_PCI_VRING_ALIGN);
+	info->queue = dma_zalloc_coherent(vdev->dev.parent, size,
+					  &info->queue_dma_addr, GFP_KERNEL);
 	if (info->queue == NULL) {
 		err = -ENOMEM;
 		goto out_info;
 	}
 
 	/* activate the queue */
-	iowrite32(virt_to_phys(info->queue) >> VIRTIO_PCI_QUEUE_ADDR_SHIFT,
+	iowrite32(info->queue_dma_addr >> VIRTIO_PCI_QUEUE_ADDR_SHIFT,
 		  vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
 
-	/* create the vring */
+	/*
+	 * Create the vring.  If there is an IOMMU of any sort, including
+	 * Xen paravirt's ersatz IOMMU, use it.  If the host wants physical
+	 * addresses instead of bus addresses, the host shouldn't expose
+	 * an IOMMU.
+	 *
+	 * As an optimization, if the platform promises to have physical
+	 * PCI DMA, we turn off DMA mapping in virtio_ring.  If the
+	 * platform's DMA API implementation is well optimized, this
+	 * should have almost no effect, but that's a dangerous thing to
+	 * rely on.
+	 */
 	vq = vring_new_virtqueue(index, info->num, VIRTIO_PCI_VRING_ALIGN, vdev,
-				 true, false, info->queue,
+				 true, !PCI_DMA_BUS_IS_PHYS, info->queue,
 				 vp_notify, callback, name);
 	if (!vq) {
 		err = -ENOMEM;
@@ -463,7 +476,8 @@ out_assign:
 	vring_del_virtqueue(vq);
 out_activate_queue:
 	iowrite32(0, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
-	free_pages_exact(info->queue, size);
+	dma_free_coherent(vdev->dev.parent, size,
+			  info->queue, info->queue_dma_addr);
 out_info:
 	kfree(info);
 	return ERR_PTR(err);
@@ -494,7 +508,8 @@ static void vp_del_vq(struct virtqueue *vq)
 	iowrite32(0, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
 
 	size = PAGE_ALIGN(vring_size(info->num, VIRTIO_PCI_VRING_ALIGN));
-	free_pages_exact(info->queue, size);
+	dma_free_coherent(vq->vdev->dev.parent, size,
+			  info->queue, info->queue_dma_addr);
 	kfree(info);
 }
 
@@ -713,6 +728,13 @@ static int virtio_pci_probe(struct pci_dev *pci_dev,
 	if (err)
 		goto out;
 
+	err = dma_set_mask_and_coherent(&pci_dev->dev, DMA_BIT_MASK(64));
+	if (err)
+		err = dma_set_mask_and_coherent(&pci_dev->dev,
+						DMA_BIT_MASK(32));
+	if (err)
+		dev_warn(&pci_dev->dev, "Failed to enable 64-bit or 32-bit DMA.  Trying to continue, but this might not work.\n");
+
 	err = pci_request_regions(pci_dev, "virtio-pci");
 	if (err)
 		goto out_enable_device;
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH v4 3/4] virtio_net: Don't set the end flag on reusable sg entries
  2014-09-01 17:39 [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API Andy Lutomirski
  2014-09-01 17:39 ` [PATCH v4 1/4] virtio_ring: Support DMA APIs if requested Andy Lutomirski
  2014-09-01 17:39 ` [PATCH v4 2/4] virtio_pci: Use the DMA API for virtqueues Andy Lutomirski
@ 2014-09-01 17:39 ` Andy Lutomirski
  2014-09-01 17:39 ` [PATCH v4 4/4] virtio_net: Stop doing DMA from the stack Andy Lutomirski
  2014-09-01 22:16 ` [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API Benjamin Herrenschmidt
  4 siblings, 0 replies; 108+ messages in thread
From: Andy Lutomirski @ 2014-09-01 17:39 UTC (permalink / raw)
  To: Rusty Russell, Michael S. Tsirkin
  Cc: linux-s390, Konrad Rzeszutek Wilk, Benjamin Herrenschmidt,
	virtualization, Christian Borntraeger, Paolo Bonzini, linux390,
	Andy Lutomirski

Every time virtio_net calls skb_to_sgvec, an end flag gets set
somewhere on the queue's scatterlist and never gets cleared.  As
soon as a larger request happens, virtio_net sends the virtqueue a
scatterlist with an end mark set in the middle.  Once the vring code
starts using for_each_sg, this will blow up.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 drivers/net/virtio_net.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 59caa06f34a6..c90466a4fab0 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -548,7 +548,7 @@ static int add_recvbuf_small(struct receive_queue *rq, gfp_t gfp)
 	hdr = skb_vnet_hdr(skb);
 	sg_set_buf(rq->sg, &hdr->hdr, sizeof hdr->hdr);
 
-	skb_to_sgvec(skb, rq->sg + 1, 0, skb->len);
+	skb_to_sgvec_nomark(skb, rq->sg + 1, 0, skb->len);
 
 	err = virtqueue_add_inbuf(rq->vq, rq->sg, 2, skb, gfp);
 	if (err < 0)
@@ -901,12 +901,12 @@ static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
 
 	if (can_push) {
 		__skb_push(skb, hdr_len);
-		num_sg = skb_to_sgvec(skb, sq->sg, 0, skb->len);
+		num_sg = skb_to_sgvec_nomark(skb, sq->sg, 0, skb->len);
 		/* Pull header back to avoid skew in tx bytes calculations. */
 		__skb_pull(skb, hdr_len);
 	} else {
 		sg_set_buf(sq->sg, hdr, hdr_len);
-		num_sg = skb_to_sgvec(skb, sq->sg + 1, 0, skb->len) + 1;
+		num_sg = skb_to_sgvec_nomark(skb, sq->sg + 1, 0, skb->len) + 1;
 	}
 	return virtqueue_add_outbuf(sq->vq, sq->sg, num_sg, skb, GFP_ATOMIC);
 }
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH v4 4/4] virtio_net: Stop doing DMA from the stack
  2014-09-01 17:39 [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API Andy Lutomirski
                   ` (2 preceding siblings ...)
  2014-09-01 17:39 ` [PATCH v4 3/4] virtio_net: Don't set the end flag on reusable sg entries Andy Lutomirski
@ 2014-09-01 17:39 ` Andy Lutomirski
  2014-09-01 22:16 ` [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API Benjamin Herrenschmidt
  4 siblings, 0 replies; 108+ messages in thread
From: Andy Lutomirski @ 2014-09-01 17:39 UTC (permalink / raw)
  To: Rusty Russell, Michael S. Tsirkin
  Cc: linux-s390, Konrad Rzeszutek Wilk, Benjamin Herrenschmidt,
	virtualization, Christian Borntraeger, Paolo Bonzini, linux390,
	Andy Lutomirski

Now that virtio supports real DMA, drivers should play by the rules.
For virtio_net, that means that DMA should be done to and from
dynamically-allocated memory, not the kernel stack.

This should have no effect on any performance-critical code paths.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 drivers/net/virtio_net.c | 53 ++++++++++++++++++++++++++++++++----------------
 1 file changed, 36 insertions(+), 17 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index c90466a4fab0..25703fd2df28 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -966,31 +966,43 @@ static bool virtnet_send_command(struct virtnet_info *vi, u8 class, u8 cmd,
 				 struct scatterlist *out)
 {
 	struct scatterlist *sgs[4], hdr, stat;
-	struct virtio_net_ctrl_hdr ctrl;
-	virtio_net_ctrl_ack status = ~0;
+
+	struct {
+		struct virtio_net_ctrl_hdr ctrl;
+		virtio_net_ctrl_ack status;
+	} *buf;
+
 	unsigned out_num = 0, tmp;
+	bool ret;
 
 	/* Caller should know better */
 	BUG_ON(!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ));
 
-	ctrl.class = class;
-	ctrl.cmd = cmd;
+	buf = kmalloc(sizeof(*buf), GFP_ATOMIC);
+	if (!buf)
+		return false;
+	buf->status = ~0;
+
+	buf->ctrl.class = class;
+	buf->ctrl.cmd = cmd;
 	/* Add header */
-	sg_init_one(&hdr, &ctrl, sizeof(ctrl));
+	sg_init_one(&hdr, &buf->ctrl, sizeof(buf->ctrl));
 	sgs[out_num++] = &hdr;
 
 	if (out)
 		sgs[out_num++] = out;
 
 	/* Add return status. */
-	sg_init_one(&stat, &status, sizeof(status));
+	sg_init_one(&stat, &buf->status, sizeof(buf->status));
 	sgs[out_num] = &stat;
 
 	BUG_ON(out_num + 1 > ARRAY_SIZE(sgs));
 	virtqueue_add_sgs(vi->cvq, sgs, out_num, 1, vi, GFP_ATOMIC);
 
-	if (unlikely(!virtqueue_kick(vi->cvq)))
-		return status == VIRTIO_NET_OK;
+	if (unlikely(!virtqueue_kick(vi->cvq))) {
+		ret = (buf->status == VIRTIO_NET_OK);
+		goto out;
+	}
 
 	/* Spin for a response, the kick causes an ioport write, trapping
 	 * into the hypervisor, so the request should be handled immediately.
@@ -999,7 +1011,11 @@ static bool virtnet_send_command(struct virtnet_info *vi, u8 class, u8 cmd,
 	       !virtqueue_is_broken(vi->cvq))
 		cpu_relax();
 
-	return status == VIRTIO_NET_OK;
+	ret = (buf->status == VIRTIO_NET_OK);
+
+out:
+	kfree(buf);
+	return ret;
 }
 
 static int virtnet_set_mac_address(struct net_device *dev, void *p)
@@ -1140,7 +1156,7 @@ static void virtnet_set_rx_mode(struct net_device *dev)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
 	struct scatterlist sg[2];
-	u8 promisc, allmulti;
+	u8 *cmdbyte;
 	struct virtio_net_ctrl_mac *mac_data;
 	struct netdev_hw_addr *ha;
 	int uc_count;
@@ -1152,22 +1168,25 @@ static void virtnet_set_rx_mode(struct net_device *dev)
 	if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_RX))
 		return;
 
-	promisc = ((dev->flags & IFF_PROMISC) != 0);
-	allmulti = ((dev->flags & IFF_ALLMULTI) != 0);
+	cmdbyte = kmalloc(sizeof(*cmdbyte), GFP_ATOMIC);
+	if (!cmdbyte)
+		return;
 
-	sg_init_one(sg, &promisc, sizeof(promisc));
+	sg_init_one(sg, cmdbyte, sizeof(*cmdbyte));
 
+	*cmdbyte = ((dev->flags & IFF_PROMISC) != 0);
 	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX,
 				  VIRTIO_NET_CTRL_RX_PROMISC, sg))
 		dev_warn(&dev->dev, "Failed to %sable promisc mode.\n",
-			 promisc ? "en" : "dis");
-
-	sg_init_one(sg, &allmulti, sizeof(allmulti));
+			 *cmdbyte ? "en" : "dis");
 
+	*cmdbyte = ((dev->flags & IFF_ALLMULTI) != 0);
 	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX,
 				  VIRTIO_NET_CTRL_RX_ALLMULTI, sg))
 		dev_warn(&dev->dev, "Failed to %sable allmulti mode.\n",
-			 allmulti ? "en" : "dis");
+			 *cmdbyte ? "en" : "dis");
+
+	kfree(cmdbyte);
 
 	uc_count = netdev_uc_count(dev);
 	mc_count = netdev_mc_count(dev);
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2014-09-01 17:39 [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API Andy Lutomirski
                   ` (3 preceding siblings ...)
  2014-09-01 17:39 ` [PATCH v4 4/4] virtio_net: Stop doing DMA from the stack Andy Lutomirski
@ 2014-09-01 22:16 ` Benjamin Herrenschmidt
  2014-09-02  5:55   ` Andy Lutomirski
  4 siblings, 1 reply; 108+ messages in thread
From: Benjamin Herrenschmidt @ 2014-09-01 22:16 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-s390, Konrad Rzeszutek Wilk, Michael S. Tsirkin,
	virtualization, Christian Borntraeger, Paolo Bonzini, linux390

On Mon, 2014-09-01 at 10:39 -0700, Andy Lutomirski wrote:
> Changes from v1:
>  - Using the DMA API is optional now.  It would be nice to improve the
>    DMA API to the point that it could be used unconditionally, but s390
>    proves that we're not there yet.
>  - Includes patch 4, which fixes DMA debugging warnings from virtio_net.

I'm not sure if you saw my reply on the other thread but I have a few
comments based on the above "it would be nice if ..."

So here we have both a yes and a no :-)

It would be nice to avoid those if () games all over and indeed just
use the DMA API, *however* we most certainly don't want to actually
create IOMMU mappings for the KVM virio case. This would be a massive
loss in performances on several platforms and generally doesn't make
much sense.

However, we can still use the API without that on any architecture
where the dma mapping API ends up calling the generic dma_map_ops,
it becomes just a matter of virtio setting up some special "nop" ops
when needed.

The difficulty here resides in the fact that we have never completely
made the dma_map_ops generic. The ops themselves are defined generically
as are the dma_map_* interfaces based on them, but the location of the
ops pointer is still more/less arch specific and some architectures
still chose not to use that indirection at all I believe.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2014-09-01 22:16 ` [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API Benjamin Herrenschmidt
@ 2014-09-02  5:55   ` Andy Lutomirski
  2014-09-02 20:53     ` Benjamin Herrenschmidt
  2014-09-02 21:10     ` Michael S. Tsirkin
  0 siblings, 2 replies; 108+ messages in thread
From: Andy Lutomirski @ 2014-09-02  5:55 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linux-s390, Konrad Rzeszutek Wilk, Michael S. Tsirkin,
	Linux Virtualization, Christian Borntraeger, Paolo Bonzini,
	linux390

On Mon, Sep 1, 2014 at 3:16 PM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Mon, 2014-09-01 at 10:39 -0700, Andy Lutomirski wrote:
>> Changes from v1:
>>  - Using the DMA API is optional now.  It would be nice to improve the
>>    DMA API to the point that it could be used unconditionally, but s390
>>    proves that we're not there yet.
>>  - Includes patch 4, which fixes DMA debugging warnings from virtio_net.
>
> I'm not sure if you saw my reply on the other thread but I have a few
> comments based on the above "it would be nice if ..."
>

Yeah, sorry, I sort of thought I responded, but I didn't do a very good job.

> So here we have both a yes and a no :-)
>
> It would be nice to avoid those if () games all over and indeed just
> use the DMA API, *however* we most certainly don't want to actually
> create IOMMU mappings for the KVM virio case. This would be a massive
> loss in performances on several platforms and generally doesn't make
> much sense.
>
> However, we can still use the API without that on any architecture
> where the dma mapping API ends up calling the generic dma_map_ops,
> it becomes just a matter of virtio setting up some special "nop" ops
> when needed.

I'm not quite convinced that this is a good idea.  I think that there
are three relevant categories of virtio devices:

a) Any virtio device where the normal DMA ops are nops.  This includes
x86 without an IOMMU (e.g. in a QEMU/KVM guest), 32-bit ARM, and
probably many other architectures.  In this case, what we do only
matters for performance, not for correctness.  Ideally the arch DMA
ops are fast.

b) Virtio devices that use physical addressing on systems where DMA
ops either don't exist at all (most s390) or do something nontrivial.
In this case, we must either override the DMA ops or just not use
them.

c) Virtio devices that use bus addressing.  This includes everything
on Xen (because the "physical" addresses are nonsense) and any actual
physical PCI device that speaks virtio on a system with an IOMMU.  In
this case, we must use the DMA ops.

The issue is that, on systems with DMA ops that do something, we need
to make sure that we know whether we're in case (b) or (c).  In these
patches, I've made the assumption that, if the virtio devices lives on
the PCI bus, then it uses the same type of addressing that any other
device on that PCI bus would use.

On x86, at least, I doubt that we'll ever see a physically addressed
PCI virtio device for which ACPI advertises an IOMMU, since any sane
hypervisor will just not advertise an IOMMU for the virtio device.
But are there arm64 or PPC guests that use virtio_pci, that have
IOMMUs, and that will malfunction if the virtio_pci driver ends up
using the IOMMU?  I certainly hope not, since these systems might be
very hard-pressed to work right if someone plugged in a physical
virtio-speaking PCI device.

>
> The difficulty here resides in the fact that we have never completely
> made the dma_map_ops generic. The ops themselves are defined generically
> as are the dma_map_* interfaces based on them, but the location of the
> ops pointer is still more/less arch specific and some architectures
> still chose not to use that indirection at all I believe.
>

I'd be happy to update the patches if someone does this, but I don't
really want to attack the DMA API on all architectures right now.  In
the mean time, at least s390 requires that we be able to compile out
the DMA API calls.  I'd rather see s390 provide working no-op dma ops
for all of the struct devices that provide virtio interfaces.

On a related note, shouldn't virtio be doing something to provide dma
ops to the virtio device and any of its children?  I don't know how it
would even try to do this, given how architecture-dependent this code
currently is.  Calling dma_map_single on the virtio device (as opposed
to its parent) is currently likely to crash on x86.  Fortunately,
nothing does this.

--Andy

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2014-09-02  5:55   ` Andy Lutomirski
@ 2014-09-02 20:53     ` Benjamin Herrenschmidt
  2014-09-02 20:56       ` Konrad Rzeszutek Wilk
  2014-09-02 21:37       ` Andy Lutomirski
  2014-09-02 21:10     ` Michael S. Tsirkin
  1 sibling, 2 replies; 108+ messages in thread
From: Benjamin Herrenschmidt @ 2014-09-02 20:53 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-s390, Konrad Rzeszutek Wilk, Michael S. Tsirkin,
	Linux Virtualization, Christian Borntraeger, Paolo Bonzini,
	linux390

On Mon, 2014-09-01 at 22:55 -0700, Andy Lutomirski wrote:
> 
> On x86, at least, I doubt that we'll ever see a physically addressed
> PCI virtio device for which ACPI advertises an IOMMU, since any sane
> hypervisor will just not advertise an IOMMU for the virtio device.
> But are there arm64 or PPC guests that use virtio_pci, that have
> IOMMUs, and that will malfunction if the virtio_pci driver ends up
> using the IOMMU?  I certainly hope not, since these systems might be
> very hard-pressed to work right if someone plugged in a physical
> virtio-speaking PCI device.

It will definitely not work on ppc64. We always have IOMMUs on pseries,
all PCI busses do, and because it's a paravirtualized environment,
napping/unmapping pages means hypercalls -> expensive.

But our virtio implementation bypasses it in qemu, so if virtio-pci
starts using the DMA mapping API without changing the DMA ops under the
hood, it will break for us.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2014-09-02 20:53     ` Benjamin Herrenschmidt
@ 2014-09-02 20:56       ` Konrad Rzeszutek Wilk
  2014-09-02 21:08         ` Benjamin Herrenschmidt
  2014-09-02 21:37       ` Andy Lutomirski
  1 sibling, 1 reply; 108+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-09-02 20:56 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linux-s390, Michael S. Tsirkin, Andy Lutomirski,
	Christian Borntraeger, Paolo Bonzini, linux390,
	Linux Virtualization

On Wed, Sep 03, 2014 at 06:53:33AM +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2014-09-01 at 22:55 -0700, Andy Lutomirski wrote:
> > 
> > On x86, at least, I doubt that we'll ever see a physically addressed
> > PCI virtio device for which ACPI advertises an IOMMU, since any sane
> > hypervisor will just not advertise an IOMMU for the virtio device.
> > But are there arm64 or PPC guests that use virtio_pci, that have
> > IOMMUs, and that will malfunction if the virtio_pci driver ends up
> > using the IOMMU?  I certainly hope not, since these systems might be
> > very hard-pressed to work right if someone plugged in a physical
> > virtio-speaking PCI device.
> 
> It will definitely not work on ppc64. We always have IOMMUs on pseries,
> all PCI busses do, and because it's a paravirtualized environment,
> napping/unmapping pages means hypercalls -> expensive.
> 
> But our virtio implementation bypasses it in qemu, so if virtio-pci
> starts using the DMA mapping API without changing the DMA ops under the
> hood, it will break for us.

What is the default dma_ops that the Linux guests start with as
guests under ppc64?

Thanks!
> 
> Cheers,
> Ben.
> 
> 

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2014-09-02 20:56       ` Konrad Rzeszutek Wilk
@ 2014-09-02 21:08         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 108+ messages in thread
From: Benjamin Herrenschmidt @ 2014-09-02 21:08 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: linux-s390, Michael S. Tsirkin, Andy Lutomirski,
	Christian Borntraeger, Paolo Bonzini, linux390,
	Linux Virtualization

On Tue, 2014-09-02 at 16:56 -0400, Konrad Rzeszutek Wilk wrote:
> On Wed, Sep 03, 2014 at 06:53:33AM +1000, Benjamin Herrenschmidt wrote:
> > On Mon, 2014-09-01 at 22:55 -0700, Andy Lutomirski wrote:
> > > 
> > > On x86, at least, I doubt that we'll ever see a physically addressed
> > > PCI virtio device for which ACPI advertises an IOMMU, since any sane
> > > hypervisor will just not advertise an IOMMU for the virtio device.
> > > But are there arm64 or PPC guests that use virtio_pci, that have
> > > IOMMUs, and that will malfunction if the virtio_pci driver ends up
> > > using the IOMMU?  I certainly hope not, since these systems might be
> > > very hard-pressed to work right if someone plugged in a physical
> > > virtio-speaking PCI device.
> > 
> > It will definitely not work on ppc64. We always have IOMMUs on pseries,
> > all PCI busses do, and because it's a paravirtualized environment,
> > napping/unmapping pages means hypercalls -> expensive.
> > 
> > But our virtio implementation bypasses it in qemu, so if virtio-pci
> > starts using the DMA mapping API without changing the DMA ops under the
> > hood, it will break for us.
> 
> What is the default dma_ops that the Linux guests start with as
> guests under ppc64?

On pseries (which is what we care the most about nowadays) it's
dma_iommu_ops() which in turn call into the "TCE" code for populating
the IOMMU entries which calls the hypervisor.

Cheers,
Ben.

> Thanks!
> > 
> > Cheers,
> > Ben.
> > 
> > 

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2014-09-02  5:55   ` Andy Lutomirski
  2014-09-02 20:53     ` Benjamin Herrenschmidt
@ 2014-09-02 21:10     ` Michael S. Tsirkin
  2014-09-02 21:49       ` Andy Lutomirski
  1 sibling, 1 reply; 108+ messages in thread
From: Michael S. Tsirkin @ 2014-09-02 21:10 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-s390, Konrad Rzeszutek Wilk, Benjamin Herrenschmidt,
	Linux Virtualization, Christian Borntraeger, Paolo Bonzini,
	linux390

On Mon, Sep 01, 2014 at 10:55:29PM -0700, Andy Lutomirski wrote:
> On Mon, Sep 1, 2014 at 3:16 PM, Benjamin Herrenschmidt
> <benh@kernel.crashing.org> wrote:
> > On Mon, 2014-09-01 at 10:39 -0700, Andy Lutomirski wrote:
> >> Changes from v1:
> >>  - Using the DMA API is optional now.  It would be nice to improve the
> >>    DMA API to the point that it could be used unconditionally, but s390
> >>    proves that we're not there yet.
> >>  - Includes patch 4, which fixes DMA debugging warnings from virtio_net.
> >
> > I'm not sure if you saw my reply on the other thread but I have a few
> > comments based on the above "it would be nice if ..."
> >
> 
> Yeah, sorry, I sort of thought I responded, but I didn't do a very good job.
> 
> > So here we have both a yes and a no :-)
> >
> > It would be nice to avoid those if () games all over and indeed just
> > use the DMA API, *however* we most certainly don't want to actually
> > create IOMMU mappings for the KVM virio case. This would be a massive
> > loss in performances on several platforms and generally doesn't make
> > much sense.
> >
> > However, we can still use the API without that on any architecture
> > where the dma mapping API ends up calling the generic dma_map_ops,
> > it becomes just a matter of virtio setting up some special "nop" ops
> > when needed.
> 
> I'm not quite convinced that this is a good idea.  I think that there
> are three relevant categories of virtio devices:
> 
> a) Any virtio device where the normal DMA ops are nops.  This includes
> x86 without an IOMMU (e.g. in a QEMU/KVM guest), 32-bit ARM, and
> probably many other architectures.  In this case, what we do only
> matters for performance, not for correctness.  Ideally the arch DMA
> ops are fast.
> 
> b) Virtio devices that use physical addressing on systems where DMA
> ops either don't exist at all (most s390) or do something nontrivial.
> In this case, we must either override the DMA ops or just not use
> them.
> 
> c) Virtio devices that use bus addressing.  This includes everything
> on Xen (because the "physical" addresses are nonsense) and any actual
> physical PCI device that speaks virtio on a system with an IOMMU.  In
> this case, we must use the DMA ops.
> 
> The issue is that, on systems with DMA ops that do something, we need
> to make sure that we know whether we're in case (b) or (c).  In these
> patches, I've made the assumption that, if the virtio devices lives on
> the PCI bus, then it uses the same type of addressing that any other
> device on that PCI bus would use.
> 
> On x86, at least, I doubt that we'll ever see a physically addressed
> PCI virtio device for which ACPI advertises an IOMMU, since any sane
> hypervisor will just not advertise an IOMMU for the virtio device.

How exactly does one not advertise an IOMMU for a specific
device? Could you please clarify?

> But are there arm64 or PPC guests that use virtio_pci, that have
> IOMMUs, and that will malfunction if the virtio_pci driver ends up
> using the IOMMU?  I certainly hope not, since these systems might be
> very hard-pressed to work right if someone plugged in a physical
> virtio-speaking PCI device.

One simple fix is to defer this all until virtio 1.0.
virtio 1.0 has an alternative set of IDs for virtio pci,
that can be used if you are making an incompatible change.
We can use that if there's an iommu.


> >
> > The difficulty here resides in the fact that we have never completely
> > made the dma_map_ops generic. The ops themselves are defined generically
> > as are the dma_map_* interfaces based on them, but the location of the
> > ops pointer is still more/less arch specific and some architectures
> > still chose not to use that indirection at all I believe.
> >
> 
> I'd be happy to update the patches if someone does this, but I don't
> really want to attack the DMA API on all architectures right now.  In
> the mean time, at least s390 requires that we be able to compile out
> the DMA API calls.  I'd rather see s390 provide working no-op dma ops
> for all of the struct devices that provide virtio interfaces.
> 
> On a related note, shouldn't virtio be doing something to provide dma
> ops to the virtio device and any of its children?  I don't know how it
> would even try to do this, given how architecture-dependent this code
> currently is.  Calling dma_map_single on the virtio device (as opposed
> to its parent) is currently likely to crash on x86.  Fortunately,
> nothing does this.
> 
> --Andy

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2014-09-02 20:53     ` Benjamin Herrenschmidt
  2014-09-02 20:56       ` Konrad Rzeszutek Wilk
@ 2014-09-02 21:37       ` Andy Lutomirski
  2014-09-02 22:10         ` Benjamin Herrenschmidt
  2014-09-03  6:42         ` Rusty Russell
  1 sibling, 2 replies; 108+ messages in thread
From: Andy Lutomirski @ 2014-09-02 21:37 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linux-s390, Konrad Rzeszutek Wilk, Michael S. Tsirkin,
	Linux Virtualization, Christian Borntraeger, Paolo Bonzini,
	linux390

On Tue, Sep 2, 2014 at 1:53 PM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Mon, 2014-09-01 at 22:55 -0700, Andy Lutomirski wrote:
>>
>> On x86, at least, I doubt that we'll ever see a physically addressed
>> PCI virtio device for which ACPI advertises an IOMMU, since any sane
>> hypervisor will just not advertise an IOMMU for the virtio device.
>> But are there arm64 or PPC guests that use virtio_pci, that have
>> IOMMUs, and that will malfunction if the virtio_pci driver ends up
>> using the IOMMU?  I certainly hope not, since these systems might be
>> very hard-pressed to work right if someone plugged in a physical
>> virtio-speaking PCI device.
>
> It will definitely not work on ppc64. We always have IOMMUs on pseries,
> all PCI busses do, and because it's a paravirtualized environment,
> napping/unmapping pages means hypercalls -> expensive.
>
> But our virtio implementation bypasses it in qemu, so if virtio-pci
> starts using the DMA mapping API without changing the DMA ops under the
> hood, it will break for us.
>

Let's take a step back from from the implementation.  What is a driver
for a virtio PCI device (i.e. a PCI device with vendor 0x1af4)
supposed to do on ppc64?

It can send the device physical addresses and ignore the normal PCI
DMA semantics, which is what the current virtio_pci driver does.  This
seems like a layering violation, and this won't work if the device is
a real PCI device.  Alternatively, it can treat the device like any
other PCI device and use the IOMMU.  This is a bit slower, and it is
also incompatible with current hypervisors.

There really are virtio devices that are pieces of silicon and not
figments of a hypervisor's imagination [1].  We could teach virtio_pci
to use physical addressing on ppc64, but that seems like a pretty
awful hack, and it'll start needing quirks as soon as someone tries to
plug a virtio-speaking PCI card into a ppc64 machine.

Ideas?  x86 and arm seem to be safe here, since AFAIK there is no such
thing as a physically addressed virtio "PCI" device on a bus with an
IOMMU on x86, arm, or arm64.

[1] https://lwn.net/Articles/580186/

> Cheers,
> Ben.
>
>

-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2014-09-02 21:10     ` Michael S. Tsirkin
@ 2014-09-02 21:49       ` Andy Lutomirski
  0 siblings, 0 replies; 108+ messages in thread
From: Andy Lutomirski @ 2014-09-02 21:49 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-s390, Konrad Rzeszutek Wilk, Benjamin Herrenschmidt,
	Linux Virtualization, Christian Borntraeger, Paolo Bonzini,
	linux390

On Tue, Sep 2, 2014 at 2:10 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Mon, Sep 01, 2014 at 10:55:29PM -0700, Andy Lutomirski wrote:
>> On Mon, Sep 1, 2014 at 3:16 PM, Benjamin Herrenschmidt
>> <benh@kernel.crashing.org> wrote:
>> > On Mon, 2014-09-01 at 10:39 -0700, Andy Lutomirski wrote:
>> >> Changes from v1:
>> >>  - Using the DMA API is optional now.  It would be nice to improve the
>> >>    DMA API to the point that it could be used unconditionally, but s390
>> >>    proves that we're not there yet.
>> >>  - Includes patch 4, which fixes DMA debugging warnings from virtio_net.
>> >
>> > I'm not sure if you saw my reply on the other thread but I have a few
>> > comments based on the above "it would be nice if ..."
>> >
>>
>> Yeah, sorry, I sort of thought I responded, but I didn't do a very good job.
>>
>> > So here we have both a yes and a no :-)
>> >
>> > It would be nice to avoid those if () games all over and indeed just
>> > use the DMA API, *however* we most certainly don't want to actually
>> > create IOMMU mappings for the KVM virio case. This would be a massive
>> > loss in performances on several platforms and generally doesn't make
>> > much sense.
>> >
>> > However, we can still use the API without that on any architecture
>> > where the dma mapping API ends up calling the generic dma_map_ops,
>> > it becomes just a matter of virtio setting up some special "nop" ops
>> > when needed.
>>
>> I'm not quite convinced that this is a good idea.  I think that there
>> are three relevant categories of virtio devices:
>>
>> a) Any virtio device where the normal DMA ops are nops.  This includes
>> x86 without an IOMMU (e.g. in a QEMU/KVM guest), 32-bit ARM, and
>> probably many other architectures.  In this case, what we do only
>> matters for performance, not for correctness.  Ideally the arch DMA
>> ops are fast.
>>
>> b) Virtio devices that use physical addressing on systems where DMA
>> ops either don't exist at all (most s390) or do something nontrivial.
>> In this case, we must either override the DMA ops or just not use
>> them.
>>
>> c) Virtio devices that use bus addressing.  This includes everything
>> on Xen (because the "physical" addresses are nonsense) and any actual
>> physical PCI device that speaks virtio on a system with an IOMMU.  In
>> this case, we must use the DMA ops.
>>
>> The issue is that, on systems with DMA ops that do something, we need
>> to make sure that we know whether we're in case (b) or (c).  In these
>> patches, I've made the assumption that, if the virtio devices lives on
>> the PCI bus, then it uses the same type of addressing that any other
>> device on that PCI bus would use.
>>
>> On x86, at least, I doubt that we'll ever see a physically addressed
>> PCI virtio device for which ACPI advertises an IOMMU, since any sane
>> hypervisor will just not advertise an IOMMU for the virtio device.
>
> How exactly does one not advertise an IOMMU for a specific
> device? Could you please clarify?

See https://software.intel.com/en-us/blogs/2009/09/11/decoding-the-dmar-tables-in-acpiiommu-part-2

I think that all that needs to happen is for ACPI to not list the
device in the scope of any drhd unit.  I don't know whether this works
correctly, but it looks like the iommu_dummy and the
init_no_remapping_devices code in intel-iommu.c exists for almost
exactly this purpose.

>
>> But are there arm64 or PPC guests that use virtio_pci, that have
>> IOMMUs, and that will malfunction if the virtio_pci driver ends up
>> using the IOMMU?  I certainly hope not, since these systems might be
>> very hard-pressed to work right if someone plugged in a physical
>> virtio-speaking PCI device.
>
> One simple fix is to defer this all until virtio 1.0.
> virtio 1.0 has an alternative set of IDs for virtio pci,
> that can be used if you are making an incompatible change.
> We can use that if there's an iommu.

How?  If someone builds a physical device compliant with the virtio
1.0 specification, how do can that device know whether it's behind an
IOMMU?  The IOMMU is part of the host (or Xen, sort of), not the PCI
device.  I suppose that virtio 1.0 could add a bit indicating that the
virtio device is a physical piece of hardware (presumably this should
be PCI-specific).

--Andy

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2014-09-02 21:37       ` Andy Lutomirski
@ 2014-09-02 22:10         ` Benjamin Herrenschmidt
  2014-09-02 23:11           ` Andy Lutomirski
  2014-09-03  6:42         ` Rusty Russell
  1 sibling, 1 reply; 108+ messages in thread
From: Benjamin Herrenschmidt @ 2014-09-02 22:10 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-s390, Konrad Rzeszutek Wilk, Michael S. Tsirkin,
	Linux Virtualization, Christian Borntraeger, Paolo Bonzini,
	linux390

On Tue, 2014-09-02 at 14:37 -0700, Andy Lutomirski wrote:

> Let's take a step back from from the implementation.  What is a driver
> for a virtio PCI device (i.e. a PCI device with vendor 0x1af4)
> supposed to do on ppc64?

Today, it's supposed to send guest physical addresses. We can make that
optional via some nego or capabilities to support more esoteric setups
but for backward compatibility, this must remain the default behaviour.

> It can send the device physical addresses and ignore the normal PCI
> DMA semantics, which is what the current virtio_pci driver does.  This
> seems like a layering violation, and this won't work if the device is
> a real PCI device.

Correct, it's an original virtio implementation choice for maximum
performances.

>   Alternatively, it can treat the device like any
> other PCI device and use the IOMMU.  This is a bit slower, and it is
> also incompatible with current hypervisors.

This is a potentially a LOT slower and is backward incompatible with
current qemu/KVM and kvmtool yes.

The slowness can be alleviated using various techniques, for example on
ppc64 we can create a DMA window that contains a permanent mapping of
the entire guest space, so we could use such a thing for virtio.

Another think we could do potentially is advertize via the device-tree
that such a bus uses a direct mapping and have the guest use appropriate
"direct map" dma_ops.

But we need to keep backward compatibility with existing
guest/hypervisors so the default must remain as it is.

> There really are virtio devices that are pieces of silicon and not
> figments of a hypervisor's imagination [1].

I am aware of that. There are also attempts at using virtio to make two
machines communicate via a PCIe link (either with one as endpoint of the
other or via a non-transparent switch).

Which is why I'm not objecting to what you are trying to do ;-)

My suggestion was that it might be a cleaner approach to do that by
having the individual virtio drivers always use the dma_map_* API, and
limiting the kludgery to a combination of virtio_pci "core" and arch
code by selecting an appropriate set of dma_map_ops, defaulting with a
"transparent" (or direct) one as our current default case (and thus
overriding the iommu ones provided by the arch).

>   We could teach virtio_pci
> to use physical addressing on ppc64, but that seems like a pretty
> awful hack, and it'll start needing quirks as soon as someone tries to
> plug a virtio-speaking PCI card into a ppc64 machine.

But x86_64 is the same no ? The day it starts growing an iommu emulation
in qemu (and I've heard it's happening) it will still want to do direct
bypass for virtio for performance.

> Ideas?  x86 and arm seem to be safe here, since AFAIK there is no such
> thing as a physically addressed virtio "PCI" device on a bus with an
> IOMMU on x86, arm, or arm64.

Today .... I wouldn't bet on it to remain that way. The qemu
implementation of virtio is physically addressed and you don't
necessarily have a choice of which device gets an iommu and which not.

Cheers,
Ben.

> [1] https://lwn.net/Articles/580186/
> 
> > Cheers,
> > Ben.
> >
> >
> 
> 
> 

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2014-09-02 22:10         ` Benjamin Herrenschmidt
@ 2014-09-02 23:11           ` Andy Lutomirski
  2014-09-02 23:20             ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 108+ messages in thread
From: Andy Lutomirski @ 2014-09-02 23:11 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linux-s390, Konrad Rzeszutek Wilk, Michael S. Tsirkin,
	Linux Virtualization, Christian Borntraeger, Paolo Bonzini,
	linux390

On Tue, Sep 2, 2014 at 3:10 PM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Tue, 2014-09-02 at 14:37 -0700, Andy Lutomirski wrote:
>
>> Let's take a step back from from the implementation.  What is a driver
>> for a virtio PCI device (i.e. a PCI device with vendor 0x1af4)
>> supposed to do on ppc64?
>
> Today, it's supposed to send guest physical addresses. We can make that
> optional via some nego or capabilities to support more esoteric setups
> but for backward compatibility, this must remain the default behaviour.

I think it only needs to remain the default in cases where the
alternative (bus addressing) won't work.  I think that, so far, this
is just ppc64.  But see below...

>
> My suggestion was that it might be a cleaner approach to do that by
> having the individual virtio drivers always use the dma_map_* API, and
> limiting the kludgery to a combination of virtio_pci "core" and arch
> code by selecting an appropriate set of dma_map_ops, defaulting with a
> "transparent" (or direct) one as our current default case (and thus
> overriding the iommu ones provided by the arch).

I think the cleanest way of all would be to get the bus drivers to do
the right thing so that all of the virtio code can just use the dma
api.  I don't know whether this is achievable.

>
>>   We could teach virtio_pci
>> to use physical addressing on ppc64, but that seems like a pretty
>> awful hack, and it'll start needing quirks as soon as someone tries to
>> plug a virtio-speaking PCI card into a ppc64 machine.
>
> But x86_64 is the same no ? The day it starts growing an iommu emulation
> in qemu (and I've heard it's happening) it will still want to do direct
> bypass for virtio for performance.

I don't think so.  I would argue that it's a straight-up bug for QEMU
to expose a physically-addressed virtio-pci device to the guest behind
an emulated IOMMU.  QEMU may already be doing that on ppc64, but it
isn't on x86_64 or arm (yet).

On x86_64, I'm pretty sure that QEMU can emulate an IOMMU for
everything except the virtio-pci devices.  The ACPI DMAR stuff is
quite expressive.

On ARM, I hope the QEMU will never implement a PCI IOMMU.  As far as I
could tell when I looked last week, none of the newer QEMU-emulated
ARM machines even support PCI.  Even if QEMU were to implement a PCI
IOMMU on some future ARM machine, it could continue using virtio-mmio
for virtio devices.

So ppc might actually be the only system that has or will have
physically-addressed virtio PCI devices that are behind an IOMMU.  Can
this be handled in a ppc64-specific way?  Is there any way that the
kernel can distinguish a QEMU-provided virtio PCI device from a
physical PCIe thing?  It would be kind of nice to address this without
adding complexity to the virtio spec.  Maybe virtio 1.0 devices could
be assumed to use bus addressing unless a new devicetree property says
otherwise.

--Andy

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2014-09-02 23:11           ` Andy Lutomirski
@ 2014-09-02 23:20             ` Benjamin Herrenschmidt
  2014-09-02 23:42               ` Andy Lutomirski
  2014-09-03  7:43               ` Paolo Bonzini
  0 siblings, 2 replies; 108+ messages in thread
From: Benjamin Herrenschmidt @ 2014-09-02 23:20 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-s390, Konrad Rzeszutek Wilk, Michael S. Tsirkin,
	Linux Virtualization, Christian Borntraeger, Paolo Bonzini,
	linux390

On Tue, 2014-09-02 at 16:11 -0700, Andy Lutomirski wrote:

> I don't think so.  I would argue that it's a straight-up bug for QEMU
> to expose a physically-addressed virtio-pci device to the guest behind
> an emulated IOMMU.  QEMU may already be doing that on ppc64, but it
> isn't on x86_64 or arm (yet).

Last I looked, it does on everything, it bypasses the DMA layer in qemu
which is where IOMMUs are implemented.

> On x86_64, I'm pretty sure that QEMU can emulate an IOMMU for
> everything except the virtio-pci devices.  The ACPI DMAR stuff is
> quite expressive.

Well, *except* virtio, exactly...

> On ARM, I hope the QEMU will never implement a PCI IOMMU.  As far as I
> could tell when I looked last week, none of the newer QEMU-emulated
> ARM machines even support PCI.  Even if QEMU were to implement a PCI
> IOMMU on some future ARM machine, it could continue using virtio-mmio
> for virtio devices.

Possibly...

> So ppc might actually be the only system that has or will have
> physically-addressed virtio PCI devices that are behind an IOMMU.  Can
> this be handled in a ppc64-specific way?

I wouldn't be so certain, as I said, the way virtio is implemented in
qemu bypass the DMA layer which is where IOMMUs sit. The fact that
currently x86 doesn't put an IOMMU there is not even garanteed, is it ?
What happens if you try to mix and match virtio and other emulated
devices that require the iommu on the same bus ?

If we could discriminate virtio devices to a specific host bridge and
guarantee no mix & match, we could probably add a concept of
"IOMMU-less" bus but that would require guest changes which limits the
usefulness.

>   Is there any way that the
> kernel can distinguish a QEMU-provided virtio PCI device from a
> physical PCIe thing? 

Not with existing guests which cannot be changed. Existing distros are
out with those drivers. If we add a backward compatibility mechanism,
then we could add something yes, provided we can segregate virtio onto a
dedicated host bridge (which can be a problem with the libvirt
trainwreck...)

>  It would be kind of nice to address this without
> adding complexity to the virtio spec.  Maybe virtio 1.0 devices could
> be assumed to use bus addressing unless a new devicetree property says
> otherwise.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2014-09-02 23:20             ` Benjamin Herrenschmidt
@ 2014-09-02 23:42               ` Andy Lutomirski
  2014-09-03  0:25                 ` Benjamin Herrenschmidt
  2014-09-03  7:43               ` Paolo Bonzini
  1 sibling, 1 reply; 108+ messages in thread
From: Andy Lutomirski @ 2014-09-02 23:42 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linux-s390, Konrad Rzeszutek Wilk, Michael S. Tsirkin,
	Linux Virtualization, Christian Borntraeger, Paolo Bonzini,
	linux390

On Tue, Sep 2, 2014 at 4:20 PM, Benjamin Herrenschmidt <benh@au1.ibm.com> wrote:
> On Tue, 2014-09-02 at 16:11 -0700, Andy Lutomirski wrote:
>
>> I don't think so.  I would argue that it's a straight-up bug for QEMU
>> to expose a physically-addressed virtio-pci device to the guest behind
>> an emulated IOMMU.  QEMU may already be doing that on ppc64, but it
>> isn't on x86_64 or arm (yet).
>
> Last I looked, it does on everything, it bypasses the DMA layer in qemu
> which is where IOMMUs are implemented.

I believe you, but I'm not convinced that this means much from the
guest's POV, except on ppc64.

>
>> On x86_64, I'm pretty sure that QEMU can emulate an IOMMU for
>> everything except the virtio-pci devices.  The ACPI DMAR stuff is
>> quite expressive.
>
> Well, *except* virtio, exactly...

But there aren't any ACPI systems with both virtio-pci and IOMMUs,
right?  So we could say that, henceforth, ACPI systems must declare
whether virtio-pci devices live behind IOMMUs without breaking
backward compatibility.

>
>> On ARM, I hope the QEMU will never implement a PCI IOMMU.  As far as I
>> could tell when I looked last week, none of the newer QEMU-emulated
>> ARM machines even support PCI.  Even if QEMU were to implement a PCI
>> IOMMU on some future ARM machine, it could continue using virtio-mmio
>> for virtio devices.
>
> Possibly...
>
>> So ppc might actually be the only system that has or will have
>> physically-addressed virtio PCI devices that are behind an IOMMU.  Can
>> this be handled in a ppc64-specific way?
>
> I wouldn't be so certain, as I said, the way virtio is implemented in
> qemu bypass the DMA layer which is where IOMMUs sit. The fact that
> currently x86 doesn't put an IOMMU there is not even garanteed, is it ?
> What happens if you try to mix and match virtio and other emulated
> devices that require the iommu on the same bus ?

AFAIK QEMU doesn't support IOMMUs at all on x86, so current versions
of QEMU really do guarantee that virtio-pci on x86 has no IOMMU, even
if that guarantee is purely accidental.

>
> If we could discriminate virtio devices to a specific host bridge and
> guarantee no mix & match, we could probably add a concept of
> "IOMMU-less" bus but that would require guest changes which limits the
> usefulness.
>
>>   Is there any way that the
>> kernel can distinguish a QEMU-provided virtio PCI device from a
>> physical PCIe thing?
>
> Not with existing guests which cannot be changed. Existing distros are
> out with those drivers. If we add a backward compatibility mechanism,
> then we could add something yes, provided we can segregate virtio onto a
> dedicated host bridge (which can be a problem with the libvirt
> trainwreck...)

Ugh.

So here's an ugly proposal:

Step 1: Make virtio-pci use the DMA API only on x86.  This will at
least fix Xen and people experimenting with virtio hardware on x86,
and it won't break anything, since there are no emulated IOMMUs on
x86.

Step 2: Update the virtio spec.  Virtio 1.0 PCI devices should set a
new bit if they are physically addressed.  If that bit is clear, then
the device is assumed to be addressed in accordance with the
platform's standard addressing model for PCI.  Presumably this would
be something like VIRTIO_F_BUS_ADDRESSING = 33, and the spec would say
something like "Physical devices compatible with this specification
MUST offer VIRTIO_F_BUS_ADDRESSING.  Drivers MUST implement this
feature."  Alternatively, this could live in a PCI configuration
capability.

Step 3: Update virtio-pci to use the DMA API for all devices on x86
and for devices that advertise bus addressing on other architectures.

I think this proposal will work, but I also think it sucks and I'd
really like to see a better counter-proposal.

--Andy

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2014-09-02 23:42               ` Andy Lutomirski
@ 2014-09-03  0:25                 ` Benjamin Herrenschmidt
  2014-09-03  0:32                   ` Andy Lutomirski
  2014-09-03  7:47                   ` Paolo Bonzini
  0 siblings, 2 replies; 108+ messages in thread
From: Benjamin Herrenschmidt @ 2014-09-03  0:25 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-s390, Konrad Rzeszutek Wilk, Michael S. Tsirkin,
	Linux Virtualization, Christian Borntraeger, Paolo Bonzini,
	linux390

On Tue, 2014-09-02 at 16:42 -0700, Andy Lutomirski wrote:

> But there aren't any ACPI systems with both virtio-pci and IOMMUs,
> right?  So we could say that, henceforth, ACPI systems must declare
> whether virtio-pci devices live behind IOMMUs without breaking
> backward compatibility.

I don't know for sure whether that's the case and whether we can rely on
that not happening, we'll need x86 folks opinion here.

> >> On ARM, I hope the QEMU will never implement a PCI IOMMU.  As far as I
> >> could tell when I looked last week, none of the newer QEMU-emulated
> >> ARM machines even support PCI.  Even if QEMU were to implement a PCI
> >> IOMMU on some future ARM machine, it could continue using virtio-mmio
> >> for virtio devices.
> >
> > Possibly...
> >
> >> So ppc might actually be the only system that has or will have
> >> physically-addressed virtio PCI devices that are behind an IOMMU.  Can
> >> this be handled in a ppc64-specific way?
> >
> > I wouldn't be so certain, as I said, the way virtio is implemented in
> > qemu bypass the DMA layer which is where IOMMUs sit. The fact that
> > currently x86 doesn't put an IOMMU there is not even garanteed, is it ?
> > What happens if you try to mix and match virtio and other emulated
> > devices that require the iommu on the same bus ?
> 
> AFAIK QEMU doesn't support IOMMUs at all on x86, so current versions
> of QEMU really do guarantee that virtio-pci on x86 has no IOMMU, even
> if that guarantee is purely accidental.

Right.

> > If we could discriminate virtio devices to a specific host bridge and
> > guarantee no mix & match, we could probably add a concept of
> > "IOMMU-less" bus but that would require guest changes which limits the
> > usefulness.
> >
> >>   Is there any way that the
> >> kernel can distinguish a QEMU-provided virtio PCI device from a
> >> physical PCIe thing?
> >
> > Not with existing guests which cannot be changed. Existing distros are
> > out with those drivers. If we add a backward compatibility mechanism,
> > then we could add something yes, provided we can segregate virtio onto a
> > dedicated host bridge (which can be a problem with the libvirt
> > trainwreck...)
> 
> Ugh.
> 
> So here's an ugly proposal:
> 
> Step 1: Make virtio-pci use the DMA API only on x86.  This will at
> least fix Xen and people experimenting with virtio hardware on x86,
> and it won't break anything, since there are no emulated IOMMUs on
> x86.

I think we should make all virtio drivers use the DMA API and just have
different set of dma_ops. We can make a simple ifdef powerpc if needed
in virtio-pci that force the dma-ops of the device to some direct
"bypass" ops at init time.

That way no need to select whether to use the DMA API or not, just
always use it, and add a tweak to replace the DMA ops with the direct
ones on the archs/platforms that need that. That was my original
proposal and I still think it's the best approach.

> Step 2: Update the virtio spec.  Virtio 1.0 PCI devices should set a
> new bit if they are physically addressed.  If that bit is clear, then
> the device is assumed to be addressed in accordance with the
> platform's standard addressing model for PCI.  Presumably this would
> be something like VIRTIO_F_BUS_ADDRESSING = 33, and the spec would say
> something like "Physical devices compatible with this specification
> MUST offer VIRTIO_F_BUS_ADDRESSING.  Drivers MUST implement this
> feature."  Alternatively, this could live in a PCI configuration
> capability.

I'll let you sort that out with Rusty but it makes sense.

> Step 3: Update virtio-pci to use the DMA API for all devices on x86
> and for devices that advertise bus addressing on other architectures.
> 
> I think this proposal will work, but I also think it sucks and I'd
> really like to see a better counter-proposal.

As I said, make it always use the DMA API, but add a quirk to replace
the dma_ops with some NULL ops on platforms that need it.

The only issue with that is the location of the dma ops is arch
specific, so that one function will contain some ifdefs, but the rest of
the code can just use the DMA API.
 
Cheers,
Ben.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2014-09-03  0:25                 ` Benjamin Herrenschmidt
@ 2014-09-03  0:32                   ` Andy Lutomirski
  2014-09-03  0:43                     ` Benjamin Herrenschmidt
  2014-09-03  7:47                   ` Paolo Bonzini
  1 sibling, 1 reply; 108+ messages in thread
From: Andy Lutomirski @ 2014-09-03  0:32 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linux-s390, Konrad Rzeszutek Wilk, Michael S. Tsirkin,
	Linux Virtualization, Christian Borntraeger, Paolo Bonzini,
	linux390

On Tue, Sep 2, 2014 at 5:25 PM, Benjamin Herrenschmidt <benh@au1.ibm.com> wrote:
> On Tue, 2014-09-02 at 16:42 -0700, Andy Lutomirski wrote:
>> So here's an ugly proposal:
>>
>> Step 1: Make virtio-pci use the DMA API only on x86.  This will at
>> least fix Xen and people experimenting with virtio hardware on x86,
>> and it won't break anything, since there are no emulated IOMMUs on
>> x86.
>
> I think we should make all virtio drivers use the DMA API and just have
> different set of dma_ops. We can make a simple ifdef powerpc if needed
> in virtio-pci that force the dma-ops of the device to some direct
> "bypass" ops at init time.
>
> That way no need to select whether to use the DMA API or not, just
> always use it, and add a tweak to replace the DMA ops with the direct
> ones on the archs/platforms that need that. That was my original
> proposal and I still think it's the best approach.

I agree *except* that implementing it will be a real PITA and (I
think) can't be done without changing code in arch/.  My patches plus
an ifdef powerpc will be functionally equivalent, just uglier.

>
> As I said, make it always use the DMA API, but add a quirk to replace
> the dma_ops with some NULL ops on platforms that need it.
>
> The only issue with that is the location of the dma ops is arch
> specific, so that one function will contain some ifdefs, but the rest of
> the code can just use the DMA API.

Bigger quirk: on a standard s390 virtio guest configuration,
dma_map_single etc will fail to link.  I tried this in v1 of these
patches.  So we can poke at the archdata all day, but we can't build a
kernel like that :(

So until the dma_ops pointer move into struct device and
CONFIG_HAS_DMA becomes mandatory (or mandatory enough that virtio can
depend on it), I don't think we can do it this way.

I'll send a v5 that is the same as v4 except with physical addressing
hardcoded in for powerpc.

--Andy

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2014-09-03  0:32                   ` Andy Lutomirski
@ 2014-09-03  0:43                     ` Benjamin Herrenschmidt
  2014-09-04  2:03                       ` Andy Lutomirski
  0 siblings, 1 reply; 108+ messages in thread
From: Benjamin Herrenschmidt @ 2014-09-03  0:43 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-s390, Konrad Rzeszutek Wilk, Michael S. Tsirkin,
	Linux Virtualization, Christian Borntraeger, Paolo Bonzini,
	linux390

On Tue, 2014-09-02 at 17:32 -0700, Andy Lutomirski wrote:
> 
> I agree *except* that implementing it will be a real PITA and (I
> think) can't be done without changing code in arch/.  My patches plus
> an ifdef powerpc will be functionally equivalent, just uglier.

So for powerpc, it's a 2 liner inside virtio-pci, but yes, it might be
more of a problem for s390, I'm not too sure what they do in that area.

> Bigger quirk: on a standard s390 virtio guest configuration,
> dma_map_single etc will fail to link. 

Yuck

>  I tried this in v1 of these
> patches.  So we can poke at the archdata all day, but we can't build a
> kernel like that :(

I would like the s390 people to chime in here, it still looks like the
best way to go if they can fix things on their side :-)

> So until the dma_ops pointer move into struct device and
> CONFIG_HAS_DMA becomes mandatory (or mandatory enough that virtio can
> depend on it), I don't think we can do it this way.

I see, it's a bummer because it would be a lot cleaner.

> I'll send a v5 that is the same as v4 except with physical addressing
> hardcoded in for powerpc.

Thanks. That will do for now, but ideally we want to make it a function
of some flag from the implementation, so let's see what Rusty has to
say.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2014-09-02 21:37       ` Andy Lutomirski
  2014-09-02 22:10         ` Benjamin Herrenschmidt
@ 2014-09-03  6:42         ` Rusty Russell
  2014-09-03  7:50           ` Andy Lutomirski
  2014-09-03 12:51           ` Michael S. Tsirkin
  1 sibling, 2 replies; 108+ messages in thread
From: Rusty Russell @ 2014-09-03  6:42 UTC (permalink / raw)
  To: Andy Lutomirski, Benjamin Herrenschmidt
  Cc: linux-s390, Konrad Rzeszutek Wilk, Michael S. Tsirkin,
	Linux Virtualization, Christian Borntraeger, Paolo Bonzini,
	linux390, virtio-dev

Andy Lutomirski <luto@amacapital.net> writes:
> There really are virtio devices that are pieces of silicon and not
> figments of a hypervisor's imagination [1].

Hi Andy,

        As you're discovering, there's a reason no one has done the DMA
API before.

So the problem is that ppc64's IOMMU is a platform thing, not a bus
thing.  They really do carve out an exception for virtio devices,
because performance (LOTS of performance).  It remains to be seen if
other platforms have the same performance issues, but in absence of
other evidence, the answer is yes.

It's a hack.  But having specific virtual-only devices are an even
bigger hack.

Physical virtio devices have been talked about, but don't actually exist
in Real Life.  And someone a virtio PCI card is going to have serious
performance issues: mainly because they'll want the rings in the card's
MMIO region, not allocated by the driver.  Being broken on PPC is really
the least of their problems.

So, what do we do?  It'd be nice if Linux virtio Just Worked under Xen,
though Xen's IOMMU is outside the virtio spec.  Since virtio_pci can be
a module, obvious hacks like having xen_arch_setup initialize a dma_ops pointer
exposed by virtio_pci.c is out.

I think the best approach is to have a new feature bit (25 is free),
VIRTIO_F_USE_BUS_MAPPING which indicates that a device really wants to
use the mapping for the bus it is on.  A real device would set this,
or it won't work behind an IOMMU.  A Xen device would also set this.

Thoughts?
Rusty.

PS.  I cc'd OASIS virtio-dev: it's subscriber only for IP reasons (to
     subscribe you have to promise we can use your suggestion in the
     standard).  Feel free to remove in any replies, but it's part of
     the world we live in...

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2014-09-02 23:20             ` Benjamin Herrenschmidt
  2014-09-02 23:42               ` Andy Lutomirski
@ 2014-09-03  7:43               ` Paolo Bonzini
  1 sibling, 0 replies; 108+ messages in thread
From: Paolo Bonzini @ 2014-09-03  7:43 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Andy Lutomirski
  Cc: linux-s390, Konrad Rzeszutek Wilk, Michael S. Tsirkin,
	Linux Virtualization, Christian Borntraeger, linux390

Il 03/09/2014 01:20, Benjamin Herrenschmidt ha scritto:
> I wouldn't be so certain, as I said, the way virtio is implemented in
> qemu bypass the DMA layer which is where IOMMUs sit. The fact that
> currently x86 doesn't put an IOMMU there is not even garanteed, is it ?
> What happens if you try to mix and match virtio and other emulated
> devices that require the iommu on the same bus ?

As far as QEMU is concerned, it's trivial to add a property like
"direct-ram-access" that selects whether to bypass the IOMMU or not.
And it would have zero performance cost if direct RAM access is enabled,
compared to the current code.

If possible, I would quirk it in the PPC code.

Paolo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2014-09-03  0:25                 ` Benjamin Herrenschmidt
  2014-09-03  0:32                   ` Andy Lutomirski
@ 2014-09-03  7:47                   ` Paolo Bonzini
  2014-09-03  7:52                     ` Andy Lutomirski
  2014-09-03  8:05                     ` Benjamin Herrenschmidt
  1 sibling, 2 replies; 108+ messages in thread
From: Paolo Bonzini @ 2014-09-03  7:47 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Andy Lutomirski
  Cc: linux-s390, Konrad Rzeszutek Wilk, Michael S. Tsirkin,
	Linux Virtualization, Christian Borntraeger, linux390

Il 03/09/2014 02:25, Benjamin Herrenschmidt ha scritto:
> > But there aren't any ACPI systems with both virtio-pci and IOMMUs,
> > right?  So we could say that, henceforth, ACPI systems must declare
> > whether virtio-pci devices live behind IOMMUs without breaking
> > backward compatibility.
> 
> I don't know for sure whether that's the case and whether we can rely on
> that not happening, we'll need x86 folks opinion here.

IOMMU support for x86 is going to go in this week.

However, it is and likely will remain niche enough that I don't really
care about performance loss from IOMMU support.  If you enable it, you
want it.

So from the QEMU point of view we can simply add the direct-ram-access
property, and have the pseries machine turn it on by default (while
other machines can leave it off by default---they have no IOMMU and thus
no performance cost).

Paolo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2014-09-03  6:42         ` Rusty Russell
@ 2014-09-03  7:50           ` Andy Lutomirski
  2014-09-05  2:31             ` Rusty Russell
  2014-09-03 12:51           ` Michael S. Tsirkin
  1 sibling, 1 reply; 108+ messages in thread
From: Andy Lutomirski @ 2014-09-03  7:50 UTC (permalink / raw)
  To: Rusty Russell
  Cc: virtio-dev, linux-s390, Michael S. Tsirkin,
	Benjamin Herrenschmidt, Konrad Rzeszutek Wilk,
	Linux Virtualization, Christian Borntraeger, linux390,
	Paolo Bonzini

On Sep 2, 2014 11:53 PM, "Rusty Russell" <rusty@rustcorp.com.au> wrote:
>
> Andy Lutomirski <luto@amacapital.net> writes:
> > There really are virtio devices that are pieces of silicon and not
> > figments of a hypervisor's imagination [1].
>
> Hi Andy,
>
>         As you're discovering, there's a reason no one has done the DMA
> API before.
>
> So the problem is that ppc64's IOMMU is a platform thing, not a bus
> thing.  They really do carve out an exception for virtio devices,
> because performance (LOTS of performance).  It remains to be seen if
> other platforms have the same performance issues, but in absence of
> other evidence, the answer is yes.
>
> It's a hack.  But having specific virtual-only devices are an even
> bigger hack.
>
> Physical virtio devices have been talked about, but don't actually exist
> in Real Life.  And someone a virtio PCI card is going to have serious
> performance issues: mainly because they'll want the rings in the card's
> MMIO region, not allocated by the driver.  Being broken on PPC is really
> the least of their problems.
>
> So, what do we do?  It'd be nice if Linux virtio Just Worked under Xen,
> though Xen's IOMMU is outside the virtio spec.  Since virtio_pci can be
> a module, obvious hacks like having xen_arch_setup initialize a dma_ops pointer
> exposed by virtio_pci.c is out.

Xen does expose dma_ops.  The trick is knowing when to use it.

>
> I think the best approach is to have a new feature bit (25 is free),
> VIRTIO_F_USE_BUS_MAPPING which indicates that a device really wants to
> use the mapping for the bus it is on.  A real device would set this,
> or it won't work behind an IOMMU.  A Xen device would also set this.

The devices I care about aren't actually Xen devices.  They're devices
supplied by QEMU/KVM, booting a Xen hypervisor, which in turn passes
the virtio device (along with every other PCI device) through to dom0.
So this is exactly the same virtio device that regular x86 KVM guests
would see.  The reason that current code fails is that Xen guest
physical addresses aren't the same as the addresses seen by the outer
hypervisor.

These devices don't know that physical addresses != bus addresses, so
they can't advertise that fact.

If we ever end up with a virtio_pci device with physical addressing,
behind an IOMMU (but ignoring it), on Xen, we'll have a problem, since
neither "physical" addressing nor dma ops will work.

That being said, there are also proposals for virtio devices supplied
by Xen dom0 to domU, and these will presumably work the same way,
except that the device implementation will know that it's on Xen.

Grr.  This is mostly a result of the fact that virtio_pci devices
aren't really PCI devices.  I still think that virtio_pci shouldn't
have to worry about this; ideally this would all be handled higher up
in the device hierarchy.  x86 already gets this right.

Are there any hypervisors except PPC that use virtio_pci, have IOMMUs
on the pci slot that virtio_pci lives in, and that use physical
addressing?  If not, I think that just quirking PPC will work (at
least until someone wants IOMMU support in virtio_pci on PPC, in which
case doing something using devicetree seems like a reasonable
solution).

--Andy

>
> Thoughts?
> Rusty.
>
> PS.  I cc'd OASIS virtio-dev: it's subscriber only for IP reasons (to
>      subscribe you have to promise we can use your suggestion in the
>      standard).  Feel free to remove in any replies, but it's part of
>      the world we live in...

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2014-09-03  7:47                   ` Paolo Bonzini
@ 2014-09-03  7:52                     ` Andy Lutomirski
  2014-09-03  8:01                       ` Paolo Bonzini
  2014-09-03  8:05                     ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 108+ messages in thread
From: Andy Lutomirski @ 2014-09-03  7:52 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-s390, Benjamin Herrenschmidt, Konrad Rzeszutek Wilk,
	Michael S. Tsirkin, Linux Virtualization, Christian Borntraeger,
	linux390

On Wed, Sep 3, 2014 at 12:47 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 03/09/2014 02:25, Benjamin Herrenschmidt ha scritto:
>> > But there aren't any ACPI systems with both virtio-pci and IOMMUs,
>> > right?  So we could say that, henceforth, ACPI systems must declare
>> > whether virtio-pci devices live behind IOMMUs without breaking
>> > backward compatibility.
>>
>> I don't know for sure whether that's the case and whether we can rely on
>> that not happening, we'll need x86 folks opinion here.
>
> IOMMU support for x86 is going to go in this week.
>

Can you try to make sure that qemu-system-x86_64 -device iommu -device
virtio-balloon-pci (or whatever the syntax is) doesn't put the
virtio-pci device behind the IOMMU?  Because, if it does, then the
kernel will have to support that, and it'll be messy.

--Andy

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2014-09-03  7:52                     ` Andy Lutomirski
@ 2014-09-03  8:01                       ` Paolo Bonzini
  0 siblings, 0 replies; 108+ messages in thread
From: Paolo Bonzini @ 2014-09-03  8:01 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-s390, Benjamin Herrenschmidt, Konrad Rzeszutek Wilk,
	Michael S. Tsirkin, Linux Virtualization, Christian Borntraeger,
	linux390

Il 03/09/2014 09:52, Andy Lutomirski ha scritto:
>> > IOMMU support for x86 is going to go in this week.
>> >
> Can you try to make sure that qemu-system-x86_64 -device iommu -device
> virtio-balloon-pci (or whatever the syntax is) doesn't put the
> virtio-pci device behind the IOMMU?  Because, if it does, then the
> kernel will have to support that, and it'll be messy.

Right now it will not put the device behind the IOMMU, but I'm fairly
sure that the DMAR will show the device as being behind the IOMMU.

We have time till QEMU 2.2 is out to make it use the IOMMU for
virtio-pci devices on x86.  I'm not worried about that.

The virtio-pci devices do set the "bus master" bit in the command
register, right?  I think they do, because otherwise MSIs will not be
received by the guest.

Paolo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2014-09-03  7:47                   ` Paolo Bonzini
  2014-09-03  7:52                     ` Andy Lutomirski
@ 2014-09-03  8:05                     ` Benjamin Herrenschmidt
  2014-09-03 12:11                       ` Paolo Bonzini
  1 sibling, 1 reply; 108+ messages in thread
From: Benjamin Herrenschmidt @ 2014-09-03  8:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-s390, Konrad Rzeszutek Wilk, Michael S. Tsirkin,
	Andy Lutomirski, Christian Borntraeger, linux390,
	Linux Virtualization

On Wed, 2014-09-03 at 09:47 +0200, Paolo Bonzini wrote:
> 
> IOMMU support for x86 is going to go in this week.

But won't that break virtio on x86 ? Or will virtio continue bypassing
it ? IE, the guest side virtio doesn't expect an IOMMU and doesn't call
the dma mappings ops.

> However, it is and likely will remain niche enough that I don't really
> care about performance loss from IOMMU support.  If you enable it, you
> want it.
> 
> So from the QEMU point of view we can simply add the direct-ram-access
> property, and have the pseries machine turn it on by default (while
> other machines can leave it off by default---they have no IOMMU and
> thus no performance cost).

Well, it's only for virtio and should be on by default on x86 as well if
an iommu is installed no ?

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2014-09-03  8:05                     ` Benjamin Herrenschmidt
@ 2014-09-03 12:11                       ` Paolo Bonzini
  2014-09-03 15:07                         ` Andy Lutomirski
  0 siblings, 1 reply; 108+ messages in thread
From: Paolo Bonzini @ 2014-09-03 12:11 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linux-s390, Konrad Rzeszutek Wilk, Michael S. Tsirkin,
	Andy Lutomirski, Christian Borntraeger, linux390,
	Linux Virtualization

Il 03/09/2014 10:05, Benjamin Herrenschmidt ha scritto:
> On Wed, 2014-09-03 at 09:47 +0200, Paolo Bonzini wrote:
>>
>> IOMMU support for x86 is going to go in this week.
> 
> But won't that break virtio on x86 ? Or will virtio continue bypassing
> it ? IE, the guest side virtio doesn't expect an IOMMU and doesn't call
> the dma mappings ops.
> 
>> However, it is and likely will remain niche enough that I don't really
>> care about performance loss from IOMMU support.  If you enable it, you
>> want it.
>>
>> So from the QEMU point of view we can simply add the direct-ram-access
>> property, and have the pseries machine turn it on by default (while
>> other machines can leave it off by default---they have no IOMMU and
>> thus no performance cost).
> 
> Well, it's only for virtio and should be on by default on x86 as well if
> an iommu is installed no ?

Yes, only for virtio---but for x86 I think it should be off by default,
even if that means virtio+IOMMU requires a new kernel.

Paolo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2014-09-03  6:42         ` Rusty Russell
  2014-09-03  7:50           ` Andy Lutomirski
@ 2014-09-03 12:51           ` Michael S. Tsirkin
  2014-09-05  2:32             ` Rusty Russell
  1 sibling, 1 reply; 108+ messages in thread
From: Michael S. Tsirkin @ 2014-09-03 12:51 UTC (permalink / raw)
  To: Rusty Russell
  Cc: linux-s390, Konrad Rzeszutek Wilk, Benjamin Herrenschmidt,
	Linux Virtualization, Christian Borntraeger, Paolo Bonzini,
	linux390, Andy Lutomirski, virtio-dev

On Wed, Sep 03, 2014 at 04:12:01PM +0930, Rusty Russell wrote:
> Andy Lutomirski <luto@amacapital.net> writes:
> > There really are virtio devices that are pieces of silicon and not
> > figments of a hypervisor's imagination [1].
> 
> Hi Andy,
> 
>         As you're discovering, there's a reason no one has done the DMA
> API before.
> 
> So the problem is that ppc64's IOMMU is a platform thing, not a bus
> thing.  They really do carve out an exception for virtio devices,
> because performance (LOTS of performance).  It remains to be seen if
> other platforms have the same performance issues, but in absence of
> other evidence, the answer is yes.
> 
> It's a hack.  But having specific virtual-only devices are an even
> bigger hack.
> 
> Physical virtio devices have been talked about, but don't actually exist
> in Real Life.  And someone a virtio PCI card is going to have serious
> performance issues: mainly because they'll want the rings in the card's
> MMIO region, not allocated by the driver.

Why? What's wrong with rings in memory?

>  Being broken on PPC is really
> the least of their problems.
> 
> So, what do we do?  It'd be nice if Linux virtio Just Worked under Xen,
> though Xen's IOMMU is outside the virtio spec.  Since virtio_pci can be
> a module, obvious hacks like having xen_arch_setup initialize a dma_ops pointer
> exposed by virtio_pci.c is out.

Well virtio could probe for xen, it's not a lot of code.

> I think the best approach is to have a new feature bit (25 is free),
> VIRTIO_F_USE_BUS_MAPPING which indicates that a device really wants to
> use the mapping for the bus it is on.  A real device would set this,
> or it won't work behind an IOMMU.  A Xen device would also set this.
> 
> Thoughts?
> Rusty.

OK and it should then be active even if guest does not ack
the feature (so in fact, it would have to be a mandatory feature).
That can work, but I still find this a bit inelegant: this is
a property of the platform, not of the device.


> PS.  I cc'd OASIS virtio-dev: it's subscriber only for IP reasons (to
>      subscribe you have to promise we can use your suggestion in the
>      standard).  Feel free to remove in any replies, but it's part of
>      the world we live in...

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2014-09-03 12:11                       ` Paolo Bonzini
@ 2014-09-03 15:07                         ` Andy Lutomirski
  2014-09-03 15:11                           ` Paolo Bonzini
  2014-09-03 16:39                           ` Michael S. Tsirkin
  0 siblings, 2 replies; 108+ messages in thread
From: Andy Lutomirski @ 2014-09-03 15:07 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-s390, Benjamin Herrenschmidt, Konrad Rzeszutek Wilk,
	Michael S. Tsirkin, Linux Virtualization, Christian Borntraeger,
	linux390

On Sep 3, 2014 5:11 AM, "Paolo Bonzini" <pbonzini@redhat.com> wrote:
>
> Il 03/09/2014 10:05, Benjamin Herrenschmidt ha scritto:
> > On Wed, 2014-09-03 at 09:47 +0200, Paolo Bonzini wrote:
> >>
> >> IOMMU support for x86 is going to go in this week.
> >
> > But won't that break virtio on x86 ? Or will virtio continue bypassing
> > it ? IE, the guest side virtio doesn't expect an IOMMU and doesn't call
> > the dma mappings ops.
> >
> >> However, it is and likely will remain niche enough that I don't really
> >> care about performance loss from IOMMU support.  If you enable it, you
> >> want it.
> >>
> >> So from the QEMU point of view we can simply add the direct-ram-access
> >> property, and have the pseries machine turn it on by default (while
> >> other machines can leave it off by default---they have no IOMMU and
> >> thus no performance cost).
> >
> > Well, it's only for virtio and should be on by default on x86 as well if
> > an iommu is installed no ?
>
> Yes, only for virtio---but for x86 I think it should be off by default,
> even if that means virtio+IOMMU requires a new kernel.

Just to clarify: is "it" the direct-ram-access property?  If so, I
think I might agree.

Alternatively, could QEMU easily teach the IOMMU code to generate the
ACPI tables such that virtio-pci devices aren't advertised as living
behind the IOMMU?  This would work both with and without my patches.
On the other hand, maybe this gets complicated when hotplug is
involved.

--Andy

>
> Paolo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2014-09-03 15:07                         ` Andy Lutomirski
@ 2014-09-03 15:11                           ` Paolo Bonzini
  2014-09-03 16:39                           ` Michael S. Tsirkin
  1 sibling, 0 replies; 108+ messages in thread
From: Paolo Bonzini @ 2014-09-03 15:11 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-s390, Benjamin Herrenschmidt, Konrad Rzeszutek Wilk,
	Michael S. Tsirkin, Linux Virtualization, Christian Borntraeger,
	linux390

Il 03/09/2014 17:07, Andy Lutomirski ha scritto:
>> > Yes, only for virtio---but for x86 I think it should be off by default,
>> > even if that means virtio+IOMMU requires a new kernel.
> Just to clarify: is "it" the direct-ram-access property?  If so, I
> think I might agree.

Yes.

> Alternatively, could QEMU easily teach the IOMMU code to generate the
> ACPI tables such that virtio-pci devices aren't advertised as living
> behind the IOMMU?  This would work both with and without my patches.
> On the other hand, maybe this gets complicated when hotplug is
> involved.

That could be possible.  For hot-plug, you can simply forbid hotplugging
with direct-ram-access=on.  If you want hotplug, then we should add the
direct-ram-access property to PCI bridges too (with the same limitation
on hotplug).  All devices under such a bridge would be outside the
IOMMU, including virtio devices with direct-ram-access=off.

Paolo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2014-09-03 15:07                         ` Andy Lutomirski
  2014-09-03 15:11                           ` Paolo Bonzini
@ 2014-09-03 16:39                           ` Michael S. Tsirkin
  2014-09-03 20:38                             ` Andy Lutomirski
  1 sibling, 1 reply; 108+ messages in thread
From: Michael S. Tsirkin @ 2014-09-03 16:39 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-s390, Benjamin Herrenschmidt, Konrad Rzeszutek Wilk,
	Linux Virtualization, Christian Borntraeger, linux390,
	Paolo Bonzini

On Wed, Sep 03, 2014 at 08:07:15AM -0700, Andy Lutomirski wrote:
> On Sep 3, 2014 5:11 AM, "Paolo Bonzini" <pbonzini@redhat.com> wrote:
> >
> > Il 03/09/2014 10:05, Benjamin Herrenschmidt ha scritto:
> > > On Wed, 2014-09-03 at 09:47 +0200, Paolo Bonzini wrote:
> > >>
> > >> IOMMU support for x86 is going to go in this week.
> > >
> > > But won't that break virtio on x86 ? Or will virtio continue bypassing
> > > it ? IE, the guest side virtio doesn't expect an IOMMU and doesn't call
> > > the dma mappings ops.
> > >
> > >> However, it is and likely will remain niche enough that I don't really
> > >> care about performance loss from IOMMU support.  If you enable it, you
> > >> want it.
> > >>
> > >> So from the QEMU point of view we can simply add the direct-ram-access
> > >> property, and have the pseries machine turn it on by default (while
> > >> other machines can leave it off by default---they have no IOMMU and
> > >> thus no performance cost).
> > >
> > > Well, it's only for virtio and should be on by default on x86 as well if
> > > an iommu is installed no ?
> >
> > Yes, only for virtio---but for x86 I think it should be off by default,
> > even if that means virtio+IOMMU requires a new kernel.
> 
> Just to clarify: is "it" the direct-ram-access property?  If so, I
> think I might agree.
> 
> Alternatively, could QEMU easily teach the IOMMU code to generate the
> ACPI tables such that virtio-pci devices aren't advertised as living
> behind the IOMMU?  This would work both with and without my patches.

How exactly does this look in ACPI?

> On the other hand, maybe this gets complicated when hotplug is
> involved.
> 
> --Andy
> 
> >
> > Paolo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2014-09-03 16:39                           ` Michael S. Tsirkin
@ 2014-09-03 20:38                             ` Andy Lutomirski
  0 siblings, 0 replies; 108+ messages in thread
From: Andy Lutomirski @ 2014-09-03 20:38 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-s390, Benjamin Herrenschmidt, Konrad Rzeszutek Wilk,
	Linux Virtualization, Christian Borntraeger, linux390,
	Paolo Bonzini

On Wed, Sep 3, 2014 at 9:39 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Wed, Sep 03, 2014 at 08:07:15AM -0700, Andy Lutomirski wrote:
>> On Sep 3, 2014 5:11 AM, "Paolo Bonzini" <pbonzini@redhat.com> wrote:
>> >
>> > Il 03/09/2014 10:05, Benjamin Herrenschmidt ha scritto:
>> > > On Wed, 2014-09-03 at 09:47 +0200, Paolo Bonzini wrote:
>> > >>
>> > >> IOMMU support for x86 is going to go in this week.
>> > >
>> > > But won't that break virtio on x86 ? Or will virtio continue bypassing
>> > > it ? IE, the guest side virtio doesn't expect an IOMMU and doesn't call
>> > > the dma mappings ops.
>> > >
>> > >> However, it is and likely will remain niche enough that I don't really
>> > >> care about performance loss from IOMMU support.  If you enable it, you
>> > >> want it.
>> > >>
>> > >> So from the QEMU point of view we can simply add the direct-ram-access
>> > >> property, and have the pseries machine turn it on by default (while
>> > >> other machines can leave it off by default---they have no IOMMU and
>> > >> thus no performance cost).
>> > >
>> > > Well, it's only for virtio and should be on by default on x86 as well if
>> > > an iommu is installed no ?
>> >
>> > Yes, only for virtio---but for x86 I think it should be off by default,
>> > even if that means virtio+IOMMU requires a new kernel.
>>
>> Just to clarify: is "it" the direct-ram-access property?  If so, I
>> think I might agree.
>>
>> Alternatively, could QEMU easily teach the IOMMU code to generate the
>> ACPI tables such that virtio-pci devices aren't advertised as living
>> behind the IOMMU?  This would work both with and without my patches.
>
> How exactly does this look in ACPI?

I think that all you need is a PCI device or segment that isn't
included in the scope of any DRHD.  I could be wrong.

--Andy

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2014-09-03  0:43                     ` Benjamin Herrenschmidt
@ 2014-09-04  2:03                       ` Andy Lutomirski
  0 siblings, 0 replies; 108+ messages in thread
From: Andy Lutomirski @ 2014-09-04  2:03 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linux-s390, Konrad Rzeszutek Wilk, Michael S. Tsirkin,
	Linux Virtualization, Christian Borntraeger, Paolo Bonzini,
	linux390

On Tue, Sep 2, 2014 at 5:43 PM, Benjamin Herrenschmidt <benh@au1.ibm.com> wrote:
> On Tue, 2014-09-02 at 17:32 -0700, Andy Lutomirski wrote:
>>
>> I agree *except* that implementing it will be a real PITA and (I
>> think) can't be done without changing code in arch/.  My patches plus
>> an ifdef powerpc will be functionally equivalent, just uglier.
>
> So for powerpc, it's a 2 liner inside virtio-pci, but yes, it might be
> more of a problem for s390, I'm not too sure what they do in that area.
>
>> Bigger quirk: on a standard s390 virtio guest configuration,
>> dma_map_single etc will fail to link.
>
> Yuck
>
>>  I tried this in v1 of these
>> patches.  So we can poke at the archdata all day, but we can't build a
>> kernel like that :(
>
> I would like the s390 people to chime in here, it still looks like the
> best way to go if they can fix things on their side :-)
>
>> So until the dma_ops pointer move into struct device and
>> CONFIG_HAS_DMA becomes mandatory (or mandatory enough that virtio can
>> depend on it), I don't think we can do it this way.
>
> I see, it's a bummer because it would be a lot cleaner.
>
>> I'll send a v5 that is the same as v4 except with physical addressing
>> hardcoded in for powerpc.
>
> Thanks. That will do for now, but ideally we want to make it a function
> of some flag from the implementation, so let's see what Rusty has to
> say.

I've confirmed that ppc64 (on QEMU) breaks without the ppc special
case and that ppc64 keeps working with the special case.  Once Rusty's
patches settle down, I'll rebase onto them and send v5.

--Andy

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2014-09-03  7:50           ` Andy Lutomirski
@ 2014-09-05  2:31             ` Rusty Russell
  2014-09-05  2:57               ` Andy Lutomirski
                                 ` (2 more replies)
  0 siblings, 3 replies; 108+ messages in thread
From: Rusty Russell @ 2014-09-05  2:31 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: virtio-dev, linux-s390, Michael S. Tsirkin,
	Benjamin Herrenschmidt, Konrad Rzeszutek Wilk,
	Linux Virtualization, Christian Borntraeger, linux390,
	Paolo Bonzini

Andy Lutomirski <luto@amacapital.net> writes:
> On Sep 2, 2014 11:53 PM, "Rusty Russell" <rusty@rustcorp.com.au> wrote:
>>
>> Andy Lutomirski <luto@amacapital.net> writes:
>> > There really are virtio devices that are pieces of silicon and not
>> > figments of a hypervisor's imagination [1].
>>
>> Hi Andy,
>>
>>         As you're discovering, there's a reason no one has done the DMA
>> API before.
>>
>> So the problem is that ppc64's IOMMU is a platform thing, not a bus
>> thing.  They really do carve out an exception for virtio devices,
>> because performance (LOTS of performance).  It remains to be seen if
>> other platforms have the same performance issues, but in absence of
>> other evidence, the answer is yes.
>>
>> It's a hack.  But having specific virtual-only devices are an even
>> bigger hack.
>>
>> Physical virtio devices have been talked about, but don't actually exist
>> in Real Life.  And someone a virtio PCI card is going to have serious
>> performance issues: mainly because they'll want the rings in the card's
>> MMIO region, not allocated by the driver.  Being broken on PPC is really
>> the least of their problems.
>>
>> So, what do we do?  It'd be nice if Linux virtio Just Worked under Xen,
>> though Xen's IOMMU is outside the virtio spec.  Since virtio_pci can be
>> a module, obvious hacks like having xen_arch_setup initialize a dma_ops pointer
>> exposed by virtio_pci.c is out.
>
> Xen does expose dma_ops.  The trick is knowing when to use it.
>
>>
>> I think the best approach is to have a new feature bit (25 is free),
>> VIRTIO_F_USE_BUS_MAPPING which indicates that a device really wants to
>> use the mapping for the bus it is on.  A real device would set this,
>> or it won't work behind an IOMMU.  A Xen device would also set this.
>
> The devices I care about aren't actually Xen devices.  They're devices
> supplied by QEMU/KVM, booting a Xen hypervisor, which in turn passes
> the virtio device (along with every other PCI device) through to dom0.
> So this is exactly the same virtio device that regular x86 KVM guests
> would see.  The reason that current code fails is that Xen guest
> physical addresses aren't the same as the addresses seen by the outer
> hypervisor.
>
> These devices don't know that physical addresses != bus addresses, so
> they can't advertise that fact.

Ah, I see.  Then we will need a Xen-specific hack.

> Grr.  This is mostly a result of the fact that virtio_pci devices
> aren't really PCI devices.  I still think that virtio_pci shouldn't
> have to worry about this; ideally this would all be handled higher up
> in the device hierarchy.  x86 already gets this right.

Yes.  Adding a feature to say "I am a real PCI device" is possible, but
has other issues (particularly as Michael Tsirkin pointed out, what do
you do if the driver doesn't understand the feature).

> Are there any hypervisors except PPC that use virtio_pci, have IOMMUs
> on the pci slot that virtio_pci lives in, and that use physical
> addressing?  If not, I think that just quirking PPC will work (at
> least until someone wants IOMMU support in virtio_pci on PPC, in which
> case doing something using devicetree seems like a reasonable
> solution).

We can either patch to make PPC weird or make Xen weird.  I'm on the
fence.

Two questions for Paulo:
1) When QEMU support IOMMU on x86, will the virtio devices behind it
   respect the IOMMU (do they use the right memory access primitives?).

2) Are we really going to be able to exclude virtio devices from using
   the x86 IOMMU in a portable way which will always work?  If it's
   per-bus granularity, will qemu really put them on their own PCI bus
   and get this right?  Or will it sometimes get it wrong and users will
   end up using virtio devices via IOMMU by accident?

If the answers are both "yes", then x86 is going to be able to use
virtio+IOMMU, so PPC looks like the odd one out.  Otherwise it looks
like we're really going to want to stick with the "ignore IOMMU" rule
until (handwave future), and we make an exception for Xen.

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2014-09-03 12:51           ` Michael S. Tsirkin
@ 2014-09-05  2:32             ` Rusty Russell
  2014-09-05  3:06               ` Andy Lutomirski
  0 siblings, 1 reply; 108+ messages in thread
From: Rusty Russell @ 2014-09-05  2:32 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-s390, Konrad Rzeszutek Wilk, Benjamin Herrenschmidt,
	Linux Virtualization, Christian Borntraeger, Paolo Bonzini,
	linux390, Andy Lutomirski, virtio-dev

"Michael S. Tsirkin" <mst@redhat.com> writes:
> On Wed, Sep 03, 2014 at 04:12:01PM +0930, Rusty Russell wrote:
>> Andy Lutomirski <luto@amacapital.net> writes:
>> > There really are virtio devices that are pieces of silicon and not
>> > figments of a hypervisor's imagination [1].
>> 
>> Hi Andy,
>> 
>>         As you're discovering, there's a reason no one has done the DMA
>> API before.
>> 
>> So the problem is that ppc64's IOMMU is a platform thing, not a bus
>> thing.  They really do carve out an exception for virtio devices,
>> because performance (LOTS of performance).  It remains to be seen if
>> other platforms have the same performance issues, but in absence of
>> other evidence, the answer is yes.
>> 
>> It's a hack.  But having specific virtual-only devices are an even
>> bigger hack.
>> 
>> Physical virtio devices have been talked about, but don't actually exist
>> in Real Life.  And someone a virtio PCI card is going to have serious
>> performance issues: mainly because they'll want the rings in the card's
>> MMIO region, not allocated by the driver.
>
> Why? What's wrong with rings in memory?

AFAICT, the card would have to access guest memory to read it, using
multiple DMA cycles.  That's going to be slow.

>>  Being broken on PPC is really
>> the least of their problems.
>> 
>> So, what do we do?  It'd be nice if Linux virtio Just Worked under Xen,
>> though Xen's IOMMU is outside the virtio spec.  Since virtio_pci can be
>> a module, obvious hacks like having xen_arch_setup initialize a dma_ops pointer
>> exposed by virtio_pci.c is out.
>
> Well virtio could probe for xen, it's not a lot of code.

We could, but I think this is going to be a more general problem in
future.  x86 is heading down the IOMMU path, and they're likely to
suffer similarly.

>> I think the best approach is to have a new feature bit (25 is free),
>> VIRTIO_F_USE_BUS_MAPPING which indicates that a device really wants to
>> use the mapping for the bus it is on.  A real device would set this,
>> or it won't work behind an IOMMU.  A Xen device would also set this.
>> 
>> Thoughts?
>> Rusty.
>
> OK and it should then be active even if guest does not ack
> the feature (so in fact, it would have to be a mandatory feature).
> That can work, but I still find this a bit inelegant: this is
> a property of the platform, not of the device.

True.  If a device needs it though, we're no worse of having a device
which doesn't work if the driver understand the feature than we were
before.

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2014-09-05  2:31             ` Rusty Russell
@ 2014-09-05  2:57               ` Andy Lutomirski
  2014-09-05  5:20                 ` Benjamin Herrenschmidt
                                   ` (2 more replies)
  2014-09-05  5:16               ` Benjamin Herrenschmidt
  2014-09-14  8:58               ` Michael S. Tsirkin
  2 siblings, 3 replies; 108+ messages in thread
From: Andy Lutomirski @ 2014-09-05  2:57 UTC (permalink / raw)
  To: Rusty Russell
  Cc: virtio-dev, linux-s390, Michael S. Tsirkin,
	Benjamin Herrenschmidt, Konrad Rzeszutek Wilk,
	Linux Virtualization, Christian Borntraeger, linux390,
	Paolo Bonzini

On Thu, Sep 4, 2014 at 7:31 PM, Rusty Russell <rusty@rustcorp.com.au> wrote:
> Andy Lutomirski <luto@amacapital.net> writes:
>> On Sep 2, 2014 11:53 PM, "Rusty Russell" <rusty@rustcorp.com.au> wrote:
>>>
>>> Andy Lutomirski <luto@amacapital.net> writes:
>>> > There really are virtio devices that are pieces of silicon and not
>>> > figments of a hypervisor's imagination [1].
>>>
>>> Hi Andy,
>>>
>>>         As you're discovering, there's a reason no one has done the DMA
>>> API before.
>>>
>>> So the problem is that ppc64's IOMMU is a platform thing, not a bus
>>> thing.  They really do carve out an exception for virtio devices,
>>> because performance (LOTS of performance).  It remains to be seen if
>>> other platforms have the same performance issues, but in absence of
>>> other evidence, the answer is yes.
>>>
>>> It's a hack.  But having specific virtual-only devices are an even
>>> bigger hack.
>>>
>>> Physical virtio devices have been talked about, but don't actually exist
>>> in Real Life.  And someone a virtio PCI card is going to have serious
>>> performance issues: mainly because they'll want the rings in the card's
>>> MMIO region, not allocated by the driver.  Being broken on PPC is really
>>> the least of their problems.
>>>
>>> So, what do we do?  It'd be nice if Linux virtio Just Worked under Xen,
>>> though Xen's IOMMU is outside the virtio spec.  Since virtio_pci can be
>>> a module, obvious hacks like having xen_arch_setup initialize a dma_ops pointer
>>> exposed by virtio_pci.c is out.
>>
>> Xen does expose dma_ops.  The trick is knowing when to use it.
>>
>>>
>>> I think the best approach is to have a new feature bit (25 is free),
>>> VIRTIO_F_USE_BUS_MAPPING which indicates that a device really wants to
>>> use the mapping for the bus it is on.  A real device would set this,
>>> or it won't work behind an IOMMU.  A Xen device would also set this.
>>
>> The devices I care about aren't actually Xen devices.  They're devices
>> supplied by QEMU/KVM, booting a Xen hypervisor, which in turn passes
>> the virtio device (along with every other PCI device) through to dom0.
>> So this is exactly the same virtio device that regular x86 KVM guests
>> would see.  The reason that current code fails is that Xen guest
>> physical addresses aren't the same as the addresses seen by the outer
>> hypervisor.
>>
>> These devices don't know that physical addresses != bus addresses, so
>> they can't advertise that fact.
>
> Ah, I see.  Then we will need a Xen-specific hack.
>
>> Grr.  This is mostly a result of the fact that virtio_pci devices
>> aren't really PCI devices.  I still think that virtio_pci shouldn't
>> have to worry about this; ideally this would all be handled higher up
>> in the device hierarchy.  x86 already gets this right.
>
> Yes.  Adding a feature to say "I am a real PCI device" is possible, but
> has other issues (particularly as Michael Tsirkin pointed out, what do
> you do if the driver doesn't understand the feature).
>
>> Are there any hypervisors except PPC that use virtio_pci, have IOMMUs
>> on the pci slot that virtio_pci lives in, and that use physical
>> addressing?  If not, I think that just quirking PPC will work (at
>> least until someone wants IOMMU support in virtio_pci on PPC, in which
>> case doing something using devicetree seems like a reasonable
>> solution).
>
> We can either patch to make PPC weird or make Xen weird.  I'm on the
> fence.
>
> Two questions for Paulo:
> 1) When QEMU support IOMMU on x86, will the virtio devices behind it
>    respect the IOMMU (do they use the right memory access primitives?).
>
> 2) Are we really going to be able to exclude virtio devices from using
>    the x86 IOMMU in a portable way which will always work?  If it's
>    per-bus granularity, will qemu really put them on their own PCI bus
>    and get this right?  Or will it sometimes get it wrong and users will
>    end up using virtio devices via IOMMU by accident?
>
> If the answers are both "yes", then x86 is going to be able to use
> virtio+IOMMU, so PPC looks like the odd one out.  Otherwise it looks
> like we're really going to want to stick with the "ignore IOMMU" rule
> until (handwave future), and we make an exception for Xen.

There's a third option: try to make virtio-mmio work everywhere
(except s390), at least in the long run.  This other benefits: it
makes minimal hypervisors simpler, I think it'll get rid of the limits
on the number of virtio devices in a system.  ARM is already going
this direction, and I imagine that PPC support would be
straightforward (it's already using devicetree).

Does virtio-mmio have any reasonable way of doing hotplug?  It could
also eventually make sense to have a standard for virtio on virtio.

--Andy

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2014-09-05  2:32             ` Rusty Russell
@ 2014-09-05  3:06               ` Andy Lutomirski
  0 siblings, 0 replies; 108+ messages in thread
From: Andy Lutomirski @ 2014-09-05  3:06 UTC (permalink / raw)
  To: Rusty Russell
  Cc: linux-s390, Michael S. Tsirkin, Benjamin Herrenschmidt,
	Konrad Rzeszutek Wilk, Linux Virtualization,
	Christian Borntraeger, Paolo Bonzini, linux390, virtio-dev

On Thu, Sep 4, 2014 at 7:32 PM, Rusty Russell <rusty@rustcorp.com.au> wrote:
> "Michael S. Tsirkin" <mst@redhat.com> writes:
>> On Wed, Sep 03, 2014 at 04:12:01PM +0930, Rusty Russell wrote:
>>> Andy Lutomirski <luto@amacapital.net> writes:
>>> > There really are virtio devices that are pieces of silicon and not
>>> > figments of a hypervisor's imagination [1].
>>>
>>> Hi Andy,
>>>
>>>         As you're discovering, there's a reason no one has done the DMA
>>> API before.
>>>
>>> So the problem is that ppc64's IOMMU is a platform thing, not a bus
>>> thing.  They really do carve out an exception for virtio devices,
>>> because performance (LOTS of performance).  It remains to be seen if
>>> other platforms have the same performance issues, but in absence of
>>> other evidence, the answer is yes.
>>>
>>> It's a hack.  But having specific virtual-only devices are an even
>>> bigger hack.
>>>
>>> Physical virtio devices have been talked about, but don't actually exist
>>> in Real Life.  And someone a virtio PCI card is going to have serious
>>> performance issues: mainly because they'll want the rings in the card's
>>> MMIO region, not allocated by the driver.
>>
>> Why? What's wrong with rings in memory?
>
> AFAICT, the card would have to access guest memory to read it, using
> multiple DMA cycles.  That's going to be slow.

I don't personally know all the considerations, but AFAICT NVMe puts
its rings in memory, and NMVe is very much focused on performance.

There might be an argument for trying to avoid using indirect rings on
real hardware to reduce the number of DMA round-trips needed for a
comment.

--Andy

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2014-09-05  2:31             ` Rusty Russell
  2014-09-05  2:57               ` Andy Lutomirski
@ 2014-09-05  5:16               ` Benjamin Herrenschmidt
  2014-09-14  8:58               ` Michael S. Tsirkin
  2 siblings, 0 replies; 108+ messages in thread
From: Benjamin Herrenschmidt @ 2014-09-05  5:16 UTC (permalink / raw)
  To: Rusty Russell
  Cc: virtio-dev, Michael S. Tsirkin, linux-s390,
	Konrad Rzeszutek Wilk, Linux Virtualization,
	Christian Borntraeger, linux390, Paolo Bonzini, Andy Lutomirski

On Fri, 2014-09-05 at 12:01 +0930, Rusty Russell wrote:
> If the answers are both "yes", then x86 is going to be able to use
> virtio+IOMMU, so PPC looks like the odd one out. 

Well, yes and no ... ppc will be able to do that too, it's just
pointless and will suck performances.

Additionally, it will be incompatible with existing guests since
today, the guest assumes physical (doesn't use the dma mapping
routines), so even if x86 grows the ability to have virtio behind an
iommu in qemu, that will break existing guests.

>  Otherwise it looks
> like we're really going to want to stick with the "ignore IOMMU" rule
> until (handwave future), and we make an exception for Xen.

Either that or we have a capability that can be negociated.

There are other reasons for wanting to allow the use of the DMA ops,
such as people using virtio as a transport between two physically
connected machines (such as a CPU running a PCIe endpoint to a CPU
running a PCIe host, or two hosts connected to a non-transparent switch,
essentially using PCIe as a fast network fabric).

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2014-09-05  2:57               ` Andy Lutomirski
@ 2014-09-05  5:20                 ` Benjamin Herrenschmidt
  2014-09-05  7:33                 ` Christian Borntraeger
  2014-09-10 15:36                 ` Christopher Covington
  2 siblings, 0 replies; 108+ messages in thread
From: Benjamin Herrenschmidt @ 2014-09-05  5:20 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: virtio-dev, Konrad Rzeszutek Wilk, Michael S. Tsirkin,
	linux-s390, Linux Virtualization, Christian Borntraeger,
	linux390, Paolo Bonzini

On Thu, 2014-09-04 at 19:57 -0700, Andy Lutomirski wrote:

> There's a third option: try to make virtio-mmio work everywhere
> (except s390), at least in the long run.  This other benefits: it
> makes minimal hypervisors simpler, I think it'll get rid of the limits
> on the number of virtio devices in a system.  ARM is already going
> this direction, and I imagine that PPC support would be
> straightforward (it's already using devicetree).

PCI has advantages though. Management stacks know about PCI and nothing
else really. We already have all the infra to do hotplug with PCI,
etc...

> Does virtio-mmio have any reasonable way of doing hotplug?  It could
> also eventually make sense to have a standard for virtio on virtio.

That would be very platform specific.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2014-09-05  2:57               ` Andy Lutomirski
  2014-09-05  5:20                 ` Benjamin Herrenschmidt
@ 2014-09-05  7:33                 ` Christian Borntraeger
  2014-09-10 15:36                 ` Christopher Covington
  2 siblings, 0 replies; 108+ messages in thread
From: Christian Borntraeger @ 2014-09-05  7:33 UTC (permalink / raw)
  To: Andy Lutomirski, Rusty Russell
  Cc: virtio-dev, linux-s390, Michael S. Tsirkin,
	Benjamin Herrenschmidt, Konrad Rzeszutek Wilk,
	Linux Virtualization, linux390, Paolo Bonzini

On 05/09/14 04:57, Andy Lutomirski wrote:
> There's a third option: try to make virtio-mmio work everywhere
> (except s390), at least in the long run.  This other benefits: it
> makes minimal hypervisors simpler, I think it'll get rid of the limits
> on the number of virtio devices in a system.  ARM is already going
> this direction, and I imagine that PPC support would be
> straightforward (it's already using devicetree).

Well this chance is gone.
When virtio was first introduced we though about abstraction (mmio,hypercalls, pci ops depending on the platform as part of the transport. There was even a virtio over serial line as potential implementation), but we had to do a fully PCI variant to please windows guests IIRC.

Christian

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2014-09-05  2:57               ` Andy Lutomirski
  2014-09-05  5:20                 ` Benjamin Herrenschmidt
  2014-09-05  7:33                 ` Christian Borntraeger
@ 2014-09-10 15:36                 ` Christopher Covington
  2014-09-10 16:15                   ` Andy Lutomirski
  2 siblings, 1 reply; 108+ messages in thread
From: Christopher Covington @ 2014-09-10 15:36 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: virtio-dev, linux-s390, Konrad Rzeszutek Wilk,
	Michael S. Tsirkin, Benjamin Herrenschmidt, Linux Virtualization,
	Christian Borntraeger, Paolo Bonzini, linux390

On 09/04/2014 10:57 PM, Andy Lutomirski wrote:
> On Thu, Sep 4, 2014 at 7:31 PM, Rusty Russell <rusty@rustcorp.com.au> wrote:
>> Andy Lutomirski <luto@amacapital.net> writes:
>>> On Sep 2, 2014 11:53 PM, "Rusty Russell" <rusty@rustcorp.com.au> wrote:
>>>>
>>>> Andy Lutomirski <luto@amacapital.net> writes:
>>>>> There really are virtio devices that are pieces of silicon and not
>>>>> figments of a hypervisor's imagination [1].
>>>>
>>>> Hi Andy,
>>>>
>>>>         As you're discovering, there's a reason no one has done the DMA
>>>> API before.
>>>>
>>>> So the problem is that ppc64's IOMMU is a platform thing, not a bus
>>>> thing.  They really do carve out an exception for virtio devices,
>>>> because performance (LOTS of performance).  It remains to be seen if
>>>> other platforms have the same performance issues, but in absence of
>>>> other evidence, the answer is yes.
>>>>
>>>> It's a hack.  But having specific virtual-only devices are an even
>>>> bigger hack.
>>>>
>>>> Physical virtio devices have been talked about, but don't actually exist
>>>> in Real Life.  And someone a virtio PCI card is going to have serious
>>>> performance issues: mainly because they'll want the rings in the card's
>>>> MMIO region, not allocated by the driver.  Being broken on PPC is really
>>>> the least of their problems.
>>>>
>>>> So, what do we do?  It'd be nice if Linux virtio Just Worked under Xen,
>>>> though Xen's IOMMU is outside the virtio spec.  Since virtio_pci can be
>>>> a module, obvious hacks like having xen_arch_setup initialize a dma_ops pointer
>>>> exposed by virtio_pci.c is out.
>>>
>>> Xen does expose dma_ops.  The trick is knowing when to use it.
>>>
>>>>
>>>> I think the best approach is to have a new feature bit (25 is free),
>>>> VIRTIO_F_USE_BUS_MAPPING which indicates that a device really wants to
>>>> use the mapping for the bus it is on.  A real device would set this,
>>>> or it won't work behind an IOMMU.  A Xen device would also set this.
>>>
>>> The devices I care about aren't actually Xen devices.  They're devices
>>> supplied by QEMU/KVM, booting a Xen hypervisor, which in turn passes
>>> the virtio device (along with every other PCI device) through to dom0.
>>> So this is exactly the same virtio device that regular x86 KVM guests
>>> would see.  The reason that current code fails is that Xen guest
>>> physical addresses aren't the same as the addresses seen by the outer
>>> hypervisor.
>>>
>>> These devices don't know that physical addresses != bus addresses, so
>>> they can't advertise that fact.
>>
>> Ah, I see.  Then we will need a Xen-specific hack.
>>
>>> Grr.  This is mostly a result of the fact that virtio_pci devices
>>> aren't really PCI devices.  I still think that virtio_pci shouldn't
>>> have to worry about this; ideally this would all be handled higher up
>>> in the device hierarchy.  x86 already gets this right.
>>
>> Yes.  Adding a feature to say "I am a real PCI device" is possible, but
>> has other issues (particularly as Michael Tsirkin pointed out, what do
>> you do if the driver doesn't understand the feature).
>>
>>> Are there any hypervisors except PPC that use virtio_pci, have IOMMUs
>>> on the pci slot that virtio_pci lives in, and that use physical
>>> addressing?  If not, I think that just quirking PPC will work (at
>>> least until someone wants IOMMU support in virtio_pci on PPC, in which
>>> case doing something using devicetree seems like a reasonable
>>> solution).
>>
>> We can either patch to make PPC weird or make Xen weird.  I'm on the
>> fence.
>>
>> Two questions for Paulo:
>> 1) When QEMU support IOMMU on x86, will the virtio devices behind it
>>    respect the IOMMU (do they use the right memory access primitives?).
>>
>> 2) Are we really going to be able to exclude virtio devices from using
>>    the x86 IOMMU in a portable way which will always work?  If it's
>>    per-bus granularity, will qemu really put them on their own PCI bus
>>    and get this right?  Or will it sometimes get it wrong and users will
>>    end up using virtio devices via IOMMU by accident?
>>
>> If the answers are both "yes", then x86 is going to be able to use
>> virtio+IOMMU, so PPC looks like the odd one out.  Otherwise it looks
>> like we're really going to want to stick with the "ignore IOMMU" rule
>> until (handwave future), and we make an exception for Xen.
> 
> There's a third option: try to make virtio-mmio work everywhere
> (except s390), at least in the long run.  This other benefits: it
> makes minimal hypervisors simpler, I think it'll get rid of the limits
> on the number of virtio devices in a system.  ARM is already going
> this direction, and I imagine that PPC support would be
> straightforward (it's already using devicetree).

In my opinion, a uniform "virt" machine for every instruction set would be
very beneficial. I would guess that MMIO is more universally available than
PCI, and as you point out, simpler to implement.

> Does virtio-mmio have any reasonable way of doing hotplug?  It could
> also eventually make sense to have a standard for virtio on virtio.

I don't think so, but it seems possible. My bystander understanding is that
QEMU allocates some fixed number of VirtIO-MMIO devices, maybe a dozen, in the
device tree. The ones that don't actually get hooked up to something real like
a block device or network interface are populated with a dummy device. One
naive approach might be to allow the dummy devices to tell the kernel that
they are now changing to a real device.

Also, higher level hotplug for at least SCSI sounds possible.

https://bugzilla.redhat.com/show_bug.cgi?id=1123390

Christopher

-- 
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by the Linux Foundation.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2014-09-10 15:36                 ` Christopher Covington
@ 2014-09-10 16:15                   ` Andy Lutomirski
  0 siblings, 0 replies; 108+ messages in thread
From: Andy Lutomirski @ 2014-09-10 16:15 UTC (permalink / raw)
  To: Christopher Covington
  Cc: virtio-dev, linux-s390, Konrad Rzeszutek Wilk,
	Michael S. Tsirkin, Benjamin Herrenschmidt, Linux Virtualization,
	Christian Borntraeger, Paolo Bonzini, linux390

On Wed, Sep 10, 2014 at 8:36 AM, Christopher Covington
<cov@codeaurora.org> wrote:
> On 09/04/2014 10:57 PM, Andy Lutomirski wrote:
>> There's a third option: try to make virtio-mmio work everywhere
>> (except s390), at least in the long run.  This other benefits: it
>> makes minimal hypervisors simpler, I think it'll get rid of the limits
>> on the number of virtio devices in a system.  ARM is already going
>> this direction, and I imagine that PPC support would be
>> straightforward (it's already using devicetree).
>
> In my opinion, a uniform "virt" machine for every instruction set would be
> very beneficial. I would guess that MMIO is more universally available than
> PCI, and as you point out, simpler to implement.

Except for x86 :(  That's presumably fixable, though.

>
>> Does virtio-mmio have any reasonable way of doing hotplug?  It could
>> also eventually make sense to have a standard for virtio on virtio.
>
> I don't think so, but it seems possible. My bystander understanding is that
> QEMU allocates some fixed number of VirtIO-MMIO devices, maybe a dozen, in the
> device tree. The ones that don't actually get hooked up to something real like
> a block device or network interface are populated with a dummy device. One
> naive approach might be to allow the dummy devices to tell the kernel that
> they are now changing to a real device.

My thought (which I completely failed to articulate) was to have a
spec for a virtio device that exposes a complete virtio bus along with
hotplug and per-cpu interrupts (a la MSI-X).  This might be a bit
complicated, but it would work everywhere without any firmware or
platform issues.

--Andy

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2014-09-05  2:31             ` Rusty Russell
  2014-09-05  2:57               ` Andy Lutomirski
  2014-09-05  5:16               ` Benjamin Herrenschmidt
@ 2014-09-14  8:58               ` Michael S. Tsirkin
  2 siblings, 0 replies; 108+ messages in thread
From: Michael S. Tsirkin @ 2014-09-14  8:58 UTC (permalink / raw)
  To: Rusty Russell
  Cc: virtio-dev, pawel.moll, linux-s390, Benjamin Herrenschmidt,
	Konrad Rzeszutek Wilk, Linux Virtualization,
	Christian Borntraeger, Paolo Bonzini, linux390, Andy Lutomirski

On Fri, Sep 05, 2014 at 12:01:33PM +0930, Rusty Russell wrote:
> Andy Lutomirski <luto@amacapital.net> writes:
> > On Sep 2, 2014 11:53 PM, "Rusty Russell" <rusty@rustcorp.com.au> wrote:
> >>
> >> Andy Lutomirski <luto@amacapital.net> writes:
> >> > There really are virtio devices that are pieces of silicon and not
> >> > figments of a hypervisor's imagination [1].
> >>
> >> Hi Andy,
> >>
> >>         As you're discovering, there's a reason no one has done the DMA
> >> API before.
> >>
> >> So the problem is that ppc64's IOMMU is a platform thing, not a bus
> >> thing.  They really do carve out an exception for virtio devices,
> >> because performance (LOTS of performance).  It remains to be seen if
> >> other platforms have the same performance issues, but in absence of
> >> other evidence, the answer is yes.
> >>
> >> It's a hack.  But having specific virtual-only devices are an even
> >> bigger hack.
> >>
> >> Physical virtio devices have been talked about, but don't actually exist
> >> in Real Life.  And someone a virtio PCI card is going to have serious
> >> performance issues: mainly because they'll want the rings in the card's
> >> MMIO region, not allocated by the driver.  Being broken on PPC is really
> >> the least of their problems.
> >>
> >> So, what do we do?  It'd be nice if Linux virtio Just Worked under Xen,
> >> though Xen's IOMMU is outside the virtio spec.  Since virtio_pci can be
> >> a module, obvious hacks like having xen_arch_setup initialize a dma_ops pointer
> >> exposed by virtio_pci.c is out.
> >
> > Xen does expose dma_ops.  The trick is knowing when to use it.
> >
> >>
> >> I think the best approach is to have a new feature bit (25 is free),
> >> VIRTIO_F_USE_BUS_MAPPING which indicates that a device really wants to
> >> use the mapping for the bus it is on.  A real device would set this,
> >> or it won't work behind an IOMMU.  A Xen device would also set this.
> >
> > The devices I care about aren't actually Xen devices.  They're devices
> > supplied by QEMU/KVM, booting a Xen hypervisor, which in turn passes
> > the virtio device (along with every other PCI device) through to dom0.
> > So this is exactly the same virtio device that regular x86 KVM guests
> > would see.  The reason that current code fails is that Xen guest
> > physical addresses aren't the same as the addresses seen by the outer
> > hypervisor.
> >
> > These devices don't know that physical addresses != bus addresses, so
> > they can't advertise that fact.
> 
> Ah, I see.  Then we will need a Xen-specific hack.
> 
> > Grr.  This is mostly a result of the fact that virtio_pci devices
> > aren't really PCI devices.  I still think that virtio_pci shouldn't
> > have to worry about this; ideally this would all be handled higher up
> > in the device hierarchy.  x86 already gets this right.
> 
> Yes.  Adding a feature to say "I am a real PCI device" is possible, but
> has other issues (particularly as Michael Tsirkin pointed out, what do
> you do if the driver doesn't understand the feature).
> 
> > Are there any hypervisors except PPC that use virtio_pci, have IOMMUs
> > on the pci slot that virtio_pci lives in, and that use physical
> > addressing?  If not, I think that just quirking PPC will work (at
> > least until someone wants IOMMU support in virtio_pci on PPC, in which
> > case doing something using devicetree seems like a reasonable
> > solution).
> 
> We can either patch to make PPC weird or make Xen weird.  I'm on the
> fence.
> 
> Two questions for Paulo:
> 1) When QEMU support IOMMU on x86, will the virtio devices behind it
>    respect the IOMMU (do they use the right memory access primitives?).
> 
> 2) Are we really going to be able to exclude virtio devices from using
>    the x86 IOMMU in a portable way which will always work?  If it's
>    per-bus granularity, will qemu really put them on their own PCI bus
>    and get this right?  Or will it sometimes get it wrong and users will
>    end up using virtio devices via IOMMU by accident?
> 
> If the answers are both "yes", then x86 is going to be able to use
> virtio+IOMMU, so PPC looks like the odd one out.  Otherwise it looks
> like we're really going to want to stick with the "ignore IOMMU" rule
> until (handwave future), and we make an exception for Xen.
> 
> Cheers,
> Rusty.

In theory, it's yes to both questions.
In practice, with patches merged recently it's no to both questions :).

It's a work in progress, but some extra effort to support miltiple
PCI roots will be needed on the QEMU side.
What problems will surface when we try to do multiple roots?
Only time will tell.

If it's felt that it's much cleaner to make PPC the odd one out, we can
defer enabling iommu in qemu on x86 until ways to bypass it are
implemented.

But I would be inclined, for pre-1.0 drivers, to make Xen weird.
For 1.0 drivers, we have a bit of time to consider this,
and maybe PPC guys can come up with some way (can be PV)
to tell guest "these devices bypass the IOMMU".

> _______________________________________________
> Virtualization mailing list
> Virtualization@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-29  8:17                                       ` Paolo Bonzini
                                                           ` (2 preceding siblings ...)
  2015-07-29  9:21                                         ` Benjamin Herrenschmidt
@ 2015-07-29  9:21                                         ` Benjamin Herrenschmidt
  3 siblings, 0 replies; 108+ messages in thread
From: Benjamin Herrenschmidt @ 2015-07-29  9:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-s390, Konrad Rzeszutek Wilk, Michael S. Tsirkin, xen-devel,
	Christian Borntraeger, Jan Kiszka, linux390, Andy Lutomirski,
	Linux Virtualization

On Wed, 2015-07-29 at 10:17 +0200, Paolo Bonzini wrote:
> 
> On 29/07/2015 02:47, Andy Lutomirski wrote:
> > > > If new kernels ignore the IOMMU for devices that don't set the flag
> > > > and there are physical devices that already exist and don't set the
> > > > flag, then those devices won't work reliably on most modern
> > > > non-virtual platforms, PPC included.
> > >
> > > Are there many virtio physical devices out there ? We are talking about
> > > a virtio flag right ? Or have you been considering something else ?
> >
> > Yes, virtio flag.  I dislike having a virtio flag at all, but so far
> > no one has come up with any better ideas.  If there was a reliable,
> > cross-platform mechanism for per-device PCI bus properties, I'd be all
> > for using that instead.
> 
> No, a virtio flag doesn't make sense.

It wouldn't if we were creating virtio from scratch.

However we have to be realistic here, we are contending with existing
practices and implementation. The fact is qemu *does* bypass any iommu
and has been doing so for a long time, *and* the guest drivers are
written today *also* bypassing all DMA mapping mechanisms and just
passing everything accross.

So if it's a bug, it's a bug on both sides of the fence. We are no
longer in "bug fixing" territory here, it's a fundamental change of ABI.
The ABI might not be what was intended (but that's arguable, see below),
but it is that way.

Arguably it was even known and considered a *feature* by some (including
myself) at the time. It somewhat improved performances on archs where
otherwise every page would have to be mapped/unmapped in guest iommu. In
fact, it also makes vhost a lot easier.

So I disagree, it's de-facto a feature (even if unintended) of the
existing virtio implementations and changing that would be a major
interface change, and thus should be exposed as such.

> Blindly using system memory is a bug in QEMU; it has to be fixed to use
> the right address space, and then whatever the system provides to
> describe "the right address space" can be used (like the DMAR table on x86).

Except that it's not so easy.

For example, on PPC PAPR guests, there is no such thing as a "no IOMMU"
space, the concept doesn't exist. So we have at least three things to
deal with:

 - Existing guests, so we must preserve the existing behaviour for
backward compatibility.

 - vhost is made more complex because it now needs to be informed of the
guest iommu updates

 - New guests have the "new driver" that knows how to map and unmap
would have a performance loss unless some mechanism to create a "no
iommu" space exists which for us would need to be added. Either that or
we rely on DDW which is a way for a guest to create a permanent mapping
of its entire address space in an IOMMU but that incur a significant
waste of host kernel memory.

> On PPC I suppose you could use the host bridge's device tree?  If you
> need a hook, you can add a

No because we can mix and match virtio and other devices on the same
host bridge. Unless we put a property that only applies to virtio
children of the host bridge.

> 	bool virtio_should_bypass_iommu(void)
> 	{
> 		/* lookup something in the device tree?!? */
> 	}
> 	EXPORT_SYMBOL_GPL(virtio_should_bypass_iommu);
> 
> in some pseries.c file, and in the driver:
> 
> 	static bool virtio_bypass_iommu(void)
> 	{
> 		bool (*fn)(void);
> 	
> 		fn = symbol_get(virtio_should_bypass_iommu);
> 		return fn && fn();
> 	}
> 
> Awful, but that's what this thing is.

Ben.

> Paolo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-29  8:17                                       ` Paolo Bonzini
  2015-07-29  8:20                                         ` Jan Kiszka
  2015-07-29  8:20                                         ` Jan Kiszka
@ 2015-07-29  9:21                                         ` Benjamin Herrenschmidt
  2015-07-29  9:21                                         ` Benjamin Herrenschmidt
  3 siblings, 0 replies; 108+ messages in thread
From: Benjamin Herrenschmidt @ 2015-07-29  9:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-s390, Michael S. Tsirkin, Stefan Hajnoczi, Rusty Russell,
	xen-devel, Christian Borntraeger, Jan Kiszka, Cornelia Huck,
	linux390, Andy Lutomirski, Linux Virtualization

On Wed, 2015-07-29 at 10:17 +0200, Paolo Bonzini wrote:
> 
> On 29/07/2015 02:47, Andy Lutomirski wrote:
> > > > If new kernels ignore the IOMMU for devices that don't set the flag
> > > > and there are physical devices that already exist and don't set the
> > > > flag, then those devices won't work reliably on most modern
> > > > non-virtual platforms, PPC included.
> > >
> > > Are there many virtio physical devices out there ? We are talking about
> > > a virtio flag right ? Or have you been considering something else ?
> >
> > Yes, virtio flag.  I dislike having a virtio flag at all, but so far
> > no one has come up with any better ideas.  If there was a reliable,
> > cross-platform mechanism for per-device PCI bus properties, I'd be all
> > for using that instead.
> 
> No, a virtio flag doesn't make sense.

It wouldn't if we were creating virtio from scratch.

However we have to be realistic here, we are contending with existing
practices and implementation. The fact is qemu *does* bypass any iommu
and has been doing so for a long time, *and* the guest drivers are
written today *also* bypassing all DMA mapping mechanisms and just
passing everything accross.

So if it's a bug, it's a bug on both sides of the fence. We are no
longer in "bug fixing" territory here, it's a fundamental change of ABI.
The ABI might not be what was intended (but that's arguable, see below),
but it is that way.

Arguably it was even known and considered a *feature* by some (including
myself) at the time. It somewhat improved performances on archs where
otherwise every page would have to be mapped/unmapped in guest iommu. In
fact, it also makes vhost a lot easier.

So I disagree, it's de-facto a feature (even if unintended) of the
existing virtio implementations and changing that would be a major
interface change, and thus should be exposed as such.

> Blindly using system memory is a bug in QEMU; it has to be fixed to use
> the right address space, and then whatever the system provides to
> describe "the right address space" can be used (like the DMAR table on x86).

Except that it's not so easy.

For example, on PPC PAPR guests, there is no such thing as a "no IOMMU"
space, the concept doesn't exist. So we have at least three things to
deal with:

 - Existing guests, so we must preserve the existing behaviour for
backward compatibility.

 - vhost is made more complex because it now needs to be informed of the
guest iommu updates

 - New guests have the "new driver" that knows how to map and unmap
would have a performance loss unless some mechanism to create a "no
iommu" space exists which for us would need to be added. Either that or
we rely on DDW which is a way for a guest to create a permanent mapping
of its entire address space in an IOMMU but that incur a significant
waste of host kernel memory.

> On PPC I suppose you could use the host bridge's device tree?  If you
> need a hook, you can add a

No because we can mix and match virtio and other devices on the same
host bridge. Unless we put a property that only applies to virtio
children of the host bridge.

> 	bool virtio_should_bypass_iommu(void)
> 	{
> 		/* lookup something in the device tree?!? */
> 	}
> 	EXPORT_SYMBOL_GPL(virtio_should_bypass_iommu);
> 
> in some pseries.c file, and in the driver:
> 
> 	static bool virtio_bypass_iommu(void)
> 	{
> 		bool (*fn)(void);
> 	
> 		fn = symbol_get(virtio_should_bypass_iommu);
> 		return fn && fn();
> 	}
> 
> Awful, but that's what this thing is.

Ben.

> Paolo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-29  8:17                                       ` Paolo Bonzini
  2015-07-29  8:20                                         ` Jan Kiszka
@ 2015-07-29  8:20                                         ` Jan Kiszka
  2015-07-29  9:21                                         ` Benjamin Herrenschmidt
  2015-07-29  9:21                                         ` Benjamin Herrenschmidt
  3 siblings, 0 replies; 108+ messages in thread
From: Jan Kiszka @ 2015-07-29  8:20 UTC (permalink / raw)
  To: Paolo Bonzini, Andy Lutomirski, Benjamin Herrenschmidt
  Cc: linux-s390, Konrad Rzeszutek Wilk, Michael S. Tsirkin, xen-devel,
	Christian Borntraeger, linux390, Linux Virtualization

On 2015-07-29 10:17, Paolo Bonzini wrote:
> 
> 
> On 29/07/2015 02:47, Andy Lutomirski wrote:
>>>> If new kernels ignore the IOMMU for devices that don't set the flag
>>>> and there are physical devices that already exist and don't set the
>>>> flag, then those devices won't work reliably on most modern
>>>> non-virtual platforms, PPC included.
>>>
>>> Are there many virtio physical devices out there ? We are talking about
>>> a virtio flag right ? Or have you been considering something else ?
>>
>> Yes, virtio flag.  I dislike having a virtio flag at all, but so far
>> no one has come up with any better ideas.  If there was a reliable,
>> cross-platform mechanism for per-device PCI bus properties, I'd be all
>> for using that instead.
> 
> No, a virtio flag doesn't make sense.

That will create the risk of subtly breaking old guests over new setups.
I wouldn't suggest this.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-29  8:17                                       ` Paolo Bonzini
@ 2015-07-29  8:20                                         ` Jan Kiszka
  2015-07-29  8:20                                         ` Jan Kiszka
                                                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 108+ messages in thread
From: Jan Kiszka @ 2015-07-29  8:20 UTC (permalink / raw)
  To: Paolo Bonzini, Andy Lutomirski, Benjamin Herrenschmidt
  Cc: linux-s390, Michael S. Tsirkin, Stefan Hajnoczi, Rusty Russell,
	xen-devel, Christian Borntraeger, Cornelia Huck, linux390,
	Linux Virtualization

On 2015-07-29 10:17, Paolo Bonzini wrote:
> 
> 
> On 29/07/2015 02:47, Andy Lutomirski wrote:
>>>> If new kernels ignore the IOMMU for devices that don't set the flag
>>>> and there are physical devices that already exist and don't set the
>>>> flag, then those devices won't work reliably on most modern
>>>> non-virtual platforms, PPC included.
>>>
>>> Are there many virtio physical devices out there ? We are talking about
>>> a virtio flag right ? Or have you been considering something else ?
>>
>> Yes, virtio flag.  I dislike having a virtio flag at all, but so far
>> no one has come up with any better ideas.  If there was a reliable,
>> cross-platform mechanism for per-device PCI bus properties, I'd be all
>> for using that instead.
> 
> No, a virtio flag doesn't make sense.

That will create the risk of subtly breaking old guests over new setups.
I wouldn't suggest this.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-29  0:47                                     ` Andy Lutomirski
  2015-07-29  0:54                                       ` Benjamin Herrenschmidt
  2015-07-29  0:54                                       ` Benjamin Herrenschmidt
@ 2015-07-29  8:17                                       ` Paolo Bonzini
  2015-07-29  8:20                                         ` Jan Kiszka
                                                           ` (3 more replies)
  2015-07-29  8:17                                       ` Paolo Bonzini
  3 siblings, 4 replies; 108+ messages in thread
From: Paolo Bonzini @ 2015-07-29  8:17 UTC (permalink / raw)
  To: Andy Lutomirski, Benjamin Herrenschmidt
  Cc: linux-s390, Konrad Rzeszutek Wilk, Michael S. Tsirkin,
	Jan Kiszka, xen-devel, Christian Borntraeger, linux390,
	Linux Virtualization



On 29/07/2015 02:47, Andy Lutomirski wrote:
> > > If new kernels ignore the IOMMU for devices that don't set the flag
> > > and there are physical devices that already exist and don't set the
> > > flag, then those devices won't work reliably on most modern
> > > non-virtual platforms, PPC included.
> >
> > Are there many virtio physical devices out there ? We are talking about
> > a virtio flag right ? Or have you been considering something else ?
>
> Yes, virtio flag.  I dislike having a virtio flag at all, but so far
> no one has come up with any better ideas.  If there was a reliable,
> cross-platform mechanism for per-device PCI bus properties, I'd be all
> for using that instead.

No, a virtio flag doesn't make sense.

Blindly using system memory is a bug in QEMU; it has to be fixed to use
the right address space, and then whatever the system provides to
describe "the right address space" can be used (like the DMAR table on x86).

On PPC I suppose you could use the host bridge's device tree?  If you
need a hook, you can add a

	bool virtio_should_bypass_iommu(void)
	{
		/* lookup something in the device tree?!? */
	}
	EXPORT_SYMBOL_GPL(virtio_should_bypass_iommu);

in some pseries.c file, and in the driver:

	static bool virtio_bypass_iommu(void)
	{
		bool (*fn)(void);
	
		fn = symbol_get(virtio_should_bypass_iommu);
		return fn && fn();
	}

Awful, but that's what this thing is.

Paolo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-29  0:47                                     ` Andy Lutomirski
                                                         ` (2 preceding siblings ...)
  2015-07-29  8:17                                       ` Paolo Bonzini
@ 2015-07-29  8:17                                       ` Paolo Bonzini
  3 siblings, 0 replies; 108+ messages in thread
From: Paolo Bonzini @ 2015-07-29  8:17 UTC (permalink / raw)
  To: Andy Lutomirski, Benjamin Herrenschmidt
  Cc: linux-s390, Michael S. Tsirkin, Jan Kiszka, Rusty Russell,
	xen-devel, Christian Borntraeger, Stefan Hajnoczi, Cornelia Huck,
	linux390, Linux Virtualization



On 29/07/2015 02:47, Andy Lutomirski wrote:
> > > If new kernels ignore the IOMMU for devices that don't set the flag
> > > and there are physical devices that already exist and don't set the
> > > flag, then those devices won't work reliably on most modern
> > > non-virtual platforms, PPC included.
> >
> > Are there many virtio physical devices out there ? We are talking about
> > a virtio flag right ? Or have you been considering something else ?
>
> Yes, virtio flag.  I dislike having a virtio flag at all, but so far
> no one has come up with any better ideas.  If there was a reliable,
> cross-platform mechanism for per-device PCI bus properties, I'd be all
> for using that instead.

No, a virtio flag doesn't make sense.

Blindly using system memory is a bug in QEMU; it has to be fixed to use
the right address space, and then whatever the system provides to
describe "the right address space" can be used (like the DMAR table on x86).

On PPC I suppose you could use the host bridge's device tree?  If you
need a hook, you can add a

	bool virtio_should_bypass_iommu(void)
	{
		/* lookup something in the device tree?!? */
	}
	EXPORT_SYMBOL_GPL(virtio_should_bypass_iommu);

in some pseries.c file, and in the driver:

	static bool virtio_bypass_iommu(void)
	{
		bool (*fn)(void);
	
		fn = symbol_get(virtio_should_bypass_iommu);
		return fn && fn();
	}

Awful, but that's what this thing is.

Paolo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-28 23:21                               ` Benjamin Herrenschmidt
                                                   ` (2 preceding siblings ...)
  2015-07-29  8:07                                 ` Jan Kiszka
@ 2015-07-29  8:07                                 ` Jan Kiszka
  3 siblings, 0 replies; 108+ messages in thread
From: Jan Kiszka @ 2015-07-29  8:07 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Andy Lutomirski
  Cc: linux-s390, Konrad Rzeszutek Wilk, Michael S. Tsirkin, xen-devel,
	Christian Borntraeger, Paolo Bonzini, linux390,
	Linux Virtualization

On 2015-07-29 01:21, Benjamin Herrenschmidt wrote:
> On Tue, 2015-07-28 at 15:43 -0700, Andy Lutomirski wrote:
>>   New QEMU
>> always advertises this feature flag.  If iommu=on, QEMU's virtio
>> devices refuse to work unless the driver acknowledges the flag.
> 
> This should be configurable.

Advertisement of that flag must be configurable, or we won't be able to
run older guests anymore which don't know it, thus will reject it. The
only precondition: there must be no IOMMU if we turn it off.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-28 23:21                               ` Benjamin Herrenschmidt
  2015-07-28 23:33                                 ` Andy Lutomirski
  2015-07-28 23:33                                 ` Andy Lutomirski
@ 2015-07-29  8:07                                 ` Jan Kiszka
  2015-07-29  8:07                                 ` Jan Kiszka
  3 siblings, 0 replies; 108+ messages in thread
From: Jan Kiszka @ 2015-07-29  8:07 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Andy Lutomirski
  Cc: linux-s390, Michael S. Tsirkin, Stefan Hajnoczi, Rusty Russell,
	xen-devel, Christian Borntraeger, Paolo Bonzini, Cornelia Huck,
	linux390, Linux Virtualization

On 2015-07-29 01:21, Benjamin Herrenschmidt wrote:
> On Tue, 2015-07-28 at 15:43 -0700, Andy Lutomirski wrote:
>>   New QEMU
>> always advertises this feature flag.  If iommu=on, QEMU's virtio
>> devices refuse to work unless the driver acknowledges the flag.
> 
> This should be configurable.

Advertisement of that flag must be configurable, or we won't be able to
run older guests anymore which don't know it, thus will reject it. The
only precondition: there must be no IOMMU if we turn it off.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-29  0:47                                     ` Andy Lutomirski
  2015-07-29  0:54                                       ` Benjamin Herrenschmidt
@ 2015-07-29  0:54                                       ` Benjamin Herrenschmidt
  2015-07-29  8:17                                       ` Paolo Bonzini
  2015-07-29  8:17                                       ` Paolo Bonzini
  3 siblings, 0 replies; 108+ messages in thread
From: Benjamin Herrenschmidt @ 2015-07-29  0:54 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-s390, Konrad Rzeszutek Wilk, Michael S. Tsirkin,
	Jan Kiszka, xen-devel, Christian Borntraeger, Paolo Bonzini,
	linux390, Linux Virtualization

On Tue, 2015-07-28 at 17:47 -0700, Andy Lutomirski wrote:

> Yes, virtio flag.  I dislike having a virtio flag at all, but so far
> no one has come up with any better ideas.  If there was a reliable,
> cross-platform mechanism for per-device PCI bus properties, I'd be all
> for using that instead.

There isn't that I know of, so I think it's the best approach we have.

 .../...

> >  - The kernel should just honor what qemu says, ie, whether the qemu
> > device honors or bypasses the iommu.
> 
> Except for vfio, which maybe just needs a special case: vfio checks if
> the device claims to be virtio and doesn't set the flag, in which case
> vfio just refuses to bind the device.

Right but passing virtio through isn't the highest priority on the
radar, but yes, indeed, it should identify them and reject them.

> >  - Qemu default behaviour should be set via a machine attribute which
> > can be overriden both globally (the machine one) or per-device.
> >
> >> I think that, in an ideal world, there would be no feature flag and
> >> all virtio devices would always respect the IOMMU.  Unfortunately we
> >> have existing practice in the form of PPC and Q35 iommu=on that
> >> conflict with that.
> >
> > And possibly more as in this is how the qemu virtio devices are written
> > today, they do not use the proper DMA accessors, they always bypass,
> > whatever the platform is (so sparc would be in the same boat for
> > example).
> 
> Except that AFAIK Q35 is the only QEMU platform that supports a
> nontrivial IOMMU in the first place.  Are there pseries hosts that
> have a working IOMMU?  Maybe I've just misunderstood.

You may well be correct, I remember that we actually created the iommu
infrastructure to a large extent in qemu for ppc/pseries, then it got
extended when q35 came in.

> >> >>   New QEMU
> >> >> always advertises this feature flag.  If iommu=on, QEMU's virtio
> >> >> devices refuse to work unless the driver acknowledges the flag.
> >> >
> >> > This should be configurable.
> >>
> >> Would any non-PPC user ever configure it differently?  I suppose if
> >> you want to support old kernels on new QEMU, you'd flip the switch.
> >
> > Possibly, have we looked at what ia64, sparc, arm, ... do ? At least
> > sparc has iommus as well.
> 
> I think (I hope!) that ia64 is irrelevant, and last I checked ARM
> didn't have a QEMU-emulated IOMMU.  Maybe things have changed.

Not yet...

 .../...
> >
> > On new machine types, we shouldn't change the behaviour of an existing
> > machine type, and we should keep the default to 0 on ppc/pseries because
> > of backward compatibility issue. But that should be the only place that
> > is "ppc specific", ie, a default value in a machine def structure.
> 
> Fair enough, except I still think we should change the default to be
> "respect IOMMU" on machine types that don't have an IOMMU in the first
> place. 

Ok, but do it in a separate patch because it *is* a behaviour change to
some extent.

>  That way Xen works with old machine types, and I don't think
> we lose anything.
> 
> >
> >> That's the setting that will work in all cases on new guest + new
> >> host, and it's the setting that's safest.  vfio will probably always
> >> malfunction if given a device that looks like it's behind an IOMMU but
> >> doesn't respect it.  For people who need the last bit of performance,
> >> they should use bus-level controls where available (they should be
> >> available everywhere except PPC and maybe arm64) and, ideally, someone
> >> would teach PPC how to exclude devices from the IOMMU cleanly if
> >> possible.  If that can't be done, then there can be an option to
> >> bypass the IOMMU the way it's currently done and no one except PPC
> >> would do it.
> >>
> >> PPC really is different from everything except x86 Q35 iommu=on, and
> >> the latter is experimental.  AFAIK in all other cases, the IOMMU is
> >> respected by virtio, but there is no non-1:1 IOMMU.
> >
> > What about sparc ? I though it was pretty similar to PPC in that
> > regard...
> 
> No clue, honestly.  I could be wrong about the set of existing QEMU
> machine types.

Ok.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-29  0:47                                     ` Andy Lutomirski
@ 2015-07-29  0:54                                       ` Benjamin Herrenschmidt
  2015-07-29  0:54                                       ` Benjamin Herrenschmidt
                                                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 108+ messages in thread
From: Benjamin Herrenschmidt @ 2015-07-29  0:54 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-s390, Michael S. Tsirkin, Stefan Hajnoczi, Jan Kiszka,
	Rusty Russell, xen-devel, Christian Borntraeger, Paolo Bonzini,
	Cornelia Huck, linux390, Linux Virtualization

On Tue, 2015-07-28 at 17:47 -0700, Andy Lutomirski wrote:

> Yes, virtio flag.  I dislike having a virtio flag at all, but so far
> no one has come up with any better ideas.  If there was a reliable,
> cross-platform mechanism for per-device PCI bus properties, I'd be all
> for using that instead.

There isn't that I know of, so I think it's the best approach we have.

 .../...

> >  - The kernel should just honor what qemu says, ie, whether the qemu
> > device honors or bypasses the iommu.
> 
> Except for vfio, which maybe just needs a special case: vfio checks if
> the device claims to be virtio and doesn't set the flag, in which case
> vfio just refuses to bind the device.

Right but passing virtio through isn't the highest priority on the
radar, but yes, indeed, it should identify them and reject them.

> >  - Qemu default behaviour should be set via a machine attribute which
> > can be overriden both globally (the machine one) or per-device.
> >
> >> I think that, in an ideal world, there would be no feature flag and
> >> all virtio devices would always respect the IOMMU.  Unfortunately we
> >> have existing practice in the form of PPC and Q35 iommu=on that
> >> conflict with that.
> >
> > And possibly more as in this is how the qemu virtio devices are written
> > today, they do not use the proper DMA accessors, they always bypass,
> > whatever the platform is (so sparc would be in the same boat for
> > example).
> 
> Except that AFAIK Q35 is the only QEMU platform that supports a
> nontrivial IOMMU in the first place.  Are there pseries hosts that
> have a working IOMMU?  Maybe I've just misunderstood.

You may well be correct, I remember that we actually created the iommu
infrastructure to a large extent in qemu for ppc/pseries, then it got
extended when q35 came in.

> >> >>   New QEMU
> >> >> always advertises this feature flag.  If iommu=on, QEMU's virtio
> >> >> devices refuse to work unless the driver acknowledges the flag.
> >> >
> >> > This should be configurable.
> >>
> >> Would any non-PPC user ever configure it differently?  I suppose if
> >> you want to support old kernels on new QEMU, you'd flip the switch.
> >
> > Possibly, have we looked at what ia64, sparc, arm, ... do ? At least
> > sparc has iommus as well.
> 
> I think (I hope!) that ia64 is irrelevant, and last I checked ARM
> didn't have a QEMU-emulated IOMMU.  Maybe things have changed.

Not yet...

 .../...
> >
> > On new machine types, we shouldn't change the behaviour of an existing
> > machine type, and we should keep the default to 0 on ppc/pseries because
> > of backward compatibility issue. But that should be the only place that
> > is "ppc specific", ie, a default value in a machine def structure.
> 
> Fair enough, except I still think we should change the default to be
> "respect IOMMU" on machine types that don't have an IOMMU in the first
> place. 

Ok, but do it in a separate patch because it *is* a behaviour change to
some extent.

>  That way Xen works with old machine types, and I don't think
> we lose anything.
> 
> >
> >> That's the setting that will work in all cases on new guest + new
> >> host, and it's the setting that's safest.  vfio will probably always
> >> malfunction if given a device that looks like it's behind an IOMMU but
> >> doesn't respect it.  For people who need the last bit of performance,
> >> they should use bus-level controls where available (they should be
> >> available everywhere except PPC and maybe arm64) and, ideally, someone
> >> would teach PPC how to exclude devices from the IOMMU cleanly if
> >> possible.  If that can't be done, then there can be an option to
> >> bypass the IOMMU the way it's currently done and no one except PPC
> >> would do it.
> >>
> >> PPC really is different from everything except x86 Q35 iommu=on, and
> >> the latter is experimental.  AFAIK in all other cases, the IOMMU is
> >> respected by virtio, but there is no non-1:1 IOMMU.
> >
> > What about sparc ? I though it was pretty similar to PPC in that
> > regard...
> 
> No clue, honestly.  I could be wrong about the set of existing QEMU
> machine types.

Ok.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-29  0:36                                   ` Benjamin Herrenschmidt
  2015-07-29  0:47                                     ` Andy Lutomirski
@ 2015-07-29  0:47                                     ` Andy Lutomirski
  2015-07-29  0:54                                       ` Benjamin Herrenschmidt
                                                         ` (3 more replies)
  1 sibling, 4 replies; 108+ messages in thread
From: Andy Lutomirski @ 2015-07-29  0:47 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linux-s390, Konrad Rzeszutek Wilk, Michael S. Tsirkin,
	Jan Kiszka, xen-devel, Christian Borntraeger, Paolo Bonzini,
	linux390, Linux Virtualization

On Tue, Jul 28, 2015 at 5:36 PM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Tue, 2015-07-28 at 16:33 -0700, Andy Lutomirski wrote:
>> On Tue, Jul 28, 2015 at 4:21 PM, Benjamin Herrenschmidt
>> <benh@kernel.crashing.org> wrote:
>> > On Tue, 2015-07-28 at 15:43 -0700, Andy Lutomirski wrote:
>> >> Let me try to summarize a proposal:
>> >>
>> >> Add a feature flag that indicates IOMMU support.
>> >>
>> >> New kernels acknowledge that flag on any device that advertises it.
>> >>
>> >> New kernels always respect the IOMMU (except on PowerPC).
>> >
>> > Why ? I disagree, the flag should be honored when set in any
>> > architecture. PowerPC is no different than any other platform in that
>> > regard.
>>
>> Perhaps I should have said instead "someone more familiar with PPC
>> than I am should figure out what PPC should do".  For the non-PPC
>> case, there is only one instance that I know of in which ignoring the
>> IOMMU is beneficial, and that case is the experimental Q35 thing.
>
> "ppc" is many fairly different platforms, some with iommu, some without,
> some benefiting from bypass, some less etc... I think ARM will soon be
> in a similar basket.
>
>> If new kernels ignore the IOMMU for devices that don't set the flag
>> and there are physical devices that already exist and don't set the
>> flag, then those devices won't work reliably on most modern
>> non-virtual platforms, PPC included.
>
> Are there many virtio physical devices out there ? We are talking about
> a virtio flag right ? Or have you been considering something else ?

Yes, virtio flag.  I dislike having a virtio flag at all, but so far
no one has come up with any better ideas.  If there was a reliable,
cross-platform mechanism for per-device PCI bus properties, I'd be all
for using that instead.

>
>> >>   New kernels
>> >> optionally refuse to talk to devices that don't have that feature flag
>> >> if the device appears to be behind an IOMMU.  (This presumably
>> >> includes any device whatsoever on an x86 platform with an IOMMU,
>> >> including Xen's fake IOMMU.)
>> >>
>> >> New QEMU always respects the IOMMU, if any, except on PPC.
>> >
>> > This is just a matter of what is the default of the flag, ie we
>> > should have a machine flag that indicates what the default is for
>> > new virtio devices, otherwise, it should be specified per device
>> > as an attribute of the device instance.
>>
>> On x86, I think that even super-peformance-critical virtio devices
>> should always honor the iommu, but that the iommu in question should
>> be a 1:1 iommu.  I *think* that x86 supports that.  IOW x86 would
>> always set the feature flag.
>
> Ok.
>
>> > I would argue that we should default to "bypass IOMMU" on *all*
>> > architecture due to the performance impact, and to essentially
>> > default to the same behaviour as today. With things like DDW even
>> > powerpc might be able to mostly alleviate the performance impact
>> > so we might to change in the long term, but I tend to prefer
>> > more incremental approaches.
>>
>> As above, there's a difference between "bypass IOMMU" and "there is no
>> IOMMU".  x86 and, I think, most other platforms are capable of the
>> latter.  I'm not sure PPC is.
>
> Depends on the platform. "pseries" isn't since it's already a
> paravirtualized plaform, but there are other ppc platforms out there
> which behave differently. That's why I think:
>
>  - The kernel should just honor what qemu says, ie, whether the qemu
> device honors or bypasses the iommu.

Except for vfio, which maybe just needs a special case: vfio checks if
the device claims to be virtio and doesn't set the flag, in which case
vfio just refuses to bind the device.

>
>  - Qemu default behaviour should be set via a machine attribute which
> can be overriden both globally (the machine one) or per-device.
>
>> I think that, in an ideal world, there would be no feature flag and
>> all virtio devices would always respect the IOMMU.  Unfortunately we
>> have existing practice in the form of PPC and Q35 iommu=on that
>> conflict with that.
>
> And possibly more as in this is how the qemu virtio devices are written
> today, they do not use the proper DMA accessors, they always bypass,
> whatever the platform is (so sparc would be in the same boat for
> example).

Except that AFAIK Q35 is the only QEMU platform that supports a
nontrivial IOMMU in the first place.  Are there pseries hosts that
have a working IOMMU?  Maybe I've just misunderstood.

>
>> >>   New QEMU
>> >> always advertises this feature flag.  If iommu=on, QEMU's virtio
>> >> devices refuse to work unless the driver acknowledges the flag.
>> >
>> > This should be configurable.
>>
>> Would any non-PPC user ever configure it differently?  I suppose if
>> you want to support old kernels on new QEMU, you'd flip the switch.
>
> Possibly, have we looked at what ia64, sparc, arm, ... do ? At least
> sparc has iommus as well.

I think (I hope!) that ia64 is irrelevant, and last I checked ARM
didn't have a QEMU-emulated IOMMU.  Maybe things have changed.

>
> Let's try to not make it an architecture issue. As I said above, we have
> a kernel that just reacts appropriately based on what qemu says it's
> doing, and what qemu does is a per-machine flag to set the default.
>
>> >> On PPC, new QEMU will not respect the IOMMU and will not set the flag.
>> >> New kernels will not talk to devices that set the flag.  If someone
>> >> wants to fix that, then they get to figure out how.
>> >
>> > I disagree with the kernel bit and I disagree with special casing PPC in
>> > any shape or form in the code. The only difference should be a default
>> > value for the iommu mode of virtio in qemu set per machine.
>> >
>> > You can then feel free to change that default (in a separate patch for
>> > bisectability) on x86 for the sake of Xen.
>>
>> I think we should flip the default everywhere to "respects IOMMU".
>
> On new machine types, we shouldn't change the behaviour of an existing
> machine type, and we should keep the default to 0 on ppc/pseries because
> of backward compatibility issue. But that should be the only place that
> is "ppc specific", ie, a default value in a machine def structure.

Fair enough, except I still think we should change the default to be
"respect IOMMU" on machine types that don't have an IOMMU in the first
place.  That way Xen works with old machine types, and I don't think
we lose anything.

>
>> That's the setting that will work in all cases on new guest + new
>> host, and it's the setting that's safest.  vfio will probably always
>> malfunction if given a device that looks like it's behind an IOMMU but
>> doesn't respect it.  For people who need the last bit of performance,
>> they should use bus-level controls where available (they should be
>> available everywhere except PPC and maybe arm64) and, ideally, someone
>> would teach PPC how to exclude devices from the IOMMU cleanly if
>> possible.  If that can't be done, then there can be an option to
>> bypass the IOMMU the way it's currently done and no one except PPC
>> would do it.
>>
>> PPC really is different from everything except x86 Q35 iommu=on, and
>> the latter is experimental.  AFAIK in all other cases, the IOMMU is
>> respected by virtio, but there is no non-1:1 IOMMU.
>
> What about sparc ? I though it was pretty similar to PPC in that
> regard...

No clue, honestly.  I could be wrong about the set of existing QEMU
machine types.

--Andy

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-29  0:36                                   ` Benjamin Herrenschmidt
@ 2015-07-29  0:47                                     ` Andy Lutomirski
  2015-07-29  0:47                                     ` Andy Lutomirski
  1 sibling, 0 replies; 108+ messages in thread
From: Andy Lutomirski @ 2015-07-29  0:47 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linux-s390, Michael S. Tsirkin, Stefan Hajnoczi, Jan Kiszka,
	Rusty Russell, xen-devel, Christian Borntraeger, Paolo Bonzini,
	Cornelia Huck, linux390, Linux Virtualization

On Tue, Jul 28, 2015 at 5:36 PM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Tue, 2015-07-28 at 16:33 -0700, Andy Lutomirski wrote:
>> On Tue, Jul 28, 2015 at 4:21 PM, Benjamin Herrenschmidt
>> <benh@kernel.crashing.org> wrote:
>> > On Tue, 2015-07-28 at 15:43 -0700, Andy Lutomirski wrote:
>> >> Let me try to summarize a proposal:
>> >>
>> >> Add a feature flag that indicates IOMMU support.
>> >>
>> >> New kernels acknowledge that flag on any device that advertises it.
>> >>
>> >> New kernels always respect the IOMMU (except on PowerPC).
>> >
>> > Why ? I disagree, the flag should be honored when set in any
>> > architecture. PowerPC is no different than any other platform in that
>> > regard.
>>
>> Perhaps I should have said instead "someone more familiar with PPC
>> than I am should figure out what PPC should do".  For the non-PPC
>> case, there is only one instance that I know of in which ignoring the
>> IOMMU is beneficial, and that case is the experimental Q35 thing.
>
> "ppc" is many fairly different platforms, some with iommu, some without,
> some benefiting from bypass, some less etc... I think ARM will soon be
> in a similar basket.
>
>> If new kernels ignore the IOMMU for devices that don't set the flag
>> and there are physical devices that already exist and don't set the
>> flag, then those devices won't work reliably on most modern
>> non-virtual platforms, PPC included.
>
> Are there many virtio physical devices out there ? We are talking about
> a virtio flag right ? Or have you been considering something else ?

Yes, virtio flag.  I dislike having a virtio flag at all, but so far
no one has come up with any better ideas.  If there was a reliable,
cross-platform mechanism for per-device PCI bus properties, I'd be all
for using that instead.

>
>> >>   New kernels
>> >> optionally refuse to talk to devices that don't have that feature flag
>> >> if the device appears to be behind an IOMMU.  (This presumably
>> >> includes any device whatsoever on an x86 platform with an IOMMU,
>> >> including Xen's fake IOMMU.)
>> >>
>> >> New QEMU always respects the IOMMU, if any, except on PPC.
>> >
>> > This is just a matter of what is the default of the flag, ie we
>> > should have a machine flag that indicates what the default is for
>> > new virtio devices, otherwise, it should be specified per device
>> > as an attribute of the device instance.
>>
>> On x86, I think that even super-peformance-critical virtio devices
>> should always honor the iommu, but that the iommu in question should
>> be a 1:1 iommu.  I *think* that x86 supports that.  IOW x86 would
>> always set the feature flag.
>
> Ok.
>
>> > I would argue that we should default to "bypass IOMMU" on *all*
>> > architecture due to the performance impact, and to essentially
>> > default to the same behaviour as today. With things like DDW even
>> > powerpc might be able to mostly alleviate the performance impact
>> > so we might to change in the long term, but I tend to prefer
>> > more incremental approaches.
>>
>> As above, there's a difference between "bypass IOMMU" and "there is no
>> IOMMU".  x86 and, I think, most other platforms are capable of the
>> latter.  I'm not sure PPC is.
>
> Depends on the platform. "pseries" isn't since it's already a
> paravirtualized plaform, but there are other ppc platforms out there
> which behave differently. That's why I think:
>
>  - The kernel should just honor what qemu says, ie, whether the qemu
> device honors or bypasses the iommu.

Except for vfio, which maybe just needs a special case: vfio checks if
the device claims to be virtio and doesn't set the flag, in which case
vfio just refuses to bind the device.

>
>  - Qemu default behaviour should be set via a machine attribute which
> can be overriden both globally (the machine one) or per-device.
>
>> I think that, in an ideal world, there would be no feature flag and
>> all virtio devices would always respect the IOMMU.  Unfortunately we
>> have existing practice in the form of PPC and Q35 iommu=on that
>> conflict with that.
>
> And possibly more as in this is how the qemu virtio devices are written
> today, they do not use the proper DMA accessors, they always bypass,
> whatever the platform is (so sparc would be in the same boat for
> example).

Except that AFAIK Q35 is the only QEMU platform that supports a
nontrivial IOMMU in the first place.  Are there pseries hosts that
have a working IOMMU?  Maybe I've just misunderstood.

>
>> >>   New QEMU
>> >> always advertises this feature flag.  If iommu=on, QEMU's virtio
>> >> devices refuse to work unless the driver acknowledges the flag.
>> >
>> > This should be configurable.
>>
>> Would any non-PPC user ever configure it differently?  I suppose if
>> you want to support old kernels on new QEMU, you'd flip the switch.
>
> Possibly, have we looked at what ia64, sparc, arm, ... do ? At least
> sparc has iommus as well.

I think (I hope!) that ia64 is irrelevant, and last I checked ARM
didn't have a QEMU-emulated IOMMU.  Maybe things have changed.

>
> Let's try to not make it an architecture issue. As I said above, we have
> a kernel that just reacts appropriately based on what qemu says it's
> doing, and what qemu does is a per-machine flag to set the default.
>
>> >> On PPC, new QEMU will not respect the IOMMU and will not set the flag.
>> >> New kernels will not talk to devices that set the flag.  If someone
>> >> wants to fix that, then they get to figure out how.
>> >
>> > I disagree with the kernel bit and I disagree with special casing PPC in
>> > any shape or form in the code. The only difference should be a default
>> > value for the iommu mode of virtio in qemu set per machine.
>> >
>> > You can then feel free to change that default (in a separate patch for
>> > bisectability) on x86 for the sake of Xen.
>>
>> I think we should flip the default everywhere to "respects IOMMU".
>
> On new machine types, we shouldn't change the behaviour of an existing
> machine type, and we should keep the default to 0 on ppc/pseries because
> of backward compatibility issue. But that should be the only place that
> is "ppc specific", ie, a default value in a machine def structure.

Fair enough, except I still think we should change the default to be
"respect IOMMU" on machine types that don't have an IOMMU in the first
place.  That way Xen works with old machine types, and I don't think
we lose anything.

>
>> That's the setting that will work in all cases on new guest + new
>> host, and it's the setting that's safest.  vfio will probably always
>> malfunction if given a device that looks like it's behind an IOMMU but
>> doesn't respect it.  For people who need the last bit of performance,
>> they should use bus-level controls where available (they should be
>> available everywhere except PPC and maybe arm64) and, ideally, someone
>> would teach PPC how to exclude devices from the IOMMU cleanly if
>> possible.  If that can't be done, then there can be an option to
>> bypass the IOMMU the way it's currently done and no one except PPC
>> would do it.
>>
>> PPC really is different from everything except x86 Q35 iommu=on, and
>> the latter is experimental.  AFAIK in all other cases, the IOMMU is
>> respected by virtio, but there is no non-1:1 IOMMU.
>
> What about sparc ? I though it was pretty similar to PPC in that
> regard...

No clue, honestly.  I could be wrong about the set of existing QEMU
machine types.

--Andy

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-28 23:33                                 ` Andy Lutomirski
  2015-07-29  0:36                                   ` Benjamin Herrenschmidt
@ 2015-07-29  0:36                                   ` Benjamin Herrenschmidt
  2015-07-29  0:47                                     ` Andy Lutomirski
  2015-07-29  0:47                                     ` Andy Lutomirski
  1 sibling, 2 replies; 108+ messages in thread
From: Benjamin Herrenschmidt @ 2015-07-29  0:36 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-s390, Konrad Rzeszutek Wilk, Michael S. Tsirkin,
	Jan Kiszka, xen-devel, Christian Borntraeger, Paolo Bonzini,
	linux390, Linux Virtualization

On Tue, 2015-07-28 at 16:33 -0700, Andy Lutomirski wrote:
> On Tue, Jul 28, 2015 at 4:21 PM, Benjamin Herrenschmidt
> <benh@kernel.crashing.org> wrote:
> > On Tue, 2015-07-28 at 15:43 -0700, Andy Lutomirski wrote:
> >> Let me try to summarize a proposal:
> >>
> >> Add a feature flag that indicates IOMMU support.
> >>
> >> New kernels acknowledge that flag on any device that advertises it.
> >>
> >> New kernels always respect the IOMMU (except on PowerPC).
> >
> > Why ? I disagree, the flag should be honored when set in any
> > architecture. PowerPC is no different than any other platform in that
> > regard.
> 
> Perhaps I should have said instead "someone more familiar with PPC
> than I am should figure out what PPC should do".  For the non-PPC
> case, there is only one instance that I know of in which ignoring the
> IOMMU is beneficial, and that case is the experimental Q35 thing.

"ppc" is many fairly different platforms, some with iommu, some without,
some benefiting from bypass, some less etc... I think ARM will soon be
in a similar basket.

> If new kernels ignore the IOMMU for devices that don't set the flag
> and there are physical devices that already exist and don't set the
> flag, then those devices won't work reliably on most modern
> non-virtual platforms, PPC included.

Are there many virtio physical devices out there ? We are talking about
a virtio flag right ? Or have you been considering something else ?

> >>   New kernels
> >> optionally refuse to talk to devices that don't have that feature flag
> >> if the device appears to be behind an IOMMU.  (This presumably
> >> includes any device whatsoever on an x86 platform with an IOMMU,
> >> including Xen's fake IOMMU.)
> >>
> >> New QEMU always respects the IOMMU, if any, except on PPC.
> >
> > This is just a matter of what is the default of the flag, ie we
> > should have a machine flag that indicates what the default is for
> > new virtio devices, otherwise, it should be specified per device
> > as an attribute of the device instance.
> 
> On x86, I think that even super-peformance-critical virtio devices
> should always honor the iommu, but that the iommu in question should
> be a 1:1 iommu.  I *think* that x86 supports that.  IOW x86 would
> always set the feature flag.

Ok.

> > I would argue that we should default to "bypass IOMMU" on *all*
> > architecture due to the performance impact, and to essentially
> > default to the same behaviour as today. With things like DDW even
> > powerpc might be able to mostly alleviate the performance impact
> > so we might to change in the long term, but I tend to prefer
> > more incremental approaches.
> 
> As above, there's a difference between "bypass IOMMU" and "there is no
> IOMMU".  x86 and, I think, most other platforms are capable of the
> latter.  I'm not sure PPC is.

Depends on the platform. "pseries" isn't since it's already a
paravirtualized plaform, but there are other ppc platforms out there
which behave differently. That's why I think:

 - The kernel should just honor what qemu says, ie, whether the qemu
device honors or bypasses the iommu.

 - Qemu default behaviour should be set via a machine attribute which
can be overriden both globally (the machine one) or per-device. 

> I think that, in an ideal world, there would be no feature flag and
> all virtio devices would always respect the IOMMU.  Unfortunately we
> have existing practice in the form of PPC and Q35 iommu=on that
> conflict with that.

And possibly more as in this is how the qemu virtio devices are written
today, they do not use the proper DMA accessors, they always bypass,
whatever the platform is (so sparc would be in the same boat for
example).

> >>   New QEMU
> >> always advertises this feature flag.  If iommu=on, QEMU's virtio
> >> devices refuse to work unless the driver acknowledges the flag.
> >
> > This should be configurable.
> 
> Would any non-PPC user ever configure it differently?  I suppose if
> you want to support old kernels on new QEMU, you'd flip the switch.

Possibly, have we looked at what ia64, sparc, arm, ... do ? At least
sparc has iommus as well.

Let's try to not make it an architecture issue. As I said above, we have
a kernel that just reacts appropriately based on what qemu says it's
doing, and what qemu does is a per-machine flag to set the default.

> >> On PPC, new QEMU will not respect the IOMMU and will not set the flag.
> >> New kernels will not talk to devices that set the flag.  If someone
> >> wants to fix that, then they get to figure out how.
> >
> > I disagree with the kernel bit and I disagree with special casing PPC in
> > any shape or form in the code. The only difference should be a default
> > value for the iommu mode of virtio in qemu set per machine.
> >
> > You can then feel free to change that default (in a separate patch for
> > bisectability) on x86 for the sake of Xen.
> 
> I think we should flip the default everywhere to "respects IOMMU".

On new machine types, we shouldn't change the behaviour of an existing
machine type, and we should keep the default to 0 on ppc/pseries because
of backward compatibility issue. But that should be the only place that
is "ppc specific", ie, a default value in a machine def structure.

> That's the setting that will work in all cases on new guest + new
> host, and it's the setting that's safest.  vfio will probably always
> malfunction if given a device that looks like it's behind an IOMMU but
> doesn't respect it.  For people who need the last bit of performance,
> they should use bus-level controls where available (they should be
> available everywhere except PPC and maybe arm64) and, ideally, someone
> would teach PPC how to exclude devices from the IOMMU cleanly if
> possible.  If that can't be done, then there can be an option to
> bypass the IOMMU the way it's currently done and no one except PPC
> would do it.
> 
> PPC really is different from everything except x86 Q35 iommu=on, and
> the latter is experimental.  AFAIK in all other cases, the IOMMU is
> respected by virtio, but there is no non-1:1 IOMMU.

What about sparc ? I though it was pretty similar to PPC in that
regard...

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-28 23:33                                 ` Andy Lutomirski
@ 2015-07-29  0:36                                   ` Benjamin Herrenschmidt
  2015-07-29  0:36                                   ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 108+ messages in thread
From: Benjamin Herrenschmidt @ 2015-07-29  0:36 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-s390, Michael S. Tsirkin, Stefan Hajnoczi, Jan Kiszka,
	Rusty Russell, xen-devel, Christian Borntraeger, Paolo Bonzini,
	Cornelia Huck, linux390, Linux Virtualization

On Tue, 2015-07-28 at 16:33 -0700, Andy Lutomirski wrote:
> On Tue, Jul 28, 2015 at 4:21 PM, Benjamin Herrenschmidt
> <benh@kernel.crashing.org> wrote:
> > On Tue, 2015-07-28 at 15:43 -0700, Andy Lutomirski wrote:
> >> Let me try to summarize a proposal:
> >>
> >> Add a feature flag that indicates IOMMU support.
> >>
> >> New kernels acknowledge that flag on any device that advertises it.
> >>
> >> New kernels always respect the IOMMU (except on PowerPC).
> >
> > Why ? I disagree, the flag should be honored when set in any
> > architecture. PowerPC is no different than any other platform in that
> > regard.
> 
> Perhaps I should have said instead "someone more familiar with PPC
> than I am should figure out what PPC should do".  For the non-PPC
> case, there is only one instance that I know of in which ignoring the
> IOMMU is beneficial, and that case is the experimental Q35 thing.

"ppc" is many fairly different platforms, some with iommu, some without,
some benefiting from bypass, some less etc... I think ARM will soon be
in a similar basket.

> If new kernels ignore the IOMMU for devices that don't set the flag
> and there are physical devices that already exist and don't set the
> flag, then those devices won't work reliably on most modern
> non-virtual platforms, PPC included.

Are there many virtio physical devices out there ? We are talking about
a virtio flag right ? Or have you been considering something else ?

> >>   New kernels
> >> optionally refuse to talk to devices that don't have that feature flag
> >> if the device appears to be behind an IOMMU.  (This presumably
> >> includes any device whatsoever on an x86 platform with an IOMMU,
> >> including Xen's fake IOMMU.)
> >>
> >> New QEMU always respects the IOMMU, if any, except on PPC.
> >
> > This is just a matter of what is the default of the flag, ie we
> > should have a machine flag that indicates what the default is for
> > new virtio devices, otherwise, it should be specified per device
> > as an attribute of the device instance.
> 
> On x86, I think that even super-peformance-critical virtio devices
> should always honor the iommu, but that the iommu in question should
> be a 1:1 iommu.  I *think* that x86 supports that.  IOW x86 would
> always set the feature flag.

Ok.

> > I would argue that we should default to "bypass IOMMU" on *all*
> > architecture due to the performance impact, and to essentially
> > default to the same behaviour as today. With things like DDW even
> > powerpc might be able to mostly alleviate the performance impact
> > so we might to change in the long term, but I tend to prefer
> > more incremental approaches.
> 
> As above, there's a difference between "bypass IOMMU" and "there is no
> IOMMU".  x86 and, I think, most other platforms are capable of the
> latter.  I'm not sure PPC is.

Depends on the platform. "pseries" isn't since it's already a
paravirtualized plaform, but there are other ppc platforms out there
which behave differently. That's why I think:

 - The kernel should just honor what qemu says, ie, whether the qemu
device honors or bypasses the iommu.

 - Qemu default behaviour should be set via a machine attribute which
can be overriden both globally (the machine one) or per-device. 

> I think that, in an ideal world, there would be no feature flag and
> all virtio devices would always respect the IOMMU.  Unfortunately we
> have existing practice in the form of PPC and Q35 iommu=on that
> conflict with that.

And possibly more as in this is how the qemu virtio devices are written
today, they do not use the proper DMA accessors, they always bypass,
whatever the platform is (so sparc would be in the same boat for
example).

> >>   New QEMU
> >> always advertises this feature flag.  If iommu=on, QEMU's virtio
> >> devices refuse to work unless the driver acknowledges the flag.
> >
> > This should be configurable.
> 
> Would any non-PPC user ever configure it differently?  I suppose if
> you want to support old kernels on new QEMU, you'd flip the switch.

Possibly, have we looked at what ia64, sparc, arm, ... do ? At least
sparc has iommus as well.

Let's try to not make it an architecture issue. As I said above, we have
a kernel that just reacts appropriately based on what qemu says it's
doing, and what qemu does is a per-machine flag to set the default.

> >> On PPC, new QEMU will not respect the IOMMU and will not set the flag.
> >> New kernels will not talk to devices that set the flag.  If someone
> >> wants to fix that, then they get to figure out how.
> >
> > I disagree with the kernel bit and I disagree with special casing PPC in
> > any shape or form in the code. The only difference should be a default
> > value for the iommu mode of virtio in qemu set per machine.
> >
> > You can then feel free to change that default (in a separate patch for
> > bisectability) on x86 for the sake of Xen.
> 
> I think we should flip the default everywhere to "respects IOMMU".

On new machine types, we shouldn't change the behaviour of an existing
machine type, and we should keep the default to 0 on ppc/pseries because
of backward compatibility issue. But that should be the only place that
is "ppc specific", ie, a default value in a machine def structure.

> That's the setting that will work in all cases on new guest + new
> host, and it's the setting that's safest.  vfio will probably always
> malfunction if given a device that looks like it's behind an IOMMU but
> doesn't respect it.  For people who need the last bit of performance,
> they should use bus-level controls where available (they should be
> available everywhere except PPC and maybe arm64) and, ideally, someone
> would teach PPC how to exclude devices from the IOMMU cleanly if
> possible.  If that can't be done, then there can be an option to
> bypass the IOMMU the way it's currently done and no one except PPC
> would do it.
> 
> PPC really is different from everything except x86 Q35 iommu=on, and
> the latter is experimental.  AFAIK in all other cases, the IOMMU is
> respected by virtio, but there is no non-1:1 IOMMU.

What about sparc ? I though it was pretty similar to PPC in that
regard...

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-28 23:21                               ` Benjamin Herrenschmidt
  2015-07-28 23:33                                 ` Andy Lutomirski
@ 2015-07-28 23:33                                 ` Andy Lutomirski
  2015-07-29  0:36                                   ` Benjamin Herrenschmidt
  2015-07-29  0:36                                   ` Benjamin Herrenschmidt
  2015-07-29  8:07                                 ` Jan Kiszka
  2015-07-29  8:07                                 ` Jan Kiszka
  3 siblings, 2 replies; 108+ messages in thread
From: Andy Lutomirski @ 2015-07-28 23:33 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linux-s390, Konrad Rzeszutek Wilk, Michael S. Tsirkin,
	Jan Kiszka, xen-devel, Christian Borntraeger, Paolo Bonzini,
	linux390, Linux Virtualization

On Tue, Jul 28, 2015 at 4:21 PM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Tue, 2015-07-28 at 15:43 -0700, Andy Lutomirski wrote:
>> Let me try to summarize a proposal:
>>
>> Add a feature flag that indicates IOMMU support.
>>
>> New kernels acknowledge that flag on any device that advertises it.
>>
>> New kernels always respect the IOMMU (except on PowerPC).
>
> Why ? I disagree, the flag should be honored when set in any
> architecture. PowerPC is no different than any other platform in that
> regard.

Perhaps I should have said instead "someone more familiar with PPC
than I am should figure out what PPC should do".  For the non-PPC
case, there is only one instance that I know of in which ignoring the
IOMMU is beneficial, and that case is the experimental Q35 thing.

If new kernels ignore the IOMMU for devices that don't set the flag
and there are physical devices that already exist and don't set the
flag, then those devices won't work reliably on most modern
non-virtual platforms, PPC included.

>
>>   New kernels
>> optionally refuse to talk to devices that don't have that feature flag
>> if the device appears to be behind an IOMMU.  (This presumably
>> includes any device whatsoever on an x86 platform with an IOMMU,
>> including Xen's fake IOMMU.)
>>
>> New QEMU always respects the IOMMU, if any, except on PPC.
>
> This is just a matter of what is the default of the flag, ie we
> should have a machine flag that indicates what the default is for
> new virtio devices, otherwise, it should be specified per device
> as an attribute of the device instance.

On x86, I think that even super-peformance-critical virtio devices
should always honor the iommu, but that the iommu in question should
be a 1:1 iommu.  I *think* that x86 supports that.  IOW x86 would
always set the feature flag.

>
> I would argue that we should default to "bypass IOMMU" on *all*
> architecture due to the performance impact, and to essentially
> default to the same behaviour as today. With things like DDW even
> powerpc might be able to mostly alleviate the performance impact
> so we might to change in the long term, but I tend to prefer
> more incremental approaches.

As above, there's a difference between "bypass IOMMU" and "there is no
IOMMU".  x86 and, I think, most other platforms are capable of the
latter.  I'm not sure PPC is.

I think that, in an ideal world, there would be no feature flag and
all virtio devices would always respect the IOMMU.  Unfortunately we
have existing practice in the form of PPC and Q35 iommu=on that
conflict with that.

>
>>   New QEMU
>> always advertises this feature flag.  If iommu=on, QEMU's virtio
>> devices refuse to work unless the driver acknowledges the flag.
>
> This should be configurable.

Would any non-PPC user ever configure it differently?  I suppose if
you want to support old kernels on new QEMU, you'd flip the switch.

>
>> On PPC, new QEMU will not respect the IOMMU and will not set the flag.
>> New kernels will not talk to devices that set the flag.  If someone
>> wants to fix that, then they get to figure out how.
>
> I disagree with the kernel bit and I disagree with special casing PPC in
> any shape or form in the code. The only difference should be a default
> value for the iommu mode of virtio in qemu set per machine.
>
> You can then feel free to change that default (in a separate patch for
> bisectability) on x86 for the sake of Xen.

I think we should flip the default everywhere to "respects IOMMU".
That's the setting that will work in all cases on new guest + new
host, and it's the setting that's safest.  vfio will probably always
malfunction if given a device that looks like it's behind an IOMMU but
doesn't respect it.  For people who need the last bit of performance,
they should use bus-level controls where available (they should be
available everywhere except PPC and maybe arm64) and, ideally, someone
would teach PPC how to exclude devices from the IOMMU cleanly if
possible.  If that can't be done, then there can be an option to
bypass the IOMMU the way it's currently done and no one except PPC
would do it.

PPC really is different from everything except x86 Q35 iommu=on, and
the latter is experimental.  AFAIK in all other cases, the IOMMU is
respected by virtio, but there is no non-1:1 IOMMU.

--Andy

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-28 23:21                               ` Benjamin Herrenschmidt
@ 2015-07-28 23:33                                 ` Andy Lutomirski
  2015-07-28 23:33                                 ` Andy Lutomirski
                                                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 108+ messages in thread
From: Andy Lutomirski @ 2015-07-28 23:33 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linux-s390, Michael S. Tsirkin, Stefan Hajnoczi, Jan Kiszka,
	Rusty Russell, xen-devel, Christian Borntraeger, Paolo Bonzini,
	Cornelia Huck, linux390, Linux Virtualization

On Tue, Jul 28, 2015 at 4:21 PM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Tue, 2015-07-28 at 15:43 -0700, Andy Lutomirski wrote:
>> Let me try to summarize a proposal:
>>
>> Add a feature flag that indicates IOMMU support.
>>
>> New kernels acknowledge that flag on any device that advertises it.
>>
>> New kernels always respect the IOMMU (except on PowerPC).
>
> Why ? I disagree, the flag should be honored when set in any
> architecture. PowerPC is no different than any other platform in that
> regard.

Perhaps I should have said instead "someone more familiar with PPC
than I am should figure out what PPC should do".  For the non-PPC
case, there is only one instance that I know of in which ignoring the
IOMMU is beneficial, and that case is the experimental Q35 thing.

If new kernels ignore the IOMMU for devices that don't set the flag
and there are physical devices that already exist and don't set the
flag, then those devices won't work reliably on most modern
non-virtual platforms, PPC included.

>
>>   New kernels
>> optionally refuse to talk to devices that don't have that feature flag
>> if the device appears to be behind an IOMMU.  (This presumably
>> includes any device whatsoever on an x86 platform with an IOMMU,
>> including Xen's fake IOMMU.)
>>
>> New QEMU always respects the IOMMU, if any, except on PPC.
>
> This is just a matter of what is the default of the flag, ie we
> should have a machine flag that indicates what the default is for
> new virtio devices, otherwise, it should be specified per device
> as an attribute of the device instance.

On x86, I think that even super-peformance-critical virtio devices
should always honor the iommu, but that the iommu in question should
be a 1:1 iommu.  I *think* that x86 supports that.  IOW x86 would
always set the feature flag.

>
> I would argue that we should default to "bypass IOMMU" on *all*
> architecture due to the performance impact, and to essentially
> default to the same behaviour as today. With things like DDW even
> powerpc might be able to mostly alleviate the performance impact
> so we might to change in the long term, but I tend to prefer
> more incremental approaches.

As above, there's a difference between "bypass IOMMU" and "there is no
IOMMU".  x86 and, I think, most other platforms are capable of the
latter.  I'm not sure PPC is.

I think that, in an ideal world, there would be no feature flag and
all virtio devices would always respect the IOMMU.  Unfortunately we
have existing practice in the form of PPC and Q35 iommu=on that
conflict with that.

>
>>   New QEMU
>> always advertises this feature flag.  If iommu=on, QEMU's virtio
>> devices refuse to work unless the driver acknowledges the flag.
>
> This should be configurable.

Would any non-PPC user ever configure it differently?  I suppose if
you want to support old kernels on new QEMU, you'd flip the switch.

>
>> On PPC, new QEMU will not respect the IOMMU and will not set the flag.
>> New kernels will not talk to devices that set the flag.  If someone
>> wants to fix that, then they get to figure out how.
>
> I disagree with the kernel bit and I disagree with special casing PPC in
> any shape or form in the code. The only difference should be a default
> value for the iommu mode of virtio in qemu set per machine.
>
> You can then feel free to change that default (in a separate patch for
> bisectability) on x86 for the sake of Xen.

I think we should flip the default everywhere to "respects IOMMU".
That's the setting that will work in all cases on new guest + new
host, and it's the setting that's safest.  vfio will probably always
malfunction if given a device that looks like it's behind an IOMMU but
doesn't respect it.  For people who need the last bit of performance,
they should use bus-level controls where available (they should be
available everywhere except PPC and maybe arm64) and, ideally, someone
would teach PPC how to exclude devices from the IOMMU cleanly if
possible.  If that can't be done, then there can be an option to
bypass the IOMMU the way it's currently done and no one except PPC
would do it.

PPC really is different from everything except x86 Q35 iommu=on, and
the latter is experimental.  AFAIK in all other cases, the IOMMU is
respected by virtio, but there is no non-1:1 IOMMU.

--Andy

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-28 22:43                             ` Andy Lutomirski
@ 2015-07-28 23:21                               ` Benjamin Herrenschmidt
  2015-07-28 23:33                                 ` Andy Lutomirski
                                                   ` (3 more replies)
  2015-07-28 23:21                               ` Benjamin Herrenschmidt
  1 sibling, 4 replies; 108+ messages in thread
From: Benjamin Herrenschmidt @ 2015-07-28 23:21 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-s390, Konrad Rzeszutek Wilk, Michael S. Tsirkin,
	Jan Kiszka, xen-devel, Christian Borntraeger, Paolo Bonzini,
	linux390, Linux Virtualization

On Tue, 2015-07-28 at 15:43 -0700, Andy Lutomirski wrote:
> Let me try to summarize a proposal:
> 
> Add a feature flag that indicates IOMMU support.
> 
> New kernels acknowledge that flag on any device that advertises it.
> 
> New kernels always respect the IOMMU (except on PowerPC).

Why ? I disagree, the flag should be honored when set in any
architecture. PowerPC is no different than any other platform in that
regard.

>   New kernels
> optionally refuse to talk to devices that don't have that feature flag
> if the device appears to be behind an IOMMU.  (This presumably
> includes any device whatsoever on an x86 platform with an IOMMU,
> including Xen's fake IOMMU.)
> 
> New QEMU always respects the IOMMU, if any, except on PPC.

This is just a matter of what is the default of the flag, ie we
should have a machine flag that indicates what the default is for
new virtio devices, otherwise, it should be specified per device
as an attribute of the device instance.

I would argue that we should default to "bypass IOMMU" on *all*
architecture due to the performance impact, and to essentially
default to the same behaviour as today. With things like DDW even
powerpc might be able to mostly alleviate the performance impact
so we might to change in the long term, but I tend to prefer
more incremental approaches.

>   New QEMU
> always advertises this feature flag.  If iommu=on, QEMU's virtio
> devices refuse to work unless the driver acknowledges the flag.

This should be configurable.

> On PPC, new QEMU will not respect the IOMMU and will not set the flag.
> New kernels will not talk to devices that set the flag.  If someone
> wants to fix that, then they get to figure out how.

I disagree with the kernel bit and I disagree with special casing PPC in
any shape or form in the code. The only difference should be a default
value for the iommu mode of virtio in qemu set per machine.

You can then feel free to change that default (in a separate patch for
bisectability) on x86 for the sake of Xen.

Ben.

> This results in:
> 
> New kernels work fine with old QEMU unless iommu=on.
> 
> New kernels work with new devices (QEMU and physical devices that set
> the flag) under all circumstances, except on PPC where physical
> devices are and remain broken.
> 
> Xen works work new QEMU and cleanly refuses to interoperate with old
> QEMU.  (This is worse than with just my patches, but it's better than
> the status quo in which the Xen guest corrupts itself and possibly
> corrupts the Xen hypervisor.)
> 
> New kernels with old QEMU with iommu=on optionally refuses to interoperate.
> 
> Old kernels are oblivious.  They work exactly the same as they do
> today except that they fail cleanly with new QEMU with iommu=on.  Old
> kernels continue to fail with physical virtio devices if they're
> behind an iommu.
> 
> Old physical virtio devices that don't advertise the flag fail cleanly
> if the host uses an iommu.  The driver could optionally whitelist such
> devices.
> 
> PPC works as well as it currently does.
> 
> I'm unsure about the arm64 situation.
> 
> 
> Did I get this right?
> 
> --Andy

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-28 22:43                             ` Andy Lutomirski
  2015-07-28 23:21                               ` Benjamin Herrenschmidt
@ 2015-07-28 23:21                               ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 108+ messages in thread
From: Benjamin Herrenschmidt @ 2015-07-28 23:21 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-s390, Michael S. Tsirkin, Stefan Hajnoczi, Jan Kiszka,
	Rusty Russell, xen-devel, Christian Borntraeger, Paolo Bonzini,
	Cornelia Huck, linux390, Linux Virtualization

On Tue, 2015-07-28 at 15:43 -0700, Andy Lutomirski wrote:
> Let me try to summarize a proposal:
> 
> Add a feature flag that indicates IOMMU support.
> 
> New kernels acknowledge that flag on any device that advertises it.
> 
> New kernels always respect the IOMMU (except on PowerPC).

Why ? I disagree, the flag should be honored when set in any
architecture. PowerPC is no different than any other platform in that
regard.

>   New kernels
> optionally refuse to talk to devices that don't have that feature flag
> if the device appears to be behind an IOMMU.  (This presumably
> includes any device whatsoever on an x86 platform with an IOMMU,
> including Xen's fake IOMMU.)
> 
> New QEMU always respects the IOMMU, if any, except on PPC.

This is just a matter of what is the default of the flag, ie we
should have a machine flag that indicates what the default is for
new virtio devices, otherwise, it should be specified per device
as an attribute of the device instance.

I would argue that we should default to "bypass IOMMU" on *all*
architecture due to the performance impact, and to essentially
default to the same behaviour as today. With things like DDW even
powerpc might be able to mostly alleviate the performance impact
so we might to change in the long term, but I tend to prefer
more incremental approaches.

>   New QEMU
> always advertises this feature flag.  If iommu=on, QEMU's virtio
> devices refuse to work unless the driver acknowledges the flag.

This should be configurable.

> On PPC, new QEMU will not respect the IOMMU and will not set the flag.
> New kernels will not talk to devices that set the flag.  If someone
> wants to fix that, then they get to figure out how.

I disagree with the kernel bit and I disagree with special casing PPC in
any shape or form in the code. The only difference should be a default
value for the iommu mode of virtio in qemu set per machine.

You can then feel free to change that default (in a separate patch for
bisectability) on x86 for the sake of Xen.

Ben.

> This results in:
> 
> New kernels work fine with old QEMU unless iommu=on.
> 
> New kernels work with new devices (QEMU and physical devices that set
> the flag) under all circumstances, except on PPC where physical
> devices are and remain broken.
> 
> Xen works work new QEMU and cleanly refuses to interoperate with old
> QEMU.  (This is worse than with just my patches, but it's better than
> the status quo in which the Xen guest corrupts itself and possibly
> corrupts the Xen hypervisor.)
> 
> New kernels with old QEMU with iommu=on optionally refuses to interoperate.
> 
> Old kernels are oblivious.  They work exactly the same as they do
> today except that they fail cleanly with new QEMU with iommu=on.  Old
> kernels continue to fail with physical virtio devices if they're
> behind an iommu.
> 
> Old physical virtio devices that don't advertise the flag fail cleanly
> if the host uses an iommu.  The driver could optionally whitelist such
> devices.
> 
> PPC works as well as it currently does.
> 
> I'm unsure about the arm64 situation.
> 
> 
> Did I get this right?
> 
> --Andy

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-28 21:16                           ` Andy Lutomirski
  2015-07-28 22:43                             ` Andy Lutomirski
@ 2015-07-28 22:43                             ` Andy Lutomirski
  2015-07-28 23:21                               ` Benjamin Herrenschmidt
  2015-07-28 23:21                               ` Benjamin Herrenschmidt
  1 sibling, 2 replies; 108+ messages in thread
From: Andy Lutomirski @ 2015-07-28 22:43 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: linux-s390, Benjamin Herrenschmidt, Konrad Rzeszutek Wilk,
	Michael S. Tsirkin, xen-devel, Christian Borntraeger,
	Paolo Bonzini, linux390, Linux Virtualization

Let me try to summarize a proposal:

Add a feature flag that indicates IOMMU support.

New kernels acknowledge that flag on any device that advertises it.

New kernels always respect the IOMMU (except on PowerPC).  New kernels
optionally refuse to talk to devices that don't have that feature flag
if the device appears to be behind an IOMMU.  (This presumably
includes any device whatsoever on an x86 platform with an IOMMU,
including Xen's fake IOMMU.)

New QEMU always respects the IOMMU, if any, except on PPC.  New QEMU
always advertises this feature flag.  If iommu=on, QEMU's virtio
devices refuse to work unless the driver acknowledges the flag.

On PPC, new QEMU will not respect the IOMMU and will not set the flag.
New kernels will not talk to devices that set the flag.  If someone
wants to fix that, then they get to figure out how.

This results in:

New kernels work fine with old QEMU unless iommu=on.

New kernels work with new devices (QEMU and physical devices that set
the flag) under all circumstances, except on PPC where physical
devices are and remain broken.

Xen works work new QEMU and cleanly refuses to interoperate with old
QEMU.  (This is worse than with just my patches, but it's better than
the status quo in which the Xen guest corrupts itself and possibly
corrupts the Xen hypervisor.)

New kernels with old QEMU with iommu=on optionally refuses to interoperate.

Old kernels are oblivious.  They work exactly the same as they do
today except that they fail cleanly with new QEMU with iommu=on.  Old
kernels continue to fail with physical virtio devices if they're
behind an iommu.

Old physical virtio devices that don't advertise the flag fail cleanly
if the host uses an iommu.  The driver could optionally whitelist such
devices.

PPC works as well as it currently does.

I'm unsure about the arm64 situation.

Did I get this right?

--Andy

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-28 21:16                           ` Andy Lutomirski
@ 2015-07-28 22:43                             ` Andy Lutomirski
  2015-07-28 22:43                             ` Andy Lutomirski
  1 sibling, 0 replies; 108+ messages in thread
From: Andy Lutomirski @ 2015-07-28 22:43 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: linux-s390, Benjamin Herrenschmidt, Michael S. Tsirkin,
	Stefan Hajnoczi, Rusty Russell, xen-devel, Christian Borntraeger,
	Paolo Bonzini, Cornelia Huck, linux390, Linux Virtualization

Let me try to summarize a proposal:

Add a feature flag that indicates IOMMU support.

New kernels acknowledge that flag on any device that advertises it.

New kernels always respect the IOMMU (except on PowerPC).  New kernels
optionally refuse to talk to devices that don't have that feature flag
if the device appears to be behind an IOMMU.  (This presumably
includes any device whatsoever on an x86 platform with an IOMMU,
including Xen's fake IOMMU.)

New QEMU always respects the IOMMU, if any, except on PPC.  New QEMU
always advertises this feature flag.  If iommu=on, QEMU's virtio
devices refuse to work unless the driver acknowledges the flag.

On PPC, new QEMU will not respect the IOMMU and will not set the flag.
New kernels will not talk to devices that set the flag.  If someone
wants to fix that, then they get to figure out how.

This results in:

New kernels work fine with old QEMU unless iommu=on.

New kernels work with new devices (QEMU and physical devices that set
the flag) under all circumstances, except on PPC where physical
devices are and remain broken.

Xen works work new QEMU and cleanly refuses to interoperate with old
QEMU.  (This is worse than with just my patches, but it's better than
the status quo in which the Xen guest corrupts itself and possibly
corrupts the Xen hypervisor.)

New kernels with old QEMU with iommu=on optionally refuses to interoperate.

Old kernels are oblivious.  They work exactly the same as they do
today except that they fail cleanly with new QEMU with iommu=on.  Old
kernels continue to fail with physical virtio devices if they're
behind an iommu.

Old physical virtio devices that don't advertise the flag fail cleanly
if the host uses an iommu.  The driver could optionally whitelist such
devices.

PPC works as well as it currently does.

I'm unsure about the arm64 situation.

Did I get this right?

--Andy

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-28 19:33                         ` Jan Kiszka
  2015-07-28 21:16                           ` Andy Lutomirski
@ 2015-07-28 21:16                           ` Andy Lutomirski
  2015-07-28 22:43                             ` Andy Lutomirski
  2015-07-28 22:43                             ` Andy Lutomirski
  1 sibling, 2 replies; 108+ messages in thread
From: Andy Lutomirski @ 2015-07-28 21:16 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: linux-s390, Benjamin Herrenschmidt, Konrad Rzeszutek Wilk,
	Michael S. Tsirkin, xen-devel, Christian Borntraeger,
	Paolo Bonzini, linux390, Linux Virtualization

On Tue, Jul 28, 2015 at 12:33 PM, Jan Kiszka <jan.kiszka@siemens.com> wrote:
> On 2015-07-28 21:24, Andy Lutomirski wrote:
>> On Tue, Jul 28, 2015 at 12:06 PM, Jan Kiszka <jan.kiszka@siemens.com> wrote:
>>> On 2015-07-28 20:22, Andy Lutomirski wrote:
>>>> On Tue, Jul 28, 2015 at 10:17 AM, Jan Kiszka <jan.kiszka@siemens.com> wrote:
>>>>> On 2015-07-28 19:10, Andy Lutomirski wrote:
>>>>>> The trouble is that this is really a property of the bus and not of
>>>>>> the device.  If you build a virtio device that physically plugs into a
>>>>>> PCIe slot, the device has no concept of an IOMMU in the first place.
>>>>>
>>>>> If one would build a real virtio device today, it would be broken
>>>>> because every IOMMU would start to translate its requests. Already from
>>>>> that POV, we really need to introduce a feature flag "I will be
>>>>> IOMMU-translated" so that a potential physical implementation can carry
>>>>> it unconditionally.
>>>>>
>>>>
>>>> Except that, with my patches, it would work correctly.  ISTM the thing
>>>
>>> I haven't looked at your patches yet - they make the virtio PCI driver
>>> in Linux IOMMU-compatible? Perfect - except for a compatibility check,
>>> right?
>>
>> Yes.  (virtio_pci_legacy, anyway.  Presumably virtio_pci_modern is
>> easy to adapt, too.)
>>
>>>
>>>> that's broken right now is QEMU and the virtio_pci driver.  My patches
>>>> fix the driver.  Last year that would have been the end of the story
>>>> except for PPC.  Now we have to deal with QEMU.
>>>>
>>>>>> Similarly, if you take an L0-provided IOMMU-supporting device and pass
>>>>>> it through to L2 using current QEMU on L1 (with Q35 emulation and
>>>>>> iommu enabled), then, from L2's perspective, the device is 1:1 no
>>>>>> matter what the device thinks.
>>>>>>
>>>>>> IOW, I think the original design was wrong and now we have to deal
>>>>>> with it.  I think the best solution would be to teach QEMU to fix its
>>>>>> ACPI tables so that 1:1 virtio devices are actually exposed as 1:1.
>>>>>
>>>>> Only the current drivers are broken. And we can easily tell them apart
>>>>> from newer ones via feature flags. Sorry, don't get the problem.
>>>>
>>>> I still don't see how feature flags solve the problem.  Suppose we
>>>> added a feature flag meaning "respects IOMMU".
>>>>
>>>> Bad case 1:  Build a malicious device that advertises
>>>> non-IOMMU-respecting virtio.  Plug it in behind an IOMMU.  Host starts
>>>> leaking physical addresses to the device (and the device doesn't work,
>>>> of course).  Maybe that's only barely a security problem, but still...
>>>
>>> I don't see right now how critical such a hypothetical case could be.
>>> But the OS / its drivers could still decide to refuse talking to such a
>>> device.
>>>
>>
>> How does OS know it's such a device as opposed to a QEMU-supplied thing?
>
> It can restrict itself to virtio devices exposing the feature if it
> feels uncomfortable that it might be talking to some evil piece of
> silicon (instead of the hypervisor, which has to be trusted anyway).
>
>>
>>>>
>>>> Bad case 2: Some hypothetical well-behaved new QEMU provides a virtio
>>>> device that *does* respect the IOMMU and sets the feature flag.  They
>>>> emulate Q35 with an IOMMU.  They boot Linux 4.1.  Data corruption in
>>>> the guest.
>>>
>>> No. In that case, the feature negotiation of "virtio-with-iommu-support"
>>> would have failed for older drivers, and the device would have never
>>> been used by the guest.
>>
>> So are you suggesting that newer virtio devices always provide this
>> feature flag and, if supplied by QEMU with iommu=on, simply refuse to
>> operate of the driver doesn't support that flag?
>
> Exactly.
>
>>
>> That could work as long as QEMU with the current (broken?) iommu=on
>> never exposes such a device.
>
> QEMU would have to be adjusted first so that all its virtio-pci device
> models take IOMMUs into account - if they exist or not. Only then it
> could expose the feature and expect the guest to acknowledge it.
>
> For compat reasons, QEMU should still be able to expose virtio devices
> without the flag set - but then without any IOMMU emulation enabled as
> well. That would prevent the current setup we are using today, but it's
> trivial to update the guest kernel to a newer virtio driver which would
> restore our scenario again.

Seems reasonable.

>>
>> If we apply something similar enough to my patches, then even old
>> hypervisors (e.g. Amazon's hardware virt systems) will support Xen
>> with virtio devices passed in just fine.
>
> Then it seems we can make everyone happy - perfect. :)

Yay.

FWIW, I have no intention to touch the QEMU code for this.  I'm
willing to do the vring bit and the virtio-pci bit as long as it's
well specified.

--Andy

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-28 19:33                         ` Jan Kiszka
@ 2015-07-28 21:16                           ` Andy Lutomirski
  2015-07-28 21:16                           ` Andy Lutomirski
  1 sibling, 0 replies; 108+ messages in thread
From: Andy Lutomirski @ 2015-07-28 21:16 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: linux-s390, Benjamin Herrenschmidt, Michael S. Tsirkin,
	Stefan Hajnoczi, Rusty Russell, xen-devel, Christian Borntraeger,
	Paolo Bonzini, Cornelia Huck, linux390, Linux Virtualization

On Tue, Jul 28, 2015 at 12:33 PM, Jan Kiszka <jan.kiszka@siemens.com> wrote:
> On 2015-07-28 21:24, Andy Lutomirski wrote:
>> On Tue, Jul 28, 2015 at 12:06 PM, Jan Kiszka <jan.kiszka@siemens.com> wrote:
>>> On 2015-07-28 20:22, Andy Lutomirski wrote:
>>>> On Tue, Jul 28, 2015 at 10:17 AM, Jan Kiszka <jan.kiszka@siemens.com> wrote:
>>>>> On 2015-07-28 19:10, Andy Lutomirski wrote:
>>>>>> The trouble is that this is really a property of the bus and not of
>>>>>> the device.  If you build a virtio device that physically plugs into a
>>>>>> PCIe slot, the device has no concept of an IOMMU in the first place.
>>>>>
>>>>> If one would build a real virtio device today, it would be broken
>>>>> because every IOMMU would start to translate its requests. Already from
>>>>> that POV, we really need to introduce a feature flag "I will be
>>>>> IOMMU-translated" so that a potential physical implementation can carry
>>>>> it unconditionally.
>>>>>
>>>>
>>>> Except that, with my patches, it would work correctly.  ISTM the thing
>>>
>>> I haven't looked at your patches yet - they make the virtio PCI driver
>>> in Linux IOMMU-compatible? Perfect - except for a compatibility check,
>>> right?
>>
>> Yes.  (virtio_pci_legacy, anyway.  Presumably virtio_pci_modern is
>> easy to adapt, too.)
>>
>>>
>>>> that's broken right now is QEMU and the virtio_pci driver.  My patches
>>>> fix the driver.  Last year that would have been the end of the story
>>>> except for PPC.  Now we have to deal with QEMU.
>>>>
>>>>>> Similarly, if you take an L0-provided IOMMU-supporting device and pass
>>>>>> it through to L2 using current QEMU on L1 (with Q35 emulation and
>>>>>> iommu enabled), then, from L2's perspective, the device is 1:1 no
>>>>>> matter what the device thinks.
>>>>>>
>>>>>> IOW, I think the original design was wrong and now we have to deal
>>>>>> with it.  I think the best solution would be to teach QEMU to fix its
>>>>>> ACPI tables so that 1:1 virtio devices are actually exposed as 1:1.
>>>>>
>>>>> Only the current drivers are broken. And we can easily tell them apart
>>>>> from newer ones via feature flags. Sorry, don't get the problem.
>>>>
>>>> I still don't see how feature flags solve the problem.  Suppose we
>>>> added a feature flag meaning "respects IOMMU".
>>>>
>>>> Bad case 1:  Build a malicious device that advertises
>>>> non-IOMMU-respecting virtio.  Plug it in behind an IOMMU.  Host starts
>>>> leaking physical addresses to the device (and the device doesn't work,
>>>> of course).  Maybe that's only barely a security problem, but still...
>>>
>>> I don't see right now how critical such a hypothetical case could be.
>>> But the OS / its drivers could still decide to refuse talking to such a
>>> device.
>>>
>>
>> How does OS know it's such a device as opposed to a QEMU-supplied thing?
>
> It can restrict itself to virtio devices exposing the feature if it
> feels uncomfortable that it might be talking to some evil piece of
> silicon (instead of the hypervisor, which has to be trusted anyway).
>
>>
>>>>
>>>> Bad case 2: Some hypothetical well-behaved new QEMU provides a virtio
>>>> device that *does* respect the IOMMU and sets the feature flag.  They
>>>> emulate Q35 with an IOMMU.  They boot Linux 4.1.  Data corruption in
>>>> the guest.
>>>
>>> No. In that case, the feature negotiation of "virtio-with-iommu-support"
>>> would have failed for older drivers, and the device would have never
>>> been used by the guest.
>>
>> So are you suggesting that newer virtio devices always provide this
>> feature flag and, if supplied by QEMU with iommu=on, simply refuse to
>> operate of the driver doesn't support that flag?
>
> Exactly.
>
>>
>> That could work as long as QEMU with the current (broken?) iommu=on
>> never exposes such a device.
>
> QEMU would have to be adjusted first so that all its virtio-pci device
> models take IOMMUs into account - if they exist or not. Only then it
> could expose the feature and expect the guest to acknowledge it.
>
> For compat reasons, QEMU should still be able to expose virtio devices
> without the flag set - but then without any IOMMU emulation enabled as
> well. That would prevent the current setup we are using today, but it's
> trivial to update the guest kernel to a newer virtio driver which would
> restore our scenario again.

Seems reasonable.

>>
>> If we apply something similar enough to my patches, then even old
>> hypervisors (e.g. Amazon's hardware virt systems) will support Xen
>> with virtio devices passed in just fine.
>
> Then it seems we can make everyone happy - perfect. :)

Yay.

FWIW, I have no intention to touch the QEMU code for this.  I'm
willing to do the vring bit and the virtio-pci bit as long as it's
well specified.

--Andy

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-28 19:24                       ` Andy Lutomirski
@ 2015-07-28 19:33                         ` Jan Kiszka
  2015-07-28 21:16                           ` Andy Lutomirski
  2015-07-28 21:16                           ` Andy Lutomirski
  2015-07-28 19:33                         ` Jan Kiszka
  1 sibling, 2 replies; 108+ messages in thread
From: Jan Kiszka @ 2015-07-28 19:33 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-s390, Benjamin Herrenschmidt, Konrad Rzeszutek Wilk,
	Michael S. Tsirkin, xen-devel, Christian Borntraeger,
	Paolo Bonzini, linux390, Linux Virtualization

On 2015-07-28 21:24, Andy Lutomirski wrote:
> On Tue, Jul 28, 2015 at 12:06 PM, Jan Kiszka <jan.kiszka@siemens.com> wrote:
>> On 2015-07-28 20:22, Andy Lutomirski wrote:
>>> On Tue, Jul 28, 2015 at 10:17 AM, Jan Kiszka <jan.kiszka@siemens.com> wrote:
>>>> On 2015-07-28 19:10, Andy Lutomirski wrote:
>>>>> The trouble is that this is really a property of the bus and not of
>>>>> the device.  If you build a virtio device that physically plugs into a
>>>>> PCIe slot, the device has no concept of an IOMMU in the first place.
>>>>
>>>> If one would build a real virtio device today, it would be broken
>>>> because every IOMMU would start to translate its requests. Already from
>>>> that POV, we really need to introduce a feature flag "I will be
>>>> IOMMU-translated" so that a potential physical implementation can carry
>>>> it unconditionally.
>>>>
>>>
>>> Except that, with my patches, it would work correctly.  ISTM the thing
>>
>> I haven't looked at your patches yet - they make the virtio PCI driver
>> in Linux IOMMU-compatible? Perfect - except for a compatibility check,
>> right?
> 
> Yes.  (virtio_pci_legacy, anyway.  Presumably virtio_pci_modern is
> easy to adapt, too.)
> 
>>
>>> that's broken right now is QEMU and the virtio_pci driver.  My patches
>>> fix the driver.  Last year that would have been the end of the story
>>> except for PPC.  Now we have to deal with QEMU.
>>>
>>>>> Similarly, if you take an L0-provided IOMMU-supporting device and pass
>>>>> it through to L2 using current QEMU on L1 (with Q35 emulation and
>>>>> iommu enabled), then, from L2's perspective, the device is 1:1 no
>>>>> matter what the device thinks.
>>>>>
>>>>> IOW, I think the original design was wrong and now we have to deal
>>>>> with it.  I think the best solution would be to teach QEMU to fix its
>>>>> ACPI tables so that 1:1 virtio devices are actually exposed as 1:1.
>>>>
>>>> Only the current drivers are broken. And we can easily tell them apart
>>>> from newer ones via feature flags. Sorry, don't get the problem.
>>>
>>> I still don't see how feature flags solve the problem.  Suppose we
>>> added a feature flag meaning "respects IOMMU".
>>>
>>> Bad case 1:  Build a malicious device that advertises
>>> non-IOMMU-respecting virtio.  Plug it in behind an IOMMU.  Host starts
>>> leaking physical addresses to the device (and the device doesn't work,
>>> of course).  Maybe that's only barely a security problem, but still...
>>
>> I don't see right now how critical such a hypothetical case could be.
>> But the OS / its drivers could still decide to refuse talking to such a
>> device.
>>
> 
> How does OS know it's such a device as opposed to a QEMU-supplied thing?

It can restrict itself to virtio devices exposing the feature if it
feels uncomfortable that it might be talking to some evil piece of
silicon (instead of the hypervisor, which has to be trusted anyway).

> 
>>>
>>> Bad case 2: Some hypothetical well-behaved new QEMU provides a virtio
>>> device that *does* respect the IOMMU and sets the feature flag.  They
>>> emulate Q35 with an IOMMU.  They boot Linux 4.1.  Data corruption in
>>> the guest.
>>
>> No. In that case, the feature negotiation of "virtio-with-iommu-support"
>> would have failed for older drivers, and the device would have never
>> been used by the guest.
> 
> So are you suggesting that newer virtio devices always provide this
> feature flag and, if supplied by QEMU with iommu=on, simply refuse to
> operate of the driver doesn't support that flag?

Exactly.

> 
> That could work as long as QEMU with the current (broken?) iommu=on
> never exposes such a device.

QEMU would have to be adjusted first so that all its virtio-pci device
models take IOMMUs into account - if they exist or not. Only then it
could expose the feature and expect the guest to acknowledge it.

For compat reasons, QEMU should still be able to expose virtio devices
without the flag set - but then without any IOMMU emulation enabled as
well. That would prevent the current setup we are using today, but it's
trivial to update the guest kernel to a newer virtio driver which would
restore our scenario again.

> 
>>
>>>
>>> We could make the rule that *all* virtio-pci devices (except on PPC)
>>> respect the bus rules.  We'd have to fix QEMU so that virtio devices
>>> on Q35 iommu=on systems set up a PCI topology where the devices
>>> *aren't* behind the IOMMU or are protected by RMRRs or whatever.  Then
>>> old kernels would work correctly on new hosts, new kernels would work
>>> correctly except on old iommu-providing hosts, and Xen would work.
>>
>> I don't see a point in doing anything about old QEMU with IOMMU enabled
>> and virtio devices plugged except declaring such setups broken. No one
>> should have configured this for production purposes, only for test
>> setups (like we, with the knowledge about the limitations).
>>
> 
> I'm fine with that.  In fact, I proposed these patches before QEMU had
> this feature in the first place.
> 
>>>
>>> In fact, on Xen, it's impossible without colossal hacks to support
>>> non-IOMMU-respecting virtio devices because Xen acts as an
>>> intermediate IOMMU between the Linux dom0 guest and the actual host.
>>> The QEMU host doesn't even know that Xen is involved.  This is why Xen
>>> and virtio don't currently work together (without my patches): the
>>> device thinks it doesn't respect the IOMMU, the driver thinks the
>>> device doesn't respect the IOMMU, and they're both wrong.
>>>
>>> TL;DR: I think there are only two cases.  Either a device respects the
>>> IOMMU or a device doesn't know whether it respects the IOMMU.  The
>>> latter case is problematic.
>>
>> See above, the latter is only problematic on setups that actually use an
>> IOMMU. If that includes Xen, then no one should use it until virtio can
>> declare itself IOMMU compatible, and drivers exist that process this.
> 
> Xen works right now with my patches on standard QEMU (as long as
> iommu=off).  Certainly no one except me uses it now with virtio
> because it doesn't work with mainline kernels.
> 
> If we apply something similar enough to my patches, then even old
> hypervisors (e.g. Amazon's hardware virt systems) will support Xen
> with virtio devices passed in just fine.

Then it seems we can make everyone happy - perfect. :)

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-28 19:24                       ` Andy Lutomirski
  2015-07-28 19:33                         ` Jan Kiszka
@ 2015-07-28 19:33                         ` Jan Kiszka
  1 sibling, 0 replies; 108+ messages in thread
From: Jan Kiszka @ 2015-07-28 19:33 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-s390, Benjamin Herrenschmidt, Michael S. Tsirkin,
	Stefan Hajnoczi, Rusty Russell, xen-devel, Christian Borntraeger,
	Paolo Bonzini, Cornelia Huck, linux390, Linux Virtualization

On 2015-07-28 21:24, Andy Lutomirski wrote:
> On Tue, Jul 28, 2015 at 12:06 PM, Jan Kiszka <jan.kiszka@siemens.com> wrote:
>> On 2015-07-28 20:22, Andy Lutomirski wrote:
>>> On Tue, Jul 28, 2015 at 10:17 AM, Jan Kiszka <jan.kiszka@siemens.com> wrote:
>>>> On 2015-07-28 19:10, Andy Lutomirski wrote:
>>>>> The trouble is that this is really a property of the bus and not of
>>>>> the device.  If you build a virtio device that physically plugs into a
>>>>> PCIe slot, the device has no concept of an IOMMU in the first place.
>>>>
>>>> If one would build a real virtio device today, it would be broken
>>>> because every IOMMU would start to translate its requests. Already from
>>>> that POV, we really need to introduce a feature flag "I will be
>>>> IOMMU-translated" so that a potential physical implementation can carry
>>>> it unconditionally.
>>>>
>>>
>>> Except that, with my patches, it would work correctly.  ISTM the thing
>>
>> I haven't looked at your patches yet - they make the virtio PCI driver
>> in Linux IOMMU-compatible? Perfect - except for a compatibility check,
>> right?
> 
> Yes.  (virtio_pci_legacy, anyway.  Presumably virtio_pci_modern is
> easy to adapt, too.)
> 
>>
>>> that's broken right now is QEMU and the virtio_pci driver.  My patches
>>> fix the driver.  Last year that would have been the end of the story
>>> except for PPC.  Now we have to deal with QEMU.
>>>
>>>>> Similarly, if you take an L0-provided IOMMU-supporting device and pass
>>>>> it through to L2 using current QEMU on L1 (with Q35 emulation and
>>>>> iommu enabled), then, from L2's perspective, the device is 1:1 no
>>>>> matter what the device thinks.
>>>>>
>>>>> IOW, I think the original design was wrong and now we have to deal
>>>>> with it.  I think the best solution would be to teach QEMU to fix its
>>>>> ACPI tables so that 1:1 virtio devices are actually exposed as 1:1.
>>>>
>>>> Only the current drivers are broken. And we can easily tell them apart
>>>> from newer ones via feature flags. Sorry, don't get the problem.
>>>
>>> I still don't see how feature flags solve the problem.  Suppose we
>>> added a feature flag meaning "respects IOMMU".
>>>
>>> Bad case 1:  Build a malicious device that advertises
>>> non-IOMMU-respecting virtio.  Plug it in behind an IOMMU.  Host starts
>>> leaking physical addresses to the device (and the device doesn't work,
>>> of course).  Maybe that's only barely a security problem, but still...
>>
>> I don't see right now how critical such a hypothetical case could be.
>> But the OS / its drivers could still decide to refuse talking to such a
>> device.
>>
> 
> How does OS know it's such a device as opposed to a QEMU-supplied thing?

It can restrict itself to virtio devices exposing the feature if it
feels uncomfortable that it might be talking to some evil piece of
silicon (instead of the hypervisor, which has to be trusted anyway).

> 
>>>
>>> Bad case 2: Some hypothetical well-behaved new QEMU provides a virtio
>>> device that *does* respect the IOMMU and sets the feature flag.  They
>>> emulate Q35 with an IOMMU.  They boot Linux 4.1.  Data corruption in
>>> the guest.
>>
>> No. In that case, the feature negotiation of "virtio-with-iommu-support"
>> would have failed for older drivers, and the device would have never
>> been used by the guest.
> 
> So are you suggesting that newer virtio devices always provide this
> feature flag and, if supplied by QEMU with iommu=on, simply refuse to
> operate of the driver doesn't support that flag?

Exactly.

> 
> That could work as long as QEMU with the current (broken?) iommu=on
> never exposes such a device.

QEMU would have to be adjusted first so that all its virtio-pci device
models take IOMMUs into account - if they exist or not. Only then it
could expose the feature and expect the guest to acknowledge it.

For compat reasons, QEMU should still be able to expose virtio devices
without the flag set - but then without any IOMMU emulation enabled as
well. That would prevent the current setup we are using today, but it's
trivial to update the guest kernel to a newer virtio driver which would
restore our scenario again.

> 
>>
>>>
>>> We could make the rule that *all* virtio-pci devices (except on PPC)
>>> respect the bus rules.  We'd have to fix QEMU so that virtio devices
>>> on Q35 iommu=on systems set up a PCI topology where the devices
>>> *aren't* behind the IOMMU or are protected by RMRRs or whatever.  Then
>>> old kernels would work correctly on new hosts, new kernels would work
>>> correctly except on old iommu-providing hosts, and Xen would work.
>>
>> I don't see a point in doing anything about old QEMU with IOMMU enabled
>> and virtio devices plugged except declaring such setups broken. No one
>> should have configured this for production purposes, only for test
>> setups (like we, with the knowledge about the limitations).
>>
> 
> I'm fine with that.  In fact, I proposed these patches before QEMU had
> this feature in the first place.
> 
>>>
>>> In fact, on Xen, it's impossible without colossal hacks to support
>>> non-IOMMU-respecting virtio devices because Xen acts as an
>>> intermediate IOMMU between the Linux dom0 guest and the actual host.
>>> The QEMU host doesn't even know that Xen is involved.  This is why Xen
>>> and virtio don't currently work together (without my patches): the
>>> device thinks it doesn't respect the IOMMU, the driver thinks the
>>> device doesn't respect the IOMMU, and they're both wrong.
>>>
>>> TL;DR: I think there are only two cases.  Either a device respects the
>>> IOMMU or a device doesn't know whether it respects the IOMMU.  The
>>> latter case is problematic.
>>
>> See above, the latter is only problematic on setups that actually use an
>> IOMMU. If that includes Xen, then no one should use it until virtio can
>> declare itself IOMMU compatible, and drivers exist that process this.
> 
> Xen works right now with my patches on standard QEMU (as long as
> iommu=off).  Certainly no one except me uses it now with virtio
> because it doesn't work with mainline kernels.
> 
> If we apply something similar enough to my patches, then even old
> hypervisors (e.g. Amazon's hardware virt systems) will support Xen
> with virtio devices passed in just fine.

Then it seems we can make everyone happy - perfect. :)

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-28 19:06                     ` Jan Kiszka
  2015-07-28 19:24                       ` Andy Lutomirski
@ 2015-07-28 19:24                       ` Andy Lutomirski
  2015-07-28 19:33                         ` Jan Kiszka
  2015-07-28 19:33                         ` Jan Kiszka
  1 sibling, 2 replies; 108+ messages in thread
From: Andy Lutomirski @ 2015-07-28 19:24 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: linux-s390, Benjamin Herrenschmidt, Konrad Rzeszutek Wilk,
	Michael S. Tsirkin, xen-devel, Christian Borntraeger,
	Paolo Bonzini, linux390, Linux Virtualization

On Tue, Jul 28, 2015 at 12:06 PM, Jan Kiszka <jan.kiszka@siemens.com> wrote:
> On 2015-07-28 20:22, Andy Lutomirski wrote:
>> On Tue, Jul 28, 2015 at 10:17 AM, Jan Kiszka <jan.kiszka@siemens.com> wrote:
>>> On 2015-07-28 19:10, Andy Lutomirski wrote:
>>>> The trouble is that this is really a property of the bus and not of
>>>> the device.  If you build a virtio device that physically plugs into a
>>>> PCIe slot, the device has no concept of an IOMMU in the first place.
>>>
>>> If one would build a real virtio device today, it would be broken
>>> because every IOMMU would start to translate its requests. Already from
>>> that POV, we really need to introduce a feature flag "I will be
>>> IOMMU-translated" so that a potential physical implementation can carry
>>> it unconditionally.
>>>
>>
>> Except that, with my patches, it would work correctly.  ISTM the thing
>
> I haven't looked at your patches yet - they make the virtio PCI driver
> in Linux IOMMU-compatible? Perfect - except for a compatibility check,
> right?

Yes.  (virtio_pci_legacy, anyway.  Presumably virtio_pci_modern is
easy to adapt, too.)

>
>> that's broken right now is QEMU and the virtio_pci driver.  My patches
>> fix the driver.  Last year that would have been the end of the story
>> except for PPC.  Now we have to deal with QEMU.
>>
>>>> Similarly, if you take an L0-provided IOMMU-supporting device and pass
>>>> it through to L2 using current QEMU on L1 (with Q35 emulation and
>>>> iommu enabled), then, from L2's perspective, the device is 1:1 no
>>>> matter what the device thinks.
>>>>
>>>> IOW, I think the original design was wrong and now we have to deal
>>>> with it.  I think the best solution would be to teach QEMU to fix its
>>>> ACPI tables so that 1:1 virtio devices are actually exposed as 1:1.
>>>
>>> Only the current drivers are broken. And we can easily tell them apart
>>> from newer ones via feature flags. Sorry, don't get the problem.
>>
>> I still don't see how feature flags solve the problem.  Suppose we
>> added a feature flag meaning "respects IOMMU".
>>
>> Bad case 1:  Build a malicious device that advertises
>> non-IOMMU-respecting virtio.  Plug it in behind an IOMMU.  Host starts
>> leaking physical addresses to the device (and the device doesn't work,
>> of course).  Maybe that's only barely a security problem, but still...
>
> I don't see right now how critical such a hypothetical case could be.
> But the OS / its drivers could still decide to refuse talking to such a
> device.
>

How does OS know it's such a device as opposed to a QEMU-supplied thing?

>>
>> Bad case 2: Some hypothetical well-behaved new QEMU provides a virtio
>> device that *does* respect the IOMMU and sets the feature flag.  They
>> emulate Q35 with an IOMMU.  They boot Linux 4.1.  Data corruption in
>> the guest.
>
> No. In that case, the feature negotiation of "virtio-with-iommu-support"
> would have failed for older drivers, and the device would have never
> been used by the guest.

So are you suggesting that newer virtio devices always provide this
feature flag and, if supplied by QEMU with iommu=on, simply refuse to
operate of the driver doesn't support that flag?

That could work as long as QEMU with the current (broken?) iommu=on
never exposes such a device.

>
>>
>> We could make the rule that *all* virtio-pci devices (except on PPC)
>> respect the bus rules.  We'd have to fix QEMU so that virtio devices
>> on Q35 iommu=on systems set up a PCI topology where the devices
>> *aren't* behind the IOMMU or are protected by RMRRs or whatever.  Then
>> old kernels would work correctly on new hosts, new kernels would work
>> correctly except on old iommu-providing hosts, and Xen would work.
>
> I don't see a point in doing anything about old QEMU with IOMMU enabled
> and virtio devices plugged except declaring such setups broken. No one
> should have configured this for production purposes, only for test
> setups (like we, with the knowledge about the limitations).
>

I'm fine with that.  In fact, I proposed these patches before QEMU had
this feature in the first place.

>>
>> In fact, on Xen, it's impossible without colossal hacks to support
>> non-IOMMU-respecting virtio devices because Xen acts as an
>> intermediate IOMMU between the Linux dom0 guest and the actual host.
>> The QEMU host doesn't even know that Xen is involved.  This is why Xen
>> and virtio don't currently work together (without my patches): the
>> device thinks it doesn't respect the IOMMU, the driver thinks the
>> device doesn't respect the IOMMU, and they're both wrong.
>>
>> TL;DR: I think there are only two cases.  Either a device respects the
>> IOMMU or a device doesn't know whether it respects the IOMMU.  The
>> latter case is problematic.
>
> See above, the latter is only problematic on setups that actually use an
> IOMMU. If that includes Xen, then no one should use it until virtio can
> declare itself IOMMU compatible, and drivers exist that process this.

Xen works right now with my patches on standard QEMU (as long as
iommu=off).  Certainly no one except me uses it now with virtio
because it doesn't work with mainline kernels.

If we apply something similar enough to my patches, then even old
hypervisors (e.g. Amazon's hardware virt systems) will support Xen
with virtio devices passed in just fine.

--Andy

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-28 19:06                     ` Jan Kiszka
@ 2015-07-28 19:24                       ` Andy Lutomirski
  2015-07-28 19:24                       ` Andy Lutomirski
  1 sibling, 0 replies; 108+ messages in thread
From: Andy Lutomirski @ 2015-07-28 19:24 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: linux-s390, Benjamin Herrenschmidt, Michael S. Tsirkin,
	Stefan Hajnoczi, Rusty Russell, xen-devel, Christian Borntraeger,
	Paolo Bonzini, Cornelia Huck, linux390, Linux Virtualization

On Tue, Jul 28, 2015 at 12:06 PM, Jan Kiszka <jan.kiszka@siemens.com> wrote:
> On 2015-07-28 20:22, Andy Lutomirski wrote:
>> On Tue, Jul 28, 2015 at 10:17 AM, Jan Kiszka <jan.kiszka@siemens.com> wrote:
>>> On 2015-07-28 19:10, Andy Lutomirski wrote:
>>>> The trouble is that this is really a property of the bus and not of
>>>> the device.  If you build a virtio device that physically plugs into a
>>>> PCIe slot, the device has no concept of an IOMMU in the first place.
>>>
>>> If one would build a real virtio device today, it would be broken
>>> because every IOMMU would start to translate its requests. Already from
>>> that POV, we really need to introduce a feature flag "I will be
>>> IOMMU-translated" so that a potential physical implementation can carry
>>> it unconditionally.
>>>
>>
>> Except that, with my patches, it would work correctly.  ISTM the thing
>
> I haven't looked at your patches yet - they make the virtio PCI driver
> in Linux IOMMU-compatible? Perfect - except for a compatibility check,
> right?

Yes.  (virtio_pci_legacy, anyway.  Presumably virtio_pci_modern is
easy to adapt, too.)

>
>> that's broken right now is QEMU and the virtio_pci driver.  My patches
>> fix the driver.  Last year that would have been the end of the story
>> except for PPC.  Now we have to deal with QEMU.
>>
>>>> Similarly, if you take an L0-provided IOMMU-supporting device and pass
>>>> it through to L2 using current QEMU on L1 (with Q35 emulation and
>>>> iommu enabled), then, from L2's perspective, the device is 1:1 no
>>>> matter what the device thinks.
>>>>
>>>> IOW, I think the original design was wrong and now we have to deal
>>>> with it.  I think the best solution would be to teach QEMU to fix its
>>>> ACPI tables so that 1:1 virtio devices are actually exposed as 1:1.
>>>
>>> Only the current drivers are broken. And we can easily tell them apart
>>> from newer ones via feature flags. Sorry, don't get the problem.
>>
>> I still don't see how feature flags solve the problem.  Suppose we
>> added a feature flag meaning "respects IOMMU".
>>
>> Bad case 1:  Build a malicious device that advertises
>> non-IOMMU-respecting virtio.  Plug it in behind an IOMMU.  Host starts
>> leaking physical addresses to the device (and the device doesn't work,
>> of course).  Maybe that's only barely a security problem, but still...
>
> I don't see right now how critical such a hypothetical case could be.
> But the OS / its drivers could still decide to refuse talking to such a
> device.
>

How does OS know it's such a device as opposed to a QEMU-supplied thing?

>>
>> Bad case 2: Some hypothetical well-behaved new QEMU provides a virtio
>> device that *does* respect the IOMMU and sets the feature flag.  They
>> emulate Q35 with an IOMMU.  They boot Linux 4.1.  Data corruption in
>> the guest.
>
> No. In that case, the feature negotiation of "virtio-with-iommu-support"
> would have failed for older drivers, and the device would have never
> been used by the guest.

So are you suggesting that newer virtio devices always provide this
feature flag and, if supplied by QEMU with iommu=on, simply refuse to
operate of the driver doesn't support that flag?

That could work as long as QEMU with the current (broken?) iommu=on
never exposes such a device.

>
>>
>> We could make the rule that *all* virtio-pci devices (except on PPC)
>> respect the bus rules.  We'd have to fix QEMU so that virtio devices
>> on Q35 iommu=on systems set up a PCI topology where the devices
>> *aren't* behind the IOMMU or are protected by RMRRs or whatever.  Then
>> old kernels would work correctly on new hosts, new kernels would work
>> correctly except on old iommu-providing hosts, and Xen would work.
>
> I don't see a point in doing anything about old QEMU with IOMMU enabled
> and virtio devices plugged except declaring such setups broken. No one
> should have configured this for production purposes, only for test
> setups (like we, with the knowledge about the limitations).
>

I'm fine with that.  In fact, I proposed these patches before QEMU had
this feature in the first place.

>>
>> In fact, on Xen, it's impossible without colossal hacks to support
>> non-IOMMU-respecting virtio devices because Xen acts as an
>> intermediate IOMMU between the Linux dom0 guest and the actual host.
>> The QEMU host doesn't even know that Xen is involved.  This is why Xen
>> and virtio don't currently work together (without my patches): the
>> device thinks it doesn't respect the IOMMU, the driver thinks the
>> device doesn't respect the IOMMU, and they're both wrong.
>>
>> TL;DR: I think there are only two cases.  Either a device respects the
>> IOMMU or a device doesn't know whether it respects the IOMMU.  The
>> latter case is problematic.
>
> See above, the latter is only problematic on setups that actually use an
> IOMMU. If that includes Xen, then no one should use it until virtio can
> declare itself IOMMU compatible, and drivers exist that process this.

Xen works right now with my patches on standard QEMU (as long as
iommu=off).  Certainly no one except me uses it now with virtio
because it doesn't work with mainline kernels.

If we apply something similar enough to my patches, then even old
hypervisors (e.g. Amazon's hardware virt systems) will support Xen
with virtio devices passed in just fine.

--Andy

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-28 18:22                   ` Andy Lutomirski
  2015-07-28 19:06                     ` Jan Kiszka
@ 2015-07-28 19:06                     ` Jan Kiszka
  2015-07-28 19:24                       ` Andy Lutomirski
  2015-07-28 19:24                       ` Andy Lutomirski
  1 sibling, 2 replies; 108+ messages in thread
From: Jan Kiszka @ 2015-07-28 19:06 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-s390, Benjamin Herrenschmidt, Konrad Rzeszutek Wilk,
	Michael S. Tsirkin, xen-devel, Christian Borntraeger,
	Paolo Bonzini, linux390, Linux Virtualization

On 2015-07-28 20:22, Andy Lutomirski wrote:
> On Tue, Jul 28, 2015 at 10:17 AM, Jan Kiszka <jan.kiszka@siemens.com> wrote:
>> On 2015-07-28 19:10, Andy Lutomirski wrote:
>>> The trouble is that this is really a property of the bus and not of
>>> the device.  If you build a virtio device that physically plugs into a
>>> PCIe slot, the device has no concept of an IOMMU in the first place.
>>
>> If one would build a real virtio device today, it would be broken
>> because every IOMMU would start to translate its requests. Already from
>> that POV, we really need to introduce a feature flag "I will be
>> IOMMU-translated" so that a potential physical implementation can carry
>> it unconditionally.
>>
> 
> Except that, with my patches, it would work correctly.  ISTM the thing

I haven't looked at your patches yet - they make the virtio PCI driver
in Linux IOMMU-compatible? Perfect - except for a compatibility check,
right?

> that's broken right now is QEMU and the virtio_pci driver.  My patches
> fix the driver.  Last year that would have been the end of the story
> except for PPC.  Now we have to deal with QEMU.
> 
>>> Similarly, if you take an L0-provided IOMMU-supporting device and pass
>>> it through to L2 using current QEMU on L1 (with Q35 emulation and
>>> iommu enabled), then, from L2's perspective, the device is 1:1 no
>>> matter what the device thinks.
>>>
>>> IOW, I think the original design was wrong and now we have to deal
>>> with it.  I think the best solution would be to teach QEMU to fix its
>>> ACPI tables so that 1:1 virtio devices are actually exposed as 1:1.
>>
>> Only the current drivers are broken. And we can easily tell them apart
>> from newer ones via feature flags. Sorry, don't get the problem.
> 
> I still don't see how feature flags solve the problem.  Suppose we
> added a feature flag meaning "respects IOMMU".
> 
> Bad case 1:  Build a malicious device that advertises
> non-IOMMU-respecting virtio.  Plug it in behind an IOMMU.  Host starts
> leaking physical addresses to the device (and the device doesn't work,
> of course).  Maybe that's only barely a security problem, but still...

I don't see right now how critical such a hypothetical case could be.
But the OS / its drivers could still decide to refuse talking to such a
device.

> 
> Bad case 2:  Use current QEMU w/ IOMMU enabled.  Assign a virtio
> device provided by L0 QEMU to L2.  L1 crashes.  I consider *that* to
> be a security problem, although in practice no one will configure
> their system that way because it has zero chance of actually working.
> Nonetheless, the device does work if L1 accesses it directly?  The
> issue is vfio doesn't notice that the device doesn't respect the IOMMU
> because "respects-IOMMU" is a property of the PCI bus and the platform
> IOMMU, and vfio assumes it works correctly.

I would have no problem with rejecting configurations in future QEMU
that try to expose unconfined virtio devices in the presence of IOMMU
emulation. Once we can do better, it's just about letting the guest know
about the difference.

The current situation is indeed just broken, we don't need to discuss
this as we can't change history to prevent this.

> 
> Bad case 2: Some hypothetical well-behaved new QEMU provides a virtio
> device that *does* respect the IOMMU and sets the feature flag.  They
> emulate Q35 with an IOMMU.  They boot Linux 4.1.  Data corruption in
> the guest.

No. In that case, the feature negotiation of "virtio-with-iommu-support"
would have failed for older drivers, and the device would have never
been used by the guest.

> 
> We could make the rule that *all* virtio-pci devices (except on PPC)
> respect the bus rules.  We'd have to fix QEMU so that virtio devices
> on Q35 iommu=on systems set up a PCI topology where the devices
> *aren't* behind the IOMMU or are protected by RMRRs or whatever.  Then
> old kernels would work correctly on new hosts, new kernels would work
> correctly except on old iommu-providing hosts, and Xen would work.

I don't see a point in doing anything about old QEMU with IOMMU enabled
and virtio devices plugged except declaring such setups broken. No one
should have configured this for production purposes, only for test
setups (like we, with the knowledge about the limitations).

> 
> In fact, on Xen, it's impossible without colossal hacks to support
> non-IOMMU-respecting virtio devices because Xen acts as an
> intermediate IOMMU between the Linux dom0 guest and the actual host.
> The QEMU host doesn't even know that Xen is involved.  This is why Xen
> and virtio don't currently work together (without my patches): the
> device thinks it doesn't respect the IOMMU, the driver thinks the
> device doesn't respect the IOMMU, and they're both wrong.
> 
> TL;DR: I think there are only two cases.  Either a device respects the
> IOMMU or a device doesn't know whether it respects the IOMMU.  The
> latter case is problematic.

See above, the latter is only problematic on setups that actually use an
IOMMU. If that includes Xen, then no one should use it until virtio can
declare itself IOMMU compatible, and drivers exist that process this.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-28 18:22                   ` Andy Lutomirski
@ 2015-07-28 19:06                     ` Jan Kiszka
  2015-07-28 19:06                     ` Jan Kiszka
  1 sibling, 0 replies; 108+ messages in thread
From: Jan Kiszka @ 2015-07-28 19:06 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-s390, Benjamin Herrenschmidt, Michael S. Tsirkin,
	Stefan Hajnoczi, Rusty Russell, xen-devel, Christian Borntraeger,
	Paolo Bonzini, Cornelia Huck, linux390, Linux Virtualization

On 2015-07-28 20:22, Andy Lutomirski wrote:
> On Tue, Jul 28, 2015 at 10:17 AM, Jan Kiszka <jan.kiszka@siemens.com> wrote:
>> On 2015-07-28 19:10, Andy Lutomirski wrote:
>>> The trouble is that this is really a property of the bus and not of
>>> the device.  If you build a virtio device that physically plugs into a
>>> PCIe slot, the device has no concept of an IOMMU in the first place.
>>
>> If one would build a real virtio device today, it would be broken
>> because every IOMMU would start to translate its requests. Already from
>> that POV, we really need to introduce a feature flag "I will be
>> IOMMU-translated" so that a potential physical implementation can carry
>> it unconditionally.
>>
> 
> Except that, with my patches, it would work correctly.  ISTM the thing

I haven't looked at your patches yet - they make the virtio PCI driver
in Linux IOMMU-compatible? Perfect - except for a compatibility check,
right?

> that's broken right now is QEMU and the virtio_pci driver.  My patches
> fix the driver.  Last year that would have been the end of the story
> except for PPC.  Now we have to deal with QEMU.
> 
>>> Similarly, if you take an L0-provided IOMMU-supporting device and pass
>>> it through to L2 using current QEMU on L1 (with Q35 emulation and
>>> iommu enabled), then, from L2's perspective, the device is 1:1 no
>>> matter what the device thinks.
>>>
>>> IOW, I think the original design was wrong and now we have to deal
>>> with it.  I think the best solution would be to teach QEMU to fix its
>>> ACPI tables so that 1:1 virtio devices are actually exposed as 1:1.
>>
>> Only the current drivers are broken. And we can easily tell them apart
>> from newer ones via feature flags. Sorry, don't get the problem.
> 
> I still don't see how feature flags solve the problem.  Suppose we
> added a feature flag meaning "respects IOMMU".
> 
> Bad case 1:  Build a malicious device that advertises
> non-IOMMU-respecting virtio.  Plug it in behind an IOMMU.  Host starts
> leaking physical addresses to the device (and the device doesn't work,
> of course).  Maybe that's only barely a security problem, but still...

I don't see right now how critical such a hypothetical case could be.
But the OS / its drivers could still decide to refuse talking to such a
device.

> 
> Bad case 2:  Use current QEMU w/ IOMMU enabled.  Assign a virtio
> device provided by L0 QEMU to L2.  L1 crashes.  I consider *that* to
> be a security problem, although in practice no one will configure
> their system that way because it has zero chance of actually working.
> Nonetheless, the device does work if L1 accesses it directly?  The
> issue is vfio doesn't notice that the device doesn't respect the IOMMU
> because "respects-IOMMU" is a property of the PCI bus and the platform
> IOMMU, and vfio assumes it works correctly.

I would have no problem with rejecting configurations in future QEMU
that try to expose unconfined virtio devices in the presence of IOMMU
emulation. Once we can do better, it's just about letting the guest know
about the difference.

The current situation is indeed just broken, we don't need to discuss
this as we can't change history to prevent this.

> 
> Bad case 2: Some hypothetical well-behaved new QEMU provides a virtio
> device that *does* respect the IOMMU and sets the feature flag.  They
> emulate Q35 with an IOMMU.  They boot Linux 4.1.  Data corruption in
> the guest.

No. In that case, the feature negotiation of "virtio-with-iommu-support"
would have failed for older drivers, and the device would have never
been used by the guest.

> 
> We could make the rule that *all* virtio-pci devices (except on PPC)
> respect the bus rules.  We'd have to fix QEMU so that virtio devices
> on Q35 iommu=on systems set up a PCI topology where the devices
> *aren't* behind the IOMMU or are protected by RMRRs or whatever.  Then
> old kernels would work correctly on new hosts, new kernels would work
> correctly except on old iommu-providing hosts, and Xen would work.

I don't see a point in doing anything about old QEMU with IOMMU enabled
and virtio devices plugged except declaring such setups broken. No one
should have configured this for production purposes, only for test
setups (like we, with the knowledge about the limitations).

> 
> In fact, on Xen, it's impossible without colossal hacks to support
> non-IOMMU-respecting virtio devices because Xen acts as an
> intermediate IOMMU between the Linux dom0 guest and the actual host.
> The QEMU host doesn't even know that Xen is involved.  This is why Xen
> and virtio don't currently work together (without my patches): the
> device thinks it doesn't respect the IOMMU, the driver thinks the
> device doesn't respect the IOMMU, and they're both wrong.
> 
> TL;DR: I think there are only two cases.  Either a device respects the
> IOMMU or a device doesn't know whether it respects the IOMMU.  The
> latter case is problematic.

See above, the latter is only problematic on setups that actually use an
IOMMU. If that includes Xen, then no one should use it until virtio can
declare itself IOMMU compatible, and drivers exist that process this.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-28 17:17                 ` Jan Kiszka
  2015-07-28 18:22                   ` Andy Lutomirski
@ 2015-07-28 18:22                   ` Andy Lutomirski
  2015-07-28 19:06                     ` Jan Kiszka
  2015-07-28 19:06                     ` Jan Kiszka
  1 sibling, 2 replies; 108+ messages in thread
From: Andy Lutomirski @ 2015-07-28 18:22 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: linux-s390, Benjamin Herrenschmidt, Konrad Rzeszutek Wilk,
	Michael S. Tsirkin, xen-devel, Christian Borntraeger,
	Paolo Bonzini, linux390, Linux Virtualization

On Tue, Jul 28, 2015 at 10:17 AM, Jan Kiszka <jan.kiszka@siemens.com> wrote:
> On 2015-07-28 19:10, Andy Lutomirski wrote:
>> The trouble is that this is really a property of the bus and not of
>> the device.  If you build a virtio device that physically plugs into a
>> PCIe slot, the device has no concept of an IOMMU in the first place.
>
> If one would build a real virtio device today, it would be broken
> because every IOMMU would start to translate its requests. Already from
> that POV, we really need to introduce a feature flag "I will be
> IOMMU-translated" so that a potential physical implementation can carry
> it unconditionally.
>

Except that, with my patches, it would work correctly.  ISTM the thing
that's broken right now is QEMU and the virtio_pci driver.  My patches
fix the driver.  Last year that would have been the end of the story
except for PPC.  Now we have to deal with QEMU.

>> Similarly, if you take an L0-provided IOMMU-supporting device and pass
>> it through to L2 using current QEMU on L1 (with Q35 emulation and
>> iommu enabled), then, from L2's perspective, the device is 1:1 no
>> matter what the device thinks.
>>
>> IOW, I think the original design was wrong and now we have to deal
>> with it.  I think the best solution would be to teach QEMU to fix its
>> ACPI tables so that 1:1 virtio devices are actually exposed as 1:1.
>
> Only the current drivers are broken. And we can easily tell them apart
> from newer ones via feature flags. Sorry, don't get the problem.

I still don't see how feature flags solve the problem.  Suppose we
added a feature flag meaning "respects IOMMU".

Bad case 1:  Build a malicious device that advertises
non-IOMMU-respecting virtio.  Plug it in behind an IOMMU.  Host starts
leaking physical addresses to the device (and the device doesn't work,
of course).  Maybe that's only barely a security problem, but still...

Bad case 2:  Use current QEMU w/ IOMMU enabled.  Assign a virtio
device provided by L0 QEMU to L2.  L1 crashes.  I consider *that* to
be a security problem, although in practice no one will configure
their system that way because it has zero chance of actually working.
Nonetheless, the device does work if L1 accesses it directly?  The
issue is vfio doesn't notice that the device doesn't respect the IOMMU
because "respects-IOMMU" is a property of the PCI bus and the platform
IOMMU, and vfio assumes it works correctly.

Bad case 2: Some hypothetical well-behaved new QEMU provides a virtio
device that *does* respect the IOMMU and sets the feature flag.  They
emulate Q35 with an IOMMU.  They boot Linux 4.1.  Data corruption in
the guest.

We could make the rule that *all* virtio-pci devices (except on PPC)
respect the bus rules.  We'd have to fix QEMU so that virtio devices
on Q35 iommu=on systems set up a PCI topology where the devices
*aren't* behind the IOMMU or are protected by RMRRs or whatever.  Then
old kernels would work correctly on new hosts, new kernels would work
correctly except on old iommu-providing hosts, and Xen would work.

In fact, on Xen, it's impossible without colossal hacks to support
non-IOMMU-respecting virtio devices because Xen acts as an
intermediate IOMMU between the Linux dom0 guest and the actual host.
The QEMU host doesn't even know that Xen is involved.  This is why Xen
and virtio don't currently work together (without my patches): the
device thinks it doesn't respect the IOMMU, the driver thinks the
device doesn't respect the IOMMU, and they're both wrong.

TL;DR: I think there are only two cases.  Either a device respects the
IOMMU or a device doesn't know whether it respects the IOMMU.  The
latter case is problematic.

--Andy

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-28 17:17                 ` Jan Kiszka
@ 2015-07-28 18:22                   ` Andy Lutomirski
  2015-07-28 18:22                   ` Andy Lutomirski
  1 sibling, 0 replies; 108+ messages in thread
From: Andy Lutomirski @ 2015-07-28 18:22 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: linux-s390, Benjamin Herrenschmidt, Michael S. Tsirkin,
	Stefan Hajnoczi, Rusty Russell, xen-devel, Christian Borntraeger,
	Paolo Bonzini, Cornelia Huck, linux390, Linux Virtualization

On Tue, Jul 28, 2015 at 10:17 AM, Jan Kiszka <jan.kiszka@siemens.com> wrote:
> On 2015-07-28 19:10, Andy Lutomirski wrote:
>> The trouble is that this is really a property of the bus and not of
>> the device.  If you build a virtio device that physically plugs into a
>> PCIe slot, the device has no concept of an IOMMU in the first place.
>
> If one would build a real virtio device today, it would be broken
> because every IOMMU would start to translate its requests. Already from
> that POV, we really need to introduce a feature flag "I will be
> IOMMU-translated" so that a potential physical implementation can carry
> it unconditionally.
>

Except that, with my patches, it would work correctly.  ISTM the thing
that's broken right now is QEMU and the virtio_pci driver.  My patches
fix the driver.  Last year that would have been the end of the story
except for PPC.  Now we have to deal with QEMU.

>> Similarly, if you take an L0-provided IOMMU-supporting device and pass
>> it through to L2 using current QEMU on L1 (with Q35 emulation and
>> iommu enabled), then, from L2's perspective, the device is 1:1 no
>> matter what the device thinks.
>>
>> IOW, I think the original design was wrong and now we have to deal
>> with it.  I think the best solution would be to teach QEMU to fix its
>> ACPI tables so that 1:1 virtio devices are actually exposed as 1:1.
>
> Only the current drivers are broken. And we can easily tell them apart
> from newer ones via feature flags. Sorry, don't get the problem.

I still don't see how feature flags solve the problem.  Suppose we
added a feature flag meaning "respects IOMMU".

Bad case 1:  Build a malicious device that advertises
non-IOMMU-respecting virtio.  Plug it in behind an IOMMU.  Host starts
leaking physical addresses to the device (and the device doesn't work,
of course).  Maybe that's only barely a security problem, but still...

Bad case 2:  Use current QEMU w/ IOMMU enabled.  Assign a virtio
device provided by L0 QEMU to L2.  L1 crashes.  I consider *that* to
be a security problem, although in practice no one will configure
their system that way because it has zero chance of actually working.
Nonetheless, the device does work if L1 accesses it directly?  The
issue is vfio doesn't notice that the device doesn't respect the IOMMU
because "respects-IOMMU" is a property of the PCI bus and the platform
IOMMU, and vfio assumes it works correctly.

Bad case 2: Some hypothetical well-behaved new QEMU provides a virtio
device that *does* respect the IOMMU and sets the feature flag.  They
emulate Q35 with an IOMMU.  They boot Linux 4.1.  Data corruption in
the guest.

We could make the rule that *all* virtio-pci devices (except on PPC)
respect the bus rules.  We'd have to fix QEMU so that virtio devices
on Q35 iommu=on systems set up a PCI topology where the devices
*aren't* behind the IOMMU or are protected by RMRRs or whatever.  Then
old kernels would work correctly on new hosts, new kernels would work
correctly except on old iommu-providing hosts, and Xen would work.

In fact, on Xen, it's impossible without colossal hacks to support
non-IOMMU-respecting virtio devices because Xen acts as an
intermediate IOMMU between the Linux dom0 guest and the actual host.
The QEMU host doesn't even know that Xen is involved.  This is why Xen
and virtio don't currently work together (without my patches): the
device thinks it doesn't respect the IOMMU, the driver thinks the
device doesn't respect the IOMMU, and they're both wrong.

TL;DR: I think there are only two cases.  Either a device respects the
IOMMU or a device doesn't know whether it respects the IOMMU.  The
latter case is problematic.

--Andy

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-28 17:19                 ` Jan Kiszka
@ 2015-07-28 17:31                   ` Paolo Bonzini
  2015-07-28 17:31                   ` Paolo Bonzini
  1 sibling, 0 replies; 108+ messages in thread
From: Paolo Bonzini @ 2015-07-28 17:31 UTC (permalink / raw)
  To: Jan Kiszka, Michael S. Tsirkin
  Cc: linux-s390, xen-devel, Konrad Rzeszutek Wilk,
	Benjamin Herrenschmidt, Andy Lutomirski, Christian Borntraeger,
	linux390, Linux Virtualization



On 28/07/2015 19:19, Jan Kiszka wrote:
> On 2015-07-28 19:15, Paolo Bonzini wrote:
>>
>>
>> On 28/07/2015 18:42, Jan Kiszka wrote:
>>>> On the other hand interrupt remapping is absolutely necessary for
>>>> production use, hence my point that x86 does not promise API stability.
>>>
>>> Well, we currently implement the features that the Q35 used to expose.
>>> Adding interrupt remapping will require a new chipset and/or a hack
>>> switch to ignore compatibility.
>>
>> Isn't the VT-d register space separate from other Q35 features and
>> backwards-compatible?  You could even add it to PIIX in theory just by
>> adding a DMAR.
> 
> Yes, it's practically working, but it's not accurate /wrt how that
> hardware looked like in reality.

We've done that for a long time.  Real PIIX3 didn't have ACPI too, for
example (and it had a USB UHCI that is optional in QEMU).

Of course I'm not advocating adding the IOMMU to PIIX (assuming that
would work even just practically)... but I don't think adding interrupt
remapping to Q35 is a big deal.  It would be optional, just in case you
want to debug something without interrupt remapping, but it can be added.

>>>> The Google patches for userspace PIC and IOAPIC are proceeding well, so
>>>> hopefully we can have interrupt remapping soon.
>>>
>>> If the day had 48 hours... I'd love to look into this, first adding QEMU
>>> support for the new irqchip architecture.
>>
>> I hope I can squeeze in some time for that...  Google also had an intern
>> that was looking at it.
> 
> Great!

In theory it's easy with the latest series.  All you need is support for
converting IOAPIC routes to KVM routes (and of course the glue code to
enable the capability and create the userspace devices); everything else
should work just by reusing the -machine kernel_irqchip=on code.  In
theory...

Paolo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-28 17:19                 ` Jan Kiszka
  2015-07-28 17:31                   ` Paolo Bonzini
@ 2015-07-28 17:31                   ` Paolo Bonzini
  1 sibling, 0 replies; 108+ messages in thread
From: Paolo Bonzini @ 2015-07-28 17:31 UTC (permalink / raw)
  To: Jan Kiszka, Michael S. Tsirkin
  Cc: linux-s390, xen-devel, Benjamin Herrenschmidt, Rusty Russell,
	Andy Lutomirski, Christian Borntraeger, Stefan Hajnoczi,
	Cornelia Huck, linux390, Linux Virtualization



On 28/07/2015 19:19, Jan Kiszka wrote:
> On 2015-07-28 19:15, Paolo Bonzini wrote:
>>
>>
>> On 28/07/2015 18:42, Jan Kiszka wrote:
>>>> On the other hand interrupt remapping is absolutely necessary for
>>>> production use, hence my point that x86 does not promise API stability.
>>>
>>> Well, we currently implement the features that the Q35 used to expose.
>>> Adding interrupt remapping will require a new chipset and/or a hack
>>> switch to ignore compatibility.
>>
>> Isn't the VT-d register space separate from other Q35 features and
>> backwards-compatible?  You could even add it to PIIX in theory just by
>> adding a DMAR.
> 
> Yes, it's practically working, but it's not accurate /wrt how that
> hardware looked like in reality.

We've done that for a long time.  Real PIIX3 didn't have ACPI too, for
example (and it had a USB UHCI that is optional in QEMU).

Of course I'm not advocating adding the IOMMU to PIIX (assuming that
would work even just practically)... but I don't think adding interrupt
remapping to Q35 is a big deal.  It would be optional, just in case you
want to debug something without interrupt remapping, but it can be added.

>>>> The Google patches for userspace PIC and IOAPIC are proceeding well, so
>>>> hopefully we can have interrupt remapping soon.
>>>
>>> If the day had 48 hours... I'd love to look into this, first adding QEMU
>>> support for the new irqchip architecture.
>>
>> I hope I can squeeze in some time for that...  Google also had an intern
>> that was looking at it.
> 
> Great!

In theory it's easy with the latest series.  All you need is support for
converting IOAPIC routes to KVM routes (and of course the glue code to
enable the capability and create the userspace devices); everything else
should work just by reusing the -machine kernel_irqchip=on code.  In
theory...

Paolo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-28 17:15               ` Paolo Bonzini
  2015-07-28 17:19                 ` Jan Kiszka
@ 2015-07-28 17:19                 ` Jan Kiszka
  2015-07-28 17:31                   ` Paolo Bonzini
  2015-07-28 17:31                   ` Paolo Bonzini
  1 sibling, 2 replies; 108+ messages in thread
From: Jan Kiszka @ 2015-07-28 17:19 UTC (permalink / raw)
  To: Paolo Bonzini, Michael S. Tsirkin
  Cc: linux-s390, xen-devel, Konrad Rzeszutek Wilk,
	Benjamin Herrenschmidt, Andy Lutomirski, Christian Borntraeger,
	linux390, Linux Virtualization

On 2015-07-28 19:15, Paolo Bonzini wrote:
> 
> 
> On 28/07/2015 18:42, Jan Kiszka wrote:
>>> On the other hand interrupt remapping is absolutely necessary for
>>> production use, hence my point that x86 does not promise API stability.
>>
>> Well, we currently implement the features that the Q35 used to expose.
>> Adding interrupt remapping will require a new chipset and/or a hack
>> switch to ignore compatibility.
> 
> Isn't the VT-d register space separate from other Q35 features and
> backwards-compatible?  You could even add it to PIIX in theory just by
> adding a DMAR.

Yes, it's practically working, but it's not accurate /wrt how that
hardware looked like in reality.

> 
> It's not like for example SMRAM, where the registers are in the
> northbridge configuration space and move around in every chipset generation.
> 
>>> ("Any kind of stability" actually didn't include crashes; those are not
>>> expected :))
>>>
>>> The Google patches for userspace PIC and IOAPIC are proceeding well, so
>>> hopefully we can have interrupt remapping soon.
>>
>> If the day had 48 hours... I'd love to look into this, first adding QEMU
>> support for the new irqchip architecture.
> 
> I hope I can squeeze in some time for that...  Google also had an intern
> that was looking at it.

Great!

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-28 17:15               ` Paolo Bonzini
@ 2015-07-28 17:19                 ` Jan Kiszka
  2015-07-28 17:19                 ` Jan Kiszka
  1 sibling, 0 replies; 108+ messages in thread
From: Jan Kiszka @ 2015-07-28 17:19 UTC (permalink / raw)
  To: Paolo Bonzini, Michael S. Tsirkin
  Cc: linux-s390, xen-devel, Benjamin Herrenschmidt, Rusty Russell,
	Andy Lutomirski, Christian Borntraeger, Stefan Hajnoczi,
	Cornelia Huck, linux390, Linux Virtualization

On 2015-07-28 19:15, Paolo Bonzini wrote:
> 
> 
> On 28/07/2015 18:42, Jan Kiszka wrote:
>>> On the other hand interrupt remapping is absolutely necessary for
>>> production use, hence my point that x86 does not promise API stability.
>>
>> Well, we currently implement the features that the Q35 used to expose.
>> Adding interrupt remapping will require a new chipset and/or a hack
>> switch to ignore compatibility.
> 
> Isn't the VT-d register space separate from other Q35 features and
> backwards-compatible?  You could even add it to PIIX in theory just by
> adding a DMAR.

Yes, it's practically working, but it's not accurate /wrt how that
hardware looked like in reality.

> 
> It's not like for example SMRAM, where the registers are in the
> northbridge configuration space and move around in every chipset generation.
> 
>>> ("Any kind of stability" actually didn't include crashes; those are not
>>> expected :))
>>>
>>> The Google patches for userspace PIC and IOAPIC are proceeding well, so
>>> hopefully we can have interrupt remapping soon.
>>
>> If the day had 48 hours... I'd love to look into this, first adding QEMU
>> support for the new irqchip architecture.
> 
> I hope I can squeeze in some time for that...  Google also had an intern
> that was looking at it.

Great!

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-28 17:10               ` Andy Lutomirski
  2015-07-28 17:17                 ` Jan Kiszka
@ 2015-07-28 17:17                 ` Jan Kiszka
  2015-07-28 18:22                   ` Andy Lutomirski
  2015-07-28 18:22                   ` Andy Lutomirski
  1 sibling, 2 replies; 108+ messages in thread
From: Jan Kiszka @ 2015-07-28 17:17 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-s390, Benjamin Herrenschmidt, Konrad Rzeszutek Wilk,
	Michael S. Tsirkin, xen-devel, Christian Borntraeger,
	Paolo Bonzini, linux390, Linux Virtualization

On 2015-07-28 19:10, Andy Lutomirski wrote:
> On Tue, Jul 28, 2015 at 9:44 AM, Jan Kiszka <jan.kiszka@siemens.com> wrote:
>> The ability to have virtio on systems with IOMMU in place makes testing
>> much more efficient for us. Ideally, we would have it in non-identity
>> mapping scenarios as well, e.g. to start secondary Linux instances in
>> the test VMs, giving them their own virtio devices. And we will
>> eventually have this need on ARM as well.
>>
>> Virtio needs to be backward compatible, so the change to put these
>> devices under IOMMU control could be advertised during feature
>> negotiations and controlled on QEMU side via a device property. Newer
>> guest drivers would have to acknowledge that they support virtio via
>> IOMMUs. Older ones would refuse to work, and the admin could instead
>> spawn VMs with this feature disabled.
>>
> 
> The trouble is that this is really a property of the bus and not of
> the device.  If you build a virtio device that physically plugs into a
> PCIe slot, the device has no concept of an IOMMU in the first place.

If one would build a real virtio device today, it would be broken
because every IOMMU would start to translate its requests. Already from
that POV, we really need to introduce a feature flag "I will be
IOMMU-translated" so that a potential physical implementation can carry
it unconditionally.

> Similarly, if you take an L0-provided IOMMU-supporting device and pass
> it through to L2 using current QEMU on L1 (with Q35 emulation and
> iommu enabled), then, from L2's perspective, the device is 1:1 no
> matter what the device thinks.
> 
> IOW, I think the original design was wrong and now we have to deal
> with it.  I think the best solution would be to teach QEMU to fix its
> ACPI tables so that 1:1 virtio devices are actually exposed as 1:1.

Only the current drivers are broken. And we can easily tell them apart
from newer ones via feature flags. Sorry, don't get the problem.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-28 17:10               ` Andy Lutomirski
@ 2015-07-28 17:17                 ` Jan Kiszka
  2015-07-28 17:17                 ` Jan Kiszka
  1 sibling, 0 replies; 108+ messages in thread
From: Jan Kiszka @ 2015-07-28 17:17 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-s390, Benjamin Herrenschmidt, Michael S. Tsirkin,
	Stefan Hajnoczi, Rusty Russell, xen-devel, Christian Borntraeger,
	Paolo Bonzini, Cornelia Huck, linux390, Linux Virtualization

On 2015-07-28 19:10, Andy Lutomirski wrote:
> On Tue, Jul 28, 2015 at 9:44 AM, Jan Kiszka <jan.kiszka@siemens.com> wrote:
>> The ability to have virtio on systems with IOMMU in place makes testing
>> much more efficient for us. Ideally, we would have it in non-identity
>> mapping scenarios as well, e.g. to start secondary Linux instances in
>> the test VMs, giving them their own virtio devices. And we will
>> eventually have this need on ARM as well.
>>
>> Virtio needs to be backward compatible, so the change to put these
>> devices under IOMMU control could be advertised during feature
>> negotiations and controlled on QEMU side via a device property. Newer
>> guest drivers would have to acknowledge that they support virtio via
>> IOMMUs. Older ones would refuse to work, and the admin could instead
>> spawn VMs with this feature disabled.
>>
> 
> The trouble is that this is really a property of the bus and not of
> the device.  If you build a virtio device that physically plugs into a
> PCIe slot, the device has no concept of an IOMMU in the first place.

If one would build a real virtio device today, it would be broken
because every IOMMU would start to translate its requests. Already from
that POV, we really need to introduce a feature flag "I will be
IOMMU-translated" so that a potential physical implementation can carry
it unconditionally.

> Similarly, if you take an L0-provided IOMMU-supporting device and pass
> it through to L2 using current QEMU on L1 (with Q35 emulation and
> iommu enabled), then, from L2's perspective, the device is 1:1 no
> matter what the device thinks.
> 
> IOW, I think the original design was wrong and now we have to deal
> with it.  I think the best solution would be to teach QEMU to fix its
> ACPI tables so that 1:1 virtio devices are actually exposed as 1:1.

Only the current drivers are broken. And we can easily tell them apart
from newer ones via feature flags. Sorry, don't get the problem.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-28 16:42             ` Jan Kiszka
  2015-07-28 17:15               ` Paolo Bonzini
@ 2015-07-28 17:15               ` Paolo Bonzini
  2015-07-28 17:19                 ` Jan Kiszka
  2015-07-28 17:19                 ` Jan Kiszka
  1 sibling, 2 replies; 108+ messages in thread
From: Paolo Bonzini @ 2015-07-28 17:15 UTC (permalink / raw)
  To: Jan Kiszka, Michael S. Tsirkin
  Cc: linux-s390, xen-devel, Konrad Rzeszutek Wilk,
	Benjamin Herrenschmidt, Andy Lutomirski, Christian Borntraeger,
	linux390, Linux Virtualization



On 28/07/2015 18:42, Jan Kiszka wrote:
> > On the other hand interrupt remapping is absolutely necessary for
> > production use, hence my point that x86 does not promise API stability.
> 
> Well, we currently implement the features that the Q35 used to expose.
> Adding interrupt remapping will require a new chipset and/or a hack
> switch to ignore compatibility.

Isn't the VT-d register space separate from other Q35 features and
backwards-compatible?  You could even add it to PIIX in theory just by
adding a DMAR.

It's not like for example SMRAM, where the registers are in the
northbridge configuration space and move around in every chipset generation.

> > ("Any kind of stability" actually didn't include crashes; those are not
> > expected :))
> > 
> > The Google patches for userspace PIC and IOAPIC are proceeding well, so
> > hopefully we can have interrupt remapping soon.
> 
> If the day had 48 hours... I'd love to look into this, first adding QEMU
> support for the new irqchip architecture.

I hope I can squeeze in some time for that...  Google also had an intern
that was looking at it.

Paolo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-28 16:42             ` Jan Kiszka
@ 2015-07-28 17:15               ` Paolo Bonzini
  2015-07-28 17:15               ` Paolo Bonzini
  1 sibling, 0 replies; 108+ messages in thread
From: Paolo Bonzini @ 2015-07-28 17:15 UTC (permalink / raw)
  To: Jan Kiszka, Michael S. Tsirkin
  Cc: linux-s390, xen-devel, Benjamin Herrenschmidt, Rusty Russell,
	Andy Lutomirski, Christian Borntraeger, Stefan Hajnoczi,
	Cornelia Huck, linux390, Linux Virtualization



On 28/07/2015 18:42, Jan Kiszka wrote:
> > On the other hand interrupt remapping is absolutely necessary for
> > production use, hence my point that x86 does not promise API stability.
> 
> Well, we currently implement the features that the Q35 used to expose.
> Adding interrupt remapping will require a new chipset and/or a hack
> switch to ignore compatibility.

Isn't the VT-d register space separate from other Q35 features and
backwards-compatible?  You could even add it to PIIX in theory just by
adding a DMAR.

It's not like for example SMRAM, where the registers are in the
northbridge configuration space and move around in every chipset generation.

> > ("Any kind of stability" actually didn't include crashes; those are not
> > expected :))
> > 
> > The Google patches for userspace PIC and IOAPIC are proceeding well, so
> > hopefully we can have interrupt remapping soon.
> 
> If the day had 48 hours... I'd love to look into this, first adding QEMU
> support for the new irqchip architecture.

I hope I can squeeze in some time for that...  Google also had an intern
that was looking at it.

Paolo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-28 16:44             ` Jan Kiszka
  2015-07-28 17:10               ` Andy Lutomirski
@ 2015-07-28 17:10               ` Andy Lutomirski
  2015-07-28 17:17                 ` Jan Kiszka
  2015-07-28 17:17                 ` Jan Kiszka
  1 sibling, 2 replies; 108+ messages in thread
From: Andy Lutomirski @ 2015-07-28 17:10 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: linux-s390, Benjamin Herrenschmidt, Konrad Rzeszutek Wilk,
	Michael S. Tsirkin, xen-devel, Christian Borntraeger,
	Paolo Bonzini, linux390, Linux Virtualization

On Tue, Jul 28, 2015 at 9:44 AM, Jan Kiszka <jan.kiszka@siemens.com> wrote:
> The ability to have virtio on systems with IOMMU in place makes testing
> much more efficient for us. Ideally, we would have it in non-identity
> mapping scenarios as well, e.g. to start secondary Linux instances in
> the test VMs, giving them their own virtio devices. And we will
> eventually have this need on ARM as well.
>
> Virtio needs to be backward compatible, so the change to put these
> devices under IOMMU control could be advertised during feature
> negotiations and controlled on QEMU side via a device property. Newer
> guest drivers would have to acknowledge that they support virtio via
> IOMMUs. Older ones would refuse to work, and the admin could instead
> spawn VMs with this feature disabled.
>

The trouble is that this is really a property of the bus and not of
the device.  If you build a virtio device that physically plugs into a
PCIe slot, the device has no concept of an IOMMU in the first place.
Similarly, if you take an L0-provided IOMMU-supporting device and pass
it through to L2 using current QEMU on L1 (with Q35 emulation and
iommu enabled), then, from L2's perspective, the device is 1:1 no
matter what the device thinks.

IOW, I think the original design was wrong and now we have to deal
with it.  I think the best solution would be to teach QEMU to fix its
ACPI tables so that 1:1 virtio devices are actually exposed as 1:1.

--Andy

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-28 16:44             ` Jan Kiszka
@ 2015-07-28 17:10               ` Andy Lutomirski
  2015-07-28 17:10               ` Andy Lutomirski
  1 sibling, 0 replies; 108+ messages in thread
From: Andy Lutomirski @ 2015-07-28 17:10 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: linux-s390, Benjamin Herrenschmidt, Michael S. Tsirkin,
	Stefan Hajnoczi, Rusty Russell, xen-devel, Christian Borntraeger,
	Paolo Bonzini, Cornelia Huck, linux390, Linux Virtualization

On Tue, Jul 28, 2015 at 9:44 AM, Jan Kiszka <jan.kiszka@siemens.com> wrote:
> The ability to have virtio on systems with IOMMU in place makes testing
> much more efficient for us. Ideally, we would have it in non-identity
> mapping scenarios as well, e.g. to start secondary Linux instances in
> the test VMs, giving them their own virtio devices. And we will
> eventually have this need on ARM as well.
>
> Virtio needs to be backward compatible, so the change to put these
> devices under IOMMU control could be advertised during feature
> negotiations and controlled on QEMU side via a device property. Newer
> guest drivers would have to acknowledge that they support virtio via
> IOMMUs. Older ones would refuse to work, and the admin could instead
> spawn VMs with this feature disabled.
>

The trouble is that this is really a property of the bus and not of
the device.  If you build a virtio device that physically plugs into a
PCIe slot, the device has no concept of an IOMMU in the first place.
Similarly, if you take an L0-provided IOMMU-supporting device and pass
it through to L2 using current QEMU on L1 (with Q35 emulation and
iommu enabled), then, from L2's perspective, the device is 1:1 no
matter what the device thinks.

IOW, I think the original design was wrong and now we have to deal
with it.  I think the best solution would be to teach QEMU to fix its
ACPI tables so that 1:1 virtio devices are actually exposed as 1:1.

--Andy

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-28 16:11           ` Andy Lutomirski
  2015-07-28 16:44             ` Jan Kiszka
@ 2015-07-28 16:44             ` Jan Kiszka
  2015-07-28 17:10               ` Andy Lutomirski
  2015-07-28 17:10               ` Andy Lutomirski
  1 sibling, 2 replies; 108+ messages in thread
From: Jan Kiszka @ 2015-07-28 16:44 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-s390, Benjamin Herrenschmidt, Konrad Rzeszutek Wilk,
	Michael S. Tsirkin, xen-devel, Christian Borntraeger,
	Paolo Bonzini, linux390, Linux Virtualization

On 2015-07-28 18:11, Andy Lutomirski wrote:
> On Jul 28, 2015 6:11 AM, "Jan Kiszka" <jan.kiszka@siemens.com> wrote:
>>
>> On 2015-07-28 15:06, Michael S. Tsirkin wrote:
>>> On Tue, Jul 28, 2015 at 02:46:20PM +0200, Paolo Bonzini wrote:
>>>>
>>>>
>>>> On 28/07/2015 12:12, Benjamin Herrenschmidt wrote:
>>>>>>> That is an experimental feature (it's x-iommu), so it can change.
>>>>>>>
>>>>>>> The plan was:
>>>>>>>
>>>>>>> - for PPC, virtio never honors IOMMU
>>>>>>>
>>>>>>> - for non-PPC, either have virtio always honor IOMMU, or enforce that
>>>>>>> virtio is not under IOMMU.
>>>>>>>
>>>>> I dislike having PPC special cased.
>>>>>
>>>>> In fact, today x86 guests also assume that virtio bypasses IOMMU I
>>>>> believe. In fact *all* guests do.
>>>>
>>>> This doesn't matter much, since the only guests that implement an IOMMU
>>>> in QEMU are (afaik) PPC and x86, and x86 does not yet promise any kind
>>>> of stability.
>>>
>>> Hmm I think Jan (cc) said it was already used out there.
>>
>> Yes, no known issues with vt-d emulation for almost a year now. Error
>> reporting could be improved, and interrupt remapping is still missing,
>> but those are minor issues in this context.
>>
>> In my testing setups, I also have virtio devices in use, passed through
>> to an L2 guest, but only in 1:1 mapping so that their broken IOMMU
>> support causes no practical problems.
>>
> 
> How are you getting 1:1 to work?  Is it something that L0 QEMU can
> advertise to L1?  If so, can we just do that unconditionally, which
> would make my patch work?

The guest hypervisor is Jailhouse and the guest is the root cell that
loaded the hypervisor, thus continues with identity mappings. You
usually don't have 1:1 mapping with other setups - maybe with some Xen
configuration? Dunno.

> 
> I have no objection to 1:1 devices in general.  It's only devices that
> the PCI code on the guest identifies as not 1:1 but that are
> nonetheless 1:1 that cause problems.

The ability to have virtio on systems with IOMMU in place makes testing
much more efficient for us. Ideally, we would have it in non-identity
mapping scenarios as well, e.g. to start secondary Linux instances in
the test VMs, giving them their own virtio devices. And we will
eventually have this need on ARM as well.

Virtio needs to be backward compatible, so the change to put these
devices under IOMMU control could be advertised during feature
negotiations and controlled on QEMU side via a device property. Newer
guest drivers would have to acknowledge that they support virtio via
IOMMUs. Older ones would refuse to work, and the admin could instead
spawn VMs with this feature disabled.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-28 16:11           ` Andy Lutomirski
@ 2015-07-28 16:44             ` Jan Kiszka
  2015-07-28 16:44             ` Jan Kiszka
  1 sibling, 0 replies; 108+ messages in thread
From: Jan Kiszka @ 2015-07-28 16:44 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-s390, Benjamin Herrenschmidt, Michael S. Tsirkin,
	Stefan Hajnoczi, Rusty Russell, xen-devel, Christian Borntraeger,
	Paolo Bonzini, Cornelia Huck, linux390, Linux Virtualization

On 2015-07-28 18:11, Andy Lutomirski wrote:
> On Jul 28, 2015 6:11 AM, "Jan Kiszka" <jan.kiszka@siemens.com> wrote:
>>
>> On 2015-07-28 15:06, Michael S. Tsirkin wrote:
>>> On Tue, Jul 28, 2015 at 02:46:20PM +0200, Paolo Bonzini wrote:
>>>>
>>>>
>>>> On 28/07/2015 12:12, Benjamin Herrenschmidt wrote:
>>>>>>> That is an experimental feature (it's x-iommu), so it can change.
>>>>>>>
>>>>>>> The plan was:
>>>>>>>
>>>>>>> - for PPC, virtio never honors IOMMU
>>>>>>>
>>>>>>> - for non-PPC, either have virtio always honor IOMMU, or enforce that
>>>>>>> virtio is not under IOMMU.
>>>>>>>
>>>>> I dislike having PPC special cased.
>>>>>
>>>>> In fact, today x86 guests also assume that virtio bypasses IOMMU I
>>>>> believe. In fact *all* guests do.
>>>>
>>>> This doesn't matter much, since the only guests that implement an IOMMU
>>>> in QEMU are (afaik) PPC and x86, and x86 does not yet promise any kind
>>>> of stability.
>>>
>>> Hmm I think Jan (cc) said it was already used out there.
>>
>> Yes, no known issues with vt-d emulation for almost a year now. Error
>> reporting could be improved, and interrupt remapping is still missing,
>> but those are minor issues in this context.
>>
>> In my testing setups, I also have virtio devices in use, passed through
>> to an L2 guest, but only in 1:1 mapping so that their broken IOMMU
>> support causes no practical problems.
>>
> 
> How are you getting 1:1 to work?  Is it something that L0 QEMU can
> advertise to L1?  If so, can we just do that unconditionally, which
> would make my patch work?

The guest hypervisor is Jailhouse and the guest is the root cell that
loaded the hypervisor, thus continues with identity mappings. You
usually don't have 1:1 mapping with other setups - maybe with some Xen
configuration? Dunno.

> 
> I have no objection to 1:1 devices in general.  It's only devices that
> the PCI code on the guest identifies as not 1:1 but that are
> nonetheless 1:1 that cause problems.

The ability to have virtio on systems with IOMMU in place makes testing
much more efficient for us. Ideally, we would have it in non-identity
mapping scenarios as well, e.g. to start secondary Linux instances in
the test VMs, giving them their own virtio devices. And we will
eventually have this need on ARM as well.

Virtio needs to be backward compatible, so the change to put these
devices under IOMMU control could be advertised during feature
negotiations and controlled on QEMU side via a device property. Newer
guest drivers would have to acknowledge that they support virtio via
IOMMUs. Older ones would refuse to work, and the admin could instead
spawn VMs with this feature disabled.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-28 16:36           ` Paolo Bonzini
@ 2015-07-28 16:42             ` Jan Kiszka
  2015-07-28 17:15               ` Paolo Bonzini
  2015-07-28 17:15               ` Paolo Bonzini
  2015-07-28 16:42             ` Jan Kiszka
  1 sibling, 2 replies; 108+ messages in thread
From: Jan Kiszka @ 2015-07-28 16:42 UTC (permalink / raw)
  To: Paolo Bonzini, Michael S. Tsirkin
  Cc: linux-s390, xen-devel, Konrad Rzeszutek Wilk,
	Benjamin Herrenschmidt, Andy Lutomirski, Christian Borntraeger,
	linux390, Linux Virtualization

On 2015-07-28 18:36, Paolo Bonzini wrote:
> On 28/07/2015 15:11, Jan Kiszka wrote:
>>>>>>
>>>>>> This doesn't matter much, since the only guests that implement an IOMMU
>>>>>> in QEMU are (afaik) PPC and x86, and x86 does not yet promise any kind
>>>>>> of stability.
>>>>
>>>> Hmm I think Jan (cc) said it was already used out there.
>> Yes, no known issues with vt-d emulation for almost a year now. Error
>> reporting could be improved, and interrupt remapping is still missing,
>> but those are minor issues in this context.
> 
> On the other hand interrupt remapping is absolutely necessary for
> production use, hence my point that x86 does not promise API stability.

Well, we currently implement the features that the Q35 used to expose.
Adding interrupt remapping will require a new chipset and/or a hack
switch to ignore compatibility.

> 
> ("Any kind of stability" actually didn't include crashes; those are not
> expected :))
> 
> The Google patches for userspace PIC and IOAPIC are proceeding well, so
> hopefully we can have interrupt remapping soon.

If the day had 48 hours... I'd love to look into this, first adding QEMU
support for the new irqchip architecture.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-28 16:36           ` Paolo Bonzini
  2015-07-28 16:42             ` Jan Kiszka
@ 2015-07-28 16:42             ` Jan Kiszka
  1 sibling, 0 replies; 108+ messages in thread
From: Jan Kiszka @ 2015-07-28 16:42 UTC (permalink / raw)
  To: Paolo Bonzini, Michael S. Tsirkin
  Cc: linux-s390, xen-devel, Benjamin Herrenschmidt, Rusty Russell,
	Andy Lutomirski, Christian Borntraeger, Stefan Hajnoczi,
	Cornelia Huck, linux390, Linux Virtualization

On 2015-07-28 18:36, Paolo Bonzini wrote:
> On 28/07/2015 15:11, Jan Kiszka wrote:
>>>>>>
>>>>>> This doesn't matter much, since the only guests that implement an IOMMU
>>>>>> in QEMU are (afaik) PPC and x86, and x86 does not yet promise any kind
>>>>>> of stability.
>>>>
>>>> Hmm I think Jan (cc) said it was already used out there.
>> Yes, no known issues with vt-d emulation for almost a year now. Error
>> reporting could be improved, and interrupt remapping is still missing,
>> but those are minor issues in this context.
> 
> On the other hand interrupt remapping is absolutely necessary for
> production use, hence my point that x86 does not promise API stability.

Well, we currently implement the features that the Q35 used to expose.
Adding interrupt remapping will require a new chipset and/or a hack
switch to ignore compatibility.

> 
> ("Any kind of stability" actually didn't include crashes; those are not
> expected :))
> 
> The Google patches for userspace PIC and IOAPIC are proceeding well, so
> hopefully we can have interrupt remapping soon.

If the day had 48 hours... I'd love to look into this, first adding QEMU
support for the new irqchip architecture.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-28 13:11         ` Jan Kiszka
                             ` (2 preceding siblings ...)
  2015-07-28 16:36           ` Paolo Bonzini
@ 2015-07-28 16:36           ` Paolo Bonzini
  2015-07-28 16:42             ` Jan Kiszka
  2015-07-28 16:42             ` Jan Kiszka
  3 siblings, 2 replies; 108+ messages in thread
From: Paolo Bonzini @ 2015-07-28 16:36 UTC (permalink / raw)
  To: Jan Kiszka, Michael S. Tsirkin
  Cc: linux-s390, xen-devel, Konrad Rzeszutek Wilk,
	Benjamin Herrenschmidt, Andy Lutomirski, Christian Borntraeger,
	linux390, Linux Virtualization



On 28/07/2015 15:11, Jan Kiszka wrote:
>>> >>
>>> >> This doesn't matter much, since the only guests that implement an IOMMU
>>> >> in QEMU are (afaik) PPC and x86, and x86 does not yet promise any kind
>>> >> of stability.
>> > 
>> > Hmm I think Jan (cc) said it was already used out there.
> Yes, no known issues with vt-d emulation for almost a year now. Error
> reporting could be improved, and interrupt remapping is still missing,
> but those are minor issues in this context.

On the other hand interrupt remapping is absolutely necessary for
production use, hence my point that x86 does not promise API stability.

("Any kind of stability" actually didn't include crashes; those are not
expected :))

The Google patches for userspace PIC and IOAPIC are proceeding well, so
hopefully we can have interrupt remapping soon.

Paolo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-28 13:11         ` Jan Kiszka
  2015-07-28 16:11           ` Andy Lutomirski
  2015-07-28 16:11           ` Andy Lutomirski
@ 2015-07-28 16:36           ` Paolo Bonzini
  2015-07-28 16:36           ` Paolo Bonzini
  3 siblings, 0 replies; 108+ messages in thread
From: Paolo Bonzini @ 2015-07-28 16:36 UTC (permalink / raw)
  To: Jan Kiszka, Michael S. Tsirkin
  Cc: linux-s390, xen-devel, Benjamin Herrenschmidt, Rusty Russell,
	Andy Lutomirski, Christian Borntraeger, Stefan Hajnoczi,
	Cornelia Huck, linux390, Linux Virtualization



On 28/07/2015 15:11, Jan Kiszka wrote:
>>> >>
>>> >> This doesn't matter much, since the only guests that implement an IOMMU
>>> >> in QEMU are (afaik) PPC and x86, and x86 does not yet promise any kind
>>> >> of stability.
>> > 
>> > Hmm I think Jan (cc) said it was already used out there.
> Yes, no known issues with vt-d emulation for almost a year now. Error
> reporting could be improved, and interrupt remapping is still missing,
> but those are minor issues in this context.

On the other hand interrupt remapping is absolutely necessary for
production use, hence my point that x86 does not promise API stability.

("Any kind of stability" actually didn't include crashes; those are not
expected :))

The Google patches for userspace PIC and IOAPIC are proceeding well, so
hopefully we can have interrupt remapping soon.

Paolo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-28 13:11         ` Jan Kiszka
@ 2015-07-28 16:11           ` Andy Lutomirski
  2015-07-28 16:44             ` Jan Kiszka
  2015-07-28 16:44             ` Jan Kiszka
  2015-07-28 16:11           ` Andy Lutomirski
                             ` (2 subsequent siblings)
  3 siblings, 2 replies; 108+ messages in thread
From: Andy Lutomirski @ 2015-07-28 16:11 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: linux-s390, Benjamin Herrenschmidt, Konrad Rzeszutek Wilk,
	Michael S. Tsirkin, xen-devel, Christian Borntraeger,
	Paolo Bonzini, linux390, Linux Virtualization

On Jul 28, 2015 6:11 AM, "Jan Kiszka" <jan.kiszka@siemens.com> wrote:
>
> On 2015-07-28 15:06, Michael S. Tsirkin wrote:
> > On Tue, Jul 28, 2015 at 02:46:20PM +0200, Paolo Bonzini wrote:
> >>
> >>
> >> On 28/07/2015 12:12, Benjamin Herrenschmidt wrote:
> >>>>> That is an experimental feature (it's x-iommu), so it can change.
> >>>>>
> >>>>> The plan was:
> >>>>>
> >>>>> - for PPC, virtio never honors IOMMU
> >>>>>
> >>>>> - for non-PPC, either have virtio always honor IOMMU, or enforce that
> >>>>> virtio is not under IOMMU.
> >>>>>
> >>> I dislike having PPC special cased.
> >>>
> >>> In fact, today x86 guests also assume that virtio bypasses IOMMU I
> >>> believe. In fact *all* guests do.
> >>
> >> This doesn't matter much, since the only guests that implement an IOMMU
> >> in QEMU are (afaik) PPC and x86, and x86 does not yet promise any kind
> >> of stability.
> >
> > Hmm I think Jan (cc) said it was already used out there.
>
> Yes, no known issues with vt-d emulation for almost a year now. Error
> reporting could be improved, and interrupt remapping is still missing,
> but those are minor issues in this context.
>
> In my testing setups, I also have virtio devices in use, passed through
> to an L2 guest, but only in 1:1 mapping so that their broken IOMMU
> support causes no practical problems.
>

How are you getting 1:1 to work?  Is it something that L0 QEMU can
advertise to L1?  If so, can we just do that unconditionally, which
would make my patch work?

I have no objection to 1:1 devices in general.  It's only devices that
the PCI code on the guest identifies as not 1:1 but that are
nonetheless 1:1 that cause problems.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-28 13:11         ` Jan Kiszka
  2015-07-28 16:11           ` Andy Lutomirski
@ 2015-07-28 16:11           ` Andy Lutomirski
  2015-07-28 16:36           ` Paolo Bonzini
  2015-07-28 16:36           ` Paolo Bonzini
  3 siblings, 0 replies; 108+ messages in thread
From: Andy Lutomirski @ 2015-07-28 16:11 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: linux-s390, Benjamin Herrenschmidt, Michael S. Tsirkin,
	Stefan Hajnoczi, Rusty Russell, xen-devel, Christian Borntraeger,
	Paolo Bonzini, Cornelia Huck, linux390, Linux Virtualization

On Jul 28, 2015 6:11 AM, "Jan Kiszka" <jan.kiszka@siemens.com> wrote:
>
> On 2015-07-28 15:06, Michael S. Tsirkin wrote:
> > On Tue, Jul 28, 2015 at 02:46:20PM +0200, Paolo Bonzini wrote:
> >>
> >>
> >> On 28/07/2015 12:12, Benjamin Herrenschmidt wrote:
> >>>>> That is an experimental feature (it's x-iommu), so it can change.
> >>>>>
> >>>>> The plan was:
> >>>>>
> >>>>> - for PPC, virtio never honors IOMMU
> >>>>>
> >>>>> - for non-PPC, either have virtio always honor IOMMU, or enforce that
> >>>>> virtio is not under IOMMU.
> >>>>>
> >>> I dislike having PPC special cased.
> >>>
> >>> In fact, today x86 guests also assume that virtio bypasses IOMMU I
> >>> believe. In fact *all* guests do.
> >>
> >> This doesn't matter much, since the only guests that implement an IOMMU
> >> in QEMU are (afaik) PPC and x86, and x86 does not yet promise any kind
> >> of stability.
> >
> > Hmm I think Jan (cc) said it was already used out there.
>
> Yes, no known issues with vt-d emulation for almost a year now. Error
> reporting could be improved, and interrupt remapping is still missing,
> but those are minor issues in this context.
>
> In my testing setups, I also have virtio devices in use, passed through
> to an L2 guest, but only in 1:1 mapping so that their broken IOMMU
> support causes no practical problems.
>

How are you getting 1:1 to work?  Is it something that L0 QEMU can
advertise to L1?  If so, can we just do that unconditionally, which
would make my patch work?

I have no objection to 1:1 devices in general.  It's only devices that
the PCI code on the guest identifies as not 1:1 but that are
nonetheless 1:1 that cause problems.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-28 13:06       ` Michael S. Tsirkin
@ 2015-07-28 13:11         ` Jan Kiszka
  2015-07-28 16:11           ` Andy Lutomirski
                             ` (3 more replies)
  2015-07-28 13:11         ` Jan Kiszka
  1 sibling, 4 replies; 108+ messages in thread
From: Jan Kiszka @ 2015-07-28 13:11 UTC (permalink / raw)
  To: Michael S. Tsirkin, Paolo Bonzini
  Cc: linux-s390, xen-devel, Konrad Rzeszutek Wilk,
	Benjamin Herrenschmidt, Andy Lutomirski, Christian Borntraeger,
	linux390, Linux Virtualization

On 2015-07-28 15:06, Michael S. Tsirkin wrote:
> On Tue, Jul 28, 2015 at 02:46:20PM +0200, Paolo Bonzini wrote:
>>
>>
>> On 28/07/2015 12:12, Benjamin Herrenschmidt wrote:
>>>>> That is an experimental feature (it's x-iommu), so it can change.
>>>>>
>>>>> The plan was:
>>>>>
>>>>> - for PPC, virtio never honors IOMMU
>>>>>
>>>>> - for non-PPC, either have virtio always honor IOMMU, or enforce that
>>>>> virtio is not under IOMMU.
>>>>>
>>> I dislike having PPC special cased.
>>>
>>> In fact, today x86 guests also assume that virtio bypasses IOMMU I
>>> believe. In fact *all* guests do.
>>
>> This doesn't matter much, since the only guests that implement an IOMMU
>> in QEMU are (afaik) PPC and x86, and x86 does not yet promise any kind
>> of stability.
> 
> Hmm I think Jan (cc) said it was already used out there.

Yes, no known issues with vt-d emulation for almost a year now. Error
reporting could be improved, and interrupt remapping is still missing,
but those are minor issues in this context.

In my testing setups, I also have virtio devices in use, passed through
to an L2 guest, but only in 1:1 mapping so that their broken IOMMU
support causes no practical problems.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-28 13:06       ` Michael S. Tsirkin
  2015-07-28 13:11         ` Jan Kiszka
@ 2015-07-28 13:11         ` Jan Kiszka
  1 sibling, 0 replies; 108+ messages in thread
From: Jan Kiszka @ 2015-07-28 13:11 UTC (permalink / raw)
  To: Michael S. Tsirkin, Paolo Bonzini
  Cc: linux-s390, xen-devel, Benjamin Herrenschmidt, Rusty Russell,
	Andy Lutomirski, Christian Borntraeger, Stefan Hajnoczi,
	Cornelia Huck, linux390, Linux Virtualization

On 2015-07-28 15:06, Michael S. Tsirkin wrote:
> On Tue, Jul 28, 2015 at 02:46:20PM +0200, Paolo Bonzini wrote:
>>
>>
>> On 28/07/2015 12:12, Benjamin Herrenschmidt wrote:
>>>>> That is an experimental feature (it's x-iommu), so it can change.
>>>>>
>>>>> The plan was:
>>>>>
>>>>> - for PPC, virtio never honors IOMMU
>>>>>
>>>>> - for non-PPC, either have virtio always honor IOMMU, or enforce that
>>>>> virtio is not under IOMMU.
>>>>>
>>> I dislike having PPC special cased.
>>>
>>> In fact, today x86 guests also assume that virtio bypasses IOMMU I
>>> believe. In fact *all* guests do.
>>
>> This doesn't matter much, since the only guests that implement an IOMMU
>> in QEMU are (afaik) PPC and x86, and x86 does not yet promise any kind
>> of stability.
> 
> Hmm I think Jan (cc) said it was already used out there.

Yes, no known issues with vt-d emulation for almost a year now. Error
reporting could be improved, and interrupt remapping is still missing,
but those are minor issues in this context.

In my testing setups, I also have virtio devices in use, passed through
to an L2 guest, but only in 1:1 mapping so that their broken IOMMU
support causes no practical problems.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-28  1:08 Andy Lutomirski
                   ` (3 preceding siblings ...)
  2015-07-28  8:16 ` Paolo Bonzini
@ 2015-07-28 13:08 ` Michael S. Tsirkin
  2015-07-28 13:08 ` Michael S. Tsirkin
  5 siblings, 0 replies; 108+ messages in thread
From: Michael S. Tsirkin @ 2015-07-28 13:08 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-s390, Konrad Rzeszutek Wilk, Benjamin Herrenschmidt,
	Linux Virtualization, Christian Borntraeger, Paolo Bonzini,
	linux390, xen-devel

On Mon, Jul 27, 2015 at 06:08:59PM -0700, Andy Lutomirski wrote:
> On Mon, Sep 1, 2014 at 10:39 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> > This fixes virtio on Xen guests as well as on any other platform
> > that uses virtio_pci on which physical addresses don't match bus
> > addresses.
> >
> > This can be tested with:
> >
> >     virtme-run --xen xen --kimg arch/x86/boot/bzImage --console
> >
> > using virtme from here:
> >
> >     https://git.kernel.org/cgit/utils/kernel/virtme/virtme.git
> >
> > Without these patches, the guest hangs forever.  With these patches,
> > everything works.
> >
> 
> Dusting off an ancient thread.
> 
> Now that the dust has accumulated^Wsettled, is it worth pursuing this?
>  I think the situation is considerably worse than it was when I
> originally wrote these patches: I think that QEMU now supports a nasty
> mode in which the guest's PCI bus appears to be behind an IOMMU but
> the virtio devices on that bus punch straight through that IOMMU.
> 
> I have a half-hearted port to modern kernels here:
> 
> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=virtio_ring_xen
> 
> I didn't implement DMA API access for virtio_pci_modern, and I have no
> idea what to do about detecting whether a given virtio device honors
> its IOMMU or not.
> 
> --Andy

It's worth thinking about. I'll need to measure what's the overhead of
supporting both modes - probably after I'm back from the KVM forum.

-- 
MST

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-28  1:08 Andy Lutomirski
                   ` (4 preceding siblings ...)
  2015-07-28 13:08 ` Michael S. Tsirkin
@ 2015-07-28 13:08 ` Michael S. Tsirkin
  5 siblings, 0 replies; 108+ messages in thread
From: Michael S. Tsirkin @ 2015-07-28 13:08 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-s390, Benjamin Herrenschmidt, Rusty Russell,
	Linux Virtualization, Christian Borntraeger, Paolo Bonzini,
	Stefan Hajnoczi, Cornelia Huck, linux390, xen-devel

On Mon, Jul 27, 2015 at 06:08:59PM -0700, Andy Lutomirski wrote:
> On Mon, Sep 1, 2014 at 10:39 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> > This fixes virtio on Xen guests as well as on any other platform
> > that uses virtio_pci on which physical addresses don't match bus
> > addresses.
> >
> > This can be tested with:
> >
> >     virtme-run --xen xen --kimg arch/x86/boot/bzImage --console
> >
> > using virtme from here:
> >
> >     https://git.kernel.org/cgit/utils/kernel/virtme/virtme.git
> >
> > Without these patches, the guest hangs forever.  With these patches,
> > everything works.
> >
> 
> Dusting off an ancient thread.
> 
> Now that the dust has accumulated^Wsettled, is it worth pursuing this?
>  I think the situation is considerably worse than it was when I
> originally wrote these patches: I think that QEMU now supports a nasty
> mode in which the guest's PCI bus appears to be behind an IOMMU but
> the virtio devices on that bus punch straight through that IOMMU.
> 
> I have a half-hearted port to modern kernels here:
> 
> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=virtio_ring_xen
> 
> I didn't implement DMA API access for virtio_pci_modern, and I have no
> idea what to do about detecting whether a given virtio device honors
> its IOMMU or not.
> 
> --Andy

It's worth thinking about. I'll need to measure what's the overhead of
supporting both modes - probably after I'm back from the KVM forum.

-- 
MST

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-28 12:46     ` Paolo Bonzini
  2015-07-28 13:06       ` Michael S. Tsirkin
@ 2015-07-28 13:06       ` Michael S. Tsirkin
  2015-07-28 13:11         ` Jan Kiszka
  2015-07-28 13:11         ` Jan Kiszka
  1 sibling, 2 replies; 108+ messages in thread
From: Michael S. Tsirkin @ 2015-07-28 13:06 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-s390, xen-devel, Konrad Rzeszutek Wilk,
	Benjamin Herrenschmidt, Andy Lutomirski, Christian Borntraeger,
	jan.kiszka, linux390, Linux Virtualization

On Tue, Jul 28, 2015 at 02:46:20PM +0200, Paolo Bonzini wrote:
> 
> 
> On 28/07/2015 12:12, Benjamin Herrenschmidt wrote:
> >> > That is an experimental feature (it's x-iommu), so it can change.
> >> > 
> >> > The plan was:
> >> > 
> >> > - for PPC, virtio never honors IOMMU
> >> > 
> >> > - for non-PPC, either have virtio always honor IOMMU, or enforce that
> >> > virtio is not under IOMMU.
> >> > 
> > I dislike having PPC special cased.
> > 
> > In fact, today x86 guests also assume that virtio bypasses IOMMU I
> > believe. In fact *all* guests do.
> 
> This doesn't matter much, since the only guests that implement an IOMMU
> in QEMU are (afaik) PPC and x86, and x86 does not yet promise any kind
> of stability.

Hmm I think Jan (cc) said it was already used out there.


> > I would much prefer if the information as to whether it honors or not
> > gets passed to the guest somewhat. My preference goes for passing it via
> > the virtio config space but there were objections that it should be a
> > bus property (which is tricky to do with PCI and doesn't properly
> > reflect the fact that in qemu you can mix & match IOMMU-honoring devices
> > and bypassing-virtio on the same bus). 
> 
> Yes, for example on x86 it must be passed through the DMAR table.
> virtio-pci device must have a separate DRHD for them.  In QEMU, you
> could add an "under-iommu" property to PCI bridges, and walk the
> hierarchy of bridges to build the DRHDs.
> 
> Paolo

-- 
MST

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-28 12:46     ` Paolo Bonzini
@ 2015-07-28 13:06       ` Michael S. Tsirkin
  2015-07-28 13:06       ` Michael S. Tsirkin
  1 sibling, 0 replies; 108+ messages in thread
From: Michael S. Tsirkin @ 2015-07-28 13:06 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-s390, xen-devel, Benjamin Herrenschmidt, Rusty Russell,
	Andy Lutomirski, Christian Borntraeger, jan.kiszka,
	Stefan Hajnoczi, Cornelia Huck, linux390, Linux Virtualization

On Tue, Jul 28, 2015 at 02:46:20PM +0200, Paolo Bonzini wrote:
> 
> 
> On 28/07/2015 12:12, Benjamin Herrenschmidt wrote:
> >> > That is an experimental feature (it's x-iommu), so it can change.
> >> > 
> >> > The plan was:
> >> > 
> >> > - for PPC, virtio never honors IOMMU
> >> > 
> >> > - for non-PPC, either have virtio always honor IOMMU, or enforce that
> >> > virtio is not under IOMMU.
> >> > 
> > I dislike having PPC special cased.
> > 
> > In fact, today x86 guests also assume that virtio bypasses IOMMU I
> > believe. In fact *all* guests do.
> 
> This doesn't matter much, since the only guests that implement an IOMMU
> in QEMU are (afaik) PPC and x86, and x86 does not yet promise any kind
> of stability.

Hmm I think Jan (cc) said it was already used out there.


> > I would much prefer if the information as to whether it honors or not
> > gets passed to the guest somewhat. My preference goes for passing it via
> > the virtio config space but there were objections that it should be a
> > bus property (which is tricky to do with PCI and doesn't properly
> > reflect the fact that in qemu you can mix & match IOMMU-honoring devices
> > and bypassing-virtio on the same bus). 
> 
> Yes, for example on x86 it must be passed through the DMAR table.
> virtio-pci device must have a separate DRHD for them.  In QEMU, you
> could add an "under-iommu" property to PCI bridges, and walk the
> hierarchy of bridges to build the DRHDs.
> 
> Paolo

-- 
MST

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-28 10:12   ` Benjamin Herrenschmidt
@ 2015-07-28 12:46     ` Paolo Bonzini
  2015-07-28 13:06       ` Michael S. Tsirkin
  2015-07-28 13:06       ` Michael S. Tsirkin
  2015-07-28 12:46     ` Paolo Bonzini
  1 sibling, 2 replies; 108+ messages in thread
From: Paolo Bonzini @ 2015-07-28 12:46 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linux-s390, xen-devel, Konrad Rzeszutek Wilk, Michael S. Tsirkin,
	Andy Lutomirski, Christian Borntraeger, linux390,
	Linux Virtualization



On 28/07/2015 12:12, Benjamin Herrenschmidt wrote:
>> > That is an experimental feature (it's x-iommu), so it can change.
>> > 
>> > The plan was:
>> > 
>> > - for PPC, virtio never honors IOMMU
>> > 
>> > - for non-PPC, either have virtio always honor IOMMU, or enforce that
>> > virtio is not under IOMMU.
>> > 
> I dislike having PPC special cased.
> 
> In fact, today x86 guests also assume that virtio bypasses IOMMU I
> believe. In fact *all* guests do.

This doesn't matter much, since the only guests that implement an IOMMU
in QEMU are (afaik) PPC and x86, and x86 does not yet promise any kind
of stability.

> I would much prefer if the information as to whether it honors or not
> gets passed to the guest somewhat. My preference goes for passing it via
> the virtio config space but there were objections that it should be a
> bus property (which is tricky to do with PCI and doesn't properly
> reflect the fact that in qemu you can mix & match IOMMU-honoring devices
> and bypassing-virtio on the same bus). 

Yes, for example on x86 it must be passed through the DMAR table.
virtio-pci device must have a separate DRHD for them.  In QEMU, you
could add an "under-iommu" property to PCI bridges, and walk the
hierarchy of bridges to build the DRHDs.

Paolo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-28 10:12   ` Benjamin Herrenschmidt
  2015-07-28 12:46     ` Paolo Bonzini
@ 2015-07-28 12:46     ` Paolo Bonzini
  1 sibling, 0 replies; 108+ messages in thread
From: Paolo Bonzini @ 2015-07-28 12:46 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linux-s390, xen-devel, Michael S. Tsirkin, Stefan Hajnoczi,
	Rusty Russell, Andy Lutomirski, Christian Borntraeger,
	Cornelia Huck, linux390, Linux Virtualization



On 28/07/2015 12:12, Benjamin Herrenschmidt wrote:
>> > That is an experimental feature (it's x-iommu), so it can change.
>> > 
>> > The plan was:
>> > 
>> > - for PPC, virtio never honors IOMMU
>> > 
>> > - for non-PPC, either have virtio always honor IOMMU, or enforce that
>> > virtio is not under IOMMU.
>> > 
> I dislike having PPC special cased.
> 
> In fact, today x86 guests also assume that virtio bypasses IOMMU I
> believe. In fact *all* guests do.

This doesn't matter much, since the only guests that implement an IOMMU
in QEMU are (afaik) PPC and x86, and x86 does not yet promise any kind
of stability.

> I would much prefer if the information as to whether it honors or not
> gets passed to the guest somewhat. My preference goes for passing it via
> the virtio config space but there were objections that it should be a
> bus property (which is tricky to do with PCI and doesn't properly
> reflect the fact that in qemu you can mix & match IOMMU-honoring devices
> and bypassing-virtio on the same bus). 

Yes, for example on x86 it must be passed through the DMAR table.
virtio-pci device must have a separate DRHD for them.  In QEMU, you
could add an "under-iommu" property to PCI bridges, and walk the
hierarchy of bridges to build the DRHDs.

Paolo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-28  8:16 ` Paolo Bonzini
  2015-07-28 10:12   ` Benjamin Herrenschmidt
@ 2015-07-28 10:12   ` Benjamin Herrenschmidt
  2015-07-28 12:46     ` Paolo Bonzini
  2015-07-28 12:46     ` Paolo Bonzini
  1 sibling, 2 replies; 108+ messages in thread
From: Benjamin Herrenschmidt @ 2015-07-28 10:12 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-s390, xen-devel, Konrad Rzeszutek Wilk, Michael S. Tsirkin,
	Andy Lutomirski, Christian Borntraeger, linux390,
	Linux Virtualization

On Tue, 2015-07-28 at 10:16 +0200, Paolo Bonzini wrote:
> 
> On 28/07/2015 03:08, Andy Lutomirski wrote:
> > On Mon, Sep 1, 2014 at 10:39 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> >> This fixes virtio on Xen guests as well as on any other platform
> >> that uses virtio_pci on which physical addresses don't match bus
> >> addresses.
> >>
> >> This can be tested with:
> >>
> >>     virtme-run --xen xen --kimg arch/x86/boot/bzImage --console
> >>
> >> using virtme from here:
> >>
> >>     https://git.kernel.org/cgit/utils/kernel/virtme/virtme.git
> >>
> >> Without these patches, the guest hangs forever.  With these patches,
> >> everything works.
> >>
> > 
> > Dusting off an ancient thread.
> > 
> > Now that the dust has accumulated^Wsettled, is it worth pursuing this?
> >  I think the situation is considerably worse than it was when I
> > originally wrote these patches: I think that QEMU now supports a nasty
> > mode in which the guest's PCI bus appears to be behind an IOMMU but
> > the virtio devices on that bus punch straight through that IOMMU.
> 
> That is an experimental feature (it's x-iommu), so it can change.
> 
> The plan was:
> 
> - for PPC, virtio never honors IOMMU
> 
> - for non-PPC, either have virtio always honor IOMMU, or enforce that
> virtio is not under IOMMU.
> 

I dislike having PPC special cased.

In fact, today x86 guests also assume that virtio bypasses IOMMU I
believe. In fact *all* guests do.

I would much prefer if the information as to whether it honors or not
gets passed to the guest somewhat. My preference goes for passing it via
the virtio config space but there were objections that it should be a
bus property (which is tricky to do with PCI and doesn't properly
reflect the fact that in qemu you can mix & match IOMMU-honoring devices
and bypassing-virtio on the same bus). 

Ben.

> Paolo
> 
> > I have a half-hearted port to modern kernels here:
> > 
> > https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=virtio_ring_xen
> > 
> > I didn't implement DMA API access for virtio_pci_modern, and I have no
> > idea what to do about detecting whether a given virtio device honors
> > its IOMMU or not.
> > 
> > --Andy
> > 

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-28  8:16 ` Paolo Bonzini
@ 2015-07-28 10:12   ` Benjamin Herrenschmidt
  2015-07-28 10:12   ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 108+ messages in thread
From: Benjamin Herrenschmidt @ 2015-07-28 10:12 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-s390, xen-devel, Michael S. Tsirkin, Stefan Hajnoczi,
	Rusty Russell, Andy Lutomirski, Christian Borntraeger,
	Cornelia Huck, linux390, Linux Virtualization

On Tue, 2015-07-28 at 10:16 +0200, Paolo Bonzini wrote:
> 
> On 28/07/2015 03:08, Andy Lutomirski wrote:
> > On Mon, Sep 1, 2014 at 10:39 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> >> This fixes virtio on Xen guests as well as on any other platform
> >> that uses virtio_pci on which physical addresses don't match bus
> >> addresses.
> >>
> >> This can be tested with:
> >>
> >>     virtme-run --xen xen --kimg arch/x86/boot/bzImage --console
> >>
> >> using virtme from here:
> >>
> >>     https://git.kernel.org/cgit/utils/kernel/virtme/virtme.git
> >>
> >> Without these patches, the guest hangs forever.  With these patches,
> >> everything works.
> >>
> > 
> > Dusting off an ancient thread.
> > 
> > Now that the dust has accumulated^Wsettled, is it worth pursuing this?
> >  I think the situation is considerably worse than it was when I
> > originally wrote these patches: I think that QEMU now supports a nasty
> > mode in which the guest's PCI bus appears to be behind an IOMMU but
> > the virtio devices on that bus punch straight through that IOMMU.
> 
> That is an experimental feature (it's x-iommu), so it can change.
> 
> The plan was:
> 
> - for PPC, virtio never honors IOMMU
> 
> - for non-PPC, either have virtio always honor IOMMU, or enforce that
> virtio is not under IOMMU.
> 

I dislike having PPC special cased.

In fact, today x86 guests also assume that virtio bypasses IOMMU I
believe. In fact *all* guests do.

I would much prefer if the information as to whether it honors or not
gets passed to the guest somewhat. My preference goes for passing it via
the virtio config space but there were objections that it should be a
bus property (which is tricky to do with PCI and doesn't properly
reflect the fact that in qemu you can mix & match IOMMU-honoring devices
and bypassing-virtio on the same bus). 

Ben.

> Paolo
> 
> > I have a half-hearted port to modern kernels here:
> > 
> > https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=virtio_ring_xen
> > 
> > I didn't implement DMA API access for virtio_pci_modern, and I have no
> > idea what to do about detecting whether a given virtio device honors
> > its IOMMU or not.
> > 
> > --Andy
> > 

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-28  1:08 Andy Lutomirski
                   ` (2 preceding siblings ...)
  2015-07-28  8:16 ` Paolo Bonzini
@ 2015-07-28  8:16 ` Paolo Bonzini
  2015-07-28 10:12   ` Benjamin Herrenschmidt
  2015-07-28 10:12   ` Benjamin Herrenschmidt
  2015-07-28 13:08 ` Michael S. Tsirkin
  2015-07-28 13:08 ` Michael S. Tsirkin
  5 siblings, 2 replies; 108+ messages in thread
From: Paolo Bonzini @ 2015-07-28  8:16 UTC (permalink / raw)
  To: Andy Lutomirski, Rusty Russell, Michael S. Tsirkin
  Cc: linux-s390, Konrad Rzeszutek Wilk, Benjamin Herrenschmidt,
	Linux Virtualization, Christian Borntraeger, linux390, xen-devel



On 28/07/2015 03:08, Andy Lutomirski wrote:
> On Mon, Sep 1, 2014 at 10:39 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>> This fixes virtio on Xen guests as well as on any other platform
>> that uses virtio_pci on which physical addresses don't match bus
>> addresses.
>>
>> This can be tested with:
>>
>>     virtme-run --xen xen --kimg arch/x86/boot/bzImage --console
>>
>> using virtme from here:
>>
>>     https://git.kernel.org/cgit/utils/kernel/virtme/virtme.git
>>
>> Without these patches, the guest hangs forever.  With these patches,
>> everything works.
>>
> 
> Dusting off an ancient thread.
> 
> Now that the dust has accumulated^Wsettled, is it worth pursuing this?
>  I think the situation is considerably worse than it was when I
> originally wrote these patches: I think that QEMU now supports a nasty
> mode in which the guest's PCI bus appears to be behind an IOMMU but
> the virtio devices on that bus punch straight through that IOMMU.

That is an experimental feature (it's x-iommu), so it can change.

The plan was:

- for PPC, virtio never honors IOMMU

- for non-PPC, either have virtio always honor IOMMU, or enforce that
virtio is not under IOMMU.

Paolo

> I have a half-hearted port to modern kernels here:
> 
> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=virtio_ring_xen
> 
> I didn't implement DMA API access for virtio_pci_modern, and I have no
> idea what to do about detecting whether a given virtio device honors
> its IOMMU or not.
> 
> --Andy
> 

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-28  1:08 Andy Lutomirski
  2015-07-28  7:05 ` Christian Borntraeger
  2015-07-28  7:05 ` Christian Borntraeger
@ 2015-07-28  8:16 ` Paolo Bonzini
  2015-07-28  8:16 ` Paolo Bonzini
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 108+ messages in thread
From: Paolo Bonzini @ 2015-07-28  8:16 UTC (permalink / raw)
  To: Andy Lutomirski, Rusty Russell, Michael S. Tsirkin
  Cc: linux-s390, Benjamin Herrenschmidt, Linux Virtualization,
	Christian Borntraeger, Stefan Hajnoczi, Cornelia Huck, linux390,
	xen-devel



On 28/07/2015 03:08, Andy Lutomirski wrote:
> On Mon, Sep 1, 2014 at 10:39 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>> This fixes virtio on Xen guests as well as on any other platform
>> that uses virtio_pci on which physical addresses don't match bus
>> addresses.
>>
>> This can be tested with:
>>
>>     virtme-run --xen xen --kimg arch/x86/boot/bzImage --console
>>
>> using virtme from here:
>>
>>     https://git.kernel.org/cgit/utils/kernel/virtme/virtme.git
>>
>> Without these patches, the guest hangs forever.  With these patches,
>> everything works.
>>
> 
> Dusting off an ancient thread.
> 
> Now that the dust has accumulated^Wsettled, is it worth pursuing this?
>  I think the situation is considerably worse than it was when I
> originally wrote these patches: I think that QEMU now supports a nasty
> mode in which the guest's PCI bus appears to be behind an IOMMU but
> the virtio devices on that bus punch straight through that IOMMU.

That is an experimental feature (it's x-iommu), so it can change.

The plan was:

- for PPC, virtio never honors IOMMU

- for non-PPC, either have virtio always honor IOMMU, or enforce that
virtio is not under IOMMU.

Paolo

> I have a half-hearted port to modern kernels here:
> 
> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=virtio_ring_xen
> 
> I didn't implement DMA API access for virtio_pci_modern, and I have no
> idea what to do about detecting whether a given virtio device honors
> its IOMMU or not.
> 
> --Andy
> 

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-28  1:08 Andy Lutomirski
@ 2015-07-28  7:05 ` Christian Borntraeger
  2015-07-28  7:05 ` Christian Borntraeger
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 108+ messages in thread
From: Christian Borntraeger @ 2015-07-28  7:05 UTC (permalink / raw)
  To: Andy Lutomirski, Rusty Russell, Michael S. Tsirkin
  Cc: linux-s390, Konrad Rzeszutek Wilk, Benjamin Herrenschmidt,
	Linux Virtualization, Paolo Bonzini, linux390, xen-devel

Am 28.07.2015 um 03:08 schrieb Andy Lutomirski:
> On Mon, Sep 1, 2014 at 10:39 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>> This fixes virtio on Xen guests as well as on any other platform
>> that uses virtio_pci on which physical addresses don't match bus
>> addresses.
>>
>> This can be tested with:
>>
>>     virtme-run --xen xen --kimg arch/x86/boot/bzImage --console
>>
>> using virtme from here:
>>
>>     https://git.kernel.org/cgit/utils/kernel/virtme/virtme.git
>>
>> Without these patches, the guest hangs forever.  With these patches,
>> everything works.
>>
> 
> Dusting off an ancient thread.
> 
> Now that the dust has accumulated^Wsettled, is it worth pursuing this?
>  I think the situation is considerably worse than it was when I
> originally wrote these patches: I think that QEMU now supports a nasty
> mode in which the guest's PCI bus appears to be behind an IOMMU but
> the virtio devices on that bus punch straight through that IOMMU.
> 
> I have a half-hearted port to modern kernels here:
> 
> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=virtio_ring_xen
> 
> I didn't implement DMA API access for virtio_pci_modern, and I have no
> idea what to do about detecting whether a given virtio device honors
> its IOMMU or not.

I think its really tricky.

Looking at where virtio came from, the virtio ring was always native access without
IOMMU. This was true for the early lguest things and then the early s390 transport,
(which is quite close to the lguest interface). virtio-pci used the same scheme
 - ignoring all iommu considerations.

I understand that for PCI we  actually might want to follow iommu restrictions from
a correctness and security point of view and from the ccw point of view we do not.
No idea about virtio-mmio.

I think the proper way of handling this is to take this into the TC for virtio - dont
know what would be the right thing to do. A feature bit, always iommu for pci, something
else?

Michael, Conny, 

do you agree?

Christian

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
  2015-07-28  1:08 Andy Lutomirski
  2015-07-28  7:05 ` Christian Borntraeger
@ 2015-07-28  7:05 ` Christian Borntraeger
  2015-07-28  8:16 ` Paolo Bonzini
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 108+ messages in thread
From: Christian Borntraeger @ 2015-07-28  7:05 UTC (permalink / raw)
  To: Andy Lutomirski, Rusty Russell, Michael S. Tsirkin
  Cc: linux-s390, Benjamin Herrenschmidt, Linux Virtualization,
	Paolo Bonzini, Stefan Hajnoczi, Cornelia Huck, linux390,
	xen-devel

Am 28.07.2015 um 03:08 schrieb Andy Lutomirski:
> On Mon, Sep 1, 2014 at 10:39 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>> This fixes virtio on Xen guests as well as on any other platform
>> that uses virtio_pci on which physical addresses don't match bus
>> addresses.
>>
>> This can be tested with:
>>
>>     virtme-run --xen xen --kimg arch/x86/boot/bzImage --console
>>
>> using virtme from here:
>>
>>     https://git.kernel.org/cgit/utils/kernel/virtme/virtme.git
>>
>> Without these patches, the guest hangs forever.  With these patches,
>> everything works.
>>
> 
> Dusting off an ancient thread.
> 
> Now that the dust has accumulated^Wsettled, is it worth pursuing this?
>  I think the situation is considerably worse than it was when I
> originally wrote these patches: I think that QEMU now supports a nasty
> mode in which the guest's PCI bus appears to be behind an IOMMU but
> the virtio devices on that bus punch straight through that IOMMU.
> 
> I have a half-hearted port to modern kernels here:
> 
> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=virtio_ring_xen
> 
> I didn't implement DMA API access for virtio_pci_modern, and I have no
> idea what to do about detecting whether a given virtio device honors
> its IOMMU or not.

I think its really tricky.

Looking at where virtio came from, the virtio ring was always native access without
IOMMU. This was true for the early lguest things and then the early s390 transport,
(which is quite close to the lguest interface). virtio-pci used the same scheme
 - ignoring all iommu considerations.

I understand that for PCI we  actually might want to follow iommu restrictions from
a correctness and security point of view and from the ccw point of view we do not.
No idea about virtio-mmio.

I think the proper way of handling this is to take this into the TC for virtio - dont
know what would be the right thing to do. A feature bit, always iommu for pci, something
else?

Michael, Conny, 

do you agree?

Christian

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
@ 2015-07-28  1:08 Andy Lutomirski
  2015-07-28  7:05 ` Christian Borntraeger
                   ` (5 more replies)
  0 siblings, 6 replies; 108+ messages in thread
From: Andy Lutomirski @ 2015-07-28  1:08 UTC (permalink / raw)
  To: Rusty Russell, Michael S. Tsirkin
  Cc: linux-s390, xen-devel, Konrad Rzeszutek Wilk,
	Benjamin Herrenschmidt, Linux Virtualization,
	Christian Borntraeger, Paolo Bonzini, linux390, Andy Lutomirski

On Mon, Sep 1, 2014 at 10:39 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> This fixes virtio on Xen guests as well as on any other platform
> that uses virtio_pci on which physical addresses don't match bus
> addresses.
>
> This can be tested with:
>
>     virtme-run --xen xen --kimg arch/x86/boot/bzImage --console
>
> using virtme from here:
>
>     https://git.kernel.org/cgit/utils/kernel/virtme/virtme.git
>
> Without these patches, the guest hangs forever.  With these patches,
> everything works.
>

Dusting off an ancient thread.

Now that the dust has accumulated^Wsettled, is it worth pursuing this?
 I think the situation is considerably worse than it was when I
originally wrote these patches: I think that QEMU now supports a nasty
mode in which the guest's PCI bus appears to be behind an IOMMU but
the virtio devices on that bus punch straight through that IOMMU.

I have a half-hearted port to modern kernels here:

https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=virtio_ring_xen

I didn't implement DMA API access for virtio_pci_modern, and I have no
idea what to do about detecting whether a given virtio device honors
its IOMMU or not.

--Andy

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
@ 2015-07-28  1:08 Andy Lutomirski
  0 siblings, 0 replies; 108+ messages in thread
From: Andy Lutomirski @ 2015-07-28  1:08 UTC (permalink / raw)
  To: Rusty Russell, Michael S. Tsirkin
  Cc: linux-s390, xen-devel, Benjamin Herrenschmidt,
	Linux Virtualization, Christian Borntraeger, Paolo Bonzini,
	Stefan Hajnoczi, Cornelia Huck, linux390, Andy Lutomirski

On Mon, Sep 1, 2014 at 10:39 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> This fixes virtio on Xen guests as well as on any other platform
> that uses virtio_pci on which physical addresses don't match bus
> addresses.
>
> This can be tested with:
>
>     virtme-run --xen xen --kimg arch/x86/boot/bzImage --console
>
> using virtme from here:
>
>     https://git.kernel.org/cgit/utils/kernel/virtme/virtme.git
>
> Without these patches, the guest hangs forever.  With these patches,
> everything works.
>

Dusting off an ancient thread.

Now that the dust has accumulated^Wsettled, is it worth pursuing this?
 I think the situation is considerably worse than it was when I
originally wrote these patches: I think that QEMU now supports a nasty
mode in which the guest's PCI bus appears to be behind an IOMMU but
the virtio devices on that bus punch straight through that IOMMU.

I have a half-hearted port to modern kernels here:

https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=virtio_ring_xen

I didn't implement DMA API access for virtio_pci_modern, and I have no
idea what to do about detecting whether a given virtio device honors
its IOMMU or not.

--Andy

^ permalink raw reply	[flat|nested] 108+ messages in thread

end of thread, other threads:[~2015-07-29  9:21 UTC | newest]

Thread overview: 108+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-01 17:39 [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API Andy Lutomirski
2014-09-01 17:39 ` [PATCH v4 1/4] virtio_ring: Support DMA APIs if requested Andy Lutomirski
2014-09-01 17:39 ` [PATCH v4 2/4] virtio_pci: Use the DMA API for virtqueues Andy Lutomirski
2014-09-01 17:39 ` [PATCH v4 3/4] virtio_net: Don't set the end flag on reusable sg entries Andy Lutomirski
2014-09-01 17:39 ` [PATCH v4 4/4] virtio_net: Stop doing DMA from the stack Andy Lutomirski
2014-09-01 22:16 ` [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API Benjamin Herrenschmidt
2014-09-02  5:55   ` Andy Lutomirski
2014-09-02 20:53     ` Benjamin Herrenschmidt
2014-09-02 20:56       ` Konrad Rzeszutek Wilk
2014-09-02 21:08         ` Benjamin Herrenschmidt
2014-09-02 21:37       ` Andy Lutomirski
2014-09-02 22:10         ` Benjamin Herrenschmidt
2014-09-02 23:11           ` Andy Lutomirski
2014-09-02 23:20             ` Benjamin Herrenschmidt
2014-09-02 23:42               ` Andy Lutomirski
2014-09-03  0:25                 ` Benjamin Herrenschmidt
2014-09-03  0:32                   ` Andy Lutomirski
2014-09-03  0:43                     ` Benjamin Herrenschmidt
2014-09-04  2:03                       ` Andy Lutomirski
2014-09-03  7:47                   ` Paolo Bonzini
2014-09-03  7:52                     ` Andy Lutomirski
2014-09-03  8:01                       ` Paolo Bonzini
2014-09-03  8:05                     ` Benjamin Herrenschmidt
2014-09-03 12:11                       ` Paolo Bonzini
2014-09-03 15:07                         ` Andy Lutomirski
2014-09-03 15:11                           ` Paolo Bonzini
2014-09-03 16:39                           ` Michael S. Tsirkin
2014-09-03 20:38                             ` Andy Lutomirski
2014-09-03  7:43               ` Paolo Bonzini
2014-09-03  6:42         ` Rusty Russell
2014-09-03  7:50           ` Andy Lutomirski
2014-09-05  2:31             ` Rusty Russell
2014-09-05  2:57               ` Andy Lutomirski
2014-09-05  5:20                 ` Benjamin Herrenschmidt
2014-09-05  7:33                 ` Christian Borntraeger
2014-09-10 15:36                 ` Christopher Covington
2014-09-10 16:15                   ` Andy Lutomirski
2014-09-05  5:16               ` Benjamin Herrenschmidt
2014-09-14  8:58               ` Michael S. Tsirkin
2014-09-03 12:51           ` Michael S. Tsirkin
2014-09-05  2:32             ` Rusty Russell
2014-09-05  3:06               ` Andy Lutomirski
2014-09-02 21:10     ` Michael S. Tsirkin
2014-09-02 21:49       ` Andy Lutomirski
2015-07-28  1:08 Andy Lutomirski
2015-07-28  7:05 ` Christian Borntraeger
2015-07-28  7:05 ` Christian Borntraeger
2015-07-28  8:16 ` Paolo Bonzini
2015-07-28  8:16 ` Paolo Bonzini
2015-07-28 10:12   ` Benjamin Herrenschmidt
2015-07-28 10:12   ` Benjamin Herrenschmidt
2015-07-28 12:46     ` Paolo Bonzini
2015-07-28 13:06       ` Michael S. Tsirkin
2015-07-28 13:06       ` Michael S. Tsirkin
2015-07-28 13:11         ` Jan Kiszka
2015-07-28 16:11           ` Andy Lutomirski
2015-07-28 16:44             ` Jan Kiszka
2015-07-28 16:44             ` Jan Kiszka
2015-07-28 17:10               ` Andy Lutomirski
2015-07-28 17:10               ` Andy Lutomirski
2015-07-28 17:17                 ` Jan Kiszka
2015-07-28 17:17                 ` Jan Kiszka
2015-07-28 18:22                   ` Andy Lutomirski
2015-07-28 18:22                   ` Andy Lutomirski
2015-07-28 19:06                     ` Jan Kiszka
2015-07-28 19:06                     ` Jan Kiszka
2015-07-28 19:24                       ` Andy Lutomirski
2015-07-28 19:24                       ` Andy Lutomirski
2015-07-28 19:33                         ` Jan Kiszka
2015-07-28 21:16                           ` Andy Lutomirski
2015-07-28 21:16                           ` Andy Lutomirski
2015-07-28 22:43                             ` Andy Lutomirski
2015-07-28 22:43                             ` Andy Lutomirski
2015-07-28 23:21                               ` Benjamin Herrenschmidt
2015-07-28 23:33                                 ` Andy Lutomirski
2015-07-28 23:33                                 ` Andy Lutomirski
2015-07-29  0:36                                   ` Benjamin Herrenschmidt
2015-07-29  0:36                                   ` Benjamin Herrenschmidt
2015-07-29  0:47                                     ` Andy Lutomirski
2015-07-29  0:47                                     ` Andy Lutomirski
2015-07-29  0:54                                       ` Benjamin Herrenschmidt
2015-07-29  0:54                                       ` Benjamin Herrenschmidt
2015-07-29  8:17                                       ` Paolo Bonzini
2015-07-29  8:20                                         ` Jan Kiszka
2015-07-29  8:20                                         ` Jan Kiszka
2015-07-29  9:21                                         ` Benjamin Herrenschmidt
2015-07-29  9:21                                         ` Benjamin Herrenschmidt
2015-07-29  8:17                                       ` Paolo Bonzini
2015-07-29  8:07                                 ` Jan Kiszka
2015-07-29  8:07                                 ` Jan Kiszka
2015-07-28 23:21                               ` Benjamin Herrenschmidt
2015-07-28 19:33                         ` Jan Kiszka
2015-07-28 16:11           ` Andy Lutomirski
2015-07-28 16:36           ` Paolo Bonzini
2015-07-28 16:36           ` Paolo Bonzini
2015-07-28 16:42             ` Jan Kiszka
2015-07-28 17:15               ` Paolo Bonzini
2015-07-28 17:15               ` Paolo Bonzini
2015-07-28 17:19                 ` Jan Kiszka
2015-07-28 17:19                 ` Jan Kiszka
2015-07-28 17:31                   ` Paolo Bonzini
2015-07-28 17:31                   ` Paolo Bonzini
2015-07-28 16:42             ` Jan Kiszka
2015-07-28 13:11         ` Jan Kiszka
2015-07-28 12:46     ` Paolo Bonzini
2015-07-28 13:08 ` Michael S. Tsirkin
2015-07-28 13:08 ` Michael S. Tsirkin
2015-07-28  1:08 Andy Lutomirski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.