linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/6] virtio core DMA API conversion
@ 2015-10-30  1:09 Andy Lutomirski
  2015-10-30  1:09 ` [PATCH v4 1/6] virtio-net: Stop doing DMA from the stack Andy Lutomirski
                   ` (8 more replies)
  0 siblings, 9 replies; 38+ messages in thread
From: Andy Lutomirski @ 2015-10-30  1:09 UTC (permalink / raw)
  To: linux-kernel, David S. Miller, sparclinux
  Cc: Joerg Roedel, Christian Borntraeger, Cornelia Huck,
	Sebastian Ott, Paolo Bonzini, Christoph Hellwig, benh, KVM,
	dwmw2, Martin Schwidefsky, linux-s390, Michael S. Tsirkin,
	virtualization, Andy Lutomirski

This switches virtio to use the DMA API unconditionally.  I'm sure
it breaks things, but it seems to work on x86 using virtio-pci, with
and without Xen, and using both the modern 1.0 variant and the
legacy variant.

This appears to work on native and Xen x86_64 using both modern and
legacy virtio-pci.  It also appears to work on arm and arm64.

It definitely won't work as-is on s390x, and I haven't been able to
test Christian's patches because I can't get virtio-ccw to work in
QEMU at all.  I don't know what I'm doing wrong.

It doesn't work on ppc64.  Ben, consider yourself pinged to send me
a patch :)

It doesn't work on sparc64.  I didn't realize at Kernel Summit that
sparc64 has the same problem as ppc64.

DaveM, for background, we're trying to fix virtio to use the DMA
API.  That will require that every platform that uses virtio
supplies valid DMA operations on devices that use virtio_ring.
Unfortunately, QEMU historically ignores the IOMMU on virtio
devices.

On x86, this isn't really a problem.  x86 has a nice way for the
platform to describe which devices are behind an IOMMU, and QEMU
will be adjusted accordingly.  The only thing that will break is a
recently-added experimental mode.

Ben's plan for powerpc is to add a quirk for existing virtio-pci
devices and to eventually update the devicetree stuff to allow QEMU
to tell the guest which devices use the IOMMU.

AFAICT sparc has a similar problem to powerpc.  DaveM, can you come
up with a straightforward way to get sparc's DMA API to work
correctly for virtio-pci devices?

NB: Sadly, the platforms I've successfully tested on don't include any
big-endian platforms, so there could still be lurking endian problems.

Changes from v3:
 - More big-endian fixes.
 - Added better virtio-ring APIs that handle allocation and use them in
   virtio-mmio and virtio-pci.
 - Switch to Michael's virtio-net patch.

Changes from v2:
 - Fix vring_mapping_error incorrect argument

Changes from v1:
 - Fix an endian conversion error causing a BUG to hit.
 - Fix a DMA ordering issue (swiotlb=force works now).
 - Minor cleanups.

Andy Lutomirski (5):
  virtio_ring: Support DMA APIs
  virtio_pci: Use the DMA API
  virtio: Add improved queue allocation API
  virtio_mmio: Use the DMA API
  virtio_pci: Use the DMA API

Michael S. Tsirkin (1):
  virtio-net: Stop doing DMA from the stack

 drivers/net/virtio_net.c           |  34 ++--
 drivers/virtio/Kconfig             |   2 +-
 drivers/virtio/virtio_mmio.c       |  67 ++-----
 drivers/virtio/virtio_pci_common.h |   6 -
 drivers/virtio/virtio_pci_legacy.c |  42 ++---
 drivers/virtio/virtio_pci_modern.c |  61 ++-----
 drivers/virtio/virtio_ring.c       | 348 ++++++++++++++++++++++++++++++-------
 include/linux/virtio.h             |  23 ++-
 include/linux/virtio_ring.h        |  35 ++++
 tools/virtio/linux/dma-mapping.h   |  17 ++
 10 files changed, 426 insertions(+), 209 deletions(-)
 create mode 100644 tools/virtio/linux/dma-mapping.h

-- 
2.4.3


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v4 1/6] virtio-net: Stop doing DMA from the stack
  2015-10-30  1:09 [PATCH v4 0/6] virtio core DMA API conversion Andy Lutomirski
@ 2015-10-30  1:09 ` Andy Lutomirski
  2015-10-30 13:55   ` Christian Borntraeger
  2015-10-30  1:09 ` [PATCH v4 2/6] virtio_ring: Support DMA APIs Andy Lutomirski
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 38+ messages in thread
From: Andy Lutomirski @ 2015-10-30  1:09 UTC (permalink / raw)
  To: linux-kernel, David S. Miller, sparclinux
  Cc: Joerg Roedel, Christian Borntraeger, Cornelia Huck,
	Sebastian Ott, Paolo Bonzini, Christoph Hellwig, benh, KVM,
	dwmw2, Martin Schwidefsky, linux-s390, Michael S. Tsirkin,
	virtualization, Andy Lutomirski

From: "Michael S. Tsirkin" <mst@redhat.com>

Once virtio starts using the DMA API, we won't be able to safely DMA
from the stack.  virtio-net does a couple of config DMA requests
from small stack buffers -- switch to using dynamically-allocated
memory.

This should have no effect on any performance-critical code paths.

[I wrote the subject and commit message.  mst wrote the code. --luto]

Signed-off-by: Andy Lutomirski <luto@kernel.org>
signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/net/virtio_net.c | 34 +++++++++++++++++++---------------
 1 file changed, 19 insertions(+), 15 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index d8838dedb7a4..f94ab786088f 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -140,6 +140,12 @@ struct virtnet_info {
 
 	/* CPU hot plug notifier */
 	struct notifier_block nb;
+
+	/* Control VQ buffers: protected by the rtnl lock */
+	struct virtio_net_ctrl_hdr ctrl_hdr;
+	virtio_net_ctrl_ack ctrl_status;
+	u8 ctrl_promisc;
+	u8 ctrl_allmulti;
 };
 
 struct padded_vnet_hdr {
@@ -976,31 +982,30 @@ static bool virtnet_send_command(struct virtnet_info *vi, u8 class, u8 cmd,
 				 struct scatterlist *out)
 {
 	struct scatterlist *sgs[4], hdr, stat;
-	struct virtio_net_ctrl_hdr ctrl;
-	virtio_net_ctrl_ack status = ~0;
 	unsigned out_num = 0, tmp;
 
 	/* Caller should know better */
 	BUG_ON(!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ));
 
-	ctrl.class = class;
-	ctrl.cmd = cmd;
+	vi->ctrl_status = ~0;
+	vi->ctrl_hdr.class = class;
+	vi->ctrl_hdr.cmd = cmd;
 	/* Add header */
-	sg_init_one(&hdr, &ctrl, sizeof(ctrl));
+	sg_init_one(&hdr, &vi->ctrl_hdr, sizeof(vi->ctrl_hdr));
 	sgs[out_num++] = &hdr;
 
 	if (out)
 		sgs[out_num++] = out;
 
 	/* Add return status. */
-	sg_init_one(&stat, &status, sizeof(status));
+	sg_init_one(&stat, &vi->ctrl_status, sizeof(vi->ctrl_status));
 	sgs[out_num] = &stat;
 
 	BUG_ON(out_num + 1 > ARRAY_SIZE(sgs));
 	virtqueue_add_sgs(vi->cvq, sgs, out_num, 1, vi, GFP_ATOMIC);
 
 	if (unlikely(!virtqueue_kick(vi->cvq)))
-		return status == VIRTIO_NET_OK;
+		return vi->ctrl_status == VIRTIO_NET_OK;
 
 	/* Spin for a response, the kick causes an ioport write, trapping
 	 * into the hypervisor, so the request should be handled immediately.
@@ -1009,7 +1014,7 @@ static bool virtnet_send_command(struct virtnet_info *vi, u8 class, u8 cmd,
 	       !virtqueue_is_broken(vi->cvq))
 		cpu_relax();
 
-	return status == VIRTIO_NET_OK;
+	return vi->ctrl_status == VIRTIO_NET_OK;
 }
 
 static int virtnet_set_mac_address(struct net_device *dev, void *p)
@@ -1151,7 +1156,6 @@ static void virtnet_set_rx_mode(struct net_device *dev)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
 	struct scatterlist sg[2];
-	u8 promisc, allmulti;
 	struct virtio_net_ctrl_mac *mac_data;
 	struct netdev_hw_addr *ha;
 	int uc_count;
@@ -1163,22 +1167,22 @@ static void virtnet_set_rx_mode(struct net_device *dev)
 	if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_RX))
 		return;
 
-	promisc = ((dev->flags & IFF_PROMISC) != 0);
-	allmulti = ((dev->flags & IFF_ALLMULTI) != 0);
+	vi->ctrl_promisc = ((dev->flags & IFF_PROMISC) != 0);
+	vi->ctrl_allmulti = ((dev->flags & IFF_ALLMULTI) != 0);
 
-	sg_init_one(sg, &promisc, sizeof(promisc));
+	sg_init_one(sg, &vi->ctrl_promisc, sizeof(vi->ctrl_promisc));
 
 	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX,
 				  VIRTIO_NET_CTRL_RX_PROMISC, sg))
 		dev_warn(&dev->dev, "Failed to %sable promisc mode.\n",
-			 promisc ? "en" : "dis");
+			 vi->ctrl_promisc ? "en" : "dis");
 
-	sg_init_one(sg, &allmulti, sizeof(allmulti));
+	sg_init_one(sg, &vi->ctrl_allmulti, sizeof(vi->ctrl_allmulti));
 
 	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX,
 				  VIRTIO_NET_CTRL_RX_ALLMULTI, sg))
 		dev_warn(&dev->dev, "Failed to %sable allmulti mode.\n",
-			 allmulti ? "en" : "dis");
+			 vi->ctrl_allmulti ? "en" : "dis");
 
 	uc_count = netdev_uc_count(dev);
 	mc_count = netdev_mc_count(dev);
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v4 2/6] virtio_ring: Support DMA APIs
  2015-10-30  1:09 [PATCH v4 0/6] virtio core DMA API conversion Andy Lutomirski
  2015-10-30  1:09 ` [PATCH v4 1/6] virtio-net: Stop doing DMA from the stack Andy Lutomirski
@ 2015-10-30  1:09 ` Andy Lutomirski
  2015-10-30 12:01   ` Cornelia Huck
  2015-10-30  1:09 ` [PATCH v4 3/6] virtio_pci: Use the DMA API Andy Lutomirski
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 38+ messages in thread
From: Andy Lutomirski @ 2015-10-30  1:09 UTC (permalink / raw)
  To: linux-kernel, David S. Miller, sparclinux
  Cc: Joerg Roedel, Christian Borntraeger, Cornelia Huck,
	Sebastian Ott, Paolo Bonzini, Christoph Hellwig, benh, KVM,
	dwmw2, Martin Schwidefsky, linux-s390, Michael S. Tsirkin,
	virtualization, Andy Lutomirski

virtio_ring currently sends the device (usually a hypervisor)
physical addresses of its I/O buffers.  This is okay when DMA
addresses and physical addresses are the same thing, but this isn't
always the case.  For example, this never works on Xen guests, and
it is likely to fail if a physical "virtio" device ever ends up
behind an IOMMU or swiotlb.

The immediate use case for me is to enable virtio on Xen guests.
For that to work, we need vring to support DMA address translation
as well as a corresponding change to virtio_pci or to another
driver.

With this patch, if enabled, virtfs survives kmemleak and
CONFIG_DMA_API_DEBUG.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 drivers/virtio/Kconfig           |   2 +-
 drivers/virtio/virtio_ring.c     | 190 +++++++++++++++++++++++++++++++--------
 tools/virtio/linux/dma-mapping.h |  17 ++++
 3 files changed, 172 insertions(+), 37 deletions(-)
 create mode 100644 tools/virtio/linux/dma-mapping.h

diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index cab9f3f63a38..77590320d44c 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -60,7 +60,7 @@ config VIRTIO_INPUT
 
  config VIRTIO_MMIO
 	tristate "Platform bus driver for memory mapped virtio devices"
-	depends on HAS_IOMEM
+	depends on HAS_IOMEM && HAS_DMA
  	select VIRTIO
  	---help---
  	 This drivers provides support for memory mapped virtio
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 096b857e7b75..a872eb89587f 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -24,6 +24,7 @@
 #include <linux/module.h>
 #include <linux/hrtimer.h>
 #include <linux/kmemleak.h>
+#include <linux/dma-mapping.h>
 
 #ifdef DEBUG
 /* For development, we want to crash whenever the ring is screwed. */
@@ -54,7 +55,14 @@
 #define END_USE(vq)
 #endif
 
-struct vring_virtqueue {
+struct vring_desc_state
+{
+	void *data;			/* Data for callback. */
+	struct vring_desc *indir_desc;	/* Indirect descriptor, if any. */
+};
+
+struct vring_virtqueue
+{
 	struct virtqueue vq;
 
 	/* Actual memory layout for this queue */
@@ -92,12 +100,71 @@ struct vring_virtqueue {
 	ktime_t last_add_time;
 #endif
 
-	/* Tokens for callbacks. */
-	void *data[];
+	/* Per-descriptor state. */
+	struct vring_desc_state desc_state[];
 };
 
 #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq)
 
+/*
+ * The DMA ops on various arches are rather gnarly right now, and
+ * making all of the arch DMA ops work on the vring device itself
+ * is a mess.  For now, we use the parent device for DMA ops.
+ */
+struct device *vring_dma_dev(const struct vring_virtqueue *vq)
+{
+	return vq->vq.vdev->dev.parent;
+}
+
+/* Map one sg entry. */
+static dma_addr_t vring_map_one_sg(const struct vring_virtqueue *vq,
+				   struct scatterlist *sg,
+				   enum dma_data_direction direction)
+{
+	/*
+	 * We can't use dma_map_sg, because we don't use scatterlists in
+	 * the way it expects (we don't guarantee that the scatterlist
+	 * will exist for the lifetime of the mapping).
+	 */
+	return dma_map_page(vring_dma_dev(vq),
+			    sg_page(sg), sg->offset, sg->length,
+			    direction);
+}
+
+static dma_addr_t vring_map_single(const struct vring_virtqueue *vq,
+				   void *cpu_addr, size_t size,
+				   enum dma_data_direction direction)
+{
+	return dma_map_single(vring_dma_dev(vq),
+			      cpu_addr, size, direction);
+}
+
+static void vring_unmap_one(const struct vring_virtqueue *vq,
+			    struct vring_desc *desc)
+{
+	u16 flags = virtio16_to_cpu(vq->vq.vdev, desc->flags);
+
+	if (flags & VRING_DESC_F_INDIRECT) {
+		dma_unmap_single(vring_dma_dev(vq),
+				 virtio64_to_cpu(vq->vq.vdev, desc->addr),
+				 virtio32_to_cpu(vq->vq.vdev, desc->len),
+				 (flags & VRING_DESC_F_WRITE) ?
+				 DMA_FROM_DEVICE : DMA_TO_DEVICE);
+	} else {
+		dma_unmap_page(vring_dma_dev(vq),
+			       virtio64_to_cpu(vq->vq.vdev, desc->addr),
+			       virtio32_to_cpu(vq->vq.vdev, desc->len),
+			       (flags & VRING_DESC_F_WRITE) ?
+			       DMA_FROM_DEVICE : DMA_TO_DEVICE);
+	}
+}
+
+static int vring_mapping_error(const struct vring_virtqueue *vq,
+			       dma_addr_t addr)
+{
+	return dma_mapping_error(vring_dma_dev(vq), addr);
+}
+
 static struct vring_desc *alloc_indirect(struct virtqueue *_vq,
 					 unsigned int total_sg, gfp_t gfp)
 {
@@ -131,7 +198,7 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 	struct vring_virtqueue *vq = to_vvq(_vq);
 	struct scatterlist *sg;
 	struct vring_desc *desc;
-	unsigned int i, n, avail, descs_used, uninitialized_var(prev);
+	unsigned int i, n, avail, descs_used, uninitialized_var(prev), err_idx;
 	int head;
 	bool indirect;
 
@@ -171,21 +238,15 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 
 	if (desc) {
 		/* Use a single buffer which doesn't continue */
-		vq->vring.desc[head].flags = cpu_to_virtio16(_vq->vdev, VRING_DESC_F_INDIRECT);
-		vq->vring.desc[head].addr = cpu_to_virtio64(_vq->vdev, virt_to_phys(desc));
-		/* avoid kmemleak false positive (hidden by virt_to_phys) */
-		kmemleak_ignore(desc);
-		vq->vring.desc[head].len = cpu_to_virtio32(_vq->vdev, total_sg * sizeof(struct vring_desc));
-
+		indirect = true;
 		/* Set up rest to use this indirect table. */
 		i = 0;
 		descs_used = 1;
-		indirect = true;
 	} else {
+		indirect = false;
 		desc = vq->vring.desc;
 		i = head;
 		descs_used = total_sg;
-		indirect = false;
 	}
 
 	if (vq->vq.num_free < descs_used) {
@@ -200,13 +261,14 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 		return -ENOSPC;
 	}
 
-	/* We're about to use some buffers from the free list. */
-	vq->vq.num_free -= descs_used;
-
 	for (n = 0; n < out_sgs; n++) {
 		for (sg = sgs[n]; sg; sg = sg_next(sg)) {
+			dma_addr_t addr = vring_map_one_sg(vq, sg, DMA_TO_DEVICE);
+			if (vring_mapping_error(vq, addr))
+				goto unmap_release;
+
 			desc[i].flags = cpu_to_virtio16(_vq->vdev, VRING_DESC_F_NEXT);
-			desc[i].addr = cpu_to_virtio64(_vq->vdev, sg_phys(sg));
+			desc[i].addr = cpu_to_virtio64(_vq->vdev, addr);
 			desc[i].len = cpu_to_virtio32(_vq->vdev, sg->length);
 			prev = i;
 			i = virtio16_to_cpu(_vq->vdev, desc[i].next);
@@ -214,8 +276,12 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 	}
 	for (; n < (out_sgs + in_sgs); n++) {
 		for (sg = sgs[n]; sg; sg = sg_next(sg)) {
+			dma_addr_t addr = vring_map_one_sg(vq, sg, DMA_FROM_DEVICE);
+			if (vring_mapping_error(vq, addr))
+				goto unmap_release;
+
 			desc[i].flags = cpu_to_virtio16(_vq->vdev, VRING_DESC_F_NEXT | VRING_DESC_F_WRITE);
-			desc[i].addr = cpu_to_virtio64(_vq->vdev, sg_phys(sg));
+			desc[i].addr = cpu_to_virtio64(_vq->vdev, addr);
 			desc[i].len = cpu_to_virtio32(_vq->vdev, sg->length);
 			prev = i;
 			i = virtio16_to_cpu(_vq->vdev, desc[i].next);
@@ -224,14 +290,33 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 	/* Last one doesn't continue. */
 	desc[prev].flags &= cpu_to_virtio16(_vq->vdev, ~VRING_DESC_F_NEXT);
 
+	if (indirect) {
+		/* Now that the indirect table is filled in, map it. */
+		dma_addr_t addr = vring_map_single(
+			vq, desc, total_sg * sizeof(struct vring_desc),
+			DMA_TO_DEVICE);
+		if (vring_mapping_error(vq, addr))
+			goto unmap_release;
+
+		vq->vring.desc[head].flags = cpu_to_virtio16(_vq->vdev, VRING_DESC_F_INDIRECT);
+		vq->vring.desc[head].addr = cpu_to_virtio64(_vq->vdev, addr);
+
+		vq->vring.desc[head].len = cpu_to_virtio32(_vq->vdev, total_sg * sizeof(struct vring_desc));
+	}
+
+	/* We're using some buffers from the free list. */
+	vq->vq.num_free -= descs_used;
+
 	/* Update free pointer */
 	if (indirect)
 		vq->free_head = virtio16_to_cpu(_vq->vdev, vq->vring.desc[head].next);
 	else
 		vq->free_head = i;
 
-	/* Set token. */
-	vq->data[head] = data;
+	/* Store token and indirect buffer state. */
+	vq->desc_state[head].data = data;
+	if (indirect)
+		vq->desc_state[head].indir_desc = desc;
 
 	/* Put entry in available array (but don't update avail->idx until they
 	 * do sync). */
@@ -253,6 +338,24 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 		virtqueue_kick(_vq);
 
 	return 0;
+
+unmap_release:
+	err_idx = i;
+	i = head;
+
+	for (n = 0; n < total_sg; n++) {
+		if (i == err_idx)
+			break;
+		vring_unmap_one(vq, &desc[i]);
+		i = vq->vring.desc[i].next;
+	}
+
+	vq->vq.num_free += total_sg;
+
+	if (indirect)
+		kfree(desc);
+
+	return -EIO;
 }
 
 /**
@@ -423,27 +526,43 @@ EXPORT_SYMBOL_GPL(virtqueue_kick);
 
 static void detach_buf(struct vring_virtqueue *vq, unsigned int head)
 {
-	unsigned int i;
+	unsigned int i, j;
+	u16 nextflag = cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_NEXT);
 
 	/* Clear data ptr. */
-	vq->data[head] = NULL;
+	vq->desc_state[head].data = NULL;
 
-	/* Put back on free list: find end */
+	/* Put back on free list: unmap first-level descriptors and find end */
 	i = head;
 
-	/* Free the indirect table */
-	if (vq->vring.desc[i].flags & cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_INDIRECT))
-		kfree(phys_to_virt(virtio64_to_cpu(vq->vq.vdev, vq->vring.desc[i].addr)));
-
-	while (vq->vring.desc[i].flags & cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_NEXT)) {
+	while (vq->vring.desc[i].flags & nextflag) {
+		vring_unmap_one(vq, &vq->vring.desc[i]);
 		i = virtio16_to_cpu(vq->vq.vdev, vq->vring.desc[i].next);
 		vq->vq.num_free++;
 	}
 
+	vring_unmap_one(vq, &vq->vring.desc[i]);
 	vq->vring.desc[i].next = cpu_to_virtio16(vq->vq.vdev, vq->free_head);
 	vq->free_head = head;
+
 	/* Plus final descriptor */
 	vq->vq.num_free++;
+
+	/* Free the indirect table, if any, now that it's unmapped. */
+	if (vq->desc_state[head].indir_desc) {
+		struct vring_desc *indir_desc = vq->desc_state[head].indir_desc;
+		u32 len = vq->vring.desc[head].len;
+
+		BUG_ON(!(vq->vring.desc[head].flags &
+			 cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_INDIRECT)));
+		BUG_ON(len == 0 || len % sizeof(struct vring_desc));
+
+		for (j = 0; j < len / sizeof(struct vring_desc); j++)
+			vring_unmap_one(vq, &indir_desc[j]);
+
+		kfree(vq->desc_state[head].indir_desc);
+		vq->desc_state[head].indir_desc = NULL;
+	}
 }
 
 static inline bool more_used(const struct vring_virtqueue *vq)
@@ -498,13 +617,13 @@ void *virtqueue_get_buf(struct virtqueue *_vq, unsigned int *len)
 		BAD_RING(vq, "id %u out of range\n", i);
 		return NULL;
 	}
-	if (unlikely(!vq->data[i])) {
+	if (unlikely(!vq->desc_state[i].data)) {
 		BAD_RING(vq, "id %u is not a head!\n", i);
 		return NULL;
 	}
 
 	/* detach_buf clears data, so grab it now. */
-	ret = vq->data[i];
+	ret = vq->desc_state[i].data;
 	detach_buf(vq, i);
 	vq->last_used_idx++;
 	/* If we expect an interrupt for the next entry, tell host
@@ -665,10 +784,10 @@ void *virtqueue_detach_unused_buf(struct virtqueue *_vq)
 	START_USE(vq);
 
 	for (i = 0; i < vq->vring.num; i++) {
-		if (!vq->data[i])
+		if (!vq->desc_state[i].data)
 			continue;
 		/* detach_buf clears data, so grab it now. */
-		buf = vq->data[i];
+		buf = vq->desc_state[i].data;
 		detach_buf(vq, i);
 		vq->vring.avail->idx = cpu_to_virtio16(_vq->vdev, virtio16_to_cpu(_vq->vdev, vq->vring.avail->idx) - 1);
 		END_USE(vq);
@@ -721,7 +840,8 @@ struct virtqueue *vring_new_virtqueue(unsigned int index,
 		return NULL;
 	}
 
-	vq = kmalloc(sizeof(*vq) + sizeof(void *)*num, GFP_KERNEL);
+	vq = kmalloc(sizeof(*vq) + num * sizeof(struct vring_desc_state),
+		     GFP_KERNEL);
 	if (!vq)
 		return NULL;
 
@@ -751,11 +871,9 @@ struct virtqueue *vring_new_virtqueue(unsigned int index,
 
 	/* Put everything in free lists. */
 	vq->free_head = 0;
-	for (i = 0; i < num-1; i++) {
+	for (i = 0; i < num-1; i++)
 		vq->vring.desc[i].next = cpu_to_virtio16(vdev, i + 1);
-		vq->data[i] = NULL;
-	}
-	vq->data[i] = NULL;
+	memset(vq->desc_state, 0, num * sizeof(struct vring_desc_state));
 
 	return &vq->vq;
 }
diff --git a/tools/virtio/linux/dma-mapping.h b/tools/virtio/linux/dma-mapping.h
new file mode 100644
index 000000000000..4f93af89ae16
--- /dev/null
+++ b/tools/virtio/linux/dma-mapping.h
@@ -0,0 +1,17 @@
+#ifndef _LINUX_DMA_MAPPING_H
+#define _LINUX_DMA_MAPPING_H
+
+#ifdef CONFIG_HAS_DMA
+# error Virtio userspace code does not support CONFIG_HAS_DMA
+#endif
+
+#define PCI_DMA_BUS_IS_PHYS 1
+
+enum dma_data_direction {
+	DMA_BIDIRECTIONAL = 0,
+	DMA_TO_DEVICE = 1,
+	DMA_FROM_DEVICE = 2,
+	DMA_NONE = 3,
+};
+
+#endif
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v4 3/6] virtio_pci: Use the DMA API
  2015-10-30  1:09 [PATCH v4 0/6] virtio core DMA API conversion Andy Lutomirski
  2015-10-30  1:09 ` [PATCH v4 1/6] virtio-net: Stop doing DMA from the stack Andy Lutomirski
  2015-10-30  1:09 ` [PATCH v4 2/6] virtio_ring: Support DMA APIs Andy Lutomirski
@ 2015-10-30  1:09 ` Andy Lutomirski
  2015-10-30  1:09 ` [PATCH v4 4/6] virtio: Add improved queue allocation API Andy Lutomirski
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 38+ messages in thread
From: Andy Lutomirski @ 2015-10-30  1:09 UTC (permalink / raw)
  To: linux-kernel, David S. Miller, sparclinux
  Cc: Joerg Roedel, Christian Borntraeger, Cornelia Huck,
	Sebastian Ott, Paolo Bonzini, Christoph Hellwig, benh, KVM,
	dwmw2, Martin Schwidefsky, linux-s390, Michael S. Tsirkin,
	virtualization, Andy Lutomirski

This fixes virtio-pci on platforms and busses that have IOMMUs.  This
will break the experimental QEMU Q35 IOMMU support until QEMU is
fixed.  In exchange, it fixes physical virtio hardware as well as
virtio-pci running under Xen.

We should clean up the virtqueue API to do its own allocation and
teach virtqueue_get_avail and virtqueue_get_used to return DMA
addresses directly.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 drivers/virtio/virtio_pci_common.h |  3 ++-
 drivers/virtio/virtio_pci_legacy.c | 19 +++++++++++++++----
 drivers/virtio/virtio_pci_modern.c | 34 ++++++++++++++++++++++++----------
 3 files changed, 41 insertions(+), 15 deletions(-)

diff --git a/drivers/virtio/virtio_pci_common.h b/drivers/virtio/virtio_pci_common.h
index b976d968e793..cd6196b513ad 100644
--- a/drivers/virtio/virtio_pci_common.h
+++ b/drivers/virtio/virtio_pci_common.h
@@ -38,8 +38,9 @@ struct virtio_pci_vq_info {
 	/* the number of entries in the queue */
 	int num;
 
-	/* the virtual address of the ring queue */
+	/* the ring queue */
 	void *queue;
+	dma_addr_t queue_dma_addr;      /* bus address */
 
 	/* the list node for the virtqueues list */
 	struct list_head node;
diff --git a/drivers/virtio/virtio_pci_legacy.c b/drivers/virtio/virtio_pci_legacy.c
index 48bc9797e530..b5293e5f2af4 100644
--- a/drivers/virtio/virtio_pci_legacy.c
+++ b/drivers/virtio/virtio_pci_legacy.c
@@ -135,12 +135,14 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev,
 	info->msix_vector = msix_vec;
 
 	size = PAGE_ALIGN(vring_size(num, VIRTIO_PCI_VRING_ALIGN));
-	info->queue = alloc_pages_exact(size, GFP_KERNEL|__GFP_ZERO);
+	info->queue = dma_zalloc_coherent(&vp_dev->pci_dev->dev, size,
+					  &info->queue_dma_addr,
+					  GFP_KERNEL);
 	if (info->queue == NULL)
 		return ERR_PTR(-ENOMEM);
 
 	/* activate the queue */
-	iowrite32(virt_to_phys(info->queue) >> VIRTIO_PCI_QUEUE_ADDR_SHIFT,
+	iowrite32(info->queue_dma_addr >> VIRTIO_PCI_QUEUE_ADDR_SHIFT,
 		  vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
 
 	/* create the vring */
@@ -169,7 +171,8 @@ out_assign:
 	vring_del_virtqueue(vq);
 out_activate_queue:
 	iowrite32(0, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
-	free_pages_exact(info->queue, size);
+	dma_free_coherent(&vp_dev->pci_dev->dev, size,
+			  info->queue, info->queue_dma_addr);
 	return ERR_PTR(err);
 }
 
@@ -194,7 +197,8 @@ static void del_vq(struct virtio_pci_vq_info *info)
 	iowrite32(0, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
 
 	size = PAGE_ALIGN(vring_size(info->num, VIRTIO_PCI_VRING_ALIGN));
-	free_pages_exact(info->queue, size);
+	dma_free_coherent(&vp_dev->pci_dev->dev, size,
+			  info->queue, info->queue_dma_addr);
 }
 
 static const struct virtio_config_ops virtio_pci_config_ops = {
@@ -227,6 +231,13 @@ int virtio_pci_legacy_probe(struct virtio_pci_device *vp_dev)
 		return -ENODEV;
 	}
 
+	rc = dma_set_mask_and_coherent(&pci_dev->dev, DMA_BIT_MASK(64));
+	if (rc)
+		rc = dma_set_mask_and_coherent(&pci_dev->dev,
+						DMA_BIT_MASK(32));
+	if (rc)
+		dev_warn(&pci_dev->dev, "Failed to enable 64-bit or 32-bit DMA.  Trying to continue, but this might not work.\n");
+
 	rc = pci_request_region(pci_dev, 0, "virtio-pci-legacy");
 	if (rc)
 		return rc;
diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c
index 8e5cf194cc0b..fbe0bd1c4881 100644
--- a/drivers/virtio/virtio_pci_modern.c
+++ b/drivers/virtio/virtio_pci_modern.c
@@ -293,14 +293,16 @@ static size_t vring_pci_size(u16 num)
 	return PAGE_ALIGN(vring_size(num, SMP_CACHE_BYTES));
 }
 
-static void *alloc_virtqueue_pages(int *num)
+static void *alloc_virtqueue_pages(struct virtio_pci_device *vp_dev,
+				   int *num, dma_addr_t *dma_addr)
 {
 	void *pages;
 
 	/* TODO: allocate each queue chunk individually */
 	for (; *num && vring_pci_size(*num) > PAGE_SIZE; *num /= 2) {
-		pages = alloc_pages_exact(vring_pci_size(*num),
-					  GFP_KERNEL|__GFP_ZERO|__GFP_NOWARN);
+		pages = dma_zalloc_coherent(
+			&vp_dev->pci_dev->dev, vring_pci_size(*num),
+			dma_addr, GFP_KERNEL|__GFP_NOWARN);
 		if (pages)
 			return pages;
 	}
@@ -309,7 +311,9 @@ static void *alloc_virtqueue_pages(int *num)
 		return NULL;
 
 	/* Try to get a single page. You are my only hope! */
-	return alloc_pages_exact(vring_pci_size(*num), GFP_KERNEL|__GFP_ZERO);
+	return dma_zalloc_coherent(
+		&vp_dev->pci_dev->dev, vring_pci_size(*num),
+		dma_addr, GFP_KERNEL);
 }
 
 static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev,
@@ -346,7 +350,8 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev,
 	info->num = num;
 	info->msix_vector = msix_vec;
 
-	info->queue = alloc_virtqueue_pages(&info->num);
+	info->queue = alloc_virtqueue_pages(vp_dev, &info->num,
+					    &info->queue_dma_addr);
 	if (info->queue == NULL)
 		return ERR_PTR(-ENOMEM);
 
@@ -361,11 +366,11 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev,
 
 	/* activate the queue */
 	vp_iowrite16(num, &cfg->queue_size);
-	vp_iowrite64_twopart(virt_to_phys(info->queue),
+	vp_iowrite64_twopart(info->queue_dma_addr,
 			     &cfg->queue_desc_lo, &cfg->queue_desc_hi);
-	vp_iowrite64_twopart(virt_to_phys(virtqueue_get_avail(vq)),
+	vp_iowrite64_twopart(info->queue_dma_addr + ((char *)virtqueue_get_avail(vq) - (char *)info->queue),
 			     &cfg->queue_avail_lo, &cfg->queue_avail_hi);
-	vp_iowrite64_twopart(virt_to_phys(virtqueue_get_used(vq)),
+	vp_iowrite64_twopart(info->queue_dma_addr + ((char *)virtqueue_get_used(vq) - (char *)info->queue),
 			     &cfg->queue_used_lo, &cfg->queue_used_hi);
 
 	if (vp_dev->notify_base) {
@@ -411,7 +416,8 @@ err_assign_vector:
 err_map_notify:
 	vring_del_virtqueue(vq);
 err_new_queue:
-	free_pages_exact(info->queue, vring_pci_size(info->num));
+	dma_free_coherent(&vp_dev->pci_dev->dev, vring_pci_size(info->num),
+			  info->queue, info->queue_dma_addr);
 	return ERR_PTR(err);
 }
 
@@ -457,7 +463,8 @@ static void del_vq(struct virtio_pci_vq_info *info)
 
 	vring_del_virtqueue(vq);
 
-	free_pages_exact(info->queue, vring_pci_size(info->num));
+	dma_free_coherent(&vp_dev->pci_dev->dev, vring_pci_size(info->num),
+			  info->queue, info->queue_dma_addr);
 }
 
 static const struct virtio_config_ops virtio_pci_config_nodev_ops = {
@@ -641,6 +648,13 @@ int virtio_pci_modern_probe(struct virtio_pci_device *vp_dev)
 		return -EINVAL;
 	}
 
+	err = dma_set_mask_and_coherent(&pci_dev->dev, DMA_BIT_MASK(64));
+	if (err)
+		err = dma_set_mask_and_coherent(&pci_dev->dev,
+						DMA_BIT_MASK(32));
+	if (err)
+		dev_warn(&pci_dev->dev, "Failed to enable 64-bit or 32-bit DMA.  Trying to continue, but this might not work.\n");
+
 	/* Device capability is only mandatory for devices that have
 	 * device-specific configuration.
 	 */
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v4 4/6] virtio: Add improved queue allocation API
  2015-10-30  1:09 [PATCH v4 0/6] virtio core DMA API conversion Andy Lutomirski
                   ` (2 preceding siblings ...)
  2015-10-30  1:09 ` [PATCH v4 3/6] virtio_pci: Use the DMA API Andy Lutomirski
@ 2015-10-30  1:09 ` Andy Lutomirski
  2015-10-30  1:09 ` [PATCH v4 5/6] virtio_mmio: Use the DMA API Andy Lutomirski
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 38+ messages in thread
From: Andy Lutomirski @ 2015-10-30  1:09 UTC (permalink / raw)
  To: linux-kernel, David S. Miller, sparclinux
  Cc: Joerg Roedel, Christian Borntraeger, Cornelia Huck,
	Sebastian Ott, Paolo Bonzini, Christoph Hellwig, benh, KVM,
	dwmw2, Martin Schwidefsky, linux-s390, Michael S. Tsirkin,
	virtualization, Andy Lutomirski

This leaves vring_new_virtqueue alone for compatbility, but it
adds two new improved APIs:

vring_create_virtqueue: Creates a virtqueue backed by automatically
allocated coherent memory.  (Some day it this could be extended to
support non-coherent memory, too, if there ends up being a platform
on which it's worthwhile.)

__vring_new_virtqueue: Creates a virtqueue with a manually-specified
layout.  This should allow mic_virtio to work much more cleanly.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 drivers/virtio/virtio_ring.c | 164 +++++++++++++++++++++++++++++++++++--------
 include/linux/virtio.h       |  23 +++++-
 include/linux/virtio_ring.h  |  35 +++++++++
 3 files changed, 190 insertions(+), 32 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index a872eb89587f..f269e1cba00c 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -91,6 +91,11 @@ struct vring_virtqueue
 	/* How to notify other side. FIXME: commonalize hcalls! */
 	bool (*notify)(struct virtqueue *vq);
 
+	/* DMA, allocation, and size information */
+	bool we_own_ring;
+	size_t queue_size_in_bytes;
+	dma_addr_t queue_dma_addr;
+
 #ifdef DEBUG
 	/* They're supposed to lock for us. */
 	unsigned int in_use;
@@ -821,36 +826,31 @@ irqreturn_t vring_interrupt(int irq, void *_vq)
 }
 EXPORT_SYMBOL_GPL(vring_interrupt);
 
-struct virtqueue *vring_new_virtqueue(unsigned int index,
-				      unsigned int num,
-				      unsigned int vring_align,
-				      struct virtio_device *vdev,
-				      bool weak_barriers,
-				      void *pages,
-				      bool (*notify)(struct virtqueue *),
-				      void (*callback)(struct virtqueue *),
-				      const char *name)
+struct virtqueue *__vring_new_virtqueue(unsigned int index,
+					struct vring vring,
+					struct virtio_device *vdev,
+					bool weak_barriers,
+					bool (*notify)(struct virtqueue *),
+					void (*callback)(struct virtqueue *),
+					const char *name)
 {
-	struct vring_virtqueue *vq;
 	unsigned int i;
+	struct vring_virtqueue *vq;
 
-	/* We assume num is a power of 2. */
-	if (num & (num - 1)) {
-		dev_warn(&vdev->dev, "Bad virtqueue length %u\n", num);
-		return NULL;
-	}
-
-	vq = kmalloc(sizeof(*vq) + num * sizeof(struct vring_desc_state),
+	vq = kmalloc(sizeof(*vq) + vring.num * sizeof(struct vring_desc_state),
 		     GFP_KERNEL);
 	if (!vq)
 		return NULL;
 
-	vring_init(&vq->vring, num, pages, vring_align);
+	vq->vring = vring;
 	vq->vq.callback = callback;
 	vq->vq.vdev = vdev;
 	vq->vq.name = name;
-	vq->vq.num_free = num;
+	vq->vq.num_free = vring.num;
 	vq->vq.index = index;
+	vq->we_own_ring = false;
+	vq->queue_dma_addr = 0;
+	vq->queue_size_in_bytes = 0;
 	vq->notify = notify;
 	vq->weak_barriers = weak_barriers;
 	vq->broken = false;
@@ -871,18 +871,105 @@ struct virtqueue *vring_new_virtqueue(unsigned int index,
 
 	/* Put everything in free lists. */
 	vq->free_head = 0;
-	for (i = 0; i < num-1; i++)
+	for (i = 0; i < vring.num-1; i++)
 		vq->vring.desc[i].next = cpu_to_virtio16(vdev, i + 1);
-	memset(vq->desc_state, 0, num * sizeof(struct vring_desc_state));
+	memset(vq->desc_state, 0, vring.num * sizeof(struct vring_desc_state));
 
 	return &vq->vq;
 }
+EXPORT_SYMBOL_GPL(__vring_new_virtqueue);
+
+struct virtqueue *vring_create_virtqueue(
+	unsigned int index,
+	unsigned int num,
+	unsigned int vring_align,
+	struct virtio_device *vdev,
+	bool weak_barriers,
+	bool may_reduce_num,
+	bool (*notify)(struct virtqueue *),
+	void (*callback)(struct virtqueue *),
+	const char *name)
+{
+	struct virtqueue *vq;
+	void *queue;
+	dma_addr_t dma_addr;
+	size_t queue_size_in_bytes;
+	struct vring vring;
+
+	/* We assume num is a power of 2. */
+	if (num & (num - 1)) {
+		dev_warn(&vdev->dev, "Bad virtqueue length %u\n", num);
+		return NULL;
+	}
+
+	/* TODO: allocate each queue chunk individually */
+	for (; num && vring_size(num, vring_align) > PAGE_SIZE; num /= 2) {
+		queue = dma_zalloc_coherent(
+			vdev->dev.parent, vring_size(num, vring_align),
+			&dma_addr, GFP_KERNEL|__GFP_NOWARN);
+		if (queue)
+			break;
+	}
+
+	if (!num)
+		return NULL;
+
+	if (!queue) {
+		/* Try to get a single page. You are my only hope! */
+		queue = dma_zalloc_coherent(
+			vdev->dev.parent, vring_size(num, vring_align),
+			&dma_addr, GFP_KERNEL);
+	}
+	if (!queue)
+		return NULL;
+
+	queue_size_in_bytes = vring_size(num, vring_align);
+	vring_init(&vring, num, queue, vring_align);
+
+	vq = __vring_new_virtqueue(index, vring, vdev, weak_barriers,
+				   notify, callback, name);
+	if (!vq) {
+		dma_free_coherent(vdev->dev.parent,
+				  queue_size_in_bytes, queue,
+				  dma_addr);
+		return NULL;
+	}
+
+	to_vvq(vq)->queue_dma_addr = dma_addr;
+	to_vvq(vq)->queue_size_in_bytes = queue_size_in_bytes;
+	to_vvq(vq)->we_own_ring = true;
+
+	return vq;
+}
+EXPORT_SYMBOL_GPL(vring_create_virtqueue);
+
+struct virtqueue *vring_new_virtqueue(unsigned int index,
+				      unsigned int num,
+				      unsigned int vring_align,
+				      struct virtio_device *vdev,
+				      bool weak_barriers,
+				      void *pages,
+				      bool (*notify)(struct virtqueue *vq),
+				      void (*callback)(struct virtqueue *vq),
+				      const char *name)
+{
+	struct vring vring;
+	vring_init(&vring, num, pages, vring_align);
+	return __vring_new_virtqueue(index, vring, vdev, weak_barriers,
+				     notify, callback, name);
+}
 EXPORT_SYMBOL_GPL(vring_new_virtqueue);
 
-void vring_del_virtqueue(struct virtqueue *vq)
+void vring_del_virtqueue(struct virtqueue *_vq)
 {
-	list_del(&vq->list);
-	kfree(to_vvq(vq));
+	struct vring_virtqueue *vq = to_vvq(_vq);
+
+	if (vq->we_own_ring) {
+		dma_free_coherent(vring_dma_dev(vq), vq->queue_size_in_bytes,
+				  vq->vring.desc, vq->queue_dma_addr);
+	}
+	list_del(&_vq->list);
+	kfree(vq);
 }
 EXPORT_SYMBOL_GPL(vring_del_virtqueue);
 
@@ -946,20 +1033,37 @@ void virtio_break_device(struct virtio_device *dev)
 }
 EXPORT_SYMBOL_GPL(virtio_break_device);
 
-void *virtqueue_get_avail(struct virtqueue *_vq)
+dma_addr_t virtqueue_get_desc_addr(struct virtqueue *_vq)
 {
 	struct vring_virtqueue *vq = to_vvq(_vq);
 
-	return vq->vring.avail;
+	BUG_ON(!vq->we_own_ring);
+	return vq->queue_dma_addr;
 }
-EXPORT_SYMBOL_GPL(virtqueue_get_avail);
+EXPORT_SYMBOL_GPL(virtqueue_get_desc_addr);
 
-void *virtqueue_get_used(struct virtqueue *_vq)
+dma_addr_t virtqueue_get_avail_addr(struct virtqueue *_vq)
 {
 	struct vring_virtqueue *vq = to_vvq(_vq);
 
-	return vq->vring.used;
+	BUG_ON(!vq->we_own_ring);
+	return vq->queue_dma_addr + ((char *)vq->vring.avail - (char *)vq->vring.desc);
+}
+EXPORT_SYMBOL_GPL(virtqueue_get_avail_addr);
+
+dma_addr_t virtqueue_get_used_addr(struct virtqueue *_vq)
+{
+	struct vring_virtqueue *vq = to_vvq(_vq);
+
+	BUG_ON(!vq->we_own_ring);
+	return vq->queue_dma_addr + ((char *)vq->vring.used - (char *)vq->vring.desc);
+}
+EXPORT_SYMBOL_GPL(virtqueue_get_used_addr);
+
+const struct vring *virtqueue_get_vring(struct virtqueue *vq)
+{
+	return &to_vvq(vq)->vring;
 }
-EXPORT_SYMBOL_GPL(virtqueue_get_used);
+EXPORT_SYMBOL_GPL(virtqueue_get_vring);
 
 MODULE_LICENSE("GPL");
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index 8f4d4bfa6d46..d5eb5479a425 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -75,8 +75,27 @@ unsigned int virtqueue_get_vring_size(struct virtqueue *vq);
 
 bool virtqueue_is_broken(struct virtqueue *vq);
 
-void *virtqueue_get_avail(struct virtqueue *vq);
-void *virtqueue_get_used(struct virtqueue *vq);
+const struct vring *virtqueue_get_vring(struct virtqueue *vq);
+dma_addr_t virtqueue_get_desc_addr(struct virtqueue *vq);
+dma_addr_t virtqueue_get_avail_addr(struct virtqueue *vq);
+dma_addr_t virtqueue_get_used_addr(struct virtqueue *vq);
+
+/*
+ * Legacy accessors -- in almost all cases, these are the wrong functions
+ * to use.
+ */
+static inline void *virtqueue_get_desc(struct virtqueue *vq)
+{
+	return virtqueue_get_vring(vq)->desc;
+}
+static inline void *virtqueue_get_avail(struct virtqueue *vq)
+{
+	return virtqueue_get_vring(vq)->avail;
+}
+static inline void *virtqueue_get_used(struct virtqueue *vq)
+{
+	return virtqueue_get_vring(vq)->used;
+}
 
 /**
  * virtio_device - representation of a device using virtio
diff --git a/include/linux/virtio_ring.h b/include/linux/virtio_ring.h
index 8e50888a6d59..b6b6821917fa 100644
--- a/include/linux/virtio_ring.h
+++ b/include/linux/virtio_ring.h
@@ -50,6 +50,35 @@ static inline void virtio_wmb(bool weak_barriers)
 struct virtio_device;
 struct virtqueue;
 
+/*
+ * Creates a virtqueue and allocates the descriptor ring.  If
+ * may_reduce_num is set, then this may allocate a smaller ring than
+ * expected.  The caller should query virtqueue_get_ring_size to learn
+ * the actual size of the ring.
+ */
+struct virtqueue *vring_create_virtqueue(unsigned int index,
+					 unsigned int num,
+					 unsigned int vring_align,
+					 struct virtio_device *vdev,
+					 bool weak_barriers,
+					 bool may_reduce_num,
+					 bool (*notify)(struct virtqueue *vq),
+					 void (*callback)(struct virtqueue *vq),
+					 const char *name);
+
+/* Creates a virtqueue with a custom layout. */
+struct virtqueue *__vring_new_virtqueue(unsigned int index,
+					struct vring vring,
+					struct virtio_device *vdev,
+					bool weak_barriers,
+					bool (*notify)(struct virtqueue *),
+					void (*callback)(struct virtqueue *),
+					const char *name);
+
+/*
+ * Creates a virtqueue with a standard layout but a caller-allocated
+ * ring.
+ */
 struct virtqueue *vring_new_virtqueue(unsigned int index,
 				      unsigned int num,
 				      unsigned int vring_align,
@@ -59,7 +88,13 @@ struct virtqueue *vring_new_virtqueue(unsigned int index,
 				      bool (*notify)(struct virtqueue *vq),
 				      void (*callback)(struct virtqueue *vq),
 				      const char *name);
+
+/*
+ * Destroys a virtqueue.  If created with vring_create_virtqueue, this
+ * also frees the ring.
+ */
 void vring_del_virtqueue(struct virtqueue *vq);
+
 /* Filter out transport-specific feature bits. */
 void vring_transport_features(struct virtio_device *vdev);
 
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v4 5/6] virtio_mmio: Use the DMA API
  2015-10-30  1:09 [PATCH v4 0/6] virtio core DMA API conversion Andy Lutomirski
                   ` (3 preceding siblings ...)
  2015-10-30  1:09 ` [PATCH v4 4/6] virtio: Add improved queue allocation API Andy Lutomirski
@ 2015-10-30  1:09 ` Andy Lutomirski
  2015-10-30  1:09 ` [PATCH v4 6/6] virtio_pci: " Andy Lutomirski
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 38+ messages in thread
From: Andy Lutomirski @ 2015-10-30  1:09 UTC (permalink / raw)
  To: linux-kernel, David S. Miller, sparclinux
  Cc: Joerg Roedel, Christian Borntraeger, Cornelia Huck,
	Sebastian Ott, Paolo Bonzini, Christoph Hellwig, benh, KVM,
	dwmw2, Martin Schwidefsky, linux-s390, Michael S. Tsirkin,
	virtualization, Andy Lutomirski

This switches to vring_create_virtqueue, simplifying the driver and
adding DMA API support.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 drivers/virtio/virtio_mmio.c | 67 ++++++++++----------------------------------
 1 file changed, 15 insertions(+), 52 deletions(-)

diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
index f499d9da7237..2b9fab52a3cb 100644
--- a/drivers/virtio/virtio_mmio.c
+++ b/drivers/virtio/virtio_mmio.c
@@ -99,12 +99,6 @@ struct virtio_mmio_vq_info {
 	/* the actual virtqueue */
 	struct virtqueue *vq;
 
-	/* the number of entries in the queue */
-	unsigned int num;
-
-	/* the virtual address of the ring queue */
-	void *queue;
-
 	/* the list node for the virtqueues list */
 	struct list_head node;
 };
@@ -322,15 +316,13 @@ static void vm_del_vq(struct virtqueue *vq)
 {
 	struct virtio_mmio_device *vm_dev = to_virtio_mmio_device(vq->vdev);
 	struct virtio_mmio_vq_info *info = vq->priv;
-	unsigned long flags, size;
+	unsigned long flags;
 	unsigned int index = vq->index;
 
 	spin_lock_irqsave(&vm_dev->lock, flags);
 	list_del(&info->node);
 	spin_unlock_irqrestore(&vm_dev->lock, flags);
 
-	vring_del_virtqueue(vq);
-
 	/* Select and deactivate the queue */
 	writel(index, vm_dev->base + VIRTIO_MMIO_QUEUE_SEL);
 	if (vm_dev->version == 1) {
@@ -340,8 +332,8 @@ static void vm_del_vq(struct virtqueue *vq)
 		WARN_ON(readl(vm_dev->base + VIRTIO_MMIO_QUEUE_READY));
 	}
 
-	size = PAGE_ALIGN(vring_size(info->num, VIRTIO_MMIO_VRING_ALIGN));
-	free_pages_exact(info->queue, size);
+	vring_del_virtqueue(vq);
+
 	kfree(info);
 }
 
@@ -356,8 +348,6 @@ static void vm_del_vqs(struct virtio_device *vdev)
 	free_irq(platform_get_irq(vm_dev->pdev, 0), vm_dev);
 }
 
-
-
 static struct virtqueue *vm_setup_vq(struct virtio_device *vdev, unsigned index,
 				  void (*callback)(struct virtqueue *vq),
 				  const char *name)
@@ -365,7 +355,8 @@ static struct virtqueue *vm_setup_vq(struct virtio_device *vdev, unsigned index,
 	struct virtio_mmio_device *vm_dev = to_virtio_mmio_device(vdev);
 	struct virtio_mmio_vq_info *info;
 	struct virtqueue *vq;
-	unsigned long flags, size;
+	unsigned long flags;
+	unsigned int num;
 	int err;
 
 	if (!name)
@@ -388,66 +379,40 @@ static struct virtqueue *vm_setup_vq(struct virtio_device *vdev, unsigned index,
 		goto error_kmalloc;
 	}
 
-	/* Allocate pages for the queue - start with a queue as big as
-	 * possible (limited by maximum size allowed by device), drop down
-	 * to a minimal size, just big enough to fit descriptor table
-	 * and two rings (which makes it "alignment_size * 2")
-	 */
-	info->num = readl(vm_dev->base + VIRTIO_MMIO_QUEUE_NUM_MAX);
-
-	/* If the device reports a 0 entry queue, we won't be able to
-	 * use it to perform I/O, and vring_new_virtqueue() can't create
-	 * empty queues anyway, so don't bother to set up the device.
-	 */
-	if (info->num == 0) {
+	num = readl(vm_dev->base + VIRTIO_MMIO_QUEUE_NUM_MAX);
+	if (num == 0) {
 		err = -ENOENT;
-		goto error_alloc_pages;
-	}
-
-	while (1) {
-		size = PAGE_ALIGN(vring_size(info->num,
-				VIRTIO_MMIO_VRING_ALIGN));
-		/* Did the last iter shrink the queue below minimum size? */
-		if (size < VIRTIO_MMIO_VRING_ALIGN * 2) {
-			err = -ENOMEM;
-			goto error_alloc_pages;
-		}
-
-		info->queue = alloc_pages_exact(size, GFP_KERNEL | __GFP_ZERO);
-		if (info->queue)
-			break;
-
-		info->num /= 2;
+		goto error_new_virtqueue;
 	}
 
 	/* Create the vring */
-	vq = vring_new_virtqueue(index, info->num, VIRTIO_MMIO_VRING_ALIGN, vdev,
-				 true, info->queue, vm_notify, callback, name);
+	vq = vring_create_virtqueue(index, num, VIRTIO_MMIO_VRING_ALIGN, vdev,
+				 true, true, vm_notify, callback, name);
 	if (!vq) {
 		err = -ENOMEM;
 		goto error_new_virtqueue;
 	}
 
 	/* Activate the queue */
-	writel(info->num, vm_dev->base + VIRTIO_MMIO_QUEUE_NUM);
+	writel(virtqueue_get_vring_size(vq), vm_dev->base + VIRTIO_MMIO_QUEUE_NUM);
 	if (vm_dev->version == 1) {
 		writel(PAGE_SIZE, vm_dev->base + VIRTIO_MMIO_QUEUE_ALIGN);
-		writel(virt_to_phys(info->queue) >> PAGE_SHIFT,
+		writel(virtqueue_get_desc_addr(vq) >> PAGE_SHIFT,
 				vm_dev->base + VIRTIO_MMIO_QUEUE_PFN);
 	} else {
 		u64 addr;
 
-		addr = virt_to_phys(info->queue);
+		addr = virtqueue_get_desc_addr(vq);
 		writel((u32)addr, vm_dev->base + VIRTIO_MMIO_QUEUE_DESC_LOW);
 		writel((u32)(addr >> 32),
 				vm_dev->base + VIRTIO_MMIO_QUEUE_DESC_HIGH);
 
-		addr = virt_to_phys(virtqueue_get_avail(vq));
+		addr = virtqueue_get_avail_addr(vq);
 		writel((u32)addr, vm_dev->base + VIRTIO_MMIO_QUEUE_AVAIL_LOW);
 		writel((u32)(addr >> 32),
 				vm_dev->base + VIRTIO_MMIO_QUEUE_AVAIL_HIGH);
 
-		addr = virt_to_phys(virtqueue_get_used(vq));
+		addr = virtqueue_get_used_addr(vq);
 		writel((u32)addr, vm_dev->base + VIRTIO_MMIO_QUEUE_USED_LOW);
 		writel((u32)(addr >> 32),
 				vm_dev->base + VIRTIO_MMIO_QUEUE_USED_HIGH);
@@ -471,8 +436,6 @@ error_new_virtqueue:
 		writel(0, vm_dev->base + VIRTIO_MMIO_QUEUE_READY);
 		WARN_ON(readl(vm_dev->base + VIRTIO_MMIO_QUEUE_READY));
 	}
-	free_pages_exact(info->queue, size);
-error_alloc_pages:
 	kfree(info);
 error_kmalloc:
 error_available:
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v4 6/6] virtio_pci: Use the DMA API
  2015-10-30  1:09 [PATCH v4 0/6] virtio core DMA API conversion Andy Lutomirski
                   ` (4 preceding siblings ...)
  2015-10-30  1:09 ` [PATCH v4 5/6] virtio_mmio: Use the DMA API Andy Lutomirski
@ 2015-10-30  1:09 ` Andy Lutomirski
  2015-10-30  1:17 ` [PATCH v4 0/6] virtio core DMA API conversion Andy Lutomirski
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 38+ messages in thread
From: Andy Lutomirski @ 2015-10-30  1:09 UTC (permalink / raw)
  To: linux-kernel, David S. Miller, sparclinux
  Cc: Joerg Roedel, Christian Borntraeger, Cornelia Huck,
	Sebastian Ott, Paolo Bonzini, Christoph Hellwig, benh, KVM,
	dwmw2, Martin Schwidefsky, linux-s390, Michael S. Tsirkin,
	virtualization, Andy Lutomirski

This switches to vring_create_virtqueue, simplifying the driver and
adding DMA API support.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 drivers/virtio/virtio_pci_common.h |  7 -----
 drivers/virtio/virtio_pci_legacy.c | 39 +++++++-----------------
 drivers/virtio/virtio_pci_modern.c | 61 ++++++--------------------------------
 3 files changed, 19 insertions(+), 88 deletions(-)

diff --git a/drivers/virtio/virtio_pci_common.h b/drivers/virtio/virtio_pci_common.h
index cd6196b513ad..1a3c689d1b9e 100644
--- a/drivers/virtio/virtio_pci_common.h
+++ b/drivers/virtio/virtio_pci_common.h
@@ -35,13 +35,6 @@ struct virtio_pci_vq_info {
 	/* the actual virtqueue */
 	struct virtqueue *vq;
 
-	/* the number of entries in the queue */
-	int num;
-
-	/* the ring queue */
-	void *queue;
-	dma_addr_t queue_dma_addr;      /* bus address */
-
 	/* the list node for the virtqueues list */
 	struct list_head node;
 
diff --git a/drivers/virtio/virtio_pci_legacy.c b/drivers/virtio/virtio_pci_legacy.c
index b5293e5f2af4..8c4e61783441 100644
--- a/drivers/virtio/virtio_pci_legacy.c
+++ b/drivers/virtio/virtio_pci_legacy.c
@@ -119,7 +119,6 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev,
 				  u16 msix_vec)
 {
 	struct virtqueue *vq;
-	unsigned long size;
 	u16 num;
 	int err;
 
@@ -131,29 +130,19 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev,
 	if (!num || ioread32(vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN))
 		return ERR_PTR(-ENOENT);
 
-	info->num = num;
 	info->msix_vector = msix_vec;
 
-	size = PAGE_ALIGN(vring_size(num, VIRTIO_PCI_VRING_ALIGN));
-	info->queue = dma_zalloc_coherent(&vp_dev->pci_dev->dev, size,
-					  &info->queue_dma_addr,
-					  GFP_KERNEL);
-	if (info->queue == NULL)
+	/* create the vring */
+	vq = vring_create_virtqueue(index, num,
+				    VIRTIO_PCI_VRING_ALIGN, &vp_dev->vdev,
+				    true, false, vp_notify, callback, name);
+	if (!vq)
 		return ERR_PTR(-ENOMEM);
 
 	/* activate the queue */
-	iowrite32(info->queue_dma_addr >> VIRTIO_PCI_QUEUE_ADDR_SHIFT,
+	iowrite32(virtqueue_get_desc_addr(vq) >> VIRTIO_PCI_QUEUE_ADDR_SHIFT,
 		  vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
 
-	/* create the vring */
-	vq = vring_new_virtqueue(index, info->num,
-				 VIRTIO_PCI_VRING_ALIGN, &vp_dev->vdev,
-				 true, info->queue, vp_notify, callback, name);
-	if (!vq) {
-		err = -ENOMEM;
-		goto out_activate_queue;
-	}
-
 	vq->priv = (void __force *)vp_dev->ioaddr + VIRTIO_PCI_QUEUE_NOTIFY;
 
 	if (msix_vec != VIRTIO_MSI_NO_VECTOR) {
@@ -161,18 +150,15 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev,
 		msix_vec = ioread16(vp_dev->ioaddr + VIRTIO_MSI_QUEUE_VECTOR);
 		if (msix_vec == VIRTIO_MSI_NO_VECTOR) {
 			err = -EBUSY;
-			goto out_assign;
+			goto out_deactivate;
 		}
 	}
 
 	return vq;
 
-out_assign:
-	vring_del_virtqueue(vq);
-out_activate_queue:
+out_deactivate:
 	iowrite32(0, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
-	dma_free_coherent(&vp_dev->pci_dev->dev, size,
-			  info->queue, info->queue_dma_addr);
+	vring_del_virtqueue(vq);
 	return ERR_PTR(err);
 }
 
@@ -180,7 +166,6 @@ static void del_vq(struct virtio_pci_vq_info *info)
 {
 	struct virtqueue *vq = info->vq;
 	struct virtio_pci_device *vp_dev = to_vp_device(vq->vdev);
-	unsigned long size;
 
 	iowrite16(vq->index, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_SEL);
 
@@ -191,14 +176,10 @@ static void del_vq(struct virtio_pci_vq_info *info)
 		ioread8(vp_dev->ioaddr + VIRTIO_PCI_ISR);
 	}
 
-	vring_del_virtqueue(vq);
-
 	/* Select and deactivate the queue */
 	iowrite32(0, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
 
-	size = PAGE_ALIGN(vring_size(info->num, VIRTIO_PCI_VRING_ALIGN));
-	dma_free_coherent(&vp_dev->pci_dev->dev, size,
-			  info->queue, info->queue_dma_addr);
+	vring_del_virtqueue(vq);
 }
 
 static const struct virtio_config_ops virtio_pci_config_ops = {
diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c
index fbe0bd1c4881..50b0cd5a501e 100644
--- a/drivers/virtio/virtio_pci_modern.c
+++ b/drivers/virtio/virtio_pci_modern.c
@@ -287,35 +287,6 @@ static u16 vp_config_vector(struct virtio_pci_device *vp_dev, u16 vector)
 	return vp_ioread16(&vp_dev->common->msix_config);
 }
 
-static size_t vring_pci_size(u16 num)
-{
-	/* We only need a cacheline separation. */
-	return PAGE_ALIGN(vring_size(num, SMP_CACHE_BYTES));
-}
-
-static void *alloc_virtqueue_pages(struct virtio_pci_device *vp_dev,
-				   int *num, dma_addr_t *dma_addr)
-{
-	void *pages;
-
-	/* TODO: allocate each queue chunk individually */
-	for (; *num && vring_pci_size(*num) > PAGE_SIZE; *num /= 2) {
-		pages = dma_zalloc_coherent(
-			&vp_dev->pci_dev->dev, vring_pci_size(*num),
-			dma_addr, GFP_KERNEL|__GFP_NOWARN);
-		if (pages)
-			return pages;
-	}
-
-	if (!*num)
-		return NULL;
-
-	/* Try to get a single page. You are my only hope! */
-	return dma_zalloc_coherent(
-		&vp_dev->pci_dev->dev, vring_pci_size(*num),
-		dma_addr, GFP_KERNEL);
-}
-
 static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev,
 				  struct virtio_pci_vq_info *info,
 				  unsigned index,
@@ -347,30 +318,22 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev,
 	/* get offset of notification word for this vq */
 	off = vp_ioread16(&cfg->queue_notify_off);
 
-	info->num = num;
 	info->msix_vector = msix_vec;
 
-	info->queue = alloc_virtqueue_pages(vp_dev, &info->num,
-					    &info->queue_dma_addr);
-	if (info->queue == NULL)
-		return ERR_PTR(-ENOMEM);
-
 	/* create the vring */
-	vq = vring_new_virtqueue(index, info->num,
-				 SMP_CACHE_BYTES, &vp_dev->vdev,
-				 true, info->queue, vp_notify, callback, name);
-	if (!vq) {
-		err = -ENOMEM;
-		goto err_new_queue;
-	}
+	vq = vring_create_virtqueue(index, num,
+				    SMP_CACHE_BYTES, &vp_dev->vdev,
+				    true, true, vp_notify, callback, name);
+	if (!vq)
+		return ERR_PTR(-ENOMEM);
 
 	/* activate the queue */
-	vp_iowrite16(num, &cfg->queue_size);
-	vp_iowrite64_twopart(info->queue_dma_addr,
+	vp_iowrite16(virtqueue_get_vring_size(vq), &cfg->queue_size);
+	vp_iowrite64_twopart(virtqueue_get_desc_addr(vq),
 			     &cfg->queue_desc_lo, &cfg->queue_desc_hi);
-	vp_iowrite64_twopart(info->queue_dma_addr + ((char *)virtqueue_get_avail(vq) - (char *)info->queue),
+	vp_iowrite64_twopart(virtqueue_get_avail_addr(vq),
 			     &cfg->queue_avail_lo, &cfg->queue_avail_hi);
-	vp_iowrite64_twopart(info->queue_dma_addr + ((char *)virtqueue_get_used(vq) - (char *)info->queue),
+	vp_iowrite64_twopart(virtqueue_get_used_addr(vq),
 			     &cfg->queue_used_lo, &cfg->queue_used_hi);
 
 	if (vp_dev->notify_base) {
@@ -415,9 +378,6 @@ err_assign_vector:
 		pci_iounmap(vp_dev->pci_dev, (void __iomem __force *)vq->priv);
 err_map_notify:
 	vring_del_virtqueue(vq);
-err_new_queue:
-	dma_free_coherent(&vp_dev->pci_dev->dev, vring_pci_size(info->num),
-			  info->queue, info->queue_dma_addr);
 	return ERR_PTR(err);
 }
 
@@ -462,9 +422,6 @@ static void del_vq(struct virtio_pci_vq_info *info)
 		pci_iounmap(vp_dev->pci_dev, (void __force __iomem *)vq->priv);
 
 	vring_del_virtqueue(vq);
-
-	dma_free_coherent(&vp_dev->pci_dev->dev, vring_pci_size(info->num),
-			  info->queue, info->queue_dma_addr);
 }
 
 static const struct virtio_config_ops virtio_pci_config_nodev_ops = {
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 0/6] virtio core DMA API conversion
  2015-10-30  1:09 [PATCH v4 0/6] virtio core DMA API conversion Andy Lutomirski
                   ` (5 preceding siblings ...)
  2015-10-30  1:09 ` [PATCH v4 6/6] virtio_pci: " Andy Lutomirski
@ 2015-10-30  1:17 ` Andy Lutomirski
  2015-10-30  9:57 ` Christian Borntraeger
  2015-11-09 12:15 ` Michael S. Tsirkin
  8 siblings, 0 replies; 38+ messages in thread
From: Andy Lutomirski @ 2015-10-30  1:17 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, David S. Miller, sparclinux, Joerg Roedel,
	Christian Borntraeger, Cornelia Huck, Sebastian Ott,
	Paolo Bonzini, Christoph Hellwig, Benjamin Herrenschmidt, KVM,
	David Woodhouse, Martin Schwidefsky, linux-s390,
	Michael S. Tsirkin, Linux Virtualization

On Thu, Oct 29, 2015 at 6:09 PM, Andy Lutomirski <luto@kernel.org> wrote:
> This switches virtio to use the DMA API unconditionally.  I'm sure
> it breaks things, but it seems to work on x86 using virtio-pci, with
> and without Xen, and using both the modern 1.0 variant and the
> legacy variant.

...

> Andy Lutomirski (5):
>   virtio_ring: Support DMA APIs
>   virtio_pci: Use the DMA API
>   virtio: Add improved queue allocation API
>   virtio_mmio: Use the DMA API
>   virtio_pci: Use the DMA API

Ugh.  The two virtio_pci patches should be squashed together.  I'll do
that for v5, but I'm not going to send it until there's more feedback.

FWIW, I'm collecting this stuff here:

https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=virtio_dma

That branch includes this series (with the squash) and the s390
patches.  I'll keep it up to date, since it seems silly to declare it
stable enough to stop rebasing yet.

--Andy

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 0/6] virtio core DMA API conversion
  2015-10-30  1:09 [PATCH v4 0/6] virtio core DMA API conversion Andy Lutomirski
                   ` (6 preceding siblings ...)
  2015-10-30  1:17 ` [PATCH v4 0/6] virtio core DMA API conversion Andy Lutomirski
@ 2015-10-30  9:57 ` Christian Borntraeger
  2015-11-09 12:15 ` Michael S. Tsirkin
  8 siblings, 0 replies; 38+ messages in thread
From: Christian Borntraeger @ 2015-10-30  9:57 UTC (permalink / raw)
  To: Andy Lutomirski, linux-kernel, David S. Miller, sparclinux
  Cc: Joerg Roedel, Cornelia Huck, Sebastian Ott, Paolo Bonzini,
	Christoph Hellwig, benh, KVM, dwmw2, Martin Schwidefsky,
	linux-s390, Michael S. Tsirkin, virtualization

Am 30.10.2015 um 02:09 schrieb Andy Lutomirski:
> This switches virtio to use the DMA API unconditionally.  I'm sure
> it breaks things, but it seems to work on x86 using virtio-pci, with
> and without Xen, and using both the modern 1.0 variant and the
> legacy variant.
> 
> This appears to work on native and Xen x86_64 using both modern and
> legacy virtio-pci.  It also appears to work on arm and arm64.
> 
> It definitely won't work as-is on s390x, and I haven't been able to
> test Christian's patches because I can't get virtio-ccw to work in
> QEMU at all.  I don't know what I'm doing wrong.


[...]
>   virtio-net: Stop doing DMA from the stack
> 
>  drivers/net/virtio_net.c           |  34 ++--
>  drivers/virtio/Kconfig             |   2 +-
>  drivers/virtio/virtio_mmio.c       |  67 ++-----
>  drivers/virtio/virtio_pci_common.h |   6 -
>  drivers/virtio/virtio_pci_legacy.c |  42 ++---
>  drivers/virtio/virtio_pci_modern.c |  61 ++-----
>  drivers/virtio/virtio_ring.c       | 348 ++++++++++++++++++++++++++++++-------

do you also have an untested patch for drivers/s390/virtio/* ?




^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 2/6] virtio_ring: Support DMA APIs
  2015-10-30  1:09 ` [PATCH v4 2/6] virtio_ring: Support DMA APIs Andy Lutomirski
@ 2015-10-30 12:01   ` Cornelia Huck
  2015-10-30 12:05     ` Christian Borntraeger
  0 siblings, 1 reply; 38+ messages in thread
From: Cornelia Huck @ 2015-10-30 12:01 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, David S. Miller, sparclinux, Joerg Roedel,
	Christian Borntraeger, Sebastian Ott, Paolo Bonzini,
	Christoph Hellwig, benh, KVM, dwmw2, Martin Schwidefsky,
	linux-s390, Michael S. Tsirkin, virtualization

On Thu, 29 Oct 2015 18:09:47 -0700
Andy Lutomirski <luto@kernel.org> wrote:

> virtio_ring currently sends the device (usually a hypervisor)
> physical addresses of its I/O buffers.  This is okay when DMA
> addresses and physical addresses are the same thing, but this isn't
> always the case.  For example, this never works on Xen guests, and
> it is likely to fail if a physical "virtio" device ever ends up
> behind an IOMMU or swiotlb.
> 
> The immediate use case for me is to enable virtio on Xen guests.
> For that to work, we need vring to support DMA address translation
> as well as a corresponding change to virtio_pci or to another
> driver.
> 
> With this patch, if enabled, virtfs survives kmemleak and
> CONFIG_DMA_API_DEBUG.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  drivers/virtio/Kconfig           |   2 +-
>  drivers/virtio/virtio_ring.c     | 190 +++++++++++++++++++++++++++++++--------
>  tools/virtio/linux/dma-mapping.h |  17 ++++
>  3 files changed, 172 insertions(+), 37 deletions(-)
>  create mode 100644 tools/virtio/linux/dma-mapping.h

>  static void detach_buf(struct vring_virtqueue *vq, unsigned int head)
>  {
> -	unsigned int i;
> +	unsigned int i, j;
> +	u16 nextflag = cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_NEXT);
> 
>  	/* Clear data ptr. */
> -	vq->data[head] = NULL;
> +	vq->desc_state[head].data = NULL;
> 
> -	/* Put back on free list: find end */
> +	/* Put back on free list: unmap first-level descriptors and find end */
>  	i = head;
> 
> -	/* Free the indirect table */
> -	if (vq->vring.desc[i].flags & cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_INDIRECT))
> -		kfree(phys_to_virt(virtio64_to_cpu(vq->vq.vdev, vq->vring.desc[i].addr)));
> -
> -	while (vq->vring.desc[i].flags & cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_NEXT)) {
> +	while (vq->vring.desc[i].flags & nextflag) {
> +		vring_unmap_one(vq, &vq->vring.desc[i]);
>  		i = virtio16_to_cpu(vq->vq.vdev, vq->vring.desc[i].next);
>  		vq->vq.num_free++;
>  	}
> 
> +	vring_unmap_one(vq, &vq->vring.desc[i]);
>  	vq->vring.desc[i].next = cpu_to_virtio16(vq->vq.vdev, vq->free_head);
>  	vq->free_head = head;
> +
>  	/* Plus final descriptor */
>  	vq->vq.num_free++;
> +
> +	/* Free the indirect table, if any, now that it's unmapped. */
> +	if (vq->desc_state[head].indir_desc) {
> +		struct vring_desc *indir_desc = vq->desc_state[head].indir_desc;
> +		u32 len = vq->vring.desc[head].len;

This one needs to be virtio32_to_cpu(...) as well.

> +
> +		BUG_ON(!(vq->vring.desc[head].flags &
> +			 cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_INDIRECT)));
> +		BUG_ON(len == 0 || len % sizeof(struct vring_desc));
> +
> +		for (j = 0; j < len / sizeof(struct vring_desc); j++)
> +			vring_unmap_one(vq, &indir_desc[j]);
> +
> +		kfree(vq->desc_state[head].indir_desc);
> +		vq->desc_state[head].indir_desc = NULL;
> +	}
>  }

With that change on top of your current branch, I can boot (root on
virtio-blk, either virtio-1 or legacy virtio) on current qemu master
with kvm enabled on s390. Haven't tried anything further.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 2/6] virtio_ring: Support DMA APIs
  2015-10-30 12:01   ` Cornelia Huck
@ 2015-10-30 12:05     ` Christian Borntraeger
  2015-10-30 18:51       ` Andy Lutomirski
  0 siblings, 1 reply; 38+ messages in thread
From: Christian Borntraeger @ 2015-10-30 12:05 UTC (permalink / raw)
  To: Cornelia Huck, Andy Lutomirski
  Cc: linux-kernel, David S. Miller, sparclinux, Joerg Roedel,
	Sebastian Ott, Paolo Bonzini, Christoph Hellwig, benh, KVM,
	dwmw2, Martin Schwidefsky, linux-s390, Michael S. Tsirkin,
	virtualization

Am 30.10.2015 um 13:01 schrieb Cornelia Huck:
> On Thu, 29 Oct 2015 18:09:47 -0700
> Andy Lutomirski <luto@kernel.org> wrote:
> 
>> virtio_ring currently sends the device (usually a hypervisor)
>> physical addresses of its I/O buffers.  This is okay when DMA
>> addresses and physical addresses are the same thing, but this isn't
>> always the case.  For example, this never works on Xen guests, and
>> it is likely to fail if a physical "virtio" device ever ends up
>> behind an IOMMU or swiotlb.
>>
>> The immediate use case for me is to enable virtio on Xen guests.
>> For that to work, we need vring to support DMA address translation
>> as well as a corresponding change to virtio_pci or to another
>> driver.
>>
>> With this patch, if enabled, virtfs survives kmemleak and
>> CONFIG_DMA_API_DEBUG.
>>
>> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>> ---
>>  drivers/virtio/Kconfig           |   2 +-
>>  drivers/virtio/virtio_ring.c     | 190 +++++++++++++++++++++++++++++++--------
>>  tools/virtio/linux/dma-mapping.h |  17 ++++
>>  3 files changed, 172 insertions(+), 37 deletions(-)
>>  create mode 100644 tools/virtio/linux/dma-mapping.h
> 
>>  static void detach_buf(struct vring_virtqueue *vq, unsigned int head)
>>  {
>> -	unsigned int i;
>> +	unsigned int i, j;
>> +	u16 nextflag = cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_NEXT);
>>
>>  	/* Clear data ptr. */
>> -	vq->data[head] = NULL;
>> +	vq->desc_state[head].data = NULL;
>>
>> -	/* Put back on free list: find end */
>> +	/* Put back on free list: unmap first-level descriptors and find end */
>>  	i = head;
>>
>> -	/* Free the indirect table */
>> -	if (vq->vring.desc[i].flags & cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_INDIRECT))
>> -		kfree(phys_to_virt(virtio64_to_cpu(vq->vq.vdev, vq->vring.desc[i].addr)));
>> -
>> -	while (vq->vring.desc[i].flags & cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_NEXT)) {
>> +	while (vq->vring.desc[i].flags & nextflag) {
>> +		vring_unmap_one(vq, &vq->vring.desc[i]);
>>  		i = virtio16_to_cpu(vq->vq.vdev, vq->vring.desc[i].next);
>>  		vq->vq.num_free++;
>>  	}
>>
>> +	vring_unmap_one(vq, &vq->vring.desc[i]);
>>  	vq->vring.desc[i].next = cpu_to_virtio16(vq->vq.vdev, vq->free_head);
>>  	vq->free_head = head;
>> +
>>  	/* Plus final descriptor */
>>  	vq->vq.num_free++;
>> +
>> +	/* Free the indirect table, if any, now that it's unmapped. */
>> +	if (vq->desc_state[head].indir_desc) {
>> +		struct vring_desc *indir_desc = vq->desc_state[head].indir_desc;
>> +		u32 len = vq->vring.desc[head].len;
> 
> This one needs to be virtio32_to_cpu(...) as well.

Yes, just did the exact same change
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index f269e1c..f2249df 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -556,7 +556,7 @@ static void detach_buf(struct vring_virtqueue *vq, unsigned int head)
        /* Free the indirect table, if any, now that it's unmapped. */
        if (vq->desc_state[head].indir_desc) {
                struct vring_desc *indir_desc = vq->desc_state[head].indir_desc;
-               u32 len = vq->vring.desc[head].len;
+               u32 len = virtio32_to_cpu(vq->vq.vdev, vq->vring.desc[head].len);
 
                BUG_ON(!(vq->vring.desc[head].flags &
                         cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_INDIRECT)));


now it boots.
> 
>> +
>> +		BUG_ON(!(vq->vring.desc[head].flags &
>> +			 cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_INDIRECT)));
>> +		BUG_ON(len == 0 || len % sizeof(struct vring_desc));
>> +
>> +		for (j = 0; j < len / sizeof(struct vring_desc); j++)
>> +			vring_unmap_one(vq, &indir_desc[j]);
>> +
>> +		kfree(vq->desc_state[head].indir_desc);
>> +		vq->desc_state[head].indir_desc = NULL;
>> +	}
>>  }
> 
> With that change on top of your current branch, I can boot (root on
> virtio-blk, either virtio-1 or legacy virtio) on current qemu master
> with kvm enabled on s390. Haven't tried anything further.
> 


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 1/6] virtio-net: Stop doing DMA from the stack
  2015-10-30  1:09 ` [PATCH v4 1/6] virtio-net: Stop doing DMA from the stack Andy Lutomirski
@ 2015-10-30 13:55   ` Christian Borntraeger
  2015-10-31  5:02     ` Andy Lutomirski
  0 siblings, 1 reply; 38+ messages in thread
From: Christian Borntraeger @ 2015-10-30 13:55 UTC (permalink / raw)
  To: Andy Lutomirski, linux-kernel, David S. Miller, sparclinux
  Cc: Joerg Roedel, Cornelia Huck, Sebastian Ott, Paolo Bonzini,
	Christoph Hellwig, benh, KVM, dwmw2, Martin Schwidefsky,
	linux-s390, Michael S. Tsirkin, virtualization

Am 30.10.2015 um 02:09 schrieb Andy Lutomirski:
> From: "Michael S. Tsirkin" <mst@redhat.com>
> 
> Once virtio starts using the DMA API, we won't be able to safely DMA
> from the stack.  virtio-net does a couple of config DMA requests
> from small stack buffers -- switch to using dynamically-allocated
> memory.
> 
> This should have no effect on any performance-critical code paths.
> 
> [I wrote the subject and commit message.  mst wrote the code. --luto]
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> signed-off-by: Michael S. Tsirkin <mst@redhat.com>

I still get an error when using multiqueue:

#  ethtool -L eth0 combined 4
[   33.534686] virtio_ccw 0.0.000d: DMA-API: device driver maps memory from stack [addr=00000000629e7c06]
[   33.534704] ------------[ cut here ]------------
[   33.534705] WARNING: at lib/dma-debug.c:1169
[   33.534706] Modules linked in: dm_multipath
[   33.534709] CPU: 1 PID: 1087 Comm: ethtool Not tainted 4.3.0-rc3+ #269
[   33.534710] task: 00000000616f9978 ti: 00000000629e4000 task.ti: 00000000629e4000
[   33.534712] Krnl PSW : 0704d00180000000 00000000005869d2 (check_for_stack+0xb2/0x118)
[   33.534716]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 EA:3
Krnl GPRS: 000000000000006a 0000000000d60f44 000000000000005a 0000000064ee0870
[   33.534718]            00000000005869ce 0000000000000000 0000000000000001 00000000629e7c06
[   33.534719]            0000000000000000 0000000000000c06 0000000000000002 000000006467f800
[   33.534720]            0000000064673428 00000000629e7c06 00000000005869ce 00000000629e7928
[   33.534726] Krnl Code: 00000000005869c2: c0200024ad4e	larl	%r2,a1c45e
           00000000005869c8: c0e5ffe6d6fc	brasl	%r14,2617c0
          #00000000005869ce: a7f40001		brc	15,5869d0
          >00000000005869d2: c010003465eb	larl	%r1,c135a8
           00000000005869d8: e31010000012	lt	%r1,0(%r1)
           00000000005869de: a784000a		brc	8,5869f2
           00000000005869e2: e340f0b00004	lg	%r4,176(%r15)
           00000000005869e8: ebcff0a00004	lmg	%r12,%r15,160(%r15)
[   33.534736] Call Trace:
[   33.534737] ([<00000000005869ce>] check_for_stack+0xae/0x118)
[   33.534738]  [<0000000000586e3c>] debug_dma_map_page+0x114/0x160
[   33.534740]  [<00000000005a31f8>] vring_map_one_sg.isra.7+0x98/0xc0
[   33.534742]  [<00000000005a3b72>] virtqueue_add_sgs+0x1e2/0x788
[   33.534744]  [<0000000000618afc>] virtnet_send_command+0xcc/0x140
[   33.534745]  [<0000000000618c0c>] virtnet_set_queues+0x9c/0x110
[   33.534747]  [<0000000000619928>] virtnet_set_channels+0x78/0xe0
[   33.534748]  [<00000000006f63ea>] ethtool_set_channels+0x62/0x88
[   33.534750]  [<00000000006f8900>] dev_ethtool+0x10d8/0x1a48
[   33.534752]  [<000000000070c540>] dev_ioctl+0x190/0x510
[   33.534754]  [<00000000006cf2da>] sock_do_ioctl+0x7a/0x90
[   33.534755]  [<00000000006cf840>] sock_ioctl+0x1e8/0x2d0
[   33.534758]  [<00000000002e6c78>] do_vfs_ioctl+0x3a8/0x508
[   33.534759]  [<00000000002e6e7c>] SyS_ioctl+0xa4/0xb8
[   33.534762]  [<00000000008231ec>] system_call+0x244/0x264
[   33.534763]  [<000003ff922026d2>] 0x3ff922026d2
[   33.534764] Last Breaking-Event-Address:
[   33.534765]  [<00000000005869ce>] check_for_stack+0xae/0x118
[   33.534766] ---[ end trace 2379df65f4decfc4 ]---


> ---
>  drivers/net/virtio_net.c | 34 +++++++++++++++++++---------------
>  1 file changed, 19 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index d8838dedb7a4..f94ab786088f 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -140,6 +140,12 @@ struct virtnet_info {
> 
>  	/* CPU hot plug notifier */
>  	struct notifier_block nb;
> +
> +	/* Control VQ buffers: protected by the rtnl lock */
> +	struct virtio_net_ctrl_hdr ctrl_hdr;
> +	virtio_net_ctrl_ack ctrl_status;
> +	u8 ctrl_promisc;
> +	u8 ctrl_allmulti;
>  };
> 
>  struct padded_vnet_hdr {
> @@ -976,31 +982,30 @@ static bool virtnet_send_command(struct virtnet_info *vi, u8 class, u8 cmd,
>  				 struct scatterlist *out)
>  {
>  	struct scatterlist *sgs[4], hdr, stat;
> -	struct virtio_net_ctrl_hdr ctrl;
> -	virtio_net_ctrl_ack status = ~0;
>  	unsigned out_num = 0, tmp;
> 
>  	/* Caller should know better */
>  	BUG_ON(!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ));
> 
> -	ctrl.class = class;
> -	ctrl.cmd = cmd;
> +	vi->ctrl_status = ~0;
> +	vi->ctrl_hdr.class = class;
> +	vi->ctrl_hdr.cmd = cmd;
>  	/* Add header */
> -	sg_init_one(&hdr, &ctrl, sizeof(ctrl));
> +	sg_init_one(&hdr, &vi->ctrl_hdr, sizeof(vi->ctrl_hdr));
>  	sgs[out_num++] = &hdr;
> 
>  	if (out)
>  		sgs[out_num++] = out;
> 
>  	/* Add return status. */
> -	sg_init_one(&stat, &status, sizeof(status));
> +	sg_init_one(&stat, &vi->ctrl_status, sizeof(vi->ctrl_status));
>  	sgs[out_num] = &stat;
> 
>  	BUG_ON(out_num + 1 > ARRAY_SIZE(sgs));
>  	virtqueue_add_sgs(vi->cvq, sgs, out_num, 1, vi, GFP_ATOMIC);
> 
>  	if (unlikely(!virtqueue_kick(vi->cvq)))
> -		return status == VIRTIO_NET_OK;
> +		return vi->ctrl_status == VIRTIO_NET_OK;
> 
>  	/* Spin for a response, the kick causes an ioport write, trapping
>  	 * into the hypervisor, so the request should be handled immediately.
> @@ -1009,7 +1014,7 @@ static bool virtnet_send_command(struct virtnet_info *vi, u8 class, u8 cmd,
>  	       !virtqueue_is_broken(vi->cvq))
>  		cpu_relax();
> 
> -	return status == VIRTIO_NET_OK;
> +	return vi->ctrl_status == VIRTIO_NET_OK;
>  }
> 
>  static int virtnet_set_mac_address(struct net_device *dev, void *p)
> @@ -1151,7 +1156,6 @@ static void virtnet_set_rx_mode(struct net_device *dev)
>  {
>  	struct virtnet_info *vi = netdev_priv(dev);
>  	struct scatterlist sg[2];
> -	u8 promisc, allmulti;
>  	struct virtio_net_ctrl_mac *mac_data;
>  	struct netdev_hw_addr *ha;
>  	int uc_count;
> @@ -1163,22 +1167,22 @@ static void virtnet_set_rx_mode(struct net_device *dev)
>  	if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_RX))
>  		return;
> 
> -	promisc = ((dev->flags & IFF_PROMISC) != 0);
> -	allmulti = ((dev->flags & IFF_ALLMULTI) != 0);
> +	vi->ctrl_promisc = ((dev->flags & IFF_PROMISC) != 0);
> +	vi->ctrl_allmulti = ((dev->flags & IFF_ALLMULTI) != 0);
> 
> -	sg_init_one(sg, &promisc, sizeof(promisc));
> +	sg_init_one(sg, &vi->ctrl_promisc, sizeof(vi->ctrl_promisc));
> 
>  	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX,
>  				  VIRTIO_NET_CTRL_RX_PROMISC, sg))
>  		dev_warn(&dev->dev, "Failed to %sable promisc mode.\n",
> -			 promisc ? "en" : "dis");
> +			 vi->ctrl_promisc ? "en" : "dis");
> 
> -	sg_init_one(sg, &allmulti, sizeof(allmulti));
> +	sg_init_one(sg, &vi->ctrl_allmulti, sizeof(vi->ctrl_allmulti));
> 
>  	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX,
>  				  VIRTIO_NET_CTRL_RX_ALLMULTI, sg))
>  		dev_warn(&dev->dev, "Failed to %sable allmulti mode.\n",
> -			 allmulti ? "en" : "dis");
> +			 vi->ctrl_allmulti ? "en" : "dis");
> 
>  	uc_count = netdev_uc_count(dev);
>  	mc_count = netdev_mc_count(dev);
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 2/6] virtio_ring: Support DMA APIs
  2015-10-30 12:05     ` Christian Borntraeger
@ 2015-10-30 18:51       ` Andy Lutomirski
  0 siblings, 0 replies; 38+ messages in thread
From: Andy Lutomirski @ 2015-10-30 18:51 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Cornelia Huck, Andy Lutomirski, linux-kernel, David S. Miller,
	sparclinux, Joerg Roedel, Sebastian Ott, Paolo Bonzini,
	Christoph Hellwig, Benjamin Herrenschmidt, KVM, David Woodhouse,
	Martin Schwidefsky, linux-s390, Michael S. Tsirkin,
	Linux Virtualization

On Fri, Oct 30, 2015 at 5:05 AM, Christian Borntraeger
<borntraeger@de.ibm.com> wrote:
> Am 30.10.2015 um 13:01 schrieb Cornelia Huck:
>> On Thu, 29 Oct 2015 18:09:47 -0700
>> Andy Lutomirski <luto@kernel.org> wrote:
>>
>>> virtio_ring currently sends the device (usually a hypervisor)
>>> physical addresses of its I/O buffers.  This is okay when DMA
>>> addresses and physical addresses are the same thing, but this isn't
>>> always the case.  For example, this never works on Xen guests, and
>>> it is likely to fail if a physical "virtio" device ever ends up
>>> behind an IOMMU or swiotlb.
>>>
>>> The immediate use case for me is to enable virtio on Xen guests.
>>> For that to work, we need vring to support DMA address translation
>>> as well as a corresponding change to virtio_pci or to another
>>> driver.
>>>
>>> With this patch, if enabled, virtfs survives kmemleak and
>>> CONFIG_DMA_API_DEBUG.
>>>
>>> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>>> ---
>>>  drivers/virtio/Kconfig           |   2 +-
>>>  drivers/virtio/virtio_ring.c     | 190 +++++++++++++++++++++++++++++++--------
>>>  tools/virtio/linux/dma-mapping.h |  17 ++++
>>>  3 files changed, 172 insertions(+), 37 deletions(-)
>>>  create mode 100644 tools/virtio/linux/dma-mapping.h
>>
>>>  static void detach_buf(struct vring_virtqueue *vq, unsigned int head)
>>>  {
>>> -    unsigned int i;
>>> +    unsigned int i, j;
>>> +    u16 nextflag = cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_NEXT);
>>>
>>>      /* Clear data ptr. */
>>> -    vq->data[head] = NULL;
>>> +    vq->desc_state[head].data = NULL;
>>>
>>> -    /* Put back on free list: find end */
>>> +    /* Put back on free list: unmap first-level descriptors and find end */
>>>      i = head;
>>>
>>> -    /* Free the indirect table */
>>> -    if (vq->vring.desc[i].flags & cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_INDIRECT))
>>> -            kfree(phys_to_virt(virtio64_to_cpu(vq->vq.vdev, vq->vring.desc[i].addr)));
>>> -
>>> -    while (vq->vring.desc[i].flags & cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_NEXT)) {
>>> +    while (vq->vring.desc[i].flags & nextflag) {
>>> +            vring_unmap_one(vq, &vq->vring.desc[i]);
>>>              i = virtio16_to_cpu(vq->vq.vdev, vq->vring.desc[i].next);
>>>              vq->vq.num_free++;
>>>      }
>>>
>>> +    vring_unmap_one(vq, &vq->vring.desc[i]);
>>>      vq->vring.desc[i].next = cpu_to_virtio16(vq->vq.vdev, vq->free_head);
>>>      vq->free_head = head;
>>> +
>>>      /* Plus final descriptor */
>>>      vq->vq.num_free++;
>>> +
>>> +    /* Free the indirect table, if any, now that it's unmapped. */
>>> +    if (vq->desc_state[head].indir_desc) {
>>> +            struct vring_desc *indir_desc = vq->desc_state[head].indir_desc;
>>> +            u32 len = vq->vring.desc[head].len;
>>
>> This one needs to be virtio32_to_cpu(...) as well.
>
> Yes, just did the exact same change
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index f269e1c..f2249df 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -556,7 +556,7 @@ static void detach_buf(struct vring_virtqueue *vq, unsigned int head)
>         /* Free the indirect table, if any, now that it's unmapped. */
>         if (vq->desc_state[head].indir_desc) {
>                 struct vring_desc *indir_desc = vq->desc_state[head].indir_desc;
> -               u32 len = vq->vring.desc[head].len;
> +               u32 len = virtio32_to_cpu(vq->vq.vdev, vq->vring.desc[head].len);
>
>                 BUG_ON(!(vq->vring.desc[head].flags &
>                          cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_INDIRECT)));
>
>
> now it boots.

Thanks!  I applied this to my tree.  I won't send a new version quite
yet, though, to reduce inbox load.

--Andy

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 1/6] virtio-net: Stop doing DMA from the stack
  2015-10-30 13:55   ` Christian Borntraeger
@ 2015-10-31  5:02     ` Andy Lutomirski
  0 siblings, 0 replies; 38+ messages in thread
From: Andy Lutomirski @ 2015-10-31  5:02 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Andy Lutomirski, linux-kernel, David S. Miller, sparclinux,
	Joerg Roedel, Cornelia Huck, Sebastian Ott, Paolo Bonzini,
	Christoph Hellwig, Benjamin Herrenschmidt, KVM, David Woodhouse,
	Martin Schwidefsky, linux-s390, Michael S. Tsirkin,
	Linux Virtualization

On Fri, Oct 30, 2015 at 6:55 AM, Christian Borntraeger
<borntraeger@de.ibm.com> wrote:
> Am 30.10.2015 um 02:09 schrieb Andy Lutomirski:
>> From: "Michael S. Tsirkin" <mst@redhat.com>
>>
>> Once virtio starts using the DMA API, we won't be able to safely DMA
>> from the stack.  virtio-net does a couple of config DMA requests
>> from small stack buffers -- switch to using dynamically-allocated
>> memory.
>>
>> This should have no effect on any performance-critical code paths.
>>
>> [I wrote the subject and commit message.  mst wrote the code. --luto]
>>
>> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>> signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>
> I still get an error when using multiqueue:
>
> #  ethtool -L eth0 combined 4
> [   33.534686] virtio_ccw 0.0.000d: DMA-API: device driver maps memory from stack [addr=00000000629e7c06]

Fixed in my branch, I think.

--Andy

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 0/6] virtio core DMA API conversion
  2015-10-30  1:09 [PATCH v4 0/6] virtio core DMA API conversion Andy Lutomirski
                   ` (7 preceding siblings ...)
  2015-10-30  9:57 ` Christian Borntraeger
@ 2015-11-09 12:15 ` Michael S. Tsirkin
  2015-11-09 12:27   ` Paolo Bonzini
  2015-11-09 22:58   ` Benjamin Herrenschmidt
  8 siblings, 2 replies; 38+ messages in thread
From: Michael S. Tsirkin @ 2015-11-09 12:15 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, David S. Miller, sparclinux, Joerg Roedel,
	Christian Borntraeger, Cornelia Huck, Sebastian Ott,
	Paolo Bonzini, Christoph Hellwig, benh, KVM, dwmw2,
	Martin Schwidefsky, linux-s390, virtualization

On Thu, Oct 29, 2015 at 06:09:45PM -0700, Andy Lutomirski wrote:
> This switches virtio to use the DMA API unconditionally.  I'm sure
> it breaks things, but it seems to work on x86 using virtio-pci, with
> and without Xen, and using both the modern 1.0 variant and the
> legacy variant.
> 
> This appears to work on native and Xen x86_64 using both modern and
> legacy virtio-pci.  It also appears to work on arm and arm64.
> 
> It definitely won't work as-is on s390x, and I haven't been able to
> test Christian's patches because I can't get virtio-ccw to work in
> QEMU at all.  I don't know what I'm doing wrong.
> 
> It doesn't work on ppc64.  Ben, consider yourself pinged to send me
> a patch :)
> 
> It doesn't work on sparc64.  I didn't realize at Kernel Summit that
> sparc64 has the same problem as ppc64.
> 
> DaveM, for background, we're trying to fix virtio to use the DMA
> API.  That will require that every platform that uses virtio
> supplies valid DMA operations on devices that use virtio_ring.
> Unfortunately, QEMU historically ignores the IOMMU on virtio
> devices.
> 
> On x86, this isn't really a problem.  x86 has a nice way for the
> platform to describe which devices are behind an IOMMU, and QEMU
> will be adjusted accordingly.  The only thing that will break is a
> recently-added experimental mode.

Well that's not exactly true. I think we would like to make
it possible to put virtio devices behind an IOMMU on x86,
but if this means existing guests break, then many people won't be able
to use this option: having to find out which kernel version your guest
is running is a significant burden.


So on the host side, we need to detect guests that
don't program the IOMMU and make QEMU ignore it.
I think we need to figure out a way to do this
before we commit to the guest change.

Additionally, IOMMU overhead is very high when running within the VM.
So for uses such as VFIO, we'd like a way to make something like
iommu-pt the default.



> Ben's plan for powerpc is to add a quirk for existing virtio-pci
> devices and to eventually update the devicetree stuff to allow QEMU
> to tell the guest which devices use the IOMMU.
> 
> AFAICT sparc has a similar problem to powerpc.  DaveM, can you come
> up with a straightforward way to get sparc's DMA API to work
> correctly for virtio-pci devices?
> 
> NB: Sadly, the platforms I've successfully tested on don't include any
> big-endian platforms, so there could still be lurking endian problems.
> 
> Changes from v3:
>  - More big-endian fixes.
>  - Added better virtio-ring APIs that handle allocation and use them in
>    virtio-mmio and virtio-pci.
>  - Switch to Michael's virtio-net patch.
> 
> Changes from v2:
>  - Fix vring_mapping_error incorrect argument
> 
> Changes from v1:
>  - Fix an endian conversion error causing a BUG to hit.
>  - Fix a DMA ordering issue (swiotlb=force works now).
>  - Minor cleanups.
> 
> Andy Lutomirski (5):
>   virtio_ring: Support DMA APIs
>   virtio_pci: Use the DMA API
>   virtio: Add improved queue allocation API
>   virtio_mmio: Use the DMA API
>   virtio_pci: Use the DMA API
> 
> Michael S. Tsirkin (1):
>   virtio-net: Stop doing DMA from the stack
> 
>  drivers/net/virtio_net.c           |  34 ++--
>  drivers/virtio/Kconfig             |   2 +-
>  drivers/virtio/virtio_mmio.c       |  67 ++-----
>  drivers/virtio/virtio_pci_common.h |   6 -
>  drivers/virtio/virtio_pci_legacy.c |  42 ++---
>  drivers/virtio/virtio_pci_modern.c |  61 ++-----
>  drivers/virtio/virtio_ring.c       | 348 ++++++++++++++++++++++++++++++-------
>  include/linux/virtio.h             |  23 ++-
>  include/linux/virtio_ring.h        |  35 ++++
>  tools/virtio/linux/dma-mapping.h   |  17 ++
>  10 files changed, 426 insertions(+), 209 deletions(-)
>  create mode 100644 tools/virtio/linux/dma-mapping.h
> 
> -- 
> 2.4.3

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 0/6] virtio core DMA API conversion
  2015-11-09 12:15 ` Michael S. Tsirkin
@ 2015-11-09 12:27   ` Paolo Bonzini
  2015-11-09 22:58   ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 38+ messages in thread
From: Paolo Bonzini @ 2015-11-09 12:27 UTC (permalink / raw)
  To: Michael S. Tsirkin, Andy Lutomirski
  Cc: linux-kernel, David S. Miller, sparclinux, Joerg Roedel,
	Christian Borntraeger, Cornelia Huck, Sebastian Ott,
	Christoph Hellwig, benh, KVM, dwmw2, Martin Schwidefsky,
	linux-s390, virtualization



On 09/11/2015 13:15, Michael S. Tsirkin wrote:
> Well that's not exactly true. I think we would like to make
> it possible to put virtio devices behind an IOMMU on x86,
> but if this means existing guests break, then many people won't be able
> to use this option: having to find out which kernel version your guest
> is running is a significant burden.
> 
> So on the host side, we need to detect guests that
> don't program the IOMMU and make QEMU ignore it.
> I think we need to figure out a way to do this
> before we commit to the guest change.

What is the usecase for putting virtio devices behind an IOMMU, apart from:

1) "because you can"

2) using VFIO within the guest

?  Case 1 can be ignored, and in case 2 the guest will do the right thing.

> Additionally, IOMMU overhead is very high when running within the VM.
> So for uses such as VFIO, we'd like a way to make something like
> iommu-pt the default.

That's not something that the kernel cares about.  It's just a
configuration issue.

Paolo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 0/6] virtio core DMA API conversion
  2015-11-09 12:15 ` Michael S. Tsirkin
  2015-11-09 12:27   ` Paolo Bonzini
@ 2015-11-09 22:58   ` Benjamin Herrenschmidt
  2015-11-10  0:46     ` Andy Lutomirski
  1 sibling, 1 reply; 38+ messages in thread
From: Benjamin Herrenschmidt @ 2015-11-09 22:58 UTC (permalink / raw)
  To: Andy Lutomirski, dwmw2
  Cc: linux-kernel, David S. Miller, sparclinux, Joerg Roedel,
	Christian Borntraeger, Cornelia Huck, Sebastian Ott,
	Paolo Bonzini, Christoph Hellwig, KVM, Martin Schwidefsky,
	linux-s390, virtualization, Michael S. Tsirkin

So ...

I've finally tried to sort that out for powerpc and I can't find a way
to make that work that isn't a complete pile of stinking shit.

I'm very tempted to go back to my original idea: virtio itself should
indicate it's "bypassing ability" via the virtio config space or some
other bit (like the ProgIf of the PCI header etc...).

I don't see how I can make it work otherwise.

The problem with the statement "it's a platform matter" is that:

  - It's not entirely correct. It's actually a feature of a specific
virtio implementation (qemu's) that it bypasses all the platform IOMMU
facilities.

  - The platforms (on powerpc there's at least 3 or 4 that have qemu
emulation and support some form of PCI) don't have an existing way to
convey the information that a device bypasses the IOMMU (if any).

  - Even if we add such a mechanism (new property in the device-tree),
we end up with a big turd: Because we need to be compatible with older
qemus, we essentially need a quirk that will make all virtio devices
assume that property is present. That will of course break whenever we
try to use another implementation of virtio on powerpc which doesn't
bypass the iommu.

We don't have a way to differenciate a virtio device that does the
bypass from one that doesn't and the backward compatibility requirement
forces us to treat all existing virtio devices as doing bypass.

The only way out of this while keeping the "platform" stuff would be to
also bump some kind of version in the virtio config (or PCI header). I
have no other way to differenciate between "this is an old qemu that
doesn't do the 'bypass property' yet" from "this is a virtio device
that doesn't bypass".

Any better idea ?

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 0/6] virtio core DMA API conversion
  2015-11-09 22:58   ` Benjamin Herrenschmidt
@ 2015-11-10  0:46     ` Andy Lutomirski
  2015-11-10  2:04       ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 38+ messages in thread
From: Andy Lutomirski @ 2015-11-10  0:46 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Andy Lutomirski, David Woodhouse, linux-kernel, David S. Miller,
	sparclinux, Joerg Roedel, Christian Borntraeger, Cornelia Huck,
	Sebastian Ott, Paolo Bonzini, Christoph Hellwig, KVM,
	Martin Schwidefsky, linux-s390, Linux Virtualization,
	Michael S. Tsirkin

On Mon, Nov 9, 2015 at 2:58 PM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> So ...
>
> I've finally tried to sort that out for powerpc and I can't find a way
> to make that work that isn't a complete pile of stinking shit.
>
> I'm very tempted to go back to my original idea: virtio itself should
> indicate it's "bypassing ability" via the virtio config space or some
> other bit (like the ProgIf of the PCI header etc...).
>
> I don't see how I can make it work otherwise.
>
> The problem with the statement "it's a platform matter" is that:
>
>   - It's not entirely correct. It's actually a feature of a specific
> virtio implementation (qemu's) that it bypasses all the platform IOMMU
> facilities.
>
>   - The platforms (on powerpc there's at least 3 or 4 that have qemu
> emulation and support some form of PCI) don't have an existing way to
> convey the information that a device bypasses the IOMMU (if any).
>
>   - Even if we add such a mechanism (new property in the device-tree),
> we end up with a big turd: Because we need to be compatible with older
> qemus, we essentially need a quirk that will make all virtio devices
> assume that property is present. That will of course break whenever we
> try to use another implementation of virtio on powerpc which doesn't
> bypass the iommu.
>
> We don't have a way to differenciate a virtio device that does the
> bypass from one that doesn't and the backward compatibility requirement
> forces us to treat all existing virtio devices as doing bypass.

The problem here is that in some of the problematic cases the virtio
driver may not even be loaded.  If someone runs an L1 guest with an
IOMMU-bypassing virtio device and assigns it to L2 using vfio, then
*boom* L1 crashes.  (Same if, say, DPDK gets used, I think.)

>
> The only way out of this while keeping the "platform" stuff would be to
> also bump some kind of version in the virtio config (or PCI header). I
> have no other way to differenciate between "this is an old qemu that
> doesn't do the 'bypass property' yet" from "this is a virtio device
> that doesn't bypass".
>
> Any better idea ?

I'd suggest that, in the absence of the new DT binding, we assume that
any PCI device with the virtio vendor ID is passthrough on powerpc.  I
can do this in the virtio driver, but if it's in the platform code
then vfio gets it right too (i.e. fails to load).

--Andy

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 0/6] virtio core DMA API conversion
  2015-11-10  0:46     ` Andy Lutomirski
@ 2015-11-10  2:04       ` Benjamin Herrenschmidt
  2015-11-10  2:18         ` Andy Lutomirski
                           ` (2 more replies)
  0 siblings, 3 replies; 38+ messages in thread
From: Benjamin Herrenschmidt @ 2015-11-10  2:04 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, David Woodhouse, linux-kernel, David S. Miller,
	sparclinux, Joerg Roedel, Christian Borntraeger, Cornelia Huck,
	Sebastian Ott, Paolo Bonzini, Christoph Hellwig, KVM,
	Martin Schwidefsky, linux-s390, Linux Virtualization,
	Michael S. Tsirkin

On Mon, 2015-11-09 at 16:46 -0800, Andy Lutomirski wrote:
> The problem here is that in some of the problematic cases the virtio
> driver may not even be loaded.  If someone runs an L1 guest with an
> IOMMU-bypassing virtio device and assigns it to L2 using vfio, then
> *boom* L1 crashes.  (Same if, say, DPDK gets used, I think.)
> 
> >
> > The only way out of this while keeping the "platform" stuff would be to
> > also bump some kind of version in the virtio config (or PCI header). I
> > have no other way to differenciate between "this is an old qemu that
> > doesn't do the 'bypass property' yet" from "this is a virtio device
> > that doesn't bypass".
> >
> > Any better idea ?
> 
> I'd suggest that, in the absence of the new DT binding, we assume that
> any PCI device with the virtio vendor ID is passthrough on powerpc.  I
> can do this in the virtio driver, but if it's in the platform code
> then vfio gets it right too (i.e. fails to load).

The problem is there isn't *a* virtio vendor ID. It's the RedHat vendor
ID which will be used by more than just virtio, so we need to
specifically list the devices.

Additionally, that still means that once we have a virtio device that
actually uses the iommu, powerpc will not work since the "workaround"
above will kick in.

The "in absence of the new DT binding" doesn't make that much sense.

Those platforms use device-trees defined since the dawn of ages by
actual open firmware implementations, they either have no iommu
representation in there (Macs, the platform code hooks it all up) or
have various properties related to the iommu but no concept of "bypass"
in there.

We can *add* a new property under some circumstances that indicates a
bypass on a per-device basis, however that doesn't completely solve it:

  - As I said above, what does the absence of that property mean ? An
old qemu that does bypass on all virtio or a new qemu trying to tell
you that the virtio device actually does use the iommu (or some other
environment that isn't qemu) ?

  - On things like macs, the device-tree is generated by openbios, it
would have to have some added logic to try to figure that out, which
means it needs to know *via different means* that some or all virtio
devices bypass the iommu.

I thus go back to my original statement, it's a LOT easier to handle if
the device itself is self describing, indicating whether it is set to
bypass a host iommu or not. For L1->L2, well, that wouldn't be the
first time qemu/VFIO plays tricks with the passed through device
configuration space...

Note that the above can be solved via some kind of compromise: The
device self describes the ability to honor the iommu, along with the
property (or ACPI table entry) that indicates whether or not it does.

IE. We could use the revision or ProgIf field of the config space for
example. Or something in virtio config. If it's an "old" device, we
know it always bypass. If it's a new device, we know it only bypasses
if the corresponding property is in. I still would have to sort out the
openbios case for mac among others but it's at least a workable
direction.

BTW. Don't you have a similar problem on x86 that today qemu claims
that everything honors the iommu in ACPI ?

Unless somebody can come up with a better idea...

Cheers,
Ben.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 0/6] virtio core DMA API conversion
  2015-11-10  2:04       ` Benjamin Herrenschmidt
@ 2015-11-10  2:18         ` Andy Lutomirski
  2015-11-10  5:26           ` Benjamin Herrenschmidt
                             ` (2 more replies)
  2015-11-10  9:45         ` Knut Omang
  2015-11-10 10:27         ` Joerg Roedel
  2 siblings, 3 replies; 38+ messages in thread
From: Andy Lutomirski @ 2015-11-10  2:18 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Andy Lutomirski, David Woodhouse, linux-kernel, David S. Miller,
	sparclinux, Joerg Roedel, Christian Borntraeger, Cornelia Huck,
	Sebastian Ott, Paolo Bonzini, Christoph Hellwig, KVM,
	Martin Schwidefsky, linux-s390, Linux Virtualization,
	Michael S. Tsirkin

On Mon, Nov 9, 2015 at 6:04 PM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Mon, 2015-11-09 at 16:46 -0800, Andy Lutomirski wrote:
>> The problem here is that in some of the problematic cases the virtio
>> driver may not even be loaded.  If someone runs an L1 guest with an
>> IOMMU-bypassing virtio device and assigns it to L2 using vfio, then
>> *boom* L1 crashes.  (Same if, say, DPDK gets used, I think.)
>>
>> >
>> > The only way out of this while keeping the "platform" stuff would be to
>> > also bump some kind of version in the virtio config (or PCI header). I
>> > have no other way to differenciate between "this is an old qemu that
>> > doesn't do the 'bypass property' yet" from "this is a virtio device
>> > that doesn't bypass".
>> >
>> > Any better idea ?
>>
>> I'd suggest that, in the absence of the new DT binding, we assume that
>> any PCI device with the virtio vendor ID is passthrough on powerpc.  I
>> can do this in the virtio driver, but if it's in the platform code
>> then vfio gets it right too (i.e. fails to load).
>
> The problem is there isn't *a* virtio vendor ID. It's the RedHat vendor
> ID which will be used by more than just virtio, so we need to
> specifically list the devices.

Really?

/* Qumranet donated their vendor ID for devices 0x1000 thru 0x10FF. */
static const struct pci_device_id virtio_pci_id_table[] = {
        { PCI_DEVICE(0x1af4, PCI_ANY_ID) },
        { 0 }
};

Can we match on that range?

>
> Additionally, that still means that once we have a virtio device that
> actually uses the iommu, powerpc will not work since the "workaround"
> above will kick in.

I don't know how to solve that problem, though, especially since the
vendor of such a device (especially if it's real hardware) might not
set any new bit.

>
> The "in absence of the new DT binding" doesn't make that much sense.
>
> Those platforms use device-trees defined since the dawn of ages by
> actual open firmware implementations, they either have no iommu
> representation in there (Macs, the platform code hooks it all up) or
> have various properties related to the iommu but no concept of "bypass"
> in there.
>
> We can *add* a new property under some circumstances that indicates a
> bypass on a per-device basis, however that doesn't completely solve it:
>
>   - As I said above, what does the absence of that property mean ? An
> old qemu that does bypass on all virtio or a new qemu trying to tell
> you that the virtio device actually does use the iommu (or some other
> environment that isn't qemu) ?
>
>   - On things like macs, the device-tree is generated by openbios, it
> would have to have some added logic to try to figure that out, which
> means it needs to know *via different means* that some or all virtio
> devices bypass the iommu.
>
> I thus go back to my original statement, it's a LOT easier to handle if
> the device itself is self describing, indicating whether it is set to
> bypass a host iommu or not. For L1->L2, well, that wouldn't be the
> first time qemu/VFIO plays tricks with the passed through device
> configuration space...

Which leaves the special case of Xen, where even preexisting devices
don't bypass the IOMMU.  Can we keep this specific to powerpc and
sparc?  On x86, this problem is basically nonexistent, since the IOMMU
is properly self-describing.

IOW, I think that on x86 we should assume that all virtio devices
honor the IOMMU.

>
> Note that the above can be solved via some kind of compromise: The
> device self describes the ability to honor the iommu, along with the
> property (or ACPI table entry) that indicates whether or not it does.
>
> IE. We could use the revision or ProgIf field of the config space for
> example. Or something in virtio config. If it's an "old" device, we
> know it always bypass. If it's a new device, we know it only bypasses
> if the corresponding property is in. I still would have to sort out the
> openbios case for mac among others but it's at least a workable
> direction.
>
> BTW. Don't you have a similar problem on x86 that today qemu claims
> that everything honors the iommu in ACPI ?

Only on a single experimental configuration, and that can apparently
just be fixed going forward without any real problems being caused.

--Andy

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 0/6] virtio core DMA API conversion
  2015-11-10  2:18         ` Andy Lutomirski
@ 2015-11-10  5:26           ` Benjamin Herrenschmidt
  2015-11-10  5:33             ` Andy Lutomirski
  2015-11-10  5:28           ` Benjamin Herrenschmidt
  2015-11-10  7:28           ` Jan Kiszka
  2 siblings, 1 reply; 38+ messages in thread
From: Benjamin Herrenschmidt @ 2015-11-10  5:26 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, David Woodhouse, linux-kernel, David S. Miller,
	sparclinux, Joerg Roedel, Christian Borntraeger, Cornelia Huck,
	Sebastian Ott, Paolo Bonzini, Christoph Hellwig, KVM,
	Martin Schwidefsky, linux-s390, Linux Virtualization,
	Michael S. Tsirkin

On Mon, 2015-11-09 at 18:18 -0800, Andy Lutomirski wrote:
> 
> Which leaves the special case of Xen, where even preexisting devices
> don't bypass the IOMMU.  Can we keep this specific to powerpc and
> sparc?  On x86, this problem is basically nonexistent, since the IOMMU
> is properly self-describing.
> 
> IOW, I think that on x86 we should assume that all virtio devices
> honor the IOMMU.

You don't like performances ? :-)

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 0/6] virtio core DMA API conversion
  2015-11-10  2:18         ` Andy Lutomirski
  2015-11-10  5:26           ` Benjamin Herrenschmidt
@ 2015-11-10  5:28           ` Benjamin Herrenschmidt
  2015-11-10  5:35             ` Andy Lutomirski
  2015-11-10  7:28           ` Jan Kiszka
  2 siblings, 1 reply; 38+ messages in thread
From: Benjamin Herrenschmidt @ 2015-11-10  5:28 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, David Woodhouse, linux-kernel, David S. Miller,
	sparclinux, Joerg Roedel, Christian Borntraeger, Cornelia Huck,
	Sebastian Ott, Paolo Bonzini, Christoph Hellwig, KVM,
	Martin Schwidefsky, linux-s390, Linux Virtualization,
	Michael S. Tsirkin

On Mon, 2015-11-09 at 18:18 -0800, Andy Lutomirski wrote:
> 
> /* Qumranet donated their vendor ID for devices 0x1000 thru 0x10FF.
> */
> static const struct pci_device_id virtio_pci_id_table[] = {
>         { PCI_DEVICE(0x1af4, PCI_ANY_ID) },
>         { 0 }
> };
> 
> Can we match on that range?

We can, but the problem remains, how do we differenciate an existing
device that does bypass vs. a newer one that needs the IOMMU and thus
doesn't have the new "bypass" property in the device-tree.

Cheers,
Ben.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 0/6] virtio core DMA API conversion
  2015-11-10  5:26           ` Benjamin Herrenschmidt
@ 2015-11-10  5:33             ` Andy Lutomirski
  0 siblings, 0 replies; 38+ messages in thread
From: Andy Lutomirski @ 2015-11-10  5:33 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Andy Lutomirski, David Woodhouse, linux-kernel, David S. Miller,
	sparclinux, Joerg Roedel, Christian Borntraeger, Cornelia Huck,
	Sebastian Ott, Paolo Bonzini, Christoph Hellwig, KVM,
	Martin Schwidefsky, linux-s390, Linux Virtualization,
	Michael S. Tsirkin

On Mon, Nov 9, 2015 at 9:26 PM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Mon, 2015-11-09 at 18:18 -0800, Andy Lutomirski wrote:
>>
>> Which leaves the special case of Xen, where even preexisting devices
>> don't bypass the IOMMU.  Can we keep this specific to powerpc and
>> sparc?  On x86, this problem is basically nonexistent, since the IOMMU
>> is properly self-describing.
>>
>> IOW, I think that on x86 we should assume that all virtio devices
>> honor the IOMMU.
>
> You don't like performances ? :-)

This should have basically no effect.  Every non-experimental x86
virtio setup in existence either doesn't work at all (Xen) or has DMA
ops that are no-ops.

--Andy

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 0/6] virtio core DMA API conversion
  2015-11-10  5:28           ` Benjamin Herrenschmidt
@ 2015-11-10  5:35             ` Andy Lutomirski
  2015-11-10 10:37               ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 38+ messages in thread
From: Andy Lutomirski @ 2015-11-10  5:35 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Andy Lutomirski, David Woodhouse, linux-kernel, David S. Miller,
	sparclinux, Joerg Roedel, Christian Borntraeger, Cornelia Huck,
	Sebastian Ott, Paolo Bonzini, Christoph Hellwig, KVM,
	Martin Schwidefsky, linux-s390, Linux Virtualization,
	Michael S. Tsirkin

On Mon, Nov 9, 2015 at 9:28 PM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Mon, 2015-11-09 at 18:18 -0800, Andy Lutomirski wrote:
>>
>> /* Qumranet donated their vendor ID for devices 0x1000 thru 0x10FF.
>> */
>> static const struct pci_device_id virtio_pci_id_table[] = {
>>         { PCI_DEVICE(0x1af4, PCI_ANY_ID) },
>>         { 0 }
>> };
>>
>> Can we match on that range?
>
> We can, but the problem remains, how do we differenciate an existing
> device that does bypass vs. a newer one that needs the IOMMU and thus
> doesn't have the new "bypass" property in the device-tree.
>

We could do it the other way around: on powerpc, if a PCI device is in
that range and doesn't have the "bypass" property at all, then it's
assumed to bypass the IOMMU.  This means that everything that
currently works continues working.  If someone builds a physical
virtio device or uses another system in PCIe target mode speaking
virtio, then it won't work until they upgrade their firmware to set
bypass=0.  Meanwhile everyone using hypothetical new QEMU also gets
bypass=0 and no ambiguity.

vfio will presumably notice the bypass and correctly refuse to map any
current virtio devices.

Would that work?

--Andy

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 0/6] virtio core DMA API conversion
  2015-11-10  2:18         ` Andy Lutomirski
  2015-11-10  5:26           ` Benjamin Herrenschmidt
  2015-11-10  5:28           ` Benjamin Herrenschmidt
@ 2015-11-10  7:28           ` Jan Kiszka
  2 siblings, 0 replies; 38+ messages in thread
From: Jan Kiszka @ 2015-11-10  7:28 UTC (permalink / raw)
  To: Andy Lutomirski, Benjamin Herrenschmidt
  Cc: Andy Lutomirski, David Woodhouse, linux-kernel, David S. Miller,
	sparclinux, Joerg Roedel, Christian Borntraeger, Cornelia Huck,
	Sebastian Ott, Paolo Bonzini, Christoph Hellwig, KVM,
	Martin Schwidefsky, linux-s390, Linux Virtualization,
	Michael S. Tsirkin

On 2015-11-10 03:18, Andy Lutomirski wrote:
> On Mon, Nov 9, 2015 at 6:04 PM, Benjamin Herrenschmidt
>> I thus go back to my original statement, it's a LOT easier to handle if
>> the device itself is self describing, indicating whether it is set to
>> bypass a host iommu or not. For L1->L2, well, that wouldn't be the
>> first time qemu/VFIO plays tricks with the passed through device
>> configuration space...
> 
> Which leaves the special case of Xen, where even preexisting devices
> don't bypass the IOMMU.  Can we keep this specific to powerpc and
> sparc?  On x86, this problem is basically nonexistent, since the IOMMU
> is properly self-describing.
> 
> IOW, I think that on x86 we should assume that all virtio devices
> honor the IOMMU.

>From the guest driver POV, that is OK because either there is no IOMMU
to program (the current situation with qemu), there can be one that
doesn't need it (the current situation with qemu and iommu=on) or there
is (Xen) or will be (future qemu) one that requires it.

> 
>>
>> Note that the above can be solved via some kind of compromise: The
>> device self describes the ability to honor the iommu, along with the
>> property (or ACPI table entry) that indicates whether or not it does.
>>
>> IE. We could use the revision or ProgIf field of the config space for
>> example. Or something in virtio config. If it's an "old" device, we
>> know it always bypass. If it's a new device, we know it only bypasses
>> if the corresponding property is in. I still would have to sort out the
>> openbios case for mac among others but it's at least a workable
>> direction.
>>
>> BTW. Don't you have a similar problem on x86 that today qemu claims
>> that everything honors the iommu in ACPI ?
> 
> Only on a single experimental configuration, and that can apparently
> just be fixed going forward without any real problems being caused.

BTW, I once tried to describe the current situation on QEMU x86 with
IOMMU enabled via ACPI. While you can easily add IOMMU device exceptions
to the static tables, the fun starts when considering device hotplug for
virtio. Unless I missed some trick, ACPI doesn't seem like being
designed for that level of flexibility.

You would have to reserve a complete PCI bus, declare that one as not
being IOMMU-governed, and then only add new virtio devices to that bus.
Possible, but a lot of restrictions that existing management software
would have to be aware of as well.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 0/6] virtio core DMA API conversion
  2015-11-10  2:04       ` Benjamin Herrenschmidt
  2015-11-10  2:18         ` Andy Lutomirski
@ 2015-11-10  9:45         ` Knut Omang
  2015-11-10 10:26           ` Benjamin Herrenschmidt
  2015-11-10 10:27         ` Joerg Roedel
  2 siblings, 1 reply; 38+ messages in thread
From: Knut Omang @ 2015-11-10  9:45 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Andy Lutomirski
  Cc: Andy Lutomirski, David Woodhouse, linux-kernel, David S. Miller,
	sparclinux, Joerg Roedel, Christian Borntraeger, Cornelia Huck,
	Sebastian Ott, Paolo Bonzini, Christoph Hellwig, KVM,
	Martin Schwidefsky, linux-s390, Linux Virtualization,
	Michael S. Tsirkin

On Tue, 2015-11-10 at 13:04 +1100, Benjamin Herrenschmidt wrote:
> On Mon, 2015-11-09 at 16:46 -0800, Andy Lutomirski wrote:
> > The problem here is that in some of the problematic cases the
> > virtio
> > driver may not even be loaded.  If someone runs an L1 guest with an
> > IOMMU-bypassing virtio device and assigns it to L2 using vfio, then
> > *boom* L1 crashes.  (Same if, say, DPDK gets used, I think.)
> > 
> > > 
> > > The only way out of this while keeping the "platform" stuff would
> > > be to
> > > also bump some kind of version in the virtio config (or PCI
> > > header). I
> > > have no other way to differenciate between "this is an old qemu
> > > that
> > > doesn't do the 'bypass property' yet" from "this is a virtio
> > > device
> > > that doesn't bypass".
> > > 
> > > Any better idea ?
> > 
> > I'd suggest that, in the absence of the new DT binding, we assume
> > that
> > any PCI device with the virtio vendor ID is passthrough on powerpc.
> >   I
> > can do this in the virtio driver, but if it's in the platform code
> > then vfio gets it right too (i.e. fails to load).
> 
> The problem is there isn't *a* virtio vendor ID. It's the RedHat
> vendor
> ID which will be used by more than just virtio, so we need to
> specifically list the devices.
> 
> Additionally, that still means that once we have a virtio device that
> actually uses the iommu, powerpc will not work since the "workaround"
> above will kick in.
> 
> The "in absence of the new DT binding" doesn't make that much sense.
> 
> Those platforms use device-trees defined since the dawn of ages by
> actual open firmware implementations, they either have no iommu
> representation in there (Macs, the platform code hooks it all up) or
> have various properties related to the iommu but no concept of
> "bypass"
> in there.
> 
> We can *add* a new property under some circumstances that indicates a
> bypass on a per-device basis, however that doesn't completely solve
> it:
> 
>   - As I said above, what does the absence of that property mean ? An
> old qemu that does bypass on all virtio or a new qemu trying to tell
> you that the virtio device actually does use the iommu (or some other
> environment that isn't qemu) ?
> 
>   - On things like macs, the device-tree is generated by openbios, it
> would have to have some added logic to try to figure that out, which
> means it needs to know *via different means* that some or all virtio
> devices bypass the iommu.
> 
> I thus go back to my original statement, it's a LOT easier to handle
> if
> the device itself is self describing, indicating whether it is set to
> bypass a host iommu or not. For L1->L2, well, that wouldn't be the
> first time qemu/VFIO plays tricks with the passed through device
> configuration space...
> 
> Note that the above can be solved via some kind of compromise: The
> device self describes the ability to honor the iommu, along with the
> property (or ACPI table entry) that indicates whether or not it does.
> 
> IE. We could use the revision or ProgIf field of the config space for
> example. Or something in virtio config. If it's an "old" device, we
> know it always bypass. If it's a new device, we know it only bypasses
> if the corresponding property is in. I still would have to sort out
> the
> openbios case for mac among others but it's at least a workable
> direction.
> 
> BTW. Don't you have a similar problem on x86 that today qemu claims
> that everything honors the iommu in ACPI ?
> 
> Unless somebody can come up with a better idea...

Can something be done by means of PCIe capabilities?
ATS (Address Translation Support) seems like a natural choice?

Knut

> Cheers,
> Ben.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe sparclinux"
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 0/6] virtio core DMA API conversion
  2015-11-10  9:45         ` Knut Omang
@ 2015-11-10 10:26           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 38+ messages in thread
From: Benjamin Herrenschmidt @ 2015-11-10 10:26 UTC (permalink / raw)
  To: Knut Omang, Andy Lutomirski
  Cc: Andy Lutomirski, David Woodhouse, linux-kernel, David S. Miller,
	sparclinux, Joerg Roedel, Christian Borntraeger, Cornelia Huck,
	Sebastian Ott, Paolo Bonzini, Christoph Hellwig, KVM,
	Martin Schwidefsky, linux-s390, Linux Virtualization,
	Michael S. Tsirkin

On Tue, 2015-11-10 at 10:45 +0100, Knut Omang wrote:
> Can something be done by means of PCIe capabilities?
> ATS (Address Translation Support) seems like a natural choice?

Euh no... ATS is something else completely....

Cheers,
Ben.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 0/6] virtio core DMA API conversion
  2015-11-10  2:04       ` Benjamin Herrenschmidt
  2015-11-10  2:18         ` Andy Lutomirski
  2015-11-10  9:45         ` Knut Omang
@ 2015-11-10 10:27         ` Joerg Roedel
  2015-11-10 19:36           ` Benjamin Herrenschmidt
  2 siblings, 1 reply; 38+ messages in thread
From: Joerg Roedel @ 2015-11-10 10:27 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Andy Lutomirski, Andy Lutomirski, David Woodhouse, linux-kernel,
	David S. Miller, sparclinux, Christian Borntraeger,
	Cornelia Huck, Sebastian Ott, Paolo Bonzini, Christoph Hellwig,
	KVM, Martin Schwidefsky, linux-s390, Linux Virtualization,
	Michael S. Tsirkin

On Tue, Nov 10, 2015 at 01:04:36PM +1100, Benjamin Herrenschmidt wrote:
> The "in absence of the new DT binding" doesn't make that much sense.
> 
> Those platforms use device-trees defined since the dawn of ages by
> actual open firmware implementations, they either have no iommu
> representation in there (Macs, the platform code hooks it all up) or
> have various properties related to the iommu but no concept of "bypass"
> in there.
> 
> We can *add* a new property under some circumstances that indicates a
> bypass on a per-device basis, however that doesn't completely solve it:
> 
>   - As I said above, what does the absence of that property mean ? An
> old qemu that does bypass on all virtio or a new qemu trying to tell
> you that the virtio device actually does use the iommu (or some other
> environment that isn't qemu) ?

You have the same problem when real PCIe devices appear that speak
virtio. I think the only real (still not very nice) solution is to add a
quirk to powerpc platform code that sets noop dma-ops for the existing
virtio vendor/device-ids and add a DT property to opt-out of that quirk.

New vendor/device-ids (as for real devices) would just not be covered by
the quirk and existing emulated devices continue to work.

The absence of the property just means that the quirk is in place and
the system assumes no translation for virtio devices.


	Joerg


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 0/6] virtio core DMA API conversion
  2015-11-10  5:35             ` Andy Lutomirski
@ 2015-11-10 10:37               ` Benjamin Herrenschmidt
  2015-11-10 12:43                 ` Michael S. Tsirkin
  2015-11-10 18:54                 ` Andy Lutomirski
  0 siblings, 2 replies; 38+ messages in thread
From: Benjamin Herrenschmidt @ 2015-11-10 10:37 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, David Woodhouse, linux-kernel, David S. Miller,
	sparclinux, Joerg Roedel, Christian Borntraeger, Cornelia Huck,
	Sebastian Ott, Paolo Bonzini, Christoph Hellwig, KVM,
	Martin Schwidefsky, linux-s390, Linux Virtualization,
	Michael S. Tsirkin

On Mon, 2015-11-09 at 21:35 -0800, Andy Lutomirski wrote:
> 
> We could do it the other way around: on powerpc, if a PCI device is in
> that range and doesn't have the "bypass" property at all, then it's
> assumed to bypass the IOMMU.  This means that everything that
> currently works continues working.  If someone builds a physical
> virtio device or uses another system in PCIe target mode speaking
> virtio, then it won't work until they upgrade their firmware to set
> bypass=0.  Meanwhile everyone using hypothetical new QEMU also gets
> bypass=0 and no ambiguity.
>
> vfio will presumably notice the bypass and correctly refuse to map any
> current virtio devices.
> 
> Would that work?

That would be extremely strange from a platform perspective. Any device
in that vendor/device range would bypass the iommu unless some new
property "actually-works-like-a-real-pci-device" happens to exist in
the device-tree, which we would then need to define somewhere and
handle accross at least 3 different platforms who get their device-tree 
from widly different places.

Also if tomorrow I create a PCI device that implements virtio-net and
put it in a machine running IBM proprietary firmware (or Apple's or
Sun's), it won't have that property...

This is not hypothetical. People are using virtio to do point-to-point
communication between machines via PCIe today.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 0/6] virtio core DMA API conversion
  2015-11-10 10:37               ` Benjamin Herrenschmidt
@ 2015-11-10 12:43                 ` Michael S. Tsirkin
  2015-11-10 19:37                   ` Benjamin Herrenschmidt
  2015-11-10 18:54                 ` Andy Lutomirski
  1 sibling, 1 reply; 38+ messages in thread
From: Michael S. Tsirkin @ 2015-11-10 12:43 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Andy Lutomirski, Andy Lutomirski, David Woodhouse, linux-kernel,
	David S. Miller, sparclinux, Joerg Roedel, Christian Borntraeger,
	Cornelia Huck, Sebastian Ott, Paolo Bonzini, Christoph Hellwig,
	KVM, Martin Schwidefsky, linux-s390, Linux Virtualization

On Tue, Nov 10, 2015 at 09:37:54PM +1100, Benjamin Herrenschmidt wrote:
> On Mon, 2015-11-09 at 21:35 -0800, Andy Lutomirski wrote:
> > 
> > We could do it the other way around: on powerpc, if a PCI device is in
> > that range and doesn't have the "bypass" property at all, then it's
> > assumed to bypass the IOMMU.  This means that everything that
> > currently works continues working.  If someone builds a physical
> > virtio device or uses another system in PCIe target mode speaking
> > virtio, then it won't work until they upgrade their firmware to set
> > bypass=0.  Meanwhile everyone using hypothetical new QEMU also gets
> > bypass=0 and no ambiguity.
> >
> > vfio will presumably notice the bypass and correctly refuse to map any
> > current virtio devices.
> > 
> > Would that work?
> 
> That would be extremely strange from a platform perspective. Any device
> in that vendor/device range would bypass the iommu unless some new
> property "actually-works-like-a-real-pci-device" happens to exist in
> the device-tree, which we would then need to define somewhere and
> handle accross at least 3 different platforms who get their device-tree 
> from widly different places.

Then we are back to virtio driver telling DMA core
whether it wants a 1:1 mapping in the iommu?
If that's acceptable to others, I don't think that's too bad.


> Also if tomorrow I create a PCI device that implements virtio-net and
> put it in a machine running IBM proprietary firmware (or Apple's or
> Sun's), it won't have that property...
> 
> This is not hypothetical. People are using virtio to do point-to-point
> communication between machines via PCIe today.
> 
> Cheers,
> Ben.

But not virtio-pci I think - that's broken for that usecase since we use
weaker barriers than required for real IO, as these have measureable
overhead.  We could have a feature "is a real PCI device",
that's completely reasonable.

-- 
MST

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 0/6] virtio core DMA API conversion
  2015-11-10 10:37               ` Benjamin Herrenschmidt
  2015-11-10 12:43                 ` Michael S. Tsirkin
@ 2015-11-10 18:54                 ` Andy Lutomirski
  2015-11-10 22:27                   ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 38+ messages in thread
From: Andy Lutomirski @ 2015-11-10 18:54 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Christian Borntraeger, Paolo Bonzini, David Woodhouse,
	Martin Schwidefsky, Michael S. Tsirkin, Sebastian Ott,
	David S. Miller, linux-s390, Cornelia Huck, Joerg Roedel, KVM,
	Christoph Hellwig, Linux Virtualization, linux-kernel,
	sparclinux

On Nov 10, 2015 2:38 AM, "Benjamin Herrenschmidt"
<benh@kernel.crashing.org> wrote:
>
> On Mon, 2015-11-09 at 21:35 -0800, Andy Lutomirski wrote:
> >
> > We could do it the other way around: on powerpc, if a PCI device is in
> > that range and doesn't have the "bypass" property at all, then it's
> > assumed to bypass the IOMMU.  This means that everything that
> > currently works continues working.  If someone builds a physical
> > virtio device or uses another system in PCIe target mode speaking
> > virtio, then it won't work until they upgrade their firmware to set
> > bypass=0.  Meanwhile everyone using hypothetical new QEMU also gets
> > bypass=0 and no ambiguity.
> >
> > vfio will presumably notice the bypass and correctly refuse to map any
> > current virtio devices.
> >
> > Would that work?
>
> That would be extremely strange from a platform perspective. Any device
> in that vendor/device range would bypass the iommu unless some new
> property "actually-works-like-a-real-pci-device" happens to exist in
> the device-tree, which we would then need to define somewhere and
> handle accross at least 3 different platforms who get their device-tree
> from widly different places.
>
> Also if tomorrow I create a PCI device that implements virtio-net and
> put it in a machine running IBM proprietary firmware (or Apple's or
> Sun's), it won't have that property...
>
> This is not hypothetical. People are using virtio to do point-to-point
> communication between machines via PCIe today.

Does that work on powerpc on existing kernels?

Anyway, here's another crazy idea: make the quirk assume that the
IOMMU is bypasses if and only if the weak barriers bit is set on
systems that are missing the new DT binding.

--Andy

>
> Cheers,
> Ben.
>
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 0/6] virtio core DMA API conversion
  2015-11-10 10:27         ` Joerg Roedel
@ 2015-11-10 19:36           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 38+ messages in thread
From: Benjamin Herrenschmidt @ 2015-11-10 19:36 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Andy Lutomirski, Andy Lutomirski, David Woodhouse, linux-kernel,
	David S. Miller, sparclinux, Christian Borntraeger,
	Cornelia Huck, Sebastian Ott, Paolo Bonzini, Christoph Hellwig,
	KVM, Martin Schwidefsky, linux-s390, Linux Virtualization,
	Michael S. Tsirkin

On Tue, 2015-11-10 at 11:27 +0100, Joerg Roedel wrote:
> 
> You have the same problem when real PCIe devices appear that speak
> virtio. I think the only real (still not very nice) solution is to add a
> quirk to powerpc platform code that sets noop dma-ops for the existing
> virtio vendor/device-ids and add a DT property to opt-out of that quirk.
>
> New vendor/device-ids (as for real devices) would just not be covered by
> the quirk and existing emulated devices continue to work.

Why woud real devices use new vendor/device IDs ? Also there are other
cases such as using virtio between 2 partitions, which we could do
under PowerVM ... that would require proper iommu usage with existing
IDs.

> The absence of the property just means that the quirk is in place and
> the system assumes no translation for virtio devices.

The only way that works forward for me (and possibly sparc & others,
what about ARM ?) is if we *change* something in virtio qemu at the
same time as we add some kind of property. For example the ProgIf field
or revision ID field.

That way I can key on that change.

It's still tricky because I would have to somewhat tell my various firmwares
(SLOF, OpenBIOS, OPAL, ...) so they can create the appropriate property, it's
still hacky, but it would be workable.

Ben.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 0/6] virtio core DMA API conversion
  2015-11-10 12:43                 ` Michael S. Tsirkin
@ 2015-11-10 19:37                   ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 38+ messages in thread
From: Benjamin Herrenschmidt @ 2015-11-10 19:37 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Andy Lutomirski, Andy Lutomirski, David Woodhouse, linux-kernel,
	David S. Miller, sparclinux, Joerg Roedel, Christian Borntraeger,
	Cornelia Huck, Sebastian Ott, Paolo Bonzini, Christoph Hellwig,
	KVM, Martin Schwidefsky, linux-s390, Linux Virtualization

On Tue, 2015-11-10 at 14:43 +0200, Michael S. Tsirkin wrote:
> But not virtio-pci I think - that's broken for that usecase since we use
> weaker barriers than required for real IO, as these have measureable
> overhead.  We could have a feature "is a real PCI device",
> that's completely reasonable.

Do we use weaker barriers on the Linux driver side ? I didn't think so
... 

Cheers,
Ben.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 0/6] virtio core DMA API conversion
  2015-11-10 18:54                 ` Andy Lutomirski
@ 2015-11-10 22:27                   ` Benjamin Herrenschmidt
  2015-11-10 23:44                     ` Andy Lutomirski
  0 siblings, 1 reply; 38+ messages in thread
From: Benjamin Herrenschmidt @ 2015-11-10 22:27 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Christian Borntraeger, Paolo Bonzini, David Woodhouse,
	Martin Schwidefsky, Michael S. Tsirkin, Sebastian Ott,
	David S. Miller, linux-s390, Cornelia Huck, Joerg Roedel, KVM,
	Christoph Hellwig, Linux Virtualization, linux-kernel,
	sparclinux

On Tue, 2015-11-10 at 10:54 -0800, Andy Lutomirski wrote:
> 
> Does that work on powerpc on existing kernels?
> 
> Anyway, here's another crazy idea: make the quirk assume that the
> IOMMU is bypasses if and only if the weak barriers bit is set on
> systems that are missing the new DT binding.

"New DT bindings" doesn't mean much ... how do we change DT bindings on
existing machines with a FW in flash ?

What about partition <-> partition virtio such as what we could do on
PAPR systems. That would have the weak barrier bit.

Ben.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 0/6] virtio core DMA API conversion
  2015-11-10 22:27                   ` Benjamin Herrenschmidt
@ 2015-11-10 23:44                     ` Andy Lutomirski
  2015-11-11  0:44                       ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 38+ messages in thread
From: Andy Lutomirski @ 2015-11-10 23:44 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Christian Borntraeger, Paolo Bonzini, David Woodhouse,
	Martin Schwidefsky, Michael S. Tsirkin, Sebastian Ott,
	David S. Miller, linux-s390, Cornelia Huck, Joerg Roedel, KVM,
	Christoph Hellwig, Linux Virtualization, linux-kernel,
	sparclinux

On Tue, Nov 10, 2015 at 2:27 PM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Tue, 2015-11-10 at 10:54 -0800, Andy Lutomirski wrote:
>>
>> Does that work on powerpc on existing kernels?
>>
>> Anyway, here's another crazy idea: make the quirk assume that the
>> IOMMU is bypasses if and only if the weak barriers bit is set on
>> systems that are missing the new DT binding.
>
> "New DT bindings" doesn't mean much ... how do we change DT bindings on
> existing machines with a FW in flash ?
>
> What about partition <-> partition virtio such as what we could do on
> PAPR systems. That would have the weak barrier bit.
>

Is it partition <-> partition, bypassing IOMMU?

I think I'd settle for just something that doesn't regress
non-experimental setups that actually work today and that allow new
setups (x86 with fixed QEMU and maybe something more complicated on
powerpc and/or sparc) to work in all cases.

We could certainly just make powerpc and sparc continue bypassing the
IOMMU until someone comes up with a way to fix it.  I'll send out some
patches that do that, and maybe that'll help this make progress.

--Andy

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 0/6] virtio core DMA API conversion
  2015-11-10 23:44                     ` Andy Lutomirski
@ 2015-11-11  0:44                       ` Benjamin Herrenschmidt
  2015-11-11  4:46                         ` Andy Lutomirski
  0 siblings, 1 reply; 38+ messages in thread
From: Benjamin Herrenschmidt @ 2015-11-11  0:44 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Christian Borntraeger, Paolo Bonzini, David Woodhouse,
	Martin Schwidefsky, Michael S. Tsirkin, Sebastian Ott,
	David S. Miller, linux-s390, Cornelia Huck, Joerg Roedel, KVM,
	Christoph Hellwig, Linux Virtualization, linux-kernel,
	sparclinux

On Tue, 2015-11-10 at 15:44 -0800, Andy Lutomirski wrote:
> 
> > What about partition <-> partition virtio such as what we could do on
> > PAPR systems. That would have the weak barrier bit.
> >
> 
> Is it partition <-> partition, bypassing IOMMU?

No.

> I think I'd settle for just something that doesn't regress
> non-experimental setups that actually work today and that allow new
> setups (x86 with fixed QEMU and maybe something more complicated on
> powerpc and/or sparc) to work in all cases.
> 
> We could certainly just make powerpc and sparc continue bypassing the
> IOMMU until someone comes up with a way to fix it.  I'll send out some
> patches that do that, and maybe that'll help this make progress.

But we haven't found a solution that works. All we have come up with is
a quirk that will force bypass on virtio always and will not allow us
to operate non-bypassing devices on either of those architectures in
the future.

I'm not too happy about this.

Ben.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 0/6] virtio core DMA API conversion
  2015-11-11  0:44                       ` Benjamin Herrenschmidt
@ 2015-11-11  4:46                         ` Andy Lutomirski
  2015-11-11  5:08                           ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 38+ messages in thread
From: Andy Lutomirski @ 2015-11-11  4:46 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Christian Borntraeger, David Woodhouse, Paolo Bonzini,
	Martin Schwidefsky, Michael S. Tsirkin, David S. Miller,
	Sebastian Ott, linux-s390, Joerg Roedel, Cornelia Huck, KVM,
	Christoph Hellwig, linux-kernel, Linux Virtualization,
	sparclinux

On Nov 10, 2015 4:44 PM, "Benjamin Herrenschmidt"
<benh@kernel.crashing.org> wrote:
>
> On Tue, 2015-11-10 at 15:44 -0800, Andy Lutomirski wrote:
> >
> > > What about partition <-> partition virtio such as what we could do on
> > > PAPR systems. That would have the weak barrier bit.
> > >
> >
> > Is it partition <-> partition, bypassing IOMMU?
>
> No.
>
> > I think I'd settle for just something that doesn't regress
> > non-experimental setups that actually work today and that allow new
> > setups (x86 with fixed QEMU and maybe something more complicated on
> > powerpc and/or sparc) to work in all cases.
> >
> > We could certainly just make powerpc and sparc continue bypassing the
> > IOMMU until someone comes up with a way to fix it.  I'll send out some
> > patches that do that, and maybe that'll help this make progress.
>
> But we haven't found a solution that works. All we have come up with is
> a quirk that will force bypass on virtio always and will not allow us
> to operate non-bypassing devices on either of those architectures in
> the future.
>
> I'm not too happy about this.

Me neither.  At least it wouldn't be a regression, but it's still crappy.

I think that arm is fine, at least.  I was unable to find an arm QEMU
config that has any problems with my patches.

--Andy

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 0/6] virtio core DMA API conversion
  2015-11-11  4:46                         ` Andy Lutomirski
@ 2015-11-11  5:08                           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 38+ messages in thread
From: Benjamin Herrenschmidt @ 2015-11-11  5:08 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Christian Borntraeger, David Woodhouse, Paolo Bonzini,
	Martin Schwidefsky, Michael S. Tsirkin, David S. Miller,
	Sebastian Ott, linux-s390, Joerg Roedel, Cornelia Huck, KVM,
	Christoph Hellwig, linux-kernel, Linux Virtualization,
	sparclinux

On Tue, 2015-11-10 at 20:46 -0800, Andy Lutomirski wrote:
> Me neither.  At least it wouldn't be a regression, but it's still
> crappy.
> 
> I think that arm is fine, at least.  I was unable to find an arm QEMU
> config that has any problems with my patches.

Ok, give me a few days for my headache & fever to subside see if I can
find something better. David, no idea from your side ? :-)

Cheers,
Ben.


^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2015-11-11  5:09 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-30  1:09 [PATCH v4 0/6] virtio core DMA API conversion Andy Lutomirski
2015-10-30  1:09 ` [PATCH v4 1/6] virtio-net: Stop doing DMA from the stack Andy Lutomirski
2015-10-30 13:55   ` Christian Borntraeger
2015-10-31  5:02     ` Andy Lutomirski
2015-10-30  1:09 ` [PATCH v4 2/6] virtio_ring: Support DMA APIs Andy Lutomirski
2015-10-30 12:01   ` Cornelia Huck
2015-10-30 12:05     ` Christian Borntraeger
2015-10-30 18:51       ` Andy Lutomirski
2015-10-30  1:09 ` [PATCH v4 3/6] virtio_pci: Use the DMA API Andy Lutomirski
2015-10-30  1:09 ` [PATCH v4 4/6] virtio: Add improved queue allocation API Andy Lutomirski
2015-10-30  1:09 ` [PATCH v4 5/6] virtio_mmio: Use the DMA API Andy Lutomirski
2015-10-30  1:09 ` [PATCH v4 6/6] virtio_pci: " Andy Lutomirski
2015-10-30  1:17 ` [PATCH v4 0/6] virtio core DMA API conversion Andy Lutomirski
2015-10-30  9:57 ` Christian Borntraeger
2015-11-09 12:15 ` Michael S. Tsirkin
2015-11-09 12:27   ` Paolo Bonzini
2015-11-09 22:58   ` Benjamin Herrenschmidt
2015-11-10  0:46     ` Andy Lutomirski
2015-11-10  2:04       ` Benjamin Herrenschmidt
2015-11-10  2:18         ` Andy Lutomirski
2015-11-10  5:26           ` Benjamin Herrenschmidt
2015-11-10  5:33             ` Andy Lutomirski
2015-11-10  5:28           ` Benjamin Herrenschmidt
2015-11-10  5:35             ` Andy Lutomirski
2015-11-10 10:37               ` Benjamin Herrenschmidt
2015-11-10 12:43                 ` Michael S. Tsirkin
2015-11-10 19:37                   ` Benjamin Herrenschmidt
2015-11-10 18:54                 ` Andy Lutomirski
2015-11-10 22:27                   ` Benjamin Herrenschmidt
2015-11-10 23:44                     ` Andy Lutomirski
2015-11-11  0:44                       ` Benjamin Herrenschmidt
2015-11-11  4:46                         ` Andy Lutomirski
2015-11-11  5:08                           ` Benjamin Herrenschmidt
2015-11-10  7:28           ` Jan Kiszka
2015-11-10  9:45         ` Knut Omang
2015-11-10 10:26           ` Benjamin Herrenschmidt
2015-11-10 10:27         ` Joerg Roedel
2015-11-10 19:36           ` Benjamin Herrenschmidt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).