linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC 00/13] vhost: format independence
@ 2020-06-02 13:05 Michael S. Tsirkin
  2020-06-02 13:05 ` [PATCH RFC 01/13] vhost: option to fetch descriptors through an independent struct Michael S. Tsirkin
                   ` (12 more replies)
  0 siblings, 13 replies; 35+ messages in thread
From: Michael S. Tsirkin @ 2020-06-02 13:05 UTC (permalink / raw)
  To: linux-kernel; +Cc: Eugenio Pérez, Jason Wang, kvm, virtualization, netdev

We let the specifics of the ring format seep through to vhost API
callers - mostly because there was only one format so it was
hard to imagine what an independent API would look like.
Now that there's an alternative in form of the packed ring,
it's easier to see the issues, and fixing them is perhaps
the cleanest way to add support for more formats.

This patchset does this by indtroducing two new structures: vhost_buf to
represent a buffer and vhost_desc to represent a descriptor.
Descriptors aren't normally of interest to devices but do occationally
get exposed e.g. for logging.

Perhaps surprisingly, the higher level API actually makes things a bit
easier for callers, as well as allows more freedom for the vhost core.
The end result is basically unchanged performance (based on preliminary
testing) even though we are forced to go through a format conversion.

The conversion also exposed (more) bugs in vhost scsi - which isn't
really surprising, that driver needs a lot more love than it's getting.

Very lightly tested. Would appreciate feedback and testing.

Michael S. Tsirkin (13):
  vhost: option to fetch descriptors through an independent struct
  vhost: use batched version by default
  vhost: batching fetches
  vhost: cleanup fetch_buf return code handling
  vhost/net: pass net specific struct pointer
  vhost: reorder functions
  vhost: format-independent API for used buffers
  vhost/net: convert to new API: heads->bufs
  vhost/net: avoid iov length math
  vhost/test: convert to the buf API
  vhost/scsi: switch to buf APIs
  vhost/vsock: switch to the buf API
  vhost: drop head based APIs

 drivers/vhost/net.c   | 173 +++++++++----------
 drivers/vhost/scsi.c  |  73 ++++----
 drivers/vhost/test.c  |  22 +--
 drivers/vhost/vhost.c | 375 +++++++++++++++++++++++++++---------------
 drivers/vhost/vhost.h |  46 ++++--
 drivers/vhost/vsock.c |  30 ++--
 6 files changed, 436 insertions(+), 283 deletions(-)

-- 
MST


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH RFC 01/13] vhost: option to fetch descriptors through an independent struct
  2020-06-02 13:05 [PATCH RFC 00/13] vhost: format independence Michael S. Tsirkin
@ 2020-06-02 13:05 ` Michael S. Tsirkin
  2020-06-03  7:13   ` Jason Wang
  2020-06-02 13:05 ` [PATCH RFC 02/13] vhost: use batched version by default Michael S. Tsirkin
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 35+ messages in thread
From: Michael S. Tsirkin @ 2020-06-02 13:05 UTC (permalink / raw)
  To: linux-kernel; +Cc: Eugenio Pérez, Jason Wang, kvm, virtualization, netdev

The idea is to support multiple ring formats by converting
to a format-independent array of descriptors.

This costs extra cycles, but we gain in ability
to fetch a batch of descriptors in one go, which
is good for code cache locality.

When used, this causes a minor performance degradation,
it's been kept as simple as possible for ease of review.
A follow-up patch gets us back the performance by adding batching.

To simplify benchmarking, I kept the old code around so one can switch
back and forth between old and new code. This will go away in the final
submission.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
Link: https://lore.kernel.org/r/20200401183118.8334-2-eperezma@redhat.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/vhost/vhost.c | 297 +++++++++++++++++++++++++++++++++++++++++-
 drivers/vhost/vhost.h |  16 +++
 2 files changed, 312 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 96d9871fa0cb..105fc97af2c8 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -298,6 +298,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
 			   struct vhost_virtqueue *vq)
 {
 	vq->num = 1;
+	vq->ndescs = 0;
 	vq->desc = NULL;
 	vq->avail = NULL;
 	vq->used = NULL;
@@ -368,6 +369,9 @@ static int vhost_worker(void *data)
 
 static void vhost_vq_free_iovecs(struct vhost_virtqueue *vq)
 {
+	kfree(vq->descs);
+	vq->descs = NULL;
+	vq->max_descs = 0;
 	kfree(vq->indirect);
 	vq->indirect = NULL;
 	kfree(vq->log);
@@ -384,6 +388,10 @@ static long vhost_dev_alloc_iovecs(struct vhost_dev *dev)
 
 	for (i = 0; i < dev->nvqs; ++i) {
 		vq = dev->vqs[i];
+		vq->max_descs = dev->iov_limit;
+		vq->descs = kmalloc_array(vq->max_descs,
+					  sizeof(*vq->descs),
+					  GFP_KERNEL);
 		vq->indirect = kmalloc_array(UIO_MAXIOV,
 					     sizeof(*vq->indirect),
 					     GFP_KERNEL);
@@ -391,7 +399,7 @@ static long vhost_dev_alloc_iovecs(struct vhost_dev *dev)
 					GFP_KERNEL);
 		vq->heads = kmalloc_array(dev->iov_limit, sizeof(*vq->heads),
 					  GFP_KERNEL);
-		if (!vq->indirect || !vq->log || !vq->heads)
+		if (!vq->indirect || !vq->log || !vq->heads || !vq->descs)
 			goto err_nomem;
 	}
 	return 0;
@@ -2277,6 +2285,293 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 }
 EXPORT_SYMBOL_GPL(vhost_get_vq_desc);
 
+static struct vhost_desc *peek_split_desc(struct vhost_virtqueue *vq)
+{
+	BUG_ON(!vq->ndescs);
+	return &vq->descs[vq->ndescs - 1];
+}
+
+static void pop_split_desc(struct vhost_virtqueue *vq)
+{
+	BUG_ON(!vq->ndescs);
+	--vq->ndescs;
+}
+
+#define VHOST_DESC_FLAGS (VRING_DESC_F_INDIRECT | VRING_DESC_F_WRITE | \
+			  VRING_DESC_F_NEXT)
+static int push_split_desc(struct vhost_virtqueue *vq, struct vring_desc *desc, u16 id)
+{
+	struct vhost_desc *h;
+
+	if (unlikely(vq->ndescs >= vq->max_descs))
+		return -EINVAL;
+	h = &vq->descs[vq->ndescs++];
+	h->addr = vhost64_to_cpu(vq, desc->addr);
+	h->len = vhost32_to_cpu(vq, desc->len);
+	h->flags = vhost16_to_cpu(vq, desc->flags) & VHOST_DESC_FLAGS;
+	h->id = id;
+
+	return 0;
+}
+
+static int fetch_indirect_descs(struct vhost_virtqueue *vq,
+				struct vhost_desc *indirect,
+				u16 head)
+{
+	struct vring_desc desc;
+	unsigned int i = 0, count, found = 0;
+	u32 len = indirect->len;
+	struct iov_iter from;
+	int ret;
+
+	/* Sanity check */
+	if (unlikely(len % sizeof desc)) {
+		vq_err(vq, "Invalid length in indirect descriptor: "
+		       "len 0x%llx not multiple of 0x%zx\n",
+		       (unsigned long long)len,
+		       sizeof desc);
+		return -EINVAL;
+	}
+
+	ret = translate_desc(vq, indirect->addr, len, vq->indirect,
+			     UIO_MAXIOV, VHOST_ACCESS_RO);
+	if (unlikely(ret < 0)) {
+		if (ret != -EAGAIN)
+			vq_err(vq, "Translation failure %d in indirect.\n", ret);
+		return ret;
+	}
+	iov_iter_init(&from, READ, vq->indirect, ret, len);
+
+	/* We will use the result as an address to read from, so most
+	 * architectures only need a compiler barrier here. */
+	read_barrier_depends();
+
+	count = len / sizeof desc;
+	/* Buffers are chained via a 16 bit next field, so
+	 * we can have at most 2^16 of these. */
+	if (unlikely(count > USHRT_MAX + 1)) {
+		vq_err(vq, "Indirect buffer length too big: %d\n",
+		       indirect->len);
+		return -E2BIG;
+	}
+	if (unlikely(vq->ndescs + count > vq->max_descs)) {
+		vq_err(vq, "Too many indirect + direct descs: %d + %d\n",
+		       vq->ndescs, indirect->len);
+		return -E2BIG;
+	}
+
+	do {
+		if (unlikely(++found > count)) {
+			vq_err(vq, "Loop detected: last one at %u "
+			       "indirect size %u\n",
+			       i, count);
+			return -EINVAL;
+		}
+		if (unlikely(!copy_from_iter_full(&desc, sizeof(desc), &from))) {
+			vq_err(vq, "Failed indirect descriptor: idx %d, %zx\n",
+			       i, (size_t)indirect->addr + i * sizeof desc);
+			return -EINVAL;
+		}
+		if (unlikely(desc.flags & cpu_to_vhost16(vq, VRING_DESC_F_INDIRECT))) {
+			vq_err(vq, "Nested indirect descriptor: idx %d, %zx\n",
+			       i, (size_t)indirect->addr + i * sizeof desc);
+			return -EINVAL;
+		}
+
+		push_split_desc(vq, &desc, head);
+	} while ((i = next_desc(vq, &desc)) != -1);
+	return 0;
+}
+
+static int fetch_descs(struct vhost_virtqueue *vq)
+{
+	unsigned int i, head, found = 0;
+	struct vhost_desc *last;
+	struct vring_desc desc;
+	__virtio16 avail_idx;
+	__virtio16 ring_head;
+	u16 last_avail_idx;
+	int ret;
+
+	/* Check it isn't doing very strange things with descriptor numbers. */
+	last_avail_idx = vq->last_avail_idx;
+
+	if (vq->avail_idx == vq->last_avail_idx) {
+		if (unlikely(vhost_get_avail_idx(vq, &avail_idx))) {
+			vq_err(vq, "Failed to access avail idx at %p\n",
+				&vq->avail->idx);
+			return -EFAULT;
+		}
+		vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
+
+		if (unlikely((u16)(vq->avail_idx - last_avail_idx) > vq->num)) {
+			vq_err(vq, "Guest moved used index from %u to %u",
+				last_avail_idx, vq->avail_idx);
+			return -EFAULT;
+		}
+
+		/* If there's nothing new since last we looked, return
+		 * invalid.
+		 */
+		if (vq->avail_idx == last_avail_idx)
+			return vq->num;
+
+		/* Only get avail ring entries after they have been
+		 * exposed by guest.
+		 */
+		smp_rmb();
+	}
+
+	/* Grab the next descriptor number they're advertising */
+	if (unlikely(vhost_get_avail_head(vq, &ring_head, last_avail_idx))) {
+		vq_err(vq, "Failed to read head: idx %d address %p\n",
+		       last_avail_idx,
+		       &vq->avail->ring[last_avail_idx % vq->num]);
+		return -EFAULT;
+	}
+
+	head = vhost16_to_cpu(vq, ring_head);
+
+	/* If their number is silly, that's an error. */
+	if (unlikely(head >= vq->num)) {
+		vq_err(vq, "Guest says index %u > %u is available",
+		       head, vq->num);
+		return -EINVAL;
+	}
+
+	i = head;
+	do {
+		if (unlikely(i >= vq->num)) {
+			vq_err(vq, "Desc index is %u > %u, head = %u",
+			       i, vq->num, head);
+			return -EINVAL;
+		}
+		if (unlikely(++found > vq->num)) {
+			vq_err(vq, "Loop detected: last one at %u "
+			       "vq size %u head %u\n",
+			       i, vq->num, head);
+			return -EINVAL;
+		}
+		ret = vhost_get_desc(vq, &desc, i);
+		if (unlikely(ret)) {
+			vq_err(vq, "Failed to get descriptor: idx %d addr %p\n",
+			       i, vq->desc + i);
+			return -EFAULT;
+		}
+		ret = push_split_desc(vq, &desc, head);
+		if (unlikely(ret)) {
+			vq_err(vq, "Failed to save descriptor: idx %d\n", i);
+			return -EINVAL;
+		}
+	} while ((i = next_desc(vq, &desc)) != -1);
+
+	last = peek_split_desc(vq);
+	if (unlikely(last->flags & VRING_DESC_F_INDIRECT)) {
+		pop_split_desc(vq);
+		ret = fetch_indirect_descs(vq, last, head);
+		if (unlikely(ret < 0)) {
+			if (ret != -EAGAIN)
+				vq_err(vq, "Failure detected "
+				       "in indirect descriptor at idx %d\n", head);
+			return ret;
+		}
+	}
+
+	/* Assume notifications from guest are disabled at this point,
+	 * if they aren't we would need to update avail_event index. */
+	BUG_ON(!(vq->used_flags & VRING_USED_F_NO_NOTIFY));
+
+	/* On success, increment avail index. */
+	vq->last_avail_idx++;
+
+	return 0;
+}
+
+/* This looks in the virtqueue and for the first available buffer, and converts
+ * it to an iovec for convenient access.  Since descriptors consist of some
+ * number of output then some number of input descriptors, it's actually two
+ * iovecs, but we pack them into one and note how many of each there were.
+ *
+ * This function returns the descriptor number found, or vq->num (which is
+ * never a valid descriptor number) if none was found.  A negative code is
+ * returned on error. */
+int vhost_get_vq_desc_batch(struct vhost_virtqueue *vq,
+		      struct iovec iov[], unsigned int iov_size,
+		      unsigned int *out_num, unsigned int *in_num,
+		      struct vhost_log *log, unsigned int *log_num)
+{
+	int ret = fetch_descs(vq);
+	int i;
+
+	if (ret)
+		return ret;
+
+	/* Now convert to IOV */
+	/* When we start there are none of either input nor output. */
+	*out_num = *in_num = 0;
+	if (unlikely(log))
+		*log_num = 0;
+
+	for (i = 0; i < vq->ndescs; ++i) {
+		unsigned iov_count = *in_num + *out_num;
+		struct vhost_desc *desc = &vq->descs[i];
+		int access;
+
+		if (desc->flags & ~VHOST_DESC_FLAGS) {
+			vq_err(vq, "Unexpected flags: 0x%x at descriptor id 0x%x\n",
+			       desc->flags, desc->id);
+			ret = -EINVAL;
+			goto err;
+		}
+		if (desc->flags & VRING_DESC_F_WRITE)
+			access = VHOST_ACCESS_WO;
+		else
+			access = VHOST_ACCESS_RO;
+		ret = translate_desc(vq, desc->addr,
+				     desc->len, iov + iov_count,
+				     iov_size - iov_count, access);
+		if (unlikely(ret < 0)) {
+			if (ret != -EAGAIN)
+				vq_err(vq, "Translation failure %d descriptor idx %d\n",
+					ret, i);
+			goto err;
+		}
+		if (access == VHOST_ACCESS_WO) {
+			/* If this is an input descriptor,
+			 * increment that count. */
+			*in_num += ret;
+			if (unlikely(log && ret)) {
+				log[*log_num].addr = desc->addr;
+				log[*log_num].len = desc->len;
+				++*log_num;
+			}
+		} else {
+			/* If it's an output descriptor, they're all supposed
+			 * to come before any input descriptors. */
+			if (unlikely(*in_num)) {
+				vq_err(vq, "Descriptor has out after in: "
+				       "idx %d\n", i);
+				ret = -EINVAL;
+				goto err;
+			}
+			*out_num += ret;
+		}
+
+		ret = desc->id;
+	}
+
+	vq->ndescs = 0;
+
+	return ret;
+
+err:
+	vhost_discard_vq_desc(vq, 1);
+	vq->ndescs = 0;
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(vhost_get_vq_desc_batch);
+
 /* Reverse the effect of vhost_get_vq_desc. Useful for error handling. */
 void vhost_discard_vq_desc(struct vhost_virtqueue *vq, int n)
 {
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 60cab4c78229..0976a2853935 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -60,6 +60,13 @@ enum vhost_uaddr_type {
 	VHOST_NUM_ADDRS = 3,
 };
 
+struct vhost_desc {
+	u64 addr;
+	u32 len;
+	u16 flags; /* VRING_DESC_F_WRITE, VRING_DESC_F_NEXT */
+	u16 id;
+};
+
 /* The virtqueue structure describes a queue attached to a device. */
 struct vhost_virtqueue {
 	struct vhost_dev *dev;
@@ -71,6 +78,11 @@ struct vhost_virtqueue {
 	vring_avail_t __user *avail;
 	vring_used_t __user *used;
 	const struct vhost_iotlb_map *meta_iotlb[VHOST_NUM_ADDRS];
+
+	struct vhost_desc *descs;
+	int ndescs;
+	int max_descs;
+
 	struct file *kick;
 	struct eventfd_ctx *call_ctx;
 	struct eventfd_ctx *error_ctx;
@@ -175,6 +187,10 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
 bool vhost_vq_access_ok(struct vhost_virtqueue *vq);
 bool vhost_log_access_ok(struct vhost_dev *);
 
+int vhost_get_vq_desc_batch(struct vhost_virtqueue *,
+		      struct iovec iov[], unsigned int iov_count,
+		      unsigned int *out_num, unsigned int *in_num,
+		      struct vhost_log *log, unsigned int *log_num);
 int vhost_get_vq_desc(struct vhost_virtqueue *,
 		      struct iovec iov[], unsigned int iov_count,
 		      unsigned int *out_num, unsigned int *in_num,
-- 
MST


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH RFC 02/13] vhost: use batched version by default
  2020-06-02 13:05 [PATCH RFC 00/13] vhost: format independence Michael S. Tsirkin
  2020-06-02 13:05 ` [PATCH RFC 01/13] vhost: option to fetch descriptors through an independent struct Michael S. Tsirkin
@ 2020-06-02 13:05 ` Michael S. Tsirkin
  2020-06-03  7:15   ` Jason Wang
  2020-06-02 13:06 ` [PATCH RFC 03/13] vhost: batching fetches Michael S. Tsirkin
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 35+ messages in thread
From: Michael S. Tsirkin @ 2020-06-02 13:05 UTC (permalink / raw)
  To: linux-kernel; +Cc: Eugenio Pérez, Jason Wang, kvm, virtualization, netdev

As testing shows no performance change, switch to that now.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
Link: https://lore.kernel.org/r/20200401183118.8334-3-eperezma@redhat.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/vhost/vhost.c | 251 +-----------------------------------------
 drivers/vhost/vhost.h |   4 -
 2 files changed, 2 insertions(+), 253 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 105fc97af2c8..8f9a07282625 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2038,253 +2038,6 @@ static unsigned next_desc(struct vhost_virtqueue *vq, struct vring_desc *desc)
 	return next;
 }
 
-static int get_indirect(struct vhost_virtqueue *vq,
-			struct iovec iov[], unsigned int iov_size,
-			unsigned int *out_num, unsigned int *in_num,
-			struct vhost_log *log, unsigned int *log_num,
-			struct vring_desc *indirect)
-{
-	struct vring_desc desc;
-	unsigned int i = 0, count, found = 0;
-	u32 len = vhost32_to_cpu(vq, indirect->len);
-	struct iov_iter from;
-	int ret, access;
-
-	/* Sanity check */
-	if (unlikely(len % sizeof desc)) {
-		vq_err(vq, "Invalid length in indirect descriptor: "
-		       "len 0x%llx not multiple of 0x%zx\n",
-		       (unsigned long long)len,
-		       sizeof desc);
-		return -EINVAL;
-	}
-
-	ret = translate_desc(vq, vhost64_to_cpu(vq, indirect->addr), len, vq->indirect,
-			     UIO_MAXIOV, VHOST_ACCESS_RO);
-	if (unlikely(ret < 0)) {
-		if (ret != -EAGAIN)
-			vq_err(vq, "Translation failure %d in indirect.\n", ret);
-		return ret;
-	}
-	iov_iter_init(&from, READ, vq->indirect, ret, len);
-
-	/* We will use the result as an address to read from, so most
-	 * architectures only need a compiler barrier here. */
-	read_barrier_depends();
-
-	count = len / sizeof desc;
-	/* Buffers are chained via a 16 bit next field, so
-	 * we can have at most 2^16 of these. */
-	if (unlikely(count > USHRT_MAX + 1)) {
-		vq_err(vq, "Indirect buffer length too big: %d\n",
-		       indirect->len);
-		return -E2BIG;
-	}
-
-	do {
-		unsigned iov_count = *in_num + *out_num;
-		if (unlikely(++found > count)) {
-			vq_err(vq, "Loop detected: last one at %u "
-			       "indirect size %u\n",
-			       i, count);
-			return -EINVAL;
-		}
-		if (unlikely(!copy_from_iter_full(&desc, sizeof(desc), &from))) {
-			vq_err(vq, "Failed indirect descriptor: idx %d, %zx\n",
-			       i, (size_t)vhost64_to_cpu(vq, indirect->addr) + i * sizeof desc);
-			return -EINVAL;
-		}
-		if (unlikely(desc.flags & cpu_to_vhost16(vq, VRING_DESC_F_INDIRECT))) {
-			vq_err(vq, "Nested indirect descriptor: idx %d, %zx\n",
-			       i, (size_t)vhost64_to_cpu(vq, indirect->addr) + i * sizeof desc);
-			return -EINVAL;
-		}
-
-		if (desc.flags & cpu_to_vhost16(vq, VRING_DESC_F_WRITE))
-			access = VHOST_ACCESS_WO;
-		else
-			access = VHOST_ACCESS_RO;
-
-		ret = translate_desc(vq, vhost64_to_cpu(vq, desc.addr),
-				     vhost32_to_cpu(vq, desc.len), iov + iov_count,
-				     iov_size - iov_count, access);
-		if (unlikely(ret < 0)) {
-			if (ret != -EAGAIN)
-				vq_err(vq, "Translation failure %d indirect idx %d\n",
-					ret, i);
-			return ret;
-		}
-		/* If this is an input descriptor, increment that count. */
-		if (access == VHOST_ACCESS_WO) {
-			*in_num += ret;
-			if (unlikely(log && ret)) {
-				log[*log_num].addr = vhost64_to_cpu(vq, desc.addr);
-				log[*log_num].len = vhost32_to_cpu(vq, desc.len);
-				++*log_num;
-			}
-		} else {
-			/* If it's an output descriptor, they're all supposed
-			 * to come before any input descriptors. */
-			if (unlikely(*in_num)) {
-				vq_err(vq, "Indirect descriptor "
-				       "has out after in: idx %d\n", i);
-				return -EINVAL;
-			}
-			*out_num += ret;
-		}
-	} while ((i = next_desc(vq, &desc)) != -1);
-	return 0;
-}
-
-/* This looks in the virtqueue and for the first available buffer, and converts
- * it to an iovec for convenient access.  Since descriptors consist of some
- * number of output then some number of input descriptors, it's actually two
- * iovecs, but we pack them into one and note how many of each there were.
- *
- * This function returns the descriptor number found, or vq->num (which is
- * never a valid descriptor number) if none was found.  A negative code is
- * returned on error. */
-int vhost_get_vq_desc(struct vhost_virtqueue *vq,
-		      struct iovec iov[], unsigned int iov_size,
-		      unsigned int *out_num, unsigned int *in_num,
-		      struct vhost_log *log, unsigned int *log_num)
-{
-	struct vring_desc desc;
-	unsigned int i, head, found = 0;
-	u16 last_avail_idx;
-	__virtio16 avail_idx;
-	__virtio16 ring_head;
-	int ret, access;
-
-	/* Check it isn't doing very strange things with descriptor numbers. */
-	last_avail_idx = vq->last_avail_idx;
-
-	if (vq->avail_idx == vq->last_avail_idx) {
-		if (unlikely(vhost_get_avail_idx(vq, &avail_idx))) {
-			vq_err(vq, "Failed to access avail idx at %p\n",
-				&vq->avail->idx);
-			return -EFAULT;
-		}
-		vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
-
-		if (unlikely((u16)(vq->avail_idx - last_avail_idx) > vq->num)) {
-			vq_err(vq, "Guest moved used index from %u to %u",
-				last_avail_idx, vq->avail_idx);
-			return -EFAULT;
-		}
-
-		/* If there's nothing new since last we looked, return
-		 * invalid.
-		 */
-		if (vq->avail_idx == last_avail_idx)
-			return vq->num;
-
-		/* Only get avail ring entries after they have been
-		 * exposed by guest.
-		 */
-		smp_rmb();
-	}
-
-	/* Grab the next descriptor number they're advertising, and increment
-	 * the index we've seen. */
-	if (unlikely(vhost_get_avail_head(vq, &ring_head, last_avail_idx))) {
-		vq_err(vq, "Failed to read head: idx %d address %p\n",
-		       last_avail_idx,
-		       &vq->avail->ring[last_avail_idx % vq->num]);
-		return -EFAULT;
-	}
-
-	head = vhost16_to_cpu(vq, ring_head);
-
-	/* If their number is silly, that's an error. */
-	if (unlikely(head >= vq->num)) {
-		vq_err(vq, "Guest says index %u > %u is available",
-		       head, vq->num);
-		return -EINVAL;
-	}
-
-	/* When we start there are none of either input nor output. */
-	*out_num = *in_num = 0;
-	if (unlikely(log))
-		*log_num = 0;
-
-	i = head;
-	do {
-		unsigned iov_count = *in_num + *out_num;
-		if (unlikely(i >= vq->num)) {
-			vq_err(vq, "Desc index is %u > %u, head = %u",
-			       i, vq->num, head);
-			return -EINVAL;
-		}
-		if (unlikely(++found > vq->num)) {
-			vq_err(vq, "Loop detected: last one at %u "
-			       "vq size %u head %u\n",
-			       i, vq->num, head);
-			return -EINVAL;
-		}
-		ret = vhost_get_desc(vq, &desc, i);
-		if (unlikely(ret)) {
-			vq_err(vq, "Failed to get descriptor: idx %d addr %p\n",
-			       i, vq->desc + i);
-			return -EFAULT;
-		}
-		if (desc.flags & cpu_to_vhost16(vq, VRING_DESC_F_INDIRECT)) {
-			ret = get_indirect(vq, iov, iov_size,
-					   out_num, in_num,
-					   log, log_num, &desc);
-			if (unlikely(ret < 0)) {
-				if (ret != -EAGAIN)
-					vq_err(vq, "Failure detected "
-						"in indirect descriptor at idx %d\n", i);
-				return ret;
-			}
-			continue;
-		}
-
-		if (desc.flags & cpu_to_vhost16(vq, VRING_DESC_F_WRITE))
-			access = VHOST_ACCESS_WO;
-		else
-			access = VHOST_ACCESS_RO;
-		ret = translate_desc(vq, vhost64_to_cpu(vq, desc.addr),
-				     vhost32_to_cpu(vq, desc.len), iov + iov_count,
-				     iov_size - iov_count, access);
-		if (unlikely(ret < 0)) {
-			if (ret != -EAGAIN)
-				vq_err(vq, "Translation failure %d descriptor idx %d\n",
-					ret, i);
-			return ret;
-		}
-		if (access == VHOST_ACCESS_WO) {
-			/* If this is an input descriptor,
-			 * increment that count. */
-			*in_num += ret;
-			if (unlikely(log && ret)) {
-				log[*log_num].addr = vhost64_to_cpu(vq, desc.addr);
-				log[*log_num].len = vhost32_to_cpu(vq, desc.len);
-				++*log_num;
-			}
-		} else {
-			/* If it's an output descriptor, they're all supposed
-			 * to come before any input descriptors. */
-			if (unlikely(*in_num)) {
-				vq_err(vq, "Descriptor has out after in: "
-				       "idx %d\n", i);
-				return -EINVAL;
-			}
-			*out_num += ret;
-		}
-	} while ((i = next_desc(vq, &desc)) != -1);
-
-	/* On success, increment avail index. */
-	vq->last_avail_idx++;
-
-	/* Assume notifications from guest are disabled at this point,
-	 * if they aren't we would need to update avail_event index. */
-	BUG_ON(!(vq->used_flags & VRING_USED_F_NO_NOTIFY));
-	return head;
-}
-EXPORT_SYMBOL_GPL(vhost_get_vq_desc);
-
 static struct vhost_desc *peek_split_desc(struct vhost_virtqueue *vq)
 {
 	BUG_ON(!vq->ndescs);
@@ -2495,7 +2248,7 @@ static int fetch_descs(struct vhost_virtqueue *vq)
  * This function returns the descriptor number found, or vq->num (which is
  * never a valid descriptor number) if none was found.  A negative code is
  * returned on error. */
-int vhost_get_vq_desc_batch(struct vhost_virtqueue *vq,
+int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 		      struct iovec iov[], unsigned int iov_size,
 		      unsigned int *out_num, unsigned int *in_num,
 		      struct vhost_log *log, unsigned int *log_num)
@@ -2570,7 +2323,7 @@ int vhost_get_vq_desc_batch(struct vhost_virtqueue *vq,
 
 	return ret;
 }
-EXPORT_SYMBOL_GPL(vhost_get_vq_desc_batch);
+EXPORT_SYMBOL_GPL(vhost_get_vq_desc);
 
 /* Reverse the effect of vhost_get_vq_desc. Useful for error handling. */
 void vhost_discard_vq_desc(struct vhost_virtqueue *vq, int n)
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 0976a2853935..76356edee8e5 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -187,10 +187,6 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
 bool vhost_vq_access_ok(struct vhost_virtqueue *vq);
 bool vhost_log_access_ok(struct vhost_dev *);
 
-int vhost_get_vq_desc_batch(struct vhost_virtqueue *,
-		      struct iovec iov[], unsigned int iov_count,
-		      unsigned int *out_num, unsigned int *in_num,
-		      struct vhost_log *log, unsigned int *log_num);
 int vhost_get_vq_desc(struct vhost_virtqueue *,
 		      struct iovec iov[], unsigned int iov_count,
 		      unsigned int *out_num, unsigned int *in_num,
-- 
MST


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH RFC 03/13] vhost: batching fetches
  2020-06-02 13:05 [PATCH RFC 00/13] vhost: format independence Michael S. Tsirkin
  2020-06-02 13:05 ` [PATCH RFC 01/13] vhost: option to fetch descriptors through an independent struct Michael S. Tsirkin
  2020-06-02 13:05 ` [PATCH RFC 02/13] vhost: use batched version by default Michael S. Tsirkin
@ 2020-06-02 13:06 ` Michael S. Tsirkin
  2020-06-03  7:27   ` Jason Wang
  2020-06-02 13:06 ` [PATCH RFC 04/13] vhost: cleanup fetch_buf return code handling Michael S. Tsirkin
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 35+ messages in thread
From: Michael S. Tsirkin @ 2020-06-02 13:06 UTC (permalink / raw)
  To: linux-kernel; +Cc: Eugenio Pérez, Jason Wang, kvm, virtualization, netdev

With this patch applied, new and old code perform identically.

Lots of extra optimizations are now possible, e.g.
we can fetch multiple heads with copy_from/to_user now.
We can get rid of maintaining the log array.  Etc etc.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
Link: https://lore.kernel.org/r/20200401183118.8334-4-eperezma@redhat.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/vhost/test.c  |  2 +-
 drivers/vhost/vhost.c | 47 ++++++++++++++++++++++++++++++++++++++-----
 drivers/vhost/vhost.h |  5 ++++-
 3 files changed, 47 insertions(+), 7 deletions(-)

diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c
index 9a3a09005e03..02806d6f84ef 100644
--- a/drivers/vhost/test.c
+++ b/drivers/vhost/test.c
@@ -119,7 +119,7 @@ static int vhost_test_open(struct inode *inode, struct file *f)
 	dev = &n->dev;
 	vqs[VHOST_TEST_VQ] = &n->vqs[VHOST_TEST_VQ];
 	n->vqs[VHOST_TEST_VQ].handle_kick = handle_vq_kick;
-	vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX, UIO_MAXIOV,
+	vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX, UIO_MAXIOV + 64,
 		       VHOST_TEST_PKT_WEIGHT, VHOST_TEST_WEIGHT, NULL);
 
 	f->private_data = n;
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 8f9a07282625..aca2a5b0d078 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -299,6 +299,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
 {
 	vq->num = 1;
 	vq->ndescs = 0;
+	vq->first_desc = 0;
 	vq->desc = NULL;
 	vq->avail = NULL;
 	vq->used = NULL;
@@ -367,6 +368,11 @@ static int vhost_worker(void *data)
 	return 0;
 }
 
+static int vhost_vq_num_batch_descs(struct vhost_virtqueue *vq)
+{
+	return vq->max_descs - UIO_MAXIOV;
+}
+
 static void vhost_vq_free_iovecs(struct vhost_virtqueue *vq)
 {
 	kfree(vq->descs);
@@ -389,6 +395,9 @@ static long vhost_dev_alloc_iovecs(struct vhost_dev *dev)
 	for (i = 0; i < dev->nvqs; ++i) {
 		vq = dev->vqs[i];
 		vq->max_descs = dev->iov_limit;
+		if (vhost_vq_num_batch_descs(vq) < 0) {
+			return -EINVAL;
+		}
 		vq->descs = kmalloc_array(vq->max_descs,
 					  sizeof(*vq->descs),
 					  GFP_KERNEL);
@@ -1570,6 +1579,7 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
 		vq->last_avail_idx = s.num;
 		/* Forget the cached index value. */
 		vq->avail_idx = vq->last_avail_idx;
+		vq->ndescs = vq->first_desc = 0;
 		break;
 	case VHOST_GET_VRING_BASE:
 		s.index = idx;
@@ -2136,7 +2146,7 @@ static int fetch_indirect_descs(struct vhost_virtqueue *vq,
 	return 0;
 }
 
-static int fetch_descs(struct vhost_virtqueue *vq)
+static int fetch_buf(struct vhost_virtqueue *vq)
 {
 	unsigned int i, head, found = 0;
 	struct vhost_desc *last;
@@ -2149,7 +2159,11 @@ static int fetch_descs(struct vhost_virtqueue *vq)
 	/* Check it isn't doing very strange things with descriptor numbers. */
 	last_avail_idx = vq->last_avail_idx;
 
-	if (vq->avail_idx == vq->last_avail_idx) {
+	if (unlikely(vq->avail_idx == vq->last_avail_idx)) {
+		/* If we already have work to do, don't bother re-checking. */
+		if (likely(vq->ndescs))
+			return vq->num;
+
 		if (unlikely(vhost_get_avail_idx(vq, &avail_idx))) {
 			vq_err(vq, "Failed to access avail idx at %p\n",
 				&vq->avail->idx);
@@ -2240,6 +2254,24 @@ static int fetch_descs(struct vhost_virtqueue *vq)
 	return 0;
 }
 
+static int fetch_descs(struct vhost_virtqueue *vq)
+{
+	int ret = 0;
+
+	if (unlikely(vq->first_desc >= vq->ndescs)) {
+		vq->first_desc = 0;
+		vq->ndescs = 0;
+	}
+
+	if (vq->ndescs)
+		return 0;
+
+	while (!ret && vq->ndescs <= vhost_vq_num_batch_descs(vq))
+		ret = fetch_buf(vq);
+
+	return vq->ndescs ? 0 : ret;
+}
+
 /* This looks in the virtqueue and for the first available buffer, and converts
  * it to an iovec for convenient access.  Since descriptors consist of some
  * number of output then some number of input descriptors, it's actually two
@@ -2265,7 +2297,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 	if (unlikely(log))
 		*log_num = 0;
 
-	for (i = 0; i < vq->ndescs; ++i) {
+	for (i = vq->first_desc; i < vq->ndescs; ++i) {
 		unsigned iov_count = *in_num + *out_num;
 		struct vhost_desc *desc = &vq->descs[i];
 		int access;
@@ -2311,14 +2343,19 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 		}
 
 		ret = desc->id;
+
+		if (!(desc->flags & VRING_DESC_F_NEXT))
+			break;
 	}
 
-	vq->ndescs = 0;
+	vq->first_desc = i + 1;
 
 	return ret;
 
 err:
-	vhost_discard_vq_desc(vq, 1);
+	for (i = vq->first_desc; i < vq->ndescs; ++i)
+		if (!(vq->descs[i].flags & VRING_DESC_F_NEXT))
+			vhost_discard_vq_desc(vq, 1);
 	vq->ndescs = 0;
 
 	return ret;
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 76356edee8e5..a67bda9792ec 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -81,6 +81,7 @@ struct vhost_virtqueue {
 
 	struct vhost_desc *descs;
 	int ndescs;
+	int first_desc;
 	int max_descs;
 
 	struct file *kick;
@@ -229,7 +230,7 @@ void vhost_iotlb_map_free(struct vhost_iotlb *iotlb,
 			  struct vhost_iotlb_map *map);
 
 #define vq_err(vq, fmt, ...) do {                                  \
-		pr_debug(pr_fmt(fmt), ##__VA_ARGS__);       \
+		pr_err(pr_fmt(fmt), ##__VA_ARGS__);       \
 		if ((vq)->error_ctx)                               \
 				eventfd_signal((vq)->error_ctx, 1);\
 	} while (0)
@@ -255,6 +256,8 @@ static inline void vhost_vq_set_backend(struct vhost_virtqueue *vq,
 					void *private_data)
 {
 	vq->private_data = private_data;
+	vq->ndescs = 0;
+	vq->first_desc = 0;
 }
 
 /**
-- 
MST


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH RFC 04/13] vhost: cleanup fetch_buf return code handling
  2020-06-02 13:05 [PATCH RFC 00/13] vhost: format independence Michael S. Tsirkin
                   ` (2 preceding siblings ...)
  2020-06-02 13:06 ` [PATCH RFC 03/13] vhost: batching fetches Michael S. Tsirkin
@ 2020-06-02 13:06 ` Michael S. Tsirkin
  2020-06-03  7:29   ` Jason Wang
  2020-06-02 13:06 ` [PATCH RFC 05/13] vhost/net: pass net specific struct pointer Michael S. Tsirkin
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 35+ messages in thread
From: Michael S. Tsirkin @ 2020-06-02 13:06 UTC (permalink / raw)
  To: linux-kernel; +Cc: Eugenio Pérez, Jason Wang, kvm, virtualization, netdev

Return code of fetch_buf is confusing, so callers resort to
tricks to get to sane values. Let's switch to something standard:
0 empty, >0 non-empty, <0 error.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/vhost/vhost.c | 24 ++++++++++++++++--------
 1 file changed, 16 insertions(+), 8 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index aca2a5b0d078..bd52b44b0d23 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2146,6 +2146,8 @@ static int fetch_indirect_descs(struct vhost_virtqueue *vq,
 	return 0;
 }
 
+/* This function returns a value > 0 if a descriptor was found, or 0 if none were found.
+ * A negative code is returned on error. */
 static int fetch_buf(struct vhost_virtqueue *vq)
 {
 	unsigned int i, head, found = 0;
@@ -2162,7 +2164,7 @@ static int fetch_buf(struct vhost_virtqueue *vq)
 	if (unlikely(vq->avail_idx == vq->last_avail_idx)) {
 		/* If we already have work to do, don't bother re-checking. */
 		if (likely(vq->ndescs))
-			return vq->num;
+			return 0;
 
 		if (unlikely(vhost_get_avail_idx(vq, &avail_idx))) {
 			vq_err(vq, "Failed to access avail idx at %p\n",
@@ -2181,7 +2183,7 @@ static int fetch_buf(struct vhost_virtqueue *vq)
 		 * invalid.
 		 */
 		if (vq->avail_idx == last_avail_idx)
-			return vq->num;
+			return 0;
 
 		/* Only get avail ring entries after they have been
 		 * exposed by guest.
@@ -2251,12 +2253,14 @@ static int fetch_buf(struct vhost_virtqueue *vq)
 	/* On success, increment avail index. */
 	vq->last_avail_idx++;
 
-	return 0;
+	return 1;
 }
 
+/* This function returns a value > 0 if a descriptor was found, or 0 if none were found.
+ * A negative code is returned on error. */
 static int fetch_descs(struct vhost_virtqueue *vq)
 {
-	int ret = 0;
+	int ret;
 
 	if (unlikely(vq->first_desc >= vq->ndescs)) {
 		vq->first_desc = 0;
@@ -2266,10 +2270,14 @@ static int fetch_descs(struct vhost_virtqueue *vq)
 	if (vq->ndescs)
 		return 0;
 
-	while (!ret && vq->ndescs <= vhost_vq_num_batch_descs(vq))
-		ret = fetch_buf(vq);
+	for (ret = 1;
+	     ret > 0 && vq->ndescs <= vhost_vq_num_batch_descs(vq);
+	     ret = fetch_buf(vq))
+		;
 
-	return vq->ndescs ? 0 : ret;
+	/* On success we expect some descs */
+	BUG_ON(ret > 0 && !vq->ndescs);
+	return ret;
 }
 
 /* This looks in the virtqueue and for the first available buffer, and converts
@@ -2288,7 +2296,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 	int ret = fetch_descs(vq);
 	int i;
 
-	if (ret)
+	if (ret <= 0)
 		return ret;
 
 	/* Now convert to IOV */
-- 
MST


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH RFC 05/13] vhost/net: pass net specific struct pointer
  2020-06-02 13:05 [PATCH RFC 00/13] vhost: format independence Michael S. Tsirkin
                   ` (3 preceding siblings ...)
  2020-06-02 13:06 ` [PATCH RFC 04/13] vhost: cleanup fetch_buf return code handling Michael S. Tsirkin
@ 2020-06-02 13:06 ` Michael S. Tsirkin
  2020-06-02 13:06 ` [PATCH RFC 06/13] vhost: reorder functions Michael S. Tsirkin
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 35+ messages in thread
From: Michael S. Tsirkin @ 2020-06-02 13:06 UTC (permalink / raw)
  To: linux-kernel; +Cc: Eugenio Pérez, Jason Wang, kvm, virtualization, netdev

In preparation for further cleanup, pass net specific pointer
to ubuf callbacks so we can move net specific fields
out to net structures.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/vhost/net.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 2927f02cc7e1..749a9cf51a59 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -94,7 +94,7 @@ struct vhost_net_ubuf_ref {
 	 */
 	atomic_t refcount;
 	wait_queue_head_t wait;
-	struct vhost_virtqueue *vq;
+	struct vhost_net_virtqueue *nvq;
 };
 
 #define VHOST_NET_BATCH 64
@@ -231,7 +231,7 @@ static void vhost_net_enable_zcopy(int vq)
 }
 
 static struct vhost_net_ubuf_ref *
-vhost_net_ubuf_alloc(struct vhost_virtqueue *vq, bool zcopy)
+vhost_net_ubuf_alloc(struct vhost_net_virtqueue *nvq, bool zcopy)
 {
 	struct vhost_net_ubuf_ref *ubufs;
 	/* No zero copy backend? Nothing to count. */
@@ -242,7 +242,7 @@ vhost_net_ubuf_alloc(struct vhost_virtqueue *vq, bool zcopy)
 		return ERR_PTR(-ENOMEM);
 	atomic_set(&ubufs->refcount, 1);
 	init_waitqueue_head(&ubufs->wait);
-	ubufs->vq = vq;
+	ubufs->nvq = nvq;
 	return ubufs;
 }
 
@@ -384,13 +384,13 @@ static void vhost_zerocopy_signal_used(struct vhost_net *net,
 static void vhost_zerocopy_callback(struct ubuf_info *ubuf, bool success)
 {
 	struct vhost_net_ubuf_ref *ubufs = ubuf->ctx;
-	struct vhost_virtqueue *vq = ubufs->vq;
+	struct vhost_net_virtqueue *nvq = ubufs->nvq;
 	int cnt;
 
 	rcu_read_lock_bh();
 
 	/* set len to mark this desc buffers done DMA */
-	vq->heads[ubuf->desc].len = success ?
+	nvq->vq.heads[ubuf->desc].in_len = success ?
 		VHOST_DMA_DONE_LEN : VHOST_DMA_FAILED_LEN;
 	cnt = vhost_net_ubuf_put(ubufs);
 
@@ -402,7 +402,7 @@ static void vhost_zerocopy_callback(struct ubuf_info *ubuf, bool success)
 	 * less than 10% of times).
 	 */
 	if (cnt <= 1 || !(cnt % 16))
-		vhost_poll_queue(&vq->poll);
+		vhost_poll_queue(&nvq->vq.poll);
 
 	rcu_read_unlock_bh();
 }
@@ -1525,7 +1525,7 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
 	/* start polling new socket */
 	oldsock = vhost_vq_get_backend(vq);
 	if (sock != oldsock) {
-		ubufs = vhost_net_ubuf_alloc(vq,
+		ubufs = vhost_net_ubuf_alloc(nvq,
 					     sock && vhost_sock_zcopy(sock));
 		if (IS_ERR(ubufs)) {
 			r = PTR_ERR(ubufs);
-- 
MST


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH RFC 06/13] vhost: reorder functions
  2020-06-02 13:05 [PATCH RFC 00/13] vhost: format independence Michael S. Tsirkin
                   ` (4 preceding siblings ...)
  2020-06-02 13:06 ` [PATCH RFC 05/13] vhost/net: pass net specific struct pointer Michael S. Tsirkin
@ 2020-06-02 13:06 ` Michael S. Tsirkin
  2020-06-02 13:06 ` [PATCH RFC 07/13] vhost: format-independent API for used buffers Michael S. Tsirkin
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 35+ messages in thread
From: Michael S. Tsirkin @ 2020-06-02 13:06 UTC (permalink / raw)
  To: linux-kernel; +Cc: Eugenio Pérez, Jason Wang, kvm, virtualization, netdev

Reorder functions in the file to not rely on forward
declarations, in preparation to making them static
down the road.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/vhost/vhost.c | 40 ++++++++++++++++++++--------------------
 1 file changed, 20 insertions(+), 20 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index bd52b44b0d23..b4a6e44d56a8 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2256,6 +2256,13 @@ static int fetch_buf(struct vhost_virtqueue *vq)
 	return 1;
 }
 
+/* Reverse the effect of vhost_get_vq_desc. Useful for error handling. */
+void vhost_discard_vq_desc(struct vhost_virtqueue *vq, int n)
+{
+	vq->last_avail_idx -= n;
+}
+EXPORT_SYMBOL_GPL(vhost_discard_vq_desc);
+
 /* This function returns a value > 0 if a descriptor was found, or 0 if none were found.
  * A negative code is returned on error. */
 static int fetch_descs(struct vhost_virtqueue *vq)
@@ -2370,26 +2377,6 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 }
 EXPORT_SYMBOL_GPL(vhost_get_vq_desc);
 
-/* Reverse the effect of vhost_get_vq_desc. Useful for error handling. */
-void vhost_discard_vq_desc(struct vhost_virtqueue *vq, int n)
-{
-	vq->last_avail_idx -= n;
-}
-EXPORT_SYMBOL_GPL(vhost_discard_vq_desc);
-
-/* After we've used one of their buffers, we tell them about it.  We'll then
- * want to notify the guest, using eventfd. */
-int vhost_add_used(struct vhost_virtqueue *vq, unsigned int head, int len)
-{
-	struct vring_used_elem heads = {
-		cpu_to_vhost32(vq, head),
-		cpu_to_vhost32(vq, len)
-	};
-
-	return vhost_add_used_n(vq, &heads, 1);
-}
-EXPORT_SYMBOL_GPL(vhost_add_used);
-
 static int __vhost_add_used_n(struct vhost_virtqueue *vq,
 			    struct vring_used_elem *heads,
 			    unsigned count)
@@ -2459,6 +2446,19 @@ int vhost_add_used_n(struct vhost_virtqueue *vq, struct vring_used_elem *heads,
 }
 EXPORT_SYMBOL_GPL(vhost_add_used_n);
 
+/* After we've used one of their buffers, we tell them about it.  We'll then
+ * want to notify the guest, using eventfd. */
+int vhost_add_used(struct vhost_virtqueue *vq, unsigned int head, int len)
+{
+	struct vring_used_elem heads = {
+		cpu_to_vhost32(vq, head),
+		cpu_to_vhost32(vq, len)
+	};
+
+	return vhost_add_used_n(vq, &heads, 1);
+}
+EXPORT_SYMBOL_GPL(vhost_add_used);
+
 static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 {
 	__u16 old, new;
-- 
MST


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH RFC 07/13] vhost: format-independent API for used buffers
  2020-06-02 13:05 [PATCH RFC 00/13] vhost: format independence Michael S. Tsirkin
                   ` (5 preceding siblings ...)
  2020-06-02 13:06 ` [PATCH RFC 06/13] vhost: reorder functions Michael S. Tsirkin
@ 2020-06-02 13:06 ` Michael S. Tsirkin
  2020-06-03  7:58   ` Jason Wang
  2020-06-02 13:06 ` [PATCH RFC 08/13] vhost/net: convert to new API: heads->bufs Michael S. Tsirkin
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 35+ messages in thread
From: Michael S. Tsirkin @ 2020-06-02 13:06 UTC (permalink / raw)
  To: linux-kernel; +Cc: Eugenio Pérez, Jason Wang, kvm, virtualization, netdev

Add a new API that doesn't assume used ring, heads, etc.
For now, we keep the old APIs around to make it easier
to convert drivers.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/vhost/vhost.c | 52 ++++++++++++++++++++++++++++++++++---------
 drivers/vhost/vhost.h | 17 +++++++++++++-
 2 files changed, 58 insertions(+), 11 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index b4a6e44d56a8..be822f0c9428 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2292,13 +2292,12 @@ static int fetch_descs(struct vhost_virtqueue *vq)
  * number of output then some number of input descriptors, it's actually two
  * iovecs, but we pack them into one and note how many of each there were.
  *
- * This function returns the descriptor number found, or vq->num (which is
- * never a valid descriptor number) if none was found.  A negative code is
- * returned on error. */
-int vhost_get_vq_desc(struct vhost_virtqueue *vq,
-		      struct iovec iov[], unsigned int iov_size,
-		      unsigned int *out_num, unsigned int *in_num,
-		      struct vhost_log *log, unsigned int *log_num)
+ * This function returns a value > 0 if a descriptor was found, or 0 if none were found.
+ * A negative code is returned on error. */
+int vhost_get_avail_buf(struct vhost_virtqueue *vq, struct vhost_buf *buf,
+			struct iovec iov[], unsigned int iov_size,
+			unsigned int *out_num, unsigned int *in_num,
+			struct vhost_log *log, unsigned int *log_num)
 {
 	int ret = fetch_descs(vq);
 	int i;
@@ -2311,6 +2310,8 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 	*out_num = *in_num = 0;
 	if (unlikely(log))
 		*log_num = 0;
+	buf->in_len = buf->out_len = 0;
+	buf->descs = 0;
 
 	for (i = vq->first_desc; i < vq->ndescs; ++i) {
 		unsigned iov_count = *in_num + *out_num;
@@ -2340,6 +2341,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 			/* If this is an input descriptor,
 			 * increment that count. */
 			*in_num += ret;
+			buf->in_len += desc->len;
 			if (unlikely(log && ret)) {
 				log[*log_num].addr = desc->addr;
 				log[*log_num].len = desc->len;
@@ -2355,9 +2357,11 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 				goto err;
 			}
 			*out_num += ret;
+			buf->out_len += desc->len;
 		}
 
-		ret = desc->id;
+		buf->id = desc->id;
+		++buf->descs;
 
 		if (!(desc->flags & VRING_DESC_F_NEXT))
 			break;
@@ -2365,7 +2369,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 
 	vq->first_desc = i + 1;
 
-	return ret;
+	return 1;
 
 err:
 	for (i = vq->first_desc; i < vq->ndescs; ++i)
@@ -2375,7 +2379,15 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 
 	return ret;
 }
-EXPORT_SYMBOL_GPL(vhost_get_vq_desc);
+EXPORT_SYMBOL_GPL(vhost_get_avail_buf);
+
+/* Reverse the effect of vhost_get_avail_buf. Useful for error handling. */
+void vhost_discard_avail_bufs(struct vhost_virtqueue *vq,
+			      struct vhost_buf *buf, unsigned count)
+{
+	vhost_discard_vq_desc(vq, count);
+}
+EXPORT_SYMBOL_GPL(vhost_discard_avail_bufs);
 
 static int __vhost_add_used_n(struct vhost_virtqueue *vq,
 			    struct vring_used_elem *heads,
@@ -2459,6 +2471,26 @@ int vhost_add_used(struct vhost_virtqueue *vq, unsigned int head, int len)
 }
 EXPORT_SYMBOL_GPL(vhost_add_used);
 
+int vhost_put_used_buf(struct vhost_virtqueue *vq, struct vhost_buf *buf)
+{
+	return vhost_add_used(vq, buf->id, buf->in_len);
+}
+EXPORT_SYMBOL_GPL(vhost_put_used_buf);
+
+int vhost_put_used_n_bufs(struct vhost_virtqueue *vq,
+			  struct vhost_buf *bufs, unsigned count)
+{
+	unsigned i;
+
+	for (i = 0; i < count; ++i) {
+		vq->heads[i].id = cpu_to_vhost32(vq, bufs[i].id);
+		vq->heads[i].len = cpu_to_vhost32(vq, bufs[i].in_len);
+	}
+
+	return vhost_add_used_n(vq, vq->heads, count);
+}
+EXPORT_SYMBOL_GPL(vhost_put_used_n_bufs);
+
 static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 {
 	__u16 old, new;
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index a67bda9792ec..6c10e99ff334 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -67,6 +67,13 @@ struct vhost_desc {
 	u16 id;
 };
 
+struct vhost_buf {
+	u32 out_len;
+	u32 in_len;
+	u16 descs;
+	u16 id;
+};
+
 /* The virtqueue structure describes a queue attached to a device. */
 struct vhost_virtqueue {
 	struct vhost_dev *dev;
@@ -193,7 +200,12 @@ int vhost_get_vq_desc(struct vhost_virtqueue *,
 		      unsigned int *out_num, unsigned int *in_num,
 		      struct vhost_log *log, unsigned int *log_num);
 void vhost_discard_vq_desc(struct vhost_virtqueue *, int n);
-
+int vhost_get_avail_buf(struct vhost_virtqueue *, struct vhost_buf *buf,
+			struct iovec iov[], unsigned int iov_count,
+			unsigned int *out_num, unsigned int *in_num,
+			struct vhost_log *log, unsigned int *log_num);
+void vhost_discard_avail_bufs(struct vhost_virtqueue *,
+			      struct vhost_buf *, unsigned count);
 int vhost_vq_init_access(struct vhost_virtqueue *);
 int vhost_add_used(struct vhost_virtqueue *, unsigned int head, int len);
 int vhost_add_used_n(struct vhost_virtqueue *, struct vring_used_elem *heads,
@@ -202,6 +214,9 @@ void vhost_add_used_and_signal(struct vhost_dev *, struct vhost_virtqueue *,
 			       unsigned int id, int len);
 void vhost_add_used_and_signal_n(struct vhost_dev *, struct vhost_virtqueue *,
 			       struct vring_used_elem *heads, unsigned count);
+int vhost_put_used_buf(struct vhost_virtqueue *, struct vhost_buf *buf);
+int vhost_put_used_n_bufs(struct vhost_virtqueue *,
+			  struct vhost_buf *bufs, unsigned count);
 void vhost_signal(struct vhost_dev *, struct vhost_virtqueue *);
 void vhost_disable_notify(struct vhost_dev *, struct vhost_virtqueue *);
 bool vhost_vq_avail_empty(struct vhost_dev *, struct vhost_virtqueue *);
-- 
MST


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH RFC 08/13] vhost/net: convert to new API: heads->bufs
  2020-06-02 13:05 [PATCH RFC 00/13] vhost: format independence Michael S. Tsirkin
                   ` (6 preceding siblings ...)
  2020-06-02 13:06 ` [PATCH RFC 07/13] vhost: format-independent API for used buffers Michael S. Tsirkin
@ 2020-06-02 13:06 ` Michael S. Tsirkin
  2020-06-03  8:11   ` Jason Wang
  2020-06-02 13:06 ` [PATCH RFC 09/13] vhost/net: avoid iov length math Michael S. Tsirkin
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 35+ messages in thread
From: Michael S. Tsirkin @ 2020-06-02 13:06 UTC (permalink / raw)
  To: linux-kernel; +Cc: Eugenio Pérez, Jason Wang, kvm, virtualization, netdev

Convert vhost net to use the new format-agnostic API.
In particular, don't poke at vq internals such as the
heads array.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/vhost/net.c | 153 +++++++++++++++++++++++---------------------
 1 file changed, 81 insertions(+), 72 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 749a9cf51a59..47af3d1ce3dd 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -59,13 +59,13 @@ MODULE_PARM_DESC(experimental_zcopytx, "Enable Zero Copy TX;"
  * status internally; used for zerocopy tx only.
  */
 /* Lower device DMA failed */
-#define VHOST_DMA_FAILED_LEN	((__force __virtio32)3)
+#define VHOST_DMA_FAILED_LEN	(3)
 /* Lower device DMA done */
-#define VHOST_DMA_DONE_LEN	((__force __virtio32)2)
+#define VHOST_DMA_DONE_LEN	(2)
 /* Lower device DMA in progress */
-#define VHOST_DMA_IN_PROGRESS	((__force __virtio32)1)
+#define VHOST_DMA_IN_PROGRESS	(1)
 /* Buffer unused */
-#define VHOST_DMA_CLEAR_LEN	((__force __virtio32)0)
+#define VHOST_DMA_CLEAR_LEN	(0)
 
 #define VHOST_DMA_IS_DONE(len) ((__force u32)(len) >= (__force u32)VHOST_DMA_DONE_LEN)
 
@@ -112,9 +112,12 @@ struct vhost_net_virtqueue {
 	/* last used idx for outstanding DMA zerocopy buffers */
 	int upend_idx;
 	/* For TX, first used idx for DMA done zerocopy buffers
-	 * For RX, number of batched heads
+	 * For RX, number of batched bufs
 	 */
 	int done_idx;
+	/* Outstanding user bufs. UIO_MAXIOV in length. */
+	/* TODO: we can make this smaller for sure. */
+	struct vhost_buf *bufs;
 	/* Number of XDP frames batched */
 	int batched_xdp;
 	/* an array of userspace buffers info */
@@ -271,6 +274,8 @@ static void vhost_net_clear_ubuf_info(struct vhost_net *n)
 	int i;
 
 	for (i = 0; i < VHOST_NET_VQ_MAX; ++i) {
+		kfree(n->vqs[i].bufs);
+		n->vqs[i].bufs = NULL;
 		kfree(n->vqs[i].ubuf_info);
 		n->vqs[i].ubuf_info = NULL;
 	}
@@ -282,6 +287,12 @@ static int vhost_net_set_ubuf_info(struct vhost_net *n)
 	int i;
 
 	for (i = 0; i < VHOST_NET_VQ_MAX; ++i) {
+		n->vqs[i].bufs = kmalloc_array(UIO_MAXIOV,
+					       sizeof(*n->vqs[i].bufs),
+					       GFP_KERNEL);
+		if (!n->vqs[i].bufs)
+			goto err;
+
 		zcopy = vhost_net_zcopy_mask & (0x1 << i);
 		if (!zcopy)
 			continue;
@@ -364,18 +375,18 @@ static void vhost_zerocopy_signal_used(struct vhost_net *net,
 	int j = 0;
 
 	for (i = nvq->done_idx; i != nvq->upend_idx; i = (i + 1) % UIO_MAXIOV) {
-		if (vq->heads[i].len == VHOST_DMA_FAILED_LEN)
+		if (nvq->bufs[i].in_len == VHOST_DMA_FAILED_LEN)
 			vhost_net_tx_err(net);
-		if (VHOST_DMA_IS_DONE(vq->heads[i].len)) {
-			vq->heads[i].len = VHOST_DMA_CLEAR_LEN;
+		if (VHOST_DMA_IS_DONE(nvq->bufs[i].in_len)) {
+			nvq->bufs[i].in_len = VHOST_DMA_CLEAR_LEN;
 			++j;
 		} else
 			break;
 	}
 	while (j) {
 		add = min(UIO_MAXIOV - nvq->done_idx, j);
-		vhost_add_used_and_signal_n(vq->dev, vq,
-					    &vq->heads[nvq->done_idx], add);
+		vhost_put_used_n_bufs(vq, &nvq->bufs[nvq->done_idx], add);
+		vhost_signal(vq->dev, vq);
 		nvq->done_idx = (nvq->done_idx + add) % UIO_MAXIOV;
 		j -= add;
 	}
@@ -390,7 +401,7 @@ static void vhost_zerocopy_callback(struct ubuf_info *ubuf, bool success)
 	rcu_read_lock_bh();
 
 	/* set len to mark this desc buffers done DMA */
-	nvq->vq.heads[ubuf->desc].in_len = success ?
+	nvq->bufs[ubuf->desc].in_len = success ?
 		VHOST_DMA_DONE_LEN : VHOST_DMA_FAILED_LEN;
 	cnt = vhost_net_ubuf_put(ubufs);
 
@@ -452,7 +463,8 @@ static void vhost_net_signal_used(struct vhost_net_virtqueue *nvq)
 	if (!nvq->done_idx)
 		return;
 
-	vhost_add_used_and_signal_n(dev, vq, vq->heads, nvq->done_idx);
+	vhost_put_used_n_bufs(vq, nvq->bufs, nvq->done_idx);
+	vhost_signal(dev, vq);
 	nvq->done_idx = 0;
 }
 
@@ -558,6 +570,7 @@ static void vhost_net_busy_poll(struct vhost_net *net,
 
 static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
 				    struct vhost_net_virtqueue *tnvq,
+				    struct vhost_buf *buf,
 				    unsigned int *out_num, unsigned int *in_num,
 				    struct msghdr *msghdr, bool *busyloop_intr)
 {
@@ -565,10 +578,10 @@ static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
 	struct vhost_virtqueue *rvq = &rnvq->vq;
 	struct vhost_virtqueue *tvq = &tnvq->vq;
 
-	int r = vhost_get_vq_desc(tvq, tvq->iov, ARRAY_SIZE(tvq->iov),
-				  out_num, in_num, NULL, NULL);
+	int r = vhost_get_avail_buf(tvq, buf, tvq->iov, ARRAY_SIZE(tvq->iov),
+				    out_num, in_num, NULL, NULL);
 
-	if (r == tvq->num && tvq->busyloop_timeout) {
+	if (!r && tvq->busyloop_timeout) {
 		/* Flush batched packets first */
 		if (!vhost_sock_zcopy(vhost_vq_get_backend(tvq)))
 			vhost_tx_batch(net, tnvq,
@@ -577,8 +590,8 @@ static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
 
 		vhost_net_busy_poll(net, rvq, tvq, busyloop_intr, false);
 
-		r = vhost_get_vq_desc(tvq, tvq->iov, ARRAY_SIZE(tvq->iov),
-				      out_num, in_num, NULL, NULL);
+		r = vhost_get_avail_buf(tvq, buf, tvq->iov, ARRAY_SIZE(tvq->iov),
+					out_num, in_num, NULL, NULL);
 	}
 
 	return r;
@@ -607,6 +620,7 @@ static size_t init_iov_iter(struct vhost_virtqueue *vq, struct iov_iter *iter,
 
 static int get_tx_bufs(struct vhost_net *net,
 		       struct vhost_net_virtqueue *nvq,
+		       struct vhost_buf *buf,
 		       struct msghdr *msg,
 		       unsigned int *out, unsigned int *in,
 		       size_t *len, bool *busyloop_intr)
@@ -614,9 +628,9 @@ static int get_tx_bufs(struct vhost_net *net,
 	struct vhost_virtqueue *vq = &nvq->vq;
 	int ret;
 
-	ret = vhost_net_tx_get_vq_desc(net, nvq, out, in, msg, busyloop_intr);
+	ret = vhost_net_tx_get_vq_desc(net, nvq, buf, out, in, msg, busyloop_intr);
 
-	if (ret < 0 || ret == vq->num)
+	if (ret <= 0)
 		return ret;
 
 	if (*in) {
@@ -761,7 +775,7 @@ static void handle_tx_copy(struct vhost_net *net, struct socket *sock)
 	struct vhost_net_virtqueue *nvq = &net->vqs[VHOST_NET_VQ_TX];
 	struct vhost_virtqueue *vq = &nvq->vq;
 	unsigned out, in;
-	int head;
+	int ret;
 	struct msghdr msg = {
 		.msg_name = NULL,
 		.msg_namelen = 0,
@@ -773,6 +787,7 @@ static void handle_tx_copy(struct vhost_net *net, struct socket *sock)
 	int err;
 	int sent_pkts = 0;
 	bool sock_can_batch = (sock->sk->sk_sndbuf == INT_MAX);
+	struct vhost_buf buf;
 
 	do {
 		bool busyloop_intr = false;
@@ -780,13 +795,13 @@ static void handle_tx_copy(struct vhost_net *net, struct socket *sock)
 		if (nvq->done_idx == VHOST_NET_BATCH)
 			vhost_tx_batch(net, nvq, sock, &msg);
 
-		head = get_tx_bufs(net, nvq, &msg, &out, &in, &len,
-				   &busyloop_intr);
+		ret = get_tx_bufs(net, nvq, &buf, &msg, &out, &in, &len,
+				  &busyloop_intr);
 		/* On error, stop handling until the next kick. */
-		if (unlikely(head < 0))
+		if (unlikely(ret < 0))
 			break;
 		/* Nothing new?  Wait for eventfd to tell us they refilled. */
-		if (head == vq->num) {
+		if (!ret) {
 			if (unlikely(busyloop_intr)) {
 				vhost_poll_queue(&vq->poll);
 			} else if (unlikely(vhost_enable_notify(&net->dev,
@@ -808,7 +823,7 @@ static void handle_tx_copy(struct vhost_net *net, struct socket *sock)
 				goto done;
 			} else if (unlikely(err != -ENOSPC)) {
 				vhost_tx_batch(net, nvq, sock, &msg);
-				vhost_discard_vq_desc(vq, 1);
+				vhost_discard_avail_bufs(vq, &buf, 1);
 				vhost_net_enable_vq(net, vq);
 				break;
 			}
@@ -829,7 +844,7 @@ static void handle_tx_copy(struct vhost_net *net, struct socket *sock)
 		/* TODO: Check specific error and bomb out unless ENOBUFS? */
 		err = sock->ops->sendmsg(sock, &msg, len);
 		if (unlikely(err < 0)) {
-			vhost_discard_vq_desc(vq, 1);
+			vhost_discard_avail_bufs(vq, &buf, 1);
 			vhost_net_enable_vq(net, vq);
 			break;
 		}
@@ -837,8 +852,7 @@ static void handle_tx_copy(struct vhost_net *net, struct socket *sock)
 			pr_debug("Truncated TX packet: len %d != %zd\n",
 				 err, len);
 done:
-		vq->heads[nvq->done_idx].id = cpu_to_vhost32(vq, head);
-		vq->heads[nvq->done_idx].len = 0;
+		nvq->bufs[nvq->done_idx] = buf;
 		++nvq->done_idx;
 	} while (likely(!vhost_exceeds_weight(vq, ++sent_pkts, total_len)));
 
@@ -850,7 +864,7 @@ static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock)
 	struct vhost_net_virtqueue *nvq = &net->vqs[VHOST_NET_VQ_TX];
 	struct vhost_virtqueue *vq = &nvq->vq;
 	unsigned out, in;
-	int head;
+	int ret;
 	struct msghdr msg = {
 		.msg_name = NULL,
 		.msg_namelen = 0,
@@ -864,6 +878,7 @@ static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock)
 	struct vhost_net_ubuf_ref *uninitialized_var(ubufs);
 	bool zcopy_used;
 	int sent_pkts = 0;
+	struct vhost_buf buf;
 
 	do {
 		bool busyloop_intr;
@@ -872,13 +887,13 @@ static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock)
 		vhost_zerocopy_signal_used(net, vq);
 
 		busyloop_intr = false;
-		head = get_tx_bufs(net, nvq, &msg, &out, &in, &len,
-				   &busyloop_intr);
+		ret = get_tx_bufs(net, nvq, &buf, &msg, &out, &in, &len,
+				  &busyloop_intr);
 		/* On error, stop handling until the next kick. */
-		if (unlikely(head < 0))
+		if (unlikely(ret < 0))
 			break;
 		/* Nothing new?  Wait for eventfd to tell us they refilled. */
-		if (head == vq->num) {
+		if (!ret) {
 			if (unlikely(busyloop_intr)) {
 				vhost_poll_queue(&vq->poll);
 			} else if (unlikely(vhost_enable_notify(&net->dev, vq))) {
@@ -897,8 +912,8 @@ static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock)
 			struct ubuf_info *ubuf;
 			ubuf = nvq->ubuf_info + nvq->upend_idx;
 
-			vq->heads[nvq->upend_idx].id = cpu_to_vhost32(vq, head);
-			vq->heads[nvq->upend_idx].len = VHOST_DMA_IN_PROGRESS;
+			nvq->bufs[nvq->upend_idx] = buf;
+			nvq->bufs[nvq->upend_idx].in_len = VHOST_DMA_IN_PROGRESS;
 			ubuf->callback = vhost_zerocopy_callback;
 			ubuf->ctx = nvq->ubufs;
 			ubuf->desc = nvq->upend_idx;
@@ -930,17 +945,19 @@ static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock)
 				nvq->upend_idx = ((unsigned)nvq->upend_idx - 1)
 					% UIO_MAXIOV;
 			}
-			vhost_discard_vq_desc(vq, 1);
+			vhost_discard_avail_bufs(vq, &buf, 1);
 			vhost_net_enable_vq(net, vq);
 			break;
 		}
 		if (err != len)
 			pr_debug("Truncated TX packet: "
 				 " len %d != %zd\n", err, len);
-		if (!zcopy_used)
-			vhost_add_used_and_signal(&net->dev, vq, head, 0);
-		else
+		if (!zcopy_used) {
+			vhost_put_used_buf(vq, &buf);
+			vhost_signal(&net->dev, vq);
+		} else {
 			vhost_zerocopy_signal_used(net, vq);
+		}
 		vhost_net_tx_packet(net);
 	} while (likely(!vhost_exceeds_weight(vq, ++sent_pkts, total_len)));
 }
@@ -1004,7 +1021,7 @@ static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk,
 	int len = peek_head_len(rnvq, sk);
 
 	if (!len && rvq->busyloop_timeout) {
-		/* Flush batched heads first */
+		/* Flush batched bufs first */
 		vhost_net_signal_used(rnvq);
 		/* Both tx vq and rx socket were polled here */
 		vhost_net_busy_poll(net, rvq, tvq, busyloop_intr, true);
@@ -1022,11 +1039,11 @@ static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk,
  * @iovcount	- returned count of io vectors we fill
  * @log		- vhost log
  * @log_num	- log offset
- * @quota       - headcount quota, 1 for big buffer
- *	returns number of buffer heads allocated, negative on error
+ * @quota       - bufcount quota, 1 for big buffer
+ *	returns number of buffers allocated, negative on error
  */
 static int get_rx_bufs(struct vhost_virtqueue *vq,
-		       struct vring_used_elem *heads,
+		       struct vhost_buf *bufs,
 		       int datalen,
 		       unsigned *iovcount,
 		       struct vhost_log *log,
@@ -1035,30 +1052,24 @@ static int get_rx_bufs(struct vhost_virtqueue *vq,
 {
 	unsigned int out, in;
 	int seg = 0;
-	int headcount = 0;
-	unsigned d;
+	int bufcount = 0;
 	int r, nlogs = 0;
 	/* len is always initialized before use since we are always called with
 	 * datalen > 0.
 	 */
 	u32 uninitialized_var(len);
 
-	while (datalen > 0 && headcount < quota) {
+	while (datalen > 0 && bufcount < quota) {
 		if (unlikely(seg >= UIO_MAXIOV)) {
 			r = -ENOBUFS;
 			goto err;
 		}
-		r = vhost_get_vq_desc(vq, vq->iov + seg,
-				      ARRAY_SIZE(vq->iov) - seg, &out,
-				      &in, log, log_num);
-		if (unlikely(r < 0))
+		r = vhost_get_avail_buf(vq, bufs + bufcount, vq->iov + seg,
+					ARRAY_SIZE(vq->iov) - seg, &out,
+					&in, log, log_num);
+		if (unlikely(r <= 0))
 			goto err;
 
-		d = r;
-		if (d == vq->num) {
-			r = 0;
-			goto err;
-		}
 		if (unlikely(out || in <= 0)) {
 			vq_err(vq, "unexpected descriptor format for RX: "
 				"out %d, in %d\n", out, in);
@@ -1069,14 +1080,12 @@ static int get_rx_bufs(struct vhost_virtqueue *vq,
 			nlogs += *log_num;
 			log += *log_num;
 		}
-		heads[headcount].id = cpu_to_vhost32(vq, d);
 		len = iov_length(vq->iov + seg, in);
-		heads[headcount].len = cpu_to_vhost32(vq, len);
 		datalen -= len;
-		++headcount;
+		++bufcount;
 		seg += in;
 	}
-	heads[headcount - 1].len = cpu_to_vhost32(vq, len + datalen);
+	bufs[bufcount - 1].in_len = len + datalen;
 	*iovcount = seg;
 	if (unlikely(log))
 		*log_num = nlogs;
@@ -1086,9 +1095,9 @@ static int get_rx_bufs(struct vhost_virtqueue *vq,
 		r = UIO_MAXIOV + 1;
 		goto err;
 	}
-	return headcount;
+	return bufcount;
 err:
-	vhost_discard_vq_desc(vq, headcount);
+	vhost_discard_avail_bufs(vq, bufs, bufcount);
 	return r;
 }
 
@@ -1113,7 +1122,7 @@ static void handle_rx(struct vhost_net *net)
 	};
 	size_t total_len = 0;
 	int err, mergeable;
-	s16 headcount;
+	int bufcount;
 	size_t vhost_hlen, sock_hlen;
 	size_t vhost_len, sock_len;
 	bool busyloop_intr = false;
@@ -1147,14 +1156,14 @@ static void handle_rx(struct vhost_net *net)
 			break;
 		sock_len += sock_hlen;
 		vhost_len = sock_len + vhost_hlen;
-		headcount = get_rx_bufs(vq, vq->heads + nvq->done_idx,
-					vhost_len, &in, vq_log, &log,
-					likely(mergeable) ? UIO_MAXIOV : 1);
+		bufcount = get_rx_bufs(vq, nvq->bufs + nvq->done_idx,
+				       vhost_len, &in, vq_log, &log,
+				       likely(mergeable) ? UIO_MAXIOV : 1);
 		/* On error, stop handling until the next kick. */
-		if (unlikely(headcount < 0))
+		if (unlikely(bufcount < 0))
 			goto out;
 		/* OK, now we need to know about added descriptors. */
-		if (!headcount) {
+		if (!bufcount) {
 			if (unlikely(busyloop_intr)) {
 				vhost_poll_queue(&vq->poll);
 			} else if (unlikely(vhost_enable_notify(&net->dev, vq))) {
@@ -1171,7 +1180,7 @@ static void handle_rx(struct vhost_net *net)
 		if (nvq->rx_ring)
 			msg.msg_control = vhost_net_buf_consume(&nvq->rxq);
 		/* On overrun, truncate and discard */
-		if (unlikely(headcount > UIO_MAXIOV)) {
+		if (unlikely(bufcount > UIO_MAXIOV)) {
 			iov_iter_init(&msg.msg_iter, READ, vq->iov, 1, 1);
 			err = sock->ops->recvmsg(sock, &msg,
 						 1, MSG_DONTWAIT | MSG_TRUNC);
@@ -1195,7 +1204,7 @@ static void handle_rx(struct vhost_net *net)
 		if (unlikely(err != sock_len)) {
 			pr_debug("Discarded rx packet: "
 				 " len %d, expected %zd\n", err, sock_len);
-			vhost_discard_vq_desc(vq, headcount);
+			vhost_discard_avail_bufs(vq, nvq->bufs + nvq->done_idx, bufcount);
 			continue;
 		}
 		/* Supply virtio_net_hdr if VHOST_NET_F_VIRTIO_NET_HDR */
@@ -1214,15 +1223,15 @@ static void handle_rx(struct vhost_net *net)
 		}
 		/* TODO: Should check and handle checksum. */
 
-		num_buffers = cpu_to_vhost16(vq, headcount);
+		num_buffers = cpu_to_vhost16(vq, bufcount);
 		if (likely(mergeable) &&
 		    copy_to_iter(&num_buffers, sizeof num_buffers,
 				 &fixup) != sizeof num_buffers) {
 			vq_err(vq, "Failed num_buffers write");
-			vhost_discard_vq_desc(vq, headcount);
+			vhost_discard_avail_bufs(vq, nvq->bufs + nvq->done_idx, bufcount);
 			goto out;
 		}
-		nvq->done_idx += headcount;
+		nvq->done_idx += bufcount;
 		if (nvq->done_idx > VHOST_NET_BATCH)
 			vhost_net_signal_used(nvq);
 		if (unlikely(vq_log))
-- 
MST


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH RFC 09/13] vhost/net: avoid iov length math
  2020-06-02 13:05 [PATCH RFC 00/13] vhost: format independence Michael S. Tsirkin
                   ` (7 preceding siblings ...)
  2020-06-02 13:06 ` [PATCH RFC 08/13] vhost/net: convert to new API: heads->bufs Michael S. Tsirkin
@ 2020-06-02 13:06 ` Michael S. Tsirkin
  2020-06-02 13:06 ` [PATCH RFC 10/13] vhost/test: convert to the buf API Michael S. Tsirkin
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 35+ messages in thread
From: Michael S. Tsirkin @ 2020-06-02 13:06 UTC (permalink / raw)
  To: linux-kernel; +Cc: Eugenio Pérez, Jason Wang, kvm, virtualization, netdev

Now that API exposes buffer length, we no longer need to
scan IOVs to figure it out.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/vhost/net.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 47af3d1ce3dd..36843058182b 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -607,11 +607,9 @@ static bool vhost_exceeds_maxpend(struct vhost_net *net)
 }
 
 static size_t init_iov_iter(struct vhost_virtqueue *vq, struct iov_iter *iter,
-			    size_t hdr_size, int out)
+			    size_t len, size_t hdr_size, int out)
 {
 	/* Skip header. TODO: support TSO. */
-	size_t len = iov_length(vq->iov, out);
-
 	iov_iter_init(iter, WRITE, vq->iov, out, len);
 	iov_iter_advance(iter, hdr_size);
 
@@ -640,7 +638,7 @@ static int get_tx_bufs(struct vhost_net *net,
 	}
 
 	/* Sanity check */
-	*len = init_iov_iter(vq, &msg->msg_iter, nvq->vhost_hlen, *out);
+	*len = init_iov_iter(vq, &msg->msg_iter, buf->out_len, nvq->vhost_hlen, *out);
 	if (*len == 0) {
 		vq_err(vq, "Unexpected header len for TX: %zd expected %zd\n",
 			*len, nvq->vhost_hlen);
@@ -1080,7 +1078,7 @@ static int get_rx_bufs(struct vhost_virtqueue *vq,
 			nlogs += *log_num;
 			log += *log_num;
 		}
-		len = iov_length(vq->iov + seg, in);
+		len = bufs[bufcount].in_len;
 		datalen -= len;
 		++bufcount;
 		seg += in;
-- 
MST


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH RFC 10/13] vhost/test: convert to the buf API
  2020-06-02 13:05 [PATCH RFC 00/13] vhost: format independence Michael S. Tsirkin
                   ` (8 preceding siblings ...)
  2020-06-02 13:06 ` [PATCH RFC 09/13] vhost/net: avoid iov length math Michael S. Tsirkin
@ 2020-06-02 13:06 ` Michael S. Tsirkin
  2020-06-02 13:06 ` [PATCH RFC 11/13] vhost/scsi: switch to buf APIs Michael S. Tsirkin
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 35+ messages in thread
From: Michael S. Tsirkin @ 2020-06-02 13:06 UTC (permalink / raw)
  To: linux-kernel; +Cc: Eugenio Pérez, Jason Wang, kvm, virtualization, netdev

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/vhost/test.c | 20 +++++++++++---------
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c
index 02806d6f84ef..251fd2bf74a3 100644
--- a/drivers/vhost/test.c
+++ b/drivers/vhost/test.c
@@ -44,9 +44,10 @@ static void handle_vq(struct vhost_test *n)
 {
 	struct vhost_virtqueue *vq = &n->vqs[VHOST_TEST_VQ];
 	unsigned out, in;
-	int head;
+	int ret;
 	size_t len, total_len = 0;
 	void *private;
+	struct vhost_buf buf;
 
 	mutex_lock(&vq->mutex);
 	private = vhost_vq_get_backend(vq);
@@ -58,15 +59,15 @@ static void handle_vq(struct vhost_test *n)
 	vhost_disable_notify(&n->dev, vq);
 
 	for (;;) {
-		head = vhost_get_vq_desc(vq, vq->iov,
-					 ARRAY_SIZE(vq->iov),
-					 &out, &in,
-					 NULL, NULL);
+		ret = vhost_get_avail_buf(vq, vq->iov, &buf,
+					  ARRAY_SIZE(vq->iov),
+					  &out, &in,
+					  NULL, NULL);
 		/* On error, stop handling until the next kick. */
-		if (unlikely(head < 0))
+		if (unlikely(ret < 0))
 			break;
 		/* Nothing new?  Wait for eventfd to tell us they refilled. */
-		if (head == vq->num) {
+		if (!ret) {
 			if (unlikely(vhost_enable_notify(&n->dev, vq))) {
 				vhost_disable_notify(&n->dev, vq);
 				continue;
@@ -78,13 +79,14 @@ static void handle_vq(struct vhost_test *n)
 			       "out %d, int %d\n", out, in);
 			break;
 		}
-		len = iov_length(vq->iov, out);
+		len = buf.out_len;
 		/* Sanity check */
 		if (!len) {
 			vq_err(vq, "Unexpected 0 len for TX\n");
 			break;
 		}
-		vhost_add_used_and_signal(&n->dev, vq, head, 0);
+		vhost_put_used_buf(vq, &buf);
+		vhost_signal(&n->dev, vq);
 		total_len += len;
 		if (unlikely(vhost_exceeds_weight(vq, 0, total_len)))
 			break;
-- 
MST


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH RFC 11/13] vhost/scsi: switch to buf APIs
  2020-06-02 13:05 [PATCH RFC 00/13] vhost: format independence Michael S. Tsirkin
                   ` (9 preceding siblings ...)
  2020-06-02 13:06 ` [PATCH RFC 10/13] vhost/test: convert to the buf API Michael S. Tsirkin
@ 2020-06-02 13:06 ` Michael S. Tsirkin
  2020-06-05  8:36   ` Stefan Hajnoczi
  2020-06-02 13:06 ` [PATCH RFC 12/13] vhost/vsock: switch to the buf API Michael S. Tsirkin
  2020-06-02 13:06 ` [PATCH RFC 13/13] vhost: drop head based APIs Michael S. Tsirkin
  12 siblings, 1 reply; 35+ messages in thread
From: Michael S. Tsirkin @ 2020-06-02 13:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Eugenio Pérez, Jason Wang, kvm, virtualization, netdev,
	Paolo Bonzini, Stefan Hajnoczi

Switch to buf APIs. Doing this exposes a spec violation in vhost scsi:
all used bufs are marked with length 0.
Fix that is left for another day.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/vhost/scsi.c | 73 ++++++++++++++++++++++++++------------------
 1 file changed, 44 insertions(+), 29 deletions(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index c39952243fd3..c426c4e899c7 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -71,8 +71,8 @@ struct vhost_scsi_inflight {
 };
 
 struct vhost_scsi_cmd {
-	/* Descriptor from vhost_get_vq_desc() for virt_queue segment */
-	int tvc_vq_desc;
+	/* Descriptor from vhost_get_avail_buf() for virt_queue segment */
+	struct vhost_buf tvc_vq_desc;
 	/* virtio-scsi initiator task attribute */
 	int tvc_task_attr;
 	/* virtio-scsi response incoming iovecs */
@@ -213,7 +213,7 @@ struct vhost_scsi {
  * Context for processing request and control queue operations.
  */
 struct vhost_scsi_ctx {
-	int head;
+	struct vhost_buf buf;
 	unsigned int out, in;
 	size_t req_size, rsp_size;
 	size_t out_size, in_size;
@@ -443,6 +443,20 @@ static int vhost_scsi_check_stop_free(struct se_cmd *se_cmd)
 	return target_put_sess_cmd(se_cmd);
 }
 
+/* Signal to guest that request finished with no input buffer. */
+/* TODO calling this when writing into buffer and most likely a bug */
+static void vhost_scsi_signal_noinput(struct vhost_dev *vdev,
+				      struct vhost_virtqueue *vq,
+				      struct vhost_buf *bufp)
+{
+	struct vhost_buf buf = *bufp;
+
+	buf.in_len = 0;
+	vhost_put_used_buf(vq, &buf);
+	vhost_signal(vdev, vq);
+}
+
+
 static void
 vhost_scsi_do_evt_work(struct vhost_scsi *vs, struct vhost_scsi_evt *evt)
 {
@@ -450,7 +464,8 @@ vhost_scsi_do_evt_work(struct vhost_scsi *vs, struct vhost_scsi_evt *evt)
 	struct virtio_scsi_event *event = &evt->event;
 	struct virtio_scsi_event __user *eventp;
 	unsigned out, in;
-	int head, ret;
+	struct vhost_buf buf;
+	int ret;
 
 	if (!vhost_vq_get_backend(vq)) {
 		vs->vs_events_missed = true;
@@ -459,14 +474,14 @@ vhost_scsi_do_evt_work(struct vhost_scsi *vs, struct vhost_scsi_evt *evt)
 
 again:
 	vhost_disable_notify(&vs->dev, vq);
-	head = vhost_get_vq_desc(vq, vq->iov,
-			ARRAY_SIZE(vq->iov), &out, &in,
-			NULL, NULL);
-	if (head < 0) {
+	ret = vhost_get_avail_buf(vq, &buf,
+				  vq->iov, ARRAY_SIZE(vq->iov), &out, &in,
+				  NULL, NULL);
+	if (ret < 0) {
 		vs->vs_events_missed = true;
 		return;
 	}
-	if (head == vq->num) {
+	if (!ret) {
 		if (vhost_enable_notify(&vs->dev, vq))
 			goto again;
 		vs->vs_events_missed = true;
@@ -488,7 +503,7 @@ vhost_scsi_do_evt_work(struct vhost_scsi *vs, struct vhost_scsi_evt *evt)
 	eventp = vq->iov[out].iov_base;
 	ret = __copy_to_user(eventp, event, sizeof(*event));
 	if (!ret)
-		vhost_add_used_and_signal(&vs->dev, vq, head, 0);
+		vhost_scsi_signal_noinput(&vs->dev, vq, &buf);
 	else
 		vq_err(vq, "Faulted on vhost_scsi_send_event\n");
 }
@@ -549,7 +564,7 @@ static void vhost_scsi_complete_cmd_work(struct vhost_work *work)
 		ret = copy_to_iter(&v_rsp, sizeof(v_rsp), &iov_iter);
 		if (likely(ret == sizeof(v_rsp))) {
 			struct vhost_scsi_virtqueue *q;
-			vhost_add_used(cmd->tvc_vq, cmd->tvc_vq_desc, 0);
+			vhost_put_used_buf(cmd->tvc_vq, &cmd->tvc_vq_desc);
 			q = container_of(cmd->tvc_vq, struct vhost_scsi_virtqueue, vq);
 			vq = q - vs->vqs;
 			__set_bit(vq, signal);
@@ -793,7 +808,7 @@ static void vhost_scsi_submission_work(struct work_struct *work)
 static void
 vhost_scsi_send_bad_target(struct vhost_scsi *vs,
 			   struct vhost_virtqueue *vq,
-			   int head, unsigned out)
+			   struct vhost_buf *buf, unsigned out)
 {
 	struct virtio_scsi_cmd_resp __user *resp;
 	struct virtio_scsi_cmd_resp rsp;
@@ -804,7 +819,7 @@ vhost_scsi_send_bad_target(struct vhost_scsi *vs,
 	resp = vq->iov[out].iov_base;
 	ret = __copy_to_user(resp, &rsp, sizeof(rsp));
 	if (!ret)
-		vhost_add_used_and_signal(&vs->dev, vq, head, 0);
+		vhost_scsi_signal_noinput(&vs->dev, vq, buf);
 	else
 		pr_err("Faulted on virtio_scsi_cmd_resp\n");
 }
@@ -813,21 +828,21 @@ static int
 vhost_scsi_get_desc(struct vhost_scsi *vs, struct vhost_virtqueue *vq,
 		    struct vhost_scsi_ctx *vc)
 {
-	int ret = -ENXIO;
+	int r, ret = -ENXIO;
 
-	vc->head = vhost_get_vq_desc(vq, vq->iov,
-				     ARRAY_SIZE(vq->iov), &vc->out, &vc->in,
-				     NULL, NULL);
+	r = vhost_get_avail_buf(vq, &vc->buf,
+				vq->iov, ARRAY_SIZE(vq->iov), &vc->out, &vc->in,
+				NULL, NULL);
 
-	pr_debug("vhost_get_vq_desc: head: %d, out: %u in: %u\n",
-		 vc->head, vc->out, vc->in);
+	pr_debug("vhost_get_avail_buf: buf: %d, out: %u in: %u\n",
+		 vc->buf.id, vc->out, vc->in);
 
 	/* On error, stop handling until the next kick. */
-	if (unlikely(vc->head < 0))
+	if (unlikely(r < 0))
 		goto done;
 
 	/* Nothing new?  Wait for eventfd to tell us they refilled. */
-	if (vc->head == vq->num) {
+	if (!r) {
 		if (unlikely(vhost_enable_notify(&vs->dev, vq))) {
 			vhost_disable_notify(&vs->dev, vq);
 			ret = -EAGAIN;
@@ -1093,11 +1108,11 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct vhost_virtqueue *vq)
 			}
 		}
 		/*
-		 * Save the descriptor from vhost_get_vq_desc() to be used to
+		 * Save the descriptor from vhost_get_avail_buf() to be used to
 		 * complete the virtio-scsi request in TCM callback context via
 		 * vhost_scsi_queue_data_in() and vhost_scsi_queue_status()
 		 */
-		cmd->tvc_vq_desc = vc.head;
+		cmd->tvc_vq_desc = vc.buf;
 		/*
 		 * Dispatch cmd descriptor for cmwq execution in process
 		 * context provided by vhost_scsi_workqueue.  This also ensures
@@ -1117,7 +1132,7 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct vhost_virtqueue *vq)
 		if (ret == -ENXIO)
 			break;
 		else if (ret == -EIO)
-			vhost_scsi_send_bad_target(vs, vq, vc.head, vc.out);
+			vhost_scsi_send_bad_target(vs, vq, &vc.buf, vc.out);
 	} while (likely(!vhost_exceeds_weight(vq, ++c, 0)));
 out:
 	mutex_unlock(&vq->mutex);
@@ -1139,9 +1154,9 @@ vhost_scsi_send_tmf_reject(struct vhost_scsi *vs,
 	iov_iter_init(&iov_iter, READ, &vq->iov[vc->out], vc->in, sizeof(rsp));
 
 	ret = copy_to_iter(&rsp, sizeof(rsp), &iov_iter);
-	if (likely(ret == sizeof(rsp)))
-		vhost_add_used_and_signal(&vs->dev, vq, vc->head, 0);
-	else
+	if (likely(ret == sizeof(rsp))) {
+		vhost_scsi_signal_noinput(&vs->dev, vq, &vc->buf);
+	} else
 		pr_err("Faulted on virtio_scsi_ctrl_tmf_resp\n");
 }
 
@@ -1162,7 +1177,7 @@ vhost_scsi_send_an_resp(struct vhost_scsi *vs,
 
 	ret = copy_to_iter(&rsp, sizeof(rsp), &iov_iter);
 	if (likely(ret == sizeof(rsp)))
-		vhost_add_used_and_signal(&vs->dev, vq, vc->head, 0);
+		vhost_scsi_signal_noinput(&vs->dev, vq, &vc->buf);
 	else
 		pr_err("Faulted on virtio_scsi_ctrl_an_resp\n");
 }
@@ -1269,7 +1284,7 @@ vhost_scsi_ctl_handle_vq(struct vhost_scsi *vs, struct vhost_virtqueue *vq)
 		if (ret == -ENXIO)
 			break;
 		else if (ret == -EIO)
-			vhost_scsi_send_bad_target(vs, vq, vc.head, vc.out);
+			vhost_scsi_send_bad_target(vs, vq, &vc.buf, vc.out);
 	} while (likely(!vhost_exceeds_weight(vq, ++c, 0)));
 out:
 	mutex_unlock(&vq->mutex);
-- 
MST


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH RFC 12/13] vhost/vsock: switch to the buf API
  2020-06-02 13:05 [PATCH RFC 00/13] vhost: format independence Michael S. Tsirkin
                   ` (10 preceding siblings ...)
  2020-06-02 13:06 ` [PATCH RFC 11/13] vhost/scsi: switch to buf APIs Michael S. Tsirkin
@ 2020-06-02 13:06 ` Michael S. Tsirkin
  2020-06-05  8:36   ` Stefan Hajnoczi
  2020-06-02 13:06 ` [PATCH RFC 13/13] vhost: drop head based APIs Michael S. Tsirkin
  12 siblings, 1 reply; 35+ messages in thread
From: Michael S. Tsirkin @ 2020-06-02 13:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Eugenio Pérez, Jason Wang, kvm, virtualization, netdev,
	Stefan Hajnoczi, Stefano Garzarella

A straight-forward conversion.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/vhost/vsock.c | 30 ++++++++++++++++++------------
 1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index fb4e944c4d0d..07d1fb340fb4 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -103,7 +103,8 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
 		unsigned out, in;
 		size_t nbytes;
 		size_t iov_len, payload_len;
-		int head;
+		struct vhost_buf buf;
+		int ret;
 
 		spin_lock_bh(&vsock->send_pkt_list_lock);
 		if (list_empty(&vsock->send_pkt_list)) {
@@ -117,16 +118,17 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
 		list_del_init(&pkt->list);
 		spin_unlock_bh(&vsock->send_pkt_list_lock);
 
-		head = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
-					 &out, &in, NULL, NULL);
-		if (head < 0) {
+		ret = vhost_get_avail_buf(vq, &buf,
+					  vq->iov, ARRAY_SIZE(vq->iov),
+					  &out, &in, NULL, NULL);
+		if (ret < 0) {
 			spin_lock_bh(&vsock->send_pkt_list_lock);
 			list_add(&pkt->list, &vsock->send_pkt_list);
 			spin_unlock_bh(&vsock->send_pkt_list_lock);
 			break;
 		}
 
-		if (head == vq->num) {
+		if (!ret) {
 			spin_lock_bh(&vsock->send_pkt_list_lock);
 			list_add(&pkt->list, &vsock->send_pkt_list);
 			spin_unlock_bh(&vsock->send_pkt_list_lock);
@@ -186,7 +188,8 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
 		 */
 		virtio_transport_deliver_tap_pkt(pkt);
 
-		vhost_add_used(vq, head, sizeof(pkt->hdr) + payload_len);
+		buf.in_len = sizeof(pkt->hdr) + payload_len;
+		vhost_put_used_buf(vq, &buf);
 		added = true;
 
 		pkt->off += payload_len;
@@ -440,7 +443,8 @@ static void vhost_vsock_handle_tx_kick(struct vhost_work *work)
 	struct vhost_vsock *vsock = container_of(vq->dev, struct vhost_vsock,
 						 dev);
 	struct virtio_vsock_pkt *pkt;
-	int head, pkts = 0, total_len = 0;
+	int ret, pkts = 0, total_len = 0;
+	struct vhost_buf buf;
 	unsigned int out, in;
 	bool added = false;
 
@@ -461,12 +465,13 @@ static void vhost_vsock_handle_tx_kick(struct vhost_work *work)
 			goto no_more_replies;
 		}
 
-		head = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
-					 &out, &in, NULL, NULL);
-		if (head < 0)
+		ret = vhost_get_avail_buf(vq, &buf,
+					  vq->iov, ARRAY_SIZE(vq->iov),
+					  &out, &in, NULL, NULL);
+		if (ret < 0)
 			break;
 
-		if (head == vq->num) {
+		if (!ret) {
 			if (unlikely(vhost_enable_notify(&vsock->dev, vq))) {
 				vhost_disable_notify(&vsock->dev, vq);
 				continue;
@@ -494,7 +499,8 @@ static void vhost_vsock_handle_tx_kick(struct vhost_work *work)
 			virtio_transport_free_pkt(pkt);
 
 		len += sizeof(pkt->hdr);
-		vhost_add_used(vq, head, len);
+		buf.in_len = len;
+		vhost_put_used_buf(vq, &buf);
 		total_len += len;
 		added = true;
 	} while(likely(!vhost_exceeds_weight(vq, ++pkts, total_len)));
-- 
MST


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH RFC 13/13] vhost: drop head based APIs
  2020-06-02 13:05 [PATCH RFC 00/13] vhost: format independence Michael S. Tsirkin
                   ` (11 preceding siblings ...)
  2020-06-02 13:06 ` [PATCH RFC 12/13] vhost/vsock: switch to the buf API Michael S. Tsirkin
@ 2020-06-02 13:06 ` Michael S. Tsirkin
  12 siblings, 0 replies; 35+ messages in thread
From: Michael S. Tsirkin @ 2020-06-02 13:06 UTC (permalink / raw)
  To: linux-kernel; +Cc: Eugenio Pérez, Jason Wang, kvm, virtualization, netdev

Everyone's using buf APIs, no need for head based ones anymore.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/vhost/vhost.c | 36 ++++++++----------------------------
 drivers/vhost/vhost.h | 12 ------------
 2 files changed, 8 insertions(+), 40 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index be822f0c9428..412923cc96df 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2256,12 +2256,12 @@ static int fetch_buf(struct vhost_virtqueue *vq)
 	return 1;
 }
 
-/* Reverse the effect of vhost_get_vq_desc. Useful for error handling. */
+/* Revert the effect of fetch_buf. Useful for error handling. */
+static
 void vhost_discard_vq_desc(struct vhost_virtqueue *vq, int n)
 {
 	vq->last_avail_idx -= n;
 }
-EXPORT_SYMBOL_GPL(vhost_discard_vq_desc);
 
 /* This function returns a value > 0 if a descriptor was found, or 0 if none were found.
  * A negative code is returned on error. */
@@ -2421,8 +2421,7 @@ static int __vhost_add_used_n(struct vhost_virtqueue *vq,
 	return 0;
 }
 
-/* After we've used one of their buffers, we tell them about it.  We'll then
- * want to notify the guest, using eventfd. */
+static
 int vhost_add_used_n(struct vhost_virtqueue *vq, struct vring_used_elem *heads,
 		     unsigned count)
 {
@@ -2456,10 +2455,8 @@ int vhost_add_used_n(struct vhost_virtqueue *vq, struct vring_used_elem *heads,
 	}
 	return r;
 }
-EXPORT_SYMBOL_GPL(vhost_add_used_n);
 
-/* After we've used one of their buffers, we tell them about it.  We'll then
- * want to notify the guest, using eventfd. */
+static
 int vhost_add_used(struct vhost_virtqueue *vq, unsigned int head, int len)
 {
 	struct vring_used_elem heads = {
@@ -2469,14 +2466,17 @@ int vhost_add_used(struct vhost_virtqueue *vq, unsigned int head, int len)
 
 	return vhost_add_used_n(vq, &heads, 1);
 }
-EXPORT_SYMBOL_GPL(vhost_add_used);
 
+/* After we've used one of their buffers, we tell them about it.  We'll then
+ * want to notify the guest, using vhost_signal. */
 int vhost_put_used_buf(struct vhost_virtqueue *vq, struct vhost_buf *buf)
 {
 	return vhost_add_used(vq, buf->id, buf->in_len);
 }
 EXPORT_SYMBOL_GPL(vhost_put_used_buf);
 
+/* After we've used one of their buffers, we tell them about it.  We'll then
+ * want to notify the guest, using vhost_signal. */
 int vhost_put_used_n_bufs(struct vhost_virtqueue *vq,
 			  struct vhost_buf *bufs, unsigned count)
 {
@@ -2537,26 +2537,6 @@ void vhost_signal(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 }
 EXPORT_SYMBOL_GPL(vhost_signal);
 
-/* And here's the combo meal deal.  Supersize me! */
-void vhost_add_used_and_signal(struct vhost_dev *dev,
-			       struct vhost_virtqueue *vq,
-			       unsigned int head, int len)
-{
-	vhost_add_used(vq, head, len);
-	vhost_signal(dev, vq);
-}
-EXPORT_SYMBOL_GPL(vhost_add_used_and_signal);
-
-/* multi-buffer version of vhost_add_used_and_signal */
-void vhost_add_used_and_signal_n(struct vhost_dev *dev,
-				 struct vhost_virtqueue *vq,
-				 struct vring_used_elem *heads, unsigned count)
-{
-	vhost_add_used_n(vq, heads, count);
-	vhost_signal(dev, vq);
-}
-EXPORT_SYMBOL_GPL(vhost_add_used_and_signal_n);
-
 /* return true if we're sure that avaiable ring is empty */
 bool vhost_vq_avail_empty(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 {
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 6c10e99ff334..4fcf59153fc7 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -195,11 +195,6 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
 bool vhost_vq_access_ok(struct vhost_virtqueue *vq);
 bool vhost_log_access_ok(struct vhost_dev *);
 
-int vhost_get_vq_desc(struct vhost_virtqueue *,
-		      struct iovec iov[], unsigned int iov_count,
-		      unsigned int *out_num, unsigned int *in_num,
-		      struct vhost_log *log, unsigned int *log_num);
-void vhost_discard_vq_desc(struct vhost_virtqueue *, int n);
 int vhost_get_avail_buf(struct vhost_virtqueue *, struct vhost_buf *buf,
 			struct iovec iov[], unsigned int iov_count,
 			unsigned int *out_num, unsigned int *in_num,
@@ -207,13 +202,6 @@ int vhost_get_avail_buf(struct vhost_virtqueue *, struct vhost_buf *buf,
 void vhost_discard_avail_bufs(struct vhost_virtqueue *,
 			      struct vhost_buf *, unsigned count);
 int vhost_vq_init_access(struct vhost_virtqueue *);
-int vhost_add_used(struct vhost_virtqueue *, unsigned int head, int len);
-int vhost_add_used_n(struct vhost_virtqueue *, struct vring_used_elem *heads,
-		     unsigned count);
-void vhost_add_used_and_signal(struct vhost_dev *, struct vhost_virtqueue *,
-			       unsigned int id, int len);
-void vhost_add_used_and_signal_n(struct vhost_dev *, struct vhost_virtqueue *,
-			       struct vring_used_elem *heads, unsigned count);
 int vhost_put_used_buf(struct vhost_virtqueue *, struct vhost_buf *buf);
 int vhost_put_used_n_bufs(struct vhost_virtqueue *,
 			  struct vhost_buf *bufs, unsigned count);
-- 
MST


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 01/13] vhost: option to fetch descriptors through an independent struct
  2020-06-02 13:05 ` [PATCH RFC 01/13] vhost: option to fetch descriptors through an independent struct Michael S. Tsirkin
@ 2020-06-03  7:13   ` Jason Wang
  2020-06-03  9:48     ` Michael S. Tsirkin
  0 siblings, 1 reply; 35+ messages in thread
From: Jason Wang @ 2020-06-03  7:13 UTC (permalink / raw)
  To: Michael S. Tsirkin, linux-kernel
  Cc: Eugenio Pérez, kvm, virtualization, netdev


On 2020/6/2 下午9:05, Michael S. Tsirkin wrote:
> The idea is to support multiple ring formats by converting
> to a format-independent array of descriptors.
>
> This costs extra cycles, but we gain in ability
> to fetch a batch of descriptors in one go, which
> is good for code cache locality.
>
> When used, this causes a minor performance degradation,
> it's been kept as simple as possible for ease of review.
> A follow-up patch gets us back the performance by adding batching.
>
> To simplify benchmarking, I kept the old code around so one can switch
> back and forth between old and new code. This will go away in the final
> submission.
>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> Link: https://lore.kernel.org/r/20200401183118.8334-2-eperezma@redhat.com
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>   drivers/vhost/vhost.c | 297 +++++++++++++++++++++++++++++++++++++++++-
>   drivers/vhost/vhost.h |  16 +++
>   2 files changed, 312 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 96d9871fa0cb..105fc97af2c8 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -298,6 +298,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
>   			   struct vhost_virtqueue *vq)
>   {
>   	vq->num = 1;
> +	vq->ndescs = 0;
>   	vq->desc = NULL;
>   	vq->avail = NULL;
>   	vq->used = NULL;
> @@ -368,6 +369,9 @@ static int vhost_worker(void *data)
>   
>   static void vhost_vq_free_iovecs(struct vhost_virtqueue *vq)
>   {
> +	kfree(vq->descs);
> +	vq->descs = NULL;
> +	vq->max_descs = 0;
>   	kfree(vq->indirect);
>   	vq->indirect = NULL;
>   	kfree(vq->log);
> @@ -384,6 +388,10 @@ static long vhost_dev_alloc_iovecs(struct vhost_dev *dev)
>   
>   	for (i = 0; i < dev->nvqs; ++i) {
>   		vq = dev->vqs[i];
> +		vq->max_descs = dev->iov_limit;
> +		vq->descs = kmalloc_array(vq->max_descs,
> +					  sizeof(*vq->descs),
> +					  GFP_KERNEL);
>   		vq->indirect = kmalloc_array(UIO_MAXIOV,
>   					     sizeof(*vq->indirect),
>   					     GFP_KERNEL);
> @@ -391,7 +399,7 @@ static long vhost_dev_alloc_iovecs(struct vhost_dev *dev)
>   					GFP_KERNEL);
>   		vq->heads = kmalloc_array(dev->iov_limit, sizeof(*vq->heads),
>   					  GFP_KERNEL);
> -		if (!vq->indirect || !vq->log || !vq->heads)
> +		if (!vq->indirect || !vq->log || !vq->heads || !vq->descs)
>   			goto err_nomem;
>   	}
>   	return 0;
> @@ -2277,6 +2285,293 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
>   }
>   EXPORT_SYMBOL_GPL(vhost_get_vq_desc);
>   
> +static struct vhost_desc *peek_split_desc(struct vhost_virtqueue *vq)
> +{
> +	BUG_ON(!vq->ndescs);
> +	return &vq->descs[vq->ndescs - 1];
> +}
> +
> +static void pop_split_desc(struct vhost_virtqueue *vq)
> +{
> +	BUG_ON(!vq->ndescs);
> +	--vq->ndescs;
> +}
> +
> +#define VHOST_DESC_FLAGS (VRING_DESC_F_INDIRECT | VRING_DESC_F_WRITE | \
> +			  VRING_DESC_F_NEXT)
> +static int push_split_desc(struct vhost_virtqueue *vq, struct vring_desc *desc, u16 id)
> +{
> +	struct vhost_desc *h;
> +
> +	if (unlikely(vq->ndescs >= vq->max_descs))
> +		return -EINVAL;
> +	h = &vq->descs[vq->ndescs++];
> +	h->addr = vhost64_to_cpu(vq, desc->addr);
> +	h->len = vhost32_to_cpu(vq, desc->len);
> +	h->flags = vhost16_to_cpu(vq, desc->flags) & VHOST_DESC_FLAGS;
> +	h->id = id;
> +
> +	return 0;
> +}
> +
> +static int fetch_indirect_descs(struct vhost_virtqueue *vq,
> +				struct vhost_desc *indirect,
> +				u16 head)
> +{
> +	struct vring_desc desc;
> +	unsigned int i = 0, count, found = 0;
> +	u32 len = indirect->len;
> +	struct iov_iter from;
> +	int ret;
> +
> +	/* Sanity check */
> +	if (unlikely(len % sizeof desc)) {
> +		vq_err(vq, "Invalid length in indirect descriptor: "
> +		       "len 0x%llx not multiple of 0x%zx\n",
> +		       (unsigned long long)len,
> +		       sizeof desc);
> +		return -EINVAL;
> +	}
> +
> +	ret = translate_desc(vq, indirect->addr, len, vq->indirect,
> +			     UIO_MAXIOV, VHOST_ACCESS_RO);
> +	if (unlikely(ret < 0)) {
> +		if (ret != -EAGAIN)
> +			vq_err(vq, "Translation failure %d in indirect.\n", ret);
> +		return ret;
> +	}
> +	iov_iter_init(&from, READ, vq->indirect, ret, len);
> +
> +	/* We will use the result as an address to read from, so most
> +	 * architectures only need a compiler barrier here. */
> +	read_barrier_depends();
> +
> +	count = len / sizeof desc;
> +	/* Buffers are chained via a 16 bit next field, so
> +	 * we can have at most 2^16 of these. */
> +	if (unlikely(count > USHRT_MAX + 1)) {
> +		vq_err(vq, "Indirect buffer length too big: %d\n",
> +		       indirect->len);
> +		return -E2BIG;
> +	}
> +	if (unlikely(vq->ndescs + count > vq->max_descs)) {
> +		vq_err(vq, "Too many indirect + direct descs: %d + %d\n",
> +		       vq->ndescs, indirect->len);
> +		return -E2BIG;
> +	}
> +
> +	do {
> +		if (unlikely(++found > count)) {
> +			vq_err(vq, "Loop detected: last one at %u "
> +			       "indirect size %u\n",
> +			       i, count);
> +			return -EINVAL;
> +		}
> +		if (unlikely(!copy_from_iter_full(&desc, sizeof(desc), &from))) {
> +			vq_err(vq, "Failed indirect descriptor: idx %d, %zx\n",
> +			       i, (size_t)indirect->addr + i * sizeof desc);
> +			return -EINVAL;
> +		}
> +		if (unlikely(desc.flags & cpu_to_vhost16(vq, VRING_DESC_F_INDIRECT))) {
> +			vq_err(vq, "Nested indirect descriptor: idx %d, %zx\n",
> +			       i, (size_t)indirect->addr + i * sizeof desc);
> +			return -EINVAL;
> +		}
> +
> +		push_split_desc(vq, &desc, head);


The error is ignored.


> +	} while ((i = next_desc(vq, &desc)) != -1);
> +	return 0;
> +}
> +
> +static int fetch_descs(struct vhost_virtqueue *vq)
> +{
> +	unsigned int i, head, found = 0;
> +	struct vhost_desc *last;
> +	struct vring_desc desc;
> +	__virtio16 avail_idx;
> +	__virtio16 ring_head;
> +	u16 last_avail_idx;
> +	int ret;
> +
> +	/* Check it isn't doing very strange things with descriptor numbers. */
> +	last_avail_idx = vq->last_avail_idx;
> +
> +	if (vq->avail_idx == vq->last_avail_idx) {
> +		if (unlikely(vhost_get_avail_idx(vq, &avail_idx))) {
> +			vq_err(vq, "Failed to access avail idx at %p\n",
> +				&vq->avail->idx);
> +			return -EFAULT;
> +		}
> +		vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
> +
> +		if (unlikely((u16)(vq->avail_idx - last_avail_idx) > vq->num)) {
> +			vq_err(vq, "Guest moved used index from %u to %u",
> +				last_avail_idx, vq->avail_idx);
> +			return -EFAULT;
> +		}
> +
> +		/* If there's nothing new since last we looked, return
> +		 * invalid.
> +		 */
> +		if (vq->avail_idx == last_avail_idx)
> +			return vq->num;
> +
> +		/* Only get avail ring entries after they have been
> +		 * exposed by guest.
> +		 */
> +		smp_rmb();
> +	}
> +
> +	/* Grab the next descriptor number they're advertising */
> +	if (unlikely(vhost_get_avail_head(vq, &ring_head, last_avail_idx))) {
> +		vq_err(vq, "Failed to read head: idx %d address %p\n",
> +		       last_avail_idx,
> +		       &vq->avail->ring[last_avail_idx % vq->num]);
> +		return -EFAULT;
> +	}
> +
> +	head = vhost16_to_cpu(vq, ring_head);
> +
> +	/* If their number is silly, that's an error. */
> +	if (unlikely(head >= vq->num)) {
> +		vq_err(vq, "Guest says index %u > %u is available",
> +		       head, vq->num);
> +		return -EINVAL;
> +	}
> +
> +	i = head;
> +	do {
> +		if (unlikely(i >= vq->num)) {
> +			vq_err(vq, "Desc index is %u > %u, head = %u",
> +			       i, vq->num, head);
> +			return -EINVAL;
> +		}
> +		if (unlikely(++found > vq->num)) {
> +			vq_err(vq, "Loop detected: last one at %u "
> +			       "vq size %u head %u\n",
> +			       i, vq->num, head);
> +			return -EINVAL;
> +		}
> +		ret = vhost_get_desc(vq, &desc, i);
> +		if (unlikely(ret)) {
> +			vq_err(vq, "Failed to get descriptor: idx %d addr %p\n",
> +			       i, vq->desc + i);
> +			return -EFAULT;
> +		}
> +		ret = push_split_desc(vq, &desc, head);
> +		if (unlikely(ret)) {
> +			vq_err(vq, "Failed to save descriptor: idx %d\n", i);
> +			return -EINVAL;
> +		}
> +	} while ((i = next_desc(vq, &desc)) != -1);
> +
> +	last = peek_split_desc(vq);
> +	if (unlikely(last->flags & VRING_DESC_F_INDIRECT)) {
> +		pop_split_desc(vq);
> +		ret = fetch_indirect_descs(vq, last, head);


Note that this means we don't supported chained indirect descriptors 
which complies the spec but we support this in vhost_get_vq_desc().

We probably need either fail early or just support that.

Thanks


> +		if (unlikely(ret < 0)) {
> +			if (ret != -EAGAIN)
> +				vq_err(vq, "Failure detected "
> +				       "in indirect descriptor at idx %d\n", head);
> +			return ret;
> +		}
> +	}
> +
> +	/* Assume notifications from guest are disabled at this point,
> +	 * if they aren't we would need to update avail_event index. */
> +	BUG_ON(!(vq->used_flags & VRING_USED_F_NO_NOTIFY));
> +
> +	/* On success, increment avail index. */
> +	vq->last_avail_idx++;
> +
> +	return 0;
> +}
> +
> +/* This looks in the virtqueue and for the first available buffer, and converts
> + * it to an iovec for convenient access.  Since descriptors consist of some
> + * number of output then some number of input descriptors, it's actually two
> + * iovecs, but we pack them into one and note how many of each there were.
> + *
> + * This function returns the descriptor number found, or vq->num (which is
> + * never a valid descriptor number) if none was found.  A negative code is
> + * returned on error. */
> +int vhost_get_vq_desc_batch(struct vhost_virtqueue *vq,
> +		      struct iovec iov[], unsigned int iov_size,
> +		      unsigned int *out_num, unsigned int *in_num,
> +		      struct vhost_log *log, unsigned int *log_num)
> +{
> +	int ret = fetch_descs(vq);
> +	int i;
> +
> +	if (ret)
> +		return ret;
> +
> +	/* Now convert to IOV */
> +	/* When we start there are none of either input nor output. */
> +	*out_num = *in_num = 0;
> +	if (unlikely(log))
> +		*log_num = 0;
> +
> +	for (i = 0; i < vq->ndescs; ++i) {
> +		unsigned iov_count = *in_num + *out_num;
> +		struct vhost_desc *desc = &vq->descs[i];
> +		int access;
> +
> +		if (desc->flags & ~VHOST_DESC_FLAGS) {
> +			vq_err(vq, "Unexpected flags: 0x%x at descriptor id 0x%x\n",
> +			       desc->flags, desc->id);
> +			ret = -EINVAL;
> +			goto err;
> +		}
> +		if (desc->flags & VRING_DESC_F_WRITE)
> +			access = VHOST_ACCESS_WO;
> +		else
> +			access = VHOST_ACCESS_RO;
> +		ret = translate_desc(vq, desc->addr,
> +				     desc->len, iov + iov_count,
> +				     iov_size - iov_count, access);
> +		if (unlikely(ret < 0)) {
> +			if (ret != -EAGAIN)
> +				vq_err(vq, "Translation failure %d descriptor idx %d\n",
> +					ret, i);
> +			goto err;
> +		}
> +		if (access == VHOST_ACCESS_WO) {
> +			/* If this is an input descriptor,
> +			 * increment that count. */
> +			*in_num += ret;
> +			if (unlikely(log && ret)) {
> +				log[*log_num].addr = desc->addr;
> +				log[*log_num].len = desc->len;
> +				++*log_num;
> +			}
> +		} else {
> +			/* If it's an output descriptor, they're all supposed
> +			 * to come before any input descriptors. */
> +			if (unlikely(*in_num)) {
> +				vq_err(vq, "Descriptor has out after in: "
> +				       "idx %d\n", i);
> +				ret = -EINVAL;
> +				goto err;
> +			}
> +			*out_num += ret;
> +		}
> +
> +		ret = desc->id;
> +	}
> +
> +	vq->ndescs = 0;
> +
> +	return ret;
> +
> +err:
> +	vhost_discard_vq_desc(vq, 1);
> +	vq->ndescs = 0;
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(vhost_get_vq_desc_batch);
> +
>   /* Reverse the effect of vhost_get_vq_desc. Useful for error handling. */
>   void vhost_discard_vq_desc(struct vhost_virtqueue *vq, int n)
>   {
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index 60cab4c78229..0976a2853935 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -60,6 +60,13 @@ enum vhost_uaddr_type {
>   	VHOST_NUM_ADDRS = 3,
>   };
>   
> +struct vhost_desc {
> +	u64 addr;
> +	u32 len;
> +	u16 flags; /* VRING_DESC_F_WRITE, VRING_DESC_F_NEXT */
> +	u16 id;
> +};
> +
>   /* The virtqueue structure describes a queue attached to a device. */
>   struct vhost_virtqueue {
>   	struct vhost_dev *dev;
> @@ -71,6 +78,11 @@ struct vhost_virtqueue {
>   	vring_avail_t __user *avail;
>   	vring_used_t __user *used;
>   	const struct vhost_iotlb_map *meta_iotlb[VHOST_NUM_ADDRS];
> +
> +	struct vhost_desc *descs;
> +	int ndescs;
> +	int max_descs;
> +
>   	struct file *kick;
>   	struct eventfd_ctx *call_ctx;
>   	struct eventfd_ctx *error_ctx;
> @@ -175,6 +187,10 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
>   bool vhost_vq_access_ok(struct vhost_virtqueue *vq);
>   bool vhost_log_access_ok(struct vhost_dev *);
>   
> +int vhost_get_vq_desc_batch(struct vhost_virtqueue *,
> +		      struct iovec iov[], unsigned int iov_count,
> +		      unsigned int *out_num, unsigned int *in_num,
> +		      struct vhost_log *log, unsigned int *log_num);
>   int vhost_get_vq_desc(struct vhost_virtqueue *,
>   		      struct iovec iov[], unsigned int iov_count,
>   		      unsigned int *out_num, unsigned int *in_num,


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 02/13] vhost: use batched version by default
  2020-06-02 13:05 ` [PATCH RFC 02/13] vhost: use batched version by default Michael S. Tsirkin
@ 2020-06-03  7:15   ` Jason Wang
  0 siblings, 0 replies; 35+ messages in thread
From: Jason Wang @ 2020-06-03  7:15 UTC (permalink / raw)
  To: Michael S. Tsirkin, linux-kernel
  Cc: Eugenio Pérez, kvm, virtualization, netdev


On 2020/6/2 下午9:05, Michael S. Tsirkin wrote:
> As testing shows no performance change, switch to that now.
>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> Link: https://lore.kernel.org/r/20200401183118.8334-3-eperezma@redhat.com
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>   drivers/vhost/vhost.c | 251 +-----------------------------------------
>   drivers/vhost/vhost.h |   4 -
>   2 files changed, 2 insertions(+), 253 deletions(-)


Since we don't have a way to switch back, it's better to remove "by 
default" in the title.

Thanks


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 03/13] vhost: batching fetches
  2020-06-02 13:06 ` [PATCH RFC 03/13] vhost: batching fetches Michael S. Tsirkin
@ 2020-06-03  7:27   ` Jason Wang
  2020-06-04  8:59     ` Michael S. Tsirkin
  0 siblings, 1 reply; 35+ messages in thread
From: Jason Wang @ 2020-06-03  7:27 UTC (permalink / raw)
  To: Michael S. Tsirkin, linux-kernel
  Cc: Eugenio Pérez, kvm, virtualization, netdev


On 2020/6/2 下午9:06, Michael S. Tsirkin wrote:
> With this patch applied, new and old code perform identically.
>
> Lots of extra optimizations are now possible, e.g.
> we can fetch multiple heads with copy_from/to_user now.
> We can get rid of maintaining the log array.  Etc etc.
>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> Link: https://lore.kernel.org/r/20200401183118.8334-4-eperezma@redhat.com
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>   drivers/vhost/test.c  |  2 +-
>   drivers/vhost/vhost.c | 47 ++++++++++++++++++++++++++++++++++++++-----
>   drivers/vhost/vhost.h |  5 ++++-
>   3 files changed, 47 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c
> index 9a3a09005e03..02806d6f84ef 100644
> --- a/drivers/vhost/test.c
> +++ b/drivers/vhost/test.c
> @@ -119,7 +119,7 @@ static int vhost_test_open(struct inode *inode, struct file *f)
>   	dev = &n->dev;
>   	vqs[VHOST_TEST_VQ] = &n->vqs[VHOST_TEST_VQ];
>   	n->vqs[VHOST_TEST_VQ].handle_kick = handle_vq_kick;
> -	vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX, UIO_MAXIOV,
> +	vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX, UIO_MAXIOV + 64,
>   		       VHOST_TEST_PKT_WEIGHT, VHOST_TEST_WEIGHT, NULL);
>   
>   	f->private_data = n;
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 8f9a07282625..aca2a5b0d078 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -299,6 +299,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
>   {
>   	vq->num = 1;
>   	vq->ndescs = 0;
> +	vq->first_desc = 0;
>   	vq->desc = NULL;
>   	vq->avail = NULL;
>   	vq->used = NULL;
> @@ -367,6 +368,11 @@ static int vhost_worker(void *data)
>   	return 0;
>   }
>   
> +static int vhost_vq_num_batch_descs(struct vhost_virtqueue *vq)
> +{
> +	return vq->max_descs - UIO_MAXIOV;
> +}


1 descriptor does not mean 1 iov, e.g userspace may pass several 1 byte 
length memory regions for us to translate.


> +
>   static void vhost_vq_free_iovecs(struct vhost_virtqueue *vq)
>   {
>   	kfree(vq->descs);
> @@ -389,6 +395,9 @@ static long vhost_dev_alloc_iovecs(struct vhost_dev *dev)
>   	for (i = 0; i < dev->nvqs; ++i) {
>   		vq = dev->vqs[i];
>   		vq->max_descs = dev->iov_limit;
> +		if (vhost_vq_num_batch_descs(vq) < 0) {
> +			return -EINVAL;
> +		}
>   		vq->descs = kmalloc_array(vq->max_descs,
>   					  sizeof(*vq->descs),
>   					  GFP_KERNEL);
> @@ -1570,6 +1579,7 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
>   		vq->last_avail_idx = s.num;
>   		/* Forget the cached index value. */
>   		vq->avail_idx = vq->last_avail_idx;
> +		vq->ndescs = vq->first_desc = 0;
>   		break;
>   	case VHOST_GET_VRING_BASE:
>   		s.index = idx;
> @@ -2136,7 +2146,7 @@ static int fetch_indirect_descs(struct vhost_virtqueue *vq,
>   	return 0;
>   }
>   
> -static int fetch_descs(struct vhost_virtqueue *vq)
> +static int fetch_buf(struct vhost_virtqueue *vq)
>   {
>   	unsigned int i, head, found = 0;
>   	struct vhost_desc *last;
> @@ -2149,7 +2159,11 @@ static int fetch_descs(struct vhost_virtqueue *vq)
>   	/* Check it isn't doing very strange things with descriptor numbers. */
>   	last_avail_idx = vq->last_avail_idx;
>   
> -	if (vq->avail_idx == vq->last_avail_idx) {
> +	if (unlikely(vq->avail_idx == vq->last_avail_idx)) {
> +		/* If we already have work to do, don't bother re-checking. */
> +		if (likely(vq->ndescs))
> +			return vq->num;
> +
>   		if (unlikely(vhost_get_avail_idx(vq, &avail_idx))) {
>   			vq_err(vq, "Failed to access avail idx at %p\n",
>   				&vq->avail->idx);
> @@ -2240,6 +2254,24 @@ static int fetch_descs(struct vhost_virtqueue *vq)
>   	return 0;
>   }
>   
> +static int fetch_descs(struct vhost_virtqueue *vq)
> +{
> +	int ret = 0;
> +
> +	if (unlikely(vq->first_desc >= vq->ndescs)) {
> +		vq->first_desc = 0;
> +		vq->ndescs = 0;
> +	}
> +
> +	if (vq->ndescs)
> +		return 0;
> +
> +	while (!ret && vq->ndescs <= vhost_vq_num_batch_descs(vq))
> +		ret = fetch_buf(vq);
> +
> +	return vq->ndescs ? 0 : ret;
> +}
> +
>   /* This looks in the virtqueue and for the first available buffer, and converts
>    * it to an iovec for convenient access.  Since descriptors consist of some
>    * number of output then some number of input descriptors, it's actually two
> @@ -2265,7 +2297,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
>   	if (unlikely(log))
>   		*log_num = 0;
>   
> -	for (i = 0; i < vq->ndescs; ++i) {
> +	for (i = vq->first_desc; i < vq->ndescs; ++i) {
>   		unsigned iov_count = *in_num + *out_num;
>   		struct vhost_desc *desc = &vq->descs[i];
>   		int access;
> @@ -2311,14 +2343,19 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
>   		}
>   
>   		ret = desc->id;
> +
> +		if (!(desc->flags & VRING_DESC_F_NEXT))
> +			break;
>   	}
>   
> -	vq->ndescs = 0;
> +	vq->first_desc = i + 1;
>   
>   	return ret;
>   
>   err:
> -	vhost_discard_vq_desc(vq, 1);
> +	for (i = vq->first_desc; i < vq->ndescs; ++i)
> +		if (!(vq->descs[i].flags & VRING_DESC_F_NEXT))
> +			vhost_discard_vq_desc(vq, 1);
>   	vq->ndescs = 0;
>   
>   	return ret;
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index 76356edee8e5..a67bda9792ec 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -81,6 +81,7 @@ struct vhost_virtqueue {
>   
>   	struct vhost_desc *descs;
>   	int ndescs;
> +	int first_desc;
>   	int max_descs;
>   
>   	struct file *kick;
> @@ -229,7 +230,7 @@ void vhost_iotlb_map_free(struct vhost_iotlb *iotlb,
>   			  struct vhost_iotlb_map *map);
>   
>   #define vq_err(vq, fmt, ...) do {                                  \
> -		pr_debug(pr_fmt(fmt), ##__VA_ARGS__);       \
> +		pr_err(pr_fmt(fmt), ##__VA_ARGS__);       \


Need a separate patch for this?

Thanks


>   		if ((vq)->error_ctx)                               \
>   				eventfd_signal((vq)->error_ctx, 1);\
>   	} while (0)
> @@ -255,6 +256,8 @@ static inline void vhost_vq_set_backend(struct vhost_virtqueue *vq,
>   					void *private_data)
>   {
>   	vq->private_data = private_data;
> +	vq->ndescs = 0;
> +	vq->first_desc = 0;
>   }
>   
>   /**


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 04/13] vhost: cleanup fetch_buf return code handling
  2020-06-02 13:06 ` [PATCH RFC 04/13] vhost: cleanup fetch_buf return code handling Michael S. Tsirkin
@ 2020-06-03  7:29   ` Jason Wang
  2020-06-04  9:01     ` Michael S. Tsirkin
  0 siblings, 1 reply; 35+ messages in thread
From: Jason Wang @ 2020-06-03  7:29 UTC (permalink / raw)
  To: Michael S. Tsirkin, linux-kernel
  Cc: Eugenio Pérez, kvm, virtualization, netdev


On 2020/6/2 下午9:06, Michael S. Tsirkin wrote:
> Return code of fetch_buf is confusing, so callers resort to
> tricks to get to sane values. Let's switch to something standard:
> 0 empty, >0 non-empty, <0 error.
>
> Signed-off-by: Michael S. Tsirkin<mst@redhat.com>
> ---
>   drivers/vhost/vhost.c | 24 ++++++++++++++++--------
>   1 file changed, 16 insertions(+), 8 deletions(-)


Why not squashing this into patch 2 or 3?

Thanks


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 07/13] vhost: format-independent API for used buffers
  2020-06-02 13:06 ` [PATCH RFC 07/13] vhost: format-independent API for used buffers Michael S. Tsirkin
@ 2020-06-03  7:58   ` Jason Wang
  2020-06-04  9:03     ` Michael S. Tsirkin
  0 siblings, 1 reply; 35+ messages in thread
From: Jason Wang @ 2020-06-03  7:58 UTC (permalink / raw)
  To: Michael S. Tsirkin, linux-kernel
  Cc: Eugenio Pérez, kvm, virtualization, netdev


On 2020/6/2 下午9:06, Michael S. Tsirkin wrote:
> Add a new API that doesn't assume used ring, heads, etc.
> For now, we keep the old APIs around to make it easier
> to convert drivers.
>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>   drivers/vhost/vhost.c | 52 ++++++++++++++++++++++++++++++++++---------
>   drivers/vhost/vhost.h | 17 +++++++++++++-
>   2 files changed, 58 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index b4a6e44d56a8..be822f0c9428 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -2292,13 +2292,12 @@ static int fetch_descs(struct vhost_virtqueue *vq)
>    * number of output then some number of input descriptors, it's actually two
>    * iovecs, but we pack them into one and note how many of each there were.
>    *
> - * This function returns the descriptor number found, or vq->num (which is
> - * never a valid descriptor number) if none was found.  A negative code is
> - * returned on error. */
> -int vhost_get_vq_desc(struct vhost_virtqueue *vq,
> -		      struct iovec iov[], unsigned int iov_size,
> -		      unsigned int *out_num, unsigned int *in_num,
> -		      struct vhost_log *log, unsigned int *log_num)
> + * This function returns a value > 0 if a descriptor was found, or 0 if none were found.
> + * A negative code is returned on error. */
> +int vhost_get_avail_buf(struct vhost_virtqueue *vq, struct vhost_buf *buf,
> +			struct iovec iov[], unsigned int iov_size,
> +			unsigned int *out_num, unsigned int *in_num,
> +			struct vhost_log *log, unsigned int *log_num)
>   {
>   	int ret = fetch_descs(vq);
>   	int i;
> @@ -2311,6 +2310,8 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
>   	*out_num = *in_num = 0;
>   	if (unlikely(log))
>   		*log_num = 0;
> +	buf->in_len = buf->out_len = 0;
> +	buf->descs = 0;
>   
>   	for (i = vq->first_desc; i < vq->ndescs; ++i) {
>   		unsigned iov_count = *in_num + *out_num;
> @@ -2340,6 +2341,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
>   			/* If this is an input descriptor,
>   			 * increment that count. */
>   			*in_num += ret;
> +			buf->in_len += desc->len;
>   			if (unlikely(log && ret)) {
>   				log[*log_num].addr = desc->addr;
>   				log[*log_num].len = desc->len;
> @@ -2355,9 +2357,11 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
>   				goto err;
>   			}
>   			*out_num += ret;
> +			buf->out_len += desc->len;
>   		}
>   
> -		ret = desc->id;
> +		buf->id = desc->id;
> +		++buf->descs;
>   
>   		if (!(desc->flags & VRING_DESC_F_NEXT))
>   			break;
> @@ -2365,7 +2369,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
>   
>   	vq->first_desc = i + 1;
>   
> -	return ret;
> +	return 1;
>   
>   err:
>   	for (i = vq->first_desc; i < vq->ndescs; ++i)
> @@ -2375,7 +2379,15 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
>   
>   	return ret;
>   }
> -EXPORT_SYMBOL_GPL(vhost_get_vq_desc);
> +EXPORT_SYMBOL_GPL(vhost_get_avail_buf);
> +
> +/* Reverse the effect of vhost_get_avail_buf. Useful for error handling. */
> +void vhost_discard_avail_bufs(struct vhost_virtqueue *vq,
> +			      struct vhost_buf *buf, unsigned count)
> +{
> +	vhost_discard_vq_desc(vq, count);
> +}
> +EXPORT_SYMBOL_GPL(vhost_discard_avail_bufs);
>   
>   static int __vhost_add_used_n(struct vhost_virtqueue *vq,
>   			    struct vring_used_elem *heads,
> @@ -2459,6 +2471,26 @@ int vhost_add_used(struct vhost_virtqueue *vq, unsigned int head, int len)
>   }
>   EXPORT_SYMBOL_GPL(vhost_add_used);
>   
> +int vhost_put_used_buf(struct vhost_virtqueue *vq, struct vhost_buf *buf)
> +{
> +	return vhost_add_used(vq, buf->id, buf->in_len);
> +}
> +EXPORT_SYMBOL_GPL(vhost_put_used_buf);
> +
> +int vhost_put_used_n_bufs(struct vhost_virtqueue *vq,
> +			  struct vhost_buf *bufs, unsigned count)
> +{
> +	unsigned i;
> +
> +	for (i = 0; i < count; ++i) {
> +		vq->heads[i].id = cpu_to_vhost32(vq, bufs[i].id);
> +		vq->heads[i].len = cpu_to_vhost32(vq, bufs[i].in_len);
> +	}
> +
> +	return vhost_add_used_n(vq, vq->heads, count);
> +}
> +EXPORT_SYMBOL_GPL(vhost_put_used_n_bufs);
> +
>   static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
>   {
>   	__u16 old, new;
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index a67bda9792ec..6c10e99ff334 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -67,6 +67,13 @@ struct vhost_desc {
>   	u16 id;
>   };
>   
> +struct vhost_buf {
> +	u32 out_len;
> +	u32 in_len;
> +	u16 descs;
> +	u16 id;
> +};


So it looks to me the struct vhost_buf can work for both split ring and 
packed ring.

If this is true, we'd better make struct vhost_desc work for both.

Thanks


> +
>   /* The virtqueue structure describes a queue attached to a device. */
>   struct vhost_virtqueue {
>   	struct vhost_dev *dev;
> @@ -193,7 +200,12 @@ int vhost_get_vq_desc(struct vhost_virtqueue *,
>   		      unsigned int *out_num, unsigned int *in_num,
>   		      struct vhost_log *log, unsigned int *log_num);
>   void vhost_discard_vq_desc(struct vhost_virtqueue *, int n);
> -
> +int vhost_get_avail_buf(struct vhost_virtqueue *, struct vhost_buf *buf,
> +			struct iovec iov[], unsigned int iov_count,
> +			unsigned int *out_num, unsigned int *in_num,
> +			struct vhost_log *log, unsigned int *log_num);
> +void vhost_discard_avail_bufs(struct vhost_virtqueue *,
> +			      struct vhost_buf *, unsigned count);
>   int vhost_vq_init_access(struct vhost_virtqueue *);
>   int vhost_add_used(struct vhost_virtqueue *, unsigned int head, int len);
>   int vhost_add_used_n(struct vhost_virtqueue *, struct vring_used_elem *heads,
> @@ -202,6 +214,9 @@ void vhost_add_used_and_signal(struct vhost_dev *, struct vhost_virtqueue *,
>   			       unsigned int id, int len);
>   void vhost_add_used_and_signal_n(struct vhost_dev *, struct vhost_virtqueue *,
>   			       struct vring_used_elem *heads, unsigned count);
> +int vhost_put_used_buf(struct vhost_virtqueue *, struct vhost_buf *buf);
> +int vhost_put_used_n_bufs(struct vhost_virtqueue *,
> +			  struct vhost_buf *bufs, unsigned count);
>   void vhost_signal(struct vhost_dev *, struct vhost_virtqueue *);
>   void vhost_disable_notify(struct vhost_dev *, struct vhost_virtqueue *);
>   bool vhost_vq_avail_empty(struct vhost_dev *, struct vhost_virtqueue *);


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 08/13] vhost/net: convert to new API: heads->bufs
  2020-06-02 13:06 ` [PATCH RFC 08/13] vhost/net: convert to new API: heads->bufs Michael S. Tsirkin
@ 2020-06-03  8:11   ` Jason Wang
  2020-06-04  9:05     ` Michael S. Tsirkin
  0 siblings, 1 reply; 35+ messages in thread
From: Jason Wang @ 2020-06-03  8:11 UTC (permalink / raw)
  To: Michael S. Tsirkin, linux-kernel
  Cc: Eugenio Pérez, kvm, virtualization, netdev


On 2020/6/2 下午9:06, Michael S. Tsirkin wrote:
> Convert vhost net to use the new format-agnostic API.
> In particular, don't poke at vq internals such as the
> heads array.
>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>   drivers/vhost/net.c | 153 +++++++++++++++++++++++---------------------
>   1 file changed, 81 insertions(+), 72 deletions(-)
>
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 749a9cf51a59..47af3d1ce3dd 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -59,13 +59,13 @@ MODULE_PARM_DESC(experimental_zcopytx, "Enable Zero Copy TX;"
>    * status internally; used for zerocopy tx only.
>    */
>   /* Lower device DMA failed */
> -#define VHOST_DMA_FAILED_LEN	((__force __virtio32)3)
> +#define VHOST_DMA_FAILED_LEN	(3)
>   /* Lower device DMA done */
> -#define VHOST_DMA_DONE_LEN	((__force __virtio32)2)
> +#define VHOST_DMA_DONE_LEN	(2)
>   /* Lower device DMA in progress */
> -#define VHOST_DMA_IN_PROGRESS	((__force __virtio32)1)
> +#define VHOST_DMA_IN_PROGRESS	(1)
>   /* Buffer unused */
> -#define VHOST_DMA_CLEAR_LEN	((__force __virtio32)0)
> +#define VHOST_DMA_CLEAR_LEN	(0)


Another patch for this?


>   
>   #define VHOST_DMA_IS_DONE(len) ((__force u32)(len) >= (__force u32)VHOST_DMA_DONE_LEN)
>   
> @@ -112,9 +112,12 @@ struct vhost_net_virtqueue {
>   	/* last used idx for outstanding DMA zerocopy buffers */
>   	int upend_idx;
>   	/* For TX, first used idx for DMA done zerocopy buffers
> -	 * For RX, number of batched heads
> +	 * For RX, number of batched bufs
>   	 */
>   	int done_idx;
> +	/* Outstanding user bufs. UIO_MAXIOV in length. */
> +	/* TODO: we can make this smaller for sure. */
> +	struct vhost_buf *bufs;
>   	/* Number of XDP frames batched */
>   	int batched_xdp;
>   	/* an array of userspace buffers info */
> @@ -271,6 +274,8 @@ static void vhost_net_clear_ubuf_info(struct vhost_net *n)
>   	int i;
>   
>   	for (i = 0; i < VHOST_NET_VQ_MAX; ++i) {
> +		kfree(n->vqs[i].bufs);
> +		n->vqs[i].bufs = NULL;
>   		kfree(n->vqs[i].ubuf_info);
>   		n->vqs[i].ubuf_info = NULL;
>   	}
> @@ -282,6 +287,12 @@ static int vhost_net_set_ubuf_info(struct vhost_net *n)
>   	int i;
>   
>   	for (i = 0; i < VHOST_NET_VQ_MAX; ++i) {
> +		n->vqs[i].bufs = kmalloc_array(UIO_MAXIOV,
> +					       sizeof(*n->vqs[i].bufs),
> +					       GFP_KERNEL);
> +		if (!n->vqs[i].bufs)
> +			goto err;
> +
>   		zcopy = vhost_net_zcopy_mask & (0x1 << i);
>   		if (!zcopy)
>   			continue;
> @@ -364,18 +375,18 @@ static void vhost_zerocopy_signal_used(struct vhost_net *net,
>   	int j = 0;
>   
>   	for (i = nvq->done_idx; i != nvq->upend_idx; i = (i + 1) % UIO_MAXIOV) {
> -		if (vq->heads[i].len == VHOST_DMA_FAILED_LEN)
> +		if (nvq->bufs[i].in_len == VHOST_DMA_FAILED_LEN)
>   			vhost_net_tx_err(net);
> -		if (VHOST_DMA_IS_DONE(vq->heads[i].len)) {
> -			vq->heads[i].len = VHOST_DMA_CLEAR_LEN;
> +		if (VHOST_DMA_IS_DONE(nvq->bufs[i].in_len)) {
> +			nvq->bufs[i].in_len = VHOST_DMA_CLEAR_LEN;
>   			++j;
>   		} else
>   			break;
>   	}
>   	while (j) {
>   		add = min(UIO_MAXIOV - nvq->done_idx, j);
> -		vhost_add_used_and_signal_n(vq->dev, vq,
> -					    &vq->heads[nvq->done_idx], add);
> +		vhost_put_used_n_bufs(vq, &nvq->bufs[nvq->done_idx], add);
> +		vhost_signal(vq->dev, vq);
>   		nvq->done_idx = (nvq->done_idx + add) % UIO_MAXIOV;
>   		j -= add;
>   	}
> @@ -390,7 +401,7 @@ static void vhost_zerocopy_callback(struct ubuf_info *ubuf, bool success)
>   	rcu_read_lock_bh();
>   
>   	/* set len to mark this desc buffers done DMA */
> -	nvq->vq.heads[ubuf->desc].in_len = success ?
> +	nvq->bufs[ubuf->desc].in_len = success ?
>   		VHOST_DMA_DONE_LEN : VHOST_DMA_FAILED_LEN;
>   	cnt = vhost_net_ubuf_put(ubufs);
>   
> @@ -452,7 +463,8 @@ static void vhost_net_signal_used(struct vhost_net_virtqueue *nvq)
>   	if (!nvq->done_idx)
>   		return;
>   
> -	vhost_add_used_and_signal_n(dev, vq, vq->heads, nvq->done_idx);
> +	vhost_put_used_n_bufs(vq, nvq->bufs, nvq->done_idx);
> +	vhost_signal(dev, vq);
>   	nvq->done_idx = 0;
>   }
>   
> @@ -558,6 +570,7 @@ static void vhost_net_busy_poll(struct vhost_net *net,
>   
>   static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
>   				    struct vhost_net_virtqueue *tnvq,
> +				    struct vhost_buf *buf,
>   				    unsigned int *out_num, unsigned int *in_num,
>   				    struct msghdr *msghdr, bool *busyloop_intr)
>   {
> @@ -565,10 +578,10 @@ static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
>   	struct vhost_virtqueue *rvq = &rnvq->vq;
>   	struct vhost_virtqueue *tvq = &tnvq->vq;
>   
> -	int r = vhost_get_vq_desc(tvq, tvq->iov, ARRAY_SIZE(tvq->iov),
> -				  out_num, in_num, NULL, NULL);
> +	int r = vhost_get_avail_buf(tvq, buf, tvq->iov, ARRAY_SIZE(tvq->iov),
> +				    out_num, in_num, NULL, NULL);
>   
> -	if (r == tvq->num && tvq->busyloop_timeout) {
> +	if (!r && tvq->busyloop_timeout) {
>   		/* Flush batched packets first */
>   		if (!vhost_sock_zcopy(vhost_vq_get_backend(tvq)))
>   			vhost_tx_batch(net, tnvq,
> @@ -577,8 +590,8 @@ static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
>   
>   		vhost_net_busy_poll(net, rvq, tvq, busyloop_intr, false);
>   
> -		r = vhost_get_vq_desc(tvq, tvq->iov, ARRAY_SIZE(tvq->iov),
> -				      out_num, in_num, NULL, NULL);
> +		r = vhost_get_avail_buf(tvq, buf, tvq->iov, ARRAY_SIZE(tvq->iov),
> +					out_num, in_num, NULL, NULL);
>   	}
>   
>   	return r;
> @@ -607,6 +620,7 @@ static size_t init_iov_iter(struct vhost_virtqueue *vq, struct iov_iter *iter,
>   
>   static int get_tx_bufs(struct vhost_net *net,
>   		       struct vhost_net_virtqueue *nvq,
> +		       struct vhost_buf *buf,
>   		       struct msghdr *msg,
>   		       unsigned int *out, unsigned int *in,
>   		       size_t *len, bool *busyloop_intr)
> @@ -614,9 +628,9 @@ static int get_tx_bufs(struct vhost_net *net,
>   	struct vhost_virtqueue *vq = &nvq->vq;
>   	int ret;
>   
> -	ret = vhost_net_tx_get_vq_desc(net, nvq, out, in, msg, busyloop_intr);
> +	ret = vhost_net_tx_get_vq_desc(net, nvq, buf, out, in, msg, busyloop_intr);
>   
> -	if (ret < 0 || ret == vq->num)
> +	if (ret <= 0)
>   		return ret;
>   
>   	if (*in) {
> @@ -761,7 +775,7 @@ static void handle_tx_copy(struct vhost_net *net, struct socket *sock)
>   	struct vhost_net_virtqueue *nvq = &net->vqs[VHOST_NET_VQ_TX];
>   	struct vhost_virtqueue *vq = &nvq->vq;
>   	unsigned out, in;
> -	int head;
> +	int ret;
>   	struct msghdr msg = {
>   		.msg_name = NULL,
>   		.msg_namelen = 0,
> @@ -773,6 +787,7 @@ static void handle_tx_copy(struct vhost_net *net, struct socket *sock)
>   	int err;
>   	int sent_pkts = 0;
>   	bool sock_can_batch = (sock->sk->sk_sndbuf == INT_MAX);
> +	struct vhost_buf buf;
>   
>   	do {
>   		bool busyloop_intr = false;
> @@ -780,13 +795,13 @@ static void handle_tx_copy(struct vhost_net *net, struct socket *sock)
>   		if (nvq->done_idx == VHOST_NET_BATCH)
>   			vhost_tx_batch(net, nvq, sock, &msg);
>   
> -		head = get_tx_bufs(net, nvq, &msg, &out, &in, &len,
> -				   &busyloop_intr);
> +		ret = get_tx_bufs(net, nvq, &buf, &msg, &out, &in, &len,
> +				  &busyloop_intr);
>   		/* On error, stop handling until the next kick. */
> -		if (unlikely(head < 0))
> +		if (unlikely(ret < 0))
>   			break;
>   		/* Nothing new?  Wait for eventfd to tell us they refilled. */
> -		if (head == vq->num) {
> +		if (!ret) {
>   			if (unlikely(busyloop_intr)) {
>   				vhost_poll_queue(&vq->poll);
>   			} else if (unlikely(vhost_enable_notify(&net->dev,
> @@ -808,7 +823,7 @@ static void handle_tx_copy(struct vhost_net *net, struct socket *sock)
>   				goto done;
>   			} else if (unlikely(err != -ENOSPC)) {
>   				vhost_tx_batch(net, nvq, sock, &msg);
> -				vhost_discard_vq_desc(vq, 1);
> +				vhost_discard_avail_bufs(vq, &buf, 1);
>   				vhost_net_enable_vq(net, vq);
>   				break;
>   			}
> @@ -829,7 +844,7 @@ static void handle_tx_copy(struct vhost_net *net, struct socket *sock)
>   		/* TODO: Check specific error and bomb out unless ENOBUFS? */
>   		err = sock->ops->sendmsg(sock, &msg, len);
>   		if (unlikely(err < 0)) {
> -			vhost_discard_vq_desc(vq, 1);
> +			vhost_discard_avail_bufs(vq, &buf, 1);


Do we need to decrease first_desc in vhost_discard_avail_bufs()?


>   			vhost_net_enable_vq(net, vq);
>   			break;
>   		}
> @@ -837,8 +852,7 @@ static void handle_tx_copy(struct vhost_net *net, struct socket *sock)
>   			pr_debug("Truncated TX packet: len %d != %zd\n",
>   				 err, len);
>   done:
> -		vq->heads[nvq->done_idx].id = cpu_to_vhost32(vq, head);
> -		vq->heads[nvq->done_idx].len = 0;
> +		nvq->bufs[nvq->done_idx] = buf;
>   		++nvq->done_idx;
>   	} while (likely(!vhost_exceeds_weight(vq, ++sent_pkts, total_len)));
>   
> @@ -850,7 +864,7 @@ static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock)
>   	struct vhost_net_virtqueue *nvq = &net->vqs[VHOST_NET_VQ_TX];
>   	struct vhost_virtqueue *vq = &nvq->vq;
>   	unsigned out, in;
> -	int head;
> +	int ret;
>   	struct msghdr msg = {
>   		.msg_name = NULL,
>   		.msg_namelen = 0,
> @@ -864,6 +878,7 @@ static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock)
>   	struct vhost_net_ubuf_ref *uninitialized_var(ubufs);
>   	bool zcopy_used;
>   	int sent_pkts = 0;
> +	struct vhost_buf buf;
>   
>   	do {
>   		bool busyloop_intr;
> @@ -872,13 +887,13 @@ static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock)
>   		vhost_zerocopy_signal_used(net, vq);
>   
>   		busyloop_intr = false;
> -		head = get_tx_bufs(net, nvq, &msg, &out, &in, &len,
> -				   &busyloop_intr);
> +		ret = get_tx_bufs(net, nvq, &buf, &msg, &out, &in, &len,
> +				  &busyloop_intr);
>   		/* On error, stop handling until the next kick. */
> -		if (unlikely(head < 0))
> +		if (unlikely(ret < 0))
>   			break;
>   		/* Nothing new?  Wait for eventfd to tell us they refilled. */
> -		if (head == vq->num) {
> +		if (!ret) {
>   			if (unlikely(busyloop_intr)) {
>   				vhost_poll_queue(&vq->poll);
>   			} else if (unlikely(vhost_enable_notify(&net->dev, vq))) {
> @@ -897,8 +912,8 @@ static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock)
>   			struct ubuf_info *ubuf;
>   			ubuf = nvq->ubuf_info + nvq->upend_idx;
>   
> -			vq->heads[nvq->upend_idx].id = cpu_to_vhost32(vq, head);
> -			vq->heads[nvq->upend_idx].len = VHOST_DMA_IN_PROGRESS;
> +			nvq->bufs[nvq->upend_idx] = buf;
> +			nvq->bufs[nvq->upend_idx].in_len = VHOST_DMA_IN_PROGRESS;
>   			ubuf->callback = vhost_zerocopy_callback;
>   			ubuf->ctx = nvq->ubufs;
>   			ubuf->desc = nvq->upend_idx;
> @@ -930,17 +945,19 @@ static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock)
>   				nvq->upend_idx = ((unsigned)nvq->upend_idx - 1)
>   					% UIO_MAXIOV;
>   			}
> -			vhost_discard_vq_desc(vq, 1);
> +			vhost_discard_avail_bufs(vq, &buf, 1);
>   			vhost_net_enable_vq(net, vq);
>   			break;
>   		}
>   		if (err != len)
>   			pr_debug("Truncated TX packet: "
>   				 " len %d != %zd\n", err, len);
> -		if (!zcopy_used)
> -			vhost_add_used_and_signal(&net->dev, vq, head, 0);
> -		else
> +		if (!zcopy_used) {
> +			vhost_put_used_buf(vq, &buf);
> +			vhost_signal(&net->dev, vq);


Do we need something like vhost_put_used_and_signal()?

Thanks


> +		} else {
>   			vhost_zerocopy_signal_used(net, vq);
> +		}
>   		vhost_net_tx_packet(net);
>   	} while (likely(!vhost_exceeds_weight(vq, ++sent_pkts, total_len)));
>   }
> @@ -1004,7 +1021,7 @@ static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk,
>   	int len = peek_head_len(rnvq, sk);
>   
>   	if (!len && rvq->busyloop_timeout) {
> -		/* Flush batched heads first */
> +		/* Flush batched bufs first */
>   		vhost_net_signal_used(rnvq);
>   		/* Both tx vq and rx socket were polled here */
>   		vhost_net_busy_poll(net, rvq, tvq, busyloop_intr, true);
> @@ -1022,11 +1039,11 @@ static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk,
>    * @iovcount	- returned count of io vectors we fill
>    * @log		- vhost log
>    * @log_num	- log offset
> - * @quota       - headcount quota, 1 for big buffer
> - *	returns number of buffer heads allocated, negative on error
> + * @quota       - bufcount quota, 1 for big buffer
> + *	returns number of buffers allocated, negative on error
>    */
>   static int get_rx_bufs(struct vhost_virtqueue *vq,
> -		       struct vring_used_elem *heads,
> +		       struct vhost_buf *bufs,
>   		       int datalen,
>   		       unsigned *iovcount,
>   		       struct vhost_log *log,
> @@ -1035,30 +1052,24 @@ static int get_rx_bufs(struct vhost_virtqueue *vq,
>   {
>   	unsigned int out, in;
>   	int seg = 0;
> -	int headcount = 0;
> -	unsigned d;
> +	int bufcount = 0;
>   	int r, nlogs = 0;
>   	/* len is always initialized before use since we are always called with
>   	 * datalen > 0.
>   	 */
>   	u32 uninitialized_var(len);
>   
> -	while (datalen > 0 && headcount < quota) {
> +	while (datalen > 0 && bufcount < quota) {
>   		if (unlikely(seg >= UIO_MAXIOV)) {
>   			r = -ENOBUFS;
>   			goto err;
>   		}
> -		r = vhost_get_vq_desc(vq, vq->iov + seg,
> -				      ARRAY_SIZE(vq->iov) - seg, &out,
> -				      &in, log, log_num);
> -		if (unlikely(r < 0))
> +		r = vhost_get_avail_buf(vq, bufs + bufcount, vq->iov + seg,
> +					ARRAY_SIZE(vq->iov) - seg, &out,
> +					&in, log, log_num);
> +		if (unlikely(r <= 0))
>   			goto err;
>   
> -		d = r;
> -		if (d == vq->num) {
> -			r = 0;
> -			goto err;
> -		}
>   		if (unlikely(out || in <= 0)) {
>   			vq_err(vq, "unexpected descriptor format for RX: "
>   				"out %d, in %d\n", out, in);
> @@ -1069,14 +1080,12 @@ static int get_rx_bufs(struct vhost_virtqueue *vq,
>   			nlogs += *log_num;
>   			log += *log_num;
>   		}
> -		heads[headcount].id = cpu_to_vhost32(vq, d);
>   		len = iov_length(vq->iov + seg, in);
> -		heads[headcount].len = cpu_to_vhost32(vq, len);
>   		datalen -= len;
> -		++headcount;
> +		++bufcount;
>   		seg += in;
>   	}
> -	heads[headcount - 1].len = cpu_to_vhost32(vq, len + datalen);
> +	bufs[bufcount - 1].in_len = len + datalen;
>   	*iovcount = seg;
>   	if (unlikely(log))
>   		*log_num = nlogs;
> @@ -1086,9 +1095,9 @@ static int get_rx_bufs(struct vhost_virtqueue *vq,
>   		r = UIO_MAXIOV + 1;
>   		goto err;
>   	}
> -	return headcount;
> +	return bufcount;
>   err:
> -	vhost_discard_vq_desc(vq, headcount);
> +	vhost_discard_avail_bufs(vq, bufs, bufcount);
>   	return r;
>   }
>   
> @@ -1113,7 +1122,7 @@ static void handle_rx(struct vhost_net *net)
>   	};
>   	size_t total_len = 0;
>   	int err, mergeable;
> -	s16 headcount;
> +	int bufcount;
>   	size_t vhost_hlen, sock_hlen;
>   	size_t vhost_len, sock_len;
>   	bool busyloop_intr = false;
> @@ -1147,14 +1156,14 @@ static void handle_rx(struct vhost_net *net)
>   			break;
>   		sock_len += sock_hlen;
>   		vhost_len = sock_len + vhost_hlen;
> -		headcount = get_rx_bufs(vq, vq->heads + nvq->done_idx,
> -					vhost_len, &in, vq_log, &log,
> -					likely(mergeable) ? UIO_MAXIOV : 1);
> +		bufcount = get_rx_bufs(vq, nvq->bufs + nvq->done_idx,
> +				       vhost_len, &in, vq_log, &log,
> +				       likely(mergeable) ? UIO_MAXIOV : 1);
>   		/* On error, stop handling until the next kick. */
> -		if (unlikely(headcount < 0))
> +		if (unlikely(bufcount < 0))
>   			goto out;
>   		/* OK, now we need to know about added descriptors. */
> -		if (!headcount) {
> +		if (!bufcount) {
>   			if (unlikely(busyloop_intr)) {
>   				vhost_poll_queue(&vq->poll);
>   			} else if (unlikely(vhost_enable_notify(&net->dev, vq))) {
> @@ -1171,7 +1180,7 @@ static void handle_rx(struct vhost_net *net)
>   		if (nvq->rx_ring)
>   			msg.msg_control = vhost_net_buf_consume(&nvq->rxq);
>   		/* On overrun, truncate and discard */
> -		if (unlikely(headcount > UIO_MAXIOV)) {
> +		if (unlikely(bufcount > UIO_MAXIOV)) {
>   			iov_iter_init(&msg.msg_iter, READ, vq->iov, 1, 1);
>   			err = sock->ops->recvmsg(sock, &msg,
>   						 1, MSG_DONTWAIT | MSG_TRUNC);
> @@ -1195,7 +1204,7 @@ static void handle_rx(struct vhost_net *net)
>   		if (unlikely(err != sock_len)) {
>   			pr_debug("Discarded rx packet: "
>   				 " len %d, expected %zd\n", err, sock_len);
> -			vhost_discard_vq_desc(vq, headcount);
> +			vhost_discard_avail_bufs(vq, nvq->bufs + nvq->done_idx, bufcount);
>   			continue;
>   		}
>   		/* Supply virtio_net_hdr if VHOST_NET_F_VIRTIO_NET_HDR */
> @@ -1214,15 +1223,15 @@ static void handle_rx(struct vhost_net *net)
>   		}
>   		/* TODO: Should check and handle checksum. */
>   
> -		num_buffers = cpu_to_vhost16(vq, headcount);
> +		num_buffers = cpu_to_vhost16(vq, bufcount);
>   		if (likely(mergeable) &&
>   		    copy_to_iter(&num_buffers, sizeof num_buffers,
>   				 &fixup) != sizeof num_buffers) {
>   			vq_err(vq, "Failed num_buffers write");
> -			vhost_discard_vq_desc(vq, headcount);
> +			vhost_discard_avail_bufs(vq, nvq->bufs + nvq->done_idx, bufcount);
>   			goto out;
>   		}
> -		nvq->done_idx += headcount;
> +		nvq->done_idx += bufcount;
>   		if (nvq->done_idx > VHOST_NET_BATCH)
>   			vhost_net_signal_used(nvq);
>   		if (unlikely(vq_log))


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 01/13] vhost: option to fetch descriptors through an independent struct
  2020-06-03  7:13   ` Jason Wang
@ 2020-06-03  9:48     ` Michael S. Tsirkin
  2020-06-03 12:04       ` Jason Wang
  0 siblings, 1 reply; 35+ messages in thread
From: Michael S. Tsirkin @ 2020-06-03  9:48 UTC (permalink / raw)
  To: Jason Wang; +Cc: linux-kernel, Eugenio Pérez, kvm, virtualization, netdev

On Wed, Jun 03, 2020 at 03:13:56PM +0800, Jason Wang wrote:
> 
> On 2020/6/2 下午9:05, Michael S. Tsirkin wrote:
> > The idea is to support multiple ring formats by converting
> > to a format-independent array of descriptors.
> > 
> > This costs extra cycles, but we gain in ability
> > to fetch a batch of descriptors in one go, which
> > is good for code cache locality.
> > 
> > When used, this causes a minor performance degradation,
> > it's been kept as simple as possible for ease of review.
> > A follow-up patch gets us back the performance by adding batching.
> > 
> > To simplify benchmarking, I kept the old code around so one can switch
> > back and forth between old and new code. This will go away in the final
> > submission.
> > 
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > Link: https://lore.kernel.org/r/20200401183118.8334-2-eperezma@redhat.com
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > ---
> >   drivers/vhost/vhost.c | 297 +++++++++++++++++++++++++++++++++++++++++-
> >   drivers/vhost/vhost.h |  16 +++
> >   2 files changed, 312 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > index 96d9871fa0cb..105fc97af2c8 100644
> > --- a/drivers/vhost/vhost.c
> > +++ b/drivers/vhost/vhost.c
> > @@ -298,6 +298,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
> >   			   struct vhost_virtqueue *vq)
> >   {
> >   	vq->num = 1;
> > +	vq->ndescs = 0;
> >   	vq->desc = NULL;
> >   	vq->avail = NULL;
> >   	vq->used = NULL;
> > @@ -368,6 +369,9 @@ static int vhost_worker(void *data)
> >   static void vhost_vq_free_iovecs(struct vhost_virtqueue *vq)
> >   {
> > +	kfree(vq->descs);
> > +	vq->descs = NULL;
> > +	vq->max_descs = 0;
> >   	kfree(vq->indirect);
> >   	vq->indirect = NULL;
> >   	kfree(vq->log);
> > @@ -384,6 +388,10 @@ static long vhost_dev_alloc_iovecs(struct vhost_dev *dev)
> >   	for (i = 0; i < dev->nvqs; ++i) {
> >   		vq = dev->vqs[i];
> > +		vq->max_descs = dev->iov_limit;
> > +		vq->descs = kmalloc_array(vq->max_descs,
> > +					  sizeof(*vq->descs),
> > +					  GFP_KERNEL);
> >   		vq->indirect = kmalloc_array(UIO_MAXIOV,
> >   					     sizeof(*vq->indirect),
> >   					     GFP_KERNEL);
> > @@ -391,7 +399,7 @@ static long vhost_dev_alloc_iovecs(struct vhost_dev *dev)
> >   					GFP_KERNEL);
> >   		vq->heads = kmalloc_array(dev->iov_limit, sizeof(*vq->heads),
> >   					  GFP_KERNEL);
> > -		if (!vq->indirect || !vq->log || !vq->heads)
> > +		if (!vq->indirect || !vq->log || !vq->heads || !vq->descs)
> >   			goto err_nomem;
> >   	}
> >   	return 0;
> > @@ -2277,6 +2285,293 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
> >   }
> >   EXPORT_SYMBOL_GPL(vhost_get_vq_desc);
> > +static struct vhost_desc *peek_split_desc(struct vhost_virtqueue *vq)
> > +{
> > +	BUG_ON(!vq->ndescs);
> > +	return &vq->descs[vq->ndescs - 1];
> > +}
> > +
> > +static void pop_split_desc(struct vhost_virtqueue *vq)
> > +{
> > +	BUG_ON(!vq->ndescs);
> > +	--vq->ndescs;
> > +}
> > +
> > +#define VHOST_DESC_FLAGS (VRING_DESC_F_INDIRECT | VRING_DESC_F_WRITE | \
> > +			  VRING_DESC_F_NEXT)
> > +static int push_split_desc(struct vhost_virtqueue *vq, struct vring_desc *desc, u16 id)
> > +{
> > +	struct vhost_desc *h;
> > +
> > +	if (unlikely(vq->ndescs >= vq->max_descs))
> > +		return -EINVAL;
> > +	h = &vq->descs[vq->ndescs++];
> > +	h->addr = vhost64_to_cpu(vq, desc->addr);
> > +	h->len = vhost32_to_cpu(vq, desc->len);
> > +	h->flags = vhost16_to_cpu(vq, desc->flags) & VHOST_DESC_FLAGS;
> > +	h->id = id;
> > +
> > +	return 0;
> > +}
> > +
> > +static int fetch_indirect_descs(struct vhost_virtqueue *vq,
> > +				struct vhost_desc *indirect,
> > +				u16 head)
> > +{
> > +	struct vring_desc desc;
> > +	unsigned int i = 0, count, found = 0;
> > +	u32 len = indirect->len;
> > +	struct iov_iter from;
> > +	int ret;
> > +
> > +	/* Sanity check */
> > +	if (unlikely(len % sizeof desc)) {
> > +		vq_err(vq, "Invalid length in indirect descriptor: "
> > +		       "len 0x%llx not multiple of 0x%zx\n",
> > +		       (unsigned long long)len,
> > +		       sizeof desc);
> > +		return -EINVAL;
> > +	}
> > +
> > +	ret = translate_desc(vq, indirect->addr, len, vq->indirect,
> > +			     UIO_MAXIOV, VHOST_ACCESS_RO);
> > +	if (unlikely(ret < 0)) {
> > +		if (ret != -EAGAIN)
> > +			vq_err(vq, "Translation failure %d in indirect.\n", ret);
> > +		return ret;
> > +	}
> > +	iov_iter_init(&from, READ, vq->indirect, ret, len);
> > +
> > +	/* We will use the result as an address to read from, so most
> > +	 * architectures only need a compiler barrier here. */
> > +	read_barrier_depends();
> > +
> > +	count = len / sizeof desc;
> > +	/* Buffers are chained via a 16 bit next field, so
> > +	 * we can have at most 2^16 of these. */
> > +	if (unlikely(count > USHRT_MAX + 1)) {
> > +		vq_err(vq, "Indirect buffer length too big: %d\n",
> > +		       indirect->len);
> > +		return -E2BIG;
> > +	}
> > +	if (unlikely(vq->ndescs + count > vq->max_descs)) {
> > +		vq_err(vq, "Too many indirect + direct descs: %d + %d\n",
> > +		       vq->ndescs, indirect->len);
> > +		return -E2BIG;
> > +	}
> > +
> > +	do {
> > +		if (unlikely(++found > count)) {
> > +			vq_err(vq, "Loop detected: last one at %u "
> > +			       "indirect size %u\n",
> > +			       i, count);
> > +			return -EINVAL;
> > +		}
> > +		if (unlikely(!copy_from_iter_full(&desc, sizeof(desc), &from))) {
> > +			vq_err(vq, "Failed indirect descriptor: idx %d, %zx\n",
> > +			       i, (size_t)indirect->addr + i * sizeof desc);
> > +			return -EINVAL;
> > +		}
> > +		if (unlikely(desc.flags & cpu_to_vhost16(vq, VRING_DESC_F_INDIRECT))) {
> > +			vq_err(vq, "Nested indirect descriptor: idx %d, %zx\n",
> > +			       i, (size_t)indirect->addr + i * sizeof desc);
> > +			return -EINVAL;
> > +		}
> > +
> > +		push_split_desc(vq, &desc, head);
> 
> 
> The error is ignored.

See above:

     	if (unlikely(vq->ndescs + count > vq->max_descs)) 

So it can't fail here, we never fetch unless there's space.

I guess we can add a WARN_ON here.

> 
> > +	} while ((i = next_desc(vq, &desc)) != -1);
> > +	return 0;
> > +}
> > +
> > +static int fetch_descs(struct vhost_virtqueue *vq)
> > +{
> > +	unsigned int i, head, found = 0;
> > +	struct vhost_desc *last;
> > +	struct vring_desc desc;
> > +	__virtio16 avail_idx;
> > +	__virtio16 ring_head;
> > +	u16 last_avail_idx;
> > +	int ret;
> > +
> > +	/* Check it isn't doing very strange things with descriptor numbers. */
> > +	last_avail_idx = vq->last_avail_idx;
> > +
> > +	if (vq->avail_idx == vq->last_avail_idx) {
> > +		if (unlikely(vhost_get_avail_idx(vq, &avail_idx))) {
> > +			vq_err(vq, "Failed to access avail idx at %p\n",
> > +				&vq->avail->idx);
> > +			return -EFAULT;
> > +		}
> > +		vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
> > +
> > +		if (unlikely((u16)(vq->avail_idx - last_avail_idx) > vq->num)) {
> > +			vq_err(vq, "Guest moved used index from %u to %u",
> > +				last_avail_idx, vq->avail_idx);
> > +			return -EFAULT;
> > +		}
> > +
> > +		/* If there's nothing new since last we looked, return
> > +		 * invalid.
> > +		 */
> > +		if (vq->avail_idx == last_avail_idx)
> > +			return vq->num;
> > +
> > +		/* Only get avail ring entries after they have been
> > +		 * exposed by guest.
> > +		 */
> > +		smp_rmb();
> > +	}
> > +
> > +	/* Grab the next descriptor number they're advertising */
> > +	if (unlikely(vhost_get_avail_head(vq, &ring_head, last_avail_idx))) {
> > +		vq_err(vq, "Failed to read head: idx %d address %p\n",
> > +		       last_avail_idx,
> > +		       &vq->avail->ring[last_avail_idx % vq->num]);
> > +		return -EFAULT;
> > +	}
> > +
> > +	head = vhost16_to_cpu(vq, ring_head);
> > +
> > +	/* If their number is silly, that's an error. */
> > +	if (unlikely(head >= vq->num)) {
> > +		vq_err(vq, "Guest says index %u > %u is available",
> > +		       head, vq->num);
> > +		return -EINVAL;
> > +	}
> > +
> > +	i = head;
> > +	do {
> > +		if (unlikely(i >= vq->num)) {
> > +			vq_err(vq, "Desc index is %u > %u, head = %u",
> > +			       i, vq->num, head);
> > +			return -EINVAL;
> > +		}
> > +		if (unlikely(++found > vq->num)) {
> > +			vq_err(vq, "Loop detected: last one at %u "
> > +			       "vq size %u head %u\n",
> > +			       i, vq->num, head);
> > +			return -EINVAL;
> > +		}
> > +		ret = vhost_get_desc(vq, &desc, i);
> > +		if (unlikely(ret)) {
> > +			vq_err(vq, "Failed to get descriptor: idx %d addr %p\n",
> > +			       i, vq->desc + i);
> > +			return -EFAULT;
> > +		}
> > +		ret = push_split_desc(vq, &desc, head);
> > +		if (unlikely(ret)) {
> > +			vq_err(vq, "Failed to save descriptor: idx %d\n", i);
> > +			return -EINVAL;
> > +		}
> > +	} while ((i = next_desc(vq, &desc)) != -1);
> > +
> > +	last = peek_split_desc(vq);
> > +	if (unlikely(last->flags & VRING_DESC_F_INDIRECT)) {
> > +		pop_split_desc(vq);
> > +		ret = fetch_indirect_descs(vq, last, head);
> 
> 
> Note that this means we don't supported chained indirect descriptors which
> complies the spec but we support this in vhost_get_vq_desc().

Well the spec says:
	A driver MUST NOT set both VIRTQ_DESC_F_INDIRECT and VIRTQ_DESC_F_NEXT in flags.

Did I miss anything?




> We probably need either fail early or just support that.
> 
> Thanks
> 
> 
> > +		if (unlikely(ret < 0)) {
> > +			if (ret != -EAGAIN)
> > +				vq_err(vq, "Failure detected "
> > +				       "in indirect descriptor at idx %d\n", head);
> > +			return ret;
> > +		}
> > +	}
> > +
> > +	/* Assume notifications from guest are disabled at this point,
> > +	 * if they aren't we would need to update avail_event index. */
> > +	BUG_ON(!(vq->used_flags & VRING_USED_F_NO_NOTIFY));
> > +
> > +	/* On success, increment avail index. */
> > +	vq->last_avail_idx++;
> > +
> > +	return 0;
> > +}
> > +
> > +/* This looks in the virtqueue and for the first available buffer, and converts
> > + * it to an iovec for convenient access.  Since descriptors consist of some
> > + * number of output then some number of input descriptors, it's actually two
> > + * iovecs, but we pack them into one and note how many of each there were.
> > + *
> > + * This function returns the descriptor number found, or vq->num (which is
> > + * never a valid descriptor number) if none was found.  A negative code is
> > + * returned on error. */
> > +int vhost_get_vq_desc_batch(struct vhost_virtqueue *vq,
> > +		      struct iovec iov[], unsigned int iov_size,
> > +		      unsigned int *out_num, unsigned int *in_num,
> > +		      struct vhost_log *log, unsigned int *log_num)
> > +{
> > +	int ret = fetch_descs(vq);
> > +	int i;
> > +
> > +	if (ret)
> > +		return ret;
> > +
> > +	/* Now convert to IOV */
> > +	/* When we start there are none of either input nor output. */
> > +	*out_num = *in_num = 0;
> > +	if (unlikely(log))
> > +		*log_num = 0;
> > +
> > +	for (i = 0; i < vq->ndescs; ++i) {
> > +		unsigned iov_count = *in_num + *out_num;
> > +		struct vhost_desc *desc = &vq->descs[i];
> > +		int access;
> > +
> > +		if (desc->flags & ~VHOST_DESC_FLAGS) {
> > +			vq_err(vq, "Unexpected flags: 0x%x at descriptor id 0x%x\n",
> > +			       desc->flags, desc->id);
> > +			ret = -EINVAL;
> > +			goto err;
> > +		}
> > +		if (desc->flags & VRING_DESC_F_WRITE)
> > +			access = VHOST_ACCESS_WO;
> > +		else
> > +			access = VHOST_ACCESS_RO;
> > +		ret = translate_desc(vq, desc->addr,
> > +				     desc->len, iov + iov_count,
> > +				     iov_size - iov_count, access);
> > +		if (unlikely(ret < 0)) {
> > +			if (ret != -EAGAIN)
> > +				vq_err(vq, "Translation failure %d descriptor idx %d\n",
> > +					ret, i);
> > +			goto err;
> > +		}
> > +		if (access == VHOST_ACCESS_WO) {
> > +			/* If this is an input descriptor,
> > +			 * increment that count. */
> > +			*in_num += ret;
> > +			if (unlikely(log && ret)) {
> > +				log[*log_num].addr = desc->addr;
> > +				log[*log_num].len = desc->len;
> > +				++*log_num;
> > +			}
> > +		} else {
> > +			/* If it's an output descriptor, they're all supposed
> > +			 * to come before any input descriptors. */
> > +			if (unlikely(*in_num)) {
> > +				vq_err(vq, "Descriptor has out after in: "
> > +				       "idx %d\n", i);
> > +				ret = -EINVAL;
> > +				goto err;
> > +			}
> > +			*out_num += ret;
> > +		}
> > +
> > +		ret = desc->id;
> > +	}
> > +
> > +	vq->ndescs = 0;
> > +
> > +	return ret;
> > +
> > +err:
> > +	vhost_discard_vq_desc(vq, 1);
> > +	vq->ndescs = 0;
> > +
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(vhost_get_vq_desc_batch);
> > +
> >   /* Reverse the effect of vhost_get_vq_desc. Useful for error handling. */
> >   void vhost_discard_vq_desc(struct vhost_virtqueue *vq, int n)
> >   {
> > diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> > index 60cab4c78229..0976a2853935 100644
> > --- a/drivers/vhost/vhost.h
> > +++ b/drivers/vhost/vhost.h
> > @@ -60,6 +60,13 @@ enum vhost_uaddr_type {
> >   	VHOST_NUM_ADDRS = 3,
> >   };
> > +struct vhost_desc {
> > +	u64 addr;
> > +	u32 len;
> > +	u16 flags; /* VRING_DESC_F_WRITE, VRING_DESC_F_NEXT */
> > +	u16 id;
> > +};
> > +
> >   /* The virtqueue structure describes a queue attached to a device. */
> >   struct vhost_virtqueue {
> >   	struct vhost_dev *dev;
> > @@ -71,6 +78,11 @@ struct vhost_virtqueue {
> >   	vring_avail_t __user *avail;
> >   	vring_used_t __user *used;
> >   	const struct vhost_iotlb_map *meta_iotlb[VHOST_NUM_ADDRS];
> > +
> > +	struct vhost_desc *descs;
> > +	int ndescs;
> > +	int max_descs;
> > +
> >   	struct file *kick;
> >   	struct eventfd_ctx *call_ctx;
> >   	struct eventfd_ctx *error_ctx;
> > @@ -175,6 +187,10 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
> >   bool vhost_vq_access_ok(struct vhost_virtqueue *vq);
> >   bool vhost_log_access_ok(struct vhost_dev *);
> > +int vhost_get_vq_desc_batch(struct vhost_virtqueue *,
> > +		      struct iovec iov[], unsigned int iov_count,
> > +		      unsigned int *out_num, unsigned int *in_num,
> > +		      struct vhost_log *log, unsigned int *log_num);
> >   int vhost_get_vq_desc(struct vhost_virtqueue *,
> >   		      struct iovec iov[], unsigned int iov_count,
> >   		      unsigned int *out_num, unsigned int *in_num,


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 01/13] vhost: option to fetch descriptors through an independent struct
  2020-06-03  9:48     ` Michael S. Tsirkin
@ 2020-06-03 12:04       ` Jason Wang
  2020-06-07 13:59         ` Michael S. Tsirkin
  0 siblings, 1 reply; 35+ messages in thread
From: Jason Wang @ 2020-06-03 12:04 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, Eugenio Pérez, kvm, virtualization, netdev


On 2020/6/3 下午5:48, Michael S. Tsirkin wrote:
> On Wed, Jun 03, 2020 at 03:13:56PM +0800, Jason Wang wrote:
>> On 2020/6/2 下午9:05, Michael S. Tsirkin wrote:


[...]


>>> +
>>> +static int fetch_indirect_descs(struct vhost_virtqueue *vq,
>>> +				struct vhost_desc *indirect,
>>> +				u16 head)
>>> +{
>>> +	struct vring_desc desc;
>>> +	unsigned int i = 0, count, found = 0;
>>> +	u32 len = indirect->len;
>>> +	struct iov_iter from;
>>> +	int ret;
>>> +
>>> +	/* Sanity check */
>>> +	if (unlikely(len % sizeof desc)) {
>>> +		vq_err(vq, "Invalid length in indirect descriptor: "
>>> +		       "len 0x%llx not multiple of 0x%zx\n",
>>> +		       (unsigned long long)len,
>>> +		       sizeof desc);
>>> +		return -EINVAL;
>>> +	}
>>> +
>>> +	ret = translate_desc(vq, indirect->addr, len, vq->indirect,
>>> +			     UIO_MAXIOV, VHOST_ACCESS_RO);
>>> +	if (unlikely(ret < 0)) {
>>> +		if (ret != -EAGAIN)
>>> +			vq_err(vq, "Translation failure %d in indirect.\n", ret);
>>> +		return ret;
>>> +	}
>>> +	iov_iter_init(&from, READ, vq->indirect, ret, len);
>>> +
>>> +	/* We will use the result as an address to read from, so most
>>> +	 * architectures only need a compiler barrier here. */
>>> +	read_barrier_depends();
>>> +
>>> +	count = len / sizeof desc;
>>> +	/* Buffers are chained via a 16 bit next field, so
>>> +	 * we can have at most 2^16 of these. */
>>> +	if (unlikely(count > USHRT_MAX + 1)) {
>>> +		vq_err(vq, "Indirect buffer length too big: %d\n",
>>> +		       indirect->len);
>>> +		return -E2BIG;
>>> +	}
>>> +	if (unlikely(vq->ndescs + count > vq->max_descs)) {
>>> +		vq_err(vq, "Too many indirect + direct descs: %d + %d\n",
>>> +		       vq->ndescs, indirect->len);
>>> +		return -E2BIG;
>>> +	}
>>> +
>>> +	do {
>>> +		if (unlikely(++found > count)) {
>>> +			vq_err(vq, "Loop detected: last one at %u "
>>> +			       "indirect size %u\n",
>>> +			       i, count);
>>> +			return -EINVAL;
>>> +		}
>>> +		if (unlikely(!copy_from_iter_full(&desc, sizeof(desc), &from))) {
>>> +			vq_err(vq, "Failed indirect descriptor: idx %d, %zx\n",
>>> +			       i, (size_t)indirect->addr + i * sizeof desc);
>>> +			return -EINVAL;
>>> +		}
>>> +		if (unlikely(desc.flags & cpu_to_vhost16(vq, VRING_DESC_F_INDIRECT))) {
>>> +			vq_err(vq, "Nested indirect descriptor: idx %d, %zx\n",
>>> +			       i, (size_t)indirect->addr + i * sizeof desc);
>>> +			return -EINVAL;
>>> +		}
>>> +
>>> +		push_split_desc(vq, &desc, head);
>>
>> The error is ignored.
> See above:
>
>       	if (unlikely(vq->ndescs + count > vq->max_descs))
>
> So it can't fail here, we never fetch unless there's space.
>
> I guess we can add a WARN_ON here.


Yes.


>
>>> +	} while ((i = next_desc(vq, &desc)) != -1);
>>> +	return 0;
>>> +}
>>> +
>>> +static int fetch_descs(struct vhost_virtqueue *vq)
>>> +{
>>> +	unsigned int i, head, found = 0;
>>> +	struct vhost_desc *last;
>>> +	struct vring_desc desc;
>>> +	__virtio16 avail_idx;
>>> +	__virtio16 ring_head;
>>> +	u16 last_avail_idx;
>>> +	int ret;
>>> +
>>> +	/* Check it isn't doing very strange things with descriptor numbers. */
>>> +	last_avail_idx = vq->last_avail_idx;
>>> +
>>> +	if (vq->avail_idx == vq->last_avail_idx) {
>>> +		if (unlikely(vhost_get_avail_idx(vq, &avail_idx))) {
>>> +			vq_err(vq, "Failed to access avail idx at %p\n",
>>> +				&vq->avail->idx);
>>> +			return -EFAULT;
>>> +		}
>>> +		vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
>>> +
>>> +		if (unlikely((u16)(vq->avail_idx - last_avail_idx) > vq->num)) {
>>> +			vq_err(vq, "Guest moved used index from %u to %u",
>>> +				last_avail_idx, vq->avail_idx);
>>> +			return -EFAULT;
>>> +		}
>>> +
>>> +		/* If there's nothing new since last we looked, return
>>> +		 * invalid.
>>> +		 */
>>> +		if (vq->avail_idx == last_avail_idx)
>>> +			return vq->num;
>>> +
>>> +		/* Only get avail ring entries after they have been
>>> +		 * exposed by guest.
>>> +		 */
>>> +		smp_rmb();
>>> +	}
>>> +
>>> +	/* Grab the next descriptor number they're advertising */
>>> +	if (unlikely(vhost_get_avail_head(vq, &ring_head, last_avail_idx))) {
>>> +		vq_err(vq, "Failed to read head: idx %d address %p\n",
>>> +		       last_avail_idx,
>>> +		       &vq->avail->ring[last_avail_idx % vq->num]);
>>> +		return -EFAULT;
>>> +	}
>>> +
>>> +	head = vhost16_to_cpu(vq, ring_head);
>>> +
>>> +	/* If their number is silly, that's an error. */
>>> +	if (unlikely(head >= vq->num)) {
>>> +		vq_err(vq, "Guest says index %u > %u is available",
>>> +		       head, vq->num);
>>> +		return -EINVAL;
>>> +	}
>>> +
>>> +	i = head;
>>> +	do {
>>> +		if (unlikely(i >= vq->num)) {
>>> +			vq_err(vq, "Desc index is %u > %u, head = %u",
>>> +			       i, vq->num, head);
>>> +			return -EINVAL;
>>> +		}
>>> +		if (unlikely(++found > vq->num)) {
>>> +			vq_err(vq, "Loop detected: last one at %u "
>>> +			       "vq size %u head %u\n",
>>> +			       i, vq->num, head);
>>> +			return -EINVAL;
>>> +		}
>>> +		ret = vhost_get_desc(vq, &desc, i);
>>> +		if (unlikely(ret)) {
>>> +			vq_err(vq, "Failed to get descriptor: idx %d addr %p\n",
>>> +			       i, vq->desc + i);
>>> +			return -EFAULT;
>>> +		}
>>> +		ret = push_split_desc(vq, &desc, head);
>>> +		if (unlikely(ret)) {
>>> +			vq_err(vq, "Failed to save descriptor: idx %d\n", i);
>>> +			return -EINVAL;
>>> +		}
>>> +	} while ((i = next_desc(vq, &desc)) != -1);
>>> +
>>> +	last = peek_split_desc(vq);
>>> +	if (unlikely(last->flags & VRING_DESC_F_INDIRECT)) {
>>> +		pop_split_desc(vq);
>>> +		ret = fetch_indirect_descs(vq, last, head);
>>
>> Note that this means we don't supported chained indirect descriptors which
>> complies the spec but we support this in vhost_get_vq_desc().
> Well the spec says:
> 	A driver MUST NOT set both VIRTQ_DESC_F_INDIRECT and VIRTQ_DESC_F_NEXT in flags.
>
> Did I miss anything?
>

No, but I meant current vhost_get_vq_desc() supports chained indirect 
descriptor. Not sure if there's an application that depends on this 
silently.

Thanks



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 03/13] vhost: batching fetches
  2020-06-03  7:27   ` Jason Wang
@ 2020-06-04  8:59     ` Michael S. Tsirkin
  2020-06-05  3:40       ` Jason Wang
  0 siblings, 1 reply; 35+ messages in thread
From: Michael S. Tsirkin @ 2020-06-04  8:59 UTC (permalink / raw)
  To: Jason Wang; +Cc: linux-kernel, Eugenio Pérez, kvm, virtualization, netdev

On Wed, Jun 03, 2020 at 03:27:39PM +0800, Jason Wang wrote:
> 
> On 2020/6/2 下午9:06, Michael S. Tsirkin wrote:
> > With this patch applied, new and old code perform identically.
> > 
> > Lots of extra optimizations are now possible, e.g.
> > we can fetch multiple heads with copy_from/to_user now.
> > We can get rid of maintaining the log array.  Etc etc.
> > 
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > Link: https://lore.kernel.org/r/20200401183118.8334-4-eperezma@redhat.com
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > ---
> >   drivers/vhost/test.c  |  2 +-
> >   drivers/vhost/vhost.c | 47 ++++++++++++++++++++++++++++++++++++++-----
> >   drivers/vhost/vhost.h |  5 ++++-
> >   3 files changed, 47 insertions(+), 7 deletions(-)
> > 
> > diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c
> > index 9a3a09005e03..02806d6f84ef 100644
> > --- a/drivers/vhost/test.c
> > +++ b/drivers/vhost/test.c
> > @@ -119,7 +119,7 @@ static int vhost_test_open(struct inode *inode, struct file *f)
> >   	dev = &n->dev;
> >   	vqs[VHOST_TEST_VQ] = &n->vqs[VHOST_TEST_VQ];
> >   	n->vqs[VHOST_TEST_VQ].handle_kick = handle_vq_kick;
> > -	vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX, UIO_MAXIOV,
> > +	vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX, UIO_MAXIOV + 64,
> >   		       VHOST_TEST_PKT_WEIGHT, VHOST_TEST_WEIGHT, NULL);
> >   	f->private_data = n;
> > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > index 8f9a07282625..aca2a5b0d078 100644
> > --- a/drivers/vhost/vhost.c
> > +++ b/drivers/vhost/vhost.c
> > @@ -299,6 +299,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
> >   {
> >   	vq->num = 1;
> >   	vq->ndescs = 0;
> > +	vq->first_desc = 0;
> >   	vq->desc = NULL;
> >   	vq->avail = NULL;
> >   	vq->used = NULL;
> > @@ -367,6 +368,11 @@ static int vhost_worker(void *data)
> >   	return 0;
> >   }
> > +static int vhost_vq_num_batch_descs(struct vhost_virtqueue *vq)
> > +{
> > +	return vq->max_descs - UIO_MAXIOV;
> > +}
> 
> 
> 1 descriptor does not mean 1 iov, e.g userspace may pass several 1 byte
> length memory regions for us to translate.
> 


Yes but I don't see the relevance. This tells us how many descriptors to
batch, not how many IOVs.

> > +
> >   static void vhost_vq_free_iovecs(struct vhost_virtqueue *vq)
> >   {
> >   	kfree(vq->descs);
> > @@ -389,6 +395,9 @@ static long vhost_dev_alloc_iovecs(struct vhost_dev *dev)
> >   	for (i = 0; i < dev->nvqs; ++i) {
> >   		vq = dev->vqs[i];
> >   		vq->max_descs = dev->iov_limit;
> > +		if (vhost_vq_num_batch_descs(vq) < 0) {
> > +			return -EINVAL;
> > +		}
> >   		vq->descs = kmalloc_array(vq->max_descs,
> >   					  sizeof(*vq->descs),
> >   					  GFP_KERNEL);
> > @@ -1570,6 +1579,7 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
> >   		vq->last_avail_idx = s.num;
> >   		/* Forget the cached index value. */
> >   		vq->avail_idx = vq->last_avail_idx;
> > +		vq->ndescs = vq->first_desc = 0;
> >   		break;
> >   	case VHOST_GET_VRING_BASE:
> >   		s.index = idx;
> > @@ -2136,7 +2146,7 @@ static int fetch_indirect_descs(struct vhost_virtqueue *vq,
> >   	return 0;
> >   }
> > -static int fetch_descs(struct vhost_virtqueue *vq)
> > +static int fetch_buf(struct vhost_virtqueue *vq)
> >   {
> >   	unsigned int i, head, found = 0;
> >   	struct vhost_desc *last;
> > @@ -2149,7 +2159,11 @@ static int fetch_descs(struct vhost_virtqueue *vq)
> >   	/* Check it isn't doing very strange things with descriptor numbers. */
> >   	last_avail_idx = vq->last_avail_idx;
> > -	if (vq->avail_idx == vq->last_avail_idx) {
> > +	if (unlikely(vq->avail_idx == vq->last_avail_idx)) {
> > +		/* If we already have work to do, don't bother re-checking. */
> > +		if (likely(vq->ndescs))
> > +			return vq->num;
> > +
> >   		if (unlikely(vhost_get_avail_idx(vq, &avail_idx))) {
> >   			vq_err(vq, "Failed to access avail idx at %p\n",
> >   				&vq->avail->idx);
> > @@ -2240,6 +2254,24 @@ static int fetch_descs(struct vhost_virtqueue *vq)
> >   	return 0;
> >   }
> > +static int fetch_descs(struct vhost_virtqueue *vq)
> > +{
> > +	int ret = 0;
> > +
> > +	if (unlikely(vq->first_desc >= vq->ndescs)) {
> > +		vq->first_desc = 0;
> > +		vq->ndescs = 0;
> > +	}
> > +
> > +	if (vq->ndescs)
> > +		return 0;
> > +
> > +	while (!ret && vq->ndescs <= vhost_vq_num_batch_descs(vq))
> > +		ret = fetch_buf(vq);
> > +
> > +	return vq->ndescs ? 0 : ret;
> > +}
> > +
> >   /* This looks in the virtqueue and for the first available buffer, and converts
> >    * it to an iovec for convenient access.  Since descriptors consist of some
> >    * number of output then some number of input descriptors, it's actually two
> > @@ -2265,7 +2297,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
> >   	if (unlikely(log))
> >   		*log_num = 0;
> > -	for (i = 0; i < vq->ndescs; ++i) {
> > +	for (i = vq->first_desc; i < vq->ndescs; ++i) {
> >   		unsigned iov_count = *in_num + *out_num;
> >   		struct vhost_desc *desc = &vq->descs[i];
> >   		int access;
> > @@ -2311,14 +2343,19 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
> >   		}
> >   		ret = desc->id;
> > +
> > +		if (!(desc->flags & VRING_DESC_F_NEXT))
> > +			break;
> >   	}
> > -	vq->ndescs = 0;
> > +	vq->first_desc = i + 1;
> >   	return ret;
> >   err:
> > -	vhost_discard_vq_desc(vq, 1);
> > +	for (i = vq->first_desc; i < vq->ndescs; ++i)
> > +		if (!(vq->descs[i].flags & VRING_DESC_F_NEXT))
> > +			vhost_discard_vq_desc(vq, 1);
> >   	vq->ndescs = 0;
> >   	return ret;
> > diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> > index 76356edee8e5..a67bda9792ec 100644
> > --- a/drivers/vhost/vhost.h
> > +++ b/drivers/vhost/vhost.h
> > @@ -81,6 +81,7 @@ struct vhost_virtqueue {
> >   	struct vhost_desc *descs;
> >   	int ndescs;
> > +	int first_desc;
> >   	int max_descs;
> >   	struct file *kick;
> > @@ -229,7 +230,7 @@ void vhost_iotlb_map_free(struct vhost_iotlb *iotlb,
> >   			  struct vhost_iotlb_map *map);
> >   #define vq_err(vq, fmt, ...) do {                                  \
> > -		pr_debug(pr_fmt(fmt), ##__VA_ARGS__);       \
> > +		pr_err(pr_fmt(fmt), ##__VA_ARGS__);       \
> 
> 
> Need a separate patch for this?
> 
> Thanks


Oh that's a debugging thing. I will drop it.

> 
> >   		if ((vq)->error_ctx)                               \
> >   				eventfd_signal((vq)->error_ctx, 1);\
> >   	} while (0)
> > @@ -255,6 +256,8 @@ static inline void vhost_vq_set_backend(struct vhost_virtqueue *vq,
> >   					void *private_data)
> >   {
> >   	vq->private_data = private_data;
> > +	vq->ndescs = 0;
> > +	vq->first_desc = 0;
> >   }
> >   /**


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 04/13] vhost: cleanup fetch_buf return code handling
  2020-06-03  7:29   ` Jason Wang
@ 2020-06-04  9:01     ` Michael S. Tsirkin
  0 siblings, 0 replies; 35+ messages in thread
From: Michael S. Tsirkin @ 2020-06-04  9:01 UTC (permalink / raw)
  To: Jason Wang; +Cc: linux-kernel, Eugenio Pérez, kvm, virtualization, netdev

On Wed, Jun 03, 2020 at 03:29:02PM +0800, Jason Wang wrote:
> 
> On 2020/6/2 下午9:06, Michael S. Tsirkin wrote:
> > Return code of fetch_buf is confusing, so callers resort to
> > tricks to get to sane values. Let's switch to something standard:
> > 0 empty, >0 non-empty, <0 error.
> > 
> > Signed-off-by: Michael S. Tsirkin<mst@redhat.com>
> > ---
> >   drivers/vhost/vhost.c | 24 ++++++++++++++++--------
> >   1 file changed, 16 insertions(+), 8 deletions(-)
> 
> 
> Why not squashing this into patch 2 or 3?
> 
> Thanks

It makes the tricky patches smaller. I'll consider it,
for now this split is also because patches 1-3 have
already been tested.

-- 
MST


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 07/13] vhost: format-independent API for used buffers
  2020-06-03  7:58   ` Jason Wang
@ 2020-06-04  9:03     ` Michael S. Tsirkin
  2020-06-04  9:18       ` Jason Wang
  0 siblings, 1 reply; 35+ messages in thread
From: Michael S. Tsirkin @ 2020-06-04  9:03 UTC (permalink / raw)
  To: Jason Wang; +Cc: linux-kernel, Eugenio Pérez, kvm, virtualization, netdev

On Wed, Jun 03, 2020 at 03:58:26PM +0800, Jason Wang wrote:
> 
> On 2020/6/2 下午9:06, Michael S. Tsirkin wrote:
> > Add a new API that doesn't assume used ring, heads, etc.
> > For now, we keep the old APIs around to make it easier
> > to convert drivers.
> > 
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > ---
> >   drivers/vhost/vhost.c | 52 ++++++++++++++++++++++++++++++++++---------
> >   drivers/vhost/vhost.h | 17 +++++++++++++-
> >   2 files changed, 58 insertions(+), 11 deletions(-)
> > 
> > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > index b4a6e44d56a8..be822f0c9428 100644
> > --- a/drivers/vhost/vhost.c
> > +++ b/drivers/vhost/vhost.c
> > @@ -2292,13 +2292,12 @@ static int fetch_descs(struct vhost_virtqueue *vq)
> >    * number of output then some number of input descriptors, it's actually two
> >    * iovecs, but we pack them into one and note how many of each there were.
> >    *
> > - * This function returns the descriptor number found, or vq->num (which is
> > - * never a valid descriptor number) if none was found.  A negative code is
> > - * returned on error. */
> > -int vhost_get_vq_desc(struct vhost_virtqueue *vq,
> > -		      struct iovec iov[], unsigned int iov_size,
> > -		      unsigned int *out_num, unsigned int *in_num,
> > -		      struct vhost_log *log, unsigned int *log_num)
> > + * This function returns a value > 0 if a descriptor was found, or 0 if none were found.
> > + * A negative code is returned on error. */
> > +int vhost_get_avail_buf(struct vhost_virtqueue *vq, struct vhost_buf *buf,
> > +			struct iovec iov[], unsigned int iov_size,
> > +			unsigned int *out_num, unsigned int *in_num,
> > +			struct vhost_log *log, unsigned int *log_num)
> >   {
> >   	int ret = fetch_descs(vq);
> >   	int i;
> > @@ -2311,6 +2310,8 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
> >   	*out_num = *in_num = 0;
> >   	if (unlikely(log))
> >   		*log_num = 0;
> > +	buf->in_len = buf->out_len = 0;
> > +	buf->descs = 0;
> >   	for (i = vq->first_desc; i < vq->ndescs; ++i) {
> >   		unsigned iov_count = *in_num + *out_num;
> > @@ -2340,6 +2341,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
> >   			/* If this is an input descriptor,
> >   			 * increment that count. */
> >   			*in_num += ret;
> > +			buf->in_len += desc->len;
> >   			if (unlikely(log && ret)) {
> >   				log[*log_num].addr = desc->addr;
> >   				log[*log_num].len = desc->len;
> > @@ -2355,9 +2357,11 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
> >   				goto err;
> >   			}
> >   			*out_num += ret;
> > +			buf->out_len += desc->len;
> >   		}
> > -		ret = desc->id;
> > +		buf->id = desc->id;
> > +		++buf->descs;
> >   		if (!(desc->flags & VRING_DESC_F_NEXT))
> >   			break;
> > @@ -2365,7 +2369,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
> >   	vq->first_desc = i + 1;
> > -	return ret;
> > +	return 1;
> >   err:
> >   	for (i = vq->first_desc; i < vq->ndescs; ++i)
> > @@ -2375,7 +2379,15 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
> >   	return ret;
> >   }
> > -EXPORT_SYMBOL_GPL(vhost_get_vq_desc);
> > +EXPORT_SYMBOL_GPL(vhost_get_avail_buf);
> > +
> > +/* Reverse the effect of vhost_get_avail_buf. Useful for error handling. */
> > +void vhost_discard_avail_bufs(struct vhost_virtqueue *vq,
> > +			      struct vhost_buf *buf, unsigned count)
> > +{
> > +	vhost_discard_vq_desc(vq, count);
> > +}
> > +EXPORT_SYMBOL_GPL(vhost_discard_avail_bufs);
> >   static int __vhost_add_used_n(struct vhost_virtqueue *vq,
> >   			    struct vring_used_elem *heads,
> > @@ -2459,6 +2471,26 @@ int vhost_add_used(struct vhost_virtqueue *vq, unsigned int head, int len)
> >   }
> >   EXPORT_SYMBOL_GPL(vhost_add_used);
> > +int vhost_put_used_buf(struct vhost_virtqueue *vq, struct vhost_buf *buf)
> > +{
> > +	return vhost_add_used(vq, buf->id, buf->in_len);
> > +}
> > +EXPORT_SYMBOL_GPL(vhost_put_used_buf);
> > +
> > +int vhost_put_used_n_bufs(struct vhost_virtqueue *vq,
> > +			  struct vhost_buf *bufs, unsigned count)
> > +{
> > +	unsigned i;
> > +
> > +	for (i = 0; i < count; ++i) {
> > +		vq->heads[i].id = cpu_to_vhost32(vq, bufs[i].id);
> > +		vq->heads[i].len = cpu_to_vhost32(vq, bufs[i].in_len);
> > +	}
> > +
> > +	return vhost_add_used_n(vq, vq->heads, count);
> > +}
> > +EXPORT_SYMBOL_GPL(vhost_put_used_n_bufs);
> > +
> >   static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
> >   {
> >   	__u16 old, new;
> > diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> > index a67bda9792ec..6c10e99ff334 100644
> > --- a/drivers/vhost/vhost.h
> > +++ b/drivers/vhost/vhost.h
> > @@ -67,6 +67,13 @@ struct vhost_desc {
> >   	u16 id;
> >   };
> > +struct vhost_buf {
> > +	u32 out_len;
> > +	u32 in_len;
> > +	u16 descs;
> > +	u16 id;
> > +};
> 
> 
> So it looks to me the struct vhost_buf can work for both split ring and
> packed ring.
> 
> If this is true, we'd better make struct vhost_desc work for both.
> 
> Thanks

Both vhost_desc and vhost_buf can work for split and packed.

Do you mean we should add packed ring support based on this?
For sure, this is one of the motivators for the patchset.


> 
> > +
> >   /* The virtqueue structure describes a queue attached to a device. */
> >   struct vhost_virtqueue {
> >   	struct vhost_dev *dev;
> > @@ -193,7 +200,12 @@ int vhost_get_vq_desc(struct vhost_virtqueue *,
> >   		      unsigned int *out_num, unsigned int *in_num,
> >   		      struct vhost_log *log, unsigned int *log_num);
> >   void vhost_discard_vq_desc(struct vhost_virtqueue *, int n);
> > -
> > +int vhost_get_avail_buf(struct vhost_virtqueue *, struct vhost_buf *buf,
> > +			struct iovec iov[], unsigned int iov_count,
> > +			unsigned int *out_num, unsigned int *in_num,
> > +			struct vhost_log *log, unsigned int *log_num);
> > +void vhost_discard_avail_bufs(struct vhost_virtqueue *,
> > +			      struct vhost_buf *, unsigned count);
> >   int vhost_vq_init_access(struct vhost_virtqueue *);
> >   int vhost_add_used(struct vhost_virtqueue *, unsigned int head, int len);
> >   int vhost_add_used_n(struct vhost_virtqueue *, struct vring_used_elem *heads,
> > @@ -202,6 +214,9 @@ void vhost_add_used_and_signal(struct vhost_dev *, struct vhost_virtqueue *,
> >   			       unsigned int id, int len);
> >   void vhost_add_used_and_signal_n(struct vhost_dev *, struct vhost_virtqueue *,
> >   			       struct vring_used_elem *heads, unsigned count);
> > +int vhost_put_used_buf(struct vhost_virtqueue *, struct vhost_buf *buf);
> > +int vhost_put_used_n_bufs(struct vhost_virtqueue *,
> > +			  struct vhost_buf *bufs, unsigned count);
> >   void vhost_signal(struct vhost_dev *, struct vhost_virtqueue *);
> >   void vhost_disable_notify(struct vhost_dev *, struct vhost_virtqueue *);
> >   bool vhost_vq_avail_empty(struct vhost_dev *, struct vhost_virtqueue *);


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 08/13] vhost/net: convert to new API: heads->bufs
  2020-06-03  8:11   ` Jason Wang
@ 2020-06-04  9:05     ` Michael S. Tsirkin
  0 siblings, 0 replies; 35+ messages in thread
From: Michael S. Tsirkin @ 2020-06-04  9:05 UTC (permalink / raw)
  To: Jason Wang; +Cc: linux-kernel, Eugenio Pérez, kvm, virtualization, netdev

On Wed, Jun 03, 2020 at 04:11:54PM +0800, Jason Wang wrote:
> 
> On 2020/6/2 下午9:06, Michael S. Tsirkin wrote:
> > Convert vhost net to use the new format-agnostic API.
> > In particular, don't poke at vq internals such as the
> > heads array.
> > 
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > ---
> >   drivers/vhost/net.c | 153 +++++++++++++++++++++++---------------------
> >   1 file changed, 81 insertions(+), 72 deletions(-)
> > 
> > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > index 749a9cf51a59..47af3d1ce3dd 100644
> > --- a/drivers/vhost/net.c
> > +++ b/drivers/vhost/net.c
> > @@ -59,13 +59,13 @@ MODULE_PARM_DESC(experimental_zcopytx, "Enable Zero Copy TX;"
> >    * status internally; used for zerocopy tx only.
> >    */
> >   /* Lower device DMA failed */
> > -#define VHOST_DMA_FAILED_LEN	((__force __virtio32)3)
> > +#define VHOST_DMA_FAILED_LEN	(3)
> >   /* Lower device DMA done */
> > -#define VHOST_DMA_DONE_LEN	((__force __virtio32)2)
> > +#define VHOST_DMA_DONE_LEN	(2)
> >   /* Lower device DMA in progress */
> > -#define VHOST_DMA_IN_PROGRESS	((__force __virtio32)1)
> > +#define VHOST_DMA_IN_PROGRESS	(1)
> >   /* Buffer unused */
> > -#define VHOST_DMA_CLEAR_LEN	((__force __virtio32)0)
> > +#define VHOST_DMA_CLEAR_LEN	(0)
> 
> 
> Another patch for this?

It can't be a separate patch. Without switching to vhost_buf we are
passing vring_used structs around, and that has __virtio32 length. If
switching to vhost_buf, the length is u32.
Just 4 lines, not a lot would be gained by splitting it out anyway.

> 
> >   #define VHOST_DMA_IS_DONE(len) ((__force u32)(len) >= (__force u32)VHOST_DMA_DONE_LEN)
> > @@ -112,9 +112,12 @@ struct vhost_net_virtqueue {
> >   	/* last used idx for outstanding DMA zerocopy buffers */
> >   	int upend_idx;
> >   	/* For TX, first used idx for DMA done zerocopy buffers
> > -	 * For RX, number of batched heads
> > +	 * For RX, number of batched bufs
> >   	 */
> >   	int done_idx;
> > +	/* Outstanding user bufs. UIO_MAXIOV in length. */
> > +	/* TODO: we can make this smaller for sure. */
> > +	struct vhost_buf *bufs;
> >   	/* Number of XDP frames batched */
> >   	int batched_xdp;
> >   	/* an array of userspace buffers info */
> > @@ -271,6 +274,8 @@ static void vhost_net_clear_ubuf_info(struct vhost_net *n)
> >   	int i;
> >   	for (i = 0; i < VHOST_NET_VQ_MAX; ++i) {
> > +		kfree(n->vqs[i].bufs);
> > +		n->vqs[i].bufs = NULL;
> >   		kfree(n->vqs[i].ubuf_info);
> >   		n->vqs[i].ubuf_info = NULL;
> >   	}
> > @@ -282,6 +287,12 @@ static int vhost_net_set_ubuf_info(struct vhost_net *n)
> >   	int i;
> >   	for (i = 0; i < VHOST_NET_VQ_MAX; ++i) {
> > +		n->vqs[i].bufs = kmalloc_array(UIO_MAXIOV,
> > +					       sizeof(*n->vqs[i].bufs),
> > +					       GFP_KERNEL);
> > +		if (!n->vqs[i].bufs)
> > +			goto err;
> > +
> >   		zcopy = vhost_net_zcopy_mask & (0x1 << i);
> >   		if (!zcopy)
> >   			continue;
> > @@ -364,18 +375,18 @@ static void vhost_zerocopy_signal_used(struct vhost_net *net,
> >   	int j = 0;
> >   	for (i = nvq->done_idx; i != nvq->upend_idx; i = (i + 1) % UIO_MAXIOV) {
> > -		if (vq->heads[i].len == VHOST_DMA_FAILED_LEN)
> > +		if (nvq->bufs[i].in_len == VHOST_DMA_FAILED_LEN)
> >   			vhost_net_tx_err(net);
> > -		if (VHOST_DMA_IS_DONE(vq->heads[i].len)) {
> > -			vq->heads[i].len = VHOST_DMA_CLEAR_LEN;
> > +		if (VHOST_DMA_IS_DONE(nvq->bufs[i].in_len)) {
> > +			nvq->bufs[i].in_len = VHOST_DMA_CLEAR_LEN;
> >   			++j;
> >   		} else
> >   			break;
> >   	}
> >   	while (j) {
> >   		add = min(UIO_MAXIOV - nvq->done_idx, j);
> > -		vhost_add_used_and_signal_n(vq->dev, vq,
> > -					    &vq->heads[nvq->done_idx], add);
> > +		vhost_put_used_n_bufs(vq, &nvq->bufs[nvq->done_idx], add);
> > +		vhost_signal(vq->dev, vq);
> >   		nvq->done_idx = (nvq->done_idx + add) % UIO_MAXIOV;
> >   		j -= add;
> >   	}
> > @@ -390,7 +401,7 @@ static void vhost_zerocopy_callback(struct ubuf_info *ubuf, bool success)
> >   	rcu_read_lock_bh();
> >   	/* set len to mark this desc buffers done DMA */
> > -	nvq->vq.heads[ubuf->desc].in_len = success ?
> > +	nvq->bufs[ubuf->desc].in_len = success ?
> >   		VHOST_DMA_DONE_LEN : VHOST_DMA_FAILED_LEN;
> >   	cnt = vhost_net_ubuf_put(ubufs);
> > @@ -452,7 +463,8 @@ static void vhost_net_signal_used(struct vhost_net_virtqueue *nvq)
> >   	if (!nvq->done_idx)
> >   		return;
> > -	vhost_add_used_and_signal_n(dev, vq, vq->heads, nvq->done_idx);
> > +	vhost_put_used_n_bufs(vq, nvq->bufs, nvq->done_idx);
> > +	vhost_signal(dev, vq);
> >   	nvq->done_idx = 0;
> >   }
> > @@ -558,6 +570,7 @@ static void vhost_net_busy_poll(struct vhost_net *net,
> >   static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
> >   				    struct vhost_net_virtqueue *tnvq,
> > +				    struct vhost_buf *buf,
> >   				    unsigned int *out_num, unsigned int *in_num,
> >   				    struct msghdr *msghdr, bool *busyloop_intr)
> >   {
> > @@ -565,10 +578,10 @@ static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
> >   	struct vhost_virtqueue *rvq = &rnvq->vq;
> >   	struct vhost_virtqueue *tvq = &tnvq->vq;
> > -	int r = vhost_get_vq_desc(tvq, tvq->iov, ARRAY_SIZE(tvq->iov),
> > -				  out_num, in_num, NULL, NULL);
> > +	int r = vhost_get_avail_buf(tvq, buf, tvq->iov, ARRAY_SIZE(tvq->iov),
> > +				    out_num, in_num, NULL, NULL);
> > -	if (r == tvq->num && tvq->busyloop_timeout) {
> > +	if (!r && tvq->busyloop_timeout) {
> >   		/* Flush batched packets first */
> >   		if (!vhost_sock_zcopy(vhost_vq_get_backend(tvq)))
> >   			vhost_tx_batch(net, tnvq,
> > @@ -577,8 +590,8 @@ static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
> >   		vhost_net_busy_poll(net, rvq, tvq, busyloop_intr, false);
> > -		r = vhost_get_vq_desc(tvq, tvq->iov, ARRAY_SIZE(tvq->iov),
> > -				      out_num, in_num, NULL, NULL);
> > +		r = vhost_get_avail_buf(tvq, buf, tvq->iov, ARRAY_SIZE(tvq->iov),
> > +					out_num, in_num, NULL, NULL);
> >   	}
> >   	return r;
> > @@ -607,6 +620,7 @@ static size_t init_iov_iter(struct vhost_virtqueue *vq, struct iov_iter *iter,
> >   static int get_tx_bufs(struct vhost_net *net,
> >   		       struct vhost_net_virtqueue *nvq,
> > +		       struct vhost_buf *buf,
> >   		       struct msghdr *msg,
> >   		       unsigned int *out, unsigned int *in,
> >   		       size_t *len, bool *busyloop_intr)
> > @@ -614,9 +628,9 @@ static int get_tx_bufs(struct vhost_net *net,
> >   	struct vhost_virtqueue *vq = &nvq->vq;
> >   	int ret;
> > -	ret = vhost_net_tx_get_vq_desc(net, nvq, out, in, msg, busyloop_intr);
> > +	ret = vhost_net_tx_get_vq_desc(net, nvq, buf, out, in, msg, busyloop_intr);
> > -	if (ret < 0 || ret == vq->num)
> > +	if (ret <= 0)
> >   		return ret;
> >   	if (*in) {
> > @@ -761,7 +775,7 @@ static void handle_tx_copy(struct vhost_net *net, struct socket *sock)
> >   	struct vhost_net_virtqueue *nvq = &net->vqs[VHOST_NET_VQ_TX];
> >   	struct vhost_virtqueue *vq = &nvq->vq;
> >   	unsigned out, in;
> > -	int head;
> > +	int ret;
> >   	struct msghdr msg = {
> >   		.msg_name = NULL,
> >   		.msg_namelen = 0,
> > @@ -773,6 +787,7 @@ static void handle_tx_copy(struct vhost_net *net, struct socket *sock)
> >   	int err;
> >   	int sent_pkts = 0;
> >   	bool sock_can_batch = (sock->sk->sk_sndbuf == INT_MAX);
> > +	struct vhost_buf buf;
> >   	do {
> >   		bool busyloop_intr = false;
> > @@ -780,13 +795,13 @@ static void handle_tx_copy(struct vhost_net *net, struct socket *sock)
> >   		if (nvq->done_idx == VHOST_NET_BATCH)
> >   			vhost_tx_batch(net, nvq, sock, &msg);
> > -		head = get_tx_bufs(net, nvq, &msg, &out, &in, &len,
> > -				   &busyloop_intr);
> > +		ret = get_tx_bufs(net, nvq, &buf, &msg, &out, &in, &len,
> > +				  &busyloop_intr);
> >   		/* On error, stop handling until the next kick. */
> > -		if (unlikely(head < 0))
> > +		if (unlikely(ret < 0))
> >   			break;
> >   		/* Nothing new?  Wait for eventfd to tell us they refilled. */
> > -		if (head == vq->num) {
> > +		if (!ret) {
> >   			if (unlikely(busyloop_intr)) {
> >   				vhost_poll_queue(&vq->poll);
> >   			} else if (unlikely(vhost_enable_notify(&net->dev,
> > @@ -808,7 +823,7 @@ static void handle_tx_copy(struct vhost_net *net, struct socket *sock)
> >   				goto done;
> >   			} else if (unlikely(err != -ENOSPC)) {
> >   				vhost_tx_batch(net, nvq, sock, &msg);
> > -				vhost_discard_vq_desc(vq, 1);
> > +				vhost_discard_avail_bufs(vq, &buf, 1);
> >   				vhost_net_enable_vq(net, vq);
> >   				break;
> >   			}
> > @@ -829,7 +844,7 @@ static void handle_tx_copy(struct vhost_net *net, struct socket *sock)
> >   		/* TODO: Check specific error and bomb out unless ENOBUFS? */
> >   		err = sock->ops->sendmsg(sock, &msg, len);
> >   		if (unlikely(err < 0)) {
> > -			vhost_discard_vq_desc(vq, 1);
> > +			vhost_discard_avail_bufs(vq, &buf, 1);
> 
> 
> Do we need to decrease first_desc in vhost_discard_avail_bufs()?
> 
> 
> >   			vhost_net_enable_vq(net, vq);
> >   			break;
> >   		}
> > @@ -837,8 +852,7 @@ static void handle_tx_copy(struct vhost_net *net, struct socket *sock)
> >   			pr_debug("Truncated TX packet: len %d != %zd\n",
> >   				 err, len);
> >   done:
> > -		vq->heads[nvq->done_idx].id = cpu_to_vhost32(vq, head);
> > -		vq->heads[nvq->done_idx].len = 0;
> > +		nvq->bufs[nvq->done_idx] = buf;
> >   		++nvq->done_idx;
> >   	} while (likely(!vhost_exceeds_weight(vq, ++sent_pkts, total_len)));
> > @@ -850,7 +864,7 @@ static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock)
> >   	struct vhost_net_virtqueue *nvq = &net->vqs[VHOST_NET_VQ_TX];
> >   	struct vhost_virtqueue *vq = &nvq->vq;
> >   	unsigned out, in;
> > -	int head;
> > +	int ret;
> >   	struct msghdr msg = {
> >   		.msg_name = NULL,
> >   		.msg_namelen = 0,
> > @@ -864,6 +878,7 @@ static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock)
> >   	struct vhost_net_ubuf_ref *uninitialized_var(ubufs);
> >   	bool zcopy_used;
> >   	int sent_pkts = 0;
> > +	struct vhost_buf buf;
> >   	do {
> >   		bool busyloop_intr;
> > @@ -872,13 +887,13 @@ static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock)
> >   		vhost_zerocopy_signal_used(net, vq);
> >   		busyloop_intr = false;
> > -		head = get_tx_bufs(net, nvq, &msg, &out, &in, &len,
> > -				   &busyloop_intr);
> > +		ret = get_tx_bufs(net, nvq, &buf, &msg, &out, &in, &len,
> > +				  &busyloop_intr);
> >   		/* On error, stop handling until the next kick. */
> > -		if (unlikely(head < 0))
> > +		if (unlikely(ret < 0))
> >   			break;
> >   		/* Nothing new?  Wait for eventfd to tell us they refilled. */
> > -		if (head == vq->num) {
> > +		if (!ret) {
> >   			if (unlikely(busyloop_intr)) {
> >   				vhost_poll_queue(&vq->poll);
> >   			} else if (unlikely(vhost_enable_notify(&net->dev, vq))) {
> > @@ -897,8 +912,8 @@ static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock)
> >   			struct ubuf_info *ubuf;
> >   			ubuf = nvq->ubuf_info + nvq->upend_idx;
> > -			vq->heads[nvq->upend_idx].id = cpu_to_vhost32(vq, head);
> > -			vq->heads[nvq->upend_idx].len = VHOST_DMA_IN_PROGRESS;
> > +			nvq->bufs[nvq->upend_idx] = buf;
> > +			nvq->bufs[nvq->upend_idx].in_len = VHOST_DMA_IN_PROGRESS;
> >   			ubuf->callback = vhost_zerocopy_callback;
> >   			ubuf->ctx = nvq->ubufs;
> >   			ubuf->desc = nvq->upend_idx;
> > @@ -930,17 +945,19 @@ static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock)
> >   				nvq->upend_idx = ((unsigned)nvq->upend_idx - 1)
> >   					% UIO_MAXIOV;
> >   			}
> > -			vhost_discard_vq_desc(vq, 1);
> > +			vhost_discard_avail_bufs(vq, &buf, 1);
> >   			vhost_net_enable_vq(net, vq);
> >   			break;
> >   		}
> >   		if (err != len)
> >   			pr_debug("Truncated TX packet: "
> >   				 " len %d != %zd\n", err, len);
> > -		if (!zcopy_used)
> > -			vhost_add_used_and_signal(&net->dev, vq, head, 0);
> > -		else
> > +		if (!zcopy_used) {
> > +			vhost_put_used_buf(vq, &buf);
> > +			vhost_signal(&net->dev, vq);
> 
> 
> Do we need something like vhost_put_used_and_signal()?
> 
> Thanks
> 
> 
> > +		} else {
> >   			vhost_zerocopy_signal_used(net, vq);
> > +		}
> >   		vhost_net_tx_packet(net);
> >   	} while (likely(!vhost_exceeds_weight(vq, ++sent_pkts, total_len)));
> >   }
> > @@ -1004,7 +1021,7 @@ static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk,
> >   	int len = peek_head_len(rnvq, sk);
> >   	if (!len && rvq->busyloop_timeout) {
> > -		/* Flush batched heads first */
> > +		/* Flush batched bufs first */
> >   		vhost_net_signal_used(rnvq);
> >   		/* Both tx vq and rx socket were polled here */
> >   		vhost_net_busy_poll(net, rvq, tvq, busyloop_intr, true);
> > @@ -1022,11 +1039,11 @@ static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk,
> >    * @iovcount	- returned count of io vectors we fill
> >    * @log		- vhost log
> >    * @log_num	- log offset
> > - * @quota       - headcount quota, 1 for big buffer
> > - *	returns number of buffer heads allocated, negative on error
> > + * @quota       - bufcount quota, 1 for big buffer
> > + *	returns number of buffers allocated, negative on error
> >    */
> >   static int get_rx_bufs(struct vhost_virtqueue *vq,
> > -		       struct vring_used_elem *heads,
> > +		       struct vhost_buf *bufs,
> >   		       int datalen,
> >   		       unsigned *iovcount,
> >   		       struct vhost_log *log,
> > @@ -1035,30 +1052,24 @@ static int get_rx_bufs(struct vhost_virtqueue *vq,
> >   {
> >   	unsigned int out, in;
> >   	int seg = 0;
> > -	int headcount = 0;
> > -	unsigned d;
> > +	int bufcount = 0;
> >   	int r, nlogs = 0;
> >   	/* len is always initialized before use since we are always called with
> >   	 * datalen > 0.
> >   	 */
> >   	u32 uninitialized_var(len);
> > -	while (datalen > 0 && headcount < quota) {
> > +	while (datalen > 0 && bufcount < quota) {
> >   		if (unlikely(seg >= UIO_MAXIOV)) {
> >   			r = -ENOBUFS;
> >   			goto err;
> >   		}
> > -		r = vhost_get_vq_desc(vq, vq->iov + seg,
> > -				      ARRAY_SIZE(vq->iov) - seg, &out,
> > -				      &in, log, log_num);
> > -		if (unlikely(r < 0))
> > +		r = vhost_get_avail_buf(vq, bufs + bufcount, vq->iov + seg,
> > +					ARRAY_SIZE(vq->iov) - seg, &out,
> > +					&in, log, log_num);
> > +		if (unlikely(r <= 0))
> >   			goto err;
> > -		d = r;
> > -		if (d == vq->num) {
> > -			r = 0;
> > -			goto err;
> > -		}
> >   		if (unlikely(out || in <= 0)) {
> >   			vq_err(vq, "unexpected descriptor format for RX: "
> >   				"out %d, in %d\n", out, in);
> > @@ -1069,14 +1080,12 @@ static int get_rx_bufs(struct vhost_virtqueue *vq,
> >   			nlogs += *log_num;
> >   			log += *log_num;
> >   		}
> > -		heads[headcount].id = cpu_to_vhost32(vq, d);
> >   		len = iov_length(vq->iov + seg, in);
> > -		heads[headcount].len = cpu_to_vhost32(vq, len);
> >   		datalen -= len;
> > -		++headcount;
> > +		++bufcount;
> >   		seg += in;
> >   	}
> > -	heads[headcount - 1].len = cpu_to_vhost32(vq, len + datalen);
> > +	bufs[bufcount - 1].in_len = len + datalen;
> >   	*iovcount = seg;
> >   	if (unlikely(log))
> >   		*log_num = nlogs;
> > @@ -1086,9 +1095,9 @@ static int get_rx_bufs(struct vhost_virtqueue *vq,
> >   		r = UIO_MAXIOV + 1;
> >   		goto err;
> >   	}
> > -	return headcount;
> > +	return bufcount;
> >   err:
> > -	vhost_discard_vq_desc(vq, headcount);
> > +	vhost_discard_avail_bufs(vq, bufs, bufcount);
> >   	return r;
> >   }
> > @@ -1113,7 +1122,7 @@ static void handle_rx(struct vhost_net *net)
> >   	};
> >   	size_t total_len = 0;
> >   	int err, mergeable;
> > -	s16 headcount;
> > +	int bufcount;
> >   	size_t vhost_hlen, sock_hlen;
> >   	size_t vhost_len, sock_len;
> >   	bool busyloop_intr = false;
> > @@ -1147,14 +1156,14 @@ static void handle_rx(struct vhost_net *net)
> >   			break;
> >   		sock_len += sock_hlen;
> >   		vhost_len = sock_len + vhost_hlen;
> > -		headcount = get_rx_bufs(vq, vq->heads + nvq->done_idx,
> > -					vhost_len, &in, vq_log, &log,
> > -					likely(mergeable) ? UIO_MAXIOV : 1);
> > +		bufcount = get_rx_bufs(vq, nvq->bufs + nvq->done_idx,
> > +				       vhost_len, &in, vq_log, &log,
> > +				       likely(mergeable) ? UIO_MAXIOV : 1);
> >   		/* On error, stop handling until the next kick. */
> > -		if (unlikely(headcount < 0))
> > +		if (unlikely(bufcount < 0))
> >   			goto out;
> >   		/* OK, now we need to know about added descriptors. */
> > -		if (!headcount) {
> > +		if (!bufcount) {
> >   			if (unlikely(busyloop_intr)) {
> >   				vhost_poll_queue(&vq->poll);
> >   			} else if (unlikely(vhost_enable_notify(&net->dev, vq))) {
> > @@ -1171,7 +1180,7 @@ static void handle_rx(struct vhost_net *net)
> >   		if (nvq->rx_ring)
> >   			msg.msg_control = vhost_net_buf_consume(&nvq->rxq);
> >   		/* On overrun, truncate and discard */
> > -		if (unlikely(headcount > UIO_MAXIOV)) {
> > +		if (unlikely(bufcount > UIO_MAXIOV)) {
> >   			iov_iter_init(&msg.msg_iter, READ, vq->iov, 1, 1);
> >   			err = sock->ops->recvmsg(sock, &msg,
> >   						 1, MSG_DONTWAIT | MSG_TRUNC);
> > @@ -1195,7 +1204,7 @@ static void handle_rx(struct vhost_net *net)
> >   		if (unlikely(err != sock_len)) {
> >   			pr_debug("Discarded rx packet: "
> >   				 " len %d, expected %zd\n", err, sock_len);
> > -			vhost_discard_vq_desc(vq, headcount);
> > +			vhost_discard_avail_bufs(vq, nvq->bufs + nvq->done_idx, bufcount);
> >   			continue;
> >   		}
> >   		/* Supply virtio_net_hdr if VHOST_NET_F_VIRTIO_NET_HDR */
> > @@ -1214,15 +1223,15 @@ static void handle_rx(struct vhost_net *net)
> >   		}
> >   		/* TODO: Should check and handle checksum. */
> > -		num_buffers = cpu_to_vhost16(vq, headcount);
> > +		num_buffers = cpu_to_vhost16(vq, bufcount);
> >   		if (likely(mergeable) &&
> >   		    copy_to_iter(&num_buffers, sizeof num_buffers,
> >   				 &fixup) != sizeof num_buffers) {
> >   			vq_err(vq, "Failed num_buffers write");
> > -			vhost_discard_vq_desc(vq, headcount);
> > +			vhost_discard_avail_bufs(vq, nvq->bufs + nvq->done_idx, bufcount);
> >   			goto out;
> >   		}
> > -		nvq->done_idx += headcount;
> > +		nvq->done_idx += bufcount;
> >   		if (nvq->done_idx > VHOST_NET_BATCH)
> >   			vhost_net_signal_used(nvq);
> >   		if (unlikely(vq_log))


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 07/13] vhost: format-independent API for used buffers
  2020-06-04  9:03     ` Michael S. Tsirkin
@ 2020-06-04  9:18       ` Jason Wang
  2020-06-04 10:17         ` Michael S. Tsirkin
  0 siblings, 1 reply; 35+ messages in thread
From: Jason Wang @ 2020-06-04  9:18 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, Eugenio Pérez, kvm, virtualization, netdev


On 2020/6/4 下午5:03, Michael S. Tsirkin wrote:
>>>    static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
>>>    {
>>>    	__u16 old, new;
>>> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
>>> index a67bda9792ec..6c10e99ff334 100644
>>> --- a/drivers/vhost/vhost.h
>>> +++ b/drivers/vhost/vhost.h
>>> @@ -67,6 +67,13 @@ struct vhost_desc {
>>>    	u16 id;
>>>    };
>>> +struct vhost_buf {
>>> +	u32 out_len;
>>> +	u32 in_len;
>>> +	u16 descs;
>>> +	u16 id;
>>> +};
>> So it looks to me the struct vhost_buf can work for both split ring and
>> packed ring.
>>
>> If this is true, we'd better make struct vhost_desc work for both.
>>
>> Thanks
> Both vhost_desc and vhost_buf can work for split and packed.
>
> Do you mean we should add packed ring support based on this?
> For sure, this is one of the motivators for the patchset.
>

Somehow. But the reason I ask is that I see "split" suffix is used in 
patch 1 as:

peek_split_desc()
pop_split_desc()
push_split_desc()

But that suffix is not used for the new used ring API invented in this 
patch.

Thanks



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 07/13] vhost: format-independent API for used buffers
  2020-06-04  9:18       ` Jason Wang
@ 2020-06-04 10:17         ` Michael S. Tsirkin
  0 siblings, 0 replies; 35+ messages in thread
From: Michael S. Tsirkin @ 2020-06-04 10:17 UTC (permalink / raw)
  To: Jason Wang; +Cc: linux-kernel, Eugenio Pérez, kvm, virtualization, netdev

On Thu, Jun 04, 2020 at 05:18:00PM +0800, Jason Wang wrote:
> 
> On 2020/6/4 下午5:03, Michael S. Tsirkin wrote:
> > > >    static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
> > > >    {
> > > >    	__u16 old, new;
> > > > diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> > > > index a67bda9792ec..6c10e99ff334 100644
> > > > --- a/drivers/vhost/vhost.h
> > > > +++ b/drivers/vhost/vhost.h
> > > > @@ -67,6 +67,13 @@ struct vhost_desc {
> > > >    	u16 id;
> > > >    };
> > > > +struct vhost_buf {
> > > > +	u32 out_len;
> > > > +	u32 in_len;
> > > > +	u16 descs;
> > > > +	u16 id;
> > > > +};
> > > So it looks to me the struct vhost_buf can work for both split ring and
> > > packed ring.
> > > 
> > > If this is true, we'd better make struct vhost_desc work for both.
> > > 
> > > Thanks
> > Both vhost_desc and vhost_buf can work for split and packed.
> > 
> > Do you mean we should add packed ring support based on this?
> > For sure, this is one of the motivators for the patchset.
> > 
> 
> Somehow. But the reason I ask is that I see "split" suffix is used in patch
> 1 as:
> 
> peek_split_desc()
> pop_split_desc()
> push_split_desc()
> 
> But that suffix is not used for the new used ring API invented in this
> patch.
> 
> Thanks
> 

And that is intentional: split is *not* part of API. The whole idea is
that ring APIs are format agnostic using "buffer" terminology from spec.
The split things are all static within vhost.c

OK so where I had to add a bunch of new format specific code, that was
tagged as "split" to make it easier to spot that they only support a
specific format.  At the same time, I did not rename existing code
adding "split" in the name. I agree it's a useful additional step for
packed ring format support, and it's fairly easy. I just didn't want
to do it automatically.



-- 
MST


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 03/13] vhost: batching fetches
  2020-06-04  8:59     ` Michael S. Tsirkin
@ 2020-06-05  3:40       ` Jason Wang
  2020-06-07 13:57         ` Michael S. Tsirkin
  0 siblings, 1 reply; 35+ messages in thread
From: Jason Wang @ 2020-06-05  3:40 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, Eugenio Pérez, kvm, virtualization, netdev


On 2020/6/4 下午4:59, Michael S. Tsirkin wrote:
> On Wed, Jun 03, 2020 at 03:27:39PM +0800, Jason Wang wrote:
>> On 2020/6/2 下午9:06, Michael S. Tsirkin wrote:
>>> With this patch applied, new and old code perform identically.
>>>
>>> Lots of extra optimizations are now possible, e.g.
>>> we can fetch multiple heads with copy_from/to_user now.
>>> We can get rid of maintaining the log array.  Etc etc.
>>>
>>> Signed-off-by: Michael S. Tsirkin<mst@redhat.com>
>>> Signed-off-by: Eugenio Pérez<eperezma@redhat.com>
>>> Link:https://lore.kernel.org/r/20200401183118.8334-4-eperezma@redhat.com
>>> Signed-off-by: Michael S. Tsirkin<mst@redhat.com>
>>> ---
>>>    drivers/vhost/test.c  |  2 +-
>>>    drivers/vhost/vhost.c | 47 ++++++++++++++++++++++++++++++++++++++-----
>>>    drivers/vhost/vhost.h |  5 ++++-
>>>    3 files changed, 47 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c
>>> index 9a3a09005e03..02806d6f84ef 100644
>>> --- a/drivers/vhost/test.c
>>> +++ b/drivers/vhost/test.c
>>> @@ -119,7 +119,7 @@ static int vhost_test_open(struct inode *inode, struct file *f)
>>>    	dev = &n->dev;
>>>    	vqs[VHOST_TEST_VQ] = &n->vqs[VHOST_TEST_VQ];
>>>    	n->vqs[VHOST_TEST_VQ].handle_kick = handle_vq_kick;
>>> -	vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX, UIO_MAXIOV,
>>> +	vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX, UIO_MAXIOV + 64,
>>>    		       VHOST_TEST_PKT_WEIGHT, VHOST_TEST_WEIGHT, NULL);
>>>    	f->private_data = n;
>>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>>> index 8f9a07282625..aca2a5b0d078 100644
>>> --- a/drivers/vhost/vhost.c
>>> +++ b/drivers/vhost/vhost.c
>>> @@ -299,6 +299,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
>>>    {
>>>    	vq->num = 1;
>>>    	vq->ndescs = 0;
>>> +	vq->first_desc = 0;
>>>    	vq->desc = NULL;
>>>    	vq->avail = NULL;
>>>    	vq->used = NULL;
>>> @@ -367,6 +368,11 @@ static int vhost_worker(void *data)
>>>    	return 0;
>>>    }
>>> +static int vhost_vq_num_batch_descs(struct vhost_virtqueue *vq)
>>> +{
>>> +	return vq->max_descs - UIO_MAXIOV;
>>> +}
>> 1 descriptor does not mean 1 iov, e.g userspace may pass several 1 byte
>> length memory regions for us to translate.
>>
> Yes but I don't see the relevance. This tells us how many descriptors to
> batch, not how many IOVs.


Yes, but questions are:

- this introduce another obstacle to support more than 1K queue size
- if we support 1K queue size, does it mean we need to cache 1K 
descriptors, which seems a large stress on the cache

Thanks


>


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 11/13] vhost/scsi: switch to buf APIs
  2020-06-02 13:06 ` [PATCH RFC 11/13] vhost/scsi: switch to buf APIs Michael S. Tsirkin
@ 2020-06-05  8:36   ` Stefan Hajnoczi
  0 siblings, 0 replies; 35+ messages in thread
From: Stefan Hajnoczi @ 2020-06-05  8:36 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, kvm, netdev, virtualization, Eugenio Pérez,
	Stefan Hajnoczi, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 2586 bytes --]

On Tue, Jun 02, 2020 at 09:06:20AM -0400, Michael S. Tsirkin wrote:
> Switch to buf APIs. Doing this exposes a spec violation in vhost scsi:
> all used bufs are marked with length 0.
> Fix that is left for another day.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>  drivers/vhost/scsi.c | 73 ++++++++++++++++++++++++++------------------
>  1 file changed, 44 insertions(+), 29 deletions(-)
> 
> diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
> index c39952243fd3..c426c4e899c7 100644
> --- a/drivers/vhost/scsi.c
> +++ b/drivers/vhost/scsi.c
> @@ -71,8 +71,8 @@ struct vhost_scsi_inflight {
>  };
>  
>  struct vhost_scsi_cmd {
> -	/* Descriptor from vhost_get_vq_desc() for virt_queue segment */
> -	int tvc_vq_desc;
> +	/* Descriptor from vhost_get_avail_buf() for virt_queue segment */
> +	struct vhost_buf tvc_vq_desc;
>  	/* virtio-scsi initiator task attribute */
>  	int tvc_task_attr;
>  	/* virtio-scsi response incoming iovecs */
> @@ -213,7 +213,7 @@ struct vhost_scsi {
>   * Context for processing request and control queue operations.
>   */
>  struct vhost_scsi_ctx {
> -	int head;
> +	struct vhost_buf buf;
>  	unsigned int out, in;
>  	size_t req_size, rsp_size;
>  	size_t out_size, in_size;
> @@ -443,6 +443,20 @@ static int vhost_scsi_check_stop_free(struct se_cmd *se_cmd)
>  	return target_put_sess_cmd(se_cmd);
>  }
>  
> +/* Signal to guest that request finished with no input buffer. */
> +/* TODO calling this when writing into buffer and most likely a bug */
> +static void vhost_scsi_signal_noinput(struct vhost_dev *vdev,
> +				      struct vhost_virtqueue *vq,
> +				      struct vhost_buf *bufp)
> +{
> +	struct vhost_buf buf = *bufp;
> +
> +	buf.in_len = 0;
> +	vhost_put_used_buf(vq, &buf);

Yes, this behavior differs from the QEMU virtio-scsi device
implementation. I think it's just a quirk that is probably my fault (I
guess I thought the length information is already encoded in the payload
SCSI headers so we have no use for the used descriptor length field).

Whether it's worth changing now or is an interesting question. In theory
it would make vhost-scsi more spec compliant and guest drivers might be
happier (especially drivers for niche OSes that were only tested against
QEMU's virtio-scsi). On the other hand, it's a guest-visible change that
could break similar niche drivers that assume length is always 0.

I'd leave it as-is unless people hit issues that justify the risk of
changing it.

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 12/13] vhost/vsock: switch to the buf API
  2020-06-02 13:06 ` [PATCH RFC 12/13] vhost/vsock: switch to the buf API Michael S. Tsirkin
@ 2020-06-05  8:36   ` Stefan Hajnoczi
  0 siblings, 0 replies; 35+ messages in thread
From: Stefan Hajnoczi @ 2020-06-05  8:36 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, kvm, netdev, virtualization, Eugenio Pérez,
	Stefan Hajnoczi

[-- Attachment #1: Type: text/plain, Size: 338 bytes --]

On Tue, Jun 02, 2020 at 09:06:22AM -0400, Michael S. Tsirkin wrote:
> A straight-forward conversion.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>  drivers/vhost/vsock.c | 30 ++++++++++++++++++------------
>  1 file changed, 18 insertions(+), 12 deletions(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 03/13] vhost: batching fetches
  2020-06-05  3:40       ` Jason Wang
@ 2020-06-07 13:57         ` Michael S. Tsirkin
  2020-06-08  3:35           ` Jason Wang
  0 siblings, 1 reply; 35+ messages in thread
From: Michael S. Tsirkin @ 2020-06-07 13:57 UTC (permalink / raw)
  To: Jason Wang; +Cc: linux-kernel, Eugenio Pérez, kvm, virtualization, netdev

On Fri, Jun 05, 2020 at 11:40:17AM +0800, Jason Wang wrote:
> 
> On 2020/6/4 下午4:59, Michael S. Tsirkin wrote:
> > On Wed, Jun 03, 2020 at 03:27:39PM +0800, Jason Wang wrote:
> > > On 2020/6/2 下午9:06, Michael S. Tsirkin wrote:
> > > > With this patch applied, new and old code perform identically.
> > > > 
> > > > Lots of extra optimizations are now possible, e.g.
> > > > we can fetch multiple heads with copy_from/to_user now.
> > > > We can get rid of maintaining the log array.  Etc etc.
> > > > 
> > > > Signed-off-by: Michael S. Tsirkin<mst@redhat.com>
> > > > Signed-off-by: Eugenio Pérez<eperezma@redhat.com>
> > > > Link:https://lore.kernel.org/r/20200401183118.8334-4-eperezma@redhat.com
> > > > Signed-off-by: Michael S. Tsirkin<mst@redhat.com>
> > > > ---
> > > >    drivers/vhost/test.c  |  2 +-
> > > >    drivers/vhost/vhost.c | 47 ++++++++++++++++++++++++++++++++++++++-----
> > > >    drivers/vhost/vhost.h |  5 ++++-
> > > >    3 files changed, 47 insertions(+), 7 deletions(-)
> > > > 
> > > > diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c
> > > > index 9a3a09005e03..02806d6f84ef 100644
> > > > --- a/drivers/vhost/test.c
> > > > +++ b/drivers/vhost/test.c
> > > > @@ -119,7 +119,7 @@ static int vhost_test_open(struct inode *inode, struct file *f)
> > > >    	dev = &n->dev;
> > > >    	vqs[VHOST_TEST_VQ] = &n->vqs[VHOST_TEST_VQ];
> > > >    	n->vqs[VHOST_TEST_VQ].handle_kick = handle_vq_kick;
> > > > -	vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX, UIO_MAXIOV,
> > > > +	vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX, UIO_MAXIOV + 64,
> > > >    		       VHOST_TEST_PKT_WEIGHT, VHOST_TEST_WEIGHT, NULL);
> > > >    	f->private_data = n;
> > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > > > index 8f9a07282625..aca2a5b0d078 100644
> > > > --- a/drivers/vhost/vhost.c
> > > > +++ b/drivers/vhost/vhost.c
> > > > @@ -299,6 +299,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
> > > >    {
> > > >    	vq->num = 1;
> > > >    	vq->ndescs = 0;
> > > > +	vq->first_desc = 0;
> > > >    	vq->desc = NULL;
> > > >    	vq->avail = NULL;
> > > >    	vq->used = NULL;
> > > > @@ -367,6 +368,11 @@ static int vhost_worker(void *data)
> > > >    	return 0;
> > > >    }
> > > > +static int vhost_vq_num_batch_descs(struct vhost_virtqueue *vq)
> > > > +{
> > > > +	return vq->max_descs - UIO_MAXIOV;
> > > > +}
> > > 1 descriptor does not mean 1 iov, e.g userspace may pass several 1 byte
> > > length memory regions for us to translate.
> > > 
> > Yes but I don't see the relevance. This tells us how many descriptors to
> > batch, not how many IOVs.
> 
> 
> Yes, but questions are:
> 
> - this introduce another obstacle to support more than 1K queue size
> - if we support 1K queue size, does it mean we need to cache 1K descriptors,
> which seems a large stress on the cache
> 
> Thanks
> 
> 
> > 

Still don't understand the relevance. We support up to 1K descriptors
per buffer just for IOV since we always did. This adds 64 more
descriptors - is that a big deal?


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 01/13] vhost: option to fetch descriptors through an independent struct
  2020-06-03 12:04       ` Jason Wang
@ 2020-06-07 13:59         ` Michael S. Tsirkin
  0 siblings, 0 replies; 35+ messages in thread
From: Michael S. Tsirkin @ 2020-06-07 13:59 UTC (permalink / raw)
  To: Jason Wang; +Cc: linux-kernel, Eugenio Pérez, kvm, virtualization, netdev

On Wed, Jun 03, 2020 at 08:04:45PM +0800, Jason Wang wrote:
> 
> On 2020/6/3 下午5:48, Michael S. Tsirkin wrote:
> > On Wed, Jun 03, 2020 at 03:13:56PM +0800, Jason Wang wrote:
> > > On 2020/6/2 下午9:05, Michael S. Tsirkin wrote:
> 
> 
> [...]
> 
> 
> > > > +
> > > > +static int fetch_indirect_descs(struct vhost_virtqueue *vq,
> > > > +				struct vhost_desc *indirect,
> > > > +				u16 head)
> > > > +{
> > > > +	struct vring_desc desc;
> > > > +	unsigned int i = 0, count, found = 0;
> > > > +	u32 len = indirect->len;
> > > > +	struct iov_iter from;
> > > > +	int ret;
> > > > +
> > > > +	/* Sanity check */
> > > > +	if (unlikely(len % sizeof desc)) {
> > > > +		vq_err(vq, "Invalid length in indirect descriptor: "
> > > > +		       "len 0x%llx not multiple of 0x%zx\n",
> > > > +		       (unsigned long long)len,
> > > > +		       sizeof desc);
> > > > +		return -EINVAL;
> > > > +	}
> > > > +
> > > > +	ret = translate_desc(vq, indirect->addr, len, vq->indirect,
> > > > +			     UIO_MAXIOV, VHOST_ACCESS_RO);
> > > > +	if (unlikely(ret < 0)) {
> > > > +		if (ret != -EAGAIN)
> > > > +			vq_err(vq, "Translation failure %d in indirect.\n", ret);
> > > > +		return ret;
> > > > +	}
> > > > +	iov_iter_init(&from, READ, vq->indirect, ret, len);
> > > > +
> > > > +	/* We will use the result as an address to read from, so most
> > > > +	 * architectures only need a compiler barrier here. */
> > > > +	read_barrier_depends();
> > > > +
> > > > +	count = len / sizeof desc;
> > > > +	/* Buffers are chained via a 16 bit next field, so
> > > > +	 * we can have at most 2^16 of these. */
> > > > +	if (unlikely(count > USHRT_MAX + 1)) {
> > > > +		vq_err(vq, "Indirect buffer length too big: %d\n",
> > > > +		       indirect->len);
> > > > +		return -E2BIG;
> > > > +	}
> > > > +	if (unlikely(vq->ndescs + count > vq->max_descs)) {
> > > > +		vq_err(vq, "Too many indirect + direct descs: %d + %d\n",
> > > > +		       vq->ndescs, indirect->len);
> > > > +		return -E2BIG;
> > > > +	}
> > > > +
> > > > +	do {
> > > > +		if (unlikely(++found > count)) {
> > > > +			vq_err(vq, "Loop detected: last one at %u "
> > > > +			       "indirect size %u\n",
> > > > +			       i, count);
> > > > +			return -EINVAL;
> > > > +		}
> > > > +		if (unlikely(!copy_from_iter_full(&desc, sizeof(desc), &from))) {
> > > > +			vq_err(vq, "Failed indirect descriptor: idx %d, %zx\n",
> > > > +			       i, (size_t)indirect->addr + i * sizeof desc);
> > > > +			return -EINVAL;
> > > > +		}
> > > > +		if (unlikely(desc.flags & cpu_to_vhost16(vq, VRING_DESC_F_INDIRECT))) {
> > > > +			vq_err(vq, "Nested indirect descriptor: idx %d, %zx\n",
> > > > +			       i, (size_t)indirect->addr + i * sizeof desc);
> > > > +			return -EINVAL;
> > > > +		}
> > > > +
> > > > +		push_split_desc(vq, &desc, head);
> > > 
> > > The error is ignored.
> > See above:
> > 
> >       	if (unlikely(vq->ndescs + count > vq->max_descs))
> > 
> > So it can't fail here, we never fetch unless there's space.
> > 
> > I guess we can add a WARN_ON here.
> 
> 
> Yes.
> 
> 
> > 
> > > > +	} while ((i = next_desc(vq, &desc)) != -1);
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +static int fetch_descs(struct vhost_virtqueue *vq)
> > > > +{
> > > > +	unsigned int i, head, found = 0;
> > > > +	struct vhost_desc *last;
> > > > +	struct vring_desc desc;
> > > > +	__virtio16 avail_idx;
> > > > +	__virtio16 ring_head;
> > > > +	u16 last_avail_idx;
> > > > +	int ret;
> > > > +
> > > > +	/* Check it isn't doing very strange things with descriptor numbers. */
> > > > +	last_avail_idx = vq->last_avail_idx;
> > > > +
> > > > +	if (vq->avail_idx == vq->last_avail_idx) {
> > > > +		if (unlikely(vhost_get_avail_idx(vq, &avail_idx))) {
> > > > +			vq_err(vq, "Failed to access avail idx at %p\n",
> > > > +				&vq->avail->idx);
> > > > +			return -EFAULT;
> > > > +		}
> > > > +		vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
> > > > +
> > > > +		if (unlikely((u16)(vq->avail_idx - last_avail_idx) > vq->num)) {
> > > > +			vq_err(vq, "Guest moved used index from %u to %u",
> > > > +				last_avail_idx, vq->avail_idx);
> > > > +			return -EFAULT;
> > > > +		}
> > > > +
> > > > +		/* If there's nothing new since last we looked, return
> > > > +		 * invalid.
> > > > +		 */
> > > > +		if (vq->avail_idx == last_avail_idx)
> > > > +			return vq->num;
> > > > +
> > > > +		/* Only get avail ring entries after they have been
> > > > +		 * exposed by guest.
> > > > +		 */
> > > > +		smp_rmb();
> > > > +	}
> > > > +
> > > > +	/* Grab the next descriptor number they're advertising */
> > > > +	if (unlikely(vhost_get_avail_head(vq, &ring_head, last_avail_idx))) {
> > > > +		vq_err(vq, "Failed to read head: idx %d address %p\n",
> > > > +		       last_avail_idx,
> > > > +		       &vq->avail->ring[last_avail_idx % vq->num]);
> > > > +		return -EFAULT;
> > > > +	}
> > > > +
> > > > +	head = vhost16_to_cpu(vq, ring_head);
> > > > +
> > > > +	/* If their number is silly, that's an error. */
> > > > +	if (unlikely(head >= vq->num)) {
> > > > +		vq_err(vq, "Guest says index %u > %u is available",
> > > > +		       head, vq->num);
> > > > +		return -EINVAL;
> > > > +	}
> > > > +
> > > > +	i = head;
> > > > +	do {
> > > > +		if (unlikely(i >= vq->num)) {
> > > > +			vq_err(vq, "Desc index is %u > %u, head = %u",
> > > > +			       i, vq->num, head);
> > > > +			return -EINVAL;
> > > > +		}
> > > > +		if (unlikely(++found > vq->num)) {
> > > > +			vq_err(vq, "Loop detected: last one at %u "
> > > > +			       "vq size %u head %u\n",
> > > > +			       i, vq->num, head);
> > > > +			return -EINVAL;
> > > > +		}
> > > > +		ret = vhost_get_desc(vq, &desc, i);
> > > > +		if (unlikely(ret)) {
> > > > +			vq_err(vq, "Failed to get descriptor: idx %d addr %p\n",
> > > > +			       i, vq->desc + i);
> > > > +			return -EFAULT;
> > > > +		}
> > > > +		ret = push_split_desc(vq, &desc, head);
> > > > +		if (unlikely(ret)) {
> > > > +			vq_err(vq, "Failed to save descriptor: idx %d\n", i);
> > > > +			return -EINVAL;
> > > > +		}
> > > > +	} while ((i = next_desc(vq, &desc)) != -1);
> > > > +
> > > > +	last = peek_split_desc(vq);
> > > > +	if (unlikely(last->flags & VRING_DESC_F_INDIRECT)) {
> > > > +		pop_split_desc(vq);
> > > > +		ret = fetch_indirect_descs(vq, last, head);
> > > 
> > > Note that this means we don't supported chained indirect descriptors which
> > > complies the spec but we support this in vhost_get_vq_desc().
> > Well the spec says:
> > 	A driver MUST NOT set both VIRTQ_DESC_F_INDIRECT and VIRTQ_DESC_F_NEXT in flags.
> > 
> > Did I miss anything?
> > 
> 
> No, but I meant current vhost_get_vq_desc() supports chained indirect
> descriptor. Not sure if there's an application that depends on this
> silently.
> 
> Thanks
> 

I don't think we need to worry about that unless this actually
surfaces.

-- 
MST


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 03/13] vhost: batching fetches
  2020-06-07 13:57         ` Michael S. Tsirkin
@ 2020-06-08  3:35           ` Jason Wang
  2020-06-08  6:01             ` Michael S. Tsirkin
  0 siblings, 1 reply; 35+ messages in thread
From: Jason Wang @ 2020-06-08  3:35 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, Eugenio Pérez, kvm, virtualization, netdev


On 2020/6/7 下午9:57, Michael S. Tsirkin wrote:
> On Fri, Jun 05, 2020 at 11:40:17AM +0800, Jason Wang wrote:
>> On 2020/6/4 下午4:59, Michael S. Tsirkin wrote:
>>> On Wed, Jun 03, 2020 at 03:27:39PM +0800, Jason Wang wrote:
>>>> On 2020/6/2 下午9:06, Michael S. Tsirkin wrote:
>>>>> With this patch applied, new and old code perform identically.
>>>>>
>>>>> Lots of extra optimizations are now possible, e.g.
>>>>> we can fetch multiple heads with copy_from/to_user now.
>>>>> We can get rid of maintaining the log array.  Etc etc.
>>>>>
>>>>> Signed-off-by: Michael S. Tsirkin<mst@redhat.com>
>>>>> Signed-off-by: Eugenio Pérez<eperezma@redhat.com>
>>>>> Link:https://lore.kernel.org/r/20200401183118.8334-4-eperezma@redhat.com
>>>>> Signed-off-by: Michael S. Tsirkin<mst@redhat.com>
>>>>> ---
>>>>>     drivers/vhost/test.c  |  2 +-
>>>>>     drivers/vhost/vhost.c | 47 ++++++++++++++++++++++++++++++++++++++-----
>>>>>     drivers/vhost/vhost.h |  5 ++++-
>>>>>     3 files changed, 47 insertions(+), 7 deletions(-)
>>>>>
>>>>> diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c
>>>>> index 9a3a09005e03..02806d6f84ef 100644
>>>>> --- a/drivers/vhost/test.c
>>>>> +++ b/drivers/vhost/test.c
>>>>> @@ -119,7 +119,7 @@ static int vhost_test_open(struct inode *inode, struct file *f)
>>>>>     	dev = &n->dev;
>>>>>     	vqs[VHOST_TEST_VQ] = &n->vqs[VHOST_TEST_VQ];
>>>>>     	n->vqs[VHOST_TEST_VQ].handle_kick = handle_vq_kick;
>>>>> -	vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX, UIO_MAXIOV,
>>>>> +	vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX, UIO_MAXIOV + 64,
>>>>>     		       VHOST_TEST_PKT_WEIGHT, VHOST_TEST_WEIGHT, NULL);
>>>>>     	f->private_data = n;
>>>>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>>>>> index 8f9a07282625..aca2a5b0d078 100644
>>>>> --- a/drivers/vhost/vhost.c
>>>>> +++ b/drivers/vhost/vhost.c
>>>>> @@ -299,6 +299,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
>>>>>     {
>>>>>     	vq->num = 1;
>>>>>     	vq->ndescs = 0;
>>>>> +	vq->first_desc = 0;
>>>>>     	vq->desc = NULL;
>>>>>     	vq->avail = NULL;
>>>>>     	vq->used = NULL;
>>>>> @@ -367,6 +368,11 @@ static int vhost_worker(void *data)
>>>>>     	return 0;
>>>>>     }
>>>>> +static int vhost_vq_num_batch_descs(struct vhost_virtqueue *vq)
>>>>> +{
>>>>> +	return vq->max_descs - UIO_MAXIOV;
>>>>> +}
>>>> 1 descriptor does not mean 1 iov, e.g userspace may pass several 1 byte
>>>> length memory regions for us to translate.
>>>>
>>> Yes but I don't see the relevance. This tells us how many descriptors to
>>> batch, not how many IOVs.
>> Yes, but questions are:
>>
>> - this introduce another obstacle to support more than 1K queue size
>> - if we support 1K queue size, does it mean we need to cache 1K descriptors,
>> which seems a large stress on the cache
>>
>> Thanks
>>
>>
> Still don't understand the relevance. We support up to 1K descriptors
> per buffer just for IOV since we always did. This adds 64 more
> descriptors - is that a big deal?


If I understanding correctly, for net, the code tries to batch 
descriptors for at last one packet.

If we allow 1K queue size then we allow a packet that consists of 1K 
descriptors. Then we need to cache 1K descriptors.

Thanks


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 03/13] vhost: batching fetches
  2020-06-08  3:35           ` Jason Wang
@ 2020-06-08  6:01             ` Michael S. Tsirkin
  0 siblings, 0 replies; 35+ messages in thread
From: Michael S. Tsirkin @ 2020-06-08  6:01 UTC (permalink / raw)
  To: Jason Wang; +Cc: linux-kernel, Eugenio Pérez, kvm, virtualization, netdev

On Mon, Jun 08, 2020 at 11:35:40AM +0800, Jason Wang wrote:
> 
> On 2020/6/7 下午9:57, Michael S. Tsirkin wrote:
> > On Fri, Jun 05, 2020 at 11:40:17AM +0800, Jason Wang wrote:
> > > On 2020/6/4 下午4:59, Michael S. Tsirkin wrote:
> > > > On Wed, Jun 03, 2020 at 03:27:39PM +0800, Jason Wang wrote:
> > > > > On 2020/6/2 下午9:06, Michael S. Tsirkin wrote:
> > > > > > With this patch applied, new and old code perform identically.
> > > > > > 
> > > > > > Lots of extra optimizations are now possible, e.g.
> > > > > > we can fetch multiple heads with copy_from/to_user now.
> > > > > > We can get rid of maintaining the log array.  Etc etc.
> > > > > > 
> > > > > > Signed-off-by: Michael S. Tsirkin<mst@redhat.com>
> > > > > > Signed-off-by: Eugenio Pérez<eperezma@redhat.com>
> > > > > > Link:https://lore.kernel.org/r/20200401183118.8334-4-eperezma@redhat.com
> > > > > > Signed-off-by: Michael S. Tsirkin<mst@redhat.com>
> > > > > > ---
> > > > > >     drivers/vhost/test.c  |  2 +-
> > > > > >     drivers/vhost/vhost.c | 47 ++++++++++++++++++++++++++++++++++++++-----
> > > > > >     drivers/vhost/vhost.h |  5 ++++-
> > > > > >     3 files changed, 47 insertions(+), 7 deletions(-)
> > > > > > 
> > > > > > diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c
> > > > > > index 9a3a09005e03..02806d6f84ef 100644
> > > > > > --- a/drivers/vhost/test.c
> > > > > > +++ b/drivers/vhost/test.c
> > > > > > @@ -119,7 +119,7 @@ static int vhost_test_open(struct inode *inode, struct file *f)
> > > > > >     	dev = &n->dev;
> > > > > >     	vqs[VHOST_TEST_VQ] = &n->vqs[VHOST_TEST_VQ];
> > > > > >     	n->vqs[VHOST_TEST_VQ].handle_kick = handle_vq_kick;
> > > > > > -	vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX, UIO_MAXIOV,
> > > > > > +	vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX, UIO_MAXIOV + 64,
> > > > > >     		       VHOST_TEST_PKT_WEIGHT, VHOST_TEST_WEIGHT, NULL);
> > > > > >     	f->private_data = n;
> > > > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > > > > > index 8f9a07282625..aca2a5b0d078 100644
> > > > > > --- a/drivers/vhost/vhost.c
> > > > > > +++ b/drivers/vhost/vhost.c
> > > > > > @@ -299,6 +299,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
> > > > > >     {
> > > > > >     	vq->num = 1;
> > > > > >     	vq->ndescs = 0;
> > > > > > +	vq->first_desc = 0;
> > > > > >     	vq->desc = NULL;
> > > > > >     	vq->avail = NULL;
> > > > > >     	vq->used = NULL;
> > > > > > @@ -367,6 +368,11 @@ static int vhost_worker(void *data)
> > > > > >     	return 0;
> > > > > >     }
> > > > > > +static int vhost_vq_num_batch_descs(struct vhost_virtqueue *vq)
> > > > > > +{
> > > > > > +	return vq->max_descs - UIO_MAXIOV;
> > > > > > +}
> > > > > 1 descriptor does not mean 1 iov, e.g userspace may pass several 1 byte
> > > > > length memory regions for us to translate.
> > > > > 
> > > > Yes but I don't see the relevance. This tells us how many descriptors to
> > > > batch, not how many IOVs.
> > > Yes, but questions are:
> > > 
> > > - this introduce another obstacle to support more than 1K queue size
> > > - if we support 1K queue size, does it mean we need to cache 1K descriptors,
> > > which seems a large stress on the cache
> > > 
> > > Thanks
> > > 
> > > 
> > Still don't understand the relevance. We support up to 1K descriptors
> > per buffer just for IOV since we always did. This adds 64 more
> > descriptors - is that a big deal?
> 
> 
> If I understanding correctly, for net, the code tries to batch descriptors
> for at last one packet.
> 
> If we allow 1K queue size then we allow a packet that consists of 1K
> descriptors. Then we need to cache 1K descriptors.
> 
> Thanks

That case is already so pathological, I am not at all worried about
it performing well.

-- 
MST


^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2020-06-08  6:02 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-02 13:05 [PATCH RFC 00/13] vhost: format independence Michael S. Tsirkin
2020-06-02 13:05 ` [PATCH RFC 01/13] vhost: option to fetch descriptors through an independent struct Michael S. Tsirkin
2020-06-03  7:13   ` Jason Wang
2020-06-03  9:48     ` Michael S. Tsirkin
2020-06-03 12:04       ` Jason Wang
2020-06-07 13:59         ` Michael S. Tsirkin
2020-06-02 13:05 ` [PATCH RFC 02/13] vhost: use batched version by default Michael S. Tsirkin
2020-06-03  7:15   ` Jason Wang
2020-06-02 13:06 ` [PATCH RFC 03/13] vhost: batching fetches Michael S. Tsirkin
2020-06-03  7:27   ` Jason Wang
2020-06-04  8:59     ` Michael S. Tsirkin
2020-06-05  3:40       ` Jason Wang
2020-06-07 13:57         ` Michael S. Tsirkin
2020-06-08  3:35           ` Jason Wang
2020-06-08  6:01             ` Michael S. Tsirkin
2020-06-02 13:06 ` [PATCH RFC 04/13] vhost: cleanup fetch_buf return code handling Michael S. Tsirkin
2020-06-03  7:29   ` Jason Wang
2020-06-04  9:01     ` Michael S. Tsirkin
2020-06-02 13:06 ` [PATCH RFC 05/13] vhost/net: pass net specific struct pointer Michael S. Tsirkin
2020-06-02 13:06 ` [PATCH RFC 06/13] vhost: reorder functions Michael S. Tsirkin
2020-06-02 13:06 ` [PATCH RFC 07/13] vhost: format-independent API for used buffers Michael S. Tsirkin
2020-06-03  7:58   ` Jason Wang
2020-06-04  9:03     ` Michael S. Tsirkin
2020-06-04  9:18       ` Jason Wang
2020-06-04 10:17         ` Michael S. Tsirkin
2020-06-02 13:06 ` [PATCH RFC 08/13] vhost/net: convert to new API: heads->bufs Michael S. Tsirkin
2020-06-03  8:11   ` Jason Wang
2020-06-04  9:05     ` Michael S. Tsirkin
2020-06-02 13:06 ` [PATCH RFC 09/13] vhost/net: avoid iov length math Michael S. Tsirkin
2020-06-02 13:06 ` [PATCH RFC 10/13] vhost/test: convert to the buf API Michael S. Tsirkin
2020-06-02 13:06 ` [PATCH RFC 11/13] vhost/scsi: switch to buf APIs Michael S. Tsirkin
2020-06-05  8:36   ` Stefan Hajnoczi
2020-06-02 13:06 ` [PATCH RFC 12/13] vhost/vsock: switch to the buf API Michael S. Tsirkin
2020-06-05  8:36   ` Stefan Hajnoczi
2020-06-02 13:06 ` [PATCH RFC 13/13] vhost: drop head based APIs Michael S. Tsirkin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).