All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/5] Multiqueue virtio-scsi, and API for piecewise buffer submission
@ 2012-12-18 12:32 ` Paolo Bonzini
  0 siblings, 0 replies; 86+ messages in thread
From: Paolo Bonzini @ 2012-12-18 12:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: kvm, gaowanlong, hutao, linux-scsi, virtualization, mst, rusty,
	asias, stefanha, nab

Hi all,

this series adds multiqueue support to the virtio-scsi driver, based
on Jason Wang's work on virtio-net.  It uses a simple queue steering
algorithm that expects one queue per CPU.  LUNs in the same target always
use the same queue (so that commands are not reordered); queue switching
occurs when the request being queued is the only one for the target.
Also based on Jason's patches, the virtqueue affinity is set so that
each CPU is associated to one virtqueue.

I tested the patches with fio, using up to 32 virtio-scsi disks backed
by tmpfs on the host.  These numbers are with 1 LUN per target.

FIO configuration
-----------------
[global]
rw=read
bsrange=4k-64k
ioengine=libaio
direct=1
iodepth=4
loops=20

overall bandwidth (MB/s)
------------------------

# of targets    single-queue    multi-queue, 4 VCPUs    multi-queue, 8 VCPUs
1                  540               626                     599
2                  795               965                     925
4                  997              1376                    1500
8                 1136              2130                    2060
16                1440              2269                    2474
24                1408              2179                    2436
32                1515              1978                    2319

(These numbers for single-queue are with 4 VCPUs, but the impact of adding
more VCPUs is very limited).

avg bandwidth per LUN (MB/s)
----------------------------

# of targets    single-queue    multi-queue, 4 VCPUs    multi-queue, 8 VCPUs
1                  540               626                     599
2                  397               482                     462
4                  249               344                     375
8                  142               266                     257
16                  90               141                     154
24                  58                90                     101
32                  47                61                      72

Patch 1 adds a new API to add functions for piecewise addition for buffers,
which enables various simplifications in virtio-scsi (patches 2-3) and a
small performance improvement of 2-6%.  Patches 4 and 5 add multiqueuing.

I'm mostly looking for comments on the new API of patch 1 for inclusion
into the 3.9 kernel.

Thanks to Wao Ganlong for help rebasing and benchmarking these patches.

Paolo Bonzini (5):
  virtio: add functions for piecewise addition of buffers
  virtio-scsi: use functions for piecewise composition of buffers
  virtio-scsi: redo allocation of target data
  virtio-scsi: pass struct virtio_scsi to virtqueue completion function
  virtio-scsi: introduce multiqueue support

 drivers/scsi/virtio_scsi.c   |  374 +++++++++++++++++++++++++++++-------------
 drivers/virtio/virtio_ring.c |  205 ++++++++++++++++++++++++
 include/linux/virtio.h       |   21 +++
 3 files changed, 485 insertions(+), 115 deletions(-)


^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH v2 0/5] Multiqueue virtio-scsi, and API for piecewise buffer submission
@ 2012-12-18 12:32 ` Paolo Bonzini
  0 siblings, 0 replies; 86+ messages in thread
From: Paolo Bonzini @ 2012-12-18 12:32 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-scsi, kvm, mst, hutao, virtualization, stefanha

Hi all,

this series adds multiqueue support to the virtio-scsi driver, based
on Jason Wang's work on virtio-net.  It uses a simple queue steering
algorithm that expects one queue per CPU.  LUNs in the same target always
use the same queue (so that commands are not reordered); queue switching
occurs when the request being queued is the only one for the target.
Also based on Jason's patches, the virtqueue affinity is set so that
each CPU is associated to one virtqueue.

I tested the patches with fio, using up to 32 virtio-scsi disks backed
by tmpfs on the host.  These numbers are with 1 LUN per target.

FIO configuration
-----------------
[global]
rw=read
bsrange=4k-64k
ioengine=libaio
direct=1
iodepth=4
loops=20

overall bandwidth (MB/s)
------------------------

# of targets    single-queue    multi-queue, 4 VCPUs    multi-queue, 8 VCPUs
1                  540               626                     599
2                  795               965                     925
4                  997              1376                    1500
8                 1136              2130                    2060
16                1440              2269                    2474
24                1408              2179                    2436
32                1515              1978                    2319

(These numbers for single-queue are with 4 VCPUs, but the impact of adding
more VCPUs is very limited).

avg bandwidth per LUN (MB/s)
----------------------------

# of targets    single-queue    multi-queue, 4 VCPUs    multi-queue, 8 VCPUs
1                  540               626                     599
2                  397               482                     462
4                  249               344                     375
8                  142               266                     257
16                  90               141                     154
24                  58                90                     101
32                  47                61                      72

Patch 1 adds a new API to add functions for piecewise addition for buffers,
which enables various simplifications in virtio-scsi (patches 2-3) and a
small performance improvement of 2-6%.  Patches 4 and 5 add multiqueuing.

I'm mostly looking for comments on the new API of patch 1 for inclusion
into the 3.9 kernel.

Thanks to Wao Ganlong for help rebasing and benchmarking these patches.

Paolo Bonzini (5):
  virtio: add functions for piecewise addition of buffers
  virtio-scsi: use functions for piecewise composition of buffers
  virtio-scsi: redo allocation of target data
  virtio-scsi: pass struct virtio_scsi to virtqueue completion function
  virtio-scsi: introduce multiqueue support

 drivers/scsi/virtio_scsi.c   |  374 +++++++++++++++++++++++++++++-------------
 drivers/virtio/virtio_ring.c |  205 ++++++++++++++++++++++++
 include/linux/virtio.h       |   21 +++
 3 files changed, 485 insertions(+), 115 deletions(-)

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH v2 1/5] virtio: add functions for piecewise addition of buffers
  2012-12-18 12:32 ` Paolo Bonzini
@ 2012-12-18 12:32   ` Paolo Bonzini
  -1 siblings, 0 replies; 86+ messages in thread
From: Paolo Bonzini @ 2012-12-18 12:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: kvm, gaowanlong, hutao, linux-scsi, virtualization, mst, rusty,
	asias, stefanha, nab

The virtqueue_add_buf function has two limitations:

1) it requires the caller to provide all the buffers in a single call;

2) it does not support chained scatterlists: the buffers must be
provided as an array of struct scatterlist;

Because of these limitations, virtio-scsi has to copy each request into
a scatterlist internal to the driver.  It cannot just use the one that
was prepared by the upper SCSI layers.

This patch adds a different set of APIs for adding a buffer to a virtqueue.
The new API lets you pass the buffers piecewise, wrapping multiple calls
to virtqueue_add_sg between virtqueue_start_buf and virtqueue_end_buf.
virtio-scsi can then call virtqueue_add_sg 3/4 times: for the request
header, for the write buffer (if present), for the response header, and
finally for the read buffer (again if present).  It saves the copying
and the related locking.

Note that this API is not needed in virtio-blk, because it does all the
work of the upper SCSI layers itself in the blk_map_rq_sg call.  Then
it simply hands the resulting scatterlist to virtqueue_add_buf.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
	v1->v2: new

 drivers/virtio/virtio_ring.c |  205 ++++++++++++++++++++++++++++++++++++++++++
 include/linux/virtio.h       |   21 ++++
 2 files changed, 226 insertions(+), 0 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index ffd7e7d..ccfa97c 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -394,6 +394,211 @@ static void detach_buf(struct vring_virtqueue *vq, unsigned int head)
 	vq->vq.num_free++;
 }
 
+/**
+ * virtqueue_start_buf - start building buffer for the other end
+ * @vq: the struct virtqueue we're talking about.
+ * @buf: a struct keeping the state of the buffer
+ * @data: the token identifying the buffer.
+ * @count: the number of buffers that will be added
+ * @count_sg: the number of sg lists that will be added
+ * @gfp: how to do memory allocations (if necessary).
+ *
+ * Caller must ensure we don't call this with other virtqueue operations
+ * at the same time (except where noted), and that a successful call is
+ * followed by one or more calls to virtqueue_add_sg, and finally a call
+ * to virtqueue_end_buf.
+ *
+ * Returns zero or a negative error (ie. ENOSPC).
+ */
+int virtqueue_start_buf(struct virtqueue *_vq,
+			struct virtqueue_buf *buf,
+			void *data,
+			unsigned int count,
+			unsigned int count_sg,
+			gfp_t gfp)
+{
+	struct vring_virtqueue *vq = to_vvq(_vq);
+	struct vring_desc *desc = NULL;
+	int head;
+	int ret = -ENOMEM;
+
+	START_USE(vq);
+
+	BUG_ON(data == NULL);
+
+#ifdef DEBUG
+	{
+		ktime_t now = ktime_get();
+
+		/* No kick or get, with .1 second between?  Warn. */
+		if (vq->last_add_time_valid)
+			WARN_ON(ktime_to_ms(ktime_sub(now, vq->last_add_time))
+					    > 100);
+		vq->last_add_time = now;
+		vq->last_add_time_valid = true;
+	}
+#endif
+
+	BUG_ON(count < count_sg);
+	BUG_ON(count_sg == 0);
+
+	/* If the host supports indirect descriptor tables, and there is
+	 * no space for direct buffers or there are multi-item scatterlists,
+	 * go indirect.
+	 */
+	head = vq->free_head;
+	if (vq->indirect && (count > count_sg || vq->vq.num_free < count)) {
+		if (vq->vq.num_free == 0)
+			goto no_space;
+
+		desc = kmalloc(count * sizeof(struct vring_desc), gfp);
+		if (!desc)
+			goto error;
+
+		/* We're about to use a buffer */
+		vq->vq.num_free--;
+
+		/* Use a single buffer which doesn't continue */
+		vq->vring.desc[head].flags = VRING_DESC_F_INDIRECT;
+		vq->vring.desc[head].addr = virt_to_phys(desc);
+		vq->vring.desc[head].len = count * sizeof(struct vring_desc);
+
+		/* Update free pointer */
+		vq->free_head = vq->vring.desc[head].next;
+	}
+
+	/* Set token. */
+	vq->data[head] = data;
+
+	pr_debug("Started buffer head %i for %p\n", head, vq);
+
+	buf->vq = _vq;
+	buf->indirect = desc;
+	buf->tail = NULL;
+	buf->head = head;
+	return 0;
+
+no_space:
+	ret = -ENOSPC;
+error:
+	pr_debug("Can't add buf (%d) - count = %i, avail = %i\n",
+		 ret, count, vq->vq.num_free);
+	END_USE(vq);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(virtqueue_start_buf);
+
+/**
+ * virtqueue_add_sg - add sglist to buffer
+ * @buf: the struct that was passed to virtqueue_start_buf
+ * @sgl: the description of the buffer(s).
+ * @count: the number of items to process in sgl
+ * @dir: whether the sgl is read or written (DMA_TO_DEVICE/DMA_FROM_DEVICE only)
+ *
+ * Note that, unlike virtqueue_add_buf, this function follows chained
+ * scatterlists, and stops before the @count-th item if a scatterlist item
+ * has a marker.
+ *
+ * Caller must ensure we don't call this with other virtqueue operations
+ * at the same time (except where noted).
+ */
+void virtqueue_add_sg(struct virtqueue_buf *buf,
+		      struct scatterlist sgl[],
+		      unsigned int count,
+		      enum dma_data_direction dir)
+{
+	struct vring_virtqueue *vq = to_vvq(buf->vq);
+	unsigned int i, uninitialized_var(prev), n;
+	struct scatterlist *sg;
+	struct vring_desc *tail;
+	u32 flags;
+
+#ifdef DEBUG
+	BUG_ON(!vq->in_use);
+#endif
+
+	BUG_ON(dir != DMA_FROM_DEVICE && dir != DMA_TO_DEVICE);
+	BUG_ON(count == 0);
+
+	flags = (dir == DMA_FROM_DEVICE ? VRING_DESC_F_WRITE : 0);
+	flags |= VRING_DESC_F_NEXT;
+
+	/* If using indirect descriptor tables, fill in the buffers
+	 * at buf->indirect.  */
+	if (buf->indirect != NULL) {
+		i = 0;
+		if (likely(buf->tail != NULL))
+			i = buf->tail - buf->indirect + 1;
+
+		for_each_sg(sgl, sg, count, n) {
+			tail = &buf->indirect[i];
+			tail->flags = flags;
+			tail->addr = sg_phys(sg);
+			tail->len = sg->length;
+			tail->next = ++i;
+		}
+	} else {
+		BUG_ON(vq->vq.num_free < count);
+
+		i = vq->free_head;
+		for_each_sg(sgl, sg, count, n) {
+			tail = &vq->vring.desc[i];
+			tail->flags = flags;
+			tail->addr = sg_phys(sg);
+			tail->len = sg->length;
+			i = vq->vring.desc[i].next;
+			vq->vq.num_free--;
+		}
+
+		vq->free_head = i;
+	}
+	buf->tail = tail;
+}
+EXPORT_SYMBOL_GPL(virtqueue_add_sg);
+
+/**
+ * virtqueue_end_buf - expose buffer to other end
+ * @buf: the struct that was passed to virtqueue_start_buf
+ *
+ * Caller must ensure we don't call this with other virtqueue operations
+ * at the same time (except where noted).
+ */
+void virtqueue_end_buf(struct virtqueue_buf *buf)
+{
+	struct vring_virtqueue *vq = to_vvq(buf->vq);
+	unsigned int avail;
+	int head = buf->head;
+	struct vring_desc *tail = buf->tail;
+
+#ifdef DEBUG
+	BUG_ON(!vq->in_use);
+#endif
+	BUG_ON(tail == NULL);
+
+	/* The last one does not have the next flag set.  */
+	tail->flags &= ~VRING_DESC_F_NEXT;
+
+	/* Put entry in available array (but don't update avail->idx until
+	 * virtqueue_end_buf). */
+	avail = (vq->vring.avail->idx & (vq->vring.num-1));
+	vq->vring.avail->ring[avail] = head;
+
+	/* Descriptors and available array need to be set before we expose the
+	 * new available array entries. */
+	virtio_wmb(vq);
+	vq->vring.avail->idx++;
+	vq->num_added++;
+
+	/* This is very unlikely, but theoretically possible.  Kick
+	 * just in case. */
+	if (unlikely(vq->num_added == (1 << 16) - 1))
+		virtqueue_kick(buf->vq);
+
+	pr_debug("Added buffer head %i to %p\n", head, vq);
+	END_USE(vq);
+}
+EXPORT_SYMBOL_GPL(virtqueue_end_buf);
+
 static inline bool more_used(const struct vring_virtqueue *vq)
 {
 	return vq->last_used_idx != vq->vring.used->idx;
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index cf8adb1..39d56c4 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -7,6 +7,7 @@
 #include <linux/spinlock.h>
 #include <linux/device.h>
 #include <linux/mod_devicetable.h>
+#include <linux/dma-direction.h>
 #include <linux/gfp.h>
 
 /**
@@ -40,6 +41,26 @@ int virtqueue_add_buf(struct virtqueue *vq,
 		      void *data,
 		      gfp_t gfp);
 
+struct virtqueue_buf {
+	struct virtqueue *vq;
+	struct vring_desc *indirect, *tail;
+	int head;
+};
+
+int virtqueue_start_buf(struct virtqueue *_vq,
+			struct virtqueue_buf *buf,
+			void *data,
+			unsigned int count,
+			unsigned int count_sg,
+			gfp_t gfp);
+
+void virtqueue_add_sg(struct virtqueue_buf *buf,
+		      struct scatterlist sgl[],
+		      unsigned int count,
+		      enum dma_data_direction dir);
+
+void virtqueue_end_buf(struct virtqueue_buf *buf);
+
 void virtqueue_kick(struct virtqueue *vq);
 
 bool virtqueue_kick_prepare(struct virtqueue *vq);
-- 
1.7.1



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v2 1/5] virtio: add functions for piecewise addition of buffers
@ 2012-12-18 12:32   ` Paolo Bonzini
  0 siblings, 0 replies; 86+ messages in thread
From: Paolo Bonzini @ 2012-12-18 12:32 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-scsi, kvm, mst, hutao, virtualization, stefanha

The virtqueue_add_buf function has two limitations:

1) it requires the caller to provide all the buffers in a single call;

2) it does not support chained scatterlists: the buffers must be
provided as an array of struct scatterlist;

Because of these limitations, virtio-scsi has to copy each request into
a scatterlist internal to the driver.  It cannot just use the one that
was prepared by the upper SCSI layers.

This patch adds a different set of APIs for adding a buffer to a virtqueue.
The new API lets you pass the buffers piecewise, wrapping multiple calls
to virtqueue_add_sg between virtqueue_start_buf and virtqueue_end_buf.
virtio-scsi can then call virtqueue_add_sg 3/4 times: for the request
header, for the write buffer (if present), for the response header, and
finally for the read buffer (again if present).  It saves the copying
and the related locking.

Note that this API is not needed in virtio-blk, because it does all the
work of the upper SCSI layers itself in the blk_map_rq_sg call.  Then
it simply hands the resulting scatterlist to virtqueue_add_buf.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
	v1->v2: new

 drivers/virtio/virtio_ring.c |  205 ++++++++++++++++++++++++++++++++++++++++++
 include/linux/virtio.h       |   21 ++++
 2 files changed, 226 insertions(+), 0 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index ffd7e7d..ccfa97c 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -394,6 +394,211 @@ static void detach_buf(struct vring_virtqueue *vq, unsigned int head)
 	vq->vq.num_free++;
 }
 
+/**
+ * virtqueue_start_buf - start building buffer for the other end
+ * @vq: the struct virtqueue we're talking about.
+ * @buf: a struct keeping the state of the buffer
+ * @data: the token identifying the buffer.
+ * @count: the number of buffers that will be added
+ * @count_sg: the number of sg lists that will be added
+ * @gfp: how to do memory allocations (if necessary).
+ *
+ * Caller must ensure we don't call this with other virtqueue operations
+ * at the same time (except where noted), and that a successful call is
+ * followed by one or more calls to virtqueue_add_sg, and finally a call
+ * to virtqueue_end_buf.
+ *
+ * Returns zero or a negative error (ie. ENOSPC).
+ */
+int virtqueue_start_buf(struct virtqueue *_vq,
+			struct virtqueue_buf *buf,
+			void *data,
+			unsigned int count,
+			unsigned int count_sg,
+			gfp_t gfp)
+{
+	struct vring_virtqueue *vq = to_vvq(_vq);
+	struct vring_desc *desc = NULL;
+	int head;
+	int ret = -ENOMEM;
+
+	START_USE(vq);
+
+	BUG_ON(data == NULL);
+
+#ifdef DEBUG
+	{
+		ktime_t now = ktime_get();
+
+		/* No kick or get, with .1 second between?  Warn. */
+		if (vq->last_add_time_valid)
+			WARN_ON(ktime_to_ms(ktime_sub(now, vq->last_add_time))
+					    > 100);
+		vq->last_add_time = now;
+		vq->last_add_time_valid = true;
+	}
+#endif
+
+	BUG_ON(count < count_sg);
+	BUG_ON(count_sg == 0);
+
+	/* If the host supports indirect descriptor tables, and there is
+	 * no space for direct buffers or there are multi-item scatterlists,
+	 * go indirect.
+	 */
+	head = vq->free_head;
+	if (vq->indirect && (count > count_sg || vq->vq.num_free < count)) {
+		if (vq->vq.num_free == 0)
+			goto no_space;
+
+		desc = kmalloc(count * sizeof(struct vring_desc), gfp);
+		if (!desc)
+			goto error;
+
+		/* We're about to use a buffer */
+		vq->vq.num_free--;
+
+		/* Use a single buffer which doesn't continue */
+		vq->vring.desc[head].flags = VRING_DESC_F_INDIRECT;
+		vq->vring.desc[head].addr = virt_to_phys(desc);
+		vq->vring.desc[head].len = count * sizeof(struct vring_desc);
+
+		/* Update free pointer */
+		vq->free_head = vq->vring.desc[head].next;
+	}
+
+	/* Set token. */
+	vq->data[head] = data;
+
+	pr_debug("Started buffer head %i for %p\n", head, vq);
+
+	buf->vq = _vq;
+	buf->indirect = desc;
+	buf->tail = NULL;
+	buf->head = head;
+	return 0;
+
+no_space:
+	ret = -ENOSPC;
+error:
+	pr_debug("Can't add buf (%d) - count = %i, avail = %i\n",
+		 ret, count, vq->vq.num_free);
+	END_USE(vq);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(virtqueue_start_buf);
+
+/**
+ * virtqueue_add_sg - add sglist to buffer
+ * @buf: the struct that was passed to virtqueue_start_buf
+ * @sgl: the description of the buffer(s).
+ * @count: the number of items to process in sgl
+ * @dir: whether the sgl is read or written (DMA_TO_DEVICE/DMA_FROM_DEVICE only)
+ *
+ * Note that, unlike virtqueue_add_buf, this function follows chained
+ * scatterlists, and stops before the @count-th item if a scatterlist item
+ * has a marker.
+ *
+ * Caller must ensure we don't call this with other virtqueue operations
+ * at the same time (except where noted).
+ */
+void virtqueue_add_sg(struct virtqueue_buf *buf,
+		      struct scatterlist sgl[],
+		      unsigned int count,
+		      enum dma_data_direction dir)
+{
+	struct vring_virtqueue *vq = to_vvq(buf->vq);
+	unsigned int i, uninitialized_var(prev), n;
+	struct scatterlist *sg;
+	struct vring_desc *tail;
+	u32 flags;
+
+#ifdef DEBUG
+	BUG_ON(!vq->in_use);
+#endif
+
+	BUG_ON(dir != DMA_FROM_DEVICE && dir != DMA_TO_DEVICE);
+	BUG_ON(count == 0);
+
+	flags = (dir == DMA_FROM_DEVICE ? VRING_DESC_F_WRITE : 0);
+	flags |= VRING_DESC_F_NEXT;
+
+	/* If using indirect descriptor tables, fill in the buffers
+	 * at buf->indirect.  */
+	if (buf->indirect != NULL) {
+		i = 0;
+		if (likely(buf->tail != NULL))
+			i = buf->tail - buf->indirect + 1;
+
+		for_each_sg(sgl, sg, count, n) {
+			tail = &buf->indirect[i];
+			tail->flags = flags;
+			tail->addr = sg_phys(sg);
+			tail->len = sg->length;
+			tail->next = ++i;
+		}
+	} else {
+		BUG_ON(vq->vq.num_free < count);
+
+		i = vq->free_head;
+		for_each_sg(sgl, sg, count, n) {
+			tail = &vq->vring.desc[i];
+			tail->flags = flags;
+			tail->addr = sg_phys(sg);
+			tail->len = sg->length;
+			i = vq->vring.desc[i].next;
+			vq->vq.num_free--;
+		}
+
+		vq->free_head = i;
+	}
+	buf->tail = tail;
+}
+EXPORT_SYMBOL_GPL(virtqueue_add_sg);
+
+/**
+ * virtqueue_end_buf - expose buffer to other end
+ * @buf: the struct that was passed to virtqueue_start_buf
+ *
+ * Caller must ensure we don't call this with other virtqueue operations
+ * at the same time (except where noted).
+ */
+void virtqueue_end_buf(struct virtqueue_buf *buf)
+{
+	struct vring_virtqueue *vq = to_vvq(buf->vq);
+	unsigned int avail;
+	int head = buf->head;
+	struct vring_desc *tail = buf->tail;
+
+#ifdef DEBUG
+	BUG_ON(!vq->in_use);
+#endif
+	BUG_ON(tail == NULL);
+
+	/* The last one does not have the next flag set.  */
+	tail->flags &= ~VRING_DESC_F_NEXT;
+
+	/* Put entry in available array (but don't update avail->idx until
+	 * virtqueue_end_buf). */
+	avail = (vq->vring.avail->idx & (vq->vring.num-1));
+	vq->vring.avail->ring[avail] = head;
+
+	/* Descriptors and available array need to be set before we expose the
+	 * new available array entries. */
+	virtio_wmb(vq);
+	vq->vring.avail->idx++;
+	vq->num_added++;
+
+	/* This is very unlikely, but theoretically possible.  Kick
+	 * just in case. */
+	if (unlikely(vq->num_added == (1 << 16) - 1))
+		virtqueue_kick(buf->vq);
+
+	pr_debug("Added buffer head %i to %p\n", head, vq);
+	END_USE(vq);
+}
+EXPORT_SYMBOL_GPL(virtqueue_end_buf);
+
 static inline bool more_used(const struct vring_virtqueue *vq)
 {
 	return vq->last_used_idx != vq->vring.used->idx;
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index cf8adb1..39d56c4 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -7,6 +7,7 @@
 #include <linux/spinlock.h>
 #include <linux/device.h>
 #include <linux/mod_devicetable.h>
+#include <linux/dma-direction.h>
 #include <linux/gfp.h>
 
 /**
@@ -40,6 +41,26 @@ int virtqueue_add_buf(struct virtqueue *vq,
 		      void *data,
 		      gfp_t gfp);
 
+struct virtqueue_buf {
+	struct virtqueue *vq;
+	struct vring_desc *indirect, *tail;
+	int head;
+};
+
+int virtqueue_start_buf(struct virtqueue *_vq,
+			struct virtqueue_buf *buf,
+			void *data,
+			unsigned int count,
+			unsigned int count_sg,
+			gfp_t gfp);
+
+void virtqueue_add_sg(struct virtqueue_buf *buf,
+		      struct scatterlist sgl[],
+		      unsigned int count,
+		      enum dma_data_direction dir);
+
+void virtqueue_end_buf(struct virtqueue_buf *buf);
+
 void virtqueue_kick(struct virtqueue *vq);
 
 bool virtqueue_kick_prepare(struct virtqueue *vq);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v2 2/5] virtio-scsi: use functions for piecewise composition of buffers
  2012-12-18 12:32 ` Paolo Bonzini
@ 2012-12-18 12:32   ` Paolo Bonzini
  -1 siblings, 0 replies; 86+ messages in thread
From: Paolo Bonzini @ 2012-12-18 12:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: kvm, gaowanlong, hutao, linux-scsi, virtualization, mst, rusty,
	asias, stefanha, nab

Using the new virtio_scsi_add_sg function lets us simplify the queueing
path.  In particular, all data protected by the tgt_lock is just gone
(multiqueue will find a new use for the lock).

The speedup is relatively small (2-4%) but it is worthwhile because of
the code simplification---both in this patches and in the next ones.

Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
	v1->v2: new

 drivers/scsi/virtio_scsi.c |   94 +++++++++++++++++++------------------------
 1 files changed, 42 insertions(+), 52 deletions(-)

diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 74ab67a..2b93b6e 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -59,11 +59,8 @@ struct virtio_scsi_vq {
 
 /* Per-target queue state */
 struct virtio_scsi_target_state {
-	/* Protects sg.  Lock hierarchy is tgt_lock -> vq_lock.  */
+	/* Never held at the same time as vq_lock.  */
 	spinlock_t tgt_lock;
-
-	/* For sglist construction when adding commands to the virtqueue.  */
-	struct scatterlist sg[];
 };
 
 /* Driver instance state */
@@ -351,57 +348,58 @@ static void virtscsi_event_done(struct virtqueue *vq)
 	spin_unlock_irqrestore(&vscsi->event_vq.vq_lock, flags);
 };
 
-static void virtscsi_map_sgl(struct scatterlist *sg, unsigned int *p_idx,
-			     struct scsi_data_buffer *sdb)
-{
-	struct sg_table *table = &sdb->table;
-	struct scatterlist *sg_elem;
-	unsigned int idx = *p_idx;
-	int i;
-
-	for_each_sg(table->sgl, sg_elem, table->nents, i)
-		sg[idx++] = *sg_elem;
-
-	*p_idx = idx;
-}
-
 /**
- * virtscsi_map_cmd - map a scsi_cmd to a virtqueue scatterlist
+ * virtscsi_add_cmd - add a virtio_scsi_cmd to a virtqueue
  * @vscsi	: virtio_scsi state
  * @cmd		: command structure
- * @out_num	: number of read-only elements
- * @in_num	: number of write-only elements
  * @req_size	: size of the request buffer
  * @resp_size	: size of the response buffer
- *
- * Called with tgt_lock held.
+ * @gfp	: flags to use for memory allocations
  */
-static void virtscsi_map_cmd(struct virtio_scsi_target_state *tgt,
-			     struct virtio_scsi_cmd *cmd,
-			     unsigned *out_num, unsigned *in_num,
-			     size_t req_size, size_t resp_size)
+static int virtscsi_add_cmd(struct virtqueue *vq,
+			    struct virtio_scsi_cmd *cmd,
+			    size_t req_size, size_t resp_size, gfp_t gfp)
 {
 	struct scsi_cmnd *sc = cmd->sc;
-	struct scatterlist *sg = tgt->sg;
-	unsigned int idx = 0;
+	struct scatterlist sg;
+	unsigned int count, count_sg;
+	struct sg_table *out, *in;
+	struct virtqueue_buf buf;
+	int ret;
+
+	out = in = NULL;
+
+	if (sc && sc->sc_data_direction != DMA_NONE) {
+		if (sc->sc_data_direction != DMA_FROM_DEVICE)
+			out = &scsi_out(sc)->table;
+		if (sc->sc_data_direction != DMA_TO_DEVICE)
+			in = &scsi_in(sc)->table;
+	}
+
+	count_sg = 2 + (out ? 1 : 0)          + (in ? 1 : 0);
+	count    = 2 + (out ? out->nents : 0) + (in ? in->nents : 0);
+	ret = virtqueue_start_buf(vq, &buf, cmd, count, count_sg, gfp);
+	if (ret < 0)
+		return ret;
 
 	/* Request header.  */
-	sg_set_buf(&sg[idx++], &cmd->req, req_size);
+	sg_init_one(&sg, &cmd->req, req_size);
+	virtqueue_add_sg(&buf, &sg, 1, DMA_TO_DEVICE);
 
 	/* Data-out buffer.  */
-	if (sc && sc->sc_data_direction != DMA_FROM_DEVICE)
-		virtscsi_map_sgl(sg, &idx, scsi_out(sc));
-
-	*out_num = idx;
+	if (out)
+		virtqueue_add_sg(&buf, out->sgl, out->nents, DMA_TO_DEVICE);
 
 	/* Response header.  */
-	sg_set_buf(&sg[idx++], &cmd->resp, resp_size);
+	sg_init_one(&sg, &cmd->resp, resp_size);
+	virtqueue_add_sg(&buf, &sg, 1, DMA_FROM_DEVICE);
 
 	/* Data-in buffer */
-	if (sc && sc->sc_data_direction != DMA_TO_DEVICE)
-		virtscsi_map_sgl(sg, &idx, scsi_in(sc));
+	if (in)
+		virtqueue_add_sg(&buf, in->sgl, in->nents, DMA_FROM_DEVICE);
 
-	*in_num = idx - *out_num;
+	virtqueue_end_buf(&buf);
+	return 0;
 }
 
 static int virtscsi_kick_cmd(struct virtio_scsi_target_state *tgt,
@@ -409,25 +407,20 @@ static int virtscsi_kick_cmd(struct virtio_scsi_target_state *tgt,
 			     struct virtio_scsi_cmd *cmd,
 			     size_t req_size, size_t resp_size, gfp_t gfp)
 {
-	unsigned int out_num, in_num;
 	unsigned long flags;
-	int err;
+	int ret;
 	bool needs_kick = false;
 
-	spin_lock_irqsave(&tgt->tgt_lock, flags);
-	virtscsi_map_cmd(tgt, cmd, &out_num, &in_num, req_size, resp_size);
-
-	spin_lock(&vq->vq_lock);
-	err = virtqueue_add_buf(vq->vq, tgt->sg, out_num, in_num, cmd, gfp);
-	spin_unlock(&tgt->tgt_lock);
-	if (!err)
+	spin_lock_irqsave(&vq->vq_lock, flags);
+	ret = virtscsi_add_cmd(vq->vq, cmd, req_size, resp_size, gfp);
+	if (!ret)
 		needs_kick = virtqueue_kick_prepare(vq->vq);
 
 	spin_unlock_irqrestore(&vq->vq_lock, flags);
 
 	if (needs_kick)
 		virtqueue_notify(vq->vq);
-	return err;
+	return ret;
 }
 
 static int virtscsi_queuecommand(struct Scsi_Host *sh, struct scsi_cmnd *sc)
@@ -592,14 +585,11 @@ static struct virtio_scsi_target_state *virtscsi_alloc_tgt(
 	gfp_t gfp_mask = GFP_KERNEL;
 
 	/* We need extra sg elements at head and tail.  */
-	tgt = kmalloc(sizeof(*tgt) + sizeof(tgt->sg[0]) * (sg_elems + 2),
-		      gfp_mask);
-
+	tgt = kmalloc(sizeof(*tgt), gfp_mask);
 	if (!tgt)
 		return NULL;
 
 	spin_lock_init(&tgt->tgt_lock);
-	sg_init_table(tgt->sg, sg_elems + 2);
 	return tgt;
 }
 
-- 
1.7.1



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v2 2/5] virtio-scsi: use functions for piecewise composition of buffers
@ 2012-12-18 12:32   ` Paolo Bonzini
  0 siblings, 0 replies; 86+ messages in thread
From: Paolo Bonzini @ 2012-12-18 12:32 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-scsi, kvm, mst, hutao, virtualization, stefanha

Using the new virtio_scsi_add_sg function lets us simplify the queueing
path.  In particular, all data protected by the tgt_lock is just gone
(multiqueue will find a new use for the lock).

The speedup is relatively small (2-4%) but it is worthwhile because of
the code simplification---both in this patches and in the next ones.

Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
	v1->v2: new

 drivers/scsi/virtio_scsi.c |   94 +++++++++++++++++++------------------------
 1 files changed, 42 insertions(+), 52 deletions(-)

diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 74ab67a..2b93b6e 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -59,11 +59,8 @@ struct virtio_scsi_vq {
 
 /* Per-target queue state */
 struct virtio_scsi_target_state {
-	/* Protects sg.  Lock hierarchy is tgt_lock -> vq_lock.  */
+	/* Never held at the same time as vq_lock.  */
 	spinlock_t tgt_lock;
-
-	/* For sglist construction when adding commands to the virtqueue.  */
-	struct scatterlist sg[];
 };
 
 /* Driver instance state */
@@ -351,57 +348,58 @@ static void virtscsi_event_done(struct virtqueue *vq)
 	spin_unlock_irqrestore(&vscsi->event_vq.vq_lock, flags);
 };
 
-static void virtscsi_map_sgl(struct scatterlist *sg, unsigned int *p_idx,
-			     struct scsi_data_buffer *sdb)
-{
-	struct sg_table *table = &sdb->table;
-	struct scatterlist *sg_elem;
-	unsigned int idx = *p_idx;
-	int i;
-
-	for_each_sg(table->sgl, sg_elem, table->nents, i)
-		sg[idx++] = *sg_elem;
-
-	*p_idx = idx;
-}
-
 /**
- * virtscsi_map_cmd - map a scsi_cmd to a virtqueue scatterlist
+ * virtscsi_add_cmd - add a virtio_scsi_cmd to a virtqueue
  * @vscsi	: virtio_scsi state
  * @cmd		: command structure
- * @out_num	: number of read-only elements
- * @in_num	: number of write-only elements
  * @req_size	: size of the request buffer
  * @resp_size	: size of the response buffer
- *
- * Called with tgt_lock held.
+ * @gfp	: flags to use for memory allocations
  */
-static void virtscsi_map_cmd(struct virtio_scsi_target_state *tgt,
-			     struct virtio_scsi_cmd *cmd,
-			     unsigned *out_num, unsigned *in_num,
-			     size_t req_size, size_t resp_size)
+static int virtscsi_add_cmd(struct virtqueue *vq,
+			    struct virtio_scsi_cmd *cmd,
+			    size_t req_size, size_t resp_size, gfp_t gfp)
 {
 	struct scsi_cmnd *sc = cmd->sc;
-	struct scatterlist *sg = tgt->sg;
-	unsigned int idx = 0;
+	struct scatterlist sg;
+	unsigned int count, count_sg;
+	struct sg_table *out, *in;
+	struct virtqueue_buf buf;
+	int ret;
+
+	out = in = NULL;
+
+	if (sc && sc->sc_data_direction != DMA_NONE) {
+		if (sc->sc_data_direction != DMA_FROM_DEVICE)
+			out = &scsi_out(sc)->table;
+		if (sc->sc_data_direction != DMA_TO_DEVICE)
+			in = &scsi_in(sc)->table;
+	}
+
+	count_sg = 2 + (out ? 1 : 0)          + (in ? 1 : 0);
+	count    = 2 + (out ? out->nents : 0) + (in ? in->nents : 0);
+	ret = virtqueue_start_buf(vq, &buf, cmd, count, count_sg, gfp);
+	if (ret < 0)
+		return ret;
 
 	/* Request header.  */
-	sg_set_buf(&sg[idx++], &cmd->req, req_size);
+	sg_init_one(&sg, &cmd->req, req_size);
+	virtqueue_add_sg(&buf, &sg, 1, DMA_TO_DEVICE);
 
 	/* Data-out buffer.  */
-	if (sc && sc->sc_data_direction != DMA_FROM_DEVICE)
-		virtscsi_map_sgl(sg, &idx, scsi_out(sc));
-
-	*out_num = idx;
+	if (out)
+		virtqueue_add_sg(&buf, out->sgl, out->nents, DMA_TO_DEVICE);
 
 	/* Response header.  */
-	sg_set_buf(&sg[idx++], &cmd->resp, resp_size);
+	sg_init_one(&sg, &cmd->resp, resp_size);
+	virtqueue_add_sg(&buf, &sg, 1, DMA_FROM_DEVICE);
 
 	/* Data-in buffer */
-	if (sc && sc->sc_data_direction != DMA_TO_DEVICE)
-		virtscsi_map_sgl(sg, &idx, scsi_in(sc));
+	if (in)
+		virtqueue_add_sg(&buf, in->sgl, in->nents, DMA_FROM_DEVICE);
 
-	*in_num = idx - *out_num;
+	virtqueue_end_buf(&buf);
+	return 0;
 }
 
 static int virtscsi_kick_cmd(struct virtio_scsi_target_state *tgt,
@@ -409,25 +407,20 @@ static int virtscsi_kick_cmd(struct virtio_scsi_target_state *tgt,
 			     struct virtio_scsi_cmd *cmd,
 			     size_t req_size, size_t resp_size, gfp_t gfp)
 {
-	unsigned int out_num, in_num;
 	unsigned long flags;
-	int err;
+	int ret;
 	bool needs_kick = false;
 
-	spin_lock_irqsave(&tgt->tgt_lock, flags);
-	virtscsi_map_cmd(tgt, cmd, &out_num, &in_num, req_size, resp_size);
-
-	spin_lock(&vq->vq_lock);
-	err = virtqueue_add_buf(vq->vq, tgt->sg, out_num, in_num, cmd, gfp);
-	spin_unlock(&tgt->tgt_lock);
-	if (!err)
+	spin_lock_irqsave(&vq->vq_lock, flags);
+	ret = virtscsi_add_cmd(vq->vq, cmd, req_size, resp_size, gfp);
+	if (!ret)
 		needs_kick = virtqueue_kick_prepare(vq->vq);
 
 	spin_unlock_irqrestore(&vq->vq_lock, flags);
 
 	if (needs_kick)
 		virtqueue_notify(vq->vq);
-	return err;
+	return ret;
 }
 
 static int virtscsi_queuecommand(struct Scsi_Host *sh, struct scsi_cmnd *sc)
@@ -592,14 +585,11 @@ static struct virtio_scsi_target_state *virtscsi_alloc_tgt(
 	gfp_t gfp_mask = GFP_KERNEL;
 
 	/* We need extra sg elements at head and tail.  */
-	tgt = kmalloc(sizeof(*tgt) + sizeof(tgt->sg[0]) * (sg_elems + 2),
-		      gfp_mask);
-
+	tgt = kmalloc(sizeof(*tgt), gfp_mask);
 	if (!tgt)
 		return NULL;
 
 	spin_lock_init(&tgt->tgt_lock);
-	sg_init_table(tgt->sg, sg_elems + 2);
 	return tgt;
 }
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v2 3/5] virtio-scsi: redo allocation of target data
  2012-12-18 12:32 ` Paolo Bonzini
@ 2012-12-18 12:32   ` Paolo Bonzini
  -1 siblings, 0 replies; 86+ messages in thread
From: Paolo Bonzini @ 2012-12-18 12:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: kvm, gaowanlong, hutao, linux-scsi, virtualization, mst, rusty,
	asias, stefanha, nab

virtio_scsi_target_state is now empty, but we will find new uses
for it in the next few patches.  However, dropping the sglist lets
us turn the array-of-pointers into a simple array, which simplifies
the allocation.

However, we do not leave the virtio_scsi_target_state structs in the
flexible array member at the end of struct virtio_scsi, because
we will place the virtqueues there in the next patches.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
	v1->v2: new

 drivers/scsi/virtio_scsi.c |   43 ++++++++++++++-----------------------------
 1 files changed, 14 insertions(+), 29 deletions(-)

diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 2b93b6e..4a3abaf 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -74,7 +74,7 @@ struct virtio_scsi {
 	/* Get some buffers ready for event vq */
 	struct virtio_scsi_event_node event_list[VIRTIO_SCSI_EVENT_LEN];
 
-	struct virtio_scsi_target_state *tgt[];
+	struct virtio_scsi_target_state *tgt;
 };
 
 static struct kmem_cache *virtscsi_cmd_cache;
@@ -426,7 +426,7 @@ static int virtscsi_kick_cmd(struct virtio_scsi_target_state *tgt,
 static int virtscsi_queuecommand(struct Scsi_Host *sh, struct scsi_cmnd *sc)
 {
 	struct virtio_scsi *vscsi = shost_priv(sh);
-	struct virtio_scsi_target_state *tgt = vscsi->tgt[sc->device->id];
+	struct virtio_scsi_target_state *tgt = &vscsi->tgt[sc->device->id];
 	struct virtio_scsi_cmd *cmd;
 	int ret;
 
@@ -474,7 +474,7 @@ out:
 static int virtscsi_tmf(struct virtio_scsi *vscsi, struct virtio_scsi_cmd *cmd)
 {
 	DECLARE_COMPLETION_ONSTACK(comp);
-	struct virtio_scsi_target_state *tgt = vscsi->tgt[cmd->sc->device->id];
+	struct virtio_scsi_target_state *tgt = &vscsi->tgt[cmd->sc->device->id];
 	int ret = FAILED;
 
 	cmd->comp = &comp;
@@ -578,19 +578,9 @@ static void virtscsi_init_vq(struct virtio_scsi_vq *virtscsi_vq,
 	virtscsi_vq->vq = vq;
 }
 
-static struct virtio_scsi_target_state *virtscsi_alloc_tgt(
-	struct virtio_device *vdev, int sg_elems)
+static void virtscsi_init_tgt(struct virtio_scsi_target_state *tgt)
 {
-	struct virtio_scsi_target_state *tgt;
-	gfp_t gfp_mask = GFP_KERNEL;
-
-	/* We need extra sg elements at head and tail.  */
-	tgt = kmalloc(sizeof(*tgt), gfp_mask);
-	if (!tgt)
-		return NULL;
-
 	spin_lock_init(&tgt->tgt_lock);
-	return tgt;
 }
 
 static void virtscsi_scan(struct virtio_device *vdev)
@@ -604,16 +594,12 @@ static void virtscsi_remove_vqs(struct virtio_device *vdev)
 {
 	struct Scsi_Host *sh = virtio_scsi_host(vdev);
 	struct virtio_scsi *vscsi = shost_priv(sh);
-	u32 i, num_targets;
 
 	/* Stop all the virtqueues. */
 	vdev->config->reset(vdev);
 
-	num_targets = sh->max_id;
-	for (i = 0; i < num_targets; i++) {
-		kfree(vscsi->tgt[i]);
-		vscsi->tgt[i] = NULL;
-	}
+	kfree(vscsi->tgt);
+	vscsi->tgt = NULL;
 
 	vdev->config->del_vqs(vdev);
 }
@@ -654,13 +640,14 @@ static int virtscsi_init(struct virtio_device *vdev,
 	/* We need to know how many segments before we allocate.  */
 	sg_elems = virtscsi_config_get(vdev, seg_max) ?: 1;
 
-	for (i = 0; i < num_targets; i++) {
-		vscsi->tgt[i] = virtscsi_alloc_tgt(vdev, sg_elems);
-		if (!vscsi->tgt[i]) {
-			err = -ENOMEM;
-			goto out;
-		}
+	vscsi->tgt = kmalloc(num_targets * sizeof(vscsi->tgt[0]), GFP_KERNEL);
+	if (!vscsi->tgt) {
+		err = -ENOMEM;
+		goto out;
 	}
+	for (i = 0; i < num_targets; i++)
+		virtscsi_init_tgt(&vscsi->tgt[i]);
+
 	err = 0;
 
 out:
@@ -679,9 +666,7 @@ static int __devinit virtscsi_probe(struct virtio_device *vdev)
 
 	/* Allocate memory and link the structs together.  */
 	num_targets = virtscsi_config_get(vdev, max_target) + 1;
-	shost = scsi_host_alloc(&virtscsi_host_template,
-		sizeof(*vscsi)
-		+ num_targets * sizeof(struct virtio_scsi_target_state));
+	shost = scsi_host_alloc(&virtscsi_host_template, sizeof(*vscsi));
 
 	if (!shost)
 		return -ENOMEM;
-- 
1.7.1



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v2 3/5] virtio-scsi: redo allocation of target data
@ 2012-12-18 12:32   ` Paolo Bonzini
  0 siblings, 0 replies; 86+ messages in thread
From: Paolo Bonzini @ 2012-12-18 12:32 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-scsi, kvm, mst, hutao, virtualization, stefanha

virtio_scsi_target_state is now empty, but we will find new uses
for it in the next few patches.  However, dropping the sglist lets
us turn the array-of-pointers into a simple array, which simplifies
the allocation.

However, we do not leave the virtio_scsi_target_state structs in the
flexible array member at the end of struct virtio_scsi, because
we will place the virtqueues there in the next patches.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
	v1->v2: new

 drivers/scsi/virtio_scsi.c |   43 ++++++++++++++-----------------------------
 1 files changed, 14 insertions(+), 29 deletions(-)

diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 2b93b6e..4a3abaf 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -74,7 +74,7 @@ struct virtio_scsi {
 	/* Get some buffers ready for event vq */
 	struct virtio_scsi_event_node event_list[VIRTIO_SCSI_EVENT_LEN];
 
-	struct virtio_scsi_target_state *tgt[];
+	struct virtio_scsi_target_state *tgt;
 };
 
 static struct kmem_cache *virtscsi_cmd_cache;
@@ -426,7 +426,7 @@ static int virtscsi_kick_cmd(struct virtio_scsi_target_state *tgt,
 static int virtscsi_queuecommand(struct Scsi_Host *sh, struct scsi_cmnd *sc)
 {
 	struct virtio_scsi *vscsi = shost_priv(sh);
-	struct virtio_scsi_target_state *tgt = vscsi->tgt[sc->device->id];
+	struct virtio_scsi_target_state *tgt = &vscsi->tgt[sc->device->id];
 	struct virtio_scsi_cmd *cmd;
 	int ret;
 
@@ -474,7 +474,7 @@ out:
 static int virtscsi_tmf(struct virtio_scsi *vscsi, struct virtio_scsi_cmd *cmd)
 {
 	DECLARE_COMPLETION_ONSTACK(comp);
-	struct virtio_scsi_target_state *tgt = vscsi->tgt[cmd->sc->device->id];
+	struct virtio_scsi_target_state *tgt = &vscsi->tgt[cmd->sc->device->id];
 	int ret = FAILED;
 
 	cmd->comp = &comp;
@@ -578,19 +578,9 @@ static void virtscsi_init_vq(struct virtio_scsi_vq *virtscsi_vq,
 	virtscsi_vq->vq = vq;
 }
 
-static struct virtio_scsi_target_state *virtscsi_alloc_tgt(
-	struct virtio_device *vdev, int sg_elems)
+static void virtscsi_init_tgt(struct virtio_scsi_target_state *tgt)
 {
-	struct virtio_scsi_target_state *tgt;
-	gfp_t gfp_mask = GFP_KERNEL;
-
-	/* We need extra sg elements at head and tail.  */
-	tgt = kmalloc(sizeof(*tgt), gfp_mask);
-	if (!tgt)
-		return NULL;
-
 	spin_lock_init(&tgt->tgt_lock);
-	return tgt;
 }
 
 static void virtscsi_scan(struct virtio_device *vdev)
@@ -604,16 +594,12 @@ static void virtscsi_remove_vqs(struct virtio_device *vdev)
 {
 	struct Scsi_Host *sh = virtio_scsi_host(vdev);
 	struct virtio_scsi *vscsi = shost_priv(sh);
-	u32 i, num_targets;
 
 	/* Stop all the virtqueues. */
 	vdev->config->reset(vdev);
 
-	num_targets = sh->max_id;
-	for (i = 0; i < num_targets; i++) {
-		kfree(vscsi->tgt[i]);
-		vscsi->tgt[i] = NULL;
-	}
+	kfree(vscsi->tgt);
+	vscsi->tgt = NULL;
 
 	vdev->config->del_vqs(vdev);
 }
@@ -654,13 +640,14 @@ static int virtscsi_init(struct virtio_device *vdev,
 	/* We need to know how many segments before we allocate.  */
 	sg_elems = virtscsi_config_get(vdev, seg_max) ?: 1;
 
-	for (i = 0; i < num_targets; i++) {
-		vscsi->tgt[i] = virtscsi_alloc_tgt(vdev, sg_elems);
-		if (!vscsi->tgt[i]) {
-			err = -ENOMEM;
-			goto out;
-		}
+	vscsi->tgt = kmalloc(num_targets * sizeof(vscsi->tgt[0]), GFP_KERNEL);
+	if (!vscsi->tgt) {
+		err = -ENOMEM;
+		goto out;
 	}
+	for (i = 0; i < num_targets; i++)
+		virtscsi_init_tgt(&vscsi->tgt[i]);
+
 	err = 0;
 
 out:
@@ -679,9 +666,7 @@ static int __devinit virtscsi_probe(struct virtio_device *vdev)
 
 	/* Allocate memory and link the structs together.  */
 	num_targets = virtscsi_config_get(vdev, max_target) + 1;
-	shost = scsi_host_alloc(&virtscsi_host_template,
-		sizeof(*vscsi)
-		+ num_targets * sizeof(struct virtio_scsi_target_state));
+	shost = scsi_host_alloc(&virtscsi_host_template, sizeof(*vscsi));
 
 	if (!shost)
 		return -ENOMEM;
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v2 4/5] virtio-scsi: pass struct virtio_scsi to virtqueue completion function
  2012-12-18 12:32 ` Paolo Bonzini
@ 2012-12-18 12:32   ` Paolo Bonzini
  -1 siblings, 0 replies; 86+ messages in thread
From: Paolo Bonzini @ 2012-12-18 12:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: kvm, gaowanlong, hutao, linux-scsi, virtualization, mst, rusty,
	asias, stefanha, nab

This will be needed soon in order to retrieve the per-target
struct.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 drivers/scsi/virtio_scsi.c |   17 +++++++++--------
 1 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 4a3abaf..4f6c6a3 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -104,7 +104,7 @@ static void virtscsi_compute_resid(struct scsi_cmnd *sc, u32 resid)
  *
  * Called with vq_lock held.
  */
-static void virtscsi_complete_cmd(void *buf)
+static void virtscsi_complete_cmd(struct virtio_scsi *vscsi, void *buf)
 {
 	struct virtio_scsi_cmd *cmd = buf;
 	struct scsi_cmnd *sc = cmd->sc;
@@ -165,7 +165,8 @@ static void virtscsi_complete_cmd(void *buf)
 	sc->scsi_done(sc);
 }
 
-static void virtscsi_vq_done(struct virtqueue *vq, void (*fn)(void *buf))
+static void virtscsi_vq_done(struct virtio_scsi *vscsi, struct virtqueue *vq,
+			     void (*fn)(struct virtio_scsi *vscsi, void *buf))
 {
 	void *buf;
 	unsigned int len;
@@ -173,7 +174,7 @@ static void virtscsi_vq_done(struct virtqueue *vq, void (*fn)(void *buf))
 	do {
 		virtqueue_disable_cb(vq);
 		while ((buf = virtqueue_get_buf(vq, &len)) != NULL)
-			fn(buf);
+			fn(vscsi, buf);
 	} while (!virtqueue_enable_cb(vq));
 }
 
@@ -184,11 +185,11 @@ static void virtscsi_req_done(struct virtqueue *vq)
 	unsigned long flags;
 
 	spin_lock_irqsave(&vscsi->req_vq.vq_lock, flags);
-	virtscsi_vq_done(vq, virtscsi_complete_cmd);
+	virtscsi_vq_done(vscsi, vq, virtscsi_complete_cmd);
 	spin_unlock_irqrestore(&vscsi->req_vq.vq_lock, flags);
 };
 
-static void virtscsi_complete_free(void *buf)
+static void virtscsi_complete_free(struct virtio_scsi *vscsi, void *buf)
 {
 	struct virtio_scsi_cmd *cmd = buf;
 
@@ -205,7 +206,7 @@ static void virtscsi_ctrl_done(struct virtqueue *vq)
 	unsigned long flags;
 
 	spin_lock_irqsave(&vscsi->ctrl_vq.vq_lock, flags);
-	virtscsi_vq_done(vq, virtscsi_complete_free);
+	virtscsi_vq_done(vscsi, vq, virtscsi_complete_free);
 	spin_unlock_irqrestore(&vscsi->ctrl_vq.vq_lock, flags);
 };
 
@@ -329,7 +330,7 @@ static void virtscsi_handle_event(struct work_struct *work)
 	virtscsi_kick_event(vscsi, event_node);
 }
 
-static void virtscsi_complete_event(void *buf)
+static void virtscsi_complete_event(struct virtio_scsi *vscsi, void *buf)
 {
 	struct virtio_scsi_event_node *event_node = buf;
 
@@ -344,7 +345,7 @@ static void virtscsi_event_done(struct virtqueue *vq)
 	unsigned long flags;
 
 	spin_lock_irqsave(&vscsi->event_vq.vq_lock, flags);
-	virtscsi_vq_done(vq, virtscsi_complete_event);
+	virtscsi_vq_done(vscsi, vq, virtscsi_complete_event);
 	spin_unlock_irqrestore(&vscsi->event_vq.vq_lock, flags);
 };
 
-- 
1.7.1



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v2 4/5] virtio-scsi: pass struct virtio_scsi to virtqueue completion function
@ 2012-12-18 12:32   ` Paolo Bonzini
  0 siblings, 0 replies; 86+ messages in thread
From: Paolo Bonzini @ 2012-12-18 12:32 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-scsi, kvm, mst, hutao, virtualization, stefanha

This will be needed soon in order to retrieve the per-target
struct.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 drivers/scsi/virtio_scsi.c |   17 +++++++++--------
 1 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 4a3abaf..4f6c6a3 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -104,7 +104,7 @@ static void virtscsi_compute_resid(struct scsi_cmnd *sc, u32 resid)
  *
  * Called with vq_lock held.
  */
-static void virtscsi_complete_cmd(void *buf)
+static void virtscsi_complete_cmd(struct virtio_scsi *vscsi, void *buf)
 {
 	struct virtio_scsi_cmd *cmd = buf;
 	struct scsi_cmnd *sc = cmd->sc;
@@ -165,7 +165,8 @@ static void virtscsi_complete_cmd(void *buf)
 	sc->scsi_done(sc);
 }
 
-static void virtscsi_vq_done(struct virtqueue *vq, void (*fn)(void *buf))
+static void virtscsi_vq_done(struct virtio_scsi *vscsi, struct virtqueue *vq,
+			     void (*fn)(struct virtio_scsi *vscsi, void *buf))
 {
 	void *buf;
 	unsigned int len;
@@ -173,7 +174,7 @@ static void virtscsi_vq_done(struct virtqueue *vq, void (*fn)(void *buf))
 	do {
 		virtqueue_disable_cb(vq);
 		while ((buf = virtqueue_get_buf(vq, &len)) != NULL)
-			fn(buf);
+			fn(vscsi, buf);
 	} while (!virtqueue_enable_cb(vq));
 }
 
@@ -184,11 +185,11 @@ static void virtscsi_req_done(struct virtqueue *vq)
 	unsigned long flags;
 
 	spin_lock_irqsave(&vscsi->req_vq.vq_lock, flags);
-	virtscsi_vq_done(vq, virtscsi_complete_cmd);
+	virtscsi_vq_done(vscsi, vq, virtscsi_complete_cmd);
 	spin_unlock_irqrestore(&vscsi->req_vq.vq_lock, flags);
 };
 
-static void virtscsi_complete_free(void *buf)
+static void virtscsi_complete_free(struct virtio_scsi *vscsi, void *buf)
 {
 	struct virtio_scsi_cmd *cmd = buf;
 
@@ -205,7 +206,7 @@ static void virtscsi_ctrl_done(struct virtqueue *vq)
 	unsigned long flags;
 
 	spin_lock_irqsave(&vscsi->ctrl_vq.vq_lock, flags);
-	virtscsi_vq_done(vq, virtscsi_complete_free);
+	virtscsi_vq_done(vscsi, vq, virtscsi_complete_free);
 	spin_unlock_irqrestore(&vscsi->ctrl_vq.vq_lock, flags);
 };
 
@@ -329,7 +330,7 @@ static void virtscsi_handle_event(struct work_struct *work)
 	virtscsi_kick_event(vscsi, event_node);
 }
 
-static void virtscsi_complete_event(void *buf)
+static void virtscsi_complete_event(struct virtio_scsi *vscsi, void *buf)
 {
 	struct virtio_scsi_event_node *event_node = buf;
 
@@ -344,7 +345,7 @@ static void virtscsi_event_done(struct virtqueue *vq)
 	unsigned long flags;
 
 	spin_lock_irqsave(&vscsi->event_vq.vq_lock, flags);
-	virtscsi_vq_done(vq, virtscsi_complete_event);
+	virtscsi_vq_done(vscsi, vq, virtscsi_complete_event);
 	spin_unlock_irqrestore(&vscsi->event_vq.vq_lock, flags);
 };
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v2 5/5] virtio-scsi: introduce multiqueue support
  2012-12-18 12:32 ` Paolo Bonzini
                   ` (4 preceding siblings ...)
  (?)
@ 2012-12-18 12:32 ` Paolo Bonzini
  2012-12-18 13:57     ` Michael S. Tsirkin
                     ` (2 more replies)
  -1 siblings, 3 replies; 86+ messages in thread
From: Paolo Bonzini @ 2012-12-18 12:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: kvm, gaowanlong, hutao, linux-scsi, virtualization, mst, rusty,
	asias, stefanha, nab

This patch adds queue steering to virtio-scsi.  When a target is sent
multiple requests, we always drive them to the same queue so that FIFO
processing order is kept.  However, if a target was idle, we can choose
a queue arbitrarily.  In this case the queue is chosen according to the
current VCPU, so the driver expects the number of request queues to be
equal to the number of VCPUs.  This makes it easy and fast to select
the queue, and also lets the driver optimize the IRQ affinity for the
virtqueues (each virtqueue's affinity is set to the CPU that "owns"
the queue).

The speedup comes from improving cache locality and giving CPU affinity
to the virtqueues, which is why this scheme was selected.  Assuming that
the thread that is sending requests to the device is I/O-bound, it is
likely to be sleeping at the time the ISR is executed, and thus executing
the ISR on the same processor that sent the requests is cheap.

However, the kernel will not execute the ISR on the "best" processor
unless you explicitly set the affinity.  This is because in practice
you will have many such I/O-bound processes and thus many otherwise
idle processors.  Then the kernel will execute the ISR on a random
processor, rather than the one that is sending requests to the device.

The alternative to per-CPU virtqueues is per-target virtqueues.  To
achieve the same locality, we could dynamically choose the virtqueue's
affinity based on the CPU of the last task that sent a request.  This
is less appealing because we do not set the affinity directly---we only
provide a hint to the irqbalanced running in userspace.  Dynamically
changing the affinity only works if the userspace applies the hint
fast enough.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
	v1->v2: improved comments and commit messages, added memory barriers

 drivers/scsi/virtio_scsi.c |  234 +++++++++++++++++++++++++++++++++++++------
 1 files changed, 201 insertions(+), 33 deletions(-)

diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 4f6c6a3..ca9d29d 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -26,6 +26,7 @@
 
 #define VIRTIO_SCSI_MEMPOOL_SZ 64
 #define VIRTIO_SCSI_EVENT_LEN 8
+#define VIRTIO_SCSI_VQ_BASE 2
 
 /* Command queue element */
 struct virtio_scsi_cmd {
@@ -57,24 +58,57 @@ struct virtio_scsi_vq {
 	struct virtqueue *vq;
 };
 
-/* Per-target queue state */
+/*
+ * Per-target queue state.
+ *
+ * This struct holds the data needed by the queue steering policy.  When a
+ * target is sent multiple requests, we need to drive them to the same queue so
+ * that FIFO processing order is kept.  However, if a target was idle, we can
+ * choose a queue arbitrarily.  In this case the queue is chosen according to
+ * the current VCPU, so the driver expects the number of request queues to be
+ * equal to the number of VCPUs.  This makes it easy and fast to select the
+ * queue, and also lets the driver optimize the IRQ affinity for the virtqueues
+ * (each virtqueue's affinity is set to the CPU that "owns" the queue).
+ *
+ * An interesting effect of this policy is that only writes to req_vq need to
+ * take the tgt_lock.  Read can be done outside the lock because:
+ *
+ * - writes of req_vq only occur when atomic_inc_return(&tgt->reqs) returns 1.
+ *   In that case, no other CPU is reading req_vq: even if they were in
+ *   virtscsi_queuecommand_multi, they would be spinning on tgt_lock.
+ *
+ * - reads of req_vq only occur when the target is not idle (reqs != 0).
+ *   A CPU that enters virtscsi_queuecommand_multi will not modify req_vq.
+ *
+ * Similarly, decrements of reqs are never concurrent with writes of req_vq.
+ * Thus they can happen outside the tgt_lock, provided of course we make reqs
+ * an atomic_t.
+ */
 struct virtio_scsi_target_state {
-	/* Never held at the same time as vq_lock.  */
+	/* This spinlock ever held at the same time as vq_lock.  */
 	spinlock_t tgt_lock;
+
+	/* Count of outstanding requests.  */
+	atomic_t reqs;
+
+	/* Currently active virtqueue for requests sent to this target.  */
+	struct virtio_scsi_vq *req_vq;
 };
 
 /* Driver instance state */
 struct virtio_scsi {
 	struct virtio_device *vdev;
 
-	struct virtio_scsi_vq ctrl_vq;
-	struct virtio_scsi_vq event_vq;
-	struct virtio_scsi_vq req_vq;
-
 	/* Get some buffers ready for event vq */
 	struct virtio_scsi_event_node event_list[VIRTIO_SCSI_EVENT_LEN];
 
 	struct virtio_scsi_target_state *tgt;
+
+	u32 num_queues;
+
+	struct virtio_scsi_vq ctrl_vq;
+	struct virtio_scsi_vq event_vq;
+	struct virtio_scsi_vq req_vqs[];
 };
 
 static struct kmem_cache *virtscsi_cmd_cache;
@@ -109,6 +143,7 @@ static void virtscsi_complete_cmd(struct virtio_scsi *vscsi, void *buf)
 	struct virtio_scsi_cmd *cmd = buf;
 	struct scsi_cmnd *sc = cmd->sc;
 	struct virtio_scsi_cmd_resp *resp = &cmd->resp.cmd;
+	struct virtio_scsi_target_state *tgt = &vscsi->tgt[sc->device->id];
 
 	dev_dbg(&sc->device->sdev_gendev,
 		"cmd %p response %u status %#02x sense_len %u\n",
@@ -163,6 +198,8 @@ static void virtscsi_complete_cmd(struct virtio_scsi *vscsi, void *buf)
 
 	mempool_free(cmd, virtscsi_cmd_pool);
 	sc->scsi_done(sc);
+
+	atomic_dec(&tgt->reqs);
 }
 
 static void virtscsi_vq_done(struct virtio_scsi *vscsi, struct virtqueue *vq,
@@ -182,11 +219,45 @@ static void virtscsi_req_done(struct virtqueue *vq)
 {
 	struct Scsi_Host *sh = virtio_scsi_host(vq->vdev);
 	struct virtio_scsi *vscsi = shost_priv(sh);
+	int index = vq->index - VIRTIO_SCSI_VQ_BASE;
+	struct virtio_scsi_vq *req_vq = &vscsi->req_vqs[index];
 	unsigned long flags;
 
-	spin_lock_irqsave(&vscsi->req_vq.vq_lock, flags);
+	/*
+	 * Read req_vq before decrementing the reqs field in
+	 * virtscsi_complete_cmd.
+	 *
+	 * With barriers:
+	 *
+	 * 	CPU #0			virtscsi_queuecommand_multi (CPU #1)
+	 * 	------------------------------------------------------------
+	 * 	lock vq_lock
+	 * 	read req_vq
+	 * 	read reqs (reqs = 1)
+	 * 	write reqs (reqs = 0)
+	 * 				increment reqs (reqs = 1)
+	 * 				write req_vq
+	 *
+	 * Possible reordering without barriers:
+	 *
+	 * 	CPU #0			virtscsi_queuecommand_multi (CPU #1)
+	 * 	------------------------------------------------------------
+	 * 	lock vq_lock
+	 * 	read reqs (reqs = 1)
+	 * 	write reqs (reqs = 0)
+	 * 				increment reqs (reqs = 1)
+	 * 				write req_vq
+	 * 	read (wrong) req_vq
+	 *
+	 * We do not need a full smp_rmb, because req_vq is required to get
+	 * to tgt->reqs: tgt is &vscsi->tgt[sc->device->id], where sc is stored
+	 * in the virtqueue as the user token.
+	 */
+	smp_read_barrier_depends();
+
+	spin_lock_irqsave(&req_vq->vq_lock, flags);
 	virtscsi_vq_done(vscsi, vq, virtscsi_complete_cmd);
-	spin_unlock_irqrestore(&vscsi->req_vq.vq_lock, flags);
+	spin_unlock_irqrestore(&req_vq->vq_lock, flags);
 };
 
 static void virtscsi_complete_free(struct virtio_scsi *vscsi, void *buf)
@@ -424,11 +495,12 @@ static int virtscsi_kick_cmd(struct virtio_scsi_target_state *tgt,
 	return ret;
 }
 
-static int virtscsi_queuecommand(struct Scsi_Host *sh, struct scsi_cmnd *sc)
+static int virtscsi_queuecommand(struct virtio_scsi *vscsi,
+				 struct virtio_scsi_target_state *tgt,
+				 struct scsi_cmnd *sc)
 {
-	struct virtio_scsi *vscsi = shost_priv(sh);
-	struct virtio_scsi_target_state *tgt = &vscsi->tgt[sc->device->id];
 	struct virtio_scsi_cmd *cmd;
+	struct virtio_scsi_vq *req_vq;
 	int ret;
 
 	struct Scsi_Host *shost = virtio_scsi_host(vscsi->vdev);
@@ -461,7 +533,8 @@ static int virtscsi_queuecommand(struct Scsi_Host *sh, struct scsi_cmnd *sc)
 	BUG_ON(sc->cmd_len > VIRTIO_SCSI_CDB_SIZE);
 	memcpy(cmd->req.cmd.cdb, sc->cmnd, sc->cmd_len);
 
-	if (virtscsi_kick_cmd(tgt, &vscsi->req_vq, cmd,
+	req_vq = ACCESS_ONCE(tgt->req_vq);
+	if (virtscsi_kick_cmd(tgt, req_vq, cmd,
 			      sizeof cmd->req.cmd, sizeof cmd->resp.cmd,
 			      GFP_ATOMIC) == 0)
 		ret = 0;
@@ -472,6 +545,48 @@ out:
 	return ret;
 }
 
+static int virtscsi_queuecommand_single(struct Scsi_Host *sh,
+					struct scsi_cmnd *sc)
+{
+	struct virtio_scsi *vscsi = shost_priv(sh);
+	struct virtio_scsi_target_state *tgt = &vscsi->tgt[sc->device->id];
+
+	atomic_inc(&tgt->reqs);
+	return virtscsi_queuecommand(vscsi, tgt, sc);
+}
+
+static int virtscsi_queuecommand_multi(struct Scsi_Host *sh,
+				       struct scsi_cmnd *sc)
+{
+	struct virtio_scsi *vscsi = shost_priv(sh);
+	struct virtio_scsi_target_state *tgt = &vscsi->tgt[sc->device->id];
+	unsigned long flags;
+	u32 queue_num;
+
+	/*
+	 * Using an atomic_t for tgt->reqs lets the virtqueue handler
+	 * decrement it without taking the spinlock.
+	 *
+	 * We still need a critical section to prevent concurrent submissions
+	 * from picking two different req_vqs.
+	 */
+	spin_lock_irqsave(&tgt->tgt_lock, flags);
+	if (atomic_inc_return(&tgt->reqs) == 1) {
+		queue_num = smp_processor_id();
+		while (unlikely(queue_num >= vscsi->num_queues))
+			queue_num -= vscsi->num_queues;
+
+		/*
+		 * Write reqs before writing req_vq, matching the
+		 * smp_read_barrier_depends() in virtscsi_req_done.
+		 */
+		smp_wmb();
+		tgt->req_vq = &vscsi->req_vqs[queue_num];
+	}
+	spin_unlock_irqrestore(&tgt->tgt_lock, flags);
+	return virtscsi_queuecommand(vscsi, tgt, sc);
+}
+
 static int virtscsi_tmf(struct virtio_scsi *vscsi, struct virtio_scsi_cmd *cmd)
 {
 	DECLARE_COMPLETION_ONSTACK(comp);
@@ -541,12 +656,26 @@ static int virtscsi_abort(struct scsi_cmnd *sc)
 	return virtscsi_tmf(vscsi, cmd);
 }
 
-static struct scsi_host_template virtscsi_host_template = {
+static struct scsi_host_template virtscsi_host_template_single = {
 	.module = THIS_MODULE,
 	.name = "Virtio SCSI HBA",
 	.proc_name = "virtio_scsi",
-	.queuecommand = virtscsi_queuecommand,
 	.this_id = -1,
+	.queuecommand = virtscsi_queuecommand_single,
+	.eh_abort_handler = virtscsi_abort,
+	.eh_device_reset_handler = virtscsi_device_reset,
+
+	.can_queue = 1024,
+	.dma_boundary = UINT_MAX,
+	.use_clustering = ENABLE_CLUSTERING,
+};
+
+static struct scsi_host_template virtscsi_host_template_multi = {
+	.module = THIS_MODULE,
+	.name = "Virtio SCSI HBA",
+	.proc_name = "virtio_scsi",
+	.this_id = -1,
+	.queuecommand = virtscsi_queuecommand_multi,
 	.eh_abort_handler = virtscsi_abort,
 	.eh_device_reset_handler = virtscsi_device_reset,
 
@@ -572,16 +701,27 @@ static struct scsi_host_template virtscsi_host_template = {
 				  &__val, sizeof(__val)); \
 	})
 
+
 static void virtscsi_init_vq(struct virtio_scsi_vq *virtscsi_vq,
-			     struct virtqueue *vq)
+			     struct virtqueue *vq, bool affinity)
 {
 	spin_lock_init(&virtscsi_vq->vq_lock);
 	virtscsi_vq->vq = vq;
+	if (affinity)
+		virtqueue_set_affinity(vq, vq->index - VIRTIO_SCSI_VQ_BASE);
 }
 
-static void virtscsi_init_tgt(struct virtio_scsi_target_state *tgt)
+static void virtscsi_init_tgt(struct virtio_scsi *vscsi, int i)
 {
+	struct virtio_scsi_target_state *tgt = &vscsi->tgt[i];
 	spin_lock_init(&tgt->tgt_lock);
+	atomic_set(&tgt->reqs, 0);
+
+	/*
+	 * The default is unused for multiqueue, but with a single queue
+	 * or target we use it in virtscsi_queuecommand.
+	 */
+	tgt->req_vq = &vscsi->req_vqs[0];
 }
 
 static void virtscsi_scan(struct virtio_device *vdev)
@@ -609,28 +749,41 @@ static int virtscsi_init(struct virtio_device *vdev,
 			 struct virtio_scsi *vscsi, int num_targets)
 {
 	int err;
-	struct virtqueue *vqs[3];
 	u32 i, sg_elems;
+	u32 num_vqs;
+	vq_callback_t **callbacks;
+	const char **names;
+	struct virtqueue **vqs;
 
-	vq_callback_t *callbacks[] = {
-		virtscsi_ctrl_done,
-		virtscsi_event_done,
-		virtscsi_req_done
-	};
-	const char *names[] = {
-		"control",
-		"event",
-		"request"
-	};
+	num_vqs = vscsi->num_queues + VIRTIO_SCSI_VQ_BASE;
+	vqs = kmalloc(num_vqs * sizeof(struct virtqueue *), GFP_KERNEL);
+	callbacks = kmalloc(num_vqs * sizeof(vq_callback_t *), GFP_KERNEL);
+	names = kmalloc(num_vqs * sizeof(char *), GFP_KERNEL);
+
+	if (!callbacks || !vqs || !names) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	callbacks[0] = virtscsi_ctrl_done;
+	callbacks[1] = virtscsi_event_done;
+	names[0] = "control";
+	names[1] = "event";
+	for (i = VIRTIO_SCSI_VQ_BASE; i < num_vqs; i++) {
+		callbacks[i] = virtscsi_req_done;
+		names[i] = "request";
+	}
 
 	/* Discover virtqueues and write information to configuration.  */
-	err = vdev->config->find_vqs(vdev, 3, vqs, callbacks, names);
+	err = vdev->config->find_vqs(vdev, num_vqs, vqs, callbacks, names);
 	if (err)
 		return err;
 
-	virtscsi_init_vq(&vscsi->ctrl_vq, vqs[0]);
-	virtscsi_init_vq(&vscsi->event_vq, vqs[1]);
-	virtscsi_init_vq(&vscsi->req_vq, vqs[2]);
+	virtscsi_init_vq(&vscsi->ctrl_vq, vqs[0], false);
+	virtscsi_init_vq(&vscsi->event_vq, vqs[1], false);
+	for (i = VIRTIO_SCSI_VQ_BASE; i < num_vqs; i++)
+		virtscsi_init_vq(&vscsi->req_vqs[i - VIRTIO_SCSI_VQ_BASE],
+				 vqs[i], vscsi->num_queues > 1);
 
 	virtscsi_config_set(vdev, cdb_size, VIRTIO_SCSI_CDB_SIZE);
 	virtscsi_config_set(vdev, sense_size, VIRTIO_SCSI_SENSE_SIZE);
@@ -647,11 +800,14 @@ static int virtscsi_init(struct virtio_device *vdev,
 		goto out;
 	}
 	for (i = 0; i < num_targets; i++)
-		virtscsi_init_tgt(&vscsi->tgt[i]);
+		virtscsi_init_tgt(vscsi, i);
 
 	err = 0;
 
 out:
+	kfree(names);
+	kfree(callbacks);
+	kfree(vqs);
 	if (err)
 		virtscsi_remove_vqs(vdev);
 	return err;
@@ -664,11 +820,22 @@ static int __devinit virtscsi_probe(struct virtio_device *vdev)
 	int err;
 	u32 sg_elems, num_targets;
 	u32 cmd_per_lun;
+	u32 num_queues;
+	struct scsi_host_template *hostt;
+
+	/* We need to know how many queues before we allocate.  */
+	num_queues = virtscsi_config_get(vdev, num_queues) ?: 1;
 
 	/* Allocate memory and link the structs together.  */
 	num_targets = virtscsi_config_get(vdev, max_target) + 1;
-	shost = scsi_host_alloc(&virtscsi_host_template, sizeof(*vscsi));
 
+	if (num_queues == 1)
+		hostt = &virtscsi_host_template_single;
+	else
+		hostt = &virtscsi_host_template_multi;
+
+	shost = scsi_host_alloc(hostt,
+		sizeof(*vscsi) + sizeof(vscsi->req_vqs[0]) * num_queues);
 	if (!shost)
 		return -ENOMEM;
 
@@ -676,6 +843,7 @@ static int __devinit virtscsi_probe(struct virtio_device *vdev)
 	shost->sg_tablesize = sg_elems;
 	vscsi = shost_priv(shost);
 	vscsi->vdev = vdev;
+	vscsi->num_queues = num_queues;
 	vdev->priv = shost;
 
 	err = virtscsi_init(vdev, vscsi, num_targets);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v2 5/5] virtio-scsi: introduce multiqueue support
  2012-12-18 12:32 ` Paolo Bonzini
                   ` (5 preceding siblings ...)
  (?)
@ 2012-12-18 12:32 ` Paolo Bonzini
  -1 siblings, 0 replies; 86+ messages in thread
From: Paolo Bonzini @ 2012-12-18 12:32 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-scsi, kvm, mst, hutao, virtualization, stefanha

This patch adds queue steering to virtio-scsi.  When a target is sent
multiple requests, we always drive them to the same queue so that FIFO
processing order is kept.  However, if a target was idle, we can choose
a queue arbitrarily.  In this case the queue is chosen according to the
current VCPU, so the driver expects the number of request queues to be
equal to the number of VCPUs.  This makes it easy and fast to select
the queue, and also lets the driver optimize the IRQ affinity for the
virtqueues (each virtqueue's affinity is set to the CPU that "owns"
the queue).

The speedup comes from improving cache locality and giving CPU affinity
to the virtqueues, which is why this scheme was selected.  Assuming that
the thread that is sending requests to the device is I/O-bound, it is
likely to be sleeping at the time the ISR is executed, and thus executing
the ISR on the same processor that sent the requests is cheap.

However, the kernel will not execute the ISR on the "best" processor
unless you explicitly set the affinity.  This is because in practice
you will have many such I/O-bound processes and thus many otherwise
idle processors.  Then the kernel will execute the ISR on a random
processor, rather than the one that is sending requests to the device.

The alternative to per-CPU virtqueues is per-target virtqueues.  To
achieve the same locality, we could dynamically choose the virtqueue's
affinity based on the CPU of the last task that sent a request.  This
is less appealing because we do not set the affinity directly---we only
provide a hint to the irqbalanced running in userspace.  Dynamically
changing the affinity only works if the userspace applies the hint
fast enough.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
	v1->v2: improved comments and commit messages, added memory barriers

 drivers/scsi/virtio_scsi.c |  234 +++++++++++++++++++++++++++++++++++++------
 1 files changed, 201 insertions(+), 33 deletions(-)

diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 4f6c6a3..ca9d29d 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -26,6 +26,7 @@
 
 #define VIRTIO_SCSI_MEMPOOL_SZ 64
 #define VIRTIO_SCSI_EVENT_LEN 8
+#define VIRTIO_SCSI_VQ_BASE 2
 
 /* Command queue element */
 struct virtio_scsi_cmd {
@@ -57,24 +58,57 @@ struct virtio_scsi_vq {
 	struct virtqueue *vq;
 };
 
-/* Per-target queue state */
+/*
+ * Per-target queue state.
+ *
+ * This struct holds the data needed by the queue steering policy.  When a
+ * target is sent multiple requests, we need to drive them to the same queue so
+ * that FIFO processing order is kept.  However, if a target was idle, we can
+ * choose a queue arbitrarily.  In this case the queue is chosen according to
+ * the current VCPU, so the driver expects the number of request queues to be
+ * equal to the number of VCPUs.  This makes it easy and fast to select the
+ * queue, and also lets the driver optimize the IRQ affinity for the virtqueues
+ * (each virtqueue's affinity is set to the CPU that "owns" the queue).
+ *
+ * An interesting effect of this policy is that only writes to req_vq need to
+ * take the tgt_lock.  Read can be done outside the lock because:
+ *
+ * - writes of req_vq only occur when atomic_inc_return(&tgt->reqs) returns 1.
+ *   In that case, no other CPU is reading req_vq: even if they were in
+ *   virtscsi_queuecommand_multi, they would be spinning on tgt_lock.
+ *
+ * - reads of req_vq only occur when the target is not idle (reqs != 0).
+ *   A CPU that enters virtscsi_queuecommand_multi will not modify req_vq.
+ *
+ * Similarly, decrements of reqs are never concurrent with writes of req_vq.
+ * Thus they can happen outside the tgt_lock, provided of course we make reqs
+ * an atomic_t.
+ */
 struct virtio_scsi_target_state {
-	/* Never held at the same time as vq_lock.  */
+	/* This spinlock ever held at the same time as vq_lock.  */
 	spinlock_t tgt_lock;
+
+	/* Count of outstanding requests.  */
+	atomic_t reqs;
+
+	/* Currently active virtqueue for requests sent to this target.  */
+	struct virtio_scsi_vq *req_vq;
 };
 
 /* Driver instance state */
 struct virtio_scsi {
 	struct virtio_device *vdev;
 
-	struct virtio_scsi_vq ctrl_vq;
-	struct virtio_scsi_vq event_vq;
-	struct virtio_scsi_vq req_vq;
-
 	/* Get some buffers ready for event vq */
 	struct virtio_scsi_event_node event_list[VIRTIO_SCSI_EVENT_LEN];
 
 	struct virtio_scsi_target_state *tgt;
+
+	u32 num_queues;
+
+	struct virtio_scsi_vq ctrl_vq;
+	struct virtio_scsi_vq event_vq;
+	struct virtio_scsi_vq req_vqs[];
 };
 
 static struct kmem_cache *virtscsi_cmd_cache;
@@ -109,6 +143,7 @@ static void virtscsi_complete_cmd(struct virtio_scsi *vscsi, void *buf)
 	struct virtio_scsi_cmd *cmd = buf;
 	struct scsi_cmnd *sc = cmd->sc;
 	struct virtio_scsi_cmd_resp *resp = &cmd->resp.cmd;
+	struct virtio_scsi_target_state *tgt = &vscsi->tgt[sc->device->id];
 
 	dev_dbg(&sc->device->sdev_gendev,
 		"cmd %p response %u status %#02x sense_len %u\n",
@@ -163,6 +198,8 @@ static void virtscsi_complete_cmd(struct virtio_scsi *vscsi, void *buf)
 
 	mempool_free(cmd, virtscsi_cmd_pool);
 	sc->scsi_done(sc);
+
+	atomic_dec(&tgt->reqs);
 }
 
 static void virtscsi_vq_done(struct virtio_scsi *vscsi, struct virtqueue *vq,
@@ -182,11 +219,45 @@ static void virtscsi_req_done(struct virtqueue *vq)
 {
 	struct Scsi_Host *sh = virtio_scsi_host(vq->vdev);
 	struct virtio_scsi *vscsi = shost_priv(sh);
+	int index = vq->index - VIRTIO_SCSI_VQ_BASE;
+	struct virtio_scsi_vq *req_vq = &vscsi->req_vqs[index];
 	unsigned long flags;
 
-	spin_lock_irqsave(&vscsi->req_vq.vq_lock, flags);
+	/*
+	 * Read req_vq before decrementing the reqs field in
+	 * virtscsi_complete_cmd.
+	 *
+	 * With barriers:
+	 *
+	 * 	CPU #0			virtscsi_queuecommand_multi (CPU #1)
+	 * 	------------------------------------------------------------
+	 * 	lock vq_lock
+	 * 	read req_vq
+	 * 	read reqs (reqs = 1)
+	 * 	write reqs (reqs = 0)
+	 * 				increment reqs (reqs = 1)
+	 * 				write req_vq
+	 *
+	 * Possible reordering without barriers:
+	 *
+	 * 	CPU #0			virtscsi_queuecommand_multi (CPU #1)
+	 * 	------------------------------------------------------------
+	 * 	lock vq_lock
+	 * 	read reqs (reqs = 1)
+	 * 	write reqs (reqs = 0)
+	 * 				increment reqs (reqs = 1)
+	 * 				write req_vq
+	 * 	read (wrong) req_vq
+	 *
+	 * We do not need a full smp_rmb, because req_vq is required to get
+	 * to tgt->reqs: tgt is &vscsi->tgt[sc->device->id], where sc is stored
+	 * in the virtqueue as the user token.
+	 */
+	smp_read_barrier_depends();
+
+	spin_lock_irqsave(&req_vq->vq_lock, flags);
 	virtscsi_vq_done(vscsi, vq, virtscsi_complete_cmd);
-	spin_unlock_irqrestore(&vscsi->req_vq.vq_lock, flags);
+	spin_unlock_irqrestore(&req_vq->vq_lock, flags);
 };
 
 static void virtscsi_complete_free(struct virtio_scsi *vscsi, void *buf)
@@ -424,11 +495,12 @@ static int virtscsi_kick_cmd(struct virtio_scsi_target_state *tgt,
 	return ret;
 }
 
-static int virtscsi_queuecommand(struct Scsi_Host *sh, struct scsi_cmnd *sc)
+static int virtscsi_queuecommand(struct virtio_scsi *vscsi,
+				 struct virtio_scsi_target_state *tgt,
+				 struct scsi_cmnd *sc)
 {
-	struct virtio_scsi *vscsi = shost_priv(sh);
-	struct virtio_scsi_target_state *tgt = &vscsi->tgt[sc->device->id];
 	struct virtio_scsi_cmd *cmd;
+	struct virtio_scsi_vq *req_vq;
 	int ret;
 
 	struct Scsi_Host *shost = virtio_scsi_host(vscsi->vdev);
@@ -461,7 +533,8 @@ static int virtscsi_queuecommand(struct Scsi_Host *sh, struct scsi_cmnd *sc)
 	BUG_ON(sc->cmd_len > VIRTIO_SCSI_CDB_SIZE);
 	memcpy(cmd->req.cmd.cdb, sc->cmnd, sc->cmd_len);
 
-	if (virtscsi_kick_cmd(tgt, &vscsi->req_vq, cmd,
+	req_vq = ACCESS_ONCE(tgt->req_vq);
+	if (virtscsi_kick_cmd(tgt, req_vq, cmd,
 			      sizeof cmd->req.cmd, sizeof cmd->resp.cmd,
 			      GFP_ATOMIC) == 0)
 		ret = 0;
@@ -472,6 +545,48 @@ out:
 	return ret;
 }
 
+static int virtscsi_queuecommand_single(struct Scsi_Host *sh,
+					struct scsi_cmnd *sc)
+{
+	struct virtio_scsi *vscsi = shost_priv(sh);
+	struct virtio_scsi_target_state *tgt = &vscsi->tgt[sc->device->id];
+
+	atomic_inc(&tgt->reqs);
+	return virtscsi_queuecommand(vscsi, tgt, sc);
+}
+
+static int virtscsi_queuecommand_multi(struct Scsi_Host *sh,
+				       struct scsi_cmnd *sc)
+{
+	struct virtio_scsi *vscsi = shost_priv(sh);
+	struct virtio_scsi_target_state *tgt = &vscsi->tgt[sc->device->id];
+	unsigned long flags;
+	u32 queue_num;
+
+	/*
+	 * Using an atomic_t for tgt->reqs lets the virtqueue handler
+	 * decrement it without taking the spinlock.
+	 *
+	 * We still need a critical section to prevent concurrent submissions
+	 * from picking two different req_vqs.
+	 */
+	spin_lock_irqsave(&tgt->tgt_lock, flags);
+	if (atomic_inc_return(&tgt->reqs) == 1) {
+		queue_num = smp_processor_id();
+		while (unlikely(queue_num >= vscsi->num_queues))
+			queue_num -= vscsi->num_queues;
+
+		/*
+		 * Write reqs before writing req_vq, matching the
+		 * smp_read_barrier_depends() in virtscsi_req_done.
+		 */
+		smp_wmb();
+		tgt->req_vq = &vscsi->req_vqs[queue_num];
+	}
+	spin_unlock_irqrestore(&tgt->tgt_lock, flags);
+	return virtscsi_queuecommand(vscsi, tgt, sc);
+}
+
 static int virtscsi_tmf(struct virtio_scsi *vscsi, struct virtio_scsi_cmd *cmd)
 {
 	DECLARE_COMPLETION_ONSTACK(comp);
@@ -541,12 +656,26 @@ static int virtscsi_abort(struct scsi_cmnd *sc)
 	return virtscsi_tmf(vscsi, cmd);
 }
 
-static struct scsi_host_template virtscsi_host_template = {
+static struct scsi_host_template virtscsi_host_template_single = {
 	.module = THIS_MODULE,
 	.name = "Virtio SCSI HBA",
 	.proc_name = "virtio_scsi",
-	.queuecommand = virtscsi_queuecommand,
 	.this_id = -1,
+	.queuecommand = virtscsi_queuecommand_single,
+	.eh_abort_handler = virtscsi_abort,
+	.eh_device_reset_handler = virtscsi_device_reset,
+
+	.can_queue = 1024,
+	.dma_boundary = UINT_MAX,
+	.use_clustering = ENABLE_CLUSTERING,
+};
+
+static struct scsi_host_template virtscsi_host_template_multi = {
+	.module = THIS_MODULE,
+	.name = "Virtio SCSI HBA",
+	.proc_name = "virtio_scsi",
+	.this_id = -1,
+	.queuecommand = virtscsi_queuecommand_multi,
 	.eh_abort_handler = virtscsi_abort,
 	.eh_device_reset_handler = virtscsi_device_reset,
 
@@ -572,16 +701,27 @@ static struct scsi_host_template virtscsi_host_template = {
 				  &__val, sizeof(__val)); \
 	})
 
+
 static void virtscsi_init_vq(struct virtio_scsi_vq *virtscsi_vq,
-			     struct virtqueue *vq)
+			     struct virtqueue *vq, bool affinity)
 {
 	spin_lock_init(&virtscsi_vq->vq_lock);
 	virtscsi_vq->vq = vq;
+	if (affinity)
+		virtqueue_set_affinity(vq, vq->index - VIRTIO_SCSI_VQ_BASE);
 }
 
-static void virtscsi_init_tgt(struct virtio_scsi_target_state *tgt)
+static void virtscsi_init_tgt(struct virtio_scsi *vscsi, int i)
 {
+	struct virtio_scsi_target_state *tgt = &vscsi->tgt[i];
 	spin_lock_init(&tgt->tgt_lock);
+	atomic_set(&tgt->reqs, 0);
+
+	/*
+	 * The default is unused for multiqueue, but with a single queue
+	 * or target we use it in virtscsi_queuecommand.
+	 */
+	tgt->req_vq = &vscsi->req_vqs[0];
 }
 
 static void virtscsi_scan(struct virtio_device *vdev)
@@ -609,28 +749,41 @@ static int virtscsi_init(struct virtio_device *vdev,
 			 struct virtio_scsi *vscsi, int num_targets)
 {
 	int err;
-	struct virtqueue *vqs[3];
 	u32 i, sg_elems;
+	u32 num_vqs;
+	vq_callback_t **callbacks;
+	const char **names;
+	struct virtqueue **vqs;
 
-	vq_callback_t *callbacks[] = {
-		virtscsi_ctrl_done,
-		virtscsi_event_done,
-		virtscsi_req_done
-	};
-	const char *names[] = {
-		"control",
-		"event",
-		"request"
-	};
+	num_vqs = vscsi->num_queues + VIRTIO_SCSI_VQ_BASE;
+	vqs = kmalloc(num_vqs * sizeof(struct virtqueue *), GFP_KERNEL);
+	callbacks = kmalloc(num_vqs * sizeof(vq_callback_t *), GFP_KERNEL);
+	names = kmalloc(num_vqs * sizeof(char *), GFP_KERNEL);
+
+	if (!callbacks || !vqs || !names) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	callbacks[0] = virtscsi_ctrl_done;
+	callbacks[1] = virtscsi_event_done;
+	names[0] = "control";
+	names[1] = "event";
+	for (i = VIRTIO_SCSI_VQ_BASE; i < num_vqs; i++) {
+		callbacks[i] = virtscsi_req_done;
+		names[i] = "request";
+	}
 
 	/* Discover virtqueues and write information to configuration.  */
-	err = vdev->config->find_vqs(vdev, 3, vqs, callbacks, names);
+	err = vdev->config->find_vqs(vdev, num_vqs, vqs, callbacks, names);
 	if (err)
 		return err;
 
-	virtscsi_init_vq(&vscsi->ctrl_vq, vqs[0]);
-	virtscsi_init_vq(&vscsi->event_vq, vqs[1]);
-	virtscsi_init_vq(&vscsi->req_vq, vqs[2]);
+	virtscsi_init_vq(&vscsi->ctrl_vq, vqs[0], false);
+	virtscsi_init_vq(&vscsi->event_vq, vqs[1], false);
+	for (i = VIRTIO_SCSI_VQ_BASE; i < num_vqs; i++)
+		virtscsi_init_vq(&vscsi->req_vqs[i - VIRTIO_SCSI_VQ_BASE],
+				 vqs[i], vscsi->num_queues > 1);
 
 	virtscsi_config_set(vdev, cdb_size, VIRTIO_SCSI_CDB_SIZE);
 	virtscsi_config_set(vdev, sense_size, VIRTIO_SCSI_SENSE_SIZE);
@@ -647,11 +800,14 @@ static int virtscsi_init(struct virtio_device *vdev,
 		goto out;
 	}
 	for (i = 0; i < num_targets; i++)
-		virtscsi_init_tgt(&vscsi->tgt[i]);
+		virtscsi_init_tgt(vscsi, i);
 
 	err = 0;
 
 out:
+	kfree(names);
+	kfree(callbacks);
+	kfree(vqs);
 	if (err)
 		virtscsi_remove_vqs(vdev);
 	return err;
@@ -664,11 +820,22 @@ static int __devinit virtscsi_probe(struct virtio_device *vdev)
 	int err;
 	u32 sg_elems, num_targets;
 	u32 cmd_per_lun;
+	u32 num_queues;
+	struct scsi_host_template *hostt;
+
+	/* We need to know how many queues before we allocate.  */
+	num_queues = virtscsi_config_get(vdev, num_queues) ?: 1;
 
 	/* Allocate memory and link the structs together.  */
 	num_targets = virtscsi_config_get(vdev, max_target) + 1;
-	shost = scsi_host_alloc(&virtscsi_host_template, sizeof(*vscsi));
 
+	if (num_queues == 1)
+		hostt = &virtscsi_host_template_single;
+	else
+		hostt = &virtscsi_host_template_multi;
+
+	shost = scsi_host_alloc(hostt,
+		sizeof(*vscsi) + sizeof(vscsi->req_vqs[0]) * num_queues);
 	if (!shost)
 		return -ENOMEM;
 
@@ -676,6 +843,7 @@ static int __devinit virtscsi_probe(struct virtio_device *vdev)
 	shost->sg_tablesize = sg_elems;
 	vscsi = shost_priv(shost);
 	vscsi->vdev = vdev;
+	vscsi->num_queues = num_queues;
 	vdev->priv = shost;
 
 	err = virtscsi_init(vdev, vscsi, num_targets);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 2/5] virtio-scsi: use functions for piecewise composition of buffers
  2012-12-18 13:37     ` Michael S. Tsirkin
@ 2012-12-18 13:35       ` Paolo Bonzini
  -1 siblings, 0 replies; 86+ messages in thread
From: Paolo Bonzini @ 2012-12-18 13:35 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, kvm, gaowanlong, hutao, linux-scsi, virtualization,
	rusty, asias, stefanha, nab

Il 18/12/2012 14:37, Michael S. Tsirkin ha scritto:
> On Tue, Dec 18, 2012 at 01:32:49PM +0100, Paolo Bonzini wrote:
>> Using the new virtio_scsi_add_sg function lets us simplify the queueing
>> path.  In particular, all data protected by the tgt_lock is just gone
>> (multiqueue will find a new use for the lock).
> 
> vq access still needs some protection: virtio is not reentrant
> by itself. with tgt_lock gone what protects vq against
> concurrent add_buf calls?

vq_lock.

Paolo

>> The speedup is relatively small (2-4%) but it is worthwhile because of
>> the code simplification---both in this patches and in the next ones.
>>
>> Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>> ---
>> 	v1->v2: new
>>
>>  drivers/scsi/virtio_scsi.c |   94 +++++++++++++++++++------------------------
>>  1 files changed, 42 insertions(+), 52 deletions(-)
>>
>> diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
>> index 74ab67a..2b93b6e 100644
>> --- a/drivers/scsi/virtio_scsi.c
>> +++ b/drivers/scsi/virtio_scsi.c
>> @@ -59,11 +59,8 @@ struct virtio_scsi_vq {
>>  
>>  /* Per-target queue state */
>>  struct virtio_scsi_target_state {
>> -	/* Protects sg.  Lock hierarchy is tgt_lock -> vq_lock.  */
>> +	/* Never held at the same time as vq_lock.  */
>>  	spinlock_t tgt_lock;
>> -
>> -	/* For sglist construction when adding commands to the virtqueue.  */
>> -	struct scatterlist sg[];
>>  };
>>  
>>  /* Driver instance state */
>> @@ -351,57 +348,58 @@ static void virtscsi_event_done(struct virtqueue *vq)
>>  	spin_unlock_irqrestore(&vscsi->event_vq.vq_lock, flags);
>>  };
>>  
>> -static void virtscsi_map_sgl(struct scatterlist *sg, unsigned int *p_idx,
>> -			     struct scsi_data_buffer *sdb)
>> -{
>> -	struct sg_table *table = &sdb->table;
>> -	struct scatterlist *sg_elem;
>> -	unsigned int idx = *p_idx;
>> -	int i;
>> -
>> -	for_each_sg(table->sgl, sg_elem, table->nents, i)
>> -		sg[idx++] = *sg_elem;
>> -
>> -	*p_idx = idx;
>> -}
>> -
>>  /**
>> - * virtscsi_map_cmd - map a scsi_cmd to a virtqueue scatterlist
>> + * virtscsi_add_cmd - add a virtio_scsi_cmd to a virtqueue
>>   * @vscsi	: virtio_scsi state
>>   * @cmd		: command structure
>> - * @out_num	: number of read-only elements
>> - * @in_num	: number of write-only elements
>>   * @req_size	: size of the request buffer
>>   * @resp_size	: size of the response buffer
>> - *
>> - * Called with tgt_lock held.
>> + * @gfp	: flags to use for memory allocations
>>   */
>> -static void virtscsi_map_cmd(struct virtio_scsi_target_state *tgt,
>> -			     struct virtio_scsi_cmd *cmd,
>> -			     unsigned *out_num, unsigned *in_num,
>> -			     size_t req_size, size_t resp_size)
>> +static int virtscsi_add_cmd(struct virtqueue *vq,
>> +			    struct virtio_scsi_cmd *cmd,
>> +			    size_t req_size, size_t resp_size, gfp_t gfp)
>>  {
>>  	struct scsi_cmnd *sc = cmd->sc;
>> -	struct scatterlist *sg = tgt->sg;
>> -	unsigned int idx = 0;
>> +	struct scatterlist sg;
>> +	unsigned int count, count_sg;
>> +	struct sg_table *out, *in;
>> +	struct virtqueue_buf buf;
>> +	int ret;
>> +
>> +	out = in = NULL;
>> +
>> +	if (sc && sc->sc_data_direction != DMA_NONE) {
>> +		if (sc->sc_data_direction != DMA_FROM_DEVICE)
>> +			out = &scsi_out(sc)->table;
>> +		if (sc->sc_data_direction != DMA_TO_DEVICE)
>> +			in = &scsi_in(sc)->table;
>> +	}
>> +
>> +	count_sg = 2 + (out ? 1 : 0)          + (in ? 1 : 0);
>> +	count    = 2 + (out ? out->nents : 0) + (in ? in->nents : 0);
>> +	ret = virtqueue_start_buf(vq, &buf, cmd, count, count_sg, gfp);
>> +	if (ret < 0)
>> +		return ret;
>>  
>>  	/* Request header.  */
>> -	sg_set_buf(&sg[idx++], &cmd->req, req_size);
>> +	sg_init_one(&sg, &cmd->req, req_size);
>> +	virtqueue_add_sg(&buf, &sg, 1, DMA_TO_DEVICE);
>>  
>>  	/* Data-out buffer.  */
>> -	if (sc && sc->sc_data_direction != DMA_FROM_DEVICE)
>> -		virtscsi_map_sgl(sg, &idx, scsi_out(sc));
>> -
>> -	*out_num = idx;
>> +	if (out)
>> +		virtqueue_add_sg(&buf, out->sgl, out->nents, DMA_TO_DEVICE);
>>  
>>  	/* Response header.  */
>> -	sg_set_buf(&sg[idx++], &cmd->resp, resp_size);
>> +	sg_init_one(&sg, &cmd->resp, resp_size);
>> +	virtqueue_add_sg(&buf, &sg, 1, DMA_FROM_DEVICE);
>>  
>>  	/* Data-in buffer */
>> -	if (sc && sc->sc_data_direction != DMA_TO_DEVICE)
>> -		virtscsi_map_sgl(sg, &idx, scsi_in(sc));
>> +	if (in)
>> +		virtqueue_add_sg(&buf, in->sgl, in->nents, DMA_FROM_DEVICE);
>>  
>> -	*in_num = idx - *out_num;
>> +	virtqueue_end_buf(&buf);
>> +	return 0;
>>  }
>>  
>>  static int virtscsi_kick_cmd(struct virtio_scsi_target_state *tgt,
>> @@ -409,25 +407,20 @@ static int virtscsi_kick_cmd(struct virtio_scsi_target_state *tgt,
>>  			     struct virtio_scsi_cmd *cmd,
>>  			     size_t req_size, size_t resp_size, gfp_t gfp)
>>  {
>> -	unsigned int out_num, in_num;
>>  	unsigned long flags;
>> -	int err;
>> +	int ret;
>>  	bool needs_kick = false;
>>  
>> -	spin_lock_irqsave(&tgt->tgt_lock, flags);
>> -	virtscsi_map_cmd(tgt, cmd, &out_num, &in_num, req_size, resp_size);
>> -
>> -	spin_lock(&vq->vq_lock);
>> -	err = virtqueue_add_buf(vq->vq, tgt->sg, out_num, in_num, cmd, gfp);
>> -	spin_unlock(&tgt->tgt_lock);
>> -	if (!err)
>> +	spin_lock_irqsave(&vq->vq_lock, flags);
>> +	ret = virtscsi_add_cmd(vq->vq, cmd, req_size, resp_size, gfp);
>> +	if (!ret)
>>  		needs_kick = virtqueue_kick_prepare(vq->vq);
>>  
>>  	spin_unlock_irqrestore(&vq->vq_lock, flags);
>>  
>>  	if (needs_kick)
>>  		virtqueue_notify(vq->vq);
>> -	return err;
>> +	return ret;
>>  }
>>  
>>  static int virtscsi_queuecommand(struct Scsi_Host *sh, struct scsi_cmnd *sc)
>> @@ -592,14 +585,11 @@ static struct virtio_scsi_target_state *virtscsi_alloc_tgt(
>>  	gfp_t gfp_mask = GFP_KERNEL;
>>  
>>  	/* We need extra sg elements at head and tail.  */
>> -	tgt = kmalloc(sizeof(*tgt) + sizeof(tgt->sg[0]) * (sg_elems + 2),
>> -		      gfp_mask);
>> -
>> +	tgt = kmalloc(sizeof(*tgt), gfp_mask);
>>  	if (!tgt)
>>  		return NULL;
>>  
>>  	spin_lock_init(&tgt->tgt_lock);
>> -	sg_init_table(tgt->sg, sg_elems + 2);
>>  	return tgt;
>>  }
>>  
>> -- 
>> 1.7.1
>>


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 2/5] virtio-scsi: use functions for piecewise composition of buffers
@ 2012-12-18 13:35       ` Paolo Bonzini
  0 siblings, 0 replies; 86+ messages in thread
From: Paolo Bonzini @ 2012-12-18 13:35 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-scsi, kvm, hutao, linux-kernel, virtualization, stefanha

Il 18/12/2012 14:37, Michael S. Tsirkin ha scritto:
> On Tue, Dec 18, 2012 at 01:32:49PM +0100, Paolo Bonzini wrote:
>> Using the new virtio_scsi_add_sg function lets us simplify the queueing
>> path.  In particular, all data protected by the tgt_lock is just gone
>> (multiqueue will find a new use for the lock).
> 
> vq access still needs some protection: virtio is not reentrant
> by itself. with tgt_lock gone what protects vq against
> concurrent add_buf calls?

vq_lock.

Paolo

>> The speedup is relatively small (2-4%) but it is worthwhile because of
>> the code simplification---both in this patches and in the next ones.
>>
>> Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>> ---
>> 	v1->v2: new
>>
>>  drivers/scsi/virtio_scsi.c |   94 +++++++++++++++++++------------------------
>>  1 files changed, 42 insertions(+), 52 deletions(-)
>>
>> diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
>> index 74ab67a..2b93b6e 100644
>> --- a/drivers/scsi/virtio_scsi.c
>> +++ b/drivers/scsi/virtio_scsi.c
>> @@ -59,11 +59,8 @@ struct virtio_scsi_vq {
>>  
>>  /* Per-target queue state */
>>  struct virtio_scsi_target_state {
>> -	/* Protects sg.  Lock hierarchy is tgt_lock -> vq_lock.  */
>> +	/* Never held at the same time as vq_lock.  */
>>  	spinlock_t tgt_lock;
>> -
>> -	/* For sglist construction when adding commands to the virtqueue.  */
>> -	struct scatterlist sg[];
>>  };
>>  
>>  /* Driver instance state */
>> @@ -351,57 +348,58 @@ static void virtscsi_event_done(struct virtqueue *vq)
>>  	spin_unlock_irqrestore(&vscsi->event_vq.vq_lock, flags);
>>  };
>>  
>> -static void virtscsi_map_sgl(struct scatterlist *sg, unsigned int *p_idx,
>> -			     struct scsi_data_buffer *sdb)
>> -{
>> -	struct sg_table *table = &sdb->table;
>> -	struct scatterlist *sg_elem;
>> -	unsigned int idx = *p_idx;
>> -	int i;
>> -
>> -	for_each_sg(table->sgl, sg_elem, table->nents, i)
>> -		sg[idx++] = *sg_elem;
>> -
>> -	*p_idx = idx;
>> -}
>> -
>>  /**
>> - * virtscsi_map_cmd - map a scsi_cmd to a virtqueue scatterlist
>> + * virtscsi_add_cmd - add a virtio_scsi_cmd to a virtqueue
>>   * @vscsi	: virtio_scsi state
>>   * @cmd		: command structure
>> - * @out_num	: number of read-only elements
>> - * @in_num	: number of write-only elements
>>   * @req_size	: size of the request buffer
>>   * @resp_size	: size of the response buffer
>> - *
>> - * Called with tgt_lock held.
>> + * @gfp	: flags to use for memory allocations
>>   */
>> -static void virtscsi_map_cmd(struct virtio_scsi_target_state *tgt,
>> -			     struct virtio_scsi_cmd *cmd,
>> -			     unsigned *out_num, unsigned *in_num,
>> -			     size_t req_size, size_t resp_size)
>> +static int virtscsi_add_cmd(struct virtqueue *vq,
>> +			    struct virtio_scsi_cmd *cmd,
>> +			    size_t req_size, size_t resp_size, gfp_t gfp)
>>  {
>>  	struct scsi_cmnd *sc = cmd->sc;
>> -	struct scatterlist *sg = tgt->sg;
>> -	unsigned int idx = 0;
>> +	struct scatterlist sg;
>> +	unsigned int count, count_sg;
>> +	struct sg_table *out, *in;
>> +	struct virtqueue_buf buf;
>> +	int ret;
>> +
>> +	out = in = NULL;
>> +
>> +	if (sc && sc->sc_data_direction != DMA_NONE) {
>> +		if (sc->sc_data_direction != DMA_FROM_DEVICE)
>> +			out = &scsi_out(sc)->table;
>> +		if (sc->sc_data_direction != DMA_TO_DEVICE)
>> +			in = &scsi_in(sc)->table;
>> +	}
>> +
>> +	count_sg = 2 + (out ? 1 : 0)          + (in ? 1 : 0);
>> +	count    = 2 + (out ? out->nents : 0) + (in ? in->nents : 0);
>> +	ret = virtqueue_start_buf(vq, &buf, cmd, count, count_sg, gfp);
>> +	if (ret < 0)
>> +		return ret;
>>  
>>  	/* Request header.  */
>> -	sg_set_buf(&sg[idx++], &cmd->req, req_size);
>> +	sg_init_one(&sg, &cmd->req, req_size);
>> +	virtqueue_add_sg(&buf, &sg, 1, DMA_TO_DEVICE);
>>  
>>  	/* Data-out buffer.  */
>> -	if (sc && sc->sc_data_direction != DMA_FROM_DEVICE)
>> -		virtscsi_map_sgl(sg, &idx, scsi_out(sc));
>> -
>> -	*out_num = idx;
>> +	if (out)
>> +		virtqueue_add_sg(&buf, out->sgl, out->nents, DMA_TO_DEVICE);
>>  
>>  	/* Response header.  */
>> -	sg_set_buf(&sg[idx++], &cmd->resp, resp_size);
>> +	sg_init_one(&sg, &cmd->resp, resp_size);
>> +	virtqueue_add_sg(&buf, &sg, 1, DMA_FROM_DEVICE);
>>  
>>  	/* Data-in buffer */
>> -	if (sc && sc->sc_data_direction != DMA_TO_DEVICE)
>> -		virtscsi_map_sgl(sg, &idx, scsi_in(sc));
>> +	if (in)
>> +		virtqueue_add_sg(&buf, in->sgl, in->nents, DMA_FROM_DEVICE);
>>  
>> -	*in_num = idx - *out_num;
>> +	virtqueue_end_buf(&buf);
>> +	return 0;
>>  }
>>  
>>  static int virtscsi_kick_cmd(struct virtio_scsi_target_state *tgt,
>> @@ -409,25 +407,20 @@ static int virtscsi_kick_cmd(struct virtio_scsi_target_state *tgt,
>>  			     struct virtio_scsi_cmd *cmd,
>>  			     size_t req_size, size_t resp_size, gfp_t gfp)
>>  {
>> -	unsigned int out_num, in_num;
>>  	unsigned long flags;
>> -	int err;
>> +	int ret;
>>  	bool needs_kick = false;
>>  
>> -	spin_lock_irqsave(&tgt->tgt_lock, flags);
>> -	virtscsi_map_cmd(tgt, cmd, &out_num, &in_num, req_size, resp_size);
>> -
>> -	spin_lock(&vq->vq_lock);
>> -	err = virtqueue_add_buf(vq->vq, tgt->sg, out_num, in_num, cmd, gfp);
>> -	spin_unlock(&tgt->tgt_lock);
>> -	if (!err)
>> +	spin_lock_irqsave(&vq->vq_lock, flags);
>> +	ret = virtscsi_add_cmd(vq->vq, cmd, req_size, resp_size, gfp);
>> +	if (!ret)
>>  		needs_kick = virtqueue_kick_prepare(vq->vq);
>>  
>>  	spin_unlock_irqrestore(&vq->vq_lock, flags);
>>  
>>  	if (needs_kick)
>>  		virtqueue_notify(vq->vq);
>> -	return err;
>> +	return ret;
>>  }
>>  
>>  static int virtscsi_queuecommand(struct Scsi_Host *sh, struct scsi_cmnd *sc)
>> @@ -592,14 +585,11 @@ static struct virtio_scsi_target_state *virtscsi_alloc_tgt(
>>  	gfp_t gfp_mask = GFP_KERNEL;
>>  
>>  	/* We need extra sg elements at head and tail.  */
>> -	tgt = kmalloc(sizeof(*tgt) + sizeof(tgt->sg[0]) * (sg_elems + 2),
>> -		      gfp_mask);
>> -
>> +	tgt = kmalloc(sizeof(*tgt), gfp_mask);
>>  	if (!tgt)
>>  		return NULL;
>>  
>>  	spin_lock_init(&tgt->tgt_lock);
>> -	sg_init_table(tgt->sg, sg_elems + 2);
>>  	return tgt;
>>  }
>>  
>> -- 
>> 1.7.1
>>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 1/5] virtio: add functions for piecewise addition of buffers
  2012-12-18 12:32   ` Paolo Bonzini
@ 2012-12-18 13:36     ` Michael S. Tsirkin
  -1 siblings, 0 replies; 86+ messages in thread
From: Michael S. Tsirkin @ 2012-12-18 13:36 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, kvm, gaowanlong, hutao, linux-scsi, virtualization,
	rusty, asias, stefanha, nab

Some comments without arguing about whether the performance
benefit is worth it.

On Tue, Dec 18, 2012 at 01:32:48PM +0100, Paolo Bonzini wrote:
> diff --git a/include/linux/virtio.h b/include/linux/virtio.h
> index cf8adb1..39d56c4 100644
> --- a/include/linux/virtio.h
> +++ b/include/linux/virtio.h
> @@ -7,6 +7,7 @@
>  #include <linux/spinlock.h>
>  #include <linux/device.h>
>  #include <linux/mod_devicetable.h>
> +#include <linux/dma-direction.h>
>  #include <linux/gfp.h>
>  
>  /**
> @@ -40,6 +41,26 @@ int virtqueue_add_buf(struct virtqueue *vq,
>  		      void *data,
>  		      gfp_t gfp);
>  
> +struct virtqueue_buf {
> +	struct virtqueue *vq;
> +	struct vring_desc *indirect, *tail;

This is wrong: virtio.h does not include virito_ring.h,
and it shouldn't by design depend on it.

> +	int head;
> +};
> +

Can't we track state internally to the virtqueue?
Exposing it seems to buy us nothing since you can't
call add_buf between start and end anyway.


> +int virtqueue_start_buf(struct virtqueue *_vq,
> +			struct virtqueue_buf *buf,
> +			void *data,
> +			unsigned int count,
> +			unsigned int count_sg,
> +			gfp_t gfp);
> +
> +void virtqueue_add_sg(struct virtqueue_buf *buf,
> +		      struct scatterlist sgl[],
> +		      unsigned int count,
> +		      enum dma_data_direction dir);
> +

And idea: in practice virtio scsi seems to always call sg_init_one, no?
So how about we pass in void* or something and avoid using sg and count?
This would make it useful for -net BTW.

> +void virtqueue_end_buf(struct virtqueue_buf *buf);
> +
>  void virtqueue_kick(struct virtqueue *vq);
>  
>  bool virtqueue_kick_prepare(struct virtqueue *vq);
> -- 
> 1.7.1
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 1/5] virtio: add functions for piecewise addition of buffers
@ 2012-12-18 13:36     ` Michael S. Tsirkin
  0 siblings, 0 replies; 86+ messages in thread
From: Michael S. Tsirkin @ 2012-12-18 13:36 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-scsi, kvm, hutao, linux-kernel, virtualization, stefanha

Some comments without arguing about whether the performance
benefit is worth it.

On Tue, Dec 18, 2012 at 01:32:48PM +0100, Paolo Bonzini wrote:
> diff --git a/include/linux/virtio.h b/include/linux/virtio.h
> index cf8adb1..39d56c4 100644
> --- a/include/linux/virtio.h
> +++ b/include/linux/virtio.h
> @@ -7,6 +7,7 @@
>  #include <linux/spinlock.h>
>  #include <linux/device.h>
>  #include <linux/mod_devicetable.h>
> +#include <linux/dma-direction.h>
>  #include <linux/gfp.h>
>  
>  /**
> @@ -40,6 +41,26 @@ int virtqueue_add_buf(struct virtqueue *vq,
>  		      void *data,
>  		      gfp_t gfp);
>  
> +struct virtqueue_buf {
> +	struct virtqueue *vq;
> +	struct vring_desc *indirect, *tail;

This is wrong: virtio.h does not include virito_ring.h,
and it shouldn't by design depend on it.

> +	int head;
> +};
> +

Can't we track state internally to the virtqueue?
Exposing it seems to buy us nothing since you can't
call add_buf between start and end anyway.


> +int virtqueue_start_buf(struct virtqueue *_vq,
> +			struct virtqueue_buf *buf,
> +			void *data,
> +			unsigned int count,
> +			unsigned int count_sg,
> +			gfp_t gfp);
> +
> +void virtqueue_add_sg(struct virtqueue_buf *buf,
> +		      struct scatterlist sgl[],
> +		      unsigned int count,
> +		      enum dma_data_direction dir);
> +

And idea: in practice virtio scsi seems to always call sg_init_one, no?
So how about we pass in void* or something and avoid using sg and count?
This would make it useful for -net BTW.

> +void virtqueue_end_buf(struct virtqueue_buf *buf);
> +
>  void virtqueue_kick(struct virtqueue *vq);
>  
>  bool virtqueue_kick_prepare(struct virtqueue *vq);
> -- 
> 1.7.1
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 2/5] virtio-scsi: use functions for piecewise composition of buffers
  2012-12-18 12:32   ` Paolo Bonzini
@ 2012-12-18 13:37     ` Michael S. Tsirkin
  -1 siblings, 0 replies; 86+ messages in thread
From: Michael S. Tsirkin @ 2012-12-18 13:37 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, kvm, gaowanlong, hutao, linux-scsi, virtualization,
	rusty, asias, stefanha, nab

On Tue, Dec 18, 2012 at 01:32:49PM +0100, Paolo Bonzini wrote:
> Using the new virtio_scsi_add_sg function lets us simplify the queueing
> path.  In particular, all data protected by the tgt_lock is just gone
> (multiqueue will find a new use for the lock).

vq access still needs some protection: virtio is not reentrant
by itself. with tgt_lock gone what protects vq against
concurrent add_buf calls?

> The speedup is relatively small (2-4%) but it is worthwhile because of
> the code simplification---both in this patches and in the next ones.
> 
> Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
> 	v1->v2: new
> 
>  drivers/scsi/virtio_scsi.c |   94 +++++++++++++++++++------------------------
>  1 files changed, 42 insertions(+), 52 deletions(-)
> 
> diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
> index 74ab67a..2b93b6e 100644
> --- a/drivers/scsi/virtio_scsi.c
> +++ b/drivers/scsi/virtio_scsi.c
> @@ -59,11 +59,8 @@ struct virtio_scsi_vq {
>  
>  /* Per-target queue state */
>  struct virtio_scsi_target_state {
> -	/* Protects sg.  Lock hierarchy is tgt_lock -> vq_lock.  */
> +	/* Never held at the same time as vq_lock.  */
>  	spinlock_t tgt_lock;
> -
> -	/* For sglist construction when adding commands to the virtqueue.  */
> -	struct scatterlist sg[];
>  };
>  
>  /* Driver instance state */
> @@ -351,57 +348,58 @@ static void virtscsi_event_done(struct virtqueue *vq)
>  	spin_unlock_irqrestore(&vscsi->event_vq.vq_lock, flags);
>  };
>  
> -static void virtscsi_map_sgl(struct scatterlist *sg, unsigned int *p_idx,
> -			     struct scsi_data_buffer *sdb)
> -{
> -	struct sg_table *table = &sdb->table;
> -	struct scatterlist *sg_elem;
> -	unsigned int idx = *p_idx;
> -	int i;
> -
> -	for_each_sg(table->sgl, sg_elem, table->nents, i)
> -		sg[idx++] = *sg_elem;
> -
> -	*p_idx = idx;
> -}
> -
>  /**
> - * virtscsi_map_cmd - map a scsi_cmd to a virtqueue scatterlist
> + * virtscsi_add_cmd - add a virtio_scsi_cmd to a virtqueue
>   * @vscsi	: virtio_scsi state
>   * @cmd		: command structure
> - * @out_num	: number of read-only elements
> - * @in_num	: number of write-only elements
>   * @req_size	: size of the request buffer
>   * @resp_size	: size of the response buffer
> - *
> - * Called with tgt_lock held.
> + * @gfp	: flags to use for memory allocations
>   */
> -static void virtscsi_map_cmd(struct virtio_scsi_target_state *tgt,
> -			     struct virtio_scsi_cmd *cmd,
> -			     unsigned *out_num, unsigned *in_num,
> -			     size_t req_size, size_t resp_size)
> +static int virtscsi_add_cmd(struct virtqueue *vq,
> +			    struct virtio_scsi_cmd *cmd,
> +			    size_t req_size, size_t resp_size, gfp_t gfp)
>  {
>  	struct scsi_cmnd *sc = cmd->sc;
> -	struct scatterlist *sg = tgt->sg;
> -	unsigned int idx = 0;
> +	struct scatterlist sg;
> +	unsigned int count, count_sg;
> +	struct sg_table *out, *in;
> +	struct virtqueue_buf buf;
> +	int ret;
> +
> +	out = in = NULL;
> +
> +	if (sc && sc->sc_data_direction != DMA_NONE) {
> +		if (sc->sc_data_direction != DMA_FROM_DEVICE)
> +			out = &scsi_out(sc)->table;
> +		if (sc->sc_data_direction != DMA_TO_DEVICE)
> +			in = &scsi_in(sc)->table;
> +	}
> +
> +	count_sg = 2 + (out ? 1 : 0)          + (in ? 1 : 0);
> +	count    = 2 + (out ? out->nents : 0) + (in ? in->nents : 0);
> +	ret = virtqueue_start_buf(vq, &buf, cmd, count, count_sg, gfp);
> +	if (ret < 0)
> +		return ret;
>  
>  	/* Request header.  */
> -	sg_set_buf(&sg[idx++], &cmd->req, req_size);
> +	sg_init_one(&sg, &cmd->req, req_size);
> +	virtqueue_add_sg(&buf, &sg, 1, DMA_TO_DEVICE);
>  
>  	/* Data-out buffer.  */
> -	if (sc && sc->sc_data_direction != DMA_FROM_DEVICE)
> -		virtscsi_map_sgl(sg, &idx, scsi_out(sc));
> -
> -	*out_num = idx;
> +	if (out)
> +		virtqueue_add_sg(&buf, out->sgl, out->nents, DMA_TO_DEVICE);
>  
>  	/* Response header.  */
> -	sg_set_buf(&sg[idx++], &cmd->resp, resp_size);
> +	sg_init_one(&sg, &cmd->resp, resp_size);
> +	virtqueue_add_sg(&buf, &sg, 1, DMA_FROM_DEVICE);
>  
>  	/* Data-in buffer */
> -	if (sc && sc->sc_data_direction != DMA_TO_DEVICE)
> -		virtscsi_map_sgl(sg, &idx, scsi_in(sc));
> +	if (in)
> +		virtqueue_add_sg(&buf, in->sgl, in->nents, DMA_FROM_DEVICE);
>  
> -	*in_num = idx - *out_num;
> +	virtqueue_end_buf(&buf);
> +	return 0;
>  }
>  
>  static int virtscsi_kick_cmd(struct virtio_scsi_target_state *tgt,
> @@ -409,25 +407,20 @@ static int virtscsi_kick_cmd(struct virtio_scsi_target_state *tgt,
>  			     struct virtio_scsi_cmd *cmd,
>  			     size_t req_size, size_t resp_size, gfp_t gfp)
>  {
> -	unsigned int out_num, in_num;
>  	unsigned long flags;
> -	int err;
> +	int ret;
>  	bool needs_kick = false;
>  
> -	spin_lock_irqsave(&tgt->tgt_lock, flags);
> -	virtscsi_map_cmd(tgt, cmd, &out_num, &in_num, req_size, resp_size);
> -
> -	spin_lock(&vq->vq_lock);
> -	err = virtqueue_add_buf(vq->vq, tgt->sg, out_num, in_num, cmd, gfp);
> -	spin_unlock(&tgt->tgt_lock);
> -	if (!err)
> +	spin_lock_irqsave(&vq->vq_lock, flags);
> +	ret = virtscsi_add_cmd(vq->vq, cmd, req_size, resp_size, gfp);
> +	if (!ret)
>  		needs_kick = virtqueue_kick_prepare(vq->vq);
>  
>  	spin_unlock_irqrestore(&vq->vq_lock, flags);
>  
>  	if (needs_kick)
>  		virtqueue_notify(vq->vq);
> -	return err;
> +	return ret;
>  }
>  
>  static int virtscsi_queuecommand(struct Scsi_Host *sh, struct scsi_cmnd *sc)
> @@ -592,14 +585,11 @@ static struct virtio_scsi_target_state *virtscsi_alloc_tgt(
>  	gfp_t gfp_mask = GFP_KERNEL;
>  
>  	/* We need extra sg elements at head and tail.  */
> -	tgt = kmalloc(sizeof(*tgt) + sizeof(tgt->sg[0]) * (sg_elems + 2),
> -		      gfp_mask);
> -
> +	tgt = kmalloc(sizeof(*tgt), gfp_mask);
>  	if (!tgt)
>  		return NULL;
>  
>  	spin_lock_init(&tgt->tgt_lock);
> -	sg_init_table(tgt->sg, sg_elems + 2);
>  	return tgt;
>  }
>  
> -- 
> 1.7.1
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 2/5] virtio-scsi: use functions for piecewise composition of buffers
@ 2012-12-18 13:37     ` Michael S. Tsirkin
  0 siblings, 0 replies; 86+ messages in thread
From: Michael S. Tsirkin @ 2012-12-18 13:37 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-scsi, kvm, hutao, linux-kernel, virtualization, stefanha

On Tue, Dec 18, 2012 at 01:32:49PM +0100, Paolo Bonzini wrote:
> Using the new virtio_scsi_add_sg function lets us simplify the queueing
> path.  In particular, all data protected by the tgt_lock is just gone
> (multiqueue will find a new use for the lock).

vq access still needs some protection: virtio is not reentrant
by itself. with tgt_lock gone what protects vq against
concurrent add_buf calls?

> The speedup is relatively small (2-4%) but it is worthwhile because of
> the code simplification---both in this patches and in the next ones.
> 
> Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
> 	v1->v2: new
> 
>  drivers/scsi/virtio_scsi.c |   94 +++++++++++++++++++------------------------
>  1 files changed, 42 insertions(+), 52 deletions(-)
> 
> diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
> index 74ab67a..2b93b6e 100644
> --- a/drivers/scsi/virtio_scsi.c
> +++ b/drivers/scsi/virtio_scsi.c
> @@ -59,11 +59,8 @@ struct virtio_scsi_vq {
>  
>  /* Per-target queue state */
>  struct virtio_scsi_target_state {
> -	/* Protects sg.  Lock hierarchy is tgt_lock -> vq_lock.  */
> +	/* Never held at the same time as vq_lock.  */
>  	spinlock_t tgt_lock;
> -
> -	/* For sglist construction when adding commands to the virtqueue.  */
> -	struct scatterlist sg[];
>  };
>  
>  /* Driver instance state */
> @@ -351,57 +348,58 @@ static void virtscsi_event_done(struct virtqueue *vq)
>  	spin_unlock_irqrestore(&vscsi->event_vq.vq_lock, flags);
>  };
>  
> -static void virtscsi_map_sgl(struct scatterlist *sg, unsigned int *p_idx,
> -			     struct scsi_data_buffer *sdb)
> -{
> -	struct sg_table *table = &sdb->table;
> -	struct scatterlist *sg_elem;
> -	unsigned int idx = *p_idx;
> -	int i;
> -
> -	for_each_sg(table->sgl, sg_elem, table->nents, i)
> -		sg[idx++] = *sg_elem;
> -
> -	*p_idx = idx;
> -}
> -
>  /**
> - * virtscsi_map_cmd - map a scsi_cmd to a virtqueue scatterlist
> + * virtscsi_add_cmd - add a virtio_scsi_cmd to a virtqueue
>   * @vscsi	: virtio_scsi state
>   * @cmd		: command structure
> - * @out_num	: number of read-only elements
> - * @in_num	: number of write-only elements
>   * @req_size	: size of the request buffer
>   * @resp_size	: size of the response buffer
> - *
> - * Called with tgt_lock held.
> + * @gfp	: flags to use for memory allocations
>   */
> -static void virtscsi_map_cmd(struct virtio_scsi_target_state *tgt,
> -			     struct virtio_scsi_cmd *cmd,
> -			     unsigned *out_num, unsigned *in_num,
> -			     size_t req_size, size_t resp_size)
> +static int virtscsi_add_cmd(struct virtqueue *vq,
> +			    struct virtio_scsi_cmd *cmd,
> +			    size_t req_size, size_t resp_size, gfp_t gfp)
>  {
>  	struct scsi_cmnd *sc = cmd->sc;
> -	struct scatterlist *sg = tgt->sg;
> -	unsigned int idx = 0;
> +	struct scatterlist sg;
> +	unsigned int count, count_sg;
> +	struct sg_table *out, *in;
> +	struct virtqueue_buf buf;
> +	int ret;
> +
> +	out = in = NULL;
> +
> +	if (sc && sc->sc_data_direction != DMA_NONE) {
> +		if (sc->sc_data_direction != DMA_FROM_DEVICE)
> +			out = &scsi_out(sc)->table;
> +		if (sc->sc_data_direction != DMA_TO_DEVICE)
> +			in = &scsi_in(sc)->table;
> +	}
> +
> +	count_sg = 2 + (out ? 1 : 0)          + (in ? 1 : 0);
> +	count    = 2 + (out ? out->nents : 0) + (in ? in->nents : 0);
> +	ret = virtqueue_start_buf(vq, &buf, cmd, count, count_sg, gfp);
> +	if (ret < 0)
> +		return ret;
>  
>  	/* Request header.  */
> -	sg_set_buf(&sg[idx++], &cmd->req, req_size);
> +	sg_init_one(&sg, &cmd->req, req_size);
> +	virtqueue_add_sg(&buf, &sg, 1, DMA_TO_DEVICE);
>  
>  	/* Data-out buffer.  */
> -	if (sc && sc->sc_data_direction != DMA_FROM_DEVICE)
> -		virtscsi_map_sgl(sg, &idx, scsi_out(sc));
> -
> -	*out_num = idx;
> +	if (out)
> +		virtqueue_add_sg(&buf, out->sgl, out->nents, DMA_TO_DEVICE);
>  
>  	/* Response header.  */
> -	sg_set_buf(&sg[idx++], &cmd->resp, resp_size);
> +	sg_init_one(&sg, &cmd->resp, resp_size);
> +	virtqueue_add_sg(&buf, &sg, 1, DMA_FROM_DEVICE);
>  
>  	/* Data-in buffer */
> -	if (sc && sc->sc_data_direction != DMA_TO_DEVICE)
> -		virtscsi_map_sgl(sg, &idx, scsi_in(sc));
> +	if (in)
> +		virtqueue_add_sg(&buf, in->sgl, in->nents, DMA_FROM_DEVICE);
>  
> -	*in_num = idx - *out_num;
> +	virtqueue_end_buf(&buf);
> +	return 0;
>  }
>  
>  static int virtscsi_kick_cmd(struct virtio_scsi_target_state *tgt,
> @@ -409,25 +407,20 @@ static int virtscsi_kick_cmd(struct virtio_scsi_target_state *tgt,
>  			     struct virtio_scsi_cmd *cmd,
>  			     size_t req_size, size_t resp_size, gfp_t gfp)
>  {
> -	unsigned int out_num, in_num;
>  	unsigned long flags;
> -	int err;
> +	int ret;
>  	bool needs_kick = false;
>  
> -	spin_lock_irqsave(&tgt->tgt_lock, flags);
> -	virtscsi_map_cmd(tgt, cmd, &out_num, &in_num, req_size, resp_size);
> -
> -	spin_lock(&vq->vq_lock);
> -	err = virtqueue_add_buf(vq->vq, tgt->sg, out_num, in_num, cmd, gfp);
> -	spin_unlock(&tgt->tgt_lock);
> -	if (!err)
> +	spin_lock_irqsave(&vq->vq_lock, flags);
> +	ret = virtscsi_add_cmd(vq->vq, cmd, req_size, resp_size, gfp);
> +	if (!ret)
>  		needs_kick = virtqueue_kick_prepare(vq->vq);
>  
>  	spin_unlock_irqrestore(&vq->vq_lock, flags);
>  
>  	if (needs_kick)
>  		virtqueue_notify(vq->vq);
> -	return err;
> +	return ret;
>  }
>  
>  static int virtscsi_queuecommand(struct Scsi_Host *sh, struct scsi_cmnd *sc)
> @@ -592,14 +585,11 @@ static struct virtio_scsi_target_state *virtscsi_alloc_tgt(
>  	gfp_t gfp_mask = GFP_KERNEL;
>  
>  	/* We need extra sg elements at head and tail.  */
> -	tgt = kmalloc(sizeof(*tgt) + sizeof(tgt->sg[0]) * (sg_elems + 2),
> -		      gfp_mask);
> -
> +	tgt = kmalloc(sizeof(*tgt), gfp_mask);
>  	if (!tgt)
>  		return NULL;
>  
>  	spin_lock_init(&tgt->tgt_lock);
> -	sg_init_table(tgt->sg, sg_elems + 2);
>  	return tgt;
>  }
>  
> -- 
> 1.7.1
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 0/5] Multiqueue virtio-scsi, and API for piecewise buffer submission
  2012-12-18 12:32 ` Paolo Bonzini
@ 2012-12-18 13:42   ` Michael S. Tsirkin
  -1 siblings, 0 replies; 86+ messages in thread
From: Michael S. Tsirkin @ 2012-12-18 13:42 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, kvm, gaowanlong, hutao, linux-scsi, virtualization,
	rusty, asias, stefanha, nab

On Tue, Dec 18, 2012 at 01:32:47PM +0100, Paolo Bonzini wrote:
> Hi all,
> 
> this series adds multiqueue support to the virtio-scsi driver, based
> on Jason Wang's work on virtio-net.  It uses a simple queue steering
> algorithm that expects one queue per CPU.  LUNs in the same target always
> use the same queue (so that commands are not reordered); queue switching
> occurs when the request being queued is the only one for the target.
> Also based on Jason's patches, the virtqueue affinity is set so that
> each CPU is associated to one virtqueue.
> 
> I tested the patches with fio, using up to 32 virtio-scsi disks backed
> by tmpfs on the host.  These numbers are with 1 LUN per target.
> 
> FIO configuration
> -----------------
> [global]
> rw=read
> bsrange=4k-64k
> ioengine=libaio
> direct=1
> iodepth=4
> loops=20
> 
> overall bandwidth (MB/s)
> ------------------------
> 
> # of targets    single-queue    multi-queue, 4 VCPUs    multi-queue, 8 VCPUs
> 1                  540               626                     599
> 2                  795               965                     925
> 4                  997              1376                    1500
> 8                 1136              2130                    2060
> 16                1440              2269                    2474
> 24                1408              2179                    2436
> 32                1515              1978                    2319
> 
> (These numbers for single-queue are with 4 VCPUs, but the impact of adding
> more VCPUs is very limited).
> 
> avg bandwidth per LUN (MB/s)
> ----------------------------
> 
> # of targets    single-queue    multi-queue, 4 VCPUs    multi-queue, 8 VCPUs
> 1                  540               626                     599
> 2                  397               482                     462
> 4                  249               344                     375
> 8                  142               266                     257
> 16                  90               141                     154
> 24                  58                90                     101
> 32                  47                61                      72


Could you please try and measure host CPU utilization?
Without this data it is possible that your host
is undersubscribed and you are drinking up more host CPU.

Another thing to note is that ATM you might need to
test with idle=poll on host otherwise we have strange interaction
with power management where reducing the overhead
switches to lower power so gives you a worse IOPS.


> Patch 1 adds a new API to add functions for piecewise addition for buffers,
> which enables various simplifications in virtio-scsi (patches 2-3) and a
> small performance improvement of 2-6%.  Patches 4 and 5 add multiqueuing.
> 
> I'm mostly looking for comments on the new API of patch 1 for inclusion
> into the 3.9 kernel.
> 
> Thanks to Wao Ganlong for help rebasing and benchmarking these patches.
> 
> Paolo Bonzini (5):
>   virtio: add functions for piecewise addition of buffers
>   virtio-scsi: use functions for piecewise composition of buffers
>   virtio-scsi: redo allocation of target data
>   virtio-scsi: pass struct virtio_scsi to virtqueue completion function
>   virtio-scsi: introduce multiqueue support
> 
>  drivers/scsi/virtio_scsi.c   |  374 +++++++++++++++++++++++++++++-------------
>  drivers/virtio/virtio_ring.c |  205 ++++++++++++++++++++++++
>  include/linux/virtio.h       |   21 +++
>  3 files changed, 485 insertions(+), 115 deletions(-)

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 0/5] Multiqueue virtio-scsi, and API for piecewise buffer submission
@ 2012-12-18 13:42   ` Michael S. Tsirkin
  0 siblings, 0 replies; 86+ messages in thread
From: Michael S. Tsirkin @ 2012-12-18 13:42 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-scsi, kvm, hutao, linux-kernel, virtualization, stefanha

On Tue, Dec 18, 2012 at 01:32:47PM +0100, Paolo Bonzini wrote:
> Hi all,
> 
> this series adds multiqueue support to the virtio-scsi driver, based
> on Jason Wang's work on virtio-net.  It uses a simple queue steering
> algorithm that expects one queue per CPU.  LUNs in the same target always
> use the same queue (so that commands are not reordered); queue switching
> occurs when the request being queued is the only one for the target.
> Also based on Jason's patches, the virtqueue affinity is set so that
> each CPU is associated to one virtqueue.
> 
> I tested the patches with fio, using up to 32 virtio-scsi disks backed
> by tmpfs on the host.  These numbers are with 1 LUN per target.
> 
> FIO configuration
> -----------------
> [global]
> rw=read
> bsrange=4k-64k
> ioengine=libaio
> direct=1
> iodepth=4
> loops=20
> 
> overall bandwidth (MB/s)
> ------------------------
> 
> # of targets    single-queue    multi-queue, 4 VCPUs    multi-queue, 8 VCPUs
> 1                  540               626                     599
> 2                  795               965                     925
> 4                  997              1376                    1500
> 8                 1136              2130                    2060
> 16                1440              2269                    2474
> 24                1408              2179                    2436
> 32                1515              1978                    2319
> 
> (These numbers for single-queue are with 4 VCPUs, but the impact of adding
> more VCPUs is very limited).
> 
> avg bandwidth per LUN (MB/s)
> ----------------------------
> 
> # of targets    single-queue    multi-queue, 4 VCPUs    multi-queue, 8 VCPUs
> 1                  540               626                     599
> 2                  397               482                     462
> 4                  249               344                     375
> 8                  142               266                     257
> 16                  90               141                     154
> 24                  58                90                     101
> 32                  47                61                      72


Could you please try and measure host CPU utilization?
Without this data it is possible that your host
is undersubscribed and you are drinking up more host CPU.

Another thing to note is that ATM you might need to
test with idle=poll on host otherwise we have strange interaction
with power management where reducing the overhead
switches to lower power so gives you a worse IOPS.


> Patch 1 adds a new API to add functions for piecewise addition for buffers,
> which enables various simplifications in virtio-scsi (patches 2-3) and a
> small performance improvement of 2-6%.  Patches 4 and 5 add multiqueuing.
> 
> I'm mostly looking for comments on the new API of patch 1 for inclusion
> into the 3.9 kernel.
> 
> Thanks to Wao Ganlong for help rebasing and benchmarking these patches.
> 
> Paolo Bonzini (5):
>   virtio: add functions for piecewise addition of buffers
>   virtio-scsi: use functions for piecewise composition of buffers
>   virtio-scsi: redo allocation of target data
>   virtio-scsi: pass struct virtio_scsi to virtqueue completion function
>   virtio-scsi: introduce multiqueue support
> 
>  drivers/scsi/virtio_scsi.c   |  374 +++++++++++++++++++++++++++++-------------
>  drivers/virtio/virtio_ring.c |  205 ++++++++++++++++++++++++
>  include/linux/virtio.h       |   21 +++
>  3 files changed, 485 insertions(+), 115 deletions(-)

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 1/5] virtio: add functions for piecewise addition of buffers
  2012-12-18 13:36     ` Michael S. Tsirkin
@ 2012-12-18 13:43       ` Paolo Bonzini
  -1 siblings, 0 replies; 86+ messages in thread
From: Paolo Bonzini @ 2012-12-18 13:43 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, kvm, gaowanlong, hutao, linux-scsi, virtualization,
	rusty, asias, stefanha, nab

Il 18/12/2012 14:36, Michael S. Tsirkin ha scritto:
> Some comments without arguing about whether the performance
> benefit is worth it.
> 
> On Tue, Dec 18, 2012 at 01:32:48PM +0100, Paolo Bonzini wrote:
>> diff --git a/include/linux/virtio.h b/include/linux/virtio.h
>> index cf8adb1..39d56c4 100644
>> --- a/include/linux/virtio.h
>> +++ b/include/linux/virtio.h
>> @@ -7,6 +7,7 @@
>>  #include <linux/spinlock.h>
>>  #include <linux/device.h>
>>  #include <linux/mod_devicetable.h>
>> +#include <linux/dma-direction.h>
>>  #include <linux/gfp.h>
>>  
>>  /**
>> @@ -40,6 +41,26 @@ int virtqueue_add_buf(struct virtqueue *vq,
>>  		      void *data,
>>  		      gfp_t gfp);
>>  
>> +struct virtqueue_buf {
>> +	struct virtqueue *vq;
>> +	struct vring_desc *indirect, *tail;
> 
> This is wrong: virtio.h does not include virito_ring.h,
> and it shouldn't by design depend on it.
> 
>> +	int head;
>> +};
>> +
> 
> Can't we track state internally to the virtqueue?
> Exposing it seems to buy us nothing since you can't
> call add_buf between start and end anyway.

I wanted to keep the state for these functions separate from the rest.
I don't think it makes much sense to move it to struct virtqueue unless
virtqueue_add_buf is converted to use the new API (doesn't make much
sense, could even be a tad slower).

On the other hand moving it there would eliminate the dependency on
virtio_ring.h.  Rusty, what do you think?

>> +int virtqueue_start_buf(struct virtqueue *_vq,
>> +			struct virtqueue_buf *buf,
>> +			void *data,
>> +			unsigned int count,
>> +			unsigned int count_sg,
>> +			gfp_t gfp);
>> +
>> +void virtqueue_add_sg(struct virtqueue_buf *buf,
>> +		      struct scatterlist sgl[],
>> +		      unsigned int count,
>> +		      enum dma_data_direction dir);
>> +
> 
> And idea: in practice virtio scsi seems to always call sg_init_one, no?
> So how about we pass in void* or something and avoid using sg and count?
> This would make it useful for -net BTW.

It also passes the scatterlist from the LLD.  It calls sg_init_one for
the request/response headers.

Paolo

>> +void virtqueue_end_buf(struct virtqueue_buf *buf);
>> +
>>  void virtqueue_kick(struct virtqueue *vq);
>>  
>>  bool virtqueue_kick_prepare(struct virtqueue *vq);
>> -- 
>> 1.7.1
>>


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 1/5] virtio: add functions for piecewise addition of buffers
@ 2012-12-18 13:43       ` Paolo Bonzini
  0 siblings, 0 replies; 86+ messages in thread
From: Paolo Bonzini @ 2012-12-18 13:43 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-scsi, kvm, hutao, linux-kernel, virtualization, stefanha

Il 18/12/2012 14:36, Michael S. Tsirkin ha scritto:
> Some comments without arguing about whether the performance
> benefit is worth it.
> 
> On Tue, Dec 18, 2012 at 01:32:48PM +0100, Paolo Bonzini wrote:
>> diff --git a/include/linux/virtio.h b/include/linux/virtio.h
>> index cf8adb1..39d56c4 100644
>> --- a/include/linux/virtio.h
>> +++ b/include/linux/virtio.h
>> @@ -7,6 +7,7 @@
>>  #include <linux/spinlock.h>
>>  #include <linux/device.h>
>>  #include <linux/mod_devicetable.h>
>> +#include <linux/dma-direction.h>
>>  #include <linux/gfp.h>
>>  
>>  /**
>> @@ -40,6 +41,26 @@ int virtqueue_add_buf(struct virtqueue *vq,
>>  		      void *data,
>>  		      gfp_t gfp);
>>  
>> +struct virtqueue_buf {
>> +	struct virtqueue *vq;
>> +	struct vring_desc *indirect, *tail;
> 
> This is wrong: virtio.h does not include virito_ring.h,
> and it shouldn't by design depend on it.
> 
>> +	int head;
>> +};
>> +
> 
> Can't we track state internally to the virtqueue?
> Exposing it seems to buy us nothing since you can't
> call add_buf between start and end anyway.

I wanted to keep the state for these functions separate from the rest.
I don't think it makes much sense to move it to struct virtqueue unless
virtqueue_add_buf is converted to use the new API (doesn't make much
sense, could even be a tad slower).

On the other hand moving it there would eliminate the dependency on
virtio_ring.h.  Rusty, what do you think?

>> +int virtqueue_start_buf(struct virtqueue *_vq,
>> +			struct virtqueue_buf *buf,
>> +			void *data,
>> +			unsigned int count,
>> +			unsigned int count_sg,
>> +			gfp_t gfp);
>> +
>> +void virtqueue_add_sg(struct virtqueue_buf *buf,
>> +		      struct scatterlist sgl[],
>> +		      unsigned int count,
>> +		      enum dma_data_direction dir);
>> +
> 
> And idea: in practice virtio scsi seems to always call sg_init_one, no?
> So how about we pass in void* or something and avoid using sg and count?
> This would make it useful for -net BTW.

It also passes the scatterlist from the LLD.  It calls sg_init_one for
the request/response headers.

Paolo

>> +void virtqueue_end_buf(struct virtqueue_buf *buf);
>> +
>>  void virtqueue_kick(struct virtqueue *vq);
>>  
>>  bool virtqueue_kick_prepare(struct virtqueue *vq);
>> -- 
>> 1.7.1
>>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 5/5] virtio-scsi: introduce multiqueue support
  2012-12-18 12:32 ` [PATCH v2 5/5] virtio-scsi: introduce multiqueue support Paolo Bonzini
@ 2012-12-18 13:57     ` Michael S. Tsirkin
  2012-12-19 11:27   ` Stefan Hajnoczi
  2012-12-19 11:27   ` Stefan Hajnoczi
  2 siblings, 0 replies; 86+ messages in thread
From: Michael S. Tsirkin @ 2012-12-18 13:57 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, kvm, gaowanlong, hutao, linux-scsi, virtualization,
	rusty, asias, stefanha, nab

On Tue, Dec 18, 2012 at 01:32:52PM +0100, Paolo Bonzini wrote:
> This patch adds queue steering to virtio-scsi.  When a target is sent
> multiple requests, we always drive them to the same queue so that FIFO
> processing order is kept.  However, if a target was idle, we can choose
> a queue arbitrarily.  In this case the queue is chosen according to the
> current VCPU, so the driver expects the number of request queues to be
> equal to the number of VCPUs.  This makes it easy and fast to select
> the queue, and also lets the driver optimize the IRQ affinity for the
> virtqueues (each virtqueue's affinity is set to the CPU that "owns"
> the queue).
> 
> The speedup comes from improving cache locality and giving CPU affinity
> to the virtqueues, which is why this scheme was selected.  Assuming that
> the thread that is sending requests to the device is I/O-bound, it is
> likely to be sleeping at the time the ISR is executed, and thus executing
> the ISR on the same processor that sent the requests is cheap.
> 
> However, the kernel will not execute the ISR on the "best" processor
> unless you explicitly set the affinity.  This is because in practice
> you will have many such I/O-bound processes and thus many otherwise
> idle processors.  Then the kernel will execute the ISR on a random
> processor, rather than the one that is sending requests to the device.
> 
> The alternative to per-CPU virtqueues is per-target virtqueues.  To
> achieve the same locality, we could dynamically choose the virtqueue's
> affinity based on the CPU of the last task that sent a request.  This
> is less appealing because we do not set the affinity directly---we only
> provide a hint to the irqbalanced running in userspace.  Dynamically
> changing the affinity only works if the userspace applies the hint
> fast enough.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
> 	v1->v2: improved comments and commit messages, added memory barriers
> 
>  drivers/scsi/virtio_scsi.c |  234 +++++++++++++++++++++++++++++++++++++------
>  1 files changed, 201 insertions(+), 33 deletions(-)
> 
> diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
> index 4f6c6a3..ca9d29d 100644
> --- a/drivers/scsi/virtio_scsi.c
> +++ b/drivers/scsi/virtio_scsi.c
> @@ -26,6 +26,7 @@
>  
>  #define VIRTIO_SCSI_MEMPOOL_SZ 64
>  #define VIRTIO_SCSI_EVENT_LEN 8
> +#define VIRTIO_SCSI_VQ_BASE 2
>  
>  /* Command queue element */
>  struct virtio_scsi_cmd {
> @@ -57,24 +58,57 @@ struct virtio_scsi_vq {
>  	struct virtqueue *vq;
>  };
>  
> -/* Per-target queue state */
> +/*
> + * Per-target queue state.
> + *
> + * This struct holds the data needed by the queue steering policy.  When a
> + * target is sent multiple requests, we need to drive them to the same queue so
> + * that FIFO processing order is kept.  However, if a target was idle, we can
> + * choose a queue arbitrarily.  In this case the queue is chosen according to
> + * the current VCPU, so the driver expects the number of request queues to be
> + * equal to the number of VCPUs.  This makes it easy and fast to select the
> + * queue, and also lets the driver optimize the IRQ affinity for the virtqueues
> + * (each virtqueue's affinity is set to the CPU that "owns" the queue).
> + *
> + * An interesting effect of this policy is that only writes to req_vq need to
> + * take the tgt_lock.  Read can be done outside the lock because:
> + *
> + * - writes of req_vq only occur when atomic_inc_return(&tgt->reqs) returns 1.
> + *   In that case, no other CPU is reading req_vq: even if they were in
> + *   virtscsi_queuecommand_multi, they would be spinning on tgt_lock.
> + *
> + * - reads of req_vq only occur when the target is not idle (reqs != 0).
> + *   A CPU that enters virtscsi_queuecommand_multi will not modify req_vq.
> + *
> + * Similarly, decrements of reqs are never concurrent with writes of req_vq.
> + * Thus they can happen outside the tgt_lock, provided of course we make reqs
> + * an atomic_t.
> + */
>  struct virtio_scsi_target_state {
> -	/* Never held at the same time as vq_lock.  */
> +	/* This spinlock ever held at the same time as vq_lock.  */
>  	spinlock_t tgt_lock;
> +
> +	/* Count of outstanding requests.  */
> +	atomic_t reqs;
> +
> +	/* Currently active virtqueue for requests sent to this target.  */
> +	struct virtio_scsi_vq *req_vq;
>  };
>  
>  /* Driver instance state */
>  struct virtio_scsi {
>  	struct virtio_device *vdev;
>  
> -	struct virtio_scsi_vq ctrl_vq;
> -	struct virtio_scsi_vq event_vq;
> -	struct virtio_scsi_vq req_vq;
> -
>  	/* Get some buffers ready for event vq */
>  	struct virtio_scsi_event_node event_list[VIRTIO_SCSI_EVENT_LEN];
>  
>  	struct virtio_scsi_target_state *tgt;
> +
> +	u32 num_queues;
> +
> +	struct virtio_scsi_vq ctrl_vq;
> +	struct virtio_scsi_vq event_vq;
> +	struct virtio_scsi_vq req_vqs[];
>  };
>  
>  static struct kmem_cache *virtscsi_cmd_cache;
> @@ -109,6 +143,7 @@ static void virtscsi_complete_cmd(struct virtio_scsi *vscsi, void *buf)
>  	struct virtio_scsi_cmd *cmd = buf;
>  	struct scsi_cmnd *sc = cmd->sc;
>  	struct virtio_scsi_cmd_resp *resp = &cmd->resp.cmd;
> +	struct virtio_scsi_target_state *tgt = &vscsi->tgt[sc->device->id];
>  
>  	dev_dbg(&sc->device->sdev_gendev,
>  		"cmd %p response %u status %#02x sense_len %u\n",
> @@ -163,6 +198,8 @@ static void virtscsi_complete_cmd(struct virtio_scsi *vscsi, void *buf)
>  
>  	mempool_free(cmd, virtscsi_cmd_pool);
>  	sc->scsi_done(sc);
> +
> +	atomic_dec(&tgt->reqs);
>  }
>  
>  static void virtscsi_vq_done(struct virtio_scsi *vscsi, struct virtqueue *vq,
> @@ -182,11 +219,45 @@ static void virtscsi_req_done(struct virtqueue *vq)
>  {
>  	struct Scsi_Host *sh = virtio_scsi_host(vq->vdev);
>  	struct virtio_scsi *vscsi = shost_priv(sh);
> +	int index = vq->index - VIRTIO_SCSI_VQ_BASE;
> +	struct virtio_scsi_vq *req_vq = &vscsi->req_vqs[index];
>  	unsigned long flags;
>  
> -	spin_lock_irqsave(&vscsi->req_vq.vq_lock, flags);
> +	/*
> +	 * Read req_vq before decrementing the reqs field in
> +	 * virtscsi_complete_cmd.
> +	 *
> +	 * With barriers:
> +	 *
> +	 * 	CPU #0			virtscsi_queuecommand_multi (CPU #1)
> +	 * 	------------------------------------------------------------
> +	 * 	lock vq_lock
> +	 * 	read req_vq
> +	 * 	read reqs (reqs = 1)
> +	 * 	write reqs (reqs = 0)
> +	 * 				increment reqs (reqs = 1)
> +	 * 				write req_vq
> +	 *
> +	 * Possible reordering without barriers:
> +	 *
> +	 * 	CPU #0			virtscsi_queuecommand_multi (CPU #1)
> +	 * 	------------------------------------------------------------
> +	 * 	lock vq_lock
> +	 * 	read reqs (reqs = 1)
> +	 * 	write reqs (reqs = 0)
> +	 * 				increment reqs (reqs = 1)
> +	 * 				write req_vq
> +	 * 	read (wrong) req_vq
> +	 *
> +	 * We do not need a full smp_rmb, because req_vq is required to get
> +	 * to tgt->reqs: tgt is &vscsi->tgt[sc->device->id], where sc is stored
> +	 * in the virtqueue as the user token.
> +	 */
> +	smp_read_barrier_depends();
> +
> +	spin_lock_irqsave(&req_vq->vq_lock, flags);
>  	virtscsi_vq_done(vscsi, vq, virtscsi_complete_cmd);
> -	spin_unlock_irqrestore(&vscsi->req_vq.vq_lock, flags);
> +	spin_unlock_irqrestore(&req_vq->vq_lock, flags);
>  };
>  
>  static void virtscsi_complete_free(struct virtio_scsi *vscsi, void *buf)
> @@ -424,11 +495,12 @@ static int virtscsi_kick_cmd(struct virtio_scsi_target_state *tgt,
>  	return ret;
>  }
>  
> -static int virtscsi_queuecommand(struct Scsi_Host *sh, struct scsi_cmnd *sc)
> +static int virtscsi_queuecommand(struct virtio_scsi *vscsi,
> +				 struct virtio_scsi_target_state *tgt,
> +				 struct scsi_cmnd *sc)
>  {
> -	struct virtio_scsi *vscsi = shost_priv(sh);
> -	struct virtio_scsi_target_state *tgt = &vscsi->tgt[sc->device->id];
>  	struct virtio_scsi_cmd *cmd;
> +	struct virtio_scsi_vq *req_vq;
>  	int ret;
>  
>  	struct Scsi_Host *shost = virtio_scsi_host(vscsi->vdev);
> @@ -461,7 +533,8 @@ static int virtscsi_queuecommand(struct Scsi_Host *sh, struct scsi_cmnd *sc)
>  	BUG_ON(sc->cmd_len > VIRTIO_SCSI_CDB_SIZE);
>  	memcpy(cmd->req.cmd.cdb, sc->cmnd, sc->cmd_len);
>  
> -	if (virtscsi_kick_cmd(tgt, &vscsi->req_vq, cmd,
> +	req_vq = ACCESS_ONCE(tgt->req_vq);

This ACCESS_ONCE without a barrier looks strange to me.
Can req_vq change? Needs a comment.

> +	if (virtscsi_kick_cmd(tgt, req_vq, cmd,
>  			      sizeof cmd->req.cmd, sizeof cmd->resp.cmd,
>  			      GFP_ATOMIC) == 0)
>  		ret = 0;
> @@ -472,6 +545,48 @@ out:
>  	return ret;
>  }
>  
> +static int virtscsi_queuecommand_single(struct Scsi_Host *sh,
> +					struct scsi_cmnd *sc)
> +{
> +	struct virtio_scsi *vscsi = shost_priv(sh);
> +	struct virtio_scsi_target_state *tgt = &vscsi->tgt[sc->device->id];
> +
> +	atomic_inc(&tgt->reqs);

And here we don't have barrier after atomic? Why? Needs a comment.

> +	return virtscsi_queuecommand(vscsi, tgt, sc);
> +}
> +
> +static int virtscsi_queuecommand_multi(struct Scsi_Host *sh,
> +				       struct scsi_cmnd *sc)
> +{
> +	struct virtio_scsi *vscsi = shost_priv(sh);
> +	struct virtio_scsi_target_state *tgt = &vscsi->tgt[sc->device->id];
> +	unsigned long flags;
> +	u32 queue_num;
> +
> +	/*
> +	 * Using an atomic_t for tgt->reqs lets the virtqueue handler
> +	 * decrement it without taking the spinlock.
> +	 *
> +	 * We still need a critical section to prevent concurrent submissions
> +	 * from picking two different req_vqs.
> +	 */
> +	spin_lock_irqsave(&tgt->tgt_lock, flags);
> +	if (atomic_inc_return(&tgt->reqs) == 1) {
> +		queue_num = smp_processor_id();
> +		while (unlikely(queue_num >= vscsi->num_queues))
> +			queue_num -= vscsi->num_queues;
> +
> +		/*
> +		 * Write reqs before writing req_vq, matching the
> +		 * smp_read_barrier_depends() in virtscsi_req_done.
> +		 */
> +		smp_wmb();
> +		tgt->req_vq = &vscsi->req_vqs[queue_num];
> +	}
> +	spin_unlock_irqrestore(&tgt->tgt_lock, flags);
> +	return virtscsi_queuecommand(vscsi, tgt, sc);
> +}
> +
>  static int virtscsi_tmf(struct virtio_scsi *vscsi, struct virtio_scsi_cmd *cmd)
>  {
>  	DECLARE_COMPLETION_ONSTACK(comp);
> @@ -541,12 +656,26 @@ static int virtscsi_abort(struct scsi_cmnd *sc)
>  	return virtscsi_tmf(vscsi, cmd);
>  }
>  
> -static struct scsi_host_template virtscsi_host_template = {
> +static struct scsi_host_template virtscsi_host_template_single = {
>  	.module = THIS_MODULE,
>  	.name = "Virtio SCSI HBA",
>  	.proc_name = "virtio_scsi",
> -	.queuecommand = virtscsi_queuecommand,
>  	.this_id = -1,
> +	.queuecommand = virtscsi_queuecommand_single,
> +	.eh_abort_handler = virtscsi_abort,
> +	.eh_device_reset_handler = virtscsi_device_reset,
> +
> +	.can_queue = 1024,
> +	.dma_boundary = UINT_MAX,
> +	.use_clustering = ENABLE_CLUSTERING,
> +};
> +
> +static struct scsi_host_template virtscsi_host_template_multi = {
> +	.module = THIS_MODULE,
> +	.name = "Virtio SCSI HBA",
> +	.proc_name = "virtio_scsi",
> +	.this_id = -1,
> +	.queuecommand = virtscsi_queuecommand_multi,
>  	.eh_abort_handler = virtscsi_abort,
>  	.eh_device_reset_handler = virtscsi_device_reset,
>  
> @@ -572,16 +701,27 @@ static struct scsi_host_template virtscsi_host_template = {
>  				  &__val, sizeof(__val)); \
>  	})
>  
> +
>  static void virtscsi_init_vq(struct virtio_scsi_vq *virtscsi_vq,
> -			     struct virtqueue *vq)
> +			     struct virtqueue *vq, bool affinity)
>  {
>  	spin_lock_init(&virtscsi_vq->vq_lock);
>  	virtscsi_vq->vq = vq;
> +	if (affinity)
> +		virtqueue_set_affinity(vq, vq->index - VIRTIO_SCSI_VQ_BASE);

I've been thinking about how set_affinity
interacts with online/offline CPUs.
Any idea?


>  }
>  
> -static void virtscsi_init_tgt(struct virtio_scsi_target_state *tgt)
> +static void virtscsi_init_tgt(struct virtio_scsi *vscsi, int i)
>  {
> +	struct virtio_scsi_target_state *tgt = &vscsi->tgt[i];
>  	spin_lock_init(&tgt->tgt_lock);
> +	atomic_set(&tgt->reqs, 0);
> +
> +	/*
> +	 * The default is unused for multiqueue, but with a single queue
> +	 * or target we use it in virtscsi_queuecommand.
> +	 */
> +	tgt->req_vq = &vscsi->req_vqs[0];
>  }
>  
>  static void virtscsi_scan(struct virtio_device *vdev)
> @@ -609,28 +749,41 @@ static int virtscsi_init(struct virtio_device *vdev,
>  			 struct virtio_scsi *vscsi, int num_targets)
>  {
>  	int err;
> -	struct virtqueue *vqs[3];
>  	u32 i, sg_elems;
> +	u32 num_vqs;
> +	vq_callback_t **callbacks;
> +	const char **names;
> +	struct virtqueue **vqs;
>  
> -	vq_callback_t *callbacks[] = {
> -		virtscsi_ctrl_done,
> -		virtscsi_event_done,
> -		virtscsi_req_done
> -	};
> -	const char *names[] = {
> -		"control",
> -		"event",
> -		"request"
> -	};
> +	num_vqs = vscsi->num_queues + VIRTIO_SCSI_VQ_BASE;
> +	vqs = kmalloc(num_vqs * sizeof(struct virtqueue *), GFP_KERNEL);
> +	callbacks = kmalloc(num_vqs * sizeof(vq_callback_t *), GFP_KERNEL);
> +	names = kmalloc(num_vqs * sizeof(char *), GFP_KERNEL);
> +
> +	if (!callbacks || !vqs || !names) {
> +		err = -ENOMEM;
> +		goto out;
> +	}
> +
> +	callbacks[0] = virtscsi_ctrl_done;
> +	callbacks[1] = virtscsi_event_done;
> +	names[0] = "control";
> +	names[1] = "event";
> +	for (i = VIRTIO_SCSI_VQ_BASE; i < num_vqs; i++) {
> +		callbacks[i] = virtscsi_req_done;
> +		names[i] = "request";
> +	}
>  
>  	/* Discover virtqueues and write information to configuration.  */
> -	err = vdev->config->find_vqs(vdev, 3, vqs, callbacks, names);
> +	err = vdev->config->find_vqs(vdev, num_vqs, vqs, callbacks, names);
>  	if (err)
>  		return err;
>  
> -	virtscsi_init_vq(&vscsi->ctrl_vq, vqs[0]);
> -	virtscsi_init_vq(&vscsi->event_vq, vqs[1]);
> -	virtscsi_init_vq(&vscsi->req_vq, vqs[2]);
> +	virtscsi_init_vq(&vscsi->ctrl_vq, vqs[0], false);
> +	virtscsi_init_vq(&vscsi->event_vq, vqs[1], false);
> +	for (i = VIRTIO_SCSI_VQ_BASE; i < num_vqs; i++)
> +		virtscsi_init_vq(&vscsi->req_vqs[i - VIRTIO_SCSI_VQ_BASE],
> +				 vqs[i], vscsi->num_queues > 1);

So affinity is true if >1 vq? I am guessing this is not
going to do the right thing unless you have at least
as many vqs as CPUs.

>  
>  	virtscsi_config_set(vdev, cdb_size, VIRTIO_SCSI_CDB_SIZE);
>  	virtscsi_config_set(vdev, sense_size, VIRTIO_SCSI_SENSE_SIZE);
> @@ -647,11 +800,14 @@ static int virtscsi_init(struct virtio_device *vdev,
>  		goto out;
>  	}
>  	for (i = 0; i < num_targets; i++)
> -		virtscsi_init_tgt(&vscsi->tgt[i]);
> +		virtscsi_init_tgt(vscsi, i);
>  
>  	err = 0;
>  
>  out:
> +	kfree(names);
> +	kfree(callbacks);
> +	kfree(vqs);
>  	if (err)
>  		virtscsi_remove_vqs(vdev);
>  	return err;
> @@ -664,11 +820,22 @@ static int __devinit virtscsi_probe(struct virtio_device *vdev)
>  	int err;
>  	u32 sg_elems, num_targets;
>  	u32 cmd_per_lun;
> +	u32 num_queues;
> +	struct scsi_host_template *hostt;
> +
> +	/* We need to know how many queues before we allocate.  */
> +	num_queues = virtscsi_config_get(vdev, num_queues) ?: 1;
>  
>  	/* Allocate memory and link the structs together.  */
>  	num_targets = virtscsi_config_get(vdev, max_target) + 1;
> -	shost = scsi_host_alloc(&virtscsi_host_template, sizeof(*vscsi));
>  
> +	if (num_queues == 1)
> +		hostt = &virtscsi_host_template_single;
> +	else
> +		hostt = &virtscsi_host_template_multi;
> +
> +	shost = scsi_host_alloc(hostt,
> +		sizeof(*vscsi) + sizeof(vscsi->req_vqs[0]) * num_queues);
>  	if (!shost)
>  		return -ENOMEM;
>  
> @@ -676,6 +843,7 @@ static int __devinit virtscsi_probe(struct virtio_device *vdev)
>  	shost->sg_tablesize = sg_elems;
>  	vscsi = shost_priv(shost);
>  	vscsi->vdev = vdev;
> +	vscsi->num_queues = num_queues;
>  	vdev->priv = shost;
>  
>  	err = virtscsi_init(vdev, vscsi, num_targets);
> -- 
> 1.7.1

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 5/5] virtio-scsi: introduce multiqueue support
@ 2012-12-18 13:57     ` Michael S. Tsirkin
  0 siblings, 0 replies; 86+ messages in thread
From: Michael S. Tsirkin @ 2012-12-18 13:57 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-scsi, kvm, hutao, linux-kernel, virtualization, stefanha

On Tue, Dec 18, 2012 at 01:32:52PM +0100, Paolo Bonzini wrote:
> This patch adds queue steering to virtio-scsi.  When a target is sent
> multiple requests, we always drive them to the same queue so that FIFO
> processing order is kept.  However, if a target was idle, we can choose
> a queue arbitrarily.  In this case the queue is chosen according to the
> current VCPU, so the driver expects the number of request queues to be
> equal to the number of VCPUs.  This makes it easy and fast to select
> the queue, and also lets the driver optimize the IRQ affinity for the
> virtqueues (each virtqueue's affinity is set to the CPU that "owns"
> the queue).
> 
> The speedup comes from improving cache locality and giving CPU affinity
> to the virtqueues, which is why this scheme was selected.  Assuming that
> the thread that is sending requests to the device is I/O-bound, it is
> likely to be sleeping at the time the ISR is executed, and thus executing
> the ISR on the same processor that sent the requests is cheap.
> 
> However, the kernel will not execute the ISR on the "best" processor
> unless you explicitly set the affinity.  This is because in practice
> you will have many such I/O-bound processes and thus many otherwise
> idle processors.  Then the kernel will execute the ISR on a random
> processor, rather than the one that is sending requests to the device.
> 
> The alternative to per-CPU virtqueues is per-target virtqueues.  To
> achieve the same locality, we could dynamically choose the virtqueue's
> affinity based on the CPU of the last task that sent a request.  This
> is less appealing because we do not set the affinity directly---we only
> provide a hint to the irqbalanced running in userspace.  Dynamically
> changing the affinity only works if the userspace applies the hint
> fast enough.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
> 	v1->v2: improved comments and commit messages, added memory barriers
> 
>  drivers/scsi/virtio_scsi.c |  234 +++++++++++++++++++++++++++++++++++++------
>  1 files changed, 201 insertions(+), 33 deletions(-)
> 
> diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
> index 4f6c6a3..ca9d29d 100644
> --- a/drivers/scsi/virtio_scsi.c
> +++ b/drivers/scsi/virtio_scsi.c
> @@ -26,6 +26,7 @@
>  
>  #define VIRTIO_SCSI_MEMPOOL_SZ 64
>  #define VIRTIO_SCSI_EVENT_LEN 8
> +#define VIRTIO_SCSI_VQ_BASE 2
>  
>  /* Command queue element */
>  struct virtio_scsi_cmd {
> @@ -57,24 +58,57 @@ struct virtio_scsi_vq {
>  	struct virtqueue *vq;
>  };
>  
> -/* Per-target queue state */
> +/*
> + * Per-target queue state.
> + *
> + * This struct holds the data needed by the queue steering policy.  When a
> + * target is sent multiple requests, we need to drive them to the same queue so
> + * that FIFO processing order is kept.  However, if a target was idle, we can
> + * choose a queue arbitrarily.  In this case the queue is chosen according to
> + * the current VCPU, so the driver expects the number of request queues to be
> + * equal to the number of VCPUs.  This makes it easy and fast to select the
> + * queue, and also lets the driver optimize the IRQ affinity for the virtqueues
> + * (each virtqueue's affinity is set to the CPU that "owns" the queue).
> + *
> + * An interesting effect of this policy is that only writes to req_vq need to
> + * take the tgt_lock.  Read can be done outside the lock because:
> + *
> + * - writes of req_vq only occur when atomic_inc_return(&tgt->reqs) returns 1.
> + *   In that case, no other CPU is reading req_vq: even if they were in
> + *   virtscsi_queuecommand_multi, they would be spinning on tgt_lock.
> + *
> + * - reads of req_vq only occur when the target is not idle (reqs != 0).
> + *   A CPU that enters virtscsi_queuecommand_multi will not modify req_vq.
> + *
> + * Similarly, decrements of reqs are never concurrent with writes of req_vq.
> + * Thus they can happen outside the tgt_lock, provided of course we make reqs
> + * an atomic_t.
> + */
>  struct virtio_scsi_target_state {
> -	/* Never held at the same time as vq_lock.  */
> +	/* This spinlock ever held at the same time as vq_lock.  */
>  	spinlock_t tgt_lock;
> +
> +	/* Count of outstanding requests.  */
> +	atomic_t reqs;
> +
> +	/* Currently active virtqueue for requests sent to this target.  */
> +	struct virtio_scsi_vq *req_vq;
>  };
>  
>  /* Driver instance state */
>  struct virtio_scsi {
>  	struct virtio_device *vdev;
>  
> -	struct virtio_scsi_vq ctrl_vq;
> -	struct virtio_scsi_vq event_vq;
> -	struct virtio_scsi_vq req_vq;
> -
>  	/* Get some buffers ready for event vq */
>  	struct virtio_scsi_event_node event_list[VIRTIO_SCSI_EVENT_LEN];
>  
>  	struct virtio_scsi_target_state *tgt;
> +
> +	u32 num_queues;
> +
> +	struct virtio_scsi_vq ctrl_vq;
> +	struct virtio_scsi_vq event_vq;
> +	struct virtio_scsi_vq req_vqs[];
>  };
>  
>  static struct kmem_cache *virtscsi_cmd_cache;
> @@ -109,6 +143,7 @@ static void virtscsi_complete_cmd(struct virtio_scsi *vscsi, void *buf)
>  	struct virtio_scsi_cmd *cmd = buf;
>  	struct scsi_cmnd *sc = cmd->sc;
>  	struct virtio_scsi_cmd_resp *resp = &cmd->resp.cmd;
> +	struct virtio_scsi_target_state *tgt = &vscsi->tgt[sc->device->id];
>  
>  	dev_dbg(&sc->device->sdev_gendev,
>  		"cmd %p response %u status %#02x sense_len %u\n",
> @@ -163,6 +198,8 @@ static void virtscsi_complete_cmd(struct virtio_scsi *vscsi, void *buf)
>  
>  	mempool_free(cmd, virtscsi_cmd_pool);
>  	sc->scsi_done(sc);
> +
> +	atomic_dec(&tgt->reqs);
>  }
>  
>  static void virtscsi_vq_done(struct virtio_scsi *vscsi, struct virtqueue *vq,
> @@ -182,11 +219,45 @@ static void virtscsi_req_done(struct virtqueue *vq)
>  {
>  	struct Scsi_Host *sh = virtio_scsi_host(vq->vdev);
>  	struct virtio_scsi *vscsi = shost_priv(sh);
> +	int index = vq->index - VIRTIO_SCSI_VQ_BASE;
> +	struct virtio_scsi_vq *req_vq = &vscsi->req_vqs[index];
>  	unsigned long flags;
>  
> -	spin_lock_irqsave(&vscsi->req_vq.vq_lock, flags);
> +	/*
> +	 * Read req_vq before decrementing the reqs field in
> +	 * virtscsi_complete_cmd.
> +	 *
> +	 * With barriers:
> +	 *
> +	 * 	CPU #0			virtscsi_queuecommand_multi (CPU #1)
> +	 * 	------------------------------------------------------------
> +	 * 	lock vq_lock
> +	 * 	read req_vq
> +	 * 	read reqs (reqs = 1)
> +	 * 	write reqs (reqs = 0)
> +	 * 				increment reqs (reqs = 1)
> +	 * 				write req_vq
> +	 *
> +	 * Possible reordering without barriers:
> +	 *
> +	 * 	CPU #0			virtscsi_queuecommand_multi (CPU #1)
> +	 * 	------------------------------------------------------------
> +	 * 	lock vq_lock
> +	 * 	read reqs (reqs = 1)
> +	 * 	write reqs (reqs = 0)
> +	 * 				increment reqs (reqs = 1)
> +	 * 				write req_vq
> +	 * 	read (wrong) req_vq
> +	 *
> +	 * We do not need a full smp_rmb, because req_vq is required to get
> +	 * to tgt->reqs: tgt is &vscsi->tgt[sc->device->id], where sc is stored
> +	 * in the virtqueue as the user token.
> +	 */
> +	smp_read_barrier_depends();
> +
> +	spin_lock_irqsave(&req_vq->vq_lock, flags);
>  	virtscsi_vq_done(vscsi, vq, virtscsi_complete_cmd);
> -	spin_unlock_irqrestore(&vscsi->req_vq.vq_lock, flags);
> +	spin_unlock_irqrestore(&req_vq->vq_lock, flags);
>  };
>  
>  static void virtscsi_complete_free(struct virtio_scsi *vscsi, void *buf)
> @@ -424,11 +495,12 @@ static int virtscsi_kick_cmd(struct virtio_scsi_target_state *tgt,
>  	return ret;
>  }
>  
> -static int virtscsi_queuecommand(struct Scsi_Host *sh, struct scsi_cmnd *sc)
> +static int virtscsi_queuecommand(struct virtio_scsi *vscsi,
> +				 struct virtio_scsi_target_state *tgt,
> +				 struct scsi_cmnd *sc)
>  {
> -	struct virtio_scsi *vscsi = shost_priv(sh);
> -	struct virtio_scsi_target_state *tgt = &vscsi->tgt[sc->device->id];
>  	struct virtio_scsi_cmd *cmd;
> +	struct virtio_scsi_vq *req_vq;
>  	int ret;
>  
>  	struct Scsi_Host *shost = virtio_scsi_host(vscsi->vdev);
> @@ -461,7 +533,8 @@ static int virtscsi_queuecommand(struct Scsi_Host *sh, struct scsi_cmnd *sc)
>  	BUG_ON(sc->cmd_len > VIRTIO_SCSI_CDB_SIZE);
>  	memcpy(cmd->req.cmd.cdb, sc->cmnd, sc->cmd_len);
>  
> -	if (virtscsi_kick_cmd(tgt, &vscsi->req_vq, cmd,
> +	req_vq = ACCESS_ONCE(tgt->req_vq);

This ACCESS_ONCE without a barrier looks strange to me.
Can req_vq change? Needs a comment.

> +	if (virtscsi_kick_cmd(tgt, req_vq, cmd,
>  			      sizeof cmd->req.cmd, sizeof cmd->resp.cmd,
>  			      GFP_ATOMIC) == 0)
>  		ret = 0;
> @@ -472,6 +545,48 @@ out:
>  	return ret;
>  }
>  
> +static int virtscsi_queuecommand_single(struct Scsi_Host *sh,
> +					struct scsi_cmnd *sc)
> +{
> +	struct virtio_scsi *vscsi = shost_priv(sh);
> +	struct virtio_scsi_target_state *tgt = &vscsi->tgt[sc->device->id];
> +
> +	atomic_inc(&tgt->reqs);

And here we don't have barrier after atomic? Why? Needs a comment.

> +	return virtscsi_queuecommand(vscsi, tgt, sc);
> +}
> +
> +static int virtscsi_queuecommand_multi(struct Scsi_Host *sh,
> +				       struct scsi_cmnd *sc)
> +{
> +	struct virtio_scsi *vscsi = shost_priv(sh);
> +	struct virtio_scsi_target_state *tgt = &vscsi->tgt[sc->device->id];
> +	unsigned long flags;
> +	u32 queue_num;
> +
> +	/*
> +	 * Using an atomic_t for tgt->reqs lets the virtqueue handler
> +	 * decrement it without taking the spinlock.
> +	 *
> +	 * We still need a critical section to prevent concurrent submissions
> +	 * from picking two different req_vqs.
> +	 */
> +	spin_lock_irqsave(&tgt->tgt_lock, flags);
> +	if (atomic_inc_return(&tgt->reqs) == 1) {
> +		queue_num = smp_processor_id();
> +		while (unlikely(queue_num >= vscsi->num_queues))
> +			queue_num -= vscsi->num_queues;
> +
> +		/*
> +		 * Write reqs before writing req_vq, matching the
> +		 * smp_read_barrier_depends() in virtscsi_req_done.
> +		 */
> +		smp_wmb();
> +		tgt->req_vq = &vscsi->req_vqs[queue_num];
> +	}
> +	spin_unlock_irqrestore(&tgt->tgt_lock, flags);
> +	return virtscsi_queuecommand(vscsi, tgt, sc);
> +}
> +
>  static int virtscsi_tmf(struct virtio_scsi *vscsi, struct virtio_scsi_cmd *cmd)
>  {
>  	DECLARE_COMPLETION_ONSTACK(comp);
> @@ -541,12 +656,26 @@ static int virtscsi_abort(struct scsi_cmnd *sc)
>  	return virtscsi_tmf(vscsi, cmd);
>  }
>  
> -static struct scsi_host_template virtscsi_host_template = {
> +static struct scsi_host_template virtscsi_host_template_single = {
>  	.module = THIS_MODULE,
>  	.name = "Virtio SCSI HBA",
>  	.proc_name = "virtio_scsi",
> -	.queuecommand = virtscsi_queuecommand,
>  	.this_id = -1,
> +	.queuecommand = virtscsi_queuecommand_single,
> +	.eh_abort_handler = virtscsi_abort,
> +	.eh_device_reset_handler = virtscsi_device_reset,
> +
> +	.can_queue = 1024,
> +	.dma_boundary = UINT_MAX,
> +	.use_clustering = ENABLE_CLUSTERING,
> +};
> +
> +static struct scsi_host_template virtscsi_host_template_multi = {
> +	.module = THIS_MODULE,
> +	.name = "Virtio SCSI HBA",
> +	.proc_name = "virtio_scsi",
> +	.this_id = -1,
> +	.queuecommand = virtscsi_queuecommand_multi,
>  	.eh_abort_handler = virtscsi_abort,
>  	.eh_device_reset_handler = virtscsi_device_reset,
>  
> @@ -572,16 +701,27 @@ static struct scsi_host_template virtscsi_host_template = {
>  				  &__val, sizeof(__val)); \
>  	})
>  
> +
>  static void virtscsi_init_vq(struct virtio_scsi_vq *virtscsi_vq,
> -			     struct virtqueue *vq)
> +			     struct virtqueue *vq, bool affinity)
>  {
>  	spin_lock_init(&virtscsi_vq->vq_lock);
>  	virtscsi_vq->vq = vq;
> +	if (affinity)
> +		virtqueue_set_affinity(vq, vq->index - VIRTIO_SCSI_VQ_BASE);

I've been thinking about how set_affinity
interacts with online/offline CPUs.
Any idea?


>  }
>  
> -static void virtscsi_init_tgt(struct virtio_scsi_target_state *tgt)
> +static void virtscsi_init_tgt(struct virtio_scsi *vscsi, int i)
>  {
> +	struct virtio_scsi_target_state *tgt = &vscsi->tgt[i];
>  	spin_lock_init(&tgt->tgt_lock);
> +	atomic_set(&tgt->reqs, 0);
> +
> +	/*
> +	 * The default is unused for multiqueue, but with a single queue
> +	 * or target we use it in virtscsi_queuecommand.
> +	 */
> +	tgt->req_vq = &vscsi->req_vqs[0];
>  }
>  
>  static void virtscsi_scan(struct virtio_device *vdev)
> @@ -609,28 +749,41 @@ static int virtscsi_init(struct virtio_device *vdev,
>  			 struct virtio_scsi *vscsi, int num_targets)
>  {
>  	int err;
> -	struct virtqueue *vqs[3];
>  	u32 i, sg_elems;
> +	u32 num_vqs;
> +	vq_callback_t **callbacks;
> +	const char **names;
> +	struct virtqueue **vqs;
>  
> -	vq_callback_t *callbacks[] = {
> -		virtscsi_ctrl_done,
> -		virtscsi_event_done,
> -		virtscsi_req_done
> -	};
> -	const char *names[] = {
> -		"control",
> -		"event",
> -		"request"
> -	};
> +	num_vqs = vscsi->num_queues + VIRTIO_SCSI_VQ_BASE;
> +	vqs = kmalloc(num_vqs * sizeof(struct virtqueue *), GFP_KERNEL);
> +	callbacks = kmalloc(num_vqs * sizeof(vq_callback_t *), GFP_KERNEL);
> +	names = kmalloc(num_vqs * sizeof(char *), GFP_KERNEL);
> +
> +	if (!callbacks || !vqs || !names) {
> +		err = -ENOMEM;
> +		goto out;
> +	}
> +
> +	callbacks[0] = virtscsi_ctrl_done;
> +	callbacks[1] = virtscsi_event_done;
> +	names[0] = "control";
> +	names[1] = "event";
> +	for (i = VIRTIO_SCSI_VQ_BASE; i < num_vqs; i++) {
> +		callbacks[i] = virtscsi_req_done;
> +		names[i] = "request";
> +	}
>  
>  	/* Discover virtqueues and write information to configuration.  */
> -	err = vdev->config->find_vqs(vdev, 3, vqs, callbacks, names);
> +	err = vdev->config->find_vqs(vdev, num_vqs, vqs, callbacks, names);
>  	if (err)
>  		return err;
>  
> -	virtscsi_init_vq(&vscsi->ctrl_vq, vqs[0]);
> -	virtscsi_init_vq(&vscsi->event_vq, vqs[1]);
> -	virtscsi_init_vq(&vscsi->req_vq, vqs[2]);
> +	virtscsi_init_vq(&vscsi->ctrl_vq, vqs[0], false);
> +	virtscsi_init_vq(&vscsi->event_vq, vqs[1], false);
> +	for (i = VIRTIO_SCSI_VQ_BASE; i < num_vqs; i++)
> +		virtscsi_init_vq(&vscsi->req_vqs[i - VIRTIO_SCSI_VQ_BASE],
> +				 vqs[i], vscsi->num_queues > 1);

So affinity is true if >1 vq? I am guessing this is not
going to do the right thing unless you have at least
as many vqs as CPUs.

>  
>  	virtscsi_config_set(vdev, cdb_size, VIRTIO_SCSI_CDB_SIZE);
>  	virtscsi_config_set(vdev, sense_size, VIRTIO_SCSI_SENSE_SIZE);
> @@ -647,11 +800,14 @@ static int virtscsi_init(struct virtio_device *vdev,
>  		goto out;
>  	}
>  	for (i = 0; i < num_targets; i++)
> -		virtscsi_init_tgt(&vscsi->tgt[i]);
> +		virtscsi_init_tgt(vscsi, i);
>  
>  	err = 0;
>  
>  out:
> +	kfree(names);
> +	kfree(callbacks);
> +	kfree(vqs);
>  	if (err)
>  		virtscsi_remove_vqs(vdev);
>  	return err;
> @@ -664,11 +820,22 @@ static int __devinit virtscsi_probe(struct virtio_device *vdev)
>  	int err;
>  	u32 sg_elems, num_targets;
>  	u32 cmd_per_lun;
> +	u32 num_queues;
> +	struct scsi_host_template *hostt;
> +
> +	/* We need to know how many queues before we allocate.  */
> +	num_queues = virtscsi_config_get(vdev, num_queues) ?: 1;
>  
>  	/* Allocate memory and link the structs together.  */
>  	num_targets = virtscsi_config_get(vdev, max_target) + 1;
> -	shost = scsi_host_alloc(&virtscsi_host_template, sizeof(*vscsi));
>  
> +	if (num_queues == 1)
> +		hostt = &virtscsi_host_template_single;
> +	else
> +		hostt = &virtscsi_host_template_multi;
> +
> +	shost = scsi_host_alloc(hostt,
> +		sizeof(*vscsi) + sizeof(vscsi->req_vqs[0]) * num_queues);
>  	if (!shost)
>  		return -ENOMEM;
>  
> @@ -676,6 +843,7 @@ static int __devinit virtscsi_probe(struct virtio_device *vdev)
>  	shost->sg_tablesize = sg_elems;
>  	vscsi = shost_priv(shost);
>  	vscsi->vdev = vdev;
> +	vscsi->num_queues = num_queues;
>  	vdev->priv = shost;
>  
>  	err = virtscsi_init(vdev, vscsi, num_targets);
> -- 
> 1.7.1

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 1/5] virtio: add functions for piecewise addition of buffers
  2012-12-18 13:43       ` Paolo Bonzini
@ 2012-12-18 13:59         ` Michael S. Tsirkin
  -1 siblings, 0 replies; 86+ messages in thread
From: Michael S. Tsirkin @ 2012-12-18 13:59 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, kvm, gaowanlong, hutao, linux-scsi, virtualization,
	rusty, asias, stefanha, nab

On Tue, Dec 18, 2012 at 02:43:51PM +0100, Paolo Bonzini wrote:
> Il 18/12/2012 14:36, Michael S. Tsirkin ha scritto:
> > Some comments without arguing about whether the performance
> > benefit is worth it.
> > 
> > On Tue, Dec 18, 2012 at 01:32:48PM +0100, Paolo Bonzini wrote:
> >> diff --git a/include/linux/virtio.h b/include/linux/virtio.h
> >> index cf8adb1..39d56c4 100644
> >> --- a/include/linux/virtio.h
> >> +++ b/include/linux/virtio.h
> >> @@ -7,6 +7,7 @@
> >>  #include <linux/spinlock.h>
> >>  #include <linux/device.h>
> >>  #include <linux/mod_devicetable.h>
> >> +#include <linux/dma-direction.h>
> >>  #include <linux/gfp.h>
> >>  
> >>  /**
> >> @@ -40,6 +41,26 @@ int virtqueue_add_buf(struct virtqueue *vq,
> >>  		      void *data,
> >>  		      gfp_t gfp);
> >>  
> >> +struct virtqueue_buf {
> >> +	struct virtqueue *vq;
> >> +	struct vring_desc *indirect, *tail;
> > 
> > This is wrong: virtio.h does not include virito_ring.h,
> > and it shouldn't by design depend on it.
> > 
> >> +	int head;
> >> +};
> >> +
> > 
> > Can't we track state internally to the virtqueue?
> > Exposing it seems to buy us nothing since you can't
> > call add_buf between start and end anyway.
> 
> I wanted to keep the state for these functions separate from the rest.
> I don't think it makes much sense to move it to struct virtqueue unless
> virtqueue_add_buf is converted to use the new API (doesn't make much
> sense, could even be a tad slower).

Why would it be slower?

> On the other hand moving it there would eliminate the dependency on
> virtio_ring.h.  Rusty, what do you think?
> 
> >> +int virtqueue_start_buf(struct virtqueue *_vq,
> >> +			struct virtqueue_buf *buf,
> >> +			void *data,
> >> +			unsigned int count,
> >> +			unsigned int count_sg,
> >> +			gfp_t gfp);
> >> +
> >> +void virtqueue_add_sg(struct virtqueue_buf *buf,
> >> +		      struct scatterlist sgl[],
> >> +		      unsigned int count,
> >> +		      enum dma_data_direction dir);
> >> +
> > 
> > And idea: in practice virtio scsi seems to always call sg_init_one, no?
> > So how about we pass in void* or something and avoid using sg and count?
> > This would make it useful for -net BTW.
> 
> It also passes the scatterlist from the LLD.  It calls sg_init_one for
> the request/response headers.
> 
> Paolo

Try adding a _single variant. You might see unrolling a loop
gives more of a benefit than this whole optimization.

> >> +void virtqueue_end_buf(struct virtqueue_buf *buf);
> >> +
> >>  void virtqueue_kick(struct virtqueue *vq);
> >>  
> >>  bool virtqueue_kick_prepare(struct virtqueue *vq);
> >> -- 
> >> 1.7.1
> >>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 1/5] virtio: add functions for piecewise addition of buffers
@ 2012-12-18 13:59         ` Michael S. Tsirkin
  0 siblings, 0 replies; 86+ messages in thread
From: Michael S. Tsirkin @ 2012-12-18 13:59 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-scsi, kvm, hutao, linux-kernel, virtualization, stefanha

On Tue, Dec 18, 2012 at 02:43:51PM +0100, Paolo Bonzini wrote:
> Il 18/12/2012 14:36, Michael S. Tsirkin ha scritto:
> > Some comments without arguing about whether the performance
> > benefit is worth it.
> > 
> > On Tue, Dec 18, 2012 at 01:32:48PM +0100, Paolo Bonzini wrote:
> >> diff --git a/include/linux/virtio.h b/include/linux/virtio.h
> >> index cf8adb1..39d56c4 100644
> >> --- a/include/linux/virtio.h
> >> +++ b/include/linux/virtio.h
> >> @@ -7,6 +7,7 @@
> >>  #include <linux/spinlock.h>
> >>  #include <linux/device.h>
> >>  #include <linux/mod_devicetable.h>
> >> +#include <linux/dma-direction.h>
> >>  #include <linux/gfp.h>
> >>  
> >>  /**
> >> @@ -40,6 +41,26 @@ int virtqueue_add_buf(struct virtqueue *vq,
> >>  		      void *data,
> >>  		      gfp_t gfp);
> >>  
> >> +struct virtqueue_buf {
> >> +	struct virtqueue *vq;
> >> +	struct vring_desc *indirect, *tail;
> > 
> > This is wrong: virtio.h does not include virito_ring.h,
> > and it shouldn't by design depend on it.
> > 
> >> +	int head;
> >> +};
> >> +
> > 
> > Can't we track state internally to the virtqueue?
> > Exposing it seems to buy us nothing since you can't
> > call add_buf between start and end anyway.
> 
> I wanted to keep the state for these functions separate from the rest.
> I don't think it makes much sense to move it to struct virtqueue unless
> virtqueue_add_buf is converted to use the new API (doesn't make much
> sense, could even be a tad slower).

Why would it be slower?

> On the other hand moving it there would eliminate the dependency on
> virtio_ring.h.  Rusty, what do you think?
> 
> >> +int virtqueue_start_buf(struct virtqueue *_vq,
> >> +			struct virtqueue_buf *buf,
> >> +			void *data,
> >> +			unsigned int count,
> >> +			unsigned int count_sg,
> >> +			gfp_t gfp);
> >> +
> >> +void virtqueue_add_sg(struct virtqueue_buf *buf,
> >> +		      struct scatterlist sgl[],
> >> +		      unsigned int count,
> >> +		      enum dma_data_direction dir);
> >> +
> > 
> > And idea: in practice virtio scsi seems to always call sg_init_one, no?
> > So how about we pass in void* or something and avoid using sg and count?
> > This would make it useful for -net BTW.
> 
> It also passes the scatterlist from the LLD.  It calls sg_init_one for
> the request/response headers.
> 
> Paolo

Try adding a _single variant. You might see unrolling a loop
gives more of a benefit than this whole optimization.

> >> +void virtqueue_end_buf(struct virtqueue_buf *buf);
> >> +
> >>  void virtqueue_kick(struct virtqueue *vq);
> >>  
> >>  bool virtqueue_kick_prepare(struct virtqueue *vq);
> >> -- 
> >> 1.7.1
> >>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 5/5] virtio-scsi: introduce multiqueue support
  2012-12-18 13:57     ` Michael S. Tsirkin
@ 2012-12-18 14:08       ` Paolo Bonzini
  -1 siblings, 0 replies; 86+ messages in thread
From: Paolo Bonzini @ 2012-12-18 14:08 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, kvm, gaowanlong, hutao, linux-scsi, virtualization,
	rusty, asias, stefanha, nab

Il 18/12/2012 14:57, Michael S. Tsirkin ha scritto:
>> -static int virtscsi_queuecommand(struct Scsi_Host *sh, struct scsi_cmnd *sc)
>> +static int virtscsi_queuecommand(struct virtio_scsi *vscsi,
>> +				 struct virtio_scsi_target_state *tgt,
>> +				 struct scsi_cmnd *sc)
>>  {
>> -	struct virtio_scsi *vscsi = shost_priv(sh);
>> -	struct virtio_scsi_target_state *tgt = &vscsi->tgt[sc->device->id];
>>  	struct virtio_scsi_cmd *cmd;
>> +	struct virtio_scsi_vq *req_vq;
>>  	int ret;
>>  
>>  	struct Scsi_Host *shost = virtio_scsi_host(vscsi->vdev);
>> @@ -461,7 +533,8 @@ static int virtscsi_queuecommand(struct Scsi_Host *sh, struct scsi_cmnd *sc)
>>  	BUG_ON(sc->cmd_len > VIRTIO_SCSI_CDB_SIZE);
>>  	memcpy(cmd->req.cmd.cdb, sc->cmnd, sc->cmd_len);
>>  
>> -	if (virtscsi_kick_cmd(tgt, &vscsi->req_vq, cmd,
>> +	req_vq = ACCESS_ONCE(tgt->req_vq);
> 
> This ACCESS_ONCE without a barrier looks strange to me.
> Can req_vq change? Needs a comment.

Barriers are needed to order two things.  Here I don't have the second thing
to order against, hence no barrier.

Accessing req_vq lockless is safe, and there's a comment about it, but you
still want ACCESS_ONCE to ensure the compiler doesn't play tricks.  It
shouldn't be necessary, because the critical section of
virtscsi_queuecommand_multi will already include the appropriate
compiler barriers, but it is actually clearer this way to me. :)

>> +	if (virtscsi_kick_cmd(tgt, req_vq, cmd,
>>  			      sizeof cmd->req.cmd, sizeof cmd->resp.cmd,
>>  			      GFP_ATOMIC) == 0)
>>  		ret = 0;
>> @@ -472,6 +545,48 @@ out:
>>  	return ret;
>>  }
>>  
>> +static int virtscsi_queuecommand_single(struct Scsi_Host *sh,
>> +					struct scsi_cmnd *sc)
>> +{
>> +	struct virtio_scsi *vscsi = shost_priv(sh);
>> +	struct virtio_scsi_target_state *tgt = &vscsi->tgt[sc->device->id];
>> +
>> +	atomic_inc(&tgt->reqs);
> 
> And here we don't have barrier after atomic? Why? Needs a comment.

Because we don't write req_vq, so there's no two writes to order.  Barrier
against what?

>> +	return virtscsi_queuecommand(vscsi, tgt, sc);
>> +}
>> +
>> +static int virtscsi_queuecommand_multi(struct Scsi_Host *sh,
>> +				       struct scsi_cmnd *sc)
>> +{
>> +	struct virtio_scsi *vscsi = shost_priv(sh);
>> +	struct virtio_scsi_target_state *tgt = &vscsi->tgt[sc->device->id];
>> +	unsigned long flags;
>> +	u32 queue_num;
>> +
>> +	/*
>> +	 * Using an atomic_t for tgt->reqs lets the virtqueue handler
>> +	 * decrement it without taking the spinlock.
>> +	 *
>> +	 * We still need a critical section to prevent concurrent submissions
>> +	 * from picking two different req_vqs.
>> +	 */
>> +	spin_lock_irqsave(&tgt->tgt_lock, flags);
>> +	if (atomic_inc_return(&tgt->reqs) == 1) {
>> +		queue_num = smp_processor_id();
>> +		while (unlikely(queue_num >= vscsi->num_queues))
>> +			queue_num -= vscsi->num_queues;
>> +
>> +		/*
>> +		 * Write reqs before writing req_vq, matching the
>> +		 * smp_read_barrier_depends() in virtscsi_req_done.
>> +		 */
>> +		smp_wmb();
>> +		tgt->req_vq = &vscsi->req_vqs[queue_num];
>> +	}
>> +	spin_unlock_irqrestore(&tgt->tgt_lock, flags);
>> +	return virtscsi_queuecommand(vscsi, tgt, sc);
>> +}
>> +
>>  static int virtscsi_tmf(struct virtio_scsi *vscsi, struct virtio_scsi_cmd *cmd)
>>  {
>>  	DECLARE_COMPLETION_ONSTACK(comp);
>> @@ -541,12 +656,26 @@ static int virtscsi_abort(struct scsi_cmnd *sc)
>>  	return virtscsi_tmf(vscsi, cmd);
>>  }
>>  
>> -static struct scsi_host_template virtscsi_host_template = {
>> +static struct scsi_host_template virtscsi_host_template_single = {
>>  	.module = THIS_MODULE,
>>  	.name = "Virtio SCSI HBA",
>>  	.proc_name = "virtio_scsi",
>> -	.queuecommand = virtscsi_queuecommand,
>>  	.this_id = -1,
>> +	.queuecommand = virtscsi_queuecommand_single,
>> +	.eh_abort_handler = virtscsi_abort,
>> +	.eh_device_reset_handler = virtscsi_device_reset,
>> +
>> +	.can_queue = 1024,
>> +	.dma_boundary = UINT_MAX,
>> +	.use_clustering = ENABLE_CLUSTERING,
>> +};
>> +
>> +static struct scsi_host_template virtscsi_host_template_multi = {
>> +	.module = THIS_MODULE,
>> +	.name = "Virtio SCSI HBA",
>> +	.proc_name = "virtio_scsi",
>> +	.this_id = -1,
>> +	.queuecommand = virtscsi_queuecommand_multi,
>>  	.eh_abort_handler = virtscsi_abort,
>>  	.eh_device_reset_handler = virtscsi_device_reset,
>>  
>> @@ -572,16 +701,27 @@ static struct scsi_host_template virtscsi_host_template = {
>>  				  &__val, sizeof(__val)); \
>>  	})
>>  
>> +
>>  static void virtscsi_init_vq(struct virtio_scsi_vq *virtscsi_vq,
>> -			     struct virtqueue *vq)
>> +			     struct virtqueue *vq, bool affinity)
>>  {
>>  	spin_lock_init(&virtscsi_vq->vq_lock);
>>  	virtscsi_vq->vq = vq;
>> +	if (affinity)
>> +		virtqueue_set_affinity(vq, vq->index - VIRTIO_SCSI_VQ_BASE);
> 
> I've been thinking about how set_affinity
> interacts with online/offline CPUs.
> Any idea?

No, I haven't tried.

>>  
>>  	/* Discover virtqueues and write information to configuration.  */
>> -	err = vdev->config->find_vqs(vdev, 3, vqs, callbacks, names);
>> +	err = vdev->config->find_vqs(vdev, num_vqs, vqs, callbacks, names);
>>  	if (err)
>>  		return err;
>>  
>> -	virtscsi_init_vq(&vscsi->ctrl_vq, vqs[0]);
>> -	virtscsi_init_vq(&vscsi->event_vq, vqs[1]);
>> -	virtscsi_init_vq(&vscsi->req_vq, vqs[2]);
>> +	virtscsi_init_vq(&vscsi->ctrl_vq, vqs[0], false);
>> +	virtscsi_init_vq(&vscsi->event_vq, vqs[1], false);
>> +	for (i = VIRTIO_SCSI_VQ_BASE; i < num_vqs; i++)
>> +		virtscsi_init_vq(&vscsi->req_vqs[i - VIRTIO_SCSI_VQ_BASE],
>> +				 vqs[i], vscsi->num_queues > 1);
> 
> So affinity is true if >1 vq? I am guessing this is not
> going to do the right thing unless you have at least
> as many vqs as CPUs.

Yes, and then you're not setting up the thing correctly.

Isn't the same thing true for virtio-net mq?

Paolo

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 5/5] virtio-scsi: introduce multiqueue support
@ 2012-12-18 14:08       ` Paolo Bonzini
  0 siblings, 0 replies; 86+ messages in thread
From: Paolo Bonzini @ 2012-12-18 14:08 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-scsi, kvm, hutao, linux-kernel, virtualization, stefanha

Il 18/12/2012 14:57, Michael S. Tsirkin ha scritto:
>> -static int virtscsi_queuecommand(struct Scsi_Host *sh, struct scsi_cmnd *sc)
>> +static int virtscsi_queuecommand(struct virtio_scsi *vscsi,
>> +				 struct virtio_scsi_target_state *tgt,
>> +				 struct scsi_cmnd *sc)
>>  {
>> -	struct virtio_scsi *vscsi = shost_priv(sh);
>> -	struct virtio_scsi_target_state *tgt = &vscsi->tgt[sc->device->id];
>>  	struct virtio_scsi_cmd *cmd;
>> +	struct virtio_scsi_vq *req_vq;
>>  	int ret;
>>  
>>  	struct Scsi_Host *shost = virtio_scsi_host(vscsi->vdev);
>> @@ -461,7 +533,8 @@ static int virtscsi_queuecommand(struct Scsi_Host *sh, struct scsi_cmnd *sc)
>>  	BUG_ON(sc->cmd_len > VIRTIO_SCSI_CDB_SIZE);
>>  	memcpy(cmd->req.cmd.cdb, sc->cmnd, sc->cmd_len);
>>  
>> -	if (virtscsi_kick_cmd(tgt, &vscsi->req_vq, cmd,
>> +	req_vq = ACCESS_ONCE(tgt->req_vq);
> 
> This ACCESS_ONCE without a barrier looks strange to me.
> Can req_vq change? Needs a comment.

Barriers are needed to order two things.  Here I don't have the second thing
to order against, hence no barrier.

Accessing req_vq lockless is safe, and there's a comment about it, but you
still want ACCESS_ONCE to ensure the compiler doesn't play tricks.  It
shouldn't be necessary, because the critical section of
virtscsi_queuecommand_multi will already include the appropriate
compiler barriers, but it is actually clearer this way to me. :)

>> +	if (virtscsi_kick_cmd(tgt, req_vq, cmd,
>>  			      sizeof cmd->req.cmd, sizeof cmd->resp.cmd,
>>  			      GFP_ATOMIC) == 0)
>>  		ret = 0;
>> @@ -472,6 +545,48 @@ out:
>>  	return ret;
>>  }
>>  
>> +static int virtscsi_queuecommand_single(struct Scsi_Host *sh,
>> +					struct scsi_cmnd *sc)
>> +{
>> +	struct virtio_scsi *vscsi = shost_priv(sh);
>> +	struct virtio_scsi_target_state *tgt = &vscsi->tgt[sc->device->id];
>> +
>> +	atomic_inc(&tgt->reqs);
> 
> And here we don't have barrier after atomic? Why? Needs a comment.

Because we don't write req_vq, so there's no two writes to order.  Barrier
against what?

>> +	return virtscsi_queuecommand(vscsi, tgt, sc);
>> +}
>> +
>> +static int virtscsi_queuecommand_multi(struct Scsi_Host *sh,
>> +				       struct scsi_cmnd *sc)
>> +{
>> +	struct virtio_scsi *vscsi = shost_priv(sh);
>> +	struct virtio_scsi_target_state *tgt = &vscsi->tgt[sc->device->id];
>> +	unsigned long flags;
>> +	u32 queue_num;
>> +
>> +	/*
>> +	 * Using an atomic_t for tgt->reqs lets the virtqueue handler
>> +	 * decrement it without taking the spinlock.
>> +	 *
>> +	 * We still need a critical section to prevent concurrent submissions
>> +	 * from picking two different req_vqs.
>> +	 */
>> +	spin_lock_irqsave(&tgt->tgt_lock, flags);
>> +	if (atomic_inc_return(&tgt->reqs) == 1) {
>> +		queue_num = smp_processor_id();
>> +		while (unlikely(queue_num >= vscsi->num_queues))
>> +			queue_num -= vscsi->num_queues;
>> +
>> +		/*
>> +		 * Write reqs before writing req_vq, matching the
>> +		 * smp_read_barrier_depends() in virtscsi_req_done.
>> +		 */
>> +		smp_wmb();
>> +		tgt->req_vq = &vscsi->req_vqs[queue_num];
>> +	}
>> +	spin_unlock_irqrestore(&tgt->tgt_lock, flags);
>> +	return virtscsi_queuecommand(vscsi, tgt, sc);
>> +}
>> +
>>  static int virtscsi_tmf(struct virtio_scsi *vscsi, struct virtio_scsi_cmd *cmd)
>>  {
>>  	DECLARE_COMPLETION_ONSTACK(comp);
>> @@ -541,12 +656,26 @@ static int virtscsi_abort(struct scsi_cmnd *sc)
>>  	return virtscsi_tmf(vscsi, cmd);
>>  }
>>  
>> -static struct scsi_host_template virtscsi_host_template = {
>> +static struct scsi_host_template virtscsi_host_template_single = {
>>  	.module = THIS_MODULE,
>>  	.name = "Virtio SCSI HBA",
>>  	.proc_name = "virtio_scsi",
>> -	.queuecommand = virtscsi_queuecommand,
>>  	.this_id = -1,
>> +	.queuecommand = virtscsi_queuecommand_single,
>> +	.eh_abort_handler = virtscsi_abort,
>> +	.eh_device_reset_handler = virtscsi_device_reset,
>> +
>> +	.can_queue = 1024,
>> +	.dma_boundary = UINT_MAX,
>> +	.use_clustering = ENABLE_CLUSTERING,
>> +};
>> +
>> +static struct scsi_host_template virtscsi_host_template_multi = {
>> +	.module = THIS_MODULE,
>> +	.name = "Virtio SCSI HBA",
>> +	.proc_name = "virtio_scsi",
>> +	.this_id = -1,
>> +	.queuecommand = virtscsi_queuecommand_multi,
>>  	.eh_abort_handler = virtscsi_abort,
>>  	.eh_device_reset_handler = virtscsi_device_reset,
>>  
>> @@ -572,16 +701,27 @@ static struct scsi_host_template virtscsi_host_template = {
>>  				  &__val, sizeof(__val)); \
>>  	})
>>  
>> +
>>  static void virtscsi_init_vq(struct virtio_scsi_vq *virtscsi_vq,
>> -			     struct virtqueue *vq)
>> +			     struct virtqueue *vq, bool affinity)
>>  {
>>  	spin_lock_init(&virtscsi_vq->vq_lock);
>>  	virtscsi_vq->vq = vq;
>> +	if (affinity)
>> +		virtqueue_set_affinity(vq, vq->index - VIRTIO_SCSI_VQ_BASE);
> 
> I've been thinking about how set_affinity
> interacts with online/offline CPUs.
> Any idea?

No, I haven't tried.

>>  
>>  	/* Discover virtqueues and write information to configuration.  */
>> -	err = vdev->config->find_vqs(vdev, 3, vqs, callbacks, names);
>> +	err = vdev->config->find_vqs(vdev, num_vqs, vqs, callbacks, names);
>>  	if (err)
>>  		return err;
>>  
>> -	virtscsi_init_vq(&vscsi->ctrl_vq, vqs[0]);
>> -	virtscsi_init_vq(&vscsi->event_vq, vqs[1]);
>> -	virtscsi_init_vq(&vscsi->req_vq, vqs[2]);
>> +	virtscsi_init_vq(&vscsi->ctrl_vq, vqs[0], false);
>> +	virtscsi_init_vq(&vscsi->event_vq, vqs[1], false);
>> +	for (i = VIRTIO_SCSI_VQ_BASE; i < num_vqs; i++)
>> +		virtscsi_init_vq(&vscsi->req_vqs[i - VIRTIO_SCSI_VQ_BASE],
>> +				 vqs[i], vscsi->num_queues > 1);
> 
> So affinity is true if >1 vq? I am guessing this is not
> going to do the right thing unless you have at least
> as many vqs as CPUs.

Yes, and then you're not setting up the thing correctly.

Isn't the same thing true for virtio-net mq?

Paolo

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 1/5] virtio: add functions for piecewise addition of buffers
  2012-12-18 13:59         ` Michael S. Tsirkin
@ 2012-12-18 14:32           ` Paolo Bonzini
  -1 siblings, 0 replies; 86+ messages in thread
From: Paolo Bonzini @ 2012-12-18 14:32 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, kvm, gaowanlong, hutao, linux-scsi, virtualization,
	rusty, asias, stefanha, nab

Il 18/12/2012 14:59, Michael S. Tsirkin ha scritto:
>>> Can't we track state internally to the virtqueue? Exposing it
>>> seems to buy us nothing since you can't call add_buf between
>>> start and end anyway.
>> 
>> I wanted to keep the state for these functions separate from the
>> rest. I don't think it makes much sense to move it to struct
>> virtqueue unless virtqueue_add_buf is converted to use the new API
>> (doesn't make much sense, could even be a tad slower).
> 
> Why would it be slower?

virtqueue_add_buf could be slower if it used the new API.  That's
because of the overhead of writing and reading from struct
virtqueue_buf, instead of using variables in registers.

>> On the other hand moving it there would eliminate the dependency
>> on virtio_ring.h.  Rusty, what do you think?
>> 
>>> And idea: in practice virtio scsi seems to always call
>>> sg_init_one, no? So how about we pass in void* or something and
>>> avoid using sg and count? This would make it useful for -net
>>> BTW.
>> 
>> It also passes the scatterlist from the LLD.  It calls sg_init_one
>> for the request/response headers.
> 
> Try adding a _single variant. You might see unrolling a loop gives
> more of a benefit than this whole optimization.

Makes sense, I'll try.  However, note that I *do* need the
infrastructure in this patch because virtio-scsi could never use a
hypothetical virtqueue_add_buf_single; requests always have at least 2
buffers for the headers.

However I could add virtqueue_add_sg_single and use it for those
headers.  The I/O buffer can keep using virtqueue_add_sg.

Paolo

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 1/5] virtio: add functions for piecewise addition of buffers
@ 2012-12-18 14:32           ` Paolo Bonzini
  0 siblings, 0 replies; 86+ messages in thread
From: Paolo Bonzini @ 2012-12-18 14:32 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-scsi, kvm, hutao, linux-kernel, virtualization, stefanha

Il 18/12/2012 14:59, Michael S. Tsirkin ha scritto:
>>> Can't we track state internally to the virtqueue? Exposing it
>>> seems to buy us nothing since you can't call add_buf between
>>> start and end anyway.
>> 
>> I wanted to keep the state for these functions separate from the
>> rest. I don't think it makes much sense to move it to struct
>> virtqueue unless virtqueue_add_buf is converted to use the new API
>> (doesn't make much sense, could even be a tad slower).
> 
> Why would it be slower?

virtqueue_add_buf could be slower if it used the new API.  That's
because of the overhead of writing and reading from struct
virtqueue_buf, instead of using variables in registers.

>> On the other hand moving it there would eliminate the dependency
>> on virtio_ring.h.  Rusty, what do you think?
>> 
>>> And idea: in practice virtio scsi seems to always call
>>> sg_init_one, no? So how about we pass in void* or something and
>>> avoid using sg and count? This would make it useful for -net
>>> BTW.
>> 
>> It also passes the scatterlist from the LLD.  It calls sg_init_one
>> for the request/response headers.
> 
> Try adding a _single variant. You might see unrolling a loop gives
> more of a benefit than this whole optimization.

Makes sense, I'll try.  However, note that I *do* need the
infrastructure in this patch because virtio-scsi could never use a
hypothetical virtqueue_add_buf_single; requests always have at least 2
buffers for the headers.

However I could add virtqueue_add_sg_single and use it for those
headers.  The I/O buffer can keep using virtqueue_add_sg.

Paolo

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 5/5] virtio-scsi: introduce multiqueue support
  2012-12-18 14:08       ` Paolo Bonzini
@ 2012-12-18 15:03         ` Michael S. Tsirkin
  -1 siblings, 0 replies; 86+ messages in thread
From: Michael S. Tsirkin @ 2012-12-18 15:03 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, kvm, gaowanlong, hutao, linux-scsi, virtualization,
	rusty, asias, stefanha, nab

On Tue, Dec 18, 2012 at 03:08:08PM +0100, Paolo Bonzini wrote:
> Il 18/12/2012 14:57, Michael S. Tsirkin ha scritto:
> >> -static int virtscsi_queuecommand(struct Scsi_Host *sh, struct scsi_cmnd *sc)
> >> +static int virtscsi_queuecommand(struct virtio_scsi *vscsi,
> >> +				 struct virtio_scsi_target_state *tgt,
> >> +				 struct scsi_cmnd *sc)
> >>  {
> >> -	struct virtio_scsi *vscsi = shost_priv(sh);
> >> -	struct virtio_scsi_target_state *tgt = &vscsi->tgt[sc->device->id];
> >>  	struct virtio_scsi_cmd *cmd;
> >> +	struct virtio_scsi_vq *req_vq;
> >>  	int ret;
> >>  
> >>  	struct Scsi_Host *shost = virtio_scsi_host(vscsi->vdev);
> >> @@ -461,7 +533,8 @@ static int virtscsi_queuecommand(struct Scsi_Host *sh, struct scsi_cmnd *sc)
> >>  	BUG_ON(sc->cmd_len > VIRTIO_SCSI_CDB_SIZE);
> >>  	memcpy(cmd->req.cmd.cdb, sc->cmnd, sc->cmd_len);
> >>  
> >> -	if (virtscsi_kick_cmd(tgt, &vscsi->req_vq, cmd,
> >> +	req_vq = ACCESS_ONCE(tgt->req_vq);
> > 
> > This ACCESS_ONCE without a barrier looks strange to me.
> > Can req_vq change? Needs a comment.
> 
> Barriers are needed to order two things.  Here I don't have the second thing
> to order against, hence no barrier.
> 
> Accessing req_vq lockless is safe, and there's a comment about it, but you
> still want ACCESS_ONCE to ensure the compiler doesn't play tricks.

That's just it.
Why don't you want compiler to play tricks?

ACCESS_ONCE is needed if the value can change
while you access it, this helps ensure
a consistent value is evalutated.

If it can you almost always need a barrier. If it doesn't
you don't need ACCESS_ONCE.

>  It
> shouldn't be necessary, because the critical section of
> virtscsi_queuecommand_multi will already include the appropriate
> compiler barriers,

So if there's a barrier then pls add a comment saying where
it is.

> but it is actually clearer this way to me. :)

No barriers are needed I think because
when you queue command req is incremented to req_vq
can not change. But this also means ACCESS_ONCE
is not needed either.

> >> +	if (virtscsi_kick_cmd(tgt, req_vq, cmd,
> >>  			      sizeof cmd->req.cmd, sizeof cmd->resp.cmd,
> >>  			      GFP_ATOMIC) == 0)
> >>  		ret = 0;
> >> @@ -472,6 +545,48 @@ out:
> >>  	return ret;
> >>  }
> >>  
> >> +static int virtscsi_queuecommand_single(struct Scsi_Host *sh,
> >> +					struct scsi_cmnd *sc)
> >> +{
> >> +	struct virtio_scsi *vscsi = shost_priv(sh);
> >> +	struct virtio_scsi_target_state *tgt = &vscsi->tgt[sc->device->id];
> >> +
> >> +	atomic_inc(&tgt->reqs);
> > 
> > And here we don't have barrier after atomic? Why? Needs a comment.
> 
> Because we don't write req_vq, so there's no two writes to order.  Barrier
> against what?

Between atomic update and command. Once you queue command it
can complete and decrement reqs, if this happens before
increment reqs can become negative even.

> >> +	return virtscsi_queuecommand(vscsi, tgt, sc);
> >> +}
> >> +
> >> +static int virtscsi_queuecommand_multi(struct Scsi_Host *sh,
> >> +				       struct scsi_cmnd *sc)
> >> +{
> >> +	struct virtio_scsi *vscsi = shost_priv(sh);
> >> +	struct virtio_scsi_target_state *tgt = &vscsi->tgt[sc->device->id];
> >> +	unsigned long flags;
> >> +	u32 queue_num;
> >> +
> >> +	/*
> >> +	 * Using an atomic_t for tgt->reqs lets the virtqueue handler
> >> +	 * decrement it without taking the spinlock.
> >> +	 *
> >> +	 * We still need a critical section to prevent concurrent submissions
> >> +	 * from picking two different req_vqs.
> >> +	 */
> >> +	spin_lock_irqsave(&tgt->tgt_lock, flags);
> >> +	if (atomic_inc_return(&tgt->reqs) == 1) {
> >> +		queue_num = smp_processor_id();
> >> +		while (unlikely(queue_num >= vscsi->num_queues))
> >> +			queue_num -= vscsi->num_queues;
> >> +
> >> +		/*
> >> +		 * Write reqs before writing req_vq, matching the
> >> +		 * smp_read_barrier_depends() in virtscsi_req_done.
> >> +		 */
> >> +		smp_wmb();
> >> +		tgt->req_vq = &vscsi->req_vqs[queue_num];
> >> +	}
> >> +	spin_unlock_irqrestore(&tgt->tgt_lock, flags);
> >> +	return virtscsi_queuecommand(vscsi, tgt, sc);
> >> +}
> >> +
> >>  static int virtscsi_tmf(struct virtio_scsi *vscsi, struct virtio_scsi_cmd *cmd)
> >>  {
> >>  	DECLARE_COMPLETION_ONSTACK(comp);
> >> @@ -541,12 +656,26 @@ static int virtscsi_abort(struct scsi_cmnd *sc)
> >>  	return virtscsi_tmf(vscsi, cmd);
> >>  }
> >>  
> >> -static struct scsi_host_template virtscsi_host_template = {
> >> +static struct scsi_host_template virtscsi_host_template_single = {
> >>  	.module = THIS_MODULE,
> >>  	.name = "Virtio SCSI HBA",
> >>  	.proc_name = "virtio_scsi",
> >> -	.queuecommand = virtscsi_queuecommand,
> >>  	.this_id = -1,
> >> +	.queuecommand = virtscsi_queuecommand_single,
> >> +	.eh_abort_handler = virtscsi_abort,
> >> +	.eh_device_reset_handler = virtscsi_device_reset,
> >> +
> >> +	.can_queue = 1024,
> >> +	.dma_boundary = UINT_MAX,
> >> +	.use_clustering = ENABLE_CLUSTERING,
> >> +};
> >> +
> >> +static struct scsi_host_template virtscsi_host_template_multi = {
> >> +	.module = THIS_MODULE,
> >> +	.name = "Virtio SCSI HBA",
> >> +	.proc_name = "virtio_scsi",
> >> +	.this_id = -1,
> >> +	.queuecommand = virtscsi_queuecommand_multi,
> >>  	.eh_abort_handler = virtscsi_abort,
> >>  	.eh_device_reset_handler = virtscsi_device_reset,
> >>  
> >> @@ -572,16 +701,27 @@ static struct scsi_host_template virtscsi_host_template = {
> >>  				  &__val, sizeof(__val)); \
> >>  	})
> >>  
> >> +
> >>  static void virtscsi_init_vq(struct virtio_scsi_vq *virtscsi_vq,
> >> -			     struct virtqueue *vq)
> >> +			     struct virtqueue *vq, bool affinity)
> >>  {
> >>  	spin_lock_init(&virtscsi_vq->vq_lock);
> >>  	virtscsi_vq->vq = vq;
> >> +	if (affinity)
> >> +		virtqueue_set_affinity(vq, vq->index - VIRTIO_SCSI_VQ_BASE);
> > 
> > I've been thinking about how set_affinity
> > interacts with online/offline CPUs.
> > Any idea?
> 
> No, I haven't tried.

We need a TODO, for -net too.

> >>  
> >>  	/* Discover virtqueues and write information to configuration.  */
> >> -	err = vdev->config->find_vqs(vdev, 3, vqs, callbacks, names);
> >> +	err = vdev->config->find_vqs(vdev, num_vqs, vqs, callbacks, names);
> >>  	if (err)
> >>  		return err;
> >>  
> >> -	virtscsi_init_vq(&vscsi->ctrl_vq, vqs[0]);
> >> -	virtscsi_init_vq(&vscsi->event_vq, vqs[1]);
> >> -	virtscsi_init_vq(&vscsi->req_vq, vqs[2]);
> >> +	virtscsi_init_vq(&vscsi->ctrl_vq, vqs[0], false);
> >> +	virtscsi_init_vq(&vscsi->event_vq, vqs[1], false);
> >> +	for (i = VIRTIO_SCSI_VQ_BASE; i < num_vqs; i++)
> >> +		virtscsi_init_vq(&vscsi->req_vqs[i - VIRTIO_SCSI_VQ_BASE],
> >> +				 vqs[i], vscsi->num_queues > 1);
> > 
> > So affinity is true if >1 vq? I am guessing this is not
> > going to do the right thing unless you have at least
> > as many vqs as CPUs.
> 
> Yes, and then you're not setting up the thing correctly.

Why not just check instead of doing the wrong thing?

> Isn't the same thing true for virtio-net mq?
> 
> Paolo

Last I looked it checked vi->max_queue_pairs == num_online_cpus().
This is even too aggressive I think, max_queue_pairs >=
num_online_cpus() should be enough.

-- 
MST

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 5/5] virtio-scsi: introduce multiqueue support
@ 2012-12-18 15:03         ` Michael S. Tsirkin
  0 siblings, 0 replies; 86+ messages in thread
From: Michael S. Tsirkin @ 2012-12-18 15:03 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-scsi, kvm, hutao, linux-kernel, virtualization, stefanha

On Tue, Dec 18, 2012 at 03:08:08PM +0100, Paolo Bonzini wrote:
> Il 18/12/2012 14:57, Michael S. Tsirkin ha scritto:
> >> -static int virtscsi_queuecommand(struct Scsi_Host *sh, struct scsi_cmnd *sc)
> >> +static int virtscsi_queuecommand(struct virtio_scsi *vscsi,
> >> +				 struct virtio_scsi_target_state *tgt,
> >> +				 struct scsi_cmnd *sc)
> >>  {
> >> -	struct virtio_scsi *vscsi = shost_priv(sh);
> >> -	struct virtio_scsi_target_state *tgt = &vscsi->tgt[sc->device->id];
> >>  	struct virtio_scsi_cmd *cmd;
> >> +	struct virtio_scsi_vq *req_vq;
> >>  	int ret;
> >>  
> >>  	struct Scsi_Host *shost = virtio_scsi_host(vscsi->vdev);
> >> @@ -461,7 +533,8 @@ static int virtscsi_queuecommand(struct Scsi_Host *sh, struct scsi_cmnd *sc)
> >>  	BUG_ON(sc->cmd_len > VIRTIO_SCSI_CDB_SIZE);
> >>  	memcpy(cmd->req.cmd.cdb, sc->cmnd, sc->cmd_len);
> >>  
> >> -	if (virtscsi_kick_cmd(tgt, &vscsi->req_vq, cmd,
> >> +	req_vq = ACCESS_ONCE(tgt->req_vq);
> > 
> > This ACCESS_ONCE without a barrier looks strange to me.
> > Can req_vq change? Needs a comment.
> 
> Barriers are needed to order two things.  Here I don't have the second thing
> to order against, hence no barrier.
> 
> Accessing req_vq lockless is safe, and there's a comment about it, but you
> still want ACCESS_ONCE to ensure the compiler doesn't play tricks.

That's just it.
Why don't you want compiler to play tricks?

ACCESS_ONCE is needed if the value can change
while you access it, this helps ensure
a consistent value is evalutated.

If it can you almost always need a barrier. If it doesn't
you don't need ACCESS_ONCE.

>  It
> shouldn't be necessary, because the critical section of
> virtscsi_queuecommand_multi will already include the appropriate
> compiler barriers,

So if there's a barrier then pls add a comment saying where
it is.

> but it is actually clearer this way to me. :)

No barriers are needed I think because
when you queue command req is incremented to req_vq
can not change. But this also means ACCESS_ONCE
is not needed either.

> >> +	if (virtscsi_kick_cmd(tgt, req_vq, cmd,
> >>  			      sizeof cmd->req.cmd, sizeof cmd->resp.cmd,
> >>  			      GFP_ATOMIC) == 0)
> >>  		ret = 0;
> >> @@ -472,6 +545,48 @@ out:
> >>  	return ret;
> >>  }
> >>  
> >> +static int virtscsi_queuecommand_single(struct Scsi_Host *sh,
> >> +					struct scsi_cmnd *sc)
> >> +{
> >> +	struct virtio_scsi *vscsi = shost_priv(sh);
> >> +	struct virtio_scsi_target_state *tgt = &vscsi->tgt[sc->device->id];
> >> +
> >> +	atomic_inc(&tgt->reqs);
> > 
> > And here we don't have barrier after atomic? Why? Needs a comment.
> 
> Because we don't write req_vq, so there's no two writes to order.  Barrier
> against what?

Between atomic update and command. Once you queue command it
can complete and decrement reqs, if this happens before
increment reqs can become negative even.

> >> +	return virtscsi_queuecommand(vscsi, tgt, sc);
> >> +}
> >> +
> >> +static int virtscsi_queuecommand_multi(struct Scsi_Host *sh,
> >> +				       struct scsi_cmnd *sc)
> >> +{
> >> +	struct virtio_scsi *vscsi = shost_priv(sh);
> >> +	struct virtio_scsi_target_state *tgt = &vscsi->tgt[sc->device->id];
> >> +	unsigned long flags;
> >> +	u32 queue_num;
> >> +
> >> +	/*
> >> +	 * Using an atomic_t for tgt->reqs lets the virtqueue handler
> >> +	 * decrement it without taking the spinlock.
> >> +	 *
> >> +	 * We still need a critical section to prevent concurrent submissions
> >> +	 * from picking two different req_vqs.
> >> +	 */
> >> +	spin_lock_irqsave(&tgt->tgt_lock, flags);
> >> +	if (atomic_inc_return(&tgt->reqs) == 1) {
> >> +		queue_num = smp_processor_id();
> >> +		while (unlikely(queue_num >= vscsi->num_queues))
> >> +			queue_num -= vscsi->num_queues;
> >> +
> >> +		/*
> >> +		 * Write reqs before writing req_vq, matching the
> >> +		 * smp_read_barrier_depends() in virtscsi_req_done.
> >> +		 */
> >> +		smp_wmb();
> >> +		tgt->req_vq = &vscsi->req_vqs[queue_num];
> >> +	}
> >> +	spin_unlock_irqrestore(&tgt->tgt_lock, flags);
> >> +	return virtscsi_queuecommand(vscsi, tgt, sc);
> >> +}
> >> +
> >>  static int virtscsi_tmf(struct virtio_scsi *vscsi, struct virtio_scsi_cmd *cmd)
> >>  {
> >>  	DECLARE_COMPLETION_ONSTACK(comp);
> >> @@ -541,12 +656,26 @@ static int virtscsi_abort(struct scsi_cmnd *sc)
> >>  	return virtscsi_tmf(vscsi, cmd);
> >>  }
> >>  
> >> -static struct scsi_host_template virtscsi_host_template = {
> >> +static struct scsi_host_template virtscsi_host_template_single = {
> >>  	.module = THIS_MODULE,
> >>  	.name = "Virtio SCSI HBA",
> >>  	.proc_name = "virtio_scsi",
> >> -	.queuecommand = virtscsi_queuecommand,
> >>  	.this_id = -1,
> >> +	.queuecommand = virtscsi_queuecommand_single,
> >> +	.eh_abort_handler = virtscsi_abort,
> >> +	.eh_device_reset_handler = virtscsi_device_reset,
> >> +
> >> +	.can_queue = 1024,
> >> +	.dma_boundary = UINT_MAX,
> >> +	.use_clustering = ENABLE_CLUSTERING,
> >> +};
> >> +
> >> +static struct scsi_host_template virtscsi_host_template_multi = {
> >> +	.module = THIS_MODULE,
> >> +	.name = "Virtio SCSI HBA",
> >> +	.proc_name = "virtio_scsi",
> >> +	.this_id = -1,
> >> +	.queuecommand = virtscsi_queuecommand_multi,
> >>  	.eh_abort_handler = virtscsi_abort,
> >>  	.eh_device_reset_handler = virtscsi_device_reset,
> >>  
> >> @@ -572,16 +701,27 @@ static struct scsi_host_template virtscsi_host_template = {
> >>  				  &__val, sizeof(__val)); \
> >>  	})
> >>  
> >> +
> >>  static void virtscsi_init_vq(struct virtio_scsi_vq *virtscsi_vq,
> >> -			     struct virtqueue *vq)
> >> +			     struct virtqueue *vq, bool affinity)
> >>  {
> >>  	spin_lock_init(&virtscsi_vq->vq_lock);
> >>  	virtscsi_vq->vq = vq;
> >> +	if (affinity)
> >> +		virtqueue_set_affinity(vq, vq->index - VIRTIO_SCSI_VQ_BASE);
> > 
> > I've been thinking about how set_affinity
> > interacts with online/offline CPUs.
> > Any idea?
> 
> No, I haven't tried.

We need a TODO, for -net too.

> >>  
> >>  	/* Discover virtqueues and write information to configuration.  */
> >> -	err = vdev->config->find_vqs(vdev, 3, vqs, callbacks, names);
> >> +	err = vdev->config->find_vqs(vdev, num_vqs, vqs, callbacks, names);
> >>  	if (err)
> >>  		return err;
> >>  
> >> -	virtscsi_init_vq(&vscsi->ctrl_vq, vqs[0]);
> >> -	virtscsi_init_vq(&vscsi->event_vq, vqs[1]);
> >> -	virtscsi_init_vq(&vscsi->req_vq, vqs[2]);
> >> +	virtscsi_init_vq(&vscsi->ctrl_vq, vqs[0], false);
> >> +	virtscsi_init_vq(&vscsi->event_vq, vqs[1], false);
> >> +	for (i = VIRTIO_SCSI_VQ_BASE; i < num_vqs; i++)
> >> +		virtscsi_init_vq(&vscsi->req_vqs[i - VIRTIO_SCSI_VQ_BASE],
> >> +				 vqs[i], vscsi->num_queues > 1);
> > 
> > So affinity is true if >1 vq? I am guessing this is not
> > going to do the right thing unless you have at least
> > as many vqs as CPUs.
> 
> Yes, and then you're not setting up the thing correctly.

Why not just check instead of doing the wrong thing?

> Isn't the same thing true for virtio-net mq?
> 
> Paolo

Last I looked it checked vi->max_queue_pairs == num_online_cpus().
This is even too aggressive I think, max_queue_pairs >=
num_online_cpus() should be enough.

-- 
MST

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 1/5] virtio: add functions for piecewise addition of buffers
  2012-12-18 14:32           ` Paolo Bonzini
@ 2012-12-18 15:06             ` Michael S. Tsirkin
  -1 siblings, 0 replies; 86+ messages in thread
From: Michael S. Tsirkin @ 2012-12-18 15:06 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, kvm, gaowanlong, hutao, linux-scsi, virtualization,
	rusty, asias, stefanha, nab

On Tue, Dec 18, 2012 at 03:32:15PM +0100, Paolo Bonzini wrote:
> Il 18/12/2012 14:59, Michael S. Tsirkin ha scritto:
> >>> Can't we track state internally to the virtqueue? Exposing it
> >>> seems to buy us nothing since you can't call add_buf between
> >>> start and end anyway.
> >> 
> >> I wanted to keep the state for these functions separate from the
> >> rest. I don't think it makes much sense to move it to struct
> >> virtqueue unless virtqueue_add_buf is converted to use the new API
> >> (doesn't make much sense, could even be a tad slower).
> > 
> > Why would it be slower?
> 
> virtqueue_add_buf could be slower if it used the new API.  That's
> because of the overhead of writing and reading from struct
> virtqueue_buf, instead of using variables in registers.

Yes but we'll get rid of virtqueue_buf.

> >> On the other hand moving it there would eliminate the dependency
> >> on virtio_ring.h.  Rusty, what do you think?
> >> 
> >>> And idea: in practice virtio scsi seems to always call
> >>> sg_init_one, no? So how about we pass in void* or something and
> >>> avoid using sg and count? This would make it useful for -net
> >>> BTW.
> >> 
> >> It also passes the scatterlist from the LLD.  It calls sg_init_one
> >> for the request/response headers.
> > 
> > Try adding a _single variant. You might see unrolling a loop gives
> > more of a benefit than this whole optimization.
> 
> Makes sense, I'll try.  However, note that I *do* need the
> infrastructure in this patch because virtio-scsi could never use a
> hypothetical virtqueue_add_buf_single; requests always have at least 2
> buffers for the headers.
> 
> However I could add virtqueue_add_sg_single and use it for those
> headers.

Right.

>  The I/O buffer can keep using virtqueue_add_sg.
> 
> Paolo

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 1/5] virtio: add functions for piecewise addition of buffers
@ 2012-12-18 15:06             ` Michael S. Tsirkin
  0 siblings, 0 replies; 86+ messages in thread
From: Michael S. Tsirkin @ 2012-12-18 15:06 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-scsi, kvm, hutao, linux-kernel, virtualization, stefanha

On Tue, Dec 18, 2012 at 03:32:15PM +0100, Paolo Bonzini wrote:
> Il 18/12/2012 14:59, Michael S. Tsirkin ha scritto:
> >>> Can't we track state internally to the virtqueue? Exposing it
> >>> seems to buy us nothing since you can't call add_buf between
> >>> start and end anyway.
> >> 
> >> I wanted to keep the state for these functions separate from the
> >> rest. I don't think it makes much sense to move it to struct
> >> virtqueue unless virtqueue_add_buf is converted to use the new API
> >> (doesn't make much sense, could even be a tad slower).
> > 
> > Why would it be slower?
> 
> virtqueue_add_buf could be slower if it used the new API.  That's
> because of the overhead of writing and reading from struct
> virtqueue_buf, instead of using variables in registers.

Yes but we'll get rid of virtqueue_buf.

> >> On the other hand moving it there would eliminate the dependency
> >> on virtio_ring.h.  Rusty, what do you think?
> >> 
> >>> And idea: in practice virtio scsi seems to always call
> >>> sg_init_one, no? So how about we pass in void* or something and
> >>> avoid using sg and count? This would make it useful for -net
> >>> BTW.
> >> 
> >> It also passes the scatterlist from the LLD.  It calls sg_init_one
> >> for the request/response headers.
> > 
> > Try adding a _single variant. You might see unrolling a loop gives
> > more of a benefit than this whole optimization.
> 
> Makes sense, I'll try.  However, note that I *do* need the
> infrastructure in this patch because virtio-scsi could never use a
> hypothetical virtqueue_add_buf_single; requests always have at least 2
> buffers for the headers.
> 
> However I could add virtqueue_add_sg_single and use it for those
> headers.

Right.

>  The I/O buffer can keep using virtqueue_add_sg.
> 
> Paolo

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 5/5] virtio-scsi: introduce multiqueue support
  2012-12-18 15:03         ` Michael S. Tsirkin
@ 2012-12-18 15:51           ` Paolo Bonzini
  -1 siblings, 0 replies; 86+ messages in thread
From: Paolo Bonzini @ 2012-12-18 15:51 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, kvm, gaowanlong, hutao, linux-scsi, virtualization,
	rusty, asias, stefanha, nab

Il 18/12/2012 16:03, Michael S. Tsirkin ha scritto:
> On Tue, Dec 18, 2012 at 03:08:08PM +0100, Paolo Bonzini wrote:
>> Il 18/12/2012 14:57, Michael S. Tsirkin ha scritto:
>>>> -static int virtscsi_queuecommand(struct Scsi_Host *sh, struct scsi_cmnd *sc)
>>>> +static int virtscsi_queuecommand(struct virtio_scsi *vscsi,
>>>> +				 struct virtio_scsi_target_state *tgt,
>>>> +				 struct scsi_cmnd *sc)
>>>>  {
>>>> -	struct virtio_scsi *vscsi = shost_priv(sh);
>>>> -	struct virtio_scsi_target_state *tgt = &vscsi->tgt[sc->device->id];
>>>>  	struct virtio_scsi_cmd *cmd;
>>>> +	struct virtio_scsi_vq *req_vq;
>>>>  	int ret;
>>>>  
>>>>  	struct Scsi_Host *shost = virtio_scsi_host(vscsi->vdev);
>>>> @@ -461,7 +533,8 @@ static int virtscsi_queuecommand(struct Scsi_Host *sh, struct scsi_cmnd *sc)
>>>>  	BUG_ON(sc->cmd_len > VIRTIO_SCSI_CDB_SIZE);
>>>>  	memcpy(cmd->req.cmd.cdb, sc->cmnd, sc->cmd_len);
>>>>  
>>>> -	if (virtscsi_kick_cmd(tgt, &vscsi->req_vq, cmd,
>>>> +	req_vq = ACCESS_ONCE(tgt->req_vq);
>>>
>>> This ACCESS_ONCE without a barrier looks strange to me.
>>> Can req_vq change? Needs a comment.
>>
>> Barriers are needed to order two things.  Here I don't have the second thing
>> to order against, hence no barrier.
>>
>> Accessing req_vq lockless is safe, and there's a comment about it, but you
>> still want ACCESS_ONCE to ensure the compiler doesn't play tricks.
> 
> That's just it.
> Why don't you want compiler to play tricks?

Because I want the lockless access to occur exactly when I write it.
Otherwise I have one more thing to think about, i.e. what a crazy
compiler writer could do with my code.  And having been on the other
side of the trench, compiler writers can have *really* crazy ideas.

Anyhow, I'll reorganize the code to move the ACCESS_ONCE closer to the
write and make it clearer.

>>>> +	if (virtscsi_kick_cmd(tgt, req_vq, cmd,
>>>>  			      sizeof cmd->req.cmd, sizeof cmd->resp.cmd,
>>>>  			      GFP_ATOMIC) == 0)
>>>>  		ret = 0;
>>>> @@ -472,6 +545,48 @@ out:
>>>>  	return ret;
>>>>  }
>>>>  
>>>> +static int virtscsi_queuecommand_single(struct Scsi_Host *sh,
>>>> +					struct scsi_cmnd *sc)
>>>> +{
>>>> +	struct virtio_scsi *vscsi = shost_priv(sh);
>>>> +	struct virtio_scsi_target_state *tgt = &vscsi->tgt[sc->device->id];
>>>> +
>>>> +	atomic_inc(&tgt->reqs);
>>>
>>> And here we don't have barrier after atomic? Why? Needs a comment.
>>
>> Because we don't write req_vq, so there's no two writes to order.  Barrier
>> against what?
> 
> Between atomic update and command. Once you queue command it
> can complete and decrement reqs, if this happens before
> increment reqs can become negative even.

This is not a problem.  Please read Documentation/memory-barrier.txt:

   The following also do _not_ imply memory barriers, and so may
   require explicit memory barriers under some circumstances
   (smp_mb__before_atomic_dec() for instance):

        atomic_add();
        atomic_sub();
        atomic_inc();
        atomic_dec();

   If they're used for statistics generation, then they probably don't
   need memory barriers, unless there's a coupling between statistical
   data.

This is the single-queue case, so it falls under this case.

>>>>  	/* Discover virtqueues and write information to configuration.  */
>>>> -	err = vdev->config->find_vqs(vdev, 3, vqs, callbacks, names);
>>>> +	err = vdev->config->find_vqs(vdev, num_vqs, vqs, callbacks, names);
>>>>  	if (err)
>>>>  		return err;
>>>>  
>>>> -	virtscsi_init_vq(&vscsi->ctrl_vq, vqs[0]);
>>>> -	virtscsi_init_vq(&vscsi->event_vq, vqs[1]);
>>>> -	virtscsi_init_vq(&vscsi->req_vq, vqs[2]);
>>>> +	virtscsi_init_vq(&vscsi->ctrl_vq, vqs[0], false);
>>>> +	virtscsi_init_vq(&vscsi->event_vq, vqs[1], false);
>>>> +	for (i = VIRTIO_SCSI_VQ_BASE; i < num_vqs; i++)
>>>> +		virtscsi_init_vq(&vscsi->req_vqs[i - VIRTIO_SCSI_VQ_BASE],
>>>> +				 vqs[i], vscsi->num_queues > 1);
>>>
>>> So affinity is true if >1 vq? I am guessing this is not
>>> going to do the right thing unless you have at least
>>> as many vqs as CPUs.
>>
>> Yes, and then you're not setting up the thing correctly.
> 
> Why not just check instead of doing the wrong thing?

The right thing could be to set the affinity with a stride, e.g. CPUs
0-4 for virtqueue 0 and so on until CPUs 3-7 for virtqueue 3.

Paolo

>> Isn't the same thing true for virtio-net mq?
>>
>> Paolo
> 
> Last I looked it checked vi->max_queue_pairs == num_online_cpus().
> This is even too aggressive I think, max_queue_pairs >=
> num_online_cpus() should be enough.
> 


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 5/5] virtio-scsi: introduce multiqueue support
@ 2012-12-18 15:51           ` Paolo Bonzini
  0 siblings, 0 replies; 86+ messages in thread
From: Paolo Bonzini @ 2012-12-18 15:51 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-scsi, kvm, hutao, linux-kernel, virtualization, stefanha

Il 18/12/2012 16:03, Michael S. Tsirkin ha scritto:
> On Tue, Dec 18, 2012 at 03:08:08PM +0100, Paolo Bonzini wrote:
>> Il 18/12/2012 14:57, Michael S. Tsirkin ha scritto:
>>>> -static int virtscsi_queuecommand(struct Scsi_Host *sh, struct scsi_cmnd *sc)
>>>> +static int virtscsi_queuecommand(struct virtio_scsi *vscsi,
>>>> +				 struct virtio_scsi_target_state *tgt,
>>>> +				 struct scsi_cmnd *sc)
>>>>  {
>>>> -	struct virtio_scsi *vscsi = shost_priv(sh);
>>>> -	struct virtio_scsi_target_state *tgt = &vscsi->tgt[sc->device->id];
>>>>  	struct virtio_scsi_cmd *cmd;
>>>> +	struct virtio_scsi_vq *req_vq;
>>>>  	int ret;
>>>>  
>>>>  	struct Scsi_Host *shost = virtio_scsi_host(vscsi->vdev);
>>>> @@ -461,7 +533,8 @@ static int virtscsi_queuecommand(struct Scsi_Host *sh, struct scsi_cmnd *sc)
>>>>  	BUG_ON(sc->cmd_len > VIRTIO_SCSI_CDB_SIZE);
>>>>  	memcpy(cmd->req.cmd.cdb, sc->cmnd, sc->cmd_len);
>>>>  
>>>> -	if (virtscsi_kick_cmd(tgt, &vscsi->req_vq, cmd,
>>>> +	req_vq = ACCESS_ONCE(tgt->req_vq);
>>>
>>> This ACCESS_ONCE without a barrier looks strange to me.
>>> Can req_vq change? Needs a comment.
>>
>> Barriers are needed to order two things.  Here I don't have the second thing
>> to order against, hence no barrier.
>>
>> Accessing req_vq lockless is safe, and there's a comment about it, but you
>> still want ACCESS_ONCE to ensure the compiler doesn't play tricks.
> 
> That's just it.
> Why don't you want compiler to play tricks?

Because I want the lockless access to occur exactly when I write it.
Otherwise I have one more thing to think about, i.e. what a crazy
compiler writer could do with my code.  And having been on the other
side of the trench, compiler writers can have *really* crazy ideas.

Anyhow, I'll reorganize the code to move the ACCESS_ONCE closer to the
write and make it clearer.

>>>> +	if (virtscsi_kick_cmd(tgt, req_vq, cmd,
>>>>  			      sizeof cmd->req.cmd, sizeof cmd->resp.cmd,
>>>>  			      GFP_ATOMIC) == 0)
>>>>  		ret = 0;
>>>> @@ -472,6 +545,48 @@ out:
>>>>  	return ret;
>>>>  }
>>>>  
>>>> +static int virtscsi_queuecommand_single(struct Scsi_Host *sh,
>>>> +					struct scsi_cmnd *sc)
>>>> +{
>>>> +	struct virtio_scsi *vscsi = shost_priv(sh);
>>>> +	struct virtio_scsi_target_state *tgt = &vscsi->tgt[sc->device->id];
>>>> +
>>>> +	atomic_inc(&tgt->reqs);
>>>
>>> And here we don't have barrier after atomic? Why? Needs a comment.
>>
>> Because we don't write req_vq, so there's no two writes to order.  Barrier
>> against what?
> 
> Between atomic update and command. Once you queue command it
> can complete and decrement reqs, if this happens before
> increment reqs can become negative even.

This is not a problem.  Please read Documentation/memory-barrier.txt:

   The following also do _not_ imply memory barriers, and so may
   require explicit memory barriers under some circumstances
   (smp_mb__before_atomic_dec() for instance):

        atomic_add();
        atomic_sub();
        atomic_inc();
        atomic_dec();

   If they're used for statistics generation, then they probably don't
   need memory barriers, unless there's a coupling between statistical
   data.

This is the single-queue case, so it falls under this case.

>>>>  	/* Discover virtqueues and write information to configuration.  */
>>>> -	err = vdev->config->find_vqs(vdev, 3, vqs, callbacks, names);
>>>> +	err = vdev->config->find_vqs(vdev, num_vqs, vqs, callbacks, names);
>>>>  	if (err)
>>>>  		return err;
>>>>  
>>>> -	virtscsi_init_vq(&vscsi->ctrl_vq, vqs[0]);
>>>> -	virtscsi_init_vq(&vscsi->event_vq, vqs[1]);
>>>> -	virtscsi_init_vq(&vscsi->req_vq, vqs[2]);
>>>> +	virtscsi_init_vq(&vscsi->ctrl_vq, vqs[0], false);
>>>> +	virtscsi_init_vq(&vscsi->event_vq, vqs[1], false);
>>>> +	for (i = VIRTIO_SCSI_VQ_BASE; i < num_vqs; i++)
>>>> +		virtscsi_init_vq(&vscsi->req_vqs[i - VIRTIO_SCSI_VQ_BASE],
>>>> +				 vqs[i], vscsi->num_queues > 1);
>>>
>>> So affinity is true if >1 vq? I am guessing this is not
>>> going to do the right thing unless you have at least
>>> as many vqs as CPUs.
>>
>> Yes, and then you're not setting up the thing correctly.
> 
> Why not just check instead of doing the wrong thing?

The right thing could be to set the affinity with a stride, e.g. CPUs
0-4 for virtqueue 0 and so on until CPUs 3-7 for virtqueue 3.

Paolo

>> Isn't the same thing true for virtio-net mq?
>>
>> Paolo
> 
> Last I looked it checked vi->max_queue_pairs == num_online_cpus().
> This is even too aggressive I think, max_queue_pairs >=
> num_online_cpus() should be enough.
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 5/5] virtio-scsi: introduce multiqueue support
  2012-12-18 15:51           ` Paolo Bonzini
@ 2012-12-18 16:02             ` Michael S. Tsirkin
  -1 siblings, 0 replies; 86+ messages in thread
From: Michael S. Tsirkin @ 2012-12-18 16:02 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, kvm, gaowanlong, hutao, linux-scsi, virtualization,
	rusty, asias, stefanha, nab

On Tue, Dec 18, 2012 at 04:51:28PM +0100, Paolo Bonzini wrote:
> Il 18/12/2012 16:03, Michael S. Tsirkin ha scritto:
> > On Tue, Dec 18, 2012 at 03:08:08PM +0100, Paolo Bonzini wrote:
> >> Il 18/12/2012 14:57, Michael S. Tsirkin ha scritto:
> >>>> -static int virtscsi_queuecommand(struct Scsi_Host *sh, struct scsi_cmnd *sc)
> >>>> +static int virtscsi_queuecommand(struct virtio_scsi *vscsi,
> >>>> +				 struct virtio_scsi_target_state *tgt,
> >>>> +				 struct scsi_cmnd *sc)
> >>>>  {
> >>>> -	struct virtio_scsi *vscsi = shost_priv(sh);
> >>>> -	struct virtio_scsi_target_state *tgt = &vscsi->tgt[sc->device->id];
> >>>>  	struct virtio_scsi_cmd *cmd;
> >>>> +	struct virtio_scsi_vq *req_vq;
> >>>>  	int ret;
> >>>>  
> >>>>  	struct Scsi_Host *shost = virtio_scsi_host(vscsi->vdev);
> >>>> @@ -461,7 +533,8 @@ static int virtscsi_queuecommand(struct Scsi_Host *sh, struct scsi_cmnd *sc)
> >>>>  	BUG_ON(sc->cmd_len > VIRTIO_SCSI_CDB_SIZE);
> >>>>  	memcpy(cmd->req.cmd.cdb, sc->cmnd, sc->cmd_len);
> >>>>  
> >>>> -	if (virtscsi_kick_cmd(tgt, &vscsi->req_vq, cmd,
> >>>> +	req_vq = ACCESS_ONCE(tgt->req_vq);
> >>>
> >>> This ACCESS_ONCE without a barrier looks strange to me.
> >>> Can req_vq change? Needs a comment.
> >>
> >> Barriers are needed to order two things.  Here I don't have the second thing
> >> to order against, hence no barrier.
> >>
> >> Accessing req_vq lockless is safe, and there's a comment about it, but you
> >> still want ACCESS_ONCE to ensure the compiler doesn't play tricks.
> > 
> > That's just it.
> > Why don't you want compiler to play tricks?
> 
> Because I want the lockless access to occur exactly when I write it.

It doesn't occur when you write it. CPU can still move accesses
around. That's why you either need both ACCESS_ONCE and a barrier
or none.

> Otherwise I have one more thing to think about, i.e. what a crazy
> compiler writer could do with my code.  And having been on the other
> side of the trench, compiler writers can have *really* crazy ideas.
> 
> Anyhow, I'll reorganize the code to move the ACCESS_ONCE closer to the
> write and make it clearer.
> 
> >>>> +	if (virtscsi_kick_cmd(tgt, req_vq, cmd,
> >>>>  			      sizeof cmd->req.cmd, sizeof cmd->resp.cmd,
> >>>>  			      GFP_ATOMIC) == 0)
> >>>>  		ret = 0;
> >>>> @@ -472,6 +545,48 @@ out:
> >>>>  	return ret;
> >>>>  }
> >>>>  
> >>>> +static int virtscsi_queuecommand_single(struct Scsi_Host *sh,
> >>>> +					struct scsi_cmnd *sc)
> >>>> +{
> >>>> +	struct virtio_scsi *vscsi = shost_priv(sh);
> >>>> +	struct virtio_scsi_target_state *tgt = &vscsi->tgt[sc->device->id];
> >>>> +
> >>>> +	atomic_inc(&tgt->reqs);
> >>>
> >>> And here we don't have barrier after atomic? Why? Needs a comment.
> >>
> >> Because we don't write req_vq, so there's no two writes to order.  Barrier
> >> against what?
> > 
> > Between atomic update and command. Once you queue command it
> > can complete and decrement reqs, if this happens before
> > increment reqs can become negative even.
> 
> This is not a problem.  Please read Documentation/memory-barrier.txt:
> 
>    The following also do _not_ imply memory barriers, and so may
>    require explicit memory barriers under some circumstances
>    (smp_mb__before_atomic_dec() for instance):
> 
>         atomic_add();
>         atomic_sub();
>         atomic_inc();
>         atomic_dec();
> 
>    If they're used for statistics generation, then they probably don't
>    need memory barriers, unless there's a coupling between statistical
>    data.
> 
> This is the single-queue case, so it falls under this case.

Aha I missed it's single queue. Correct but please add a comment.

> >>>>  	/* Discover virtqueues and write information to configuration.  */
> >>>> -	err = vdev->config->find_vqs(vdev, 3, vqs, callbacks, names);
> >>>> +	err = vdev->config->find_vqs(vdev, num_vqs, vqs, callbacks, names);
> >>>>  	if (err)
> >>>>  		return err;
> >>>>  
> >>>> -	virtscsi_init_vq(&vscsi->ctrl_vq, vqs[0]);
> >>>> -	virtscsi_init_vq(&vscsi->event_vq, vqs[1]);
> >>>> -	virtscsi_init_vq(&vscsi->req_vq, vqs[2]);
> >>>> +	virtscsi_init_vq(&vscsi->ctrl_vq, vqs[0], false);
> >>>> +	virtscsi_init_vq(&vscsi->event_vq, vqs[1], false);
> >>>> +	for (i = VIRTIO_SCSI_VQ_BASE; i < num_vqs; i++)
> >>>> +		virtscsi_init_vq(&vscsi->req_vqs[i - VIRTIO_SCSI_VQ_BASE],
> >>>> +				 vqs[i], vscsi->num_queues > 1);
> >>>
> >>> So affinity is true if >1 vq? I am guessing this is not
> >>> going to do the right thing unless you have at least
> >>> as many vqs as CPUs.
> >>
> >> Yes, and then you're not setting up the thing correctly.
> > 
> > Why not just check instead of doing the wrong thing?
> 
> The right thing could be to set the affinity with a stride, e.g. CPUs
> 0-4 for virtqueue 0 and so on until CPUs 3-7 for virtqueue 3.
> 
> Paolo

I think a simple #vqs == #cpus check would be kind of OK for
starters, otherwise let userspace set affinity.
Again need to think what happens with CPU hotplug.

> >> Isn't the same thing true for virtio-net mq?
> >>
> >> Paolo
> > 
> > Last I looked it checked vi->max_queue_pairs == num_online_cpus().
> > This is even too aggressive I think, max_queue_pairs >=
> > num_online_cpus() should be enough.
> > 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 5/5] virtio-scsi: introduce multiqueue support
@ 2012-12-18 16:02             ` Michael S. Tsirkin
  0 siblings, 0 replies; 86+ messages in thread
From: Michael S. Tsirkin @ 2012-12-18 16:02 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-scsi, kvm, hutao, linux-kernel, virtualization, stefanha

On Tue, Dec 18, 2012 at 04:51:28PM +0100, Paolo Bonzini wrote:
> Il 18/12/2012 16:03, Michael S. Tsirkin ha scritto:
> > On Tue, Dec 18, 2012 at 03:08:08PM +0100, Paolo Bonzini wrote:
> >> Il 18/12/2012 14:57, Michael S. Tsirkin ha scritto:
> >>>> -static int virtscsi_queuecommand(struct Scsi_Host *sh, struct scsi_cmnd *sc)
> >>>> +static int virtscsi_queuecommand(struct virtio_scsi *vscsi,
> >>>> +				 struct virtio_scsi_target_state *tgt,
> >>>> +				 struct scsi_cmnd *sc)
> >>>>  {
> >>>> -	struct virtio_scsi *vscsi = shost_priv(sh);
> >>>> -	struct virtio_scsi_target_state *tgt = &vscsi->tgt[sc->device->id];
> >>>>  	struct virtio_scsi_cmd *cmd;
> >>>> +	struct virtio_scsi_vq *req_vq;
> >>>>  	int ret;
> >>>>  
> >>>>  	struct Scsi_Host *shost = virtio_scsi_host(vscsi->vdev);
> >>>> @@ -461,7 +533,8 @@ static int virtscsi_queuecommand(struct Scsi_Host *sh, struct scsi_cmnd *sc)
> >>>>  	BUG_ON(sc->cmd_len > VIRTIO_SCSI_CDB_SIZE);
> >>>>  	memcpy(cmd->req.cmd.cdb, sc->cmnd, sc->cmd_len);
> >>>>  
> >>>> -	if (virtscsi_kick_cmd(tgt, &vscsi->req_vq, cmd,
> >>>> +	req_vq = ACCESS_ONCE(tgt->req_vq);
> >>>
> >>> This ACCESS_ONCE without a barrier looks strange to me.
> >>> Can req_vq change? Needs a comment.
> >>
> >> Barriers are needed to order two things.  Here I don't have the second thing
> >> to order against, hence no barrier.
> >>
> >> Accessing req_vq lockless is safe, and there's a comment about it, but you
> >> still want ACCESS_ONCE to ensure the compiler doesn't play tricks.
> > 
> > That's just it.
> > Why don't you want compiler to play tricks?
> 
> Because I want the lockless access to occur exactly when I write it.

It doesn't occur when you write it. CPU can still move accesses
around. That's why you either need both ACCESS_ONCE and a barrier
or none.

> Otherwise I have one more thing to think about, i.e. what a crazy
> compiler writer could do with my code.  And having been on the other
> side of the trench, compiler writers can have *really* crazy ideas.
> 
> Anyhow, I'll reorganize the code to move the ACCESS_ONCE closer to the
> write and make it clearer.
> 
> >>>> +	if (virtscsi_kick_cmd(tgt, req_vq, cmd,
> >>>>  			      sizeof cmd->req.cmd, sizeof cmd->resp.cmd,
> >>>>  			      GFP_ATOMIC) == 0)
> >>>>  		ret = 0;
> >>>> @@ -472,6 +545,48 @@ out:
> >>>>  	return ret;
> >>>>  }
> >>>>  
> >>>> +static int virtscsi_queuecommand_single(struct Scsi_Host *sh,
> >>>> +					struct scsi_cmnd *sc)
> >>>> +{
> >>>> +	struct virtio_scsi *vscsi = shost_priv(sh);
> >>>> +	struct virtio_scsi_target_state *tgt = &vscsi->tgt[sc->device->id];
> >>>> +
> >>>> +	atomic_inc(&tgt->reqs);
> >>>
> >>> And here we don't have barrier after atomic? Why? Needs a comment.
> >>
> >> Because we don't write req_vq, so there's no two writes to order.  Barrier
> >> against what?
> > 
> > Between atomic update and command. Once you queue command it
> > can complete and decrement reqs, if this happens before
> > increment reqs can become negative even.
> 
> This is not a problem.  Please read Documentation/memory-barrier.txt:
> 
>    The following also do _not_ imply memory barriers, and so may
>    require explicit memory barriers under some circumstances
>    (smp_mb__before_atomic_dec() for instance):
> 
>         atomic_add();
>         atomic_sub();
>         atomic_inc();
>         atomic_dec();
> 
>    If they're used for statistics generation, then they probably don't
>    need memory barriers, unless there's a coupling between statistical
>    data.
> 
> This is the single-queue case, so it falls under this case.

Aha I missed it's single queue. Correct but please add a comment.

> >>>>  	/* Discover virtqueues and write information to configuration.  */
> >>>> -	err = vdev->config->find_vqs(vdev, 3, vqs, callbacks, names);
> >>>> +	err = vdev->config->find_vqs(vdev, num_vqs, vqs, callbacks, names);
> >>>>  	if (err)
> >>>>  		return err;
> >>>>  
> >>>> -	virtscsi_init_vq(&vscsi->ctrl_vq, vqs[0]);
> >>>> -	virtscsi_init_vq(&vscsi->event_vq, vqs[1]);
> >>>> -	virtscsi_init_vq(&vscsi->req_vq, vqs[2]);
> >>>> +	virtscsi_init_vq(&vscsi->ctrl_vq, vqs[0], false);
> >>>> +	virtscsi_init_vq(&vscsi->event_vq, vqs[1], false);
> >>>> +	for (i = VIRTIO_SCSI_VQ_BASE; i < num_vqs; i++)
> >>>> +		virtscsi_init_vq(&vscsi->req_vqs[i - VIRTIO_SCSI_VQ_BASE],
> >>>> +				 vqs[i], vscsi->num_queues > 1);
> >>>
> >>> So affinity is true if >1 vq? I am guessing this is not
> >>> going to do the right thing unless you have at least
> >>> as many vqs as CPUs.
> >>
> >> Yes, and then you're not setting up the thing correctly.
> > 
> > Why not just check instead of doing the wrong thing?
> 
> The right thing could be to set the affinity with a stride, e.g. CPUs
> 0-4 for virtqueue 0 and so on until CPUs 3-7 for virtqueue 3.
> 
> Paolo

I think a simple #vqs == #cpus check would be kind of OK for
starters, otherwise let userspace set affinity.
Again need to think what happens with CPU hotplug.

> >> Isn't the same thing true for virtio-net mq?
> >>
> >> Paolo
> > 
> > Last I looked it checked vi->max_queue_pairs == num_online_cpus().
> > This is even too aggressive I think, max_queue_pairs >=
> > num_online_cpus() should be enough.
> > 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 0/5] Multiqueue virtio-scsi, and API for piecewise buffer submission
  2012-12-18 12:32 ` Paolo Bonzini
                   ` (7 preceding siblings ...)
  (?)
@ 2012-12-18 22:18 ` Rolf Eike Beer
  2012-12-19  8:52     ` Paolo Bonzini
  -1 siblings, 1 reply; 86+ messages in thread
From: Rolf Eike Beer @ 2012-12-18 22:18 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, kvm, gaowanlong, hutao, linux-scsi, virtualization,
	mst, rusty, asias, stefanha, nab

[-- Attachment #1: Type: text/plain, Size: 2340 bytes --]

Paolo Bonzini wrote:
> Hi all,
> 
> this series adds multiqueue support to the virtio-scsi driver, based
> on Jason Wang's work on virtio-net.  It uses a simple queue steering
> algorithm that expects one queue per CPU.  LUNs in the same target always
> use the same queue (so that commands are not reordered); queue switching
> occurs when the request being queued is the only one for the target.
> Also based on Jason's patches, the virtqueue affinity is set so that
> each CPU is associated to one virtqueue.
> 
> I tested the patches with fio, using up to 32 virtio-scsi disks backed
> by tmpfs on the host.  These numbers are with 1 LUN per target.
> 
> FIO configuration
> -----------------
> [global]
> rw=read
> bsrange=4k-64k
> ioengine=libaio
> direct=1
> iodepth=4
> loops=20
> 
> overall bandwidth (MB/s)
> ------------------------
> 
> # of targets    single-queue    multi-queue, 4 VCPUs    multi-queue, 8 VCPUs
> 1                  540               626                     599
> 2                  795               965                     925
> 4                  997              1376                    1500
> 8                 1136              2130                    2060
> 16                1440              2269                    2474
> 24                1408              2179                    2436
> 32                1515              1978                    2319
> 
> (These numbers for single-queue are with 4 VCPUs, but the impact of adding
> more VCPUs is very limited).
> 
> avg bandwidth per LUN (MB/s)
> ----------------------------
> 
> # of targets    single-queue    multi-queue, 4 VCPUs    multi-queue, 8 VCPUs
> 1                  540               626                     599
> 2                  397               482                     462
> 4                  249               344                     375
> 8                  142               266                     257
> 16                  90               141                     154
> 24                  58                90                     101
> 32                  47                61                      72

Is there an explanation why 8x8 is slower then 4x8 in both cases? 8x1 and 8x2 
being slower than 4x1 and 4x2 is more or less expected, but 8x8 loses against 
4x8 while 8x4 wins against 4x4 and 8x16 against 4x16.

Eike

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 0/5] Multiqueue virtio-scsi, and API for piecewise buffer submission
  2012-12-18 12:32 ` Paolo Bonzini
                   ` (8 preceding siblings ...)
  (?)
@ 2012-12-18 22:18 ` Rolf Eike Beer
  -1 siblings, 0 replies; 86+ messages in thread
From: Rolf Eike Beer @ 2012-12-18 22:18 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-scsi, kvm, mst, hutao, linux-kernel, virtualization, stefanha


[-- Attachment #1.1: Type: text/plain, Size: 2340 bytes --]

Paolo Bonzini wrote:
> Hi all,
> 
> this series adds multiqueue support to the virtio-scsi driver, based
> on Jason Wang's work on virtio-net.  It uses a simple queue steering
> algorithm that expects one queue per CPU.  LUNs in the same target always
> use the same queue (so that commands are not reordered); queue switching
> occurs when the request being queued is the only one for the target.
> Also based on Jason's patches, the virtqueue affinity is set so that
> each CPU is associated to one virtqueue.
> 
> I tested the patches with fio, using up to 32 virtio-scsi disks backed
> by tmpfs on the host.  These numbers are with 1 LUN per target.
> 
> FIO configuration
> -----------------
> [global]
> rw=read
> bsrange=4k-64k
> ioengine=libaio
> direct=1
> iodepth=4
> loops=20
> 
> overall bandwidth (MB/s)
> ------------------------
> 
> # of targets    single-queue    multi-queue, 4 VCPUs    multi-queue, 8 VCPUs
> 1                  540               626                     599
> 2                  795               965                     925
> 4                  997              1376                    1500
> 8                 1136              2130                    2060
> 16                1440              2269                    2474
> 24                1408              2179                    2436
> 32                1515              1978                    2319
> 
> (These numbers for single-queue are with 4 VCPUs, but the impact of adding
> more VCPUs is very limited).
> 
> avg bandwidth per LUN (MB/s)
> ----------------------------
> 
> # of targets    single-queue    multi-queue, 4 VCPUs    multi-queue, 8 VCPUs
> 1                  540               626                     599
> 2                  397               482                     462
> 4                  249               344                     375
> 8                  142               266                     257
> 16                  90               141                     154
> 24                  58                90                     101
> 32                  47                61                      72

Is there an explanation why 8x8 is slower then 4x8 in both cases? 8x1 and 8x2 
being slower than 4x1 and 4x2 is more or less expected, but 8x8 loses against 
4x8 while 8x4 wins against 4x4 and 8x16 against 4x16.

Eike

[-- Attachment #1.2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 0/5] Multiqueue virtio-scsi, and API for piecewise buffer submission
  2012-12-18 22:18 ` Rolf Eike Beer
@ 2012-12-19  8:52     ` Paolo Bonzini
  0 siblings, 0 replies; 86+ messages in thread
From: Paolo Bonzini @ 2012-12-19  8:52 UTC (permalink / raw)
  To: Rolf Eike Beer
  Cc: linux-kernel, kvm, gaowanlong, hutao, linux-scsi, virtualization,
	mst, rusty, asias, stefanha, nab

Il 18/12/2012 23:18, Rolf Eike Beer ha scritto:
> Paolo Bonzini wrote:
>> Hi all,
>>
>> this series adds multiqueue support to the virtio-scsi driver, based
>> on Jason Wang's work on virtio-net.  It uses a simple queue steering
>> algorithm that expects one queue per CPU.  LUNs in the same target always
>> use the same queue (so that commands are not reordered); queue switching
>> occurs when the request being queued is the only one for the target.
>> Also based on Jason's patches, the virtqueue affinity is set so that
>> each CPU is associated to one virtqueue.
>>
>> I tested the patches with fio, using up to 32 virtio-scsi disks backed
>> by tmpfs on the host.  These numbers are with 1 LUN per target.
>>
>> FIO configuration
>> -----------------
>> [global]
>> rw=read
>> bsrange=4k-64k
>> ioengine=libaio
>> direct=1
>> iodepth=4
>> loops=20
>>
>> overall bandwidth (MB/s)
>> ------------------------
>>
>> # of targets    single-queue    multi-queue, 4 VCPUs    multi-queue, 8 VCPUs
>> 1                  540               626                     599
>> 2                  795               965                     925
>> 4                  997              1376                    1500
>> 8                 1136              2130                    2060
>> 16                1440              2269                    2474
>> 24                1408              2179                    2436
>> 32                1515              1978                    2319
>>
>> (These numbers for single-queue are with 4 VCPUs, but the impact of adding
>> more VCPUs is very limited).
>>
>> avg bandwidth per LUN (MB/s)
>> ----------------------------
>>
>> # of targets    single-queue    multi-queue, 4 VCPUs    multi-queue, 8 VCPUs
>> 1                  540               626                     599
>> 2                  397               482                     462
>> 4                  249               344                     375
>> 8                  142               266                     257
>> 16                  90               141                     154
>> 24                  58                90                     101
>> 32                  47                61                      72
> 
> Is there an explanation why 8x8 is slower then 4x8 in both cases?

Regarding the "in both cases" part, it's because the second table has
the same data as the first, but divided by the first column.

In general, the "strangenesses" you find are probably within statistical
noise or due to other effects such as host CPU utilization or contention
on the big QEMU lock.

Paolo


 8x1 and 8x2
> being slower than 4x1 and 4x2 is more or less expected, but 8x8 loses against 
> 4x8 while 8x4 wins against 4x4 and 8x16 against 4x16.
> 
> Eike
> 


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 0/5] Multiqueue virtio-scsi, and API for piecewise buffer submission
@ 2012-12-19  8:52     ` Paolo Bonzini
  0 siblings, 0 replies; 86+ messages in thread
From: Paolo Bonzini @ 2012-12-19  8:52 UTC (permalink / raw)
  To: Rolf Eike Beer
  Cc: linux-scsi, kvm, mst, hutao, linux-kernel, virtualization, stefanha

Il 18/12/2012 23:18, Rolf Eike Beer ha scritto:
> Paolo Bonzini wrote:
>> Hi all,
>>
>> this series adds multiqueue support to the virtio-scsi driver, based
>> on Jason Wang's work on virtio-net.  It uses a simple queue steering
>> algorithm that expects one queue per CPU.  LUNs in the same target always
>> use the same queue (so that commands are not reordered); queue switching
>> occurs when the request being queued is the only one for the target.
>> Also based on Jason's patches, the virtqueue affinity is set so that
>> each CPU is associated to one virtqueue.
>>
>> I tested the patches with fio, using up to 32 virtio-scsi disks backed
>> by tmpfs on the host.  These numbers are with 1 LUN per target.
>>
>> FIO configuration
>> -----------------
>> [global]
>> rw=read
>> bsrange=4k-64k
>> ioengine=libaio
>> direct=1
>> iodepth=4
>> loops=20
>>
>> overall bandwidth (MB/s)
>> ------------------------
>>
>> # of targets    single-queue    multi-queue, 4 VCPUs    multi-queue, 8 VCPUs
>> 1                  540               626                     599
>> 2                  795               965                     925
>> 4                  997              1376                    1500
>> 8                 1136              2130                    2060
>> 16                1440              2269                    2474
>> 24                1408              2179                    2436
>> 32                1515              1978                    2319
>>
>> (These numbers for single-queue are with 4 VCPUs, but the impact of adding
>> more VCPUs is very limited).
>>
>> avg bandwidth per LUN (MB/s)
>> ----------------------------
>>
>> # of targets    single-queue    multi-queue, 4 VCPUs    multi-queue, 8 VCPUs
>> 1                  540               626                     599
>> 2                  397               482                     462
>> 4                  249               344                     375
>> 8                  142               266                     257
>> 16                  90               141                     154
>> 24                  58                90                     101
>> 32                  47                61                      72
> 
> Is there an explanation why 8x8 is slower then 4x8 in both cases?

Regarding the "in both cases" part, it's because the second table has
the same data as the first, but divided by the first column.

In general, the "strangenesses" you find are probably within statistical
noise or due to other effects such as host CPU utilization or contention
on the big QEMU lock.

Paolo


 8x1 and 8x2
> being slower than 4x1 and 4x2 is more or less expected, but 8x8 loses against 
> 4x8 while 8x4 wins against 4x4 and 8x16 against 4x16.
> 
> Eike
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 1/5] virtio: add functions for piecewise addition of buffers
  2012-12-18 12:32   ` Paolo Bonzini
                     ` (2 preceding siblings ...)
  (?)
@ 2012-12-19 10:47   ` Stefan Hajnoczi
  2012-12-19 12:04       ` Paolo Bonzini
  -1 siblings, 1 reply; 86+ messages in thread
From: Stefan Hajnoczi @ 2012-12-19 10:47 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, kvm, gaowanlong, hutao, linux-scsi, virtualization,
	mst, rusty, asias, nab

On Tue, Dec 18, 2012 at 01:32:48PM +0100, Paolo Bonzini wrote:
> +/**
> + * virtqueue_start_buf - start building buffer for the other end
> + * @vq: the struct virtqueue we're talking about.
> + * @buf: a struct keeping the state of the buffer
> + * @data: the token identifying the buffer.
> + * @count: the number of buffers that will be added

Perhaps count should be named count_bufs or num_bufs.

> + * @count_sg: the number of sg lists that will be added

What is the purpose of count_sg?

Stefan

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 1/5] virtio: add functions for piecewise addition of buffers
  2012-12-18 12:32   ` Paolo Bonzini
  (?)
  (?)
@ 2012-12-19 10:47   ` Stefan Hajnoczi
  -1 siblings, 0 replies; 86+ messages in thread
From: Stefan Hajnoczi @ 2012-12-19 10:47 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: linux-scsi, kvm, mst, hutao, linux-kernel, virtualization

On Tue, Dec 18, 2012 at 01:32:48PM +0100, Paolo Bonzini wrote:
> +/**
> + * virtqueue_start_buf - start building buffer for the other end
> + * @vq: the struct virtqueue we're talking about.
> + * @buf: a struct keeping the state of the buffer
> + * @data: the token identifying the buffer.
> + * @count: the number of buffers that will be added

Perhaps count should be named count_bufs or num_bufs.

> + * @count_sg: the number of sg lists that will be added

What is the purpose of count_sg?

Stefan

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 5/5] virtio-scsi: introduce multiqueue support
  2012-12-18 12:32 ` [PATCH v2 5/5] virtio-scsi: introduce multiqueue support Paolo Bonzini
  2012-12-18 13:57     ` Michael S. Tsirkin
@ 2012-12-19 11:27   ` Stefan Hajnoczi
  2012-12-19 11:27   ` Stefan Hajnoczi
  2 siblings, 0 replies; 86+ messages in thread
From: Stefan Hajnoczi @ 2012-12-19 11:27 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, kvm, gaowanlong, hutao, linux-scsi, virtualization,
	mst, rusty, asias, nab

On Tue, Dec 18, 2012 at 01:32:52PM +0100, Paolo Bonzini wrote:
>  struct virtio_scsi_target_state {
> -	/* Never held at the same time as vq_lock.  */
> +	/* This spinlock ever held at the same time as vq_lock.  */

s/ever/is never/

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 5/5] virtio-scsi: introduce multiqueue support
  2012-12-18 12:32 ` [PATCH v2 5/5] virtio-scsi: introduce multiqueue support Paolo Bonzini
  2012-12-18 13:57     ` Michael S. Tsirkin
  2012-12-19 11:27   ` Stefan Hajnoczi
@ 2012-12-19 11:27   ` Stefan Hajnoczi
  2 siblings, 0 replies; 86+ messages in thread
From: Stefan Hajnoczi @ 2012-12-19 11:27 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: linux-scsi, kvm, mst, hutao, linux-kernel, virtualization

On Tue, Dec 18, 2012 at 01:32:52PM +0100, Paolo Bonzini wrote:
>  struct virtio_scsi_target_state {
> -	/* Never held at the same time as vq_lock.  */
> +	/* This spinlock ever held at the same time as vq_lock.  */

s/ever/is never/

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 0/5] Multiqueue virtio-scsi, and API for piecewise buffer submission
  2012-12-19  8:52     ` Paolo Bonzini
@ 2012-12-19 11:32       ` Michael S. Tsirkin
  -1 siblings, 0 replies; 86+ messages in thread
From: Michael S. Tsirkin @ 2012-12-19 11:32 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Rolf Eike Beer, linux-kernel, kvm, gaowanlong, hutao, linux-scsi,
	virtualization, rusty, asias, stefanha, nab

On Wed, Dec 19, 2012 at 09:52:59AM +0100, Paolo Bonzini wrote:
> Il 18/12/2012 23:18, Rolf Eike Beer ha scritto:
> > Paolo Bonzini wrote:
> >> Hi all,
> >>
> >> this series adds multiqueue support to the virtio-scsi driver, based
> >> on Jason Wang's work on virtio-net.  It uses a simple queue steering
> >> algorithm that expects one queue per CPU.  LUNs in the same target always
> >> use the same queue (so that commands are not reordered); queue switching
> >> occurs when the request being queued is the only one for the target.
> >> Also based on Jason's patches, the virtqueue affinity is set so that
> >> each CPU is associated to one virtqueue.
> >>
> >> I tested the patches with fio, using up to 32 virtio-scsi disks backed
> >> by tmpfs on the host.  These numbers are with 1 LUN per target.
> >>
> >> FIO configuration
> >> -----------------
> >> [global]
> >> rw=read
> >> bsrange=4k-64k
> >> ioengine=libaio
> >> direct=1
> >> iodepth=4
> >> loops=20
> >>
> >> overall bandwidth (MB/s)
> >> ------------------------
> >>
> >> # of targets    single-queue    multi-queue, 4 VCPUs    multi-queue, 8 VCPUs
> >> 1                  540               626                     599
> >> 2                  795               965                     925
> >> 4                  997              1376                    1500
> >> 8                 1136              2130                    2060
> >> 16                1440              2269                    2474
> >> 24                1408              2179                    2436
> >> 32                1515              1978                    2319
> >>
> >> (These numbers for single-queue are with 4 VCPUs, but the impact of adding
> >> more VCPUs is very limited).
> >>
> >> avg bandwidth per LUN (MB/s)
> >> ----------------------------
> >>
> >> # of targets    single-queue    multi-queue, 4 VCPUs    multi-queue, 8 VCPUs
> >> 1                  540               626                     599
> >> 2                  397               482                     462
> >> 4                  249               344                     375
> >> 8                  142               266                     257
> >> 16                  90               141                     154
> >> 24                  58                90                     101
> >> 32                  47                61                      72
> > 
> > Is there an explanation why 8x8 is slower then 4x8 in both cases?
> 
> Regarding the "in both cases" part, it's because the second table has
> the same data as the first, but divided by the first column.
> 
> In general, the "strangenesses" you find are probably within statistical
> noise or due to other effects such as host CPU utilization or contention
> on the big QEMU lock.
> 
> Paolo
> 

That's exactly what bothers me. If the IOPS divided by host CPU
goes down, then the win on lightly loaded host will become a regression
on a loaded host.

Need to measure that.

>  8x1 and 8x2
> > being slower than 4x1 and 4x2 is more or less expected, but 8x8 loses against 
> > 4x8 while 8x4 wins against 4x4 and 8x16 against 4x16.
> > 
> > Eike
> > 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 0/5] Multiqueue virtio-scsi, and API for piecewise buffer submission
@ 2012-12-19 11:32       ` Michael S. Tsirkin
  0 siblings, 0 replies; 86+ messages in thread
From: Michael S. Tsirkin @ 2012-12-19 11:32 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-scsi, kvm, hutao, linux-kernel, virtualization, stefanha,
	Rolf Eike Beer

On Wed, Dec 19, 2012 at 09:52:59AM +0100, Paolo Bonzini wrote:
> Il 18/12/2012 23:18, Rolf Eike Beer ha scritto:
> > Paolo Bonzini wrote:
> >> Hi all,
> >>
> >> this series adds multiqueue support to the virtio-scsi driver, based
> >> on Jason Wang's work on virtio-net.  It uses a simple queue steering
> >> algorithm that expects one queue per CPU.  LUNs in the same target always
> >> use the same queue (so that commands are not reordered); queue switching
> >> occurs when the request being queued is the only one for the target.
> >> Also based on Jason's patches, the virtqueue affinity is set so that
> >> each CPU is associated to one virtqueue.
> >>
> >> I tested the patches with fio, using up to 32 virtio-scsi disks backed
> >> by tmpfs on the host.  These numbers are with 1 LUN per target.
> >>
> >> FIO configuration
> >> -----------------
> >> [global]
> >> rw=read
> >> bsrange=4k-64k
> >> ioengine=libaio
> >> direct=1
> >> iodepth=4
> >> loops=20
> >>
> >> overall bandwidth (MB/s)
> >> ------------------------
> >>
> >> # of targets    single-queue    multi-queue, 4 VCPUs    multi-queue, 8 VCPUs
> >> 1                  540               626                     599
> >> 2                  795               965                     925
> >> 4                  997              1376                    1500
> >> 8                 1136              2130                    2060
> >> 16                1440              2269                    2474
> >> 24                1408              2179                    2436
> >> 32                1515              1978                    2319
> >>
> >> (These numbers for single-queue are with 4 VCPUs, but the impact of adding
> >> more VCPUs is very limited).
> >>
> >> avg bandwidth per LUN (MB/s)
> >> ----------------------------
> >>
> >> # of targets    single-queue    multi-queue, 4 VCPUs    multi-queue, 8 VCPUs
> >> 1                  540               626                     599
> >> 2                  397               482                     462
> >> 4                  249               344                     375
> >> 8                  142               266                     257
> >> 16                  90               141                     154
> >> 24                  58                90                     101
> >> 32                  47                61                      72
> > 
> > Is there an explanation why 8x8 is slower then 4x8 in both cases?
> 
> Regarding the "in both cases" part, it's because the second table has
> the same data as the first, but divided by the first column.
> 
> In general, the "strangenesses" you find are probably within statistical
> noise or due to other effects such as host CPU utilization or contention
> on the big QEMU lock.
> 
> Paolo
> 

That's exactly what bothers me. If the IOPS divided by host CPU
goes down, then the win on lightly loaded host will become a regression
on a loaded host.

Need to measure that.

>  8x1 and 8x2
> > being slower than 4x1 and 4x2 is more or less expected, but 8x8 loses against 
> > 4x8 while 8x4 wins against 4x4 and 8x16 against 4x16.
> > 
> > Eike
> > 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 1/5] virtio: add functions for piecewise addition of buffers
  2012-12-19 10:47   ` Stefan Hajnoczi
@ 2012-12-19 12:04       ` Paolo Bonzini
  0 siblings, 0 replies; 86+ messages in thread
From: Paolo Bonzini @ 2012-12-19 12:04 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: linux-kernel, kvm, gaowanlong, hutao, linux-scsi, virtualization,
	mst, rusty, asias, nab

Il 19/12/2012 11:47, Stefan Hajnoczi ha scritto:
> On Tue, Dec 18, 2012 at 01:32:48PM +0100, Paolo Bonzini wrote:
>> +/**
>> + * virtqueue_start_buf - start building buffer for the other end
>> + * @vq: the struct virtqueue we're talking about.
>> + * @buf: a struct keeping the state of the buffer
>> + * @data: the token identifying the buffer.
>> + * @count: the number of buffers that will be added
> 
> Perhaps count should be named count_bufs or num_bufs.

Ok.

>> + * @count_sg: the number of sg lists that will be added
> 
> What is the purpose of count_sg?

It is needed to decide whether to use an indirect or a direct buffer.
The idea is to avoid a memory allocation if the driver is providing us
with separate sg elements (under the assumption that they will be few).

Originally I wanted to use a mix of direct and indirect buffer (direct
if add_buf received a one-element scatterlist, otherwise indirect).  It
would have had the same effect, without having to specify count_sg in
advance.  The spec is not clear if that is allowed or not, but in the
end they do not work with either QEMU or vhost, so I chose this
alternative instead.

Paolo


> Stefan
> 


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 1/5] virtio: add functions for piecewise addition of buffers
@ 2012-12-19 12:04       ` Paolo Bonzini
  0 siblings, 0 replies; 86+ messages in thread
From: Paolo Bonzini @ 2012-12-19 12:04 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: linux-scsi, kvm, mst, hutao, linux-kernel, virtualization

Il 19/12/2012 11:47, Stefan Hajnoczi ha scritto:
> On Tue, Dec 18, 2012 at 01:32:48PM +0100, Paolo Bonzini wrote:
>> +/**
>> + * virtqueue_start_buf - start building buffer for the other end
>> + * @vq: the struct virtqueue we're talking about.
>> + * @buf: a struct keeping the state of the buffer
>> + * @data: the token identifying the buffer.
>> + * @count: the number of buffers that will be added
> 
> Perhaps count should be named count_bufs or num_bufs.

Ok.

>> + * @count_sg: the number of sg lists that will be added
> 
> What is the purpose of count_sg?

It is needed to decide whether to use an indirect or a direct buffer.
The idea is to avoid a memory allocation if the driver is providing us
with separate sg elements (under the assumption that they will be few).

Originally I wanted to use a mix of direct and indirect buffer (direct
if add_buf received a one-element scatterlist, otherwise indirect).  It
would have had the same effect, without having to specify count_sg in
advance.  The spec is not clear if that is allowed or not, but in the
end they do not work with either QEMU or vhost, so I chose this
alternative instead.

Paolo


> Stefan
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 1/5] virtio: add functions for piecewise addition of buffers
  2012-12-19 12:04       ` Paolo Bonzini
@ 2012-12-19 12:40         ` Stefan Hajnoczi
  -1 siblings, 0 replies; 86+ messages in thread
From: Stefan Hajnoczi @ 2012-12-19 12:40 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Stefan Hajnoczi, linux-scsi, kvm, Michael S. Tsirkin, hutao,
	linux-kernel, Linux Virtualization

On Wed, Dec 19, 2012 at 1:04 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 19/12/2012 11:47, Stefan Hajnoczi ha scritto:
>> On Tue, Dec 18, 2012 at 01:32:48PM +0100, Paolo Bonzini wrote:
>> What is the purpose of count_sg?
>
> It is needed to decide whether to use an indirect or a direct buffer.
> The idea is to avoid a memory allocation if the driver is providing us
> with separate sg elements (under the assumption that they will be few).

Ah, this makes sense now.  I saw it affects the decision whether to go
indirect or not but it wasn't obvious why.

> Originally I wanted to use a mix of direct and indirect buffer (direct
> if add_buf received a one-element scatterlist, otherwise indirect).  It
> would have had the same effect, without having to specify count_sg in
> advance.  The spec is not clear if that is allowed or not, but in the
> end they do not work with either QEMU or vhost, so I chose this
> alternative instead.

Okay.

Stefan

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 1/5] virtio: add functions for piecewise addition of buffers
@ 2012-12-19 12:40         ` Stefan Hajnoczi
  0 siblings, 0 replies; 86+ messages in thread
From: Stefan Hajnoczi @ 2012-12-19 12:40 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, linux-scsi, Michael S. Tsirkin, hutao, linux-kernel,
	Linux Virtualization, Stefan Hajnoczi

On Wed, Dec 19, 2012 at 1:04 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 19/12/2012 11:47, Stefan Hajnoczi ha scritto:
>> On Tue, Dec 18, 2012 at 01:32:48PM +0100, Paolo Bonzini wrote:
>> What is the purpose of count_sg?
>
> It is needed to decide whether to use an indirect or a direct buffer.
> The idea is to avoid a memory allocation if the driver is providing us
> with separate sg elements (under the assumption that they will be few).

Ah, this makes sense now.  I saw it affects the decision whether to go
indirect or not but it wasn't obvious why.

> Originally I wanted to use a mix of direct and indirect buffer (direct
> if add_buf received a one-element scatterlist, otherwise indirect).  It
> would have had the same effect, without having to specify count_sg in
> advance.  The spec is not clear if that is allowed or not, but in the
> end they do not work with either QEMU or vhost, so I chose this
> alternative instead.

Okay.

Stefan

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 1/5] virtio: add functions for piecewise addition of buffers
  2012-12-19 12:04       ` Paolo Bonzini
@ 2012-12-19 16:51         ` Michael S. Tsirkin
  -1 siblings, 0 replies; 86+ messages in thread
From: Michael S. Tsirkin @ 2012-12-19 16:51 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Stefan Hajnoczi, linux-kernel, kvm, gaowanlong, hutao,
	linux-scsi, virtualization, rusty, asias, nab

On Wed, Dec 19, 2012 at 01:04:08PM +0100, Paolo Bonzini wrote:
> Il 19/12/2012 11:47, Stefan Hajnoczi ha scritto:
> > On Tue, Dec 18, 2012 at 01:32:48PM +0100, Paolo Bonzini wrote:
> >> +/**
> >> + * virtqueue_start_buf - start building buffer for the other end
> >> + * @vq: the struct virtqueue we're talking about.
> >> + * @buf: a struct keeping the state of the buffer
> >> + * @data: the token identifying the buffer.
> >> + * @count: the number of buffers that will be added
> > 
> > Perhaps count should be named count_bufs or num_bufs.
> 
> Ok.
> 
> >> + * @count_sg: the number of sg lists that will be added
> > 
> > What is the purpose of count_sg?
> 
> It is needed to decide whether to use an indirect or a direct buffer.
> The idea is to avoid a memory allocation if the driver is providing us
> with separate sg elements (under the assumption that they will be few).
> 
> Originally I wanted to use a mix of direct and indirect buffer (direct
> if add_buf received a one-element scatterlist, otherwise indirect).  It
> would have had the same effect, without having to specify count_sg in
> advance.  The spec is not clear if that is allowed or not, but in the
> end they do not work with either QEMU or vhost, so I chose this
> alternative instead.
> 
> Paolo

Hmm it should work with vhost.

> 
> > Stefan
> > 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 1/5] virtio: add functions for piecewise addition of buffers
@ 2012-12-19 16:51         ` Michael S. Tsirkin
  0 siblings, 0 replies; 86+ messages in thread
From: Michael S. Tsirkin @ 2012-12-19 16:51 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, linux-scsi, hutao, linux-kernel, virtualization, Stefan Hajnoczi

On Wed, Dec 19, 2012 at 01:04:08PM +0100, Paolo Bonzini wrote:
> Il 19/12/2012 11:47, Stefan Hajnoczi ha scritto:
> > On Tue, Dec 18, 2012 at 01:32:48PM +0100, Paolo Bonzini wrote:
> >> +/**
> >> + * virtqueue_start_buf - start building buffer for the other end
> >> + * @vq: the struct virtqueue we're talking about.
> >> + * @buf: a struct keeping the state of the buffer
> >> + * @data: the token identifying the buffer.
> >> + * @count: the number of buffers that will be added
> > 
> > Perhaps count should be named count_bufs or num_bufs.
> 
> Ok.
> 
> >> + * @count_sg: the number of sg lists that will be added
> > 
> > What is the purpose of count_sg?
> 
> It is needed to decide whether to use an indirect or a direct buffer.
> The idea is to avoid a memory allocation if the driver is providing us
> with separate sg elements (under the assumption that they will be few).
> 
> Originally I wanted to use a mix of direct and indirect buffer (direct
> if add_buf received a one-element scatterlist, otherwise indirect).  It
> would have had the same effect, without having to specify count_sg in
> advance.  The spec is not clear if that is allowed or not, but in the
> end they do not work with either QEMU or vhost, so I chose this
> alternative instead.
> 
> Paolo

Hmm it should work with vhost.

> 
> > Stefan
> > 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 1/5] virtio: add functions for piecewise addition of buffers
  2012-12-19 16:51         ` Michael S. Tsirkin
@ 2012-12-19 16:52           ` Michael S. Tsirkin
  -1 siblings, 0 replies; 86+ messages in thread
From: Michael S. Tsirkin @ 2012-12-19 16:52 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Stefan Hajnoczi, linux-kernel, kvm, gaowanlong, hutao,
	linux-scsi, virtualization, rusty, asias, nab

On Wed, Dec 19, 2012 at 06:51:30PM +0200, Michael S. Tsirkin wrote:
> On Wed, Dec 19, 2012 at 01:04:08PM +0100, Paolo Bonzini wrote:
> > Il 19/12/2012 11:47, Stefan Hajnoczi ha scritto:
> > > On Tue, Dec 18, 2012 at 01:32:48PM +0100, Paolo Bonzini wrote:
> > >> +/**
> > >> + * virtqueue_start_buf - start building buffer for the other end
> > >> + * @vq: the struct virtqueue we're talking about.
> > >> + * @buf: a struct keeping the state of the buffer
> > >> + * @data: the token identifying the buffer.
> > >> + * @count: the number of buffers that will be added
> > > 
> > > Perhaps count should be named count_bufs or num_bufs.
> > 
> > Ok.
> > 
> > >> + * @count_sg: the number of sg lists that will be added
> > > 
> > > What is the purpose of count_sg?
> > 
> > It is needed to decide whether to use an indirect or a direct buffer.
> > The idea is to avoid a memory allocation if the driver is providing us
> > with separate sg elements (under the assumption that they will be few).
> > 
> > Originally I wanted to use a mix of direct and indirect buffer (direct
> > if add_buf received a one-element scatterlist, otherwise indirect).  It
> > would have had the same effect, without having to specify count_sg in
> > advance.  The spec is not clear if that is allowed or not, but in the
> > end they do not work with either QEMU or vhost, so I chose this
> > alternative instead.
> > 
> > Paolo
> 
> Hmm it should work with vhost.

BTW passing in num_in + num_out would be nicer than explicit direction:
closer to the current add_buf.

> > 
> > > Stefan
> > > 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 1/5] virtio: add functions for piecewise addition of buffers
@ 2012-12-19 16:52           ` Michael S. Tsirkin
  0 siblings, 0 replies; 86+ messages in thread
From: Michael S. Tsirkin @ 2012-12-19 16:52 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, linux-scsi, hutao, linux-kernel, virtualization, Stefan Hajnoczi

On Wed, Dec 19, 2012 at 06:51:30PM +0200, Michael S. Tsirkin wrote:
> On Wed, Dec 19, 2012 at 01:04:08PM +0100, Paolo Bonzini wrote:
> > Il 19/12/2012 11:47, Stefan Hajnoczi ha scritto:
> > > On Tue, Dec 18, 2012 at 01:32:48PM +0100, Paolo Bonzini wrote:
> > >> +/**
> > >> + * virtqueue_start_buf - start building buffer for the other end
> > >> + * @vq: the struct virtqueue we're talking about.
> > >> + * @buf: a struct keeping the state of the buffer
> > >> + * @data: the token identifying the buffer.
> > >> + * @count: the number of buffers that will be added
> > > 
> > > Perhaps count should be named count_bufs or num_bufs.
> > 
> > Ok.
> > 
> > >> + * @count_sg: the number of sg lists that will be added
> > > 
> > > What is the purpose of count_sg?
> > 
> > It is needed to decide whether to use an indirect or a direct buffer.
> > The idea is to avoid a memory allocation if the driver is providing us
> > with separate sg elements (under the assumption that they will be few).
> > 
> > Originally I wanted to use a mix of direct and indirect buffer (direct
> > if add_buf received a one-element scatterlist, otherwise indirect).  It
> > would have had the same effect, without having to specify count_sg in
> > advance.  The spec is not clear if that is allowed or not, but in the
> > end they do not work with either QEMU or vhost, so I chose this
> > alternative instead.
> > 
> > Paolo
> 
> Hmm it should work with vhost.

BTW passing in num_in + num_out would be nicer than explicit direction:
closer to the current add_buf.

> > 
> > > Stefan
> > > 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 0/5] Multiqueue virtio-scsi, and API for piecewise buffer submission
  2012-12-18 13:42   ` Michael S. Tsirkin
@ 2012-12-24  6:44     ` Wanlong Gao
  -1 siblings, 0 replies; 86+ messages in thread
From: Wanlong Gao @ 2012-12-24  6:44 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Paolo Bonzini, linux-kernel, kvm, hutao, linux-scsi,
	virtualization, rusty, asias, stefanha, nab

On 12/18/2012 09:42 PM, Michael S. Tsirkin wrote:
> On Tue, Dec 18, 2012 at 01:32:47PM +0100, Paolo Bonzini wrote:
>> Hi all,
>>
>> this series adds multiqueue support to the virtio-scsi driver, based
>> on Jason Wang's work on virtio-net.  It uses a simple queue steering
>> algorithm that expects one queue per CPU.  LUNs in the same target always
>> use the same queue (so that commands are not reordered); queue switching
>> occurs when the request being queued is the only one for the target.
>> Also based on Jason's patches, the virtqueue affinity is set so that
>> each CPU is associated to one virtqueue.
>>
>> I tested the patches with fio, using up to 32 virtio-scsi disks backed
>> by tmpfs on the host.  These numbers are with 1 LUN per target.
>>
>> FIO configuration
>> -----------------
>> [global]
>> rw=read
>> bsrange=4k-64k
>> ioengine=libaio
>> direct=1
>> iodepth=4
>> loops=20
>>
>> overall bandwidth (MB/s)
>> ------------------------
>>
>> # of targets    single-queue    multi-queue, 4 VCPUs    multi-queue, 8 VCPUs
>> 1                  540               626                     599
>> 2                  795               965                     925
>> 4                  997              1376                    1500
>> 8                 1136              2130                    2060
>> 16                1440              2269                    2474
>> 24                1408              2179                    2436
>> 32                1515              1978                    2319
>>
>> (These numbers for single-queue are with 4 VCPUs, but the impact of adding
>> more VCPUs is very limited).
>>
>> avg bandwidth per LUN (MB/s)
>> ----------------------------
>>
>> # of targets    single-queue    multi-queue, 4 VCPUs    multi-queue, 8 VCPUs
>> 1                  540               626                     599
>> 2                  397               482                     462
>> 4                  249               344                     375
>> 8                  142               266                     257
>> 16                  90               141                     154
>> 24                  58                90                     101
>> 32                  47                61                      72
> 
> 
> Could you please try and measure host CPU utilization?

I measured and didn't see any CPU utilization regression here.

> Without this data it is possible that your host
> is undersubscribed and you are drinking up more host CPU.
> 
> Another thing to note is that ATM you might need to
> test with idle=poll on host otherwise we have strange interaction
> with power management where reducing the overhead
> switches to lower power so gives you a worse IOPS.

Yeah, I measured with host cpu idle=poll and saw that the performance
improved about 68%.

Thanks,
Wanlong Gao

> 
> 
>> Patch 1 adds a new API to add functions for piecewise addition for buffers,
>> which enables various simplifications in virtio-scsi (patches 2-3) and a
>> small performance improvement of 2-6%.  Patches 4 and 5 add multiqueuing.
>>
>> I'm mostly looking for comments on the new API of patch 1 for inclusion
>> into the 3.9 kernel.
>>
>> Thanks to Wao Ganlong for help rebasing and benchmarking these patches.
>>
>> Paolo Bonzini (5):
>>   virtio: add functions for piecewise addition of buffers
>>   virtio-scsi: use functions for piecewise composition of buffers
>>   virtio-scsi: redo allocation of target data
>>   virtio-scsi: pass struct virtio_scsi to virtqueue completion function
>>   virtio-scsi: introduce multiqueue support
>>
>>  drivers/scsi/virtio_scsi.c   |  374 +++++++++++++++++++++++++++++-------------
>>  drivers/virtio/virtio_ring.c |  205 ++++++++++++++++++++++++
>>  include/linux/virtio.h       |   21 +++
>>  3 files changed, 485 insertions(+), 115 deletions(-)
> 


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 0/5] Multiqueue virtio-scsi, and API for piecewise buffer submission
@ 2012-12-24  6:44     ` Wanlong Gao
  0 siblings, 0 replies; 86+ messages in thread
From: Wanlong Gao @ 2012-12-24  6:44 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: kvm, linux-scsi, hutao, linux-kernel, virtualization, stefanha,
	Paolo Bonzini

On 12/18/2012 09:42 PM, Michael S. Tsirkin wrote:
> On Tue, Dec 18, 2012 at 01:32:47PM +0100, Paolo Bonzini wrote:
>> Hi all,
>>
>> this series adds multiqueue support to the virtio-scsi driver, based
>> on Jason Wang's work on virtio-net.  It uses a simple queue steering
>> algorithm that expects one queue per CPU.  LUNs in the same target always
>> use the same queue (so that commands are not reordered); queue switching
>> occurs when the request being queued is the only one for the target.
>> Also based on Jason's patches, the virtqueue affinity is set so that
>> each CPU is associated to one virtqueue.
>>
>> I tested the patches with fio, using up to 32 virtio-scsi disks backed
>> by tmpfs on the host.  These numbers are with 1 LUN per target.
>>
>> FIO configuration
>> -----------------
>> [global]
>> rw=read
>> bsrange=4k-64k
>> ioengine=libaio
>> direct=1
>> iodepth=4
>> loops=20
>>
>> overall bandwidth (MB/s)
>> ------------------------
>>
>> # of targets    single-queue    multi-queue, 4 VCPUs    multi-queue, 8 VCPUs
>> 1                  540               626                     599
>> 2                  795               965                     925
>> 4                  997              1376                    1500
>> 8                 1136              2130                    2060
>> 16                1440              2269                    2474
>> 24                1408              2179                    2436
>> 32                1515              1978                    2319
>>
>> (These numbers for single-queue are with 4 VCPUs, but the impact of adding
>> more VCPUs is very limited).
>>
>> avg bandwidth per LUN (MB/s)
>> ----------------------------
>>
>> # of targets    single-queue    multi-queue, 4 VCPUs    multi-queue, 8 VCPUs
>> 1                  540               626                     599
>> 2                  397               482                     462
>> 4                  249               344                     375
>> 8                  142               266                     257
>> 16                  90               141                     154
>> 24                  58                90                     101
>> 32                  47                61                      72
> 
> 
> Could you please try and measure host CPU utilization?

I measured and didn't see any CPU utilization regression here.

> Without this data it is possible that your host
> is undersubscribed and you are drinking up more host CPU.
> 
> Another thing to note is that ATM you might need to
> test with idle=poll on host otherwise we have strange interaction
> with power management where reducing the overhead
> switches to lower power so gives you a worse IOPS.

Yeah, I measured with host cpu idle=poll and saw that the performance
improved about 68%.

Thanks,
Wanlong Gao

> 
> 
>> Patch 1 adds a new API to add functions for piecewise addition for buffers,
>> which enables various simplifications in virtio-scsi (patches 2-3) and a
>> small performance improvement of 2-6%.  Patches 4 and 5 add multiqueuing.
>>
>> I'm mostly looking for comments on the new API of patch 1 for inclusion
>> into the 3.9 kernel.
>>
>> Thanks to Wao Ganlong for help rebasing and benchmarking these patches.
>>
>> Paolo Bonzini (5):
>>   virtio: add functions for piecewise addition of buffers
>>   virtio-scsi: use functions for piecewise composition of buffers
>>   virtio-scsi: redo allocation of target data
>>   virtio-scsi: pass struct virtio_scsi to virtqueue completion function
>>   virtio-scsi: introduce multiqueue support
>>
>>  drivers/scsi/virtio_scsi.c   |  374 +++++++++++++++++++++++++++++-------------
>>  drivers/virtio/virtio_ring.c |  205 ++++++++++++++++++++++++
>>  include/linux/virtio.h       |   21 +++
>>  3 files changed, 485 insertions(+), 115 deletions(-)
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 5/5] virtio-scsi: introduce multiqueue support
  2012-12-18 16:02             ` Michael S. Tsirkin
@ 2012-12-25 12:41               ` Wanlong Gao
  -1 siblings, 0 replies; 86+ messages in thread
From: Wanlong Gao @ 2012-12-25 12:41 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Paolo Bonzini, linux-kernel, kvm, hutao, linux-scsi,
	virtualization, rusty, asias, stefanha, nab

On 12/19/2012 12:02 AM, Michael S. Tsirkin wrote:
> On Tue, Dec 18, 2012 at 04:51:28PM +0100, Paolo Bonzini wrote:
>> Il 18/12/2012 16:03, Michael S. Tsirkin ha scritto:
>>> On Tue, Dec 18, 2012 at 03:08:08PM +0100, Paolo Bonzini wrote:
>>>> Il 18/12/2012 14:57, Michael S. Tsirkin ha scritto:
>>>>>> -static int virtscsi_queuecommand(struct Scsi_Host *sh, struct scsi_cmnd *sc)
>>>>>> +static int virtscsi_queuecommand(struct virtio_scsi *vscsi,
>>>>>> +				 struct virtio_scsi_target_state *tgt,
>>>>>> +				 struct scsi_cmnd *sc)
>>>>>>  {
>>>>>> -	struct virtio_scsi *vscsi = shost_priv(sh);
>>>>>> -	struct virtio_scsi_target_state *tgt = &vscsi->tgt[sc->device->id];
>>>>>>  	struct virtio_scsi_cmd *cmd;
>>>>>> +	struct virtio_scsi_vq *req_vq;
>>>>>>  	int ret;
>>>>>>  
>>>>>>  	struct Scsi_Host *shost = virtio_scsi_host(vscsi->vdev);
>>>>>> @@ -461,7 +533,8 @@ static int virtscsi_queuecommand(struct Scsi_Host *sh, struct scsi_cmnd *sc)
>>>>>>  	BUG_ON(sc->cmd_len > VIRTIO_SCSI_CDB_SIZE);
>>>>>>  	memcpy(cmd->req.cmd.cdb, sc->cmnd, sc->cmd_len);
>>>>>>  
>>>>>> -	if (virtscsi_kick_cmd(tgt, &vscsi->req_vq, cmd,
>>>>>> +	req_vq = ACCESS_ONCE(tgt->req_vq);
>>>>>
>>>>> This ACCESS_ONCE without a barrier looks strange to me.
>>>>> Can req_vq change? Needs a comment.
>>>>
>>>> Barriers are needed to order two things.  Here I don't have the second thing
>>>> to order against, hence no barrier.
>>>>
>>>> Accessing req_vq lockless is safe, and there's a comment about it, but you
>>>> still want ACCESS_ONCE to ensure the compiler doesn't play tricks.
>>>
>>> That's just it.
>>> Why don't you want compiler to play tricks?
>>
>> Because I want the lockless access to occur exactly when I write it.
> 
> It doesn't occur when you write it. CPU can still move accesses
> around. That's why you either need both ACCESS_ONCE and a barrier
> or none.
> 
>> Otherwise I have one more thing to think about, i.e. what a crazy
>> compiler writer could do with my code.  And having been on the other
>> side of the trench, compiler writers can have *really* crazy ideas.
>>
>> Anyhow, I'll reorganize the code to move the ACCESS_ONCE closer to the
>> write and make it clearer.
>>
>>>>>> +	if (virtscsi_kick_cmd(tgt, req_vq, cmd,
>>>>>>  			      sizeof cmd->req.cmd, sizeof cmd->resp.cmd,
>>>>>>  			      GFP_ATOMIC) == 0)
>>>>>>  		ret = 0;
>>>>>> @@ -472,6 +545,48 @@ out:
>>>>>>  	return ret;
>>>>>>  }
>>>>>>  
>>>>>> +static int virtscsi_queuecommand_single(struct Scsi_Host *sh,
>>>>>> +					struct scsi_cmnd *sc)
>>>>>> +{
>>>>>> +	struct virtio_scsi *vscsi = shost_priv(sh);
>>>>>> +	struct virtio_scsi_target_state *tgt = &vscsi->tgt[sc->device->id];
>>>>>> +
>>>>>> +	atomic_inc(&tgt->reqs);
>>>>>
>>>>> And here we don't have barrier after atomic? Why? Needs a comment.
>>>>
>>>> Because we don't write req_vq, so there's no two writes to order.  Barrier
>>>> against what?
>>>
>>> Between atomic update and command. Once you queue command it
>>> can complete and decrement reqs, if this happens before
>>> increment reqs can become negative even.
>>
>> This is not a problem.  Please read Documentation/memory-barrier.txt:
>>
>>    The following also do _not_ imply memory barriers, and so may
>>    require explicit memory barriers under some circumstances
>>    (smp_mb__before_atomic_dec() for instance):
>>
>>         atomic_add();
>>         atomic_sub();
>>         atomic_inc();
>>         atomic_dec();
>>
>>    If they're used for statistics generation, then they probably don't
>>    need memory barriers, unless there's a coupling between statistical
>>    data.
>>
>> This is the single-queue case, so it falls under this case.
> 
> Aha I missed it's single queue. Correct but please add a comment.
> 
>>>>>>  	/* Discover virtqueues and write information to configuration.  */
>>>>>> -	err = vdev->config->find_vqs(vdev, 3, vqs, callbacks, names);
>>>>>> +	err = vdev->config->find_vqs(vdev, num_vqs, vqs, callbacks, names);
>>>>>>  	if (err)
>>>>>>  		return err;
>>>>>>  
>>>>>> -	virtscsi_init_vq(&vscsi->ctrl_vq, vqs[0]);
>>>>>> -	virtscsi_init_vq(&vscsi->event_vq, vqs[1]);
>>>>>> -	virtscsi_init_vq(&vscsi->req_vq, vqs[2]);
>>>>>> +	virtscsi_init_vq(&vscsi->ctrl_vq, vqs[0], false);
>>>>>> +	virtscsi_init_vq(&vscsi->event_vq, vqs[1], false);
>>>>>> +	for (i = VIRTIO_SCSI_VQ_BASE; i < num_vqs; i++)
>>>>>> +		virtscsi_init_vq(&vscsi->req_vqs[i - VIRTIO_SCSI_VQ_BASE],
>>>>>> +				 vqs[i], vscsi->num_queues > 1);
>>>>>
>>>>> So affinity is true if >1 vq? I am guessing this is not
>>>>> going to do the right thing unless you have at least
>>>>> as many vqs as CPUs.
>>>>
>>>> Yes, and then you're not setting up the thing correctly.
>>>
>>> Why not just check instead of doing the wrong thing?
>>
>> The right thing could be to set the affinity with a stride, e.g. CPUs
>> 0-4 for virtqueue 0 and so on until CPUs 3-7 for virtqueue 3.
>>
>> Paolo
> 
> I think a simple #vqs == #cpus check would be kind of OK for
> starters, otherwise let userspace set affinity.
> Again need to think what happens with CPU hotplug.

How about dynamicly setting affinity this way?
========================================================================
Subject: [PATCH] virtio-scsi: set virtqueue affinity under cpu hotplug

We set virtqueue affinity when the num_queues equals to the number
of VCPUs. Register a hotcpu notifier to notifier whether the
VCPU number is changed. If the VCPU number is changed, we
force to set virtqueue affinity again.

Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
---
 drivers/scsi/virtio_scsi.c | 72 ++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 66 insertions(+), 6 deletions(-)

diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 3641d5f..1b28e03 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -20,6 +20,7 @@
 #include <linux/virtio_ids.h>
 #include <linux/virtio_config.h>
 #include <linux/virtio_scsi.h>
+#include <linux/cpu.h>
 #include <scsi/scsi_host.h>
 #include <scsi/scsi_device.h>
 #include <scsi/scsi_cmnd.h>
@@ -106,6 +107,9 @@ struct virtio_scsi {
 
 	u32 num_queues;
 
+	/* Does the affinity hint is set for virtqueues? */
+	bool affinity_hint_set;
+
 	struct virtio_scsi_vq ctrl_vq;
 	struct virtio_scsi_vq event_vq;
 	struct virtio_scsi_vq req_vqs[];
@@ -113,6 +117,7 @@ struct virtio_scsi {
 
 static struct kmem_cache *virtscsi_cmd_cache;
 static mempool_t *virtscsi_cmd_pool;
+static bool cpu_hotplug = false;
 
 static inline struct Scsi_Host *virtio_scsi_host(struct virtio_device *vdev)
 {
@@ -555,6 +560,52 @@ static int virtscsi_queuecommand_single(struct Scsi_Host *sh,
 	return virtscsi_queuecommand(vscsi, tgt, sc);
 }
 
+static void virtscsi_set_affinity(struct virtio_scsi *vscsi,
+				  bool affinity)
+{
+	int i;
+
+	/* In multiqueue mode, when the number of cpu is equal to the number of
+	 * request queues , we let the queues to be private to one cpu by
+	 * setting the affinity hint to eliminate the contention.
+	 */
+	if ((vscsi->num_queues == 1 ||
+	    vscsi->num_queues != num_online_cpus()) && affinity) {
+		if (vscsi->affinity_hint_set)
+			affinity = false;
+		else
+			return;
+	}
+
+	for (i = 0; i < vscsi->num_queues - VIRTIO_SCSI_VQ_BASE; i++) {
+		int cpu = affinity ? i : -1;
+		virtqueue_set_affinity(vscsi->req_vqs[i].vq, cpu);
+	}
+
+	if (affinity)
+		vscsi->affinity_hint_set = true;
+	else
+		vscsi->affinity_hint_set = false;
+}
+
+static int __cpuinit virtscsi_cpu_callback(struct notifier_block *nfb,
+					   unsigned long action, void *hcpu)
+{
+	switch (action) {
+	case CPU_ONLINE:
+	case CPU_ONLINE_FROZEN:
+	case CPU_DEAD:
+	case CPU_DEAD_FROZEN:
+		cpu_hotplug = true;
+	}
+	return NOTIFY_OK;
+}
+
+static struct notifier_block __cpuinitdata virtscsi_cpu_notifier =
+{
+	.notifier_call = virtscsi_cpu_callback,
+};
+
 static int virtscsi_queuecommand_multi(struct Scsi_Host *sh,
 				       struct scsi_cmnd *sc)
 {
@@ -563,6 +614,11 @@ static int virtscsi_queuecommand_multi(struct Scsi_Host *sh,
 	unsigned long flags;
 	u32 queue_num;
 
+	if (unlikely(cpu_hotplug == true)) {
+		virtscsi_set_affinity(vscsi, true);
+		cpu_hotplug = false;
+	}
+
 	/*
 	 * Using an atomic_t for tgt->reqs lets the virtqueue handler
 	 * decrement it without taking the spinlock.
@@ -703,12 +759,10 @@ static struct scsi_host_template virtscsi_host_template_multi = {
 
 
 static void virtscsi_init_vq(struct virtio_scsi_vq *virtscsi_vq,
-			     struct virtqueue *vq, bool affinity)
+			     struct virtqueue *vq)
 {
 	spin_lock_init(&virtscsi_vq->vq_lock);
 	virtscsi_vq->vq = vq;
-	if (affinity)
-		virtqueue_set_affinity(vq, vq->index - VIRTIO_SCSI_VQ_BASE);
 }
 
 static void virtscsi_init_tgt(struct virtio_scsi *vscsi, int i)
@@ -736,6 +790,8 @@ static void virtscsi_remove_vqs(struct virtio_device *vdev)
 	struct Scsi_Host *sh = virtio_scsi_host(vdev);
 	struct virtio_scsi *vscsi = shost_priv(sh);
 
+	virtscsi_set_affinity(vscsi, false);
+
 	/* Stop all the virtqueues. */
 	vdev->config->reset(vdev);
 
@@ -779,11 +835,14 @@ static int virtscsi_init(struct virtio_device *vdev,
 	if (err)
 		return err;
 
-	virtscsi_init_vq(&vscsi->ctrl_vq, vqs[0], false);
-	virtscsi_init_vq(&vscsi->event_vq, vqs[1], false);
+	virtscsi_init_vq(&vscsi->ctrl_vq, vqs[0]);
+	virtscsi_init_vq(&vscsi->event_vq, vqs[1]);
 	for (i = VIRTIO_SCSI_VQ_BASE; i < num_vqs; i++)
 		virtscsi_init_vq(&vscsi->req_vqs[i - VIRTIO_SCSI_VQ_BASE],
-				 vqs[i], vscsi->num_queues > 1);
+				 vqs[i]);
+
+	virtscsi_set_affinity(vscsi, true);
+	register_hotcpu_notifier(&virtscsi_cpu_notifier);
 
 	virtscsi_config_set(vdev, cdb_size, VIRTIO_SCSI_CDB_SIZE);
 	virtscsi_config_set(vdev, sense_size, VIRTIO_SCSI_SENSE_SIZE);
@@ -882,6 +941,7 @@ static void __devexit virtscsi_remove(struct virtio_device *vdev)
 	struct Scsi_Host *shost = virtio_scsi_host(vdev);
 	struct virtio_scsi *vscsi = shost_priv(shost);
 
+	unregister_hotcpu_notifier(&virtscsi_cpu_notifier);
 	if (virtio_has_feature(vdev, VIRTIO_SCSI_F_HOTPLUG))
 		virtscsi_cancel_event_work(vscsi);
 
-- 
1.8.0

Thanks,
Wanlong Gao

> 
>>>> Isn't the same thing true for virtio-net mq?
>>>>
>>>> Paolo
>>>
>>> Last I looked it checked vi->max_queue_pairs == num_online_cpus().
>>> This is even too aggressive I think, max_queue_pairs >=
>>> num_online_cpus() should be enough.
>>>
> 


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 5/5] virtio-scsi: introduce multiqueue support
@ 2012-12-25 12:41               ` Wanlong Gao
  0 siblings, 0 replies; 86+ messages in thread
From: Wanlong Gao @ 2012-12-25 12:41 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: kvm, linux-scsi, hutao, linux-kernel, virtualization, stefanha,
	Paolo Bonzini

On 12/19/2012 12:02 AM, Michael S. Tsirkin wrote:
> On Tue, Dec 18, 2012 at 04:51:28PM +0100, Paolo Bonzini wrote:
>> Il 18/12/2012 16:03, Michael S. Tsirkin ha scritto:
>>> On Tue, Dec 18, 2012 at 03:08:08PM +0100, Paolo Bonzini wrote:
>>>> Il 18/12/2012 14:57, Michael S. Tsirkin ha scritto:
>>>>>> -static int virtscsi_queuecommand(struct Scsi_Host *sh, struct scsi_cmnd *sc)
>>>>>> +static int virtscsi_queuecommand(struct virtio_scsi *vscsi,
>>>>>> +				 struct virtio_scsi_target_state *tgt,
>>>>>> +				 struct scsi_cmnd *sc)
>>>>>>  {
>>>>>> -	struct virtio_scsi *vscsi = shost_priv(sh);
>>>>>> -	struct virtio_scsi_target_state *tgt = &vscsi->tgt[sc->device->id];
>>>>>>  	struct virtio_scsi_cmd *cmd;
>>>>>> +	struct virtio_scsi_vq *req_vq;
>>>>>>  	int ret;
>>>>>>  
>>>>>>  	struct Scsi_Host *shost = virtio_scsi_host(vscsi->vdev);
>>>>>> @@ -461,7 +533,8 @@ static int virtscsi_queuecommand(struct Scsi_Host *sh, struct scsi_cmnd *sc)
>>>>>>  	BUG_ON(sc->cmd_len > VIRTIO_SCSI_CDB_SIZE);
>>>>>>  	memcpy(cmd->req.cmd.cdb, sc->cmnd, sc->cmd_len);
>>>>>>  
>>>>>> -	if (virtscsi_kick_cmd(tgt, &vscsi->req_vq, cmd,
>>>>>> +	req_vq = ACCESS_ONCE(tgt->req_vq);
>>>>>
>>>>> This ACCESS_ONCE without a barrier looks strange to me.
>>>>> Can req_vq change? Needs a comment.
>>>>
>>>> Barriers are needed to order two things.  Here I don't have the second thing
>>>> to order against, hence no barrier.
>>>>
>>>> Accessing req_vq lockless is safe, and there's a comment about it, but you
>>>> still want ACCESS_ONCE to ensure the compiler doesn't play tricks.
>>>
>>> That's just it.
>>> Why don't you want compiler to play tricks?
>>
>> Because I want the lockless access to occur exactly when I write it.
> 
> It doesn't occur when you write it. CPU can still move accesses
> around. That's why you either need both ACCESS_ONCE and a barrier
> or none.
> 
>> Otherwise I have one more thing to think about, i.e. what a crazy
>> compiler writer could do with my code.  And having been on the other
>> side of the trench, compiler writers can have *really* crazy ideas.
>>
>> Anyhow, I'll reorganize the code to move the ACCESS_ONCE closer to the
>> write and make it clearer.
>>
>>>>>> +	if (virtscsi_kick_cmd(tgt, req_vq, cmd,
>>>>>>  			      sizeof cmd->req.cmd, sizeof cmd->resp.cmd,
>>>>>>  			      GFP_ATOMIC) == 0)
>>>>>>  		ret = 0;
>>>>>> @@ -472,6 +545,48 @@ out:
>>>>>>  	return ret;
>>>>>>  }
>>>>>>  
>>>>>> +static int virtscsi_queuecommand_single(struct Scsi_Host *sh,
>>>>>> +					struct scsi_cmnd *sc)
>>>>>> +{
>>>>>> +	struct virtio_scsi *vscsi = shost_priv(sh);
>>>>>> +	struct virtio_scsi_target_state *tgt = &vscsi->tgt[sc->device->id];
>>>>>> +
>>>>>> +	atomic_inc(&tgt->reqs);
>>>>>
>>>>> And here we don't have barrier after atomic? Why? Needs a comment.
>>>>
>>>> Because we don't write req_vq, so there's no two writes to order.  Barrier
>>>> against what?
>>>
>>> Between atomic update and command. Once you queue command it
>>> can complete and decrement reqs, if this happens before
>>> increment reqs can become negative even.
>>
>> This is not a problem.  Please read Documentation/memory-barrier.txt:
>>
>>    The following also do _not_ imply memory barriers, and so may
>>    require explicit memory barriers under some circumstances
>>    (smp_mb__before_atomic_dec() for instance):
>>
>>         atomic_add();
>>         atomic_sub();
>>         atomic_inc();
>>         atomic_dec();
>>
>>    If they're used for statistics generation, then they probably don't
>>    need memory barriers, unless there's a coupling between statistical
>>    data.
>>
>> This is the single-queue case, so it falls under this case.
> 
> Aha I missed it's single queue. Correct but please add a comment.
> 
>>>>>>  	/* Discover virtqueues and write information to configuration.  */
>>>>>> -	err = vdev->config->find_vqs(vdev, 3, vqs, callbacks, names);
>>>>>> +	err = vdev->config->find_vqs(vdev, num_vqs, vqs, callbacks, names);
>>>>>>  	if (err)
>>>>>>  		return err;
>>>>>>  
>>>>>> -	virtscsi_init_vq(&vscsi->ctrl_vq, vqs[0]);
>>>>>> -	virtscsi_init_vq(&vscsi->event_vq, vqs[1]);
>>>>>> -	virtscsi_init_vq(&vscsi->req_vq, vqs[2]);
>>>>>> +	virtscsi_init_vq(&vscsi->ctrl_vq, vqs[0], false);
>>>>>> +	virtscsi_init_vq(&vscsi->event_vq, vqs[1], false);
>>>>>> +	for (i = VIRTIO_SCSI_VQ_BASE; i < num_vqs; i++)
>>>>>> +		virtscsi_init_vq(&vscsi->req_vqs[i - VIRTIO_SCSI_VQ_BASE],
>>>>>> +				 vqs[i], vscsi->num_queues > 1);
>>>>>
>>>>> So affinity is true if >1 vq? I am guessing this is not
>>>>> going to do the right thing unless you have at least
>>>>> as many vqs as CPUs.
>>>>
>>>> Yes, and then you're not setting up the thing correctly.
>>>
>>> Why not just check instead of doing the wrong thing?
>>
>> The right thing could be to set the affinity with a stride, e.g. CPUs
>> 0-4 for virtqueue 0 and so on until CPUs 3-7 for virtqueue 3.
>>
>> Paolo
> 
> I think a simple #vqs == #cpus check would be kind of OK for
> starters, otherwise let userspace set affinity.
> Again need to think what happens with CPU hotplug.

How about dynamicly setting affinity this way?
========================================================================
Subject: [PATCH] virtio-scsi: set virtqueue affinity under cpu hotplug

We set virtqueue affinity when the num_queues equals to the number
of VCPUs. Register a hotcpu notifier to notifier whether the
VCPU number is changed. If the VCPU number is changed, we
force to set virtqueue affinity again.

Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
---
 drivers/scsi/virtio_scsi.c | 72 ++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 66 insertions(+), 6 deletions(-)

diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 3641d5f..1b28e03 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -20,6 +20,7 @@
 #include <linux/virtio_ids.h>
 #include <linux/virtio_config.h>
 #include <linux/virtio_scsi.h>
+#include <linux/cpu.h>
 #include <scsi/scsi_host.h>
 #include <scsi/scsi_device.h>
 #include <scsi/scsi_cmnd.h>
@@ -106,6 +107,9 @@ struct virtio_scsi {
 
 	u32 num_queues;
 
+	/* Does the affinity hint is set for virtqueues? */
+	bool affinity_hint_set;
+
 	struct virtio_scsi_vq ctrl_vq;
 	struct virtio_scsi_vq event_vq;
 	struct virtio_scsi_vq req_vqs[];
@@ -113,6 +117,7 @@ struct virtio_scsi {
 
 static struct kmem_cache *virtscsi_cmd_cache;
 static mempool_t *virtscsi_cmd_pool;
+static bool cpu_hotplug = false;
 
 static inline struct Scsi_Host *virtio_scsi_host(struct virtio_device *vdev)
 {
@@ -555,6 +560,52 @@ static int virtscsi_queuecommand_single(struct Scsi_Host *sh,
 	return virtscsi_queuecommand(vscsi, tgt, sc);
 }
 
+static void virtscsi_set_affinity(struct virtio_scsi *vscsi,
+				  bool affinity)
+{
+	int i;
+
+	/* In multiqueue mode, when the number of cpu is equal to the number of
+	 * request queues , we let the queues to be private to one cpu by
+	 * setting the affinity hint to eliminate the contention.
+	 */
+	if ((vscsi->num_queues == 1 ||
+	    vscsi->num_queues != num_online_cpus()) && affinity) {
+		if (vscsi->affinity_hint_set)
+			affinity = false;
+		else
+			return;
+	}
+
+	for (i = 0; i < vscsi->num_queues - VIRTIO_SCSI_VQ_BASE; i++) {
+		int cpu = affinity ? i : -1;
+		virtqueue_set_affinity(vscsi->req_vqs[i].vq, cpu);
+	}
+
+	if (affinity)
+		vscsi->affinity_hint_set = true;
+	else
+		vscsi->affinity_hint_set = false;
+}
+
+static int __cpuinit virtscsi_cpu_callback(struct notifier_block *nfb,
+					   unsigned long action, void *hcpu)
+{
+	switch (action) {
+	case CPU_ONLINE:
+	case CPU_ONLINE_FROZEN:
+	case CPU_DEAD:
+	case CPU_DEAD_FROZEN:
+		cpu_hotplug = true;
+	}
+	return NOTIFY_OK;
+}
+
+static struct notifier_block __cpuinitdata virtscsi_cpu_notifier =
+{
+	.notifier_call = virtscsi_cpu_callback,
+};
+
 static int virtscsi_queuecommand_multi(struct Scsi_Host *sh,
 				       struct scsi_cmnd *sc)
 {
@@ -563,6 +614,11 @@ static int virtscsi_queuecommand_multi(struct Scsi_Host *sh,
 	unsigned long flags;
 	u32 queue_num;
 
+	if (unlikely(cpu_hotplug == true)) {
+		virtscsi_set_affinity(vscsi, true);
+		cpu_hotplug = false;
+	}
+
 	/*
 	 * Using an atomic_t for tgt->reqs lets the virtqueue handler
 	 * decrement it without taking the spinlock.
@@ -703,12 +759,10 @@ static struct scsi_host_template virtscsi_host_template_multi = {
 
 
 static void virtscsi_init_vq(struct virtio_scsi_vq *virtscsi_vq,
-			     struct virtqueue *vq, bool affinity)
+			     struct virtqueue *vq)
 {
 	spin_lock_init(&virtscsi_vq->vq_lock);
 	virtscsi_vq->vq = vq;
-	if (affinity)
-		virtqueue_set_affinity(vq, vq->index - VIRTIO_SCSI_VQ_BASE);
 }
 
 static void virtscsi_init_tgt(struct virtio_scsi *vscsi, int i)
@@ -736,6 +790,8 @@ static void virtscsi_remove_vqs(struct virtio_device *vdev)
 	struct Scsi_Host *sh = virtio_scsi_host(vdev);
 	struct virtio_scsi *vscsi = shost_priv(sh);
 
+	virtscsi_set_affinity(vscsi, false);
+
 	/* Stop all the virtqueues. */
 	vdev->config->reset(vdev);
 
@@ -779,11 +835,14 @@ static int virtscsi_init(struct virtio_device *vdev,
 	if (err)
 		return err;
 
-	virtscsi_init_vq(&vscsi->ctrl_vq, vqs[0], false);
-	virtscsi_init_vq(&vscsi->event_vq, vqs[1], false);
+	virtscsi_init_vq(&vscsi->ctrl_vq, vqs[0]);
+	virtscsi_init_vq(&vscsi->event_vq, vqs[1]);
 	for (i = VIRTIO_SCSI_VQ_BASE; i < num_vqs; i++)
 		virtscsi_init_vq(&vscsi->req_vqs[i - VIRTIO_SCSI_VQ_BASE],
-				 vqs[i], vscsi->num_queues > 1);
+				 vqs[i]);
+
+	virtscsi_set_affinity(vscsi, true);
+	register_hotcpu_notifier(&virtscsi_cpu_notifier);
 
 	virtscsi_config_set(vdev, cdb_size, VIRTIO_SCSI_CDB_SIZE);
 	virtscsi_config_set(vdev, sense_size, VIRTIO_SCSI_SENSE_SIZE);
@@ -882,6 +941,7 @@ static void __devexit virtscsi_remove(struct virtio_device *vdev)
 	struct Scsi_Host *shost = virtio_scsi_host(vdev);
 	struct virtio_scsi *vscsi = shost_priv(shost);
 
+	unregister_hotcpu_notifier(&virtscsi_cpu_notifier);
 	if (virtio_has_feature(vdev, VIRTIO_SCSI_F_HOTPLUG))
 		virtscsi_cancel_event_work(vscsi);
 
-- 
1.8.0

Thanks,
Wanlong Gao

> 
>>>> Isn't the same thing true for virtio-net mq?
>>>>
>>>> Paolo
>>>
>>> Last I looked it checked vi->max_queue_pairs == num_online_cpus().
>>> This is even too aggressive I think, max_queue_pairs >=
>>> num_online_cpus() should be enough.
>>>
> 

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 1/5] virtio: add functions for piecewise addition of buffers
  2012-12-18 12:32   ` Paolo Bonzini
@ 2013-01-02  5:03     ` Rusty Russell
  -1 siblings, 0 replies; 86+ messages in thread
From: Rusty Russell @ 2013-01-02  5:03 UTC (permalink / raw)
  To: Paolo Bonzini, linux-kernel
  Cc: kvm, gaowanlong, hutao, linux-scsi, virtualization, mst, asias,
	stefanha, nab

Paolo Bonzini <pbonzini@redhat.com> writes:
> The virtqueue_add_buf function has two limitations:
>
> 1) it requires the caller to provide all the buffers in a single call;
>
> 2) it does not support chained scatterlists: the buffers must be
> provided as an array of struct scatterlist;

Chained scatterlists are a horrible interface, but that doesn't mean we
shouldn't support them if there's a need.

I think I once even had a patch which passed two chained sgs, rather
than a combo sg and two length numbers.  It's very old, but I've pasted
it below.

Duplicating the implementation by having another interface is pretty
nasty; I think I'd prefer the chained scatterlists, if that's optimal
for you.

Cheers,
Rusty.

From: Rusty Russell <rusty@rustcorp.com.au>
Subject: virtio: use chained scatterlists.

Rather than handing a scatterlist[] and out and in numbers to
virtqueue_add_buf(), hand two separate ones which can be chained.

I shall refrain from ranting about what a disgusting hack chained
scatterlists are.  I'll just note that this doesn't make things
simpler (see diff).

The scatterlists we use can be too large for the stack, so we put them
in our device struct and reuse them.  But in many cases we don't want
to pay the cost of sg_init_table() as we don't know how many elements
we'll have and we'd have to initialize the entire table.

This means we have two choices: carefully reset the end markers after
we call virtqueue_add_buf(), which we do in virtio_net for the xmit
path where it's easy and we want to be optimal.  Elsewhere we
implement a helper to unset the end markers after we've filled the
array.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
---
 drivers/block/virtio_blk.c          |   37 +++++++++++++-----
 drivers/char/hw_random/virtio-rng.c |    2 -
 drivers/char/virtio_console.c       |    6 +--
 drivers/net/virtio_net.c            |   67 ++++++++++++++++++---------------
 drivers/virtio/virtio_balloon.c     |    6 +--
 drivers/virtio/virtio_ring.c        |   71 ++++++++++++++++++++++--------------
 include/linux/virtio.h              |    5 +-
 net/9p/trans_virtio.c               |   38 +++++++++++++++++--
 8 files changed, 151 insertions(+), 81 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -100,6 +100,14 @@ static void blk_done(struct virtqueue *v
 	spin_unlock_irqrestore(vblk->disk->queue->queue_lock, flags);
 }
 
+static void sg_unset_end_markers(struct scatterlist *sg, unsigned int num)
+{
+	unsigned int i;
+
+	for (i = 0; i < num; i++)
+		sg[i].page_link &= ~0x02;
+}
+
 static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
 		   struct request *req)
 {
@@ -140,6 +148,7 @@ static bool do_req(struct request_queue 
 		}
 	}
 
+	/* We layout out scatterlist in a single array, out then in. */
 	sg_set_buf(&vblk->sg[out++], &vbr->out_hdr, sizeof(vbr->out_hdr));
 
 	/*
@@ -151,17 +160,8 @@ static bool do_req(struct request_queue 
 	if (vbr->req->cmd_type == REQ_TYPE_BLOCK_PC)
 		sg_set_buf(&vblk->sg[out++], vbr->req->cmd, vbr->req->cmd_len);
 
+	/* This marks the end of the sg list at vblk->sg[out]. */
 	num = blk_rq_map_sg(q, vbr->req, vblk->sg + out);
-
-	if (vbr->req->cmd_type == REQ_TYPE_BLOCK_PC) {
-		sg_set_buf(&vblk->sg[num + out + in++], vbr->req->sense, SCSI_SENSE_BUFFERSIZE);
-		sg_set_buf(&vblk->sg[num + out + in++], &vbr->in_hdr,
-			   sizeof(vbr->in_hdr));
-	}
-
-	sg_set_buf(&vblk->sg[num + out + in++], &vbr->status,
-		   sizeof(vbr->status));
-
 	if (num) {
 		if (rq_data_dir(vbr->req) == WRITE) {
 			vbr->out_hdr.type |= VIRTIO_BLK_T_OUT;
@@ -172,7 +172,22 @@ static bool do_req(struct request_queue 
 		}
 	}
 
-	if (virtqueue_add_buf(vblk->vq, vblk->sg, out, in, vbr, GFP_ATOMIC)<0) {
+	if (vbr->req->cmd_type == REQ_TYPE_BLOCK_PC) {
+		sg_set_buf(&vblk->sg[out + in++], vbr->req->sense,
+			   SCSI_SENSE_BUFFERSIZE);
+		sg_set_buf(&vblk->sg[out + in++], &vbr->in_hdr,
+			   sizeof(vbr->in_hdr));
+	}
+
+	sg_set_buf(&vblk->sg[out + in++], &vbr->status,
+		   sizeof(vbr->status));
+
+	sg_unset_end_markers(vblk->sg, out+in);
+	sg_mark_end(&vblk->sg[out-1]);
+	sg_mark_end(&vblk->sg[out+in-1]);
+
+	if (virtqueue_add_buf(vblk->vq, vblk->sg, vblk->sg+out, vbr, GFP_ATOMIC)
+	    < 0) {
 		mempool_free(vbr, vblk->pool);
 		return false;
 	}
diff --git a/drivers/char/hw_random/virtio-rng.c b/drivers/char/hw_random/virtio-rng.c
--- a/drivers/char/hw_random/virtio-rng.c
+++ b/drivers/char/hw_random/virtio-rng.c
@@ -47,7 +47,7 @@ static void register_buffer(u8 *buf, siz
 	sg_init_one(&sg, buf, size);
 
 	/* There should always be room for one buffer. */
-	if (virtqueue_add_buf(vq, &sg, 0, 1, buf, GFP_KERNEL) < 0)
+	if (virtqueue_add_buf(vq, NULL, &sg, buf, GFP_KERNEL) < 0)
 		BUG();
 
 	virtqueue_kick(vq);
diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
--- a/drivers/char/virtio_console.c
+++ b/drivers/char/virtio_console.c
@@ -392,7 +392,7 @@ static int add_inbuf(struct virtqueue *v
 
 	sg_init_one(sg, buf->buf, buf->size);
 
-	ret = virtqueue_add_buf(vq, sg, 0, 1, buf, GFP_ATOMIC);
+	ret = virtqueue_add_buf(vq, NULL, sg, buf, GFP_ATOMIC);
 	virtqueue_kick(vq);
 	return ret;
 }
@@ -457,7 +457,7 @@ static ssize_t __send_control_msg(struct
 	vq = portdev->c_ovq;
 
 	sg_init_one(sg, &cpkt, sizeof(cpkt));
-	if (virtqueue_add_buf(vq, sg, 1, 0, &cpkt, GFP_ATOMIC) >= 0) {
+	if (virtqueue_add_buf(vq, sg, NULL, &cpkt, GFP_ATOMIC) >= 0) {
 		virtqueue_kick(vq);
 		while (!virtqueue_get_buf(vq, &len))
 			cpu_relax();
@@ -506,7 +506,7 @@ static ssize_t send_buf(struct port *por
 	reclaim_consumed_buffers(port);
 
 	sg_init_one(sg, in_buf, in_count);
-	ret = virtqueue_add_buf(out_vq, sg, 1, 0, in_buf, GFP_ATOMIC);
+	ret = virtqueue_add_buf(out_vq, sg, NULL, in_buf, GFP_ATOMIC);
 
 	/* Tell Host to go! */
 	virtqueue_kick(out_vq);
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -376,11 +376,11 @@ static int add_recvbuf_small(struct virt
 	skb_put(skb, MAX_PACKET_LEN);
 
 	hdr = skb_vnet_hdr(skb);
+	sg_init_table(vi->rx_sg, 2);
 	sg_set_buf(vi->rx_sg, &hdr->hdr, sizeof hdr->hdr);
-
 	skb_to_sgvec(skb, vi->rx_sg + 1, 0, skb->len);
 
-	err = virtqueue_add_buf(vi->rvq, vi->rx_sg, 0, 2, skb, gfp);
+	err = virtqueue_add_buf(vi->rvq, NULL, vi->rx_sg, skb, gfp);
 	if (err < 0)
 		dev_kfree_skb(skb);
 
@@ -393,6 +393,8 @@ static int add_recvbuf_big(struct virtne
 	char *p;
 	int i, err, offset;
 
+	sg_init_table(vi->rx_sg, MAX_SKB_FRAGS + 1);
+
 	/* page in vi->rx_sg[MAX_SKB_FRAGS + 1] is list tail */
 	for (i = MAX_SKB_FRAGS + 1; i > 1; --i) {
 		first = get_a_page(vi, gfp);
@@ -425,8 +427,8 @@ static int add_recvbuf_big(struct virtne
 
 	/* chain first in list head */
 	first->private = (unsigned long)list;
-	err = virtqueue_add_buf(vi->rvq, vi->rx_sg, 0, MAX_SKB_FRAGS + 2,
-				first, gfp);
+
+	err = virtqueue_add_buf(vi->rvq, NULL, vi->rx_sg, first, gfp);
 	if (err < 0)
 		give_pages(vi, first);
 
@@ -444,7 +446,7 @@ static int add_recvbuf_mergeable(struct 
 
 	sg_init_one(vi->rx_sg, page_address(page), PAGE_SIZE);
 
-	err = virtqueue_add_buf(vi->rvq, vi->rx_sg, 0, 1, page, gfp);
+	err = virtqueue_add_buf(vi->rvq, NULL, vi->rx_sg, page, gfp);
 	if (err < 0)
 		give_pages(vi, page);
 
@@ -581,6 +583,7 @@ static int xmit_skb(struct virtnet_info 
 {
 	struct skb_vnet_hdr *hdr = skb_vnet_hdr(skb);
 	const unsigned char *dest = ((struct ethhdr *)skb->data)->h_dest;
+	int ret;
 
 	pr_debug("%s: xmit %p %pM\n", vi->dev->name, skb, dest);
 
@@ -620,8 +623,16 @@ static int xmit_skb(struct virtnet_info 
 		sg_set_buf(vi->tx_sg, &hdr->hdr, sizeof hdr->hdr);
 
 	hdr->num_sg = skb_to_sgvec(skb, vi->tx_sg + 1, 0, skb->len) + 1;
-	return virtqueue_add_buf(vi->svq, vi->tx_sg, hdr->num_sg,
-				 0, skb, GFP_ATOMIC);
+
+	ret = virtqueue_add_buf(vi->svq, vi->tx_sg, NULL, skb, GFP_ATOMIC);
+
+	/*
+	 * An optimization: clear the end bit set by skb_to_sgvec, so
+	 * we can simply re-use vi->tx_sg[] next time.
+	 */
+	vi->tx_sg[hdr->num_sg-1].page_link &= ~0x02;
+
+	return ret;
 }
 
 static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
@@ -757,32 +768,31 @@ static int virtnet_open(struct net_devic
  * never fail unless improperly formated.
  */
 static bool virtnet_send_command(struct virtnet_info *vi, u8 class, u8 cmd,
-				 struct scatterlist *data, int out, int in)
+				 struct scatterlist *cmdsg)
 {
-	struct scatterlist *s, sg[VIRTNET_SEND_COMMAND_SG_MAX + 2];
+	struct scatterlist in[1], out[2];
 	struct virtio_net_ctrl_hdr ctrl;
 	virtio_net_ctrl_ack status = ~0;
 	unsigned int tmp;
-	int i;
 
 	/* Caller should know better */
-	BUG_ON(!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ) ||
-		(out + in > VIRTNET_SEND_COMMAND_SG_MAX));
+	BUG_ON(!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ));
 
-	out++; /* Add header */
-	in++; /* Add return status */
-
+	/* Prepend header to output. */
+	sg_init_table(out, 2);
 	ctrl.class = class;
 	ctrl.cmd = cmd;
+	sg_set_buf(&out[0], &ctrl, sizeof(ctrl));
+	if (cmdsg)
+		sg_chain(out, 2, cmdsg);
+	else
+		sg_mark_end(&out[0]);
 
-	sg_init_table(sg, out + in);
+	/* Status response. */
+	sg_init_one(in, &status, sizeof(status));
 
-	sg_set_buf(&sg[0], &ctrl, sizeof(ctrl));
-	for_each_sg(data, s, out + in - 2, i)
-		sg_set_buf(&sg[i + 1], sg_virt(s), s->length);
-	sg_set_buf(&sg[out + in - 1], &status, sizeof(status));
 
-	BUG_ON(virtqueue_add_buf(vi->cvq, sg, out, in, vi, GFP_ATOMIC) < 0);
+	BUG_ON(virtqueue_add_buf(vi->cvq, out, in, vi, GFP_ATOMIC) < 0);
 
 	virtqueue_kick(vi->cvq);
 
@@ -800,8 +810,7 @@ static void virtnet_ack_link_announce(st
 {
 	rtnl_lock();
 	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_ANNOUNCE,
-				  VIRTIO_NET_CTRL_ANNOUNCE_ACK, NULL,
-				  0, 0))
+				  VIRTIO_NET_CTRL_ANNOUNCE_ACK, NULL))
 		dev_warn(&vi->dev->dev, "Failed to ack link announce.\n");
 	rtnl_unlock();
 }
@@ -839,16 +848,14 @@ static void virtnet_set_rx_mode(struct n
 	sg_init_one(sg, &promisc, sizeof(promisc));
 
 	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX,
-				  VIRTIO_NET_CTRL_RX_PROMISC,
-				  sg, 1, 0))
+				  VIRTIO_NET_CTRL_RX_PROMISC, sg))
 		dev_warn(&dev->dev, "Failed to %sable promisc mode.\n",
 			 promisc ? "en" : "dis");
 
 	sg_init_one(sg, &allmulti, sizeof(allmulti));
 
 	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX,
-				  VIRTIO_NET_CTRL_RX_ALLMULTI,
-				  sg, 1, 0))
+				  VIRTIO_NET_CTRL_RX_ALLMULTI, sg))
 		dev_warn(&dev->dev, "Failed to %sable allmulti mode.\n",
 			 allmulti ? "en" : "dis");
 
@@ -887,7 +894,7 @@ static void virtnet_set_rx_mode(struct n
 
 	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_MAC,
 				  VIRTIO_NET_CTRL_MAC_TABLE_SET,
-				  sg, 2, 0))
+				  sg))
 		dev_warn(&dev->dev, "Failed to set MAC fitler table.\n");
 
 	kfree(buf);
@@ -901,7 +908,7 @@ static int virtnet_vlan_rx_add_vid(struc
 	sg_init_one(&sg, &vid, sizeof(vid));
 
 	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_VLAN,
-				  VIRTIO_NET_CTRL_VLAN_ADD, &sg, 1, 0))
+				  VIRTIO_NET_CTRL_VLAN_ADD, &sg))
 		dev_warn(&dev->dev, "Failed to add VLAN ID %d.\n", vid);
 	return 0;
 }
@@ -914,7 +921,7 @@ static int virtnet_vlan_rx_kill_vid(stru
 	sg_init_one(&sg, &vid, sizeof(vid));
 
 	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_VLAN,
-				  VIRTIO_NET_CTRL_VLAN_DEL, &sg, 1, 0))
+				  VIRTIO_NET_CTRL_VLAN_DEL, &sg))
 		dev_warn(&dev->dev, "Failed to kill VLAN ID %d.\n", vid);
 	return 0;
 }
diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -102,7 +102,7 @@ static void tell_host(struct virtio_ball
 	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
 
 	/* We should always be able to add one buffer to an empty queue. */
-	if (virtqueue_add_buf(vq, &sg, 1, 0, vb, GFP_KERNEL) < 0)
+	if (virtqueue_add_buf(vq, &sg, NULL, vb, GFP_KERNEL) < 0)
 		BUG();
 	virtqueue_kick(vq);
 
@@ -246,7 +246,7 @@ static void stats_handle_request(struct 
 	if (!virtqueue_get_buf(vq, &len))
 		return;
 	sg_init_one(&sg, vb->stats, sizeof(vb->stats));
-	if (virtqueue_add_buf(vq, &sg, 1, 0, vb, GFP_KERNEL) < 0)
+	if (virtqueue_add_buf(vq, &sg, NULL, vb, GFP_KERNEL) < 0)
 		BUG();
 	virtqueue_kick(vq);
 }
@@ -331,7 +331,7 @@ static int init_vqs(struct virtio_balloo
 		 * use it to signal us later.
 		 */
 		sg_init_one(&sg, vb->stats, sizeof vb->stats);
-		if (virtqueue_add_buf(vb->stats_vq, &sg, 1, 0, vb, GFP_KERNEL)
+		if (virtqueue_add_buf(vb->stats_vq, &sg, NULL, vb, GFP_KERNEL)
 		    < 0)
 			BUG();
 		virtqueue_kick(vb->stats_vq);
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -121,35 +121,41 @@ struct vring_virtqueue
 
 #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq)
 
+/* This doesn't have the counter that for_each_sg() has */
+#define foreach_sg(sglist, i)			\
+	for (i = (sglist); i; i = sg_next(i))
+
 /* Set up an indirect table of descriptors and add it to the queue. */
 static int vring_add_indirect(struct vring_virtqueue *vq,
-			      struct scatterlist sg[],
-			      unsigned int out,
-			      unsigned int in,
+			      unsigned int num,
+			      const struct scatterlist *out,
+			      const struct scatterlist *in,
 			      gfp_t gfp)
 {
+	const struct scatterlist *sg;
 	struct vring_desc *desc;
 	unsigned head;
 	int i;
 
-	desc = kmalloc((out + in) * sizeof(struct vring_desc), gfp);
+	desc = kmalloc(num * sizeof(struct vring_desc), gfp);
 	if (!desc)
 		return -ENOMEM;
 
 	/* Transfer entries from the sg list into the indirect page */
-	for (i = 0; i < out; i++) {
+	i = 0;
+	foreach_sg(out, sg) {
 		desc[i].flags = VRING_DESC_F_NEXT;
 		desc[i].addr = sg_phys(sg);
 		desc[i].len = sg->length;
 		desc[i].next = i+1;
-		sg++;
+		i++;
 	}
-	for (; i < (out + in); i++) {
+	foreach_sg(in, sg) {
 		desc[i].flags = VRING_DESC_F_NEXT|VRING_DESC_F_WRITE;
 		desc[i].addr = sg_phys(sg);
 		desc[i].len = sg->length;
 		desc[i].next = i+1;
-		sg++;
+		i++;
 	}
 
 	/* Last one doesn't continue. */
@@ -171,12 +177,21 @@ static int vring_add_indirect(struct vri
 	return head;
 }
 
+static unsigned int count_sg(const struct scatterlist *sg)
+{
+	unsigned int count = 0;
+	const struct scatterlist *i;
+
+	foreach_sg(sg, i)
+		count++;
+	return count;
+}
+
 /**
  * virtqueue_add_buf - expose buffer to other end
  * @vq: the struct virtqueue we're talking about.
- * @sg: the description of the buffer(s).
- * @out_num: the number of sg readable by other side
- * @in_num: the number of sg which are writable (after readable ones)
+ * @out: the description of the output buffer(s).
+ * @in: the description of the input buffer(s).
  * @data: the token identifying the buffer.
  * @gfp: how to do memory allocations (if necessary).
  *
@@ -189,20 +204,23 @@ static int vring_add_indirect(struct vri
  * we can put an entire sg[] array inside a single queue entry.
  */
 int virtqueue_add_buf(struct virtqueue *_vq,
-		      struct scatterlist sg[],
-		      unsigned int out,
-		      unsigned int in,
+		      const struct scatterlist *out,
+		      const struct scatterlist *in,
 		      void *data,
 		      gfp_t gfp)
 {
 	struct vring_virtqueue *vq = to_vvq(_vq);
-	unsigned int i, avail, uninitialized_var(prev);
+	unsigned int i, avail, uninitialized_var(prev), num;
+	const struct scatterlist *sg;
 	int head;
 
 	START_USE(vq);
 
 	BUG_ON(data == NULL);
 
+	num = count_sg(out) + count_sg(in);
+	BUG_ON(num == 0);
+
 #ifdef DEBUG
 	{
 		ktime_t now = ktime_get();
@@ -218,18 +236,17 @@ int virtqueue_add_buf(struct virtqueue *
 
 	/* If the host supports indirect descriptor tables, and we have multiple
 	 * buffers, then go indirect. FIXME: tune this threshold */
-	if (vq->indirect && (out + in) > 1 && vq->num_free) {
-		head = vring_add_indirect(vq, sg, out, in, gfp);
+	if (vq->indirect && num > 1 && vq->num_free) {
+		head = vring_add_indirect(vq, num, out, in, gfp);
 		if (likely(head >= 0))
 			goto add_head;
 	}
 
-	BUG_ON(out + in > vq->vring.num);
-	BUG_ON(out + in == 0);
+	BUG_ON(num > vq->vring.num);
 
-	if (vq->num_free < out + in) {
+	if (vq->num_free < num) {
 		pr_debug("Can't add buf len %i - avail = %i\n",
-			 out + in, vq->num_free);
+			 num, vq->num_free);
 		/* FIXME: for historical reasons, we force a notify here if
 		 * there are outgoing parts to the buffer.  Presumably the
 		 * host should service the ring ASAP. */
@@ -240,22 +257,24 @@ int virtqueue_add_buf(struct virtqueue *
 	}
 
 	/* We're about to use some buffers from the free list. */
-	vq->num_free -= out + in;
+	vq->num_free -= num;
 
 	head = vq->free_head;
-	for (i = vq->free_head; out; i = vq->vring.desc[i].next, out--) {
+
+	i = vq->free_head;
+	foreach_sg(out, sg) {
 		vq->vring.desc[i].flags = VRING_DESC_F_NEXT;
 		vq->vring.desc[i].addr = sg_phys(sg);
 		vq->vring.desc[i].len = sg->length;
 		prev = i;
-		sg++;
+		i = vq->vring.desc[i].next;
 	}
-	for (; in; i = vq->vring.desc[i].next, in--) {
+	foreach_sg(in, sg) {
 		vq->vring.desc[i].flags = VRING_DESC_F_NEXT|VRING_DESC_F_WRITE;
 		vq->vring.desc[i].addr = sg_phys(sg);
 		vq->vring.desc[i].len = sg->length;
 		prev = i;
-		sg++;
+		i = vq->vring.desc[i].next;
 	}
 	/* Last one doesn't continue. */
 	vq->vring.desc[prev].flags &= ~VRING_DESC_F_NEXT;
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -26,9 +26,8 @@ struct virtqueue {
 };
 
 int virtqueue_add_buf(struct virtqueue *vq,
-		      struct scatterlist sg[],
-		      unsigned int out_num,
-		      unsigned int in_num,
+		      const struct scatterlist *out,
+		      const struct scatterlist *in,
 		      void *data,
 		      gfp_t gfp);
 
diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
--- a/net/9p/trans_virtio.c
+++ b/net/9p/trans_virtio.c
@@ -244,6 +244,14 @@ pack_sg_list_p(struct scatterlist *sg, i
 	return index - start;
 }
 
+static void sg_unset_end_markers(struct scatterlist *sg, unsigned int num)
+{
+	unsigned int i;
+
+	for (i = 0; i < num; i++)
+		sg[i].page_link &= ~0x02;
+}
+
 /**
  * p9_virtio_request - issue a request
  * @client: client instance issuing the request
@@ -258,6 +266,7 @@ p9_virtio_request(struct p9_client *clie
 	int in, out;
 	unsigned long flags;
 	struct virtio_chan *chan = client->trans;
+	const struct scatterlist *outsg = NULL, *insg = NULL;
 
 	p9_debug(P9_DEBUG_TRANS, "9p debug: virtio request\n");
 
@@ -268,12 +277,21 @@ req_retry:
 	/* Handle out VirtIO ring buffers */
 	out = pack_sg_list(chan->sg, 0,
 			   VIRTQUEUE_NUM, req->tc->sdata, req->tc->size);
+	if (out) {
+		sg_unset_end_markers(chan->sg, out-1);
+		sg_mark_end(&chan->sg[out-1]);
+		outsg = chan->sg;
+	}
 
 	in = pack_sg_list(chan->sg, out,
 			  VIRTQUEUE_NUM, req->rc->sdata, req->rc->capacity);
+	if (in) {
+		sg_unset_end_markers(chan->sg+out, in-1);
+		sg_mark_end(&chan->sg[out+in-1]);
+		insg = chan->sg+out;
+	}
 
-	err = virtqueue_add_buf(chan->vq, chan->sg, out, in, req->tc,
-				GFP_ATOMIC);
+	err = virtqueue_add_buf(chan->vq, outsg, insg, req->tc, GFP_ATOMIC);
 	if (err < 0) {
 		if (err == -ENOSPC) {
 			chan->ring_bufs_avail = 0;
@@ -355,6 +377,7 @@ p9_virtio_zc_request(struct p9_client *c
 	int in_nr_pages = 0, out_nr_pages = 0;
 	struct page **in_pages = NULL, **out_pages = NULL;
 	struct virtio_chan *chan = client->trans;
+	struct scatterlist *insg = NULL, *outsg = NULL;
 
 	p9_debug(P9_DEBUG_TRANS, "virtio request\n");
 
@@ -402,6 +425,13 @@ req_retry_pinned:
 	if (out_pages)
 		out += pack_sg_list_p(chan->sg, out, VIRTQUEUE_NUM,
 				      out_pages, out_nr_pages, uodata, outlen);
+
+	if (out) {
+		sg_unset_end_markers(chan->sg, out-1);
+		sg_mark_end(&chan->sg[out-1]);
+		outsg = chan->sg;
+	}
+
 	/*
 	 * Take care of in data
 	 * For example TREAD have 11.
@@ -414,9 +446,13 @@ req_retry_pinned:
 	if (in_pages)
 		in += pack_sg_list_p(chan->sg, out + in, VIRTQUEUE_NUM,
 				     in_pages, in_nr_pages, uidata, inlen);
+	if (in) {
+		sg_unset_end_markers(chan->sg+out, in-1);
+		sg_mark_end(&chan->sg[out+in-1]);
+		insg = chan->sg + out;
+	}
 
-	err = virtqueue_add_buf(chan->vq, chan->sg, out, in, req->tc,
-				GFP_ATOMIC);
+	err = virtqueue_add_buf(chan->vq, outsg, insg, req->tc, GFP_ATOMIC);
 	if (err < 0) {
 		if (err == -ENOSPC) {
 			chan->ring_bufs_avail = 0;

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 1/5] virtio: add functions for piecewise addition of buffers
@ 2013-01-02  5:03     ` Rusty Russell
  0 siblings, 0 replies; 86+ messages in thread
From: Rusty Russell @ 2013-01-02  5:03 UTC (permalink / raw)
  To: Paolo Bonzini, linux-kernel
  Cc: linux-scsi, kvm, mst, hutao, virtualization, stefanha

Paolo Bonzini <pbonzini@redhat.com> writes:
> The virtqueue_add_buf function has two limitations:
>
> 1) it requires the caller to provide all the buffers in a single call;
>
> 2) it does not support chained scatterlists: the buffers must be
> provided as an array of struct scatterlist;

Chained scatterlists are a horrible interface, but that doesn't mean we
shouldn't support them if there's a need.

I think I once even had a patch which passed two chained sgs, rather
than a combo sg and two length numbers.  It's very old, but I've pasted
it below.

Duplicating the implementation by having another interface is pretty
nasty; I think I'd prefer the chained scatterlists, if that's optimal
for you.

Cheers,
Rusty.

From: Rusty Russell <rusty@rustcorp.com.au>
Subject: virtio: use chained scatterlists.

Rather than handing a scatterlist[] and out and in numbers to
virtqueue_add_buf(), hand two separate ones which can be chained.

I shall refrain from ranting about what a disgusting hack chained
scatterlists are.  I'll just note that this doesn't make things
simpler (see diff).

The scatterlists we use can be too large for the stack, so we put them
in our device struct and reuse them.  But in many cases we don't want
to pay the cost of sg_init_table() as we don't know how many elements
we'll have and we'd have to initialize the entire table.

This means we have two choices: carefully reset the end markers after
we call virtqueue_add_buf(), which we do in virtio_net for the xmit
path where it's easy and we want to be optimal.  Elsewhere we
implement a helper to unset the end markers after we've filled the
array.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
---
 drivers/block/virtio_blk.c          |   37 +++++++++++++-----
 drivers/char/hw_random/virtio-rng.c |    2 -
 drivers/char/virtio_console.c       |    6 +--
 drivers/net/virtio_net.c            |   67 ++++++++++++++++++---------------
 drivers/virtio/virtio_balloon.c     |    6 +--
 drivers/virtio/virtio_ring.c        |   71 ++++++++++++++++++++++--------------
 include/linux/virtio.h              |    5 +-
 net/9p/trans_virtio.c               |   38 +++++++++++++++++--
 8 files changed, 151 insertions(+), 81 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -100,6 +100,14 @@ static void blk_done(struct virtqueue *v
 	spin_unlock_irqrestore(vblk->disk->queue->queue_lock, flags);
 }
 
+static void sg_unset_end_markers(struct scatterlist *sg, unsigned int num)
+{
+	unsigned int i;
+
+	for (i = 0; i < num; i++)
+		sg[i].page_link &= ~0x02;
+}
+
 static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
 		   struct request *req)
 {
@@ -140,6 +148,7 @@ static bool do_req(struct request_queue 
 		}
 	}
 
+	/* We layout out scatterlist in a single array, out then in. */
 	sg_set_buf(&vblk->sg[out++], &vbr->out_hdr, sizeof(vbr->out_hdr));
 
 	/*
@@ -151,17 +160,8 @@ static bool do_req(struct request_queue 
 	if (vbr->req->cmd_type == REQ_TYPE_BLOCK_PC)
 		sg_set_buf(&vblk->sg[out++], vbr->req->cmd, vbr->req->cmd_len);
 
+	/* This marks the end of the sg list at vblk->sg[out]. */
 	num = blk_rq_map_sg(q, vbr->req, vblk->sg + out);
-
-	if (vbr->req->cmd_type == REQ_TYPE_BLOCK_PC) {
-		sg_set_buf(&vblk->sg[num + out + in++], vbr->req->sense, SCSI_SENSE_BUFFERSIZE);
-		sg_set_buf(&vblk->sg[num + out + in++], &vbr->in_hdr,
-			   sizeof(vbr->in_hdr));
-	}
-
-	sg_set_buf(&vblk->sg[num + out + in++], &vbr->status,
-		   sizeof(vbr->status));
-
 	if (num) {
 		if (rq_data_dir(vbr->req) == WRITE) {
 			vbr->out_hdr.type |= VIRTIO_BLK_T_OUT;
@@ -172,7 +172,22 @@ static bool do_req(struct request_queue 
 		}
 	}
 
-	if (virtqueue_add_buf(vblk->vq, vblk->sg, out, in, vbr, GFP_ATOMIC)<0) {
+	if (vbr->req->cmd_type == REQ_TYPE_BLOCK_PC) {
+		sg_set_buf(&vblk->sg[out + in++], vbr->req->sense,
+			   SCSI_SENSE_BUFFERSIZE);
+		sg_set_buf(&vblk->sg[out + in++], &vbr->in_hdr,
+			   sizeof(vbr->in_hdr));
+	}
+
+	sg_set_buf(&vblk->sg[out + in++], &vbr->status,
+		   sizeof(vbr->status));
+
+	sg_unset_end_markers(vblk->sg, out+in);
+	sg_mark_end(&vblk->sg[out-1]);
+	sg_mark_end(&vblk->sg[out+in-1]);
+
+	if (virtqueue_add_buf(vblk->vq, vblk->sg, vblk->sg+out, vbr, GFP_ATOMIC)
+	    < 0) {
 		mempool_free(vbr, vblk->pool);
 		return false;
 	}
diff --git a/drivers/char/hw_random/virtio-rng.c b/drivers/char/hw_random/virtio-rng.c
--- a/drivers/char/hw_random/virtio-rng.c
+++ b/drivers/char/hw_random/virtio-rng.c
@@ -47,7 +47,7 @@ static void register_buffer(u8 *buf, siz
 	sg_init_one(&sg, buf, size);
 
 	/* There should always be room for one buffer. */
-	if (virtqueue_add_buf(vq, &sg, 0, 1, buf, GFP_KERNEL) < 0)
+	if (virtqueue_add_buf(vq, NULL, &sg, buf, GFP_KERNEL) < 0)
 		BUG();
 
 	virtqueue_kick(vq);
diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
--- a/drivers/char/virtio_console.c
+++ b/drivers/char/virtio_console.c
@@ -392,7 +392,7 @@ static int add_inbuf(struct virtqueue *v
 
 	sg_init_one(sg, buf->buf, buf->size);
 
-	ret = virtqueue_add_buf(vq, sg, 0, 1, buf, GFP_ATOMIC);
+	ret = virtqueue_add_buf(vq, NULL, sg, buf, GFP_ATOMIC);
 	virtqueue_kick(vq);
 	return ret;
 }
@@ -457,7 +457,7 @@ static ssize_t __send_control_msg(struct
 	vq = portdev->c_ovq;
 
 	sg_init_one(sg, &cpkt, sizeof(cpkt));
-	if (virtqueue_add_buf(vq, sg, 1, 0, &cpkt, GFP_ATOMIC) >= 0) {
+	if (virtqueue_add_buf(vq, sg, NULL, &cpkt, GFP_ATOMIC) >= 0) {
 		virtqueue_kick(vq);
 		while (!virtqueue_get_buf(vq, &len))
 			cpu_relax();
@@ -506,7 +506,7 @@ static ssize_t send_buf(struct port *por
 	reclaim_consumed_buffers(port);
 
 	sg_init_one(sg, in_buf, in_count);
-	ret = virtqueue_add_buf(out_vq, sg, 1, 0, in_buf, GFP_ATOMIC);
+	ret = virtqueue_add_buf(out_vq, sg, NULL, in_buf, GFP_ATOMIC);
 
 	/* Tell Host to go! */
 	virtqueue_kick(out_vq);
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -376,11 +376,11 @@ static int add_recvbuf_small(struct virt
 	skb_put(skb, MAX_PACKET_LEN);
 
 	hdr = skb_vnet_hdr(skb);
+	sg_init_table(vi->rx_sg, 2);
 	sg_set_buf(vi->rx_sg, &hdr->hdr, sizeof hdr->hdr);
-
 	skb_to_sgvec(skb, vi->rx_sg + 1, 0, skb->len);
 
-	err = virtqueue_add_buf(vi->rvq, vi->rx_sg, 0, 2, skb, gfp);
+	err = virtqueue_add_buf(vi->rvq, NULL, vi->rx_sg, skb, gfp);
 	if (err < 0)
 		dev_kfree_skb(skb);
 
@@ -393,6 +393,8 @@ static int add_recvbuf_big(struct virtne
 	char *p;
 	int i, err, offset;
 
+	sg_init_table(vi->rx_sg, MAX_SKB_FRAGS + 1);
+
 	/* page in vi->rx_sg[MAX_SKB_FRAGS + 1] is list tail */
 	for (i = MAX_SKB_FRAGS + 1; i > 1; --i) {
 		first = get_a_page(vi, gfp);
@@ -425,8 +427,8 @@ static int add_recvbuf_big(struct virtne
 
 	/* chain first in list head */
 	first->private = (unsigned long)list;
-	err = virtqueue_add_buf(vi->rvq, vi->rx_sg, 0, MAX_SKB_FRAGS + 2,
-				first, gfp);
+
+	err = virtqueue_add_buf(vi->rvq, NULL, vi->rx_sg, first, gfp);
 	if (err < 0)
 		give_pages(vi, first);
 
@@ -444,7 +446,7 @@ static int add_recvbuf_mergeable(struct 
 
 	sg_init_one(vi->rx_sg, page_address(page), PAGE_SIZE);
 
-	err = virtqueue_add_buf(vi->rvq, vi->rx_sg, 0, 1, page, gfp);
+	err = virtqueue_add_buf(vi->rvq, NULL, vi->rx_sg, page, gfp);
 	if (err < 0)
 		give_pages(vi, page);
 
@@ -581,6 +583,7 @@ static int xmit_skb(struct virtnet_info 
 {
 	struct skb_vnet_hdr *hdr = skb_vnet_hdr(skb);
 	const unsigned char *dest = ((struct ethhdr *)skb->data)->h_dest;
+	int ret;
 
 	pr_debug("%s: xmit %p %pM\n", vi->dev->name, skb, dest);
 
@@ -620,8 +623,16 @@ static int xmit_skb(struct virtnet_info 
 		sg_set_buf(vi->tx_sg, &hdr->hdr, sizeof hdr->hdr);
 
 	hdr->num_sg = skb_to_sgvec(skb, vi->tx_sg + 1, 0, skb->len) + 1;
-	return virtqueue_add_buf(vi->svq, vi->tx_sg, hdr->num_sg,
-				 0, skb, GFP_ATOMIC);
+
+	ret = virtqueue_add_buf(vi->svq, vi->tx_sg, NULL, skb, GFP_ATOMIC);
+
+	/*
+	 * An optimization: clear the end bit set by skb_to_sgvec, so
+	 * we can simply re-use vi->tx_sg[] next time.
+	 */
+	vi->tx_sg[hdr->num_sg-1].page_link &= ~0x02;
+
+	return ret;
 }
 
 static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
@@ -757,32 +768,31 @@ static int virtnet_open(struct net_devic
  * never fail unless improperly formated.
  */
 static bool virtnet_send_command(struct virtnet_info *vi, u8 class, u8 cmd,
-				 struct scatterlist *data, int out, int in)
+				 struct scatterlist *cmdsg)
 {
-	struct scatterlist *s, sg[VIRTNET_SEND_COMMAND_SG_MAX + 2];
+	struct scatterlist in[1], out[2];
 	struct virtio_net_ctrl_hdr ctrl;
 	virtio_net_ctrl_ack status = ~0;
 	unsigned int tmp;
-	int i;
 
 	/* Caller should know better */
-	BUG_ON(!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ) ||
-		(out + in > VIRTNET_SEND_COMMAND_SG_MAX));
+	BUG_ON(!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ));
 
-	out++; /* Add header */
-	in++; /* Add return status */
-
+	/* Prepend header to output. */
+	sg_init_table(out, 2);
 	ctrl.class = class;
 	ctrl.cmd = cmd;
+	sg_set_buf(&out[0], &ctrl, sizeof(ctrl));
+	if (cmdsg)
+		sg_chain(out, 2, cmdsg);
+	else
+		sg_mark_end(&out[0]);
 
-	sg_init_table(sg, out + in);
+	/* Status response. */
+	sg_init_one(in, &status, sizeof(status));
 
-	sg_set_buf(&sg[0], &ctrl, sizeof(ctrl));
-	for_each_sg(data, s, out + in - 2, i)
-		sg_set_buf(&sg[i + 1], sg_virt(s), s->length);
-	sg_set_buf(&sg[out + in - 1], &status, sizeof(status));
 
-	BUG_ON(virtqueue_add_buf(vi->cvq, sg, out, in, vi, GFP_ATOMIC) < 0);
+	BUG_ON(virtqueue_add_buf(vi->cvq, out, in, vi, GFP_ATOMIC) < 0);
 
 	virtqueue_kick(vi->cvq);
 
@@ -800,8 +810,7 @@ static void virtnet_ack_link_announce(st
 {
 	rtnl_lock();
 	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_ANNOUNCE,
-				  VIRTIO_NET_CTRL_ANNOUNCE_ACK, NULL,
-				  0, 0))
+				  VIRTIO_NET_CTRL_ANNOUNCE_ACK, NULL))
 		dev_warn(&vi->dev->dev, "Failed to ack link announce.\n");
 	rtnl_unlock();
 }
@@ -839,16 +848,14 @@ static void virtnet_set_rx_mode(struct n
 	sg_init_one(sg, &promisc, sizeof(promisc));
 
 	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX,
-				  VIRTIO_NET_CTRL_RX_PROMISC,
-				  sg, 1, 0))
+				  VIRTIO_NET_CTRL_RX_PROMISC, sg))
 		dev_warn(&dev->dev, "Failed to %sable promisc mode.\n",
 			 promisc ? "en" : "dis");
 
 	sg_init_one(sg, &allmulti, sizeof(allmulti));
 
 	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX,
-				  VIRTIO_NET_CTRL_RX_ALLMULTI,
-				  sg, 1, 0))
+				  VIRTIO_NET_CTRL_RX_ALLMULTI, sg))
 		dev_warn(&dev->dev, "Failed to %sable allmulti mode.\n",
 			 allmulti ? "en" : "dis");
 
@@ -887,7 +894,7 @@ static void virtnet_set_rx_mode(struct n
 
 	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_MAC,
 				  VIRTIO_NET_CTRL_MAC_TABLE_SET,
-				  sg, 2, 0))
+				  sg))
 		dev_warn(&dev->dev, "Failed to set MAC fitler table.\n");
 
 	kfree(buf);
@@ -901,7 +908,7 @@ static int virtnet_vlan_rx_add_vid(struc
 	sg_init_one(&sg, &vid, sizeof(vid));
 
 	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_VLAN,
-				  VIRTIO_NET_CTRL_VLAN_ADD, &sg, 1, 0))
+				  VIRTIO_NET_CTRL_VLAN_ADD, &sg))
 		dev_warn(&dev->dev, "Failed to add VLAN ID %d.\n", vid);
 	return 0;
 }
@@ -914,7 +921,7 @@ static int virtnet_vlan_rx_kill_vid(stru
 	sg_init_one(&sg, &vid, sizeof(vid));
 
 	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_VLAN,
-				  VIRTIO_NET_CTRL_VLAN_DEL, &sg, 1, 0))
+				  VIRTIO_NET_CTRL_VLAN_DEL, &sg))
 		dev_warn(&dev->dev, "Failed to kill VLAN ID %d.\n", vid);
 	return 0;
 }
diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -102,7 +102,7 @@ static void tell_host(struct virtio_ball
 	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
 
 	/* We should always be able to add one buffer to an empty queue. */
-	if (virtqueue_add_buf(vq, &sg, 1, 0, vb, GFP_KERNEL) < 0)
+	if (virtqueue_add_buf(vq, &sg, NULL, vb, GFP_KERNEL) < 0)
 		BUG();
 	virtqueue_kick(vq);
 
@@ -246,7 +246,7 @@ static void stats_handle_request(struct 
 	if (!virtqueue_get_buf(vq, &len))
 		return;
 	sg_init_one(&sg, vb->stats, sizeof(vb->stats));
-	if (virtqueue_add_buf(vq, &sg, 1, 0, vb, GFP_KERNEL) < 0)
+	if (virtqueue_add_buf(vq, &sg, NULL, vb, GFP_KERNEL) < 0)
 		BUG();
 	virtqueue_kick(vq);
 }
@@ -331,7 +331,7 @@ static int init_vqs(struct virtio_balloo
 		 * use it to signal us later.
 		 */
 		sg_init_one(&sg, vb->stats, sizeof vb->stats);
-		if (virtqueue_add_buf(vb->stats_vq, &sg, 1, 0, vb, GFP_KERNEL)
+		if (virtqueue_add_buf(vb->stats_vq, &sg, NULL, vb, GFP_KERNEL)
 		    < 0)
 			BUG();
 		virtqueue_kick(vb->stats_vq);
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -121,35 +121,41 @@ struct vring_virtqueue
 
 #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq)
 
+/* This doesn't have the counter that for_each_sg() has */
+#define foreach_sg(sglist, i)			\
+	for (i = (sglist); i; i = sg_next(i))
+
 /* Set up an indirect table of descriptors and add it to the queue. */
 static int vring_add_indirect(struct vring_virtqueue *vq,
-			      struct scatterlist sg[],
-			      unsigned int out,
-			      unsigned int in,
+			      unsigned int num,
+			      const struct scatterlist *out,
+			      const struct scatterlist *in,
 			      gfp_t gfp)
 {
+	const struct scatterlist *sg;
 	struct vring_desc *desc;
 	unsigned head;
 	int i;
 
-	desc = kmalloc((out + in) * sizeof(struct vring_desc), gfp);
+	desc = kmalloc(num * sizeof(struct vring_desc), gfp);
 	if (!desc)
 		return -ENOMEM;
 
 	/* Transfer entries from the sg list into the indirect page */
-	for (i = 0; i < out; i++) {
+	i = 0;
+	foreach_sg(out, sg) {
 		desc[i].flags = VRING_DESC_F_NEXT;
 		desc[i].addr = sg_phys(sg);
 		desc[i].len = sg->length;
 		desc[i].next = i+1;
-		sg++;
+		i++;
 	}
-	for (; i < (out + in); i++) {
+	foreach_sg(in, sg) {
 		desc[i].flags = VRING_DESC_F_NEXT|VRING_DESC_F_WRITE;
 		desc[i].addr = sg_phys(sg);
 		desc[i].len = sg->length;
 		desc[i].next = i+1;
-		sg++;
+		i++;
 	}
 
 	/* Last one doesn't continue. */
@@ -171,12 +177,21 @@ static int vring_add_indirect(struct vri
 	return head;
 }
 
+static unsigned int count_sg(const struct scatterlist *sg)
+{
+	unsigned int count = 0;
+	const struct scatterlist *i;
+
+	foreach_sg(sg, i)
+		count++;
+	return count;
+}
+
 /**
  * virtqueue_add_buf - expose buffer to other end
  * @vq: the struct virtqueue we're talking about.
- * @sg: the description of the buffer(s).
- * @out_num: the number of sg readable by other side
- * @in_num: the number of sg which are writable (after readable ones)
+ * @out: the description of the output buffer(s).
+ * @in: the description of the input buffer(s).
  * @data: the token identifying the buffer.
  * @gfp: how to do memory allocations (if necessary).
  *
@@ -189,20 +204,23 @@ static int vring_add_indirect(struct vri
  * we can put an entire sg[] array inside a single queue entry.
  */
 int virtqueue_add_buf(struct virtqueue *_vq,
-		      struct scatterlist sg[],
-		      unsigned int out,
-		      unsigned int in,
+		      const struct scatterlist *out,
+		      const struct scatterlist *in,
 		      void *data,
 		      gfp_t gfp)
 {
 	struct vring_virtqueue *vq = to_vvq(_vq);
-	unsigned int i, avail, uninitialized_var(prev);
+	unsigned int i, avail, uninitialized_var(prev), num;
+	const struct scatterlist *sg;
 	int head;
 
 	START_USE(vq);
 
 	BUG_ON(data == NULL);
 
+	num = count_sg(out) + count_sg(in);
+	BUG_ON(num == 0);
+
 #ifdef DEBUG
 	{
 		ktime_t now = ktime_get();
@@ -218,18 +236,17 @@ int virtqueue_add_buf(struct virtqueue *
 
 	/* If the host supports indirect descriptor tables, and we have multiple
 	 * buffers, then go indirect. FIXME: tune this threshold */
-	if (vq->indirect && (out + in) > 1 && vq->num_free) {
-		head = vring_add_indirect(vq, sg, out, in, gfp);
+	if (vq->indirect && num > 1 && vq->num_free) {
+		head = vring_add_indirect(vq, num, out, in, gfp);
 		if (likely(head >= 0))
 			goto add_head;
 	}
 
-	BUG_ON(out + in > vq->vring.num);
-	BUG_ON(out + in == 0);
+	BUG_ON(num > vq->vring.num);
 
-	if (vq->num_free < out + in) {
+	if (vq->num_free < num) {
 		pr_debug("Can't add buf len %i - avail = %i\n",
-			 out + in, vq->num_free);
+			 num, vq->num_free);
 		/* FIXME: for historical reasons, we force a notify here if
 		 * there are outgoing parts to the buffer.  Presumably the
 		 * host should service the ring ASAP. */
@@ -240,22 +257,24 @@ int virtqueue_add_buf(struct virtqueue *
 	}
 
 	/* We're about to use some buffers from the free list. */
-	vq->num_free -= out + in;
+	vq->num_free -= num;
 
 	head = vq->free_head;
-	for (i = vq->free_head; out; i = vq->vring.desc[i].next, out--) {
+
+	i = vq->free_head;
+	foreach_sg(out, sg) {
 		vq->vring.desc[i].flags = VRING_DESC_F_NEXT;
 		vq->vring.desc[i].addr = sg_phys(sg);
 		vq->vring.desc[i].len = sg->length;
 		prev = i;
-		sg++;
+		i = vq->vring.desc[i].next;
 	}
-	for (; in; i = vq->vring.desc[i].next, in--) {
+	foreach_sg(in, sg) {
 		vq->vring.desc[i].flags = VRING_DESC_F_NEXT|VRING_DESC_F_WRITE;
 		vq->vring.desc[i].addr = sg_phys(sg);
 		vq->vring.desc[i].len = sg->length;
 		prev = i;
-		sg++;
+		i = vq->vring.desc[i].next;
 	}
 	/* Last one doesn't continue. */
 	vq->vring.desc[prev].flags &= ~VRING_DESC_F_NEXT;
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -26,9 +26,8 @@ struct virtqueue {
 };
 
 int virtqueue_add_buf(struct virtqueue *vq,
-		      struct scatterlist sg[],
-		      unsigned int out_num,
-		      unsigned int in_num,
+		      const struct scatterlist *out,
+		      const struct scatterlist *in,
 		      void *data,
 		      gfp_t gfp);
 
diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
--- a/net/9p/trans_virtio.c
+++ b/net/9p/trans_virtio.c
@@ -244,6 +244,14 @@ pack_sg_list_p(struct scatterlist *sg, i
 	return index - start;
 }
 
+static void sg_unset_end_markers(struct scatterlist *sg, unsigned int num)
+{
+	unsigned int i;
+
+	for (i = 0; i < num; i++)
+		sg[i].page_link &= ~0x02;
+}
+
 /**
  * p9_virtio_request - issue a request
  * @client: client instance issuing the request
@@ -258,6 +266,7 @@ p9_virtio_request(struct p9_client *clie
 	int in, out;
 	unsigned long flags;
 	struct virtio_chan *chan = client->trans;
+	const struct scatterlist *outsg = NULL, *insg = NULL;
 
 	p9_debug(P9_DEBUG_TRANS, "9p debug: virtio request\n");
 
@@ -268,12 +277,21 @@ req_retry:
 	/* Handle out VirtIO ring buffers */
 	out = pack_sg_list(chan->sg, 0,
 			   VIRTQUEUE_NUM, req->tc->sdata, req->tc->size);
+	if (out) {
+		sg_unset_end_markers(chan->sg, out-1);
+		sg_mark_end(&chan->sg[out-1]);
+		outsg = chan->sg;
+	}
 
 	in = pack_sg_list(chan->sg, out,
 			  VIRTQUEUE_NUM, req->rc->sdata, req->rc->capacity);
+	if (in) {
+		sg_unset_end_markers(chan->sg+out, in-1);
+		sg_mark_end(&chan->sg[out+in-1]);
+		insg = chan->sg+out;
+	}
 
-	err = virtqueue_add_buf(chan->vq, chan->sg, out, in, req->tc,
-				GFP_ATOMIC);
+	err = virtqueue_add_buf(chan->vq, outsg, insg, req->tc, GFP_ATOMIC);
 	if (err < 0) {
 		if (err == -ENOSPC) {
 			chan->ring_bufs_avail = 0;
@@ -355,6 +377,7 @@ p9_virtio_zc_request(struct p9_client *c
 	int in_nr_pages = 0, out_nr_pages = 0;
 	struct page **in_pages = NULL, **out_pages = NULL;
 	struct virtio_chan *chan = client->trans;
+	struct scatterlist *insg = NULL, *outsg = NULL;
 
 	p9_debug(P9_DEBUG_TRANS, "virtio request\n");
 
@@ -402,6 +425,13 @@ req_retry_pinned:
 	if (out_pages)
 		out += pack_sg_list_p(chan->sg, out, VIRTQUEUE_NUM,
 				      out_pages, out_nr_pages, uodata, outlen);
+
+	if (out) {
+		sg_unset_end_markers(chan->sg, out-1);
+		sg_mark_end(&chan->sg[out-1]);
+		outsg = chan->sg;
+	}
+
 	/*
 	 * Take care of in data
 	 * For example TREAD have 11.
@@ -414,9 +446,13 @@ req_retry_pinned:
 	if (in_pages)
 		in += pack_sg_list_p(chan->sg, out + in, VIRTQUEUE_NUM,
 				     in_pages, in_nr_pages, uidata, inlen);
+	if (in) {
+		sg_unset_end_markers(chan->sg+out, in-1);
+		sg_mark_end(&chan->sg[out+in-1]);
+		insg = chan->sg + out;
+	}
 
-	err = virtqueue_add_buf(chan->vq, chan->sg, out, in, req->tc,
-				GFP_ATOMIC);
+	err = virtqueue_add_buf(chan->vq, outsg, insg, req->tc, GFP_ATOMIC);
 	if (err < 0) {
 		if (err == -ENOSPC) {
 			chan->ring_bufs_avail = 0;

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 1/5] virtio: add functions for piecewise addition of buffers
  2013-01-02  5:03     ` Rusty Russell
  (?)
@ 2013-01-03  8:58       ` Wanlong Gao
  -1 siblings, 0 replies; 86+ messages in thread
From: Wanlong Gao @ 2013-01-03  8:58 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Paolo Bonzini, linux-kernel, kvm, hutao, linux-scsi,
	virtualization, mst, asias, stefanha, nab, Wanlong Gao

On 01/02/2013 01:03 PM, Rusty Russell wrote:
> Paolo Bonzini <pbonzini@redhat.com> writes:
>> The virtqueue_add_buf function has two limitations:
>>
>> 1) it requires the caller to provide all the buffers in a single call;
>>
>> 2) it does not support chained scatterlists: the buffers must be
>> provided as an array of struct scatterlist;
> 
> Chained scatterlists are a horrible interface, but that doesn't mean we
> shouldn't support them if there's a need.
> 
> I think I once even had a patch which passed two chained sgs, rather
> than a combo sg and two length numbers.  It's very old, but I've pasted
> it below.
> 
> Duplicating the implementation by having another interface is pretty
> nasty; I think I'd prefer the chained scatterlists, if that's optimal
> for you.

I rebased against virtio-next and use it in virtio-scsi, and tested it with 4 targets
virtio-scsi devices and host cpu idle=poll. Saw a little performance regression here.

General:
Run status group 0 (all jobs):
   READ: io=34675MB, aggrb=248257KB/s, minb=248257KB/s, maxb=248257KB/s, mint=143025msec, maxt=143025msec
  WRITE: io=34625MB, aggrb=247902KB/s, minb=247902KB/s, maxb=247902KB/s, mint=143025msec, maxt=143025msec

Chained:
Run status group 0 (all jobs):
   READ: io=34863MB, aggrb=242320KB/s, minb=242320KB/s, maxb=242320KB/s, mint=147325msec, maxt=147325msec
  WRITE: io=34437MB, aggrb=239357KB/s, minb=239357KB/s, maxb=239357KB/s, mint=147325msec, maxt=147325msec

Thanks,
Wanlong Gao

>From d3181b3f9bbdebbd3f2928b64821b406774757f8 Mon Sep 17 00:00:00 2001
From: Rusty Russell <rusty@rustcorp.com.au>
Date: Wed, 2 Jan 2013 16:43:49 +0800
Subject: [PATCH] virtio: use chained scatterlists

Rather than handing a scatterlist[] and out and in numbers to
virtqueue_add_buf(), hand two separate ones which can be chained.

I shall refrain from ranting about what a disgusting hack chained
scatterlists are.  I'll just note that this doesn't make things
simpler (see diff).

The scatterlists we use can be too large for the stack, so we put them
in our device struct and reuse them.  But in many cases we don't want
to pay the cost of sg_init_table() as we don't know how many elements
we'll have and we'd have to initialize the entire table.

This means we have two choices: carefully reset the end markers after
we call virtqueue_add_buf(), which we do in virtio_net for the xmit
path where it's easy and we want to be optimal.  Elsewhere we
implement a helper to unset the end markers after we've filled the
array.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
---
 drivers/block/virtio_blk.c          | 57 ++++++++++++++++++-----------
 drivers/char/hw_random/virtio-rng.c |  2 +-
 drivers/char/virtio_console.c       |  6 ++--
 drivers/net/virtio_net.c            | 68 ++++++++++++++++++-----------------
 drivers/scsi/virtio_scsi.c          | 18 ++++++----
 drivers/virtio/virtio_balloon.c     |  6 ++--
 drivers/virtio/virtio_ring.c        | 71 +++++++++++++++++++++++--------------
 include/linux/virtio.h              | 14 ++++++--
 net/9p/trans_virtio.c               | 31 +++++++++++++---
 9 files changed, 172 insertions(+), 101 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 0bdde8f..17cf0b7 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -102,8 +102,8 @@ static inline struct virtblk_req *virtblk_alloc_req(struct virtio_blk *vblk,
 
 static void virtblk_add_buf_wait(struct virtio_blk *vblk,
 				 struct virtblk_req *vbr,
-				 unsigned long out,
-				 unsigned long in)
+				 struct scatterlist *out,
+				 struct scatterlist *in)
 {
 	DEFINE_WAIT(wait);
 
@@ -112,7 +112,7 @@ static void virtblk_add_buf_wait(struct virtio_blk *vblk,
 					  TASK_UNINTERRUPTIBLE);
 
 		spin_lock_irq(vblk->disk->queue->queue_lock);
-		if (virtqueue_add_buf(vblk->vq, vbr->sg, out, in, vbr,
+		if (virtqueue_add_buf(vblk->vq, out, in, vbr,
 				      GFP_ATOMIC) < 0) {
 			spin_unlock_irq(vblk->disk->queue->queue_lock);
 			io_schedule();
@@ -128,12 +128,13 @@ static void virtblk_add_buf_wait(struct virtio_blk *vblk,
 }
 
 static inline void virtblk_add_req(struct virtblk_req *vbr,
-				   unsigned int out, unsigned int in)
+				   struct scatterlist *out,
+				   struct scatterlist *in)
 {
 	struct virtio_blk *vblk = vbr->vblk;
 
 	spin_lock_irq(vblk->disk->queue->queue_lock);
-	if (unlikely(virtqueue_add_buf(vblk->vq, vbr->sg, out, in, vbr,
+	if (unlikely(virtqueue_add_buf(vblk->vq, out, in, vbr,
 					GFP_ATOMIC) < 0)) {
 		spin_unlock_irq(vblk->disk->queue->queue_lock);
 		virtblk_add_buf_wait(vblk, vbr, out, in);
@@ -154,7 +155,11 @@ static int virtblk_bio_send_flush(struct virtblk_req *vbr)
 	sg_set_buf(&vbr->sg[out++], &vbr->out_hdr, sizeof(vbr->out_hdr));
 	sg_set_buf(&vbr->sg[out + in++], &vbr->status, sizeof(vbr->status));
 
-	virtblk_add_req(vbr, out, in);
+	sg_unset_end_markers(vbr->sg, out + in);
+	sg_mark_end(&vbr->sg[out - 1]);
+	sg_mark_end(&vbr->sg[out + in - 1]);
+
+	virtblk_add_req(vbr, vbr->sg, vbr->sg + out);
 
 	return 0;
 }
@@ -174,9 +179,6 @@ static int virtblk_bio_send_data(struct virtblk_req *vbr)
 
 	num = blk_bio_map_sg(vblk->disk->queue, bio, vbr->sg + out);
 
-	sg_set_buf(&vbr->sg[num + out + in++], &vbr->status,
-		   sizeof(vbr->status));
-
 	if (num) {
 		if (bio->bi_rw & REQ_WRITE) {
 			vbr->out_hdr.type |= VIRTIO_BLK_T_OUT;
@@ -187,7 +189,13 @@ static int virtblk_bio_send_data(struct virtblk_req *vbr)
 		}
 	}
 
-	virtblk_add_req(vbr, out, in);
+	sg_set_buf(&vbr->sg[out + in++], &vbr->status, sizeof(vbr->status));
+
+	sg_unset_end_markers(vbr->sg, out + in);
+	sg_mark_end(&vbr->sg[out - 1]);
+	sg_mark_end(&vbr->sg[out + in - 1]);
+
+	virtblk_add_req(vbr, vbr->sg, vbr->sg + out);
 
 	return 0;
 }
@@ -335,6 +343,7 @@ static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
 		}
 	}
 
+	/* We layout out scatterlist in a single array, out then in. */
 	sg_set_buf(&vblk->sg[out++], &vbr->out_hdr, sizeof(vbr->out_hdr));
 
 	/*
@@ -346,17 +355,9 @@ static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
 	if (vbr->req->cmd_type == REQ_TYPE_BLOCK_PC)
 		sg_set_buf(&vblk->sg[out++], vbr->req->cmd, vbr->req->cmd_len);
 
+	/* This marks the end of the sg list at vblk->sg[out]. */
 	num = blk_rq_map_sg(q, vbr->req, vblk->sg + out);
 
-	if (vbr->req->cmd_type == REQ_TYPE_BLOCK_PC) {
-		sg_set_buf(&vblk->sg[num + out + in++], vbr->req->sense, SCSI_SENSE_BUFFERSIZE);
-		sg_set_buf(&vblk->sg[num + out + in++], &vbr->in_hdr,
-			   sizeof(vbr->in_hdr));
-	}
-
-	sg_set_buf(&vblk->sg[num + out + in++], &vbr->status,
-		   sizeof(vbr->status));
-
 	if (num) {
 		if (rq_data_dir(vbr->req) == WRITE) {
 			vbr->out_hdr.type |= VIRTIO_BLK_T_OUT;
@@ -367,8 +368,22 @@ static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
 		}
 	}
 
-	if (virtqueue_add_buf(vblk->vq, vblk->sg, out, in, vbr,
-			      GFP_ATOMIC) < 0) {
+	if (vbr->req->cmd_type == REQ_TYPE_BLOCK_PC) {
+		sg_set_buf(&vblk->sg[out + in++], vbr->req->sense,
+			   SCSI_SENSE_BUFFERSIZE);
+		sg_set_buf(&vblk->sg[out + in++], &vbr->in_hdr,
+			   sizeof(vbr->in_hdr));
+	}
+
+	sg_set_buf(&vblk->sg[out + in++], &vbr->status,
+		   sizeof(vbr->status));
+
+	sg_unset_end_markers(vblk->sg, out + in);
+	sg_mark_end(&vblk->sg[out - 1]);
+	sg_mark_end(&vblk->sg[out + in - 1]);
+
+	if (virtqueue_add_buf(vblk->vq, vblk->sg, vblk->sg + out, vbr, GFP_ATOMIC)
+	    < 0) {
 		mempool_free(vbr, vblk->pool);
 		return false;
 	}
diff --git a/drivers/char/hw_random/virtio-rng.c b/drivers/char/hw_random/virtio-rng.c
index 621f595..4dec874 100644
--- a/drivers/char/hw_random/virtio-rng.c
+++ b/drivers/char/hw_random/virtio-rng.c
@@ -47,7 +47,7 @@ static void register_buffer(u8 *buf, size_t size)
 	sg_init_one(&sg, buf, size);
 
 	/* There should always be room for one buffer. */
-	if (virtqueue_add_buf(vq, &sg, 0, 1, buf, GFP_KERNEL) < 0)
+	if (virtqueue_add_buf(vq, NULL, &sg, buf, GFP_KERNEL) < 0)
 		BUG();
 
 	virtqueue_kick(vq);
diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
index c594cb1..bc56ff5 100644
--- a/drivers/char/virtio_console.c
+++ b/drivers/char/virtio_console.c
@@ -508,7 +508,7 @@ static int add_inbuf(struct virtqueue *vq, struct port_buffer *buf)
 
 	sg_init_one(sg, buf->buf, buf->size);
 
-	ret = virtqueue_add_buf(vq, sg, 0, 1, buf, GFP_ATOMIC);
+	ret = virtqueue_add_buf(vq, NULL, sg, buf, GFP_ATOMIC);
 	virtqueue_kick(vq);
 	if (!ret)
 		ret = vq->num_free;
@@ -575,7 +575,7 @@ static ssize_t __send_control_msg(struct ports_device *portdev, u32 port_id,
 	vq = portdev->c_ovq;
 
 	sg_init_one(sg, &cpkt, sizeof(cpkt));
-	if (virtqueue_add_buf(vq, sg, 1, 0, &cpkt, GFP_ATOMIC) == 0) {
+	if (virtqueue_add_buf(vq, sg, NULL, &cpkt, GFP_ATOMIC) == 0) {
 		virtqueue_kick(vq);
 		while (!virtqueue_get_buf(vq, &len))
 			cpu_relax();
@@ -624,7 +624,7 @@ static ssize_t __send_to_port(struct port *port, struct scatterlist *sg,
 
 	reclaim_consumed_buffers(port);
 
-	err = virtqueue_add_buf(out_vq, sg, nents, 0, data, GFP_ATOMIC);
+	err = virtqueue_add_buf(out_vq, sg, NULL, data, GFP_ATOMIC);
 
 	/* Tell Host to go! */
 	virtqueue_kick(out_vq);
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index a6fcf15..32f6e13 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -432,11 +432,12 @@ static int add_recvbuf_small(struct receive_queue *rq, gfp_t gfp)
 	skb_put(skb, MAX_PACKET_LEN);
 
 	hdr = skb_vnet_hdr(skb);
+	sg_init_table(rq->sg, 2);
 	sg_set_buf(rq->sg, &hdr->hdr, sizeof hdr->hdr);
 
 	skb_to_sgvec(skb, rq->sg + 1, 0, skb->len);
 
-	err = virtqueue_add_buf(rq->vq, rq->sg, 0, 2, skb, gfp);
+	err = virtqueue_add_buf(rq->vq, NULL, rq->sg, skb, gfp);
 	if (err < 0)
 		dev_kfree_skb(skb);
 
@@ -449,6 +450,8 @@ static int add_recvbuf_big(struct receive_queue *rq, gfp_t gfp)
 	char *p;
 	int i, err, offset;
 
+	sg_init_table(rq->sg, MAX_SKB_FRAGS + 1);
+
 	/* page in rq->sg[MAX_SKB_FRAGS + 1] is list tail */
 	for (i = MAX_SKB_FRAGS + 1; i > 1; --i) {
 		first = get_a_page(rq, gfp);
@@ -481,8 +484,7 @@ static int add_recvbuf_big(struct receive_queue *rq, gfp_t gfp)
 
 	/* chain first in list head */
 	first->private = (unsigned long)list;
-	err = virtqueue_add_buf(rq->vq, rq->sg, 0, MAX_SKB_FRAGS + 2,
-				first, gfp);
+	err = virtqueue_add_buf(rq->vq, NULL, rq->sg, first, gfp);
 	if (err < 0)
 		give_pages(rq, first);
 
@@ -500,7 +502,7 @@ static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp)
 
 	sg_init_one(rq->sg, page_address(page), PAGE_SIZE);
 
-	err = virtqueue_add_buf(rq->vq, rq->sg, 0, 1, page, gfp);
+	err = virtqueue_add_buf(rq->vq, NULL, rq->sg, page, gfp);
 	if (err < 0)
 		give_pages(rq, page);
 
@@ -664,6 +666,7 @@ static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
 	const unsigned char *dest = ((struct ethhdr *)skb->data)->h_dest;
 	struct virtnet_info *vi = sq->vq->vdev->priv;
 	unsigned num_sg;
+	int ret;
 
 	pr_debug("%s: xmit %p %pM\n", vi->dev->name, skb, dest);
 
@@ -703,8 +706,15 @@ static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
 		sg_set_buf(sq->sg, &hdr->hdr, sizeof hdr->hdr);
 
 	num_sg = skb_to_sgvec(skb, sq->sg + 1, 0, skb->len) + 1;
-	return virtqueue_add_buf(sq->vq, sq->sg, num_sg,
-				 0, skb, GFP_ATOMIC);
+	ret = virtqueue_add_buf(sq->vq, sq->sg, NULL, skb, GFP_ATOMIC);
+
+	/*
+	 * An optimization: clear the end bit set by skb_to_sgvec, so
+	 * we can simply re-use sq->sg[] next time.
+	 */
+	sq->sg[num_sg-1].page_link &= ~0x02;
+
+	return ret;
 }
 
 static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
@@ -825,32 +835,30 @@ static void virtnet_netpoll(struct net_device *dev)
  * never fail unless improperly formated.
  */
 static bool virtnet_send_command(struct virtnet_info *vi, u8 class, u8 cmd,
-				 struct scatterlist *data, int out, int in)
+				 struct scatterlist *cmdsg)
 {
-	struct scatterlist *s, sg[VIRTNET_SEND_COMMAND_SG_MAX + 2];
+	struct scatterlist in[1], out[2];
 	struct virtio_net_ctrl_hdr ctrl;
 	virtio_net_ctrl_ack status = ~0;
 	unsigned int tmp;
-	int i;
 
 	/* Caller should know better */
-	BUG_ON(!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ) ||
-		(out + in > VIRTNET_SEND_COMMAND_SG_MAX));
-
-	out++; /* Add header */
-	in++; /* Add return status */
+	BUG_ON(!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ));
 
+	/* Prepend header to output */
+	sg_init_table(out, 2);
 	ctrl.class = class;
 	ctrl.cmd = cmd;
+	sg_set_buf(&out[0], &ctrl, sizeof(ctrl));
+	if (cmdsg)
+		sg_chain(out, 2, cmdsg);
+	else
+		sg_mark_end(&out[0]);
 
-	sg_init_table(sg, out + in);
-
-	sg_set_buf(&sg[0], &ctrl, sizeof(ctrl));
-	for_each_sg(data, s, out + in - 2, i)
-		sg_set_buf(&sg[i + 1], sg_virt(s), s->length);
-	sg_set_buf(&sg[out + in - 1], &status, sizeof(status));
+	/* Status response */
+	sg_init_one(in, &status, sizeof(status));
 
-	BUG_ON(virtqueue_add_buf(vi->cvq, sg, out, in, vi, GFP_ATOMIC) < 0);
+	BUG_ON(virtqueue_add_buf(vi->cvq, out, in, vi, GFP_ATOMIC) < 0);
 
 	virtqueue_kick(vi->cvq);
 
@@ -868,8 +876,7 @@ static void virtnet_ack_link_announce(struct virtnet_info *vi)
 {
 	rtnl_lock();
 	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_ANNOUNCE,
-				  VIRTIO_NET_CTRL_ANNOUNCE_ACK, NULL,
-				  0, 0))
+				  VIRTIO_NET_CTRL_ANNOUNCE_ACK, NULL))
 		dev_warn(&vi->dev->dev, "Failed to ack link announce.\n");
 	rtnl_unlock();
 }
@@ -887,7 +894,7 @@ static int virtnet_set_queues(struct virtnet_info *vi, u16 queue_pairs)
 	sg_init_one(&sg, &s, sizeof(s));
 
 	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_MQ,
-				  VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET, &sg, 1, 0)){
+				  VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET, &sg)){
 		dev_warn(&dev->dev, "Fail to set num of queue pairs to %d\n",
 			 queue_pairs);
 		return -EINVAL;
@@ -933,16 +940,14 @@ static void virtnet_set_rx_mode(struct net_device *dev)
 	sg_init_one(sg, &promisc, sizeof(promisc));
 
 	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX,
-				  VIRTIO_NET_CTRL_RX_PROMISC,
-				  sg, 1, 0))
+				  VIRTIO_NET_CTRL_RX_PROMISC, sg))
 		dev_warn(&dev->dev, "Failed to %sable promisc mode.\n",
 			 promisc ? "en" : "dis");
 
 	sg_init_one(sg, &allmulti, sizeof(allmulti));
 
 	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX,
-				  VIRTIO_NET_CTRL_RX_ALLMULTI,
-				  sg, 1, 0))
+				  VIRTIO_NET_CTRL_RX_ALLMULTI, sg))
 		dev_warn(&dev->dev, "Failed to %sable allmulti mode.\n",
 			 allmulti ? "en" : "dis");
 
@@ -980,8 +985,7 @@ static void virtnet_set_rx_mode(struct net_device *dev)
 		   sizeof(mac_data->entries) + (mc_count * ETH_ALEN));
 
 	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_MAC,
-				  VIRTIO_NET_CTRL_MAC_TABLE_SET,
-				  sg, 2, 0))
+				  VIRTIO_NET_CTRL_MAC_TABLE_SET, sg))
 		dev_warn(&dev->dev, "Failed to set MAC fitler table.\n");
 
 	kfree(buf);
@@ -995,7 +999,7 @@ static int virtnet_vlan_rx_add_vid(struct net_device *dev, u16 vid)
 	sg_init_one(&sg, &vid, sizeof(vid));
 
 	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_VLAN,
-				  VIRTIO_NET_CTRL_VLAN_ADD, &sg, 1, 0))
+				  VIRTIO_NET_CTRL_VLAN_ADD, &sg))
 		dev_warn(&dev->dev, "Failed to add VLAN ID %d.\n", vid);
 	return 0;
 }
@@ -1008,7 +1012,7 @@ static int virtnet_vlan_rx_kill_vid(struct net_device *dev, u16 vid)
 	sg_init_one(&sg, &vid, sizeof(vid));
 
 	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_VLAN,
-				  VIRTIO_NET_CTRL_VLAN_DEL, &sg, 1, 0))
+				  VIRTIO_NET_CTRL_VLAN_DEL, &sg))
 		dev_warn(&dev->dev, "Failed to kill VLAN ID %d.\n", vid);
 	return 0;
 }
diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 74ab67a..5021c64 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -223,7 +223,7 @@ static int virtscsi_kick_event(struct virtio_scsi *vscsi,
 
 	spin_lock_irqsave(&vscsi->event_vq.vq_lock, flags);
 
-	err = virtqueue_add_buf(vscsi->event_vq.vq, &sg, 0, 1, event_node,
+	err = virtqueue_add_buf(vscsi->event_vq.vq, NULL, &sg, event_node,
 				GFP_ATOMIC);
 	if (!err)
 		virtqueue_kick(vscsi->event_vq.vq);
@@ -378,7 +378,7 @@ static void virtscsi_map_sgl(struct scatterlist *sg, unsigned int *p_idx,
  */
 static void virtscsi_map_cmd(struct virtio_scsi_target_state *tgt,
 			     struct virtio_scsi_cmd *cmd,
-			     unsigned *out_num, unsigned *in_num,
+			     unsigned *out, unsigned *in,
 			     size_t req_size, size_t resp_size)
 {
 	struct scsi_cmnd *sc = cmd->sc;
@@ -392,7 +392,7 @@ static void virtscsi_map_cmd(struct virtio_scsi_target_state *tgt,
 	if (sc && sc->sc_data_direction != DMA_FROM_DEVICE)
 		virtscsi_map_sgl(sg, &idx, scsi_out(sc));
 
-	*out_num = idx;
+	*out = idx;
 
 	/* Response header.  */
 	sg_set_buf(&sg[idx++], &cmd->resp, resp_size);
@@ -401,7 +401,11 @@ static void virtscsi_map_cmd(struct virtio_scsi_target_state *tgt,
 	if (sc && sc->sc_data_direction != DMA_TO_DEVICE)
 		virtscsi_map_sgl(sg, &idx, scsi_in(sc));
 
-	*in_num = idx - *out_num;
+	*in = idx - *out;
+
+	sg_unset_end_markers(sg, *out + *in);
+	sg_mark_end(&sg[*out - 1]);
+	sg_mark_end(&sg[*out + *in - 1]);
 }
 
 static int virtscsi_kick_cmd(struct virtio_scsi_target_state *tgt,
@@ -409,16 +413,16 @@ static int virtscsi_kick_cmd(struct virtio_scsi_target_state *tgt,
 			     struct virtio_scsi_cmd *cmd,
 			     size_t req_size, size_t resp_size, gfp_t gfp)
 {
-	unsigned int out_num, in_num;
+	unsigned int out, in;
 	unsigned long flags;
 	int err;
 	bool needs_kick = false;
 
 	spin_lock_irqsave(&tgt->tgt_lock, flags);
-	virtscsi_map_cmd(tgt, cmd, &out_num, &in_num, req_size, resp_size);
+	virtscsi_map_cmd(tgt, cmd, &out, &in, req_size, resp_size);
 
 	spin_lock(&vq->vq_lock);
-	err = virtqueue_add_buf(vq->vq, tgt->sg, out_num, in_num, cmd, gfp);
+	err = virtqueue_add_buf(vq->vq, tgt->sg, tgt->sg + out, cmd, gfp);
 	spin_unlock(&tgt->tgt_lock);
 	if (!err)
 		needs_kick = virtqueue_kick_prepare(vq->vq);
diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index d19fe3e..181cef1 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -108,7 +108,7 @@ static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
 	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
 
 	/* We should always be able to add one buffer to an empty queue. */
-	if (virtqueue_add_buf(vq, &sg, 1, 0, vb, GFP_KERNEL) < 0)
+	if (virtqueue_add_buf(vq, &sg, NULL, vb, GFP_KERNEL) < 0)
 		BUG();
 	virtqueue_kick(vq);
 
@@ -256,7 +256,7 @@ static void stats_handle_request(struct virtio_balloon *vb)
 	if (!virtqueue_get_buf(vq, &len))
 		return;
 	sg_init_one(&sg, vb->stats, sizeof(vb->stats));
-	if (virtqueue_add_buf(vq, &sg, 1, 0, vb, GFP_KERNEL) < 0)
+	if (virtqueue_add_buf(vq, &sg, NULL, vb, GFP_KERNEL) < 0)
 		BUG();
 	virtqueue_kick(vq);
 }
@@ -341,7 +341,7 @@ static int init_vqs(struct virtio_balloon *vb)
 		 * use it to signal us later.
 		 */
 		sg_init_one(&sg, vb->stats, sizeof vb->stats);
-		if (virtqueue_add_buf(vb->stats_vq, &sg, 1, 0, vb, GFP_KERNEL)
+		if (virtqueue_add_buf(vb->stats_vq, &sg, NULL, vb, GFP_KERNEL)
 		    < 0)
 			BUG();
 		virtqueue_kick(vb->stats_vq);
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index ffd7e7d..277021b 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -119,13 +119,18 @@ struct vring_virtqueue
 
 #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq)
 
+/* This doesn't have the counter that for_each_sg() has */
+#define foreach_sg(sglist, i)			\
+	for (i = (sglist); i; i = sg_next(i))
+
 /* Set up an indirect table of descriptors and add it to the queue. */
 static int vring_add_indirect(struct vring_virtqueue *vq,
-			      struct scatterlist sg[],
-			      unsigned int out,
-			      unsigned int in,
+			      unsigned int num,
+			      struct scatterlist *out,
+			      struct scatterlist *in,
 			      gfp_t gfp)
 {
+	struct scatterlist *sg;
 	struct vring_desc *desc;
 	unsigned head;
 	int i;
@@ -137,24 +142,25 @@ static int vring_add_indirect(struct vring_virtqueue *vq,
 	 */
 	gfp &= ~(__GFP_HIGHMEM | __GFP_HIGH);
 
-	desc = kmalloc((out + in) * sizeof(struct vring_desc), gfp);
+	desc = kmalloc(num * sizeof(struct vring_desc), gfp);
 	if (!desc)
 		return -ENOMEM;
 
 	/* Transfer entries from the sg list into the indirect page */
-	for (i = 0; i < out; i++) {
+	i = 0;
+	foreach_sg(out, sg) {
 		desc[i].flags = VRING_DESC_F_NEXT;
 		desc[i].addr = sg_phys(sg);
 		desc[i].len = sg->length;
 		desc[i].next = i+1;
-		sg++;
+		i++;
 	}
-	for (; i < (out + in); i++) {
+	foreach_sg(in, sg) {
 		desc[i].flags = VRING_DESC_F_NEXT|VRING_DESC_F_WRITE;
 		desc[i].addr = sg_phys(sg);
 		desc[i].len = sg->length;
 		desc[i].next = i+1;
-		sg++;
+		i++;
 	}
 
 	/* Last one doesn't continue. */
@@ -176,12 +182,21 @@ static int vring_add_indirect(struct vring_virtqueue *vq,
 	return head;
 }
 
+static unsigned int count_sg(struct scatterlist *sg)
+{
+	unsigned int count = 0;
+	struct scatterlist *i;
+
+	foreach_sg(sg, i)
+		count++;
+	return count;
+}
+
 /**
  * virtqueue_add_buf - expose buffer to other end
  * @vq: the struct virtqueue we're talking about.
- * @sg: the description of the buffer(s).
- * @out_num: the number of sg readable by other side
- * @in_num: the number of sg which are writable (after readable ones)
+ * @out: the description of the output buffer(s).
+ * @in: the description of the input buffer(s).
  * @data: the token identifying the buffer.
  * @gfp: how to do memory allocations (if necessary).
  *
@@ -191,20 +206,23 @@ static int vring_add_indirect(struct vring_virtqueue *vq,
  * Returns zero or a negative error (ie. ENOSPC, ENOMEM).
  */
 int virtqueue_add_buf(struct virtqueue *_vq,
-		      struct scatterlist sg[],
-		      unsigned int out,
-		      unsigned int in,
+		      struct scatterlist *out,
+		      struct scatterlist *in,
 		      void *data,
 		      gfp_t gfp)
 {
 	struct vring_virtqueue *vq = to_vvq(_vq);
-	unsigned int i, avail, uninitialized_var(prev);
+	unsigned int i, avail, uninitialized_var(prev), num;
+	struct scatterlist *sg;
 	int head;
 
 	START_USE(vq);
 
 	BUG_ON(data == NULL);
 
+	num = count_sg(out) + count_sg(in);
+	BUG_ON(num == 0);
+
 #ifdef DEBUG
 	{
 		ktime_t now = ktime_get();
@@ -220,18 +238,17 @@ int virtqueue_add_buf(struct virtqueue *_vq,
 
 	/* If the host supports indirect descriptor tables, and we have multiple
 	 * buffers, then go indirect. FIXME: tune this threshold */
-	if (vq->indirect && (out + in) > 1 && vq->vq.num_free) {
-		head = vring_add_indirect(vq, sg, out, in, gfp);
+	if (vq->indirect && num > 1 && vq->vq.num_free) {
+		head = vring_add_indirect(vq, num, out, in, gfp);
 		if (likely(head >= 0))
 			goto add_head;
 	}
 
-	BUG_ON(out + in > vq->vring.num);
-	BUG_ON(out + in == 0);
+	BUG_ON(num > vq->vring.num);
 
-	if (vq->vq.num_free < out + in) {
+	if (vq->vq.num_free < num) {
 		pr_debug("Can't add buf len %i - avail = %i\n",
-			 out + in, vq->vq.num_free);
+			 num, vq->vq.num_free);
 		/* FIXME: for historical reasons, we force a notify here if
 		 * there are outgoing parts to the buffer.  Presumably the
 		 * host should service the ring ASAP. */
@@ -242,22 +259,22 @@ int virtqueue_add_buf(struct virtqueue *_vq,
 	}
 
 	/* We're about to use some buffers from the free list. */
-	vq->vq.num_free -= out + in;
+	vq->vq.num_free -= num;
 
-	head = vq->free_head;
-	for (i = vq->free_head; out; i = vq->vring.desc[i].next, out--) {
+	i = head = vq->free_head;
+	foreach_sg(out, sg) {
 		vq->vring.desc[i].flags = VRING_DESC_F_NEXT;
 		vq->vring.desc[i].addr = sg_phys(sg);
 		vq->vring.desc[i].len = sg->length;
 		prev = i;
-		sg++;
+		i = vq->vring.desc[i].next;
 	}
-	for (; in; i = vq->vring.desc[i].next, in--) {
+	foreach_sg(in, sg) {
 		vq->vring.desc[i].flags = VRING_DESC_F_NEXT|VRING_DESC_F_WRITE;
 		vq->vring.desc[i].addr = sg_phys(sg);
 		vq->vring.desc[i].len = sg->length;
 		prev = i;
-		sg++;
+		i = vq->vring.desc[i].next;
 	}
 	/* Last one doesn't continue. */
 	vq->vring.desc[prev].flags &= ~VRING_DESC_F_NEXT;
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index cf8adb1..69509a8 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -33,10 +33,18 @@ struct virtqueue {
 	void *priv;
 };
 
+static inline void sg_unset_end_markers(struct scatterlist *sg,
+					unsigned int num)
+{
+	unsigned int i;
+
+	for (i = 0; i < num; i++)
+		sg[i].page_link &= ~0x02;
+}
+
 int virtqueue_add_buf(struct virtqueue *vq,
-		      struct scatterlist sg[],
-		      unsigned int out_num,
-		      unsigned int in_num,
+		      struct scatterlist *out,
+		      struct scatterlist *in,
 		      void *data,
 		      gfp_t gfp);
 
diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
index fd05c81..7c5ac34 100644
--- a/net/9p/trans_virtio.c
+++ b/net/9p/trans_virtio.c
@@ -259,6 +259,7 @@ p9_virtio_request(struct p9_client *client, struct p9_req_t *req)
 	int in, out;
 	unsigned long flags;
 	struct virtio_chan *chan = client->trans;
+	struct scatterlist *outsg = NULL, *insg = NULL;
 
 	p9_debug(P9_DEBUG_TRANS, "9p debug: virtio request\n");
 
@@ -269,12 +270,21 @@ req_retry:
 	/* Handle out VirtIO ring buffers */
 	out = pack_sg_list(chan->sg, 0,
 			   VIRTQUEUE_NUM, req->tc->sdata, req->tc->size);
+	if (out) {
+		sg_unset_end_markers(chan->sg, out - 1);
+		sg_mark_end(&chan->sg[out - 1]);
+		outsg = chan->sg;
+	}
 
 	in = pack_sg_list(chan->sg, out,
 			  VIRTQUEUE_NUM, req->rc->sdata, req->rc->capacity);
+	if (in) {
+		sg_unset_end_markers(chan->sg + out, in - 1);
+		sg_mark_end(&chan->sg[out + in - 1]);
+		insg = chan->sg + out;
+	}
 
-	err = virtqueue_add_buf(chan->vq, chan->sg, out, in, req->tc,
-				GFP_ATOMIC);
+	err = virtqueue_add_buf(chan->vq, outsg, insg, req->tc, GFP_ATOMIC);
 	if (err < 0) {
 		if (err == -ENOSPC) {
 			chan->ring_bufs_avail = 0;
@@ -356,6 +366,7 @@ p9_virtio_zc_request(struct p9_client *client, struct p9_req_t *req,
 	int in_nr_pages = 0, out_nr_pages = 0;
 	struct page **in_pages = NULL, **out_pages = NULL;
 	struct virtio_chan *chan = client->trans;
+	struct scatterlist *insg = NULL, *outsg = NULL;
 
 	p9_debug(P9_DEBUG_TRANS, "virtio request\n");
 
@@ -403,6 +414,13 @@ req_retry_pinned:
 	if (out_pages)
 		out += pack_sg_list_p(chan->sg, out, VIRTQUEUE_NUM,
 				      out_pages, out_nr_pages, uodata, outlen);
+
+	if (out) {
+		sg_unset_end_markers(chan->sg, out - 1);
+		sg_mark_end(&chan->sg[out - 1]);
+		outsg = chan->sg;
+	}
+
 	/*
 	 * Take care of in data
 	 * For example TREAD have 11.
@@ -416,8 +434,13 @@ req_retry_pinned:
 		in += pack_sg_list_p(chan->sg, out + in, VIRTQUEUE_NUM,
 				     in_pages, in_nr_pages, uidata, inlen);
 
-	err = virtqueue_add_buf(chan->vq, chan->sg, out, in, req->tc,
-				GFP_ATOMIC);
+	if (in) {
+		sg_unset_end_markers(chan->sg + out, in - 1);
+		sg_mark_end(&chan->sg[out + in - 1]);
+		insg = chan->sg + out;
+	}
+
+	err = virtqueue_add_buf(chan->vq, outsg, insg, req->tc, GFP_ATOMIC);
 	if (err < 0) {
 		if (err == -ENOSPC) {
 			chan->ring_bufs_avail = 0;
-- 
1.8.1



> 
> Cheers,
> Rusty.


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 1/5] virtio: add functions for piecewise addition of buffers
@ 2013-01-03  8:58       ` Wanlong Gao
  0 siblings, 0 replies; 86+ messages in thread
From: Wanlong Gao @ 2013-01-03  8:58 UTC (permalink / raw)
  To: Rusty Russell
  Cc: kvm, linux-scsi, mst, hutao, linux-kernel, virtualization,
	stefanha, Paolo Bonzini

On 01/02/2013 01:03 PM, Rusty Russell wrote:
> Paolo Bonzini <pbonzini@redhat.com> writes:
>> The virtqueue_add_buf function has two limitations:
>>
>> 1) it requires the caller to provide all the buffers in a single call;
>>
>> 2) it does not support chained scatterlists: the buffers must be
>> provided as an array of struct scatterlist;
> 
> Chained scatterlists are a horrible interface, but that doesn't mean we
> shouldn't support them if there's a need.
> 
> I think I once even had a patch which passed two chained sgs, rather
> than a combo sg and two length numbers.  It's very old, but I've pasted
> it below.
> 
> Duplicating the implementation by having another interface is pretty
> nasty; I think I'd prefer the chained scatterlists, if that's optimal
> for you.

I rebased against virtio-next and use it in virtio-scsi, and tested it with 4 targets
virtio-scsi devices and host cpu idle=poll. Saw a little performance regression here.

General:
Run status group 0 (all jobs):
   READ: io=34675MB, aggrb=248257KB/s, minb=248257KB/s, maxb=248257KB/s, mint=143025msec, maxt=143025msec
  WRITE: io=34625MB, aggrb=247902KB/s, minb=247902KB/s, maxb=247902KB/s, mint=143025msec, maxt=143025msec

Chained:
Run status group 0 (all jobs):
   READ: io=34863MB, aggrb=242320KB/s, minb=242320KB/s, maxb=242320KB/s, mint=147325msec, maxt=147325msec
  WRITE: io=34437MB, aggrb=239357KB/s, minb=239357KB/s, maxb=239357KB/s, mint=147325msec, maxt=147325msec

Thanks,
Wanlong Gao

>From d3181b3f9bbdebbd3f2928b64821b406774757f8 Mon Sep 17 00:00:00 2001
From: Rusty Russell <rusty@rustcorp.com.au>
Date: Wed, 2 Jan 2013 16:43:49 +0800
Subject: [PATCH] virtio: use chained scatterlists

Rather than handing a scatterlist[] and out and in numbers to
virtqueue_add_buf(), hand two separate ones which can be chained.

I shall refrain from ranting about what a disgusting hack chained
scatterlists are.  I'll just note that this doesn't make things
simpler (see diff).

The scatterlists we use can be too large for the stack, so we put them
in our device struct and reuse them.  But in many cases we don't want
to pay the cost of sg_init_table() as we don't know how many elements
we'll have and we'd have to initialize the entire table.

This means we have two choices: carefully reset the end markers after
we call virtqueue_add_buf(), which we do in virtio_net for the xmit
path where it's easy and we want to be optimal.  Elsewhere we
implement a helper to unset the end markers after we've filled the
array.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
---
 drivers/block/virtio_blk.c          | 57 ++++++++++++++++++-----------
 drivers/char/hw_random/virtio-rng.c |  2 +-
 drivers/char/virtio_console.c       |  6 ++--
 drivers/net/virtio_net.c            | 68 ++++++++++++++++++-----------------
 drivers/scsi/virtio_scsi.c          | 18 ++++++----
 drivers/virtio/virtio_balloon.c     |  6 ++--
 drivers/virtio/virtio_ring.c        | 71 +++++++++++++++++++++++--------------
 include/linux/virtio.h              | 14 ++++++--
 net/9p/trans_virtio.c               | 31 +++++++++++++---
 9 files changed, 172 insertions(+), 101 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 0bdde8f..17cf0b7 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -102,8 +102,8 @@ static inline struct virtblk_req *virtblk_alloc_req(struct virtio_blk *vblk,
 
 static void virtblk_add_buf_wait(struct virtio_blk *vblk,
 				 struct virtblk_req *vbr,
-				 unsigned long out,
-				 unsigned long in)
+				 struct scatterlist *out,
+				 struct scatterlist *in)
 {
 	DEFINE_WAIT(wait);
 
@@ -112,7 +112,7 @@ static void virtblk_add_buf_wait(struct virtio_blk *vblk,
 					  TASK_UNINTERRUPTIBLE);
 
 		spin_lock_irq(vblk->disk->queue->queue_lock);
-		if (virtqueue_add_buf(vblk->vq, vbr->sg, out, in, vbr,
+		if (virtqueue_add_buf(vblk->vq, out, in, vbr,
 				      GFP_ATOMIC) < 0) {
 			spin_unlock_irq(vblk->disk->queue->queue_lock);
 			io_schedule();
@@ -128,12 +128,13 @@ static void virtblk_add_buf_wait(struct virtio_blk *vblk,
 }
 
 static inline void virtblk_add_req(struct virtblk_req *vbr,
-				   unsigned int out, unsigned int in)
+				   struct scatterlist *out,
+				   struct scatterlist *in)
 {
 	struct virtio_blk *vblk = vbr->vblk;
 
 	spin_lock_irq(vblk->disk->queue->queue_lock);
-	if (unlikely(virtqueue_add_buf(vblk->vq, vbr->sg, out, in, vbr,
+	if (unlikely(virtqueue_add_buf(vblk->vq, out, in, vbr,
 					GFP_ATOMIC) < 0)) {
 		spin_unlock_irq(vblk->disk->queue->queue_lock);
 		virtblk_add_buf_wait(vblk, vbr, out, in);
@@ -154,7 +155,11 @@ static int virtblk_bio_send_flush(struct virtblk_req *vbr)
 	sg_set_buf(&vbr->sg[out++], &vbr->out_hdr, sizeof(vbr->out_hdr));
 	sg_set_buf(&vbr->sg[out + in++], &vbr->status, sizeof(vbr->status));
 
-	virtblk_add_req(vbr, out, in);
+	sg_unset_end_markers(vbr->sg, out + in);
+	sg_mark_end(&vbr->sg[out - 1]);
+	sg_mark_end(&vbr->sg[out + in - 1]);
+
+	virtblk_add_req(vbr, vbr->sg, vbr->sg + out);
 
 	return 0;
 }
@@ -174,9 +179,6 @@ static int virtblk_bio_send_data(struct virtblk_req *vbr)
 
 	num = blk_bio_map_sg(vblk->disk->queue, bio, vbr->sg + out);
 
-	sg_set_buf(&vbr->sg[num + out + in++], &vbr->status,
-		   sizeof(vbr->status));
-
 	if (num) {
 		if (bio->bi_rw & REQ_WRITE) {
 			vbr->out_hdr.type |= VIRTIO_BLK_T_OUT;
@@ -187,7 +189,13 @@ static int virtblk_bio_send_data(struct virtblk_req *vbr)
 		}
 	}
 
-	virtblk_add_req(vbr, out, in);
+	sg_set_buf(&vbr->sg[out + in++], &vbr->status, sizeof(vbr->status));
+
+	sg_unset_end_markers(vbr->sg, out + in);
+	sg_mark_end(&vbr->sg[out - 1]);
+	sg_mark_end(&vbr->sg[out + in - 1]);
+
+	virtblk_add_req(vbr, vbr->sg, vbr->sg + out);
 
 	return 0;
 }
@@ -335,6 +343,7 @@ static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
 		}
 	}
 
+	/* We layout out scatterlist in a single array, out then in. */
 	sg_set_buf(&vblk->sg[out++], &vbr->out_hdr, sizeof(vbr->out_hdr));
 
 	/*
@@ -346,17 +355,9 @@ static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
 	if (vbr->req->cmd_type == REQ_TYPE_BLOCK_PC)
 		sg_set_buf(&vblk->sg[out++], vbr->req->cmd, vbr->req->cmd_len);
 
+	/* This marks the end of the sg list at vblk->sg[out]. */
 	num = blk_rq_map_sg(q, vbr->req, vblk->sg + out);
 
-	if (vbr->req->cmd_type == REQ_TYPE_BLOCK_PC) {
-		sg_set_buf(&vblk->sg[num + out + in++], vbr->req->sense, SCSI_SENSE_BUFFERSIZE);
-		sg_set_buf(&vblk->sg[num + out + in++], &vbr->in_hdr,
-			   sizeof(vbr->in_hdr));
-	}
-
-	sg_set_buf(&vblk->sg[num + out + in++], &vbr->status,
-		   sizeof(vbr->status));
-
 	if (num) {
 		if (rq_data_dir(vbr->req) == WRITE) {
 			vbr->out_hdr.type |= VIRTIO_BLK_T_OUT;
@@ -367,8 +368,22 @@ static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
 		}
 	}
 
-	if (virtqueue_add_buf(vblk->vq, vblk->sg, out, in, vbr,
-			      GFP_ATOMIC) < 0) {
+	if (vbr->req->cmd_type == REQ_TYPE_BLOCK_PC) {
+		sg_set_buf(&vblk->sg[out + in++], vbr->req->sense,
+			   SCSI_SENSE_BUFFERSIZE);
+		sg_set_buf(&vblk->sg[out + in++], &vbr->in_hdr,
+			   sizeof(vbr->in_hdr));
+	}
+
+	sg_set_buf(&vblk->sg[out + in++], &vbr->status,
+		   sizeof(vbr->status));
+
+	sg_unset_end_markers(vblk->sg, out + in);
+	sg_mark_end(&vblk->sg[out - 1]);
+	sg_mark_end(&vblk->sg[out + in - 1]);
+
+	if (virtqueue_add_buf(vblk->vq, vblk->sg, vblk->sg + out, vbr, GFP_ATOMIC)
+	    < 0) {
 		mempool_free(vbr, vblk->pool);
 		return false;
 	}
diff --git a/drivers/char/hw_random/virtio-rng.c b/drivers/char/hw_random/virtio-rng.c
index 621f595..4dec874 100644
--- a/drivers/char/hw_random/virtio-rng.c
+++ b/drivers/char/hw_random/virtio-rng.c
@@ -47,7 +47,7 @@ static void register_buffer(u8 *buf, size_t size)
 	sg_init_one(&sg, buf, size);
 
 	/* There should always be room for one buffer. */
-	if (virtqueue_add_buf(vq, &sg, 0, 1, buf, GFP_KERNEL) < 0)
+	if (virtqueue_add_buf(vq, NULL, &sg, buf, GFP_KERNEL) < 0)
 		BUG();
 
 	virtqueue_kick(vq);
diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
index c594cb1..bc56ff5 100644
--- a/drivers/char/virtio_console.c
+++ b/drivers/char/virtio_console.c
@@ -508,7 +508,7 @@ static int add_inbuf(struct virtqueue *vq, struct port_buffer *buf)
 
 	sg_init_one(sg, buf->buf, buf->size);
 
-	ret = virtqueue_add_buf(vq, sg, 0, 1, buf, GFP_ATOMIC);
+	ret = virtqueue_add_buf(vq, NULL, sg, buf, GFP_ATOMIC);
 	virtqueue_kick(vq);
 	if (!ret)
 		ret = vq->num_free;
@@ -575,7 +575,7 @@ static ssize_t __send_control_msg(struct ports_device *portdev, u32 port_id,
 	vq = portdev->c_ovq;
 
 	sg_init_one(sg, &cpkt, sizeof(cpkt));
-	if (virtqueue_add_buf(vq, sg, 1, 0, &cpkt, GFP_ATOMIC) == 0) {
+	if (virtqueue_add_buf(vq, sg, NULL, &cpkt, GFP_ATOMIC) == 0) {
 		virtqueue_kick(vq);
 		while (!virtqueue_get_buf(vq, &len))
 			cpu_relax();
@@ -624,7 +624,7 @@ static ssize_t __send_to_port(struct port *port, struct scatterlist *sg,
 
 	reclaim_consumed_buffers(port);
 
-	err = virtqueue_add_buf(out_vq, sg, nents, 0, data, GFP_ATOMIC);
+	err = virtqueue_add_buf(out_vq, sg, NULL, data, GFP_ATOMIC);
 
 	/* Tell Host to go! */
 	virtqueue_kick(out_vq);
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index a6fcf15..32f6e13 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -432,11 +432,12 @@ static int add_recvbuf_small(struct receive_queue *rq, gfp_t gfp)
 	skb_put(skb, MAX_PACKET_LEN);
 
 	hdr = skb_vnet_hdr(skb);
+	sg_init_table(rq->sg, 2);
 	sg_set_buf(rq->sg, &hdr->hdr, sizeof hdr->hdr);
 
 	skb_to_sgvec(skb, rq->sg + 1, 0, skb->len);
 
-	err = virtqueue_add_buf(rq->vq, rq->sg, 0, 2, skb, gfp);
+	err = virtqueue_add_buf(rq->vq, NULL, rq->sg, skb, gfp);
 	if (err < 0)
 		dev_kfree_skb(skb);
 
@@ -449,6 +450,8 @@ static int add_recvbuf_big(struct receive_queue *rq, gfp_t gfp)
 	char *p;
 	int i, err, offset;
 
+	sg_init_table(rq->sg, MAX_SKB_FRAGS + 1);
+
 	/* page in rq->sg[MAX_SKB_FRAGS + 1] is list tail */
 	for (i = MAX_SKB_FRAGS + 1; i > 1; --i) {
 		first = get_a_page(rq, gfp);
@@ -481,8 +484,7 @@ static int add_recvbuf_big(struct receive_queue *rq, gfp_t gfp)
 
 	/* chain first in list head */
 	first->private = (unsigned long)list;
-	err = virtqueue_add_buf(rq->vq, rq->sg, 0, MAX_SKB_FRAGS + 2,
-				first, gfp);
+	err = virtqueue_add_buf(rq->vq, NULL, rq->sg, first, gfp);
 	if (err < 0)
 		give_pages(rq, first);
 
@@ -500,7 +502,7 @@ static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp)
 
 	sg_init_one(rq->sg, page_address(page), PAGE_SIZE);
 
-	err = virtqueue_add_buf(rq->vq, rq->sg, 0, 1, page, gfp);
+	err = virtqueue_add_buf(rq->vq, NULL, rq->sg, page, gfp);
 	if (err < 0)
 		give_pages(rq, page);
 
@@ -664,6 +666,7 @@ static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
 	const unsigned char *dest = ((struct ethhdr *)skb->data)->h_dest;
 	struct virtnet_info *vi = sq->vq->vdev->priv;
 	unsigned num_sg;
+	int ret;
 
 	pr_debug("%s: xmit %p %pM\n", vi->dev->name, skb, dest);
 
@@ -703,8 +706,15 @@ static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
 		sg_set_buf(sq->sg, &hdr->hdr, sizeof hdr->hdr);
 
 	num_sg = skb_to_sgvec(skb, sq->sg + 1, 0, skb->len) + 1;
-	return virtqueue_add_buf(sq->vq, sq->sg, num_sg,
-				 0, skb, GFP_ATOMIC);
+	ret = virtqueue_add_buf(sq->vq, sq->sg, NULL, skb, GFP_ATOMIC);
+
+	/*
+	 * An optimization: clear the end bit set by skb_to_sgvec, so
+	 * we can simply re-use sq->sg[] next time.
+	 */
+	sq->sg[num_sg-1].page_link &= ~0x02;
+
+	return ret;
 }
 
 static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
@@ -825,32 +835,30 @@ static void virtnet_netpoll(struct net_device *dev)
  * never fail unless improperly formated.
  */
 static bool virtnet_send_command(struct virtnet_info *vi, u8 class, u8 cmd,
-				 struct scatterlist *data, int out, int in)
+				 struct scatterlist *cmdsg)
 {
-	struct scatterlist *s, sg[VIRTNET_SEND_COMMAND_SG_MAX + 2];
+	struct scatterlist in[1], out[2];
 	struct virtio_net_ctrl_hdr ctrl;
 	virtio_net_ctrl_ack status = ~0;
 	unsigned int tmp;
-	int i;
 
 	/* Caller should know better */
-	BUG_ON(!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ) ||
-		(out + in > VIRTNET_SEND_COMMAND_SG_MAX));
-
-	out++; /* Add header */
-	in++; /* Add return status */
+	BUG_ON(!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ));
 
+	/* Prepend header to output */
+	sg_init_table(out, 2);
 	ctrl.class = class;
 	ctrl.cmd = cmd;
+	sg_set_buf(&out[0], &ctrl, sizeof(ctrl));
+	if (cmdsg)
+		sg_chain(out, 2, cmdsg);
+	else
+		sg_mark_end(&out[0]);
 
-	sg_init_table(sg, out + in);
-
-	sg_set_buf(&sg[0], &ctrl, sizeof(ctrl));
-	for_each_sg(data, s, out + in - 2, i)
-		sg_set_buf(&sg[i + 1], sg_virt(s), s->length);
-	sg_set_buf(&sg[out + in - 1], &status, sizeof(status));
+	/* Status response */
+	sg_init_one(in, &status, sizeof(status));
 
-	BUG_ON(virtqueue_add_buf(vi->cvq, sg, out, in, vi, GFP_ATOMIC) < 0);
+	BUG_ON(virtqueue_add_buf(vi->cvq, out, in, vi, GFP_ATOMIC) < 0);
 
 	virtqueue_kick(vi->cvq);
 
@@ -868,8 +876,7 @@ static void virtnet_ack_link_announce(struct virtnet_info *vi)
 {
 	rtnl_lock();
 	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_ANNOUNCE,
-				  VIRTIO_NET_CTRL_ANNOUNCE_ACK, NULL,
-				  0, 0))
+				  VIRTIO_NET_CTRL_ANNOUNCE_ACK, NULL))
 		dev_warn(&vi->dev->dev, "Failed to ack link announce.\n");
 	rtnl_unlock();
 }
@@ -887,7 +894,7 @@ static int virtnet_set_queues(struct virtnet_info *vi, u16 queue_pairs)
 	sg_init_one(&sg, &s, sizeof(s));
 
 	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_MQ,
-				  VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET, &sg, 1, 0)){
+				  VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET, &sg)){
 		dev_warn(&dev->dev, "Fail to set num of queue pairs to %d\n",
 			 queue_pairs);
 		return -EINVAL;
@@ -933,16 +940,14 @@ static void virtnet_set_rx_mode(struct net_device *dev)
 	sg_init_one(sg, &promisc, sizeof(promisc));
 
 	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX,
-				  VIRTIO_NET_CTRL_RX_PROMISC,
-				  sg, 1, 0))
+				  VIRTIO_NET_CTRL_RX_PROMISC, sg))
 		dev_warn(&dev->dev, "Failed to %sable promisc mode.\n",
 			 promisc ? "en" : "dis");
 
 	sg_init_one(sg, &allmulti, sizeof(allmulti));
 
 	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX,
-				  VIRTIO_NET_CTRL_RX_ALLMULTI,
-				  sg, 1, 0))
+				  VIRTIO_NET_CTRL_RX_ALLMULTI, sg))
 		dev_warn(&dev->dev, "Failed to %sable allmulti mode.\n",
 			 allmulti ? "en" : "dis");
 
@@ -980,8 +985,7 @@ static void virtnet_set_rx_mode(struct net_device *dev)
 		   sizeof(mac_data->entries) + (mc_count * ETH_ALEN));
 
 	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_MAC,
-				  VIRTIO_NET_CTRL_MAC_TABLE_SET,
-				  sg, 2, 0))
+				  VIRTIO_NET_CTRL_MAC_TABLE_SET, sg))
 		dev_warn(&dev->dev, "Failed to set MAC fitler table.\n");
 
 	kfree(buf);
@@ -995,7 +999,7 @@ static int virtnet_vlan_rx_add_vid(struct net_device *dev, u16 vid)
 	sg_init_one(&sg, &vid, sizeof(vid));
 
 	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_VLAN,
-				  VIRTIO_NET_CTRL_VLAN_ADD, &sg, 1, 0))
+				  VIRTIO_NET_CTRL_VLAN_ADD, &sg))
 		dev_warn(&dev->dev, "Failed to add VLAN ID %d.\n", vid);
 	return 0;
 }
@@ -1008,7 +1012,7 @@ static int virtnet_vlan_rx_kill_vid(struct net_device *dev, u16 vid)
 	sg_init_one(&sg, &vid, sizeof(vid));
 
 	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_VLAN,
-				  VIRTIO_NET_CTRL_VLAN_DEL, &sg, 1, 0))
+				  VIRTIO_NET_CTRL_VLAN_DEL, &sg))
 		dev_warn(&dev->dev, "Failed to kill VLAN ID %d.\n", vid);
 	return 0;
 }
diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 74ab67a..5021c64 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -223,7 +223,7 @@ static int virtscsi_kick_event(struct virtio_scsi *vscsi,
 
 	spin_lock_irqsave(&vscsi->event_vq.vq_lock, flags);
 
-	err = virtqueue_add_buf(vscsi->event_vq.vq, &sg, 0, 1, event_node,
+	err = virtqueue_add_buf(vscsi->event_vq.vq, NULL, &sg, event_node,
 				GFP_ATOMIC);
 	if (!err)
 		virtqueue_kick(vscsi->event_vq.vq);
@@ -378,7 +378,7 @@ static void virtscsi_map_sgl(struct scatterlist *sg, unsigned int *p_idx,
  */
 static void virtscsi_map_cmd(struct virtio_scsi_target_state *tgt,
 			     struct virtio_scsi_cmd *cmd,
-			     unsigned *out_num, unsigned *in_num,
+			     unsigned *out, unsigned *in,
 			     size_t req_size, size_t resp_size)
 {
 	struct scsi_cmnd *sc = cmd->sc;
@@ -392,7 +392,7 @@ static void virtscsi_map_cmd(struct virtio_scsi_target_state *tgt,
 	if (sc && sc->sc_data_direction != DMA_FROM_DEVICE)
 		virtscsi_map_sgl(sg, &idx, scsi_out(sc));
 
-	*out_num = idx;
+	*out = idx;
 
 	/* Response header.  */
 	sg_set_buf(&sg[idx++], &cmd->resp, resp_size);
@@ -401,7 +401,11 @@ static void virtscsi_map_cmd(struct virtio_scsi_target_state *tgt,
 	if (sc && sc->sc_data_direction != DMA_TO_DEVICE)
 		virtscsi_map_sgl(sg, &idx, scsi_in(sc));
 
-	*in_num = idx - *out_num;
+	*in = idx - *out;
+
+	sg_unset_end_markers(sg, *out + *in);
+	sg_mark_end(&sg[*out - 1]);
+	sg_mark_end(&sg[*out + *in - 1]);
 }
 
 static int virtscsi_kick_cmd(struct virtio_scsi_target_state *tgt,
@@ -409,16 +413,16 @@ static int virtscsi_kick_cmd(struct virtio_scsi_target_state *tgt,
 			     struct virtio_scsi_cmd *cmd,
 			     size_t req_size, size_t resp_size, gfp_t gfp)
 {
-	unsigned int out_num, in_num;
+	unsigned int out, in;
 	unsigned long flags;
 	int err;
 	bool needs_kick = false;
 
 	spin_lock_irqsave(&tgt->tgt_lock, flags);
-	virtscsi_map_cmd(tgt, cmd, &out_num, &in_num, req_size, resp_size);
+	virtscsi_map_cmd(tgt, cmd, &out, &in, req_size, resp_size);
 
 	spin_lock(&vq->vq_lock);
-	err = virtqueue_add_buf(vq->vq, tgt->sg, out_num, in_num, cmd, gfp);
+	err = virtqueue_add_buf(vq->vq, tgt->sg, tgt->sg + out, cmd, gfp);
 	spin_unlock(&tgt->tgt_lock);
 	if (!err)
 		needs_kick = virtqueue_kick_prepare(vq->vq);
diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index d19fe3e..181cef1 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -108,7 +108,7 @@ static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
 	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
 
 	/* We should always be able to add one buffer to an empty queue. */
-	if (virtqueue_add_buf(vq, &sg, 1, 0, vb, GFP_KERNEL) < 0)
+	if (virtqueue_add_buf(vq, &sg, NULL, vb, GFP_KERNEL) < 0)
 		BUG();
 	virtqueue_kick(vq);
 
@@ -256,7 +256,7 @@ static void stats_handle_request(struct virtio_balloon *vb)
 	if (!virtqueue_get_buf(vq, &len))
 		return;
 	sg_init_one(&sg, vb->stats, sizeof(vb->stats));
-	if (virtqueue_add_buf(vq, &sg, 1, 0, vb, GFP_KERNEL) < 0)
+	if (virtqueue_add_buf(vq, &sg, NULL, vb, GFP_KERNEL) < 0)
 		BUG();
 	virtqueue_kick(vq);
 }
@@ -341,7 +341,7 @@ static int init_vqs(struct virtio_balloon *vb)
 		 * use it to signal us later.
 		 */
 		sg_init_one(&sg, vb->stats, sizeof vb->stats);
-		if (virtqueue_add_buf(vb->stats_vq, &sg, 1, 0, vb, GFP_KERNEL)
+		if (virtqueue_add_buf(vb->stats_vq, &sg, NULL, vb, GFP_KERNEL)
 		    < 0)
 			BUG();
 		virtqueue_kick(vb->stats_vq);
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index ffd7e7d..277021b 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -119,13 +119,18 @@ struct vring_virtqueue
 
 #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq)
 
+/* This doesn't have the counter that for_each_sg() has */
+#define foreach_sg(sglist, i)			\
+	for (i = (sglist); i; i = sg_next(i))
+
 /* Set up an indirect table of descriptors and add it to the queue. */
 static int vring_add_indirect(struct vring_virtqueue *vq,
-			      struct scatterlist sg[],
-			      unsigned int out,
-			      unsigned int in,
+			      unsigned int num,
+			      struct scatterlist *out,
+			      struct scatterlist *in,
 			      gfp_t gfp)
 {
+	struct scatterlist *sg;
 	struct vring_desc *desc;
 	unsigned head;
 	int i;
@@ -137,24 +142,25 @@ static int vring_add_indirect(struct vring_virtqueue *vq,
 	 */
 	gfp &= ~(__GFP_HIGHMEM | __GFP_HIGH);
 
-	desc = kmalloc((out + in) * sizeof(struct vring_desc), gfp);
+	desc = kmalloc(num * sizeof(struct vring_desc), gfp);
 	if (!desc)
 		return -ENOMEM;
 
 	/* Transfer entries from the sg list into the indirect page */
-	for (i = 0; i < out; i++) {
+	i = 0;
+	foreach_sg(out, sg) {
 		desc[i].flags = VRING_DESC_F_NEXT;
 		desc[i].addr = sg_phys(sg);
 		desc[i].len = sg->length;
 		desc[i].next = i+1;
-		sg++;
+		i++;
 	}
-	for (; i < (out + in); i++) {
+	foreach_sg(in, sg) {
 		desc[i].flags = VRING_DESC_F_NEXT|VRING_DESC_F_WRITE;
 		desc[i].addr = sg_phys(sg);
 		desc[i].len = sg->length;
 		desc[i].next = i+1;
-		sg++;
+		i++;
 	}
 
 	/* Last one doesn't continue. */
@@ -176,12 +182,21 @@ static int vring_add_indirect(struct vring_virtqueue *vq,
 	return head;
 }
 
+static unsigned int count_sg(struct scatterlist *sg)
+{
+	unsigned int count = 0;
+	struct scatterlist *i;
+
+	foreach_sg(sg, i)
+		count++;
+	return count;
+}
+
 /**
  * virtqueue_add_buf - expose buffer to other end
  * @vq: the struct virtqueue we're talking about.
- * @sg: the description of the buffer(s).
- * @out_num: the number of sg readable by other side
- * @in_num: the number of sg which are writable (after readable ones)
+ * @out: the description of the output buffer(s).
+ * @in: the description of the input buffer(s).
  * @data: the token identifying the buffer.
  * @gfp: how to do memory allocations (if necessary).
  *
@@ -191,20 +206,23 @@ static int vring_add_indirect(struct vring_virtqueue *vq,
  * Returns zero or a negative error (ie. ENOSPC, ENOMEM).
  */
 int virtqueue_add_buf(struct virtqueue *_vq,
-		      struct scatterlist sg[],
-		      unsigned int out,
-		      unsigned int in,
+		      struct scatterlist *out,
+		      struct scatterlist *in,
 		      void *data,
 		      gfp_t gfp)
 {
 	struct vring_virtqueue *vq = to_vvq(_vq);
-	unsigned int i, avail, uninitialized_var(prev);
+	unsigned int i, avail, uninitialized_var(prev), num;
+	struct scatterlist *sg;
 	int head;
 
 	START_USE(vq);
 
 	BUG_ON(data == NULL);
 
+	num = count_sg(out) + count_sg(in);
+	BUG_ON(num == 0);
+
 #ifdef DEBUG
 	{
 		ktime_t now = ktime_get();
@@ -220,18 +238,17 @@ int virtqueue_add_buf(struct virtqueue *_vq,
 
 	/* If the host supports indirect descriptor tables, and we have multiple
 	 * buffers, then go indirect. FIXME: tune this threshold */
-	if (vq->indirect && (out + in) > 1 && vq->vq.num_free) {
-		head = vring_add_indirect(vq, sg, out, in, gfp);
+	if (vq->indirect && num > 1 && vq->vq.num_free) {
+		head = vring_add_indirect(vq, num, out, in, gfp);
 		if (likely(head >= 0))
 			goto add_head;
 	}
 
-	BUG_ON(out + in > vq->vring.num);
-	BUG_ON(out + in == 0);
+	BUG_ON(num > vq->vring.num);
 
-	if (vq->vq.num_free < out + in) {
+	if (vq->vq.num_free < num) {
 		pr_debug("Can't add buf len %i - avail = %i\n",
-			 out + in, vq->vq.num_free);
+			 num, vq->vq.num_free);
 		/* FIXME: for historical reasons, we force a notify here if
 		 * there are outgoing parts to the buffer.  Presumably the
 		 * host should service the ring ASAP. */
@@ -242,22 +259,22 @@ int virtqueue_add_buf(struct virtqueue *_vq,
 	}
 
 	/* We're about to use some buffers from the free list. */
-	vq->vq.num_free -= out + in;
+	vq->vq.num_free -= num;
 
-	head = vq->free_head;
-	for (i = vq->free_head; out; i = vq->vring.desc[i].next, out--) {
+	i = head = vq->free_head;
+	foreach_sg(out, sg) {
 		vq->vring.desc[i].flags = VRING_DESC_F_NEXT;
 		vq->vring.desc[i].addr = sg_phys(sg);
 		vq->vring.desc[i].len = sg->length;
 		prev = i;
-		sg++;
+		i = vq->vring.desc[i].next;
 	}
-	for (; in; i = vq->vring.desc[i].next, in--) {
+	foreach_sg(in, sg) {
 		vq->vring.desc[i].flags = VRING_DESC_F_NEXT|VRING_DESC_F_WRITE;
 		vq->vring.desc[i].addr = sg_phys(sg);
 		vq->vring.desc[i].len = sg->length;
 		prev = i;
-		sg++;
+		i = vq->vring.desc[i].next;
 	}
 	/* Last one doesn't continue. */
 	vq->vring.desc[prev].flags &= ~VRING_DESC_F_NEXT;
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index cf8adb1..69509a8 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -33,10 +33,18 @@ struct virtqueue {
 	void *priv;
 };
 
+static inline void sg_unset_end_markers(struct scatterlist *sg,
+					unsigned int num)
+{
+	unsigned int i;
+
+	for (i = 0; i < num; i++)
+		sg[i].page_link &= ~0x02;
+}
+
 int virtqueue_add_buf(struct virtqueue *vq,
-		      struct scatterlist sg[],
-		      unsigned int out_num,
-		      unsigned int in_num,
+		      struct scatterlist *out,
+		      struct scatterlist *in,
 		      void *data,
 		      gfp_t gfp);
 
diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
index fd05c81..7c5ac34 100644
--- a/net/9p/trans_virtio.c
+++ b/net/9p/trans_virtio.c
@@ -259,6 +259,7 @@ p9_virtio_request(struct p9_client *client, struct p9_req_t *req)
 	int in, out;
 	unsigned long flags;
 	struct virtio_chan *chan = client->trans;
+	struct scatterlist *outsg = NULL, *insg = NULL;
 
 	p9_debug(P9_DEBUG_TRANS, "9p debug: virtio request\n");
 
@@ -269,12 +270,21 @@ req_retry:
 	/* Handle out VirtIO ring buffers */
 	out = pack_sg_list(chan->sg, 0,
 			   VIRTQUEUE_NUM, req->tc->sdata, req->tc->size);
+	if (out) {
+		sg_unset_end_markers(chan->sg, out - 1);
+		sg_mark_end(&chan->sg[out - 1]);
+		outsg = chan->sg;
+	}
 
 	in = pack_sg_list(chan->sg, out,
 			  VIRTQUEUE_NUM, req->rc->sdata, req->rc->capacity);
+	if (in) {
+		sg_unset_end_markers(chan->sg + out, in - 1);
+		sg_mark_end(&chan->sg[out + in - 1]);
+		insg = chan->sg + out;
+	}
 
-	err = virtqueue_add_buf(chan->vq, chan->sg, out, in, req->tc,
-				GFP_ATOMIC);
+	err = virtqueue_add_buf(chan->vq, outsg, insg, req->tc, GFP_ATOMIC);
 	if (err < 0) {
 		if (err == -ENOSPC) {
 			chan->ring_bufs_avail = 0;
@@ -356,6 +366,7 @@ p9_virtio_zc_request(struct p9_client *client, struct p9_req_t *req,
 	int in_nr_pages = 0, out_nr_pages = 0;
 	struct page **in_pages = NULL, **out_pages = NULL;
 	struct virtio_chan *chan = client->trans;
+	struct scatterlist *insg = NULL, *outsg = NULL;
 
 	p9_debug(P9_DEBUG_TRANS, "virtio request\n");
 
@@ -403,6 +414,13 @@ req_retry_pinned:
 	if (out_pages)
 		out += pack_sg_list_p(chan->sg, out, VIRTQUEUE_NUM,
 				      out_pages, out_nr_pages, uodata, outlen);
+
+	if (out) {
+		sg_unset_end_markers(chan->sg, out - 1);
+		sg_mark_end(&chan->sg[out - 1]);
+		outsg = chan->sg;
+	}
+
 	/*
 	 * Take care of in data
 	 * For example TREAD have 11.
@@ -416,8 +434,13 @@ req_retry_pinned:
 		in += pack_sg_list_p(chan->sg, out + in, VIRTQUEUE_NUM,
 				     in_pages, in_nr_pages, uidata, inlen);
 
-	err = virtqueue_add_buf(chan->vq, chan->sg, out, in, req->tc,
-				GFP_ATOMIC);
+	if (in) {
+		sg_unset_end_markers(chan->sg + out, in - 1);
+		sg_mark_end(&chan->sg[out + in - 1]);
+		insg = chan->sg + out;
+	}
+
+	err = virtqueue_add_buf(chan->vq, outsg, insg, req->tc, GFP_ATOMIC);
 	if (err < 0) {
 		if (err == -ENOSPC) {
 			chan->ring_bufs_avail = 0;
-- 
1.8.1



> 
> Cheers,
> Rusty.

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 1/5] virtio: add functions for piecewise addition of buffers
@ 2013-01-03  8:58       ` Wanlong Gao
  0 siblings, 0 replies; 86+ messages in thread
From: Wanlong Gao @ 2013-01-03  8:58 UTC (permalink / raw)
  To: Rusty Russell
  Cc: kvm, linux-scsi, mst, hutao, linux-kernel, virtualization,
	stefanha, Paolo Bonzini

On 01/02/2013 01:03 PM, Rusty Russell wrote:
> Paolo Bonzini <pbonzini@redhat.com> writes:
>> The virtqueue_add_buf function has two limitations:
>>
>> 1) it requires the caller to provide all the buffers in a single call;
>>
>> 2) it does not support chained scatterlists: the buffers must be
>> provided as an array of struct scatterlist;
> 
> Chained scatterlists are a horrible interface, but that doesn't mean we
> shouldn't support them if there's a need.
> 
> I think I once even had a patch which passed two chained sgs, rather
> than a combo sg and two length numbers.  It's very old, but I've pasted
> it below.
> 
> Duplicating the implementation by having another interface is pretty
> nasty; I think I'd prefer the chained scatterlists, if that's optimal
> for you.

I rebased against virtio-next and use it in virtio-scsi, and tested it with 4 targets
virtio-scsi devices and host cpu idle=poll. Saw a little performance regression here.

General:
Run status group 0 (all jobs):
   READ: io=34675MB, aggrb=248257KB/s, minb=248257KB/s, maxb=248257KB/s, mint=143025msec, maxt=143025msec
  WRITE: io=34625MB, aggrb=247902KB/s, minb=247902KB/s, maxb=247902KB/s, mint=143025msec, maxt=143025msec

Chained:
Run status group 0 (all jobs):
   READ: io=34863MB, aggrb=242320KB/s, minb=242320KB/s, maxb=242320KB/s, mint=147325msec, maxt=147325msec
  WRITE: io=34437MB, aggrb=239357KB/s, minb=239357KB/s, maxb=239357KB/s, mint=147325msec, maxt=147325msec

Thanks,
Wanlong Gao

From d3181b3f9bbdebbd3f2928b64821b406774757f8 Mon Sep 17 00:00:00 2001
From: Rusty Russell <rusty@rustcorp.com.au>
Date: Wed, 2 Jan 2013 16:43:49 +0800
Subject: [PATCH] virtio: use chained scatterlists

Rather than handing a scatterlist[] and out and in numbers to
virtqueue_add_buf(), hand two separate ones which can be chained.

I shall refrain from ranting about what a disgusting hack chained
scatterlists are.  I'll just note that this doesn't make things
simpler (see diff).

The scatterlists we use can be too large for the stack, so we put them
in our device struct and reuse them.  But in many cases we don't want
to pay the cost of sg_init_table() as we don't know how many elements
we'll have and we'd have to initialize the entire table.

This means we have two choices: carefully reset the end markers after
we call virtqueue_add_buf(), which we do in virtio_net for the xmit
path where it's easy and we want to be optimal.  Elsewhere we
implement a helper to unset the end markers after we've filled the
array.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
---
 drivers/block/virtio_blk.c          | 57 ++++++++++++++++++-----------
 drivers/char/hw_random/virtio-rng.c |  2 +-
 drivers/char/virtio_console.c       |  6 ++--
 drivers/net/virtio_net.c            | 68 ++++++++++++++++++-----------------
 drivers/scsi/virtio_scsi.c          | 18 ++++++----
 drivers/virtio/virtio_balloon.c     |  6 ++--
 drivers/virtio/virtio_ring.c        | 71 +++++++++++++++++++++++--------------
 include/linux/virtio.h              | 14 ++++++--
 net/9p/trans_virtio.c               | 31 +++++++++++++---
 9 files changed, 172 insertions(+), 101 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 0bdde8f..17cf0b7 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -102,8 +102,8 @@ static inline struct virtblk_req *virtblk_alloc_req(struct virtio_blk *vblk,
 
 static void virtblk_add_buf_wait(struct virtio_blk *vblk,
 				 struct virtblk_req *vbr,
-				 unsigned long out,
-				 unsigned long in)
+				 struct scatterlist *out,
+				 struct scatterlist *in)
 {
 	DEFINE_WAIT(wait);
 
@@ -112,7 +112,7 @@ static void virtblk_add_buf_wait(struct virtio_blk *vblk,
 					  TASK_UNINTERRUPTIBLE);
 
 		spin_lock_irq(vblk->disk->queue->queue_lock);
-		if (virtqueue_add_buf(vblk->vq, vbr->sg, out, in, vbr,
+		if (virtqueue_add_buf(vblk->vq, out, in, vbr,
 				      GFP_ATOMIC) < 0) {
 			spin_unlock_irq(vblk->disk->queue->queue_lock);
 			io_schedule();
@@ -128,12 +128,13 @@ static void virtblk_add_buf_wait(struct virtio_blk *vblk,
 }
 
 static inline void virtblk_add_req(struct virtblk_req *vbr,
-				   unsigned int out, unsigned int in)
+				   struct scatterlist *out,
+				   struct scatterlist *in)
 {
 	struct virtio_blk *vblk = vbr->vblk;
 
 	spin_lock_irq(vblk->disk->queue->queue_lock);
-	if (unlikely(virtqueue_add_buf(vblk->vq, vbr->sg, out, in, vbr,
+	if (unlikely(virtqueue_add_buf(vblk->vq, out, in, vbr,
 					GFP_ATOMIC) < 0)) {
 		spin_unlock_irq(vblk->disk->queue->queue_lock);
 		virtblk_add_buf_wait(vblk, vbr, out, in);
@@ -154,7 +155,11 @@ static int virtblk_bio_send_flush(struct virtblk_req *vbr)
 	sg_set_buf(&vbr->sg[out++], &vbr->out_hdr, sizeof(vbr->out_hdr));
 	sg_set_buf(&vbr->sg[out + in++], &vbr->status, sizeof(vbr->status));
 
-	virtblk_add_req(vbr, out, in);
+	sg_unset_end_markers(vbr->sg, out + in);
+	sg_mark_end(&vbr->sg[out - 1]);
+	sg_mark_end(&vbr->sg[out + in - 1]);
+
+	virtblk_add_req(vbr, vbr->sg, vbr->sg + out);
 
 	return 0;
 }
@@ -174,9 +179,6 @@ static int virtblk_bio_send_data(struct virtblk_req *vbr)
 
 	num = blk_bio_map_sg(vblk->disk->queue, bio, vbr->sg + out);
 
-	sg_set_buf(&vbr->sg[num + out + in++], &vbr->status,
-		   sizeof(vbr->status));
-
 	if (num) {
 		if (bio->bi_rw & REQ_WRITE) {
 			vbr->out_hdr.type |= VIRTIO_BLK_T_OUT;
@@ -187,7 +189,13 @@ static int virtblk_bio_send_data(struct virtblk_req *vbr)
 		}
 	}
 
-	virtblk_add_req(vbr, out, in);
+	sg_set_buf(&vbr->sg[out + in++], &vbr->status, sizeof(vbr->status));
+
+	sg_unset_end_markers(vbr->sg, out + in);
+	sg_mark_end(&vbr->sg[out - 1]);
+	sg_mark_end(&vbr->sg[out + in - 1]);
+
+	virtblk_add_req(vbr, vbr->sg, vbr->sg + out);
 
 	return 0;
 }
@@ -335,6 +343,7 @@ static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
 		}
 	}
 
+	/* We layout out scatterlist in a single array, out then in. */
 	sg_set_buf(&vblk->sg[out++], &vbr->out_hdr, sizeof(vbr->out_hdr));
 
 	/*
@@ -346,17 +355,9 @@ static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
 	if (vbr->req->cmd_type == REQ_TYPE_BLOCK_PC)
 		sg_set_buf(&vblk->sg[out++], vbr->req->cmd, vbr->req->cmd_len);
 
+	/* This marks the end of the sg list at vblk->sg[out]. */
 	num = blk_rq_map_sg(q, vbr->req, vblk->sg + out);
 
-	if (vbr->req->cmd_type == REQ_TYPE_BLOCK_PC) {
-		sg_set_buf(&vblk->sg[num + out + in++], vbr->req->sense, SCSI_SENSE_BUFFERSIZE);
-		sg_set_buf(&vblk->sg[num + out + in++], &vbr->in_hdr,
-			   sizeof(vbr->in_hdr));
-	}
-
-	sg_set_buf(&vblk->sg[num + out + in++], &vbr->status,
-		   sizeof(vbr->status));
-
 	if (num) {
 		if (rq_data_dir(vbr->req) == WRITE) {
 			vbr->out_hdr.type |= VIRTIO_BLK_T_OUT;
@@ -367,8 +368,22 @@ static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
 		}
 	}
 
-	if (virtqueue_add_buf(vblk->vq, vblk->sg, out, in, vbr,
-			      GFP_ATOMIC) < 0) {
+	if (vbr->req->cmd_type == REQ_TYPE_BLOCK_PC) {
+		sg_set_buf(&vblk->sg[out + in++], vbr->req->sense,
+			   SCSI_SENSE_BUFFERSIZE);
+		sg_set_buf(&vblk->sg[out + in++], &vbr->in_hdr,
+			   sizeof(vbr->in_hdr));
+	}
+
+	sg_set_buf(&vblk->sg[out + in++], &vbr->status,
+		   sizeof(vbr->status));
+
+	sg_unset_end_markers(vblk->sg, out + in);
+	sg_mark_end(&vblk->sg[out - 1]);
+	sg_mark_end(&vblk->sg[out + in - 1]);
+
+	if (virtqueue_add_buf(vblk->vq, vblk->sg, vblk->sg + out, vbr, GFP_ATOMIC)
+	    < 0) {
 		mempool_free(vbr, vblk->pool);
 		return false;
 	}
diff --git a/drivers/char/hw_random/virtio-rng.c b/drivers/char/hw_random/virtio-rng.c
index 621f595..4dec874 100644
--- a/drivers/char/hw_random/virtio-rng.c
+++ b/drivers/char/hw_random/virtio-rng.c
@@ -47,7 +47,7 @@ static void register_buffer(u8 *buf, size_t size)
 	sg_init_one(&sg, buf, size);
 
 	/* There should always be room for one buffer. */
-	if (virtqueue_add_buf(vq, &sg, 0, 1, buf, GFP_KERNEL) < 0)
+	if (virtqueue_add_buf(vq, NULL, &sg, buf, GFP_KERNEL) < 0)
 		BUG();
 
 	virtqueue_kick(vq);
diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
index c594cb1..bc56ff5 100644
--- a/drivers/char/virtio_console.c
+++ b/drivers/char/virtio_console.c
@@ -508,7 +508,7 @@ static int add_inbuf(struct virtqueue *vq, struct port_buffer *buf)
 
 	sg_init_one(sg, buf->buf, buf->size);
 
-	ret = virtqueue_add_buf(vq, sg, 0, 1, buf, GFP_ATOMIC);
+	ret = virtqueue_add_buf(vq, NULL, sg, buf, GFP_ATOMIC);
 	virtqueue_kick(vq);
 	if (!ret)
 		ret = vq->num_free;
@@ -575,7 +575,7 @@ static ssize_t __send_control_msg(struct ports_device *portdev, u32 port_id,
 	vq = portdev->c_ovq;
 
 	sg_init_one(sg, &cpkt, sizeof(cpkt));
-	if (virtqueue_add_buf(vq, sg, 1, 0, &cpkt, GFP_ATOMIC) == 0) {
+	if (virtqueue_add_buf(vq, sg, NULL, &cpkt, GFP_ATOMIC) == 0) {
 		virtqueue_kick(vq);
 		while (!virtqueue_get_buf(vq, &len))
 			cpu_relax();
@@ -624,7 +624,7 @@ static ssize_t __send_to_port(struct port *port, struct scatterlist *sg,
 
 	reclaim_consumed_buffers(port);
 
-	err = virtqueue_add_buf(out_vq, sg, nents, 0, data, GFP_ATOMIC);
+	err = virtqueue_add_buf(out_vq, sg, NULL, data, GFP_ATOMIC);
 
 	/* Tell Host to go! */
 	virtqueue_kick(out_vq);
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index a6fcf15..32f6e13 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -432,11 +432,12 @@ static int add_recvbuf_small(struct receive_queue *rq, gfp_t gfp)
 	skb_put(skb, MAX_PACKET_LEN);
 
 	hdr = skb_vnet_hdr(skb);
+	sg_init_table(rq->sg, 2);
 	sg_set_buf(rq->sg, &hdr->hdr, sizeof hdr->hdr);
 
 	skb_to_sgvec(skb, rq->sg + 1, 0, skb->len);
 
-	err = virtqueue_add_buf(rq->vq, rq->sg, 0, 2, skb, gfp);
+	err = virtqueue_add_buf(rq->vq, NULL, rq->sg, skb, gfp);
 	if (err < 0)
 		dev_kfree_skb(skb);
 
@@ -449,6 +450,8 @@ static int add_recvbuf_big(struct receive_queue *rq, gfp_t gfp)
 	char *p;
 	int i, err, offset;
 
+	sg_init_table(rq->sg, MAX_SKB_FRAGS + 1);
+
 	/* page in rq->sg[MAX_SKB_FRAGS + 1] is list tail */
 	for (i = MAX_SKB_FRAGS + 1; i > 1; --i) {
 		first = get_a_page(rq, gfp);
@@ -481,8 +484,7 @@ static int add_recvbuf_big(struct receive_queue *rq, gfp_t gfp)
 
 	/* chain first in list head */
 	first->private = (unsigned long)list;
-	err = virtqueue_add_buf(rq->vq, rq->sg, 0, MAX_SKB_FRAGS + 2,
-				first, gfp);
+	err = virtqueue_add_buf(rq->vq, NULL, rq->sg, first, gfp);
 	if (err < 0)
 		give_pages(rq, first);
 
@@ -500,7 +502,7 @@ static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp)
 
 	sg_init_one(rq->sg, page_address(page), PAGE_SIZE);
 
-	err = virtqueue_add_buf(rq->vq, rq->sg, 0, 1, page, gfp);
+	err = virtqueue_add_buf(rq->vq, NULL, rq->sg, page, gfp);
 	if (err < 0)
 		give_pages(rq, page);
 
@@ -664,6 +666,7 @@ static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
 	const unsigned char *dest = ((struct ethhdr *)skb->data)->h_dest;
 	struct virtnet_info *vi = sq->vq->vdev->priv;
 	unsigned num_sg;
+	int ret;
 
 	pr_debug("%s: xmit %p %pM\n", vi->dev->name, skb, dest);
 
@@ -703,8 +706,15 @@ static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
 		sg_set_buf(sq->sg, &hdr->hdr, sizeof hdr->hdr);
 
 	num_sg = skb_to_sgvec(skb, sq->sg + 1, 0, skb->len) + 1;
-	return virtqueue_add_buf(sq->vq, sq->sg, num_sg,
-				 0, skb, GFP_ATOMIC);
+	ret = virtqueue_add_buf(sq->vq, sq->sg, NULL, skb, GFP_ATOMIC);
+
+	/*
+	 * An optimization: clear the end bit set by skb_to_sgvec, so
+	 * we can simply re-use sq->sg[] next time.
+	 */
+	sq->sg[num_sg-1].page_link &= ~0x02;
+
+	return ret;
 }
 
 static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
@@ -825,32 +835,30 @@ static void virtnet_netpoll(struct net_device *dev)
  * never fail unless improperly formated.
  */
 static bool virtnet_send_command(struct virtnet_info *vi, u8 class, u8 cmd,
-				 struct scatterlist *data, int out, int in)
+				 struct scatterlist *cmdsg)
 {
-	struct scatterlist *s, sg[VIRTNET_SEND_COMMAND_SG_MAX + 2];
+	struct scatterlist in[1], out[2];
 	struct virtio_net_ctrl_hdr ctrl;
 	virtio_net_ctrl_ack status = ~0;
 	unsigned int tmp;
-	int i;
 
 	/* Caller should know better */
-	BUG_ON(!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ) ||
-		(out + in > VIRTNET_SEND_COMMAND_SG_MAX));
-
-	out++; /* Add header */
-	in++; /* Add return status */
+	BUG_ON(!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ));
 
+	/* Prepend header to output */
+	sg_init_table(out, 2);
 	ctrl.class = class;
 	ctrl.cmd = cmd;
+	sg_set_buf(&out[0], &ctrl, sizeof(ctrl));
+	if (cmdsg)
+		sg_chain(out, 2, cmdsg);
+	else
+		sg_mark_end(&out[0]);
 
-	sg_init_table(sg, out + in);
-
-	sg_set_buf(&sg[0], &ctrl, sizeof(ctrl));
-	for_each_sg(data, s, out + in - 2, i)
-		sg_set_buf(&sg[i + 1], sg_virt(s), s->length);
-	sg_set_buf(&sg[out + in - 1], &status, sizeof(status));
+	/* Status response */
+	sg_init_one(in, &status, sizeof(status));
 
-	BUG_ON(virtqueue_add_buf(vi->cvq, sg, out, in, vi, GFP_ATOMIC) < 0);
+	BUG_ON(virtqueue_add_buf(vi->cvq, out, in, vi, GFP_ATOMIC) < 0);
 
 	virtqueue_kick(vi->cvq);
 
@@ -868,8 +876,7 @@ static void virtnet_ack_link_announce(struct virtnet_info *vi)
 {
 	rtnl_lock();
 	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_ANNOUNCE,
-				  VIRTIO_NET_CTRL_ANNOUNCE_ACK, NULL,
-				  0, 0))
+				  VIRTIO_NET_CTRL_ANNOUNCE_ACK, NULL))
 		dev_warn(&vi->dev->dev, "Failed to ack link announce.\n");
 	rtnl_unlock();
 }
@@ -887,7 +894,7 @@ static int virtnet_set_queues(struct virtnet_info *vi, u16 queue_pairs)
 	sg_init_one(&sg, &s, sizeof(s));
 
 	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_MQ,
-				  VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET, &sg, 1, 0)){
+				  VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET, &sg)){
 		dev_warn(&dev->dev, "Fail to set num of queue pairs to %d\n",
 			 queue_pairs);
 		return -EINVAL;
@@ -933,16 +940,14 @@ static void virtnet_set_rx_mode(struct net_device *dev)
 	sg_init_one(sg, &promisc, sizeof(promisc));
 
 	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX,
-				  VIRTIO_NET_CTRL_RX_PROMISC,
-				  sg, 1, 0))
+				  VIRTIO_NET_CTRL_RX_PROMISC, sg))
 		dev_warn(&dev->dev, "Failed to %sable promisc mode.\n",
 			 promisc ? "en" : "dis");
 
 	sg_init_one(sg, &allmulti, sizeof(allmulti));
 
 	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX,
-				  VIRTIO_NET_CTRL_RX_ALLMULTI,
-				  sg, 1, 0))
+				  VIRTIO_NET_CTRL_RX_ALLMULTI, sg))
 		dev_warn(&dev->dev, "Failed to %sable allmulti mode.\n",
 			 allmulti ? "en" : "dis");
 
@@ -980,8 +985,7 @@ static void virtnet_set_rx_mode(struct net_device *dev)
 		   sizeof(mac_data->entries) + (mc_count * ETH_ALEN));
 
 	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_MAC,
-				  VIRTIO_NET_CTRL_MAC_TABLE_SET,
-				  sg, 2, 0))
+				  VIRTIO_NET_CTRL_MAC_TABLE_SET, sg))
 		dev_warn(&dev->dev, "Failed to set MAC fitler table.\n");
 
 	kfree(buf);
@@ -995,7 +999,7 @@ static int virtnet_vlan_rx_add_vid(struct net_device *dev, u16 vid)
 	sg_init_one(&sg, &vid, sizeof(vid));
 
 	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_VLAN,
-				  VIRTIO_NET_CTRL_VLAN_ADD, &sg, 1, 0))
+				  VIRTIO_NET_CTRL_VLAN_ADD, &sg))
 		dev_warn(&dev->dev, "Failed to add VLAN ID %d.\n", vid);
 	return 0;
 }
@@ -1008,7 +1012,7 @@ static int virtnet_vlan_rx_kill_vid(struct net_device *dev, u16 vid)
 	sg_init_one(&sg, &vid, sizeof(vid));
 
 	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_VLAN,
-				  VIRTIO_NET_CTRL_VLAN_DEL, &sg, 1, 0))
+				  VIRTIO_NET_CTRL_VLAN_DEL, &sg))
 		dev_warn(&dev->dev, "Failed to kill VLAN ID %d.\n", vid);
 	return 0;
 }
diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 74ab67a..5021c64 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -223,7 +223,7 @@ static int virtscsi_kick_event(struct virtio_scsi *vscsi,
 
 	spin_lock_irqsave(&vscsi->event_vq.vq_lock, flags);
 
-	err = virtqueue_add_buf(vscsi->event_vq.vq, &sg, 0, 1, event_node,
+	err = virtqueue_add_buf(vscsi->event_vq.vq, NULL, &sg, event_node,
 				GFP_ATOMIC);
 	if (!err)
 		virtqueue_kick(vscsi->event_vq.vq);
@@ -378,7 +378,7 @@ static void virtscsi_map_sgl(struct scatterlist *sg, unsigned int *p_idx,
  */
 static void virtscsi_map_cmd(struct virtio_scsi_target_state *tgt,
 			     struct virtio_scsi_cmd *cmd,
-			     unsigned *out_num, unsigned *in_num,
+			     unsigned *out, unsigned *in,
 			     size_t req_size, size_t resp_size)
 {
 	struct scsi_cmnd *sc = cmd->sc;
@@ -392,7 +392,7 @@ static void virtscsi_map_cmd(struct virtio_scsi_target_state *tgt,
 	if (sc && sc->sc_data_direction != DMA_FROM_DEVICE)
 		virtscsi_map_sgl(sg, &idx, scsi_out(sc));
 
-	*out_num = idx;
+	*out = idx;
 
 	/* Response header.  */
 	sg_set_buf(&sg[idx++], &cmd->resp, resp_size);
@@ -401,7 +401,11 @@ static void virtscsi_map_cmd(struct virtio_scsi_target_state *tgt,
 	if (sc && sc->sc_data_direction != DMA_TO_DEVICE)
 		virtscsi_map_sgl(sg, &idx, scsi_in(sc));
 
-	*in_num = idx - *out_num;
+	*in = idx - *out;
+
+	sg_unset_end_markers(sg, *out + *in);
+	sg_mark_end(&sg[*out - 1]);
+	sg_mark_end(&sg[*out + *in - 1]);
 }
 
 static int virtscsi_kick_cmd(struct virtio_scsi_target_state *tgt,
@@ -409,16 +413,16 @@ static int virtscsi_kick_cmd(struct virtio_scsi_target_state *tgt,
 			     struct virtio_scsi_cmd *cmd,
 			     size_t req_size, size_t resp_size, gfp_t gfp)
 {
-	unsigned int out_num, in_num;
+	unsigned int out, in;
 	unsigned long flags;
 	int err;
 	bool needs_kick = false;
 
 	spin_lock_irqsave(&tgt->tgt_lock, flags);
-	virtscsi_map_cmd(tgt, cmd, &out_num, &in_num, req_size, resp_size);
+	virtscsi_map_cmd(tgt, cmd, &out, &in, req_size, resp_size);
 
 	spin_lock(&vq->vq_lock);
-	err = virtqueue_add_buf(vq->vq, tgt->sg, out_num, in_num, cmd, gfp);
+	err = virtqueue_add_buf(vq->vq, tgt->sg, tgt->sg + out, cmd, gfp);
 	spin_unlock(&tgt->tgt_lock);
 	if (!err)
 		needs_kick = virtqueue_kick_prepare(vq->vq);
diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index d19fe3e..181cef1 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -108,7 +108,7 @@ static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
 	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
 
 	/* We should always be able to add one buffer to an empty queue. */
-	if (virtqueue_add_buf(vq, &sg, 1, 0, vb, GFP_KERNEL) < 0)
+	if (virtqueue_add_buf(vq, &sg, NULL, vb, GFP_KERNEL) < 0)
 		BUG();
 	virtqueue_kick(vq);
 
@@ -256,7 +256,7 @@ static void stats_handle_request(struct virtio_balloon *vb)
 	if (!virtqueue_get_buf(vq, &len))
 		return;
 	sg_init_one(&sg, vb->stats, sizeof(vb->stats));
-	if (virtqueue_add_buf(vq, &sg, 1, 0, vb, GFP_KERNEL) < 0)
+	if (virtqueue_add_buf(vq, &sg, NULL, vb, GFP_KERNEL) < 0)
 		BUG();
 	virtqueue_kick(vq);
 }
@@ -341,7 +341,7 @@ static int init_vqs(struct virtio_balloon *vb)
 		 * use it to signal us later.
 		 */
 		sg_init_one(&sg, vb->stats, sizeof vb->stats);
-		if (virtqueue_add_buf(vb->stats_vq, &sg, 1, 0, vb, GFP_KERNEL)
+		if (virtqueue_add_buf(vb->stats_vq, &sg, NULL, vb, GFP_KERNEL)
 		    < 0)
 			BUG();
 		virtqueue_kick(vb->stats_vq);
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index ffd7e7d..277021b 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -119,13 +119,18 @@ struct vring_virtqueue
 
 #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq)
 
+/* This doesn't have the counter that for_each_sg() has */
+#define foreach_sg(sglist, i)			\
+	for (i = (sglist); i; i = sg_next(i))
+
 /* Set up an indirect table of descriptors and add it to the queue. */
 static int vring_add_indirect(struct vring_virtqueue *vq,
-			      struct scatterlist sg[],
-			      unsigned int out,
-			      unsigned int in,
+			      unsigned int num,
+			      struct scatterlist *out,
+			      struct scatterlist *in,
 			      gfp_t gfp)
 {
+	struct scatterlist *sg;
 	struct vring_desc *desc;
 	unsigned head;
 	int i;
@@ -137,24 +142,25 @@ static int vring_add_indirect(struct vring_virtqueue *vq,
 	 */
 	gfp &= ~(__GFP_HIGHMEM | __GFP_HIGH);
 
-	desc = kmalloc((out + in) * sizeof(struct vring_desc), gfp);
+	desc = kmalloc(num * sizeof(struct vring_desc), gfp);
 	if (!desc)
 		return -ENOMEM;
 
 	/* Transfer entries from the sg list into the indirect page */
-	for (i = 0; i < out; i++) {
+	i = 0;
+	foreach_sg(out, sg) {
 		desc[i].flags = VRING_DESC_F_NEXT;
 		desc[i].addr = sg_phys(sg);
 		desc[i].len = sg->length;
 		desc[i].next = i+1;
-		sg++;
+		i++;
 	}
-	for (; i < (out + in); i++) {
+	foreach_sg(in, sg) {
 		desc[i].flags = VRING_DESC_F_NEXT|VRING_DESC_F_WRITE;
 		desc[i].addr = sg_phys(sg);
 		desc[i].len = sg->length;
 		desc[i].next = i+1;
-		sg++;
+		i++;
 	}
 
 	/* Last one doesn't continue. */
@@ -176,12 +182,21 @@ static int vring_add_indirect(struct vring_virtqueue *vq,
 	return head;
 }
 
+static unsigned int count_sg(struct scatterlist *sg)
+{
+	unsigned int count = 0;
+	struct scatterlist *i;
+
+	foreach_sg(sg, i)
+		count++;
+	return count;
+}
+
 /**
  * virtqueue_add_buf - expose buffer to other end
  * @vq: the struct virtqueue we're talking about.
- * @sg: the description of the buffer(s).
- * @out_num: the number of sg readable by other side
- * @in_num: the number of sg which are writable (after readable ones)
+ * @out: the description of the output buffer(s).
+ * @in: the description of the input buffer(s).
  * @data: the token identifying the buffer.
  * @gfp: how to do memory allocations (if necessary).
  *
@@ -191,20 +206,23 @@ static int vring_add_indirect(struct vring_virtqueue *vq,
  * Returns zero or a negative error (ie. ENOSPC, ENOMEM).
  */
 int virtqueue_add_buf(struct virtqueue *_vq,
-		      struct scatterlist sg[],
-		      unsigned int out,
-		      unsigned int in,
+		      struct scatterlist *out,
+		      struct scatterlist *in,
 		      void *data,
 		      gfp_t gfp)
 {
 	struct vring_virtqueue *vq = to_vvq(_vq);
-	unsigned int i, avail, uninitialized_var(prev);
+	unsigned int i, avail, uninitialized_var(prev), num;
+	struct scatterlist *sg;
 	int head;
 
 	START_USE(vq);
 
 	BUG_ON(data == NULL);
 
+	num = count_sg(out) + count_sg(in);
+	BUG_ON(num == 0);
+
 #ifdef DEBUG
 	{
 		ktime_t now = ktime_get();
@@ -220,18 +238,17 @@ int virtqueue_add_buf(struct virtqueue *_vq,
 
 	/* If the host supports indirect descriptor tables, and we have multiple
 	 * buffers, then go indirect. FIXME: tune this threshold */
-	if (vq->indirect && (out + in) > 1 && vq->vq.num_free) {
-		head = vring_add_indirect(vq, sg, out, in, gfp);
+	if (vq->indirect && num > 1 && vq->vq.num_free) {
+		head = vring_add_indirect(vq, num, out, in, gfp);
 		if (likely(head >= 0))
 			goto add_head;
 	}
 
-	BUG_ON(out + in > vq->vring.num);
-	BUG_ON(out + in == 0);
+	BUG_ON(num > vq->vring.num);
 
-	if (vq->vq.num_free < out + in) {
+	if (vq->vq.num_free < num) {
 		pr_debug("Can't add buf len %i - avail = %i\n",
-			 out + in, vq->vq.num_free);
+			 num, vq->vq.num_free);
 		/* FIXME: for historical reasons, we force a notify here if
 		 * there are outgoing parts to the buffer.  Presumably the
 		 * host should service the ring ASAP. */
@@ -242,22 +259,22 @@ int virtqueue_add_buf(struct virtqueue *_vq,
 	}
 
 	/* We're about to use some buffers from the free list. */
-	vq->vq.num_free -= out + in;
+	vq->vq.num_free -= num;
 
-	head = vq->free_head;
-	for (i = vq->free_head; out; i = vq->vring.desc[i].next, out--) {
+	i = head = vq->free_head;
+	foreach_sg(out, sg) {
 		vq->vring.desc[i].flags = VRING_DESC_F_NEXT;
 		vq->vring.desc[i].addr = sg_phys(sg);
 		vq->vring.desc[i].len = sg->length;
 		prev = i;
-		sg++;
+		i = vq->vring.desc[i].next;
 	}
-	for (; in; i = vq->vring.desc[i].next, in--) {
+	foreach_sg(in, sg) {
 		vq->vring.desc[i].flags = VRING_DESC_F_NEXT|VRING_DESC_F_WRITE;
 		vq->vring.desc[i].addr = sg_phys(sg);
 		vq->vring.desc[i].len = sg->length;
 		prev = i;
-		sg++;
+		i = vq->vring.desc[i].next;
 	}
 	/* Last one doesn't continue. */
 	vq->vring.desc[prev].flags &= ~VRING_DESC_F_NEXT;
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index cf8adb1..69509a8 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -33,10 +33,18 @@ struct virtqueue {
 	void *priv;
 };
 
+static inline void sg_unset_end_markers(struct scatterlist *sg,
+					unsigned int num)
+{
+	unsigned int i;
+
+	for (i = 0; i < num; i++)
+		sg[i].page_link &= ~0x02;
+}
+
 int virtqueue_add_buf(struct virtqueue *vq,
-		      struct scatterlist sg[],
-		      unsigned int out_num,
-		      unsigned int in_num,
+		      struct scatterlist *out,
+		      struct scatterlist *in,
 		      void *data,
 		      gfp_t gfp);
 
diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
index fd05c81..7c5ac34 100644
--- a/net/9p/trans_virtio.c
+++ b/net/9p/trans_virtio.c
@@ -259,6 +259,7 @@ p9_virtio_request(struct p9_client *client, struct p9_req_t *req)
 	int in, out;
 	unsigned long flags;
 	struct virtio_chan *chan = client->trans;
+	struct scatterlist *outsg = NULL, *insg = NULL;
 
 	p9_debug(P9_DEBUG_TRANS, "9p debug: virtio request\n");
 
@@ -269,12 +270,21 @@ req_retry:
 	/* Handle out VirtIO ring buffers */
 	out = pack_sg_list(chan->sg, 0,
 			   VIRTQUEUE_NUM, req->tc->sdata, req->tc->size);
+	if (out) {
+		sg_unset_end_markers(chan->sg, out - 1);
+		sg_mark_end(&chan->sg[out - 1]);
+		outsg = chan->sg;
+	}
 
 	in = pack_sg_list(chan->sg, out,
 			  VIRTQUEUE_NUM, req->rc->sdata, req->rc->capacity);
+	if (in) {
+		sg_unset_end_markers(chan->sg + out, in - 1);
+		sg_mark_end(&chan->sg[out + in - 1]);
+		insg = chan->sg + out;
+	}
 
-	err = virtqueue_add_buf(chan->vq, chan->sg, out, in, req->tc,
-				GFP_ATOMIC);
+	err = virtqueue_add_buf(chan->vq, outsg, insg, req->tc, GFP_ATOMIC);
 	if (err < 0) {
 		if (err == -ENOSPC) {
 			chan->ring_bufs_avail = 0;
@@ -356,6 +366,7 @@ p9_virtio_zc_request(struct p9_client *client, struct p9_req_t *req,
 	int in_nr_pages = 0, out_nr_pages = 0;
 	struct page **in_pages = NULL, **out_pages = NULL;
 	struct virtio_chan *chan = client->trans;
+	struct scatterlist *insg = NULL, *outsg = NULL;
 
 	p9_debug(P9_DEBUG_TRANS, "virtio request\n");
 
@@ -403,6 +414,13 @@ req_retry_pinned:
 	if (out_pages)
 		out += pack_sg_list_p(chan->sg, out, VIRTQUEUE_NUM,
 				      out_pages, out_nr_pages, uodata, outlen);
+
+	if (out) {
+		sg_unset_end_markers(chan->sg, out - 1);
+		sg_mark_end(&chan->sg[out - 1]);
+		outsg = chan->sg;
+	}
+
 	/*
 	 * Take care of in data
 	 * For example TREAD have 11.
@@ -416,8 +434,13 @@ req_retry_pinned:
 		in += pack_sg_list_p(chan->sg, out + in, VIRTQUEUE_NUM,
 				     in_pages, in_nr_pages, uidata, inlen);
 
-	err = virtqueue_add_buf(chan->vq, chan->sg, out, in, req->tc,
-				GFP_ATOMIC);
+	if (in) {
+		sg_unset_end_markers(chan->sg + out, in - 1);
+		sg_mark_end(&chan->sg[out + in - 1]);
+		insg = chan->sg + out;
+	}
+
+	err = virtqueue_add_buf(chan->vq, outsg, insg, req->tc, GFP_ATOMIC);
 	if (err < 0) {
 		if (err == -ENOSPC) {
 			chan->ring_bufs_avail = 0;
-- 
1.8.1



> 
> Cheers,
> Rusty.

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 1/5] virtio: add functions for piecewise addition of buffers
  2013-01-02  5:03     ` Rusty Russell
@ 2013-01-03  9:22       ` Paolo Bonzini
  -1 siblings, 0 replies; 86+ messages in thread
From: Paolo Bonzini @ 2013-01-03  9:22 UTC (permalink / raw)
  To: Rusty Russell
  Cc: linux-kernel, linux-scsi, kvm, mst, hutao, virtualization, stefanha

Il 02/01/2013 06:03, Rusty Russell ha scritto:
> Paolo Bonzini <pbonzini@redhat.com> writes:
>> The virtqueue_add_buf function has two limitations:
>>
>> 1) it requires the caller to provide all the buffers in a single call;
>>
>> 2) it does not support chained scatterlists: the buffers must be
>> provided as an array of struct scatterlist;
> 
> Chained scatterlists are a horrible interface, but that doesn't mean we
> shouldn't support them if there's a need.
> 
> I think I once even had a patch which passed two chained sgs, rather
> than a combo sg and two length numbers.  It's very old, but I've pasted
> it below.
> 
> Duplicating the implementation by having another interface is pretty
> nasty; I think I'd prefer the chained scatterlists, if that's optimal
> for you.

Unfortunately, that cannot work because not all architectures support
chained scatterlists.  Having two different implementations on the
driver side, with different locking rules, depending on the support for
chained scatterlists is awful.

(Also, as you mention chained scatterlists are horrible.  They'd happen
to work for virtio-scsi, but not for virtio-blk where the response
status is part of the footer, not the header).

We can move all drivers to virtqueue_add_sg, I just didn't do it in this
series because I first wanted to get the API right.

Paolo

> Cheers,
> Rusty.
> 
> From: Rusty Russell <rusty@rustcorp.com.au>
> Subject: virtio: use chained scatterlists.
> 
> Rather than handing a scatterlist[] and out and in numbers to
> virtqueue_add_buf(), hand two separate ones which can be chained.
> 
> I shall refrain from ranting about what a disgusting hack chained
> scatterlists are.  I'll just note that this doesn't make things
> simpler (see diff).
> 
> The scatterlists we use can be too large for the stack, so we put them
> in our device struct and reuse them.  But in many cases we don't want
> to pay the cost of sg_init_table() as we don't know how many elements
> we'll have and we'd have to initialize the entire table.
> 
> This means we have two choices: carefully reset the end markers after
> we call virtqueue_add_buf(), which we do in virtio_net for the xmit
> path where it's easy and we want to be optimal.  Elsewhere we
> implement a helper to unset the end markers after we've filled the
> array.
> 
> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
> ---
>  drivers/block/virtio_blk.c          |   37 +++++++++++++-----
>  drivers/char/hw_random/virtio-rng.c |    2 -
>  drivers/char/virtio_console.c       |    6 +--
>  drivers/net/virtio_net.c            |   67 ++++++++++++++++++---------------
>  drivers/virtio/virtio_balloon.c     |    6 +--
>  drivers/virtio/virtio_ring.c        |   71 ++++++++++++++++++++++--------------
>  include/linux/virtio.h              |    5 +-
>  net/9p/trans_virtio.c               |   38 +++++++++++++++++--
>  8 files changed, 151 insertions(+), 81 deletions(-)
> 
> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> --- a/drivers/block/virtio_blk.c
> +++ b/drivers/block/virtio_blk.c
> @@ -100,6 +100,14 @@ static void blk_done(struct virtqueue *v
>  	spin_unlock_irqrestore(vblk->disk->queue->queue_lock, flags);
>  }
>  
> +static void sg_unset_end_markers(struct scatterlist *sg, unsigned int num)
> +{
> +	unsigned int i;
> +
> +	for (i = 0; i < num; i++)
> +		sg[i].page_link &= ~0x02;
> +}
> +
>  static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
>  		   struct request *req)
>  {
> @@ -140,6 +148,7 @@ static bool do_req(struct request_queue 
>  		}
>  	}
>  
> +	/* We layout out scatterlist in a single array, out then in. */
>  	sg_set_buf(&vblk->sg[out++], &vbr->out_hdr, sizeof(vbr->out_hdr));
>  
>  	/*
> @@ -151,17 +160,8 @@ static bool do_req(struct request_queue 
>  	if (vbr->req->cmd_type == REQ_TYPE_BLOCK_PC)
>  		sg_set_buf(&vblk->sg[out++], vbr->req->cmd, vbr->req->cmd_len);
>  
> +	/* This marks the end of the sg list at vblk->sg[out]. */
>  	num = blk_rq_map_sg(q, vbr->req, vblk->sg + out);
> -
> -	if (vbr->req->cmd_type == REQ_TYPE_BLOCK_PC) {
> -		sg_set_buf(&vblk->sg[num + out + in++], vbr->req->sense, SCSI_SENSE_BUFFERSIZE);
> -		sg_set_buf(&vblk->sg[num + out + in++], &vbr->in_hdr,
> -			   sizeof(vbr->in_hdr));
> -	}
> -
> -	sg_set_buf(&vblk->sg[num + out + in++], &vbr->status,
> -		   sizeof(vbr->status));
> -
>  	if (num) {
>  		if (rq_data_dir(vbr->req) == WRITE) {
>  			vbr->out_hdr.type |= VIRTIO_BLK_T_OUT;
> @@ -172,7 +172,22 @@ static bool do_req(struct request_queue 
>  		}
>  	}
>  
> -	if (virtqueue_add_buf(vblk->vq, vblk->sg, out, in, vbr, GFP_ATOMIC)<0) {
> +	if (vbr->req->cmd_type == REQ_TYPE_BLOCK_PC) {
> +		sg_set_buf(&vblk->sg[out + in++], vbr->req->sense,
> +			   SCSI_SENSE_BUFFERSIZE);
> +		sg_set_buf(&vblk->sg[out + in++], &vbr->in_hdr,
> +			   sizeof(vbr->in_hdr));
> +	}
> +
> +	sg_set_buf(&vblk->sg[out + in++], &vbr->status,
> +		   sizeof(vbr->status));
> +
> +	sg_unset_end_markers(vblk->sg, out+in);
> +	sg_mark_end(&vblk->sg[out-1]);
> +	sg_mark_end(&vblk->sg[out+in-1]);
> +
> +	if (virtqueue_add_buf(vblk->vq, vblk->sg, vblk->sg+out, vbr, GFP_ATOMIC)
> +	    < 0) {
>  		mempool_free(vbr, vblk->pool);
>  		return false;
>  	}
> diff --git a/drivers/char/hw_random/virtio-rng.c b/drivers/char/hw_random/virtio-rng.c
> --- a/drivers/char/hw_random/virtio-rng.c
> +++ b/drivers/char/hw_random/virtio-rng.c
> @@ -47,7 +47,7 @@ static void register_buffer(u8 *buf, siz
>  	sg_init_one(&sg, buf, size);
>  
>  	/* There should always be room for one buffer. */
> -	if (virtqueue_add_buf(vq, &sg, 0, 1, buf, GFP_KERNEL) < 0)
> +	if (virtqueue_add_buf(vq, NULL, &sg, buf, GFP_KERNEL) < 0)
>  		BUG();
>  
>  	virtqueue_kick(vq);
> diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
> --- a/drivers/char/virtio_console.c
> +++ b/drivers/char/virtio_console.c
> @@ -392,7 +392,7 @@ static int add_inbuf(struct virtqueue *v
>  
>  	sg_init_one(sg, buf->buf, buf->size);
>  
> -	ret = virtqueue_add_buf(vq, sg, 0, 1, buf, GFP_ATOMIC);
> +	ret = virtqueue_add_buf(vq, NULL, sg, buf, GFP_ATOMIC);
>  	virtqueue_kick(vq);
>  	return ret;
>  }
> @@ -457,7 +457,7 @@ static ssize_t __send_control_msg(struct
>  	vq = portdev->c_ovq;
>  
>  	sg_init_one(sg, &cpkt, sizeof(cpkt));
> -	if (virtqueue_add_buf(vq, sg, 1, 0, &cpkt, GFP_ATOMIC) >= 0) {
> +	if (virtqueue_add_buf(vq, sg, NULL, &cpkt, GFP_ATOMIC) >= 0) {
>  		virtqueue_kick(vq);
>  		while (!virtqueue_get_buf(vq, &len))
>  			cpu_relax();
> @@ -506,7 +506,7 @@ static ssize_t send_buf(struct port *por
>  	reclaim_consumed_buffers(port);
>  
>  	sg_init_one(sg, in_buf, in_count);
> -	ret = virtqueue_add_buf(out_vq, sg, 1, 0, in_buf, GFP_ATOMIC);
> +	ret = virtqueue_add_buf(out_vq, sg, NULL, in_buf, GFP_ATOMIC);
>  
>  	/* Tell Host to go! */
>  	virtqueue_kick(out_vq);
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -376,11 +376,11 @@ static int add_recvbuf_small(struct virt
>  	skb_put(skb, MAX_PACKET_LEN);
>  
>  	hdr = skb_vnet_hdr(skb);
> +	sg_init_table(vi->rx_sg, 2);
>  	sg_set_buf(vi->rx_sg, &hdr->hdr, sizeof hdr->hdr);
> -
>  	skb_to_sgvec(skb, vi->rx_sg + 1, 0, skb->len);
>  
> -	err = virtqueue_add_buf(vi->rvq, vi->rx_sg, 0, 2, skb, gfp);
> +	err = virtqueue_add_buf(vi->rvq, NULL, vi->rx_sg, skb, gfp);
>  	if (err < 0)
>  		dev_kfree_skb(skb);
>  
> @@ -393,6 +393,8 @@ static int add_recvbuf_big(struct virtne
>  	char *p;
>  	int i, err, offset;
>  
> +	sg_init_table(vi->rx_sg, MAX_SKB_FRAGS + 1);
> +
>  	/* page in vi->rx_sg[MAX_SKB_FRAGS + 1] is list tail */
>  	for (i = MAX_SKB_FRAGS + 1; i > 1; --i) {
>  		first = get_a_page(vi, gfp);
> @@ -425,8 +427,8 @@ static int add_recvbuf_big(struct virtne
>  
>  	/* chain first in list head */
>  	first->private = (unsigned long)list;
> -	err = virtqueue_add_buf(vi->rvq, vi->rx_sg, 0, MAX_SKB_FRAGS + 2,
> -				first, gfp);
> +
> +	err = virtqueue_add_buf(vi->rvq, NULL, vi->rx_sg, first, gfp);
>  	if (err < 0)
>  		give_pages(vi, first);
>  
> @@ -444,7 +446,7 @@ static int add_recvbuf_mergeable(struct 
>  
>  	sg_init_one(vi->rx_sg, page_address(page), PAGE_SIZE);
>  
> -	err = virtqueue_add_buf(vi->rvq, vi->rx_sg, 0, 1, page, gfp);
> +	err = virtqueue_add_buf(vi->rvq, NULL, vi->rx_sg, page, gfp);
>  	if (err < 0)
>  		give_pages(vi, page);
>  
> @@ -581,6 +583,7 @@ static int xmit_skb(struct virtnet_info 
>  {
>  	struct skb_vnet_hdr *hdr = skb_vnet_hdr(skb);
>  	const unsigned char *dest = ((struct ethhdr *)skb->data)->h_dest;
> +	int ret;
>  
>  	pr_debug("%s: xmit %p %pM\n", vi->dev->name, skb, dest);
>  
> @@ -620,8 +623,16 @@ static int xmit_skb(struct virtnet_info 
>  		sg_set_buf(vi->tx_sg, &hdr->hdr, sizeof hdr->hdr);
>  
>  	hdr->num_sg = skb_to_sgvec(skb, vi->tx_sg + 1, 0, skb->len) + 1;
> -	return virtqueue_add_buf(vi->svq, vi->tx_sg, hdr->num_sg,
> -				 0, skb, GFP_ATOMIC);
> +
> +	ret = virtqueue_add_buf(vi->svq, vi->tx_sg, NULL, skb, GFP_ATOMIC);
> +
> +	/*
> +	 * An optimization: clear the end bit set by skb_to_sgvec, so
> +	 * we can simply re-use vi->tx_sg[] next time.
> +	 */
> +	vi->tx_sg[hdr->num_sg-1].page_link &= ~0x02;
> +
> +	return ret;
>  }
>  
>  static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
> @@ -757,32 +768,31 @@ static int virtnet_open(struct net_devic
>   * never fail unless improperly formated.
>   */
>  static bool virtnet_send_command(struct virtnet_info *vi, u8 class, u8 cmd,
> -				 struct scatterlist *data, int out, int in)
> +				 struct scatterlist *cmdsg)
>  {
> -	struct scatterlist *s, sg[VIRTNET_SEND_COMMAND_SG_MAX + 2];
> +	struct scatterlist in[1], out[2];
>  	struct virtio_net_ctrl_hdr ctrl;
>  	virtio_net_ctrl_ack status = ~0;
>  	unsigned int tmp;
> -	int i;
>  
>  	/* Caller should know better */
> -	BUG_ON(!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ) ||
> -		(out + in > VIRTNET_SEND_COMMAND_SG_MAX));
> +	BUG_ON(!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ));
>  
> -	out++; /* Add header */
> -	in++; /* Add return status */
> -
> +	/* Prepend header to output. */
> +	sg_init_table(out, 2);
>  	ctrl.class = class;
>  	ctrl.cmd = cmd;
> +	sg_set_buf(&out[0], &ctrl, sizeof(ctrl));
> +	if (cmdsg)
> +		sg_chain(out, 2, cmdsg);
> +	else
> +		sg_mark_end(&out[0]);
>  
> -	sg_init_table(sg, out + in);
> +	/* Status response. */
> +	sg_init_one(in, &status, sizeof(status));
>  
> -	sg_set_buf(&sg[0], &ctrl, sizeof(ctrl));
> -	for_each_sg(data, s, out + in - 2, i)
> -		sg_set_buf(&sg[i + 1], sg_virt(s), s->length);
> -	sg_set_buf(&sg[out + in - 1], &status, sizeof(status));
>  
> -	BUG_ON(virtqueue_add_buf(vi->cvq, sg, out, in, vi, GFP_ATOMIC) < 0);
> +	BUG_ON(virtqueue_add_buf(vi->cvq, out, in, vi, GFP_ATOMIC) < 0);
>  
>  	virtqueue_kick(vi->cvq);
>  
> @@ -800,8 +810,7 @@ static void virtnet_ack_link_announce(st
>  {
>  	rtnl_lock();
>  	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_ANNOUNCE,
> -				  VIRTIO_NET_CTRL_ANNOUNCE_ACK, NULL,
> -				  0, 0))
> +				  VIRTIO_NET_CTRL_ANNOUNCE_ACK, NULL))
>  		dev_warn(&vi->dev->dev, "Failed to ack link announce.\n");
>  	rtnl_unlock();
>  }
> @@ -839,16 +848,14 @@ static void virtnet_set_rx_mode(struct n
>  	sg_init_one(sg, &promisc, sizeof(promisc));
>  
>  	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX,
> -				  VIRTIO_NET_CTRL_RX_PROMISC,
> -				  sg, 1, 0))
> +				  VIRTIO_NET_CTRL_RX_PROMISC, sg))
>  		dev_warn(&dev->dev, "Failed to %sable promisc mode.\n",
>  			 promisc ? "en" : "dis");
>  
>  	sg_init_one(sg, &allmulti, sizeof(allmulti));
>  
>  	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX,
> -				  VIRTIO_NET_CTRL_RX_ALLMULTI,
> -				  sg, 1, 0))
> +				  VIRTIO_NET_CTRL_RX_ALLMULTI, sg))
>  		dev_warn(&dev->dev, "Failed to %sable allmulti mode.\n",
>  			 allmulti ? "en" : "dis");
>  
> @@ -887,7 +894,7 @@ static void virtnet_set_rx_mode(struct n
>  
>  	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_MAC,
>  				  VIRTIO_NET_CTRL_MAC_TABLE_SET,
> -				  sg, 2, 0))
> +				  sg))
>  		dev_warn(&dev->dev, "Failed to set MAC fitler table.\n");
>  
>  	kfree(buf);
> @@ -901,7 +908,7 @@ static int virtnet_vlan_rx_add_vid(struc
>  	sg_init_one(&sg, &vid, sizeof(vid));
>  
>  	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_VLAN,
> -				  VIRTIO_NET_CTRL_VLAN_ADD, &sg, 1, 0))
> +				  VIRTIO_NET_CTRL_VLAN_ADD, &sg))
>  		dev_warn(&dev->dev, "Failed to add VLAN ID %d.\n", vid);
>  	return 0;
>  }
> @@ -914,7 +921,7 @@ static int virtnet_vlan_rx_kill_vid(stru
>  	sg_init_one(&sg, &vid, sizeof(vid));
>  
>  	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_VLAN,
> -				  VIRTIO_NET_CTRL_VLAN_DEL, &sg, 1, 0))
> +				  VIRTIO_NET_CTRL_VLAN_DEL, &sg))
>  		dev_warn(&dev->dev, "Failed to kill VLAN ID %d.\n", vid);
>  	return 0;
>  }
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -102,7 +102,7 @@ static void tell_host(struct virtio_ball
>  	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
>  
>  	/* We should always be able to add one buffer to an empty queue. */
> -	if (virtqueue_add_buf(vq, &sg, 1, 0, vb, GFP_KERNEL) < 0)
> +	if (virtqueue_add_buf(vq, &sg, NULL, vb, GFP_KERNEL) < 0)
>  		BUG();
>  	virtqueue_kick(vq);
>  
> @@ -246,7 +246,7 @@ static void stats_handle_request(struct 
>  	if (!virtqueue_get_buf(vq, &len))
>  		return;
>  	sg_init_one(&sg, vb->stats, sizeof(vb->stats));
> -	if (virtqueue_add_buf(vq, &sg, 1, 0, vb, GFP_KERNEL) < 0)
> +	if (virtqueue_add_buf(vq, &sg, NULL, vb, GFP_KERNEL) < 0)
>  		BUG();
>  	virtqueue_kick(vq);
>  }
> @@ -331,7 +331,7 @@ static int init_vqs(struct virtio_balloo
>  		 * use it to signal us later.
>  		 */
>  		sg_init_one(&sg, vb->stats, sizeof vb->stats);
> -		if (virtqueue_add_buf(vb->stats_vq, &sg, 1, 0, vb, GFP_KERNEL)
> +		if (virtqueue_add_buf(vb->stats_vq, &sg, NULL, vb, GFP_KERNEL)
>  		    < 0)
>  			BUG();
>  		virtqueue_kick(vb->stats_vq);
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -121,35 +121,41 @@ struct vring_virtqueue
>  
>  #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq)
>  
> +/* This doesn't have the counter that for_each_sg() has */
> +#define foreach_sg(sglist, i)			\
> +	for (i = (sglist); i; i = sg_next(i))
> +
>  /* Set up an indirect table of descriptors and add it to the queue. */
>  static int vring_add_indirect(struct vring_virtqueue *vq,
> -			      struct scatterlist sg[],
> -			      unsigned int out,
> -			      unsigned int in,
> +			      unsigned int num,
> +			      const struct scatterlist *out,
> +			      const struct scatterlist *in,
>  			      gfp_t gfp)
>  {
> +	const struct scatterlist *sg;
>  	struct vring_desc *desc;
>  	unsigned head;
>  	int i;
>  
> -	desc = kmalloc((out + in) * sizeof(struct vring_desc), gfp);
> +	desc = kmalloc(num * sizeof(struct vring_desc), gfp);
>  	if (!desc)
>  		return -ENOMEM;
>  
>  	/* Transfer entries from the sg list into the indirect page */
> -	for (i = 0; i < out; i++) {
> +	i = 0;
> +	foreach_sg(out, sg) {
>  		desc[i].flags = VRING_DESC_F_NEXT;
>  		desc[i].addr = sg_phys(sg);
>  		desc[i].len = sg->length;
>  		desc[i].next = i+1;
> -		sg++;
> +		i++;
>  	}
> -	for (; i < (out + in); i++) {
> +	foreach_sg(in, sg) {
>  		desc[i].flags = VRING_DESC_F_NEXT|VRING_DESC_F_WRITE;
>  		desc[i].addr = sg_phys(sg);
>  		desc[i].len = sg->length;
>  		desc[i].next = i+1;
> -		sg++;
> +		i++;
>  	}
>  
>  	/* Last one doesn't continue. */
> @@ -171,12 +177,21 @@ static int vring_add_indirect(struct vri
>  	return head;
>  }
>  
> +static unsigned int count_sg(const struct scatterlist *sg)
> +{
> +	unsigned int count = 0;
> +	const struct scatterlist *i;
> +
> +	foreach_sg(sg, i)
> +		count++;
> +	return count;
> +}
> +
>  /**
>   * virtqueue_add_buf - expose buffer to other end
>   * @vq: the struct virtqueue we're talking about.
> - * @sg: the description of the buffer(s).
> - * @out_num: the number of sg readable by other side
> - * @in_num: the number of sg which are writable (after readable ones)
> + * @out: the description of the output buffer(s).
> + * @in: the description of the input buffer(s).
>   * @data: the token identifying the buffer.
>   * @gfp: how to do memory allocations (if necessary).
>   *
> @@ -189,20 +204,23 @@ static int vring_add_indirect(struct vri
>   * we can put an entire sg[] array inside a single queue entry.
>   */
>  int virtqueue_add_buf(struct virtqueue *_vq,
> -		      struct scatterlist sg[],
> -		      unsigned int out,
> -		      unsigned int in,
> +		      const struct scatterlist *out,
> +		      const struct scatterlist *in,
>  		      void *data,
>  		      gfp_t gfp)
>  {
>  	struct vring_virtqueue *vq = to_vvq(_vq);
> -	unsigned int i, avail, uninitialized_var(prev);
> +	unsigned int i, avail, uninitialized_var(prev), num;
> +	const struct scatterlist *sg;
>  	int head;
>  
>  	START_USE(vq);
>  
>  	BUG_ON(data == NULL);
>  
> +	num = count_sg(out) + count_sg(in);
> +	BUG_ON(num == 0);
> +
>  #ifdef DEBUG
>  	{
>  		ktime_t now = ktime_get();
> @@ -218,18 +236,17 @@ int virtqueue_add_buf(struct virtqueue *
>  
>  	/* If the host supports indirect descriptor tables, and we have multiple
>  	 * buffers, then go indirect. FIXME: tune this threshold */
> -	if (vq->indirect && (out + in) > 1 && vq->num_free) {
> -		head = vring_add_indirect(vq, sg, out, in, gfp);
> +	if (vq->indirect && num > 1 && vq->num_free) {
> +		head = vring_add_indirect(vq, num, out, in, gfp);
>  		if (likely(head >= 0))
>  			goto add_head;
>  	}
>  
> -	BUG_ON(out + in > vq->vring.num);
> -	BUG_ON(out + in == 0);
> +	BUG_ON(num > vq->vring.num);
>  
> -	if (vq->num_free < out + in) {
> +	if (vq->num_free < num) {
>  		pr_debug("Can't add buf len %i - avail = %i\n",
> -			 out + in, vq->num_free);
> +			 num, vq->num_free);
>  		/* FIXME: for historical reasons, we force a notify here if
>  		 * there are outgoing parts to the buffer.  Presumably the
>  		 * host should service the ring ASAP. */
> @@ -240,22 +257,24 @@ int virtqueue_add_buf(struct virtqueue *
>  	}
>  
>  	/* We're about to use some buffers from the free list. */
> -	vq->num_free -= out + in;
> +	vq->num_free -= num;
>  
>  	head = vq->free_head;
> -	for (i = vq->free_head; out; i = vq->vring.desc[i].next, out--) {
> +
> +	i = vq->free_head;
> +	foreach_sg(out, sg) {
>  		vq->vring.desc[i].flags = VRING_DESC_F_NEXT;
>  		vq->vring.desc[i].addr = sg_phys(sg);
>  		vq->vring.desc[i].len = sg->length;
>  		prev = i;
> -		sg++;
> +		i = vq->vring.desc[i].next;
>  	}
> -	for (; in; i = vq->vring.desc[i].next, in--) {
> +	foreach_sg(in, sg) {
>  		vq->vring.desc[i].flags = VRING_DESC_F_NEXT|VRING_DESC_F_WRITE;
>  		vq->vring.desc[i].addr = sg_phys(sg);
>  		vq->vring.desc[i].len = sg->length;
>  		prev = i;
> -		sg++;
> +		i = vq->vring.desc[i].next;
>  	}
>  	/* Last one doesn't continue. */
>  	vq->vring.desc[prev].flags &= ~VRING_DESC_F_NEXT;
> diff --git a/include/linux/virtio.h b/include/linux/virtio.h
> --- a/include/linux/virtio.h
> +++ b/include/linux/virtio.h
> @@ -26,9 +26,8 @@ struct virtqueue {
>  };
>  
>  int virtqueue_add_buf(struct virtqueue *vq,
> -		      struct scatterlist sg[],
> -		      unsigned int out_num,
> -		      unsigned int in_num,
> +		      const struct scatterlist *out,
> +		      const struct scatterlist *in,
>  		      void *data,
>  		      gfp_t gfp);
>  
> diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
> --- a/net/9p/trans_virtio.c
> +++ b/net/9p/trans_virtio.c
> @@ -244,6 +244,14 @@ pack_sg_list_p(struct scatterlist *sg, i
>  	return index - start;
>  }
>  
> +static void sg_unset_end_markers(struct scatterlist *sg, unsigned int num)
> +{
> +	unsigned int i;
> +
> +	for (i = 0; i < num; i++)
> +		sg[i].page_link &= ~0x02;
> +}
> +
>  /**
>   * p9_virtio_request - issue a request
>   * @client: client instance issuing the request
> @@ -258,6 +266,7 @@ p9_virtio_request(struct p9_client *clie
>  	int in, out;
>  	unsigned long flags;
>  	struct virtio_chan *chan = client->trans;
> +	const struct scatterlist *outsg = NULL, *insg = NULL;
>  
>  	p9_debug(P9_DEBUG_TRANS, "9p debug: virtio request\n");
>  
> @@ -268,12 +277,21 @@ req_retry:
>  	/* Handle out VirtIO ring buffers */
>  	out = pack_sg_list(chan->sg, 0,
>  			   VIRTQUEUE_NUM, req->tc->sdata, req->tc->size);
> +	if (out) {
> +		sg_unset_end_markers(chan->sg, out-1);
> +		sg_mark_end(&chan->sg[out-1]);
> +		outsg = chan->sg;
> +	}
>  
>  	in = pack_sg_list(chan->sg, out,
>  			  VIRTQUEUE_NUM, req->rc->sdata, req->rc->capacity);
> +	if (in) {
> +		sg_unset_end_markers(chan->sg+out, in-1);
> +		sg_mark_end(&chan->sg[out+in-1]);
> +		insg = chan->sg+out;
> +	}
>  
> -	err = virtqueue_add_buf(chan->vq, chan->sg, out, in, req->tc,
> -				GFP_ATOMIC);
> +	err = virtqueue_add_buf(chan->vq, outsg, insg, req->tc, GFP_ATOMIC);
>  	if (err < 0) {
>  		if (err == -ENOSPC) {
>  			chan->ring_bufs_avail = 0;
> @@ -355,6 +377,7 @@ p9_virtio_zc_request(struct p9_client *c
>  	int in_nr_pages = 0, out_nr_pages = 0;
>  	struct page **in_pages = NULL, **out_pages = NULL;
>  	struct virtio_chan *chan = client->trans;
> +	struct scatterlist *insg = NULL, *outsg = NULL;
>  
>  	p9_debug(P9_DEBUG_TRANS, "virtio request\n");
>  
> @@ -402,6 +425,13 @@ req_retry_pinned:
>  	if (out_pages)
>  		out += pack_sg_list_p(chan->sg, out, VIRTQUEUE_NUM,
>  				      out_pages, out_nr_pages, uodata, outlen);
> +
> +	if (out) {
> +		sg_unset_end_markers(chan->sg, out-1);
> +		sg_mark_end(&chan->sg[out-1]);
> +		outsg = chan->sg;
> +	}
> +
>  	/*
>  	 * Take care of in data
>  	 * For example TREAD have 11.
> @@ -414,9 +446,13 @@ req_retry_pinned:
>  	if (in_pages)
>  		in += pack_sg_list_p(chan->sg, out + in, VIRTQUEUE_NUM,
>  				     in_pages, in_nr_pages, uidata, inlen);
> +	if (in) {
> +		sg_unset_end_markers(chan->sg+out, in-1);
> +		sg_mark_end(&chan->sg[out+in-1]);
> +		insg = chan->sg + out;
> +	}
>  
> -	err = virtqueue_add_buf(chan->vq, chan->sg, out, in, req->tc,
> -				GFP_ATOMIC);
> +	err = virtqueue_add_buf(chan->vq, outsg, insg, req->tc, GFP_ATOMIC);
>  	if (err < 0) {
>  		if (err == -ENOSPC) {
>  			chan->ring_bufs_avail = 0;
> 


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 1/5] virtio: add functions for piecewise addition of buffers
@ 2013-01-03  9:22       ` Paolo Bonzini
  0 siblings, 0 replies; 86+ messages in thread
From: Paolo Bonzini @ 2013-01-03  9:22 UTC (permalink / raw)
  To: Rusty Russell
  Cc: kvm, linux-scsi, mst, hutao, linux-kernel, virtualization, stefanha

Il 02/01/2013 06:03, Rusty Russell ha scritto:
> Paolo Bonzini <pbonzini@redhat.com> writes:
>> The virtqueue_add_buf function has two limitations:
>>
>> 1) it requires the caller to provide all the buffers in a single call;
>>
>> 2) it does not support chained scatterlists: the buffers must be
>> provided as an array of struct scatterlist;
> 
> Chained scatterlists are a horrible interface, but that doesn't mean we
> shouldn't support them if there's a need.
> 
> I think I once even had a patch which passed two chained sgs, rather
> than a combo sg and two length numbers.  It's very old, but I've pasted
> it below.
> 
> Duplicating the implementation by having another interface is pretty
> nasty; I think I'd prefer the chained scatterlists, if that's optimal
> for you.

Unfortunately, that cannot work because not all architectures support
chained scatterlists.  Having two different implementations on the
driver side, with different locking rules, depending on the support for
chained scatterlists is awful.

(Also, as you mention chained scatterlists are horrible.  They'd happen
to work for virtio-scsi, but not for virtio-blk where the response
status is part of the footer, not the header).

We can move all drivers to virtqueue_add_sg, I just didn't do it in this
series because I first wanted to get the API right.

Paolo

> Cheers,
> Rusty.
> 
> From: Rusty Russell <rusty@rustcorp.com.au>
> Subject: virtio: use chained scatterlists.
> 
> Rather than handing a scatterlist[] and out and in numbers to
> virtqueue_add_buf(), hand two separate ones which can be chained.
> 
> I shall refrain from ranting about what a disgusting hack chained
> scatterlists are.  I'll just note that this doesn't make things
> simpler (see diff).
> 
> The scatterlists we use can be too large for the stack, so we put them
> in our device struct and reuse them.  But in many cases we don't want
> to pay the cost of sg_init_table() as we don't know how many elements
> we'll have and we'd have to initialize the entire table.
> 
> This means we have two choices: carefully reset the end markers after
> we call virtqueue_add_buf(), which we do in virtio_net for the xmit
> path where it's easy and we want to be optimal.  Elsewhere we
> implement a helper to unset the end markers after we've filled the
> array.
> 
> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
> ---
>  drivers/block/virtio_blk.c          |   37 +++++++++++++-----
>  drivers/char/hw_random/virtio-rng.c |    2 -
>  drivers/char/virtio_console.c       |    6 +--
>  drivers/net/virtio_net.c            |   67 ++++++++++++++++++---------------
>  drivers/virtio/virtio_balloon.c     |    6 +--
>  drivers/virtio/virtio_ring.c        |   71 ++++++++++++++++++++++--------------
>  include/linux/virtio.h              |    5 +-
>  net/9p/trans_virtio.c               |   38 +++++++++++++++++--
>  8 files changed, 151 insertions(+), 81 deletions(-)
> 
> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> --- a/drivers/block/virtio_blk.c
> +++ b/drivers/block/virtio_blk.c
> @@ -100,6 +100,14 @@ static void blk_done(struct virtqueue *v
>  	spin_unlock_irqrestore(vblk->disk->queue->queue_lock, flags);
>  }
>  
> +static void sg_unset_end_markers(struct scatterlist *sg, unsigned int num)
> +{
> +	unsigned int i;
> +
> +	for (i = 0; i < num; i++)
> +		sg[i].page_link &= ~0x02;
> +}
> +
>  static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
>  		   struct request *req)
>  {
> @@ -140,6 +148,7 @@ static bool do_req(struct request_queue 
>  		}
>  	}
>  
> +	/* We layout out scatterlist in a single array, out then in. */
>  	sg_set_buf(&vblk->sg[out++], &vbr->out_hdr, sizeof(vbr->out_hdr));
>  
>  	/*
> @@ -151,17 +160,8 @@ static bool do_req(struct request_queue 
>  	if (vbr->req->cmd_type == REQ_TYPE_BLOCK_PC)
>  		sg_set_buf(&vblk->sg[out++], vbr->req->cmd, vbr->req->cmd_len);
>  
> +	/* This marks the end of the sg list at vblk->sg[out]. */
>  	num = blk_rq_map_sg(q, vbr->req, vblk->sg + out);
> -
> -	if (vbr->req->cmd_type == REQ_TYPE_BLOCK_PC) {
> -		sg_set_buf(&vblk->sg[num + out + in++], vbr->req->sense, SCSI_SENSE_BUFFERSIZE);
> -		sg_set_buf(&vblk->sg[num + out + in++], &vbr->in_hdr,
> -			   sizeof(vbr->in_hdr));
> -	}
> -
> -	sg_set_buf(&vblk->sg[num + out + in++], &vbr->status,
> -		   sizeof(vbr->status));
> -
>  	if (num) {
>  		if (rq_data_dir(vbr->req) == WRITE) {
>  			vbr->out_hdr.type |= VIRTIO_BLK_T_OUT;
> @@ -172,7 +172,22 @@ static bool do_req(struct request_queue 
>  		}
>  	}
>  
> -	if (virtqueue_add_buf(vblk->vq, vblk->sg, out, in, vbr, GFP_ATOMIC)<0) {
> +	if (vbr->req->cmd_type == REQ_TYPE_BLOCK_PC) {
> +		sg_set_buf(&vblk->sg[out + in++], vbr->req->sense,
> +			   SCSI_SENSE_BUFFERSIZE);
> +		sg_set_buf(&vblk->sg[out + in++], &vbr->in_hdr,
> +			   sizeof(vbr->in_hdr));
> +	}
> +
> +	sg_set_buf(&vblk->sg[out + in++], &vbr->status,
> +		   sizeof(vbr->status));
> +
> +	sg_unset_end_markers(vblk->sg, out+in);
> +	sg_mark_end(&vblk->sg[out-1]);
> +	sg_mark_end(&vblk->sg[out+in-1]);
> +
> +	if (virtqueue_add_buf(vblk->vq, vblk->sg, vblk->sg+out, vbr, GFP_ATOMIC)
> +	    < 0) {
>  		mempool_free(vbr, vblk->pool);
>  		return false;
>  	}
> diff --git a/drivers/char/hw_random/virtio-rng.c b/drivers/char/hw_random/virtio-rng.c
> --- a/drivers/char/hw_random/virtio-rng.c
> +++ b/drivers/char/hw_random/virtio-rng.c
> @@ -47,7 +47,7 @@ static void register_buffer(u8 *buf, siz
>  	sg_init_one(&sg, buf, size);
>  
>  	/* There should always be room for one buffer. */
> -	if (virtqueue_add_buf(vq, &sg, 0, 1, buf, GFP_KERNEL) < 0)
> +	if (virtqueue_add_buf(vq, NULL, &sg, buf, GFP_KERNEL) < 0)
>  		BUG();
>  
>  	virtqueue_kick(vq);
> diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
> --- a/drivers/char/virtio_console.c
> +++ b/drivers/char/virtio_console.c
> @@ -392,7 +392,7 @@ static int add_inbuf(struct virtqueue *v
>  
>  	sg_init_one(sg, buf->buf, buf->size);
>  
> -	ret = virtqueue_add_buf(vq, sg, 0, 1, buf, GFP_ATOMIC);
> +	ret = virtqueue_add_buf(vq, NULL, sg, buf, GFP_ATOMIC);
>  	virtqueue_kick(vq);
>  	return ret;
>  }
> @@ -457,7 +457,7 @@ static ssize_t __send_control_msg(struct
>  	vq = portdev->c_ovq;
>  
>  	sg_init_one(sg, &cpkt, sizeof(cpkt));
> -	if (virtqueue_add_buf(vq, sg, 1, 0, &cpkt, GFP_ATOMIC) >= 0) {
> +	if (virtqueue_add_buf(vq, sg, NULL, &cpkt, GFP_ATOMIC) >= 0) {
>  		virtqueue_kick(vq);
>  		while (!virtqueue_get_buf(vq, &len))
>  			cpu_relax();
> @@ -506,7 +506,7 @@ static ssize_t send_buf(struct port *por
>  	reclaim_consumed_buffers(port);
>  
>  	sg_init_one(sg, in_buf, in_count);
> -	ret = virtqueue_add_buf(out_vq, sg, 1, 0, in_buf, GFP_ATOMIC);
> +	ret = virtqueue_add_buf(out_vq, sg, NULL, in_buf, GFP_ATOMIC);
>  
>  	/* Tell Host to go! */
>  	virtqueue_kick(out_vq);
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -376,11 +376,11 @@ static int add_recvbuf_small(struct virt
>  	skb_put(skb, MAX_PACKET_LEN);
>  
>  	hdr = skb_vnet_hdr(skb);
> +	sg_init_table(vi->rx_sg, 2);
>  	sg_set_buf(vi->rx_sg, &hdr->hdr, sizeof hdr->hdr);
> -
>  	skb_to_sgvec(skb, vi->rx_sg + 1, 0, skb->len);
>  
> -	err = virtqueue_add_buf(vi->rvq, vi->rx_sg, 0, 2, skb, gfp);
> +	err = virtqueue_add_buf(vi->rvq, NULL, vi->rx_sg, skb, gfp);
>  	if (err < 0)
>  		dev_kfree_skb(skb);
>  
> @@ -393,6 +393,8 @@ static int add_recvbuf_big(struct virtne
>  	char *p;
>  	int i, err, offset;
>  
> +	sg_init_table(vi->rx_sg, MAX_SKB_FRAGS + 1);
> +
>  	/* page in vi->rx_sg[MAX_SKB_FRAGS + 1] is list tail */
>  	for (i = MAX_SKB_FRAGS + 1; i > 1; --i) {
>  		first = get_a_page(vi, gfp);
> @@ -425,8 +427,8 @@ static int add_recvbuf_big(struct virtne
>  
>  	/* chain first in list head */
>  	first->private = (unsigned long)list;
> -	err = virtqueue_add_buf(vi->rvq, vi->rx_sg, 0, MAX_SKB_FRAGS + 2,
> -				first, gfp);
> +
> +	err = virtqueue_add_buf(vi->rvq, NULL, vi->rx_sg, first, gfp);
>  	if (err < 0)
>  		give_pages(vi, first);
>  
> @@ -444,7 +446,7 @@ static int add_recvbuf_mergeable(struct 
>  
>  	sg_init_one(vi->rx_sg, page_address(page), PAGE_SIZE);
>  
> -	err = virtqueue_add_buf(vi->rvq, vi->rx_sg, 0, 1, page, gfp);
> +	err = virtqueue_add_buf(vi->rvq, NULL, vi->rx_sg, page, gfp);
>  	if (err < 0)
>  		give_pages(vi, page);
>  
> @@ -581,6 +583,7 @@ static int xmit_skb(struct virtnet_info 
>  {
>  	struct skb_vnet_hdr *hdr = skb_vnet_hdr(skb);
>  	const unsigned char *dest = ((struct ethhdr *)skb->data)->h_dest;
> +	int ret;
>  
>  	pr_debug("%s: xmit %p %pM\n", vi->dev->name, skb, dest);
>  
> @@ -620,8 +623,16 @@ static int xmit_skb(struct virtnet_info 
>  		sg_set_buf(vi->tx_sg, &hdr->hdr, sizeof hdr->hdr);
>  
>  	hdr->num_sg = skb_to_sgvec(skb, vi->tx_sg + 1, 0, skb->len) + 1;
> -	return virtqueue_add_buf(vi->svq, vi->tx_sg, hdr->num_sg,
> -				 0, skb, GFP_ATOMIC);
> +
> +	ret = virtqueue_add_buf(vi->svq, vi->tx_sg, NULL, skb, GFP_ATOMIC);
> +
> +	/*
> +	 * An optimization: clear the end bit set by skb_to_sgvec, so
> +	 * we can simply re-use vi->tx_sg[] next time.
> +	 */
> +	vi->tx_sg[hdr->num_sg-1].page_link &= ~0x02;
> +
> +	return ret;
>  }
>  
>  static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
> @@ -757,32 +768,31 @@ static int virtnet_open(struct net_devic
>   * never fail unless improperly formated.
>   */
>  static bool virtnet_send_command(struct virtnet_info *vi, u8 class, u8 cmd,
> -				 struct scatterlist *data, int out, int in)
> +				 struct scatterlist *cmdsg)
>  {
> -	struct scatterlist *s, sg[VIRTNET_SEND_COMMAND_SG_MAX + 2];
> +	struct scatterlist in[1], out[2];
>  	struct virtio_net_ctrl_hdr ctrl;
>  	virtio_net_ctrl_ack status = ~0;
>  	unsigned int tmp;
> -	int i;
>  
>  	/* Caller should know better */
> -	BUG_ON(!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ) ||
> -		(out + in > VIRTNET_SEND_COMMAND_SG_MAX));
> +	BUG_ON(!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ));
>  
> -	out++; /* Add header */
> -	in++; /* Add return status */
> -
> +	/* Prepend header to output. */
> +	sg_init_table(out, 2);
>  	ctrl.class = class;
>  	ctrl.cmd = cmd;
> +	sg_set_buf(&out[0], &ctrl, sizeof(ctrl));
> +	if (cmdsg)
> +		sg_chain(out, 2, cmdsg);
> +	else
> +		sg_mark_end(&out[0]);
>  
> -	sg_init_table(sg, out + in);
> +	/* Status response. */
> +	sg_init_one(in, &status, sizeof(status));
>  
> -	sg_set_buf(&sg[0], &ctrl, sizeof(ctrl));
> -	for_each_sg(data, s, out + in - 2, i)
> -		sg_set_buf(&sg[i + 1], sg_virt(s), s->length);
> -	sg_set_buf(&sg[out + in - 1], &status, sizeof(status));
>  
> -	BUG_ON(virtqueue_add_buf(vi->cvq, sg, out, in, vi, GFP_ATOMIC) < 0);
> +	BUG_ON(virtqueue_add_buf(vi->cvq, out, in, vi, GFP_ATOMIC) < 0);
>  
>  	virtqueue_kick(vi->cvq);
>  
> @@ -800,8 +810,7 @@ static void virtnet_ack_link_announce(st
>  {
>  	rtnl_lock();
>  	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_ANNOUNCE,
> -				  VIRTIO_NET_CTRL_ANNOUNCE_ACK, NULL,
> -				  0, 0))
> +				  VIRTIO_NET_CTRL_ANNOUNCE_ACK, NULL))
>  		dev_warn(&vi->dev->dev, "Failed to ack link announce.\n");
>  	rtnl_unlock();
>  }
> @@ -839,16 +848,14 @@ static void virtnet_set_rx_mode(struct n
>  	sg_init_one(sg, &promisc, sizeof(promisc));
>  
>  	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX,
> -				  VIRTIO_NET_CTRL_RX_PROMISC,
> -				  sg, 1, 0))
> +				  VIRTIO_NET_CTRL_RX_PROMISC, sg))
>  		dev_warn(&dev->dev, "Failed to %sable promisc mode.\n",
>  			 promisc ? "en" : "dis");
>  
>  	sg_init_one(sg, &allmulti, sizeof(allmulti));
>  
>  	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX,
> -				  VIRTIO_NET_CTRL_RX_ALLMULTI,
> -				  sg, 1, 0))
> +				  VIRTIO_NET_CTRL_RX_ALLMULTI, sg))
>  		dev_warn(&dev->dev, "Failed to %sable allmulti mode.\n",
>  			 allmulti ? "en" : "dis");
>  
> @@ -887,7 +894,7 @@ static void virtnet_set_rx_mode(struct n
>  
>  	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_MAC,
>  				  VIRTIO_NET_CTRL_MAC_TABLE_SET,
> -				  sg, 2, 0))
> +				  sg))
>  		dev_warn(&dev->dev, "Failed to set MAC fitler table.\n");
>  
>  	kfree(buf);
> @@ -901,7 +908,7 @@ static int virtnet_vlan_rx_add_vid(struc
>  	sg_init_one(&sg, &vid, sizeof(vid));
>  
>  	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_VLAN,
> -				  VIRTIO_NET_CTRL_VLAN_ADD, &sg, 1, 0))
> +				  VIRTIO_NET_CTRL_VLAN_ADD, &sg))
>  		dev_warn(&dev->dev, "Failed to add VLAN ID %d.\n", vid);
>  	return 0;
>  }
> @@ -914,7 +921,7 @@ static int virtnet_vlan_rx_kill_vid(stru
>  	sg_init_one(&sg, &vid, sizeof(vid));
>  
>  	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_VLAN,
> -				  VIRTIO_NET_CTRL_VLAN_DEL, &sg, 1, 0))
> +				  VIRTIO_NET_CTRL_VLAN_DEL, &sg))
>  		dev_warn(&dev->dev, "Failed to kill VLAN ID %d.\n", vid);
>  	return 0;
>  }
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -102,7 +102,7 @@ static void tell_host(struct virtio_ball
>  	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
>  
>  	/* We should always be able to add one buffer to an empty queue. */
> -	if (virtqueue_add_buf(vq, &sg, 1, 0, vb, GFP_KERNEL) < 0)
> +	if (virtqueue_add_buf(vq, &sg, NULL, vb, GFP_KERNEL) < 0)
>  		BUG();
>  	virtqueue_kick(vq);
>  
> @@ -246,7 +246,7 @@ static void stats_handle_request(struct 
>  	if (!virtqueue_get_buf(vq, &len))
>  		return;
>  	sg_init_one(&sg, vb->stats, sizeof(vb->stats));
> -	if (virtqueue_add_buf(vq, &sg, 1, 0, vb, GFP_KERNEL) < 0)
> +	if (virtqueue_add_buf(vq, &sg, NULL, vb, GFP_KERNEL) < 0)
>  		BUG();
>  	virtqueue_kick(vq);
>  }
> @@ -331,7 +331,7 @@ static int init_vqs(struct virtio_balloo
>  		 * use it to signal us later.
>  		 */
>  		sg_init_one(&sg, vb->stats, sizeof vb->stats);
> -		if (virtqueue_add_buf(vb->stats_vq, &sg, 1, 0, vb, GFP_KERNEL)
> +		if (virtqueue_add_buf(vb->stats_vq, &sg, NULL, vb, GFP_KERNEL)
>  		    < 0)
>  			BUG();
>  		virtqueue_kick(vb->stats_vq);
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -121,35 +121,41 @@ struct vring_virtqueue
>  
>  #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq)
>  
> +/* This doesn't have the counter that for_each_sg() has */
> +#define foreach_sg(sglist, i)			\
> +	for (i = (sglist); i; i = sg_next(i))
> +
>  /* Set up an indirect table of descriptors and add it to the queue. */
>  static int vring_add_indirect(struct vring_virtqueue *vq,
> -			      struct scatterlist sg[],
> -			      unsigned int out,
> -			      unsigned int in,
> +			      unsigned int num,
> +			      const struct scatterlist *out,
> +			      const struct scatterlist *in,
>  			      gfp_t gfp)
>  {
> +	const struct scatterlist *sg;
>  	struct vring_desc *desc;
>  	unsigned head;
>  	int i;
>  
> -	desc = kmalloc((out + in) * sizeof(struct vring_desc), gfp);
> +	desc = kmalloc(num * sizeof(struct vring_desc), gfp);
>  	if (!desc)
>  		return -ENOMEM;
>  
>  	/* Transfer entries from the sg list into the indirect page */
> -	for (i = 0; i < out; i++) {
> +	i = 0;
> +	foreach_sg(out, sg) {
>  		desc[i].flags = VRING_DESC_F_NEXT;
>  		desc[i].addr = sg_phys(sg);
>  		desc[i].len = sg->length;
>  		desc[i].next = i+1;
> -		sg++;
> +		i++;
>  	}
> -	for (; i < (out + in); i++) {
> +	foreach_sg(in, sg) {
>  		desc[i].flags = VRING_DESC_F_NEXT|VRING_DESC_F_WRITE;
>  		desc[i].addr = sg_phys(sg);
>  		desc[i].len = sg->length;
>  		desc[i].next = i+1;
> -		sg++;
> +		i++;
>  	}
>  
>  	/* Last one doesn't continue. */
> @@ -171,12 +177,21 @@ static int vring_add_indirect(struct vri
>  	return head;
>  }
>  
> +static unsigned int count_sg(const struct scatterlist *sg)
> +{
> +	unsigned int count = 0;
> +	const struct scatterlist *i;
> +
> +	foreach_sg(sg, i)
> +		count++;
> +	return count;
> +}
> +
>  /**
>   * virtqueue_add_buf - expose buffer to other end
>   * @vq: the struct virtqueue we're talking about.
> - * @sg: the description of the buffer(s).
> - * @out_num: the number of sg readable by other side
> - * @in_num: the number of sg which are writable (after readable ones)
> + * @out: the description of the output buffer(s).
> + * @in: the description of the input buffer(s).
>   * @data: the token identifying the buffer.
>   * @gfp: how to do memory allocations (if necessary).
>   *
> @@ -189,20 +204,23 @@ static int vring_add_indirect(struct vri
>   * we can put an entire sg[] array inside a single queue entry.
>   */
>  int virtqueue_add_buf(struct virtqueue *_vq,
> -		      struct scatterlist sg[],
> -		      unsigned int out,
> -		      unsigned int in,
> +		      const struct scatterlist *out,
> +		      const struct scatterlist *in,
>  		      void *data,
>  		      gfp_t gfp)
>  {
>  	struct vring_virtqueue *vq = to_vvq(_vq);
> -	unsigned int i, avail, uninitialized_var(prev);
> +	unsigned int i, avail, uninitialized_var(prev), num;
> +	const struct scatterlist *sg;
>  	int head;
>  
>  	START_USE(vq);
>  
>  	BUG_ON(data == NULL);
>  
> +	num = count_sg(out) + count_sg(in);
> +	BUG_ON(num == 0);
> +
>  #ifdef DEBUG
>  	{
>  		ktime_t now = ktime_get();
> @@ -218,18 +236,17 @@ int virtqueue_add_buf(struct virtqueue *
>  
>  	/* If the host supports indirect descriptor tables, and we have multiple
>  	 * buffers, then go indirect. FIXME: tune this threshold */
> -	if (vq->indirect && (out + in) > 1 && vq->num_free) {
> -		head = vring_add_indirect(vq, sg, out, in, gfp);
> +	if (vq->indirect && num > 1 && vq->num_free) {
> +		head = vring_add_indirect(vq, num, out, in, gfp);
>  		if (likely(head >= 0))
>  			goto add_head;
>  	}
>  
> -	BUG_ON(out + in > vq->vring.num);
> -	BUG_ON(out + in == 0);
> +	BUG_ON(num > vq->vring.num);
>  
> -	if (vq->num_free < out + in) {
> +	if (vq->num_free < num) {
>  		pr_debug("Can't add buf len %i - avail = %i\n",
> -			 out + in, vq->num_free);
> +			 num, vq->num_free);
>  		/* FIXME: for historical reasons, we force a notify here if
>  		 * there are outgoing parts to the buffer.  Presumably the
>  		 * host should service the ring ASAP. */
> @@ -240,22 +257,24 @@ int virtqueue_add_buf(struct virtqueue *
>  	}
>  
>  	/* We're about to use some buffers from the free list. */
> -	vq->num_free -= out + in;
> +	vq->num_free -= num;
>  
>  	head = vq->free_head;
> -	for (i = vq->free_head; out; i = vq->vring.desc[i].next, out--) {
> +
> +	i = vq->free_head;
> +	foreach_sg(out, sg) {
>  		vq->vring.desc[i].flags = VRING_DESC_F_NEXT;
>  		vq->vring.desc[i].addr = sg_phys(sg);
>  		vq->vring.desc[i].len = sg->length;
>  		prev = i;
> -		sg++;
> +		i = vq->vring.desc[i].next;
>  	}
> -	for (; in; i = vq->vring.desc[i].next, in--) {
> +	foreach_sg(in, sg) {
>  		vq->vring.desc[i].flags = VRING_DESC_F_NEXT|VRING_DESC_F_WRITE;
>  		vq->vring.desc[i].addr = sg_phys(sg);
>  		vq->vring.desc[i].len = sg->length;
>  		prev = i;
> -		sg++;
> +		i = vq->vring.desc[i].next;
>  	}
>  	/* Last one doesn't continue. */
>  	vq->vring.desc[prev].flags &= ~VRING_DESC_F_NEXT;
> diff --git a/include/linux/virtio.h b/include/linux/virtio.h
> --- a/include/linux/virtio.h
> +++ b/include/linux/virtio.h
> @@ -26,9 +26,8 @@ struct virtqueue {
>  };
>  
>  int virtqueue_add_buf(struct virtqueue *vq,
> -		      struct scatterlist sg[],
> -		      unsigned int out_num,
> -		      unsigned int in_num,
> +		      const struct scatterlist *out,
> +		      const struct scatterlist *in,
>  		      void *data,
>  		      gfp_t gfp);
>  
> diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
> --- a/net/9p/trans_virtio.c
> +++ b/net/9p/trans_virtio.c
> @@ -244,6 +244,14 @@ pack_sg_list_p(struct scatterlist *sg, i
>  	return index - start;
>  }
>  
> +static void sg_unset_end_markers(struct scatterlist *sg, unsigned int num)
> +{
> +	unsigned int i;
> +
> +	for (i = 0; i < num; i++)
> +		sg[i].page_link &= ~0x02;
> +}
> +
>  /**
>   * p9_virtio_request - issue a request
>   * @client: client instance issuing the request
> @@ -258,6 +266,7 @@ p9_virtio_request(struct p9_client *clie
>  	int in, out;
>  	unsigned long flags;
>  	struct virtio_chan *chan = client->trans;
> +	const struct scatterlist *outsg = NULL, *insg = NULL;
>  
>  	p9_debug(P9_DEBUG_TRANS, "9p debug: virtio request\n");
>  
> @@ -268,12 +277,21 @@ req_retry:
>  	/* Handle out VirtIO ring buffers */
>  	out = pack_sg_list(chan->sg, 0,
>  			   VIRTQUEUE_NUM, req->tc->sdata, req->tc->size);
> +	if (out) {
> +		sg_unset_end_markers(chan->sg, out-1);
> +		sg_mark_end(&chan->sg[out-1]);
> +		outsg = chan->sg;
> +	}
>  
>  	in = pack_sg_list(chan->sg, out,
>  			  VIRTQUEUE_NUM, req->rc->sdata, req->rc->capacity);
> +	if (in) {
> +		sg_unset_end_markers(chan->sg+out, in-1);
> +		sg_mark_end(&chan->sg[out+in-1]);
> +		insg = chan->sg+out;
> +	}
>  
> -	err = virtqueue_add_buf(chan->vq, chan->sg, out, in, req->tc,
> -				GFP_ATOMIC);
> +	err = virtqueue_add_buf(chan->vq, outsg, insg, req->tc, GFP_ATOMIC);
>  	if (err < 0) {
>  		if (err == -ENOSPC) {
>  			chan->ring_bufs_avail = 0;
> @@ -355,6 +377,7 @@ p9_virtio_zc_request(struct p9_client *c
>  	int in_nr_pages = 0, out_nr_pages = 0;
>  	struct page **in_pages = NULL, **out_pages = NULL;
>  	struct virtio_chan *chan = client->trans;
> +	struct scatterlist *insg = NULL, *outsg = NULL;
>  
>  	p9_debug(P9_DEBUG_TRANS, "virtio request\n");
>  
> @@ -402,6 +425,13 @@ req_retry_pinned:
>  	if (out_pages)
>  		out += pack_sg_list_p(chan->sg, out, VIRTQUEUE_NUM,
>  				      out_pages, out_nr_pages, uodata, outlen);
> +
> +	if (out) {
> +		sg_unset_end_markers(chan->sg, out-1);
> +		sg_mark_end(&chan->sg[out-1]);
> +		outsg = chan->sg;
> +	}
> +
>  	/*
>  	 * Take care of in data
>  	 * For example TREAD have 11.
> @@ -414,9 +446,13 @@ req_retry_pinned:
>  	if (in_pages)
>  		in += pack_sg_list_p(chan->sg, out + in, VIRTQUEUE_NUM,
>  				     in_pages, in_nr_pages, uidata, inlen);
> +	if (in) {
> +		sg_unset_end_markers(chan->sg+out, in-1);
> +		sg_mark_end(&chan->sg[out+in-1]);
> +		insg = chan->sg + out;
> +	}
>  
> -	err = virtqueue_add_buf(chan->vq, chan->sg, out, in, req->tc,
> -				GFP_ATOMIC);
> +	err = virtqueue_add_buf(chan->vq, outsg, insg, req->tc, GFP_ATOMIC);
>  	if (err < 0) {
>  		if (err == -ENOSPC) {
>  			chan->ring_bufs_avail = 0;
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 1/5] virtio: add functions for piecewise addition of buffers
  2013-01-03  8:58       ` Wanlong Gao
@ 2013-01-06 23:32         ` Rusty Russell
  -1 siblings, 0 replies; 86+ messages in thread
From: Rusty Russell @ 2013-01-06 23:32 UTC (permalink / raw)
  To: Wanlong Gao
  Cc: Paolo Bonzini, linux-kernel, kvm, hutao, linux-scsi,
	virtualization, mst, asias, stefanha, nab, Wanlong Gao

Wanlong Gao <gaowanlong@cn.fujitsu.com> writes:
> On 01/02/2013 01:03 PM, Rusty Russell wrote:
>> Paolo Bonzini <pbonzini@redhat.com> writes:
>>> The virtqueue_add_buf function has two limitations:
>>>
>>> 1) it requires the caller to provide all the buffers in a single call;
>>>
>>> 2) it does not support chained scatterlists: the buffers must be
>>> provided as an array of struct scatterlist;
>> 
>> Chained scatterlists are a horrible interface, but that doesn't mean we
>> shouldn't support them if there's a need.
>> 
>> I think I once even had a patch which passed two chained sgs, rather
>> than a combo sg and two length numbers.  It's very old, but I've pasted
>> it below.
>> 
>> Duplicating the implementation by having another interface is pretty
>> nasty; I think I'd prefer the chained scatterlists, if that's optimal
>> for you.
>
> I rebased against virtio-next and use it in virtio-scsi, and tested it with 4 targets
> virtio-scsi devices and host cpu idle=poll. Saw a little performance regression here.

Sure, but now you should be able to eliminate virtscsi_map_sgl(), right?
You should be able to use scsi_out(sc) and scsi_in(sc) directly, which
is what Paulo wanted to do...

Right Paulo?

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 1/5] virtio: add functions for piecewise addition of buffers
@ 2013-01-06 23:32         ` Rusty Russell
  0 siblings, 0 replies; 86+ messages in thread
From: Rusty Russell @ 2013-01-06 23:32 UTC (permalink / raw)
  Cc: Paolo Bonzini, linux-kernel, kvm, hutao, linux-scsi,
	virtualization, mst, asias, stefanha, nab, Wanlong Gao

Wanlong Gao <gaowanlong@cn.fujitsu.com> writes:
> On 01/02/2013 01:03 PM, Rusty Russell wrote:
>> Paolo Bonzini <pbonzini@redhat.com> writes:
>>> The virtqueue_add_buf function has two limitations:
>>>
>>> 1) it requires the caller to provide all the buffers in a single call;
>>>
>>> 2) it does not support chained scatterlists: the buffers must be
>>> provided as an array of struct scatterlist;
>> 
>> Chained scatterlists are a horrible interface, but that doesn't mean we
>> shouldn't support them if there's a need.
>> 
>> I think I once even had a patch which passed two chained sgs, rather
>> than a combo sg and two length numbers.  It's very old, but I've pasted
>> it below.
>> 
>> Duplicating the implementation by having another interface is pretty
>> nasty; I think I'd prefer the chained scatterlists, if that's optimal
>> for you.
>
> I rebased against virtio-next and use it in virtio-scsi, and tested it with 4 targets
> virtio-scsi devices and host cpu idle=poll. Saw a little performance regression here.

Sure, but now you should be able to eliminate virtscsi_map_sgl(), right?
You should be able to use scsi_out(sc) and scsi_in(sc) directly, which
is what Paulo wanted to do...

Right Paulo?

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 1/5] virtio: add functions for piecewise addition of buffers
  2013-01-03  8:58       ` Wanlong Gao
  (?)
  (?)
@ 2013-01-06 23:32       ` Rusty Russell
  -1 siblings, 0 replies; 86+ messages in thread
From: Rusty Russell @ 2013-01-06 23:32 UTC (permalink / raw)
  To: Wanlong Gao
  Cc: kvm, linux-scsi, mst, hutao, linux-kernel, virtualization,
	stefanha, Paolo Bonzini

Wanlong Gao <gaowanlong@cn.fujitsu.com> writes:
> On 01/02/2013 01:03 PM, Rusty Russell wrote:
>> Paolo Bonzini <pbonzini@redhat.com> writes:
>>> The virtqueue_add_buf function has two limitations:
>>>
>>> 1) it requires the caller to provide all the buffers in a single call;
>>>
>>> 2) it does not support chained scatterlists: the buffers must be
>>> provided as an array of struct scatterlist;
>> 
>> Chained scatterlists are a horrible interface, but that doesn't mean we
>> shouldn't support them if there's a need.
>> 
>> I think I once even had a patch which passed two chained sgs, rather
>> than a combo sg and two length numbers.  It's very old, but I've pasted
>> it below.
>> 
>> Duplicating the implementation by having another interface is pretty
>> nasty; I think I'd prefer the chained scatterlists, if that's optimal
>> for you.
>
> I rebased against virtio-next and use it in virtio-scsi, and tested it with 4 targets
> virtio-scsi devices and host cpu idle=poll. Saw a little performance regression here.

Sure, but now you should be able to eliminate virtscsi_map_sgl(), right?
You should be able to use scsi_out(sc) and scsi_in(sc) directly, which
is what Paulo wanted to do...

Right Paulo?

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 1/5] virtio: add functions for piecewise addition of buffers
  2013-01-03  9:22       ` Paolo Bonzini
@ 2013-01-07  0:02         ` Rusty Russell
  -1 siblings, 0 replies; 86+ messages in thread
From: Rusty Russell @ 2013-01-07  0:02 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, linux-scsi, kvm, mst, hutao, virtualization,
	stefanha, Jens Axboe

Paolo Bonzini <pbonzini@redhat.com> writes:
> Il 02/01/2013 06:03, Rusty Russell ha scritto:
>> Paolo Bonzini <pbonzini@redhat.com> writes:
>>> The virtqueue_add_buf function has two limitations:
>>>
>>> 1) it requires the caller to provide all the buffers in a single call;
>>>
>>> 2) it does not support chained scatterlists: the buffers must be
>>> provided as an array of struct scatterlist;
>> 
>> Chained scatterlists are a horrible interface, but that doesn't mean we
>> shouldn't support them if there's a need.
>> 
>> I think I once even had a patch which passed two chained sgs, rather
>> than a combo sg and two length numbers.  It's very old, but I've pasted
>> it below.
>> 
>> Duplicating the implementation by having another interface is pretty
>> nasty; I think I'd prefer the chained scatterlists, if that's optimal
>> for you.
>
> Unfortunately, that cannot work because not all architectures support
> chained scatterlists.

WHAT?  I can't figure out what an arch needs to do to support this?  Why
is it an option for archs?  Why is sg_chain() even defined for
non-ARCH_HAS_SG_CHAIN?

Jens, help!!

All archs we care about support them, though, so I think we can ignore
this issue for now.

> (Also, as you mention chained scatterlists are horrible.  They'd happen
> to work for virtio-scsi, but not for virtio-blk where the response
> status is part of the footer, not the header).

We lost that debate 5 years ago, so we hack around it as needed.  We can
add helpers to append if we need.

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 1/5] virtio: add functions for piecewise addition of buffers
@ 2013-01-07  0:02         ` Rusty Russell
  0 siblings, 0 replies; 86+ messages in thread
From: Rusty Russell @ 2013-01-07  0:02 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, linux-scsi, mst, hutao, linux-kernel, virtualization,
	stefanha, Jens Axboe

Paolo Bonzini <pbonzini@redhat.com> writes:
> Il 02/01/2013 06:03, Rusty Russell ha scritto:
>> Paolo Bonzini <pbonzini@redhat.com> writes:
>>> The virtqueue_add_buf function has two limitations:
>>>
>>> 1) it requires the caller to provide all the buffers in a single call;
>>>
>>> 2) it does not support chained scatterlists: the buffers must be
>>> provided as an array of struct scatterlist;
>> 
>> Chained scatterlists are a horrible interface, but that doesn't mean we
>> shouldn't support them if there's a need.
>> 
>> I think I once even had a patch which passed two chained sgs, rather
>> than a combo sg and two length numbers.  It's very old, but I've pasted
>> it below.
>> 
>> Duplicating the implementation by having another interface is pretty
>> nasty; I think I'd prefer the chained scatterlists, if that's optimal
>> for you.
>
> Unfortunately, that cannot work because not all architectures support
> chained scatterlists.

WHAT?  I can't figure out what an arch needs to do to support this?  Why
is it an option for archs?  Why is sg_chain() even defined for
non-ARCH_HAS_SG_CHAIN?

Jens, help!!

All archs we care about support them, though, so I think we can ignore
this issue for now.

> (Also, as you mention chained scatterlists are horrible.  They'd happen
> to work for virtio-scsi, but not for virtio-blk where the response
> status is part of the footer, not the header).

We lost that debate 5 years ago, so we hack around it as needed.  We can
add helpers to append if we need.

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 1/5] virtio: add functions for piecewise addition of buffers
  2013-01-07  0:02         ` Rusty Russell
  (?)
@ 2013-01-07 14:27         ` Paolo Bonzini
  2013-01-08  0:12             ` Rusty Russell
  -1 siblings, 1 reply; 86+ messages in thread
From: Paolo Bonzini @ 2013-01-07 14:27 UTC (permalink / raw)
  To: Rusty Russell
  Cc: kvm, linux-scsi, mst, hutao, linux-kernel, virtualization,
	stefanha, Jens Axboe

Il 07/01/2013 01:02, Rusty Russell ha scritto:
> Paolo Bonzini <pbonzini@redhat.com> writes:
>> Il 02/01/2013 06:03, Rusty Russell ha scritto:
>>> Paolo Bonzini <pbonzini@redhat.com> writes:
>>>> The virtqueue_add_buf function has two limitations:
>>>>
>>>> 1) it requires the caller to provide all the buffers in a single call;
>>>>
>>>> 2) it does not support chained scatterlists: the buffers must be
>>>> provided as an array of struct scatterlist;
>>>
>>> Chained scatterlists are a horrible interface, but that doesn't mean we
>>> shouldn't support them if there's a need.
>>>
>>> I think I once even had a patch which passed two chained sgs, rather
>>> than a combo sg and two length numbers.  It's very old, but I've pasted
>>> it below.
>>>
>>> Duplicating the implementation by having another interface is pretty
>>> nasty; I think I'd prefer the chained scatterlists, if that's optimal
>>> for you.
>>
>> Unfortunately, that cannot work because not all architectures support
>> chained scatterlists.
> 
> WHAT?  I can't figure out what an arch needs to do to support this?

It needs to use the iterator functions in its DMA driver.

> All archs we care about support them, though, so I think we can ignore
> this issue for now.

Kind of... In principle all QEMU-supported arches can use virtio, and
the speedup can be quite useful.  And there is no Kconfig symbol for SG
chains that I can use to disable virtio-scsi on unsupported arches. :/

Paolo

>> (Also, as you mention chained scatterlists are horrible.  They'd happen
>> to work for virtio-scsi, but not for virtio-blk where the response
>> status is part of the footer, not the header).
> 
> We lost that debate 5 years ago, so we hack around it as needed.  We can
> add helpers to append if we need.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 1/5] virtio: add functions for piecewise addition of buffers
  2013-01-07 14:27         ` Paolo Bonzini
@ 2013-01-08  0:12             ` Rusty Russell
  0 siblings, 0 replies; 86+ messages in thread
From: Rusty Russell @ 2013-01-08  0:12 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, linux-scsi, mst, hutao, linux-kernel, virtualization,
	stefanha, Jens Axboe

Paolo Bonzini <pbonzini@redhat.com> writes:
> Il 07/01/2013 01:02, Rusty Russell ha scritto:
>> Paolo Bonzini <pbonzini@redhat.com> writes:
>>> Il 02/01/2013 06:03, Rusty Russell ha scritto:
>>>> Paolo Bonzini <pbonzini@redhat.com> writes:
>>>>> The virtqueue_add_buf function has two limitations:
>>>>>
>>>>> 1) it requires the caller to provide all the buffers in a single call;
>>>>>
>>>>> 2) it does not support chained scatterlists: the buffers must be
>>>>> provided as an array of struct scatterlist;
>>>>
>>>> Chained scatterlists are a horrible interface, but that doesn't mean we
>>>> shouldn't support them if there's a need.
>>>>
>>>> I think I once even had a patch which passed two chained sgs, rather
>>>> than a combo sg and two length numbers.  It's very old, but I've pasted
>>>> it below.
>>>>
>>>> Duplicating the implementation by having another interface is pretty
>>>> nasty; I think I'd prefer the chained scatterlists, if that's optimal
>>>> for you.
>>>
>>> Unfortunately, that cannot work because not all architectures support
>>> chained scatterlists.
>> 
>> WHAT?  I can't figure out what an arch needs to do to support this?
>
> It needs to use the iterator functions in its DMA driver.

But we don't care for virtio.

>> All archs we care about support them, though, so I think we can ignore
>> this issue for now.
>
> Kind of... In principle all QEMU-supported arches can use virtio, and
> the speedup can be quite useful.  And there is no Kconfig symbol for SG
> chains that I can use to disable virtio-scsi on unsupported arches. :/

Well, we #error if it's not supported.  Then the lazy architectures can
fix it.

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 1/5] virtio: add functions for piecewise addition of buffers
@ 2013-01-08  0:12             ` Rusty Russell
  0 siblings, 0 replies; 86+ messages in thread
From: Rusty Russell @ 2013-01-08  0:12 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Jens Axboe, kvm, linux-scsi, mst, hutao, linux-kernel,
	virtualization, stefanha

Paolo Bonzini <pbonzini@redhat.com> writes:
> Il 07/01/2013 01:02, Rusty Russell ha scritto:
>> Paolo Bonzini <pbonzini@redhat.com> writes:
>>> Il 02/01/2013 06:03, Rusty Russell ha scritto:
>>>> Paolo Bonzini <pbonzini@redhat.com> writes:
>>>>> The virtqueue_add_buf function has two limitations:
>>>>>
>>>>> 1) it requires the caller to provide all the buffers in a single call;
>>>>>
>>>>> 2) it does not support chained scatterlists: the buffers must be
>>>>> provided as an array of struct scatterlist;
>>>>
>>>> Chained scatterlists are a horrible interface, but that doesn't mean we
>>>> shouldn't support them if there's a need.
>>>>
>>>> I think I once even had a patch which passed two chained sgs, rather
>>>> than a combo sg and two length numbers.  It's very old, but I've pasted
>>>> it below.
>>>>
>>>> Duplicating the implementation by having another interface is pretty
>>>> nasty; I think I'd prefer the chained scatterlists, if that's optimal
>>>> for you.
>>>
>>> Unfortunately, that cannot work because not all architectures support
>>> chained scatterlists.
>> 
>> WHAT?  I can't figure out what an arch needs to do to support this?
>
> It needs to use the iterator functions in its DMA driver.

But we don't care for virtio.

>> All archs we care about support them, though, so I think we can ignore
>> this issue for now.
>
> Kind of... In principle all QEMU-supported arches can use virtio, and
> the speedup can be quite useful.  And there is no Kconfig symbol for SG
> chains that I can use to disable virtio-scsi on unsupported arches. :/

Well, we #error if it's not supported.  Then the lazy architectures can
fix it.

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v2 1/5] virtio: add functions for piecewise addition of buffers
  2013-01-08  0:12             ` Rusty Russell
  (?)
@ 2013-01-10  8:44             ` Paolo Bonzini
  -1 siblings, 0 replies; 86+ messages in thread
From: Paolo Bonzini @ 2013-01-10  8:44 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Jens Axboe, kvm, linux-scsi, mst, hutao, linux-kernel,
	virtualization, stefanha

Il 08/01/2013 01:12, Rusty Russell ha scritto:
>>>> >>> Unfortunately, that cannot work because not all architectures support
>>>> >>> chained scatterlists.
>>> >> 
>>> >> WHAT?  I can't figure out what an arch needs to do to support this?
>> >
>> > It needs to use the iterator functions in its DMA driver.
> But we don't care for virtio.

True.

>>> >> All archs we care about support them, though, so I think we can ignore
>>> >> this issue for now.
>> >
>> > Kind of... In principle all QEMU-supported arches can use virtio, and
>> > the speedup can be quite useful.  And there is no Kconfig symbol for SG
>> > chains that I can use to disable virtio-scsi on unsupported arches. :/
> Well, we #error if it's not supported.  Then the lazy architectures can
> fix it.

Yeah, that would be one approach.

But frankly, your patch is really disgusting. :)  Not your fault, of
course, but I still prefer a limited amount of duplication.

Perhaps we can get the best of both worlds, I'll take a look when I have
some time.

Paolo

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 1/2] virtio-scsi: split out request queue set affinity function
  2012-12-18 12:32 ` Paolo Bonzini
@ 2013-01-15  9:48   ` Wanlong Gao
  -1 siblings, 0 replies; 86+ messages in thread
From: Wanlong Gao @ 2013-01-15  9:48 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, kvm, hutao, linux-scsi, virtualization, mst, rusty,
	asias, stefanha, nab

These two patches are based on the multi-queue virtio-scsi patch set.

We set cpu affinity when the num_queues equals to the number
of VCPUs. Split out the set affinity function, this also
fix the bug when CPU IDs are not consecutive.

Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
---
 drivers/scsi/virtio_scsi.c | 50 ++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 44 insertions(+), 6 deletions(-)

diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 3641d5f..16b0ef2 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -106,6 +106,9 @@ struct virtio_scsi {
 
 	u32 num_queues;
 
+	/* Does the affinity hint is set for virtqueues? */
+	bool affinity_hint_set;
+
 	struct virtio_scsi_vq ctrl_vq;
 	struct virtio_scsi_vq event_vq;
 	struct virtio_scsi_vq req_vqs[];
@@ -701,14 +704,45 @@ static struct scsi_host_template virtscsi_host_template_multi = {
 				  &__val, sizeof(__val)); \
 	})
 
+static void virtscsi_set_affinity(struct virtio_scsi *vscsi, bool affinity)
+{
+	int i;
+	int cpu;
+
+	/* In multiqueue mode, when the number of cpu is equal
+	 * to the number of request queues, we let the qeueues
+	 * to be private to one cpu by setting the affinity hint
+	 * to eliminate the contention.
+	 */
+	if ((vscsi->num_queues == 1 ||
+	     vscsi->num_queues != num_online_cpus()) && affinity) {
+		if (vscsi->affinity_hint_set)
+			affinity = false;
+		else
+			return;
+	}
+
+	if (affinity) {
+		i = 0;
+		for_each_online_cpu(cpu) {
+			virtqueue_set_affinity(vscsi->req_vqs[i].vq, cpu);
+			i++;
+		}
+
+		vscsi->affinity_hint_set = true;
+	} else {
+		for (i = 0; i < vscsi->num_queues - VIRTIO_SCSI_VQ_BASE; i++)
+			virtqueue_set_affinity(vscsi->req_vqs[i].vq, -1);
+
+		vscsi->affinity_hint_set = false;
+	}
+}
 
 static void virtscsi_init_vq(struct virtio_scsi_vq *virtscsi_vq,
-			     struct virtqueue *vq, bool affinity)
+			     struct virtqueue *vq)
 {
 	spin_lock_init(&virtscsi_vq->vq_lock);
 	virtscsi_vq->vq = vq;
-	if (affinity)
-		virtqueue_set_affinity(vq, vq->index - VIRTIO_SCSI_VQ_BASE);
 }
 
 static void virtscsi_init_tgt(struct virtio_scsi *vscsi, int i)
@@ -736,6 +770,8 @@ static void virtscsi_remove_vqs(struct virtio_device *vdev)
 	struct Scsi_Host *sh = virtio_scsi_host(vdev);
 	struct virtio_scsi *vscsi = shost_priv(sh);
 
+	virtscsi_set_affinity(vscsi, false);
+
 	/* Stop all the virtqueues. */
 	vdev->config->reset(vdev);
 
@@ -779,11 +815,13 @@ static int virtscsi_init(struct virtio_device *vdev,
 	if (err)
 		return err;
 
-	virtscsi_init_vq(&vscsi->ctrl_vq, vqs[0], false);
-	virtscsi_init_vq(&vscsi->event_vq, vqs[1], false);
+	virtscsi_init_vq(&vscsi->ctrl_vq, vqs[0]);
+	virtscsi_init_vq(&vscsi->event_vq, vqs[1]);
 	for (i = VIRTIO_SCSI_VQ_BASE; i < num_vqs; i++)
 		virtscsi_init_vq(&vscsi->req_vqs[i - VIRTIO_SCSI_VQ_BASE],
-				 vqs[i], vscsi->num_queues > 1);
+				 vqs[i]);
+
+	virtscsi_set_affinity(vscsi, true);
 
 	virtscsi_config_set(vdev, cdb_size, VIRTIO_SCSI_CDB_SIZE);
 	virtscsi_config_set(vdev, sense_size, VIRTIO_SCSI_SENSE_SIZE);
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 1/2] virtio-scsi: split out request queue set affinity function
@ 2013-01-15  9:48   ` Wanlong Gao
  0 siblings, 0 replies; 86+ messages in thread
From: Wanlong Gao @ 2013-01-15  9:48 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-scsi, kvm, mst, hutao, linux-kernel, virtualization, stefanha

These two patches are based on the multi-queue virtio-scsi patch set.

We set cpu affinity when the num_queues equals to the number
of VCPUs. Split out the set affinity function, this also
fix the bug when CPU IDs are not consecutive.

Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
---
 drivers/scsi/virtio_scsi.c | 50 ++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 44 insertions(+), 6 deletions(-)

diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 3641d5f..16b0ef2 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -106,6 +106,9 @@ struct virtio_scsi {
 
 	u32 num_queues;
 
+	/* Does the affinity hint is set for virtqueues? */
+	bool affinity_hint_set;
+
 	struct virtio_scsi_vq ctrl_vq;
 	struct virtio_scsi_vq event_vq;
 	struct virtio_scsi_vq req_vqs[];
@@ -701,14 +704,45 @@ static struct scsi_host_template virtscsi_host_template_multi = {
 				  &__val, sizeof(__val)); \
 	})
 
+static void virtscsi_set_affinity(struct virtio_scsi *vscsi, bool affinity)
+{
+	int i;
+	int cpu;
+
+	/* In multiqueue mode, when the number of cpu is equal
+	 * to the number of request queues, we let the qeueues
+	 * to be private to one cpu by setting the affinity hint
+	 * to eliminate the contention.
+	 */
+	if ((vscsi->num_queues == 1 ||
+	     vscsi->num_queues != num_online_cpus()) && affinity) {
+		if (vscsi->affinity_hint_set)
+			affinity = false;
+		else
+			return;
+	}
+
+	if (affinity) {
+		i = 0;
+		for_each_online_cpu(cpu) {
+			virtqueue_set_affinity(vscsi->req_vqs[i].vq, cpu);
+			i++;
+		}
+
+		vscsi->affinity_hint_set = true;
+	} else {
+		for (i = 0; i < vscsi->num_queues - VIRTIO_SCSI_VQ_BASE; i++)
+			virtqueue_set_affinity(vscsi->req_vqs[i].vq, -1);
+
+		vscsi->affinity_hint_set = false;
+	}
+}
 
 static void virtscsi_init_vq(struct virtio_scsi_vq *virtscsi_vq,
-			     struct virtqueue *vq, bool affinity)
+			     struct virtqueue *vq)
 {
 	spin_lock_init(&virtscsi_vq->vq_lock);
 	virtscsi_vq->vq = vq;
-	if (affinity)
-		virtqueue_set_affinity(vq, vq->index - VIRTIO_SCSI_VQ_BASE);
 }
 
 static void virtscsi_init_tgt(struct virtio_scsi *vscsi, int i)
@@ -736,6 +770,8 @@ static void virtscsi_remove_vqs(struct virtio_device *vdev)
 	struct Scsi_Host *sh = virtio_scsi_host(vdev);
 	struct virtio_scsi *vscsi = shost_priv(sh);
 
+	virtscsi_set_affinity(vscsi, false);
+
 	/* Stop all the virtqueues. */
 	vdev->config->reset(vdev);
 
@@ -779,11 +815,13 @@ static int virtscsi_init(struct virtio_device *vdev,
 	if (err)
 		return err;
 
-	virtscsi_init_vq(&vscsi->ctrl_vq, vqs[0], false);
-	virtscsi_init_vq(&vscsi->event_vq, vqs[1], false);
+	virtscsi_init_vq(&vscsi->ctrl_vq, vqs[0]);
+	virtscsi_init_vq(&vscsi->event_vq, vqs[1]);
 	for (i = VIRTIO_SCSI_VQ_BASE; i < num_vqs; i++)
 		virtscsi_init_vq(&vscsi->req_vqs[i - VIRTIO_SCSI_VQ_BASE],
-				 vqs[i], vscsi->num_queues > 1);
+				 vqs[i]);
+
+	virtscsi_set_affinity(vscsi, true);
 
 	virtscsi_config_set(vdev, cdb_size, VIRTIO_SCSI_CDB_SIZE);
 	virtscsi_config_set(vdev, sense_size, VIRTIO_SCSI_SENSE_SIZE);
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 2/2] virtio-scsi: reset virtqueue affinity when doing cpu hotplug
  2013-01-15  9:48   ` Wanlong Gao
@ 2013-01-15  9:50     ` Wanlong Gao
  -1 siblings, 0 replies; 86+ messages in thread
From: Wanlong Gao @ 2013-01-15  9:50 UTC (permalink / raw)
  To: virtualization
  Cc: Paolo Bonzini, linux-kernel, kvm, hutao, linux-scsi, mst, rusty,
	asias, stefanha, nab, Wanlong Gao


Add hot cpu notifier to reset the request virtqueue affinity
when doing cpu hotplug.

Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
---
 drivers/scsi/virtio_scsi.c | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 16b0ef2..a3b5f12 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -20,6 +20,7 @@
 #include <linux/virtio_ids.h>
 #include <linux/virtio_config.h>
 #include <linux/virtio_scsi.h>
+#include <linux/cpu.h>
 #include <scsi/scsi_host.h>
 #include <scsi/scsi_device.h>
 #include <scsi/scsi_cmnd.h>
@@ -109,6 +110,9 @@ struct virtio_scsi {
 	/* Does the affinity hint is set for virtqueues? */
 	bool affinity_hint_set;
 
+	/* CPU hotplug notifier */
+	struct notifier_block nb;
+
 	struct virtio_scsi_vq ctrl_vq;
 	struct virtio_scsi_vq event_vq;
 	struct virtio_scsi_vq req_vqs[];
@@ -738,6 +742,23 @@ static void virtscsi_set_affinity(struct virtio_scsi *vscsi, bool affinity)
 	}
 }
 
+static int virtscsi_cpu_callback(struct notifier_block *nfb,
+				 unsigned long action, void *hcpu)
+{
+	struct virtio_scsi *vscsi = container_of(nfb, struct virtio_scsi, nb);
+	switch(action) {
+	case CPU_ONLINE:
+	case CPU_ONLINE_FROZEN:
+	case CPU_DEAD:
+	case CPU_DEAD_FROZEN:
+		virtscsi_set_affinity(vscsi, true);
+		break;
+	default:
+		break;
+	}
+	return NOTIFY_OK;
+}
+
 static void virtscsi_init_vq(struct virtio_scsi_vq *virtscsi_vq,
 			     struct virtqueue *vq)
 {
@@ -888,6 +909,13 @@ static int __devinit virtscsi_probe(struct virtio_device *vdev)
 	if (err)
 		goto virtscsi_init_failed;
 
+	vscsi->nb.notifier_call = &virtscsi_cpu_callback;
+	err = register_hotcpu_notifier(&vscsi->nb);
+	if (err) {
+		pr_err("virtio_scsi: registering cpu notifier failed\n");
+		goto scsi_add_host_failed;
+	}
+
 	cmd_per_lun = virtscsi_config_get(vdev, cmd_per_lun) ?: 1;
 	shost->cmd_per_lun = min_t(u32, cmd_per_lun, shost->can_queue);
 	shost->max_sectors = virtscsi_config_get(vdev, max_sectors) ?: 0xFFFF;
@@ -925,6 +953,8 @@ static void __devexit virtscsi_remove(struct virtio_device *vdev)
 
 	scsi_remove_host(shost);
 
+	unregister_hotcpu_notifier(&vscsi->nb);
+
 	virtscsi_remove_vqs(vdev);
 	scsi_host_put(shost);
 }
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 2/2] virtio-scsi: reset virtqueue affinity when doing cpu hotplug
@ 2013-01-15  9:50     ` Wanlong Gao
  0 siblings, 0 replies; 86+ messages in thread
From: Wanlong Gao @ 2013-01-15  9:50 UTC (permalink / raw)
  To: virtualization
  Cc: kvm, linux-scsi, mst, hutao, linux-kernel, stefanha, Paolo Bonzini


Add hot cpu notifier to reset the request virtqueue affinity
when doing cpu hotplug.

Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
---
 drivers/scsi/virtio_scsi.c | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 16b0ef2..a3b5f12 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -20,6 +20,7 @@
 #include <linux/virtio_ids.h>
 #include <linux/virtio_config.h>
 #include <linux/virtio_scsi.h>
+#include <linux/cpu.h>
 #include <scsi/scsi_host.h>
 #include <scsi/scsi_device.h>
 #include <scsi/scsi_cmnd.h>
@@ -109,6 +110,9 @@ struct virtio_scsi {
 	/* Does the affinity hint is set for virtqueues? */
 	bool affinity_hint_set;
 
+	/* CPU hotplug notifier */
+	struct notifier_block nb;
+
 	struct virtio_scsi_vq ctrl_vq;
 	struct virtio_scsi_vq event_vq;
 	struct virtio_scsi_vq req_vqs[];
@@ -738,6 +742,23 @@ static void virtscsi_set_affinity(struct virtio_scsi *vscsi, bool affinity)
 	}
 }
 
+static int virtscsi_cpu_callback(struct notifier_block *nfb,
+				 unsigned long action, void *hcpu)
+{
+	struct virtio_scsi *vscsi = container_of(nfb, struct virtio_scsi, nb);
+	switch(action) {
+	case CPU_ONLINE:
+	case CPU_ONLINE_FROZEN:
+	case CPU_DEAD:
+	case CPU_DEAD_FROZEN:
+		virtscsi_set_affinity(vscsi, true);
+		break;
+	default:
+		break;
+	}
+	return NOTIFY_OK;
+}
+
 static void virtscsi_init_vq(struct virtio_scsi_vq *virtscsi_vq,
 			     struct virtqueue *vq)
 {
@@ -888,6 +909,13 @@ static int __devinit virtscsi_probe(struct virtio_device *vdev)
 	if (err)
 		goto virtscsi_init_failed;
 
+	vscsi->nb.notifier_call = &virtscsi_cpu_callback;
+	err = register_hotcpu_notifier(&vscsi->nb);
+	if (err) {
+		pr_err("virtio_scsi: registering cpu notifier failed\n");
+		goto scsi_add_host_failed;
+	}
+
 	cmd_per_lun = virtscsi_config_get(vdev, cmd_per_lun) ?: 1;
 	shost->cmd_per_lun = min_t(u32, cmd_per_lun, shost->can_queue);
 	shost->max_sectors = virtscsi_config_get(vdev, max_sectors) ?: 0xFFFF;
@@ -925,6 +953,8 @@ static void __devexit virtscsi_remove(struct virtio_device *vdev)
 
 	scsi_remove_host(shost);
 
+	unregister_hotcpu_notifier(&vscsi->nb);
+
 	virtscsi_remove_vqs(vdev);
 	scsi_host_put(shost);
 }
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/2] virtio-scsi: reset virtqueue affinity when doing cpu hotplug
  2013-01-15  9:50     ` Wanlong Gao
@ 2013-01-16  3:31       ` Rusty Russell
  -1 siblings, 0 replies; 86+ messages in thread
From: Rusty Russell @ 2013-01-16  3:31 UTC (permalink / raw)
  To: Wanlong Gao, virtualization
  Cc: Paolo Bonzini, linux-kernel, kvm, hutao, linux-scsi, mst, asias,
	stefanha, nab, Wanlong Gao

Wanlong Gao <gaowanlong@cn.fujitsu.com> writes:
> Add hot cpu notifier to reset the request virtqueue affinity
> when doing cpu hotplug.

You need to be careful to get_online_cpus() and put_online_cpus() here,
so CPUs can't go up and down in the middle of operations.

In particular, get_online_cpus()/put_online_cpus() around calls to
virtscsi_set_affinity() (except within notifiers).

> +static int virtscsi_cpu_callback(struct notifier_block *nfb,
> +				 unsigned long action, void *hcpu)
> +{
> +	struct virtio_scsi *vscsi = container_of(nfb, struct virtio_scsi, nb);
> +	switch(action) {
> +	case CPU_ONLINE:
> +	case CPU_ONLINE_FROZEN:
> +	case CPU_DEAD:
> +	case CPU_DEAD_FROZEN:
> +		virtscsi_set_affinity(vscsi, true);
> +		break;
> +	default:
> +		break;
> +	}
> +	return NOTIFY_OK;
> +}

You probably want to remove affinities *before* the CPU goes down, and
restore it after the CPU comes up.

So, how about:

        switch (action & ~CPU_TASKS_FROZEN) {
        case CPU_ONLINE:
        case CPU_DOWN_FAILED:
                virtscsi_set_affinity(vscsi, true, -1);
                break;
        case CPU_DOWN_PREPARE:
                virtscsi_set_affinity(vscsi, true, (unsigned long)hcpu);
                break;
        }

The extra argument to virtscsi_set_affinity() is to tell it to ignore a
cpu which seems online (because it's going down).

>  static void virtscsi_init_vq(struct virtio_scsi_vq *virtscsi_vq,
>  			     struct virtqueue *vq)
>  {
> @@ -888,6 +909,13 @@ static int __devinit virtscsi_probe(struct virtio_device *vdev)
>  	if (err)
>  		goto virtscsi_init_failed;
>  
> +	vscsi->nb.notifier_call = &virtscsi_cpu_callback;
> +	err = register_hotcpu_notifier(&vscsi->nb);
> +	if (err) {
> +		pr_err("virtio_scsi: registering cpu notifier failed\n");
> +		goto scsi_add_host_failed;
> +	}
> +

You have to clean this up if scsi_add_host() fails, I think.

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/2] virtio-scsi: reset virtqueue affinity when doing cpu hotplug
@ 2013-01-16  3:31       ` Rusty Russell
  0 siblings, 0 replies; 86+ messages in thread
From: Rusty Russell @ 2013-01-16  3:31 UTC (permalink / raw)
  To: Wanlong Gao, virtualization
  Cc: kvm, linux-scsi, mst, hutao, linux-kernel, stefanha, Paolo Bonzini

Wanlong Gao <gaowanlong@cn.fujitsu.com> writes:
> Add hot cpu notifier to reset the request virtqueue affinity
> when doing cpu hotplug.

You need to be careful to get_online_cpus() and put_online_cpus() here,
so CPUs can't go up and down in the middle of operations.

In particular, get_online_cpus()/put_online_cpus() around calls to
virtscsi_set_affinity() (except within notifiers).

> +static int virtscsi_cpu_callback(struct notifier_block *nfb,
> +				 unsigned long action, void *hcpu)
> +{
> +	struct virtio_scsi *vscsi = container_of(nfb, struct virtio_scsi, nb);
> +	switch(action) {
> +	case CPU_ONLINE:
> +	case CPU_ONLINE_FROZEN:
> +	case CPU_DEAD:
> +	case CPU_DEAD_FROZEN:
> +		virtscsi_set_affinity(vscsi, true);
> +		break;
> +	default:
> +		break;
> +	}
> +	return NOTIFY_OK;
> +}

You probably want to remove affinities *before* the CPU goes down, and
restore it after the CPU comes up.

So, how about:

        switch (action & ~CPU_TASKS_FROZEN) {
        case CPU_ONLINE:
        case CPU_DOWN_FAILED:
                virtscsi_set_affinity(vscsi, true, -1);
                break;
        case CPU_DOWN_PREPARE:
                virtscsi_set_affinity(vscsi, true, (unsigned long)hcpu);
                break;
        }

The extra argument to virtscsi_set_affinity() is to tell it to ignore a
cpu which seems online (because it's going down).

>  static void virtscsi_init_vq(struct virtio_scsi_vq *virtscsi_vq,
>  			     struct virtqueue *vq)
>  {
> @@ -888,6 +909,13 @@ static int __devinit virtscsi_probe(struct virtio_device *vdev)
>  	if (err)
>  		goto virtscsi_init_failed;
>  
> +	vscsi->nb.notifier_call = &virtscsi_cpu_callback;
> +	err = register_hotcpu_notifier(&vscsi->nb);
> +	if (err) {
> +		pr_err("virtio_scsi: registering cpu notifier failed\n");
> +		goto scsi_add_host_failed;
> +	}
> +

You have to clean this up if scsi_add_host() fails, I think.

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/2] virtio-scsi: reset virtqueue affinity when doing cpu hotplug
  2013-01-16  3:31       ` Rusty Russell
@ 2013-01-16  3:55         ` Wanlong Gao
  -1 siblings, 0 replies; 86+ messages in thread
From: Wanlong Gao @ 2013-01-16  3:55 UTC (permalink / raw)
  To: Rusty Russell
  Cc: virtualization, Paolo Bonzini, linux-kernel, kvm, hutao,
	linux-scsi, mst, asias, stefanha, nab, Wanlong Gao

On 01/16/2013 11:31 AM, Rusty Russell wrote:
> Wanlong Gao <gaowanlong@cn.fujitsu.com> writes:
>> Add hot cpu notifier to reset the request virtqueue affinity
>> when doing cpu hotplug.
> 
> You need to be careful to get_online_cpus() and put_online_cpus() here,
> so CPUs can't go up and down in the middle of operations.
> 
> In particular, get_online_cpus()/put_online_cpus() around calls to
> virtscsi_set_affinity() (except within notifiers).

Yes, I'll take care of this, thank you.

> 
>> +static int virtscsi_cpu_callback(struct notifier_block *nfb,
>> +				 unsigned long action, void *hcpu)
>> +{
>> +	struct virtio_scsi *vscsi = container_of(nfb, struct virtio_scsi, nb);
>> +	switch(action) {
>> +	case CPU_ONLINE:
>> +	case CPU_ONLINE_FROZEN:
>> +	case CPU_DEAD:
>> +	case CPU_DEAD_FROZEN:
>> +		virtscsi_set_affinity(vscsi, true);
>> +		break;
>> +	default:
>> +		break;
>> +	}
>> +	return NOTIFY_OK;
>> +}
> 
> You probably want to remove affinities *before* the CPU goes down, and
> restore it after the CPU comes up.
> 
> So, how about:
> 
>         switch (action & ~CPU_TASKS_FROZEN) {
>         case CPU_ONLINE:
>         case CPU_DOWN_FAILED:
>                 virtscsi_set_affinity(vscsi, true, -1);
>                 break;
>         case CPU_DOWN_PREPARE:
>                 virtscsi_set_affinity(vscsi, true, (unsigned long)hcpu);
>                 break;
>         }
> 
> The extra argument to virtscsi_set_affinity() is to tell it to ignore a
> cpu which seems online (because it's going down).

OK, thank you very much for teaching this.

> 
>>  static void virtscsi_init_vq(struct virtio_scsi_vq *virtscsi_vq,
>>  			     struct virtqueue *vq)
>>  {
>> @@ -888,6 +909,13 @@ static int __devinit virtscsi_probe(struct virtio_device *vdev)
>>  	if (err)
>>  		goto virtscsi_init_failed;
>>  
>> +	vscsi->nb.notifier_call = &virtscsi_cpu_callback;
>> +	err = register_hotcpu_notifier(&vscsi->nb);
>> +	if (err) {
>> +		pr_err("virtio_scsi: registering cpu notifier failed\n");
>> +		goto scsi_add_host_failed;
>> +	}
>> +
> 
> You have to clean this up if scsi_add_host() fails, I think.

Yeah, will do, thank you very much. 


Regards,
Wanlong Gao

> 
> Cheers,
> Rusty.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/2] virtio-scsi: reset virtqueue affinity when doing cpu hotplug
@ 2013-01-16  3:55         ` Wanlong Gao
  0 siblings, 0 replies; 86+ messages in thread
From: Wanlong Gao @ 2013-01-16  3:55 UTC (permalink / raw)
  To: Rusty Russell
  Cc: kvm, linux-scsi, mst, hutao, linux-kernel, virtualization,
	stefanha, Paolo Bonzini

On 01/16/2013 11:31 AM, Rusty Russell wrote:
> Wanlong Gao <gaowanlong@cn.fujitsu.com> writes:
>> Add hot cpu notifier to reset the request virtqueue affinity
>> when doing cpu hotplug.
> 
> You need to be careful to get_online_cpus() and put_online_cpus() here,
> so CPUs can't go up and down in the middle of operations.
> 
> In particular, get_online_cpus()/put_online_cpus() around calls to
> virtscsi_set_affinity() (except within notifiers).

Yes, I'll take care of this, thank you.

> 
>> +static int virtscsi_cpu_callback(struct notifier_block *nfb,
>> +				 unsigned long action, void *hcpu)
>> +{
>> +	struct virtio_scsi *vscsi = container_of(nfb, struct virtio_scsi, nb);
>> +	switch(action) {
>> +	case CPU_ONLINE:
>> +	case CPU_ONLINE_FROZEN:
>> +	case CPU_DEAD:
>> +	case CPU_DEAD_FROZEN:
>> +		virtscsi_set_affinity(vscsi, true);
>> +		break;
>> +	default:
>> +		break;
>> +	}
>> +	return NOTIFY_OK;
>> +}
> 
> You probably want to remove affinities *before* the CPU goes down, and
> restore it after the CPU comes up.
> 
> So, how about:
> 
>         switch (action & ~CPU_TASKS_FROZEN) {
>         case CPU_ONLINE:
>         case CPU_DOWN_FAILED:
>                 virtscsi_set_affinity(vscsi, true, -1);
>                 break;
>         case CPU_DOWN_PREPARE:
>                 virtscsi_set_affinity(vscsi, true, (unsigned long)hcpu);
>                 break;
>         }
> 
> The extra argument to virtscsi_set_affinity() is to tell it to ignore a
> cpu which seems online (because it's going down).

OK, thank you very much for teaching this.

> 
>>  static void virtscsi_init_vq(struct virtio_scsi_vq *virtscsi_vq,
>>  			     struct virtqueue *vq)
>>  {
>> @@ -888,6 +909,13 @@ static int __devinit virtscsi_probe(struct virtio_device *vdev)
>>  	if (err)
>>  		goto virtscsi_init_failed;
>>  
>> +	vscsi->nb.notifier_call = &virtscsi_cpu_callback;
>> +	err = register_hotcpu_notifier(&vscsi->nb);
>> +	if (err) {
>> +		pr_err("virtio_scsi: registering cpu notifier failed\n");
>> +		goto scsi_add_host_failed;
>> +	}
>> +
> 
> You have to clean this up if scsi_add_host() fails, I think.

Yeah, will do, thank you very much. 


Regards,
Wanlong Gao

> 
> Cheers,
> Rusty.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/2] virtio-scsi: reset virtqueue affinity when doing cpu hotplug
  2013-01-16  3:55         ` Wanlong Gao
@ 2013-02-06 17:27           ` Paolo Bonzini
  -1 siblings, 0 replies; 86+ messages in thread
From: Paolo Bonzini @ 2013-02-06 17:27 UTC (permalink / raw)
  To: gaowanlong
  Cc: Rusty Russell, kvm, linux-scsi, mst, hutao, linux-kernel,
	virtualization, stefanha

Il 16/01/2013 04:55, Wanlong Gao ha scritto:
>>> >> Add hot cpu notifier to reset the request virtqueue affinity
>>> >> when doing cpu hotplug.
>> > 
>> > You need to be careful to get_online_cpus() and put_online_cpus() here,
>> > so CPUs can't go up and down in the middle of operations.
>> > 
>> > In particular, get_online_cpus()/put_online_cpus() around calls to
>> > virtscsi_set_affinity() (except within notifiers).
> Yes, I'll take care of this, thank you.
> 

I squashed patch 1 (plus changes to get/put_online_cpus) in my
multiqueue series, and applied this one as a separate patch.

Paolo

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/2] virtio-scsi: reset virtqueue affinity when doing cpu hotplug
@ 2013-02-06 17:27           ` Paolo Bonzini
  0 siblings, 0 replies; 86+ messages in thread
From: Paolo Bonzini @ 2013-02-06 17:27 UTC (permalink / raw)
  To: gaowanlong
  Cc: kvm, linux-scsi, mst, hutao, linux-kernel, virtualization, stefanha

Il 16/01/2013 04:55, Wanlong Gao ha scritto:
>>> >> Add hot cpu notifier to reset the request virtqueue affinity
>>> >> when doing cpu hotplug.
>> > 
>> > You need to be careful to get_online_cpus() and put_online_cpus() here,
>> > so CPUs can't go up and down in the middle of operations.
>> > 
>> > In particular, get_online_cpus()/put_online_cpus() around calls to
>> > virtscsi_set_affinity() (except within notifiers).
> Yes, I'll take care of this, thank you.
> 

I squashed patch 1 (plus changes to get/put_online_cpus) in my
multiqueue series, and applied this one as a separate patch.

Paolo

^ permalink raw reply	[flat|nested] 86+ messages in thread

end of thread, other threads:[~2013-02-06 17:27 UTC | newest]

Thread overview: 86+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-12-18 12:32 [PATCH v2 0/5] Multiqueue virtio-scsi, and API for piecewise buffer submission Paolo Bonzini
2012-12-18 12:32 ` Paolo Bonzini
2012-12-18 12:32 ` [PATCH v2 1/5] virtio: add functions for piecewise addition of buffers Paolo Bonzini
2012-12-18 12:32   ` Paolo Bonzini
2012-12-18 13:36   ` Michael S. Tsirkin
2012-12-18 13:36     ` Michael S. Tsirkin
2012-12-18 13:43     ` Paolo Bonzini
2012-12-18 13:43       ` Paolo Bonzini
2012-12-18 13:59       ` Michael S. Tsirkin
2012-12-18 13:59         ` Michael S. Tsirkin
2012-12-18 14:32         ` Paolo Bonzini
2012-12-18 14:32           ` Paolo Bonzini
2012-12-18 15:06           ` Michael S. Tsirkin
2012-12-18 15:06             ` Michael S. Tsirkin
2012-12-19 10:47   ` Stefan Hajnoczi
2012-12-19 10:47   ` Stefan Hajnoczi
2012-12-19 12:04     ` Paolo Bonzini
2012-12-19 12:04       ` Paolo Bonzini
2012-12-19 12:40       ` Stefan Hajnoczi
2012-12-19 12:40         ` Stefan Hajnoczi
2012-12-19 16:51       ` Michael S. Tsirkin
2012-12-19 16:51         ` Michael S. Tsirkin
2012-12-19 16:52         ` Michael S. Tsirkin
2012-12-19 16:52           ` Michael S. Tsirkin
2013-01-02  5:03   ` Rusty Russell
2013-01-02  5:03     ` Rusty Russell
2013-01-03  8:58     ` Wanlong Gao
2013-01-03  8:58       ` Wanlong Gao
2013-01-03  8:58       ` Wanlong Gao
2013-01-06 23:32       ` Rusty Russell
2013-01-06 23:32       ` Rusty Russell
2013-01-06 23:32         ` Rusty Russell
2013-01-03  9:22     ` Paolo Bonzini
2013-01-03  9:22       ` Paolo Bonzini
2013-01-07  0:02       ` Rusty Russell
2013-01-07  0:02         ` Rusty Russell
2013-01-07 14:27         ` Paolo Bonzini
2013-01-08  0:12           ` Rusty Russell
2013-01-08  0:12             ` Rusty Russell
2013-01-10  8:44             ` Paolo Bonzini
2012-12-18 12:32 ` [PATCH v2 2/5] virtio-scsi: use functions for piecewise composition " Paolo Bonzini
2012-12-18 12:32   ` Paolo Bonzini
2012-12-18 13:37   ` Michael S. Tsirkin
2012-12-18 13:37     ` Michael S. Tsirkin
2012-12-18 13:35     ` Paolo Bonzini
2012-12-18 13:35       ` Paolo Bonzini
2012-12-18 12:32 ` [PATCH v2 3/5] virtio-scsi: redo allocation of target data Paolo Bonzini
2012-12-18 12:32   ` Paolo Bonzini
2012-12-18 12:32 ` [PATCH v2 4/5] virtio-scsi: pass struct virtio_scsi to virtqueue completion function Paolo Bonzini
2012-12-18 12:32   ` Paolo Bonzini
2012-12-18 12:32 ` [PATCH v2 5/5] virtio-scsi: introduce multiqueue support Paolo Bonzini
2012-12-18 13:57   ` Michael S. Tsirkin
2012-12-18 13:57     ` Michael S. Tsirkin
2012-12-18 14:08     ` Paolo Bonzini
2012-12-18 14:08       ` Paolo Bonzini
2012-12-18 15:03       ` Michael S. Tsirkin
2012-12-18 15:03         ` Michael S. Tsirkin
2012-12-18 15:51         ` Paolo Bonzini
2012-12-18 15:51           ` Paolo Bonzini
2012-12-18 16:02           ` Michael S. Tsirkin
2012-12-18 16:02             ` Michael S. Tsirkin
2012-12-25 12:41             ` Wanlong Gao
2012-12-25 12:41               ` Wanlong Gao
2012-12-19 11:27   ` Stefan Hajnoczi
2012-12-19 11:27   ` Stefan Hajnoczi
2012-12-18 12:32 ` Paolo Bonzini
2012-12-18 13:42 ` [PATCH v2 0/5] Multiqueue virtio-scsi, and API for piecewise buffer submission Michael S. Tsirkin
2012-12-18 13:42   ` Michael S. Tsirkin
2012-12-24  6:44   ` Wanlong Gao
2012-12-24  6:44     ` Wanlong Gao
2012-12-18 22:18 ` Rolf Eike Beer
2012-12-19  8:52   ` Paolo Bonzini
2012-12-19  8:52     ` Paolo Bonzini
2012-12-19 11:32     ` Michael S. Tsirkin
2012-12-19 11:32       ` Michael S. Tsirkin
2012-12-18 22:18 ` Rolf Eike Beer
2013-01-15  9:48 ` [PATCH 1/2] virtio-scsi: split out request queue set affinity function Wanlong Gao
2013-01-15  9:48   ` Wanlong Gao
2013-01-15  9:50   ` [PATCH 2/2] virtio-scsi: reset virtqueue affinity when doing cpu hotplug Wanlong Gao
2013-01-15  9:50     ` Wanlong Gao
2013-01-16  3:31     ` Rusty Russell
2013-01-16  3:31       ` Rusty Russell
2013-01-16  3:55       ` Wanlong Gao
2013-01-16  3:55         ` Wanlong Gao
2013-02-06 17:27         ` Paolo Bonzini
2013-02-06 17:27           ` Paolo Bonzini

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.