linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] Multiqueue virtio-scsi
@ 2012-08-28 11:54 Paolo Bonzini
  2012-08-28 11:54 ` [PATCH 1/5] virtio-ring: move queue_index to vring_virtqueue Paolo Bonzini
                   ` (6 more replies)
  0 siblings, 7 replies; 34+ messages in thread
From: Paolo Bonzini @ 2012-08-28 11:54 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-scsi, kvm, rusty, jasowang, mst, virtualization

Hi all,

this series adds multiqueue support to the virtio-scsi driver, based
on Jason Wang's work on virtio-net.  It uses a simple queue steering
algorithm that expects one queue per CPU.  LUNs in the same target always
use the same queue (so that commands are not reordered); queue switching
occurs when the request being queued is the only one for the target.
Also based on Jason's patches, the virtqueue affinity is set so that
each CPU is associated to one virtqueue.

I tested the patches with fio, using up to 32 virtio-scsi disks backed
by tmpfs on the host, and 1 LUN per target.

FIO configuration
-----------------
[global]
rw=read
bsrange=4k-64k
ioengine=libaio
direct=1
iodepth=4
loops=20

overall bandwidth (MB/s)
-----------------

# of targets    single-queue    multi-queue, 4 VCPUs    multi-queue, 8 VCPUs
1                  540               626                     599
2                  795               965                     925
4                  997              1376                    1500
8                 1136              2130                    2060
16                1440              2269                    2474
24                1408              2179                    2436
32                1515              1978                    2319

(These numbers for single-queue are with 4 VCPUs, but the impact of adding
more VCPUs is very limited).

avg bandwidth per LUN (MB/s)
---------------------

# of targets    single-queue    multi-queue, 4 VCPUs    multi-queue, 8 VCPUs
1                  540               626                     599
2                  397               482                     462
4                  249               344                     375
8                  142               266                     257
16                  90               141                     154
24                  58                90                     101
32                  47                61                      72

Testing this may require an irqbalance daemon that is built from git,
due to http://code.google.com/p/irqbalance/issues/detail?id=37.
Alternatively you can just set the affinity manually in /proc.

Rusty, can you please give your Acked-by to the first two patches?

Jason Wang (2):
  virtio-ring: move queue_index to vring_virtqueue
  virtio: introduce an API to set affinity for a virtqueue

Paolo Bonzini (3):
  virtio-scsi: allocate target pointers in a separate memory block
  virtio-scsi: pass struct virtio_scsi to virtqueue completion function
  virtio-scsi: introduce multiqueue support

 drivers/lguest/lguest_device.c         |    1 +
 drivers/remoteproc/remoteproc_virtio.c |    1 +
 drivers/s390/kvm/kvm_virtio.c          |    1 +
 drivers/scsi/virtio_scsi.c             |  200 ++++++++++++++++++++++++--------
 drivers/virtio/virtio_mmio.c           |   11 +-
 drivers/virtio/virtio_pci.c            |   58 ++++++++-
 drivers/virtio/virtio_ring.c           |   17 +++
 include/linux/virtio.h                 |    4 +
 include/linux/virtio_config.h          |   21 ++++
 9 files changed, 253 insertions(+), 61 deletions(-)


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 1/5] virtio-ring: move queue_index to vring_virtqueue
  2012-08-28 11:54 [PATCH 0/5] Multiqueue virtio-scsi Paolo Bonzini
@ 2012-08-28 11:54 ` Paolo Bonzini
  2012-08-29  7:54   ` Jason Wang
  2012-09-05 23:32   ` Rusty Russell
  2012-08-28 11:54 ` [PATCH 2/5] virtio: introduce an API to set affinity for a virtqueue Paolo Bonzini
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 34+ messages in thread
From: Paolo Bonzini @ 2012-08-28 11:54 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-scsi, kvm, rusty, jasowang, mst, virtualization

From: Jason Wang <jasowang@redhat.com>

Instead of storing the queue index in transport-specific virtio structs,
this patch moves them to vring_virtqueue and introduces an helper to get
the value.  This lets drivers simplify their management and tracing of
virtqueues.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
	I fixed the problems in Jason's v5 (posted at
	http://permalink.gmane.org/gmane.linux.kernel.virtualization/15910)
	and switched from virtio_set_queue_index to a new argument of
	vring_new_virtqueue.  This breaks at compile-time any virtio
	transport that is not updated.

 drivers/lguest/lguest_device.c         |    2 +-
 drivers/remoteproc/remoteproc_virtio.c |    2 +-
 drivers/s390/kvm/kvm_virtio.c          |    2 +-
 drivers/virtio/virtio_mmio.c           |   12 ++++--------
 drivers/virtio/virtio_pci.c            |   13 +++++--------
 drivers/virtio/virtio_ring.c           |   14 +++++++++++++-
 include/linux/virtio.h                 |    2 ++
 include/linux/virtio_ring.h            |    3 ++-
 8 files changed, 29 insertions(+), 21 deletions(-)

diff --git a/drivers/lguest/lguest_device.c b/drivers/lguest/lguest_device.c
index 9e8388e..ccb7dfb 100644
--- a/drivers/lguest/lguest_device.c
+++ b/drivers/lguest/lguest_device.c
@@ -296,7 +296,7 @@ static struct virtqueue *lg_find_vq(struct virtio_device *vdev,
 	 * to 'true': the host just a(nother) SMP CPU, so we only need inter-cpu
 	 * barriers.
 	 */
-	vq = vring_new_virtqueue(lvq->config.num, LGUEST_VRING_ALIGN, vdev,
+	vq = vring_new_virtqueue(index, lvq->config.num, LGUEST_VRING_ALIGN, vdev,
 				 true, lvq->pages, lg_notify, callback, name);
 	if (!vq) {
 		err = -ENOMEM;
diff --git a/drivers/remoteproc/remoteproc_virtio.c b/drivers/remoteproc/remoteproc_virtio.c
index 3541b44..343c194 100644
--- a/drivers/remoteproc/remoteproc_virtio.c
+++ b/drivers/remoteproc/remoteproc_virtio.c
@@ -103,7 +103,7 @@ static struct virtqueue *rp_find_vq(struct virtio_device *vdev,
 	 * Create the new vq, and tell virtio we're not interested in
 	 * the 'weak' smp barriers, since we're talking with a real device.
 	 */
-	vq = vring_new_virtqueue(len, rvring->align, vdev, false, addr,
+	vq = vring_new_virtqueue(id, len, rvring->align, vdev, false, addr,
 					rproc_virtio_notify, callback, name);
 	if (!vq) {
 		dev_err(dev, "vring_new_virtqueue %s failed\n", name);
diff --git a/drivers/s390/kvm/kvm_virtio.c b/drivers/s390/kvm/kvm_virtio.c
index 47cccd5..5565af2 100644
--- a/drivers/s390/kvm/kvm_virtio.c
+++ b/drivers/s390/kvm/kvm_virtio.c
@@ -198,7 +198,7 @@ static struct virtqueue *kvm_find_vq(struct virtio_device *vdev,
 	if (err)
 		goto out;
 
-	vq = vring_new_virtqueue(config->num, KVM_S390_VIRTIO_RING_ALIGN,
+	vq = vring_new_virtqueue(index, config->num, KVM_S390_VIRTIO_RING_ALIGN,
 				 vdev, true, (void *) config->address,
 				 kvm_notify, callback, name);
 	if (!vq) {
diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
index 453db0c..008bf58 100644
--- a/drivers/virtio/virtio_mmio.c
+++ b/drivers/virtio/virtio_mmio.c
@@ -131,9 +131,6 @@ struct virtio_mmio_vq_info {
 	/* the number of entries in the queue */
 	unsigned int num;
 
-	/* the index of the queue */
-	int queue_index;
-
 	/* the virtual address of the ring queue */
 	void *queue;
 
@@ -225,11 +222,10 @@ static void vm_reset(struct virtio_device *vdev)
 static void vm_notify(struct virtqueue *vq)
 {
 	struct virtio_mmio_device *vm_dev = to_virtio_mmio_device(vq->vdev);
-	struct virtio_mmio_vq_info *info = vq->priv;
 
 	/* We write the queue's selector into the notification register to
 	 * signal the other end */
-	writel(info->queue_index, vm_dev->base + VIRTIO_MMIO_QUEUE_NOTIFY);
+	writel(virtqueue_get_queue_index(vq), vm_dev->base + VIRTIO_MMIO_QUEUE_NOTIFY);
 }
 
 /* Notify all virtqueues on an interrupt. */
@@ -270,6 +266,7 @@ static void vm_del_vq(struct virtqueue *vq)
 	struct virtio_mmio_device *vm_dev = to_virtio_mmio_device(vq->vdev);
 	struct virtio_mmio_vq_info *info = vq->priv;
 	unsigned long flags, size;
+	unsigned int index = virtqueue_get_queue_index(vq);
 
 	spin_lock_irqsave(&vm_dev->lock, flags);
 	list_del(&info->node);
@@ -278,7 +275,7 @@ static void vm_del_vq(struct virtqueue *vq)
 	vring_del_virtqueue(vq);
 
 	/* Select and deactivate the queue */
-	writel(info->queue_index, vm_dev->base + VIRTIO_MMIO_QUEUE_SEL);
+	writel(index, vm_dev->base + VIRTIO_MMIO_QUEUE_SEL);
 	writel(0, vm_dev->base + VIRTIO_MMIO_QUEUE_PFN);
 
 	size = PAGE_ALIGN(vring_size(info->num, VIRTIO_MMIO_VRING_ALIGN));
@@ -324,7 +321,6 @@ static struct virtqueue *vm_setup_vq(struct virtio_device *vdev, unsigned index,
 		err = -ENOMEM;
 		goto error_kmalloc;
 	}
-	info->queue_index = index;
 
 	/* Allocate pages for the queue - start with a queue as big as
 	 * possible (limited by maximum size allowed by device), drop down
@@ -356,7 +352,7 @@ static struct virtqueue *vm_setup_vq(struct virtio_device *vdev, unsigned index,
 			vm_dev->base + VIRTIO_MMIO_QUEUE_PFN);
 
 	/* Create the vring */
-	vq = vring_new_virtqueue(info->num, VIRTIO_MMIO_VRING_ALIGN, vdev,
+	vq = vring_new_virtqueue(index, info->num, VIRTIO_MMIO_VRING_ALIGN, vdev,
 				 true, info->queue, vm_notify, callback, name);
 	if (!vq) {
 		err = -ENOMEM;
diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
index 2e03d41..d902464 100644
--- a/drivers/virtio/virtio_pci.c
+++ b/drivers/virtio/virtio_pci.c
@@ -79,9 +79,6 @@ struct virtio_pci_vq_info
 	/* the number of entries in the queue */
 	int num;
 
-	/* the index of the queue */
-	int queue_index;
-
 	/* the virtual address of the ring queue */
 	void *queue;
 
@@ -202,11 +199,11 @@ static void vp_reset(struct virtio_device *vdev)
 static void vp_notify(struct virtqueue *vq)
 {
 	struct virtio_pci_device *vp_dev = to_vp_device(vq->vdev);
-	struct virtio_pci_vq_info *info = vq->priv;
 
 	/* we write the queue's selector into the notification register to
 	 * signal the other end */
-	iowrite16(info->queue_index, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_NOTIFY);
+	iowrite16(virtqueue_get_queue_index(vq),
+		  vp_dev->ioaddr + VIRTIO_PCI_QUEUE_NOTIFY);
 }
 
 /* Handle a configuration change: Tell driver if it wants to know. */
@@ -402,7 +399,6 @@ static struct virtqueue *setup_vq(struct virtio_device *vdev, unsigned index,
 	if (!info)
 		return ERR_PTR(-ENOMEM);
 
-	info->queue_index = index;
 	info->num = num;
 	info->msix_vector = msix_vec;
 
@@ -418,7 +414,7 @@ static struct virtqueue *setup_vq(struct virtio_device *vdev, unsigned index,
 		  vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
 
 	/* create the vring */
-	vq = vring_new_virtqueue(info->num, VIRTIO_PCI_VRING_ALIGN, vdev,
+	vq = vring_new_virtqueue(index, info->num, VIRTIO_PCI_VRING_ALIGN, vdev,
 				 true, info->queue, vp_notify, callback, name);
 	if (!vq) {
 		err = -ENOMEM;
@@ -467,7 +463,8 @@ static void vp_del_vq(struct virtqueue *vq)
 	list_del(&info->node);
 	spin_unlock_irqrestore(&vp_dev->lock, flags);
 
-	iowrite16(info->queue_index, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_SEL);
+	iowrite16(virtqueue_get_queue_index(vq),
+		vp_dev->ioaddr + VIRTIO_PCI_QUEUE_SEL);
 
 	if (vp_dev->msix_enabled) {
 		iowrite16(VIRTIO_MSI_NO_VECTOR,
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 5aa43c3..e639584 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -106,6 +106,9 @@ struct vring_virtqueue
 	/* How to notify other side. FIXME: commonalize hcalls! */
 	void (*notify)(struct virtqueue *vq);
 
+	/* Index of the queue */
+	int queue_index;
+
 #ifdef DEBUG
 	/* They're supposed to lock for us. */
 	unsigned int in_use;
@@ -171,6 +174,13 @@ static int vring_add_indirect(struct vring_virtqueue *vq,
 	return head;
 }
 
+int virtqueue_get_queue_index(struct virtqueue *_vq)
+{
+	struct vring_virtqueue *vq = to_vvq(_vq);
+	return vq->queue_index;
+}
+EXPORT_SYMBOL_GPL(virtqueue_get_queue_index);
+
 /**
  * virtqueue_add_buf - expose buffer to other end
  * @vq: the struct virtqueue we're talking about.
@@ -616,7 +626,8 @@ irqreturn_t vring_interrupt(int irq, void *_vq)
 }
 EXPORT_SYMBOL_GPL(vring_interrupt);
 
-struct virtqueue *vring_new_virtqueue(unsigned int num,
+struct virtqueue *vring_new_virtqueue(unsigned int index,
+				      unsigned int num,
 				      unsigned int vring_align,
 				      struct virtio_device *vdev,
 				      bool weak_barriers,
@@ -647,6 +658,7 @@ struct virtqueue *vring_new_virtqueue(unsigned int num,
 	vq->broken = false;
 	vq->last_used_idx = 0;
 	vq->num_added = 0;
+	vq->queue_index = index;
 	list_add_tail(&vq->vq.list, &vdev->vqs);
 #ifdef DEBUG
 	vq->in_use = false;
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index a1ba8bb..533b115 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -50,6 +50,8 @@ void *virtqueue_detach_unused_buf(struct virtqueue *vq);
 
 unsigned int virtqueue_get_vring_size(struct virtqueue *vq);
 
+int virtqueue_get_queue_index(struct virtqueue *vq);
+
 /**
  * virtio_device - representation of a device using virtio
  * @index: unique position on the virtio bus
diff --git a/include/linux/virtio_ring.h b/include/linux/virtio_ring.h
index e338730..c2d793a 100644
--- a/include/linux/virtio_ring.h
+++ b/include/linux/virtio_ring.h
@@ -165,7 +165,8 @@ static inline int vring_need_event(__u16 event_idx, __u16 new_idx, __u16 old)
 struct virtio_device;
 struct virtqueue;
 
-struct virtqueue *vring_new_virtqueue(unsigned int num,
+struct virtqueue *vring_new_virtqueue(unsigned int index,
+				      unsigned int num,
 				      unsigned int vring_align,
 				      struct virtio_device *vdev,
 				      bool weak_barriers,
-- 
1.7.1




^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 2/5] virtio: introduce an API to set affinity for a virtqueue
  2012-08-28 11:54 [PATCH 0/5] Multiqueue virtio-scsi Paolo Bonzini
  2012-08-28 11:54 ` [PATCH 1/5] virtio-ring: move queue_index to vring_virtqueue Paolo Bonzini
@ 2012-08-28 11:54 ` Paolo Bonzini
  2012-09-05 23:32   ` Rusty Russell
  2012-08-28 11:54 ` [PATCH 3/5] virtio-scsi: allocate target pointers in a separate memory block Paolo Bonzini
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 34+ messages in thread
From: Paolo Bonzini @ 2012-08-28 11:54 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-scsi, kvm, rusty, jasowang, mst, virtualization

From: Jason Wang <jasowang@redhat.com>

Sometimes, virtio device need to configure irq affinity hint to maximize the
performance. Instead of just exposing the irq of a virtqueue, this patch
introduce an API to set the affinity for a virtqueue.

The api is best-effort, the affinity hint may not be set as expected due to
platform support, irq sharing or irq type. Currently, only pci method were
implemented and we set the affinity according to:

- if device uses INTX, we just ignore the request
- if device has per vq vector, we force the affinity hint
- if the virtqueues share MSI, make the affinity OR over all affinities
  requested

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 drivers/virtio/virtio_pci.c   |   46 +++++++++++++++++++++++++++++++++++++++++
 include/linux/virtio_config.h |   21 ++++++++++++++++++
 2 files changed, 67 insertions(+), 0 deletions(-)

diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
index adb24f2..2ff0451 100644
--- a/drivers/virtio/virtio_pci.c
+++ b/drivers/virtio/virtio_pci.c
@@ -48,6 +48,7 @@ struct virtio_pci_device
 	int msix_enabled;
 	int intx_enabled;
 	struct msix_entry *msix_entries;
+	cpumask_var_t *msix_affinity_masks;
 	/* Name strings for interrupts. This size should be enough,
 	 * and I'm too lazy to allocate each name separately. */
 	char (*msix_names)[256];
@@ -276,6 +277,10 @@ static void vp_free_vectors(struct virtio_device *vdev)
 	for (i = 0; i < vp_dev->msix_used_vectors; ++i)
 		free_irq(vp_dev->msix_entries[i].vector, vp_dev);
 
+	for (i = 0; i < vp_dev->msix_vectors; i++)
+		if (vp_dev->msix_affinity_masks[i])
+			free_cpumask_var(vp_dev->msix_affinity_masks[i]);
+
 	if (vp_dev->msix_enabled) {
 		/* Disable the vector used for configuration */
 		iowrite16(VIRTIO_MSI_NO_VECTOR,
@@ -293,6 +298,8 @@ static void vp_free_vectors(struct virtio_device *vdev)
 	vp_dev->msix_names = NULL;
 	kfree(vp_dev->msix_entries);
 	vp_dev->msix_entries = NULL;
+	kfree(vp_dev->msix_affinity_masks);
+	vp_dev->msix_affinity_masks = NULL;
 }
 
 static int vp_request_msix_vectors(struct virtio_device *vdev, int nvectors,
@@ -311,6 +318,15 @@ static int vp_request_msix_vectors(struct virtio_device *vdev, int nvectors,
 				     GFP_KERNEL);
 	if (!vp_dev->msix_names)
 		goto error;
+	vp_dev->msix_affinity_masks
+		= kzalloc(nvectors * sizeof *vp_dev->msix_affinity_masks,
+			  GFP_KERNEL);
+	if (!vp_dev->msix_affinity_masks)
+		goto error;
+	for (i = 0; i < nvectors; ++i)
+		if (!alloc_cpumask_var(&vp_dev->msix_affinity_masks[i],
+					GFP_KERNEL))
+			goto error;
 
 	for (i = 0; i < nvectors; ++i)
 		vp_dev->msix_entries[i].entry = i;
@@ -607,6 +623,35 @@ static const char *vp_bus_name(struct virtio_device *vdev)
 	return pci_name(vp_dev->pci_dev);
 }
 
+/* Setup the affinity for a virtqueue:
+ * - force the affinity for per vq vector
+ * - OR over all affinities for shared MSI
+ * - ignore the affinity request if we're using INTX
+ */
+static int vp_set_vq_affinity(struct virtqueue *vq, int cpu)
+{
+	struct virtio_device *vdev = vq->vdev;
+	struct virtio_pci_device *vp_dev = to_vp_device(vdev);
+	struct virtio_pci_vq_info *info = vq->priv;
+	struct cpumask *mask;
+	unsigned int irq;
+
+	if (!vq->callback)
+		return -EINVAL;
+
+	if (vp_dev->msix_enabled) {
+		mask = vp_dev->msix_affinity_masks[info->msix_vector];
+		irq = vp_dev->msix_entries[info->msix_vector].vector;
+		if (cpu == -1)
+			irq_set_affinity_hint(irq, NULL);
+		else {
+			cpumask_set_cpu(cpu, mask);
+			irq_set_affinity_hint(irq, mask);
+		}
+	}
+	return 0;
+}
+
 static struct virtio_config_ops virtio_pci_config_ops = {
 	.get		= vp_get,
 	.set		= vp_set,
@@ -618,6 +663,7 @@ static struct virtio_config_ops virtio_pci_config_ops = {
 	.get_features	= vp_get_features,
 	.finalize_features = vp_finalize_features,
 	.bus_name	= vp_bus_name,
+	.set_vq_affinity = vp_set_vq_affinity,
 };
 
 static void virtio_pci_release_dev(struct device *_d)
diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h
index fc457f4..2c4a989 100644
--- a/include/linux/virtio_config.h
+++ b/include/linux/virtio_config.h
@@ -98,6 +98,7 @@
  *	vdev: the virtio_device
  *      This returns a pointer to the bus name a la pci_name from which
  *      the caller can then copy.
+ * @set_vq_affinity: set the affinity for a virtqueue.
  */
 typedef void vq_callback_t(struct virtqueue *);
 struct virtio_config_ops {
@@ -116,6 +117,7 @@ struct virtio_config_ops {
 	u32 (*get_features)(struct virtio_device *vdev);
 	void (*finalize_features)(struct virtio_device *vdev);
 	const char *(*bus_name)(struct virtio_device *vdev);
+	int (*set_vq_affinity)(struct virtqueue *vq, int cpu);
 };
 
 /* If driver didn't advertise the feature, it will never appear. */
@@ -190,5 +192,24 @@ const char *virtio_bus_name(struct virtio_device *vdev)
 	return vdev->config->bus_name(vdev);
 }
 
+/**
+ * virtqueue_set_affinity - setting affinity for a virtqueue
+ * @vq: the virtqueue
+ * @cpu: the cpu no.
+ *
+ * Pay attention the function are best-effort: the affinity hint may not be set
+ * due to config support, irq type and sharing.
+ *
+ */
+static inline
+int virtqueue_set_affinity(struct virtqueue *vq, int cpu)
+{
+	struct virtio_device *vdev = vq->vdev;
+	if (vdev->config->set_vq_affinity)
+		return vdev->config->set_vq_affinity(vq, cpu);
+	return 0;
+}
+
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_VIRTIO_CONFIG_H */
-- 
1.7.1



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 3/5] virtio-scsi: allocate target pointers in a separate memory block
  2012-08-28 11:54 [PATCH 0/5] Multiqueue virtio-scsi Paolo Bonzini
  2012-08-28 11:54 ` [PATCH 1/5] virtio-ring: move queue_index to vring_virtqueue Paolo Bonzini
  2012-08-28 11:54 ` [PATCH 2/5] virtio: introduce an API to set affinity for a virtqueue Paolo Bonzini
@ 2012-08-28 11:54 ` Paolo Bonzini
  2012-08-28 14:07   ` Sasha Levin
  2012-08-28 11:54 ` [PATCH 4/5] virtio-scsi: pass struct virtio_scsi to virtqueue completion function Paolo Bonzini
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 34+ messages in thread
From: Paolo Bonzini @ 2012-08-28 11:54 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-scsi, kvm, rusty, jasowang, mst, virtualization

We will place the request virtqueues in the flexible array member.

Refining the virtqueue API would let us drop the sglist copy, at
which point the pointer-to-array-of-pointers can become a simple
pointer-to-array.  It would both simplify the allocation and remove a
dereference in several hot paths.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 drivers/scsi/virtio_scsi.c |   23 +++++++++++++++--------
 1 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 595af1a..62fec04 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -77,7 +77,7 @@ struct virtio_scsi {
 	/* Get some buffers ready for event vq */
 	struct virtio_scsi_event_node event_list[VIRTIO_SCSI_EVENT_LEN];
 
-	struct virtio_scsi_target_state *tgt[];
+	struct virtio_scsi_target_state **tgt;
 };
 
 static struct kmem_cache *virtscsi_cmd_cache;
@@ -615,10 +615,13 @@ static void virtscsi_remove_vqs(struct virtio_device *vdev)
 	/* Stop all the virtqueues. */
 	vdev->config->reset(vdev);
 
-	num_targets = sh->max_id;
-	for (i = 0; i < num_targets; i++) {
-		kfree(vscsi->tgt[i]);
-		vscsi->tgt[i] = NULL;
+	if (vscsi->tgt) {
+		num_targets = sh->max_id;
+		for (i = 0; i < num_targets; i++) {
+			kfree(vscsi->tgt[i]);
+			vscsi->tgt[i] = NULL;
+		}
+		kfree(vscsi->tgt);
 	}
 
 	vdev->config->del_vqs(vdev);
@@ -660,6 +663,12 @@ static int virtscsi_init(struct virtio_device *vdev,
 	/* We need to know how many segments before we allocate.  */
 	sg_elems = virtscsi_config_get(vdev, seg_max) ?: 1;
 
+	vscsi->tgt = kmalloc(num_targets *
+			sizeof(struct virtio_scsi_target_state *), GFP_KERNEL);
+	if (!vscsi->tgt) {
+		err = -ENOMEM;
+		goto out;
+	}
 	for (i = 0; i < num_targets; i++) {
 		vscsi->tgt[i] = virtscsi_alloc_tgt(vdev, sg_elems);
 		if (!vscsi->tgt[i]) {
@@ -685,9 +694,7 @@ static int __devinit virtscsi_probe(struct virtio_device *vdev)
 
 	/* Allocate memory and link the structs together.  */
 	num_targets = virtscsi_config_get(vdev, max_target) + 1;
-	shost = scsi_host_alloc(&virtscsi_host_template,
-		sizeof(*vscsi)
-		+ num_targets * sizeof(struct virtio_scsi_target_state));
+	shost = scsi_host_alloc(&virtscsi_host_template, sizeof(*vscsi));
 
 	if (!shost)
 		return -ENOMEM;
-- 
1.7.1



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 4/5] virtio-scsi: pass struct virtio_scsi to virtqueue completion function
  2012-08-28 11:54 [PATCH 0/5] Multiqueue virtio-scsi Paolo Bonzini
                   ` (2 preceding siblings ...)
  2012-08-28 11:54 ` [PATCH 3/5] virtio-scsi: allocate target pointers in a separate memory block Paolo Bonzini
@ 2012-08-28 11:54 ` Paolo Bonzini
  2012-08-28 11:54 ` [PATCH 5/5] virtio-scsi: introduce multiqueue support Paolo Bonzini
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 34+ messages in thread
From: Paolo Bonzini @ 2012-08-28 11:54 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-scsi, kvm, rusty, jasowang, mst, virtualization

This will be needed soon in order to retrieve the per-target
struct.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 drivers/scsi/virtio_scsi.c |   17 +++++++++--------
 1 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 62fec04..6414ea0 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -107,7 +107,7 @@ static void virtscsi_compute_resid(struct scsi_cmnd *sc, u32 resid)
  *
  * Called with vq_lock held.
  */
-static void virtscsi_complete_cmd(void *buf)
+static void virtscsi_complete_cmd(struct virtio_scsi *vscsi, void *buf)
 {
 	struct virtio_scsi_cmd *cmd = buf;
 	struct scsi_cmnd *sc = cmd->sc;
@@ -168,7 +168,8 @@ static void virtscsi_complete_cmd(void *buf)
 	sc->scsi_done(sc);
 }
 
-static void virtscsi_vq_done(struct virtqueue *vq, void (*fn)(void *buf))
+static void virtscsi_vq_done(struct virtio_scsi *vscsi, struct virtqueue *vq,
+			     void (*fn)(struct virtio_scsi *vscsi, void *buf))
 {
 	void *buf;
 	unsigned int len;
@@ -176,7 +177,7 @@ static void virtscsi_vq_done(struct virtqueue *vq, void (*fn)(void *buf))
 	do {
 		virtqueue_disable_cb(vq);
 		while ((buf = virtqueue_get_buf(vq, &len)) != NULL)
-			fn(buf);
+			fn(vscsi, buf);
 	} while (!virtqueue_enable_cb(vq));
 }
 
@@ -187,11 +188,11 @@ static void virtscsi_req_done(struct virtqueue *vq)
 	unsigned long flags;
 
 	spin_lock_irqsave(&vscsi->req_vq.vq_lock, flags);
-	virtscsi_vq_done(vq, virtscsi_complete_cmd);
+	virtscsi_vq_done(vscsi, vq, virtscsi_complete_cmd);
 	spin_unlock_irqrestore(&vscsi->req_vq.vq_lock, flags);
 };
 
-static void virtscsi_complete_free(void *buf)
+static void virtscsi_complete_free(struct virtio_scsi *vscsi, void *buf)
 {
 	struct virtio_scsi_cmd *cmd = buf;
 
@@ -208,7 +209,7 @@ static void virtscsi_ctrl_done(struct virtqueue *vq)
 	unsigned long flags;
 
 	spin_lock_irqsave(&vscsi->ctrl_vq.vq_lock, flags);
-	virtscsi_vq_done(vq, virtscsi_complete_free);
+	virtscsi_vq_done(vscsi, vq, virtscsi_complete_free);
 	spin_unlock_irqrestore(&vscsi->ctrl_vq.vq_lock, flags);
 };
 
@@ -331,7 +332,7 @@ static void virtscsi_handle_event(struct work_struct *work)
 	virtscsi_kick_event(vscsi, event_node);
 }
 
-static void virtscsi_complete_event(void *buf)
+static void virtscsi_complete_event(struct virtio_scsi *vscsi, void *buf)
 {
 	struct virtio_scsi_event_node *event_node = buf;
 
@@ -346,7 +347,7 @@ static void virtscsi_event_done(struct virtqueue *vq)
 	unsigned long flags;
 
 	spin_lock_irqsave(&vscsi->event_vq.vq_lock, flags);
-	virtscsi_vq_done(vq, virtscsi_complete_event);
+	virtscsi_vq_done(vscsi, vq, virtscsi_complete_event);
 	spin_unlock_irqrestore(&vscsi->event_vq.vq_lock, flags);
 };
 
-- 
1.7.1



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 5/5] virtio-scsi: introduce multiqueue support
  2012-08-28 11:54 [PATCH 0/5] Multiqueue virtio-scsi Paolo Bonzini
                   ` (3 preceding siblings ...)
  2012-08-28 11:54 ` [PATCH 4/5] virtio-scsi: pass struct virtio_scsi to virtqueue completion function Paolo Bonzini
@ 2012-08-28 11:54 ` Paolo Bonzini
  2012-09-04  2:21   ` Nicholas A. Bellinger
                     ` (2 more replies)
  2012-08-30  7:13 ` [PATCH 0/5] Multiqueue virtio-scsi Stefan Hajnoczi
  2012-08-30 14:53 ` Michael S. Tsirkin
  6 siblings, 3 replies; 34+ messages in thread
From: Paolo Bonzini @ 2012-08-28 11:54 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-scsi, kvm, rusty, jasowang, mst, virtualization

This patch adds queue steering to virtio-scsi.  When a target is sent
multiple requests, we always drive them to the same queue so that FIFO
processing order is kept.  However, if a target was idle, we can choose
a queue arbitrarily.  In this case the queue is chosen according to the
current VCPU, so the driver expects the number of request queues to be
equal to the number of VCPUs.  This makes it easy and fast to select
the queue, and also lets the driver optimize the IRQ affinity for the
virtqueues (each virtqueue's affinity is set to the CPU that "owns"
the queue).

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 drivers/scsi/virtio_scsi.c |  162 +++++++++++++++++++++++++++++++++++---------
 1 files changed, 130 insertions(+), 32 deletions(-)

diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 6414ea0..0c4b096 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -26,6 +26,7 @@
 
 #define VIRTIO_SCSI_MEMPOOL_SZ 64
 #define VIRTIO_SCSI_EVENT_LEN 8
+#define VIRTIO_SCSI_VQ_BASE 2
 
 /* Command queue element */
 struct virtio_scsi_cmd {
@@ -59,9 +60,13 @@ struct virtio_scsi_vq {
 
 /* Per-target queue state */
 struct virtio_scsi_target_state {
-	/* Protects sg.  Lock hierarchy is tgt_lock -> vq_lock.  */
+	/* Protects sg, req_vq.  Lock hierarchy is tgt_lock -> vq_lock.  */
 	spinlock_t tgt_lock;
 
+	struct virtio_scsi_vq *req_vq;
+
+	atomic_t reqs;
+
 	/* For sglist construction when adding commands to the virtqueue.  */
 	struct scatterlist sg[];
 };
@@ -70,14 +75,15 @@ struct virtio_scsi_target_state {
 struct virtio_scsi {
 	struct virtio_device *vdev;
 
-	struct virtio_scsi_vq ctrl_vq;
-	struct virtio_scsi_vq event_vq;
-	struct virtio_scsi_vq req_vq;
-
 	/* Get some buffers ready for event vq */
 	struct virtio_scsi_event_node event_list[VIRTIO_SCSI_EVENT_LEN];
 
+	u32 num_queues;
 	struct virtio_scsi_target_state **tgt;
+
+	struct virtio_scsi_vq ctrl_vq;
+	struct virtio_scsi_vq event_vq;
+	struct virtio_scsi_vq req_vqs[];
 };
 
 static struct kmem_cache *virtscsi_cmd_cache;
@@ -112,6 +118,9 @@ static void virtscsi_complete_cmd(struct virtio_scsi *vscsi, void *buf)
 	struct virtio_scsi_cmd *cmd = buf;
 	struct scsi_cmnd *sc = cmd->sc;
 	struct virtio_scsi_cmd_resp *resp = &cmd->resp.cmd;
+	struct virtio_scsi_target_state *tgt = vscsi->tgt[sc->device->id];
+
+	atomic_dec(&tgt->reqs);
 
 	dev_dbg(&sc->device->sdev_gendev,
 		"cmd %p response %u status %#02x sense_len %u\n",
@@ -185,11 +194,13 @@ static void virtscsi_req_done(struct virtqueue *vq)
 {
 	struct Scsi_Host *sh = virtio_scsi_host(vq->vdev);
 	struct virtio_scsi *vscsi = shost_priv(sh);
+	int index = virtqueue_get_queue_index(vq) - VIRTIO_SCSI_VQ_BASE;
+	struct virtio_scsi_vq *req_vq = &vscsi->req_vqs[index];
 	unsigned long flags;
 
-	spin_lock_irqsave(&vscsi->req_vq.vq_lock, flags);
+	spin_lock_irqsave(&req_vq->vq_lock, flags);
 	virtscsi_vq_done(vscsi, vq, virtscsi_complete_cmd);
-	spin_unlock_irqrestore(&vscsi->req_vq.vq_lock, flags);
+	spin_unlock_irqrestore(&req_vq->vq_lock, flags);
 };
 
 static void virtscsi_complete_free(struct virtio_scsi *vscsi, void *buf)
@@ -429,10 +440,10 @@ static int virtscsi_kick_cmd(struct virtio_scsi_target_state *tgt,
 	return ret;
 }
 
-static int virtscsi_queuecommand(struct Scsi_Host *sh, struct scsi_cmnd *sc)
+static int virtscsi_queuecommand(struct virtio_scsi *vscsi,
+				 struct virtio_scsi_target_state *tgt,
+				 struct scsi_cmnd *sc)
 {
-	struct virtio_scsi *vscsi = shost_priv(sh);
-	struct virtio_scsi_target_state *tgt = vscsi->tgt[sc->device->id];
 	struct virtio_scsi_cmd *cmd;
 	int ret;
 
@@ -466,7 +477,7 @@ static int virtscsi_queuecommand(struct Scsi_Host *sh, struct scsi_cmnd *sc)
 	BUG_ON(sc->cmd_len > VIRTIO_SCSI_CDB_SIZE);
 	memcpy(cmd->req.cmd.cdb, sc->cmnd, sc->cmd_len);
 
-	if (virtscsi_kick_cmd(tgt, &vscsi->req_vq, cmd,
+	if (virtscsi_kick_cmd(tgt, tgt->req_vq, cmd,
 			      sizeof cmd->req.cmd, sizeof cmd->resp.cmd,
 			      GFP_ATOMIC) >= 0)
 		ret = 0;
@@ -475,6 +486,38 @@ out:
 	return ret;
 }
 
+static int virtscsi_queuecommand_single(struct Scsi_Host *sh,
+					struct scsi_cmnd *sc)
+{
+	struct virtio_scsi *vscsi = shost_priv(sh);
+	struct virtio_scsi_target_state *tgt = vscsi->tgt[sc->device->id];
+
+	atomic_inc(&tgt->reqs);
+	return virtscsi_queuecommand(vscsi, tgt, sc);
+}
+
+static int virtscsi_queuecommand_multi(struct Scsi_Host *sh,
+				       struct scsi_cmnd *sc)
+{
+	struct virtio_scsi *vscsi = shost_priv(sh);
+	struct virtio_scsi_target_state *tgt = vscsi->tgt[sc->device->id];
+	unsigned long flags;
+	u32 queue_num;
+
+	/* Using an atomic_t for tgt->reqs lets the virtqueue handler
+	 * decrement it without taking the spinlock.
+	 */
+	spin_lock_irqsave(&tgt->tgt_lock, flags);
+	if (atomic_inc_return(&tgt->reqs) == 1) {
+		queue_num = smp_processor_id();
+		while (unlikely(queue_num >= vscsi->num_queues))
+			queue_num -= vscsi->num_queues;
+		tgt->req_vq = &vscsi->req_vqs[queue_num];
+	}
+	spin_unlock_irqrestore(&tgt->tgt_lock, flags);
+	return virtscsi_queuecommand(vscsi, tgt, sc);
+}
+
 static int virtscsi_tmf(struct virtio_scsi *vscsi, struct virtio_scsi_cmd *cmd)
 {
 	DECLARE_COMPLETION_ONSTACK(comp);
@@ -544,12 +585,26 @@ static int virtscsi_abort(struct scsi_cmnd *sc)
 	return virtscsi_tmf(vscsi, cmd);
 }
 
-static struct scsi_host_template virtscsi_host_template = {
+static struct scsi_host_template virtscsi_host_template_single = {
 	.module = THIS_MODULE,
 	.name = "Virtio SCSI HBA",
 	.proc_name = "virtio_scsi",
-	.queuecommand = virtscsi_queuecommand,
 	.this_id = -1,
+	.queuecommand = virtscsi_queuecommand_single,
+	.eh_abort_handler = virtscsi_abort,
+	.eh_device_reset_handler = virtscsi_device_reset,
+
+	.can_queue = 1024,
+	.dma_boundary = UINT_MAX,
+	.use_clustering = ENABLE_CLUSTERING,
+};
+
+static struct scsi_host_template virtscsi_host_template_multi = {
+	.module = THIS_MODULE,
+	.name = "Virtio SCSI HBA",
+	.proc_name = "virtio_scsi",
+	.this_id = -1,
+	.queuecommand = virtscsi_queuecommand_multi,
 	.eh_abort_handler = virtscsi_abort,
 	.eh_device_reset_handler = virtscsi_device_reset,
 
@@ -575,15 +630,19 @@ static struct scsi_host_template virtscsi_host_template = {
 				  &__val, sizeof(__val)); \
 	})
 
+
 static void virtscsi_init_vq(struct virtio_scsi_vq *virtscsi_vq,
-			     struct virtqueue *vq)
+			     struct virtqueue *vq, bool affinity)
 {
 	spin_lock_init(&virtscsi_vq->vq_lock);
 	virtscsi_vq->vq = vq;
+	if (affinity)
+		virtqueue_set_affinity(vq, virtqueue_get_queue_index(vq) -
+				       VIRTIO_SCSI_VQ_BASE);
 }
 
 static struct virtio_scsi_target_state *virtscsi_alloc_tgt(
-	struct virtio_device *vdev, int sg_elems)
+	struct virtio_scsi *vscsi, u32 sg_elems)
 {
 	struct virtio_scsi_target_state *tgt;
 	gfp_t gfp_mask = GFP_KERNEL;
@@ -597,6 +656,13 @@ static struct virtio_scsi_target_state *virtscsi_alloc_tgt(
 
 	spin_lock_init(&tgt->tgt_lock);
 	sg_init_table(tgt->sg, sg_elems + 2);
+	atomic_set(&tgt->reqs, 0);
+
+	/*
+	 * The default is unused for multiqueue, but with a single queue
+	 * or target we use it in virtscsi_queuecommand.
+	 */
+	tgt->req_vq = &vscsi->req_vqs[0];
 	return tgt;
 }
 
@@ -632,28 +698,41 @@ static int virtscsi_init(struct virtio_device *vdev,
 			 struct virtio_scsi *vscsi, int num_targets)
 {
 	int err;
-	struct virtqueue *vqs[3];
 	u32 i, sg_elems;
+	u32 num_vqs;
+	vq_callback_t **callbacks;
+	const char **names;
+	struct virtqueue **vqs;
 
-	vq_callback_t *callbacks[] = {
-		virtscsi_ctrl_done,
-		virtscsi_event_done,
-		virtscsi_req_done
-	};
-	const char *names[] = {
-		"control",
-		"event",
-		"request"
-	};
+	num_vqs = vscsi->num_queues + VIRTIO_SCSI_VQ_BASE;
+	vqs = kmalloc(num_vqs * sizeof(struct virtqueue *), GFP_KERNEL);
+	callbacks = kmalloc(num_vqs * sizeof(vq_callback_t *), GFP_KERNEL);
+	names = kmalloc(num_vqs * sizeof(char *), GFP_KERNEL);
+
+	if (!callbacks || !vqs || !names) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	callbacks[0] = virtscsi_ctrl_done;
+	callbacks[1] = virtscsi_event_done;
+	names[0] = "control";
+	names[1] = "event";
+	for (i = VIRTIO_SCSI_VQ_BASE; i < num_vqs; i++) {
+		callbacks[i] = virtscsi_req_done;
+		names[i] = "request";
+	}
 
 	/* Discover virtqueues and write information to configuration.  */
-	err = vdev->config->find_vqs(vdev, 3, vqs, callbacks, names);
+	err = vdev->config->find_vqs(vdev, num_vqs, vqs, callbacks, names);
 	if (err)
 		return err;
 
-	virtscsi_init_vq(&vscsi->ctrl_vq, vqs[0]);
-	virtscsi_init_vq(&vscsi->event_vq, vqs[1]);
-	virtscsi_init_vq(&vscsi->req_vq, vqs[2]);
+	virtscsi_init_vq(&vscsi->ctrl_vq, vqs[0], false);
+	virtscsi_init_vq(&vscsi->event_vq, vqs[1], false);
+	for (i = VIRTIO_SCSI_VQ_BASE; i < num_vqs; i++)
+		virtscsi_init_vq(&vscsi->req_vqs[i - VIRTIO_SCSI_VQ_BASE],
+				 vqs[i], vscsi->num_queues > 1);
 
 	virtscsi_config_set(vdev, cdb_size, VIRTIO_SCSI_CDB_SIZE);
 	virtscsi_config_set(vdev, sense_size, VIRTIO_SCSI_SENSE_SIZE);
@@ -671,7 +750,7 @@ static int virtscsi_init(struct virtio_device *vdev,
 		goto out;
 	}
 	for (i = 0; i < num_targets; i++) {
-		vscsi->tgt[i] = virtscsi_alloc_tgt(vdev, sg_elems);
+		vscsi->tgt[i] = virtscsi_alloc_tgt(vscsi, sg_elems);
 		if (!vscsi->tgt[i]) {
 			err = -ENOMEM;
 			goto out;
@@ -680,6 +759,9 @@ static int virtscsi_init(struct virtio_device *vdev,
 	err = 0;
 
 out:
+	kfree(names);
+	kfree(callbacks);
+	kfree(vqs);
 	if (err)
 		virtscsi_remove_vqs(vdev);
 	return err;
@@ -692,11 +774,26 @@ static int __devinit virtscsi_probe(struct virtio_device *vdev)
 	int err;
 	u32 sg_elems, num_targets;
 	u32 cmd_per_lun;
+	u32 num_queues;
+	struct scsi_host_template *hostt;
+
+	/* We need to know how many queues before we allocate.  */
+	num_queues = virtscsi_config_get(vdev, num_queues) ?: 1;
 
 	/* Allocate memory and link the structs together.  */
 	num_targets = virtscsi_config_get(vdev, max_target) + 1;
-	shost = scsi_host_alloc(&virtscsi_host_template, sizeof(*vscsi));
 
+	/* Multiqueue is not beneficial with a single target.  */
+	if (num_targets == 1)
+		num_queues = 1;
+
+	if (num_queues == 1)
+		hostt = &virtscsi_host_template_single;
+	else
+		hostt = &virtscsi_host_template_multi;
+
+	shost = scsi_host_alloc(hostt,
+		sizeof(*vscsi) + sizeof(vscsi->req_vqs[0]) * num_queues);
 	if (!shost)
 		return -ENOMEM;
 
@@ -704,6 +801,7 @@ static int __devinit virtscsi_probe(struct virtio_device *vdev)
 	shost->sg_tablesize = sg_elems;
 	vscsi = shost_priv(shost);
 	vscsi->vdev = vdev;
+	vscsi->num_queues = num_queues;
 	vdev->priv = shost;
 
 	err = virtscsi_init(vdev, vscsi, num_targets);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/5] virtio-scsi: allocate target pointers in a separate memory block
  2012-08-28 11:54 ` [PATCH 3/5] virtio-scsi: allocate target pointers in a separate memory block Paolo Bonzini
@ 2012-08-28 14:07   ` Sasha Levin
  2012-08-28 14:25     ` Paolo Bonzini
  0 siblings, 1 reply; 34+ messages in thread
From: Sasha Levin @ 2012-08-28 14:07 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, linux-scsi, kvm, rusty, jasowang, mst, virtualization

On 08/28/2012 01:54 PM, Paolo Bonzini wrote:
> We will place the request virtqueues in the flexible array member.
> 
> Refining the virtqueue API would let us drop the sglist copy, at
> which point the pointer-to-array-of-pointers can become a simple
> pointer-to-array.  It would both simplify the allocation and remove a
> dereference in several hot paths.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  drivers/scsi/virtio_scsi.c |   23 +++++++++++++++--------
>  1 files changed, 15 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
> index 595af1a..62fec04 100644
> --- a/drivers/scsi/virtio_scsi.c
> +++ b/drivers/scsi/virtio_scsi.c
> @@ -77,7 +77,7 @@ struct virtio_scsi {
>  	/* Get some buffers ready for event vq */
>  	struct virtio_scsi_event_node event_list[VIRTIO_SCSI_EVENT_LEN];
>  
> -	struct virtio_scsi_target_state *tgt[];
> +	struct virtio_scsi_target_state **tgt;
>  };
>  
>  static struct kmem_cache *virtscsi_cmd_cache;
> @@ -615,10 +615,13 @@ static void virtscsi_remove_vqs(struct virtio_device *vdev)
>  	/* Stop all the virtqueues. */
>  	vdev->config->reset(vdev);
>  
> -	num_targets = sh->max_id;
> -	for (i = 0; i < num_targets; i++) {
> -		kfree(vscsi->tgt[i]);
> -		vscsi->tgt[i] = NULL;
> +	if (vscsi->tgt) {
> +		num_targets = sh->max_id;
> +		for (i = 0; i < num_targets; i++) {
> +			kfree(vscsi->tgt[i]);

Since we now kmalloc() the vscsi->tgt array, it doesn't get zeroed anymore.

This means that if for example, num_targets=3, and the second
virtscsi_alloc_tgt() in virtscsi_init() failed, we're going to kfree() garbage here.


Thanks,
Sasha

> +			vscsi->tgt[i] = NULL;
> +		}
> +		kfree(vscsi->tgt);
>  	}
>  	vdev->config->del_vqs(vdev);
> @@ -660,6 +663,12 @@ static int virtscsi_init(struct virtio_device *vdev,
>  	/* We need to know how many segments before we allocate.  */
>  	sg_elems = virtscsi_config_get(vdev, seg_max) ?: 1;
>  
> +	vscsi->tgt = kmalloc(num_targets *
> +			sizeof(struct virtio_scsi_target_state *), GFP_KERNEL);
> +	if (!vscsi->tgt) {
> +		err = -ENOMEM;
> +		goto out;
> +	}
>  	for (i = 0; i < num_targets; i++) {
>  		vscsi->tgt[i] = virtscsi_alloc_tgt(vdev, sg_elems);
>  		if (!vscsi->tgt[i]) {
> @@ -685,9 +694,7 @@ static int __devinit virtscsi_probe(struct virtio_device *vdev)
>  
>  	/* Allocate memory and link the structs together.  */
>  	num_targets = virtscsi_config_get(vdev, max_target) + 1;
> -	shost = scsi_host_alloc(&virtscsi_host_template,
> -		sizeof(*vscsi)
> -		+ num_targets * sizeof(struct virtio_scsi_target_state));
> +	shost = scsi_host_alloc(&virtscsi_host_template, sizeof(*vscsi));
>  
>  	if (!shost)
>  		return -ENOMEM;
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/5] virtio-scsi: allocate target pointers in a separate memory block
  2012-08-28 14:07   ` Sasha Levin
@ 2012-08-28 14:25     ` Paolo Bonzini
  0 siblings, 0 replies; 34+ messages in thread
From: Paolo Bonzini @ 2012-08-28 14:25 UTC (permalink / raw)
  To: Sasha Levin
  Cc: linux-kernel, linux-scsi, kvm, rusty, jasowang, mst, virtualization

Il 28/08/2012 16:07, Sasha Levin ha scritto:
>> > -	num_targets = sh->max_id;
>> > -	for (i = 0; i < num_targets; i++) {
>> > -		kfree(vscsi->tgt[i]);
>> > -		vscsi->tgt[i] = NULL;
>> > +	if (vscsi->tgt) {
>> > +		num_targets = sh->max_id;
>> > +		for (i = 0; i < num_targets; i++) {
>> > +			kfree(vscsi->tgt[i]);
> Since we now kmalloc() the vscsi->tgt array, it doesn't get zeroed anymore.
> 
> This means that if for example, num_targets=3, and the second
> virtscsi_alloc_tgt() in virtscsi_init() failed, we're going to kfree() garbage here.

Right, changed to kzalloc.

Paolo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] virtio-ring: move queue_index to vring_virtqueue
  2012-08-28 11:54 ` [PATCH 1/5] virtio-ring: move queue_index to vring_virtqueue Paolo Bonzini
@ 2012-08-29  7:54   ` Jason Wang
  2012-09-05 23:32   ` Rusty Russell
  1 sibling, 0 replies; 34+ messages in thread
From: Jason Wang @ 2012-08-29  7:54 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: linux-kernel, linux-scsi, kvm, rusty, mst, virtualization

On 08/28/2012 07:54 PM, Paolo Bonzini wrote:
> From: Jason Wang<jasowang@redhat.com>
>
> Instead of storing the queue index in transport-specific virtio structs,
> this patch moves them to vring_virtqueue and introduces an helper to get
> the value.  This lets drivers simplify their management and tracing of
> virtqueues.
>
> Signed-off-by: Jason Wang<jasowang@redhat.com>
> Signed-off-by: Paolo Bonzini<pbonzini@redhat.com>
> ---
> 	I fixed the problems in Jason's v5 (posted at
> 	http://permalink.gmane.org/gmane.linux.kernel.virtualization/15910)
> 	and switched from virtio_set_queue_index to a new argument of
> 	vring_new_virtqueue.  This breaks at compile-time any virtio
> 	transport that is not updated.

Thanks Paolo. I'm fine with this patch.
>
>   drivers/lguest/lguest_device.c         |    2 +-
>   drivers/remoteproc/remoteproc_virtio.c |    2 +-
>   drivers/s390/kvm/kvm_virtio.c          |    2 +-
>   drivers/virtio/virtio_mmio.c           |   12 ++++--------
>   drivers/virtio/virtio_pci.c            |   13 +++++--------
>   drivers/virtio/virtio_ring.c           |   14 +++++++++++++-
>   include/linux/virtio.h                 |    2 ++
>   include/linux/virtio_ring.h            |    3 ++-
>   8 files changed, 29 insertions(+), 21 deletions(-)
>
> diff --git a/drivers/lguest/lguest_device.c b/drivers/lguest/lguest_device.c
> index 9e8388e..ccb7dfb 100644
> --- a/drivers/lguest/lguest_device.c
> +++ b/drivers/lguest/lguest_device.c
> @@ -296,7 +296,7 @@ static struct virtqueue *lg_find_vq(struct virtio_device *vdev,
>   	 * to 'true': the host just a(nother) SMP CPU, so we only need inter-cpu
>   	 * barriers.
>   	 */
> -	vq = vring_new_virtqueue(lvq->config.num, LGUEST_VRING_ALIGN, vdev,
> +	vq = vring_new_virtqueue(index, lvq->config.num, LGUEST_VRING_ALIGN, vdev,
>   				 true, lvq->pages, lg_notify, callback, name);
>   	if (!vq) {
>   		err = -ENOMEM;
> diff --git a/drivers/remoteproc/remoteproc_virtio.c b/drivers/remoteproc/remoteproc_virtio.c
> index 3541b44..343c194 100644
> --- a/drivers/remoteproc/remoteproc_virtio.c
> +++ b/drivers/remoteproc/remoteproc_virtio.c
> @@ -103,7 +103,7 @@ static struct virtqueue *rp_find_vq(struct virtio_device *vdev,
>   	 * Create the new vq, and tell virtio we're not interested in
>   	 * the 'weak' smp barriers, since we're talking with a real device.
>   	 */
> -	vq = vring_new_virtqueue(len, rvring->align, vdev, false, addr,
> +	vq = vring_new_virtqueue(id, len, rvring->align, vdev, false, addr,
>   					rproc_virtio_notify, callback, name);
>   	if (!vq) {
>   		dev_err(dev, "vring_new_virtqueue %s failed\n", name);
> diff --git a/drivers/s390/kvm/kvm_virtio.c b/drivers/s390/kvm/kvm_virtio.c
> index 47cccd5..5565af2 100644
> --- a/drivers/s390/kvm/kvm_virtio.c
> +++ b/drivers/s390/kvm/kvm_virtio.c
> @@ -198,7 +198,7 @@ static struct virtqueue *kvm_find_vq(struct virtio_device *vdev,
>   	if (err)
>   		goto out;
>
> -	vq = vring_new_virtqueue(config->num, KVM_S390_VIRTIO_RING_ALIGN,
> +	vq = vring_new_virtqueue(index, config->num, KVM_S390_VIRTIO_RING_ALIGN,
>   				 vdev, true, (void *) config->address,
>   				 kvm_notify, callback, name);
>   	if (!vq) {
> diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
> index 453db0c..008bf58 100644
> --- a/drivers/virtio/virtio_mmio.c
> +++ b/drivers/virtio/virtio_mmio.c
> @@ -131,9 +131,6 @@ struct virtio_mmio_vq_info {
>   	/* the number of entries in the queue */
>   	unsigned int num;
>
> -	/* the index of the queue */
> -	int queue_index;
> -
>   	/* the virtual address of the ring queue */
>   	void *queue;
>
> @@ -225,11 +222,10 @@ static void vm_reset(struct virtio_device *vdev)
>   static void vm_notify(struct virtqueue *vq)
>   {
>   	struct virtio_mmio_device *vm_dev = to_virtio_mmio_device(vq->vdev);
> -	struct virtio_mmio_vq_info *info = vq->priv;
>
>   	/* We write the queue's selector into the notification register to
>   	 * signal the other end */
> -	writel(info->queue_index, vm_dev->base + VIRTIO_MMIO_QUEUE_NOTIFY);
> +	writel(virtqueue_get_queue_index(vq), vm_dev->base + VIRTIO_MMIO_QUEUE_NOTIFY);
>   }
>
>   /* Notify all virtqueues on an interrupt. */
> @@ -270,6 +266,7 @@ static void vm_del_vq(struct virtqueue *vq)
>   	struct virtio_mmio_device *vm_dev = to_virtio_mmio_device(vq->vdev);
>   	struct virtio_mmio_vq_info *info = vq->priv;
>   	unsigned long flags, size;
> +	unsigned int index = virtqueue_get_queue_index(vq);
>
>   	spin_lock_irqsave(&vm_dev->lock, flags);
>   	list_del(&info->node);
> @@ -278,7 +275,7 @@ static void vm_del_vq(struct virtqueue *vq)
>   	vring_del_virtqueue(vq);
>
>   	/* Select and deactivate the queue */
> -	writel(info->queue_index, vm_dev->base + VIRTIO_MMIO_QUEUE_SEL);
> +	writel(index, vm_dev->base + VIRTIO_MMIO_QUEUE_SEL);
>   	writel(0, vm_dev->base + VIRTIO_MMIO_QUEUE_PFN);
>
>   	size = PAGE_ALIGN(vring_size(info->num, VIRTIO_MMIO_VRING_ALIGN));
> @@ -324,7 +321,6 @@ static struct virtqueue *vm_setup_vq(struct virtio_device *vdev, unsigned index,
>   		err = -ENOMEM;
>   		goto error_kmalloc;
>   	}
> -	info->queue_index = index;
>
>   	/* Allocate pages for the queue - start with a queue as big as
>   	 * possible (limited by maximum size allowed by device), drop down
> @@ -356,7 +352,7 @@ static struct virtqueue *vm_setup_vq(struct virtio_device *vdev, unsigned index,
>   			vm_dev->base + VIRTIO_MMIO_QUEUE_PFN);
>
>   	/* Create the vring */
> -	vq = vring_new_virtqueue(info->num, VIRTIO_MMIO_VRING_ALIGN, vdev,
> +	vq = vring_new_virtqueue(index, info->num, VIRTIO_MMIO_VRING_ALIGN, vdev,
>   				 true, info->queue, vm_notify, callback, name);
>   	if (!vq) {
>   		err = -ENOMEM;
> diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
> index 2e03d41..d902464 100644
> --- a/drivers/virtio/virtio_pci.c
> +++ b/drivers/virtio/virtio_pci.c
> @@ -79,9 +79,6 @@ struct virtio_pci_vq_info
>   	/* the number of entries in the queue */
>   	int num;
>
> -	/* the index of the queue */
> -	int queue_index;
> -
>   	/* the virtual address of the ring queue */
>   	void *queue;
>
> @@ -202,11 +199,11 @@ static void vp_reset(struct virtio_device *vdev)
>   static void vp_notify(struct virtqueue *vq)
>   {
>   	struct virtio_pci_device *vp_dev = to_vp_device(vq->vdev);
> -	struct virtio_pci_vq_info *info = vq->priv;
>
>   	/* we write the queue's selector into the notification register to
>   	 * signal the other end */
> -	iowrite16(info->queue_index, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_NOTIFY);
> +	iowrite16(virtqueue_get_queue_index(vq),
> +		  vp_dev->ioaddr + VIRTIO_PCI_QUEUE_NOTIFY);
>   }
>
>   /* Handle a configuration change: Tell driver if it wants to know. */
> @@ -402,7 +399,6 @@ static struct virtqueue *setup_vq(struct virtio_device *vdev, unsigned index,
>   	if (!info)
>   		return ERR_PTR(-ENOMEM);
>
> -	info->queue_index = index;
>   	info->num = num;
>   	info->msix_vector = msix_vec;
>
> @@ -418,7 +414,7 @@ static struct virtqueue *setup_vq(struct virtio_device *vdev, unsigned index,
>   		  vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
>
>   	/* create the vring */
> -	vq = vring_new_virtqueue(info->num, VIRTIO_PCI_VRING_ALIGN, vdev,
> +	vq = vring_new_virtqueue(index, info->num, VIRTIO_PCI_VRING_ALIGN, vdev,
>   				 true, info->queue, vp_notify, callback, name);
>   	if (!vq) {
>   		err = -ENOMEM;
> @@ -467,7 +463,8 @@ static void vp_del_vq(struct virtqueue *vq)
>   	list_del(&info->node);
>   	spin_unlock_irqrestore(&vp_dev->lock, flags);
>
> -	iowrite16(info->queue_index, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_SEL);
> +	iowrite16(virtqueue_get_queue_index(vq),
> +		vp_dev->ioaddr + VIRTIO_PCI_QUEUE_SEL);
>
>   	if (vp_dev->msix_enabled) {
>   		iowrite16(VIRTIO_MSI_NO_VECTOR,
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index 5aa43c3..e639584 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -106,6 +106,9 @@ struct vring_virtqueue
>   	/* How to notify other side. FIXME: commonalize hcalls! */
>   	void (*notify)(struct virtqueue *vq);
>
> +	/* Index of the queue */
> +	int queue_index;
> +
>   #ifdef DEBUG
>   	/* They're supposed to lock for us. */
>   	unsigned int in_use;
> @@ -171,6 +174,13 @@ static int vring_add_indirect(struct vring_virtqueue *vq,
>   	return head;
>   }
>
> +int virtqueue_get_queue_index(struct virtqueue *_vq)
> +{
> +	struct vring_virtqueue *vq = to_vvq(_vq);
> +	return vq->queue_index;
> +}
> +EXPORT_SYMBOL_GPL(virtqueue_get_queue_index);
> +
>   /**
>    * virtqueue_add_buf - expose buffer to other end
>    * @vq: the struct virtqueue we're talking about.
> @@ -616,7 +626,8 @@ irqreturn_t vring_interrupt(int irq, void *_vq)
>   }
>   EXPORT_SYMBOL_GPL(vring_interrupt);
>
> -struct virtqueue *vring_new_virtqueue(unsigned int num,
> +struct virtqueue *vring_new_virtqueue(unsigned int index,
> +				      unsigned int num,
>   				      unsigned int vring_align,
>   				      struct virtio_device *vdev,
>   				      bool weak_barriers,
> @@ -647,6 +658,7 @@ struct virtqueue *vring_new_virtqueue(unsigned int num,
>   	vq->broken = false;
>   	vq->last_used_idx = 0;
>   	vq->num_added = 0;
> +	vq->queue_index = index;
>   	list_add_tail(&vq->vq.list,&vdev->vqs);
>   #ifdef DEBUG
>   	vq->in_use = false;
> diff --git a/include/linux/virtio.h b/include/linux/virtio.h
> index a1ba8bb..533b115 100644
> --- a/include/linux/virtio.h
> +++ b/include/linux/virtio.h
> @@ -50,6 +50,8 @@ void *virtqueue_detach_unused_buf(struct virtqueue *vq);
>
>   unsigned int virtqueue_get_vring_size(struct virtqueue *vq);
>
> +int virtqueue_get_queue_index(struct virtqueue *vq);
> +
>   /**
>    * virtio_device - representation of a device using virtio
>    * @index: unique position on the virtio bus
> diff --git a/include/linux/virtio_ring.h b/include/linux/virtio_ring.h
> index e338730..c2d793a 100644
> --- a/include/linux/virtio_ring.h
> +++ b/include/linux/virtio_ring.h
> @@ -165,7 +165,8 @@ static inline int vring_need_event(__u16 event_idx, __u16 new_idx, __u16 old)
>   struct virtio_device;
>   struct virtqueue;
>
> -struct virtqueue *vring_new_virtqueue(unsigned int num,
> +struct virtqueue *vring_new_virtqueue(unsigned int index,
> +				      unsigned int num,
>   				      unsigned int vring_align,
>   				      struct virtio_device *vdev,
>   				      bool weak_barriers,


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/5] Multiqueue virtio-scsi
  2012-08-28 11:54 [PATCH 0/5] Multiqueue virtio-scsi Paolo Bonzini
                   ` (4 preceding siblings ...)
  2012-08-28 11:54 ` [PATCH 5/5] virtio-scsi: introduce multiqueue support Paolo Bonzini
@ 2012-08-30  7:13 ` Stefan Hajnoczi
  2012-08-30 14:53 ` Michael S. Tsirkin
  6 siblings, 0 replies; 34+ messages in thread
From: Stefan Hajnoczi @ 2012-08-30  7:13 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: linux-kernel, linux-scsi, kvm, mst, virtualization

On Tue, Aug 28, 2012 at 01:54:12PM +0200, Paolo Bonzini wrote:
> this series adds multiqueue support to the virtio-scsi driver, based
> on Jason Wang's work on virtio-net.  It uses a simple queue steering
> algorithm that expects one queue per CPU.  LUNs in the same target always
> use the same queue (so that commands are not reordered); queue switching
> occurs when the request being queued is the only one for the target.
> Also based on Jason's patches, the virtqueue affinity is set so that
> each CPU is associated to one virtqueue.

Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/5] Multiqueue virtio-scsi
  2012-08-28 11:54 [PATCH 0/5] Multiqueue virtio-scsi Paolo Bonzini
                   ` (5 preceding siblings ...)
  2012-08-30  7:13 ` [PATCH 0/5] Multiqueue virtio-scsi Stefan Hajnoczi
@ 2012-08-30 14:53 ` Michael S. Tsirkin
  2012-08-30 15:45   ` Paolo Bonzini
  6 siblings, 1 reply; 34+ messages in thread
From: Michael S. Tsirkin @ 2012-08-30 14:53 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: linux-kernel, linux-scsi, kvm, virtualization

On Tue, Aug 28, 2012 at 01:54:12PM +0200, Paolo Bonzini wrote:
> Hi all,
> 
> this series adds multiqueue support to the virtio-scsi driver, based
> on Jason Wang's work on virtio-net.  It uses a simple queue steering
> algorithm that expects one queue per CPU.  LUNs in the same target always
> use the same queue (so that commands are not reordered); queue switching
> occurs when the request being queued is the only one for the target.
> Also based on Jason's patches, the virtqueue affinity is set so that
> each CPU is associated to one virtqueue.

Is there a spec patch? I did not see one.

> I tested the patches with fio, using up to 32 virtio-scsi disks backed
> by tmpfs on the host, and 1 LUN per target.
> 
> FIO configuration
> -----------------
> [global]
> rw=read
> bsrange=4k-64k
> ioengine=libaio
> direct=1
> iodepth=4
> loops=20
> 
> overall bandwidth (MB/s)
> -----------------
> 
> # of targets    single-queue    multi-queue, 4 VCPUs    multi-queue, 8 VCPUs
> 1                  540               626                     599
> 2                  795               965                     925
> 4                  997              1376                    1500
> 8                 1136              2130                    2060
> 16                1440              2269                    2474
> 24                1408              2179                    2436
> 32                1515              1978                    2319
> 
> (These numbers for single-queue are with 4 VCPUs, but the impact of adding
> more VCPUs is very limited).
> 
> avg bandwidth per LUN (MB/s)
> ---------------------
> 
> # of targets    single-queue    multi-queue, 4 VCPUs    multi-queue, 8 VCPUs
> 1                  540               626                     599
> 2                  397               482                     462
> 4                  249               344                     375
> 8                  142               266                     257
> 16                  90               141                     154
> 24                  58                90                     101
> 32                  47                61                      72
> 
> Testing this may require an irqbalance daemon that is built from git,
> due to http://code.google.com/p/irqbalance/issues/detail?id=37.
> Alternatively you can just set the affinity manually in /proc.
> 
> Rusty, can you please give your Acked-by to the first two patches?
> 
> Jason Wang (2):
>   virtio-ring: move queue_index to vring_virtqueue
>   virtio: introduce an API to set affinity for a virtqueue
> 
> Paolo Bonzini (3):
>   virtio-scsi: allocate target pointers in a separate memory block
>   virtio-scsi: pass struct virtio_scsi to virtqueue completion function
>   virtio-scsi: introduce multiqueue support
> 
>  drivers/lguest/lguest_device.c         |    1 +
>  drivers/remoteproc/remoteproc_virtio.c |    1 +
>  drivers/s390/kvm/kvm_virtio.c          |    1 +
>  drivers/scsi/virtio_scsi.c             |  200 ++++++++++++++++++++++++--------
>  drivers/virtio/virtio_mmio.c           |   11 +-
>  drivers/virtio/virtio_pci.c            |   58 ++++++++-
>  drivers/virtio/virtio_ring.c           |   17 +++
>  include/linux/virtio.h                 |    4 +
>  include/linux/virtio_config.h          |   21 ++++
>  9 files changed, 253 insertions(+), 61 deletions(-)
> 
> _______________________________________________
> Virtualization mailing list
> Virtualization@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/5] Multiqueue virtio-scsi
  2012-08-30 14:53 ` Michael S. Tsirkin
@ 2012-08-30 15:45   ` Paolo Bonzini
  0 siblings, 0 replies; 34+ messages in thread
From: Paolo Bonzini @ 2012-08-30 15:45 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: linux-kernel, linux-scsi, kvm, virtualization

Il 30/08/2012 16:53, Michael S. Tsirkin ha scritto:
>> > this series adds multiqueue support to the virtio-scsi driver, based
>> > on Jason Wang's work on virtio-net.  It uses a simple queue steering
>> > algorithm that expects one queue per CPU.  LUNs in the same target always
>> > use the same queue (so that commands are not reordered); queue switching
>> > occurs when the request being queued is the only one for the target.
>> > Also based on Jason's patches, the virtqueue affinity is set so that
>> > each CPU is associated to one virtqueue.
> 
> Is there a spec patch? I did not see one.

It was already in the first version of the spec, just not implemented
until now.

Paolo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/5] virtio-scsi: introduce multiqueue support
  2012-08-28 11:54 ` [PATCH 5/5] virtio-scsi: introduce multiqueue support Paolo Bonzini
@ 2012-09-04  2:21   ` Nicholas A. Bellinger
  2012-09-04  6:46     ` Paolo Bonzini
  2012-09-04 12:48   ` Michael S. Tsirkin
  2012-09-04 14:47   ` Michael S. Tsirkin
  2 siblings, 1 reply; 34+ messages in thread
From: Nicholas A. Bellinger @ 2012-09-04  2:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, linux-scsi, kvm, rusty, jasowang, mst,
	virtualization, Christoph Hellwig, Jens Axboe, target-devel

On Tue, 2012-08-28 at 13:54 +0200, Paolo Bonzini wrote:
> This patch adds queue steering to virtio-scsi.  When a target is sent
> multiple requests, we always drive them to the same queue so that FIFO
> processing order is kept.  However, if a target was idle, we can choose
> a queue arbitrarily.  In this case the queue is chosen according to the
> current VCPU, so the driver expects the number of request queues to be
> equal to the number of VCPUs.  This makes it easy and fast to select
> the queue, and also lets the driver optimize the IRQ affinity for the
> virtqueues (each virtqueue's affinity is set to the CPU that "owns"
> the queue).
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---

Hey Paolo & Co,

I've not had a chance to try this with tcm_vhost just yet, but noticed
one thing wrt to assumptions about virtio_scsi_target_state->reqs access
below..

>  drivers/scsi/virtio_scsi.c |  162 +++++++++++++++++++++++++++++++++++---------
>  1 files changed, 130 insertions(+), 32 deletions(-)
> 
> diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
> index 6414ea0..0c4b096 100644
> --- a/drivers/scsi/virtio_scsi.c
> +++ b/drivers/scsi/virtio_scsi.c
> @@ -26,6 +26,7 @@
>  
>  #define VIRTIO_SCSI_MEMPOOL_SZ 64
>  #define VIRTIO_SCSI_EVENT_LEN 8
> +#define VIRTIO_SCSI_VQ_BASE 2
>  
>  /* Command queue element */
>  struct virtio_scsi_cmd {
> @@ -59,9 +60,13 @@ struct virtio_scsi_vq {
>  
>  /* Per-target queue state */
>  struct virtio_scsi_target_state {
> -	/* Protects sg.  Lock hierarchy is tgt_lock -> vq_lock.  */
> +	/* Protects sg, req_vq.  Lock hierarchy is tgt_lock -> vq_lock.  */
>  	spinlock_t tgt_lock;
>  
> +	struct virtio_scsi_vq *req_vq;
> +
> +	atomic_t reqs;
> +
>  	/* For sglist construction when adding commands to the virtqueue.  */
>  	struct scatterlist sg[];
>  };
> @@ -70,14 +75,15 @@ struct virtio_scsi_target_state {
>  struct virtio_scsi {
>  	struct virtio_device *vdev;
>  
> -	struct virtio_scsi_vq ctrl_vq;
> -	struct virtio_scsi_vq event_vq;
> -	struct virtio_scsi_vq req_vq;
> -
>  	/* Get some buffers ready for event vq */
>  	struct virtio_scsi_event_node event_list[VIRTIO_SCSI_EVENT_LEN];
>  
> +	u32 num_queues;
>  	struct virtio_scsi_target_state **tgt;
> +
> +	struct virtio_scsi_vq ctrl_vq;
> +	struct virtio_scsi_vq event_vq;
> +	struct virtio_scsi_vq req_vqs[];
>  };
>  
>  static struct kmem_cache *virtscsi_cmd_cache;
> @@ -112,6 +118,9 @@ static void virtscsi_complete_cmd(struct virtio_scsi *vscsi, void *buf)
>  	struct virtio_scsi_cmd *cmd = buf;
>  	struct scsi_cmnd *sc = cmd->sc;
>  	struct virtio_scsi_cmd_resp *resp = &cmd->resp.cmd;
> +	struct virtio_scsi_target_state *tgt = vscsi->tgt[sc->device->id];
> +
> +	atomic_dec(&tgt->reqs);
>  

As tgt->tgt_lock is taken in virtscsi_queuecommand_multi() before the
atomic_inc_return(tgt->reqs) check, it seems like using atomic_dec() w/o
smp_mb__after_atomic_dec or tgt_lock access here is not using atomic.h
accessors properly, no..?

>  	dev_dbg(&sc->device->sdev_gendev,
>  		"cmd %p response %u status %#02x sense_len %u\n",
> @@ -185,11 +194,13 @@ static void virtscsi_req_done(struct virtqueue *vq)
>  {
>  	struct Scsi_Host *sh = virtio_scsi_host(vq->vdev);
>  	struct virtio_scsi *vscsi = shost_priv(sh);
> +	int index = virtqueue_get_queue_index(vq) - VIRTIO_SCSI_VQ_BASE;
> +	struct virtio_scsi_vq *req_vq = &vscsi->req_vqs[index];
>  	unsigned long flags;
>  
> -	spin_lock_irqsave(&vscsi->req_vq.vq_lock, flags);
> +	spin_lock_irqsave(&req_vq->vq_lock, flags);
>  	virtscsi_vq_done(vscsi, vq, virtscsi_complete_cmd);
> -	spin_unlock_irqrestore(&vscsi->req_vq.vq_lock, flags);
> +	spin_unlock_irqrestore(&req_vq->vq_lock, flags);
>  };
>  
>  static void virtscsi_complete_free(struct virtio_scsi *vscsi, void *buf)
> @@ -429,10 +440,10 @@ static int virtscsi_kick_cmd(struct virtio_scsi_target_state *tgt,
>  	return ret;
>  }
>  
> -static int virtscsi_queuecommand(struct Scsi_Host *sh, struct scsi_cmnd *sc)
> +static int virtscsi_queuecommand(struct virtio_scsi *vscsi,
> +				 struct virtio_scsi_target_state *tgt,
> +				 struct scsi_cmnd *sc)
>  {
> -	struct virtio_scsi *vscsi = shost_priv(sh);
> -	struct virtio_scsi_target_state *tgt = vscsi->tgt[sc->device->id];
>  	struct virtio_scsi_cmd *cmd;
>  	int ret;
>  
> @@ -466,7 +477,7 @@ static int virtscsi_queuecommand(struct Scsi_Host *sh, struct scsi_cmnd *sc)
>  	BUG_ON(sc->cmd_len > VIRTIO_SCSI_CDB_SIZE);
>  	memcpy(cmd->req.cmd.cdb, sc->cmnd, sc->cmd_len);
>  
> -	if (virtscsi_kick_cmd(tgt, &vscsi->req_vq, cmd,
> +	if (virtscsi_kick_cmd(tgt, tgt->req_vq, cmd,
>  			      sizeof cmd->req.cmd, sizeof cmd->resp.cmd,
>  			      GFP_ATOMIC) >= 0)
>  		ret = 0;
> @@ -475,6 +486,38 @@ out:
>  	return ret;
>  }
>  
> +static int virtscsi_queuecommand_single(struct Scsi_Host *sh,
> +					struct scsi_cmnd *sc)
> +{
> +	struct virtio_scsi *vscsi = shost_priv(sh);
> +	struct virtio_scsi_target_state *tgt = vscsi->tgt[sc->device->id];
> +
> +	atomic_inc(&tgt->reqs);
> +	return virtscsi_queuecommand(vscsi, tgt, sc);
> +}
> +

...

> +static int virtscsi_queuecommand_multi(struct Scsi_Host *sh,
> +				       struct scsi_cmnd *sc)
> +{
> +	struct virtio_scsi *vscsi = shost_priv(sh);
> +	struct virtio_scsi_target_state *tgt = vscsi->tgt[sc->device->id];
> +	unsigned long flags;
> +	u32 queue_num;
> +
> +	/* Using an atomic_t for tgt->reqs lets the virtqueue handler
> +	 * decrement it without taking the spinlock.
> +	 */
> +	spin_lock_irqsave(&tgt->tgt_lock, flags);
> +	if (atomic_inc_return(&tgt->reqs) == 1) {
> +		queue_num = smp_processor_id();
> +		while (unlikely(queue_num >= vscsi->num_queues))
> +			queue_num -= vscsi->num_queues;
> +		tgt->req_vq = &vscsi->req_vqs[queue_num];
> +	}
> +	spin_unlock_irqrestore(&tgt->tgt_lock, flags);
> +	return virtscsi_queuecommand(vscsi, tgt, sc);
> +}
> +

The extra memory barriers to get this right for the current approach are
just going to slow things down even more for virtio-scsi-mq..

After hearing Jen's blk-mq talk last week in San Diego + having a look
at the new code in linux-block.git/new-queue, the approach of using a
per-cpu lock-less-list  hw -> sw queue that uses IPI + numa_node hints
to make smart decisions for the completion path is making alot of sense.

Jen's approach is what we will ultimately need to re-architect in SCSI
core if we're ever going to move beyond the issues of legacy host_lock,
so I'm wondering if maybe this is the direction that virtio-scsi-mq
needs to go in as well..?

--nab


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/5] virtio-scsi: introduce multiqueue support
  2012-09-04  2:21   ` Nicholas A. Bellinger
@ 2012-09-04  6:46     ` Paolo Bonzini
  2012-09-04  8:46       ` Michael S. Tsirkin
  2012-09-04 20:11       ` Nicholas A. Bellinger
  0 siblings, 2 replies; 34+ messages in thread
From: Paolo Bonzini @ 2012-09-04  6:46 UTC (permalink / raw)
  To: Nicholas A. Bellinger
  Cc: linux-kernel, linux-scsi, kvm, rusty, jasowang, mst,
	virtualization, Christoph Hellwig, Jens Axboe, target-devel

Il 04/09/2012 04:21, Nicholas A. Bellinger ha scritto:
>> @@ -112,6 +118,9 @@ static void virtscsi_complete_cmd(struct virtio_scsi *vscsi, void *buf)
>>  	struct virtio_scsi_cmd *cmd = buf;
>>  	struct scsi_cmnd *sc = cmd->sc;
>>  	struct virtio_scsi_cmd_resp *resp = &cmd->resp.cmd;
>> +	struct virtio_scsi_target_state *tgt = vscsi->tgt[sc->device->id];
>> +
>> +	atomic_dec(&tgt->reqs);
>>  
> 
> As tgt->tgt_lock is taken in virtscsi_queuecommand_multi() before the
> atomic_inc_return(tgt->reqs) check, it seems like using atomic_dec() w/o
> smp_mb__after_atomic_dec or tgt_lock access here is not using atomic.h
> accessors properly, no..?

No, only a single "thing" is being accessed, and there is no need to
order the decrement with respect to preceding or subsequent accesses to
other locations.

In other words, tgt->reqs is already synchronized with itself, and that
is enough.

(Besides, on x86 smp_mb__after_atomic_dec is a nop).

>> +static int virtscsi_queuecommand_multi(struct Scsi_Host *sh,
>> +				       struct scsi_cmnd *sc)
>> +{
>> +	struct virtio_scsi *vscsi = shost_priv(sh);
>> +	struct virtio_scsi_target_state *tgt = vscsi->tgt[sc->device->id];
>> +	unsigned long flags;
>> +	u32 queue_num;
>> +
>> +	/* Using an atomic_t for tgt->reqs lets the virtqueue handler
>> +	 * decrement it without taking the spinlock.
>> +	 */
>> +	spin_lock_irqsave(&tgt->tgt_lock, flags);
>> +	if (atomic_inc_return(&tgt->reqs) == 1) {
>> +		queue_num = smp_processor_id();
>> +		while (unlikely(queue_num >= vscsi->num_queues))
>> +			queue_num -= vscsi->num_queues;
>> +		tgt->req_vq = &vscsi->req_vqs[queue_num];
>> +	}
>> +	spin_unlock_irqrestore(&tgt->tgt_lock, flags);
>> +	return virtscsi_queuecommand(vscsi, tgt, sc);
>> +}
>> +
> 
> The extra memory barriers to get this right for the current approach are
> just going to slow things down even more for virtio-scsi-mq..

virtio-scsi multiqueue has a performance benefit up to 20% (for a single
LUN) or 40% (on overall bandwidth across multiple LUNs).  I doubt that a
single memory barrier can have that much impact. :)

The way to go to improve performance even more is to add new virtio APIs
for finer control of the usage of the ring.  These should let us avoid
copying the sg list and almost get rid of the tgt_lock; even though the
locking is quite efficient in virtio-scsi (see how tgt_lock and vq_lock
are "pipelined" so as to overlap the preparation of two requests), it
should give a nice improvement and especially avoid a kmalloc with small
requests.  I may have some time for it next month.

> Jen's approach is what we will ultimately need to re-architect in SCSI
> core if we're ever going to move beyond the issues of legacy host_lock,
> so I'm wondering if maybe this is the direction that virtio-scsi-mq
> needs to go in as well..?

We can see after the block layer multiqueue work goes in...  I also need
to look more closely at Jens's changes.

Have you measured the host_lock to be a bottleneck in high-iops
benchmarks, even for a modern driver that does not hold it in
queuecommand?  (Certainly it will become more important as the
virtio-scsi queuecommand becomes thinner and thinner).  If so, we can
start looking at limiting host_lock usage in the fast path.

BTW, supporting this in tcm-vhost should be quite trivial, as all the
request queues are the same and all serialization is done in the
virtio-scsi driver.

Paolo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/5] virtio-scsi: introduce multiqueue support
  2012-09-04  6:46     ` Paolo Bonzini
@ 2012-09-04  8:46       ` Michael S. Tsirkin
  2012-09-04 10:25         ` Paolo Bonzini
  2012-09-04 20:11       ` Nicholas A. Bellinger
  1 sibling, 1 reply; 34+ messages in thread
From: Michael S. Tsirkin @ 2012-09-04  8:46 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Nicholas A. Bellinger, linux-kernel, linux-scsi, kvm, rusty,
	jasowang, virtualization, Christoph Hellwig, Jens Axboe,
	target-devel

On Tue, Sep 04, 2012 at 08:46:12AM +0200, Paolo Bonzini wrote:
> Il 04/09/2012 04:21, Nicholas A. Bellinger ha scritto:
> >> @@ -112,6 +118,9 @@ static void virtscsi_complete_cmd(struct virtio_scsi *vscsi, void *buf)
> >>  	struct virtio_scsi_cmd *cmd = buf;
> >>  	struct scsi_cmnd *sc = cmd->sc;
> >>  	struct virtio_scsi_cmd_resp *resp = &cmd->resp.cmd;
> >> +	struct virtio_scsi_target_state *tgt = vscsi->tgt[sc->device->id];
> >> +
> >> +	atomic_dec(&tgt->reqs);
> >>  
> > 
> > As tgt->tgt_lock is taken in virtscsi_queuecommand_multi() before the
> > atomic_inc_return(tgt->reqs) check, it seems like using atomic_dec() w/o
> > smp_mb__after_atomic_dec or tgt_lock access here is not using atomic.h
> > accessors properly, no..?
> 
> No, only a single "thing" is being accessed, and there is no need to
> order the decrement with respect to preceding or subsequent accesses to
> other locations.
>
> In other words, tgt->reqs is already synchronized with itself, and that
> is enough.

I think your logic is correct and barrier is not needed,
but this needs better documentation.

> (Besides, on x86 smp_mb__after_atomic_dec is a nop).
> >> +static int virtscsi_queuecommand_multi(struct Scsi_Host *sh,
> >> +				       struct scsi_cmnd *sc)
> >> +{
> >> +	struct virtio_scsi *vscsi = shost_priv(sh);
> >> +	struct virtio_scsi_target_state *tgt = vscsi->tgt[sc->device->id];
> >> +	unsigned long flags;
> >> +	u32 queue_num;
> >> +
> >> +	/* Using an atomic_t for tgt->reqs lets the virtqueue handler
> >> +	 * decrement it without taking the spinlock.
> >> +	 */

Above comment is not really helpful - reader can be safely assumed to
know what atomic_t is.

Please delete, and replace with the text from commit log
that explains the heuristic used to select req_vq.

Also please add a comment near 'reqs' definition.
Something like "number of outstanding requests - used to detect idle
target".


> >> +	spin_lock_irqsave(&tgt->tgt_lock, flags);

Looks like this lock can be removed - req_vq is only
modified when target is idle and only used when it is
not idle.

> >> +	if (atomic_inc_return(&tgt->reqs) == 1) {
> >> +		queue_num = smp_processor_id();
> >> +		while (unlikely(queue_num >= vscsi->num_queues))
> >> +			queue_num -= vscsi->num_queues;
> >> +		tgt->req_vq = &vscsi->req_vqs[queue_num];
> >> +	}
> >> +	spin_unlock_irqrestore(&tgt->tgt_lock, flags);
> >> +	return virtscsi_queuecommand(vscsi, tgt, sc);
> >> +}
> >> +
> >> +

.....

> >> +static int virtscsi_queuecommand_single(struct Scsi_Host *sh,
> >> +                                       struct scsi_cmnd *sc)
> >> +{
> >> +       struct virtio_scsi *vscsi = shost_priv(sh);
> >> +       struct virtio_scsi_target_state *tgt = vscsi->tgt[sc->device->id];
> >> +
> >> +       atomic_inc(&tgt->reqs);
> >> +       return virtscsi_queuecommand(vscsi, tgt, sc);
> >> +}
> >> +

Here, reqs is unused - why bother incrementing it?
A branch on completion would be cheaper IMHO.


>virtio-scsi multiqueue has a performance benefit up to 20%

To be fair, you could be running in single queue mode.
In that case extra atomics and indirection that this code
brings will just add overhead without benefits.
I don't know how significant would that be.

-- 
MST

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/5] virtio-scsi: introduce multiqueue support
  2012-09-04  8:46       ` Michael S. Tsirkin
@ 2012-09-04 10:25         ` Paolo Bonzini
  2012-09-04 11:09           ` Michael S. Tsirkin
  0 siblings, 1 reply; 34+ messages in thread
From: Paolo Bonzini @ 2012-09-04 10:25 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Nicholas A. Bellinger, linux-kernel, linux-scsi, kvm, rusty,
	jasowang, virtualization, Christoph Hellwig, Jens Axboe,
	target-devel

Il 04/09/2012 10:46, Michael S. Tsirkin ha scritto:
>>>> +static int virtscsi_queuecommand_multi(struct Scsi_Host *sh,
>>>> +				       struct scsi_cmnd *sc)
>>>> +{
>>>> +	struct virtio_scsi *vscsi = shost_priv(sh);
>>>> +	struct virtio_scsi_target_state *tgt = vscsi->tgt[sc->device->id];
>>>> +	unsigned long flags;
>>>> +	u32 queue_num;
>>>> +
>>>> +	/* Using an atomic_t for tgt->reqs lets the virtqueue handler
>>>> +	 * decrement it without taking the spinlock.
>>>> +	 */
> 
> Above comment is not really helpful - reader can be safely assumed to
> know what atomic_t is.

Sure, the comment explains that we use an atomic because _elsewhere_ the
tgt_lock is not held while modifying reqs.

> Please delete, and replace with the text from commit log
> that explains the heuristic used to select req_vq.

Ok.

> Also please add a comment near 'reqs' definition.
> Something like "number of outstanding requests - used to detect idle
> target".

Ok.

> 
>>>> +	spin_lock_irqsave(&tgt->tgt_lock, flags);
> 
> Looks like this lock can be removed - req_vq is only
> modified when target is idle and only used when it is
> not idle.

If you have two incoming requests at the same time, req_vq is also
modified when the target is not idle; that's the point of the lock.

Suppose tgt->reqs = 0 initially, and you have two processors/queues.
Initially tgt->req_vq is queue #1.  If you have this:

    queuecommand on CPU #0         queuecommand #2 on CPU #1
  --------------------------------------------------------------
    atomic_inc_return(...) == 1
                                   atomic_inc_return(...) == 2
                                   virtscsi_queuecommand to queue #1
    tgt->req_vq = queue #0
    virtscsi_queuecommand to queue #0

then two requests are issued to different queues without a quiescent
point in the middle.

>>>> +	if (atomic_inc_return(&tgt->reqs) == 1) {
>>>> +		queue_num = smp_processor_id();
>>>> +		while (unlikely(queue_num >= vscsi->num_queues))
>>>> +			queue_num -= vscsi->num_queues;
>>>> +		tgt->req_vq = &vscsi->req_vqs[queue_num];
>>>> +	}
>>>> +	spin_unlock_irqrestore(&tgt->tgt_lock, flags);
>>>> +	return virtscsi_queuecommand(vscsi, tgt, sc);
>>>> +}
>>>> +
>>>> +
> 
> .....
> 
>>>> +static int virtscsi_queuecommand_single(struct Scsi_Host *sh,
>>>> +                                       struct scsi_cmnd *sc)
>>>> +{
>>>> +       struct virtio_scsi *vscsi = shost_priv(sh);
>>>> +       struct virtio_scsi_target_state *tgt = vscsi->tgt[sc->device->id];
>>>> +
>>>> +       atomic_inc(&tgt->reqs);
>>>> +       return virtscsi_queuecommand(vscsi, tgt, sc);
>>>> +}
>>>> +
> 
> Here, reqs is unused - why bother incrementing it?
> A branch on completion would be cheaper IMHO.

Well, I could also let tgt->reqs go negative, but it would be a bit untidy.

Another alternative is to access the target's target_busy field with
ACCESS_ONCE, and drop reqs altogether.  Too tricky to do this kind of
micro-optimization so early, though.

>> virtio-scsi multiqueue has a performance benefit up to 20%
> 
> To be fair, you could be running in single queue mode.
> In that case extra atomics and indirection that this code
> brings will just add overhead without benefits.
> I don't know how significant would that be.

Not measurable in my experiments.

Paolo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/5] virtio-scsi: introduce multiqueue support
  2012-09-04 10:25         ` Paolo Bonzini
@ 2012-09-04 11:09           ` Michael S. Tsirkin
  2012-09-04 11:18             ` Paolo Bonzini
  0 siblings, 1 reply; 34+ messages in thread
From: Michael S. Tsirkin @ 2012-09-04 11:09 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Nicholas A. Bellinger, linux-kernel, linux-scsi, kvm, rusty,
	jasowang, virtualization, Christoph Hellwig, Jens Axboe,
	target-devel

On Tue, Sep 04, 2012 at 12:25:03PM +0200, Paolo Bonzini wrote:
> Il 04/09/2012 10:46, Michael S. Tsirkin ha scritto:
> >>>> +static int virtscsi_queuecommand_multi(struct Scsi_Host *sh,
> >>>> +				       struct scsi_cmnd *sc)
> >>>> +{
> >>>> +	struct virtio_scsi *vscsi = shost_priv(sh);
> >>>> +	struct virtio_scsi_target_state *tgt = vscsi->tgt[sc->device->id];
> >>>> +	unsigned long flags;
> >>>> +	u32 queue_num;
> >>>> +
> >>>> +	/* Using an atomic_t for tgt->reqs lets the virtqueue handler
> >>>> +	 * decrement it without taking the spinlock.
> >>>> +	 */
> > 
> > Above comment is not really helpful - reader can be safely assumed to
> > know what atomic_t is.
> 
> Sure, the comment explains that we use an atomic because _elsewhere_ the
> tgt_lock is not held while modifying reqs.
> 
> > Please delete, and replace with the text from commit log
> > that explains the heuristic used to select req_vq.
> 
> Ok.
> 
> > Also please add a comment near 'reqs' definition.
> > Something like "number of outstanding requests - used to detect idle
> > target".
> 
> Ok.
> 
> > 
> >>>> +	spin_lock_irqsave(&tgt->tgt_lock, flags);
> > 
> > Looks like this lock can be removed - req_vq is only
> > modified when target is idle and only used when it is
> > not idle.
> 
> If you have two incoming requests at the same time, req_vq is also
> modified when the target is not idle; that's the point of the lock.
> 
> Suppose tgt->reqs = 0 initially, and you have two processors/queues.
> Initially tgt->req_vq is queue #1.  If you have this:
> 
>     queuecommand on CPU #0         queuecommand #2 on CPU #1
>   --------------------------------------------------------------
>     atomic_inc_return(...) == 1
>                                    atomic_inc_return(...) == 2
>                                    virtscsi_queuecommand to queue #1
>     tgt->req_vq = queue #0
>     virtscsi_queuecommand to queue #0
> 
> then two requests are issued to different queues without a quiescent
> point in the middle.

What happens then? Does this break correctness?

> >>>> +	if (atomic_inc_return(&tgt->reqs) == 1) {
> >>>> +		queue_num = smp_processor_id();
> >>>> +		while (unlikely(queue_num >= vscsi->num_queues))
> >>>> +			queue_num -= vscsi->num_queues;
> >>>> +		tgt->req_vq = &vscsi->req_vqs[queue_num];
> >>>> +	}
> >>>> +	spin_unlock_irqrestore(&tgt->tgt_lock, flags);
> >>>> +	return virtscsi_queuecommand(vscsi, tgt, sc);
> >>>> +}
> >>>> +
> >>>> +
> > 
> > .....
> > 
> >>>> +static int virtscsi_queuecommand_single(struct Scsi_Host *sh,
> >>>> +                                       struct scsi_cmnd *sc)
> >>>> +{
> >>>> +       struct virtio_scsi *vscsi = shost_priv(sh);
> >>>> +       struct virtio_scsi_target_state *tgt = vscsi->tgt[sc->device->id];
> >>>> +
> >>>> +       atomic_inc(&tgt->reqs);
> >>>> +       return virtscsi_queuecommand(vscsi, tgt, sc);
> >>>> +}
> >>>> +
> > 
> > Here, reqs is unused - why bother incrementing it?
> > A branch on completion would be cheaper IMHO.
> 
> Well, I could also let tgt->reqs go negative, but it would be a bit untidy.
> 
> Another alternative is to access the target's target_busy field with
> ACCESS_ONCE, and drop reqs altogether.  Too tricky to do this kind of
> micro-optimization so early, though.

So keep it simple and just check a flag.

> >> virtio-scsi multiqueue has a performance benefit up to 20%
> > 
> > To be fair, you could be running in single queue mode.
> > In that case extra atomics and indirection that this code
> > brings will just add overhead without benefits.
> > I don't know how significant would that be.
> 
> Not measurable in my experiments.
> 
> Paolo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/5] virtio-scsi: introduce multiqueue support
  2012-09-04 11:09           ` Michael S. Tsirkin
@ 2012-09-04 11:18             ` Paolo Bonzini
  2012-09-04 13:35               ` Michael S. Tsirkin
  0 siblings, 1 reply; 34+ messages in thread
From: Paolo Bonzini @ 2012-09-04 11:18 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Nicholas A. Bellinger, linux-kernel, linux-scsi, kvm, rusty,
	jasowang, virtualization, Christoph Hellwig, Jens Axboe,
	target-devel

Il 04/09/2012 13:09, Michael S. Tsirkin ha scritto:
>> >     queuecommand on CPU #0         queuecommand #2 on CPU #1
>> >   --------------------------------------------------------------
>> >     atomic_inc_return(...) == 1
>> >                                    atomic_inc_return(...) == 2
>> >                                    virtscsi_queuecommand to queue #1
>> >     tgt->req_vq = queue #0
>> >     virtscsi_queuecommand to queue #0
>> > 
>> > then two requests are issued to different queues without a quiescent
>> > point in the middle.
> What happens then? Does this break correctness?

Yes, requests to the same target should be processed in FIFO order, or
you have things like a flush issued before the write it was supposed to
flush.  This is why I can only change the queue when there is no request
pending.

Paolo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/5] virtio-scsi: introduce multiqueue support
  2012-08-28 11:54 ` [PATCH 5/5] virtio-scsi: introduce multiqueue support Paolo Bonzini
  2012-09-04  2:21   ` Nicholas A. Bellinger
@ 2012-09-04 12:48   ` Michael S. Tsirkin
  2012-09-04 13:49     ` Paolo Bonzini
  2012-09-04 14:47   ` Michael S. Tsirkin
  2 siblings, 1 reply; 34+ messages in thread
From: Michael S. Tsirkin @ 2012-09-04 12:48 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, linux-scsi, kvm, rusty, jasowang, virtualization

On Tue, Aug 28, 2012 at 01:54:17PM +0200, Paolo Bonzini wrote:
> This patch adds queue steering to virtio-scsi.  When a target is sent
> multiple requests, we always drive them to the same queue so that FIFO
> processing order is kept.  However, if a target was idle, we can choose
> a queue arbitrarily.  In this case the queue is chosen according to the
> current VCPU, so the driver expects the number of request queues to be
> equal to the number of VCPUs.  This makes it easy and fast to select
> the queue, and also lets the driver optimize the IRQ affinity for the
> virtqueues (each virtqueue's affinity is set to the CPU that "owns"
> the queue).
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

I guess an alternative is a per-target vq.
Is the reason you avoid this that you expect more targets
than cpus? If yes this is something you might want to
mention in the log.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/5] virtio-scsi: introduce multiqueue support
  2012-09-04 11:18             ` Paolo Bonzini
@ 2012-09-04 13:35               ` Michael S. Tsirkin
  2012-09-04 13:45                 ` Paolo Bonzini
  0 siblings, 1 reply; 34+ messages in thread
From: Michael S. Tsirkin @ 2012-09-04 13:35 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Nicholas A. Bellinger, linux-kernel, linux-scsi, kvm, rusty,
	jasowang, virtualization, Christoph Hellwig, Jens Axboe,
	target-devel

On Tue, Sep 04, 2012 at 01:18:31PM +0200, Paolo Bonzini wrote:
> Il 04/09/2012 13:09, Michael S. Tsirkin ha scritto:
> >> >     queuecommand on CPU #0         queuecommand #2 on CPU #1
> >> >   --------------------------------------------------------------
> >> >     atomic_inc_return(...) == 1
> >> >                                    atomic_inc_return(...) == 2
> >> >                                    virtscsi_queuecommand to queue #1
> >> >     tgt->req_vq = queue #0
> >> >     virtscsi_queuecommand to queue #0
> >> > 
> >> > then two requests are issued to different queues without a quiescent
> >> > point in the middle.
> > What happens then? Does this break correctness?
> 
> Yes, requests to the same target should be processed in FIFO order, or
> you have things like a flush issued before the write it was supposed to
> flush.  This is why I can only change the queue when there is no request
> pending.
> 
> Paolo

I see.  I guess you can rewrite this as:
atomic_inc
if (atomic_read() == 1)
which is a bit cheaper, and make the fact
that you do not need increment and return to be atomic,
explicit.

Another simple idea: store last processor id in target,
if it is unchanged no need to play with req_vq
and take spinlock.

Also - some kind of comment explaining why a similar race can not happen
with this lock in place would be nice: I see why this specific race can
not trigger but since lock is dropped later before you submit command, I
have hard time convincing myself what exactly gurantees that vq is never
switched before or even while command is submitted.




-- 
MST

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/5] virtio-scsi: introduce multiqueue support
  2012-09-04 13:35               ` Michael S. Tsirkin
@ 2012-09-04 13:45                 ` Paolo Bonzini
  2012-09-04 14:19                   ` Michael S. Tsirkin
  0 siblings, 1 reply; 34+ messages in thread
From: Paolo Bonzini @ 2012-09-04 13:45 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Nicholas A. Bellinger, linux-kernel, linux-scsi, kvm, rusty,
	jasowang, virtualization, Christoph Hellwig, Jens Axboe,
	target-devel

Il 04/09/2012 15:35, Michael S. Tsirkin ha scritto:
> I see.  I guess you can rewrite this as:
> atomic_inc
> if (atomic_read() == 1)
> which is a bit cheaper, and make the fact
> that you do not need increment and return to be atomic,
> explicit.

It seems more complicated to me for hardly any reason.  (Besides, is it
cheaper?  It has one less memory barrier on some architectures I frankly
do not care much about---not on x86---but it also has two memory
accesses instead of one on all architectures).

> Another simple idea: store last processor id in target,
> if it is unchanged no need to play with req_vq
> and take spinlock.

Not so sure, consider the previous example with last_processor_id equal
to 1.

    queuecommand on CPU #0         queuecommand #2 on CPU #1
  --------------------------------------------------------------
    atomic_inc_return(...) == 1
                                   atomic_inc_return(...) == 2
                                   virtscsi_queuecommand to queue #1
    last_processor_id == 0? no
    spin_lock
    tgt->req_vq = queue #0
    spin_unlock
    virtscsi_queuecommand to queue #0

This is not a network driver, there are still a lot of locks around.
This micro-optimization doesn't pay enough for the pain.

> Also - some kind of comment explaining why a similar race can not happen
> with this lock in place would be nice: I see why this specific race can
> not trigger but since lock is dropped later before you submit command, I
> have hard time convincing myself what exactly gurantees that vq is never
> switched before or even while command is submitted.

Because tgt->reqs will never become zero (which is a necessary condition
for tgt->req_vq to change), as long as one request is executing
virtscsi_queuecommand.

Paolo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/5] virtio-scsi: introduce multiqueue support
  2012-09-04 12:48   ` Michael S. Tsirkin
@ 2012-09-04 13:49     ` Paolo Bonzini
  2012-09-04 14:21       ` Michael S. Tsirkin
  0 siblings, 1 reply; 34+ messages in thread
From: Paolo Bonzini @ 2012-09-04 13:49 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, linux-scsi, kvm, rusty, jasowang, virtualization

Il 04/09/2012 14:48, Michael S. Tsirkin ha scritto:
>> > This patch adds queue steering to virtio-scsi.  When a target is sent
>> > multiple requests, we always drive them to the same queue so that FIFO
>> > processing order is kept.  However, if a target was idle, we can choose
>> > a queue arbitrarily.  In this case the queue is chosen according to the
>> > current VCPU, so the driver expects the number of request queues to be
>> > equal to the number of VCPUs.  This makes it easy and fast to select
>> > the queue, and also lets the driver optimize the IRQ affinity for the
>> > virtqueues (each virtqueue's affinity is set to the CPU that "owns"
>> > the queue).
>> > 
>> > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> I guess an alternative is a per-target vq.
> Is the reason you avoid this that you expect more targets
> than cpus? If yes this is something you might want to
> mention in the log.

One reason is that, even though in practice I expect roughly the same
number of targets and VCPUs, hotplug means the number of targets is
difficult to predict and is usually fixed to 256.

The other reason is that per-target vq didn't give any performance
advantage.  The bonus comes from cache locality and less process
migrations, more than from the independent virtqueues.

Paolo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/5] virtio-scsi: introduce multiqueue support
  2012-09-04 13:45                 ` Paolo Bonzini
@ 2012-09-04 14:19                   ` Michael S. Tsirkin
  2012-09-04 14:25                     ` Paolo Bonzini
  0 siblings, 1 reply; 34+ messages in thread
From: Michael S. Tsirkin @ 2012-09-04 14:19 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Nicholas A. Bellinger, linux-kernel, linux-scsi, kvm, rusty,
	jasowang, virtualization, Christoph Hellwig, Jens Axboe,
	target-devel

On Tue, Sep 04, 2012 at 03:45:57PM +0200, Paolo Bonzini wrote:
> > Also - some kind of comment explaining why a similar race can not happen
> > with this lock in place would be nice: I see why this specific race can
> > not trigger but since lock is dropped later before you submit command, I
> > have hard time convincing myself what exactly gurantees that vq is never
> > switched before or even while command is submitted.
> 
> Because tgt->reqs will never become zero (which is a necessary condition
> for tgt->req_vq to change), as long as one request is executing
> virtscsi_queuecommand.
> 
> Paolo

Yes but this logic would apparently imply the lock is not necessary, and
it actually is. I am not saying anything is wrong just that it
looks scary.

-- 
MST

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/5] virtio-scsi: introduce multiqueue support
  2012-09-04 13:49     ` Paolo Bonzini
@ 2012-09-04 14:21       ` Michael S. Tsirkin
  2012-09-04 14:30         ` Paolo Bonzini
  0 siblings, 1 reply; 34+ messages in thread
From: Michael S. Tsirkin @ 2012-09-04 14:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, linux-scsi, kvm, rusty, jasowang, virtualization

On Tue, Sep 04, 2012 at 03:49:42PM +0200, Paolo Bonzini wrote:
> Il 04/09/2012 14:48, Michael S. Tsirkin ha scritto:
> >> > This patch adds queue steering to virtio-scsi.  When a target is sent
> >> > multiple requests, we always drive them to the same queue so that FIFO
> >> > processing order is kept.  However, if a target was idle, we can choose
> >> > a queue arbitrarily.  In this case the queue is chosen according to the
> >> > current VCPU, so the driver expects the number of request queues to be
> >> > equal to the number of VCPUs.  This makes it easy and fast to select
> >> > the queue, and also lets the driver optimize the IRQ affinity for the
> >> > virtqueues (each virtqueue's affinity is set to the CPU that "owns"
> >> > the queue).
> >> > 
> >> > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > I guess an alternative is a per-target vq.
> > Is the reason you avoid this that you expect more targets
> > than cpus? If yes this is something you might want to
> > mention in the log.
> 
> One reason is that, even though in practice I expect roughly the same
> number of targets and VCPUs, hotplug means the number of targets is
> difficult to predict and is usually fixed to 256.
> 
> The other reason is that per-target vq didn't give any performance
> advantage.  The bonus comes from cache locality and less process
> migrations, more than from the independent virtqueues.
> 
> Paolo

Okay, and why is per-target worse for cache locality?

-- 
MST

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/5] virtio-scsi: introduce multiqueue support
  2012-09-04 14:19                   ` Michael S. Tsirkin
@ 2012-09-04 14:25                     ` Paolo Bonzini
  0 siblings, 0 replies; 34+ messages in thread
From: Paolo Bonzini @ 2012-09-04 14:25 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Nicholas A. Bellinger, linux-kernel, linux-scsi, kvm, rusty,
	jasowang, virtualization, Christoph Hellwig, Jens Axboe,
	target-devel

Il 04/09/2012 16:19, Michael S. Tsirkin ha scritto:
> > > Also - some kind of comment explaining why a similar race can not happen
> > > with this lock in place would be nice: I see why this specific race can
> > > not trigger but since lock is dropped later before you submit command, I
> > > have hard time convincing myself what exactly gurantees that vq is never
> > > switched before or even while command is submitted.
> > 
> > Because tgt->reqs will never become zero (which is a necessary condition
> > for tgt->req_vq to change), as long as one request is executing
> > virtscsi_queuecommand.
> 
> Yes but this logic would apparently imply the lock is not necessary, and
> it actually is. I am not saying anything is wrong just that it
> looks scary.

Ok, I get the misunderstanding.  For the logic to hold, you need a
serialization point after which tgt->req_vq is not changed.  The lock
provides one such serialization point: after you unlock tgt->tgt_lock,
nothing else will change tgt->req_vq until your request completes.

Without the lock, there could always be a thread that is in the "then"
branch but has been scheduled out, and when rescheduled it will change
tgt->req_vq.

Perhaps the confusion comes from the atomic_inc_return, and that was
what my "why is this atomic" wanted to clear.  **tgt->reqs is only
atomic to avoid taking a spinlock in the ISR**.  If you read the code
with the lock, but with tgt->reqs as a regular non-atomic int, it should
be much easier to reason on the code.  I can split the patch if needed.

Paolo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/5] virtio-scsi: introduce multiqueue support
  2012-09-04 14:21       ` Michael S. Tsirkin
@ 2012-09-04 14:30         ` Paolo Bonzini
  2012-09-04 14:41           ` Michael S. Tsirkin
  0 siblings, 1 reply; 34+ messages in thread
From: Paolo Bonzini @ 2012-09-04 14:30 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, linux-scsi, kvm, rusty, jasowang, virtualization

Il 04/09/2012 16:21, Michael S. Tsirkin ha scritto:
> > One reason is that, even though in practice I expect roughly the same
> > number of targets and VCPUs, hotplug means the number of targets is
> > difficult to predict and is usually fixed to 256.
> > 
> > The other reason is that per-target vq didn't give any performance
> > advantage.  The bonus comes from cache locality and less process
> > migrations, more than from the independent virtqueues.
> 
> Okay, and why is per-target worse for cache locality?

Because per-target doesn't have IRQ affinity for a particular CPU.

Assuming that the thread that is sending requests to the device is
I/O-bound, it is likely to be sleeping at the time the ISR is executed,
and thus executing the ISR on the same processor that sent the requests
is cheap.

But if you have many such I/O-bound processes, the kernel will execute
the ISR on a random processor, rather than the one that is sending
requests to the device.

Paolo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/5] virtio-scsi: introduce multiqueue support
  2012-09-04 14:30         ` Paolo Bonzini
@ 2012-09-04 14:41           ` Michael S. Tsirkin
  0 siblings, 0 replies; 34+ messages in thread
From: Michael S. Tsirkin @ 2012-09-04 14:41 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, linux-scsi, kvm, rusty, jasowang, virtualization

On Tue, Sep 04, 2012 at 04:30:35PM +0200, Paolo Bonzini wrote:
> Il 04/09/2012 16:21, Michael S. Tsirkin ha scritto:
> > > One reason is that, even though in practice I expect roughly the same
> > > number of targets and VCPUs, hotplug means the number of targets is
> > > difficult to predict and is usually fixed to 256.
> > > 
> > > The other reason is that per-target vq didn't give any performance
> > > advantage.  The bonus comes from cache locality and less process
> > > migrations, more than from the independent virtqueues.
> > 
> > Okay, and why is per-target worse for cache locality?
> 
> Because per-target doesn't have IRQ affinity for a particular CPU.
> 
> Assuming that the thread that is sending requests to the device is
> I/O-bound, it is likely to be sleeping at the time the ISR is executed,
> and thus executing the ISR on the same processor that sent the requests
> is cheap.
> 
> But if you have many such I/O-bound processes, the kernel will execute
> the ISR on a random processor, rather than the one that is sending
> requests to the device.
> 
> Paolo

I see, another case where our irq balancing makes bad decisions.
You could do it differently - pin irq to the cpu of the last task that
executed, tweak irq affinity when that changes.
Still if you want to support 256 targets vector per target
is not going to work.

Would be nice to add this motivation to commit log I think.

-- 
MST

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/5] virtio-scsi: introduce multiqueue support
  2012-08-28 11:54 ` [PATCH 5/5] virtio-scsi: introduce multiqueue support Paolo Bonzini
  2012-09-04  2:21   ` Nicholas A. Bellinger
  2012-09-04 12:48   ` Michael S. Tsirkin
@ 2012-09-04 14:47   ` Michael S. Tsirkin
  2012-09-04 14:55     ` Paolo Bonzini
  2 siblings, 1 reply; 34+ messages in thread
From: Michael S. Tsirkin @ 2012-09-04 14:47 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, linux-scsi, kvm, rusty, jasowang, virtualization

On Tue, Aug 28, 2012 at 01:54:17PM +0200, Paolo Bonzini wrote:
> @@ -575,15 +630,19 @@ static struct scsi_host_template virtscsi_host_template = {
>  				  &__val, sizeof(__val)); \
>  	})
>  
> +

Pls don't add empty lines.

>  static void virtscsi_init_vq(struct virtio_scsi_vq *virtscsi_vq,
> -			     struct virtqueue *vq)
> +			     struct virtqueue *vq, bool affinity)
>  {
>  	spin_lock_init(&virtscsi_vq->vq_lock);
>  	virtscsi_vq->vq = vq;
> +	if (affinity)
> +		virtqueue_set_affinity(vq, virtqueue_get_queue_index(vq) -
> +				       VIRTIO_SCSI_VQ_BASE);
>  }
>  

This means in practice if you have less virtqueues than CPUs,
things are not going to work well, will they?

Any idea what to do?

-- 
MST

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/5] virtio-scsi: introduce multiqueue support
  2012-09-04 14:47   ` Michael S. Tsirkin
@ 2012-09-04 14:55     ` Paolo Bonzini
  2012-09-04 15:03       ` Michael S. Tsirkin
  0 siblings, 1 reply; 34+ messages in thread
From: Paolo Bonzini @ 2012-09-04 14:55 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, linux-scsi, kvm, rusty, jasowang, virtualization

Il 04/09/2012 16:47, Michael S. Tsirkin ha scritto:
>> >  static void virtscsi_init_vq(struct virtio_scsi_vq *virtscsi_vq,
>> > -			     struct virtqueue *vq)
>> > +			     struct virtqueue *vq, bool affinity)
>> >  {
>> >  	spin_lock_init(&virtscsi_vq->vq_lock);
>> >  	virtscsi_vq->vq = vq;
>> > +	if (affinity)
>> > +		virtqueue_set_affinity(vq, virtqueue_get_queue_index(vq) -
>> > +				       VIRTIO_SCSI_VQ_BASE);
>> >  }
>> >  
> This means in practice if you have less virtqueues than CPUs,
> things are not going to work well, will they?

Not particularly.  It could be better or worse than single queue
depending on the workload.

> Any idea what to do?

Two possibilities:

1) Add a stride argument to virtqueue_set_affinity, and make it equal to
the number of queues.

2) Make multiqueue the default in QEMU, and make the default number of
queues equal to the number of VCPUs.

I was going for (2).

Paolo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/5] virtio-scsi: introduce multiqueue support
  2012-09-04 14:55     ` Paolo Bonzini
@ 2012-09-04 15:03       ` Michael S. Tsirkin
  0 siblings, 0 replies; 34+ messages in thread
From: Michael S. Tsirkin @ 2012-09-04 15:03 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, linux-scsi, kvm, rusty, jasowang, virtualization

On Tue, Sep 04, 2012 at 04:55:56PM +0200, Paolo Bonzini wrote:
> Il 04/09/2012 16:47, Michael S. Tsirkin ha scritto:
> >> >  static void virtscsi_init_vq(struct virtio_scsi_vq *virtscsi_vq,
> >> > -			     struct virtqueue *vq)
> >> > +			     struct virtqueue *vq, bool affinity)
> >> >  {
> >> >  	spin_lock_init(&virtscsi_vq->vq_lock);
> >> >  	virtscsi_vq->vq = vq;
> >> > +	if (affinity)
> >> > +		virtqueue_set_affinity(vq, virtqueue_get_queue_index(vq) -
> >> > +				       VIRTIO_SCSI_VQ_BASE);
> >> >  }
> >> >  
> > This means in practice if you have less virtqueues than CPUs,
> > things are not going to work well, will they?
> 
> Not particularly.  It could be better or worse than single queue
> depending on the workload.

Well interrupts will go to CPU different from the one
that sends commands so ...

> > Any idea what to do?
> 
> Two possibilities:
> 
> 1) Add a stride argument to virtqueue_set_affinity, and make it equal to
> the number of queues.
> 
> 2) Make multiqueue the default in QEMU, and make the default number of
> queues equal to the number of VCPUs.
> 
> I was going for (2).
> 
> Paolo

3. use per target queue if less targets than cpus?

-- 
MST


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/5] virtio-scsi: introduce multiqueue support
  2012-09-04  6:46     ` Paolo Bonzini
  2012-09-04  8:46       ` Michael S. Tsirkin
@ 2012-09-04 20:11       ` Nicholas A. Bellinger
  2012-09-05  7:03         ` Paolo Bonzini
  1 sibling, 1 reply; 34+ messages in thread
From: Nicholas A. Bellinger @ 2012-09-04 20:11 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, linux-scsi, kvm, rusty, jasowang, mst,
	virtualization, Christoph Hellwig, Jens Axboe, target-devel,
	Asias He

On Tue, 2012-09-04 at 08:46 +0200, Paolo Bonzini wrote:
> Il 04/09/2012 04:21, Nicholas A. Bellinger ha scritto:
> >> @@ -112,6 +118,9 @@ static void virtscsi_complete_cmd(struct virtio_scsi *vscsi, void *buf)
> >>  	struct virtio_scsi_cmd *cmd = buf;
> >>  	struct scsi_cmnd *sc = cmd->sc;
> >>  	struct virtio_scsi_cmd_resp *resp = &cmd->resp.cmd;
> >> +	struct virtio_scsi_target_state *tgt = vscsi->tgt[sc->device->id];
> >> +
> >> +	atomic_dec(&tgt->reqs);
> >>  
> > 
> > As tgt->tgt_lock is taken in virtscsi_queuecommand_multi() before the
> > atomic_inc_return(tgt->reqs) check, it seems like using atomic_dec() w/o
> > smp_mb__after_atomic_dec or tgt_lock access here is not using atomic.h
> > accessors properly, no..?
> 
> No, only a single "thing" is being accessed, and there is no need to
> order the decrement with respect to preceding or subsequent accesses to
> other locations.
> 
> In other words, tgt->reqs is already synchronized with itself, and that
> is enough.
> 
> (Besides, on x86 smp_mb__after_atomic_dec is a nop).
> 

So the implementation detail wrt to requests to the same target being
processed in FIFO ordering + only being able to change the queue when no
requests are pending helps understand this code more.  Thanks for the
explanation on that bit..

However, it's still my understanding that the use of atomic_dec() in the
completion path mean that smp_mb__after_atomic_dec() is a requirement to
be proper portable atomic.hcode, no..?  Otherwise tgt->regs should be
using something other than an atomic_t, right..?

> >> +static int virtscsi_queuecommand_multi(struct Scsi_Host *sh,
> >> +				       struct scsi_cmnd *sc)
> >> +{
> >> +	struct virtio_scsi *vscsi = shost_priv(sh);
> >> +	struct virtio_scsi_target_state *tgt = vscsi->tgt[sc->device->id];
> >> +	unsigned long flags;
> >> +	u32 queue_num;
> >> +
> >> +	/* Using an atomic_t for tgt->reqs lets the virtqueue handler
> >> +	 * decrement it without taking the spinlock.
> >> +	 */
> >> +	spin_lock_irqsave(&tgt->tgt_lock, flags);
> >> +	if (atomic_inc_return(&tgt->reqs) == 1) {
> >> +		queue_num = smp_processor_id();
> >> +		while (unlikely(queue_num >= vscsi->num_queues))
> >> +			queue_num -= vscsi->num_queues;
> >> +		tgt->req_vq = &vscsi->req_vqs[queue_num];
> >> +	}
> >> +	spin_unlock_irqrestore(&tgt->tgt_lock, flags);
> >> +	return virtscsi_queuecommand(vscsi, tgt, sc);
> >> +}
> >> +
> > 
> > The extra memory barriers to get this right for the current approach are
> > just going to slow things down even more for virtio-scsi-mq..
> 
> virtio-scsi multiqueue has a performance benefit up to 20% (for a single
> LUN) or 40% (on overall bandwidth across multiple LUNs).  I doubt that a
> single memory barrier can have that much impact. :)
> 

I've no doubt that this series increases the large block high bandwidth
for virtio-scsi, but historically that has always been the easier
workload to scale.  ;)

> The way to go to improve performance even more is to add new virtio APIs
> for finer control of the usage of the ring.  These should let us avoid
> copying the sg list and almost get rid of the tgt_lock; even though the
> locking is quite efficient in virtio-scsi (see how tgt_lock and vq_lock
> are "pipelined" so as to overlap the preparation of two requests), it
> should give a nice improvement and especially avoid a kmalloc with small
> requests.  I may have some time for it next month.
> 
> > Jen's approach is what we will ultimately need to re-architect in SCSI
> > core if we're ever going to move beyond the issues of legacy host_lock,
> > so I'm wondering if maybe this is the direction that virtio-scsi-mq
> > needs to go in as well..?
> 
> We can see after the block layer multiqueue work goes in...  I also need
> to look more closely at Jens's changes.
> 

Yes, I think Jen's new approach is providing some pretty significant
gains for raw block drivers with extremly high packet (small block
random I/O) workloads, esp with hw block drivers that support genuine mq
with hw num_queues > 1.

He also has virtio-blk converted to run in num_queues=1 mode.

> Have you measured the host_lock to be a bottleneck in high-iops
> benchmarks, even for a modern driver that does not hold it in
> queuecommand?  (Certainly it will become more important as the
> virtio-scsi queuecommand becomes thinner and thinner).

This is exactly why it would make such a good vehicle to re-architect
SCSI core.  I'm thinking it can be the first sw LLD we attempt to get
running on an (currently) future scsi-mq prototype.

>   If so, we can
> start looking at limiting host_lock usage in the fast path.
> 

That would be a good incremental step for SCSI core, but I'm not sure
that that we'll be able to scale compared to blk-mq without a
new-approach for sw/hw LLDs along the lines of what Jen's is doing.

> BTW, supporting this in tcm-vhost should be quite trivial, as all the
> request queues are the same and all serialization is done in the
> virtio-scsi driver.
> 

Looking forward to that too..  ;)


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/5] virtio-scsi: introduce multiqueue support
  2012-09-04 20:11       ` Nicholas A. Bellinger
@ 2012-09-05  7:03         ` Paolo Bonzini
  0 siblings, 0 replies; 34+ messages in thread
From: Paolo Bonzini @ 2012-09-05  7:03 UTC (permalink / raw)
  To: Nicholas A. Bellinger
  Cc: linux-kernel, linux-scsi, kvm, rusty, jasowang, mst,
	virtualization, Christoph Hellwig, Jens Axboe, target-devel,
	Asias He

Il 04/09/2012 22:11, Nicholas A. Bellinger ha scritto:
>>> As tgt->tgt_lock is taken in virtscsi_queuecommand_multi() before the
>>> atomic_inc_return(tgt->reqs) check, it seems like using atomic_dec() w/o
>>> smp_mb__after_atomic_dec or tgt_lock access here is not using atomic.h
>>> accessors properly, no..?
>>
>> No, only a single "thing" is being accessed, and there is no need to
>> order the decrement with respect to preceding or subsequent accesses to
>> other locations.
>>
>> In other words, tgt->reqs is already synchronized with itself, and that
>> is enough.
>>
> 
> However, it's still my understanding that the use of atomic_dec() in the
> completion path mean that smp_mb__after_atomic_dec() is a requirement to
> be proper portable atomic.hcode, no..?  Otherwise tgt->regs should be
> using something other than an atomic_t, right..?

Memory barriers aren't _always_ requested, only when you need to order
accesses to multiple locations.

In this case, there is no other location that the
queuecommand/completion handlers needs to synchronize against, so no
barrier is required.  You can see plenty of atomic_inc/atomic_dec in the
code without a barrier afterwards (the typical case is the opposite as
in this patch: a refcount increment needs no barrier, a refcount
decrement uses atomic_dec_return).

>> virtio-scsi multiqueue has a performance benefit up to 20% (for a single
>> LUN) or 40% (on overall bandwidth across multiple LUNs).  I doubt that a
>> single memory barrier can have that much impact. :)
>>
> 
> I've no doubt that this series increases the large block high bandwidth
> for virtio-scsi, but historically that has always been the easier
> workload to scale.  ;)

This is with a mixed workload (random 4k-64k) and tmpfs backend on the host.

> Yes, I think Jen's new approach is providing some pretty significant
> gains for raw block drivers with extremly high packet (small block
> random I/O) workloads, esp with hw block drivers that support genuine mq
> with hw num_queues > 1.

I need to look into it, to understand how the queue steering here can be
adapted to his code.

>> Have you measured the host_lock to be a bottleneck in high-iops
>> benchmarks, even for a modern driver that does not hold it in
>> queuecommand?  (Certainly it will become more important as the
>> virtio-scsi queuecommand becomes thinner and thinner).
> 
> This is exactly why it would make such a good vehicle to re-architect
> SCSI core.  I'm thinking it can be the first sw LLD we attempt to get
> running on an (currently) future scsi-mq prototype.

Agreed.

Paolo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/5] virtio: introduce an API to set affinity for a virtqueue
  2012-08-28 11:54 ` [PATCH 2/5] virtio: introduce an API to set affinity for a virtqueue Paolo Bonzini
@ 2012-09-05 23:32   ` Rusty Russell
  0 siblings, 0 replies; 34+ messages in thread
From: Rusty Russell @ 2012-09-05 23:32 UTC (permalink / raw)
  To: Paolo Bonzini, linux-kernel
  Cc: linux-scsi, kvm, jasowang, mst, virtualization

Paolo Bonzini <pbonzini@redhat.com> writes:

> From: Jason Wang <jasowang@redhat.com>
>
> Sometimes, virtio device need to configure irq affinity hint to maximize the
> performance. Instead of just exposing the irq of a virtqueue, this patch
> introduce an API to set the affinity for a virtqueue.
>
> The api is best-effort, the affinity hint may not be set as expected due to
> platform support, irq sharing or irq type. Currently, only pci method were
> implemented and we set the affinity according to:
>
> - if device uses INTX, we just ignore the request
> - if device has per vq vector, we force the affinity hint
> - if the virtqueues share MSI, make the affinity OR over all affinities
>   requested
>
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Applied, thanks.

Acked-by: Rusty Russell <rusty@rustcorp.com.au>

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] virtio-ring: move queue_index to vring_virtqueue
  2012-08-28 11:54 ` [PATCH 1/5] virtio-ring: move queue_index to vring_virtqueue Paolo Bonzini
  2012-08-29  7:54   ` Jason Wang
@ 2012-09-05 23:32   ` Rusty Russell
  1 sibling, 0 replies; 34+ messages in thread
From: Rusty Russell @ 2012-09-05 23:32 UTC (permalink / raw)
  To: Paolo Bonzini, linux-kernel
  Cc: linux-scsi, kvm, jasowang, mst, virtualization

Paolo Bonzini <pbonzini@redhat.com> writes:

> From: Jason Wang <jasowang@redhat.com>
>
> Instead of storing the queue index in transport-specific virtio structs,
> this patch moves them to vring_virtqueue and introduces an helper to get
> the value.  This lets drivers simplify their management and tracing of
> virtqueues.
>
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Sorry for the delay, I was at Kernel Summit and am only now actually
reading (vs skimming) my backlog.

Putting it in vring_virtqueue rather than virtqueue seems weird, though.
But I've applied as-is, we can clean up that later if we want (probably
by merging the two structures, I'll have to think harder on that).

Acked-by: Rusty Russell <rusty@rustcorp.com.au>

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2012-09-06  3:02 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-08-28 11:54 [PATCH 0/5] Multiqueue virtio-scsi Paolo Bonzini
2012-08-28 11:54 ` [PATCH 1/5] virtio-ring: move queue_index to vring_virtqueue Paolo Bonzini
2012-08-29  7:54   ` Jason Wang
2012-09-05 23:32   ` Rusty Russell
2012-08-28 11:54 ` [PATCH 2/5] virtio: introduce an API to set affinity for a virtqueue Paolo Bonzini
2012-09-05 23:32   ` Rusty Russell
2012-08-28 11:54 ` [PATCH 3/5] virtio-scsi: allocate target pointers in a separate memory block Paolo Bonzini
2012-08-28 14:07   ` Sasha Levin
2012-08-28 14:25     ` Paolo Bonzini
2012-08-28 11:54 ` [PATCH 4/5] virtio-scsi: pass struct virtio_scsi to virtqueue completion function Paolo Bonzini
2012-08-28 11:54 ` [PATCH 5/5] virtio-scsi: introduce multiqueue support Paolo Bonzini
2012-09-04  2:21   ` Nicholas A. Bellinger
2012-09-04  6:46     ` Paolo Bonzini
2012-09-04  8:46       ` Michael S. Tsirkin
2012-09-04 10:25         ` Paolo Bonzini
2012-09-04 11:09           ` Michael S. Tsirkin
2012-09-04 11:18             ` Paolo Bonzini
2012-09-04 13:35               ` Michael S. Tsirkin
2012-09-04 13:45                 ` Paolo Bonzini
2012-09-04 14:19                   ` Michael S. Tsirkin
2012-09-04 14:25                     ` Paolo Bonzini
2012-09-04 20:11       ` Nicholas A. Bellinger
2012-09-05  7:03         ` Paolo Bonzini
2012-09-04 12:48   ` Michael S. Tsirkin
2012-09-04 13:49     ` Paolo Bonzini
2012-09-04 14:21       ` Michael S. Tsirkin
2012-09-04 14:30         ` Paolo Bonzini
2012-09-04 14:41           ` Michael S. Tsirkin
2012-09-04 14:47   ` Michael S. Tsirkin
2012-09-04 14:55     ` Paolo Bonzini
2012-09-04 15:03       ` Michael S. Tsirkin
2012-08-30  7:13 ` [PATCH 0/5] Multiqueue virtio-scsi Stefan Hajnoczi
2012-08-30 14:53 ` Michael S. Tsirkin
2012-08-30 15:45   ` Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).