Target-devel archive on lore.kernel.org
 help / color / Atom feed
* [PATCH 00/10] vhost/qemu: thread per IO SCSI vq
@ 2020-11-12 23:18 Mike Christie
  2020-11-12 23:19 ` [PATCH 1/1] qemu vhost scsi: add VHOST_SET_VRING_ENABLE support Mike Christie
                   ` (11 more replies)
  0 siblings, 12 replies; 47+ messages in thread
From: Mike Christie @ 2020-11-12 23:18 UTC (permalink / raw)
  To: stefanha, qemu-devel, fam, linux-scsi, target-devel, mst,
	jasowang, pbonzini, virtualization

The following kernel patches were made over Michael's vhost branch:

https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git/log/?h=vhost

and the vhost-scsi bug fix patchset:

https://lore.kernel.org/linux-scsi/20201112170008.GB1555653@stefanha-x1.localdomain/T/#t

And the qemu patch was made over the qemu master branch.

vhost-scsi currently supports multiple queues with the num_queues
setting, but we end up with a setup where the guest's scsi/block
layer can do a queue per vCPU and the layers below vhost can do
a queue per CPU. vhost-scsi will then do a num_queue virtqueues,
but all IO gets set on and completed on a single vhost-scsi thread.
After 2 - 4 vqs this becomes a bottleneck.

This patchset allows us to create a worker thread per IO vq, so we
can better utilize multiple CPUs with the multiple queues. It
implments Jason's suggestion to create the initial worker like
normal, then create the extra workers for IO vqs with the
VHOST_SET_VRING_ENABLE ioctl command added in this patchset.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 1/1] qemu vhost scsi: add VHOST_SET_VRING_ENABLE support
  2020-11-12 23:18 [PATCH 00/10] vhost/qemu: thread per IO SCSI vq Mike Christie
@ 2020-11-12 23:19 ` Mike Christie
  2020-11-17 11:53   ` Stefan Hajnoczi
  2020-12-02  9:59   ` Michael S. Tsirkin
  2020-11-12 23:19 ` [PATCH 01/10] vhost: remove work arg from vhost_work_flush Mike Christie
                   ` (10 subsequent siblings)
  11 siblings, 2 replies; 47+ messages in thread
From: Mike Christie @ 2020-11-12 23:19 UTC (permalink / raw)
  To: stefanha, qemu-devel, fam, linux-scsi, target-devel, mst,
	jasowang, pbonzini, virtualization

This patch made over the master branch allows the vhost-scsi
driver to call into the kernel and tell it to enable/disable
a virtqueue.

The kernel patches included with this set, will create
a worker per IO vq when multiple IO queues have been setup.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
 hw/scsi/vhost-scsi.c        |  6 ++++++
 hw/virtio/vhost-backend.c   | 30 ++++++++++++++++++++++++++++++
 linux-headers/linux/vhost.h |  1 +
 3 files changed, 37 insertions(+)

diff --git a/hw/scsi/vhost-scsi.c b/hw/scsi/vhost-scsi.c
index 4d70fa0..bbb2ba3 100644
--- a/hw/scsi/vhost-scsi.c
+++ b/hw/scsi/vhost-scsi.c
@@ -93,6 +93,12 @@ static int vhost_scsi_start(VHostSCSI *s)
         return ret;
     }
 
+    ret = vsc->dev.vhost_ops->vhost_set_vring_enable(&vsc->dev, 1);
+    if (ret) {
+        error_report("Error enabling vhost-scsi vqs %d", ret);
+        vhost_scsi_common_stop(vsc);
+    }
+
     ret = vhost_scsi_set_endpoint(s);
     if (ret < 0) {
         error_report("Error setting vhost-scsi endpoint");
diff --git a/hw/virtio/vhost-backend.c b/hw/virtio/vhost-backend.c
index 88c8ecc..e190c8e 100644
--- a/hw/virtio/vhost-backend.c
+++ b/hw/virtio/vhost-backend.c
@@ -102,6 +102,35 @@ static int vhost_kernel_set_mem_table(struct vhost_dev *dev,
     return vhost_kernel_call(dev, VHOST_SET_MEM_TABLE, mem);
 }
 
+static int vhost_kernel_set_vring_enable(struct vhost_dev *dev, int enable)
+{
+    struct vhost_vring_state s;
+    int i, ret;
+
+    s.num = 1;
+    for (i = 0; i < dev->nvqs; ++i) {
+        s.index = i;
+
+        ret = vhost_kernel_call(dev, VHOST_SET_VRING_ENABLE, &s);
+        /* Ignore kernels that do not support the cmd */
+        if (ret = -EPERM)
+            return 0;
+        if (ret)
+            goto disable_vrings;
+    }
+
+    return 0;
+
+disable_vrings:
+    s.num = 0;
+    for (i--; i < dev->nvqs; ++i) {
+        s.index = i;
+        vhost_kernel_call(dev, VHOST_SET_VRING_ENABLE, &s);
+    }
+
+    return ret;
+}
+
 static int vhost_kernel_set_vring_addr(struct vhost_dev *dev,
                                        struct vhost_vring_addr *addr)
 {
@@ -302,6 +331,7 @@ static const VhostOps kernel_ops = {
         .vhost_scsi_get_abi_version = vhost_kernel_scsi_get_abi_version,
         .vhost_set_log_base = vhost_kernel_set_log_base,
         .vhost_set_mem_table = vhost_kernel_set_mem_table,
+        .vhost_set_vring_enable = vhost_kernel_set_vring_enable,
         .vhost_set_vring_addr = vhost_kernel_set_vring_addr,
         .vhost_set_vring_endian = vhost_kernel_set_vring_endian,
         .vhost_set_vring_num = vhost_kernel_set_vring_num,
diff --git a/linux-headers/linux/vhost.h b/linux-headers/linux/vhost.h
index 7523218..98dd919 100644
--- a/linux-headers/linux/vhost.h
+++ b/linux-headers/linux/vhost.h
@@ -70,6 +70,7 @@
 #define VHOST_VRING_BIG_ENDIAN 1
 #define VHOST_SET_VRING_ENDIAN _IOW(VHOST_VIRTIO, 0x13, struct vhost_vring_state)
 #define VHOST_GET_VRING_ENDIAN _IOW(VHOST_VIRTIO, 0x14, struct vhost_vring_state)
+#define VHOST_SET_VRING_ENABLE _IOW(VHOST_VIRTIO, 0x15, struct vhost_vring_state)
 
 /* The following ioctls use eventfd file descriptors to signal and poll
  * for events. */
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 01/10] vhost: remove work arg from vhost_work_flush
  2020-11-12 23:18 [PATCH 00/10] vhost/qemu: thread per IO SCSI vq Mike Christie
  2020-11-12 23:19 ` [PATCH 1/1] qemu vhost scsi: add VHOST_SET_VRING_ENABLE support Mike Christie
@ 2020-11-12 23:19 ` Mike Christie
  2020-11-17 13:04   ` Stefan Hajnoczi
  2020-11-12 23:19 ` [PATCH 02/10] vhost scsi: remove extra flushes Mike Christie
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 47+ messages in thread
From: Mike Christie @ 2020-11-12 23:19 UTC (permalink / raw)
  To: stefanha, qemu-devel, fam, linux-scsi, target-devel, mst,
	jasowang, pbonzini, virtualization

vhost_work_flush doesn't do anything with the work arg. This patch drops
it and then renames vhost_work_flush to vhost_work_dev_flush to reflect
that the function flushes all the works in the dev and not just a
specific queue or work item.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
---
 drivers/vhost/scsi.c  | 4 ++--
 drivers/vhost/vhost.c | 8 ++++----
 drivers/vhost/vhost.h | 2 +-
 drivers/vhost/vsock.c | 2 +-
 4 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index f22fce5..8795fd3 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -1468,8 +1468,8 @@ static void vhost_scsi_flush(struct vhost_scsi *vs)
 	/* Flush both the vhost poll and vhost work */
 	for (i = 0; i < VHOST_SCSI_MAX_VQ; i++)
 		vhost_scsi_flush_vq(vs, i);
-	vhost_work_flush(&vs->dev, &vs->vs_completion_work);
-	vhost_work_flush(&vs->dev, &vs->vs_event_work);
+	vhost_work_dev_flush(&vs->dev);
+	vhost_work_dev_flush(&vs->dev);
 
 	/* Wait for all reqs issued before the flush to be finished */
 	for (i = 0; i < VHOST_SCSI_MAX_VQ; i++)
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index a262e12..78d9535 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -231,7 +231,7 @@ void vhost_poll_stop(struct vhost_poll *poll)
 }
 EXPORT_SYMBOL_GPL(vhost_poll_stop);
 
-void vhost_work_flush(struct vhost_dev *dev, struct vhost_work *work)
+void vhost_work_dev_flush(struct vhost_dev *dev)
 {
 	struct vhost_flush_struct flush;
 
@@ -243,13 +243,13 @@ void vhost_work_flush(struct vhost_dev *dev, struct vhost_work *work)
 		wait_for_completion(&flush.wait_event);
 	}
 }
-EXPORT_SYMBOL_GPL(vhost_work_flush);
+EXPORT_SYMBOL_GPL(vhost_work_dev_flush);
 
 /* Flush any work that has been scheduled. When calling this, don't hold any
  * locks that are also used by the callback. */
 void vhost_poll_flush(struct vhost_poll *poll)
 {
-	vhost_work_flush(poll->dev, &poll->work);
+	vhost_work_dev_flush(poll->dev);
 }
 EXPORT_SYMBOL_GPL(vhost_poll_flush);
 
@@ -538,7 +538,7 @@ static int vhost_attach_cgroups(struct vhost_dev *dev)
 	attach.owner = current;
 	vhost_work_init(&attach.work, vhost_attach_cgroups_work);
 	vhost_work_queue(dev, &attach.work);
-	vhost_work_flush(dev, &attach.work);
+	vhost_work_dev_flush(dev);
 	return attach.ret;
 }
 
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index b063324..1ba8e81 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -46,7 +46,7 @@ void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
 void vhost_poll_stop(struct vhost_poll *poll);
 void vhost_poll_flush(struct vhost_poll *poll);
 void vhost_poll_queue(struct vhost_poll *poll);
-void vhost_work_flush(struct vhost_dev *dev, struct vhost_work *work);
+void vhost_work_dev_flush(struct vhost_dev *dev);
 long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *argp);
 
 struct vhost_log {
diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index a483cec..f40205f 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -652,7 +652,7 @@ static void vhost_vsock_flush(struct vhost_vsock *vsock)
 	for (i = 0; i < ARRAY_SIZE(vsock->vqs); i++)
 		if (vsock->vqs[i].handle_kick)
 			vhost_poll_flush(&vsock->vqs[i].poll);
-	vhost_work_flush(&vsock->dev, &vsock->send_pkt_work);
+	vhost_work_dev_flush(&vsock->dev);
 }
 
 static void vhost_vsock_reset_orphans(struct sock *sk)
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 02/10] vhost scsi: remove extra flushes
  2020-11-12 23:18 [PATCH 00/10] vhost/qemu: thread per IO SCSI vq Mike Christie
  2020-11-12 23:19 ` [PATCH 1/1] qemu vhost scsi: add VHOST_SET_VRING_ENABLE support Mike Christie
  2020-11-12 23:19 ` [PATCH 01/10] vhost: remove work arg from vhost_work_flush Mike Christie
@ 2020-11-12 23:19 ` Mike Christie
  2020-11-17 13:07   ` Stefan Hajnoczi
  2020-11-12 23:19 ` [PATCH 03/10] vhost poll: fix coding style Mike Christie
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 47+ messages in thread
From: Mike Christie @ 2020-11-12 23:19 UTC (permalink / raw)
  To: stefanha, qemu-devel, fam, linux-scsi, target-devel, mst,
	jasowang, pbonzini, virtualization

The vhost work flush function was flushing the entire work queue, so
there is no need for the double vhost_work_dev_flush calls in
vhost_scsi_flush.

And we do not need to call vhost_poll_flush for each poller because
that call also ends up flushing the same work queue thread the
vhost_work_dev_flush call flushed.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
 drivers/vhost/scsi.c | 8 --------
 1 file changed, 8 deletions(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 8795fd3..4725a08 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -1443,11 +1443,6 @@ static void vhost_scsi_handle_kick(struct vhost_work *work)
 	vhost_scsi_handle_vq(vs, vq);
 }
 
-static void vhost_scsi_flush_vq(struct vhost_scsi *vs, int index)
-{
-	vhost_poll_flush(&vs->vqs[index].vq.poll);
-}
-
 /* Callers must hold dev mutex */
 static void vhost_scsi_flush(struct vhost_scsi *vs)
 {
@@ -1466,9 +1461,6 @@ static void vhost_scsi_flush(struct vhost_scsi *vs)
 		kref_put(&old_inflight[i]->kref, vhost_scsi_done_inflight);
 
 	/* Flush both the vhost poll and vhost work */
-	for (i = 0; i < VHOST_SCSI_MAX_VQ; i++)
-		vhost_scsi_flush_vq(vs, i);
-	vhost_work_dev_flush(&vs->dev);
 	vhost_work_dev_flush(&vs->dev);
 
 	/* Wait for all reqs issued before the flush to be finished */
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 03/10] vhost poll: fix coding style
  2020-11-12 23:18 [PATCH 00/10] vhost/qemu: thread per IO SCSI vq Mike Christie
                   ` (2 preceding siblings ...)
  2020-11-12 23:19 ` [PATCH 02/10] vhost scsi: remove extra flushes Mike Christie
@ 2020-11-12 23:19 ` Mike Christie
  2020-11-17 13:07   ` Stefan Hajnoczi
  2020-11-12 23:19 ` [PATCH 04/10] vhost: support multiple worker threads Mike Christie
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 47+ messages in thread
From: Mike Christie @ 2020-11-12 23:19 UTC (permalink / raw)
  To: stefanha, qemu-devel, fam, linux-scsi, target-devel, mst,
	jasowang, pbonzini, virtualization

We use like 3 coding styles in this struct. Switch to just tabs.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
---
 drivers/vhost/vhost.h | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 1ba8e81..575c818 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -28,12 +28,12 @@ struct vhost_work {
 /* Poll a file (eventfd or socket) */
 /* Note: there's nothing vhost specific about this structure. */
 struct vhost_poll {
-	poll_table                table;
-	wait_queue_head_t        *wqh;
-	wait_queue_entry_t              wait;
-	struct vhost_work	  work;
-	__poll_t		  mask;
-	struct vhost_dev	 *dev;
+	poll_table		table;
+	wait_queue_head_t	*wqh;
+	wait_queue_entry_t	wait;
+	struct vhost_work	work;
+	__poll_t		mask;
+	struct vhost_dev	*dev;
 };
 
 void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn);
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 04/10] vhost: support multiple worker threads
  2020-11-12 23:18 [PATCH 00/10] vhost/qemu: thread per IO SCSI vq Mike Christie
                   ` (3 preceding siblings ...)
  2020-11-12 23:19 ` [PATCH 03/10] vhost poll: fix coding style Mike Christie
@ 2020-11-12 23:19 ` Mike Christie
  2020-11-12 23:19 ` [PATCH 05/10] vhost: poll support support multiple workers Mike Christie
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 47+ messages in thread
From: Mike Christie @ 2020-11-12 23:19 UTC (permalink / raw)
  To: stefanha, qemu-devel, fam, linux-scsi, target-devel, mst,
	jasowang, pbonzini, virtualization

This is a prep patch to support multiple vhost worker threads per vhost
dev. This patch converts the code that had assumed a single worker
thread by:

1. Moving worker related fields to a new struct vhost_worker.
2. Converting vhost.c code to use the new struct and assume we will
have an array of workers.
3. It also exports 2 helper functions that will be used in the last
patch when vhost-scsi is converted to use this new functionality.

Why do we need multiple worker threads?

The admin can set_num_queues > 1 and the guest OS will run in
multiqueue mode where, depending on the num_queues, you might get
a queue per CPU. The layers below vhost-scsi are also doing
multiqueue. So, while vhost-scsi will create num_queue vqs every
IO on every CPU we are using has to be sent from and complete
on this one thread on one CPU and can't fully utlize the multiple
queues above and below us.

With the null_blk driver we max out at 360K IOPs when doing a random
workload like:

fio --direct=1 --rw=randrw --bs=4k --ioengine=libaio \
--iodepth=VQ_QUEUE_DEPTH --numjobs=NUM_VQS --filename  /dev/sdXYZ

where NUM_VQS gets up to 8 (number of cores per numa node on my system)
and VQ_QUEUE_DEPTH can be anywhere from 32 to 128.

With the patches in this set and the patches to remove the sess_cmd_lock
and execution_lock from lio's IO path in the SCSI tree for 5.11, we are
able to get IOPs from a single LUN up to 700K.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
 drivers/vhost/vhost.c | 260 +++++++++++++++++++++++++++++++++++++++-----------
 drivers/vhost/vhost.h |  14 ++-
 2 files changed, 217 insertions(+), 57 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 78d9535..d229515 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -231,16 +231,47 @@ void vhost_poll_stop(struct vhost_poll *poll)
 }
 EXPORT_SYMBOL_GPL(vhost_poll_stop);
 
-void vhost_work_dev_flush(struct vhost_dev *dev)
+static void vhost_work_queue_on(struct vhost_dev *dev, struct vhost_work *work,
+				int worker_id)
+{
+	if (!dev->num_workers)
+		return;
+
+	if (!test_and_set_bit(VHOST_WORK_QUEUED, &work->flags)) {
+		/* We can only add the work to the list after we're
+		 * sure it was not in the list.
+		 * test_and_set_bit() implies a memory barrier.
+		 */
+		llist_add(&work->node, &dev->workers[worker_id]->work_list);
+		wake_up_process(dev->workers[worker_id]->task);
+	}
+}
+
+void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)
+{
+	vhost_work_queue_on(dev, work, 0);
+}
+EXPORT_SYMBOL_GPL(vhost_work_queue);
+
+static void vhost_work_flush_on(struct vhost_dev *dev, int worker_id)
 {
 	struct vhost_flush_struct flush;
 
-	if (dev->worker) {
-		init_completion(&flush.wait_event);
-		vhost_work_init(&flush.work, vhost_flush_work);
+	init_completion(&flush.wait_event);
+	vhost_work_init(&flush.work, vhost_flush_work);
+
+	vhost_work_queue_on(dev, &flush.work, worker_id);
+	wait_for_completion(&flush.wait_event);
+}
+
+void vhost_work_dev_flush(struct vhost_dev *dev)
+{
+	int i;
 
-		vhost_work_queue(dev, &flush.work);
-		wait_for_completion(&flush.wait_event);
+	for (i = 0; i < dev->num_workers; i++) {
+		if (!dev->workers[i])
+			continue;
+		vhost_work_flush_on(dev, i);
 	}
 }
 EXPORT_SYMBOL_GPL(vhost_work_dev_flush);
@@ -253,26 +284,18 @@ void vhost_poll_flush(struct vhost_poll *poll)
 }
 EXPORT_SYMBOL_GPL(vhost_poll_flush);
 
-void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)
+/* A lockless hint for busy polling code to exit the loop */
+bool vhost_has_work(struct vhost_dev *dev)
 {
-	if (!dev->worker)
-		return;
+	int i;
 
-	if (!test_and_set_bit(VHOST_WORK_QUEUED, &work->flags)) {
-		/* We can only add the work to the list after we're
-		 * sure it was not in the list.
-		 * test_and_set_bit() implies a memory barrier.
-		 */
-		llist_add(&work->node, &dev->work_list);
-		wake_up_process(dev->worker);
+	for (i = 0; i < dev->num_workers; i++) {
+		if (dev->workers[i] &&
+		    !llist_empty(&dev->workers[i]->work_list))
+			return true;
 	}
-}
-EXPORT_SYMBOL_GPL(vhost_work_queue);
 
-/* A lockless hint for busy polling code to exit the loop */
-bool vhost_has_work(struct vhost_dev *dev)
-{
-	return !llist_empty(&dev->work_list);
+	return false;
 }
 EXPORT_SYMBOL_GPL(vhost_has_work);
 
@@ -343,7 +366,8 @@ static void vhost_vq_reset(struct vhost_dev *dev,
 
 static int vhost_worker(void *data)
 {
-	struct vhost_dev *dev = data;
+	struct vhost_worker *worker = data;
+	struct vhost_dev *dev = worker->dev;
 	struct vhost_work *work, *work_next;
 	struct llist_node *node;
 
@@ -357,8 +381,7 @@ static int vhost_worker(void *data)
 			__set_current_state(TASK_RUNNING);
 			break;
 		}
-
-		node = llist_del_all(&dev->work_list);
+		node = llist_del_all(&worker->work_list);
 		if (!node)
 			schedule();
 
@@ -481,13 +504,13 @@ void vhost_dev_init(struct vhost_dev *dev,
 	dev->umem = NULL;
 	dev->iotlb = NULL;
 	dev->mm = NULL;
-	dev->worker = NULL;
+	dev->workers = NULL;
+	dev->num_workers = 0;
 	dev->iov_limit = iov_limit;
 	dev->weight = weight;
 	dev->byte_weight = byte_weight;
 	dev->use_worker = use_worker;
 	dev->msg_handler = msg_handler;
-	init_llist_head(&dev->work_list);
 	init_waitqueue_head(&dev->wait);
 	INIT_LIST_HEAD(&dev->read_list);
 	INIT_LIST_HEAD(&dev->pending_list);
@@ -500,6 +523,7 @@ void vhost_dev_init(struct vhost_dev *dev,
 		vq->indirect = NULL;
 		vq->heads = NULL;
 		vq->dev = dev;
+		vq->worker_id = 0;
 		mutex_init(&vq->mutex);
 		vhost_vq_reset(dev, vq);
 		if (vq->handle_kick)
@@ -531,14 +555,14 @@ static void vhost_attach_cgroups_work(struct vhost_work *work)
 	s->ret = cgroup_attach_task_all(s->owner, current);
 }
 
-static int vhost_attach_cgroups(struct vhost_dev *dev)
+static int vhost_attach_cgroups_on(struct vhost_dev *dev, int worker_id)
 {
 	struct vhost_attach_cgroups_struct attach;
 
 	attach.owner = current;
 	vhost_work_init(&attach.work, vhost_attach_cgroups_work);
-	vhost_work_queue(dev, &attach.work);
-	vhost_work_dev_flush(dev);
+	vhost_work_queue_on(dev, &attach.work, worker_id);
+	vhost_work_flush_on(dev, worker_id);
 	return attach.ret;
 }
 
@@ -579,10 +603,153 @@ static void vhost_detach_mm(struct vhost_dev *dev)
 	dev->mm = NULL;
 }
 
+static void vhost_worker_free(struct vhost_dev *dev, int worker_id)
+{
+	struct vhost_worker *worker;
+
+	worker = dev->workers[worker_id];
+	WARN_ON(!llist_empty(&worker->work_list));
+	kthread_stop(worker->task);
+	kfree(worker);
+
+	dev->workers[worker_id] = NULL;
+}
+
+void vhost_vq_worker_remove(struct vhost_dev *dev, struct vhost_virtqueue *vq)
+{
+	/*
+	 * vqs may share a worker and so this might have been removed already.
+	 */
+	if (!dev->workers[vq->worker_id])
+		return;
+
+	vhost_worker_free(dev, vq->worker_id);
+	dev->num_workers--;
+
+	vq->worker_id = 0;
+}
+EXPORT_SYMBOL_GPL(vhost_vq_worker_remove);
+
+static void vhost_workers_free(struct vhost_dev *dev)
+{
+	int i;
+
+	if (!dev->workers)
+		return;
+
+	for (i = 0; i < dev->nvqs; i++) {
+		if (!dev->num_workers)
+			break;
+		vhost_vq_worker_remove(dev, dev->vqs[i]);
+	}
+
+	kfree(dev->workers);
+	dev->workers = NULL;
+}
+
+static int vhost_worker_create(struct vhost_dev *dev, int worker_id)
+{
+	struct vhost_worker *worker;
+	struct task_struct *task;
+	int ret;
+
+	worker = kzalloc(sizeof(*worker), GFP_KERNEL);
+	if (!worker)
+		return -ENOMEM;
+
+	init_llist_head(&worker->work_list);
+	worker->dev = dev;
+
+	task = kthread_create(vhost_worker, worker, "vhost-%d", current->pid);
+	if (IS_ERR(task)) {
+		ret = PTR_ERR(task);
+		goto free_worker;
+	}
+
+	dev->workers[worker_id] = worker;
+	worker->task = task;
+	wake_up_process(task); /* avoid contributing to loadavg */
+	return 0;
+
+free_worker:
+	kfree(worker);
+	return ret;
+}
+
+/**
+ * vhost_vq_worker_add - create a new worker and add it to workers[]
+ * @dev: vhost device
+ * @vq: optional virtqueue to bind worker to.
+ *
+ * Caller must have the device mutex and have stopped operations that
+ * can access the workers array.
+ */
+int vhost_vq_worker_add(struct vhost_dev *dev, struct vhost_virtqueue *vq)
+{
+	struct mm_struct *mm;
+	bool owner_match = true;
+	int err, worker_id;
+
+	if (vq && vq->worker_id)
+		return -EINVAL;
+
+	if (vhost_dev_has_owner(dev)) {
+		mm = get_task_mm(current);
+		if (mm != dev->mm)
+			owner_match = false;
+		mmput(mm);
+		if (!owner_match)
+			return -EBUSY;
+	}
+
+	worker_id = dev->num_workers;
+	err = vhost_worker_create(dev, worker_id);
+	if (err)
+		return -ENOMEM;
+	dev->num_workers++;
+
+	err = vhost_attach_cgroups_on(dev, worker_id);
+	if (err)
+		goto free_worker;
+
+	if (vq)
+		vq->worker_id = worker_id;
+	return 0;
+
+free_worker:
+	dev->num_workers--;
+	vhost_worker_free(dev, worker_id);
+	return err;
+}
+EXPORT_SYMBOL_GPL(vhost_vq_worker_add);
+
+static int vhost_workers_create(struct vhost_dev *dev)
+{
+	int err;
+
+	dev->workers = kcalloc(dev->nvqs, sizeof(struct vhost_worker *),
+			       GFP_KERNEL);
+	if (!dev->workers)
+		return -ENOMEM;
+	/*
+	 * All drivers that set use_worker=true use at least one worker that
+	 * may be bound to multiple vqs. Drivers like vhost-scsi may override
+	 * this later.
+	 */
+	err = vhost_vq_worker_add(dev, NULL);
+	if (err)
+		goto free_workers;
+	return 0;
+
+free_workers:
+	kfree(dev->workers);
+	dev->workers = NULL;
+	return err;
+}
+
 /* Caller should have device mutex */
 long vhost_dev_set_owner(struct vhost_dev *dev)
 {
-	struct task_struct *worker;
 	int err;
 
 	/* Is there an owner already? */
@@ -595,31 +762,18 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
 
 	dev->kcov_handle = kcov_common_handle();
 	if (dev->use_worker) {
-		worker = kthread_create(vhost_worker, dev,
-					"vhost-%d", current->pid);
-		if (IS_ERR(worker)) {
-			err = PTR_ERR(worker);
-			goto err_worker;
-		}
-
-		dev->worker = worker;
-		wake_up_process(worker); /* avoid contributing to loadavg */
-
-		err = vhost_attach_cgroups(dev);
+		err = vhost_workers_create(dev);
 		if (err)
-			goto err_cgroup;
+			goto err_worker;
 	}
 
 	err = vhost_dev_alloc_iovecs(dev);
 	if (err)
-		goto err_cgroup;
+		goto err_iovecs;
 
 	return 0;
-err_cgroup:
-	if (dev->worker) {
-		kthread_stop(dev->worker);
-		dev->worker = NULL;
-	}
+err_iovecs:
+	vhost_workers_free(dev);
 err_worker:
 	vhost_detach_mm(dev);
 	dev->kcov_handle = 0;
@@ -712,12 +866,8 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
 	dev->iotlb = NULL;
 	vhost_clear_msg(dev);
 	wake_up_interruptible_poll(&dev->wait, EPOLLIN | EPOLLRDNORM);
-	WARN_ON(!llist_empty(&dev->work_list));
-	if (dev->worker) {
-		kthread_stop(dev->worker);
-		dev->worker = NULL;
-		dev->kcov_handle = 0;
-	}
+	vhost_workers_free(dev);
+	dev->kcov_handle = 0;
 	vhost_detach_mm(dev);
 }
 EXPORT_SYMBOL_GPL(vhost_dev_cleanup);
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 575c818..f334e90 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -16,6 +16,7 @@
 #include <linux/irqbypass.h>
 
 struct vhost_work;
+struct vhost_virtqueue;
 typedef void (*vhost_work_fn_t)(struct vhost_work *work);
 
 #define VHOST_WORK_QUEUED 1
@@ -25,6 +26,12 @@ struct vhost_work {
 	unsigned long		  flags;
 };
 
+struct vhost_worker {
+	struct task_struct *task;
+	struct llist_head work_list;
+	struct vhost_dev *dev;
+};
+
 /* Poll a file (eventfd or socket) */
 /* Note: there's nothing vhost specific about this structure. */
 struct vhost_poll {
@@ -39,6 +46,8 @@ struct vhost_poll {
 void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn);
 void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work);
 bool vhost_has_work(struct vhost_dev *dev);
+int vhost_vq_worker_add(struct vhost_dev *dev, struct vhost_virtqueue *vq);
+void vhost_vq_worker_remove(struct vhost_dev *dev, struct vhost_virtqueue *vq);
 
 void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
 		     __poll_t mask, struct vhost_dev *dev);
@@ -84,6 +93,7 @@ struct vhost_virtqueue {
 
 	struct vhost_poll poll;
 
+	int worker_id;
 	/* The routine to call when the Guest pings us, or timeout. */
 	vhost_work_fn_t handle_kick;
 
@@ -149,8 +159,8 @@ struct vhost_dev {
 	struct vhost_virtqueue **vqs;
 	int nvqs;
 	struct eventfd_ctx *log_ctx;
-	struct llist_head work_list;
-	struct task_struct *worker;
+	struct vhost_worker **workers;
+	int num_workers;
 	struct vhost_iotlb *umem;
 	struct vhost_iotlb *iotlb;
 	spinlock_t iotlb_lock;
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 05/10] vhost: poll support support multiple workers
  2020-11-12 23:18 [PATCH 00/10] vhost/qemu: thread per IO SCSI vq Mike Christie
                   ` (4 preceding siblings ...)
  2020-11-12 23:19 ` [PATCH 04/10] vhost: support multiple worker threads Mike Christie
@ 2020-11-12 23:19 ` Mike Christie
  2020-11-17 15:32   ` Stefan Hajnoczi
  2020-11-12 23:19 ` [PATCH 06/10] vhost scsi: make SCSI cmd completion per vq Mike Christie
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 47+ messages in thread
From: Mike Christie @ 2020-11-12 23:19 UTC (permalink / raw)
  To: stefanha, qemu-devel, fam, linux-scsi, target-devel, mst,
	jasowang, pbonzini, virtualization

The final patches are going to have vhost scsi create a vhost worker
per IO vq. This patch converts the poll code to poll and queue work on
the worker that is tied to the vq (in this patch we maintain the old
behavior where all vqs use a single worker).

For drivers that do not convert over to the multiple worker support
or for the case where the user just does not want to allocate the
resources then we maintain support for the single worker case.

Note: This adds a new function vhost_vq_work_queue. It's used by
this patch and also the next one, so I exported it here.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
 drivers/vhost/net.c   |  6 ++++--
 drivers/vhost/vhost.c | 14 +++++++++++---
 drivers/vhost/vhost.h |  6 ++++--
 3 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 531a00d..6a27fe6 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -1330,8 +1330,10 @@ static int vhost_net_open(struct inode *inode, struct file *f)
 		       VHOST_NET_PKT_WEIGHT, VHOST_NET_WEIGHT, true,
 		       NULL);
 
-	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, EPOLLOUT, dev);
-	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, EPOLLIN, dev);
+	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, EPOLLOUT, dev,
+			vqs[VHOST_NET_VQ_TX]);
+	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, EPOLLIN, dev,
+			vqs[VHOST_NET_VQ_RX]);
 
 	f->private_data = n;
 	n->page_frag.page = NULL;
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index d229515..9eeb8c7 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -187,13 +187,15 @@ void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn)
 
 /* Init poll structure */
 void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
-		     __poll_t mask, struct vhost_dev *dev)
+		     __poll_t mask, struct vhost_dev *dev,
+		     struct vhost_virtqueue *vq)
 {
 	init_waitqueue_func_entry(&poll->wait, vhost_poll_wakeup);
 	init_poll_funcptr(&poll->table, vhost_poll_func);
 	poll->mask = mask;
 	poll->dev = dev;
 	poll->wqh = NULL;
+	poll->vq = vq;
 
 	vhost_work_init(&poll->work, fn);
 }
@@ -284,6 +286,12 @@ void vhost_poll_flush(struct vhost_poll *poll)
 }
 EXPORT_SYMBOL_GPL(vhost_poll_flush);
 
+void vhost_vq_work_queue(struct vhost_virtqueue *vq, struct vhost_work *work)
+{
+	vhost_work_queue_on(vq->dev, work, vq->worker_id);
+}
+EXPORT_SYMBOL_GPL(vhost_vq_work_queue);
+
 /* A lockless hint for busy polling code to exit the loop */
 bool vhost_has_work(struct vhost_dev *dev)
 {
@@ -301,7 +309,7 @@ bool vhost_has_work(struct vhost_dev *dev)
 
 void vhost_poll_queue(struct vhost_poll *poll)
 {
-	vhost_work_queue(poll->dev, &poll->work);
+	vhost_vq_work_queue(poll->vq, &poll->work);
 }
 EXPORT_SYMBOL_GPL(vhost_poll_queue);
 
@@ -528,7 +536,7 @@ void vhost_dev_init(struct vhost_dev *dev,
 		vhost_vq_reset(dev, vq);
 		if (vq->handle_kick)
 			vhost_poll_init(&vq->poll, vq->handle_kick,
-					EPOLLIN, dev);
+					EPOLLIN, dev, vq);
 	}
 }
 EXPORT_SYMBOL_GPL(vhost_dev_init);
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index f334e90..232c5f9 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -33,7 +33,6 @@ struct vhost_worker {
 };
 
 /* Poll a file (eventfd or socket) */
-/* Note: there's nothing vhost specific about this structure. */
 struct vhost_poll {
 	poll_table		table;
 	wait_queue_head_t	*wqh;
@@ -41,16 +40,19 @@ struct vhost_poll {
 	struct vhost_work	work;
 	__poll_t		mask;
 	struct vhost_dev	*dev;
+	struct vhost_virtqueue	*vq;
 };
 
 void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn);
 void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work);
 bool vhost_has_work(struct vhost_dev *dev);
+void vhost_vq_work_queue(struct vhost_virtqueue *vq, struct vhost_work *work);
 int vhost_vq_worker_add(struct vhost_dev *dev, struct vhost_virtqueue *vq);
 void vhost_vq_worker_remove(struct vhost_dev *dev, struct vhost_virtqueue *vq);
 
 void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
-		     __poll_t mask, struct vhost_dev *dev);
+		     __poll_t mask, struct vhost_dev *dev,
+		     struct vhost_virtqueue *vq);
 int vhost_poll_start(struct vhost_poll *poll, struct file *file);
 void vhost_poll_stop(struct vhost_poll *poll);
 void vhost_poll_flush(struct vhost_poll *poll);
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 06/10] vhost scsi: make SCSI cmd completion per vq
  2020-11-12 23:18 [PATCH 00/10] vhost/qemu: thread per IO SCSI vq Mike Christie
                   ` (5 preceding siblings ...)
  2020-11-12 23:19 ` [PATCH 05/10] vhost: poll support support multiple workers Mike Christie
@ 2020-11-12 23:19 ` Mike Christie
  2020-11-17 16:04   ` Stefan Hajnoczi
  2020-11-12 23:19 ` [PATCH 07/10] vhost, vhost-scsi: flush IO vqs then send TMF rsp Mike Christie
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 47+ messages in thread
From: Mike Christie @ 2020-11-12 23:19 UTC (permalink / raw)
  To: stefanha, qemu-devel, fam, linux-scsi, target-devel, mst,
	jasowang, pbonzini, virtualization

In the last patches we are going to have a worker thread per IO vq.
This patch separates the scsi cmd completion code paths so we can
complete cmds based on their vq instead of having all cmds complete
on the same worker thread.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
 drivers/vhost/scsi.c | 48 +++++++++++++++++++++++++-----------------------
 1 file changed, 25 insertions(+), 23 deletions(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 4725a08..2bbe1a8 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -178,6 +178,7 @@ enum {
 
 struct vhost_scsi_virtqueue {
 	struct vhost_virtqueue vq;
+	struct vhost_scsi *vs;
 	/*
 	 * Reference counting for inflight reqs, used for flush operation. At
 	 * each time, one reference tracks new commands submitted, while we
@@ -192,6 +193,9 @@ struct vhost_scsi_virtqueue {
 	struct vhost_scsi_cmd *scsi_cmds;
 	struct sbitmap scsi_tags;
 	int max_cmds;
+
+	struct vhost_work completion_work;
+	struct llist_head completion_list;
 };
 
 struct vhost_scsi {
@@ -202,9 +206,6 @@ struct vhost_scsi {
 	struct vhost_dev dev;
 	struct vhost_scsi_virtqueue vqs[VHOST_SCSI_MAX_VQ];
 
-	struct vhost_work vs_completion_work; /* cmd completion work item */
-	struct llist_head vs_completion_list; /* cmd completion queue */
-
 	struct vhost_work vs_event_work; /* evt injection work item */
 	struct llist_head vs_event_list; /* evt injection queue */
 
@@ -380,10 +381,11 @@ static void vhost_scsi_release_cmd(struct se_cmd *se_cmd)
 	} else {
 		struct vhost_scsi_cmd *cmd = container_of(se_cmd,
 					struct vhost_scsi_cmd, tvc_se_cmd);
-		struct vhost_scsi *vs = cmd->tvc_vhost;
+		struct vhost_scsi_virtqueue *svq =  container_of(cmd->tvc_vq,
+					struct vhost_scsi_virtqueue, vq);
 
-		llist_add(&cmd->tvc_completion_list, &vs->vs_completion_list);
-		vhost_work_queue(&vs->dev, &vs->vs_completion_work);
+		llist_add(&cmd->tvc_completion_list, &svq->completion_list);
+		vhost_vq_work_queue(&svq->vq, &svq->completion_work);
 	}
 }
 
@@ -545,18 +547,17 @@ static void vhost_scsi_evt_work(struct vhost_work *work)
  */
 static void vhost_scsi_complete_cmd_work(struct vhost_work *work)
 {
-	struct vhost_scsi *vs = container_of(work, struct vhost_scsi,
-					vs_completion_work);
-	DECLARE_BITMAP(signal, VHOST_SCSI_MAX_VQ);
+	struct vhost_scsi_virtqueue *svq = container_of(work,
+				struct vhost_scsi_virtqueue, completion_work);
 	struct virtio_scsi_cmd_resp v_rsp;
 	struct vhost_scsi_cmd *cmd, *t;
 	struct llist_node *llnode;
 	struct se_cmd *se_cmd;
 	struct iov_iter iov_iter;
-	int ret, vq;
+	bool signal = false;
+	int ret;
 
-	bitmap_zero(signal, VHOST_SCSI_MAX_VQ);
-	llnode = llist_del_all(&vs->vs_completion_list);
+	llnode = llist_del_all(&svq->completion_list);
 	llist_for_each_entry_safe(cmd, t, llnode, tvc_completion_list) {
 		se_cmd = &cmd->tvc_se_cmd;
 
@@ -576,21 +577,16 @@ static void vhost_scsi_complete_cmd_work(struct vhost_work *work)
 			      cmd->tvc_in_iovs, sizeof(v_rsp));
 		ret = copy_to_iter(&v_rsp, sizeof(v_rsp), &iov_iter);
 		if (likely(ret = sizeof(v_rsp))) {
-			struct vhost_scsi_virtqueue *q;
+			signal = true;
 			vhost_add_used(cmd->tvc_vq, cmd->tvc_vq_desc, 0);
-			q = container_of(cmd->tvc_vq, struct vhost_scsi_virtqueue, vq);
-			vq = q - vs->vqs;
-			__set_bit(vq, signal);
 		} else
 			pr_err("Faulted on virtio_scsi_cmd_resp\n");
 
 		vhost_scsi_release_cmd_res(se_cmd);
 	}
 
-	vq = -1;
-	while ((vq = find_next_bit(signal, VHOST_SCSI_MAX_VQ, vq + 1))
-		< VHOST_SCSI_MAX_VQ)
-		vhost_signal(&vs->dev, &vs->vqs[vq].vq);
+	if (signal)
+		vhost_signal(&svq->vs->dev, &svq->vq);
 }
 
 static struct vhost_scsi_cmd *
@@ -1799,6 +1795,7 @@ static int vhost_scsi_set_features(struct vhost_scsi *vs, u64 features)
 
 static int vhost_scsi_open(struct inode *inode, struct file *f)
 {
+	struct vhost_scsi_virtqueue *svq;
 	struct vhost_scsi *vs;
 	struct vhost_virtqueue **vqs;
 	int r = -ENOMEM, i;
@@ -1814,7 +1811,6 @@ static int vhost_scsi_open(struct inode *inode, struct file *f)
 	if (!vqs)
 		goto err_vqs;
 
-	vhost_work_init(&vs->vs_completion_work, vhost_scsi_complete_cmd_work);
 	vhost_work_init(&vs->vs_event_work, vhost_scsi_evt_work);
 
 	vs->vs_events_nr = 0;
@@ -1825,8 +1821,14 @@ static int vhost_scsi_open(struct inode *inode, struct file *f)
 	vs->vqs[VHOST_SCSI_VQ_CTL].vq.handle_kick = vhost_scsi_ctl_handle_kick;
 	vs->vqs[VHOST_SCSI_VQ_EVT].vq.handle_kick = vhost_scsi_evt_handle_kick;
 	for (i = VHOST_SCSI_VQ_IO; i < VHOST_SCSI_MAX_VQ; i++) {
-		vqs[i] = &vs->vqs[i].vq;
-		vs->vqs[i].vq.handle_kick = vhost_scsi_handle_kick;
+		svq = &vs->vqs[i];
+
+		vqs[i] = &svq->vq;
+		svq->vs = vs;
+		init_llist_head(&svq->completion_list);
+		vhost_work_init(&svq->completion_work,
+				vhost_scsi_complete_cmd_work);
+		svq->vq.handle_kick = vhost_scsi_handle_kick;
 	}
 	vhost_dev_init(&vs->dev, vqs, VHOST_SCSI_MAX_VQ, UIO_MAXIOV,
 		       VHOST_SCSI_WEIGHT, 0, true, NULL);
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 07/10] vhost, vhost-scsi: flush IO vqs then send TMF rsp
  2020-11-12 23:18 [PATCH 00/10] vhost/qemu: thread per IO SCSI vq Mike Christie
                   ` (6 preceding siblings ...)
  2020-11-12 23:19 ` [PATCH 06/10] vhost scsi: make SCSI cmd completion per vq Mike Christie
@ 2020-11-12 23:19 ` Mike Christie
  2020-11-17 16:05   ` Stefan Hajnoczi
  2020-11-12 23:19 ` [PATCH 08/10] vhost: move msg_handler to new ops struct Mike Christie
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 47+ messages in thread
From: Mike Christie @ 2020-11-12 23:19 UTC (permalink / raw)
  To: stefanha, qemu-devel, fam, linux-scsi, target-devel, mst,
	jasowang, pbonzini, virtualization

With one worker we will always send the scsi cmd responses then send the
TMF rsp, because LIO will always complete the scsi cmds first which
calls vhost_scsi_release_cmd to add them to the work queue.

When the next patch adds multiple worker support, the worker threads
could still be sending their responses when the tmf's work is run.
So this patch has vhost-scsi flush the IO vqs on other worker threads
before we send the tmf response.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
 drivers/vhost/scsi.c  | 16 ++++++++++++++--
 drivers/vhost/vhost.c |  6 ++++++
 drivers/vhost/vhost.h |  1 +
 3 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 2bbe1a8..612359d 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -1178,11 +1178,23 @@ static void vhost_scsi_tmf_resp_work(struct vhost_work *work)
 	struct vhost_scsi_tmf *tmf = container_of(work, struct vhost_scsi_tmf,
 						  vwork);
 	int resp_code;
+	int i;
+
+	if (tmf->se_cmd.se_tmr_req->response = TMR_FUNCTION_COMPLETE) {
+		/*
+		 * When processing a TMF, lio completes the cmds then the
+		 * TMF, so with one worker the TMF always completes after
+		 * cmds. For multiple worker support, we must flush the
+		 * IO vqs that do not share a worker with the ctl vq (vqs
+		 * 3 and up) to make sure they have completed their cmds.
+		 */
+		for (i = 1; i < tmf->vhost->dev.num_workers; i++)
+			vhost_vq_work_flush(&tmf->vhost->vqs[i + VHOST_SCSI_VQ_IO].vq);
 
-	if (tmf->se_cmd.se_tmr_req->response = TMR_FUNCTION_COMPLETE)
 		resp_code = VIRTIO_SCSI_S_FUNCTION_SUCCEEDED;
-	else
+	} else {
 		resp_code = VIRTIO_SCSI_S_FUNCTION_REJECTED;
+	}
 
 	vhost_scsi_send_tmf_resp(tmf->vhost, &tmf->svq->vq, tmf->in_iovs,
 				 tmf->vq_desc, &tmf->resp_iov, resp_code);
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 9eeb8c7..6a6abfc 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -278,6 +278,12 @@ void vhost_work_dev_flush(struct vhost_dev *dev)
 }
 EXPORT_SYMBOL_GPL(vhost_work_dev_flush);
 
+void vhost_vq_work_flush(struct vhost_virtqueue *vq)
+{
+	vhost_work_flush_on(vq->dev, vq->worker_id);
+}
+EXPORT_SYMBOL_GPL(vhost_vq_work_flush);
+
 /* Flush any work that has been scheduled. When calling this, don't hold any
  * locks that are also used by the callback. */
 void vhost_poll_flush(struct vhost_poll *poll)
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 232c5f9..0837133 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -46,6 +46,7 @@ struct vhost_poll {
 void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn);
 void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work);
 bool vhost_has_work(struct vhost_dev *dev);
+void vhost_vq_work_flush(struct vhost_virtqueue *vq);
 void vhost_vq_work_queue(struct vhost_virtqueue *vq, struct vhost_work *work);
 int vhost_vq_worker_add(struct vhost_dev *dev, struct vhost_virtqueue *vq);
 void vhost_vq_worker_remove(struct vhost_dev *dev, struct vhost_virtqueue *vq);
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 08/10] vhost: move msg_handler to new ops struct
  2020-11-12 23:18 [PATCH 00/10] vhost/qemu: thread per IO SCSI vq Mike Christie
                   ` (7 preceding siblings ...)
  2020-11-12 23:19 ` [PATCH 07/10] vhost, vhost-scsi: flush IO vqs then send TMF rsp Mike Christie
@ 2020-11-12 23:19 ` Mike Christie
  2020-11-17 16:08   ` Stefan Hajnoczi
  2020-11-12 23:19 ` [PATCH 09/10] vhost: add VHOST_SET_VRING_ENABLE support Mike Christie
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 47+ messages in thread
From: Mike Christie @ 2020-11-12 23:19 UTC (permalink / raw)
  To: stefanha, qemu-devel, fam, linux-scsi, target-devel, mst,
	jasowang, pbonzini, virtualization

The next patch adds a callout so drivers can perform some action when we
get a VHOST_SET_VRING_ENABLE, so this patch moves the msg_handler callout
to a new vhost_dev_ops struct just to keep all the callouts better
organized.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
 drivers/vhost/vdpa.c  |  7 +++++--
 drivers/vhost/vhost.c | 10 ++++------
 drivers/vhost/vhost.h | 11 ++++++-----
 3 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 2754f30..f271f42 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -802,6 +802,10 @@ static void vhost_vdpa_set_iova_range(struct vhost_vdpa *v)
 	}
 }
 
+static struct vhost_dev_ops vdpa_dev_ops = {
+	.msg_handler	= vhost_vdpa_process_iotlb_msg,
+};
+
 static int vhost_vdpa_open(struct inode *inode, struct file *filep)
 {
 	struct vhost_vdpa *v;
@@ -829,8 +833,7 @@ static int vhost_vdpa_open(struct inode *inode, struct file *filep)
 		vqs[i] = &v->vqs[i];
 		vqs[i]->handle_kick = handle_vq_kick;
 	}
-	vhost_dev_init(dev, vqs, nvqs, 0, 0, 0, false,
-		       vhost_vdpa_process_iotlb_msg);
+	vhost_dev_init(dev, vqs, nvqs, 0, 0, 0, false, &vdpa_dev_ops);
 
 	dev->iotlb = vhost_iotlb_alloc(0, 0);
 	if (!dev->iotlb) {
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 6a6abfc..2f98b81 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -504,9 +504,7 @@ static size_t vhost_get_desc_size(struct vhost_virtqueue *vq,
 void vhost_dev_init(struct vhost_dev *dev,
 		    struct vhost_virtqueue **vqs, int nvqs,
 		    int iov_limit, int weight, int byte_weight,
-		    bool use_worker,
-		    int (*msg_handler)(struct vhost_dev *dev,
-				       struct vhost_iotlb_msg *msg))
+		    bool use_worker, struct vhost_dev_ops *ops)
 {
 	struct vhost_virtqueue *vq;
 	int i;
@@ -524,7 +522,7 @@ void vhost_dev_init(struct vhost_dev *dev,
 	dev->weight = weight;
 	dev->byte_weight = byte_weight;
 	dev->use_worker = use_worker;
-	dev->msg_handler = msg_handler;
+	dev->ops = ops;
 	init_waitqueue_head(&dev->wait);
 	INIT_LIST_HEAD(&dev->read_list);
 	INIT_LIST_HEAD(&dev->pending_list);
@@ -1328,8 +1326,8 @@ ssize_t vhost_chr_write_iter(struct vhost_dev *dev,
 		goto done;
 	}
 
-	if (dev->msg_handler)
-		ret = dev->msg_handler(dev, &msg);
+	if (dev->ops && dev->ops->msg_handler)
+		ret = dev->ops->msg_handler(dev, &msg);
 	else
 		ret = vhost_process_iotlb_msg(dev, &msg);
 	if (ret) {
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 0837133..a293f48 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -156,6 +156,10 @@ struct vhost_msg_node {
   struct list_head node;
 };
 
+struct vhost_dev_ops {
+	int (*msg_handler)(struct vhost_dev *dev, struct vhost_iotlb_msg *msg);
+};
+
 struct vhost_dev {
 	struct mm_struct *mm;
 	struct mutex mutex;
@@ -175,16 +179,13 @@ struct vhost_dev {
 	int byte_weight;
 	u64 kcov_handle;
 	bool use_worker;
-	int (*msg_handler)(struct vhost_dev *dev,
-			   struct vhost_iotlb_msg *msg);
+	struct vhost_dev_ops *ops;
 };
 
 bool vhost_exceeds_weight(struct vhost_virtqueue *vq, int pkts, int total_len);
 void vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue **vqs,
 		    int nvqs, int iov_limit, int weight, int byte_weight,
-		    bool use_worker,
-		    int (*msg_handler)(struct vhost_dev *dev,
-				       struct vhost_iotlb_msg *msg));
+		    bool use_worker, struct vhost_dev_ops *ops);
 long vhost_dev_set_owner(struct vhost_dev *dev);
 bool vhost_dev_has_owner(struct vhost_dev *dev);
 long vhost_dev_check_owner(struct vhost_dev *);
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 09/10] vhost: add VHOST_SET_VRING_ENABLE support
  2020-11-12 23:18 [PATCH 00/10] vhost/qemu: thread per IO SCSI vq Mike Christie
                   ` (8 preceding siblings ...)
  2020-11-12 23:19 ` [PATCH 08/10] vhost: move msg_handler to new ops struct Mike Christie
@ 2020-11-12 23:19 ` Mike Christie
  2020-11-17 16:14   ` Stefan Hajnoczi
  2020-11-12 23:19 ` [PATCH 10/10] vhost-scsi: create a woker per IO vq Mike Christie
  2020-11-17 16:40 ` [PATCH 00/10] vhost/qemu: thread per IO SCSI vq Stefan Hajnoczi
  11 siblings, 1 reply; 47+ messages in thread
From: Mike Christie @ 2020-11-12 23:19 UTC (permalink / raw)
  To: stefanha, qemu-devel, fam, linux-scsi, target-devel, mst,
	jasowang, pbonzini, virtualization

This adds a new ioctl VHOST_SET_VRING_ENABLE that the vhost drivers can
implement a callout for and execute an operation when the vq is
enabled/disabled.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
 drivers/vhost/vhost.c      | 25 +++++++++++++++++++++++++
 drivers/vhost/vhost.h      |  1 +
 include/uapi/linux/vhost.h |  1 +
 3 files changed, 27 insertions(+)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 2f98b81..e953031 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1736,6 +1736,28 @@ static long vhost_vring_set_num_addr(struct vhost_dev *d,
 
 	return r;
 }
+
+static long vhost_vring_set_enable(struct vhost_dev *d,
+				   struct vhost_virtqueue *vq,
+				   void __user *argp)
+{
+	struct vhost_vring_state s;
+	int ret = 0;
+
+	if (vq->private_data)
+		return -EBUSY;
+
+	if (copy_from_user(&s, argp, sizeof s))
+		return -EFAULT;
+
+	if (s.num != 1 && s.num != 0)
+		return -EINVAL;
+
+	if (d->ops && d->ops->enable_vring)
+		ret = d->ops->enable_vring(vq, s.num);
+	return ret;
+}
+
 long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *argp)
 {
 	struct file *eventfp, *filep = NULL;
@@ -1765,6 +1787,9 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
 	mutex_lock(&vq->mutex);
 
 	switch (ioctl) {
+	case VHOST_SET_VRING_ENABLE:
+		r = vhost_vring_set_enable(d, vq, argp);
+		break;
 	case VHOST_SET_VRING_BASE:
 		/* Moving base with an active backend?
 		 * You don't want to do that. */
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index a293f48..1279c09 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -158,6 +158,7 @@ struct vhost_msg_node {
 
 struct vhost_dev_ops {
 	int (*msg_handler)(struct vhost_dev *dev, struct vhost_iotlb_msg *msg);
+	int (*enable_vring)(struct vhost_virtqueue *vq, bool enable);
 };
 
 struct vhost_dev {
diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
index c998860..3ffd133 100644
--- a/include/uapi/linux/vhost.h
+++ b/include/uapi/linux/vhost.h
@@ -70,6 +70,7 @@
 #define VHOST_VRING_BIG_ENDIAN 1
 #define VHOST_SET_VRING_ENDIAN _IOW(VHOST_VIRTIO, 0x13, struct vhost_vring_state)
 #define VHOST_GET_VRING_ENDIAN _IOW(VHOST_VIRTIO, 0x14, struct vhost_vring_state)
+#define VHOST_SET_VRING_ENABLE _IOW(VHOST_VIRTIO, 0x15, struct vhost_vring_state)
 
 /* The following ioctls use eventfd file descriptors to signal and poll
  * for events. */
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 10/10] vhost-scsi: create a woker per IO vq
  2020-11-12 23:18 [PATCH 00/10] vhost/qemu: thread per IO SCSI vq Mike Christie
                   ` (9 preceding siblings ...)
  2020-11-12 23:19 ` [PATCH 09/10] vhost: add VHOST_SET_VRING_ENABLE support Mike Christie
@ 2020-11-12 23:19 ` Mike Christie
  2020-11-17 16:40 ` [PATCH 00/10] vhost/qemu: thread per IO SCSI vq Stefan Hajnoczi
  11 siblings, 0 replies; 47+ messages in thread
From: Mike Christie @ 2020-11-12 23:19 UTC (permalink / raw)
  To: stefanha, qemu-devel, fam, linux-scsi, target-devel, mst,
	jasowang, pbonzini, virtualization

This patch has vhost-scsi create a worker thread per IO vq.
It also adds a modparam to enable the feature, because I was thinking
existing setups might not be expecting the extra threading use, so the
default is to use the old single thread multiple vq behavior.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
 drivers/vhost/scsi.c | 38 +++++++++++++++++++++++++++++++++++++-
 1 file changed, 37 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 612359d..3fb147f 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -62,6 +62,12 @@
  */
 #define VHOST_SCSI_WEIGHT 256
 
+static bool vhost_scsi_worker_per_io_vq;
+module_param_named(thread_per_io_virtqueue, vhost_scsi_worker_per_io_vq, bool,
+		   0644);
+MODULE_PARM_DESC(thread_per_io_virtqueue,
+		 "Create a worker thread per IO virtqueue. Set to true to turn on. Default is false where all virtqueues share a thread.");
+
 struct vhost_scsi_inflight {
 	/* Wait for the flush operation to finish */
 	struct completion comp;
@@ -1805,6 +1811,36 @@ static int vhost_scsi_set_features(struct vhost_scsi *vs, u64 features)
 	return 0;
 }
 
+static int vhost_scsi_enable_vring(struct vhost_virtqueue *vq, bool enable)
+{
+	struct vhost_scsi *vs = container_of(vq->dev, struct vhost_scsi, dev);
+	/*
+	 * For compat, we have the evt, ctl and first IO vq share worker0 like
+	 * is setup by default. Additional vqs get their own worker.
+	 */
+	if (vq = &vs->vqs[VHOST_SCSI_VQ_CTL].vq ||
+	    vq = &vs->vqs[VHOST_SCSI_VQ_EVT].vq ||
+	    vq = &vs->vqs[VHOST_SCSI_VQ_IO].vq)
+		return 0;
+
+	if (enable) {
+		if (!vhost_scsi_worker_per_io_vq)
+			return 0;
+		if (vq->worker_id != 0)
+			return 0;
+		return vhost_vq_worker_add(vq->dev, vq);
+	} else {
+		if (vq->worker_id = 0)
+			return 0;
+		vhost_vq_worker_remove(vq->dev, vq);
+		return 0;
+	}
+}
+
+static struct vhost_dev_ops vhost_scsi_dev_ops = {
+	.enable_vring = vhost_scsi_enable_vring,
+};
+
 static int vhost_scsi_open(struct inode *inode, struct file *f)
 {
 	struct vhost_scsi_virtqueue *svq;
@@ -1843,7 +1879,7 @@ static int vhost_scsi_open(struct inode *inode, struct file *f)
 		svq->vq.handle_kick = vhost_scsi_handle_kick;
 	}
 	vhost_dev_init(&vs->dev, vqs, VHOST_SCSI_MAX_VQ, UIO_MAXIOV,
-		       VHOST_SCSI_WEIGHT, 0, true, NULL);
+		       VHOST_SCSI_WEIGHT, 0, true, &vhost_scsi_dev_ops);
 
 	vhost_scsi_init_inflight(vs, NULL);
 
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/1] qemu vhost scsi: add VHOST_SET_VRING_ENABLE support
  2020-11-12 23:19 ` [PATCH 1/1] qemu vhost scsi: add VHOST_SET_VRING_ENABLE support Mike Christie
@ 2020-11-17 11:53   ` Stefan Hajnoczi
  2020-12-02  9:59   ` Michael S. Tsirkin
  1 sibling, 0 replies; 47+ messages in thread
From: Stefan Hajnoczi @ 2020-11-17 11:53 UTC (permalink / raw)
  To: Mike Christie
  Cc: qemu-devel, fam, linux-scsi, target-devel, mst, jasowang,
	pbonzini, virtualization


[-- Attachment #1: Type: text/plain, Size: 854 bytes --]

On Thu, Nov 12, 2020 at 05:19:00PM -0600, Mike Christie wrote:
> +static int vhost_kernel_set_vring_enable(struct vhost_dev *dev, int enable)
> +{
> +    struct vhost_vring_state s;
> +    int i, ret;
> +
> +    s.num = 1;
> +    for (i = 0; i < dev->nvqs; ++i) {
> +        s.index = i;
> +
> +        ret = vhost_kernel_call(dev, VHOST_SET_VRING_ENABLE, &s);
> +        /* Ignore kernels that do not support the cmd */
> +        if (ret == -EPERM)
> +            return 0;
> +        if (ret)
> +            goto disable_vrings;
> +    }

The 'enable' argument is ignored and this function acts on all
virtqueues, while the ioctl acts on a single virtqueue only.

This function's behavior is actually "vhost_kernel_enable_vrings()"
(plural), not "vhost_kernel_set_vring_enable()" (singular).

Please rename this function and drop the enable argument.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 01/10] vhost: remove work arg from vhost_work_flush
  2020-11-12 23:19 ` [PATCH 01/10] vhost: remove work arg from vhost_work_flush Mike Christie
@ 2020-11-17 13:04   ` Stefan Hajnoczi
  0 siblings, 0 replies; 47+ messages in thread
From: Stefan Hajnoczi @ 2020-11-17 13:04 UTC (permalink / raw)
  To: Mike Christie
  Cc: qemu-devel, fam, linux-scsi, target-devel, mst, jasowang,
	pbonzini, virtualization


[-- Attachment #1: Type: text/plain, Size: 654 bytes --]

On Thu, Nov 12, 2020 at 05:19:01PM -0600, Mike Christie wrote:
> diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
> index f22fce5..8795fd3 100644
> --- a/drivers/vhost/scsi.c
> +++ b/drivers/vhost/scsi.c
> @@ -1468,8 +1468,8 @@ static void vhost_scsi_flush(struct vhost_scsi *vs)
>  	/* Flush both the vhost poll and vhost work */
>  	for (i = 0; i < VHOST_SCSI_MAX_VQ; i++)
>  		vhost_scsi_flush_vq(vs, i);
> -	vhost_work_flush(&vs->dev, &vs->vs_completion_work);
> -	vhost_work_flush(&vs->dev, &vs->vs_event_work);
> +	vhost_work_dev_flush(&vs->dev);
> +	vhost_work_dev_flush(&vs->dev);

These two calls can be combined into a single call now.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 02/10] vhost scsi: remove extra flushes
  2020-11-12 23:19 ` [PATCH 02/10] vhost scsi: remove extra flushes Mike Christie
@ 2020-11-17 13:07   ` Stefan Hajnoczi
  0 siblings, 0 replies; 47+ messages in thread
From: Stefan Hajnoczi @ 2020-11-17 13:07 UTC (permalink / raw)
  To: Mike Christie
  Cc: qemu-devel, fam, linux-scsi, target-devel, mst, jasowang,
	pbonzini, virtualization


[-- Attachment #1: Type: text/plain, Size: 642 bytes --]

On Thu, Nov 12, 2020 at 05:19:02PM -0600, Mike Christie wrote:
> The vhost work flush function was flushing the entire work queue, so
> there is no need for the double vhost_work_dev_flush calls in
> vhost_scsi_flush.
> 
> And we do not need to call vhost_poll_flush for each poller because
> that call also ends up flushing the same work queue thread the
> vhost_work_dev_flush call flushed.
> 
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> ---
>  drivers/vhost/scsi.c | 8 --------
>  1 file changed, 8 deletions(-)

Ah, this was done as a separate step:

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 03/10] vhost poll: fix coding style
  2020-11-12 23:19 ` [PATCH 03/10] vhost poll: fix coding style Mike Christie
@ 2020-11-17 13:07   ` Stefan Hajnoczi
  0 siblings, 0 replies; 47+ messages in thread
From: Stefan Hajnoczi @ 2020-11-17 13:07 UTC (permalink / raw)
  To: Mike Christie
  Cc: qemu-devel, fam, linux-scsi, target-devel, mst, jasowang,
	pbonzini, virtualization


[-- Attachment #1: Type: text/plain, Size: 419 bytes --]

On Thu, Nov 12, 2020 at 05:19:03PM -0600, Mike Christie wrote:
> We use like 3 coding styles in this struct. Switch to just tabs.
> 
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
> ---
>  drivers/vhost/vhost.h | 12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 05/10] vhost: poll support support multiple workers
  2020-11-12 23:19 ` [PATCH 05/10] vhost: poll support support multiple workers Mike Christie
@ 2020-11-17 15:32   ` Stefan Hajnoczi
  0 siblings, 0 replies; 47+ messages in thread
From: Stefan Hajnoczi @ 2020-11-17 15:32 UTC (permalink / raw)
  To: Mike Christie
  Cc: qemu-devel, fam, linux-scsi, target-devel, mst, jasowang,
	pbonzini, virtualization


[-- Attachment #1: Type: text/plain, Size: 1071 bytes --]

On Thu, Nov 12, 2020 at 05:19:05PM -0600, Mike Christie wrote:
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index d229515..9eeb8c7 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -187,13 +187,15 @@ void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn)
>  
>  /* Init poll structure */
>  void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
> -		     __poll_t mask, struct vhost_dev *dev)
> +		     __poll_t mask, struct vhost_dev *dev,
> +		     struct vhost_virtqueue *vq)
>  {
>  	init_waitqueue_func_entry(&poll->wait, vhost_poll_wakeup);
>  	init_poll_funcptr(&poll->table, vhost_poll_func);
>  	poll->mask = mask;
>  	poll->dev = dev;
>  	poll->wqh = NULL;
> +	poll->vq = vq;
>  
>  	vhost_work_init(&poll->work, fn);
>  }

Tying the poll mechanism to vqs rather than directly to vhost_worker
seems okay for now, but it might be necessary to change this later if
drivers want more flexibility about poll something that's not tied to a
vq or that uses worker 0.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 06/10] vhost scsi: make SCSI cmd completion per vq
  2020-11-12 23:19 ` [PATCH 06/10] vhost scsi: make SCSI cmd completion per vq Mike Christie
@ 2020-11-17 16:04   ` Stefan Hajnoczi
  0 siblings, 0 replies; 47+ messages in thread
From: Stefan Hajnoczi @ 2020-11-17 16:04 UTC (permalink / raw)
  To: Mike Christie
  Cc: qemu-devel, fam, linux-scsi, target-devel, mst, jasowang,
	pbonzini, virtualization


[-- Attachment #1: Type: text/plain, Size: 565 bytes --]

On Thu, Nov 12, 2020 at 05:19:06PM -0600, Mike Christie wrote:
> In the last patches we are going to have a worker thread per IO vq.
> This patch separates the scsi cmd completion code paths so we can
> complete cmds based on their vq instead of having all cmds complete
> on the same worker thread.
> 
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> ---
>  drivers/vhost/scsi.c | 48 +++++++++++++++++++++++++-----------------------
>  1 file changed, 25 insertions(+), 23 deletions(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 07/10] vhost, vhost-scsi: flush IO vqs then send TMF rsp
  2020-11-12 23:19 ` [PATCH 07/10] vhost, vhost-scsi: flush IO vqs then send TMF rsp Mike Christie
@ 2020-11-17 16:05   ` Stefan Hajnoczi
  0 siblings, 0 replies; 47+ messages in thread
From: Stefan Hajnoczi @ 2020-11-17 16:05 UTC (permalink / raw)
  To: Mike Christie
  Cc: qemu-devel, fam, linux-scsi, target-devel, mst, jasowang,
	pbonzini, virtualization


[-- Attachment #1: Type: text/plain, Size: 829 bytes --]

On Thu, Nov 12, 2020 at 05:19:07PM -0600, Mike Christie wrote:
> With one worker we will always send the scsi cmd responses then send the
> TMF rsp, because LIO will always complete the scsi cmds first which
> calls vhost_scsi_release_cmd to add them to the work queue.
> 
> When the next patch adds multiple worker support, the worker threads
> could still be sending their responses when the tmf's work is run.
> So this patch has vhost-scsi flush the IO vqs on other worker threads
> before we send the tmf response.
> 
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> ---
>  drivers/vhost/scsi.c  | 16 ++++++++++++++--
>  drivers/vhost/vhost.c |  6 ++++++
>  drivers/vhost/vhost.h |  1 +
>  3 files changed, 21 insertions(+), 2 deletions(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 08/10] vhost: move msg_handler to new ops struct
  2020-11-12 23:19 ` [PATCH 08/10] vhost: move msg_handler to new ops struct Mike Christie
@ 2020-11-17 16:08   ` Stefan Hajnoczi
  0 siblings, 0 replies; 47+ messages in thread
From: Stefan Hajnoczi @ 2020-11-17 16:08 UTC (permalink / raw)
  To: Mike Christie
  Cc: qemu-devel, fam, linux-scsi, target-devel, mst, jasowang,
	pbonzini, virtualization


[-- Attachment #1: Type: text/plain, Size: 607 bytes --]

On Thu, Nov 12, 2020 at 05:19:08PM -0600, Mike Christie wrote:
> The next patch adds a callout so drivers can perform some action when we
> get a VHOST_SET_VRING_ENABLE, so this patch moves the msg_handler callout
> to a new vhost_dev_ops struct just to keep all the callouts better
> organized.
> 
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> ---
>  drivers/vhost/vdpa.c  |  7 +++++--
>  drivers/vhost/vhost.c | 10 ++++------
>  drivers/vhost/vhost.h | 11 ++++++-----
>  3 files changed, 15 insertions(+), 13 deletions(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 09/10] vhost: add VHOST_SET_VRING_ENABLE support
  2020-11-12 23:19 ` [PATCH 09/10] vhost: add VHOST_SET_VRING_ENABLE support Mike Christie
@ 2020-11-17 16:14   ` Stefan Hajnoczi
  0 siblings, 0 replies; 47+ messages in thread
From: Stefan Hajnoczi @ 2020-11-17 16:14 UTC (permalink / raw)
  To: Mike Christie
  Cc: qemu-devel, fam, linux-scsi, target-devel, mst, jasowang,
	pbonzini, virtualization


[-- Attachment #1: Type: text/plain, Size: 1587 bytes --]

On Thu, Nov 12, 2020 at 05:19:09PM -0600, Mike Christie wrote:
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 2f98b81..e953031 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -1736,6 +1736,28 @@ static long vhost_vring_set_num_addr(struct vhost_dev *d,
>  
>  	return r;
>  }
> +
> +static long vhost_vring_set_enable(struct vhost_dev *d,
> +				   struct vhost_virtqueue *vq,
> +				   void __user *argp)
> +{
> +	struct vhost_vring_state s;
> +	int ret = 0;
> +
> +	if (vq->private_data)
> +		return -EBUSY;
> +
> +	if (copy_from_user(&s, argp, sizeof s))
> +		return -EFAULT;
> +
> +	if (s.num != 1 && s.num != 0)
> +		return -EINVAL;
> +
> +	if (d->ops && d->ops->enable_vring)
> +		ret = d->ops->enable_vring(vq, s.num);
> +	return ret;
> +}

Silently ignoring this ioctl on drivers that don't implement
d->ops->enable_vring() could be a problem. Userspace expects to be able
to enable/disable vqs, we can't just return 0 because the vq won't be
enabled/disabled as requested.

> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index a293f48..1279c09 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -158,6 +158,7 @@ struct vhost_msg_node {
>  
>  struct vhost_dev_ops {
>  	int (*msg_handler)(struct vhost_dev *dev, struct vhost_iotlb_msg *msg);
> +	int (*enable_vring)(struct vhost_virtqueue *vq, bool enable);

Please add doc comments explaining what this callback needs to do and
the environment in which it is executed (locks that are held, etc).

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 00/10] vhost/qemu: thread per IO SCSI vq
  2020-11-12 23:18 [PATCH 00/10] vhost/qemu: thread per IO SCSI vq Mike Christie
                   ` (10 preceding siblings ...)
  2020-11-12 23:19 ` [PATCH 10/10] vhost-scsi: create a woker per IO vq Mike Christie
@ 2020-11-17 16:40 ` Stefan Hajnoczi
  2020-11-17 19:13   ` Mike Christie
  2020-11-18  5:17   ` Jason Wang
  11 siblings, 2 replies; 47+ messages in thread
From: Stefan Hajnoczi @ 2020-11-17 16:40 UTC (permalink / raw)
  To: Mike Christie
  Cc: qemu-devel, fam, linux-scsi, target-devel, mst, jasowang,
	pbonzini, virtualization


[-- Attachment #1: Type: text/plain, Size: 2458 bytes --]

On Thu, Nov 12, 2020 at 05:18:59PM -0600, Mike Christie wrote:
> The following kernel patches were made over Michael's vhost branch:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git/log/?h=vhost
> 
> and the vhost-scsi bug fix patchset:
> 
> https://lore.kernel.org/linux-scsi/20201112170008.GB1555653@stefanha-x1.localdomain/T/#t
> 
> And the qemu patch was made over the qemu master branch.
> 
> vhost-scsi currently supports multiple queues with the num_queues
> setting, but we end up with a setup where the guest's scsi/block
> layer can do a queue per vCPU and the layers below vhost can do
> a queue per CPU. vhost-scsi will then do a num_queue virtqueues,
> but all IO gets set on and completed on a single vhost-scsi thread.
> After 2 - 4 vqs this becomes a bottleneck.
> 
> This patchset allows us to create a worker thread per IO vq, so we
> can better utilize multiple CPUs with the multiple queues. It
> implments Jason's suggestion to create the initial worker like
> normal, then create the extra workers for IO vqs with the
> VHOST_SET_VRING_ENABLE ioctl command added in this patchset.

How does userspace find out the tids and set their CPU affinity?

What is the meaning of the new VHOST_SET_VRING_ENABLE ioctl? It doesn't
really "enable" or "disable" the vq, requests are processed regardless.

The purpose of the ioctl isn't clear to me because the kernel could
automatically create 1 thread per vq without a new ioctl. On the other
hand, if userspace is supposed to control worker threads then a
different interface would be more powerful:

  struct vhost_vq_worker_info {
      /*
       * The pid of an existing vhost worker that this vq will be
       * assigned to. When pid is 0 the virtqueue is assigned to the
       * default vhost worker. When pid is -1 a new worker thread is
       * created for this virtqueue. When pid is -2 the virtqueue's
       * worker thread is unchanged.
       *
       * If a vhost worker no longer has any virtqueues assigned to it
       * then it will terminate.
       *
       * The pid of the vhost worker is stored to this field when the
       * ioctl completes successfully. Use pid -2 to query the current
       * vhost worker pid.
       */
      __kernel_pid_t pid;  /* in/out */

      /* The virtqueue index*/
      unsigned int vq_idx; /* in */
  };

  ioctl(vhost_fd, VHOST_SET_VQ_WORKER, &info);

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 00/10] vhost/qemu: thread per IO SCSI vq
  2020-11-17 16:40 ` [PATCH 00/10] vhost/qemu: thread per IO SCSI vq Stefan Hajnoczi
@ 2020-11-17 19:13   ` Mike Christie
  2020-11-18  9:54     ` Michael S. Tsirkin
  2020-11-18 11:31     ` Stefan Hajnoczi
  2020-11-18  5:17   ` Jason Wang
  1 sibling, 2 replies; 47+ messages in thread
From: Mike Christie @ 2020-11-17 19:13 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, fam, linux-scsi, target-devel, mst, jasowang,
	pbonzini, virtualization

On 11/17/20 10:40 AM, Stefan Hajnoczi wrote:
> On Thu, Nov 12, 2020 at 05:18:59PM -0600, Mike Christie wrote:
>> The following kernel patches were made over Michael's vhost branch:
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git/log/?h=vhost
>>
>> and the vhost-scsi bug fix patchset:
>>
>> https://lore.kernel.org/linux-scsi/20201112170008.GB1555653@stefanha-x1.localdomain/T/#t
>>
>> And the qemu patch was made over the qemu master branch.
>>
>> vhost-scsi currently supports multiple queues with the num_queues
>> setting, but we end up with a setup where the guest's scsi/block
>> layer can do a queue per vCPU and the layers below vhost can do
>> a queue per CPU. vhost-scsi will then do a num_queue virtqueues,
>> but all IO gets set on and completed on a single vhost-scsi thread.
>> After 2 - 4 vqs this becomes a bottleneck.
>>
>> This patchset allows us to create a worker thread per IO vq, so we
>> can better utilize multiple CPUs with the multiple queues. It
>> implments Jason's suggestion to create the initial worker like
>> normal, then create the extra workers for IO vqs with the
>> VHOST_SET_VRING_ENABLE ioctl command added in this patchset.
> 
> How does userspace find out the tids and set their CPU affinity?
> 

When we create the worker thread we add it to the device owner's cgroup,
so we end up inheriting those settings like affinity.

However, are you more asking about finer control like if the guest is
doing mq, and the mq hw queue is bound to cpu0, it would perform
better if we could bind vhost vq's worker thread to cpu0? I think the
problem might is if you are in the cgroup then we can't set a specific
threads CPU affinity to just one specific CPU. So you can either do
cgroups or not.


> What is the meaning of the new VHOST_SET_VRING_ENABLE ioctl? It doesn't
> really "enable" or "disable" the vq, requests are processed regardless.
> 

Yeah, I agree. The problem I've mentioned before is:

1. For net and vsock, it's not useful because the vqs are hard coded in
the kernel and userspace, so you can't disable a vq and you never need
to enable one.

2. vdpa has it's own enable ioctl.

3. For scsi, because we already are doing multiple vqs based on the
num_queues value, we have to have some sort of compat support and
code to detect if userspace is even going to send the new ioctl.
In this patchset, compat just meant enable/disable the extra functionality
of extra worker threads for a vq. We will still use the vq if
userspace set it up.


> The purpose of the ioctl isn't clear to me because the kernel could
> automatically create 1 thread per vq without a new ioctl. On the other
> hand, if userspace is supposed to control worker threads then a
> different interface would be more powerful:
> 

My preference has been:

1. If we were to ditch cgroups, then add a new interface that would allow
us to bind threads to a specific CPU, so that it lines up with the guest's
mq to CPU mapping.

2. If we continue with cgroups then I think just creating the worker
threads from vhost_scsi_set_endpoint is best, because that is the point
we do the other final vq setup ops vhost_vq_set_backend and
vhost_vq_init_access.

For option number 2 it would be simple. Instead of the vring enable patches:

[PATCH 08/10] vhost: move msg_handler to new ops struct
[PATCH 09/10] vhost: add VHOST_SET_VRING_ENABLE support
[PATCH 10/10] vhost-scsi: create a woker per IO vq
and
[PATCH 1/1] qemu vhost scsi: add VHOST_SET_VRING_ENABLE support


we could do this patch like I had done in previous versions:


From bcc4c29c28daf04679ce6566d06845b9e1b31eb4 Mon Sep 17 00:00:00 2001
From: Mike Christie <michael.christie@oracle.com>
Date: Wed, 11 Nov 2020 22:50:56 -0600
Subject:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   vhost scsi: multiple worker support

This patch creates a worker per IO vq to fix an issue where after
2 vqs and/or multple luns the single worker thread becomes a
bottleneck due to the multiple queues/luns trying to execute/
complete their IO on the same thread/CPU. This patch allows us
to better match the guest and lower levels multiqueue setups.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
 drivers/vhost/scsi.c | 41 ++++++++++++++++++++++++++++++++---------
 1 file changed, 32 insertions(+), 9 deletions(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 44c108a..2c119d3 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -1640,9 +1640,18 @@ static int vhost_scsi_setup_vq_cmds(struct vhost_virtqueue *vq, int max_cmds)
 			vq = &vs->vqs[i].vq;
 			if (!vhost_vq_is_setup(vq))
 				continue;
+			/*
+			 * For compat, we have the evt, ctl and first IO vq
+			 * share worker0 like is setup by default. Additional
+			 * vqs get their own worker.
+			 */
+			if (i > VHOST_SCSI_VQ_IO) {
+				if (vhost_vq_worker_add(&vs->dev, vq))
+					goto cleanup_vq;
+			}
 
 			if (vhost_scsi_setup_vq_cmds(vq, vq->num))
-				goto destroy_vq_cmds;
+				goto cleanup_vq;
 		}
 
 		for (i = 0; i < VHOST_SCSI_MAX_VQ; i++) {
@@ -1666,10 +1675,14 @@ static int vhost_scsi_setup_vq_cmds(struct vhost_virtqueue *vq, int max_cmds)
 	vs->vs_tpg = vs_tpg;
 	goto out;
 
-destroy_vq_cmds:
-	for (i--; i >= VHOST_SCSI_VQ_IO; i--) {
-		if (!vhost_vq_get_backend(&vs->vqs[i].vq))
-			vhost_scsi_destroy_vq_cmds(&vs->vqs[i].vq);
+cleanup_vq:
+	for (; i >= VHOST_SCSI_VQ_IO; i--) {
+		if (vhost_vq_get_backend(&vs->vqs[i].vq))
+			continue;
+
+		if (i > VHOST_SCSI_VQ_IO)
+			vhost_vq_worker_remove(&vs->dev, &vs->vqs[i].vq);
+		vhost_scsi_destroy_vq_cmds(&vs->vqs[i].vq);
 	}
 undepend:
 	for (i = 0; i < VHOST_SCSI_MAX_TARGET; i++) {
@@ -1752,14 +1765,24 @@ static int vhost_scsi_setup_vq_cmds(struct vhost_virtqueue *vq, int max_cmds)
 			mutex_lock(&vq->mutex);
 			vhost_vq_set_backend(vq, NULL);
 			mutex_unlock(&vq->mutex);
+		}
+		vhost_scsi_flush(vs);
+
+		for (i = VHOST_SCSI_VQ_IO; i < VHOST_SCSI_MAX_VQ; i++) {
+			vq = &vs->vqs[i].vq;
+			if (!vhost_vq_is_setup(vq))
+				continue;
 			/*
-			 * Make sure cmds are not running before tearing them
-			 * down.
-			 */
-			vhost_scsi_flush(vs);
+			 * We only remove the extra workers we created in case
+			 * this is for a reboot. The default worker will be
+			 * removed at dev cleanup.
+			 */ 
+			if (i > VHOST_SCSI_VQ_IO)
+				vhost_vq_worker_remove(&vs->dev, vq);
 			vhost_scsi_destroy_vq_cmds(vq);
 		}
 	}
+
 	/*
 	 * Act as synchronize_rcu to make sure access to
 	 * old vs->vs_tpg is finished.
-- 
1.8.3.1






^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 00/10] vhost/qemu: thread per IO SCSI vq
  2020-11-17 16:40 ` [PATCH 00/10] vhost/qemu: thread per IO SCSI vq Stefan Hajnoczi
  2020-11-17 19:13   ` Mike Christie
@ 2020-11-18  5:17   ` Jason Wang
  2020-11-18  6:57     ` Mike Christie
  1 sibling, 1 reply; 47+ messages in thread
From: Jason Wang @ 2020-11-18  5:17 UTC (permalink / raw)
  To: Stefan Hajnoczi, Mike Christie
  Cc: fam, linux-scsi, mst, qemu-devel, virtualization, target-devel, pbonzini


On 2020/11/18 上午12:40, Stefan Hajnoczi wrote:
> On Thu, Nov 12, 2020 at 05:18:59PM -0600, Mike Christie wrote:
>> The following kernel patches were made over Michael's vhost branch:
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git/log/?h=vhost
>>
>> and the vhost-scsi bug fix patchset:
>>
>> https://lore.kernel.org/linux-scsi/20201112170008.GB1555653@stefanha-x1.localdomain/T/#t
>>
>> And the qemu patch was made over the qemu master branch.
>>
>> vhost-scsi currently supports multiple queues with the num_queues
>> setting, but we end up with a setup where the guest's scsi/block
>> layer can do a queue per vCPU and the layers below vhost can do
>> a queue per CPU. vhost-scsi will then do a num_queue virtqueues,
>> but all IO gets set on and completed on a single vhost-scsi thread.
>> After 2 - 4 vqs this becomes a bottleneck.
>>
>> This patchset allows us to create a worker thread per IO vq, so we
>> can better utilize multiple CPUs with the multiple queues. It
>> implments Jason's suggestion to create the initial worker like
>> normal, then create the extra workers for IO vqs with the
>> VHOST_SET_VRING_ENABLE ioctl command added in this patchset.
> How does userspace find out the tids and set their CPU affinity?
>
> What is the meaning of the new VHOST_SET_VRING_ENABLE ioctl? It doesn't
> really "enable" or "disable" the vq, requests are processed regardless.


Actually I think it should do the real "enable/disable" that tries to 
follow the virtio spec.

(E.g both PCI and MMIO have something similar).


>
> The purpose of the ioctl isn't clear to me because the kernel could
> automatically create 1 thread per vq without a new ioctl.


It's not necessarily to create or destroy kthread according to 
VRING_ENABLE but could be a hint.


>   On the other
> hand, if userspace is supposed to control worker threads then a
> different interface would be more powerful:
>
>    struct vhost_vq_worker_info {
>        /*
>         * The pid of an existing vhost worker that this vq will be
>         * assigned to. When pid is 0 the virtqueue is assigned to the
>         * default vhost worker. When pid is -1 a new worker thread is
>         * created for this virtqueue. When pid is -2 the virtqueue's
>         * worker thread is unchanged.
>         *
>         * If a vhost worker no longer has any virtqueues assigned to it
>         * then it will terminate.
>         *
>         * The pid of the vhost worker is stored to this field when the
>         * ioctl completes successfully. Use pid -2 to query the current
>         * vhost worker pid.
>         */
>        __kernel_pid_t pid;  /* in/out */
>
>        /* The virtqueue index*/
>        unsigned int vq_idx; /* in */
>    };
>
>    ioctl(vhost_fd, VHOST_SET_VQ_WORKER, &info);


This seems to leave the question to userspace which I'm not sure it's 
good since it tries to introduce another scheduling layer.

Per vq worker seems be good enough to start with.

Thanks


>
> Stefan


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 00/10] vhost/qemu: thread per IO SCSI vq
  2020-11-18  5:17   ` Jason Wang
@ 2020-11-18  6:57     ` Mike Christie
  2020-11-18  7:19       ` Mike Christie
  2020-11-18  7:54       ` Jason Wang
  0 siblings, 2 replies; 47+ messages in thread
From: Mike Christie @ 2020-11-18  6:57 UTC (permalink / raw)
  To: Jason Wang, Stefan Hajnoczi
  Cc: fam, linux-scsi, mst, qemu-devel, virtualization, target-devel, pbonzini

On 11/17/20 11:17 PM, Jason Wang wrote:
> 
> On 2020/11/18 上午12:40, Stefan Hajnoczi wrote:
>> On Thu, Nov 12, 2020 at 05:18:59PM -0600, Mike Christie wrote:
>>> The following kernel patches were made over Michael's vhost branch:
>>>
>>> https://urldefense.com/v3/__https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git/log/?h=vhost__;!!GqivPVa7Brio!MzCv3wdRfz5dltunazRWGCeUkMg91pPEOLpIivsebLX9vhYDSi_E1V36e9H0NoRys_hU$
>>> and the vhost-scsi bug fix patchset:
>>>
>>> https://urldefense.com/v3/__https://lore.kernel.org/linux-scsi/20201112170008.GB1555653@stefanha-x1.localdomain/T/*t__;Iw!!GqivPVa7Brio!MzCv3wdRfz5dltunazRWGCeUkMg91pPEOLpIivsebLX9vhYDSi_E1V36e9H0NmuPE_m8$
>>> And the qemu patch was made over the qemu master branch.
>>>
>>> vhost-scsi currently supports multiple queues with the num_queues
>>> setting, but we end up with a setup where the guest's scsi/block
>>> layer can do a queue per vCPU and the layers below vhost can do
>>> a queue per CPU. vhost-scsi will then do a num_queue virtqueues,
>>> but all IO gets set on and completed on a single vhost-scsi thread.
>>> After 2 - 4 vqs this becomes a bottleneck.
>>>
>>> This patchset allows us to create a worker thread per IO vq, so we
>>> can better utilize multiple CPUs with the multiple queues. It
>>> implments Jason's suggestion to create the initial worker like
>>> normal, then create the extra workers for IO vqs with the
>>> VHOST_SET_VRING_ENABLE ioctl command added in this patchset.
>> How does userspace find out the tids and set their CPU affinity?
>>
>> What is the meaning of the new VHOST_SET_VRING_ENABLE ioctl? It doesn't
>> really "enable" or "disable" the vq, requests are processed regardless.
> 
> 
> Actually I think it should do the real "enable/disable" that tries to follow the virtio spec.
> 

What does real mean here? For the vdpa enable call for example, would it be like
ifcvf_vdpa_set_vq_ready where it sets the ready bit or more like mlx5_vdpa_set_vq_ready
where it can do some more work in the disable case?

For net and something like ifcvf_vdpa_set_vq_ready's design would we have
vhost_ring_ioctl() set some vhost_virtqueue enable bit. We then have some helper
vhost_vq_is_enabled() and some code to detect if userspace supports the new ioctl.
And then in vhost_net_set_backend do we call vhost_vq_is_enabled()? What is done
for disable then? It doesn't seem to buy a lot of new functionality. Is it just
so we follow the spec?

Or do you want it work more like mlx5_vdpa_set_vq_ready? For this in vhost_ring_ioctl
when we get the new ioctl we would call into the drivers and have it start queues
and stop queues? For enable, what we you do for net for this case? For disable,
would you do something like vhost_net_stop_vq (we don't free up anything allocated
in vhost_vring_ioctl calls, but we can stop what we setup in the net driver)?
Is this useful for the current net mq design or is this for something like where
you would do one vhost net device with multiple vqs?

My issue/convern is that in general these calls seems useful, but we don't really
need them for scsi because vhost scsi is already stuck creating vqs like how it does
due to existing users. If we do the ifcvf_vdpa_set_vq_ready type of design where
we just set some bit, then the new ioctl does not give us a lot. It's just an extra
check and extra code.

And for the mlx5_vdpa_set_vq_ready type of design, it doesn't seem like it's going
to happen a lot where the admin is going to want to remove vqs from a running device.
And for both addition/removal for scsi we would need code in virtio scsi to handle
hot plug removal/addition of a queue and then redoing the multiqueue mappings which
would be difficult to add with no one requesting it.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 00/10] vhost/qemu: thread per IO SCSI vq
  2020-11-18  6:57     ` Mike Christie
@ 2020-11-18  7:19       ` Mike Christie
  2020-11-18  7:54       ` Jason Wang
  1 sibling, 0 replies; 47+ messages in thread
From: Mike Christie @ 2020-11-18  7:19 UTC (permalink / raw)
  To: Jason Wang, Stefan Hajnoczi
  Cc: fam, linux-scsi, mst, qemu-devel, virtualization, target-devel, pbonzini

On 11/18/20 12:57 AM, Mike Christie wrote:
> On 11/17/20 11:17 PM, Jason Wang wrote:
>>
>> On 2020/11/18 上午12:40, Stefan Hajnoczi wrote:
>>> On Thu, Nov 12, 2020 at 05:18:59PM -0600, Mike Christie wrote:
>>>> The following kernel patches were made over Michael's vhost branch:
>>>>
>>>> https://urldefense.com/v3/__https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git/log/?h=vhost__;!!GqivPVa7Brio!MzCv3wdRfz5dltunazRWGCeUkMg91pPEOLpIivsebLX9vhYDSi_E1V36e9H0NoRys_hU$
>>>> and the vhost-scsi bug fix patchset:
>>>>
>>>> https://urldefense.com/v3/__https://lore.kernel.org/linux-scsi/20201112170008.GB1555653@stefanha-x1.localdomain/T/*t__;Iw!!GqivPVa7Brio!MzCv3wdRfz5dltunazRWGCeUkMg91pPEOLpIivsebLX9vhYDSi_E1V36e9H0NmuPE_m8$
>>>> And the qemu patch was made over the qemu master branch.
>>>>
>>>> vhost-scsi currently supports multiple queues with the num_queues
>>>> setting, but we end up with a setup where the guest's scsi/block
>>>> layer can do a queue per vCPU and the layers below vhost can do
>>>> a queue per CPU. vhost-scsi will then do a num_queue virtqueues,
>>>> but all IO gets set on and completed on a single vhost-scsi thread.
>>>> After 2 - 4 vqs this becomes a bottleneck.
>>>>
>>>> This patchset allows us to create a worker thread per IO vq, so we
>>>> can better utilize multiple CPUs with the multiple queues. It
>>>> implments Jason's suggestion to create the initial worker like
>>>> normal, then create the extra workers for IO vqs with the
>>>> VHOST_SET_VRING_ENABLE ioctl command added in this patchset.
>>> How does userspace find out the tids and set their CPU affinity?
>>>
>>> What is the meaning of the new VHOST_SET_VRING_ENABLE ioctl? It doesn't
>>> really "enable" or "disable" the vq, requests are processed regardless.
>>
>>
>> Actually I think it should do the real "enable/disable" that tries to follow the virtio spec.
>>
> 
> What does real mean here? For the vdpa enable call for example, would it be like
> ifcvf_vdpa_set_vq_ready where it sets the ready bit or more like mlx5_vdpa_set_vq_ready
> where it can do some more work in the disable case?
> 
> For net and something like ifcvf_vdpa_set_vq_ready's design would we have
> vhost_ring_ioctl() set some vhost_virtqueue enable bit. We then have some helper
> vhost_vq_is_enabled() and some code to detect if userspace supports the new ioctl.
> And then in vhost_net_set_backend do we call vhost_vq_is_enabled()? What is done
> for disable then? It doesn't seem to buy a lot of new functionality. Is it just
> so we follow the spec?
> 
> Or do you want it work more like mlx5_vdpa_set_vq_ready? For this in vhost_ring_ioctl
> when we get the new ioctl we would call into the drivers and have it start queues
> and stop queues? For enable, what we you do for net for this case? For disable,
> would you do something like vhost_net_stop_vq (we don't free up anything allocated
> in vhost_vring_ioctl calls, but we can stop what we setup in the net driver)?
> Is this useful for the current net mq design or is this for something like where
> you would do one vhost net device with multiple vqs?
> 
> My issue/convern is that in general these calls seems useful, but we don't really
> need them for scsi because vhost scsi is already stuck creating vqs like how it does
> due to existing users. If we do the ifcvf_vdpa_set_vq_ready type of design where
> we just set some bit, then the new ioctl does not give us a lot. It's just an extra
> check and extra code.
> 
> And for the mlx5_vdpa_set_vq_ready type of design, it doesn't seem like it's going
> to happen a lot where the admin is going to want to remove vqs from a running device.
> And for both addition/removal for scsi we would need code in virtio scsi to handle
> hot plug removal/addition of a queue and then redoing the multiqueue mappings which
> would be difficult to add with no one requesting it.

Actually I want to half take this last chunk back. When I said in general these calls
seem useful, I meant for the mlx5_vdpa_set_vq_ready type design. For example, if a
user was going to remove/add vCPUs then this functionality where we are completely
adding/removing virtqueues would be useful. We would need a lot more than just
the new ioctl though, because we would want to completely create/setup a new
virtqueue

I do not have any of our users asking for this. You guys work on this more so
you know better.

Another option is to kick it down the road again since I'm not sure my patches
here have a lot to do with this. We could also just do the kernel only approach
(no new ioctl) and then add some new design when we have users asking for it.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 00/10] vhost/qemu: thread per IO SCSI vq
  2020-11-18  6:57     ` Mike Christie
  2020-11-18  7:19       ` Mike Christie
@ 2020-11-18  7:54       ` Jason Wang
  2020-11-18 20:06         ` Mike Christie
  1 sibling, 1 reply; 47+ messages in thread
From: Jason Wang @ 2020-11-18  7:54 UTC (permalink / raw)
  To: Mike Christie, Stefan Hajnoczi
  Cc: fam, linux-scsi, mst, qemu-devel, virtualization, target-devel, pbonzini


On 2020/11/18 下午2:57, Mike Christie wrote:
> On 11/17/20 11:17 PM, Jason Wang wrote:
>> On 2020/11/18 上午12:40, Stefan Hajnoczi wrote:
>>> On Thu, Nov 12, 2020 at 05:18:59PM -0600, Mike Christie wrote:
>>>> The following kernel patches were made over Michael's vhost branch:
>>>>
>>>> https://urldefense.com/v3/__https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git/log/?h=vhost__;!!GqivPVa7Brio!MzCv3wdRfz5dltunazRWGCeUkMg91pPEOLpIivsebLX9vhYDSi_E1V36e9H0NoRys_hU$
>>>> and the vhost-scsi bug fix patchset:
>>>>
>>>> https://urldefense.com/v3/__https://lore.kernel.org/linux-scsi/20201112170008.GB1555653@stefanha-x1.localdomain/T/*t__;Iw!!GqivPVa7Brio!MzCv3wdRfz5dltunazRWGCeUkMg91pPEOLpIivsebLX9vhYDSi_E1V36e9H0NmuPE_m8$
>>>> And the qemu patch was made over the qemu master branch.
>>>>
>>>> vhost-scsi currently supports multiple queues with the num_queues
>>>> setting, but we end up with a setup where the guest's scsi/block
>>>> layer can do a queue per vCPU and the layers below vhost can do
>>>> a queue per CPU. vhost-scsi will then do a num_queue virtqueues,
>>>> but all IO gets set on and completed on a single vhost-scsi thread.
>>>> After 2 - 4 vqs this becomes a bottleneck.
>>>>
>>>> This patchset allows us to create a worker thread per IO vq, so we
>>>> can better utilize multiple CPUs with the multiple queues. It
>>>> implments Jason's suggestion to create the initial worker like
>>>> normal, then create the extra workers for IO vqs with the
>>>> VHOST_SET_VRING_ENABLE ioctl command added in this patchset.
>>> How does userspace find out the tids and set their CPU affinity?
>>>
>>> What is the meaning of the new VHOST_SET_VRING_ENABLE ioctl? It doesn't
>>> really "enable" or "disable" the vq, requests are processed regardless.
>>
>> Actually I think it should do the real "enable/disable" that tries to follow the virtio spec.
>>
> What does real mean here?


I think it means when a vq is disabled, vhost won't process any request 
from that virtqueue.


> For the vdpa enable call for example, would it be like
> ifcvf_vdpa_set_vq_ready where it sets the ready bit or more like mlx5_vdpa_set_vq_ready
> where it can do some more work in the disable case?


For vDPA, it would be more complicated.

E.g for IFCVF, it just delay the setting of queue_enable when it get 
DRIVER_OK. Technically it can passthrough the queue_enable to the 
hardware as what mlx5e did.


>
> For net and something like ifcvf_vdpa_set_vq_ready's design would we have
> vhost_ring_ioctl() set some vhost_virtqueue enable bit. We then have some helper
> vhost_vq_is_enabled() and some code to detect if userspace supports the new ioctl.


Yes, vhost support backend capability. When userspace negotiate the new 
capability, we should depend on SET_VRING_ENABLE, if not we can do 
vhost_vq_is_enable().


> And then in vhost_net_set_backend do we call vhost_vq_is_enabled()? What is done
> for disable then?


It needs more thought, but the question is not specific to 
SET_VRING_ENABLE. Consider guest may zero ring address as well.

For disabling, we can simply flush the work and disable all the polls.


> It doesn't seem to buy a lot of new functionality. Is it just
> so we follow the spec?


My understanding is that, since spec defines queue_enable, we should 
support it in vhost. And we can piggyback the delayed vq creation with 
this feature. Otherwise we will duplicate the function if we want to 
support queue_enable.


>
> Or do you want it work more like mlx5_vdpa_set_vq_ready? For this in vhost_ring_ioctl
> when we get the new ioctl we would call into the drivers and have it start queues
> and stop queues? For enable, what we you do for net for this case?


Net is something different, we can simply use SET_BACKEND to disable a 
specific virtqueue without introducing new ioctls. Notice that, net mq 
is kind of different with scsi which have a per queue pair vhost device, 
and the API allows us to set backend for a specific virtqueue.


> For disable,
> would you do something like vhost_net_stop_vq (we don't free up anything allocated
> in vhost_vring_ioctl calls, but we can stop what we setup in the net driver)?


It's up to you, if you think you should free the resources you can do that.


> Is this useful for the current net mq design or is this for something like where
> you would do one vhost net device with multiple vqs?


I think SET_VRING_ENABLE is more useful for SCSI since it have a model 
of multiple vqs per vhost device.


>
> My issue/convern is that in general these calls seems useful, but we don't really
> need them for scsi because vhost scsi is already stuck creating vqs like how it does
> due to existing users. If we do the ifcvf_vdpa_set_vq_ready type of design where
> we just set some bit, then the new ioctl does not give us a lot. It's just an extra
> check and extra code.
>
> And for the mlx5_vdpa_set_vq_ready type of design, it doesn't seem like it's going
> to happen a lot where the admin is going to want to remove vqs from a running device.


In this case, qemu may just disable the queues of vhost-scsi via 
SET_VRING_ENABLE and then we can free resources?


> And for both addition/removal for scsi we would need code in virtio scsi to handle
> hot plug removal/addition of a queue and then redoing the multiqueue mappings which
> would be difficult to add with no one requesting it.


Thanks


>


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 00/10] vhost/qemu: thread per IO SCSI vq
  2020-11-17 19:13   ` Mike Christie
@ 2020-11-18  9:54     ` Michael S. Tsirkin
  2020-11-19 14:00       ` Stefan Hajnoczi
  2020-11-18 11:31     ` Stefan Hajnoczi
  1 sibling, 1 reply; 47+ messages in thread
From: Michael S. Tsirkin @ 2020-11-18  9:54 UTC (permalink / raw)
  To: Mike Christie
  Cc: Stefan Hajnoczi, qemu-devel, fam, linux-scsi, target-devel,
	jasowang, pbonzini, virtualization

On Tue, Nov 17, 2020 at 01:13:14PM -0600, Mike Christie wrote:
> On 11/17/20 10:40 AM, Stefan Hajnoczi wrote:
> > On Thu, Nov 12, 2020 at 05:18:59PM -0600, Mike Christie wrote:
> >> The following kernel patches were made over Michael's vhost branch:
> >>
> >> https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git/log/?h=vhost
> >>
> >> and the vhost-scsi bug fix patchset:
> >>
> >> https://lore.kernel.org/linux-scsi/20201112170008.GB1555653@stefanha-x1.localdomain/T/#t
> >>
> >> And the qemu patch was made over the qemu master branch.
> >>
> >> vhost-scsi currently supports multiple queues with the num_queues
> >> setting, but we end up with a setup where the guest's scsi/block
> >> layer can do a queue per vCPU and the layers below vhost can do
> >> a queue per CPU. vhost-scsi will then do a num_queue virtqueues,
> >> but all IO gets set on and completed on a single vhost-scsi thread.
> >> After 2 - 4 vqs this becomes a bottleneck.
> >>
> >> This patchset allows us to create a worker thread per IO vq, so we
> >> can better utilize multiple CPUs with the multiple queues. It
> >> implments Jason's suggestion to create the initial worker like
> >> normal, then create the extra workers for IO vqs with the
> >> VHOST_SET_VRING_ENABLE ioctl command added in this patchset.
> > 
> > How does userspace find out the tids and set their CPU affinity?
> > 
> 
> When we create the worker thread we add it to the device owner's cgroup,
> so we end up inheriting those settings like affinity.
> 
> However, are you more asking about finer control like if the guest is
> doing mq, and the mq hw queue is bound to cpu0, it would perform
> better if we could bind vhost vq's worker thread to cpu0? I think the
> problem might is if you are in the cgroup then we can't set a specific
> threads CPU affinity to just one specific CPU. So you can either do
> cgroups or not.

Something we wanted to try for a while is to allow userspace
to create threads for us, then specify which vqs it processes.

That would address this set of concerns ...

-- 
MST


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 00/10] vhost/qemu: thread per IO SCSI vq
  2020-11-17 19:13   ` Mike Christie
  2020-11-18  9:54     ` Michael S. Tsirkin
@ 2020-11-18 11:31     ` Stefan Hajnoczi
  2020-11-19 14:46       ` Michael S. Tsirkin
  1 sibling, 1 reply; 47+ messages in thread
From: Stefan Hajnoczi @ 2020-11-18 11:31 UTC (permalink / raw)
  To: Mike Christie
  Cc: qemu-devel, fam, linux-scsi, target-devel, mst, jasowang,
	pbonzini, virtualization


[-- Attachment #1: Type: text/plain, Size: 3848 bytes --]

On Tue, Nov 17, 2020 at 01:13:14PM -0600, Mike Christie wrote:
> On 11/17/20 10:40 AM, Stefan Hajnoczi wrote:
> > On Thu, Nov 12, 2020 at 05:18:59PM -0600, Mike Christie wrote:
> >> The following kernel patches were made over Michael's vhost branch:
> >>
> >> https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git/log/?h=vhost
> >>
> >> and the vhost-scsi bug fix patchset:
> >>
> >> https://lore.kernel.org/linux-scsi/20201112170008.GB1555653@stefanha-x1.localdomain/T/#t
> >>
> >> And the qemu patch was made over the qemu master branch.
> >>
> >> vhost-scsi currently supports multiple queues with the num_queues
> >> setting, but we end up with a setup where the guest's scsi/block
> >> layer can do a queue per vCPU and the layers below vhost can do
> >> a queue per CPU. vhost-scsi will then do a num_queue virtqueues,
> >> but all IO gets set on and completed on a single vhost-scsi thread.
> >> After 2 - 4 vqs this becomes a bottleneck.
> >>
> >> This patchset allows us to create a worker thread per IO vq, so we
> >> can better utilize multiple CPUs with the multiple queues. It
> >> implments Jason's suggestion to create the initial worker like
> >> normal, then create the extra workers for IO vqs with the
> >> VHOST_SET_VRING_ENABLE ioctl command added in this patchset.
> > 
> > How does userspace find out the tids and set their CPU affinity?
> > 
> 
> When we create the worker thread we add it to the device owner's cgroup,
> so we end up inheriting those settings like affinity.
> 
> However, are you more asking about finer control like if the guest is
> doing mq, and the mq hw queue is bound to cpu0, it would perform
> better if we could bind vhost vq's worker thread to cpu0? I think the
> problem might is if you are in the cgroup then we can't set a specific
> threads CPU affinity to just one specific CPU. So you can either do
> cgroups or not.
> 
> 
> > What is the meaning of the new VHOST_SET_VRING_ENABLE ioctl? It doesn't
> > really "enable" or "disable" the vq, requests are processed regardless.
> > 
> 
> Yeah, I agree. The problem I've mentioned before is:
> 
> 1. For net and vsock, it's not useful because the vqs are hard coded in
> the kernel and userspace, so you can't disable a vq and you never need
> to enable one.
> 
> 2. vdpa has it's own enable ioctl.
> 
> 3. For scsi, because we already are doing multiple vqs based on the
> num_queues value, we have to have some sort of compat support and
> code to detect if userspace is even going to send the new ioctl.
> In this patchset, compat just meant enable/disable the extra functionality
> of extra worker threads for a vq. We will still use the vq if
> userspace set it up.
> 
> 
> > The purpose of the ioctl isn't clear to me because the kernel could
> > automatically create 1 thread per vq without a new ioctl. On the other
> > hand, if userspace is supposed to control worker threads then a
> > different interface would be more powerful:
> > 

The main request I have is to clearly define the meaning of the
VHOST_SET_VRING_ENABLE ioctl. If you want to keep it as-is for now and
the vhost maintainers are happy with then, that's okay. It should just
be documented so that userspace and other vhost driver authors
understand what it's supposed to do.

> My preference has been:
> 
> 1. If we were to ditch cgroups, then add a new interface that would allow
> us to bind threads to a specific CPU, so that it lines up with the guest's
> mq to CPU mapping.

A 1:1 vCPU/vq->CPU mapping isn't desirable in all cases.

The CPU affinity is a userspace policy decision. The host kernel should
provide a mechanism but not the policy. That way userspace can decide
which workers are shared by multiple vqs and on which physical CPUs they
should run.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 00/10] vhost/qemu: thread per IO SCSI vq
  2020-11-18  7:54       ` Jason Wang
@ 2020-11-18 20:06         ` Mike Christie
  2020-11-19  4:35           ` Jason Wang
  0 siblings, 1 reply; 47+ messages in thread
From: Mike Christie @ 2020-11-18 20:06 UTC (permalink / raw)
  To: Jason Wang, Stefan Hajnoczi
  Cc: fam, linux-scsi, mst, qemu-devel, virtualization, target-devel, pbonzini

On 11/18/20 1:54 AM, Jason Wang wrote:
> 
> On 2020/11/18 下午2:57, Mike Christie wrote:
>> On 11/17/20 11:17 PM, Jason Wang wrote:
>>> On 2020/11/18 上午12:40, Stefan Hajnoczi wrote:
>>>> On Thu, Nov 12, 2020 at 05:18:59PM -0600, Mike Christie wrote:
>>>>> The following kernel patches were made over Michael's vhost branch:
>>>>>
>>>>> https://urldefense.com/v3/__https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git/log/?h=vhost__;!!GqivPVa7Brio!MzCv3wdRfz5dltunazRWGCeUkMg91pPEOLpIivsebLX9vhYDSi_E1V36e9H0NoRys_hU$ 
>>>>>
>>>>> and the vhost-scsi bug fix patchset:
>>>>>
>>>>> https://urldefense.com/v3/__https://lore.kernel.org/linux-scsi/20201112170008.GB1555653@stefanha-x1.localdomain/T/*t__;Iw!!GqivPVa7Brio!MzCv3wdRfz5dltunazRWGCeUkMg91pPEOLpIivsebLX9vhYDSi_E1V36e9H0NmuPE_m8$ 
>>>>>
>>>>> And the qemu patch was made over the qemu master branch.
>>>>>
>>>>> vhost-scsi currently supports multiple queues with the num_queues
>>>>> setting, but we end up with a setup where the guest's scsi/block
>>>>> layer can do a queue per vCPU and the layers below vhost can do
>>>>> a queue per CPU. vhost-scsi will then do a num_queue virtqueues,
>>>>> but all IO gets set on and completed on a single vhost-scsi thread.
>>>>> After 2 - 4 vqs this becomes a bottleneck.
>>>>>
>>>>> This patchset allows us to create a worker thread per IO vq, so we
>>>>> can better utilize multiple CPUs with the multiple queues. It
>>>>> implments Jason's suggestion to create the initial worker like
>>>>> normal, then create the extra workers for IO vqs with the
>>>>> VHOST_SET_VRING_ENABLE ioctl command added in this patchset.
>>>> How does userspace find out the tids and set their CPU affinity?
>>>>
>>>> What is the meaning of the new VHOST_SET_VRING_ENABLE ioctl? It doesn't
>>>> really "enable" or "disable" the vq, requests are processed regardless.
>>>
>>> Actually I think it should do the real "enable/disable" that tries to 
>>> follow the virtio spec.
>>>
>> What does real mean here?
> 
> 
> I think it means when a vq is disabled, vhost won't process any request 
> from that virtqueue.
> 
> 
>> For the vdpa enable call for example, would it be like
>> ifcvf_vdpa_set_vq_ready where it sets the ready bit or more like 
>> mlx5_vdpa_set_vq_ready
>> where it can do some more work in the disable case?
> 
> 
> For vDPA, it would be more complicated.
> 
> E.g for IFCVF, it just delay the setting of queue_enable when it get 
> DRIVER_OK. Technically it can passthrough the queue_enable to the 
> hardware as what mlx5e did.
> 
> 
>>
>> For net and something like ifcvf_vdpa_set_vq_ready's design would we have
>> vhost_ring_ioctl() set some vhost_virtqueue enable bit. We then have 
>> some helper
>> vhost_vq_is_enabled() and some code to detect if userspace supports 
>> the new ioctl.
> 
> 
> Yes, vhost support backend capability. When userspace negotiate the new 
> capability, we should depend on SET_VRING_ENABLE, if not we can do 
> vhost_vq_is_enable().
> 
> 
>> And then in vhost_net_set_backend do we call vhost_vq_is_enabled()? 
>> What is done
>> for disable then?
> 
> 
> It needs more thought, but the question is not specific to 
> SET_VRING_ENABLE. Consider guest may zero ring address as well.
> 
> For disabling, we can simply flush the work and disable all the polls.
> 
> 
>> It doesn't seem to buy a lot of new functionality. Is it just
>> so we follow the spec?
> 
> 
> My understanding is that, since spec defines queue_enable, we should 
> support it in vhost. And we can piggyback the delayed vq creation with 
> this feature. Otherwise we will duplicate the function if we want to 
> support queue_enable.


I had actually given up on the delayed vq creation goal. I'm still not 
sure how it's related to ENABLE and I think it gets pretty gross.

1. If we started from a semi-clean slate, and used the ENABLE ioctl more 
like a CREATE ioctl, and did the ENABLE after vhost dev open() but 
before any other ioctls, we can allocate the vq when we get the ENABLE 
ioctl. This fixes the issue where vhost scsi is allocating 128 vqs at 
open() time. We can then allocate metadata like the iovecs at ENABLE 
time or when we get a setup ioctl that is related to the metadata, so it 
fixes that too.

That makes sense how ENABLE is related to delayed vq allocation and why 
we would want it.

If we now need to support old tools though, then you lose me. To try and 
keep the code paths using the same code, then at vhost dev open() time 
do we start vhost_dev_init with zero vqs like with the allocate at 
ENABLE time case? Then when we get the first vring or dev ioctl, do we 
allocate the vq and related metadata? If so, the ENABLE does not buy us 
a lot since we get the delayed allocation from the compat code. Also 
this compat case gets really messy when we are delaying the actual vq 
and not just the metadata.

If for the compat case, we keep the code that before/during 
vhost_dev_init allocates all the vqs and does the initialization, then 
we end up with 2 very very different code paths. And we also need a new 
modparam or something to tell the drivers to do the old or new open() 
behavior.

2. If we do an approach that is less invasive to the kernel for the 
compat case, and do the ENABLE ioctl after other vring ioctl calls then 
that would not work for the delayed vq allocation goal since the ENABLE 
call is too late.


> 
> 
>>
>> Or do you want it work more like mlx5_vdpa_set_vq_ready? For this in 
>> vhost_ring_ioctl
>> when we get the new ioctl we would call into the drivers and have it 
>> start queues
>> and stop queues? For enable, what we you do for net for this case?
> 
> 
> Net is something different, we can simply use SET_BACKEND to disable a 
> specific virtqueue without introducing new ioctls. Notice that, net mq 
> is kind of different with scsi which have a per queue pair vhost device, 
> and the API allows us to set backend for a specific virtqueue.


That's one of the things I am trying to understand. It sounds like 
ENABLE is not useful to net. Will net even use/implement the ENABLE 
ioctl or just use the SET_BACKEND? What about vsock?

For net it sounds like it's just going to add an extra code path if you 
support it.


> 
> 
>> For disable,
>> would you do something like vhost_net_stop_vq (we don't free up 
>> anything allocated
>> in vhost_vring_ioctl calls, but we can stop what we setup in the net 
>> driver)?
> 
> 
> It's up to you, if you think you should free the resources you can do that.
> 
> 
>> Is this useful for the current net mq design or is this for something 
>> like where
>> you would do one vhost net device with multiple vqs?
> 
> 
> I think SET_VRING_ENABLE is more useful for SCSI since it have a model 
> of multiple vqs per vhost device.

That is why I was asking about if you were going to change net.

It would have been useful for scsi if we had it when mq support was 
added and we don't have to support old tools. But now, if enable=true, 
is only going to be something where we set some bit so later when 
VHOST_SCSI_SET_ENDPOINT is run it we can do what we are already doing 
its just extra code. This patch:
https://www.spinics.net/lists/linux-scsi/msg150151.html
would work without the ENABLE ioctl I mean.


And if you guys want to do the completely new interface, then none of 
this matters I guess :)

For disable see below.

> 
> 
>>
>> My issue/convern is that in general these calls seems useful, but we 
>> don't really
>> need them for scsi because vhost scsi is already stuck creating vqs 
>> like how it does
>> due to existing users. If we do the ifcvf_vdpa_set_vq_ready type of 
>> design where
>> we just set some bit, then the new ioctl does not give us a lot. It's 
>> just an extra
>> check and extra code.
>>
>> And for the mlx5_vdpa_set_vq_ready type of design, it doesn't seem 
>> like it's going
>> to happen a lot where the admin is going to want to remove vqs from a 
>> running device.
> 
> 
> In this case, qemu may just disable the queues of vhost-scsi via 
> SET_VRING_ENABLE and then we can free resources?


Some SCSI background in case it doesn't work like net:
-------
When the user sets up mq for vhost-scsi/virtio-scsi, for max perf and no 
cares about mem use they would normally set num_queues based on the 
number of vCPUs and MSI-x vectors. I think the default in qemu now is to 
try and detect that value.

When the virtio_scsi driver is loaded into the guest kernel, it takes 
the num_queues value and tells the scsi/block mq layer to create 
num_queues multiqueue hw queues.

------

I was trying to say in the previous email that is if all we do is set 
some bits to indicate the queue is disabled, free its resources, stop 
polling/queueing in the scsi/target layer, flush etc, it does not seem 
useful. I was trying to ask when would a user only want this behavior?

I think we need an extra piece where the guests needs to be modified to 
handle the queue removal or the block/scsi layers would still send IO 
and we would get IO errors. Without this it seems like some extra code 
that we will not use.

And then if we are going to make disable useful like this, what about 
enable? We would want to the reverse where we add the queue and the 
guest remaps the mq to hw queue layout. To do this, enable has to do 
more than just set some bits. There is also an issue with how it would 
need to interact with the SET_BACKEND 
(VHOST_SCSI_SET_ENDPOINT/VHOST_SCSI_CLEAR_ENDPOINT for scsi) calls.

I think if we wanted the ENABLE ioctl to work like this then that is not 
related to my patches and I like I've written before I think my patches 
do not need the ENABLE ioctl in general. We could add the patch where we 
create the workers threads from VHOST_SCSI_SET_ENDPOINT. And if we ever 
add this queue hotplug type of code, then the worker thread would just 
get moved/rearranged with the other vq modification code in 
vhost_scsi_set_endpoint/vhost_scsi_clear_endpoint.

We could also go the new threading interface route, and also do the 
ENABLE ioctl separately.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 00/10] vhost/qemu: thread per IO SCSI vq
  2020-11-18 20:06         ` Mike Christie
@ 2020-11-19  4:35           ` Jason Wang
  2020-11-19 15:49             ` Mike Christie
  0 siblings, 1 reply; 47+ messages in thread
From: Jason Wang @ 2020-11-19  4:35 UTC (permalink / raw)
  To: Mike Christie, Stefan Hajnoczi
  Cc: fam, linux-scsi, mst, qemu-devel, virtualization, target-devel, pbonzini


On 2020/11/19 上午4:06, Mike Christie wrote:
> On 11/18/20 1:54 AM, Jason Wang wrote:
>>
>> On 2020/11/18 下午2:57, Mike Christie wrote:
>>> On 11/17/20 11:17 PM, Jason Wang wrote:
>>>> On 2020/11/18 上午12:40, Stefan Hajnoczi wrote:
>>>>> On Thu, Nov 12, 2020 at 05:18:59PM -0600, Mike Christie wrote:
>>>>>> The following kernel patches were made over Michael's vhost branch:
>>>>>>
>>>>>> https://urldefense.com/v3/__https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git/log/?h=vhost__;!!GqivPVa7Brio!MzCv3wdRfz5dltunazRWGCeUkMg91pPEOLpIivsebLX9vhYDSi_E1V36e9H0NoRys_hU$ 
>>>>>>
>>>>>> and the vhost-scsi bug fix patchset:
>>>>>>
>>>>>> https://urldefense.com/v3/__https://lore.kernel.org/linux-scsi/20201112170008.GB1555653@stefanha-x1.localdomain/T/*t__;Iw!!GqivPVa7Brio!MzCv3wdRfz5dltunazRWGCeUkMg91pPEOLpIivsebLX9vhYDSi_E1V36e9H0NmuPE_m8$ 
>>>>>>
>>>>>> And the qemu patch was made over the qemu master branch.
>>>>>>
>>>>>> vhost-scsi currently supports multiple queues with the num_queues
>>>>>> setting, but we end up with a setup where the guest's scsi/block
>>>>>> layer can do a queue per vCPU and the layers below vhost can do
>>>>>> a queue per CPU. vhost-scsi will then do a num_queue virtqueues,
>>>>>> but all IO gets set on and completed on a single vhost-scsi thread.
>>>>>> After 2 - 4 vqs this becomes a bottleneck.
>>>>>>
>>>>>> This patchset allows us to create a worker thread per IO vq, so we
>>>>>> can better utilize multiple CPUs with the multiple queues. It
>>>>>> implments Jason's suggestion to create the initial worker like
>>>>>> normal, then create the extra workers for IO vqs with the
>>>>>> VHOST_SET_VRING_ENABLE ioctl command added in this patchset.
>>>>> How does userspace find out the tids and set their CPU affinity?
>>>>>
>>>>> What is the meaning of the new VHOST_SET_VRING_ENABLE ioctl? It 
>>>>> doesn't
>>>>> really "enable" or "disable" the vq, requests are processed 
>>>>> regardless.
>>>>
>>>> Actually I think it should do the real "enable/disable" that tries 
>>>> to follow the virtio spec.
>>>>
>>> What does real mean here?
>>
>>
>> I think it means when a vq is disabled, vhost won't process any 
>> request from that virtqueue.
>>
>>
>>> For the vdpa enable call for example, would it be like
>>> ifcvf_vdpa_set_vq_ready where it sets the ready bit or more like 
>>> mlx5_vdpa_set_vq_ready
>>> where it can do some more work in the disable case?
>>
>>
>> For vDPA, it would be more complicated.
>>
>> E.g for IFCVF, it just delay the setting of queue_enable when it get 
>> DRIVER_OK. Technically it can passthrough the queue_enable to the 
>> hardware as what mlx5e did.
>>
>>
>>>
>>> For net and something like ifcvf_vdpa_set_vq_ready's design would we 
>>> have
>>> vhost_ring_ioctl() set some vhost_virtqueue enable bit. We then have 
>>> some helper
>>> vhost_vq_is_enabled() and some code to detect if userspace supports 
>>> the new ioctl.
>>
>>
>> Yes, vhost support backend capability. When userspace negotiate the 
>> new capability, we should depend on SET_VRING_ENABLE, if not we can 
>> do vhost_vq_is_enable().
>>
>>
>>> And then in vhost_net_set_backend do we call vhost_vq_is_enabled()? 
>>> What is done
>>> for disable then?
>>
>>
>> It needs more thought, but the question is not specific to 
>> SET_VRING_ENABLE. Consider guest may zero ring address as well.
>>
>> For disabling, we can simply flush the work and disable all the polls.
>>
>>
>>> It doesn't seem to buy a lot of new functionality. Is it just
>>> so we follow the spec?
>>
>>
>> My understanding is that, since spec defines queue_enable, we should 
>> support it in vhost. And we can piggyback the delayed vq creation 
>> with this feature. Otherwise we will duplicate the function if we 
>> want to support queue_enable.
>
>
> I had actually given up on the delayed vq creation goal. I'm still not 
> sure how it's related to ENABLE and I think it gets pretty gross.
>
> 1. If we started from a semi-clean slate, and used the ENABLE ioctl 
> more like a CREATE ioctl, and did the ENABLE after vhost dev open() 
> but before any other ioctls, we can allocate the vq when we get the 
> ENABLE ioctl. This fixes the issue where vhost scsi is allocating 128 
> vqs at open() time. We can then allocate metadata like the iovecs at 
> ENABLE time or when we get a setup ioctl that is related to the 
> metadata, so it fixes that too.
>
> That makes sense how ENABLE is related to delayed vq allocation and 
> why we would want it.
>
> If we now need to support old tools though, then you lose me. To try 
> and keep the code paths using the same code, then at vhost dev open() 
> time do we start vhost_dev_init with zero vqs like with the allocate 
> at ENABLE time case? Then when we get the first vring or dev ioctl, do 
> we allocate the vq and related metadata? If so, the ENABLE does not 
> buy us a lot since we get the delayed allocation from the compat code. 
> Also this compat case gets really messy when we are delaying the 
> actual vq and not just the metadata.
>
> If for the compat case, we keep the code that before/during 
> vhost_dev_init allocates all the vqs and does the initialization, then 
> we end up with 2 very very different code paths. And we also need a 
> new modparam or something to tell the drivers to do the old or new 
> open() behavior.


Right, so I think maybe we can take a step back. Instead of depending on 
explicit new ioctl which may cause a lot of issues, can we do something 
similar to vhost_vq_is_setup().

That means, let's create/destory new workers on SET_VRING_ADDR?


>
> 2. If we do an approach that is less invasive to the kernel for the 
> compat case, and do the ENABLE ioctl after other vring ioctl calls 
> then that would not work for the delayed vq allocation goal since the 
> ENABLE call is too late.
>
>
>>
>>
>>>
>>> Or do you want it work more like mlx5_vdpa_set_vq_ready? For this in 
>>> vhost_ring_ioctl
>>> when we get the new ioctl we would call into the drivers and have it 
>>> start queues
>>> and stop queues? For enable, what we you do for net for this case?
>>
>>
>> Net is something different, we can simply use SET_BACKEND to disable 
>> a specific virtqueue without introducing new ioctls. Notice that, net 
>> mq is kind of different with scsi which have a per queue pair vhost 
>> device, and the API allows us to set backend for a specific virtqueue.
>
>
> That's one of the things I am trying to understand. It sounds like 
> ENABLE is not useful to net. Will net even use/implement the ENABLE 
> ioctl or just use the SET_BACKEND?


I think SET_BACKEND is sufficient for net.


> What about vsock?


For vsock (and scsi as well), their backend is per virtqueue, but the 
actual issue is there's no uAPI to configure it per vq. The current uAPI 
is per device.


>
> For net it sounds like it's just going to add an extra code path if 
> you support it.


Yes, so if we really want one w(which is still questionable during our 
discussion). We can start from a SCSI specific one (or an alias of vDPA 
one).


>
>
>>
>>
>>> For disable,
>>> would you do something like vhost_net_stop_vq (we don't free up 
>>> anything allocated
>>> in vhost_vring_ioctl calls, but we can stop what we setup in the net 
>>> driver)?
>>
>>
>> It's up to you, if you think you should free the resources you can do 
>> that.
>>
>>
>>> Is this useful for the current net mq design or is this for 
>>> something like where
>>> you would do one vhost net device with multiple vqs?
>>
>>
>> I think SET_VRING_ENABLE is more useful for SCSI since it have a 
>> model of multiple vqs per vhost device.
>
> That is why I was asking about if you were going to change net.
>
> It would have been useful for scsi if we had it when mq support was 
> added and we don't have to support old tools. But now, if enable=true, 
> is only going to be something where we set some bit so later when 
> VHOST_SCSI_SET_ENDPOINT is run it we can do what we are already doing 
> its just extra code. This patch:
> https://www.spinics.net/lists/linux-scsi/msg150151.html
> would work without the ENABLE ioctl I mean.


That seems to pre-allocate all workers. If we don't care the resources 
(127 workers) consumption it could be fine.


>
>
> And if you guys want to do the completely new interface, then none of 
> this matters I guess :)
>
> For disable see below.
>
>>
>>
>>>
>>> My issue/convern is that in general these calls seems useful, but we 
>>> don't really
>>> need them for scsi because vhost scsi is already stuck creating vqs 
>>> like how it does
>>> due to existing users. If we do the ifcvf_vdpa_set_vq_ready type of 
>>> design where
>>> we just set some bit, then the new ioctl does not give us a lot. 
>>> It's just an extra
>>> check and extra code.
>>>
>>> And for the mlx5_vdpa_set_vq_ready type of design, it doesn't seem 
>>> like it's going
>>> to happen a lot where the admin is going to want to remove vqs from 
>>> a running device.
>>
>>
>> In this case, qemu may just disable the queues of vhost-scsi via 
>> SET_VRING_ENABLE and then we can free resources?
>
>
> Some SCSI background in case it doesn't work like net:
> -------
> When the user sets up mq for vhost-scsi/virtio-scsi, for max perf and 
> no cares about mem use they would normally set num_queues based on the 
> number of vCPUs and MSI-x vectors. I think the default in qemu now is 
> to try and detect that value.
>
> When the virtio_scsi driver is loaded into the guest kernel, it takes 
> the num_queues value and tells the scsi/block mq layer to create 
> num_queues multiqueue hw queues.


If I read the code correctly, for modern device, guest will set 
queue_enable for the queues that it wants to use. So in this ideal case, 
qemu can forward them to VRING_ENABLE and reset VRING_ENABLE during 
device reset.

But it would be complicated to support legacy device and qemu.


>
> ------
>
> I was trying to say in the previous email that is if all we do is set 
> some bits to indicate the queue is disabled, free its resources, stop 
> polling/queueing in the scsi/target layer, flush etc, it does not seem 
> useful. I was trying to ask when would a user only want this behavior?


I think it's device reset, the semantic is that unless the queue is 
enabled, we should treat it as disabled.


>
> I think we need an extra piece where the guests needs to be modified 
> to handle the queue removal or the block/scsi layers would still send 
> IO and we would get IO errors. Without this it seems like some extra 
> code that we will not use.
>
> And then if we are going to make disable useful like this, what about 
> enable? We would want to the reverse where we add the queue and the 
> guest remaps the mq to hw queue layout. To do this, enable has to do 
> more than just set some bits. There is also an issue with how it would 
> need to interact with the SET_BACKEND 
> (VHOST_SCSI_SET_ENDPOINT/VHOST_SCSI_CLEAR_ENDPOINT for scsi) calls.
>
> I think if we wanted the ENABLE ioctl to work like this then that is 
> not related to my patches and I like I've written before I think my 
> patches do not need the ENABLE ioctl in general. We could add the 
> patch where we create the workers threads from 
> VHOST_SCSI_SET_ENDPOINT. And if we ever add this queue hotplug type of 
> code, then the worker thread would just get moved/rearranged with the 
> other vq modification code in 
> vhost_scsi_set_endpoint/vhost_scsi_clear_endpoint.
>
> We could also go the new threading interface route, and also do the 
> ENABLE ioctl separately.


Right, my original idea is to try to make queue_enable (in the spec) 
work for SCSI and we can use that for any delayed stuffs (vq, or workers).

But it looks not as easy as I imaged.

Thanks





^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 00/10] vhost/qemu: thread per IO SCSI vq
  2020-11-18  9:54     ` Michael S. Tsirkin
@ 2020-11-19 14:00       ` Stefan Hajnoczi
  0 siblings, 0 replies; 47+ messages in thread
From: Stefan Hajnoczi @ 2020-11-19 14:00 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Mike Christie, qemu-devel, fam, linux-scsi, target-devel,
	jasowang, pbonzini, virtualization


[-- Attachment #1: Type: text/plain, Size: 2747 bytes --]

On Wed, Nov 18, 2020 at 04:54:07AM -0500, Michael S. Tsirkin wrote:
> On Tue, Nov 17, 2020 at 01:13:14PM -0600, Mike Christie wrote:
> > On 11/17/20 10:40 AM, Stefan Hajnoczi wrote:
> > > On Thu, Nov 12, 2020 at 05:18:59PM -0600, Mike Christie wrote:
> > >> The following kernel patches were made over Michael's vhost branch:
> > >>
> > >> https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git/log/?h=vhost
> > >>
> > >> and the vhost-scsi bug fix patchset:
> > >>
> > >> https://lore.kernel.org/linux-scsi/20201112170008.GB1555653@stefanha-x1.localdomain/T/#t
> > >>
> > >> And the qemu patch was made over the qemu master branch.
> > >>
> > >> vhost-scsi currently supports multiple queues with the num_queues
> > >> setting, but we end up with a setup where the guest's scsi/block
> > >> layer can do a queue per vCPU and the layers below vhost can do
> > >> a queue per CPU. vhost-scsi will then do a num_queue virtqueues,
> > >> but all IO gets set on and completed on a single vhost-scsi thread.
> > >> After 2 - 4 vqs this becomes a bottleneck.
> > >>
> > >> This patchset allows us to create a worker thread per IO vq, so we
> > >> can better utilize multiple CPUs with the multiple queues. It
> > >> implments Jason's suggestion to create the initial worker like
> > >> normal, then create the extra workers for IO vqs with the
> > >> VHOST_SET_VRING_ENABLE ioctl command added in this patchset.
> > > 
> > > How does userspace find out the tids and set their CPU affinity?
> > > 
> > 
> > When we create the worker thread we add it to the device owner's cgroup,
> > so we end up inheriting those settings like affinity.
> > 
> > However, are you more asking about finer control like if the guest is
> > doing mq, and the mq hw queue is bound to cpu0, it would perform
> > better if we could bind vhost vq's worker thread to cpu0? I think the
> > problem might is if you are in the cgroup then we can't set a specific
> > threads CPU affinity to just one specific CPU. So you can either do
> > cgroups or not.
> 
> Something we wanted to try for a while is to allow userspace
> to create threads for us, then specify which vqs it processes.

Do you mean an interface like a blocking ioctl(vhost_fd,
VHOST_WORKER_RUN) where the vhost processing is done in the context of
the caller's userspace thread?

What is neat about this is that it removes thread configuration from the
kernel vhost code. On the other hand, userspace still needs an interface
indicating which vqs should be processed. Maybe it would even require an
int worker_fd = ioctl(vhost_fd, VHOST_WORKER_CREATE) and then
ioctl(worker_fd, VHOST_WORKER_BIND_VQ, vq_idx)? So then it becomes
complex again...

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 00/10] vhost/qemu: thread per IO SCSI vq
  2020-11-18 11:31     ` Stefan Hajnoczi
@ 2020-11-19 14:46       ` Michael S. Tsirkin
  2020-11-19 16:11         ` Mike Christie
  0 siblings, 1 reply; 47+ messages in thread
From: Michael S. Tsirkin @ 2020-11-19 14:46 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Mike Christie, qemu-devel, fam, linux-scsi, target-devel,
	jasowang, pbonzini, virtualization

On Wed, Nov 18, 2020 at 11:31:17AM +0000, Stefan Hajnoczi wrote:
> > My preference has been:
> > 
> > 1. If we were to ditch cgroups, then add a new interface that would allow
> > us to bind threads to a specific CPU, so that it lines up with the guest's
> > mq to CPU mapping.
> 
> A 1:1 vCPU/vq->CPU mapping isn't desirable in all cases.
> 
> The CPU affinity is a userspace policy decision. The host kernel should
> provide a mechanism but not the policy. That way userspace can decide
> which workers are shared by multiple vqs and on which physical CPUs they
> should run.

So if we let userspace dictate the threading policy then I think binding
vqs to userspace threads and running there makes the most sense,
no need to create the threads.

-- 
MST


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 00/10] vhost/qemu: thread per IO SCSI vq
  2020-11-19  4:35           ` Jason Wang
@ 2020-11-19 15:49             ` Mike Christie
  0 siblings, 0 replies; 47+ messages in thread
From: Mike Christie @ 2020-11-19 15:49 UTC (permalink / raw)
  To: Jason Wang, Stefan Hajnoczi
  Cc: fam, linux-scsi, mst, qemu-devel, virtualization, target-devel, pbonzini

On 11/18/20 10:35 PM, Jason Wang wrote:
>> its just extra code. This patch:
>> https://urldefense.com/v3/__https://www.spinics.net/lists/linux-scsi/msg150151.html__;!!GqivPVa7Brio!MJS-iYeBuOljoz2xerETyn4c1N9i0XnOE8oNhz4ebbzCMNeQIP_Iie8zH18L7cY7_hur$ 
>> would work without the ENABLE ioctl I mean.
> 
> 
> That seems to pre-allocate all workers. If we don't care the resources 
> (127 workers) consumption it could be fine.

It only makes what the user requested via num_queues.

That patch will:
1. For the default case of num_queues=1 we use the single worker created 
from the SET_OWNER ioctl.
2. If num_queues > 1, then it creates a worker thread per num_queue > 1.


> 
> 
>>
>>
>> And if you guys want to do the completely new interface, then none of 
>> this matters I guess :)
>>
>> For disable see below.
>>
>>>
>>>
>>>>
>>>> My issue/convern is that in general these calls seems useful, but we 
>>>> don't really
>>>> need them for scsi because vhost scsi is already stuck creating vqs 
>>>> like how it does
>>>> due to existing users. If we do the ifcvf_vdpa_set_vq_ready type of 
>>>> design where
>>>> we just set some bit, then the new ioctl does not give us a lot. 
>>>> It's just an extra
>>>> check and extra code.
>>>>
>>>> And for the mlx5_vdpa_set_vq_ready type of design, it doesn't seem 
>>>> like it's going
>>>> to happen a lot where the admin is going to want to remove vqs from 
>>>> a running device.
>>>
>>>
>>> In this case, qemu may just disable the queues of vhost-scsi via 
>>> SET_VRING_ENABLE and then we can free resources?
>>
>>
>> Some SCSI background in case it doesn't work like net:
>> -------
>> When the user sets up mq for vhost-scsi/virtio-scsi, for max perf and 
>> no cares about mem use they would normally set num_queues based on the 
>> number of vCPUs and MSI-x vectors. I think the default in qemu now is 
>> to try and detect that value.
>>
>> When the virtio_scsi driver is loaded into the guest kernel, it takes 
>> the num_queues value and tells the scsi/block mq layer to create 
>> num_queues multiqueue hw queues.
> 
> 
> If I read the code correctly, for modern device, guest will set 
> queue_enable for the queues that it wants to use. So in this ideal case, 
> qemu can forward them to VRING_ENABLE and reset VRING_ENABLE during 
> device reset.

I was thinking more you want an event like when a device/LUN is 
added/removed to a host. Instead of kicking off a device scan, you could 
call the block helper to remap queues. It would then not be too invasive 
to running IO. I'll look into reset some more.


> 
> But it would be complicated to support legacy device and qemu.
> 
> 
>>
>> ------
>>
>> I was trying to say in the previous email that is if all we do is set 
>> some bits to indicate the queue is disabled, free its resources, stop 
>> polling/queueing in the scsi/target layer, flush etc, it does not seem 
>> useful. I was trying to ask when would a user only want this behavior?
> 
> 
> I think it's device reset, the semantic is that unless the queue is 
> enabled, we should treat it as disabled.
> 

Ah ok. I I'll look into that some more. A funny thing is that I was 
trying to test that a while ago, but it wasn't helpful. I'm guessing it 
didn't work because it didn't implement what you wanted for disable 
right now :)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 00/10] vhost/qemu: thread per IO SCSI vq
  2020-11-19 14:46       ` Michael S. Tsirkin
@ 2020-11-19 16:11         ` Mike Christie
  2020-11-19 16:24           ` Stefan Hajnoczi
  0 siblings, 1 reply; 47+ messages in thread
From: Mike Christie @ 2020-11-19 16:11 UTC (permalink / raw)
  To: Michael S. Tsirkin, Stefan Hajnoczi
  Cc: qemu-devel, fam, linux-scsi, target-devel, jasowang, pbonzini,
	virtualization

On 11/19/20 8:46 AM, Michael S. Tsirkin wrote:
> On Wed, Nov 18, 2020 at 11:31:17AM +0000, Stefan Hajnoczi wrote:
>>> My preference has been:
>>>
>>> 1. If we were to ditch cgroups, then add a new interface that would allow
>>> us to bind threads to a specific CPU, so that it lines up with the guest's
>>> mq to CPU mapping.
>>
>> A 1:1 vCPU/vq->CPU mapping isn't desirable in all cases.
>>
>> The CPU affinity is a userspace policy decision. The host kernel should
>> provide a mechanism but not the policy. That way userspace can decide
>> which workers are shared by multiple vqs and on which physical CPUs they
>> should run.
> 
> So if we let userspace dictate the threading policy then I think binding
> vqs to userspace threads and running there makes the most sense,
> no need to create the threads.
> 

Just to make sure I am on the same page, in one of the first postings of 
this set at the bottom of the mail:

https://www.spinics.net/lists/linux-scsi/msg148322.html

I asked about a new interface and had done something more like what 
Stefan posted:

   struct vhost_vq_worker_info {
       /*
        * The pid of an existing vhost worker that this vq will be
        * assigned to. When pid is 0 the virtqueue is assigned to the
        * default vhost worker. When pid is -1 a new worker thread is
        * created for this virtqueue. When pid is -2 the virtqueue's
        * worker thread is unchanged.
        *
        * If a vhost worker no longer has any virtqueues assigned to it
        * then it will terminate.
        *
        * The pid of the vhost worker is stored to this field when the
        * ioctl completes successfully. Use pid -2 to query the current
        * vhost worker pid.
        */
       __kernel_pid_t pid;  /* in/out */

       /* The virtqueue index*/
       unsigned int vq_idx; /* in */
   };

This approach is simple and it allowed me to have userspace map queues 
and threads optimally for our setups.

Note: Stefan, in response to your previous comment, I am just using my 
1:1 mapping as an example and would make it configurable from userspace.

In the email above are you guys suggesting to execute the SCSI/vhost 
requests in userspace? We should not do that because:

1. It negates part of what makes vhost fast where we do not have to kick 
out to userspace then back to the kernel.

2. It's not doable or becomes a crazy mess because vhost-scsi is tied to 
the scsi/target layer in the kernel. You can't process the scsi command 
in userspace since the scsi state machine and all its configuration info 
is in the kernel's scsi/target layer.

For example, I was just the maintainer of the target_core_user module 
that hooks into LIO/target on the backend (vhost-scsi hooks in on the 
front end) and passes commands to userspace and there we have a 
semi-shadow state machine. It gets nasty to try and maintain/sync state 
between lio/target core in the kernel and in userspace. We also see the 
perf loss I mentioned in #1.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 00/10] vhost/qemu: thread per IO SCSI vq
  2020-11-19 16:11         ` Mike Christie
@ 2020-11-19 16:24           ` Stefan Hajnoczi
  2020-11-19 16:43             ` Mike Christie
  0 siblings, 1 reply; 47+ messages in thread
From: Stefan Hajnoczi @ 2020-11-19 16:24 UTC (permalink / raw)
  To: Mike Christie
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, fam, linux-scsi, Jason Wang,
	qemu-devel, Linux Virtualization, target-devel, Paolo Bonzini

On Thu, Nov 19, 2020 at 4:13 PM Mike Christie
<michael.christie@oracle.com> wrote:
>
> On 11/19/20 8:46 AM, Michael S. Tsirkin wrote:
> > On Wed, Nov 18, 2020 at 11:31:17AM +0000, Stefan Hajnoczi wrote:
> >>> My preference has been:
> >>>
> >>> 1. If we were to ditch cgroups, then add a new interface that would allow
> >>> us to bind threads to a specific CPU, so that it lines up with the guest's
> >>> mq to CPU mapping.
> >>
> >> A 1:1 vCPU/vq->CPU mapping isn't desirable in all cases.
> >>
> >> The CPU affinity is a userspace policy decision. The host kernel should
> >> provide a mechanism but not the policy. That way userspace can decide
> >> which workers are shared by multiple vqs and on which physical CPUs they
> >> should run.
> >
> > So if we let userspace dictate the threading policy then I think binding
> > vqs to userspace threads and running there makes the most sense,
> > no need to create the threads.
> >
>
> Just to make sure I am on the same page, in one of the first postings of
> this set at the bottom of the mail:
>
> https://www.spinics.net/lists/linux-scsi/msg148322.html
>
> I asked about a new interface and had done something more like what
> Stefan posted:
>
>    struct vhost_vq_worker_info {
>        /*
>         * The pid of an existing vhost worker that this vq will be
>         * assigned to. When pid is 0 the virtqueue is assigned to the
>         * default vhost worker. When pid is -1 a new worker thread is
>         * created for this virtqueue. When pid is -2 the virtqueue's
>         * worker thread is unchanged.
>         *
>         * If a vhost worker no longer has any virtqueues assigned to it
>         * then it will terminate.
>         *
>         * The pid of the vhost worker is stored to this field when the
>         * ioctl completes successfully. Use pid -2 to query the current
>         * vhost worker pid.
>         */
>        __kernel_pid_t pid;  /* in/out */
>
>        /* The virtqueue index*/
>        unsigned int vq_idx; /* in */
>    };
>
> This approach is simple and it allowed me to have userspace map queues
> and threads optimally for our setups.
>
> Note: Stefan, in response to your previous comment, I am just using my
> 1:1 mapping as an example and would make it configurable from userspace.
>
> In the email above are you guys suggesting to execute the SCSI/vhost
> requests in userspace? We should not do that because:
>
> 1. It negates part of what makes vhost fast where we do not have to kick
> out to userspace then back to the kernel.
>
> 2. It's not doable or becomes a crazy mess because vhost-scsi is tied to
> the scsi/target layer in the kernel. You can't process the scsi command
> in userspace since the scsi state machine and all its configuration info
> is in the kernel's scsi/target layer.
>
> For example, I was just the maintainer of the target_core_user module
> that hooks into LIO/target on the backend (vhost-scsi hooks in on the
> front end) and passes commands to userspace and there we have a
> semi-shadow state machine. It gets nasty to try and maintain/sync state
> between lio/target core in the kernel and in userspace. We also see the
> perf loss I mentioned in #1.

No, if I understand Michael correctly he has suggested a different approach.

My suggestion was that the kernel continues to manage the worker
threads but an ioctl allows userspace to control the policy.

I think Michael is saying that the kernel shouldn't manage/create
threads. Userspace should create threads and then invoke an ioctl from
those threads.

The ioctl will call into the vhost driver where it will execute
something similar to vhost_worker(). So this ioctl will block while
the kernel is using the thread to process vqs.

What isn't clear to me is how to tell the kernel which vqs are
processed by a thread. We could try to pass that information into the
ioctl. I'm not sure what the cleanest solution is here.

Maybe something like:

struct vhost_run_worker_info {
    struct timespec *timeout;
    sigset_t *sigmask;

    /* List of virtqueues to process */
    unsigned nvqs;
    unsigned vqs[];
};

/* This blocks until the timeout is reached, a signal is received, or
the vhost device is destroyed */
int ret = ioctl(vhost_fd, VHOST_RUN_WORKER, &info);

As you can see, userspace isn't involved with dealing with the
requests. It just acts as a thread donor to the vhost driver.

We would want the VHOST_RUN_WORKER calls to be infrequent to avoid the
penalty of switching into the kernel, copying in the arguments, etc.

Michael: is this the kind of thing you were thinking of?

Stefan

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 00/10] vhost/qemu: thread per IO SCSI vq
  2020-11-19 16:24           ` Stefan Hajnoczi
@ 2020-11-19 16:43             ` Mike Christie
  2020-11-19 17:08               ` Stefan Hajnoczi
  0 siblings, 1 reply; 47+ messages in thread
From: Mike Christie @ 2020-11-19 16:43 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, fam, linux-scsi, Jason Wang,
	qemu-devel, Linux Virtualization, target-devel, Paolo Bonzini

On 11/19/20 10:24 AM, Stefan Hajnoczi wrote:
> On Thu, Nov 19, 2020 at 4:13 PM Mike Christie
> <michael.christie@oracle.com> wrote:
>>
>> On 11/19/20 8:46 AM, Michael S. Tsirkin wrote:
>>> On Wed, Nov 18, 2020 at 11:31:17AM +0000, Stefan Hajnoczi wrote:
>>>>> My preference has been:
>>>>>
>>>>> 1. If we were to ditch cgroups, then add a new interface that would allow
>>>>> us to bind threads to a specific CPU, so that it lines up with the guest's
>>>>> mq to CPU mapping.
>>>>
>>>> A 1:1 vCPU/vq->CPU mapping isn't desirable in all cases.
>>>>
>>>> The CPU affinity is a userspace policy decision. The host kernel should
>>>> provide a mechanism but not the policy. That way userspace can decide
>>>> which workers are shared by multiple vqs and on which physical CPUs they
>>>> should run.
>>>
>>> So if we let userspace dictate the threading policy then I think binding
>>> vqs to userspace threads and running there makes the most sense,
>>> no need to create the threads.
>>>
>>
>> Just to make sure I am on the same page, in one of the first postings of
>> this set at the bottom of the mail:
>>
>> https://urldefense.com/v3/__https://www.spinics.net/lists/linux-scsi/msg148322.html__;!!GqivPVa7Brio!PdGIFdzqcAb6DW8twtjX3r7xOcM7XbTh7Ndkhxhb-1fV1VNB4lXjFzwFVE1zczUIE2Mp$
>>
>> I asked about a new interface and had done something more like what
>> Stefan posted:
>>
>>     struct vhost_vq_worker_info {
>>         /*
>>          * The pid of an existing vhost worker that this vq will be
>>          * assigned to. When pid is 0 the virtqueue is assigned to the
>>          * default vhost worker. When pid is -1 a new worker thread is
>>          * created for this virtqueue. When pid is -2 the virtqueue's
>>          * worker thread is unchanged.
>>          *
>>          * If a vhost worker no longer has any virtqueues assigned to it
>>          * then it will terminate.
>>          *
>>          * The pid of the vhost worker is stored to this field when the
>>          * ioctl completes successfully. Use pid -2 to query the current
>>          * vhost worker pid.
>>          */
>>         __kernel_pid_t pid;  /* in/out */
>>
>>         /* The virtqueue index*/
>>         unsigned int vq_idx; /* in */
>>     };
>>
>> This approach is simple and it allowed me to have userspace map queues
>> and threads optimally for our setups.
>>
>> Note: Stefan, in response to your previous comment, I am just using my
>> 1:1 mapping as an example and would make it configurable from userspace.
>>
>> In the email above are you guys suggesting to execute the SCSI/vhost
>> requests in userspace? We should not do that because:
>>
>> 1. It negates part of what makes vhost fast where we do not have to kick
>> out to userspace then back to the kernel.
>>
>> 2. It's not doable or becomes a crazy mess because vhost-scsi is tied to
>> the scsi/target layer in the kernel. You can't process the scsi command
>> in userspace since the scsi state machine and all its configuration info
>> is in the kernel's scsi/target layer.
>>
>> For example, I was just the maintainer of the target_core_user module
>> that hooks into LIO/target on the backend (vhost-scsi hooks in on the
>> front end) and passes commands to userspace and there we have a
>> semi-shadow state machine. It gets nasty to try and maintain/sync state
>> between lio/target core in the kernel and in userspace. We also see the
>> perf loss I mentioned in #1.
> 
> No, if I understand Michael correctly he has suggested a different approach.
> 
> My suggestion was that the kernel continues to manage the worker
> threads but an ioctl allows userspace to control the policy.
> 
> I think Michael is saying that the kernel shouldn't manage/create
> threads. Userspace should create threads and then invoke an ioctl from
> those threads.
> 
> The ioctl will call into the vhost driver where it will execute
> something similar to vhost_worker(). So this ioctl will block while
> the kernel is using the thread to process vqs.
> 
> What isn't clear to me is how to tell the kernel which vqs are
> processed by a thread. We could try to pass that information into the
> ioctl. I'm not sure what the cleanest solution is here.
> 
> Maybe something like:
> 
> struct vhost_run_worker_info {
>      struct timespec *timeout;
>      sigset_t *sigmask;
> 
>      /* List of virtqueues to process */
>      unsigned nvqs;
>      unsigned vqs[];
> };
> 
> /* This blocks until the timeout is reached, a signal is received, or
> the vhost device is destroyed */
> int ret = ioctl(vhost_fd, VHOST_RUN_WORKER, &info);
> 
> As you can see, userspace isn't involved with dealing with the
> requests. It just acts as a thread donor to the vhost driver.
> 
> We would want the VHOST_RUN_WORKER calls to be infrequent to avoid the
> penalty of switching into the kernel, copying in the arguments, etc.

I didn't get this part. Why have the timeout? When the timeout expires, 
does userspace just call right back down to the kernel or does it do 
some sort of processing/operation?

You could have your worker function run from that ioctl wait for a 
signal or a wake up call from the vhost_work/poll functions.


> 
> Michael: is this the kind of thing you were thinking of?
> 
> Stefan
> 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 00/10] vhost/qemu: thread per IO SCSI vq
  2020-11-19 16:43             ` Mike Christie
@ 2020-11-19 17:08               ` Stefan Hajnoczi
  2020-11-20  8:45                 ` Stefan Hajnoczi
  0 siblings, 1 reply; 47+ messages in thread
From: Stefan Hajnoczi @ 2020-11-19 17:08 UTC (permalink / raw)
  To: Mike Christie
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, fam, linux-scsi, Jason Wang,
	qemu-devel, Linux Virtualization, target-devel, Paolo Bonzini

On Thu, Nov 19, 2020 at 4:43 PM Mike Christie
<michael.christie@oracle.com> wrote:
>
> On 11/19/20 10:24 AM, Stefan Hajnoczi wrote:
> > On Thu, Nov 19, 2020 at 4:13 PM Mike Christie
> > <michael.christie@oracle.com> wrote:
> >>
> >> On 11/19/20 8:46 AM, Michael S. Tsirkin wrote:
> >>> On Wed, Nov 18, 2020 at 11:31:17AM +0000, Stefan Hajnoczi wrote:
> > struct vhost_run_worker_info {
> >      struct timespec *timeout;
> >      sigset_t *sigmask;
> >
> >      /* List of virtqueues to process */
> >      unsigned nvqs;
> >      unsigned vqs[];
> > };
> >
> > /* This blocks until the timeout is reached, a signal is received, or
> > the vhost device is destroyed */
> > int ret = ioctl(vhost_fd, VHOST_RUN_WORKER, &info);
> >
> > As you can see, userspace isn't involved with dealing with the
> > requests. It just acts as a thread donor to the vhost driver.
> >
> > We would want the VHOST_RUN_WORKER calls to be infrequent to avoid the
> > penalty of switching into the kernel, copying in the arguments, etc.
>
> I didn't get this part. Why have the timeout? When the timeout expires,
> does userspace just call right back down to the kernel or does it do
> some sort of processing/operation?
>
> You could have your worker function run from that ioctl wait for a
> signal or a wake up call from the vhost_work/poll functions.

An optional timeout argument is common in blocking interfaces like
poll(2), recvmmsg(2), etc.

Although something can send a signal to the thread instead,
implementing that in an application is more awkward than passing a
struct timespec.

Compared to other blocking calls we don't expect
ioctl(VHOST_RUN_WORKER) to return soon, so maybe the timeout will
rarely be used and can be dropped from the interface.

BTW the code I posted wasn't a carefully thought out proposal :). The
details still need to be considered and I'm going to be offline for
the next week so maybe someone else can think it through in the
meantime.

Stefan

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 00/10] vhost/qemu: thread per IO SCSI vq
  2020-11-19 17:08               ` Stefan Hajnoczi
@ 2020-11-20  8:45                 ` Stefan Hajnoczi
  2020-11-20 12:31                   ` Michael S. Tsirkin
  2020-11-23 15:17                   ` Stefano Garzarella
  0 siblings, 2 replies; 47+ messages in thread
From: Stefan Hajnoczi @ 2020-11-20  8:45 UTC (permalink / raw)
  To: Mike Christie
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, fam, linux-scsi, Jason Wang,
	qemu-devel, Linux Virtualization, target-devel, Paolo Bonzini,
	Stefano Garzarella

On Thu, Nov 19, 2020 at 5:08 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>
> On Thu, Nov 19, 2020 at 4:43 PM Mike Christie
> <michael.christie@oracle.com> wrote:
> >
> > On 11/19/20 10:24 AM, Stefan Hajnoczi wrote:
> > > On Thu, Nov 19, 2020 at 4:13 PM Mike Christie
> > > <michael.christie@oracle.com> wrote:
> > >>
> > >> On 11/19/20 8:46 AM, Michael S. Tsirkin wrote:
> > >>> On Wed, Nov 18, 2020 at 11:31:17AM +0000, Stefan Hajnoczi wrote:
> > > struct vhost_run_worker_info {
> > >      struct timespec *timeout;
> > >      sigset_t *sigmask;
> > >
> > >      /* List of virtqueues to process */
> > >      unsigned nvqs;
> > >      unsigned vqs[];
> > > };
> > >
> > > /* This blocks until the timeout is reached, a signal is received, or
> > > the vhost device is destroyed */
> > > int ret = ioctl(vhost_fd, VHOST_RUN_WORKER, &info);
> > >
> > > As you can see, userspace isn't involved with dealing with the
> > > requests. It just acts as a thread donor to the vhost driver.
> > >
> > > We would want the VHOST_RUN_WORKER calls to be infrequent to avoid the
> > > penalty of switching into the kernel, copying in the arguments, etc.
> >
> > I didn't get this part. Why have the timeout? When the timeout expires,
> > does userspace just call right back down to the kernel or does it do
> > some sort of processing/operation?
> >
> > You could have your worker function run from that ioctl wait for a
> > signal or a wake up call from the vhost_work/poll functions.
>
> An optional timeout argument is common in blocking interfaces like
> poll(2), recvmmsg(2), etc.
>
> Although something can send a signal to the thread instead,
> implementing that in an application is more awkward than passing a
> struct timespec.
>
> Compared to other blocking calls we don't expect
> ioctl(VHOST_RUN_WORKER) to return soon, so maybe the timeout will
> rarely be used and can be dropped from the interface.
>
> BTW the code I posted wasn't a carefully thought out proposal :). The
> details still need to be considered and I'm going to be offline for
> the next week so maybe someone else can think it through in the
> meantime.

One final thought before I'm offline for a week. If
ioctl(VHOST_RUN_WORKER) is specific to a single vhost device instance
then it's hard to support poll-mode (busy waiting) workers because
each device instance consumes a whole CPU. If we stick to an interface
where the kernel manages the worker threads then it's easier to share
workers between devices for polling.

I have CCed Stefano Garzarella, who is looking at similar designs for
vDPA software device implementations.

Stefan

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 00/10] vhost/qemu: thread per IO SCSI vq
  2020-11-20  8:45                 ` Stefan Hajnoczi
@ 2020-11-20 12:31                   ` Michael S. Tsirkin
  2020-12-01 12:59                     ` Stefan Hajnoczi
  2020-11-23 15:17                   ` Stefano Garzarella
  1 sibling, 1 reply; 47+ messages in thread
From: Michael S. Tsirkin @ 2020-11-20 12:31 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Mike Christie, Stefan Hajnoczi, fam, linux-scsi, Jason Wang,
	qemu-devel, Linux Virtualization, target-devel, Paolo Bonzini,
	Stefano Garzarella

On Fri, Nov 20, 2020 at 08:45:49AM +0000, Stefan Hajnoczi wrote:
> On Thu, Nov 19, 2020 at 5:08 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >
> > On Thu, Nov 19, 2020 at 4:43 PM Mike Christie
> > <michael.christie@oracle.com> wrote:
> > >
> > > On 11/19/20 10:24 AM, Stefan Hajnoczi wrote:
> > > > On Thu, Nov 19, 2020 at 4:13 PM Mike Christie
> > > > <michael.christie@oracle.com> wrote:
> > > >>
> > > >> On 11/19/20 8:46 AM, Michael S. Tsirkin wrote:
> > > >>> On Wed, Nov 18, 2020 at 11:31:17AM +0000, Stefan Hajnoczi wrote:
> > > > struct vhost_run_worker_info {
> > > >      struct timespec *timeout;
> > > >      sigset_t *sigmask;
> > > >
> > > >      /* List of virtqueues to process */
> > > >      unsigned nvqs;
> > > >      unsigned vqs[];
> > > > };
> > > >
> > > > /* This blocks until the timeout is reached, a signal is received, or
> > > > the vhost device is destroyed */
> > > > int ret = ioctl(vhost_fd, VHOST_RUN_WORKER, &info);
> > > >
> > > > As you can see, userspace isn't involved with dealing with the
> > > > requests. It just acts as a thread donor to the vhost driver.
> > > >
> > > > We would want the VHOST_RUN_WORKER calls to be infrequent to avoid the
> > > > penalty of switching into the kernel, copying in the arguments, etc.
> > >
> > > I didn't get this part. Why have the timeout? When the timeout expires,
> > > does userspace just call right back down to the kernel or does it do
> > > some sort of processing/operation?
> > >
> > > You could have your worker function run from that ioctl wait for a
> > > signal or a wake up call from the vhost_work/poll functions.
> >
> > An optional timeout argument is common in blocking interfaces like
> > poll(2), recvmmsg(2), etc.
> >
> > Although something can send a signal to the thread instead,
> > implementing that in an application is more awkward than passing a
> > struct timespec.
> >
> > Compared to other blocking calls we don't expect
> > ioctl(VHOST_RUN_WORKER) to return soon, so maybe the timeout will
> > rarely be used and can be dropped from the interface.
> >
> > BTW the code I posted wasn't a carefully thought out proposal :). The
> > details still need to be considered and I'm going to be offline for
> > the next week so maybe someone else can think it through in the
> > meantime.
> 
> One final thought before I'm offline for a week. If
> ioctl(VHOST_RUN_WORKER) is specific to a single vhost device instance
> then it's hard to support poll-mode (busy waiting) workers because
> each device instance consumes a whole CPU. If we stick to an interface
> where the kernel manages the worker threads then it's easier to share
> workers between devices for polling.


Yes that is the reason vhost did its own reason in the first place.


I am vaguely thinking about poll(2) or a similar interface,
which can wait for an event on multiple FDs.


> I have CCed Stefano Garzarella, who is looking at similar designs for
> vDPA software device implementations.
> 
> Stefan


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 00/10] vhost/qemu: thread per IO SCSI vq
  2020-11-20  8:45                 ` Stefan Hajnoczi
  2020-11-20 12:31                   ` Michael S. Tsirkin
@ 2020-11-23 15:17                   ` Stefano Garzarella
  1 sibling, 0 replies; 47+ messages in thread
From: Stefano Garzarella @ 2020-11-23 15:17 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Mike Christie, Michael S. Tsirkin, Stefan Hajnoczi, fam,
	linux-scsi, Jason Wang, qemu-devel, Linux Virtualization,
	target-devel, Paolo Bonzini

On Fri, Nov 20, 2020 at 08:45:49AM +0000, Stefan Hajnoczi wrote:
>On Thu, Nov 19, 2020 at 5:08 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>>
>> On Thu, Nov 19, 2020 at 4:43 PM Mike Christie
>> <michael.christie@oracle.com> wrote:
>> >
>> > On 11/19/20 10:24 AM, Stefan Hajnoczi wrote:
>> > > On Thu, Nov 19, 2020 at 4:13 PM Mike Christie
>> > > <michael.christie@oracle.com> wrote:
>> > >>
>> > >> On 11/19/20 8:46 AM, Michael S. Tsirkin wrote:
>> > >>> On Wed, Nov 18, 2020 at 11:31:17AM +0000, Stefan Hajnoczi wrote:
>> > > struct vhost_run_worker_info {
>> > >      struct timespec *timeout;
>> > >      sigset_t *sigmask;
>> > >
>> > >      /* List of virtqueues to process */
>> > >      unsigned nvqs;
>> > >      unsigned vqs[];
>> > > };
>> > >
>> > > /* This blocks until the timeout is reached, a signal is received, or
>> > > the vhost device is destroyed */
>> > > int ret = ioctl(vhost_fd, VHOST_RUN_WORKER, &info);
>> > >
>> > > As you can see, userspace isn't involved with dealing with the
>> > > requests. It just acts as a thread donor to the vhost driver.
>> > >
>> > > We would want the VHOST_RUN_WORKER calls to be infrequent to avoid the
>> > > penalty of switching into the kernel, copying in the arguments, etc.
>> >
>> > I didn't get this part. Why have the timeout? When the timeout expires,
>> > does userspace just call right back down to the kernel or does it do
>> > some sort of processing/operation?
>> >
>> > You could have your worker function run from that ioctl wait for a
>> > signal or a wake up call from the vhost_work/poll functions.
>>
>> An optional timeout argument is common in blocking interfaces like
>> poll(2), recvmmsg(2), etc.
>>
>> Although something can send a signal to the thread instead,
>> implementing that in an application is more awkward than passing a
>> struct timespec.
>>
>> Compared to other blocking calls we don't expect
>> ioctl(VHOST_RUN_WORKER) to return soon, so maybe the timeout will
>> rarely be used and can be dropped from the interface.
>>
>> BTW the code I posted wasn't a carefully thought out proposal :). The
>> details still need to be considered and I'm going to be offline for
>> the next week so maybe someone else can think it through in the
>> meantime.
>
>One final thought before I'm offline for a week. If
>ioctl(VHOST_RUN_WORKER) is specific to a single vhost device instance
>then it's hard to support poll-mode (busy waiting) workers because
>each device instance consumes a whole CPU. If we stick to an interface
>where the kernel manages the worker threads then it's easier to share
>workers between devices for polling.

Agree, ioctl(VHOST_RUN_WORKER) is interesting and perhaps simplifies 
thread management (pinning, etc.), but with kthread would be easier to 
implement polling sharing worker with multiple devices.

>
>I have CCed Stefano Garzarella, who is looking at similar designs for
>vDPA software device implementations.

Thanks, Mike please can you keep me in CC for this work?

It's really interesting since I'll have similar issues to solve with 
vDPA software device.

Thanks,
Stefano


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 00/10] vhost/qemu: thread per IO SCSI vq
  2020-11-20 12:31                   ` Michael S. Tsirkin
@ 2020-12-01 12:59                     ` Stefan Hajnoczi
  2020-12-01 13:45                       ` Stefano Garzarella
  0 siblings, 1 reply; 47+ messages in thread
From: Stefan Hajnoczi @ 2020-12-01 12:59 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Stefan Hajnoczi, Mike Christie, fam, linux-scsi, Jason Wang,
	qemu-devel, Linux Virtualization, target-devel, Paolo Bonzini,
	Stefano Garzarella


[-- Attachment #1: Type: text/plain, Size: 4713 bytes --]

On Fri, Nov 20, 2020 at 07:31:08AM -0500, Michael S. Tsirkin wrote:
> On Fri, Nov 20, 2020 at 08:45:49AM +0000, Stefan Hajnoczi wrote:
> > On Thu, Nov 19, 2020 at 5:08 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > >
> > > On Thu, Nov 19, 2020 at 4:43 PM Mike Christie
> > > <michael.christie@oracle.com> wrote:
> > > >
> > > > On 11/19/20 10:24 AM, Stefan Hajnoczi wrote:
> > > > > On Thu, Nov 19, 2020 at 4:13 PM Mike Christie
> > > > > <michael.christie@oracle.com> wrote:
> > > > >>
> > > > >> On 11/19/20 8:46 AM, Michael S. Tsirkin wrote:
> > > > >>> On Wed, Nov 18, 2020 at 11:31:17AM +0000, Stefan Hajnoczi wrote:
> > > > > struct vhost_run_worker_info {
> > > > >      struct timespec *timeout;
> > > > >      sigset_t *sigmask;
> > > > >
> > > > >      /* List of virtqueues to process */
> > > > >      unsigned nvqs;
> > > > >      unsigned vqs[];
> > > > > };
> > > > >
> > > > > /* This blocks until the timeout is reached, a signal is received, or
> > > > > the vhost device is destroyed */
> > > > > int ret = ioctl(vhost_fd, VHOST_RUN_WORKER, &info);
> > > > >
> > > > > As you can see, userspace isn't involved with dealing with the
> > > > > requests. It just acts as a thread donor to the vhost driver.
> > > > >
> > > > > We would want the VHOST_RUN_WORKER calls to be infrequent to avoid the
> > > > > penalty of switching into the kernel, copying in the arguments, etc.
> > > >
> > > > I didn't get this part. Why have the timeout? When the timeout expires,
> > > > does userspace just call right back down to the kernel or does it do
> > > > some sort of processing/operation?
> > > >
> > > > You could have your worker function run from that ioctl wait for a
> > > > signal or a wake up call from the vhost_work/poll functions.
> > >
> > > An optional timeout argument is common in blocking interfaces like
> > > poll(2), recvmmsg(2), etc.
> > >
> > > Although something can send a signal to the thread instead,
> > > implementing that in an application is more awkward than passing a
> > > struct timespec.
> > >
> > > Compared to other blocking calls we don't expect
> > > ioctl(VHOST_RUN_WORKER) to return soon, so maybe the timeout will
> > > rarely be used and can be dropped from the interface.
> > >
> > > BTW the code I posted wasn't a carefully thought out proposal :). The
> > > details still need to be considered and I'm going to be offline for
> > > the next week so maybe someone else can think it through in the
> > > meantime.
> > 
> > One final thought before I'm offline for a week. If
> > ioctl(VHOST_RUN_WORKER) is specific to a single vhost device instance
> > then it's hard to support poll-mode (busy waiting) workers because
> > each device instance consumes a whole CPU. If we stick to an interface
> > where the kernel manages the worker threads then it's easier to share
> > workers between devices for polling.
> 
> 
> Yes that is the reason vhost did its own reason in the first place.
> 
> 
> I am vaguely thinking about poll(2) or a similar interface,
> which can wait for an event on multiple FDs.

I can imagine how using poll(2) would work from a userspace perspective,
but on the kernel side I don't think it can be implemented cleanly.
poll(2) is tied to the file_operations->poll() callback and
read/write/error events. Not to mention there isn't a way to substitue
the vhost worker thread function instead of scheduling out the current
thread while waiting for poll fd events.

But maybe ioctl(VHOST_WORKER_RUN) can do it:

  struct vhost_run_worker_dev {
      int vhostfd;      /* /dev/vhost-TYPE fd */
      unsigned nvqs;    /* number of virtqueues in vqs[] */
      unsigned vqs[];   /* virtqueues to process */
  };

  struct vhost_run_worker_info {
       struct timespec *timeout;
       sigset_t *sigmask;

       unsigned ndevices;
       struct vhost_run_worker_dev *devices[];
  };

In the simple case userspace sets ndevices to 1 and we just handle
virtqueues for the current device.

In the fancier shared worker thread case the userspace process has the
vhost fds of all the devices it is processing and passes them to
ioctl(VHOST_WORKER_RUN) via struct vhost_run_worker_dev elements.

From a security perspective it means the userspace thread has access to
all vhost devices (because it has their fds).

I'm not sure how the mm is supposed to work. The devices might be
associated with different userspace processes (guests) and therefore
have different virtual memory.

Just wanted to push this discussion along a little further. I'm buried
under emails and probably wont be very active over the next few days.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 00/10] vhost/qemu: thread per IO SCSI vq
  2020-12-01 12:59                     ` Stefan Hajnoczi
@ 2020-12-01 13:45                       ` Stefano Garzarella
  2020-12-01 17:43                         ` Stefan Hajnoczi
  0 siblings, 1 reply; 47+ messages in thread
From: Stefano Garzarella @ 2020-12-01 13:45 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Mike Christie, fam,
	linux-scsi, Jason Wang, qemu-devel, Linux Virtualization,
	target-devel, Paolo Bonzini

On Tue, Dec 01, 2020 at 12:59:43PM +0000, Stefan Hajnoczi wrote:
>On Fri, Nov 20, 2020 at 07:31:08AM -0500, Michael S. Tsirkin wrote:
>> On Fri, Nov 20, 2020 at 08:45:49AM +0000, Stefan Hajnoczi wrote:
>> > On Thu, Nov 19, 2020 at 5:08 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>> > >
>> > > On Thu, Nov 19, 2020 at 4:43 PM Mike Christie
>> > > <michael.christie@oracle.com> wrote:
>> > > >
>> > > > On 11/19/20 10:24 AM, Stefan Hajnoczi wrote:
>> > > > > On Thu, Nov 19, 2020 at 4:13 PM Mike Christie
>> > > > > <michael.christie@oracle.com> wrote:
>> > > > >>
>> > > > >> On 11/19/20 8:46 AM, Michael S. Tsirkin wrote:
>> > > > >>> On Wed, Nov 18, 2020 at 11:31:17AM +0000, Stefan Hajnoczi wrote:
>> > > > > struct vhost_run_worker_info {
>> > > > >      struct timespec *timeout;
>> > > > >      sigset_t *sigmask;
>> > > > >
>> > > > >      /* List of virtqueues to process */
>> > > > >      unsigned nvqs;
>> > > > >      unsigned vqs[];
>> > > > > };
>> > > > >
>> > > > > /* This blocks until the timeout is reached, a signal is received, or
>> > > > > the vhost device is destroyed */
>> > > > > int ret = ioctl(vhost_fd, VHOST_RUN_WORKER, &info);
>> > > > >
>> > > > > As you can see, userspace isn't involved with dealing with the
>> > > > > requests. It just acts as a thread donor to the vhost driver.
>> > > > >
>> > > > > We would want the VHOST_RUN_WORKER calls to be infrequent to avoid the
>> > > > > penalty of switching into the kernel, copying in the arguments, etc.
>> > > >
>> > > > I didn't get this part. Why have the timeout? When the timeout expires,
>> > > > does userspace just call right back down to the kernel or does it do
>> > > > some sort of processing/operation?
>> > > >
>> > > > You could have your worker function run from that ioctl wait for a
>> > > > signal or a wake up call from the vhost_work/poll functions.
>> > >
>> > > An optional timeout argument is common in blocking interfaces like
>> > > poll(2), recvmmsg(2), etc.
>> > >
>> > > Although something can send a signal to the thread instead,
>> > > implementing that in an application is more awkward than passing a
>> > > struct timespec.
>> > >
>> > > Compared to other blocking calls we don't expect
>> > > ioctl(VHOST_RUN_WORKER) to return soon, so maybe the timeout will
>> > > rarely be used and can be dropped from the interface.
>> > >
>> > > BTW the code I posted wasn't a carefully thought out proposal :). The
>> > > details still need to be considered and I'm going to be offline for
>> > > the next week so maybe someone else can think it through in the
>> > > meantime.
>> >
>> > One final thought before I'm offline for a week. If
>> > ioctl(VHOST_RUN_WORKER) is specific to a single vhost device instance
>> > then it's hard to support poll-mode (busy waiting) workers because
>> > each device instance consumes a whole CPU. If we stick to an interface
>> > where the kernel manages the worker threads then it's easier to share
>> > workers between devices for polling.
>>
>>
>> Yes that is the reason vhost did its own reason in the first place.
>>
>>
>> I am vaguely thinking about poll(2) or a similar interface,
>> which can wait for an event on multiple FDs.
>
>I can imagine how using poll(2) would work from a userspace perspective,
>but on the kernel side I don't think it can be implemented cleanly.
>poll(2) is tied to the file_operations->poll() callback and
>read/write/error events. Not to mention there isn't a way to substitue
>the vhost worker thread function instead of scheduling out the current
>thread while waiting for poll fd events.
>
>But maybe ioctl(VHOST_WORKER_RUN) can do it:
>
>  struct vhost_run_worker_dev {
>      int vhostfd;      /* /dev/vhost-TYPE fd */
>      unsigned nvqs;    /* number of virtqueues in vqs[] */
>      unsigned vqs[];   /* virtqueues to process */
>  };
>
>  struct vhost_run_worker_info {
>       struct timespec *timeout;
>       sigset_t *sigmask;
>
>       unsigned ndevices;
>       struct vhost_run_worker_dev *devices[];
>  };
>
>In the simple case userspace sets ndevices to 1 and we just handle
>virtqueues for the current device.
>
>In the fancier shared worker thread case the userspace process has the
>vhost fds of all the devices it is processing and passes them to
>ioctl(VHOST_WORKER_RUN) via struct vhost_run_worker_dev elements.

Which fd will be used for this IOCTL? One of the 'vhostfd' or we should 
create a new /dev/vhost-workers (or something similar)?

Maybe the new device will be cleaner and can be reused also for other 
stuff (I'm thinking about vDPA software devices).

>
>From a security perspective it means the userspace thread has access to
>all vhost devices (because it has their fds).
>
>I'm not sure how the mm is supposed to work. The devices might be
>associated with different userspace processes (guests) and therefore
>have different virtual memory.

Maybe in this case we should do something similar to io_uring SQPOLL 
kthread where kthread_use_mm()/kthread_unuse_mm() is used to switch 
virtual memory spaces.

After writing, I saw that we already do it this in the vhost_worker() in 
drivers/vhost/vhost.c

>
>Just wanted to push this discussion along a little further. I'm buried
>under emails and probably wont be very active over the next few days.
>

I think ioctl(VHOST_WORKER_RUN) might be the right way and also maybe 
the least difficult one.

Thanks,
Stefano


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 00/10] vhost/qemu: thread per IO SCSI vq
  2020-12-01 13:45                       ` Stefano Garzarella
@ 2020-12-01 17:43                         ` Stefan Hajnoczi
  2020-12-02 10:35                           ` Stefano Garzarella
  0 siblings, 1 reply; 47+ messages in thread
From: Stefan Hajnoczi @ 2020-12-01 17:43 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Mike Christie, fam,
	linux-scsi, Jason Wang, qemu-devel, Linux Virtualization,
	target-devel, Paolo Bonzini


[-- Attachment #1: Type: text/plain, Size: 6140 bytes --]

On Tue, Dec 01, 2020 at 02:45:18PM +0100, Stefano Garzarella wrote:
> On Tue, Dec 01, 2020 at 12:59:43PM +0000, Stefan Hajnoczi wrote:
> > On Fri, Nov 20, 2020 at 07:31:08AM -0500, Michael S. Tsirkin wrote:
> > > On Fri, Nov 20, 2020 at 08:45:49AM +0000, Stefan Hajnoczi wrote:
> > > > On Thu, Nov 19, 2020 at 5:08 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > > > >
> > > > > On Thu, Nov 19, 2020 at 4:43 PM Mike Christie
> > > > > <michael.christie@oracle.com> wrote:
> > > > > >
> > > > > > On 11/19/20 10:24 AM, Stefan Hajnoczi wrote:
> > > > > > > On Thu, Nov 19, 2020 at 4:13 PM Mike Christie
> > > > > > > <michael.christie@oracle.com> wrote:
> > > > > > >>
> > > > > > >> On 11/19/20 8:46 AM, Michael S. Tsirkin wrote:
> > > > > > >>> On Wed, Nov 18, 2020 at 11:31:17AM +0000, Stefan Hajnoczi wrote:
> > > > > > > struct vhost_run_worker_info {
> > > > > > >      struct timespec *timeout;
> > > > > > >      sigset_t *sigmask;
> > > > > > >
> > > > > > >      /* List of virtqueues to process */
> > > > > > >      unsigned nvqs;
> > > > > > >      unsigned vqs[];
> > > > > > > };
> > > > > > >
> > > > > > > /* This blocks until the timeout is reached, a signal is received, or
> > > > > > > the vhost device is destroyed */
> > > > > > > int ret = ioctl(vhost_fd, VHOST_RUN_WORKER, &info);
> > > > > > >
> > > > > > > As you can see, userspace isn't involved with dealing with the
> > > > > > > requests. It just acts as a thread donor to the vhost driver.
> > > > > > >
> > > > > > > We would want the VHOST_RUN_WORKER calls to be infrequent to avoid the
> > > > > > > penalty of switching into the kernel, copying in the arguments, etc.
> > > > > >
> > > > > > I didn't get this part. Why have the timeout? When the timeout expires,
> > > > > > does userspace just call right back down to the kernel or does it do
> > > > > > some sort of processing/operation?
> > > > > >
> > > > > > You could have your worker function run from that ioctl wait for a
> > > > > > signal or a wake up call from the vhost_work/poll functions.
> > > > >
> > > > > An optional timeout argument is common in blocking interfaces like
> > > > > poll(2), recvmmsg(2), etc.
> > > > >
> > > > > Although something can send a signal to the thread instead,
> > > > > implementing that in an application is more awkward than passing a
> > > > > struct timespec.
> > > > >
> > > > > Compared to other blocking calls we don't expect
> > > > > ioctl(VHOST_RUN_WORKER) to return soon, so maybe the timeout will
> > > > > rarely be used and can be dropped from the interface.
> > > > >
> > > > > BTW the code I posted wasn't a carefully thought out proposal :). The
> > > > > details still need to be considered and I'm going to be offline for
> > > > > the next week so maybe someone else can think it through in the
> > > > > meantime.
> > > >
> > > > One final thought before I'm offline for a week. If
> > > > ioctl(VHOST_RUN_WORKER) is specific to a single vhost device instance
> > > > then it's hard to support poll-mode (busy waiting) workers because
> > > > each device instance consumes a whole CPU. If we stick to an interface
> > > > where the kernel manages the worker threads then it's easier to share
> > > > workers between devices for polling.
> > > 
> > > 
> > > Yes that is the reason vhost did its own reason in the first place.
> > > 
> > > 
> > > I am vaguely thinking about poll(2) or a similar interface,
> > > which can wait for an event on multiple FDs.
> > 
> > I can imagine how using poll(2) would work from a userspace perspective,
> > but on the kernel side I don't think it can be implemented cleanly.
> > poll(2) is tied to the file_operations->poll() callback and
> > read/write/error events. Not to mention there isn't a way to substitue
> > the vhost worker thread function instead of scheduling out the current
> > thread while waiting for poll fd events.
> > 
> > But maybe ioctl(VHOST_WORKER_RUN) can do it:
> > 
> >  struct vhost_run_worker_dev {
> >      int vhostfd;      /* /dev/vhost-TYPE fd */
> >      unsigned nvqs;    /* number of virtqueues in vqs[] */
> >      unsigned vqs[];   /* virtqueues to process */
> >  };
> > 
> >  struct vhost_run_worker_info {
> >       struct timespec *timeout;
> >       sigset_t *sigmask;
> > 
> >       unsigned ndevices;
> >       struct vhost_run_worker_dev *devices[];
> >  };
> > 
> > In the simple case userspace sets ndevices to 1 and we just handle
> > virtqueues for the current device.
> > 
> > In the fancier shared worker thread case the userspace process has the
> > vhost fds of all the devices it is processing and passes them to
> > ioctl(VHOST_WORKER_RUN) via struct vhost_run_worker_dev elements.
> 
> Which fd will be used for this IOCTL? One of the 'vhostfd' or we should
> create a new /dev/vhost-workers (or something similar)?
> 
> Maybe the new device will be cleaner and can be reused also for other stuff
> (I'm thinking about vDPA software devices).
> 
> > 
> > From a security perspective it means the userspace thread has access to
> > all vhost devices (because it has their fds).
> > 
> > I'm not sure how the mm is supposed to work. The devices might be
> > associated with different userspace processes (guests) and therefore
> > have different virtual memory.
> 
> Maybe in this case we should do something similar to io_uring SQPOLL kthread
> where kthread_use_mm()/kthread_unuse_mm() is used to switch virtual memory
> spaces.
> 
> After writing, I saw that we already do it this in the vhost_worker() in
> drivers/vhost/vhost.c
> 
> > 
> > Just wanted to push this discussion along a little further. I'm buried
> > under emails and probably wont be very active over the next few days.
> > 
> 
> I think ioctl(VHOST_WORKER_RUN) might be the right way and also maybe the
> least difficult one.

Sending an ioctl API proposal email could help progress this discussion.

Interesting questions:
1. How to specify which virtqueues to process (Mike's use case)?
2. How to process multiple devices?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/1] qemu vhost scsi: add VHOST_SET_VRING_ENABLE support
  2020-11-12 23:19 ` [PATCH 1/1] qemu vhost scsi: add VHOST_SET_VRING_ENABLE support Mike Christie
  2020-11-17 11:53   ` Stefan Hajnoczi
@ 2020-12-02  9:59   ` Michael S. Tsirkin
  2020-12-02 16:05     ` Michael Christie
  1 sibling, 1 reply; 47+ messages in thread
From: Michael S. Tsirkin @ 2020-12-02  9:59 UTC (permalink / raw)
  To: Mike Christie
  Cc: stefanha, qemu-devel, fam, linux-scsi, target-devel, jasowang,
	pbonzini, virtualization

On Thu, Nov 12, 2020 at 05:19:00PM -0600, Mike Christie wrote:
> diff --git a/linux-headers/linux/vhost.h b/linux-headers/linux/vhost.h
> index 7523218..98dd919 100644
> --- a/linux-headers/linux/vhost.h
> +++ b/linux-headers/linux/vhost.h
> @@ -70,6 +70,7 @@
>  #define VHOST_VRING_BIG_ENDIAN 1
>  #define VHOST_SET_VRING_ENDIAN _IOW(VHOST_VIRTIO, 0x13, struct vhost_vring_state)
>  #define VHOST_GET_VRING_ENDIAN _IOW(VHOST_VIRTIO, 0x14, struct vhost_vring_state)
> +#define VHOST_SET_VRING_ENABLE _IOW(VHOST_VIRTIO, 0x15, struct vhost_vring_state)

OK so first we need the kernel patches, then update the header, then
we can apply the qemu patch.

>  /* The following ioctls use eventfd file descriptors to signal and poll
>   * for events. */
> -- 
> 1.8.3.1


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 00/10] vhost/qemu: thread per IO SCSI vq
  2020-12-01 17:43                         ` Stefan Hajnoczi
@ 2020-12-02 10:35                           ` Stefano Garzarella
  0 siblings, 0 replies; 47+ messages in thread
From: Stefano Garzarella @ 2020-12-02 10:35 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Mike Christie, fam,
	linux-scsi, Jason Wang, qemu-devel, Linux Virtualization,
	target-devel, Paolo Bonzini

On Tue, Dec 01, 2020 at 05:43:38PM +0000, Stefan Hajnoczi wrote:
>On Tue, Dec 01, 2020 at 02:45:18PM +0100, Stefano Garzarella wrote:
>> On Tue, Dec 01, 2020 at 12:59:43PM +0000, Stefan Hajnoczi wrote:
>> > On Fri, Nov 20, 2020 at 07:31:08AM -0500, Michael S. Tsirkin wrote:
>> > > On Fri, Nov 20, 2020 at 08:45:49AM +0000, Stefan Hajnoczi wrote:
>> > > > On Thu, Nov 19, 2020 at 5:08 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>> > > > >
>> > > > > On Thu, Nov 19, 2020 at 4:43 PM Mike Christie
>> > > > > <michael.christie@oracle.com> wrote:
>> > > > > >
>> > > > > > On 11/19/20 10:24 AM, Stefan Hajnoczi wrote:
>> > > > > > > On Thu, Nov 19, 2020 at 4:13 PM Mike Christie
>> > > > > > > <michael.christie@oracle.com> wrote:
>> > > > > > >>
>> > > > > > >> On 11/19/20 8:46 AM, Michael S. Tsirkin wrote:
>> > > > > > >>> On Wed, Nov 18, 2020 at 11:31:17AM +0000, Stefan Hajnoczi wrote:
>> > > > > > > struct vhost_run_worker_info {
>> > > > > > >      struct timespec *timeout;
>> > > > > > >      sigset_t *sigmask;
>> > > > > > >
>> > > > > > >      /* List of virtqueues to process */
>> > > > > > >      unsigned nvqs;
>> > > > > > >      unsigned vqs[];
>> > > > > > > };
>> > > > > > >
>> > > > > > > /* This blocks until the timeout is reached, a signal is received, or
>> > > > > > > the vhost device is destroyed */
>> > > > > > > int ret = ioctl(vhost_fd, VHOST_RUN_WORKER, &info);
>> > > > > > >
>> > > > > > > As you can see, userspace isn't involved with dealing with the
>> > > > > > > requests. It just acts as a thread donor to the vhost driver.
>> > > > > > >
>> > > > > > > We would want the VHOST_RUN_WORKER calls to be infrequent to avoid the
>> > > > > > > penalty of switching into the kernel, copying in the arguments, etc.
>> > > > > >
>> > > > > > I didn't get this part. Why have the timeout? When the timeout expires,
>> > > > > > does userspace just call right back down to the kernel or does it do
>> > > > > > some sort of processing/operation?
>> > > > > >
>> > > > > > You could have your worker function run from that ioctl wait for a
>> > > > > > signal or a wake up call from the vhost_work/poll functions.
>> > > > >
>> > > > > An optional timeout argument is common in blocking interfaces like
>> > > > > poll(2), recvmmsg(2), etc.
>> > > > >
>> > > > > Although something can send a signal to the thread instead,
>> > > > > implementing that in an application is more awkward than passing a
>> > > > > struct timespec.
>> > > > >
>> > > > > Compared to other blocking calls we don't expect
>> > > > > ioctl(VHOST_RUN_WORKER) to return soon, so maybe the timeout will
>> > > > > rarely be used and can be dropped from the interface.
>> > > > >
>> > > > > BTW the code I posted wasn't a carefully thought out proposal 
>> > > > > :). The
>> > > > > details still need to be considered and I'm going to be offline for
>> > > > > the next week so maybe someone else can think it through in the
>> > > > > meantime.
>> > > >
>> > > > One final thought before I'm offline for a week. If
>> > > > ioctl(VHOST_RUN_WORKER) is specific to a single vhost device instance
>> > > > then it's hard to support poll-mode (busy waiting) workers because
>> > > > each device instance consumes a whole CPU. If we stick to an interface
>> > > > where the kernel manages the worker threads then it's easier to 
>> > > > share
>> > > > workers between devices for polling.
>> > >
>> > >
>> > > Yes that is the reason vhost did its own reason in the first place.
>> > >
>> > >
>> > > I am vaguely thinking about poll(2) or a similar interface,
>> > > which can wait for an event on multiple FDs.
>> >
>> > I can imagine how using poll(2) would work from a userspace perspective,
>> > but on the kernel side I don't think it can be implemented cleanly.
>> > poll(2) is tied to the file_operations->poll() callback and
>> > read/write/error events. Not to mention there isn't a way to substitue
>> > the vhost worker thread function instead of scheduling out the current
>> > thread while waiting for poll fd events.
>> >
>> > But maybe ioctl(VHOST_WORKER_RUN) can do it:
>> >
>> >  struct vhost_run_worker_dev {
>> >      int vhostfd;      /* /dev/vhost-TYPE fd */
>> >      unsigned nvqs;    /* number of virtqueues in vqs[] */
>> >      unsigned vqs[];   /* virtqueues to process */
>> >  };
>> >
>> >  struct vhost_run_worker_info {
>> >       struct timespec *timeout;
>> >       sigset_t *sigmask;
>> >
>> >       unsigned ndevices;
>> >       struct vhost_run_worker_dev *devices[];
>> >  };
>> >
>> > In the simple case userspace sets ndevices to 1 and we just handle
>> > virtqueues for the current device.
>> >
>> > In the fancier shared worker thread case the userspace process has the
>> > vhost fds of all the devices it is processing and passes them to
>> > ioctl(VHOST_WORKER_RUN) via struct vhost_run_worker_dev elements.
>>
>> Which fd will be used for this IOCTL? One of the 'vhostfd' or we should
>> create a new /dev/vhost-workers (or something similar)?
>>
>> Maybe the new device will be cleaner and can be reused also for other stuff
>> (I'm thinking about vDPA software devices).
>>
>> >
>> > From a security perspective it means the userspace thread has access to
>> > all vhost devices (because it has their fds).
>> >
>> > I'm not sure how the mm is supposed to work. The devices might be
>> > associated with different userspace processes (guests) and therefore
>> > have different virtual memory.
>>
>> Maybe in this case we should do something similar to io_uring SQPOLL kthread
>> where kthread_use_mm()/kthread_unuse_mm() is used to switch virtual memory
>> spaces.
>>
>> After writing, I saw that we already do it this in the vhost_worker() in
>> drivers/vhost/vhost.c
>>
>> >
>> > Just wanted to push this discussion along a little further. I'm buried
>> > under emails and probably wont be very active over the next few days.
>> >
>>
>> I think ioctl(VHOST_WORKER_RUN) might be the right way and also maybe the
>> least difficult one.
>
>Sending an ioctl API proposal email could help progress this 
>discussion.
>
>Interesting questions:
>1. How to specify which virtqueues to process (Mike's use case)?
>2. How to process multiple devices?
>

Okay, I'll try to prepare a tentative proposal next week with that 
questions in mind :-)

Thanks,
Stefano


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/1] qemu vhost scsi: add VHOST_SET_VRING_ENABLE support
  2020-12-02  9:59   ` Michael S. Tsirkin
@ 2020-12-02 16:05     ` Michael Christie
  0 siblings, 0 replies; 47+ messages in thread
From: Michael Christie @ 2020-12-02 16:05 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: stefanha, qemu-devel, fam, linux-scsi, target-devel, jasowang,
	pbonzini, virtualization



> On Dec 2, 2020, at 3:59 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> 
> On Thu, Nov 12, 2020 at 05:19:00PM -0600, Mike Christie wrote:
>> diff --git a/linux-headers/linux/vhost.h b/linux-headers/linux/vhost.h
>> index 7523218..98dd919 100644
>> --- a/linux-headers/linux/vhost.h
>> +++ b/linux-headers/linux/vhost.h
>> @@ -70,6 +70,7 @@
>> #define VHOST_VRING_BIG_ENDIAN 1
>> #define VHOST_SET_VRING_ENDIAN _IOW(VHOST_VIRTIO, 0x13, struct vhost_vring_state)
>> #define VHOST_GET_VRING_ENDIAN _IOW(VHOST_VIRTIO, 0x14, struct vhost_vring_state)
>> +#define VHOST_SET_VRING_ENABLE _IOW(VHOST_VIRTIO, 0x15, struct vhost_vring_state)
> 
> OK so first we need the kernel patches, then update the header, then
> we can apply the qemu patch.

Hey Michael,

Don’t waste any more of your time on this patch and the kernel related ones.

I’m working on the userspace initiated threading approach discussed the other week.

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, back to index

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-12 23:18 [PATCH 00/10] vhost/qemu: thread per IO SCSI vq Mike Christie
2020-11-12 23:19 ` [PATCH 1/1] qemu vhost scsi: add VHOST_SET_VRING_ENABLE support Mike Christie
2020-11-17 11:53   ` Stefan Hajnoczi
2020-12-02  9:59   ` Michael S. Tsirkin
2020-12-02 16:05     ` Michael Christie
2020-11-12 23:19 ` [PATCH 01/10] vhost: remove work arg from vhost_work_flush Mike Christie
2020-11-17 13:04   ` Stefan Hajnoczi
2020-11-12 23:19 ` [PATCH 02/10] vhost scsi: remove extra flushes Mike Christie
2020-11-17 13:07   ` Stefan Hajnoczi
2020-11-12 23:19 ` [PATCH 03/10] vhost poll: fix coding style Mike Christie
2020-11-17 13:07   ` Stefan Hajnoczi
2020-11-12 23:19 ` [PATCH 04/10] vhost: support multiple worker threads Mike Christie
2020-11-12 23:19 ` [PATCH 05/10] vhost: poll support support multiple workers Mike Christie
2020-11-17 15:32   ` Stefan Hajnoczi
2020-11-12 23:19 ` [PATCH 06/10] vhost scsi: make SCSI cmd completion per vq Mike Christie
2020-11-17 16:04   ` Stefan Hajnoczi
2020-11-12 23:19 ` [PATCH 07/10] vhost, vhost-scsi: flush IO vqs then send TMF rsp Mike Christie
2020-11-17 16:05   ` Stefan Hajnoczi
2020-11-12 23:19 ` [PATCH 08/10] vhost: move msg_handler to new ops struct Mike Christie
2020-11-17 16:08   ` Stefan Hajnoczi
2020-11-12 23:19 ` [PATCH 09/10] vhost: add VHOST_SET_VRING_ENABLE support Mike Christie
2020-11-17 16:14   ` Stefan Hajnoczi
2020-11-12 23:19 ` [PATCH 10/10] vhost-scsi: create a woker per IO vq Mike Christie
2020-11-17 16:40 ` [PATCH 00/10] vhost/qemu: thread per IO SCSI vq Stefan Hajnoczi
2020-11-17 19:13   ` Mike Christie
2020-11-18  9:54     ` Michael S. Tsirkin
2020-11-19 14:00       ` Stefan Hajnoczi
2020-11-18 11:31     ` Stefan Hajnoczi
2020-11-19 14:46       ` Michael S. Tsirkin
2020-11-19 16:11         ` Mike Christie
2020-11-19 16:24           ` Stefan Hajnoczi
2020-11-19 16:43             ` Mike Christie
2020-11-19 17:08               ` Stefan Hajnoczi
2020-11-20  8:45                 ` Stefan Hajnoczi
2020-11-20 12:31                   ` Michael S. Tsirkin
2020-12-01 12:59                     ` Stefan Hajnoczi
2020-12-01 13:45                       ` Stefano Garzarella
2020-12-01 17:43                         ` Stefan Hajnoczi
2020-12-02 10:35                           ` Stefano Garzarella
2020-11-23 15:17                   ` Stefano Garzarella
2020-11-18  5:17   ` Jason Wang
2020-11-18  6:57     ` Mike Christie
2020-11-18  7:19       ` Mike Christie
2020-11-18  7:54       ` Jason Wang
2020-11-18 20:06         ` Mike Christie
2020-11-19  4:35           ` Jason Wang
2020-11-19 15:49             ` Mike Christie

Target-devel archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/target-devel/0 target-devel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 target-devel target-devel/ https://lore.kernel.org/target-devel \
		target-devel@vger.kernel.org
	public-inbox-index target-devel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.target-devel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git