target-devel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* vhost: multiple worker support
@ 2021-05-25 18:05 Mike Christie
  2021-05-25 18:05 ` [PATCH 1/9] vhost: move worker thread fields to new struct Mike Christie
                   ` (10 more replies)
  0 siblings, 11 replies; 32+ messages in thread
From: Mike Christie @ 2021-05-25 18:05 UTC (permalink / raw)
  To: target-devel, linux-scsi, stefanha, pbonzini, jasowang, mst,
	sgarzare, virtualization

The following patches apply over linus's tree or mst's vhost branch
and my cleanup patchset:

https://lists.linuxfoundation.org/pipermail/virtualization/2021-May/054354.html

These patches allow us to support multiple vhost workers per device. I
ended up just doing Stefan's original idea where userspace has the
kernel create a worker and we pass back the pid. This has the benefit
over the workqueue and userspace thread approach where we only have
one'ish code path in the kernel during setup to detect old tools. The
main IO paths and device/vq setup/teardown paths all use common code.

The kernel patches here allow us to then do N workers device and also
share workers across devices.

I've also included a patch for qemu so you can get an idea of how it
works. If we are ok with the kernel code then I'll break that up into
a patchset and send to qemu-devel.

Results:
--------
When running with the null_blk driver and vhost-scsi I can get 1.2
million IOPs by just running a simple

fio --filename=/dev/sda --direct=1 --rw=randrw --bs=4k --ioengine=libaio
--iodepth=128  --numjobs=8 --time_based --group_reporting --name=iops
--runtime=60 --eta-newline=1

The VM has 8 vCPUs and sda has 8 virtqueues and we can do a total of
1024 cmds per devices. To get 1.2 million IOPs I did have to tune and
ran the virsh emulatorpin command so the vhost threads were running
on different CPUs than the VM. If the vhost threads share CPUs then I
get around 800K.

For a more real device that are also CPU hogs like iscsi, I can still
get 1 million IOPs using 1 dm-multipath device over 8 iscsi paths
(natively it gets 1.1 million IOPs).

Results/TODO Note:

- I ported the vdpa sim code to support multiple workers and as-is now
it made perf much worse. If I increase vdpa_sim_blk's num queues to 4-8
I get 700K IOPs with the fio command above. However with the multiple
worker support it drops to 400K. The problem is the vdpa_sim lock
and the iommu_lock. If I hack (like comment out locks or not worry about
data corruption or crashes) then I can get around 1.2M - 1.6M IOPs with
8 queues and fio command above.

So these patches could help other drivers, but it will just take more
work to remove those types of locks. I was hoping the 2 items could be
done indepentently since it helps vhost-scsi immediately.


TODO:
- Stefano has 2 questions about security issues passing the pid back
to userspace and if we should do a feature bit. We are waiting to hear
back from the list.

v2:
- change loop that we take a refcount to the worker in
- replaced pid == -1 with define.
- fixed tabbing/spacing coding style issue
- use hash instead of list to lookup workers.
- I dropped the patch that added an ioctl cmd to get a vq's worker's
pid. I saw we might do a generic netlink interface instead.




^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 1/9] vhost: move worker thread fields to new struct
  2021-05-25 18:05 vhost: multiple worker support Mike Christie
@ 2021-05-25 18:05 ` Mike Christie
  2021-06-03 10:16   ` Stefan Hajnoczi
  2021-05-25 18:05 ` [PATCH 2/9] vhost: move vhost worker creation to kick setup Mike Christie
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 32+ messages in thread
From: Mike Christie @ 2021-05-25 18:05 UTC (permalink / raw)
  To: target-devel, linux-scsi, stefanha, pbonzini, jasowang, mst,
	sgarzare, virtualization
  Cc: Mike Christie

This is just a prep patch. It moves the worker related fields to a new
vhost_worker struct and moves the code around to create some helpers that
will be used in the next patches.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
 drivers/vhost/vhost.c | 94 +++++++++++++++++++++++++++++--------------
 drivers/vhost/vhost.h |  9 ++++-
 2 files changed, 70 insertions(+), 33 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index b9e853e6094d..0cd19b1a832e 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -263,8 +263,8 @@ void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)
 		 * sure it was not in the list.
 		 * test_and_set_bit() implies a memory barrier.
 		 */
-		llist_add(&work->node, &dev->work_list);
-		wake_up_process(dev->worker);
+		llist_add(&work->node, &dev->worker->work_list);
+		wake_up_process(dev->worker->task);
 	}
 }
 EXPORT_SYMBOL_GPL(vhost_work_queue);
@@ -272,7 +272,7 @@ EXPORT_SYMBOL_GPL(vhost_work_queue);
 /* A lockless hint for busy polling code to exit the loop */
 bool vhost_has_work(struct vhost_dev *dev)
 {
-	return !llist_empty(&dev->work_list);
+	return dev->worker && !llist_empty(&dev->worker->work_list);
 }
 EXPORT_SYMBOL_GPL(vhost_has_work);
 
@@ -343,7 +343,8 @@ static void vhost_vq_reset(struct vhost_dev *dev,
 
 static int vhost_worker(void *data)
 {
-	struct vhost_dev *dev = data;
+	struct vhost_worker *worker = data;
+	struct vhost_dev *dev = worker->dev;
 	struct vhost_work *work, *work_next;
 	struct llist_node *node;
 
@@ -358,7 +359,7 @@ static int vhost_worker(void *data)
 			break;
 		}
 
-		node = llist_del_all(&dev->work_list);
+		node = llist_del_all(&worker->work_list);
 		if (!node)
 			schedule();
 
@@ -487,7 +488,6 @@ void vhost_dev_init(struct vhost_dev *dev,
 	dev->byte_weight = byte_weight;
 	dev->use_worker = use_worker;
 	dev->msg_handler = msg_handler;
-	init_llist_head(&dev->work_list);
 	init_waitqueue_head(&dev->wait);
 	INIT_LIST_HEAD(&dev->read_list);
 	INIT_LIST_HEAD(&dev->pending_list);
@@ -579,10 +579,59 @@ static void vhost_detach_mm(struct vhost_dev *dev)
 	dev->mm = NULL;
 }
 
+static void vhost_worker_free(struct vhost_dev *dev)
+{
+	struct vhost_worker *worker = dev->worker;
+
+	if (!worker)
+		return;
+
+	dev->worker = NULL;
+	WARN_ON(!llist_empty(&worker->work_list));
+	kthread_stop(worker->task);
+	kfree(worker);
+}
+
+static int vhost_worker_create(struct vhost_dev *dev)
+{
+	struct vhost_worker *worker;
+	struct task_struct *task;
+	int ret;
+
+	worker = kzalloc(sizeof(*worker), GFP_KERNEL);
+	if (!worker)
+		return -ENOMEM;
+
+	dev->worker = worker;
+	worker->dev = dev;
+	init_llist_head(&worker->work_list);
+
+	task = kthread_create(vhost_worker, worker, "vhost-%d", current->pid);
+	if (IS_ERR(task)) {
+		ret = PTR_ERR(task);
+		goto free_worker;
+	}
+
+	worker->task = task;
+	wake_up_process(task); /* avoid contributing to loadavg */
+
+	ret = vhost_attach_cgroups(dev);
+	if (ret)
+		goto stop_worker;
+
+	return 0;
+
+stop_worker:
+	kthread_stop(worker->task);
+free_worker:
+	kfree(worker);
+	dev->worker = NULL;
+	return ret;
+}
+
 /* Caller should have device mutex */
 long vhost_dev_set_owner(struct vhost_dev *dev)
 {
-	struct task_struct *worker;
 	int err;
 
 	/* Is there an owner already? */
@@ -595,31 +644,18 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
 
 	dev->kcov_handle = kcov_common_handle();
 	if (dev->use_worker) {
-		worker = kthread_create(vhost_worker, dev,
-					"vhost-%d", current->pid);
-		if (IS_ERR(worker)) {
-			err = PTR_ERR(worker);
-			goto err_worker;
-		}
-
-		dev->worker = worker;
-		wake_up_process(worker); /* avoid contributing to loadavg */
-
-		err = vhost_attach_cgroups(dev);
+		err = vhost_worker_create(dev);
 		if (err)
-			goto err_cgroup;
+			goto err_worker;
 	}
 
 	err = vhost_dev_alloc_iovecs(dev);
 	if (err)
-		goto err_cgroup;
+		goto err_iovecs;
 
 	return 0;
-err_cgroup:
-	if (dev->worker) {
-		kthread_stop(dev->worker);
-		dev->worker = NULL;
-	}
+err_iovecs:
+	vhost_worker_free(dev);
 err_worker:
 	vhost_detach_mm(dev);
 	dev->kcov_handle = 0;
@@ -712,13 +748,9 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
 	dev->iotlb = NULL;
 	vhost_clear_msg(dev);
 	wake_up_interruptible_poll(&dev->wait, EPOLLIN | EPOLLRDNORM);
-	WARN_ON(!llist_empty(&dev->work_list));
-	if (dev->worker) {
-		kthread_stop(dev->worker);
-		dev->worker = NULL;
-		dev->kcov_handle = 0;
-	}
+	vhost_worker_free(dev);
 	vhost_detach_mm(dev);
+	dev->kcov_handle = 0;
 }
 EXPORT_SYMBOL_GPL(vhost_dev_cleanup);
 
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 7d5306d1229d..bfc4563e612f 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -25,6 +25,12 @@ struct vhost_work {
 	unsigned long		flags;
 };
 
+struct vhost_worker {
+	struct task_struct	*task;
+	struct llist_head	work_list;
+	struct vhost_dev	*dev;
+};
+
 /* Poll a file (eventfd or socket) */
 /* Note: there's nothing vhost specific about this structure. */
 struct vhost_poll {
@@ -149,8 +155,7 @@ struct vhost_dev {
 	struct vhost_virtqueue **vqs;
 	int nvqs;
 	struct eventfd_ctx *log_ctx;
-	struct llist_head work_list;
-	struct task_struct *worker;
+	struct vhost_worker *worker;
 	struct vhost_iotlb *umem;
 	struct vhost_iotlb *iotlb;
 	spinlock_t iotlb_lock;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 2/9] vhost: move vhost worker creation to kick setup
  2021-05-25 18:05 vhost: multiple worker support Mike Christie
  2021-05-25 18:05 ` [PATCH 1/9] vhost: move worker thread fields to new struct Mike Christie
@ 2021-05-25 18:05 ` Mike Christie
  2021-06-03 10:28   ` Stefan Hajnoczi
  2021-05-25 18:05 ` [PATCH 3/9] vhost: modify internal functions to take a vhost_worker Mike Christie
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 32+ messages in thread
From: Mike Christie @ 2021-05-25 18:05 UTC (permalink / raw)
  To: target-devel, linux-scsi, stefanha, pbonzini, jasowang, mst,
	sgarzare, virtualization
  Cc: Mike Christie

The next patch will add new ioctls that allows userspace to create workers
and bind them to devs and vqs after VHOST_SET_OWNER. To support older
tools, newer tools that want to go wild with worker threads, and newer
tools that want the old/default behavior this patch moves the default
worker setup to the kick setup.

After the first vq's kick/poll setup is done we could start to get works
queued, so this is the point when we must have a worker setup. If we are
using older tools or the newer tools just want the default single vhost
thread per dev behavior then it will automatically be done here. If the
tools are using the newer ioctls that have already created the workers
then we also detect that here and do nothing.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
 drivers/vhost/vhost.c | 23 ++++++++++++++++-------
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 0cd19b1a832e..a291cde95c43 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -629,6 +629,15 @@ static int vhost_worker_create(struct vhost_dev *dev)
 	return ret;
 }
 
+/* Caller must have device mutex */
+static int vhost_worker_try_create_def(struct vhost_dev *dev)
+{
+	if (!dev->use_worker || dev->worker)
+		return 0;
+
+	return vhost_worker_create(dev);
+}
+
 /* Caller should have device mutex */
 long vhost_dev_set_owner(struct vhost_dev *dev)
 {
@@ -643,11 +652,6 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
 	vhost_attach_mm(dev);
 
 	dev->kcov_handle = kcov_common_handle();
-	if (dev->use_worker) {
-		err = vhost_worker_create(dev);
-		if (err)
-			goto err_worker;
-	}
 
 	err = vhost_dev_alloc_iovecs(dev);
 	if (err)
@@ -655,8 +659,6 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
 
 	return 0;
 err_iovecs:
-	vhost_worker_free(dev);
-err_worker:
 	vhost_detach_mm(dev);
 	dev->kcov_handle = 0;
 err_mm:
@@ -1665,6 +1667,13 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
 			r = -EFAULT;
 			break;
 		}
+
+		if (f.fd != VHOST_FILE_UNBIND) {
+			r = vhost_worker_try_create_def(d);
+			if (r)
+				break;
+		}
+
 		eventfp = f.fd == VHOST_FILE_UNBIND ? NULL : eventfd_fget(f.fd);
 		if (IS_ERR(eventfp)) {
 			r = PTR_ERR(eventfp);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 3/9] vhost: modify internal functions to take a vhost_worker
  2021-05-25 18:05 vhost: multiple worker support Mike Christie
  2021-05-25 18:05 ` [PATCH 1/9] vhost: move worker thread fields to new struct Mike Christie
  2021-05-25 18:05 ` [PATCH 2/9] vhost: move vhost worker creation to kick setup Mike Christie
@ 2021-05-25 18:05 ` Mike Christie
  2021-06-03 10:45   ` Stefan Hajnoczi
  2021-05-25 18:05 ` [PATCH 4/9] vhost: allow vhost_polls to use different vhost_workers Mike Christie
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 32+ messages in thread
From: Mike Christie @ 2021-05-25 18:05 UTC (permalink / raw)
  To: target-devel, linux-scsi, stefanha, pbonzini, jasowang, mst,
	sgarzare, virtualization
  Cc: Mike Christie

The final patches will allow us to have multiple vhost_workers per device
and be able to share them across devices. To prepare for that, this patch
allow us our internal work queueing, flush and cgroup attach functions to
take a vhost_worker as an arg.

The poll code required a change to the drivers, so the next patch will
convert that code.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
 drivers/vhost/vhost.c | 136 ++++++++++++++++++++++++++++--------------
 drivers/vhost/vhost.h |   4 +-
 2 files changed, 94 insertions(+), 46 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index a291cde95c43..4bfa9a7a51bb 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -231,20 +231,6 @@ void vhost_poll_stop(struct vhost_poll *poll)
 }
 EXPORT_SYMBOL_GPL(vhost_poll_stop);
 
-void vhost_work_dev_flush(struct vhost_dev *dev)
-{
-	struct vhost_flush_struct flush;
-
-	if (dev->worker) {
-		init_completion(&flush.wait_event);
-		vhost_work_init(&flush.work, vhost_flush_work);
-
-		vhost_work_queue(dev, &flush.work);
-		wait_for_completion(&flush.wait_event);
-	}
-}
-EXPORT_SYMBOL_GPL(vhost_work_dev_flush);
-
 /* Flush any work that has been scheduled. When calling this, don't hold any
  * locks that are also used by the callback. */
 void vhost_poll_flush(struct vhost_poll *poll)
@@ -253,26 +239,62 @@ void vhost_poll_flush(struct vhost_poll *poll)
 }
 EXPORT_SYMBOL_GPL(vhost_poll_flush);
 
-void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)
+static void vhost_work_queue_on(struct vhost_work *work,
+				struct vhost_worker *worker)
 {
-	if (!dev->worker)
-		return;
-
 	if (!test_and_set_bit(VHOST_WORK_QUEUED, &work->flags)) {
 		/* We can only add the work to the list after we're
 		 * sure it was not in the list.
 		 * test_and_set_bit() implies a memory barrier.
 		 */
-		llist_add(&work->node, &dev->worker->work_list);
-		wake_up_process(dev->worker->task);
+		llist_add(&work->node, &worker->work_list);
+		wake_up_process(worker->task);
 	}
 }
+
+void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)
+{
+	if (!dev->workers)
+		return;
+	/*
+	 * devs with use_worker=true created by tools that do not support the
+	 * worker creation ioctl will always have at least one worker.
+	 */
+	vhost_work_queue_on(work, dev->workers[0]);
+}
 EXPORT_SYMBOL_GPL(vhost_work_queue);
 
+static void vhost_work_dev_flush_on(struct vhost_worker *worker)
+{
+	struct vhost_flush_struct flush;
+
+	init_completion(&flush.wait_event);
+	vhost_work_init(&flush.work, vhost_flush_work);
+
+	vhost_work_queue_on(&flush.work, worker);
+	wait_for_completion(&flush.wait_event);
+}
+
+void vhost_work_dev_flush(struct vhost_dev *dev)
+{
+	int i;
+
+	for (i = 0; i < dev->num_workers; i++)
+		vhost_work_dev_flush_on(dev->workers[i]);
+}
+EXPORT_SYMBOL_GPL(vhost_work_dev_flush);
+
 /* A lockless hint for busy polling code to exit the loop */
 bool vhost_has_work(struct vhost_dev *dev)
 {
-	return dev->worker && !llist_empty(&dev->worker->work_list);
+	int i;
+
+	for (i = 0; i < dev->num_workers; i++) {
+		if (!llist_empty(&dev->workers[i]->work_list))
+			return true;
+	}
+
+	return false;
 }
 EXPORT_SYMBOL_GPL(vhost_has_work);
 
@@ -482,7 +504,8 @@ void vhost_dev_init(struct vhost_dev *dev,
 	dev->umem = NULL;
 	dev->iotlb = NULL;
 	dev->mm = NULL;
-	dev->worker = NULL;
+	dev->workers = NULL;
+	dev->num_workers = 0;
 	dev->iov_limit = iov_limit;
 	dev->weight = weight;
 	dev->byte_weight = byte_weight;
@@ -531,14 +554,14 @@ static void vhost_attach_cgroups_work(struct vhost_work *work)
 	s->ret = cgroup_attach_task_all(s->owner, current);
 }
 
-static int vhost_attach_cgroups(struct vhost_dev *dev)
+static int vhost_attach_cgroups_on(struct vhost_worker *worker)
 {
 	struct vhost_attach_cgroups_struct attach;
 
 	attach.owner = current;
 	vhost_work_init(&attach.work, vhost_attach_cgroups_work);
-	vhost_work_queue(dev, &attach.work);
-	vhost_work_dev_flush(dev);
+	vhost_work_queue_on(&attach.work, worker);
+	vhost_work_dev_flush_on(worker);
 	return attach.ret;
 }
 
@@ -579,20 +602,29 @@ static void vhost_detach_mm(struct vhost_dev *dev)
 	dev->mm = NULL;
 }
 
-static void vhost_worker_free(struct vhost_dev *dev)
+static void vhost_worker_free(struct vhost_worker *worker)
 {
-	struct vhost_worker *worker = dev->worker;
-
-	if (!worker)
-		return;
-
-	dev->worker = NULL;
 	WARN_ON(!llist_empty(&worker->work_list));
 	kthread_stop(worker->task);
 	kfree(worker);
 }
 
-static int vhost_worker_create(struct vhost_dev *dev)
+static void vhost_workers_free(struct vhost_dev *dev)
+{
+	int i;
+
+	if (!dev->workers)
+		return;
+
+	for (i = 0; i < dev->num_workers; i++)
+		vhost_worker_free(dev->workers[i]);
+
+	kfree(dev->workers);
+	dev->num_workers = 0;
+	dev->workers = NULL;
+}
+
+static struct vhost_worker *vhost_worker_create(struct vhost_dev *dev)
 {
 	struct vhost_worker *worker;
 	struct task_struct *task;
@@ -600,42 +632,56 @@ static int vhost_worker_create(struct vhost_dev *dev)
 
 	worker = kzalloc(sizeof(*worker), GFP_KERNEL);
 	if (!worker)
-		return -ENOMEM;
+		return NULL;
 
-	dev->worker = worker;
+	worker->id = dev->num_workers;
 	worker->dev = dev;
 	init_llist_head(&worker->work_list);
 
 	task = kthread_create(vhost_worker, worker, "vhost-%d", current->pid);
-	if (IS_ERR(task)) {
-		ret = PTR_ERR(task);
+	if (IS_ERR(task))
 		goto free_worker;
-	}
 
 	worker->task = task;
 	wake_up_process(task); /* avoid contributing to loadavg */
 
-	ret = vhost_attach_cgroups(dev);
+	ret = vhost_attach_cgroups_on(worker);
 	if (ret)
 		goto stop_worker;
 
-	return 0;
+	return worker;
 
 stop_worker:
 	kthread_stop(worker->task);
 free_worker:
 	kfree(worker);
-	dev->worker = NULL;
-	return ret;
+	return NULL;
 }
 
 /* Caller must have device mutex */
 static int vhost_worker_try_create_def(struct vhost_dev *dev)
 {
-	if (!dev->use_worker || dev->worker)
+	struct vhost_worker *worker;
+
+	if (!dev->use_worker || dev->workers)
 		return 0;
 
-	return vhost_worker_create(dev);
+	dev->workers = kcalloc(1, sizeof(struct vhost_worker *), GFP_KERNEL);
+	if (!dev->workers)
+		return -ENOMEM;
+
+	worker = vhost_worker_create(dev);
+	if (!worker)
+		goto free_workers;
+
+	dev->workers[worker->id] = worker;
+	dev->num_workers++;
+	return 0;
+
+free_workers:
+	kfree(dev->workers);
+	dev->workers = NULL;
+	return -ENOMEM;
 }
 
 /* Caller should have device mutex */
@@ -750,7 +796,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
 	dev->iotlb = NULL;
 	vhost_clear_msg(dev);
 	wake_up_interruptible_poll(&dev->wait, EPOLLIN | EPOLLRDNORM);
-	vhost_worker_free(dev);
+	vhost_workers_free(dev);
 	vhost_detach_mm(dev);
 	dev->kcov_handle = 0;
 }
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index bfc4563e612f..fa1af4106755 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -29,6 +29,7 @@ struct vhost_worker {
 	struct task_struct	*task;
 	struct llist_head	work_list;
 	struct vhost_dev	*dev;
+	int			id;
 };
 
 /* Poll a file (eventfd or socket) */
@@ -155,7 +156,8 @@ struct vhost_dev {
 	struct vhost_virtqueue **vqs;
 	int nvqs;
 	struct eventfd_ctx *log_ctx;
-	struct vhost_worker *worker;
+	struct vhost_worker **workers;
+	int num_workers;
 	struct vhost_iotlb *umem;
 	struct vhost_iotlb *iotlb;
 	spinlock_t iotlb_lock;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 4/9] vhost: allow vhost_polls to use different vhost_workers
  2021-05-25 18:05 vhost: multiple worker support Mike Christie
                   ` (2 preceding siblings ...)
  2021-05-25 18:05 ` [PATCH 3/9] vhost: modify internal functions to take a vhost_worker Mike Christie
@ 2021-05-25 18:05 ` Mike Christie
  2021-06-03 13:51   ` Stefan Hajnoczi
  2021-05-25 18:05 ` [PATCH 5/9] vhost-scsi: flush IO vqs then send TMF rsp Mike Christie
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 32+ messages in thread
From: Mike Christie @ 2021-05-25 18:05 UTC (permalink / raw)
  To: target-devel, linux-scsi, stefanha, pbonzini, jasowang, mst,
	sgarzare, virtualization
  Cc: Mike Christie

This allows vhost_polls to use the worker the vq the poll is associated
with.

This also exports a helper vhost_vq_work_queue which is used here
internally, and will be used in the vhost-scsi patches.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
 drivers/vhost/net.c   |  6 ++++--
 drivers/vhost/vhost.c | 19 ++++++++++++++++---
 drivers/vhost/vhost.h |  6 +++++-
 3 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index df82b124170e..948bc3d361ab 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -1334,8 +1334,10 @@ static int vhost_net_open(struct inode *inode, struct file *f)
 		       VHOST_NET_PKT_WEIGHT, VHOST_NET_WEIGHT, true,
 		       NULL);
 
-	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, EPOLLOUT, dev);
-	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, EPOLLIN, dev);
+	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, EPOLLOUT, dev,
+			vqs[VHOST_NET_VQ_TX]);
+	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, EPOLLIN, dev,
+			vqs[VHOST_NET_VQ_RX]);
 
 	f->private_data = n;
 	n->page_frag.page = NULL;
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 4bfa9a7a51bb..3cc1196d465b 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -187,13 +187,15 @@ EXPORT_SYMBOL_GPL(vhost_work_init);
 
 /* Init poll structure */
 void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
-		     __poll_t mask, struct vhost_dev *dev)
+		     __poll_t mask, struct vhost_dev *dev,
+		     struct vhost_virtqueue *vq)
 {
 	init_waitqueue_func_entry(&poll->wait, vhost_poll_wakeup);
 	init_poll_funcptr(&poll->table, vhost_poll_func);
 	poll->mask = mask;
 	poll->dev = dev;
 	poll->wqh = NULL;
+	poll->vq = vq;
 
 	vhost_work_init(&poll->work, fn);
 }
@@ -298,9 +300,15 @@ bool vhost_has_work(struct vhost_dev *dev)
 }
 EXPORT_SYMBOL_GPL(vhost_has_work);
 
+void vhost_vq_work_queue(struct vhost_virtqueue *vq, struct vhost_work *work)
+{
+	vhost_work_queue_on(work, vq->worker);
+}
+EXPORT_SYMBOL_GPL(vhost_vq_work_queue);
+
 void vhost_poll_queue(struct vhost_poll *poll)
 {
-	vhost_work_queue(poll->dev, &poll->work);
+	vhost_vq_work_queue(poll->vq, &poll->work);
 }
 EXPORT_SYMBOL_GPL(vhost_poll_queue);
 
@@ -359,6 +367,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
 	vq->busyloop_timeout = 0;
 	vq->umem = NULL;
 	vq->iotlb = NULL;
+	vq->worker = NULL;
 	vhost_vring_call_reset(&vq->call_ctx);
 	__vhost_vq_meta_reset(vq);
 }
@@ -527,7 +536,7 @@ void vhost_dev_init(struct vhost_dev *dev,
 		vhost_vq_reset(dev, vq);
 		if (vq->handle_kick)
 			vhost_poll_init(&vq->poll, vq->handle_kick,
-					EPOLLIN, dev);
+					EPOLLIN, dev, vq);
 	}
 }
 EXPORT_SYMBOL_GPL(vhost_dev_init);
@@ -662,6 +671,7 @@ static struct vhost_worker *vhost_worker_create(struct vhost_dev *dev)
 static int vhost_worker_try_create_def(struct vhost_dev *dev)
 {
 	struct vhost_worker *worker;
+	int i;
 
 	if (!dev->use_worker || dev->workers)
 		return 0;
@@ -674,6 +684,9 @@ static int vhost_worker_try_create_def(struct vhost_dev *dev)
 	if (!worker)
 		goto free_workers;
 
+	for (i = 0; i < dev->nvqs; i++)
+		dev->vqs[i]->worker = worker;
+
 	dev->workers[worker->id] = worker;
 	dev->num_workers++;
 	return 0;
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index fa1af4106755..6880e7a29872 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -41,14 +41,17 @@ struct vhost_poll {
 	struct vhost_work	work;
 	__poll_t		mask;
 	struct vhost_dev	*dev;
+	struct vhost_virtqueue	*vq;
 };
 
 void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn);
 void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work);
 bool vhost_has_work(struct vhost_dev *dev);
+void vhost_vq_work_queue(struct vhost_virtqueue *vq, struct vhost_work *work);
 
 void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
-		     __poll_t mask, struct vhost_dev *dev);
+		     __poll_t mask, struct vhost_dev *dev,
+		     struct vhost_virtqueue *vq);
 int vhost_poll_start(struct vhost_poll *poll, struct file *file);
 void vhost_poll_stop(struct vhost_poll *poll);
 void vhost_poll_flush(struct vhost_poll *poll);
@@ -76,6 +79,7 @@ struct vhost_vring_call {
 /* The virtqueue structure describes a queue attached to a device. */
 struct vhost_virtqueue {
 	struct vhost_dev *dev;
+	struct vhost_worker *worker;
 
 	/* The actual ring of buffers. */
 	struct mutex mutex;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 5/9] vhost-scsi: flush IO vqs then send TMF rsp
  2021-05-25 18:05 vhost: multiple worker support Mike Christie
                   ` (3 preceding siblings ...)
  2021-05-25 18:05 ` [PATCH 4/9] vhost: allow vhost_polls to use different vhost_workers Mike Christie
@ 2021-05-25 18:05 ` Mike Christie
  2021-06-03 13:54   ` Stefan Hajnoczi
  2021-05-25 18:05 ` [PATCH 6/9] vhost-scsi: make SCSI cmd completion per vq Mike Christie
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 32+ messages in thread
From: Mike Christie @ 2021-05-25 18:05 UTC (permalink / raw)
  To: target-devel, linux-scsi, stefanha, pbonzini, jasowang, mst,
	sgarzare, virtualization
  Cc: Mike Christie

With one worker we will always send the scsi cmd responses then send the
TMF rsp, because LIO will always complete the scsi cmds first which
calls vhost_scsi_release_cmd to add them to the work queue.

When the next patches adds multiple worker support, the worker threads
could still be sending their responses when the tmf's work is run.
So this patch has vhost-scsi flush the IO vqs on other worker threads
before we send the tmf response.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
 drivers/vhost/scsi.c  | 17 ++++++++++++++---
 drivers/vhost/vhost.c |  6 ++++++
 drivers/vhost/vhost.h |  1 +
 3 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 46f897e41217..e585f2180457 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -1168,12 +1168,23 @@ static void vhost_scsi_tmf_resp_work(struct vhost_work *work)
 {
 	struct vhost_scsi_tmf *tmf = container_of(work, struct vhost_scsi_tmf,
 						  vwork);
-	int resp_code;
+	int resp_code, i;
+
+	if (tmf->scsi_resp == TMR_FUNCTION_COMPLETE) {
+		/*
+		 * When processing a TMF lio completes the cmds then the TMF,
+		 * so with one worker the TMF always completes after cmds. For
+		 * multiple worker support we must flush the IO vqs to make
+		 * sure if they have differrent workers then the cmds have
+		 * completed before we send the TMF response.
+		 */
+		for (i = 1; i < tmf->vhost->dev.num_workers; i++)
+			vhost_vq_work_flush(&tmf->vhost->vqs[i + VHOST_SCSI_VQ_IO].vq);
 
-	if (tmf->scsi_resp == TMR_FUNCTION_COMPLETE)
 		resp_code = VIRTIO_SCSI_S_FUNCTION_SUCCEEDED;
-	else
+	} else {
 		resp_code = VIRTIO_SCSI_S_FUNCTION_REJECTED;
+	}
 
 	vhost_scsi_send_tmf_resp(tmf->vhost, &tmf->svq->vq, tmf->in_iovs,
 				 tmf->vq_desc, &tmf->resp_iov, resp_code);
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 3cc1196d465b..345ade0af133 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -300,6 +300,12 @@ bool vhost_has_work(struct vhost_dev *dev)
 }
 EXPORT_SYMBOL_GPL(vhost_has_work);
 
+void vhost_vq_work_flush(struct vhost_virtqueue *vq)
+{
+	vhost_work_dev_flush_on(vq->worker);
+}
+EXPORT_SYMBOL_GPL(vhost_vq_work_flush);
+
 void vhost_vq_work_queue(struct vhost_virtqueue *vq, struct vhost_work *work)
 {
 	vhost_work_queue_on(work, vq->worker);
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 6880e7a29872..0a252dd45101 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -47,6 +47,7 @@ struct vhost_poll {
 void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn);
 void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work);
 bool vhost_has_work(struct vhost_dev *dev);
+void vhost_vq_work_flush(struct vhost_virtqueue *vq);
 void vhost_vq_work_queue(struct vhost_virtqueue *vq, struct vhost_work *work);
 
 void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 6/9] vhost-scsi: make SCSI cmd completion per vq
  2021-05-25 18:05 vhost: multiple worker support Mike Christie
                   ` (4 preceding siblings ...)
  2021-05-25 18:05 ` [PATCH 5/9] vhost-scsi: flush IO vqs then send TMF rsp Mike Christie
@ 2021-05-25 18:05 ` Mike Christie
  2021-06-03 13:57   ` Stefan Hajnoczi
  2021-05-25 18:05 ` [PATCH 7/9] vhost: allow userspace to create workers Mike Christie
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 32+ messages in thread
From: Mike Christie @ 2021-05-25 18:05 UTC (permalink / raw)
  To: target-devel, linux-scsi, stefanha, pbonzini, jasowang, mst,
	sgarzare, virtualization
  Cc: Mike Christie

This patch separates the scsi cmd completion code paths so we can complete
cmds based on their vq instead of having all cmds complete on the same
worker/CPU. This will be useful with the next patch that allows us to
create mulitple worker threads and bind them to different vqs, so we can
have completions running on different threads/CPUs.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
 drivers/vhost/scsi.c | 48 +++++++++++++++++++++++---------------------
 1 file changed, 25 insertions(+), 23 deletions(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index e585f2180457..b607bff41074 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -176,6 +176,7 @@ enum {
 
 struct vhost_scsi_virtqueue {
 	struct vhost_virtqueue vq;
+	struct vhost_scsi *vs;
 	/*
 	 * Reference counting for inflight reqs, used for flush operation. At
 	 * each time, one reference tracks new commands submitted, while we
@@ -190,6 +191,9 @@ struct vhost_scsi_virtqueue {
 	struct vhost_scsi_cmd *scsi_cmds;
 	struct sbitmap scsi_tags;
 	int max_cmds;
+
+	struct vhost_work completion_work;
+	struct llist_head completion_list;
 };
 
 struct vhost_scsi {
@@ -200,9 +204,6 @@ struct vhost_scsi {
 	struct vhost_dev dev;
 	struct vhost_scsi_virtqueue vqs[VHOST_SCSI_MAX_VQ];
 
-	struct vhost_work vs_completion_work; /* cmd completion work item */
-	struct llist_head vs_completion_list; /* cmd completion queue */
-
 	struct vhost_work vs_event_work; /* evt injection work item */
 	struct llist_head vs_event_list; /* evt injection queue */
 
@@ -377,10 +378,11 @@ static void vhost_scsi_release_cmd(struct se_cmd *se_cmd)
 	} else {
 		struct vhost_scsi_cmd *cmd = container_of(se_cmd,
 					struct vhost_scsi_cmd, tvc_se_cmd);
-		struct vhost_scsi *vs = cmd->tvc_vhost;
+		struct vhost_scsi_virtqueue *svq =  container_of(cmd->tvc_vq,
+					struct vhost_scsi_virtqueue, vq);
 
-		llist_add(&cmd->tvc_completion_list, &vs->vs_completion_list);
-		vhost_work_queue(&vs->dev, &vs->vs_completion_work);
+		llist_add(&cmd->tvc_completion_list, &svq->completion_list);
+		vhost_vq_work_queue(&svq->vq, &svq->completion_work);
 	}
 }
 
@@ -543,18 +545,17 @@ static void vhost_scsi_evt_work(struct vhost_work *work)
  */
 static void vhost_scsi_complete_cmd_work(struct vhost_work *work)
 {
-	struct vhost_scsi *vs = container_of(work, struct vhost_scsi,
-					vs_completion_work);
-	DECLARE_BITMAP(signal, VHOST_SCSI_MAX_VQ);
+	struct vhost_scsi_virtqueue *svq = container_of(work,
+				struct vhost_scsi_virtqueue, completion_work);
 	struct virtio_scsi_cmd_resp v_rsp;
 	struct vhost_scsi_cmd *cmd, *t;
 	struct llist_node *llnode;
 	struct se_cmd *se_cmd;
 	struct iov_iter iov_iter;
-	int ret, vq;
+	bool signal = false;
+	int ret;
 
-	bitmap_zero(signal, VHOST_SCSI_MAX_VQ);
-	llnode = llist_del_all(&vs->vs_completion_list);
+	llnode = llist_del_all(&svq->completion_list);
 	llist_for_each_entry_safe(cmd, t, llnode, tvc_completion_list) {
 		se_cmd = &cmd->tvc_se_cmd;
 
@@ -574,21 +575,16 @@ static void vhost_scsi_complete_cmd_work(struct vhost_work *work)
 			      cmd->tvc_in_iovs, sizeof(v_rsp));
 		ret = copy_to_iter(&v_rsp, sizeof(v_rsp), &iov_iter);
 		if (likely(ret == sizeof(v_rsp))) {
-			struct vhost_scsi_virtqueue *q;
+			signal = true;
 			vhost_add_used(cmd->tvc_vq, cmd->tvc_vq_desc, 0);
-			q = container_of(cmd->tvc_vq, struct vhost_scsi_virtqueue, vq);
-			vq = q - vs->vqs;
-			__set_bit(vq, signal);
 		} else
 			pr_err("Faulted on virtio_scsi_cmd_resp\n");
 
 		vhost_scsi_release_cmd_res(se_cmd);
 	}
 
-	vq = -1;
-	while ((vq = find_next_bit(signal, VHOST_SCSI_MAX_VQ, vq + 1))
-		< VHOST_SCSI_MAX_VQ)
-		vhost_signal(&vs->dev, &vs->vqs[vq].vq);
+	if (signal)
+		vhost_signal(&svq->vs->dev, &svq->vq);
 }
 
 static struct vhost_scsi_cmd *
@@ -1799,6 +1795,7 @@ static int vhost_scsi_set_features(struct vhost_scsi *vs, u64 features)
 
 static int vhost_scsi_open(struct inode *inode, struct file *f)
 {
+	struct vhost_scsi_virtqueue *svq;
 	struct vhost_scsi *vs;
 	struct vhost_virtqueue **vqs;
 	int r = -ENOMEM, i;
@@ -1811,7 +1808,6 @@ static int vhost_scsi_open(struct inode *inode, struct file *f)
 	if (!vqs)
 		goto err_vqs;
 
-	vhost_work_init(&vs->vs_completion_work, vhost_scsi_complete_cmd_work);
 	vhost_work_init(&vs->vs_event_work, vhost_scsi_evt_work);
 
 	vs->vs_events_nr = 0;
@@ -1822,8 +1818,14 @@ static int vhost_scsi_open(struct inode *inode, struct file *f)
 	vs->vqs[VHOST_SCSI_VQ_CTL].vq.handle_kick = vhost_scsi_ctl_handle_kick;
 	vs->vqs[VHOST_SCSI_VQ_EVT].vq.handle_kick = vhost_scsi_evt_handle_kick;
 	for (i = VHOST_SCSI_VQ_IO; i < VHOST_SCSI_MAX_VQ; i++) {
-		vqs[i] = &vs->vqs[i].vq;
-		vs->vqs[i].vq.handle_kick = vhost_scsi_handle_kick;
+		svq = &vs->vqs[i];
+
+		vqs[i] = &svq->vq;
+		svq->vs = vs;
+		init_llist_head(&svq->completion_list);
+		vhost_work_init(&svq->completion_work,
+				vhost_scsi_complete_cmd_work);
+		svq->vq.handle_kick = vhost_scsi_handle_kick;
 	}
 	vhost_dev_init(&vs->dev, vqs, VHOST_SCSI_MAX_VQ, UIO_MAXIOV,
 		       VHOST_SCSI_WEIGHT, 0, true, NULL);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 7/9] vhost: allow userspace to create workers
  2021-05-25 18:05 vhost: multiple worker support Mike Christie
                   ` (5 preceding siblings ...)
  2021-05-25 18:05 ` [PATCH 6/9] vhost-scsi: make SCSI cmd completion per vq Mike Christie
@ 2021-05-25 18:05 ` Mike Christie
  2021-06-03 14:30   ` Stefan Hajnoczi
  2021-05-25 18:05 ` [PATCH 8/9] vhost: add vhost_dev pointer to vhost_work Mike Christie
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 32+ messages in thread
From: Mike Christie @ 2021-05-25 18:05 UTC (permalink / raw)
  To: target-devel, linux-scsi, stefanha, pbonzini, jasowang, mst,
	sgarzare, virtualization
  Cc: Mike Christie

This patch allows userspace to create workers and bind them to vqs, so you
can have N workers per dev and also share N workers with M vqs. The next
patch will allow sharing across devices.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
 drivers/vhost/vhost.c            | 94 +++++++++++++++++++++++++++++++-
 drivers/vhost/vhost.h            |  3 +
 include/uapi/linux/vhost.h       |  6 ++
 include/uapi/linux/vhost_types.h | 12 ++++
 4 files changed, 113 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 345ade0af133..981e9bac7a31 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -30,6 +30,7 @@
 #include <linux/interval_tree_generic.h>
 #include <linux/nospec.h>
 #include <linux/kcov.h>
+#include <linux/hashtable.h>
 
 #include "vhost.h"
 
@@ -42,6 +43,9 @@ module_param(max_iotlb_entries, int, 0444);
 MODULE_PARM_DESC(max_iotlb_entries,
 	"Maximum number of iotlb entries. (default: 2048)");
 
+static DEFINE_HASHTABLE(vhost_workers, 5);
+static DEFINE_SPINLOCK(vhost_workers_lock);
+
 enum {
 	VHOST_MEMORY_F_LOG = 0x1,
 };
@@ -617,8 +621,17 @@ static void vhost_detach_mm(struct vhost_dev *dev)
 	dev->mm = NULL;
 }
 
-static void vhost_worker_free(struct vhost_worker *worker)
+static void vhost_worker_put(struct vhost_worker *worker)
 {
+	spin_lock(&vhost_workers_lock);
+	if (!refcount_dec_and_test(&worker->refcount)) {
+		spin_unlock(&vhost_workers_lock);
+		return;
+	}
+
+	hash_del(&worker->h_node);
+	spin_unlock(&vhost_workers_lock);
+
 	WARN_ON(!llist_empty(&worker->work_list));
 	kthread_stop(worker->task);
 	kfree(worker);
@@ -632,7 +645,7 @@ static void vhost_workers_free(struct vhost_dev *dev)
 		return;
 
 	for (i = 0; i < dev->num_workers; i++)
-		vhost_worker_free(dev->workers[i]);
+		vhost_worker_put(dev->workers[i]);
 
 	kfree(dev->workers);
 	dev->num_workers = 0;
@@ -652,6 +665,8 @@ static struct vhost_worker *vhost_worker_create(struct vhost_dev *dev)
 	worker->id = dev->num_workers;
 	worker->dev = dev;
 	init_llist_head(&worker->work_list);
+	INIT_HLIST_NODE(&worker->h_node);
+	refcount_set(&worker->refcount, 1);
 
 	task = kthread_create(vhost_worker, worker, "vhost-%d", current->pid);
 	if (IS_ERR(task))
@@ -664,6 +679,9 @@ static struct vhost_worker *vhost_worker_create(struct vhost_dev *dev)
 	if (ret)
 		goto stop_worker;
 
+	spin_lock(&vhost_workers_lock);
+	hash_add(vhost_workers, &worker->h_node, worker->task->pid);
+	spin_unlock(&vhost_workers_lock);
 	return worker;
 
 stop_worker:
@@ -673,6 +691,67 @@ static struct vhost_worker *vhost_worker_create(struct vhost_dev *dev)
 	return NULL;
 }
 
+static struct vhost_worker *vhost_worker_find(struct vhost_dev *dev, pid_t pid)
+{
+	struct vhost_worker *worker, *found_worker = NULL;
+
+	spin_lock(&vhost_workers_lock);
+	hash_for_each_possible(vhost_workers, worker, h_node, pid) {
+		if (worker->task->pid == pid) {
+			/* tmp - next patch allows sharing across devs */
+			if (worker->dev != dev)
+				break;
+
+			found_worker = worker;
+			refcount_inc(&worker->refcount);
+			break;
+		}
+	}
+	spin_unlock(&vhost_workers_lock);
+	return found_worker;
+}
+
+/* Caller must have device mutex */
+static int vhost_vq_set_worker(struct vhost_virtqueue *vq,
+			       struct vhost_vring_worker *info)
+{
+	struct vhost_dev *dev = vq->dev;
+	struct vhost_worker *worker;
+
+	if (vq->worker) {
+		/* TODO - support changing while works are running */
+		return -EBUSY;
+	}
+
+	if (info->pid == VHOST_VRING_NEW_WORKER) {
+		worker = vhost_worker_create(dev);
+		if (!worker)
+			return -ENOMEM;
+
+		info->pid = worker->task->pid;
+	} else {
+		worker = vhost_worker_find(dev, info->pid);
+		if (!worker)
+			return -ENODEV;
+	}
+
+	if (!dev->workers) {
+		dev->workers = kcalloc(vq->dev->nvqs,
+				       sizeof(struct vhost_worker *),
+				       GFP_KERNEL);
+		if (!dev->workers) {
+			vhost_worker_put(worker);
+			return -ENOMEM;
+		}
+	}
+
+	vq->worker = worker;
+
+	dev->workers[dev->num_workers] = worker;
+	dev->num_workers++;
+	return 0;
+}
+
 /* Caller must have device mutex */
 static int vhost_worker_try_create_def(struct vhost_dev *dev)
 {
@@ -1680,6 +1759,7 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
 	struct eventfd_ctx *ctx = NULL;
 	u32 __user *idxp = argp;
 	struct vhost_virtqueue *vq;
+	struct vhost_vring_worker w;
 	struct vhost_vring_state s;
 	struct vhost_vring_file f;
 	u32 idx;
@@ -1794,6 +1874,15 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
 		if (copy_to_user(argp, &s, sizeof(s)))
 			r = -EFAULT;
 		break;
+	case VHOST_SET_VRING_WORKER:
+		if (copy_from_user(&w, argp, sizeof(w))) {
+			r = -EFAULT;
+			break;
+		}
+		r = vhost_vq_set_worker(vq, &w);
+		if (!r && copy_to_user(argp, &w, sizeof(w)))
+			r = -EFAULT;
+		break;
 	default:
 		r = -ENOIOCTLCMD;
 	}
@@ -2726,6 +2815,7 @@ EXPORT_SYMBOL_GPL(vhost_set_backend_features);
 
 static int __init vhost_init(void)
 {
+	hash_init(vhost_workers);
 	return 0;
 }
 
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 0a252dd45101..75b884ad1f17 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -14,6 +14,7 @@
 #include <linux/atomic.h>
 #include <linux/vhost_iotlb.h>
 #include <linux/irqbypass.h>
+#include <linux/refcount.h>
 
 struct vhost_work;
 typedef void (*vhost_work_fn_t)(struct vhost_work *work);
@@ -28,6 +29,8 @@ struct vhost_work {
 struct vhost_worker {
 	struct task_struct	*task;
 	struct llist_head	work_list;
+	struct hlist_node	h_node;
+	refcount_t		refcount;
 	struct vhost_dev	*dev;
 	int			id;
 };
diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
index c998860d7bbc..ce32119cb139 100644
--- a/include/uapi/linux/vhost.h
+++ b/include/uapi/linux/vhost.h
@@ -70,6 +70,12 @@
 #define VHOST_VRING_BIG_ENDIAN 1
 #define VHOST_SET_VRING_ENDIAN _IOW(VHOST_VIRTIO, 0x13, struct vhost_vring_state)
 #define VHOST_GET_VRING_ENDIAN _IOW(VHOST_VIRTIO, 0x14, struct vhost_vring_state)
+/* Create/bind a vhost worker to a virtqueue. If pid > 0 and matches an existing
+ * vhost_worker thread it will be bound to the vq. If pid is
+ * VHOST_VRING_NEW_WORKER, then a new worker will be created and bound to the
+ * vq.
+ */
+#define VHOST_SET_VRING_WORKER _IOWR(VHOST_VIRTIO, 0x15, struct vhost_vring_worker)
 
 /* The following ioctls use eventfd file descriptors to signal and poll
  * for events. */
diff --git a/include/uapi/linux/vhost_types.h b/include/uapi/linux/vhost_types.h
index f7f6a3a28977..5113baa8bc3e 100644
--- a/include/uapi/linux/vhost_types.h
+++ b/include/uapi/linux/vhost_types.h
@@ -47,6 +47,18 @@ struct vhost_vring_addr {
 	__u64 log_guest_addr;
 };
 
+#define VHOST_VRING_NEW_WORKER -1
+
+struct vhost_vring_worker {
+	unsigned int index;
+	/*
+	 * The pid of the vhost worker that the vq will be bound to. If
+	 * pid is VHOST_VRING_NEW_WORKER a new worker will be created and it's
+	 * pid will be returned in pid.
+	 */
+	__kernel_pid_t pid;
+};
+
 /* no alignment requirement */
 struct vhost_iotlb_msg {
 	__u64 iova;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 8/9] vhost: add vhost_dev pointer to vhost_work
  2021-05-25 18:05 vhost: multiple worker support Mike Christie
                   ` (6 preceding siblings ...)
  2021-05-25 18:05 ` [PATCH 7/9] vhost: allow userspace to create workers Mike Christie
@ 2021-05-25 18:05 ` Mike Christie
  2021-06-03 14:31   ` Stefan Hajnoczi
  2021-05-25 18:06 ` [PATCH 9/9] vhost: support sharing workers across devs Mike Christie
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 32+ messages in thread
From: Mike Christie @ 2021-05-25 18:05 UTC (permalink / raw)
  To: target-devel, linux-scsi, stefanha, pbonzini, jasowang, mst,
	sgarzare, virtualization
  Cc: Mike Christie

The next patch allows a vhost_worker to handle different devices. To
prepare for that, this patch adds a pointer to the device on the work so
we can get to the different mms in the vhost_worker thread.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
 drivers/vhost/scsi.c  |  7 ++++---
 drivers/vhost/vhost.c | 24 ++++++++++++++----------
 drivers/vhost/vhost.h |  4 +++-
 drivers/vhost/vsock.c |  3 ++-
 4 files changed, 23 insertions(+), 15 deletions(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index b607bff41074..073b20bca257 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -1808,7 +1808,7 @@ static int vhost_scsi_open(struct inode *inode, struct file *f)
 	if (!vqs)
 		goto err_vqs;
 
-	vhost_work_init(&vs->vs_event_work, vhost_scsi_evt_work);
+	vhost_work_init(&vs->dev, &vs->vs_event_work, vhost_scsi_evt_work);
 
 	vs->vs_events_nr = 0;
 	vs->vs_events_missed = false;
@@ -1823,7 +1823,7 @@ static int vhost_scsi_open(struct inode *inode, struct file *f)
 		vqs[i] = &svq->vq;
 		svq->vs = vs;
 		init_llist_head(&svq->completion_list);
-		vhost_work_init(&svq->completion_work,
+		vhost_work_init(&vs->dev, &svq->completion_work,
 				vhost_scsi_complete_cmd_work);
 		svq->vq.handle_kick = vhost_scsi_handle_kick;
 	}
@@ -2017,7 +2017,8 @@ static int vhost_scsi_port_link(struct se_portal_group *se_tpg,
 	if (!tmf)
 		return -ENOMEM;
 	INIT_LIST_HEAD(&tmf->queue_entry);
-	vhost_work_init(&tmf->vwork, vhost_scsi_tmf_resp_work);
+	vhost_work_init(&tpg->vhost_scsi->dev, &tmf->vwork,
+			vhost_scsi_tmf_resp_work);
 
 	mutex_lock(&vhost_scsi_mutex);
 
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 981e9bac7a31..eb16eb2bbee0 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -182,10 +182,12 @@ static int vhost_poll_wakeup(wait_queue_entry_t *wait, unsigned mode, int sync,
 	return 0;
 }
 
-void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn)
+void vhost_work_init(struct vhost_dev *dev, struct vhost_work *work,
+		     vhost_work_fn_t fn)
 {
 	clear_bit(VHOST_WORK_QUEUED, &work->flags);
 	work->fn = fn;
+	work->dev = dev;
 }
 EXPORT_SYMBOL_GPL(vhost_work_init);
 
@@ -201,7 +203,7 @@ void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
 	poll->wqh = NULL;
 	poll->vq = vq;
 
-	vhost_work_init(&poll->work, fn);
+	vhost_work_init(dev, &poll->work, fn);
 }
 EXPORT_SYMBOL_GPL(vhost_poll_init);
 
@@ -270,12 +272,13 @@ void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)
 }
 EXPORT_SYMBOL_GPL(vhost_work_queue);
 
-static void vhost_work_dev_flush_on(struct vhost_worker *worker)
+static void vhost_work_dev_flush_on(struct vhost_dev *dev,
+				    struct vhost_worker *worker)
 {
 	struct vhost_flush_struct flush;
 
 	init_completion(&flush.wait_event);
-	vhost_work_init(&flush.work, vhost_flush_work);
+	vhost_work_init(dev, &flush.work, vhost_flush_work);
 
 	vhost_work_queue_on(&flush.work, worker);
 	wait_for_completion(&flush.wait_event);
@@ -286,7 +289,7 @@ void vhost_work_dev_flush(struct vhost_dev *dev)
 	int i;
 
 	for (i = 0; i < dev->num_workers; i++)
-		vhost_work_dev_flush_on(dev->workers[i]);
+		vhost_work_dev_flush_on(dev, dev->workers[i]);
 }
 EXPORT_SYMBOL_GPL(vhost_work_dev_flush);
 
@@ -306,7 +309,7 @@ EXPORT_SYMBOL_GPL(vhost_has_work);
 
 void vhost_vq_work_flush(struct vhost_virtqueue *vq)
 {
-	vhost_work_dev_flush_on(vq->worker);
+	vhost_work_dev_flush_on(vq->dev, vq->worker);
 }
 EXPORT_SYMBOL_GPL(vhost_vq_work_flush);
 
@@ -573,14 +576,15 @@ static void vhost_attach_cgroups_work(struct vhost_work *work)
 	s->ret = cgroup_attach_task_all(s->owner, current);
 }
 
-static int vhost_attach_cgroups_on(struct vhost_worker *worker)
+static int vhost_attach_cgroups_on(struct vhost_dev *dev,
+				   struct vhost_worker *worker)
 {
 	struct vhost_attach_cgroups_struct attach;
 
 	attach.owner = current;
-	vhost_work_init(&attach.work, vhost_attach_cgroups_work);
+	vhost_work_init(dev, &attach.work, vhost_attach_cgroups_work);
 	vhost_work_queue_on(&attach.work, worker);
-	vhost_work_dev_flush_on(worker);
+	vhost_work_dev_flush_on(dev, worker);
 	return attach.ret;
 }
 
@@ -675,7 +679,7 @@ static struct vhost_worker *vhost_worker_create(struct vhost_dev *dev)
 	worker->task = task;
 	wake_up_process(task); /* avoid contributing to loadavg */
 
-	ret = vhost_attach_cgroups_on(worker);
+	ret = vhost_attach_cgroups_on(dev, worker);
 	if (ret)
 		goto stop_worker;
 
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 75b884ad1f17..75ad3aa5adca 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -24,6 +24,7 @@ struct vhost_work {
 	struct llist_node	node;
 	vhost_work_fn_t		fn;
 	unsigned long		flags;
+	struct vhost_dev	*dev;
 };
 
 struct vhost_worker {
@@ -47,7 +48,8 @@ struct vhost_poll {
 	struct vhost_virtqueue	*vq;
 };
 
-void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn);
+void vhost_work_init(struct vhost_dev *dev, struct vhost_work *work,
+		     vhost_work_fn_t fn);
 void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work);
 bool vhost_has_work(struct vhost_dev *dev);
 void vhost_vq_work_flush(struct vhost_virtqueue *vq);
diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index f954f4d29c95..302415b6460b 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -648,7 +648,8 @@ static int vhost_vsock_dev_open(struct inode *inode, struct file *file)
 	file->private_data = vsock;
 	spin_lock_init(&vsock->send_pkt_list_lock);
 	INIT_LIST_HEAD(&vsock->send_pkt_list);
-	vhost_work_init(&vsock->send_pkt_work, vhost_transport_send_pkt_work);
+	vhost_work_init(&vsock->dev, &vsock->send_pkt_work,
+			vhost_transport_send_pkt_work);
 	return 0;
 
 out:
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 9/9] vhost: support sharing workers across devs
  2021-05-25 18:05 vhost: multiple worker support Mike Christie
                   ` (7 preceding siblings ...)
  2021-05-25 18:05 ` [PATCH 8/9] vhost: add vhost_dev pointer to vhost_work Mike Christie
@ 2021-05-25 18:06 ` Mike Christie
  2021-06-03 14:32   ` Stefan Hajnoczi
  2021-06-03 10:13 ` vhost: multiple worker support Stefan Hajnoczi
  2021-06-03 14:37 ` Stefan Hajnoczi
  10 siblings, 1 reply; 32+ messages in thread
From: Mike Christie @ 2021-05-25 18:06 UTC (permalink / raw)
  To: target-devel, linux-scsi, stefanha, pbonzini, jasowang, mst,
	sgarzare, virtualization
  Cc: Mike Christie

This allows a worker to handle multiple device's vqs.

TODO:
- The worker is attached to the cgroup of the device that created it. In
this patch you can share workers with devices with different owners which
could be in different cgroups. Do we want to restict sharing workers with
devices that have the same owner (dev->mm value)?

Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
 drivers/vhost/vhost.c | 16 +++++++---------
 drivers/vhost/vhost.h |  1 -
 2 files changed, 7 insertions(+), 10 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index eb16eb2bbee0..c32f72b1901c 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -388,12 +388,10 @@ static void vhost_vq_reset(struct vhost_dev *dev,
 static int vhost_worker(void *data)
 {
 	struct vhost_worker *worker = data;
-	struct vhost_dev *dev = worker->dev;
 	struct vhost_work *work, *work_next;
+	struct vhost_dev *dev;
 	struct llist_node *node;
 
-	kthread_use_mm(dev->mm);
-
 	for (;;) {
 		/* mb paired w/ kthread_stop */
 		set_current_state(TASK_INTERRUPTIBLE);
@@ -412,15 +410,20 @@ static int vhost_worker(void *data)
 		smp_wmb();
 		llist_for_each_entry_safe(work, work_next, node, node) {
 			clear_bit(VHOST_WORK_QUEUED, &work->flags);
+			dev = work->dev;
+
+			kthread_use_mm(dev->mm);
+
 			__set_current_state(TASK_RUNNING);
 			kcov_remote_start_common(dev->kcov_handle);
 			work->fn(work);
 			kcov_remote_stop();
 			if (need_resched())
 				schedule();
+
+			kthread_unuse_mm(dev->mm);
 		}
 	}
-	kthread_unuse_mm(dev->mm);
 	return 0;
 }
 
@@ -667,7 +670,6 @@ static struct vhost_worker *vhost_worker_create(struct vhost_dev *dev)
 		return NULL;
 
 	worker->id = dev->num_workers;
-	worker->dev = dev;
 	init_llist_head(&worker->work_list);
 	INIT_HLIST_NODE(&worker->h_node);
 	refcount_set(&worker->refcount, 1);
@@ -702,10 +704,6 @@ static struct vhost_worker *vhost_worker_find(struct vhost_dev *dev, pid_t pid)
 	spin_lock(&vhost_workers_lock);
 	hash_for_each_possible(vhost_workers, worker, h_node, pid) {
 		if (worker->task->pid == pid) {
-			/* tmp - next patch allows sharing across devs */
-			if (worker->dev != dev)
-				break;
-
 			found_worker = worker;
 			refcount_inc(&worker->refcount);
 			break;
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 75ad3aa5adca..40c400172a84 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -32,7 +32,6 @@ struct vhost_worker {
 	struct llist_head	work_list;
 	struct hlist_node	h_node;
 	refcount_t		refcount;
-	struct vhost_dev	*dev;
 	int			id;
 };
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: vhost: multiple worker support
  2021-05-25 18:05 vhost: multiple worker support Mike Christie
                   ` (8 preceding siblings ...)
  2021-05-25 18:06 ` [PATCH 9/9] vhost: support sharing workers across devs Mike Christie
@ 2021-06-03 10:13 ` Stefan Hajnoczi
  2021-06-03 18:45   ` Mike Christie
  2021-06-03 14:37 ` Stefan Hajnoczi
  10 siblings, 1 reply; 32+ messages in thread
From: Stefan Hajnoczi @ 2021-06-03 10:13 UTC (permalink / raw)
  To: Mike Christie
  Cc: target-devel, linux-scsi, pbonzini, jasowang, mst, sgarzare,
	virtualization

[-- Attachment #1: Type: text/plain, Size: 1898 bytes --]

On Tue, May 25, 2021 at 01:05:51PM -0500, Mike Christie wrote:
> Results:
> --------
> When running with the null_blk driver and vhost-scsi I can get 1.2
> million IOPs by just running a simple
> 
> fio --filename=/dev/sda --direct=1 --rw=randrw --bs=4k --ioengine=libaio
> --iodepth=128  --numjobs=8 --time_based --group_reporting --name=iops
> --runtime=60 --eta-newline=1
> 
> The VM has 8 vCPUs and sda has 8 virtqueues and we can do a total of
> 1024 cmds per devices. To get 1.2 million IOPs I did have to tune and
> ran the virsh emulatorpin command so the vhost threads were running
> on different CPUs than the VM. If the vhost threads share CPUs then I
> get around 800K.
> 
> For a more real device that are also CPU hogs like iscsi, I can still
> get 1 million IOPs using 1 dm-multipath device over 8 iscsi paths
> (natively it gets 1.1 million IOPs).

There is no comparison against a baseline, but I guess it would be the
same 8 vCPU guest with single queue vhost-scsi?

> 
> Results/TODO Note:
> 
> - I ported the vdpa sim code to support multiple workers and as-is now
> it made perf much worse. If I increase vdpa_sim_blk's num queues to 4-8
> I get 700K IOPs with the fio command above. However with the multiple
> worker support it drops to 400K. The problem is the vdpa_sim lock
> and the iommu_lock. If I hack (like comment out locks or not worry about
> data corruption or crashes) then I can get around 1.2M - 1.6M IOPs with
> 8 queues and fio command above.
> 
> So these patches could help other drivers, but it will just take more
> work to remove those types of locks. I was hoping the 2 items could be
> done indepentently since it helps vhost-scsi immediately.

Cool, thank you for taking a look. That's useful info for Stefano. vDPA
and vhost can be handled independently though in the long term hopefully
they can share code.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/9] vhost: move worker thread fields to new struct
  2021-05-25 18:05 ` [PATCH 1/9] vhost: move worker thread fields to new struct Mike Christie
@ 2021-06-03 10:16   ` Stefan Hajnoczi
  0 siblings, 0 replies; 32+ messages in thread
From: Stefan Hajnoczi @ 2021-06-03 10:16 UTC (permalink / raw)
  To: Mike Christie
  Cc: target-devel, linux-scsi, pbonzini, jasowang, mst, sgarzare,
	virtualization

[-- Attachment #1: Type: text/plain, Size: 546 bytes --]

On Tue, May 25, 2021 at 01:05:52PM -0500, Mike Christie wrote:
> This is just a prep patch. It moves the worker related fields to a new
> vhost_worker struct and moves the code around to create some helpers that
> will be used in the next patches.
> 
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> ---
>  drivers/vhost/vhost.c | 94 +++++++++++++++++++++++++++++--------------
>  drivers/vhost/vhost.h |  9 ++++-
>  2 files changed, 70 insertions(+), 33 deletions(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/9] vhost: move vhost worker creation to kick setup
  2021-05-25 18:05 ` [PATCH 2/9] vhost: move vhost worker creation to kick setup Mike Christie
@ 2021-06-03 10:28   ` Stefan Hajnoczi
  0 siblings, 0 replies; 32+ messages in thread
From: Stefan Hajnoczi @ 2021-06-03 10:28 UTC (permalink / raw)
  To: Mike Christie
  Cc: target-devel, linux-scsi, pbonzini, jasowang, mst, sgarzare,
	virtualization

[-- Attachment #1: Type: text/plain, Size: 1061 bytes --]

On Tue, May 25, 2021 at 01:05:53PM -0500, Mike Christie wrote:
> The next patch will add new ioctls that allows userspace to create workers
> and bind them to devs and vqs after VHOST_SET_OWNER. To support older
> tools, newer tools that want to go wild with worker threads, and newer
> tools that want the old/default behavior this patch moves the default
> worker setup to the kick setup.
> 
> After the first vq's kick/poll setup is done we could start to get works
> queued, so this is the point when we must have a worker setup. If we are
> using older tools or the newer tools just want the default single vhost
> thread per dev behavior then it will automatically be done here. If the
> tools are using the newer ioctls that have already created the workers
> then we also detect that here and do nothing.
> 
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> ---
>  drivers/vhost/vhost.c | 23 ++++++++++++++++-------
>  1 file changed, 16 insertions(+), 7 deletions(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/9] vhost: modify internal functions to take a vhost_worker
  2021-05-25 18:05 ` [PATCH 3/9] vhost: modify internal functions to take a vhost_worker Mike Christie
@ 2021-06-03 10:45   ` Stefan Hajnoczi
  0 siblings, 0 replies; 32+ messages in thread
From: Stefan Hajnoczi @ 2021-06-03 10:45 UTC (permalink / raw)
  To: Mike Christie
  Cc: target-devel, linux-scsi, pbonzini, jasowang, mst, sgarzare,
	virtualization

[-- Attachment #1: Type: text/plain, Size: 2311 bytes --]

On Tue, May 25, 2021 at 01:05:54PM -0500, Mike Christie wrote:
> -void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)
> +static void vhost_work_queue_on(struct vhost_work *work,
> +				struct vhost_worker *worker)
>  {
> -	if (!dev->worker)
> -		return;
> -
>  	if (!test_and_set_bit(VHOST_WORK_QUEUED, &work->flags)) {
>  		/* We can only add the work to the list after we're
>  		 * sure it was not in the list.
>  		 * test_and_set_bit() implies a memory barrier.
>  		 */
> -		llist_add(&work->node, &dev->worker->work_list);
> -		wake_up_process(dev->worker->task);
> +		llist_add(&work->node, &worker->work_list);
> +		wake_up_process(worker->task);
>  	}
>  }
> +
> +void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)

When should this function still be used? A doc comment contrasting it to
vhost_work_queue_on() would be helpful. I would expect callers to switch
to that instead of queuing work on dev->workers[0].

>  /* A lockless hint for busy polling code to exit the loop */
>  bool vhost_has_work(struct vhost_dev *dev)
>  {
> -	return dev->worker && !llist_empty(&dev->worker->work_list);
> +	int i;
> +
> +	for (i = 0; i < dev->num_workers; i++) {
> +		if (!llist_empty(&dev->workers[i]->work_list))
> +			return true;
> +	}
> +
> +	return false;
>  }
>  EXPORT_SYMBOL_GPL(vhost_has_work);

It's probably not necessary to poll all workers:

drivers/vhost/net.c calls vhost_has_work() to busy poll a specific
virtqueue. If the vq:worker mapping is 1:1 or N:1 then vhost_has_work()
should be extended to include the struct vhost_virtqueue so we can poll
just that vq worker's work_list.
>  /* Caller must have device mutex */
>  static int vhost_worker_try_create_def(struct vhost_dev *dev)
>  {
> -	if (!dev->use_worker || dev->worker)
> +	struct vhost_worker *worker;
> +
> +	if (!dev->use_worker || dev->workers)
>  		return 0;
>  
> -	return vhost_worker_create(dev);
> +	dev->workers = kcalloc(1, sizeof(struct vhost_worker *), GFP_KERNEL);

GFP_KERNEL_ACCOUNT so that vhost memory associated with a process (the
one that invoked the ioctl) is accounted? This may get trickier if the
workers are shared between processes.

The same applies for struct vhost_worker in vhost_worker_create().

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/9] vhost: allow vhost_polls to use different vhost_workers
  2021-05-25 18:05 ` [PATCH 4/9] vhost: allow vhost_polls to use different vhost_workers Mike Christie
@ 2021-06-03 13:51   ` Stefan Hajnoczi
  0 siblings, 0 replies; 32+ messages in thread
From: Stefan Hajnoczi @ 2021-06-03 13:51 UTC (permalink / raw)
  To: Mike Christie
  Cc: target-devel, linux-scsi, pbonzini, jasowang, mst, sgarzare,
	virtualization

[-- Attachment #1: Type: text/plain, Size: 588 bytes --]

On Tue, May 25, 2021 at 01:05:55PM -0500, Mike Christie wrote:
> This allows vhost_polls to use the worker the vq the poll is associated
> with.
> 
> This also exports a helper vhost_vq_work_queue which is used here
> internally, and will be used in the vhost-scsi patches.
> 
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> ---
>  drivers/vhost/net.c   |  6 ++++--
>  drivers/vhost/vhost.c | 19 ++++++++++++++++---
>  drivers/vhost/vhost.h |  6 +++++-
>  3 files changed, 25 insertions(+), 6 deletions(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/9] vhost-scsi: flush IO vqs then send TMF rsp
  2021-05-25 18:05 ` [PATCH 5/9] vhost-scsi: flush IO vqs then send TMF rsp Mike Christie
@ 2021-06-03 13:54   ` Stefan Hajnoczi
  0 siblings, 0 replies; 32+ messages in thread
From: Stefan Hajnoczi @ 2021-06-03 13:54 UTC (permalink / raw)
  To: Mike Christie
  Cc: target-devel, linux-scsi, pbonzini, jasowang, mst, sgarzare,
	virtualization

[-- Attachment #1: Type: text/plain, Size: 1594 bytes --]

On Tue, May 25, 2021 at 01:05:56PM -0500, Mike Christie wrote:
> With one worker we will always send the scsi cmd responses then send the
> TMF rsp, because LIO will always complete the scsi cmds first which
> calls vhost_scsi_release_cmd to add them to the work queue.
> 
> When the next patches adds multiple worker support, the worker threads
> could still be sending their responses when the tmf's work is run.
> So this patch has vhost-scsi flush the IO vqs on other worker threads
> before we send the tmf response.
> 
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> ---
>  drivers/vhost/scsi.c  | 17 ++++++++++++++---
>  drivers/vhost/vhost.c |  6 ++++++
>  drivers/vhost/vhost.h |  1 +
>  3 files changed, 21 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
> index 46f897e41217..e585f2180457 100644
> --- a/drivers/vhost/scsi.c
> +++ b/drivers/vhost/scsi.c
> @@ -1168,12 +1168,23 @@ static void vhost_scsi_tmf_resp_work(struct vhost_work *work)
>  {
>  	struct vhost_scsi_tmf *tmf = container_of(work, struct vhost_scsi_tmf,
>  						  vwork);
> -	int resp_code;
> +	int resp_code, i;
> +
> +	if (tmf->scsi_resp == TMR_FUNCTION_COMPLETE) {
> +		/*
> +		 * When processing a TMF lio completes the cmds then the TMF,
> +		 * so with one worker the TMF always completes after cmds. For
> +		 * multiple worker support we must flush the IO vqs to make
> +		 * sure if they have differrent workers then the cmds have

s/differrent/different/

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 6/9] vhost-scsi: make SCSI cmd completion per vq
  2021-05-25 18:05 ` [PATCH 6/9] vhost-scsi: make SCSI cmd completion per vq Mike Christie
@ 2021-06-03 13:57   ` Stefan Hajnoczi
  0 siblings, 0 replies; 32+ messages in thread
From: Stefan Hajnoczi @ 2021-06-03 13:57 UTC (permalink / raw)
  To: Mike Christie
  Cc: target-devel, linux-scsi, pbonzini, jasowang, mst, sgarzare,
	virtualization

[-- Attachment #1: Type: text/plain, Size: 676 bytes --]

On Tue, May 25, 2021 at 01:05:57PM -0500, Mike Christie wrote:
> This patch separates the scsi cmd completion code paths so we can complete
> cmds based on their vq instead of having all cmds complete on the same
> worker/CPU. This will be useful with the next patch that allows us to
> create mulitple worker threads and bind them to different vqs, so we can
> have completions running on different threads/CPUs.
> 
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> ---
>  drivers/vhost/scsi.c | 48 +++++++++++++++++++++++---------------------
>  1 file changed, 25 insertions(+), 23 deletions(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 7/9] vhost: allow userspace to create workers
  2021-05-25 18:05 ` [PATCH 7/9] vhost: allow userspace to create workers Mike Christie
@ 2021-06-03 14:30   ` Stefan Hajnoczi
  2021-06-05 23:53     ` michael.christie
  0 siblings, 1 reply; 32+ messages in thread
From: Stefan Hajnoczi @ 2021-06-03 14:30 UTC (permalink / raw)
  To: Mike Christie
  Cc: target-devel, linux-scsi, pbonzini, jasowang, mst, sgarzare,
	virtualization

[-- Attachment #1: Type: text/plain, Size: 8691 bytes --]

On Tue, May 25, 2021 at 01:05:58PM -0500, Mike Christie wrote:
> This patch allows userspace to create workers and bind them to vqs, so you
> can have N workers per dev and also share N workers with M vqs. The next
> patch will allow sharing across devices.
> 
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> ---
>  drivers/vhost/vhost.c            | 94 +++++++++++++++++++++++++++++++-
>  drivers/vhost/vhost.h            |  3 +
>  include/uapi/linux/vhost.h       |  6 ++
>  include/uapi/linux/vhost_types.h | 12 ++++
>  4 files changed, 113 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 345ade0af133..981e9bac7a31 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -30,6 +30,7 @@
>  #include <linux/interval_tree_generic.h>
>  #include <linux/nospec.h>
>  #include <linux/kcov.h>
> +#include <linux/hashtable.h>
>  
>  #include "vhost.h"
>  
> @@ -42,6 +43,9 @@ module_param(max_iotlb_entries, int, 0444);
>  MODULE_PARM_DESC(max_iotlb_entries,
>  	"Maximum number of iotlb entries. (default: 2048)");
>  
> +static DEFINE_HASHTABLE(vhost_workers, 5);
> +static DEFINE_SPINLOCK(vhost_workers_lock);
> +
>  enum {
>  	VHOST_MEMORY_F_LOG = 0x1,
>  };
> @@ -617,8 +621,17 @@ static void vhost_detach_mm(struct vhost_dev *dev)
>  	dev->mm = NULL;
>  }
>  
> -static void vhost_worker_free(struct vhost_worker *worker)
> +static void vhost_worker_put(struct vhost_worker *worker)
>  {
> +	spin_lock(&vhost_workers_lock);
> +	if (!refcount_dec_and_test(&worker->refcount)) {
> +		spin_unlock(&vhost_workers_lock);
> +		return;
> +	}
> +
> +	hash_del(&worker->h_node);
> +	spin_unlock(&vhost_workers_lock);
> +
>  	WARN_ON(!llist_empty(&worker->work_list));
>  	kthread_stop(worker->task);
>  	kfree(worker);
> @@ -632,7 +645,7 @@ static void vhost_workers_free(struct vhost_dev *dev)
>  		return;
>  
>  	for (i = 0; i < dev->num_workers; i++)
> -		vhost_worker_free(dev->workers[i]);
> +		vhost_worker_put(dev->workers[i]);
>  
>  	kfree(dev->workers);
>  	dev->num_workers = 0;
> @@ -652,6 +665,8 @@ static struct vhost_worker *vhost_worker_create(struct vhost_dev *dev)
>  	worker->id = dev->num_workers;
>  	worker->dev = dev;
>  	init_llist_head(&worker->work_list);
> +	INIT_HLIST_NODE(&worker->h_node);
> +	refcount_set(&worker->refcount, 1);
>  
>  	task = kthread_create(vhost_worker, worker, "vhost-%d", current->pid);
>  	if (IS_ERR(task))
> @@ -664,6 +679,9 @@ static struct vhost_worker *vhost_worker_create(struct vhost_dev *dev)
>  	if (ret)
>  		goto stop_worker;
>  
> +	spin_lock(&vhost_workers_lock);
> +	hash_add(vhost_workers, &worker->h_node, worker->task->pid);
> +	spin_unlock(&vhost_workers_lock);
>  	return worker;
>  
>  stop_worker:
> @@ -673,6 +691,67 @@ static struct vhost_worker *vhost_worker_create(struct vhost_dev *dev)
>  	return NULL;
>  }
>  
> +static struct vhost_worker *vhost_worker_find(struct vhost_dev *dev, pid_t pid)
> +{
> +	struct vhost_worker *worker, *found_worker = NULL;
> +
> +	spin_lock(&vhost_workers_lock);
> +	hash_for_each_possible(vhost_workers, worker, h_node, pid) {
> +		if (worker->task->pid == pid) {
> +			/* tmp - next patch allows sharing across devs */
> +			if (worker->dev != dev)
> +				break;
> +
> +			found_worker = worker;
> +			refcount_inc(&worker->refcount);
> +			break;
> +		}
> +	}
> +	spin_unlock(&vhost_workers_lock);
> +	return found_worker;
> +}
> +
> +/* Caller must have device mutex */
> +static int vhost_vq_set_worker(struct vhost_virtqueue *vq,
> +			       struct vhost_vring_worker *info)
> +{
> +	struct vhost_dev *dev = vq->dev;
> +	struct vhost_worker *worker;
> +
> +	if (vq->worker) {
> +		/* TODO - support changing while works are running */
> +		return -EBUSY;
> +	}
> +
> +	if (info->pid == VHOST_VRING_NEW_WORKER) {
> +		worker = vhost_worker_create(dev);

The maximum number of kthreads created is limited by
vhost_dev_init(nvqs)? For example VHOST_SCSI_MAX_VQ 128.

IIUC kthread_create is not limited by or accounted against the current
task, so I'm a little worried that a process can create a lot of
kthreads.

I haven't investigated other kthread_create() users reachable from
userspace applications to see how they bound the number of threads
effectively.

Any thoughts?

> +		if (!worker)
> +			return -ENOMEM;
> +
> +		info->pid = worker->task->pid;
> +	} else {
> +		worker = vhost_worker_find(dev, info->pid);
> +		if (!worker)
> +			return -ENODEV;
> +	}
> +
> +	if (!dev->workers) {
> +		dev->workers = kcalloc(vq->dev->nvqs,
> +				       sizeof(struct vhost_worker *),
> +				       GFP_KERNEL);

Another candidate for GFP_KERNEL_ACCOUNT.

> +		if (!dev->workers) {
> +			vhost_worker_put(worker);
> +			return -ENOMEM;
> +		}
> +	}
> +
> +	vq->worker = worker;
> +
> +	dev->workers[dev->num_workers] = worker;
> +	dev->num_workers++;

Hmm...should we really append to workers[] in the vhost_worker_find()
case?

> +	return 0;
> +}
> +
>  /* Caller must have device mutex */
>  static int vhost_worker_try_create_def(struct vhost_dev *dev)
>  {
> @@ -1680,6 +1759,7 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
>  	struct eventfd_ctx *ctx = NULL;
>  	u32 __user *idxp = argp;
>  	struct vhost_virtqueue *vq;
> +	struct vhost_vring_worker w;
>  	struct vhost_vring_state s;
>  	struct vhost_vring_file f;
>  	u32 idx;
> @@ -1794,6 +1874,15 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
>  		if (copy_to_user(argp, &s, sizeof(s)))
>  			r = -EFAULT;
>  		break;
> +	case VHOST_SET_VRING_WORKER:
> +		if (copy_from_user(&w, argp, sizeof(w))) {
> +			r = -EFAULT;
> +			break;
> +		}
> +		r = vhost_vq_set_worker(vq, &w);
> +		if (!r && copy_to_user(argp, &w, sizeof(w)))
> +			r = -EFAULT;
> +		break;
>  	default:
>  		r = -ENOIOCTLCMD;
>  	}
> @@ -2726,6 +2815,7 @@ EXPORT_SYMBOL_GPL(vhost_set_backend_features);
>  
>  static int __init vhost_init(void)
>  {
> +	hash_init(vhost_workers);
>  	return 0;
>  }
>  
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index 0a252dd45101..75b884ad1f17 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -14,6 +14,7 @@
>  #include <linux/atomic.h>
>  #include <linux/vhost_iotlb.h>
>  #include <linux/irqbypass.h>
> +#include <linux/refcount.h>
>  
>  struct vhost_work;
>  typedef void (*vhost_work_fn_t)(struct vhost_work *work);
> @@ -28,6 +29,8 @@ struct vhost_work {
>  struct vhost_worker {
>  	struct task_struct	*task;
>  	struct llist_head	work_list;
> +	struct hlist_node	h_node;

h_node is a generic name. If you're willing to use a longer name then
vhost_workers_node would make it clear which hlist this is associated
with.

> +	refcount_t		refcount;
>  	struct vhost_dev	*dev;
>  	int			id;
>  };
> diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
> index c998860d7bbc..ce32119cb139 100644
> --- a/include/uapi/linux/vhost.h
> +++ b/include/uapi/linux/vhost.h
> @@ -70,6 +70,12 @@
>  #define VHOST_VRING_BIG_ENDIAN 1
>  #define VHOST_SET_VRING_ENDIAN _IOW(VHOST_VIRTIO, 0x13, struct vhost_vring_state)
>  #define VHOST_GET_VRING_ENDIAN _IOW(VHOST_VIRTIO, 0x14, struct vhost_vring_state)
> +/* Create/bind a vhost worker to a virtqueue. If pid > 0 and matches an existing
> + * vhost_worker thread it will be bound to the vq. If pid is
> + * VHOST_VRING_NEW_WORKER, then a new worker will be created and bound to the
> + * vq.
> + */

Please document when this ioctl must be called (before kick is set up).

> +#define VHOST_SET_VRING_WORKER _IOWR(VHOST_VIRTIO, 0x15, struct vhost_vring_worker)
>  
>  /* The following ioctls use eventfd file descriptors to signal and poll
>   * for events. */
> diff --git a/include/uapi/linux/vhost_types.h b/include/uapi/linux/vhost_types.h
> index f7f6a3a28977..5113baa8bc3e 100644
> --- a/include/uapi/linux/vhost_types.h
> +++ b/include/uapi/linux/vhost_types.h
> @@ -47,6 +47,18 @@ struct vhost_vring_addr {
>  	__u64 log_guest_addr;
>  };
>  
> +#define VHOST_VRING_NEW_WORKER -1
> +
> +struct vhost_vring_worker {
> +	unsigned int index;
> +	/*
> +	 * The pid of the vhost worker that the vq will be bound to. If
> +	 * pid is VHOST_VRING_NEW_WORKER a new worker will be created and it's

s/it's/its/

> +	 * pid will be returned in pid.
> +	 */
> +	__kernel_pid_t pid;
> +};
> +
>  /* no alignment requirement */
>  struct vhost_iotlb_msg {
>  	__u64 iova;
> -- 
> 2.25.1
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 8/9] vhost: add vhost_dev pointer to vhost_work
  2021-05-25 18:05 ` [PATCH 8/9] vhost: add vhost_dev pointer to vhost_work Mike Christie
@ 2021-06-03 14:31   ` Stefan Hajnoczi
  0 siblings, 0 replies; 32+ messages in thread
From: Stefan Hajnoczi @ 2021-06-03 14:31 UTC (permalink / raw)
  To: Mike Christie
  Cc: target-devel, linux-scsi, pbonzini, jasowang, mst, sgarzare,
	virtualization

[-- Attachment #1: Type: text/plain, Size: 623 bytes --]

On Tue, May 25, 2021 at 01:05:59PM -0500, Mike Christie wrote:
> The next patch allows a vhost_worker to handle different devices. To
> prepare for that, this patch adds a pointer to the device on the work so
> we can get to the different mms in the vhost_worker thread.
> 
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> ---
>  drivers/vhost/scsi.c  |  7 ++++---
>  drivers/vhost/vhost.c | 24 ++++++++++++++----------
>  drivers/vhost/vhost.h |  4 +++-
>  drivers/vhost/vsock.c |  3 ++-
>  4 files changed, 23 insertions(+), 15 deletions(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 9/9] vhost: support sharing workers across devs
  2021-05-25 18:06 ` [PATCH 9/9] vhost: support sharing workers across devs Mike Christie
@ 2021-06-03 14:32   ` Stefan Hajnoczi
  2021-06-07  2:18     ` Jason Wang
  0 siblings, 1 reply; 32+ messages in thread
From: Stefan Hajnoczi @ 2021-06-03 14:32 UTC (permalink / raw)
  To: Mike Christie
  Cc: target-devel, linux-scsi, pbonzini, jasowang, mst, sgarzare,
	virtualization

[-- Attachment #1: Type: text/plain, Size: 451 bytes --]

On Tue, May 25, 2021 at 01:06:00PM -0500, Mike Christie wrote:
> This allows a worker to handle multiple device's vqs.
> 
> TODO:
> - The worker is attached to the cgroup of the device that created it. In
> this patch you can share workers with devices with different owners which
> could be in different cgroups. Do we want to restict sharing workers with
> devices that have the same owner (dev->mm value)?

Question for Michael or Jason.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: vhost: multiple worker support
  2021-05-25 18:05 vhost: multiple worker support Mike Christie
                   ` (9 preceding siblings ...)
  2021-06-03 10:13 ` vhost: multiple worker support Stefan Hajnoczi
@ 2021-06-03 14:37 ` Stefan Hajnoczi
  2021-06-03 22:16   ` Mike Christie
  10 siblings, 1 reply; 32+ messages in thread
From: Stefan Hajnoczi @ 2021-06-03 14:37 UTC (permalink / raw)
  To: Mike Christie
  Cc: target-devel, linux-scsi, pbonzini, jasowang, mst, sgarzare,
	virtualization

[-- Attachment #1: Type: text/plain, Size: 1309 bytes --]

On Tue, May 25, 2021 at 01:05:51PM -0500, Mike Christie wrote:
> The following patches apply over linus's tree or mst's vhost branch
> and my cleanup patchset:
> 
> https://lists.linuxfoundation.org/pipermail/virtualization/2021-May/054354.html
> 
> These patches allow us to support multiple vhost workers per device. I
> ended up just doing Stefan's original idea where userspace has the
> kernel create a worker and we pass back the pid. This has the benefit
> over the workqueue and userspace thread approach where we only have
> one'ish code path in the kernel during setup to detect old tools. The
> main IO paths and device/vq setup/teardown paths all use common code.
> 
> The kernel patches here allow us to then do N workers device and also
> share workers across devices.
> 
> I've also included a patch for qemu so you can get an idea of how it
> works. If we are ok with the kernel code then I'll break that up into
> a patchset and send to qemu-devel.

It seems risky to allow userspace process A to "share" a vhost worker
thread with userspace process B based on a matching pid alone. Should
they have ptrace_may_access() or similar?

For example, two competing users could sabotague each other by running
lots of work items on their competitor's vhost worker thread.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: vhost: multiple worker support
  2021-06-03 10:13 ` vhost: multiple worker support Stefan Hajnoczi
@ 2021-06-03 18:45   ` Mike Christie
  0 siblings, 0 replies; 32+ messages in thread
From: Mike Christie @ 2021-06-03 18:45 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: target-devel, linux-scsi, pbonzini, jasowang, mst, sgarzare,
	virtualization

On 6/3/21 5:13 AM, Stefan Hajnoczi wrote:
> On Tue, May 25, 2021 at 01:05:51PM -0500, Mike Christie wrote:
>> Results:
>> --------
>> When running with the null_blk driver and vhost-scsi I can get 1.2
>> million IOPs by just running a simple
>>
>> fio --filename=/dev/sda --direct=1 --rw=randrw --bs=4k --ioengine=libaio
>> --iodepth=128  --numjobs=8 --time_based --group_reporting --name=iops
>> --runtime=60 --eta-newline=1
>>
>> The VM has 8 vCPUs and sda has 8 virtqueues and we can do a total of
>> 1024 cmds per devices. To get 1.2 million IOPs I did have to tune and
>> ran the virsh emulatorpin command so the vhost threads were running
>> on different CPUs than the VM. If the vhost threads share CPUs then I
>> get around 800K.
>>
>> For a more real device that are also CPU hogs like iscsi, I can still
>> get 1 million IOPs using 1 dm-multipath device over 8 iscsi paths
>> (natively it gets 1.1 million IOPs).
> 
> There is no comparison against a baseline, but I guess it would be the
> same 8 vCPU guest with single queue vhost-scsi?
> 

For the iscsi device the max IOPs for the single thread case was around
380K IOPs.

Here are the results with null_blk as the backend device with a 16
vCPU guest to give you a better picture.

fio
numjobs 1        2        4        8        12       16
--------------------------------------------------------

Current upstream (single thread per vhost-scsi device).
After 8 jobs there was no perf diff.
********************************************************
VQs
1       130k     338k     390k     404k     -        -
2       146k     440k     448k     478k     -        -
4       146k     456k     448k     482k     -        -
8       154k     464k     500k     490k     -        -
12      160k     454k     486k     490k     -        -
16      162k     460k     484k     486k     -        -

thread per VQ:
After 16 jobs there was no perf diff even if I increased
the number of guest vCPUs.
*********************************************************
1	same as above
2       166k     320k     542k     664k     558k     658k
4       156k     310k     660k     986k     860k     890k
8       156k     328k     652k     988k     972k     1074k
12      162k     336k     660k     1172k    1190k    1324
16      162k     332k     664k     1398k    850k     1426k

Note:
- For numjobs > 8, I lowered iodepth so we had a total of 1024
cmds over all jobs.
- virtqueue_size/cmd_per_lun=1024 was used for all tests.
- If I modify vhost-scsi so vhost_scsi_handle_vq queues the
response immediately so we never enter the LIO/block/scsi layers
then I can get around 1.6-1.8M IOPs as the max.
- There are some device wide locks in the LIO main IO path that
we are hitting in these results. We are working on removing them.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: vhost: multiple worker support
  2021-06-03 14:37 ` Stefan Hajnoczi
@ 2021-06-03 22:16   ` Mike Christie
  2021-06-05 22:40     ` michael.christie
  0 siblings, 1 reply; 32+ messages in thread
From: Mike Christie @ 2021-06-03 22:16 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: target-devel, linux-scsi, pbonzini, jasowang, mst, sgarzare,
	virtualization

On 6/3/21 9:37 AM, Stefan Hajnoczi wrote:
> On Tue, May 25, 2021 at 01:05:51PM -0500, Mike Christie wrote:
>> The following patches apply over linus's tree or mst's vhost branch
>> and my cleanup patchset:
>>
>> https://lists.linuxfoundation.org/pipermail/virtualization/2021-May/054354.html
>>
>> These patches allow us to support multiple vhost workers per device. I
>> ended up just doing Stefan's original idea where userspace has the
>> kernel create a worker and we pass back the pid. This has the benefit
>> over the workqueue and userspace thread approach where we only have
>> one'ish code path in the kernel during setup to detect old tools. The
>> main IO paths and device/vq setup/teardown paths all use common code.
>>
>> The kernel patches here allow us to then do N workers device and also
>> share workers across devices.
>>
>> I've also included a patch for qemu so you can get an idea of how it
>> works. If we are ok with the kernel code then I'll break that up into
>> a patchset and send to qemu-devel.
> 
> It seems risky to allow userspace process A to "share" a vhost worker
> thread with userspace process B based on a matching pid alone. Should
> they have ptrace_may_access() or similar?
> 

I'm not sure. I already made it a little restrictive in this posting, but
it may not be enough depending on what's possible and what we want to allow.

Right now to share a worker the userspace process doing the
VHOST_SET_VRING_WORKER ioctl has to be the owner. Before we do a
VHOST_SET_VRING_WORKER, vhost_dev_ioctl calls vhost_dev_check_owner,
so we will fail if 2 random processes try to share.

So we can share a worker across a vhost-dev's N virtqueues or share a
worker with multiple vhost devs and their virtqueues, but the devs have
to be managed by the same VM/qemu process. If that's all we want to support,
then is the owner check enough?

If we want to share workers across VMs then I think we definitely want
something like ptrace_may_access.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: vhost: multiple worker support
  2021-06-03 22:16   ` Mike Christie
@ 2021-06-05 22:40     ` michael.christie
  2021-06-07 15:23       ` Stefan Hajnoczi
  0 siblings, 1 reply; 32+ messages in thread
From: michael.christie @ 2021-06-05 22:40 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: target-devel, linux-scsi, pbonzini, jasowang, mst, sgarzare,
	virtualization

On 6/3/21 5:16 PM, Mike Christie wrote:
> On 6/3/21 9:37 AM, Stefan Hajnoczi wrote:
>> On Tue, May 25, 2021 at 01:05:51PM -0500, Mike Christie wrote:
>>> The following patches apply over linus's tree or mst's vhost branch
>>> and my cleanup patchset:
>>>
>>> https://urldefense.com/v3/__https://lists.linuxfoundation.org/pipermail/virtualization/2021-May/054354.html__;!!GqivPVa7Brio!P55eslnMW_iZkoTUZckwhnSw_9Z35JBScgtSYRh4CMFTRkSsWwKOYqY0huUfBfIPM8BV$ 
>>>
>>> These patches allow us to support multiple vhost workers per device. I
>>> ended up just doing Stefan's original idea where userspace has the
>>> kernel create a worker and we pass back the pid. This has the benefit
>>> over the workqueue and userspace thread approach where we only have
>>> one'ish code path in the kernel during setup to detect old tools. The
>>> main IO paths and device/vq setup/teardown paths all use common code.
>>>
>>> The kernel patches here allow us to then do N workers device and also
>>> share workers across devices.
>>>
>>> I've also included a patch for qemu so you can get an idea of how it
>>> works. If we are ok with the kernel code then I'll break that up into
>>> a patchset and send to qemu-devel.
>>
>> It seems risky to allow userspace process A to "share" a vhost worker
>> thread with userspace process B based on a matching pid alone. Should
>> they have ptrace_may_access() or similar?
>>
> 
> I'm not sure. I already made it a little restrictive in this posting, but

Made a mistake here. In this posting I did not make it restrictive and
I was allowing any old 2 processes to share. So we would need something
like ptrace_may_access if go this route.

If we restrict sharing workers with the same owner, then I'm not sure if
need anything.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 7/9] vhost: allow userspace to create workers
  2021-06-03 14:30   ` Stefan Hajnoczi
@ 2021-06-05 23:53     ` michael.christie
  2021-06-07 15:19       ` Stefan Hajnoczi
  0 siblings, 1 reply; 32+ messages in thread
From: michael.christie @ 2021-06-05 23:53 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: target-devel, linux-scsi, pbonzini, jasowang, mst, sgarzare,
	virtualization

On 6/3/21 9:30 AM, Stefan Hajnoczi wrote:
>> +	if (info->pid == VHOST_VRING_NEW_WORKER) {
>> +		worker = vhost_worker_create(dev);
> 
> The maximum number of kthreads created is limited by
> vhost_dev_init(nvqs)? For example VHOST_SCSI_MAX_VQ 128.
> 
> IIUC kthread_create is not limited by or accounted against the current
> task, so I'm a little worried that a process can create a lot of
> kthreads.
> 
> I haven't investigated other kthread_create() users reachable from
> userspace applications to see how they bound the number of threads
> effectively.

Do we want something like io_uring's copy_process use? It's what fork uses,
so we get checks like RLIMIT_NPROC and max_threads.

I know I didn't look at everything, but it looks like for some software
drivers we just allow the user to run wild. For example for nbd, when we
create the device to do alloc_workqueue and use the default max_active
value (256). We then don't have a limit on connections, so we could end
up with 256 workqueue threads per device. And then there is no limit on
devices a user can make.


> 
> Any thoughts?
>

Is the concern a bad VM could create N devs each with 128 vqs/threads
and it would slow down other VMs? How do we handle the case where
some VM makes M * N devs and that is equal to N * 128 so we would end
up with the same number of threads either way? Is there a limit to the
number of vhost devices a VM can make and can I just stick in a similar
check for workers?

For vhost-scsi specifically, the 128 limit does not make a lot of sense.
I think we want the max to be the number of vCPUs the VM has so we can
add checks for that. Then we would assume someone making a VM with lots of
CPUs is going to have the resources to support them.

Note: It does make sense from the point of view that we don't know the
number of vCPUs when vhost-scsi calls vhost_dev_init, so I get we had to
select an initial limit.



>> +		if (!dev->workers) {
>> +			vhost_worker_put(worker);
>> +			return -ENOMEM;
>> +		}
>> +	}
>> +
>> +	vq->worker = worker;
>> +
>> +	dev->workers[dev->num_workers] = worker;
>> +	dev->num_workers++;
> 
> Hmm...should we really append to workers[] in the vhost_worker_find()
> case?


As it's coded now, yes. Every successful vhost_worker_find call does a
get on the worker's refcount. Later when we delete the device, we loop
over the workers array and for every entry we do a put.

I can add in some code to first check if the worker is already in the
dev's worker list. If so then skip the refcount and skip adding to the
workers array. If not in the dev's worker list then do a vhost_worker_find.

I thought it might be nicer how it is now with the single path. It's less
code at least. Later if we add support to change a vq's worker then we also
don't have to worry about refcounts as much. We just always drop the count
taken from when it was added.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 9/9] vhost: support sharing workers across devs
  2021-06-03 14:32   ` Stefan Hajnoczi
@ 2021-06-07  2:18     ` Jason Wang
  0 siblings, 0 replies; 32+ messages in thread
From: Jason Wang @ 2021-06-07  2:18 UTC (permalink / raw)
  To: Stefan Hajnoczi, Mike Christie
  Cc: target-devel, linux-scsi, pbonzini, mst, sgarzare, virtualization


在 2021/6/3 下午10:32, Stefan Hajnoczi 写道:
> On Tue, May 25, 2021 at 01:06:00PM -0500, Mike Christie wrote:
>> This allows a worker to handle multiple device's vqs.
>>
>> TODO:
>> - The worker is attached to the cgroup of the device that created it. In
>> this patch you can share workers with devices with different owners which
>> could be in different cgroups. Do we want to restict sharing workers with
>> devices that have the same owner (dev->mm value)?
> Question for Michael or Jason.


I thing sharing workers within a cgroup should be fine.

The differences is that if we restrict the works with the same owner, it 
may only work in the case where an VM have multiple vhost devices.

Thanks




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 7/9] vhost: allow userspace to create workers
  2021-06-05 23:53     ` michael.christie
@ 2021-06-07 15:19       ` Stefan Hajnoczi
  2021-06-09 21:03         ` Mike Christie
  0 siblings, 1 reply; 32+ messages in thread
From: Stefan Hajnoczi @ 2021-06-07 15:19 UTC (permalink / raw)
  To: michael.christie
  Cc: target-devel, linux-scsi, pbonzini, jasowang, mst, sgarzare,
	virtualization

[-- Attachment #1: Type: text/plain, Size: 3439 bytes --]

On Sat, Jun 05, 2021 at 06:53:58PM -0500, michael.christie@oracle.com wrote:
> On 6/3/21 9:30 AM, Stefan Hajnoczi wrote:
> >> +	if (info->pid == VHOST_VRING_NEW_WORKER) {
> >> +		worker = vhost_worker_create(dev);
> > 
> > The maximum number of kthreads created is limited by
> > vhost_dev_init(nvqs)? For example VHOST_SCSI_MAX_VQ 128.
> > 
> > IIUC kthread_create is not limited by or accounted against the current
> > task, so I'm a little worried that a process can create a lot of
> > kthreads.
> > 
> > I haven't investigated other kthread_create() users reachable from
> > userspace applications to see how they bound the number of threads
> > effectively.
> 
> Do we want something like io_uring's copy_process use? It's what fork uses,
> so we get checks like RLIMIT_NPROC and max_threads.
> 
> I know I didn't look at everything, but it looks like for some software
> drivers we just allow the user to run wild. For example for nbd, when we
> create the device to do alloc_workqueue and use the default max_active
> value (256). We then don't have a limit on connections, so we could end
> up with 256 workqueue threads per device. And then there is no limit on
> devices a user can make.
> 
> 
> > 
> > Any thoughts?
> >
> 
> Is the concern a bad VM could create N devs each with 128 vqs/threads
> and it would slow down other VMs? How do we handle the case where
> some VM makes M * N devs and that is equal to N * 128 so we would end
> up with the same number of threads either way? Is there a limit to the
> number of vhost devices a VM can make and can I just stick in a similar
> check for workers?
> 
> For vhost-scsi specifically, the 128 limit does not make a lot of sense.
> I think we want the max to be the number of vCPUs the VM has so we can
> add checks for that. Then we would assume someone making a VM with lots of
> CPUs is going to have the resources to support them.
> 
> Note: It does make sense from the point of view that we don't know the
> number of vCPUs when vhost-scsi calls vhost_dev_init, so I get we had to
> select an initial limit.

My concern is that threads should probably accounted against
RLIMIT_NPROC and max_threads rather than something indirect like 128 *
RLIMIT_NOFILE (a userspace process can only have RLIMIT_NOFILE
vhost-user file descriptors open).

> >> +		if (!dev->workers) {
> >> +			vhost_worker_put(worker);
> >> +			return -ENOMEM;
> >> +		}
> >> +	}
> >> +
> >> +	vq->worker = worker;
> >> +
> >> +	dev->workers[dev->num_workers] = worker;
> >> +	dev->num_workers++;
> > 
> > Hmm...should we really append to workers[] in the vhost_worker_find()
> > case?
> 
> 
> As it's coded now, yes. Every successful vhost_worker_find call does a
> get on the worker's refcount. Later when we delete the device, we loop
> over the workers array and for every entry we do a put.
> 
> I can add in some code to first check if the worker is already in the
> dev's worker list. If so then skip the refcount and skip adding to the
> workers array. If not in the dev's worker list then do a vhost_worker_find.
> 
> I thought it might be nicer how it is now with the single path. It's less
> code at least. Later if we add support to change a vq's worker then we also
> don't have to worry about refcounts as much. We just always drop the count
> taken from when it was added.

Thanks for explaining.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: vhost: multiple worker support
  2021-06-05 22:40     ` michael.christie
@ 2021-06-07 15:23       ` Stefan Hajnoczi
  0 siblings, 0 replies; 32+ messages in thread
From: Stefan Hajnoczi @ 2021-06-07 15:23 UTC (permalink / raw)
  To: michael.christie
  Cc: target-devel, linux-scsi, pbonzini, jasowang, mst, sgarzare,
	virtualization

[-- Attachment #1: Type: text/plain, Size: 2193 bytes --]

On Sat, Jun 05, 2021 at 05:40:17PM -0500, michael.christie@oracle.com wrote:
> On 6/3/21 5:16 PM, Mike Christie wrote:
> > On 6/3/21 9:37 AM, Stefan Hajnoczi wrote:
> >> On Tue, May 25, 2021 at 01:05:51PM -0500, Mike Christie wrote:
> >>> The following patches apply over linus's tree or mst's vhost branch
> >>> and my cleanup patchset:
> >>>
> >>> https://urldefense.com/v3/__https://lists.linuxfoundation.org/pipermail/virtualization/2021-May/054354.html__;!!GqivPVa7Brio!P55eslnMW_iZkoTUZckwhnSw_9Z35JBScgtSYRh4CMFTRkSsWwKOYqY0huUfBfIPM8BV$ 
> >>>
> >>> These patches allow us to support multiple vhost workers per device. I
> >>> ended up just doing Stefan's original idea where userspace has the
> >>> kernel create a worker and we pass back the pid. This has the benefit
> >>> over the workqueue and userspace thread approach where we only have
> >>> one'ish code path in the kernel during setup to detect old tools. The
> >>> main IO paths and device/vq setup/teardown paths all use common code.
> >>>
> >>> The kernel patches here allow us to then do N workers device and also
> >>> share workers across devices.
> >>>
> >>> I've also included a patch for qemu so you can get an idea of how it
> >>> works. If we are ok with the kernel code then I'll break that up into
> >>> a patchset and send to qemu-devel.
> >>
> >> It seems risky to allow userspace process A to "share" a vhost worker
> >> thread with userspace process B based on a matching pid alone. Should
> >> they have ptrace_may_access() or similar?
> >>
> > 
> > I'm not sure. I already made it a little restrictive in this posting, but
> 
> Made a mistake here. In this posting I did not make it restrictive and
> I was allowing any old 2 processes to share. So we would need something
> like ptrace_may_access if go this route.
> 
> If we restrict sharing workers with the same owner, then I'm not sure if
> need anything.

Agreed.

Sharing between processes becomes most interesting when there is busy
polling (because it consumes CPU and we should consolidate polling onto
as few CPUs as possible). Without polling we can just share the threads
within a process.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 7/9] vhost: allow userspace to create workers
  2021-06-07 15:19       ` Stefan Hajnoczi
@ 2021-06-09 21:03         ` Mike Christie
  2021-06-10  8:06           ` Stefan Hajnoczi
  0 siblings, 1 reply; 32+ messages in thread
From: Mike Christie @ 2021-06-09 21:03 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: target-devel, linux-scsi, pbonzini, jasowang, mst, sgarzare,
	virtualization

On 6/7/21 10:19 AM, Stefan Hajnoczi wrote:
> My concern is that threads should probably accounted against
> RLIMIT_NPROC and max_threads rather than something indirect like 128 *
> RLIMIT_NOFILE (a userspace process can only have RLIMIT_NOFILE
> vhost-user file descriptors open).
> 

Ah ok, I see what you want I think.

Ok, I think the options are:

0. Nothing. Just use existing indirect/RLIMIT_NOFILE.

1. Do something like io_uring's create_io_thread/copy_process. If we call
copy_process from the vhost ioctl context, then the userspace process that
did the ioctl will have it's processes count incremented and checked against
its rlimit.

The drawbacks:
- This gets a little more complicated than just calling copy_process though.
We end up duplicating a lot of the kthread API.
- We have to deal with new error cases like the parent exiting early.
- I think all devs sharing a worker have to have the same owner. kthread_use_mm
and kthread_unuse_mm to switch between mm's for differrent owner's devs seem to
be causing lots of errors. I'm still looking into this one though.

2.  It's not really what you want, but for unbound work io_uring has a check for
RLIMIT_NPROC in the io_uring code. It does:

wqe->acct[IO_WQ_ACCT_UNBOUND].max_workers =
					task_rlimit(current, RLIMIT_NPROC);

then does:

if (!ret && acct->nr_workers < acct->max_workers) {

Drawbacks:
In vhost.c, we could do something similar. It would make sure that vhost.c does
not create more worker threads than the rlimit value, but we wouldn't be
incrementing the userspace process's process count. The userspace process could
then create RLIMIT_NPROC threads and vhost.c could also create RLIMIT_NPROC
threads, so we end up with 2 * RLIMIT_NPROC threads.

3. Change the kthread and copy_process code so we can pass in the thread
(or it's creds or some struct that has the values that need to be check) that
needs to be checked and updated.

Drawback:
This might be considered too ugly for how special case vhost is. For example, we
need checks/code like the io_thread/PF_IO_WORKER code in copy_process for io_uring.
I can see how added that for io_uring because it affects so many users, but I can
see how vhost is not special enough.








^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 7/9] vhost: allow userspace to create workers
  2021-06-09 21:03         ` Mike Christie
@ 2021-06-10  8:06           ` Stefan Hajnoczi
  2021-06-18  2:49             ` Mike Christie
  0 siblings, 1 reply; 32+ messages in thread
From: Stefan Hajnoczi @ 2021-06-10  8:06 UTC (permalink / raw)
  To: Mike Christie
  Cc: target-devel, linux-scsi, pbonzini, jasowang, mst, sgarzare,
	virtualization

[-- Attachment #1: Type: text/plain, Size: 2703 bytes --]

On Wed, Jun 09, 2021 at 04:03:55PM -0500, Mike Christie wrote:
> On 6/7/21 10:19 AM, Stefan Hajnoczi wrote:
> > My concern is that threads should probably accounted against
> > RLIMIT_NPROC and max_threads rather than something indirect like 128 *
> > RLIMIT_NOFILE (a userspace process can only have RLIMIT_NOFILE
> > vhost-user file descriptors open).
> > 
> 
> Ah ok, I see what you want I think.
> 
> Ok, I think the options are:
> 
> 0. Nothing. Just use existing indirect/RLIMIT_NOFILE.
> 
> 1. Do something like io_uring's create_io_thread/copy_process. If we call
> copy_process from the vhost ioctl context, then the userspace process that
> did the ioctl will have it's processes count incremented and checked against
> its rlimit.
> 
> The drawbacks:
> - This gets a little more complicated than just calling copy_process though.
> We end up duplicating a lot of the kthread API.
> - We have to deal with new error cases like the parent exiting early.
> - I think all devs sharing a worker have to have the same owner. kthread_use_mm
> and kthread_unuse_mm to switch between mm's for differrent owner's devs seem to
> be causing lots of errors. I'm still looking into this one though.
> 
> 2.  It's not really what you want, but for unbound work io_uring has a check for
> RLIMIT_NPROC in the io_uring code. It does:
> 
> wqe->acct[IO_WQ_ACCT_UNBOUND].max_workers =
> 					task_rlimit(current, RLIMIT_NPROC);
> 
> then does:
> 
> if (!ret && acct->nr_workers < acct->max_workers) {
> 
> Drawbacks:
> In vhost.c, we could do something similar. It would make sure that vhost.c does
> not create more worker threads than the rlimit value, but we wouldn't be
> incrementing the userspace process's process count. The userspace process could
> then create RLIMIT_NPROC threads and vhost.c could also create RLIMIT_NPROC
> threads, so we end up with 2 * RLIMIT_NPROC threads.

Yes, in that case we might as well go with Option 0, so I think this
option can be eliminated.

> 3. Change the kthread and copy_process code so we can pass in the thread
> (or it's creds or some struct that has the values that need to be check) that
> needs to be checked and updated.
> 
> Drawback:
> This might be considered too ugly for how special case vhost is. For example, we
> need checks/code like the io_thread/PF_IO_WORKER code in copy_process for io_uring.
> I can see how added that for io_uring because it affects so many users, but I can
> see how vhost is not special enough.

This seems like the most general solution. If you try it and get
negative feedback then maybe the maintainers can help suggest how to
solve this problem :).

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 7/9] vhost: allow userspace to create workers
  2021-06-10  8:06           ` Stefan Hajnoczi
@ 2021-06-18  2:49             ` Mike Christie
  2021-06-21 13:41               ` Stefan Hajnoczi
  0 siblings, 1 reply; 32+ messages in thread
From: Mike Christie @ 2021-06-18  2:49 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: target-devel, linux-scsi, pbonzini, jasowang, mst, sgarzare,
	virtualization

On 6/10/21 3:06 AM, Stefan Hajnoczi wrote:
> On Wed, Jun 09, 2021 at 04:03:55PM -0500, Mike Christie wrote:
>> On 6/7/21 10:19 AM, Stefan Hajnoczi wrote:
>>> My concern is that threads should probably accounted against
>>> RLIMIT_NPROC and max_threads rather than something indirect like 128 *
>>> RLIMIT_NOFILE (a userspace process can only have RLIMIT_NOFILE
>>> vhost-user file descriptors open).
>>>
>>
>> Ah ok, I see what you want I think.
>>
>> Ok, I think the options are:
>>
>> 0. Nothing. Just use existing indirect/RLIMIT_NOFILE.
>>
>> 1. Do something like io_uring's create_io_thread/copy_process. If we call
>> copy_process from the vhost ioctl context, then the userspace process that
>> did the ioctl will have it's processes count incremented and checked against
>> its rlimit.
>>
>> The drawbacks:
>> - This gets a little more complicated than just calling copy_process though.
>> We end up duplicating a lot of the kthread API.
>> - We have to deal with new error cases like the parent exiting early.
>> - I think all devs sharing a worker have to have the same owner. kthread_use_mm
>> and kthread_unuse_mm to switch between mm's for differrent owner's devs seem to
>> be causing lots of errors. I'm still looking into this one though.
>>
>> 2.  It's not really what you want, but for unbound work io_uring has a check for
>> RLIMIT_NPROC in the io_uring code. It does:
>>
>> wqe->acct[IO_WQ_ACCT_UNBOUND].max_workers =
>> 					task_rlimit(current, RLIMIT_NPROC);
>>
>> then does:
>>
>> if (!ret && acct->nr_workers < acct->max_workers) {
>>
>> Drawbacks:
>> In vhost.c, we could do something similar. It would make sure that vhost.c does
>> not create more worker threads than the rlimit value, but we wouldn't be
>> incrementing the userspace process's process count. The userspace process could
>> then create RLIMIT_NPROC threads and vhost.c could also create RLIMIT_NPROC
>> threads, so we end up with 2 * RLIMIT_NPROC threads.
> 
> Yes, in that case we might as well go with Option 0, so I think this
> option can be eliminated.
> 
>> 3. Change the kthread and copy_process code so we can pass in the thread
>> (or it's creds or some struct that has the values that need to be check) that
>> needs to be checked and updated.
>>
>> Drawback:
>> This might be considered too ugly for how special case vhost is. For example, we
>> need checks/code like the io_thread/PF_IO_WORKER code in copy_process for io_uring.
>> I can see how added that for io_uring because it affects so many users, but I can
>> see how vhost is not special enough.
> 
> This seems like the most general solution. If you try it and get
> negative feedback then maybe the maintainers can help suggest how to
> solve this problem :).

Hey, I implemented #3 here:

https://github.com/mikechristie/linux/commit/76f7a555a85147420a22d0163c15259e01e02193

in this patchset:

https://github.com/mikechristie/linux/commits/kthread-node-user

but before I post I wanted to bring up an option 4 someone mentioned to me
offlist.

Again it's io_uring. Check out fs/io_uring.c:__io_account_mem(). For RLIMIT_MEMLOCK
it just does the check and increments the user's counter itself. It's simple like
option 2, and it handles the issue where the process doing the ioctl wasn't having
its RLIMIT_NPROC checked/updated.




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 7/9] vhost: allow userspace to create workers
  2021-06-18  2:49             ` Mike Christie
@ 2021-06-21 13:41               ` Stefan Hajnoczi
  0 siblings, 0 replies; 32+ messages in thread
From: Stefan Hajnoczi @ 2021-06-21 13:41 UTC (permalink / raw)
  To: Mike Christie
  Cc: target-devel, linux-scsi, pbonzini, jasowang, mst, sgarzare,
	virtualization

[-- Attachment #1: Type: text/plain, Size: 513 bytes --]

On Thu, Jun 17, 2021 at 09:49:07PM -0500, Mike Christie wrote:
> Again it's io_uring. Check out fs/io_uring.c:__io_account_mem(). For RLIMIT_MEMLOCK
> it just does the check and increments the user's counter itself. It's simple like
> option 2, and it handles the issue where the process doing the ioctl wasn't having
> its RLIMIT_NPROC checked/updated.

This can work too. It doesn't cover cases where code called indirectly
acquires resources, but that's probably fine for the vhost worker thread
case.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2021-06-21 13:41 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-25 18:05 vhost: multiple worker support Mike Christie
2021-05-25 18:05 ` [PATCH 1/9] vhost: move worker thread fields to new struct Mike Christie
2021-06-03 10:16   ` Stefan Hajnoczi
2021-05-25 18:05 ` [PATCH 2/9] vhost: move vhost worker creation to kick setup Mike Christie
2021-06-03 10:28   ` Stefan Hajnoczi
2021-05-25 18:05 ` [PATCH 3/9] vhost: modify internal functions to take a vhost_worker Mike Christie
2021-06-03 10:45   ` Stefan Hajnoczi
2021-05-25 18:05 ` [PATCH 4/9] vhost: allow vhost_polls to use different vhost_workers Mike Christie
2021-06-03 13:51   ` Stefan Hajnoczi
2021-05-25 18:05 ` [PATCH 5/9] vhost-scsi: flush IO vqs then send TMF rsp Mike Christie
2021-06-03 13:54   ` Stefan Hajnoczi
2021-05-25 18:05 ` [PATCH 6/9] vhost-scsi: make SCSI cmd completion per vq Mike Christie
2021-06-03 13:57   ` Stefan Hajnoczi
2021-05-25 18:05 ` [PATCH 7/9] vhost: allow userspace to create workers Mike Christie
2021-06-03 14:30   ` Stefan Hajnoczi
2021-06-05 23:53     ` michael.christie
2021-06-07 15:19       ` Stefan Hajnoczi
2021-06-09 21:03         ` Mike Christie
2021-06-10  8:06           ` Stefan Hajnoczi
2021-06-18  2:49             ` Mike Christie
2021-06-21 13:41               ` Stefan Hajnoczi
2021-05-25 18:05 ` [PATCH 8/9] vhost: add vhost_dev pointer to vhost_work Mike Christie
2021-06-03 14:31   ` Stefan Hajnoczi
2021-05-25 18:06 ` [PATCH 9/9] vhost: support sharing workers across devs Mike Christie
2021-06-03 14:32   ` Stefan Hajnoczi
2021-06-07  2:18     ` Jason Wang
2021-06-03 10:13 ` vhost: multiple worker support Stefan Hajnoczi
2021-06-03 18:45   ` Mike Christie
2021-06-03 14:37 ` Stefan Hajnoczi
2021-06-03 22:16   ` Mike Christie
2021-06-05 22:40     ` michael.christie
2021-06-07 15:23       ` Stefan Hajnoczi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).