[RFC PATCH 0/8] vhost: allow userspace to control vq cpu affinity

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/8] vhost: allow userspace to control vq cpu affinity
@ 2020-12-04  7:56 Mike Christie
  2020-12-04  7:56 ` [RFC PATCH 1/8] vhost: remove work arg from vhost_work_flush Mike Christie
                   ` (8 more replies)
  0 siblings, 9 replies; 23+ messages in thread
From: Mike Christie @ 2020-12-04  7:56 UTC (permalink / raw)
  To: sgarzare, stefanha, linux-scsi, target-devel, mst, jasowang,
	pbonzini, virtualization

These patches were made over mst's vhost branch.

The following patches, made over mst's vhost branch, allow userspace
to set each vq's cpu affinity. Currently, with cgroups the worker thread
inherits the affinity settings, but we are at the mercy of the CPU
scheduler for where the vq's IO will be executed on. This can result in
the scheduler sometimes hammering a couple queues on the host instead of
spreading it out like how the guest's app might have intended if it was
mq aware.

This version of the patches is not what you guys were talking about
initially like with the interface that was similar to nbd's old
(3.x kernel days) NBD_DO_IT ioctl where userspace calls down to the
kernel and we run from that context. These patches instead just
allow userspace to tell the kernel which CPU a vq should run on.
We then use the kernel's workqueue code to handle the thread
management.

I wanted to post this version first, because it is flexible
in that userspace can set things up so devs/vqs share threads/CPUs
and we don't have to worry about replicating a bunch of features
that the workqueue code already has like dynamic thread creation,
blocked work detection, idle thread detection and thread reaping,
and it also has an interface to control how many threads can be
created and which CPUs work can run on if we want to further restrict
that from userspace.

Note that these patches have been lightly tested. I more wanted
to get comments on the overall approach, because I know it's
not really what you were thinking about. But while I worked
on being able to share threads with multiple devices, I kept
coming back to the existing workqueue code and thinking I'll
just copy and paste that.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC PATCH 1/8] vhost: remove work arg from vhost_work_flush
  2020-12-04  7:56 [RFC PATCH 0/8] vhost: allow userspace to control vq cpu affinity Mike Christie
@ 2020-12-04  7:56 ` Mike Christie
  2020-12-04  7:56 ` [RFC PATCH 2/8] vhost-scsi: remove extra flushes Mike Christie
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 23+ messages in thread
From: Mike Christie @ 2020-12-04  7:56 UTC (permalink / raw)
  To: sgarzare, stefanha, linux-scsi, target-devel, mst, jasowang,
	pbonzini, virtualization

vhost_work_flush doesn't do anything with the work arg. This patch drops
it and then renames vhost_work_flush to vhost_work_dev_flush to reflect
that the function flushes all the works in the dev and not just a
specific queue or work item.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
---
 drivers/vhost/scsi.c  | 4 ++--
 drivers/vhost/vhost.c | 8 ++++----
 drivers/vhost/vhost.h | 2 +-
 drivers/vhost/vsock.c | 2 +-
 4 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 6ff8a5096..93d5836 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -1470,8 +1470,8 @@ static void vhost_scsi_flush(struct vhost_scsi *vs)
 	/* Flush both the vhost poll and vhost work */
 	for (i = 0; i < VHOST_SCSI_MAX_VQ; i++)
 		vhost_scsi_flush_vq(vs, i);
-	vhost_work_flush(&vs->dev, &vs->vs_completion_work);
-	vhost_work_flush(&vs->dev, &vs->vs_event_work);
+	vhost_work_dev_flush(&vs->dev);
+	vhost_work_dev_flush(&vs->dev);
 
 	/* Wait for all reqs issued before the flush to be finished */
 	for (i = 0; i < VHOST_SCSI_MAX_VQ; i++)
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index a262e12..78d9535 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -231,7 +231,7 @@ void vhost_poll_stop(struct vhost_poll *poll)
 }
 EXPORT_SYMBOL_GPL(vhost_poll_stop);
 
-void vhost_work_flush(struct vhost_dev *dev, struct vhost_work *work)
+void vhost_work_dev_flush(struct vhost_dev *dev)
 {
 	struct vhost_flush_struct flush;
 
@@ -243,13 +243,13 @@ void vhost_work_flush(struct vhost_dev *dev, struct vhost_work *work)
 		wait_for_completion(&flush.wait_event);
 	}
 }
-EXPORT_SYMBOL_GPL(vhost_work_flush);
+EXPORT_SYMBOL_GPL(vhost_work_dev_flush);
 
 /* Flush any work that has been scheduled. When calling this, don't hold any
  * locks that are also used by the callback. */
 void vhost_poll_flush(struct vhost_poll *poll)
 {
-	vhost_work_flush(poll->dev, &poll->work);
+	vhost_work_dev_flush(poll->dev);
 }
 EXPORT_SYMBOL_GPL(vhost_poll_flush);
 
@@ -538,7 +538,7 @@ static int vhost_attach_cgroups(struct vhost_dev *dev)
 	attach.owner = current;
 	vhost_work_init(&attach.work, vhost_attach_cgroups_work);
 	vhost_work_queue(dev, &attach.work);
-	vhost_work_flush(dev, &attach.work);
+	vhost_work_dev_flush(dev);
 	return attach.ret;
 }
 
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index b063324..1ba8e81 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -46,7 +46,7 @@ void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
 void vhost_poll_stop(struct vhost_poll *poll);
 void vhost_poll_flush(struct vhost_poll *poll);
 void vhost_poll_queue(struct vhost_poll *poll);
-void vhost_work_flush(struct vhost_dev *dev, struct vhost_work *work);
+void vhost_work_dev_flush(struct vhost_dev *dev);
 long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *argp);
 
 struct vhost_log {
diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index a483cec..f40205f 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -652,7 +652,7 @@ static void vhost_vsock_flush(struct vhost_vsock *vsock)
 	for (i = 0; i < ARRAY_SIZE(vsock->vqs); i++)
 		if (vsock->vqs[i].handle_kick)
 			vhost_poll_flush(&vsock->vqs[i].poll);
-	vhost_work_flush(&vsock->dev, &vsock->send_pkt_work);
+	vhost_work_dev_flush(&vsock->dev);
 }
 
 static void vhost_vsock_reset_orphans(struct sock *sk)
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH 2/8] vhost-scsi: remove extra flushes
  2020-12-04  7:56 [RFC PATCH 0/8] vhost: allow userspace to control vq cpu affinity Mike Christie
  2020-12-04  7:56 ` [RFC PATCH 1/8] vhost: remove work arg from vhost_work_flush Mike Christie
@ 2020-12-04  7:56 ` Mike Christie
  2020-12-04  7:56 ` [RFC PATCH 3/8] vhost poll: fix coding style Mike Christie
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 23+ messages in thread
From: Mike Christie @ 2020-12-04  7:56 UTC (permalink / raw)
  To: sgarzare, stefanha, linux-scsi, target-devel, mst, jasowang,
	pbonzini, virtualization

The vhost work flush function was flushing the entire work queue, so
there is no need for the double vhost_work_dev_flush calls in
vhost_scsi_flush.

And we do not need to call vhost_poll_flush for each poller because
that call also ends up flushing the same work queue thread the
vhost_work_dev_flush call flushed.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 drivers/vhost/scsi.c | 8 --------
 1 file changed, 8 deletions(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 93d5836..2ba1c19 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -1445,11 +1445,6 @@ static void vhost_scsi_handle_kick(struct vhost_work *work)
 	vhost_scsi_handle_vq(vs, vq);
 }
 
-static void vhost_scsi_flush_vq(struct vhost_scsi *vs, int index)
-{
-	vhost_poll_flush(&vs->vqs[index].vq.poll);
-}
-
 /* Callers must hold dev mutex */
 static void vhost_scsi_flush(struct vhost_scsi *vs)
 {
@@ -1468,9 +1463,6 @@ static void vhost_scsi_flush(struct vhost_scsi *vs)
 		kref_put(&old_inflight[i]->kref, vhost_scsi_done_inflight);
 
 	/* Flush both the vhost poll and vhost work */
-	for (i = 0; i < VHOST_SCSI_MAX_VQ; i++)
-		vhost_scsi_flush_vq(vs, i);
-	vhost_work_dev_flush(&vs->dev);
 	vhost_work_dev_flush(&vs->dev);
 
 	/* Wait for all reqs issued before the flush to be finished */
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH 3/8] vhost poll: fix coding style
  2020-12-04  7:56 [RFC PATCH 0/8] vhost: allow userspace to control vq cpu affinity Mike Christie
  2020-12-04  7:56 ` [RFC PATCH 1/8] vhost: remove work arg from vhost_work_flush Mike Christie
  2020-12-04  7:56 ` [RFC PATCH 2/8] vhost-scsi: remove extra flushes Mike Christie
@ 2020-12-04  7:56 ` Mike Christie
  2020-12-04  7:56 ` [RFC PATCH 4/8] vhost: move msg_handler to new ops struct Mike Christie
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 23+ messages in thread
From: Mike Christie @ 2020-12-04  7:56 UTC (permalink / raw)
  To: sgarzare, stefanha, linux-scsi, target-devel, mst, jasowang,
	pbonzini, virtualization

We use like 3 coding styles in this struct. Switch to just tabs.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 drivers/vhost/vhost.h | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 1ba8e81..575c818 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -28,12 +28,12 @@ struct vhost_work {
 /* Poll a file (eventfd or socket) */
 /* Note: there's nothing vhost specific about this structure. */
 struct vhost_poll {
-	poll_table                table;
-	wait_queue_head_t        *wqh;
-	wait_queue_entry_t              wait;
-	struct vhost_work	  work;
-	__poll_t		  mask;
-	struct vhost_dev	 *dev;
+	poll_table		table;
+	wait_queue_head_t	*wqh;
+	wait_queue_entry_t	wait;
+	struct vhost_work	work;
+	__poll_t		mask;
+	struct vhost_dev	*dev;
 };
 
 void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH 4/8] vhost: move msg_handler to new ops struct
  2020-12-04  7:56 [RFC PATCH 0/8] vhost: allow userspace to control vq cpu affinity Mike Christie
                   ` (2 preceding siblings ...)
  2020-12-04  7:56 ` [RFC PATCH 3/8] vhost poll: fix coding style Mike Christie
@ 2020-12-04  7:56 ` Mike Christie
  2020-12-04  7:56 ` [RFC PATCH 5/8] vhost: allow userspace to bind vqs to CPUs Mike Christie
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 23+ messages in thread
From: Mike Christie @ 2020-12-04  7:56 UTC (permalink / raw)
  To: sgarzare, stefanha, linux-scsi, target-devel, mst, jasowang,
	pbonzini, virtualization

The next patch adds a callout so drivers can perform some action when we
bind a vq to a cpu, so this patch moves the msg_handler callout to a new
vhost_dev_ops struct just to keep all the callouts better organized.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
 drivers/vhost/vdpa.c  |  7 +++++--
 drivers/vhost/vhost.c | 10 ++++------
 drivers/vhost/vhost.h | 11 ++++++-----
 3 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index ef688c8..1c8c2bf 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -844,6 +844,10 @@ static void vhost_vdpa_set_iova_range(struct vhost_vdpa *v)
 	}
 }
 
+static struct vhost_dev_ops vdpa_dev_ops = {
+	.msg_handler	= vhost_vdpa_process_iotlb_msg,
+};
+
 static int vhost_vdpa_open(struct inode *inode, struct file *filep)
 {
 	struct vhost_vdpa *v;
@@ -871,8 +875,7 @@ static int vhost_vdpa_open(struct inode *inode, struct file *filep)
 		vqs[i] = &v->vqs[i];
 		vqs[i]->handle_kick = handle_vq_kick;
 	}
-	vhost_dev_init(dev, vqs, nvqs, 0, 0, 0, false,
-		       vhost_vdpa_process_iotlb_msg);
+	vhost_dev_init(dev, vqs, nvqs, 0, 0, 0, false, &vdpa_dev_ops);
 
 	dev->iotlb = vhost_iotlb_alloc(0, 0);
 	if (!dev->iotlb) {
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 78d9535..ee2551c 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -467,9 +467,7 @@ static size_t vhost_get_desc_size(struct vhost_virtqueue *vq,
 void vhost_dev_init(struct vhost_dev *dev,
 		    struct vhost_virtqueue **vqs, int nvqs,
 		    int iov_limit, int weight, int byte_weight,
-		    bool use_worker,
-		    int (*msg_handler)(struct vhost_dev *dev,
-				       struct vhost_iotlb_msg *msg))
+		    bool use_worker, struct vhost_dev_ops *ops)
 {
 	struct vhost_virtqueue *vq;
 	int i;
@@ -486,7 +484,7 @@ void vhost_dev_init(struct vhost_dev *dev,
 	dev->weight = weight;
 	dev->byte_weight = byte_weight;
 	dev->use_worker = use_worker;
-	dev->msg_handler = msg_handler;
+	dev->ops = ops;
 	init_llist_head(&dev->work_list);
 	init_waitqueue_head(&dev->wait);
 	INIT_LIST_HEAD(&dev->read_list);
@@ -1164,8 +1162,8 @@ ssize_t vhost_chr_write_iter(struct vhost_dev *dev,
 		goto done;
 	}
 
-	if (dev->msg_handler)
-		ret = dev->msg_handler(dev, &msg);
+	if (dev->ops && dev->ops->msg_handler)
+		ret = dev->ops->msg_handler(dev, &msg);
 	else
 		ret = vhost_process_iotlb_msg(dev, &msg);
 	if (ret) {
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 575c818..64fa638 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -143,6 +143,10 @@ struct vhost_msg_node {
   struct list_head node;
 };
 
+struct vhost_dev_ops {
+	int (*msg_handler)(struct vhost_dev *dev, struct vhost_iotlb_msg *msg);
+};
+
 struct vhost_dev {
 	struct mm_struct *mm;
 	struct mutex mutex;
@@ -162,16 +166,13 @@ struct vhost_dev {
 	int byte_weight;
 	u64 kcov_handle;
 	bool use_worker;
-	int (*msg_handler)(struct vhost_dev *dev,
-			   struct vhost_iotlb_msg *msg);
+	struct vhost_dev_ops *ops;
 };
 
 bool vhost_exceeds_weight(struct vhost_virtqueue *vq, int pkts, int total_len);
 void vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue **vqs,
 		    int nvqs, int iov_limit, int weight, int byte_weight,
-		    bool use_worker,
-		    int (*msg_handler)(struct vhost_dev *dev,
-				       struct vhost_iotlb_msg *msg));
+		    bool use_worker, struct vhost_dev_ops *ops);
 long vhost_dev_set_owner(struct vhost_dev *dev);
 bool vhost_dev_has_owner(struct vhost_dev *dev);
 long vhost_dev_check_owner(struct vhost_dev *);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH 5/8] vhost: allow userspace to bind vqs to CPUs
  2020-12-04  7:56 [RFC PATCH 0/8] vhost: allow userspace to control vq cpu affinity Mike Christie
                   ` (3 preceding siblings ...)
  2020-12-04  7:56 ` [RFC PATCH 4/8] vhost: move msg_handler to new ops struct Mike Christie
@ 2020-12-04  7:56 ` Mike Christie
  2020-12-04  8:09     ` Jason Wang
  2020-12-04  7:56 ` [RFC PATCH 6/8] vhost-scsi: make SCSI cmd completion per vq Mike Christie
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 23+ messages in thread
From: Mike Christie @ 2020-12-04  7:56 UTC (permalink / raw)
  To: sgarzare, stefanha, linux-scsi, target-devel, mst, jasowang,
	pbonzini, virtualization

This patch allows user space to bind vqs to specific CPUs. Since
cgroups will not be supported this just uses the normal kernel
workqueues. I know you guys were thinking about userspace
initiating the threading and doing something like nbd's DO_IO
ioctl, or something like shared thread pools, etc. But workqueues
have had the benefits:

1. You can share the same thread with different devs and/or vqs
and there's almost no extra code in vhost.c.

2. If a work item blocks for too long, the workqueue code can
create new threads on demand for us. This ends up being helpful
for the scsi case, where we can block waiting for IO completions
when the queues get too full.

3. We get the workqueue's dynamic thread creation and destruction.
We don't have to add our own reaping code when the system becomes
less busy.

Some TODOs:

1. What about the default worker? I left this setup for now.
For vhost-scsi, we can have multple queues, and then also multple
LUNs under the same vhost-scsi device. So we might want some
lower perf LUNs to use the default worker and use the existing
cgroup settings.

2. I added the get_workqueue callout so drivers could pass in
their own workqueue incase some wanted to do a per device one.
We could also just one vhost workqueue. vhost-scsi can do a lot
of work that might block in its work struct so I didn't want
it to interfer with the other devs.

Some results:

With this patch and the patches for 5.11's target/lio layer that
fix up/remove the locking in the main IO path, I can get 920K IOPs
with the lio ram disk and 800K with the nullblk driver and iblock
backend. This is with 8 vqs, virtqueue_size/cmd_per_lun=1024 (in
some guests I had to set both really high to get the guest's
scsi_host can_queue high enough to allow IO from all vqs), and
a fio command like:

fio --filename=/dev/sda --direct=1 --rw=randrw --bs=4k --ioengine=libaio
--iodepth=128  --numjobs=8 --time_based --group_reporting --name=iops
--runtime=60 --eta-newline=1

Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
 drivers/vhost/net.c        |   6 ++-
 drivers/vhost/vhost.c      | 130 +++++++++++++++++++++++++++++++++++++--------
 drivers/vhost/vhost.h      |  15 +++++-
 include/uapi/linux/vhost.h |   5 ++
 4 files changed, 129 insertions(+), 27 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 531a00d..6a27fe6 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -1330,8 +1330,10 @@ static int vhost_net_open(struct inode *inode, struct file *f)
 		       VHOST_NET_PKT_WEIGHT, VHOST_NET_WEIGHT, true,
 		       NULL);
 
-	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, EPOLLOUT, dev);
-	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, EPOLLIN, dev);
+	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, EPOLLOUT, dev,
+			vqs[VHOST_NET_VQ_TX]);
+	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, EPOLLIN, dev,
+			vqs[VHOST_NET_VQ_RX]);
 
 	f->private_data = n;
 	n->page_frag.page = NULL;
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index ee2551c..f425d0f 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -187,13 +187,15 @@ void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn)
 
 /* Init poll structure */
 void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
-		     __poll_t mask, struct vhost_dev *dev)
+		     __poll_t mask, struct vhost_dev *dev,
+		     struct vhost_virtqueue *vq)
 {
 	init_waitqueue_func_entry(&poll->wait, vhost_poll_wakeup);
 	init_poll_funcptr(&poll->table, vhost_poll_func);
 	poll->mask = mask;
 	poll->dev = dev;
 	poll->wqh = NULL;
+	poll->vq = vq;
 
 	vhost_work_init(&poll->work, fn);
 }
@@ -242,6 +244,9 @@ void vhost_work_dev_flush(struct vhost_dev *dev)
 		vhost_work_queue(dev, &flush.work);
 		wait_for_completion(&flush.wait_event);
 	}
+
+	if (dev->wq)
+		flush_workqueue(dev->wq);
 }
 EXPORT_SYMBOL_GPL(vhost_work_dev_flush);
 
@@ -253,7 +258,46 @@ void vhost_poll_flush(struct vhost_poll *poll)
 }
 EXPORT_SYMBOL_GPL(vhost_poll_flush);
 
-void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)
+static bool vhost_run_work_list(struct vhost_dev *dev,
+				struct llist_head *work_list)
+{
+	struct vhost_work *work, *work_next;
+	struct llist_node *node;
+
+	node = llist_del_all(work_list);
+	if (!node)
+		return false;
+
+	node = llist_reverse_order(node);
+	/* make sure flag is seen after deletion */
+	smp_wmb();
+	llist_for_each_entry_safe(work, work_next, node, node) {
+		clear_bit(VHOST_WORK_QUEUED, &work->flags);
+		__set_current_state(TASK_RUNNING);
+		kcov_remote_start_common(dev->kcov_handle);
+		work->fn(work);
+		kcov_remote_stop();
+		if (need_resched())
+			schedule();
+	}
+
+	return true;
+}
+
+static void vhost_vq_work(struct work_struct *work)
+{
+	struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue,
+						  work);
+	struct vhost_dev *dev = vq->dev;
+
+	kthread_use_mm(dev->mm);
+	vhost_run_work_list(dev, &vq->work_list);
+	kthread_unuse_mm(dev->mm);
+}
+
+static void __vhost_work_queue(struct vhost_dev *dev,
+			       struct vhost_virtqueue *vq,
+			       struct vhost_work *work)
 {
 	if (!dev->worker)
 		return;
@@ -263,12 +307,28 @@ void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)
 		 * sure it was not in the list.
 		 * test_and_set_bit() implies a memory barrier.
 		 */
-		llist_add(&work->node, &dev->work_list);
-		wake_up_process(dev->worker);
+		if (!vq || vq->cpu == -1) {
+			llist_add(&work->node, &dev->work_list);
+			wake_up_process(dev->worker);
+		} else {
+			llist_add(&work->node, &vq->work_list);
+			queue_work_on(vq->cpu, dev->wq, &vq->work);
+		}
 	}
 }
+
+void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)
+{
+	__vhost_work_queue(dev, NULL, work);
+}
 EXPORT_SYMBOL_GPL(vhost_work_queue);
 
+void vhost_vq_work_queue(struct vhost_virtqueue *vq, struct vhost_work *work)
+{
+	__vhost_work_queue(vq->dev, vq, work);
+}
+EXPORT_SYMBOL_GPL(vhost_vq_work_queue);
+
 /* A lockless hint for busy polling code to exit the loop */
 bool vhost_has_work(struct vhost_dev *dev)
 {
@@ -278,7 +338,7 @@ bool vhost_has_work(struct vhost_dev *dev)
 
 void vhost_poll_queue(struct vhost_poll *poll)
 {
-	vhost_work_queue(poll->dev, &poll->work);
+	vhost_vq_work_queue(poll->vq, &poll->work);
 }
 EXPORT_SYMBOL_GPL(vhost_poll_queue);
 
@@ -344,8 +404,6 @@ static void vhost_vq_reset(struct vhost_dev *dev,
 static int vhost_worker(void *data)
 {
 	struct vhost_dev *dev = data;
-	struct vhost_work *work, *work_next;
-	struct llist_node *node;
 
 	kthread_use_mm(dev->mm);
 
@@ -358,22 +416,8 @@ static int vhost_worker(void *data)
 			break;
 		}
 
-		node = llist_del_all(&dev->work_list);
-		if (!node)
+		if (!vhost_run_work_list(dev, &dev->work_list))
 			schedule();
-
-		node = llist_reverse_order(node);
-		/* make sure flag is seen after deletion */
-		smp_wmb();
-		llist_for_each_entry_safe(work, work_next, node, node) {
-			clear_bit(VHOST_WORK_QUEUED, &work->flags);
-			__set_current_state(TASK_RUNNING);
-			kcov_remote_start_common(dev->kcov_handle);
-			work->fn(work);
-			kcov_remote_stop();
-			if (need_resched())
-				schedule();
-		}
 	}
 	kthread_unuse_mm(dev->mm);
 	return 0;
@@ -480,6 +524,7 @@ void vhost_dev_init(struct vhost_dev *dev,
 	dev->iotlb = NULL;
 	dev->mm = NULL;
 	dev->worker = NULL;
+	dev->wq = NULL;
 	dev->iov_limit = iov_limit;
 	dev->weight = weight;
 	dev->byte_weight = byte_weight;
@@ -497,12 +542,15 @@ void vhost_dev_init(struct vhost_dev *dev,
 		vq->log = NULL;
 		vq->indirect = NULL;
 		vq->heads = NULL;
+		vq->cpu = -1;
 		vq->dev = dev;
+		init_llist_head(&vq->work_list);
+		INIT_WORK(&vq->work, vhost_vq_work);
 		mutex_init(&vq->mutex);
 		vhost_vq_reset(dev, vq);
 		if (vq->handle_kick)
 			vhost_poll_init(&vq->poll, vq->handle_kick,
-					EPOLLIN, dev);
+					EPOLLIN, dev, vq);
 	}
 }
 EXPORT_SYMBOL_GPL(vhost_dev_init);
@@ -1572,6 +1620,39 @@ static long vhost_vring_set_num_addr(struct vhost_dev *d,
 
 	return r;
 }
+
+static long vhost_vring_set_cpu(struct vhost_dev *d, struct vhost_virtqueue *vq,
+				void __user *argp)
+{
+	struct vhost_vring_state s;
+	int ret = 0;
+
+	if (vq->private_data)
+		return -EBUSY;
+
+	if (copy_from_user(&s, argp, sizeof s))
+		return -EFAULT;
+
+	if (s.num == -1) {
+		vq->cpu = s.num;
+		return 0;
+	}
+
+	if (s.num >= nr_cpu_ids)
+		return -EINVAL;
+
+	if (!d->ops || !d->ops->get_workqueue)
+		return -EINVAL;
+
+	if (!d->wq)
+		d->wq = d->ops->get_workqueue();
+	if (!d->wq)
+		return -EINVAL;
+
+	vq->cpu = s.num;
+	return ret;
+}
+
 long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *argp)
 {
 	struct file *eventfp, *filep = NULL;
@@ -1601,6 +1682,9 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
 	mutex_lock(&vq->mutex);
 
 	switch (ioctl) {
+	case VHOST_SET_VRING_CPU:
+		r = vhost_vring_set_cpu(d, vq, argp);
+		break;
 	case VHOST_SET_VRING_BASE:
 		/* Moving base with an active backend?
 		 * You don't want to do that. */
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 64fa638..28ff4a2 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -26,7 +26,6 @@ struct vhost_work {
 };
 
 /* Poll a file (eventfd or socket) */
-/* Note: there's nothing vhost specific about this structure. */
 struct vhost_poll {
 	poll_table		table;
 	wait_queue_head_t	*wqh;
@@ -34,14 +33,17 @@ struct vhost_poll {
 	struct vhost_work	work;
 	__poll_t		mask;
 	struct vhost_dev	*dev;
+	struct vhost_virtqueue	*vq;
 };
 
 void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn);
 void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work);
 bool vhost_has_work(struct vhost_dev *dev);
+void vhost_vq_work_queue(struct vhost_virtqueue *vq, struct vhost_work *work);
 
 void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
-		     __poll_t mask, struct vhost_dev *dev);
+		     __poll_t mask, struct vhost_dev *dev,
+		     struct vhost_virtqueue *vq);
 int vhost_poll_start(struct vhost_poll *poll, struct file *file);
 void vhost_poll_stop(struct vhost_poll *poll);
 void vhost_poll_flush(struct vhost_poll *poll);
@@ -82,6 +84,9 @@ struct vhost_virtqueue {
 	struct eventfd_ctx *error_ctx;
 	struct eventfd_ctx *log_ctx;
 
+	unsigned int cpu;
+	struct work_struct work;
+	struct llist_head work_list;
 	struct vhost_poll poll;
 
 	/* The routine to call when the Guest pings us, or timeout. */
@@ -145,6 +150,11 @@ struct vhost_msg_node {
 
 struct vhost_dev_ops {
 	int (*msg_handler)(struct vhost_dev *dev, struct vhost_iotlb_msg *msg);
+	/*
+	 * If the driver supports the VHOST_SET_VRING_CPU ioctl this must
+	 * return the workqueue to use.
+	 */ 
+	struct workqueue_struct *(*get_workqueue)(void);
 };
 
 struct vhost_dev {
@@ -166,6 +176,7 @@ struct vhost_dev {
 	int byte_weight;
 	u64 kcov_handle;
 	bool use_worker;
+	struct workqueue_struct *wq;
 	struct vhost_dev_ops *ops;
 };
 
diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
index c998860..78b31d8 100644
--- a/include/uapi/linux/vhost.h
+++ b/include/uapi/linux/vhost.h
@@ -70,6 +70,11 @@
 #define VHOST_VRING_BIG_ENDIAN 1
 #define VHOST_SET_VRING_ENDIAN _IOW(VHOST_VIRTIO, 0x13, struct vhost_vring_state)
 #define VHOST_GET_VRING_ENDIAN _IOW(VHOST_VIRTIO, 0x14, struct vhost_vring_state)
+/* Bind a vring to a CPU. vhost_vring_state.num is -1 then the default worker
+ * and it's cgroup will be used. If vhost_vring_state.num != -1, the vring will
+ * be bound to the CPU in vhost_vring_state.num.
+ */
+#define VHOST_SET_VRING_CPU _IOW(VHOST_VIRTIO, 0x15, struct vhost_vring_state)
 
 /* The following ioctls use eventfd file descriptors to signal and poll
  * for events. */
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH 6/8] vhost-scsi: make SCSI cmd completion per vq
  2020-12-04  7:56 [RFC PATCH 0/8] vhost: allow userspace to control vq cpu affinity Mike Christie
                   ` (4 preceding siblings ...)
  2020-12-04  7:56 ` [RFC PATCH 5/8] vhost: allow userspace to bind vqs to CPUs Mike Christie
@ 2020-12-04  7:56 ` Mike Christie
  2020-12-04  7:56 ` [RFC PATCH 7/8] vhost, vhost-scsi: flush IO vqs then send TMF rsp Mike Christie
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 23+ messages in thread
From: Mike Christie @ 2020-12-04  7:56 UTC (permalink / raw)
  To: sgarzare, stefanha, linux-scsi, target-devel, mst, jasowang,
	pbonzini, virtualization

In the last patches we are going to allow userspace to bind a vq
to specific CPUs. This patch separates the scsi cmd completion code
paths so we can complete cmds based on their vq instead of having all
cmds complete on the same CPU.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
 drivers/vhost/scsi.c | 48 +++++++++++++++++++++++++-----------------------
 1 file changed, 25 insertions(+), 23 deletions(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 2ba1c19..08bc513 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -178,6 +178,7 @@ enum {
 
 struct vhost_scsi_virtqueue {
 	struct vhost_virtqueue vq;
+	struct vhost_scsi *vs;
 	/*
 	 * Reference counting for inflight reqs, used for flush operation. At
 	 * each time, one reference tracks new commands submitted, while we
@@ -192,6 +193,9 @@ struct vhost_scsi_virtqueue {
 	struct vhost_scsi_cmd *scsi_cmds;
 	struct sbitmap scsi_tags;
 	int max_cmds;
+
+	struct vhost_work completion_work;
+	struct llist_head completion_list;
 };
 
 struct vhost_scsi {
@@ -202,9 +206,6 @@ struct vhost_scsi {
 	struct vhost_dev dev;
 	struct vhost_scsi_virtqueue vqs[VHOST_SCSI_MAX_VQ];
 
-	struct vhost_work vs_completion_work; /* cmd completion work item */
-	struct llist_head vs_completion_list; /* cmd completion queue */
-
 	struct vhost_work vs_event_work; /* evt injection work item */
 	struct llist_head vs_event_list; /* evt injection queue */
 
@@ -381,10 +382,11 @@ static void vhost_scsi_release_cmd(struct se_cmd *se_cmd)
 	} else {
 		struct vhost_scsi_cmd *cmd = container_of(se_cmd,
 					struct vhost_scsi_cmd, tvc_se_cmd);
-		struct vhost_scsi *vs = cmd->tvc_vhost;
+		struct vhost_scsi_virtqueue *svq =  container_of(cmd->tvc_vq,
+					struct vhost_scsi_virtqueue, vq);
 
-		llist_add(&cmd->tvc_completion_list, &vs->vs_completion_list);
-		vhost_work_queue(&vs->dev, &vs->vs_completion_work);
+		llist_add(&cmd->tvc_completion_list, &svq->completion_list);
+		vhost_vq_work_queue(&svq->vq, &svq->completion_work);
 	}
 }
 
@@ -547,18 +549,17 @@ static void vhost_scsi_evt_work(struct vhost_work *work)
  */
 static void vhost_scsi_complete_cmd_work(struct vhost_work *work)
 {
-	struct vhost_scsi *vs = container_of(work, struct vhost_scsi,
-					vs_completion_work);
-	DECLARE_BITMAP(signal, VHOST_SCSI_MAX_VQ);
+	struct vhost_scsi_virtqueue *svq = container_of(work,
+				struct vhost_scsi_virtqueue, completion_work);
 	struct virtio_scsi_cmd_resp v_rsp;
 	struct vhost_scsi_cmd *cmd, *t;
 	struct llist_node *llnode;
 	struct se_cmd *se_cmd;
 	struct iov_iter iov_iter;
-	int ret, vq;
+	bool signal = false;
+	int ret;
 
-	bitmap_zero(signal, VHOST_SCSI_MAX_VQ);
-	llnode = llist_del_all(&vs->vs_completion_list);
+	llnode = llist_del_all(&svq->completion_list);
 	llist_for_each_entry_safe(cmd, t, llnode, tvc_completion_list) {
 		se_cmd = &cmd->tvc_se_cmd;
 
@@ -578,21 +579,16 @@ static void vhost_scsi_complete_cmd_work(struct vhost_work *work)
 			      cmd->tvc_in_iovs, sizeof(v_rsp));
 		ret = copy_to_iter(&v_rsp, sizeof(v_rsp), &iov_iter);
 		if (likely(ret == sizeof(v_rsp))) {
-			struct vhost_scsi_virtqueue *q;
+			signal = true;
 			vhost_add_used(cmd->tvc_vq, cmd->tvc_vq_desc, 0);
-			q = container_of(cmd->tvc_vq, struct vhost_scsi_virtqueue, vq);
-			vq = q - vs->vqs;
-			__set_bit(vq, signal);
 		} else
 			pr_err("Faulted on virtio_scsi_cmd_resp\n");
 
 		vhost_scsi_release_cmd_res(se_cmd);
 	}
 
-	vq = -1;
-	while ((vq = find_next_bit(signal, VHOST_SCSI_MAX_VQ, vq + 1))
-		< VHOST_SCSI_MAX_VQ)
-		vhost_signal(&vs->dev, &vs->vqs[vq].vq);
+	if (signal)
+		vhost_signal(&svq->vs->dev, &svq->vq);
 }
 
 static struct vhost_scsi_cmd *
@@ -1801,6 +1797,7 @@ static int vhost_scsi_set_features(struct vhost_scsi *vs, u64 features)
 
 static int vhost_scsi_open(struct inode *inode, struct file *f)
 {
+	struct vhost_scsi_virtqueue *svq;
 	struct vhost_scsi *vs;
 	struct vhost_virtqueue **vqs;
 	int r = -ENOMEM, i;
@@ -1816,7 +1813,6 @@ static int vhost_scsi_open(struct inode *inode, struct file *f)
 	if (!vqs)
 		goto err_vqs;
 
-	vhost_work_init(&vs->vs_completion_work, vhost_scsi_complete_cmd_work);
 	vhost_work_init(&vs->vs_event_work, vhost_scsi_evt_work);
 
 	vs->vs_events_nr = 0;
@@ -1827,8 +1823,14 @@ static int vhost_scsi_open(struct inode *inode, struct file *f)
 	vs->vqs[VHOST_SCSI_VQ_CTL].vq.handle_kick = vhost_scsi_ctl_handle_kick;
 	vs->vqs[VHOST_SCSI_VQ_EVT].vq.handle_kick = vhost_scsi_evt_handle_kick;
 	for (i = VHOST_SCSI_VQ_IO; i < VHOST_SCSI_MAX_VQ; i++) {
-		vqs[i] = &vs->vqs[i].vq;
-		vs->vqs[i].vq.handle_kick = vhost_scsi_handle_kick;
+		svq = &vs->vqs[i];
+
+		vqs[i] = &svq->vq;
+		svq->vs = vs;
+		init_llist_head(&svq->completion_list);
+		vhost_work_init(&svq->completion_work,
+				vhost_scsi_complete_cmd_work);
+		svq->vq.handle_kick = vhost_scsi_handle_kick;
 	}
 	vhost_dev_init(&vs->dev, vqs, VHOST_SCSI_MAX_VQ, UIO_MAXIOV,
 		       VHOST_SCSI_WEIGHT, 0, true, NULL);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH 7/8] vhost, vhost-scsi: flush IO vqs then send TMF rsp
  2020-12-04  7:56 [RFC PATCH 0/8] vhost: allow userspace to control vq cpu affinity Mike Christie
                   ` (5 preceding siblings ...)
  2020-12-04  7:56 ` [RFC PATCH 6/8] vhost-scsi: make SCSI cmd completion per vq Mike Christie
@ 2020-12-04  7:56 ` Mike Christie
  2020-12-04  7:56 ` [RFC PATCH 8/8] vhost-scsi: hook vhost-scsi into vring set cpu support Mike Christie
  2020-12-04 16:06   ` Stefano Garzarella
  8 siblings, 0 replies; 23+ messages in thread
From: Mike Christie @ 2020-12-04  7:56 UTC (permalink / raw)
  To: sgarzare, stefanha, linux-scsi, target-devel, mst, jasowang,
	pbonzini, virtualization

With one worker we will always send the scsi cmd responses then send the
TMF rsp, because LIO will always complete the scsi cmds first which
calls vhost_scsi_release_cmd to add them to the work queue.

When the next patch adds multiple worker support, the worker threads
could still be sending their responses when the tmf's work is run.
So this patch has vhost-scsi flush the IO vqs on other worker threads
before we send the tmf response.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
 drivers/vhost/scsi.c  | 20 ++++++++++++++++++--
 drivers/vhost/vhost.c |  9 +++++++++
 drivers/vhost/vhost.h |  1 +
 3 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 08bc513..8005a7f 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -1179,12 +1179,28 @@ static void vhost_scsi_tmf_resp_work(struct vhost_work *work)
 {
 	struct vhost_scsi_tmf *tmf = container_of(work, struct vhost_scsi_tmf,
 						  vwork);
+	struct vhost_virtqueue *vq;
+	unsigned int cpu;
 	int resp_code;
+	int i;
 
-	if (tmf->scsi_resp == TMR_FUNCTION_COMPLETE)
+	if (tmf->scsi_resp == TMR_FUNCTION_COMPLETE) {
+		/*
+		 * When processing a TMF, lio completes the cmds then the
+		 * TMF, so with one worker the TMF always completes after
+		 * cmds. For multiple worker support, we must flush every
+		 * worker that runs on a different cpu than the EVT vq.
+		 */
+		cpu = tmf->vhost->vqs[VHOST_SCSI_VQ_CTL].vq.cpu;
+		for (i = VHOST_SCSI_VQ_IO; i < tmf->vhost->dev.nvqs; i++) {
+			vq = &tmf->vhost->vqs[i].vq;
+			if (cpu != vq->cpu)
+				vhost_vq_work_flush(vq);
+		}
 		resp_code = VIRTIO_SCSI_S_FUNCTION_SUCCEEDED;
-	else
+	} else {
 		resp_code = VIRTIO_SCSI_S_FUNCTION_REJECTED;
+	}
 
 	vhost_scsi_send_tmf_resp(tmf->vhost, &tmf->svq->vq, tmf->in_iovs,
 				 tmf->vq_desc, &tmf->resp_iov, resp_code);
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index f425d0f..4aae504 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -250,6 +250,15 @@ void vhost_work_dev_flush(struct vhost_dev *dev)
 }
 EXPORT_SYMBOL_GPL(vhost_work_dev_flush);
 
+void vhost_vq_work_flush(struct vhost_virtqueue *vq)
+{
+	if (vq->cpu != -1)
+		flush_work(&vq->work);
+	else
+		vhost_work_dev_flush(vq->dev);
+}
+EXPORT_SYMBOL_GPL(vhost_vq_work_flush);
+
 /* Flush any work that has been scheduled. When calling this, don't hold any
  * locks that are also used by the callback. */
 void vhost_poll_flush(struct vhost_poll *poll)
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 28ff4a2..2d306f8 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -40,6 +40,7 @@ struct vhost_poll {
 void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work);
 bool vhost_has_work(struct vhost_dev *dev);
 void vhost_vq_work_queue(struct vhost_virtqueue *vq, struct vhost_work *work);
+void vhost_vq_work_flush(struct vhost_virtqueue *vq);
 
 void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
 		     __poll_t mask, struct vhost_dev *dev,
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH 8/8] vhost-scsi: hook vhost-scsi into vring set cpu support
  2020-12-04  7:56 [RFC PATCH 0/8] vhost: allow userspace to control vq cpu affinity Mike Christie
                   ` (6 preceding siblings ...)
  2020-12-04  7:56 ` [RFC PATCH 7/8] vhost, vhost-scsi: flush IO vqs then send TMF rsp Mike Christie
@ 2020-12-04  7:56 ` Mike Christie
  2020-12-04 16:06   ` Stefano Garzarella
  8 siblings, 0 replies; 23+ messages in thread
From: Mike Christie @ 2020-12-04  7:56 UTC (permalink / raw)
  To: sgarzare, stefanha, linux-scsi, target-devel, mst, jasowang,
	pbonzini, virtualization

This hooks vhost-scsi into the vring set cpu ioctl support by
just replacing it's existing workqueue use with the vhost
support.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
 drivers/vhost/scsi.c | 28 +++++++++++++---------------
 1 file changed, 13 insertions(+), 15 deletions(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 8005a7f..29b139c 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -101,8 +101,6 @@ struct vhost_scsi_cmd {
 	struct vhost_scsi_nexus *tvc_nexus;
 	/* The TCM I/O descriptor that is accessed via container_of() */
 	struct se_cmd tvc_se_cmd;
-	/* work item used for cmwq dispatch to vhost_scsi_submission_work() */
-	struct work_struct work;
 	/* Copy of the incoming SCSI command descriptor block (CDB) */
 	unsigned char tvc_cdb[VHOST_SCSI_MAX_CDB_SIZE];
 	/* Sense buffer that will be mapped into outgoing status */
@@ -778,10 +776,8 @@ static int vhost_scsi_to_tcm_attr(int attr)
 	return TCM_SIMPLE_TAG;
 }
 
-static void vhost_scsi_submission_work(struct work_struct *work)
+static void vhost_scsi_target_submit(struct vhost_scsi_cmd *cmd)
 {
-	struct vhost_scsi_cmd *cmd =
-		container_of(work, struct vhost_scsi_cmd, work);
 	struct vhost_scsi_nexus *tv_nexus;
 	struct se_cmd *se_cmd = &cmd->tvc_se_cmd;
 	struct scatterlist *sg_ptr, *sg_prot_ptr = NULL;
@@ -1128,14 +1124,7 @@ static u16 vhost_buf_to_lun(u8 *lun_buf)
 		 * vhost_scsi_queue_data_in() and vhost_scsi_queue_status()
 		 */
 		cmd->tvc_vq_desc = vc.head;
-		/*
-		 * Dispatch cmd descriptor for cmwq execution in process
-		 * context provided by vhost_scsi_workqueue.  This also ensures
-		 * cmd is executed on the same kworker CPU as this vhost
-		 * thread to gain positive L2 cache locality effects.
-		 */
-		INIT_WORK(&cmd->work, vhost_scsi_submission_work);
-		queue_work(vhost_scsi_workqueue, &cmd->work);
+		vhost_scsi_target_submit(cmd);
 		ret = 0;
 err:
 		/*
@@ -1811,6 +1800,15 @@ static int vhost_scsi_set_features(struct vhost_scsi *vs, u64 features)
 	return 0;
 }
 
+static struct workqueue_struct *vhost_scsi_get_workqueue(void)
+{
+	return vhost_scsi_workqueue;
+}
+
+static struct vhost_dev_ops vhost_scsi_dev_ops = {
+	.get_workqueue = vhost_scsi_get_workqueue,	
+};
+
 static int vhost_scsi_open(struct inode *inode, struct file *f)
 {
 	struct vhost_scsi_virtqueue *svq;
@@ -1849,7 +1847,7 @@ static int vhost_scsi_open(struct inode *inode, struct file *f)
 		svq->vq.handle_kick = vhost_scsi_handle_kick;
 	}
 	vhost_dev_init(&vs->dev, vqs, VHOST_SCSI_MAX_VQ, UIO_MAXIOV,
-		       VHOST_SCSI_WEIGHT, 0, true, NULL);
+		       VHOST_SCSI_WEIGHT, 0, true, &vhost_scsi_dev_ops);
 
 	vhost_scsi_init_inflight(vs, NULL);
 
@@ -2502,7 +2500,7 @@ static int __init vhost_scsi_init(void)
 	 * Use our own dedicated workqueue for submitting I/O into
 	 * target core to avoid contention within system_wq.
 	 */
-	vhost_scsi_workqueue = alloc_workqueue("vhost_scsi", 0, 0);
+	vhost_scsi_workqueue = alloc_workqueue("vhost_scsi", WQ_SYSFS, 0);
 	if (!vhost_scsi_workqueue)
 		goto out;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 5/8] vhost: allow userspace to bind vqs to CPUs
  2020-12-04  7:56 ` [RFC PATCH 5/8] vhost: allow userspace to bind vqs to CPUs Mike Christie
@ 2020-12-04  8:09     ` Jason Wang
  0 siblings, 0 replies; 23+ messages in thread
From: Jason Wang @ 2020-12-04  8:09 UTC (permalink / raw)
  To: Mike Christie, sgarzare, stefanha, linux-scsi, target-devel, mst,
	pbonzini, virtualization


On 2020/12/4 下午3:56, Mike Christie wrote:
> +static long vhost_vring_set_cpu(struct vhost_dev *d, struct vhost_virtqueue *vq,
> +				void __user *argp)
> +{
> +	struct vhost_vring_state s;
> +	int ret = 0;
> +
> +	if (vq->private_data)
> +		return -EBUSY;
> +
> +	if (copy_from_user(&s, argp, sizeof s))
> +		return -EFAULT;
> +
> +	if (s.num == -1) {
> +		vq->cpu = s.num;
> +		return 0;
> +	}
> +
> +	if (s.num >= nr_cpu_ids)
> +		return -EINVAL;
> +
> +	if (!d->ops || !d->ops->get_workqueue)
> +		return -EINVAL;
> +
> +	if (!d->wq)
> +		d->wq = d->ops->get_workqueue();
> +	if (!d->wq)
> +		return -EINVAL;
> +
> +	vq->cpu = s.num;
> +	return ret;
> +}


So one question here. Who is in charge of doing this set_cpu? Note 
that sched_setaffinity(2) requires CAP_SYS_NICE to work, so I wonder 
whether or not it's legal for unprivileged Qemu to do this.

Thanks



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 5/8] vhost: allow userspace to bind vqs to CPUs
@ 2020-12-04  8:09     ` Jason Wang
  0 siblings, 0 replies; 23+ messages in thread
From: Jason Wang @ 2020-12-04  8:09 UTC (permalink / raw)
  To: Mike Christie, sgarzare, stefanha, linux-scsi, target-devel, mst,
	pbonzini, virtualization


On 2020/12/4 下午3:56, Mike Christie wrote:
> +static long vhost_vring_set_cpu(struct vhost_dev *d, struct vhost_virtqueue *vq,
> +				void __user *argp)
> +{
> +	struct vhost_vring_state s;
> +	int ret = 0;
> +
> +	if (vq->private_data)
> +		return -EBUSY;
> +
> +	if (copy_from_user(&s, argp, sizeof s))
> +		return -EFAULT;
> +
> +	if (s.num == -1) {
> +		vq->cpu = s.num;
> +		return 0;
> +	}
> +
> +	if (s.num >= nr_cpu_ids)
> +		return -EINVAL;
> +
> +	if (!d->ops || !d->ops->get_workqueue)
> +		return -EINVAL;
> +
> +	if (!d->wq)
> +		d->wq = d->ops->get_workqueue();
> +	if (!d->wq)
> +		return -EINVAL;
> +
> +	vq->cpu = s.num;
> +	return ret;
> +}


So one question here. Who is in charge of doing this set_cpu? Note 
that sched_setaffinity(2) requires CAP_SYS_NICE to work, so I wonder 
whether or not it's legal for unprivileged Qemu to do this.

Thanks


_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/8] vhost: allow userspace to control vq cpu affinity
  2020-12-04  7:56 [RFC PATCH 0/8] vhost: allow userspace to control vq cpu affinity Mike Christie
@ 2020-12-04 16:06   ` Stefano Garzarella
  2020-12-04  7:56 ` [RFC PATCH 2/8] vhost-scsi: remove extra flushes Mike Christie
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 23+ messages in thread
From: Stefano Garzarella @ 2020-12-04 16:06 UTC (permalink / raw)
  To: Mike Christie
  Cc: stefanha, linux-scsi, target-devel, mst, jasowang, pbonzini,
	virtualization

Hi Mike,

On Fri, Dec 04, 2020 at 01:56:25AM -0600, Mike Christie wrote:
>These patches were made over mst's vhost branch.
>
>The following patches, made over mst's vhost branch, allow userspace
>to set each vq's cpu affinity. Currently, with cgroups the worker thread
>inherits the affinity settings, but we are at the mercy of the CPU
>scheduler for where the vq's IO will be executed on. This can result in
>the scheduler sometimes hammering a couple queues on the host instead of
>spreading it out like how the guest's app might have intended if it was
>mq aware.
>
>This version of the patches is not what you guys were talking about
>initially like with the interface that was similar to nbd's old
>(3.x kernel days) NBD_DO_IT ioctl where userspace calls down to the
>kernel and we run from that context. These patches instead just
>allow userspace to tell the kernel which CPU a vq should run on.
>We then use the kernel's workqueue code to handle the thread
>management.

I agree that reusing kernel's workqueue code would be a good strategy.

One concern is how easy it is to implement an adaptive polling strategy 
using workqueues. From what I've seen, adding some polling of both 
backend and virtqueue helps to eliminate interrupts and reduce latency.

Anyway, I'll take a closer look at your patches next week. :-)

Thanks,
Stefano

>
>I wanted to post this version first, because it is flexible
>in that userspace can set things up so devs/vqs share threads/CPUs
>and we don't have to worry about replicating a bunch of features
>that the workqueue code already has like dynamic thread creation,
>blocked work detection, idle thread detection and thread reaping,
>and it also has an interface to control how many threads can be
>created and which CPUs work can run on if we want to further restrict
>that from userspace.
>
>Note that these patches have been lightly tested. I more wanted
>to get comments on the overall approach, because I know it's
>not really what you were thinking about. But while I worked
>on being able to share threads with multiple devices, I kept
>coming back to the existing workqueue code and thinking I'll
>just copy and paste that.
>
>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/8] vhost: allow userspace to control vq cpu affinity
@ 2020-12-04 16:06   ` Stefano Garzarella
  0 siblings, 0 replies; 23+ messages in thread
From: Stefano Garzarella @ 2020-12-04 16:06 UTC (permalink / raw)
  To: Mike Christie
  Cc: linux-scsi, mst, virtualization, target-devel, stefanha, pbonzini

Hi Mike,

On Fri, Dec 04, 2020 at 01:56:25AM -0600, Mike Christie wrote:
>These patches were made over mst's vhost branch.
>
>The following patches, made over mst's vhost branch, allow userspace
>to set each vq's cpu affinity. Currently, with cgroups the worker thread
>inherits the affinity settings, but we are at the mercy of the CPU
>scheduler for where the vq's IO will be executed on. This can result in
>the scheduler sometimes hammering a couple queues on the host instead of
>spreading it out like how the guest's app might have intended if it was
>mq aware.
>
>This version of the patches is not what you guys were talking about
>initially like with the interface that was similar to nbd's old
>(3.x kernel days) NBD_DO_IT ioctl where userspace calls down to the
>kernel and we run from that context. These patches instead just
>allow userspace to tell the kernel which CPU a vq should run on.
>We then use the kernel's workqueue code to handle the thread
>management.

I agree that reusing kernel's workqueue code would be a good strategy.

One concern is how easy it is to implement an adaptive polling strategy 
using workqueues. From what I've seen, adding some polling of both 
backend and virtqueue helps to eliminate interrupts and reduce latency.

Anyway, I'll take a closer look at your patches next week. :-)

Thanks,
Stefano

>
>I wanted to post this version first, because it is flexible
>in that userspace can set things up so devs/vqs share threads/CPUs
>and we don't have to worry about replicating a bunch of features
>that the workqueue code already has like dynamic thread creation,
>blocked work detection, idle thread detection and thread reaping,
>and it also has an interface to control how many threads can be
>created and which CPUs work can run on if we want to further restrict
>that from userspace.
>
>Note that these patches have been lightly tested. I more wanted
>to get comments on the overall approach, because I know it's
>not really what you were thinking about. But while I worked
>on being able to share threads with multiple devices, I kept
>coming back to the existing workqueue code and thinking I'll
>just copy and paste that.
>
>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 5/8] vhost: allow userspace to bind vqs to CPUs
  2020-12-04  8:09     ` Jason Wang
  (?)
@ 2020-12-04 16:32     ` Mike Christie
  2020-12-07  4:27         ` Jason Wang
  -1 siblings, 1 reply; 23+ messages in thread
From: Mike Christie @ 2020-12-04 16:32 UTC (permalink / raw)
  To: Jason Wang, sgarzare, stefanha, linux-scsi, target-devel, mst,
	pbonzini, virtualization

On 12/4/20 2:09 AM, Jason Wang wrote:
> 
> On 2020/12/4 下午3:56, Mike Christie wrote:
>> +static long vhost_vring_set_cpu(struct vhost_dev *d, struct 
>> vhost_virtqueue *vq,
>> +                void __user *argp)
>> +{
>> +    struct vhost_vring_state s;
>> +    int ret = 0;
>> +
>> +    if (vq->private_data)
>> +        return -EBUSY;
>> +
>> +    if (copy_from_user(&s, argp, sizeof s))
>> +        return -EFAULT;
>> +
>> +    if (s.num == -1) {
>> +        vq->cpu = s.num;
>> +        return 0;
>> +    }
>> +
>> +    if (s.num >= nr_cpu_ids)
>> +        return -EINVAL;
>> +
>> +    if (!d->ops || !d->ops->get_workqueue)
>> +        return -EINVAL;
>> +
>> +    if (!d->wq)
>> +        d->wq = d->ops->get_workqueue();
>> +    if (!d->wq)
>> +        return -EINVAL;
>> +
>> +    vq->cpu = s.num;
>> +    return ret;
>> +}
> 
> 
> So one question here. Who is in charge of doing this set_cpu? Note 
> that sched_setaffinity(2) requires CAP_SYS_NICE to work, so I wonder 
> whether or not it's legal for unprivileged Qemu to do this.


I was having qemu do it when it's setting up the vqs since it had the 
info there already.

Is it normally the tool that makes calls into qemu that does the 
operations that require CAP_SYS_NICE? If so, then I see the interface 
needs to be changed.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/8] vhost: allow userspace to control vq cpu affinity
  2020-12-04 16:06   ` Stefano Garzarella
  (?)
@ 2020-12-04 17:10   ` Mike Christie
  2020-12-04 17:33     ` Mike Christie
  -1 siblings, 1 reply; 23+ messages in thread
From: Mike Christie @ 2020-12-04 17:10 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: stefanha, linux-scsi, target-devel, mst, jasowang, pbonzini,
	virtualization

On 12/4/20 10:06 AM, Stefano Garzarella wrote:
> Hi Mike,
> 
> On Fri, Dec 04, 2020 at 01:56:25AM -0600, Mike Christie wrote:
>> These patches were made over mst's vhost branch.
>>
>> The following patches, made over mst's vhost branch, allow userspace
>> to set each vq's cpu affinity. Currently, with cgroups the worker thread
>> inherits the affinity settings, but we are at the mercy of the CPU
>> scheduler for where the vq's IO will be executed on. This can result in
>> the scheduler sometimes hammering a couple queues on the host instead of
>> spreading it out like how the guest's app might have intended if it was
>> mq aware.
>>
>> This version of the patches is not what you guys were talking about
>> initially like with the interface that was similar to nbd's old
>> (3.x kernel days) NBD_DO_IT ioctl where userspace calls down to the
>> kernel and we run from that context. These patches instead just
>> allow userspace to tell the kernel which CPU a vq should run on.
>> We then use the kernel's workqueue code to handle the thread
>> management.
> 
> I agree that reusing kernel's workqueue code would be a good strategy.
> 
> One concern is how easy it is to implement an adaptive polling strategy 
> using workqueues. From what I've seen, adding some polling of both 
> backend and virtqueue helps to eliminate interrupts and reduce latency.
> 
Would the polling you need to do be similar to the vhost net poll code 
like in vhost_net_busy_poll (different algorithm though)? But, we want 
to be able to poll multiple devs/vqs from the same CPU right? Something 
like:

retry:

for each poller on CPU N
	if poller has work
		driver->run work fn

if (poll limit hit)
	return
else
	cpu_relax();
goto retry:

?

If so, I had an idea for it. Let me send an additional patch on top of 
this set.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/8] vhost: allow userspace to control vq cpu affinity
  2020-12-04 17:10   ` Mike Christie
@ 2020-12-04 17:33     ` Mike Christie
  2020-12-09 15:58         ` Stefano Garzarella
  0 siblings, 1 reply; 23+ messages in thread
From: Mike Christie @ 2020-12-04 17:33 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: stefanha, linux-scsi, target-devel, mst, jasowang, pbonzini,
	virtualization

On 12/4/20 11:10 AM, Mike Christie wrote:
> On 12/4/20 10:06 AM, Stefano Garzarella wrote:
>> Hi Mike,
>>
>> On Fri, Dec 04, 2020 at 01:56:25AM -0600, Mike Christie wrote:
>>> These patches were made over mst's vhost branch.
>>>
>>> The following patches, made over mst's vhost branch, allow userspace
>>> to set each vq's cpu affinity. Currently, with cgroups the worker thread
>>> inherits the affinity settings, but we are at the mercy of the CPU
>>> scheduler for where the vq's IO will be executed on. This can result in
>>> the scheduler sometimes hammering a couple queues on the host instead of
>>> spreading it out like how the guest's app might have intended if it was
>>> mq aware.
>>>
>>> This version of the patches is not what you guys were talking about
>>> initially like with the interface that was similar to nbd's old
>>> (3.x kernel days) NBD_DO_IT ioctl where userspace calls down to the
>>> kernel and we run from that context. These patches instead just
>>> allow userspace to tell the kernel which CPU a vq should run on.
>>> We then use the kernel's workqueue code to handle the thread
>>> management.
>>
>> I agree that reusing kernel's workqueue code would be a good strategy.
>>
>> One concern is how easy it is to implement an adaptive polling 
>> strategy using workqueues. From what I've seen, adding some polling of 
>> both backend and virtqueue helps to eliminate interrupts and reduce 
>> latency.
>>
> Would the polling you need to do be similar to the vhost net poll code 
> like in vhost_net_busy_poll (different algorithm though)? But, we want 
> to be able to poll multiple devs/vqs from the same CPU right? Something 
> like:
> 
> retry:
> 
> for each poller on CPU N
>      if poller has work
>          driver->run work fn
> 
> if (poll limit hit)
>      return
> else
>      cpu_relax();
> goto retry:
> 
> ?
> 
> If so, I had an idea for it. Let me send an additional patch on top of 
> this set.

Oh yeah, just to make sure I am on the same page for vdpa, because scsi 
and net work so differnetly.

Were you thinking that you would initially run from

vhost_poll_wakeup -> work->fn

then in the vdpa work->fn you would do the kick_vq still, but then also 
kick off a group backend/vq poller. This would then poll the vqs/devs 
that were bound to that CPU from the worker/wq thread.

So I was thinking you want something similar to network's NAPI. Here our 
work->fn is the hard irq, and then the worker is like their softirq we 
poll from.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 5/8] vhost: allow userspace to bind vqs to CPUs
  2020-12-04 16:32     ` Mike Christie
@ 2020-12-07  4:27         ` Jason Wang
  0 siblings, 0 replies; 23+ messages in thread
From: Jason Wang @ 2020-12-07  4:27 UTC (permalink / raw)
  To: Mike Christie, sgarzare, stefanha, linux-scsi, target-devel, mst,
	pbonzini, virtualization


On 2020/12/5 上午12:32, Mike Christie wrote:
> On 12/4/20 2:09 AM, Jason Wang wrote:
>>
>> On 2020/12/4 下午3:56, Mike Christie wrote:
>>> +static long vhost_vring_set_cpu(struct vhost_dev *d, struct 
>>> vhost_virtqueue *vq,
>>> +                void __user *argp)
>>> +{
>>> +    struct vhost_vring_state s;
>>> +    int ret = 0;
>>> +
>>> +    if (vq->private_data)
>>> +        return -EBUSY;
>>> +
>>> +    if (copy_from_user(&s, argp, sizeof s))
>>> +        return -EFAULT;
>>> +
>>> +    if (s.num == -1) {
>>> +        vq->cpu = s.num;
>>> +        return 0;
>>> +    }
>>> +
>>> +    if (s.num >= nr_cpu_ids)
>>> +        return -EINVAL;
>>> +
>>> +    if (!d->ops || !d->ops->get_workqueue)
>>> +        return -EINVAL;
>>> +
>>> +    if (!d->wq)
>>> +        d->wq = d->ops->get_workqueue();
>>> +    if (!d->wq)
>>> +        return -EINVAL;
>>> +
>>> +    vq->cpu = s.num;
>>> +    return ret;
>>> +}
>>
>>
>> So one question here. Who is in charge of doing this set_cpu? Note 
>> that sched_setaffinity(2) requires CAP_SYS_NICE to work, so I wonder 
>> whether or not it's legal for unprivileged Qemu to do this.
>
>
> I was having qemu do it when it's setting up the vqs since it had the 
> info there already.
>
> Is it normally the tool that makes calls into qemu that does the 
> operations that require CAP_SYS_NICE? 


My understanding is that it only matter scheduling. And this patch wants 
to change the affinity which should check that capability.


> If so, then I see the interface needs to be changed.


Actually, if I read this patch correctly it requires e.g qemu to make 
the decision instead of the management layer. This may bring some 
troubles to for e.g the libvirt emulatorpin[1] implementation.

Thanks

[1] 
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/virtualization_tuning_and_optimization_guide/sect-virtualization_tuning_optimization_guide-numa-numa_and_libvirt


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 5/8] vhost: allow userspace to bind vqs to CPUs
@ 2020-12-07  4:27         ` Jason Wang
  0 siblings, 0 replies; 23+ messages in thread
From: Jason Wang @ 2020-12-07  4:27 UTC (permalink / raw)
  To: Mike Christie, sgarzare, stefanha, linux-scsi, target-devel, mst,
	pbonzini, virtualization


On 2020/12/5 上午12:32, Mike Christie wrote:
> On 12/4/20 2:09 AM, Jason Wang wrote:
>>
>> On 2020/12/4 下午3:56, Mike Christie wrote:
>>> +static long vhost_vring_set_cpu(struct vhost_dev *d, struct 
>>> vhost_virtqueue *vq,
>>> +                void __user *argp)
>>> +{
>>> +    struct vhost_vring_state s;
>>> +    int ret = 0;
>>> +
>>> +    if (vq->private_data)
>>> +        return -EBUSY;
>>> +
>>> +    if (copy_from_user(&s, argp, sizeof s))
>>> +        return -EFAULT;
>>> +
>>> +    if (s.num == -1) {
>>> +        vq->cpu = s.num;
>>> +        return 0;
>>> +    }
>>> +
>>> +    if (s.num >= nr_cpu_ids)
>>> +        return -EINVAL;
>>> +
>>> +    if (!d->ops || !d->ops->get_workqueue)
>>> +        return -EINVAL;
>>> +
>>> +    if (!d->wq)
>>> +        d->wq = d->ops->get_workqueue();
>>> +    if (!d->wq)
>>> +        return -EINVAL;
>>> +
>>> +    vq->cpu = s.num;
>>> +    return ret;
>>> +}
>>
>>
>> So one question here. Who is in charge of doing this set_cpu? Note 
>> that sched_setaffinity(2) requires CAP_SYS_NICE to work, so I wonder 
>> whether or not it's legal for unprivileged Qemu to do this.
>
>
> I was having qemu do it when it's setting up the vqs since it had the 
> info there already.
>
> Is it normally the tool that makes calls into qemu that does the 
> operations that require CAP_SYS_NICE? 


My understanding is that it only matter scheduling. And this patch wants 
to change the affinity which should check that capability.


> If so, then I see the interface needs to be changed.


Actually, if I read this patch correctly it requires e.g qemu to make 
the decision instead of the management layer. This may bring some 
troubles to for e.g the libvirt emulatorpin[1] implementation.

Thanks

[1] 
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/virtualization_tuning_and_optimization_guide/sect-virtualization_tuning_optimization_guide-numa-numa_and_libvirt

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 5/8] vhost: allow userspace to bind vqs to CPUs
  2020-12-07  4:27         ` Jason Wang
  (?)
@ 2020-12-07 18:31         ` Mike Christie
  2020-12-08  2:30             ` Jason Wang
  -1 siblings, 1 reply; 23+ messages in thread
From: Mike Christie @ 2020-12-07 18:31 UTC (permalink / raw)
  To: Jason Wang, sgarzare, stefanha, linux-scsi, target-devel, mst,
	pbonzini, virtualization

On 12/6/20 10:27 PM, Jason Wang wrote:
> 
> On 2020/12/5 上午12:32, Mike Christie wrote:
>> On 12/4/20 2:09 AM, Jason Wang wrote:
>>>
>>> On 2020/12/4 下午3:56, Mike Christie wrote:
>>>> +static long vhost_vring_set_cpu(struct vhost_dev *d, struct 
>>>> vhost_virtqueue *vq,
>>>> +                void __user *argp)
>>>> +{
>>>> +    struct vhost_vring_state s;
>>>> +    int ret = 0;
>>>> +
>>>> +    if (vq->private_data)
>>>> +        return -EBUSY;
>>>> +
>>>> +    if (copy_from_user(&s, argp, sizeof s))
>>>> +        return -EFAULT;
>>>> +
>>>> +    if (s.num == -1) {
>>>> +        vq->cpu = s.num;
>>>> +        return 0;
>>>> +    }
>>>> +
>>>> +    if (s.num >= nr_cpu_ids)
>>>> +        return -EINVAL;
>>>> +
>>>> +    if (!d->ops || !d->ops->get_workqueue)
>>>> +        return -EINVAL;
>>>> +
>>>> +    if (!d->wq)
>>>> +        d->wq = d->ops->get_workqueue();
>>>> +    if (!d->wq)
>>>> +        return -EINVAL;
>>>> +
>>>> +    vq->cpu = s.num;
>>>> +    return ret;
>>>> +}
>>>
>>>
>>> So one question here. Who is in charge of doing this set_cpu? Note 
>>> that sched_setaffinity(2) requires CAP_SYS_NICE to work, so I wonder 
>>> whether or not it's legal for unprivileged Qemu to do this.
>>
>>
>> I was having qemu do it when it's setting up the vqs since it had the 
>> info there already.
>>
>> Is it normally the tool that makes calls into qemu that does the 
>> operations that require CAP_SYS_NICE? 
> 
> 
> My understanding is that it only matter scheduling. And this patch wants 
> to change the affinity which should check that capability.
> 
> 
>> If so, then I see the interface needs to be changed.
> 
> 
> Actually, if I read this patch correctly it requires e.g qemu to make 
> the decision instead of the management layer. This may bring some 
> troubles to for e.g the libvirt emulatorpin[1] implementation.
> 

Let me make sure I understood you.

I thought qemu would just have a new property, and users would pass that 
in like they do for the number of queues setting. Then qemu would pass 
that to the kernel. The primary user I have to support at work does not 
use libvirt based tools so I thought that was a common point that would 
work for everyone.

For my work use requirement, your emulatorpin and CAP_SYS_NICE comment 
then that means we want an interface that something other than qemu can 
use right? So the tools would call directly into the kernel and not go 
through qemu right?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 5/8] vhost: allow userspace to bind vqs to CPUs
  2020-12-07 18:31         ` Mike Christie
@ 2020-12-08  2:30             ` Jason Wang
  0 siblings, 0 replies; 23+ messages in thread
From: Jason Wang @ 2020-12-08  2:30 UTC (permalink / raw)
  To: Mike Christie, sgarzare, stefanha, linux-scsi, target-devel, mst,
	pbonzini, virtualization


On 2020/12/8 上午2:31, Mike Christie wrote:
> On 12/6/20 10:27 PM, Jason Wang wrote:
>>
>> On 2020/12/5 上午12:32, Mike Christie wrote:
>>> On 12/4/20 2:09 AM, Jason Wang wrote:
>>>>
>>>> On 2020/12/4 下午3:56, Mike Christie wrote:
>>>>> +static long vhost_vring_set_cpu(struct vhost_dev *d, struct 
>>>>> vhost_virtqueue *vq,
>>>>> +                void __user *argp)
>>>>> +{
>>>>> +    struct vhost_vring_state s;
>>>>> +    int ret = 0;
>>>>> +
>>>>> +    if (vq->private_data)
>>>>> +        return -EBUSY;
>>>>> +
>>>>> +    if (copy_from_user(&s, argp, sizeof s))
>>>>> +        return -EFAULT;
>>>>> +
>>>>> +    if (s.num == -1) {
>>>>> +        vq->cpu = s.num;
>>>>> +        return 0;
>>>>> +    }
>>>>> +
>>>>> +    if (s.num >= nr_cpu_ids)
>>>>> +        return -EINVAL;
>>>>> +
>>>>> +    if (!d->ops || !d->ops->get_workqueue)
>>>>> +        return -EINVAL;
>>>>> +
>>>>> +    if (!d->wq)
>>>>> +        d->wq = d->ops->get_workqueue();
>>>>> +    if (!d->wq)
>>>>> +        return -EINVAL;
>>>>> +
>>>>> +    vq->cpu = s.num;
>>>>> +    return ret;
>>>>> +}
>>>>
>>>>
>>>> So one question here. Who is in charge of doing this set_cpu? Note 
>>>> that sched_setaffinity(2) requires CAP_SYS_NICE to work, so I 
>>>> wonder whether or not it's legal for unprivileged Qemu to do this.
>>>
>>>
>>> I was having qemu do it when it's setting up the vqs since it had 
>>> the info there already.
>>>
>>> Is it normally the tool that makes calls into qemu that does the 
>>> operations that require CAP_SYS_NICE? 
>>
>>
>> My understanding is that it only matter scheduling. And this patch 
>> wants to change the affinity which should check that capability.
>>
>>
>>> If so, then I see the interface needs to be changed.
>>
>>
>> Actually, if I read this patch correctly it requires e.g qemu to make 
>> the decision instead of the management layer. This may bring some 
>> troubles to for e.g the libvirt emulatorpin[1] implementation.
>>
>
> Let me make sure I understood you.
>
> I thought qemu would just have a new property, and users would pass 
> that in like they do for the number of queues setting. Then qemu would 
> pass that to the kernel. The primary user I have to support at work 
> does not use libvirt based tools so I thought that was a common point 
> that would work for everyone.


I think we need talk with libvirt guys to see if it works for them. My 
understanding is the scheduling should be the charge of them not qemu.


>
> For my work use requirement, your emulatorpin and CAP_SYS_NICE comment 
> then that means we want an interface that something other than qemu 
> can use right? So the tools would call directly into the kernel and 
> not go through qemu right?


Yes, usually qemu runs without any privilege. So could it be e.g a sysfs 
interface or other?

Thanks


>
>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 5/8] vhost: allow userspace to bind vqs to CPUs
@ 2020-12-08  2:30             ` Jason Wang
  0 siblings, 0 replies; 23+ messages in thread
From: Jason Wang @ 2020-12-08  2:30 UTC (permalink / raw)
  To: Mike Christie, sgarzare, stefanha, linux-scsi, target-devel, mst,
	pbonzini, virtualization


On 2020/12/8 上午2:31, Mike Christie wrote:
> On 12/6/20 10:27 PM, Jason Wang wrote:
>>
>> On 2020/12/5 上午12:32, Mike Christie wrote:
>>> On 12/4/20 2:09 AM, Jason Wang wrote:
>>>>
>>>> On 2020/12/4 下午3:56, Mike Christie wrote:
>>>>> +static long vhost_vring_set_cpu(struct vhost_dev *d, struct 
>>>>> vhost_virtqueue *vq,
>>>>> +                void __user *argp)
>>>>> +{
>>>>> +    struct vhost_vring_state s;
>>>>> +    int ret = 0;
>>>>> +
>>>>> +    if (vq->private_data)
>>>>> +        return -EBUSY;
>>>>> +
>>>>> +    if (copy_from_user(&s, argp, sizeof s))
>>>>> +        return -EFAULT;
>>>>> +
>>>>> +    if (s.num == -1) {
>>>>> +        vq->cpu = s.num;
>>>>> +        return 0;
>>>>> +    }
>>>>> +
>>>>> +    if (s.num >= nr_cpu_ids)
>>>>> +        return -EINVAL;
>>>>> +
>>>>> +    if (!d->ops || !d->ops->get_workqueue)
>>>>> +        return -EINVAL;
>>>>> +
>>>>> +    if (!d->wq)
>>>>> +        d->wq = d->ops->get_workqueue();
>>>>> +    if (!d->wq)
>>>>> +        return -EINVAL;
>>>>> +
>>>>> +    vq->cpu = s.num;
>>>>> +    return ret;
>>>>> +}
>>>>
>>>>
>>>> So one question here. Who is in charge of doing this set_cpu? Note 
>>>> that sched_setaffinity(2) requires CAP_SYS_NICE to work, so I 
>>>> wonder whether or not it's legal for unprivileged Qemu to do this.
>>>
>>>
>>> I was having qemu do it when it's setting up the vqs since it had 
>>> the info there already.
>>>
>>> Is it normally the tool that makes calls into qemu that does the 
>>> operations that require CAP_SYS_NICE? 
>>
>>
>> My understanding is that it only matter scheduling. And this patch 
>> wants to change the affinity which should check that capability.
>>
>>
>>> If so, then I see the interface needs to be changed.
>>
>>
>> Actually, if I read this patch correctly it requires e.g qemu to make 
>> the decision instead of the management layer. This may bring some 
>> troubles to for e.g the libvirt emulatorpin[1] implementation.
>>
>
> Let me make sure I understood you.
>
> I thought qemu would just have a new property, and users would pass 
> that in like they do for the number of queues setting. Then qemu would 
> pass that to the kernel. The primary user I have to support at work 
> does not use libvirt based tools so I thought that was a common point 
> that would work for everyone.


I think we need talk with libvirt guys to see if it works for them. My 
understanding is the scheduling should be the charge of them not qemu.


>
> For my work use requirement, your emulatorpin and CAP_SYS_NICE comment 
> then that means we want an interface that something other than qemu 
> can use right? So the tools would call directly into the kernel and 
> not go through qemu right?


Yes, usually qemu runs without any privilege. So could it be e.g a sysfs 
interface or other?

Thanks


>
>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/8] vhost: allow userspace to control vq cpu affinity
  2020-12-04 17:33     ` Mike Christie
@ 2020-12-09 15:58         ` Stefano Garzarella
  0 siblings, 0 replies; 23+ messages in thread
From: Stefano Garzarella @ 2020-12-09 15:58 UTC (permalink / raw)
  To: Mike Christie
  Cc: stefanha, linux-scsi, target-devel, mst, jasowang, pbonzini,
	virtualization

Hi Mike,
sorry for the delay but there were holidays.

On Fri, Dec 04, 2020 at 11:33:11AM -0600, Mike Christie wrote:
>On 12/4/20 11:10 AM, Mike Christie wrote:
>>On 12/4/20 10:06 AM, Stefano Garzarella wrote:
>>>Hi Mike,
>>>
>>>On Fri, Dec 04, 2020 at 01:56:25AM -0600, Mike Christie wrote:
>>>>These patches were made over mst's vhost branch.
>>>>
>>>>The following patches, made over mst's vhost branch, allow userspace
>>>>to set each vq's cpu affinity. Currently, with cgroups the worker thread
>>>>inherits the affinity settings, but we are at the mercy of the CPU
>>>>scheduler for where the vq's IO will be executed on. This can result in
>>>>the scheduler sometimes hammering a couple queues on the host instead of
>>>>spreading it out like how the guest's app might have intended if it was
>>>>mq aware.
>>>>
>>>>This version of the patches is not what you guys were talking about
>>>>initially like with the interface that was similar to nbd's old
>>>>(3.x kernel days) NBD_DO_IT ioctl where userspace calls down to the
>>>>kernel and we run from that context. These patches instead just
>>>>allow userspace to tell the kernel which CPU a vq should run on.
>>>>We then use the kernel's workqueue code to handle the thread
>>>>management.
>>>
>>>I agree that reusing kernel's workqueue code would be a good strategy.
>>>
>>>One concern is how easy it is to implement an adaptive polling 
>>>strategy using workqueues. From what I've seen, adding some 
>>>polling of both backend and virtqueue helps to eliminate 
>>>interrupts and reduce latency.
>>>
>>Would the polling you need to do be similar to the vhost net poll 
>>code like in vhost_net_busy_poll (different algorithm though)? But, 
>>we want to be able to poll multiple devs/vqs from the same CPU 
>>right? Something like:
>>
>>retry:
>>
>>for each poller on CPU N
>>     if poller has work
>>         driver->run work fn
>>
>>if (poll limit hit)
>>     return
>>else
>>     cpu_relax();
>>goto retry:
>>
>>?

Yeah, something similar. IIUC vhost_net_busy_poll() polls both vring and 
backend (socket).

Maybe we need to limit the work->fn amount of work to avoid starvation.

>>
>>If so, I had an idea for it. Let me send an additional patch on top 
>>of this set.

Sure :-)

>
>Oh yeah, just to make sure I am on the same page for vdpa, because 
>scsi and net work so differnetly.
>
>Were you thinking that you would initially run from
>
>vhost_poll_wakeup -> work->fn
>
>then in the vdpa work->fn you would do the kick_vq still, but then 
>also kick off a group backend/vq poller. This would then poll the 
>vqs/devs that were bound to that CPU from the worker/wq thread.

Yes, this seams reasonable!

>
>So I was thinking you want something similar to network's NAPI. Here 

I don't know NAPI very well, but IIUC the goal is the same: try to avoid 
notifications (IRQs from the device, vm-exit from the guest) doing an 
adaptive polling.

>our work->fn is the hard irq, and then the worker is like their softirq 
>we poll from.
>

I'm a little lost here...

Thanks,
Stefano


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/8] vhost: allow userspace to control vq cpu affinity
@ 2020-12-09 15:58         ` Stefano Garzarella
  0 siblings, 0 replies; 23+ messages in thread
From: Stefano Garzarella @ 2020-12-09 15:58 UTC (permalink / raw)
  To: Mike Christie
  Cc: linux-scsi, mst, virtualization, target-devel, stefanha, pbonzini

Hi Mike,
sorry for the delay but there were holidays.

On Fri, Dec 04, 2020 at 11:33:11AM -0600, Mike Christie wrote:
>On 12/4/20 11:10 AM, Mike Christie wrote:
>>On 12/4/20 10:06 AM, Stefano Garzarella wrote:
>>>Hi Mike,
>>>
>>>On Fri, Dec 04, 2020 at 01:56:25AM -0600, Mike Christie wrote:
>>>>These patches were made over mst's vhost branch.
>>>>
>>>>The following patches, made over mst's vhost branch, allow userspace
>>>>to set each vq's cpu affinity. Currently, with cgroups the worker thread
>>>>inherits the affinity settings, but we are at the mercy of the CPU
>>>>scheduler for where the vq's IO will be executed on. This can result in
>>>>the scheduler sometimes hammering a couple queues on the host instead of
>>>>spreading it out like how the guest's app might have intended if it was
>>>>mq aware.
>>>>
>>>>This version of the patches is not what you guys were talking about
>>>>initially like with the interface that was similar to nbd's old
>>>>(3.x kernel days) NBD_DO_IT ioctl where userspace calls down to the
>>>>kernel and we run from that context. These patches instead just
>>>>allow userspace to tell the kernel which CPU a vq should run on.
>>>>We then use the kernel's workqueue code to handle the thread
>>>>management.
>>>
>>>I agree that reusing kernel's workqueue code would be a good strategy.
>>>
>>>One concern is how easy it is to implement an adaptive polling 
>>>strategy using workqueues. From what I've seen, adding some 
>>>polling of both backend and virtqueue helps to eliminate 
>>>interrupts and reduce latency.
>>>
>>Would the polling you need to do be similar to the vhost net poll 
>>code like in vhost_net_busy_poll (different algorithm though)? But, 
>>we want to be able to poll multiple devs/vqs from the same CPU 
>>right? Something like:
>>
>>retry:
>>
>>for each poller on CPU N
>> Â Â Â Â if poller has work
>> Â Â Â Â Â Â Â  driver->run work fn
>>
>>if (poll limit hit)
>> Â Â Â Â return
>>else
>> Â Â Â Â cpu_relax();
>>goto retry:
>>
>>?

Yeah, something similar. IIUC vhost_net_busy_poll() polls both vring and 
backend (socket).

Maybe we need to limit the work->fn amount of work to avoid starvation.

>>
>>If so, I had an idea for it. Let me send an additional patch on top 
>>of this set.

Sure :-)

>
>Oh yeah, just to make sure I am on the same page for vdpa, because 
>scsi and net work so differnetly.
>
>Were you thinking that you would initially run from
>
>vhost_poll_wakeup -> work->fn
>
>then in the vdpa work->fn you would do the kick_vq still, but then 
>also kick off a group backend/vq poller. This would then poll the 
>vqs/devs that were bound to that CPU from the worker/wq thread.

Yes, this seams reasonable!

>
>So I was thinking you want something similar to network's NAPI. Here 

I don't know NAPI very well, but IIUC the goal is the same: try to avoid 
notifications (IRQs from the device, vm-exit from the guest) doing an 
adaptive polling.

>our work->fn is the hard irq, and then the worker is like their softirq 
>we poll from.
>

I'm a little lost here...

Thanks,
Stefano

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2020-12-09 16:00 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-04  7:56 [RFC PATCH 0/8] vhost: allow userspace to control vq cpu affinity Mike Christie
2020-12-04  7:56 ` [RFC PATCH 1/8] vhost: remove work arg from vhost_work_flush Mike Christie
2020-12-04  7:56 ` [RFC PATCH 2/8] vhost-scsi: remove extra flushes Mike Christie
2020-12-04  7:56 ` [RFC PATCH 3/8] vhost poll: fix coding style Mike Christie
2020-12-04  7:56 ` [RFC PATCH 4/8] vhost: move msg_handler to new ops struct Mike Christie
2020-12-04  7:56 ` [RFC PATCH 5/8] vhost: allow userspace to bind vqs to CPUs Mike Christie
2020-12-04  8:09   ` Jason Wang
2020-12-04  8:09     ` Jason Wang
2020-12-04 16:32     ` Mike Christie
2020-12-07  4:27       ` Jason Wang
2020-12-07  4:27         ` Jason Wang
2020-12-07 18:31         ` Mike Christie
2020-12-08  2:30           ` Jason Wang
2020-12-08  2:30             ` Jason Wang
2020-12-04  7:56 ` [RFC PATCH 6/8] vhost-scsi: make SCSI cmd completion per vq Mike Christie
2020-12-04  7:56 ` [RFC PATCH 7/8] vhost, vhost-scsi: flush IO vqs then send TMF rsp Mike Christie
2020-12-04  7:56 ` [RFC PATCH 8/8] vhost-scsi: hook vhost-scsi into vring set cpu support Mike Christie
2020-12-04 16:06 ` [RFC PATCH 0/8] vhost: allow userspace to control vq cpu affinity Stefano Garzarella
2020-12-04 16:06   ` Stefano Garzarella
2020-12-04 17:10   ` Mike Christie
2020-12-04 17:33     ` Mike Christie
2020-12-09 15:58       ` Stefano Garzarella
2020-12-09 15:58         ` Stefano Garzarella

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.