All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v2 00/10] vhost-blk: in-kernel accelerator for virtio-blk guests
@ 2022-10-13 15:18 Andrey Zhadchenko
  2022-10-13 15:18 ` [RFC PATCH v2 01/10] drivers/vhost: vhost-blk " Andrey Zhadchenko
                   ` (10 more replies)
  0 siblings, 11 replies; 14+ messages in thread
From: Andrey Zhadchenko @ 2022-10-13 15:18 UTC (permalink / raw)
  To: virtualization
  Cc: andrey.zhadchenko, mst, jasonwang, kvm, stefanha, sgarzare, den,
	ptikhomirov

As there is some interest from QEMU userspace, I am sending second version
of this patchet.

Main addition is a few patches about vhost multithreading so vhost-blk can
be scaled. Generally the idea is not a new one - somehow attach workers to
the virtqueues and do the work on them.

I have seen several previous attemps like cgroup-aware worker pools or the
userspace threads, but they seem very complicated and involve a lot of
subsystems. Probably just spawning a few more vhost threads can do a good
job.

As this is RFC, I did not convert any vhost users except vhost_blk. If
anyone is interested in this regarding other modules, please tell me.
I can test it to see if it is beneficial and maybe send multithreading
separately.
Also multithreading part may eventually be of help with vdpa-blk.

---

Although QEMU virtio-blk is quite fast, there is still some room for
improvements. Disk latency can be reduced if we handle virito-blk requests
in host kernel so we avoid a lot of syscalls and context switches.
The idea is quite simple - QEMU gives us block device and we translate
any incoming virtio requests into bio and push them into bdev.
The biggest disadvantage of this vhost-blk flavor is raw format.
Luckily Kirill Thai proposed device mapper driver for QCOW2 format to attach
files as block devices: https://www.spinics.net/lists/kernel/msg4292965.html

Also by using kernel modules we can bypass iothread limitation and finaly scale
block requests with cpus for high-performance devices.


There have already been several attempts to write vhost-blk:

Asias' version: https://lkml.org/lkml/2012/12/1/174
Badari's version: https://lwn.net/Articles/379864/
Vitaly's https://lwn.net/Articles/770965/

The main difference between them is API to access backend file. The fastest
one is Asias's version with bio flavor. It is also the most reviewed and
have the most features. So vhost_blk module is partially based on it. Multiple
virtqueue support was addded, some places reworked. Added support for several
vhost workers.

test setup and results:
fio --direct=1 --rw=randread  --bs=4k  --ioengine=libaio --iodepth=128
QEMU drive options: cache=none
filesystem: xfs

SSD:
               | randread, IOPS  | randwrite, IOPS |
Host           |      95.8k	 |	85.3k	   |
QEMU virtio    |      61.5k	 |	79.9k	   |
QEMU vhost-blk |      95.6k	 |	84.3k	   |

RAMDISK (vq == vcpu == numjobs):
                 | randread, IOPS | randwrite, IOPS |
virtio, 1vcpu    |	133k	  |	 133k       |
virtio, 2vcpu    |	305k      |	 306k       |
virtio, 4vcpu    |	310k	  |	 298k       |
virtio, 8vcpu    |	271k      |	 252k       |
vhost-blk, 1vcpu |	110k	  |	 113k       |
vhost-blk, 2vcpu |	247k	  |	 252k       |
vhost-blk, 4vcpu |	558k	  |	 556k       |
vhost-blk, 8vcpu |	576k	  |	 575k       | *single kernel thread
vhost-blk, 8vcpu |      803k      |      779k       | *two kernel threads

v2:
Re-measured virtio performance with aio=threads and iothread on latest QEMU

vhost-blk changes:
 - removed unused VHOST_BLK_VQ
 - reworked bio handling a bit: now add all pages from signle iov into
bio until it is full istead of allocating one bio per page
 - changed how to calculate sector incrementation
 - check move_iovec() in vhost_blk_req_handle()
 - remove snprintf check and better check ret from copy_to_iter for
VIRTIO_BLK_ID_BYTES requests
 - discard vq request if vhost_blk_req_handle() returned negative code
 - forbid to change nonzero backend in vhost_blk_set_backend(). First of
all, QEMU sets backend only once. Also if we want to change backend when
we already running requests we need to be much more careful in
vhost_blk_handle_guest_kick() as it is not taking any references. If
userspace want to change backend that bad it can always reset device.
 - removed EXPERIMENTAL from Kconfig

Andrey Zhadchenko (10):
  drivers/vhost: vhost-blk accelerator for virtio-blk guests
  drivers/vhost: use array to store workers
  drivers/vhost: adjust vhost to flush all workers
  drivers/vhost: rework cgroups attachment to be worker aware
  drivers/vhost: rework worker creation
  drivers/vhost: add ioctl to increase the number of workers
  drivers/vhost: assign workers to virtqueues
  drivers/vhost: add API to queue work at virtqueue's worker
  drivers/vhost: allow polls to be bound to workers via vqs
  drivers/vhost: queue vhost_blk works at vq workers

 drivers/vhost/Kconfig      |  12 +
 drivers/vhost/Makefile     |   3 +
 drivers/vhost/blk.c        | 819 +++++++++++++++++++++++++++++++++++++
 drivers/vhost/vhost.c      | 263 +++++++++---
 drivers/vhost/vhost.h      |  21 +-
 include/uapi/linux/vhost.h |  13 +
 6 files changed, 1064 insertions(+), 67 deletions(-)
 create mode 100644 drivers/vhost/blk.c

-- 
2.31.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC PATCH v2 01/10] drivers/vhost: vhost-blk accelerator for virtio-blk guests
  2022-10-13 15:18 [RFC PATCH v2 00/10] vhost-blk: in-kernel accelerator for virtio-blk guests Andrey Zhadchenko
@ 2022-10-13 15:18 ` Andrey Zhadchenko
  2022-10-13 15:18 ` [RFC PATCH v2 02/10] drivers/vhost: use array to store workers Andrey Zhadchenko
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: Andrey Zhadchenko @ 2022-10-13 15:18 UTC (permalink / raw)
  To: virtualization
  Cc: andrey.zhadchenko, mst, jasonwang, kvm, stefanha, sgarzare, den,
	ptikhomirov

Although QEMU virtio is quite fast, there is still some room for
improvements. Disk latency can be reduced if we handle virito-blk requests
in host kernel istead of passing them to QEMU. The patch adds vhost-blk
kernel module to do so.

Some test setups:
fio --direct=1 --rw=randread  --bs=4k  --ioengine=libaio --iodepth=128
QEMU drive options: cache=none
filesystem: xfs

SSD:
               | randread, IOPS  | randwrite, IOPS |
Host           |      95.8k      |      85.3k      |
QEMU virtio    |      57.5k      |      79.4k      |
QEMU vhost-blk |      95.6k      |      84.3k      |

RAMDISK (vq == vcpu = numjobs):
                 | randread, IOPS | randwrite, IOPS |
virtio, 1vcpu    |      133k      |	 133k       |
virtio, 2vcpu    |      305k      |      306k       |
virtio, 4vcpu    |      310k      |      298k       |
vhost-blk, 1vcpu |      110k      |	 113k       |
vhost-blk, 2vcpu |      247k      |      252k       |
vhost-blk, 4vcpu |	558k	  |	 556k       |

v2:
 - removed unused VHOST_BLK_VQ
 - reworked bio handling a bit: now add all pages from signle iov into
bio until it is full istead of allocating one bio per page
 - changed sector incrementation calculation
 - check move_iovec() in vhost_blk_req_handle()
 - remove snprintf check and better check ret from copy_to_iter for
VIRTIO_BLK_ID_BYTES requests
 - discard vq request if vhost_blk_req_handle() returned negative code
 - forbid to change nonzero backend in vhost_blk_set_backend(). First of
all, QEMU sets backend only once. Also if we want to change backend when
we already running requests we need to be much more careful in
vhost_blk_handle_guest_kick() as it is not taking any references. If
userspace want to change backend that bad it can always reset device.
 - removed EXPERIMENTAL from Kconfig

Signed-off-by: Andrey Zhadchenko <andrey.zhadchenko@virtuozzo.com>
---
 drivers/vhost/Kconfig      |  12 +
 drivers/vhost/Makefile     |   3 +
 drivers/vhost/blk.c        | 820 +++++++++++++++++++++++++++++++++++++
 include/uapi/linux/vhost.h |   4 +
 4 files changed, 839 insertions(+)
 create mode 100644 drivers/vhost/blk.c

diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index 587fbae06182..e1389bf0c10b 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -89,4 +89,16 @@ config VHOST_CROSS_ENDIAN_LEGACY
 
 	  If unsure, say "N".
 
+config VHOST_BLK
+	tristate "Host kernel accelerator for virtio-blk"
+	depends on BLOCK && EVENTFD
+	select VHOST
+	default n
+	help
+	  This kernel module can be loaded in host kernel to accelerate
+	  guest vm with virtio-blk driver.
+
+	  To compile this driver as a module, choose M here: the module will
+	  be called vhost_blk.
+
 endif
diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
index f3e1897cce85..c76cc4f5fcd8 100644
--- a/drivers/vhost/Makefile
+++ b/drivers/vhost/Makefile
@@ -17,3 +17,6 @@ obj-$(CONFIG_VHOST)	+= vhost.o
 
 obj-$(CONFIG_VHOST_IOTLB) += vhost_iotlb.o
 vhost_iotlb-y := iotlb.o
+
+obj-$(CONFIG_VHOST_BLK) += vhost_blk.o
+vhost_blk-y := blk.o
diff --git a/drivers/vhost/blk.c b/drivers/vhost/blk.c
new file mode 100644
index 000000000000..42acdef452b3
--- /dev/null
+++ b/drivers/vhost/blk.c
@@ -0,0 +1,820 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2011 Taobao, Inc.
+ * Author: Liu Yuan <tailai.ly@taobao.com>
+ *
+ * Copyright (C) 2012 Red Hat, Inc.
+ * Author: Asias He <asias@redhat.com>
+ *
+ * Copyright (c) 2022 Virtuozzo International GmbH.
+ * Author: Andrey Zhadchenko <andrey.zhadchenko@virtuozzo.com>
+ *
+ * virtio-blk host kernel accelerator.
+ */
+
+#include <linux/miscdevice.h>
+#include <linux/module.h>
+#include <linux/vhost.h>
+#include <linux/virtio_blk.h>
+#include <linux/mutex.h>
+#include <linux/file.h>
+#include <linux/kthread.h>
+#include <linux/blkdev.h>
+#include <linux/llist.h>
+
+#include "vhost.h"
+
+enum {
+	VHOST_BLK_FEATURES = VHOST_FEATURES |
+			     (1ULL << VIRTIO_RING_F_INDIRECT_DESC) |
+			     (1ULL << VIRTIO_RING_F_EVENT_IDX) |
+			     (1ULL << VIRTIO_BLK_F_MQ) |
+			     (1ULL << VIRTIO_BLK_F_FLUSH),
+};
+
+/*
+ * Max number of bytes transferred before requeueing the job.
+ * Using this limit prevents one virtqueue from starving others.
+ */
+#define VHOST_DEV_WEIGHT 0x80000
+
+/*
+ * Max number of packets transferred before requeueing the job.
+ * Using this limit prevents one virtqueue from starving others with
+ * pkts.
+ */
+#define VHOST_DEV_PKT_WEIGHT 256
+
+#define VHOST_BLK_VQ_MAX 16
+
+#define VHOST_MAX_METADATA_IOV 1
+
+#define VHOST_BLK_SECTOR_BITS 9
+#define VHOST_BLK_SECTOR_SIZE (1 << VHOST_BLK_SECTOR_BITS)
+#define VHOST_BLK_SECTOR_MASK (VHOST_BLK_SECTOR_SIZE - 1)
+
+struct req_page_list {
+	struct page **pages;
+	int pages_nr;
+};
+
+#define NR_INLINE 16
+
+struct vhost_blk_req {
+	struct req_page_list inline_pl[NR_INLINE];
+	struct page *inline_page[NR_INLINE];
+	struct bio *inline_bio[NR_INLINE];
+	struct req_page_list *pl;
+	int during_flush;
+	bool use_inline;
+
+	struct llist_node llnode;
+
+	struct vhost_blk *blk;
+
+	struct iovec *iov;
+	int iov_nr;
+
+	struct bio **bio;
+	atomic_t bio_nr;
+
+	struct iovec status[VHOST_MAX_METADATA_IOV];
+
+	sector_t sector;
+	int bi_opf;
+	u16 head;
+	long len;
+	int bio_err;
+
+	struct vhost_blk_vq *blk_vq;
+};
+
+struct vhost_blk_vq {
+	struct vhost_virtqueue vq;
+	struct vhost_blk_req *req;
+	struct iovec iov[UIO_MAXIOV];
+	struct llist_head llhead;
+	struct vhost_work work;
+};
+
+struct vhost_blk {
+	wait_queue_head_t flush_wait;
+	struct vhost_blk_vq vqs[VHOST_BLK_VQ_MAX];
+	atomic_t req_inflight[2];
+	spinlock_t flush_lock;
+	struct vhost_dev dev;
+	int during_flush;
+	struct file *backend;
+	int index;
+};
+
+static int gen;
+
+static int move_iovec(struct iovec *from, struct iovec *to,
+		      size_t len, int iov_count_from, int iov_count_to)
+{
+	int moved_seg = 0, spent_seg = 0;
+	size_t size;
+
+	while (len && spent_seg < iov_count_from && moved_seg < iov_count_to) {
+		if (from->iov_len == 0) {
+			++from;
+			++spent_seg;
+			continue;
+		}
+		size = min(from->iov_len, len);
+		to->iov_base = from->iov_base;
+		to->iov_len = size;
+		from->iov_len -= size;
+		from->iov_base += size;
+		len -= size;
+		++from;
+		++to;
+		++moved_seg;
+		++spent_seg;
+	}
+
+	return len ? -1 : moved_seg;
+}
+
+static inline int iov_num_pages(struct iovec *iov)
+{
+	return (PAGE_ALIGN((unsigned long)iov->iov_base + iov->iov_len) -
+	       ((unsigned long)iov->iov_base & PAGE_MASK)) >> PAGE_SHIFT;
+}
+
+static inline int vhost_blk_set_status(struct vhost_blk_req *req, u8 status)
+{
+	struct iov_iter iter;
+	int ret;
+
+	iov_iter_init(&iter, WRITE, req->status, ARRAY_SIZE(req->status), sizeof(status));
+	ret = copy_to_iter(&status, sizeof(status), &iter);
+	if (ret != sizeof(status)) {
+		vq_err(&req->blk_vq->vq, "Failed to write status\n");
+		return -EFAULT;
+	}
+
+	return 0;
+}
+
+static void vhost_blk_req_done(struct bio *bio)
+{
+	struct vhost_blk_req *req = bio->bi_private;
+	struct vhost_blk *blk = req->blk;
+
+	req->bio_err = blk_status_to_errno(bio->bi_status);
+
+	if (atomic_dec_and_test(&req->bio_nr)) {
+		llist_add(&req->llnode, &req->blk_vq->llhead);
+		vhost_work_queue(&blk->dev, &req->blk_vq->work);
+	}
+
+	bio_put(bio);
+}
+
+static void vhost_blk_req_umap(struct vhost_blk_req *req)
+{
+	struct req_page_list *pl;
+	int i, j;
+
+	if (req->pl) {
+		for (i = 0; i < req->iov_nr; i++) {
+			pl = &req->pl[i];
+
+			for (j = 0; j < pl->pages_nr; j++) {
+				if (!req->bi_opf)
+					set_page_dirty_lock(pl->pages[j]);
+				put_page(pl->pages[j]);
+			}
+		}
+	}
+
+	if (!req->use_inline)
+		kfree(req->pl);
+}
+
+static int vhost_blk_bio_make_simple(struct vhost_blk_req *req,
+				     struct block_device *bdev)
+{
+	struct bio *bio;
+
+	req->use_inline = true;
+	req->pl = NULL;
+	req->bio = req->inline_bio;
+
+	bio = bio_alloc(bdev, 1, req->bi_opf, GFP_KERNEL);
+	if (!bio)
+		return -ENOMEM;
+
+	bio->bi_iter.bi_sector = req->sector;
+	bio->bi_private = req;
+	bio->bi_end_io  = vhost_blk_req_done;
+	req->bio[0] = bio;
+
+	atomic_set(&req->bio_nr, 1);
+
+	return 0;
+}
+
+static struct page **vhost_blk_prepare_req(struct vhost_blk_req *req,
+				 int total_pages, int iov_nr)
+{
+	int pl_len, page_len, bio_len;
+	void *buf;
+
+	req->use_inline = false;
+	pl_len = iov_nr * sizeof(req->pl[0]);
+	page_len = total_pages * sizeof(struct page *);
+	bio_len = total_pages * sizeof(struct bio *);
+
+	buf = kmalloc(pl_len + page_len + bio_len, GFP_KERNEL);
+	if (!buf)
+		return NULL;
+
+	req->pl	= buf;
+	req->bio = buf + pl_len + page_len;
+
+	return buf + pl_len;
+}
+
+static int vhost_blk_bio_make(struct vhost_blk_req *req,
+			      struct block_device *bdev)
+{
+	int pages_nr_total, i, j, ret;
+	struct iovec *iov = req->iov;
+	int iov_nr = req->iov_nr;
+	struct page **pages, *page;
+	struct bio *bio = NULL;
+	int bio_nr = 0;
+
+	if (unlikely(req->bi_opf == REQ_OP_FLUSH))
+		return vhost_blk_bio_make_simple(req, bdev);
+
+	pages_nr_total = 0;
+	for (i = 0; i < iov_nr; i++)
+		pages_nr_total += iov_num_pages(&iov[i]);
+
+	if (pages_nr_total > NR_INLINE) {
+		pages = vhost_blk_prepare_req(req, pages_nr_total, iov_nr);
+		if (!pages)
+			return -ENOMEM;
+	} else {
+		req->use_inline = true;
+		req->pl = req->inline_pl;
+		pages = req->inline_page;
+		req->bio = req->inline_bio;
+	}
+
+	req->iov_nr = 0;
+	for (i = 0; i < iov_nr; i++) {
+		int pages_nr = iov_num_pages(&iov[i]);
+		unsigned long iov_base, iov_len;
+		struct req_page_list *pl;
+
+		iov_base = (unsigned long)iov[i].iov_base;
+		iov_len  = (unsigned long)iov[i].iov_len;
+
+		ret = get_user_pages_fast(iov_base, pages_nr,
+					  !req->bi_opf, pages);
+		if (ret != pages_nr)
+			goto fail;
+
+		req->iov_nr++;
+		pl = &req->pl[i];
+		pl->pages_nr = pages_nr;
+		pl->pages = pages;
+
+		for (j = 0; j < pages_nr; j++) {
+			unsigned int off, len, pos;
+
+			page = pages[j];
+			off = iov_base & ~PAGE_MASK;
+			len = PAGE_SIZE - off;
+			if (len > iov_len)
+				len = iov_len;
+
+			while (!bio || !bio_add_page(bio, page, len, off)) {
+				bio = bio_alloc(bdev, pages_nr, req->bi_opf, GFP_KERNEL);
+				if (!bio)
+					goto fail;
+				bio->bi_iter.bi_sector  = req->sector;
+				bio->bi_private = req;
+				bio->bi_end_io  = vhost_blk_req_done;
+				req->bio[bio_nr++] = bio;
+			}
+
+			iov_base	+= len;
+			iov_len		-= len;
+
+			pos = (iov_base & VHOST_BLK_SECTOR_MASK) + iov_len;
+			req->sector += pos >> VHOST_BLK_SECTOR_BITS;
+		}
+
+		pages += pages_nr;
+	}
+	atomic_set(&req->bio_nr, bio_nr);
+	return 0;
+
+fail:
+	for (i = 0; i < bio_nr; i++)
+		bio_put(req->bio[i]);
+	vhost_blk_req_umap(req);
+	return -ENOMEM;
+}
+
+static inline void vhost_blk_bio_send(struct vhost_blk_req *req)
+{
+	struct blk_plug plug;
+	int i, bio_nr;
+
+	bio_nr = atomic_read(&req->bio_nr);
+	blk_start_plug(&plug);
+	for (i = 0; i < bio_nr; i++)
+		submit_bio(req->bio[i]);
+
+	blk_finish_plug(&plug);
+}
+
+static int vhost_blk_req_submit(struct vhost_blk_req *req, struct file *file)
+{
+
+	struct inode *inode = file->f_mapping->host;
+	struct block_device *bdev = I_BDEV(inode);
+	int ret;
+
+	ret = vhost_blk_bio_make(req, bdev);
+	if (ret < 0)
+		return ret;
+
+	vhost_blk_bio_send(req);
+
+	spin_lock(&req->blk->flush_lock);
+	req->during_flush = req->blk->during_flush;
+	atomic_inc(&req->blk->req_inflight[req->during_flush]);
+	spin_unlock(&req->blk->flush_lock);
+
+	return ret;
+}
+
+static int vhost_blk_req_handle(struct vhost_virtqueue *vq,
+				struct virtio_blk_outhdr *hdr,
+				u16 head, u16 total_iov_nr,
+				struct file *file)
+{
+	struct vhost_blk *blk = container_of(vq->dev, struct vhost_blk, dev);
+	struct vhost_blk_vq *blk_vq = container_of(vq, struct vhost_blk_vq, vq);
+	unsigned char id[VIRTIO_BLK_ID_BYTES];
+	struct vhost_blk_req *req;
+	struct iov_iter iter;
+	int ret, len;
+	u8 status;
+
+	req		= &blk_vq->req[head];
+	req->blk_vq	= blk_vq;
+	req->head	= head;
+	req->blk	= blk;
+	req->sector	= hdr->sector;
+	req->iov	= blk_vq->iov;
+
+	req->len	= iov_length(vq->iov, total_iov_nr) - sizeof(status);
+	req->iov_nr	= move_iovec(vq->iov, req->iov, req->len, total_iov_nr,
+				     ARRAY_SIZE(blk_vq->iov));
+
+	ret = move_iovec(vq->iov, req->status, sizeof(status), total_iov_nr,
+			 ARRAY_SIZE(req->status));
+	if (ret < 0 || req->iov_nr < 0)
+		return -EINVAL;
+
+	switch (hdr->type) {
+	case VIRTIO_BLK_T_OUT:
+		req->bi_opf = REQ_OP_WRITE;
+		ret = vhost_blk_req_submit(req, file);
+		break;
+	case VIRTIO_BLK_T_IN:
+		req->bi_opf = REQ_OP_READ;
+		ret = vhost_blk_req_submit(req, file);
+		break;
+	case VIRTIO_BLK_T_FLUSH:
+		req->bi_opf = REQ_OP_FLUSH;
+		ret = vhost_blk_req_submit(req, file);
+		break;
+	case VIRTIO_BLK_T_GET_ID:
+		len = snprintf(id, VIRTIO_BLK_ID_BYTES, "vhost-blk%d", blk->index);
+		iov_iter_init(&iter, WRITE, req->iov, req->iov_nr, req->len);
+		ret = copy_to_iter(id, len, &iter);
+		status = ret != len ? VIRTIO_BLK_S_IOERR : VIRTIO_BLK_S_OK;
+		ret = vhost_blk_set_status(req, status);
+		if (ret)
+			break;
+		vhost_add_used_and_signal(&blk->dev, vq, head, len);
+		break;
+	default:
+		vq_err(vq, "Unsupported request type %d\n", hdr->type);
+		status = VIRTIO_BLK_S_UNSUPP;
+		ret = vhost_blk_set_status(req, status);
+		if (ret)
+			break;
+		vhost_add_used_and_signal(&blk->dev, vq, head, 0);
+	}
+
+	return ret;
+}
+
+static void vhost_blk_handle_guest_kick(struct vhost_work *work)
+{
+	struct virtio_blk_outhdr hdr;
+	struct vhost_blk_vq *blk_vq;
+	struct vhost_virtqueue *vq;
+	struct iovec hdr_iovec[VHOST_MAX_METADATA_IOV];
+	struct vhost_blk *blk;
+	struct iov_iter iter;
+	int in, out, ret;
+	struct file *f;
+	u16 head;
+
+	vq = container_of(work, struct vhost_virtqueue, poll.work);
+	blk = container_of(vq->dev, struct vhost_blk, dev);
+	blk_vq = container_of(vq, struct vhost_blk_vq, vq);
+
+	f = vhost_vq_get_backend(vq);
+	if (!f)
+		return;
+
+	vhost_disable_notify(&blk->dev, vq);
+	for (;;) {
+		head = vhost_get_vq_desc(vq, vq->iov,
+					 ARRAY_SIZE(vq->iov),
+					 &out, &in, NULL, NULL);
+		if (unlikely(head < 0))
+			break;
+
+		if (unlikely(head == vq->num)) {
+			if (unlikely(vhost_enable_notify(&blk->dev, vq))) {
+				vhost_disable_notify(&blk->dev, vq);
+				continue;
+			}
+			break;
+		}
+
+		ret = move_iovec(vq->iov, hdr_iovec, sizeof(hdr), in + out, ARRAY_SIZE(hdr_iovec));
+		if (ret < 0) {
+			vq_err(vq, "virtio_blk_hdr is too split!");
+			vhost_discard_vq_desc(vq, 1);
+			break;
+		}
+
+		iov_iter_init(&iter, READ, hdr_iovec, ARRAY_SIZE(hdr_iovec), sizeof(hdr));
+		ret = copy_from_iter(&hdr, sizeof(hdr), &iter);
+		if (ret != sizeof(hdr)) {
+			vq_err(vq, "Failed to get block header: read %d bytes instead of %ld!\n",
+			       ret, sizeof(hdr));
+			vhost_discard_vq_desc(vq, 1);
+			break;
+		}
+
+		if (vhost_blk_req_handle(vq, &hdr, head, out + in, f) < 0) {
+			vhost_discard_vq_desc(vq, 1);
+			break;
+		}
+
+		if (!llist_empty(&blk_vq->llhead)) {
+			vhost_poll_queue(&vq->poll);
+			break;
+		}
+	}
+}
+
+static void vhost_blk_handle_host_kick(struct vhost_work *work)
+{
+	struct vhost_blk_vq *blk_vq;
+	struct vhost_virtqueue *vq;
+	struct vhost_blk_req *req;
+	struct llist_node *llnode;
+	struct vhost_blk *blk = NULL;
+	bool added, zero;
+	u8 status;
+	int ret;
+
+	blk_vq = container_of(work, struct vhost_blk_vq, work);
+	vq = &blk_vq->vq;
+	llnode = llist_del_all(&blk_vq->llhead);
+	added = false;
+	while (llnode) {
+		req = llist_entry(llnode, struct vhost_blk_req, llnode);
+		llnode = llist_next(llnode);
+
+		if (!blk)
+			blk = req->blk;
+
+		vhost_blk_req_umap(req);
+
+		status = req->bio_err == 0 ?  VIRTIO_BLK_S_OK : VIRTIO_BLK_S_IOERR;
+		ret = vhost_blk_set_status(req, status);
+		if (unlikely(ret))
+			continue;
+
+		vhost_add_used(vq, req->head, req->len);
+		added = true;
+
+		spin_lock(&req->blk->flush_lock);
+		zero = atomic_dec_and_test(
+				&req->blk->req_inflight[req->during_flush]);
+		if (zero && !req->during_flush)
+			wake_up(&blk->flush_wait);
+		spin_unlock(&req->blk->flush_lock);
+
+	}
+
+	if (likely(added))
+		vhost_signal(&blk->dev, vq);
+}
+
+static void vhost_blk_flush(struct vhost_blk *blk)
+{
+	spin_lock(&blk->flush_lock);
+	blk->during_flush = 1;
+	spin_unlock(&blk->flush_lock);
+
+	vhost_dev_flush(&blk->dev);
+	/*
+	 * Wait until requests fired before the flush to be finished
+	 * req_inflight[0] is used to track the requests fired before the flush
+	 * req_inflight[1] is used to track the requests fired during the flush
+	 */
+	wait_event(blk->flush_wait, !atomic_read(&blk->req_inflight[0]));
+
+	spin_lock(&blk->flush_lock);
+	blk->during_flush = 0;
+	spin_unlock(&blk->flush_lock);
+}
+
+static inline void vhost_blk_drop_backends(struct vhost_blk *blk)
+{
+	struct vhost_virtqueue *vq;
+	int i;
+
+	for (i = 0; i < VHOST_BLK_VQ_MAX; i++) {
+		vq = &blk->vqs[i].vq;
+
+		mutex_lock(&vq->mutex);
+		vhost_vq_set_backend(vq, NULL);
+		mutex_unlock(&vq->mutex);
+	}
+}
+
+static int vhost_blk_open(struct inode *inode, struct file *file)
+{
+	struct vhost_blk *blk;
+	struct vhost_virtqueue **vqs;
+	int ret = 0, i = 0;
+
+	blk = kvzalloc(sizeof(*blk), GFP_KERNEL);
+	if (!blk) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	vqs = kcalloc(VHOST_BLK_VQ_MAX, sizeof(*vqs), GFP_KERNEL);
+	if (!vqs) {
+		ret = -ENOMEM;
+		goto out_blk;
+	}
+
+	for (i = 0; i < VHOST_BLK_VQ_MAX; i++) {
+		blk->vqs[i].vq.handle_kick = vhost_blk_handle_guest_kick;
+		vqs[i] = &blk->vqs[i].vq;
+	}
+
+	blk->index = gen++;
+
+	atomic_set(&blk->req_inflight[0], 0);
+	atomic_set(&blk->req_inflight[1], 0);
+	blk->during_flush = 0;
+	spin_lock_init(&blk->flush_lock);
+	init_waitqueue_head(&blk->flush_wait);
+
+	vhost_dev_init(&blk->dev, vqs, VHOST_BLK_VQ_MAX, UIO_MAXIOV,
+		       VHOST_DEV_WEIGHT, VHOST_DEV_PKT_WEIGHT, true, NULL);
+	file->private_data = blk;
+
+	for (i = 0; i < VHOST_BLK_VQ_MAX; i++)
+		vhost_work_init(&blk->vqs[i].work, vhost_blk_handle_host_kick);
+
+	return ret;
+out_blk:
+	kvfree(blk);
+out:
+	return ret;
+}
+
+static int vhost_blk_release(struct inode *inode, struct file *f)
+{
+	struct vhost_blk *blk = f->private_data;
+	int i;
+
+	vhost_blk_drop_backends(blk);
+	vhost_blk_flush(blk);
+	vhost_dev_stop(&blk->dev);
+	if (blk->backend)
+		fput(blk->backend);
+	vhost_dev_cleanup(&blk->dev);
+	for (i = 0; i < VHOST_BLK_VQ_MAX; i++)
+		kvfree(blk->vqs[i].req);
+	kfree(blk->dev.vqs);
+	kvfree(blk);
+
+	return 0;
+}
+
+static int vhost_blk_set_features(struct vhost_blk *blk, u64 features)
+{
+	struct vhost_virtqueue *vq;
+	int i;
+
+	mutex_lock(&blk->dev.mutex);
+	if ((features & (1 << VHOST_F_LOG_ALL)) &&
+	    !vhost_log_access_ok(&blk->dev)) {
+		mutex_unlock(&blk->dev.mutex);
+		return -EFAULT;
+	}
+
+	for (i = 0; i < VHOST_BLK_VQ_MAX; i++) {
+		vq = &blk->vqs[i].vq;
+		mutex_lock(&vq->mutex);
+		vq->acked_features = features & (VHOST_BLK_FEATURES);
+		mutex_unlock(&vq->mutex);
+	}
+
+	vhost_blk_flush(blk);
+	mutex_unlock(&blk->dev.mutex);
+
+	return 0;
+}
+
+static long vhost_blk_set_backend(struct vhost_blk *blk, int fd)
+{
+	struct vhost_virtqueue *vq;
+	struct file *file;
+	struct inode *inode;
+	int ret, i;
+
+	mutex_lock(&blk->dev.mutex);
+	ret = vhost_dev_check_owner(&blk->dev);
+	if (ret)
+		goto out_dev;
+
+	if (blk->backend) {
+		ret = -EBUSY;
+		goto out_dev;
+	}
+
+	file = fget(fd);
+	if (IS_ERR(file)) {
+		ret = PTR_ERR(file);
+		goto out_dev;
+	}
+
+	inode = file->f_mapping->host;
+	if (!S_ISBLK(inode->i_mode)) {
+		ret = -EFAULT;
+		goto out_file;
+	}
+
+	for (i = 0; i < VHOST_BLK_VQ_MAX; i++) {
+		vq = &blk->vqs[i].vq;
+		if (!vhost_vq_access_ok(vq)) {
+			ret = -EFAULT;
+			goto out_drop;
+		}
+
+		mutex_lock(&vq->mutex);
+		vhost_vq_set_backend(vq, file);
+		ret = vhost_vq_init_access(vq);
+		mutex_unlock(&vq->mutex);
+	}
+
+	blk->backend = file;
+
+	mutex_unlock(&blk->dev.mutex);
+	return 0;
+
+out_drop:
+	vhost_blk_drop_backends(blk);
+out_file:
+	fput(file);
+out_dev:
+	mutex_unlock(&blk->dev.mutex);
+	return ret;
+}
+
+static long vhost_blk_reset_owner(struct vhost_blk *blk)
+{
+	struct vhost_iotlb *umem;
+	int err, i;
+
+	mutex_lock(&blk->dev.mutex);
+	err = vhost_dev_check_owner(&blk->dev);
+	if (err)
+		goto done;
+	umem = vhost_dev_reset_owner_prepare();
+	if (!umem) {
+		err = -ENOMEM;
+		goto done;
+	}
+	vhost_blk_drop_backends(blk);
+	if (blk->backend) {
+		fput(blk->backend);
+		blk->backend = NULL;
+	}
+	vhost_blk_flush(blk);
+	vhost_dev_stop(&blk->dev);
+	vhost_dev_reset_owner(&blk->dev, umem);
+
+	for (i = 0; i < VHOST_BLK_VQ_MAX; i++) {
+		kvfree(blk->vqs[i].req);
+		blk->vqs[i].req = NULL;
+	}
+
+done:
+	mutex_unlock(&blk->dev.mutex);
+	return err;
+}
+
+static int vhost_blk_setup(struct vhost_blk *blk, void __user *argp)
+{
+	struct vhost_vring_state s;
+
+	if (copy_from_user(&s, argp, sizeof(s)))
+		return -EFAULT;
+
+	if (blk->vqs[s.index].req)
+		return 0;
+
+	blk->vqs[s.index].req = kvmalloc(sizeof(struct vhost_blk_req) * s.num, GFP_KERNEL);
+	if (!blk->vqs[s.index].req)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static long vhost_blk_ioctl(struct file *f, unsigned int ioctl,
+			    unsigned long arg)
+{
+	struct vhost_blk *blk = f->private_data;
+	void __user *argp = (void __user *)arg;
+	struct vhost_vring_file backend;
+	u64 __user *featurep = argp;
+	u64 features;
+	int ret;
+
+	switch (ioctl) {
+	case VHOST_BLK_SET_BACKEND:
+		if (copy_from_user(&backend, argp, sizeof(backend)))
+			return -EFAULT;
+		return vhost_blk_set_backend(blk, backend.fd);
+	case VHOST_GET_FEATURES:
+		features = VHOST_BLK_FEATURES;
+		if (copy_to_user(featurep, &features, sizeof(features)))
+			return -EFAULT;
+		return 0;
+	case VHOST_SET_FEATURES:
+		if (copy_from_user(&features, featurep, sizeof(features)))
+			return -EFAULT;
+		if (features & ~VHOST_BLK_FEATURES)
+			return -EOPNOTSUPP;
+		return vhost_blk_set_features(blk, features);
+	case VHOST_RESET_OWNER:
+		return vhost_blk_reset_owner(blk);
+	default:
+		mutex_lock(&blk->dev.mutex);
+		ret = vhost_dev_ioctl(&blk->dev, ioctl, argp);
+		if (ret == -ENOIOCTLCMD)
+			ret = vhost_vring_ioctl(&blk->dev, ioctl, argp);
+		if (!ret && ioctl == VHOST_SET_VRING_NUM)
+			ret = vhost_blk_setup(blk, argp);
+		vhost_blk_flush(blk);
+		mutex_unlock(&blk->dev.mutex);
+		return ret;
+	}
+}
+
+static const struct file_operations vhost_blk_fops = {
+	.owner          = THIS_MODULE,
+	.open           = vhost_blk_open,
+	.release        = vhost_blk_release,
+	.llseek		= noop_llseek,
+	.unlocked_ioctl = vhost_blk_ioctl,
+};
+
+static struct miscdevice vhost_blk_misc = {
+	MISC_DYNAMIC_MINOR,
+	"vhost-blk",
+	&vhost_blk_fops,
+};
+module_misc_device(vhost_blk_misc);
+
+MODULE_VERSION("0.0.1");
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Andrey Zhadchenko");
+MODULE_DESCRIPTION("Host kernel accelerator for virtio_blk");
diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
index f9f115a7c75b..56a709e31709 100644
--- a/include/uapi/linux/vhost.h
+++ b/include/uapi/linux/vhost.h
@@ -180,4 +180,8 @@
  */
 #define VHOST_VDPA_SUSPEND		_IO(VHOST_VIRTIO, 0x7D)
 
+/* VHOST_BLK specific defines */
+#define VHOST_BLK_SET_BACKEND		_IOW(VHOST_VIRTIO, 0xFF, \
+					     struct vhost_vring_file)
+
 #endif
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH v2 02/10] drivers/vhost: use array to store workers
  2022-10-13 15:18 [RFC PATCH v2 00/10] vhost-blk: in-kernel accelerator for virtio-blk guests Andrey Zhadchenko
  2022-10-13 15:18 ` [RFC PATCH v2 01/10] drivers/vhost: vhost-blk " Andrey Zhadchenko
@ 2022-10-13 15:18 ` Andrey Zhadchenko
  2022-10-13 15:18 ` [RFC PATCH v2 03/10] drivers/vhost: adjust vhost to flush all workers Andrey Zhadchenko
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: Andrey Zhadchenko @ 2022-10-13 15:18 UTC (permalink / raw)
  To: virtualization
  Cc: andrey.zhadchenko, mst, jasonwang, kvm, stefanha, sgarzare, den,
	ptikhomirov

We want to support several vhost workers. The first step is to
rework vhost to use array of workers rather than single pointer.
Update creation and cleanup routines.

Signed-off-by: Andrey Zhadchenko <andrey.zhadchenko@virtuozzo.com>
---
 drivers/vhost/vhost.c | 75 +++++++++++++++++++++++++++++++------------
 drivers/vhost/vhost.h | 10 +++++-
 2 files changed, 63 insertions(+), 22 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 40097826cff0..40cc50c33c66 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -231,11 +231,24 @@ void vhost_poll_stop(struct vhost_poll *poll)
 }
 EXPORT_SYMBOL_GPL(vhost_poll_stop);
 
+static void vhost_work_queue_at_worker(struct vhost_worker *w,
+				       struct vhost_work *work)
+{
+	if (!test_and_set_bit(VHOST_WORK_QUEUED, &work->flags)) {
+		/* We can only add the work to the list after we're
+		 * sure it was not in the list.
+		 * test_and_set_bit() implies a memory barrier.
+		 */
+		llist_add(&work->node, &w->work_list);
+		wake_up_process(w->worker);
+	}
+}
+
 void vhost_dev_flush(struct vhost_dev *dev)
 {
 	struct vhost_flush_struct flush;
 
-	if (dev->worker) {
+	if (dev->workers[0].worker) {
 		init_completion(&flush.wait_event);
 		vhost_work_init(&flush.work, vhost_flush_work);
 
@@ -247,17 +260,12 @@ EXPORT_SYMBOL_GPL(vhost_dev_flush);
 
 void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)
 {
-	if (!dev->worker)
+	struct vhost_worker *w = &dev->workers[0];
+
+	if (!w->worker)
 		return;
 
-	if (!test_and_set_bit(VHOST_WORK_QUEUED, &work->flags)) {
-		/* We can only add the work to the list after we're
-		 * sure it was not in the list.
-		 * test_and_set_bit() implies a memory barrier.
-		 */
-		llist_add(&work->node, &dev->work_list);
-		wake_up_process(dev->worker);
-	}
+	vhost_work_queue_at_worker(w, work);
 }
 EXPORT_SYMBOL_GPL(vhost_work_queue);
 
@@ -333,9 +341,29 @@ static void vhost_vq_reset(struct vhost_dev *dev,
 	__vhost_vq_meta_reset(vq);
 }
 
+static void vhost_worker_reset(struct vhost_worker *w)
+{
+	init_llist_head(&w->work_list);
+	w->worker = NULL;
+}
+
+void vhost_cleanup_workers(struct vhost_dev *dev)
+{
+	int i;
+
+	for (i = 0; i < dev->nworkers; ++i) {
+		WARN_ON(!llist_empty(&dev->workers[i].work_list));
+		kthread_stop(dev->workers[i].worker);
+		vhost_worker_reset(&dev->workers[i]);
+	}
+
+	dev->nworkers = 0;
+}
+
 static int vhost_worker(void *data)
 {
-	struct vhost_dev *dev = data;
+	struct vhost_worker *w = data;
+	struct vhost_dev *dev = w->dev;
 	struct vhost_work *work, *work_next;
 	struct llist_node *node;
 
@@ -350,7 +378,7 @@ static int vhost_worker(void *data)
 			break;
 		}
 
-		node = llist_del_all(&dev->work_list);
+		node = llist_del_all(&w->work_list);
 		if (!node)
 			schedule();
 
@@ -473,7 +501,6 @@ void vhost_dev_init(struct vhost_dev *dev,
 	dev->umem = NULL;
 	dev->iotlb = NULL;
 	dev->mm = NULL;
-	dev->worker = NULL;
 	dev->iov_limit = iov_limit;
 	dev->weight = weight;
 	dev->byte_weight = byte_weight;
@@ -485,6 +512,11 @@ void vhost_dev_init(struct vhost_dev *dev,
 	INIT_LIST_HEAD(&dev->pending_list);
 	spin_lock_init(&dev->iotlb_lock);
 
+	dev->nworkers = 0;
+	for (i = 0; i < VHOST_MAX_WORKERS; ++i) {
+		dev->workers[i].dev = dev;
+		vhost_worker_reset(&dev->workers[i]);
+	}
 
 	for (i = 0; i < dev->nvqs; ++i) {
 		vq = dev->vqs[i];
@@ -594,7 +626,8 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
 			goto err_worker;
 		}
 
-		dev->worker = worker;
+		dev->workers[0].worker = worker;
+		dev->nworkers = 1;
 		wake_up_process(worker); /* avoid contributing to loadavg */
 
 		err = vhost_attach_cgroups(dev);
@@ -608,9 +641,10 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
 
 	return 0;
 err_cgroup:
-	if (dev->worker) {
-		kthread_stop(dev->worker);
-		dev->worker = NULL;
+	dev->nworkers = 0;
+	if (dev->workers[0].worker) {
+		kthread_stop(dev->workers[0].worker);
+		dev->workers[0].worker = NULL;
 	}
 err_worker:
 	vhost_detach_mm(dev);
@@ -693,6 +727,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
 			eventfd_ctx_put(dev->vqs[i]->call_ctx.ctx);
 		vhost_vq_reset(dev, dev->vqs[i]);
 	}
+
 	vhost_dev_free_iovecs(dev);
 	if (dev->log_ctx)
 		eventfd_ctx_put(dev->log_ctx);
@@ -704,10 +739,8 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
 	dev->iotlb = NULL;
 	vhost_clear_msg(dev);
 	wake_up_interruptible_poll(&dev->wait, EPOLLIN | EPOLLRDNORM);
-	WARN_ON(!llist_empty(&dev->work_list));
-	if (dev->worker) {
-		kthread_stop(dev->worker);
-		dev->worker = NULL;
+	if (dev->use_worker) {
+		vhost_cleanup_workers(dev);
 		dev->kcov_handle = 0;
 	}
 	vhost_detach_mm(dev);
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index d9109107af08..fc365b08fc38 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -25,6 +25,13 @@ struct vhost_work {
 	unsigned long		flags;
 };
 
+#define VHOST_MAX_WORKERS 4
+struct vhost_worker {
+	struct task_struct *worker;
+	struct llist_head work_list;
+	struct vhost_dev *dev;
+};
+
 /* Poll a file (eventfd or socket) */
 /* Note: there's nothing vhost specific about this structure. */
 struct vhost_poll {
@@ -148,7 +155,8 @@ struct vhost_dev {
 	int nvqs;
 	struct eventfd_ctx *log_ctx;
 	struct llist_head work_list;
-	struct task_struct *worker;
+	struct vhost_worker workers[VHOST_MAX_WORKERS];
+	int nworkers;
 	struct vhost_iotlb *umem;
 	struct vhost_iotlb *iotlb;
 	spinlock_t iotlb_lock;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH v2 03/10] drivers/vhost: adjust vhost to flush all workers
  2022-10-13 15:18 [RFC PATCH v2 00/10] vhost-blk: in-kernel accelerator for virtio-blk guests Andrey Zhadchenko
  2022-10-13 15:18 ` [RFC PATCH v2 01/10] drivers/vhost: vhost-blk " Andrey Zhadchenko
  2022-10-13 15:18 ` [RFC PATCH v2 02/10] drivers/vhost: use array to store workers Andrey Zhadchenko
@ 2022-10-13 15:18 ` Andrey Zhadchenko
  2022-10-13 15:18 ` [RFC PATCH v2 04/10] drivers/vhost: rework cgroups attachment to be worker aware Andrey Zhadchenko
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: Andrey Zhadchenko @ 2022-10-13 15:18 UTC (permalink / raw)
  To: virtualization
  Cc: andrey.zhadchenko, mst, jasonwang, kvm, stefanha, sgarzare, den,
	ptikhomirov

Make vhost_dev_flush support several workers and flush
them simultaneously

Signed-off-by: Andrey Zhadchenko <andrey.zhadchenko@virtuozzo.com>
---
 drivers/vhost/vhost.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 40cc50c33c66..5dae8e8f8163 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -246,15 +246,19 @@ static void vhost_work_queue_at_worker(struct vhost_worker *w,
 
 void vhost_dev_flush(struct vhost_dev *dev)
 {
-	struct vhost_flush_struct flush;
+	struct vhost_flush_struct flush[VHOST_MAX_WORKERS];
+	int i, nworkers;
 
-	if (dev->workers[0].worker) {
-		init_completion(&flush.wait_event);
-		vhost_work_init(&flush.work, vhost_flush_work);
+	nworkers = READ_ONCE(dev->nworkers);
 
-		vhost_work_queue(dev, &flush.work);
-		wait_for_completion(&flush.wait_event);
+	for (i = 0; i < nworkers; i++) {
+		init_completion(&flush[i].wait_event);
+		vhost_work_init(&flush[i].work, vhost_flush_work);
+		vhost_work_queue_at_worker(&dev->workers[i], &flush[i].work);
 	}
+
+	for (i = 0; i < nworkers; i++)
+		wait_for_completion(&flush[i].wait_event);
 }
 EXPORT_SYMBOL_GPL(vhost_dev_flush);
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH v2 04/10] drivers/vhost: rework cgroups attachment to be worker aware
  2022-10-13 15:18 [RFC PATCH v2 00/10] vhost-blk: in-kernel accelerator for virtio-blk guests Andrey Zhadchenko
                   ` (2 preceding siblings ...)
  2022-10-13 15:18 ` [RFC PATCH v2 03/10] drivers/vhost: adjust vhost to flush all workers Andrey Zhadchenko
@ 2022-10-13 15:18 ` Andrey Zhadchenko
  2022-10-13 15:18 ` [RFC PATCH v2 05/10] drivers/vhost: rework worker creation Andrey Zhadchenko
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: Andrey Zhadchenko @ 2022-10-13 15:18 UTC (permalink / raw)
  To: virtualization
  Cc: andrey.zhadchenko, mst, jasonwang, kvm, stefanha, sgarzare, den,
	ptikhomirov

Rework vhost_attach_cgroups to manipulate specified worker.
Implement vhost_worker_flush as we need to flush specific worker.

Signed-off-by: Andrey Zhadchenko <andrey.zhadchenko@virtuozzo.com>
---
 drivers/vhost/vhost.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 5dae8e8f8163..e6bcb2605ee0 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -262,6 +262,16 @@ void vhost_dev_flush(struct vhost_dev *dev)
 }
 EXPORT_SYMBOL_GPL(vhost_dev_flush);
 
+static void vhost_worker_flush(struct vhost_worker *w)
+{
+       struct vhost_flush_struct flush;
+
+       init_completion(&flush.wait_event);
+       vhost_work_init(&flush.work, vhost_flush_work);
+       vhost_work_queue_at_worker(w, &flush.work);
+       wait_for_completion(&flush.wait_event);
+}
+
 void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)
 {
 	struct vhost_worker *w = &dev->workers[0];
@@ -559,14 +569,14 @@ static void vhost_attach_cgroups_work(struct vhost_work *work)
 	s->ret = cgroup_attach_task_all(s->owner, current);
 }
 
-static int vhost_attach_cgroups(struct vhost_dev *dev)
+static int vhost_worker_attach_cgroups(struct vhost_worker *w)
 {
 	struct vhost_attach_cgroups_struct attach;
 
 	attach.owner = current;
 	vhost_work_init(&attach.work, vhost_attach_cgroups_work);
-	vhost_work_queue(dev, &attach.work);
-	vhost_dev_flush(dev);
+	vhost_work_queue_at_worker(w, &attach.work);
+	vhost_worker_flush(w);
 	return attach.ret;
 }
 
@@ -634,7 +644,7 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
 		dev->nworkers = 1;
 		wake_up_process(worker); /* avoid contributing to loadavg */
 
-		err = vhost_attach_cgroups(dev);
+		err = vhost_worker_attach_cgroups(&dev->workers[0]);
 		if (err)
 			goto err_cgroup;
 	}
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH v2 05/10] drivers/vhost: rework worker creation
  2022-10-13 15:18 [RFC PATCH v2 00/10] vhost-blk: in-kernel accelerator for virtio-blk guests Andrey Zhadchenko
                   ` (3 preceding siblings ...)
  2022-10-13 15:18 ` [RFC PATCH v2 04/10] drivers/vhost: rework cgroups attachment to be worker aware Andrey Zhadchenko
@ 2022-10-13 15:18 ` Andrey Zhadchenko
  2022-10-13 15:18 ` [RFC PATCH v2 06/10] drivers/vhost: add ioctl to increase the number of workers Andrey Zhadchenko
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: Andrey Zhadchenko @ 2022-10-13 15:18 UTC (permalink / raw)
  To: virtualization
  Cc: andrey.zhadchenko, mst, jasonwang, kvm, stefanha, sgarzare, den,
	ptikhomirov

Add function to create a vhost worker and add it into the device.
Rework vhost_dev_set_owner

Signed-off-by: Andrey Zhadchenko <andrey.zhadchenko@virtuozzo.com>
---
 drivers/vhost/vhost.c | 68 +++++++++++++++++++++++++------------------
 1 file changed, 40 insertions(+), 28 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index e6bcb2605ee0..219ad453ca27 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -617,53 +617,65 @@ static void vhost_detach_mm(struct vhost_dev *dev)
 	dev->mm = NULL;
 }
 
-/* Caller should have device mutex */
-long vhost_dev_set_owner(struct vhost_dev *dev)
+static int vhost_add_worker(struct vhost_dev *dev)
 {
+	struct vhost_worker *w = &dev->workers[dev->nworkers];
 	struct task_struct *worker;
 	int err;
 
+	if (dev->nworkers == VHOST_MAX_WORKERS)
+		return -E2BIG;
+
+	worker = kthread_create(vhost_worker, w,
+				"vhost-%d-%d", current->pid, dev->nworkers);
+	if (IS_ERR(worker))
+		return PTR_ERR(worker);
+
+	w->worker = worker;
+	wake_up_process(worker); /* avoid contributing to loadavg */
+
+	err = vhost_worker_attach_cgroups(w);
+	if (err)
+		goto cleanup;
+
+	dev->nworkers++;
+	return 0;
+
+cleanup:
+	kthread_stop(worker);
+	w->worker = NULL;
+
+	return err;
+}
+
+/* Caller should have device mutex */
+long vhost_dev_set_owner(struct vhost_dev *dev)
+{
+	int err;
+
 	/* Is there an owner already? */
-	if (vhost_dev_has_owner(dev)) {
-		err = -EBUSY;
-		goto err_mm;
-	}
+	if (vhost_dev_has_owner(dev))
+		return -EBUSY;
 
 	vhost_attach_mm(dev);
 
 	dev->kcov_handle = kcov_common_handle();
 	if (dev->use_worker) {
-		worker = kthread_create(vhost_worker, dev,
-					"vhost-%d", current->pid);
-		if (IS_ERR(worker)) {
-			err = PTR_ERR(worker);
-			goto err_worker;
-		}
-
-		dev->workers[0].worker = worker;
-		dev->nworkers = 1;
-		wake_up_process(worker); /* avoid contributing to loadavg */
-
-		err = vhost_worker_attach_cgroups(&dev->workers[0]);
+		err = vhost_add_worker(dev);
 		if (err)
-			goto err_cgroup;
+			goto err_mm;
 	}
 
 	err = vhost_dev_alloc_iovecs(dev);
 	if (err)
-		goto err_cgroup;
+		goto err_worker;
 
 	return 0;
-err_cgroup:
-	dev->nworkers = 0;
-	if (dev->workers[0].worker) {
-		kthread_stop(dev->workers[0].worker);
-		dev->workers[0].worker = NULL;
-	}
 err_worker:
-	vhost_detach_mm(dev);
-	dev->kcov_handle = 0;
+	vhost_cleanup_workers(dev);
 err_mm:
+	vhost_detach_mm(dev);
+	dev->kcov_handle = 0;
 	return err;
 }
 EXPORT_SYMBOL_GPL(vhost_dev_set_owner);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH v2 06/10] drivers/vhost: add ioctl to increase the number of workers
  2022-10-13 15:18 [RFC PATCH v2 00/10] vhost-blk: in-kernel accelerator for virtio-blk guests Andrey Zhadchenko
                   ` (4 preceding siblings ...)
  2022-10-13 15:18 ` [RFC PATCH v2 05/10] drivers/vhost: rework worker creation Andrey Zhadchenko
@ 2022-10-13 15:18 ` Andrey Zhadchenko
  2022-10-13 15:18 ` [RFC PATCH v2 07/10] drivers/vhost: assign workers to virtqueues Andrey Zhadchenko
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: Andrey Zhadchenko @ 2022-10-13 15:18 UTC (permalink / raw)
  To: virtualization
  Cc: andrey.zhadchenko, mst, jasonwang, kvm, stefanha, sgarzare, den,
	ptikhomirov

Finally add ioctl to allow userspace to create additional workers
For now only allow to increase the number of workers

Signed-off-by: Andrey Zhadchenko <andrey.zhadchenko@virtuozzo.com>
---
 drivers/vhost/vhost.c      | 32 +++++++++++++++++++++++++++++++-
 include/uapi/linux/vhost.h |  9 +++++++++
 2 files changed, 40 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 219ad453ca27..09e3c4b1bd05 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -648,6 +648,25 @@ static int vhost_add_worker(struct vhost_dev *dev)
 	return err;
 }
 
+static int vhost_set_workers(struct vhost_dev *dev, int n)
+{
+	int i, ret;
+
+	if (n > dev->nvqs)
+		n = dev->nvqs;
+
+	if (n > VHOST_MAX_WORKERS)
+		n = VHOST_MAX_WORKERS;
+
+	for (i = 0; i < n - dev->nworkers ; i++) {
+		ret = vhost_add_worker(dev);
+		if (ret)
+			break;
+	}
+
+	return ret;
+}
+
 /* Caller should have device mutex */
 long vhost_dev_set_owner(struct vhost_dev *dev)
 {
@@ -1821,7 +1840,7 @@ long vhost_dev_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *argp)
 	struct eventfd_ctx *ctx;
 	u64 p;
 	long r;
-	int i, fd;
+	int i, fd, n;
 
 	/* If you are not the owner, you can become one */
 	if (ioctl == VHOST_SET_OWNER) {
@@ -1878,6 +1897,17 @@ long vhost_dev_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *argp)
 		if (ctx)
 			eventfd_ctx_put(ctx);
 		break;
+	case VHOST_SET_NWORKERS:
+		r = get_user(n, (int __user *)argp);
+		if (r < 0)
+			break;
+		if (n < d->nworkers) {
+			r = -EINVAL;
+			break;
+		}
+
+		r = vhost_set_workers(d, n);
+		break;
 	default:
 		r = -ENOIOCTLCMD;
 		break;
diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
index 56a709e31709..f78cec410460 100644
--- a/include/uapi/linux/vhost.h
+++ b/include/uapi/linux/vhost.h
@@ -71,6 +71,15 @@
 #define VHOST_SET_VRING_ENDIAN _IOW(VHOST_VIRTIO, 0x13, struct vhost_vring_state)
 #define VHOST_GET_VRING_ENDIAN _IOW(VHOST_VIRTIO, 0x14, struct vhost_vring_state)
 
+/* Set number of vhost workers
+ * Currently nuber of vhost workers can only be increased.
+ * All workers are freed upon reset.
+ * If the value is too big it is silently truncated to the maximum number of
+ * supported vhost workers
+ * Even if the error is returned it is possible that some workers were created
+ */
+#define VHOST_SET_NWORKERS _IOW(VHOST_VIRTIO, 0x1F, int)
+
 /* The following ioctls use eventfd file descriptors to signal and poll
  * for events. */
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH v2 07/10] drivers/vhost: assign workers to virtqueues
  2022-10-13 15:18 [RFC PATCH v2 00/10] vhost-blk: in-kernel accelerator for virtio-blk guests Andrey Zhadchenko
                   ` (5 preceding siblings ...)
  2022-10-13 15:18 ` [RFC PATCH v2 06/10] drivers/vhost: add ioctl to increase the number of workers Andrey Zhadchenko
@ 2022-10-13 15:18 ` Andrey Zhadchenko
  2022-10-13 15:18 ` [RFC PATCH v2 08/10] drivers/vhost: add API to queue work at virtqueue's worker Andrey Zhadchenko
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: Andrey Zhadchenko @ 2022-10-13 15:18 UTC (permalink / raw)
  To: virtualization
  Cc: andrey.zhadchenko, mst, jasonwang, kvm, stefanha, sgarzare, den,
	ptikhomirov

Add worker pointer to every virtqueue. Add routine to assing
workers to virtqueues and call it after any worker creation

Signed-off-by: Andrey Zhadchenko <andrey.zhadchenko@virtuozzo.com>
---
 drivers/vhost/vhost.c | 14 ++++++++++++++
 drivers/vhost/vhost.h |  2 ++
 2 files changed, 16 insertions(+)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 09e3c4b1bd05..d80831d21fba 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -353,6 +353,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
 	vq->iotlb = NULL;
 	vhost_vring_call_reset(&vq->call_ctx);
 	__vhost_vq_meta_reset(vq);
+	vq->worker = NULL;
 }
 
 static void vhost_worker_reset(struct vhost_worker *w)
@@ -667,6 +668,17 @@ static int vhost_set_workers(struct vhost_dev *dev, int n)
 	return ret;
 }
 
+static void vhost_assign_workers(struct vhost_dev *dev)
+{
+	int i, j = 0;
+
+	for (i = 0; i < dev->nvqs; i++) {
+		dev->vqs[i]->worker = &dev->workers[j];
+		if (++j == dev->nworkers)
+			j = 0;
+	}
+}
+
 /* Caller should have device mutex */
 long vhost_dev_set_owner(struct vhost_dev *dev)
 {
@@ -689,6 +701,7 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
 	if (err)
 		goto err_worker;
 
+	vhost_assign_workers(dev);
 	return 0;
 err_worker:
 	vhost_cleanup_workers(dev);
@@ -1907,6 +1920,7 @@ long vhost_dev_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *argp)
 		}
 
 		r = vhost_set_workers(d, n);
+		vhost_assign_workers(d);
 		break;
 	default:
 		r = -ENOIOCTLCMD;
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index fc365b08fc38..b9a10c85b298 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -137,6 +137,8 @@ struct vhost_virtqueue {
 	bool user_be;
 #endif
 	u32 busyloop_timeout;
+
+	struct vhost_worker *worker;
 };
 
 struct vhost_msg_node {
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH v2 08/10] drivers/vhost: add API to queue work at virtqueue's worker
  2022-10-13 15:18 [RFC PATCH v2 00/10] vhost-blk: in-kernel accelerator for virtio-blk guests Andrey Zhadchenko
                   ` (6 preceding siblings ...)
  2022-10-13 15:18 ` [RFC PATCH v2 07/10] drivers/vhost: assign workers to virtqueues Andrey Zhadchenko
@ 2022-10-13 15:18 ` Andrey Zhadchenko
  2022-10-13 15:18 ` [RFC PATCH v2 09/10] drivers/vhost: allow polls to be bound to workers via vqs Andrey Zhadchenko
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: Andrey Zhadchenko @ 2022-10-13 15:18 UTC (permalink / raw)
  To: virtualization
  Cc: andrey.zhadchenko, mst, jasonwang, kvm, stefanha, sgarzare, den,
	ptikhomirov

Add routines to queue works on virtqueue assigned workers

Signed-off-by: Andrey Zhadchenko <andrey.zhadchenko@virtuozzo.com>
---
 drivers/vhost/vhost.c | 22 ++++++++++++++++++++++
 drivers/vhost/vhost.h |  5 +++++
 2 files changed, 27 insertions(+)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index d80831d21fba..fbf5fae1a6bf 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -272,6 +272,17 @@ static void vhost_worker_flush(struct vhost_worker *w)
        wait_for_completion(&flush.wait_event);
 }
 
+void vhost_work_flush_vq(struct vhost_virtqueue *vq)
+{
+       struct vhost_worker *w = READ_ONCE(vq->worker);
+
+       if (!w)
+               return;
+
+       vhost_worker_flush(w);
+}
+EXPORT_SYMBOL_GPL(vhost_work_flush_vq);
+
 void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)
 {
 	struct vhost_worker *w = &dev->workers[0];
@@ -283,6 +294,17 @@ void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)
 }
 EXPORT_SYMBOL_GPL(vhost_work_queue);
 
+void vhost_work_vqueue(struct vhost_virtqueue *vq, struct vhost_work *work)
+{
+       struct vhost_worker *w = READ_ONCE(vq->worker);
+
+       if (!w)
+               return;
+
+       vhost_work_queue_at_worker(w, work);
+}
+EXPORT_SYMBOL_GPL(vhost_work_vqueue);
+
 /* A lockless hint for busy polling code to exit the loop */
 bool vhost_has_work(struct vhost_dev *dev)
 {
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index b9a10c85b298..bedf8b9d99de 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -141,6 +141,11 @@ struct vhost_virtqueue {
 	struct vhost_worker *worker;
 };
 
+/* Queue the work on virtqueue assigned worker */
+void vhost_work_vqueue(struct vhost_virtqueue *vq, struct vhost_work *work);
+/* Flush virtqueue assigned worker */
+void vhost_work_flush_vq(struct vhost_virtqueue *vq);
+
 struct vhost_msg_node {
   union {
 	  struct vhost_msg msg;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH v2 09/10] drivers/vhost: allow polls to be bound to workers via vqs
  2022-10-13 15:18 [RFC PATCH v2 00/10] vhost-blk: in-kernel accelerator for virtio-blk guests Andrey Zhadchenko
                   ` (7 preceding siblings ...)
  2022-10-13 15:18 ` [RFC PATCH v2 08/10] drivers/vhost: add API to queue work at virtqueue's worker Andrey Zhadchenko
@ 2022-10-13 15:18 ` Andrey Zhadchenko
  2022-10-13 15:18 ` [RFC PATCH v2 10/10] drivers/vhost: queue vhost_blk works at vq workers Andrey Zhadchenko
  2022-10-13 17:33 ` [RFC PATCH v2 00/10] vhost-blk: in-kernel accelerator for virtio-blk guests Andrey Zhadchenko
  10 siblings, 0 replies; 14+ messages in thread
From: Andrey Zhadchenko @ 2022-10-13 15:18 UTC (permalink / raw)
  To: virtualization
  Cc: andrey.zhadchenko, mst, jasonwang, kvm, stefanha, sgarzare, den,
	ptikhomirov

Allow vhost polls to be associated with vqs so we can queue them
on assigned workers.
If polls are not associated with specific vqs queue them on the first
virtqueue.

Signed-off-by: Andrey Zhadchenko <andrey.zhadchenko@virtuozzo.com>
---
 drivers/vhost/vhost.c | 24 ++++++++++++++++--------
 drivers/vhost/vhost.h |  4 +++-
 2 files changed, 19 insertions(+), 9 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index fbf5fae1a6bf..9c35908ebc4f 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -170,7 +170,7 @@ static int vhost_poll_wakeup(wait_queue_entry_t *wait, unsigned mode, int sync,
 	if (!(key_to_poll(key) & poll->mask))
 		return 0;
 
-	if (!poll->dev->use_worker)
+	if (!poll->vq->dev->use_worker)
 		work->fn(work);
 	else
 		vhost_poll_queue(poll);
@@ -185,19 +185,27 @@ void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn)
 }
 EXPORT_SYMBOL_GPL(vhost_work_init);
 
-/* Init poll structure */
 void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
 		     __poll_t mask, struct vhost_dev *dev)
+{
+       vhost_poll_init_vq(poll, fn, mask, dev->vqs[0]);
+}
+EXPORT_SYMBOL_GPL(vhost_poll_init);
+
+
+/* Init poll structure */
+void vhost_poll_init_vq(struct vhost_poll *poll, vhost_work_fn_t fn,
+                    __poll_t mask, struct vhost_virtqueue *vq)
 {
 	init_waitqueue_func_entry(&poll->wait, vhost_poll_wakeup);
 	init_poll_funcptr(&poll->table, vhost_poll_func);
 	poll->mask = mask;
-	poll->dev = dev;
+	poll->vq = vq;
 	poll->wqh = NULL;
 
 	vhost_work_init(&poll->work, fn);
 }
-EXPORT_SYMBOL_GPL(vhost_poll_init);
+EXPORT_SYMBOL_GPL(vhost_poll_init_vq);
 
 /* Start polling a file. We add ourselves to file's wait queue. The caller must
  * keep a reference to a file until after vhost_poll_stop is called. */
@@ -314,7 +322,7 @@ EXPORT_SYMBOL_GPL(vhost_has_work);
 
 void vhost_poll_queue(struct vhost_poll *poll)
 {
-	vhost_work_queue(poll->dev, &poll->work);
+	vhost_work_vqueue(poll->vq, &poll->work);
 }
 EXPORT_SYMBOL_GPL(vhost_poll_queue);
 
@@ -564,8 +572,8 @@ void vhost_dev_init(struct vhost_dev *dev,
 		mutex_init(&vq->mutex);
 		vhost_vq_reset(dev, vq);
 		if (vq->handle_kick)
-			vhost_poll_init(&vq->poll, vq->handle_kick,
-					EPOLLIN, dev);
+			vhost_poll_init_vq(&vq->poll, vq->handle_kick,
+					EPOLLIN, vq);
 	}
 }
 EXPORT_SYMBOL_GPL(vhost_dev_init);
@@ -1837,7 +1845,7 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
 	mutex_unlock(&vq->mutex);
 
 	if (pollstop && vq->handle_kick)
-		vhost_dev_flush(vq->poll.dev);
+		vhost_dev_flush(d);
 	return r;
 }
 EXPORT_SYMBOL_GPL(vhost_vring_ioctl);
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index bedf8b9d99de..fd0943de4d4f 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -40,7 +40,7 @@ struct vhost_poll {
 	wait_queue_entry_t	wait;
 	struct vhost_work	work;
 	__poll_t		mask;
-	struct vhost_dev	*dev;
+	struct vhost_virtqueue	*vq;
 };
 
 void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn);
@@ -49,6 +49,8 @@ bool vhost_has_work(struct vhost_dev *dev);
 
 void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
 		     __poll_t mask, struct vhost_dev *dev);
+void vhost_poll_init_vq(struct vhost_poll *poll, vhost_work_fn_t fn,
+		     __poll_t mask, struct vhost_virtqueue *vq);
 int vhost_poll_start(struct vhost_poll *poll, struct file *file);
 void vhost_poll_stop(struct vhost_poll *poll);
 void vhost_poll_queue(struct vhost_poll *poll);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH v2 10/10] drivers/vhost: queue vhost_blk works at vq workers
  2022-10-13 15:18 [RFC PATCH v2 00/10] vhost-blk: in-kernel accelerator for virtio-blk guests Andrey Zhadchenko
                   ` (8 preceding siblings ...)
  2022-10-13 15:18 ` [RFC PATCH v2 09/10] drivers/vhost: allow polls to be bound to workers via vqs Andrey Zhadchenko
@ 2022-10-13 15:18 ` Andrey Zhadchenko
  2022-10-13 17:33 ` [RFC PATCH v2 00/10] vhost-blk: in-kernel accelerator for virtio-blk guests Andrey Zhadchenko
  10 siblings, 0 replies; 14+ messages in thread
From: Andrey Zhadchenko @ 2022-10-13 15:18 UTC (permalink / raw)
  To: virtualization
  Cc: andrey.zhadchenko, mst, jasonwang, kvm, stefanha, sgarzare, den,
	ptikhomirov

Update vhost_blk to queue works on virtqueue workers. Together
with previous changes this allows us to split virtio blk requests
across several threads.

                        | randread, IOPS | randwrite, IOPS |
8vcpu, 1 kernel worker  |      576k      |      575k       |
8vcpu, 2 kernel workers |      803k      |      779k       |

Signed-off-by: Andrey Zhadchenko <andrey.zhadchenko@virtuozzo.com>
---
 drivers/vhost/blk.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/vhost/blk.c b/drivers/vhost/blk.c
index 42acdef452b3..f7db891ee985 100644
--- a/drivers/vhost/blk.c
+++ b/drivers/vhost/blk.c
@@ -161,13 +161,12 @@ static inline int vhost_blk_set_status(struct vhost_blk_req *req, u8 status)
 static void vhost_blk_req_done(struct bio *bio)
 {
 	struct vhost_blk_req *req = bio->bi_private;
-	struct vhost_blk *blk = req->blk;
 
 	req->bio_err = blk_status_to_errno(bio->bi_status);
 
 	if (atomic_dec_and_test(&req->bio_nr)) {
 		llist_add(&req->llnode, &req->blk_vq->llhead);
-		vhost_work_queue(&blk->dev, &req->blk_vq->work);
+		vhost_work_vqueue(&req->blk_vq->vq, &req->blk_vq->work);
 	}
 
 	bio_put(bio);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH v2 00/10] vhost-blk: in-kernel accelerator for virtio-blk guests
  2022-10-13 15:18 [RFC PATCH v2 00/10] vhost-blk: in-kernel accelerator for virtio-blk guests Andrey Zhadchenko
                   ` (9 preceding siblings ...)
  2022-10-13 15:18 ` [RFC PATCH v2 10/10] drivers/vhost: queue vhost_blk works at vq workers Andrey Zhadchenko
@ 2022-10-13 17:33 ` Andrey Zhadchenko
  2024-05-17  5:34   ` [PATCH] Fix Issue: When VirtIO Backend providing VIRTIO_BLK_F_MQ feature, The file system of the front-end OS fails to be mounted Lynch
  10 siblings, 1 reply; 14+ messages in thread
From: Andrey Zhadchenko @ 2022-10-13 17:33 UTC (permalink / raw)
  To: Jason Wang; +Cc: kvm, virtualization

Sorry, I misspelled your email while sending the patchset.
Linking you back as was originally intended

Kind regards, Andrey

On 10/13/22 18:18, Andrey Zhadchenko wrote:
> As there is some interest from QEMU userspace, I am sending second version
> of this patchet.
> 
> Main addition is a few patches about vhost multithreading so vhost-blk can
> be scaled. Generally the idea is not a new one - somehow attach workers to
> the virtqueues and do the work on them.
> 
> I have seen several previous attemps like cgroup-aware worker pools or the
> userspace threads, but they seem very complicated and involve a lot of
> subsystems. Probably just spawning a few more vhost threads can do a good
> job.
> 
> As this is RFC, I did not convert any vhost users except vhost_blk. If
> anyone is interested in this regarding other modules, please tell me.
> I can test it to see if it is beneficial and maybe send multithreading
> separately.
> Also multithreading part may eventually be of help with vdpa-blk.
> 
> ---
> 
> Although QEMU virtio-blk is quite fast, there is still some room for
> improvements. Disk latency can be reduced if we handle virito-blk requests
> in host kernel so we avoid a lot of syscalls and context switches.
> The idea is quite simple - QEMU gives us block device and we translate
> any incoming virtio requests into bio and push them into bdev.
> The biggest disadvantage of this vhost-blk flavor is raw format.
> Luckily Kirill Thai proposed device mapper driver for QCOW2 format to attach
> files as block devices: https://www.spinics.net/lists/kernel/msg4292965.html
> 
> Also by using kernel modules we can bypass iothread limitation and finaly scale
> block requests with cpus for high-performance devices.
> 
> 
> There have already been several attempts to write vhost-blk:
> 
> Asias' version: https://lkml.org/lkml/2012/12/1/174
> Badari's version: https://lwn.net/Articles/379864/
> Vitaly's https://lwn.net/Articles/770965/
> 
> The main difference between them is API to access backend file. The fastest
> one is Asias's version with bio flavor. It is also the most reviewed and
> have the most features. So vhost_blk module is partially based on it. Multiple
> virtqueue support was addded, some places reworked. Added support for several
> vhost workers.
> 
> test setup and results:
> fio --direct=1 --rw=randread  --bs=4k  --ioengine=libaio --iodepth=128
> QEMU drive options: cache=none
> filesystem: xfs
> 
> SSD:
>                 | randread, IOPS  | randwrite, IOPS |
> Host           |      95.8k	 |	85.3k	   |
> QEMU virtio    |      61.5k	 |	79.9k	   |
> QEMU vhost-blk |      95.6k	 |	84.3k	   |
> 
> RAMDISK (vq == vcpu == numjobs):
>                   | randread, IOPS | randwrite, IOPS |
> virtio, 1vcpu    |	133k	  |	 133k       |
> virtio, 2vcpu    |	305k      |	 306k       |
> virtio, 4vcpu    |	310k	  |	 298k       |
> virtio, 8vcpu    |	271k      |	 252k       |
> vhost-blk, 1vcpu |	110k	  |	 113k       |
> vhost-blk, 2vcpu |	247k	  |	 252k       |
> vhost-blk, 4vcpu |	558k	  |	 556k       |
> vhost-blk, 8vcpu |	576k	  |	 575k       | *single kernel thread
> vhost-blk, 8vcpu |      803k      |      779k       | *two kernel threads
> 
> v2:
> Re-measured virtio performance with aio=threads and iothread on latest QEMU
> 
> vhost-blk changes:
>   - removed unused VHOST_BLK_VQ
>   - reworked bio handling a bit: now add all pages from signle iov into
> bio until it is full istead of allocating one bio per page
>   - changed how to calculate sector incrementation
>   - check move_iovec() in vhost_blk_req_handle()
>   - remove snprintf check and better check ret from copy_to_iter for
> VIRTIO_BLK_ID_BYTES requests
>   - discard vq request if vhost_blk_req_handle() returned negative code
>   - forbid to change nonzero backend in vhost_blk_set_backend(). First of
> all, QEMU sets backend only once. Also if we want to change backend when
> we already running requests we need to be much more careful in
> vhost_blk_handle_guest_kick() as it is not taking any references. If
> userspace want to change backend that bad it can always reset device.
>   - removed EXPERIMENTAL from Kconfig
> 
> Andrey Zhadchenko (10):
>    drivers/vhost: vhost-blk accelerator for virtio-blk guests
>    drivers/vhost: use array to store workers
>    drivers/vhost: adjust vhost to flush all workers
>    drivers/vhost: rework cgroups attachment to be worker aware
>    drivers/vhost: rework worker creation
>    drivers/vhost: add ioctl to increase the number of workers
>    drivers/vhost: assign workers to virtqueues
>    drivers/vhost: add API to queue work at virtqueue's worker
>    drivers/vhost: allow polls to be bound to workers via vqs
>    drivers/vhost: queue vhost_blk works at vq workers
> 
>   drivers/vhost/Kconfig      |  12 +
>   drivers/vhost/Makefile     |   3 +
>   drivers/vhost/blk.c        | 819 +++++++++++++++++++++++++++++++++++++
>   drivers/vhost/vhost.c      | 263 +++++++++---
>   drivers/vhost/vhost.h      |  21 +-
>   include/uapi/linux/vhost.h |  13 +
>   6 files changed, 1064 insertions(+), 67 deletions(-)
>   create mode 100644 drivers/vhost/blk.c
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH] Fix Issue: When VirtIO Backend providing VIRTIO_BLK_F_MQ feature, The file system of the front-end OS fails to be mounted.
  2022-10-13 17:33 ` [RFC PATCH v2 00/10] vhost-blk: in-kernel accelerator for virtio-blk guests Andrey Zhadchenko
@ 2024-05-17  5:34   ` Lynch
  2024-05-17  9:09     ` Andrey Zhadchenko
  0 siblings, 1 reply; 14+ messages in thread
From: Lynch @ 2024-05-17  5:34 UTC (permalink / raw)
  To: andrey.zhadchenko; +Cc: jasowang, kvm, virtualization, Lynch

---
 drivers/vhost/blk.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/blk.c b/drivers/vhost/blk.c
index 44fbf253e773..0e946d9dfc33 100644
--- a/drivers/vhost/blk.c
+++ b/drivers/vhost/blk.c
@@ -251,6 +251,7 @@ static int vhost_blk_bio_make(struct vhost_blk_req *req,
 	struct page **pages, *page;
 	struct bio *bio = NULL;
 	int bio_nr = 0;
+	sector_t sector_tmp;
 
 	if (unlikely(req->bi_opf == REQ_OP_FLUSH))
 		return vhost_blk_bio_make_simple(req, bdev);
@@ -270,6 +271,7 @@ static int vhost_blk_bio_make(struct vhost_blk_req *req,
 		req->bio = req->inline_bio;
 	}
 
+	sector_tmp = req->sector;
 	req->iov_nr = 0;
 	for (i = 0; i < iov_nr; i++) {
 		int pages_nr = iov_num_pages(&iov[i]);
@@ -302,7 +304,7 @@ static int vhost_blk_bio_make(struct vhost_blk_req *req,
 				bio = bio_alloc(GFP_KERNEL, pages_nr_total);
 				if (!bio)
 					goto fail;
-				bio->bi_iter.bi_sector  = req->sector;
+				bio->bi_iter.bi_sector  = sector_tmp;
 				bio_set_dev(bio, bdev);
 				bio->bi_private = req;
 				bio->bi_end_io  = vhost_blk_req_done;
@@ -314,7 +316,7 @@ static int vhost_blk_bio_make(struct vhost_blk_req *req,
 			iov_len		-= len;
 
 			pos = (iov_base & VHOST_BLK_SECTOR_MASK) + iov_len;
-			req->sector += pos >> VHOST_BLK_SECTOR_BITS;
+			sector_tmp += pos >> VHOST_BLK_SECTOR_BITS;
 		}
 
 		pages += pages_nr;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] Fix Issue: When VirtIO Backend providing VIRTIO_BLK_F_MQ feature, The file system of the front-end OS fails to be mounted.
  2024-05-17  5:34   ` [PATCH] Fix Issue: When VirtIO Backend providing VIRTIO_BLK_F_MQ feature, The file system of the front-end OS fails to be mounted Lynch
@ 2024-05-17  9:09     ` Andrey Zhadchenko
  0 siblings, 0 replies; 14+ messages in thread
From: Andrey Zhadchenko @ 2024-05-17  9:09 UTC (permalink / raw)
  To: Lynch, vzdevel; +Cc: jasowang, kvm, virtualization, Konstantin Khorenko

Hi

Thank you for the patch.
vhost-blk didn't spark enough interest to be reviewed and merged into 
the upstream and the code is not present here.
I have forwarded your patch to relevant openvz kernel mailing list.

On 5/17/24 07:34, Lynch wrote:
> ---
>   drivers/vhost/blk.c | 6 ++++--
>   1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/vhost/blk.c b/drivers/vhost/blk.c
> index 44fbf253e773..0e946d9dfc33 100644
> --- a/drivers/vhost/blk.c
> +++ b/drivers/vhost/blk.c
> @@ -251,6 +251,7 @@ static int vhost_blk_bio_make(struct vhost_blk_req *req,
>   	struct page **pages, *page;
>   	struct bio *bio = NULL;
>   	int bio_nr = 0;
> +	sector_t sector_tmp;
>   
>   	if (unlikely(req->bi_opf == REQ_OP_FLUSH))
>   		return vhost_blk_bio_make_simple(req, bdev);
> @@ -270,6 +271,7 @@ static int vhost_blk_bio_make(struct vhost_blk_req *req,
>   		req->bio = req->inline_bio;
>   	}
>   
> +	sector_tmp = req->sector;
>   	req->iov_nr = 0;
>   	for (i = 0; i < iov_nr; i++) {
>   		int pages_nr = iov_num_pages(&iov[i]);
> @@ -302,7 +304,7 @@ static int vhost_blk_bio_make(struct vhost_blk_req *req,
>   				bio = bio_alloc(GFP_KERNEL, pages_nr_total);
>   				if (!bio)
>   					goto fail;
> -				bio->bi_iter.bi_sector  = req->sector;
> +				bio->bi_iter.bi_sector  = sector_tmp;
>   				bio_set_dev(bio, bdev);
>   				bio->bi_private = req;
>   				bio->bi_end_io  = vhost_blk_req_done;
> @@ -314,7 +316,7 @@ static int vhost_blk_bio_make(struct vhost_blk_req *req,
>   			iov_len		-= len;
>   
>   			pos = (iov_base & VHOST_BLK_SECTOR_MASK) + iov_len;
> -			req->sector += pos >> VHOST_BLK_SECTOR_BITS;
> +			sector_tmp += pos >> VHOST_BLK_SECTOR_BITS;
>   		}
>   
>   		pages += pages_nr;

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2024-05-17  9:09 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-13 15:18 [RFC PATCH v2 00/10] vhost-blk: in-kernel accelerator for virtio-blk guests Andrey Zhadchenko
2022-10-13 15:18 ` [RFC PATCH v2 01/10] drivers/vhost: vhost-blk " Andrey Zhadchenko
2022-10-13 15:18 ` [RFC PATCH v2 02/10] drivers/vhost: use array to store workers Andrey Zhadchenko
2022-10-13 15:18 ` [RFC PATCH v2 03/10] drivers/vhost: adjust vhost to flush all workers Andrey Zhadchenko
2022-10-13 15:18 ` [RFC PATCH v2 04/10] drivers/vhost: rework cgroups attachment to be worker aware Andrey Zhadchenko
2022-10-13 15:18 ` [RFC PATCH v2 05/10] drivers/vhost: rework worker creation Andrey Zhadchenko
2022-10-13 15:18 ` [RFC PATCH v2 06/10] drivers/vhost: add ioctl to increase the number of workers Andrey Zhadchenko
2022-10-13 15:18 ` [RFC PATCH v2 07/10] drivers/vhost: assign workers to virtqueues Andrey Zhadchenko
2022-10-13 15:18 ` [RFC PATCH v2 08/10] drivers/vhost: add API to queue work at virtqueue's worker Andrey Zhadchenko
2022-10-13 15:18 ` [RFC PATCH v2 09/10] drivers/vhost: allow polls to be bound to workers via vqs Andrey Zhadchenko
2022-10-13 15:18 ` [RFC PATCH v2 10/10] drivers/vhost: queue vhost_blk works at vq workers Andrey Zhadchenko
2022-10-13 17:33 ` [RFC PATCH v2 00/10] vhost-blk: in-kernel accelerator for virtio-blk guests Andrey Zhadchenko
2024-05-17  5:34   ` [PATCH] Fix Issue: When VirtIO Backend providing VIRTIO_BLK_F_MQ feature, The file system of the front-end OS fails to be mounted Lynch
2024-05-17  9:09     ` Andrey Zhadchenko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.