[RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device
@ 2011-07-28 14:29 Liu Yuan
  2011-07-28 14:29 ` [RFC PATCH] vhost-blk: An in-kernel accelerator for virtio-blk Liu Yuan
                   ` (2 more replies)
  0 siblings, 3 replies; 54+ messages in thread
From: Liu Yuan @ 2011-07-28 14:29 UTC (permalink / raw)
  To: Michael S. Tsirkin, Rusty Russell, Avi Kivity; +Cc: kvm, linux-kernel

[design idea]

        The vhost-blk uses two kernel threads to handle the guests' requests. One is tosubmit them via Linux kernel's internal AIO structs, and the other is signal the guests the completion of the IO requests.

        The current qemu-kvm's native AIO in the user mode acctually just uses one io-thread to submitting and signalling. One more nuance is that qemu-kvm AIO signals the completion of the requests one by one.

        Like vhost-net, the in-kernel vhost-blk module reduces the number of the system calls during the requests handling and the code path is shorter than the implementation of the qemu-kvm.

[performance]

        Currently, the fio benchmarking number is rather promising. The seq read is imporved as much as 16% for throughput and the latency is dropped up to 14%. For seq write, 13.5% and 13% respectively.

sequential read:
+-------------+-------------+---------------+---------------+
| iodepth     | 1           |   2           |   3           |
+-------------+-------------+---------------+----------------
| virtio-blk  | 4116(214)   |   7814(222)   |   8867(306)   |
+-------------+-------------+---------------+---------------+
| vhost-blk   | 4755(183)   |   8645(202)   |   10084(266)  |
+-------------+-------------+---------------+---------------+

4116(214) means 4116 IOPS/s, the it is completion latency is 214 us.

seqeuential write:
+-------------+-------------+----------------+--------------+
| iodepth     |  1          |    2           |  3           |
+-------------+-------------+----------------+--------------+
| virtio-blk  | 3848(228)   |   6505(275)    |  9335(291)   |
+-------------+-------------+----------------+--------------+
| vhost-blk   | 4370(198)   |   7009(249)    |  9938(264)   |
+-------------+-------------+----------------+--------------+

the fio command for sequential read:

sudo fio -name iops -readonly -rw=read -runtime=120 -iodepth 1 -filename /dev/vda -ioengine libaio -direct=1 -bs=512

and config file for sequential write is:

dev@taobao:~$ cat rw.fio
-------------------------
[test]

rw=rw
size=200M
directory=/home/dev/data
ioengine=libaio
iodepth=1
direct=1
bs=512
-------------------------

        These numbers are collected on my laptop with Intel Core i5 CPU, 2.67GHz, SATA harddisk with 7200 RPM. Both guest and host use Linux 3.0-rc6 kernel with ext4 filesystem.

        I setup the Guest by:

        sudo x86_64-softmmu/qemu-system-x86_64 -cpu host -m 512 -drive file=/dev/sda6,if=virtio,cache=none,aio=native -nographic


        The patchset is very primitive and need much further improvement for both funtionality and performance.

        Inputs and suggestions are more than welcome.

Yuan
--
 drivers/vhost/Makefile |    3 +
 drivers/vhost/blk.c    |  568 ++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/vhost/vhost.h  |   11 +
 fs/aio.c               |   44 ++---
 fs/eventfd.c           |    1 +
 include/linux/aio.h    |   31 +++
 6 files changed, 631 insertions(+), 27 deletions(-)
--
 Makefile.target |    2 +-
 hw/vhost_blk.c  |   84 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 hw/vhost_blk.h  |   44 ++++++++++++++++++++++++++++
 hw/virtio-blk.c |   74 ++++++++++++++++++++++++++++++++++++++----------
 hw/virtio-blk.h |   15 ++++++++++
 hw/virtio-pci.c |   12 ++++++-
 6 files changed, 213 insertions(+), 18 deletions(-)
                                                                                                                      

In-Reply-To: 


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [RFC PATCH] vhost-blk: An in-kernel accelerator for virtio-blk
  2011-07-28 14:29 [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device Liu Yuan
@ 2011-07-28 14:29 ` Liu Yuan
  2011-07-28 14:47   ` Christoph Hellwig
                     ` (2 more replies)
  2011-07-28 14:29 ` [RFC PATCH] vhost: Enable vhost-blk support Liu Yuan
  2011-07-28 15:44 ` [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device Stefan Hajnoczi
  2 siblings, 3 replies; 54+ messages in thread
From: Liu Yuan @ 2011-07-28 14:29 UTC (permalink / raw)
  To: Michael S. Tsirkin, Rusty Russell, Avi Kivity; +Cc: kvm, linux-kernel

From: Liu Yuan <tailai.ly@taobao.com>

Vhost-blk driver is an in-kernel accelerator, intercepting the
IO requests from KVM virtio-capable guests. It is based on the
vhost infrastructure.

This is supposed to be a module over latest kernel tree, but it
needs some symbols from fs/aio.c and fs/eventfd.c to compile with.
So currently, after applying the patch, you need to *recomplie*
the kernel.

Usage:
$kernel-src: make M=drivers/vhost
$kernel-src: sudo insmod drivers/vhost/vhost_blk.ko

After insmod, you'll see /dev/vhost-blk created. done!

Signed-off-by: Liu Yuan <tailai.ly@taobao.com>
---
 drivers/vhost/Makefile |    3 +
 drivers/vhost/blk.c    |  568 ++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/vhost/vhost.h  |   11 +
 fs/aio.c               |   44 ++---
 fs/eventfd.c           |    1 +
 include/linux/aio.h    |   31 +++
 6 files changed, 631 insertions(+), 27 deletions(-)
 create mode 100644 drivers/vhost/blk.c

diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
index 72dd020..31f8b2e 100644
--- a/drivers/vhost/Makefile
+++ b/drivers/vhost/Makefile
@@ -1,2 +1,5 @@
 obj-$(CONFIG_VHOST_NET) += vhost_net.o
+obj-m += vhost_blk.o
+
 vhost_net-y := vhost.o net.o
+vhost_blk-y := vhost.o blk.o
diff --git a/drivers/vhost/blk.c b/drivers/vhost/blk.c
new file mode 100644
index 0000000..f3462be
--- /dev/null
+++ b/drivers/vhost/blk.c
@@ -0,0 +1,568 @@
+/* Copyright (C) 2011 Taobao, Inc.
+ * Author: Liu Yuan <tailai.ly@taobao.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ *
+ * Vhost-blk driver is an in-kernel accelerator, intercepting the
+ * IO requests from KVM virtio-capable guests. It is based on the
+ * vhost infrastructure.
+ */
+
+#include <linux/miscdevice.h>
+#include <linux/module.h>
+#include <linux/virtio_net.h>
+#include <linux/vhost.h>
+#include <linux/eventfd.h>
+#include <linux/mutex.h>
+#include <linux/workqueue.h>
+#include <linux/virtio_blk.h>
+#include <linux/file.h>
+#include <linux/mmu_context.h>
+#include <linux/kthread.h>
+#include <linux/anon_inodes.h>
+#include <linux/syscalls.h>
+#include <linux/blkdev.h>
+
+#include "vhost.h"
+
+#define DEBUG 0
+
+#if DEBUG > 0
+#define dprintk         printk
+#else
+#define dprintk(x...)   do { ; } while (0)
+#endif
+
+enum {
+	virtqueue_max = 1,
+};
+
+#define MAX_EVENTS 128
+
+struct vhost_blk {
+	struct vhost_virtqueue vq;
+	struct vhost_dev dev;
+	int should_stop;
+	struct kioctx *ioctx;
+	struct eventfd_ctx *ectx;
+	struct file *efile;
+	struct task_struct *worker;
+};
+
+struct used_info {
+	void *status;
+	int head;
+	int len;
+};
+
+static struct io_event events[MAX_EVENTS];
+
+static void blk_flush(struct vhost_blk *blk)
+{
+       vhost_poll_flush(&blk->vq.poll);
+}
+
+static long blk_set_features(struct vhost_blk *blk, u64 features)
+{
+	blk->dev.acked_features = features;
+	return 0;
+}
+
+static void blk_stop(struct vhost_blk *blk)
+{
+	struct vhost_virtqueue *vq = &blk->vq;
+	struct file *f;
+
+	mutex_lock(&vq->mutex);
+	f = rcu_dereference_protected(vq->private_data,
+					lockdep_is_held(&vq->mutex));
+	rcu_assign_pointer(vq->private_data, NULL);
+	mutex_unlock(&vq->mutex);
+
+	if (f)
+		fput(f);
+}
+
+static long blk_set_backend(struct vhost_blk *blk, struct vhost_vring_file *backend)
+{
+	int idx = backend->index;
+	struct vhost_virtqueue *vq = &blk->vq;
+	struct file *file, *oldfile;
+	int ret;
+
+	mutex_lock(&blk->dev.mutex);
+	ret = vhost_dev_check_owner(&blk->dev);
+	if (ret)
+		goto err_dev;
+	if (idx >= virtqueue_max) {
+		ret = -ENOBUFS;
+		goto err_dev;
+	}
+
+	mutex_lock(&vq->mutex);
+
+	if (!vhost_vq_access_ok(vq)) {
+		ret = -EFAULT;
+		goto err_vq;
+	}
+
+	file = fget(backend->fd);
+	if (IS_ERR(file)) {
+		ret = PTR_ERR(file);
+		goto err_vq;
+	}
+
+	oldfile = rcu_dereference_protected(vq->private_data,
+						lockdep_is_held(&vq->mutex));
+	if (file != oldfile)
+		rcu_assign_pointer(vq->private_data, file);
+
+	mutex_unlock(&vq->mutex);
+
+	if (oldfile) {
+		blk_flush(blk);
+		fput(oldfile);
+	}
+
+	mutex_unlock(&blk->dev.mutex);
+	return 0;
+err_vq:
+	mutex_unlock(&vq->mutex);
+err_dev:
+	mutex_unlock(&blk->dev.mutex);
+	return ret;
+}
+
+static long blk_reset_owner(struct vhost_blk *b)
+{
+	int ret;
+
+        mutex_lock(&b->dev.mutex);
+        ret = vhost_dev_check_owner(&b->dev);
+        if (ret)
+                goto err;
+        blk_stop(b);
+        blk_flush(b);
+        ret = vhost_dev_reset_owner(&b->dev);
+	if (b->worker) {
+		b->should_stop = 1;
+		smp_mb();
+		eventfd_signal(b->ectx, 1);
+	}
+err:
+        mutex_unlock(&b->dev.mutex);
+        return ret;
+}
+
+static int kernel_io_setup(unsigned nr_events, struct kioctx **ioctx)
+{
+	int ret = 0;
+	*ioctx = ioctx_alloc(nr_events);
+	if (IS_ERR(ioctx))
+		ret = PTR_ERR(ioctx);
+	return ret;
+}
+
+static inline int kernel_read_events(struct kioctx *ctx, long min_nr, long nr, struct io_event *event,
+			struct timespec *ts)
+{
+        mm_segment_t old_fs;
+        int ret;
+
+        old_fs = get_fs();
+        set_fs(get_ds());
+	ret = read_events(ctx, min_nr, nr, event, ts);
+        set_fs(old_fs);
+
+	return ret;
+}
+
+static inline ssize_t io_event_ret(struct io_event *ev)
+{
+    return (ssize_t)(((uint64_t)ev->res2 << 32) | ev->res);
+}
+
+static inline void aio_prep_req(struct kiocb *iocb, struct eventfd_ctx *ectx, struct file *file,
+		struct iovec *iov, int nvecs, u64 offset, int opcode, struct used_info *ui)
+{
+	iocb->ki_filp = file;
+	iocb->ki_eventfd = ectx;
+	iocb->ki_pos = offset;
+	iocb->ki_buf = (void *)iov;
+	iocb->ki_left = iocb->ki_nbytes = nvecs;
+	iocb->ki_opcode = opcode;
+	iocb->ki_obj.user = ui;
+}
+
+static inline int kernel_io_submit(struct vhost_blk *blk, struct iovec *iov, u64 nvecs, loff_t pos, int opcode, int head, int len)
+{
+	int ret = -EAGAIN;
+	struct kiocb *req;
+	struct kioctx *ioctx = blk->ioctx;
+	struct used_info *ui = kzalloc(sizeof *ui, GFP_KERNEL);
+	struct file *f = blk->vq.private_data;
+
+	try_get_ioctx(ioctx);
+	atomic_long_inc_not_zero(&f->f_count);
+	eventfd_ctx_get(blk->ectx);
+
+
+	req = aio_get_req(ioctx); /* return 2 refs of req*/
+	if (unlikely(!req))
+		goto out;
+
+	ui->head = head;
+	ui->status = blk->vq.iov[nvecs + 1].iov_base;
+	ui->len = len;
+	aio_prep_req(req, blk->ectx, f, iov, nvecs, pos, opcode, ui);
+
+	ret = aio_setup_iocb(req, 0);
+	if (unlikely(ret))
+		goto out_put_req;
+
+	spin_lock_irq(&ioctx->ctx_lock);
+	if (unlikely(ioctx->dead)) {
+		spin_unlock_irq(&ioctx->ctx_lock);
+		ret = -EINVAL;
+		goto out_put_req;
+	}
+
+	aio_run_iocb(req);
+	if (!list_empty(&ioctx->run_list)) {
+		while (__aio_run_iocbs(ioctx))
+			;
+	}
+	spin_unlock_irq(&ioctx->ctx_lock);
+
+	aio_put_req(req);
+	put_ioctx(blk->ioctx);
+
+	return ret;
+
+out_put_req:
+	aio_put_req(req);
+	aio_put_req(req);
+out:
+	put_ioctx(blk->ioctx);
+	return ret;
+}
+
+static int blk_completion_worker(void *priv)
+{
+	struct vhost_blk *blk = priv;
+	u64 count;
+	int ret;
+
+	use_mm(blk->dev.mm);
+	for (;;) {
+		struct timespec ts = { 0 };
+		int i, nr;
+
+		do {
+		ret = eventfd_ctx_read(blk->ectx, 0, &count);
+		} while (unlikely(ret == -ERESTARTSYS));
+
+		if (unlikely(blk->should_stop))
+			break;
+
+		do {
+		nr = kernel_read_events(blk->ioctx, count, MAX_EVENTS, events, &ts);
+		} while (unlikely(nr == -EINTR));
+		dprintk("%s, count %llu, nr %d\n", __func__, count, nr);
+
+		if (unlikely(nr < 0))
+			continue;
+
+		for (i = 0; i < nr; i++) {
+			struct used_info *u = (struct used_info *)events[i].obj;
+			int len, status;
+
+			dprintk("%s, head %d complete in %d\n", __func__, u->head, i);
+			len = io_event_ret(&events[i]);
+			//status = u->len == len ? VIRTIO_BLK_S_OK : VIRTIO_BLK_S_IOERR;
+			status = len > 0 ? VIRTIO_BLK_S_OK : VIRTIO_BLK_S_IOERR;
+			if (copy_to_user(u->status, &status, sizeof status)) {
+				vq_err(&blk->vq, "%s failed to write status\n", __func__);
+				BUG(); /* FIXME: maybe a bit radical? */
+			}
+			vhost_add_used(&blk->vq, u->head, u->len);
+			kfree(u);
+		}
+
+		vhost_signal(&blk->dev, &blk->vq);
+	}
+	unuse_mm(blk->dev.mm);
+	return 0;
+}
+
+static int completion_thread_setup(struct vhost_blk *blk)
+{
+	int ret = 0;
+	struct task_struct *worker;
+	worker = kthread_create(blk_completion_worker, blk, "vhost-blk-%d", current->pid);
+	if (IS_ERR(worker)) {
+		ret = PTR_ERR(worker);
+		goto err;
+	}
+	blk->worker = worker;
+	blk->should_stop = 0;
+	smp_mb();
+	wake_up_process(worker);
+err:
+	return ret;
+}
+
+static void completion_thread_destory(struct vhost_blk *blk)
+{
+	if (blk->worker) {
+		blk->should_stop = 1;
+		smp_mb();
+		eventfd_signal(blk->ectx, 1);
+	}
+}
+
+
+static long blk_set_owner(struct vhost_blk *blk)
+{
+	return completion_thread_setup(blk);
+}
+
+static long vhost_blk_ioctl(struct file *f, unsigned int ioctl,
+		unsigned long arg)
+{
+	struct vhost_blk *blk = f->private_data;
+	struct vhost_vring_file backend;
+	u64 features = VHOST_BLK_FEATURES;
+	int ret = -EFAULT;
+
+	switch (ioctl) {
+		case VHOST_NET_SET_BACKEND:
+			if(copy_from_user(&backend, (void __user *)arg, sizeof backend))
+				break;
+			ret = blk_set_backend(blk, &backend);
+			break;
+		case VHOST_GET_FEATURES:
+			features = VHOST_BLK_FEATURES;
+			if (copy_to_user((void __user *)arg , &features, sizeof features))
+				break;
+			ret = 0;
+			break;
+		case VHOST_SET_FEATURES:
+			if (copy_from_user(&features, (void __user *)arg, sizeof features))
+				break;
+			if (features & ~VHOST_BLK_FEATURES) {
+				ret = -EOPNOTSUPP;
+				break;
+			}
+			ret = blk_set_features(blk, features);
+			break;
+		case VHOST_RESET_OWNER:
+			ret = blk_reset_owner(blk);
+			break;
+		default:
+			mutex_lock(&blk->dev.mutex);
+			ret = vhost_dev_ioctl(&blk->dev, ioctl, arg);
+			if (!ret && ioctl == VHOST_SET_OWNER)
+				ret = blk_set_owner(blk);
+			blk_flush(blk);
+			mutex_unlock(&blk->dev.mutex);
+			break;
+	}
+	return ret;
+}
+
+#define BLK_HDR 0
+#define BLK_HDR_LEN 16
+
+static inline int do_request(struct vhost_virtqueue *vq, struct virtio_blk_outhdr *hdr,
+		u64 nr_vecs, int head)
+{
+	struct file *f = vq->private_data;
+	struct vhost_blk *blk = container_of(vq->dev, struct vhost_blk, dev);
+	struct iovec *iov = &vq->iov[BLK_HDR + 1];
+	loff_t pos = hdr->sector << 9;
+	int ret = 0, len = 0, status;
+//	int i;
+
+	dprintk("sector %llu, num %lu, type %d\n", hdr->sector, iov->iov_len / 512, hdr->type);
+	//Guest virtio-blk driver dosen't use len currently.
+	//for (i = 0; i < nr_vecs; i++) {
+	//	len += iov[i].iov_len;
+	//}
+	switch (hdr->type) {
+	case VIRTIO_BLK_T_OUT:
+		kernel_io_submit(blk, iov, nr_vecs, pos, IOCB_CMD_PWRITEV, head, len);
+		break;
+	case VIRTIO_BLK_T_IN:
+		kernel_io_submit(blk, iov, nr_vecs, pos, IOCB_CMD_PREADV, head, len);
+		break;
+	case VIRTIO_BLK_T_FLUSH:
+		ret = vfs_fsync(f, 1);
+		/* fall through */
+	case VIRTIO_BLK_T_GET_ID:
+		status = ret < 0 ? VIRTIO_BLK_S_IOERR :VIRTIO_BLK_S_OK;
+		if ((vq->iov[nr_vecs + 1].iov_len != 1))
+			BUG();
+
+		if (copy_to_user(vq->iov[nr_vecs + 1].iov_base, &status, sizeof status)) {
+				vq_err(vq, "%s failed to write status!\n", __func__);
+				vhost_discard_vq_desc(vq, 1);
+				ret = -EFAULT;
+				break;
+			}
+
+		vhost_add_used_and_signal(&blk->dev, vq, head, ret);
+		break;
+	default:
+		pr_info("%s, unsupported request type %d\n", __func__, hdr->type);
+		vhost_discard_vq_desc(vq, 1);
+		ret = -EFAULT;
+		break;
+	}
+	return ret;
+}
+
+static inline void handle_kick(struct vhost_blk *blk)
+{
+	struct vhost_virtqueue *vq = &blk->vq;
+	struct virtio_blk_outhdr hdr;
+	u64 nr_vecs;
+	int in, out, head;
+	struct blk_plug plug;
+
+	mutex_lock(&vq->mutex);
+	vhost_disable_notify(&blk->dev, vq);
+
+	blk_start_plug(&plug);
+	for (;;) {
+		head = vhost_get_vq_desc(&blk->dev, vq, vq->iov,
+				ARRAY_SIZE(vq->iov),
+				&out, &in, NULL, NULL);
+		/* No awailable descriptors from Guest? */
+		if (head == vq->num) {
+			if (unlikely(vhost_enable_notify(&blk->dev, vq))) {
+				vhost_disable_notify(&blk->dev, vq);
+				continue;
+			}
+			break;
+		}
+		if (unlikely(head < 0))
+			break;
+
+		dprintk("head %d, in %d, out %d\n", head, in, out);
+		if(unlikely(vq->iov[BLK_HDR].iov_len != BLK_HDR_LEN)) {
+			vq_err(vq, "%s bad block header lengh!\n", __func__);
+			vhost_discard_vq_desc(vq, 1);
+			break;
+		}
+
+		if (copy_from_user(&hdr, vq->iov[BLK_HDR].iov_base, sizeof hdr)) {
+			vq_err(vq, "%s failed to get block header!\n", __func__);
+			vhost_discard_vq_desc(vq, 1);
+			break;
+		}
+
+		if (hdr.type == VIRTIO_BLK_T_IN || hdr.type == VIRTIO_BLK_T_GET_ID)
+			nr_vecs = in - 1;
+		else
+			nr_vecs = out - 1;
+
+		if (do_request(vq, &hdr, nr_vecs, head) < 0)
+			break;
+	}
+	blk_finish_plug(&plug);
+	mutex_unlock(&vq->mutex);
+}
+
+static void handle_guest_kick(struct vhost_work *work)
+{
+	struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue, poll.work);
+	struct vhost_blk *blk = container_of(vq->dev, struct vhost_blk, dev);
+	handle_kick(blk);
+}
+
+static void eventfd_setup(struct vhost_blk *blk)
+{
+	blk->efile = eventfd_file_create(0, 0);
+	blk->ectx = eventfd_ctx_fileget(blk->efile);
+}
+
+static int vhost_blk_open(struct inode *inode, struct file *f)
+{
+	int ret = -ENOMEM;
+	struct vhost_blk *blk = kmalloc(sizeof *blk, GFP_KERNEL);
+	if (!blk)
+		goto err;
+
+	blk->vq.handle_kick = handle_guest_kick;
+	ret = vhost_dev_init(&blk->dev, &blk->vq, virtqueue_max);
+	if (ret < 0)
+		goto err_init;
+
+	ret = kernel_io_setup(MAX_EVENTS, &blk->ioctx);
+	if (ret < 0)
+		goto err_io_setup;
+
+	eventfd_setup(blk);
+	f->private_data = blk;
+	return ret;
+err_init:
+err_io_setup:
+	kfree(blk);
+err:
+	return ret;
+}
+
+static void eventfd_destroy(struct vhost_blk *blk)
+{
+	eventfd_ctx_put(blk->ectx);
+	fput(blk->efile);
+}
+
+static int vhost_blk_release(struct inode *inode, struct file *f)
+{
+	struct vhost_blk *blk = f->private_data;
+
+	blk_stop(blk);
+	blk_flush(blk);
+	vhost_dev_cleanup(&blk->dev);
+	/* Yet another flush? See comments in vhost_net_release() */
+	blk_flush(blk);
+	completion_thread_destory(blk);
+	eventfd_destroy(blk);
+	kfree(blk);
+
+	return 0;
+}
+
+const static struct file_operations vhost_blk_fops = {
+	.owner          = THIS_MODULE,
+	.release        = vhost_blk_release,
+	.open           = vhost_blk_open,
+	.unlocked_ioctl = vhost_blk_ioctl,
+	.llseek		= noop_llseek,
+};
+
+
+static struct miscdevice vhost_blk_misc = {
+	234,
+	"vhost-blk",
+	&vhost_blk_fops,
+};
+
+int vhost_blk_init(void)
+{
+	return misc_register(&vhost_blk_misc);
+}
+void vhost_blk_exit(void)
+{
+	misc_deregister(&vhost_blk_misc);
+}
+
+module_init(vhost_blk_init);
+module_exit(vhost_blk_exit);
+
+MODULE_VERSION("0.0.1");
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Liu Yuan");
+MODULE_DESCRIPTION("Host kernel accelerator for virtio_blk");
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 8e03379..9e17152 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -12,6 +12,7 @@
 #include <linux/virtio_config.h>
 #include <linux/virtio_ring.h>
 #include <asm/atomic.h>
+#include <linux/virtio_blk.h>
 
 struct vhost_device;
 
@@ -174,6 +175,16 @@ enum {
 			 (1ULL << VHOST_F_LOG_ALL) |
 			 (1ULL << VHOST_NET_F_VIRTIO_NET_HDR) |
 			 (1ULL << VIRTIO_NET_F_MRG_RXBUF),
+
+	VHOST_BLK_FEATURES =	(1ULL << VIRTIO_F_NOTIFY_ON_EMPTY) |
+				(1ULL << VIRTIO_RING_F_INDIRECT_DESC) |
+				(1ULL << VIRTIO_RING_F_EVENT_IDX) |
+				(1ULL << VIRTIO_BLK_F_SEG_MAX) |
+				(1ULL << VIRTIO_BLK_F_GEOMETRY) |
+				(1ULL << VIRTIO_BLK_F_TOPOLOGY) |
+				(1ULL << VIRTIO_BLK_F_SCSI) |
+				(1ULL << VIRTIO_BLK_F_BLK_SIZE),
+
 };
 
 static inline int vhost_has_feature(struct vhost_dev *dev, int bit)
diff --git a/fs/aio.c b/fs/aio.c
index e29ec48..534d396 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -215,7 +215,7 @@ static void ctx_rcu_free(struct rcu_head *head)
  *	Called when the last user of an aio context has gone away,
  *	and the struct needs to be freed.
  */
-static void __put_ioctx(struct kioctx *ctx)
+void __put_ioctx(struct kioctx *ctx)
 {
 	BUG_ON(ctx->reqs_active);
 
@@ -227,29 +227,12 @@ static void __put_ioctx(struct kioctx *ctx)
 	pr_debug("__put_ioctx: freeing %p\n", ctx);
 	call_rcu(&ctx->rcu_head, ctx_rcu_free);
 }
-
-static inline void get_ioctx(struct kioctx *kioctx)
-{
-	BUG_ON(atomic_read(&kioctx->users) <= 0);
-	atomic_inc(&kioctx->users);
-}
-
-static inline int try_get_ioctx(struct kioctx *kioctx)
-{
-	return atomic_inc_not_zero(&kioctx->users);
-}
-
-static inline void put_ioctx(struct kioctx *kioctx)
-{
-	BUG_ON(atomic_read(&kioctx->users) <= 0);
-	if (unlikely(atomic_dec_and_test(&kioctx->users)))
-		__put_ioctx(kioctx);
-}
+EXPORT_SYMBOL(__put_ioctx);
 
 /* ioctx_alloc
  *	Allocates and initializes an ioctx.  Returns an ERR_PTR if it failed.
  */
-static struct kioctx *ioctx_alloc(unsigned nr_events)
+struct kioctx *ioctx_alloc(unsigned nr_events)
 {
 	struct mm_struct *mm;
 	struct kioctx *ctx;
@@ -327,6 +310,7 @@ out_freectx:
 	dprintk("aio: error allocating ioctx %p\n", ctx);
 	return ctx;
 }
+EXPORT_SYMBOL(ioctx_alloc);
 
 /* aio_cancel_all
  *	Cancels all outstanding aio requests on an aio context.  Used 
@@ -437,7 +421,7 @@ void exit_aio(struct mm_struct *mm)
  * This prevents races between the aio code path referencing the
  * req (after submitting it) and aio_complete() freeing the req.
  */
-static struct kiocb *__aio_get_req(struct kioctx *ctx)
+struct kiocb *__aio_get_req(struct kioctx *ctx)
 {
 	struct kiocb *req = NULL;
 	struct aio_ring *ring;
@@ -480,7 +464,7 @@ static struct kiocb *__aio_get_req(struct kioctx *ctx)
 	return req;
 }
 
-static inline struct kiocb *aio_get_req(struct kioctx *ctx)
+struct kiocb *aio_get_req(struct kioctx *ctx)
 {
 	struct kiocb *req;
 	/* Handle a potential starvation case -- should be exceedingly rare as 
@@ -494,6 +478,7 @@ static inline struct kiocb *aio_get_req(struct kioctx *ctx)
 	}
 	return req;
 }
+EXPORT_SYMBOL(aio_get_req);
 
 static inline void really_put_req(struct kioctx *ctx, struct kiocb *req)
 {
@@ -659,7 +644,7 @@ static inline int __queue_kicked_iocb(struct kiocb *iocb)
  * simplifies the coding of individual aio operations as
  * it avoids various potential races.
  */
-static ssize_t aio_run_iocb(struct kiocb *iocb)
+ssize_t aio_run_iocb(struct kiocb *iocb)
 {
 	struct kioctx	*ctx = iocb->ki_ctx;
 	ssize_t (*retry)(struct kiocb *);
@@ -753,6 +738,7 @@ out:
 	}
 	return ret;
 }
+EXPORT_SYMBOL(aio_run_iocb);
 
 /*
  * __aio_run_iocbs:
@@ -761,7 +747,7 @@ out:
  * Assumes it is operating within the aio issuer's mm
  * context.
  */
-static int __aio_run_iocbs(struct kioctx *ctx)
+int __aio_run_iocbs(struct kioctx *ctx)
 {
 	struct kiocb *iocb;
 	struct list_head run_list;
@@ -784,6 +770,7 @@ static int __aio_run_iocbs(struct kioctx *ctx)
 		return 1;
 	return 0;
 }
+EXPORT_SYMBOL(__aio_run_iocbs);
 
 static void aio_queue_work(struct kioctx * ctx)
 {
@@ -1074,7 +1061,7 @@ static inline void clear_timeout(struct aio_timeout *to)
 	del_singleshot_timer_sync(&to->timer);
 }
 
-static int read_events(struct kioctx *ctx,
+int read_events(struct kioctx *ctx,
 			long min_nr, long nr,
 			struct io_event __user *event,
 			struct timespec __user *timeout)
@@ -1190,11 +1177,12 @@ out:
 	destroy_timer_on_stack(&to.timer);
 	return i ? i : ret;
 }
+EXPORT_SYMBOL(read_events);
 
 /* Take an ioctx and remove it from the list of ioctx's.  Protects 
  * against races with itself via ->dead.
  */
-static void io_destroy(struct kioctx *ioctx)
+void io_destroy(struct kioctx *ioctx)
 {
 	struct mm_struct *mm = current->mm;
 	int was_dead;
@@ -1221,6 +1209,7 @@ static void io_destroy(struct kioctx *ioctx)
 	wake_up_all(&ioctx->wait);
 	put_ioctx(ioctx);	/* once for the lookup */
 }
+EXPORT_SYMBOL(io_destroy);
 
 /* sys_io_setup:
  *	Create an aio_context capable of receiving at least nr_events.
@@ -1423,7 +1412,7 @@ static ssize_t aio_setup_single_vector(struct kiocb *kiocb)
  *	Performs the initial checks and aio retry method
  *	setup for the kiocb at the time of io submission.
  */
-static ssize_t aio_setup_iocb(struct kiocb *kiocb, bool compat)
+ssize_t aio_setup_iocb(struct kiocb *kiocb, bool compat)
 {
 	struct file *file = kiocb->ki_filp;
 	ssize_t ret = 0;
@@ -1513,6 +1502,7 @@ static ssize_t aio_setup_iocb(struct kiocb *kiocb, bool compat)
 
 	return 0;
 }
+EXPORT_SYMBOL(aio_setup_iocb);
 
 static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 			 struct iocb *iocb, bool compat)
diff --git a/fs/eventfd.c b/fs/eventfd.c
index d9a5917..6343bc9 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -406,6 +406,7 @@ struct file *eventfd_file_create(unsigned int count, int flags)
 
 	return file;
 }
+EXPORT_SYMBOL_GPL(eventfd_file_create);
 
 SYSCALL_DEFINE2(eventfd2, unsigned int, count, int, flags)
 {
diff --git a/include/linux/aio.h b/include/linux/aio.h
index 7a8db41..d63bc04 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -214,6 +214,37 @@ struct mm_struct;
 extern void exit_aio(struct mm_struct *mm);
 extern long do_io_submit(aio_context_t ctx_id, long nr,
 			 struct iocb __user *__user *iocbpp, bool compat);
+extern void __put_ioctx(struct kioctx *ctx);
+extern struct kioctx *ioctx_alloc(unsigned nr_events);
+extern struct kiocb *aio_get_req(struct kioctx *ctx);
+extern ssize_t aio_run_iocb(struct kiocb *iocb);
+extern int __aio_run_iocbs(struct kioctx *ctx);
+extern int read_events(struct kioctx *ctx,
+                        long min_nr, long nr,
+                        struct io_event __user *event,
+                        struct timespec __user *timeout);
+extern void io_destroy(struct kioctx *ioctx);
+extern ssize_t aio_setup_iocb(struct kiocb *kiocb, bool compat);
+extern void __put_ioctx(struct kioctx *ctx);
+
+static inline void get_ioctx(struct kioctx *kioctx)
+{
+        BUG_ON(atomic_read(&kioctx->users) <= 0);
+        atomic_inc(&kioctx->users);
+}
+
+static inline int try_get_ioctx(struct kioctx *kioctx)
+{
+        return atomic_inc_not_zero(&kioctx->users);
+}
+
+static inline void put_ioctx(struct kioctx *kioctx)
+{
+        BUG_ON(atomic_read(&kioctx->users) <= 0);
+        if (unlikely(atomic_dec_and_test(&kioctx->users)))
+                __put_ioctx(kioctx);
+}
+
 #else
 static inline ssize_t wait_on_sync_kiocb(struct kiocb *iocb) { return 0; }
 static inline int aio_put_req(struct kiocb *iocb) { return 0; }
-- 
1.7.5.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH] vhost-blk: An in-kernel accelerator for virtio-blk
  2011-07-28 14:29 ` [RFC PATCH] vhost-blk: An in-kernel accelerator for virtio-blk Liu Yuan
@ 2011-07-28 14:47   ` Christoph Hellwig
  2011-07-29 11:19     ` Liu Yuan
  2011-07-28 15:18   ` Stefan Hajnoczi
  2011-07-28 15:22   ` Michael S. Tsirkin
  2 siblings, 1 reply; 54+ messages in thread
From: Christoph Hellwig @ 2011-07-28 14:47 UTC (permalink / raw)
  To: Liu Yuan; +Cc: Michael S. Tsirkin, Rusty Russell, Avi Kivity, kvm, linux-kernel

On Thu, Jul 28, 2011 at 10:29:05PM +0800, Liu Yuan wrote:
> From: Liu Yuan <tailai.ly@taobao.com>
> 
> Vhost-blk driver is an in-kernel accelerator, intercepting the
> IO requests from KVM virtio-capable guests. It is based on the
> vhost infrastructure.
> 
> This is supposed to be a module over latest kernel tree, but it
> needs some symbols from fs/aio.c and fs/eventfd.c to compile with.
> So currently, after applying the patch, you need to *recomplie*
> the kernel.
> 
> Usage:
> $kernel-src: make M=drivers/vhost
> $kernel-src: sudo insmod drivers/vhost/vhost_blk.ko
> 
> After insmod, you'll see /dev/vhost-blk created. done!

You'll need to send the changes for existing code separately.

If you're going mostly for raw blockdevice access just calling
submit_bio will shave even more overhead off, and simplify the
code a lot.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH] vhost-blk: An in-kernel accelerator for virtio-blk
  2011-07-28 14:47   ` Christoph Hellwig
@ 2011-07-29 11:19     ` Liu Yuan
  0 siblings, 0 replies; 54+ messages in thread
From: Liu Yuan @ 2011-07-29 11:19 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Michael S. Tsirkin, Rusty Russell, Avi Kivity, kvm, linux-kernel

On 07/28/2011 10:47 PM, Christoph Hellwig wrote:
> On Thu, Jul 28, 2011 at 10:29:05PM +0800, Liu Yuan wrote:
>> From: Liu Yuan<tailai.ly@taobao.com>
>>
>> Vhost-blk driver is an in-kernel accelerator, intercepting the
>> IO requests from KVM virtio-capable guests. It is based on the
>> vhost infrastructure.
>>
>> This is supposed to be a module over latest kernel tree, but it
>> needs some symbols from fs/aio.c and fs/eventfd.c to compile with.
>> So currently, after applying the patch, you need to *recomplie*
>> the kernel.
>>
>> Usage:
>> $kernel-src: make M=drivers/vhost
>> $kernel-src: sudo insmod drivers/vhost/vhost_blk.ko
>>
>> After insmod, you'll see /dev/vhost-blk created. done!
> You'll need to send the changes for existing code separately.
>
Thanks for reminding.

> If you're going mostly for raw blockdevice access just calling
> submit_bio will shave even more overhead off, and simplify the
> code a lot.
Yes, sounds cool, I'll give it a try.

Yuan

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH] vhost-blk: An in-kernel accelerator for virtio-blk
  2011-07-28 14:29 ` [RFC PATCH] vhost-blk: An in-kernel accelerator for virtio-blk Liu Yuan
  2011-07-28 14:47   ` Christoph Hellwig
@ 2011-07-28 15:18   ` Stefan Hajnoczi
  2011-07-28 15:22   ` Michael S. Tsirkin
  2 siblings, 0 replies; 54+ messages in thread
From: Stefan Hajnoczi @ 2011-07-28 15:18 UTC (permalink / raw)
  To: Liu Yuan; +Cc: Michael S. Tsirkin, Rusty Russell, Avi Kivity, kvm, linux-kernel

On Thu, Jul 28, 2011 at 3:29 PM, Liu Yuan <namei.unix@gmail.com> wrote:
> +static int blk_completion_worker(void *priv)
> +{

Do you really need this?  How about using the vhost poll helper to
observe the eventfd?  Then you can drop your own worker thread code
and simply have a work function to handle all pending completions
events.

Stefan

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH] vhost-blk: An in-kernel accelerator for virtio-blk
  2011-07-28 14:29 ` [RFC PATCH] vhost-blk: An in-kernel accelerator for virtio-blk Liu Yuan
  2011-07-28 14:47   ` Christoph Hellwig
  2011-07-28 15:18   ` Stefan Hajnoczi
@ 2011-07-28 15:22   ` Michael S. Tsirkin
  2011-07-29 15:09     ` Liu Yuan
                       ` (2 more replies)
  2 siblings, 3 replies; 54+ messages in thread
From: Michael S. Tsirkin @ 2011-07-28 15:22 UTC (permalink / raw)
  To: Liu Yuan; +Cc: Rusty Russell, Avi Kivity, kvm, linux-kernel

On Thu, Jul 28, 2011 at 10:29:05PM +0800, Liu Yuan wrote:
> From: Liu Yuan <tailai.ly@taobao.com>
> 
> Vhost-blk driver is an in-kernel accelerator, intercepting the
> IO requests from KVM virtio-capable guests. It is based on the
> vhost infrastructure.
> 
> This is supposed to be a module over latest kernel tree, but it
> needs some symbols from fs/aio.c and fs/eventfd.c to compile with.
> So currently, after applying the patch, you need to *recomplie*
> the kernel.
> 
> Usage:
> $kernel-src: make M=drivers/vhost
> $kernel-src: sudo insmod drivers/vhost/vhost_blk.ko
> 
> After insmod, you'll see /dev/vhost-blk created. done!
> 
> Signed-off-by: Liu Yuan <tailai.ly@taobao.com>

Thanks, this is an interesting patch.

There are some coding style issues in this patch, could you please
change the code to match the kernel coding style?

In particular pls prefix functions macros etc with vhost_blk to avoid
confusion.

scripts/checkpatch.pl can find some, but not all, issues.

> ---
>  drivers/vhost/Makefile |    3 +
>  drivers/vhost/blk.c    |  568 ++++++++++++++++++++++++++++++++++++++++++++++++
>  drivers/vhost/vhost.h  |   11 +
>  fs/aio.c               |   44 ++---
>  fs/eventfd.c           |    1 +
>  include/linux/aio.h    |   31 +++

As others said, core changes need to be split out
and get acks from relevant people.

Use scripts/get_maintainer.pl to get a list.


>  6 files changed, 631 insertions(+), 27 deletions(-)
>  create mode 100644 drivers/vhost/blk.c
> 
> diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
> index 72dd020..31f8b2e 100644
> --- a/drivers/vhost/Makefile
> +++ b/drivers/vhost/Makefile
> @@ -1,2 +1,5 @@
>  obj-$(CONFIG_VHOST_NET) += vhost_net.o
> +obj-m += vhost_blk.o
> +
>  vhost_net-y := vhost.o net.o
> +vhost_blk-y := vhost.o blk.o
> diff --git a/drivers/vhost/blk.c b/drivers/vhost/blk.c
> new file mode 100644
> index 0000000..f3462be
> --- /dev/null
> +++ b/drivers/vhost/blk.c
> @@ -0,0 +1,568 @@
> +/* Copyright (C) 2011 Taobao, Inc.
> + * Author: Liu Yuan <tailai.ly@taobao.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.
> + *
> + * Vhost-blk driver is an in-kernel accelerator, intercepting the
> + * IO requests from KVM virtio-capable guests. It is based on the
> + * vhost infrastructure.
> + */
> +
> +#include <linux/miscdevice.h>
> +#include <linux/module.h>
> +#include <linux/virtio_net.h>
> +#include <linux/vhost.h>
> +#include <linux/eventfd.h>
> +#include <linux/mutex.h>
> +#include <linux/workqueue.h>
> +#include <linux/virtio_blk.h>
> +#include <linux/file.h>
> +#include <linux/mmu_context.h>
> +#include <linux/kthread.h>
> +#include <linux/anon_inodes.h>
> +#include <linux/syscalls.h>
> +#include <linux/blkdev.h>
> +
> +#include "vhost.h"
> +
> +#define DEBUG 0
> +
> +#if DEBUG > 0
> +#define dprintk         printk
> +#else
> +#define dprintk(x...)   do { ; } while (0)
> +#endif

There are standard macros for these.

> +
> +enum {
> +	virtqueue_max = 1,
> +};
> +
> +#define MAX_EVENTS 128
> +
> +struct vhost_blk {
> +	struct vhost_virtqueue vq;
> +	struct vhost_dev dev;
> +	int should_stop;
> +	struct kioctx *ioctx;
> +	struct eventfd_ctx *ectx;
> +	struct file *efile;
> +	struct task_struct *worker;
> +};
> +
> +struct used_info {
> +	void *status;
> +	int head;
> +	int len;
> +};
> +
> +static struct io_event events[MAX_EVENTS];
> +
> +static void blk_flush(struct vhost_blk *blk)
> +{
> +       vhost_poll_flush(&blk->vq.poll);
> +}
> +
> +static long blk_set_features(struct vhost_blk *blk, u64 features)
> +{
> +	blk->dev.acked_features = features;
> +	return 0;
> +}
> +
> +static void blk_stop(struct vhost_blk *blk)
> +{
> +	struct vhost_virtqueue *vq = &blk->vq;
> +	struct file *f;
> +
> +	mutex_lock(&vq->mutex);
> +	f = rcu_dereference_protected(vq->private_data,
> +					lockdep_is_held(&vq->mutex));
> +	rcu_assign_pointer(vq->private_data, NULL);
> +	mutex_unlock(&vq->mutex);
> +
> +	if (f)
> +		fput(f);
> +}
> +
> +static long blk_set_backend(struct vhost_blk *blk, struct vhost_vring_file *backend)
> +{
> +	int idx = backend->index;
> +	struct vhost_virtqueue *vq = &blk->vq;
> +	struct file *file, *oldfile;
> +	int ret;
> +
> +	mutex_lock(&blk->dev.mutex);
> +	ret = vhost_dev_check_owner(&blk->dev);
> +	if (ret)
> +		goto err_dev;
> +	if (idx >= virtqueue_max) {
> +		ret = -ENOBUFS;
> +		goto err_dev;
> +	}
> +
> +	mutex_lock(&vq->mutex);
> +
> +	if (!vhost_vq_access_ok(vq)) {
> +		ret = -EFAULT;
> +		goto err_vq;
> +	}

NET used -1 backend to remove a backend.
I think it's a good idea, to make an operation reversible.

> +
> +	file = fget(backend->fd);

We need to verify that the file type passed makes sense.
For example, it's possible to create reference loops
by passng the vhost-blk fd.


> +	if (IS_ERR(file)) {
> +		ret = PTR_ERR(file);
> +		goto err_vq;
> +	}
> +
> +	oldfile = rcu_dereference_protected(vq->private_data,
> +						lockdep_is_held(&vq->mutex));
> +	if (file != oldfile)
> +		rcu_assign_pointer(vq->private_data, file);
> +
> +	mutex_unlock(&vq->mutex);
> +
> +	if (oldfile) {
> +		blk_flush(blk);
> +		fput(oldfile);
> +	}
> +
> +	mutex_unlock(&blk->dev.mutex);
> +	return 0;
> +err_vq:
> +	mutex_unlock(&vq->mutex);
> +err_dev:
> +	mutex_unlock(&blk->dev.mutex);
> +	return ret;
> +}
> +
> +static long blk_reset_owner(struct vhost_blk *b)
> +{
> +	int ret;
> +
> +        mutex_lock(&b->dev.mutex);
> +        ret = vhost_dev_check_owner(&b->dev);
> +        if (ret)
> +                goto err;
> +        blk_stop(b);
> +        blk_flush(b);
> +        ret = vhost_dev_reset_owner(&b->dev);
> +	if (b->worker) {
> +		b->should_stop = 1;
> +		smp_mb();
> +		eventfd_signal(b->ectx, 1);
> +	}
> +err:
> +        mutex_unlock(&b->dev.mutex);
> +        return ret;
> +}
> +
> +static int kernel_io_setup(unsigned nr_events, struct kioctx **ioctx)
> +{
> +	int ret = 0;
> +	*ioctx = ioctx_alloc(nr_events);
> +	if (IS_ERR(ioctx))
> +		ret = PTR_ERR(ioctx);
> +	return ret;
> +}
> +
> +static inline int kernel_read_events(struct kioctx *ctx, long min_nr, long nr, struct io_event *event,
> +			struct timespec *ts)
> +{
> +        mm_segment_t old_fs;
> +        int ret;
> +
> +        old_fs = get_fs();
> +        set_fs(get_ds());
> +	ret = read_events(ctx, min_nr, nr, event, ts);
> +        set_fs(old_fs);
> +
> +	return ret;
> +}
> +
> +static inline ssize_t io_event_ret(struct io_event *ev)
> +{
> +    return (ssize_t)(((uint64_t)ev->res2 << 32) | ev->res);
> +}
> +
> +static inline void aio_prep_req(struct kiocb *iocb, struct eventfd_ctx *ectx, struct file *file,
> +		struct iovec *iov, int nvecs, u64 offset, int opcode, struct used_info *ui)
> +{
> +	iocb->ki_filp = file;
> +	iocb->ki_eventfd = ectx;
> +	iocb->ki_pos = offset;
> +	iocb->ki_buf = (void *)iov;
> +	iocb->ki_left = iocb->ki_nbytes = nvecs;
> +	iocb->ki_opcode = opcode;
> +	iocb->ki_obj.user = ui;
> +}
> +
> +static inline int kernel_io_submit(struct vhost_blk *blk, struct iovec *iov, u64 nvecs, loff_t pos, int opcode, int head, int len)
> +{
> +	int ret = -EAGAIN;
> +	struct kiocb *req;
> +	struct kioctx *ioctx = blk->ioctx;
> +	struct used_info *ui = kzalloc(sizeof *ui, GFP_KERNEL);
> +	struct file *f = blk->vq.private_data;
> +
> +	try_get_ioctx(ioctx);
> +	atomic_long_inc_not_zero(&f->f_count);
> +	eventfd_ctx_get(blk->ectx);
> +
> +
> +	req = aio_get_req(ioctx); /* return 2 refs of req*/
> +	if (unlikely(!req))
> +		goto out;
> +
> +	ui->head = head;
> +	ui->status = blk->vq.iov[nvecs + 1].iov_base;
> +	ui->len = len;
> +	aio_prep_req(req, blk->ectx, f, iov, nvecs, pos, opcode, ui);
> +
> +	ret = aio_setup_iocb(req, 0);
> +	if (unlikely(ret))
> +		goto out_put_req;
> +
> +	spin_lock_irq(&ioctx->ctx_lock);
> +	if (unlikely(ioctx->dead)) {
> +		spin_unlock_irq(&ioctx->ctx_lock);
> +		ret = -EINVAL;
> +		goto out_put_req;
> +	}
> +
> +	aio_run_iocb(req);
> +	if (!list_empty(&ioctx->run_list)) {
> +		while (__aio_run_iocbs(ioctx))
> +			;
> +	}
> +	spin_unlock_irq(&ioctx->ctx_lock);
> +
> +	aio_put_req(req);
> +	put_ioctx(blk->ioctx);
> +
> +	return ret;
> +
> +out_put_req:
> +	aio_put_req(req);
> +	aio_put_req(req);
> +out:
> +	put_ioctx(blk->ioctx);
> +	return ret;
> +}
> +
> +static int blk_completion_worker(void *priv)
> +{
> +	struct vhost_blk *blk = priv;
> +	u64 count;
> +	int ret;
> +
> +	use_mm(blk->dev.mm);
> +	for (;;) {

It would be nicer to reuse the worker infrastructure
from vhost.c. In particular this one ignores cgroups that
the owner belongs to if any.
Does this one do anything that vhost.c doesn't?

> +		struct timespec ts = { 0 };
> +		int i, nr;
> +
> +		do {
> +		ret = eventfd_ctx_read(blk->ectx, 0, &count);
> +		} while (unlikely(ret == -ERESTARTSYS));
> +
> +		if (unlikely(blk->should_stop))
> +			break;
> +
> +		do {
> +		nr = kernel_read_events(blk->ioctx, count, MAX_EVENTS, events, &ts);
> +		} while (unlikely(nr == -EINTR));
> +		dprintk("%s, count %llu, nr %d\n", __func__, count, nr);
> +
> +		if (unlikely(nr < 0))
> +			continue;
> +
> +		for (i = 0; i < nr; i++) {
> +			struct used_info *u = (struct used_info *)events[i].obj;
> +			int len, status;
> +
> +			dprintk("%s, head %d complete in %d\n", __func__, u->head, i);
> +			len = io_event_ret(&events[i]);
> +			//status = u->len == len ? VIRTIO_BLK_S_OK : VIRTIO_BLK_S_IOERR;
> +			status = len > 0 ? VIRTIO_BLK_S_OK : VIRTIO_BLK_S_IOERR;
> +			if (copy_to_user(u->status, &status, sizeof status)) {
> +				vq_err(&blk->vq, "%s failed to write status\n", __func__);
> +				BUG(); /* FIXME: maybe a bit radical? */

On an invalid userspace address?
You may very well say so.

> +			}
> +			vhost_add_used(&blk->vq, u->head, u->len);
> +			kfree(u);
> +		}
> +
> +		vhost_signal(&blk->dev, &blk->vq);
> +	}
> +	unuse_mm(blk->dev.mm);
> +	return 0;
> +}
> +
> +static int completion_thread_setup(struct vhost_blk *blk)
> +{
> +	int ret = 0;
> +	struct task_struct *worker;
> +	worker = kthread_create(blk_completion_worker, blk, "vhost-blk-%d", current->pid);
> +	if (IS_ERR(worker)) {
> +		ret = PTR_ERR(worker);
> +		goto err;
> +	}
> +	blk->worker = worker;
> +	blk->should_stop = 0;
> +	smp_mb();
> +	wake_up_process(worker);
> +err:
> +	return ret;
> +}
> +
> +static void completion_thread_destory(struct vhost_blk *blk)
> +{
> +	if (blk->worker) {
> +		blk->should_stop = 1;
> +		smp_mb();
> +		eventfd_signal(blk->ectx, 1);
> +	}
> +}
> +
> +
> +static long blk_set_owner(struct vhost_blk *blk)
> +{
> +	return completion_thread_setup(blk);
> +}
> +
> +static long vhost_blk_ioctl(struct file *f, unsigned int ioctl,
> +		unsigned long arg)
> +{
> +	struct vhost_blk *blk = f->private_data;
> +	struct vhost_vring_file backend;
> +	u64 features = VHOST_BLK_FEATURES;
> +	int ret = -EFAULT;
> +
> +	switch (ioctl) {
> +		case VHOST_NET_SET_BACKEND:
> +			if(copy_from_user(&backend, (void __user *)arg, sizeof backend))
> +				break;
> +			ret = blk_set_backend(blk, &backend);
> +			break;

Please create your own ioctl for this one.

> +		case VHOST_GET_FEATURES:
> +			features = VHOST_BLK_FEATURES;
> +			if (copy_to_user((void __user *)arg , &features, sizeof features))
> +				break;
> +			ret = 0;
> +			break;
> +		case VHOST_SET_FEATURES:
> +			if (copy_from_user(&features, (void __user *)arg, sizeof features))
> +				break;
> +			if (features & ~VHOST_BLK_FEATURES) {
> +				ret = -EOPNOTSUPP;
> +				break;
> +			}
> +			ret = blk_set_features(blk, features);
> +			break;
> +		case VHOST_RESET_OWNER:
> +			ret = blk_reset_owner(blk);
> +			break;
> +		default:
> +			mutex_lock(&blk->dev.mutex);
> +			ret = vhost_dev_ioctl(&blk->dev, ioctl, arg);
> +			if (!ret && ioctl == VHOST_SET_OWNER)
> +				ret = blk_set_owner(blk);
> +			blk_flush(blk);
> +			mutex_unlock(&blk->dev.mutex);
> +			break;
> +	}
> +	return ret;
> +}
> +
> +#define BLK_HDR 0
> +#define BLK_HDR_LEN 16
> +
> +static inline int do_request(struct vhost_virtqueue *vq, struct virtio_blk_outhdr *hdr,
> +		u64 nr_vecs, int head)
> +{
> +	struct file *f = vq->private_data;
> +	struct vhost_blk *blk = container_of(vq->dev, struct vhost_blk, dev);
> +	struct iovec *iov = &vq->iov[BLK_HDR + 1];
> +	loff_t pos = hdr->sector << 9;
> +	int ret = 0, len = 0, status;
> +//	int i;
> +
> +	dprintk("sector %llu, num %lu, type %d\n", hdr->sector, iov->iov_len / 512, hdr->type);
> +	//Guest virtio-blk driver dosen't use len currently.
> +	//for (i = 0; i < nr_vecs; i++) {
> +	//	len += iov[i].iov_len;
> +	//}
> +	switch (hdr->type) {
> +	case VIRTIO_BLK_T_OUT:
> +		kernel_io_submit(blk, iov, nr_vecs, pos, IOCB_CMD_PWRITEV, head, len);
> +		break;
> +	case VIRTIO_BLK_T_IN:
> +		kernel_io_submit(blk, iov, nr_vecs, pos, IOCB_CMD_PREADV, head, len);
> +		break;
> +	case VIRTIO_BLK_T_FLUSH:
> +		ret = vfs_fsync(f, 1);
> +		/* fall through */
> +	case VIRTIO_BLK_T_GET_ID:
> +		status = ret < 0 ? VIRTIO_BLK_S_IOERR :VIRTIO_BLK_S_OK;
> +		if ((vq->iov[nr_vecs + 1].iov_len != 1))
> +			BUG();

Why is this one a bug?


> +
> +		if (copy_to_user(vq->iov[nr_vecs + 1].iov_base, &status, sizeof status)) {
> +				vq_err(vq, "%s failed to write status!\n", __func__);
> +				vhost_discard_vq_desc(vq, 1);
> +				ret = -EFAULT;
> +				break;
> +			}
> +
> +		vhost_add_used_and_signal(&blk->dev, vq, head, ret);
> +		break;
> +	default:
> +		pr_info("%s, unsupported request type %d\n", __func__, hdr->type);
> +		vhost_discard_vq_desc(vq, 1);
> +		ret = -EFAULT;
> +		break;
> +	}
> +	return ret;
> +}
> +
> +static inline void handle_kick(struct vhost_blk *blk)
> +{
> +	struct vhost_virtqueue *vq = &blk->vq;
> +	struct virtio_blk_outhdr hdr;
> +	u64 nr_vecs;
> +	int in, out, head;
> +	struct blk_plug plug;
> +
> +	mutex_lock(&vq->mutex);
> +	vhost_disable_notify(&blk->dev, vq);
> +
> +	blk_start_plug(&plug);
> +	for (;;) {
> +		head = vhost_get_vq_desc(&blk->dev, vq, vq->iov,
> +				ARRAY_SIZE(vq->iov),
> +				&out, &in, NULL, NULL);
> +		/* No awailable descriptors from Guest? */
> +		if (head == vq->num) {
> +			if (unlikely(vhost_enable_notify(&blk->dev, vq))) {
> +				vhost_disable_notify(&blk->dev, vq);
> +				continue;
> +			}
> +			break;
> +		}
> +		if (unlikely(head < 0))
> +			break;
> +
> +		dprintk("head %d, in %d, out %d\n", head, in, out);
> +		if(unlikely(vq->iov[BLK_HDR].iov_len != BLK_HDR_LEN)) {
> +			vq_err(vq, "%s bad block header lengh!\n", __func__);
> +			vhost_discard_vq_desc(vq, 1);
> +			break;
> +		}
> +
> +		if (copy_from_user(&hdr, vq->iov[BLK_HDR].iov_base, sizeof hdr)) {
> +			vq_err(vq, "%s failed to get block header!\n", __func__);
> +			vhost_discard_vq_desc(vq, 1);
> +			break;
> +		}
> +
> +		if (hdr.type == VIRTIO_BLK_T_IN || hdr.type == VIRTIO_BLK_T_GET_ID)
> +			nr_vecs = in - 1;
> +		else
> +			nr_vecs = out - 1;
> +
> +		if (do_request(vq, &hdr, nr_vecs, head) < 0)
> +			break;
> +	}
> +	blk_finish_plug(&plug);
> +	mutex_unlock(&vq->mutex);
> +}
> +
> +static void handle_guest_kick(struct vhost_work *work)
> +{
> +	struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue, poll.work);
> +	struct vhost_blk *blk = container_of(vq->dev, struct vhost_blk, dev);
> +	handle_kick(blk);
> +}
> +
> +static void eventfd_setup(struct vhost_blk *blk)
> +{
> +	blk->efile = eventfd_file_create(0, 0);
> +	blk->ectx = eventfd_ctx_fileget(blk->efile);
> +}
> +
> +static int vhost_blk_open(struct inode *inode, struct file *f)
> +{
> +	int ret = -ENOMEM;
> +	struct vhost_blk *blk = kmalloc(sizeof *blk, GFP_KERNEL);
> +	if (!blk)
> +		goto err;
> +
> +	blk->vq.handle_kick = handle_guest_kick;
> +	ret = vhost_dev_init(&blk->dev, &blk->vq, virtqueue_max);
> +	if (ret < 0)
> +		goto err_init;
> +
> +	ret = kernel_io_setup(MAX_EVENTS, &blk->ioctx);
> +	if (ret < 0)
> +		goto err_io_setup;
> +
> +	eventfd_setup(blk);
> +	f->private_data = blk;
> +	return ret;
> +err_init:
> +err_io_setup:
> +	kfree(blk);
> +err:
> +	return ret;
> +}
> +
> +static void eventfd_destroy(struct vhost_blk *blk)
> +{
> +	eventfd_ctx_put(blk->ectx);
> +	fput(blk->efile);
> +}
> +
> +static int vhost_blk_release(struct inode *inode, struct file *f)
> +{
> +	struct vhost_blk *blk = f->private_data;
> +
> +	blk_stop(blk);
> +	blk_flush(blk);
> +	vhost_dev_cleanup(&blk->dev);
> +	/* Yet another flush? See comments in vhost_net_release() */
> +	blk_flush(blk);
> +	completion_thread_destory(blk);
> +	eventfd_destroy(blk);
> +	kfree(blk);
> +
> +	return 0;
> +}
> +
> +const static struct file_operations vhost_blk_fops = {
> +	.owner          = THIS_MODULE,
> +	.release        = vhost_blk_release,
> +	.open           = vhost_blk_open,
> +	.unlocked_ioctl = vhost_blk_ioctl,
> +	.llseek		= noop_llseek,
> +};
> +
> +
> +static struct miscdevice vhost_blk_misc = {
> +	234,

Don't get a major unless you really must.

> +	"vhost-blk",
> +	&vhost_blk_fops,

And use C99 initializers.

> +};
> +
> +int vhost_blk_init(void)
> +{
> +	return misc_register(&vhost_blk_misc);
> +}
> +void vhost_blk_exit(void)
> +{
> +	misc_deregister(&vhost_blk_misc);
> +}
> +
> +module_init(vhost_blk_init);
> +module_exit(vhost_blk_exit);
> +
> +MODULE_VERSION("0.0.1");
> +MODULE_LICENSE("GPL v2");
> +MODULE_AUTHOR("Liu Yuan");
> +MODULE_DESCRIPTION("Host kernel accelerator for virtio_blk");
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index 8e03379..9e17152 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -12,6 +12,7 @@
>  #include <linux/virtio_config.h>
>  #include <linux/virtio_ring.h>
>  #include <asm/atomic.h>
> +#include <linux/virtio_blk.h>
>  
>  struct vhost_device;
>  
> @@ -174,6 +175,16 @@ enum {
>  			 (1ULL << VHOST_F_LOG_ALL) |
>  			 (1ULL << VHOST_NET_F_VIRTIO_NET_HDR) |
>  			 (1ULL << VIRTIO_NET_F_MRG_RXBUF),
> +
> +	VHOST_BLK_FEATURES =	(1ULL << VIRTIO_F_NOTIFY_ON_EMPTY) |
> +				(1ULL << VIRTIO_RING_F_INDIRECT_DESC) |
> +				(1ULL << VIRTIO_RING_F_EVENT_IDX) |
> +				(1ULL << VIRTIO_BLK_F_SEG_MAX) |
> +				(1ULL << VIRTIO_BLK_F_GEOMETRY) |
> +				(1ULL << VIRTIO_BLK_F_TOPOLOGY) |
> +				(1ULL << VIRTIO_BLK_F_SCSI) |
> +				(1ULL << VIRTIO_BLK_F_BLK_SIZE),
> +
>  };
>  
>  static inline int vhost_has_feature(struct vhost_dev *dev, int bit)
> diff --git a/fs/aio.c b/fs/aio.c
> index e29ec48..534d396 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -215,7 +215,7 @@ static void ctx_rcu_free(struct rcu_head *head)
>   *	Called when the last user of an aio context has gone away,
>   *	and the struct needs to be freed.
>   */
> -static void __put_ioctx(struct kioctx *ctx)
> +void __put_ioctx(struct kioctx *ctx)
>  {
>  	BUG_ON(ctx->reqs_active);
>  
> @@ -227,29 +227,12 @@ static void __put_ioctx(struct kioctx *ctx)
>  	pr_debug("__put_ioctx: freeing %p\n", ctx);
>  	call_rcu(&ctx->rcu_head, ctx_rcu_free);
>  }
> -
> -static inline void get_ioctx(struct kioctx *kioctx)
> -{
> -	BUG_ON(atomic_read(&kioctx->users) <= 0);
> -	atomic_inc(&kioctx->users);
> -}
> -
> -static inline int try_get_ioctx(struct kioctx *kioctx)
> -{
> -	return atomic_inc_not_zero(&kioctx->users);
> -}
> -
> -static inline void put_ioctx(struct kioctx *kioctx)
> -{
> -	BUG_ON(atomic_read(&kioctx->users) <= 0);
> -	if (unlikely(atomic_dec_and_test(&kioctx->users)))
> -		__put_ioctx(kioctx);
> -}
> +EXPORT_SYMBOL(__put_ioctx);
>  
>  /* ioctx_alloc
>   *	Allocates and initializes an ioctx.  Returns an ERR_PTR if it failed.
>   */
> -static struct kioctx *ioctx_alloc(unsigned nr_events)
> +struct kioctx *ioctx_alloc(unsigned nr_events)
>  {
>  	struct mm_struct *mm;
>  	struct kioctx *ctx;
> @@ -327,6 +310,7 @@ out_freectx:
>  	dprintk("aio: error allocating ioctx %p\n", ctx);
>  	return ctx;
>  }
> +EXPORT_SYMBOL(ioctx_alloc);
>  
>  /* aio_cancel_all
>   *	Cancels all outstanding aio requests on an aio context.  Used 
> @@ -437,7 +421,7 @@ void exit_aio(struct mm_struct *mm)
>   * This prevents races between the aio code path referencing the
>   * req (after submitting it) and aio_complete() freeing the req.
>   */
> -static struct kiocb *__aio_get_req(struct kioctx *ctx)
> +struct kiocb *__aio_get_req(struct kioctx *ctx)
>  {
>  	struct kiocb *req = NULL;
>  	struct aio_ring *ring;
> @@ -480,7 +464,7 @@ static struct kiocb *__aio_get_req(struct kioctx *ctx)
>  	return req;
>  }
>  
> -static inline struct kiocb *aio_get_req(struct kioctx *ctx)
> +struct kiocb *aio_get_req(struct kioctx *ctx)
>  {
>  	struct kiocb *req;
>  	/* Handle a potential starvation case -- should be exceedingly rare as 
> @@ -494,6 +478,7 @@ static inline struct kiocb *aio_get_req(struct kioctx *ctx)
>  	}
>  	return req;
>  }
> +EXPORT_SYMBOL(aio_get_req);
>  
>  static inline void really_put_req(struct kioctx *ctx, struct kiocb *req)
>  {
> @@ -659,7 +644,7 @@ static inline int __queue_kicked_iocb(struct kiocb *iocb)
>   * simplifies the coding of individual aio operations as
>   * it avoids various potential races.
>   */
> -static ssize_t aio_run_iocb(struct kiocb *iocb)
> +ssize_t aio_run_iocb(struct kiocb *iocb)
>  {
>  	struct kioctx	*ctx = iocb->ki_ctx;
>  	ssize_t (*retry)(struct kiocb *);
> @@ -753,6 +738,7 @@ out:
>  	}
>  	return ret;
>  }
> +EXPORT_SYMBOL(aio_run_iocb);
>  
>  /*
>   * __aio_run_iocbs:
> @@ -761,7 +747,7 @@ out:
>   * Assumes it is operating within the aio issuer's mm
>   * context.
>   */
> -static int __aio_run_iocbs(struct kioctx *ctx)
> +int __aio_run_iocbs(struct kioctx *ctx)
>  {
>  	struct kiocb *iocb;
>  	struct list_head run_list;
> @@ -784,6 +770,7 @@ static int __aio_run_iocbs(struct kioctx *ctx)
>  		return 1;
>  	return 0;
>  }
> +EXPORT_SYMBOL(__aio_run_iocbs);
>  
>  static void aio_queue_work(struct kioctx * ctx)
>  {
> @@ -1074,7 +1061,7 @@ static inline void clear_timeout(struct aio_timeout *to)
>  	del_singleshot_timer_sync(&to->timer);
>  }
>  
> -static int read_events(struct kioctx *ctx,
> +int read_events(struct kioctx *ctx,
>  			long min_nr, long nr,
>  			struct io_event __user *event,
>  			struct timespec __user *timeout)
> @@ -1190,11 +1177,12 @@ out:
>  	destroy_timer_on_stack(&to.timer);
>  	return i ? i : ret;
>  }
> +EXPORT_SYMBOL(read_events);
>  
>  /* Take an ioctx and remove it from the list of ioctx's.  Protects 
>   * against races with itself via ->dead.
>   */
> -static void io_destroy(struct kioctx *ioctx)
> +void io_destroy(struct kioctx *ioctx)
>  {
>  	struct mm_struct *mm = current->mm;
>  	int was_dead;
> @@ -1221,6 +1209,7 @@ static void io_destroy(struct kioctx *ioctx)
>  	wake_up_all(&ioctx->wait);
>  	put_ioctx(ioctx);	/* once for the lookup */
>  }
> +EXPORT_SYMBOL(io_destroy);
>  
>  /* sys_io_setup:
>   *	Create an aio_context capable of receiving at least nr_events.
> @@ -1423,7 +1412,7 @@ static ssize_t aio_setup_single_vector(struct kiocb *kiocb)
>   *	Performs the initial checks and aio retry method
>   *	setup for the kiocb at the time of io submission.
>   */
> -static ssize_t aio_setup_iocb(struct kiocb *kiocb, bool compat)
> +ssize_t aio_setup_iocb(struct kiocb *kiocb, bool compat)
>  {
>  	struct file *file = kiocb->ki_filp;
>  	ssize_t ret = 0;
> @@ -1513,6 +1502,7 @@ static ssize_t aio_setup_iocb(struct kiocb *kiocb, bool compat)
>  
>  	return 0;
>  }
> +EXPORT_SYMBOL(aio_setup_iocb);
>  
>  static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
>  			 struct iocb *iocb, bool compat)
> diff --git a/fs/eventfd.c b/fs/eventfd.c
> index d9a5917..6343bc9 100644
> --- a/fs/eventfd.c
> +++ b/fs/eventfd.c
> @@ -406,6 +406,7 @@ struct file *eventfd_file_create(unsigned int count, int flags)
>  
>  	return file;
>  }
> +EXPORT_SYMBOL_GPL(eventfd_file_create);

You can avoid the need for this export if you pass
the eventfd in from userspace.

>  
>  SYSCALL_DEFINE2(eventfd2, unsigned int, count, int, flags)
>  {
> diff --git a/include/linux/aio.h b/include/linux/aio.h
> index 7a8db41..d63bc04 100644
> --- a/include/linux/aio.h
> +++ b/include/linux/aio.h
> @@ -214,6 +214,37 @@ struct mm_struct;
>  extern void exit_aio(struct mm_struct *mm);
>  extern long do_io_submit(aio_context_t ctx_id, long nr,
>  			 struct iocb __user *__user *iocbpp, bool compat);
> +extern void __put_ioctx(struct kioctx *ctx);
> +extern struct kioctx *ioctx_alloc(unsigned nr_events);
> +extern struct kiocb *aio_get_req(struct kioctx *ctx);
> +extern ssize_t aio_run_iocb(struct kiocb *iocb);
> +extern int __aio_run_iocbs(struct kioctx *ctx);
> +extern int read_events(struct kioctx *ctx,
> +                        long min_nr, long nr,
> +                        struct io_event __user *event,
> +                        struct timespec __user *timeout);
> +extern void io_destroy(struct kioctx *ioctx);
> +extern ssize_t aio_setup_iocb(struct kiocb *kiocb, bool compat);
> +extern void __put_ioctx(struct kioctx *ctx);
> +
> +static inline void get_ioctx(struct kioctx *kioctx)
> +{
> +        BUG_ON(atomic_read(&kioctx->users) <= 0);
> +        atomic_inc(&kioctx->users);
> +}
> +
> +static inline int try_get_ioctx(struct kioctx *kioctx)
> +{
> +        return atomic_inc_not_zero(&kioctx->users);
> +}
> +
> +static inline void put_ioctx(struct kioctx *kioctx)
> +{
> +        BUG_ON(atomic_read(&kioctx->users) <= 0);
> +        if (unlikely(atomic_dec_and_test(&kioctx->users)))
> +                __put_ioctx(kioctx);
> +}
> +
>  #else
>  static inline ssize_t wait_on_sync_kiocb(struct kiocb *iocb) { return 0; }
>  static inline int aio_put_req(struct kiocb *iocb) { return 0; }
> -- 
> 1.7.5.1

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH] vhost-blk: An in-kernel accelerator for virtio-blk
  2011-07-28 15:22   ` Michael S. Tsirkin
@ 2011-07-29 15:09     ` Liu Yuan
  2011-08-01  6:25     ` Liu Yuan
  2011-08-11 19:59     ` Dongsu Park
  2 siblings, 0 replies; 54+ messages in thread
From: Liu Yuan @ 2011-07-29 15:09 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Rusty Russell, Avi Kivity, kvm, linux-kernel

On 07/28/2011 11:22 PM, Michael S. Tsirkin wrote:
> On Thu, Jul 28, 2011 at 10:29:05PM +0800, Liu Yuan wrote:
>> From: Liu Yuan<tailai.ly@taobao.com>
>>
>> Vhost-blk driver is an in-kernel accelerator, intercepting the
>> IO requests from KVM virtio-capable guests. It is based on the
>> vhost infrastructure.
>>
>> This is supposed to be a module over latest kernel tree, but it
>> needs some symbols from fs/aio.c and fs/eventfd.c to compile with.
>> So currently, after applying the patch, you need to *recomplie*
>> the kernel.
>>
>> Usage:
>> $kernel-src: make M=drivers/vhost
>> $kernel-src: sudo insmod drivers/vhost/vhost_blk.ko
>>
>> After insmod, you'll see /dev/vhost-blk created. done!
>>
>> Signed-off-by: Liu Yuan<tailai.ly@taobao.com>
> Thanks, this is an interesting patch.
>
> There are some coding style issues in this patch, could you please
> change the code to match the kernel coding style?
>
> In particular pls prefix functions macros etc with vhost_blk to avoid
> confusion.
>
> scripts/checkpatch.pl can find some, but not all, issues.
>
>> ---
>>   drivers/vhost/Makefile |    3 +
>>   drivers/vhost/blk.c    |  568 ++++++++++++++++++++++++++++++++++++++++++++++++
>>   drivers/vhost/vhost.h  |   11 +
>>   fs/aio.c               |   44 ++---
>>   fs/eventfd.c           |    1 +
>>   include/linux/aio.h    |   31 +++
> As others said, core changes need to be split out
> and get acks from relevant people.
>
> Use scripts/get_maintainer.pl to get a list.
>
>
>>   6 files changed, 631 insertions(+), 27 deletions(-)
>>   create mode 100644 drivers/vhost/blk.c
>>
>> diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
>> index 72dd020..31f8b2e 100644
>> --- a/drivers/vhost/Makefile
>> +++ b/drivers/vhost/Makefile
>> @@ -1,2 +1,5 @@
>>   obj-$(CONFIG_VHOST_NET) += vhost_net.o
>> +obj-m += vhost_blk.o
>> +
>>   vhost_net-y := vhost.o net.o
>> +vhost_blk-y := vhost.o blk.o
>> diff --git a/drivers/vhost/blk.c b/drivers/vhost/blk.c
>> new file mode 100644
>> index 0000000..f3462be
>> --- /dev/null
>> +++ b/drivers/vhost/blk.c
>> @@ -0,0 +1,568 @@
>> +/* Copyright (C) 2011 Taobao, Inc.
>> + * Author: Liu Yuan<tailai.ly@taobao.com>
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2.
>> + *
>> + * Vhost-blk driver is an in-kernel accelerator, intercepting the
>> + * IO requests from KVM virtio-capable guests. It is based on the
>> + * vhost infrastructure.
>> + */
>> +
>> +#include<linux/miscdevice.h>
>> +#include<linux/module.h>
>> +#include<linux/virtio_net.h>
>> +#include<linux/vhost.h>
>> +#include<linux/eventfd.h>
>> +#include<linux/mutex.h>
>> +#include<linux/workqueue.h>
>> +#include<linux/virtio_blk.h>
>> +#include<linux/file.h>
>> +#include<linux/mmu_context.h>
>> +#include<linux/kthread.h>
>> +#include<linux/anon_inodes.h>
>> +#include<linux/syscalls.h>
>> +#include<linux/blkdev.h>
>> +
>> +#include "vhost.h"
>> +
>> +#define DEBUG 0
>> +
>> +#if DEBUG>  0
>> +#define dprintk         printk
>> +#else
>> +#define dprintk(x...)   do { ; } while (0)
>> +#endif
> There are standard macros for these.
>
>> +
>> +enum {
>> +	virtqueue_max = 1,
>> +};
>> +
>> +#define MAX_EVENTS 128
>> +
>> +struct vhost_blk {
>> +	struct vhost_virtqueue vq;
>> +	struct vhost_dev dev;
>> +	int should_stop;
>> +	struct kioctx *ioctx;
>> +	struct eventfd_ctx *ectx;
>> +	struct file *efile;
>> +	struct task_struct *worker;
>> +};
>> +
>> +struct used_info {
>> +	void *status;
>> +	int head;
>> +	int len;
>> +};
>> +
>> +static struct io_event events[MAX_EVENTS];
>> +
>> +static void blk_flush(struct vhost_blk *blk)
>> +{
>> +       vhost_poll_flush(&blk->vq.poll);
>> +}
>> +
>> +static long blk_set_features(struct vhost_blk *blk, u64 features)
>> +{
>> +	blk->dev.acked_features = features;
>> +	return 0;
>> +}
>> +
>> +static void blk_stop(struct vhost_blk *blk)
>> +{
>> +	struct vhost_virtqueue *vq =&blk->vq;
>> +	struct file *f;
>> +
>> +	mutex_lock(&vq->mutex);
>> +	f = rcu_dereference_protected(vq->private_data,
>> +					lockdep_is_held(&vq->mutex));
>> +	rcu_assign_pointer(vq->private_data, NULL);
>> +	mutex_unlock(&vq->mutex);
>> +
>> +	if (f)
>> +		fput(f);
>> +}
>> +
>> +static long blk_set_backend(struct vhost_blk *blk, struct vhost_vring_file *backend)
>> +{
>> +	int idx = backend->index;
>> +	struct vhost_virtqueue *vq =&blk->vq;
>> +	struct file *file, *oldfile;
>> +	int ret;
>> +
>> +	mutex_lock(&blk->dev.mutex);
>> +	ret = vhost_dev_check_owner(&blk->dev);
>> +	if (ret)
>> +		goto err_dev;
>> +	if (idx>= virtqueue_max) {
>> +		ret = -ENOBUFS;
>> +		goto err_dev;
>> +	}
>> +
>> +	mutex_lock(&vq->mutex);
>> +
>> +	if (!vhost_vq_access_ok(vq)) {
>> +		ret = -EFAULT;
>> +		goto err_vq;
>> +	}
> NET used -1 backend to remove a backend.
> I think it's a good idea, to make an operation reversible.
>
>> +
>> +	file = fget(backend->fd);
> We need to verify that the file type passed makes sense.
> For example, it's possible to create reference loops
> by passng the vhost-blk fd.
>
>
>> +	if (IS_ERR(file)) {
>> +		ret = PTR_ERR(file);
>> +		goto err_vq;
>> +	}
>> +
>> +	oldfile = rcu_dereference_protected(vq->private_data,
>> +						lockdep_is_held(&vq->mutex));
>> +	if (file != oldfile)
>> +		rcu_assign_pointer(vq->private_data, file);
>> +
>> +	mutex_unlock(&vq->mutex);
>> +
>> +	if (oldfile) {
>> +		blk_flush(blk);
>> +		fput(oldfile);
>> +	}
>> +
>> +	mutex_unlock(&blk->dev.mutex);
>> +	return 0;
>> +err_vq:
>> +	mutex_unlock(&vq->mutex);
>> +err_dev:
>> +	mutex_unlock(&blk->dev.mutex);
>> +	return ret;
>> +}
>> +
>> +static long blk_reset_owner(struct vhost_blk *b)
>> +{
>> +	int ret;
>> +
>> +        mutex_lock(&b->dev.mutex);
>> +        ret = vhost_dev_check_owner(&b->dev);
>> +        if (ret)
>> +                goto err;
>> +        blk_stop(b);
>> +        blk_flush(b);
>> +        ret = vhost_dev_reset_owner(&b->dev);
>> +	if (b->worker) {
>> +		b->should_stop = 1;
>> +		smp_mb();
>> +		eventfd_signal(b->ectx, 1);
>> +	}
>> +err:
>> +        mutex_unlock(&b->dev.mutex);
>> +        return ret;
>> +}
>> +
>> +static int kernel_io_setup(unsigned nr_events, struct kioctx **ioctx)
>> +{
>> +	int ret = 0;
>> +	*ioctx = ioctx_alloc(nr_events);
>> +	if (IS_ERR(ioctx))
>> +		ret = PTR_ERR(ioctx);
>> +	return ret;
>> +}
>> +
>> +static inline int kernel_read_events(struct kioctx *ctx, long min_nr, long nr, struct io_event *event,
>> +			struct timespec *ts)
>> +{
>> +        mm_segment_t old_fs;
>> +        int ret;
>> +
>> +        old_fs = get_fs();
>> +        set_fs(get_ds());
>> +	ret = read_events(ctx, min_nr, nr, event, ts);
>> +        set_fs(old_fs);
>> +
>> +	return ret;
>> +}
>> +
>> +static inline ssize_t io_event_ret(struct io_event *ev)
>> +{
>> +    return (ssize_t)(((uint64_t)ev->res2<<  32) | ev->res);
>> +}
>> +
>> +static inline void aio_prep_req(struct kiocb *iocb, struct eventfd_ctx *ectx, struct file *file,
>> +		struct iovec *iov, int nvecs, u64 offset, int opcode, struct used_info *ui)
>> +{
>> +	iocb->ki_filp = file;
>> +	iocb->ki_eventfd = ectx;
>> +	iocb->ki_pos = offset;
>> +	iocb->ki_buf = (void *)iov;
>> +	iocb->ki_left = iocb->ki_nbytes = nvecs;
>> +	iocb->ki_opcode = opcode;
>> +	iocb->ki_obj.user = ui;
>> +}
>> +
>> +static inline int kernel_io_submit(struct vhost_blk *blk, struct iovec *iov, u64 nvecs, loff_t pos, int opcode, int head, int len)
>> +{
>> +	int ret = -EAGAIN;
>> +	struct kiocb *req;
>> +	struct kioctx *ioctx = blk->ioctx;
>> +	struct used_info *ui = kzalloc(sizeof *ui, GFP_KERNEL);
>> +	struct file *f = blk->vq.private_data;
>> +
>> +	try_get_ioctx(ioctx);
>> +	atomic_long_inc_not_zero(&f->f_count);
>> +	eventfd_ctx_get(blk->ectx);
>> +
>> +
>> +	req = aio_get_req(ioctx); /* return 2 refs of req*/
>> +	if (unlikely(!req))
>> +		goto out;
>> +
>> +	ui->head = head;
>> +	ui->status = blk->vq.iov[nvecs + 1].iov_base;
>> +	ui->len = len;
>> +	aio_prep_req(req, blk->ectx, f, iov, nvecs, pos, opcode, ui);
>> +
>> +	ret = aio_setup_iocb(req, 0);
>> +	if (unlikely(ret))
>> +		goto out_put_req;
>> +
>> +	spin_lock_irq(&ioctx->ctx_lock);
>> +	if (unlikely(ioctx->dead)) {
>> +		spin_unlock_irq(&ioctx->ctx_lock);
>> +		ret = -EINVAL;
>> +		goto out_put_req;
>> +	}
>> +
>> +	aio_run_iocb(req);
>> +	if (!list_empty(&ioctx->run_list)) {
>> +		while (__aio_run_iocbs(ioctx))
>> +			;
>> +	}
>> +	spin_unlock_irq(&ioctx->ctx_lock);
>> +
>> +	aio_put_req(req);
>> +	put_ioctx(blk->ioctx);
>> +
>> +	return ret;
>> +
>> +out_put_req:
>> +	aio_put_req(req);
>> +	aio_put_req(req);
>> +out:
>> +	put_ioctx(blk->ioctx);
>> +	return ret;
>> +}
>> +
>> +static int blk_completion_worker(void *priv)
>> +{
>> +	struct vhost_blk *blk = priv;
>> +	u64 count;
>> +	int ret;
>> +
>> +	use_mm(blk->dev.mm);
>> +	for (;;) {
> It would be nicer to reuse the worker infrastructure
> from vhost.c. In particular this one ignores cgroups that
> the owner belongs to if any.
> Does this one do anything that vhost.c doesn't?
>
>> +		struct timespec ts = { 0 };
>> +		int i, nr;
>> +
>> +		do {
>> +		ret = eventfd_ctx_read(blk->ectx, 0,&count);
>> +		} while (unlikely(ret == -ERESTARTSYS));
>> +
>> +		if (unlikely(blk->should_stop))
>> +			break;
>> +
>> +		do {
>> +		nr = kernel_read_events(blk->ioctx, count, MAX_EVENTS, events,&ts);
>> +		} while (unlikely(nr == -EINTR));
>> +		dprintk("%s, count %llu, nr %d\n", __func__, count, nr);
>> +
>> +		if (unlikely(nr<  0))
>> +			continue;
>> +
>> +		for (i = 0; i<  nr; i++) {
>> +			struct used_info *u = (struct used_info *)events[i].obj;
>> +			int len, status;
>> +
>> +			dprintk("%s, head %d complete in %d\n", __func__, u->head, i);
>> +			len = io_event_ret(&events[i]);
>> +			//status = u->len == len ? VIRTIO_BLK_S_OK : VIRTIO_BLK_S_IOERR;
>> +			status = len>  0 ? VIRTIO_BLK_S_OK : VIRTIO_BLK_S_IOERR;
>> +			if (copy_to_user(u->status,&status, sizeof status)) {
>> +				vq_err(&blk->vq, "%s failed to write status\n", __func__);
>> +				BUG(); /* FIXME: maybe a bit radical? */
> On an invalid userspace address?
> You may very well say so.
>
>> +			}
>> +			vhost_add_used(&blk->vq, u->head, u->len);
>> +			kfree(u);
>> +		}
>> +
>> +		vhost_signal(&blk->dev,&blk->vq);
>> +	}
>> +	unuse_mm(blk->dev.mm);
>> +	return 0;
>> +}
>> +
>> +static int completion_thread_setup(struct vhost_blk *blk)
>> +{
>> +	int ret = 0;
>> +	struct task_struct *worker;
>> +	worker = kthread_create(blk_completion_worker, blk, "vhost-blk-%d", current->pid);
>> +	if (IS_ERR(worker)) {
>> +		ret = PTR_ERR(worker);
>> +		goto err;
>> +	}
>> +	blk->worker = worker;
>> +	blk->should_stop = 0;
>> +	smp_mb();
>> +	wake_up_process(worker);
>> +err:
>> +	return ret;
>> +}
>> +
>> +static void completion_thread_destory(struct vhost_blk *blk)
>> +{
>> +	if (blk->worker) {
>> +		blk->should_stop = 1;
>> +		smp_mb();
>> +		eventfd_signal(blk->ectx, 1);
>> +	}
>> +}
>> +
>> +
>> +static long blk_set_owner(struct vhost_blk *blk)
>> +{
>> +	return completion_thread_setup(blk);
>> +}
>> +
>> +static long vhost_blk_ioctl(struct file *f, unsigned int ioctl,
>> +		unsigned long arg)
>> +{
>> +	struct vhost_blk *blk = f->private_data;
>> +	struct vhost_vring_file backend;
>> +	u64 features = VHOST_BLK_FEATURES;
>> +	int ret = -EFAULT;
>> +
>> +	switch (ioctl) {
>> +		case VHOST_NET_SET_BACKEND:
>> +			if(copy_from_user(&backend, (void __user *)arg, sizeof backend))
>> +				break;
>> +			ret = blk_set_backend(blk,&backend);
>> +			break;
> Please create your own ioctl for this one.
>
>> +		case VHOST_GET_FEATURES:
>> +			features = VHOST_BLK_FEATURES;
>> +			if (copy_to_user((void __user *)arg ,&features, sizeof features))
>> +				break;
>> +			ret = 0;
>> +			break;
>> +		case VHOST_SET_FEATURES:
>> +			if (copy_from_user(&features, (void __user *)arg, sizeof features))
>> +				break;
>> +			if (features&  ~VHOST_BLK_FEATURES) {
>> +				ret = -EOPNOTSUPP;
>> +				break;
>> +			}
>> +			ret = blk_set_features(blk, features);
>> +			break;
>> +		case VHOST_RESET_OWNER:
>> +			ret = blk_reset_owner(blk);
>> +			break;
>> +		default:
>> +			mutex_lock(&blk->dev.mutex);
>> +			ret = vhost_dev_ioctl(&blk->dev, ioctl, arg);
>> +			if (!ret&&  ioctl == VHOST_SET_OWNER)
>> +				ret = blk_set_owner(blk);
>> +			blk_flush(blk);
>> +			mutex_unlock(&blk->dev.mutex);
>> +			break;
>> +	}
>> +	return ret;
>> +}
>> +
>> +#define BLK_HDR 0
>> +#define BLK_HDR_LEN 16
>> +
>> +static inline int do_request(struct vhost_virtqueue *vq, struct virtio_blk_outhdr *hdr,
>> +		u64 nr_vecs, int head)
>> +{
>> +	struct file *f = vq->private_data;
>> +	struct vhost_blk *blk = container_of(vq->dev, struct vhost_blk, dev);
>> +	struct iovec *iov =&vq->iov[BLK_HDR + 1];
>> +	loff_t pos = hdr->sector<<  9;
>> +	int ret = 0, len = 0, status;
>> +//	int i;
>> +
>> +	dprintk("sector %llu, num %lu, type %d\n", hdr->sector, iov->iov_len / 512, hdr->type);
>> +	//Guest virtio-blk driver dosen't use len currently.
>> +	//for (i = 0; i<  nr_vecs; i++) {
>> +	//	len += iov[i].iov_len;
>> +	//}
>> +	switch (hdr->type) {
>> +	case VIRTIO_BLK_T_OUT:
>> +		kernel_io_submit(blk, iov, nr_vecs, pos, IOCB_CMD_PWRITEV, head, len);
>> +		break;
>> +	case VIRTIO_BLK_T_IN:
>> +		kernel_io_submit(blk, iov, nr_vecs, pos, IOCB_CMD_PREADV, head, len);
>> +		break;
>> +	case VIRTIO_BLK_T_FLUSH:
>> +		ret = vfs_fsync(f, 1);
>> +		/* fall through */
>> +	case VIRTIO_BLK_T_GET_ID:
>> +		status = ret<  0 ? VIRTIO_BLK_S_IOERR :VIRTIO_BLK_S_OK;
>> +		if ((vq->iov[nr_vecs + 1].iov_len != 1))
>> +			BUG();
> Why is this one a bug?
>
>
>> +
>> +		if (copy_to_user(vq->iov[nr_vecs + 1].iov_base,&status, sizeof status)) {
>> +				vq_err(vq, "%s failed to write status!\n", __func__);
>> +				vhost_discard_vq_desc(vq, 1);
>> +				ret = -EFAULT;
>> +				break;
>> +			}
>> +
>> +		vhost_add_used_and_signal(&blk->dev, vq, head, ret);
>> +		break;
>> +	default:
>> +		pr_info("%s, unsupported request type %d\n", __func__, hdr->type);
>> +		vhost_discard_vq_desc(vq, 1);
>> +		ret = -EFAULT;
>> +		break;
>> +	}
>> +	return ret;
>> +}
>> +
>> +static inline void handle_kick(struct vhost_blk *blk)
>> +{
>> +	struct vhost_virtqueue *vq =&blk->vq;
>> +	struct virtio_blk_outhdr hdr;
>> +	u64 nr_vecs;
>> +	int in, out, head;
>> +	struct blk_plug plug;
>> +
>> +	mutex_lock(&vq->mutex);
>> +	vhost_disable_notify(&blk->dev, vq);
>> +
>> +	blk_start_plug(&plug);
>> +	for (;;) {
>> +		head = vhost_get_vq_desc(&blk->dev, vq, vq->iov,
>> +				ARRAY_SIZE(vq->iov),
>> +				&out,&in, NULL, NULL);
>> +		/* No awailable descriptors from Guest? */
>> +		if (head == vq->num) {
>> +			if (unlikely(vhost_enable_notify(&blk->dev, vq))) {
>> +				vhost_disable_notify(&blk->dev, vq);
>> +				continue;
>> +			}
>> +			break;
>> +		}
>> +		if (unlikely(head<  0))
>> +			break;
>> +
>> +		dprintk("head %d, in %d, out %d\n", head, in, out);
>> +		if(unlikely(vq->iov[BLK_HDR].iov_len != BLK_HDR_LEN)) {
>> +			vq_err(vq, "%s bad block header lengh!\n", __func__);
>> +			vhost_discard_vq_desc(vq, 1);
>> +			break;
>> +		}
>> +
>> +		if (copy_from_user(&hdr, vq->iov[BLK_HDR].iov_base, sizeof hdr)) {
>> +			vq_err(vq, "%s failed to get block header!\n", __func__);
>> +			vhost_discard_vq_desc(vq, 1);
>> +			break;
>> +		}
>> +
>> +		if (hdr.type == VIRTIO_BLK_T_IN || hdr.type == VIRTIO_BLK_T_GET_ID)
>> +			nr_vecs = in - 1;
>> +		else
>> +			nr_vecs = out - 1;
>> +
>> +		if (do_request(vq,&hdr, nr_vecs, head)<  0)
>> +			break;
>> +	}
>> +	blk_finish_plug(&plug);
>> +	mutex_unlock(&vq->mutex);
>> +}
>> +
>> +static void handle_guest_kick(struct vhost_work *work)
>> +{
>> +	struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue, poll.work);
>> +	struct vhost_blk *blk = container_of(vq->dev, struct vhost_blk, dev);
>> +	handle_kick(blk);
>> +}
>> +
>> +static void eventfd_setup(struct vhost_blk *blk)
>> +{
>> +	blk->efile = eventfd_file_create(0, 0);
>> +	blk->ectx = eventfd_ctx_fileget(blk->efile);
>> +}
>> +
>> +static int vhost_blk_open(struct inode *inode, struct file *f)
>> +{
>> +	int ret = -ENOMEM;
>> +	struct vhost_blk *blk = kmalloc(sizeof *blk, GFP_KERNEL);
>> +	if (!blk)
>> +		goto err;
>> +
>> +	blk->vq.handle_kick = handle_guest_kick;
>> +	ret = vhost_dev_init(&blk->dev,&blk->vq, virtqueue_max);
>> +	if (ret<  0)
>> +		goto err_init;
>> +
>> +	ret = kernel_io_setup(MAX_EVENTS,&blk->ioctx);
>> +	if (ret<  0)
>> +		goto err_io_setup;
>> +
>> +	eventfd_setup(blk);
>> +	f->private_data = blk;
>> +	return ret;
>> +err_init:
>> +err_io_setup:
>> +	kfree(blk);
>> +err:
>> +	return ret;
>> +}
>> +
>> +static void eventfd_destroy(struct vhost_blk *blk)
>> +{
>> +	eventfd_ctx_put(blk->ectx);
>> +	fput(blk->efile);
>> +}
>> +
>> +static int vhost_blk_release(struct inode *inode, struct file *f)
>> +{
>> +	struct vhost_blk *blk = f->private_data;
>> +
>> +	blk_stop(blk);
>> +	blk_flush(blk);
>> +	vhost_dev_cleanup(&blk->dev);
>> +	/* Yet another flush? See comments in vhost_net_release() */
>> +	blk_flush(blk);
>> +	completion_thread_destory(blk);
>> +	eventfd_destroy(blk);
>> +	kfree(blk);
>> +
>> +	return 0;
>> +}
>> +
>> +const static struct file_operations vhost_blk_fops = {
>> +	.owner          = THIS_MODULE,
>> +	.release        = vhost_blk_release,
>> +	.open           = vhost_blk_open,
>> +	.unlocked_ioctl = vhost_blk_ioctl,
>> +	.llseek		= noop_llseek,
>> +};
>> +
>> +
>> +static struct miscdevice vhost_blk_misc = {
>> +	234,
> Don't get a major unless you really must.
>
>> +	"vhost-blk",
>> +	&vhost_blk_fops,
> And use C99 initializers.
>
>> +};
>> +
>> +int vhost_blk_init(void)
>> +{
>> +	return misc_register(&vhost_blk_misc);
>> +}
>> +void vhost_blk_exit(void)
>> +{
>> +	misc_deregister(&vhost_blk_misc);
>> +}
>> +
>> +module_init(vhost_blk_init);
>> +module_exit(vhost_blk_exit);
>> +
>> +MODULE_VERSION("0.0.1");
>> +MODULE_LICENSE("GPL v2");
>> +MODULE_AUTHOR("Liu Yuan");
>> +MODULE_DESCRIPTION("Host kernel accelerator for virtio_blk");
>> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
>> index 8e03379..9e17152 100644
>> --- a/drivers/vhost/vhost.h
>> +++ b/drivers/vhost/vhost.h
>> @@ -12,6 +12,7 @@
>>   #include<linux/virtio_config.h>
>>   #include<linux/virtio_ring.h>
>>   #include<asm/atomic.h>
>> +#include<linux/virtio_blk.h>
>>
>>   struct vhost_device;
>>
>> @@ -174,6 +175,16 @@ enum {
>>   			 (1ULL<<  VHOST_F_LOG_ALL) |
>>   			 (1ULL<<  VHOST_NET_F_VIRTIO_NET_HDR) |
>>   			 (1ULL<<  VIRTIO_NET_F_MRG_RXBUF),
>> +
>> +	VHOST_BLK_FEATURES =	(1ULL<<  VIRTIO_F_NOTIFY_ON_EMPTY) |
>> +				(1ULL<<  VIRTIO_RING_F_INDIRECT_DESC) |
>> +				(1ULL<<  VIRTIO_RING_F_EVENT_IDX) |
>> +				(1ULL<<  VIRTIO_BLK_F_SEG_MAX) |
>> +				(1ULL<<  VIRTIO_BLK_F_GEOMETRY) |
>> +				(1ULL<<  VIRTIO_BLK_F_TOPOLOGY) |
>> +				(1ULL<<  VIRTIO_BLK_F_SCSI) |
>> +				(1ULL<<  VIRTIO_BLK_F_BLK_SIZE),
>> +
>>   };
>>
>>   static inline int vhost_has_feature(struct vhost_dev *dev, int bit)
>> diff --git a/fs/aio.c b/fs/aio.c
>> index e29ec48..534d396 100644
>> --- a/fs/aio.c
>> +++ b/fs/aio.c
>> @@ -215,7 +215,7 @@ static void ctx_rcu_free(struct rcu_head *head)
>>    *	Called when the last user of an aio context has gone away,
>>    *	and the struct needs to be freed.
>>    */
>> -static void __put_ioctx(struct kioctx *ctx)
>> +void __put_ioctx(struct kioctx *ctx)
>>   {
>>   	BUG_ON(ctx->reqs_active);
>>
>> @@ -227,29 +227,12 @@ static void __put_ioctx(struct kioctx *ctx)
>>   	pr_debug("__put_ioctx: freeing %p\n", ctx);
>>   	call_rcu(&ctx->rcu_head, ctx_rcu_free);
>>   }
>> -
>> -static inline void get_ioctx(struct kioctx *kioctx)
>> -{
>> -	BUG_ON(atomic_read(&kioctx->users)<= 0);
>> -	atomic_inc(&kioctx->users);
>> -}
>> -
>> -static inline int try_get_ioctx(struct kioctx *kioctx)
>> -{
>> -	return atomic_inc_not_zero(&kioctx->users);
>> -}
>> -
>> -static inline void put_ioctx(struct kioctx *kioctx)
>> -{
>> -	BUG_ON(atomic_read(&kioctx->users)<= 0);
>> -	if (unlikely(atomic_dec_and_test(&kioctx->users)))
>> -		__put_ioctx(kioctx);
>> -}
>> +EXPORT_SYMBOL(__put_ioctx);
>>
>>   /* ioctx_alloc
>>    *	Allocates and initializes an ioctx.  Returns an ERR_PTR if it failed.
>>    */
>> -static struct kioctx *ioctx_alloc(unsigned nr_events)
>> +struct kioctx *ioctx_alloc(unsigned nr_events)
>>   {
>>   	struct mm_struct *mm;
>>   	struct kioctx *ctx;
>> @@ -327,6 +310,7 @@ out_freectx:
>>   	dprintk("aio: error allocating ioctx %p\n", ctx);
>>   	return ctx;
>>   }
>> +EXPORT_SYMBOL(ioctx_alloc);
>>
>>   /* aio_cancel_all
>>    *	Cancels all outstanding aio requests on an aio context.  Used
>> @@ -437,7 +421,7 @@ void exit_aio(struct mm_struct *mm)
>>    * This prevents races between the aio code path referencing the
>>    * req (after submitting it) and aio_complete() freeing the req.
>>    */
>> -static struct kiocb *__aio_get_req(struct kioctx *ctx)
>> +struct kiocb *__aio_get_req(struct kioctx *ctx)
>>   {
>>   	struct kiocb *req = NULL;
>>   	struct aio_ring *ring;
>> @@ -480,7 +464,7 @@ static struct kiocb *__aio_get_req(struct kioctx *ctx)
>>   	return req;
>>   }
>>
>> -static inline struct kiocb *aio_get_req(struct kioctx *ctx)
>> +struct kiocb *aio_get_req(struct kioctx *ctx)
>>   {
>>   	struct kiocb *req;
>>   	/* Handle a potential starvation case -- should be exceedingly rare as
>> @@ -494,6 +478,7 @@ static inline struct kiocb *aio_get_req(struct kioctx *ctx)
>>   	}
>>   	return req;
>>   }
>> +EXPORT_SYMBOL(aio_get_req);
>>
>>   static inline void really_put_req(struct kioctx *ctx, struct kiocb *req)
>>   {
>> @@ -659,7 +644,7 @@ static inline int __queue_kicked_iocb(struct kiocb *iocb)
>>    * simplifies the coding of individual aio operations as
>>    * it avoids various potential races.
>>    */
>> -static ssize_t aio_run_iocb(struct kiocb *iocb)
>> +ssize_t aio_run_iocb(struct kiocb *iocb)
>>   {
>>   	struct kioctx	*ctx = iocb->ki_ctx;
>>   	ssize_t (*retry)(struct kiocb *);
>> @@ -753,6 +738,7 @@ out:
>>   	}
>>   	return ret;
>>   }
>> +EXPORT_SYMBOL(aio_run_iocb);
>>
>>   /*
>>    * __aio_run_iocbs:
>> @@ -761,7 +747,7 @@ out:
>>    * Assumes it is operating within the aio issuer's mm
>>    * context.
>>    */
>> -static int __aio_run_iocbs(struct kioctx *ctx)
>> +int __aio_run_iocbs(struct kioctx *ctx)
>>   {
>>   	struct kiocb *iocb;
>>   	struct list_head run_list;
>> @@ -784,6 +770,7 @@ static int __aio_run_iocbs(struct kioctx *ctx)
>>   		return 1;
>>   	return 0;
>>   }
>> +EXPORT_SYMBOL(__aio_run_iocbs);
>>
>>   static void aio_queue_work(struct kioctx * ctx)
>>   {
>> @@ -1074,7 +1061,7 @@ static inline void clear_timeout(struct aio_timeout *to)
>>   	del_singleshot_timer_sync(&to->timer);
>>   }
>>
>> -static int read_events(struct kioctx *ctx,
>> +int read_events(struct kioctx *ctx,
>>   			long min_nr, long nr,
>>   			struct io_event __user *event,
>>   			struct timespec __user *timeout)
>> @@ -1190,11 +1177,12 @@ out:
>>   	destroy_timer_on_stack(&to.timer);
>>   	return i ? i : ret;
>>   }
>> +EXPORT_SYMBOL(read_events);
>>
>>   /* Take an ioctx and remove it from the list of ioctx's.  Protects
>>    * against races with itself via ->dead.
>>    */
>> -static void io_destroy(struct kioctx *ioctx)
>> +void io_destroy(struct kioctx *ioctx)
>>   {
>>   	struct mm_struct *mm = current->mm;
>>   	int was_dead;
>> @@ -1221,6 +1209,7 @@ static void io_destroy(struct kioctx *ioctx)
>>   	wake_up_all(&ioctx->wait);
>>   	put_ioctx(ioctx);	/* once for the lookup */
>>   }
>> +EXPORT_SYMBOL(io_destroy);
>>
>>   /* sys_io_setup:
>>    *	Create an aio_context capable of receiving at least nr_events.
>> @@ -1423,7 +1412,7 @@ static ssize_t aio_setup_single_vector(struct kiocb *kiocb)
>>    *	Performs the initial checks and aio retry method
>>    *	setup for the kiocb at the time of io submission.
>>    */
>> -static ssize_t aio_setup_iocb(struct kiocb *kiocb, bool compat)
>> +ssize_t aio_setup_iocb(struct kiocb *kiocb, bool compat)
>>   {
>>   	struct file *file = kiocb->ki_filp;
>>   	ssize_t ret = 0;
>> @@ -1513,6 +1502,7 @@ static ssize_t aio_setup_iocb(struct kiocb *kiocb, bool compat)
>>
>>   	return 0;
>>   }
>> +EXPORT_SYMBOL(aio_setup_iocb);
>>
>>   static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
>>   			 struct iocb *iocb, bool compat)
>> diff --git a/fs/eventfd.c b/fs/eventfd.c
>> index d9a5917..6343bc9 100644
>> --- a/fs/eventfd.c
>> +++ b/fs/eventfd.c
>> @@ -406,6 +406,7 @@ struct file *eventfd_file_create(unsigned int count, int flags)
>>
>>   	return file;
>>   }
>> +EXPORT_SYMBOL_GPL(eventfd_file_create);
> You can avoid the need for this export if you pass
> the eventfd in from userspace.
>
>>
>>   SYSCALL_DEFINE2(eventfd2, unsigned int, count, int, flags)
>>   {
>> diff --git a/include/linux/aio.h b/include/linux/aio.h
>> index 7a8db41..d63bc04 100644
>> --- a/include/linux/aio.h
>> +++ b/include/linux/aio.h
>> @@ -214,6 +214,37 @@ struct mm_struct;
>>   extern void exit_aio(struct mm_struct *mm);
>>   extern long do_io_submit(aio_context_t ctx_id, long nr,
>>   			 struct iocb __user *__user *iocbpp, bool compat);
>> +extern void __put_ioctx(struct kioctx *ctx);
>> +extern struct kioctx *ioctx_alloc(unsigned nr_events);
>> +extern struct kiocb *aio_get_req(struct kioctx *ctx);
>> +extern ssize_t aio_run_iocb(struct kiocb *iocb);
>> +extern int __aio_run_iocbs(struct kioctx *ctx);
>> +extern int read_events(struct kioctx *ctx,
>> +                        long min_nr, long nr,
>> +                        struct io_event __user *event,
>> +                        struct timespec __user *timeout);
>> +extern void io_destroy(struct kioctx *ioctx);
>> +extern ssize_t aio_setup_iocb(struct kiocb *kiocb, bool compat);
>> +extern void __put_ioctx(struct kioctx *ctx);
>> +
>> +static inline void get_ioctx(struct kioctx *kioctx)
>> +{
>> +        BUG_ON(atomic_read(&kioctx->users)<= 0);
>> +        atomic_inc(&kioctx->users);
>> +}
>> +
>> +static inline int try_get_ioctx(struct kioctx *kioctx)
>> +{
>> +        return atomic_inc_not_zero(&kioctx->users);
>> +}
>> +
>> +static inline void put_ioctx(struct kioctx *kioctx)
>> +{
>> +        BUG_ON(atomic_read(&kioctx->users)<= 0);
>> +        if (unlikely(atomic_dec_and_test(&kioctx->users)))
>> +                __put_ioctx(kioctx);
>> +}
>> +
>>   #else
>>   static inline ssize_t wait_on_sync_kiocb(struct kiocb *iocb) { return 0; }
>>   static inline int aio_put_req(struct kiocb *iocb) { return 0; }
>> -- 
>> 1.7.5.1
Thanks, I'll split the patch, prepare v2 to address your comments.

Yuan

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH] vhost-blk: An in-kernel accelerator for virtio-blk
  2011-07-28 15:22   ` Michael S. Tsirkin
  2011-07-29 15:09     ` Liu Yuan
@ 2011-08-01  6:25     ` Liu Yuan
  2011-08-01  8:12       ` Michael S. Tsirkin
  2011-08-11 19:59     ` Dongsu Park
  2 siblings, 1 reply; 54+ messages in thread
From: Liu Yuan @ 2011-08-01  6:25 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Rusty Russell, Avi Kivity, kvm, linux-kernel

On 07/28/2011 11:22 PM, Michael S. Tsirkin wrote:
>
> It would be nicer to reuse the worker infrastructure
> from vhost.c. In particular this one ignores cgroups that
> the owner belongs to if any.
> Does this one do anything that vhost.c doesn't?
>

The main idea I use a separated thread to handling completion is to 
decouple the  request handling
and the request completion signalling. This might allow better 
scalability in a IO intensive scenario,
since I noted that virtio driver would allow sumbit as much as 128 
requests in one go.

Hmm, I have tried to make signalling thread into a function that is 
called as a vhost-worker's work.
I didn't see regression in my laptop with iodepth equalling 1,2,3. But 
requests handling and completion signalling may produce race in a high 
requests submitting rate. Anyway, I'll adopt it to work as a vhost
worker function in v2.

>
>> +	switch (ioctl) {
>> +		case VHOST_NET_SET_BACKEND:
>> +			if(copy_from_user(&backend, (void __user *)arg, sizeof backend))
>> +				break;
>> +			ret = blk_set_backend(blk,&backend);
>> +			break;
> Please create your own ioctl for this one.
>

How about change VHOST_NET_SET_BACKEND into VHOST_SET_BACKEND?

>>
>>   static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
>>   			 struct iocb *iocb, bool compat)
>> diff --git a/fs/eventfd.c b/fs/eventfd.c
>> index d9a5917..6343bc9 100644
>> --- a/fs/eventfd.c
>> +++ b/fs/eventfd.c
>> @@ -406,6 +406,7 @@ struct file *eventfd_file_create(unsigned int count, int flags)
>>
>>   	return file;
>>   }
>> +EXPORT_SYMBOL_GPL(eventfd_file_create);
> You can avoid the need for this export if you pass
> the eventfd in from userspace.
>

Since eventfd used by completion code is internal and hiding it from 
hw/vhost_blk.c would simplify
the configuration, I think this exporting is necessary and can get rid 
of unnecessary FD management
in vhost-blk.c.

>>
>>   SYSCALL_DEFINE2(eventfd2, unsigned int, count, int, flags)
>>   {
>> diff --git a/include/linux/aio.h b/include/linux/aio.h
>> index 7a8db41..d63bc04 100644
>> --- a/include/linux/aio.h
>> +++ b/include/linux/aio.h
>> @@ -214,6 +214,37 @@ struct mm_struct;
>>   extern void exit_aio(struct mm_struct *mm);
>>   extern long do_io_submit(aio_context_t ctx_id, long nr,
>>   			 struct iocb __user *__user *iocbpp, bool compat);
>> +extern void __put_ioctx(struct kioctx *ctx);
>> +extern struct kioctx *ioctx_alloc(unsigned nr_events);
>> +extern struct kiocb *aio_get_req(struct kioctx *ctx);
>> +extern ssize_t aio_run_iocb(struct kiocb *iocb);
>> +extern int __aio_run_iocbs(struct kioctx *ctx);
>> +extern int read_events(struct kioctx *ctx,
>> +                        long min_nr, long nr,
>> +                        struct io_event __user *event,
>> +                        struct timespec __user *timeout);
>> +extern void io_destroy(struct kioctx *ioctx);
>> +extern ssize_t aio_setup_iocb(struct kiocb *kiocb, bool compat);
>> +extern void __put_ioctx(struct kioctx *ctx);
>> +
>> +static inline void get_ioctx(struct kioctx *kioctx)
>> +{
>> +        BUG_ON(atomic_read(&kioctx->users)<= 0);
>> +        atomic_inc(&kioctx->users);
>> +}
>> +
>> +static inline int try_get_ioctx(struct kioctx *kioctx)
>> +{
>> +        return atomic_inc_not_zero(&kioctx->users);
>> +}
>> +
>> +static inline void put_ioctx(struct kioctx *kioctx)
>> +{
>> +        BUG_ON(atomic_read(&kioctx->users)<= 0);
>> +        if (unlikely(atomic_dec_and_test(&kioctx->users)))
>> +                __put_ioctx(kioctx);
>> +}
>> +
>>   #else
>>   static inline ssize_t wait_on_sync_kiocb(struct kiocb *iocb) { return 0; }
>>   static inline int aio_put_req(struct kiocb *iocb) { return 0; }
>> -- 
>> 1.7.5.1

Other comments will be addressed in V2. Thanks

Yuan

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH] vhost-blk: An in-kernel accelerator for virtio-blk
  2011-08-01  6:25     ` Liu Yuan
@ 2011-08-01  8:12       ` Michael S. Tsirkin
  2011-08-01  8:55         ` Liu Yuan
  0 siblings, 1 reply; 54+ messages in thread
From: Michael S. Tsirkin @ 2011-08-01  8:12 UTC (permalink / raw)
  To: Liu Yuan; +Cc: Rusty Russell, Avi Kivity, kvm, linux-kernel

On Mon, Aug 01, 2011 at 02:25:36PM +0800, Liu Yuan wrote:
> On 07/28/2011 11:22 PM, Michael S. Tsirkin wrote:
> >
> >It would be nicer to reuse the worker infrastructure
> >from vhost.c. In particular this one ignores cgroups that
> >the owner belongs to if any.
> >Does this one do anything that vhost.c doesn't?
> >
> 
> The main idea I use a separated thread to handling completion is to
> decouple the  request handling
> and the request completion signalling. This might allow better
> scalability in a IO intensive scenario,

The code seems to have the vq mutex though, isn't that right?
If so, it can't execute in parallel so it's a bit
hard to see how this would help scalability.

> since I noted that virtio driver would allow sumbit as much as 128
> requests in one go.
> 
> Hmm, I have tried to make signalling thread into a function that is
> called as a vhost-worker's work.
> I didn't see regression in my laptop with iodepth equalling 1,2,3.
> But requests handling and completion signalling may produce race in
> a high requests submitting rate. Anyway, I'll adopt it to work as a
> vhost
> worker function in v2.
> 
> >
> >>+	switch (ioctl) {
> >>+		case VHOST_NET_SET_BACKEND:
> >>+			if(copy_from_user(&backend, (void __user *)arg, sizeof backend))
> >>+				break;
> >>+			ret = blk_set_backend(blk,&backend);
> >>+			break;
> >Please create your own ioctl for this one.
> >
> 
> How about change VHOST_NET_SET_BACKEND into VHOST_SET_BACKEND?

I had a feeling other devices might want some other structure
(not an fd) as a backend. Maybe that was wrong ...
If so, pls do that, and #define VHOST_NET_SET_BACKEND VHOST_SET_BACKEND
for compatibility.

> >>
> >>  static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
> >>  			 struct iocb *iocb, bool compat)
> >>diff --git a/fs/eventfd.c b/fs/eventfd.c
> >>index d9a5917..6343bc9 100644
> >>--- a/fs/eventfd.c
> >>+++ b/fs/eventfd.c
> >>@@ -406,6 +406,7 @@ struct file *eventfd_file_create(unsigned int count, int flags)
> >>
> >>  	return file;
> >>  }
> >>+EXPORT_SYMBOL_GPL(eventfd_file_create);
> >You can avoid the need for this export if you pass
> >the eventfd in from userspace.
> >
> 
> Since eventfd used by completion code is internal and hiding it from
> hw/vhost_blk.c would simplify
> the configuration, I think this exporting is necessary and can get
> rid of unnecessary FD management
> in vhost-blk.c.

Well this is a new kernel interface duplicating the functionality of the
old one.  You'll have a hard time selling this idea upstream, I suspect.
And I doubt it simplifies the code significantly.
Further, you have a single vq for block, but net has two and
we do want the flexibility of using a single eventfd for both,
I think.

> >>
> >>  SYSCALL_DEFINE2(eventfd2, unsigned int, count, int, flags)
> >>  {
> >>diff --git a/include/linux/aio.h b/include/linux/aio.h
> >>index 7a8db41..d63bc04 100644
> >>--- a/include/linux/aio.h
> >>+++ b/include/linux/aio.h
> >>@@ -214,6 +214,37 @@ struct mm_struct;
> >>  extern void exit_aio(struct mm_struct *mm);
> >>  extern long do_io_submit(aio_context_t ctx_id, long nr,
> >>  			 struct iocb __user *__user *iocbpp, bool compat);
> >>+extern void __put_ioctx(struct kioctx *ctx);
> >>+extern struct kioctx *ioctx_alloc(unsigned nr_events);
> >>+extern struct kiocb *aio_get_req(struct kioctx *ctx);
> >>+extern ssize_t aio_run_iocb(struct kiocb *iocb);
> >>+extern int __aio_run_iocbs(struct kioctx *ctx);
> >>+extern int read_events(struct kioctx *ctx,
> >>+                        long min_nr, long nr,
> >>+                        struct io_event __user *event,
> >>+                        struct timespec __user *timeout);
> >>+extern void io_destroy(struct kioctx *ioctx);
> >>+extern ssize_t aio_setup_iocb(struct kiocb *kiocb, bool compat);
> >>+extern void __put_ioctx(struct kioctx *ctx);
> >>+
> >>+static inline void get_ioctx(struct kioctx *kioctx)
> >>+{
> >>+        BUG_ON(atomic_read(&kioctx->users)<= 0);
> >>+        atomic_inc(&kioctx->users);
> >>+}
> >>+
> >>+static inline int try_get_ioctx(struct kioctx *kioctx)
> >>+{
> >>+        return atomic_inc_not_zero(&kioctx->users);
> >>+}
> >>+
> >>+static inline void put_ioctx(struct kioctx *kioctx)
> >>+{
> >>+        BUG_ON(atomic_read(&kioctx->users)<= 0);
> >>+        if (unlikely(atomic_dec_and_test(&kioctx->users)))
> >>+                __put_ioctx(kioctx);
> >>+}
> >>+
> >>  #else
> >>  static inline ssize_t wait_on_sync_kiocb(struct kiocb *iocb) { return 0; }
> >>  static inline int aio_put_req(struct kiocb *iocb) { return 0; }
> >>-- 
> >>1.7.5.1
> 
> Other comments will be addressed in V2. Thanks
> 
> Yuan

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH] vhost-blk: An in-kernel accelerator for virtio-blk
  2011-08-01  8:12       ` Michael S. Tsirkin
@ 2011-08-01  8:55         ` Liu Yuan
  2011-08-01 10:26           ` Michael S. Tsirkin
  0 siblings, 1 reply; 54+ messages in thread
From: Liu Yuan @ 2011-08-01  8:55 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Rusty Russell, Avi Kivity, kvm, linux-kernel

On 08/01/2011 04:12 PM, Michael S. Tsirkin wrote:
> On Mon, Aug 01, 2011 at 02:25:36PM +0800, Liu Yuan wrote:
>> On 07/28/2011 11:22 PM, Michael S. Tsirkin wrote:
>>> It would be nicer to reuse the worker infrastructure
>> >from vhost.c. In particular this one ignores cgroups that
>>> the owner belongs to if any.
>>> Does this one do anything that vhost.c doesn't?
>>>
>> The main idea I use a separated thread to handling completion is to
>> decouple the  request handling
>> and the request completion signalling. This might allow better
>> scalability in a IO intensive scenario,
> The code seems to have the vq mutex though, isn't that right?
> If so, it can't execute in parallel so it's a bit
> hard to see how this would help scalability.
>

Nope, V1 completion thread doesn't has any mutex, thus can run parallel 
with the vhost worker.
Anyway, I'll adopt completion code to the vhost worker, since it deals 
with other stuff like cgroup.

> I had a feeling other devices might want some other structure
> (not an fd) as a backend. Maybe that was wrong ...
> If so, pls do that, and #define VHOST_NET_SET_BACKEND VHOST_SET_BACKEND
> for compatibility.
>

Okay.

>>>>   static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
>>>>   			 struct iocb *iocb, bool compat)
>>>> diff --git a/fs/eventfd.c b/fs/eventfd.c
>>>> index d9a5917..6343bc9 100644
>>>> --- a/fs/eventfd.c
>>>> +++ b/fs/eventfd.c
>>>> @@ -406,6 +406,7 @@ struct file *eventfd_file_create(unsigned int count, int flags)
>>>>
>>>>   	return file;
>>>>   }
>>>> +EXPORT_SYMBOL_GPL(eventfd_file_create);
>>> You can avoid the need for this export if you pass
>>> the eventfd in from userspace.
>>>
>> Since eventfd used by completion code is internal and hiding it from
>> hw/vhost_blk.c would simplify
>> the configuration, I think this exporting is necessary and can get
>> rid of unnecessary FD management
>> in vhost-blk.c.
> Well this is a new kernel interface duplicating the functionality of the
> old one.  You'll have a hard time selling this idea upstream, I suspect.
> And I doubt it simplifies the code significantly.
> Further, you have a single vq for block, but net has two and
> we do want the flexibility of using a single eventfd for both,
> I think.
>
point taken.

>>>>   SYSCALL_DEFINE2(eventfd2, unsigned int, count, int, flags)
>>>>   {
>>>> diff --git a/include/linux/aio.h b/include/linux/aio.h
>>>> index 7a8db41..d63bc04 100644
>>>> --- a/include/linux/aio.h
>>>> +++ b/include/linux/aio.h
>>>> @@ -214,6 +214,37 @@ struct mm_struct;
>>>>   extern void exit_aio(struct mm_struct *mm);
>>>>   extern long do_io_submit(aio_context_t ctx_id, long nr,
>>>>   			 struct iocb __user *__user *iocbpp, bool compat);
>>>> +extern void __put_ioctx(struct kioctx *ctx);
>>>> +extern struct kioctx *ioctx_alloc(unsigned nr_events);
>>>> +extern struct kiocb *aio_get_req(struct kioctx *ctx);
>>>> +extern ssize_t aio_run_iocb(struct kiocb *iocb);
>>>> +extern int __aio_run_iocbs(struct kioctx *ctx);
>>>> +extern int read_events(struct kioctx *ctx,
>>>> +                        long min_nr, long nr,
>>>> +                        struct io_event __user *event,
>>>> +                        struct timespec __user *timeout);
>>>> +extern void io_destroy(struct kioctx *ioctx);
>>>> +extern ssize_t aio_setup_iocb(struct kiocb *kiocb, bool compat);
>>>> +extern void __put_ioctx(struct kioctx *ctx);
>>>> +
>>>> +static inline void get_ioctx(struct kioctx *kioctx)
>>>> +{
>>>> +        BUG_ON(atomic_read(&kioctx->users)<= 0);
>>>> +        atomic_inc(&kioctx->users);
>>>> +}
>>>> +
>>>> +static inline int try_get_ioctx(struct kioctx *kioctx)
>>>> +{
>>>> +        return atomic_inc_not_zero(&kioctx->users);
>>>> +}
>>>> +
>>>> +static inline void put_ioctx(struct kioctx *kioctx)
>>>> +{
>>>> +        BUG_ON(atomic_read(&kioctx->users)<= 0);
>>>> +        if (unlikely(atomic_dec_and_test(&kioctx->users)))
>>>> +                __put_ioctx(kioctx);
>>>> +}
>>>> +
>>>>   #else
>>>>   static inline ssize_t wait_on_sync_kiocb(struct kiocb *iocb) { return 0; }
>>>>   static inline int aio_put_req(struct kiocb *iocb) { return 0; }
>>>> -- 
>>>> 1.7.5.1
>> Other comments will be addressed in V2. Thanks
>>
>> Yuan


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH] vhost-blk: An in-kernel accelerator for virtio-blk
  2011-08-01  8:55         ` Liu Yuan
@ 2011-08-01 10:26           ` Michael S. Tsirkin
  0 siblings, 0 replies; 54+ messages in thread
From: Michael S. Tsirkin @ 2011-08-01 10:26 UTC (permalink / raw)
  To: Liu Yuan; +Cc: Rusty Russell, Avi Kivity, kvm, linux-kernel

On Mon, Aug 01, 2011 at 04:55:54PM +0800, Liu Yuan wrote:
> Nope, V1 completion thread doesn't has any mutex, thus can run
> parallel with the vhost worker.

It's true, it doesn't, but I think it's a bug :)
It calls vhost_add_used which definitely needs the
vq mutex.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH] vhost-blk: An in-kernel accelerator for virtio-blk
  2011-07-28 15:22   ` Michael S. Tsirkin
  2011-07-29 15:09     ` Liu Yuan
  2011-08-01  6:25     ` Liu Yuan
@ 2011-08-11 19:59     ` Dongsu Park
  2011-08-12  8:56       ` Alan Cox
  2 siblings, 1 reply; 54+ messages in thread
From: Dongsu Park @ 2011-08-11 19:59 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Liu Yuan, Rusty Russell, Avi Kivity, kvm, linux-kernel

Hi,

On 07/28/2011 05:22 PM, Michael S. Tsirkin wrote:
> On Thu, Jul 28, 2011 at 10:29:05PM +0800, Liu Yuan wrote:
>> +static struct miscdevice vhost_blk_misc = {
>> +	234,
>
> Don't get a major unless you really must.

the minor number 234 conflicts with that of BTRFS, in kernel 3.0 at least.
Therefore you cannot load vhost_blk.ko if btrfs.ko was already loaded.
Probably that number should be something else, with which you don't have 
conflict with any minor number in include/linux/miscdevice.h.

-- 
Dongsu Park
Email: dongsu.park@profitbricks.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH] vhost-blk: An in-kernel accelerator for virtio-blk
  2011-08-11 19:59     ` Dongsu Park
@ 2011-08-12  8:56       ` Alan Cox
  0 siblings, 0 replies; 54+ messages in thread
From: Alan Cox @ 2011-08-12  8:56 UTC (permalink / raw)
  To: Dongsu Park
  Cc: Michael S. Tsirkin, Liu Yuan, Rusty Russell, Avi Kivity, kvm,
	linux-kernel

On Thu, 11 Aug 2011 21:59:25 +0200
Dongsu Park <dongsu.park@profitbricks.com> wrote:

> Hi,
> 
> On 07/28/2011 05:22 PM, Michael S. Tsirkin wrote:
> > On Thu, Jul 28, 2011 at 10:29:05PM +0800, Liu Yuan wrote:
> >> +static struct miscdevice vhost_blk_misc = {
> >> +	234,
> >
> > Don't get a major unless you really must.
> 
> the minor number 234 conflicts with that of BTRFS, in kernel 3.0 at least.
> Therefore you cannot load vhost_blk.ko if btrfs.ko was already loaded.
> Probably that number should be something else, with which you don't have 
> conflict with any minor number in include/linux/miscdevice.h.

It should be zero or it should be officially reserved in devices.txt via
lanana@kernel.org, who (for it happens to be me currently) will turn it
down and tell you to use 0 to get a dynamic allocation unless you can
provide a very good reason that isn't suitable.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [RFC PATCH] vhost: Enable vhost-blk support
  2011-07-28 14:29 [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device Liu Yuan
  2011-07-28 14:29 ` [RFC PATCH] vhost-blk: An in-kernel accelerator for virtio-blk Liu Yuan
@ 2011-07-28 14:29 ` Liu Yuan
  2011-07-28 15:44 ` [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device Stefan Hajnoczi
  2 siblings, 0 replies; 54+ messages in thread
From: Liu Yuan @ 2011-07-28 14:29 UTC (permalink / raw)
  To: Michael S. Tsirkin, Rusty Russell, Avi Kivity; +Cc: kvm, linux-kernel

From: Liu Yuan <tailai.ly@taobao.com>

vhost-blk is an in-kernel accelerator for virtio-blk
device. This patch is the counterpart of the vhost-blk
module in the kernel. It basically does setup of the
vhost-blk, pass on the virtio buffer information via
/dev/vhost-blk.

Useage:
$:qemu -drvie file=path/to/image,if=virtio,aio=native...

Signed-off-by: Liu Yuan <tailai.ly@taobao.com>
---
 Makefile.target |    2 +-
 hw/vhost_blk.c  |   84 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 hw/vhost_blk.h  |   44 ++++++++++++++++++++++++++++
 hw/virtio-blk.c |   74 ++++++++++++++++++++++++++++++++++++++----------
 hw/virtio-blk.h |   15 ++++++++++
 hw/virtio-pci.c |   12 ++++++-
 6 files changed, 213 insertions(+), 18 deletions(-)
 create mode 100644 hw/vhost_blk.c
 create mode 100644 hw/vhost_blk.h

diff --git a/Makefile.target b/Makefile.target
index c511010..0f62d7e 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -198,7 +198,7 @@ obj-y = arch_init.o cpus.o monitor.o machine.o gdbstub.o vl.o balloon.o
 obj-$(CONFIG_NO_PCI) += pci-stub.o
 obj-$(CONFIG_PCI) += pci.o
 obj-$(CONFIG_VIRTIO) += virtio-blk.o virtio-balloon.o virtio-net.o virtio-serial-bus.o
-obj-y += vhost_net.o
+obj-y += vhost_net.o vhost_blk.o
 obj-$(CONFIG_VHOST_NET) += vhost.o
 obj-$(CONFIG_REALLY_VIRTFS) += 9pfs/virtio-9p-device.o
 obj-y += rwhandler.o
diff --git a/hw/vhost_blk.c b/hw/vhost_blk.c
new file mode 100644
index 0000000..31fb11f
--- /dev/null
+++ b/hw/vhost_blk.c
@@ -0,0 +1,84 @@
+#if 1
+#include <linux/vhost.h>
+#include <linux/kvm.h>
+#include <fcntl.h>
+#include <sys/ioctl.h>
+#include <linux/virtio_ring.h>
+
+#include <stdio.h>
+#include <stdlib.h>
+
+#include "vhost.h"
+#include "vhost_blk.h"
+
+struct vhost_blk * vhost_blk_init(void)
+{
+	struct vhost_blk *blk = qemu_mallocz(sizeof *blk);
+	int err;
+
+	err = open("/dev/vhost-blk", O_RDWR);
+	if (err < 0)
+		goto err_open;
+	blk->fd = err;
+	err = vhost_dev_init(&blk->dev, err, 1);
+	if (err < 0)
+		goto err_init;
+
+	blk->dev.vqs = blk->vqs;
+	blk->dev.nvqs = blk_vq_max;
+	return blk;
+err_init:
+	close(blk->fd);
+err_open:
+	perror("vhost_blk_init");
+	qemu_free(blk);
+	return NULL;
+}
+
+typedef struct BDRVRawState {
+    int fd;
+    int type;
+    int open_flags;
+#if defined(__linux__)
+    /* linux floppy specific */
+    int64_t fd_open_time;
+    int64_t fd_error_time;
+    int fd_got_error;
+    int fd_media_changed;
+#endif
+#ifdef CONFIG_LINUX_AIO
+    int use_aio;
+    void *aio_ctx;
+#endif
+    uint8_t *aligned_buf;
+    unsigned aligned_buf_size;
+#ifdef CONFIG_XFS
+    bool is_xfs : 1;
+#endif
+} BDRVRawState;
+
+int vhost_blk_start(struct vhost_blk *blk, VirtIODevice *device)
+{
+	VirtIOBlock *iob = (VirtIOBlock *)device;
+	BDRVRawState *raw = iob->bs->file->opaque;
+	struct vhost_vring_file f = {blk_vq_idx, raw->fd};
+	static int i = 0;
+	int ret;
+
+	ret = vhost_dev_start(&blk->dev, device);
+	if (ret < 0)
+		goto err_start;
+
+	ret = ioctl(blk->fd, VHOST_NET_SET_BACKEND, &f);
+	if (ret <0)
+		goto err_ioctl;
+
+	printf("%s: vhost-blk get started successfully (%d)\n", __func__, i++);
+	return ret;
+
+err_ioctl:
+	vhost_dev_stop(&blk->dev, device);
+err_start:
+	return ret;
+}
+#endif
diff --git a/hw/vhost_blk.h b/hw/vhost_blk.h
new file mode 100644
index 0000000..f437af5
--- /dev/null
+++ b/hw/vhost_blk.h
@@ -0,0 +1,44 @@
+#ifndef VHOST_BLK_H
+#define VHOST_BLK_H
+
+#include <errno.h>
+
+#include "virtio-blk.h"
+#include "vhost.h"
+
+enum {
+        blk_vq_idx = 0,
+        blk_vq_max = 1,
+};
+
+struct vhost_blk {
+        struct vhost_dev dev;
+        struct vhost_virtqueue vqs[blk_vq_max];
+	int fd;
+};
+
+# if 1
+extern struct vhost_blk * vhost_blk_init(void);
+extern int vhost_blk_start(struct vhost_blk *blk, VirtIODevice *device);
+static inline struct vhost_blk * to_vhost_blk(VirtIODevice *device)
+{
+	VirtIOBlock * iob = (VirtIOBlock *)device;
+	return iob->vblk;
+}
+# else
+static inline struct vhost_blk * vhost_blk_init(void);
+{
+	return NULL;
+}
+
+static inline int vhost_blk_start(struct vhost_blk *vblk, VirtIODevice *device)
+{
+	return -1;
+}
+
+static inline struct vhost_blk * to_vhost_blk(VirtIODevice *device)
+{
+	return NULL;
+}
+#endif
+#endif /* VHOST_BLK_H */
diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index 6471ac8..a5f3a27 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -16,23 +16,32 @@
 #include "trace.h"
 #include "blockdev.h"
 #include "virtio-blk.h"
+#include "vhost_blk.h"
 #ifdef __linux__
 # include <scsi/sg.h>
 #endif
 
-typedef struct VirtIOBlock
-{
-    VirtIODevice vdev;
-    BlockDriverState *bs;
-    VirtQueue *vq;
-    void *rq;
-    QEMUBH *bh;
-    BlockConf *conf;
-    char *serial;
-    unsigned short sector_mask;
-    DeviceState *qdev;
-} VirtIOBlock;
-
+typedef struct BDRVRawState {
+    int fd;
+    int type;
+    int open_flags;
+#if defined(__linux__)
+    /* linux floppy specific */
+    int64_t fd_open_time;
+    int64_t fd_error_time;
+    int fd_got_error;
+    int fd_media_changed;
+#endif
+#ifdef CONFIG_LINUX_AIO
+    int use_aio;
+    void *aio_ctx;
+#endif
+    uint8_t *aligned_buf;
+    unsigned aligned_buf_size;
+#ifdef CONFIG_XFS
+    bool is_xfs : 1;
+#endif
+} BDRVRawState;
 static VirtIOBlock *to_virtio_blk(VirtIODevice *vdev)
 {
     return (VirtIOBlock *)vdev;
@@ -436,6 +445,29 @@ static void virtio_blk_dma_restart_cb(void *opaque, int running, int reason)
     }
 }
 
+#include <sys/ioctl.h>
+#include <linux/vhost.h>
+static void vhost_blk_reset(VirtIODevice *device)
+{
+	//int err;
+	struct vhost_blk *vblk = to_vhost_blk(device);
+
+	if (!vblk)
+		return;
+
+	if (!vblk->dev.started)
+		return;
+
+	vhost_dev_stop(&vblk->dev, device);
+	if (!ioctl(vblk->fd, VHOST_RESET_OWNER, NULL) &&
+	   !ioctl(vblk->fd, VHOST_SET_OWNER, NULL))
+		vblk->dev.acked_features = 0;
+	else
+		printf("%s %d fd %d\n", __func__, -errno, vblk->fd);
+
+	return;
+}
+
 static void virtio_blk_reset(VirtIODevice *vdev)
 {
     /*
@@ -443,6 +475,7 @@ static void virtio_blk_reset(VirtIODevice *vdev)
      * are per-device request lists.
      */
     qemu_aio_flush();
+    vhost_blk_reset(vdev);
 }
 
 /* coalesce internal state, copy to pci i/o region 0
@@ -482,20 +515,29 @@ static uint32_t virtio_blk_get_features(VirtIODevice *vdev, uint32_t features)
 
     if (bdrv_enable_write_cache(s->bs))
         features |= (1 << VIRTIO_BLK_F_WCACHE);
-    
+
     if (bdrv_is_read_only(s->bs))
         features |= 1 << VIRTIO_BLK_F_RO;
 
     return features;
 }
 
+static void virtio_blk_set_features(VirtIODevice *vdev, uint32_t val)
+{
+	VirtIOBlock *s = to_virtio_blk(vdev);
+	if (s->vblk) {
+		val &= ~(1 << VIRTIO_BLK_F_WCACHE);
+		s->vblk->dev.acked_features = val;
+	}
+}
+
 static void virtio_blk_save(QEMUFile *f, void *opaque)
 {
     VirtIOBlock *s = opaque;
     VirtIOBlockReq *req = s->rq;
 
     virtio_save(&s->vdev, f);
-    
+
     while (req) {
         qemu_put_sbyte(f, 1);
         qemu_put_buffer(f, (unsigned char*)&req->elem, sizeof(req->elem));
@@ -567,6 +609,7 @@ VirtIODevice *virtio_blk_init(DeviceState *dev, BlockConf *conf,
 
     s->vdev.get_config = virtio_blk_update_config;
     s->vdev.get_features = virtio_blk_get_features;
+    s->vdev.set_features = virtio_blk_set_features;
     s->vdev.reset = virtio_blk_reset;
     s->bs = conf->bs;
     s->conf = conf;
@@ -587,6 +630,7 @@ VirtIODevice *virtio_blk_init(DeviceState *dev, BlockConf *conf,
 
     add_boot_device_path(conf->bootindex, dev, "/disk@0,0");
 
+    s->vblk = vhost_blk_init();
     return &s->vdev;
 }
 
diff --git a/hw/virtio-blk.h b/hw/virtio-blk.h
index 5645d2b..cdaa0ef 100644
--- a/hw/virtio-blk.h
+++ b/hw/virtio-blk.h
@@ -16,6 +16,7 @@
 
 #include "virtio.h"
 #include "block.h"
+#include "blockdev.h"
 
 /* from Linux's linux/virtio_blk.h */
 
@@ -97,6 +98,20 @@ struct virtio_scsi_inhdr
     uint32_t residual;
 };
 
+typedef struct VirtIOBlock
+{
+    VirtIODevice vdev;
+    BlockDriverState *bs;
+    VirtQueue *vq;
+    void *rq;
+    QEMUBH *bh;
+    BlockConf *conf;
+    char *serial;
+    unsigned short sector_mask;
+    DeviceState *qdev;
+    struct vhost_blk *vblk;
+} VirtIOBlock;
+
 #ifdef __linux__
 #define DEFINE_VIRTIO_BLK_FEATURES(_state, _field) \
         DEFINE_VIRTIO_COMMON_FEATURES(_state, _field), \
diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
index c5bfb62..f653014 100644
--- a/hw/virtio-pci.c
+++ b/hw/virtio-pci.c
@@ -27,6 +27,8 @@
 #include "kvm.h"
 #include "blockdev.h"
 #include "virtio-pci.h"
+#include "vhost_blk.h"
+#include "vhost.h"
 
 /* from Linux's linux/virtio_pci.h */
 
@@ -162,6 +164,7 @@ static int virtio_pci_set_host_notifier_internal(VirtIOPCIProxy *proxy,
     VirtQueue *vq = virtio_get_queue(proxy->vdev, n);
     EventNotifier *notifier = virtio_queue_get_host_notifier(vq);
     int r;
+
     if (assign) {
         r = event_notifier_init(notifier, 1);
         if (r < 0) {
@@ -190,7 +193,7 @@ static int virtio_pci_set_host_notifier_internal(VirtIOPCIProxy *proxy,
         /* Handle the race condition where the guest kicked and we deassigned
          * before we got around to handling the kick.
          */
-        if (event_notifier_test_and_clear(notifier)) {
+        if (proxy->ioeventfd_started && event_notifier_test_and_clear(notifier)) {
             virtio_queue_notify_vq(vq);
         }
 
@@ -337,7 +340,12 @@ static void virtio_ioport_write(void *opaque, uint32_t addr, uint32_t val)
         virtio_set_status(vdev, val & 0xFF);
 
         if (val & VIRTIO_CONFIG_S_DRIVER_OK) {
-            virtio_pci_start_ioeventfd(proxy);
+	    struct vhost_blk *vblk = to_vhost_blk(vdev);
+	    if (vblk) {
+		    if (!vblk->dev.started)
+			vhost_blk_start(to_vhost_blk(vdev), vdev);
+	    } else
+		    virtio_pci_start_ioeventfd(proxy);
         }
 
         if (vdev->status == 0) {
-- 
1.7.5.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device
  2011-07-28 14:29 [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device Liu Yuan
  2011-07-28 14:29 ` [RFC PATCH] vhost-blk: An in-kernel accelerator for virtio-blk Liu Yuan
  2011-07-28 14:29 ` [RFC PATCH] vhost: Enable vhost-blk support Liu Yuan
@ 2011-07-28 15:44 ` Stefan Hajnoczi
  2011-07-29  4:48   ` Stefan Hajnoczi
  2011-07-29  7:22   ` Liu Yuan
  2 siblings, 2 replies; 54+ messages in thread
From: Stefan Hajnoczi @ 2011-07-28 15:44 UTC (permalink / raw)
  To: Liu Yuan
  Cc: Michael S. Tsirkin, Rusty Russell, Avi Kivity, kvm, linux-kernel,
	Khoa Huynh, Badari Pulavarty

On Thu, Jul 28, 2011 at 3:29 PM, Liu Yuan <namei.unix@gmail.com> wrote:

Did you investigate userspace virtio-blk performance?  If so, what
issues did you find?

I have a hacked up world here that basically implements vhost-blk in userspace:
http://repo.or.cz/w/qemu/stefanha.git/blob/refs/heads/virtio-blk-data-plane:/hw/virtio-blk.c

 * A dedicated virtqueue thread sleeps on ioeventfd
 * Guest memory is pre-mapped and accessed directly (not using QEMU's
usually memory access functions)
 * Linux AIO is used, the QEMU block layer is bypassed
 * Completion interrupts are injected from the virtqueue thread using ioctl

I will try to rebase onto qemu-kvm.git/master (this work is several
months old).  Then we can compare to see how much of the benefit can
be gotten in userspace.

> [performance]
>
>        Currently, the fio benchmarking number is rather promising. The seq read is imporved as much as 16% for throughput and the latency is dropped up to 14%. For seq write, 13.5% and 13% respectively.
>
> sequential read:
> +-------------+-------------+---------------+---------------+
> | iodepth     | 1           |   2           |   3           |
> +-------------+-------------+---------------+----------------
> | virtio-blk  | 4116(214)   |   7814(222)   |   8867(306)   |
> +-------------+-------------+---------------+---------------+
> | vhost-blk   | 4755(183)   |   8645(202)   |   10084(266)  |
> +-------------+-------------+---------------+---------------+
>
> 4116(214) means 4116 IOPS/s, the it is completion latency is 214 us.
>
> seqeuential write:
> +-------------+-------------+----------------+--------------+
> | iodepth     |  1          |    2           |  3           |
> +-------------+-------------+----------------+--------------+
> | virtio-blk  | 3848(228)   |   6505(275)    |  9335(291)   |
> +-------------+-------------+----------------+--------------+
> | vhost-blk   | 4370(198)   |   7009(249)    |  9938(264)   |
> +-------------+-------------+----------------+--------------+
>
> the fio command for sequential read:
>
> sudo fio -name iops -readonly -rw=read -runtime=120 -iodepth 1 -filename /dev/vda -ioengine libaio -direct=1 -bs=512
>
> and config file for sequential write is:
>
> dev@taobao:~$ cat rw.fio
> -------------------------
> [test]
>
> rw=rw
> size=200M
> directory=/home/dev/data
> ioengine=libaio
> iodepth=1
> direct=1
> bs=512
> -------------------------

512 byte blocksize is very small, given that you can expect a file
system to have 4 KB or so block sizes.  It would be interesting to
measure a wider range of block sizes: 4 KB, 64 KB, and 128 KB for
example.

Stefan

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device
  2011-07-28 15:44 ` [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device Stefan Hajnoczi
@ 2011-07-29  4:48   ` Stefan Hajnoczi
  2011-07-29  7:59     ` Liu Yuan
  2011-07-29  7:22   ` Liu Yuan
  1 sibling, 1 reply; 54+ messages in thread
From: Stefan Hajnoczi @ 2011-07-29  4:48 UTC (permalink / raw)
  To: Liu Yuan
  Cc: Michael S. Tsirkin, Rusty Russell, Avi Kivity, kvm, linux-kernel,
	Khoa Huynh, Badari Pulavarty

On Thu, Jul 28, 2011 at 4:44 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Thu, Jul 28, 2011 at 3:29 PM, Liu Yuan <namei.unix@gmail.com> wrote:
>
> Did you investigate userspace virtio-blk performance?  If so, what
> issues did you find?
>
> I have a hacked up world here that basically implements vhost-blk in userspace:
> http://repo.or.cz/w/qemu/stefanha.git/blob/refs/heads/virtio-blk-data-plane:/hw/virtio-blk.c
>
>  * A dedicated virtqueue thread sleeps on ioeventfd
>  * Guest memory is pre-mapped and accessed directly (not using QEMU's
> usually memory access functions)
>  * Linux AIO is used, the QEMU block layer is bypassed
>  * Completion interrupts are injected from the virtqueue thread using ioctl
>
> I will try to rebase onto qemu-kvm.git/master (this work is several
> months old).  Then we can compare to see how much of the benefit can
> be gotten in userspace.

Here is the rebased virtio-blk-data-plane tree:
http://repo.or.cz/w/qemu-kvm/stefanha.git/shortlog/refs/heads/virtio-blk-data-plane

When I run it on my laptop with an Intel X-25M G2 SSD I see a latency
reduction compared to mainline userspace virtio-blk.  I'm not posting
results because I did quick fio runs without ensuring a quiet
benchmarking environment.

There are a couple of things that could be modified:
 * I/O request merging is done to mimic bdrv_aio_multiwrite() - but
vhost-blk does not do this.  Try turning it off?
 * epoll(2) is used but perhaps select(2)/poll(2) have lower latency
for this use case.  Try another event mechanism.

Let's see how it compares to vhost-blk first.  I can tweak it if we
want to investigate further.

Yuan: Do you want to try the virtio-blk-data-plane tree?  You don't
need to change the qemu-kvm command-line options.

Stefan

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device
  2011-07-29  4:48   ` Stefan Hajnoczi
@ 2011-07-29  7:59     ` Liu Yuan
  2011-07-29 10:55       ` Christoph Hellwig
  0 siblings, 1 reply; 54+ messages in thread
From: Liu Yuan @ 2011-07-29  7:59 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Michael S. Tsirkin, Rusty Russell, Avi Kivity, kvm, linux-kernel,
	Khoa Huynh, Badari Pulavarty

Hi
On 07/29/2011 12:48 PM, Stefan Hajnoczi wrote:
> On Thu, Jul 28, 2011 at 4:44 PM, Stefan Hajnoczi<stefanha@gmail.com>  wrote:
>> On Thu, Jul 28, 2011 at 3:29 PM, Liu Yuan<namei.unix@gmail.com>  wrote:
>>
>> Did you investigate userspace virtio-blk performance?  If so, what
>> issues did you find?
>>
>> I have a hacked up world here that basically implements vhost-blk in userspace:
>> http://repo.or.cz/w/qemu/stefanha.git/blob/refs/heads/virtio-blk-data-plane:/hw/virtio-blk.c
>>
>>   * A dedicated virtqueue thread sleeps on ioeventfd
>>   * Guest memory is pre-mapped and accessed directly (not using QEMU's
>> usually memory access functions)
>>   * Linux AIO is used, the QEMU block layer is bypassed
>>   * Completion interrupts are injected from the virtqueue thread using ioctl
>>
>> I will try to rebase onto qemu-kvm.git/master (this work is several
>> months old).  Then we can compare to see how much of the benefit can
>> be gotten in userspace.
> Here is the rebased virtio-blk-data-plane tree:
> http://repo.or.cz/w/qemu-kvm/stefanha.git/shortlog/refs/heads/virtio-blk-data-plane
>
> When I run it on my laptop with an Intel X-25M G2 SSD I see a latency
> reduction compared to mainline userspace virtio-blk.  I'm not posting
> results because I did quick fio runs without ensuring a quiet
> benchmarking environment.
>
> There are a couple of things that could be modified:
>   * I/O request merging is done to mimic bdrv_aio_multiwrite() - but
> vhost-blk does not do this.  Try turning it off?

I noted bdrv_aio_multiwrite() do the murging job, but  I am not sure if 
this trick is really needed since we have an io scheduler down the path 
that is in a much more better position to murge requests. I think the 
duplicate *pre-mature* merging of bdrv_aio_multiwrite is the result of  
laio_submit()'s lack of submitting the requests in a batch mode. 
io_submit() in the fs/aio.c says that every time we call laio_submit(), 
it will submit the very request into the driver's request queue, which 
would be run when we blk_finish_plug(). IMHO, you can simply batch 
io_submit() requests instead of this tricks if you already bypass the 
QEMU block layer.

>   * epoll(2) is used but perhaps select(2)/poll(2) have lower latency
> for this use case.  Try another event mechanism.
>
> Let's see how it compares to vhost-blk first.  I can tweak it if we
> want to investigate further.
>
> Yuan: Do you want to try the virtio-blk-data-plane tree?  You don't
> need to change the qemu-kvm command-line options.
>
> Stefan
Yes, please, sounds interesting. BTW, I think the user space would 
achieve the same performance gain if you bypass qemu io layer all the 
way down to the system calls in a request handling cycle, compared to 
the current vhost-blk implementation that uses linux AIO. But hey, I 
would go further to optimise it with block layer and other resources in 
the mind. ;) and I don't add complexity to the current qemu io layer.

Yuan

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device
  2011-07-29  7:59     ` Liu Yuan
@ 2011-07-29 10:55       ` Christoph Hellwig
  0 siblings, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2011-07-29 10:55 UTC (permalink / raw)
  To: Liu Yuan
  Cc: Stefan Hajnoczi, Michael S. Tsirkin, Rusty Russell, Avi Kivity,
	kvm, linux-kernel, Khoa Huynh, Badari Pulavarty

On Fri, Jul 29, 2011 at 03:59:53PM +0800, Liu Yuan wrote:
> I noted bdrv_aio_multiwrite() do the murging job, but  I am not sure

Just like I/O schedulers it's actually fairly harmful on high IOPS,
low latency devices.  I've just started doing a lot of qemu bencharks,
and disabling that multiwrite mess alone gives fairly nice speedups.

The major issue seems to be additional memory allocations and cache
lines - a problem that actually is fairly inherent all over the qemu
code.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device
  2011-07-28 15:44 ` [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device Stefan Hajnoczi
  2011-07-29  4:48   ` Stefan Hajnoczi
@ 2011-07-29  7:22   ` Liu Yuan
  2011-07-29  9:06     ` Stefan Hajnoczi
  2011-07-29 18:12     ` Badari Pulavarty
  1 sibling, 2 replies; 54+ messages in thread
From: Liu Yuan @ 2011-07-29  7:22 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Michael S. Tsirkin, Rusty Russell, Avi Kivity, kvm, linux-kernel,
	Khoa Huynh, Badari Pulavarty

Hi Stefan
On 07/28/2011 11:44 PM, Stefan Hajnoczi wrote:
> On Thu, Jul 28, 2011 at 3:29 PM, Liu Yuan<namei.unix@gmail.com>  wrote:
>
> Did you investigate userspace virtio-blk performance?  If so, what
> issues did you find?
>

Yes, in the performance table I presented, virtio-blk in the user space 
lags behind the vhost-blk(although this prototype is very primitive 
impl.) in the kernel by about 15%.

Actually, the motivation to start vhost-blk is that, in our observation, 
KVM(virtio enabled) in RHEL 6 is worse than Xen(PV) in RHEL in disk IO 
perspective, especially for sequential read/write (around 20% gap).

We'll deploy a large number of KVM-based systems as the infrastructure 
of some service and this gap is really unpleasant.

By the design, IMHO, virtio performance is supposed to be comparable to 
the para-vulgarization solution if not better, because for KVM, guest 
and backend driver could sit in the same address space via mmaping. This 
would reduce the overhead involved in page table modification, thus 
speed up the buffer management and transfer a lot compared with Xen PV.

I am not in a qualified  position to talk about QEMU , but I think the 
surprised performance improvement by this very primitive vhost-blk 
simply manifest that, the internal structure for qemu io is the way 
bloated. I say it *surprised* because basically vhost just reduces the 
number of system calls, which is heavily tuned by chip manufacture for 
years. So, I guess the performance number vhost-blk gains mainly could 
possibly be contributed to *shorter and simpler* code path.

Anyway, IMHO, compared with user space approach, the in-kernel one would 
allow more flexibility and better integration with the kernel IO stack, 
since we don't need two IO stacks for guest OS.

> I have a hacked up world here that basically implements vhost-blk in userspace:
> http://repo.or.cz/w/qemu/stefanha.git/blob/refs/heads/virtio-blk-data-plane:/hw/virtio-blk.c
>
>   * A dedicated virtqueue thread sleeps on ioeventfd
>   * Guest memory is pre-mapped and accessed directly (not using QEMU's
> usually memory access functions)
>   * Linux AIO is used, the QEMU block layer is bypassed
>   * Completion interrupts are injected from the virtqueue thread using ioctl
>
> I will try to rebase onto qemu-kvm.git/master (this work is several
> months old).  Then we can compare to see how much of the benefit can
> be gotten in userspace.
>
I don't really get you about vhost-blk in user space since vhost 
infrastructure itself means an in-kernel accelerator that implemented in 
kernel . I guess what you meant is somewhat a re-write of virtio-blk in 
user space with a dedicated thread handling requests, and shorter code 
path similar to vhost-blk.

>> [performance]
>>
>>         Currently, the fio benchmarking number is rather promising. The seq read is imporved as much as 16% for throughput and the latency is dropped up to 14%. For seq write, 13.5% and 13% respectively.
>>
>> sequential read:
>> +-------------+-------------+---------------+---------------+
>> | iodepth     | 1           |   2           |   3           |
>> +-------------+-------------+---------------+----------------
>> | virtio-blk  | 4116(214)   |   7814(222)   |   8867(306)   |
>> +-------------+-------------+---------------+---------------+
>> | vhost-blk   | 4755(183)   |   8645(202)   |   10084(266)  |
>> +-------------+-------------+---------------+---------------+
>>
>> 4116(214) means 4116 IOPS/s, the it is completion latency is 214 us.
>>
>> seqeuential write:
>> +-------------+-------------+----------------+--------------+
>> | iodepth     |  1          |    2           |  3           |
>> +-------------+-------------+----------------+--------------+
>> | virtio-blk  | 3848(228)   |   6505(275)    |  9335(291)   |
>> +-------------+-------------+----------------+--------------+
>> | vhost-blk   | 4370(198)   |   7009(249)    |  9938(264)   |
>> +-------------+-------------+----------------+--------------+
>>
>> the fio command for sequential read:
>>
>> sudo fio -name iops -readonly -rw=read -runtime=120 -iodepth 1 -filename /dev/vda -ioengine libaio -direct=1 -bs=512
>>
>> and config file for sequential write is:
>>
>> dev@taobao:~$ cat rw.fio
>> -------------------------
>> [test]
>>
>> rw=rw
>> size=200M
>> directory=/home/dev/data
>> ioengine=libaio
>> iodepth=1
>> direct=1
>> bs=512
>> -------------------------
> 512 byte blocksize is very small, given that you can expect a file
> system to have 4 KB or so block sizes.  It would be interesting to
> measure a wider range of block sizes: 4 KB, 64 KB, and 128 KB for
> example.
>
> Stefan
Actually, I have tested 4KB, it shows the same improvement. What I care 
more is iodepth, since batched AIO would benefit it.But my laptop SATA 
doesn't behave well as it advertises: it says its NCQ queue depth is 32 
and kernel tells me it support 31 requests in one go. When increase 
iodepth in the test up to 4, both the host and guest' IOPS drops 
drastically.

Yuan

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device
  2011-07-29  7:22   ` Liu Yuan
@ 2011-07-29  9:06     ` Stefan Hajnoczi
  2011-07-29 12:01       ` Liu Yuan
  2011-07-29 18:12     ` Badari Pulavarty
  1 sibling, 1 reply; 54+ messages in thread
From: Stefan Hajnoczi @ 2011-07-29  9:06 UTC (permalink / raw)
  To: Liu Yuan
  Cc: Michael S. Tsirkin, Rusty Russell, Avi Kivity, kvm, linux-kernel,
	Khoa Huynh, Badari Pulavarty

On Fri, Jul 29, 2011 at 8:22 AM, Liu Yuan <namei.unix@gmail.com> wrote:
> Hi Stefan
> On 07/28/2011 11:44 PM, Stefan Hajnoczi wrote:
>>
>> On Thu, Jul 28, 2011 at 3:29 PM, Liu Yuan<namei.unix@gmail.com>  wrote:
>>
>> Did you investigate userspace virtio-blk performance?  If so, what
>> issues did you find?
>>
>
> Yes, in the performance table I presented, virtio-blk in the user space lags
> behind the vhost-blk(although this prototype is very primitive impl.) in the
> kernel by about 15%.

I mean did you investigate *why* userspace virtio-blk has higher
latency?  Did you profile it and drill down on its performance?

It's important to understand what is going on before replacing it with
another mechanism.  What I'm saying is, if I have a buggy program I
can sometimes rewrite it from scratch correctly but that doesn't tell
me what the bug was.

Perhaps the inefficiencies in userspace virtio-blk can be solved by
adjusting the code (removing inefficient notification mechanisms,
introducing a dedicated thread outside of the QEMU iothread model,
etc).  Then we'd get the performance benefit for non-raw images and
perhaps non-virtio and non-Linux host platforms too.

> Actually, the motivation to start vhost-blk is that, in our observation,
> KVM(virtio enabled) in RHEL 6 is worse than Xen(PV) in RHEL in disk IO
> perspective, especially for sequential read/write (around 20% gap).
>
> We'll deploy a large number of KVM-based systems as the infrastructure of
> some service and this gap is really unpleasant.
>
> By the design, IMHO, virtio performance is supposed to be comparable to the
> para-vulgarization solution if not better, because for KVM, guest and
> backend driver could sit in the same address space via mmaping. This would
> reduce the overhead involved in page table modification, thus speed up the
> buffer management and transfer a lot compared with Xen PV.

Yes, guest memory is just a region of QEMU userspace memory.  So it's
easy to reach inside and there are no page table tricks or copying
involved.

> I am not in a qualified  position to talk about QEMU , but I think the
> surprised performance improvement by this very primitive vhost-blk simply
> manifest that, the internal structure for qemu io is the way bloated. I say
> it *surprised* because basically vhost just reduces the number of system
> calls, which is heavily tuned by chip manufacture for years. So, I guess the
> performance number vhost-blk gains mainly could possibly be contributed to
> *shorter and simpler* code path.

First we need to understand exactly what the latency overhead is.  If
we discover that it's simply not possible to do this equally well in
userspace, then it makes perfect sense to use vhost-blk.

So let's gather evidence and learn what the overheads really are.
Last year I spent time looking at virtio-blk latency:
http://www.linux-kvm.org/page/Virtio/Block/Latency

See especially this diagram:
http://www.linux-kvm.org/page/Image:Threads.png

The goal wasn't specifically to reduce synchronous sequential I/O,
instead the aim was to reduce overheads for a variety of scenarios,
especially multithreaded workloads.

In most cases it was helpful to move I/O submission out of the vcpu
thread by using the ioeventfd model just like vhost.  Ioeventfd for
userspace virtio-blk is now on by default in qemu-kvm.

Try running the userspace virtio-blk benchmark with -drive
if=none,id=drive0,file=... -device
virtio-blk-pci,drive=drive0,ioeventfd=off.  This causes QEMU to do I/O
submission in the vcpu thread, which might reduce latency at the cost
of stealing guest time.

> Anyway, IMHO, compared with user space approach, the in-kernel one would
> allow more flexibility and better integration with the kernel IO stack,
> since we don't need two IO stacks for guest OS.

I agree that there may be advantages to integrating with in-kernel I/O
mechanisms.  An interesting step would be to implement the
submit_bio() approach that Christoph suggested and seeing if that
improves things further.

Push virtio-blk as far as you can and let's see what the performance is!

>> I have a hacked up world here that basically implements vhost-blk in
>> userspace:
>>
>> http://repo.or.cz/w/qemu/stefanha.git/blob/refs/heads/virtio-blk-data-plane:/hw/virtio-blk.c
>>
>>  * A dedicated virtqueue thread sleeps on ioeventfd
>>  * Guest memory is pre-mapped and accessed directly (not using QEMU's
>> usually memory access functions)
>>  * Linux AIO is used, the QEMU block layer is bypassed
>>  * Completion interrupts are injected from the virtqueue thread using
>> ioctl
>>
>> I will try to rebase onto qemu-kvm.git/master (this work is several
>> months old).  Then we can compare to see how much of the benefit can
>> be gotten in userspace.
>>
> I don't really get you about vhost-blk in user space since vhost
> infrastructure itself means an in-kernel accelerator that implemented in
> kernel . I guess what you meant is somewhat a re-write of virtio-blk in user
> space with a dedicated thread handling requests, and shorter code path
> similar to vhost-blk.

Right - it's the same model as vhost: a dedicated thread listening for
ioeventfd virtqueue kicks and processing them out-of-line with the
guest and userspace QEMU's traditional vcpu and iothread.

>>> [performance]
>>>
>>>        Currently, the fio benchmarking number is rather promising. The
>>> seq read is imporved as much as 16% for throughput and the latency is
>>> dropped up to 14%. For seq write, 13.5% and 13% respectively.
>>>
>>> sequential read:
>>> +-------------+-------------+---------------+---------------+
>>> | iodepth     | 1           |   2           |   3           |
>>> +-------------+-------------+---------------+----------------
>>> | virtio-blk  | 4116(214)   |   7814(222)   |   8867(306)   |
>>> +-------------+-------------+---------------+---------------+
>>> | vhost-blk   | 4755(183)   |   8645(202)   |   10084(266)  |
>>> +-------------+-------------+---------------+---------------+
>>>
>>> 4116(214) means 4116 IOPS/s, the it is completion latency is 214 us.
>>>
>>> seqeuential write:
>>> +-------------+-------------+----------------+--------------+
>>> | iodepth     |  1          |    2           |  3           |
>>> +-------------+-------------+----------------+--------------+
>>> | virtio-blk  | 3848(228)   |   6505(275)    |  9335(291)   |
>>> +-------------+-------------+----------------+--------------+
>>> | vhost-blk   | 4370(198)   |   7009(249)    |  9938(264)   |
>>> +-------------+-------------+----------------+--------------+
>>>
>>> the fio command for sequential read:
>>>
>>> sudo fio -name iops -readonly -rw=read -runtime=120 -iodepth 1 -filename
>>> /dev/vda -ioengine libaio -direct=1 -bs=512
>>>
>>> and config file for sequential write is:
>>>
>>> dev@taobao:~$ cat rw.fio
>>> -------------------------
>>> [test]
>>>
>>> rw=rw
>>> size=200M
>>> directory=/home/dev/data
>>> ioengine=libaio
>>> iodepth=1
>>> direct=1
>>> bs=512
>>> -------------------------
>>
>> 512 byte blocksize is very small, given that you can expect a file
>> system to have 4 KB or so block sizes.  It would be interesting to
>> measure a wider range of block sizes: 4 KB, 64 KB, and 128 KB for
>> example.
>>
>> Stefan
>
> Actually, I have tested 4KB, it shows the same improvement. What I care more
> is iodepth, since batched AIO would benefit it.But my laptop SATA doesn't
> behave well as it advertises: it says its NCQ queue depth is 32 and kernel
> tells me it support 31 requests in one go. When increase iodepth in the test
> up to 4, both the host and guest' IOPS drops drastically.

When you say "IOPS drops drastically" do you mean that it gets worse
than with queue-depth=1?

I hope that others are interested in running the benchmarks on their
systems so we can try out a range of storage devices.

Stefan

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device
  2011-07-29  9:06     ` Stefan Hajnoczi
@ 2011-07-29 12:01       ` Liu Yuan
  2011-07-29 12:29         ` Stefan Hajnoczi
  2011-07-29 15:25         ` Sasha Levin
  0 siblings, 2 replies; 54+ messages in thread
From: Liu Yuan @ 2011-07-29 12:01 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Michael S. Tsirkin, Rusty Russell, Avi Kivity, kvm, linux-kernel,
	Khoa Huynh, Badari Pulavarty, Christoph Hellwig

On 07/29/2011 05:06 PM, Stefan Hajnoczi wrote:
> I mean did you investigate *why* userspace virtio-blk has higher
> latency?  Did you profile it and drill down on its performance?
>
> It's important to understand what is going on before replacing it with
> another mechanism.  What I'm saying is, if I have a buggy program I
> can sometimes rewrite it from scratch correctly but that doesn't tell
> me what the bug was.
>
> Perhaps the inefficiencies in userspace virtio-blk can be solved by
> adjusting the code (removing inefficient notification mechanisms,
> introducing a dedicated thread outside of the QEMU iothread model,
> etc).  Then we'd get the performance benefit for non-raw images and
> perhaps non-virtio and non-Linux host platforms too.
>

As Christoph mentioned, the unnecessary memory allocation and too much 
cache line unfriendly
function pointers might be culprit. For example, the read quests code 
path for linux aio would be

     
qemu_iohandler_poll->virtio_pci_host_notifier_read->virtio_queue_notify_vq->virtio_blk_handle_output
->virtio_blk_handle_read->bdrv_aio_read->raw_aio_readv->bdrv_aio_readv(Yes 
again nested called!)->raw_aio_readv->laio_submit->io_submit...

Looking at this long list,most are function pointers that can not be 
inlined, and the internal data structures used by these functions are 
dozons. Leave aside code complexity, this long code path would really 
need retrofit. As Christoph simply put, this kind of mess is inherent 
all over the qemu code. So I am afraid, the 'retrofit'  would end up to 
be a re-write the entire (sub)system. I have to admit that, I am 
inclined to the MST's vhost approach, that write a new subsystem other 
than tedious profiling and fixing, that would possibly goes as far as 
actually re-writing it.

>> Actually, the motivation to start vhost-blk is that, in our observation,
>> KVM(virtio enabled) in RHEL 6 is worse than Xen(PV) in RHEL in disk IO
>> perspective, especially for sequential read/write (around 20% gap).
>>
>> We'll deploy a large number of KVM-based systems as the infrastructure of
>> some service and this gap is really unpleasant.
>>
>> By the design, IMHO, virtio performance is supposed to be comparable to the
>> para-vulgarization solution if not better, because for KVM, guest and
>> backend driver could sit in the same address space via mmaping. This would
>> reduce the overhead involved in page table modification, thus speed up the
>> buffer management and transfer a lot compared with Xen PV.
> Yes, guest memory is just a region of QEMU userspace memory.  So it's
> easy to reach inside and there are no page table tricks or copying
> involved.
>
>> I am not in a qualified  position to talk about QEMU , but I think the
>> surprised performance improvement by this very primitive vhost-blk simply
>> manifest that, the internal structure for qemu io is the way bloated. I say
>> it *surprised* because basically vhost just reduces the number of system
>> calls, which is heavily tuned by chip manufacture for years. So, I guess the
>> performance number vhost-blk gains mainly could possibly be contributed to
>> *shorter and simpler* code path.
> First we need to understand exactly what the latency overhead is.  If
> we discover that it's simply not possible to do this equally well in
> userspace, then it makes perfect sense to use vhost-blk.
>
> So let's gather evidence and learn what the overheads really are.
> Last year I spent time looking at virtio-blk latency:
> http://www.linux-kvm.org/page/Virtio/Block/Latency
>

Nice stuff.

> See especially this diagram:
> http://www.linux-kvm.org/page/Image:Threads.png
>
> The goal wasn't specifically to reduce synchronous sequential I/O,
> instead the aim was to reduce overheads for a variety of scenarios,
> especially multithreaded workloads.
>
> In most cases it was helpful to move I/O submission out of the vcpu
> thread by using the ioeventfd model just like vhost.  Ioeventfd for
> userspace virtio-blk is now on by default in qemu-kvm.
>
> Try running the userspace virtio-blk benchmark with -drive
> if=none,id=drive0,file=... -device
> virtio-blk-pci,drive=drive0,ioeventfd=off.  This causes QEMU to do I/O
> submission in the vcpu thread, which might reduce latency at the cost
> of stealing guest time.
>
>> Anyway, IMHO, compared with user space approach, the in-kernel one would
>> allow more flexibility and better integration with the kernel IO stack,
>> since we don't need two IO stacks for guest OS.
> I agree that there may be advantages to integrating with in-kernel I/O
> mechanisms.  An interesting step would be to implement the
> submit_bio() approach that Christoph suggested and seeing if that
> improves things further.
>
> Push virtio-blk as far as you can and let's see what the performance is!
>
>>> I have a hacked up world here that basically implements vhost-blk in
>>> userspace:
>>>
>>> http://repo.or.cz/w/qemu/stefanha.git/blob/refs/heads/virtio-blk-data-plane:/hw/virtio-blk.c
>>>
>>>   * A dedicated virtqueue thread sleeps on ioeventfd
>>>   * Guest memory is pre-mapped and accessed directly (not using QEMU's
>>> usually memory access functions)
>>>   * Linux AIO is used, the QEMU block layer is bypassed
>>>   * Completion interrupts are injected from the virtqueue thread using
>>> ioctl
>>>
>>> I will try to rebase onto qemu-kvm.git/master (this work is several
>>> months old).  Then we can compare to see how much of the benefit can
>>> be gotten in userspace.
>>>
>> I don't really get you about vhost-blk in user space since vhost
>> infrastructure itself means an in-kernel accelerator that implemented in
>> kernel . I guess what you meant is somewhat a re-write of virtio-blk in user
>> space with a dedicated thread handling requests, and shorter code path
>> similar to vhost-blk.
> Right - it's the same model as vhost: a dedicated thread listening for
> ioeventfd virtqueue kicks and processing them out-of-line with the
> guest and userspace QEMU's traditional vcpu and iothread.
>
> When you say "IOPS drops drastically" do you mean that it gets worse
> than with queue-depth=1?
>

Yes, on my laptop, when iodepth = 3, IOPS in my host drops to about 
3,500 from 13K! and so is iodepth = 4 in my guest during FIO seq read 
test. This should never happen.

I think SATA on my laptop has something wrong that can not be 
explainable. If not, The cause I could image is that the NCQ depth is 2 
on my disk and when the kernel submit reqs more than this number, it 
would cause severe scheduling overhead. Anyway, this is unrelated to 
vhost-blk and would not be seen
by others.

> I hope that others are interested in running the benchmarks on their
> systems so we can try out a range of storage devices.
>
> Stefan

Yuan

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device
  2011-07-29 12:01       ` Liu Yuan
@ 2011-07-29 12:29         ` Stefan Hajnoczi
  2011-07-29 12:50           ` Stefan Hajnoczi
  2011-07-29 15:25         ` Sasha Levin
  1 sibling, 1 reply; 54+ messages in thread
From: Stefan Hajnoczi @ 2011-07-29 12:29 UTC (permalink / raw)
  To: Liu Yuan
  Cc: Michael S. Tsirkin, Rusty Russell, Avi Kivity, kvm, linux-kernel,
	Khoa Huynh, Badari Pulavarty, Christoph Hellwig

On Fri, Jul 29, 2011 at 1:01 PM, Liu Yuan <namei.unix@gmail.com> wrote:
> On 07/29/2011 05:06 PM, Stefan Hajnoczi wrote:
>>
>> I mean did you investigate *why* userspace virtio-blk has higher
>> latency?  Did you profile it and drill down on its performance?
>>
>> It's important to understand what is going on before replacing it with
>> another mechanism.  What I'm saying is, if I have a buggy program I
>> can sometimes rewrite it from scratch correctly but that doesn't tell
>> me what the bug was.
>>
>> Perhaps the inefficiencies in userspace virtio-blk can be solved by
>> adjusting the code (removing inefficient notification mechanisms,
>> introducing a dedicated thread outside of the QEMU iothread model,
>> etc).  Then we'd get the performance benefit for non-raw images and
>> perhaps non-virtio and non-Linux host platforms too.
>>
>
> As Christoph mentioned, the unnecessary memory allocation and too much cache
> line unfriendly
> function pointers might be culprit. For example, the read quests code path
> for linux aio would be
>
>
>  qemu_iohandler_poll->virtio_pci_host_notifier_read->virtio_queue_notify_vq->virtio_blk_handle_output
> ->virtio_blk_handle_read->bdrv_aio_read->raw_aio_readv->bdrv_aio_readv(Yes
> again nested called!)->raw_aio_readv->laio_submit->io_submit...
>
> Looking at this long list,most are function pointers that can not be
> inlined, and the internal data structures used by these functions are
> dozons. Leave aside code complexity, this long code path would really need
> retrofit. As Christoph simply put, this kind of mess is inherent all over
> the qemu code. So I am afraid, the 'retrofit'  would end up to be a re-write
> the entire (sub)system. I have to admit that, I am inclined to the MST's
> vhost approach, that write a new subsystem other than tedious profiling and
> fixing, that would possibly goes as far as actually re-writing it.

I'm totally for vhost-blk if there are unique benefits that make it
worth maintaining.  But better benchmark results are not a cause, they
are an effect.  So the thing to do is to drill down on both vhost-blk
and userspace virtio-blk to understand what causes overheads.
Evidence showing that userspace can never compete is needed to justify
vhost-blk IMO.

>>> Actually, the motivation to start vhost-blk is that, in our observation,
>>> KVM(virtio enabled) in RHEL 6 is worse than Xen(PV) in RHEL in disk IO
>>> perspective, especially for sequential read/write (around 20% gap).
>>>
>>> We'll deploy a large number of KVM-based systems as the infrastructure of
>>> some service and this gap is really unpleasant.
>>>
>>> By the design, IMHO, virtio performance is supposed to be comparable to
>>> the
>>> para-vulgarization solution if not better, because for KVM, guest and
>>> backend driver could sit in the same address space via mmaping. This
>>> would
>>> reduce the overhead involved in page table modification, thus speed up
>>> the
>>> buffer management and transfer a lot compared with Xen PV.
>>
>> Yes, guest memory is just a region of QEMU userspace memory.  So it's
>> easy to reach inside and there are no page table tricks or copying
>> involved.
>>
>>> I am not in a qualified  position to talk about QEMU , but I think the
>>> surprised performance improvement by this very primitive vhost-blk simply
>>> manifest that, the internal structure for qemu io is the way bloated. I
>>> say
>>> it *surprised* because basically vhost just reduces the number of system
>>> calls, which is heavily tuned by chip manufacture for years. So, I guess
>>> the
>>> performance number vhost-blk gains mainly could possibly be contributed
>>> to
>>> *shorter and simpler* code path.
>>
>> First we need to understand exactly what the latency overhead is.  If
>> we discover that it's simply not possible to do this equally well in
>> userspace, then it makes perfect sense to use vhost-blk.
>>
>> So let's gather evidence and learn what the overheads really are.
>> Last year I spent time looking at virtio-blk latency:
>> http://www.linux-kvm.org/page/Virtio/Block/Latency
>>
>
> Nice stuff.
>
>> See especially this diagram:
>> http://www.linux-kvm.org/page/Image:Threads.png
>>
>> The goal wasn't specifically to reduce synchronous sequential I/O,
>> instead the aim was to reduce overheads for a variety of scenarios,
>> especially multithreaded workloads.
>>
>> In most cases it was helpful to move I/O submission out of the vcpu
>> thread by using the ioeventfd model just like vhost.  Ioeventfd for
>> userspace virtio-blk is now on by default in qemu-kvm.
>>
>> Try running the userspace virtio-blk benchmark with -drive
>> if=none,id=drive0,file=... -device
>> virtio-blk-pci,drive=drive0,ioeventfd=off.  This causes QEMU to do I/O
>> submission in the vcpu thread, which might reduce latency at the cost
>> of stealing guest time.
>>
>>> Anyway, IMHO, compared with user space approach, the in-kernel one would
>>> allow more flexibility and better integration with the kernel IO stack,
>>> since we don't need two IO stacks for guest OS.
>>
>> I agree that there may be advantages to integrating with in-kernel I/O
>> mechanisms.  An interesting step would be to implement the
>> submit_bio() approach that Christoph suggested and seeing if that
>> improves things further.
>>
>> Push virtio-blk as far as you can and let's see what the performance is!
>>
>>>> I have a hacked up world here that basically implements vhost-blk in
>>>> userspace:
>>>>
>>>>
>>>> http://repo.or.cz/w/qemu/stefanha.git/blob/refs/heads/virtio-blk-data-plane:/hw/virtio-blk.c
>>>>
>>>>  * A dedicated virtqueue thread sleeps on ioeventfd
>>>>  * Guest memory is pre-mapped and accessed directly (not using QEMU's
>>>> usually memory access functions)
>>>>  * Linux AIO is used, the QEMU block layer is bypassed
>>>>  * Completion interrupts are injected from the virtqueue thread using
>>>> ioctl
>>>>
>>>> I will try to rebase onto qemu-kvm.git/master (this work is several
>>>> months old).  Then we can compare to see how much of the benefit can
>>>> be gotten in userspace.
>>>>
>>> I don't really get you about vhost-blk in user space since vhost
>>> infrastructure itself means an in-kernel accelerator that implemented in
>>> kernel . I guess what you meant is somewhat a re-write of virtio-blk in
>>> user
>>> space with a dedicated thread handling requests, and shorter code path
>>> similar to vhost-blk.
>>
>> Right - it's the same model as vhost: a dedicated thread listening for
>> ioeventfd virtqueue kicks and processing them out-of-line with the
>> guest and userspace QEMU's traditional vcpu and iothread.
>>
>> When you say "IOPS drops drastically" do you mean that it gets worse
>> than with queue-depth=1?
>>
>
> Yes, on my laptop, when iodepth = 3, IOPS in my host drops to about 3,500
> from 13K! and so is iodepth = 4 in my guest during FIO seq read test. This
> should never happen.

Yes, that doesn't make sense to me unless the I/O scheduler is doing
something weird.  Have you tried switching between cfq, deadline, and
noop?

Stefan

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device
  2011-07-29 12:29         ` Stefan Hajnoczi
@ 2011-07-29 12:50           ` Stefan Hajnoczi
  2011-07-29 14:45             ` Liu Yuan
  0 siblings, 1 reply; 54+ messages in thread
From: Stefan Hajnoczi @ 2011-07-29 12:50 UTC (permalink / raw)
  To: Liu Yuan
  Cc: Michael S. Tsirkin, Rusty Russell, Avi Kivity, kvm, linux-kernel,
	Khoa Huynh, Badari Pulavarty, Christoph Hellwig

I hit a weirdness yesterday, just want to mention it in case you notice it too.

When running vanilla qemu-kvm I forgot to use aio=native.  When I
compared the results against virtio-blk-data-plane (which *always*
uses Linux AIO) I was surprised to find average 4k read latency was
lower and the standard deviation was also lower.

So from now on I will run tests both with and without aio=native.
aio=native should be faster and if I can reproduce the reverse I'll
try to figure out why.

Stefan

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device
  2011-07-29 12:50           ` Stefan Hajnoczi
@ 2011-07-29 14:45             ` Liu Yuan
  2011-07-29 14:50               ` Liu Yuan
  0 siblings, 1 reply; 54+ messages in thread
From: Liu Yuan @ 2011-07-29 14:45 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Michael S. Tsirkin, Rusty Russell, Avi Kivity, kvm, linux-kernel,
	Khoa Huynh, Badari Pulavarty, Christoph Hellwig

On 07/29/2011 08:50 PM, Stefan Hajnoczi wrote:
> I hit a weirdness yesterday, just want to mention it in case you notice it too.
>
> When running vanilla qemu-kvm I forgot to use aio=native.  When I
> compared the results against virtio-blk-data-plane (which *always*
> uses Linux AIO) I was surprised to find average 4k read latency was
> lower and the standard deviation was also lower.
>
> So from now on I will run tests both with and without aio=native.
> aio=native should be faster and if I can reproduce the reverse I'll
> try to figure out why.
>
> Stefan
On my laptop, I don't meet this weirdo. the emulated POSIX AIO is much 
worse than the Linux AIO as expected. If iodepth goes deeper, the gap 
gets wider.

If not set aio=none, qemu uses emulated posix aio interface to do the 
IO. I peek at the posix-aio-compat.c,it uses thread pool and sync 
preadv/pwritev to emulate the AIO behaviour. The sync IO interface would 
even cause much poorer performance for random rw, since io-scheduler 
would possibly never get a chance to merge the requests stream. 
(blk_finish_plug->queue_unplugged->__blk_run_queue)

Yuan

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device
  2011-07-29 14:45             ` Liu Yuan
@ 2011-07-29 14:50               ` Liu Yuan
  0 siblings, 0 replies; 54+ messages in thread
From: Liu Yuan @ 2011-07-29 14:50 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Michael S. Tsirkin, Rusty Russell, Avi Kivity, kvm, linux-kernel,
	Khoa Huynh, Badari Pulavarty, Christoph Hellwig

On 07/29/2011 10:45 PM, Liu Yuan wrote:
> On 07/29/2011 08:50 PM, Stefan Hajnoczi wrote:
>> I hit a weirdness yesterday, just want to mention it in case you 
>> notice it too.
>>
>> When running vanilla qemu-kvm I forgot to use aio=native.  When I
>> compared the results against virtio-blk-data-plane (which *always*
>> uses Linux AIO) I was surprised to find average 4k read latency was
>> lower and the standard deviation was also lower.
>>
>> So from now on I will run tests both with and without aio=native.
>> aio=native should be faster and if I can reproduce the reverse I'll
>> try to figure out why.
>>
>> Stefan
> On my laptop, I don't meet this weirdo. the emulated POSIX AIO is much 
> worse than the Linux AIO as expected. If iodepth goes deeper, the gap 
> gets wider.
>
> If not set aio=none, qemu uses emulated posix aio interface to do the 
> IO. I peek at the posix-aio-compat.c,it uses thread pool and sync 
> preadv/pwritev to emulate the AIO behaviour. The sync IO interface 
> would even cause much poorer performance for random rw, since 
> io-scheduler would possibly never get a chance to merge the requests 
> stream. (blk_finish_plug->queue_unplugged->__blk_run_queue)
>
> Yuan
Typo. not merge, I mean *sort* the reqs

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device
  2011-07-29 12:01       ` Liu Yuan
  2011-07-29 12:29         ` Stefan Hajnoczi
@ 2011-07-29 15:25         ` Sasha Levin
  2011-08-01  8:17           ` Avi Kivity
  1 sibling, 1 reply; 54+ messages in thread
From: Sasha Levin @ 2011-07-29 15:25 UTC (permalink / raw)
  To: Liu Yuan
  Cc: Stefan Hajnoczi, Michael S. Tsirkin, Rusty Russell, Avi Kivity,
	kvm, linux-kernel, Khoa Huynh, Badari Pulavarty,
	Christoph Hellwig

On Fri, 2011-07-29 at 20:01 +0800, Liu Yuan wrote:
> Looking at this long list,most are function pointers that can not be 
> inlined, and the internal data structures used by these functions are 
> dozons. Leave aside code complexity, this long code path would really 
> need retrofit. As Christoph simply put, this kind of mess is inherent 
> all over the qemu code. So I am afraid, the 'retrofit'  would end up to 
> be a re-write the entire (sub)system. I have to admit that, I am 
> inclined to the MST's vhost approach, that write a new subsystem other 
> than tedious profiling and fixing, that would possibly goes as far as 
> actually re-writing it.

I don't think the fix for problematic userspace is to write more kernel
code.

vhost-net improved throughput and latency by several factors, allowing
to achieve much more than was possible at userspace alone.

With vhost-blk we see an improvement of ~15% - which I assume by your
and Christoph's comments can be mostly attributed to QEMU. Merging a
module which won't improve performance dramatically compared to what is
possible to achieve in userspace (even if it would require a code
rewrite) sounds a bit wrong to me.

-- 

Sasha.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device
  2011-07-29 15:25         ` Sasha Levin
@ 2011-08-01  8:17           ` Avi Kivity
  2011-08-01  9:18             ` Liu Yuan
  0 siblings, 1 reply; 54+ messages in thread
From: Avi Kivity @ 2011-08-01  8:17 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Liu Yuan, Stefan Hajnoczi, Michael S. Tsirkin, Rusty Russell,
	kvm, linux-kernel, Khoa Huynh, Badari Pulavarty,
	Christoph Hellwig

On 07/29/2011 06:25 PM, Sasha Levin wrote:
> On Fri, 2011-07-29 at 20:01 +0800, Liu Yuan wrote:
> >  Looking at this long list,most are function pointers that can not be
> >  inlined, and the internal data structures used by these functions are
> >  dozons. Leave aside code complexity, this long code path would really
> >  need retrofit. As Christoph simply put, this kind of mess is inherent
> >  all over the qemu code. So I am afraid, the 'retrofit'  would end up to
> >  be a re-write the entire (sub)system. I have to admit that, I am
> >  inclined to the MST's vhost approach, that write a new subsystem other
> >  than tedious profiling and fixing, that would possibly goes as far as
> >  actually re-writing it.
>
> I don't think the fix for problematic userspace is to write more kernel
> code.
>
> vhost-net improved throughput and latency by several factors, allowing
> to achieve much more than was possible at userspace alone.
>
> With vhost-blk we see an improvement of ~15% - which I assume by your
> and Christoph's comments can be mostly attributed to QEMU. Merging a
> module which won't improve performance dramatically compared to what is
> possible to achieve in userspace (even if it would require a code
> rewrite) sounds a bit wrong to me

Agree.  vhost-net works around the lack of async zero copy networking 
interface.  Block I/O on the other hand does have such an interface, and 
in addition transaction rates are usually lower.  All we're saving is 
the syscall overhead.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device
  2011-08-01  8:17           ` Avi Kivity
@ 2011-08-01  9:18             ` Liu Yuan
  2011-08-01  9:37               ` Avi Kivity
  0 siblings, 1 reply; 54+ messages in thread
From: Liu Yuan @ 2011-08-01  9:18 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Sasha Levin, Stefan Hajnoczi, Michael S. Tsirkin, Rusty Russell,
	kvm, linux-kernel, Khoa Huynh, Badari Pulavarty,
	Christoph Hellwig

On 08/01/2011 04:17 PM, Avi Kivity wrote:
> On 07/29/2011 06:25 PM, Sasha Levin wrote:
>> On Fri, 2011-07-29 at 20:01 +0800, Liu Yuan wrote:
>> >  Looking at this long list,most are function pointers that can not be
>> >  inlined, and the internal data structures used by these functions are
>> >  dozons. Leave aside code complexity, this long code path would really
>> >  need retrofit. As Christoph simply put, this kind of mess is inherent
>> >  all over the qemu code. So I am afraid, the 'retrofit'  would end 
>> up to
>> >  be a re-write the entire (sub)system. I have to admit that, I am
>> >  inclined to the MST's vhost approach, that write a new subsystem 
>> other
>> >  than tedious profiling and fixing, that would possibly goes as far as
>> >  actually re-writing it.
>>
>> I don't think the fix for problematic userspace is to write more kernel
>> code.
>>
>> vhost-net improved throughput and latency by several factors, allowing
>> to achieve much more than was possible at userspace alone.
>>
>> With vhost-blk we see an improvement of ~15% - which I assume by your
>> and Christoph's comments can be mostly attributed to QEMU. Merging a
>> module which won't improve performance dramatically compared to what is
>> possible to achieve in userspace (even if it would require a code
>> rewrite) sounds a bit wrong to me
>
> Agree.  vhost-net works around the lack of async zero copy networking 
> interface.  Block I/O on the other hand does have such an interface, 
> and in addition transaction rates are usually lower.  All we're saving 
> is the syscall overhead.
>
Personally I too agree with Sasha Levin. But vhost-blk is the first fast 
prototype that is supposed to act as a code base to do further 
optimisation, which I plan to utilize  kernel's internal stuff like BIO 
layer,  that can not be accessed from user space, to maximize the 
performance for raw disk based block IO.

Yuan

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device
  2011-08-01  9:18             ` Liu Yuan
@ 2011-08-01  9:37               ` Avi Kivity
  0 siblings, 0 replies; 54+ messages in thread
From: Avi Kivity @ 2011-08-01  9:37 UTC (permalink / raw)
  To: Liu Yuan
  Cc: Sasha Levin, Stefan Hajnoczi, Michael S. Tsirkin, Rusty Russell,
	kvm, linux-kernel, Khoa Huynh, Badari Pulavarty,
	Christoph Hellwig

On 08/01/2011 12:18 PM, Liu Yuan wrote:
>> Agree.  vhost-net works around the lack of async zero copy networking 
>> interface.  Block I/O on the other hand does have such an interface, 
>> and in addition transaction rates are usually lower.  All we're 
>> saving is the syscall overhead.
>>
>
> Personally I too agree with Sasha Levin. But vhost-blk is the first 
> fast prototype that is supposed to act as a code base to do further 
> optimisation, which I plan to utilize  kernel's internal stuff like 
> BIO layer,  that can not be accessed from user space, to maximize the 
> performance for raw disk based block IO.

Is there anything in the bio layer which is not exposed by linux-aio?  
Or is linux-aio slow in translating from vfs ops to bio ops?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device
  2011-07-29  7:22   ` Liu Yuan
  2011-07-29  9:06     ` Stefan Hajnoczi
@ 2011-07-29 18:12     ` Badari Pulavarty
  2011-08-01  5:46       ` Liu Yuan
  1 sibling, 1 reply; 54+ messages in thread
From: Badari Pulavarty @ 2011-07-29 18:12 UTC (permalink / raw)
  To: Liu Yuan
  Cc: Stefan Hajnoczi, Michael S. Tsirkin, Rusty Russell, Avi Kivity,
	kvm, linux-kernel, Khoa Huynh

Hi Liu Yuan,

I am glad to see that you started looking at vhost-blk.   I did an 
attempt year ago to improve block
performance using vhost-blk approach.

http://lwn.net/Articles/379864/
http://lwn.net/Articles/382543/

I will take a closer look at your patchset to find differences and 
similarities.

- I focused on using vfs interfaces in the kernel, so that I can use it 
for file-backed devices.
Our use-case scenario is mostly file-backed images.

- In few cases, virtio-blk did outperform vhost-blk -- which was counter 
intuitive - but
couldn't exactly nail down. why ?

- I had to implement my own threads for parellism. I see that you are 
using aio infrastructure
to get around it.

- In our high scale performance testing, what we found is block-backed 
device performance is
pretty close to bare-metal (91% of bare-metal). vhost-blk didn't add any 
major benefits to it.
I am curious on your performance analysis & data on where you see the 
gains and why ?

Hence I prioritized my work low :(

Now that you are interested in driving this, I am happy to work with you 
and see what
vhost-blk brings to the tables. (even if helps us improve virtio-blk).

Thanks,
Badari



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device
  2011-07-29 18:12     ` Badari Pulavarty
@ 2011-08-01  5:46       ` Liu Yuan
  2011-08-01  8:12         ` Christoph Hellwig
  2011-08-04 21:58         ` Badari Pulavarty
  0 siblings, 2 replies; 54+ messages in thread
From: Liu Yuan @ 2011-08-01  5:46 UTC (permalink / raw)
  To: Badari Pulavarty
  Cc: Stefan Hajnoczi, Michael S. Tsirkin, Rusty Russell, Avi Kivity,
	kvm, linux-kernel, Khoa Huynh

On 07/30/2011 02:12 AM, Badari Pulavarty wrote:
> Hi Liu Yuan,
>
> I am glad to see that you started looking at vhost-blk.   I did an 
> attempt year ago to improve block
> performance using vhost-blk approach.
>
> http://lwn.net/Articles/379864/
> http://lwn.net/Articles/382543/
>
> I will take a closer look at your patchset to find differences and 
> similarities.
>
> - I focused on using vfs interfaces in the kernel, so that I can use 
> it for file-backed devices.
> Our use-case scenario is mostly file-backed images.
>
vhost-blk's that uses Linux AIO also support file-backed images. 
Actually, I have run Guests both on raw partition and raw file images.

> - In few cases, virtio-blk did outperform vhost-blk -- which was 
> counter intuitive - but
> couldn't exactly nail down. why ?
>
> - I had to implement my own threads for parellism. I see that you are 
> using aio infrastructure
> to get around it.
>
> - In our high scale performance testing, what we found is block-backed 
> device performance is
> pretty close to bare-metal (91% of bare-metal). vhost-blk didn't add 
> any major benefits to it.
> I am curious on your performance analysis & data on where you see the 
> gains and why ?
>
Possibly bypass vfs-layer and translate sg lists from virtio buffer into 
BIOs would benefit the block-backed device. I'll give it a try later.

> Hence I prioritized my work low :(
>
> Now that you are interested in driving this, I am happy to work with 
> you and see what
> vhost-blk brings to the tables. (even if helps us improve virtio-blk).
>
> Thanks,
> Badari
>
>

That's great.

Yuan

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device
  2011-08-01  5:46       ` Liu Yuan
@ 2011-08-01  8:12         ` Christoph Hellwig
  2011-08-04 21:58         ` Badari Pulavarty
  1 sibling, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2011-08-01  8:12 UTC (permalink / raw)
  To: Liu Yuan
  Cc: Badari Pulavarty, Stefan Hajnoczi, Michael S. Tsirkin,
	Rusty Russell, Avi Kivity, kvm, linux-kernel, Khoa Huynh

On Mon, Aug 01, 2011 at 01:46:33PM +0800, Liu Yuan wrote:
> >- I focused on using vfs interfaces in the kernel, so that I can
> >use it for file-backed devices.
> >Our use-case scenario is mostly file-backed images.
> >
> vhost-blk's that uses Linux AIO also support file-backed images.
> Actually, I have run Guests both on raw partition and raw file
> images.

Note tjat it will only perform very well for preallocated images (just
like aio=native in qemu) - if you use sparse images it will have to
call into the allocator, which blocks for metadata I/O even in aio mode.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device
  2011-08-01  5:46       ` Liu Yuan
  2011-08-01  8:12         ` Christoph Hellwig
@ 2011-08-04 21:58         ` Badari Pulavarty
  2011-08-05  7:56           ` Liu Yuan
  2011-08-05 11:04           ` Liu Yuan
  1 sibling, 2 replies; 54+ messages in thread
From: Badari Pulavarty @ 2011-08-04 21:58 UTC (permalink / raw)
  To: Liu Yuan
  Cc: Stefan Hajnoczi, Michael S. Tsirkin, Rusty Russell, Avi Kivity,
	kvm, linux-kernel, Khoa Huynh

Hi Liu Yuan,

I started testing your patches. I applied your kernel patch to 3.0
and applied QEMU to latest git.

I passed 6 blockdevices from the host to guest (4 vcpu, 4GB RAM). 
I ran simple "dd" read tests from the guest on all block devices 
(with various blocksizes, iflag=direct).

Unfortunately, system doesn't stay up. I immediately get into
panic on the host. I didn't get time to debug the problem. Wondering
if you have seen this issue before and/or you have new patchset
to try ?

Let me know.

Thanks,
Badari

------------[ cut here ]------------
kernel BUG at mm/slab.c:3059!
invalid opcode: 0000 [#1] SMP 
CPU 7 
Modules linked in: vhost_blk ebtable_nat ebtables xt_CHECKSUM bridge stp llc autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf cachefiles fscache ipt_REJECT ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 dm_mirror dm_region_hash dm_log dm_round_robin scsi_dh_rdac dm_multipath vhost_net macvtap macvlan tun kvm_intel kvm cdc_ether usbnet mii microcode serio_raw pcspkr i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support shpchp ioatdma dca i7core_edac edac_core bnx2 sg ext4 mbcache jbd2 sd_mod crc_t10dif qla2xxx scsi_transport_fc scsi_tgt mptsas mptscsih mptbase scsi_transport_sas dm_mod [last unloaded: nf_defrag_ipv4]

Pid: 2744, comm: vhost-2698 Not tainted 3.0.0 #2 IBM  -[7870AC1]-/46M0761     
RIP: 0010:[<ffffffff8114932c>]  [<ffffffff8114932c>] cache_alloc_refill+0x22c/0x250
RSP: 0018:ffff880258c87d00  EFLAGS: 00010046
RAX: 0000000000000002 RBX: ffff88027f800040 RCX: dead000000200200
RDX: ffff880271128000 RSI: 0000000000000070 RDI: ffff88026eb6c000
RBP: ffff880258c87d50 R08: ffff880271128000 R09: 0000000000000003
R10: 000000021fffffff R11: ffff88026b5790c0 R12: ffff880272cd8c00
R13: ffff88027f822440 R14: 0000000000000002 R15: ffff88026eb6c000
FS:  0000000000000000(0000) GS:ffff88027fce0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000ecb100 CR3: 0000000270bfe000 CR4: 00000000000026e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process vhost-2698 (pid: 2744, threadinfo ffff880258c86000, task ffff8802703154c0)
Stack:
 ffff880200000002 000492d000000000 ffff88027f822460 ffff88027f822450
 ffff880258c87d60 ffff88027f800040 0000000000000000 00000000000080d0
 00000000000080d0 0000000000000246 ffff880258c87da0 ffffffff81149c82
Call Trace:
 [<ffffffff81149c82>] kmem_cache_alloc_trace+0x182/0x190
 [<ffffffffa0252f52>] handle_guest_kick+0x162/0x799 [vhost_blk]
 [<ffffffffa02514ab>] vhost_worker+0xcb/0x150 [vhost_blk]
 [<ffffffffa02513e0>] ? vhost_dev_set_owner+0x190/0x190 [vhost_blk]
 [<ffffffffa02513e0>] ? vhost_dev_set_owner+0x190/0x190 [vhost_blk]
 [<ffffffff81084c66>] kthread+0x96/0xa0
 [<ffffffff814d2f84>] kernel_thread_helper+0x4/0x10
 [<ffffffff81084bd0>] ? kthread_worker_fn+0x1a0/0x1a0
 [<ffffffff814d2f80>] ? gs_change+0x13/0x13
Code: 48 89 df e8 07 fb ff ff 65 8b 14 25 58 dc 00 00 85 c0 48 63 d2 4c 8b 24 d3 74 16 41 83 3c 24 00 0f 84 fc fd ff ff e9 75 ff ff ff <0f> 0b 66 90 eb fc 31 c0 41 83 3c 24 00 0f 85 62 ff ff ff 90 e9 
RIP  [<ffffffff8114932c>] cache_alloc_refill+0x22c/0x250
 RSP <ffff880258c87d00>
---[ end trace e286566e512cba7b ]---








^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device
  2011-08-04 21:58         ` Badari Pulavarty
@ 2011-08-05  7:56           ` Liu Yuan
  2011-08-05 11:04           ` Liu Yuan
  1 sibling, 0 replies; 54+ messages in thread
From: Liu Yuan @ 2011-08-05  7:56 UTC (permalink / raw)
  To: Badari Pulavarty
  Cc: Stefan Hajnoczi, Michael S. Tsirkin, Rusty Russell, Avi Kivity,
	kvm, linux-kernel, Khoa Huynh

On 08/05/2011 05:58 AM, Badari Pulavarty wrote:
> Hi Liu Yuan,
>
> I started testing your patches. I applied your kernel patch to 3.0
> and applied QEMU to latest git.
>
> I passed 6 blockdevices from the host to guest (4 vcpu, 4GB RAM).
> I ran simple "dd" read tests from the guest on all block devices
> (with various blocksizes, iflag=direct).
>
> Unfortunately, system doesn't stay up. I immediately get into
> panic on the host. I didn't get time to debug the problem. Wondering
> if you have seen this issue before and/or you have new patchset
> to try ?
>
> Let me know.
>

This patch set doesn't support multiple devices currently, since 
vhost-blk code for the qemu just
passes *one* backend to the vhost_blk module in the kernel.

If you really need to test it with multiple blockdevices, you have to 
tweak vhost-blk part for qemu.

I'll take a look at this issue, but not a promise with a patch as soon 
as possible.

Yuan

> Thanks,
> Badari
>
> ------------[ cut here ]------------
> kernel BUG at mm/slab.c:3059!
> invalid opcode: 0000 [#1] SMP
> CPU 7
> Modules linked in: vhost_blk ebtable_nat ebtables xt_CHECKSUM bridge stp llc autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf cachefiles fscache ipt_REJECT ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 dm_mirror dm_region_hash dm_log dm_round_robin scsi_dh_rdac dm_multipath vhost_net macvtap macvlan tun kvm_intel kvm cdc_ether usbnet mii microcode serio_raw pcspkr i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support shpchp ioatdma dca i7core_edac edac_core bnx2 sg ext4 mbcache jbd2 sd_mod crc_t10dif qla2xxx scsi_transport_fc scsi_tgt mptsas mptscsih mptbase scsi_transport_sas dm_mod [last unloaded: nf_defrag_ipv4]
>
> Pid: 2744, comm: vhost-2698 Not tainted 3.0.0 #2 IBM  -[7870AC1]-/46M0761
> RIP: 0010:[<ffffffff8114932c>]  [<ffffffff8114932c>] cache_alloc_refill+0x22c/0x250
> RSP: 0018:ffff880258c87d00  EFLAGS: 00010046
> RAX: 0000000000000002 RBX: ffff88027f800040 RCX: dead000000200200
> RDX: ffff880271128000 RSI: 0000000000000070 RDI: ffff88026eb6c000
> RBP: ffff880258c87d50 R08: ffff880271128000 R09: 0000000000000003
> R10: 000000021fffffff R11: ffff88026b5790c0 R12: ffff880272cd8c00
> R13: ffff88027f822440 R14: 0000000000000002 R15: ffff88026eb6c000
> FS:  0000000000000000(0000) GS:ffff88027fce0000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 0000000000ecb100 CR3: 0000000270bfe000 CR4: 00000000000026e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process vhost-2698 (pid: 2744, threadinfo ffff880258c86000, task ffff8802703154c0)
> Stack:
>   ffff880200000002 000492d000000000 ffff88027f822460 ffff88027f822450
>   ffff880258c87d60 ffff88027f800040 0000000000000000 00000000000080d0
>   00000000000080d0 0000000000000246 ffff880258c87da0 ffffffff81149c82
> Call Trace:
>   [<ffffffff81149c82>] kmem_cache_alloc_trace+0x182/0x190
>   [<ffffffffa0252f52>] handle_guest_kick+0x162/0x799 [vhost_blk]
>   [<ffffffffa02514ab>] vhost_worker+0xcb/0x150 [vhost_blk]
>   [<ffffffffa02513e0>] ? vhost_dev_set_owner+0x190/0x190 [vhost_blk]
>   [<ffffffffa02513e0>] ? vhost_dev_set_owner+0x190/0x190 [vhost_blk]
>   [<ffffffff81084c66>] kthread+0x96/0xa0
>   [<ffffffff814d2f84>] kernel_thread_helper+0x4/0x10
>   [<ffffffff81084bd0>] ? kthread_worker_fn+0x1a0/0x1a0
>   [<ffffffff814d2f80>] ? gs_change+0x13/0x13
> Code: 48 89 df e8 07 fb ff ff 65 8b 14 25 58 dc 00 00 85 c0 48 63 d2 4c 8b 24 d3 74 16 41 83 3c 24 00 0f 84 fc fd ff ff e9 75 ff ff ff<0f>  0b 66 90 eb fc 31 c0 41 83 3c 24 00 0f 85 62 ff ff ff 90 e9
> RIP  [<ffffffff8114932c>] cache_alloc_refill+0x22c/0x250
>   RSP<ffff880258c87d00>
> ---[ end trace e286566e512cba7b ]---
>
>
>
>
>
>
>


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device
  2011-08-04 21:58         ` Badari Pulavarty
  2011-08-05  7:56           ` Liu Yuan
@ 2011-08-05 11:04           ` Liu Yuan
  2011-08-05 18:02             ` Badari Pulavarty
  1 sibling, 1 reply; 54+ messages in thread
From: Liu Yuan @ 2011-08-05 11:04 UTC (permalink / raw)
  To: Badari Pulavarty
  Cc: Stefan Hajnoczi, Michael S. Tsirkin, Rusty Russell, Avi Kivity,
	kvm, linux-kernel, Khoa Huynh

[-- Attachment #1: Type: text/plain, Size: 993 bytes --]

On 08/05/2011 05:58 AM, Badari Pulavarty wrote:
> Hi Liu Yuan,
>
> I started testing your patches. I applied your kernel patch to 3.0
> and applied QEMU to latest git.
>
> I passed 6 blockdevices from the host to guest (4 vcpu, 4GB RAM).
> I ran simple "dd" read tests from the guest on all block devices
> (with various blocksizes, iflag=direct).
>
> Unfortunately, system doesn't stay up. I immediately get into
> panic on the host. I didn't get time to debug the problem. Wondering
> if you have seen this issue before and/or you have new patchset
> to try ?
>
> Let me know.
>
> Thanks,
> Badari
>

Okay, it is actually a bug pointed out by MST on the other thread, that 
it needs a mutex for completion thread.

Now would you please this attachment?This patch only applies to kernel 
part, on top of v1 kernel patch.

This patch mainly moves completion thread into vhost thread as a 
function. As a result, both requests submitting and completion 
signalling is in the same thread.

Yuan

[-- Attachment #2: vblk-for-kernel-2.patch --]
[-- Type: text/x-patch, Size: 3505 bytes --]

diff --git a/drivers/vhost/blk.c b/drivers/vhost/blk.c
index ecaf6fe..5cba543 100644
--- a/drivers/vhost/blk.c
+++ b/drivers/vhost/blk.c
@@ -47,6 +47,7 @@ struct vhost_blk {
 	struct eventfd_ctx *ectx;
 	struct file *efile;
 	struct task_struct *worker;
+	struct vhost_poll poll;
 };
 
 struct used_info {
@@ -62,6 +63,7 @@ static struct kmem_cache *used_info_cachep;
 static void blk_flush(struct vhost_blk *blk)
 {
        vhost_poll_flush(&blk->vq.poll);
+       vhost_poll_flush(&blk->poll);
 }
 
 static long blk_set_features(struct vhost_blk *blk, u64 features)
@@ -146,11 +148,11 @@ static long blk_reset_owner(struct vhost_blk *b)
         blk_stop(b);
         blk_flush(b);
         ret = vhost_dev_reset_owner(&b->dev);
-	if (b->worker) {
-		b->should_stop = 1;
-		smp_mb();
-		eventfd_signal(b->ectx, 1);
-	}
+//	if (b->worker) {
+//		b->should_stop = 1;
+//		smp_mb();
+//		eventfd_signal(b->ectx, 1);
+//	}
 err:
         mutex_unlock(&b->dev.mutex);
         return ret;
@@ -323,6 +325,7 @@ static void completion_thread_destory(struct vhost_blk *blk)
 
 static long blk_set_owner(struct vhost_blk *blk)
 {
+	eventfd_signal(blk->ectx, 1);
 	return completion_thread_setup(blk);
 }
 
@@ -361,8 +364,8 @@ static long vhost_blk_ioctl(struct file *f, unsigned int ioctl,
 		default:
 			mutex_lock(&blk->dev.mutex);
 			ret = vhost_dev_ioctl(&blk->dev, ioctl, arg);
-			if (!ret && ioctl == VHOST_SET_OWNER)
-				ret = blk_set_owner(blk);
+//			if (!ret && ioctl == VHOST_SET_OWNER)
+//				ret = blk_set_owner(blk);
 			blk_flush(blk);
 			mutex_unlock(&blk->dev.mutex);
 			break;
@@ -480,10 +483,51 @@ static void handle_guest_kick(struct vhost_work *work)
 	handle_kick(blk);
 }
 
+static void handle_completetion(struct vhost_work* work)
+{
+	struct vhost_blk *blk = container_of(work, struct vhost_blk, poll.work);
+	struct timespec ts = { 0 };
+	int ret, i, nr;
+	u64 count;
+
+	do {
+		ret = eventfd_ctx_read(blk->ectx, 1, &count);
+	} while (unlikely(ret == -ERESTARTSYS));
+
+	do {
+		nr = kernel_read_events(blk->ioctx, count, MAX_EVENTS, events, &ts);
+	} while (unlikely(nr == -EINTR));
+	dprintk("%s, count %llu, nr %d\n", __func__, count, nr);
+
+	if (unlikely(nr <= 0))
+		return;
+
+	for (i = 0; i < nr; i++) {
+		struct used_info *u = (struct used_info *)events[i].obj;
+		int len, status;
+
+		dprintk("%s, head %d complete in %d\n", __func__, u->head, i);
+		len = io_event_ret(&events[i]);
+		//status = u->len == len ? VIRTIO_BLK_S_OK : VIRTIO_BLK_S_IOERR;
+		status = len > 0 ? VIRTIO_BLK_S_OK : VIRTIO_BLK_S_IOERR;
+		if (copy_to_user(u->status, &status, sizeof status)) {
+			vq_err(&blk->vq, "%s failed to write status\n", __func__);
+			BUG(); /* FIXME: maybe a bit radical? */
+		}
+		vhost_add_used(&blk->vq, u->head, u->len);
+		kfree(u);
+	}
+
+	vhost_signal(&blk->dev, &blk->vq);
+}
+
 static void eventfd_setup(struct vhost_blk *blk)
 {
+	//struct vhost_virtqueue *vq = &blk->vq;
 	blk->efile = eventfd_file_create(0, 0);
 	blk->ectx = eventfd_ctx_fileget(blk->efile);
+	vhost_poll_init(&blk->poll, handle_completetion, POLLIN, &blk->dev);
+	vhost_poll_start(&blk->poll, blk->efile);
 }
 
 static int vhost_blk_open(struct inode *inode, struct file *f)
@@ -528,7 +572,7 @@ static int vhost_blk_release(struct inode *inode, struct file *f)
 	vhost_dev_cleanup(&blk->dev);
 	/* Yet another flush? See comments in vhost_net_release() */
 	blk_flush(blk);
-	completion_thread_destory(blk);
+//	completion_thread_destory(blk);
 	eventfd_destroy(blk);
 	kfree(blk);
 

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device
  2011-08-05 11:04           ` Liu Yuan
@ 2011-08-05 18:02             ` Badari Pulavarty
  2011-08-08  1:35               ` Liu Yuan
  0 siblings, 1 reply; 54+ messages in thread
From: Badari Pulavarty @ 2011-08-05 18:02 UTC (permalink / raw)
  To: Liu Yuan
  Cc: Stefan Hajnoczi, Michael S. Tsirkin, Rusty Russell, Avi Kivity,
	kvm, linux-kernel, Khoa Huynh

On 8/5/2011 4:04 AM, Liu Yuan wrote:
> On 08/05/2011 05:58 AM, Badari Pulavarty wrote:
>> Hi Liu Yuan,
>>
>> I started testing your patches. I applied your kernel patch to 3.0
>> and applied QEMU to latest git.
>>
>> I passed 6 blockdevices from the host to guest (4 vcpu, 4GB RAM).
>> I ran simple "dd" read tests from the guest on all block devices
>> (with various blocksizes, iflag=direct).
>>
>> Unfortunately, system doesn't stay up. I immediately get into
>> panic on the host. I didn't get time to debug the problem. Wondering
>> if you have seen this issue before and/or you have new patchset
>> to try ?
>>
>> Let me know.
>>
>> Thanks,
>> Badari
>>
>
> Okay, it is actually a bug pointed out by MST on the other thread, 
> that it needs a mutex for completion thread.
>
> Now would you please this attachment?This patch only applies to kernel 
> part, on top of v1 kernel patch.
>
> This patch mainly moves completion thread into vhost thread as a 
> function. As a result, both requests submitting and completion 
> signalling is in the same thread.
>
> Yuan

Unfortunately, "dd" tests (4 out of 6) in the guest hung. I see 
following messages

virtio_blk virtio2: requests: id 0 is not a head !
virtio_blk virtio3: requests: id 1 is not a head !
virtio_blk virtio5: requests: id 1 is not a head !
virtio_blk virtio1: requests: id 1 is not a head !

I still see host panics. I will collect the host panic and see if its 
still same or not.

Thanks,
Badari



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device
  2011-08-05 18:02             ` Badari Pulavarty
@ 2011-08-08  1:35               ` Liu Yuan
  2011-08-08  5:04                 ` Badari Pulavarty
  0 siblings, 1 reply; 54+ messages in thread
From: Liu Yuan @ 2011-08-08  1:35 UTC (permalink / raw)
  To: Badari Pulavarty
  Cc: Stefan Hajnoczi, Michael S. Tsirkin, Rusty Russell, Avi Kivity,
	kvm, linux-kernel, Khoa Huynh

On 08/06/2011 02:02 AM, Badari Pulavarty wrote:
> On 8/5/2011 4:04 AM, Liu Yuan wrote:
>> On 08/05/2011 05:58 AM, Badari Pulavarty wrote:
>>> Hi Liu Yuan,
>>>
>>> I started testing your patches. I applied your kernel patch to 3.0
>>> and applied QEMU to latest git.
>>>
>>> I passed 6 blockdevices from the host to guest (4 vcpu, 4GB RAM).
>>> I ran simple "dd" read tests from the guest on all block devices
>>> (with various blocksizes, iflag=direct).
>>>
>>> Unfortunately, system doesn't stay up. I immediately get into
>>> panic on the host. I didn't get time to debug the problem. Wondering
>>> if you have seen this issue before and/or you have new patchset
>>> to try ?
>>>
>>> Let me know.
>>>
>>> Thanks,
>>> Badari
>>>
>>
>> Okay, it is actually a bug pointed out by MST on the other thread, 
>> that it needs a mutex for completion thread.
>>
>> Now would you please this attachment?This patch only applies to 
>> kernel part, on top of v1 kernel patch.
>>
>> This patch mainly moves completion thread into vhost thread as a 
>> function. As a result, both requests submitting and completion 
>> signalling is in the same thread.
>>
>> Yuan
>
> Unfortunately, "dd" tests (4 out of 6) in the guest hung. I see 
> following messages
>
> virtio_blk virtio2: requests: id 0 is not a head !
> virtio_blk virtio3: requests: id 1 is not a head !
> virtio_blk virtio5: requests: id 1 is not a head !
> virtio_blk virtio1: requests: id 1 is not a head !
>
> I still see host panics. I will collect the host panic and see if its 
> still same or not.
>
> Thanks,
> Badari
>
>
Would you please show me how to reproduce it step by step? I tried dd 
with two block device attached, but didn't get hung nor panic.

Yuan

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device
  2011-08-08  1:35               ` Liu Yuan
@ 2011-08-08  5:04                 ` Badari Pulavarty
  2011-08-08  7:31                   ` Liu Yuan
  0 siblings, 1 reply; 54+ messages in thread
From: Badari Pulavarty @ 2011-08-08  5:04 UTC (permalink / raw)
  To: Liu Yuan
  Cc: Stefan Hajnoczi, Michael S. Tsirkin, Rusty Russell, Avi Kivity,
	kvm, linux-kernel, Khoa Huynh

On 8/7/2011 6:35 PM, Liu Yuan wrote:
> On 08/06/2011 02:02 AM, Badari Pulavarty wrote:
>> On 8/5/2011 4:04 AM, Liu Yuan wrote:
>>> On 08/05/2011 05:58 AM, Badari Pulavarty wrote:
>>>> Hi Liu Yuan,
>>>>
>>>> I started testing your patches. I applied your kernel patch to 3.0
>>>> and applied QEMU to latest git.
>>>>
>>>> I passed 6 blockdevices from the host to guest (4 vcpu, 4GB RAM).
>>>> I ran simple "dd" read tests from the guest on all block devices
>>>> (with various blocksizes, iflag=direct).
>>>>
>>>> Unfortunately, system doesn't stay up. I immediately get into
>>>> panic on the host. I didn't get time to debug the problem. Wondering
>>>> if you have seen this issue before and/or you have new patchset
>>>> to try ?
>>>>
>>>> Let me know.
>>>>
>>>> Thanks,
>>>> Badari
>>>>
>>>
>>> Okay, it is actually a bug pointed out by MST on the other thread, 
>>> that it needs a mutex for completion thread.
>>>
>>> Now would you please this attachment?This patch only applies to 
>>> kernel part, on top of v1 kernel patch.
>>>
>>> This patch mainly moves completion thread into vhost thread as a 
>>> function. As a result, both requests submitting and completion 
>>> signalling is in the same thread.
>>>
>>> Yuan
>>
>> Unfortunately, "dd" tests (4 out of 6) in the guest hung. I see 
>> following messages
>>
>> virtio_blk virtio2: requests: id 0 is not a head !
>> virtio_blk virtio3: requests: id 1 is not a head !
>> virtio_blk virtio5: requests: id 1 is not a head !
>> virtio_blk virtio1: requests: id 1 is not a head !
>>
>> I still see host panics. I will collect the host panic and see if its 
>> still same or not.
>>
>> Thanks,
>> Badari
>>
>>
> Would you please show me how to reproduce it step by step? I tried dd 
> with two block device attached, but didn't get hung nor panic.
>
> Yuan

I did 6 "dd"s on 6 block devices..

dd if=/dev/vdb of=/dev/null bs=1M iflag=direct &
dd if=/dev/vdc of=/dev/null bs=1M iflag=direct &
dd if=/dev/vdd of=/dev/null bs=1M iflag=direct &
dd if=/dev/vde of=/dev/null bs=1M iflag=direct &
dd if=/dev/vdf of=/dev/null bs=1M iflag=direct &
dd if=/dev/vdg of=/dev/null bs=1M iflag=direct &

I can reproduce the problem with in 3 minutes :(

Thanks,
Badari



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device
  2011-08-08  5:04                 ` Badari Pulavarty
@ 2011-08-08  7:31                   ` Liu Yuan
  2011-08-08 17:16                     ` Badari Pulavarty
  0 siblings, 1 reply; 54+ messages in thread
From: Liu Yuan @ 2011-08-08  7:31 UTC (permalink / raw)
  To: Badari Pulavarty
  Cc: Stefan Hajnoczi, Michael S. Tsirkin, Rusty Russell, Avi Kivity,
	kvm, linux-kernel, Khoa Huynh

[-- Attachment #1: Type: text/plain, Size: 2541 bytes --]

On 08/08/2011 01:04 PM, Badari Pulavarty wrote:
> On 8/7/2011 6:35 PM, Liu Yuan wrote:
>> On 08/06/2011 02:02 AM, Badari Pulavarty wrote:
>>> On 8/5/2011 4:04 AM, Liu Yuan wrote:
>>>> On 08/05/2011 05:58 AM, Badari Pulavarty wrote:
>>>>> Hi Liu Yuan,
>>>>>
>>>>> I started testing your patches. I applied your kernel patch to 3.0
>>>>> and applied QEMU to latest git.
>>>>>
>>>>> I passed 6 blockdevices from the host to guest (4 vcpu, 4GB RAM).
>>>>> I ran simple "dd" read tests from the guest on all block devices
>>>>> (with various blocksizes, iflag=direct).
>>>>>
>>>>> Unfortunately, system doesn't stay up. I immediately get into
>>>>> panic on the host. I didn't get time to debug the problem. Wondering
>>>>> if you have seen this issue before and/or you have new patchset
>>>>> to try ?
>>>>>
>>>>> Let me know.
>>>>>
>>>>> Thanks,
>>>>> Badari
>>>>>
>>>>
>>>> Okay, it is actually a bug pointed out by MST on the other thread, 
>>>> that it needs a mutex for completion thread.
>>>>
>>>> Now would you please this attachment?This patch only applies to 
>>>> kernel part, on top of v1 kernel patch.
>>>>
>>>> This patch mainly moves completion thread into vhost thread as a 
>>>> function. As a result, both requests submitting and completion 
>>>> signalling is in the same thread.
>>>>
>>>> Yuan
>>>
>>> Unfortunately, "dd" tests (4 out of 6) in the guest hung. I see 
>>> following messages
>>>
>>> virtio_blk virtio2: requests: id 0 is not a head !
>>> virtio_blk virtio3: requests: id 1 is not a head !
>>> virtio_blk virtio5: requests: id 1 is not a head !
>>> virtio_blk virtio1: requests: id 1 is not a head !
>>>
>>> I still see host panics. I will collect the host panic and see if 
>>> its still same or not.
>>>
>>> Thanks,
>>> Badari
>>>
>>>
>> Would you please show me how to reproduce it step by step? I tried dd 
>> with two block device attached, but didn't get hung nor panic.
>>
>> Yuan
>
> I did 6 "dd"s on 6 block devices..
>
> dd if=/dev/vdb of=/dev/null bs=1M iflag=direct &
> dd if=/dev/vdc of=/dev/null bs=1M iflag=direct &
> dd if=/dev/vdd of=/dev/null bs=1M iflag=direct &
> dd if=/dev/vde of=/dev/null bs=1M iflag=direct &
> dd if=/dev/vdf of=/dev/null bs=1M iflag=direct &
> dd if=/dev/vdg of=/dev/null bs=1M iflag=direct &
>
> I can reproduce the problem with in 3 minutes :(
>
> Thanks,
> Badari
>
>
Ah...I made an embarrassing mistake that I tried to 'free()' an 
kmem_cache object.

Would you please revert the vblk-for-kernel-2 patch and apply the new 
one attached in this letter?

Yuan,
Thanks

[-- Attachment #2: vblk-for-kernel-2.patch --]
[-- Type: text/x-patch, Size: 3278 bytes --]

diff --git a/drivers/vhost/blk.c b/drivers/vhost/blk.c
index ecaf6fe..7a24aba 100644
--- a/drivers/vhost/blk.c
+++ b/drivers/vhost/blk.c
@@ -47,6 +47,7 @@ struct vhost_blk {
 	struct eventfd_ctx *ectx;
 	struct file *efile;
 	struct task_struct *worker;
+	struct vhost_poll poll;
 };
 
 struct used_info {
@@ -62,6 +63,7 @@ static struct kmem_cache *used_info_cachep;
 static void blk_flush(struct vhost_blk *blk)
 {
        vhost_poll_flush(&blk->vq.poll);
+       vhost_poll_flush(&blk->poll);
 }
 
 static long blk_set_features(struct vhost_blk *blk, u64 features)
@@ -146,11 +148,11 @@ static long blk_reset_owner(struct vhost_blk *b)
         blk_stop(b);
         blk_flush(b);
         ret = vhost_dev_reset_owner(&b->dev);
-	if (b->worker) {
-		b->should_stop = 1;
-		smp_mb();
-		eventfd_signal(b->ectx, 1);
-	}
+//	if (b->worker) {
+//		b->should_stop = 1;
+//		smp_mb();
+//		eventfd_signal(b->ectx, 1);
+//	}
 err:
         mutex_unlock(&b->dev.mutex);
         return ret;
@@ -361,8 +363,8 @@ static long vhost_blk_ioctl(struct file *f, unsigned int ioctl,
 		default:
 			mutex_lock(&blk->dev.mutex);
 			ret = vhost_dev_ioctl(&blk->dev, ioctl, arg);
-			if (!ret && ioctl == VHOST_SET_OWNER)
-				ret = blk_set_owner(blk);
+//			if (!ret && ioctl == VHOST_SET_OWNER)
+//				ret = blk_set_owner(blk);
 			blk_flush(blk);
 			mutex_unlock(&blk->dev.mutex);
 			break;
@@ -480,10 +482,50 @@ static void handle_guest_kick(struct vhost_work *work)
 	handle_kick(blk);
 }
 
+static void handle_completetion(struct vhost_work* work)
+{
+	struct vhost_blk *blk = container_of(work, struct vhost_blk, poll.work);
+	struct timespec ts = { 0 };
+	int ret, i, nr;
+	u64 count;
+
+	do {
+		ret = eventfd_ctx_read(blk->ectx, 1, &count);
+	} while (unlikely(ret == -ERESTARTSYS));
+
+	do {
+		nr = kernel_read_events(blk->ioctx, count, MAX_EVENTS, events, &ts);
+	} while (unlikely(nr == -EINTR));
+	dprintk("%s, count %llu, nr %d\n", __func__, count, nr);
+
+	if (unlikely(nr <= 0))
+		return;
+
+	for (i = 0; i < nr; i++) {
+		struct used_info *u = (struct used_info *)events[i].obj;
+		int len, status;
+
+		dprintk("%s, head %d complete in %d\n", __func__, u->head, i);
+		len = io_event_ret(&events[i]);
+		//status = u->len == len ? VIRTIO_BLK_S_OK : VIRTIO_BLK_S_IOERR;
+		status = len > 0 ? VIRTIO_BLK_S_OK : VIRTIO_BLK_S_IOERR;
+		if (copy_to_user(u->status, &status, sizeof status)) {
+			vq_err(&blk->vq, "%s failed to write status\n", __func__);
+			BUG(); /* FIXME: maybe a bit radical? */
+		}
+		vhost_add_used(&blk->vq, u->head, u->len);
+		kmem_cache_free(used_info_cachep, u);
+	}
+
+	vhost_signal(&blk->dev, &blk->vq);
+}
+
 static void eventfd_setup(struct vhost_blk *blk)
 {
 	blk->efile = eventfd_file_create(0, 0);
 	blk->ectx = eventfd_ctx_fileget(blk->efile);
+	vhost_poll_init(&blk->poll, handle_completetion, POLLIN, &blk->dev);
+	vhost_poll_start(&blk->poll, blk->efile);
 }
 
 static int vhost_blk_open(struct inode *inode, struct file *f)
@@ -528,7 +570,7 @@ static int vhost_blk_release(struct inode *inode, struct file *f)
 	vhost_dev_cleanup(&blk->dev);
 	/* Yet another flush? See comments in vhost_net_release() */
 	blk_flush(blk);
-	completion_thread_destory(blk);
+//	completion_thread_destory(blk);
 	eventfd_destroy(blk);
 	kfree(blk);
 

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device
  2011-08-08  7:31                   ` Liu Yuan
@ 2011-08-08 17:16                     ` Badari Pulavarty
  2011-08-10  2:19                       ` Liu Yuan
  0 siblings, 1 reply; 54+ messages in thread
From: Badari Pulavarty @ 2011-08-08 17:16 UTC (permalink / raw)
  To: Liu Yuan
  Cc: Stefan Hajnoczi, Michael S. Tsirkin, Rusty Russell, Avi Kivity,
	kvm, linux-kernel, Khoa Huynh

On 8/8/2011 12:31 AM, Liu Yuan wrote:
> On 08/08/2011 01:04 PM, Badari Pulavarty wrote:
>> On 8/7/2011 6:35 PM, Liu Yuan wrote:
>>> On 08/06/2011 02:02 AM, Badari Pulavarty wrote:
>>>> On 8/5/2011 4:04 AM, Liu Yuan wrote:
>>>>> On 08/05/2011 05:58 AM, Badari Pulavarty wrote:
>>>>>> Hi Liu Yuan,
>>>>>>
>>>>>> I started testing your patches. I applied your kernel patch to 3.0
>>>>>> and applied QEMU to latest git.
>>>>>>
>>>>>> I passed 6 blockdevices from the host to guest (4 vcpu, 4GB RAM).
>>>>>> I ran simple "dd" read tests from the guest on all block devices
>>>>>> (with various blocksizes, iflag=direct).
>>>>>>
>>>>>> Unfortunately, system doesn't stay up. I immediately get into
>>>>>> panic on the host. I didn't get time to debug the problem. Wondering
>>>>>> if you have seen this issue before and/or you have new patchset
>>>>>> to try ?
>>>>>>
>>>>>> Let me know.
>>>>>>
>>>>>> Thanks,
>>>>>> Badari
>>>>>>
>>>>>
>>>>> Okay, it is actually a bug pointed out by MST on the other thread, 
>>>>> that it needs a mutex for completion thread.
>>>>>
>>>>> Now would you please this attachment?This patch only applies to 
>>>>> kernel part, on top of v1 kernel patch.
>>>>>
>>>>> This patch mainly moves completion thread into vhost thread as a 
>>>>> function. As a result, both requests submitting and completion 
>>>>> signalling is in the same thread.
>>>>>
>>>>> Yuan
>>>>
>>>> Unfortunately, "dd" tests (4 out of 6) in the guest hung. I see 
>>>> following messages
>>>>
>>>> virtio_blk virtio2: requests: id 0 is not a head !
>>>> virtio_blk virtio3: requests: id 1 is not a head !
>>>> virtio_blk virtio5: requests: id 1 is not a head !
>>>> virtio_blk virtio1: requests: id 1 is not a head !
>>>>
>>>> I still see host panics. I will collect the host panic and see if 
>>>> its still same or not.
>>>>
>>>> Thanks,
>>>> Badari
>>>>
>>>>
>>> Would you please show me how to reproduce it step by step? I tried 
>>> dd with two block device attached, but didn't get hung nor panic.
>>>
>>> Yuan
>>
>> I did 6 "dd"s on 6 block devices..
>>
>> dd if=/dev/vdb of=/dev/null bs=1M iflag=direct &
>> dd if=/dev/vdc of=/dev/null bs=1M iflag=direct &
>> dd if=/dev/vdd of=/dev/null bs=1M iflag=direct &
>> dd if=/dev/vde of=/dev/null bs=1M iflag=direct &
>> dd if=/dev/vdf of=/dev/null bs=1M iflag=direct &
>> dd if=/dev/vdg of=/dev/null bs=1M iflag=direct &
>>
>> I can reproduce the problem with in 3 minutes :(
>>
>> Thanks,
>> Badari
>>
>>
> Ah...I made an embarrassing mistake that I tried to 'free()' an 
> kmem_cache object.
>
> Would you please revert the vblk-for-kernel-2 patch and apply the new 
> one attached in this letter?
>
Hmm.. My version of the code seems to have kzalloc() for used_info. I 
don't have a version
that is using kmem_cache_alloc(). Would it be possible for you to send 
out complete patch
(with all the fixes applied) for me to try ? This will avoid all the 
confusion ..

Thanks,
Badari



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device
  2011-08-08 17:16                     ` Badari Pulavarty
@ 2011-08-10  2:19                       ` Liu Yuan
  2011-08-10 20:37                         ` Badari Pulavarty
  0 siblings, 1 reply; 54+ messages in thread
From: Liu Yuan @ 2011-08-10  2:19 UTC (permalink / raw)
  To: Badari Pulavarty
  Cc: Stefan Hajnoczi, Michael S. Tsirkin, Rusty Russell, Avi Kivity,
	kvm, linux-kernel, Khoa Huynh

[-- Attachment #1: Type: text/plain, Size: 3167 bytes --]

On 08/09/2011 01:16 AM, Badari Pulavarty wrote:
> On 8/8/2011 12:31 AM, Liu Yuan wrote:
>> On 08/08/2011 01:04 PM, Badari Pulavarty wrote:
>>> On 8/7/2011 6:35 PM, Liu Yuan wrote:
>>>> On 08/06/2011 02:02 AM, Badari Pulavarty wrote:
>>>>> On 8/5/2011 4:04 AM, Liu Yuan wrote:
>>>>>> On 08/05/2011 05:58 AM, Badari Pulavarty wrote:
>>>>>>> Hi Liu Yuan,
>>>>>>>
>>>>>>> I started testing your patches. I applied your kernel patch to 3.0
>>>>>>> and applied QEMU to latest git.
>>>>>>>
>>>>>>> I passed 6 blockdevices from the host to guest (4 vcpu, 4GB RAM).
>>>>>>> I ran simple "dd" read tests from the guest on all block devices
>>>>>>> (with various blocksizes, iflag=direct).
>>>>>>>
>>>>>>> Unfortunately, system doesn't stay up. I immediately get into
>>>>>>> panic on the host. I didn't get time to debug the problem. 
>>>>>>> Wondering
>>>>>>> if you have seen this issue before and/or you have new patchset
>>>>>>> to try ?
>>>>>>>
>>>>>>> Let me know.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Badari
>>>>>>>
>>>>>>
>>>>>> Okay, it is actually a bug pointed out by MST on the other 
>>>>>> thread, that it needs a mutex for completion thread.
>>>>>>
>>>>>> Now would you please this attachment?This patch only applies to 
>>>>>> kernel part, on top of v1 kernel patch.
>>>>>>
>>>>>> This patch mainly moves completion thread into vhost thread as a 
>>>>>> function. As a result, both requests submitting and completion 
>>>>>> signalling is in the same thread.
>>>>>>
>>>>>> Yuan
>>>>>
>>>>> Unfortunately, "dd" tests (4 out of 6) in the guest hung. I see 
>>>>> following messages
>>>>>
>>>>> virtio_blk virtio2: requests: id 0 is not a head !
>>>>> virtio_blk virtio3: requests: id 1 is not a head !
>>>>> virtio_blk virtio5: requests: id 1 is not a head !
>>>>> virtio_blk virtio1: requests: id 1 is not a head !
>>>>>
>>>>> I still see host panics. I will collect the host panic and see if 
>>>>> its still same or not.
>>>>>
>>>>> Thanks,
>>>>> Badari
>>>>>
>>>>>
>>>> Would you please show me how to reproduce it step by step? I tried 
>>>> dd with two block device attached, but didn't get hung nor panic.
>>>>
>>>> Yuan
>>>
>>> I did 6 "dd"s on 6 block devices..
>>>
>>> dd if=/dev/vdb of=/dev/null bs=1M iflag=direct &
>>> dd if=/dev/vdc of=/dev/null bs=1M iflag=direct &
>>> dd if=/dev/vdd of=/dev/null bs=1M iflag=direct &
>>> dd if=/dev/vde of=/dev/null bs=1M iflag=direct &
>>> dd if=/dev/vdf of=/dev/null bs=1M iflag=direct &
>>> dd if=/dev/vdg of=/dev/null bs=1M iflag=direct &
>>>
>>> I can reproduce the problem with in 3 minutes :(
>>>
>>> Thanks,
>>> Badari
>>>
>>>
>> Ah...I made an embarrassing mistake that I tried to 'free()' an 
>> kmem_cache object.
>>
>> Would you please revert the vblk-for-kernel-2 patch and apply the new 
>> one attached in this letter?
>>
> Hmm.. My version of the code seems to have kzalloc() for used_info. I 
> don't have a version
> that is using kmem_cache_alloc(). Would it be possible for you to send 
> out complete patch
> (with all the fixes applied) for me to try ? This will avoid all the 
> confusion ..
>
> Thanks,
> Badari
>
>
Okay, please apply the attached patch to the vanilla kernel. :)

Thanks,
Yuan

[-- Attachment #2: vhost-blk-for-kernel.patch --]
[-- Type: text/x-patch, Size: 20763 bytes --]

diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
index 72dd020..31f8b2e 100644
--- a/drivers/vhost/Makefile
+++ b/drivers/vhost/Makefile
@@ -1,2 +1,5 @@
 obj-$(CONFIG_VHOST_NET) += vhost_net.o
+obj-m += vhost_blk.o
+
 vhost_net-y := vhost.o net.o
+vhost_blk-y := vhost.o blk.o
diff --git a/drivers/vhost/blk.c b/drivers/vhost/blk.c
new file mode 100644
index 0000000..c372011
--- /dev/null
+++ b/drivers/vhost/blk.c
@@ -0,0 +1,530 @@
+/* Copyright (C) 2011 Taobao, Inc.
+ * Author: Liu Yuan <tailai.ly@taobao.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ *
+ * Vhost-blk driver is an in-kernel accelerator, intercepting the
+ * IO requests from KVM virtio-capable guests. It is based on the
+ * vhost infrastructure.
+ */
+
+#include <linux/miscdevice.h>
+#include <linux/module.h>
+#include <linux/virtio_net.h>
+#include <linux/vhost.h>
+#include <linux/eventfd.h>
+#include <linux/mutex.h>
+#include <linux/workqueue.h>
+#include <linux/virtio_blk.h>
+#include <linux/file.h>
+#include <linux/mmu_context.h>
+#include <linux/kthread.h>
+#include <linux/anon_inodes.h>
+#include <linux/syscalls.h>
+#include <linux/blkdev.h>
+
+#include "vhost.h"
+
+#define DEBUG 0
+
+#if DEBUG > 0
+#define dprintk         printk
+#else
+#define dprintk(x...)   do { ; } while (0)
+#endif
+
+enum {
+	virtqueue_max = 1,
+};
+
+#define MAX_EVENTS 128
+
+struct vhost_blk {
+	struct vhost_virtqueue vq;
+	struct vhost_dev dev;
+	int should_stop;
+	struct kioctx *ioctx;
+	struct eventfd_ctx *ectx;
+	struct file *efile;
+	struct task_struct *worker;
+	struct vhost_poll poll;
+};
+
+struct used_info {
+	void *status;
+	int head;
+	int len;
+};
+
+static struct io_event events[MAX_EVENTS];
+
+static struct kmem_cache *used_info_cachep;
+
+static void blk_flush(struct vhost_blk *blk)
+{
+	vhost_poll_flush(&blk->vq.poll);
+	vhost_poll_flush(&blk->poll);
+}
+
+static long blk_set_features(struct vhost_blk *blk, u64 features)
+{
+	blk->dev.acked_features = features;
+	return 0;
+}
+
+static void blk_stop(struct vhost_blk *blk)
+{
+	struct vhost_virtqueue *vq = &blk->vq;
+	struct file *f;
+
+	mutex_lock(&vq->mutex);
+	f = rcu_dereference_protected(vq->private_data,
+			lockdep_is_held(&vq->mutex));
+	rcu_assign_pointer(vq->private_data, NULL);
+	mutex_unlock(&vq->mutex);
+
+	if (f)
+		fput(f);
+}
+
+static long blk_set_backend(struct vhost_blk *blk, struct vhost_vring_file *backend)
+{
+	int idx = backend->index;
+	struct vhost_virtqueue *vq = &blk->vq;
+	struct file *file, *oldfile;
+	int ret;
+
+	mutex_lock(&blk->dev.mutex);
+	ret = vhost_dev_check_owner(&blk->dev);
+	if (ret)
+		goto err_dev;
+	if (idx >= virtqueue_max) {
+		ret = -ENOBUFS;
+		goto err_dev;
+	}
+
+	mutex_lock(&vq->mutex);
+
+	if (!vhost_vq_access_ok(vq)) {
+		ret = -EFAULT;
+		goto err_vq;
+	}
+
+	file = fget(backend->fd);
+	if (IS_ERR(file)) {
+		ret = PTR_ERR(file);
+		goto err_vq;
+	}
+
+	oldfile = rcu_dereference_protected(vq->private_data,
+			lockdep_is_held(&vq->mutex));
+	if (file != oldfile)
+		rcu_assign_pointer(vq->private_data, file);
+
+	mutex_unlock(&vq->mutex);
+
+	if (oldfile) {
+		blk_flush(blk);
+		fput(oldfile);
+	}
+
+	mutex_unlock(&blk->dev.mutex);
+	return 0;
+err_vq:
+	mutex_unlock(&vq->mutex);
+err_dev:
+	mutex_unlock(&blk->dev.mutex);
+	return ret;
+}
+
+static long blk_reset_owner(struct vhost_blk *b)
+{
+	int ret;
+
+	mutex_lock(&b->dev.mutex);
+	ret = vhost_dev_check_owner(&b->dev);
+	if (ret)
+		goto err;
+	blk_stop(b);
+	blk_flush(b);
+	ret = vhost_dev_reset_owner(&b->dev);
+err:
+	mutex_unlock(&b->dev.mutex);
+	return ret;
+}
+
+static int kernel_io_setup(unsigned nr_events, struct kioctx **ioctx)
+{
+	int ret = 0;
+	*ioctx = ioctx_alloc(nr_events);
+	if (IS_ERR(ioctx))
+		ret = PTR_ERR(ioctx);
+	return ret;
+}
+
+static inline int kernel_read_events(struct kioctx *ctx, long min_nr, long nr, struct io_event *event,
+		struct timespec *ts)
+{
+	mm_segment_t old_fs;
+	int ret;
+
+	old_fs = get_fs();
+	set_fs(get_ds());
+	ret = read_events(ctx, min_nr, nr, event, ts);
+	set_fs(old_fs);
+
+	return ret;
+}
+
+static inline ssize_t io_event_ret(struct io_event *ev)
+{
+	return (ssize_t)(((uint64_t)ev->res2 << 32) | ev->res);
+}
+
+static inline void aio_prep_req(struct kiocb *iocb, struct eventfd_ctx *ectx, struct file *file,
+		struct iovec *iov, int nvecs, u64 offset, int opcode, struct used_info *ui)
+{
+	iocb->ki_filp = file;
+	iocb->ki_eventfd = ectx;
+	iocb->ki_pos = offset;
+	iocb->ki_buf = (void *)iov;
+	iocb->ki_left = iocb->ki_nbytes = nvecs;
+	iocb->ki_opcode = opcode;
+	iocb->ki_obj.user = ui;
+}
+
+static inline int kernel_io_submit(struct vhost_blk *blk, struct iovec *iov, u64 nvecs, loff_t pos, int opcode, int head, int len)
+{
+	int ret = -EAGAIN;
+	struct kiocb *req;
+	struct kioctx *ioctx = blk->ioctx;
+	struct used_info *ui = kmem_cache_zalloc(used_info_cachep, GFP_KERNEL);
+	struct file *f = blk->vq.private_data;
+
+	try_get_ioctx(ioctx);
+	atomic_long_inc_not_zero(&f->f_count);
+	eventfd_ctx_get(blk->ectx);
+
+
+	req = aio_get_req(ioctx); /* return 2 refs of req*/
+	if (unlikely(!req))
+		goto out;
+
+	ui->head = head;
+	ui->status = blk->vq.iov[nvecs + 1].iov_base;
+	ui->len = len;
+	aio_prep_req(req, blk->ectx, f, iov, nvecs, pos, opcode, ui);
+
+	ret = aio_setup_iocb(req, 0);
+	if (unlikely(ret))
+		goto out_put_req;
+
+	spin_lock_irq(&ioctx->ctx_lock);
+	if (unlikely(ioctx->dead)) {
+		spin_unlock_irq(&ioctx->ctx_lock);
+		ret = -EINVAL;
+		goto out_put_req;
+	}
+
+	aio_run_iocb(req);
+	if (!list_empty(&ioctx->run_list)) {
+		while (__aio_run_iocbs(ioctx))
+			;
+	}
+	spin_unlock_irq(&ioctx->ctx_lock);
+
+	aio_put_req(req);
+	put_ioctx(blk->ioctx);
+
+	return ret;
+
+out_put_req:
+	aio_put_req(req);
+	aio_put_req(req);
+out:
+	put_ioctx(blk->ioctx);
+	return ret;
+}
+
+static long vhost_blk_ioctl(struct file *f, unsigned int ioctl,
+		unsigned long arg)
+{
+	struct vhost_blk *blk = f->private_data;
+	struct vhost_vring_file backend;
+	u64 features = VHOST_BLK_FEATURES;
+	int ret = -EFAULT;
+
+	switch (ioctl) {
+		case VHOST_NET_SET_BACKEND:
+			if(copy_from_user(&backend, (void __user *)arg, sizeof backend))
+				break;
+			ret = blk_set_backend(blk, &backend);
+			break;
+		case VHOST_GET_FEATURES:
+			features = VHOST_BLK_FEATURES;
+			if (copy_to_user((void __user *)arg , &features, sizeof features))
+				break;
+			ret = 0;
+			break;
+		case VHOST_SET_FEATURES:
+			if (copy_from_user(&features, (void __user *)arg, sizeof features))
+				break;
+			if (features & ~VHOST_BLK_FEATURES) {
+				ret = -EOPNOTSUPP;
+				break;
+			}
+			ret = blk_set_features(blk, features);
+			break;
+		case VHOST_RESET_OWNER:
+			ret = blk_reset_owner(blk);
+			break;
+		default:
+			mutex_lock(&blk->dev.mutex);
+			ret = vhost_dev_ioctl(&blk->dev, ioctl, arg);
+			blk_flush(blk);
+			mutex_unlock(&blk->dev.mutex);
+			break;
+	}
+	return ret;
+}
+
+#define BLK_HDR 0
+#define BLK_HDR_LEN 16
+
+static inline int do_request(struct vhost_virtqueue *vq, struct virtio_blk_outhdr *hdr,
+		u64 nr_vecs, int head)
+{
+	struct file *f = vq->private_data;
+	struct vhost_blk *blk = container_of(vq->dev, struct vhost_blk, dev);
+	struct iovec *iov = &vq->iov[BLK_HDR + 1];
+	loff_t pos = hdr->sector << 9;
+	int ret = 0, len = 0, status;
+	//	int i;
+
+	dprintk("sector %llu, num %lu, type %d\n", hdr->sector, iov->iov_len / 512, hdr->type);
+	//Guest virtio-blk driver dosen't use len currently.
+	//for (i = 0; i < nr_vecs; i++) {
+	//	len += iov[i].iov_len;
+	//}
+	switch (hdr->type) {
+		case VIRTIO_BLK_T_OUT:
+			kernel_io_submit(blk, iov, nr_vecs, pos, IOCB_CMD_PWRITEV, head, len);
+			break;
+		case VIRTIO_BLK_T_IN:
+			kernel_io_submit(blk, iov, nr_vecs, pos, IOCB_CMD_PREADV, head, len);
+			break;
+		case VIRTIO_BLK_T_FLUSH:
+			ret = vfs_fsync(f, 1);
+			/* fall through */
+		case VIRTIO_BLK_T_GET_ID:
+			status = ret < 0 ? VIRTIO_BLK_S_IOERR :VIRTIO_BLK_S_OK;
+			if ((vq->iov[nr_vecs + 1].iov_len != 1))
+				BUG();
+
+			if (copy_to_user(vq->iov[nr_vecs + 1].iov_base, &status, sizeof status)) {
+				vq_err(vq, "%s failed to write status!\n", __func__);
+				vhost_discard_vq_desc(vq, 1);
+				ret = -EFAULT;
+				break;
+			}
+
+			vhost_add_used_and_signal(&blk->dev, vq, head, ret);
+			break;
+		default:
+			pr_info("%s, unsupported request type %d\n", __func__, hdr->type);
+			vhost_discard_vq_desc(vq, 1);
+			ret = -EFAULT;
+			break;
+	}
+	return ret;
+}
+
+static inline void handle_kick(struct vhost_blk *blk)
+{
+	struct vhost_virtqueue *vq = &blk->vq;
+	struct virtio_blk_outhdr hdr;
+	u64 nr_vecs;
+	int in, out, head;
+	struct blk_plug plug;
+
+	mutex_lock(&vq->mutex);
+	vhost_disable_notify(&blk->dev, vq);
+
+	blk_start_plug(&plug);
+	for (;;) {
+		head = vhost_get_vq_desc(&blk->dev, vq, vq->iov,
+				ARRAY_SIZE(vq->iov),
+				&out, &in, NULL, NULL);
+		/* No awailable descriptors from Guest? */
+		if (head == vq->num) {
+			if (unlikely(vhost_enable_notify(&blk->dev, vq))) {
+				vhost_disable_notify(&blk->dev, vq);
+				continue;
+			}
+			break;
+		}
+		if (unlikely(head < 0))
+			break;
+
+		dprintk("head %d, in %d, out %d\n", head, in, out);
+		if(unlikely(vq->iov[BLK_HDR].iov_len != BLK_HDR_LEN)) {
+			vq_err(vq, "%s bad block header lengh!\n", __func__);
+			vhost_discard_vq_desc(vq, 1);
+			break;
+		}
+
+		if (copy_from_user(&hdr, vq->iov[BLK_HDR].iov_base, sizeof hdr)) {
+			vq_err(vq, "%s failed to get block header!\n", __func__);
+			vhost_discard_vq_desc(vq, 1);
+			break;
+		}
+
+		if (hdr.type == VIRTIO_BLK_T_IN || hdr.type == VIRTIO_BLK_T_GET_ID)
+			nr_vecs = in - 1;
+		else
+			nr_vecs = out - 1;
+
+		if (do_request(vq, &hdr, nr_vecs, head) < 0)
+			break;
+	}
+	blk_finish_plug(&plug);
+	mutex_unlock(&vq->mutex);
+}
+
+static void handle_guest_kick(struct vhost_work *work)
+{
+	struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue, poll.work);
+	struct vhost_blk *blk = container_of(vq->dev, struct vhost_blk, dev);
+	handle_kick(blk);
+}
+
+static void handle_completion(struct vhost_work* work)
+{
+	struct vhost_blk *blk = container_of(work, struct vhost_blk, poll.work);
+	struct vhost_virtqueue *vq = &blk->vq;
+	struct timespec ts = { 0 };
+	int ret, i, nr;
+	u64 count;
+
+	mutex_lock(&vq->mutex);
+	do {
+		ret = eventfd_ctx_read(blk->ectx, 1, &count);
+	} while (unlikely(ret == -ERESTARTSYS));
+
+	do {
+		nr = kernel_read_events(blk->ioctx, count, MAX_EVENTS, events, &ts);
+	} while (unlikely(nr == -EINTR));
+	dprintk("%s, count %llu, nr %d\n", __func__, count, nr);
+
+	if (unlikely(nr <= 0)) {
+		mutex_unlock(&vq->mutex);
+		return;
+	}
+
+	for (i = 0; i < nr; i++) {
+		struct used_info *u = (struct used_info *)events[i].obj;
+		int len, status;
+
+		dprintk("%s, head %d complete in %d\n", __func__, u->head, i);
+		len = io_event_ret(&events[i]);
+		//status = u->len == len ? VIRTIO_BLK_S_OK : VIRTIO_BLK_S_IOERR;
+		status = len > 0 ? VIRTIO_BLK_S_OK : VIRTIO_BLK_S_IOERR;
+		if (copy_to_user(u->status, &status, sizeof status)) {
+			vq_err(&blk->vq, "%s failed to write status\n", __func__);
+			BUG(); /* FIXME: maybe a bit radical? */
+		}
+		vhost_add_used(&blk->vq, u->head, u->len);
+		kmem_cache_free(used_info_cachep, u);
+	}
+
+	vhost_signal(&blk->dev, &blk->vq);
+	mutex_unlock(&vq->mutex);
+}
+
+static void eventfd_setup(struct vhost_blk *blk)
+{
+	blk->efile = eventfd_file_create(0, 0);
+	blk->ectx = eventfd_ctx_fileget(blk->efile);
+	vhost_poll_init(&blk->poll, handle_completion, POLLIN, &blk->dev);
+	vhost_poll_start(&blk->poll, blk->efile);
+}
+
+static int vhost_blk_open(struct inode *inode, struct file *f)
+{
+	int ret = -ENOMEM;
+	struct vhost_blk *blk = kmalloc(sizeof *blk, GFP_KERNEL);
+	if (!blk)
+		goto err;
+
+	blk->vq.handle_kick = handle_guest_kick;
+	ret = vhost_dev_init(&blk->dev, &blk->vq, virtqueue_max);
+	if (ret < 0)
+		goto err_init;
+
+	ret = kernel_io_setup(MAX_EVENTS, &blk->ioctx);
+	if (ret < 0)
+		goto err_io_setup;
+
+	eventfd_setup(blk);
+	f->private_data = blk;
+	used_info_cachep = KMEM_CACHE(used_info, SLAB_HWCACHE_ALIGN | SLAB_PANIC);
+	return ret;
+err_init:
+err_io_setup:
+	kfree(blk);
+err:
+	return ret;
+}
+
+static void eventfd_destroy(struct vhost_blk *blk)
+{
+	eventfd_ctx_put(blk->ectx);
+	fput(blk->efile);
+}
+
+static int vhost_blk_release(struct inode *inode, struct file *f)
+{
+	struct vhost_blk *blk = f->private_data;
+
+	blk_stop(blk);
+	blk_flush(blk);
+	vhost_dev_cleanup(&blk->dev);
+	/* Yet another flush? See comments in vhost_net_release() */
+	blk_flush(blk);
+	eventfd_destroy(blk);
+	kfree(blk);
+
+	return 0;
+}
+
+const static struct file_operations vhost_blk_fops = {
+	.owner          = THIS_MODULE,
+	.release        = vhost_blk_release,
+	.open           = vhost_blk_open,
+	.unlocked_ioctl = vhost_blk_ioctl,
+	.llseek		= noop_llseek,
+};
+
+
+static struct miscdevice vhost_blk_misc = {
+	234,
+	"vhost-blk",
+	&vhost_blk_fops,
+};
+
+int vhost_blk_init(void)
+{
+	return misc_register(&vhost_blk_misc);
+}
+void vhost_blk_exit(void)
+{
+	misc_deregister(&vhost_blk_misc);
+}
+
+module_init(vhost_blk_init);
+module_exit(vhost_blk_exit);
+
+MODULE_VERSION("0.0.1");
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Liu Yuan");
+MODULE_DESCRIPTION("Host kernel accelerator for virtio_blk");
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 8e03379..9e17152 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -12,6 +12,7 @@
 #include <linux/virtio_config.h>
 #include <linux/virtio_ring.h>
 #include <asm/atomic.h>
+#include <linux/virtio_blk.h>
 
 struct vhost_device;
 
@@ -174,6 +175,16 @@ enum {
 			 (1ULL << VHOST_F_LOG_ALL) |
 			 (1ULL << VHOST_NET_F_VIRTIO_NET_HDR) |
 			 (1ULL << VIRTIO_NET_F_MRG_RXBUF),
+
+	VHOST_BLK_FEATURES =	(1ULL << VIRTIO_F_NOTIFY_ON_EMPTY) |
+				(1ULL << VIRTIO_RING_F_INDIRECT_DESC) |
+				(1ULL << VIRTIO_RING_F_EVENT_IDX) |
+				(1ULL << VIRTIO_BLK_F_SEG_MAX) |
+				(1ULL << VIRTIO_BLK_F_GEOMETRY) |
+				(1ULL << VIRTIO_BLK_F_TOPOLOGY) |
+				(1ULL << VIRTIO_BLK_F_SCSI) |
+				(1ULL << VIRTIO_BLK_F_BLK_SIZE),
+
 };
 
 static inline int vhost_has_feature(struct vhost_dev *dev, int bit)
diff --git a/fs/aio.c b/fs/aio.c
index e29ec48..534d396 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -215,7 +215,7 @@ static void ctx_rcu_free(struct rcu_head *head)
  *	Called when the last user of an aio context has gone away,
  *	and the struct needs to be freed.
  */
-static void __put_ioctx(struct kioctx *ctx)
+void __put_ioctx(struct kioctx *ctx)
 {
 	BUG_ON(ctx->reqs_active);
 
@@ -227,29 +227,12 @@ static void __put_ioctx(struct kioctx *ctx)
 	pr_debug("__put_ioctx: freeing %p\n", ctx);
 	call_rcu(&ctx->rcu_head, ctx_rcu_free);
 }
-
-static inline void get_ioctx(struct kioctx *kioctx)
-{
-	BUG_ON(atomic_read(&kioctx->users) <= 0);
-	atomic_inc(&kioctx->users);
-}
-
-static inline int try_get_ioctx(struct kioctx *kioctx)
-{
-	return atomic_inc_not_zero(&kioctx->users);
-}
-
-static inline void put_ioctx(struct kioctx *kioctx)
-{
-	BUG_ON(atomic_read(&kioctx->users) <= 0);
-	if (unlikely(atomic_dec_and_test(&kioctx->users)))
-		__put_ioctx(kioctx);
-}
+EXPORT_SYMBOL(__put_ioctx);
 
 /* ioctx_alloc
  *	Allocates and initializes an ioctx.  Returns an ERR_PTR if it failed.
  */
-static struct kioctx *ioctx_alloc(unsigned nr_events)
+struct kioctx *ioctx_alloc(unsigned nr_events)
 {
 	struct mm_struct *mm;
 	struct kioctx *ctx;
@@ -327,6 +310,7 @@ out_freectx:
 	dprintk("aio: error allocating ioctx %p\n", ctx);
 	return ctx;
 }
+EXPORT_SYMBOL(ioctx_alloc);
 
 /* aio_cancel_all
  *	Cancels all outstanding aio requests on an aio context.  Used 
@@ -437,7 +421,7 @@ void exit_aio(struct mm_struct *mm)
  * This prevents races between the aio code path referencing the
  * req (after submitting it) and aio_complete() freeing the req.
  */
-static struct kiocb *__aio_get_req(struct kioctx *ctx)
+struct kiocb *__aio_get_req(struct kioctx *ctx)
 {
 	struct kiocb *req = NULL;
 	struct aio_ring *ring;
@@ -480,7 +464,7 @@ static struct kiocb *__aio_get_req(struct kioctx *ctx)
 	return req;
 }
 
-static inline struct kiocb *aio_get_req(struct kioctx *ctx)
+struct kiocb *aio_get_req(struct kioctx *ctx)
 {
 	struct kiocb *req;
 	/* Handle a potential starvation case -- should be exceedingly rare as 
@@ -494,6 +478,7 @@ static inline struct kiocb *aio_get_req(struct kioctx *ctx)
 	}
 	return req;
 }
+EXPORT_SYMBOL(aio_get_req);
 
 static inline void really_put_req(struct kioctx *ctx, struct kiocb *req)
 {
@@ -659,7 +644,7 @@ static inline int __queue_kicked_iocb(struct kiocb *iocb)
  * simplifies the coding of individual aio operations as
  * it avoids various potential races.
  */
-static ssize_t aio_run_iocb(struct kiocb *iocb)
+ssize_t aio_run_iocb(struct kiocb *iocb)
 {
 	struct kioctx	*ctx = iocb->ki_ctx;
 	ssize_t (*retry)(struct kiocb *);
@@ -753,6 +738,7 @@ out:
 	}
 	return ret;
 }
+EXPORT_SYMBOL(aio_run_iocb);
 
 /*
  * __aio_run_iocbs:
@@ -761,7 +747,7 @@ out:
  * Assumes it is operating within the aio issuer's mm
  * context.
  */
-static int __aio_run_iocbs(struct kioctx *ctx)
+int __aio_run_iocbs(struct kioctx *ctx)
 {
 	struct kiocb *iocb;
 	struct list_head run_list;
@@ -784,6 +770,7 @@ static int __aio_run_iocbs(struct kioctx *ctx)
 		return 1;
 	return 0;
 }
+EXPORT_SYMBOL(__aio_run_iocbs);
 
 static void aio_queue_work(struct kioctx * ctx)
 {
@@ -1074,7 +1061,7 @@ static inline void clear_timeout(struct aio_timeout *to)
 	del_singleshot_timer_sync(&to->timer);
 }
 
-static int read_events(struct kioctx *ctx,
+int read_events(struct kioctx *ctx,
 			long min_nr, long nr,
 			struct io_event __user *event,
 			struct timespec __user *timeout)
@@ -1190,11 +1177,12 @@ out:
 	destroy_timer_on_stack(&to.timer);
 	return i ? i : ret;
 }
+EXPORT_SYMBOL(read_events);
 
 /* Take an ioctx and remove it from the list of ioctx's.  Protects 
  * against races with itself via ->dead.
  */
-static void io_destroy(struct kioctx *ioctx)
+void io_destroy(struct kioctx *ioctx)
 {
 	struct mm_struct *mm = current->mm;
 	int was_dead;
@@ -1221,6 +1209,7 @@ static void io_destroy(struct kioctx *ioctx)
 	wake_up_all(&ioctx->wait);
 	put_ioctx(ioctx);	/* once for the lookup */
 }
+EXPORT_SYMBOL(io_destroy);
 
 /* sys_io_setup:
  *	Create an aio_context capable of receiving at least nr_events.
@@ -1423,7 +1412,7 @@ static ssize_t aio_setup_single_vector(struct kiocb *kiocb)
  *	Performs the initial checks and aio retry method
  *	setup for the kiocb at the time of io submission.
  */
-static ssize_t aio_setup_iocb(struct kiocb *kiocb, bool compat)
+ssize_t aio_setup_iocb(struct kiocb *kiocb, bool compat)
 {
 	struct file *file = kiocb->ki_filp;
 	ssize_t ret = 0;
@@ -1513,6 +1502,7 @@ static ssize_t aio_setup_iocb(struct kiocb *kiocb, bool compat)
 
 	return 0;
 }
+EXPORT_SYMBOL(aio_setup_iocb);
 
 static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 			 struct iocb *iocb, bool compat)
diff --git a/fs/eventfd.c b/fs/eventfd.c
index d9a5917..6343bc9 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -406,6 +406,7 @@ struct file *eventfd_file_create(unsigned int count, int flags)
 
 	return file;
 }
+EXPORT_SYMBOL_GPL(eventfd_file_create);
 
 SYSCALL_DEFINE2(eventfd2, unsigned int, count, int, flags)
 {
diff --git a/include/linux/aio.h b/include/linux/aio.h
index 7a8db41..d63bc04 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -214,6 +214,37 @@ struct mm_struct;
 extern void exit_aio(struct mm_struct *mm);
 extern long do_io_submit(aio_context_t ctx_id, long nr,
 			 struct iocb __user *__user *iocbpp, bool compat);
+extern void __put_ioctx(struct kioctx *ctx);
+extern struct kioctx *ioctx_alloc(unsigned nr_events);
+extern struct kiocb *aio_get_req(struct kioctx *ctx);
+extern ssize_t aio_run_iocb(struct kiocb *iocb);
+extern int __aio_run_iocbs(struct kioctx *ctx);
+extern int read_events(struct kioctx *ctx,
+                        long min_nr, long nr,
+                        struct io_event __user *event,
+                        struct timespec __user *timeout);
+extern void io_destroy(struct kioctx *ioctx);
+extern ssize_t aio_setup_iocb(struct kiocb *kiocb, bool compat);
+extern void __put_ioctx(struct kioctx *ctx);
+
+static inline void get_ioctx(struct kioctx *kioctx)
+{
+        BUG_ON(atomic_read(&kioctx->users) <= 0);
+        atomic_inc(&kioctx->users);
+}
+
+static inline int try_get_ioctx(struct kioctx *kioctx)
+{
+        return atomic_inc_not_zero(&kioctx->users);
+}
+
+static inline void put_ioctx(struct kioctx *kioctx)
+{
+        BUG_ON(atomic_read(&kioctx->users) <= 0);
+        if (unlikely(atomic_dec_and_test(&kioctx->users)))
+                __put_ioctx(kioctx);
+}
+
 #else
 static inline ssize_t wait_on_sync_kiocb(struct kiocb *iocb) { return 0; }
 static inline int aio_put_req(struct kiocb *iocb) { return 0; }

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device
  2011-08-10  2:19                       ` Liu Yuan
@ 2011-08-10 20:37                         ` Badari Pulavarty
  2011-08-11  3:01                           ` Liu Yuan
  0 siblings, 1 reply; 54+ messages in thread
From: Badari Pulavarty @ 2011-08-10 20:37 UTC (permalink / raw)
  To: Liu Yuan
  Cc: Stefan Hajnoczi, Michael S. Tsirkin, Rusty Russell, Avi Kivity,
	kvm, linux-kernel, Khoa Huynh

On Wed, 2011-08-10 at 10:19 +0800, Liu Yuan wrote:
> On 08/09/2011 01:16 AM, Badari Pulavarty wrote:
> > On 8/8/2011 12:31 AM, Liu Yuan wrote:
> >> On 08/08/2011 01:04 PM, Badari Pulavarty wrote:
> >>> On 8/7/2011 6:35 PM, Liu Yuan wrote:
> >>>> On 08/06/2011 02:02 AM, Badari Pulavarty wrote:
> >>>>> On 8/5/2011 4:04 AM, Liu Yuan wrote:
> >>>>>> On 08/05/2011 05:58 AM, Badari Pulavarty wrote:
> >>>>>>> Hi Liu Yuan,
> >>>>>>>
> >>>>>>> I started testing your patches. I applied your kernel patch to 3.0
> >>>>>>> and applied QEMU to latest git.
> >>>>>>>
> >>>>>>> I passed 6 blockdevices from the host to guest (4 vcpu, 4GB RAM).
> >>>>>>> I ran simple "dd" read tests from the guest on all block devices
> >>>>>>> (with various blocksizes, iflag=direct).
> >>>>>>>
> >>>>>>> Unfortunately, system doesn't stay up. I immediately get into
> >>>>>>> panic on the host. I didn't get time to debug the problem. 
> >>>>>>> Wondering
> >>>>>>> if you have seen this issue before and/or you have new patchset
> >>>>>>> to try ?
> >>>>>>>
> >>>>>>> Let me know.
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Badari
> >>>>>>>
> >>>>>>
> >>>>>> Okay, it is actually a bug pointed out by MST on the other 
> >>>>>> thread, that it needs a mutex for completion thread.
> >>>>>>
> >>>>>> Now would you please this attachment?This patch only applies to 
> >>>>>> kernel part, on top of v1 kernel patch.
> >>>>>>
> >>>>>> This patch mainly moves completion thread into vhost thread as a 
> >>>>>> function. As a result, both requests submitting and completion 
> >>>>>> signalling is in the same thread.
> >>>>>>
> >>>>>> Yuan
> >>>>>
> >>>>> Unfortunately, "dd" tests (4 out of 6) in the guest hung. I see 
> >>>>> following messages
> >>>>>
> >>>>> virtio_blk virtio2: requests: id 0 is not a head !
> >>>>> virtio_blk virtio3: requests: id 1 is not a head !
> >>>>> virtio_blk virtio5: requests: id 1 is not a head !
> >>>>> virtio_blk virtio1: requests: id 1 is not a head !
> >>>>>
> >>>>> I still see host panics. I will collect the host panic and see if 
> >>>>> its still same or not.
> >>>>>
> >>>>> Thanks,
> >>>>> Badari
> >>>>>
> >>>>>
> >>>> Would you please show me how to reproduce it step by step? I tried 
> >>>> dd with two block device attached, but didn't get hung nor panic.
> >>>>
> >>>> Yuan
> >>>
> >>> I did 6 "dd"s on 6 block devices..
> >>>
> >>> dd if=/dev/vdb of=/dev/null bs=1M iflag=direct &
> >>> dd if=/dev/vdc of=/dev/null bs=1M iflag=direct &
> >>> dd if=/dev/vdd of=/dev/null bs=1M iflag=direct &
> >>> dd if=/dev/vde of=/dev/null bs=1M iflag=direct &
> >>> dd if=/dev/vdf of=/dev/null bs=1M iflag=direct &
> >>> dd if=/dev/vdg of=/dev/null bs=1M iflag=direct &
> >>>
> >>> I can reproduce the problem with in 3 minutes :(
> >>>
> >>> Thanks,
> >>> Badari
> >>>
> >>>
> >> Ah...I made an embarrassing mistake that I tried to 'free()' an 
> >> kmem_cache object.
> >>
> >> Would you please revert the vblk-for-kernel-2 patch and apply the new 
> >> one attached in this letter?
> >>
> > Hmm.. My version of the code seems to have kzalloc() for used_info. I 
> > don't have a version
> > that is using kmem_cache_alloc(). Would it be possible for you to send 
> > out complete patch
> > (with all the fixes applied) for me to try ? This will avoid all the 
> > confusion ..
> >
> > Thanks,
> > Badari
> >
>
> Okay, please apply the attached patch to the vanilla kernel. :)


It looks like the patch wouldn't work for testing multiple devices.

vhost_blk_open() does
+       used_info_cachep = KMEM_CACHE(used_info, SLAB_HWCACHE_ALIGN |
SLAB_PANIC);

When opening second device, we get panic since used_info_cachep is
already created. Just to make progress I moved this call to
vhost_blk_init().

I don't see any host panics now. With single block device (dd),
it seems to work fine. But when I start testing multiple block
devices I quickly run into hangs in the guest. I see following
messages in the guest from virtio_ring.c:

virtio_blk virtio2: requests: id 0 is not a head !
virtio_blk virtio1: requests: id 0 is not a head !
virtio_blk virtio4: requests: id 1 is not a head !
virtio_blk virtio3: requests: id 39 is not a head !

Thanks,
Badari




^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device
  2011-08-10 20:37                         ` Badari Pulavarty
@ 2011-08-11  3:01                           ` Liu Yuan
  2011-08-11  3:19                             ` Liu Yuan
  0 siblings, 1 reply; 54+ messages in thread
From: Liu Yuan @ 2011-08-11  3:01 UTC (permalink / raw)
  To: Badari Pulavarty, kvm


> It looks like the patch wouldn't work for testing multiple devices.
>
> vhost_blk_open() does
> +       used_info_cachep = KMEM_CACHE(used_info, SLAB_HWCACHE_ALIGN |
> SLAB_PANIC);
>

This is weird. how do you open multiple device?I just opened the device 
with following command:

-drive file=/dev/sda6,if=virtio,cache=none,aio=native -drive 
file=~/data0.img,if=virtio,cache=none,aio=native -drive 
file=~/data1.img,if=virtio,cache=none,aio=native

And I didn't meet any problem.

this would tell qemu to open three devices, and pass three FDs to three 
instances of vhost_blk module.
So KMEM_CACHE() is okay in vhost_blk_open().

> When opening second device, we get panic since used_info_cachep is
> already created. Just to make progress I moved this call to
> vhost_blk_init().
>
> I don't see any host panics now. With single block device (dd),
> it seems to work fine. But when I start testing multiple block
> devices I quickly run into hangs in the guest. I see following
> messages in the guest from virtio_ring.c:
>
> virtio_blk virtio2: requests: id 0 is not a head !
> virtio_blk virtio1: requests: id 0 is not a head !
> virtio_blk virtio4: requests: id 1 is not a head !
> virtio_blk virtio3: requests: id 39 is not a head !
>
> Thanks,
> Badari
>
>

vq->data[] is initialized by guest virtio-blk driver and vhost_blk is 
unware of it. it looks like used ID passed
over by vhost_blk to guest virtio_blk is wrong, but, it should not 
happen. :|

And I can't reproduce this on my laptop. :(

> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device
  2011-08-11  3:01                           ` Liu Yuan
@ 2011-08-11  3:19                             ` Liu Yuan
  2011-08-11 23:51                               ` Badari Pulavarty
  2011-08-12  4:50                               ` Badari Pulavarty
  0 siblings, 2 replies; 54+ messages in thread
From: Liu Yuan @ 2011-08-11  3:19 UTC (permalink / raw)
  To: Badari Pulavarty, kvm

On 08/11/2011 11:01 AM, Liu Yuan wrote:
>
>> It looks like the patch wouldn't work for testing multiple devices.
>>
>> vhost_blk_open() does
>> +       used_info_cachep = KMEM_CACHE(used_info, SLAB_HWCACHE_ALIGN |
>> SLAB_PANIC);
>>
>
> This is weird. how do you open multiple device?I just opened the 
> device with following command:
>
> -drive file=/dev/sda6,if=virtio,cache=none,aio=native -drive 
> file=~/data0.img,if=virtio,cache=none,aio=native -drive 
> file=~/data1.img,if=virtio,cache=none,aio=native
>
> And I didn't meet any problem.
>
> this would tell qemu to open three devices, and pass three FDs to 
> three instances of vhost_blk module.
> So KMEM_CACHE() is okay in vhost_blk_open().
>

Oh, you are right. KMEM_CACHE() is in the wrong place. it is three 
instances vhost worker threads created. Hmmm, but I didn't meet any 
problem when opening it and running it. So strange. I'll go to figure it 
out.

>> When opening second device, we get panic since used_info_cachep is
>> already created. Just to make progress I moved this call to
>> vhost_blk_init().
>>
>> I don't see any host panics now. With single block device (dd),
>> it seems to work fine. But when I start testing multiple block
>> devices I quickly run into hangs in the guest. I see following
>> messages in the guest from virtio_ring.c:
>>
>> virtio_blk virtio2: requests: id 0 is not a head !
>> virtio_blk virtio1: requests: id 0 is not a head !
>> virtio_blk virtio4: requests: id 1 is not a head !
>> virtio_blk virtio3: requests: id 39 is not a head !
>>
>> Thanks,
>> Badari
>>
>>
>
> vq->data[] is initialized by guest virtio-blk driver and vhost_blk is 
> unware of it. it looks like used ID passed
> over by vhost_blk to guest virtio_blk is wrong, but, it should not 
> happen. :|
>
> And I can't reproduce this on my laptop. :(
>
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device
  2011-08-11  3:19                             ` Liu Yuan
@ 2011-08-11 23:51                               ` Badari Pulavarty
  2011-08-12  4:50                               ` Badari Pulavarty
  1 sibling, 0 replies; 54+ messages in thread
From: Badari Pulavarty @ 2011-08-11 23:51 UTC (permalink / raw)
  To: Liu Yuan; +Cc: kvm

On 8/10/2011 8:19 PM, Liu Yuan wrote:
> On 08/11/2011 11:01 AM, Liu Yuan wrote:
>>
>>> It looks like the patch wouldn't work for testing multiple devices.
>>>
>>> vhost_blk_open() does
>>> +       used_info_cachep = KMEM_CACHE(used_info, SLAB_HWCACHE_ALIGN |
>>> SLAB_PANIC);
>>>
>>
>> This is weird. how do you open multiple device?I just opened the 
>> device with following command:
>>
>> -drive file=/dev/sda6,if=virtio,cache=none,aio=native -drive 
>> file=~/data0.img,if=virtio,cache=none,aio=native -drive 
>> file=~/data1.img,if=virtio,cache=none,aio=native
>>
>> And I didn't meet any problem.
>>
>> this would tell qemu to open three devices, and pass three FDs to 
>> three instances of vhost_blk module.
>> So KMEM_CACHE() is okay in vhost_blk_open().
>>
>
> Oh, you are right. KMEM_CACHE() is in the wrong place. it is three 
> instances vhost worker threads created. Hmmm, but I didn't meet any 
> problem when opening it and running it. So strange. I'll go to figure 
> it out.
>
>>> When opening second device, we get panic since used_info_cachep is
>>> already created. Just to make progress I moved this call to
>>> vhost_blk_init().
>>>
>>> I don't see any host panics now. With single block device (dd),
>>> it seems to work fine. But when I start testing multiple block
>>> devices I quickly run into hangs in the guest. I see following
>>> messages in the guest from virtio_ring.c:
>>>
>>> virtio_blk virtio2: requests: id 0 is not a head !
>>> virtio_blk virtio1: requests: id 0 is not a head !
>>> virtio_blk virtio4: requests: id 1 is not a head !
>>> virtio_blk virtio3: requests: id 39 is not a head !
>>>
>>> Thanks,
>>> Badari
>>>
>>>
>>
>> vq->data[] is initialized by guest virtio-blk driver and vhost_blk is 
>> unware of it. it looks like used ID passed
>> over by vhost_blk to guest virtio_blk is wrong, but, it should not 
>> happen. :|
>>
>> And I can't reproduce this on my laptop. :(

I spent lot time looking at the code on how we can pass the wrong ID and 
corrupt vq->data[]. I can't seem
to spot the bug :(

I hacked vhost_blk to return success immediately, without doing any IO - 
to see if its a generic problem.
With the hack (of not doing any IO), I can't reproduce the problem. So, 
its some thing in the IO completion
handling code causing this. I will keep looking..

Thanks,
Badari


Thanks,
Badari



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device
  2011-08-11  3:19                             ` Liu Yuan
  2011-08-11 23:51                               ` Badari Pulavarty
@ 2011-08-12  4:50                               ` Badari Pulavarty
  2011-08-12  6:46                                 ` Dongsu Park
  2011-08-12  8:27                                 ` Liu Yuan
  1 sibling, 2 replies; 54+ messages in thread
From: Badari Pulavarty @ 2011-08-12  4:50 UTC (permalink / raw)
  To: Liu Yuan; +Cc: kvm

On 8/10/2011 8:19 PM, Liu Yuan wrote:
> On 08/11/2011 11:01 AM, Liu Yuan wrote:
>>
>>> It looks like the patch wouldn't work for testing multiple devices.
>>>
>>> vhost_blk_open() does
>>> +       used_info_cachep = KMEM_CACHE(used_info, SLAB_HWCACHE_ALIGN |
>>> SLAB_PANIC);
>>>
>>
>> This is weird. how do you open multiple device?I just opened the 
>> device with following command:
>>
>> -drive file=/dev/sda6,if=virtio,cache=none,aio=native -drive 
>> file=~/data0.img,if=virtio,cache=none,aio=native -drive 
>> file=~/data1.img,if=virtio,cache=none,aio=native
>>
>> And I didn't meet any problem.
>>
>> this would tell qemu to open three devices, and pass three FDs to 
>> three instances of vhost_blk module.
>> So KMEM_CACHE() is okay in vhost_blk_open().
>>
>
> Oh, you are right. KMEM_CACHE() is in the wrong place. it is three 
> instances vhost worker threads created. Hmmm, but I didn't meet any 
> problem when opening it and running it. So strange. I'll go to figure 
> it out.
>
>>> When opening second device, we get panic since used_info_cachep is
>>> already created. Just to make progress I moved this call to
>>> vhost_blk_init().
>>>
>>> I don't see any host panics now. With single block device (dd),
>>> it seems to work fine. But when I start testing multiple block
>>> devices I quickly run into hangs in the guest. I see following
>>> messages in the guest from virtio_ring.c:
>>>
>>> virtio_blk virtio2: requests: id 0 is not a head !
>>> virtio_blk virtio1: requests: id 0 is not a head !
>>> virtio_blk virtio4: requests: id 1 is not a head !
>>> virtio_blk virtio3: requests: id 39 is not a head !
>>>
>>> Thanks,
>>> Badari
>>>
>>>
>>
>> vq->data[] is initialized by guest virtio-blk driver and vhost_blk is 
>> unware of it. it looks like used ID passed
>> over by vhost_blk to guest virtio_blk is wrong, but, it should not 
>> happen. :|
>>
>> And I can't reproduce this on my laptop. :(
>>
Finally, found the issue  :)

Culprit is:

+static struct io_event events[MAX_EVENTS];

With multiple devices, multiple threads could be executing handle_completion() (one for
each fd) at the same time. "events" array is global :( Need to make it one per device/fd.

For test, I changed MAX_EVENTS to 32 and moved "events" array to be local (stack)
to handle_completion(). Tests are running fine.

Your laptop must have single processor, hence you have only one thread executing handle_completion()
at any time..

Thanks,
Badari



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device
  2011-08-12  4:50                               ` Badari Pulavarty
@ 2011-08-12  6:46                                 ` Dongsu Park
  2011-08-12  8:27                                 ` Liu Yuan
  1 sibling, 0 replies; 54+ messages in thread
From: Dongsu Park @ 2011-08-12  6:46 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: Liu Yuan, kvm

Hi Badari,

On 12/08/11 06:50, Badari Pulavarty wrote:
> On 8/10/2011 8:19 PM, Liu Yuan wrote:
>>>> When opening second device, we get panic since used_info_cachep is
>>>> already created. Just to make progress I moved this call to
>>>> vhost_blk_init().
>>>>
>>>> I don't see any host panics now. With single block device (dd),
>>>> it seems to work fine. But when I start testing multiple block
>>>> devices I quickly run into hangs in the guest. I see following
>>>> messages in the guest from virtio_ring.c:
>>>>
>>>> virtio_blk virtio2: requests: id 0 is not a head !
>>>> virtio_blk virtio1: requests: id 0 is not a head !
>>>> virtio_blk virtio4: requests: id 1 is not a head !
>>>> virtio_blk virtio3: requests: id 39 is not a head !
>>>>
>>>> Thanks,
>>>> Badari
>>>>
>>>>
>>>
>>> vq->data[] is initialized by guest virtio-blk driver and vhost_blk is
>>> unware of it. it looks like used ID passed
>>> over by vhost_blk to guest virtio_blk is wrong, but, it should not
>>> happen. :|
>>>
>>> And I can't reproduce this on my laptop. :(
>>>
> Finally, found the issue :)
>
> Culprit is:
>
> +static struct io_event events[MAX_EVENTS];
>
> With multiple devices, multiple threads could be executing
> handle_completion() (one for
> each fd) at the same time. "events" array is global :( Need to make it
> one per device/fd.
>
> For test, I changed MAX_EVENTS to 32 and moved "events" array to be
> local (stack)
> to handle_completion(). Tests are running fine.
>
> Your laptop must have single processor, hence you have only one thread
> executing handle_completion()
> at any time..

Can you please post your code, or send me via email?
I'm also trying to get it running on a multi-processor system.

Thanks in advance,

>
> Thanks,
> Badari


-- 
Dongsu Park
Email: dongsu.park@profitbricks.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device
  2011-08-12  4:50                               ` Badari Pulavarty
  2011-08-12  6:46                                 ` Dongsu Park
@ 2011-08-12  8:27                                 ` Liu Yuan
  2011-08-12 11:40                                   ` Liu Yuan
  1 sibling, 1 reply; 54+ messages in thread
From: Liu Yuan @ 2011-08-12  8:27 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: kvm

On 08/12/2011 12:50 PM, Badari Pulavarty wrote:
> On 8/10/2011 8:19 PM, Liu Yuan wrote:
>> On 08/11/2011 11:01 AM, Liu Yuan wrote:
>>>
>>>> It looks like the patch wouldn't work for testing multiple devices.
>>>>
>>>> vhost_blk_open() does
>>>> +       used_info_cachep = KMEM_CACHE(used_info, SLAB_HWCACHE_ALIGN |
>>>> SLAB_PANIC);
>>>>
>>>
>>> This is weird. how do you open multiple device?I just opened the 
>>> device with following command:
>>>
>>> -drive file=/dev/sda6,if=virtio,cache=none,aio=native -drive 
>>> file=~/data0.img,if=virtio,cache=none,aio=native -drive 
>>> file=~/data1.img,if=virtio,cache=none,aio=native
>>>
>>> And I didn't meet any problem.
>>>
>>> this would tell qemu to open three devices, and pass three FDs to 
>>> three instances of vhost_blk module.
>>> So KMEM_CACHE() is okay in vhost_blk_open().
>>>
>>
>> Oh, you are right. KMEM_CACHE() is in the wrong place. it is three 
>> instances vhost worker threads created. Hmmm, but I didn't meet any 
>> problem when opening it and running it. So strange. I'll go to figure 
>> it out.
>>
>>>> When opening second device, we get panic since used_info_cachep is
>>>> already created. Just to make progress I moved this call to
>>>> vhost_blk_init().
>>>>
>>>> I don't see any host panics now. With single block device (dd),
>>>> it seems to work fine. But when I start testing multiple block
>>>> devices I quickly run into hangs in the guest. I see following
>>>> messages in the guest from virtio_ring.c:
>>>>
>>>> virtio_blk virtio2: requests: id 0 is not a head !
>>>> virtio_blk virtio1: requests: id 0 is not a head !
>>>> virtio_blk virtio4: requests: id 1 is not a head !
>>>> virtio_blk virtio3: requests: id 39 is not a head !
>>>>
>>>> Thanks,
>>>> Badari
>>>>
>>>>
>>>
>>> vq->data[] is initialized by guest virtio-blk driver and vhost_blk 
>>> is unware of it. it looks like used ID passed
>>> over by vhost_blk to guest virtio_blk is wrong, but, it should not 
>>> happen. :|
>>>
>>> And I can't reproduce this on my laptop. :(
>>>
> Finally, found the issue  :)
>
> Culprit is:
>
> +static struct io_event events[MAX_EVENTS];
>
> With multiple devices, multiple threads could be executing 
> handle_completion() (one for
> each fd) at the same time. "events" array is global :( Need to make it 
> one per device/fd.
>
> For test, I changed MAX_EVENTS to 32 and moved "events" array to be 
> local (stack)
> to handle_completion(). Tests are running fine.
>
> Your laptop must have single processor, hence you have only one thread 
> executing handle_completion()
> at any time..
>
> Thanks,
> Badari
>
>
Good catch, this is rather cool!....Yup, I develop it mostly in a nested 
KVM environment. and the L2 host  only runs single processor :(

Thanks,
Yuan

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device
  2011-08-12  8:27                                 ` Liu Yuan
@ 2011-08-12 11:40                                   ` Liu Yuan
  2011-08-12 16:12                                     ` Badari Pulavarty
  0 siblings, 1 reply; 54+ messages in thread
From: Liu Yuan @ 2011-08-12 11:40 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: kvm

On 08/12/2011 04:27 PM, Liu Yuan wrote:
> On 08/12/2011 12:50 PM, Badari Pulavarty wrote:
>> On 8/10/2011 8:19 PM, Liu Yuan wrote:
>>> On 08/11/2011 11:01 AM, Liu Yuan wrote:
>>>>
>>>>> It looks like the patch wouldn't work for testing multiple devices.
>>>>>
>>>>> vhost_blk_open() does
>>>>> +       used_info_cachep = KMEM_CACHE(used_info, SLAB_HWCACHE_ALIGN |
>>>>> SLAB_PANIC);
>>>>>
>>>>
>>>> This is weird. how do you open multiple device?I just opened the 
>>>> device with following command:
>>>>
>>>> -drive file=/dev/sda6,if=virtio,cache=none,aio=native -drive 
>>>> file=~/data0.img,if=virtio,cache=none,aio=native -drive 
>>>> file=~/data1.img,if=virtio,cache=none,aio=native
>>>>
>>>> And I didn't meet any problem.
>>>>
>>>> this would tell qemu to open three devices, and pass three FDs to 
>>>> three instances of vhost_blk module.
>>>> So KMEM_CACHE() is okay in vhost_blk_open().
>>>>
>>>
>>> Oh, you are right. KMEM_CACHE() is in the wrong place. it is three 
>>> instances vhost worker threads created. Hmmm, but I didn't meet any 
>>> problem when opening it and running it. So strange. I'll go to 
>>> figure it out.
>>>
>>>>> When opening second device, we get panic since used_info_cachep is
>>>>> already created. Just to make progress I moved this call to
>>>>> vhost_blk_init().
>>>>>
>>>>> I don't see any host panics now. With single block device (dd),
>>>>> it seems to work fine. But when I start testing multiple block
>>>>> devices I quickly run into hangs in the guest. I see following
>>>>> messages in the guest from virtio_ring.c:
>>>>>
>>>>> virtio_blk virtio2: requests: id 0 is not a head !
>>>>> virtio_blk virtio1: requests: id 0 is not a head !
>>>>> virtio_blk virtio4: requests: id 1 is not a head !
>>>>> virtio_blk virtio3: requests: id 39 is not a head !
>>>>>
>>>>> Thanks,
>>>>> Badari
>>>>>
>>>>>
>>>>
>>>> vq->data[] is initialized by guest virtio-blk driver and vhost_blk 
>>>> is unware of it. it looks like used ID passed
>>>> over by vhost_blk to guest virtio_blk is wrong, but, it should not 
>>>> happen. :|
>>>>
>>>> And I can't reproduce this on my laptop. :(
>>>>
>> Finally, found the issue  :)
>>
>> Culprit is:
>>
>> +static struct io_event events[MAX_EVENTS];
>>
>> With multiple devices, multiple threads could be executing 
>> handle_completion() (one for
>> each fd) at the same time. "events" array is global :( Need to make 
>> it one per device/fd.
>>
>> For test, I changed MAX_EVENTS to 32 and moved "events" array to be 
>> local (stack)
>> to handle_completion(). Tests are running fine.
>>
>> Your laptop must have single processor, hence you have only one 
>> thread executing handle_completion()
>> at any time..
>>
>> Thanks,
>> Badari
>>
>>
> Good catch, this is rather cool!....Yup, I develop it mostly in a 
> nested KVM environment. and the L2 host  only runs single processor :(
>
> Thanks,
> Yuan
By the way, MAX_EVENTS should be 128, as much as guest virtio_blk driver 
can batch-submit,
causing array overflow.
I have had turned on the debug, and had seen as much as over 100 
requests batched from guest OS.

Thanks,
Yuan

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device
  2011-08-12 11:40                                   ` Liu Yuan
@ 2011-08-12 16:12                                     ` Badari Pulavarty
  2011-08-15  3:20                                       ` Liu Yuan
  0 siblings, 1 reply; 54+ messages in thread
From: Badari Pulavarty @ 2011-08-12 16:12 UTC (permalink / raw)
  To: Liu Yuan; +Cc: kvm, Dongsu Park

[-- Attachment #1: Type: text/plain, Size: 3500 bytes --]

On 8/12/2011 4:40 AM, Liu Yuan wrote:
> On 08/12/2011 04:27 PM, Liu Yuan wrote:
>> On 08/12/2011 12:50 PM, Badari Pulavarty wrote:
>>> On 8/10/2011 8:19 PM, Liu Yuan wrote:
>>>> On 08/11/2011 11:01 AM, Liu Yuan wrote:
>>>>>
>>>>>> It looks like the patch wouldn't work for testing multiple devices.
>>>>>>
>>>>>> vhost_blk_open() does
>>>>>> +       used_info_cachep = KMEM_CACHE(used_info, 
>>>>>> SLAB_HWCACHE_ALIGN |
>>>>>> SLAB_PANIC);
>>>>>>
>>>>>
>>>>> This is weird. how do you open multiple device?I just opened the 
>>>>> device with following command:
>>>>>
>>>>> -drive file=/dev/sda6,if=virtio,cache=none,aio=native -drive 
>>>>> file=~/data0.img,if=virtio,cache=none,aio=native -drive 
>>>>> file=~/data1.img,if=virtio,cache=none,aio=native
>>>>>
>>>>> And I didn't meet any problem.
>>>>>
>>>>> this would tell qemu to open three devices, and pass three FDs to 
>>>>> three instances of vhost_blk module.
>>>>> So KMEM_CACHE() is okay in vhost_blk_open().
>>>>>
>>>>
>>>> Oh, you are right. KMEM_CACHE() is in the wrong place. it is three 
>>>> instances vhost worker threads created. Hmmm, but I didn't meet any 
>>>> problem when opening it and running it. So strange. I'll go to 
>>>> figure it out.
>>>>
>>>>>> When opening second device, we get panic since used_info_cachep is
>>>>>> already created. Just to make progress I moved this call to
>>>>>> vhost_blk_init().
>>>>>>
>>>>>> I don't see any host panics now. With single block device (dd),
>>>>>> it seems to work fine. But when I start testing multiple block
>>>>>> devices I quickly run into hangs in the guest. I see following
>>>>>> messages in the guest from virtio_ring.c:
>>>>>>
>>>>>> virtio_blk virtio2: requests: id 0 is not a head !
>>>>>> virtio_blk virtio1: requests: id 0 is not a head !
>>>>>> virtio_blk virtio4: requests: id 1 is not a head !
>>>>>> virtio_blk virtio3: requests: id 39 is not a head !
>>>>>>
>>>>>> Thanks,
>>>>>> Badari
>>>>>>
>>>>>>
>>>>>
>>>>> vq->data[] is initialized by guest virtio-blk driver and vhost_blk 
>>>>> is unware of it. it looks like used ID passed
>>>>> over by vhost_blk to guest virtio_blk is wrong, but, it should not 
>>>>> happen. :|
>>>>>
>>>>> And I can't reproduce this on my laptop. :(
>>>>>
>>> Finally, found the issue  :)
>>>
>>> Culprit is:
>>>
>>> +static struct io_event events[MAX_EVENTS];
>>>
>>> With multiple devices, multiple threads could be executing 
>>> handle_completion() (one for
>>> each fd) at the same time. "events" array is global :( Need to make 
>>> it one per device/fd.
>>>
>>> For test, I changed MAX_EVENTS to 32 and moved "events" array to be 
>>> local (stack)
>>> to handle_completion(). Tests are running fine.
>>>
>>> Your laptop must have single processor, hence you have only one 
>>> thread executing handle_completion()
>>> at any time..
>>>
>>> Thanks,
>>> Badari
>>>
>>>
>> Good catch, this is rather cool!....Yup, I develop it mostly in a 
>> nested KVM environment. and the L2 host  only runs single processor :(
>>
>> Thanks,
>> Yuan
> By the way, MAX_EVENTS should be 128, as much as guest virtio_blk 
> driver can batch-submit,
> causing array overflow.
> I have had turned on the debug, and had seen as much as over 100 
> requests batched from guest OS.
>

Hmm.. I am not sure why you see over 100 outstanding events per fd.  Max 
events could be as high as
number of number of outstanding IOs.

Anyway, instead of putting it on stack, I kmalloced it now.

Dongsu Park, Here is the complete patch.

Thanks
Badari



[-- Attachment #2: vhost-blk-kernel.patch --]
[-- Type: text/plain, Size: 21813 bytes --]

---
 drivers/vhost/Makefile |    3 
 drivers/vhost/blk.c    |  536 +++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/vhost/vhost.h  |   11 +
 fs/aio.c               |   44 +---
 fs/eventfd.c           |    1 
 include/linux/aio.h    |   31 ++
 6 files changed, 599 insertions(+), 27 deletions(-)

Index: linux-3.0/drivers/vhost/Makefile
===================================================================
--- linux-3.0.orig/drivers/vhost/Makefile	2011-08-10 12:54:26.833639379 -0400
+++ linux-3.0/drivers/vhost/Makefile	2011-08-10 12:56:29.686641851 -0400
@@ -1,2 +1,5 @@
 obj-$(CONFIG_VHOST_NET) += vhost_net.o
+obj-m += vhost_blk.o
+
 vhost_net-y := vhost.o net.o
+vhost_blk-y := vhost.o blk.o
Index: linux-3.0/drivers/vhost/blk.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-3.0/drivers/vhost/blk.c	2011-08-12 11:56:41.041471082 -0400
@@ -0,0 +1,536 @@
+/* Copyright (C) 2011 Taobao, Inc.
+ * Author: Liu Yuan <tailai.ly@taobao.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ *
+ * Vhost-blk driver is an in-kernel accelerator, intercepting the
+ * IO requests from KVM virtio-capable guests. It is based on the
+ * vhost infrastructure.
+ */
+
+#include <linux/miscdevice.h>
+#include <linux/module.h>
+#include <linux/virtio_net.h>
+#include <linux/vhost.h>
+#include <linux/eventfd.h>
+#include <linux/mutex.h>
+#include <linux/workqueue.h>
+#include <linux/virtio_blk.h>
+#include <linux/file.h>
+#include <linux/mmu_context.h>
+#include <linux/kthread.h>
+#include <linux/anon_inodes.h>
+#include <linux/syscalls.h>
+#include <linux/blkdev.h>
+
+#include "vhost.h"
+
+#define DEBUG 0
+
+#if DEBUG > 0
+#define dprintk         printk
+#else
+#define dprintk(x...)   do { ; } while (0)
+#endif
+
+enum {
+	virtqueue_max = 1,
+};
+
+#define MAX_EVENTS 128
+
+struct vhost_blk {
+	struct vhost_virtqueue vq;
+	struct vhost_dev dev;
+	int should_stop;
+	struct kioctx *ioctx;
+	struct eventfd_ctx *ectx;
+	struct file *efile;
+	struct task_struct *worker;
+	struct vhost_poll poll;
+	struct io_event *events;
+};
+
+struct used_info {
+	void *status;
+	int head;
+	int len;
+};
+
+static struct kmem_cache *used_info_cachep;
+
+static void blk_flush(struct vhost_blk *blk)
+{
+	vhost_poll_flush(&blk->vq.poll);
+	vhost_poll_flush(&blk->poll);
+}
+
+static long blk_set_features(struct vhost_blk *blk, u64 features)
+{
+	blk->dev.acked_features = features;
+	return 0;
+}
+
+static void blk_stop(struct vhost_blk *blk)
+{
+	struct vhost_virtqueue *vq = &blk->vq;
+	struct file *f;
+
+	mutex_lock(&vq->mutex);
+	f = rcu_dereference_protected(vq->private_data,
+			lockdep_is_held(&vq->mutex));
+	rcu_assign_pointer(vq->private_data, NULL);
+	mutex_unlock(&vq->mutex);
+
+	if (f)
+		fput(f);
+}
+
+static long blk_set_backend(struct vhost_blk *blk, struct vhost_vring_file *backend)
+{
+	int idx = backend->index;
+	struct vhost_virtqueue *vq = &blk->vq;
+	struct file *file, *oldfile;
+	int ret;
+
+	mutex_lock(&blk->dev.mutex);
+	ret = vhost_dev_check_owner(&blk->dev);
+	if (ret)
+		goto err_dev;
+	if (idx >= virtqueue_max) {
+		ret = -ENOBUFS;
+		goto err_dev;
+	}
+
+	mutex_lock(&vq->mutex);
+
+	if (!vhost_vq_access_ok(vq)) {
+		ret = -EFAULT;
+		goto err_vq;
+	}
+
+	file = fget(backend->fd);
+	if (IS_ERR(file)) {
+		ret = PTR_ERR(file);
+		goto err_vq;
+	}
+
+	oldfile = rcu_dereference_protected(vq->private_data,
+			lockdep_is_held(&vq->mutex));
+	if (file != oldfile)
+		rcu_assign_pointer(vq->private_data, file);
+
+	mutex_unlock(&vq->mutex);
+
+	if (oldfile) {
+		blk_flush(blk);
+		fput(oldfile);
+	}
+
+	mutex_unlock(&blk->dev.mutex);
+	return 0;
+err_vq:
+	mutex_unlock(&vq->mutex);
+err_dev:
+	mutex_unlock(&blk->dev.mutex);
+	return ret;
+}
+
+static long blk_reset_owner(struct vhost_blk *b)
+{
+	int ret;
+
+	mutex_lock(&b->dev.mutex);
+	ret = vhost_dev_check_owner(&b->dev);
+	if (ret)
+		goto err;
+	blk_stop(b);
+	blk_flush(b);
+	ret = vhost_dev_reset_owner(&b->dev);
+err:
+	mutex_unlock(&b->dev.mutex);
+	return ret;
+}
+
+static int kernel_io_setup(unsigned nr_events, struct kioctx **ioctx)
+{
+	int ret = 0;
+	*ioctx = ioctx_alloc(nr_events);
+	if (IS_ERR(ioctx))
+		ret = PTR_ERR(ioctx);
+	return ret;
+}
+
+static inline int kernel_read_events(struct kioctx *ctx, long min_nr, long nr, struct io_event *event,
+		struct timespec *ts)
+{
+	mm_segment_t old_fs;
+	int ret;
+
+	old_fs = get_fs();
+	set_fs(get_ds());
+	ret = read_events(ctx, min_nr, nr, event, ts);
+	set_fs(old_fs);
+
+	return ret;
+}
+
+static inline ssize_t io_event_ret(struct io_event *ev)
+{
+	return (ssize_t)(((uint64_t)ev->res2 << 32) | ev->res);
+}
+
+static inline void aio_prep_req(struct kiocb *iocb, struct eventfd_ctx *ectx, struct file *file,
+		struct iovec *iov, int nvecs, u64 offset, int opcode, struct used_info *ui)
+{
+	iocb->ki_filp = file;
+	iocb->ki_eventfd = ectx;
+	iocb->ki_pos = offset;
+	iocb->ki_buf = (void *)iov;
+	iocb->ki_left = iocb->ki_nbytes = nvecs;
+	iocb->ki_opcode = opcode;
+	iocb->ki_obj.user = ui;
+}
+
+static inline int kernel_io_submit(struct vhost_blk *blk, struct iovec *iov, u64 nvecs, loff_t pos, int opcode, int head, int len)
+{
+	int ret = -EAGAIN;
+	struct kiocb *req;
+	struct kioctx *ioctx = blk->ioctx;
+	struct used_info *ui = kmem_cache_zalloc(used_info_cachep, GFP_KERNEL);
+	struct file *f = blk->vq.private_data;
+
+	try_get_ioctx(ioctx);
+	atomic_long_inc_not_zero(&f->f_count);
+	eventfd_ctx_get(blk->ectx);
+
+
+	req = aio_get_req(ioctx); /* return 2 refs of req*/
+	if (unlikely(!req))
+		goto out;
+
+	ui->head = head;
+	ui->status = blk->vq.iov[nvecs + 1].iov_base;
+	BUG_ON(blk->vq.iov[nvecs + 1].iov_len != 1);
+	ui->len = len;
+	aio_prep_req(req, blk->ectx, f, iov, nvecs, pos, opcode, ui);
+
+	ret = aio_setup_iocb(req, 0);
+	if (unlikely(ret))
+		goto out_put_req;
+
+	spin_lock_irq(&ioctx->ctx_lock);
+	if (unlikely(ioctx->dead)) {
+		spin_unlock_irq(&ioctx->ctx_lock);
+		ret = -EINVAL;
+		goto out_put_req;
+	}
+
+	aio_run_iocb(req);
+	if (!list_empty(&ioctx->run_list)) {
+		while (__aio_run_iocbs(ioctx))
+			;
+	}
+	spin_unlock_irq(&ioctx->ctx_lock);
+
+	aio_put_req(req);
+	put_ioctx(blk->ioctx);
+
+	return ret;
+
+out_put_req:
+	aio_put_req(req);
+	aio_put_req(req);
+out:
+	put_ioctx(blk->ioctx);
+	return ret;
+}
+
+static long vhost_blk_ioctl(struct file *f, unsigned int ioctl,
+		unsigned long arg)
+{
+	struct vhost_blk *blk = f->private_data;
+	struct vhost_vring_file backend;
+	u64 features = VHOST_BLK_FEATURES;
+	int ret = -EFAULT;
+
+	switch (ioctl) {
+		case VHOST_NET_SET_BACKEND:
+			if(copy_from_user(&backend, (void __user *)arg, sizeof backend))
+				break;
+			ret = blk_set_backend(blk, &backend);
+			break;
+		case VHOST_GET_FEATURES:
+			features = VHOST_BLK_FEATURES;
+			if (copy_to_user((void __user *)arg , &features, sizeof features))
+				break;
+			ret = 0;
+			break;
+		case VHOST_SET_FEATURES:
+			if (copy_from_user(&features, (void __user *)arg, sizeof features))
+				break;
+			if (features & ~VHOST_BLK_FEATURES) {
+				ret = -EOPNOTSUPP;
+				break;
+			}
+			ret = blk_set_features(blk, features);
+			break;
+		case VHOST_RESET_OWNER:
+			ret = blk_reset_owner(blk);
+			break;
+		default:
+			mutex_lock(&blk->dev.mutex);
+			ret = vhost_dev_ioctl(&blk->dev, ioctl, arg);
+			blk_flush(blk);
+			mutex_unlock(&blk->dev.mutex);
+			break;
+	}
+	return ret;
+}
+
+#define BLK_HDR 0
+#define BLK_HDR_LEN 16
+
+static inline int do_request(struct vhost_virtqueue *vq, struct virtio_blk_outhdr *hdr,
+		u64 nr_vecs, int head)
+{
+	struct file *f = vq->private_data;
+	struct vhost_blk *blk = container_of(vq->dev, struct vhost_blk, dev);
+	struct iovec *iov = &vq->iov[BLK_HDR + 1];
+	loff_t pos = hdr->sector << 9;
+	int ret = 0, len = 0, status;
+	//	int i;
+
+	dprintk("sector %llu, num %lu, type %d\n", hdr->sector, iov->iov_len / 512, hdr->type);
+	//Guest virtio-blk driver dosen't use len currently.
+	//for (i = 0; i < nr_vecs; i++) {
+	//	len += iov[i].iov_len;
+	//}
+	switch (hdr->type) {
+		case VIRTIO_BLK_T_OUT:
+			kernel_io_submit(blk, iov, nr_vecs, pos, IOCB_CMD_PWRITEV, head, len);
+			break;
+		case VIRTIO_BLK_T_IN:
+			kernel_io_submit(blk, iov, nr_vecs, pos, IOCB_CMD_PREADV, head, len);
+			break;
+		case VIRTIO_BLK_T_FLUSH:
+			ret = vfs_fsync(f, 1);
+			/* fall through */
+		case VIRTIO_BLK_T_GET_ID:
+			status = ret < 0 ? VIRTIO_BLK_S_IOERR :VIRTIO_BLK_S_OK;
+			if ((vq->iov[nr_vecs + 1].iov_len != 1))
+				BUG();
+
+			if (copy_to_user(vq->iov[nr_vecs + 1].iov_base, &status, sizeof status)) {
+				vq_err(vq, "%s failed to write status!\n", __func__);
+				vhost_discard_vq_desc(vq, 1);
+				ret = -EFAULT;
+				break;
+			}
+
+			vhost_add_used_and_signal(&blk->dev, vq, head, ret);
+			break;
+		default:
+			pr_info("%s, unsupported request type %d\n", __func__, hdr->type);
+			vhost_discard_vq_desc(vq, 1);
+			ret = -EFAULT;
+			break;
+	}
+	return ret;
+}
+
+static inline void handle_kick(struct vhost_blk *blk)
+{
+	struct vhost_virtqueue *vq = &blk->vq;
+	struct virtio_blk_outhdr hdr;
+	u64 nr_vecs;
+	int in, out, head;
+	struct blk_plug plug;
+
+	mutex_lock(&vq->mutex);
+	vhost_disable_notify(&blk->dev, vq);
+
+	blk_start_plug(&plug);
+	for (;;) {
+		head = vhost_get_vq_desc(&blk->dev, vq, vq->iov,
+				ARRAY_SIZE(vq->iov),
+				&out, &in, NULL, NULL);
+		/* No awailable descriptors from Guest? */
+		if (head == vq->num) {
+			if (unlikely(vhost_enable_notify(&blk->dev, vq))) {
+				vhost_disable_notify(&blk->dev, vq);
+				continue;
+			}
+			break;
+		}
+		if (unlikely(head < 0))
+			break;
+
+		dprintk("head %d, in %d, out %d\n", head, in, out);
+		if(unlikely(vq->iov[BLK_HDR].iov_len != BLK_HDR_LEN)) {
+			vq_err(vq, "%s bad block header lengh!\n", __func__);
+			vhost_discard_vq_desc(vq, 1);
+			break;
+		}
+
+		if (copy_from_user(&hdr, vq->iov[BLK_HDR].iov_base, sizeof hdr)) {
+			vq_err(vq, "%s failed to get block header!\n", __func__);
+			vhost_discard_vq_desc(vq, 1);
+			break;
+		}
+
+		if (hdr.type == VIRTIO_BLK_T_IN || hdr.type == VIRTIO_BLK_T_GET_ID)
+			nr_vecs = in - 1;
+		else
+			nr_vecs = out - 1;
+
+		if (do_request(vq, &hdr, nr_vecs, head) < 0)
+			break;
+	}
+	blk_finish_plug(&plug);
+	mutex_unlock(&vq->mutex);
+}
+
+static void handle_guest_kick(struct vhost_work *work)
+{
+	struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue, poll.work);
+	struct vhost_blk *blk = container_of(vq->dev, struct vhost_blk, dev);
+	handle_kick(blk);
+}
+
+static void handle_completion(struct vhost_work* work)
+{
+	struct vhost_blk *blk = container_of(work, struct vhost_blk, poll.work);
+	struct vhost_virtqueue *vq = &blk->vq;
+	struct timespec ts = { 0 };
+	int ret, i, nr;
+	u64 count;
+
+	mutex_lock(&vq->mutex);
+	do {
+		ret = eventfd_ctx_read(blk->ectx, 1, &count);
+	} while (unlikely(ret == -ERESTARTSYS));
+
+	do {
+		nr = kernel_read_events(blk->ioctx, count, MAX_EVENTS, blk->events, &ts);
+	} while (unlikely(nr == -EINTR));
+	dprintk("%s, count %llu, nr %d\n", __func__, count, nr);
+
+	if (unlikely(nr <= 0)) {
+		mutex_unlock(&vq->mutex);
+		return;
+	}
+
+	for (i = 0; i < nr; i++) {
+		struct used_info *u = (struct used_info *)blk->events[i].obj;
+		int len, status;
+
+		dprintk("%s, head %d complete in %d\n", __func__, u->head, i);
+		len = io_event_ret(&blk->events[i]);
+		//status = u->len == len ? VIRTIO_BLK_S_OK : VIRTIO_BLK_S_IOERR;
+		status = len > 0 ? VIRTIO_BLK_S_OK : VIRTIO_BLK_S_IOERR;
+		if (copy_to_user(u->status, &status, sizeof status)) {
+			vq_err(&blk->vq, "%s failed to write status\n", __func__);
+			BUG(); /* FIXME: maybe a bit radical? */
+		}
+		vhost_add_used(&blk->vq, u->head, u->len);
+		kmem_cache_free(used_info_cachep, u);
+	}
+
+	vhost_signal(&blk->dev, &blk->vq);
+	mutex_unlock(&vq->mutex);
+}
+
+static void eventfd_setup(struct vhost_blk *blk)
+{
+	blk->efile = eventfd_file_create(0, 0);
+	blk->ectx = eventfd_ctx_fileget(blk->efile);
+	vhost_poll_init(&blk->poll, handle_completion, POLLIN, &blk->dev);
+	vhost_poll_start(&blk->poll, blk->efile);
+}
+
+static int vhost_blk_open(struct inode *inode, struct file *f)
+{
+	int ret = -ENOMEM;
+	struct vhost_blk *blk = kmalloc(sizeof *blk, GFP_KERNEL);
+	if (!blk)
+		goto err;
+
+	blk->vq.handle_kick = handle_guest_kick;
+	ret = vhost_dev_init(&blk->dev, &blk->vq, virtqueue_max);
+	if (ret < 0)
+		goto err_init;
+
+	ret = kernel_io_setup(MAX_EVENTS, &blk->ioctx);
+	if (ret < 0)
+		goto err_io_setup;
+
+	blk->events = kmalloc(MAX_EVENTS * sizeof(struct io_event), GFP_KERNEL);
+	if (blk->events == NULL)
+		goto err_io_setup;
+
+	eventfd_setup(blk);
+	f->private_data = blk;
+	return ret;
+err_init:
+err_io_setup:
+	kfree(blk);
+err:
+	return ret;
+}
+
+static void eventfd_destroy(struct vhost_blk *blk)
+{
+	eventfd_ctx_put(blk->ectx);
+	fput(blk->efile);
+}
+
+static int vhost_blk_release(struct inode *inode, struct file *f)
+{
+	struct vhost_blk *blk = f->private_data;
+
+	blk_stop(blk);
+	blk_flush(blk);
+	vhost_dev_cleanup(&blk->dev);
+	/* Yet another flush? See comments in vhost_net_release() */
+	blk_flush(blk);
+	eventfd_destroy(blk);
+	kfree(blk->events);
+	kfree(blk);
+
+	return 0;
+}
+
+const static struct file_operations vhost_blk_fops = {
+	.owner          = THIS_MODULE,
+	.release        = vhost_blk_release,
+	.open           = vhost_blk_open,
+	.unlocked_ioctl = vhost_blk_ioctl,
+	.llseek		= noop_llseek,
+};
+
+
+static struct miscdevice vhost_blk_misc = {
+	234,
+	"vhost-blk",
+	&vhost_blk_fops,
+};
+
+int vhost_blk_init(void)
+{
+	used_info_cachep = KMEM_CACHE(used_info, SLAB_HWCACHE_ALIGN | SLAB_PANIC);
+	return misc_register(&vhost_blk_misc);
+}
+void vhost_blk_exit(void)
+{
+	kmem_cache_destroy(used_info_cachep);
+	misc_deregister(&vhost_blk_misc);
+}
+
+module_init(vhost_blk_init);
+module_exit(vhost_blk_exit);
+
+MODULE_VERSION("0.0.1");
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Liu Yuan");
+MODULE_DESCRIPTION("Host kernel accelerator for virtio_blk");
Index: linux-3.0/drivers/vhost/vhost.h
===================================================================
--- linux-3.0.orig/drivers/vhost/vhost.h	2011-08-10 12:54:26.828639379 -0400
+++ linux-3.0/drivers/vhost/vhost.h	2011-08-10 12:56:29.703641851 -0400
@@ -12,6 +12,7 @@
 #include <linux/virtio_config.h>
 #include <linux/virtio_ring.h>
 #include <asm/atomic.h>
+#include <linux/virtio_blk.h>
 
 struct vhost_device;
 
@@ -174,6 +175,16 @@ enum {
 			 (1ULL << VHOST_F_LOG_ALL) |
 			 (1ULL << VHOST_NET_F_VIRTIO_NET_HDR) |
 			 (1ULL << VIRTIO_NET_F_MRG_RXBUF),
+
+	VHOST_BLK_FEATURES =	(1ULL << VIRTIO_F_NOTIFY_ON_EMPTY) |
+				(1ULL << VIRTIO_RING_F_INDIRECT_DESC) |
+				(1ULL << VIRTIO_RING_F_EVENT_IDX) |
+				(1ULL << VIRTIO_BLK_F_SEG_MAX) |
+				(1ULL << VIRTIO_BLK_F_GEOMETRY) |
+				(1ULL << VIRTIO_BLK_F_TOPOLOGY) |
+				(1ULL << VIRTIO_BLK_F_SCSI) |
+				(1ULL << VIRTIO_BLK_F_BLK_SIZE),
+
 };
 
 static inline int vhost_has_feature(struct vhost_dev *dev, int bit)
Index: linux-3.0/fs/aio.c
===================================================================
--- linux-3.0.orig/fs/aio.c	2011-08-10 12:54:26.847639379 -0400
+++ linux-3.0/fs/aio.c	2011-08-10 12:56:29.709641851 -0400
@@ -215,7 +215,7 @@ static void ctx_rcu_free(struct rcu_head
  *	Called when the last user of an aio context has gone away,
  *	and the struct needs to be freed.
  */
-static void __put_ioctx(struct kioctx *ctx)
+void __put_ioctx(struct kioctx *ctx)
 {
 	BUG_ON(ctx->reqs_active);
 
@@ -227,29 +227,12 @@ static void __put_ioctx(struct kioctx *c
 	pr_debug("__put_ioctx: freeing %p\n", ctx);
 	call_rcu(&ctx->rcu_head, ctx_rcu_free);
 }
-
-static inline void get_ioctx(struct kioctx *kioctx)
-{
-	BUG_ON(atomic_read(&kioctx->users) <= 0);
-	atomic_inc(&kioctx->users);
-}
-
-static inline int try_get_ioctx(struct kioctx *kioctx)
-{
-	return atomic_inc_not_zero(&kioctx->users);
-}
-
-static inline void put_ioctx(struct kioctx *kioctx)
-{
-	BUG_ON(atomic_read(&kioctx->users) <= 0);
-	if (unlikely(atomic_dec_and_test(&kioctx->users)))
-		__put_ioctx(kioctx);
-}
+EXPORT_SYMBOL(__put_ioctx);
 
 /* ioctx_alloc
  *	Allocates and initializes an ioctx.  Returns an ERR_PTR if it failed.
  */
-static struct kioctx *ioctx_alloc(unsigned nr_events)
+struct kioctx *ioctx_alloc(unsigned nr_events)
 {
 	struct mm_struct *mm;
 	struct kioctx *ctx;
@@ -327,6 +310,7 @@ out_freectx:
 	dprintk("aio: error allocating ioctx %p\n", ctx);
 	return ctx;
 }
+EXPORT_SYMBOL(ioctx_alloc);
 
 /* aio_cancel_all
  *	Cancels all outstanding aio requests on an aio context.  Used 
@@ -437,7 +421,7 @@ void exit_aio(struct mm_struct *mm)
  * This prevents races between the aio code path referencing the
  * req (after submitting it) and aio_complete() freeing the req.
  */
-static struct kiocb *__aio_get_req(struct kioctx *ctx)
+struct kiocb *__aio_get_req(struct kioctx *ctx)
 {
 	struct kiocb *req = NULL;
 	struct aio_ring *ring;
@@ -480,7 +464,7 @@ static struct kiocb *__aio_get_req(struc
 	return req;
 }
 
-static inline struct kiocb *aio_get_req(struct kioctx *ctx)
+struct kiocb *aio_get_req(struct kioctx *ctx)
 {
 	struct kiocb *req;
 	/* Handle a potential starvation case -- should be exceedingly rare as 
@@ -494,6 +478,7 @@ static inline struct kiocb *aio_get_req(
 	}
 	return req;
 }
+EXPORT_SYMBOL(aio_get_req);
 
 static inline void really_put_req(struct kioctx *ctx, struct kiocb *req)
 {
@@ -659,7 +644,7 @@ static inline int __queue_kicked_iocb(st
  * simplifies the coding of individual aio operations as
  * it avoids various potential races.
  */
-static ssize_t aio_run_iocb(struct kiocb *iocb)
+ssize_t aio_run_iocb(struct kiocb *iocb)
 {
 	struct kioctx	*ctx = iocb->ki_ctx;
 	ssize_t (*retry)(struct kiocb *);
@@ -753,6 +738,7 @@ out:
 	}
 	return ret;
 }
+EXPORT_SYMBOL(aio_run_iocb);
 
 /*
  * __aio_run_iocbs:
@@ -761,7 +747,7 @@ out:
  * Assumes it is operating within the aio issuer's mm
  * context.
  */
-static int __aio_run_iocbs(struct kioctx *ctx)
+int __aio_run_iocbs(struct kioctx *ctx)
 {
 	struct kiocb *iocb;
 	struct list_head run_list;
@@ -784,6 +770,7 @@ static int __aio_run_iocbs(struct kioctx
 		return 1;
 	return 0;
 }
+EXPORT_SYMBOL(__aio_run_iocbs);
 
 static void aio_queue_work(struct kioctx * ctx)
 {
@@ -1074,7 +1061,7 @@ static inline void clear_timeout(struct 
 	del_singleshot_timer_sync(&to->timer);
 }
 
-static int read_events(struct kioctx *ctx,
+int read_events(struct kioctx *ctx,
 			long min_nr, long nr,
 			struct io_event __user *event,
 			struct timespec __user *timeout)
@@ -1190,11 +1177,12 @@ out:
 	destroy_timer_on_stack(&to.timer);
 	return i ? i : ret;
 }
+EXPORT_SYMBOL(read_events);
 
 /* Take an ioctx and remove it from the list of ioctx's.  Protects 
  * against races with itself via ->dead.
  */
-static void io_destroy(struct kioctx *ioctx)
+void io_destroy(struct kioctx *ioctx)
 {
 	struct mm_struct *mm = current->mm;
 	int was_dead;
@@ -1221,6 +1209,7 @@ static void io_destroy(struct kioctx *io
 	wake_up_all(&ioctx->wait);
 	put_ioctx(ioctx);	/* once for the lookup */
 }
+EXPORT_SYMBOL(io_destroy);
 
 /* sys_io_setup:
  *	Create an aio_context capable of receiving at least nr_events.
@@ -1423,7 +1412,7 @@ static ssize_t aio_setup_single_vector(s
  *	Performs the initial checks and aio retry method
  *	setup for the kiocb at the time of io submission.
  */
-static ssize_t aio_setup_iocb(struct kiocb *kiocb, bool compat)
+ssize_t aio_setup_iocb(struct kiocb *kiocb, bool compat)
 {
 	struct file *file = kiocb->ki_filp;
 	ssize_t ret = 0;
@@ -1513,6 +1502,7 @@ static ssize_t aio_setup_iocb(struct kio
 
 	return 0;
 }
+EXPORT_SYMBOL(aio_setup_iocb);
 
 static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 			 struct iocb *iocb, bool compat)
Index: linux-3.0/fs/eventfd.c
===================================================================
--- linux-3.0.orig/fs/eventfd.c	2011-08-10 12:54:26.847639379 -0400
+++ linux-3.0/fs/eventfd.c	2011-08-10 12:56:29.715641851 -0400
@@ -406,6 +406,7 @@ struct file *eventfd_file_create(unsigne
 
 	return file;
 }
+EXPORT_SYMBOL_GPL(eventfd_file_create);
 
 SYSCALL_DEFINE2(eventfd2, unsigned int, count, int, flags)
 {
Index: linux-3.0/include/linux/aio.h
===================================================================
--- linux-3.0.orig/include/linux/aio.h	2011-08-10 12:54:26.839639379 -0400
+++ linux-3.0/include/linux/aio.h	2011-08-10 12:56:29.744641852 -0400
@@ -214,6 +214,37 @@ struct mm_struct;
 extern void exit_aio(struct mm_struct *mm);
 extern long do_io_submit(aio_context_t ctx_id, long nr,
 			 struct iocb __user *__user *iocbpp, bool compat);
+extern void __put_ioctx(struct kioctx *ctx);
+extern struct kioctx *ioctx_alloc(unsigned nr_events);
+extern struct kiocb *aio_get_req(struct kioctx *ctx);
+extern ssize_t aio_run_iocb(struct kiocb *iocb);
+extern int __aio_run_iocbs(struct kioctx *ctx);
+extern int read_events(struct kioctx *ctx,
+                        long min_nr, long nr,
+                        struct io_event __user *event,
+                        struct timespec __user *timeout);
+extern void io_destroy(struct kioctx *ioctx);
+extern ssize_t aio_setup_iocb(struct kiocb *kiocb, bool compat);
+extern void __put_ioctx(struct kioctx *ctx);
+
+static inline void get_ioctx(struct kioctx *kioctx)
+{
+        BUG_ON(atomic_read(&kioctx->users) <= 0);
+        atomic_inc(&kioctx->users);
+}
+
+static inline int try_get_ioctx(struct kioctx *kioctx)
+{
+        return atomic_inc_not_zero(&kioctx->users);
+}
+
+static inline void put_ioctx(struct kioctx *kioctx)
+{
+        BUG_ON(atomic_read(&kioctx->users) <= 0);
+        if (unlikely(atomic_dec_and_test(&kioctx->users)))
+                __put_ioctx(kioctx);
+}
+
 #else
 static inline ssize_t wait_on_sync_kiocb(struct kiocb *iocb) { return 0; }
 static inline int aio_put_req(struct kiocb *iocb) { return 0; }

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device
  2011-08-12 16:12                                     ` Badari Pulavarty
@ 2011-08-15  3:20                                       ` Liu Yuan
  2011-08-15  4:17                                         ` Badari Pulavarty
  0 siblings, 1 reply; 54+ messages in thread
From: Liu Yuan @ 2011-08-15  3:20 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: kvm, Dongsu Park

On 08/13/2011 12:12 AM, Badari Pulavarty wrote:
> On 8/12/2011 4:40 AM, Liu Yuan wrote:
>> On 08/12/2011 04:27 PM, Liu Yuan wrote:
>>> On 08/12/2011 12:50 PM, Badari Pulavarty wrote:
>>>> On 8/10/2011 8:19 PM, Liu Yuan wrote:
>>>>> On 08/11/2011 11:01 AM, Liu Yuan wrote:
>>>>>>
>>>>>>> It looks like the patch wouldn't work for testing multiple devices.
>>>>>>>
>>>>>>> vhost_blk_open() does
>>>>>>> +       used_info_cachep = KMEM_CACHE(used_info, 
>>>>>>> SLAB_HWCACHE_ALIGN |
>>>>>>> SLAB_PANIC);
>>>>>>>
>>>>>>
>>>>>> This is weird. how do you open multiple device?I just opened the 
>>>>>> device with following command:
>>>>>>
>>>>>> -drive file=/dev/sda6,if=virtio,cache=none,aio=native -drive 
>>>>>> file=~/data0.img,if=virtio,cache=none,aio=native -drive 
>>>>>> file=~/data1.img,if=virtio,cache=none,aio=native
>>>>>>
>>>>>> And I didn't meet any problem.
>>>>>>
>>>>>> this would tell qemu to open three devices, and pass three FDs to 
>>>>>> three instances of vhost_blk module.
>>>>>> So KMEM_CACHE() is okay in vhost_blk_open().
>>>>>>
>>>>>
>>>>> Oh, you are right. KMEM_CACHE() is in the wrong place. it is three 
>>>>> instances vhost worker threads created. Hmmm, but I didn't meet 
>>>>> any problem when opening it and running it. So strange. I'll go to 
>>>>> figure it out.
>>>>>
>>>>>>> When opening second device, we get panic since used_info_cachep is
>>>>>>> already created. Just to make progress I moved this call to
>>>>>>> vhost_blk_init().
>>>>>>>
>>>>>>> I don't see any host panics now. With single block device (dd),
>>>>>>> it seems to work fine. But when I start testing multiple block
>>>>>>> devices I quickly run into hangs in the guest. I see following
>>>>>>> messages in the guest from virtio_ring.c:
>>>>>>>
>>>>>>> virtio_blk virtio2: requests: id 0 is not a head !
>>>>>>> virtio_blk virtio1: requests: id 0 is not a head !
>>>>>>> virtio_blk virtio4: requests: id 1 is not a head !
>>>>>>> virtio_blk virtio3: requests: id 39 is not a head !
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Badari
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> vq->data[] is initialized by guest virtio-blk driver and 
>>>>>> vhost_blk is unware of it. it looks like used ID passed
>>>>>> over by vhost_blk to guest virtio_blk is wrong, but, it should 
>>>>>> not happen. :|
>>>>>>
>>>>>> And I can't reproduce this on my laptop. :(
>>>>>>
>>>> Finally, found the issue  :)
>>>>
>>>> Culprit is:
>>>>
>>>> +static struct io_event events[MAX_EVENTS];
>>>>
>>>> With multiple devices, multiple threads could be executing 
>>>> handle_completion() (one for
>>>> each fd) at the same time. "events" array is global :( Need to make 
>>>> it one per device/fd.
>>>>
>>>> For test, I changed MAX_EVENTS to 32 and moved "events" array to be 
>>>> local (stack)
>>>> to handle_completion(). Tests are running fine.
>>>>
>>>> Your laptop must have single processor, hence you have only one 
>>>> thread executing handle_completion()
>>>> at any time..
>>>>
>>>> Thanks,
>>>> Badari
>>>>
>>>>
>>> Good catch, this is rather cool!....Yup, I develop it mostly in a 
>>> nested KVM environment. and the L2 host  only runs single processor :(
>>>
>>> Thanks,
>>> Yuan
>> By the way, MAX_EVENTS should be 128, as much as guest virtio_blk 
>> driver can batch-submit,
>> causing array overflow.
>> I have had turned on the debug, and had seen as much as over 100 
>> requests batched from guest OS.
>>
>
> Hmm.. I am not sure why you see over 100 outstanding events per fd.  
> Max events could be as high as
> number of number of outstanding IOs.
>
> Anyway, instead of putting it on stack, I kmalloced it now.
>
> Dongsu Park, Here is the complete patch.
>
> Thanks
> Badari
>
>
In the physical machine, there is a queue depth posted by block device 
driver to limit the
pending requests number, normally it is 31. But virtio driver doesn't 
post it in the guest OS.
So nothing prvents OS batch-submitting requests more than 31.

I have noticed over 100 pending requests during guest OS initilization 
and it is reproducible.

BTW, how is perf number for vhost-blk in your environment?

Thanks,
Yuan

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device
  2011-08-15  3:20                                       ` Liu Yuan
@ 2011-08-15  4:17                                         ` Badari Pulavarty
  2011-08-16  5:44                                           ` Liu Yuan
  2011-09-07 13:36                                           ` Liu Yuan
  0 siblings, 2 replies; 54+ messages in thread
From: Badari Pulavarty @ 2011-08-15  4:17 UTC (permalink / raw)
  To: Liu Yuan; +Cc: kvm, Dongsu Park

On 8/14/2011 8:20 PM, Liu Yuan wrote:
> On 08/13/2011 12:12 AM, Badari Pulavarty wrote:
>> On 8/12/2011 4:40 AM, Liu Yuan wrote:
>>> On 08/12/2011 04:27 PM, Liu Yuan wrote:
>>>> On 08/12/2011 12:50 PM, Badari Pulavarty wrote:
>>>>> On 8/10/2011 8:19 PM, Liu Yuan wrote:
>>>>>> On 08/11/2011 11:01 AM, Liu Yuan wrote:
>>>>>>>
>>>>>>>> It looks like the patch wouldn't work for testing multiple 
>>>>>>>> devices.
>>>>>>>>
>>>>>>>> vhost_blk_open() does
>>>>>>>> +       used_info_cachep = KMEM_CACHE(used_info, 
>>>>>>>> SLAB_HWCACHE_ALIGN |
>>>>>>>> SLAB_PANIC);
>>>>>>>>
>>>>>>>
>>>>>>> This is weird. how do you open multiple device?I just opened the 
>>>>>>> device with following command:
>>>>>>>
>>>>>>> -drive file=/dev/sda6,if=virtio,cache=none,aio=native -drive 
>>>>>>> file=~/data0.img,if=virtio,cache=none,aio=native -drive 
>>>>>>> file=~/data1.img,if=virtio,cache=none,aio=native
>>>>>>>
>>>>>>> And I didn't meet any problem.
>>>>>>>
>>>>>>> this would tell qemu to open three devices, and pass three FDs 
>>>>>>> to three instances of vhost_blk module.
>>>>>>> So KMEM_CACHE() is okay in vhost_blk_open().
>>>>>>>
>>>>>>
>>>>>> Oh, you are right. KMEM_CACHE() is in the wrong place. it is 
>>>>>> three instances vhost worker threads created. Hmmm, but I didn't 
>>>>>> meet any problem when opening it and running it. So strange. I'll 
>>>>>> go to figure it out.
>>>>>>
>>>>>>>> When opening second device, we get panic since used_info_cachep is
>>>>>>>> already created. Just to make progress I moved this call to
>>>>>>>> vhost_blk_init().
>>>>>>>>
>>>>>>>> I don't see any host panics now. With single block device (dd),
>>>>>>>> it seems to work fine. But when I start testing multiple block
>>>>>>>> devices I quickly run into hangs in the guest. I see following
>>>>>>>> messages in the guest from virtio_ring.c:
>>>>>>>>
>>>>>>>> virtio_blk virtio2: requests: id 0 is not a head !
>>>>>>>> virtio_blk virtio1: requests: id 0 is not a head !
>>>>>>>> virtio_blk virtio4: requests: id 1 is not a head !
>>>>>>>> virtio_blk virtio3: requests: id 39 is not a head !
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Badari
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> vq->data[] is initialized by guest virtio-blk driver and 
>>>>>>> vhost_blk is unware of it. it looks like used ID passed
>>>>>>> over by vhost_blk to guest virtio_blk is wrong, but, it should 
>>>>>>> not happen. :|
>>>>>>>
>>>>>>> And I can't reproduce this on my laptop. :(
>>>>>>>
>>>>> Finally, found the issue  :)
>>>>>
>>>>> Culprit is:
>>>>>
>>>>> +static struct io_event events[MAX_EVENTS];
>>>>>
>>>>> With multiple devices, multiple threads could be executing 
>>>>> handle_completion() (one for
>>>>> each fd) at the same time. "events" array is global :( Need to 
>>>>> make it one per device/fd.
>>>>>
>>>>> For test, I changed MAX_EVENTS to 32 and moved "events" array to 
>>>>> be local (stack)
>>>>> to handle_completion(). Tests are running fine.
>>>>>
>>>>> Your laptop must have single processor, hence you have only one 
>>>>> thread executing handle_completion()
>>>>> at any time..
>>>>>
>>>>> Thanks,
>>>>> Badari
>>>>>
>>>>>
>>>> Good catch, this is rather cool!....Yup, I develop it mostly in a 
>>>> nested KVM environment. and the L2 host  only runs single processor :(
>>>>
>>>> Thanks,
>>>> Yuan
>>> By the way, MAX_EVENTS should be 128, as much as guest virtio_blk 
>>> driver can batch-submit,
>>> causing array overflow.
>>> I have had turned on the debug, and had seen as much as over 100 
>>> requests batched from guest OS.
>>>
>>
>> Hmm.. I am not sure why you see over 100 outstanding events per fd.  
>> Max events could be as high as
>> number of number of outstanding IOs.
>>
>> Anyway, instead of putting it on stack, I kmalloced it now.
>>
>> Dongsu Park, Here is the complete patch.
>>
>> Thanks
>> Badari
>>
>>
> In the physical machine, there is a queue depth posted by block device 
> driver to limit the
> pending requests number, normally it is 31. But virtio driver doesn't 
> post it in the guest OS.
> So nothing prvents OS batch-submitting requests more than 31.
>
> I have noticed over 100 pending requests during guest OS initilization 
> and it is reproducible.
>
> BTW, how is perf number for vhost-blk in your environment?

Right now I am doing "dd" tests to test out the functionality and stability.

I plan to collect FFSB benchmark results across 6-virtio-blk/vhost-blk 
disks with
all profiles - seq read, seq write, random read, random write with 
blocksizes varying
from 4k to 1MB.

I will start the test tomorrow. It will take few days to run thru all 
the scenarios.
I don't have an easy way to collect host CPU consumption - but for now lets
focus on throughput and latency. I will share the results in few days.

Thanks
Badari



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device
  2011-08-15  4:17                                         ` Badari Pulavarty
@ 2011-08-16  5:44                                           ` Liu Yuan
  2011-09-07 13:36                                           ` Liu Yuan
  1 sibling, 0 replies; 54+ messages in thread
From: Liu Yuan @ 2011-08-16  5:44 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: kvm, Dongsu Park

On 08/15/2011 12:17 PM, Badari Pulavarty wrote:
> On 8/14/2011 8:20 PM, Liu Yuan wrote:
>> On 08/13/2011 12:12 AM, Badari Pulavarty wrote:
>>> On 8/12/2011 4:40 AM, Liu Yuan wrote:
>>>> On 08/12/2011 04:27 PM, Liu Yuan wrote:
>>>>> On 08/12/2011 12:50 PM, Badari Pulavarty wrote:
>>>>>> On 8/10/2011 8:19 PM, Liu Yuan wrote:
>>>>>>> On 08/11/2011 11:01 AM, Liu Yuan wrote:
>>>>>>>>
>>>>>>>>> It looks like the patch wouldn't work for testing multiple 
>>>>>>>>> devices.
>>>>>>>>>
>>>>>>>>> vhost_blk_open() does
>>>>>>>>> +       used_info_cachep = KMEM_CACHE(used_info, 
>>>>>>>>> SLAB_HWCACHE_ALIGN |
>>>>>>>>> SLAB_PANIC);
>>>>>>>>>
>>>>>>>>
>>>>>>>> This is weird. how do you open multiple device?I just opened 
>>>>>>>> the device with following command:
>>>>>>>>
>>>>>>>> -drive file=/dev/sda6,if=virtio,cache=none,aio=native -drive 
>>>>>>>> file=~/data0.img,if=virtio,cache=none,aio=native -drive 
>>>>>>>> file=~/data1.img,if=virtio,cache=none,aio=native
>>>>>>>>
>>>>>>>> And I didn't meet any problem.
>>>>>>>>
>>>>>>>> this would tell qemu to open three devices, and pass three FDs 
>>>>>>>> to three instances of vhost_blk module.
>>>>>>>> So KMEM_CACHE() is okay in vhost_blk_open().
>>>>>>>>
>>>>>>>
>>>>>>> Oh, you are right. KMEM_CACHE() is in the wrong place. it is 
>>>>>>> three instances vhost worker threads created. Hmmm, but I didn't 
>>>>>>> meet any problem when opening it and running it. So strange. 
>>>>>>> I'll go to figure it out.
>>>>>>>
>>>>>>>>> When opening second device, we get panic since 
>>>>>>>>> used_info_cachep is
>>>>>>>>> already created. Just to make progress I moved this call to
>>>>>>>>> vhost_blk_init().
>>>>>>>>>
>>>>>>>>> I don't see any host panics now. With single block device (dd),
>>>>>>>>> it seems to work fine. But when I start testing multiple block
>>>>>>>>> devices I quickly run into hangs in the guest. I see following
>>>>>>>>> messages in the guest from virtio_ring.c:
>>>>>>>>>
>>>>>>>>> virtio_blk virtio2: requests: id 0 is not a head !
>>>>>>>>> virtio_blk virtio1: requests: id 0 is not a head !
>>>>>>>>> virtio_blk virtio4: requests: id 1 is not a head !
>>>>>>>>> virtio_blk virtio3: requests: id 39 is not a head !
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Badari
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> vq->data[] is initialized by guest virtio-blk driver and 
>>>>>>>> vhost_blk is unware of it. it looks like used ID passed
>>>>>>>> over by vhost_blk to guest virtio_blk is wrong, but, it should 
>>>>>>>> not happen. :|
>>>>>>>>
>>>>>>>> And I can't reproduce this on my laptop. :(
>>>>>>>>
>>>>>> Finally, found the issue  :)
>>>>>>
>>>>>> Culprit is:
>>>>>>
>>>>>> +static struct io_event events[MAX_EVENTS];
>>>>>>
>>>>>> With multiple devices, multiple threads could be executing 
>>>>>> handle_completion() (one for
>>>>>> each fd) at the same time. "events" array is global :( Need to 
>>>>>> make it one per device/fd.
>>>>>>
>>>>>> For test, I changed MAX_EVENTS to 32 and moved "events" array to 
>>>>>> be local (stack)
>>>>>> to handle_completion(). Tests are running fine.
>>>>>>
>>>>>> Your laptop must have single processor, hence you have only one 
>>>>>> thread executing handle_completion()
>>>>>> at any time..
>>>>>>
>>>>>> Thanks,
>>>>>> Badari
>>>>>>
>>>>>>
>>>>> Good catch, this is rather cool!....Yup, I develop it mostly in a 
>>>>> nested KVM environment. and the L2 host  only runs single 
>>>>> processor :(
>>>>>
>>>>> Thanks,
>>>>> Yuan
>>>> By the way, MAX_EVENTS should be 128, as much as guest virtio_blk 
>>>> driver can batch-submit,
>>>> causing array overflow.
>>>> I have had turned on the debug, and had seen as much as over 100 
>>>> requests batched from guest OS.
>>>>
>>>
>>> Hmm.. I am not sure why you see over 100 outstanding events per fd.  
>>> Max events could be as high as
>>> number of number of outstanding IOs.
>>>
>>> Anyway, instead of putting it on stack, I kmalloced it now.
>>>
>>> Dongsu Park, Here is the complete patch.
>>>
>>> Thanks
>>> Badari
>>>
>>>
>> In the physical machine, there is a queue depth posted by block 
>> device driver to limit the
>> pending requests number, normally it is 31. But virtio driver doesn't 
>> post it in the guest OS.
>> So nothing prvents OS batch-submitting requests more than 31.
>>
>> I have noticed over 100 pending requests during guest OS 
>> initilization and it is reproducible.
>>
>> BTW, how is perf number for vhost-blk in your environment?
>
> Right now I am doing "dd" tests to test out the functionality and 
> stability.
>
> I plan to collect FFSB benchmark results across 6-virtio-blk/vhost-blk 
> disks with
> all profiles - seq read, seq write, random read, random write with 
> blocksizes varying
> from 4k to 1MB.
>
> I will start the test tomorrow. It will take few days to run thru all 
> the scenarios.
> I don't have an easy way to collect host CPU consumption - but for now 
> lets
> focus on throughput and latency. I will share the results in few days.
>
> Thanks
> Badari
>
>

Awesome! Thanks for your work and data.

Yuan

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device
  2011-08-15  4:17                                         ` Badari Pulavarty
  2011-08-16  5:44                                           ` Liu Yuan
@ 2011-09-07 13:36                                           ` Liu Yuan
  1 sibling, 0 replies; 54+ messages in thread
From: Liu Yuan @ 2011-09-07 13:36 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: kvm, Dongsu Park

On 08/15/2011 12:17 PM, Badari Pulavarty wrote:
> On 8/14/2011 8:20 PM, Liu Yuan wrote:
>> On 08/13/2011 12:12 AM, Badari Pulavarty wrote:
>>> On 8/12/2011 4:40 AM, Liu Yuan wrote:
>>>> On 08/12/2011 04:27 PM, Liu Yuan wrote:
>>>>> On 08/12/2011 12:50 PM, Badari Pulavarty wrote:
>>>>>> On 8/10/2011 8:19 PM, Liu Yuan wrote:
>>>>>>> On 08/11/2011 11:01 AM, Liu Yuan wrote:
>>>>>>>>
>>>>>>>>> It looks like the patch wouldn't work for testing multiple 
>>>>>>>>> devices.
>>>>>>>>>
>>>>>>>>> vhost_blk_open() does
>>>>>>>>> +       used_info_cachep = KMEM_CACHE(used_info, 
>>>>>>>>> SLAB_HWCACHE_ALIGN |
>>>>>>>>> SLAB_PANIC);
>>>>>>>>>
>>>>>>>>
>>>>>>>> This is weird. how do you open multiple device?I just opened 
>>>>>>>> the device with following command:
>>>>>>>>
>>>>>>>> -drive file=/dev/sda6,if=virtio,cache=none,aio=native -drive 
>>>>>>>> file=~/data0.img,if=virtio,cache=none,aio=native -drive 
>>>>>>>> file=~/data1.img,if=virtio,cache=none,aio=native
>>>>>>>>
>>>>>>>> And I didn't meet any problem.
>>>>>>>>
>>>>>>>> this would tell qemu to open three devices, and pass three FDs 
>>>>>>>> to three instances of vhost_blk module.
>>>>>>>> So KMEM_CACHE() is okay in vhost_blk_open().
>>>>>>>>
>>>>>>>
>>>>>>> Oh, you are right. KMEM_CACHE() is in the wrong place. it is 
>>>>>>> three instances vhost worker threads created. Hmmm, but I didn't 
>>>>>>> meet any problem when opening it and running it. So strange. 
>>>>>>> I'll go to figure it out.
>>>>>>>
>>>>>>>>> When opening second device, we get panic since 
>>>>>>>>> used_info_cachep is
>>>>>>>>> already created. Just to make progress I moved this call to
>>>>>>>>> vhost_blk_init().
>>>>>>>>>
>>>>>>>>> I don't see any host panics now. With single block device (dd),
>>>>>>>>> it seems to work fine. But when I start testing multiple block
>>>>>>>>> devices I quickly run into hangs in the guest. I see following
>>>>>>>>> messages in the guest from virtio_ring.c:
>>>>>>>>>
>>>>>>>>> virtio_blk virtio2: requests: id 0 is not a head !
>>>>>>>>> virtio_blk virtio1: requests: id 0 is not a head !
>>>>>>>>> virtio_blk virtio4: requests: id 1 is not a head !
>>>>>>>>> virtio_blk virtio3: requests: id 39 is not a head !
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Badari
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> vq->data[] is initialized by guest virtio-blk driver and 
>>>>>>>> vhost_blk is unware of it. it looks like used ID passed
>>>>>>>> over by vhost_blk to guest virtio_blk is wrong, but, it should 
>>>>>>>> not happen. :|
>>>>>>>>
>>>>>>>> And I can't reproduce this on my laptop. :(
>>>>>>>>
>>>>>> Finally, found the issue  :)
>>>>>>
>>>>>> Culprit is:
>>>>>>
>>>>>> +static struct io_event events[MAX_EVENTS];
>>>>>>
>>>>>> With multiple devices, multiple threads could be executing 
>>>>>> handle_completion() (one for
>>>>>> each fd) at the same time. "events" array is global :( Need to 
>>>>>> make it one per device/fd.
>>>>>>
>>>>>> For test, I changed MAX_EVENTS to 32 and moved "events" array to 
>>>>>> be local (stack)
>>>>>> to handle_completion(). Tests are running fine.
>>>>>>
>>>>>> Your laptop must have single processor, hence you have only one 
>>>>>> thread executing handle_completion()
>>>>>> at any time..
>>>>>>
>>>>>> Thanks,
>>>>>> Badari
>>>>>>
>>>>>>
>>>>> Good catch, this is rather cool!....Yup, I develop it mostly in a 
>>>>> nested KVM environment. and the L2 host  only runs single 
>>>>> processor :(
>>>>>
>>>>> Thanks,
>>>>> Yuan
>>>> By the way, MAX_EVENTS should be 128, as much as guest virtio_blk 
>>>> driver can batch-submit,
>>>> causing array overflow.
>>>> I have had turned on the debug, and had seen as much as over 100 
>>>> requests batched from guest OS.
>>>>
>>>
>>> Hmm.. I am not sure why you see over 100 outstanding events per fd.  
>>> Max events could be as high as
>>> number of number of outstanding IOs.
>>>
>>> Anyway, instead of putting it on stack, I kmalloced it now.
>>>
>>> Dongsu Park, Here is the complete patch.
>>>
>>> Thanks
>>> Badari
>>>
>>>
>> In the physical machine, there is a queue depth posted by block 
>> device driver to limit the
>> pending requests number, normally it is 31. But virtio driver doesn't 
>> post it in the guest OS.
>> So nothing prvents OS batch-submitting requests more than 31.
>>
>> I have noticed over 100 pending requests during guest OS 
>> initilization and it is reproducible.
>>
>> BTW, how is perf number for vhost-blk in your environment?
>
> Right now I am doing "dd" tests to test out the functionality and 
> stability.
>
> I plan to collect FFSB benchmark results across 6-virtio-blk/vhost-blk 
> disks with
> all profiles - seq read, seq write, random read, random write with 
> blocksizes varying
> from 4k to 1MB.
>
> I will start the test tomorrow. It will take few days to run thru all 
> the scenarios.
> I don't have an easy way to collect host CPU consumption - but for now 
> lets
> focus on throughput and latency. I will share the results in few days.
>
> Thanks
> Badari
>
>
Hi Badari,
     how is test going?

Thanks,
Yuan

^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2011-09-07 16:20 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-07-28 14:29 [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device Liu Yuan
2011-07-28 14:29 ` [RFC PATCH] vhost-blk: An in-kernel accelerator for virtio-blk Liu Yuan
2011-07-28 14:47   ` Christoph Hellwig
2011-07-29 11:19     ` Liu Yuan
2011-07-28 15:18   ` Stefan Hajnoczi
2011-07-28 15:22   ` Michael S. Tsirkin
2011-07-29 15:09     ` Liu Yuan
2011-08-01  6:25     ` Liu Yuan
2011-08-01  8:12       ` Michael S. Tsirkin
2011-08-01  8:55         ` Liu Yuan
2011-08-01 10:26           ` Michael S. Tsirkin
2011-08-11 19:59     ` Dongsu Park
2011-08-12  8:56       ` Alan Cox
2011-07-28 14:29 ` [RFC PATCH] vhost: Enable vhost-blk support Liu Yuan
2011-07-28 15:44 ` [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device Stefan Hajnoczi
2011-07-29  4:48   ` Stefan Hajnoczi
2011-07-29  7:59     ` Liu Yuan
2011-07-29 10:55       ` Christoph Hellwig
2011-07-29  7:22   ` Liu Yuan
2011-07-29  9:06     ` Stefan Hajnoczi
2011-07-29 12:01       ` Liu Yuan
2011-07-29 12:29         ` Stefan Hajnoczi
2011-07-29 12:50           ` Stefan Hajnoczi
2011-07-29 14:45             ` Liu Yuan
2011-07-29 14:50               ` Liu Yuan
2011-07-29 15:25         ` Sasha Levin
2011-08-01  8:17           ` Avi Kivity
2011-08-01  9:18             ` Liu Yuan
2011-08-01  9:37               ` Avi Kivity
2011-07-29 18:12     ` Badari Pulavarty
2011-08-01  5:46       ` Liu Yuan
2011-08-01  8:12         ` Christoph Hellwig
2011-08-04 21:58         ` Badari Pulavarty
2011-08-05  7:56           ` Liu Yuan
2011-08-05 11:04           ` Liu Yuan
2011-08-05 18:02             ` Badari Pulavarty
2011-08-08  1:35               ` Liu Yuan
2011-08-08  5:04                 ` Badari Pulavarty
2011-08-08  7:31                   ` Liu Yuan
2011-08-08 17:16                     ` Badari Pulavarty
2011-08-10  2:19                       ` Liu Yuan
2011-08-10 20:37                         ` Badari Pulavarty
2011-08-11  3:01                           ` Liu Yuan
2011-08-11  3:19                             ` Liu Yuan
2011-08-11 23:51                               ` Badari Pulavarty
2011-08-12  4:50                               ` Badari Pulavarty
2011-08-12  6:46                                 ` Dongsu Park
2011-08-12  8:27                                 ` Liu Yuan
2011-08-12 11:40                                   ` Liu Yuan
2011-08-12 16:12                                     ` Badari Pulavarty
2011-08-15  3:20                                       ` Liu Yuan
2011-08-15  4:17                                         ` Badari Pulavarty
2011-08-16  5:44                                           ` Liu Yuan
2011-09-07 13:36                                           ` Liu Yuan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.