All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH 0/7] virtio: virtio-blk data plane
@ 2012-11-15 15:18 Stefan Hajnoczi
  2012-11-15 15:19 ` [Qemu-devel] [PATCH 1/7] raw-posix: add raw_get_aio_fd() for virtio-blk-data-plane Stefan Hajnoczi
                   ` (7 more replies)
  0 siblings, 8 replies; 38+ messages in thread
From: Stefan Hajnoczi @ 2012-11-15 15:18 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Anthony Liguori, Michael S. Tsirkin, khoa,
	Stefan Hajnoczi, Paolo Bonzini, Asias He

This series adds the -device virtio-blk-pci,x-data-plane=on property that
enables a high performance I/O codepath.  A dedicated thread is used to process
virtio-blk requests outside the global mutex and without going through the QEMU
block layer.

Khoa Huynh <khoa@us.ibm.com> reported an increase from 140,000 IOPS to 600,000
IOPS for a single VM using virtio-blk-data-plane in July:

  http://comments.gmane.org/gmane.comp.emulators.kvm.devel/94580

The virtio-blk-data-plane approach was originally presented at Linux Plumbers
Conference 2010.  The following slides contain a brief overview:

  http://linuxplumbersconf.org/2010/ocw/system/presentations/651/original/Optimizing_the_QEMU_Storage_Stack.pdf

The basic approach is:
1. Each virtio-blk device has a thread dedicated to handling ioeventfd
   signalling when the guest kicks the virtqueue.
2. Requests are processed without going through the QEMU block layer using
   Linux AIO directly.
3. Completion interrupts are injected via irqfd from the dedicated thread.

To try it out:

  qemu -drive if=none,id=drive0,cache=none,aio=native,format=raw,file=...
       -device virtio-blk-pci,drive=drive0,scsi=off,x-data-plane=on

Limitations:
 * Only format=raw is supported
 * Live migration is not supported
 * Block jobs, hot unplug, and other operations fail with -EBUSY
 * I/O throttling limits are ignored
 * Only Linux hosts are supported due to Linux AIO usage

The code has reached a stage where I feel it is ready to merge.  Users have
been playing with it for some time and want the significant performance boost.

We are refactoring QEMU to get rid of the global mutex.  I believe that
virtio-blk-data-plane can eventually become the default mode of operation.

Instead of waiting for global mutex removal efforts to finish, I want to use
virtio-blk-data-plane as an example device for AioContext and threaded hw
dispatch refactoring.  This means:

1. When the block layer can bind to an AioContext and execute I/O outside the
   global mutex, virtio-blk-data-plane can use this (and gain image format
   support).

2. When hw dispatch no longer needs the global mutex we can use hw/virtio.c
   again and perhaps run a pool of iothreads instead of dedicated data plane
   threads.

But in the meantime, I have cleaned up the virtio-blk-data-plane code so that
it can be merged as an experimental feature.

Changes from the RFC v9:
 * Add x-data-plane=on|off option and coexist with regular virtio-blk code
 * Create thread from BH so it inherits iothread cpusets
 * Drain requests on vm_stop() so stopped guest does not access image file
 * Add migration blocker
 * Add bdrv_in_use() to prevent block jobs and other operations that can interfere
 * Drop IOQueue request merging for simplicity
 * Drop ioctl interrupt injection and always use irqfd for simplicity
 * Major cleanup to split up source files
 * Rebase from qemu-kvm.git onto qemu.git
 * Address Michael Tsirkin's review comments

Stefan Hajnoczi (7):
  raw-posix: add raw_get_aio_fd() for virtio-blk-data-plane
  configure: add CONFIG_VIRTIO_BLK_DATA_PLANE
  dataplane: add virtqueue vring code
  dataplane: add event loop
  dataplane: add Linux AIO request queue
  dataplane: add virtio-blk data plane code
  virtio-blk: add x-data-plane=on|off performance feature

 block.h                    |   9 +
 block/raw-posix.c          |  34 ++++
 configure                  |  21 +++
 hw/Makefile.objs           |   2 +-
 hw/dataplane/Makefile.objs |   3 +
 hw/dataplane/event-poll.c  | 109 ++++++++++++
 hw/dataplane/event-poll.h  |  40 +++++
 hw/dataplane/ioq.c         | 118 +++++++++++++
 hw/dataplane/ioq.h         |  57 +++++++
 hw/dataplane/virtio-blk.c  | 414 +++++++++++++++++++++++++++++++++++++++++++++
 hw/dataplane/virtio-blk.h  |  41 +++++
 hw/dataplane/vring.c       | 321 +++++++++++++++++++++++++++++++++++
 hw/dataplane/vring.h       |  54 ++++++
 hw/virtio-blk.c            |  59 ++++++-
 hw/virtio-blk.h            |   1 +
 hw/virtio-pci.c            |   3 +
 trace-events               |   9 +
 17 files changed, 1293 insertions(+), 2 deletions(-)
 create mode 100644 hw/dataplane/Makefile.objs
 create mode 100644 hw/dataplane/event-poll.c
 create mode 100644 hw/dataplane/event-poll.h
 create mode 100644 hw/dataplane/ioq.c
 create mode 100644 hw/dataplane/ioq.h
 create mode 100644 hw/dataplane/virtio-blk.c
 create mode 100644 hw/dataplane/virtio-blk.h
 create mode 100644 hw/dataplane/vring.c
 create mode 100644 hw/dataplane/vring.h

-- 
1.8.0

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Qemu-devel] [PATCH 1/7] raw-posix: add raw_get_aio_fd() for virtio-blk-data-plane
  2012-11-15 15:18 [Qemu-devel] [PATCH 0/7] virtio: virtio-blk data plane Stefan Hajnoczi
@ 2012-11-15 15:19 ` Stefan Hajnoczi
  2012-11-15 20:03   ` Anthony Liguori
  2012-11-15 15:19 ` [Qemu-devel] [PATCH 2/7] configure: add CONFIG_VIRTIO_BLK_DATA_PLANE Stefan Hajnoczi
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 38+ messages in thread
From: Stefan Hajnoczi @ 2012-11-15 15:19 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Anthony Liguori, Michael S. Tsirkin, khoa,
	Stefan Hajnoczi, Paolo Bonzini, Asias He

The raw_get_aio_fd() function allows virtio-blk-data-plane to get the
file descriptor of a raw image file with Linux AIO enabled.  This
interface is really a layering violation that can be resolved once the
block layer is able to run outside the global mutex - at that point
virtio-blk-data-plane will switch from custom Linux AIO code to using
the block layer.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block.h           |  9 +++++++++
 block/raw-posix.c | 34 ++++++++++++++++++++++++++++++++++
 2 files changed, 43 insertions(+)

diff --git a/block.h b/block.h
index 722c620..2dc6aaf 100644
--- a/block.h
+++ b/block.h
@@ -365,6 +365,15 @@ void bdrv_disable_copy_on_read(BlockDriverState *bs);
 void bdrv_set_in_use(BlockDriverState *bs, int in_use);
 int bdrv_in_use(BlockDriverState *bs);
 
+#ifdef CONFIG_LINUX_AIO
+int raw_get_aio_fd(BlockDriverState *bs);
+#else
+static inline int raw_get_aio_fd(BlockDriverState *bs)
+{
+    return -ENOTSUP;
+}
+#endif
+
 enum BlockAcctType {
     BDRV_ACCT_READ,
     BDRV_ACCT_WRITE,
diff --git a/block/raw-posix.c b/block/raw-posix.c
index f2f0404..fc04981 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -1768,6 +1768,40 @@ static BlockDriver bdrv_host_cdrom = {
 };
 #endif /* __FreeBSD__ */
 
+#ifdef CONFIG_LINUX_AIO
+/**
+ * Return the file descriptor for Linux AIO
+ *
+ * This function is a layering violation and should be removed when it becomes
+ * possible to call the block layer outside the global mutex.  It allows the
+ * caller to hijack the file descriptor so I/O can be performed outside the
+ * block layer.
+ */
+int raw_get_aio_fd(BlockDriverState *bs)
+{
+    BDRVRawState *s;
+
+    if (!bs->drv) {
+        return -ENOMEDIUM;
+    }
+
+    if (bs->drv == bdrv_find_format("raw")) {
+        bs = bs->file;
+    }
+
+    /* raw-posix has several protocols so just check for raw_aio_readv */
+    if (bs->drv->bdrv_aio_readv != raw_aio_readv) {
+        return -ENOTSUP;
+    }
+
+    s = bs->opaque;
+    if (!s->use_aio) {
+        return -ENOTSUP;
+    }
+    return s->fd;
+}
+#endif /* CONFIG_LINUX_AIO */
+
 static void bdrv_file_init(void)
 {
     /*
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Qemu-devel] [PATCH 2/7] configure: add CONFIG_VIRTIO_BLK_DATA_PLANE
  2012-11-15 15:18 [Qemu-devel] [PATCH 0/7] virtio: virtio-blk data plane Stefan Hajnoczi
  2012-11-15 15:19 ` [Qemu-devel] [PATCH 1/7] raw-posix: add raw_get_aio_fd() for virtio-blk-data-plane Stefan Hajnoczi
@ 2012-11-15 15:19 ` Stefan Hajnoczi
  2012-11-15 15:19 ` [Qemu-devel] [PATCH 3/7] dataplane: add virtqueue vring code Stefan Hajnoczi
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 38+ messages in thread
From: Stefan Hajnoczi @ 2012-11-15 15:19 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Anthony Liguori, Michael S. Tsirkin, khoa,
	Stefan Hajnoczi, Paolo Bonzini, Asias He

The virtio-blk-data-plane feature only works with Linux AIO.  Therefore
add a ./configure option and necessary checks to implement this
dependency.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 configure | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/configure b/configure
index f847ee2..0460589 100755
--- a/configure
+++ b/configure
@@ -223,6 +223,7 @@ libiscsi=""
 coroutine=""
 seccomp=""
 glusterfs=""
+virtio_blk_data_plane=""
 
 # parse CC options first
 for opt do
@@ -871,6 +872,10 @@ for opt do
   ;;
   --enable-glusterfs) glusterfs="yes"
   ;;
+  --disable-virtio-blk-data-plane) virtio_blk_data_plane="no"
+  ;;
+  --enable-virtio-blk-data-plane) virtio_blk_data_plane="yes"
+  ;;
   *) echo "ERROR: unknown option $opt"; show_help="yes"
   ;;
   esac
@@ -2233,6 +2238,17 @@ EOF
 fi
 
 ##########################################
+# adjust virtio-blk-data-plane based on linux-aio
+
+if test "$virtio_blk_data_plane" = "yes" -a \
+	"$linux_aio" != "yes" ; then
+  echo "Error: virtio-blk-data-plane requires Linux AIO, please try --enable-linux-aio"
+  exit 1
+elif test -z "$virtio_blk_data_plane" ; then
+  virtio_blk_data_plane=$linux_aio
+fi
+
+##########################################
 # attr probe
 
 if test "$attr" != "no" ; then
@@ -3235,6 +3251,7 @@ echo "build guest agent $guest_agent"
 echo "seccomp support   $seccomp"
 echo "coroutine backend $coroutine_backend"
 echo "GlusterFS support $glusterfs"
+echo "virtio-blk-data-plane $virtio_blk_data_plane"
 
 if test "$sdl_too_old" = "yes"; then
 echo "-> Your SDL version is too old - please upgrade to have SDL support"
@@ -3581,6 +3598,10 @@ if test "$glusterfs" = "yes" ; then
   echo "CONFIG_GLUSTERFS=y" >> $config_host_mak
 fi
 
+if test "$virtio_blk_data_plane" = "yes" ; then
+  echo "CONFIG_VIRTIO_BLK_DATA_PLANE=y" >> $config_host_mak
+fi
+
 # USB host support
 case "$usb" in
 linux)
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Qemu-devel] [PATCH 3/7] dataplane: add virtqueue vring code
  2012-11-15 15:18 [Qemu-devel] [PATCH 0/7] virtio: virtio-blk data plane Stefan Hajnoczi
  2012-11-15 15:19 ` [Qemu-devel] [PATCH 1/7] raw-posix: add raw_get_aio_fd() for virtio-blk-data-plane Stefan Hajnoczi
  2012-11-15 15:19 ` [Qemu-devel] [PATCH 2/7] configure: add CONFIG_VIRTIO_BLK_DATA_PLANE Stefan Hajnoczi
@ 2012-11-15 15:19 ` Stefan Hajnoczi
  2012-11-15 15:37   ` Paolo Bonzini
                     ` (3 more replies)
  2012-11-15 15:19 ` [Qemu-devel] [PATCH 4/7] dataplane: add event loop Stefan Hajnoczi
                   ` (4 subsequent siblings)
  7 siblings, 4 replies; 38+ messages in thread
From: Stefan Hajnoczi @ 2012-11-15 15:19 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Anthony Liguori, Michael S. Tsirkin, khoa,
	Stefan Hajnoczi, Paolo Bonzini, Asias He

The virtio-blk-data-plane cannot access memory using the usual QEMU
functions since it executes outside the global mutex and the memory APIs
are this time are not thread-safe.

This patch introduces a virtqueue module based on the kernel's vhost
vring code.  The trick is that we map guest memory ahead of time and
access it cheaply outside the global mutex.

Once the hardware emulation code can execute outside the global mutex it
will be possible to drop this code.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 hw/Makefile.objs           |   2 +-
 hw/dataplane/Makefile.objs |   3 +
 hw/dataplane/vring.c       | 321 +++++++++++++++++++++++++++++++++++++++++++++
 hw/dataplane/vring.h       |  54 ++++++++
 trace-events               |   3 +
 5 files changed, 382 insertions(+), 1 deletion(-)
 create mode 100644 hw/dataplane/Makefile.objs
 create mode 100644 hw/dataplane/vring.c
 create mode 100644 hw/dataplane/vring.h

diff --git a/hw/Makefile.objs b/hw/Makefile.objs
index af4ab0c..da8ef0c 100644
--- a/hw/Makefile.objs
+++ b/hw/Makefile.objs
@@ -1,4 +1,4 @@
-common-obj-y = usb/ ide/
+common-obj-y = usb/ ide/ dataplane/
 common-obj-y += loader.o
 common-obj-$(CONFIG_VIRTIO) += virtio-console.o
 common-obj-$(CONFIG_VIRTIO_PCI) += virtio-pci.o
diff --git a/hw/dataplane/Makefile.objs b/hw/dataplane/Makefile.objs
new file mode 100644
index 0000000..b58544f
--- /dev/null
+++ b/hw/dataplane/Makefile.objs
@@ -0,0 +1,3 @@
+ifeq ($(CONFIG_VIRTIO), y)
+common-obj-$(CONFIG_VIRTIO_BLK_DATA_PLANE) += vring.o
+endif
diff --git a/hw/dataplane/vring.c b/hw/dataplane/vring.c
new file mode 100644
index 0000000..6aacce8
--- /dev/null
+++ b/hw/dataplane/vring.c
@@ -0,0 +1,321 @@
+/* Copyright 2012 Red Hat, Inc.
+ * Copyright IBM, Corp. 2012
+ *
+ * Based on Linux vhost code:
+ * Copyright (C) 2009 Red Hat, Inc.
+ * Copyright (C) 2006 Rusty Russell IBM Corporation
+ *
+ * Author: Michael S. Tsirkin <mst@redhat.com>
+ *         Stefan Hajnoczi <stefanha@redhat.com>
+ *
+ * Inspiration, some code, and most witty comments come from
+ * Documentation/virtual/lguest/lguest.c, by Rusty Russell
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ */
+
+#include "trace.h"
+#include "hw/dataplane/vring.h"
+
+/* Map target physical address to host address
+ */
+static inline void *phys_to_host(Vring *vring, hwaddr phys)
+{
+    /* Adjust for 3.6-4 GB PCI memory range */
+    if (phys >= 0x100000000) {
+        phys -= 0x100000000 - 0xe0000000;
+    } else if (phys >= 0xe0000000) {
+        fprintf(stderr, "phys_to_host bad physical address in "
+                "PCI range %#lx\n", phys);
+        exit(1);
+    }
+    return vring->phys_mem_zero_host_ptr + phys;
+}
+
+/* Setup for cheap target physical to host address conversion
+ *
+ * This is a hack for direct access to guest memory, we're not really allowed
+ * to do this.
+ */
+static void setup_phys_to_host(Vring *vring)
+{
+    hwaddr len = 4096; /* RAM is really much larger but we cheat */
+    vring->phys_mem_zero_host_ptr = cpu_physical_memory_map(0, &len, 0);
+    if (!vring->phys_mem_zero_host_ptr) {
+        fprintf(stderr, "setup_phys_to_host failed\n");
+        exit(1);
+    }
+}
+
+/* Map the guest's vring to host memory
+ *
+ * This is not allowed but we know the ring won't move.
+ */
+void vring_setup(Vring *vring, VirtIODevice *vdev, int n)
+{
+    setup_phys_to_host(vring);
+
+    vring_init(&vring->vr, virtio_queue_get_num(vdev, n),
+               phys_to_host(vring, virtio_queue_get_ring_addr(vdev, n)), 4096);
+
+    vring->last_avail_idx = 0;
+    vring->last_used_idx = 0;
+    vring->signalled_used = 0;
+    vring->signalled_used_valid = false;
+
+    trace_vring_setup(virtio_queue_get_ring_addr(vdev, n),
+                      vring->vr.desc, vring->vr.avail, vring->vr.used);
+}
+
+/* Toggle guest->host notifies */
+void vring_set_notification(VirtIODevice *vdev, Vring *vring, bool enable)
+{
+    if (vdev->guest_features & (1 << VIRTIO_RING_F_EVENT_IDX)) {
+        if (enable) {
+            vring_avail_event(&vring->vr) = vring->vr.avail->idx;
+        }
+    } else if (enable) {
+        vring->vr.used->flags &= ~VRING_USED_F_NO_NOTIFY;
+    } else {
+        vring->vr.used->flags |= VRING_USED_F_NO_NOTIFY;
+    }
+}
+
+/* This is stolen from linux/drivers/vhost/vhost.c:vhost_notify() */
+bool vring_should_notify(VirtIODevice *vdev, Vring *vring)
+{
+    uint16_t old, new;
+    bool v;
+    /* Flush out used index updates. This is paired
+     * with the barrier that the Guest executes when enabling
+     * interrupts. */
+    smp_mb();
+
+    if ((vdev->guest_features & VIRTIO_F_NOTIFY_ON_EMPTY) &&
+        unlikely(vring->vr.avail->idx == vring->last_avail_idx)) {
+        return true;
+    }
+
+    if (!(vdev->guest_features & VIRTIO_RING_F_EVENT_IDX)) {
+        return !(vring->vr.avail->flags & VRING_AVAIL_F_NO_INTERRUPT);
+    }
+    old = vring->signalled_used;
+    v = vring->signalled_used_valid;
+    new = vring->signalled_used = vring->last_used_idx;
+    vring->signalled_used_valid = true;
+
+    if (unlikely(!v)) {
+        return true;
+    }
+
+    return vring_need_event(vring_used_event(&vring->vr), new, old);
+}
+
+/* This is stolen from linux-2.6/drivers/vhost/vhost.c. */
+static bool get_indirect(Vring *vring,
+                         struct iovec iov[], struct iovec *iov_end,
+                         unsigned int *out_num, unsigned int *in_num,
+                         struct vring_desc *indirect)
+{
+    struct vring_desc desc;
+    unsigned int i = 0, count, found = 0;
+
+    /* Sanity check */
+    if (unlikely(indirect->len % sizeof desc)) {
+        fprintf(stderr, "Invalid length in indirect descriptor: "
+               "len 0x%llx not multiple of 0x%zx\n",
+               (unsigned long long)indirect->len,
+               sizeof desc);
+        exit(1);
+    }
+
+    count = indirect->len / sizeof desc;
+    /* Buffers are chained via a 16 bit next field, so
+     * we can have at most 2^16 of these. */
+    if (unlikely(count > USHRT_MAX + 1)) {
+        fprintf(stderr, "Indirect buffer length too big: %d\n",
+               indirect->len);
+        exit(1);
+    }
+
+    /* Point to translate indirect desc chain */
+    indirect = phys_to_host(vring, indirect->addr);
+
+    /* We will use the result as an address to read from, so most
+     * architectures only need a compiler barrier here. */
+    barrier(); /* read_barrier_depends(); */
+
+    do {
+        if (unlikely(++found > count)) {
+            fprintf(stderr, "Loop detected: last one at %u "
+                   "indirect size %u\n",
+                   i, count);
+            exit(1);
+        }
+
+        desc = *indirect++;
+        if (unlikely(desc.flags & VRING_DESC_F_INDIRECT)) {
+            fprintf(stderr, "Nested indirect descriptor\n");
+            exit(1);
+        }
+
+        /* Stop for now if there are not enough iovecs available. */
+        if (iov >= iov_end) {
+            return false;
+        }
+
+        iov->iov_base = phys_to_host(vring, desc.addr);
+        iov->iov_len  = desc.len;
+        iov++;
+
+        /* If this is an input descriptor, increment that count. */
+        if (desc.flags & VRING_DESC_F_WRITE) {
+            *in_num += 1;
+        } else {
+            /* If it's an output descriptor, they're all supposed
+             * to come before any input descriptors. */
+            if (unlikely(*in_num)) {
+                fprintf(stderr, "Indirect descriptor "
+                       "has out after in: idx %d\n", i);
+                exit(1);
+            }
+            *out_num += 1;
+        }
+        i = desc.next;
+    } while (desc.flags & VRING_DESC_F_NEXT);
+    return true;
+}
+
+/* This looks in the virtqueue and for the first available buffer, and converts
+ * it to an iovec for convenient access.  Since descriptors consist of some
+ * number of output then some number of input descriptors, it's actually two
+ * iovecs, but we pack them into one and note how many of each there were.
+ *
+ * This function returns the descriptor number found, or vq->num (which is
+ * never a valid descriptor number) if none was found.  A negative code is
+ * returned on error.
+ *
+ * Stolen from linux-2.6/drivers/vhost/vhost.c.
+ */
+int vring_pop(VirtIODevice *vdev, Vring *vring,
+              struct iovec iov[], struct iovec *iov_end,
+              unsigned int *out_num, unsigned int *in_num)
+{
+    struct vring_desc desc;
+    unsigned int i, head, found = 0, num = vring->vr.num;
+    __u16 avail_idx, last_avail_idx;
+
+    /* Check it isn't doing very strange things with descriptor numbers. */
+    last_avail_idx = vring->last_avail_idx;
+    avail_idx = vring->vr.avail->idx;
+
+    if (unlikely((__u16)(avail_idx - last_avail_idx) > num)) {
+        fprintf(stderr, "Guest moved used index from %u to %u\n",
+                last_avail_idx, avail_idx);
+        exit(1);
+    }
+
+    /* If there's nothing new since last we looked. */
+    if (avail_idx == last_avail_idx) {
+        return -EAGAIN;
+    }
+
+    /* Only get avail ring entries after they have been exposed by guest. */
+    smp_rmb();
+
+    /* Grab the next descriptor number they're advertising, and increment
+     * the index we've seen. */
+    head = vring->vr.avail->ring[last_avail_idx % num];
+
+    /* If their number is silly, that's an error. */
+    if (unlikely(head >= num)) {
+        fprintf(stderr, "Guest says index %u > %u is available\n",
+                head, num);
+        exit(1);
+    }
+
+    if (vdev->guest_features & (1 << VIRTIO_RING_F_EVENT_IDX)) {
+        vring_avail_event(&vring->vr) = vring->vr.avail->idx;
+    }
+
+    /* When we start there are none of either input nor output. */
+    *out_num = *in_num = 0;
+
+    i = head;
+    do {
+        if (unlikely(i >= num)) {
+            fprintf(stderr, "Desc index is %u > %u, head = %u\n",
+                    i, num, head);
+            exit(1);
+        }
+        if (unlikely(++found > num)) {
+            fprintf(stderr, "Loop detected: last one at %u "
+                    "vq size %u head %u\n",
+                    i, num, head);
+            exit(1);
+        }
+        desc = vring->vr.desc[i];
+        if (desc.flags & VRING_DESC_F_INDIRECT) {
+            if (!get_indirect(vring, iov, iov_end, out_num, in_num, &desc)) {
+                return -ENOBUFS; /* not enough iovecs, stop for now */
+            }
+            continue;
+        }
+
+        /* If there are not enough iovecs left, stop for now.  The caller
+         * should check if there are more descs available once they have dealt
+         * with the current set.
+         */
+        if (iov >= iov_end) {
+            return -ENOBUFS;
+        }
+
+        iov->iov_base = phys_to_host(vring, desc.addr);
+        iov->iov_len  = desc.len;
+        iov++;
+
+        if (desc.flags & VRING_DESC_F_WRITE) {
+            /* If this is an input descriptor,
+             * increment that count. */
+            *in_num += 1;
+        } else {
+            /* If it's an output descriptor, they're all supposed
+             * to come before any input descriptors. */
+            if (unlikely(*in_num)) {
+                fprintf(stderr, "Descriptor has out after in: "
+                        "idx %d\n", i);
+                exit(1);
+            }
+            *out_num += 1;
+        }
+        i = desc.next;
+    } while (desc.flags & VRING_DESC_F_NEXT);
+
+    /* On success, increment avail index. */
+    vring->last_avail_idx++;
+    return head;
+}
+
+/* After we've used one of their buffers, we tell them about it.
+ *
+ * Stolen from linux-2.6/drivers/vhost/vhost.c.
+ */
+void vring_push(Vring *vring, unsigned int head, int len)
+{
+    struct vring_used_elem *used;
+    uint16_t new;
+
+    /* The virtqueue contains a ring of used buffers.  Get a pointer to the
+     * next entry in that used ring. */
+    used = &vring->vr.used->ring[vring->last_used_idx % vring->vr.num];
+    used->id = head;
+    used->len = len;
+
+    /* Make sure buffer is written before we update index. */
+    smp_wmb();
+
+    new = vring->vr.used->idx = ++vring->last_used_idx;
+    if (unlikely((int16_t)(new - vring->signalled_used) < (uint16_t)1)) {
+        vring->signalled_used_valid = false;
+    }
+}
diff --git a/hw/dataplane/vring.h b/hw/dataplane/vring.h
new file mode 100644
index 0000000..42c2f0a
--- /dev/null
+++ b/hw/dataplane/vring.h
@@ -0,0 +1,54 @@
+/* Copyright 2012 Red Hat, Inc. and/or its affiliates
+ * Copyright IBM, Corp. 2012
+ *
+ * Based on Linux vhost code:
+ * Copyright (C) 2009 Red Hat, Inc.
+ * Copyright (C) 2006 Rusty Russell IBM Corporation
+ *
+ * Author: Michael S. Tsirkin <mst@redhat.com>
+ *         Stefan Hajnoczi <stefanha@redhat.com>
+ *
+ * Inspiration, some code, and most witty comments come from
+ * Documentation/virtual/lguest/lguest.c, by Rusty Russell
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ */
+
+#ifndef VRING_H
+#define VRING_H
+
+#include <linux/virtio_ring.h>
+#include "qemu-common.h"
+#include "qemu-barrier.h"
+#include "memory.h"
+#include "hw/virtio.h"
+
+typedef struct {
+    void *phys_mem_zero_host_ptr;   /* host pointer to guest RAM */
+    struct vring vr;                /* virtqueue vring mapped to host memory */
+    __u16 last_avail_idx;           /* last processed avail ring index */
+    __u16 last_used_idx;            /* last processed used ring index */
+    uint16_t signalled_used;        /* EVENT_IDX state */
+    bool signalled_used_valid;
+} Vring;
+
+static inline unsigned int vring_get_num(Vring *vring)
+{
+    return vring->vr.num;
+}
+
+/* Are there more descriptors available? */
+static inline bool vring_more_avail(Vring *vring)
+{
+    return vring->vr.avail->idx != vring->last_avail_idx;
+}
+
+void vring_setup(Vring *vring, VirtIODevice *vdev, int n);
+void vring_set_notification(VirtIODevice *vdev, Vring *vring, bool enable);
+bool vring_should_notify(VirtIODevice *vdev, Vring *vring);
+int vring_pop(VirtIODevice *vdev, Vring *vring,
+              struct iovec iov[], struct iovec *iov_end,
+              unsigned int *out_num, unsigned int *in_num);
+void vring_push(Vring *vring, unsigned int head, int len);
+
+#endif /* VRING_H */
diff --git a/trace-events b/trace-events
index e1a37cc..8eeab34 100644
--- a/trace-events
+++ b/trace-events
@@ -98,6 +98,9 @@ virtio_blk_rw_complete(void *req, int ret) "req %p ret %d"
 virtio_blk_handle_write(void *req, uint64_t sector, size_t nsectors) "req %p sector %"PRIu64" nsectors %zu"
 virtio_blk_handle_read(void *req, uint64_t sector, size_t nsectors) "req %p sector %"PRIu64" nsectors %zu"
 
+# hw/dataplane/vring.c
+vring_setup(uint64_t physical, void *desc, void *avail, void *used) "vring physical %#"PRIx64" desc %p avail %p used %p"
+
 # thread-pool.c
 thread_pool_submit(void *req, void *opaque) "req %p opaque %p"
 thread_pool_complete(void *req, void *opaque, int ret) "req %p opaque %p ret %d"
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Qemu-devel] [PATCH 4/7] dataplane: add event loop
  2012-11-15 15:18 [Qemu-devel] [PATCH 0/7] virtio: virtio-blk data plane Stefan Hajnoczi
                   ` (2 preceding siblings ...)
  2012-11-15 15:19 ` [Qemu-devel] [PATCH 3/7] dataplane: add virtqueue vring code Stefan Hajnoczi
@ 2012-11-15 15:19 ` Stefan Hajnoczi
  2012-11-15 15:19 ` [Qemu-devel] [PATCH 5/7] dataplane: add Linux AIO request queue Stefan Hajnoczi
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 38+ messages in thread
From: Stefan Hajnoczi @ 2012-11-15 15:19 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Anthony Liguori, Michael S. Tsirkin, khoa,
	Stefan Hajnoczi, Paolo Bonzini, Asias He

Outside the safety of the global mutex we need to poll on file
descriptors.  I found epoll(2) is a convenient way to do that, although
other options could replace this module in the future (such as an
AioContext-based loop or glib's GMainLoop).

One important feature of this small event loop implementation is that
the loop can be terminated in a thread-safe way.  This allows QEMU to
stop the data plane thread cleanly.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 hw/dataplane/Makefile.objs |   2 +-
 hw/dataplane/event-poll.c  | 109 +++++++++++++++++++++++++++++++++++++++++++++
 hw/dataplane/event-poll.h  |  40 +++++++++++++++++
 3 files changed, 150 insertions(+), 1 deletion(-)
 create mode 100644 hw/dataplane/event-poll.c
 create mode 100644 hw/dataplane/event-poll.h

diff --git a/hw/dataplane/Makefile.objs b/hw/dataplane/Makefile.objs
index b58544f..a01ad05 100644
--- a/hw/dataplane/Makefile.objs
+++ b/hw/dataplane/Makefile.objs
@@ -1,3 +1,3 @@
 ifeq ($(CONFIG_VIRTIO), y)
-common-obj-$(CONFIG_VIRTIO_BLK_DATA_PLANE) += vring.o
+common-obj-$(CONFIG_VIRTIO_BLK_DATA_PLANE) += vring.o event-poll.o
 endif
diff --git a/hw/dataplane/event-poll.c b/hw/dataplane/event-poll.c
new file mode 100644
index 0000000..4a53d48
--- /dev/null
+++ b/hw/dataplane/event-poll.c
@@ -0,0 +1,109 @@
+/*
+ * Event loop with file descriptor polling
+ *
+ * Copyright 2012 IBM, Corp.
+ * Copyright 2012 Red Hat, Inc. and/or its affiliates
+ *
+ * Authors:
+ *   Stefan Hajnoczi <stefanha@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include <sys/epoll.h>
+#include "hw/dataplane/event-poll.h"
+
+/* Add an event notifier and its callback for polling */
+void event_poll_add(EventPoll *poll, EventHandler *handler,
+                    EventNotifier *notifier, EventCallback *callback)
+{
+    struct epoll_event event = {
+        .events = EPOLLIN,
+        .data.ptr = handler,
+    };
+    handler->notifier = notifier;
+    handler->callback = callback;
+    if (epoll_ctl(poll->epoll_fd, EPOLL_CTL_ADD,
+                  event_notifier_get_fd(notifier), &event) != 0) {
+        fprintf(stderr, "failed to add event handler to epoll: %m\n");
+        exit(1);
+    }
+}
+
+/* Event callback for stopping the event_poll_run() loop */
+static bool handle_stop(EventHandler *handler)
+{
+    return false; /* stop event loop */
+}
+
+void event_poll_init(EventPoll *poll)
+{
+    /* Create epoll file descriptor */
+    poll->epoll_fd = epoll_create1(EPOLL_CLOEXEC);
+    if (poll->epoll_fd < 0) {
+        fprintf(stderr, "epoll_create1 failed: %m\n");
+        exit(1);
+    }
+
+    /* Set up stop notifier */
+    if (event_notifier_init(&poll->stop_notifier, 0) < 0) {
+        fprintf(stderr, "failed to init stop notifier\n");
+        exit(1);
+    }
+    event_poll_add(poll, &poll->stop_handler,
+                   &poll->stop_notifier, handle_stop);
+}
+
+void event_poll_cleanup(EventPoll *poll)
+{
+    event_notifier_cleanup(&poll->stop_notifier);
+    close(poll->epoll_fd);
+    poll->epoll_fd = -1;
+}
+
+/* Block until the next event and invoke its callback
+ *
+ * Signals must be masked, EINTR should never happen.  This is true for QEMU
+ * threads.
+ */
+static bool event_poll(EventPoll *poll)
+{
+    EventHandler *handler;
+    struct epoll_event event;
+    int nevents;
+
+    /* Wait for the next event.  Only do one event per call to keep the
+     * function simple, this could be changed later. */
+    nevents = epoll_wait(poll->epoll_fd, &event, 1, -1);
+    if (unlikely(nevents != 1)) {
+        fprintf(stderr, "epoll_wait failed: %m\n");
+        exit(1); /* should never happen */
+    }
+
+    /* Find out which event handler has become active */
+    handler = event.data.ptr;
+
+    /* Clear the eventfd */
+    event_notifier_test_and_clear(handler->notifier);
+
+    /* Handle the event */
+    return handler->callback(handler);
+}
+
+void event_poll_run(EventPoll *poll)
+{
+    while (event_poll(poll)) {
+        /* do nothing */
+    }
+}
+
+/* Stop the event_poll_run() loop
+ *
+ * This function can be used from another thread.
+ */
+void event_poll_stop(EventPoll *poll)
+{
+    event_notifier_set(&poll->stop_notifier);
+}
diff --git a/hw/dataplane/event-poll.h b/hw/dataplane/event-poll.h
new file mode 100644
index 0000000..5e1771f
--- /dev/null
+++ b/hw/dataplane/event-poll.h
@@ -0,0 +1,40 @@
+/*
+ * Event loop with file descriptor polling
+ *
+ * Copyright 2012 IBM, Corp.
+ * Copyright 2012 Red Hat, Inc. and/or its affiliates
+ *
+ * Authors:
+ *   Stefan Hajnoczi <stefanha@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef EVENT_POLL_H
+#define EVENT_POLL_H
+
+#include "event_notifier.h"
+
+typedef struct EventHandler EventHandler;
+typedef bool EventCallback(EventHandler *handler);
+struct EventHandler {
+    EventNotifier *notifier;        /* eventfd */
+    EventCallback *callback;        /* callback function */
+};
+
+typedef struct {
+    int epoll_fd;                   /* epoll(2) file descriptor */
+    EventNotifier stop_notifier;    /* stop poll notifier */
+    EventHandler stop_handler;      /* stop poll handler */
+} EventPoll;
+
+void event_poll_add(EventPoll *poll, EventHandler *handler,
+                    EventNotifier *notifier, EventCallback *callback);
+void event_poll_init(EventPoll *poll);
+void event_poll_cleanup(EventPoll *poll);
+void event_poll_run(EventPoll *poll);
+void event_poll_stop(EventPoll *poll);
+
+#endif /* EVENT_POLL_H */
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Qemu-devel] [PATCH 5/7] dataplane: add Linux AIO request queue
  2012-11-15 15:18 [Qemu-devel] [PATCH 0/7] virtio: virtio-blk data plane Stefan Hajnoczi
                   ` (3 preceding siblings ...)
  2012-11-15 15:19 ` [Qemu-devel] [PATCH 4/7] dataplane: add event loop Stefan Hajnoczi
@ 2012-11-15 15:19 ` Stefan Hajnoczi
  2012-11-15 15:19 ` [Qemu-devel] [PATCH 6/7] dataplane: add virtio-blk data plane code Stefan Hajnoczi
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 38+ messages in thread
From: Stefan Hajnoczi @ 2012-11-15 15:19 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Anthony Liguori, Michael S. Tsirkin, khoa,
	Stefan Hajnoczi, Paolo Bonzini, Asias He

The IOQueue has a pool of iocb structs and a function to add new
read/write requests.  Multiple requests can be added before calling the
submit function to actually tell the host kernel to begin I/O.  This
allows callers to batch requests and submit them in one go.

The actual I/O is performed using Linux AIO.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 hw/dataplane/Makefile.objs |   2 +-
 hw/dataplane/ioq.c         | 118 +++++++++++++++++++++++++++++++++++++++++++++
 hw/dataplane/ioq.h         |  57 ++++++++++++++++++++++
 3 files changed, 176 insertions(+), 1 deletion(-)
 create mode 100644 hw/dataplane/ioq.c
 create mode 100644 hw/dataplane/ioq.h

diff --git a/hw/dataplane/Makefile.objs b/hw/dataplane/Makefile.objs
index a01ad05..5f3e371 100644
--- a/hw/dataplane/Makefile.objs
+++ b/hw/dataplane/Makefile.objs
@@ -1,3 +1,3 @@
 ifeq ($(CONFIG_VIRTIO), y)
-common-obj-$(CONFIG_VIRTIO_BLK_DATA_PLANE) += vring.o event-poll.o
+common-obj-$(CONFIG_VIRTIO_BLK_DATA_PLANE) += vring.o event-poll.o ioq.o
 endif
diff --git a/hw/dataplane/ioq.c b/hw/dataplane/ioq.c
new file mode 100644
index 0000000..7adeb5d
--- /dev/null
+++ b/hw/dataplane/ioq.c
@@ -0,0 +1,118 @@
+/*
+ * Linux AIO request queue
+ *
+ * Copyright 2012 IBM, Corp.
+ * Copyright 2012 Red Hat, Inc. and/or its affiliates
+ *
+ * Authors:
+ *   Stefan Hajnoczi <stefanha@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "hw/dataplane/ioq.h"
+
+void ioq_init(IOQueue *ioq, int fd, unsigned int max_reqs)
+{
+    int rc;
+
+    ioq->fd = fd;
+    ioq->max_reqs = max_reqs;
+
+    memset(&ioq->io_ctx, 0, sizeof ioq->io_ctx);
+    rc = io_setup(max_reqs, &ioq->io_ctx);
+    if (rc != 0) {
+        fprintf(stderr, "ioq io_setup failed %d\n", rc);
+        exit(1);
+    }
+
+    rc = event_notifier_init(&ioq->io_notifier, 0);
+    if (rc != 0) {
+        fprintf(stderr, "ioq io event notifier creation failed %d\n", rc);
+        exit(1);
+    }
+
+    ioq->freelist = g_malloc0(sizeof ioq->freelist[0] * max_reqs);
+    ioq->freelist_idx = 0;
+
+    ioq->queue = g_malloc0(sizeof ioq->queue[0] * max_reqs);
+    ioq->queue_idx = 0;
+}
+
+void ioq_cleanup(IOQueue *ioq)
+{
+    g_free(ioq->freelist);
+    g_free(ioq->queue);
+
+    event_notifier_cleanup(&ioq->io_notifier);
+    io_destroy(ioq->io_ctx);
+}
+
+EventNotifier *ioq_get_notifier(IOQueue *ioq)
+{
+    return &ioq->io_notifier;
+}
+
+struct iocb *ioq_get_iocb(IOQueue *ioq)
+{
+    if (unlikely(ioq->freelist_idx == 0)) {
+        fprintf(stderr, "ioq underflow\n");
+        exit(1);
+    }
+    struct iocb *iocb = ioq->freelist[--ioq->freelist_idx];
+    ioq->queue[ioq->queue_idx++] = iocb;
+    return iocb;
+}
+
+void ioq_put_iocb(IOQueue *ioq, struct iocb *iocb)
+{
+    if (unlikely(ioq->freelist_idx == ioq->max_reqs)) {
+        fprintf(stderr, "ioq overflow\n");
+        exit(1);
+    }
+    ioq->freelist[ioq->freelist_idx++] = iocb;
+}
+
+struct iocb *ioq_rdwr(IOQueue *ioq, bool read, struct iovec *iov,
+                      unsigned int count, long long offset)
+{
+    struct iocb *iocb = ioq_get_iocb(ioq);
+
+    if (read) {
+        io_prep_preadv(iocb, ioq->fd, iov, count, offset);
+    } else {
+        io_prep_pwritev(iocb, ioq->fd, iov, count, offset);
+    }
+    io_set_eventfd(iocb, event_notifier_get_fd(&ioq->io_notifier));
+    return iocb;
+}
+
+int ioq_submit(IOQueue *ioq)
+{
+    int rc = io_submit(ioq->io_ctx, ioq->queue_idx, ioq->queue);
+    ioq->queue_idx = 0; /* reset */
+    return rc;
+}
+
+int ioq_run_completion(IOQueue *ioq, IOQueueCompletion *completion,
+                       void *opaque)
+{
+    struct io_event events[ioq->max_reqs];
+    int nevents, i;
+
+    nevents = io_getevents(ioq->io_ctx, 0, ioq->max_reqs, events, NULL);
+    if (unlikely(nevents < 0)) {
+        fprintf(stderr, "io_getevents failed %d\n", nevents);
+        exit(1);
+    }
+
+    for (i = 0; i < nevents; i++) {
+        ssize_t ret = ((uint64_t)events[i].res2 << 32) | events[i].res;
+
+        completion(events[i].obj, ret, opaque);
+        ioq_put_iocb(ioq, events[i].obj);
+    }
+    return nevents;
+}
diff --git a/hw/dataplane/ioq.h b/hw/dataplane/ioq.h
new file mode 100644
index 0000000..890db22
--- /dev/null
+++ b/hw/dataplane/ioq.h
@@ -0,0 +1,57 @@
+/*
+ * Linux AIO request queue
+ *
+ * Copyright 2012 IBM, Corp.
+ * Copyright 2012 Red Hat, Inc. and/or its affiliates
+ *
+ * Authors:
+ *   Stefan Hajnoczi <stefanha@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef IOQ_H
+#define IOQ_H
+
+#include <libaio.h>
+#include "event_notifier.h"
+
+typedef struct {
+    int fd;                         /* file descriptor */
+    unsigned int max_reqs;          /* max length of freelist and queue */
+
+    io_context_t io_ctx;            /* Linux AIO context */
+    EventNotifier io_notifier;      /* Linux AIO eventfd */
+
+    /* Requests can complete in any order so a free list is necessary to manage
+     * available iocbs.
+     */
+    struct iocb **freelist;         /* free iocbs */
+    unsigned int freelist_idx;
+
+    /* Multiple requests are queued up before submitting them all in one go */
+    struct iocb **queue;            /* queued iocbs */
+    unsigned int queue_idx;
+} IOQueue;
+
+void ioq_init(IOQueue *ioq, int fd, unsigned int max_reqs);
+void ioq_cleanup(IOQueue *ioq);
+EventNotifier *ioq_get_notifier(IOQueue *ioq);
+struct iocb *ioq_get_iocb(IOQueue *ioq);
+void ioq_put_iocb(IOQueue *ioq, struct iocb *iocb);
+struct iocb *ioq_rdwr(IOQueue *ioq, bool read, struct iovec *iov,
+                      unsigned int count, long long offset);
+int ioq_submit(IOQueue *ioq);
+
+static inline unsigned int ioq_num_queued(IOQueue *ioq)
+{
+    return ioq->queue_idx;
+}
+
+typedef void IOQueueCompletion(struct iocb *iocb, ssize_t ret, void *opaque);
+int ioq_run_completion(IOQueue *ioq, IOQueueCompletion *completion,
+                       void *opaque);
+
+#endif /* IOQ_H */
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Qemu-devel] [PATCH 6/7] dataplane: add virtio-blk data plane code
  2012-11-15 15:18 [Qemu-devel] [PATCH 0/7] virtio: virtio-blk data plane Stefan Hajnoczi
                   ` (4 preceding siblings ...)
  2012-11-15 15:19 ` [Qemu-devel] [PATCH 5/7] dataplane: add Linux AIO request queue Stefan Hajnoczi
@ 2012-11-15 15:19 ` Stefan Hajnoczi
  2012-11-15 15:19 ` [Qemu-devel] [PATCH 7/7] virtio-blk: add x-data-plane=on|off performance feature Stefan Hajnoczi
  2012-11-20  9:02 ` [Qemu-devel] [PATCH 0/7] virtio: virtio-blk data plane Asias He
  7 siblings, 0 replies; 38+ messages in thread
From: Stefan Hajnoczi @ 2012-11-15 15:19 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Anthony Liguori, Michael S. Tsirkin, khoa,
	Stefan Hajnoczi, Paolo Bonzini, Asias He

virtio-blk-data-plane is a subset implementation of virtio-blk.  It only
handles read, write, and flush requests.  It does this using a dedicated
thread that executes an epoll(2)-based event loop and processes I/O
using Linux AIO.

This approach performs very well but can be used for raw image files
only.  The number of IOPS achieved has been reported to be several times
higher than the existing virtio-blk implementation.

Eventually it should be possible to unify virtio-blk-data-plane with the
main body of QEMU code once the block layer and hardware emulation is
able to run outside the global mutex.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 hw/dataplane/Makefile.objs |   2 +-
 hw/dataplane/virtio-blk.c  | 414 +++++++++++++++++++++++++++++++++++++++++++++
 hw/dataplane/virtio-blk.h  |  41 +++++
 trace-events               |   6 +
 4 files changed, 462 insertions(+), 1 deletion(-)
 create mode 100644 hw/dataplane/virtio-blk.c
 create mode 100644 hw/dataplane/virtio-blk.h

diff --git a/hw/dataplane/Makefile.objs b/hw/dataplane/Makefile.objs
index 5f3e371..d56796f 100644
--- a/hw/dataplane/Makefile.objs
+++ b/hw/dataplane/Makefile.objs
@@ -1,3 +1,3 @@
 ifeq ($(CONFIG_VIRTIO), y)
-common-obj-$(CONFIG_VIRTIO_BLK_DATA_PLANE) += vring.o event-poll.o ioq.o
+common-obj-$(CONFIG_VIRTIO_BLK_DATA_PLANE) += vring.o event-poll.o ioq.o virtio-blk.o
 endif
diff --git a/hw/dataplane/virtio-blk.c b/hw/dataplane/virtio-blk.c
new file mode 100644
index 0000000..e18fcb8
--- /dev/null
+++ b/hw/dataplane/virtio-blk.c
@@ -0,0 +1,414 @@
+/*
+ * Dedicated thread for virtio-blk I/O processing
+ *
+ * Copyright 2012 IBM, Corp.
+ * Copyright 2012 Red Hat, Inc. and/or its affiliates
+ *
+ * Authors:
+ *   Stefan Hajnoczi <stefanha@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "trace.h"
+#include "event-poll.h"
+#include "qemu-thread.h"
+#include "vring.h"
+#include "ioq.h"
+#include "hw/virtio-blk.h"
+#include "hw/dataplane/virtio-blk.h"
+
+enum {
+    SEG_MAX = 126,                  /* maximum number of I/O segments */
+    VRING_MAX = SEG_MAX + 2,        /* maximum number of vring descriptors */
+    REQ_MAX = VRING_MAX,            /* maximum number of requests in the vring,
+                                     * is VRING_MAX / 2 with traditional and
+                                     * VRING_MAX with indirect descriptors */
+};
+
+typedef struct {
+    struct iocb iocb;               /* Linux AIO control block */
+    unsigned char *status;          /* virtio block status code */
+    unsigned int head;              /* vring descriptor index */
+} VirtIOBlockRequest;
+
+struct VirtIOBlockDataPlane {
+    bool started;
+    QEMUBH *start_bh;
+    QemuThread thread;
+
+    int fd;                         /* image file descriptor */
+
+    VirtIODevice *vdev;
+    Vring vring;                    /* virtqueue vring */
+    EventNotifier *guest_notifier;  /* irq */
+
+    EventPoll event_poll;           /* event poller */
+    EventHandler io_handler;        /* Linux AIO completion handler */
+    EventHandler notify_handler;    /* virtqueue notify handler */
+
+    IOQueue ioqueue;                /* Linux AIO queue (should really be per
+                                       dataplane thread) */
+    VirtIOBlockRequest requests[REQ_MAX]; /* pool of requests, managed by the
+                                             queue */
+
+    unsigned int num_reqs;
+    QemuMutex num_reqs_lock;
+    QemuCond no_reqs_cond;
+};
+
+/* Raise an interrupt to signal guest, if necessary */
+static void notify_guest(VirtIOBlockDataPlane *s)
+{
+    if (!vring_should_notify(s->vdev, &s->vring)) {
+        return;
+    }
+
+    event_notifier_set(s->guest_notifier);
+}
+
+static void complete_request(struct iocb *iocb, ssize_t ret, void *opaque)
+{
+    VirtIOBlockDataPlane *s = opaque;
+    VirtIOBlockRequest *req = container_of(iocb, VirtIOBlockRequest, iocb);
+    int len;
+
+    if (likely(ret >= 0)) {
+        *req->status = VIRTIO_BLK_S_OK;
+        len = ret;
+    } else {
+        *req->status = VIRTIO_BLK_S_IOERR;
+        len = 0;
+    }
+
+    trace_virtio_blk_data_plane_complete_request(s, req->head, ret);
+
+    /* According to the virtio specification len should be the number of bytes
+     * written to, but for virtio-blk it seems to be the number of bytes
+     * transferred plus the status bytes.
+     */
+    vring_push(&s->vring, req->head, len + sizeof(*req->status));
+
+    qemu_mutex_lock(&s->num_reqs_lock);
+    if (--s->num_reqs == 0) {
+        qemu_cond_broadcast(&s->no_reqs_cond);
+    }
+    qemu_mutex_unlock(&s->num_reqs_lock);
+}
+
+static void process_request(IOQueue *ioq, struct iovec iov[],
+                            unsigned int out_num, unsigned int in_num,
+                            unsigned int head)
+{
+    /* Virtio block requests look like this: */
+    struct virtio_blk_outhdr *outhdr; /* iov[0] */
+    /* data[]                            ... */
+    struct virtio_blk_inhdr *inhdr;   /* iov[out_num + in_num - 1] */
+
+    if (unlikely(out_num == 0 || in_num == 0 ||
+                iov[0].iov_len != sizeof *outhdr ||
+                iov[out_num + in_num - 1].iov_len != sizeof *inhdr)) {
+        fprintf(stderr, "virtio-blk invalid request\n");
+        exit(1);
+    }
+
+    outhdr = iov[0].iov_base;
+    inhdr = iov[out_num + in_num - 1].iov_base;
+
+    /* TODO Linux sets the barrier bit even when not advertised! */
+    uint32_t type = outhdr->type & ~VIRTIO_BLK_T_BARRIER;
+    struct iocb *iocb;
+    switch (type & (VIRTIO_BLK_T_OUT | VIRTIO_BLK_T_SCSI_CMD |
+                    VIRTIO_BLK_T_FLUSH)) {
+    case VIRTIO_BLK_T_IN:
+        if (unlikely(out_num != 1)) {
+            fprintf(stderr, "virtio-blk invalid read request\n");
+            exit(1);
+        }
+        iocb = ioq_rdwr(ioq, true, &iov[1], in_num - 1, outhdr->sector * 512);
+        break;
+
+    case VIRTIO_BLK_T_OUT:
+        if (unlikely(in_num != 1)) {
+            fprintf(stderr, "virtio-blk invalid write request\n");
+            exit(1);
+        }
+        iocb = ioq_rdwr(ioq, false, &iov[1], out_num - 1, outhdr->sector * 512);
+        break;
+
+    case VIRTIO_BLK_T_SCSI_CMD:
+        if (unlikely(in_num == 0)) {
+            fprintf(stderr, "virtio-blk invalid SCSI command request\n");
+            exit(1);
+        }
+
+        /* TODO support SCSI commands */
+        {
+            VirtIOBlockDataPlane *s = container_of(ioq, VirtIOBlockDataPlane,
+                                                   ioqueue);
+            inhdr->status = VIRTIO_BLK_S_UNSUPP;
+            vring_push(&s->vring, head, sizeof *inhdr);
+            notify_guest(s);
+        }
+        return;
+
+    case VIRTIO_BLK_T_FLUSH:
+        if (unlikely(in_num != 1 || out_num != 1)) {
+            fprintf(stderr, "virtio-blk invalid flush request\n");
+            exit(1);
+        }
+
+        /* TODO fdsync not supported by Linux AIO, do it synchronously here! */
+        {
+            VirtIOBlockDataPlane *s = container_of(ioq, VirtIOBlockDataPlane,
+                                                   ioqueue);
+            fdatasync(s->fd);
+            inhdr->status = VIRTIO_BLK_S_OK;
+            vring_push(&s->vring, head, sizeof *inhdr);
+            notify_guest(s);
+        }
+        return;
+
+    default:
+        fprintf(stderr, "virtio-blk unsupported request type %#x\n",
+                outhdr->type);
+        exit(1);
+    }
+
+    /* Fill in virtio block metadata needed for completion */
+    VirtIOBlockRequest *req = container_of(iocb, VirtIOBlockRequest, iocb);
+    req->head = head;
+    req->status = &inhdr->status;
+}
+
+static bool handle_notify(EventHandler *handler)
+{
+    VirtIOBlockDataPlane *s = container_of(handler, VirtIOBlockDataPlane,
+                                           notify_handler);
+
+    /* There is one array of iovecs into which all new requests are extracted
+     * from the vring.  Requests are read from the vring and the translated
+     * descriptors are written to the iovecs array.  The iovecs do not have to
+     * persist across handle_notify() calls because the kernel copies the
+     * iovecs on io_submit().
+     *
+     * Handling io_submit() EAGAIN may require storing the requests across
+     * handle_notify() calls until the kernel has sufficient resources to
+     * accept more I/O.  This is not implemented yet.
+     */
+    struct iovec iovec[VRING_MAX];
+    struct iovec *end = &iovec[VRING_MAX];
+    struct iovec *iov = iovec;
+
+    /* When a request is read from the vring, the index of the first descriptor
+     * (aka head) is returned so that the completed request can be pushed onto
+     * the vring later.
+     *
+     * The number of hypervisor read-only iovecs is out_num.  The number of
+     * hypervisor write-only iovecs is in_num.
+     */
+    int head;
+    unsigned int out_num = 0, in_num = 0;
+    unsigned int num_queued;
+
+    for (;;) {
+        /* Disable guest->host notifies to avoid unnecessary vmexits */
+        vring_set_notification(s->vdev, &s->vring, false);
+
+        for (;;) {
+            head = vring_pop(s->vdev, &s->vring, iov, end, &out_num, &in_num);
+            if (head < 0) {
+                break; /* no more requests */
+            }
+
+            trace_virtio_blk_data_plane_process_request(s, out_num, in_num,
+                                                        head);
+
+            process_request(&s->ioqueue, iov, out_num, in_num, head);
+            iov += out_num + in_num;
+        }
+
+        if (likely(head == -EAGAIN)) { /* vring emptied */
+            /* Re-enable guest->host notifies and stop processing the vring.
+             * But if the guest has snuck in more descriptors, keep processing.
+             */
+            vring_set_notification(s->vdev, &s->vring, true);
+            smp_mb();
+            if (!vring_more_avail(&s->vring)) {
+                break;
+            }
+        } else { /* head == -ENOBUFS, iovecs[] is depleted */
+            /* Since there are no iovecs[] left, stop processing for now.  Do
+             * not re-enable guest->host notifies since the I/O completion
+             * handler knows to check for more vring descriptors anyway.
+             */
+            break;
+        }
+    }
+
+    num_queued = ioq_num_queued(&s->ioqueue);
+    if (num_queued > 0) {
+        qemu_mutex_lock(&s->num_reqs_lock);
+        s->num_reqs += num_queued;
+        qemu_mutex_unlock(&s->num_reqs_lock);
+
+        int rc = ioq_submit(&s->ioqueue);
+        if (unlikely(rc < 0)) {
+            fprintf(stderr, "ioq_submit failed %d\n", rc);
+            exit(1);
+        }
+    }
+    return true;
+}
+
+static bool handle_io(EventHandler *handler)
+{
+    VirtIOBlockDataPlane *s = container_of(handler, VirtIOBlockDataPlane,
+                                           io_handler);
+
+    if (ioq_run_completion(&s->ioqueue, complete_request, s) > 0) {
+        notify_guest(s);
+    }
+
+    /* If there were more requests than iovecs, the vring will not be empty yet
+     * so check again.  There should now be enough resources to process more
+     * requests.
+     */
+    if (unlikely(vring_more_avail(&s->vring))) {
+        return handle_notify(&s->notify_handler);
+    }
+
+    return true;
+}
+
+static void *data_plane_thread(void *opaque)
+{
+    VirtIOBlockDataPlane *s = opaque;
+    event_poll_run(&s->event_poll);
+    return NULL;
+}
+
+static void start_data_plane_bh(void *opaque)
+{
+    VirtIOBlockDataPlane *s = opaque;
+
+    qemu_bh_delete(s->start_bh);
+    s->start_bh = NULL;
+    qemu_thread_create(&s->thread, data_plane_thread,
+                       s, QEMU_THREAD_JOINABLE);
+}
+
+VirtIOBlockDataPlane *virtio_blk_data_plane_create(VirtIODevice *vdev, int fd)
+{
+    VirtIOBlockDataPlane *s;
+
+    s = g_new0(VirtIOBlockDataPlane, 1);
+    s->vdev = vdev;
+    s->fd = fd;
+    return s;
+}
+
+void virtio_blk_data_plane_destroy(VirtIOBlockDataPlane *s)
+{
+    if (!s) {
+        return;
+    }
+    virtio_blk_data_plane_stop(s);
+    g_free(s);
+}
+
+/* Block until pending requests have completed
+ *
+ * The vring continues to be serviced so ensure no new requests will be added
+ * to avoid races.
+ */
+void virtio_blk_data_plane_drain(VirtIOBlockDataPlane *s)
+{
+    qemu_mutex_lock(&s->num_reqs_lock);
+    while (s->num_reqs > 0) {
+        qemu_cond_wait(&s->no_reqs_cond, &s->num_reqs_lock);
+    }
+    qemu_mutex_unlock(&s->num_reqs_lock);
+}
+
+void virtio_blk_data_plane_start(VirtIOBlockDataPlane *s)
+{
+    VirtQueue *vq;
+    int i;
+
+    if (s->started) {
+        return;
+    }
+
+    vq = virtio_get_queue(s->vdev, 0);
+    vring_setup(&s->vring, s->vdev, 0);
+
+    event_poll_init(&s->event_poll);
+
+    /* Set up guest notifier (irq) */
+    if (s->vdev->binding->set_guest_notifiers(s->vdev->binding_opaque,
+                                              true) != 0) {
+        fprintf(stderr, "virtio-blk failed to set guest notifier, "
+                "ensure -enable-kvm is set\n");
+        exit(1);
+    }
+    s->guest_notifier = virtio_queue_get_guest_notifier(vq);
+
+    /* Set up virtqueue notify */
+    if (s->vdev->binding->set_host_notifier(s->vdev->binding_opaque,
+                                            0, true) != 0) {
+        fprintf(stderr, "virtio-blk failed to set host notifier\n");
+        exit(1);
+    }
+    event_poll_add(&s->event_poll, &s->notify_handler,
+                   virtio_queue_get_host_notifier(vq),
+                   handle_notify);
+
+    /* Set up ioqueue */
+    ioq_init(&s->ioqueue, s->fd, REQ_MAX);
+    for (i = 0; i < ARRAY_SIZE(s->requests); i++) {
+        ioq_put_iocb(&s->ioqueue, &s->requests[i].iocb);
+    }
+    event_poll_add(&s->event_poll, &s->io_handler,
+                   ioq_get_notifier(&s->ioqueue), handle_io);
+
+    s->started = true;
+    trace_virtio_blk_data_plane_start(s);
+
+    /* Kick right away to begin processing requests already in vring */
+    event_notifier_set(virtio_queue_get_host_notifier(vq));
+
+    /* Spawn thread in BH so it inherits iothread cpusets */
+    s->start_bh = qemu_bh_new(start_data_plane_bh, s);
+    qemu_bh_schedule(s->start_bh);
+}
+
+void virtio_blk_data_plane_stop(VirtIOBlockDataPlane *s)
+{
+    if (!s->started) {
+        return;
+    }
+    s->started = false;
+    trace_virtio_blk_data_plane_stop(s);
+
+    /* Stop thread or cancel pending thread creation BH */
+    if (s->start_bh) {
+        qemu_bh_delete(s->start_bh);
+        s->start_bh = NULL;
+    } else {
+        virtio_blk_data_plane_drain(s);
+        event_poll_stop(&s->event_poll);
+        qemu_thread_join(&s->thread);
+    }
+
+    ioq_cleanup(&s->ioqueue);
+
+    s->vdev->binding->set_host_notifier(s->vdev->binding_opaque, 0, false);
+
+    event_poll_cleanup(&s->event_poll);
+
+    /* Clean up guest notifier (irq) */
+    s->vdev->binding->set_guest_notifiers(s->vdev->binding_opaque, false);
+}
diff --git a/hw/dataplane/virtio-blk.h b/hw/dataplane/virtio-blk.h
new file mode 100644
index 0000000..ddf1115
--- /dev/null
+++ b/hw/dataplane/virtio-blk.h
@@ -0,0 +1,41 @@
+/*
+ * Dedicated thread for virtio-blk I/O processing
+ *
+ * Copyright 2012 IBM, Corp.
+ * Copyright 2012 Red Hat, Inc. and/or its affiliates
+ *
+ * Authors:
+ *   Stefan Hajnoczi <stefanha@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef HW_DATAPLANE_VIRTIO_BLK_H
+#define HW_DATAPLANE_VIRTIO_BLK_H
+
+#include "hw/virtio.h"
+
+typedef struct VirtIOBlockDataPlane VirtIOBlockDataPlane;
+
+#ifdef CONFIG_VIRTIO_BLK_DATA_PLANE
+VirtIOBlockDataPlane *virtio_blk_data_plane_create(VirtIODevice *vdev, int fd);
+void virtio_blk_data_plane_destroy(VirtIOBlockDataPlane *s);
+void virtio_blk_data_plane_start(VirtIOBlockDataPlane *s);
+void virtio_blk_data_plane_stop(VirtIOBlockDataPlane *s);
+void virtio_blk_data_plane_drain(VirtIOBlockDataPlane *s);
+#else
+static inline VirtIOBlockDataPlane *virtio_blk_data_plane_create(
+        VirtIODevice *vdev, int fd)
+{
+    return NULL;
+}
+
+static inline void virtio_blk_data_plane_destroy(VirtIOBlockDataPlane *s) {}
+static inline void virtio_blk_data_plane_start(VirtIOBlockDataPlane *s) {}
+static inline void virtio_blk_data_plane_stop(VirtIOBlockDataPlane *s) {}
+static inline void virtio_blk_data_plane_drain(VirtIOBlockDataPlane *s) {}
+#endif
+
+#endif /* HW_DATAPLANE_VIRTIO_BLK_H */
diff --git a/trace-events b/trace-events
index 8eeab34..76d92bb 100644
--- a/trace-events
+++ b/trace-events
@@ -98,6 +98,12 @@ virtio_blk_rw_complete(void *req, int ret) "req %p ret %d"
 virtio_blk_handle_write(void *req, uint64_t sector, size_t nsectors) "req %p sector %"PRIu64" nsectors %zu"
 virtio_blk_handle_read(void *req, uint64_t sector, size_t nsectors) "req %p sector %"PRIu64" nsectors %zu"
 
+# hw/dataplane/virtio-blk.c
+virtio_blk_data_plane_start(void *s) "dataplane %p"
+virtio_blk_data_plane_stop(void *s) "dataplane %p"
+virtio_blk_data_plane_process_request(void *s, unsigned int out_num, unsigned int in_num, unsigned int head) "dataplane %p out_num %u in_num %u head %u"
+virtio_blk_data_plane_complete_request(void *s, unsigned int head, int ret) "dataplane %p head %u ret %d"
+
 # hw/dataplane/vring.c
 vring_setup(uint64_t physical, void *desc, void *avail, void *used) "vring physical %#"PRIx64" desc %p avail %p used %p"
 
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Qemu-devel] [PATCH 7/7] virtio-blk: add x-data-plane=on|off performance feature
  2012-11-15 15:18 [Qemu-devel] [PATCH 0/7] virtio: virtio-blk data plane Stefan Hajnoczi
                   ` (5 preceding siblings ...)
  2012-11-15 15:19 ` [Qemu-devel] [PATCH 6/7] dataplane: add virtio-blk data plane code Stefan Hajnoczi
@ 2012-11-15 15:19 ` Stefan Hajnoczi
  2012-11-15 18:48   ` Michael S. Tsirkin
  2012-11-15 21:08   ` Anthony Liguori
  2012-11-20  9:02 ` [Qemu-devel] [PATCH 0/7] virtio: virtio-blk data plane Asias He
  7 siblings, 2 replies; 38+ messages in thread
From: Stefan Hajnoczi @ 2012-11-15 15:19 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Anthony Liguori, Michael S. Tsirkin, khoa,
	Stefan Hajnoczi, Paolo Bonzini, Asias He

The virtio-blk-data-plane feature is easy to integrate into
hw/virtio-blk.c.  The data plane can be started and stopped similar to
vhost-net.

Users can take advantage of the virtio-blk-data-plane feature using the
new -device virtio-blk-pci,x-data-plane=on property.

The x-data-plane name was chosen because at this stage the feature is
experimental and likely to see changes in the future.

If the VM configuration does not support virtio-blk-data-plane an error
message is printed.  Although we could fall back to regular virtio-blk,
I prefer the explicit approach since it prompts the user to fix their
configuration if they want the performance benefit of
virtio-blk-data-plane.

Limitations:
 * Only format=raw is supported
 * Live migration is not supported
 * Block jobs, hot unplug, and other operations fail with -EBUSY
 * I/O throttling limits are ignored
 * Only Linux hosts are supported due to Linux AIO usage

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 hw/virtio-blk.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 hw/virtio-blk.h |  1 +
 hw/virtio-pci.c |  3 +++
 3 files changed, 62 insertions(+), 1 deletion(-)

diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index e25cc96..7f6004e 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -17,6 +17,8 @@
 #include "hw/block-common.h"
 #include "blockdev.h"
 #include "virtio-blk.h"
+#include "hw/dataplane/virtio-blk.h"
+#include "migration.h"
 #include "scsi-defs.h"
 #ifdef __linux__
 # include <scsi/sg.h>
@@ -33,6 +35,8 @@ typedef struct VirtIOBlock
     VirtIOBlkConf *blk;
     unsigned short sector_mask;
     DeviceState *qdev;
+    VirtIOBlockDataPlane *dataplane;
+    Error *migration_blocker;
 } VirtIOBlock;
 
 static VirtIOBlock *to_virtio_blk(VirtIODevice *vdev)
@@ -407,6 +411,14 @@ static void virtio_blk_handle_output(VirtIODevice *vdev, VirtQueue *vq)
         .num_writes = 0,
     };
 
+    /* Some guests kick before setting VIRTIO_CONFIG_S_DRIVER_OK so start
+     * dataplane here instead of waiting for .set_status().
+     */
+    if (s->dataplane) {
+        virtio_blk_data_plane_start(s->dataplane);
+        return;
+    }
+
     while ((req = virtio_blk_get_request(s))) {
         virtio_blk_handle_request(req, &mrb);
     }
@@ -446,8 +458,13 @@ static void virtio_blk_dma_restart_cb(void *opaque, int running,
 {
     VirtIOBlock *s = opaque;
 
-    if (!running)
+    if (!running) {
+        /* qemu_drain_all() doesn't know about data plane, quiesce here */
+        if (s->dataplane) {
+            virtio_blk_data_plane_drain(s->dataplane);
+        }
         return;
+    }
 
     if (!s->bh) {
         s->bh = qemu_bh_new(virtio_blk_dma_restart_bh, s);
@@ -538,6 +555,10 @@ static void virtio_blk_set_status(VirtIODevice *vdev, uint8_t status)
     VirtIOBlock *s = to_virtio_blk(vdev);
     uint32_t features;
 
+    if (s->dataplane && !(status & VIRTIO_CONFIG_S_DRIVER)) {
+        virtio_blk_data_plane_stop(s->dataplane);
+    }
+
     if (!(status & VIRTIO_CONFIG_S_DRIVER_OK)) {
         return;
     }
@@ -604,6 +625,7 @@ VirtIODevice *virtio_blk_init(DeviceState *dev, VirtIOBlkConf *blk)
 {
     VirtIOBlock *s;
     static int virtio_blk_id;
+    int fd = -1;
 
     if (!blk->conf.bs) {
         error_report("drive property not set");
@@ -619,6 +641,21 @@ VirtIODevice *virtio_blk_init(DeviceState *dev, VirtIOBlkConf *blk)
         return NULL;
     }
 
+    if (blk->data_plane) {
+        if (blk->scsi) {
+            error_report("device is incompatible with x-data-plane, "
+                         "use scsi=off");
+            return NULL;
+        }
+
+        fd = raw_get_aio_fd(blk->conf.bs);
+        if (fd < 0) {
+            error_report("drive is incompatible with x-data-plane, "
+                         "use format=raw,cache=none,aio=native");
+            return NULL;
+        }
+    }
+
     s = (VirtIOBlock *)virtio_common_init("virtio-blk", VIRTIO_ID_BLOCK,
                                           sizeof(struct virtio_blk_config),
                                           sizeof(VirtIOBlock));
@@ -636,6 +673,17 @@ VirtIODevice *virtio_blk_init(DeviceState *dev, VirtIOBlkConf *blk)
 
     s->vq = virtio_add_queue(&s->vdev, 128, virtio_blk_handle_output);
 
+    if (fd >= 0) {
+        s->dataplane = virtio_blk_data_plane_create(&s->vdev, fd);
+
+        /* Prevent block operations that conflict with data plane thread */
+        bdrv_set_in_use(s->bs, 1);
+
+        error_setg(&s->migration_blocker,
+                   "x-data-plane does not support migration");
+        migrate_add_blocker(s->migration_blocker);
+    }
+
     qemu_add_vm_change_state_handler(virtio_blk_dma_restart_cb, s);
     s->qdev = dev;
     register_savevm(dev, "virtio-blk", virtio_blk_id++, 2,
@@ -652,6 +700,15 @@ VirtIODevice *virtio_blk_init(DeviceState *dev, VirtIOBlkConf *blk)
 void virtio_blk_exit(VirtIODevice *vdev)
 {
     VirtIOBlock *s = to_virtio_blk(vdev);
+
+    if (s->dataplane) {
+        migrate_del_blocker(s->migration_blocker);
+        error_free(s->migration_blocker);
+        bdrv_set_in_use(s->bs, 0);
+        virtio_blk_data_plane_destroy(s->dataplane);
+        s->dataplane = NULL;
+    }
+
     unregister_savevm(s->qdev, "virtio-blk", s);
     blockdev_mark_auto_del(s->bs);
     virtio_cleanup(vdev);
diff --git a/hw/virtio-blk.h b/hw/virtio-blk.h
index f0740d0..53d7971 100644
--- a/hw/virtio-blk.h
+++ b/hw/virtio-blk.h
@@ -105,6 +105,7 @@ struct VirtIOBlkConf
     char *serial;
     uint32_t scsi;
     uint32_t config_wce;
+    uint32_t data_plane;
 };
 
 #define DEFINE_VIRTIO_BLK_FEATURES(_state, _field) \
diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
index 9603150..401f5ea 100644
--- a/hw/virtio-pci.c
+++ b/hw/virtio-pci.c
@@ -862,6 +862,9 @@ static Property virtio_blk_properties[] = {
 #endif
     DEFINE_PROP_BIT("config-wce", VirtIOPCIProxy, blk.config_wce, 0, true),
     DEFINE_PROP_BIT("ioeventfd", VirtIOPCIProxy, flags, VIRTIO_PCI_FLAG_USE_IOEVENTFD_BIT, true),
+#ifdef CONFIG_VIRTIO_BLK_DATA_PLANE
+    DEFINE_PROP_BIT("x-data-plane", VirtIOPCIProxy, blk.data_plane, 0, false),
+#endif
     DEFINE_PROP_UINT32("vectors", VirtIOPCIProxy, nvectors, 2),
     DEFINE_VIRTIO_BLK_FEATURES(VirtIOPCIProxy, host_features),
     DEFINE_PROP_END_OF_LIST(),
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH 3/7] dataplane: add virtqueue vring code
  2012-11-15 15:19 ` [Qemu-devel] [PATCH 3/7] dataplane: add virtqueue vring code Stefan Hajnoczi
@ 2012-11-15 15:37   ` Paolo Bonzini
  2012-11-15 20:09   ` Anthony Liguori
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 38+ messages in thread
From: Paolo Bonzini @ 2012-11-15 15:37 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Anthony Liguori, Michael S. Tsirkin, qemu-devel,
	khoa, Asias He

Il 15/11/2012 16:19, Stefan Hajnoczi ha scritto:
> +/* Map target physical address to host address
> + */
> +static inline void *phys_to_host(Vring *vring, hwaddr phys)
> +{
> +    /* Adjust for 3.6-4 GB PCI memory range */
> +    if (phys >= 0x100000000) {
> +        phys -= 0x100000000 - 0xe0000000;
> +    } else if (phys >= 0xe0000000) {
> +        fprintf(stderr, "phys_to_host bad physical address in "
> +                "PCI range %#lx\n", phys);
> +        exit(1);
> +    }
> +    return vring->phys_mem_zero_host_ptr + phys;
> +}
> +

Hmm, perhaps *this* is not quite ready. :)

What we want is lockless address_space_map.  We're not far from it, but
not there either.

Can you add, at least for now, a weak function that does a 1:1 mapping,
and override it with the above code in hw/pc.c?  The prototype then would be

static inline void *dataplane_phys_to_host(void *base, hwaddr phys)
{
}

or something like that.

Paolo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH 7/7] virtio-blk: add x-data-plane=on|off performance feature
  2012-11-15 15:19 ` [Qemu-devel] [PATCH 7/7] virtio-blk: add x-data-plane=on|off performance feature Stefan Hajnoczi
@ 2012-11-15 18:48   ` Michael S. Tsirkin
  2012-11-15 19:34     ` Khoa Huynh
  2012-11-15 21:08   ` Anthony Liguori
  1 sibling, 1 reply; 38+ messages in thread
From: Michael S. Tsirkin @ 2012-11-15 18:48 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Anthony Liguori, qemu-devel, khoa, Paolo Bonzini, Asias He

On Thu, Nov 15, 2012 at 04:19:06PM +0100, Stefan Hajnoczi wrote:
> The virtio-blk-data-plane feature is easy to integrate into
> hw/virtio-blk.c.  The data plane can be started and stopped similar to
> vhost-net.
> 
> Users can take advantage of the virtio-blk-data-plane feature using the
> new -device virtio-blk-pci,x-data-plane=on property.
> 
> The x-data-plane name was chosen because at this stage the feature is
> experimental and likely to see changes in the future.
> 
> If the VM configuration does not support virtio-blk-data-plane an error
> message is printed.  Although we could fall back to regular virtio-blk,
> I prefer the explicit approach since it prompts the user to fix their
> configuration if they want the performance benefit of
> virtio-blk-data-plane.
> 
> Limitations:
>  * Only format=raw is supported
>  * Live migration is not supported
>  * Block jobs, hot unplug, and other operations fail with -EBUSY
>  * I/O throttling limits are ignored
>  * Only Linux hosts are supported due to Linux AIO usage
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>


Would be very interested in learning about the performance
impact of this. How does this compare to current model and
to vhost-blk?


> ---
>  hw/virtio-blk.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  hw/virtio-blk.h |  1 +
>  hw/virtio-pci.c |  3 +++
>  3 files changed, 62 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
> index e25cc96..7f6004e 100644
> --- a/hw/virtio-blk.c
> +++ b/hw/virtio-blk.c
> @@ -17,6 +17,8 @@
>  #include "hw/block-common.h"
>  #include "blockdev.h"
>  #include "virtio-blk.h"
> +#include "hw/dataplane/virtio-blk.h"
> +#include "migration.h"
>  #include "scsi-defs.h"
>  #ifdef __linux__
>  # include <scsi/sg.h>
> @@ -33,6 +35,8 @@ typedef struct VirtIOBlock
>      VirtIOBlkConf *blk;
>      unsigned short sector_mask;
>      DeviceState *qdev;
> +    VirtIOBlockDataPlane *dataplane;
> +    Error *migration_blocker;
>  } VirtIOBlock;
>  
>  static VirtIOBlock *to_virtio_blk(VirtIODevice *vdev)
> @@ -407,6 +411,14 @@ static void virtio_blk_handle_output(VirtIODevice *vdev, VirtQueue *vq)
>          .num_writes = 0,
>      };
>  
> +    /* Some guests kick before setting VIRTIO_CONFIG_S_DRIVER_OK so start
> +     * dataplane here instead of waiting for .set_status().
> +     */
> +    if (s->dataplane) {
> +        virtio_blk_data_plane_start(s->dataplane);
> +        return;
> +    }
> +
>      while ((req = virtio_blk_get_request(s))) {
>          virtio_blk_handle_request(req, &mrb);
>      }
> @@ -446,8 +458,13 @@ static void virtio_blk_dma_restart_cb(void *opaque, int running,
>  {
>      VirtIOBlock *s = opaque;
>  
> -    if (!running)
> +    if (!running) {
> +        /* qemu_drain_all() doesn't know about data plane, quiesce here */
> +        if (s->dataplane) {
> +            virtio_blk_data_plane_drain(s->dataplane);
> +        }
>          return;
> +    }
>  
>      if (!s->bh) {
>          s->bh = qemu_bh_new(virtio_blk_dma_restart_bh, s);
> @@ -538,6 +555,10 @@ static void virtio_blk_set_status(VirtIODevice *vdev, uint8_t status)
>      VirtIOBlock *s = to_virtio_blk(vdev);
>      uint32_t features;
>  
> +    if (s->dataplane && !(status & VIRTIO_CONFIG_S_DRIVER)) {
> +        virtio_blk_data_plane_stop(s->dataplane);
> +    }
> +
>      if (!(status & VIRTIO_CONFIG_S_DRIVER_OK)) {
>          return;
>      }
> @@ -604,6 +625,7 @@ VirtIODevice *virtio_blk_init(DeviceState *dev, VirtIOBlkConf *blk)
>  {
>      VirtIOBlock *s;
>      static int virtio_blk_id;
> +    int fd = -1;
>  
>      if (!blk->conf.bs) {
>          error_report("drive property not set");
> @@ -619,6 +641,21 @@ VirtIODevice *virtio_blk_init(DeviceState *dev, VirtIOBlkConf *blk)
>          return NULL;
>      }
>  
> +    if (blk->data_plane) {
> +        if (blk->scsi) {
> +            error_report("device is incompatible with x-data-plane, "
> +                         "use scsi=off");
> +            return NULL;
> +        }
> +
> +        fd = raw_get_aio_fd(blk->conf.bs);
> +        if (fd < 0) {
> +            error_report("drive is incompatible with x-data-plane, "
> +                         "use format=raw,cache=none,aio=native");
> +            return NULL;
> +        }
> +    }
> +
>      s = (VirtIOBlock *)virtio_common_init("virtio-blk", VIRTIO_ID_BLOCK,
>                                            sizeof(struct virtio_blk_config),
>                                            sizeof(VirtIOBlock));
> @@ -636,6 +673,17 @@ VirtIODevice *virtio_blk_init(DeviceState *dev, VirtIOBlkConf *blk)
>  
>      s->vq = virtio_add_queue(&s->vdev, 128, virtio_blk_handle_output);
>  
> +    if (fd >= 0) {
> +        s->dataplane = virtio_blk_data_plane_create(&s->vdev, fd);
> +
> +        /* Prevent block operations that conflict with data plane thread */
> +        bdrv_set_in_use(s->bs, 1);
> +
> +        error_setg(&s->migration_blocker,
> +                   "x-data-plane does not support migration");
> +        migrate_add_blocker(s->migration_blocker);
> +    }
> +
>      qemu_add_vm_change_state_handler(virtio_blk_dma_restart_cb, s);
>      s->qdev = dev;
>      register_savevm(dev, "virtio-blk", virtio_blk_id++, 2,
> @@ -652,6 +700,15 @@ VirtIODevice *virtio_blk_init(DeviceState *dev, VirtIOBlkConf *blk)
>  void virtio_blk_exit(VirtIODevice *vdev)
>  {
>      VirtIOBlock *s = to_virtio_blk(vdev);
> +
> +    if (s->dataplane) {
> +        migrate_del_blocker(s->migration_blocker);
> +        error_free(s->migration_blocker);
> +        bdrv_set_in_use(s->bs, 0);
> +        virtio_blk_data_plane_destroy(s->dataplane);
> +        s->dataplane = NULL;
> +    }
> +
>      unregister_savevm(s->qdev, "virtio-blk", s);
>      blockdev_mark_auto_del(s->bs);
>      virtio_cleanup(vdev);
> diff --git a/hw/virtio-blk.h b/hw/virtio-blk.h
> index f0740d0..53d7971 100644
> --- a/hw/virtio-blk.h
> +++ b/hw/virtio-blk.h
> @@ -105,6 +105,7 @@ struct VirtIOBlkConf
>      char *serial;
>      uint32_t scsi;
>      uint32_t config_wce;
> +    uint32_t data_plane;
>  };
>  
>  #define DEFINE_VIRTIO_BLK_FEATURES(_state, _field) \
> diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
> index 9603150..401f5ea 100644
> --- a/hw/virtio-pci.c
> +++ b/hw/virtio-pci.c
> @@ -862,6 +862,9 @@ static Property virtio_blk_properties[] = {
>  #endif
>      DEFINE_PROP_BIT("config-wce", VirtIOPCIProxy, blk.config_wce, 0, true),
>      DEFINE_PROP_BIT("ioeventfd", VirtIOPCIProxy, flags, VIRTIO_PCI_FLAG_USE_IOEVENTFD_BIT, true),
> +#ifdef CONFIG_VIRTIO_BLK_DATA_PLANE
> +    DEFINE_PROP_BIT("x-data-plane", VirtIOPCIProxy, blk.data_plane, 0, false),
> +#endif
>      DEFINE_PROP_UINT32("vectors", VirtIOPCIProxy, nvectors, 2),
>      DEFINE_VIRTIO_BLK_FEATURES(VirtIOPCIProxy, host_features),
>      DEFINE_PROP_END_OF_LIST(),
> -- 
> 1.8.0

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH 7/7] virtio-blk: add x-data-plane=on|off performance feature
  2012-11-15 18:48   ` Michael S. Tsirkin
@ 2012-11-15 19:34     ` Khoa Huynh
  2012-11-15 21:11       ` Anthony Liguori
  0 siblings, 1 reply; 38+ messages in thread
From: Khoa Huynh @ 2012-11-15 19:34 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Kevin Wolf, Anthony Liguori, qemu-devel, Stefan Hajnoczi,
	Paolo Bonzini, Asias He

[-- Attachment #1: Type: text/plain, Size: 8938 bytes --]

"Michael S. Tsirkin" <mst@redhat.com> wrote on 11/15/2012 12:48:49 PM:

> From: "Michael S. Tsirkin" <mst@redhat.com>
> To: Stefan Hajnoczi <stefanha@redhat.com>,
> Cc: qemu-devel@nongnu.org, Anthony Liguori/Austin/IBM@IBMUS, Paolo
> Bonzini <pbonzini@redhat.com>, Kevin Wolf <kwolf@redhat.com>, Asias
> He <asias@redhat.com>, Khoa Huynh/Austin/IBM@IBMUS
> Date: 11/15/2012 12:46 PM
> Subject: Re: [PATCH 7/7] virtio-blk: add x-data-plane=on|off
> performance feature
>
> On Thu, Nov 15, 2012 at 04:19:06PM +0100, Stefan Hajnoczi wrote:
> > The virtio-blk-data-plane feature is easy to integrate into
> > hw/virtio-blk.c.  The data plane can be started and stopped similar to
> > vhost-net.
> >
> > Users can take advantage of the virtio-blk-data-plane feature using the
> > new -device virtio-blk-pci,x-data-plane=on property.
> >
> > The x-data-plane name was chosen because at this stage the feature is
> > experimental and likely to see changes in the future.
> >
> > If the VM configuration does not support virtio-blk-data-plane an error
> > message is printed.  Although we could fall back to regular virtio-blk,
> > I prefer the explicit approach since it prompts the user to fix their
> > configuration if they want the performance benefit of
> > virtio-blk-data-plane.
> >
> > Limitations:
> >  * Only format=raw is supported
> >  * Live migration is not supported
> >  * Block jobs, hot unplug, and other operations fail with -EBUSY
> >  * I/O throttling limits are ignored
> >  * Only Linux hosts are supported due to Linux AIO usage
> >
> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
>
>
> Would be very interested in learning about the performance
> impact of this. How does this compare to current model and
> to vhost-blk?

I plan to do a complete evaluation of this patchset in the coming days.
However, I have done quite a bit of performance testing on earlier versions
of the data-plane and vhost-blk code bits. Here's what I found:

1) The existing kvm/qemu code can only handle up to about 150,000 IOPS for
a single KVM guest.  The bottleneck here is the global qemu mutex.

2) With performance tuning, I was able to achieve 1.33 million IOPS for a
single KVM guest with data-plane. This is very close to the
1.4-million-IOPS
limit of my storage setup.

3) Similarly, I was able to get up to 1.34 million IOPS for a single KVM
guest with an earlier prototype of vhost-blk (using Linux AIO, from Liu
Yuan).

This shows that bypassing the global qemu mutex would allow us to achieve
much higher I/O rates.  In fact, these I/O rates would handily beat claims
from all other competing hypervisors.

I am currently evaluating the latest vhost-blk code bits from Asias He and
will report results soon. I have already got past 1 million IOPS per guest
with Asias' code and is trying to see if I could get even higher I/O rates.
And as I mentioned earlier, I'll also evaluate this latest data-plane code
from Stefan real soon.  Please stay tuned...

-Khoa

>
>
> > ---
> >  hw/virtio-blk.c | 59 ++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++++-
> >  hw/virtio-blk.h |  1 +
> >  hw/virtio-pci.c |  3 +++
> >  3 files changed, 62 insertions(+), 1 deletion(-)
> >
> > diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
> > index e25cc96..7f6004e 100644
> > --- a/hw/virtio-blk.c
> > +++ b/hw/virtio-blk.c
> > @@ -17,6 +17,8 @@
> >  #include "hw/block-common.h"
> >  #include "blockdev.h"
> >  #include "virtio-blk.h"
> > +#include "hw/dataplane/virtio-blk.h"
> > +#include "migration.h"
> >  #include "scsi-defs.h"
> >  #ifdef __linux__
> >  # include <scsi/sg.h>
> > @@ -33,6 +35,8 @@ typedef struct VirtIOBlock
> >      VirtIOBlkConf *blk;
> >      unsigned short sector_mask;
> >      DeviceState *qdev;
> > +    VirtIOBlockDataPlane *dataplane;
> > +    Error *migration_blocker;
> >  } VirtIOBlock;
> >
> >  static VirtIOBlock *to_virtio_blk(VirtIODevice *vdev)
> > @@ -407,6 +411,14 @@ static void virtio_blk_handle_output
> (VirtIODevice *vdev, VirtQueue *vq)
> >          .num_writes = 0,
> >      };
> >
> > +    /* Some guests kick before setting VIRTIO_CONFIG_S_DRIVER_OK so
start
> > +     * dataplane here instead of waiting for .set_status().
> > +     */
> > +    if (s->dataplane) {
> > +        virtio_blk_data_plane_start(s->dataplane);
> > +        return;
> > +    }
> > +
> >      while ((req = virtio_blk_get_request(s))) {
> >          virtio_blk_handle_request(req, &mrb);
> >      }
> > @@ -446,8 +458,13 @@ static void virtio_blk_dma_restart_cb(void
> *opaque, int running,
> >  {
> >      VirtIOBlock *s = opaque;
> >
> > -    if (!running)
> > +    if (!running) {
> > +        /* qemu_drain_all() doesn't know about data plane, quiesce
here */
> > +        if (s->dataplane) {
> > +            virtio_blk_data_plane_drain(s->dataplane);
> > +        }
> >          return;
> > +    }
> >
> >      if (!s->bh) {
> >          s->bh = qemu_bh_new(virtio_blk_dma_restart_bh, s);
> > @@ -538,6 +555,10 @@ static void virtio_blk_set_status
> (VirtIODevice *vdev, uint8_t status)
> >      VirtIOBlock *s = to_virtio_blk(vdev);
> >      uint32_t features;
> >
> > +    if (s->dataplane && !(status & VIRTIO_CONFIG_S_DRIVER)) {
> > +        virtio_blk_data_plane_stop(s->dataplane);
> > +    }
> > +
> >      if (!(status & VIRTIO_CONFIG_S_DRIVER_OK)) {
> >          return;
> >      }
> > @@ -604,6 +625,7 @@ VirtIODevice *virtio_blk_init(DeviceState
> *dev, VirtIOBlkConf *blk)
> >  {
> >      VirtIOBlock *s;
> >      static int virtio_blk_id;
> > +    int fd = -1;
> >
> >      if (!blk->conf.bs) {
> >          error_report("drive property not set");
> > @@ -619,6 +641,21 @@ VirtIODevice *virtio_blk_init(DeviceState
> *dev, VirtIOBlkConf *blk)
> >          return NULL;
> >      }
> >
> > +    if (blk->data_plane) {
> > +        if (blk->scsi) {
> > +            error_report("device is incompatible with x-data-plane, "
> > +                         "use scsi=off");
> > +            return NULL;
> > +        }
> > +
> > +        fd = raw_get_aio_fd(blk->conf.bs);
> > +        if (fd < 0) {
> > +            error_report("drive is incompatible with x-data-plane, "
> > +                         "use format=raw,cache=none,aio=native");
> > +            return NULL;
> > +        }
> > +    }
> > +
> >      s = (VirtIOBlock *)virtio_common_init("virtio-blk",
VIRTIO_ID_BLOCK,
> >                                            sizeof(struct
virtio_blk_config),
> >                                            sizeof(VirtIOBlock));
> > @@ -636,6 +673,17 @@ VirtIODevice *virtio_blk_init(DeviceState
> *dev, VirtIOBlkConf *blk)
> >
> >      s->vq = virtio_add_queue(&s->vdev, 128, virtio_blk_handle_output);
> >
> > +    if (fd >= 0) {
> > +        s->dataplane = virtio_blk_data_plane_create(&s->vdev, fd);
> > +
> > +        /* Prevent block operations that conflict with data
planethread */
> > +        bdrv_set_in_use(s->bs, 1);
> > +
> > +        error_setg(&s->migration_blocker,
> > +                   "x-data-plane does not support migration");
> > +        migrate_add_blocker(s->migration_blocker);
> > +    }
> > +
> >      qemu_add_vm_change_state_handler(virtio_blk_dma_restart_cb, s);
> >      s->qdev = dev;
> >      register_savevm(dev, "virtio-blk", virtio_blk_id++, 2,
> > @@ -652,6 +700,15 @@ VirtIODevice *virtio_blk_init(DeviceState
> *dev, VirtIOBlkConf *blk)
> >  void virtio_blk_exit(VirtIODevice *vdev)
> >  {
> >      VirtIOBlock *s = to_virtio_blk(vdev);
> > +
> > +    if (s->dataplane) {
> > +        migrate_del_blocker(s->migration_blocker);
> > +        error_free(s->migration_blocker);
> > +        bdrv_set_in_use(s->bs, 0);
> > +        virtio_blk_data_plane_destroy(s->dataplane);
> > +        s->dataplane = NULL;
> > +    }
> > +
> >      unregister_savevm(s->qdev, "virtio-blk", s);
> >      blockdev_mark_auto_del(s->bs);
> >      virtio_cleanup(vdev);
> > diff --git a/hw/virtio-blk.h b/hw/virtio-blk.h
> > index f0740d0..53d7971 100644
> > --- a/hw/virtio-blk.h
> > +++ b/hw/virtio-blk.h
> > @@ -105,6 +105,7 @@ struct VirtIOBlkConf
> >      char *serial;
> >      uint32_t scsi;
> >      uint32_t config_wce;
> > +    uint32_t data_plane;
> >  };
> >
> >  #define DEFINE_VIRTIO_BLK_FEATURES(_state, _field) \
> > diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
> > index 9603150..401f5ea 100644
> > --- a/hw/virtio-pci.c
> > +++ b/hw/virtio-pci.c
> > @@ -862,6 +862,9 @@ static Property virtio_blk_properties[] = {
> >  #endif
> >      DEFINE_PROP_BIT("config-wce", VirtIOPCIProxy, blk.config_wce,0,
true),
> >      DEFINE_PROP_BIT("ioeventfd", VirtIOPCIProxy, flags,
> VIRTIO_PCI_FLAG_USE_IOEVENTFD_BIT, true),
> > +#ifdef CONFIG_VIRTIO_BLK_DATA_PLANE
> > +    DEFINE_PROP_BIT("x-data-plane", VirtIOPCIProxy,
> blk.data_plane, 0, false),
> > +#endif
> >      DEFINE_PROP_UINT32("vectors", VirtIOPCIProxy, nvectors, 2),
> >      DEFINE_VIRTIO_BLK_FEATURES(VirtIOPCIProxy, host_features),
> >      DEFINE_PROP_END_OF_LIST(),
> > --
> > 1.8.0
>

[-- Attachment #2: Type: text/html, Size: 14404 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH 1/7] raw-posix: add raw_get_aio_fd() for virtio-blk-data-plane
  2012-11-15 15:19 ` [Qemu-devel] [PATCH 1/7] raw-posix: add raw_get_aio_fd() for virtio-blk-data-plane Stefan Hajnoczi
@ 2012-11-15 20:03   ` Anthony Liguori
  2012-11-16  6:15     ` Stefan Hajnoczi
  0 siblings, 1 reply; 38+ messages in thread
From: Anthony Liguori @ 2012-11-15 20:03 UTC (permalink / raw)
  To: Stefan Hajnoczi, qemu-devel
  Cc: Kevin Wolf, Michael S. Tsirkin, Khoa Huynh, Paolo Bonzini, Asias He

Stefan Hajnoczi <stefanha@redhat.com> writes:

> The raw_get_aio_fd() function allows virtio-blk-data-plane to get the
> file descriptor of a raw image file with Linux AIO enabled.  This
> interface is really a layering violation that can be resolved once the
> block layer is able to run outside the global mutex - at that point
> virtio-blk-data-plane will switch from custom Linux AIO code to using
> the block layer.
>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

I think this creates user confusion because virtio-blk-data-plane can't
actually take a BDS.

So why not just make a string 'filename' property and open it directly
in virtio-blk-data-plane?  Then it's at least clear to the user and
management tools what the device is capable of doing.

Regards,

Anthony Liguori

> ---
>  block.h           |  9 +++++++++
>  block/raw-posix.c | 34 ++++++++++++++++++++++++++++++++++
>  2 files changed, 43 insertions(+)
>
> diff --git a/block.h b/block.h
> index 722c620..2dc6aaf 100644
> --- a/block.h
> +++ b/block.h
> @@ -365,6 +365,15 @@ void bdrv_disable_copy_on_read(BlockDriverState *bs);
>  void bdrv_set_in_use(BlockDriverState *bs, int in_use);
>  int bdrv_in_use(BlockDriverState *bs);
>  
> +#ifdef CONFIG_LINUX_AIO
> +int raw_get_aio_fd(BlockDriverState *bs);
> +#else
> +static inline int raw_get_aio_fd(BlockDriverState *bs)
> +{
> +    return -ENOTSUP;
> +}
> +#endif
> +
>  enum BlockAcctType {
>      BDRV_ACCT_READ,
>      BDRV_ACCT_WRITE,
> diff --git a/block/raw-posix.c b/block/raw-posix.c
> index f2f0404..fc04981 100644
> --- a/block/raw-posix.c
> +++ b/block/raw-posix.c
> @@ -1768,6 +1768,40 @@ static BlockDriver bdrv_host_cdrom = {
>  };
>  #endif /* __FreeBSD__ */
>  
> +#ifdef CONFIG_LINUX_AIO
> +/**
> + * Return the file descriptor for Linux AIO
> + *
> + * This function is a layering violation and should be removed when it becomes
> + * possible to call the block layer outside the global mutex.  It allows the
> + * caller to hijack the file descriptor so I/O can be performed outside the
> + * block layer.
> + */
> +int raw_get_aio_fd(BlockDriverState *bs)
> +{
> +    BDRVRawState *s;
> +
> +    if (!bs->drv) {
> +        return -ENOMEDIUM;
> +    }
> +
> +    if (bs->drv == bdrv_find_format("raw")) {
> +        bs = bs->file;
> +    }
> +
> +    /* raw-posix has several protocols so just check for raw_aio_readv */
> +    if (bs->drv->bdrv_aio_readv != raw_aio_readv) {
> +        return -ENOTSUP;
> +    }
> +
> +    s = bs->opaque;
> +    if (!s->use_aio) {
> +        return -ENOTSUP;
> +    }
> +    return s->fd;
> +}
> +#endif /* CONFIG_LINUX_AIO */
> +
>  static void bdrv_file_init(void)
>  {
>      /*
> -- 
> 1.8.0

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH 3/7] dataplane: add virtqueue vring code
  2012-11-15 15:19 ` [Qemu-devel] [PATCH 3/7] dataplane: add virtqueue vring code Stefan Hajnoczi
  2012-11-15 15:37   ` Paolo Bonzini
@ 2012-11-15 20:09   ` Anthony Liguori
  2012-11-16  6:24     ` Stefan Hajnoczi
  2012-11-16  7:48   ` Christian Borntraeger
  2012-11-17 16:15   ` Blue Swirl
  3 siblings, 1 reply; 38+ messages in thread
From: Anthony Liguori @ 2012-11-15 20:09 UTC (permalink / raw)
  To: Stefan Hajnoczi, qemu-devel
  Cc: Kevin Wolf, Michael S. Tsirkin, khoa, Paolo Bonzini, Asias He

Stefan Hajnoczi <stefanha@redhat.com> writes:

> The virtio-blk-data-plane cannot access memory using the usual QEMU
> functions since it executes outside the global mutex and the memory APIs
> are this time are not thread-safe.
>
> This patch introduces a virtqueue module based on the kernel's vhost
> vring code.  The trick is that we map guest memory ahead of time and
> access it cheaply outside the global mutex.
>
> Once the hardware emulation code can execute outside the global mutex it
> will be possible to drop this code.
>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  hw/Makefile.objs           |   2 +-
>  hw/dataplane/Makefile.objs |   3 +
>  hw/dataplane/vring.c       | 321 +++++++++++++++++++++++++++++++++++++++++++++
>  hw/dataplane/vring.h       |  54 ++++++++
>  trace-events               |   3 +
>  5 files changed, 382 insertions(+), 1 deletion(-)
>  create mode 100644 hw/dataplane/Makefile.objs
>  create mode 100644 hw/dataplane/vring.c
>  create mode 100644 hw/dataplane/vring.h
>
> diff --git a/hw/Makefile.objs b/hw/Makefile.objs
> index af4ab0c..da8ef0c 100644
> --- a/hw/Makefile.objs
> +++ b/hw/Makefile.objs
> @@ -1,4 +1,4 @@
> -common-obj-y = usb/ ide/
> +common-obj-y = usb/ ide/ dataplane/
>  common-obj-y += loader.o
>  common-obj-$(CONFIG_VIRTIO) += virtio-console.o
>  common-obj-$(CONFIG_VIRTIO_PCI) += virtio-pci.o
> diff --git a/hw/dataplane/Makefile.objs b/hw/dataplane/Makefile.objs
> new file mode 100644
> index 0000000..b58544f
> --- /dev/null
> +++ b/hw/dataplane/Makefile.objs
> @@ -0,0 +1,3 @@
> +ifeq ($(CONFIG_VIRTIO), y)
> +common-obj-$(CONFIG_VIRTIO_BLK_DATA_PLANE) += vring.o
> +endif
> diff --git a/hw/dataplane/vring.c b/hw/dataplane/vring.c
> new file mode 100644
> index 0000000..6aacce8
> --- /dev/null
> +++ b/hw/dataplane/vring.c
> @@ -0,0 +1,321 @@
> +/* Copyright 2012 Red Hat, Inc.
> + * Copyright IBM, Corp. 2012
> + *
> + * Based on Linux vhost code:
> + * Copyright (C) 2009 Red Hat, Inc.
> + * Copyright (C) 2006 Rusty Russell IBM Corporation
> + *
> + * Author: Michael S. Tsirkin <mst@redhat.com>
> + *         Stefan Hajnoczi <stefanha@redhat.com>
> + *
> + * Inspiration, some code, and most witty comments come from
> + * Documentation/virtual/lguest/lguest.c, by Rusty Russell
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.
> + */
> +
> +#include "trace.h"
> +#include "hw/dataplane/vring.h"
> +
> +/* Map target physical address to host address
> + */
> +static inline void *phys_to_host(Vring *vring, hwaddr phys)
> +{
> +    /* Adjust for 3.6-4 GB PCI memory range */
> +    if (phys >= 0x100000000) {
> +        phys -= 0x100000000 - 0xe0000000;
> +    } else if (phys >= 0xe0000000) {
> +        fprintf(stderr, "phys_to_host bad physical address in "
> +                "PCI range %#lx\n", phys);
> +        exit(1);
> +    }
> +    return vring->phys_mem_zero_host_ptr + phys;
> +}

This is too dirty to merge.  This code should use a slot mechanism
similar to what vhost uses.  Then it has a plausible chance of working
on other architectures.

> +
> +/* Setup for cheap target physical to host address conversion
> + *
> + * This is a hack for direct access to guest memory, we're not really allowed
> + * to do this.
> + */
> +static void setup_phys_to_host(Vring *vring)
> +{
> +    hwaddr len = 4096; /* RAM is really much larger but we cheat */
> +    vring->phys_mem_zero_host_ptr = cpu_physical_memory_map(0, &len, 0);
> +    if (!vring->phys_mem_zero_host_ptr) {
> +        fprintf(stderr, "setup_phys_to_host failed\n");
> +        exit(1);
> +    }
> +}

You should be able to use a MemoryListener to keep the memory slot table
up to date.  You'll need to use a mutex to protect it but again since
it's mostly uncontended, that shouldn't be a problem. 

Regards,

Anthony Liguori

> +
> +/* Map the guest's vring to host memory
> + *
> + * This is not allowed but we know the ring won't move.
> + */
> +void vring_setup(Vring *vring, VirtIODevice *vdev, int n)
> +{
> +    setup_phys_to_host(vring);
> +
> +    vring_init(&vring->vr, virtio_queue_get_num(vdev, n),
> +               phys_to_host(vring, virtio_queue_get_ring_addr(vdev, n)), 4096);
> +
> +    vring->last_avail_idx = 0;
> +    vring->last_used_idx = 0;
> +    vring->signalled_used = 0;
> +    vring->signalled_used_valid = false;
> +
> +    trace_vring_setup(virtio_queue_get_ring_addr(vdev, n),
> +                      vring->vr.desc, vring->vr.avail, vring->vr.used);
> +}
> +
> +/* Toggle guest->host notifies */
> +void vring_set_notification(VirtIODevice *vdev, Vring *vring, bool enable)
> +{
> +    if (vdev->guest_features & (1 << VIRTIO_RING_F_EVENT_IDX)) {
> +        if (enable) {
> +            vring_avail_event(&vring->vr) = vring->vr.avail->idx;
> +        }
> +    } else if (enable) {
> +        vring->vr.used->flags &= ~VRING_USED_F_NO_NOTIFY;
> +    } else {
> +        vring->vr.used->flags |= VRING_USED_F_NO_NOTIFY;
> +    }
> +}
> +
> +/* This is stolen from linux/drivers/vhost/vhost.c:vhost_notify() */
> +bool vring_should_notify(VirtIODevice *vdev, Vring *vring)
> +{
> +    uint16_t old, new;
> +    bool v;
> +    /* Flush out used index updates. This is paired
> +     * with the barrier that the Guest executes when enabling
> +     * interrupts. */
> +    smp_mb();
> +
> +    if ((vdev->guest_features & VIRTIO_F_NOTIFY_ON_EMPTY) &&
> +        unlikely(vring->vr.avail->idx == vring->last_avail_idx)) {
> +        return true;
> +    }
> +
> +    if (!(vdev->guest_features & VIRTIO_RING_F_EVENT_IDX)) {
> +        return !(vring->vr.avail->flags & VRING_AVAIL_F_NO_INTERRUPT);
> +    }
> +    old = vring->signalled_used;
> +    v = vring->signalled_used_valid;
> +    new = vring->signalled_used = vring->last_used_idx;
> +    vring->signalled_used_valid = true;
> +
> +    if (unlikely(!v)) {
> +        return true;
> +    }
> +
> +    return vring_need_event(vring_used_event(&vring->vr), new, old);
> +}
> +
> +/* This is stolen from linux-2.6/drivers/vhost/vhost.c. */
> +static bool get_indirect(Vring *vring,
> +                         struct iovec iov[], struct iovec *iov_end,
> +                         unsigned int *out_num, unsigned int *in_num,
> +                         struct vring_desc *indirect)
> +{
> +    struct vring_desc desc;
> +    unsigned int i = 0, count, found = 0;
> +
> +    /* Sanity check */
> +    if (unlikely(indirect->len % sizeof desc)) {
> +        fprintf(stderr, "Invalid length in indirect descriptor: "
> +               "len 0x%llx not multiple of 0x%zx\n",
> +               (unsigned long long)indirect->len,
> +               sizeof desc);
> +        exit(1);
> +    }
> +
> +    count = indirect->len / sizeof desc;
> +    /* Buffers are chained via a 16 bit next field, so
> +     * we can have at most 2^16 of these. */
> +    if (unlikely(count > USHRT_MAX + 1)) {
> +        fprintf(stderr, "Indirect buffer length too big: %d\n",
> +               indirect->len);
> +        exit(1);
> +    }
> +
> +    /* Point to translate indirect desc chain */
> +    indirect = phys_to_host(vring, indirect->addr);
> +
> +    /* We will use the result as an address to read from, so most
> +     * architectures only need a compiler barrier here. */
> +    barrier(); /* read_barrier_depends(); */
> +
> +    do {
> +        if (unlikely(++found > count)) {
> +            fprintf(stderr, "Loop detected: last one at %u "
> +                   "indirect size %u\n",
> +                   i, count);
> +            exit(1);
> +        }
> +
> +        desc = *indirect++;
> +        if (unlikely(desc.flags & VRING_DESC_F_INDIRECT)) {
> +            fprintf(stderr, "Nested indirect descriptor\n");
> +            exit(1);
> +        }
> +
> +        /* Stop for now if there are not enough iovecs available. */
> +        if (iov >= iov_end) {
> +            return false;
> +        }
> +
> +        iov->iov_base = phys_to_host(vring, desc.addr);
> +        iov->iov_len  = desc.len;
> +        iov++;
> +
> +        /* If this is an input descriptor, increment that count. */
> +        if (desc.flags & VRING_DESC_F_WRITE) {
> +            *in_num += 1;
> +        } else {
> +            /* If it's an output descriptor, they're all supposed
> +             * to come before any input descriptors. */
> +            if (unlikely(*in_num)) {
> +                fprintf(stderr, "Indirect descriptor "
> +                       "has out after in: idx %d\n", i);
> +                exit(1);
> +            }
> +            *out_num += 1;
> +        }
> +        i = desc.next;
> +    } while (desc.flags & VRING_DESC_F_NEXT);
> +    return true;
> +}
> +
> +/* This looks in the virtqueue and for the first available buffer, and converts
> + * it to an iovec for convenient access.  Since descriptors consist of some
> + * number of output then some number of input descriptors, it's actually two
> + * iovecs, but we pack them into one and note how many of each there were.
> + *
> + * This function returns the descriptor number found, or vq->num (which is
> + * never a valid descriptor number) if none was found.  A negative code is
> + * returned on error.
> + *
> + * Stolen from linux-2.6/drivers/vhost/vhost.c.
> + */
> +int vring_pop(VirtIODevice *vdev, Vring *vring,
> +              struct iovec iov[], struct iovec *iov_end,
> +              unsigned int *out_num, unsigned int *in_num)
> +{
> +    struct vring_desc desc;
> +    unsigned int i, head, found = 0, num = vring->vr.num;
> +    __u16 avail_idx, last_avail_idx;
> +
> +    /* Check it isn't doing very strange things with descriptor numbers. */
> +    last_avail_idx = vring->last_avail_idx;
> +    avail_idx = vring->vr.avail->idx;
> +
> +    if (unlikely((__u16)(avail_idx - last_avail_idx) > num)) {
> +        fprintf(stderr, "Guest moved used index from %u to %u\n",
> +                last_avail_idx, avail_idx);
> +        exit(1);
> +    }
> +
> +    /* If there's nothing new since last we looked. */
> +    if (avail_idx == last_avail_idx) {
> +        return -EAGAIN;
> +    }
> +
> +    /* Only get avail ring entries after they have been exposed by guest. */
> +    smp_rmb();
> +
> +    /* Grab the next descriptor number they're advertising, and increment
> +     * the index we've seen. */
> +    head = vring->vr.avail->ring[last_avail_idx % num];
> +
> +    /* If their number is silly, that's an error. */
> +    if (unlikely(head >= num)) {
> +        fprintf(stderr, "Guest says index %u > %u is available\n",
> +                head, num);
> +        exit(1);
> +    }
> +
> +    if (vdev->guest_features & (1 << VIRTIO_RING_F_EVENT_IDX)) {
> +        vring_avail_event(&vring->vr) = vring->vr.avail->idx;
> +    }
> +
> +    /* When we start there are none of either input nor output. */
> +    *out_num = *in_num = 0;
> +
> +    i = head;
> +    do {
> +        if (unlikely(i >= num)) {
> +            fprintf(stderr, "Desc index is %u > %u, head = %u\n",
> +                    i, num, head);
> +            exit(1);
> +        }
> +        if (unlikely(++found > num)) {
> +            fprintf(stderr, "Loop detected: last one at %u "
> +                    "vq size %u head %u\n",
> +                    i, num, head);
> +            exit(1);
> +        }
> +        desc = vring->vr.desc[i];
> +        if (desc.flags & VRING_DESC_F_INDIRECT) {
> +            if (!get_indirect(vring, iov, iov_end, out_num, in_num, &desc)) {
> +                return -ENOBUFS; /* not enough iovecs, stop for now */
> +            }
> +            continue;
> +        }
> +
> +        /* If there are not enough iovecs left, stop for now.  The caller
> +         * should check if there are more descs available once they have dealt
> +         * with the current set.
> +         */
> +        if (iov >= iov_end) {
> +            return -ENOBUFS;
> +        }
> +
> +        iov->iov_base = phys_to_host(vring, desc.addr);
> +        iov->iov_len  = desc.len;
> +        iov++;
> +
> +        if (desc.flags & VRING_DESC_F_WRITE) {
> +            /* If this is an input descriptor,
> +             * increment that count. */
> +            *in_num += 1;
> +        } else {
> +            /* If it's an output descriptor, they're all supposed
> +             * to come before any input descriptors. */
> +            if (unlikely(*in_num)) {
> +                fprintf(stderr, "Descriptor has out after in: "
> +                        "idx %d\n", i);
> +                exit(1);
> +            }
> +            *out_num += 1;
> +        }
> +        i = desc.next;
> +    } while (desc.flags & VRING_DESC_F_NEXT);
> +
> +    /* On success, increment avail index. */
> +    vring->last_avail_idx++;
> +    return head;
> +}
> +
> +/* After we've used one of their buffers, we tell them about it.
> + *
> + * Stolen from linux-2.6/drivers/vhost/vhost.c.
> + */
> +void vring_push(Vring *vring, unsigned int head, int len)
> +{
> +    struct vring_used_elem *used;
> +    uint16_t new;
> +
> +    /* The virtqueue contains a ring of used buffers.  Get a pointer to the
> +     * next entry in that used ring. */
> +    used = &vring->vr.used->ring[vring->last_used_idx % vring->vr.num];
> +    used->id = head;
> +    used->len = len;
> +
> +    /* Make sure buffer is written before we update index. */
> +    smp_wmb();
> +
> +    new = vring->vr.used->idx = ++vring->last_used_idx;
> +    if (unlikely((int16_t)(new - vring->signalled_used) < (uint16_t)1)) {
> +        vring->signalled_used_valid = false;
> +    }
> +}
> diff --git a/hw/dataplane/vring.h b/hw/dataplane/vring.h
> new file mode 100644
> index 0000000..42c2f0a
> --- /dev/null
> +++ b/hw/dataplane/vring.h
> @@ -0,0 +1,54 @@
> +/* Copyright 2012 Red Hat, Inc. and/or its affiliates
> + * Copyright IBM, Corp. 2012
> + *
> + * Based on Linux vhost code:
> + * Copyright (C) 2009 Red Hat, Inc.
> + * Copyright (C) 2006 Rusty Russell IBM Corporation
> + *
> + * Author: Michael S. Tsirkin <mst@redhat.com>
> + *         Stefan Hajnoczi <stefanha@redhat.com>
> + *
> + * Inspiration, some code, and most witty comments come from
> + * Documentation/virtual/lguest/lguest.c, by Rusty Russell
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.
> + */
> +
> +#ifndef VRING_H
> +#define VRING_H
> +
> +#include <linux/virtio_ring.h>
> +#include "qemu-common.h"
> +#include "qemu-barrier.h"
> +#include "memory.h"
> +#include "hw/virtio.h"
> +
> +typedef struct {
> +    void *phys_mem_zero_host_ptr;   /* host pointer to guest RAM */
> +    struct vring vr;                /* virtqueue vring mapped to host memory */
> +    __u16 last_avail_idx;           /* last processed avail ring index */
> +    __u16 last_used_idx;            /* last processed used ring index */
> +    uint16_t signalled_used;        /* EVENT_IDX state */
> +    bool signalled_used_valid;
> +} Vring;
> +
> +static inline unsigned int vring_get_num(Vring *vring)
> +{
> +    return vring->vr.num;
> +}
> +
> +/* Are there more descriptors available? */
> +static inline bool vring_more_avail(Vring *vring)
> +{
> +    return vring->vr.avail->idx != vring->last_avail_idx;
> +}
> +
> +void vring_setup(Vring *vring, VirtIODevice *vdev, int n);
> +void vring_set_notification(VirtIODevice *vdev, Vring *vring, bool enable);
> +bool vring_should_notify(VirtIODevice *vdev, Vring *vring);
> +int vring_pop(VirtIODevice *vdev, Vring *vring,
> +              struct iovec iov[], struct iovec *iov_end,
> +              unsigned int *out_num, unsigned int *in_num);
> +void vring_push(Vring *vring, unsigned int head, int len);
> +
> +#endif /* VRING_H */
> diff --git a/trace-events b/trace-events
> index e1a37cc..8eeab34 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -98,6 +98,9 @@ virtio_blk_rw_complete(void *req, int ret) "req %p ret %d"
>  virtio_blk_handle_write(void *req, uint64_t sector, size_t nsectors) "req %p sector %"PRIu64" nsectors %zu"
>  virtio_blk_handle_read(void *req, uint64_t sector, size_t nsectors) "req %p sector %"PRIu64" nsectors %zu"
>  
> +# hw/dataplane/vring.c
> +vring_setup(uint64_t physical, void *desc, void *avail, void *used) "vring physical %#"PRIx64" desc %p avail %p used %p"
> +
>  # thread-pool.c
>  thread_pool_submit(void *req, void *opaque) "req %p opaque %p"
>  thread_pool_complete(void *req, void *opaque, int ret) "req %p opaque %p ret %d"
> -- 
> 1.8.0

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH 7/7] virtio-blk: add x-data-plane=on|off performance feature
  2012-11-15 15:19 ` [Qemu-devel] [PATCH 7/7] virtio-blk: add x-data-plane=on|off performance feature Stefan Hajnoczi
  2012-11-15 18:48   ` Michael S. Tsirkin
@ 2012-11-15 21:08   ` Anthony Liguori
  2012-11-16  6:22     ` Stefan Hajnoczi
  2012-11-16  7:40     ` Paolo Bonzini
  1 sibling, 2 replies; 38+ messages in thread
From: Anthony Liguori @ 2012-11-15 21:08 UTC (permalink / raw)
  To: Stefan Hajnoczi, qemu-devel
  Cc: Kevin Wolf, Michael S. Tsirkin, Khoa Huynh, Paolo Bonzini, Asias He

Stefan Hajnoczi <stefanha@redhat.com> writes:

> The virtio-blk-data-plane feature is easy to integrate into
> hw/virtio-blk.c.  The data plane can be started and stopped similar to
> vhost-net.
>
> Users can take advantage of the virtio-blk-data-plane feature using the
> new -device virtio-blk-pci,x-data-plane=on property.

I don't think this should be a property of virtio-blk-pci but rather a
separate device.

It's an alternative interface with a slightly different guest view.  

Regards,

Anthony Liguori

>
> The x-data-plane name was chosen because at this stage the feature is
> experimental and likely to see changes in the future.
>
> If the VM configuration does not support virtio-blk-data-plane an error
> message is printed.  Although we could fall back to regular virtio-blk,
> I prefer the explicit approach since it prompts the user to fix their
> configuration if they want the performance benefit of
> virtio-blk-data-plane.
>
> Limitations:
>  * Only format=raw is supported
>  * Live migration is not supported
>  * Block jobs, hot unplug, and other operations fail with -EBUSY
>  * I/O throttling limits are ignored
>  * Only Linux hosts are supported due to Linux AIO usage
>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  hw/virtio-blk.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  hw/virtio-blk.h |  1 +
>  hw/virtio-pci.c |  3 +++
>  3 files changed, 62 insertions(+), 1 deletion(-)
>
> diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
> index e25cc96..7f6004e 100644
> --- a/hw/virtio-blk.c
> +++ b/hw/virtio-blk.c
> @@ -17,6 +17,8 @@
>  #include "hw/block-common.h"
>  #include "blockdev.h"
>  #include "virtio-blk.h"
> +#include "hw/dataplane/virtio-blk.h"
> +#include "migration.h"
>  #include "scsi-defs.h"
>  #ifdef __linux__
>  # include <scsi/sg.h>
> @@ -33,6 +35,8 @@ typedef struct VirtIOBlock
>      VirtIOBlkConf *blk;
>      unsigned short sector_mask;
>      DeviceState *qdev;
> +    VirtIOBlockDataPlane *dataplane;
> +    Error *migration_blocker;
>  } VirtIOBlock;
>  
>  static VirtIOBlock *to_virtio_blk(VirtIODevice *vdev)
> @@ -407,6 +411,14 @@ static void virtio_blk_handle_output(VirtIODevice *vdev, VirtQueue *vq)
>          .num_writes = 0,
>      };
>  
> +    /* Some guests kick before setting VIRTIO_CONFIG_S_DRIVER_OK so start
> +     * dataplane here instead of waiting for .set_status().
> +     */
> +    if (s->dataplane) {
> +        virtio_blk_data_plane_start(s->dataplane);
> +        return;
> +    }
> +
>      while ((req = virtio_blk_get_request(s))) {
>          virtio_blk_handle_request(req, &mrb);
>      }
> @@ -446,8 +458,13 @@ static void virtio_blk_dma_restart_cb(void *opaque, int running,
>  {
>      VirtIOBlock *s = opaque;
>  
> -    if (!running)
> +    if (!running) {
> +        /* qemu_drain_all() doesn't know about data plane, quiesce here */
> +        if (s->dataplane) {
> +            virtio_blk_data_plane_drain(s->dataplane);
> +        }
>          return;
> +    }
>  
>      if (!s->bh) {
>          s->bh = qemu_bh_new(virtio_blk_dma_restart_bh, s);
> @@ -538,6 +555,10 @@ static void virtio_blk_set_status(VirtIODevice *vdev, uint8_t status)
>      VirtIOBlock *s = to_virtio_blk(vdev);
>      uint32_t features;
>  
> +    if (s->dataplane && !(status & VIRTIO_CONFIG_S_DRIVER)) {
> +        virtio_blk_data_plane_stop(s->dataplane);
> +    }
> +
>      if (!(status & VIRTIO_CONFIG_S_DRIVER_OK)) {
>          return;
>      }
> @@ -604,6 +625,7 @@ VirtIODevice *virtio_blk_init(DeviceState *dev, VirtIOBlkConf *blk)
>  {
>      VirtIOBlock *s;
>      static int virtio_blk_id;
> +    int fd = -1;
>  
>      if (!blk->conf.bs) {
>          error_report("drive property not set");
> @@ -619,6 +641,21 @@ VirtIODevice *virtio_blk_init(DeviceState *dev, VirtIOBlkConf *blk)
>          return NULL;
>      }
>  
> +    if (blk->data_plane) {
> +        if (blk->scsi) {
> +            error_report("device is incompatible with x-data-plane, "
> +                         "use scsi=off");
> +            return NULL;
> +        }
> +
> +        fd = raw_get_aio_fd(blk->conf.bs);
> +        if (fd < 0) {
> +            error_report("drive is incompatible with x-data-plane, "
> +                         "use format=raw,cache=none,aio=native");
> +            return NULL;
> +        }
> +    }
> +
>      s = (VirtIOBlock *)virtio_common_init("virtio-blk", VIRTIO_ID_BLOCK,
>                                            sizeof(struct virtio_blk_config),
>                                            sizeof(VirtIOBlock));
> @@ -636,6 +673,17 @@ VirtIODevice *virtio_blk_init(DeviceState *dev, VirtIOBlkConf *blk)
>  
>      s->vq = virtio_add_queue(&s->vdev, 128, virtio_blk_handle_output);
>  
> +    if (fd >= 0) {
> +        s->dataplane = virtio_blk_data_plane_create(&s->vdev, fd);
> +
> +        /* Prevent block operations that conflict with data plane thread */
> +        bdrv_set_in_use(s->bs, 1);
> +
> +        error_setg(&s->migration_blocker,
> +                   "x-data-plane does not support migration");
> +        migrate_add_blocker(s->migration_blocker);
> +    }
> +
>      qemu_add_vm_change_state_handler(virtio_blk_dma_restart_cb, s);
>      s->qdev = dev;
>      register_savevm(dev, "virtio-blk", virtio_blk_id++, 2,
> @@ -652,6 +700,15 @@ VirtIODevice *virtio_blk_init(DeviceState *dev, VirtIOBlkConf *blk)
>  void virtio_blk_exit(VirtIODevice *vdev)
>  {
>      VirtIOBlock *s = to_virtio_blk(vdev);
> +
> +    if (s->dataplane) {
> +        migrate_del_blocker(s->migration_blocker);
> +        error_free(s->migration_blocker);
> +        bdrv_set_in_use(s->bs, 0);
> +        virtio_blk_data_plane_destroy(s->dataplane);
> +        s->dataplane = NULL;
> +    }
> +
>      unregister_savevm(s->qdev, "virtio-blk", s);
>      blockdev_mark_auto_del(s->bs);
>      virtio_cleanup(vdev);
> diff --git a/hw/virtio-blk.h b/hw/virtio-blk.h
> index f0740d0..53d7971 100644
> --- a/hw/virtio-blk.h
> +++ b/hw/virtio-blk.h
> @@ -105,6 +105,7 @@ struct VirtIOBlkConf
>      char *serial;
>      uint32_t scsi;
>      uint32_t config_wce;
> +    uint32_t data_plane;
>  };
>  
>  #define DEFINE_VIRTIO_BLK_FEATURES(_state, _field) \
> diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
> index 9603150..401f5ea 100644
> --- a/hw/virtio-pci.c
> +++ b/hw/virtio-pci.c
> @@ -862,6 +862,9 @@ static Property virtio_blk_properties[] = {
>  #endif
>      DEFINE_PROP_BIT("config-wce", VirtIOPCIProxy, blk.config_wce, 0, true),
>      DEFINE_PROP_BIT("ioeventfd", VirtIOPCIProxy, flags, VIRTIO_PCI_FLAG_USE_IOEVENTFD_BIT, true),
> +#ifdef CONFIG_VIRTIO_BLK_DATA_PLANE
> +    DEFINE_PROP_BIT("x-data-plane", VirtIOPCIProxy, blk.data_plane, 0, false),
> +#endif
>      DEFINE_PROP_UINT32("vectors", VirtIOPCIProxy, nvectors, 2),
>      DEFINE_VIRTIO_BLK_FEATURES(VirtIOPCIProxy, host_features),
>      DEFINE_PROP_END_OF_LIST(),
> -- 
> 1.8.0

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH 7/7] virtio-blk: add x-data-plane=on|off performance feature
  2012-11-15 19:34     ` Khoa Huynh
@ 2012-11-15 21:11       ` Anthony Liguori
  0 siblings, 0 replies; 38+ messages in thread
From: Anthony Liguori @ 2012-11-15 21:11 UTC (permalink / raw)
  To: Khoa Huynh, Michael S. Tsirkin
  Cc: Kevin Wolf, Paolo Bonzini, Asias He, qemu-devel, Stefan Hajnoczi

Khoa Huynh <khoa@us.ibm.com> writes:

> "Michael S. Tsirkin" <mst@redhat.com> wrote on 11/15/2012 12:48:49 PM:
>
>> From: "Michael S. Tsirkin" <mst@redhat.com>
>> To: Stefan Hajnoczi <stefanha@redhat.com>,
>> Cc: qemu-devel@nongnu.org, Anthony Liguori/Austin/IBM@IBMUS, Paolo
>> Bonzini <pbonzini@redhat.com>, Kevin Wolf <kwolf@redhat.com>, Asias
>> He <asias@redhat.com>, Khoa Huynh/Austin/IBM@IBMUS
>> Date: 11/15/2012 12:46 PM
>> Subject: Re: [PATCH 7/7] virtio-blk: add x-data-plane=on|off
>> performance feature
>>
>> On Thu, Nov 15, 2012 at 04:19:06PM +0100, Stefan Hajnoczi wrote:
>> > The virtio-blk-data-plane feature is easy to integrate into
>> > hw/virtio-blk.c.  The data plane can be started and stopped similar to
>> > vhost-net.
>> >
>> > Users can take advantage of the virtio-blk-data-plane feature using the
>> > new -device virtio-blk-pci,x-data-plane=on property.
>> >
>> > The x-data-plane name was chosen because at this stage the feature is
>> > experimental and likely to see changes in the future.
>> >
>> > If the VM configuration does not support virtio-blk-data-plane an error
>> > message is printed.  Although we could fall back to regular virtio-blk,
>> > I prefer the explicit approach since it prompts the user to fix their
>> > configuration if they want the performance benefit of
>> > virtio-blk-data-plane.
>> >
>> > Limitations:
>> >  * Only format=raw is supported
>> >  * Live migration is not supported
>> >  * Block jobs, hot unplug, and other operations fail with -EBUSY
>> >  * I/O throttling limits are ignored
>> >  * Only Linux hosts are supported due to Linux AIO usage
>> >
>> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
>>
>>
>> Would be very interested in learning about the performance
>> impact of this. How does this compare to current model and
>> to vhost-blk?
>
> I plan to do a complete evaluation of this patchset in the coming days.
> However, I have done quite a bit of performance testing on earlier versions
> of the data-plane and vhost-blk code bits. Here's what I found:
>
> 1) The existing kvm/qemu code can only handle up to about 150,000 IOPS for
> a single KVM guest.  The bottleneck here is the global qemu mutex.
>
> 2) With performance tuning, I was able to achieve 1.33 million IOPS for a
> single KVM guest with data-plane. This is very close to the
> 1.4-million-IOPS
> limit of my storage setup.

>From my POV, if we can get this close to bare metal with
virtio-blk-dataplane, there's absolutely no reason to merge vhost-blk
support.

We simply lose too much with a kernel-based solution.

I'm sure there's more we can do to improve the userspace implementation
too like a hypercall-based notify and adaptive polling.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH 1/7] raw-posix: add raw_get_aio_fd() for virtio-blk-data-plane
  2012-11-15 20:03   ` Anthony Liguori
@ 2012-11-16  6:15     ` Stefan Hajnoczi
  2012-11-16  8:22       ` Paolo Bonzini
  0 siblings, 1 reply; 38+ messages in thread
From: Stefan Hajnoczi @ 2012-11-16  6:15 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Kevin Wolf, Michael S. Tsirkin, qemu-devel, Khoa Huynh,
	Stefan Hajnoczi, Paolo Bonzini, Asias He

On Thu, Nov 15, 2012 at 9:03 PM, Anthony Liguori <aliguori@us.ibm.com> wrote:
> Stefan Hajnoczi <stefanha@redhat.com> writes:
>
>> The raw_get_aio_fd() function allows virtio-blk-data-plane to get the
>> file descriptor of a raw image file with Linux AIO enabled.  This
>> interface is really a layering violation that can be resolved once the
>> block layer is able to run outside the global mutex - at that point
>> virtio-blk-data-plane will switch from custom Linux AIO code to using
>> the block layer.
>>
>> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
>
> I think this creates user confusion because virtio-blk-data-plane can't
> actually take a BDS.
>
> So why not just make a string 'filename' property and open it directly
> in virtio-blk-data-plane?  Then it's at least clear to the user and
> management tools what the device is capable of doing.

There are some benefits to raw_get_aio_fd():

1. virtio-blk-data-plane is only a subset virtio-blk implementation,
it still needs a regular virtio-blk-pci device (with BDS) in order to
run.  If we use a filename the user would have to specify it twice.

2. Fetching the file descriptor in this way ensures that the image
file is format=raw.

3. virtio-blk-data-plane uses Linux AIO and raw-posix.c has checks
which I don't want to duplicate - we can simply check s->use_aio in
raw_get_aio_fd() to confirm that Linux AIO can be used.

If we open a file directly then we lose these benefits.  Do you still
think we should open a filename?

Stefan

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH 7/7] virtio-blk: add x-data-plane=on|off performance feature
  2012-11-15 21:08   ` Anthony Liguori
@ 2012-11-16  6:22     ` Stefan Hajnoczi
  2012-11-19 10:38       ` Kevin Wolf
  2012-11-16  7:40     ` Paolo Bonzini
  1 sibling, 1 reply; 38+ messages in thread
From: Stefan Hajnoczi @ 2012-11-16  6:22 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Kevin Wolf, Michael S. Tsirkin, qemu-devel, Khoa Huynh,
	Stefan Hajnoczi, Paolo Bonzini, Asias He

On Thu, Nov 15, 2012 at 10:08 PM, Anthony Liguori <aliguori@us.ibm.com> wrote:
> Stefan Hajnoczi <stefanha@redhat.com> writes:
>
>> The virtio-blk-data-plane feature is easy to integrate into
>> hw/virtio-blk.c.  The data plane can be started and stopped similar to
>> vhost-net.
>>
>> Users can take advantage of the virtio-blk-data-plane feature using the
>> new -device virtio-blk-pci,x-data-plane=on property.
>
> I don't think this should be a property of virtio-blk-pci but rather a
> separate device.

The hw/virtio-blk.c code still needs to be used since
hw/dataplane/virtio-blk.c is only a subset of virtio-blk.

So we're talking about adding a new virtio-blk-data-plane-pci device
type to hw/virtio-pci.c?

Stefan

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH 3/7] dataplane: add virtqueue vring code
  2012-11-15 20:09   ` Anthony Liguori
@ 2012-11-16  6:24     ` Stefan Hajnoczi
  0 siblings, 0 replies; 38+ messages in thread
From: Stefan Hajnoczi @ 2012-11-16  6:24 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Kevin Wolf, Michael S. Tsirkin, qemu-devel, Khoa Huynh,
	Stefan Hajnoczi, Paolo Bonzini, Asias He

On Thu, Nov 15, 2012 at 9:09 PM, Anthony Liguori <aliguori@us.ibm.com> wrote:
> Stefan Hajnoczi <stefanha@redhat.com> writes:
>
>> The virtio-blk-data-plane cannot access memory using the usual QEMU
>> functions since it executes outside the global mutex and the memory APIs
>> are this time are not thread-safe.
>>
>> This patch introduces a virtqueue module based on the kernel's vhost
>> vring code.  The trick is that we map guest memory ahead of time and
>> access it cheaply outside the global mutex.
>>
>> Once the hardware emulation code can execute outside the global mutex it
>> will be possible to drop this code.
>>
>> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
>> ---
>>  hw/Makefile.objs           |   2 +-
>>  hw/dataplane/Makefile.objs |   3 +
>>  hw/dataplane/vring.c       | 321 +++++++++++++++++++++++++++++++++++++++++++++
>>  hw/dataplane/vring.h       |  54 ++++++++
>>  trace-events               |   3 +
>>  5 files changed, 382 insertions(+), 1 deletion(-)
>>  create mode 100644 hw/dataplane/Makefile.objs
>>  create mode 100644 hw/dataplane/vring.c
>>  create mode 100644 hw/dataplane/vring.h
>>
>> diff --git a/hw/Makefile.objs b/hw/Makefile.objs
>> index af4ab0c..da8ef0c 100644
>> --- a/hw/Makefile.objs
>> +++ b/hw/Makefile.objs
>> @@ -1,4 +1,4 @@
>> -common-obj-y = usb/ ide/
>> +common-obj-y = usb/ ide/ dataplane/
>>  common-obj-y += loader.o
>>  common-obj-$(CONFIG_VIRTIO) += virtio-console.o
>>  common-obj-$(CONFIG_VIRTIO_PCI) += virtio-pci.o
>> diff --git a/hw/dataplane/Makefile.objs b/hw/dataplane/Makefile.objs
>> new file mode 100644
>> index 0000000..b58544f
>> --- /dev/null
>> +++ b/hw/dataplane/Makefile.objs
>> @@ -0,0 +1,3 @@
>> +ifeq ($(CONFIG_VIRTIO), y)
>> +common-obj-$(CONFIG_VIRTIO_BLK_DATA_PLANE) += vring.o
>> +endif
>> diff --git a/hw/dataplane/vring.c b/hw/dataplane/vring.c
>> new file mode 100644
>> index 0000000..6aacce8
>> --- /dev/null
>> +++ b/hw/dataplane/vring.c
>> @@ -0,0 +1,321 @@
>> +/* Copyright 2012 Red Hat, Inc.
>> + * Copyright IBM, Corp. 2012
>> + *
>> + * Based on Linux vhost code:
>> + * Copyright (C) 2009 Red Hat, Inc.
>> + * Copyright (C) 2006 Rusty Russell IBM Corporation
>> + *
>> + * Author: Michael S. Tsirkin <mst@redhat.com>
>> + *         Stefan Hajnoczi <stefanha@redhat.com>
>> + *
>> + * Inspiration, some code, and most witty comments come from
>> + * Documentation/virtual/lguest/lguest.c, by Rusty Russell
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2.
>> + */
>> +
>> +#include "trace.h"
>> +#include "hw/dataplane/vring.h"
>> +
>> +/* Map target physical address to host address
>> + */
>> +static inline void *phys_to_host(Vring *vring, hwaddr phys)
>> +{
>> +    /* Adjust for 3.6-4 GB PCI memory range */
>> +    if (phys >= 0x100000000) {
>> +        phys -= 0x100000000 - 0xe0000000;
>> +    } else if (phys >= 0xe0000000) {
>> +        fprintf(stderr, "phys_to_host bad physical address in "
>> +                "PCI range %#lx\n", phys);
>> +        exit(1);
>> +    }
>> +    return vring->phys_mem_zero_host_ptr + phys;
>> +}
>
> This is too dirty to merge.  This code should use a slot mechanism
> similar to what vhost uses.  Then it has a plausible chance of working
> on other architectures.
[...]
> You should be able to use a MemoryListener to keep the memory slot table
> up to date.  You'll need to use a mutex to protect it but again since
> it's mostly uncontended, that shouldn't be a problem.

Anthony, Paolo,
You're right, we should use MemoryListener.  Will fix for v2.

Stefan

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH 7/7] virtio-blk: add x-data-plane=on|off performance feature
  2012-11-15 21:08   ` Anthony Liguori
  2012-11-16  6:22     ` Stefan Hajnoczi
@ 2012-11-16  7:40     ` Paolo Bonzini
  1 sibling, 0 replies; 38+ messages in thread
From: Paolo Bonzini @ 2012-11-16  7:40 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Kevin Wolf, Michael S. Tsirkin, qemu-devel, Khoa Huynh,
	Stefan Hajnoczi, Asias He

Il 15/11/2012 22:08, Anthony Liguori ha scritto:
> Stefan Hajnoczi <stefanha@redhat.com> writes:
> 
>> The virtio-blk-data-plane feature is easy to integrate into
>> hw/virtio-blk.c.  The data plane can be started and stopped similar to
>> vhost-net.
>>
>> Users can take advantage of the virtio-blk-data-plane feature using the
>> new -device virtio-blk-pci,x-data-plane=on property.
> 
> I don't think this should be a property of virtio-blk-pci but rather a
> separate device.

Since this is all temporary and we want it to become the default
(perhaps keeping the old code for ioeventfd=false only), I don't think
it really matters.

Paolo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH 3/7] dataplane: add virtqueue vring code
  2012-11-15 15:19 ` [Qemu-devel] [PATCH 3/7] dataplane: add virtqueue vring code Stefan Hajnoczi
  2012-11-15 15:37   ` Paolo Bonzini
  2012-11-15 20:09   ` Anthony Liguori
@ 2012-11-16  7:48   ` Christian Borntraeger
  2012-11-16  8:13     ` Stefan Hajnoczi
  2012-11-17 16:15   ` Blue Swirl
  3 siblings, 1 reply; 38+ messages in thread
From: Christian Borntraeger @ 2012-11-16  7:48 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Anthony Liguori, Michael S. Tsirkin, qemu-devel,
	khoa, Paolo Bonzini, Asias He

On 15/11/12 16:19, Stefan Hajnoczi wrote:

> +#include "trace.h"
> +#include "hw/dataplane/vring.h"
> +
> +/* Map target physical address to host address
> + */
> +static inline void *phys_to_host(Vring *vring, hwaddr phys)
> +{
> +    /* Adjust for 3.6-4 GB PCI memory range */
> +    if (phys >= 0x100000000) {
> +        phys -= 0x100000000 - 0xe0000000;
> +    } else if (phys >= 0xe0000000) {
> +        fprintf(stderr, "phys_to_host bad physical address in "
> +                "PCI range %#lx\n", phys);
> +        exit(1);
> +    }

I think non-pci virtio also wants to use dataplane. Any chance to move such pci
specific things out of the main code?

Christian

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH 3/7] dataplane: add virtqueue vring code
  2012-11-16  7:48   ` Christian Borntraeger
@ 2012-11-16  8:13     ` Stefan Hajnoczi
  0 siblings, 0 replies; 38+ messages in thread
From: Stefan Hajnoczi @ 2012-11-16  8:13 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Kevin Wolf, Anthony Liguori, Michael S. Tsirkin, qemu-devel,
	Khoa Huynh, Stefan Hajnoczi, Paolo Bonzini, Asias He

On Fri, Nov 16, 2012 at 8:48 AM, Christian Borntraeger
<borntraeger@de.ibm.com> wrote:
> On 15/11/12 16:19, Stefan Hajnoczi wrote:
>
>> +#include "trace.h"
>> +#include "hw/dataplane/vring.h"
>> +
>> +/* Map target physical address to host address
>> + */
>> +static inline void *phys_to_host(Vring *vring, hwaddr phys)
>> +{
>> +    /* Adjust for 3.6-4 GB PCI memory range */
>> +    if (phys >= 0x100000000) {
>> +        phys -= 0x100000000 - 0xe0000000;
>> +    } else if (phys >= 0xe0000000) {
>> +        fprintf(stderr, "phys_to_host bad physical address in "
>> +                "PCI range %#lx\n", phys);
>> +        exit(1);
>> +    }
>
> I think non-pci virtio also wants to use dataplane. Any chance to move such pci
> specific things out of the main code?

Yes, using MemoryListener take care of this.  Actually enabling
virtio-blk-mmio or virtio-blk-ccw should work as long as those
transports provide ioeventfd, irqfd, and use the vring memory layout.

Stefan

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH 1/7] raw-posix: add raw_get_aio_fd() for virtio-blk-data-plane
  2012-11-16  6:15     ` Stefan Hajnoczi
@ 2012-11-16  8:22       ` Paolo Bonzini
  0 siblings, 0 replies; 38+ messages in thread
From: Paolo Bonzini @ 2012-11-16  8:22 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Anthony Liguori, Michael S. Tsirkin, qemu-devel,
	Khoa Huynh, Stefan Hajnoczi, Asias He

Il 16/11/2012 07:15, Stefan Hajnoczi ha scritto:
>> >
>> > So why not just make a string 'filename' property and open it directly
>> > in virtio-blk-data-plane?  Then it's at least clear to the user and
>> > management tools what the device is capable of doing.
> There are some benefits to raw_get_aio_fd():
> 
> 1. virtio-blk-data-plane is only a subset virtio-blk implementation,
> it still needs a regular virtio-blk-pci device (with BDS) in order to
> run.  If we use a filename the user would have to specify it twice.
> 
> 2. Fetching the file descriptor in this way ensures that the image
> file is format=raw.
> 
> 3. virtio-blk-data-plane uses Linux AIO and raw-posix.c has checks
> which I don't want to duplicate - we can simply check s->use_aio in
> raw_get_aio_fd() to confirm that Linux AIO can be used.

Agreed.  This is not vhost-blk, for which I agree that opening the file
would make more sense (so you have no BDS at all).  It's just a stopgap
measure for something that should become the standard implementation.

Paolo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH 3/7] dataplane: add virtqueue vring code
  2012-11-15 15:19 ` [Qemu-devel] [PATCH 3/7] dataplane: add virtqueue vring code Stefan Hajnoczi
                     ` (2 preceding siblings ...)
  2012-11-16  7:48   ` Christian Borntraeger
@ 2012-11-17 16:15   ` Blue Swirl
  2012-11-18  9:27     ` Stefan Hajnoczi
  3 siblings, 1 reply; 38+ messages in thread
From: Blue Swirl @ 2012-11-17 16:15 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Anthony Liguori, Michael S. Tsirkin, qemu-devel,
	khoa, Paolo Bonzini, Asias He

On Thu, Nov 15, 2012 at 3:19 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> The virtio-blk-data-plane cannot access memory using the usual QEMU
> functions since it executes outside the global mutex and the memory APIs
> are this time are not thread-safe.
>
> This patch introduces a virtqueue module based on the kernel's vhost
> vring code.  The trick is that we map guest memory ahead of time and
> access it cheaply outside the global mutex.
>
> Once the hardware emulation code can execute outside the global mutex it
> will be possible to drop this code.
>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  hw/Makefile.objs           |   2 +-
>  hw/dataplane/Makefile.objs |   3 +
>  hw/dataplane/vring.c       | 321 +++++++++++++++++++++++++++++++++++++++++++++
>  hw/dataplane/vring.h       |  54 ++++++++
>  trace-events               |   3 +
>  5 files changed, 382 insertions(+), 1 deletion(-)
>  create mode 100644 hw/dataplane/Makefile.objs
>  create mode 100644 hw/dataplane/vring.c
>  create mode 100644 hw/dataplane/vring.h
>
> diff --git a/hw/Makefile.objs b/hw/Makefile.objs
> index af4ab0c..da8ef0c 100644
> --- a/hw/Makefile.objs
> +++ b/hw/Makefile.objs
> @@ -1,4 +1,4 @@
> -common-obj-y = usb/ ide/
> +common-obj-y = usb/ ide/ dataplane/
>  common-obj-y += loader.o
>  common-obj-$(CONFIG_VIRTIO) += virtio-console.o
>  common-obj-$(CONFIG_VIRTIO_PCI) += virtio-pci.o
> diff --git a/hw/dataplane/Makefile.objs b/hw/dataplane/Makefile.objs
> new file mode 100644
> index 0000000..b58544f
> --- /dev/null
> +++ b/hw/dataplane/Makefile.objs
> @@ -0,0 +1,3 @@
> +ifeq ($(CONFIG_VIRTIO), y)
> +common-obj-$(CONFIG_VIRTIO_BLK_DATA_PLANE) += vring.o
> +endif
> diff --git a/hw/dataplane/vring.c b/hw/dataplane/vring.c
> new file mode 100644
> index 0000000..6aacce8
> --- /dev/null
> +++ b/hw/dataplane/vring.c
> @@ -0,0 +1,321 @@
> +/* Copyright 2012 Red Hat, Inc.
> + * Copyright IBM, Corp. 2012
> + *
> + * Based on Linux vhost code:
> + * Copyright (C) 2009 Red Hat, Inc.
> + * Copyright (C) 2006 Rusty Russell IBM Corporation
> + *
> + * Author: Michael S. Tsirkin <mst@redhat.com>
> + *         Stefan Hajnoczi <stefanha@redhat.com>
> + *
> + * Inspiration, some code, and most witty comments come from
> + * Documentation/virtual/lguest/lguest.c, by Rusty Russell
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.
> + */
> +
> +#include "trace.h"
> +#include "hw/dataplane/vring.h"
> +
> +/* Map target physical address to host address
> + */
> +static inline void *phys_to_host(Vring *vring, hwaddr phys)
> +{
> +    /* Adjust for 3.6-4 GB PCI memory range */
> +    if (phys >= 0x100000000) {
> +        phys -= 0x100000000 - 0xe0000000;
> +    } else if (phys >= 0xe0000000) {
> +        fprintf(stderr, "phys_to_host bad physical address in "
> +                "PCI range %#lx\n", phys);
> +        exit(1);

Exiting is rather drastic. Is this guest's error or QEMU's?

> +    }
> +    return vring->phys_mem_zero_host_ptr + phys;
> +}
> +
> +/* Setup for cheap target physical to host address conversion
> + *
> + * This is a hack for direct access to guest memory, we're not really allowed
> + * to do this.
> + */
> +static void setup_phys_to_host(Vring *vring)
> +{
> +    hwaddr len = 4096; /* RAM is really much larger but we cheat */
> +    vring->phys_mem_zero_host_ptr = cpu_physical_memory_map(0, &len, 0);
> +    if (!vring->phys_mem_zero_host_ptr) {
> +        fprintf(stderr, "setup_phys_to_host failed\n");
> +        exit(1);
> +    }
> +}
> +
> +/* Map the guest's vring to host memory
> + *
> + * This is not allowed but we know the ring won't move.
> + */
> +void vring_setup(Vring *vring, VirtIODevice *vdev, int n)
> +{
> +    setup_phys_to_host(vring);
> +
> +    vring_init(&vring->vr, virtio_queue_get_num(vdev, n),
> +               phys_to_host(vring, virtio_queue_get_ring_addr(vdev, n)), 4096);
> +
> +    vring->last_avail_idx = 0;
> +    vring->last_used_idx = 0;
> +    vring->signalled_used = 0;
> +    vring->signalled_used_valid = false;
> +
> +    trace_vring_setup(virtio_queue_get_ring_addr(vdev, n),
> +                      vring->vr.desc, vring->vr.avail, vring->vr.used);
> +}
> +
> +/* Toggle guest->host notifies */
> +void vring_set_notification(VirtIODevice *vdev, Vring *vring, bool enable)
> +{
> +    if (vdev->guest_features & (1 << VIRTIO_RING_F_EVENT_IDX)) {
> +        if (enable) {
> +            vring_avail_event(&vring->vr) = vring->vr.avail->idx;
> +        }
> +    } else if (enable) {
> +        vring->vr.used->flags &= ~VRING_USED_F_NO_NOTIFY;
> +    } else {
> +        vring->vr.used->flags |= VRING_USED_F_NO_NOTIFY;
> +    }
> +}
> +
> +/* This is stolen from linux/drivers/vhost/vhost.c:vhost_notify() */
> +bool vring_should_notify(VirtIODevice *vdev, Vring *vring)
> +{
> +    uint16_t old, new;
> +    bool v;
> +    /* Flush out used index updates. This is paired
> +     * with the barrier that the Guest executes when enabling
> +     * interrupts. */
> +    smp_mb();
> +
> +    if ((vdev->guest_features & VIRTIO_F_NOTIFY_ON_EMPTY) &&
> +        unlikely(vring->vr.avail->idx == vring->last_avail_idx)) {
> +        return true;
> +    }
> +
> +    if (!(vdev->guest_features & VIRTIO_RING_F_EVENT_IDX)) {
> +        return !(vring->vr.avail->flags & VRING_AVAIL_F_NO_INTERRUPT);
> +    }
> +    old = vring->signalled_used;
> +    v = vring->signalled_used_valid;
> +    new = vring->signalled_used = vring->last_used_idx;
> +    vring->signalled_used_valid = true;
> +
> +    if (unlikely(!v)) {
> +        return true;
> +    }
> +
> +    return vring_need_event(vring_used_event(&vring->vr), new, old);
> +}
> +
> +/* This is stolen from linux-2.6/drivers/vhost/vhost.c. */
> +static bool get_indirect(Vring *vring,
> +                         struct iovec iov[], struct iovec *iov_end,
> +                         unsigned int *out_num, unsigned int *in_num,
> +                         struct vring_desc *indirect)
> +{
> +    struct vring_desc desc;
> +    unsigned int i = 0, count, found = 0;
> +
> +    /* Sanity check */
> +    if (unlikely(indirect->len % sizeof desc)) {
> +        fprintf(stderr, "Invalid length in indirect descriptor: "
> +               "len 0x%llx not multiple of 0x%zx\n",
> +               (unsigned long long)indirect->len,
> +               sizeof desc);
> +        exit(1);
> +    }
> +
> +    count = indirect->len / sizeof desc;
> +    /* Buffers are chained via a 16 bit next field, so
> +     * we can have at most 2^16 of these. */
> +    if (unlikely(count > USHRT_MAX + 1)) {
> +        fprintf(stderr, "Indirect buffer length too big: %d\n",
> +               indirect->len);
> +        exit(1);
> +    }
> +
> +    /* Point to translate indirect desc chain */
> +    indirect = phys_to_host(vring, indirect->addr);
> +
> +    /* We will use the result as an address to read from, so most
> +     * architectures only need a compiler barrier here. */
> +    barrier(); /* read_barrier_depends(); */
> +
> +    do {
> +        if (unlikely(++found > count)) {
> +            fprintf(stderr, "Loop detected: last one at %u "
> +                   "indirect size %u\n",
> +                   i, count);
> +            exit(1);
> +        }
> +
> +        desc = *indirect++;
> +        if (unlikely(desc.flags & VRING_DESC_F_INDIRECT)) {
> +            fprintf(stderr, "Nested indirect descriptor\n");
> +            exit(1);
> +        }
> +
> +        /* Stop for now if there are not enough iovecs available. */
> +        if (iov >= iov_end) {
> +            return false;
> +        }
> +
> +        iov->iov_base = phys_to_host(vring, desc.addr);
> +        iov->iov_len  = desc.len;
> +        iov++;
> +
> +        /* If this is an input descriptor, increment that count. */
> +        if (desc.flags & VRING_DESC_F_WRITE) {
> +            *in_num += 1;
> +        } else {
> +            /* If it's an output descriptor, they're all supposed
> +             * to come before any input descriptors. */
> +            if (unlikely(*in_num)) {
> +                fprintf(stderr, "Indirect descriptor "
> +                       "has out after in: idx %d\n", i);
> +                exit(1);
> +            }
> +            *out_num += 1;
> +        }
> +        i = desc.next;
> +    } while (desc.flags & VRING_DESC_F_NEXT);
> +    return true;
> +}
> +
> +/* This looks in the virtqueue and for the first available buffer, and converts
> + * it to an iovec for convenient access.  Since descriptors consist of some
> + * number of output then some number of input descriptors, it's actually two
> + * iovecs, but we pack them into one and note how many of each there were.
> + *
> + * This function returns the descriptor number found, or vq->num (which is
> + * never a valid descriptor number) if none was found.  A negative code is
> + * returned on error.
> + *
> + * Stolen from linux-2.6/drivers/vhost/vhost.c.
> + */
> +int vring_pop(VirtIODevice *vdev, Vring *vring,
> +              struct iovec iov[], struct iovec *iov_end,
> +              unsigned int *out_num, unsigned int *in_num)
> +{
> +    struct vring_desc desc;
> +    unsigned int i, head, found = 0, num = vring->vr.num;
> +    __u16 avail_idx, last_avail_idx;

Please use uint16_t in QEMU code.

> +
> +    /* Check it isn't doing very strange things with descriptor numbers. */
> +    last_avail_idx = vring->last_avail_idx;
> +    avail_idx = vring->vr.avail->idx;
> +
> +    if (unlikely((__u16)(avail_idx - last_avail_idx) > num)) {
> +        fprintf(stderr, "Guest moved used index from %u to %u\n",
> +                last_avail_idx, avail_idx);
> +        exit(1);
> +    }
> +
> +    /* If there's nothing new since last we looked. */
> +    if (avail_idx == last_avail_idx) {
> +        return -EAGAIN;
> +    }
> +
> +    /* Only get avail ring entries after they have been exposed by guest. */
> +    smp_rmb();
> +
> +    /* Grab the next descriptor number they're advertising, and increment
> +     * the index we've seen. */
> +    head = vring->vr.avail->ring[last_avail_idx % num];
> +
> +    /* If their number is silly, that's an error. */
> +    if (unlikely(head >= num)) {
> +        fprintf(stderr, "Guest says index %u > %u is available\n",
> +                head, num);
> +        exit(1);
> +    }
> +
> +    if (vdev->guest_features & (1 << VIRTIO_RING_F_EVENT_IDX)) {
> +        vring_avail_event(&vring->vr) = vring->vr.avail->idx;
> +    }
> +
> +    /* When we start there are none of either input nor output. */
> +    *out_num = *in_num = 0;
> +
> +    i = head;
> +    do {
> +        if (unlikely(i >= num)) {
> +            fprintf(stderr, "Desc index is %u > %u, head = %u\n",
> +                    i, num, head);
> +            exit(1);
> +        }
> +        if (unlikely(++found > num)) {
> +            fprintf(stderr, "Loop detected: last one at %u "
> +                    "vq size %u head %u\n",
> +                    i, num, head);
> +            exit(1);
> +        }
> +        desc = vring->vr.desc[i];
> +        if (desc.flags & VRING_DESC_F_INDIRECT) {
> +            if (!get_indirect(vring, iov, iov_end, out_num, in_num, &desc)) {
> +                return -ENOBUFS; /* not enough iovecs, stop for now */
> +            }
> +            continue;
> +        }
> +
> +        /* If there are not enough iovecs left, stop for now.  The caller
> +         * should check if there are more descs available once they have dealt
> +         * with the current set.
> +         */
> +        if (iov >= iov_end) {
> +            return -ENOBUFS;
> +        }
> +
> +        iov->iov_base = phys_to_host(vring, desc.addr);
> +        iov->iov_len  = desc.len;
> +        iov++;
> +
> +        if (desc.flags & VRING_DESC_F_WRITE) {
> +            /* If this is an input descriptor,
> +             * increment that count. */
> +            *in_num += 1;
> +        } else {
> +            /* If it's an output descriptor, they're all supposed
> +             * to come before any input descriptors. */
> +            if (unlikely(*in_num)) {
> +                fprintf(stderr, "Descriptor has out after in: "
> +                        "idx %d\n", i);
> +                exit(1);
> +            }
> +            *out_num += 1;
> +        }
> +        i = desc.next;
> +    } while (desc.flags & VRING_DESC_F_NEXT);
> +
> +    /* On success, increment avail index. */
> +    vring->last_avail_idx++;
> +    return head;
> +}
> +
> +/* After we've used one of their buffers, we tell them about it.
> + *
> + * Stolen from linux-2.6/drivers/vhost/vhost.c.
> + */
> +void vring_push(Vring *vring, unsigned int head, int len)
> +{
> +    struct vring_used_elem *used;
> +    uint16_t new;
> +
> +    /* The virtqueue contains a ring of used buffers.  Get a pointer to the
> +     * next entry in that used ring. */
> +    used = &vring->vr.used->ring[vring->last_used_idx % vring->vr.num];
> +    used->id = head;
> +    used->len = len;
> +
> +    /* Make sure buffer is written before we update index. */
> +    smp_wmb();
> +
> +    new = vring->vr.used->idx = ++vring->last_used_idx;
> +    if (unlikely((int16_t)(new - vring->signalled_used) < (uint16_t)1)) {
> +        vring->signalled_used_valid = false;
> +    }
> +}
> diff --git a/hw/dataplane/vring.h b/hw/dataplane/vring.h
> new file mode 100644
> index 0000000..42c2f0a
> --- /dev/null
> +++ b/hw/dataplane/vring.h
> @@ -0,0 +1,54 @@
> +/* Copyright 2012 Red Hat, Inc. and/or its affiliates
> + * Copyright IBM, Corp. 2012
> + *
> + * Based on Linux vhost code:
> + * Copyright (C) 2009 Red Hat, Inc.
> + * Copyright (C) 2006 Rusty Russell IBM Corporation
> + *
> + * Author: Michael S. Tsirkin <mst@redhat.com>
> + *         Stefan Hajnoczi <stefanha@redhat.com>
> + *
> + * Inspiration, some code, and most witty comments come from
> + * Documentation/virtual/lguest/lguest.c, by Rusty Russell
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.
> + */
> +
> +#ifndef VRING_H
> +#define VRING_H
> +
> +#include <linux/virtio_ring.h>
> +#include "qemu-common.h"
> +#include "qemu-barrier.h"
> +#include "memory.h"
> +#include "hw/virtio.h"
> +
> +typedef struct {
> +    void *phys_mem_zero_host_ptr;   /* host pointer to guest RAM */
> +    struct vring vr;                /* virtqueue vring mapped to host memory */
> +    __u16 last_avail_idx;           /* last processed avail ring index */
> +    __u16 last_used_idx;            /* last processed used ring index */
> +    uint16_t signalled_used;        /* EVENT_IDX state */
> +    bool signalled_used_valid;
> +} Vring;
> +
> +static inline unsigned int vring_get_num(Vring *vring)
> +{
> +    return vring->vr.num;
> +}
> +
> +/* Are there more descriptors available? */
> +static inline bool vring_more_avail(Vring *vring)
> +{
> +    return vring->vr.avail->idx != vring->last_avail_idx;
> +}
> +
> +void vring_setup(Vring *vring, VirtIODevice *vdev, int n);
> +void vring_set_notification(VirtIODevice *vdev, Vring *vring, bool enable);
> +bool vring_should_notify(VirtIODevice *vdev, Vring *vring);
> +int vring_pop(VirtIODevice *vdev, Vring *vring,
> +              struct iovec iov[], struct iovec *iov_end,
> +              unsigned int *out_num, unsigned int *in_num);
> +void vring_push(Vring *vring, unsigned int head, int len);
> +
> +#endif /* VRING_H */
> diff --git a/trace-events b/trace-events
> index e1a37cc..8eeab34 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -98,6 +98,9 @@ virtio_blk_rw_complete(void *req, int ret) "req %p ret %d"
>  virtio_blk_handle_write(void *req, uint64_t sector, size_t nsectors) "req %p sector %"PRIu64" nsectors %zu"
>  virtio_blk_handle_read(void *req, uint64_t sector, size_t nsectors) "req %p sector %"PRIu64" nsectors %zu"
>
> +# hw/dataplane/vring.c
> +vring_setup(uint64_t physical, void *desc, void *avail, void *used) "vring physical %#"PRIx64" desc %p avail %p used %p"
> +
>  # thread-pool.c
>  thread_pool_submit(void *req, void *opaque) "req %p opaque %p"
>  thread_pool_complete(void *req, void *opaque, int ret) "req %p opaque %p ret %d"
> --
> 1.8.0
>
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH 3/7] dataplane: add virtqueue vring code
  2012-11-17 16:15   ` Blue Swirl
@ 2012-11-18  9:27     ` Stefan Hajnoczi
  0 siblings, 0 replies; 38+ messages in thread
From: Stefan Hajnoczi @ 2012-11-18  9:27 UTC (permalink / raw)
  To: Blue Swirl
  Cc: Kevin Wolf, Anthony Liguori, Michael S. Tsirkin, qemu-devel,
	Khoa Huynh, Stefan Hajnoczi, Paolo Bonzini, Asias He

On Sat, Nov 17, 2012 at 5:15 PM, Blue Swirl <blauwirbel@gmail.com> wrote:
> On Thu, Nov 15, 2012 at 3:19 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
>> The virtio-blk-data-plane cannot access memory using the usual QEMU
>> functions since it executes outside the global mutex and the memory APIs
>> are this time are not thread-safe.
>>
>> This patch introduces a virtqueue module based on the kernel's vhost
>> vring code.  The trick is that we map guest memory ahead of time and
>> access it cheaply outside the global mutex.
>>
>> Once the hardware emulation code can execute outside the global mutex it
>> will be possible to drop this code.
>>
>> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
>> ---
>>  hw/Makefile.objs           |   2 +-
>>  hw/dataplane/Makefile.objs |   3 +
>>  hw/dataplane/vring.c       | 321 +++++++++++++++++++++++++++++++++++++++++++++
>>  hw/dataplane/vring.h       |  54 ++++++++
>>  trace-events               |   3 +
>>  5 files changed, 382 insertions(+), 1 deletion(-)
>>  create mode 100644 hw/dataplane/Makefile.objs
>>  create mode 100644 hw/dataplane/vring.c
>>  create mode 100644 hw/dataplane/vring.h
>>
>> diff --git a/hw/Makefile.objs b/hw/Makefile.objs
>> index af4ab0c..da8ef0c 100644
>> --- a/hw/Makefile.objs
>> +++ b/hw/Makefile.objs
>> @@ -1,4 +1,4 @@
>> -common-obj-y = usb/ ide/
>> +common-obj-y = usb/ ide/ dataplane/
>>  common-obj-y += loader.o
>>  common-obj-$(CONFIG_VIRTIO) += virtio-console.o
>>  common-obj-$(CONFIG_VIRTIO_PCI) += virtio-pci.o
>> diff --git a/hw/dataplane/Makefile.objs b/hw/dataplane/Makefile.objs
>> new file mode 100644
>> index 0000000..b58544f
>> --- /dev/null
>> +++ b/hw/dataplane/Makefile.objs
>> @@ -0,0 +1,3 @@
>> +ifeq ($(CONFIG_VIRTIO), y)
>> +common-obj-$(CONFIG_VIRTIO_BLK_DATA_PLANE) += vring.o
>> +endif
>> diff --git a/hw/dataplane/vring.c b/hw/dataplane/vring.c
>> new file mode 100644
>> index 0000000..6aacce8
>> --- /dev/null
>> +++ b/hw/dataplane/vring.c
>> @@ -0,0 +1,321 @@
>> +/* Copyright 2012 Red Hat, Inc.
>> + * Copyright IBM, Corp. 2012
>> + *
>> + * Based on Linux vhost code:
>> + * Copyright (C) 2009 Red Hat, Inc.
>> + * Copyright (C) 2006 Rusty Russell IBM Corporation
>> + *
>> + * Author: Michael S. Tsirkin <mst@redhat.com>
>> + *         Stefan Hajnoczi <stefanha@redhat.com>
>> + *
>> + * Inspiration, some code, and most witty comments come from
>> + * Documentation/virtual/lguest/lguest.c, by Rusty Russell
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2.
>> + */
>> +
>> +#include "trace.h"
>> +#include "hw/dataplane/vring.h"
>> +
>> +/* Map target physical address to host address
>> + */
>> +static inline void *phys_to_host(Vring *vring, hwaddr phys)
>> +{
>> +    /* Adjust for 3.6-4 GB PCI memory range */
>> +    if (phys >= 0x100000000) {
>> +        phys -= 0x100000000 - 0xe0000000;
>> +    } else if (phys >= 0xe0000000) {
>> +        fprintf(stderr, "phys_to_host bad physical address in "
>> +                "PCI range %#lx\n", phys);
>> +        exit(1);
>
> Exiting is rather drastic. Is this guest's error or QEMU's?

This can be triggered by the guest - the guest could pass an invalid
physical memory address.  I agree it's better to have an error state
but hw/virtio.c doesn't do this any better today either.  I think the
safest option is to enter an error state where we ignore the vring
until the next (guest-initiated) reset.

>> +int vring_pop(VirtIODevice *vdev, Vring *vring,
>> +              struct iovec iov[], struct iovec *iov_end,
>> +              unsigned int *out_num, unsigned int *in_num)
>> +{
>> +    struct vring_desc desc;
>> +    unsigned int i, head, found = 0, num = vring->vr.num;
>> +    __u16 avail_idx, last_avail_idx;
>
> Please use uint16_t in QEMU code.

Sure, will convert to QEMU types.

Stefan

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH 7/7] virtio-blk: add x-data-plane=on|off performance feature
  2012-11-16  6:22     ` Stefan Hajnoczi
@ 2012-11-19 10:38       ` Kevin Wolf
  2012-11-19 10:51         ` Paolo Bonzini
  0 siblings, 1 reply; 38+ messages in thread
From: Kevin Wolf @ 2012-11-19 10:38 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Anthony Liguori, Michael S. Tsirkin, qemu-devel, Khoa Huynh,
	Stefan Hajnoczi, Paolo Bonzini, Asias He

Am 16.11.2012 07:22, schrieb Stefan Hajnoczi:
> On Thu, Nov 15, 2012 at 10:08 PM, Anthony Liguori <aliguori@us.ibm.com> wrote:
>> Stefan Hajnoczi <stefanha@redhat.com> writes:
>>
>>> The virtio-blk-data-plane feature is easy to integrate into
>>> hw/virtio-blk.c.  The data plane can be started and stopped similar to
>>> vhost-net.
>>>
>>> Users can take advantage of the virtio-blk-data-plane feature using the
>>> new -device virtio-blk-pci,x-data-plane=on property.
>>
>> I don't think this should be a property of virtio-blk-pci but rather a
>> separate device.
> 
> The hw/virtio-blk.c code still needs to be used since
> hw/dataplane/virtio-blk.c is only a subset of virtio-blk.
> 
> So we're talking about adding a new virtio-blk-data-plane-pci device
> type to hw/virtio-pci.c?

A new device sounds wrong to me, it's the very same thing from a guest
perspective. Which makes me wonder if in the final version it shouldn't
be a -blockdev option rather than a -device one...

Kevin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH 7/7] virtio-blk: add x-data-plane=on|off performance feature
  2012-11-19 10:38       ` Kevin Wolf
@ 2012-11-19 10:51         ` Paolo Bonzini
  0 siblings, 0 replies; 38+ messages in thread
From: Paolo Bonzini @ 2012-11-19 10:51 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Anthony Liguori, Michael S. Tsirkin, Stefan Hajnoczi, qemu-devel,
	Khoa Huynh, Stefan Hajnoczi, Asias He

Il 19/11/2012 11:38, Kevin Wolf ha scritto:
> Am 16.11.2012 07:22, schrieb Stefan Hajnoczi:
>> On Thu, Nov 15, 2012 at 10:08 PM, Anthony Liguori <aliguori@us.ibm.com> wrote:
>>> Stefan Hajnoczi <stefanha@redhat.com> writes:
>>>
>>>> The virtio-blk-data-plane feature is easy to integrate into
>>>> hw/virtio-blk.c.  The data plane can be started and stopped similar to
>>>> vhost-net.
>>>>
>>>> Users can take advantage of the virtio-blk-data-plane feature using the
>>>> new -device virtio-blk-pci,x-data-plane=on property.
>>>
>>> I don't think this should be a property of virtio-blk-pci but rather a
>>> separate device.
>>
>> The hw/virtio-blk.c code still needs to be used since
>> hw/dataplane/virtio-blk.c is only a subset of virtio-blk.
>>
>> So we're talking about adding a new virtio-blk-data-plane-pci device
>> type to hw/virtio-pci.c?
> 
> A new device sounds wrong to me, it's the very same thing from a guest
> perspective. Which makes me wonder if in the final version it shouldn't
> be a -blockdev option rather than a -device one...

In the final version it shouldn't be an option at all. :)

Paolo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH 0/7] virtio: virtio-blk data plane
  2012-11-15 15:18 [Qemu-devel] [PATCH 0/7] virtio: virtio-blk data plane Stefan Hajnoczi
                   ` (6 preceding siblings ...)
  2012-11-15 15:19 ` [Qemu-devel] [PATCH 7/7] virtio-blk: add x-data-plane=on|off performance feature Stefan Hajnoczi
@ 2012-11-20  9:02 ` Asias He
  2012-11-20 12:21   ` Stefan Hajnoczi
  2012-11-20 15:03   ` Khoa Huynh
  7 siblings, 2 replies; 38+ messages in thread
From: Asias He @ 2012-11-20  9:02 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Anthony Liguori, Michael S. Tsirkin, qemu-devel,
	khoa, Paolo Bonzini

Hello Stefan,

On 11/15/2012 11:18 PM, Stefan Hajnoczi wrote:
> This series adds the -device virtio-blk-pci,x-data-plane=on property that
> enables a high performance I/O codepath.  A dedicated thread is used to process
> virtio-blk requests outside the global mutex and without going through the QEMU
> block layer.
> 
> Khoa Huynh <khoa@us.ibm.com> reported an increase from 140,000 IOPS to 600,000
> IOPS for a single VM using virtio-blk-data-plane in July:
> 
>   http://comments.gmane.org/gmane.comp.emulators.kvm.devel/94580
> 
> The virtio-blk-data-plane approach was originally presented at Linux Plumbers
> Conference 2010.  The following slides contain a brief overview:
> 
>   http://linuxplumbersconf.org/2010/ocw/system/presentations/651/original/Optimizing_the_QEMU_Storage_Stack.pdf
> 
> The basic approach is:
> 1. Each virtio-blk device has a thread dedicated to handling ioeventfd
>    signalling when the guest kicks the virtqueue.
> 2. Requests are processed without going through the QEMU block layer using
>    Linux AIO directly.
> 3. Completion interrupts are injected via irqfd from the dedicated thread.
> 
> To try it out:
> 
>   qemu -drive if=none,id=drive0,cache=none,aio=native,format=raw,file=...
>        -device virtio-blk-pci,drive=drive0,scsi=off,x-data-plane=on


Is this the latest dataplane bits:
(git://github.com/stefanha/qemu.git virtio-blk-data-plane)

commit 7872075c24fa01c925d4f41faa9d04ce69bf5328
Author: Stefan Hajnoczi <stefanha@redhat.com>
Date:   Wed Nov 14 15:45:38 2012 +0100

    virtio-blk: add x-data-plane=on|off performance feature


With this commit on a ramdisk based box, I am seeing about 10K IOPS with
x-data-plane on and 90K IOPS with x-data-plane off.

Any ideas?

Command line I used:

IMG=/dev/ram0
x86_64-softmmu/qemu-system-x86_64 \
-drive file=/root/img/sid.img,if=ide \
-drive file=${IMG},if=none,cache=none,aio=native,id=disk1 -device
virtio-blk-pci,x-data-plane=off,drive=disk1,scsi=off \
-kernel $KERNEL -append "root=/dev/sdb1 console=tty0" \
-L /tmp/qemu-dataplane/share/qemu/ -nographic -vnc :0 -enable-kvm -m
2048 -smp 4 -cpu qemu64,+x2apic -M pc


> 
> Limitations:
>  * Only format=raw is supported
>  * Live migration is not supported
>  * Block jobs, hot unplug, and other operations fail with -EBUSY
>  * I/O throttling limits are ignored
>  * Only Linux hosts are supported due to Linux AIO usage
> 
> The code has reached a stage where I feel it is ready to merge.  Users have
> been playing with it for some time and want the significant performance boost.
> 
> We are refactoring QEMU to get rid of the global mutex.  I believe that
> virtio-blk-data-plane can eventually become the default mode of operation.
> 
> Instead of waiting for global mutex removal efforts to finish, I want to use
> virtio-blk-data-plane as an example device for AioContext and threaded hw
> dispatch refactoring.  This means:
> 
> 1. When the block layer can bind to an AioContext and execute I/O outside the
>    global mutex, virtio-blk-data-plane can use this (and gain image format
>    support).
> 
> 2. When hw dispatch no longer needs the global mutex we can use hw/virtio.c
>    again and perhaps run a pool of iothreads instead of dedicated data plane
>    threads.
> 
> But in the meantime, I have cleaned up the virtio-blk-data-plane code so that
> it can be merged as an experimental feature.
> 
> Changes from the RFC v9:
>  * Add x-data-plane=on|off option and coexist with regular virtio-blk code
>  * Create thread from BH so it inherits iothread cpusets
>  * Drain requests on vm_stop() so stopped guest does not access image file
>  * Add migration blocker
>  * Add bdrv_in_use() to prevent block jobs and other operations that can interfere
>  * Drop IOQueue request merging for simplicity
>  * Drop ioctl interrupt injection and always use irqfd for simplicity
>  * Major cleanup to split up source files
>  * Rebase from qemu-kvm.git onto qemu.git
>  * Address Michael Tsirkin's review comments
> 
> Stefan Hajnoczi (7):
>   raw-posix: add raw_get_aio_fd() for virtio-blk-data-plane
>   configure: add CONFIG_VIRTIO_BLK_DATA_PLANE
>   dataplane: add virtqueue vring code
>   dataplane: add event loop
>   dataplane: add Linux AIO request queue
>   dataplane: add virtio-blk data plane code
>   virtio-blk: add x-data-plane=on|off performance feature
> 
>  block.h                    |   9 +
>  block/raw-posix.c          |  34 ++++
>  configure                  |  21 +++
>  hw/Makefile.objs           |   2 +-
>  hw/dataplane/Makefile.objs |   3 +
>  hw/dataplane/event-poll.c  | 109 ++++++++++++
>  hw/dataplane/event-poll.h  |  40 +++++
>  hw/dataplane/ioq.c         | 118 +++++++++++++
>  hw/dataplane/ioq.h         |  57 +++++++
>  hw/dataplane/virtio-blk.c  | 414 +++++++++++++++++++++++++++++++++++++++++++++
>  hw/dataplane/virtio-blk.h  |  41 +++++
>  hw/dataplane/vring.c       | 321 +++++++++++++++++++++++++++++++++++
>  hw/dataplane/vring.h       |  54 ++++++
>  hw/virtio-blk.c            |  59 ++++++-
>  hw/virtio-blk.h            |   1 +
>  hw/virtio-pci.c            |   3 +
>  trace-events               |   9 +
>  17 files changed, 1293 insertions(+), 2 deletions(-)
>  create mode 100644 hw/dataplane/Makefile.objs
>  create mode 100644 hw/dataplane/event-poll.c
>  create mode 100644 hw/dataplane/event-poll.h
>  create mode 100644 hw/dataplane/ioq.c
>  create mode 100644 hw/dataplane/ioq.h
>  create mode 100644 hw/dataplane/virtio-blk.c
>  create mode 100644 hw/dataplane/virtio-blk.h
>  create mode 100644 hw/dataplane/vring.c
>  create mode 100644 hw/dataplane/vring.h
> 


-- 
Asias

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH 0/7] virtio: virtio-blk data plane
  2012-11-20  9:02 ` [Qemu-devel] [PATCH 0/7] virtio: virtio-blk data plane Asias He
@ 2012-11-20 12:21   ` Stefan Hajnoczi
  2012-11-20 12:25     ` Stefan Hajnoczi
  2012-11-21  5:22     ` Asias He
  2012-11-20 15:03   ` Khoa Huynh
  1 sibling, 2 replies; 38+ messages in thread
From: Stefan Hajnoczi @ 2012-11-20 12:21 UTC (permalink / raw)
  To: Asias He
  Cc: Kevin Wolf, Anthony Liguori, Michael S. Tsirkin, qemu-devel,
	Khoa Huynh, Stefan Hajnoczi, Paolo Bonzini

On Tue, Nov 20, 2012 at 10:02 AM, Asias He <asias@redhat.com> wrote:
> Hello Stefan,
>
> On 11/15/2012 11:18 PM, Stefan Hajnoczi wrote:
>> This series adds the -device virtio-blk-pci,x-data-plane=on property that
>> enables a high performance I/O codepath.  A dedicated thread is used to process
>> virtio-blk requests outside the global mutex and without going through the QEMU
>> block layer.
>>
>> Khoa Huynh <khoa@us.ibm.com> reported an increase from 140,000 IOPS to 600,000
>> IOPS for a single VM using virtio-blk-data-plane in July:
>>
>>   http://comments.gmane.org/gmane.comp.emulators.kvm.devel/94580
>>
>> The virtio-blk-data-plane approach was originally presented at Linux Plumbers
>> Conference 2010.  The following slides contain a brief overview:
>>
>>   http://linuxplumbersconf.org/2010/ocw/system/presentations/651/original/Optimizing_the_QEMU_Storage_Stack.pdf
>>
>> The basic approach is:
>> 1. Each virtio-blk device has a thread dedicated to handling ioeventfd
>>    signalling when the guest kicks the virtqueue.
>> 2. Requests are processed without going through the QEMU block layer using
>>    Linux AIO directly.
>> 3. Completion interrupts are injected via irqfd from the dedicated thread.
>>
>> To try it out:
>>
>>   qemu -drive if=none,id=drive0,cache=none,aio=native,format=raw,file=...
>>        -device virtio-blk-pci,drive=drive0,scsi=off,x-data-plane=on
>
>
> Is this the latest dataplane bits:
> (git://github.com/stefanha/qemu.git virtio-blk-data-plane)
>
> commit 7872075c24fa01c925d4f41faa9d04ce69bf5328
> Author: Stefan Hajnoczi <stefanha@redhat.com>
> Date:   Wed Nov 14 15:45:38 2012 +0100
>
>     virtio-blk: add x-data-plane=on|off performance feature
>
>
> With this commit on a ramdisk based box, I am seeing about 10K IOPS with
> x-data-plane on and 90K IOPS with x-data-plane off.
>
> Any ideas?
>
> Command line I used:
>
> IMG=/dev/ram0
> x86_64-softmmu/qemu-system-x86_64 \
> -drive file=/root/img/sid.img,if=ide \
> -drive file=${IMG},if=none,cache=none,aio=native,id=disk1 -device
> virtio-blk-pci,x-data-plane=off,drive=disk1,scsi=off \
> -kernel $KERNEL -append "root=/dev/sdb1 console=tty0" \
> -L /tmp/qemu-dataplane/share/qemu/ -nographic -vnc :0 -enable-kvm -m
> 2048 -smp 4 -cpu qemu64,+x2apic -M pc

Was just about to send out the latest patch series which addresses
review comments, so I have tested the latest code
(61b70fef489ce51ecd18d69afb9622c110b9315c).

I was unable to reproduce a ramdisk performance regression on Linux
3.6.6-3.fc18.x86_64 with Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz with
8 GB RAM.

The ramdisk is 4 GB and I used your QEMU command-line with a RHEL 6.3 guest.

Summary results:
x-data-plane-on: iops=132856 aggrb=1039.1MB/s
x-data-plane-off: iops=126236 aggrb=988.40MB/s

virtio-blk-data-plane is ~5% faster in this benchmark.

fio jobfile:
[global]
filename=/dev/vda
blocksize=8k
ioengine=libaio
direct=1
iodepth=8
runtime=120
time_based=1

[reads]
readwrite=randread
numjobs=4

Perf top (data-plane-on):
  3.71%  [kvm]               [k] kvm_arch_vcpu_ioctl_run
  3.27%  [kernel]            [k] memset    <--- ramdisk
  2.98%  [kernel]            [k] do_blockdev_direct_IO
  2.82%  [kvm_intel]         [k] vmx_vcpu_run
  2.66%  [kernel]            [k] _raw_spin_lock_irqsave
  2.06%  [kernel]            [k] put_compound_page
  2.06%  [kernel]            [k] __get_page_tail
  1.83%  [i915]              [k] __gen6_gt_force_wake_mt_get
  1.75%  [kernel]            [k] _raw_spin_unlock_irqrestore
  1.33%  qemu-system-x86_64  [.] vring_pop <--- virtio-blk-data-plane
  1.19%  [kernel]            [k] compound_unlock_irqrestore
  1.13%  [kernel]            [k] gup_huge_pmd
  1.11%  [kernel]            [k] __audit_syscall_exit
  1.07%  [kernel]            [k] put_page_testzero
  1.01%  [kernel]            [k] fget
  1.01%  [kernel]            [k] do_io_submit

Since the ramdisk (memset and page-related functions) is so prominent
in perf top, I also tried a 1-job 8k dd sequential write test on a
Samsung 830 Series SSD where virtio-blk-data-plane was 9% faster than
virtio-blk.  Optimizing against ramdisk isn't a good idea IMO because
it acts very differently from real hardware where the driver relies on
mmio, DMA, and interrupts (vs synchronous memcpy/memset).

Full results:
$ cat data-plane-off
reads: (g=0): rw=randread, bs=8K-8K/8K-8K, ioengine=libaio, iodepth=8
...
reads: (g=0): rw=randread, bs=8K-8K/8K-8K, ioengine=libaio, iodepth=8
fio 1.57
Starting 4 processes

reads: (groupid=0, jobs=1): err= 0: pid=1851
  read : io=29408MB, bw=250945KB/s, iops=31368 , runt=120001msec
    slat (usec): min=2 , max=27829 , avg=11.06, stdev=78.05
    clat (usec): min=1 , max=28028 , avg=241.41, stdev=388.47
     lat (usec): min=33 , max=28035 , avg=253.17, stdev=396.66
    bw (KB/s) : min=197141, max=335365, per=24.78%, avg=250797.02,
stdev=29376.35
  cpu          : usr=6.55%, sys=31.34%, ctx=310932, majf=0, minf=41
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w/d: total=3764202/0/0, short=0/0/0
     lat (usec): 2=0.01%, 4=0.01%, 20=0.01%, 50=1.78%, 100=27.11%
     lat (usec): 250=38.97%, 500=27.11%, 750=2.09%, 1000=0.71%
     lat (msec): 2=1.32%, 4=0.70%, 10=0.20%, 20=0.01%, 50=0.01%
reads: (groupid=0, jobs=1): err= 0: pid=1852
  read : io=29742MB, bw=253798KB/s, iops=31724 , runt=120001msec
    slat (usec): min=2 , max=17007 , avg=10.61, stdev=67.51
    clat (usec): min=1 , max=41531 , avg=239.00, stdev=379.03
     lat (usec): min=32 , max=41547 , avg=250.33, stdev=385.21
    bw (KB/s) : min=194336, max=347497, per=25.02%, avg=253204.25,
stdev=31172.37
  cpu          : usr=6.66%, sys=32.58%, ctx=327250, majf=0, minf=41
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w/d: total=3806999/0/0, short=0/0/0
     lat (usec): 2=0.01%, 20=0.01%, 50=1.54%, 100=26.45%, 250=40.04%
     lat (usec): 500=27.15%, 750=1.95%, 1000=0.71%
     lat (msec): 2=1.29%, 4=0.68%, 10=0.18%, 20=0.01%, 50=0.01%
reads: (groupid=0, jobs=1): err= 0: pid=1853
  read : io=29859MB, bw=254797KB/s, iops=31849 , runt=120001msec
    slat (usec): min=2 , max=16821 , avg=11.35, stdev=76.54
    clat (usec): min=1 , max=17659 , avg=237.25, stdev=375.31
     lat (usec): min=31 , max=17673 , avg=249.27, stdev=383.62
    bw (KB/s) : min=194864, max=345280, per=25.15%, avg=254534.63,
stdev=30549.32
  cpu          : usr=6.52%, sys=31.84%, ctx=303763, majf=0, minf=39
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w/d: total=3821989/0/0, short=0/0/0
     lat (usec): 2=0.01%, 10=0.01%, 20=0.01%, 50=2.09%, 100=29.19%
     lat (usec): 250=37.31%, 500=26.41%, 750=2.08%, 1000=0.71%
     lat (msec): 2=1.32%, 4=0.70%, 10=0.20%, 20=0.01%
reads: (groupid=0, jobs=1): err= 0: pid=1854
  read : io=29598MB, bw=252565KB/s, iops=31570 , runt=120001msec
    slat (usec): min=2 , max=26413 , avg=11.21, stdev=78.32
    clat (usec): min=16 , max=27993 , avg=239.56, stdev=381.67
     lat (usec): min=34 , max=28006 , avg=251.49, stdev=390.13
    bw (KB/s) : min=194256, max=369424, per=24.94%, avg=252462.86,
stdev=29420.58
  cpu          : usr=6.57%, sys=31.33%, ctx=305623, majf=0, minf=41
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w/d: total=3788507/0/0, short=0/0/0
     lat (usec): 20=0.01%, 50=2.13%, 100=28.30%, 250=37.74%, 500=26.66%
     lat (usec): 750=2.17%, 1000=0.75%
     lat (msec): 2=1.35%, 4=0.70%, 10=0.19%, 20=0.01%, 50=0.01%

Run status group 0 (all jobs):
   READ: io=118607MB, aggrb=988.40MB/s, minb=256967KB/s,
maxb=260912KB/s, mint=120001msec, maxt=120001msec

Disk stats (read/write):
  vda: ios=15148328/0, merge=0/0, ticks=1550570/0, in_queue=1536232, util=96.56%

$ cat data-plane-on
reads: (g=0): rw=randread, bs=8K-8K/8K-8K, ioengine=libaio, iodepth=8
...
reads: (g=0): rw=randread, bs=8K-8K/8K-8K, ioengine=libaio, iodepth=8
fio 1.57
Starting 4 processes

reads: (groupid=0, jobs=1): err= 0: pid=1796
  read : io=32081MB, bw=273759KB/s, iops=34219 , runt=120001msec
    slat (usec): min=1 , max=20404 , avg=21.08, stdev=125.49
    clat (usec): min=10 , max=135743 , avg=207.62, stdev=532.90
     lat (usec): min=21 , max=136055 , avg=229.60, stdev=556.82
    bw (KB/s) : min=56480, max=951952, per=25.49%, avg=271488.81,
stdev=149773.57
  cpu          : usr=7.01%, sys=43.26%, ctx=336854, majf=0, minf=41
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w/d: total=4106413/0/0, short=0/0/0
     lat (usec): 20=0.01%, 50=2.46%, 100=61.13%, 250=21.58%, 500=3.11%
     lat (usec): 750=3.04%, 1000=3.88%
     lat (msec): 2=4.50%, 4=0.13%, 10=0.11%, 20=0.06%, 50=0.01%
     lat (msec): 250=0.01%
reads: (groupid=0, jobs=1): err= 0: pid=1797
  read : io=30104MB, bw=256888KB/s, iops=32110 , runt=120001msec
    slat (usec): min=1 , max=17595 , avg=22.20, stdev=120.29
    clat (usec): min=13 , max=136264 , avg=221.21, stdev=528.19
     lat (usec): min=22 , max=136280 , avg=244.35, stdev=551.73
    bw (KB/s) : min=57312, max=838880, per=23.93%, avg=254798.51,
stdev=139546.57
  cpu          : usr=6.82%, sys=41.87%, ctx=360348, majf=0, minf=41
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w/d: total=3853351/0/0, short=0/0/0
     lat (usec): 20=0.01%, 50=2.10%, 100=58.47%, 250=22.38%, 500=3.68%
     lat (usec): 750=3.69%, 1000=4.52%
     lat (msec): 2=4.87%, 4=0.14%, 10=0.11%, 20=0.05%, 250=0.01%
reads: (groupid=0, jobs=1): err= 0: pid=1798
  read : io=31698MB, bw=270487KB/s, iops=33810 , runt=120001msec
    slat (usec): min=1 , max=17457 , avg=20.93, stdev=125.33
    clat (usec): min=16 , max=134663 , avg=210.19, stdev=535.77
     lat (usec): min=21 , max=134671 , avg=232.02, stdev=559.27
    bw (KB/s) : min=57248, max=841952, per=25.29%, avg=269330.21,
stdev=148661.08
  cpu          : usr=6.92%, sys=42.81%, ctx=337799, majf=0, minf=39
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w/d: total=4057340/0/0, short=0/0/0
     lat (usec): 20=0.01%, 50=1.98%, 100=62.00%, 250=20.70%, 500=3.22%
     lat (usec): 750=3.23%, 1000=4.16%
     lat (msec): 2=4.41%, 4=0.13%, 10=0.10%, 20=0.06%, 250=0.01%
reads: (groupid=0, jobs=1): err= 0: pid=1799
  read : io=30913MB, bw=263789KB/s, iops=32973 , runt=120000msec
    slat (usec): min=1 , max=17565 , avg=21.52, stdev=120.17
    clat (usec): min=15 , max=136064 , avg=215.53, stdev=529.56
     lat (usec): min=27 , max=136070 , avg=237.99, stdev=552.50
    bw (KB/s) : min=57632, max=900896, per=24.74%, avg=263431.57,
stdev=148379.15
  cpu          : usr=6.90%, sys=42.56%, ctx=348217, majf=0, minf=41
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w/d: total=3956830/0/0, short=0/0/0
     lat (usec): 20=0.01%, 50=1.76%, 100=59.96%, 250=22.21%, 500=3.45%
     lat (usec): 750=3.35%, 1000=4.33%
     lat (msec): 2=4.65%, 4=0.13%, 10=0.11%, 20=0.05%, 250=0.01%

Run status group 0 (all jobs):
   READ: io=124796MB, aggrb=1039.1MB/s, minb=263053KB/s,
maxb=280328KB/s, mint=120000msec, maxt=120001msec

Disk stats (read/write):
  vda: ios=15942789/0, merge=0/0, ticks=336240/0, in_queue=317832, util=97.47%

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH 0/7] virtio: virtio-blk data plane
  2012-11-20 12:21   ` Stefan Hajnoczi
@ 2012-11-20 12:25     ` Stefan Hajnoczi
  2012-11-21  5:39       ` Asias He
  2012-11-21  5:22     ` Asias He
  1 sibling, 1 reply; 38+ messages in thread
From: Stefan Hajnoczi @ 2012-11-20 12:25 UTC (permalink / raw)
  To: Asias He
  Cc: Kevin Wolf, Anthony Liguori, Michael S. Tsirkin, qemu-devel,
	Khoa Huynh, Stefan Hajnoczi, Paolo Bonzini

On Tue, Nov 20, 2012 at 1:21 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Tue, Nov 20, 2012 at 10:02 AM, Asias He <asias@redhat.com> wrote:
>> Hello Stefan,
>>
>> On 11/15/2012 11:18 PM, Stefan Hajnoczi wrote:
>>> This series adds the -device virtio-blk-pci,x-data-plane=on property that
>>> enables a high performance I/O codepath.  A dedicated thread is used to process
>>> virtio-blk requests outside the global mutex and without going through the QEMU
>>> block layer.
>>>
>>> Khoa Huynh <khoa@us.ibm.com> reported an increase from 140,000 IOPS to 600,000
>>> IOPS for a single VM using virtio-blk-data-plane in July:
>>>
>>>   http://comments.gmane.org/gmane.comp.emulators.kvm.devel/94580
>>>
>>> The virtio-blk-data-plane approach was originally presented at Linux Plumbers
>>> Conference 2010.  The following slides contain a brief overview:
>>>
>>>   http://linuxplumbersconf.org/2010/ocw/system/presentations/651/original/Optimizing_the_QEMU_Storage_Stack.pdf
>>>
>>> The basic approach is:
>>> 1. Each virtio-blk device has a thread dedicated to handling ioeventfd
>>>    signalling when the guest kicks the virtqueue.
>>> 2. Requests are processed without going through the QEMU block layer using
>>>    Linux AIO directly.
>>> 3. Completion interrupts are injected via irqfd from the dedicated thread.
>>>
>>> To try it out:
>>>
>>>   qemu -drive if=none,id=drive0,cache=none,aio=native,format=raw,file=...
>>>        -device virtio-blk-pci,drive=drive0,scsi=off,x-data-plane=on
>>
>>
>> Is this the latest dataplane bits:
>> (git://github.com/stefanha/qemu.git virtio-blk-data-plane)
>>
>> commit 7872075c24fa01c925d4f41faa9d04ce69bf5328
>> Author: Stefan Hajnoczi <stefanha@redhat.com>
>> Date:   Wed Nov 14 15:45:38 2012 +0100
>>
>>     virtio-blk: add x-data-plane=on|off performance feature
>>
>>
>> With this commit on a ramdisk based box, I am seeing about 10K IOPS with
>> x-data-plane on and 90K IOPS with x-data-plane off.
>>
>> Any ideas?
>>
>> Command line I used:
>>
>> IMG=/dev/ram0
>> x86_64-softmmu/qemu-system-x86_64 \
>> -drive file=/root/img/sid.img,if=ide \
>> -drive file=${IMG},if=none,cache=none,aio=native,id=disk1 -device
>> virtio-blk-pci,x-data-plane=off,drive=disk1,scsi=off \
>> -kernel $KERNEL -append "root=/dev/sdb1 console=tty0" \
>> -L /tmp/qemu-dataplane/share/qemu/ -nographic -vnc :0 -enable-kvm -m
>> 2048 -smp 4 -cpu qemu64,+x2apic -M pc
>
> Was just about to send out the latest patch series which addresses
> review comments, so I have tested the latest code
> (61b70fef489ce51ecd18d69afb9622c110b9315c).

Rebased onto qemu.git/master before sending out.  The commit ID is now:
cf6ed6406543ecc43895012a9ac9665e3753d5e8

https://github.com/stefanha/qemu/commits/virtio-blk-data-plane

Stefan

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH 0/7] virtio: virtio-blk data plane
  2012-11-20  9:02 ` [Qemu-devel] [PATCH 0/7] virtio: virtio-blk data plane Asias He
  2012-11-20 12:21   ` Stefan Hajnoczi
@ 2012-11-20 15:03   ` Khoa Huynh
  2012-11-21  5:22     ` Asias He
  1 sibling, 1 reply; 38+ messages in thread
From: Khoa Huynh @ 2012-11-20 15:03 UTC (permalink / raw)
  To: Asias He
  Cc: Kevin Wolf, Anthony Liguori, Michael S. Tsirkin, qemu-devel,
	Stefan Hajnoczi, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 6616 bytes --]

Asias He <asias@redhat.com> wrote on 11/20/2012 03:02:07 AM:

> From: Asias He <asias@redhat.com>
> To: Stefan Hajnoczi <stefanha@redhat.com>,
> Cc: qemu-devel@nongnu.org, Anthony Liguori/Austin/IBM@IBMUS, Paolo
> Bonzini <pbonzini@redhat.com>, Kevin Wolf <kwolf@redhat.com>,
> "Michael S. Tsirkin" <mst@redhat.com>, Khoa Huynh/Austin/IBM@IBMUS
> Date: 11/20/2012 03:01 AM
> Subject: Re: [PATCH 0/7] virtio: virtio-blk data plane
>
> Hello Stefan,
>
> On 11/15/2012 11:18 PM, Stefan Hajnoczi wrote:
> > This series adds the -device virtio-blk-pci,x-data-plane=on property
that
> > enables a high performance I/O codepath.  A dedicated thread is
> used to process
> > virtio-blk requests outside the global mutex and without going
> through the QEMU
> > block layer.
> >
> > Khoa Huynh <khoa@us.ibm.com> reported an increase from 140,000
> IOPS to 600,000
> > IOPS for a single VM using virtio-blk-data-plane in July:
> >
> >   http://comments.gmane.org/gmane.comp.emulators.kvm.devel/94580
> >
> > The virtio-blk-data-plane approach was originally presented at
> Linux Plumbers
> > Conference 2010.  The following slides contain a brief overview:
> >
> >   http://linuxplumbersconf.org/2010/ocw/system/presentations/651/
> original/Optimizing_the_QEMU_Storage_Stack.pdf
> >
> > The basic approach is:
> > 1. Each virtio-blk device has a thread dedicated to handling ioeventfd
> >    signalling when the guest kicks the virtqueue.
> > 2. Requests are processed without going through the QEMU block layer
using
> >    Linux AIO directly.
> > 3. Completion interrupts are injected via irqfd from the dedicated
thread.
> >
> > To try it out:
> >
> >   qemu -drive
if=none,id=drive0,cache=none,aio=native,format=raw,file=...
> >        -device virtio-blk-pci,drive=drive0,scsi=off,x-data-plane=on
>
>
> Is this the latest dataplane bits:
> (git://github.com/stefanha/qemu.git virtio-blk-data-plane)
>
> commit 7872075c24fa01c925d4f41faa9d04ce69bf5328
> Author: Stefan Hajnoczi <stefanha@redhat.com>
> Date:   Wed Nov 14 15:45:38 2012 +0100
>
>     virtio-blk: add x-data-plane=on|off performance feature
>
>
> With this commit on a ramdisk based box, I am seeing about 10K IOPS with
> x-data-plane on and 90K IOPS with x-data-plane off.
>
> Any ideas?

Hi Asias, I'll try this latest patchset next week and will report results
as soon as they are available.  I am currently on vacation this week due
to the Thanksgivings holiday in the US.... (Previously, I was able to get
up to 1.33 million IOPS for a single KVM guest using Stefan's previous
code.)

Thanks,
-Khoa

>
> Command line I used:
>
> IMG=/dev/ram0
> x86_64-softmmu/qemu-system-x86_64 \
> -drive file=/root/img/sid.img,if=ide \
> -drive file=${IMG},if=none,cache=none,aio=native,id=disk1 -device
> virtio-blk-pci,x-data-plane=off,drive=disk1,scsi=off \
> -kernel $KERNEL -append "root=/dev/sdb1 console=tty0" \
> -L /tmp/qemu-dataplane/share/qemu/ -nographic -vnc :0 -enable-kvm -m
> 2048 -smp 4 -cpu qemu64,+x2apic -M pc
>
>
> >
> > Limitations:
> >  * Only format=raw is supported
> >  * Live migration is not supported
> >  * Block jobs, hot unplug, and other operations fail with -EBUSY
> >  * I/O throttling limits are ignored
> >  * Only Linux hosts are supported due to Linux AIO usage
> >
> > The code has reached a stage where I feel it is ready to merge.  Users
have
> > been playing with it for some time and want the significant
> performance boost.
> >
> > We are refactoring QEMU to get rid of the global mutex.  I believe that
> > virtio-blk-data-plane can eventually become the default mode of
operation.
> >
> > Instead of waiting for global mutex removal efforts to finish, I want
to use
> > virtio-blk-data-plane as an example device for AioContext and threaded
hw
> > dispatch refactoring.  This means:
> >
> > 1. When the block layer can bind to an AioContext and execute I/O
> outside the
> >    global mutex, virtio-blk-data-plane can use this (and gain image
format
> >    support).
> >
> > 2. When hw dispatch no longer needs the global mutex we can use
hw/virtio.c
> >    again and perhaps run a pool of iothreads instead of dedicated data
plane
> >    threads.
> >
> > But in the meantime, I have cleaned up the virtio-blk-data-plane
> code so that
> > it can be merged as an experimental feature.
> >
> > Changes from the RFC v9:
> >  * Add x-data-plane=on|off option and coexist with regular virtio-blk
code
> >  * Create thread from BH so it inherits iothread cpusets
> >  * Drain requests on vm_stop() so stopped guest does not access image
file
> >  * Add migration blocker
> >  * Add bdrv_in_use() to prevent block jobs and other operations
> that can interfere
> >  * Drop IOQueue request merging for simplicity
> >  * Drop ioctl interrupt injection and always use irqfd for simplicity
> >  * Major cleanup to split up source files
> >  * Rebase from qemu-kvm.git onto qemu.git
> >  * Address Michael Tsirkin's review comments
> >
> > Stefan Hajnoczi (7):
> >   raw-posix: add raw_get_aio_fd() for virtio-blk-data-plane
> >   configure: add CONFIG_VIRTIO_BLK_DATA_PLANE
> >   dataplane: add virtqueue vring code
> >   dataplane: add event loop
> >   dataplane: add Linux AIO request queue
> >   dataplane: add virtio-blk data plane code
> >   virtio-blk: add x-data-plane=on|off performance feature
> >
> >  block.h                    |   9 +
> >  block/raw-posix.c          |  34 ++++
> >  configure                  |  21 +++
> >  hw/Makefile.objs           |   2 +-
> >  hw/dataplane/Makefile.objs |   3 +
> >  hw/dataplane/event-poll.c  | 109 ++++++++++++
> >  hw/dataplane/event-poll.h  |  40 +++++
> >  hw/dataplane/ioq.c         | 118 +++++++++++++
> >  hw/dataplane/ioq.h         |  57 +++++++
> >  hw/dataplane/virtio-blk.c  | 414 ++++++++++++++++++++++++++++++++
> +++++++++++++
> >  hw/dataplane/virtio-blk.h  |  41 +++++
> >  hw/dataplane/vring.c       | 321 +++++++++++++++++++++++++++++++++++
> >  hw/dataplane/vring.h       |  54 ++++++
> >  hw/virtio-blk.c            |  59 ++++++-
> >  hw/virtio-blk.h            |   1 +
> >  hw/virtio-pci.c            |   3 +
> >  trace-events               |   9 +
> >  17 files changed, 1293 insertions(+), 2 deletions(-)
> >  create mode 100644 hw/dataplane/Makefile.objs
> >  create mode 100644 hw/dataplane/event-poll.c
> >  create mode 100644 hw/dataplane/event-poll.h
> >  create mode 100644 hw/dataplane/ioq.c
> >  create mode 100644 hw/dataplane/ioq.h
> >  create mode 100644 hw/dataplane/virtio-blk.c
> >  create mode 100644 hw/dataplane/virtio-blk.h
> >  create mode 100644 hw/dataplane/vring.c
> >  create mode 100644 hw/dataplane/vring.h
> >
>
>
> --
> Asias
>

[-- Attachment #2: Type: text/html, Size: 9635 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH 0/7] virtio: virtio-blk data plane
  2012-11-20 12:21   ` Stefan Hajnoczi
  2012-11-20 12:25     ` Stefan Hajnoczi
@ 2012-11-21  5:22     ` Asias He
  2012-11-22 12:16       ` Stefan Hajnoczi
  1 sibling, 1 reply; 38+ messages in thread
From: Asias He @ 2012-11-21  5:22 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Anthony Liguori, Michael S. Tsirkin, qemu-devel,
	Khoa Huynh, Stefan Hajnoczi, Paolo Bonzini

On 11/20/2012 08:21 PM, Stefan Hajnoczi wrote:
> On Tue, Nov 20, 2012 at 10:02 AM, Asias He <asias@redhat.com> wrote:
>> Hello Stefan,
>>
>> On 11/15/2012 11:18 PM, Stefan Hajnoczi wrote:
>>> This series adds the -device virtio-blk-pci,x-data-plane=on property that
>>> enables a high performance I/O codepath.  A dedicated thread is used to process
>>> virtio-blk requests outside the global mutex and without going through the QEMU
>>> block layer.
>>>
>>> Khoa Huynh <khoa@us.ibm.com> reported an increase from 140,000 IOPS to 600,000
>>> IOPS for a single VM using virtio-blk-data-plane in July:
>>>
>>>   http://comments.gmane.org/gmane.comp.emulators.kvm.devel/94580
>>>
>>> The virtio-blk-data-plane approach was originally presented at Linux Plumbers
>>> Conference 2010.  The following slides contain a brief overview:
>>>
>>>   http://linuxplumbersconf.org/2010/ocw/system/presentations/651/original/Optimizing_the_QEMU_Storage_Stack.pdf
>>>
>>> The basic approach is:
>>> 1. Each virtio-blk device has a thread dedicated to handling ioeventfd
>>>    signalling when the guest kicks the virtqueue.
>>> 2. Requests are processed without going through the QEMU block layer using
>>>    Linux AIO directly.
>>> 3. Completion interrupts are injected via irqfd from the dedicated thread.
>>>
>>> To try it out:
>>>
>>>   qemu -drive if=none,id=drive0,cache=none,aio=native,format=raw,file=...
>>>        -device virtio-blk-pci,drive=drive0,scsi=off,x-data-plane=on
>>
>>
>> Is this the latest dataplane bits:
>> (git://github.com/stefanha/qemu.git virtio-blk-data-plane)
>>
>> commit 7872075c24fa01c925d4f41faa9d04ce69bf5328
>> Author: Stefan Hajnoczi <stefanha@redhat.com>
>> Date:   Wed Nov 14 15:45:38 2012 +0100
>>
>>     virtio-blk: add x-data-plane=on|off performance feature
>>
>>
>> With this commit on a ramdisk based box, I am seeing about 10K IOPS with
>> x-data-plane on and 90K IOPS with x-data-plane off.
>>
>> Any ideas?
>>
>> Command line I used:
>>
>> IMG=/dev/ram0
>> x86_64-softmmu/qemu-system-x86_64 \
>> -drive file=/root/img/sid.img,if=ide \
>> -drive file=${IMG},if=none,cache=none,aio=native,id=disk1 -device
>> virtio-blk-pci,x-data-plane=off,drive=disk1,scsi=off \
>> -kernel $KERNEL -append "root=/dev/sdb1 console=tty0" \
>> -L /tmp/qemu-dataplane/share/qemu/ -nographic -vnc :0 -enable-kvm -m
>> 2048 -smp 4 -cpu qemu64,+x2apic -M pc
> 
> Was just about to send out the latest patch series which addresses
> review comments, so I have tested the latest code
> (61b70fef489ce51ecd18d69afb9622c110b9315c).
> 
> I was unable to reproduce a ramdisk performance regression on Linux
> 3.6.6-3.fc18.x86_64 with Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz with
> 8 GB RAM.

I am using the latest upstream kernel.

> The ramdisk is 4 GB and I used your QEMU command-line with a RHEL 6.3 guest.
> 
> Summary results:
> x-data-plane-on: iops=132856 aggrb=1039.1MB/s
> x-data-plane-off: iops=126236 aggrb=988.40MB/s
> 
> virtio-blk-data-plane is ~5% faster in this benchmark.
> 
> fio jobfile:
> [global]
> filename=/dev/vda
> blocksize=8k
> ioengine=libaio
> direct=1
> iodepth=8
> runtime=120
> time_based=1
> 
> [reads]
> readwrite=randread
> numjobs=4
> 
> Perf top (data-plane-on):
>   3.71%  [kvm]               [k] kvm_arch_vcpu_ioctl_run
>   3.27%  [kernel]            [k] memset    <--- ramdisk
>   2.98%  [kernel]            [k] do_blockdev_direct_IO
>   2.82%  [kvm_intel]         [k] vmx_vcpu_run
>   2.66%  [kernel]            [k] _raw_spin_lock_irqsave
>   2.06%  [kernel]            [k] put_compound_page
>   2.06%  [kernel]            [k] __get_page_tail
>   1.83%  [i915]              [k] __gen6_gt_force_wake_mt_get
>   1.75%  [kernel]            [k] _raw_spin_unlock_irqrestore
>   1.33%  qemu-system-x86_64  [.] vring_pop <--- virtio-blk-data-plane
>   1.19%  [kernel]            [k] compound_unlock_irqrestore
>   1.13%  [kernel]            [k] gup_huge_pmd
>   1.11%  [kernel]            [k] __audit_syscall_exit
>   1.07%  [kernel]            [k] put_page_testzero
>   1.01%  [kernel]            [k] fget
>   1.01%  [kernel]            [k] do_io_submit
> 
> Since the ramdisk (memset and page-related functions) is so prominent
> in perf top, I also tried a 1-job 8k dd sequential write test on a
> Samsung 830 Series SSD where virtio-blk-data-plane was 9% faster than
> virtio-blk.  Optimizing against ramdisk isn't a good idea IMO because
> it acts very differently from real hardware where the driver relies on
> mmio, DMA, and interrupts (vs synchronous memcpy/memset).

For the memset in ramdisk, you can simply patch drivers/block/brd.c to
do nop instead of memset for testing.

Yes, if you have fast SSD device (sometimes you need multiple which I
do not have), it makes more sense to test on real hardware. However,
ramdisk test is still useful. It gives rough performance numbers. If A
and B are both tested against ramdisk. The difference between A and B
are still useful.


> Full results:
> $ cat data-plane-off
> reads: (g=0): rw=randread, bs=8K-8K/8K-8K, ioengine=libaio, iodepth=8
> ...
> reads: (g=0): rw=randread, bs=8K-8K/8K-8K, ioengine=libaio, iodepth=8
> fio 1.57
> Starting 4 processes
> 
> reads: (groupid=0, jobs=1): err= 0: pid=1851
>   read : io=29408MB, bw=250945KB/s, iops=31368 , runt=120001msec
>     slat (usec): min=2 , max=27829 , avg=11.06, stdev=78.05
>     clat (usec): min=1 , max=28028 , avg=241.41, stdev=388.47
>      lat (usec): min=33 , max=28035 , avg=253.17, stdev=396.66
>     bw (KB/s) : min=197141, max=335365, per=24.78%, avg=250797.02,
> stdev=29376.35
>   cpu          : usr=6.55%, sys=31.34%, ctx=310932, majf=0, minf=41
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued r/w/d: total=3764202/0/0, short=0/0/0
>      lat (usec): 2=0.01%, 4=0.01%, 20=0.01%, 50=1.78%, 100=27.11%
>      lat (usec): 250=38.97%, 500=27.11%, 750=2.09%, 1000=0.71%
>      lat (msec): 2=1.32%, 4=0.70%, 10=0.20%, 20=0.01%, 50=0.01%
> reads: (groupid=0, jobs=1): err= 0: pid=1852
>   read : io=29742MB, bw=253798KB/s, iops=31724 , runt=120001msec
>     slat (usec): min=2 , max=17007 , avg=10.61, stdev=67.51
>     clat (usec): min=1 , max=41531 , avg=239.00, stdev=379.03
>      lat (usec): min=32 , max=41547 , avg=250.33, stdev=385.21
>     bw (KB/s) : min=194336, max=347497, per=25.02%, avg=253204.25,
> stdev=31172.37
>   cpu          : usr=6.66%, sys=32.58%, ctx=327250, majf=0, minf=41
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued r/w/d: total=3806999/0/0, short=0/0/0
>      lat (usec): 2=0.01%, 20=0.01%, 50=1.54%, 100=26.45%, 250=40.04%
>      lat (usec): 500=27.15%, 750=1.95%, 1000=0.71%
>      lat (msec): 2=1.29%, 4=0.68%, 10=0.18%, 20=0.01%, 50=0.01%
> reads: (groupid=0, jobs=1): err= 0: pid=1853
>   read : io=29859MB, bw=254797KB/s, iops=31849 , runt=120001msec
>     slat (usec): min=2 , max=16821 , avg=11.35, stdev=76.54
>     clat (usec): min=1 , max=17659 , avg=237.25, stdev=375.31
>      lat (usec): min=31 , max=17673 , avg=249.27, stdev=383.62
>     bw (KB/s) : min=194864, max=345280, per=25.15%, avg=254534.63,
> stdev=30549.32
>   cpu          : usr=6.52%, sys=31.84%, ctx=303763, majf=0, minf=39
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued r/w/d: total=3821989/0/0, short=0/0/0
>      lat (usec): 2=0.01%, 10=0.01%, 20=0.01%, 50=2.09%, 100=29.19%
>      lat (usec): 250=37.31%, 500=26.41%, 750=2.08%, 1000=0.71%
>      lat (msec): 2=1.32%, 4=0.70%, 10=0.20%, 20=0.01%
> reads: (groupid=0, jobs=1): err= 0: pid=1854
>   read : io=29598MB, bw=252565KB/s, iops=31570 , runt=120001msec
>     slat (usec): min=2 , max=26413 , avg=11.21, stdev=78.32
>     clat (usec): min=16 , max=27993 , avg=239.56, stdev=381.67
>      lat (usec): min=34 , max=28006 , avg=251.49, stdev=390.13
>     bw (KB/s) : min=194256, max=369424, per=24.94%, avg=252462.86,
> stdev=29420.58
>   cpu          : usr=6.57%, sys=31.33%, ctx=305623, majf=0, minf=41
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued r/w/d: total=3788507/0/0, short=0/0/0
>      lat (usec): 20=0.01%, 50=2.13%, 100=28.30%, 250=37.74%, 500=26.66%
>      lat (usec): 750=2.17%, 1000=0.75%
>      lat (msec): 2=1.35%, 4=0.70%, 10=0.19%, 20=0.01%, 50=0.01%
> 
> Run status group 0 (all jobs):
>    READ: io=118607MB, aggrb=988.40MB/s, minb=256967KB/s,
> maxb=260912KB/s, mint=120001msec, maxt=120001msec
> 
> Disk stats (read/write):
>   vda: ios=15148328/0, merge=0/0, ticks=1550570/0, in_queue=1536232, util=96.56%
> 
> $ cat data-plane-on
> reads: (g=0): rw=randread, bs=8K-8K/8K-8K, ioengine=libaio, iodepth=8
> ...
> reads: (g=0): rw=randread, bs=8K-8K/8K-8K, ioengine=libaio, iodepth=8
> fio 1.57
> Starting 4 processes
> 
> reads: (groupid=0, jobs=1): err= 0: pid=1796
>   read : io=32081MB, bw=273759KB/s, iops=34219 , runt=120001msec
>     slat (usec): min=1 , max=20404 , avg=21.08, stdev=125.49
>     clat (usec): min=10 , max=135743 , avg=207.62, stdev=532.90
>      lat (usec): min=21 , max=136055 , avg=229.60, stdev=556.82
>     bw (KB/s) : min=56480, max=951952, per=25.49%, avg=271488.81,
> stdev=149773.57
>   cpu          : usr=7.01%, sys=43.26%, ctx=336854, majf=0, minf=41
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued r/w/d: total=4106413/0/0, short=0/0/0
>      lat (usec): 20=0.01%, 50=2.46%, 100=61.13%, 250=21.58%, 500=3.11%
>      lat (usec): 750=3.04%, 1000=3.88%
>      lat (msec): 2=4.50%, 4=0.13%, 10=0.11%, 20=0.06%, 50=0.01%
>      lat (msec): 250=0.01%
> reads: (groupid=0, jobs=1): err= 0: pid=1797
>   read : io=30104MB, bw=256888KB/s, iops=32110 , runt=120001msec
>     slat (usec): min=1 , max=17595 , avg=22.20, stdev=120.29
>     clat (usec): min=13 , max=136264 , avg=221.21, stdev=528.19
>      lat (usec): min=22 , max=136280 , avg=244.35, stdev=551.73
>     bw (KB/s) : min=57312, max=838880, per=23.93%, avg=254798.51,
> stdev=139546.57
>   cpu          : usr=6.82%, sys=41.87%, ctx=360348, majf=0, minf=41
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued r/w/d: total=3853351/0/0, short=0/0/0
>      lat (usec): 20=0.01%, 50=2.10%, 100=58.47%, 250=22.38%, 500=3.68%
>      lat (usec): 750=3.69%, 1000=4.52%
>      lat (msec): 2=4.87%, 4=0.14%, 10=0.11%, 20=0.05%, 250=0.01%
> reads: (groupid=0, jobs=1): err= 0: pid=1798
>   read : io=31698MB, bw=270487KB/s, iops=33810 , runt=120001msec
>     slat (usec): min=1 , max=17457 , avg=20.93, stdev=125.33
>     clat (usec): min=16 , max=134663 , avg=210.19, stdev=535.77
>      lat (usec): min=21 , max=134671 , avg=232.02, stdev=559.27
>     bw (KB/s) : min=57248, max=841952, per=25.29%, avg=269330.21,
> stdev=148661.08
>   cpu          : usr=6.92%, sys=42.81%, ctx=337799, majf=0, minf=39
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued r/w/d: total=4057340/0/0, short=0/0/0
>      lat (usec): 20=0.01%, 50=1.98%, 100=62.00%, 250=20.70%, 500=3.22%
>      lat (usec): 750=3.23%, 1000=4.16%
>      lat (msec): 2=4.41%, 4=0.13%, 10=0.10%, 20=0.06%, 250=0.01%
> reads: (groupid=0, jobs=1): err= 0: pid=1799
>   read : io=30913MB, bw=263789KB/s, iops=32973 , runt=120000msec
>     slat (usec): min=1 , max=17565 , avg=21.52, stdev=120.17
>     clat (usec): min=15 , max=136064 , avg=215.53, stdev=529.56
>      lat (usec): min=27 , max=136070 , avg=237.99, stdev=552.50
>     bw (KB/s) : min=57632, max=900896, per=24.74%, avg=263431.57,
> stdev=148379.15
>   cpu          : usr=6.90%, sys=42.56%, ctx=348217, majf=0, minf=41
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued r/w/d: total=3956830/0/0, short=0/0/0
>      lat (usec): 20=0.01%, 50=1.76%, 100=59.96%, 250=22.21%, 500=3.45%
>      lat (usec): 750=3.35%, 1000=4.33%
>      lat (msec): 2=4.65%, 4=0.13%, 10=0.11%, 20=0.05%, 250=0.01%
> 
> Run status group 0 (all jobs):
>    READ: io=124796MB, aggrb=1039.1MB/s, minb=263053KB/s,
> maxb=280328KB/s, mint=120000msec, maxt=120001msec
> 
> Disk stats (read/write):
>   vda: ios=15942789/0, merge=0/0, ticks=336240/0, in_queue=317832, util=97.47%
> 


-- 
Asias

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH 0/7] virtio: virtio-blk data plane
  2012-11-20 15:03   ` Khoa Huynh
@ 2012-11-21  5:22     ` Asias He
  0 siblings, 0 replies; 38+ messages in thread
From: Asias He @ 2012-11-21  5:22 UTC (permalink / raw)
  To: Khoa Huynh
  Cc: Kevin Wolf, Anthony Liguori, Michael S. Tsirkin, qemu-devel,
	Stefan Hajnoczi, Paolo Bonzini

On 11/20/2012 11:03 PM, Khoa Huynh wrote:
> Asias He <asias@redhat.com> wrote on 11/20/2012 03:02:07 AM:
> 
>> From: Asias He <asias@redhat.com>
>> To: Stefan Hajnoczi <stefanha@redhat.com>,
>> Cc: qemu-devel@nongnu.org, Anthony Liguori/Austin/IBM@IBMUS, Paolo
>> Bonzini <pbonzini@redhat.com>, Kevin Wolf <kwolf@redhat.com>,
>> "Michael S. Tsirkin" <mst@redhat.com>, Khoa Huynh/Austin/IBM@IBMUS
>> Date: 11/20/2012 03:01 AM
>> Subject: Re: [PATCH 0/7] virtio: virtio-blk data plane
>>
>> Hello Stefan,
>>
>> On 11/15/2012 11:18 PM, Stefan Hajnoczi wrote:
>> > This series adds the -device virtio-blk-pci,x-data-plane=on property
> that
>> > enables a high performance I/O codepath.  A dedicated thread is
>> used to process
>> > virtio-blk requests outside the global mutex and without going
>> through the QEMU
>> > block layer.
>> >
>> > Khoa Huynh <khoa@us.ibm.com> reported an increase from 140,000
>> IOPS to 600,000
>> > IOPS for a single VM using virtio-blk-data-plane in July:
>> >
>> >   http://comments.gmane.org/gmane.comp.emulators.kvm.devel/94580
>> >
>> > The virtio-blk-data-plane approach was originally presented at
>> Linux Plumbers
>> > Conference 2010.  The following slides contain a brief overview:
>> >
>> >   http://linuxplumbersconf.org/2010/ocw/system/presentations/651/
>> original/Optimizing_the_QEMU_Storage_Stack.pdf
>> >
>> > The basic approach is:
>> > 1. Each virtio-blk device has a thread dedicated to handling ioeventfd
>> >    signalling when the guest kicks the virtqueue.
>> > 2. Requests are processed without going through the QEMU block layer
> using
>> >    Linux AIO directly.
>> > 3. Completion interrupts are injected via irqfd from the dedicated
> thread.
>> >
>> > To try it out:
>> >
>> >   qemu -drive
> if=none,id=drive0,cache=none,aio=native,format=raw,file=...
>> >        -device virtio-blk-pci,drive=drive0,scsi=off,x-data-plane=on
>>
>>
>> Is this the latest dataplane bits:
>> (git://github.com/stefanha/qemu.git virtio-blk-data-plane)
>>
>> commit 7872075c24fa01c925d4f41faa9d04ce69bf5328
>> Author: Stefan Hajnoczi <stefanha@redhat.com>
>> Date:   Wed Nov 14 15:45:38 2012 +0100
>>
>>     virtio-blk: add x-data-plane=on|off performance feature
>>
>>
>> With this commit on a ramdisk based box, I am seeing about 10K IOPS with
>> x-data-plane on and 90K IOPS with x-data-plane off.
>>
>> Any ideas?
> 
> Hi Asias, I'll try this latest patchset next week and will report results
> as soon as they are available.  I am currently on vacation this week due
> to the Thanksgivings holiday in the US.... (Previously, I was able to get
> up to 1.33 million IOPS for a single KVM guest using Stefan's previous
> code.)

Hi Khoa, Thanks! Yes, I saw that number in your slides.

> 
> Thanks,
> -Khoa
> 
>>
>> Command line I used:
>>
>> IMG=/dev/ram0
>> x86_64-softmmu/qemu-system-x86_64 \
>> -drive file=/root/img/sid.img,if=ide \
>> -drive file=${IMG},if=none,cache=none,aio=native,id=disk1 -device
>> virtio-blk-pci,x-data-plane=off,drive=disk1,scsi=off \
>> -kernel $KERNEL -append "root=/dev/sdb1 console=tty0" \
>> -L /tmp/qemu-dataplane/share/qemu/ -nographic -vnc :0 -enable-kvm -m
>> 2048 -smp 4 -cpu qemu64,+x2apic -M pc
>>
>>
>> >
>> > Limitations:
>> >  * Only format=raw is supported
>> >  * Live migration is not supported
>> >  * Block jobs, hot unplug, and other operations fail with -EBUSY
>> >  * I/O throttling limits are ignored
>> >  * Only Linux hosts are supported due to Linux AIO usage
>> >
>> > The code has reached a stage where I feel it is ready to merge.
>  Users have
>> > been playing with it for some time and want the significant
>> performance boost.
>> >
>> > We are refactoring QEMU to get rid of the global mutex.  I believe that
>> > virtio-blk-data-plane can eventually become the default mode of
> operation.
>> >
>> > Instead of waiting for global mutex removal efforts to finish, I
> want to use
>> > virtio-blk-data-plane as an example device for AioContext and
> threaded hw
>> > dispatch refactoring.  This means:
>> >
>> > 1. When the block layer can bind to an AioContext and execute I/O
>> outside the
>> >    global mutex, virtio-blk-data-plane can use this (and gain image
> format
>> >    support).
>> >
>> > 2. When hw dispatch no longer needs the global mutex we can use
> hw/virtio.c
>> >    again and perhaps run a pool of iothreads instead of dedicated
> data plane
>> >    threads.
>> >
>> > But in the meantime, I have cleaned up the virtio-blk-data-plane
>> code so that
>> > it can be merged as an experimental feature.
>> >
>> > Changes from the RFC v9:
>> >  * Add x-data-plane=on|off option and coexist with regular
> virtio-blk code
>> >  * Create thread from BH so it inherits iothread cpusets
>> >  * Drain requests on vm_stop() so stopped guest does not access
> image file
>> >  * Add migration blocker
>> >  * Add bdrv_in_use() to prevent block jobs and other operations
>> that can interfere
>> >  * Drop IOQueue request merging for simplicity
>> >  * Drop ioctl interrupt injection and always use irqfd for simplicity
>> >  * Major cleanup to split up source files
>> >  * Rebase from qemu-kvm.git onto qemu.git
>> >  * Address Michael Tsirkin's review comments
>> >
>> > Stefan Hajnoczi (7):
>> >   raw-posix: add raw_get_aio_fd() for virtio-blk-data-plane
>> >   configure: add CONFIG_VIRTIO_BLK_DATA_PLANE
>> >   dataplane: add virtqueue vring code
>> >   dataplane: add event loop
>> >   dataplane: add Linux AIO request queue
>> >   dataplane: add virtio-blk data plane code
>> >   virtio-blk: add x-data-plane=on|off performance feature
>> >
>> >  block.h                    |   9 +
>> >  block/raw-posix.c          |  34 ++++
>> >  configure                  |  21 +++
>> >  hw/Makefile.objs           |   2 +-
>> >  hw/dataplane/Makefile.objs |   3 +
>> >  hw/dataplane/event-poll.c  | 109 ++++++++++++
>> >  hw/dataplane/event-poll.h  |  40 +++++
>> >  hw/dataplane/ioq.c         | 118 +++++++++++++
>> >  hw/dataplane/ioq.h         |  57 +++++++
>> >  hw/dataplane/virtio-blk.c  | 414 ++++++++++++++++++++++++++++++++
>> +++++++++++++
>> >  hw/dataplane/virtio-blk.h  |  41 +++++
>> >  hw/dataplane/vring.c       | 321 +++++++++++++++++++++++++++++++++++
>> >  hw/dataplane/vring.h       |  54 ++++++
>> >  hw/virtio-blk.c            |  59 ++++++-
>> >  hw/virtio-blk.h            |   1 +
>> >  hw/virtio-pci.c            |   3 +
>> >  trace-events               |   9 +
>> >  17 files changed, 1293 insertions(+), 2 deletions(-)
>> >  create mode 100644 hw/dataplane/Makefile.objs
>> >  create mode 100644 hw/dataplane/event-poll.c
>> >  create mode 100644 hw/dataplane/event-poll.h
>> >  create mode 100644 hw/dataplane/ioq.c
>> >  create mode 100644 hw/dataplane/ioq.h
>> >  create mode 100644 hw/dataplane/virtio-blk.c
>> >  create mode 100644 hw/dataplane/virtio-blk.h
>> >  create mode 100644 hw/dataplane/vring.c
>> >  create mode 100644 hw/dataplane/vring.h
>> >
>>
>>
>> --
>> Asias
>>
> 


-- 
Asias

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH 0/7] virtio: virtio-blk data plane
  2012-11-20 12:25     ` Stefan Hajnoczi
@ 2012-11-21  5:39       ` Asias He
  2012-11-21  6:42         ` Asias He
  0 siblings, 1 reply; 38+ messages in thread
From: Asias He @ 2012-11-21  5:39 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Anthony Liguori, Michael S. Tsirkin, qemu-devel,
	Khoa Huynh, Stefan Hajnoczi, Paolo Bonzini

On 11/20/2012 08:25 PM, Stefan Hajnoczi wrote:
> On Tue, Nov 20, 2012 at 1:21 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>> On Tue, Nov 20, 2012 at 10:02 AM, Asias He <asias@redhat.com> wrote:
>>> Hello Stefan,
>>>
>>> On 11/15/2012 11:18 PM, Stefan Hajnoczi wrote:
>>>> This series adds the -device virtio-blk-pci,x-data-plane=on property that
>>>> enables a high performance I/O codepath.  A dedicated thread is used to process
>>>> virtio-blk requests outside the global mutex and without going through the QEMU
>>>> block layer.
>>>>
>>>> Khoa Huynh <khoa@us.ibm.com> reported an increase from 140,000 IOPS to 600,000
>>>> IOPS for a single VM using virtio-blk-data-plane in July:
>>>>
>>>>   http://comments.gmane.org/gmane.comp.emulators.kvm.devel/94580
>>>>
>>>> The virtio-blk-data-plane approach was originally presented at Linux Plumbers
>>>> Conference 2010.  The following slides contain a brief overview:
>>>>
>>>>   http://linuxplumbersconf.org/2010/ocw/system/presentations/651/original/Optimizing_the_QEMU_Storage_Stack.pdf
>>>>
>>>> The basic approach is:
>>>> 1. Each virtio-blk device has a thread dedicated to handling ioeventfd
>>>>    signalling when the guest kicks the virtqueue.
>>>> 2. Requests are processed without going through the QEMU block layer using
>>>>    Linux AIO directly.
>>>> 3. Completion interrupts are injected via irqfd from the dedicated thread.
>>>>
>>>> To try it out:
>>>>
>>>>   qemu -drive if=none,id=drive0,cache=none,aio=native,format=raw,file=...
>>>>        -device virtio-blk-pci,drive=drive0,scsi=off,x-data-plane=on
>>>
>>>
>>> Is this the latest dataplane bits:
>>> (git://github.com/stefanha/qemu.git virtio-blk-data-plane)
>>>
>>> commit 7872075c24fa01c925d4f41faa9d04ce69bf5328
>>> Author: Stefan Hajnoczi <stefanha@redhat.com>
>>> Date:   Wed Nov 14 15:45:38 2012 +0100
>>>
>>>     virtio-blk: add x-data-plane=on|off performance feature
>>>
>>>
>>> With this commit on a ramdisk based box, I am seeing about 10K IOPS with
>>> x-data-plane on and 90K IOPS with x-data-plane off.
>>>
>>> Any ideas?
>>>
>>> Command line I used:
>>>
>>> IMG=/dev/ram0
>>> x86_64-softmmu/qemu-system-x86_64 \
>>> -drive file=/root/img/sid.img,if=ide \
>>> -drive file=${IMG},if=none,cache=none,aio=native,id=disk1 -device
>>> virtio-blk-pci,x-data-plane=off,drive=disk1,scsi=off \
>>> -kernel $KERNEL -append "root=/dev/sdb1 console=tty0" \
>>> -L /tmp/qemu-dataplane/share/qemu/ -nographic -vnc :0 -enable-kvm -m
>>> 2048 -smp 4 -cpu qemu64,+x2apic -M pc
>>
>> Was just about to send out the latest patch series which addresses
>> review comments, so I have tested the latest code
>> (61b70fef489ce51ecd18d69afb9622c110b9315c).
> 
> Rebased onto qemu.git/master before sending out.  The commit ID is now:
> cf6ed6406543ecc43895012a9ac9665e3753d5e8
> 
> https://github.com/stefanha/qemu/commits/virtio-blk-data-plane
> 
> Stefan

Ok, thanks. /me trying

-- 
Asias

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH 0/7] virtio: virtio-blk data plane
  2012-11-21  5:39       ` Asias He
@ 2012-11-21  6:42         ` Asias He
  2012-11-21  6:44           ` Stefan Hajnoczi
  0 siblings, 1 reply; 38+ messages in thread
From: Asias He @ 2012-11-21  6:42 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Anthony Liguori, Michael S. Tsirkin, qemu-devel,
	Khoa Huynh, Stefan Hajnoczi, Paolo Bonzini

On 11/21/2012 01:39 PM, Asias He wrote:
> On 11/20/2012 08:25 PM, Stefan Hajnoczi wrote:
>> On Tue, Nov 20, 2012 at 1:21 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>>> On Tue, Nov 20, 2012 at 10:02 AM, Asias He <asias@redhat.com> wrote:
>>>> Hello Stefan,
>>>>
>>>> On 11/15/2012 11:18 PM, Stefan Hajnoczi wrote:
>>>>> This series adds the -device virtio-blk-pci,x-data-plane=on property that
>>>>> enables a high performance I/O codepath.  A dedicated thread is used to process
>>>>> virtio-blk requests outside the global mutex and without going through the QEMU
>>>>> block layer.
>>>>>
>>>>> Khoa Huynh <khoa@us.ibm.com> reported an increase from 140,000 IOPS to 600,000
>>>>> IOPS for a single VM using virtio-blk-data-plane in July:
>>>>>
>>>>>   http://comments.gmane.org/gmane.comp.emulators.kvm.devel/94580
>>>>>
>>>>> The virtio-blk-data-plane approach was originally presented at Linux Plumbers
>>>>> Conference 2010.  The following slides contain a brief overview:
>>>>>
>>>>>   http://linuxplumbersconf.org/2010/ocw/system/presentations/651/original/Optimizing_the_QEMU_Storage_Stack.pdf
>>>>>
>>>>> The basic approach is:
>>>>> 1. Each virtio-blk device has a thread dedicated to handling ioeventfd
>>>>>    signalling when the guest kicks the virtqueue.
>>>>> 2. Requests are processed without going through the QEMU block layer using
>>>>>    Linux AIO directly.
>>>>> 3. Completion interrupts are injected via irqfd from the dedicated thread.
>>>>>
>>>>> To try it out:
>>>>>
>>>>>   qemu -drive if=none,id=drive0,cache=none,aio=native,format=raw,file=...
>>>>>        -device virtio-blk-pci,drive=drive0,scsi=off,x-data-plane=on
>>>>
>>>>
>>>> Is this the latest dataplane bits:
>>>> (git://github.com/stefanha/qemu.git virtio-blk-data-plane)
>>>>
>>>> commit 7872075c24fa01c925d4f41faa9d04ce69bf5328
>>>> Author: Stefan Hajnoczi <stefanha@redhat.com>
>>>> Date:   Wed Nov 14 15:45:38 2012 +0100
>>>>
>>>>     virtio-blk: add x-data-plane=on|off performance feature
>>>>
>>>>
>>>> With this commit on a ramdisk based box, I am seeing about 10K IOPS with
>>>> x-data-plane on and 90K IOPS with x-data-plane off.
>>>>
>>>> Any ideas?
>>>>
>>>> Command line I used:
>>>>
>>>> IMG=/dev/ram0
>>>> x86_64-softmmu/qemu-system-x86_64 \
>>>> -drive file=/root/img/sid.img,if=ide \
>>>> -drive file=${IMG},if=none,cache=none,aio=native,id=disk1 -device
>>>> virtio-blk-pci,x-data-plane=off,drive=disk1,scsi=off \
>>>> -kernel $KERNEL -append "root=/dev/sdb1 console=tty0" \
>>>> -L /tmp/qemu-dataplane/share/qemu/ -nographic -vnc :0 -enable-kvm -m
>>>> 2048 -smp 4 -cpu qemu64,+x2apic -M pc
>>>
>>> Was just about to send out the latest patch series which addresses
>>> review comments, so I have tested the latest code
>>> (61b70fef489ce51ecd18d69afb9622c110b9315c).
>>
>> Rebased onto qemu.git/master before sending out.  The commit ID is now:
>> cf6ed6406543ecc43895012a9ac9665e3753d5e8
>>
>> https://github.com/stefanha/qemu/commits/virtio-blk-data-plane
>>
>> Stefan
> 
> Ok, thanks. /me trying

Hi Stefan,

If I enable the merge in guest the IOPS for seq read/write goes up to
~400K/300K. If I disable the merge in guest the IOPS drops to ~17K/24K
for seq read/write (which is similar to the result I posted yesterday,
with merge disalbed). Could you please also share the numbers for rand
read and write in your setup?


1.(With merge enabled in guest + dataplane on)
echo noop > /sys/block/vda/queue/scheduler
echo 0 > /sys/block/vda/queue/nomerges
-------------------------------------
  read : io=0 B, bw=1575.2MB/s, iops=403453 , runt= 10396msec
  write: io=0 B, bw=1224.1MB/s, iops=313592 , runt= 13375msec
  read : io=0 B, bw=99534KB/s, iops=24883 , runt=168558msec
  write: io=0 B, bw=197695KB/s, iops=49423 , runt= 84864msec
    clat (usec): min=102 , max=395042 , avg=2167.18, stdev=4307.91
    clat (usec): min=89 , max=636874 , avg=2728.93, stdev=6777.71
    clat (usec): min=285 , max=4737.1K, avg=39939.42, stdev=87137.41
    clat (usec): min=74 , max=1408.1K, avg=18575.50, stdev=47752.13
  cpu          : usr=79.16%, sys=261.43%, ctx=1837824, majf=0, minf=58
  cpu          : usr=73.61%, sys=251.80%, ctx=1585892, majf=0, minf=7
  cpu          : usr=17.03%, sys=71.03%, ctx=6427788, majf=0, minf=6
  cpu          : usr=25.51%, sys=88.46%, ctx=5117624, majf=0, minf=1
  vda: ios=5037311/5303500, merge=3351489/3084584,
ticks=30674372/15110818, in_queue=45816533, util=97.50%
 41:      40099      40163      40110      40157   PCI-MSI-edge
virtio0-requests
 41: interrupt in total: 160529
fio --exec_prerun="echo 3 > /proc/sys/vm/drop_caches" --group_reporting
--ioscheduler=noop --thread --bs=4k --size=512MB --direct=1 --numjobs=16
--ioengine=libaio --iodepth=64 --loops=3 --ramp_time=0
--filename=/dev/vda --name=seq-read --stonewall --rw=read
--name=seq-write --stonewall --rw=write --name=rnd-read --stonewall
--rw=randread --name=rnd-write --stonewall --rw=randwrite

2.(With merge diabled in guest + dataplane on)
echo noop > /sys/block/vda/queue/scheduler
echo 2 > /sys/block/vda/queue/nomerges
-------------------------------------
  read : io=0 B, bw=69185KB/s, iops=17296 , runt=242497msec
  write: io=0 B, bw=96219KB/s, iops=24054 , runt=174365msec
  read : io=0 B, bw=90866KB/s, iops=22716 , runt=184637msec
  write: io=0 B, bw=202018KB/s, iops=50504 , runt= 83048msec
    clat (usec): min=98 , max=1719.7K, avg=57623.32, stdev=84730.86
    clat (usec): min=0 , max=1372.8K, avg=41286.47, stdev=73252.45
    clat (usec): min=82 , max=1308.8K, avg=43828.82, stdev=73483.73
    clat (usec): min=0 , max=1239.6K, avg=18445.77, stdev=49099.15
  cpu          : usr=10.12%, sys=72.64%, ctx=7942850, majf=0, minf=62
  cpu          : usr=14.72%, sys=78.99%, ctx=7358973, majf=0, minf=3
  cpu          : usr=16.34%, sys=72.60%, ctx=7394674, majf=0, minf=5
  cpu          : usr=29.69%, sys=83.41%, ctx=5262809, majf=0, minf=4
  vda: ios=8389288/8388552, merge=0/0, ticks=76013872/44425860,
in_queue=120493774, util=99.58%
 41:      89414      89456      89504      89534   PCI-MSI-edge
virtio0-requests
 41: interrupt in total: 357908
fio --exec_prerun="echo 3 > /proc/sys/vm/drop_caches" --group_reporting
--ioscheduler=noop --thread --bs=4k --size=512MB --direct=1 --numjobs=16
--ioengine=libaio --iodepth=64 --loops=3 --ramp_time=0
--filename=/dev/vda --name=seq-read --stonewall --rw=read
--name=seq-write --stonewall --rw=write --name=rnd-read --stonewall
--rw=randread --name=rnd-write --stonewall --rw=randwrite

3.(With merge enabled in guest + dataplane off)
-------------------------------------
  read : io=0 B, bw=810220KB/s, iops=202554 , runt= 20707msec
  write: io=0 B, bw=999.97MB/s, iops=255984 , runt= 16385msec
  read : io=0 B, bw=338066KB/s, iops=84516 , runt= 49627msec
  write: io=0 B, bw=455420KB/s, iops=113854 , runt= 36839msec
    clat (usec): min=0 , max=26340 , avg=5019.36, stdev=2185.04
    clat (usec): min=58 , max=21572 , avg=3972.33, stdev=1708.00
    clat (usec): min=34 , max=90185 , avg=11879.72, stdev=8054.58
    clat (usec): min=189 , max=122825 , avg=8822.87, stdev=4608.65
  cpu          : usr=44.89%, sys=141.08%, ctx=1611401, majf=0, minf=54
  cpu          : usr=58.46%, sys=177.50%, ctx=1582260, majf=0, minf=0
  cpu          : usr=21.09%, sys=63.61%, ctx=7609871, majf=0, minf=2
  cpu          : usr=28.88%, sys=73.51%, ctx=8140689, majf=0, minf=10
  vda: ios=5222932/5209798, merge=3164880/3163834,
ticks=10133305/7618081, in_queue=17773509, util=99.46%
 41:     286378     284870     285478     285759   PCI-MSI-edge
virtio0-requests
 41: interrupt in total: 1142485
fio --exec_prerun="echo 3 > /proc/sys/vm/drop_caches" --group_reporting
--ioscheduler=noop --thread --bs=4k --size=512MB --direct=1 --numjobs=16
--ioengine=libaio --iodepth=64 --loops=3 --ramp_time=0
--filename=/dev/vda --name=seq-read --stonewall --rw=read
--name=seq-write --stonewall --rw=write --name=rnd-read --stonewall
--rw=randread --name=rnd-write --stonewall --rw=randwrite


4.(With merge disabled in guest + dataplane off)
-------------------------------------
  read : io=0 B, bw=331147KB/s, iops=82786 , runt= 50664msec
  write: io=0 B, bw=431802KB/s, iops=107950 , runt= 38854msec
  read : io=0 B, bw=376424KB/s, iops=94105 , runt= 44570msec
  write: io=0 B, bw=373200KB/s, iops=93300 , runt= 44955msec
    clat (usec): min=84 , max=99635 , avg=12075.27, stdev=7899.12
    clat (usec): min=97 , max=84882 , avg=9289.49, stdev=5688.96
    clat (usec): min=56 , max=76176 , avg=10662.84, stdev=6375.63
    clat (usec): min=158 , max=112216 , avg=10762.62, stdev=6412.89
  cpu          : usr=15.16%, sys=59.67%, ctx=8007217, majf=0, minf=62
  cpu          : usr=26.45%, sys=99.42%, ctx=7477108, majf=0, minf=8
  cpu          : usr=29.52%, sys=85.40%, ctx=7889672, majf=0, minf=2
  cpu          : usr=30.71%, sys=86.54%, ctx=7893782, majf=0, minf=2
  vda: ios=8389195/8374171, merge=0/0, ticks=15202863/12550941,
in_queue=27840167, util=99.63%
 41:     179902     179182     179232     179539   PCI-MSI-edge
virtio0-requests
 41: interrupt in total: 717855
fio --exec_prerun="echo 3 > /proc/sys/vm/drop_caches" --group_reporting
--ioscheduler=noop --thread --bs=4k --size=512MB --direct=1 --numjobs=16
--ioengine=libaio --iodepth=64 --loops=3 --ramp_time=0
--filename=/dev/vda --name=seq-read --stonewall --rw=read
--name=seq-write --stonewall --rw=write --name=rnd-read --stonewall
--rw=randread --name=rnd-write --stonewall --rw=randwrite


-- 
Asias

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH 0/7] virtio: virtio-blk data plane
  2012-11-21  6:42         ` Asias He
@ 2012-11-21  6:44           ` Stefan Hajnoczi
  2012-11-21  7:00             ` Asias He
  0 siblings, 1 reply; 38+ messages in thread
From: Stefan Hajnoczi @ 2012-11-21  6:44 UTC (permalink / raw)
  To: Asias He
  Cc: Kevin Wolf, Anthony Liguori, Michael S. Tsirkin, qemu-devel,
	Khoa Huynh, Stefan Hajnoczi, Paolo Bonzini

On Wed, Nov 21, 2012 at 7:42 AM, Asias He <asias@redhat.com> wrote:
> On 11/21/2012 01:39 PM, Asias He wrote:
>> On 11/20/2012 08:25 PM, Stefan Hajnoczi wrote:
>>> On Tue, Nov 20, 2012 at 1:21 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>>>> On Tue, Nov 20, 2012 at 10:02 AM, Asias He <asias@redhat.com> wrote:
>>>>> Hello Stefan,
>>>>>
>>>>> On 11/15/2012 11:18 PM, Stefan Hajnoczi wrote:
>>>>>> This series adds the -device virtio-blk-pci,x-data-plane=on property that
>>>>>> enables a high performance I/O codepath.  A dedicated thread is used to process
>>>>>> virtio-blk requests outside the global mutex and without going through the QEMU
>>>>>> block layer.
>>>>>>
>>>>>> Khoa Huynh <khoa@us.ibm.com> reported an increase from 140,000 IOPS to 600,000
>>>>>> IOPS for a single VM using virtio-blk-data-plane in July:
>>>>>>
>>>>>>   http://comments.gmane.org/gmane.comp.emulators.kvm.devel/94580
>>>>>>
>>>>>> The virtio-blk-data-plane approach was originally presented at Linux Plumbers
>>>>>> Conference 2010.  The following slides contain a brief overview:
>>>>>>
>>>>>>   http://linuxplumbersconf.org/2010/ocw/system/presentations/651/original/Optimizing_the_QEMU_Storage_Stack.pdf
>>>>>>
>>>>>> The basic approach is:
>>>>>> 1. Each virtio-blk device has a thread dedicated to handling ioeventfd
>>>>>>    signalling when the guest kicks the virtqueue.
>>>>>> 2. Requests are processed without going through the QEMU block layer using
>>>>>>    Linux AIO directly.
>>>>>> 3. Completion interrupts are injected via irqfd from the dedicated thread.
>>>>>>
>>>>>> To try it out:
>>>>>>
>>>>>>   qemu -drive if=none,id=drive0,cache=none,aio=native,format=raw,file=...
>>>>>>        -device virtio-blk-pci,drive=drive0,scsi=off,x-data-plane=on
>>>>>
>>>>>
>>>>> Is this the latest dataplane bits:
>>>>> (git://github.com/stefanha/qemu.git virtio-blk-data-plane)
>>>>>
>>>>> commit 7872075c24fa01c925d4f41faa9d04ce69bf5328
>>>>> Author: Stefan Hajnoczi <stefanha@redhat.com>
>>>>> Date:   Wed Nov 14 15:45:38 2012 +0100
>>>>>
>>>>>     virtio-blk: add x-data-plane=on|off performance feature
>>>>>
>>>>>
>>>>> With this commit on a ramdisk based box, I am seeing about 10K IOPS with
>>>>> x-data-plane on and 90K IOPS with x-data-plane off.
>>>>>
>>>>> Any ideas?
>>>>>
>>>>> Command line I used:
>>>>>
>>>>> IMG=/dev/ram0
>>>>> x86_64-softmmu/qemu-system-x86_64 \
>>>>> -drive file=/root/img/sid.img,if=ide \
>>>>> -drive file=${IMG},if=none,cache=none,aio=native,id=disk1 -device
>>>>> virtio-blk-pci,x-data-plane=off,drive=disk1,scsi=off \
>>>>> -kernel $KERNEL -append "root=/dev/sdb1 console=tty0" \
>>>>> -L /tmp/qemu-dataplane/share/qemu/ -nographic -vnc :0 -enable-kvm -m
>>>>> 2048 -smp 4 -cpu qemu64,+x2apic -M pc
>>>>
>>>> Was just about to send out the latest patch series which addresses
>>>> review comments, so I have tested the latest code
>>>> (61b70fef489ce51ecd18d69afb9622c110b9315c).
>>>
>>> Rebased onto qemu.git/master before sending out.  The commit ID is now:
>>> cf6ed6406543ecc43895012a9ac9665e3753d5e8
>>>
>>> https://github.com/stefanha/qemu/commits/virtio-blk-data-plane
>>>
>>> Stefan
>>
>> Ok, thanks. /me trying
>
> Hi Stefan,
>
> If I enable the merge in guest the IOPS for seq read/write goes up to
> ~400K/300K. If I disable the merge in guest the IOPS drops to ~17K/24K
> for seq read/write (which is similar to the result I posted yesterday,
> with merge disalbed). Could you please also share the numbers for rand
> read and write in your setup?

Thanks for running the test.  Please send your rand read/write fio
jobfile so I can run the exact same test.

BTW I was running the default F18 (host) and RHEL 6.3 (guest) I/O
schedulers in my test yesterday.

Stefan

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH 0/7] virtio: virtio-blk data plane
  2012-11-21  6:44           ` Stefan Hajnoczi
@ 2012-11-21  7:00             ` Asias He
  2012-11-22 12:12               ` Stefan Hajnoczi
  0 siblings, 1 reply; 38+ messages in thread
From: Asias He @ 2012-11-21  7:00 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Anthony Liguori, Michael S. Tsirkin, qemu-devel,
	Khoa Huynh, Stefan Hajnoczi, Paolo Bonzini

On 11/21/2012 02:44 PM, Stefan Hajnoczi wrote:
> On Wed, Nov 21, 2012 at 7:42 AM, Asias He <asias@redhat.com> wrote:
>> On 11/21/2012 01:39 PM, Asias He wrote:
>>> On 11/20/2012 08:25 PM, Stefan Hajnoczi wrote:
>>>> On Tue, Nov 20, 2012 at 1:21 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>>>>> On Tue, Nov 20, 2012 at 10:02 AM, Asias He <asias@redhat.com> wrote:
>>>>>> Hello Stefan,
>>>>>>
>>>>>> On 11/15/2012 11:18 PM, Stefan Hajnoczi wrote:
>>>>>>> This series adds the -device virtio-blk-pci,x-data-plane=on property that
>>>>>>> enables a high performance I/O codepath.  A dedicated thread is used to process
>>>>>>> virtio-blk requests outside the global mutex and without going through the QEMU
>>>>>>> block layer.
>>>>>>>
>>>>>>> Khoa Huynh <khoa@us.ibm.com> reported an increase from 140,000 IOPS to 600,000
>>>>>>> IOPS for a single VM using virtio-blk-data-plane in July:
>>>>>>>
>>>>>>>   http://comments.gmane.org/gmane.comp.emulators.kvm.devel/94580
>>>>>>>
>>>>>>> The virtio-blk-data-plane approach was originally presented at Linux Plumbers
>>>>>>> Conference 2010.  The following slides contain a brief overview:
>>>>>>>
>>>>>>>   http://linuxplumbersconf.org/2010/ocw/system/presentations/651/original/Optimizing_the_QEMU_Storage_Stack.pdf
>>>>>>>
>>>>>>> The basic approach is:
>>>>>>> 1. Each virtio-blk device has a thread dedicated to handling ioeventfd
>>>>>>>    signalling when the guest kicks the virtqueue.
>>>>>>> 2. Requests are processed without going through the QEMU block layer using
>>>>>>>    Linux AIO directly.
>>>>>>> 3. Completion interrupts are injected via irqfd from the dedicated thread.
>>>>>>>
>>>>>>> To try it out:
>>>>>>>
>>>>>>>   qemu -drive if=none,id=drive0,cache=none,aio=native,format=raw,file=...
>>>>>>>        -device virtio-blk-pci,drive=drive0,scsi=off,x-data-plane=on
>>>>>>
>>>>>>
>>>>>> Is this the latest dataplane bits:
>>>>>> (git://github.com/stefanha/qemu.git virtio-blk-data-plane)
>>>>>>
>>>>>> commit 7872075c24fa01c925d4f41faa9d04ce69bf5328
>>>>>> Author: Stefan Hajnoczi <stefanha@redhat.com>
>>>>>> Date:   Wed Nov 14 15:45:38 2012 +0100
>>>>>>
>>>>>>     virtio-blk: add x-data-plane=on|off performance feature
>>>>>>
>>>>>>
>>>>>> With this commit on a ramdisk based box, I am seeing about 10K IOPS with
>>>>>> x-data-plane on and 90K IOPS with x-data-plane off.
>>>>>>
>>>>>> Any ideas?
>>>>>>
>>>>>> Command line I used:
>>>>>>
>>>>>> IMG=/dev/ram0
>>>>>> x86_64-softmmu/qemu-system-x86_64 \
>>>>>> -drive file=/root/img/sid.img,if=ide \
>>>>>> -drive file=${IMG},if=none,cache=none,aio=native,id=disk1 -device
>>>>>> virtio-blk-pci,x-data-plane=off,drive=disk1,scsi=off \
>>>>>> -kernel $KERNEL -append "root=/dev/sdb1 console=tty0" \
>>>>>> -L /tmp/qemu-dataplane/share/qemu/ -nographic -vnc :0 -enable-kvm -m
>>>>>> 2048 -smp 4 -cpu qemu64,+x2apic -M pc
>>>>>
>>>>> Was just about to send out the latest patch series which addresses
>>>>> review comments, so I have tested the latest code
>>>>> (61b70fef489ce51ecd18d69afb9622c110b9315c).
>>>>
>>>> Rebased onto qemu.git/master before sending out.  The commit ID is now:
>>>> cf6ed6406543ecc43895012a9ac9665e3753d5e8
>>>>
>>>> https://github.com/stefanha/qemu/commits/virtio-blk-data-plane
>>>>
>>>> Stefan
>>>
>>> Ok, thanks. /me trying
>>
>> Hi Stefan,
>>
>> If I enable the merge in guest the IOPS for seq read/write goes up to
>> ~400K/300K. If I disable the merge in guest the IOPS drops to ~17K/24K
>> for seq read/write (which is similar to the result I posted yesterday,
>> with merge disalbed). Could you please also share the numbers for rand
>> read and write in your setup?
> 
> Thanks for running the test.  Please send your rand read/write fio
> jobfile so I can run the exact same test.
> 
> BTW I was running the default F18 (host) and RHEL 6.3 (guest) I/O
> schedulers in my test yesterday.

Sure, this is the script I used to run the test in guest.

# cat run.sh
#!/bin/sh

cat > all.fio <<EOF
[global]
exec_prerun="echo 3 > /proc/sys/vm/drop_caches"
group_reporting
ioscheduler=noop
thread
bs=4k
size=512MB
direct=1
filename=/dev/vda
numjobs=16
ioengine=libaio
iodepth=64
loops=3
ramp_time=0
[seq-read]
stonewall
rw=read
[seq-write]
stonewall
rw=write
[rnd-read]
stonewall
rw=randread
[rnd-write]
stonewall
rw=randwrite
EOF

echo noop > /sys/block/vda/queue/scheduler
echo 2 > /sys/block/vda/queue/nomerges
echo 0 > /sys/block/vda/queue/nomerges
fio all.fio --output=f.log
echo
echo -------------------------------------
cat f.log|grep iops
cat f.log|grep clat |grep avg
cat f.log|grep cpu
cat f.log|grep ios |grep in_queue
cat /proc/interrupts |grep requests |grep virtio |grep 41\:
cat /proc/stat |grep ^intr | awk '{print " 41: interrupt in total: "$44}'
fio all.fio --showcmd



-- 
Asias

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH 0/7] virtio: virtio-blk data plane
  2012-11-21  7:00             ` Asias He
@ 2012-11-22 12:12               ` Stefan Hajnoczi
  0 siblings, 0 replies; 38+ messages in thread
From: Stefan Hajnoczi @ 2012-11-22 12:12 UTC (permalink / raw)
  To: Asias He
  Cc: Kevin Wolf, Anthony Liguori, Michael S. Tsirkin, Stefan Hajnoczi,
	qemu-devel, Khoa Huynh, Paolo Bonzini

On Wed, Nov 21, 2012 at 03:00:25PM +0800, Asias He wrote:
> On 11/21/2012 02:44 PM, Stefan Hajnoczi wrote:
> > On Wed, Nov 21, 2012 at 7:42 AM, Asias He <asias@redhat.com> wrote:
> >> On 11/21/2012 01:39 PM, Asias He wrote:
> >>> On 11/20/2012 08:25 PM, Stefan Hajnoczi wrote:
> >>>> On Tue, Nov 20, 2012 at 1:21 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >>>>> On Tue, Nov 20, 2012 at 10:02 AM, Asias He <asias@redhat.com> wrote:
> >>>>>> Hello Stefan,
> >>>>>>
> >>>>>> On 11/15/2012 11:18 PM, Stefan Hajnoczi wrote:
> >>>>>>> This series adds the -device virtio-blk-pci,x-data-plane=on property that
> >>>>>>> enables a high performance I/O codepath.  A dedicated thread is used to process
> >>>>>>> virtio-blk requests outside the global mutex and without going through the QEMU
> >>>>>>> block layer.
> >>>>>>>
> >>>>>>> Khoa Huynh <khoa@us.ibm.com> reported an increase from 140,000 IOPS to 600,000
> >>>>>>> IOPS for a single VM using virtio-blk-data-plane in July:
> >>>>>>>
> >>>>>>>   http://comments.gmane.org/gmane.comp.emulators.kvm.devel/94580
> >>>>>>>
> >>>>>>> The virtio-blk-data-plane approach was originally presented at Linux Plumbers
> >>>>>>> Conference 2010.  The following slides contain a brief overview:
> >>>>>>>
> >>>>>>>   http://linuxplumbersconf.org/2010/ocw/system/presentations/651/original/Optimizing_the_QEMU_Storage_Stack.pdf
> >>>>>>>
> >>>>>>> The basic approach is:
> >>>>>>> 1. Each virtio-blk device has a thread dedicated to handling ioeventfd
> >>>>>>>    signalling when the guest kicks the virtqueue.
> >>>>>>> 2. Requests are processed without going through the QEMU block layer using
> >>>>>>>    Linux AIO directly.
> >>>>>>> 3. Completion interrupts are injected via irqfd from the dedicated thread.
> >>>>>>>
> >>>>>>> To try it out:
> >>>>>>>
> >>>>>>>   qemu -drive if=none,id=drive0,cache=none,aio=native,format=raw,file=...
> >>>>>>>        -device virtio-blk-pci,drive=drive0,scsi=off,x-data-plane=on
> >>>>>>
> >>>>>>
> >>>>>> Is this the latest dataplane bits:
> >>>>>> (git://github.com/stefanha/qemu.git virtio-blk-data-plane)
> >>>>>>
> >>>>>> commit 7872075c24fa01c925d4f41faa9d04ce69bf5328
> >>>>>> Author: Stefan Hajnoczi <stefanha@redhat.com>
> >>>>>> Date:   Wed Nov 14 15:45:38 2012 +0100
> >>>>>>
> >>>>>>     virtio-blk: add x-data-plane=on|off performance feature
> >>>>>>
> >>>>>>
> >>>>>> With this commit on a ramdisk based box, I am seeing about 10K IOPS with
> >>>>>> x-data-plane on and 90K IOPS with x-data-plane off.
> >>>>>>
> >>>>>> Any ideas?
> >>>>>>
> >>>>>> Command line I used:
> >>>>>>
> >>>>>> IMG=/dev/ram0
> >>>>>> x86_64-softmmu/qemu-system-x86_64 \
> >>>>>> -drive file=/root/img/sid.img,if=ide \
> >>>>>> -drive file=${IMG},if=none,cache=none,aio=native,id=disk1 -device
> >>>>>> virtio-blk-pci,x-data-plane=off,drive=disk1,scsi=off \
> >>>>>> -kernel $KERNEL -append "root=/dev/sdb1 console=tty0" \
> >>>>>> -L /tmp/qemu-dataplane/share/qemu/ -nographic -vnc :0 -enable-kvm -m
> >>>>>> 2048 -smp 4 -cpu qemu64,+x2apic -M pc
> >>>>>
> >>>>> Was just about to send out the latest patch series which addresses
> >>>>> review comments, so I have tested the latest code
> >>>>> (61b70fef489ce51ecd18d69afb9622c110b9315c).
> >>>>
> >>>> Rebased onto qemu.git/master before sending out.  The commit ID is now:
> >>>> cf6ed6406543ecc43895012a9ac9665e3753d5e8
> >>>>
> >>>> https://github.com/stefanha/qemu/commits/virtio-blk-data-plane
> >>>>
> >>>> Stefan
> >>>
> >>> Ok, thanks. /me trying
> >>
> >> Hi Stefan,
> >>
> >> If I enable the merge in guest the IOPS for seq read/write goes up to
> >> ~400K/300K. If I disable the merge in guest the IOPS drops to ~17K/24K
> >> for seq read/write (which is similar to the result I posted yesterday,
> >> with merge disalbed). Could you please also share the numbers for rand
> >> read and write in your setup?
> > 
> > Thanks for running the test.  Please send your rand read/write fio
> > jobfile so I can run the exact same test.
> > 
> > BTW I was running the default F18 (host) and RHEL 6.3 (guest) I/O
> > schedulers in my test yesterday.
> 
> Sure, this is the script I used to run the test in guest.

Thanks, will give this a try!

Stefan

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH 0/7] virtio: virtio-blk data plane
  2012-11-21  5:22     ` Asias He
@ 2012-11-22 12:16       ` Stefan Hajnoczi
  0 siblings, 0 replies; 38+ messages in thread
From: Stefan Hajnoczi @ 2012-11-22 12:16 UTC (permalink / raw)
  To: Asias He
  Cc: Kevin Wolf, Anthony Liguori, Michael S. Tsirkin, Stefan Hajnoczi,
	qemu-devel, Khoa Huynh, Paolo Bonzini

On Wed, Nov 21, 2012 at 01:22:22PM +0800, Asias He wrote:
> On 11/20/2012 08:21 PM, Stefan Hajnoczi wrote:
> > On Tue, Nov 20, 2012 at 10:02 AM, Asias He <asias@redhat.com> wrote:
> >> Hello Stefan,
> >>
> >> On 11/15/2012 11:18 PM, Stefan Hajnoczi wrote:
> >>> This series adds the -device virtio-blk-pci,x-data-plane=on property that
> >>> enables a high performance I/O codepath.  A dedicated thread is used to process
> >>> virtio-blk requests outside the global mutex and without going through the QEMU
> >>> block layer.
> >>>
> >>> Khoa Huynh <khoa@us.ibm.com> reported an increase from 140,000 IOPS to 600,000
> >>> IOPS for a single VM using virtio-blk-data-plane in July:
> >>>
> >>>   http://comments.gmane.org/gmane.comp.emulators.kvm.devel/94580
> >>>
> >>> The virtio-blk-data-plane approach was originally presented at Linux Plumbers
> >>> Conference 2010.  The following slides contain a brief overview:
> >>>
> >>>   http://linuxplumbersconf.org/2010/ocw/system/presentations/651/original/Optimizing_the_QEMU_Storage_Stack.pdf
> >>>
> >>> The basic approach is:
> >>> 1. Each virtio-blk device has a thread dedicated to handling ioeventfd
> >>>    signalling when the guest kicks the virtqueue.
> >>> 2. Requests are processed without going through the QEMU block layer using
> >>>    Linux AIO directly.
> >>> 3. Completion interrupts are injected via irqfd from the dedicated thread.
> >>>
> >>> To try it out:
> >>>
> >>>   qemu -drive if=none,id=drive0,cache=none,aio=native,format=raw,file=...
> >>>        -device virtio-blk-pci,drive=drive0,scsi=off,x-data-plane=on
> >>
> >>
> >> Is this the latest dataplane bits:
> >> (git://github.com/stefanha/qemu.git virtio-blk-data-plane)
> >>
> >> commit 7872075c24fa01c925d4f41faa9d04ce69bf5328
> >> Author: Stefan Hajnoczi <stefanha@redhat.com>
> >> Date:   Wed Nov 14 15:45:38 2012 +0100
> >>
> >>     virtio-blk: add x-data-plane=on|off performance feature
> >>
> >>
> >> With this commit on a ramdisk based box, I am seeing about 10K IOPS with
> >> x-data-plane on and 90K IOPS with x-data-plane off.
> >>
> >> Any ideas?
> >>
> >> Command line I used:
> >>
> >> IMG=/dev/ram0
> >> x86_64-softmmu/qemu-system-x86_64 \
> >> -drive file=/root/img/sid.img,if=ide \
> >> -drive file=${IMG},if=none,cache=none,aio=native,id=disk1 -device
> >> virtio-blk-pci,x-data-plane=off,drive=disk1,scsi=off \
> >> -kernel $KERNEL -append "root=/dev/sdb1 console=tty0" \
> >> -L /tmp/qemu-dataplane/share/qemu/ -nographic -vnc :0 -enable-kvm -m
> >> 2048 -smp 4 -cpu qemu64,+x2apic -M pc
> > 
> > Was just about to send out the latest patch series which addresses
> > review comments, so I have tested the latest code
> > (61b70fef489ce51ecd18d69afb9622c110b9315c).
> > 
> > I was unable to reproduce a ramdisk performance regression on Linux
> > 3.6.6-3.fc18.x86_64 with Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz with
> > 8 GB RAM.
> 
> I am using the latest upstream kernel.
> 
> > The ramdisk is 4 GB and I used your QEMU command-line with a RHEL 6.3 guest.
> > 
> > Summary results:
> > x-data-plane-on: iops=132856 aggrb=1039.1MB/s
> > x-data-plane-off: iops=126236 aggrb=988.40MB/s
> > 
> > virtio-blk-data-plane is ~5% faster in this benchmark.
> > 
> > fio jobfile:
> > [global]
> > filename=/dev/vda
> > blocksize=8k
> > ioengine=libaio
> > direct=1
> > iodepth=8
> > runtime=120
> > time_based=1
> > 
> > [reads]
> > readwrite=randread
> > numjobs=4
> > 
> > Perf top (data-plane-on):
> >   3.71%  [kvm]               [k] kvm_arch_vcpu_ioctl_run
> >   3.27%  [kernel]            [k] memset    <--- ramdisk
> >   2.98%  [kernel]            [k] do_blockdev_direct_IO
> >   2.82%  [kvm_intel]         [k] vmx_vcpu_run
> >   2.66%  [kernel]            [k] _raw_spin_lock_irqsave
> >   2.06%  [kernel]            [k] put_compound_page
> >   2.06%  [kernel]            [k] __get_page_tail
> >   1.83%  [i915]              [k] __gen6_gt_force_wake_mt_get
> >   1.75%  [kernel]            [k] _raw_spin_unlock_irqrestore
> >   1.33%  qemu-system-x86_64  [.] vring_pop <--- virtio-blk-data-plane
> >   1.19%  [kernel]            [k] compound_unlock_irqrestore
> >   1.13%  [kernel]            [k] gup_huge_pmd
> >   1.11%  [kernel]            [k] __audit_syscall_exit
> >   1.07%  [kernel]            [k] put_page_testzero
> >   1.01%  [kernel]            [k] fget
> >   1.01%  [kernel]            [k] do_io_submit
> > 
> > Since the ramdisk (memset and page-related functions) is so prominent
> > in perf top, I also tried a 1-job 8k dd sequential write test on a
> > Samsung 830 Series SSD where virtio-blk-data-plane was 9% faster than
> > virtio-blk.  Optimizing against ramdisk isn't a good idea IMO because
> > it acts very differently from real hardware where the driver relies on
> > mmio, DMA, and interrupts (vs synchronous memcpy/memset).
> 
> For the memset in ramdisk, you can simply patch drivers/block/brd.c to
> do nop instead of memset for testing.
> 
> Yes, if you have fast SSD device (sometimes you need multiple which I
> do not have), it makes more sense to test on real hardware. However,
> ramdisk test is still useful. It gives rough performance numbers. If A
> and B are both tested against ramdisk. The difference between A and B
> are still useful.

Optimizing the difference between A and B on ramdisk is only guaranteed
to optimize the ramdisk case.  On real hardware the bottleneck might be
elsewhere and we'd be chasing the wrong lead.

I don't think it's a waste of time but I think to stay healthy we need
to focus on real disks and SSDs most of the time.

Stefan

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2012-11-22 12:17 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-11-15 15:18 [Qemu-devel] [PATCH 0/7] virtio: virtio-blk data plane Stefan Hajnoczi
2012-11-15 15:19 ` [Qemu-devel] [PATCH 1/7] raw-posix: add raw_get_aio_fd() for virtio-blk-data-plane Stefan Hajnoczi
2012-11-15 20:03   ` Anthony Liguori
2012-11-16  6:15     ` Stefan Hajnoczi
2012-11-16  8:22       ` Paolo Bonzini
2012-11-15 15:19 ` [Qemu-devel] [PATCH 2/7] configure: add CONFIG_VIRTIO_BLK_DATA_PLANE Stefan Hajnoczi
2012-11-15 15:19 ` [Qemu-devel] [PATCH 3/7] dataplane: add virtqueue vring code Stefan Hajnoczi
2012-11-15 15:37   ` Paolo Bonzini
2012-11-15 20:09   ` Anthony Liguori
2012-11-16  6:24     ` Stefan Hajnoczi
2012-11-16  7:48   ` Christian Borntraeger
2012-11-16  8:13     ` Stefan Hajnoczi
2012-11-17 16:15   ` Blue Swirl
2012-11-18  9:27     ` Stefan Hajnoczi
2012-11-15 15:19 ` [Qemu-devel] [PATCH 4/7] dataplane: add event loop Stefan Hajnoczi
2012-11-15 15:19 ` [Qemu-devel] [PATCH 5/7] dataplane: add Linux AIO request queue Stefan Hajnoczi
2012-11-15 15:19 ` [Qemu-devel] [PATCH 6/7] dataplane: add virtio-blk data plane code Stefan Hajnoczi
2012-11-15 15:19 ` [Qemu-devel] [PATCH 7/7] virtio-blk: add x-data-plane=on|off performance feature Stefan Hajnoczi
2012-11-15 18:48   ` Michael S. Tsirkin
2012-11-15 19:34     ` Khoa Huynh
2012-11-15 21:11       ` Anthony Liguori
2012-11-15 21:08   ` Anthony Liguori
2012-11-16  6:22     ` Stefan Hajnoczi
2012-11-19 10:38       ` Kevin Wolf
2012-11-19 10:51         ` Paolo Bonzini
2012-11-16  7:40     ` Paolo Bonzini
2012-11-20  9:02 ` [Qemu-devel] [PATCH 0/7] virtio: virtio-blk data plane Asias He
2012-11-20 12:21   ` Stefan Hajnoczi
2012-11-20 12:25     ` Stefan Hajnoczi
2012-11-21  5:39       ` Asias He
2012-11-21  6:42         ` Asias He
2012-11-21  6:44           ` Stefan Hajnoczi
2012-11-21  7:00             ` Asias He
2012-11-22 12:12               ` Stefan Hajnoczi
2012-11-21  5:22     ` Asias He
2012-11-22 12:16       ` Stefan Hajnoczi
2012-11-20 15:03   ` Khoa Huynh
2012-11-21  5:22     ` Asias He

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.