All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
@ 2014-08-05  3:33 Ming Lei
  2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 01/17] qemu/obj_pool.h: introduce object allocation pool Ming Lei
                   ` (18 more replies)
  0 siblings, 19 replies; 81+ messages in thread
From: Ming Lei @ 2014-08-05  3:33 UTC (permalink / raw)
  To: qemu-devel, Peter Maydell, Paolo Bonzini, Stefan Hajnoczi
  Cc: Kevin Wolf, Fam Zheng, Michael S. Tsirkin

Hi,

These patches bring up below 4 changes:
        - introduce object allocation pool and apply it to
        virtio-blk dataplane for improving its performance

        - introduce selective coroutine bypass mechanism
        for improving performance of virtio-blk dataplane with
        raw format image

        - linux-aio changes: fixing for cases of -EAGAIN and partial
        completion, increase max events to 256, and remove one unuseful
        fields in 'struct qemu_laiocb'

        - support multi virtqueue for virtio-blk

The virtio-blk multi virtqueue feature will be added to virtio spec 1.1[1],
and the 3.17 linux kernel[2] will support the feature in virtio-blk driver.
For those who wants to play the stuff, the kernel side patche can be found
in either Jens's block tree[3] or linux-next[4].

Below fio script running from VM is used for test improvement of these patches:

        [global]
        direct=1
        size=128G
        bsrange=4k-4k
        timeout=120
        numjobs=${JOBS}
        ioengine=libaio
        iodepth=64
        filename=/dev/vdc
        group_reporting=1

        [f]
        rw=randread

One quadcore VM(8G RAM) is created in below host to run above fio test:

        - server(16cores: 8 physical cores, 2 threads per physical core)

Follows the test result on throughput improvement(IOPS) with
this patchset(4 virtqueues per virito-blk device, 4JOBS) against
QEMU 2.1.0: 53% throughput improvement can be observed, and
scalability for parallel I/Os is improved more(>100% throughput
improvement is observed in case of 4 JOBS).

>From above result, we can see both scalability and performance
get improved a lot.

After commit 580b6b2aa2(dataplane: use the QEMU block
layer for I/O), average time for submiting one single
request has been increased a lot, as my trace, the average
time taken for submiting one request has been doubled even
though block plug&unplug mechanism is introduced to
ease its effect. That is why this patchset introduces
selective coroutine bypass mechanism and object allocation
pool for saving the time first. Based on QEMU 2.0, only
single virtio-blk dataplane multi virtqueue patch can get
better improvement than current result[5].

V1:
	- bypass co: add check for making bypass decision to help
	remove hint from device in future
	- bypass co: run acb->cb() via BH as pointed by Paolo and Stefan
	- virtio: remove patch for decreasing size of VirtQueueElement,
	which will break migration between different QEMU version,
	another standalone patchset might do that 
	- linux-aio: retry io_submit in following completion cb for -EAGAIN
	as suggested by Paolo
	- linux-aio: handle -EAGAIN for non plugged case as suggested by Paolo
	- mq conversion: support multi virtqueue for non-dataplane as required
	by Paolo

TODO:
	- optimize block layer for linux aio, so that
    more time can be saved for submitting request
	- support more than one aio-context for improving
    virtio-blk performance

[1], http://marc.info/?l=linux-api&m=140486843317107&w=2
[2], http://marc.info/?l=linux-api&m=140418368421229&w=2
[3], http://git.kernel.org/cgit/linux/kernel/git/axboe/linux-block.git/ #for-3.17/drivers
[4], https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/
[5], http://marc.info/?l=linux-api&m=140377573830230&w=2

 block.c                         |  233 ++++++++++++++++++++++++++++++++++-----
 block/linux-aio.c               |  124 ++++++++++++++++-----
 block/raw-posix.c               |   34 ++++++
 hw/block/dataplane/virtio-blk.c |  221 ++++++++++++++++++++++++++++---------
 hw/block/virtio-blk.c           |   39 +++++--
 include/block/block.h           |   12 ++
 include/block/block_int.h       |    3 +
 include/block/coroutine.h       |    8 ++
 include/block/coroutine_int.h   |    5 +
 include/hw/virtio/virtio-blk.h  |   14 ++-
 include/qemu/gc.h               |   56 ++++++++++
 include/qemu/obj_pool.h         |   64 +++++++++++
 qemu-coroutine-lock.c           |    4 +-
 qemu-coroutine.c                |   33 ++++++
 14 files changed, 734 insertions(+), 116 deletions(-)



Thanks,

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [Qemu-devel] [PATCH v1 01/17] qemu/obj_pool.h: introduce object allocation pool
  2014-08-05  3:33 [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support Ming Lei
@ 2014-08-05  3:33 ` Ming Lei
  2014-08-05 11:55   ` Eric Blake
  2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 02/17] dataplane: use object pool to speed up allocation for virtio blk request Ming Lei
                   ` (17 subsequent siblings)
  18 siblings, 1 reply; 81+ messages in thread
From: Ming Lei @ 2014-08-05  3:33 UTC (permalink / raw)
  To: qemu-devel, Peter Maydell, Paolo Bonzini, Stefan Hajnoczi
  Cc: Kevin Wolf, Ming Lei, Fam Zheng, Michael S. Tsirkin

This patch introduces object allocation pool for speeding up
object allocation in fast path.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 include/qemu/obj_pool.h |   64 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 64 insertions(+)
 create mode 100644 include/qemu/obj_pool.h

diff --git a/include/qemu/obj_pool.h b/include/qemu/obj_pool.h
new file mode 100644
index 0000000..94b5f49
--- /dev/null
+++ b/include/qemu/obj_pool.h
@@ -0,0 +1,64 @@
+#ifndef QEMU_OBJ_POOL_HEAD
+#define QEMU_OBJ_POOL_HEAD
+
+typedef struct {
+    unsigned int size;
+    unsigned int cnt;
+
+    void **free_obj;
+    int free_idx;
+
+    char *objs;
+} ObjPool;
+
+static inline void obj_pool_init(ObjPool *op, void *objs_buf, void **free_objs,
+                                 unsigned int obj_size, unsigned cnt)
+{
+    int i;
+
+    op->objs = (char *)objs_buf;
+    op->free_obj = free_objs;
+    op->size = obj_size;
+    op->cnt = cnt;
+
+    for (i = 0; i < op->cnt; i++) {
+        op->free_obj[i] = (void *)&op->objs[i * op->size];
+    }
+    op->free_idx = op->cnt;
+}
+
+static inline void *obj_pool_get(ObjPool *op)
+{
+    void *obj;
+
+    if (!op) {
+        return NULL;
+    }
+
+    if (op->free_idx <= 0) {
+        return NULL;
+    }
+
+    obj = op->free_obj[--op->free_idx];
+    return obj;
+}
+
+static inline bool obj_pool_has_obj(ObjPool *op, void *obj)
+{
+    return op && (unsigned long)obj >= (unsigned long)&op->objs[0] &&
+           (unsigned long)obj <=
+           (unsigned long)&op->objs[(op->cnt - 1) * op->size];
+}
+
+static inline void obj_pool_put(ObjPool *op, void *obj)
+{
+    if (!op || !obj_pool_has_obj(op, obj)) {
+        return;
+    }
+
+    assert(op->free_idx < op->cnt);
+
+    op->free_obj[op->free_idx++] = obj;
+}
+
+#endif
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [Qemu-devel] [PATCH v1 02/17] dataplane: use object pool to speed up allocation for virtio blk request
  2014-08-05  3:33 [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support Ming Lei
  2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 01/17] qemu/obj_pool.h: introduce object allocation pool Ming Lei
@ 2014-08-05  3:33 ` Ming Lei
  2014-08-05 12:30   ` Eric Blake
  2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 03/17] qemu coroutine: support bypass mode Ming Lei
                   ` (16 subsequent siblings)
  18 siblings, 1 reply; 81+ messages in thread
From: Ming Lei @ 2014-08-05  3:33 UTC (permalink / raw)
  To: qemu-devel, Peter Maydell, Paolo Bonzini, Stefan Hajnoczi
  Cc: Kevin Wolf, Ming Lei, Fam Zheng, Michael S. Tsirkin

g_slice_new(VirtIOBlockReq), its free pair and access the instance
is a bit slow since sizeof(VirtIOBlockReq) takes more than 40KB,
so use object pool to speed up its allocation and release.

With this patch, ~5%-10% throughput improvement is observed in the VM
based on server.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 hw/block/dataplane/virtio-blk.c |   12 ++++++++++++
 hw/block/virtio-blk.c           |   13 +++++++++++--
 include/hw/virtio/virtio-blk.h  |    2 ++
 3 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
index d6ba65c..c9a8cc2 100644
--- a/hw/block/dataplane/virtio-blk.c
+++ b/hw/block/dataplane/virtio-blk.c
@@ -24,6 +24,8 @@
 #include "hw/virtio/virtio-bus.h"
 #include "qom/object_interfaces.h"
 
+#define REQ_POOL_SZ 128
+
 struct VirtIOBlockDataPlane {
     bool started;
     bool starting;
@@ -50,6 +52,10 @@ struct VirtIOBlockDataPlane {
     Error *blocker;
     void (*saved_complete_request)(struct VirtIOBlockReq *req,
                                    unsigned char status);
+
+    VirtIOBlockReq  reqs[REQ_POOL_SZ];
+    void *free_reqs[REQ_POOL_SZ];
+    ObjPool  req_pool;
 };
 
 /* Raise an interrupt to signal guest, if necessary */
@@ -235,6 +241,10 @@ void virtio_blk_data_plane_start(VirtIOBlockDataPlane *s)
         return;
     }
 
+    vblk->obj_pool = &s->req_pool;
+    obj_pool_init(vblk->obj_pool, s->reqs, s->free_reqs,
+                  sizeof(VirtIOBlockReq), REQ_POOL_SZ);
+
     /* Set up guest notifier (irq) */
     if (k->set_guest_notifiers(qbus->parent, 1, true) != 0) {
         fprintf(stderr, "virtio-blk failed to set guest notifier, "
@@ -291,6 +301,8 @@ void virtio_blk_data_plane_stop(VirtIOBlockDataPlane *s)
 
     aio_context_release(s->ctx);
 
+    vblk->obj_pool = NULL;
+
     /* Sync vring state back to virtqueue so that non-dataplane request
      * processing can continue when we disable the host notifier below.
      */
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index c241c50..2a11bc4 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -31,7 +31,11 @@
 
 VirtIOBlockReq *virtio_blk_alloc_request(VirtIOBlock *s)
 {
-    VirtIOBlockReq *req = g_slice_new(VirtIOBlockReq);
+    VirtIOBlockReq *req = obj_pool_get(s->obj_pool);
+
+    if (!req) {
+        req = g_slice_new(VirtIOBlockReq);
+    }
     req->dev = s;
     req->qiov.size = 0;
     req->next = NULL;
@@ -41,7 +45,11 @@ VirtIOBlockReq *virtio_blk_alloc_request(VirtIOBlock *s)
 void virtio_blk_free_request(VirtIOBlockReq *req)
 {
     if (req) {
-        g_slice_free(VirtIOBlockReq, req);
+        if (obj_pool_has_obj(req->dev->obj_pool, req)) {
+            obj_pool_put(req->dev->obj_pool, req);
+        } else {
+            g_slice_free(VirtIOBlockReq, req);
+        }
     }
 }
 
@@ -801,6 +809,7 @@ static void virtio_blk_instance_init(Object *obj)
 {
     VirtIOBlock *s = VIRTIO_BLK(obj);
 
+    s->obj_pool = NULL;
     object_property_add_link(obj, "iothread", TYPE_IOTHREAD,
                              (Object **)&s->blk.iothread,
                              qdev_prop_allow_set_link_before_realize,
diff --git a/include/hw/virtio/virtio-blk.h b/include/hw/virtio/virtio-blk.h
index afb7b8d..49ac234 100644
--- a/include/hw/virtio/virtio-blk.h
+++ b/include/hw/virtio/virtio-blk.h
@@ -18,6 +18,7 @@
 #include "hw/block/block.h"
 #include "sysemu/iothread.h"
 #include "block/block.h"
+#include "qemu/obj_pool.h"
 
 #define TYPE_VIRTIO_BLK "virtio-blk-device"
 #define VIRTIO_BLK(obj) \
@@ -135,6 +136,7 @@ typedef struct VirtIOBlock {
     Notifier migration_state_notifier;
     struct VirtIOBlockDataPlane *dataplane;
 #endif
+    ObjPool *obj_pool;
 } VirtIOBlock;
 
 typedef struct MultiReqBuffer {
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [Qemu-devel] [PATCH v1 03/17] qemu coroutine: support bypass mode
  2014-08-05  3:33 [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support Ming Lei
  2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 01/17] qemu/obj_pool.h: introduce object allocation pool Ming Lei
  2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 02/17] dataplane: use object pool to speed up allocation for virtio blk request Ming Lei
@ 2014-08-05  3:33 ` Ming Lei
  2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 04/17] block: prepare for supporting selective bypass coroutine Ming Lei
                   ` (15 subsequent siblings)
  18 siblings, 0 replies; 81+ messages in thread
From: Ming Lei @ 2014-08-05  3:33 UTC (permalink / raw)
  To: qemu-devel, Peter Maydell, Paolo Bonzini, Stefan Hajnoczi
  Cc: Kevin Wolf, Ming Lei, Fam Zheng, Michael S. Tsirkin

This patch introduces several APIs for supporting bypass qemu coroutine
in case of being not necessary and for performance's sake.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 include/block/coroutine.h     |    8 ++++++++
 include/block/coroutine_int.h |    5 +++++
 qemu-coroutine-lock.c         |    4 ++--
 qemu-coroutine.c              |   33 +++++++++++++++++++++++++++++++++
 4 files changed, 48 insertions(+), 2 deletions(-)

diff --git a/include/block/coroutine.h b/include/block/coroutine.h
index b408f96..46d2642 100644
--- a/include/block/coroutine.h
+++ b/include/block/coroutine.h
@@ -223,4 +223,12 @@ void coroutine_fn co_aio_sleep_ns(AioContext *ctx, QEMUClockType type,
  * Note that this function clobbers the handlers for the file descriptor.
  */
 void coroutine_fn yield_until_fd_readable(int fd);
+
+/* qemu coroutine bypass APIs */
+void qemu_coroutine_set_bypass(bool bypass);
+bool qemu_coroutine_bypassed(Coroutine *self);
+bool qemu_coroutine_self_bypassed(void);
+void qemu_coroutine_set_var(void *var);
+void *qemu_coroutine_get_var(void);
+
 #endif /* QEMU_COROUTINE_H */
diff --git a/include/block/coroutine_int.h b/include/block/coroutine_int.h
index f133d65..106d0b2 100644
--- a/include/block/coroutine_int.h
+++ b/include/block/coroutine_int.h
@@ -39,6 +39,11 @@ struct Coroutine {
     Coroutine *caller;
     QSLIST_ENTRY(Coroutine) pool_next;
 
+    bool bypass;
+
+    /* only used in bypass mode */
+    void *opaque;
+
     /* Coroutines that should be woken up when we yield or terminate */
     QTAILQ_HEAD(, Coroutine) co_queue_wakeup;
     QTAILQ_ENTRY(Coroutine) co_queue_next;
diff --git a/qemu-coroutine-lock.c b/qemu-coroutine-lock.c
index e4860ae..7c69ff6 100644
--- a/qemu-coroutine-lock.c
+++ b/qemu-coroutine-lock.c
@@ -82,13 +82,13 @@ static bool qemu_co_queue_do_restart(CoQueue *queue, bool single)
 
 bool coroutine_fn qemu_co_queue_next(CoQueue *queue)
 {
-    assert(qemu_in_coroutine());
+    assert(qemu_in_coroutine() || qemu_coroutine_self_bypassed());
     return qemu_co_queue_do_restart(queue, true);
 }
 
 void coroutine_fn qemu_co_queue_restart_all(CoQueue *queue)
 {
-    assert(qemu_in_coroutine());
+    assert(qemu_in_coroutine() || qemu_coroutine_self_bypassed());
     qemu_co_queue_do_restart(queue, false);
 }
 
diff --git a/qemu-coroutine.c b/qemu-coroutine.c
index 4708521..0597ed9 100644
--- a/qemu-coroutine.c
+++ b/qemu-coroutine.c
@@ -137,3 +137,36 @@ void coroutine_fn qemu_coroutine_yield(void)
     self->caller = NULL;
     coroutine_swap(self, to);
 }
+
+void qemu_coroutine_set_bypass(bool bypass)
+{
+    Coroutine *self = qemu_coroutine_self();
+
+    self->bypass = bypass;
+}
+
+bool qemu_coroutine_bypassed(Coroutine *self)
+{
+    return self->bypass;
+}
+
+bool qemu_coroutine_self_bypassed(void)
+{
+    Coroutine *self = qemu_coroutine_self();
+
+    return qemu_coroutine_bypassed(self);
+}
+
+void qemu_coroutine_set_var(void *var)
+{
+    Coroutine *self = qemu_coroutine_self();
+
+    self->opaque = var;
+}
+
+void *qemu_coroutine_get_var(void)
+{
+    Coroutine *self = qemu_coroutine_self();
+
+    return self->opaque;
+}
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [Qemu-devel] [PATCH v1 04/17] block: prepare for supporting selective bypass coroutine
  2014-08-05  3:33 [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support Ming Lei
                   ` (2 preceding siblings ...)
  2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 03/17] qemu coroutine: support bypass mode Ming Lei
@ 2014-08-05  3:33 ` Ming Lei
  2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 05/17] garbage collector: introduced for support of " Ming Lei
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 81+ messages in thread
From: Ming Lei @ 2014-08-05  3:33 UTC (permalink / raw)
  To: qemu-devel, Peter Maydell, Paolo Bonzini, Stefan Hajnoczi
  Cc: Kevin Wolf, Ming Lei, Fam Zheng, Michael S. Tsirkin

If device thinks that it isn't necessary to apply coroutine
in its performance sensitive path, it can call
block_set_bypass_co(false) to bypass the coroutine
and just call the function directly in the aio read/write path.

One example is virtio-blk dataplane.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 block.c                   |   10 ++++++++++
 include/block/block.h     |    3 +++
 include/block/block_int.h |    3 +++
 3 files changed, 16 insertions(+)

diff --git a/block.c b/block.c
index 8cf519b..ac184ef 100644
--- a/block.c
+++ b/block.c
@@ -5840,3 +5840,13 @@ void bdrv_flush_io_queue(BlockDriverState *bs)
         bdrv_flush_io_queue(bs->file);
     }
 }
+
+void bdrv_set_bypass_co(BlockDriverState *bs, bool bypass)
+{
+    bs->bypass_co = bypass;
+}
+
+bool bdrv_get_bypass_co(BlockDriverState *bs)
+{
+    return bs->bypass_co;
+}
diff --git a/include/block/block.h b/include/block/block.h
index f08471d..92f2f3a 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -588,4 +588,7 @@ void bdrv_io_plug(BlockDriverState *bs);
 void bdrv_io_unplug(BlockDriverState *bs);
 void bdrv_flush_io_queue(BlockDriverState *bs);
 
+void bdrv_set_bypass_co(BlockDriverState *bs, bool bypass);
+bool bdrv_get_bypass_co(BlockDriverState *bs);
+
 #endif
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 7b541a0..9fa2f4c 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -354,6 +354,9 @@ struct BlockDriverState {
     /* Whether produces zeros when read beyond eof */
     bool zero_beyond_eof;
 
+    /* Whether bypasses coroutine when doing aio read & write */
+    bool bypass_co;
+
     /* Alignment requirement for offset/length of I/O requests */
     unsigned int request_alignment;
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [Qemu-devel] [PATCH v1 05/17] garbage collector: introduced for support of bypass coroutine
  2014-08-05  3:33 [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support Ming Lei
                   ` (3 preceding siblings ...)
  2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 04/17] block: prepare for supporting selective bypass coroutine Ming Lei
@ 2014-08-05  3:33 ` Ming Lei
  2014-08-05 12:43   ` Eric Blake
  2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 06/17] block: introduce bdrv_co_can_bypass_co Ming Lei
                   ` (13 subsequent siblings)
  18 siblings, 1 reply; 81+ messages in thread
From: Ming Lei @ 2014-08-05  3:33 UTC (permalink / raw)
  To: qemu-devel, Peter Maydell, Paolo Bonzini, Stefan Hajnoczi
  Cc: Kevin Wolf, Ming Lei, Fam Zheng, Michael S. Tsirkin

In case of bypass coroutine, some buffers in stack have to convert
to survive in the whole I/O submit & completion cycle.

Garbase collector is one of the best data structure for this purpose,
as I thought of.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 include/qemu/gc.h |   56 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 56 insertions(+)
 create mode 100644 include/qemu/gc.h

diff --git a/include/qemu/gc.h b/include/qemu/gc.h
new file mode 100644
index 0000000..b9a3f6e
--- /dev/null
+++ b/include/qemu/gc.h
@@ -0,0 +1,56 @@
+#ifndef QEMU_GC_HEADER
+#define QEMU_GC_HEADER
+
+#include "qemu/queue.h"
+
+/* simple garbage collector implementation for bypass coroutine */
+
+/* internal type and helper */
+typedef struct SimpleGCNode SimpleGCNode;
+struct SimpleGCNode {
+    void *addr;
+    void (*free)(void *data);
+    QLIST_ENTRY(SimpleGCNode) node;
+};
+
+static inline void simple_gc_free_one(SimpleGCNode *node)
+{
+    if (node->free) {
+        node->free(node->addr);
+    } else {
+        qemu_vfree(node->addr);
+    }
+
+    g_free(node);
+}
+
+/* public type and helpers */
+typedef struct {
+    QLIST_HEAD(, SimpleGCNode) head;
+} SimpleGC;
+
+static inline void simple_gc_init(SimpleGC *gc)
+{
+    QLIST_INIT(&gc->head);
+}
+
+static inline void simple_gc_add(SimpleGC *gc, void *addr,
+                                 void (*free)(void *data))
+{
+    SimpleGCNode *node = g_malloc0(sizeof(*node));
+
+    node->addr = addr;
+    node->free = free;
+    QLIST_INSERT_HEAD(&gc->head, node, node);
+}
+
+static inline void simple_gc_free_all(SimpleGC *gc)
+{
+    SimpleGCNode *curr, *next;
+
+    QLIST_FOREACH_SAFE(curr, &gc->head, node, next) {
+        QLIST_REMOVE(curr, node);
+        simple_gc_free_one(curr);
+    }
+}
+#endif
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [Qemu-devel] [PATCH v1 06/17] block: introduce bdrv_co_can_bypass_co
  2014-08-05  3:33 [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support Ming Lei
                   ` (4 preceding siblings ...)
  2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 05/17] garbage collector: introduced for support of " Ming Lei
@ 2014-08-05  3:33 ` Ming Lei
  2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 07/17] block: support to bypass qemu coroutinue Ming Lei
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 81+ messages in thread
From: Ming Lei @ 2014-08-05  3:33 UTC (permalink / raw)
  To: qemu-devel, Peter Maydell, Paolo Bonzini, Stefan Hajnoczi
  Cc: Kevin Wolf, Ming Lei, Fam Zheng, Michael S. Tsirkin

This function is introduced to check if the current block
I/O can be allowed to run without coroutine for sake of
performance.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 block.c |   38 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/block.c b/block.c
index ac184ef..2326dab 100644
--- a/block.c
+++ b/block.c
@@ -4718,6 +4718,44 @@ static void coroutine_fn bdrv_co_do_rw(void *opaque)
     qemu_bh_schedule(acb->bh);
 }
 
+static bool bdrv_rw_aligned(BlockDriverState *bs,
+                            int64_t offset,
+                            int bytes)
+{
+    uint64_t align = MAX(BDRV_SECTOR_SIZE, bs->request_alignment);
+
+    if ((offset & (align - 1)) || ((offset + bytes) & (align - 1))) {
+        return false;
+    } else {
+        return true;
+    }
+}
+
+static bool bdrv_co_can_bypass_co(BlockDriverState *bs,
+                                  int64_t sector_num,
+                                  int nb_sectors,
+                                  BdrvRequestFlags flags,
+                                  bool is_write)
+{
+    if (flags || bs->copy_on_read || bs->io_limits_enabled) {
+        return false;
+    }
+
+    /* unaligned read is safe */
+    if (!is_write) {
+        return true;
+    }
+
+    if (!bs->enable_write_cache ||
+        bs->detect_zeroes != BLOCKDEV_DETECT_ZEROES_OPTIONS_OFF ||
+        !QLIST_EMPTY(&bs->before_write_notifiers.notifiers)) {
+        return false;
+    } else {
+        return bdrv_rw_aligned(bs, sector_num << BDRV_SECTOR_BITS,
+                               nb_sectors << BDRV_SECTOR_BITS);
+    }
+}
+
 static BlockDriverAIOCB *bdrv_co_aio_rw_vector(BlockDriverState *bs,
                                                int64_t sector_num,
                                                QEMUIOVector *qiov,
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [Qemu-devel] [PATCH v1 07/17] block: support to bypass qemu coroutinue
  2014-08-05  3:33 [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support Ming Lei
                   ` (5 preceding siblings ...)
  2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 06/17] block: introduce bdrv_co_can_bypass_co Ming Lei
@ 2014-08-05  3:33 ` Ming Lei
  2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 08/17] Revert "raw-posix: drop raw_get_aio_fd() since it is no longer used" Ming Lei
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 81+ messages in thread
From: Ming Lei @ 2014-08-05  3:33 UTC (permalink / raw)
  To: qemu-devel, Peter Maydell, Paolo Bonzini, Stefan Hajnoczi
  Cc: Kevin Wolf, Ming Lei, Fam Zheng, Michael S. Tsirkin

This patch adds support of bypassing coroutinue
in bdrv_co_aio_rw_vector(), which is in the fast path
block device, especially for virtio-blk dataplane.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 block.c |  185 +++++++++++++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 157 insertions(+), 28 deletions(-)

diff --git a/block.c b/block.c
index 2326dab..e1812a7 100644
--- a/block.c
+++ b/block.c
@@ -35,6 +35,7 @@
 #include "qmp-commands.h"
 #include "qemu/timer.h"
 #include "qapi-event.h"
+#include "qemu/gc.h"
 
 #ifdef CONFIG_BSD
 #include <sys/types.h>
@@ -55,6 +56,21 @@ struct BdrvDirtyBitmap {
     QLIST_ENTRY(BdrvDirtyBitmap) list;
 };
 
+typedef struct CoroutineIOCompletion {
+    Coroutine *coroutine;
+    int ret;
+    bool bypass;
+    SimpleGC gc;
+} CoroutineIOCompletion;
+
+typedef struct BlockDriverAIOCBCoroutine {
+    BlockDriverAIOCB common;
+    BlockRequest req;
+    bool is_write;
+    bool *done;
+    QEMUBH *bh;
+} BlockDriverAIOCBCoroutine;
+
 #define NOT_DONE 0x7fffffff /* used while emulated sync operation in progress */
 
 static void bdrv_dev_change_media_cb(BlockDriverState *bs, bool load);
@@ -120,6 +136,48 @@ int is_windows_drive(const char *filename)
 }
 #endif
 
+static CoroutineIOCompletion *bdrv_get_co_io_comp(void *acb)
+{
+    return (CoroutineIOCompletion *)(acb +
+               sizeof(BlockDriverAIOCBCoroutine));
+}
+
+static BlockDriverAIOCBCoroutine *bdrv_get_aio_co(void *co)
+{
+    assert(((CoroutineIOCompletion *)co)->bypass);
+
+    return (BlockDriverAIOCBCoroutine *)(co -
+               sizeof(BlockDriverAIOCBCoroutine));
+}
+
+static void bdrv_init_io_comp(CoroutineIOCompletion *co)
+{
+    co->coroutine = NULL;
+    co->bypass = false;
+    co->ret = 0;
+    simple_gc_init(&co->gc);
+}
+
+static void bdrv_free_qiov(void *addr)
+{
+    qemu_iovec_destroy((QEMUIOVector *)addr);
+    g_free(addr);
+}
+
+static void bdrv_gc_add_qiov(CoroutineIOCompletion *co,
+                             QEMUIOVector *qiov)
+{
+    QEMUIOVector *iov = g_malloc(sizeof(QEMUIOVector));
+
+    *iov = *qiov;
+    simple_gc_add(&co->gc, iov, bdrv_free_qiov);
+}
+
+static void bdrv_gc_add_buf(CoroutineIOCompletion *co, void *addr)
+{
+    simple_gc_add(&co->gc, addr, NULL);
+}
+
 /* throttling disk I/O limits */
 void bdrv_set_io_limits(BlockDriverState *bs,
                         ThrottleConfig *cfg)
@@ -3081,7 +3139,16 @@ static int coroutine_fn bdrv_aligned_preadv(BlockDriverState *bs,
             ret = drv->bdrv_co_readv(bs, sector_num, local_sectors,
                                      &local_qiov);
 
-            qemu_iovec_destroy(&local_qiov);
+
+            if (qemu_coroutine_self_bypassed()) {
+                CoroutineIOCompletion *pco = bdrv_get_co_io_comp(
+                                             qemu_coroutine_get_var());
+
+                /* GC will destroy the local iov after IO is completed */
+                bdrv_gc_add_qiov(pco, &local_qiov);
+            } else {
+                qemu_iovec_destroy(&local_qiov);
+            }
         } else {
             ret = 0;
         }
@@ -3165,9 +3232,19 @@ static int coroutine_fn bdrv_co_do_preadv(BlockDriverState *bs,
     tracked_request_end(&req);
 
     if (use_local_qiov) {
-        qemu_iovec_destroy(&local_qiov);
-        qemu_vfree(head_buf);
-        qemu_vfree(tail_buf);
+        if (!qemu_coroutine_self_bypassed()) {
+            qemu_iovec_destroy(&local_qiov);
+            qemu_vfree(head_buf);
+            qemu_vfree(tail_buf);
+        } else {
+            CoroutineIOCompletion *pco = bdrv_get_co_io_comp(
+                                         qemu_coroutine_get_var());
+
+            /* GC will release resources after IO is completed */
+            bdrv_gc_add_qiov(pco, &local_qiov);
+            head_buf == NULL ? true : bdrv_gc_add_buf(pco, head_buf);
+            tail_buf == NULL ? true : bdrv_gc_add_buf(pco, tail_buf);
+        }
     }
 
     return ret;
@@ -4659,15 +4736,6 @@ static BlockDriverAIOCB *bdrv_aio_writev_em(BlockDriverState *bs,
     return bdrv_aio_rw_vector(bs, sector_num, qiov, nb_sectors, cb, opaque, 1);
 }
 
-
-typedef struct BlockDriverAIOCBCoroutine {
-    BlockDriverAIOCB common;
-    BlockRequest req;
-    bool is_write;
-    bool *done;
-    QEMUBH* bh;
-} BlockDriverAIOCBCoroutine;
-
 static void bdrv_aio_co_cancel_em(BlockDriverAIOCB *blockacb)
 {
     AioContext *aio_context = bdrv_get_aio_context(blockacb->bs);
@@ -4686,6 +4754,12 @@ static const AIOCBInfo bdrv_em_co_aiocb_info = {
     .cancel             = bdrv_aio_co_cancel_em,
 };
 
+static const AIOCBInfo bdrv_em_co_bypass_aiocb_info = {
+    .aiocb_size         = sizeof(BlockDriverAIOCBCoroutine) +
+                          sizeof(CoroutineIOCompletion),
+    .cancel             = bdrv_aio_co_cancel_em,
+};
+
 static void bdrv_co_em_bh(void *opaque)
 {
     BlockDriverAIOCBCoroutine *acb = opaque;
@@ -4705,6 +4779,13 @@ static void coroutine_fn bdrv_co_do_rw(void *opaque)
 {
     BlockDriverAIOCBCoroutine *acb = opaque;
     BlockDriverState *bs = acb->common.bs;
+    bool bypass = qemu_coroutine_self_bypassed();
+    CoroutineIOCompletion *co = bdrv_get_co_io_comp(acb);
+
+    if (bypass) {
+        bdrv_init_io_comp(bdrv_get_co_io_comp(acb));
+        qemu_coroutine_set_var(acb);
+    }
 
     if (!acb->is_write) {
         acb->req.error = bdrv_co_do_readv(bs, acb->req.sector,
@@ -4714,8 +4795,11 @@ static void coroutine_fn bdrv_co_do_rw(void *opaque)
             acb->req.nb_sectors, acb->req.qiov, acb->req.flags);
     }
 
-    acb->bh = aio_bh_new(bdrv_get_aio_context(bs), bdrv_co_em_bh, acb);
-    qemu_bh_schedule(acb->bh);
+    /* co->bypass is used for detecting early completion */
+    if (!bypass || !co->bypass) {
+        acb->bh = aio_bh_new(bdrv_get_aio_context(bs), bdrv_co_em_bh, acb);
+        qemu_bh_schedule(acb->bh);
+    }
 }
 
 static bool bdrv_rw_aligned(BlockDriverState *bs,
@@ -4767,8 +4851,27 @@ static BlockDriverAIOCB *bdrv_co_aio_rw_vector(BlockDriverState *bs,
 {
     Coroutine *co;
     BlockDriverAIOCBCoroutine *acb;
+    const AIOCBInfo *aiocb_info;
+    bool bypass;
 
-    acb = qemu_aio_get(&bdrv_em_co_aiocb_info, bs, cb, opaque);
+    /*
+     * In longterm, creating of coroutine should be pushed far further
+     * to make a fast path in cases of unnecessary coroutine usage.
+     *
+     * Also when the bypass mechanism is mature, the 'bypass_co' hint
+     * which is set in device can be moved to block layer so that bypass
+     * can be enabled automatically.
+     */
+    if (bs->bypass_co &&
+        bdrv_co_can_bypass_co(bs, sector_num, nb_sectors, flags, is_write)) {
+        aiocb_info = &bdrv_em_co_bypass_aiocb_info;
+        bypass = true;
+    } else {
+        aiocb_info = &bdrv_em_co_aiocb_info;
+        bypass = false;
+    }
+
+    acb = qemu_aio_get(aiocb_info, bs, cb, opaque);
     acb->req.sector = sector_num;
     acb->req.nb_sectors = nb_sectors;
     acb->req.qiov = qiov;
@@ -4776,8 +4879,14 @@ static BlockDriverAIOCB *bdrv_co_aio_rw_vector(BlockDriverState *bs,
     acb->is_write = is_write;
     acb->done = NULL;
 
-    co = qemu_coroutine_create(bdrv_co_do_rw);
-    qemu_coroutine_enter(co, acb);
+    if (!bypass) {
+        co = qemu_coroutine_create(bdrv_co_do_rw);
+        qemu_coroutine_enter(co, acb);
+    } else {
+        qemu_coroutine_set_bypass(true);
+        bdrv_co_do_rw(acb);
+        qemu_coroutine_set_bypass(false);
+    }
 
     return &acb->common;
 }
@@ -4871,17 +4980,23 @@ void qemu_aio_release(void *p)
 /**************************************************************/
 /* Coroutine block device emulation */
 
-typedef struct CoroutineIOCompletion {
-    Coroutine *coroutine;
-    int ret;
-} CoroutineIOCompletion;
-
 static void bdrv_co_io_em_complete(void *opaque, int ret)
 {
     CoroutineIOCompletion *co = opaque;
 
-    co->ret = ret;
-    qemu_coroutine_enter(co->coroutine, NULL);
+    if (!co->bypass) {
+        co->ret = ret;
+        qemu_coroutine_enter(co->coroutine, NULL);
+    } else {
+        BlockDriverAIOCBCoroutine *acb = bdrv_get_aio_co(co);
+
+        simple_gc_free_all(&co->gc);
+
+        acb->req.error = ret;
+        acb->bh = aio_bh_new(bdrv_get_aio_context(acb->common.bs),
+                             bdrv_co_em_bh, acb);
+        qemu_bh_schedule(acb->bh);
+    }
 }
 
 static int coroutine_fn bdrv_co_io_em(BlockDriverState *bs, int64_t sector_num,
@@ -4891,21 +5006,35 @@ static int coroutine_fn bdrv_co_io_em(BlockDriverState *bs, int64_t sector_num,
     CoroutineIOCompletion co = {
         .coroutine = qemu_coroutine_self(),
     };
+    CoroutineIOCompletion *pco = &co;
     BlockDriverAIOCB *acb;
 
+    if (qemu_coroutine_bypassed(pco->coroutine)) {
+        pco = bdrv_get_co_io_comp(qemu_coroutine_get_var());
+        pco->bypass = true;
+    }
+
     if (is_write) {
         acb = bs->drv->bdrv_aio_writev(bs, sector_num, iov, nb_sectors,
-                                       bdrv_co_io_em_complete, &co);
+                                       bdrv_co_io_em_complete, pco);
     } else {
         acb = bs->drv->bdrv_aio_readv(bs, sector_num, iov, nb_sectors,
-                                      bdrv_co_io_em_complete, &co);
+                                      bdrv_co_io_em_complete, pco);
     }
 
     trace_bdrv_co_io_em(bs, sector_num, nb_sectors, is_write, acb);
     if (!acb) {
+        /*
+         * no completion callback for failure case, let bdrv_co_do_rw
+         * handle completion.
+         */
+        pco->bypass = false;
         return -EIO;
     }
-    qemu_coroutine_yield();
+
+    if (!pco->bypass) {
+        qemu_coroutine_yield();
+    }
 
     return co.ret;
 }
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [Qemu-devel] [PATCH v1 08/17] Revert "raw-posix: drop raw_get_aio_fd() since it is no longer used"
  2014-08-05  3:33 [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support Ming Lei
                   ` (6 preceding siblings ...)
  2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 07/17] block: support to bypass qemu coroutinue Ming Lei
@ 2014-08-05  3:33 ` Ming Lei
  2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 09/17] dataplane: enable selective bypassing coroutine Ming Lei
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 81+ messages in thread
From: Ming Lei @ 2014-08-05  3:33 UTC (permalink / raw)
  To: qemu-devel, Peter Maydell, Paolo Bonzini, Stefan Hajnoczi
  Cc: Kevin Wolf, Ming Lei, Fam Zheng, Michael S. Tsirkin

This reverts commit 76ef2cf5493a215efc351f48ae7094d6c183fcac.

Reintroduce the helper of raw_get_aio_fd() for enabling
coroutinue bypass mode in case of raw image.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 block/raw-posix.c     |   34 ++++++++++++++++++++++++++++++++++
 include/block/block.h |    9 +++++++++
 2 files changed, 43 insertions(+)

diff --git a/block/raw-posix.c b/block/raw-posix.c
index 8e9758e..88715c8 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -2436,6 +2436,40 @@ static BlockDriver bdrv_host_cdrom = {
 };
 #endif /* __FreeBSD__ */
 
+#ifdef CONFIG_LINUX_AIO
+/**
+ * Return the file descriptor for Linux AIO
+ *
+ * This function is a layering violation and should be removed when it becomes
+ * possible to call the block layer outside the global mutex.  It allows the
+ * caller to hijack the file descriptor so I/O can be performed outside the
+ * block layer.
+ */
+int raw_get_aio_fd(BlockDriverState *bs)
+{
+    BDRVRawState *s;
+
+    if (!bs->drv) {
+        return -ENOMEDIUM;
+    }
+
+    if (bs->drv == bdrv_find_format("raw")) {
+        bs = bs->file;
+    }
+
+    /* raw-posix has several protocols so just check for raw_aio_readv */
+    if (bs->drv->bdrv_aio_readv != raw_aio_readv) {
+        return -ENOTSUP;
+    }
+
+    s = bs->opaque;
+    if (!s->use_aio) {
+        return -ENOTSUP;
+    }
+    return s->fd;
+}
+#endif /* CONFIG_LINUX_AIO */
+
 static void bdrv_file_init(void)
 {
     /*
diff --git a/include/block/block.h b/include/block/block.h
index 92f2f3a..4450d26 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -482,6 +482,15 @@ void bdrv_op_block_all(BlockDriverState *bs, Error *reason);
 void bdrv_op_unblock_all(BlockDriverState *bs, Error *reason);
 bool bdrv_op_blocker_is_empty(BlockDriverState *bs);
 
+#ifdef CONFIG_LINUX_AIO
+int raw_get_aio_fd(BlockDriverState *bs);
+#else
+static inline int raw_get_aio_fd(BlockDriverState *bs)
+{
+    return -ENOTSUP;
+}
+#endif
+
 enum BlockAcctType {
     BDRV_ACCT_READ,
     BDRV_ACCT_WRITE,
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [Qemu-devel] [PATCH v1 09/17] dataplane: enable selective bypassing coroutine
  2014-08-05  3:33 [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support Ming Lei
                   ` (7 preceding siblings ...)
  2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 08/17] Revert "raw-posix: drop raw_get_aio_fd() since it is no longer used" Ming Lei
@ 2014-08-05  3:33 ` Ming Lei
  2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 10/17] linux-aio: fix submit aio as a batch Ming Lei
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 81+ messages in thread
From: Ming Lei @ 2014-08-05  3:33 UTC (permalink / raw)
  To: qemu-devel, Peter Maydell, Paolo Bonzini, Stefan Hajnoczi
  Cc: Kevin Wolf, Ming Lei, Fam Zheng, Michael S. Tsirkin

This patch enables selective bypassing for the
coroutine in bdrv_co_aio_rw_vector() if the image
format is raw.

With this patch, ~10% throughput improvement for raw image is
observed in the VM based on server.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 hw/block/dataplane/virtio-blk.c |   10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
index c9a8cc2..a0732e3 100644
--- a/hw/block/dataplane/virtio-blk.c
+++ b/hw/block/dataplane/virtio-blk.c
@@ -30,6 +30,7 @@ struct VirtIOBlockDataPlane {
     bool started;
     bool starting;
     bool stopping;
+    bool raw_format;
 
     VirtIOBlkConf *blk;
 
@@ -199,6 +200,8 @@ void virtio_blk_data_plane_create(VirtIODevice *vdev, VirtIOBlkConf *blk,
     error_setg(&s->blocker, "block device is in use by data plane");
     bdrv_op_block_all(blk->conf.bs, s->blocker);
 
+    s->raw_format = (raw_get_aio_fd(blk->conf.bs) >= 0);
+
     *dataplane = s;
 }
 
@@ -272,6 +275,10 @@ void virtio_blk_data_plane_start(VirtIOBlockDataPlane *s)
     /* Kick right away to begin processing requests already in vring */
     event_notifier_set(virtio_queue_get_host_notifier(vq));
 
+    if (s->raw_format) {
+        bdrv_set_bypass_co(s->ctx, true);
+    }
+
     /* Get this show started by hooking up our callbacks */
     aio_context_acquire(s->ctx);
     aio_set_event_notifier(s->ctx, &s->host_notifier, handle_notify);
@@ -303,6 +310,9 @@ void virtio_blk_data_plane_stop(VirtIOBlockDataPlane *s)
 
     vblk->obj_pool = NULL;
 
+    if (s->raw_format) {
+        bdrv_set_bypass_co(s->ctx, false);
+    }
     /* Sync vring state back to virtqueue so that non-dataplane request
      * processing can continue when we disable the host notifier below.
      */
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [Qemu-devel] [PATCH v1 10/17] linux-aio: fix submit aio as a batch
  2014-08-05  3:33 [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support Ming Lei
                   ` (8 preceding siblings ...)
  2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 09/17] dataplane: enable selective bypassing coroutine Ming Lei
@ 2014-08-05  3:33 ` Ming Lei
  2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 11/17] linux-aio: handling -EAGAIN for !s->io_q.plugged case Ming Lei
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 81+ messages in thread
From: Ming Lei @ 2014-08-05  3:33 UTC (permalink / raw)
  To: qemu-devel, Peter Maydell, Paolo Bonzini, Stefan Hajnoczi
  Cc: Kevin Wolf, Ming Lei, Fam Zheng, Michael S. Tsirkin

In the enqueue path, we can't complete request, otherwise
"Co-routine re-entered recursively" may be caused, so this
patch fixes the issue with below ideas:

	- for -EAGAIN or partial completion, retry the submision by
	schedule an BH in following completion cb
	- for part of completion, also update the io queue
	- for other failure, return the failure if in enqueue path,
	otherwise, abort all queued I/O

Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 block/linux-aio.c |   99 +++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 77 insertions(+), 22 deletions(-)

diff --git a/block/linux-aio.c b/block/linux-aio.c
index 7ac7e8c..4cdf507 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -38,11 +38,19 @@ struct qemu_laiocb {
     QLIST_ENTRY(qemu_laiocb) node;
 };
 
+/*
+ * TODO: support to batch I/O from multiple bs in one same
+ * AIO context, one important use case is multi-lun scsi,
+ * so in future the IO queue should be per AIO context.
+ */
 typedef struct {
     struct iocb *iocbs[MAX_QUEUED_IO];
     int plugged;
     unsigned int size;
     unsigned int idx;
+
+    /* handle -EAGAIN and partial completion */
+    QEMUBH *retry;
 } LaioQueue;
 
 struct qemu_laio_state {
@@ -86,6 +94,12 @@ static void qemu_laio_process_completion(struct qemu_laio_state *s,
     qemu_aio_release(laiocb);
 }
 
+static void qemu_laio_start_retry(struct qemu_laio_state *s)
+{
+    if (s->io_q.idx)
+        qemu_bh_schedule(s->io_q.retry);
+}
+
 static void qemu_laio_completion_cb(EventNotifier *e)
 {
     struct qemu_laio_state *s = container_of(e, struct qemu_laio_state, e);
@@ -108,6 +122,7 @@ static void qemu_laio_completion_cb(EventNotifier *e)
             qemu_laio_process_completion(s, laiocb);
         }
     }
+    qemu_laio_start_retry(s);
 }
 
 static void laio_cancel(BlockDriverAIOCB *blockacb)
@@ -127,6 +142,7 @@ static void laio_cancel(BlockDriverAIOCB *blockacb)
     ret = io_cancel(laiocb->ctx->ctx, &laiocb->iocb, &event);
     if (ret == 0) {
         laiocb->ret = -ECANCELED;
+        qemu_laio_start_retry(laiocb->ctx);
         return;
     }
 
@@ -154,45 +170,80 @@ static void ioq_init(LaioQueue *io_q)
     io_q->plugged = 0;
 }
 
-static int ioq_submit(struct qemu_laio_state *s)
+static void abort_queue(struct qemu_laio_state *s)
+{
+    int i;
+    for (i = 0; i < s->io_q.idx; i++) {
+        struct qemu_laiocb *laiocb = container_of(s->io_q.iocbs[i],
+                                                  struct qemu_laiocb,
+                                                  iocb);
+        laiocb->ret = -EIO;
+        qemu_laio_process_completion(s, laiocb);
+    }
+}
+
+static int ioq_submit(struct qemu_laio_state *s, bool enqueue)
 {
     int ret, i = 0;
     int len = s->io_q.idx;
+    int j = 0;
 
-    do {
-        ret = io_submit(s->ctx, len, s->io_q.iocbs);
-    } while (i++ < 3 && ret == -EAGAIN);
+    if (!len) {
+        return 0;
+    }
+
+    ret = io_submit(s->ctx, len, s->io_q.iocbs);
+    if (ret == -EAGAIN) { /* retry in following completion cb */
+        return 0;
+    } else if (ret < 0) {
+        if (enqueue) {
+            return ret;
+        }
 
-    /* empty io queue */
-    s->io_q.idx = 0;
+        /* in non-queue path, all IOs have to be completed */
+        abort_queue(s);
+        ret = len;
+    } else if (ret == 0) {
+        goto out;
+    }
 
-    if (ret < 0) {
-        i = 0;
-    } else {
-        i = ret;
+    for (i = ret; i < len; i++) {
+        s->io_q.iocbs[j++] = s->io_q.iocbs[i];
     }
 
-    for (; i < len; i++) {
-        struct qemu_laiocb *laiocb =
-            container_of(s->io_q.iocbs[i], struct qemu_laiocb, iocb);
+ out:
+    /*
+     * update io queue, for partial completion, retry will be
+     * started automatically in following completion cb.
+     */
+    s->io_q.idx -= ret;
 
-        laiocb->ret = (ret < 0) ? ret : -EIO;
-        qemu_laio_process_completion(s, laiocb);
-    }
     return ret;
 }
 
-static void ioq_enqueue(struct qemu_laio_state *s, struct iocb *iocb)
+static void ioq_submit_retry(void *opaque)
+{
+    struct qemu_laio_state *s = opaque;
+    ioq_submit(s, false);
+}
+
+static int ioq_enqueue(struct qemu_laio_state *s, struct iocb *iocb)
 {
     unsigned int idx = s->io_q.idx;
 
+    if (unlikely(idx == s->io_q.size)) {
+        return -1;
+    }
+
     s->io_q.iocbs[idx++] = iocb;
     s->io_q.idx = idx;
 
-    /* submit immediately if queue is full */
-    if (idx == s->io_q.size) {
-        ioq_submit(s);
+    /* submit immediately if queue depth is above 2/3 */
+    if (idx > s->io_q.size * 2 / 3) {
+        return ioq_submit(s, true);
     }
+
+    return 0;
 }
 
 void laio_io_plug(BlockDriverState *bs, void *aio_ctx)
@@ -214,7 +265,7 @@ int laio_io_unplug(BlockDriverState *bs, void *aio_ctx, bool unplug)
     }
 
     if (s->io_q.idx > 0) {
-        ret = ioq_submit(s);
+        ret = ioq_submit(s, false);
     }
 
     return ret;
@@ -258,7 +309,9 @@ BlockDriverAIOCB *laio_submit(BlockDriverState *bs, void *aio_ctx, int fd,
             goto out_free_aiocb;
         }
     } else {
-        ioq_enqueue(s, iocbs);
+        if (ioq_enqueue(s, iocbs) < 0) {
+            goto out_free_aiocb;
+        }
     }
     return &laiocb->common;
 
@@ -272,6 +325,7 @@ void laio_detach_aio_context(void *s_, AioContext *old_context)
     struct qemu_laio_state *s = s_;
 
     aio_set_event_notifier(old_context, &s->e, NULL);
+    qemu_bh_delete(s->io_q.retry);
 }
 
 void laio_attach_aio_context(void *s_, AioContext *new_context)
@@ -279,6 +333,7 @@ void laio_attach_aio_context(void *s_, AioContext *new_context)
     struct qemu_laio_state *s = s_;
 
     aio_set_event_notifier(new_context, &s->e, qemu_laio_completion_cb);
+    s->io_q.retry = aio_bh_new(new_context, ioq_submit_retry, s);
 }
 
 void *laio_init(void)
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [Qemu-devel] [PATCH v1 11/17] linux-aio: handling -EAGAIN for !s->io_q.plugged case
  2014-08-05  3:33 [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support Ming Lei
                   ` (9 preceding siblings ...)
  2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 10/17] linux-aio: fix submit aio as a batch Ming Lei
@ 2014-08-05  3:33 ` Ming Lei
  2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 12/17] linux-aio: increase max event to 256 Ming Lei
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 81+ messages in thread
From: Ming Lei @ 2014-08-05  3:33 UTC (permalink / raw)
  To: qemu-devel, Peter Maydell, Paolo Bonzini, Stefan Hajnoczi
  Cc: Kevin Wolf, Ming Lei, Fam Zheng, Michael S. Tsirkin

Previously -EAGAIN is simply ignored for !s->io_q.plugged case,
and sometimes it is easy to cause -EIO to VM, such as NVME device.

This patch handles -EAGAIN by io queue for !s->io_q.plugged case,
and it will be retried in following aio completion cb.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 block/linux-aio.c |   22 +++++++++++++++++++++-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/block/linux-aio.c b/block/linux-aio.c
index 4cdf507..0e21f76 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -238,6 +238,11 @@ static int ioq_enqueue(struct qemu_laio_state *s, struct iocb *iocb)
     s->io_q.iocbs[idx++] = iocb;
     s->io_q.idx = idx;
 
+    /* don't submit until next completion for -EAGAIN of non plug case */
+    if (unlikely(!s->io_q.plugged)) {
+        return 0;
+    }
+
     /* submit immediately if queue depth is above 2/3 */
     if (idx > s->io_q.size * 2 / 3) {
         return ioq_submit(s, true);
@@ -305,10 +310,25 @@ BlockDriverAIOCB *laio_submit(BlockDriverState *bs, void *aio_ctx, int fd,
     io_set_eventfd(&laiocb->iocb, event_notifier_get_fd(&s->e));
 
     if (!s->io_q.plugged) {
-        if (io_submit(s->ctx, 1, &iocbs) < 0) {
+        int ret;
+
+        if (!s->io_q.idx) {
+            ret = io_submit(s->ctx, 1, &iocbs);
+        } else {
+            ret = -EAGAIN;
+        }
+        /*
+         * Switch to queue mode until -EAGAIN is handled, we suppose
+         * there is always uncompleted I/O, so try to enqueue it first,
+         * and will be submitted again in following aio completion cb.
+         */
+        if (ret == -EAGAIN) {
+            goto enqueue;
+        } else if (ret < 0) {
             goto out_free_aiocb;
         }
     } else {
+ enqueue:
         if (ioq_enqueue(s, iocbs) < 0) {
             goto out_free_aiocb;
         }
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [Qemu-devel] [PATCH v1 12/17] linux-aio: increase max event to 256
  2014-08-05  3:33 [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support Ming Lei
                   ` (10 preceding siblings ...)
  2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 11/17] linux-aio: handling -EAGAIN for !s->io_q.plugged case Ming Lei
@ 2014-08-05  3:33 ` Ming Lei
  2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 13/17] linux-aio: remove 'node' from 'struct qemu_laiocb' Ming Lei
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 81+ messages in thread
From: Ming Lei @ 2014-08-05  3:33 UTC (permalink / raw)
  To: qemu-devel, Peter Maydell, Paolo Bonzini, Stefan Hajnoczi
  Cc: Kevin Wolf, Ming Lei, Fam Zheng, Michael S. Tsirkin

This patch increases max event to 256 for the comming
virtio-blk multi virtqueue support.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 block/linux-aio.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/linux-aio.c b/block/linux-aio.c
index 0e21f76..bf94ae9 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -23,7 +23,7 @@
  *      than this we will get EAGAIN from io_submit which is communicated to
  *      the guest as an I/O error.
  */
-#define MAX_EVENTS 128
+#define MAX_EVENTS 256
 
 #define MAX_QUEUED_IO  128
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [Qemu-devel] [PATCH v1 13/17] linux-aio: remove 'node' from 'struct qemu_laiocb'
  2014-08-05  3:33 [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support Ming Lei
                   ` (11 preceding siblings ...)
  2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 12/17] linux-aio: increase max event to 256 Ming Lei
@ 2014-08-05  3:33 ` Ming Lei
  2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 14/17] hw/virtio/virtio-blk.h: introduce VIRTIO_BLK_F_MQ Ming Lei
                   ` (5 subsequent siblings)
  18 siblings, 0 replies; 81+ messages in thread
From: Ming Lei @ 2014-08-05  3:33 UTC (permalink / raw)
  To: qemu-devel, Peter Maydell, Paolo Bonzini, Stefan Hajnoczi
  Cc: Kevin Wolf, Ming Lei, Fam Zheng, Michael S. Tsirkin

No one uses the 'node' field any more, so remove it
from 'struct qemu_laiocb', and this can save 16byte
for the struct on 64bit arch.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 block/linux-aio.c |    1 -
 1 file changed, 1 deletion(-)

diff --git a/block/linux-aio.c b/block/linux-aio.c
index bf94ae9..da50ea5 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -35,7 +35,6 @@ struct qemu_laiocb {
     size_t nbytes;
     QEMUIOVector *qiov;
     bool is_read;
-    QLIST_ENTRY(qemu_laiocb) node;
 };
 
 /*
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [Qemu-devel] [PATCH v1 14/17] hw/virtio/virtio-blk.h: introduce VIRTIO_BLK_F_MQ
  2014-08-05  3:33 [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support Ming Lei
                   ` (12 preceding siblings ...)
  2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 13/17] linux-aio: remove 'node' from 'struct qemu_laiocb' Ming Lei
@ 2014-08-05  3:33 ` Ming Lei
  2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 15/17] virtio-blk: support multi queue for non-dataplane Ming Lei
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 81+ messages in thread
From: Ming Lei @ 2014-08-05  3:33 UTC (permalink / raw)
  To: qemu-devel, Peter Maydell, Paolo Bonzini, Stefan Hajnoczi
  Cc: Kevin Wolf, Ming Lei, Fam Zheng, Michael S. Tsirkin

Prepare for supporting mutli vqs per virtio-blk device.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 include/hw/virtio/virtio-blk.h |    8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/include/hw/virtio/virtio-blk.h b/include/hw/virtio/virtio-blk.h
index 49ac234..5b0fb91 100644
--- a/include/hw/virtio/virtio-blk.h
+++ b/include/hw/virtio/virtio-blk.h
@@ -42,6 +42,12 @@
 #define VIRTIO_BLK_F_TOPOLOGY   10      /* Topology information is available */
 #define VIRTIO_BLK_F_CONFIG_WCE 11      /* write cache configurable */
 
+/*
+ * support multi vqs, and virtio_blk_config.num_queues is only
+ * available when this feature is enabled
+ */
+#define VIRTIO_BLK_F_MQ		12
+
 #define VIRTIO_BLK_ID_BYTES     20      /* ID string length */
 
 struct virtio_blk_config
@@ -58,6 +64,8 @@ struct virtio_blk_config
     uint16_t min_io_size;
     uint32_t opt_io_size;
     uint8_t wce;
+    uint8_t unused;
+    uint16_t num_queues;	/* must be at the end */
 } QEMU_PACKED;
 
 /* These two define direction. */
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [Qemu-devel] [PATCH v1 15/17] virtio-blk: support multi queue for non-dataplane
  2014-08-05  3:33 [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support Ming Lei
                   ` (13 preceding siblings ...)
  2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 14/17] hw/virtio/virtio-blk.h: introduce VIRTIO_BLK_F_MQ Ming Lei
@ 2014-08-05  3:33 ` Ming Lei
  2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 16/17] virtio-blk: dataplane: support multi virtqueue Ming Lei
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 81+ messages in thread
From: Ming Lei @ 2014-08-05  3:33 UTC (permalink / raw)
  To: qemu-devel, Peter Maydell, Paolo Bonzini, Stefan Hajnoczi
  Cc: Kevin Wolf, Ming Lei, Fam Zheng, Michael S. Tsirkin

This patch introduces support of multi virtqueue for non-dataplane,
and the conversion is a bit straightforward.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 hw/block/virtio-blk.c          |   25 +++++++++++++++++++------
 include/hw/virtio/virtio-blk.h |    4 +++-
 2 files changed, 22 insertions(+), 7 deletions(-)

diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index 2a11bc4..baec8f8 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -58,12 +58,13 @@ static void virtio_blk_complete_request(VirtIOBlockReq *req,
 {
     VirtIOBlock *s = req->dev;
     VirtIODevice *vdev = VIRTIO_DEVICE(s);
+    unsigned qid = req->qid;
 
     trace_virtio_blk_req_complete(req, status);
 
     stb_p(&req->in->status, status);
-    virtqueue_push(s->vq, &req->elem, req->qiov.size + sizeof(*req->in));
-    virtio_notify(vdev, s->vq);
+    virtqueue_push(s->vqs[qid], &req->elem, req->qiov.size + sizeof(*req->in));
+    virtio_notify(vdev, s->vqs[qid]);
 }
 
 static void virtio_blk_req_complete(VirtIOBlockReq *req, unsigned char status)
@@ -123,11 +124,12 @@ static void virtio_blk_flush_complete(void *opaque, int ret)
     virtio_blk_free_request(req);
 }
 
-static VirtIOBlockReq *virtio_blk_get_request(VirtIOBlock *s)
+static VirtIOBlockReq *virtio_blk_get_request(VirtIOBlock *s, unsigned qid)
 {
     VirtIOBlockReq *req = virtio_blk_alloc_request(s);
 
-    if (!virtqueue_pop(s->vq, &req->elem)) {
+    req->qid = qid;
+    if (!virtqueue_pop(s->vqs[qid], &req->elem)) {
         virtio_blk_free_request(req);
         return NULL;
     }
@@ -439,6 +441,7 @@ static void virtio_blk_handle_output(VirtIODevice *vdev, VirtQueue *vq)
     MultiReqBuffer mrb = {
         .num_writes = 0,
     };
+    unsigned qid = virtio_get_queue_index(vq);
 
 #ifdef CONFIG_VIRTIO_BLK_DATA_PLANE
     /* Some guests kick before setting VIRTIO_CONFIG_S_DRIVER_OK so start
@@ -450,7 +453,7 @@ static void virtio_blk_handle_output(VirtIODevice *vdev, VirtQueue *vq)
     }
 #endif
 
-    while ((req = virtio_blk_get_request(s))) {
+    while ((req = virtio_blk_get_request(s, qid))) {
         virtio_blk_handle_request(req, &mrb);
     }
 
@@ -556,6 +559,7 @@ static void virtio_blk_update_config(VirtIODevice *vdev, uint8_t *config)
     blkcfg.physical_block_exp = get_physical_block_exp(s->conf);
     blkcfg.alignment_offset = 0;
     blkcfg.wce = bdrv_enable_write_cache(s->bs);
+    stw_p(&blkcfg.num_queues, s->blk.num_queues);
     memcpy(config, &blkcfg, sizeof(struct virtio_blk_config));
 }
 
@@ -590,6 +594,10 @@ static uint32_t virtio_blk_get_features(VirtIODevice *vdev, uint32_t features)
     if (bdrv_is_read_only(s->bs))
         features |= 1 << VIRTIO_BLK_F_RO;
 
+    if (s->blk.num_queues > 1) {
+        features |= 1 << VIRTIO_BLK_F_MQ;
+    }
+
     return features;
 }
 
@@ -739,6 +747,7 @@ static void virtio_blk_device_realize(DeviceState *dev, Error **errp)
 #ifdef CONFIG_VIRTIO_BLK_DATA_PLANE
     Error *err = NULL;
 #endif
+    int i;
     static int virtio_blk_id;
 
     if (!blk->conf.bs) {
@@ -765,7 +774,9 @@ static void virtio_blk_device_realize(DeviceState *dev, Error **errp)
     s->rq = NULL;
     s->sector_mask = (s->conf->logical_block_size / BDRV_SECTOR_SIZE) - 1;
 
-    s->vq = virtio_add_queue(vdev, 128, virtio_blk_handle_output);
+    s->vqs = g_malloc0(sizeof(VirtQueue *) * blk->num_queues);
+    for (i = 0; i < blk->num_queues; i++)
+        s->vqs[i] = virtio_add_queue(vdev, 128, virtio_blk_handle_output);
     s->complete_request = virtio_blk_complete_request;
 #ifdef CONFIG_VIRTIO_BLK_DATA_PLANE
     virtio_blk_data_plane_create(vdev, blk, &s->dataplane, &err);
@@ -802,6 +813,7 @@ static void virtio_blk_device_unrealize(DeviceState *dev, Error **errp)
     qemu_del_vm_change_state_handler(s->change);
     unregister_savevm(dev, "virtio-blk", s);
     blockdev_mark_auto_del(s->bs);
+    g_free(s->vqs);
     virtio_cleanup(vdev);
 }
 
@@ -809,6 +821,7 @@ static void virtio_blk_instance_init(Object *obj)
 {
     VirtIOBlock *s = VIRTIO_BLK(obj);
 
+    s->blk.num_queues = 1;    /* num of queue has to be at least 1 */
     s->obj_pool = NULL;
     object_property_add_link(obj, "iothread", TYPE_IOTHREAD,
                              (Object **)&s->blk.iothread,
diff --git a/include/hw/virtio/virtio-blk.h b/include/hw/virtio/virtio-blk.h
index 5b0fb91..79c3017 100644
--- a/include/hw/virtio/virtio-blk.h
+++ b/include/hw/virtio/virtio-blk.h
@@ -122,6 +122,7 @@ struct VirtIOBlkConf
     uint32_t scsi;
     uint32_t config_wce;
     uint32_t data_plane;
+    uint32_t num_queues;
 };
 
 struct VirtIOBlockDataPlane;
@@ -130,7 +131,7 @@ struct VirtIOBlockReq;
 typedef struct VirtIOBlock {
     VirtIODevice parent_obj;
     BlockDriverState *bs;
-    VirtQueue *vq;
+    VirtQueue **vqs;
     void *rq;
     QEMUBH *bh;
     BlockConf *conf;
@@ -160,6 +161,7 @@ typedef struct VirtIOBlockReq {
     QEMUIOVector qiov;
     struct VirtIOBlockReq *next;
     BlockAcctCookie acct;
+    unsigned qid;
 } VirtIOBlockReq;
 
 VirtIOBlockReq *virtio_blk_alloc_request(VirtIOBlock *s);
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [Qemu-devel] [PATCH v1 16/17] virtio-blk: dataplane: support multi virtqueue
  2014-08-05  3:33 [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support Ming Lei
                   ` (14 preceding siblings ...)
  2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 15/17] virtio-blk: support multi queue for non-dataplane Ming Lei
@ 2014-08-05  3:33 ` Ming Lei
  2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 17/17] hw/virtio-pci: introduce num_queues property Ming Lei
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 81+ messages in thread
From: Ming Lei @ 2014-08-05  3:33 UTC (permalink / raw)
  To: qemu-devel, Peter Maydell, Paolo Bonzini, Stefan Hajnoczi
  Cc: Kevin Wolf, Ming Lei, Fam Zheng, Michael S. Tsirkin

This patch supports to handle host notify from multi
virt queues, but still process/submit io in the
one iothread.

One BH is introduced to process I/O from all virtqueues,
so that they can be sumitted to kernel in batch as far
as possible.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 hw/block/dataplane/virtio-blk.c |  211 ++++++++++++++++++++++++++++-----------
 1 file changed, 153 insertions(+), 58 deletions(-)

diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
index a0732e3..d61a920 100644
--- a/hw/block/dataplane/virtio-blk.c
+++ b/hw/block/dataplane/virtio-blk.c
@@ -26,6 +26,11 @@
 
 #define REQ_POOL_SZ 128
 
+typedef struct {
+    EventNotifier notifier;
+    VirtIOBlockDataPlane *s;
+} VirtIOBlockNotifier;
+
 struct VirtIOBlockDataPlane {
     bool started;
     bool starting;
@@ -35,9 +40,10 @@ struct VirtIOBlockDataPlane {
     VirtIOBlkConf *blk;
 
     VirtIODevice *vdev;
-    Vring vring;                    /* virtqueue vring */
-    EventNotifier *guest_notifier;  /* irq */
-    QEMUBH *bh;                     /* bh for guest notification */
+    Vring *vring;                    /* virtqueue vring */
+    EventNotifier **guest_notifier;  /* irq */
+    uint64_t   pending_guest_notifier;  /* pending guest notifer for vq */
+    QEMUBH *bh;                      /* bh for guest notification */
 
     /* Note that these EventNotifiers are assigned by value.  This is
      * fine as long as you do not call event_notifier_cleanup on them
@@ -47,7 +53,9 @@ struct VirtIOBlockDataPlane {
     IOThread *iothread;
     IOThread internal_iothread_obj;
     AioContext *ctx;
-    EventNotifier host_notifier;    /* doorbell */
+    VirtIOBlockNotifier *host_notifier; /* doorbell */
+    uint64_t   pending_host_notifier;   /* pending host notifer for vq */
+    QEMUBH *host_notifier_bh;           /* for handle host notifier */
 
     /* Operation blocker on BDS */
     Error *blocker;
@@ -60,20 +68,26 @@ struct VirtIOBlockDataPlane {
 };
 
 /* Raise an interrupt to signal guest, if necessary */
-static void notify_guest(VirtIOBlockDataPlane *s)
+static void notify_guest(VirtIOBlockDataPlane *s, unsigned int qid)
 {
-    if (!vring_should_notify(s->vdev, &s->vring)) {
-        return;
+    if (vring_should_notify(s->vdev, &s->vring[qid])) {
+        event_notifier_set(s->guest_notifier[qid]);
     }
-
-    event_notifier_set(s->guest_notifier);
 }
 
 static void notify_guest_bh(void *opaque)
 {
     VirtIOBlockDataPlane *s = opaque;
+    unsigned int qid;
+    uint64_t pending = s->pending_guest_notifier;
+
+    s->pending_guest_notifier = 0;
 
-    notify_guest(s);
+    while ((qid = ffsl(pending))) {
+        qid--;
+        notify_guest(s, qid);
+        pending &= ~(1 << qid);
+    }
 }
 
 static void complete_request_vring(VirtIOBlockReq *req, unsigned char status)
@@ -81,7 +95,7 @@ static void complete_request_vring(VirtIOBlockReq *req, unsigned char status)
     VirtIOBlockDataPlane *s = req->dev->dataplane;
     stb_p(&req->in->status, status);
 
-    vring_push(&req->dev->dataplane->vring, &req->elem,
+    vring_push(&s->vring[req->qid], &req->elem,
                req->qiov.size + sizeof(*req->in));
 
     /* Suppress notification to guest by BH and its scheduled
@@ -90,17 +104,15 @@ static void complete_request_vring(VirtIOBlockReq *req, unsigned char status)
      * executed in dataplane aio context even after it is
      * stopped, so needn't worry about notification loss with BH.
      */
+    assert(req->qid < 64);
+    s->pending_guest_notifier |= (1 << req->qid);
     qemu_bh_schedule(s->bh);
 }
 
-static void handle_notify(EventNotifier *e)
+static void process_vq_notify(VirtIOBlockDataPlane *s, unsigned short qid)
 {
-    VirtIOBlockDataPlane *s = container_of(e, VirtIOBlockDataPlane,
-                                           host_notifier);
     VirtIOBlock *vblk = VIRTIO_BLK(s->vdev);
 
-    event_notifier_test_and_clear(&s->host_notifier);
-    bdrv_io_plug(s->blk->conf.bs);
     for (;;) {
         MultiReqBuffer mrb = {
             .num_writes = 0,
@@ -108,12 +120,13 @@ static void handle_notify(EventNotifier *e)
         int ret;
 
         /* Disable guest->host notifies to avoid unnecessary vmexits */
-        vring_disable_notification(s->vdev, &s->vring);
+        vring_disable_notification(s->vdev, &s->vring[qid]);
 
         for (;;) {
             VirtIOBlockReq *req = virtio_blk_alloc_request(vblk);
 
-            ret = vring_pop(s->vdev, &s->vring, &req->elem);
+            req->qid = qid;
+            ret = vring_pop(s->vdev, &s->vring[qid], &req->elem);
             if (ret < 0) {
                 virtio_blk_free_request(req);
                 break; /* no more requests */
@@ -132,16 +145,48 @@ static void handle_notify(EventNotifier *e)
             /* Re-enable guest->host notifies and stop processing the vring.
              * But if the guest has snuck in more descriptors, keep processing.
              */
-            if (vring_enable_notification(s->vdev, &s->vring)) {
+            if (vring_enable_notification(s->vdev, &s->vring[qid])) {
                 break;
             }
         } else { /* fatal error */
             break;
         }
     }
+}
+
+static void process_notify(void *opaque)
+{
+    VirtIOBlockDataPlane *s = opaque;
+    unsigned int qid;
+    uint64_t pending = s->pending_host_notifier;
+
+    s->pending_host_notifier = 0;
+
+    bdrv_io_plug(s->blk->conf.bs);
+    while ((qid = ffsl(pending))) {
+        qid--;
+        process_vq_notify(s, qid);
+        pending &= ~(1 << qid);
+    }
     bdrv_io_unplug(s->blk->conf.bs);
 }
 
+/* TODO: handle requests from other vqs together */
+static void handle_notify(EventNotifier *e)
+{
+    VirtIOBlockNotifier *n = container_of(e, VirtIOBlockNotifier,
+		                          notifier);
+    VirtIOBlockDataPlane *s = n->s;
+    unsigned int qid = n - &s->host_notifier[0];
+
+    assert(qid < 64);
+
+    event_notifier_test_and_clear(e);
+
+    s->pending_host_notifier |= (1 << qid);
+    qemu_bh_schedule(s->host_notifier_bh);
+}
+
 /* Context: QEMU global mutex held */
 void virtio_blk_data_plane_create(VirtIODevice *vdev, VirtIOBlkConf *blk,
                                   VirtIOBlockDataPlane **dataplane,
@@ -197,6 +242,11 @@ void virtio_blk_data_plane_create(VirtIODevice *vdev, VirtIOBlkConf *blk,
     s->ctx = iothread_get_aio_context(s->iothread);
     s->bh = aio_bh_new(s->ctx, notify_guest_bh, s);
 
+    s->vring = g_new0(Vring, blk->num_queues);
+    s->guest_notifier = g_new(EventNotifier *, blk->num_queues);
+    s->host_notifier = g_new(VirtIOBlockNotifier, blk->num_queues);
+    s->host_notifier_bh = aio_bh_new(s->ctx, process_notify, s);
+
     error_setg(&s->blocker, "block device is in use by data plane");
     bdrv_op_block_all(blk->conf.bs, s->blocker);
 
@@ -217,16 +267,83 @@ void virtio_blk_data_plane_destroy(VirtIOBlockDataPlane *s)
     error_free(s->blocker);
     object_unref(OBJECT(s->iothread));
     qemu_bh_delete(s->bh);
+    qemu_bh_delete(s->host_notifier_bh);
+    g_free(s->vring);
+    g_free(s->guest_notifier);
+    g_free(s->host_notifier);
     g_free(s);
 }
 
+static int pre_start_vq(VirtIOBlockDataPlane *s, BusState *qbus,
+                        VirtioBusClass *k)
+{
+    int i;
+    int num = s->blk->num_queues;
+    VirtQueue *vq[num];
+
+    for (i = 0; i < num; i++) {
+        vq[i] = virtio_get_queue(s->vdev, i);
+        if (!vring_setup(&s->vring[i], s->vdev, i)) {
+            return -1;
+        }
+    }
+
+    /* Set up guest notifier (irq) */
+    if (k->set_guest_notifiers(qbus->parent, num, true) != 0) {
+        fprintf(stderr, "virtio-blk failed to set guest notifier, "
+                "ensure -enable-kvm is set\n");
+        exit(1);
+    }
+
+    for (i = 0; i < num; i++)
+        s->guest_notifier[i] = virtio_queue_get_guest_notifier(vq[i]);
+    s->pending_guest_notifier = 0;
+
+    /* Set up virtqueue notify */
+    for (i = 0; i < num; i++) {
+        if (k->set_host_notifier(qbus->parent, i, true) != 0) {
+            fprintf(stderr, "virtio-blk failed to set host notifier\n");
+            exit(1);
+        }
+        s->host_notifier[i].notifier = *virtio_queue_get_host_notifier(vq[i]);
+        s->host_notifier[i].s = s;
+    }
+    s->pending_host_notifier = 0;
+
+    return 0;
+}
+
+static void post_start_vq(VirtIOBlockDataPlane *s)
+{
+    int i;
+    int num = s->blk->num_queues;
+
+    for (i = 0; i < num; i++) {
+        VirtQueue *vq;
+        vq = virtio_get_queue(s->vdev, i);
+
+        /* Kick right away to begin processing requests already in vring */
+        event_notifier_set(virtio_queue_get_host_notifier(vq));
+    }
+
+    if (s->raw_format) {
+        bdrv_set_bypass_co(s->blk->conf.bs, true);
+    }
+
+    /* Get this show started by hooking up our callbacks */
+    aio_context_acquire(s->ctx);
+    for (i = 0; i < num; i++)
+        aio_set_event_notifier(s->ctx, &s->host_notifier[i].notifier,
+                               handle_notify);
+    aio_context_release(s->ctx);
+}
+
 /* Context: QEMU global mutex held */
 void virtio_blk_data_plane_start(VirtIOBlockDataPlane *s)
 {
     BusState *qbus = BUS(qdev_get_parent_bus(DEVICE(s->vdev)));
     VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(qbus);
     VirtIOBlock *vblk = VIRTIO_BLK(s->vdev);
-    VirtQueue *vq;
 
     if (s->started) {
         return;
@@ -238,51 +355,24 @@ void virtio_blk_data_plane_start(VirtIOBlockDataPlane *s)
 
     s->starting = true;
 
-    vq = virtio_get_queue(s->vdev, 0);
-    if (!vring_setup(&s->vring, s->vdev, 0)) {
-        s->starting = false;
-        return;
-    }
-
     vblk->obj_pool = &s->req_pool;
     obj_pool_init(vblk->obj_pool, s->reqs, s->free_reqs,
                   sizeof(VirtIOBlockReq), REQ_POOL_SZ);
 
-    /* Set up guest notifier (irq) */
-    if (k->set_guest_notifiers(qbus->parent, 1, true) != 0) {
-        fprintf(stderr, "virtio-blk failed to set guest notifier, "
-                "ensure -enable-kvm is set\n");
-        exit(1);
-    }
-    s->guest_notifier = virtio_queue_get_guest_notifier(vq);
-
-    /* Set up virtqueue notify */
-    if (k->set_host_notifier(qbus->parent, 0, true) != 0) {
-        fprintf(stderr, "virtio-blk failed to set host notifier\n");
-        exit(1);
-    }
-    s->host_notifier = *virtio_queue_get_host_notifier(vq);
-
     s->saved_complete_request = vblk->complete_request;
     vblk->complete_request = complete_request_vring;
 
+    if (pre_start_vq(s, qbus, k)) {
+       s->starting = false;
+       return;
+     }
+
     s->starting = false;
     s->started = true;
     trace_virtio_blk_data_plane_start(s);
 
     bdrv_set_aio_context(s->blk->conf.bs, s->ctx);
-
-    /* Kick right away to begin processing requests already in vring */
-    event_notifier_set(virtio_queue_get_host_notifier(vq));
-
-    if (s->raw_format) {
-        bdrv_set_bypass_co(s->ctx, true);
-    }
-
-    /* Get this show started by hooking up our callbacks */
-    aio_context_acquire(s->ctx);
-    aio_set_event_notifier(s->ctx, &s->host_notifier, handle_notify);
-    aio_context_release(s->ctx);
+    post_start_vq(s);
 }
 
 /* Context: QEMU global mutex held */
@@ -291,6 +381,8 @@ void virtio_blk_data_plane_stop(VirtIOBlockDataPlane *s)
     BusState *qbus = BUS(qdev_get_parent_bus(DEVICE(s->vdev)));
     VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(qbus);
     VirtIOBlock *vblk = VIRTIO_BLK(s->vdev);
+    int i;
+    int num = s->blk->num_queues;
     if (!s->started || s->stopping) {
         return;
     }
@@ -301,7 +393,8 @@ void virtio_blk_data_plane_stop(VirtIOBlockDataPlane *s)
     aio_context_acquire(s->ctx);
 
     /* Stop notifications for new requests from guest */
-    aio_set_event_notifier(s->ctx, &s->host_notifier, NULL);
+    for (i = 0; i < num; i++)
+        aio_set_event_notifier(s->ctx, &s->host_notifier[i].notifier, NULL);
 
     /* Drain and switch bs back to the QEMU main loop */
     bdrv_set_aio_context(s->blk->conf.bs, qemu_get_aio_context());
@@ -311,17 +404,19 @@ void virtio_blk_data_plane_stop(VirtIOBlockDataPlane *s)
     vblk->obj_pool = NULL;
 
     if (s->raw_format) {
-        bdrv_set_bypass_co(s->ctx, false);
+        bdrv_set_bypass_co(s->blk->conf.bs, false);
     }
     /* Sync vring state back to virtqueue so that non-dataplane request
      * processing can continue when we disable the host notifier below.
      */
-    vring_teardown(&s->vring, s->vdev, 0);
+    for (i = 0; i < num; i++)
+        vring_teardown(&s->vring[i], s->vdev, 0);
 
-    k->set_host_notifier(qbus->parent, 0, false);
+    for (i = 0; i < num; i++)
+        k->set_host_notifier(qbus->parent, i, false);
 
     /* Clean up guest notifier (irq) */
-    k->set_guest_notifiers(qbus->parent, 1, false);
+    k->set_guest_notifiers(qbus->parent, num, false);
 
     s->started = false;
     s->stopping = false;
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [Qemu-devel] [PATCH v1 17/17] hw/virtio-pci: introduce num_queues property
  2014-08-05  3:33 [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support Ming Lei
                   ` (15 preceding siblings ...)
  2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 16/17] virtio-blk: dataplane: support multi virtqueue Ming Lei
@ 2014-08-05  3:33 ` Ming Lei
  2014-08-05  9:38 ` [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support Stefan Hajnoczi
  2014-08-05  9:48 ` Kevin Wolf
  18 siblings, 0 replies; 81+ messages in thread
From: Ming Lei @ 2014-08-05  3:33 UTC (permalink / raw)
  To: qemu-devel, Peter Maydell, Paolo Bonzini, Stefan Hajnoczi
  Cc: Kevin Wolf, Ming Lei, Fam Zheng, Michael S. Tsirkin

This patch introduces parameter of 'num_queues',
so that virtio-blk can support multi virt queues.

The virtio-blk multi virtqueue feature will be added to
virtio spec 1.1[1], and the 3.17 linux kernel[2] will support
the feature in virtio-blk driver. For those who wants to play
the stuff now, the kernel side patches can be found in either
Jens's block tree[3] or linux-next[4].

In my fio test over VM which is hosted on the server host,
if "num_queues" is set as 4, JOBS of fio script is set
as 4, throughout can be improved by 25% compared with
single virtqueue with any JOBS.

Compared with throughput, scalability is improved much more,
for example:
        ---------------------------------------------------
                | VM in server host, 4 virtqueues vs. 1 virtqueue
        ---------------------------------------------------
        JOBS=2  | +10%
        ---------------------------------------------------
        JOBS=4  | +78%
        ---------------------------------------------------

[1], http://marc.info/?l=linux-api&m=140486843317107&w=2
[2], http://marc.info/?l=linux-api&m=140418368421229&w=2
[3], http://git.kernel.org/cgit/linux/kernel/git/axboe/linux-block.git/ #for-3.17/drivers
[4], https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/

Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 hw/block/virtio-blk.c |    1 +
 1 file changed, 1 insertion(+)

diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index baec8f8..58f8296 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -834,6 +834,7 @@ static Property virtio_blk_properties[] = {
     DEFINE_BLOCK_CHS_PROPERTIES(VirtIOBlock, blk.conf),
     DEFINE_PROP_STRING("serial", VirtIOBlock, blk.serial),
     DEFINE_PROP_BIT("config-wce", VirtIOBlock, blk.config_wce, 0, true),
+    DEFINE_PROP_UINT32("num_queues", VirtIOBlock, blk.num_queues, 1),
 #ifdef __linux__
     DEFINE_PROP_BIT("scsi", VirtIOBlock, blk.scsi, 0, true),
 #endif
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
  2014-08-05  3:33 [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support Ming Lei
                   ` (16 preceding siblings ...)
  2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 17/17] hw/virtio-pci: introduce num_queues property Ming Lei
@ 2014-08-05  9:38 ` Stefan Hajnoczi
  2014-08-05  9:50   ` Ming Lei
  2014-08-05  9:48 ` Kevin Wolf
  18 siblings, 1 reply; 81+ messages in thread
From: Stefan Hajnoczi @ 2014-08-05  9:38 UTC (permalink / raw)
  To: Ming Lei
  Cc: Kevin Wolf, Peter Maydell, Fam Zheng, Michael S. Tsirkin,
	qemu-devel, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 1125 bytes --]

On Tue, Aug 05, 2014 at 11:33:01AM +0800, Ming Lei wrote:
> These patches bring up below 4 changes:
>         - introduce object allocation pool and apply it to
>         virtio-blk dataplane for improving its performance
> 
>         - introduce selective coroutine bypass mechanism
>         for improving performance of virtio-blk dataplane with
>         raw format image
> 
>         - linux-aio changes: fixing for cases of -EAGAIN and partial
>         completion, increase max events to 256, and remove one unuseful
>         fields in 'struct qemu_laiocb'
> 
>         - support multi virtqueue for virtio-blk

Please split up this patch series into separate patch series.

These are independent changes and there is no reason to combine them.
You're doing yourself a disservice because changes that are ready to be
applied are getting held up by those that still need more discussion.

That will also make the performance discussions easier to follow since
each patch series should include performance results, making it easy to
understand how much improvement each change brings.

Stefan

[-- Attachment #2: Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
  2014-08-05  3:33 [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support Ming Lei
                   ` (17 preceding siblings ...)
  2014-08-05  9:38 ` [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support Stefan Hajnoczi
@ 2014-08-05  9:48 ` Kevin Wolf
  2014-08-05 10:00   ` Ming Lei
  18 siblings, 1 reply; 81+ messages in thread
From: Kevin Wolf @ 2014-08-05  9:48 UTC (permalink / raw)
  To: Ming Lei
  Cc: Peter Maydell, Fam Zheng, Michael S. Tsirkin, qemu-devel,
	Stefan Hajnoczi, Paolo Bonzini

Am 05.08.2014 um 05:33 hat Ming Lei geschrieben:
> Hi,
> 
> These patches bring up below 4 changes:
>         - introduce object allocation pool and apply it to
>         virtio-blk dataplane for improving its performance
> 
>         - introduce selective coroutine bypass mechanism
>         for improving performance of virtio-blk dataplane with
>         raw format image

Before applying any bypassing patches, I think we should understand in
detail where we are losing performance with coroutines enabled.

I also think that the device emulation has no business in deciding
whether the bypass is used (it depends solely on conditions outside of
the device) and that leaking the fd number out of raw-posix is wrong.
Both of them are layering violations that shouldn't be reintroduced.

>         - linux-aio changes: fixing for cases of -EAGAIN and partial
>         completion, increase max events to 256, and remove one unuseful
>         fields in 'struct qemu_laiocb'
> 
>         - support multi virtqueue for virtio-blk

Like Stefan said, the series should be split in four, one for each item
in your list, so that each optimisation can be evaluated on its own.

Kevin

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
  2014-08-05  9:38 ` [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support Stefan Hajnoczi
@ 2014-08-05  9:50   ` Ming Lei
  2014-08-05  9:56     ` Kevin Wolf
  2014-08-05 13:59     ` Stefan Hajnoczi
  0 siblings, 2 replies; 81+ messages in thread
From: Ming Lei @ 2014-08-05  9:50 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Peter Maydell, Fam Zheng, Michael S. Tsirkin,
	qemu-devel, Paolo Bonzini

On Tue, Aug 5, 2014 at 5:38 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> On Tue, Aug 05, 2014 at 11:33:01AM +0800, Ming Lei wrote:
>> These patches bring up below 4 changes:
>>         - introduce object allocation pool and apply it to
>>         virtio-blk dataplane for improving its performance
>>
>>         - introduce selective coroutine bypass mechanism
>>         for improving performance of virtio-blk dataplane with
>>         raw format image
>>
>>         - linux-aio changes: fixing for cases of -EAGAIN and partial
>>         completion, increase max events to 256, and remove one unuseful
>>         fields in 'struct qemu_laiocb'
>>
>>         - support multi virtqueue for virtio-blk
>
> Please split up this patch series into separate patch series.
>
> These are independent changes and there is no reason to combine them.
> You're doing yourself a disservice because changes that are ready to be
> applied are getting held up by those that still need more discussion.

Without previous optimization patches, the mq conversion can't
obtain so much improvement, that is why I put them together.

Also mq conversion depends on linux-aio fix too.

Also it becomes a difficult to test these patches if they are splitted,
and describing the dependency is a bit annoying too.

> That will also make the performance discussions easier to follow since
> each patch series should include performance results, making it easy to
> understand how much improvement each change brings.

The number can be found inside patches, for example, patch 02 has
the number for using obj pool, and patch 09 has the number for
bypassing coroutine, and patch 17 has the number for mq conversion.

Thanks,

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
  2014-08-05  9:50   ` Ming Lei
@ 2014-08-05  9:56     ` Kevin Wolf
  2014-08-05 10:50       ` Ming Lei
  2014-08-05 13:59     ` Stefan Hajnoczi
  1 sibling, 1 reply; 81+ messages in thread
From: Kevin Wolf @ 2014-08-05  9:56 UTC (permalink / raw)
  To: Ming Lei
  Cc: Peter Maydell, Fam Zheng, Michael S. Tsirkin, qemu-devel,
	Stefan Hajnoczi, Paolo Bonzini

Am 05.08.2014 um 11:50 hat Ming Lei geschrieben:
> On Tue, Aug 5, 2014 at 5:38 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > On Tue, Aug 05, 2014 at 11:33:01AM +0800, Ming Lei wrote:
> >> These patches bring up below 4 changes:
> >>         - introduce object allocation pool and apply it to
> >>         virtio-blk dataplane for improving its performance
> >>
> >>         - introduce selective coroutine bypass mechanism
> >>         for improving performance of virtio-blk dataplane with
> >>         raw format image
> >>
> >>         - linux-aio changes: fixing for cases of -EAGAIN and partial
> >>         completion, increase max events to 256, and remove one unuseful
> >>         fields in 'struct qemu_laiocb'
> >>
> >>         - support multi virtqueue for virtio-blk
> >
> > Please split up this patch series into separate patch series.
> >
> > These are independent changes and there is no reason to combine them.
> > You're doing yourself a disservice because changes that are ready to be
> > applied are getting held up by those that still need more discussion.
> 
> Without previous optimization patches, the mq conversion can't
> obtain so much improvement, that is why I put them together.
> 
> Also mq conversion depends on linux-aio fix too.
> 
> Also it becomes a difficult to test these patches if they are splitted,
> and describing the dependency is a bit annoying too.
> 
> > That will also make the performance discussions easier to follow since
> > each patch series should include performance results, making it easy to
> > understand how much improvement each change brings.
> 
> The number can be found inside patches, for example, patch 02 has
> the number for using obj pool, and patch 09 has the number for
> bypassing coroutine, and patch 17 has the number for mq conversion.

A claim like "~5%-10% throughput improvement" isn't numbers nor a
precise description of your benchmark setup.

Kevin

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
  2014-08-05  9:48 ` Kevin Wolf
@ 2014-08-05 10:00   ` Ming Lei
  2014-08-05 11:44     ` Paolo Bonzini
  2014-08-05 13:48     ` Stefan Hajnoczi
  0 siblings, 2 replies; 81+ messages in thread
From: Ming Lei @ 2014-08-05 10:00 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Peter Maydell, Fam Zheng, Michael S. Tsirkin, qemu-devel,
	Stefan Hajnoczi, Paolo Bonzini

On Tue, Aug 5, 2014 at 5:48 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 05.08.2014 um 05:33 hat Ming Lei geschrieben:
>> Hi,
>>
>> These patches bring up below 4 changes:
>>         - introduce object allocation pool and apply it to
>>         virtio-blk dataplane for improving its performance
>>
>>         - introduce selective coroutine bypass mechanism
>>         for improving performance of virtio-blk dataplane with
>>         raw format image
>
> Before applying any bypassing patches, I think we should understand in
> detail where we are losing performance with coroutines enabled.

>From the below profiling data, CPU becomes slow to run instructions
with coroutine, and CPU dcache miss is increased so it is very
likely caused by switching stack frequently.

http://marc.info/?l=qemu-devel&m=140679721126306&w=2

http://pastebin.com/ae0vnQ6V

>
> I also think that the device emulation has no business in deciding
> whether the bypass is used (it depends solely on conditions outside of
> the device) and that leaking the fd number out of raw-posix is wrong.
> Both of them are layering violations that shouldn't be reintroduced.

Yes, that is right, and I have added comments that the bypass hint will
be moved to block layer completely in future.

Thanks,

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
  2014-08-05  9:56     ` Kevin Wolf
@ 2014-08-05 10:50       ` Ming Lei
  0 siblings, 0 replies; 81+ messages in thread
From: Ming Lei @ 2014-08-05 10:50 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Peter Maydell, Fam Zheng, Michael S. Tsirkin, qemu-devel,
	Stefan Hajnoczi, Paolo Bonzini

On Tue, Aug 5, 2014 at 5:56 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 05.08.2014 um 11:50 hat Ming Lei geschrieben:
>> On Tue, Aug 5, 2014 at 5:38 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
>> > On Tue, Aug 05, 2014 at 11:33:01AM +0800, Ming Lei wrote:
>> >> These patches bring up below 4 changes:
>> >>         - introduce object allocation pool and apply it to
>> >>         virtio-blk dataplane for improving its performance
>> >>
>> >>         - introduce selective coroutine bypass mechanism
>> >>         for improving performance of virtio-blk dataplane with
>> >>         raw format image
>> >>
>> >>         - linux-aio changes: fixing for cases of -EAGAIN and partial
>> >>         completion, increase max events to 256, and remove one unuseful
>> >>         fields in 'struct qemu_laiocb'
>> >>
>> >>         - support multi virtqueue for virtio-blk
>> >
>> > Please split up this patch series into separate patch series.
>> >
>> > These are independent changes and there is no reason to combine them.
>> > You're doing yourself a disservice because changes that are ready to be
>> > applied are getting held up by those that still need more discussion.
>>
>> Without previous optimization patches, the mq conversion can't
>> obtain so much improvement, that is why I put them together.
>>
>> Also mq conversion depends on linux-aio fix too.
>>
>> Also it becomes a difficult to test these patches if they are splitted,
>> and describing the dependency is a bit annoying too.
>>
>> > That will also make the performance discussions easier to follow since
>> > each patch series should include performance results, making it easy to
>> > understand how much improvement each change brings.
>>
>> The number can be found inside patches, for example, patch 02 has
>> the number for using obj pool, and patch 09 has the number for
>> bypassing coroutine, and patch 17 has the number for mq conversion.
>
> A claim like "~5%-10% throughput improvement" isn't numbers nor a

Sorry, it is a range from 5% to 10% per my test.

> precise description of your benchmark setup.

And the benchmark setup can be found in the 0/17 commit log,
and it is basically a FIO(randread, libaio, direct io, ...) test running
in VM.

Sorry for not describing them clearly.

Thanks,

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
  2014-08-05 10:00   ` Ming Lei
@ 2014-08-05 11:44     ` Paolo Bonzini
  2014-08-05 13:48     ` Stefan Hajnoczi
  1 sibling, 0 replies; 81+ messages in thread
From: Paolo Bonzini @ 2014-08-05 11:44 UTC (permalink / raw)
  To: Ming Lei, Kevin Wolf
  Cc: Peter Maydell, Fam Zheng, qemu-devel, Stefan Hajnoczi,
	Michael S. Tsirkin

Il 05/08/2014 12:00, Ming Lei ha scritto:
>> >
>> > I also think that the device emulation has no business in deciding
>> > whether the bypass is used (it depends solely on conditions outside of
>> > the device) and that leaking the fd number out of raw-posix is wrong.
>> > Both of them are layering violations that shouldn't be reintroduced.
> Yes, that is right, and I have added comments that the bypass hint will
> be moved to block layer completely in future.

Actually, it will never be accepted in the first place.

We have told you repeatedly that the bypass as you wrote it is buggy and
a layering violation.

Paolo

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 01/17] qemu/obj_pool.h: introduce object allocation pool
  2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 01/17] qemu/obj_pool.h: introduce object allocation pool Ming Lei
@ 2014-08-05 11:55   ` Eric Blake
  2014-08-05 12:05     ` Michael S. Tsirkin
  2014-08-06  2:35     ` Ming Lei
  0 siblings, 2 replies; 81+ messages in thread
From: Eric Blake @ 2014-08-05 11:55 UTC (permalink / raw)
  To: Ming Lei, qemu-devel, Peter Maydell, Paolo Bonzini, Stefan Hajnoczi
  Cc: Kevin Wolf, Fam Zheng, Michael S. Tsirkin

[-- Attachment #1: Type: text/plain, Size: 2025 bytes --]

On 08/04/2014 09:33 PM, Ming Lei wrote:
> This patch introduces object allocation pool for speeding up
> object allocation in fast path.
> 
> Signed-off-by: Ming Lei <ming.lei@canonical.com>
> ---
>  include/qemu/obj_pool.h |   64 +++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 64 insertions(+)
>  create mode 100644 include/qemu/obj_pool.h
> 
> diff --git a/include/qemu/obj_pool.h b/include/qemu/obj_pool.h
> new file mode 100644
> index 0000000..94b5f49
> --- /dev/null
> +++ b/include/qemu/obj_pool.h
> @@ -0,0 +1,64 @@
> +#ifndef QEMU_OBJ_POOL_HEAD
> +#define QEMU_OBJ_POOL_HEAD

Missing copyright boilerplate.  According to LICENSE, that makes this
file GPLv2+, but I'd much rather you make it explicit.

> +
> +typedef struct {
> +    unsigned int size;
> +    unsigned int cnt;

size_t feels better for sizes.  int may be okay in this case, but
definitely consider if size_t is appropriate.

> +
> +    void **free_obj;
> +    int free_idx;
> +
> +    char *objs;
> +} ObjPool;
> +
> +static inline void obj_pool_init(ObjPool *op, void *objs_buf, void **free_objs,
> +                                 unsigned int obj_size, unsigned cnt)
> +{
> +    int i;
> +
> +    op->objs = (char *)objs_buf;

Why the cast? This is C, not C++.

> +    op->free_obj = free_objs;
> +    op->size = obj_size;
> +    op->cnt = cnt;
> +
> +    for (i = 0; i < op->cnt; i++) {
> +        op->free_obj[i] = (void *)&op->objs[i * op->size];

Again, why the cast?


> +static inline bool obj_pool_has_obj(ObjPool *op, void *obj)
> +{
> +    return op && (unsigned long)obj >= (unsigned long)&op->objs[0] &&
> +           (unsigned long)obj <=
> +           (unsigned long)&op->objs[(op->cnt - 1) * op->size];

uintptr_t, not unsigned long.  You are asking for problems on 64-bit
mingw, where unsigned long is 32 bits but uintptr_t is 64 bits.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 539 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 01/17] qemu/obj_pool.h: introduce object allocation pool
  2014-08-05 11:55   ` Eric Blake
@ 2014-08-05 12:05     ` Michael S. Tsirkin
  2014-08-05 12:21       ` Eric Blake
  2014-08-06  2:35     ` Ming Lei
  1 sibling, 1 reply; 81+ messages in thread
From: Michael S. Tsirkin @ 2014-08-05 12:05 UTC (permalink / raw)
  To: Eric Blake
  Cc: Kevin Wolf, Peter Maydell, Fam Zheng, Ming Lei, qemu-devel,
	Stefan Hajnoczi, Paolo Bonzini

On Tue, Aug 05, 2014 at 05:55:49AM -0600, Eric Blake wrote:
> On 08/04/2014 09:33 PM, Ming Lei wrote:
> > This patch introduces object allocation pool for speeding up
> > object allocation in fast path.
> > 
> > Signed-off-by: Ming Lei <ming.lei@canonical.com>
> > ---
> >  include/qemu/obj_pool.h |   64 +++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 64 insertions(+)
> >  create mode 100644 include/qemu/obj_pool.h
> > 
> > diff --git a/include/qemu/obj_pool.h b/include/qemu/obj_pool.h
> > new file mode 100644
> > index 0000000..94b5f49
> > --- /dev/null
> > +++ b/include/qemu/obj_pool.h
> > @@ -0,0 +1,64 @@
> > +#ifndef QEMU_OBJ_POOL_HEAD
> > +#define QEMU_OBJ_POOL_HEAD
> 
> Missing copyright boilerplate.  According to LICENSE, that makes this
> file GPLv2+, but I'd much rather you make it explicit.
> 
> > +
> > +typedef struct {
> > +    unsigned int size;
> > +    unsigned int cnt;
> 
> size_t feels better for sizes.  int may be okay in this case, but
> definitely consider if size_t is appropriate.
> 
> > +
> > +    void **free_obj;
> > +    int free_idx;
> > +
> > +    char *objs;
> > +} ObjPool;
> > +
> > +static inline void obj_pool_init(ObjPool *op, void *objs_buf, void **free_objs,
> > +                                 unsigned int obj_size, unsigned cnt)
> > +{
> > +    int i;
> > +
> > +    op->objs = (char *)objs_buf;
> 
> Why the cast? This is C, not C++.

It's not needed in C++ either, right?

> > +    op->free_obj = free_objs;
> > +    op->size = obj_size;
> > +    op->cnt = cnt;
> > +
> > +    for (i = 0; i < op->cnt; i++) {
> > +        op->free_obj[i] = (void *)&op->objs[i * op->size];
> 
> Again, why the cast?
> 
> 
> > +static inline bool obj_pool_has_obj(ObjPool *op, void *obj)
> > +{
> > +    return op && (unsigned long)obj >= (unsigned long)&op->objs[0] &&
> > +           (unsigned long)obj <=
> > +           (unsigned long)&op->objs[(op->cnt - 1) * op->size];
> 
> uintptr_t, not unsigned long.  You are asking for problems on 64-bit
> mingw, where unsigned long is 32 bits but uintptr_t is 64 bits.
> 
> -- 
> Eric Blake   eblake redhat com    +1-919-301-3266
> Libvirt virtualization library http://libvirt.org
> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 01/17] qemu/obj_pool.h: introduce object allocation pool
  2014-08-05 12:05     ` Michael S. Tsirkin
@ 2014-08-05 12:21       ` Eric Blake
  2014-08-05 12:51         ` Michael S. Tsirkin
  0 siblings, 1 reply; 81+ messages in thread
From: Eric Blake @ 2014-08-05 12:21 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Kevin Wolf, Peter Maydell, Fam Zheng, Ming Lei, qemu-devel,
	Stefan Hajnoczi, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 1211 bytes --]

On 08/05/2014 06:05 AM, Michael S. Tsirkin wrote:
> On Tue, Aug 05, 2014 at 05:55:49AM -0600, Eric Blake wrote:
>> On 08/04/2014 09:33 PM, Ming Lei wrote:
>>> This patch introduces object allocation pool for speeding up
>>> object allocation in fast path.
>>>
>>> Signed-off-by: Ming Lei <ming.lei@canonical.com>
>>> ---
>>>  include/qemu/obj_pool.h |   64 +++++++++++++++++++++++++++++++++++++++++++++++
>>>  1 file changed, 64 insertions(+)
>>>  create mode 100644 include/qemu/obj_pool.h
>>>

>>> +
>>> +    char *objs;
>>> +} ObjPool;
>>> +
>>> +static inline void obj_pool_init(ObjPool *op, void *objs_buf, void **free_objs,
>>> +                                 unsigned int obj_size, unsigned cnt)
>>> +{
>>> +    int i;
>>> +
>>> +    op->objs = (char *)objs_buf;
>>
>> Why the cast? This is C, not C++.
> 
> It's not needed in C++ either, right?

In C++, going from void* to a typed pointer requires a cast (that's why
in C++ you see casts on malloc results).  In C, void* can implicitly be
converted to any other pointer (modulo const-/volatile-correctness).

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 539 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 02/17] dataplane: use object pool to speed up allocation for virtio blk request
  2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 02/17] dataplane: use object pool to speed up allocation for virtio blk request Ming Lei
@ 2014-08-05 12:30   ` Eric Blake
  2014-08-06  2:45     ` Ming Lei
  0 siblings, 1 reply; 81+ messages in thread
From: Eric Blake @ 2014-08-05 12:30 UTC (permalink / raw)
  To: Ming Lei, qemu-devel, Peter Maydell, Paolo Bonzini, Stefan Hajnoczi
  Cc: Kevin Wolf, Fam Zheng, Michael S. Tsirkin

[-- Attachment #1: Type: text/plain, Size: 1232 bytes --]

On 08/04/2014 09:33 PM, Ming Lei wrote:
> g_slice_new(VirtIOBlockReq), its free pair and access the instance

Took me a while to read this.  Maybe:

Calling g_slice_new(VirtIOBlockReq) and its free pair, and accessing the
instance, are a bit slow...

> is a bit slow since sizeof(VirtIOBlockReq) takes more than 40KB,
> so use object pool to speed up its allocation and release.
> 
> With this patch, ~5%-10% throughput improvement is observed in the VM
> based on server.
> 
> Signed-off-by: Ming Lei <ming.lei@canonical.com>
> ---
>  hw/block/dataplane/virtio-blk.c |   12 ++++++++++++
>  hw/block/virtio-blk.c           |   13 +++++++++++--
>  include/hw/virtio/virtio-blk.h  |    2 ++
>  3 files changed, 25 insertions(+), 2 deletions(-)

> @@ -50,6 +52,10 @@ struct VirtIOBlockDataPlane {
>      Error *blocker;
>      void (*saved_complete_request)(struct VirtIOBlockReq *req,
>                                     unsigned char status);
> +
> +    VirtIOBlockReq  reqs[REQ_POOL_SZ];
> +    void *free_reqs[REQ_POOL_SZ];
> +    ObjPool  req_pool;

Why two instances of double spaces?

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 539 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 05/17] garbage collector: introduced for support of bypass coroutine
  2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 05/17] garbage collector: introduced for support of " Ming Lei
@ 2014-08-05 12:43   ` Eric Blake
  0 siblings, 0 replies; 81+ messages in thread
From: Eric Blake @ 2014-08-05 12:43 UTC (permalink / raw)
  To: Ming Lei, qemu-devel, Peter Maydell, Paolo Bonzini, Stefan Hajnoczi
  Cc: Kevin Wolf, Fam Zheng, Michael S. Tsirkin

[-- Attachment #1: Type: text/plain, Size: 887 bytes --]

On 08/04/2014 09:33 PM, Ming Lei wrote:
> In case of bypass coroutine, some buffers in stack have to convert
> to survive in the whole I/O submit & completion cycle.
> 
> Garbase collector is one of the best data structure for this purpose,

s/Garbase/A garbage/

> as I thought of.
> 
> Signed-off-by: Ming Lei <ming.lei@canonical.com>
> ---
>  include/qemu/gc.h |   56 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 56 insertions(+)
>  create mode 100644 include/qemu/gc.h
> 
> diff --git a/include/qemu/gc.h b/include/qemu/gc.h
> new file mode 100644
> index 0000000..b9a3f6e
> --- /dev/null
> +++ b/include/qemu/gc.h
> @@ -0,0 +1,56 @@
> +#ifndef QEMU_GC_HEADER
> +#define QEMU_GC_HEADER

Missing copyright boilerplate.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 539 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 01/17] qemu/obj_pool.h: introduce object allocation pool
  2014-08-05 12:21       ` Eric Blake
@ 2014-08-05 12:51         ` Michael S. Tsirkin
  0 siblings, 0 replies; 81+ messages in thread
From: Michael S. Tsirkin @ 2014-08-05 12:51 UTC (permalink / raw)
  To: Eric Blake
  Cc: Kevin Wolf, Peter Maydell, Fam Zheng, Ming Lei, qemu-devel,
	Stefan Hajnoczi, Paolo Bonzini

On Tue, Aug 05, 2014 at 06:21:55AM -0600, Eric Blake wrote:
> On 08/05/2014 06:05 AM, Michael S. Tsirkin wrote:
> > On Tue, Aug 05, 2014 at 05:55:49AM -0600, Eric Blake wrote:
> >> On 08/04/2014 09:33 PM, Ming Lei wrote:
> >>> This patch introduces object allocation pool for speeding up
> >>> object allocation in fast path.
> >>>
> >>> Signed-off-by: Ming Lei <ming.lei@canonical.com>
> >>> ---
> >>>  include/qemu/obj_pool.h |   64 +++++++++++++++++++++++++++++++++++++++++++++++
> >>>  1 file changed, 64 insertions(+)
> >>>  create mode 100644 include/qemu/obj_pool.h
> >>>
> 
> >>> +
> >>> +    char *objs;
> >>> +} ObjPool;
> >>> +
> >>> +static inline void obj_pool_init(ObjPool *op, void *objs_buf, void **free_objs,
> >>> +                                 unsigned int obj_size, unsigned cnt)
> >>> +{
> >>> +    int i;
> >>> +
> >>> +    op->objs = (char *)objs_buf;
> >>
> >> Why the cast? This is C, not C++.
> > 
> > It's not needed in C++ either, right?
> 
> In C++, going from void* to a typed pointer requires a cast (that's why
> in C++ you see casts on malloc results).

Ah yes, I was confusing this with going from char * to void *.
You are right, thanks for the reminder.

>  In C, void* can implicitly be
> converted to any other pointer (modulo const-/volatile-correctness).

Yes: and const and voilatile safety is exactly the reason
one *shouldn't* typically cast to/from void * explicitly.

> -- 
> Eric Blake   eblake redhat com    +1-919-301-3266
> Libvirt virtualization library http://libvirt.org
> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
  2014-08-05 10:00   ` Ming Lei
  2014-08-05 11:44     ` Paolo Bonzini
@ 2014-08-05 13:48     ` Stefan Hajnoczi
  2014-08-05 14:47       ` Kevin Wolf
  1 sibling, 1 reply; 81+ messages in thread
From: Stefan Hajnoczi @ 2014-08-05 13:48 UTC (permalink / raw)
  To: Ming Lei
  Cc: Kevin Wolf, Peter Maydell, Fam Zheng, Michael S. Tsirkin,
	qemu-devel, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 1518 bytes --]

On Tue, Aug 05, 2014 at 06:00:22PM +0800, Ming Lei wrote:
> On Tue, Aug 5, 2014 at 5:48 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> > Am 05.08.2014 um 05:33 hat Ming Lei geschrieben:
> >> Hi,
> >>
> >> These patches bring up below 4 changes:
> >>         - introduce object allocation pool and apply it to
> >>         virtio-blk dataplane for improving its performance
> >>
> >>         - introduce selective coroutine bypass mechanism
> >>         for improving performance of virtio-blk dataplane with
> >>         raw format image
> >
> > Before applying any bypassing patches, I think we should understand in
> > detail where we are losing performance with coroutines enabled.
> 
> From the below profiling data, CPU becomes slow to run instructions
> with coroutine, and CPU dcache miss is increased so it is very
> likely caused by switching stack frequently.
> 
> http://marc.info/?l=qemu-devel&m=140679721126306&w=2
> 
> http://pastebin.com/ae0vnQ6V

I have been wondering how to prove that the root cause is the ucontext
coroutine mechanism (stack switching).  Here is an idea:

Hack your "bypass" code path to run the request inside a coroutine.
That way you can compare "bypass without coroutine" against "bypass with
coroutine".

Right now I think there are doubts because the bypass code path is
indeed a different (and not 100% correct) code path.  So this approach
might prove that the coroutines are adding the overhead and not
something that you bypassed.

Stefan

[-- Attachment #2: Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
  2014-08-05  9:50   ` Ming Lei
  2014-08-05  9:56     ` Kevin Wolf
@ 2014-08-05 13:59     ` Stefan Hajnoczi
  1 sibling, 0 replies; 81+ messages in thread
From: Stefan Hajnoczi @ 2014-08-05 13:59 UTC (permalink / raw)
  To: Ming Lei
  Cc: Kevin Wolf, Peter Maydell, Fam Zheng, Michael S. Tsirkin,
	qemu-devel, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 2471 bytes --]

On Tue, Aug 05, 2014 at 05:50:42PM +0800, Ming Lei wrote:
> On Tue, Aug 5, 2014 at 5:38 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > On Tue, Aug 05, 2014 at 11:33:01AM +0800, Ming Lei wrote:
> >> These patches bring up below 4 changes:
> >>         - introduce object allocation pool and apply it to
> >>         virtio-blk dataplane for improving its performance
> >>
> >>         - introduce selective coroutine bypass mechanism
> >>         for improving performance of virtio-blk dataplane with
> >>         raw format image
> >>
> >>         - linux-aio changes: fixing for cases of -EAGAIN and partial
> >>         completion, increase max events to 256, and remove one unuseful
> >>         fields in 'struct qemu_laiocb'
> >>
> >>         - support multi virtqueue for virtio-blk
> >
> > Please split up this patch series into separate patch series.
> >
> > These are independent changes and there is no reason to combine them.
> > You're doing yourself a disservice because changes that are ready to be
> > applied are getting held up by those that still need more discussion.
> 
> Without previous optimization patches, the mq conversion can't
> obtain so much improvement, that is why I put them together.

You can post two sets of numbers: "independent results" and "together
with series X, Y, and Z".

> Also mq conversion depends on linux-aio fix too.

No problem, just include a note in the cover letter for the mq series
stating that this series depends on the linux-aio fix.

Maintainers keep an eye out for that and make sure that the dependencies
are merged before applying.

> Also it becomes a difficult to test these patches if they are splitted,
> and describing the dependency is a bit annoying too.

I understand that you need to manage extra branches and sometimes rebase
between them.  But that's life.

The reason people are pushing back is that you are throwing a blob at
the mailing list and expecting reviewers to dissect it.  Reviewers have
to put more effort in than necessary.  As a result they are scrutinizing
your changes and are not comfortable with them in their current form.

If you want to get patches merged smoothly, split them up and justify
each series with a cover letter and performance results (if it's an
optimization).  That way you get reviewers on your side; they understand
and agree with the benefit of the series.  Make them *want* to merge the
patches.

[-- Attachment #2: Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
  2014-08-05 13:48     ` Stefan Hajnoczi
@ 2014-08-05 14:47       ` Kevin Wolf
  2014-08-06  5:33         ` Ming Lei
  0 siblings, 1 reply; 81+ messages in thread
From: Kevin Wolf @ 2014-08-05 14:47 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Peter Maydell, Fam Zheng, Michael S. Tsirkin, Ming Lei,
	qemu-devel, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 6461 bytes --]

Am 05.08.2014 um 15:48 hat Stefan Hajnoczi geschrieben:
> On Tue, Aug 05, 2014 at 06:00:22PM +0800, Ming Lei wrote:
> > On Tue, Aug 5, 2014 at 5:48 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> > > Am 05.08.2014 um 05:33 hat Ming Lei geschrieben:
> > >> Hi,
> > >>
> > >> These patches bring up below 4 changes:
> > >>         - introduce object allocation pool and apply it to
> > >>         virtio-blk dataplane for improving its performance
> > >>
> > >>         - introduce selective coroutine bypass mechanism
> > >>         for improving performance of virtio-blk dataplane with
> > >>         raw format image
> > >
> > > Before applying any bypassing patches, I think we should understand in
> > > detail where we are losing performance with coroutines enabled.
> > 
> > From the below profiling data, CPU becomes slow to run instructions
> > with coroutine, and CPU dcache miss is increased so it is very
> > likely caused by switching stack frequently.
> > 
> > http://marc.info/?l=qemu-devel&m=140679721126306&w=2
> > 
> > http://pastebin.com/ae0vnQ6V
> 
> I have been wondering how to prove that the root cause is the ucontext
> coroutine mechanism (stack switching).  Here is an idea:
> 
> Hack your "bypass" code path to run the request inside a coroutine.
> That way you can compare "bypass without coroutine" against "bypass with
> coroutine".
> 
> Right now I think there are doubts because the bypass code path is
> indeed a different (and not 100% correct) code path.  So this approach
> might prove that the coroutines are adding the overhead and not
> something that you bypassed.

My doubts aren't only that the overhead might not come from the
coroutines, but also whether any coroutine-related overhead is really
unavoidable. If we can optimise coroutines, I'd strongly prefer to do
just that instead of introducing additional code paths.

Another thought I had was this: If the performance difference is indeed
only coroutines, then that is completely inside the block layer and we
don't actually need a VM to test it. We could instead have something
like a simple qemu-img based benchmark and should be observing the same.

I played a bit with the following, I hope it's not too naive. I couldn't
see a difference with your patches, but at least one reason for this is
probably that my laptop SSD isn't fast enough to make the CPU the
bottleneck. Haven't tried ramdisk yet, that would probably be the next
thing. (I actually wrote the patch up just for some profiling on my own,
not for comparing throughput, but it should be usable for that as well.)

Kevin


diff --git a/qemu-img-cmds.hx b/qemu-img-cmds.hx
index d029609..ae64b3d 100644
--- a/qemu-img-cmds.hx
+++ b/qemu-img-cmds.hx
@@ -9,6 +9,12 @@ STEXI
 @table @option
 ETEXI
 
+DEF("bench", img_bench,
+    "bench [-q] [-f fmt] [-n] [-t cache] filename")
+STEXI
+@item bench [-q] [-f @var{fmt]} [-n] [-t @var{cache}] filename
+ETEXI
+
 DEF("check", img_check,
     "check [-q] [-f fmt] [--output=ofmt]  [-r [leaks | all]] filename")
 STEXI
diff --git a/qemu-img.c b/qemu-img.c
index d4518e7..92e9529 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -2789,6 +2789,132 @@ out:
     return 0;
 }
 
+typedef struct BenchData {
+    BlockDriverState *bs;
+    int bufsize;
+    int nrreq;
+    int n;
+    uint8_t *buf;
+    QEMUIOVector *qiov;
+
+    int in_flight;
+    uint64_t sector;
+} BenchData;
+
+static void bench_cb(void *opaque, int ret)
+{
+    BenchData *b = opaque;
+    BlockDriverAIOCB *acb;
+
+    if (ret < 0) {
+        error_report("Failed request: %s\n", strerror(-ret));
+        exit(EXIT_FAILURE);
+    }
+    if (b->in_flight > 0) {
+        b->n--;
+        b->in_flight--;
+    }
+
+    while (b->n > b->in_flight && b->in_flight < b->nrreq) {
+        acb = bdrv_aio_readv(b->bs, b->sector, b->qiov,
+                             b->bufsize >> BDRV_SECTOR_BITS,
+                             bench_cb, b);
+        if (!acb) {
+            error_report("Failed to issue request");
+            exit(EXIT_FAILURE);
+        }
+        b->in_flight++;
+        b->sector += b->bufsize;
+        b->sector %= b->bs->total_sectors;
+    }
+}
+
+static int img_bench(int argc, char **argv)
+{
+    int c, ret = 0;
+    const char *fmt = NULL, *filename;
+    bool quiet = false;
+    BlockDriverState *bs = NULL;
+    int flags = BDRV_O_FLAGS;
+    int i;
+
+    for (;;) {
+        c = getopt(argc, argv, "hf:nqt:");
+        if (c == -1) {
+            break;
+        }
+
+        switch (c) {
+            case 'h':
+            case '?':
+                help();
+                break;
+            case 'f':
+                fmt = optarg;
+                break;
+            case 'n':
+                flags |= BDRV_O_NATIVE_AIO;
+                break;
+            case 'q':
+                quiet = true;
+                break;
+            case 't':
+                ret = bdrv_parse_cache_flags(optarg, &flags);
+                if (ret < 0) {
+                    error_report("Invalid cache mode");
+                    ret = -1;
+                    goto out;
+                }
+                break;
+        }
+    }
+
+    if (optind != argc - 1) {
+        error_exit("Expecting one image file name");
+    }
+    filename = argv[argc - 1];
+
+    bs = bdrv_new_open("image", filename, fmt, flags, true, quiet);
+    if (!bs) {
+        error_report("Could not open image '%s'", filename);
+        ret = -1;
+        goto out;
+    }
+
+    BenchData data = {
+        .bs = bs,
+        .bufsize = 0x1000,
+        .nrreq = 64,
+        .n = 75000,
+    };
+
+    data.buf = qemu_blockalign(bs, data.nrreq * data.bufsize);
+    data.qiov = g_new(QEMUIOVector, data.nrreq);
+    for (i = 0; i < data.nrreq; i++) {
+        qemu_iovec_init(&data.qiov[i], 1);
+        qemu_iovec_add(&data.qiov[i],
+                       data.buf + i * data.bufsize, data.bufsize);
+    }
+
+    bench_cb(&data, 0);
+
+    while (data.n > 0) {
+        main_loop_wait(false);
+    }
+
+out:
+    qemu_vfree(data.buf);
+    if (bs) {
+        bdrv_unref(bs);
+    }
+
+    if (ret) {
+        return 1;
+    }
+    return 0;
+}
+
+
 static const img_cmd_t img_cmds[] = {
 #define DEF(option, callback, arg_string)        \
     { option, callback },

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 01/17] qemu/obj_pool.h: introduce object allocation pool
  2014-08-05 11:55   ` Eric Blake
  2014-08-05 12:05     ` Michael S. Tsirkin
@ 2014-08-06  2:35     ` Ming Lei
  1 sibling, 0 replies; 81+ messages in thread
From: Ming Lei @ 2014-08-06  2:35 UTC (permalink / raw)
  To: Eric Blake
  Cc: Kevin Wolf, Peter Maydell, Fam Zheng, Michael S. Tsirkin,
	qemu-devel, Stefan Hajnoczi, Paolo Bonzini

On Tue, Aug 5, 2014 at 7:55 PM, Eric Blake <eblake@redhat.com> wrote:
> On 08/04/2014 09:33 PM, Ming Lei wrote:
>> This patch introduces object allocation pool for speeding up
>> object allocation in fast path.
>>
>> Signed-off-by: Ming Lei <ming.lei@canonical.com>
>> ---
>>  include/qemu/obj_pool.h |   64 +++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 64 insertions(+)
>>  create mode 100644 include/qemu/obj_pool.h
>>
>> diff --git a/include/qemu/obj_pool.h b/include/qemu/obj_pool.h
>> new file mode 100644
>> index 0000000..94b5f49
>> --- /dev/null
>> +++ b/include/qemu/obj_pool.h
>> @@ -0,0 +1,64 @@
>> +#ifndef QEMU_OBJ_POOL_HEAD
>> +#define QEMU_OBJ_POOL_HEAD
>
> Missing copyright boilerplate.  According to LICENSE, that makes this
> file GPLv2+, but I'd much rather you make it explicit.
>
>> +
>> +typedef struct {
>> +    unsigned int size;
>> +    unsigned int cnt;
>
> size_t feels better for sizes.  int may be okay in this case, but
> definitely consider if size_t is appropriate.

Sounds good.

>
>> +
>> +    void **free_obj;
>> +    int free_idx;
>> +
>> +    char *objs;
>> +} ObjPool;
>> +
>> +static inline void obj_pool_init(ObjPool *op, void *objs_buf, void **free_objs,
>> +                                 unsigned int obj_size, unsigned cnt)
>> +{
>> +    int i;
>> +
>> +    op->objs = (char *)objs_buf;
>
> Why the cast? This is C, not C++.

Right, the cast isn't needed.

>
>> +    op->free_obj = free_objs;
>> +    op->size = obj_size;
>> +    op->cnt = cnt;
>> +
>> +    for (i = 0; i < op->cnt; i++) {
>> +        op->free_obj[i] = (void *)&op->objs[i * op->size];
>
> Again, why the cast?

Right too.

>
>
>> +static inline bool obj_pool_has_obj(ObjPool *op, void *obj)
>> +{
>> +    return op && (unsigned long)obj >= (unsigned long)&op->objs[0] &&
>> +           (unsigned long)obj <=
>> +           (unsigned long)&op->objs[(op->cnt - 1) * op->size];
>
> uintptr_t, not unsigned long.  You are asking for problems on 64-bit
> mingw, where unsigned long is 32 bits but uintptr_t is 64 bits.

Good point, it is the 1st time for me to know the mingw long magic.


Thanks,

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 02/17] dataplane: use object pool to speed up allocation for virtio blk request
  2014-08-05 12:30   ` Eric Blake
@ 2014-08-06  2:45     ` Ming Lei
  0 siblings, 0 replies; 81+ messages in thread
From: Ming Lei @ 2014-08-06  2:45 UTC (permalink / raw)
  To: Eric Blake
  Cc: Kevin Wolf, Peter Maydell, Fam Zheng, Michael S. Tsirkin,
	qemu-devel, Stefan Hajnoczi, Paolo Bonzini

On Tue, Aug 5, 2014 at 8:30 PM, Eric Blake <eblake@redhat.com> wrote:
> On 08/04/2014 09:33 PM, Ming Lei wrote:
>> g_slice_new(VirtIOBlockReq), its free pair and access the instance
>
> Took me a while to read this.  Maybe:
>
> Calling g_slice_new(VirtIOBlockReq) and its free pair, and accessing the
> instance, are a bit slow...

One point is that VirtIOBlockReq is very big, so using libc's allocation
is slow since lock has be to held for thread safety, or maybe brk() is
involved in big allocation too.

Another point is that obj pool can keep these frequent accessed
objects in ram easily, and decrease page fault when accessing
these buffer.

>
>> is a bit slow since sizeof(VirtIOBlockReq) takes more than 40KB,
>> so use object pool to speed up its allocation and release.
>>
>> With this patch, ~5%-10% throughput improvement is observed in the VM
>> based on server.
>>
>> Signed-off-by: Ming Lei <ming.lei@canonical.com>
>> ---
>>  hw/block/dataplane/virtio-blk.c |   12 ++++++++++++
>>  hw/block/virtio-blk.c           |   13 +++++++++++--
>>  include/hw/virtio/virtio-blk.h  |    2 ++
>>  3 files changed, 25 insertions(+), 2 deletions(-)
>
>> @@ -50,6 +52,10 @@ struct VirtIOBlockDataPlane {
>>      Error *blocker;
>>      void (*saved_complete_request)(struct VirtIOBlockReq *req,
>>                                     unsigned char status);
>> +
>> +    VirtIOBlockReq  reqs[REQ_POOL_SZ];
>> +    void *free_reqs[REQ_POOL_SZ];
>> +    ObjPool  req_pool;
>
> Why two instances of double spaces?

reqs is the real space for object, and free_reqs is used for
implementing allocation and release of objects.

Thanks,

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
  2014-08-05 14:47       ` Kevin Wolf
@ 2014-08-06  5:33         ` Ming Lei
  2014-08-06  7:45           ` Paolo Bonzini
                             ` (2 more replies)
  0 siblings, 3 replies; 81+ messages in thread
From: Ming Lei @ 2014-08-06  5:33 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Peter Maydell, Fam Zheng, Michael S. Tsirkin, qemu-devel,
	Stefan Hajnoczi, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 3877 bytes --]

Hi Kevin,

On Tue, Aug 5, 2014 at 10:47 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 05.08.2014 um 15:48 hat Stefan Hajnoczi geschrieben:
>> On Tue, Aug 05, 2014 at 06:00:22PM +0800, Ming Lei wrote:
>> > On Tue, Aug 5, 2014 at 5:48 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>> > > Am 05.08.2014 um 05:33 hat Ming Lei geschrieben:
>> > >> Hi,
>> > >>
>> > >> These patches bring up below 4 changes:
>> > >>         - introduce object allocation pool and apply it to
>> > >>         virtio-blk dataplane for improving its performance
>> > >>
>> > >>         - introduce selective coroutine bypass mechanism
>> > >>         for improving performance of virtio-blk dataplane with
>> > >>         raw format image
>> > >
>> > > Before applying any bypassing patches, I think we should understand in
>> > > detail where we are losing performance with coroutines enabled.
>> >
>> > From the below profiling data, CPU becomes slow to run instructions
>> > with coroutine, and CPU dcache miss is increased so it is very
>> > likely caused by switching stack frequently.
>> >
>> > http://marc.info/?l=qemu-devel&m=140679721126306&w=2
>> >
>> > http://pastebin.com/ae0vnQ6V
>>
>> I have been wondering how to prove that the root cause is the ucontext
>> coroutine mechanism (stack switching).  Here is an idea:
>>
>> Hack your "bypass" code path to run the request inside a coroutine.
>> That way you can compare "bypass without coroutine" against "bypass with
>> coroutine".
>>
>> Right now I think there are doubts because the bypass code path is
>> indeed a different (and not 100% correct) code path.  So this approach
>> might prove that the coroutines are adding the overhead and not
>> something that you bypassed.
>
> My doubts aren't only that the overhead might not come from the
> coroutines, but also whether any coroutine-related overhead is really
> unavoidable. If we can optimise coroutines, I'd strongly prefer to do
> just that instead of introducing additional code paths.

OK, thank you for taking look at the problem, and hope we can
figure out the root cause, :-)

>
> Another thought I had was this: If the performance difference is indeed
> only coroutines, then that is completely inside the block layer and we
> don't actually need a VM to test it. We could instead have something
> like a simple qemu-img based benchmark and should be observing the same.

Even it is simpler to run a coroutine-only benchmark, and I just
wrote a raw one, and looks coroutine does decrease performance
a lot, please see the attachment patch, and thanks for your template
to help me add the 'co_bench' command in qemu-img.

>From the profiling data in below link:

    http://pastebin.com/YwH2uwbq

With coroutine, the running time for same loading is increased
~50%(1.325s vs. 0.903s), and dcache load events is increased
~35%(693M vs. 512M), insns per cycle is decreased by ~50%(
1.35 vs. 1.63), compared with bypassing coroutine(-b parameter).

The bypass code in the benchmark is very similar with the approach
used in the bypass patch, since linux-aio with O_DIRECT seldom
blocks in the the kernel I/O path.

Maybe the benchmark is a bit extremely, but given modern storage
device may reach millions of IOPS, and it is very easy to slow down
the I/O by coroutine.

> I played a bit with the following, I hope it's not too naive. I couldn't
> see a difference with your patches, but at least one reason for this is
> probably that my laptop SSD isn't fast enough to make the CPU the
> bottleneck. Haven't tried ramdisk yet, that would probably be the next
> thing. (I actually wrote the patch up just for some profiling on my own,
> not for comparing throughput, but it should be usable for that as well.)

This might not be good for the test since it is basically a sequential
read test, which can be optimized a lot by kernel. And I always use
randread benchmark.


Thanks,

[-- Attachment #2: co_bench.patch --]
[-- Type: text/x-patch, Size: 3353 bytes --]

diff --git a/Makefile b/Makefile
index d6b9dc1..a59523c 100644
--- a/Makefile
+++ b/Makefile
@@ -211,7 +211,7 @@ util/module.o-cflags = -D'CONFIG_BLOCK_MODULES=$(block-modules)'
 
 qemu-img.o: qemu-img-cmds.h
 
-qemu-img$(EXESUF): qemu-img.o $(block-obj-y) libqemuutil.a libqemustub.a
+qemu-img$(EXESUF): qemu-img.o $(block-obj-y) libqemuutil.a libqemustub.a -lcrypt
 qemu-nbd$(EXESUF): qemu-nbd.o $(block-obj-y) libqemuutil.a libqemustub.a
 qemu-io$(EXESUF): qemu-io.o $(block-obj-y) libqemuutil.a libqemustub.a
 
diff --git a/qemu-img-cmds.hx b/qemu-img-cmds.hx
index ae64b3d..7601b9a 100644
--- a/qemu-img-cmds.hx
+++ b/qemu-img-cmds.hx
@@ -15,6 +15,12 @@ STEXI
 @item bench [-q] [-f @var{fmt]} [-n] [-t @var{cache}] filename
 ETEXI
 
+DEF("co_bench", co_bench,
+    "co_bench -c count -q -b")
+STEXI
+@item co_bench [-c] @var{count} [-b] [-q]
+ETEXI
+
 DEF("check", img_check,
     "check [-q] [-f fmt] [--output=ofmt]  [-r [leaks | all]] filename")
 STEXI
diff --git a/qemu-img.c b/qemu-img.c
index 92e9529..d73b171 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -366,6 +366,106 @@ static int add_old_style_options(const char *fmt, QemuOpts *opts,
     return 0;
 }
 
+struct crypt_data {
+    unsigned long sum;
+    bool bypass;
+};
+
+static unsigned long crypt_and_sum(const char *key, const char *salt)
+{
+    char *enc = crypt(key, salt);
+    int len = strlen(enc);
+    int i;
+    unsigned long sum = 0;
+
+    for (i = 0; i < len; i++)
+        sum += enc[i];
+
+    return sum;
+}
+
+static void gen_key(char *key, int len)
+{
+    char set[] = {
+        '0', '1', '2', '3', '4', '5', '6', '7', '8',
+        'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i',
+        'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r',
+        's', 't', 'w', 'x', 'y', 'z', '_', '-', ' ',
+        '*', '$', '#', '%',
+    };
+    int cnt = sizeof(set) / sizeof(set[0]);
+    int i;
+
+    for (i = 0; i < len; i++) {
+        key[i] = set[rand() % cnt];
+    }
+}
+
+static void crypt_bench(void *opaque)
+{
+    struct crypt_data *data = opaque;
+    const int len = 8;
+    char key1[8];
+    char key2[8];
+    char salt[3];
+
+    gen_key(key1, len);
+    gen_key(key2, len);
+    salt[0] = key1[0];
+    salt[1] = key2[7];
+    salt[2] = '0';
+
+    data->sum += crypt_and_sum(key1, salt);
+    data->sum += crypt_and_sum(key2, salt);
+
+    if (!data->bypass) {
+        qemu_coroutine_yield();
+    }
+}
+
+static int co_bench(int argc, char **argv)
+{
+    int c;
+    bool bypass = false;
+    unsigned long cnt = 1;
+    int num = 1;
+    unsigned long i;
+    struct crypt_data data = {
+        .sum = 0,
+        .bypass = bypass,
+    };
+
+    for(;;) {
+        c = getopt(argc, argv, "bc:q");
+        if (c == -1) {
+            break;
+        }
+        switch(c) {
+        case 'b':
+            bypass = true;
+            break;
+        case 'c':
+	    num = atoi(optarg);
+            break;
+        }
+    }
+
+    data.bypass = bypass;
+
+    srand((unsigned int)&i);
+    srand(rand());
+    for (i = 0; i < num * cnt; i++) {
+        Coroutine *co;
+        if (!data.bypass) {
+            co = qemu_coroutine_create(crypt_bench);
+            qemu_coroutine_enter(co, &data);
+        } else {
+            crypt_bench(&data);
+	}
+    }
+    return (int)data.sum;
+}
+
 static int img_create(int argc, char **argv)
 {
     int c;

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
  2014-08-06  5:33         ` Ming Lei
@ 2014-08-06  7:45           ` Paolo Bonzini
  2014-08-06  8:38             ` Ming Lei
  2014-08-06  8:48           ` Kevin Wolf
  2014-08-06  9:37           ` Stefan Hajnoczi
  2 siblings, 1 reply; 81+ messages in thread
From: Paolo Bonzini @ 2014-08-06  7:45 UTC (permalink / raw)
  To: Ming Lei, Kevin Wolf
  Cc: Peter Maydell, Fam Zheng, qemu-devel, Stefan Hajnoczi,
	Michael S. Tsirkin

Il 06/08/2014 07:33, Ming Lei ha scritto:
>> > I played a bit with the following, I hope it's not too naive. I couldn't
>> > see a difference with your patches, but at least one reason for this is
>> > probably that my laptop SSD isn't fast enough to make the CPU the
>> > bottleneck. Haven't tried ramdisk yet, that would probably be the next
>> > thing. (I actually wrote the patch up just for some profiling on my own,
>> > not for comparing throughput, but it should be usable for that as well.)
> This might not be good for the test since it is basically a sequential
> read test, which can be optimized a lot by kernel. And I always use
> randread benchmark.

A microbenchmark already exists in tests/test-coroutine.c, and doesn't
really tell us much; it's obvious that coroutines execute more code, the
question is why it affects the iops performance.

The sequential read should be the right workload.  For fio, you want to
get as many iops as possible to QEMU and so you need randread.  But
qemu-img is not run in a guest and if the kernel optimizes sequential
reads then the bypass should have even more benefits because it makes
userspace proportionally more expensive.

In any case, the patches as written have no hope of being accepted.  If
you "invert" the logic from aio->co to co->aio, that would be much
better even if it's tedious.

Paolo

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
  2014-08-06  7:45           ` Paolo Bonzini
@ 2014-08-06  8:38             ` Ming Lei
  2014-08-06  8:50               ` Paolo Bonzini
  0 siblings, 1 reply; 81+ messages in thread
From: Ming Lei @ 2014-08-06  8:38 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Peter Maydell, Fam Zheng, Michael S. Tsirkin,
	qemu-devel, Stefan Hajnoczi

On Wed, Aug 6, 2014 at 3:45 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 06/08/2014 07:33, Ming Lei ha scritto:
>>> > I played a bit with the following, I hope it's not too naive. I couldn't
>>> > see a difference with your patches, but at least one reason for this is
>>> > probably that my laptop SSD isn't fast enough to make the CPU the
>>> > bottleneck. Haven't tried ramdisk yet, that would probably be the next
>>> > thing. (I actually wrote the patch up just for some profiling on my own,
>>> > not for comparing throughput, but it should be usable for that as well.)
>> This might not be good for the test since it is basically a sequential
>> read test, which can be optimized a lot by kernel. And I always use
>> randread benchmark.
>
> A microbenchmark already exists in tests/test-coroutine.c, and doesn't
> really tell us much; it's obvious that coroutines execute more code, the
> question is why it affects the iops performance.

Could you take a look at the coroutine benchmark I worte?  The running
result shows coroutine does decrease performance a lot compared with
bypass coroutine like the patchset is doing.

>
> The sequential read should be the right workload.  For fio, you want to
> get as many iops as possible to QEMU and so you need randread.  But
> qemu-img is not run in a guest and if the kernel optimizes sequential
> reads then the bypass should have even more benefits because it makes
> userspace proportionally more expensive.
>
> In any case, the patches as written have no hope of being accepted.  If
> you "invert" the logic from aio->co to co->aio, that would be much
> better even if it's tedious.

Let's not talk the bypass patch, and see if the coroutine is really the cause
of performance drop first.

Thanks,

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
  2014-08-06  5:33         ` Ming Lei
  2014-08-06  7:45           ` Paolo Bonzini
@ 2014-08-06  8:48           ` Kevin Wolf
  2014-08-06  9:37             ` Ming Lei
  2014-08-10  3:46             ` Ming Lei
  2014-08-06  9:37           ` Stefan Hajnoczi
  2 siblings, 2 replies; 81+ messages in thread
From: Kevin Wolf @ 2014-08-06  8:48 UTC (permalink / raw)
  To: Ming Lei
  Cc: Peter Maydell, Fam Zheng, Michael S. Tsirkin, qemu-devel,
	Stefan Hajnoczi, Paolo Bonzini

Am 06.08.2014 um 07:33 hat Ming Lei geschrieben:
> Hi Kevin,
> 
> On Tue, Aug 5, 2014 at 10:47 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> > Am 05.08.2014 um 15:48 hat Stefan Hajnoczi geschrieben:
> >> I have been wondering how to prove that the root cause is the ucontext
> >> coroutine mechanism (stack switching).  Here is an idea:
> >>
> >> Hack your "bypass" code path to run the request inside a coroutine.
> >> That way you can compare "bypass without coroutine" against "bypass with
> >> coroutine".
> >>
> >> Right now I think there are doubts because the bypass code path is
> >> indeed a different (and not 100% correct) code path.  So this approach
> >> might prove that the coroutines are adding the overhead and not
> >> something that you bypassed.
> >
> > My doubts aren't only that the overhead might not come from the
> > coroutines, but also whether any coroutine-related overhead is really
> > unavoidable. If we can optimise coroutines, I'd strongly prefer to do
> > just that instead of introducing additional code paths.
> 
> OK, thank you for taking look at the problem, and hope we can
> figure out the root cause, :-)
> 
> >
> > Another thought I had was this: If the performance difference is indeed
> > only coroutines, then that is completely inside the block layer and we
> > don't actually need a VM to test it. We could instead have something
> > like a simple qemu-img based benchmark and should be observing the same.
> 
> Even it is simpler to run a coroutine-only benchmark, and I just
> wrote a raw one, and looks coroutine does decrease performance
> a lot, please see the attachment patch, and thanks for your template
> to help me add the 'co_bench' command in qemu-img.

Yes, we can look at coroutines microbenchmarks in isolation. I actually
did do that yesterday with the yield test from tests/test-coroutine.c.
And in fact profiling immediately showed something to optimise:
pthread_getspecific() was quite high, replacing it by __thread on
systems where it works is more efficient and helped the numbers a bit.
Also, a lot of time seems to be spent in pthread_mutex_lock/unlock (even
in qemu-img bench), maybe there's even something that can be done here.

However, I just wasn't sure whether a change on this level would be
relevant in a realistic environment. This is the reason why I wanted to
get a benchmark involving the block layer and some I/O.

> From the profiling data in below link:
> 
>     http://pastebin.com/YwH2uwbq
> 
> With coroutine, the running time for same loading is increased
> ~50%(1.325s vs. 0.903s), and dcache load events is increased
> ~35%(693M vs. 512M), insns per cycle is decreased by ~50%(
> 1.35 vs. 1.63), compared with bypassing coroutine(-b parameter).
> 
> The bypass code in the benchmark is very similar with the approach
> used in the bypass patch, since linux-aio with O_DIRECT seldom
> blocks in the the kernel I/O path.
> 
> Maybe the benchmark is a bit extremely, but given modern storage
> device may reach millions of IOPS, and it is very easy to slow down
> the I/O by coroutine.

I think in order to optimise coroutines, such benchmarks are fair game.
It's just not guaranteed that the effects are exactly the same on real
workloads, so we should take the results with a grain of salt.

Anyhow, the coroutine version of your benchmark is buggy, it leaks all
coroutines instead of exiting them, so it can't make any use of the
coroutine pool. On my laptop, I get this (where fixed coroutine is a
version that simply removes the yield at the end):

                | bypass        | fixed coro    | buggy coro
----------------+---------------+---------------+--------------
time            | 1.09s         | 1.10s         | 1.62s
L1-dcache-loads | 921,836,360   | 932,781,747   | 1,298,067,438
insns per cycle | 2.39          | 2.39          | 1.90

Begs the question whether you see a similar effect on a real qemu and
the coroutine pool is still not big enough? With correct use of
coroutines, the difference seems to be barely measurable even without
any I/O involved.

> > I played a bit with the following, I hope it's not too naive. I couldn't
> > see a difference with your patches, but at least one reason for this is
> > probably that my laptop SSD isn't fast enough to make the CPU the
> > bottleneck. Haven't tried ramdisk yet, that would probably be the next
> > thing. (I actually wrote the patch up just for some profiling on my own,
> > not for comparing throughput, but it should be usable for that as well.)
> 
> This might not be good for the test since it is basically a sequential
> read test, which can be optimized a lot by kernel. And I always use
> randread benchmark.

Yes, I shortly pondered whether I should implement random offsets
instead. But then I realised that a quicker kernel operation would only
help the benchmark because we want it to test the CPU consumption in
userspace. So the faster the kernel gets, the better for us, because it
should make the impact of coroutines bigger.

Kevin

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
  2014-08-06  8:38             ` Ming Lei
@ 2014-08-06  8:50               ` Paolo Bonzini
  2014-08-06 13:53                 ` Ming Lei
  0 siblings, 1 reply; 81+ messages in thread
From: Paolo Bonzini @ 2014-08-06  8:50 UTC (permalink / raw)
  To: Ming Lei
  Cc: Kevin Wolf, Peter Maydell, Fam Zheng, Michael S. Tsirkin,
	qemu-devel, Stefan Hajnoczi

Il 06/08/2014 10:38, Ming Lei ha scritto:
> On Wed, Aug 6, 2014 at 3:45 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> Il 06/08/2014 07:33, Ming Lei ha scritto:
>>>>> I played a bit with the following, I hope it's not too naive. I couldn't
>>>>> see a difference with your patches, but at least one reason for this is
>>>>> probably that my laptop SSD isn't fast enough to make the CPU the
>>>>> bottleneck. Haven't tried ramdisk yet, that would probably be the next
>>>>> thing. (I actually wrote the patch up just for some profiling on my own,
>>>>> not for comparing throughput, but it should be usable for that as well.)
>>> This might not be good for the test since it is basically a sequential
>>> read test, which can be optimized a lot by kernel. And I always use
>>> randread benchmark.
>>
>> A microbenchmark already exists in tests/test-coroutine.c, and doesn't
>> really tell us much; it's obvious that coroutines execute more code, the
>> question is why it affects the iops performance.
> 
> Could you take a look at the coroutine benchmark I worte?  The running
> result shows coroutine does decrease performance a lot compared with
> bypass coroutine like the patchset is doing.

Your benchmark is synchronous, while disk I/O is asynchronous.

Your benchmark doesn't add much compared to "time tests/test-coroutine
-m perf  -p /perf/yield".  It takes 8 seconds on my machine, and 10^8
function calls obviously take less than 8 seconds.  I've sent a patch to
add a "baseline" function call benchmark to test-coroutine.

>> The sequential read should be the right workload.  For fio, you want to
>> get as many iops as possible to QEMU and so you need randread.  But
>> qemu-img is not run in a guest and if the kernel optimizes sequential
>> reads then the bypass should have even more benefits because it makes
>> userspace proportionally more expensive.

Do you agree with this?

Paolo

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
  2014-08-06  8:48           ` Kevin Wolf
@ 2014-08-06  9:37             ` Ming Lei
  2014-08-06 10:09               ` Kevin Wolf
  2014-08-10  3:46             ` Ming Lei
  1 sibling, 1 reply; 81+ messages in thread
From: Ming Lei @ 2014-08-06  9:37 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Peter Maydell, Fam Zheng, Michael S. Tsirkin, qemu-devel,
	Stefan Hajnoczi, Paolo Bonzini

On Wed, Aug 6, 2014 at 4:48 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 06.08.2014 um 07:33 hat Ming Lei geschrieben:
>> Hi Kevin,
>>
>> On Tue, Aug 5, 2014 at 10:47 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>> > Am 05.08.2014 um 15:48 hat Stefan Hajnoczi geschrieben:
>> >> I have been wondering how to prove that the root cause is the ucontext
>> >> coroutine mechanism (stack switching).  Here is an idea:
>> >>
>> >> Hack your "bypass" code path to run the request inside a coroutine.
>> >> That way you can compare "bypass without coroutine" against "bypass with
>> >> coroutine".
>> >>
>> >> Right now I think there are doubts because the bypass code path is
>> >> indeed a different (and not 100% correct) code path.  So this approach
>> >> might prove that the coroutines are adding the overhead and not
>> >> something that you bypassed.
>> >
>> > My doubts aren't only that the overhead might not come from the
>> > coroutines, but also whether any coroutine-related overhead is really
>> > unavoidable. If we can optimise coroutines, I'd strongly prefer to do
>> > just that instead of introducing additional code paths.
>>
>> OK, thank you for taking look at the problem, and hope we can
>> figure out the root cause, :-)
>>
>> >
>> > Another thought I had was this: If the performance difference is indeed
>> > only coroutines, then that is completely inside the block layer and we
>> > don't actually need a VM to test it. We could instead have something
>> > like a simple qemu-img based benchmark and should be observing the same.
>>
>> Even it is simpler to run a coroutine-only benchmark, and I just
>> wrote a raw one, and looks coroutine does decrease performance
>> a lot, please see the attachment patch, and thanks for your template
>> to help me add the 'co_bench' command in qemu-img.
>
> Yes, we can look at coroutines microbenchmarks in isolation. I actually
> did do that yesterday with the yield test from tests/test-coroutine.c.
> And in fact profiling immediately showed something to optimise:
> pthread_getspecific() was quite high, replacing it by __thread on
> systems where it works is more efficient and helped the numbers a bit.
> Also, a lot of time seems to be spent in pthread_mutex_lock/unlock (even
> in qemu-img bench), maybe there's even something that can be done here.

The lock/unlock in dataplane is often from memory_region_find(), and Paolo
should have done lots of work on that.

>
> However, I just wasn't sure whether a change on this level would be
> relevant in a realistic environment. This is the reason why I wanted to
> get a benchmark involving the block layer and some I/O.
>
>> From the profiling data in below link:
>>
>>     http://pastebin.com/YwH2uwbq
>>
>> With coroutine, the running time for same loading is increased
>> ~50%(1.325s vs. 0.903s), and dcache load events is increased
>> ~35%(693M vs. 512M), insns per cycle is decreased by ~50%(
>> 1.35 vs. 1.63), compared with bypassing coroutine(-b parameter).
>>
>> The bypass code in the benchmark is very similar with the approach
>> used in the bypass patch, since linux-aio with O_DIRECT seldom
>> blocks in the the kernel I/O path.
>>
>> Maybe the benchmark is a bit extremely, but given modern storage
>> device may reach millions of IOPS, and it is very easy to slow down
>> the I/O by coroutine.
>
> I think in order to optimise coroutines, such benchmarks are fair game.
> It's just not guaranteed that the effects are exactly the same on real
> workloads, so we should take the results with a grain of salt.
>
> Anyhow, the coroutine version of your benchmark is buggy, it leaks all
> coroutines instead of exiting them, so it can't make any use of the
> coroutine pool. On my laptop, I get this (where fixed coroutine is a
> version that simply removes the yield at the end):
>
>                 | bypass        | fixed coro    | buggy coro
> ----------------+---------------+---------------+--------------
> time            | 1.09s         | 1.10s         | 1.62s
> L1-dcache-loads | 921,836,360   | 932,781,747   | 1,298,067,438
> insns per cycle | 2.39          | 2.39          | 1.90
>
> Begs the question whether you see a similar effect on a real qemu and
> the coroutine pool is still not big enough? With correct use of
> coroutines, the difference seems to be barely measurable even without
> any I/O involved.

When I comment qemu_coroutine_yield(), looks result of
bypass and fixed coro is very similar as your test, and I am just
wondering if stack is always switched in qemu_coroutine_enter()
without calling qemu_coroutine_yield().

Without the yield, the benchmark can't emulate coroutine usage in
bdrv_aio_readv/writev() path any more, and bypass in the patchset
skips two qemu_coroutine_enter() and one qemu_coroutine_yield()
for each bdrv_aio_readv/writev().

>
>> > I played a bit with the following, I hope it's not too naive. I couldn't
>> > see a difference with your patches, but at least one reason for this is
>> > probably that my laptop SSD isn't fast enough to make the CPU the
>> > bottleneck. Haven't tried ramdisk yet, that would probably be the next
>> > thing. (I actually wrote the patch up just for some profiling on my own,
>> > not for comparing throughput, but it should be usable for that as well.)
>>
>> This might not be good for the test since it is basically a sequential
>> read test, which can be optimized a lot by kernel. And I always use
>> randread benchmark.
>
> Yes, I shortly pondered whether I should implement random offsets
> instead. But then I realised that a quicker kernel operation would only
> help the benchmark because we want it to test the CPU consumption in
> userspace. So the faster the kernel gets, the better for us, because it
> should make the impact of coroutines bigger.

OK, I will compare coroutine vs. bypass-co with the benchmark.


Thanks,

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
  2014-08-06  5:33         ` Ming Lei
  2014-08-06  7:45           ` Paolo Bonzini
  2014-08-06  8:48           ` Kevin Wolf
@ 2014-08-06  9:37           ` Stefan Hajnoczi
  2 siblings, 0 replies; 81+ messages in thread
From: Stefan Hajnoczi @ 2014-08-06  9:37 UTC (permalink / raw)
  To: Ming Lei
  Cc: Kevin Wolf, Peter Maydell, Fam Zheng, Michael S. Tsirkin,
	qemu-devel, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 664 bytes --]

On Wed, Aug 06, 2014 at 01:33:36PM +0800, Ming Lei wrote:
> With coroutine, the running time for same loading is increased
> ~50%(1.325s vs. 0.903s), and dcache load events is increased

I agree with Paolo about microbenchmarks.  We need to do I/O to get a
realistic picture of performance, since there is little point is
optimizing something that is not a significant factor in overall
performance.

But I also wanted to say that these benchmark durations are so short
that they can be greatly affected by outliers (e.g. scheduler behavior,
system background activity, etc).  Run benchmarks for 2 minutes to
reduce variance and give the system time to "warm up".

[-- Attachment #2: Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
  2014-08-06  9:37             ` Ming Lei
@ 2014-08-06 10:09               ` Kevin Wolf
  2014-08-06 11:28                 ` Ming Lei
  0 siblings, 1 reply; 81+ messages in thread
From: Kevin Wolf @ 2014-08-06 10:09 UTC (permalink / raw)
  To: Ming Lei
  Cc: Peter Maydell, Fam Zheng, Michael S. Tsirkin, qemu-devel,
	Stefan Hajnoczi, Paolo Bonzini

Am 06.08.2014 um 11:37 hat Ming Lei geschrieben:
> On Wed, Aug 6, 2014 at 4:48 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> > Am 06.08.2014 um 07:33 hat Ming Lei geschrieben:
> >> Hi Kevin,
> >>
> >> On Tue, Aug 5, 2014 at 10:47 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> >> > Am 05.08.2014 um 15:48 hat Stefan Hajnoczi geschrieben:
> >> >> I have been wondering how to prove that the root cause is the ucontext
> >> >> coroutine mechanism (stack switching).  Here is an idea:
> >> >>
> >> >> Hack your "bypass" code path to run the request inside a coroutine.
> >> >> That way you can compare "bypass without coroutine" against "bypass with
> >> >> coroutine".
> >> >>
> >> >> Right now I think there are doubts because the bypass code path is
> >> >> indeed a different (and not 100% correct) code path.  So this approach
> >> >> might prove that the coroutines are adding the overhead and not
> >> >> something that you bypassed.
> >> >
> >> > My doubts aren't only that the overhead might not come from the
> >> > coroutines, but also whether any coroutine-related overhead is really
> >> > unavoidable. If we can optimise coroutines, I'd strongly prefer to do
> >> > just that instead of introducing additional code paths.
> >>
> >> OK, thank you for taking look at the problem, and hope we can
> >> figure out the root cause, :-)
> >>
> >> >
> >> > Another thought I had was this: If the performance difference is indeed
> >> > only coroutines, then that is completely inside the block layer and we
> >> > don't actually need a VM to test it. We could instead have something
> >> > like a simple qemu-img based benchmark and should be observing the same.
> >>
> >> Even it is simpler to run a coroutine-only benchmark, and I just
> >> wrote a raw one, and looks coroutine does decrease performance
> >> a lot, please see the attachment patch, and thanks for your template
> >> to help me add the 'co_bench' command in qemu-img.
> >
> > Yes, we can look at coroutines microbenchmarks in isolation. I actually
> > did do that yesterday with the yield test from tests/test-coroutine.c.
> > And in fact profiling immediately showed something to optimise:
> > pthread_getspecific() was quite high, replacing it by __thread on
> > systems where it works is more efficient and helped the numbers a bit.
> > Also, a lot of time seems to be spent in pthread_mutex_lock/unlock (even
> > in qemu-img bench), maybe there's even something that can be done here.
> 
> The lock/unlock in dataplane is often from memory_region_find(), and Paolo
> should have done lots of work on that.
> 
> >
> > However, I just wasn't sure whether a change on this level would be
> > relevant in a realistic environment. This is the reason why I wanted to
> > get a benchmark involving the block layer and some I/O.
> >
> >> From the profiling data in below link:
> >>
> >>     http://pastebin.com/YwH2uwbq
> >>
> >> With coroutine, the running time for same loading is increased
> >> ~50%(1.325s vs. 0.903s), and dcache load events is increased
> >> ~35%(693M vs. 512M), insns per cycle is decreased by ~50%(
> >> 1.35 vs. 1.63), compared with bypassing coroutine(-b parameter).
> >>
> >> The bypass code in the benchmark is very similar with the approach
> >> used in the bypass patch, since linux-aio with O_DIRECT seldom
> >> blocks in the the kernel I/O path.
> >>
> >> Maybe the benchmark is a bit extremely, but given modern storage
> >> device may reach millions of IOPS, and it is very easy to slow down
> >> the I/O by coroutine.
> >
> > I think in order to optimise coroutines, such benchmarks are fair game.
> > It's just not guaranteed that the effects are exactly the same on real
> > workloads, so we should take the results with a grain of salt.
> >
> > Anyhow, the coroutine version of your benchmark is buggy, it leaks all
> > coroutines instead of exiting them, so it can't make any use of the
> > coroutine pool. On my laptop, I get this (where fixed coroutine is a
> > version that simply removes the yield at the end):
> >
> >                 | bypass        | fixed coro    | buggy coro
> > ----------------+---------------+---------------+--------------
> > time            | 1.09s         | 1.10s         | 1.62s
> > L1-dcache-loads | 921,836,360   | 932,781,747   | 1,298,067,438
> > insns per cycle | 2.39          | 2.39          | 1.90
> >
> > Begs the question whether you see a similar effect on a real qemu and
> > the coroutine pool is still not big enough? With correct use of
> > coroutines, the difference seems to be barely measurable even without
> > any I/O involved.
> 
> When I comment qemu_coroutine_yield(), looks result of
> bypass and fixed coro is very similar as your test, and I am just
> wondering if stack is always switched in qemu_coroutine_enter()
> without calling qemu_coroutine_yield().

Yes, definitely. qemu_coroutine_enter() always involves calling
qemu_coroutine_switch(), which is the stack switch.

> Without the yield, the benchmark can't emulate coroutine usage in
> bdrv_aio_readv/writev() path any more, and bypass in the patchset
> skips two qemu_coroutine_enter() and one qemu_coroutine_yield()
> for each bdrv_aio_readv/writev().

It's not completely comparable anyway because you're not going through a
main loop and callbacks from there for your benchmark.

But fair enough: Keep the yield, but enter the coroutine twice then. You
get slightly worse results then, but that's more like doubling the very
small difference between "bypass" and "fixed coro" (1.11s / 946,434,327
/ 2.37), not like the horrible performance of the buggy version.

Actually, that's within the error of measurement for time and
insns/cycle, so running it for a bit longer:

                | bypass    | coro      | + yield   | buggy coro
----------------+-----------+-----------+-----------+--------------
time            | 21.45s    | 21.68s    | 21.83s    | 97.05s
L1-dcache-loads | 18,049 M  | 18,387 M  | 18,618 M  | 26,062 M
insns per cycle | 2.42      | 2.40      | 2.41      | 1.75

> >> > I played a bit with the following, I hope it's not too naive. I couldn't
> >> > see a difference with your patches, but at least one reason for this is
> >> > probably that my laptop SSD isn't fast enough to make the CPU the
> >> > bottleneck. Haven't tried ramdisk yet, that would probably be the next
> >> > thing. (I actually wrote the patch up just for some profiling on my own,
> >> > not for comparing throughput, but it should be usable for that as well.)
> >>
> >> This might not be good for the test since it is basically a sequential
> >> read test, which can be optimized a lot by kernel. And I always use
> >> randread benchmark.
> >
> > Yes, I shortly pondered whether I should implement random offsets
> > instead. But then I realised that a quicker kernel operation would only
> > help the benchmark because we want it to test the CPU consumption in
> > userspace. So the faster the kernel gets, the better for us, because it
> > should make the impact of coroutines bigger.
> 
> OK, I will compare coroutine vs. bypass-co with the benchmark.

Ok, thanks.

Kevin

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
  2014-08-06 10:09               ` Kevin Wolf
@ 2014-08-06 11:28                 ` Ming Lei
  2014-08-06 11:44                   ` Ming Lei
  2014-08-06 15:40                   ` Kevin Wolf
  0 siblings, 2 replies; 81+ messages in thread
From: Ming Lei @ 2014-08-06 11:28 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Peter Maydell, Fam Zheng, Michael S. Tsirkin, qemu-devel,
	Stefan Hajnoczi, Paolo Bonzini

On Wed, Aug 6, 2014 at 6:09 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 06.08.2014 um 11:37 hat Ming Lei geschrieben:
>> On Wed, Aug 6, 2014 at 4:48 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>> > Am 06.08.2014 um 07:33 hat Ming Lei geschrieben:
>> >> Hi Kevin,
>> >>
>> >> On Tue, Aug 5, 2014 at 10:47 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>> >> > Am 05.08.2014 um 15:48 hat Stefan Hajnoczi geschrieben:
>> >> >> I have been wondering how to prove that the root cause is the ucontext
>> >> >> coroutine mechanism (stack switching).  Here is an idea:
>> >> >>
>> >> >> Hack your "bypass" code path to run the request inside a coroutine.
>> >> >> That way you can compare "bypass without coroutine" against "bypass with
>> >> >> coroutine".
>> >> >>
>> >> >> Right now I think there are doubts because the bypass code path is
>> >> >> indeed a different (and not 100% correct) code path.  So this approach
>> >> >> might prove that the coroutines are adding the overhead and not
>> >> >> something that you bypassed.
>> >> >
>> >> > My doubts aren't only that the overhead might not come from the
>> >> > coroutines, but also whether any coroutine-related overhead is really
>> >> > unavoidable. If we can optimise coroutines, I'd strongly prefer to do
>> >> > just that instead of introducing additional code paths.
>> >>
>> >> OK, thank you for taking look at the problem, and hope we can
>> >> figure out the root cause, :-)
>> >>
>> >> >
>> >> > Another thought I had was this: If the performance difference is indeed
>> >> > only coroutines, then that is completely inside the block layer and we
>> >> > don't actually need a VM to test it. We could instead have something
>> >> > like a simple qemu-img based benchmark and should be observing the same.
>> >>
>> >> Even it is simpler to run a coroutine-only benchmark, and I just
>> >> wrote a raw one, and looks coroutine does decrease performance
>> >> a lot, please see the attachment patch, and thanks for your template
>> >> to help me add the 'co_bench' command in qemu-img.
>> >
>> > Yes, we can look at coroutines microbenchmarks in isolation. I actually
>> > did do that yesterday with the yield test from tests/test-coroutine.c.
>> > And in fact profiling immediately showed something to optimise:
>> > pthread_getspecific() was quite high, replacing it by __thread on
>> > systems where it works is more efficient and helped the numbers a bit.
>> > Also, a lot of time seems to be spent in pthread_mutex_lock/unlock (even
>> > in qemu-img bench), maybe there's even something that can be done here.
>>
>> The lock/unlock in dataplane is often from memory_region_find(), and Paolo
>> should have done lots of work on that.
>>
>> >
>> > However, I just wasn't sure whether a change on this level would be
>> > relevant in a realistic environment. This is the reason why I wanted to
>> > get a benchmark involving the block layer and some I/O.
>> >
>> >> From the profiling data in below link:
>> >>
>> >>     http://pastebin.com/YwH2uwbq
>> >>
>> >> With coroutine, the running time for same loading is increased
>> >> ~50%(1.325s vs. 0.903s), and dcache load events is increased
>> >> ~35%(693M vs. 512M), insns per cycle is decreased by ~50%(
>> >> 1.35 vs. 1.63), compared with bypassing coroutine(-b parameter).
>> >>
>> >> The bypass code in the benchmark is very similar with the approach
>> >> used in the bypass patch, since linux-aio with O_DIRECT seldom
>> >> blocks in the the kernel I/O path.
>> >>
>> >> Maybe the benchmark is a bit extremely, but given modern storage
>> >> device may reach millions of IOPS, and it is very easy to slow down
>> >> the I/O by coroutine.
>> >
>> > I think in order to optimise coroutines, such benchmarks are fair game.
>> > It's just not guaranteed that the effects are exactly the same on real
>> > workloads, so we should take the results with a grain of salt.
>> >
>> > Anyhow, the coroutine version of your benchmark is buggy, it leaks all
>> > coroutines instead of exiting them, so it can't make any use of the
>> > coroutine pool. On my laptop, I get this (where fixed coroutine is a
>> > version that simply removes the yield at the end):
>> >
>> >                 | bypass        | fixed coro    | buggy coro
>> > ----------------+---------------+---------------+--------------
>> > time            | 1.09s         | 1.10s         | 1.62s
>> > L1-dcache-loads | 921,836,360   | 932,781,747   | 1,298,067,438
>> > insns per cycle | 2.39          | 2.39          | 1.90
>> >
>> > Begs the question whether you see a similar effect on a real qemu and
>> > the coroutine pool is still not big enough? With correct use of
>> > coroutines, the difference seems to be barely measurable even without
>> > any I/O involved.
>>
>> When I comment qemu_coroutine_yield(), looks result of
>> bypass and fixed coro is very similar as your test, and I am just
>> wondering if stack is always switched in qemu_coroutine_enter()
>> without calling qemu_coroutine_yield().
>
> Yes, definitely. qemu_coroutine_enter() always involves calling
> qemu_coroutine_switch(), which is the stack switch.
>
>> Without the yield, the benchmark can't emulate coroutine usage in
>> bdrv_aio_readv/writev() path any more, and bypass in the patchset
>> skips two qemu_coroutine_enter() and one qemu_coroutine_yield()
>> for each bdrv_aio_readv/writev().
>
> It's not completely comparable anyway because you're not going through a
> main loop and callbacks from there for your benchmark.
>
> But fair enough: Keep the yield, but enter the coroutine twice then. You
> get slightly worse results then, but that's more like doubling the very
> small difference between "bypass" and "fixed coro" (1.11s / 946,434,327
> / 2.37), not like the horrible performance of the buggy version.

Yes, I compared that too, looks no big difference.

>
> Actually, that's within the error of measurement for time and
> insns/cycle, so running it for a bit longer:
>
>                 | bypass    | coro      | + yield   | buggy coro
> ----------------+-----------+-----------+-----------+--------------
> time            | 21.45s    | 21.68s    | 21.83s    | 97.05s
> L1-dcache-loads | 18,049 M  | 18,387 M  | 18,618 M  | 26,062 M
> insns per cycle | 2.42      | 2.40      | 2.41      | 1.75
>
>> >> > I played a bit with the following, I hope it's not too naive. I couldn't
>> >> > see a difference with your patches, but at least one reason for this is
>> >> > probably that my laptop SSD isn't fast enough to make the CPU the
>> >> > bottleneck. Haven't tried ramdisk yet, that would probably be the next
>> >> > thing. (I actually wrote the patch up just for some profiling on my own,
>> >> > not for comparing throughput, but it should be usable for that as well.)
>> >>
>> >> This might not be good for the test since it is basically a sequential
>> >> read test, which can be optimized a lot by kernel. And I always use
>> >> randread benchmark.
>> >
>> > Yes, I shortly pondered whether I should implement random offsets
>> > instead. But then I realised that a quicker kernel operation would only
>> > help the benchmark because we want it to test the CPU consumption in
>> > userspace. So the faster the kernel gets, the better for us, because it
>> > should make the impact of coroutines bigger.
>>
>> OK, I will compare coroutine vs. bypass-co with the benchmark.

I use the /dev/nullb0 block device to test, which is available in linux kernel
3.13+, and follows the difference, which looks not very big(< 10%):

And I added two parameter to your img-bench patch:

      -c CNT  # which is passed to 'data.n'
      -b           #enable bypass coroutine introduced in this patchset

Another difference is that dataplane uses its own thread, and this
bench takes main_loop.

ming@:~/git/qemu$ sudo ~/bin/perf stat -e
L1-dcache-loads,L1-dcache-load-misses,cpu-cycles,instructions,branch-instructions,branch-misses,branch-loads,branch-load-misses,dTLB-loads,dTLB-load-misses
./qemu-img bench -f raw -t off -n -c 10000000  -b /dev/nullb0
read time: 58024ms

 Performance counter stats for './qemu-img bench -f raw -t off -n -c
10000000 -b /dev/nullb0':

    34,874,462,357      L1-dcache-loads
              [40.00%]
       714,018,039      L1-dcache-load-misses     #    2.05% of all
L1-dcache hits   [40.00%]
   133,897,794,677      cpu-cycles                [40.05%]
   116,714,230,004      instructions              #    0.87  insns per
cycle         [50.02%]
    22,689,223,546      branch-instructions
              [50.01%]
       391,673,952      branch-misses             #    1.73% of all
branches         [50.00%]
    22,726,856,215      branch-loads
              [50.01%]
    18,570,766,783      branch-load-misses
              [49.98%]
    34,944,839,907      dTLB-loads
              [39.99%]
        24,405,944      dTLB-load-misses          #    0.07% of all
dTLB cache hits  [39.99%]

      58.040785989 seconds time elapsed


ming@:~/git/qemu$ sudo ~/bin/perf stat -e
L1-dcache-loads,L1-dcache-load-misses,cpu-cycles,instructions,branch-instructions,branch-misses,branch-loads,branch-load-misses,dTLB-loads,dTLB-load-misses
./qemu-img bench -f raw -t off -n -c 10000000  /dev/nullb0
read time: 63369ms

 Performance counter stats for './qemu-img bench -f raw -t off -n -c
10000000 /dev/nullb0':

    35,751,490,462      L1-dcache-loads
              [39.97%]
     1,111,352,581      L1-dcache-load-misses     #    3.11% of all
L1-dcache hits   [40.01%]
   143,731,446,722      cpu-cycles                [40.01%]
   118,754,926,871      instructions              #    0.83  insns per
cycle         [50.04%]
    22,870,542,314      branch-instructions
              [50.07%]
       524,893,216      branch-misses             #    2.30% of all
branches         [50.05%]
    22,903,688,861      branch-loads
              [50.00%]
    20,179,726,291      branch-load-misses
              [49.99%]
    35,829,927,679      dTLB-loads
              [39.96%]
        42,964,365      dTLB-load-misses          #    0.12% of all
dTLB cache hits  [39.97%]

      63.392832844 seconds time elapsed


Thanks,

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
  2014-08-06 11:28                 ` Ming Lei
@ 2014-08-06 11:44                   ` Ming Lei
  2014-08-06 15:40                   ` Kevin Wolf
  1 sibling, 0 replies; 81+ messages in thread
From: Ming Lei @ 2014-08-06 11:44 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Peter Maydell, Fam Zheng, Michael S. Tsirkin, qemu-devel,
	Stefan Hajnoczi, Paolo Bonzini

On Wed, Aug 6, 2014 at 7:28 PM, Ming Lei <ming.lei@canonical.com> wrote:
> On Wed, Aug 6, 2014 at 6:09 PM, Kevin Wolf <kwolf@redhat.com> wrote:

>
> I use the /dev/nullb0 block device to test, which is available in linux kernel
> 3.13+, and follows the difference, which looks not very big(< 10%):
>
> And I added two parameter to your img-bench patch:
>
>       -c CNT  # which is passed to 'data.n'
>       -b           #enable bypass coroutine introduced in this patchset
>
> Another difference is that dataplane uses its own thread, and this
> bench takes main_loop.
>
> ming@:~/git/qemu$ sudo ~/bin/perf stat -e
> L1-dcache-loads,L1-dcache-load-misses,cpu-cycles,instructions,branch-instructions,branch-misses,branch-loads,branch-load-misses,dTLB-loads,dTLB-load-misses
> ./qemu-img bench -f raw -t off -n -c 10000000  -b /dev/nullb0
> read time: 58024ms
>
>  Performance counter stats for './qemu-img bench -f raw -t off -n -c
> 10000000 -b /dev/nullb0':
>
>     34,874,462,357      L1-dcache-loads
>               [40.00%]
>        714,018,039      L1-dcache-load-misses     #    2.05% of all
> L1-dcache hits   [40.00%]
>    133,897,794,677      cpu-cycles                [40.05%]
>    116,714,230,004      instructions              #    0.87  insns per
> cycle         [50.02%]
>     22,689,223,546      branch-instructions
>               [50.01%]
>        391,673,952      branch-misses             #    1.73% of all
> branches         [50.00%]
>     22,726,856,215      branch-loads
>               [50.01%]
>     18,570,766,783      branch-load-misses
>               [49.98%]
>     34,944,839,907      dTLB-loads
>               [39.99%]
>         24,405,944      dTLB-load-misses          #    0.07% of all
> dTLB cache hits  [39.99%]
>
>       58.040785989 seconds time elapsed
>
>
> ming@:~/git/qemu$ sudo ~/bin/perf stat -e
> L1-dcache-loads,L1-dcache-load-misses,cpu-cycles,instructions,branch-instructions,branch-misses,branch-loads,branch-load-misses,dTLB-loads,dTLB-load-misses
> ./qemu-img bench -f raw -t off -n -c 10000000  /dev/nullb0
> read time: 63369ms

BTW, Stefan's coroutine resize patch is applied in both the
tests(qem-img bench).

Thanks,

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
  2014-08-06  8:50               ` Paolo Bonzini
@ 2014-08-06 13:53                 ` Ming Lei
  0 siblings, 0 replies; 81+ messages in thread
From: Ming Lei @ 2014-08-06 13:53 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Peter Maydell, Fam Zheng, Michael S. Tsirkin,
	qemu-devel, Stefan Hajnoczi

On Wed, Aug 6, 2014 at 4:50 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 06/08/2014 10:38, Ming Lei ha scritto:
>> On Wed, Aug 6, 2014 at 3:45 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>> Il 06/08/2014 07:33, Ming Lei ha scritto:
>>>>>> I played a bit with the following, I hope it's not too naive. I couldn't
>>>>>> see a difference with your patches, but at least one reason for this is
>>>>>> probably that my laptop SSD isn't fast enough to make the CPU the
>>>>>> bottleneck. Haven't tried ramdisk yet, that would probably be the next
>>>>>> thing. (I actually wrote the patch up just for some profiling on my own,
>>>>>> not for comparing throughput, but it should be usable for that as well.)
>>>> This might not be good for the test since it is basically a sequential
>>>> read test, which can be optimized a lot by kernel. And I always use
>>>> randread benchmark.
>>>
>>> A microbenchmark already exists in tests/test-coroutine.c, and doesn't
>>> really tell us much; it's obvious that coroutines execute more code, the
>>> question is why it affects the iops performance.
>>
>> Could you take a look at the coroutine benchmark I worte?  The running
>> result shows coroutine does decrease performance a lot compared with
>> bypass coroutine like the patchset is doing.
>
> Your benchmark is synchronous, while disk I/O is asynchronous.

It can be thought as asynchronous too, since it won't sleep like
synchronous I/O.

Basically the IO thread is CPU bound type in case of linux-aio
since both submission and completion won't block CPU mostly,
so my benchmark still fits if we thought the completion as nop.

The current problem is that from single coroutine benchmark,
looks it doesn't hurt performance with stack switch, but in Kevin's
block aio benchmark, bypass coroutine can still obtain observable
improvement.

>
> Your benchmark doesn't add much compared to "time tests/test-coroutine
> -m perf  -p /perf/yield".  It takes 8 seconds on my machine, and 10^8
> function calls obviously take less than 8 seconds.  I've sent a patch to
> add a "baseline" function call benchmark to test-coroutine.
>
>>> The sequential read should be the right workload.  For fio, you want to
>>> get as many iops as possible to QEMU and so you need randread.  But
>>> qemu-img is not run in a guest and if the kernel optimizes sequential
>>> reads then the bypass should have even more benefits because it makes
>>> userspace proportionally more expensive.
>
> Do you agree with this?

Yes, I have posted the result of the benchmark, and looks the result
is basically similar with my previous test on dataplane.

Thanks,

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
  2014-08-06 11:28                 ` Ming Lei
  2014-08-06 11:44                   ` Ming Lei
@ 2014-08-06 15:40                   ` Kevin Wolf
  2014-08-07 10:27                     ` Ming Lei
  1 sibling, 1 reply; 81+ messages in thread
From: Kevin Wolf @ 2014-08-06 15:40 UTC (permalink / raw)
  To: Ming Lei
  Cc: Peter Maydell, Fam Zheng, Michael S. Tsirkin, qemu-devel,
	Stefan Hajnoczi, Paolo Bonzini

Am 06.08.2014 um 13:28 hat Ming Lei geschrieben:
> On Wed, Aug 6, 2014 at 6:09 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> > Am 06.08.2014 um 11:37 hat Ming Lei geschrieben:
> >> On Wed, Aug 6, 2014 at 4:48 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> >> > Am 06.08.2014 um 07:33 hat Ming Lei geschrieben:
> >> >> Hi Kevin,
> >> >>
> >> >> On Tue, Aug 5, 2014 at 10:47 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> >> >> > Am 05.08.2014 um 15:48 hat Stefan Hajnoczi geschrieben:
> >> >> >> I have been wondering how to prove that the root cause is the ucontext
> >> >> >> coroutine mechanism (stack switching).  Here is an idea:
> >> >> >>
> >> >> >> Hack your "bypass" code path to run the request inside a coroutine.
> >> >> >> That way you can compare "bypass without coroutine" against "bypass with
> >> >> >> coroutine".
> >> >> >>
> >> >> >> Right now I think there are doubts because the bypass code path is
> >> >> >> indeed a different (and not 100% correct) code path.  So this approach
> >> >> >> might prove that the coroutines are adding the overhead and not
> >> >> >> something that you bypassed.
> >> >> >
> >> >> > My doubts aren't only that the overhead might not come from the
> >> >> > coroutines, but also whether any coroutine-related overhead is really
> >> >> > unavoidable. If we can optimise coroutines, I'd strongly prefer to do
> >> >> > just that instead of introducing additional code paths.
> >> >>
> >> >> OK, thank you for taking look at the problem, and hope we can
> >> >> figure out the root cause, :-)
> >> >>
> >> >> >
> >> >> > Another thought I had was this: If the performance difference is indeed
> >> >> > only coroutines, then that is completely inside the block layer and we
> >> >> > don't actually need a VM to test it. We could instead have something
> >> >> > like a simple qemu-img based benchmark and should be observing the same.
> >> >>
> >> >> Even it is simpler to run a coroutine-only benchmark, and I just
> >> >> wrote a raw one, and looks coroutine does decrease performance
> >> >> a lot, please see the attachment patch, and thanks for your template
> >> >> to help me add the 'co_bench' command in qemu-img.
> >> >
> >> > Yes, we can look at coroutines microbenchmarks in isolation. I actually
> >> > did do that yesterday with the yield test from tests/test-coroutine.c.
> >> > And in fact profiling immediately showed something to optimise:
> >> > pthread_getspecific() was quite high, replacing it by __thread on
> >> > systems where it works is more efficient and helped the numbers a bit.
> >> > Also, a lot of time seems to be spent in pthread_mutex_lock/unlock (even
> >> > in qemu-img bench), maybe there's even something that can be done here.
> >>
> >> The lock/unlock in dataplane is often from memory_region_find(), and Paolo
> >> should have done lots of work on that.

qemu-img bench doesn't run that code. We have a few more locks that are
taken, and one of them (the coroutine pool lock) is avoided by your
bypass patches.

> >> >
> >> > However, I just wasn't sure whether a change on this level would be
> >> > relevant in a realistic environment. This is the reason why I wanted to
> >> > get a benchmark involving the block layer and some I/O.
> >> >
> >> >> From the profiling data in below link:
> >> >>
> >> >>     http://pastebin.com/YwH2uwbq
> >> >>
> >> >> With coroutine, the running time for same loading is increased
> >> >> ~50%(1.325s vs. 0.903s), and dcache load events is increased
> >> >> ~35%(693M vs. 512M), insns per cycle is decreased by ~50%(
> >> >> 1.35 vs. 1.63), compared with bypassing coroutine(-b parameter).
> >> >>
> >> >> The bypass code in the benchmark is very similar with the approach
> >> >> used in the bypass patch, since linux-aio with O_DIRECT seldom
> >> >> blocks in the the kernel I/O path.
> >> >>
> >> >> Maybe the benchmark is a bit extremely, but given modern storage
> >> >> device may reach millions of IOPS, and it is very easy to slow down
> >> >> the I/O by coroutine.
> >> >
> >> > I think in order to optimise coroutines, such benchmarks are fair game.
> >> > It's just not guaranteed that the effects are exactly the same on real
> >> > workloads, so we should take the results with a grain of salt.
> >> >
> >> > Anyhow, the coroutine version of your benchmark is buggy, it leaks all
> >> > coroutines instead of exiting them, so it can't make any use of the
> >> > coroutine pool. On my laptop, I get this (where fixed coroutine is a
> >> > version that simply removes the yield at the end):
> >> >
> >> >                 | bypass        | fixed coro    | buggy coro
> >> > ----------------+---------------+---------------+--------------
> >> > time            | 1.09s         | 1.10s         | 1.62s
> >> > L1-dcache-loads | 921,836,360   | 932,781,747   | 1,298,067,438
> >> > insns per cycle | 2.39          | 2.39          | 1.90
> >> >
> >> > Begs the question whether you see a similar effect on a real qemu and
> >> > the coroutine pool is still not big enough? With correct use of
> >> > coroutines, the difference seems to be barely measurable even without
> >> > any I/O involved.
> >>
> >> When I comment qemu_coroutine_yield(), looks result of
> >> bypass and fixed coro is very similar as your test, and I am just
> >> wondering if stack is always switched in qemu_coroutine_enter()
> >> without calling qemu_coroutine_yield().
> >
> > Yes, definitely. qemu_coroutine_enter() always involves calling
> > qemu_coroutine_switch(), which is the stack switch.
> >
> >> Without the yield, the benchmark can't emulate coroutine usage in
> >> bdrv_aio_readv/writev() path any more, and bypass in the patchset
> >> skips two qemu_coroutine_enter() and one qemu_coroutine_yield()
> >> for each bdrv_aio_readv/writev().
> >
> > It's not completely comparable anyway because you're not going through a
> > main loop and callbacks from there for your benchmark.
> >
> > But fair enough: Keep the yield, but enter the coroutine twice then. You
> > get slightly worse results then, but that's more like doubling the very
> > small difference between "bypass" and "fixed coro" (1.11s / 946,434,327
> > / 2.37), not like the horrible performance of the buggy version.
> 
> Yes, I compared that too, looks no big difference.
> 
> >
> > Actually, that's within the error of measurement for time and
> > insns/cycle, so running it for a bit longer:
> >
> >                 | bypass    | coro      | + yield   | buggy coro
> > ----------------+-----------+-----------+-----------+--------------
> > time            | 21.45s    | 21.68s    | 21.83s    | 97.05s
> > L1-dcache-loads | 18,049 M  | 18,387 M  | 18,618 M  | 26,062 M
> > insns per cycle | 2.42      | 2.40      | 2.41      | 1.75
> >
> >> >> > I played a bit with the following, I hope it's not too naive. I couldn't
> >> >> > see a difference with your patches, but at least one reason for this is
> >> >> > probably that my laptop SSD isn't fast enough to make the CPU the
> >> >> > bottleneck. Haven't tried ramdisk yet, that would probably be the next
> >> >> > thing. (I actually wrote the patch up just for some profiling on my own,
> >> >> > not for comparing throughput, but it should be usable for that as well.)
> >> >>
> >> >> This might not be good for the test since it is basically a sequential
> >> >> read test, which can be optimized a lot by kernel. And I always use
> >> >> randread benchmark.
> >> >
> >> > Yes, I shortly pondered whether I should implement random offsets
> >> > instead. But then I realised that a quicker kernel operation would only
> >> > help the benchmark because we want it to test the CPU consumption in
> >> > userspace. So the faster the kernel gets, the better for us, because it
> >> > should make the impact of coroutines bigger.
> >>
> >> OK, I will compare coroutine vs. bypass-co with the benchmark.
> 
> I use the /dev/nullb0 block device to test, which is available in linux kernel
> 3.13+, and follows the difference, which looks not very big(< 10%):

Sounds useful. I'm running on an older kernel, so I used a loop-mounted
file on tmpfs instead for my tests.

Anyway, at some point today I figured I should take a different approach
and not try to minimise the problems that coroutines introduce, but
rather make the most use of them when we have them. After all, the
raw-posix driver is still very callback-oriented and does things that
aren't really necessary with coroutines (such as AIOCB allocation).

The qemu-img bench time I ended up with looked quite nice. Maybe you
want to take a look if you can reproduce these results, both with
qemu-img bench and your real benchmark.


$ for i in $(seq 1 5); do time ./qemu-img bench -t none -n -c 2000000 /dev/loop0; done
Sending 2000000 requests, 4096 bytes each, 64 in parallel

        bypass (base) | bypass (patch) | coro (base) | coro (patch)
----------------------+----------------+-------------+---------------
run 1   0m5.966s      | 0m5.687s       |  0m6.224s   | 0m5.362s
run 2   0m5.826s      | 0m5.831s       |  0m5.994s   | 0m5.541s
run 3   0m6.145s      | 0m5.495s       |  0m6.253s   | 0m5.408s
run 4   0m5.683s      | 0m5.527s       |  0m6.045s   | 0m5.293s
run 5   0m5.904s      | 0m5.607s       |  0m6.238s   | 0m5.207s


You can find my working tree at:

    git://repo.or.cz/qemu/kevin.git perf-bypass

Please note that I added an even worse and even wronger hack to keep the
bypass working so I can compare it (raw-posix exposes now both bdrv_aio*
and bdrv_co_*, and enabling the bypass also switches). Also, once the
AIO code that I kept for the bypass mode is gone, we can make the
coroutine path even nicer.

Kevin

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
  2014-08-06 15:40                   ` Kevin Wolf
@ 2014-08-07 10:27                     ` Ming Lei
  2014-08-07 10:52                       ` Ming Lei
  2014-08-07 13:51                       ` Kevin Wolf
  0 siblings, 2 replies; 81+ messages in thread
From: Ming Lei @ 2014-08-07 10:27 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Peter Maydell, Fam Zheng, Michael S. Tsirkin, qemu-devel,
	Stefan Hajnoczi, Paolo Bonzini

On Wed, Aug 6, 2014 at 11:40 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 06.08.2014 um 13:28 hat Ming Lei geschrieben:
>> On Wed, Aug 6, 2014 at 6:09 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>> > Am 06.08.2014 um 11:37 hat Ming Lei geschrieben:
>> >> On Wed, Aug 6, 2014 at 4:48 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>> >> > Am 06.08.2014 um 07:33 hat Ming Lei geschrieben:
>> >> >> Hi Kevin,
>> >> >>
>> >> >> On Tue, Aug 5, 2014 at 10:47 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>> >> >> > Am 05.08.2014 um 15:48 hat Stefan Hajnoczi geschrieben:
>> >> >> >> I have been wondering how to prove that the root cause is the ucontext
>> >> >> >> coroutine mechanism (stack switching).  Here is an idea:
>> >> >> >>
>> >> >> >> Hack your "bypass" code path to run the request inside a coroutine.
>> >> >> >> That way you can compare "bypass without coroutine" against "bypass with
>> >> >> >> coroutine".
>> >> >> >>
>> >> >> >> Right now I think there are doubts because the bypass code path is
>> >> >> >> indeed a different (and not 100% correct) code path.  So this approach
>> >> >> >> might prove that the coroutines are adding the overhead and not
>> >> >> >> something that you bypassed.
>> >> >> >
>> >> >> > My doubts aren't only that the overhead might not come from the
>> >> >> > coroutines, but also whether any coroutine-related overhead is really
>> >> >> > unavoidable. If we can optimise coroutines, I'd strongly prefer to do
>> >> >> > just that instead of introducing additional code paths.
>> >> >>
>> >> >> OK, thank you for taking look at the problem, and hope we can
>> >> >> figure out the root cause, :-)
>> >> >>
>> >> >> >
>> >> >> > Another thought I had was this: If the performance difference is indeed
>> >> >> > only coroutines, then that is completely inside the block layer and we
>> >> >> > don't actually need a VM to test it. We could instead have something
>> >> >> > like a simple qemu-img based benchmark and should be observing the same.
>> >> >>
>> >> >> Even it is simpler to run a coroutine-only benchmark, and I just
>> >> >> wrote a raw one, and looks coroutine does decrease performance
>> >> >> a lot, please see the attachment patch, and thanks for your template
>> >> >> to help me add the 'co_bench' command in qemu-img.
>> >> >
>> >> > Yes, we can look at coroutines microbenchmarks in isolation. I actually
>> >> > did do that yesterday with the yield test from tests/test-coroutine.c.
>> >> > And in fact profiling immediately showed something to optimise:
>> >> > pthread_getspecific() was quite high, replacing it by __thread on
>> >> > systems where it works is more efficient and helped the numbers a bit.
>> >> > Also, a lot of time seems to be spent in pthread_mutex_lock/unlock (even
>> >> > in qemu-img bench), maybe there's even something that can be done here.
>> >>
>> >> The lock/unlock in dataplane is often from memory_region_find(), and Paolo
>> >> should have done lots of work on that.
>
> qemu-img bench doesn't run that code. We have a few more locks that are
> taken, and one of them (the coroutine pool lock) is avoided by your
> bypass patches.
>
>> >> >
>> >> > However, I just wasn't sure whether a change on this level would be
>> >> > relevant in a realistic environment. This is the reason why I wanted to
>> >> > get a benchmark involving the block layer and some I/O.
>> >> >
>> >> >> From the profiling data in below link:
>> >> >>
>> >> >>     http://pastebin.com/YwH2uwbq
>> >> >>
>> >> >> With coroutine, the running time for same loading is increased
>> >> >> ~50%(1.325s vs. 0.903s), and dcache load events is increased
>> >> >> ~35%(693M vs. 512M), insns per cycle is decreased by ~50%(
>> >> >> 1.35 vs. 1.63), compared with bypassing coroutine(-b parameter).
>> >> >>
>> >> >> The bypass code in the benchmark is very similar with the approach
>> >> >> used in the bypass patch, since linux-aio with O_DIRECT seldom
>> >> >> blocks in the the kernel I/O path.
>> >> >>
>> >> >> Maybe the benchmark is a bit extremely, but given modern storage
>> >> >> device may reach millions of IOPS, and it is very easy to slow down
>> >> >> the I/O by coroutine.
>> >> >
>> >> > I think in order to optimise coroutines, such benchmarks are fair game.
>> >> > It's just not guaranteed that the effects are exactly the same on real
>> >> > workloads, so we should take the results with a grain of salt.
>> >> >
>> >> > Anyhow, the coroutine version of your benchmark is buggy, it leaks all
>> >> > coroutines instead of exiting them, so it can't make any use of the
>> >> > coroutine pool. On my laptop, I get this (where fixed coroutine is a
>> >> > version that simply removes the yield at the end):
>> >> >
>> >> >                 | bypass        | fixed coro    | buggy coro
>> >> > ----------------+---------------+---------------+--------------
>> >> > time            | 1.09s         | 1.10s         | 1.62s
>> >> > L1-dcache-loads | 921,836,360   | 932,781,747   | 1,298,067,438
>> >> > insns per cycle | 2.39          | 2.39          | 1.90
>> >> >
>> >> > Begs the question whether you see a similar effect on a real qemu and
>> >> > the coroutine pool is still not big enough? With correct use of
>> >> > coroutines, the difference seems to be barely measurable even without
>> >> > any I/O involved.
>> >>
>> >> When I comment qemu_coroutine_yield(), looks result of
>> >> bypass and fixed coro is very similar as your test, and I am just
>> >> wondering if stack is always switched in qemu_coroutine_enter()
>> >> without calling qemu_coroutine_yield().
>> >
>> > Yes, definitely. qemu_coroutine_enter() always involves calling
>> > qemu_coroutine_switch(), which is the stack switch.
>> >
>> >> Without the yield, the benchmark can't emulate coroutine usage in
>> >> bdrv_aio_readv/writev() path any more, and bypass in the patchset
>> >> skips two qemu_coroutine_enter() and one qemu_coroutine_yield()
>> >> for each bdrv_aio_readv/writev().
>> >
>> > It's not completely comparable anyway because you're not going through a
>> > main loop and callbacks from there for your benchmark.
>> >
>> > But fair enough: Keep the yield, but enter the coroutine twice then. You
>> > get slightly worse results then, but that's more like doubling the very
>> > small difference between "bypass" and "fixed coro" (1.11s / 946,434,327
>> > / 2.37), not like the horrible performance of the buggy version.
>>
>> Yes, I compared that too, looks no big difference.
>>
>> >
>> > Actually, that's within the error of measurement for time and
>> > insns/cycle, so running it for a bit longer:
>> >
>> >                 | bypass    | coro      | + yield   | buggy coro
>> > ----------------+-----------+-----------+-----------+--------------
>> > time            | 21.45s    | 21.68s    | 21.83s    | 97.05s
>> > L1-dcache-loads | 18,049 M  | 18,387 M  | 18,618 M  | 26,062 M
>> > insns per cycle | 2.42      | 2.40      | 2.41      | 1.75
>> >
>> >> >> > I played a bit with the following, I hope it's not too naive. I couldn't
>> >> >> > see a difference with your patches, but at least one reason for this is
>> >> >> > probably that my laptop SSD isn't fast enough to make the CPU the
>> >> >> > bottleneck. Haven't tried ramdisk yet, that would probably be the next
>> >> >> > thing. (I actually wrote the patch up just for some profiling on my own,
>> >> >> > not for comparing throughput, but it should be usable for that as well.)
>> >> >>
>> >> >> This might not be good for the test since it is basically a sequential
>> >> >> read test, which can be optimized a lot by kernel. And I always use
>> >> >> randread benchmark.
>> >> >
>> >> > Yes, I shortly pondered whether I should implement random offsets
>> >> > instead. But then I realised that a quicker kernel operation would only
>> >> > help the benchmark because we want it to test the CPU consumption in
>> >> > userspace. So the faster the kernel gets, the better for us, because it
>> >> > should make the impact of coroutines bigger.
>> >>
>> >> OK, I will compare coroutine vs. bypass-co with the benchmark.
>>
>> I use the /dev/nullb0 block device to test, which is available in linux kernel
>> 3.13+, and follows the difference, which looks not very big(< 10%):
>
> Sounds useful. I'm running on an older kernel, so I used a loop-mounted
> file on tmpfs instead for my tests.

Actually loop is a slow device, and recently I used kernel aio and blk-mq
to speedup it a lot.

>
> Anyway, at some point today I figured I should take a different approach
> and not try to minimise the problems that coroutines introduce, but
> rather make the most use of them when we have them. After all, the
> raw-posix driver is still very callback-oriented and does things that
> aren't really necessary with coroutines (such as AIOCB allocation).
>
> The qemu-img bench time I ended up with looked quite nice. Maybe you
> want to take a look if you can reproduce these results, both with
> qemu-img bench and your real benchmark.
>
>
> $ for i in $(seq 1 5); do time ./qemu-img bench -t none -n -c 2000000 /dev/loop0; done
> Sending 2000000 requests, 4096 bytes each, 64 in parallel
>
>         bypass (base) | bypass (patch) | coro (base) | coro (patch)
> ----------------------+----------------+-------------+---------------
> run 1   0m5.966s      | 0m5.687s       |  0m6.224s   | 0m5.362s
> run 2   0m5.826s      | 0m5.831s       |  0m5.994s   | 0m5.541s
> run 3   0m6.145s      | 0m5.495s       |  0m6.253s   | 0m5.408s
> run 4   0m5.683s      | 0m5.527s       |  0m6.045s   | 0m5.293s
> run 5   0m5.904s      | 0m5.607s       |  0m6.238s   | 0m5.207s

I suggest to run the test a bit long.

>
> You can find my working tree at:
>
>     git://repo.or.cz/qemu/kevin.git perf-bypass

I just tried your work tree, and looks qemu-img can work well
with your linux-aio coro patches, but unfortunately there is
little improvement observed in my server, basically the result is
same without bypass; in my laptop, the improvement can be
observed but it is still at least 5% less than bypass.

Let's see the result in my server:

ming@:~/git/qemu$ sudo ./qemu-img bench -f raw -t off -n -c 6400000 /dev/nullb5
Sending 6400000 requests, 4096 bytes each, 64 in parallel
    read time: 38351ms, 166.000000K IOPS
ming@:~/git/qemu$
ming@:~/git/qemu$ sudo ./qemu-img bench -f raw -t off -n -c 6400000 -b
/dev/nullb5
Sending 6400000 requests, 4096 bytes each, 64 in parallel
    read time: 35241ms, 181.000000K IOPS

Also there are some problems with your patches which can't boot a
VM in my environment:

- __thread patch: looks there is no '__thread' used, and the patch
basically makes bypass not workable.

- bdrv_co_writev callback isn't set for raw-posix, looks my rootfs need to
write during booting

- another problem, I am investigating: laio isn't accessable
in qemu_laio_process_completion() sometimes

Actually I do care about performance boost with multi queue, since
multi-queue can improve performance a lots against QEMU 2.0,
once I fixed these problems, I will run VM to test mq performance
with linu-aio coroutine. Or could you give suggestions about these problem?

> Please note that I added an even worse and even wronger hack to keep the
> bypass working so I can compare it (raw-posix exposes now both bdrv_aio*
> and bdrv_co_*, and enabling the bypass also switches). Also, once the
> AIO code that I kept for the bypass mode is gone, we can make the
> coroutine path even nicer.

This approach looks nice since it saves the intermediate callback.

Basically current bypass approach is to bypass coroutine in block, but
linux-aio takes a new coroutine, which are two different path. And
linux-aio's coroutine still can be bypassed easily too , :-)


Thanks,

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
  2014-08-07 10:27                     ` Ming Lei
@ 2014-08-07 10:52                       ` Ming Lei
  2014-08-07 11:06                         ` Kevin Wolf
  2014-08-07 13:51                       ` Kevin Wolf
  1 sibling, 1 reply; 81+ messages in thread
From: Ming Lei @ 2014-08-07 10:52 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Peter Maydell, Fam Zheng, Michael S. Tsirkin, qemu-devel,
	Stefan Hajnoczi, Paolo Bonzini

On Thu, Aug 7, 2014 at 6:27 PM, Ming Lei <ming.lei@canonical.com> wrote:
> On Wed, Aug 6, 2014 at 11:40 PM, Kevin Wolf <kwolf@redhat.com> wrote:

> Also there are some problems with your patches which can't boot a
> VM in my environment:
>
> - __thread patch: looks there is no '__thread' used, and the patch
> basically makes bypass not workable.
>
> - bdrv_co_writev callback isn't set for raw-posix, looks my rootfs need to
> write during booting
>
> - another problem, I am investigating: laio isn't accessable
> in qemu_laio_process_completion() sometimes

This one should be caused by accessing 'laiocb' after cb().

Thanks,

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
  2014-08-07 10:52                       ` Ming Lei
@ 2014-08-07 11:06                         ` Kevin Wolf
  2014-08-07 13:03                           ` Ming Lei
  0 siblings, 1 reply; 81+ messages in thread
From: Kevin Wolf @ 2014-08-07 11:06 UTC (permalink / raw)
  To: Ming Lei
  Cc: Peter Maydell, Fam Zheng, Michael S. Tsirkin, qemu-devel,
	Stefan Hajnoczi, Paolo Bonzini

Am 07.08.2014 um 12:52 hat Ming Lei geschrieben:
> On Thu, Aug 7, 2014 at 6:27 PM, Ming Lei <ming.lei@canonical.com> wrote:
> > On Wed, Aug 6, 2014 at 11:40 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> 
> > Also there are some problems with your patches which can't boot a
> > VM in my environment:
> >
> > - __thread patch: looks there is no '__thread' used, and the patch
> > basically makes bypass not workable.
> >
> > - bdrv_co_writev callback isn't set for raw-posix, looks my rootfs need to
> > write during booting
> >
> > - another problem, I am investigating: laio isn't accessable
> > in qemu_laio_process_completion() sometimes
> 
> This one should be caused by accessing 'laiocb' after cb().

I stumbled across the same problems this morning when I tried to
actually run VMs with it instead of just qemu-img bench. They should all
be fixed in my git repo now. (Haven't figured out yet why __thread
doesn't work, so I have reverted that part, probably at the cost of some
performance.)

Kevin

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
  2014-08-07 11:06                         ` Kevin Wolf
@ 2014-08-07 13:03                           ` Ming Lei
  0 siblings, 0 replies; 81+ messages in thread
From: Ming Lei @ 2014-08-07 13:03 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Peter Maydell, Fam Zheng, Michael S. Tsirkin, qemu-devel,
	Stefan Hajnoczi, Paolo Bonzini

On Thu, Aug 7, 2014 at 7:06 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 07.08.2014 um 12:52 hat Ming Lei geschrieben:
>> On Thu, Aug 7, 2014 at 6:27 PM, Ming Lei <ming.lei@canonical.com> wrote:
>> > On Wed, Aug 6, 2014 at 11:40 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>>
>> > Also there are some problems with your patches which can't boot a
>> > VM in my environment:
>> >
>> > - __thread patch: looks there is no '__thread' used, and the patch
>> > basically makes bypass not workable.
>> >
>> > - bdrv_co_writev callback isn't set for raw-posix, looks my rootfs need to
>> > write during booting
>> >
>> > - another problem, I am investigating: laio isn't accessable
>> > in qemu_laio_process_completion() sometimes
>>
>> This one should be caused by accessing 'laiocb' after cb().
>
> I stumbled across the same problems this morning when I tried to
> actually run VMs with it instead of just qemu-img bench. They should all
> be fixed in my git repo now. (Haven't figured out yet why __thread
> doesn't work, so I have reverted that part, probably at the cost of some
> performance.)

In my test, looks no obvious performance effect by the commit, or by
pthread_getspecific() which should be fine for fast path. I also simply
revert it since __thread can't be added. Interesting, my other local change
is basically same with your change, :-)

Finally I implemented bypassing coroutine on your linux-aio coro patches,
for comparing bypass effect easily, now both are run in basically same
path except for coroutine APIs:

       git://kernel.ubuntu.com/ming/qemu.git  v2.1.0-mq.1-kevin-perf

The above branch only holds three patches which are against the
latest 'perf-bypass' branch of your tree.

Then I run it in VM on my server and still use the same fio(linux aio,
direct, 4k bs, 120sec) to test virtio-blk dataplane performance, and the
virtio-blk is backed by the /dev/nullb0 block device too.

----------------------------------------------------+
-----------------------------------------
            without bypass(linux-aio coro)  |  with bypass linux-aio corou
 ---------------------------------------------------+------------------------------------------
1 vq, 2 jobs |        101K iops                  |     116K iops
------------------------------------------------------------------------------------------------
4 vq, 4 jobs |        121K iops                  |     142K iops
------------------------------------------------------------------------------------------------

Looks there is still some difference even applying linux-aio coroutine patches.

Now I am a bit more confident that coroutine is the cause of
performance difference...

Thanks,

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
  2014-08-07 10:27                     ` Ming Lei
  2014-08-07 10:52                       ` Ming Lei
@ 2014-08-07 13:51                       ` Kevin Wolf
  2014-08-08 10:32                         ` Ming Lei
  1 sibling, 1 reply; 81+ messages in thread
From: Kevin Wolf @ 2014-08-07 13:51 UTC (permalink / raw)
  To: Ming Lei
  Cc: Peter Maydell, Fam Zheng, Michael S. Tsirkin, qemu-devel,
	Stefan Hajnoczi, Paolo Bonzini

Am 07.08.2014 um 12:27 hat Ming Lei geschrieben:
> On Wed, Aug 6, 2014 at 11:40 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> > Am 06.08.2014 um 13:28 hat Ming Lei geschrieben:
> >> On Wed, Aug 6, 2014 at 6:09 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> >> > Am 06.08.2014 um 11:37 hat Ming Lei geschrieben:
> >> >> On Wed, Aug 6, 2014 at 4:48 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> >> >> > However, I just wasn't sure whether a change on this level would be
> >> >> > relevant in a realistic environment. This is the reason why I wanted to
> >> >> > get a benchmark involving the block layer and some I/O.
> >> >> >
> >> >> >> From the profiling data in below link:
> >> >> >>
> >> >> >>     http://pastebin.com/YwH2uwbq
> >> >> >>
> >> >> >> With coroutine, the running time for same loading is increased
> >> >> >> ~50%(1.325s vs. 0.903s), and dcache load events is increased
> >> >> >> ~35%(693M vs. 512M), insns per cycle is decreased by ~50%(
> >> >> >> 1.35 vs. 1.63), compared with bypassing coroutine(-b parameter).
> >> >> >>
> >> >> >> The bypass code in the benchmark is very similar with the approach
> >> >> >> used in the bypass patch, since linux-aio with O_DIRECT seldom
> >> >> >> blocks in the the kernel I/O path.
> >> >> >>
> >> >> >> Maybe the benchmark is a bit extremely, but given modern storage
> >> >> >> device may reach millions of IOPS, and it is very easy to slow down
> >> >> >> the I/O by coroutine.
> >> >> >
> >> >> > I think in order to optimise coroutines, such benchmarks are fair game.
> >> >> > It's just not guaranteed that the effects are exactly the same on real
> >> >> > workloads, so we should take the results with a grain of salt.
> >> >> >
> >> >> > Anyhow, the coroutine version of your benchmark is buggy, it leaks all
> >> >> > coroutines instead of exiting them, so it can't make any use of the
> >> >> > coroutine pool. On my laptop, I get this (where fixed coroutine is a
> >> >> > version that simply removes the yield at the end):
> >> >> >
> >> >> >                 | bypass        | fixed coro    | buggy coro
> >> >> > ----------------+---------------+---------------+--------------
> >> >> > time            | 1.09s         | 1.10s         | 1.62s
> >> >> > L1-dcache-loads | 921,836,360   | 932,781,747   | 1,298,067,438
> >> >> > insns per cycle | 2.39          | 2.39          | 1.90
> >> >> >
> >> >> > Begs the question whether you see a similar effect on a real qemu and
> >> >> > the coroutine pool is still not big enough? With correct use of
> >> >> > coroutines, the difference seems to be barely measurable even without
> >> >> > any I/O involved.
> >> >>
> >> >> When I comment qemu_coroutine_yield(), looks result of
> >> >> bypass and fixed coro is very similar as your test, and I am just
> >> >> wondering if stack is always switched in qemu_coroutine_enter()
> >> >> without calling qemu_coroutine_yield().
> >> >
> >> > Yes, definitely. qemu_coroutine_enter() always involves calling
> >> > qemu_coroutine_switch(), which is the stack switch.
> >> >
> >> >> Without the yield, the benchmark can't emulate coroutine usage in
> >> >> bdrv_aio_readv/writev() path any more, and bypass in the patchset
> >> >> skips two qemu_coroutine_enter() and one qemu_coroutine_yield()
> >> >> for each bdrv_aio_readv/writev().
> >> >
> >> > It's not completely comparable anyway because you're not going through a
> >> > main loop and callbacks from there for your benchmark.
> >> >
> >> > But fair enough: Keep the yield, but enter the coroutine twice then. You
> >> > get slightly worse results then, but that's more like doubling the very
> >> > small difference between "bypass" and "fixed coro" (1.11s / 946,434,327
> >> > / 2.37), not like the horrible performance of the buggy version.
> >>
> >> Yes, I compared that too, looks no big difference.
> >>
> >> >
> >> > Actually, that's within the error of measurement for time and
> >> > insns/cycle, so running it for a bit longer:
> >> >
> >> >                 | bypass    | coro      | + yield   | buggy coro
> >> > ----------------+-----------+-----------+-----------+--------------
> >> > time            | 21.45s    | 21.68s    | 21.83s    | 97.05s
> >> > L1-dcache-loads | 18,049 M  | 18,387 M  | 18,618 M  | 26,062 M
> >> > insns per cycle | 2.42      | 2.40      | 2.41      | 1.75
> >> >
> >> >> >> > I played a bit with the following, I hope it's not too naive. I couldn't
> >> >> >> > see a difference with your patches, but at least one reason for this is
> >> >> >> > probably that my laptop SSD isn't fast enough to make the CPU the
> >> >> >> > bottleneck. Haven't tried ramdisk yet, that would probably be the next
> >> >> >> > thing. (I actually wrote the patch up just for some profiling on my own,
> >> >> >> > not for comparing throughput, but it should be usable for that as well.)
> >> >> >>
> >> >> >> This might not be good for the test since it is basically a sequential
> >> >> >> read test, which can be optimized a lot by kernel. And I always use
> >> >> >> randread benchmark.
> >> >> >
> >> >> > Yes, I shortly pondered whether I should implement random offsets
> >> >> > instead. But then I realised that a quicker kernel operation would only
> >> >> > help the benchmark because we want it to test the CPU consumption in
> >> >> > userspace. So the faster the kernel gets, the better for us, because it
> >> >> > should make the impact of coroutines bigger.
> >> >>
> >> >> OK, I will compare coroutine vs. bypass-co with the benchmark.
> >>
> >> I use the /dev/nullb0 block device to test, which is available in linux kernel
> >> 3.13+, and follows the difference, which looks not very big(< 10%):
> >
> > Sounds useful. I'm running on an older kernel, so I used a loop-mounted
> > file on tmpfs instead for my tests.
> 
> Actually loop is a slow device, and recently I used kernel aio and blk-mq
> to speedup it a lot.

Yes, I have no doubts that it's slower than a proper ramdisk, but it
should still be way faster than my normal disk.

> > Anyway, at some point today I figured I should take a different approach
> > and not try to minimise the problems that coroutines introduce, but
> > rather make the most use of them when we have them. After all, the
> > raw-posix driver is still very callback-oriented and does things that
> > aren't really necessary with coroutines (such as AIOCB allocation).
> >
> > The qemu-img bench time I ended up with looked quite nice. Maybe you
> > want to take a look if you can reproduce these results, both with
> > qemu-img bench and your real benchmark.
> >
> >
> > $ for i in $(seq 1 5); do time ./qemu-img bench -t none -n -c 2000000 /dev/loop0; done
> > Sending 2000000 requests, 4096 bytes each, 64 in parallel
> >
> >         bypass (base) | bypass (patch) | coro (base) | coro (patch)
> > ----------------------+----------------+-------------+---------------
> > run 1   0m5.966s      | 0m5.687s       |  0m6.224s   | 0m5.362s
> > run 2   0m5.826s      | 0m5.831s       |  0m5.994s   | 0m5.541s
> > run 3   0m6.145s      | 0m5.495s       |  0m6.253s   | 0m5.408s
> > run 4   0m5.683s      | 0m5.527s       |  0m6.045s   | 0m5.293s
> > run 5   0m5.904s      | 0m5.607s       |  0m6.238s   | 0m5.207s
> 
> I suggest to run the test a bit long.

Okay, ran it again with -c 10000000 this time. I also used the updated
branch for the patched version. This means that the __thread patch is
not enabled; this is probably why the improvement for the bypass has
disappeared and the coroutine based version only approaches, but doesn't
beat it this time.

        bypass (base) | bypass (patch) | coro (base) | coro (patch)
----------------------+----------------+-------------+---------------
run 1   28.255s       |  28.615s       | 30.364s     | 28.318s
run 2   28.190s       |  28.926s       | 30.096s     | 28.437s
run 3   28.079s       |  29.603s       | 30.084s     | 28.567s
run 4   28.888s       |  28.581s       | 31.343s     | 28.605s
run 5   28.196s       |  28.924s       | 30.033s     | 27.935s

> > You can find my working tree at:
> >
> >     git://repo.or.cz/qemu/kevin.git perf-bypass
> 
> I just tried your work tree, and looks qemu-img can work well
> with your linux-aio coro patches, but unfortunately there is
> little improvement observed in my server, basically the result is
> same without bypass; in my laptop, the improvement can be
> observed but it is still at least 5% less than bypass.
> 
> Let's see the result in my server:
> 
> ming@:~/git/qemu$ sudo ./qemu-img bench -f raw -t off -n -c 6400000 /dev/nullb5
> Sending 6400000 requests, 4096 bytes each, 64 in parallel
>     read time: 38351ms, 166.000000K IOPS
> ming@:~/git/qemu$
> ming@:~/git/qemu$ sudo ./qemu-img bench -f raw -t off -n -c 6400000 -b
> /dev/nullb5
> Sending 6400000 requests, 4096 bytes each, 64 in parallel
>     read time: 35241ms, 181.000000K IOPS

Hm, interesting. Apparently our environments are different enough to
come to opposite conclusions.

I also tried running some fio benchmarks based on the configuration you
had in the cover letter (just a bit downsized to fit it in the ramdisk)
and came to completely different results: For me, git master is a lot
better than qemu 2.0. The optimisation branch showed small, but
measurable additional improvements, with coroutines consistently being a
bit ahead of the bypass mode.

> > Please note that I added an even worse and even wronger hack to keep the
> > bypass working so I can compare it (raw-posix exposes now both bdrv_aio*
> > and bdrv_co_*, and enabling the bypass also switches). Also, once the
> > AIO code that I kept for the bypass mode is gone, we can make the
> > coroutine path even nicer.
> 
> This approach looks nice since it saves the intermediate callback.
> 
> Basically current bypass approach is to bypass coroutine in block, but
> linux-aio takes a new coroutine, which are two different path. And
> linux-aio's coroutine still can be bypassed easily too , :-)

The patched linux-aio doesn't create a new coroutine, it simply stays
in the one coroutine that we have and in which we already are. Bypassing
it by making the yield conditional would still be possible, of course
(for testing anyway; I don't think anything like that can be merged
easily).

Kevin

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
  2014-08-07 13:51                       ` Kevin Wolf
@ 2014-08-08 10:32                         ` Ming Lei
  2014-08-08 11:26                           ` Ming Lei
  0 siblings, 1 reply; 81+ messages in thread
From: Ming Lei @ 2014-08-08 10:32 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Peter Maydell, Fam Zheng, Michael S. Tsirkin, qemu-devel,
	Stefan Hajnoczi, Paolo Bonzini

On Thu, Aug 7, 2014 at 9:51 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 07.08.2014 um 12:27 hat Ming Lei geschrieben:
>> On Wed, Aug 6, 2014 at 11:40 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>> > Am 06.08.2014 um 13:28 hat Ming Lei geschrieben:
>> >> On Wed, Aug 6, 2014 at 6:09 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>> >> > Am 06.08.2014 um 11:37 hat Ming Lei geschrieben:
>> >> >> On Wed, Aug 6, 2014 at 4:48 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>> >> >> > However, I just wasn't sure whether a change on this level would be
>> >> >> > relevant in a realistic environment. This is the reason why I wanted to
>> >> >> > get a benchmark involving the block layer and some I/O.
>> >> >> >
>> >> >> >> From the profiling data in below link:
>> >> >> >>
>> >> >> >>     http://pastebin.com/YwH2uwbq
>> >> >> >>
>> >> >> >> With coroutine, the running time for same loading is increased
>> >> >> >> ~50%(1.325s vs. 0.903s), and dcache load events is increased
>> >> >> >> ~35%(693M vs. 512M), insns per cycle is decreased by ~50%(
>> >> >> >> 1.35 vs. 1.63), compared with bypassing coroutine(-b parameter).
>> >> >> >>
>> >> >> >> The bypass code in the benchmark is very similar with the approach
>> >> >> >> used in the bypass patch, since linux-aio with O_DIRECT seldom
>> >> >> >> blocks in the the kernel I/O path.
>> >> >> >>
>> >> >> >> Maybe the benchmark is a bit extremely, but given modern storage
>> >> >> >> device may reach millions of IOPS, and it is very easy to slow down
>> >> >> >> the I/O by coroutine.
>> >> >> >
>> >> >> > I think in order to optimise coroutines, such benchmarks are fair game.
>> >> >> > It's just not guaranteed that the effects are exactly the same on real
>> >> >> > workloads, so we should take the results with a grain of salt.
>> >> >> >
>> >> >> > Anyhow, the coroutine version of your benchmark is buggy, it leaks all
>> >> >> > coroutines instead of exiting them, so it can't make any use of the
>> >> >> > coroutine pool. On my laptop, I get this (where fixed coroutine is a
>> >> >> > version that simply removes the yield at the end):
>> >> >> >
>> >> >> >                 | bypass        | fixed coro    | buggy coro
>> >> >> > ----------------+---------------+---------------+--------------
>> >> >> > time            | 1.09s         | 1.10s         | 1.62s
>> >> >> > L1-dcache-loads | 921,836,360   | 932,781,747   | 1,298,067,438
>> >> >> > insns per cycle | 2.39          | 2.39          | 1.90
>> >> >> >
>> >> >> > Begs the question whether you see a similar effect on a real qemu and
>> >> >> > the coroutine pool is still not big enough? With correct use of
>> >> >> > coroutines, the difference seems to be barely measurable even without
>> >> >> > any I/O involved.
>> >> >>
>> >> >> When I comment qemu_coroutine_yield(), looks result of
>> >> >> bypass and fixed coro is very similar as your test, and I am just
>> >> >> wondering if stack is always switched in qemu_coroutine_enter()
>> >> >> without calling qemu_coroutine_yield().
>> >> >
>> >> > Yes, definitely. qemu_coroutine_enter() always involves calling
>> >> > qemu_coroutine_switch(), which is the stack switch.
>> >> >
>> >> >> Without the yield, the benchmark can't emulate coroutine usage in
>> >> >> bdrv_aio_readv/writev() path any more, and bypass in the patchset
>> >> >> skips two qemu_coroutine_enter() and one qemu_coroutine_yield()
>> >> >> for each bdrv_aio_readv/writev().
>> >> >
>> >> > It's not completely comparable anyway because you're not going through a
>> >> > main loop and callbacks from there for your benchmark.
>> >> >
>> >> > But fair enough: Keep the yield, but enter the coroutine twice then. You
>> >> > get slightly worse results then, but that's more like doubling the very
>> >> > small difference between "bypass" and "fixed coro" (1.11s / 946,434,327
>> >> > / 2.37), not like the horrible performance of the buggy version.
>> >>
>> >> Yes, I compared that too, looks no big difference.
>> >>
>> >> >
>> >> > Actually, that's within the error of measurement for time and
>> >> > insns/cycle, so running it for a bit longer:
>> >> >
>> >> >                 | bypass    | coro      | + yield   | buggy coro
>> >> > ----------------+-----------+-----------+-----------+--------------
>> >> > time            | 21.45s    | 21.68s    | 21.83s    | 97.05s
>> >> > L1-dcache-loads | 18,049 M  | 18,387 M  | 18,618 M  | 26,062 M
>> >> > insns per cycle | 2.42      | 2.40      | 2.41      | 1.75
>> >> >
>> >> >> >> > I played a bit with the following, I hope it's not too naive. I couldn't
>> >> >> >> > see a difference with your patches, but at least one reason for this is
>> >> >> >> > probably that my laptop SSD isn't fast enough to make the CPU the
>> >> >> >> > bottleneck. Haven't tried ramdisk yet, that would probably be the next
>> >> >> >> > thing. (I actually wrote the patch up just for some profiling on my own,
>> >> >> >> > not for comparing throughput, but it should be usable for that as well.)
>> >> >> >>
>> >> >> >> This might not be good for the test since it is basically a sequential
>> >> >> >> read test, which can be optimized a lot by kernel. And I always use
>> >> >> >> randread benchmark.
>> >> >> >
>> >> >> > Yes, I shortly pondered whether I should implement random offsets
>> >> >> > instead. But then I realised that a quicker kernel operation would only
>> >> >> > help the benchmark because we want it to test the CPU consumption in
>> >> >> > userspace. So the faster the kernel gets, the better for us, because it
>> >> >> > should make the impact of coroutines bigger.
>> >> >>
>> >> >> OK, I will compare coroutine vs. bypass-co with the benchmark.
>> >>
>> >> I use the /dev/nullb0 block device to test, which is available in linux kernel
>> >> 3.13+, and follows the difference, which looks not very big(< 10%):
>> >
>> > Sounds useful. I'm running on an older kernel, so I used a loop-mounted
>> > file on tmpfs instead for my tests.
>>
>> Actually loop is a slow device, and recently I used kernel aio and blk-mq
>> to speedup it a lot.
>
> Yes, I have no doubts that it's slower than a proper ramdisk, but it
> should still be way faster than my normal disk.
>
>> > Anyway, at some point today I figured I should take a different approach
>> > and not try to minimise the problems that coroutines introduce, but
>> > rather make the most use of them when we have them. After all, the
>> > raw-posix driver is still very callback-oriented and does things that
>> > aren't really necessary with coroutines (such as AIOCB allocation).
>> >
>> > The qemu-img bench time I ended up with looked quite nice. Maybe you
>> > want to take a look if you can reproduce these results, both with
>> > qemu-img bench and your real benchmark.
>> >
>> >
>> > $ for i in $(seq 1 5); do time ./qemu-img bench -t none -n -c 2000000 /dev/loop0; done
>> > Sending 2000000 requests, 4096 bytes each, 64 in parallel
>> >
>> >         bypass (base) | bypass (patch) | coro (base) | coro (patch)
>> > ----------------------+----------------+-------------+---------------
>> > run 1   0m5.966s      | 0m5.687s       |  0m6.224s   | 0m5.362s
>> > run 2   0m5.826s      | 0m5.831s       |  0m5.994s   | 0m5.541s
>> > run 3   0m6.145s      | 0m5.495s       |  0m6.253s   | 0m5.408s
>> > run 4   0m5.683s      | 0m5.527s       |  0m6.045s   | 0m5.293s
>> > run 5   0m5.904s      | 0m5.607s       |  0m6.238s   | 0m5.207s
>>
>> I suggest to run the test a bit long.
>
> Okay, ran it again with -c 10000000 this time. I also used the updated
> branch for the patched version. This means that the __thread patch is
> not enabled; this is probably why the improvement for the bypass has
> disappeared and the coroutine based version only approaches, but doesn't
> beat it this time.
>
>         bypass (base) | bypass (patch) | coro (base) | coro (patch)
> ----------------------+----------------+-------------+---------------
> run 1   28.255s       |  28.615s       | 30.364s     | 28.318s
> run 2   28.190s       |  28.926s       | 30.096s     | 28.437s
> run 3   28.079s       |  29.603s       | 30.084s     | 28.567s
> run 4   28.888s       |  28.581s       | 31.343s     | 28.605s
> run 5   28.196s       |  28.924s       | 30.033s     | 27.935s

Your result is quite good(>300K IOPS), much better than my result with
/dev/nullb0(less than 200K), and I also tried loop over file in tmpfs, which
looks a bit quicker than /dev/nullb0(still ~200K IOPS in my server), so
I guess your machine is very fast.

It is a bit similar with my observation:

- in my laptop(CPU: 2.6GHz), your coro patch improved much, and only
less 5% than bypass
- in my server(CPU: 1.6GHz, same L1/L2 cache with laptop, bigger L3 cache),
your coro patch improved little, and it is less 10% than bypass

so looks coroutine behaves better on fast CPUs? instead of slow CPU?

I appreciate if you may run test in VM, especially with 2 virtqueue or 4 and
run 2/4 jobs to see if what the IOPS can reach.

>> > You can find my working tree at:
>> >
>> >     git://repo.or.cz/qemu/kevin.git perf-bypass
>>
>> I just tried your work tree, and looks qemu-img can work well
>> with your linux-aio coro patches, but unfortunately there is
>> little improvement observed in my server, basically the result is
>> same without bypass; in my laptop, the improvement can be
>> observed but it is still at least 5% less than bypass.
>>
>> Let's see the result in my server:
>>
>> ming@:~/git/qemu$ sudo ./qemu-img bench -f raw -t off -n -c 6400000 /dev/nullb5
>> Sending 6400000 requests, 4096 bytes each, 64 in parallel
>>     read time: 38351ms, 166.000000K IOPS
>> ming@:~/git/qemu$
>> ming@:~/git/qemu$ sudo ./qemu-img bench -f raw -t off -n -c 6400000 -b
>> /dev/nullb5
>> Sending 6400000 requests, 4096 bytes each, 64 in parallel
>>     read time: 35241ms, 181.000000K IOPS
>
> Hm, interesting. Apparently our environments are different enough to
> come to opposite conclusions.

Yes, looks coroutine behaves better in fast CPU instead of slow CPU, as you
see, my result is much worse than yours.

ming@:~/git/qemu$ sudo losetup -a
/dev/loop0: [0014]:64892 (/run/shm/dd.img)
ming@:~/git/qemu$ sudo ./qemu-img bench -f raw -n -t off -c 2000000 -b
/dev/loop0
Sending 2000000 requests, 4096 bytes each, 64 in parallel
    read time: 9692ms, 206.000000K IOPS
ming@:~/git/qemu$ sudo ./qemu-img bench -f raw -n -t off -c 2000000 /dev/loop0
Sending 2000000 requests, 4096 bytes each, 64 in parallel
    read time: 10683ms, 187.000000K IOPS

>
> I also tried running some fio benchmarks based on the configuration you
> had in the cover letter (just a bit downsized to fit it in the ramdisk)
> and came to completely different results: For me, git master is a lot
> better than qemu 2.0. The optimisation branch showed small, but
> measurable additional improvements, with coroutines consistently being a
> bit ahead of the bypass mode.
>
>> > Please note that I added an even worse and even wronger hack to keep the
>> > bypass working so I can compare it (raw-posix exposes now both bdrv_aio*
>> > and bdrv_co_*, and enabling the bypass also switches). Also, once the
>> > AIO code that I kept for the bypass mode is gone, we can make the
>> > coroutine path even nicer.
>>
>> This approach looks nice since it saves the intermediate callback.
>>
>> Basically current bypass approach is to bypass coroutine in block, but
>> linux-aio takes a new coroutine, which are two different path. And
>> linux-aio's coroutine still can be bypassed easily too , :-)
>
> The patched linux-aio doesn't create a new coroutine, it simply stays
> in the one coroutine that we have and in which we already are. Bypassing
> it by making the yield conditional would still be possible, of course
> (for testing anyway; I don't think anything like that can be merged
> easily).


Thanks,

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
  2014-08-08 10:32                         ` Ming Lei
@ 2014-08-08 11:26                           ` Ming Lei
  0 siblings, 0 replies; 81+ messages in thread
From: Ming Lei @ 2014-08-08 11:26 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Peter Maydell, Fam Zheng, Michael S. Tsirkin, qemu-devel,
	Stefan Hajnoczi, Paolo Bonzini

On Fri, Aug 8, 2014 at 6:32 PM, Ming Lei <ming.lei@canonical.com> wrote:
> On Thu, Aug 7, 2014 at 9:51 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>> Am 07.08.2014 um 12:27 hat Ming Lei geschrieben:
>>> On Wed, Aug 6, 2014 at 11:40 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>>> > Am 06.08.2014 um 13:28 hat Ming Lei geschrieben:
>>> >> On Wed, Aug 6, 2014 at 6:09 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>>> >> > Am 06.08.2014 um 11:37 hat Ming Lei geschrieben:
>>> >> >> On Wed, Aug 6, 2014 at 4:48 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>>> >> >> > However, I just wasn't sure whether a change on this level would be
>>> >> >> > relevant in a realistic environment. This is the reason why I wanted to
>>> >> >> > get a benchmark involving the block layer and some I/O.
>>> >> >> >
>>> >> >> >> From the profiling data in below link:
>>> >> >> >>
>>> >> >> >>     http://pastebin.com/YwH2uwbq
>>> >> >> >>
>>> >> >> >> With coroutine, the running time for same loading is increased
>>> >> >> >> ~50%(1.325s vs. 0.903s), and dcache load events is increased
>>> >> >> >> ~35%(693M vs. 512M), insns per cycle is decreased by ~50%(
>>> >> >> >> 1.35 vs. 1.63), compared with bypassing coroutine(-b parameter).
>>> >> >> >>
>>> >> >> >> The bypass code in the benchmark is very similar with the approach
>>> >> >> >> used in the bypass patch, since linux-aio with O_DIRECT seldom
>>> >> >> >> blocks in the the kernel I/O path.
>>> >> >> >>
>>> >> >> >> Maybe the benchmark is a bit extremely, but given modern storage
>>> >> >> >> device may reach millions of IOPS, and it is very easy to slow down
>>> >> >> >> the I/O by coroutine.
>>> >> >> >
>>> >> >> > I think in order to optimise coroutines, such benchmarks are fair game.
>>> >> >> > It's just not guaranteed that the effects are exactly the same on real
>>> >> >> > workloads, so we should take the results with a grain of salt.
>>> >> >> >
>>> >> >> > Anyhow, the coroutine version of your benchmark is buggy, it leaks all
>>> >> >> > coroutines instead of exiting them, so it can't make any use of the
>>> >> >> > coroutine pool. On my laptop, I get this (where fixed coroutine is a
>>> >> >> > version that simply removes the yield at the end):
>>> >> >> >
>>> >> >> >                 | bypass        | fixed coro    | buggy coro
>>> >> >> > ----------------+---------------+---------------+--------------
>>> >> >> > time            | 1.09s         | 1.10s         | 1.62s
>>> >> >> > L1-dcache-loads | 921,836,360   | 932,781,747   | 1,298,067,438
>>> >> >> > insns per cycle | 2.39          | 2.39          | 1.90
>>> >> >> >
>>> >> >> > Begs the question whether you see a similar effect on a real qemu and
>>> >> >> > the coroutine pool is still not big enough? With correct use of
>>> >> >> > coroutines, the difference seems to be barely measurable even without
>>> >> >> > any I/O involved.
>>> >> >>
>>> >> >> When I comment qemu_coroutine_yield(), looks result of
>>> >> >> bypass and fixed coro is very similar as your test, and I am just
>>> >> >> wondering if stack is always switched in qemu_coroutine_enter()
>>> >> >> without calling qemu_coroutine_yield().
>>> >> >
>>> >> > Yes, definitely. qemu_coroutine_enter() always involves calling
>>> >> > qemu_coroutine_switch(), which is the stack switch.
>>> >> >
>>> >> >> Without the yield, the benchmark can't emulate coroutine usage in
>>> >> >> bdrv_aio_readv/writev() path any more, and bypass in the patchset
>>> >> >> skips two qemu_coroutine_enter() and one qemu_coroutine_yield()
>>> >> >> for each bdrv_aio_readv/writev().
>>> >> >
>>> >> > It's not completely comparable anyway because you're not going through a
>>> >> > main loop and callbacks from there for your benchmark.
>>> >> >
>>> >> > But fair enough: Keep the yield, but enter the coroutine twice then. You
>>> >> > get slightly worse results then, but that's more like doubling the very
>>> >> > small difference between "bypass" and "fixed coro" (1.11s / 946,434,327
>>> >> > / 2.37), not like the horrible performance of the buggy version.
>>> >>
>>> >> Yes, I compared that too, looks no big difference.
>>> >>
>>> >> >
>>> >> > Actually, that's within the error of measurement for time and
>>> >> > insns/cycle, so running it for a bit longer:
>>> >> >
>>> >> >                 | bypass    | coro      | + yield   | buggy coro
>>> >> > ----------------+-----------+-----------+-----------+--------------
>>> >> > time            | 21.45s    | 21.68s    | 21.83s    | 97.05s
>>> >> > L1-dcache-loads | 18,049 M  | 18,387 M  | 18,618 M  | 26,062 M
>>> >> > insns per cycle | 2.42      | 2.40      | 2.41      | 1.75
>>> >> >
>>> >> >> >> > I played a bit with the following, I hope it's not too naive. I couldn't
>>> >> >> >> > see a difference with your patches, but at least one reason for this is
>>> >> >> >> > probably that my laptop SSD isn't fast enough to make the CPU the
>>> >> >> >> > bottleneck. Haven't tried ramdisk yet, that would probably be the next
>>> >> >> >> > thing. (I actually wrote the patch up just for some profiling on my own,
>>> >> >> >> > not for comparing throughput, but it should be usable for that as well.)
>>> >> >> >>
>>> >> >> >> This might not be good for the test since it is basically a sequential
>>> >> >> >> read test, which can be optimized a lot by kernel. And I always use
>>> >> >> >> randread benchmark.
>>> >> >> >
>>> >> >> > Yes, I shortly pondered whether I should implement random offsets
>>> >> >> > instead. But then I realised that a quicker kernel operation would only
>>> >> >> > help the benchmark because we want it to test the CPU consumption in
>>> >> >> > userspace. So the faster the kernel gets, the better for us, because it
>>> >> >> > should make the impact of coroutines bigger.
>>> >> >>
>>> >> >> OK, I will compare coroutine vs. bypass-co with the benchmark.
>>> >>
>>> >> I use the /dev/nullb0 block device to test, which is available in linux kernel
>>> >> 3.13+, and follows the difference, which looks not very big(< 10%):
>>> >
>>> > Sounds useful. I'm running on an older kernel, so I used a loop-mounted
>>> > file on tmpfs instead for my tests.
>>>
>>> Actually loop is a slow device, and recently I used kernel aio and blk-mq
>>> to speedup it a lot.
>>
>> Yes, I have no doubts that it's slower than a proper ramdisk, but it
>> should still be way faster than my normal disk.
>>
>>> > Anyway, at some point today I figured I should take a different approach
>>> > and not try to minimise the problems that coroutines introduce, but
>>> > rather make the most use of them when we have them. After all, the
>>> > raw-posix driver is still very callback-oriented and does things that
>>> > aren't really necessary with coroutines (such as AIOCB allocation).
>>> >
>>> > The qemu-img bench time I ended up with looked quite nice. Maybe you
>>> > want to take a look if you can reproduce these results, both with
>>> > qemu-img bench and your real benchmark.
>>> >
>>> >
>>> > $ for i in $(seq 1 5); do time ./qemu-img bench -t none -n -c 2000000 /dev/loop0; done
>>> > Sending 2000000 requests, 4096 bytes each, 64 in parallel
>>> >
>>> >         bypass (base) | bypass (patch) | coro (base) | coro (patch)
>>> > ----------------------+----------------+-------------+---------------
>>> > run 1   0m5.966s      | 0m5.687s       |  0m6.224s   | 0m5.362s
>>> > run 2   0m5.826s      | 0m5.831s       |  0m5.994s   | 0m5.541s
>>> > run 3   0m6.145s      | 0m5.495s       |  0m6.253s   | 0m5.408s
>>> > run 4   0m5.683s      | 0m5.527s       |  0m6.045s   | 0m5.293s
>>> > run 5   0m5.904s      | 0m5.607s       |  0m6.238s   | 0m5.207s
>>>
>>> I suggest to run the test a bit long.
>>
>> Okay, ran it again with -c 10000000 this time. I also used the updated
>> branch for the patched version. This means that the __thread patch is
>> not enabled; this is probably why the improvement for the bypass has
>> disappeared and the coroutine based version only approaches, but doesn't
>> beat it this time.
>>
>>         bypass (base) | bypass (patch) | coro (base) | coro (patch)
>> ----------------------+----------------+-------------+---------------
>> run 1   28.255s       |  28.615s       | 30.364s     | 28.318s
>> run 2   28.190s       |  28.926s       | 30.096s     | 28.437s
>> run 3   28.079s       |  29.603s       | 30.084s     | 28.567s
>> run 4   28.888s       |  28.581s       | 31.343s     | 28.605s
>> run 5   28.196s       |  28.924s       | 30.033s     | 27.935s
>
> Your result is quite good(>300K IOPS), much better than my result with
> /dev/nullb0(less than 200K), and I also tried loop over file in tmpfs, which
> looks a bit quicker than /dev/nullb0(still ~200K IOPS in my server), so
> I guess your machine is very fast.
>
> It is a bit similar with my observation:
>
> - in my laptop(CPU: 2.6GHz), your coro patch improved much, and only
> less 5% than bypass
> - in my server(CPU: 1.6GHz, same L1/L2 cache with laptop, bigger L3 cache),
> your coro patch improved little, and it is less 10% than bypass
>
> so looks coroutine behaves better on fast CPUs? instead of slow CPU?

I think it is true:

- using coroutine introduces a bit extra CPU loading(coroutine_swap,
and dcache miss introduced by switching stack) inevitably

- the introduced loading may be not a (big) deal for fast CPU, but makes
difference for slower CPU

- even for fast CPU,  more or less, 'perf stat' may show some difference
in instructions per cycle, dcache load and misses, branch misses, dTLB
misses...

BTW, Kevin, if we want to see coroutine effect in block I/O path, it may be
better to use same path(bypass linux-aio co too in below tree, or just
qemu master with my patchset) to compare result since the only difference
is in using coroutine or not:

   git://kernel.ubuntu.com/ming/qemu.git  v2.1.0-mq.1-kevin-perf

In your perf-bypass branch, bypass and non-bypass runs a different
path, so it is not suitable for doing the comparison and making conclusion.

Thanks,

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
  2014-08-06  8:48           ` Kevin Wolf
  2014-08-06  9:37             ` Ming Lei
@ 2014-08-10  3:46             ` Ming Lei
  2014-08-11 14:03               ` Kevin Wolf
  2014-08-11 19:37               ` Paolo Bonzini
  1 sibling, 2 replies; 81+ messages in thread
From: Ming Lei @ 2014-08-10  3:46 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Peter Maydell, Fam Zheng, Michael S. Tsirkin, tom.leiming,
	qemu-devel, Stefan Hajnoczi, Paolo Bonzini

Hi Kevin, Paolo, Stefan and all,


On Wed, 6 Aug 2014 10:48:55 +0200
Kevin Wolf <kwolf@redhat.com> wrote:

> Am 06.08.2014 um 07:33 hat Ming Lei geschrieben:

> 
> Anyhow, the coroutine version of your benchmark is buggy, it leaks all
> coroutines instead of exiting them, so it can't make any use of the
> coroutine pool. On my laptop, I get this (where fixed coroutine is a
> version that simply removes the yield at the end):
> 
>                 | bypass        | fixed coro    | buggy coro
> ----------------+---------------+---------------+--------------
> time            | 1.09s         | 1.10s         | 1.62s
> L1-dcache-loads | 921,836,360   | 932,781,747   | 1,298,067,438
> insns per cycle | 2.39          | 2.39          | 1.90
> 
> Begs the question whether you see a similar effect on a real qemu and
> the coroutine pool is still not big enough? With correct use of
> coroutines, the difference seems to be barely measurable even without
> any I/O involved.

Now I fixes the coroutine leak bug, and previous crypt bench is a bit high
loading, and cause operations per sec very low(~40K/sec), finally I write a new
and simple one which can generate hundreds of kilo operations per sec and
the number should match with some fast storage devices, and it does show there
is not small effect from coroutine.

Extremely if just getppid() syscall is run in each iteration, with using coroutine,
only 3M operations/sec can be got, and without using coroutine, the number can
reach 16M/sec, and there is more than 4 times difference!!!

From another file read bench which is the default one:

      just doing open(file), read(fd, buf in stack, 512), sum and close() in each iteration

without using coroutine, operations per second can increase ~20% compared
with using coroutine. If reading 1024 bytes each time, the number still can
increase ~10%. The operations per second level is between 200K~400K per
sec which should match the IOPS in dataplane test, and the tests are
done in my lenovo T410 notepad(CPU: 2.6GHz, dual core, four threads). 

When reading 8192 and more bytes each time, the difference between using
coroutine and not can't be observed obviously.

Surely, the test result should depend on how fast the machine is, but even
for fast machine, I guess the similar result still can be observed by
decreasing read bytes each time.


diff --git a/qemu-img-cmds.hx b/qemu-img-cmds.hx
index ae64b3d..78c3b60 100644
--- a/qemu-img-cmds.hx
+++ b/qemu-img-cmds.hx
@@ -15,6 +15,12 @@ STEXI
 @item bench [-q] [-f @var{fmt]} [-n] [-t @var{cache}] filename
 ETEXI
 
+DEF("co_bench", co_bench,
+    "co_bench -c count -f read_file_name -s read_size -q -b")
+STEXI
+@item co_bench [-c @var{count}] [-f @var{filename}] [-s @var{read_size}] [-b] [-q]
+ETEXI
+
 DEF("check", img_check,
     "check [-q] [-f fmt] [--output=ofmt]  [-r [leaks | all]] filename")
 STEXI
diff --git a/qemu-img.c b/qemu-img.c
index 3e1b7c4..c9c7ac3 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -366,6 +366,138 @@ static int add_old_style_options(const char *fmt, QemuOpts *opts,
     return 0;
 }
 
+struct co_data {
+    const char *file_name;
+    unsigned long sum;
+    int read_size;
+    bool bypass;
+};
+
+static unsigned long file_bench(struct co_data *co)
+{
+    const int size = co->read_size;
+    int fd = open(co->file_name, O_RDONLY);
+    char buf[size];
+    int len, i;
+    unsigned long sum = 0;
+
+    if (fd < 0) {
+        perror("open file failed\n");
+        exit(-1);
+    }
+
+    /* the 1st page should have been in page cache, needn't worry about block */
+    len = read(fd, buf, size);
+    if (len != size) {
+        perror("open file failed\n");
+        exit(-1);
+    }
+    close(fd);
+
+    for (i = 0; i < len; i++) {
+        sum += buf[i];
+    }
+
+    return sum;
+}
+
+static void syscall_bench(void *opaque)
+{
+    struct co_data *data = opaque;
+
+#if 0
+    /*
+     * Doing getppid() only will show operations per sec may increase 5
+     * times in my T410 notepad via bypassing coroutine!!!
+     */
+    data->sum += getppid();
+#else
+    /*
+     * open, read 1024 bytes, and close will show ~10% increase in my
+     * T410 notepad via bypassing coroutine!!!
+     *
+     * open, read 512bytes, and close will show ~20% increase in my
+     * T410 notepad via bypassing coroutine!!!
+     *
+     * Below link provides 'perf stat' on several hw events:
+     *
+     *       http://pastebin.com/5s750m8C
+     *
+     * And with bypassing coroutine, dcache loads decreases, insns per
+     * cycle increased 0.7, branch-misses ratio decreases 0.4%, and
+     * dTLB-loads decreases too.
+     */
+    data->sum += file_bench(data);
+#endif
+
+    if (!data->bypass) {
+        qemu_coroutine_yield();
+    }
+}
+
+static int co_bench(int argc, char **argv)
+{
+    int c;
+    unsigned long cnt = 1;
+    int num = 1;
+    unsigned long i;
+    struct co_data data = {
+        .file_name = argv[-1],
+        .sum = 0,
+        .read_size = 1024,
+        .bypass = false,
+    };
+    Coroutine *co, *last_co = NULL;
+    struct timeval t1, t2;
+    unsigned long tv = 0;
+
+    for (;;) {
+        c = getopt(argc, argv, "bc:s:f:");
+        if (c == -1) {
+            break;
+        }
+        switch (c) {
+        case 'b':
+            data.bypass = true;
+            break;
+        case 'c':
+            num = atoi(optarg);
+            break;
+        case 's':
+            data.read_size = atoi(optarg);
+            break;
+        case 'f':
+            data.file_name = optarg;
+            break;
+        }
+    }
+
+    printf("%s: iterations %d, bypass: %s, file %s, read_size: %d\n",
+           __func__, num,
+           data.bypass ? "yes" : "no",
+           data.file_name, data.read_size);
+    gettimeofday(&t1, NULL);
+    for (i = 0; i < num * cnt; i++) {
+        if (!data.bypass) {
+            if (last_co) {
+                qemu_coroutine_enter(last_co, NULL);
+            }
+            co = qemu_coroutine_create(syscall_bench);
+            last_co = co;
+            qemu_coroutine_enter(co, &data);
+        } else {
+            syscall_bench(&data);
+        }
+    }
+    gettimeofday(&t2, NULL);
+    tv = (t2.tv_sec - t1.tv_sec) * 1000000 +
+        (t2.tv_usec - t1.tv_usec);
+    printf("\ttotal time: %lums, %5.0fK ops per sec\n", tv / 1000,
+           (double)((cnt * num * 1000) / tv));
+
+    return (int)data.sum;
+}
+
 static int img_create(int argc, char **argv)
 {
     int c;


Thanks,
-- 
Ming Lei

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
  2014-08-10  3:46             ` Ming Lei
@ 2014-08-11 14:03               ` Kevin Wolf
  2014-08-12  7:53                 ` Ming Lei
  2014-08-11 19:37               ` Paolo Bonzini
  1 sibling, 1 reply; 81+ messages in thread
From: Kevin Wolf @ 2014-08-11 14:03 UTC (permalink / raw)
  To: Ming Lei
  Cc: Peter Maydell, Fam Zheng, Michael S. Tsirkin, tom.leiming,
	qemu-devel, Stefan Hajnoczi, Paolo Bonzini

Am 10.08.2014 um 05:46 hat Ming Lei geschrieben:
> Hi Kevin, Paolo, Stefan and all,
> 
> 
> On Wed, 6 Aug 2014 10:48:55 +0200
> Kevin Wolf <kwolf@redhat.com> wrote:
> 
> > Am 06.08.2014 um 07:33 hat Ming Lei geschrieben:
> 
> > 
> > Anyhow, the coroutine version of your benchmark is buggy, it leaks all
> > coroutines instead of exiting them, so it can't make any use of the
> > coroutine pool. On my laptop, I get this (where fixed coroutine is a
> > version that simply removes the yield at the end):
> > 
> >                 | bypass        | fixed coro    | buggy coro
> > ----------------+---------------+---------------+--------------
> > time            | 1.09s         | 1.10s         | 1.62s
> > L1-dcache-loads | 921,836,360   | 932,781,747   | 1,298,067,438
> > insns per cycle | 2.39          | 2.39          | 1.90
> > 
> > Begs the question whether you see a similar effect on a real qemu and
> > the coroutine pool is still not big enough? With correct use of
> > coroutines, the difference seems to be barely measurable even without
> > any I/O involved.
> 
> Now I fixes the coroutine leak bug, and previous crypt bench is a bit high
> loading, and cause operations per sec very low(~40K/sec), finally I write a new
> and simple one which can generate hundreds of kilo operations per sec and
> the number should match with some fast storage devices, and it does show there
> is not small effect from coroutine.
> 
> Extremely if just getppid() syscall is run in each iteration, with using coroutine,
> only 3M operations/sec can be got, and without using coroutine, the number can
> reach 16M/sec, and there is more than 4 times difference!!!

I see that you're measuring a lot of things, but the one thing that is
unclear to me is what question those benchmarks are supposed to answer.

Basically I see two different, useful types of benchmark:

1. Look at coroutines in isolation and try to get a directly coroutine-
   related function (like create/destroy or yield/reenter) faster. This
   is what tests/test-coroutine does.

   This is quite good at telling you what costs the coroutine functions
   have and where you need to optimise - without taking the pratical
   benefits into account, so it's not suitable for comparison.

2. Look at the whole thing in its realistic environment. This should
   probably involve at least some asynchronous I/O, but ideally use the
   whole block layer. qemu-img bench tries to do this. For being even
   closer to the real environment you'd have to use the virtio-blk code
   as well, which you currently only get with a full VM (perhaps qtest
   could do something interesting here in theory).

   This is good for telling how big the costs are in relation to the
   total workload (and code saved elsewhere) in practice. This is the
   set of tests that can meaningfully be compared to a callback-based
   solution.

Running arbitrary workloads like getppid() or open/read/close isn't as
useful as these. It doesn't isolate the coroutines as well as tests that
run literally nothing else than coroutine functions, and it is too
removed from the actual use case to get the relation between additional
costs, saving and total workload figured out for the real case.

> From another file read bench which is the default one:
> 
>       just doing open(file), read(fd, buf in stack, 512), sum and close() in each iteration
> 
> without using coroutine, operations per second can increase ~20% compared
> with using coroutine. If reading 1024 bytes each time, the number still can
> increase ~10%. The operations per second level is between 200K~400K per
> sec which should match the IOPS in dataplane test, and the tests are
> done in my lenovo T410 notepad(CPU: 2.6GHz, dual core, four threads). 
> 
> When reading 8192 and more bytes each time, the difference between using
> coroutine and not can't be observed obviously.

All it tells you is that the variation of the workload can make the
coroutine cost disappear in the noise. It doesn't tell you much about
how the real use case.

And you're comparing apples and oranges anyway: The real question in
qemu is whether you use coroutines or pass around heap-allocated state
between callbacks. Your benchmark doesn't have a single callback because
it hasn't got any asynchronous operations and doesn't need to allocate
and pass any state.

It does, however, have an unnecessary yield() for the coroutine case
because you felt that the real case is more complex and does yield
(which is true, but it's more complex for both coroutines and
callbacks).

> Surely, the test result should depend on how fast the machine is, but even
> for fast machine, I guess the similar result still can be observed by
> decreasing read bytes each time.

Yes, results looked similar on my laptop. (They just don't tell me
much.)


Let's have a look at some fio results from my laptop:

aggrb KB/s  | master    | coroutine | bypass
------------+-----------+-----------+------------
run 1       | 419934    | 449518    | 445823
run 2       | 444358    | 456365    | 448332
run 3       | 444076    | 455209    | 441552


And here from my lab test box:

aggrb KB/s  | master    | coroutine | bypass
------------+-----------+-----------+------------
run 1       | 25330     | 56378     | 53541
run 2       | 26041     | 55709     | 54136
run 3       | 25811     | 56829     | 49080

The improvement of the bypass patches is barely measurable on my laptop
(if it even exists), whereas it seems to be a pretty big thing for my
lab test box. In any case, the optimised coroutine code seems to beat
the bypass on both machines. (That is for random reads anyway. For
sequential, I get a much larger variation, and on my lab test box bypass
is ahead, whereas on my laptop both are roughly on the same level.)


Another thing I tried is creating the coroutine already in virtio-blk to
avoid the overhead of the bdrv_aio_* emulation. I don't quite understand
the result of my benchmarks there, maybe you have an idea: For random
reads, I see a significant improvement, for sequential however a clear
degradation.

aggrb MB/s  | bypass    | coroutine | virtio-blk-created coroutine
------------+-----------+-----------+------------------------------
seq. read   | 738       | 738       | 694
random read | 442       | 459       | 475

I would appreciate any ideas about what's going on with sequential reads
here and how it can be fixed. Anyway, on my machines, coroutines don't
look like a lost case at all.

Kevin

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
  2014-08-10  3:46             ` Ming Lei
  2014-08-11 14:03               ` Kevin Wolf
@ 2014-08-11 19:37               ` Paolo Bonzini
  2014-08-12  8:12                 ` Ming Lei
                                   ` (3 more replies)
  1 sibling, 4 replies; 81+ messages in thread
From: Paolo Bonzini @ 2014-08-11 19:37 UTC (permalink / raw)
  To: Ming Lei, Kevin Wolf; +Cc: tom.leiming, Fam Zheng, qemu-devel, Stefan Hajnoczi

Il 10/08/2014 05:46, Ming Lei ha scritto:
> Hi Kevin, Paolo, Stefan and all,
> 
> 
> On Wed, 6 Aug 2014 10:48:55 +0200
> Kevin Wolf <kwolf@redhat.com> wrote:
> 
>> Am 06.08.2014 um 07:33 hat Ming Lei geschrieben:
> 
>>
>> Anyhow, the coroutine version of your benchmark is buggy, it leaks all
>> coroutines instead of exiting them, so it can't make any use of the
>> coroutine pool. On my laptop, I get this (where fixed coroutine is a
>> version that simply removes the yield at the end):
>>
>>                 | bypass        | fixed coro    | buggy coro
>> ----------------+---------------+---------------+--------------
>> time            | 1.09s         | 1.10s         | 1.62s
>> L1-dcache-loads | 921,836,360   | 932,781,747   | 1,298,067,438
>> insns per cycle | 2.39          | 2.39          | 1.90
>>
>> Begs the question whether you see a similar effect on a real qemu and
>> the coroutine pool is still not big enough? With correct use of
>> coroutines, the difference seems to be barely measurable even without
>> any I/O involved.
> 
> Now I fixes the coroutine leak bug, and previous crypt bench is a bit high
> loading, and cause operations per sec very low(~40K/sec), finally I write a new
> and simple one which can generate hundreds of kilo operations per sec and
> the number should match with some fast storage devices, and it does show there
> is not small effect from coroutine.
> 
> Extremely if just getppid() syscall is run in each iteration, with using coroutine,
> only 3M operations/sec can be got, and without using coroutine, the number can
> reach 16M/sec, and there is more than 4 times difference!!!

I should be on vacation, but I'm following a couple threads in the mailing list
and I'm a bit tired to hear the same argument again and again...

The different characteristics of asynchronous I/O vs. any synchronous workload
are such that it is hard to be sure that microbenchmarks make sense.

The below patch is basically the minimal change to bypass coroutines.  Of course
the block.c part is not acceptable as is (the change to refresh_total_sectors
is broken, the others are just ugly), but it is a start.  Please run it with
your fio workloads, or write an aio-based version of a qemu-img/qemu-io *I/O*
benchmark.

Paolo

diff --git a/block.c b/block.c
index 3e252a2..0b6e9cf 100644
--- a/block.c
+++ b/block.c
@@ -704,7 +704,7 @@ static int refresh_total_sectors(BlockDriverState *bs, int64_t hint)
         return 0;
 
     /* query actual device if possible, otherwise just trust the hint */
-    if (drv->bdrv_getlength) {
+    if (!hint && drv->bdrv_getlength) {
         int64_t length = drv->bdrv_getlength(bs);
         if (length < 0) {
             return length;
@@ -2651,9 +2651,6 @@ static int bdrv_check_byte_request(BlockDriverState *bs, int64_t offset,
     if (!bdrv_is_inserted(bs))
         return -ENOMEDIUM;
 
-    if (bs->growable)
-        return 0;
-
     len = bdrv_getlength(bs);
 
     if (offset < 0)
@@ -3107,7 +3104,7 @@ static int coroutine_fn bdrv_co_do_preadv(BlockDriverState *bs,
     if (!drv) {
         return -ENOMEDIUM;
     }
-    if (bdrv_check_byte_request(bs, offset, bytes)) {
+    if (!bs->growable && bdrv_check_byte_request(bs, offset, bytes)) {
         return -EIO;
     }
 
@@ -3347,7 +3344,7 @@ static int coroutine_fn bdrv_co_do_pwritev(BlockDriverState *bs,
     if (bs->read_only) {
         return -EACCES;
     }
-    if (bdrv_check_byte_request(bs, offset, bytes)) {
+    if (!bs->growable && bdrv_check_byte_request(bs, offset, bytes)) {
         return -EIO;
     }
 
@@ -4356,6 +4353,20 @@ BlockDriverAIOCB *bdrv_aio_readv(BlockDriverState *bs, int64_t sector_num,
 {
     trace_bdrv_aio_readv(bs, sector_num, nb_sectors, opaque);
 
+    if (bs->drv && bs->drv->bdrv_aio_readv &&
+        bs->drv->bdrv_aio_readv != bdrv_aio_readv_em &&
+        nb_sectors >= 0 && nb_sectors <= (UINT_MAX >> BDRV_SECTOR_BITS) &&
+        !bdrv_check_byte_request(bs, sector_num << BDRV_SECTOR_BITS,
+                                 nb_sectors << BDRV_SECTOR_BITS) &&
+        !bs->copy_on_read && !bs->io_limits_enabled &&
+        bs->request_alignment <= BDRV_SECTOR_SIZE) {
+        BlockDriverAIOCB *acb =
+            bs->drv->bdrv_aio_readv(bs, sector_num, qiov, nb_sectors,
+                                    cb, opaque);
+        assert(acb);
+        return acb;
+    }
+
     return bdrv_co_aio_rw_vector(bs, sector_num, qiov, nb_sectors, 0,
                                  cb, opaque, false);
 }
@@ -4366,6 +4377,24 @@ BlockDriverAIOCB *bdrv_aio_writev(BlockDriverState *bs, int64_t sector_num,
 {
     trace_bdrv_aio_writev(bs, sector_num, nb_sectors, opaque);
 
+    if (bs->drv && bs->drv->bdrv_aio_writev &&
+        bs->drv->bdrv_aio_writev != bdrv_aio_writev_em &&
+        nb_sectors >= 0 && nb_sectors <= (UINT_MAX >> BDRV_SECTOR_BITS) &&
+        !bdrv_check_byte_request(bs, sector_num << BDRV_SECTOR_BITS,
+                                 nb_sectors << BDRV_SECTOR_BITS) &&
+        !bs->read_only && !bs->io_limits_enabled &&
+        bs->request_alignment <= BDRV_SECTOR_SIZE &&
+        bs->enable_write_cache &&
+        QLIST_EMPTY(&bs->before_write_notifiers.notifiers) &&
+        bs->wr_highest_sector >= sector_num + nb_sectors - 1 &&
+        QLIST_EMPTY(&bs->dirty_bitmaps)) {
+        BlockDriverAIOCB *acb =
+            bs->drv->bdrv_aio_writev(bs, sector_num, qiov, nb_sectors,
+                                     cb, opaque);
+        assert(acb);
+        return acb;
+    }
+
     return bdrv_co_aio_rw_vector(bs, sector_num, qiov, nb_sectors, 0,
                                  cb, opaque, true);
 }
diff --git a/block/raw_bsd.c b/block/raw_bsd.c
index 492f58d..b86f26b 100644
--- a/block/raw_bsd.c
+++ b/block/raw_bsd.c
@@ -48,6 +48,22 @@ static int raw_reopen_prepare(BDRVReopenState *reopen_state,
     return 0;
 }
 
+static BlockDriverAIOCB *raw_aio_readv(BlockDriverState *bs, int64_t sector_num,
+                                     QEMUIOVector *qiov, int nb_sectors,
+                                    BlockDriverCompletionFunc *cb, void *opaque)
+{
+    BLKDBG_EVENT(bs->file, BLKDBG_READ_AIO);
+    return bdrv_aio_readv(bs->file, sector_num, qiov, nb_sectors, cb, opaque);
+}
+
+static BlockDriverAIOCB *raw_aio_writev(BlockDriverState *bs, int64_t sector_num,
+                                      QEMUIOVector *qiov, int nb_sectors,
+                                     BlockDriverCompletionFunc *cb, void *opaque)
+{
+    BLKDBG_EVENT(bs->file, BLKDBG_WRITE_AIO);
+    return bdrv_aio_writev(bs->file, sector_num, qiov, nb_sectors, cb, opaque);
+}
+
 static int coroutine_fn raw_co_readv(BlockDriverState *bs, int64_t sector_num,
                                      int nb_sectors, QEMUIOVector *qiov)
 {
@@ -181,6 +197,8 @@ static BlockDriver bdrv_raw = {
     .bdrv_open            = &raw_open,
     .bdrv_close           = &raw_close,
     .bdrv_create          = &raw_create,
+    .bdrv_aio_readv       = &raw_aio_readv,
+    .bdrv_aio_writev      = &raw_aio_writev,
     .bdrv_co_readv        = &raw_co_readv,
     .bdrv_co_writev       = &raw_co_writev,
     .bdrv_co_write_zeroes = &raw_co_write_zeroes,

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
  2014-08-11 14:03               ` Kevin Wolf
@ 2014-08-12  7:53                 ` Ming Lei
  2014-08-12 11:40                   ` Kevin Wolf
  0 siblings, 1 reply; 81+ messages in thread
From: Ming Lei @ 2014-08-12  7:53 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Peter Maydell, Fam Zheng, Michael S. Tsirkin, qemu-devel,
	Stefan Hajnoczi, Paolo Bonzini

On Mon, Aug 11, 2014 at 10:03 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 10.08.2014 um 05:46 hat Ming Lei geschrieben:
>> Hi Kevin, Paolo, Stefan and all,
>>
>>
>> On Wed, 6 Aug 2014 10:48:55 +0200
>> Kevin Wolf <kwolf@redhat.com> wrote:
>>
>> > Am 06.08.2014 um 07:33 hat Ming Lei geschrieben:
>>
>> >
>> > Anyhow, the coroutine version of your benchmark is buggy, it leaks all
>> > coroutines instead of exiting them, so it can't make any use of the
>> > coroutine pool. On my laptop, I get this (where fixed coroutine is a
>> > version that simply removes the yield at the end):
>> >
>> >                 | bypass        | fixed coro    | buggy coro
>> > ----------------+---------------+---------------+--------------
>> > time            | 1.09s         | 1.10s         | 1.62s
>> > L1-dcache-loads | 921,836,360   | 932,781,747   | 1,298,067,438
>> > insns per cycle | 2.39          | 2.39          | 1.90
>> >
>> > Begs the question whether you see a similar effect on a real qemu and
>> > the coroutine pool is still not big enough? With correct use of
>> > coroutines, the difference seems to be barely measurable even without
>> > any I/O involved.
>>
>> Now I fixes the coroutine leak bug, and previous crypt bench is a bit high
>> loading, and cause operations per sec very low(~40K/sec), finally I write a new
>> and simple one which can generate hundreds of kilo operations per sec and
>> the number should match with some fast storage devices, and it does show there
>> is not small effect from coroutine.
>>
>> Extremely if just getppid() syscall is run in each iteration, with using coroutine,
>> only 3M operations/sec can be got, and without using coroutine, the number can
>> reach 16M/sec, and there is more than 4 times difference!!!
>
> I see that you're measuring a lot of things, but the one thing that is
> unclear to me is what question those benchmarks are supposed to answer.
>
> Basically I see two different, useful types of benchmark:
>
> 1. Look at coroutines in isolation and try to get a directly coroutine-
>    related function (like create/destroy or yield/reenter) faster. This
>    is what tests/test-coroutine does.

Actually the tests/test-coroutine does tell us there is not small cost
introduced by using coroutine, as Paolo's computation in his environment[1]:

    - one yield takes 83ns
    - one enter takes 97ns
    - this will introduce 8.3% cost by using coroutine if the block
can reach 300K
      IOPS, like your case of loop over tmpfs
    - it may cause 13.8% cost if the block device can reach 500K IOPS

The cost may show in IOPS, or in CPU utilization or both, which
depends how fast the CPU is.

The above computation suppose all coroutine allocation hits on the pool,
and does not consider effect from switching stack. If both two considered,
the cost becomes more surely.

[1], https://lists.nongnu.org/archive/html/qemu-devel/2014-08/msg01544.html

>    This is quite good at telling you what costs the coroutine functions
>    have and where you need to optimise - without taking the pratical
>    benefits into account, so it's not suitable for comparison.
>
> 2. Look at the whole thing in its realistic environment. This should
>    probably involve at least some asynchronous I/O, but ideally use the
>    whole block layer. qemu-img bench tries to do this. For being even
>    closer to the real environment you'd have to use the virtio-blk code
>    as well, which you currently only get with a full VM (perhaps qtest
>    could do something interesting here in theory).
>
>    This is good for telling how big the costs are in relation to the
>    total workload (and code saved elsewhere) in practice. This is the
>    set of tests that can meaningfully be compared to a callback-based
>    solution.
>
> Running arbitrary workloads like getppid() or open/read/close isn't as
> useful as these. It doesn't isolate the coroutines as well as tests that
> run literally nothing else than coroutine functions, and it is too
> removed from the actual use case to get the relation between additional
> costs, saving and total workload figured out for the real case.

If you think getppid() doesn't isolate the coroutine, you can just do nop,
then you will find the cost may reach 90%.  Basically it is nothing to do
with what the load does, and it is much related to how fast the load can
run. The quicker, the more cost introduced by using coroutine, please
see the computation in above link.

Also another reason I use gettpid() is that:

     After IO plug&unplug is introduced,  bdrv_aio_readv/bdrv_aio_writev
     becomes much quicker because most of times they just queue I/O req
     into I/O queue, no io submit involved at all. Even though coroutine
     operations take little time(<100ns), it still may make a difference
     compared with the time for queuing I/O only, at lest for high-speed I/O,
     like > 300K IOPS in your case.

>> From another file read bench which is the default one:
>>
>>       just doing open(file), read(fd, buf in stack, 512), sum and close() in each iteration
>>
>> without using coroutine, operations per second can increase ~20% compared
>> with using coroutine. If reading 1024 bytes each time, the number still can
>> increase ~10%. The operations per second level is between 200K~400K per
>> sec which should match the IOPS in dataplane test, and the tests are
>> done in my lenovo T410 notepad(CPU: 2.6GHz, dual core, four threads).
>>
>> When reading 8192 and more bytes each time, the difference between using
>> coroutine and not can't be observed obviously.
>
> All it tells you is that the variation of the workload can make the
> coroutine cost disappear in the noise. It doesn't tell you much about
> how the real use case.

When cost disappear, the IOPS has become very small. That also said
coroutine can fit in high-speed IO case.

> And you're comparing apples and oranges anyway: The real question in
> qemu is whether you use coroutines or pass around heap-allocated state
> between callbacks. Your benchmark doesn't have a single callback because
> it hasn't got any asynchronous operations and doesn't need to allocate
> and pass any state.
>
> It does, however, have an unnecessary yield() for the coroutine case
> because you felt that the real case is more complex and does yield
> (which is true, but it's more complex for both coroutines and
> callbacks).
>
>> Surely, the test result should depend on how fast the machine is, but even
>> for fast machine, I guess the similar result still can be observed by
>> decreasing read bytes each time.
>
> Yes, results looked similar on my laptop. (They just don't tell me
> much.)
>
>
> Let's have a look at some fio results from my laptop:
>
> aggrb KB/s  | master    | coroutine | bypass
> ------------+-----------+-----------+------------
> run 1       | 419934    | 449518    | 445823
> run 2       | 444358    | 456365    | 448332
> run 3       | 444076    | 455209    | 441552
>
>
> And here from my lab test box:
>
> aggrb KB/s  | master    | coroutine | bypass
> ------------+-----------+-----------+------------
> run 1       | 25330     | 56378     | 53541
> run 2       | 26041     | 55709     | 54136
> run 3       | 25811     | 56829     | 49080
>
> The improvement of the bypass patches is barely measurable on my laptop
> (if it even exists), whereas it seems to be a pretty big thing for my
> lab test box. In any case, the optimised coroutine code seems to beat
> the bypass on both machines. (That is for random reads anyway. For
> sequential, I get a much larger variation, and on my lab test box bypass
> is ahead, whereas on my laptop both are roughly on the same level.)
>
> Another thing I tried is creating the coroutine already in virtio-blk to
> avoid the overhead of the bdrv_aio_* emulation. I don't quite understand
> the result of my benchmarks there, maybe you have an idea: For random
> reads, I see a significant improvement, for sequential however a clear
> degradation.
>
> aggrb MB/s  | bypass    | coroutine | virtio-blk-created coroutine
> ------------+-----------+-----------+------------------------------
> seq. read   | 738       | 738       | 694
> random read | 442       | 459       | 475
>
> I would appreciate any ideas about what's going on with sequential reads
> here and how it can be fixed. Anyway, on my machines, coroutines don't
> look like a lost case at all.

Firstly I hope you can bypass the coroutine only to do the test, that said, use
same code path except for coroutine operation to observe effect from coroutine.

Secondly, maybe your machine is fast enough, and we can't observe the
IOPS difference easily, but there should be the difference in CPU utilization,
since the above computation tells us the coroutine cost does exist. Faster
the block faster, the more.


Thanks,

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
  2014-08-11 19:37               ` Paolo Bonzini
@ 2014-08-12  8:12                 ` Ming Lei
  2014-08-12 19:08                   ` Paolo Bonzini
  2014-08-13  8:55                 ` Stefan Hajnoczi
                                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 81+ messages in thread
From: Ming Lei @ 2014-08-12  8:12 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Kevin Wolf, Fam Zheng, qemu-devel, Stefan Hajnoczi

On Tue, Aug 12, 2014 at 3:37 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 10/08/2014 05:46, Ming Lei ha scritto:
>> Hi Kevin, Paolo, Stefan and all,
>>
>>
>> On Wed, 6 Aug 2014 10:48:55 +0200
>> Kevin Wolf <kwolf@redhat.com> wrote:
>>
>>> Am 06.08.2014 um 07:33 hat Ming Lei geschrieben:
>>
>>>
>>> Anyhow, the coroutine version of your benchmark is buggy, it leaks all
>>> coroutines instead of exiting them, so it can't make any use of the
>>> coroutine pool. On my laptop, I get this (where fixed coroutine is a
>>> version that simply removes the yield at the end):
>>>
>>>                 | bypass        | fixed coro    | buggy coro
>>> ----------------+---------------+---------------+--------------
>>> time            | 1.09s         | 1.10s         | 1.62s
>>> L1-dcache-loads | 921,836,360   | 932,781,747   | 1,298,067,438
>>> insns per cycle | 2.39          | 2.39          | 1.90
>>>
>>> Begs the question whether you see a similar effect on a real qemu and
>>> the coroutine pool is still not big enough? With correct use of
>>> coroutines, the difference seems to be barely measurable even without
>>> any I/O involved.
>>
>> Now I fixes the coroutine leak bug, and previous crypt bench is a bit high
>> loading, and cause operations per sec very low(~40K/sec), finally I write a new
>> and simple one which can generate hundreds of kilo operations per sec and
>> the number should match with some fast storage devices, and it does show there
>> is not small effect from coroutine.
>>
>> Extremely if just getppid() syscall is run in each iteration, with using coroutine,
>> only 3M operations/sec can be got, and without using coroutine, the number can
>> reach 16M/sec, and there is more than 4 times difference!!!
>
> I should be on vacation, but I'm following a couple threads in the mailing list
> and I'm a bit tired to hear the same argument again and again...

I am sorry to interrupt your vocation and make you tired, but the discussion
isn't simply again and again, and something new always comes every time
or most of times.

>
> The different characteristics of asynchronous I/O vs. any synchronous workload
> are such that it is hard to be sure that microbenchmarks make sense.

I don't think it is related with asynchronous I/O or synchronous I/O, and there
isn't sleep(or wait for completion) at all, and we can treat it as aio
by thinking
completion as nop in this case(AIO model: submit and complete)

IMO the getppid() bench is a simple simulation on bdrv_aio_readv/writev()
with I/O plug/unplug wrt. coroutine usage.

BTW, do you agree the computation on coroutine cost in my previous mail?
And I don't think the computation is related with I/O type.

>
> The below patch is basically the minimal change to bypass coroutines.  Of course
> the block.c part is not acceptable as is (the change to refresh_total_sectors
> is broken, the others are just ugly), but it is a start.  Please run it with
> your fio workloads, or write an aio-based version of a qemu-img/qemu-io *I/O*
> benchmark.

Could you explain why the new change is introduced?

I will hold it until we can align to the coroutine cost computation,
because it is
very important for the discussion.

Thank you again for taking time in the discussion.

Thanks,

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
  2014-08-12  7:53                 ` Ming Lei
@ 2014-08-12 11:40                   ` Kevin Wolf
  2014-08-12 12:14                     ` Ming Lei
  0 siblings, 1 reply; 81+ messages in thread
From: Kevin Wolf @ 2014-08-12 11:40 UTC (permalink / raw)
  To: Ming Lei
  Cc: Peter Maydell, Fam Zheng, Michael S. Tsirkin, qemu-devel,
	Stefan Hajnoczi, Paolo Bonzini

Am 12.08.2014 um 09:53 hat Ming Lei geschrieben:
> On Mon, Aug 11, 2014 at 10:03 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> > Am 10.08.2014 um 05:46 hat Ming Lei geschrieben:
> >> Hi Kevin, Paolo, Stefan and all,
> >>
> >>
> >> On Wed, 6 Aug 2014 10:48:55 +0200
> >> Kevin Wolf <kwolf@redhat.com> wrote:
> >>
> >> > Am 06.08.2014 um 07:33 hat Ming Lei geschrieben:
> >>
> >> >
> >> > Anyhow, the coroutine version of your benchmark is buggy, it leaks all
> >> > coroutines instead of exiting them, so it can't make any use of the
> >> > coroutine pool. On my laptop, I get this (where fixed coroutine is a
> >> > version that simply removes the yield at the end):
> >> >
> >> >                 | bypass        | fixed coro    | buggy coro
> >> > ----------------+---------------+---------------+--------------
> >> > time            | 1.09s         | 1.10s         | 1.62s
> >> > L1-dcache-loads | 921,836,360   | 932,781,747   | 1,298,067,438
> >> > insns per cycle | 2.39          | 2.39          | 1.90
> >> >
> >> > Begs the question whether you see a similar effect on a real qemu and
> >> > the coroutine pool is still not big enough? With correct use of
> >> > coroutines, the difference seems to be barely measurable even without
> >> > any I/O involved.
> >>
> >> Now I fixes the coroutine leak bug, and previous crypt bench is a bit high
> >> loading, and cause operations per sec very low(~40K/sec), finally I write a new
> >> and simple one which can generate hundreds of kilo operations per sec and
> >> the number should match with some fast storage devices, and it does show there
> >> is not small effect from coroutine.
> >>
> >> Extremely if just getppid() syscall is run in each iteration, with using coroutine,
> >> only 3M operations/sec can be got, and without using coroutine, the number can
> >> reach 16M/sec, and there is more than 4 times difference!!!
> >
> > I see that you're measuring a lot of things, but the one thing that is
> > unclear to me is what question those benchmarks are supposed to answer.
> >
> > Basically I see two different, useful types of benchmark:
> >
> > 1. Look at coroutines in isolation and try to get a directly coroutine-
> >    related function (like create/destroy or yield/reenter) faster. This
> >    is what tests/test-coroutine does.
> 
> Actually the tests/test-coroutine does tell us there is not small cost
> introduced by using coroutine, as Paolo's computation in his environment[1]:
> 
>     - one yield takes 83ns
>     - one enter takes 97ns

Okay so far (haven't checked the numbers, but I'll assume they are
right).

>     - this will introduce 8.3% cost by using coroutine if the block
> can reach 300K
>       IOPS, like your case of loop over tmpfs
>     - it may cause 13.8% cost if the block device can reach 500K IOPS

Here your argumentation goes downhill. I wrote "coroutines in isolation"
for a reason. Here you're starting to leave the isolation and draw
conclusions from the microbenchmark to the real environment. As if that
wasn't bad enough, you're comparing "using coroutines" to "doing
nothing, but magic happens and we get the right result anyway". This is
not a useful comparison.

> The cost may show in IOPS, or in CPU utilization or both, which
> depends how fast the CPU is.
> 
> The above computation suppose all coroutine allocation hits on the pool,
> and does not consider effect from switching stack. If both two considered,
> the cost becomes more surely.
> 
> [1], https://lists.nongnu.org/archive/html/qemu-devel/2014-08/msg01544.html
> 
> >    This is quite good at telling you what costs the coroutine functions
> >    have and where you need to optimise - without taking the pratical
> >    benefits into account, so it's not suitable for comparison.
> >
> > 2. Look at the whole thing in its realistic environment. This should
> >    probably involve at least some asynchronous I/O, but ideally use the
> >    whole block layer. qemu-img bench tries to do this. For being even
> >    closer to the real environment you'd have to use the virtio-blk code
> >    as well, which you currently only get with a full VM (perhaps qtest
> >    could do something interesting here in theory).
> >
> >    This is good for telling how big the costs are in relation to the
> >    total workload (and code saved elsewhere) in practice. This is the
> >    set of tests that can meaningfully be compared to a callback-based
> >    solution.
> >
> > Running arbitrary workloads like getppid() or open/read/close isn't as
> > useful as these. It doesn't isolate the coroutines as well as tests that
> > run literally nothing else than coroutine functions, and it is too
> > removed from the actual use case to get the relation between additional
> > costs, saving and total workload figured out for the real case.
> 
> If you think getppid() doesn't isolate the coroutine, you can just do nop,
> then you will find the cost may reach 90%.  Basically it is nothing to do
> with what the load does, and it is much related to how fast the load can
> run. The quicker, the more cost introduced by using coroutine, please
> see the computation in above link.

Correct, the arbitrary load is only blurring the result. If you want to
measure purely the cost of coroutine functions in isolation (but then
treat it as such!), tests/test-coroutine is your tool.

> Also another reason I use gettpid() is that:
> 
>      After IO plug&unplug is introduced,  bdrv_aio_readv/bdrv_aio_writev
>      becomes much quicker because most of times they just queue I/O req
>      into I/O queue, no io submit involved at all. Even though coroutine
>      operations take little time(<100ns), it still may make a difference
>      compared with the time for queuing I/O only, at lest for high-speed I/O,
>      like > 300K IOPS in your case.

You still go the whole path through the block layer (and back when the
request is completed).

> >> From another file read bench which is the default one:
> >>
> >>       just doing open(file), read(fd, buf in stack, 512), sum and close() in each iteration
> >>
> >> without using coroutine, operations per second can increase ~20% compared
> >> with using coroutine. If reading 1024 bytes each time, the number still can
> >> increase ~10%. The operations per second level is between 200K~400K per
> >> sec which should match the IOPS in dataplane test, and the tests are
> >> done in my lenovo T410 notepad(CPU: 2.6GHz, dual core, four threads).
> >>
> >> When reading 8192 and more bytes each time, the difference between using
> >> coroutine and not can't be observed obviously.
> >
> > All it tells you is that the variation of the workload can make the
> > coroutine cost disappear in the noise. It doesn't tell you much about
> > how the real use case.
> 
> When cost disappear, the IOPS has become very small. That also said
> coroutine can fit in high-speed IO case.
> 
> > And you're comparing apples and oranges anyway: The real question in
> > qemu is whether you use coroutines or pass around heap-allocated state
> > between callbacks. Your benchmark doesn't have a single callback because
> > it hasn't got any asynchronous operations and doesn't need to allocate
> > and pass any state.
> >
> > It does, however, have an unnecessary yield() for the coroutine case
> > because you felt that the real case is more complex and does yield
> > (which is true, but it's more complex for both coroutines and
> > callbacks).
> >
> >> Surely, the test result should depend on how fast the machine is, but even
> >> for fast machine, I guess the similar result still can be observed by
> >> decreasing read bytes each time.
> >
> > Yes, results looked similar on my laptop. (They just don't tell me
> > much.)
> >
> >
> > Let's have a look at some fio results from my laptop:
> >
> > aggrb KB/s  | master    | coroutine | bypass
> > ------------+-----------+-----------+------------
> > run 1       | 419934    | 449518    | 445823
> > run 2       | 444358    | 456365    | 448332
> > run 3       | 444076    | 455209    | 441552
> >
> >
> > And here from my lab test box:
> >
> > aggrb KB/s  | master    | coroutine | bypass
> > ------------+-----------+-----------+------------
> > run 1       | 25330     | 56378     | 53541
> > run 2       | 26041     | 55709     | 54136
> > run 3       | 25811     | 56829     | 49080
> >
> > The improvement of the bypass patches is barely measurable on my laptop
> > (if it even exists), whereas it seems to be a pretty big thing for my
> > lab test box. In any case, the optimised coroutine code seems to beat
> > the bypass on both machines. (That is for random reads anyway. For
> > sequential, I get a much larger variation, and on my lab test box bypass
> > is ahead, whereas on my laptop both are roughly on the same level.)
> >
> > Another thing I tried is creating the coroutine already in virtio-blk to
> > avoid the overhead of the bdrv_aio_* emulation. I don't quite understand
> > the result of my benchmarks there, maybe you have an idea: For random
> > reads, I see a significant improvement, for sequential however a clear
> > degradation.
> >
> > aggrb MB/s  | bypass    | coroutine | virtio-blk-created coroutine
> > ------------+-----------+-----------+------------------------------
> > seq. read   | 738       | 738       | 694
> > random read | 442       | 459       | 475
> >
> > I would appreciate any ideas about what's going on with sequential reads
> > here and how it can be fixed. Anyway, on my machines, coroutines don't
> > look like a lost case at all.
> 
> Firstly I hope you can bypass the coroutine only to do the test, that said, use
> same code path except for coroutine operation to observe effect from coroutine.

Sorry, I'm having a hard time parsing this sentence.

> Secondly, maybe your machine is fast enough, and we can't observe the
> IOPS difference easily, but there should be the difference in CPU utilization,
> since the above computation tells us the coroutine cost does exist. Faster
> the block faster, the more.

You are constantly ignoring the fact that the AIO callback-oriented
style doesn't have zero cost either. Please stop doing this, otherwise
any discussion is useless.

If you want to benchmark in a scenario that is closer to what actually
happens in qemu, but still doesn't involve I/O to slow things down, try
the following patch. The operation there isn't synchronous any more, but
an external event (like I/O completion) is emulated by scheduling a BH.
The AIO/coroutine usage patterns in the code are the same as you can
find in the real block layer.

For me, in this benchmark callbacks are still a little faster, but you
don't get absurd numbers like you claimed above (because suddenly it's
clear that the AIO path has its costs, too).

Kevin


diff --git a/qemu-img-cmds.hx b/qemu-img-cmds.hx
index d029609..3c2659b 100644
--- a/qemu-img-cmds.hx
+++ b/qemu-img-cmds.hx
@@ -9,6 +9,12 @@ STEXI
 @table @option
 ETEXI
 
+DEF("co_bench", img_co_bench,
+    "co_bench [-c count]")
+STEXI
+@item co_bench [-c @var{count}]
+ETEXI
+
 DEF("check", img_check,
     "check [-q] [-f fmt] [--output=ofmt]  [-r [leaks | all]] filename")
 STEXI
diff --git a/qemu-img.c b/qemu-img.c
index d4518e7..b2d9e56 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -2789,6 +2789,173 @@ out:
     return 0;
 }
 
+static QEMUBH *external_event;
+
+typedef struct BenchACB {
+    BlockDriverAIOCB common;
+    QEMUBH *bh;
+    int result;
+    int i;
+    int counter;
+} BenchACB;
+
+static void external_event_aio_bh(void* opaque);
+
+static void bench_cb(void *opaque, int ret)
+{
+    BenchACB *acb = opaque;
+
+    acb->result += acb->i;
+
+    if (++acb->counter == 2) {
+        qemu_bh_schedule(acb->bh);
+    } else {
+        external_event = qemu_bh_new(external_event_aio_bh, acb);
+        qemu_bh_schedule(external_event);
+    }
+}
+
+static void external_event_aio_bh(void* opaque)
+{
+    qemu_bh_delete(external_event);
+    bench_cb(opaque, 0);
+}
+
+static void bdrv_aio_completion_bh(void* opaque)
+{
+    BenchACB *acb = opaque;
+    qemu_bh_delete(acb->bh);
+    acb->bh = NULL;
+    acb->common.cb(acb->common.opaque, acb->result);
+    qemu_aio_release(acb);
+}
+
+static const AIOCBInfo bench_aiocb_info = {
+    .aiocb_size         = sizeof(BenchACB),
+    .cancel             = NULL,
+};
+
+static BlockDriverAIOCB *bench_aio(int i,
+                                   BlockDriverCompletionFunc *cb,
+                                   void *opaque)
+{
+    BenchACB *acb = qemu_aio_get(&bench_aiocb_info, NULL, cb, opaque);
+    acb->counter = 0;
+    acb->result = 0;
+    acb->i = i;
+    acb->bh = qemu_bh_new(bdrv_aio_completion_bh, acb);
+
+    bench_cb(acb, 0);
+    return &acb->common;
+}
+
+static void bench_aio_completed(void *opaque, int ret)
+{
+    int *result = opaque;
+    *result = ret;
+}
+
+typedef struct BenchCo {
+    Coroutine *co;
+    int i;
+    int *result;
+} BenchCo;
+
+static void external_event_co_bh(void* opaque)
+{
+    Coroutine *co = opaque;
+    qemu_coroutine_enter(co, NULL);
+}
+
+static void bench_co(void *opaque)
+{
+    BenchCo *b = opaque;
+    int result = 0;
+
+    result += b->i;
+
+    external_event = qemu_bh_new(external_event_co_bh, b->co);
+    qemu_bh_schedule(external_event);
+    qemu_coroutine_yield();
+    qemu_bh_delete(external_event);
+
+    result += b->i;
+
+    *b->result = result;
+}
+
+static int img_co_bench(int argc, char **argv)
+{
+    int c;
+    bool bypass = false;
+    int count = 10000000;
+    struct timeval t1, t2;
+    int i;
+
+    for (;;) {
+        c = getopt(argc, argv, "hbc:");
+        if (c == -1) {
+            break;
+        }
+
+        switch (c) {
+        case 'h':
+        case '?':
+            help();
+            break;
+        case 'b':
+            bypass = true;
+            break;
+        case 'c':
+        {
+            char *end;
+            errno = 0;
+            count = strtoul(optarg, &end, 0);
+            if (errno || *end || count > INT_MAX) {
+                error_report("Invalid iteration count specified");
+                return 1;
+            }
+            break;
+        }
+        }
+    }
+
+    printf("Doing %d iterations (%s)\n", count,
+           bypass ? "callbacks" : "coroutines");
+
+    gettimeofday(&t1, NULL);
+
+    for (i = 0; i < count; i++) {
+        int result = -EINPROGRESS;
+
+        if (bypass) {
+            bench_aio(i, bench_aio_completed, &result);
+        } else {
+            Coroutine *co = qemu_coroutine_create(bench_co);
+            BenchCo b = {
+                .co = co,
+                .i = i,
+                .result = &result,
+            };
+            qemu_coroutine_enter(co, &b);
+        }
+
+        while (result == -EINPROGRESS) {
+            main_loop_wait(false);
+        }
+
+        assert(result == i * 2);
+    }
+    gettimeofday(&t2, NULL);
+
+    printf("Run completed in %3.3f seconds.\n",
+           (t2.tv_sec - t1.tv_sec)
+           + ((double)(t2.tv_usec - t1.tv_usec) / 1000000));
+
+    return 0;
+}
+
+
 static const img_cmd_t img_cmds[] = {
 #define DEF(option, callback, arg_string)        \
     { option, callback },

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
  2014-08-12 11:40                   ` Kevin Wolf
@ 2014-08-12 12:14                     ` Ming Lei
  0 siblings, 0 replies; 81+ messages in thread
From: Ming Lei @ 2014-08-12 12:14 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Peter Maydell, Fam Zheng, Michael S. Tsirkin, qemu-devel,
	Stefan Hajnoczi, Paolo Bonzini

On Tue, Aug 12, 2014 at 7:40 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 12.08.2014 um 09:53 hat Ming Lei geschrieben:
>> On Mon, Aug 11, 2014 at 10:03 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>> > Am 10.08.2014 um 05:46 hat Ming Lei geschrieben:
>> >> Hi Kevin, Paolo, Stefan and all,
>> >>
>> >>
>> >> On Wed, 6 Aug 2014 10:48:55 +0200
>> >> Kevin Wolf <kwolf@redhat.com> wrote:
>> >>
>> >> > Am 06.08.2014 um 07:33 hat Ming Lei geschrieben:
>> >>
>> >> >
>> >> > Anyhow, the coroutine version of your benchmark is buggy, it leaks all
>> >> > coroutines instead of exiting them, so it can't make any use of the
>> >> > coroutine pool. On my laptop, I get this (where fixed coroutine is a
>> >> > version that simply removes the yield at the end):
>> >> >
>> >> >                 | bypass        | fixed coro    | buggy coro
>> >> > ----------------+---------------+---------------+--------------
>> >> > time            | 1.09s         | 1.10s         | 1.62s
>> >> > L1-dcache-loads | 921,836,360   | 932,781,747   | 1,298,067,438
>> >> > insns per cycle | 2.39          | 2.39          | 1.90
>> >> >
>> >> > Begs the question whether you see a similar effect on a real qemu and
>> >> > the coroutine pool is still not big enough? With correct use of
>> >> > coroutines, the difference seems to be barely measurable even without
>> >> > any I/O involved.
>> >>
>> >> Now I fixes the coroutine leak bug, and previous crypt bench is a bit high
>> >> loading, and cause operations per sec very low(~40K/sec), finally I write a new
>> >> and simple one which can generate hundreds of kilo operations per sec and
>> >> the number should match with some fast storage devices, and it does show there
>> >> is not small effect from coroutine.
>> >>
>> >> Extremely if just getppid() syscall is run in each iteration, with using coroutine,
>> >> only 3M operations/sec can be got, and without using coroutine, the number can
>> >> reach 16M/sec, and there is more than 4 times difference!!!
>> >
>> > I see that you're measuring a lot of things, but the one thing that is
>> > unclear to me is what question those benchmarks are supposed to answer.
>> >
>> > Basically I see two different, useful types of benchmark:
>> >
>> > 1. Look at coroutines in isolation and try to get a directly coroutine-
>> >    related function (like create/destroy or yield/reenter) faster. This
>> >    is what tests/test-coroutine does.
>>
>> Actually the tests/test-coroutine does tell us there is not small cost
>> introduced by using coroutine, as Paolo's computation in his environment[1]:
>>
>>     - one yield takes 83ns
>>     - one enter takes 97ns
>
> Okay so far (haven't checked the numbers, but I'll assume they are
> right).
>
>>     - this will introduce 8.3% cost by using coroutine if the block
>> can reach 300K
>>       IOPS, like your case of loop over tmpfs
>>     - it may cause 13.8% cost if the block device can reach 500K IOPS
>
> Here your argumentation goes downhill. I wrote "coroutines in isolation"
> for a reason. Here you're starting to leave the isolation and draw
> conclusions from the microbenchmark to the real environment. As if that

The conclusion only depends how much time yield() and enter() takes.

Compared with the test environment, the time which yield() and enter()
takes should not decrease in real environment, should it?

That is why I make the conclusion, if you think the reasoning isn't correct,
please point it out explicitly.

> wasn't bad enough, you're comparing "using coroutines" to "doing
> nothing, but magic happens and we get the right result anyway". This is
> not a useful comparison.

It isn't related with what coroutine->entry() did, and the conclusion
only depends on how much time yiedld() and enter() takes, as I said.

>
>> The cost may show in IOPS, or in CPU utilization or both, which
>> depends how fast the CPU is.
>>
>> The above computation suppose all coroutine allocation hits on the pool,
>> and does not consider effect from switching stack. If both two considered,
>> the cost becomes more surely.
>>
>> [1], https://lists.nongnu.org/archive/html/qemu-devel/2014-08/msg01544.html
>>
>> >    This is quite good at telling you what costs the coroutine functions
>> >    have and where you need to optimise - without taking the pratical
>> >    benefits into account, so it's not suitable for comparison.
>> >
>> > 2. Look at the whole thing in its realistic environment. This should
>> >    probably involve at least some asynchronous I/O, but ideally use the
>> >    whole block layer. qemu-img bench tries to do this. For being even
>> >    closer to the real environment you'd have to use the virtio-blk code
>> >    as well, which you currently only get with a full VM (perhaps qtest
>> >    could do something interesting here in theory).
>> >
>> >    This is good for telling how big the costs are in relation to the
>> >    total workload (and code saved elsewhere) in practice. This is the
>> >    set of tests that can meaningfully be compared to a callback-based
>> >    solution.
>> >
>> > Running arbitrary workloads like getppid() or open/read/close isn't as
>> > useful as these. It doesn't isolate the coroutines as well as tests that
>> > run literally nothing else than coroutine functions, and it is too
>> > removed from the actual use case to get the relation between additional
>> > costs, saving and total workload figured out for the real case.
>>
>> If you think getppid() doesn't isolate the coroutine, you can just do nop,
>> then you will find the cost may reach 90%.  Basically it is nothing to do
>> with what the load does, and it is much related to how fast the load can
>> run. The quicker, the more cost introduced by using coroutine, please
>> see the computation in above link.
>
> Correct, the arbitrary load is only blurring the result. If you want to
> measure purely the cost of coroutine functions in isolation (but then
> treat it as such!), tests/test-coroutine is your tool.

Yes, the tool can tell us yield() and enter() may take 83ns and 97ns,
that is enough to conclude.

>
>> Also another reason I use gettpid() is that:
>>
>>      After IO plug&unplug is introduced,  bdrv_aio_readv/bdrv_aio_writev
>>      becomes much quicker because most of times they just queue I/O req
>>      into I/O queue, no io submit involved at all. Even though coroutine
>>      operations take little time(<100ns), it still may make a difference
>>      compared with the time for queuing I/O only, at lest for high-speed I/O,
>>      like > 300K IOPS in your case.
>
> You still go the whole path through the block layer (and back when the
> request is completed).

Yes, but that should have been not much computation, since what the
previous dataplane did is just write the request address into io queue,
and we should think it takes little time.

>
>> >> From another file read bench which is the default one:
>> >>
>> >>       just doing open(file), read(fd, buf in stack, 512), sum and close() in each iteration
>> >>
>> >> without using coroutine, operations per second can increase ~20% compared
>> >> with using coroutine. If reading 1024 bytes each time, the number still can
>> >> increase ~10%. The operations per second level is between 200K~400K per
>> >> sec which should match the IOPS in dataplane test, and the tests are
>> >> done in my lenovo T410 notepad(CPU: 2.6GHz, dual core, four threads).
>> >>
>> >> When reading 8192 and more bytes each time, the difference between using
>> >> coroutine and not can't be observed obviously.
>> >
>> > All it tells you is that the variation of the workload can make the
>> > coroutine cost disappear in the noise. It doesn't tell you much about
>> > how the real use case.
>>
>> When cost disappear, the IOPS has become very small. That also said
>> coroutine can fit in high-speed IO case.
>>
>> > And you're comparing apples and oranges anyway: The real question in
>> > qemu is whether you use coroutines or pass around heap-allocated state
>> > between callbacks. Your benchmark doesn't have a single callback because
>> > it hasn't got any asynchronous operations and doesn't need to allocate
>> > and pass any state.
>> >
>> > It does, however, have an unnecessary yield() for the coroutine case
>> > because you felt that the real case is more complex and does yield
>> > (which is true, but it's more complex for both coroutines and
>> > callbacks).
>> >
>> >> Surely, the test result should depend on how fast the machine is, but even
>> >> for fast machine, I guess the similar result still can be observed by
>> >> decreasing read bytes each time.
>> >
>> > Yes, results looked similar on my laptop. (They just don't tell me
>> > much.)
>> >
>> >
>> > Let's have a look at some fio results from my laptop:
>> >
>> > aggrb KB/s  | master    | coroutine | bypass
>> > ------------+-----------+-----------+------------
>> > run 1       | 419934    | 449518    | 445823
>> > run 2       | 444358    | 456365    | 448332
>> > run 3       | 444076    | 455209    | 441552
>> >
>> >
>> > And here from my lab test box:
>> >
>> > aggrb KB/s  | master    | coroutine | bypass
>> > ------------+-----------+-----------+------------
>> > run 1       | 25330     | 56378     | 53541
>> > run 2       | 26041     | 55709     | 54136
>> > run 3       | 25811     | 56829     | 49080
>> >
>> > The improvement of the bypass patches is barely measurable on my laptop
>> > (if it even exists), whereas it seems to be a pretty big thing for my
>> > lab test box. In any case, the optimised coroutine code seems to beat
>> > the bypass on both machines. (That is for random reads anyway. For
>> > sequential, I get a much larger variation, and on my lab test box bypass
>> > is ahead, whereas on my laptop both are roughly on the same level.)
>> >
>> > Another thing I tried is creating the coroutine already in virtio-blk to
>> > avoid the overhead of the bdrv_aio_* emulation. I don't quite understand
>> > the result of my benchmarks there, maybe you have an idea: For random
>> > reads, I see a significant improvement, for sequential however a clear
>> > degradation.
>> >
>> > aggrb MB/s  | bypass    | coroutine | virtio-blk-created coroutine
>> > ------------+-----------+-----------+------------------------------
>> > seq. read   | 738       | 738       | 694
>> > random read | 442       | 459       | 475
>> >
>> > I would appreciate any ideas about what's going on with sequential reads
>> > here and how it can be fixed. Anyway, on my machines, coroutines don't
>> > look like a lost case at all.
>>
>> Firstly I hope you can bypass the coroutine only to do the test, that said, use
>> same code path except for coroutine operation to observe effect from coroutine.
>
> Sorry, I'm having a hard time parsing this sentence.

In your previous test, you add linux-aio brv_co_readv()/writev() in your patch,
and your test is against previous bypass patch which is bypassing co in
another path.

>
>> Secondly, maybe your machine is fast enough, and we can't observe the
>> IOPS difference easily, but there should be the difference in CPU utilization,
>> since the above computation tells us the coroutine cost does exist. Faster
>> the block faster, the more.
>
> You are constantly ignoring the fact that the AIO callback-oriented
> style doesn't have zero cost either. Please stop doing this, otherwise
> any discussion is useless.

No, I didn't ignore it, the average time for handling one IO has already
include the completion time. That is why I said it takes about 3.333us
for handling one IO if the device is capable of 300K IOPS, then the
time taken by two enter() and one yield() does make a difference against
the 3.333us since they are required for handling each IO.

Also bypassing coroutine does _not_ increase time which is taken for
running AIO callback, does it?

>
> If you want to benchmark in a scenario that is closer to what actually
> happens in qemu, but still doesn't involve I/O to slow things down, try
> the following patch. The operation there isn't synchronous any more, but
> an external event (like I/O completion) is emulated by scheduling a BH.
> The AIO/coroutine usage patterns in the code are the same as you can
> find in the real block layer.
>
> For me, in this benchmark callbacks are still a little faster, but you
> don't get absurd numbers like you claimed above (because suddenly it's
> clear that the AIO path has its costs, too).
>
> Kevin
>
>
> diff --git a/qemu-img-cmds.hx b/qemu-img-cmds.hx
> index d029609..3c2659b 100644
> --- a/qemu-img-cmds.hx
> +++ b/qemu-img-cmds.hx
> @@ -9,6 +9,12 @@ STEXI
>  @table @option
>  ETEXI
>
> +DEF("co_bench", img_co_bench,
> +    "co_bench [-c count]")
> +STEXI
> +@item co_bench [-c @var{count}]
> +ETEXI
> +
>  DEF("check", img_check,
>      "check [-q] [-f fmt] [--output=ofmt]  [-r [leaks | all]] filename")
>  STEXI
> diff --git a/qemu-img.c b/qemu-img.c
> index d4518e7..b2d9e56 100644
> --- a/qemu-img.c
> +++ b/qemu-img.c
> @@ -2789,6 +2789,173 @@ out:
>      return 0;
>  }
>
> +static QEMUBH *external_event;
> +
> +typedef struct BenchACB {
> +    BlockDriverAIOCB common;
> +    QEMUBH *bh;
> +    int result;
> +    int i;
> +    int counter;
> +} BenchACB;
> +
> +static void external_event_aio_bh(void* opaque);
> +
> +static void bench_cb(void *opaque, int ret)
> +{
> +    BenchACB *acb = opaque;
> +
> +    acb->result += acb->i;
> +
> +    if (++acb->counter == 2) {
> +        qemu_bh_schedule(acb->bh);
> +    } else {
> +        external_event = qemu_bh_new(external_event_aio_bh, acb);
> +        qemu_bh_schedule(external_event);
> +    }
> +}
> +
> +static void external_event_aio_bh(void* opaque)
> +{
> +    qemu_bh_delete(external_event);
> +    bench_cb(opaque, 0);
> +}
> +
> +static void bdrv_aio_completion_bh(void* opaque)
> +{
> +    BenchACB *acb = opaque;
> +    qemu_bh_delete(acb->bh);
> +    acb->bh = NULL;
> +    acb->common.cb(acb->common.opaque, acb->result);
> +    qemu_aio_release(acb);
> +}
> +
> +static const AIOCBInfo bench_aiocb_info = {
> +    .aiocb_size         = sizeof(BenchACB),
> +    .cancel             = NULL,
> +};
> +
> +static BlockDriverAIOCB *bench_aio(int i,
> +                                   BlockDriverCompletionFunc *cb,
> +                                   void *opaque)
> +{
> +    BenchACB *acb = qemu_aio_get(&bench_aiocb_info, NULL, cb, opaque);
> +    acb->counter = 0;
> +    acb->result = 0;
> +    acb->i = i;
> +    acb->bh = qemu_bh_new(bdrv_aio_completion_bh, acb);
> +
> +    bench_cb(acb, 0);
> +    return &acb->common;
> +}
> +
> +static void bench_aio_completed(void *opaque, int ret)
> +{
> +    int *result = opaque;
> +    *result = ret;
> +}
> +
> +typedef struct BenchCo {
> +    Coroutine *co;
> +    int i;
> +    int *result;
> +} BenchCo;
> +
> +static void external_event_co_bh(void* opaque)
> +{
> +    Coroutine *co = opaque;
> +    qemu_coroutine_enter(co, NULL);
> +}
> +
> +static void bench_co(void *opaque)
> +{
> +    BenchCo *b = opaque;
> +    int result = 0;
> +
> +    result += b->i;
> +
> +    external_event = qemu_bh_new(external_event_co_bh, b->co);
> +    qemu_bh_schedule(external_event);
> +    qemu_coroutine_yield();
> +    qemu_bh_delete(external_event);
> +
> +    result += b->i;
> +
> +    *b->result = result;
> +}
> +
> +static int img_co_bench(int argc, char **argv)
> +{
> +    int c;
> +    bool bypass = false;
> +    int count = 10000000;
> +    struct timeval t1, t2;
> +    int i;
> +
> +    for (;;) {
> +        c = getopt(argc, argv, "hbc:");
> +        if (c == -1) {
> +            break;
> +        }
> +
> +        switch (c) {
> +        case 'h':
> +        case '?':
> +            help();
> +            break;
> +        case 'b':
> +            bypass = true;
> +            break;
> +        case 'c':
> +        {
> +            char *end;
> +            errno = 0;
> +            count = strtoul(optarg, &end, 0);
> +            if (errno || *end || count > INT_MAX) {
> +                error_report("Invalid iteration count specified");
> +                return 1;
> +            }
> +            break;
> +        }
> +        }
> +    }
> +
> +    printf("Doing %d iterations (%s)\n", count,
> +           bypass ? "callbacks" : "coroutines");
> +
> +    gettimeofday(&t1, NULL);
> +
> +    for (i = 0; i < count; i++) {
> +        int result = -EINPROGRESS;
> +
> +        if (bypass) {
> +            bench_aio(i, bench_aio_completed, &result);
> +        } else {
> +            Coroutine *co = qemu_coroutine_create(bench_co);
> +            BenchCo b = {
> +                .co = co,
> +                .i = i,
> +                .result = &result,
> +            };
> +            qemu_coroutine_enter(co, &b);
> +        }
> +
> +        while (result == -EINPROGRESS) {
> +            main_loop_wait(false);
> +        }
> +
> +        assert(result == i * 2);
> +    }
> +    gettimeofday(&t2, NULL);
> +
> +    printf("Run completed in %3.3f seconds.\n",
> +           (t2.tv_sec - t1.tv_sec)
> +           + ((double)(t2.tv_usec - t1.tv_usec) / 1000000));
> +
> +    return 0;
> +}
> +
> +
>  static const img_cmd_t img_cmds[] = {
>  #define DEF(option, callback, arg_string)        \
>      { option, callback },
>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
  2014-08-12  8:12                 ` Ming Lei
@ 2014-08-12 19:08                   ` Paolo Bonzini
  2014-08-13  9:54                     ` Kevin Wolf
  2014-08-13 10:19                     ` Ming Lei
  0 siblings, 2 replies; 81+ messages in thread
From: Paolo Bonzini @ 2014-08-12 19:08 UTC (permalink / raw)
  To: Ming Lei; +Cc: Kevin Wolf, Fam Zheng, qemu-devel, Stefan Hajnoczi

Il 12/08/2014 10:12, Ming Lei ha scritto:
>> > The below patch is basically the minimal change to bypass coroutines.  Of course
>> > the block.c part is not acceptable as is (the change to refresh_total_sectors
>> > is broken, the others are just ugly), but it is a start.  Please run it with
>> > your fio workloads, or write an aio-based version of a qemu-img/qemu-io *I/O*
>> > benchmark.
> Could you explain why the new change is introduced?

It provides a fast path for bdrv_aio_readv/writev whenever there is
nothing to do after the driver routine returns.  In this case there is
no need to wrap the AIOCB returned by the driver routine.

It doesn't go all the way, and in particular it doesn't reverse
completely the roles of bdrv_co_readv/writev vs. bdrv_aio_readv/writev.
 But it is enough to provide something that is not dataplane-specific,
does not break various functionality that we need to add to dataplane
virtio-blk, does not mess up the semantics of the block layer, and lets
you run benchmarks.

> I will hold it until we can align to the coroutine cost computation,
> because it is very important for the discussion.

First of all, note that the coroutine cost is totally pointless in the
discussion unless you have 100% CPU time and the dataplane thread
becomes CPU bound.  You haven't said if this is the case.

Second, if the coroutine cost is relevant, the profile is really too
flat to do much about it.  The only solution (and here I *think* I
disagree slightly with Kevin) is to get rid of it, which is not even too
hard to do.

The problem is that your patches to do touch too much code and subtly
break too much stuff.  The one I wrote does have a little breakage
because I don't understand bs->growable 100% and I didn't really put
much effort into it (my deadline being basically "be done as soon as the
shower is free"), and it is ugly as hell, _but_ it should be compatible
with the way the block layer works.

Paolo

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
  2014-08-11 19:37               ` Paolo Bonzini
  2014-08-12  8:12                 ` Ming Lei
@ 2014-08-13  8:55                 ` Stefan Hajnoczi
  2014-08-13 11:43                 ` Ming Lei
  2014-08-14 10:46                 ` Kevin Wolf
  3 siblings, 0 replies; 81+ messages in thread
From: Stefan Hajnoczi @ 2014-08-13  8:55 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Kevin Wolf, tom.leiming, Ming Lei, Fam Zheng, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1230 bytes --]

On Mon, Aug 11, 2014 at 09:37:01PM +0200, Paolo Bonzini wrote:
> Il 10/08/2014 05:46, Ming Lei ha scritto:
> @@ -4356,6 +4353,20 @@ BlockDriverAIOCB *bdrv_aio_readv(BlockDriverState *bs, int64_t sector_num,
>  {
>      trace_bdrv_aio_readv(bs, sector_num, nb_sectors, opaque);
>  
> +    if (bs->drv && bs->drv->bdrv_aio_readv &&
> +        bs->drv->bdrv_aio_readv != bdrv_aio_readv_em &&
> +        nb_sectors >= 0 && nb_sectors <= (UINT_MAX >> BDRV_SECTOR_BITS) &&
> +        !bdrv_check_byte_request(bs, sector_num << BDRV_SECTOR_BITS,
> +                                 nb_sectors << BDRV_SECTOR_BITS) &&
> +        !bs->copy_on_read && !bs->io_limits_enabled &&
> +        bs->request_alignment <= BDRV_SECTOR_SIZE) {
> +        BlockDriverAIOCB *acb =
> +            bs->drv->bdrv_aio_readv(bs, sector_num, qiov, nb_sectors,
> +                                    cb, opaque);
> +        assert(acb);

Minor issue:

block.h:bdrv_aio_readv() guarantees the return value is non-NULL but
BlockDriver->bdrv_aio_readv() does not.

The floppy disk (fd_open()) code path in raw-posix.c can return NULL so
we would need to return a BlockDriverAIOCB and set up a BH that will
complete with -EIO.

Stefan

[-- Attachment #2: Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
  2014-08-12 19:08                   ` Paolo Bonzini
@ 2014-08-13  9:54                     ` Kevin Wolf
  2014-08-13 13:16                       ` Paolo Bonzini
  2014-08-13 10:19                     ` Ming Lei
  1 sibling, 1 reply; 81+ messages in thread
From: Kevin Wolf @ 2014-08-13  9:54 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Ming Lei, Fam Zheng, qemu-devel, Stefan Hajnoczi

Am 12.08.2014 um 21:08 hat Paolo Bonzini geschrieben:
> Il 12/08/2014 10:12, Ming Lei ha scritto:
> >> > The below patch is basically the minimal change to bypass coroutines.  Of course
> >> > the block.c part is not acceptable as is (the change to refresh_total_sectors
> >> > is broken, the others are just ugly), but it is a start.  Please run it with
> >> > your fio workloads, or write an aio-based version of a qemu-img/qemu-io *I/O*
> >> > benchmark.
> > Could you explain why the new change is introduced?
> 
> It provides a fast path for bdrv_aio_readv/writev whenever there is
> nothing to do after the driver routine returns.  In this case there is
> no need to wrap the AIOCB returned by the driver routine.
> 
> It doesn't go all the way, and in particular it doesn't reverse
> completely the roles of bdrv_co_readv/writev vs. bdrv_aio_readv/writev.

That's actually why I think it's an option. Remember that, like you say
below, we're optimising for an extreme case here, and I certainly don't
want to hurt the common case for it. I can't imagine a way of reversing
the roles without multiplying the cost for the coroutine path.

Or do you have a clever solution how you'd go about it without having an
impact on the common case?

>  But it is enough to provide something that is not dataplane-specific,
> does not break various functionality that we need to add to dataplane
> virtio-blk, does not mess up the semantics of the block layer, and lets
> you run benchmarks.
> 
> > I will hold it until we can align to the coroutine cost computation,
> > because it is very important for the discussion.
> 
> First of all, note that the coroutine cost is totally pointless in the
> discussion unless you have 100% CPU time and the dataplane thread
> becomes CPU bound.  You haven't said if this is the case.

That's probably the implicit assumption. As I said, it's an extreme
case we're trying to look at. I'm not sure how realistic it is when you
don't work with ramdisks...

> Second, if the coroutine cost is relevant, the profile is really too
> flat to do much about it.  The only solution (and here I *think* I
> disagree slightly with Kevin) is to get rid of it, which is not even too
> hard to do.

I think we just need to make the best use of coroutines. I would really
love to show you numbers, but I'm having a hard time benchmarking all
this stuff. When I test only the block layer with 'qemu-img bench', I
clearly have working optimisations, but it doesn't translate yet into
clear improvments for actual guests. I think other things in the way
from the guest to qemu slow it down so that in the end the coroutine
part doesn't matter much any more.

By the way, I just noticed that sequential reads were significantly
faster (~25%) for me without dataplane than with it. I didn't expect to
gain anything with dataplane on this setup, but certainly not to lose
that much. There might be more to gain there than by optimising or
removing coroutines.

> The problem is that your patches to do touch too much code and subtly
> break too much stuff.  The one I wrote does have a little breakage
> because I don't understand bs->growable 100% and I didn't really put
> much effort into it (my deadline being basically "be done as soon as the
> shower is free"), and it is ugly as hell, _but_ it should be compatible
> with the way the block layer works.

Yes, your patch is definitely much more palatable than Ming's. The part
that I still don't like about it is that it would be stating "in the
common case, we're only doing the second best thing". I'm not yet
convinced that coroutines perform necessarily worse than state-passing
callbacks.

Kevin

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
  2014-08-12 19:08                   ` Paolo Bonzini
  2014-08-13  9:54                     ` Kevin Wolf
@ 2014-08-13 10:19                     ` Ming Lei
  2014-08-13 12:35                       ` Paolo Bonzini
  1 sibling, 1 reply; 81+ messages in thread
From: Ming Lei @ 2014-08-13 10:19 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Kevin Wolf, Fam Zheng, qemu-devel, Stefan Hajnoczi

On Wed, Aug 13, 2014 at 3:08 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 12/08/2014 10:12, Ming Lei ha scritto:
>>> > The below patch is basically the minimal change to bypass coroutines.  Of course
>>> > the block.c part is not acceptable as is (the change to refresh_total_sectors
>>> > is broken, the others are just ugly), but it is a start.  Please run it with
>>> > your fio workloads, or write an aio-based version of a qemu-img/qemu-io *I/O*
>>> > benchmark.
>> Could you explain why the new change is introduced?
>
> It provides a fast path for bdrv_aio_readv/writev whenever there is
> nothing to do after the driver routine returns.  In this case there is
> no need to wrap the AIOCB returned by the driver routine.
>
> It doesn't go all the way, and in particular it doesn't reverse
> completely the roles of bdrv_co_readv/writev vs. bdrv_aio_readv/writev.
>  But it is enough to provide something that is not dataplane-specific,
> does not break various functionality that we need to add to dataplane
> virtio-blk, does not mess up the semantics of the block layer, and lets
> you run benchmarks.
>
>> I will hold it until we can align to the coroutine cost computation,
>> because it is very important for the discussion.
>
> First of all, note that the coroutine cost is totally pointless in the
> discussion unless you have 100% CPU time and the dataplane thread
> becomes CPU bound.  You haven't said if this is the case.

No, it does make sense, especially for high speed block device.

In my test, the CPU is close to 100%, otherwise block throughput
should not have been effected.

Also it can decrease CPU utilization if it isn't 100% CPU.

Also it is related with CPU speed, in one slow machine, running
coroutine may introduce some load especially for high IOPS
block device.

>
> Second, if the coroutine cost is relevant, the profile is really too

I have wrote a patch to figure out coroutine cost which can show
it clearly, and you should be in the Cc list.

> flat to do much about it.  The only solution (and here I *think* I
> disagree slightly with Kevin) is to get rid of it, which is not even too
> hard to do.

I agree.

But it depends on the situation of coroutine use. If the function running
by coroutine isn't called very frequently, the effect from coroutine can
be ignored.

For block device which can reach hundreds of kilo IOPS, as far
as I thought of, the only solution is to not using coroutine for this case.

That is why I wrote the bypass coroutine patch.

>
> The problem is that your patches to do touch too much code and subtly
> break too much stuff.  The one I wrote does have a little breakage

Could you give a hint about which stuff are broken? Last time, you mention
virtio-scsi need to keep AIOCB live after returning, I have fixed it in V1.

> because I don't understand bs->growable 100% and I didn't really put
> much effort into it (my deadline being basically "be done as soon as the
> shower is free"), and it is ugly as hell, _but_ it should be compatible
> with the way the block layer works.

I will take a careful look to your patch later.

If coroutine is still there, I think it still can slow down performance.

Thanks,

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
  2014-08-11 19:37               ` Paolo Bonzini
  2014-08-12  8:12                 ` Ming Lei
  2014-08-13  8:55                 ` Stefan Hajnoczi
@ 2014-08-13 11:43                 ` Ming Lei
  2014-08-13 12:35                   ` Paolo Bonzini
  2014-08-14 10:46                 ` Kevin Wolf
  3 siblings, 1 reply; 81+ messages in thread
From: Ming Lei @ 2014-08-13 11:43 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Kevin Wolf, Fam Zheng, qemu-devel, Stefan Hajnoczi

Hi Paolo,

On Tue, Aug 12, 2014 at 3:37 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 10/08/2014 05:46, Ming Lei ha scritto:
>> Hi Kevin, Paolo, Stefan and all,
>>
>>
>> On Wed, 6 Aug 2014 10:48:55 +0200
>> Kevin Wolf <kwolf@redhat.com> wrote:
>>
>>> Am 06.08.2014 um 07:33 hat Ming Lei geschrieben:
>>
>>>
>>> Anyhow, the coroutine version of your benchmark is buggy, it leaks all
>>> coroutines instead of exiting them, so it can't make any use of the
>>> coroutine pool. On my laptop, I get this (where fixed coroutine is a
>>> version that simply removes the yield at the end):
>>>
>>>                 | bypass        | fixed coro    | buggy coro
>>> ----------------+---------------+---------------+--------------
>>> time            | 1.09s         | 1.10s         | 1.62s
>>> L1-dcache-loads | 921,836,360   | 932,781,747   | 1,298,067,438
>>> insns per cycle | 2.39          | 2.39          | 1.90
>>>
>>> Begs the question whether you see a similar effect on a real qemu and
>>> the coroutine pool is still not big enough? With correct use of
>>> coroutines, the difference seems to be barely measurable even without
>>> any I/O involved.
>>
>> Now I fixes the coroutine leak bug, and previous crypt bench is a bit high
>> loading, and cause operations per sec very low(~40K/sec), finally I write a new
>> and simple one which can generate hundreds of kilo operations per sec and
>> the number should match with some fast storage devices, and it does show there
>> is not small effect from coroutine.
>>
>> Extremely if just getppid() syscall is run in each iteration, with using coroutine,
>> only 3M operations/sec can be got, and without using coroutine, the number can
>> reach 16M/sec, and there is more than 4 times difference!!!
>
> I should be on vacation, but I'm following a couple threads in the mailing list
> and I'm a bit tired to hear the same argument again and again...
>
> The different characteristics of asynchronous I/O vs. any synchronous workload
> are such that it is hard to be sure that microbenchmarks make sense.
>
> The below patch is basically the minimal change to bypass coroutines.  Of course
> the block.c part is not acceptable as is (the change to refresh_total_sectors
> is broken, the others are just ugly), but it is a start.  Please run it with
> your fio workloads, or write an aio-based version of a qemu-img/qemu-io *I/O*
> benchmark.

I have to say this approach is much cleaver, and better than mine, and
I just run a quick fio randread test in VM, and IOPS can improve > 10%
than bypass coroutine patch.

Hope it can be merged soon, :-)

Great thanks, Paolo.

Thanks,

> Paolo
>
> diff --git a/block.c b/block.c
> index 3e252a2..0b6e9cf 100644
> --- a/block.c
> +++ b/block.c
> @@ -704,7 +704,7 @@ static int refresh_total_sectors(BlockDriverState *bs, int64_t hint)
>          return 0;
>
>      /* query actual device if possible, otherwise just trust the hint */
> -    if (drv->bdrv_getlength) {
> +    if (!hint && drv->bdrv_getlength) {
>          int64_t length = drv->bdrv_getlength(bs);
>          if (length < 0) {
>              return length;
> @@ -2651,9 +2651,6 @@ static int bdrv_check_byte_request(BlockDriverState *bs, int64_t offset,
>      if (!bdrv_is_inserted(bs))
>          return -ENOMEDIUM;
>
> -    if (bs->growable)
> -        return 0;
> -
>      len = bdrv_getlength(bs);
>
>      if (offset < 0)
> @@ -3107,7 +3104,7 @@ static int coroutine_fn bdrv_co_do_preadv(BlockDriverState *bs,
>      if (!drv) {
>          return -ENOMEDIUM;
>      }
> -    if (bdrv_check_byte_request(bs, offset, bytes)) {
> +    if (!bs->growable && bdrv_check_byte_request(bs, offset, bytes)) {
>          return -EIO;
>      }
>
> @@ -3347,7 +3344,7 @@ static int coroutine_fn bdrv_co_do_pwritev(BlockDriverState *bs,
>      if (bs->read_only) {
>          return -EACCES;
>      }
> -    if (bdrv_check_byte_request(bs, offset, bytes)) {
> +    if (!bs->growable && bdrv_check_byte_request(bs, offset, bytes)) {
>          return -EIO;
>      }
>
> @@ -4356,6 +4353,20 @@ BlockDriverAIOCB *bdrv_aio_readv(BlockDriverState *bs, int64_t sector_num,
>  {
>      trace_bdrv_aio_readv(bs, sector_num, nb_sectors, opaque);
>
> +    if (bs->drv && bs->drv->bdrv_aio_readv &&
> +        bs->drv->bdrv_aio_readv != bdrv_aio_readv_em &&
> +        nb_sectors >= 0 && nb_sectors <= (UINT_MAX >> BDRV_SECTOR_BITS) &&
> +        !bdrv_check_byte_request(bs, sector_num << BDRV_SECTOR_BITS,
> +                                 nb_sectors << BDRV_SECTOR_BITS) &&
> +        !bs->copy_on_read && !bs->io_limits_enabled &&
> +        bs->request_alignment <= BDRV_SECTOR_SIZE) {
> +        BlockDriverAIOCB *acb =
> +            bs->drv->bdrv_aio_readv(bs, sector_num, qiov, nb_sectors,
> +                                    cb, opaque);
> +        assert(acb);
> +        return acb;
> +    }
> +
>      return bdrv_co_aio_rw_vector(bs, sector_num, qiov, nb_sectors, 0,
>                                   cb, opaque, false);
>  }
> @@ -4366,6 +4377,24 @@ BlockDriverAIOCB *bdrv_aio_writev(BlockDriverState *bs, int64_t sector_num,
>  {
>      trace_bdrv_aio_writev(bs, sector_num, nb_sectors, opaque);
>
> +    if (bs->drv && bs->drv->bdrv_aio_writev &&
> +        bs->drv->bdrv_aio_writev != bdrv_aio_writev_em &&
> +        nb_sectors >= 0 && nb_sectors <= (UINT_MAX >> BDRV_SECTOR_BITS) &&
> +        !bdrv_check_byte_request(bs, sector_num << BDRV_SECTOR_BITS,
> +                                 nb_sectors << BDRV_SECTOR_BITS) &&
> +        !bs->read_only && !bs->io_limits_enabled &&
> +        bs->request_alignment <= BDRV_SECTOR_SIZE &&
> +        bs->enable_write_cache &&
> +        QLIST_EMPTY(&bs->before_write_notifiers.notifiers) &&
> +        bs->wr_highest_sector >= sector_num + nb_sectors - 1 &&
> +        QLIST_EMPTY(&bs->dirty_bitmaps)) {
> +        BlockDriverAIOCB *acb =
> +            bs->drv->bdrv_aio_writev(bs, sector_num, qiov, nb_sectors,
> +                                     cb, opaque);
> +        assert(acb);
> +        return acb;
> +    }
> +
>      return bdrv_co_aio_rw_vector(bs, sector_num, qiov, nb_sectors, 0,
>                                   cb, opaque, true);
>  }
> diff --git a/block/raw_bsd.c b/block/raw_bsd.c
> index 492f58d..b86f26b 100644
> --- a/block/raw_bsd.c
> +++ b/block/raw_bsd.c
> @@ -48,6 +48,22 @@ static int raw_reopen_prepare(BDRVReopenState *reopen_state,
>      return 0;
>  }
>
> +static BlockDriverAIOCB *raw_aio_readv(BlockDriverState *bs, int64_t sector_num,
> +                                     QEMUIOVector *qiov, int nb_sectors,
> +                                    BlockDriverCompletionFunc *cb, void *opaque)
> +{
> +    BLKDBG_EVENT(bs->file, BLKDBG_READ_AIO);
> +    return bdrv_aio_readv(bs->file, sector_num, qiov, nb_sectors, cb, opaque);
> +}
> +
> +static BlockDriverAIOCB *raw_aio_writev(BlockDriverState *bs, int64_t sector_num,
> +                                      QEMUIOVector *qiov, int nb_sectors,
> +                                     BlockDriverCompletionFunc *cb, void *opaque)
> +{
> +    BLKDBG_EVENT(bs->file, BLKDBG_WRITE_AIO);
> +    return bdrv_aio_writev(bs->file, sector_num, qiov, nb_sectors, cb, opaque);
> +}
> +
>  static int coroutine_fn raw_co_readv(BlockDriverState *bs, int64_t sector_num,
>                                       int nb_sectors, QEMUIOVector *qiov)
>  {
> @@ -181,6 +197,8 @@ static BlockDriver bdrv_raw = {
>      .bdrv_open            = &raw_open,
>      .bdrv_close           = &raw_close,
>      .bdrv_create          = &raw_create,
> +    .bdrv_aio_readv       = &raw_aio_readv,
> +    .bdrv_aio_writev      = &raw_aio_writev,
>      .bdrv_co_readv        = &raw_co_readv,
>      .bdrv_co_writev       = &raw_co_writev,
>      .bdrv_co_write_zeroes = &raw_co_write_zeroes,

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
  2014-08-13 10:19                     ` Ming Lei
@ 2014-08-13 12:35                       ` Paolo Bonzini
  0 siblings, 0 replies; 81+ messages in thread
From: Paolo Bonzini @ 2014-08-13 12:35 UTC (permalink / raw)
  To: Ming Lei; +Cc: Kevin Wolf, Fam Zheng, qemu-devel, Stefan Hajnoczi

Il 13/08/2014 12:19, Ming Lei ha scritto:
>> > The problem is that your patches to do touch too much code and subtly
>> > break too much stuff.  The one I wrote does have a little breakage
> Could you give a hint about which stuff are broken? Last time, you mention
> virtio-scsi need to keep AIOCB live after returning, I have fixed it in V1.

They are dataplane-specific, while there's no reason not to have the
same benefits elsewhere.  They are file-specific, while there's no
reason not to have the same benefits for e.g. iSCSI (though iSCSI now
uses coroutines instead of bdrv_aio_*).  They touch AioContext for no
reason, and introduce a bunch of layering violations everywhere.

They are simply the wrong API.

>> > because I don't understand bs->growable 100% and I didn't really put
>> > much effort into it (my deadline being basically "be done as soon as the
>> > shower is free"), and it is ugly as hell, _but_ it should be compatible
>> > with the way the block layer works.
> I will take a careful look to your patch later.
> 
> If coroutine is still there, I think it still can slow down performance.

No, it's not there.  Please try the patch.

Paolo

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
  2014-08-13 11:43                 ` Ming Lei
@ 2014-08-13 12:35                   ` Paolo Bonzini
  2014-08-13 13:07                     ` Ming Lei
  0 siblings, 1 reply; 81+ messages in thread
From: Paolo Bonzini @ 2014-08-13 12:35 UTC (permalink / raw)
  To: Ming Lei; +Cc: Kevin Wolf, Fam Zheng, qemu-devel, Stefan Hajnoczi

Il 13/08/2014 13:43, Ming Lei ha scritto:
>> > The below patch is basically the minimal change to bypass coroutines.  Of course
>> > the block.c part is not acceptable as is (the change to refresh_total_sectors
>> > is broken, the others are just ugly), but it is a start.  Please run it with
>> > your fio workloads, or write an aio-based version of a qemu-img/qemu-io *I/O*
>> > benchmark.
> I have to say this approach is much cleaver, and better than mine, and
> I just run a quick fio randread test in VM, and IOPS can improve > 10%
> than bypass coroutine patch.

Great, do you have a profile without and with the patch?

Paolo

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
  2014-08-13 12:35                   ` Paolo Bonzini
@ 2014-08-13 13:07                     ` Ming Lei
  0 siblings, 0 replies; 81+ messages in thread
From: Ming Lei @ 2014-08-13 13:07 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Kevin Wolf, Fam Zheng, qemu-devel, Stefan Hajnoczi

On Wed, Aug 13, 2014 at 8:35 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 13/08/2014 13:43, Ming Lei ha scritto:
>>> > The below patch is basically the minimal change to bypass coroutines.  Of course
>>> > the block.c part is not acceptable as is (the change to refresh_total_sectors
>>> > is broken, the others are just ugly), but it is a start.  Please run it with
>>> > your fio workloads, or write an aio-based version of a qemu-img/qemu-io *I/O*
>>> > benchmark.
>> I have to say this approach is much cleaver, and better than mine, and
>> I just run a quick fio randread test in VM, and IOPS can improve > 10%
>> than bypass coroutine patch.
>
> Great, do you have a profile without and with the patch?

Please see below link:

http://pastebin.com/0VKSMKxv

Ming

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
  2014-08-13  9:54                     ` Kevin Wolf
@ 2014-08-13 13:16                       ` Paolo Bonzini
  2014-08-13 13:49                         ` Ming Lei
  0 siblings, 1 reply; 81+ messages in thread
From: Paolo Bonzini @ 2014-08-13 13:16 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Ming Lei, Fam Zheng, qemu-devel, Stefan Hajnoczi

Il 13/08/2014 11:54, Kevin Wolf ha scritto:
> Am 12.08.2014 um 21:08 hat Paolo Bonzini geschrieben:
>> Il 12/08/2014 10:12, Ming Lei ha scritto:
>>>>> The below patch is basically the minimal change to bypass coroutines.  Of course
>>>>> the block.c part is not acceptable as is (the change to refresh_total_sectors
>>>>> is broken, the others are just ugly), but it is a start.  Please run it with
>>>>> your fio workloads, or write an aio-based version of a qemu-img/qemu-io *I/O*
>>>>> benchmark.
>>> Could you explain why the new change is introduced?
>>
>> It provides a fast path for bdrv_aio_readv/writev whenever there is
>> nothing to do after the driver routine returns.  In this case there is
>> no need to wrap the AIOCB returned by the driver routine.
>>
>> It doesn't go all the way, and in particular it doesn't reverse
>> completely the roles of bdrv_co_readv/writev vs. bdrv_aio_readv/writev.
> 
> That's actually why I think it's an option. Remember that, like you say
> below, we're optimising for an extreme case here, and I certainly don't
> want to hurt the common case for it. I can't imagine a way of reversing
> the roles without multiplying the cost for the coroutine path.

I'm not that worried about it.  Perhaps it's enough to add an
!qemu_in_coroutine() to the AIO fast path, and let the driver provide
optimized coroutine paths like in your patches that allocate AIOCBs on
the stack.

> Or do you have a clever solution how you'd go about it without having an
> impact on the common case?

I don't really have any ace up my sleeve, but there are some things that
bother me in the block layer's AIO API and in block.c in general.

One is that block.c can do all the pre-processing it wants before
issuing AIO, but nothing before calling the callback.  This means that
my patches break bdrv_drain_all (they cannot call tracked_request_end).

Another is all the similar structs that we have (RwCo,
BdrvTrackedRequest, BlockRequest, etc.).

Perhaps it would help if we had a single "real" block request object,
which is an extension of the BlockDriverAIOCB and includes enough data
to subsume all these request structs.  That should help commonizing
stuff between the coroutine and AIO paths, for the common case where a
single yield is enough.  I think the single-yield case is the one that
is really worth optimizing for.  If done properly, I think this can
simplify a lot of block.c code, but it is really difficult to get it
right, and unless the design is sound the code is going to come up
really ugly. :(

Another thing to evaluate is the performance gap (if there is any)
between aio=threads and aio=native.  The only advantage of aio=native,
AFAIU, is the batch submission of requests (plug/unplug).  But
aio=threads often ends up having better performance because the kernel
folks have optimized VFS a lot.  So, in the aio=threads case, we might
as well move the format code out of the iothread and into the worker
thread, and get rid of the coroutine cost simply by making everything
synchronous.  Looking within QEMU, this worked out very well for migration.

(We could do batch submission of requests to the thread pool if there
were a variant of sem_post that can add multiple signals to the same
semaphore, similar to ReleaseSemaphore on Windows).

>> The problem is that your patches to do touch too much code and subtly
>> break too much stuff.  The one I wrote does have a little breakage
>> because I don't understand bs->growable 100% and I didn't really put
>> much effort into it (my deadline being basically "be done as soon as the
>> shower is free"), and it is ugly as hell, _but_ it should be compatible
>> with the way the block layer works.
> 
> Yes, your patch is definitely much more palatable than Ming's. The part
> that I still don't like about it is that it would be stating "in the
> common case, we're only doing the second best thing". I'm not yet
> convinced that coroutines perform necessarily worse than state-passing
> callbacks.

Coroutines lump all the allocation costs together at the time you
allocate the stack, but have (much) more expensive context switching.
Your patches decrease the allocation costs by placing the AIOCB on the
stack.

Since you have to allocate an AIOCB anyway if the caller uses
bdrv_aio_readv/writev, coroutines do perform necessarily worse if the
drivers have little or no state to pass around.  This is the case for
both thread-pool and linux-aio AIOCBs; and which is also the case for
the generic block layer whenever the long "if" statements evaluate to true.

Paolo

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
  2014-08-13 13:16                       ` Paolo Bonzini
@ 2014-08-13 13:49                         ` Ming Lei
  2014-08-14  9:39                           ` Stefan Hajnoczi
  0 siblings, 1 reply; 81+ messages in thread
From: Ming Lei @ 2014-08-13 13:49 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Kevin Wolf, Fam Zheng, qemu-devel, Stefan Hajnoczi

On Wed, Aug 13, 2014 at 9:16 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 13/08/2014 11:54, Kevin Wolf ha scritto:
>> Am 12.08.2014 um 21:08 hat Paolo Bonzini geschrieben:
>>> Il 12/08/2014 10:12, Ming Lei ha scritto:
>>>>>> The below patch is basically the minimal change to bypass coroutines.  Of course
>>>>>> the block.c part is not acceptable as is (the change to refresh_total_sectors
>>>>>> is broken, the others are just ugly), but it is a start.  Please run it with
>>>>>> your fio workloads, or write an aio-based version of a qemu-img/qemu-io *I/O*
>>>>>> benchmark.
>>>> Could you explain why the new change is introduced?
>>>
>>> It provides a fast path for bdrv_aio_readv/writev whenever there is
>>> nothing to do after the driver routine returns.  In this case there is
>>> no need to wrap the AIOCB returned by the driver routine.
>>>
>>> It doesn't go all the way, and in particular it doesn't reverse
>>> completely the roles of bdrv_co_readv/writev vs. bdrv_aio_readv/writev.
>>
>> That's actually why I think it's an option. Remember that, like you say
>> below, we're optimising for an extreme case here, and I certainly don't
>> want to hurt the common case for it. I can't imagine a way of reversing
>> the roles without multiplying the cost for the coroutine path.
>
> I'm not that worried about it.  Perhaps it's enough to add an
> !qemu_in_coroutine() to the AIO fast path, and let the driver provide
> optimized coroutine paths like in your patches that allocate AIOCBs on
> the stack.

IMO, it will not be a extreme case as SSD or high performance storage
becomes more popular, coroutine starts to affect performance if IOPS
is more than 100K, as previous computation.

>
>> Or do you have a clever solution how you'd go about it without having an
>> impact on the common case?
>
> I don't really have any ace up my sleeve, but there are some things that
> bother me in the block layer's AIO API and in block.c in general.
>
> One is that block.c can do all the pre-processing it wants before
> issuing AIO, but nothing before calling the callback.  This means that
> my patches break bdrv_drain_all (they cannot call tracked_request_end).
>
> Another is all the similar structs that we have (RwCo,
> BdrvTrackedRequest, BlockRequest, etc.).
>
> Perhaps it would help if we had a single "real" block request object,
> which is an extension of the BlockDriverAIOCB and includes enough data
> to subsume all these request structs.  That should help commonizing
> stuff between the coroutine and AIO paths, for the common case where a
> single yield is enough.  I think the single-yield case is the one that
> is really worth optimizing for.  If done properly, I think this can
> simplify a lot of block.c code, but it is really difficult to get it
> right, and unless the design is sound the code is going to come up
> really ugly. :(
>
> Another thing to evaluate is the performance gap (if there is any)
> between aio=threads and aio=native.  The only advantage of aio=native,
> AFAIU, is the batch submission of requests (plug/unplug).  But
> aio=threads often ends up having better performance because the kernel
> folks have optimized VFS a lot.  So, in the aio=threads case, we might

>From my test, aio=native is much better than aio=threads at least for
read.

For aio=thread, it may be possible to use batch submission too with
readv/writev to decrease sycall.

> as well move the format code out of the iothread and into the worker
> thread, and get rid of the coroutine cost simply by making everything
> synchronous.  Looking within QEMU, this worked out very well for migration.
>
> (We could do batch submission of requests to the thread pool if there
> were a variant of sem_post that can add multiple signals to the same
> semaphore, similar to ReleaseSemaphore on Windows).
>
>>> The problem is that your patches to do touch too much code and subtly
>>> break too much stuff.  The one I wrote does have a little breakage
>>> because I don't understand bs->growable 100% and I didn't really put
>>> much effort into it (my deadline being basically "be done as soon as the
>>> shower is free"), and it is ugly as hell, _but_ it should be compatible
>>> with the way the block layer works.
>>
>> Yes, your patch is definitely much more palatable than Ming's. The part
>> that I still don't like about it is that it would be stating "in the
>> common case, we're only doing the second best thing". I'm not yet
>> convinced that coroutines perform necessarily worse than state-passing
>> callbacks.
>
> Coroutines lump all the allocation costs together at the time you
> allocate the stack, but have (much) more expensive context switching.

Yes, I agree.

In my tests, one pair of malloc(128)/free(128) only takes 57ns,
but two enter and one yield takes 240ns, not mention dcache reload &
miss caused by switching stack, and allocation fallback.


Ming

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
  2014-08-13 13:49                         ` Ming Lei
@ 2014-08-14  9:39                           ` Stefan Hajnoczi
  2014-08-14 10:12                             ` Ming Lei
  2014-08-15 20:16                             ` Paolo Bonzini
  0 siblings, 2 replies; 81+ messages in thread
From: Stefan Hajnoczi @ 2014-08-14  9:39 UTC (permalink / raw)
  To: Ming Lei; +Cc: Kevin Wolf, Paolo Bonzini, Fam Zheng, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 2204 bytes --]

On Wed, Aug 13, 2014 at 09:49:23PM +0800, Ming Lei wrote:
> On Wed, Aug 13, 2014 at 9:16 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> > Il 13/08/2014 11:54, Kevin Wolf ha scritto:
> >> Am 12.08.2014 um 21:08 hat Paolo Bonzini geschrieben:
> >>> Il 12/08/2014 10:12, Ming Lei ha scritto:
> >>>>>> The below patch is basically the minimal change to bypass coroutines.  Of course
> >>>>>> the block.c part is not acceptable as is (the change to refresh_total_sectors
> >>>>>> is broken, the others are just ugly), but it is a start.  Please run it with
> >>>>>> your fio workloads, or write an aio-based version of a qemu-img/qemu-io *I/O*
> >>>>>> benchmark.
> >>>> Could you explain why the new change is introduced?
> >>>
> >>> It provides a fast path for bdrv_aio_readv/writev whenever there is
> >>> nothing to do after the driver routine returns.  In this case there is
> >>> no need to wrap the AIOCB returned by the driver routine.
> >>>
> >>> It doesn't go all the way, and in particular it doesn't reverse
> >>> completely the roles of bdrv_co_readv/writev vs. bdrv_aio_readv/writev.
> >>
> >> That's actually why I think it's an option. Remember that, like you say
> >> below, we're optimising for an extreme case here, and I certainly don't
> >> want to hurt the common case for it. I can't imagine a way of reversing
> >> the roles without multiplying the cost for the coroutine path.
> >
> > I'm not that worried about it.  Perhaps it's enough to add an
> > !qemu_in_coroutine() to the AIO fast path, and let the driver provide
> > optimized coroutine paths like in your patches that allocate AIOCBs on
> > the stack.
> 
> IMO, it will not be a extreme case as SSD or high performance storage
> becomes more popular, coroutine starts to affect performance if IOPS
> is more than 100K, as previous computation.

The case you seem to care about is raw images on high IOPS devices.  You
mentioned 1M IOPS devices in another email.

You don't seem to want QEMU's block layer features, that is why you are
trying to bypass them instead of optimizing the block layer.

That begs the question whether you should look at PCI passthrough
instead?

Stefan

[-- Attachment #2: Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
  2014-08-14  9:39                           ` Stefan Hajnoczi
@ 2014-08-14 10:12                             ` Ming Lei
  2014-08-15 20:16                             ` Paolo Bonzini
  1 sibling, 0 replies; 81+ messages in thread
From: Ming Lei @ 2014-08-14 10:12 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Kevin Wolf, Paolo Bonzini, Fam Zheng, qemu-devel

On Thu, Aug 14, 2014 at 5:39 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> On Wed, Aug 13, 2014 at 09:49:23PM +0800, Ming Lei wrote:
>> On Wed, Aug 13, 2014 at 9:16 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> > Il 13/08/2014 11:54, Kevin Wolf ha scritto:
>> >> Am 12.08.2014 um 21:08 hat Paolo Bonzini geschrieben:
>> >>> Il 12/08/2014 10:12, Ming Lei ha scritto:
>> >>>>>> The below patch is basically the minimal change to bypass coroutines.  Of course
>> >>>>>> the block.c part is not acceptable as is (the change to refresh_total_sectors
>> >>>>>> is broken, the others are just ugly), but it is a start.  Please run it with
>> >>>>>> your fio workloads, or write an aio-based version of a qemu-img/qemu-io *I/O*
>> >>>>>> benchmark.
>> >>>> Could you explain why the new change is introduced?
>> >>>
>> >>> It provides a fast path for bdrv_aio_readv/writev whenever there is
>> >>> nothing to do after the driver routine returns.  In this case there is
>> >>> no need to wrap the AIOCB returned by the driver routine.
>> >>>
>> >>> It doesn't go all the way, and in particular it doesn't reverse
>> >>> completely the roles of bdrv_co_readv/writev vs. bdrv_aio_readv/writev.
>> >>
>> >> That's actually why I think it's an option. Remember that, like you say
>> >> below, we're optimising for an extreme case here, and I certainly don't
>> >> want to hurt the common case for it. I can't imagine a way of reversing
>> >> the roles without multiplying the cost for the coroutine path.
>> >
>> > I'm not that worried about it.  Perhaps it's enough to add an
>> > !qemu_in_coroutine() to the AIO fast path, and let the driver provide
>> > optimized coroutine paths like in your patches that allocate AIOCBs on
>> > the stack.
>>
>> IMO, it will not be a extreme case as SSD or high performance storage
>> becomes more popular, coroutine starts to affect performance if IOPS
>> is more than 100K, as previous computation.
>
> The case you seem to care about is raw images on high IOPS devices.  You
> mentioned 1M IOPS devices in another email.

In reality, if someone cares about high IOPS, looks raw format has to be
considered.

>
> You don't seem to want QEMU's block layer features, that is why you are
> trying to bypass them instead of optimizing the block layer.

I don't think bypassing coroutin isn't in opposite side of optimizing
block layer.

As we know, coroutine always introduces some cost which can't
be ignored for high IOPS device. If coroutine can be improved to
fit in the case, I'd like to help do that, but I am wondering it is doable.

I like rich features, and I like good performance too, and they two
shouldn't  be contrary, and block layer should be flexible to support
both.

> That begs the question whether you should look at PCI passthrough
> instead?

I am wondering why you raise this question, it is said that virtio-blk may be
one of the most fast block device in VM world, so it is worth the optimization.
Also it can support live migration compared with passthrough.

Thanks,

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
  2014-08-11 19:37               ` Paolo Bonzini
                                   ` (2 preceding siblings ...)
  2014-08-13 11:43                 ` Ming Lei
@ 2014-08-14 10:46                 ` Kevin Wolf
  2014-08-15 10:39                   ` Ming Lei
  2014-08-15 20:15                   ` Paolo Bonzini
  3 siblings, 2 replies; 81+ messages in thread
From: Kevin Wolf @ 2014-08-14 10:46 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: tom.leiming, Ming Lei, Fam Zheng, qemu-devel, Stefan Hajnoczi

Am 11.08.2014 um 21:37 hat Paolo Bonzini geschrieben:
> Il 10/08/2014 05:46, Ming Lei ha scritto:
> > Hi Kevin, Paolo, Stefan and all,
> > 
> > 
> > On Wed, 6 Aug 2014 10:48:55 +0200
> > Kevin Wolf <kwolf@redhat.com> wrote:
> > 
> >> Am 06.08.2014 um 07:33 hat Ming Lei geschrieben:
> > 
> >>
> >> Anyhow, the coroutine version of your benchmark is buggy, it leaks all
> >> coroutines instead of exiting them, so it can't make any use of the
> >> coroutine pool. On my laptop, I get this (where fixed coroutine is a
> >> version that simply removes the yield at the end):
> >>
> >>                 | bypass        | fixed coro    | buggy coro
> >> ----------------+---------------+---------------+--------------
> >> time            | 1.09s         | 1.10s         | 1.62s
> >> L1-dcache-loads | 921,836,360   | 932,781,747   | 1,298,067,438
> >> insns per cycle | 2.39          | 2.39          | 1.90
> >>
> >> Begs the question whether you see a similar effect on a real qemu and
> >> the coroutine pool is still not big enough? With correct use of
> >> coroutines, the difference seems to be barely measurable even without
> >> any I/O involved.
> > 
> > Now I fixes the coroutine leak bug, and previous crypt bench is a bit high
> > loading, and cause operations per sec very low(~40K/sec), finally I write a new
> > and simple one which can generate hundreds of kilo operations per sec and
> > the number should match with some fast storage devices, and it does show there
> > is not small effect from coroutine.
> > 
> > Extremely if just getppid() syscall is run in each iteration, with using coroutine,
> > only 3M operations/sec can be got, and without using coroutine, the number can
> > reach 16M/sec, and there is more than 4 times difference!!!
> 
> I should be on vacation, but I'm following a couple threads in the mailing list
> and I'm a bit tired to hear the same argument again and again...
> 
> The different characteristics of asynchronous I/O vs. any synchronous workload
> are such that it is hard to be sure that microbenchmarks make sense.
> 
> The below patch is basically the minimal change to bypass coroutines.  Of course
> the block.c part is not acceptable as is (the change to refresh_total_sectors
> is broken, the others are just ugly), but it is a start.  Please run it with
> your fio workloads, or write an aio-based version of a qemu-img/qemu-io *I/O*
> benchmark.

So to finally reply with some numbers... I'm running fio tests based on
Ming's configuration on a loop-mounted tmpfs image using dataplane. I've
extended the tests to not only test random reads, but also sequential
reads. I did not yet test writes and almost no test for block sizes
larger than 4k, so I'm not including it here.

The "base" case is with Ming's patches applied, but the set_bypass(true)
call commented out in the virtio-blk code. All other cases are patches
applied on top of this.

                | Random throughput | Sequential throughput
----------------+-------------------+-----------------------
master          | 442 MB/s          | 730 MB/s
base            | 453 MB/s          | 757 MB/s
bypass (Ming)   | 461 MB/s          | 734 MB/s
coroutine       | 468 MB/s          | 716 MB/s
bypass (Paolo)  | 476 MB/s          | 682 MB/s

So while your patches look pretty good in Ming's test case of random
reads, I think the sequential case is worrying. The same is true for my
latest coroutine optimisations, even though the degradation is smaller
there.

This needs some more investigation.

Kevin

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
  2014-08-14 10:46                 ` Kevin Wolf
@ 2014-08-15 10:39                   ` Ming Lei
  2014-08-15 20:15                   ` Paolo Bonzini
  1 sibling, 0 replies; 81+ messages in thread
From: Ming Lei @ 2014-08-15 10:39 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Paolo Bonzini, Fam Zheng, qemu-devel, Stefan Hajnoczi

On Thu, Aug 14, 2014 at 6:46 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 11.08.2014 um 21:37 hat Paolo Bonzini geschrieben:
>> Il 10/08/2014 05:46, Ming Lei ha scritto:
>> > Hi Kevin, Paolo, Stefan and all,
>> >
>> >
>> > On Wed, 6 Aug 2014 10:48:55 +0200
>> > Kevin Wolf <kwolf@redhat.com> wrote:
>> >
>> >> Am 06.08.2014 um 07:33 hat Ming Lei geschrieben:
>> >
>> >>
>> >> Anyhow, the coroutine version of your benchmark is buggy, it leaks all
>> >> coroutines instead of exiting them, so it can't make any use of the
>> >> coroutine pool. On my laptop, I get this (where fixed coroutine is a
>> >> version that simply removes the yield at the end):
>> >>
>> >>                 | bypass        | fixed coro    | buggy coro
>> >> ----------------+---------------+---------------+--------------
>> >> time            | 1.09s         | 1.10s         | 1.62s
>> >> L1-dcache-loads | 921,836,360   | 932,781,747   | 1,298,067,438
>> >> insns per cycle | 2.39          | 2.39          | 1.90
>> >>
>> >> Begs the question whether you see a similar effect on a real qemu and
>> >> the coroutine pool is still not big enough? With correct use of
>> >> coroutines, the difference seems to be barely measurable even without
>> >> any I/O involved.
>> >
>> > Now I fixes the coroutine leak bug, and previous crypt bench is a bit high
>> > loading, and cause operations per sec very low(~40K/sec), finally I write a new
>> > and simple one which can generate hundreds of kilo operations per sec and
>> > the number should match with some fast storage devices, and it does show there
>> > is not small effect from coroutine.
>> >
>> > Extremely if just getppid() syscall is run in each iteration, with using coroutine,
>> > only 3M operations/sec can be got, and without using coroutine, the number can
>> > reach 16M/sec, and there is more than 4 times difference!!!
>>
>> I should be on vacation, but I'm following a couple threads in the mailing list
>> and I'm a bit tired to hear the same argument again and again...
>>
>> The different characteristics of asynchronous I/O vs. any synchronous workload
>> are such that it is hard to be sure that microbenchmarks make sense.
>>
>> The below patch is basically the minimal change to bypass coroutines.  Of course
>> the block.c part is not acceptable as is (the change to refresh_total_sectors
>> is broken, the others are just ugly), but it is a start.  Please run it with
>> your fio workloads, or write an aio-based version of a qemu-img/qemu-io *I/O*
>> benchmark.
>
> So to finally reply with some numbers... I'm running fio tests based on
> Ming's configuration on a loop-mounted tmpfs image using dataplane. I've
> extended the tests to not only test random reads, but also sequential
> reads. I did not yet test writes and almost no test for block sizes
> larger than 4k, so I'm not including it here.
>
> The "base" case is with Ming's patches applied, but the set_bypass(true)
> call commented out in the virtio-blk code. All other cases are patches
> applied on top of this.
>
>                 | Random throughput | Sequential throughput
> ----------------+-------------------+-----------------------
> master          | 442 MB/s          | 730 MB/s
> base            | 453 MB/s          | 757 MB/s
> bypass (Ming)   | 461 MB/s          | 734 MB/s
> coroutine       | 468 MB/s          | 716 MB/s
> bypass (Paolo)  | 476 MB/s          | 682 MB/s

Looks the difference between rand read and sequential read
is quite big, which shouldn't have been so since the whole file is
cached in ram.

>
> So while your patches look pretty good in Ming's test case of random
> reads, I think the sequential case is worrying. The same is true for my
> latest coroutine optimisations, even though the degradation is smaller
> there.

In my VM test, both rand read and sequential read result are basically
same, and IO thread's CPU utilization is more than 93% with Paolo's
patch, over both nullblk and loop on file in tmpfs.

I am using 3.16 kernel.

>
> This needs some more investigation.

Maybe it is caused by your test setup and environment, or your VM kernel,
not sure.


Thanks,
-- 
Ming Lei

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
  2014-08-14 10:46                 ` Kevin Wolf
  2014-08-15 10:39                   ` Ming Lei
@ 2014-08-15 20:15                   ` Paolo Bonzini
  2014-08-16  8:20                     ` Ming Lei
  2014-08-17  5:29                     ` Paolo Bonzini
  1 sibling, 2 replies; 81+ messages in thread
From: Paolo Bonzini @ 2014-08-15 20:15 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: tom.leiming, Ming Lei, Fam Zheng, qemu-devel, Stefan Hajnoczi

Il 14/08/2014 12:46, Kevin Wolf ha scritto:
> So to finally reply with some numbers... I'm running fio tests based on
> Ming's configuration on a loop-mounted tmpfs image using dataplane.

I'm not sure tmpfs is a particularly useful comparison, since it doesn't
support O_DIRECT.  O_DIRECT over ramdisk ("modprobe brd rd_nr=1
rd_size=524288 max_part=1", either directly or via a filesystem) is
probably a better benchmark.

Also, I'm not sure how the I/O scheduler works over tmpfs.  A ramdisk
should just do the right thing.  (Are you using deadline or cfq?)

>                 | Random throughput | Sequential throughput
> ----------------+-------------------+-----------------------
> master          | 442 MB/s          | 730 MB/s
> base            | 453 MB/s          | 757 MB/s
> bypass (Ming)   | 461 MB/s          | 734 MB/s
> coroutine       | 468 MB/s          | 716 MB/s
> bypass (Paolo)  | 476 MB/s          | 682 MB/s

This is pretty large, but it really smells like either a setup problem
or a kernel bug...

Paolo

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
  2014-08-14  9:39                           ` Stefan Hajnoczi
  2014-08-14 10:12                             ` Ming Lei
@ 2014-08-15 20:16                             ` Paolo Bonzini
  1 sibling, 0 replies; 81+ messages in thread
From: Paolo Bonzini @ 2014-08-15 20:16 UTC (permalink / raw)
  To: Stefan Hajnoczi, Ming Lei; +Cc: Kevin Wolf, Fam Zheng, qemu-devel

Il 14/08/2014 11:39, Stefan Hajnoczi ha scritto:
> That begs the question whether you should look at PCI passthrough
> instead?

Being able to use logical volumes, or to access multiple remote LUNs
through a single FC card in the host is an obvious reason to avoid PCI
passthrough.

Paolo

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
  2014-08-15 20:15                   ` Paolo Bonzini
@ 2014-08-16  8:20                     ` Ming Lei
  2014-08-17  5:29                     ` Paolo Bonzini
  1 sibling, 0 replies; 81+ messages in thread
From: Ming Lei @ 2014-08-16  8:20 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Kevin Wolf, Fam Zheng, qemu-devel, Stefan Hajnoczi

On 8/16/14, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 14/08/2014 12:46, Kevin Wolf ha scritto:
>> So to finally reply with some numbers... I'm running fio tests based on
>> Ming's configuration on a loop-mounted tmpfs image using dataplane.
>
> I'm not sure tmpfs is a particularly useful comparison, since it doesn't
> support O_DIRECT.  O_DIRECT over ramdisk ("modprobe brd rd_nr=1
> rd_size=524288 max_part=1", either directly or via a filesystem) is
> probably a better benchmark.

If loop is over file from tmpfs, that is fine since loop itself supports
O_DIRECT and the sync ->read()/->write() inside loop driver can
return without block.

I have tested loop over tmpfs file, and can't reproduce Kevin's issue.

>
> Also, I'm not sure how the I/O scheduler works over tmpfs.  A ramdisk
> should just do the right thing.  (Are you using deadline or cfq?)
>
>>                 | Random throughput | Sequential throughput
>> ----------------+-------------------+-----------------------
>> master          | 442 MB/s          | 730 MB/s
>> base            | 453 MB/s          | 757 MB/s
>> bypass (Ming)   | 461 MB/s          | 734 MB/s
>> coroutine       | 468 MB/s          | 716 MB/s
>> bypass (Paolo)  | 476 MB/s          | 682 MB/s
>
> This is pretty large, but it really smells like either a setup problem
> or a kernel bug...
>
> Paolo
>

Thanks,
-- 
Ming Lei

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
  2014-08-15 20:15                   ` Paolo Bonzini
  2014-08-16  8:20                     ` Ming Lei
@ 2014-08-17  5:29                     ` Paolo Bonzini
  2014-08-18  8:58                       ` Kevin Wolf
  1 sibling, 1 reply; 81+ messages in thread
From: Paolo Bonzini @ 2014-08-17  5:29 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: tom.leiming, Ming Lei, Fam Zheng, qemu-devel, Stefan Hajnoczi

Il 15/08/2014 22:15, Paolo Bonzini ha scritto:
>> >                 | Random throughput | Sequential throughput
>> > ----------------+-------------------+-----------------------
>> > master          | 442 MB/s          | 730 MB/s
>> > base            | 453 MB/s          | 757 MB/s
>> > bypass (Ming)   | 461 MB/s          | 734 MB/s
>> > coroutine       | 468 MB/s          | 716 MB/s
>> > bypass (Paolo)  | 476 MB/s          | 682 MB/s
> This is pretty large, but it really smells like either a setup problem
> or a kernel bug...

Thinking more about the I/O scheduler, it could simply be that faster
I/O = less coalescing = more bios actually reaching the driver = less speed.

It should be possible to find if this is true using blktrace.

(The reason why sequential I/O is faster is coalescing in the I/O
scheduler).

Paolo

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
  2014-08-17  5:29                     ` Paolo Bonzini
@ 2014-08-18  8:58                       ` Kevin Wolf
  0 siblings, 0 replies; 81+ messages in thread
From: Kevin Wolf @ 2014-08-18  8:58 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: tom.leiming, Ming Lei, Fam Zheng, qemu-devel, Stefan Hajnoczi

Am 17.08.2014 um 07:29 hat Paolo Bonzini geschrieben:
> Il 15/08/2014 22:15, Paolo Bonzini ha scritto:
> >> >                 | Random throughput | Sequential throughput
> >> > ----------------+-------------------+-----------------------
> >> > master          | 442 MB/s          | 730 MB/s
> >> > base            | 453 MB/s          | 757 MB/s
> >> > bypass (Ming)   | 461 MB/s          | 734 MB/s
> >> > coroutine       | 468 MB/s          | 716 MB/s
> >> > bypass (Paolo)  | 476 MB/s          | 682 MB/s
> > This is pretty large, but it really smells like either a setup problem
> > or a kernel bug...
> 
> Thinking more about the I/O scheduler, it could simply be that faster
> I/O = less coalescing = more bios actually reaching the driver = less speed.
> 
> It should be possible to find if this is true using blktrace.
> 
> (The reason why sequential I/O is faster is coalescing in the I/O
> scheduler).

Yes, sorry, should have posted an update on Friday. This was cfq in the
guest (apparently, the host doesn't use any scheduler with loop devices,
which makes some sense), with noop (or deadline) the numbers look much
better: Sequential throughput is on a slightly higher level and
increases with the optimisations, and random is much closer to it.

Kevin

^ permalink raw reply	[flat|nested] 81+ messages in thread

end of thread, other threads:[~2014-08-18  8:59 UTC | newest]

Thread overview: 81+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-05  3:33 [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support Ming Lei
2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 01/17] qemu/obj_pool.h: introduce object allocation pool Ming Lei
2014-08-05 11:55   ` Eric Blake
2014-08-05 12:05     ` Michael S. Tsirkin
2014-08-05 12:21       ` Eric Blake
2014-08-05 12:51         ` Michael S. Tsirkin
2014-08-06  2:35     ` Ming Lei
2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 02/17] dataplane: use object pool to speed up allocation for virtio blk request Ming Lei
2014-08-05 12:30   ` Eric Blake
2014-08-06  2:45     ` Ming Lei
2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 03/17] qemu coroutine: support bypass mode Ming Lei
2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 04/17] block: prepare for supporting selective bypass coroutine Ming Lei
2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 05/17] garbage collector: introduced for support of " Ming Lei
2014-08-05 12:43   ` Eric Blake
2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 06/17] block: introduce bdrv_co_can_bypass_co Ming Lei
2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 07/17] block: support to bypass qemu coroutinue Ming Lei
2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 08/17] Revert "raw-posix: drop raw_get_aio_fd() since it is no longer used" Ming Lei
2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 09/17] dataplane: enable selective bypassing coroutine Ming Lei
2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 10/17] linux-aio: fix submit aio as a batch Ming Lei
2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 11/17] linux-aio: handling -EAGAIN for !s->io_q.plugged case Ming Lei
2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 12/17] linux-aio: increase max event to 256 Ming Lei
2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 13/17] linux-aio: remove 'node' from 'struct qemu_laiocb' Ming Lei
2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 14/17] hw/virtio/virtio-blk.h: introduce VIRTIO_BLK_F_MQ Ming Lei
2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 15/17] virtio-blk: support multi queue for non-dataplane Ming Lei
2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 16/17] virtio-blk: dataplane: support multi virtqueue Ming Lei
2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 17/17] hw/virtio-pci: introduce num_queues property Ming Lei
2014-08-05  9:38 ` [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support Stefan Hajnoczi
2014-08-05  9:50   ` Ming Lei
2014-08-05  9:56     ` Kevin Wolf
2014-08-05 10:50       ` Ming Lei
2014-08-05 13:59     ` Stefan Hajnoczi
2014-08-05  9:48 ` Kevin Wolf
2014-08-05 10:00   ` Ming Lei
2014-08-05 11:44     ` Paolo Bonzini
2014-08-05 13:48     ` Stefan Hajnoczi
2014-08-05 14:47       ` Kevin Wolf
2014-08-06  5:33         ` Ming Lei
2014-08-06  7:45           ` Paolo Bonzini
2014-08-06  8:38             ` Ming Lei
2014-08-06  8:50               ` Paolo Bonzini
2014-08-06 13:53                 ` Ming Lei
2014-08-06  8:48           ` Kevin Wolf
2014-08-06  9:37             ` Ming Lei
2014-08-06 10:09               ` Kevin Wolf
2014-08-06 11:28                 ` Ming Lei
2014-08-06 11:44                   ` Ming Lei
2014-08-06 15:40                   ` Kevin Wolf
2014-08-07 10:27                     ` Ming Lei
2014-08-07 10:52                       ` Ming Lei
2014-08-07 11:06                         ` Kevin Wolf
2014-08-07 13:03                           ` Ming Lei
2014-08-07 13:51                       ` Kevin Wolf
2014-08-08 10:32                         ` Ming Lei
2014-08-08 11:26                           ` Ming Lei
2014-08-10  3:46             ` Ming Lei
2014-08-11 14:03               ` Kevin Wolf
2014-08-12  7:53                 ` Ming Lei
2014-08-12 11:40                   ` Kevin Wolf
2014-08-12 12:14                     ` Ming Lei
2014-08-11 19:37               ` Paolo Bonzini
2014-08-12  8:12                 ` Ming Lei
2014-08-12 19:08                   ` Paolo Bonzini
2014-08-13  9:54                     ` Kevin Wolf
2014-08-13 13:16                       ` Paolo Bonzini
2014-08-13 13:49                         ` Ming Lei
2014-08-14  9:39                           ` Stefan Hajnoczi
2014-08-14 10:12                             ` Ming Lei
2014-08-15 20:16                             ` Paolo Bonzini
2014-08-13 10:19                     ` Ming Lei
2014-08-13 12:35                       ` Paolo Bonzini
2014-08-13  8:55                 ` Stefan Hajnoczi
2014-08-13 11:43                 ` Ming Lei
2014-08-13 12:35                   ` Paolo Bonzini
2014-08-13 13:07                     ` Ming Lei
2014-08-14 10:46                 ` Kevin Wolf
2014-08-15 10:39                   ` Ming Lei
2014-08-15 20:15                   ` Paolo Bonzini
2014-08-16  8:20                     ` Ming Lei
2014-08-17  5:29                     ` Paolo Bonzini
2014-08-18  8:58                       ` Kevin Wolf
2014-08-06  9:37           ` Stefan Hajnoczi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.