All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH 00/14] dataplane: optimization and multi virtqueue support
@ 2014-07-30 11:39 Ming Lei
  2014-07-30 11:39 ` [Qemu-devel] [PATCH 01/15] qemu coroutine: support bypass mode Ming Lei
                   ` (16 more replies)
  0 siblings, 17 replies; 71+ messages in thread
From: Ming Lei @ 2014-07-30 11:39 UTC (permalink / raw)
  To: qemu-devel, Peter Maydell, Paolo Bonzini, Stefan Hajnoczi
  Cc: Kevin Wolf, Fam Zheng, Michael S. Tsirkin

Hi,

These patches bring up below 4 changes:

        - introduce selective coroutine bypass mechanism
        for improving performance of virtio-blk dataplane with
        raw format image

        - introduce object allocation pool and apply it to
        virtio-blk dataplane for improving its performance

        - linux-aio changes: fixing for cases of -EAGAIN and partial
        completion, increase max events to 256, and remove one unuseful
        fields in 'struct qemu_laiocb'

        - support multi virtqueue for virtio-blk dataplane

The virtio-blk multi virtqueue feature will be added to virtio spec 1.1[1],
and the 3.17 linux kernel[2] will support the feature in virtio-blk driver.
For those who wants to play the stuff, the kernel side patche can be found
in either Jens's block tree[3] or linux-next[4].

Below fio script running from VM is used for test improvement of these patches:

        [global]
        direct=1
        size=128G
        bsrange=4k-4k
        timeout=120
        numjobs=${JOBS}
        ioengine=libaio
        iodepth=64
        filename=/dev/vdc
        group_reporting=1

        [f]
        rw=randread

One quadcore VM(8G RAM) is created in below host to run above fio test:

        - server(16cores: 8 physical cores, 2 threads per physical core)

Follows the test result on throughput improvement(IOPS) with
this patchset(4 virtqueues per virito-blk device) against QEMU
2.1.0-rc5: 30% throughput improvement can be observed, and
scalability for parallel I/Os is improved more(80% throughput
improvement is observed in case of 4 JOBS).

>From above result, we can see both scalability and performance
get improved a lot.

After commit 580b6b2aa2(dataplane: use the QEMU block
layer for I/O), average time for submiting one single
request has been increased a lot, as my trace, the average
time taken for submiting one request has been doubled even
though block plug&unplug mechanism is introduced to
ease its effect. That is why this patchset introduces
selective coroutine bypass mechanism and object allocation
pool for saving the time first. Based on QEMU 2.0, only
single virtio-blk dataplane multi virtqueue patch can get
better improvement than current result[5].

TODO:
        - optimize block layer for linux aio, so that
        more time can be saved for submitting request
        - support more than one aio-context for improving
        virtio-blk performance

 async.c                         |    1 +
 block.c                         |  129 ++++++++++++++++++-----
 block/linux-aio.c               |   93 +++++++++++-----
 block/raw-posix.c               |   34 ++++++
 hw/block/dataplane/virtio-blk.c |  221 ++++++++++++++++++++++++++++++---------
 hw/block/virtio-blk.c           |   32 +++++-
 hw/net/virtio-net.c             |    4 +-
 hw/virtio/dataplane/vring.c     |   23 +++-
 hw/virtio/virtio.c              |   23 ++--
 include/block/aio.h             |   13 +++
 include/block/block.h           |    9 ++
 include/block/coroutine.h       |    8 ++
 include/block/coroutine_int.h   |    5 +
 include/hw/virtio/virtio-blk.h  |   13 +++
 include/hw/virtio/virtio.h      |   13 ++-
 include/qemu/obj_pool.h         |   64 ++++++++++++
 qemu-coroutine-lock.c           |    4 +-
 qemu-coroutine.c                |   33 ++++++
 18 files changed, 600 insertions(+), 122 deletions(-)


[1], http://marc.info/?l=linux-api&m=140486843317107&w=2
[2], http://marc.info/?l=linux-api&m=140418368421229&w=2
[3], http://git.kernel.org/cgit/linux/kernel/git/axboe/linux-block.git/ #for-3.17/drivers
[4], https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/
[5], http://marc.info/?l=linux-api&m=140377573830230&w=2

Thanks,

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [Qemu-devel] [PATCH 01/15] qemu coroutine: support bypass mode
  2014-07-30 11:39 [Qemu-devel] [PATCH 00/14] dataplane: optimization and multi virtqueue support Ming Lei
@ 2014-07-30 11:39 ` Ming Lei
  2014-07-30 13:45   ` Paolo Bonzini
  2014-07-30 11:39 ` [Qemu-devel] [PATCH 02/15] qemu aio: prepare for supporting selective bypass coroutine Ming Lei
                   ` (15 subsequent siblings)
  16 siblings, 1 reply; 71+ messages in thread
From: Ming Lei @ 2014-07-30 11:39 UTC (permalink / raw)
  To: qemu-devel, Peter Maydell, Paolo Bonzini, Stefan Hajnoczi
  Cc: Kevin Wolf, Ming Lei, Fam Zheng, Michael S. Tsirkin

This patch introduces several APIs for supporting bypass qemu coroutine
in case of being not necessary and for performance's sake.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 include/block/coroutine.h     |    8 ++++++++
 include/block/coroutine_int.h |    5 +++++
 qemu-coroutine-lock.c         |    4 ++--
 qemu-coroutine.c              |   33 +++++++++++++++++++++++++++++++++
 4 files changed, 48 insertions(+), 2 deletions(-)

diff --git a/include/block/coroutine.h b/include/block/coroutine.h
index b408f96..46d2642 100644
--- a/include/block/coroutine.h
+++ b/include/block/coroutine.h
@@ -223,4 +223,12 @@ void coroutine_fn co_aio_sleep_ns(AioContext *ctx, QEMUClockType type,
  * Note that this function clobbers the handlers for the file descriptor.
  */
 void coroutine_fn yield_until_fd_readable(int fd);
+
+/* qemu coroutine bypass APIs */
+void qemu_coroutine_set_bypass(bool bypass);
+bool qemu_coroutine_bypassed(Coroutine *self);
+bool qemu_coroutine_self_bypassed(void);
+void qemu_coroutine_set_var(void *var);
+void *qemu_coroutine_get_var(void);
+
 #endif /* QEMU_COROUTINE_H */
diff --git a/include/block/coroutine_int.h b/include/block/coroutine_int.h
index f133d65..106d0b2 100644
--- a/include/block/coroutine_int.h
+++ b/include/block/coroutine_int.h
@@ -39,6 +39,11 @@ struct Coroutine {
     Coroutine *caller;
     QSLIST_ENTRY(Coroutine) pool_next;
 
+    bool bypass;
+
+    /* only used in bypass mode */
+    void *opaque;
+
     /* Coroutines that should be woken up when we yield or terminate */
     QTAILQ_HEAD(, Coroutine) co_queue_wakeup;
     QTAILQ_ENTRY(Coroutine) co_queue_next;
diff --git a/qemu-coroutine-lock.c b/qemu-coroutine-lock.c
index e4860ae..7c69ff6 100644
--- a/qemu-coroutine-lock.c
+++ b/qemu-coroutine-lock.c
@@ -82,13 +82,13 @@ static bool qemu_co_queue_do_restart(CoQueue *queue, bool single)
 
 bool coroutine_fn qemu_co_queue_next(CoQueue *queue)
 {
-    assert(qemu_in_coroutine());
+    assert(qemu_in_coroutine() || qemu_coroutine_self_bypassed());
     return qemu_co_queue_do_restart(queue, true);
 }
 
 void coroutine_fn qemu_co_queue_restart_all(CoQueue *queue)
 {
-    assert(qemu_in_coroutine());
+    assert(qemu_in_coroutine() || qemu_coroutine_self_bypassed());
     qemu_co_queue_do_restart(queue, false);
 }
 
diff --git a/qemu-coroutine.c b/qemu-coroutine.c
index 4708521..0597ed9 100644
--- a/qemu-coroutine.c
+++ b/qemu-coroutine.c
@@ -137,3 +137,36 @@ void coroutine_fn qemu_coroutine_yield(void)
     self->caller = NULL;
     coroutine_swap(self, to);
 }
+
+void qemu_coroutine_set_bypass(bool bypass)
+{
+    Coroutine *self = qemu_coroutine_self();
+
+    self->bypass = bypass;
+}
+
+bool qemu_coroutine_bypassed(Coroutine *self)
+{
+    return self->bypass;
+}
+
+bool qemu_coroutine_self_bypassed(void)
+{
+    Coroutine *self = qemu_coroutine_self();
+
+    return qemu_coroutine_bypassed(self);
+}
+
+void qemu_coroutine_set_var(void *var)
+{
+    Coroutine *self = qemu_coroutine_self();
+
+    self->opaque = var;
+}
+
+void *qemu_coroutine_get_var(void)
+{
+    Coroutine *self = qemu_coroutine_self();
+
+    return self->opaque;
+}
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [Qemu-devel] [PATCH 02/15] qemu aio: prepare for supporting selective bypass coroutine
  2014-07-30 11:39 [Qemu-devel] [PATCH 00/14] dataplane: optimization and multi virtqueue support Ming Lei
  2014-07-30 11:39 ` [Qemu-devel] [PATCH 01/15] qemu coroutine: support bypass mode Ming Lei
@ 2014-07-30 11:39 ` Ming Lei
  2014-07-30 11:39 ` [Qemu-devel] [PATCH 03/15] block: support to bypass qemu coroutinue Ming Lei
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 71+ messages in thread
From: Ming Lei @ 2014-07-30 11:39 UTC (permalink / raw)
  To: qemu-devel, Peter Maydell, Paolo Bonzini, Stefan Hajnoczi
  Cc: Kevin Wolf, Ming Lei, Fam Zheng, Michael S. Tsirkin

If device thinks that it isn't necessary to apply coroutine
in its performance sensitive path, it can call
qemu_aio_set_bypass_co(false) to bypass the coroutine which
has supported bypass mode and just call the function directly.

One example is virtio-blk dataplane.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 async.c             |    1 +
 include/block/aio.h |   13 +++++++++++++
 2 files changed, 14 insertions(+)

diff --git a/async.c b/async.c
index 34af0b2..251a074 100644
--- a/async.c
+++ b/async.c
@@ -293,6 +293,7 @@ AioContext *aio_context_new(void)
                            (EventNotifierHandler *)
                            event_notifier_test_and_clear);
     timerlistgroup_init(&ctx->tlg, aio_timerlist_notify, ctx);
+    qemu_aio_set_bypass_co(ctx, false);
 
     return ctx;
 }
diff --git a/include/block/aio.h b/include/block/aio.h
index c23de3c..48d827e 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -87,6 +87,9 @@ struct AioContext {
 
     /* TimerLists for calling timers - one per clock type */
     QEMUTimerListGroup tlg;
+
+    /* support selective bypass coroutine */
+    bool bypass_co;
 };
 
 /* Used internally to synchronize aio_poll against qemu_bh_schedule.  */
@@ -303,4 +306,14 @@ static inline void aio_timer_init(AioContext *ctx,
     timer_init(ts, ctx->tlg.tl[type], scale, cb, opaque);
 }
 
+static inline void qemu_aio_set_bypass_co(AioContext *ctx, bool bypass)
+{
+    ctx->bypass_co = bypass;
+}
+
+static inline bool qemu_aio_get_bypass_co(AioContext *ctx)
+{
+    return ctx->bypass_co;
+}
+
 #endif
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [Qemu-devel] [PATCH 03/15] block: support to bypass qemu coroutinue
  2014-07-30 11:39 [Qemu-devel] [PATCH 00/14] dataplane: optimization and multi virtqueue support Ming Lei
  2014-07-30 11:39 ` [Qemu-devel] [PATCH 01/15] qemu coroutine: support bypass mode Ming Lei
  2014-07-30 11:39 ` [Qemu-devel] [PATCH 02/15] qemu aio: prepare for supporting selective bypass coroutine Ming Lei
@ 2014-07-30 11:39 ` Ming Lei
  2014-07-30 11:39 ` [Qemu-devel] [PATCH 04/15] Revert "raw-posix: drop raw_get_aio_fd() since it is no longer used" Ming Lei
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 71+ messages in thread
From: Ming Lei @ 2014-07-30 11:39 UTC (permalink / raw)
  To: qemu-devel, Peter Maydell, Paolo Bonzini, Stefan Hajnoczi
  Cc: Kevin Wolf, Ming Lei, Fam Zheng, Michael S. Tsirkin

This patch adds bypass mode support for the coroutinue
in bdrv_co_aio_rw_vector(), which is in the fast path
of lots of block device, especially for virtio-blk dataplane.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 block.c |  129 +++++++++++++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 105 insertions(+), 24 deletions(-)

diff --git a/block.c b/block.c
index 8cf519b..d84079a 100644
--- a/block.c
+++ b/block.c
@@ -55,6 +55,21 @@ struct BdrvDirtyBitmap {
     QLIST_ENTRY(BdrvDirtyBitmap) list;
 };
 
+typedef struct CoroutineIOCompletion {
+    Coroutine *coroutine;
+    int ret;
+    bool bypass;
+    QEMUIOVector *bounced_iov;
+} CoroutineIOCompletion;
+
+typedef struct BlockDriverAIOCBCoroutine {
+    BlockDriverAIOCB common;
+    BlockRequest req;
+    bool is_write;
+    bool *done;
+    QEMUBH *bh;
+} BlockDriverAIOCBCoroutine;
+
 #define NOT_DONE 0x7fffffff /* used while emulated sync operation in progress */
 
 static void bdrv_dev_change_media_cb(BlockDriverState *bs, bool load);
@@ -120,6 +135,21 @@ int is_windows_drive(const char *filename)
 }
 #endif
 
+static CoroutineIOCompletion *bdrv_get_co_io_comp(BlockDriverAIOCBCoroutine
+                                                  *acb)
+{
+    return (CoroutineIOCompletion *)((void *)acb +
+               sizeof(BlockDriverAIOCBCoroutine));
+}
+
+static BlockDriverAIOCBCoroutine *bdrv_get_aio_co(CoroutineIOCompletion *co)
+{
+    assert(co->bypass);
+
+    return (BlockDriverAIOCBCoroutine *)((void *)co -
+               sizeof(BlockDriverAIOCBCoroutine));
+}
+
 /* throttling disk I/O limits */
 void bdrv_set_io_limits(BlockDriverState *bs,
                         ThrottleConfig *cfg)
@@ -3081,7 +3111,16 @@ static int coroutine_fn bdrv_aligned_preadv(BlockDriverState *bs,
             ret = drv->bdrv_co_readv(bs, sector_num, local_sectors,
                                      &local_qiov);
 
-            qemu_iovec_destroy(&local_qiov);
+
+            if (qemu_coroutine_self_bypassed()) {
+                CoroutineIOCompletion *pco = bdrv_get_co_io_comp(
+                                             (BlockDriverAIOCBCoroutine *)
+                                             qemu_coroutine_get_var());
+                pco->bounced_iov = g_malloc(sizeof(QEMUIOVector));
+                *pco->bounced_iov = local_qiov;
+            } else {
+                qemu_iovec_destroy(&local_qiov);
+            }
         } else {
             ret = 0;
         }
@@ -4659,15 +4698,6 @@ static BlockDriverAIOCB *bdrv_aio_writev_em(BlockDriverState *bs,
     return bdrv_aio_rw_vector(bs, sector_num, qiov, nb_sectors, cb, opaque, 1);
 }
 
-
-typedef struct BlockDriverAIOCBCoroutine {
-    BlockDriverAIOCB common;
-    BlockRequest req;
-    bool is_write;
-    bool *done;
-    QEMUBH* bh;
-} BlockDriverAIOCBCoroutine;
-
 static void bdrv_aio_co_cancel_em(BlockDriverAIOCB *blockacb)
 {
     AioContext *aio_context = bdrv_get_aio_context(blockacb->bs);
@@ -4686,6 +4716,12 @@ static const AIOCBInfo bdrv_em_co_aiocb_info = {
     .cancel             = bdrv_aio_co_cancel_em,
 };
 
+static const AIOCBInfo bdrv_em_co_bypass_aiocb_info = {
+    .aiocb_size         = sizeof(BlockDriverAIOCBCoroutine) +
+                          sizeof(CoroutineIOCompletion),
+    .cancel             = bdrv_aio_co_cancel_em,
+};
+
 static void bdrv_co_em_bh(void *opaque)
 {
     BlockDriverAIOCBCoroutine *acb = opaque;
@@ -4705,6 +4741,12 @@ static void coroutine_fn bdrv_co_do_rw(void *opaque)
 {
     BlockDriverAIOCBCoroutine *acb = opaque;
     BlockDriverState *bs = acb->common.bs;
+    bool bypass = qemu_coroutine_self_bypassed();
+
+    if (bypass) {
+        qemu_coroutine_set_var(acb);
+        memset(bdrv_get_co_io_comp(acb), 0, sizeof(CoroutineIOCompletion));
+    }
 
     if (!acb->is_write) {
         acb->req.error = bdrv_co_do_readv(bs, acb->req.sector,
@@ -4714,8 +4756,10 @@ static void coroutine_fn bdrv_co_do_rw(void *opaque)
             acb->req.nb_sectors, acb->req.qiov, acb->req.flags);
     }
 
-    acb->bh = aio_bh_new(bdrv_get_aio_context(bs), bdrv_co_em_bh, acb);
-    qemu_bh_schedule(acb->bh);
+    if (!bypass) {
+        acb->bh = aio_bh_new(bdrv_get_aio_context(bs), bdrv_co_em_bh, acb);
+        qemu_bh_schedule(acb->bh);
+    }
 }
 
 static BlockDriverAIOCB *bdrv_co_aio_rw_vector(BlockDriverState *bs,
@@ -4729,8 +4773,18 @@ static BlockDriverAIOCB *bdrv_co_aio_rw_vector(BlockDriverState *bs,
 {
     Coroutine *co;
     BlockDriverAIOCBCoroutine *acb;
+    const AIOCBInfo *aiocb_info;
+    bool bypass;
 
-    acb = qemu_aio_get(&bdrv_em_co_aiocb_info, bs, cb, opaque);
+    if (qemu_aio_get_bypass_co(bdrv_get_aio_context(bs))) {
+        aiocb_info = &bdrv_em_co_bypass_aiocb_info;
+        bypass = true;
+    } else {
+        aiocb_info = &bdrv_em_co_aiocb_info;
+        bypass = false;
+    }
+
+    acb = qemu_aio_get(aiocb_info, bs, cb, opaque);
     acb->req.sector = sector_num;
     acb->req.nb_sectors = nb_sectors;
     acb->req.qiov = qiov;
@@ -4738,8 +4792,14 @@ static BlockDriverAIOCB *bdrv_co_aio_rw_vector(BlockDriverState *bs,
     acb->is_write = is_write;
     acb->done = NULL;
 
-    co = qemu_coroutine_create(bdrv_co_do_rw);
-    qemu_coroutine_enter(co, acb);
+    if (!bypass) {
+        co = qemu_coroutine_create(bdrv_co_do_rw);
+        qemu_coroutine_enter(co, acb);
+    } else {
+        qemu_coroutine_set_bypass(true);
+        bdrv_co_do_rw(acb);
+        qemu_coroutine_set_bypass(false);
+    }
 
     return &acb->common;
 }
@@ -4833,17 +4893,28 @@ void qemu_aio_release(void *p)
 /**************************************************************/
 /* Coroutine block device emulation */
 
-typedef struct CoroutineIOCompletion {
-    Coroutine *coroutine;
-    int ret;
-} CoroutineIOCompletion;
-
 static void bdrv_co_io_em_complete(void *opaque, int ret)
 {
     CoroutineIOCompletion *co = opaque;
 
     co->ret = ret;
-    qemu_coroutine_enter(co->coroutine, NULL);
+
+    if (!co->bypass) {
+        qemu_coroutine_enter(co->coroutine, NULL);
+    } else {
+        BlockDriverAIOCBCoroutine *acb = bdrv_get_aio_co(co);
+
+        acb->common.cb(acb->common.opaque, ret);
+        if (acb->done) {
+            *acb->done = true;
+        }
+        qemu_aio_release(acb);
+
+        if (co->bounced_iov) {
+            qemu_iovec_destroy(co->bounced_iov);
+            g_free(co->bounced_iov);
+        }
+    }
 }
 
 static int coroutine_fn bdrv_co_io_em(BlockDriverState *bs, int64_t sector_num,
@@ -4853,21 +4924,31 @@ static int coroutine_fn bdrv_co_io_em(BlockDriverState *bs, int64_t sector_num,
     CoroutineIOCompletion co = {
         .coroutine = qemu_coroutine_self(),
     };
+    CoroutineIOCompletion *pco = &co;
     BlockDriverAIOCB *acb;
 
+    if (qemu_coroutine_bypassed(co.coroutine)) {
+        pco = bdrv_get_co_io_comp((BlockDriverAIOCBCoroutine *)
+                                   qemu_coroutine_get_var());
+        pco->bypass = true;
+    }
+
     if (is_write) {
         acb = bs->drv->bdrv_aio_writev(bs, sector_num, iov, nb_sectors,
-                                       bdrv_co_io_em_complete, &co);
+                                       bdrv_co_io_em_complete, pco);
     } else {
         acb = bs->drv->bdrv_aio_readv(bs, sector_num, iov, nb_sectors,
-                                      bdrv_co_io_em_complete, &co);
+                                      bdrv_co_io_em_complete, pco);
     }
 
     trace_bdrv_co_io_em(bs, sector_num, nb_sectors, is_write, acb);
     if (!acb) {
         return -EIO;
     }
-    qemu_coroutine_yield();
+
+    if (!pco->bypass) {
+        qemu_coroutine_yield();
+    }
 
     return co.ret;
 }
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [Qemu-devel] [PATCH 04/15] Revert "raw-posix: drop raw_get_aio_fd() since it is no longer used"
  2014-07-30 11:39 [Qemu-devel] [PATCH 00/14] dataplane: optimization and multi virtqueue support Ming Lei
                   ` (2 preceding siblings ...)
  2014-07-30 11:39 ` [Qemu-devel] [PATCH 03/15] block: support to bypass qemu coroutinue Ming Lei
@ 2014-07-30 11:39 ` Ming Lei
  2014-07-30 11:39 ` [Qemu-devel] [PATCH 05/15] dataplane: enable selective bypassing coroutine Ming Lei
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 71+ messages in thread
From: Ming Lei @ 2014-07-30 11:39 UTC (permalink / raw)
  To: qemu-devel, Peter Maydell, Paolo Bonzini, Stefan Hajnoczi
  Cc: Kevin Wolf, Ming Lei, Fam Zheng, Michael S. Tsirkin

This reverts commit 76ef2cf5493a215efc351f48ae7094d6c183fcac.

Reintroduce the helper of raw_get_aio_fd() for enabling
coroutinue bypass mode in case of raw image.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 block/raw-posix.c     |   34 ++++++++++++++++++++++++++++++++++
 include/block/block.h |    9 +++++++++
 2 files changed, 43 insertions(+)

diff --git a/block/raw-posix.c b/block/raw-posix.c
index 8e9758e..88715c8 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -2436,6 +2436,40 @@ static BlockDriver bdrv_host_cdrom = {
 };
 #endif /* __FreeBSD__ */
 
+#ifdef CONFIG_LINUX_AIO
+/**
+ * Return the file descriptor for Linux AIO
+ *
+ * This function is a layering violation and should be removed when it becomes
+ * possible to call the block layer outside the global mutex.  It allows the
+ * caller to hijack the file descriptor so I/O can be performed outside the
+ * block layer.
+ */
+int raw_get_aio_fd(BlockDriverState *bs)
+{
+    BDRVRawState *s;
+
+    if (!bs->drv) {
+        return -ENOMEDIUM;
+    }
+
+    if (bs->drv == bdrv_find_format("raw")) {
+        bs = bs->file;
+    }
+
+    /* raw-posix has several protocols so just check for raw_aio_readv */
+    if (bs->drv->bdrv_aio_readv != raw_aio_readv) {
+        return -ENOTSUP;
+    }
+
+    s = bs->opaque;
+    if (!s->use_aio) {
+        return -ENOTSUP;
+    }
+    return s->fd;
+}
+#endif /* CONFIG_LINUX_AIO */
+
 static void bdrv_file_init(void)
 {
     /*
diff --git a/include/block/block.h b/include/block/block.h
index f08471d..4f35741 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -482,6 +482,15 @@ void bdrv_op_block_all(BlockDriverState *bs, Error *reason);
 void bdrv_op_unblock_all(BlockDriverState *bs, Error *reason);
 bool bdrv_op_blocker_is_empty(BlockDriverState *bs);
 
+#ifdef CONFIG_LINUX_AIO
+int raw_get_aio_fd(BlockDriverState *bs);
+#else
+static inline int raw_get_aio_fd(BlockDriverState *bs)
+{
+    return -ENOTSUP;
+}
+#endif
+
 enum BlockAcctType {
     BDRV_ACCT_READ,
     BDRV_ACCT_WRITE,
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [Qemu-devel] [PATCH 05/15] dataplane: enable selective bypassing coroutine
  2014-07-30 11:39 [Qemu-devel] [PATCH 00/14] dataplane: optimization and multi virtqueue support Ming Lei
                   ` (3 preceding siblings ...)
  2014-07-30 11:39 ` [Qemu-devel] [PATCH 04/15] Revert "raw-posix: drop raw_get_aio_fd() since it is no longer used" Ming Lei
@ 2014-07-30 11:39 ` Ming Lei
  2014-07-30 11:39 ` [Qemu-devel] [PATCH 06/15] qemu/obj_pool.h: introduce object allocation pool Ming Lei
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 71+ messages in thread
From: Ming Lei @ 2014-07-30 11:39 UTC (permalink / raw)
  To: qemu-devel, Peter Maydell, Paolo Bonzini, Stefan Hajnoczi
  Cc: Kevin Wolf, Ming Lei, Fam Zheng, Michael S. Tsirkin

This patch enables selective bypassing for the
coroutine in bdrv_co_aio_rw_vector() if the image
format is raw.

With this patch, ~7% throughput improvement for raw image is
observed in the VM based on server.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 hw/block/dataplane/virtio-blk.c |   10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
index d6ba65c..2093e4a 100644
--- a/hw/block/dataplane/virtio-blk.c
+++ b/hw/block/dataplane/virtio-blk.c
@@ -28,6 +28,7 @@ struct VirtIOBlockDataPlane {
     bool started;
     bool starting;
     bool stopping;
+    bool raw_format;
 
     VirtIOBlkConf *blk;
 
@@ -193,6 +194,8 @@ void virtio_blk_data_plane_create(VirtIODevice *vdev, VirtIOBlkConf *blk,
     error_setg(&s->blocker, "block device is in use by data plane");
     bdrv_op_block_all(blk->conf.bs, s->blocker);
 
+    s->raw_format = (raw_get_aio_fd(blk->conf.bs) >= 0);
+
     *dataplane = s;
 }
 
@@ -262,6 +265,10 @@ void virtio_blk_data_plane_start(VirtIOBlockDataPlane *s)
     /* Kick right away to begin processing requests already in vring */
     event_notifier_set(virtio_queue_get_host_notifier(vq));
 
+    if (s->raw_format) {
+        qemu_aio_set_bypass_co(s->ctx, true);
+    }
+
     /* Get this show started by hooking up our callbacks */
     aio_context_acquire(s->ctx);
     aio_set_event_notifier(s->ctx, &s->host_notifier, handle_notify);
@@ -291,6 +298,9 @@ void virtio_blk_data_plane_stop(VirtIOBlockDataPlane *s)
 
     aio_context_release(s->ctx);
 
+    if (s->raw_format) {
+        qemu_aio_set_bypass_co(s->ctx, false);
+    }
     /* Sync vring state back to virtqueue so that non-dataplane request
      * processing can continue when we disable the host notifier below.
      */
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [Qemu-devel] [PATCH 06/15] qemu/obj_pool.h: introduce object allocation pool
  2014-07-30 11:39 [Qemu-devel] [PATCH 00/14] dataplane: optimization and multi virtqueue support Ming Lei
                   ` (4 preceding siblings ...)
  2014-07-30 11:39 ` [Qemu-devel] [PATCH 05/15] dataplane: enable selective bypassing coroutine Ming Lei
@ 2014-07-30 11:39 ` Ming Lei
  2014-07-30 11:39 ` [Qemu-devel] [PATCH 07/15] dataplane: use object pool to speed up allocation for virtio blk request Ming Lei
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 71+ messages in thread
From: Ming Lei @ 2014-07-30 11:39 UTC (permalink / raw)
  To: qemu-devel, Peter Maydell, Paolo Bonzini, Stefan Hajnoczi
  Cc: Kevin Wolf, Ming Lei, Fam Zheng, Michael S. Tsirkin

This patch introduces object allocation pool for speeding up
object allocation in fast path.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 include/qemu/obj_pool.h |   64 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 64 insertions(+)
 create mode 100644 include/qemu/obj_pool.h

diff --git a/include/qemu/obj_pool.h b/include/qemu/obj_pool.h
new file mode 100644
index 0000000..94b5f49
--- /dev/null
+++ b/include/qemu/obj_pool.h
@@ -0,0 +1,64 @@
+#ifndef QEMU_OBJ_POOL_HEAD
+#define QEMU_OBJ_POOL_HEAD
+
+typedef struct {
+    unsigned int size;
+    unsigned int cnt;
+
+    void **free_obj;
+    int free_idx;
+
+    char *objs;
+} ObjPool;
+
+static inline void obj_pool_init(ObjPool *op, void *objs_buf, void **free_objs,
+                                 unsigned int obj_size, unsigned cnt)
+{
+    int i;
+
+    op->objs = (char *)objs_buf;
+    op->free_obj = free_objs;
+    op->size = obj_size;
+    op->cnt = cnt;
+
+    for (i = 0; i < op->cnt; i++) {
+        op->free_obj[i] = (void *)&op->objs[i * op->size];
+    }
+    op->free_idx = op->cnt;
+}
+
+static inline void *obj_pool_get(ObjPool *op)
+{
+    void *obj;
+
+    if (!op) {
+        return NULL;
+    }
+
+    if (op->free_idx <= 0) {
+        return NULL;
+    }
+
+    obj = op->free_obj[--op->free_idx];
+    return obj;
+}
+
+static inline bool obj_pool_has_obj(ObjPool *op, void *obj)
+{
+    return op && (unsigned long)obj >= (unsigned long)&op->objs[0] &&
+           (unsigned long)obj <=
+           (unsigned long)&op->objs[(op->cnt - 1) * op->size];
+}
+
+static inline void obj_pool_put(ObjPool *op, void *obj)
+{
+    if (!op || !obj_pool_has_obj(op, obj)) {
+        return;
+    }
+
+    assert(op->free_idx < op->cnt);
+
+    op->free_obj[op->free_idx++] = obj;
+}
+
+#endif
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [Qemu-devel] [PATCH 07/15] dataplane: use object pool to speed up allocation for virtio blk request
  2014-07-30 11:39 [Qemu-devel] [PATCH 00/14] dataplane: optimization and multi virtqueue support Ming Lei
                   ` (5 preceding siblings ...)
  2014-07-30 11:39 ` [Qemu-devel] [PATCH 06/15] qemu/obj_pool.h: introduce object allocation pool Ming Lei
@ 2014-07-30 11:39 ` Ming Lei
  2014-07-30 14:14   ` Paolo Bonzini
  2014-07-30 11:39 ` [Qemu-devel] [PATCH 08/15] virtio: decrease size of VirtQueueElement Ming Lei
                   ` (9 subsequent siblings)
  16 siblings, 1 reply; 71+ messages in thread
From: Ming Lei @ 2014-07-30 11:39 UTC (permalink / raw)
  To: qemu-devel, Peter Maydell, Paolo Bonzini, Stefan Hajnoczi
  Cc: Kevin Wolf, Ming Lei, Fam Zheng, Michael S. Tsirkin

g_slice_new(VirtIOBlockReq), its free pair and access the instance
is a bit slow since sizeof(VirtIOBlockReq) takes more than 40KB,
so use object pool to speed up its allocation and release.

With this patch, ~5% throughput improvement is observed in the VM
based on server.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 hw/block/dataplane/virtio-blk.c |   12 ++++++++++++
 hw/block/virtio-blk.c           |   13 +++++++++++--
 include/hw/virtio/virtio-blk.h  |    2 ++
 3 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
index 2093e4a..828fe99 100644
--- a/hw/block/dataplane/virtio-blk.c
+++ b/hw/block/dataplane/virtio-blk.c
@@ -24,6 +24,8 @@
 #include "hw/virtio/virtio-bus.h"
 #include "qom/object_interfaces.h"
 
+#define REQ_POOL_SZ 128
+
 struct VirtIOBlockDataPlane {
     bool started;
     bool starting;
@@ -51,6 +53,10 @@ struct VirtIOBlockDataPlane {
     Error *blocker;
     void (*saved_complete_request)(struct VirtIOBlockReq *req,
                                    unsigned char status);
+
+    VirtIOBlockReq  reqs[REQ_POOL_SZ];
+    void *free_reqs[REQ_POOL_SZ];
+    ObjPool  req_pool;
 };
 
 /* Raise an interrupt to signal guest, if necessary */
@@ -238,6 +244,10 @@ void virtio_blk_data_plane_start(VirtIOBlockDataPlane *s)
         return;
     }
 
+    vblk->obj_pool = &s->req_pool;
+    obj_pool_init(vblk->obj_pool, s->reqs, s->free_reqs,
+                  sizeof(VirtIOBlockReq), REQ_POOL_SZ);
+
     /* Set up guest notifier (irq) */
     if (k->set_guest_notifiers(qbus->parent, 1, true) != 0) {
         fprintf(stderr, "virtio-blk failed to set guest notifier, "
@@ -298,6 +308,8 @@ void virtio_blk_data_plane_stop(VirtIOBlockDataPlane *s)
 
     aio_context_release(s->ctx);
 
+    vblk->obj_pool = NULL;
+
     if (s->raw_format) {
         qemu_aio_set_bypass_co(s->ctx, false);
     }
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index c241c50..2a11bc4 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -31,7 +31,11 @@
 
 VirtIOBlockReq *virtio_blk_alloc_request(VirtIOBlock *s)
 {
-    VirtIOBlockReq *req = g_slice_new(VirtIOBlockReq);
+    VirtIOBlockReq *req = obj_pool_get(s->obj_pool);
+
+    if (!req) {
+        req = g_slice_new(VirtIOBlockReq);
+    }
     req->dev = s;
     req->qiov.size = 0;
     req->next = NULL;
@@ -41,7 +45,11 @@ VirtIOBlockReq *virtio_blk_alloc_request(VirtIOBlock *s)
 void virtio_blk_free_request(VirtIOBlockReq *req)
 {
     if (req) {
-        g_slice_free(VirtIOBlockReq, req);
+        if (obj_pool_has_obj(req->dev->obj_pool, req)) {
+            obj_pool_put(req->dev->obj_pool, req);
+        } else {
+            g_slice_free(VirtIOBlockReq, req);
+        }
     }
 }
 
@@ -801,6 +809,7 @@ static void virtio_blk_instance_init(Object *obj)
 {
     VirtIOBlock *s = VIRTIO_BLK(obj);
 
+    s->obj_pool = NULL;
     object_property_add_link(obj, "iothread", TYPE_IOTHREAD,
                              (Object **)&s->blk.iothread,
                              qdev_prop_allow_set_link_before_realize,
diff --git a/include/hw/virtio/virtio-blk.h b/include/hw/virtio/virtio-blk.h
index afb7b8d..49ac234 100644
--- a/include/hw/virtio/virtio-blk.h
+++ b/include/hw/virtio/virtio-blk.h
@@ -18,6 +18,7 @@
 #include "hw/block/block.h"
 #include "sysemu/iothread.h"
 #include "block/block.h"
+#include "qemu/obj_pool.h"
 
 #define TYPE_VIRTIO_BLK "virtio-blk-device"
 #define VIRTIO_BLK(obj) \
@@ -135,6 +136,7 @@ typedef struct VirtIOBlock {
     Notifier migration_state_notifier;
     struct VirtIOBlockDataPlane *dataplane;
 #endif
+    ObjPool *obj_pool;
 } VirtIOBlock;
 
 typedef struct MultiReqBuffer {
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [Qemu-devel] [PATCH 08/15] virtio: decrease size of VirtQueueElement
  2014-07-30 11:39 [Qemu-devel] [PATCH 00/14] dataplane: optimization and multi virtqueue support Ming Lei
                   ` (6 preceding siblings ...)
  2014-07-30 11:39 ` [Qemu-devel] [PATCH 07/15] dataplane: use object pool to speed up allocation for virtio blk request Ming Lei
@ 2014-07-30 11:39 ` Ming Lei
  2014-07-30 13:51   ` Paolo Bonzini
  2014-07-30 11:39 ` [Qemu-devel] [PATCH 09/15] linux-aio: fix submit aio as a batch Ming Lei
                   ` (8 subsequent siblings)
  16 siblings, 1 reply; 71+ messages in thread
From: Ming Lei @ 2014-07-30 11:39 UTC (permalink / raw)
  To: qemu-devel, Peter Maydell, Paolo Bonzini, Stefan Hajnoczi
  Cc: Kevin Wolf, Ming Lei, Fam Zheng, Michael S. Tsirkin

VirtQueueElement is used in performance senstive path, so cut
its size by half to speed up its allocation and decrease memory
footprint in the dataplane I/O path.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 hw/net/virtio-net.c         |    4 +++-
 hw/virtio/dataplane/vring.c |   23 ++++++++++++++++++-----
 hw/virtio/virtio.c          |   23 ++++++++++++++++-------
 include/hw/virtio/virtio.h  |   13 +++++++++----
 4 files changed, 46 insertions(+), 17 deletions(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 268eff9..c842697 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -1010,7 +1010,7 @@ static ssize_t virtio_net_receive(NetClientState *nc, const uint8_t *buf, size_t
     while (offset < size) {
         VirtQueueElement elem;
         int len, total;
-        const struct iovec *sg = elem.in_sg;
+        const struct iovec *sg;
 
         total = 0;
 
@@ -1025,6 +1025,8 @@ static ssize_t virtio_net_receive(NetClientState *nc, const uint8_t *buf, size_t
             exit(1);
         }
 
+        sg = elem.in_sg;
+
         if (elem.in_num < 1) {
             error_report("virtio-net receive queue contains no in buffers");
             exit(1);
diff --git a/hw/virtio/dataplane/vring.c b/hw/virtio/dataplane/vring.c
index 67cb2b8..502c999 100644
--- a/hw/virtio/dataplane/vring.c
+++ b/hw/virtio/dataplane/vring.c
@@ -164,12 +164,12 @@ static int get_desc(Vring *vring, VirtQueueElement *elem,
 
     if (desc->flags & VRING_DESC_F_WRITE) {
         num = &elem->in_num;
-        iov = &elem->in_sg[*num];
-        addr = &elem->in_addr[*num];
+        iov = &elem->sg[elem->num];
+        addr = &elem->addr[elem->num];
     } else {
         num = &elem->out_num;
-        iov = &elem->out_sg[*num];
-        addr = &elem->out_addr[*num];
+        iov = &elem->sg[elem->num];
+        addr = &elem->addr[elem->num];
 
         /* If it's an output descriptor, they're all supposed
          * to come before any input descriptors. */
@@ -198,6 +198,7 @@ static int get_desc(Vring *vring, VirtQueueElement *elem,
     iov->iov_len = desc->len;
     *addr = desc->addr;
     *num += 1;
+    elem->num++;
     return 0;
 }
 
@@ -289,6 +290,15 @@ static void vring_unmap_element(VirtQueueElement *elem)
     }
 }
 
+static void update_elem(VirtQueueElement *elem)
+{
+    elem->out_sg = &elem->sg[0];
+    elem->out_addr = &elem->addr[0];
+
+    elem->in_sg = &elem->sg[elem->out_num];
+    elem->in_addr = &elem->addr[elem->out_num];
+}
+
 /* This looks in the virtqueue and for the first available buffer, and converts
  * it to an iovec for convenient access.  Since descriptors consist of some
  * number of output then some number of input descriptors, it's actually two
@@ -309,7 +319,7 @@ int vring_pop(VirtIODevice *vdev, Vring *vring,
     int ret;
 
     /* Initialize elem so it can be safely unmapped */
-    elem->in_num = elem->out_num = 0;
+    elem->num = elem->in_num = elem->out_num = 0;
 
     /* If there was a fatal error then refuse operation */
     if (vring->broken) {
@@ -389,11 +399,14 @@ int vring_pop(VirtIODevice *vdev, Vring *vring,
         i = desc.next;
     } while (desc.flags & VRING_DESC_F_NEXT);
 
+    update_elem(elem);
+
     /* On success, increment avail index. */
     vring->last_avail_idx++;
     return head;
 
 out:
+    update_elem(elem);
     assert(ret < 0);
     if (ret == -EFAULT) {
         vring->broken = true;
diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 5c98180..db42cd0 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -463,7 +463,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
         return 0;
 
     /* When we start there are none of either input nor output. */
-    elem->out_num = elem->in_num = 0;
+    elem->num = elem->out_num = elem->in_num = 0;
 
     max = vq->vring.num;
 
@@ -489,21 +489,25 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
         struct iovec *sg;
 
         if (vring_desc_flags(vdev, desc_pa, i) & VRING_DESC_F_WRITE) {
-            if (elem->in_num >= ARRAY_SIZE(elem->in_sg)) {
+            if (elem->in_num >= ARRAY_SIZE(elem->sg)) {
                 error_report("Too many write descriptors in indirect table");
                 exit(1);
             }
-            elem->in_addr[elem->in_num] = vring_desc_addr(vdev, desc_pa, i);
-            sg = &elem->in_sg[elem->in_num++];
+            elem->addr[elem->num] = vring_desc_addr(vdev, desc_pa, i);
+            sg = &elem->sg[elem->num];
+            elem->in_num++;
         } else {
-            if (elem->out_num >= ARRAY_SIZE(elem->out_sg)) {
+            if (elem->out_num >= ARRAY_SIZE(elem->sg)) {
                 error_report("Too many read descriptors in indirect table");
                 exit(1);
             }
-            elem->out_addr[elem->out_num] = vring_desc_addr(vdev, desc_pa, i);
-            sg = &elem->out_sg[elem->out_num++];
+            elem->addr[elem->num] = vring_desc_addr(vdev, desc_pa, i);
+            sg = &elem->sg[elem->num];
+            elem->out_num++;
         }
 
+        elem->num++;
+
         sg->iov_len = vring_desc_len(vdev, desc_pa, i);
 
         /* If we've got too many, that implies a descriptor loop. */
@@ -513,6 +517,11 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
         }
     } while ((i = virtqueue_next_desc(vdev, desc_pa, i, max)) != max);
 
+    elem->out_sg = &elem->sg[0];
+    elem->out_addr = &elem->addr[0];
+    elem->in_sg = &elem->sg[elem->out_num];
+    elem->in_addr = &elem->addr[elem->out_num];
+
     /* Now map what we have collected */
     virtqueue_map_sg(elem->in_sg, elem->in_addr, elem->in_num, 1);
     virtqueue_map_sg(elem->out_sg, elem->out_addr, elem->out_num, 0);
diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index a60104c..943e72f 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -84,12 +84,17 @@ typedef struct VirtQueue VirtQueue;
 typedef struct VirtQueueElement
 {
     unsigned int index;
+    unsigned int num;
     unsigned int out_num;
     unsigned int in_num;
-    hwaddr in_addr[VIRTQUEUE_MAX_SIZE];
-    hwaddr out_addr[VIRTQUEUE_MAX_SIZE];
-    struct iovec in_sg[VIRTQUEUE_MAX_SIZE];
-    struct iovec out_sg[VIRTQUEUE_MAX_SIZE];
+
+    hwaddr *in_addr;
+    hwaddr *out_addr;
+    struct iovec *in_sg;
+    struct iovec *out_sg;
+
+    hwaddr addr[VIRTQUEUE_MAX_SIZE];
+    struct iovec sg[VIRTQUEUE_MAX_SIZE];
 } VirtQueueElement;
 
 #define VIRTIO_PCI_QUEUE_MAX 64
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [Qemu-devel] [PATCH 09/15] linux-aio: fix submit aio as a batch
  2014-07-30 11:39 [Qemu-devel] [PATCH 00/14] dataplane: optimization and multi virtqueue support Ming Lei
                   ` (7 preceding siblings ...)
  2014-07-30 11:39 ` [Qemu-devel] [PATCH 08/15] virtio: decrease size of VirtQueueElement Ming Lei
@ 2014-07-30 11:39 ` Ming Lei
  2014-07-30 13:59   ` Paolo Bonzini
  2014-07-30 11:39 ` [Qemu-devel] [PATCH 10/15] linux-aio: increase max event to 256 Ming Lei
                   ` (7 subsequent siblings)
  16 siblings, 1 reply; 71+ messages in thread
From: Ming Lei @ 2014-07-30 11:39 UTC (permalink / raw)
  To: qemu-devel, Peter Maydell, Paolo Bonzini, Stefan Hajnoczi
  Cc: Kevin Wolf, Ming Lei, Fam Zheng, Michael S. Tsirkin

In the enqueue path, we can't complete request, otherwise
"Co-routine re-entered recursively" may be caused, so this
patch fixes the issue with below ideas:

	- for -EAGAIN or partial completion, retry the submission by
	an introduced event handler
	- for part of completion, also update the io queue
	- for other failure, return the failure if in enqueue path,
	otherwise, abort all queued I/O

Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 block/linux-aio.c |   90 ++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 68 insertions(+), 22 deletions(-)

diff --git a/block/linux-aio.c b/block/linux-aio.c
index 7ac7e8c..5eb9c92 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -51,6 +51,7 @@ struct qemu_laio_state {
 
     /* io queue for submit at batch */
     LaioQueue io_q;
+    EventNotifier retry;      /* handle -EAGAIN and partial completion */
 };
 
 static inline ssize_t io_event_ret(struct io_event *ev)
@@ -154,45 +155,80 @@ static void ioq_init(LaioQueue *io_q)
     io_q->plugged = 0;
 }
 
-static int ioq_submit(struct qemu_laio_state *s)
+static void abort_queue(struct qemu_laio_state *s)
+{
+    int i;
+    for (i = 0; i < s->io_q.idx; i++) {
+        struct qemu_laiocb *laiocb = container_of(s->io_q.iocbs[i],
+                                                  struct qemu_laiocb,
+                                                  iocb);
+        laiocb->ret = -EIO;
+        qemu_laio_process_completion(s, laiocb);
+    }
+}
+
+static int ioq_submit(struct qemu_laio_state *s, bool enqueue)
 {
     int ret, i = 0;
     int len = s->io_q.idx;
+    int j = 0;
 
-    do {
-        ret = io_submit(s->ctx, len, s->io_q.iocbs);
-    } while (i++ < 3 && ret == -EAGAIN);
+    if (!len) {
+        return 0;
+    }
 
-    /* empty io queue */
-    s->io_q.idx = 0;
+    ret = io_submit(s->ctx, len, s->io_q.iocbs);
+    if (ret == -EAGAIN) {
+        event_notifier_set(&s->retry);
+        return 0;
+    } else if (ret < 0) {
+        if (enqueue) {
+            return ret;
+        }
 
-    if (ret < 0) {
-        i = 0;
-    } else {
-        i = ret;
+        /* in non-queue path, all IOs have to be completed */
+        abort_queue(s);
+        ret = len;
+    } else if (ret == 0) {
+        goto out;
     }
 
-    for (; i < len; i++) {
-        struct qemu_laiocb *laiocb =
-            container_of(s->io_q.iocbs[i], struct qemu_laiocb, iocb);
-
-        laiocb->ret = (ret < 0) ? ret : -EIO;
-        qemu_laio_process_completion(s, laiocb);
+    for (i = ret; i < len; i++) {
+        s->io_q.iocbs[j++] = s->io_q.iocbs[i];
     }
+
+ out:
+    /* update io queue */
+    s->io_q.idx -= ret;
+
     return ret;
 }
 
-static void ioq_enqueue(struct qemu_laio_state *s, struct iocb *iocb)
+static void ioq_submit_retry(EventNotifier *e)
+{
+    struct qemu_laio_state *s = container_of(e, struct qemu_laio_state, retry);
+
+    event_notifier_test_and_clear(e);
+    ioq_submit(s, false);
+}
+
+static int ioq_enqueue(struct qemu_laio_state *s, struct iocb *iocb)
 {
     unsigned int idx = s->io_q.idx;
 
+    if (unlikely(idx == s->io_q.size)) {
+        return -1;
+    }
+
     s->io_q.iocbs[idx++] = iocb;
     s->io_q.idx = idx;
 
-    /* submit immediately if queue is full */
-    if (idx == s->io_q.size) {
-        ioq_submit(s);
+    /* submit immediately if queue depth is above 2/3 */
+    if (idx > s->io_q.size * 2 / 3) {
+        return ioq_submit(s, true);
     }
+
+    return 0;
 }
 
 void laio_io_plug(BlockDriverState *bs, void *aio_ctx)
@@ -214,7 +250,7 @@ int laio_io_unplug(BlockDriverState *bs, void *aio_ctx, bool unplug)
     }
 
     if (s->io_q.idx > 0) {
-        ret = ioq_submit(s);
+        ret = ioq_submit(s, false);
     }
 
     return ret;
@@ -258,7 +294,9 @@ BlockDriverAIOCB *laio_submit(BlockDriverState *bs, void *aio_ctx, int fd,
             goto out_free_aiocb;
         }
     } else {
-        ioq_enqueue(s, iocbs);
+        if (ioq_enqueue(s, iocbs) < 0) {
+            goto out_free_aiocb;
+        }
     }
     return &laiocb->common;
 
@@ -272,6 +310,7 @@ void laio_detach_aio_context(void *s_, AioContext *old_context)
     struct qemu_laio_state *s = s_;
 
     aio_set_event_notifier(old_context, &s->e, NULL);
+    aio_set_event_notifier(old_context, &s->retry, NULL);
 }
 
 void laio_attach_aio_context(void *s_, AioContext *new_context)
@@ -279,6 +318,7 @@ void laio_attach_aio_context(void *s_, AioContext *new_context)
     struct qemu_laio_state *s = s_;
 
     aio_set_event_notifier(new_context, &s->e, qemu_laio_completion_cb);
+    aio_set_event_notifier(new_context, &s->retry, ioq_submit_retry);
 }
 
 void *laio_init(void)
@@ -295,9 +335,14 @@ void *laio_init(void)
     }
 
     ioq_init(&s->io_q);
+    if (event_notifier_init(&s->retry, false) < 0) {
+        goto out_notifer_init;
+    }
 
     return s;
 
+out_notifer_init:
+    io_destroy(s->ctx);
 out_close_efd:
     event_notifier_cleanup(&s->e);
 out_free_state:
@@ -310,6 +355,7 @@ void laio_cleanup(void *s_)
     struct qemu_laio_state *s = s_;
 
     event_notifier_cleanup(&s->e);
+    event_notifier_cleanup(&s->retry);
 
     if (io_destroy(s->ctx) != 0) {
         fprintf(stderr, "%s: destroy AIO context %p failed\n",
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [Qemu-devel] [PATCH 10/15] linux-aio: increase max event to 256
  2014-07-30 11:39 [Qemu-devel] [PATCH 00/14] dataplane: optimization and multi virtqueue support Ming Lei
                   ` (8 preceding siblings ...)
  2014-07-30 11:39 ` [Qemu-devel] [PATCH 09/15] linux-aio: fix submit aio as a batch Ming Lei
@ 2014-07-30 11:39 ` Ming Lei
  2014-07-30 12:15   ` Eric Blake
  2014-07-30 14:00   ` Paolo Bonzini
  2014-07-30 11:39 ` [Qemu-devel] [PATCH 11/15] linux-aio: remove 'node' from 'struct qemu_laiocb' Ming Lei
                   ` (6 subsequent siblings)
  16 siblings, 2 replies; 71+ messages in thread
From: Ming Lei @ 2014-07-30 11:39 UTC (permalink / raw)
  To: qemu-devel, Peter Maydell, Paolo Bonzini, Stefan Hajnoczi
  Cc: Kevin Wolf, Ming Lei, Fam Zheng, Michael S. Tsirkin

This patch increases max event to 256 for the comming
virtio-blk multi virtqueue support.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 block/linux-aio.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/linux-aio.c b/block/linux-aio.c
index 5eb9c92..c06a57d 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -23,7 +23,7 @@
  *      than this we will get EAGAIN from io_submit which is communicated to
  *      the guest as an I/O error.
  */
-#define MAX_EVENTS 128
+#define MAX_EVENTS 256
 
 #define MAX_QUEUED_IO  128
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [Qemu-devel] [PATCH 11/15] linux-aio: remove 'node' from 'struct qemu_laiocb'
  2014-07-30 11:39 [Qemu-devel] [PATCH 00/14] dataplane: optimization and multi virtqueue support Ming Lei
                   ` (9 preceding siblings ...)
  2014-07-30 11:39 ` [Qemu-devel] [PATCH 10/15] linux-aio: increase max event to 256 Ming Lei
@ 2014-07-30 11:39 ` Ming Lei
  2014-07-30 11:39 ` [Qemu-devel] [PATCH 12/15] hw/virtio-pci: introduce num_queues property Ming Lei
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 71+ messages in thread
From: Ming Lei @ 2014-07-30 11:39 UTC (permalink / raw)
  To: qemu-devel, Peter Maydell, Paolo Bonzini, Stefan Hajnoczi
  Cc: Kevin Wolf, Ming Lei, Fam Zheng, Michael S. Tsirkin

No one uses the 'node' field any more, so remove it
from 'struct qemu_laiocb', and this can save 16byte
for the struct on 64bit arch.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 block/linux-aio.c |    1 -
 1 file changed, 1 deletion(-)

diff --git a/block/linux-aio.c b/block/linux-aio.c
index c06a57d..337f879 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -35,7 +35,6 @@ struct qemu_laiocb {
     size_t nbytes;
     QEMUIOVector *qiov;
     bool is_read;
-    QLIST_ENTRY(qemu_laiocb) node;
 };
 
 typedef struct {
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [Qemu-devel] [PATCH 12/15] hw/virtio-pci: introduce num_queues property
  2014-07-30 11:39 [Qemu-devel] [PATCH 00/14] dataplane: optimization and multi virtqueue support Ming Lei
                   ` (10 preceding siblings ...)
  2014-07-30 11:39 ` [Qemu-devel] [PATCH 11/15] linux-aio: remove 'node' from 'struct qemu_laiocb' Ming Lei
@ 2014-07-30 11:39 ` Ming Lei
  2014-07-30 11:39 ` [Qemu-devel] [PATCH 13/15] hw/virtio/virtio-blk.h: introduce VIRTIO_BLK_F_MQ Ming Lei
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 71+ messages in thread
From: Ming Lei @ 2014-07-30 11:39 UTC (permalink / raw)
  To: qemu-devel, Peter Maydell, Paolo Bonzini, Stefan Hajnoczi
  Cc: Kevin Wolf, Ming Lei, Fam Zheng, Michael S. Tsirkin

This patch introduces the parameter of 'num_queues', and
prepare for supporting mutli vqs of virtio-blk.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 hw/block/virtio-blk.c          |    1 +
 include/hw/virtio/virtio-blk.h |    1 +
 2 files changed, 2 insertions(+)

diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index 2a11bc4..ab99156 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -826,6 +826,7 @@ static Property virtio_blk_properties[] = {
 #endif
 #ifdef CONFIG_VIRTIO_BLK_DATA_PLANE
     DEFINE_PROP_BIT("x-data-plane", VirtIOBlock, blk.data_plane, 0, false),
+    DEFINE_PROP_UINT32("num_queues", VirtIOBlock, blk.num_queues, 1),
 #endif
     DEFINE_PROP_END_OF_LIST(),
 };
diff --git a/include/hw/virtio/virtio-blk.h b/include/hw/virtio/virtio-blk.h
index 49ac234..45f8894 100644
--- a/include/hw/virtio/virtio-blk.h
+++ b/include/hw/virtio/virtio-blk.h
@@ -114,6 +114,7 @@ struct VirtIOBlkConf
     uint32_t scsi;
     uint32_t config_wce;
     uint32_t data_plane;
+    uint32_t num_queues;
 };
 
 struct VirtIOBlockDataPlane;
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [Qemu-devel] [PATCH 13/15] hw/virtio/virtio-blk.h: introduce VIRTIO_BLK_F_MQ
  2014-07-30 11:39 [Qemu-devel] [PATCH 00/14] dataplane: optimization and multi virtqueue support Ming Lei
                   ` (11 preceding siblings ...)
  2014-07-30 11:39 ` [Qemu-devel] [PATCH 12/15] hw/virtio-pci: introduce num_queues property Ming Lei
@ 2014-07-30 11:39 ` Ming Lei
  2014-07-30 11:39 ` [Qemu-devel] [PATCH 14/15] hw/block/virtio-blk: create num_queues vqs if dataplane is enabled Ming Lei
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 71+ messages in thread
From: Ming Lei @ 2014-07-30 11:39 UTC (permalink / raw)
  To: qemu-devel, Peter Maydell, Paolo Bonzini, Stefan Hajnoczi
  Cc: Kevin Wolf, Ming Lei, Fam Zheng, Michael S. Tsirkin

Prepare for supporting mutli vqs per virtio-blk device.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 include/hw/virtio/virtio-blk.h |    8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/include/hw/virtio/virtio-blk.h b/include/hw/virtio/virtio-blk.h
index 45f8894..ad70c9a 100644
--- a/include/hw/virtio/virtio-blk.h
+++ b/include/hw/virtio/virtio-blk.h
@@ -42,6 +42,12 @@
 #define VIRTIO_BLK_F_TOPOLOGY   10      /* Topology information is available */
 #define VIRTIO_BLK_F_CONFIG_WCE 11      /* write cache configurable */
 
+/*
+ * support multi vqs, and virtio_blk_config.num_queues is only
+ * available when this feature is enabled
+ */
+#define VIRTIO_BLK_F_MQ		12
+
 #define VIRTIO_BLK_ID_BYTES     20      /* ID string length */
 
 struct virtio_blk_config
@@ -58,6 +64,8 @@ struct virtio_blk_config
     uint16_t min_io_size;
     uint32_t opt_io_size;
     uint8_t wce;
+    uint8_t unused;
+    uint16_t num_queues;	/* must be at the end */
 } QEMU_PACKED;
 
 /* These two define direction. */
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [Qemu-devel] [PATCH 14/15] hw/block/virtio-blk: create num_queues vqs if dataplane is enabled
  2014-07-30 11:39 [Qemu-devel] [PATCH 00/14] dataplane: optimization and multi virtqueue support Ming Lei
                   ` (12 preceding siblings ...)
  2014-07-30 11:39 ` [Qemu-devel] [PATCH 13/15] hw/virtio/virtio-blk.h: introduce VIRTIO_BLK_F_MQ Ming Lei
@ 2014-07-30 11:39 ` Ming Lei
  2014-07-30 14:01   ` Paolo Bonzini
  2014-07-30 11:39 ` [Qemu-devel] [PATCH 15/15] dataplane: virtio-blk: support mutlti virtqueue Ming Lei
                   ` (2 subsequent siblings)
  16 siblings, 1 reply; 71+ messages in thread
From: Ming Lei @ 2014-07-30 11:39 UTC (permalink / raw)
  To: qemu-devel, Peter Maydell, Paolo Bonzini, Stefan Hajnoczi
  Cc: Kevin Wolf, Ming Lei, Fam Zheng, Michael S. Tsirkin

Now we only support multi vqs for dataplane case.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 hw/block/virtio-blk.c          |   18 +++++++++++++++++-
 include/hw/virtio/virtio-blk.h |    1 +
 2 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index ab99156..44e9956 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -556,6 +556,7 @@ static void virtio_blk_update_config(VirtIODevice *vdev, uint8_t *config)
     blkcfg.physical_block_exp = get_physical_block_exp(s->conf);
     blkcfg.alignment_offset = 0;
     blkcfg.wce = bdrv_enable_write_cache(s->bs);
+    stw_p(&blkcfg.num_queues, s->blk.num_queues);
     memcpy(config, &blkcfg, sizeof(struct virtio_blk_config));
 }
 
@@ -590,6 +591,12 @@ static uint32_t virtio_blk_get_features(VirtIODevice *vdev, uint32_t features)
     if (bdrv_is_read_only(s->bs))
         features |= 1 << VIRTIO_BLK_F_RO;
 
+#ifdef CONFIG_VIRTIO_BLK_DATA_PLANE
+    if (s->blk.num_queues > 1) {
+        features |= 1 << VIRTIO_BLK_F_MQ;
+    }
+#endif
+
     return features;
 }
 
@@ -739,8 +746,13 @@ static void virtio_blk_device_realize(DeviceState *dev, Error **errp)
 #ifdef CONFIG_VIRTIO_BLK_DATA_PLANE
     Error *err = NULL;
 #endif
+    int i;
     static int virtio_blk_id;
 
+#ifndef CONFIG_VIRTIO_BLK_DATA_PLANE
+    blk->num_queues = 1;
+#endif
+
     if (!blk->conf.bs) {
         error_setg(errp, "drive property not set");
         return;
@@ -765,7 +777,10 @@ static void virtio_blk_device_realize(DeviceState *dev, Error **errp)
     s->rq = NULL;
     s->sector_mask = (s->conf->logical_block_size / BDRV_SECTOR_SIZE) - 1;
 
-    s->vq = virtio_add_queue(vdev, 128, virtio_blk_handle_output);
+    s->vqs = g_malloc0(sizeof(VirtQueue *) * blk->num_queues);
+    for (i = 0; i < blk->num_queues; i++)
+        s->vqs[i] = virtio_add_queue(vdev, 128, virtio_blk_handle_output);
+    s->vq = s->vqs[0];
     s->complete_request = virtio_blk_complete_request;
 #ifdef CONFIG_VIRTIO_BLK_DATA_PLANE
     virtio_blk_data_plane_create(vdev, blk, &s->dataplane, &err);
@@ -802,6 +817,7 @@ static void virtio_blk_device_unrealize(DeviceState *dev, Error **errp)
     qemu_del_vm_change_state_handler(s->change);
     unregister_savevm(dev, "virtio-blk", s);
     blockdev_mark_auto_del(s->bs);
+    g_free(s->vqs);
     virtio_cleanup(vdev);
 }
 
diff --git a/include/hw/virtio/virtio-blk.h b/include/hw/virtio/virtio-blk.h
index ad70c9a..b4609dd 100644
--- a/include/hw/virtio/virtio-blk.h
+++ b/include/hw/virtio/virtio-blk.h
@@ -132,6 +132,7 @@ typedef struct VirtIOBlock {
     VirtIODevice parent_obj;
     BlockDriverState *bs;
     VirtQueue *vq;
+    VirtQueue **vqs;
     void *rq;
     QEMUBH *bh;
     BlockConf *conf;
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [Qemu-devel] [PATCH 15/15] dataplane: virtio-blk: support mutlti virtqueue
  2014-07-30 11:39 [Qemu-devel] [PATCH 00/14] dataplane: optimization and multi virtqueue support Ming Lei
                   ` (13 preceding siblings ...)
  2014-07-30 11:39 ` [Qemu-devel] [PATCH 14/15] hw/block/virtio-blk: create num_queues vqs if dataplane is enabled Ming Lei
@ 2014-07-30 11:39 ` Ming Lei
  2014-07-30 12:42 ` [Qemu-devel] [PATCH 00/14] dataplane: optimization and multi virtqueue support Christian Borntraeger
  2014-08-04 10:16 ` Stefan Hajnoczi
  16 siblings, 0 replies; 71+ messages in thread
From: Ming Lei @ 2014-07-30 11:39 UTC (permalink / raw)
  To: qemu-devel, Peter Maydell, Paolo Bonzini, Stefan Hajnoczi
  Cc: Kevin Wolf, Ming Lei, Fam Zheng, Michael S. Tsirkin

This patch supports to handle host notify from multi
virt queues, but still process/submit io in the
one iothread.

In my fio test over VM which is hosted on the server host,
if "num_queues" is set as 4, JOBS of fio script is set
as 4, throughout can be improved by 30% compared with
single virtqueue with any JOBS.

Compared with throughput, scalability is improved much more,
for example:
	---------------------------------------------------
		| VM in server host, 4 virtqueues vs. 1	virtqueue
	---------------------------------------------------
	JOBS=2	| +14%
	---------------------------------------------------
	JOBS=4	| +83%
	---------------------------------------------------

Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 hw/block/dataplane/virtio-blk.c |  209 ++++++++++++++++++++++++++++-----------
 include/hw/virtio/virtio-blk.h  |    1 +
 2 files changed, 153 insertions(+), 57 deletions(-)

diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
index 828fe99..bd66274 100644
--- a/hw/block/dataplane/virtio-blk.c
+++ b/hw/block/dataplane/virtio-blk.c
@@ -26,6 +26,11 @@
 
 #define REQ_POOL_SZ 128
 
+typedef struct {
+    EventNotifier notifier;
+    VirtIOBlockDataPlane *s;
+} VirtIOBlockNotifier;
+
 struct VirtIOBlockDataPlane {
     bool started;
     bool starting;
@@ -35,9 +40,10 @@ struct VirtIOBlockDataPlane {
     VirtIOBlkConf *blk;
 
     VirtIODevice *vdev;
-    Vring vring;                    /* virtqueue vring */
-    EventNotifier *guest_notifier;  /* irq */
-    QEMUBH *bh;                     /* bh for guest notification */
+    Vring *vring;                    /* virtqueue vring */
+    EventNotifier **guest_notifier;  /* irq */
+    uint64_t   pending_guest_notifier;  /* pending guest notifer for vq */
+    QEMUBH *bh;                      /* bh for guest notification */
 
     /* Note that these EventNotifiers are assigned by value.  This is
      * fine as long as you do not call event_notifier_cleanup on them
@@ -47,7 +53,9 @@ struct VirtIOBlockDataPlane {
     IOThread *iothread;
     IOThread internal_iothread_obj;
     AioContext *ctx;
-    EventNotifier host_notifier;    /* doorbell */
+    VirtIOBlockNotifier *host_notifier; /* doorbell */
+    uint64_t   pending_host_notifier;   /* pending host notifer for vq */
+    QEMUBH *host_notifier_bh;           /* for handle host notifier */
 
     /* Operation blocker on BDS */
     Error *blocker;
@@ -60,20 +68,26 @@ struct VirtIOBlockDataPlane {
 };
 
 /* Raise an interrupt to signal guest, if necessary */
-static void notify_guest(VirtIOBlockDataPlane *s)
+static void notify_guest(VirtIOBlockDataPlane *s, unsigned int qid)
 {
-    if (!vring_should_notify(s->vdev, &s->vring)) {
-        return;
+    if (vring_should_notify(s->vdev, &s->vring[qid])) {
+        event_notifier_set(s->guest_notifier[qid]);
     }
-
-    event_notifier_set(s->guest_notifier);
 }
 
 static void notify_guest_bh(void *opaque)
 {
     VirtIOBlockDataPlane *s = opaque;
+    unsigned int qid;
+    uint64_t pending = s->pending_guest_notifier;
+
+    s->pending_guest_notifier = 0;
 
-    notify_guest(s);
+    while ((qid = ffsl(pending))) {
+        qid--;
+        notify_guest(s, qid);
+        pending &= ~(1 << qid);
+    }
 }
 
 static void complete_request_vring(VirtIOBlockReq *req, unsigned char status)
@@ -81,7 +95,7 @@ static void complete_request_vring(VirtIOBlockReq *req, unsigned char status)
     VirtIOBlockDataPlane *s = req->dev->dataplane;
     stb_p(&req->in->status, status);
 
-    vring_push(&req->dev->dataplane->vring, &req->elem,
+    vring_push(&s->vring[req->qid], &req->elem,
                req->qiov.size + sizeof(*req->in));
 
     /* Suppress notification to guest by BH and its scheduled
@@ -90,17 +104,15 @@ static void complete_request_vring(VirtIOBlockReq *req, unsigned char status)
      * executed in dataplane aio context even after it is
      * stopped, so needn't worry about notification loss with BH.
      */
+    assert(req->qid < 64);
+    s->pending_guest_notifier |= (1 << req->qid);
     qemu_bh_schedule(s->bh);
 }
 
-static void handle_notify(EventNotifier *e)
+static void process_vq_notify(VirtIOBlockDataPlane *s, unsigned short qid)
 {
-    VirtIOBlockDataPlane *s = container_of(e, VirtIOBlockDataPlane,
-                                           host_notifier);
     VirtIOBlock *vblk = VIRTIO_BLK(s->vdev);
 
-    event_notifier_test_and_clear(&s->host_notifier);
-    bdrv_io_plug(s->blk->conf.bs);
     for (;;) {
         MultiReqBuffer mrb = {
             .num_writes = 0,
@@ -108,12 +120,13 @@ static void handle_notify(EventNotifier *e)
         int ret;
 
         /* Disable guest->host notifies to avoid unnecessary vmexits */
-        vring_disable_notification(s->vdev, &s->vring);
+        vring_disable_notification(s->vdev, &s->vring[qid]);
 
         for (;;) {
             VirtIOBlockReq *req = virtio_blk_alloc_request(vblk);
 
-            ret = vring_pop(s->vdev, &s->vring, &req->elem);
+            req->qid = qid;
+            ret = vring_pop(s->vdev, &s->vring[qid], &req->elem);
             if (ret < 0) {
                 virtio_blk_free_request(req);
                 break; /* no more requests */
@@ -132,16 +145,48 @@ static void handle_notify(EventNotifier *e)
             /* Re-enable guest->host notifies and stop processing the vring.
              * But if the guest has snuck in more descriptors, keep processing.
              */
-            if (vring_enable_notification(s->vdev, &s->vring)) {
+            if (vring_enable_notification(s->vdev, &s->vring[qid])) {
                 break;
             }
         } else { /* fatal error */
             break;
         }
     }
+}
+
+static void process_notify(void *opaque)
+{
+    VirtIOBlockDataPlane *s = opaque;
+    unsigned int qid;
+    uint64_t pending = s->pending_host_notifier;
+
+    s->pending_host_notifier = 0;
+
+    bdrv_io_plug(s->blk->conf.bs);
+    while ((qid = ffsl(pending))) {
+        qid--;
+        process_vq_notify(s, qid);
+        pending &= ~(1 << qid);
+    }
     bdrv_io_unplug(s->blk->conf.bs);
 }
 
+/* TODO: handle requests from other vqs together */
+static void handle_notify(EventNotifier *e)
+{
+    VirtIOBlockNotifier *n = container_of(e, VirtIOBlockNotifier,
+		                          notifier);
+    VirtIOBlockDataPlane *s = n->s;
+    unsigned int qid = n - &s->host_notifier[0];
+
+    assert(qid < 64);
+
+    event_notifier_test_and_clear(e);
+
+    s->pending_host_notifier |= (1 << qid);
+    qemu_bh_schedule(s->host_notifier_bh);
+}
+
 /* Context: QEMU global mutex held */
 void virtio_blk_data_plane_create(VirtIODevice *vdev, VirtIOBlkConf *blk,
                                   VirtIOBlockDataPlane **dataplane,
@@ -197,6 +242,11 @@ void virtio_blk_data_plane_create(VirtIODevice *vdev, VirtIOBlkConf *blk,
     s->ctx = iothread_get_aio_context(s->iothread);
     s->bh = aio_bh_new(s->ctx, notify_guest_bh, s);
 
+    s->vring = g_new0(Vring, blk->num_queues);
+    s->guest_notifier = g_new(EventNotifier *, blk->num_queues);
+    s->host_notifier = g_new(VirtIOBlockNotifier, blk->num_queues);
+    s->host_notifier_bh = aio_bh_new(s->ctx, process_notify, s);
+
     error_setg(&s->blocker, "block device is in use by data plane");
     bdrv_op_block_all(blk->conf.bs, s->blocker);
 
@@ -217,16 +267,83 @@ void virtio_blk_data_plane_destroy(VirtIOBlockDataPlane *s)
     error_free(s->blocker);
     object_unref(OBJECT(s->iothread));
     qemu_bh_delete(s->bh);
+    qemu_bh_delete(s->host_notifier_bh);
+    g_free(s->vring);
+    g_free(s->guest_notifier);
+    g_free(s->host_notifier);
     g_free(s);
 }
 
+static int pre_start_vq(VirtIOBlockDataPlane *s, BusState *qbus,
+                        VirtioBusClass *k)
+{
+    int i;
+    int num = s->blk->num_queues;
+    VirtQueue *vq[num];
+
+    for (i = 0; i < num; i++) {
+        vq[i] = virtio_get_queue(s->vdev, i);
+        if (!vring_setup(&s->vring[i], s->vdev, i)) {
+            return -1;
+        }
+    }
+
+    /* Set up guest notifier (irq) */
+    if (k->set_guest_notifiers(qbus->parent, num, true) != 0) {
+        fprintf(stderr, "virtio-blk failed to set guest notifier, "
+                "ensure -enable-kvm is set\n");
+        exit(1);
+    }
+
+    for (i = 0; i < num; i++)
+        s->guest_notifier[i] = virtio_queue_get_guest_notifier(vq[i]);
+    s->pending_guest_notifier = 0;
+
+    /* Set up virtqueue notify */
+    for (i = 0; i < num; i++) {
+        if (k->set_host_notifier(qbus->parent, i, true) != 0) {
+            fprintf(stderr, "virtio-blk failed to set host notifier\n");
+            exit(1);
+        }
+        s->host_notifier[i].notifier = *virtio_queue_get_host_notifier(vq[i]);
+        s->host_notifier[i].s = s;
+    }
+    s->pending_host_notifier = 0;
+
+    return 0;
+}
+
+static void post_start_vq(VirtIOBlockDataPlane *s)
+{
+    int i;
+    int num = s->blk->num_queues;
+
+    for (i = 0; i < num; i++) {
+        VirtQueue *vq;
+        vq = virtio_get_queue(s->vdev, i);
+
+        /* Kick right away to begin processing requests already in vring */
+        event_notifier_set(virtio_queue_get_host_notifier(vq));
+    }
+
+    if (s->raw_format) {
+        qemu_aio_set_bypass_co(s->ctx, true);
+    }
+
+    /* Get this show started by hooking up our callbacks */
+    aio_context_acquire(s->ctx);
+    for (i = 0; i < num; i++)
+        aio_set_event_notifier(s->ctx, &s->host_notifier[i].notifier,
+                               handle_notify);
+    aio_context_release(s->ctx);
+}
+
 /* Context: QEMU global mutex held */
 void virtio_blk_data_plane_start(VirtIOBlockDataPlane *s)
 {
     BusState *qbus = BUS(qdev_get_parent_bus(DEVICE(s->vdev)));
     VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(qbus);
     VirtIOBlock *vblk = VIRTIO_BLK(s->vdev);
-    VirtQueue *vq;
 
     if (s->started) {
         return;
@@ -238,51 +355,24 @@ void virtio_blk_data_plane_start(VirtIOBlockDataPlane *s)
 
     s->starting = true;
 
-    vq = virtio_get_queue(s->vdev, 0);
-    if (!vring_setup(&s->vring, s->vdev, 0)) {
-        s->starting = false;
-        return;
-    }
-
     vblk->obj_pool = &s->req_pool;
     obj_pool_init(vblk->obj_pool, s->reqs, s->free_reqs,
                   sizeof(VirtIOBlockReq), REQ_POOL_SZ);
 
-    /* Set up guest notifier (irq) */
-    if (k->set_guest_notifiers(qbus->parent, 1, true) != 0) {
-        fprintf(stderr, "virtio-blk failed to set guest notifier, "
-                "ensure -enable-kvm is set\n");
-        exit(1);
-    }
-    s->guest_notifier = virtio_queue_get_guest_notifier(vq);
-
-    /* Set up virtqueue notify */
-    if (k->set_host_notifier(qbus->parent, 0, true) != 0) {
-        fprintf(stderr, "virtio-blk failed to set host notifier\n");
-        exit(1);
-    }
-    s->host_notifier = *virtio_queue_get_host_notifier(vq);
-
     s->saved_complete_request = vblk->complete_request;
     vblk->complete_request = complete_request_vring;
 
+    if (pre_start_vq(s, qbus, k)) {
+       s->starting = false;
+       return;
+     }
+
     s->starting = false;
     s->started = true;
     trace_virtio_blk_data_plane_start(s);
 
     bdrv_set_aio_context(s->blk->conf.bs, s->ctx);
-
-    /* Kick right away to begin processing requests already in vring */
-    event_notifier_set(virtio_queue_get_host_notifier(vq));
-
-    if (s->raw_format) {
-        qemu_aio_set_bypass_co(s->ctx, true);
-    }
-
-    /* Get this show started by hooking up our callbacks */
-    aio_context_acquire(s->ctx);
-    aio_set_event_notifier(s->ctx, &s->host_notifier, handle_notify);
-    aio_context_release(s->ctx);
+    post_start_vq(s);
 }
 
 /* Context: QEMU global mutex held */
@@ -291,6 +381,8 @@ void virtio_blk_data_plane_stop(VirtIOBlockDataPlane *s)
     BusState *qbus = BUS(qdev_get_parent_bus(DEVICE(s->vdev)));
     VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(qbus);
     VirtIOBlock *vblk = VIRTIO_BLK(s->vdev);
+    int i;
+    int num = s->blk->num_queues;
     if (!s->started || s->stopping) {
         return;
     }
@@ -301,7 +393,8 @@ void virtio_blk_data_plane_stop(VirtIOBlockDataPlane *s)
     aio_context_acquire(s->ctx);
 
     /* Stop notifications for new requests from guest */
-    aio_set_event_notifier(s->ctx, &s->host_notifier, NULL);
+    for (i = 0; i < num; i++)
+        aio_set_event_notifier(s->ctx, &s->host_notifier[i].notifier, NULL);
 
     /* Drain and switch bs back to the QEMU main loop */
     bdrv_set_aio_context(s->blk->conf.bs, qemu_get_aio_context());
@@ -316,12 +409,14 @@ void virtio_blk_data_plane_stop(VirtIOBlockDataPlane *s)
     /* Sync vring state back to virtqueue so that non-dataplane request
      * processing can continue when we disable the host notifier below.
      */
-    vring_teardown(&s->vring, s->vdev, 0);
+    for (i = 0; i < num; i++)
+        vring_teardown(&s->vring[i], s->vdev, 0);
 
-    k->set_host_notifier(qbus->parent, 0, false);
+    for (i = 0; i < num; i++)
+        k->set_host_notifier(qbus->parent, i, false);
 
     /* Clean up guest notifier (irq) */
-    k->set_guest_notifiers(qbus->parent, 1, false);
+    k->set_guest_notifiers(qbus->parent, num, false);
 
     s->started = false;
     s->stopping = false;
diff --git a/include/hw/virtio/virtio-blk.h b/include/hw/virtio/virtio-blk.h
index b4609dd..80fc738 100644
--- a/include/hw/virtio/virtio-blk.h
+++ b/include/hw/virtio/virtio-blk.h
@@ -162,6 +162,7 @@ typedef struct VirtIOBlockReq {
     QEMUIOVector qiov;
     struct VirtIOBlockReq *next;
     BlockAcctCookie acct;
+    unsigned qid;
 } VirtIOBlockReq;
 
 VirtIOBlockReq *virtio_blk_alloc_request(VirtIOBlock *s);
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH 10/15] linux-aio: increase max event to 256
  2014-07-30 11:39 ` [Qemu-devel] [PATCH 10/15] linux-aio: increase max event to 256 Ming Lei
@ 2014-07-30 12:15   ` Eric Blake
  2014-07-30 14:00   ` Paolo Bonzini
  1 sibling, 0 replies; 71+ messages in thread
From: Eric Blake @ 2014-07-30 12:15 UTC (permalink / raw)
  To: Ming Lei, qemu-devel, Peter Maydell, Paolo Bonzini, Stefan Hajnoczi
  Cc: Kevin Wolf, Fam Zheng, Michael S. Tsirkin

[-- Attachment #1: Type: text/plain, Size: 809 bytes --]

On 07/30/2014 05:39 AM, Ming Lei wrote:
> This patch increases max event to 256 for the comming

s/comming/coming/

> virtio-blk multi virtqueue support.
> 
> Signed-off-by: Ming Lei <ming.lei@canonical.com>
> ---
>  block/linux-aio.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/block/linux-aio.c b/block/linux-aio.c
> index 5eb9c92..c06a57d 100644
> --- a/block/linux-aio.c
> +++ b/block/linux-aio.c
> @@ -23,7 +23,7 @@
>   *      than this we will get EAGAIN from io_submit which is communicated to
>   *      the guest as an I/O error.
>   */
> -#define MAX_EVENTS 128
> +#define MAX_EVENTS 256
>  
>  #define MAX_QUEUED_IO  128
>  
> 

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 539 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH 00/14] dataplane: optimization and multi virtqueue support
  2014-07-30 11:39 [Qemu-devel] [PATCH 00/14] dataplane: optimization and multi virtqueue support Ming Lei
                   ` (14 preceding siblings ...)
  2014-07-30 11:39 ` [Qemu-devel] [PATCH 15/15] dataplane: virtio-blk: support mutlti virtqueue Ming Lei
@ 2014-07-30 12:42 ` Christian Borntraeger
  2014-08-04 10:16 ` Stefan Hajnoczi
  16 siblings, 0 replies; 71+ messages in thread
From: Christian Borntraeger @ 2014-07-30 12:42 UTC (permalink / raw)
  To: Ming Lei, qemu-devel, Peter Maydell, Paolo Bonzini, Stefan Hajnoczi
  Cc: Kevin Wolf, Christian Ehrhardt, Fam Zheng, Michael S. Tsirkin

On 30/07/14 13:39, Ming Lei wrote:
> These patches bring up below 4 changes:
> 
>         - introduce selective coroutine bypass mechanism
>         for improving performance of virtio-blk dataplane with
>         raw format image
> 
>         - introduce object allocation pool and apply it to
>         virtio-blk dataplane for improving its performance
> 
>         - linux-aio changes: fixing for cases of -EAGAIN and partial
>         completion, increase max events to 256, and remove one unuseful
>         fields in 'struct qemu_laiocb'
> 
>         - support multi virtqueue for virtio-blk dataplane
> 
> The virtio-blk multi virtqueue feature will be added to virtio spec 1.1[1],
> and the 3.17 linux kernel[2] will support the feature in virtio-blk driver.
> For those who wants to play the stuff, the kernel side patche can be found
> in either Jens's block tree[3] or linux-next[4].
> 
> Below fio script running from VM is used for test improvement of these patches:
> 
>         [global]
>         direct=1
>         size=128G
>         bsrange=4k-4k
>         timeout=120
>         numjobs=${JOBS}
>         ioengine=libaio
>         iodepth=64
>         filename=/dev/vdc
>         group_reporting=1
> 
>         [f]
>         rw=randread
> 
> One quadcore VM(8G RAM) is created in below host to run above fio test:
> 
>         - server(16cores: 8 physical cores, 2 threads per physical core)
> 
> Follows the test result on throughput improvement(IOPS) with
> this patchset(4 virtqueues per virito-blk device) against QEMU
> 2.1.0-rc5: 30% throughput improvement can be observed, and
> scalability for parallel I/Os is improved more(80% throughput
> improvement is observed in case of 4 JOBS).
> 
> From above result, we can see both scalability and performance
> get improved a lot.
> 
> After commit 580b6b2aa2(dataplane: use the QEMU block
> layer for I/O), average time for submiting one single
> request has been increased a lot, as my trace, the average
> time taken for submiting one request has been doubled even
> though block plug&unplug mechanism is introduced to
> ease its effect. That is why this patchset introduces
> selective coroutine bypass mechanism and object allocation
> pool for saving the time first. Based on QEMU 2.0, only
> single virtio-blk dataplane multi virtqueue patch can get
> better improvement than current result[5].
> 
> TODO:
>         - optimize block layer for linux aio, so that
>         more time can be saved for submitting request
>         - support more than one aio-context for improving
>         virtio-blk performance
[...]
> 
> [1], http://marc.info/?l=linux-api&m=140486843317107&w=2
> [2], http://marc.info/?l=linux-api&m=140418368421229&w=2
> [3], http://git.kernel.org/cgit/linux/kernel/git/axboe/linux-block.git/ #for-3.17/drivers
> [4], https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/
> [5], http://marc.info/?l=linux-api&m=140377573830230&w=2

FYI, I just tested with one virtqueue on s390 (3.15 as guest). It was just a quick sniff, but we are getting closer to the fio results that we had before commit 580b6b2aa2(dataplane: use the QEMU block
layer for I/O). I cant give proper numbers right now, as I am on a shared storage subsystem but this looks like we are on the right track. I have not looked at the code, though.

Christian

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH 01/15] qemu coroutine: support bypass mode
  2014-07-30 11:39 ` [Qemu-devel] [PATCH 01/15] qemu coroutine: support bypass mode Ming Lei
@ 2014-07-30 13:45   ` Paolo Bonzini
  2014-07-30 17:15     ` Ming Lei
  0 siblings, 1 reply; 71+ messages in thread
From: Paolo Bonzini @ 2014-07-30 13:45 UTC (permalink / raw)
  To: Ming Lei, qemu-devel, Peter Maydell, Stefan Hajnoczi
  Cc: Kevin Wolf, Fam Zheng, Michael S. Tsirkin

Il 30/07/2014 13:39, Ming Lei ha scritto:
> This patch introduces several APIs for supporting bypass qemu coroutine
> in case of being not necessary and for performance's sake.

No, this is wrong.  Dataplane *must* use the same code as non-dataplane,
anything else is a step backwards.

If you want to bypass coroutines, bdrv_aio_readv/writev must detect the
conditions that allow doing that and call the bdrv_aio_readv/writev
directly.

To begin with, have you benchmarked QEMU and can you provide a trace of
*where* the coroutine overhead lies?

Paolo

> Signed-off-by: Ming Lei <ming.lei@canonical.com>
> ---
>  include/block/coroutine.h     |    8 ++++++++
>  include/block/coroutine_int.h |    5 +++++
>  qemu-coroutine-lock.c         |    4 ++--
>  qemu-coroutine.c              |   33 +++++++++++++++++++++++++++++++++
>  4 files changed, 48 insertions(+), 2 deletions(-)
> 
> diff --git a/include/block/coroutine.h b/include/block/coroutine.h
> index b408f96..46d2642 100644
> --- a/include/block/coroutine.h
> +++ b/include/block/coroutine.h
> @@ -223,4 +223,12 @@ void coroutine_fn co_aio_sleep_ns(AioContext *ctx, QEMUClockType type,
>   * Note that this function clobbers the handlers for the file descriptor.
>   */
>  void coroutine_fn yield_until_fd_readable(int fd);
> +
> +/* qemu coroutine bypass APIs */
> +void qemu_coroutine_set_bypass(bool bypass);
> +bool qemu_coroutine_bypassed(Coroutine *self);
> +bool qemu_coroutine_self_bypassed(void);
> +void qemu_coroutine_set_var(void *var);
> +void *qemu_coroutine_get_var(void);
> +
>  #endif /* QEMU_COROUTINE_H */
> diff --git a/include/block/coroutine_int.h b/include/block/coroutine_int.h
> index f133d65..106d0b2 100644
> --- a/include/block/coroutine_int.h
> +++ b/include/block/coroutine_int.h
> @@ -39,6 +39,11 @@ struct Coroutine {
>      Coroutine *caller;
>      QSLIST_ENTRY(Coroutine) pool_next;
>  
> +    bool bypass;
> +
> +    /* only used in bypass mode */
> +    void *opaque;
> +
>      /* Coroutines that should be woken up when we yield or terminate */
>      QTAILQ_HEAD(, Coroutine) co_queue_wakeup;
>      QTAILQ_ENTRY(Coroutine) co_queue_next;
> diff --git a/qemu-coroutine-lock.c b/qemu-coroutine-lock.c
> index e4860ae..7c69ff6 100644
> --- a/qemu-coroutine-lock.c
> +++ b/qemu-coroutine-lock.c
> @@ -82,13 +82,13 @@ static bool qemu_co_queue_do_restart(CoQueue *queue, bool single)
>  
>  bool coroutine_fn qemu_co_queue_next(CoQueue *queue)
>  {
> -    assert(qemu_in_coroutine());
> +    assert(qemu_in_coroutine() || qemu_coroutine_self_bypassed());
>      return qemu_co_queue_do_restart(queue, true);
>  }
>  
>  void coroutine_fn qemu_co_queue_restart_all(CoQueue *queue)
>  {
> -    assert(qemu_in_coroutine());
> +    assert(qemu_in_coroutine() || qemu_coroutine_self_bypassed());
>      qemu_co_queue_do_restart(queue, false);
>  }
>  
> diff --git a/qemu-coroutine.c b/qemu-coroutine.c
> index 4708521..0597ed9 100644
> --- a/qemu-coroutine.c
> +++ b/qemu-coroutine.c
> @@ -137,3 +137,36 @@ void coroutine_fn qemu_coroutine_yield(void)
>      self->caller = NULL;
>      coroutine_swap(self, to);
>  }
> +
> +void qemu_coroutine_set_bypass(bool bypass)
> +{
> +    Coroutine *self = qemu_coroutine_self();
> +
> +    self->bypass = bypass;
> +}
> +
> +bool qemu_coroutine_bypassed(Coroutine *self)
> +{
> +    return self->bypass;
> +}
> +
> +bool qemu_coroutine_self_bypassed(void)
> +{
> +    Coroutine *self = qemu_coroutine_self();
> +
> +    return qemu_coroutine_bypassed(self);
> +}
> +
> +void qemu_coroutine_set_var(void *var)
> +{
> +    Coroutine *self = qemu_coroutine_self();
> +
> +    self->opaque = var;
> +}
> +
> +void *qemu_coroutine_get_var(void)
> +{
> +    Coroutine *self = qemu_coroutine_self();
> +
> +    return self->opaque;
> +}
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH 08/15] virtio: decrease size of VirtQueueElement
  2014-07-30 11:39 ` [Qemu-devel] [PATCH 08/15] virtio: decrease size of VirtQueueElement Ming Lei
@ 2014-07-30 13:51   ` Paolo Bonzini
  2014-07-30 14:40     ` Michael S. Tsirkin
  2014-07-31  2:07     ` Ming Lei
  0 siblings, 2 replies; 71+ messages in thread
From: Paolo Bonzini @ 2014-07-30 13:51 UTC (permalink / raw)
  To: Ming Lei, qemu-devel, Peter Maydell, Stefan Hajnoczi
  Cc: Kevin Wolf, Fam Zheng, Michael S. Tsirkin

Il 30/07/2014 13:39, Ming Lei ha scritto:
> diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
> index a60104c..943e72f 100644
> --- a/include/hw/virtio/virtio.h
> +++ b/include/hw/virtio/virtio.h
> @@ -84,12 +84,17 @@ typedef struct VirtQueue VirtQueue;
>  typedef struct VirtQueueElement
>  {
>      unsigned int index;
> +    unsigned int num;
>      unsigned int out_num;
>      unsigned int in_num;
> -    hwaddr in_addr[VIRTQUEUE_MAX_SIZE];
> -    hwaddr out_addr[VIRTQUEUE_MAX_SIZE];
> -    struct iovec in_sg[VIRTQUEUE_MAX_SIZE];
> -    struct iovec out_sg[VIRTQUEUE_MAX_SIZE];
> +
> +    hwaddr *in_addr;
> +    hwaddr *out_addr;
> +    struct iovec *in_sg;
> +    struct iovec *out_sg;
> +
> +    hwaddr addr[VIRTQUEUE_MAX_SIZE];
> +    struct iovec sg[VIRTQUEUE_MAX_SIZE];
>  } VirtQueueElement;
>  
>  #define VIRTIO_PCI_QUEUE_MAX 64
> -- 

since addr and sg aren't used directly, allocate them flexibly like

    char *p;
    VirtQueueElement *elem;
    total_size = ROUND_UP(sizeof(struct VirtQueueElement),
                          __alignof__(elem->addr[0]);
    addr_offset = total_size;
    total_size = ROUND_UP(total_size + num * sizeof(elem->addr[0]),
                          __alignof__(elem->sg[0]));
    sg_offset = total_size;
    total_size += num * sizeof(elem->sg[0]);

    elem = p = g_slice_alloc(total_size);
    elem->size = total_size;
    elem->in_addr = p + addr_offset;
    elem->out_addr = elem->in_addr + in_num;
    elem->in_sg = p + sg_offset;
    elem->out_sg = elem->in_sg + in_num;

...

    g_slice_free1(elem->size, elem);

The small size will help glib do slab-style allocation and should remove
the need for an object pool.

Paolo

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH 09/15] linux-aio: fix submit aio as a batch
  2014-07-30 11:39 ` [Qemu-devel] [PATCH 09/15] linux-aio: fix submit aio as a batch Ming Lei
@ 2014-07-30 13:59   ` Paolo Bonzini
  2014-07-30 17:32     ` Ming Lei
  0 siblings, 1 reply; 71+ messages in thread
From: Paolo Bonzini @ 2014-07-30 13:59 UTC (permalink / raw)
  To: Ming Lei, qemu-devel, Peter Maydell, Stefan Hajnoczi
  Cc: Kevin Wolf, Fam Zheng, Michael S. Tsirkin

Il 30/07/2014 13:39, Ming Lei ha scritto:
> In the enqueue path, we can't complete request, otherwise
> "Co-routine re-entered recursively" may be caused, so this
> patch fixes the issue with below ideas:
> 
> 	- for -EAGAIN or partial completion, retry the submission by
> 	an introduced event handler
> 	- for part of completion, also update the io queue
> 	- for other failure, return the failure if in enqueue path,
> 	otherwise, abort all queued I/O
> 
> Signed-off-by: Ming Lei <ming.lei@canonical.com>
> ---
>  block/linux-aio.c |   90 ++++++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 68 insertions(+), 22 deletions(-)
> 
> diff --git a/block/linux-aio.c b/block/linux-aio.c
> index 7ac7e8c..5eb9c92 100644
> --- a/block/linux-aio.c
> +++ b/block/linux-aio.c
> @@ -51,6 +51,7 @@ struct qemu_laio_state {
>  
>      /* io queue for submit at batch */
>      LaioQueue io_q;
> +    EventNotifier retry;      /* handle -EAGAIN and partial completion */
>  };
>  
>  static inline ssize_t io_event_ret(struct io_event *ev)
> @@ -154,45 +155,80 @@ static void ioq_init(LaioQueue *io_q)
>      io_q->plugged = 0;
>  }
>  
> -static int ioq_submit(struct qemu_laio_state *s)
> +static void abort_queue(struct qemu_laio_state *s)
> +{
> +    int i;
> +    for (i = 0; i < s->io_q.idx; i++) {
> +        struct qemu_laiocb *laiocb = container_of(s->io_q.iocbs[i],
> +                                                  struct qemu_laiocb,
> +                                                  iocb);
> +        laiocb->ret = -EIO;
> +        qemu_laio_process_completion(s, laiocb);
> +    }
> +}
> +
> +static int ioq_submit(struct qemu_laio_state *s, bool enqueue)
>  {
>      int ret, i = 0;
>      int len = s->io_q.idx;
> +    int j = 0;
>  
> -    do {
> -        ret = io_submit(s->ctx, len, s->io_q.iocbs);
> -    } while (i++ < 3 && ret == -EAGAIN);
> +    if (!len) {
> +        return 0;
> +    }
>  
> -    /* empty io queue */
> -    s->io_q.idx = 0;
> +    ret = io_submit(s->ctx, len, s->io_q.iocbs);
> +    if (ret == -EAGAIN) {
> +        event_notifier_set(&s->retry);

Retrying immediately (and just doing a couple of system calls to waste
time) is not an improvement.  The right place to retry is in
qemu_laio_completion_cb, after io_getevents has been called and
presumably the queue depth has decreased.

If !s->io_q.plugged but io_submit fails you can call ioq_enqueue and it
will just work.  Then you can only go to out_free_aiocb if the queue is
full (independent of the "plug" state).

Paolo

> +        return 0;
> +    } else if (ret < 0) {
> +        if (enqueue) {
> +            return ret;
> +        }
>  
> -    if (ret < 0) {
> -        i = 0;
> -    } else {
> -        i = ret;
> +        /* in non-queue path, all IOs have to be completed */
> +        abort_queue(s);
> +        ret = len;
> +    } else if (ret == 0) {
> +        goto out;
>      }
>  
> -    for (; i < len; i++) {
> -        struct qemu_laiocb *laiocb =
> -            container_of(s->io_q.iocbs[i], struct qemu_laiocb, iocb);
> -
> -        laiocb->ret = (ret < 0) ? ret : -EIO;
> -        qemu_laio_process_completion(s, laiocb);
> +    for (i = ret; i < len; i++) {
> +        s->io_q.iocbs[j++] = s->io_q.iocbs[i];
>      }
> +
> + out:
> +    /* update io queue */
> +    s->io_q.idx -= ret;
> +
>      return ret;
>  }
>  
> -static void ioq_enqueue(struct qemu_laio_state *s, struct iocb *iocb)
> +static void ioq_submit_retry(EventNotifier *e)
> +{
> +    struct qemu_laio_state *s = container_of(e, struct qemu_laio_state, retry);
> +
> +    event_notifier_test_and_clear(e);
> +    ioq_submit(s, false);
> +}
> +
> +static int ioq_enqueue(struct qemu_laio_state *s, struct iocb *iocb)
>  {
>      unsigned int idx = s->io_q.idx;
>  
> +    if (unlikely(idx == s->io_q.size)) {
> +        return -1;
> +    }
> +
>      s->io_q.iocbs[idx++] = iocb;
>      s->io_q.idx = idx;
>  
> -    /* submit immediately if queue is full */
> -    if (idx == s->io_q.size) {
> -        ioq_submit(s);
> +    /* submit immediately if queue depth is above 2/3 */
> +    if (idx > s->io_q.size * 2 / 3) {
> +        return ioq_submit(s, true);
>      }
> +
> +    return 0;
>  }
>  
>  void laio_io_plug(BlockDriverState *bs, void *aio_ctx)
> @@ -214,7 +250,7 @@ int laio_io_unplug(BlockDriverState *bs, void *aio_ctx, bool unplug)
>      }
>  
>      if (s->io_q.idx > 0) {
> -        ret = ioq_submit(s);
> +        ret = ioq_submit(s, false);
>      }
>  
>      return ret;
> @@ -258,7 +294,9 @@ BlockDriverAIOCB *laio_submit(BlockDriverState *bs, void *aio_ctx, int fd,
>              goto out_free_aiocb;
>          }
>      } else {
> -        ioq_enqueue(s, iocbs);
> +        if (ioq_enqueue(s, iocbs) < 0) {
> +            goto out_free_aiocb;
> +        }
>      }
>      return &laiocb->common;
>  
> @@ -272,6 +310,7 @@ void laio_detach_aio_context(void *s_, AioContext *old_context)
>      struct qemu_laio_state *s = s_;
>  
>      aio_set_event_notifier(old_context, &s->e, NULL);
> +    aio_set_event_notifier(old_context, &s->retry, NULL);
>  }
>  
>  void laio_attach_aio_context(void *s_, AioContext *new_context)
> @@ -279,6 +318,7 @@ void laio_attach_aio_context(void *s_, AioContext *new_context)
>      struct qemu_laio_state *s = s_;
>  
>      aio_set_event_notifier(new_context, &s->e, qemu_laio_completion_cb);
> +    aio_set_event_notifier(new_context, &s->retry, ioq_submit_retry);
>  }
>  
>  void *laio_init(void)
> @@ -295,9 +335,14 @@ void *laio_init(void)
>      }
>  
>      ioq_init(&s->io_q);
> +    if (event_notifier_init(&s->retry, false) < 0) {
> +        goto out_notifer_init;
> +    }
>  
>      return s;
>  
> +out_notifer_init:
> +    io_destroy(s->ctx);
>  out_close_efd:
>      event_notifier_cleanup(&s->e);
>  out_free_state:
> @@ -310,6 +355,7 @@ void laio_cleanup(void *s_)
>      struct qemu_laio_state *s = s_;
>  
>      event_notifier_cleanup(&s->e);
> +    event_notifier_cleanup(&s->retry);
>  
>      if (io_destroy(s->ctx) != 0) {
>          fprintf(stderr, "%s: destroy AIO context %p failed\n",
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH 10/15] linux-aio: increase max event to 256
  2014-07-30 11:39 ` [Qemu-devel] [PATCH 10/15] linux-aio: increase max event to 256 Ming Lei
  2014-07-30 12:15   ` Eric Blake
@ 2014-07-30 14:00   ` Paolo Bonzini
  2014-07-30 17:20     ` Ming Lei
  1 sibling, 1 reply; 71+ messages in thread
From: Paolo Bonzini @ 2014-07-30 14:00 UTC (permalink / raw)
  To: Ming Lei, qemu-devel, Peter Maydell, Stefan Hajnoczi
  Cc: Kevin Wolf, Fam Zheng, Michael S. Tsirkin

Il 30/07/2014 13:39, Ming Lei ha scritto:
> This patch increases max event to 256 for the comming
> virtio-blk multi virtqueue support.
> 
> Signed-off-by: Ming Lei <ming.lei@canonical.com>
> ---
>  block/linux-aio.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

What makes the new magic number less magic than the old one?

Paolo

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH 14/15] hw/block/virtio-blk: create num_queues vqs if dataplane is enabled
  2014-07-30 11:39 ` [Qemu-devel] [PATCH 14/15] hw/block/virtio-blk: create num_queues vqs if dataplane is enabled Ming Lei
@ 2014-07-30 14:01   ` Paolo Bonzini
  2014-07-30 15:12     ` Michael S. Tsirkin
  0 siblings, 1 reply; 71+ messages in thread
From: Paolo Bonzini @ 2014-07-30 14:01 UTC (permalink / raw)
  To: Ming Lei, qemu-devel, Peter Maydell, Stefan Hajnoczi
  Cc: Kevin Wolf, Fam Zheng, Michael S. Tsirkin

Il 30/07/2014 13:39, Ming Lei ha scritto:
> Now we only support multi vqs for dataplane case.
> 
> Signed-off-by: Ming Lei <ming.lei@canonical.com>
> ---
>  hw/block/virtio-blk.c          |   18 +++++++++++++++++-
>  include/hw/virtio/virtio-blk.h |    1 +
>  2 files changed, 18 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
> index ab99156..44e9956 100644
> --- a/hw/block/virtio-blk.c
> +++ b/hw/block/virtio-blk.c
> @@ -556,6 +556,7 @@ static void virtio_blk_update_config(VirtIODevice *vdev, uint8_t *config)
>      blkcfg.physical_block_exp = get_physical_block_exp(s->conf);
>      blkcfg.alignment_offset = 0;
>      blkcfg.wce = bdrv_enable_write_cache(s->bs);
> +    stw_p(&blkcfg.num_queues, s->blk.num_queues);
>      memcpy(config, &blkcfg, sizeof(struct virtio_blk_config));
>  }
>  
> @@ -590,6 +591,12 @@ static uint32_t virtio_blk_get_features(VirtIODevice *vdev, uint32_t features)
>      if (bdrv_is_read_only(s->bs))
>          features |= 1 << VIRTIO_BLK_F_RO;
>  
> +#ifdef CONFIG_VIRTIO_BLK_DATA_PLANE
> +    if (s->blk.num_queues > 1) {
> +        features |= 1 << VIRTIO_BLK_F_MQ;
> +    }
> +#endif
> +
>      return features;
>  }
>  
> @@ -739,8 +746,13 @@ static void virtio_blk_device_realize(DeviceState *dev, Error **errp)
>  #ifdef CONFIG_VIRTIO_BLK_DATA_PLANE
>      Error *err = NULL;
>  #endif
> +    int i;
>      static int virtio_blk_id;
>  
> +#ifndef CONFIG_VIRTIO_BLK_DATA_PLANE
> +    blk->num_queues = 1;
> +#endif
> +
>      if (!blk->conf.bs) {
>          error_setg(errp, "drive property not set");
>          return;
> @@ -765,7 +777,10 @@ static void virtio_blk_device_realize(DeviceState *dev, Error **errp)
>      s->rq = NULL;
>      s->sector_mask = (s->conf->logical_block_size / BDRV_SECTOR_SIZE) - 1;
>  
> -    s->vq = virtio_add_queue(vdev, 128, virtio_blk_handle_output);
> +    s->vqs = g_malloc0(sizeof(VirtQueue *) * blk->num_queues);
> +    for (i = 0; i < blk->num_queues; i++)
> +        s->vqs[i] = virtio_add_queue(vdev, 128, virtio_blk_handle_output);
> +    s->vq = s->vqs[0];
>      s->complete_request = virtio_blk_complete_request;
>  #ifdef CONFIG_VIRTIO_BLK_DATA_PLANE
>      virtio_blk_data_plane_create(vdev, blk, &s->dataplane, &err);
> @@ -802,6 +817,7 @@ static void virtio_blk_device_unrealize(DeviceState *dev, Error **errp)
>      qemu_del_vm_change_state_handler(s->change);
>      unregister_savevm(dev, "virtio-blk", s);
>      blockdev_mark_auto_del(s->bs);
> +    g_free(s->vqs);
>      virtio_cleanup(vdev);
>  }
>  
> diff --git a/include/hw/virtio/virtio-blk.h b/include/hw/virtio/virtio-blk.h
> index ad70c9a..b4609dd 100644
> --- a/include/hw/virtio/virtio-blk.h
> +++ b/include/hw/virtio/virtio-blk.h
> @@ -132,6 +132,7 @@ typedef struct VirtIOBlock {
>      VirtIODevice parent_obj;
>      BlockDriverState *bs;
>      VirtQueue *vq;
> +    VirtQueue **vqs;
>      void *rq;
>      QEMUBH *bh;
>      BlockConf *conf;
> 

Dataplane must not be a change to the guest ABI.  If you implement this
feature you have to implement it for both dataplane and non-dataplne.

Paolo

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH 07/15] dataplane: use object pool to speed up allocation for virtio blk request
  2014-07-30 11:39 ` [Qemu-devel] [PATCH 07/15] dataplane: use object pool to speed up allocation for virtio blk request Ming Lei
@ 2014-07-30 14:14   ` Paolo Bonzini
  2014-07-30 15:09     ` Michael S. Tsirkin
  2014-07-31  3:22     ` Ming Lei
  0 siblings, 2 replies; 71+ messages in thread
From: Paolo Bonzini @ 2014-07-30 14:14 UTC (permalink / raw)
  To: Ming Lei, qemu-devel, Peter Maydell, Stefan Hajnoczi
  Cc: Kevin Wolf, Fam Zheng, Michael S. Tsirkin

Il 30/07/2014 13:39, Ming Lei ha scritto:
> g_slice_new(VirtIOBlockReq), its free pair and access the instance
> is a bit slow since sizeof(VirtIOBlockReq) takes more than 40KB,
> so use object pool to speed up its allocation and release.
> 
> With this patch, ~5% throughput improvement is observed in the VM
> based on server.
> 
> Signed-off-by: Ming Lei <ming.lei@canonical.com>
> ---
>  hw/block/dataplane/virtio-blk.c |   12 ++++++++++++
>  hw/block/virtio-blk.c           |   13 +++++++++++--
>  include/hw/virtio/virtio-blk.h  |    2 ++
>  3 files changed, 25 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
> index 2093e4a..828fe99 100644
> --- a/hw/block/dataplane/virtio-blk.c
> +++ b/hw/block/dataplane/virtio-blk.c
> @@ -24,6 +24,8 @@
>  #include "hw/virtio/virtio-bus.h"
>  #include "qom/object_interfaces.h"
>  
> +#define REQ_POOL_SZ 128
> +
>  struct VirtIOBlockDataPlane {
>      bool started;
>      bool starting;
> @@ -51,6 +53,10 @@ struct VirtIOBlockDataPlane {
>      Error *blocker;
>      void (*saved_complete_request)(struct VirtIOBlockReq *req,
>                                     unsigned char status);
> +
> +    VirtIOBlockReq  reqs[REQ_POOL_SZ];
> +    void *free_reqs[REQ_POOL_SZ];
> +    ObjPool  req_pool;
>  };
>  
>  /* Raise an interrupt to signal guest, if necessary */
> @@ -238,6 +244,10 @@ void virtio_blk_data_plane_start(VirtIOBlockDataPlane *s)
>          return;
>      }
>  
> +    vblk->obj_pool = &s->req_pool;
> +    obj_pool_init(vblk->obj_pool, s->reqs, s->free_reqs,
> +                  sizeof(VirtIOBlockReq), REQ_POOL_SZ);
> +
>      /* Set up guest notifier (irq) */
>      if (k->set_guest_notifiers(qbus->parent, 1, true) != 0) {
>          fprintf(stderr, "virtio-blk failed to set guest notifier, "
> @@ -298,6 +308,8 @@ void virtio_blk_data_plane_stop(VirtIOBlockDataPlane *s)
>  
>      aio_context_release(s->ctx);
>  
> +    vblk->obj_pool = NULL;
> +
>      if (s->raw_format) {
>          qemu_aio_set_bypass_co(s->ctx, false);
>      }
> diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
> index c241c50..2a11bc4 100644
> --- a/hw/block/virtio-blk.c
> +++ b/hw/block/virtio-blk.c
> @@ -31,7 +31,11 @@
>  
>  VirtIOBlockReq *virtio_blk_alloc_request(VirtIOBlock *s)
>  {
> -    VirtIOBlockReq *req = g_slice_new(VirtIOBlockReq);
> +    VirtIOBlockReq *req = obj_pool_get(s->obj_pool);
> +
> +    if (!req) {
> +        req = g_slice_new(VirtIOBlockReq);
> +    }
>      req->dev = s;
>      req->qiov.size = 0;
>      req->next = NULL;
> @@ -41,7 +45,11 @@ VirtIOBlockReq *virtio_blk_alloc_request(VirtIOBlock *s)
>  void virtio_blk_free_request(VirtIOBlockReq *req)
>  {
>      if (req) {
> -        g_slice_free(VirtIOBlockReq, req);
> +        if (obj_pool_has_obj(req->dev->obj_pool, req)) {
> +            obj_pool_put(req->dev->obj_pool, req);
> +        } else {
> +            g_slice_free(VirtIOBlockReq, req);
> +        }
>      }
>  }
>  
> @@ -801,6 +809,7 @@ static void virtio_blk_instance_init(Object *obj)
>  {
>      VirtIOBlock *s = VIRTIO_BLK(obj);
>  
> +    s->obj_pool = NULL;
>      object_property_add_link(obj, "iothread", TYPE_IOTHREAD,
>                               (Object **)&s->blk.iothread,
>                               qdev_prop_allow_set_link_before_realize,
> diff --git a/include/hw/virtio/virtio-blk.h b/include/hw/virtio/virtio-blk.h
> index afb7b8d..49ac234 100644
> --- a/include/hw/virtio/virtio-blk.h
> +++ b/include/hw/virtio/virtio-blk.h
> @@ -18,6 +18,7 @@
>  #include "hw/block/block.h"
>  #include "sysemu/iothread.h"
>  #include "block/block.h"
> +#include "qemu/obj_pool.h"
>  
>  #define TYPE_VIRTIO_BLK "virtio-blk-device"
>  #define VIRTIO_BLK(obj) \
> @@ -135,6 +136,7 @@ typedef struct VirtIOBlock {
>      Notifier migration_state_notifier;
>      struct VirtIOBlockDataPlane *dataplane;
>  #endif
> +    ObjPool *obj_pool;
>  } VirtIOBlock;
>  
>  typedef struct MultiReqBuffer {
> 

The problem is that g_slice here is not using the slab-style allocator
because the object is larger than roughly 500 bytes.  One solution would
be to make virtqueue_pop/vring_pop allocate a VirtQueueElement of the
right size (and virtqueue_push/vring_push free it), as mentioned in the
review of patch 8.

However, I now remembered that VirtQueueElement is a mess because it's
serialized directly into the migration state. :(  So you basically
cannot change it without mucking with migration.  Please leave out patch
8 for now.

Let's use this object pool, however, please simplify the API, it should
be just:

void obj_pool_init(ObjPool *pool, unsigned obj_size, unsigned cnt);
void *obj_pool_alloc(ObjPool *pool);
void obj_pool_free(ObjPool *pool, void *obj);
void obj_pool_destroy(ObjPool *pool);

All allocations of the objs buffer, and all logic like

+    VirtIOBlockReq *req = obj_pool_get(s->obj_pool);
+
+    if (!req) {
+        req = g_slice_new(VirtIOBlockReq);
+    }

+        if (obj_pool_has_obj(req->dev->obj_pool, req)) {
+            obj_pool_put(req->dev->obj_pool, req);
+        } else {
+            g_slice_free(VirtIOBlockReq, req);
+        }

should be part of the object pool, not its users.

Paolo

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH 08/15] virtio: decrease size of VirtQueueElement
  2014-07-30 13:51   ` Paolo Bonzini
@ 2014-07-30 14:40     ` Michael S. Tsirkin
  2014-07-30 14:50       ` Paolo Bonzini
  2014-07-31  2:11       ` Ming Lei
  2014-07-31  2:07     ` Ming Lei
  1 sibling, 2 replies; 71+ messages in thread
From: Michael S. Tsirkin @ 2014-07-30 14:40 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Peter Maydell, Fam Zheng, Ming Lei, qemu-devel,
	Stefan Hajnoczi

On Wed, Jul 30, 2014 at 03:51:22PM +0200, Paolo Bonzini wrote:
> Il 30/07/2014 13:39, Ming Lei ha scritto:
> > diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
> > index a60104c..943e72f 100644
> > --- a/include/hw/virtio/virtio.h
> > +++ b/include/hw/virtio/virtio.h
> > @@ -84,12 +84,17 @@ typedef struct VirtQueue VirtQueue;
> >  typedef struct VirtQueueElement
> >  {
> >      unsigned int index;
> > +    unsigned int num;
> >      unsigned int out_num;
> >      unsigned int in_num;
> > -    hwaddr in_addr[VIRTQUEUE_MAX_SIZE];
> > -    hwaddr out_addr[VIRTQUEUE_MAX_SIZE];
> > -    struct iovec in_sg[VIRTQUEUE_MAX_SIZE];
> > -    struct iovec out_sg[VIRTQUEUE_MAX_SIZE];
> > +
> > +    hwaddr *in_addr;
> > +    hwaddr *out_addr;
> > +    struct iovec *in_sg;
> > +    struct iovec *out_sg;
> > +
> > +    hwaddr addr[VIRTQUEUE_MAX_SIZE];
> > +    struct iovec sg[VIRTQUEUE_MAX_SIZE];
> >  } VirtQueueElement;
> >  
> >  #define VIRTIO_PCI_QUEUE_MAX 64
> > -- 
> 
> since addr and sg aren't used directly, allocate them flexibly like
> 
>     char *p;
>     VirtQueueElement *elem;
>     total_size = ROUND_UP(sizeof(struct VirtQueueElement),
>                           __alignof__(elem->addr[0]);
>     addr_offset = total_size;
>     total_size = ROUND_UP(total_size + num * sizeof(elem->addr[0]),
>                           __alignof__(elem->sg[0]));
>     sg_offset = total_size;
>     total_size += num * sizeof(elem->sg[0]);
> 
>     elem = p = g_slice_alloc(total_size);
>     elem->size = total_size;
>     elem->in_addr = p + addr_offset;
>     elem->out_addr = elem->in_addr + in_num;
>     elem->in_sg = p + sg_offset;
>     elem->out_sg = elem->in_sg + in_num;
> 
> ...
> 
>     g_slice_free1(elem->size, elem);
> 
> The small size will help glib do slab-style allocation and should remove
> the need for an object pool.
> 
> Paolo

As long as you relying on this, why not allocate in_sg/out_sg
separately?

In any case, a bunch of code needs to be audited to make sure
no one assumes we can always stick VIRTQUEUE_MAX_SIZE there.


But what really worries me is this is an optimization patch
without any numbers accompanying it.
I would say split this out from your virtio blk work,
work on this separately.

-- 
MST

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH 08/15] virtio: decrease size of VirtQueueElement
  2014-07-30 14:40     ` Michael S. Tsirkin
@ 2014-07-30 14:50       ` Paolo Bonzini
  2014-07-31  2:11       ` Ming Lei
  1 sibling, 0 replies; 71+ messages in thread
From: Paolo Bonzini @ 2014-07-30 14:50 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Kevin Wolf, Peter Maydell, Fam Zheng, Ming Lei, qemu-devel,
	Stefan Hajnoczi

Il 30/07/2014 16:40, Michael S. Tsirkin ha scritto:
> As long as you relying on this, why not allocate in_sg/out_sg
> separately?

Because allocations can be expensive.

> In any case, a bunch of code needs to be audited to make sure
> no one assumes we can always stick VIRTQUEUE_MAX_SIZE there.
> 
> But what really worries me is this is an optimization patch
> without any numbers accompanying it.

I agree.

> I would say split this out from your virtio blk work,
> work on this separately.

This, too.

Paolo

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH 07/15] dataplane: use object pool to speed up allocation for virtio blk request
  2014-07-30 14:14   ` Paolo Bonzini
@ 2014-07-30 15:09     ` Michael S. Tsirkin
  2014-07-31  3:22     ` Ming Lei
  1 sibling, 0 replies; 71+ messages in thread
From: Michael S. Tsirkin @ 2014-07-30 15:09 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Peter Maydell, Fam Zheng, Ming Lei, qemu-devel,
	Stefan Hajnoczi

On Wed, Jul 30, 2014 at 04:14:09PM +0200, Paolo Bonzini wrote:
> Il 30/07/2014 13:39, Ming Lei ha scritto:
> > g_slice_new(VirtIOBlockReq), its free pair and access the instance
> > is a bit slow since sizeof(VirtIOBlockReq) takes more than 40KB,
> > so use object pool to speed up its allocation and release.
> > 
> > With this patch, ~5% throughput improvement is observed in the VM
> > based on server.
> > 
> > Signed-off-by: Ming Lei <ming.lei@canonical.com>
> > ---
> >  hw/block/dataplane/virtio-blk.c |   12 ++++++++++++
> >  hw/block/virtio-blk.c           |   13 +++++++++++--
> >  include/hw/virtio/virtio-blk.h  |    2 ++
> >  3 files changed, 25 insertions(+), 2 deletions(-)
> > 
> > diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
> > index 2093e4a..828fe99 100644
> > --- a/hw/block/dataplane/virtio-blk.c
> > +++ b/hw/block/dataplane/virtio-blk.c
> > @@ -24,6 +24,8 @@
> >  #include "hw/virtio/virtio-bus.h"
> >  #include "qom/object_interfaces.h"
> >  
> > +#define REQ_POOL_SZ 128
> > +
> >  struct VirtIOBlockDataPlane {
> >      bool started;
> >      bool starting;
> > @@ -51,6 +53,10 @@ struct VirtIOBlockDataPlane {
> >      Error *blocker;
> >      void (*saved_complete_request)(struct VirtIOBlockReq *req,
> >                                     unsigned char status);
> > +
> > +    VirtIOBlockReq  reqs[REQ_POOL_SZ];
> > +    void *free_reqs[REQ_POOL_SZ];
> > +    ObjPool  req_pool;
> >  };
> >  
> >  /* Raise an interrupt to signal guest, if necessary */
> > @@ -238,6 +244,10 @@ void virtio_blk_data_plane_start(VirtIOBlockDataPlane *s)
> >          return;
> >      }
> >  
> > +    vblk->obj_pool = &s->req_pool;
> > +    obj_pool_init(vblk->obj_pool, s->reqs, s->free_reqs,
> > +                  sizeof(VirtIOBlockReq), REQ_POOL_SZ);
> > +
> >      /* Set up guest notifier (irq) */
> >      if (k->set_guest_notifiers(qbus->parent, 1, true) != 0) {
> >          fprintf(stderr, "virtio-blk failed to set guest notifier, "
> > @@ -298,6 +308,8 @@ void virtio_blk_data_plane_stop(VirtIOBlockDataPlane *s)
> >  
> >      aio_context_release(s->ctx);
> >  
> > +    vblk->obj_pool = NULL;
> > +
> >      if (s->raw_format) {
> >          qemu_aio_set_bypass_co(s->ctx, false);
> >      }
> > diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
> > index c241c50..2a11bc4 100644
> > --- a/hw/block/virtio-blk.c
> > +++ b/hw/block/virtio-blk.c
> > @@ -31,7 +31,11 @@
> >  
> >  VirtIOBlockReq *virtio_blk_alloc_request(VirtIOBlock *s)
> >  {
> > -    VirtIOBlockReq *req = g_slice_new(VirtIOBlockReq);
> > +    VirtIOBlockReq *req = obj_pool_get(s->obj_pool);
> > +
> > +    if (!req) {
> > +        req = g_slice_new(VirtIOBlockReq);
> > +    }
> >      req->dev = s;
> >      req->qiov.size = 0;
> >      req->next = NULL;
> > @@ -41,7 +45,11 @@ VirtIOBlockReq *virtio_blk_alloc_request(VirtIOBlock *s)
> >  void virtio_blk_free_request(VirtIOBlockReq *req)
> >  {
> >      if (req) {
> > -        g_slice_free(VirtIOBlockReq, req);
> > +        if (obj_pool_has_obj(req->dev->obj_pool, req)) {
> > +            obj_pool_put(req->dev->obj_pool, req);
> > +        } else {
> > +            g_slice_free(VirtIOBlockReq, req);
> > +        }
> >      }
> >  }
> >  
> > @@ -801,6 +809,7 @@ static void virtio_blk_instance_init(Object *obj)
> >  {
> >      VirtIOBlock *s = VIRTIO_BLK(obj);
> >  
> > +    s->obj_pool = NULL;
> >      object_property_add_link(obj, "iothread", TYPE_IOTHREAD,
> >                               (Object **)&s->blk.iothread,
> >                               qdev_prop_allow_set_link_before_realize,
> > diff --git a/include/hw/virtio/virtio-blk.h b/include/hw/virtio/virtio-blk.h
> > index afb7b8d..49ac234 100644
> > --- a/include/hw/virtio/virtio-blk.h
> > +++ b/include/hw/virtio/virtio-blk.h
> > @@ -18,6 +18,7 @@
> >  #include "hw/block/block.h"
> >  #include "sysemu/iothread.h"
> >  #include "block/block.h"
> > +#include "qemu/obj_pool.h"
> >  
> >  #define TYPE_VIRTIO_BLK "virtio-blk-device"
> >  #define VIRTIO_BLK(obj) \
> > @@ -135,6 +136,7 @@ typedef struct VirtIOBlock {
> >      Notifier migration_state_notifier;
> >      struct VirtIOBlockDataPlane *dataplane;
> >  #endif
> > +    ObjPool *obj_pool;
> >  } VirtIOBlock;
> >  
> >  typedef struct MultiReqBuffer {
> > 
> 
> The problem is that g_slice here is not using the slab-style allocator
> because the object is larger than roughly 500 bytes.  One solution would
> be to make virtqueue_pop/vring_pop allocate a VirtQueueElement of the
> right size (and virtqueue_push/vring_push free it), as mentioned in the
> review of patch 8.
> 
> However, I now remembered that VirtQueueElement is a mess because it's
> serialized directly into the migration state. :(

BTW, this needs to be fixed anyway.

> So you basically
> cannot change it without mucking with migration.  Please leave out patch
> 8 for now.
> 
> Let's use this object pool, however, please simplify the API, it should
> be just:
> 
> void obj_pool_init(ObjPool *pool, unsigned obj_size, unsigned cnt);
> void *obj_pool_alloc(ObjPool *pool);
> void obj_pool_free(ObjPool *pool, void *obj);
> void obj_pool_destroy(ObjPool *pool);
> 
> All allocations of the objs buffer, and all logic like
> 
> +    VirtIOBlockReq *req = obj_pool_get(s->obj_pool);
> +
> +    if (!req) {
> +        req = g_slice_new(VirtIOBlockReq);
> +    }
> 
> +        if (obj_pool_has_obj(req->dev->obj_pool, req)) {
> +            obj_pool_put(req->dev->obj_pool, req);
> +        } else {
> +            g_slice_free(VirtIOBlockReq, req);
> +        }
> 
> should be part of the object pool, not its users.
> 
> Paolo

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH 14/15] hw/block/virtio-blk: create num_queues vqs if dataplane is enabled
  2014-07-30 14:01   ` Paolo Bonzini
@ 2014-07-30 15:12     ` Michael S. Tsirkin
  2014-07-30 15:25       ` Paolo Bonzini
  0 siblings, 1 reply; 71+ messages in thread
From: Michael S. Tsirkin @ 2014-07-30 15:12 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Peter Maydell, Fam Zheng, Ming Lei, qemu-devel,
	Stefan Hajnoczi

On Wed, Jul 30, 2014 at 04:01:59PM +0200, Paolo Bonzini wrote:
> Il 30/07/2014 13:39, Ming Lei ha scritto:
> > Now we only support multi vqs for dataplane case.
> > 
> > Signed-off-by: Ming Lei <ming.lei@canonical.com>
> > ---
> >  hw/block/virtio-blk.c          |   18 +++++++++++++++++-
> >  include/hw/virtio/virtio-blk.h |    1 +
> >  2 files changed, 18 insertions(+), 1 deletion(-)
> > 
> > diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
> > index ab99156..44e9956 100644
> > --- a/hw/block/virtio-blk.c
> > +++ b/hw/block/virtio-blk.c
> > @@ -556,6 +556,7 @@ static void virtio_blk_update_config(VirtIODevice *vdev, uint8_t *config)
> >      blkcfg.physical_block_exp = get_physical_block_exp(s->conf);
> >      blkcfg.alignment_offset = 0;
> >      blkcfg.wce = bdrv_enable_write_cache(s->bs);
> > +    stw_p(&blkcfg.num_queues, s->blk.num_queues);
> >      memcpy(config, &blkcfg, sizeof(struct virtio_blk_config));
> >  }
> >  
> > @@ -590,6 +591,12 @@ static uint32_t virtio_blk_get_features(VirtIODevice *vdev, uint32_t features)
> >      if (bdrv_is_read_only(s->bs))
> >          features |= 1 << VIRTIO_BLK_F_RO;
> >  
> > +#ifdef CONFIG_VIRTIO_BLK_DATA_PLANE
> > +    if (s->blk.num_queues > 1) {
> > +        features |= 1 << VIRTIO_BLK_F_MQ;
> > +    }
> > +#endif
> > +
> >      return features;
> >  }
> >  
> > @@ -739,8 +746,13 @@ static void virtio_blk_device_realize(DeviceState *dev, Error **errp)
> >  #ifdef CONFIG_VIRTIO_BLK_DATA_PLANE
> >      Error *err = NULL;
> >  #endif
> > +    int i;
> >      static int virtio_blk_id;
> >  
> > +#ifndef CONFIG_VIRTIO_BLK_DATA_PLANE
> > +    blk->num_queues = 1;
> > +#endif
> > +
> >      if (!blk->conf.bs) {
> >          error_setg(errp, "drive property not set");
> >          return;
> > @@ -765,7 +777,10 @@ static void virtio_blk_device_realize(DeviceState *dev, Error **errp)
> >      s->rq = NULL;
> >      s->sector_mask = (s->conf->logical_block_size / BDRV_SECTOR_SIZE) - 1;
> >  
> > -    s->vq = virtio_add_queue(vdev, 128, virtio_blk_handle_output);
> > +    s->vqs = g_malloc0(sizeof(VirtQueue *) * blk->num_queues);
> > +    for (i = 0; i < blk->num_queues; i++)
> > +        s->vqs[i] = virtio_add_queue(vdev, 128, virtio_blk_handle_output);
> > +    s->vq = s->vqs[0];
> >      s->complete_request = virtio_blk_complete_request;
> >  #ifdef CONFIG_VIRTIO_BLK_DATA_PLANE
> >      virtio_blk_data_plane_create(vdev, blk, &s->dataplane, &err);
> > @@ -802,6 +817,7 @@ static void virtio_blk_device_unrealize(DeviceState *dev, Error **errp)
> >      qemu_del_vm_change_state_handler(s->change);
> >      unregister_savevm(dev, "virtio-blk", s);
> >      blockdev_mark_auto_del(s->bs);
> > +    g_free(s->vqs);
> >      virtio_cleanup(vdev);
> >  }
> >  
> > diff --git a/include/hw/virtio/virtio-blk.h b/include/hw/virtio/virtio-blk.h
> > index ad70c9a..b4609dd 100644
> > --- a/include/hw/virtio/virtio-blk.h
> > +++ b/include/hw/virtio/virtio-blk.h
> > @@ -132,6 +132,7 @@ typedef struct VirtIOBlock {
> >      VirtIODevice parent_obj;
> >      BlockDriverState *bs;
> >      VirtQueue *vq;
> > +    VirtQueue **vqs;
> >      void *rq;
> >      QEMUBH *bh;
> >      BlockConf *conf;
> > 
> 
> Dataplane must not be a change to the guest ABI.  If you implement this
> feature you have to implement it for both dataplane and non-dataplne.
> 
> Paolo

Actually, I think different backends at runtime should be allowed to
change guest visible ABI.  For example if you give qemu a read only
backend this is guest visible right?

However relying on ifdefs is wrong.
It should be a runtime thing.

-- 
MST

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH 14/15] hw/block/virtio-blk: create num_queues vqs if dataplane is enabled
  2014-07-30 15:12     ` Michael S. Tsirkin
@ 2014-07-30 15:25       ` Paolo Bonzini
  2014-07-31  3:47         ` Ming Lei
  0 siblings, 1 reply; 71+ messages in thread
From: Paolo Bonzini @ 2014-07-30 15:25 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Kevin Wolf, Peter Maydell, Fam Zheng, Ming Lei, qemu-devel,
	Stefan Hajnoczi

Il 30/07/2014 17:12, Michael S. Tsirkin ha scritto:
>> > 
>> > Dataplane must not be a change to the guest ABI.  If you implement this
>> > feature you have to implement it for both dataplane and non-dataplne.
>> > 
>> > Paolo
> Actually, I think different backends at runtime should be allowed to
> change guest visible ABI.  For example if you give qemu a read only
> backend this is guest visible right?

Dataplane is not meant to be a different backend, it's just moving stuff
out to a separate thread.  It's only due to QEMU limitation that it
doesn't share the vring code (and these patches include a lot of steps
backwards where dataplane becomes != non-dataplane).

Paolo

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH 01/15] qemu coroutine: support bypass mode
  2014-07-30 13:45   ` Paolo Bonzini
@ 2014-07-30 17:15     ` Ming Lei
  2014-07-30 23:37       ` Paolo Bonzini
  0 siblings, 1 reply; 71+ messages in thread
From: Ming Lei @ 2014-07-30 17:15 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Peter Maydell, Fam Zheng, Michael S. Tsirkin,
	qemu-devel, Stefan Hajnoczi

On Wed, Jul 30, 2014 at 9:45 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 30/07/2014 13:39, Ming Lei ha scritto:
>> This patch introduces several APIs for supporting bypass qemu coroutine
>> in case of being not necessary and for performance's sake.
>
> No, this is wrong.  Dataplane *must* use the same code as non-dataplane,
> anything else is a step backwards.

As we saw, coroutine has brought up performance regression
on dataplane, and it isn't necessary to use co in some cases, is it?

>
> If you want to bypass coroutines, bdrv_aio_readv/writev must detect the
> conditions that allow doing that and call the bdrv_aio_readv/writev
> directly.

That is easy to detect, please see the 5th patch.

>
> To begin with, have you benchmarked QEMU and can you provide a trace of
> *where* the coroutine overhead lies?

I guess it may be caused by the stack switch, at least in one of
my box, bypassing co can improve throughput by ~7%, and by
~15% in another box.

Thanks,

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH 10/15] linux-aio: increase max event to 256
  2014-07-30 14:00   ` Paolo Bonzini
@ 2014-07-30 17:20     ` Ming Lei
  2014-08-04 10:26       ` Stefan Hajnoczi
  0 siblings, 1 reply; 71+ messages in thread
From: Ming Lei @ 2014-07-30 17:20 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Peter Maydell, Fam Zheng, Michael S. Tsirkin,
	qemu-devel, Stefan Hajnoczi

On Wed, Jul 30, 2014 at 10:00 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 30/07/2014 13:39, Ming Lei ha scritto:
>> This patch increases max event to 256 for the comming
>> virtio-blk multi virtqueue support.
>>
>> Signed-off-by: Ming Lei <ming.lei@canonical.com>
>> ---
>>  block/linux-aio.c |    2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> What makes the new magic number less magic than the old one?

Just for supporting the coming multi virtqueue, otherwise it is
easy to trigger EAGAIN.

Or do you have better idea to figure out a non-magic number?

Thanks

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH 09/15] linux-aio: fix submit aio as a batch
  2014-07-30 13:59   ` Paolo Bonzini
@ 2014-07-30 17:32     ` Ming Lei
  2014-07-30 23:41       ` Paolo Bonzini
  0 siblings, 1 reply; 71+ messages in thread
From: Ming Lei @ 2014-07-30 17:32 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Peter Maydell, Fam Zheng, Michael S. Tsirkin,
	qemu-devel, Stefan Hajnoczi

On Wed, Jul 30, 2014 at 9:59 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 30/07/2014 13:39, Ming Lei ha scritto:
>> In the enqueue path, we can't complete request, otherwise
>> "Co-routine re-entered recursively" may be caused, so this
>> patch fixes the issue with below ideas:
>>
>>       - for -EAGAIN or partial completion, retry the submission by
>>       an introduced event handler
>>       - for part of completion, also update the io queue
>>       - for other failure, return the failure if in enqueue path,
>>       otherwise, abort all queued I/O
>>
>> Signed-off-by: Ming Lei <ming.lei@canonical.com>
>> ---
>>  block/linux-aio.c |   90 ++++++++++++++++++++++++++++++++++++++++-------------
>>  1 file changed, 68 insertions(+), 22 deletions(-)
>>
>> diff --git a/block/linux-aio.c b/block/linux-aio.c
>> index 7ac7e8c..5eb9c92 100644
>> --- a/block/linux-aio.c
>> +++ b/block/linux-aio.c
>> @@ -51,6 +51,7 @@ struct qemu_laio_state {
>>
>>      /* io queue for submit at batch */
>>      LaioQueue io_q;
>> +    EventNotifier retry;      /* handle -EAGAIN and partial completion */
>>  };
>>
>>  static inline ssize_t io_event_ret(struct io_event *ev)
>> @@ -154,45 +155,80 @@ static void ioq_init(LaioQueue *io_q)
>>      io_q->plugged = 0;
>>  }
>>
>> -static int ioq_submit(struct qemu_laio_state *s)
>> +static void abort_queue(struct qemu_laio_state *s)
>> +{
>> +    int i;
>> +    for (i = 0; i < s->io_q.idx; i++) {
>> +        struct qemu_laiocb *laiocb = container_of(s->io_q.iocbs[i],
>> +                                                  struct qemu_laiocb,
>> +                                                  iocb);
>> +        laiocb->ret = -EIO;
>> +        qemu_laio_process_completion(s, laiocb);
>> +    }
>> +}
>> +
>> +static int ioq_submit(struct qemu_laio_state *s, bool enqueue)
>>  {
>>      int ret, i = 0;
>>      int len = s->io_q.idx;
>> +    int j = 0;
>>
>> -    do {
>> -        ret = io_submit(s->ctx, len, s->io_q.iocbs);
>> -    } while (i++ < 3 && ret == -EAGAIN);
>> +    if (!len) {
>> +        return 0;
>> +    }
>>
>> -    /* empty io queue */
>> -    s->io_q.idx = 0;
>> +    ret = io_submit(s->ctx, len, s->io_q.iocbs);
>> +    if (ret == -EAGAIN) {
>> +        event_notifier_set(&s->retry);
>
> Retrying immediately (and just doing a couple of system calls to waste
> time) is not an improvement.  The right place to retry is in
> qemu_laio_completion_cb, after io_getevents has been called and
> presumably the queue depth has decreased.

Good point.

>
> If !s->io_q.plugged but io_submit fails you can call ioq_enqueue and it

When will the queued I/O be submitted? That will introduce extra
complexity definitely.

It is a change for !s->io_q.plugged case, and it isn't good to do that in
this patch, IMO.

> will just work.  Then you can only go to out_free_aiocb if the queue is
> full (independent of the "plug" state).


Thanks,

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH 01/15] qemu coroutine: support bypass mode
  2014-07-30 17:15     ` Ming Lei
@ 2014-07-30 23:37       ` Paolo Bonzini
  2014-07-31  3:55         ` Ming Lei
  2014-07-31  8:59         ` Ming Lei
  0 siblings, 2 replies; 71+ messages in thread
From: Paolo Bonzini @ 2014-07-30 23:37 UTC (permalink / raw)
  To: Ming Lei
  Cc: Kevin Wolf, Peter Maydell, Fam Zheng, Michael S. Tsirkin,
	qemu-devel, Stefan Hajnoczi

Il 30/07/2014 19:15, Ming Lei ha scritto:
> On Wed, Jul 30, 2014 at 9:45 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> Il 30/07/2014 13:39, Ming Lei ha scritto:
>>> This patch introduces several APIs for supporting bypass qemu coroutine
>>> in case of being not necessary and for performance's sake.
>>
>> No, this is wrong.  Dataplane *must* use the same code as non-dataplane,
>> anything else is a step backwards.
> 
> As we saw, coroutine has brought up performance regression
> on dataplane, and it isn't necessary to use co in some cases, is it?

Yes, and it's not necessary on non-dataplane either.  It's not necessary
on virtio-scsi, and it will not be necessary on virtio-scsi dataplane
either.

>> If you want to bypass coroutines, bdrv_aio_readv/writev must detect the
>> conditions that allow doing that and call the bdrv_aio_readv/writev
>> directly.
> 
> That is easy to detect, please see the 5th patch.

No, that's not enough.  Dataplane right now prevents block jobs, but
that's going to change and it could require coroutines even for raw devices.

>> To begin with, have you benchmarked QEMU and can you provide a trace of
>> *where* the coroutine overhead lies?
> 
> I guess it may be caused by the stack switch, at least in one of
> my box, bypassing co can improve throughput by ~7%, and by
> ~15% in another box.

No guesses please.  Actually that's also my guess, but since you are
submitting the patch you must do better and show profiles where stack
switching disappears after the patches.

Paolo

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH 09/15] linux-aio: fix submit aio as a batch
  2014-07-30 17:32     ` Ming Lei
@ 2014-07-30 23:41       ` Paolo Bonzini
  0 siblings, 0 replies; 71+ messages in thread
From: Paolo Bonzini @ 2014-07-30 23:41 UTC (permalink / raw)
  To: Ming Lei
  Cc: Kevin Wolf, Peter Maydell, Fam Zheng, Michael S. Tsirkin,
	qemu-devel, Stefan Hajnoczi

Il 30/07/2014 19:32, Ming Lei ha scritto:
> On Wed, Jul 30, 2014 at 9:59 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> Il 30/07/2014 13:39, Ming Lei ha scritto:
>>> In the enqueue path, we can't complete request, otherwise
>>> "Co-routine re-entered recursively" may be caused, so this
>>> patch fixes the issue with below ideas:
>>>
>>>       - for -EAGAIN or partial completion, retry the submission by
>>>       an introduced event handler
>>>       - for part of completion, also update the io queue
>>>       - for other failure, return the failure if in enqueue path,
>>>       otherwise, abort all queued I/O
>>>
>>> Signed-off-by: Ming Lei <ming.lei@canonical.com>
>>> ---
>>>  block/linux-aio.c |   90 ++++++++++++++++++++++++++++++++++++++++-------------
>>>  1 file changed, 68 insertions(+), 22 deletions(-)
>>>
>>> diff --git a/block/linux-aio.c b/block/linux-aio.c
>>> index 7ac7e8c..5eb9c92 100644
>>> --- a/block/linux-aio.c
>>> +++ b/block/linux-aio.c
>>> @@ -51,6 +51,7 @@ struct qemu_laio_state {
>>>
>>>      /* io queue for submit at batch */
>>>      LaioQueue io_q;
>>> +    EventNotifier retry;      /* handle -EAGAIN and partial completion */
>>>  };
>>>
>>>  static inline ssize_t io_event_ret(struct io_event *ev)
>>> @@ -154,45 +155,80 @@ static void ioq_init(LaioQueue *io_q)
>>>      io_q->plugged = 0;
>>>  }
>>>
>>> -static int ioq_submit(struct qemu_laio_state *s)
>>> +static void abort_queue(struct qemu_laio_state *s)
>>> +{
>>> +    int i;
>>> +    for (i = 0; i < s->io_q.idx; i++) {
>>> +        struct qemu_laiocb *laiocb = container_of(s->io_q.iocbs[i],
>>> +                                                  struct qemu_laiocb,
>>> +                                                  iocb);
>>> +        laiocb->ret = -EIO;
>>> +        qemu_laio_process_completion(s, laiocb);
>>> +    }
>>> +}
>>> +
>>> +static int ioq_submit(struct qemu_laio_state *s, bool enqueue)
>>>  {
>>>      int ret, i = 0;
>>>      int len = s->io_q.idx;
>>> +    int j = 0;
>>>
>>> -    do {
>>> -        ret = io_submit(s->ctx, len, s->io_q.iocbs);
>>> -    } while (i++ < 3 && ret == -EAGAIN);
>>> +    if (!len) {
>>> +        return 0;
>>> +    }
>>>
>>> -    /* empty io queue */
>>> -    s->io_q.idx = 0;
>>> +    ret = io_submit(s->ctx, len, s->io_q.iocbs);
>>> +    if (ret == -EAGAIN) {
>>> +        event_notifier_set(&s->retry);
>>
>> Retrying immediately (and just doing a couple of system calls to waste
>> time) is not an improvement.  The right place to retry is in
>> qemu_laio_completion_cb, after io_getevents has been called and
>> presumably the queue depth has decreased.
> 
> Good point.
> 
>>
>> If !s->io_q.plugged but io_submit fails you can call ioq_enqueue and it
> 
> When will the queued I/O be submitted? That will introduce extra
> complexity definitely.

It will be submitted when qemu_laio_completion_cb is called.

> It is a change for !s->io_q.plugged case, and it isn't good to do that in
> this patch, IMO.

I agree with you that this series is doing too many things at a single
time.  You can submit separate series for 1) no-coroutine fast path, 2)
full queue, 3) multiqueue.  If you do things properly you won't have a
single conflict, since they affect respectively block.c,
block/linux-aio.c and hw/block/.

Paolo

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH 08/15] virtio: decrease size of VirtQueueElement
  2014-07-30 13:51   ` Paolo Bonzini
  2014-07-30 14:40     ` Michael S. Tsirkin
@ 2014-07-31  2:07     ` Ming Lei
  2014-07-31  9:38       ` Paolo Bonzini
  1 sibling, 1 reply; 71+ messages in thread
From: Ming Lei @ 2014-07-31  2:07 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Peter Maydell, Fam Zheng, Michael S. Tsirkin,
	qemu-devel, Stefan Hajnoczi

On Wed, Jul 30, 2014 at 9:51 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 30/07/2014 13:39, Ming Lei ha scritto:
>> diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
>> index a60104c..943e72f 100644
>> --- a/include/hw/virtio/virtio.h
>> +++ b/include/hw/virtio/virtio.h
>> @@ -84,12 +84,17 @@ typedef struct VirtQueue VirtQueue;
>>  typedef struct VirtQueueElement
>>  {
>>      unsigned int index;
>> +    unsigned int num;
>>      unsigned int out_num;
>>      unsigned int in_num;
>> -    hwaddr in_addr[VIRTQUEUE_MAX_SIZE];
>> -    hwaddr out_addr[VIRTQUEUE_MAX_SIZE];
>> -    struct iovec in_sg[VIRTQUEUE_MAX_SIZE];
>> -    struct iovec out_sg[VIRTQUEUE_MAX_SIZE];
>> +
>> +    hwaddr *in_addr;
>> +    hwaddr *out_addr;
>> +    struct iovec *in_sg;
>> +    struct iovec *out_sg;
>> +
>> +    hwaddr addr[VIRTQUEUE_MAX_SIZE];
>> +    struct iovec sg[VIRTQUEUE_MAX_SIZE];
>>  } VirtQueueElement;
>>
>>  #define VIRTIO_PCI_QUEUE_MAX 64
>> --
>
> since addr and sg aren't used directly, allocate them flexibly like
>
>     char *p;
>     VirtQueueElement *elem;
>     total_size = ROUND_UP(sizeof(struct VirtQueueElement),
>                           __alignof__(elem->addr[0]);
>     addr_offset = total_size;
>     total_size = ROUND_UP(total_size + num * sizeof(elem->addr[0]),
>                           __alignof__(elem->sg[0]));
>     sg_offset = total_size;
>     total_size += num * sizeof(elem->sg[0]);
>
>     elem = p = g_slice_alloc(total_size);
>     elem->size = total_size;
>     elem->in_addr = p + addr_offset;
>     elem->out_addr = elem->in_addr + in_num;
>     elem->in_sg = p + sg_offset;
>     elem->out_sg = elem->in_sg + in_num;
>
> ...
>
>     g_slice_free1(elem->size, elem);
>
> The small size will help glib do slab-style allocation and should remove
> the need for an object pool.

Yes, that should be correct way to do, but can't avoid big chunk allocation
completely because 'num' is a bit big.

Also this kind of change requires almost all users of elem to be changed,
that need lots of work.

That is why I choose to take the simple approach to ease memory
preallocation for obj pool.

Thanks,

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH 08/15] virtio: decrease size of VirtQueueElement
  2014-07-30 14:40     ` Michael S. Tsirkin
  2014-07-30 14:50       ` Paolo Bonzini
@ 2014-07-31  2:11       ` Ming Lei
  1 sibling, 0 replies; 71+ messages in thread
From: Ming Lei @ 2014-07-31  2:11 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Kevin Wolf, Peter Maydell, Fam Zheng, qemu-devel,
	Stefan Hajnoczi, Paolo Bonzini

On Wed, Jul 30, 2014 at 10:40 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Wed, Jul 30, 2014 at 03:51:22PM +0200, Paolo Bonzini wrote:
>> Il 30/07/2014 13:39, Ming Lei ha scritto:
>> > diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
>> > index a60104c..943e72f 100644
>> > --- a/include/hw/virtio/virtio.h
>> > +++ b/include/hw/virtio/virtio.h
>> > @@ -84,12 +84,17 @@ typedef struct VirtQueue VirtQueue;
>> >  typedef struct VirtQueueElement
>> >  {
>> >      unsigned int index;
>> > +    unsigned int num;
>> >      unsigned int out_num;
>> >      unsigned int in_num;
>> > -    hwaddr in_addr[VIRTQUEUE_MAX_SIZE];
>> > -    hwaddr out_addr[VIRTQUEUE_MAX_SIZE];
>> > -    struct iovec in_sg[VIRTQUEUE_MAX_SIZE];
>> > -    struct iovec out_sg[VIRTQUEUE_MAX_SIZE];
>> > +
>> > +    hwaddr *in_addr;
>> > +    hwaddr *out_addr;
>> > +    struct iovec *in_sg;
>> > +    struct iovec *out_sg;
>> > +
>> > +    hwaddr addr[VIRTQUEUE_MAX_SIZE];
>> > +    struct iovec sg[VIRTQUEUE_MAX_SIZE];
>> >  } VirtQueueElement;
>> >
>> >  #define VIRTIO_PCI_QUEUE_MAX 64
>> > --
>>
>> since addr and sg aren't used directly, allocate them flexibly like
>>
>>     char *p;
>>     VirtQueueElement *elem;
>>     total_size = ROUND_UP(sizeof(struct VirtQueueElement),
>>                           __alignof__(elem->addr[0]);
>>     addr_offset = total_size;
>>     total_size = ROUND_UP(total_size + num * sizeof(elem->addr[0]),
>>                           __alignof__(elem->sg[0]));
>>     sg_offset = total_size;
>>     total_size += num * sizeof(elem->sg[0]);
>>
>>     elem = p = g_slice_alloc(total_size);
>>     elem->size = total_size;
>>     elem->in_addr = p + addr_offset;
>>     elem->out_addr = elem->in_addr + in_num;
>>     elem->in_sg = p + sg_offset;
>>     elem->out_sg = elem->in_sg + in_num;
>>
>> ...
>>
>>     g_slice_free1(elem->size, elem);
>>
>> The small size will help glib do slab-style allocation and should remove
>> the need for an object pool.
>>
>> Paolo
>
> As long as you relying on this, why not allocate in_sg/out_sg
> separately?

That won't avoid big allocation(includes the following writing) which
is always expensive.

Also using each array for in and out is totally wrong and inefficient.

> In any case, a bunch of code needs to be audited to make sure
> no one assumes we can always stick VIRTQUEUE_MAX_SIZE there.
>
>
> But what really worries me is this is an optimization patch
> without any numbers accompanying it.

One number is that almost half of sizeof(elem) is saved, it isn't
good enough, is it?

> I would say split this out from your virtio blk work,
> work on this separately.

This patch can ease elem preallocation for obj pool, that is related
with previous patches.

Thanks

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH 07/15] dataplane: use object pool to speed up allocation for virtio blk request
  2014-07-30 14:14   ` Paolo Bonzini
  2014-07-30 15:09     ` Michael S. Tsirkin
@ 2014-07-31  3:22     ` Ming Lei
  2014-07-31  9:18       ` Paolo Bonzini
  1 sibling, 1 reply; 71+ messages in thread
From: Ming Lei @ 2014-07-31  3:22 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Peter Maydell, Fam Zheng, Michael S. Tsirkin,
	qemu-devel, Stefan Hajnoczi

On Wed, Jul 30, 2014 at 10:14 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 30/07/2014 13:39, Ming Lei ha scritto:
>> g_slice_new(VirtIOBlockReq), its free pair and access the instance
>> is a bit slow since sizeof(VirtIOBlockReq) takes more than 40KB,
>> so use object pool to speed up its allocation and release.
>>
>> With this patch, ~5% throughput improvement is observed in the VM
>> based on server.
>>
>> Signed-off-by: Ming Lei <ming.lei@canonical.com>
>> ---
>>  hw/block/dataplane/virtio-blk.c |   12 ++++++++++++
>>  hw/block/virtio-blk.c           |   13 +++++++++++--
>>  include/hw/virtio/virtio-blk.h  |    2 ++
>>  3 files changed, 25 insertions(+), 2 deletions(-)
>>
>> diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
>> index 2093e4a..828fe99 100644
>> --- a/hw/block/dataplane/virtio-blk.c
>> +++ b/hw/block/dataplane/virtio-blk.c
>> @@ -24,6 +24,8 @@
>>  #include "hw/virtio/virtio-bus.h"
>>  #include "qom/object_interfaces.h"
>>
>> +#define REQ_POOL_SZ 128
>> +
>>  struct VirtIOBlockDataPlane {
>>      bool started;
>>      bool starting;
>> @@ -51,6 +53,10 @@ struct VirtIOBlockDataPlane {
>>      Error *blocker;
>>      void (*saved_complete_request)(struct VirtIOBlockReq *req,
>>                                     unsigned char status);
>> +
>> +    VirtIOBlockReq  reqs[REQ_POOL_SZ];
>> +    void *free_reqs[REQ_POOL_SZ];
>> +    ObjPool  req_pool;
>>  };
>>
>>  /* Raise an interrupt to signal guest, if necessary */
>> @@ -238,6 +244,10 @@ void virtio_blk_data_plane_start(VirtIOBlockDataPlane *s)
>>          return;
>>      }
>>
>> +    vblk->obj_pool = &s->req_pool;
>> +    obj_pool_init(vblk->obj_pool, s->reqs, s->free_reqs,
>> +                  sizeof(VirtIOBlockReq), REQ_POOL_SZ);
>> +
>>      /* Set up guest notifier (irq) */
>>      if (k->set_guest_notifiers(qbus->parent, 1, true) != 0) {
>>          fprintf(stderr, "virtio-blk failed to set guest notifier, "
>> @@ -298,6 +308,8 @@ void virtio_blk_data_plane_stop(VirtIOBlockDataPlane *s)
>>
>>      aio_context_release(s->ctx);
>>
>> +    vblk->obj_pool = NULL;
>> +
>>      if (s->raw_format) {
>>          qemu_aio_set_bypass_co(s->ctx, false);
>>      }
>> diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
>> index c241c50..2a11bc4 100644
>> --- a/hw/block/virtio-blk.c
>> +++ b/hw/block/virtio-blk.c
>> @@ -31,7 +31,11 @@
>>
>>  VirtIOBlockReq *virtio_blk_alloc_request(VirtIOBlock *s)
>>  {
>> -    VirtIOBlockReq *req = g_slice_new(VirtIOBlockReq);
>> +    VirtIOBlockReq *req = obj_pool_get(s->obj_pool);
>> +
>> +    if (!req) {
>> +        req = g_slice_new(VirtIOBlockReq);
>> +    }
>>      req->dev = s;
>>      req->qiov.size = 0;
>>      req->next = NULL;
>> @@ -41,7 +45,11 @@ VirtIOBlockReq *virtio_blk_alloc_request(VirtIOBlock *s)
>>  void virtio_blk_free_request(VirtIOBlockReq *req)
>>  {
>>      if (req) {
>> -        g_slice_free(VirtIOBlockReq, req);
>> +        if (obj_pool_has_obj(req->dev->obj_pool, req)) {
>> +            obj_pool_put(req->dev->obj_pool, req);
>> +        } else {
>> +            g_slice_free(VirtIOBlockReq, req);
>> +        }
>>      }
>>  }
>>
>> @@ -801,6 +809,7 @@ static void virtio_blk_instance_init(Object *obj)
>>  {
>>      VirtIOBlock *s = VIRTIO_BLK(obj);
>>
>> +    s->obj_pool = NULL;
>>      object_property_add_link(obj, "iothread", TYPE_IOTHREAD,
>>                               (Object **)&s->blk.iothread,
>>                               qdev_prop_allow_set_link_before_realize,
>> diff --git a/include/hw/virtio/virtio-blk.h b/include/hw/virtio/virtio-blk.h
>> index afb7b8d..49ac234 100644
>> --- a/include/hw/virtio/virtio-blk.h
>> +++ b/include/hw/virtio/virtio-blk.h
>> @@ -18,6 +18,7 @@
>>  #include "hw/block/block.h"
>>  #include "sysemu/iothread.h"
>>  #include "block/block.h"
>> +#include "qemu/obj_pool.h"
>>
>>  #define TYPE_VIRTIO_BLK "virtio-blk-device"
>>  #define VIRTIO_BLK(obj) \
>> @@ -135,6 +136,7 @@ typedef struct VirtIOBlock {
>>      Notifier migration_state_notifier;
>>      struct VirtIOBlockDataPlane *dataplane;
>>  #endif
>> +    ObjPool *obj_pool;
>>  } VirtIOBlock;
>>
>>  typedef struct MultiReqBuffer {
>>
>
> The problem is that g_slice here is not using the slab-style allocator
> because the object is larger than roughly 500 bytes.  One solution would
> be to make virtqueue_pop/vring_pop allocate a VirtQueueElement of the
> right size (and virtqueue_push/vring_push free it), as mentioned in the
> review of patch 8.

Unluckily both iovec and addr array can't be fitted into 500 bytes, :-(
Not mention all users of VirtQueueElement need to be changed too,
I hate to make that work involved in this patchset, :-)

>
> However, I now remembered that VirtQueueElement is a mess because it's
> serialized directly into the migration state. :(  So you basically
> cannot change it without mucking with migration.  Please leave out patch
> 8 for now.

save_device code serializes elem in this way:

 qemu_put_buffer(f, (unsigned char *)&req->elem,
                        sizeof(VirtQueueElement));

so I am wondering why this patch may break migration.

And in my test live migration can survive with the patch.

>
> Let's use this object pool, however, please simplify the API, it should
> be just:
>
> void obj_pool_init(ObjPool *pool, unsigned obj_size, unsigned cnt);

In this way, obj pool can't be backed on static buffer, which sometimes
is very convenient since users need to worry about its release.

But it won't be a big deal for dataplane case.

> void *obj_pool_alloc(ObjPool *pool);
> void obj_pool_free(ObjPool *pool, void *obj);
> void obj_pool_destroy(ObjPool *pool);
>
> All allocations of the objs buffer, and all logic like
>
> +    VirtIOBlockReq *req = obj_pool_get(s->obj_pool);
> +
> +    if (!req) {
> +        req = g_slice_new(VirtIOBlockReq);
> +    }
>
> +        if (obj_pool_has_obj(req->dev->obj_pool, req)) {
> +            obj_pool_put(req->dev->obj_pool, req);
> +        } else {
> +            g_slice_free(VirtIOBlockReq, req);
> +        }
>
> should be part of the object pool, not its users.

Yeah, we can do that, but some callers may like to use malloc(),
and other users may like calloc(), that is why I didn't put the fallback
allocation into obj pool APIs.

It is still a bit simple in current way.

Thanks,

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH 14/15] hw/block/virtio-blk: create num_queues vqs if dataplane is enabled
  2014-07-30 15:25       ` Paolo Bonzini
@ 2014-07-31  3:47         ` Ming Lei
  2014-07-31  8:52           ` Paolo Bonzini
  0 siblings, 1 reply; 71+ messages in thread
From: Ming Lei @ 2014-07-31  3:47 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Peter Maydell, Fam Zheng, Michael S. Tsirkin,
	qemu-devel, Stefan Hajnoczi

On Wed, Jul 30, 2014 at 11:25 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 30/07/2014 17:12, Michael S. Tsirkin ha scritto:
>>> >
>>> > Dataplane must not be a change to the guest ABI.  If you implement this
>>> > feature you have to implement it for both dataplane and non-dataplne.
>>> >

IMO, no matter if the feature is set or not, both old and new VM
can support it well.

Per virtio spec, only the feature is set, the VM can be allowed to
access the 'num_queues' field, and I didn't see any problem from
VM's view point.

So could you explain why both dataplane and non-dataplane have
to support the feature.

I am doing so because there isn't performance advantage to take mq
for non-dataplane.

>>> > Paolo
>> Actually, I think different backends at runtime should be allowed to
>> change guest visible ABI.  For example if you give qemu a read only
>> backend this is guest visible right?
>
> Dataplane is not meant to be a different backend, it's just moving stuff
> out to a separate thread.  It's only due to QEMU limitation that it
> doesn't share the vring code (and these patches include a lot of steps
> backwards where dataplane becomes != non-dataplane).

There won't lots of backwards steps, just the bypass co patch, which
is just bypassing co in case of being unnecessary for raw image case,
but it is still one code path.

As I mentioned, all these steps are just for the performance
regression from the commit 580b6b2aa2(dataplane: use the
QEMU block layer for I/O).

Thanks,

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH 01/15] qemu coroutine: support bypass mode
  2014-07-30 23:37       ` Paolo Bonzini
@ 2014-07-31  3:55         ` Ming Lei
  2014-07-31  7:37           ` Benoît Canet
  2014-07-31  8:59         ` Ming Lei
  1 sibling, 1 reply; 71+ messages in thread
From: Ming Lei @ 2014-07-31  3:55 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Peter Maydell, Fam Zheng, Michael S. Tsirkin,
	qemu-devel, Stefan Hajnoczi

On Thu, Jul 31, 2014 at 7:37 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 30/07/2014 19:15, Ming Lei ha scritto:
>> On Wed, Jul 30, 2014 at 9:45 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>> Il 30/07/2014 13:39, Ming Lei ha scritto:
>>>> This patch introduces several APIs for supporting bypass qemu coroutine
>>>> in case of being not necessary and for performance's sake.
>>>
>>> No, this is wrong.  Dataplane *must* use the same code as non-dataplane,
>>> anything else is a step backwards.
>>
>> As we saw, coroutine has brought up performance regression
>> on dataplane, and it isn't necessary to use co in some cases, is it?
>
> Yes, and it's not necessary on non-dataplane either.  It's not necessary
> on virtio-scsi, and it will not be necessary on virtio-scsi dataplane
> either.

It is good to know these cases, so they might benefit from this patch
in future too, :-)

>>> If you want to bypass coroutines, bdrv_aio_readv/writev must detect the
>>> conditions that allow doing that and call the bdrv_aio_readv/writev
>>> directly.
>>
>> That is easy to detect, please see the 5th patch.
>
> No, that's not enough.  Dataplane right now prevents block jobs, but
> that's going to change and it could require coroutines even for raw devices.

Could you explain it a bit why co is required for raw devices in future?

If some cases for not requiring co can be detected beforehand, these
patches are useful, IMO.

>>> To begin with, have you benchmarked QEMU and can you provide a trace of
>>> *where* the coroutine overhead lies?
>>
>> I guess it may be caused by the stack switch, at least in one of
>> my box, bypassing co can improve throughput by ~7%, and by
>> ~15% in another box.
>
> No guesses please.  Actually that's also my guess, but since you are
> submitting the patch you must do better and show profiles where stack
> switching disappears after the patches.

OK, no problem, I will provide some profile result.

Thanks,

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH 01/15] qemu coroutine: support bypass mode
  2014-07-31  3:55         ` Ming Lei
@ 2014-07-31  7:37           ` Benoît Canet
  2014-07-31  9:47             ` Ming Lei
  0 siblings, 1 reply; 71+ messages in thread
From: Benoît Canet @ 2014-07-31  7:37 UTC (permalink / raw)
  To: Ming Lei
  Cc: Kevin Wolf, Peter Maydell, Fam Zheng, Michael S. Tsirkin,
	qemu-devel, Stefan Hajnoczi, Paolo Bonzini

The Thursday 31 Jul 2014 à 11:55:28 (+0800), Ming Lei wrote :
> On Thu, Jul 31, 2014 at 7:37 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> > Il 30/07/2014 19:15, Ming Lei ha scritto:
> >> On Wed, Jul 30, 2014 at 9:45 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> >>> Il 30/07/2014 13:39, Ming Lei ha scritto:
> >>>> This patch introduces several APIs for supporting bypass qemu coroutine
> >>>> in case of being not necessary and for performance's sake.
> >>>
> >>> No, this is wrong.  Dataplane *must* use the same code as non-dataplane,
> >>> anything else is a step backwards.
> >>
> >> As we saw, coroutine has brought up performance regression
> >> on dataplane, and it isn't necessary to use co in some cases, is it?
> >
> > Yes, and it's not necessary on non-dataplane either.  It's not necessary
> > on virtio-scsi, and it will not be necessary on virtio-scsi dataplane
> > either.
> 
> It is good to know these cases, so they might benefit from this patch
> in future too, :-)
> 
> >>> If you want to bypass coroutines, bdrv_aio_readv/writev must detect the
> >>> conditions that allow doing that and call the bdrv_aio_readv/writev
> >>> directly.
> >>
> >> That is easy to detect, please see the 5th patch.
> >
> > No, that's not enough.  Dataplane right now prevents block jobs, but
> > that's going to change and it could require coroutines even for raw devices.
> 
> Could you explain it a bit why co is required for raw devices in future?

Block mirroring of a device for example is done using coroutines.
As block mirroring can be done on a raw device you need coroutines.

> 
> If some cases for not requiring co can be detected beforehand, these
> patches are useful, IMO.
> 
> >>> To begin with, have you benchmarked QEMU and can you provide a trace of
> >>> *where* the coroutine overhead lies?
> >>
> >> I guess it may be caused by the stack switch, at least in one of
> >> my box, bypassing co can improve throughput by ~7%, and by
> >> ~15% in another box.
> >
> > No guesses please.  Actually that's also my guess, but since you are
> > submitting the patch you must do better and show profiles where stack
> > switching disappears after the patches.
> 
> OK, no problem, I will provide some profile result.
> 
> Thanks,
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH 14/15] hw/block/virtio-blk: create num_queues vqs if dataplane is enabled
  2014-07-31  3:47         ` Ming Lei
@ 2014-07-31  8:52           ` Paolo Bonzini
  2014-08-01  3:09             ` Ming Lei
  0 siblings, 1 reply; 71+ messages in thread
From: Paolo Bonzini @ 2014-07-31  8:52 UTC (permalink / raw)
  To: Ming Lei
  Cc: Kevin Wolf, Peter Maydell, Fam Zheng, Michael S. Tsirkin,
	qemu-devel, Stefan Hajnoczi

Il 31/07/2014 05:47, Ming Lei ha scritto:
> On Wed, Jul 30, 2014 at 11:25 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> Il 30/07/2014 17:12, Michael S. Tsirkin ha scritto:
>>>>>
>>>>> Dataplane must not be a change to the guest ABI.  If you implement this
>>>>> feature you have to implement it for both dataplane and non-dataplne.
>>>>>
> 
> IMO, no matter if the feature is set or not, both old and new VM
> can support it well.
> 
> Per virtio spec, only the feature is set, the VM can be allowed to
> access the 'num_queues' field, and I didn't see any problem from
> VM's view point.
> 
> So could you explain why both dataplane and non-dataplane have
> to support the feature.

Because otherwise you change the guest ABI.

> I am doing so because there isn't performance advantage to take mq
> for non-dataplane.

It doesn't matter, changing features between dataplane and non-dataplane
is not something that you can do.

>>> Actually, I think different backends at runtime should be allowed to
>>> change guest visible ABI.  For example if you give qemu a read only
>>> backend this is guest visible right?
>>
>> Dataplane is not meant to be a different backend, it's just moving stuff
>> out to a separate thread.  It's only due to QEMU limitation that it
>> doesn't share the vring code (and these patches include a lot of steps
>> backwards where dataplane becomes != non-dataplane).
> 
> There won't lots of backwards steps, just the bypass co patch, which
> is just bypassing co in case of being unnecessary for raw image case,
> but it is still one code path.

Using the object pool for dataplane only is a backwards step,
implementing multique for dataplane only is a backwards step, bypassing
coroutines for dataplane only is a backwards step.

Paolo

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH 01/15] qemu coroutine: support bypass mode
  2014-07-30 23:37       ` Paolo Bonzini
  2014-07-31  3:55         ` Ming Lei
@ 2014-07-31  8:59         ` Ming Lei
  2014-07-31  9:15           ` Paolo Bonzini
  1 sibling, 1 reply; 71+ messages in thread
From: Ming Lei @ 2014-07-31  8:59 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Peter Maydell, Fam Zheng, Michael S. Tsirkin,
	qemu-devel, Stefan Hajnoczi

On Thu, Jul 31, 2014 at 7:37 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 30/07/2014 19:15, Ming Lei ha scritto:
>> On Wed, Jul 30, 2014 at 9:45 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>> Il 30/07/2014 13:39, Ming Lei ha scritto:
>>>> This patch introduces several APIs for supporting bypass qemu coroutine
>>>> in case of being not necessary and for performance's sake.
>>>
>>> No, this is wrong.  Dataplane *must* use the same code as non-dataplane,
>>> anything else is a step backwards.
>>
>> As we saw, coroutine has brought up performance regression
>> on dataplane, and it isn't necessary to use co in some cases, is it?
>
> Yes, and it's not necessary on non-dataplane either.  It's not necessary
> on virtio-scsi, and it will not be necessary on virtio-scsi dataplane
> either.
>
>>> If you want to bypass coroutines, bdrv_aio_readv/writev must detect the
>>> conditions that allow doing that and call the bdrv_aio_readv/writev
>>> directly.
>>
>> That is easy to detect, please see the 5th patch.
>
> No, that's not enough.  Dataplane right now prevents block jobs, but
> that's going to change and it could require coroutines even for raw devices.
>
>>> To begin with, have you benchmarked QEMU and can you provide a trace of
>>> *where* the coroutine overhead lies?
>>
>> I guess it may be caused by the stack switch, at least in one of
>> my box, bypassing co can improve throughput by ~7%, and by
>> ~15% in another box.
>
> No guesses please.  Actually that's also my guess, but since you are
> submitting the patch you must do better and show profiles where stack
> switching disappears after the patches.

Follows the below hardware events reported by 'perf stat' when running
fio randread benchmark for 2min in VM(single vq, 2 jobs):

sudo ~/bin/perf stat -e
L1-dcache-loads,L1-dcache-load-misses,cpu-cycles,instructions,branch-instructions,branch-misses,branch-loads,branch-load-misses,dTLB-loads,dTLB-load-misses
./nqemu-start-mq 4 1

1), without bypassing coroutine via forcing to set 's->raw_format ' as
false, see patch 5/15

- throughout: 95K

  Performance counter stats for './nqemu-start-mq 4 1':

    69,231,035,842      L1-dcache-loads
              [40.10%]
     1,909,978,930      L1-dcache-load-misses     #    2.76% of all
L1-dcache hits   [39.98%]
   263,731,501,086      cpu-cycles                [40.03%]
   232,564,905,115      instructions              #    0.88  insns per
cycle         [50.23%]
    46,157,868,745      branch-instructions
              [49.82%]
       785,618,591      branch-misses             #    1.70% of all
branches         [49.99%]
    46,280,342,654      branch-loads
              [49.95%]
    34,934,790,140      branch-load-misses
              [50.02%]
    69,447,857,237      dTLB-loads
              [40.13%]
       169,617,374      dTLB-load-misses          #    0.24% of all
dTLB cache hits  [40.04%]

     161.991075781 seconds time elapsed


2), with bypassing coroutinue
- throughput: 115K
 Performance counter stats for './nqemu-start-mq 4 1':

    76,784,224,509      L1-dcache-loads
              [39.93%]
     1,334,036,447      L1-dcache-load-misses     #    1.74% of all
L1-dcache hits   [39.91%]
   262,697,428,470      cpu-cycles                [40.03%]
   255,526,629,881      instructions              #    0.97  insns per
cycle         [50.01%]
    50,160,082,611      branch-instructions
              [49.97%]
       564,407,788      branch-misses             #    1.13% of all
branches         [50.08%]
    50,331,510,702      branch-loads
              [50.08%]
    35,760,766,459      branch-load-misses
              [50.03%]
    76,706,000,951      dTLB-loads
              [40.00%]
       123,291,001      dTLB-load-misses          #    0.16% of all
dTLB cache hits  [40.02%]

     162.333465490 seconds time elapsed

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH 01/15] qemu coroutine: support bypass mode
  2014-07-31  8:59         ` Ming Lei
@ 2014-07-31  9:15           ` Paolo Bonzini
  2014-07-31 10:06             ` Ming Lei
  2014-07-31 16:13             ` Ming Lei
  0 siblings, 2 replies; 71+ messages in thread
From: Paolo Bonzini @ 2014-07-31  9:15 UTC (permalink / raw)
  To: Ming Lei
  Cc: Kevin Wolf, Peter Maydell, Fam Zheng, Michael S. Tsirkin,
	qemu-devel, Stefan Hajnoczi

Il 31/07/2014 10:59, Ming Lei ha scritto:
>> > No guesses please.  Actually that's also my guess, but since you are
>> > submitting the patch you must do better and show profiles where stack
>> > switching disappears after the patches.
> Follows the below hardware events reported by 'perf stat' when running
> fio randread benchmark for 2min in VM(single vq, 2 jobs):
> 
> sudo ~/bin/perf stat -e
> L1-dcache-loads,L1-dcache-load-misses,cpu-cycles,instructions,branch-instructions,branch-misses,branch-loads,branch-load-misses,dTLB-loads,dTLB-load-misses
> ./nqemu-start-mq 4 1
> 
> 1), without bypassing coroutine via forcing to set 's->raw_format ' as
> false, see patch 5/15
> 
> - throughout: 95K
>    232,564,905,115      instructions
>      161.991075781 seconds time elapsed
> 
> 
> 2), with bypassing coroutinue
> - throughput: 115K
>    255,526,629,881      instructions
>      162.333465490 seconds time elapsed

Ok, so you are saving 10% instructions per iop: before 232G / 95K =
2.45M instructions/iop, 255G / 115K = 2.22M instructions/iop.

That's not small, and it's a good thing for CPU utilization even if you
were not increasing iops.  On top of this, can you provide the stack
traces to see the difference in the profiles?

Paolo

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH 07/15] dataplane: use object pool to speed up allocation for virtio blk request
  2014-07-31  3:22     ` Ming Lei
@ 2014-07-31  9:18       ` Paolo Bonzini
  2014-08-01  7:42         ` Ming Lei
  0 siblings, 1 reply; 71+ messages in thread
From: Paolo Bonzini @ 2014-07-31  9:18 UTC (permalink / raw)
  To: Ming Lei
  Cc: Kevin Wolf, Peter Maydell, Fam Zheng, Michael S. Tsirkin,
	qemu-devel, Stefan Hajnoczi

Il 31/07/2014 05:22, Ming Lei ha scritto:
>> >
>> > The problem is that g_slice here is not using the slab-style allocator
>> > because the object is larger than roughly 500 bytes.  One solution would
>> > be to make virtqueue_pop/vring_pop allocate a VirtQueueElement of the
>> > right size (and virtqueue_push/vring_push free it), as mentioned in the
>> > review of patch 8.
> Unluckily both iovec and addr array can't be fitted into 500 bytes, :-(
> Not mention all users of VirtQueueElement need to be changed too,
> I hate to make that work involved in this patchset, :-)

Well, the point of dataplane was not just to get maximum iops.  It was
also to provide guidance in the work necessary to improve the code and
get maximum iops without special-casing everything.  This can be a lot
of work indeed.

>> >
>> > However, I now remembered that VirtQueueElement is a mess because it's
>> > serialized directly into the migration state. :(  So you basically
>> > cannot change it without mucking with migration.  Please leave out patch
>> > 8 for now.
> save_device code serializes elem in this way:
> 
>  qemu_put_buffer(f, (unsigned char *)&req->elem,
>                         sizeof(VirtQueueElement));
> 
> so I am wondering why this patch may break migration.

Because you change the on-wire format and break migration from 2.1 to
2.2.  Sorry, I wasn't clear enough.

Paolo

> And in my test live migration can survive with the patch.
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH 08/15] virtio: decrease size of VirtQueueElement
  2014-07-31  2:07     ` Ming Lei
@ 2014-07-31  9:38       ` Paolo Bonzini
  2014-08-01  3:34         ` Ming Lei
  0 siblings, 1 reply; 71+ messages in thread
From: Paolo Bonzini @ 2014-07-31  9:38 UTC (permalink / raw)
  To: Ming Lei
  Cc: Kevin Wolf, Peter Maydell, Fam Zheng, Michael S. Tsirkin,
	qemu-devel, Stefan Hajnoczi

Il 31/07/2014 04:07, Ming Lei ha scritto:
> On Wed, Jul 30, 2014 at 9:51 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> Il 30/07/2014 13:39, Ming Lei ha scritto:
>>> diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
>>> index a60104c..943e72f 100644
>>> --- a/include/hw/virtio/virtio.h
>>> +++ b/include/hw/virtio/virtio.h
>>> @@ -84,12 +84,17 @@ typedef struct VirtQueue VirtQueue;
>>>  typedef struct VirtQueueElement
>>>  {
>>>      unsigned int index;
>>> +    unsigned int num;
>>>      unsigned int out_num;
>>>      unsigned int in_num;
>>> -    hwaddr in_addr[VIRTQUEUE_MAX_SIZE];
>>> -    hwaddr out_addr[VIRTQUEUE_MAX_SIZE];
>>> -    struct iovec in_sg[VIRTQUEUE_MAX_SIZE];
>>> -    struct iovec out_sg[VIRTQUEUE_MAX_SIZE];
>>> +
>>> +    hwaddr *in_addr;
>>> +    hwaddr *out_addr;
>>> +    struct iovec *in_sg;
>>> +    struct iovec *out_sg;
>>> +
>>> +    hwaddr addr[VIRTQUEUE_MAX_SIZE];
>>> +    struct iovec sg[VIRTQUEUE_MAX_SIZE];
>>>  } VirtQueueElement;
>>>
>>>  #define VIRTIO_PCI_QUEUE_MAX 64
>>> --
>>
>> since addr and sg aren't used directly, allocate them flexibly like
>>
>>     char *p;
>>     VirtQueueElement *elem;
>>     total_size = ROUND_UP(sizeof(struct VirtQueueElement),
>>                           __alignof__(elem->addr[0]);
>>     addr_offset = total_size;
>>     total_size = ROUND_UP(total_size + num * sizeof(elem->addr[0]),
>>                           __alignof__(elem->sg[0]));
>>     sg_offset = total_size;
>>     total_size += num * sizeof(elem->sg[0]);
>>
>>     elem = p = g_slice_alloc(total_size);
>>     elem->size = total_size;
>>     elem->in_addr = p + addr_offset;
>>     elem->out_addr = elem->in_addr + in_num;
>>     elem->in_sg = p + sg_offset;
>>     elem->out_sg = elem->in_sg + in_num;
>>
>> ...
>>
>>     g_slice_free1(elem->size, elem);
>>
>> The small size will help glib do slab-style allocation and should remove
>> the need for an object pool.
> 
> Yes, that should be correct way to do, but can't avoid big chunk allocation
> completely because 'num' is a bit big.

For typical iops microbenchmarks, num will be 3 (4K I/O) to 18 (64K).

> Also this kind of change requires almost all users of elem to be changed,
> that need lots of work.
> 
> That is why I choose to take the simple approach to ease memory
> preallocation for obj pool.

Yeah, that's simpler.

Paolo

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH 01/15] qemu coroutine: support bypass mode
  2014-07-31  7:37           ` Benoît Canet
@ 2014-07-31  9:47             ` Ming Lei
  2014-07-31 10:45               ` Paolo Bonzini
  0 siblings, 1 reply; 71+ messages in thread
From: Ming Lei @ 2014-07-31  9:47 UTC (permalink / raw)
  To: Benoît Canet
  Cc: Kevin Wolf, Peter Maydell, Fam Zheng, Michael S. Tsirkin,
	qemu-devel, Stefan Hajnoczi, Paolo Bonzini

On Thu, Jul 31, 2014 at 3:37 PM, Benoît Canet <benoit.canet@irqsave.net> wrote:
> The Thursday 31 Jul 2014 à 11:55:28 (+0800), Ming Lei wrote :
>> On Thu, Jul 31, 2014 at 7:37 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> > Il 30/07/2014 19:15, Ming Lei ha scritto:
>> >> On Wed, Jul 30, 2014 at 9:45 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> >>> Il 30/07/2014 13:39, Ming Lei ha scritto:
>> >>>> This patch introduces several APIs for supporting bypass qemu coroutine
>> >>>> in case of being not necessary and for performance's sake.
>> >>>
>> >>> No, this is wrong.  Dataplane *must* use the same code as non-dataplane,
>> >>> anything else is a step backwards.
>> >>
>> >> As we saw, coroutine has brought up performance regression
>> >> on dataplane, and it isn't necessary to use co in some cases, is it?
>> >
>> > Yes, and it's not necessary on non-dataplane either.  It's not necessary
>> > on virtio-scsi, and it will not be necessary on virtio-scsi dataplane
>> > either.
>>
>> It is good to know these cases, so they might benefit from this patch
>> in future too, :-)
>>
>> >>> If you want to bypass coroutines, bdrv_aio_readv/writev must detect the
>> >>> conditions that allow doing that and call the bdrv_aio_readv/writev
>> >>> directly.
>> >>
>> >> That is easy to detect, please see the 5th patch.
>> >
>> > No, that's not enough.  Dataplane right now prevents block jobs, but
>> > that's going to change and it could require coroutines even for raw devices.
>>
>> Could you explain it a bit why co is required for raw devices in future?
>
> Block mirroring of a device for example is done using coroutines.
> As block mirroring can be done on a raw device you need coroutines.

If block layer knows the mirroring is in-progress, it still can enable
coroutine by ignoring bypass coroutine, or just let device disable
bypass coroutine in case of mirroring, and the current APIs are very
flexible.

Thanks,

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH 01/15] qemu coroutine: support bypass mode
  2014-07-31  9:15           ` Paolo Bonzini
@ 2014-07-31 10:06             ` Ming Lei
  2014-07-31 16:13             ` Ming Lei
  1 sibling, 0 replies; 71+ messages in thread
From: Ming Lei @ 2014-07-31 10:06 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Peter Maydell, Fam Zheng, Michael S. Tsirkin,
	qemu-devel, Stefan Hajnoczi

On Thu, Jul 31, 2014 at 5:15 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 31/07/2014 10:59, Ming Lei ha scritto:
>>> > No guesses please.  Actually that's also my guess, but since you are
>>> > submitting the patch you must do better and show profiles where stack
>>> > switching disappears after the patches.
>> Follows the below hardware events reported by 'perf stat' when running
>> fio randread benchmark for 2min in VM(single vq, 2 jobs):
>>
>> sudo ~/bin/perf stat -e
>> L1-dcache-loads,L1-dcache-load-misses,cpu-cycles,instructions,branch-instructions,branch-misses,branch-loads,branch-load-misses,dTLB-loads,dTLB-load-misses
>> ./nqemu-start-mq 4 1
>>
>> 1), without bypassing coroutine via forcing to set 's->raw_format ' as
>> false, see patch 5/15
>>
>> - throughout: 95K
>>    232,564,905,115      instructions
>>      161.991075781 seconds time elapsed
>>
>>
>> 2), with bypassing coroutinue
>> - throughput: 115K
>>    255,526,629,881      instructions
>>      162.333465490 seconds time elapsed
>
> Ok, so you are saving 10% instructions per iop: before 232G / 95K =
> 2.45M instructions/iop, 255G / 115K = 2.22M instructions/iop.

I am wondering if it is useful since IOP is from view of VM, and instructions
is got from host(qemu). If qemu dataplane handles IO quickly enough,
it can save instructions by batch operations.

But I will collect the 'perf report' on 'instruction' event for you.

L1-dcache-load-misses ratio is increased by 1%, and instructions per
cycle is decreased by 10%(0.88->0.97), which is big too, and means
qemu becomes much slower when running block I/O in VM by
coroutine.

Thanks,

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH 01/15] qemu coroutine: support bypass mode
  2014-07-31  9:47             ` Ming Lei
@ 2014-07-31 10:45               ` Paolo Bonzini
  2014-08-01 13:38                 ` Ming Lei
  0 siblings, 1 reply; 71+ messages in thread
From: Paolo Bonzini @ 2014-07-31 10:45 UTC (permalink / raw)
  To: Ming Lei, Benoît Canet
  Cc: Kevin Wolf, Peter Maydell, Fam Zheng, Michael S. Tsirkin,
	qemu-devel, Stefan Hajnoczi

Il 31/07/2014 11:47, Ming Lei ha scritto:
>> Block mirroring of a device for example is done using coroutines.
>> As block mirroring can be done on a raw device you need coroutines.
> 
> If block layer knows the mirroring is in-progress, it still can enable
> coroutine by ignoring bypass coroutine, or just let device disable
> bypass coroutine in case of mirroring, and the current APIs are very
> flexible.

What matters is not whether you're mirroring.  What matters is whether
you're calling bdrv_aio_readv/writev or bdrv_co_readv/writev.  Under
some limitation, if you are calling bdrv_aio_readv/writev you can bypass
coroutines.  (In fact drive mirroring uses those APIs too so it doesn't
need coroutines and can benefit from the speedup too. :)  But
drive-backup does need them).

The limitations are essentially "would bdrv_co_do_preadv or
bdrv_co_do_pwritev do anything special?"  This means for example for
bdrv_co_do_pwritev:

- bs->drv must be non-NULL

- bs->read_only must be false

- bdrv_check_byte_request(bs, offset, bytes) must return false

- bs->io_limits_enabled must be false

- the request must be aligned

- (in bdrv_aligned_pwritev) the before_write_notifiers must be empty

- (in bdrv_aligned_pwritev) bs->detect_zeroes must be off

- (in bdrv_aligned_pwritev) the BDRV_REQ_ZERO_WRITE flag must be off

- (in bdrv_aligned_pwritev) bs->enable_write_cache must be false

and the hard part is organizing the code so that the code duplication is
as limited as possible.

Paolo

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH 01/15] qemu coroutine: support bypass mode
  2014-07-31  9:15           ` Paolo Bonzini
  2014-07-31 10:06             ` Ming Lei
@ 2014-07-31 16:13             ` Ming Lei
  2014-07-31 16:30               ` Paolo Bonzini
  1 sibling, 1 reply; 71+ messages in thread
From: Ming Lei @ 2014-07-31 16:13 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Peter Maydell, Fam Zheng, Michael S. Tsirkin,
	qemu-devel, Stefan Hajnoczi

On Thu, Jul 31, 2014 at 5:15 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 31/07/2014 10:59, Ming Lei ha scritto:
>>> > No guesses please.  Actually that's also my guess, but since you are
>>> > submitting the patch you must do better and show profiles where stack
>>> > switching disappears after the patches.
>> Follows the below hardware events reported by 'perf stat' when running
>> fio randread benchmark for 2min in VM(single vq, 2 jobs):
>>
>> sudo ~/bin/perf stat -e
>> L1-dcache-loads,L1-dcache-load-misses,cpu-cycles,instructions,branch-instructions,branch-misses,branch-loads,branch-load-misses,dTLB-loads,dTLB-load-misses
>> ./nqemu-start-mq 4 1
>>
>> 1), without bypassing coroutine via forcing to set 's->raw_format ' as
>> false, see patch 5/15
>>
>> - throughout: 95K
>>    232,564,905,115      instructions
>>      161.991075781 seconds time elapsed
>>
>>
>> 2), with bypassing coroutinue
>> - throughput: 115K
>>    255,526,629,881      instructions
>>      162.333465490 seconds time elapsed
>
> Ok, so you are saving 10% instructions per iop: before 232G / 95K =
> 2.45M instructions/iop, 255G / 115K = 2.22M instructions/iop.
>
> That's not small, and it's a good thing for CPU utilization even if you
> were not increasing iops.  On top of this, can you provide the stack
> traces to see the difference in the profiles?

Follows 'perf report' result on cycles event for with/without bypass
coroutine:

    http://pastebin.com/ae0vnQ6V

>From the profiling result, looks bdrv_co_do_preadv() is a bit slow
without bypass coroutine.


Thanks,

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH 01/15] qemu coroutine: support bypass mode
  2014-07-31 16:13             ` Ming Lei
@ 2014-07-31 16:30               ` Paolo Bonzini
  2014-08-01  2:54                 ` Ming Lei
  0 siblings, 1 reply; 71+ messages in thread
From: Paolo Bonzini @ 2014-07-31 16:30 UTC (permalink / raw)
  To: Ming Lei
  Cc: Kevin Wolf, Peter Maydell, Fam Zheng, Michael S. Tsirkin,
	qemu-devel, Stefan Hajnoczi

Il 31/07/2014 18:13, Ming Lei ha scritto:
> Follows 'perf report' result on cycles event for with/without bypass
> coroutine:
> 
>     http://pastebin.com/ae0vnQ6V
> 
> From the profiling result, looks bdrv_co_do_preadv() is a bit slow
> without bypass coroutine.

Yeah, I can count at least 3.3% time spent here:

0.87%          bdrv_co_do_preadv
0.79%          bdrv_aligned_preadv
0.71%          qemu_coroutine_switch
0.52%          tracked_request_begin
0.45%          coroutine_swap

Another ~3% wasted in malloc, etc.

I suggest that we discuss it on the phone at next Tuesday's KVM call.
I'm not denying that bypassing coroutines is a useful thing to do.  But
*everyone* should be doing that if it is so useful, and it's hard to do
it without causing code duplication.

Paolo

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH 01/15] qemu coroutine: support bypass mode
  2014-07-31 16:30               ` Paolo Bonzini
@ 2014-08-01  2:54                 ` Ming Lei
  2014-08-01 13:13                   ` Stefan Hajnoczi
  0 siblings, 1 reply; 71+ messages in thread
From: Ming Lei @ 2014-08-01  2:54 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Peter Maydell, Fam Zheng, Michael S. Tsirkin,
	qemu-devel, Stefan Hajnoczi

On Fri, Aug 1, 2014 at 12:30 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 31/07/2014 18:13, Ming Lei ha scritto:
>> Follows 'perf report' result on cycles event for with/without bypass
>> coroutine:
>>
>>     http://pastebin.com/ae0vnQ6V
>>
>> From the profiling result, looks bdrv_co_do_preadv() is a bit slow
>> without bypass coroutine.
>
> Yeah, I can count at least 3.3% time spent here:
>
> 0.87%          bdrv_co_do_preadv
> 0.79%          bdrv_aligned_preadv
> 0.71%          qemu_coroutine_switch
> 0.52%          tracked_request_begin
> 0.45%          coroutine_swap
>
> Another ~3% wasted in malloc, etc.

That should be related with coroutine and the BH in bdrv_co_do_rw().
In this post I didn't apply Stephan's coroutine resize patch which might
decrease usage of malloc() for coroutine.

At least, coroutine isn't cheap from the profile result.

>
> I suggest that we discuss it on the phone at next Tuesday's KVM call.

No problem, and we can continue to discuss it on mail list too.  If you
and someone else need more profiling data, please feel free to let me
know, and I am happy to provide that.

> I'm not denying that bypassing coroutines is a useful thing to do.  But
> *everyone* should be doing that if it is so useful, and it's hard to do
> it without causing code duplication.

Yes, I agree, and the patch is designed as so or sort of.

If device can know in advance that corotine isn't required, it just
calls qemu_aio_set_bypass_co() to notify block layer to not use
coroutine, that is the approach designed in this patch.


Thanks,

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH 14/15] hw/block/virtio-blk: create num_queues vqs if dataplane is enabled
  2014-07-31  8:52           ` Paolo Bonzini
@ 2014-08-01  3:09             ` Ming Lei
  2014-08-01  3:24               ` Ming Lei
  2014-08-01  6:10               ` Paolo Bonzini
  0 siblings, 2 replies; 71+ messages in thread
From: Ming Lei @ 2014-08-01  3:09 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Peter Maydell, Fam Zheng, Michael S. Tsirkin,
	qemu-devel, Stefan Hajnoczi

On Thu, Jul 31, 2014 at 4:52 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 31/07/2014 05:47, Ming Lei ha scritto:
>> On Wed, Jul 30, 2014 at 11:25 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>> Il 30/07/2014 17:12, Michael S. Tsirkin ha scritto:
>>>>>>
>>>>>> Dataplane must not be a change to the guest ABI.  If you implement this
>>>>>> feature you have to implement it for both dataplane and non-dataplne.
>>>>>>
>>
>> IMO, no matter if the feature is set or not, both old and new VM
>> can support it well.
>>
>> Per virtio spec, only the feature is set, the VM can be allowed to
>> access the 'num_queues' field, and I didn't see any problem from
>> VM's view point.
>>
>> So could you explain why both dataplane and non-dataplane have
>> to support the feature.
>
> Because otherwise you change the guest ABI.

Sorry, I don't understand why the guest ABI is changed since
VM only parses 'num_queues' iff the feature is set, and the DATAPLANE
macro is determined during compiling.

>
>> I am doing so because there isn't performance advantage to take mq
>> for non-dataplane.
>
> It doesn't matter, changing features between dataplane and non-dataplane
> is not something that you can do.

Actually it is easy to support multi virtqueue for non-dataplane,
and it should be enough to remove two 'ifdef'.

>
>>>> Actually, I think different backends at runtime should be allowed to
>>>> change guest visible ABI.  For example if you give qemu a read only
>>>> backend this is guest visible right?
>>>
>>> Dataplane is not meant to be a different backend, it's just moving stuff
>>> out to a separate thread.  It's only due to QEMU limitation that it
>>> doesn't share the vring code (and these patches include a lot of steps
>>> backwards where dataplane becomes != non-dataplane).
>>
>> There won't lots of backwards steps, just the bypass co patch, which
>> is just bypassing co in case of being unnecessary for raw image case,
>> but it is still one code path.
>
> Using the object pool for dataplane only is a backwards step,
> implementing multique for dataplane only is a backwards step,

This one isn't since no multique has been tried before for virtio-blk, :-)

> bypassing
> coroutines for dataplane only is a backwards step.

OK, then all these backwards are positive steps too since they
do improve performance, :-)

The only side-effect I thought of is that bypassing coroutine may
cause block layer a bit complicated.

Thanks,

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH 14/15] hw/block/virtio-blk: create num_queues vqs if dataplane is enabled
  2014-08-01  3:09             ` Ming Lei
@ 2014-08-01  3:24               ` Ming Lei
  2014-08-01  6:10               ` Paolo Bonzini
  1 sibling, 0 replies; 71+ messages in thread
From: Ming Lei @ 2014-08-01  3:24 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Peter Maydell, Fam Zheng, Michael S. Tsirkin,
	qemu-devel, Stefan Hajnoczi

On Fri, Aug 1, 2014 at 11:09 AM, Ming Lei <ming.lei@canonical.com> wrote:
> On Thu, Jul 31, 2014 at 4:52 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:

>>
>> It doesn't matter, changing features between dataplane and non-dataplane
>> is not something that you can do.
>
> Actually it is easy to support multi virtqueue for non-dataplane,
> and it should be enough to remove two 'ifdef'.

Sorry, virtqueue_pop()/virtqueue_push() need changes too, but it
won't be difficult too since 'qid' has been introduced inside
VirtIOBlockReq already.

Thanks,

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH 08/15] virtio: decrease size of VirtQueueElement
  2014-07-31  9:38       ` Paolo Bonzini
@ 2014-08-01  3:34         ` Ming Lei
  0 siblings, 0 replies; 71+ messages in thread
From: Ming Lei @ 2014-08-01  3:34 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Peter Maydell, Fam Zheng, Michael S. Tsirkin,
	qemu-devel, Stefan Hajnoczi

On Thu, Jul 31, 2014 at 5:38 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 31/07/2014 04:07, Ming Lei ha scritto:
>> On Wed, Jul 30, 2014 at 9:51 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>> Il 30/07/2014 13:39, Ming Lei ha scritto:
>>>> diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
>>>> index a60104c..943e72f 100644
>>>> --- a/include/hw/virtio/virtio.h
>>>> +++ b/include/hw/virtio/virtio.h
>>>> @@ -84,12 +84,17 @@ typedef struct VirtQueue VirtQueue;
>>>>  typedef struct VirtQueueElement
>>>>  {
>>>>      unsigned int index;
>>>> +    unsigned int num;
>>>>      unsigned int out_num;
>>>>      unsigned int in_num;
>>>> -    hwaddr in_addr[VIRTQUEUE_MAX_SIZE];
>>>> -    hwaddr out_addr[VIRTQUEUE_MAX_SIZE];
>>>> -    struct iovec in_sg[VIRTQUEUE_MAX_SIZE];
>>>> -    struct iovec out_sg[VIRTQUEUE_MAX_SIZE];
>>>> +
>>>> +    hwaddr *in_addr;
>>>> +    hwaddr *out_addr;
>>>> +    struct iovec *in_sg;
>>>> +    struct iovec *out_sg;
>>>> +
>>>> +    hwaddr addr[VIRTQUEUE_MAX_SIZE];
>>>> +    struct iovec sg[VIRTQUEUE_MAX_SIZE];
>>>>  } VirtQueueElement;
>>>>
>>>>  #define VIRTIO_PCI_QUEUE_MAX 64
>>>> --
>>>
>>> since addr and sg aren't used directly, allocate them flexibly like
>>>
>>>     char *p;
>>>     VirtQueueElement *elem;
>>>     total_size = ROUND_UP(sizeof(struct VirtQueueElement),
>>>                           __alignof__(elem->addr[0]);
>>>     addr_offset = total_size;
>>>     total_size = ROUND_UP(total_size + num * sizeof(elem->addr[0]),
>>>                           __alignof__(elem->sg[0]));
>>>     sg_offset = total_size;
>>>     total_size += num * sizeof(elem->sg[0]);
>>>
>>>     elem = p = g_slice_alloc(total_size);
>>>     elem->size = total_size;
>>>     elem->in_addr = p + addr_offset;
>>>     elem->out_addr = elem->in_addr + in_num;
>>>     elem->in_sg = p + sg_offset;
>>>     elem->out_sg = elem->in_sg + in_num;
>>>
>>> ...
>>>
>>>     g_slice_free1(elem->size, elem);
>>>
>>> The small size will help glib do slab-style allocation and should remove
>>> the need for an object pool.
>>
>> Yes, that should be correct way to do, but can't avoid big chunk allocation
>> completely because 'num' is a bit big.
>
> For typical iops microbenchmarks, num will be 3 (4K I/O) to 18 (64K).

That is only in benchmark, in reality, wring is always big size from
file system, and read might be big too for readahead or by I/O
merge.

Also you need to walk virt queue first for figure out how many io
vectors in current request first, it is a bit expensive to walk the
list since lots of dcache miss is often generated.

Thanks,

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH 14/15] hw/block/virtio-blk: create num_queues vqs if dataplane is enabled
  2014-08-01  3:09             ` Ming Lei
  2014-08-01  3:24               ` Ming Lei
@ 2014-08-01  6:10               ` Paolo Bonzini
  2014-08-01  7:35                 ` Ming Lei
  1 sibling, 1 reply; 71+ messages in thread
From: Paolo Bonzini @ 2014-08-01  6:10 UTC (permalink / raw)
  To: Ming Lei
  Cc: Kevin Wolf, Peter Maydell, Fam Zheng, Michael S. Tsirkin,
	qemu-devel, Stefan Hajnoczi

Il 01/08/2014 05:09, Ming Lei ha scritto:
>>> >> Per virtio spec, only the feature is set, the VM can be allowed to
>>> >> access the 'num_queues' field, and I didn't see any problem from
>>> >> VM's view point.
>>> >>
>>> >> So could you explain why both dataplane and non-dataplane have
>>> >> to support the feature.
>> >
>> > Because otherwise you change the guest ABI.
> Sorry, I don't understand why the guest ABI is changed since
> VM only parses 'num_queues' iff the feature is set, and the DATAPLANE
> macro is determined during compiling.

Even recompiling the same version of QEMU should not change the guest
ABI.  Recompilation may affect which devices are present, but then the
migration destination will not even start if something is broken.

And as you pointed out, migration from dataplane to non-dataplane will
break because you didn't convert callers of virtqueue_pop/push.

Paolo

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH 14/15] hw/block/virtio-blk: create num_queues vqs if dataplane is enabled
  2014-08-01  6:10               ` Paolo Bonzini
@ 2014-08-01  7:35                 ` Ming Lei
  2014-08-01  7:46                   ` Paolo Bonzini
  0 siblings, 1 reply; 71+ messages in thread
From: Ming Lei @ 2014-08-01  7:35 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Peter Maydell, Fam Zheng, Michael S. Tsirkin,
	qemu-devel, Stefan Hajnoczi

On Fri, Aug 1, 2014 at 2:10 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 01/08/2014 05:09, Ming Lei ha scritto:
>>>> >> Per virtio spec, only the feature is set, the VM can be allowed to
>>>> >> access the 'num_queues' field, and I didn't see any problem from
>>>> >> VM's view point.
>>>> >>
>>>> >> So could you explain why both dataplane and non-dataplane have
>>>> >> to support the feature.
>>> >
>>> > Because otherwise you change the guest ABI.
>> Sorry, I don't understand why the guest ABI is changed since
>> VM only parses 'num_queues' iff the feature is set, and the DATAPLANE
>> macro is determined during compiling.
>
> Even recompiling the same version of QEMU should not change the guest
> ABI.  Recompilation may affect which devices are present, but then the
> migration destination will not even start if something is broken.
>
> And as you pointed out, migration from dataplane to non-dataplane will
> break because you didn't convert callers of virtqueue_pop/push.

OK, I will convert non-dataplane to support multi virtqueues in V1,
and the conversion is not difficult and straightforward.

BTW, docs/migration.txt mentions that "QEMU has to be launched
with the same arguments the two times", so can I understand that
migration from dataplane to non-dataplane shouldn't have been
allowed?

Thanks,

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH 07/15] dataplane: use object pool to speed up allocation for virtio blk request
  2014-07-31  9:18       ` Paolo Bonzini
@ 2014-08-01  7:42         ` Ming Lei
  2014-08-04 10:21           ` Stefan Hajnoczi
  0 siblings, 1 reply; 71+ messages in thread
From: Ming Lei @ 2014-08-01  7:42 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Peter Maydell, Fam Zheng, Michael S. Tsirkin,
	qemu-devel, Stefan Hajnoczi

On Thu, Jul 31, 2014 at 5:18 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 31/07/2014 05:22, Ming Lei ha scritto:
>>> >
>>> > The problem is that g_slice here is not using the slab-style allocator
>>> > because the object is larger than roughly 500 bytes.  One solution would
>>> > be to make virtqueue_pop/vring_pop allocate a VirtQueueElement of the
>>> > right size (and virtqueue_push/vring_push free it), as mentioned in the
>>> > review of patch 8.
>> Unluckily both iovec and addr array can't be fitted into 500 bytes, :-(
>> Not mention all users of VirtQueueElement need to be changed too,
>> I hate to make that work involved in this patchset, :-)
>
> Well, the point of dataplane was not just to get maximum iops.  It was
> also to provide guidance in the work necessary to improve the code and
> get maximum iops without special-casing everything.  This can be a lot
> of work indeed.
>
>>> >
>>> > However, I now remembered that VirtQueueElement is a mess because it's
>>> > serialized directly into the migration state. :(  So you basically
>>> > cannot change it without mucking with migration.  Please leave out patch
>>> > 8 for now.
>> save_device code serializes elem in this way:
>>
>>  qemu_put_buffer(f, (unsigned char *)&req->elem,
>>                         sizeof(VirtQueueElement));
>>
>> so I am wondering why this patch may break migration.
>
> Because you change the on-wire format and break migration from 2.1 to
> 2.2.  Sorry, I wasn't clear enough.

That is really a mess, but in future we still may convert VirtQueueElement
into a smart one, and keep the original structure only for save/load, but
a conversion between the two structures is required in save/load.


Thanks,

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH 14/15] hw/block/virtio-blk: create num_queues vqs if dataplane is enabled
  2014-08-01  7:35                 ` Ming Lei
@ 2014-08-01  7:46                   ` Paolo Bonzini
  0 siblings, 0 replies; 71+ messages in thread
From: Paolo Bonzini @ 2014-08-01  7:46 UTC (permalink / raw)
  To: Ming Lei
  Cc: Kevin Wolf, Peter Maydell, Fam Zheng, Michael S. Tsirkin,
	qemu-devel, Stefan Hajnoczi

Il 01/08/2014 09:35, Ming Lei ha scritto:
> OK, I will convert non-dataplane to support multi virtqueues in V1,
> and the conversion is not difficult and straightforward.
> 
> BTW, docs/migration.txt mentions that "QEMU has to be launched
> with the same arguments the two times", so can I understand that
> migration from dataplane to non-dataplane shouldn't have been
> allowed?

You're right, that text gives a rule of thumb but it is not absolutely true.

It is for example obviously okay to have different paths for drives in
the two invocations.  It is also okay to use different formats (raw vs.
qcow2) if, for example, you are migrating disk contents (either with
"migrate -b" or with the embedded NBD server).

Dataplane is another such case where an option does not affect how the
device behaves.  It is a bit more clearer if you think of it as just
"use an iothread different from the default one", which is what we're
aiming at.  Everything else is working around limitations in QEMU, and
will disappear once these limitations are lifted.

Paolo

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH 01/15] qemu coroutine: support bypass mode
  2014-08-01  2:54                 ` Ming Lei
@ 2014-08-01 13:13                   ` Stefan Hajnoczi
  2014-08-01 13:48                     ` Ming Lei
  0 siblings, 1 reply; 71+ messages in thread
From: Stefan Hajnoczi @ 2014-08-01 13:13 UTC (permalink / raw)
  To: Ming Lei
  Cc: Kevin Wolf, Peter Maydell, Fam Zheng, Michael S. Tsirkin,
	qemu-devel, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 1325 bytes --]

On Fri, Aug 01, 2014 at 10:54:02AM +0800, Ming Lei wrote:
> On Fri, Aug 1, 2014 at 12:30 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> > Il 31/07/2014 18:13, Ming Lei ha scritto:
> >> Follows 'perf report' result on cycles event for with/without bypass
> >> coroutine:
> >>
> >>     http://pastebin.com/ae0vnQ6V
> >>
> >> From the profiling result, looks bdrv_co_do_preadv() is a bit slow
> >> without bypass coroutine.
> >
> > Yeah, I can count at least 3.3% time spent here:
> >
> > 0.87%          bdrv_co_do_preadv
> > 0.79%          bdrv_aligned_preadv
> > 0.71%          qemu_coroutine_switch
> > 0.52%          tracked_request_begin
> > 0.45%          coroutine_swap
> >
> > Another ~3% wasted in malloc, etc.
> 
> That should be related with coroutine and the BH in bdrv_co_do_rw().
> In this post I didn't apply Stephan's coroutine resize patch which might
> decrease usage of malloc() for coroutine.

Please rerun with "[PATCH v3 0/2] coroutine: dynamically scale pool
size".

> At least, coroutine isn't cheap from the profile result.

Instead of bypassing coroutines we should first understand the overhead
that they impose.  Is it due to the coroutine implementation (switching
stacks) or due to the bdrv_co_*() code that happens to use coroutines
but slow for other reasons.

[-- Attachment #2: Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH 01/15] qemu coroutine: support bypass mode
  2014-07-31 10:45               ` Paolo Bonzini
@ 2014-08-01 13:38                 ` Ming Lei
  0 siblings, 0 replies; 71+ messages in thread
From: Ming Lei @ 2014-08-01 13:38 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Benoît Canet, Kevin Wolf, Fam Zheng, Michael S. Tsirkin,
	Peter Maydell, qemu-devel, Stefan Hajnoczi

On Thu, Jul 31, 2014 at 6:45 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 31/07/2014 11:47, Ming Lei ha scritto:
>>> Block mirroring of a device for example is done using coroutines.
>>> As block mirroring can be done on a raw device you need coroutines.
>>
>> If block layer knows the mirroring is in-progress, it still can enable
>> coroutine by ignoring bypass coroutine, or just let device disable
>> bypass coroutine in case of mirroring, and the current APIs are very
>> flexible.
>
> What matters is not whether you're mirroring.  What matters is whether
> you're calling bdrv_aio_readv/writev or bdrv_co_readv/writev.  Under
> some limitation, if you are calling bdrv_aio_readv/writev you can bypass
> coroutines.  (In fact drive mirroring uses those APIs too so it doesn't
> need coroutines and can benefit from the speedup too. :)  But
> drive-backup does need them).
>
> The limitations are essentially "would bdrv_co_do_preadv or
> bdrv_co_do_pwritev do anything special?"  This means for example for
> bdrv_co_do_pwritev:
>
> - bs->drv must be non-NULL
>
> - bs->read_only must be false
>
> - bdrv_check_byte_request(bs, offset, bytes) must return false
>
> - bs->io_limits_enabled must be false
>
> - the request must be aligned
>
> - (in bdrv_aligned_pwritev) the before_write_notifiers must be empty
>
> - (in bdrv_aligned_pwritev) bs->detect_zeroes must be off
>
> - (in bdrv_aligned_pwritev) the BDRV_REQ_ZERO_WRITE flag must be off
>
> - (in bdrv_aligned_pwritev) bs->enable_write_cache must be false
>
> and the hard part is organizing the code so that the code duplication is
> as limited as possible.

In this patchset, only implementation of bypassing coroutine is
introduced, and simply apply it on raw image for dataplane, and it
won't break drive mirror & backup, if I understand correctly. So I
suggest we focus on the implementation of bypassing coroutine.

All the above is mostly about if the bypass can be applied in
other cases, so this patchset shouldn't have caused code duplication.

I appreciate very much the implementation of bypassing coroutine
can be reviewed in the discussion.

Thanks,

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH 01/15] qemu coroutine: support bypass mode
  2014-08-01 13:13                   ` Stefan Hajnoczi
@ 2014-08-01 13:48                     ` Ming Lei
  2014-08-01 14:17                       ` Paolo Bonzini
  2014-08-01 14:52                       ` Ming Lei
  0 siblings, 2 replies; 71+ messages in thread
From: Ming Lei @ 2014-08-01 13:48 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Peter Maydell, Fam Zheng, Michael S. Tsirkin,
	qemu-devel, Paolo Bonzini

On Fri, Aug 1, 2014 at 9:13 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> On Fri, Aug 01, 2014 at 10:54:02AM +0800, Ming Lei wrote:
>> On Fri, Aug 1, 2014 at 12:30 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> > Il 31/07/2014 18:13, Ming Lei ha scritto:
>> >> Follows 'perf report' result on cycles event for with/without bypass
>> >> coroutine:
>> >>
>> >>     http://pastebin.com/ae0vnQ6V
>> >>
>> >> From the profiling result, looks bdrv_co_do_preadv() is a bit slow
>> >> without bypass coroutine.
>> >
>> > Yeah, I can count at least 3.3% time spent here:
>> >
>> > 0.87%          bdrv_co_do_preadv
>> > 0.79%          bdrv_aligned_preadv
>> > 0.71%          qemu_coroutine_switch
>> > 0.52%          tracked_request_begin
>> > 0.45%          coroutine_swap
>> >
>> > Another ~3% wasted in malloc, etc.
>>
>> That should be related with coroutine and the BH in bdrv_co_do_rw().
>> In this post I didn't apply Stephan's coroutine resize patch which might
>> decrease usage of malloc() for coroutine.
>
> Please rerun with "[PATCH v3 0/2] coroutine: dynamically scale pool
> size".

No problem, will do that. Actually in my last post with rfc, this patchset
was against your coroutine resize patches.

I will provide the profile data tomorrow.

>
>> At least, coroutine isn't cheap from the profile result.
>
> Instead of bypassing coroutines we should first understand the overhead
> that they impose.  Is it due to the coroutine implementation (switching
> stacks) or due to the bdrv_co_*() code that happens to use coroutines
> but slow for other reasons.

>From the 3th patch(block: support to bypass qemu coroutinue)
and the 5th patch(dataplane: enable selective bypassing coroutine),
the change is to bypass coroutine and BH, and the other bdrv code
path is same, so it is due to the coroutine implementation, IMO.

Thanks,

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH 01/15] qemu coroutine: support bypass mode
  2014-08-01 13:48                     ` Ming Lei
@ 2014-08-01 14:17                       ` Paolo Bonzini
  2014-08-01 15:21                         ` Ming Lei
  2014-08-01 14:52                       ` Ming Lei
  1 sibling, 1 reply; 71+ messages in thread
From: Paolo Bonzini @ 2014-08-01 14:17 UTC (permalink / raw)
  To: Ming Lei, Stefan Hajnoczi
  Cc: Kevin Wolf, Peter Maydell, Fam Zheng, qemu-devel, Michael S. Tsirkin

Il 01/08/2014 15:48, Ming Lei ha scritto:
> On Fri, Aug 1, 2014 at 9:13 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
>> On Fri, Aug 01, 2014 at 10:54:02AM +0800, Ming Lei wrote:
>>> On Fri, Aug 1, 2014 at 12:30 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>>> Il 31/07/2014 18:13, Ming Lei ha scritto:
>>>>> Follows 'perf report' result on cycles event for with/without bypass
>>>>> coroutine:
>>>>>
>>>>>     http://pastebin.com/ae0vnQ6V
>>>>>
>>>>> From the profiling result, looks bdrv_co_do_preadv() is a bit slow
>>>>> without bypass coroutine.
>>>>
>>>> Yeah, I can count at least 3.3% time spent here:
>>>>
>>>> 0.87%          bdrv_co_do_preadv
>>>> 0.79%          bdrv_aligned_preadv
>>>> 0.71%          qemu_coroutine_switch
>>>> 0.52%          tracked_request_begin
>>>> 0.45%          coroutine_swap
>>>>
>>>> Another ~3% wasted in malloc, etc.
>>>
>>> That should be related with coroutine and the BH in bdrv_co_do_rw().
>>> In this post I didn't apply Stephan's coroutine resize patch which might
>>> decrease usage of malloc() for coroutine.
>>
>> Please rerun with "[PATCH v3 0/2] coroutine: dynamically scale pool
>> size".
> 
> No problem, will do that. Actually in my last post with rfc, this patchset
> was against your coroutine resize patches.
> 
> I will provide the profile data tomorrow.
> 
>>
>>> At least, coroutine isn't cheap from the profile result.
>>
>> Instead of bypassing coroutines we should first understand the overhead
>> that they impose.  Is it due to the coroutine implementation (switching
>> stacks) or due to the bdrv_co_*() code that happens to use coroutines
>> but slow for other reasons.
> 
> From the 3th patch(block: support to bypass qemu coroutinue)
> and the 5th patch(dataplane: enable selective bypassing coroutine),
> the change is to bypass coroutine and BH, and the other bdrv code
> path is same, so it is due to the coroutine implementation, IMO.

But your code breaks all sort of invariants.  For example, the aiocb
must be valid when bdrv_aio_readv/writev return.  virtio-blk does not
use it, but virtio-scsi does.  If we apply your patches now, we will
have to redo it soon.

Basically we should be rewriting parts of block.c so that
bdrv_co_readv/writev calls bdrv_aio_readv/writev instead of vice versa.
 Coroutine creation should be pushed down to the
bdrv_aligned_preadv/bdrv_aligned_pwritev and, in the fast path, you can
simply call the driver's bdrv_aio_readv/writev.

Paolo

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH 01/15] qemu coroutine: support bypass mode
  2014-08-01 13:48                     ` Ming Lei
  2014-08-01 14:17                       ` Paolo Bonzini
@ 2014-08-01 14:52                       ` Ming Lei
  2014-08-01 16:03                         ` Stefan Hajnoczi
  1 sibling, 1 reply; 71+ messages in thread
From: Ming Lei @ 2014-08-01 14:52 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Peter Maydell, Fam Zheng, Michael S. Tsirkin,
	qemu-devel, Paolo Bonzini

On Fri, Aug 1, 2014 at 9:48 PM, Ming Lei <ming.lei@canonical.com> wrote:
> On Fri, Aug 1, 2014 at 9:13 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
>> On Fri, Aug 01, 2014 at 10:54:02AM +0800, Ming Lei wrote:
>>> On Fri, Aug 1, 2014 at 12:30 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>> > Il 31/07/2014 18:13, Ming Lei ha scritto:
>>> >> Follows 'perf report' result on cycles event for with/without bypass
>>> >> coroutine:
>>> >>
>>> >>     http://pastebin.com/ae0vnQ6V
>>> >>
>>> >> From the profiling result, looks bdrv_co_do_preadv() is a bit slow
>>> >> without bypass coroutine.
>>> >
>>> > Yeah, I can count at least 3.3% time spent here:
>>> >
>>> > 0.87%          bdrv_co_do_preadv
>>> > 0.79%          bdrv_aligned_preadv
>>> > 0.71%          qemu_coroutine_switch
>>> > 0.52%          tracked_request_begin
>>> > 0.45%          coroutine_swap
>>> >
>>> > Another ~3% wasted in malloc, etc.
>>>
>>> That should be related with coroutine and the BH in bdrv_co_do_rw().
>>> In this post I didn't apply Stephan's coroutine resize patch which might
>>> decrease usage of malloc() for coroutine.
>>
>> Please rerun with "[PATCH v3 0/2] coroutine: dynamically scale pool
>> size".
>
> No problem, will do that. Actually in my last post with rfc, this patchset
> was against your coroutine resize patches.
>
> I will provide the profile data tomorrow.

Please see below link for without bypass coroutine, and with
your coroutine resize patches(V3):

   http://pastebin.com/10y00sir

BTW, if anyone likes to play this stuff, these patches can be
pulled from below tree easily:

git://kernel.ubuntu.com/ming/qemu.git

branch:
         - v2.1.0-mq-co-resize:
              these 15 patches against 2.1.0-rc5 plus Stefan's coroutine
              resize patches(V3)
         - v2.1.0-mq:
              only these 15 patches against 2.1.0-rc5

Thanks,

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH 01/15] qemu coroutine: support bypass mode
  2014-08-01 14:17                       ` Paolo Bonzini
@ 2014-08-01 15:21                         ` Ming Lei
  0 siblings, 0 replies; 71+ messages in thread
From: Ming Lei @ 2014-08-01 15:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Peter Maydell, Fam Zheng, Michael S. Tsirkin,
	qemu-devel, Stefan Hajnoczi

On Fri, Aug 1, 2014 at 10:17 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 01/08/2014 15:48, Ming Lei ha scritto:
>> On Fri, Aug 1, 2014 at 9:13 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>> On Fri, Aug 01, 2014 at 10:54:02AM +0800, Ming Lei wrote:
>>>> On Fri, Aug 1, 2014 at 12:30 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>>>> Il 31/07/2014 18:13, Ming Lei ha scritto:
>>>>>> Follows 'perf report' result on cycles event for with/without bypass
>>>>>> coroutine:
>>>>>>
>>>>>>     http://pastebin.com/ae0vnQ6V
>>>>>>
>>>>>> From the profiling result, looks bdrv_co_do_preadv() is a bit slow
>>>>>> without bypass coroutine.
>>>>>
>>>>> Yeah, I can count at least 3.3% time spent here:
>>>>>
>>>>> 0.87%          bdrv_co_do_preadv
>>>>> 0.79%          bdrv_aligned_preadv
>>>>> 0.71%          qemu_coroutine_switch
>>>>> 0.52%          tracked_request_begin
>>>>> 0.45%          coroutine_swap
>>>>>
>>>>> Another ~3% wasted in malloc, etc.
>>>>
>>>> That should be related with coroutine and the BH in bdrv_co_do_rw().
>>>> In this post I didn't apply Stephan's coroutine resize patch which might
>>>> decrease usage of malloc() for coroutine.
>>>
>>> Please rerun with "[PATCH v3 0/2] coroutine: dynamically scale pool
>>> size".
>>
>> No problem, will do that. Actually in my last post with rfc, this patchset
>> was against your coroutine resize patches.
>>
>> I will provide the profile data tomorrow.
>>
>>>
>>>> At least, coroutine isn't cheap from the profile result.
>>>
>>> Instead of bypassing coroutines we should first understand the overhead
>>> that they impose.  Is it due to the coroutine implementation (switching
>>> stacks) or due to the bdrv_co_*() code that happens to use coroutines
>>> but slow for other reasons.
>>
>> From the 3th patch(block: support to bypass qemu coroutinue)
>> and the 5th patch(dataplane: enable selective bypassing coroutine),
>> the change is to bypass coroutine and BH, and the other bdrv code
>> path is same, so it is due to the coroutine implementation, IMO.
>
> But your code breaks all sort of invariants.  For example, the aiocb
> must be valid when bdrv_aio_readv/writev return.  virtio-blk does not
> use it, but virtio-scsi does.  If we apply your patches now, we will
> have to redo it soon.

It can be supported by scheduling BH in bdrv_co_io_em_complete() too.

>
> Basically we should be rewriting parts of block.c so that
> bdrv_co_readv/writev calls bdrv_aio_readv/writev instead of vice versa.
>  Coroutine creation should be pushed down to the
> bdrv_aligned_preadv/bdrv_aligned_pwritev and, in the fast path, you can
> simply call the driver's bdrv_aio_readv/writev.

This idea is very constructive, and I will investigate further to see
if it is easy to start.


Thanks,

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH 01/15] qemu coroutine: support bypass mode
  2014-08-01 14:52                       ` Ming Lei
@ 2014-08-01 16:03                         ` Stefan Hajnoczi
  2014-08-02  2:42                           ` Ming Lei
  0 siblings, 1 reply; 71+ messages in thread
From: Stefan Hajnoczi @ 2014-08-01 16:03 UTC (permalink / raw)
  To: Ming Lei
  Cc: Kevin Wolf, Peter Maydell, Fam Zheng, Michael S. Tsirkin,
	qemu-devel, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 1876 bytes --]

On Fri, Aug 01, 2014 at 10:52:55PM +0800, Ming Lei wrote:
> On Fri, Aug 1, 2014 at 9:48 PM, Ming Lei <ming.lei@canonical.com> wrote:
> > On Fri, Aug 1, 2014 at 9:13 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >> On Fri, Aug 01, 2014 at 10:54:02AM +0800, Ming Lei wrote:
> >>> On Fri, Aug 1, 2014 at 12:30 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> >>> > Il 31/07/2014 18:13, Ming Lei ha scritto:
> >>> >> Follows 'perf report' result on cycles event for with/without bypass
> >>> >> coroutine:
> >>> >>
> >>> >>     http://pastebin.com/ae0vnQ6V
> >>> >>
> >>> >> From the profiling result, looks bdrv_co_do_preadv() is a bit slow
> >>> >> without bypass coroutine.
> >>> >
> >>> > Yeah, I can count at least 3.3% time spent here:
> >>> >
> >>> > 0.87%          bdrv_co_do_preadv
> >>> > 0.79%          bdrv_aligned_preadv
> >>> > 0.71%          qemu_coroutine_switch
> >>> > 0.52%          tracked_request_begin
> >>> > 0.45%          coroutine_swap
> >>> >
> >>> > Another ~3% wasted in malloc, etc.
> >>>
> >>> That should be related with coroutine and the BH in bdrv_co_do_rw().
> >>> In this post I didn't apply Stephan's coroutine resize patch which might
> >>> decrease usage of malloc() for coroutine.
> >>
> >> Please rerun with "[PATCH v3 0/2] coroutine: dynamically scale pool
> >> size".
> >
> > No problem, will do that. Actually in my last post with rfc, this patchset
> > was against your coroutine resize patches.
> >
> > I will provide the profile data tomorrow.
> 
> Please see below link for without bypass coroutine, and with
> your coroutine resize patches(V3):
> 
>    http://pastebin.com/10y00sir

Thanks for sharing!

Do you have the results (IOPS and perf report) for just the coroutine
bypass (but not the other changes in this patch series)?

Coroutine: 101k IOPS
Bypass: ? IOPS

Stefan

[-- Attachment #2: Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH 01/15] qemu coroutine: support bypass mode
  2014-08-01 16:03                         ` Stefan Hajnoczi
@ 2014-08-02  2:42                           ` Ming Lei
  0 siblings, 0 replies; 71+ messages in thread
From: Ming Lei @ 2014-08-02  2:42 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Peter Maydell, Fam Zheng, Michael S. Tsirkin,
	qemu-devel, Paolo Bonzini

On Sat, Aug 2, 2014 at 12:03 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> On Fri, Aug 01, 2014 at 10:52:55PM +0800, Ming Lei wrote:
>> On Fri, Aug 1, 2014 at 9:48 PM, Ming Lei <ming.lei@canonical.com> wrote:
>> > On Fri, Aug 1, 2014 at 9:13 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
>> >> On Fri, Aug 01, 2014 at 10:54:02AM +0800, Ming Lei wrote:
>> >>> On Fri, Aug 1, 2014 at 12:30 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> >>> > Il 31/07/2014 18:13, Ming Lei ha scritto:
>> >>> >> Follows 'perf report' result on cycles event for with/without bypass
>> >>> >> coroutine:
>> >>> >>
>> >>> >>     http://pastebin.com/ae0vnQ6V
>> >>> >>
>> >>> >> From the profiling result, looks bdrv_co_do_preadv() is a bit slow
>> >>> >> without bypass coroutine.
>> >>> >
>> >>> > Yeah, I can count at least 3.3% time spent here:
>> >>> >
>> >>> > 0.87%          bdrv_co_do_preadv
>> >>> > 0.79%          bdrv_aligned_preadv
>> >>> > 0.71%          qemu_coroutine_switch
>> >>> > 0.52%          tracked_request_begin
>> >>> > 0.45%          coroutine_swap
>> >>> >
>> >>> > Another ~3% wasted in malloc, etc.
>> >>>
>> >>> That should be related with coroutine and the BH in bdrv_co_do_rw().
>> >>> In this post I didn't apply Stephan's coroutine resize patch which might
>> >>> decrease usage of malloc() for coroutine.
>> >>
>> >> Please rerun with "[PATCH v3 0/2] coroutine: dynamically scale pool
>> >> size".
>> >
>> > No problem, will do that. Actually in my last post with rfc, this patchset
>> > was against your coroutine resize patches.
>> >
>> > I will provide the profile data tomorrow.
>>
>> Please see below link for without bypass coroutine, and with
>> your coroutine resize patches(V3):
>>
>>    http://pastebin.com/10y00sir
>
> Thanks for sharing!
>
> Do you have the results (IOPS and perf report) for just the coroutine
> bypass (but not the other changes in this patch series)?
>
> Coroutine: 101k IOPS
> Bypass: ? IOPS

Please see blow link, sorry for missing the data for bypass co which
was collected at the same condition:

http://pastebin.com/JqrpF87G

Thanks,

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH 00/14] dataplane: optimization and multi virtqueue support
  2014-07-30 11:39 [Qemu-devel] [PATCH 00/14] dataplane: optimization and multi virtqueue support Ming Lei
                   ` (15 preceding siblings ...)
  2014-07-30 12:42 ` [Qemu-devel] [PATCH 00/14] dataplane: optimization and multi virtqueue support Christian Borntraeger
@ 2014-08-04 10:16 ` Stefan Hajnoczi
  2014-08-04 10:45   ` Ming Lei
  16 siblings, 1 reply; 71+ messages in thread
From: Stefan Hajnoczi @ 2014-08-04 10:16 UTC (permalink / raw)
  To: Ming Lei
  Cc: Kevin Wolf, Peter Maydell, Fam Zheng, Michael S. Tsirkin,
	qemu-devel, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 1143 bytes --]

On Wed, Jul 30, 2014 at 07:39:33PM +0800, Ming Lei wrote:
> These patches bring up below 4 changes:
> 
>         - introduce selective coroutine bypass mechanism
>         for improving performance of virtio-blk dataplane with
>         raw format image
> 
>         - introduce object allocation pool and apply it to
>         virtio-blk dataplane for improving its performance
> 
>         - linux-aio changes: fixing for cases of -EAGAIN and partial
>         completion, increase max events to 256, and remove one unuseful
>         fields in 'struct qemu_laiocb'
> 
>         - support multi virtqueue for virtio-blk dataplane

Please split this series into separate series.

Patch 1-5 - block layer coroutine bypass

  These patches are hacky and not correct yet (e.g. you cannot enable
  bypass on a per-AioContext basis since multiple devices can share an
  IOThread's AioContext and some of them might not be raw, you are
  invoking acb's cb() directly instead of via a BH so there may be
  reentrancy).

  It would be great if cleaning up the block.c aio->co emulation layers
  could improve performance.

[-- Attachment #2: Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH 07/15] dataplane: use object pool to speed up allocation for virtio blk request
  2014-08-01  7:42         ` Ming Lei
@ 2014-08-04 10:21           ` Stefan Hajnoczi
  2014-08-04 11:42             ` Ming Lei
  0 siblings, 1 reply; 71+ messages in thread
From: Stefan Hajnoczi @ 2014-08-04 10:21 UTC (permalink / raw)
  To: Ming Lei
  Cc: Kevin Wolf, Peter Maydell, Fam Zheng, Michael S. Tsirkin,
	qemu-devel, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 2026 bytes --]

On Fri, Aug 01, 2014 at 03:42:05PM +0800, Ming Lei wrote:
> On Thu, Jul 31, 2014 at 5:18 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> > Il 31/07/2014 05:22, Ming Lei ha scritto:
> >>> >
> >>> > The problem is that g_slice here is not using the slab-style allocator
> >>> > because the object is larger than roughly 500 bytes.  One solution would
> >>> > be to make virtqueue_pop/vring_pop allocate a VirtQueueElement of the
> >>> > right size (and virtqueue_push/vring_push free it), as mentioned in the
> >>> > review of patch 8.
> >> Unluckily both iovec and addr array can't be fitted into 500 bytes, :-(
> >> Not mention all users of VirtQueueElement need to be changed too,
> >> I hate to make that work involved in this patchset, :-)
> >
> > Well, the point of dataplane was not just to get maximum iops.  It was
> > also to provide guidance in the work necessary to improve the code and
> > get maximum iops without special-casing everything.  This can be a lot
> > of work indeed.
> >
> >>> >
> >>> > However, I now remembered that VirtQueueElement is a mess because it's
> >>> > serialized directly into the migration state. :(  So you basically
> >>> > cannot change it without mucking with migration.  Please leave out patch
> >>> > 8 for now.
> >> save_device code serializes elem in this way:
> >>
> >>  qemu_put_buffer(f, (unsigned char *)&req->elem,
> >>                         sizeof(VirtQueueElement));
> >>
> >> so I am wondering why this patch may break migration.
> >
> > Because you change the on-wire format and break migration from 2.1 to
> > 2.2.  Sorry, I wasn't clear enough.
> 
> That is really a mess, but in future we still may convert VirtQueueElement
> into a smart one, and keep the original structure only for save/load, but
> a conversion between the two structures is required in save/load.

This is a good idea.  The VirtQueueElement structure isn't complicated,
it's just big.  Just use the current layout as the serialization format.

Stefan

[-- Attachment #2: Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH 10/15] linux-aio: increase max event to 256
  2014-07-30 17:20     ` Ming Lei
@ 2014-08-04 10:26       ` Stefan Hajnoczi
  0 siblings, 0 replies; 71+ messages in thread
From: Stefan Hajnoczi @ 2014-08-04 10:26 UTC (permalink / raw)
  To: Ming Lei
  Cc: Kevin Wolf, Peter Maydell, Fam Zheng, Michael S. Tsirkin,
	qemu-devel, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 949 bytes --]

On Thu, Jul 31, 2014 at 01:20:22AM +0800, Ming Lei wrote:
> On Wed, Jul 30, 2014 at 10:00 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> > Il 30/07/2014 13:39, Ming Lei ha scritto:
> >> This patch increases max event to 256 for the comming
> >> virtio-blk multi virtqueue support.
> >>
> >> Signed-off-by: Ming Lei <ming.lei@canonical.com>
> >> ---
> >>  block/linux-aio.c |    2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > What makes the new magic number less magic than the old one?
> 
> Just for supporting the coming multi virtqueue, otherwise it is
> easy to trigger EAGAIN.
> 
> Or do you have better idea to figure out a non-magic number?

The virtio-blk device knows how many requests can be in-flight at a
time.  laio_init() could take a size parameter.  The messy thing is that
the block layer is between the virtio-blk device and Linux AIO, so we'd
need to pass that information down.

Stefan

[-- Attachment #2: Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH 00/14] dataplane: optimization and multi virtqueue support
  2014-08-04 10:16 ` Stefan Hajnoczi
@ 2014-08-04 10:45   ` Ming Lei
  0 siblings, 0 replies; 71+ messages in thread
From: Ming Lei @ 2014-08-04 10:45 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Peter Maydell, Fam Zheng, Michael S. Tsirkin,
	qemu-devel, Paolo Bonzini

On Mon, Aug 4, 2014 at 6:16 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> On Wed, Jul 30, 2014 at 07:39:33PM +0800, Ming Lei wrote:
>> These patches bring up below 4 changes:
>>
>>         - introduce selective coroutine bypass mechanism
>>         for improving performance of virtio-blk dataplane with
>>         raw format image
>>
>>         - introduce object allocation pool and apply it to
>>         virtio-blk dataplane for improving its performance
>>
>>         - linux-aio changes: fixing for cases of -EAGAIN and partial
>>         completion, increase max events to 256, and remove one unuseful
>>         fields in 'struct qemu_laiocb'
>>
>>         - support multi virtqueue for virtio-blk dataplane
>
> Please split this series into separate series.
>
> Patch 1-5 - block layer coroutine bypass
>
>   These patches are hacky and not correct yet (e.g. you cannot enable
>   bypass on a per-AioContext basis since multiple devices can share an
>   IOThread's AioContext and some of them might not be raw, you are

Good point, so the 'bypass' info should be moved to BlockDriverState.

>   invoking acb's cb() directly instead of via a BH so there may be
>   reentrancy).

Yes, as Paolo pointed too, I will change to run acb's cb() via a BH
in v1 patchset.

>
>   It would be great if cleaning up the block.c aio->co emulation layers
>   could improve performance.

That is a good idea too.

But from current profiling, coroutine is surely one of main causes for
the performance regression.

Thanks,

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH 07/15] dataplane: use object pool to speed up allocation for virtio blk request
  2014-08-04 10:21           ` Stefan Hajnoczi
@ 2014-08-04 11:42             ` Ming Lei
  0 siblings, 0 replies; 71+ messages in thread
From: Ming Lei @ 2014-08-04 11:42 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Peter Maydell, Fam Zheng, Michael S. Tsirkin,
	qemu-devel, Paolo Bonzini

On Mon, Aug 4, 2014 at 6:21 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> On Fri, Aug 01, 2014 at 03:42:05PM +0800, Ming Lei wrote:
>> On Thu, Jul 31, 2014 at 5:18 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> > Il 31/07/2014 05:22, Ming Lei ha scritto:
>> >>> >
>> >>> > The problem is that g_slice here is not using the slab-style allocator
>> >>> > because the object is larger than roughly 500 bytes.  One solution would
>> >>> > be to make virtqueue_pop/vring_pop allocate a VirtQueueElement of the
>> >>> > right size (and virtqueue_push/vring_push free it), as mentioned in the
>> >>> > review of patch 8.
>> >> Unluckily both iovec and addr array can't be fitted into 500 bytes, :-(
>> >> Not mention all users of VirtQueueElement need to be changed too,
>> >> I hate to make that work involved in this patchset, :-)
>> >
>> > Well, the point of dataplane was not just to get maximum iops.  It was
>> > also to provide guidance in the work necessary to improve the code and
>> > get maximum iops without special-casing everything.  This can be a lot
>> > of work indeed.
>> >
>> >>> >
>> >>> > However, I now remembered that VirtQueueElement is a mess because it's
>> >>> > serialized directly into the migration state. :(  So you basically
>> >>> > cannot change it without mucking with migration.  Please leave out patch
>> >>> > 8 for now.
>> >> save_device code serializes elem in this way:
>> >>
>> >>  qemu_put_buffer(f, (unsigned char *)&req->elem,
>> >>                         sizeof(VirtQueueElement));
>> >>
>> >> so I am wondering why this patch may break migration.
>> >
>> > Because you change the on-wire format and break migration from 2.1 to
>> > 2.2.  Sorry, I wasn't clear enough.
>>
>> That is really a mess, but in future we still may convert VirtQueueElement
>> into a smart one, and keep the original structure only for save/load, but
>> a conversion between the two structures is required in save/load.
>
> This is a good idea.  The VirtQueueElement structure isn't complicated,
> it's just big.  Just use the current layout as the serialization format.

The patch shouldn't be complicated too, and the only annoying part is to
find each virtio device's load/save , and add the conversion in  the two
functions if VirtQueueElement is involved.

I suggest to do the conversion in another standalone patchset.

Thanks,

^ permalink raw reply	[flat|nested] 71+ messages in thread

end of thread, other threads:[~2014-08-04 11:42 UTC | newest]

Thread overview: 71+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-30 11:39 [Qemu-devel] [PATCH 00/14] dataplane: optimization and multi virtqueue support Ming Lei
2014-07-30 11:39 ` [Qemu-devel] [PATCH 01/15] qemu coroutine: support bypass mode Ming Lei
2014-07-30 13:45   ` Paolo Bonzini
2014-07-30 17:15     ` Ming Lei
2014-07-30 23:37       ` Paolo Bonzini
2014-07-31  3:55         ` Ming Lei
2014-07-31  7:37           ` Benoît Canet
2014-07-31  9:47             ` Ming Lei
2014-07-31 10:45               ` Paolo Bonzini
2014-08-01 13:38                 ` Ming Lei
2014-07-31  8:59         ` Ming Lei
2014-07-31  9:15           ` Paolo Bonzini
2014-07-31 10:06             ` Ming Lei
2014-07-31 16:13             ` Ming Lei
2014-07-31 16:30               ` Paolo Bonzini
2014-08-01  2:54                 ` Ming Lei
2014-08-01 13:13                   ` Stefan Hajnoczi
2014-08-01 13:48                     ` Ming Lei
2014-08-01 14:17                       ` Paolo Bonzini
2014-08-01 15:21                         ` Ming Lei
2014-08-01 14:52                       ` Ming Lei
2014-08-01 16:03                         ` Stefan Hajnoczi
2014-08-02  2:42                           ` Ming Lei
2014-07-30 11:39 ` [Qemu-devel] [PATCH 02/15] qemu aio: prepare for supporting selective bypass coroutine Ming Lei
2014-07-30 11:39 ` [Qemu-devel] [PATCH 03/15] block: support to bypass qemu coroutinue Ming Lei
2014-07-30 11:39 ` [Qemu-devel] [PATCH 04/15] Revert "raw-posix: drop raw_get_aio_fd() since it is no longer used" Ming Lei
2014-07-30 11:39 ` [Qemu-devel] [PATCH 05/15] dataplane: enable selective bypassing coroutine Ming Lei
2014-07-30 11:39 ` [Qemu-devel] [PATCH 06/15] qemu/obj_pool.h: introduce object allocation pool Ming Lei
2014-07-30 11:39 ` [Qemu-devel] [PATCH 07/15] dataplane: use object pool to speed up allocation for virtio blk request Ming Lei
2014-07-30 14:14   ` Paolo Bonzini
2014-07-30 15:09     ` Michael S. Tsirkin
2014-07-31  3:22     ` Ming Lei
2014-07-31  9:18       ` Paolo Bonzini
2014-08-01  7:42         ` Ming Lei
2014-08-04 10:21           ` Stefan Hajnoczi
2014-08-04 11:42             ` Ming Lei
2014-07-30 11:39 ` [Qemu-devel] [PATCH 08/15] virtio: decrease size of VirtQueueElement Ming Lei
2014-07-30 13:51   ` Paolo Bonzini
2014-07-30 14:40     ` Michael S. Tsirkin
2014-07-30 14:50       ` Paolo Bonzini
2014-07-31  2:11       ` Ming Lei
2014-07-31  2:07     ` Ming Lei
2014-07-31  9:38       ` Paolo Bonzini
2014-08-01  3:34         ` Ming Lei
2014-07-30 11:39 ` [Qemu-devel] [PATCH 09/15] linux-aio: fix submit aio as a batch Ming Lei
2014-07-30 13:59   ` Paolo Bonzini
2014-07-30 17:32     ` Ming Lei
2014-07-30 23:41       ` Paolo Bonzini
2014-07-30 11:39 ` [Qemu-devel] [PATCH 10/15] linux-aio: increase max event to 256 Ming Lei
2014-07-30 12:15   ` Eric Blake
2014-07-30 14:00   ` Paolo Bonzini
2014-07-30 17:20     ` Ming Lei
2014-08-04 10:26       ` Stefan Hajnoczi
2014-07-30 11:39 ` [Qemu-devel] [PATCH 11/15] linux-aio: remove 'node' from 'struct qemu_laiocb' Ming Lei
2014-07-30 11:39 ` [Qemu-devel] [PATCH 12/15] hw/virtio-pci: introduce num_queues property Ming Lei
2014-07-30 11:39 ` [Qemu-devel] [PATCH 13/15] hw/virtio/virtio-blk.h: introduce VIRTIO_BLK_F_MQ Ming Lei
2014-07-30 11:39 ` [Qemu-devel] [PATCH 14/15] hw/block/virtio-blk: create num_queues vqs if dataplane is enabled Ming Lei
2014-07-30 14:01   ` Paolo Bonzini
2014-07-30 15:12     ` Michael S. Tsirkin
2014-07-30 15:25       ` Paolo Bonzini
2014-07-31  3:47         ` Ming Lei
2014-07-31  8:52           ` Paolo Bonzini
2014-08-01  3:09             ` Ming Lei
2014-08-01  3:24               ` Ming Lei
2014-08-01  6:10               ` Paolo Bonzini
2014-08-01  7:35                 ` Ming Lei
2014-08-01  7:46                   ` Paolo Bonzini
2014-07-30 11:39 ` [Qemu-devel] [PATCH 15/15] dataplane: virtio-blk: support mutlti virtqueue Ming Lei
2014-07-30 12:42 ` [Qemu-devel] [PATCH 00/14] dataplane: optimization and multi virtqueue support Christian Borntraeger
2014-08-04 10:16 ` Stefan Hajnoczi
2014-08-04 10:45   ` Ming Lei

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.