qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH 00/16] AioContext fine-grained locking, part 1 of 3, including bdrv_drain rewrite
@ 2016-02-16 17:56 Paolo Bonzini
  2016-02-16 17:56 ` [Qemu-devel] [PATCH 01/16] block: make bdrv_start_throttled_reqs return void Paolo Bonzini
                   ` (18 more replies)
  0 siblings, 19 replies; 48+ messages in thread
From: Paolo Bonzini @ 2016-02-16 17:56 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-block, stefanha

So the fine-grained locking series has grown from 2 parts to 3...

This first part stops where we remove RFifoLock.

In the previous submission, the bulk of the series took care of
associating an AioContext to a thread, so that aio_poll could run event
handlers only on that thread.  This was done to fix a race in bdrv_drain.
There were two disadvantages:

1) reporting progress from aio_poll to the main thread was Really Hard.
Despite being relatively sure of the correctness---also thanks to formal
models---it's not the kind of code that I'd lightly donate to posterity.

2) the strict dependency between bdrv_drain and a single AioContext
would have been bad for multiqueue.

Instead, this series does the same trick (do not run dataplane event handlers
from the main thread) but reports progress directly at the BlockDriverState
level.

To do so, the mechanism to track in-flight requests is changed to a
simple counter.  This is more flexible than BdrvTrackedRequest, and lets
the block/io.c code track precisely the initiation and completion of the
requests.  In particular, the counter remains nonzero when a bottom half
is required to "really" complete the request.  bdrv_drain doesn't rely
anymore on a non-blocking aio_poll to execute those bottom halves.

It is also more efficient; while BdrvTrackedRequest has to remain
for request serialisation, with this series we can in principle make
BdrvTrackedRequest optional and enable it only when something (automatic
RMW or copy-on-read) requires request serialisation.

In general this flows nicely, the only snag being patch 5.  The problem
here is that mirror is calling bdrv_drain from an AIO callback, which
used to be okay (because bdrv_drain more or less tried to guess when
all AIO callbacks were done) but now causes a deadlock (because bdrv_drain
really checks for AIO callbacks to be complete).  To add to the complication,
the bdrv_drain happens far away from the AIO callback, in the coroutine that
the AIO callback enters.  The situation here is admittedly underdefined,
and Stefan pointed out that the same solution is found in many other places
in the QEMU block layer.  Therefore I think it's acceptable.  I'm pointing
it out explicitly though, because (together with patch 8) it is an example
of the issues caused by the new, stricter definition of bdrv_drain.

Patches 1-7 organize bdrv_drain into separate phases, with a well-defined
order between the BDS and it children.  The phases are: 1) disable
throttling; 2) disable io_plug; 3) call bdrv_drain callbacks; 4) wait
for I/O to finish; 5) re-enable io_plug and throttling.  Previously,
throttling of throttling and io_plug were handled by bdrv_flush_io_queue,
which was repeatedly called as part of the I/O wait loop.  This part of
the series removes bdrv_flush_io_queue.

Patch 8-11 replace aio_poll with bdrv_drain so that the work done
so far applies to all former callers of aio_poll.

Patches 12-13 introduce the logic that lets the main thread wait remotely
for an iothread to drain the BDS.  Patches 14-16 then achieve the purpose
of this series, removing the long-running AioContext critical section
from iothread_run and removing RFifoLock.

The series passes check-block.sh with a few fixes applied for pre-existing
bugs:

    vl: fix migration from prelaunch state
    migration: fix incorrect memory_global_dirty_log_start outside BQL
    qed: fix bdrv_qed_drain

Paolo

Paolo Bonzini (16):
  block: make bdrv_start_throttled_reqs return void
  block: move restarting of throttled reqs to block/throttle-groups.c
  block: introduce bdrv_no_throttling_begin/end
  block: plug whole tree at once, introduce bdrv_io_unplugged_begin/end
  mirror: use bottom half to re-enter coroutine
  block: add BDS field to count in-flight requests
  block: change drain to look only at one child at a time
  blockjob: introduce .drain callback for jobs
  block: wait for all pending I/O when doing synchronous requests
  nfs: replace aio_poll with bdrv_drain
  sheepdog: disable dataplane
  aio: introduce aio_context_in_iothread
  block: only call aio_poll from iothread
  iothread: release AioContext around aio_poll
  qemu-thread: introduce QemuRecMutex
  aio: convert from RFifoLock to QemuRecMutex

 async.c                         |  28 +---
 block.c                         |   4 +-
 block/backup.c                  |   7 +
 block/io.c                      | 281 +++++++++++++++++++++++++---------------
 block/linux-aio.c               |  13 +-
 block/mirror.c                  |  37 +++++-
 block/nfs.c                     |  50 ++++---
 block/qed-table.c               |  16 +--
 block/qed.c                     |   4 +-
 block/raw-aio.h                 |   2 +-
 block/raw-posix.c               |  16 +--
 block/sheepdog.c                |  19 +++
 block/throttle-groups.c         |  19 +++
 blockjob.c                      |  16 ++-
 docs/multiple-iothreads.txt     |  40 +++---
 include/block/aio.h             |  13 +-
 include/block/block.h           |   3 +-
 include/block/block_int.h       |  22 +++-
 include/block/blockjob.h        |   7 +
 include/block/throttle-groups.h |   1 +
 include/qemu/rfifolock.h        |  54 --------
 include/qemu/thread-posix.h     |   6 +
 include/qemu/thread-win32.h     |  10 ++
 include/qemu/thread.h           |   3 +
 iothread.c                      |  20 +--
 stubs/Makefile.objs             |   1 +
 stubs/iothread.c                |   8 ++
 tests/.gitignore                |   1 -
 tests/Makefile                  |   2 -
 tests/qemu-iotests/060          |   8 +-
 tests/qemu-iotests/060.out      |   4 +-
 tests/test-aio.c                |  22 ++--
 tests/test-rfifolock.c          |  91 -------------
 util/Makefile.objs              |   1 -
 util/qemu-thread-posix.c        |  13 ++
 util/qemu-thread-win32.c        |  25 ++++
 util/rfifolock.c                |  78 -----------
 37 files changed, 471 insertions(+), 474 deletions(-)
 delete mode 100644 include/qemu/rfifolock.h
 create mode 100644 stubs/iothread.c
 delete mode 100644 tests/test-rfifolock.c
 delete mode 100644 util/rfifolock.c

-- 
2.5.0

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [Qemu-devel] [PATCH 01/16] block: make bdrv_start_throttled_reqs return void
  2016-02-16 17:56 [Qemu-devel] [PATCH 00/16] AioContext fine-grained locking, part 1 of 3, including bdrv_drain rewrite Paolo Bonzini
@ 2016-02-16 17:56 ` Paolo Bonzini
  2016-02-16 17:56 ` [Qemu-devel] [PATCH 02/16] block: move restarting of throttled reqs to block/throttle-groups.c Paolo Bonzini
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 48+ messages in thread
From: Paolo Bonzini @ 2016-02-16 17:56 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-block, stefanha

The return value is unused and I am not sure why it would be useful.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block/io.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/block/io.c b/block/io.c
index a69bfc4..e58cfe2 100644
--- a/block/io.c
+++ b/block/io.c
@@ -75,10 +75,8 @@ void bdrv_set_io_limits(BlockDriverState *bs,
     }
 }
 
-/* this function drain all the throttled IOs */
-static bool bdrv_start_throttled_reqs(BlockDriverState *bs)
+static void bdrv_start_throttled_reqs(BlockDriverState *bs)
 {
-    bool drained = false;
     bool enabled = bs->io_limits_enabled;
     int i;
 
@@ -86,13 +84,11 @@ static bool bdrv_start_throttled_reqs(BlockDriverState *bs)
 
     for (i = 0; i < 2; i++) {
         while (qemu_co_enter_next(&bs->throttled_reqs[i])) {
-            drained = true;
+            ;
         }
     }
 
     bs->io_limits_enabled = enabled;
-
-    return drained;
 }
 
 void bdrv_io_limits_disable(BlockDriverState *bs)
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [Qemu-devel] [PATCH 02/16] block: move restarting of throttled reqs to block/throttle-groups.c
  2016-02-16 17:56 [Qemu-devel] [PATCH 00/16] AioContext fine-grained locking, part 1 of 3, including bdrv_drain rewrite Paolo Bonzini
  2016-02-16 17:56 ` [Qemu-devel] [PATCH 01/16] block: make bdrv_start_throttled_reqs return void Paolo Bonzini
@ 2016-02-16 17:56 ` Paolo Bonzini
  2016-03-09  1:26   ` Fam Zheng
  2016-02-16 17:56 ` [Qemu-devel] [PATCH 03/16] block: introduce bdrv_no_throttling_begin/end Paolo Bonzini
                   ` (16 subsequent siblings)
  18 siblings, 1 reply; 48+ messages in thread
From: Paolo Bonzini @ 2016-02-16 17:56 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-block, stefanha

We want to remove throttled_reqs from block/io.c.  This is the easy
part---hide the handling of throttled_reqs during disable/enable of
throttling within throttle-groups.c.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block/io.c                      | 15 +--------------
 block/throttle-groups.c         | 15 +++++++++++++++
 include/block/throttle-groups.h |  1 +
 3 files changed, 17 insertions(+), 14 deletions(-)

diff --git a/block/io.c b/block/io.c
index e58cfe2..5ee5032 100644
--- a/block/io.c
+++ b/block/io.c
@@ -66,28 +66,15 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
 void bdrv_set_io_limits(BlockDriverState *bs,
                         ThrottleConfig *cfg)
 {
-    int i;
-
     throttle_group_config(bs, cfg);
-
-    for (i = 0; i < 2; i++) {
-        qemu_co_enter_next(&bs->throttled_reqs[i]);
-    }
 }
 
 static void bdrv_start_throttled_reqs(BlockDriverState *bs)
 {
     bool enabled = bs->io_limits_enabled;
-    int i;
 
     bs->io_limits_enabled = false;
-
-    for (i = 0; i < 2; i++) {
-        while (qemu_co_enter_next(&bs->throttled_reqs[i])) {
-            ;
-        }
-    }
-
+    throttle_group_restart_bs(bs);
     bs->io_limits_enabled = enabled;
 }
 
diff --git a/block/throttle-groups.c b/block/throttle-groups.c
index 4920e09..eccfc0d 100644
--- a/block/throttle-groups.c
+++ b/block/throttle-groups.c
@@ -313,6 +313,17 @@ void coroutine_fn throttle_group_co_io_limits_intercept(BlockDriverState *bs,
     qemu_mutex_unlock(&tg->lock);
 }
 
+void throttle_group_restart_bs(BlockDriverState *bs)
+{
+    int i;
+
+    for (i = 0; i < 2; i++) {
+        while (qemu_co_enter_next(&bs->throttled_reqs[i])) {
+            ;
+        }
+    }
+}
+
 /* Update the throttle configuration for a particular group. Similar
  * to throttle_config(), but guarantees atomicity within the
  * throttling group.
@@ -335,6 +346,10 @@ void throttle_group_config(BlockDriverState *bs, ThrottleConfig *cfg)
     }
     throttle_config(ts, tt, cfg);
     qemu_mutex_unlock(&tg->lock);
+
+    aio_context_acquire(bdrv_get_aio_context(bs));
+    throttle_group_restart_bs(bs);
+    aio_context_release(bdrv_get_aio_context(bs));
 }
 
 /* Get the throttle configuration from a particular group. Similar to
diff --git a/include/block/throttle-groups.h b/include/block/throttle-groups.h
index aba28f3..395f72d 100644
--- a/include/block/throttle-groups.h
+++ b/include/block/throttle-groups.h
@@ -38,6 +38,7 @@ void throttle_group_get_config(BlockDriverState *bs, ThrottleConfig *cfg);
 
 void throttle_group_register_bs(BlockDriverState *bs, const char *groupname);
 void throttle_group_unregister_bs(BlockDriverState *bs);
+void throttle_group_restart_bs(BlockDriverState *bs);
 
 void coroutine_fn throttle_group_co_io_limits_intercept(BlockDriverState *bs,
                                                         unsigned int bytes,
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [Qemu-devel] [PATCH 03/16] block: introduce bdrv_no_throttling_begin/end
  2016-02-16 17:56 [Qemu-devel] [PATCH 00/16] AioContext fine-grained locking, part 1 of 3, including bdrv_drain rewrite Paolo Bonzini
  2016-02-16 17:56 ` [Qemu-devel] [PATCH 01/16] block: make bdrv_start_throttled_reqs return void Paolo Bonzini
  2016-02-16 17:56 ` [Qemu-devel] [PATCH 02/16] block: move restarting of throttled reqs to block/throttle-groups.c Paolo Bonzini
@ 2016-02-16 17:56 ` Paolo Bonzini
  2016-03-09  1:45   ` Fam Zheng
  2016-02-16 17:56 ` [Qemu-devel] [PATCH 04/16] block: plug whole tree at once, introduce bdrv_io_unplugged_begin/end Paolo Bonzini
                   ` (15 subsequent siblings)
  18 siblings, 1 reply; 48+ messages in thread
From: Paolo Bonzini @ 2016-02-16 17:56 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-block, stefanha

Extract the handling of throttling from bdrv_flush_io_queue.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block.c                   |  1 -
 block/io.c                | 56 +++++++++++++++++++++++++++++------------------
 block/throttle-groups.c   |  4 ++++
 include/block/block_int.h |  6 ++---
 4 files changed, 42 insertions(+), 25 deletions(-)

diff --git a/block.c b/block.c
index efc3c43..b4f328f 100644
--- a/block.c
+++ b/block.c
@@ -2314,7 +2314,6 @@ static void swap_feature_fields(BlockDriverState *bs_top,
 
     assert(!bs_new->throttle_state);
     if (bs_top->throttle_state) {
-        assert(bs_top->io_limits_enabled);
         bdrv_io_limits_enable(bs_new, throttle_group_get_name(bs_top));
         bdrv_io_limits_disable(bs_top);
     }
diff --git a/block/io.c b/block/io.c
index 5ee5032..272caac 100644
--- a/block/io.c
+++ b/block/io.c
@@ -69,28 +69,43 @@ void bdrv_set_io_limits(BlockDriverState *bs,
     throttle_group_config(bs, cfg);
 }
 
-static void bdrv_start_throttled_reqs(BlockDriverState *bs)
+static void bdrv_no_throttling_begin(BlockDriverState *bs)
 {
-    bool enabled = bs->io_limits_enabled;
+    BdrvChild *child;
+
+    QLIST_FOREACH(child, &bs->children, next) {
+        bdrv_no_throttling_begin(child->bs);
+    }
 
-    bs->io_limits_enabled = false;
-    throttle_group_restart_bs(bs);
-    bs->io_limits_enabled = enabled;
+    if (bs->io_limits_disabled++ == 0) {
+        throttle_group_restart_bs(bs);
+    }
+}
+
+static void bdrv_no_throttling_end(BlockDriverState *bs)
+{
+    BdrvChild *child;
+
+    --bs->io_limits_disabled;
+
+    QLIST_FOREACH(child, &bs->children, next) {
+        bdrv_no_throttling_end(child->bs);
+    }
 }
 
 void bdrv_io_limits_disable(BlockDriverState *bs)
 {
-    bs->io_limits_enabled = false;
-    bdrv_start_throttled_reqs(bs);
+    assert(bs->throttle_state);
+    bdrv_no_throttling_begin(bs);
     throttle_group_unregister_bs(bs);
+    bdrv_no_throttling_end(bs);
 }
 
 /* should be called before bdrv_set_io_limits if a limit is set */
 void bdrv_io_limits_enable(BlockDriverState *bs, const char *group)
 {
-    assert(!bs->io_limits_enabled);
+    assert(!bs->throttle_state);
     throttle_group_register_bs(bs, group);
-    bs->io_limits_enabled = true;
 }
 
 void bdrv_io_limits_update_group(BlockDriverState *bs, const char *group)
@@ -255,6 +270,7 @@ void bdrv_drain(BlockDriverState *bs)
 {
     bool busy = true;
 
+    bdrv_no_throttling_begin(bs);
     bdrv_drain_recurse(bs);
     while (busy) {
         /* Keep iterating */
@@ -262,6 +278,7 @@ void bdrv_drain(BlockDriverState *bs)
          busy = bdrv_requests_pending(bs);
          busy |= aio_poll(bdrv_get_aio_context(bs), busy);
     }
+    bdrv_no_throttling_end(bs);
 }
 
 /*
@@ -284,6 +301,7 @@ void bdrv_drain_all(void)
         if (bs->job) {
             block_job_pause(bs->job);
         }
+        bdrv_no_throttling_begin(bs);
         bdrv_drain_recurse(bs);
         aio_context_release(aio_context);
 
@@ -325,6 +343,7 @@ void bdrv_drain_all(void)
         AioContext *aio_context = bdrv_get_aio_context(bs);
 
         aio_context_acquire(aio_context);
+        bdrv_no_throttling_end(bs);
         if (bs->job) {
             block_job_resume(bs->job);
         }
@@ -555,11 +574,7 @@ static int bdrv_prwv_co(BlockDriverState *bs, int64_t offset,
      * will not fire; so the I/O throttling function has to be disabled here
      * if it has been enabled.
      */
-    if (bs->io_limits_enabled) {
-        fprintf(stderr, "Disabling I/O throttling on '%s' due "
-                        "to synchronous I/O.\n", bdrv_get_device_name(bs));
-        bdrv_io_limits_disable(bs);
-    }
+    bdrv_no_throttling_begin(bs);
 
     if (qemu_in_coroutine()) {
         /* Fast-path if already in coroutine context */
@@ -573,6 +588,8 @@ static int bdrv_prwv_co(BlockDriverState *bs, int64_t offset,
             aio_poll(aio_context, true);
         }
     }
+
+    bdrv_no_throttling_end(bs);
     return rwco.ret;
 }
 
@@ -608,13 +625,11 @@ int bdrv_read(BlockDriverState *bs, int64_t sector_num,
 int bdrv_read_unthrottled(BlockDriverState *bs, int64_t sector_num,
                           uint8_t *buf, int nb_sectors)
 {
-    bool enabled;
     int ret;
 
-    enabled = bs->io_limits_enabled;
-    bs->io_limits_enabled = false;
+    bdrv_no_throttling_begin(bs);
     ret = bdrv_read(bs, sector_num, buf, nb_sectors);
-    bs->io_limits_enabled = enabled;
+    bdrv_no_throttling_end(bs);
     return ret;
 }
 
@@ -952,7 +967,7 @@ static int coroutine_fn bdrv_co_do_preadv(BlockDriverState *bs,
     }
 
     /* throttling disk I/O */
-    if (bs->io_limits_enabled) {
+    if (bs->throttle_state) {
         throttle_group_co_io_limits_intercept(bs, bytes, false);
     }
 
@@ -1294,7 +1309,7 @@ static int coroutine_fn bdrv_co_do_pwritev(BlockDriverState *bs,
     }
 
     /* throttling disk I/O */
-    if (bs->io_limits_enabled) {
+    if (bs->throttle_state) {
         throttle_group_co_io_limits_intercept(bs, bytes, true);
     }
 
@@ -2749,7 +2764,6 @@ void bdrv_flush_io_queue(BlockDriverState *bs)
     } else if (bs->file) {
         bdrv_flush_io_queue(bs->file->bs);
     }
-    bdrv_start_throttled_reqs(bs);
 }
 
 void bdrv_drained_begin(BlockDriverState *bs)
diff --git a/block/throttle-groups.c b/block/throttle-groups.c
index eccfc0d..8fe0a4f 100644
--- a/block/throttle-groups.c
+++ b/block/throttle-groups.c
@@ -219,6 +219,10 @@ static bool throttle_group_schedule_timer(BlockDriverState *bs,
     ThrottleGroup *tg = container_of(ts, ThrottleGroup, ts);
     bool must_wait;
 
+    if (bs->io_limits_disabled) {
+        return false;
+    }
+
     /* Check if any of the timers in this group is already armed */
     if (tg->any_timer_armed[is_write]) {
         return true;
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 9ef823a..ed2e034 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -412,10 +412,10 @@ struct BlockDriverState {
 
     /* I/O throttling.
      * throttle_state tells us if this BDS has I/O limits configured.
-     * io_limits_enabled tells us if they are currently being
-     * enforced, but it can be temporarily set to false */
+     * io_limits_disabled tells us if they are currently being enforced */
     CoQueue      throttled_reqs[2];
-    bool         io_limits_enabled;
+    unsigned int io_limits_disabled;
+
     /* The following fields are protected by the ThrottleGroup lock.
      * See the ThrottleGroup documentation for details. */
     ThrottleState *throttle_state;
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [Qemu-devel] [PATCH 04/16] block: plug whole tree at once, introduce bdrv_io_unplugged_begin/end
  2016-02-16 17:56 [Qemu-devel] [PATCH 00/16] AioContext fine-grained locking, part 1 of 3, including bdrv_drain rewrite Paolo Bonzini
                   ` (2 preceding siblings ...)
  2016-02-16 17:56 ` [Qemu-devel] [PATCH 03/16] block: introduce bdrv_no_throttling_begin/end Paolo Bonzini
@ 2016-02-16 17:56 ` Paolo Bonzini
  2016-02-16 17:56 ` [Qemu-devel] [PATCH 05/16] mirror: use bottom half to re-enter coroutine Paolo Bonzini
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 48+ messages in thread
From: Paolo Bonzini @ 2016-02-16 17:56 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-block, stefanha

Extract the handling of io_plug "depth" from linux-aio.c and let the
main bdrv_drain loop do nothing but wait on I/O.

Like the two newly introduced functions, bdrv_io_plug and bdrv_io_unplug
now operate on all children.  The visit order is now symmetrical between
plug and unplug, making it possible for formats to implement plug/unplug.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block/io.c                | 72 +++++++++++++++++++++++++++++++++++------------
 block/linux-aio.c         | 13 ++++-----
 block/raw-aio.h           |  2 +-
 block/raw-posix.c         | 16 +----------
 include/block/block.h     |  3 +-
 include/block/block_int.h |  5 +++-
 6 files changed, 67 insertions(+), 44 deletions(-)

diff --git a/block/io.c b/block/io.c
index 272caac..d8d50b7 100644
--- a/block/io.c
+++ b/block/io.c
@@ -271,13 +271,14 @@ void bdrv_drain(BlockDriverState *bs)
     bool busy = true;
 
     bdrv_no_throttling_begin(bs);
+    bdrv_io_unplugged_begin(bs);
     bdrv_drain_recurse(bs);
     while (busy) {
         /* Keep iterating */
-         bdrv_flush_io_queue(bs);
          busy = bdrv_requests_pending(bs);
          busy |= aio_poll(bdrv_get_aio_context(bs), busy);
     }
+    bdrv_io_unplugged_end(bs);
     bdrv_no_throttling_end(bs);
 }
 
@@ -302,6 +303,7 @@ void bdrv_drain_all(void)
             block_job_pause(bs->job);
         }
         bdrv_no_throttling_begin(bs);
+        bdrv_io_unplugged_begin(bs);
         bdrv_drain_recurse(bs);
         aio_context_release(aio_context);
 
@@ -326,7 +328,6 @@ void bdrv_drain_all(void)
             aio_context_acquire(aio_context);
             while ((bs = bdrv_next(bs))) {
                 if (aio_context == bdrv_get_aio_context(bs)) {
-                    bdrv_flush_io_queue(bs);
                     if (bdrv_requests_pending(bs)) {
                         busy = true;
                         aio_poll(aio_context, busy);
@@ -343,6 +344,7 @@ void bdrv_drain_all(void)
         AioContext *aio_context = bdrv_get_aio_context(bs);
 
         aio_context_acquire(aio_context);
+        bdrv_io_unplugged_end(bs);
         bdrv_no_throttling_end(bs);
         if (bs->job) {
             block_job_resume(bs->job);
@@ -2738,31 +2740,65 @@ void bdrv_add_before_write_notifier(BlockDriverState *bs,
 
 void bdrv_io_plug(BlockDriverState *bs)
 {
-    BlockDriver *drv = bs->drv;
-    if (drv && drv->bdrv_io_plug) {
-        drv->bdrv_io_plug(bs);
-    } else if (bs->file) {
-        bdrv_io_plug(bs->file->bs);
+    BdrvChild *child;
+
+    if (bs->io_plugged++ == 0 && bs->io_plug_disabled == 0) {
+        BlockDriver *drv = bs->drv;
+        if (drv && drv->bdrv_io_plug) {
+            drv->bdrv_io_plug(bs);
+        }
+    }
+
+    QLIST_FOREACH(child, &bs->children, next) {
+        bdrv_io_plug(child->bs);
     }
 }
 
 void bdrv_io_unplug(BlockDriverState *bs)
 {
-    BlockDriver *drv = bs->drv;
-    if (drv && drv->bdrv_io_unplug) {
-        drv->bdrv_io_unplug(bs);
-    } else if (bs->file) {
-        bdrv_io_unplug(bs->file->bs);
+    BdrvChild *child;
+
+    QLIST_FOREACH(child, &bs->children, next) {
+        bdrv_io_unplug(child->bs);
+    }
+
+    if (--bs->io_plugged == 0 && bs->io_plug_disabled == 0) {
+        BlockDriver *drv = bs->drv;
+        if (drv && drv->bdrv_io_unplug) {
+            drv->bdrv_io_unplug(bs);
+        }
     }
 }
 
-void bdrv_flush_io_queue(BlockDriverState *bs)
+void bdrv_io_unplugged_begin(BlockDriverState *bs)
 {
-    BlockDriver *drv = bs->drv;
-    if (drv && drv->bdrv_flush_io_queue) {
-        drv->bdrv_flush_io_queue(bs);
-    } else if (bs->file) {
-        bdrv_flush_io_queue(bs->file->bs);
+    BdrvChild *child;
+
+    if (bs->io_plug_disabled++ == 0 && bs->io_plugged > 0) {
+        BlockDriver *drv = bs->drv;
+        if (drv && drv->bdrv_io_unplug) {
+            drv->bdrv_io_unplug(bs);
+        }
+    }
+
+    QLIST_FOREACH(child, &bs->children, next) {
+        bdrv_io_unplugged_begin(child->bs);
+    }
+}
+
+void bdrv_io_unplugged_end(BlockDriverState *bs)
+{
+    BdrvChild *child;
+
+    QLIST_FOREACH(child, &bs->children, next) {
+        bdrv_io_unplugged_end(child->bs);
+    }
+
+    if (--bs->io_plug_disabled == 0 && bs->io_plugged > 0) {
+        BlockDriver *drv = bs->drv;
+        if (drv && drv->bdrv_io_plug) {
+            drv->bdrv_io_plug(bs);
+        }
     }
 }
 
diff --git a/block/linux-aio.c b/block/linux-aio.c
index 805757e..102bf92 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -220,19 +220,16 @@ void laio_io_plug(BlockDriverState *bs, void *aio_ctx)
 {
     struct qemu_laio_state *s = aio_ctx;
 
-    s->io_q.plugged++;
+    assert(!s->io_q.plugged);
+    s->io_q.plugged = 1;
 }
 
-void laio_io_unplug(BlockDriverState *bs, void *aio_ctx, bool unplug)
+void laio_io_unplug(BlockDriverState *bs, void *aio_ctx)
 {
     struct qemu_laio_state *s = aio_ctx;
 
-    assert(s->io_q.plugged > 0 || !unplug);
-
-    if (unplug && --s->io_q.plugged > 0) {
-        return;
-    }
-
+    assert(s->io_q.plugged);
+    s->io_q.plugged = 0;
     if (!s->io_q.blocked && !QSIMPLEQ_EMPTY(&s->io_q.pending)) {
         ioq_submit(s);
     }
diff --git a/block/raw-aio.h b/block/raw-aio.h
index 31d791f..7600060 100644
--- a/block/raw-aio.h
+++ b/block/raw-aio.h
@@ -41,7 +41,7 @@ BlockAIOCB *laio_submit(BlockDriverState *bs, void *aio_ctx, int fd,
 void laio_detach_aio_context(void *s, AioContext *old_context);
 void laio_attach_aio_context(void *s, AioContext *new_context);
 void laio_io_plug(BlockDriverState *bs, void *aio_ctx);
-void laio_io_unplug(BlockDriverState *bs, void *aio_ctx, bool unplug);
+void laio_io_unplug(BlockDriverState *bs, void *aio_ctx);
 #endif
 
 #ifdef _WIN32
diff --git a/block/raw-posix.c b/block/raw-posix.c
index 8866121..7ecbee3 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -1343,17 +1343,7 @@ static void raw_aio_unplug(BlockDriverState *bs)
 #ifdef CONFIG_LINUX_AIO
     BDRVRawState *s = bs->opaque;
     if (s->use_aio) {
-        laio_io_unplug(bs, s->aio_ctx, true);
-    }
-#endif
-}
-
-static void raw_aio_flush_io_queue(BlockDriverState *bs)
-{
-#ifdef CONFIG_LINUX_AIO
-    BDRVRawState *s = bs->opaque;
-    if (s->use_aio) {
-        laio_io_unplug(bs, s->aio_ctx, false);
+        laio_io_unplug(bs, s->aio_ctx);
     }
 #endif
 }
@@ -1947,7 +1937,6 @@ BlockDriver bdrv_file = {
     .bdrv_refresh_limits = raw_refresh_limits,
     .bdrv_io_plug = raw_aio_plug,
     .bdrv_io_unplug = raw_aio_unplug,
-    .bdrv_flush_io_queue = raw_aio_flush_io_queue,
 
     .bdrv_truncate = raw_truncate,
     .bdrv_getlength = raw_getlength,
@@ -2310,7 +2299,6 @@ static BlockDriver bdrv_host_device = {
     .bdrv_refresh_limits = raw_refresh_limits,
     .bdrv_io_plug = raw_aio_plug,
     .bdrv_io_unplug = raw_aio_unplug,
-    .bdrv_flush_io_queue = raw_aio_flush_io_queue,
 
     .bdrv_truncate      = raw_truncate,
     .bdrv_getlength	= raw_getlength,
@@ -2440,7 +2428,6 @@ static BlockDriver bdrv_host_cdrom = {
     .bdrv_refresh_limits = raw_refresh_limits,
     .bdrv_io_plug = raw_aio_plug,
     .bdrv_io_unplug = raw_aio_unplug,
-    .bdrv_flush_io_queue = raw_aio_flush_io_queue,
 
     .bdrv_truncate      = raw_truncate,
     .bdrv_getlength      = raw_getlength,
@@ -2576,7 +2563,6 @@ static BlockDriver bdrv_host_cdrom = {
     .bdrv_refresh_limits = raw_refresh_limits,
     .bdrv_io_plug = raw_aio_plug,
     .bdrv_io_unplug = raw_aio_unplug,
-    .bdrv_flush_io_queue = raw_aio_flush_io_queue,
 
     .bdrv_truncate      = raw_truncate,
     .bdrv_getlength      = raw_getlength,
diff --git a/include/block/block.h b/include/block/block.h
index 1c4f4d8..330c5ba 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -561,7 +561,8 @@ int bdrv_probe_geometry(BlockDriverState *bs, HDGeometry *geo);
 
 void bdrv_io_plug(BlockDriverState *bs);
 void bdrv_io_unplug(BlockDriverState *bs);
-void bdrv_flush_io_queue(BlockDriverState *bs);
+void bdrv_io_unplugged_begin(BlockDriverState *bs);
+void bdrv_io_unplugged_end(BlockDriverState *bs);
 
 /**
  * bdrv_drained_begin:
diff --git a/include/block/block_int.h b/include/block/block_int.h
index ed2e034..2838814 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -282,7 +282,6 @@ struct BlockDriver {
     /* io queue for linux-aio */
     void (*bdrv_io_plug)(BlockDriverState *bs);
     void (*bdrv_io_unplug)(BlockDriverState *bs);
-    void (*bdrv_flush_io_queue)(BlockDriverState *bs);
 
     /**
      * Try to get @bs's logical and physical block size.
@@ -477,6 +476,10 @@ struct BlockDriverState {
     uint64_t write_threshold_offset;
     NotifierWithReturn write_threshold_notifier;
 
+    /* counters for nested bdrv_io_plug and bdrv_io_unplugged_begin */
+    unsigned io_plugged;
+    unsigned io_plug_disabled;
+
     int quiesce_counter;
 };
 
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [Qemu-devel] [PATCH 05/16] mirror: use bottom half to re-enter coroutine
  2016-02-16 17:56 [Qemu-devel] [PATCH 00/16] AioContext fine-grained locking, part 1 of 3, including bdrv_drain rewrite Paolo Bonzini
                   ` (3 preceding siblings ...)
  2016-02-16 17:56 ` [Qemu-devel] [PATCH 04/16] block: plug whole tree at once, introduce bdrv_io_unplugged_begin/end Paolo Bonzini
@ 2016-02-16 17:56 ` Paolo Bonzini
  2016-03-09  3:19   ` Fam Zheng
  2016-02-16 17:56 ` [Qemu-devel] [PATCH 06/16] block: add BDS field to count in-flight requests Paolo Bonzini
                   ` (13 subsequent siblings)
  18 siblings, 1 reply; 48+ messages in thread
From: Paolo Bonzini @ 2016-02-16 17:56 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-block, stefanha

mirror is calling bdrv_drain from an AIO callback---more precisely,
the bdrv_drain happens far away from the AIO callback, in the coroutine that
the AIO callback enters.

This used to be okay because bdrv_drain more or less tried to guess
when all AIO callbacks were done; however it will cause a deadlock once
bdrv_drain really checks for AIO callbacks to be complete.  The situation
here is admittedly underdefined, and Stefan pointed out that the same
solution is found in many other places in the QEMU block layer, therefore
I think this workaround is acceptable.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block/mirror.c | 23 +++++++++++++++++++----
 1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 2c0edfa..793c20c 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -71,6 +71,7 @@ typedef struct MirrorOp {
     QEMUIOVector qiov;
     int64_t sector_num;
     int nb_sectors;
+    QEMUBH *co_enter_bh;
 } MirrorOp;
 
 static BlockErrorAction mirror_error_action(MirrorBlockJob *s, bool read,
@@ -86,6 +87,18 @@ static BlockErrorAction mirror_error_action(MirrorBlockJob *s, bool read,
     }
 }
 
+static void mirror_bh_cb(void *opaque)
+{
+    MirrorOp *op = opaque;
+    MirrorBlockJob *s = op->s;
+
+    qemu_bh_delete(op->co_enter_bh);
+    g_free(op);
+    if (s->waiting_for_io) {
+        qemu_coroutine_enter(s->common.co, NULL);
+    }
+}
+
 static void mirror_iteration_done(MirrorOp *op, int ret)
 {
     MirrorBlockJob *s = op->s;
@@ -116,11 +129,13 @@ static void mirror_iteration_done(MirrorOp *op, int ret)
     }
 
     qemu_iovec_destroy(&op->qiov);
-    g_free(op);
 
-    if (s->waiting_for_io) {
-        qemu_coroutine_enter(s->common.co, NULL);
-    }
+    /* The I/O operation is not finished until the callback returns.
+     * If we call qemu_coroutine_enter here, there is the possibility
+     * of a deadlock when the coroutine calls bdrv_drained_begin.
+     */
+    op->co_enter_bh = qemu_bh_new(mirror_bh_cb, op);
+    qemu_bh_schedule(op->co_enter_bh);
 }
 
 static void mirror_write_complete(void *opaque, int ret)
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [Qemu-devel] [PATCH 06/16] block: add BDS field to count in-flight requests
  2016-02-16 17:56 [Qemu-devel] [PATCH 00/16] AioContext fine-grained locking, part 1 of 3, including bdrv_drain rewrite Paolo Bonzini
                   ` (4 preceding siblings ...)
  2016-02-16 17:56 ` [Qemu-devel] [PATCH 05/16] mirror: use bottom half to re-enter coroutine Paolo Bonzini
@ 2016-02-16 17:56 ` Paolo Bonzini
  2016-03-09  3:35   ` Fam Zheng
  2016-02-16 17:56 ` [Qemu-devel] [PATCH 07/16] block: change drain to look only at one child at a time Paolo Bonzini
                   ` (12 subsequent siblings)
  18 siblings, 1 reply; 48+ messages in thread
From: Paolo Bonzini @ 2016-02-16 17:56 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-block, stefanha

Unlike tracked_requests, this field also counts throttled requests,
and remains non-zero if an AIO operation needs a BH to be "really"
completed.

With this change, it is no longer necessary to have a dummy
BdrvTrackedRequest for requests that are never serialising, and
it is no longer necessary to poll the AioContext once after
bdrv_requests_pending(bs) returns false.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block/io.c                | 71 +++++++++++++++++++++++++++++++----------------
 block/mirror.c            |  2 ++
 include/block/block_int.h |  8 ++++--
 3 files changed, 54 insertions(+), 27 deletions(-)

diff --git a/block/io.c b/block/io.c
index d8d50b7..a9a23a6 100644
--- a/block/io.c
+++ b/block/io.c
@@ -224,13 +224,7 @@ bool bdrv_requests_pending(BlockDriverState *bs)
 {
     BdrvChild *child;
 
-    if (!QLIST_EMPTY(&bs->tracked_requests)) {
-        return true;
-    }
-    if (!qemu_co_queue_empty(&bs->throttled_reqs[0])) {
-        return true;
-    }
-    if (!qemu_co_queue_empty(&bs->throttled_reqs[1])) {
+    if (atomic_read(&bs->in_flight)) {
         return true;
     }
 
@@ -268,15 +262,12 @@ static void bdrv_drain_recurse(BlockDriverState *bs)
  */
 void bdrv_drain(BlockDriverState *bs)
 {
-    bool busy = true;
-
     bdrv_no_throttling_begin(bs);
     bdrv_io_unplugged_begin(bs);
     bdrv_drain_recurse(bs);
-    while (busy) {
+    while (bdrv_requests_pending(bs)) {
         /* Keep iterating */
-         busy = bdrv_requests_pending(bs);
-         busy |= aio_poll(bdrv_get_aio_context(bs), busy);
+         aio_poll(bdrv_get_aio_context(bs), true);
     }
     bdrv_io_unplugged_end(bs);
     bdrv_no_throttling_end(bs);
@@ -291,7 +282,7 @@ void bdrv_drain(BlockDriverState *bs)
 void bdrv_drain_all(void)
 {
     /* Always run first iteration so any pending completion BHs run */
-    bool busy = true;
+    bool waited = true;
     BlockDriverState *bs = NULL;
     GSList *aio_ctxs = NULL, *ctx;
 
@@ -318,8 +309,8 @@ void bdrv_drain_all(void)
      * request completion.  Therefore we must keep looping until there was no
      * more activity rather than simply draining each device independently.
      */
-    while (busy) {
-        busy = false;
+    while (waited) {
+        waited = false;
 
         for (ctx = aio_ctxs; ctx != NULL; ctx = ctx->next) {
             AioContext *aio_context = ctx->data;
@@ -329,12 +320,11 @@ void bdrv_drain_all(void)
             while ((bs = bdrv_next(bs))) {
                 if (aio_context == bdrv_get_aio_context(bs)) {
                     if (bdrv_requests_pending(bs)) {
-                        busy = true;
-                        aio_poll(aio_context, busy);
+                        aio_poll(aio_context, true);
+                        waited = true;
                     }
                 }
             }
-            busy |= aio_poll(aio_context, false);
             aio_context_release(aio_context);
         }
     }
@@ -457,6 +447,16 @@ static bool tracked_request_overlaps(BdrvTrackedRequest *req,
     return true;
 }
 
+void bdrv_inc_in_flight(BlockDriverState *bs)
+{
+    atomic_inc(&bs->in_flight);
+}
+
+void bdrv_dec_in_flight(BlockDriverState *bs)
+{
+    atomic_dec(&bs->in_flight);
+}
+
 static bool coroutine_fn wait_serialising_requests(BdrvTrackedRequest *self)
 {
     BlockDriverState *bs = self->bs;
@@ -963,6 +963,8 @@ static int coroutine_fn bdrv_co_do_preadv(BlockDriverState *bs,
         return ret;
     }
 
+    bdrv_inc_in_flight(bs);
+
     /* Don't do copy-on-read if we read data before write operation */
     if (bs->copy_on_read && !(flags & BDRV_REQ_NO_SERIALISING)) {
         flags |= BDRV_REQ_COPY_ON_READ;
@@ -1003,6 +1005,7 @@ static int coroutine_fn bdrv_co_do_preadv(BlockDriverState *bs,
                               use_local_qiov ? &local_qiov : qiov,
                               flags);
     tracked_request_end(&req);
+    bdrv_dec_in_flight(bs);
 
     if (use_local_qiov) {
         qemu_iovec_destroy(&local_qiov);
@@ -1310,6 +1313,8 @@ static int coroutine_fn bdrv_co_do_pwritev(BlockDriverState *bs,
         return ret;
     }
 
+    bdrv_inc_in_flight(bs);
+
     /* throttling disk I/O */
     if (bs->throttle_state) {
         throttle_group_co_io_limits_intercept(bs, bytes, true);
@@ -1408,6 +1413,7 @@ fail:
     qemu_vfree(tail_buf);
 out:
     tracked_request_end(&req);
+    bdrv_dec_in_flight(bs);
     return ret;
 }
 
@@ -2065,6 +2071,7 @@ static const AIOCBInfo bdrv_em_aiocb_info = {
 static void bdrv_aio_bh_cb(void *opaque)
 {
     BlockAIOCBSync *acb = opaque;
+    BlockDriverState *bs = acb->common.bs;
 
     if (!acb->is_write && acb->ret >= 0) {
         qemu_iovec_from_buf(acb->qiov, 0, acb->bounce, acb->qiov->size);
@@ -2072,6 +2079,7 @@ static void bdrv_aio_bh_cb(void *opaque)
     qemu_vfree(acb->bounce);
     acb->common.cb(acb->common.opaque, acb->ret);
     qemu_bh_delete(acb->bh);
+    bdrv_dec_in_flight(bs);
     acb->bh = NULL;
     qemu_aio_unref(acb);
 }
@@ -2087,6 +2095,7 @@ static BlockAIOCB *bdrv_aio_rw_vector(BlockDriverState *bs,
 {
     BlockAIOCBSync *acb;
 
+    bdrv_inc_in_flight(bs);
     acb = qemu_aio_get(&bdrv_em_aiocb_info, bs, cb, opaque);
     acb->is_write = is_write;
     acb->qiov = qiov;
@@ -2139,6 +2148,7 @@ static void bdrv_co_complete(BlockAIOCBCoroutine *acb)
 {
     if (!acb->need_bh) {
         acb->common.cb(acb->common.opaque, acb->req.error);
+        bdrv_dec_in_flight(acb->common.bs);
         qemu_aio_unref(acb);
     }
 }
@@ -2192,6 +2202,9 @@ static BlockAIOCB *bdrv_co_aio_rw_vector(BlockDriverState *bs,
     Coroutine *co;
     BlockAIOCBCoroutine *acb;
 
+    /* Matched by bdrv_co_complete's bdrv_dec_in_flight.  */
+    bdrv_inc_in_flight(bs);
+
     acb = qemu_aio_get(&bdrv_em_co_aiocb_info, bs, cb, opaque);
     acb->need_bh = true;
     acb->req.error = -EINPROGRESS;
@@ -2225,6 +2238,9 @@ BlockAIOCB *bdrv_aio_flush(BlockDriverState *bs,
     Coroutine *co;
     BlockAIOCBCoroutine *acb;
 
+    /* Matched by bdrv_co_complete's bdrv_dec_in_flight.  */
+    bdrv_inc_in_flight(bs);
+
     acb = qemu_aio_get(&bdrv_em_co_aiocb_info, bs, cb, opaque);
     acb->need_bh = true;
     acb->req.error = -EINPROGRESS;
@@ -2254,6 +2270,9 @@ BlockAIOCB *bdrv_aio_discard(BlockDriverState *bs,
 
     trace_bdrv_aio_discard(bs, sector_num, nb_sectors, opaque);
 
+    /* Matched by bdrv_co_complete's bdrv_dec_in_flight.  */
+    bdrv_inc_in_flight(bs);
+
     acb = qemu_aio_get(&bdrv_em_co_aiocb_info, bs, cb, opaque);
     acb->need_bh = true;
     acb->req.error = -EINPROGRESS;
@@ -2361,14 +2380,14 @@ static void coroutine_fn bdrv_flush_co_entry(void *opaque)
 int coroutine_fn bdrv_co_flush(BlockDriverState *bs)
 {
     int ret;
-    BdrvTrackedRequest req;
 
     if (!bs || !bdrv_is_inserted(bs) || bdrv_is_read_only(bs) ||
         bdrv_is_sg(bs)) {
         return 0;
     }
 
-    tracked_request_begin(&req, bs, 0, 0, BDRV_TRACKED_FLUSH);
+    bdrv_inc_in_flight(bs);
+
     /* Write back cached data to the OS even with cache=unsafe */
     BLKDBG_EVENT(bs->file, BLKDBG_FLUSH_TO_OS);
     if (bs->drv->bdrv_co_flush_to_os) {
@@ -2423,7 +2442,7 @@ int coroutine_fn bdrv_co_flush(BlockDriverState *bs)
 flush_parent:
     ret = bs->file ? bdrv_co_flush(bs->file->bs) : 0;
 out:
-    tracked_request_end(&req);
+    bdrv_dec_in_flight(bs);
     return ret;
 }
 
@@ -2491,6 +2510,7 @@ int coroutine_fn bdrv_co_discard(BlockDriverState *bs, int64_t sector_num,
         return 0;
     }
 
+    bdrv_inc_in_flight(bs);
     tracked_request_begin(&req, bs, sector_num, nb_sectors,
                           BDRV_TRACKED_DISCARD);
     bdrv_set_dirty(bs, sector_num, nb_sectors);
@@ -2543,6 +2563,7 @@ int coroutine_fn bdrv_co_discard(BlockDriverState *bs, int64_t sector_num,
     ret = 0;
 out:
     tracked_request_end(&req);
+    bdrv_dec_in_flight(bs);
     return ret;
 }
 
@@ -2588,13 +2609,12 @@ static void bdrv_ioctl_bh_cb(void *opaque)
 static int bdrv_co_do_ioctl(BlockDriverState *bs, int req, void *buf)
 {
     BlockDriver *drv = bs->drv;
-    BdrvTrackedRequest tracked_req;
     CoroutineIOCompletion co = {
         .coroutine = qemu_coroutine_self(),
     };
     BlockAIOCB *acb;
 
-    tracked_request_begin(&tracked_req, bs, 0, 0, BDRV_TRACKED_IOCTL);
+    bdrv_inc_in_flight(bs);
     if (!drv || !drv->bdrv_aio_ioctl) {
         co.ret = -ENOTSUP;
         goto out;
@@ -2610,7 +2630,7 @@ static int bdrv_co_do_ioctl(BlockDriverState *bs, int req, void *buf)
     }
     qemu_coroutine_yield();
 out:
-    tracked_request_end(&tracked_req);
+    bdrv_dec_in_flight(bs);
     return co.ret;
 }
 
@@ -2667,6 +2687,9 @@ BlockAIOCB *bdrv_aio_ioctl(BlockDriverState *bs,
                                             bs, cb, opaque);
     Coroutine *co;
 
+    /* Matched by bdrv_co_complete's bdrv_dec_in_flight.  */
+    bdrv_inc_in_flight(bs);
+
     acb->need_bh = true;
     acb->req.error = -EINPROGRESS;
     acb->req.req = req;
diff --git a/block/mirror.c b/block/mirror.c
index 793c20c..3f163b8 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -94,6 +94,7 @@ static void mirror_bh_cb(void *opaque)
 
     qemu_bh_delete(op->co_enter_bh);
     g_free(op);
+    bdrv_dec_in_flight(s->common.bs);
     if (s->waiting_for_io) {
         qemu_coroutine_enter(s->common.co, NULL);
     }
@@ -134,6 +135,7 @@ static void mirror_iteration_done(MirrorOp *op, int ret)
      * If we call qemu_coroutine_enter here, there is the possibility
      * of a deadlock when the coroutine calls bdrv_drained_begin.
      */
+    bdrv_inc_in_flight(s->common.bs);
     op->co_enter_bh = qemu_bh_new(mirror_bh_cb, op);
     qemu_bh_schedule(op->co_enter_bh);
 }
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 2838814..89c38c0 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -63,8 +63,6 @@
 enum BdrvTrackedRequestType {
     BDRV_TRACKED_READ,
     BDRV_TRACKED_WRITE,
-    BDRV_TRACKED_FLUSH,
-    BDRV_TRACKED_IOCTL,
     BDRV_TRACKED_DISCARD,
 };
 
@@ -406,7 +404,8 @@ struct BlockDriverState {
     /* Callback before write request is processed */
     NotifierWithReturnList before_write_notifiers;
 
-    /* number of in-flight serialising requests */
+    /* number of in-flight requests; overall and serialising */
+    unsigned int in_flight;
     unsigned int serialising_in_flight;
 
     /* I/O throttling.
@@ -713,6 +712,9 @@ bool bdrv_requests_pending(BlockDriverState *bs);
 void bdrv_clear_dirty_bitmap(BdrvDirtyBitmap *bitmap, HBitmap **out);
 void bdrv_undo_clear_dirty_bitmap(BdrvDirtyBitmap *bitmap, HBitmap *in);
 
+void bdrv_inc_in_flight(BlockDriverState *bs);
+void bdrv_dec_in_flight(BlockDriverState *bs);
+
 void blockdev_close_all_bdrv_states(void);
 
 #endif /* BLOCK_INT_H */
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [Qemu-devel] [PATCH 07/16] block: change drain to look only at one child at a time
  2016-02-16 17:56 [Qemu-devel] [PATCH 00/16] AioContext fine-grained locking, part 1 of 3, including bdrv_drain rewrite Paolo Bonzini
                   ` (5 preceding siblings ...)
  2016-02-16 17:56 ` [Qemu-devel] [PATCH 06/16] block: add BDS field to count in-flight requests Paolo Bonzini
@ 2016-02-16 17:56 ` Paolo Bonzini
  2016-03-09  3:41   ` Fam Zheng
  2016-03-16 16:39   ` Stefan Hajnoczi
  2016-02-16 17:56 ` [Qemu-devel] [PATCH 08/16] blockjob: introduce .drain callback for jobs Paolo Bonzini
                   ` (11 subsequent siblings)
  18 siblings, 2 replies; 48+ messages in thread
From: Paolo Bonzini @ 2016-02-16 17:56 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-block, stefanha

bdrv_requests_pending is checking children to also wait until internal
requests (such as metadata writes) have completed.  However, checking
children is in general overkill because, apart from this special case,
the parent's in_flight count will always be incremented by at least one
for every request in the child.

Since internal requests are only generated by the parent in the child,
instead visit the tree parent first, and then wait for internal I/O in
the children to complete.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block/io.c | 27 +++++++++++++++++++--------
 1 file changed, 19 insertions(+), 8 deletions(-)

diff --git a/block/io.c b/block/io.c
index a9a23a6..e0c9215 100644
--- a/block/io.c
+++ b/block/io.c
@@ -249,6 +249,23 @@ static void bdrv_drain_recurse(BlockDriverState *bs)
     }
 }
 
+static bool bdrv_drain_io_recurse(BlockDriverState *bs)
+{
+    BdrvChild *child;
+    bool waited = false;
+
+    while (atomic_read(&bs->in_flight) > 0) {
+        aio_poll(bdrv_get_aio_context(bs), true);
+        waited = true;
+    }
+
+    QLIST_FOREACH(child, &bs->children, next) {
+        waited |= bdrv_drain_io_recurse(child->bs);
+    }
+
+    return waited;
+}
+
 /*
  * Wait for pending requests to complete on a single BlockDriverState subtree,
  * and suspend block driver's internal I/O until next request arrives.
@@ -265,10 +282,7 @@ void bdrv_drain(BlockDriverState *bs)
     bdrv_no_throttling_begin(bs);
     bdrv_io_unplugged_begin(bs);
     bdrv_drain_recurse(bs);
-    while (bdrv_requests_pending(bs)) {
-        /* Keep iterating */
-         aio_poll(bdrv_get_aio_context(bs), true);
-    }
+    bdrv_drain_io_recurse(bs);
     bdrv_io_unplugged_end(bs);
     bdrv_no_throttling_end(bs);
 }
@@ -319,10 +333,7 @@ void bdrv_drain_all(void)
             aio_context_acquire(aio_context);
             while ((bs = bdrv_next(bs))) {
                 if (aio_context == bdrv_get_aio_context(bs)) {
-                    if (bdrv_requests_pending(bs)) {
-                        aio_poll(aio_context, true);
-                        waited = true;
-                    }
+                    waited |= bdrv_drain_io_recurse(bs);
                 }
             }
             aio_context_release(aio_context);
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [Qemu-devel] [PATCH 08/16] blockjob: introduce .drain callback for jobs
  2016-02-16 17:56 [Qemu-devel] [PATCH 00/16] AioContext fine-grained locking, part 1 of 3, including bdrv_drain rewrite Paolo Bonzini
                   ` (6 preceding siblings ...)
  2016-02-16 17:56 ` [Qemu-devel] [PATCH 07/16] block: change drain to look only at one child at a time Paolo Bonzini
@ 2016-02-16 17:56 ` Paolo Bonzini
  2016-03-16 17:56   ` Stefan Hajnoczi
  2016-02-16 17:56 ` [Qemu-devel] [PATCH 09/16] block: wait for all pending I/O when doing synchronous requests Paolo Bonzini
                   ` (10 subsequent siblings)
  18 siblings, 1 reply; 48+ messages in thread
From: Paolo Bonzini @ 2016-02-16 17:56 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-block, stefanha

This is required to decouple block jobs from running in an
AioContext.  With multiqueue block devices, a BlockDriverState
does not really belong to a single AioContext.

The solution is to first wait until all I/O operations are
complete; then loop in the main thread for the block job to
complete entirely.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block/backup.c           |  7 +++++++
 block/mirror.c           | 12 ++++++++++--
 blockjob.c               | 16 +++++++++++++---
 include/block/blockjob.h |  7 +++++++
 4 files changed, 37 insertions(+), 5 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index 00cafdb..2edb895 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -251,6 +251,12 @@ static void backup_abort(BlockJob *job)
     }
 }
 
+static void backup_drain(BlockJob *job)
+{
+    BackupBlockJob *s = container_of(job, BackupBlockJob, common);
+    bdrv_drain(s->target);
+}
+
 static const BlockJobDriver backup_job_driver = {
     .instance_size  = sizeof(BackupBlockJob),
     .job_type       = BLOCK_JOB_TYPE_BACKUP,
@@ -258,6 +264,7 @@ static const BlockJobDriver backup_job_driver = {
     .iostatus_reset = backup_iostatus_reset,
     .commit         = backup_commit,
     .abort          = backup_abort,
+    .drain          = backup_drain,
 };
 
 static BlockErrorAction backup_error_action(BackupBlockJob *job,
diff --git a/block/mirror.c b/block/mirror.c
index 3f163b8..b473a1b 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -358,7 +358,7 @@ static void mirror_free_init(MirrorBlockJob *s)
     }
 }
 
-static void mirror_drain(MirrorBlockJob *s)
+static void mirror_wait_for_completion(MirrorBlockJob *s)
 {
     while (s->in_flight > 0) {
         s->waiting_for_io = true;
@@ -627,7 +627,7 @@ immediate_exit:
          * the target is a copy of the source.
          */
         assert(ret < 0 || (!s->synced && block_job_is_cancelled(&s->common)));
-        mirror_drain(s);
+        mirror_wait_for_completion(s);
     }
 
     assert(s->in_flight == 0);
@@ -708,12 +708,19 @@ static void mirror_complete(BlockJob *job, Error **errp)
     block_job_enter(&s->common);
 }
 
+static void mirror_drain(BlockJob *job)
+{
+    MirrorBlockJob *s = container_of(job, MirrorBlockJob, common);
+    bdrv_drain(s->target);
+}
+
 static const BlockJobDriver mirror_job_driver = {
     .instance_size = sizeof(MirrorBlockJob),
     .job_type      = BLOCK_JOB_TYPE_MIRROR,
     .set_speed     = mirror_set_speed,
     .iostatus_reset= mirror_iostatus_reset,
     .complete      = mirror_complete,
+    .drain         = mirror_drain,
 };
 
 static const BlockJobDriver commit_active_job_driver = {
@@ -723,6 +730,7 @@ static const BlockJobDriver commit_active_job_driver = {
     .iostatus_reset
                    = mirror_iostatus_reset,
     .complete      = mirror_complete,
+    .drain         = mirror_drain,
 };
 
 static void mirror_start_job(BlockDriverState *bs, BlockDriverState *target,
diff --git a/blockjob.c b/blockjob.c
index 9fc37ca..5bb6f9b 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -289,16 +289,26 @@ static int block_job_finish_sync(BlockJob *job,
     assert(bs->job == job);
 
     block_job_ref(job);
+
+    /* finish will call block_job_enter (see e.g. block_job_cancel,
+     * or mirror_complete in block/mirror.c).  Barring bugs in the
+     * job coroutine, bdrv_drain should be enough to induce progress
+     * until the job completes or moves to the main thread.
+    */
     finish(job, &local_err);
     if (local_err) {
         error_propagate(errp, local_err);
         block_job_unref(job);
         return -EBUSY;
     }
+    while (!job->deferred_to_main_loop && !job->completed) {
+        bdrv_drain(bs);
+        if (job->driver->drain) {
+            job->driver->drain(job);
+        }
+    }
     while (!job->completed) {
-        aio_poll(job->deferred_to_main_loop ? qemu_get_aio_context() :
-                                              bdrv_get_aio_context(bs),
-                 true);
+        aio_poll(qemu_get_aio_context(), true);
     }
     ret = (job->cancelled && job->ret == 0) ? -ECANCELED : job->ret;
     block_job_unref(job);
diff --git a/include/block/blockjob.h b/include/block/blockjob.h
index 8bedc49..8e564df 100644
--- a/include/block/blockjob.h
+++ b/include/block/blockjob.h
@@ -70,6 +70,13 @@ typedef struct BlockJobDriver {
      * never both.
      */
     void (*abort)(BlockJob *job);
+
+    /**
+     * If the callback is not NULL, it will be invoked when the job has to be
+     * synchronously cancelled or completed; it should drain BlockDriverStates
+     * as required to ensure progress.
+     */
+    void (*drain)(BlockJob *job);
 } BlockJobDriver;
 
 /**
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [Qemu-devel] [PATCH 09/16] block: wait for all pending I/O when doing synchronous requests
  2016-02-16 17:56 [Qemu-devel] [PATCH 00/16] AioContext fine-grained locking, part 1 of 3, including bdrv_drain rewrite Paolo Bonzini
                   ` (7 preceding siblings ...)
  2016-02-16 17:56 ` [Qemu-devel] [PATCH 08/16] blockjob: introduce .drain callback for jobs Paolo Bonzini
@ 2016-02-16 17:56 ` Paolo Bonzini
  2016-03-09  8:13   ` Fam Zheng
  2016-03-16 18:04   ` Stefan Hajnoczi
  2016-02-16 17:56 ` [Qemu-devel] [PATCH 10/16] nfs: replace aio_poll with bdrv_drain Paolo Bonzini
                   ` (9 subsequent siblings)
  18 siblings, 2 replies; 48+ messages in thread
From: Paolo Bonzini @ 2016-02-16 17:56 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-block, stefanha

Synchronous I/O should in general happen either in the main thread (e.g.
for bdrv_open and bdrv_create) or between bdrv_drained_begin and
bdrv_drained_end.  Therefore, the simplest way to wait for it to finish
is to wait for _all_ pending I/O to complete.

In fact, there was one case in bdrv_close where we explicitly drained
after bdrv_flush; this is now unnecessary.  And we should probably have
called bdrv_drain_all after calls to bdrv_flush_all, which is now
unnecessary too.

On the other hand, there was a case where a test was relying on doing
a synchronous read after a breakpoint had suspended an asynchronous read.
This is easily fixed.

This decouples synchronous I/O from aio_poll.  When the request used
not to be tracked as part of bdrv_drain (e.g. bdrv_co_get_block_status)
we need to update the in_flight count.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block.c                    |  1 -
 block/io.c                 | 39 ++++++++++++---------------------------
 block/qed-table.c          | 16 ++++------------
 block/qed.c                |  4 +++-
 tests/qemu-iotests/060     |  8 ++++++--
 tests/qemu-iotests/060.out |  4 +++-
 6 files changed, 28 insertions(+), 44 deletions(-)

diff --git a/block.c b/block.c
index b4f328f..fb02d7f 100644
--- a/block.c
+++ b/block.c
@@ -2152,7 +2152,6 @@ static void bdrv_close(BlockDriverState *bs)
 
     bdrv_drained_begin(bs); /* complete I/O */
     bdrv_flush(bs);
-    bdrv_drain(bs); /* in case flush left pending I/O */
 
     bdrv_release_named_dirty_bitmaps(bs);
     assert(QLIST_EMPTY(&bs->dirty_bitmaps));
diff --git a/block/io.c b/block/io.c
index e0c9215..04b52c8 100644
--- a/block/io.c
+++ b/block/io.c
@@ -593,13 +593,9 @@ static int bdrv_prwv_co(BlockDriverState *bs, int64_t offset,
         /* Fast-path if already in coroutine context */
         bdrv_rw_co_entry(&rwco);
     } else {
-        AioContext *aio_context = bdrv_get_aio_context(bs);
-
         co = qemu_coroutine_create(bdrv_rw_co_entry);
         qemu_coroutine_enter(co, &rwco);
-        while (rwco.ret == NOT_DONE) {
-            aio_poll(aio_context, true);
-        }
+        bdrv_drain(bs);
     }
 
     bdrv_no_throttling_end(bs);
@@ -1545,17 +1541,19 @@ static int64_t coroutine_fn bdrv_co_get_block_status(BlockDriverState *bs,
     }
 
     *file = NULL;
+    bdrv_inc_in_flight(bs);
     ret = bs->drv->bdrv_co_get_block_status(bs, sector_num, nb_sectors, pnum,
                                             file);
     if (ret < 0) {
         *pnum = 0;
-        return ret;
+        goto out;
     }
 
     if (ret & BDRV_BLOCK_RAW) {
         assert(ret & BDRV_BLOCK_OFFSET_VALID);
-        return bdrv_get_block_status(bs->file->bs, ret >> BDRV_SECTOR_BITS,
-                                     *pnum, pnum, file);
+        ret = bdrv_get_block_status(bs->file->bs, ret >> BDRV_SECTOR_BITS,
+                                    *pnum, pnum, file);
+        goto out;
     }
 
     if (ret & (BDRV_BLOCK_DATA | BDRV_BLOCK_ZERO)) {
@@ -1597,6 +1595,8 @@ static int64_t coroutine_fn bdrv_co_get_block_status(BlockDriverState *bs,
         }
     }
 
+out:
+    bdrv_dec_in_flight(bs);
     return ret;
 }
 
@@ -1662,13 +1662,9 @@ int64_t bdrv_get_block_status_above(BlockDriverState *bs,
         /* Fast-path if already in coroutine context */
         bdrv_get_block_status_above_co_entry(&data);
     } else {
-        AioContext *aio_context = bdrv_get_aio_context(bs);
-
         co = qemu_coroutine_create(bdrv_get_block_status_above_co_entry);
         qemu_coroutine_enter(co, &data);
-        while (!data.done) {
-            aio_poll(aio_context, true);
-        }
+        bdrv_drain(bs);
     }
     return data.ret;
 }
@@ -2469,13 +2465,9 @@ int bdrv_flush(BlockDriverState *bs)
         /* Fast-path if already in coroutine context */
         bdrv_flush_co_entry(&rwco);
     } else {
-        AioContext *aio_context = bdrv_get_aio_context(bs);
-
         co = qemu_coroutine_create(bdrv_flush_co_entry);
         qemu_coroutine_enter(co, &rwco);
-        while (rwco.ret == NOT_DONE) {
-            aio_poll(aio_context, true);
-        }
+        bdrv_drain(bs);
     }
 
     return rwco.ret;
@@ -2592,13 +2584,9 @@ int bdrv_discard(BlockDriverState *bs, int64_t sector_num, int nb_sectors)
         /* Fast-path if already in coroutine context */
         bdrv_discard_co_entry(&rwco);
     } else {
-        AioContext *aio_context = bdrv_get_aio_context(bs);
-
         co = qemu_coroutine_create(bdrv_discard_co_entry);
         qemu_coroutine_enter(co, &rwco);
-        while (rwco.ret == NOT_DONE) {
-            aio_poll(aio_context, true);
-        }
+        bdrv_drain(bs);
     }
 
     return rwco.ret;
@@ -2673,11 +2661,8 @@ int bdrv_ioctl(BlockDriverState *bs, unsigned long int req, void *buf)
         bdrv_co_ioctl_entry(&data);
     } else {
         Coroutine *co = qemu_coroutine_create(bdrv_co_ioctl_entry);
-
         qemu_coroutine_enter(co, &data);
-        while (data.ret == -EINPROGRESS) {
-            aio_poll(bdrv_get_aio_context(bs), true);
-        }
+        bdrv_drain(bs);
     }
     return data.ret;
 }
diff --git a/block/qed-table.c b/block/qed-table.c
index 802945f..3a8566f 100644
--- a/block/qed-table.c
+++ b/block/qed-table.c
@@ -173,9 +173,7 @@ int qed_read_l1_table_sync(BDRVQEDState *s)
 
     qed_read_table(s, s->header.l1_table_offset,
                    s->l1_table, qed_sync_cb, &ret);
-    while (ret == -EINPROGRESS) {
-        aio_poll(bdrv_get_aio_context(s->bs), true);
-    }
+    bdrv_drain(s->bs);
 
     return ret;
 }
@@ -194,9 +192,7 @@ int qed_write_l1_table_sync(BDRVQEDState *s, unsigned int index,
     int ret = -EINPROGRESS;
 
     qed_write_l1_table(s, index, n, qed_sync_cb, &ret);
-    while (ret == -EINPROGRESS) {
-        aio_poll(bdrv_get_aio_context(s->bs), true);
-    }
+    bdrv_drain(s->bs);
 
     return ret;
 }
@@ -267,9 +263,7 @@ int qed_read_l2_table_sync(BDRVQEDState *s, QEDRequest *request, uint64_t offset
     int ret = -EINPROGRESS;
 
     qed_read_l2_table(s, request, offset, qed_sync_cb, &ret);
-    while (ret == -EINPROGRESS) {
-        aio_poll(bdrv_get_aio_context(s->bs), true);
-    }
+    bdrv_drain(s->bs);
 
     return ret;
 }
@@ -289,9 +283,7 @@ int qed_write_l2_table_sync(BDRVQEDState *s, QEDRequest *request,
     int ret = -EINPROGRESS;
 
     qed_write_l2_table(s, request, index, n, flush, qed_sync_cb, &ret);
-    while (ret == -EINPROGRESS) {
-        aio_poll(bdrv_get_aio_context(s->bs), true);
-    }
+    bdrv_drain(s->bs);
 
     return ret;
 }
diff --git a/block/qed.c b/block/qed.c
index ebba220..e90792f 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -352,7 +352,9 @@ static void qed_start_need_check_timer(BDRVQEDState *s)
 static void qed_cancel_need_check_timer(BDRVQEDState *s)
 {
     trace_qed_cancel_need_check_timer(s);
-    timer_del(s->need_check_timer);
+    if (s->need_check_timer) {
+        timer_del(s->need_check_timer);
+    }
 }
 
 static void bdrv_qed_detach_aio_context(BlockDriverState *bs)
diff --git a/tests/qemu-iotests/060 b/tests/qemu-iotests/060
index c81319c..dffe35a 100755
--- a/tests/qemu-iotests/060
+++ b/tests/qemu-iotests/060
@@ -164,8 +164,12 @@ echo "open -o file.driver=blkdebug $TEST_IMG
 break cow_read 0
 aio_write 0k 1k
 wait_break 0
-write 64k 64k
-resume 0" | $QEMU_IO | _filter_qemu_io
+break pwritev_done 1
+aio_write 64k 64k
+wait_break 1
+resume 1
+resume 0
+aio_flush" | $QEMU_IO | _filter_qemu_io
 
 echo
 echo "=== Testing unallocated image header ==="
diff --git a/tests/qemu-iotests/060.out b/tests/qemu-iotests/060.out
index 5d40206..a0a9a11 100644
--- a/tests/qemu-iotests/060.out
+++ b/tests/qemu-iotests/060.out
@@ -105,7 +105,9 @@ discard 65536/65536 bytes at offset 0
 64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 qcow2: Marking image as corrupt: Preventing invalid write on metadata (overlaps with active L2 table); further corruption events will be suppressed
 blkdebug: Suspended request '0'
-write failed: Input/output error
+blkdebug: Suspended request '1'
+blkdebug: Resuming request '1'
+aio_write failed: Input/output error
 blkdebug: Resuming request '0'
 aio_write failed: No medium found
 
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [Qemu-devel] [PATCH 10/16] nfs: replace aio_poll with bdrv_drain
  2016-02-16 17:56 [Qemu-devel] [PATCH 00/16] AioContext fine-grained locking, part 1 of 3, including bdrv_drain rewrite Paolo Bonzini
                   ` (8 preceding siblings ...)
  2016-02-16 17:56 ` [Qemu-devel] [PATCH 09/16] block: wait for all pending I/O when doing synchronous requests Paolo Bonzini
@ 2016-02-16 17:56 ` Paolo Bonzini
  2016-02-16 17:56 ` [Qemu-devel] [PATCH 11/16] sheepdog: disable dataplane Paolo Bonzini
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 48+ messages in thread
From: Paolo Bonzini @ 2016-02-16 17:56 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-block, stefanha

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block/nfs.c | 50 ++++++++++++++++++++++++++++++--------------------
 1 file changed, 30 insertions(+), 20 deletions(-)

diff --git a/block/nfs.c b/block/nfs.c
index 5eb8c13..082ef3a 100644
--- a/block/nfs.c
+++ b/block/nfs.c
@@ -47,6 +47,7 @@ typedef struct NFSClient {
 } NFSClient;
 
 typedef struct NFSRPC {
+    BlockDriverState *bs;
     int ret;
     int complete;
     QEMUIOVector *iov;
@@ -86,11 +87,12 @@ static void nfs_process_write(void *arg)
     nfs_set_events(client);
 }
 
-static void nfs_co_init_task(NFSClient *client, NFSRPC *task)
+static void nfs_co_init_task(BlockDriverState *bs, NFSRPC *task)
 {
     *task = (NFSRPC) {
         .co             = qemu_coroutine_self(),
-        .client         = client,
+        .bs             = bs,
+        .client         = bs->opaque,
     };
 }
 
@@ -108,6 +110,7 @@ nfs_co_generic_cb(int ret, struct nfs_context *nfs, void *data,
 {
     NFSRPC *task = private_data;
     task->ret = ret;
+    assert(!task->st);
     if (task->ret > 0 && task->iov) {
         if (task->ret <= task->iov->size) {
             qemu_iovec_from_buf(task->iov, 0, data, task->ret);
@@ -115,19 +118,12 @@ nfs_co_generic_cb(int ret, struct nfs_context *nfs, void *data,
             task->ret = -EIO;
         }
     }
-    if (task->ret == 0 && task->st) {
-        memcpy(task->st, data, sizeof(struct stat));
-    }
     if (task->ret < 0) {
         error_report("NFS Error: %s", nfs_get_error(nfs));
     }
-    if (task->co) {
-        task->bh = aio_bh_new(task->client->aio_context,
-                              nfs_co_generic_bh_cb, task);
-        qemu_bh_schedule(task->bh);
-    } else {
-        task->complete = 1;
-    }
+    task->bh = aio_bh_new(task->client->aio_context,
+                          nfs_co_generic_bh_cb, task);
+    qemu_bh_schedule(task->bh);
 }
 
 static int coroutine_fn nfs_co_readv(BlockDriverState *bs,
@@ -137,7 +133,7 @@ static int coroutine_fn nfs_co_readv(BlockDriverState *bs,
     NFSClient *client = bs->opaque;
     NFSRPC task;
 
-    nfs_co_init_task(client, &task);
+    nfs_co_init_task(bs, &task);
     task.iov = iov;
 
     if (nfs_pread_async(client->context, client->fh,
@@ -172,7 +168,7 @@ static int coroutine_fn nfs_co_writev(BlockDriverState *bs,
     NFSRPC task;
     char *buf = NULL;
 
-    nfs_co_init_task(client, &task);
+    nfs_co_init_task(bs, &task);
 
     buf = g_try_malloc(nb_sectors * BDRV_SECTOR_SIZE);
     if (nb_sectors && buf == NULL) {
@@ -208,7 +204,7 @@ static int coroutine_fn nfs_co_flush(BlockDriverState *bs)
     NFSClient *client = bs->opaque;
     NFSRPC task;
 
-    nfs_co_init_task(client, &task);
+    nfs_co_init_task(bs, &task);
 
     if (nfs_fsync_async(client->context, client->fh, nfs_co_generic_cb,
                         &task) != 0) {
@@ -457,6 +453,21 @@ static int nfs_has_zero_init(BlockDriverState *bs)
     return client->has_zero_init;
 }
 
+static void
+nfs_get_allocated_file_size_cb(int ret, struct nfs_context *nfs, void *data,
+                               void *private_data)
+{
+    NFSRPC *task = private_data;
+    task->ret = ret;
+    if (task->ret == 0) {
+        memcpy(task->st, data, sizeof(struct stat));
+    }
+    if (task->ret < 0) {
+        error_report("NFS Error: %s", nfs_get_error(nfs));
+    }
+    bdrv_dec_in_flight(task->bs);
+}
+
 static int64_t nfs_get_allocated_file_size(BlockDriverState *bs)
 {
     NFSClient *client = bs->opaque;
@@ -468,16 +479,15 @@ static int64_t nfs_get_allocated_file_size(BlockDriverState *bs)
         return client->st_blocks * 512;
     }
 
+    bdrv_inc_in_flight(bs);
+    task.bs = bs;
     task.st = &st;
-    if (nfs_fstat_async(client->context, client->fh, nfs_co_generic_cb,
+    if (nfs_fstat_async(client->context, client->fh, nfs_get_allocated_file_size_cb,
                         &task) != 0) {
         return -ENOMEM;
     }
 
-    while (!task.complete) {
-        nfs_set_events(client);
-        aio_poll(client->aio_context, true);
-    }
+    bdrv_drain(bs);
 
     return (task.ret < 0 ? task.ret : st.st_blocks * 512);
 }
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [Qemu-devel] [PATCH 11/16] sheepdog: disable dataplane
  2016-02-16 17:56 [Qemu-devel] [PATCH 00/16] AioContext fine-grained locking, part 1 of 3, including bdrv_drain rewrite Paolo Bonzini
                   ` (9 preceding siblings ...)
  2016-02-16 17:56 ` [Qemu-devel] [PATCH 10/16] nfs: replace aio_poll with bdrv_drain Paolo Bonzini
@ 2016-02-16 17:56 ` Paolo Bonzini
  2016-02-16 17:56 ` [Qemu-devel] [PATCH 12/16] aio: introduce aio_context_in_iothread Paolo Bonzini
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 48+ messages in thread
From: Paolo Bonzini @ 2016-02-16 17:56 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-block, stefanha

sheepdog has some calls to aio_poll that are hard to eliminate, for
example in sd_sheepdog_goto's call to do_req.  Since I don't have
means to test sheepdog well, disable dataplane altogether for this
driver.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block/sheepdog.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/block/sheepdog.c b/block/sheepdog.c
index a0098c1..e2ab946 100644
--- a/block/sheepdog.c
+++ b/block/sheepdog.c
@@ -357,6 +357,7 @@ struct SheepdogAIOCB {
 typedef struct BDRVSheepdogState {
     BlockDriverState *bs;
     AioContext *aio_context;
+    Error *blocker;
 
     SheepdogInode inode;
 
@@ -1416,6 +1417,21 @@ static int sd_open(BlockDriverState *bs, QDict *options, int flags,
     Error *local_err = NULL;
     const char *filename;
 
+    /* sd_snapshot_goto does blocking operations that call aio_poll
+     * (through do_req).  This can cause races with iothread:
+     *
+     *       main thread                       I/O thread
+     *       -----------------                 ------------------
+     *       while(srco.finished == false)
+     *                                         aio_poll(..., true)
+     *                                            srco.finished = true
+     *         aio_poll(..., true)
+     *
+     * Now aio_poll potentially blocks forever.
+     */
+    error_setg(&s->blocker, "sheepdog does not support iothreads");
+    bdrv_op_block(bs, BLOCK_OP_TYPE_DATAPLANE, s->blocker);
+
     s->bs = bs;
     s->aio_context = bdrv_get_aio_context(bs);
 
@@ -1950,6 +1966,9 @@ static void sd_close(BlockDriverState *bs)
                        false, NULL, NULL, NULL);
     closesocket(s->fd);
     g_free(s->host_spec);
+
+    bdrv_op_unblock(bs, BLOCK_OP_TYPE_DATAPLANE, s->blocker);
+    error_free(s->blocker);
 }
 
 static int64_t sd_getlength(BlockDriverState *bs)
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [Qemu-devel] [PATCH 12/16] aio: introduce aio_context_in_iothread
  2016-02-16 17:56 [Qemu-devel] [PATCH 00/16] AioContext fine-grained locking, part 1 of 3, including bdrv_drain rewrite Paolo Bonzini
                   ` (10 preceding siblings ...)
  2016-02-16 17:56 ` [Qemu-devel] [PATCH 11/16] sheepdog: disable dataplane Paolo Bonzini
@ 2016-02-16 17:56 ` Paolo Bonzini
  2016-02-16 17:56 ` [Qemu-devel] [PATCH 13/16] block: only call aio_poll from iothread Paolo Bonzini
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 48+ messages in thread
From: Paolo Bonzini @ 2016-02-16 17:56 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-block, stefanha

This will be used by the synchronous I/O helpers, to choose between
aio_poll or QemuEvent.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 include/block/aio.h | 7 +++++++
 iothread.c          | 9 +++++++++
 stubs/Makefile.objs | 1 +
 stubs/iothread.c    | 8 ++++++++
 4 files changed, 25 insertions(+)
 create mode 100644 stubs/iothread.c

diff --git a/include/block/aio.h b/include/block/aio.h
index e086e3b..9434665 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -435,6 +435,13 @@ static inline bool aio_node_check(AioContext *ctx, bool is_external)
 }
 
 /**
+ * @ctx: the aio context
+ *
+ * Return whether we are running in the I/O thread that manages @ctx.
+ */
+bool aio_context_in_iothread(AioContext *ctx);
+
+/**
  * aio_context_setup:
  * @ctx: the aio context
  *
diff --git a/iothread.c b/iothread.c
index f183d38..8d40bb0 100644
--- a/iothread.c
+++ b/iothread.c
@@ -20,6 +20,7 @@
 #include "qmp-commands.h"
 #include "qemu/error-report.h"
 #include "qemu/rcu.h"
+#include "qemu/main-loop.h"
 
 typedef ObjectClass IOThreadClass;
 
@@ -28,6 +29,13 @@ typedef ObjectClass IOThreadClass;
 #define IOTHREAD_CLASS(klass) \
    OBJECT_CLASS_CHECK(IOThreadClass, klass, TYPE_IOTHREAD)
 
+static __thread IOThread *my_iothread;
+
+bool aio_context_in_iothread(AioContext *ctx)
+{
+    return ctx == (my_iothread ? my_iothread->ctx : qemu_get_aio_context());
+}
+
 static void *iothread_run(void *opaque)
 {
     IOThread *iothread = opaque;
@@ -35,6 +43,7 @@ static void *iothread_run(void *opaque)
 
     rcu_register_thread();
 
+    my_iothread = iothread;
     qemu_mutex_lock(&iothread->init_done_lock);
     iothread->thread_id = qemu_get_thread_id();
     qemu_cond_signal(&iothread->init_done_cond);
diff --git a/stubs/Makefile.objs b/stubs/Makefile.objs
index e922de9..187cee1 100644
--- a/stubs/Makefile.objs
+++ b/stubs/Makefile.objs
@@ -13,6 +13,7 @@ stub-obj-y += gdbstub.o
 stub-obj-y += get-fd.o
 stub-obj-y += get-next-serial.o
 stub-obj-y += get-vm-name.o
+stub-obj-y += iothread.o
 stub-obj-y += iothread-lock.o
 stub-obj-y += is-daemonized.o
 stub-obj-y += machine-init-done.o
diff --git a/stubs/iothread.c b/stubs/iothread.c
new file mode 100644
index 0000000..6c02323
--- /dev/null
+++ b/stubs/iothread.c
@@ -0,0 +1,8 @@
+#include "qemu/osdep.h"
+#include "block/aio.h"
+#include "qemu/main-loop.h"
+
+bool aio_context_in_iothread(AioContext *ctx)
+{
+    return ctx == qemu_get_aio_context();
+}
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [Qemu-devel] [PATCH 13/16] block: only call aio_poll from iothread
  2016-02-16 17:56 [Qemu-devel] [PATCH 00/16] AioContext fine-grained locking, part 1 of 3, including bdrv_drain rewrite Paolo Bonzini
                   ` (11 preceding siblings ...)
  2016-02-16 17:56 ` [Qemu-devel] [PATCH 12/16] aio: introduce aio_context_in_iothread Paolo Bonzini
@ 2016-02-16 17:56 ` Paolo Bonzini
  2016-03-09  8:30   ` Fam Zheng
  2016-02-16 17:56 ` [Qemu-devel] [PATCH 14/16] iothread: release AioContext around aio_poll Paolo Bonzini
                   ` (5 subsequent siblings)
  18 siblings, 1 reply; 48+ messages in thread
From: Paolo Bonzini @ 2016-02-16 17:56 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-block, stefanha

aio_poll is not thread safe; for example bdrv_drain can hang if
the last in-flight I/O operation is completed in the I/O thread after
the main thread has checked bs->in_flight.

The bug remains latent as long as all of it is called within
aio_context_acquire/aio_context_release, but this will change soon.

To fix this, if bdrv_drain is called from outside the I/O thread handle
it internally in the BDS, without involving AioContext and aio_poll.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block.c                   |  2 ++
 block/io.c                | 21 ++++++++++++++++++---
 include/block/block_int.h |  5 ++++-
 3 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/block.c b/block.c
index fb02d7f..601a73f 100644
--- a/block.c
+++ b/block.c
@@ -267,6 +267,7 @@ BlockDriverState *bdrv_new(void)
     qemu_co_queue_init(&bs->throttled_reqs[1]);
     bs->refcnt = 1;
     bs->aio_context = qemu_get_aio_context();
+    qemu_event_init(&bs->in_flight_event, true);
 
     QTAILQ_INSERT_TAIL(&all_bdrv_states, bs, bs_list);
 
@@ -2395,6 +2396,7 @@ static void bdrv_delete(BlockDriverState *bs)
     bdrv_make_anon(bs);
 
     QTAILQ_REMOVE(&all_bdrv_states, bs, bs_list);
+    qemu_event_destroy(&bs->in_flight_event);
 
     g_free(bs);
 }
diff --git a/block/io.c b/block/io.c
index 04b52c8..ea0546f 100644
--- a/block/io.c
+++ b/block/io.c
@@ -251,11 +251,24 @@ static void bdrv_drain_recurse(BlockDriverState *bs)
 
 static bool bdrv_drain_io_recurse(BlockDriverState *bs)
 {
-    BdrvChild *child;
+    AioContext *ctx = bdrv_get_aio_context(bs);
     bool waited = false;
+    BdrvChild *child;
 
     while (atomic_read(&bs->in_flight) > 0) {
-        aio_poll(bdrv_get_aio_context(bs), true);
+        if (aio_context_in_iothread(ctx)) {
+            /* This case should not occur at all, except for the
+             * main thread.
+             */
+            aio_poll(bdrv_get_aio_context(bs), true);
+        } else {
+            qemu_event_reset(&bs->in_flight_event);
+            if (atomic_read(&bs->in_flight) > 0) {
+                aio_context_release(bdrv_get_aio_context(bs));
+                qemu_event_wait(&bs->in_flight_event);
+                aio_context_acquire(bdrv_get_aio_context(bs));
+            }
+        }
         waited = true;
     }
 
@@ -465,7 +478,9 @@ void bdrv_inc_in_flight(BlockDriverState *bs)
 
 void bdrv_dec_in_flight(BlockDriverState *bs)
 {
-    atomic_dec(&bs->in_flight);
+    if (atomic_fetch_dec(&bs->in_flight) == 1) {
+        qemu_event_set(&bs->in_flight_event);
+    }
 }
 
 static bool coroutine_fn wait_serialising_requests(BdrvTrackedRequest *self)
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 89c38c0..9c96d5d 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -404,9 +404,12 @@ struct BlockDriverState {
     /* Callback before write request is processed */
     NotifierWithReturnList before_write_notifiers;
 
-    /* number of in-flight requests; overall and serialising */
+    /* number of in-flight requests; overall and serialising.
+     * in_flight_event is set when in_flight becomes 0.
+     */
     unsigned int in_flight;
     unsigned int serialising_in_flight;
+    QemuEvent in_flight_event;
 
     /* I/O throttling.
      * throttle_state tells us if this BDS has I/O limits configured.
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [Qemu-devel] [PATCH 14/16] iothread: release AioContext around aio_poll
  2016-02-16 17:56 [Qemu-devel] [PATCH 00/16] AioContext fine-grained locking, part 1 of 3, including bdrv_drain rewrite Paolo Bonzini
                   ` (12 preceding siblings ...)
  2016-02-16 17:56 ` [Qemu-devel] [PATCH 13/16] block: only call aio_poll from iothread Paolo Bonzini
@ 2016-02-16 17:56 ` Paolo Bonzini
  2016-02-16 17:56 ` [Qemu-devel] [PATCH 15/16] qemu-thread: introduce QemuRecMutex Paolo Bonzini
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 48+ messages in thread
From: Paolo Bonzini @ 2016-02-16 17:56 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-block, stefanha

This is the first step towards having fine-grained critical sections in
dataplane threads, which will resolve lock ordering problems between
address_space_* functions (which need the BQL when doing MMIO, even
after we complete RCU-based dispatch) and the AioContext.

Because AioContext does not use contention callbacks anymore, the
unit test has to be changed.

Previously applied as a0710f7995f914e3044e5899bd8ff6c43c62f916 and
then reverted.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 async.c                     | 22 +++-------------------
 docs/multiple-iothreads.txt | 40 +++++++++++++++++++++++-----------------
 include/block/aio.h         |  3 ---
 iothread.c                  | 11 ++---------
 tests/test-aio.c            | 22 ++++++++++++++--------
 5 files changed, 42 insertions(+), 56 deletions(-)

diff --git a/async.c b/async.c
index d4dd2cc..0084d57 100644
--- a/async.c
+++ b/async.c
@@ -85,8 +85,8 @@ int aio_bh_poll(AioContext *ctx)
          * aio_notify again if necessary.
          */
         if (!bh->deleted && atomic_xchg(&bh->scheduled, 0)) {
-            /* Idle BHs and the notify BH don't count as progress */
-            if (!bh->idle && bh != ctx->notify_dummy_bh) {
+            /* Idle BHs don't count as progress */
+            if (!bh->idle) {
                 ret = 1;
             }
             bh->idle = 0;
@@ -238,7 +238,6 @@ aio_ctx_finalize(GSource     *source)
 {
     AioContext *ctx = (AioContext *) source;
 
-    qemu_bh_delete(ctx->notify_dummy_bh);
     thread_pool_free(ctx->thread_pool);
 
     qemu_mutex_lock(&ctx->bh_lock);
@@ -305,19 +304,6 @@ static void aio_timerlist_notify(void *opaque)
     aio_notify(opaque);
 }
 
-static void aio_rfifolock_cb(void *opaque)
-{
-    AioContext *ctx = opaque;
-
-    /* Kick owner thread in case they are blocked in aio_poll() */
-    qemu_bh_schedule(ctx->notify_dummy_bh);
-}
-
-static void notify_dummy_bh(void *opaque)
-{
-    /* Do nothing, we were invoked just to force the event loop to iterate */
-}
-
 static void event_notifier_dummy_cb(EventNotifier *e)
 {
 }
@@ -346,11 +332,9 @@ AioContext *aio_context_new(Error **errp)
                            event_notifier_dummy_cb);
     ctx->thread_pool = NULL;
     qemu_mutex_init(&ctx->bh_lock);
-    rfifolock_init(&ctx->lock, aio_rfifolock_cb, ctx);
+    rfifolock_init(&ctx->lock, NULL, NULL);
     timerlistgroup_init(&ctx->tlg, aio_timerlist_notify, ctx);
 
-    ctx->notify_dummy_bh = aio_bh_new(ctx, notify_dummy_bh, NULL);
-
     return ctx;
 fail:
     g_source_destroy(&ctx->source);
diff --git a/docs/multiple-iothreads.txt b/docs/multiple-iothreads.txt
index 40b8419..0e7cdb2 100644
--- a/docs/multiple-iothreads.txt
+++ b/docs/multiple-iothreads.txt
@@ -105,13 +105,10 @@ a BH in the target AioContext beforehand and then call qemu_bh_schedule().  No
 acquire/release or locking is needed for the qemu_bh_schedule() call.  But be
 sure to acquire the AioContext for aio_bh_new() if necessary.
 
-The relationship between AioContext and the block layer
--------------------------------------------------------
-The AioContext originates from the QEMU block layer because it provides a
-scoped way of running event loop iterations until all work is done.  This
-feature is used to complete all in-flight block I/O requests (see
-bdrv_drain_all()).  Nowadays AioContext is a generic event loop that can be
-used by any QEMU subsystem.
+AioContext and the block layer
+------------------------------
+The AioContext originates from the QEMU block layer, even though nowadays
+AioContext is a generic event loop that can be used by any QEMU subsystem.
 
 The block layer has support for AioContext integrated.  Each BlockDriverState
 is associated with an AioContext using bdrv_set_aio_context() and
@@ -122,13 +119,22 @@ Block layer code must therefore expect to run in an IOThread and avoid using
 old APIs that implicitly use the main loop.  See the "How to program for
 IOThreads" above for information on how to do that.
 
-If main loop code such as a QMP function wishes to access a BlockDriverState it
-must first call aio_context_acquire(bdrv_get_aio_context(bs)) to ensure the
-IOThread does not run in parallel.
-
-Long-running jobs (usually in the form of coroutines) are best scheduled in the
-BlockDriverState's AioContext to avoid the need to acquire/release around each
-bdrv_*() call.  Be aware that there is currently no mechanism to get notified
-when bdrv_set_aio_context() moves this BlockDriverState to a different
-AioContext (see bdrv_detach_aio_context()/bdrv_attach_aio_context()), so you
-may need to add this if you want to support long-running jobs.
+If main loop code such as a QMP function wishes to access a BlockDriverState
+it must first call aio_context_acquire(bdrv_get_aio_context(bs)) to ensure
+that callbacks in the IOThread do not run in parallel.
+
+Code running in the monitor typically needs to ensure that past
+requests from the guest are completed.  When a block device is running
+in an IOThread, the IOThread can also process requests from the guest
+(via ioeventfd).  To achieve both objects, wrap the code between
+bdrv_drained_begin() and bdrv_drained_end(), thus creating a "drained
+section".  The functions must be called between aio_context_acquire()
+and aio_context_release().  You can freely release and re-acquire the
+AioContext within a drained section.
+
+Long-running jobs (usually in the form of coroutines) are best scheduled in
+the BlockDriverState's AioContext to avoid the need to acquire/release around
+each bdrv_*() call.  The functions bdrv_add/remove_aio_context_notifier,
+or alternatively blk_add/remove_aio_context_notifier if you use BlockBackends,
+can be used to get a notification whenever bdrv_set_aio_context() moves a
+BlockDriverState to a different AioContext.
diff --git a/include/block/aio.h b/include/block/aio.h
index 9434665..ba1ee81 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -114,9 +114,6 @@ struct AioContext {
     bool notified;
     EventNotifier notifier;
 
-    /* Scheduling this BH forces the event loop it iterate */
-    QEMUBH *notify_dummy_bh;
-
     /* Thread pool for performing work and receiving completion callbacks */
     struct ThreadPool *thread_pool;
 
diff --git a/iothread.c b/iothread.c
index 8d40bb0..f66ec95 100644
--- a/iothread.c
+++ b/iothread.c
@@ -39,7 +39,6 @@ bool aio_context_in_iothread(AioContext *ctx)
 static void *iothread_run(void *opaque)
 {
     IOThread *iothread = opaque;
-    bool blocking;
 
     rcu_register_thread();
 
@@ -49,14 +48,8 @@ static void *iothread_run(void *opaque)
     qemu_cond_signal(&iothread->init_done_cond);
     qemu_mutex_unlock(&iothread->init_done_lock);
 
-    while (!iothread->stopping) {
-        aio_context_acquire(iothread->ctx);
-        blocking = true;
-        while (!iothread->stopping && aio_poll(iothread->ctx, blocking)) {
-            /* Progress was made, keep going */
-            blocking = false;
-        }
-        aio_context_release(iothread->ctx);
+    while (!atomic_read(&iothread->stopping)) {
+        aio_poll(iothread->ctx, true);
     }
 
     rcu_unregister_thread();
diff --git a/tests/test-aio.c b/tests/test-aio.c
index 6ccea98..3fe27e7 100644
--- a/tests/test-aio.c
+++ b/tests/test-aio.c
@@ -99,6 +99,7 @@ static void event_ready_cb(EventNotifier *e)
 
 typedef struct {
     QemuMutex start_lock;
+    EventNotifier notifier;
     bool thread_acquired;
 } AcquireTestData;
 
@@ -110,6 +111,11 @@ static void *test_acquire_thread(void *opaque)
     qemu_mutex_lock(&data->start_lock);
     qemu_mutex_unlock(&data->start_lock);
 
+    /* event_notifier_set might be called either before or after
+     * the main thread's call to poll().  The test case's outcome
+     * should be the same in either case.
+     */
+    event_notifier_set(&data->notifier);
     aio_context_acquire(ctx);
     aio_context_release(ctx);
 
@@ -124,20 +130,19 @@ static void set_event_notifier(AioContext *ctx, EventNotifier *notifier,
     aio_set_event_notifier(ctx, notifier, false, handler);
 }
 
-static void dummy_notifier_read(EventNotifier *unused)
+static void dummy_notifier_read(EventNotifier *n)
 {
-    g_assert(false); /* should never be invoked */
+    event_notifier_test_and_clear(n);
 }
 
 static void test_acquire(void)
 {
     QemuThread thread;
-    EventNotifier notifier;
     AcquireTestData data;
 
     /* Dummy event notifier ensures aio_poll() will block */
-    event_notifier_init(&notifier, false);
-    set_event_notifier(ctx, &notifier, dummy_notifier_read);
+    event_notifier_init(&data.notifier, false);
+    set_event_notifier(ctx, &data.notifier, dummy_notifier_read);
     g_assert(!aio_poll(ctx, false)); /* consume aio_notify() */
 
     qemu_mutex_init(&data.start_lock);
@@ -151,12 +156,13 @@ static void test_acquire(void)
     /* Block in aio_poll(), let other thread kick us and acquire context */
     aio_context_acquire(ctx);
     qemu_mutex_unlock(&data.start_lock); /* let the thread run */
-    g_assert(!aio_poll(ctx, true));
+    g_assert(aio_poll(ctx, true));
+    g_assert(!data.thread_acquired);
     aio_context_release(ctx);
 
     qemu_thread_join(&thread);
-    set_event_notifier(ctx, &notifier, NULL);
-    event_notifier_cleanup(&notifier);
+    set_event_notifier(ctx, &data.notifier, NULL);
+    event_notifier_cleanup(&data.notifier);
 
     g_assert(data.thread_acquired);
 }
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [Qemu-devel] [PATCH 15/16] qemu-thread: introduce QemuRecMutex
  2016-02-16 17:56 [Qemu-devel] [PATCH 00/16] AioContext fine-grained locking, part 1 of 3, including bdrv_drain rewrite Paolo Bonzini
                   ` (13 preceding siblings ...)
  2016-02-16 17:56 ` [Qemu-devel] [PATCH 14/16] iothread: release AioContext around aio_poll Paolo Bonzini
@ 2016-02-16 17:56 ` Paolo Bonzini
  2016-02-16 17:56 ` [Qemu-devel] [PATCH 16/16] aio: convert from RFifoLock to QemuRecMutex Paolo Bonzini
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 48+ messages in thread
From: Paolo Bonzini @ 2016-02-16 17:56 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-block, stefanha

GRecMutex is new in glib 2.32, so we cannot use it.  Introduce
a recursive mutex in qemu-thread instead, which will be used
instead of RFifoLock.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 include/qemu/thread-posix.h |  6 ++++++
 include/qemu/thread-win32.h | 10 ++++++++++
 include/qemu/thread.h       |  3 +++
 util/qemu-thread-posix.c    | 13 +++++++++++++
 util/qemu-thread-win32.c    | 25 +++++++++++++++++++++++++
 5 files changed, 57 insertions(+)

diff --git a/include/qemu/thread-posix.h b/include/qemu/thread-posix.h
index eb5c7a1..2fb6b90 100644
--- a/include/qemu/thread-posix.h
+++ b/include/qemu/thread-posix.h
@@ -3,6 +3,12 @@
 #include "pthread.h"
 #include <semaphore.h>
 
+typedef QemuMutex QemuRecMutex;
+#define qemu_rec_mutex_destroy qemu_mutex_destroy
+#define qemu_rec_mutex_lock qemu_mutex_lock
+#define qemu_rec_mutex_try_lock qemu_mutex_try_lock
+#define qemu_rec_mutex_unlock qemu_mutex_unlock
+
 struct QemuMutex {
     pthread_mutex_t lock;
 };
diff --git a/include/qemu/thread-win32.h b/include/qemu/thread-win32.h
index 385ff5f..92c1969 100644
--- a/include/qemu/thread-win32.h
+++ b/include/qemu/thread-win32.h
@@ -7,6 +7,16 @@ struct QemuMutex {
     LONG owner;
 };
 
+typedef struct QemuRecMutex QemuRecMutex;
+struct QemuRecMutex {
+    CRITICAL_SECTION lock;
+};
+
+void qemu_rec_mutex_destroy(QemuRecMutex *mutex);
+void qemu_rec_mutex_lock(QemuRecMutex *mutex);
+int qemu_rec_mutex_trylock(QemuRecMutex *mutex);
+void qemu_rec_mutex_unlock(QemuRecMutex *mutex);
+
 struct QemuCond {
     LONG waiters, target;
     HANDLE sema;
diff --git a/include/qemu/thread.h b/include/qemu/thread.h
index 5114ec8..981f3dc 100644
--- a/include/qemu/thread.h
+++ b/include/qemu/thread.h
@@ -25,6 +25,9 @@ void qemu_mutex_lock(QemuMutex *mutex);
 int qemu_mutex_trylock(QemuMutex *mutex);
 void qemu_mutex_unlock(QemuMutex *mutex);
 
+/* Prototypes for other functions are in thread-posix.h/thread-win32.h.  */
+void qemu_rec_mutex_init(QemuRecMutex *mutex);
+
 void qemu_cond_init(QemuCond *cond);
 void qemu_cond_destroy(QemuCond *cond);
 
diff --git a/util/qemu-thread-posix.c b/util/qemu-thread-posix.c
index 74a3023..1aec83f 100644
--- a/util/qemu-thread-posix.c
+++ b/util/qemu-thread-posix.c
@@ -80,6 +80,19 @@ void qemu_mutex_unlock(QemuMutex *mutex)
         error_exit(err, __func__);
 }
 
+void qemu_rec_mutex_init(QemuRecMutex *mutex)
+{
+    int err;
+    pthread_mutexattr_t attr;
+
+    pthread_mutexattr_init(&attr);
+    pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_RECURSIVE);
+    err = pthread_mutex_init(&mutex->lock, &attr);
+    if (err) {
+        error_exit(err, __func__);
+    }
+}
+
 void qemu_cond_init(QemuCond *cond)
 {
     int err;
diff --git a/util/qemu-thread-win32.c b/util/qemu-thread-win32.c
index 98a5ddf..171d83c 100644
--- a/util/qemu-thread-win32.c
+++ b/util/qemu-thread-win32.c
@@ -79,6 +79,31 @@ void qemu_mutex_unlock(QemuMutex *mutex)
     LeaveCriticalSection(&mutex->lock);
 }
 
+void qemu_rec_mutex_init(QemuRecMutex *mutex)
+{
+    InitializeCriticalSection(&mutex->lock);
+}
+
+void qemu_rec_mutex_destroy(QemuRecMutex *mutex)
+{
+    DeleteCriticalSection(&mutex->lock);
+}
+
+void qemu_rec_mutex_lock(QemuRecMutex *mutex)
+{
+    EnterCriticalSection(&mutex->lock);
+}
+
+int qemu_rec_mutex_trylock(QemuRecMutex *mutex)
+{
+    return !TryEnterCriticalSection(&mutex->lock);
+}
+
+void qemu_rec_mutex_unlock(QemuRecMutex *mutex)
+{
+    LeaveCriticalSection(&mutex->lock);
+}
+
 void qemu_cond_init(QemuCond *cond)
 {
     memset(cond, 0, sizeof(*cond));
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [Qemu-devel] [PATCH 16/16] aio: convert from RFifoLock to QemuRecMutex
  2016-02-16 17:56 [Qemu-devel] [PATCH 00/16] AioContext fine-grained locking, part 1 of 3, including bdrv_drain rewrite Paolo Bonzini
                   ` (14 preceding siblings ...)
  2016-02-16 17:56 ` [Qemu-devel] [PATCH 15/16] qemu-thread: introduce QemuRecMutex Paolo Bonzini
@ 2016-02-16 17:56 ` Paolo Bonzini
  2016-03-08 17:51 ` [Qemu-devel] [PATCH 00/16] AioContext fine-grained locking, part 1 of 3, including bdrv_drain rewrite Paolo Bonzini
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 48+ messages in thread
From: Paolo Bonzini @ 2016-02-16 17:56 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-block, stefanha

It is simpler and a bit faster, and QEMU does not need the contention
callbacks (and thus the fairness) anymore.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 async.c                  |  8 ++---
 include/block/aio.h      |  3 +-
 include/qemu/rfifolock.h | 54 ----------------------------
 tests/.gitignore         |  1 -
 tests/Makefile           |  2 --
 tests/test-rfifolock.c   | 91 ------------------------------------------------
 util/Makefile.objs       |  1 -
 util/rfifolock.c         | 78 -----------------------------------------
 8 files changed, 5 insertions(+), 233 deletions(-)
 delete mode 100644 include/qemu/rfifolock.h
 delete mode 100644 tests/test-rfifolock.c
 delete mode 100644 util/rfifolock.c

diff --git a/async.c b/async.c
index 0084d57..c9f2fc2 100644
--- a/async.c
+++ b/async.c
@@ -254,7 +254,7 @@ aio_ctx_finalize(GSource     *source)
 
     aio_set_event_notifier(ctx, &ctx->notifier, false, NULL);
     event_notifier_cleanup(&ctx->notifier);
-    rfifolock_destroy(&ctx->lock);
+    qemu_rec_mutex_destroy(&ctx->lock);
     qemu_mutex_destroy(&ctx->bh_lock);
     timerlistgroup_deinit(&ctx->tlg);
 }
@@ -332,7 +332,7 @@ AioContext *aio_context_new(Error **errp)
                            event_notifier_dummy_cb);
     ctx->thread_pool = NULL;
     qemu_mutex_init(&ctx->bh_lock);
-    rfifolock_init(&ctx->lock, NULL, NULL);
+    qemu_rec_mutex_init(&ctx->lock);
     timerlistgroup_init(&ctx->tlg, aio_timerlist_notify, ctx);
 
     return ctx;
@@ -353,10 +353,10 @@ void aio_context_unref(AioContext *ctx)
 
 void aio_context_acquire(AioContext *ctx)
 {
-    rfifolock_lock(&ctx->lock);
+    qemu_rec_mutex_lock(&ctx->lock);
 }
 
 void aio_context_release(AioContext *ctx)
 {
-    rfifolock_unlock(&ctx->lock);
+    qemu_rec_mutex_unlock(&ctx->lock);
 }
diff --git a/include/block/aio.h b/include/block/aio.h
index ba1ee81..d9ff265 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -19,7 +19,6 @@
 #include "qemu/queue.h"
 #include "qemu/event_notifier.h"
 #include "qemu/thread.h"
-#include "qemu/rfifolock.h"
 #include "qemu/timer.h"
 
 typedef struct BlockAIOCB BlockAIOCB;
@@ -52,7 +51,7 @@ struct AioContext {
     GSource source;
 
     /* Protects all fields from multi-threaded access */
-    RFifoLock lock;
+    QemuRecMutex lock;
 
     /* The list of registered AIO handlers */
     QLIST_HEAD(, AioHandler) aio_handlers;
diff --git a/include/qemu/rfifolock.h b/include/qemu/rfifolock.h
deleted file mode 100644
index b23ab53..0000000
--- a/include/qemu/rfifolock.h
+++ /dev/null
@@ -1,54 +0,0 @@
-/*
- * Recursive FIFO lock
- *
- * Copyright Red Hat, Inc. 2013
- *
- * Authors:
- *  Stefan Hajnoczi   <stefanha@redhat.com>
- *
- * This work is licensed under the terms of the GNU GPL, version 2 or later.
- * See the COPYING file in the top-level directory.
- *
- */
-
-#ifndef QEMU_RFIFOLOCK_H
-#define QEMU_RFIFOLOCK_H
-
-#include "qemu/thread.h"
-
-/* Recursive FIFO lock
- *
- * This lock provides more features than a plain mutex:
- *
- * 1. Fairness - enforces FIFO order.
- * 2. Nesting - can be taken recursively.
- * 3. Contention callback - optional, called when thread must wait.
- *
- * The recursive FIFO lock is heavyweight so prefer other synchronization
- * primitives if you do not need its features.
- */
-typedef struct {
-    QemuMutex lock;             /* protects all fields */
-
-    /* FIFO order */
-    unsigned int head;          /* active ticket number */
-    unsigned int tail;          /* waiting ticket number */
-    QemuCond cond;              /* used to wait for our ticket number */
-
-    /* Nesting */
-    QemuThread owner_thread;    /* thread that currently has ownership */
-    unsigned int nesting;       /* amount of nesting levels */
-
-    /* Contention callback */
-    void (*cb)(void *);         /* called when thread must wait, with ->lock
-                                 * held so it may not recursively lock/unlock
-                                 */
-    void *cb_opaque;
-} RFifoLock;
-
-void rfifolock_init(RFifoLock *r, void (*cb)(void *), void *opaque);
-void rfifolock_destroy(RFifoLock *r);
-void rfifolock_lock(RFifoLock *r);
-void rfifolock_unlock(RFifoLock *r);
-
-#endif /* QEMU_RFIFOLOCK_H */
diff --git a/tests/.gitignore b/tests/.gitignore
index 787c95c..b3da3f1 100644
--- a/tests/.gitignore
+++ b/tests/.gitignore
@@ -51,7 +51,6 @@ test-qmp-introspect.[ch]
 test-qmp-marshal.c
 test-qmp-output-visitor
 test-rcu-list
-test-rfifolock
 test-string-input-visitor
 test-string-output-visitor
 test-thread-pool
diff --git a/tests/Makefile b/tests/Makefile
index 650e654..9da0fcf 100644
--- a/tests/Makefile
+++ b/tests/Makefile
@@ -39,7 +39,6 @@ check-unit-y += tests/test-visitor-serialization$(EXESUF)
 check-unit-y += tests/test-iov$(EXESUF)
 gcov-files-test-iov-y = util/iov.c
 check-unit-y += tests/test-aio$(EXESUF)
-check-unit-$(CONFIG_POSIX) += tests/test-rfifolock$(EXESUF)
 check-unit-y += tests/test-throttle$(EXESUF)
 gcov-files-test-aio-$(CONFIG_WIN32) = aio-win32.c
 gcov-files-test-aio-$(CONFIG_POSIX) = aio-posix.c
@@ -403,7 +402,6 @@ tests/check-qom-interface$(EXESUF): tests/check-qom-interface.o $(test-qom-obj-y
 tests/check-qom-proplist$(EXESUF): tests/check-qom-proplist.o $(test-qom-obj-y)
 tests/test-coroutine$(EXESUF): tests/test-coroutine.o $(test-block-obj-y)
 tests/test-aio$(EXESUF): tests/test-aio.o $(test-block-obj-y)
-tests/test-rfifolock$(EXESUF): tests/test-rfifolock.o $(test-util-obj-y)
 tests/test-throttle$(EXESUF): tests/test-throttle.o $(test-block-obj-y)
 tests/test-blockjob-txn$(EXESUF): tests/test-blockjob-txn.o $(test-block-obj-y) $(test-util-obj-y)
 tests/test-thread-pool$(EXESUF): tests/test-thread-pool.o $(test-block-obj-y)
diff --git a/tests/test-rfifolock.c b/tests/test-rfifolock.c
deleted file mode 100644
index 0572ebb..0000000
--- a/tests/test-rfifolock.c
+++ /dev/null
@@ -1,91 +0,0 @@
-/*
- * RFifoLock tests
- *
- * Copyright Red Hat, Inc. 2013
- *
- * Authors:
- *  Stefan Hajnoczi    <stefanha@redhat.com>
- *
- * This work is licensed under the terms of the GNU LGPL, version 2 or later.
- * See the COPYING.LIB file in the top-level directory.
- */
-
-#include <glib.h>
-#include "qemu-common.h"
-#include "qemu/rfifolock.h"
-
-static void test_nesting(void)
-{
-    RFifoLock lock;
-
-    /* Trivial test, ensure the lock is recursive */
-    rfifolock_init(&lock, NULL, NULL);
-    rfifolock_lock(&lock);
-    rfifolock_lock(&lock);
-    rfifolock_lock(&lock);
-    rfifolock_unlock(&lock);
-    rfifolock_unlock(&lock);
-    rfifolock_unlock(&lock);
-    rfifolock_destroy(&lock);
-}
-
-typedef struct {
-    RFifoLock lock;
-    int fd[2];
-} CallbackTestData;
-
-static void rfifolock_cb(void *opaque)
-{
-    CallbackTestData *data = opaque;
-    int ret;
-    char c = 0;
-
-    ret = write(data->fd[1], &c, sizeof(c));
-    g_assert(ret == 1);
-}
-
-static void *callback_thread(void *opaque)
-{
-    CallbackTestData *data = opaque;
-
-    /* The other thread holds the lock so the contention callback will be
-     * invoked...
-     */
-    rfifolock_lock(&data->lock);
-    rfifolock_unlock(&data->lock);
-    return NULL;
-}
-
-static void test_callback(void)
-{
-    CallbackTestData data;
-    QemuThread thread;
-    int ret;
-    char c;
-
-    rfifolock_init(&data.lock, rfifolock_cb, &data);
-    ret = qemu_pipe(data.fd);
-    g_assert(ret == 0);
-
-    /* Hold lock but allow the callback to kick us by writing to the pipe */
-    rfifolock_lock(&data.lock);
-    qemu_thread_create(&thread, "callback_thread",
-                       callback_thread, &data, QEMU_THREAD_JOINABLE);
-    ret = read(data.fd[0], &c, sizeof(c));
-    g_assert(ret == 1);
-    rfifolock_unlock(&data.lock);
-    /* If we got here then the callback was invoked, as expected */
-
-    qemu_thread_join(&thread);
-    close(data.fd[0]);
-    close(data.fd[1]);
-    rfifolock_destroy(&data.lock);
-}
-
-int main(int argc, char **argv)
-{
-    g_test_init(&argc, &argv, NULL);
-    g_test_add_func("/nesting", test_nesting);
-    g_test_add_func("/callback", test_callback);
-    return g_test_run();
-}
diff --git a/util/Makefile.objs b/util/Makefile.objs
index a8a777e..c0223c6 100644
--- a/util/Makefile.objs
+++ b/util/Makefile.objs
@@ -23,7 +23,6 @@ util-obj-y += crc32c.o
 util-obj-y += throttle.o
 util-obj-y += getauxval.o
 util-obj-y += readline.o
-util-obj-y += rfifolock.o
 util-obj-y += rcu.o
 util-obj-y += qemu-coroutine.o qemu-coroutine-lock.o qemu-coroutine-io.o
 util-obj-y += qemu-coroutine-sleep.o
diff --git a/util/rfifolock.c b/util/rfifolock.c
deleted file mode 100644
index c22f5fe..0000000
--- a/util/rfifolock.c
+++ /dev/null
@@ -1,78 +0,0 @@
-/*
- * Recursive FIFO lock
- *
- * Copyright Red Hat, Inc. 2013
- *
- * Authors:
- *  Stefan Hajnoczi   <stefanha@redhat.com>
- *
- * This work is licensed under the terms of the GNU LGPL, version 2 or later.
- * See the COPYING.LIB file in the top-level directory.
- *
- */
-
-#include "qemu/osdep.h"
-#include "qemu/rfifolock.h"
-
-void rfifolock_init(RFifoLock *r, void (*cb)(void *), void *opaque)
-{
-    qemu_mutex_init(&r->lock);
-    r->head = 0;
-    r->tail = 0;
-    qemu_cond_init(&r->cond);
-    r->nesting = 0;
-    r->cb = cb;
-    r->cb_opaque = opaque;
-}
-
-void rfifolock_destroy(RFifoLock *r)
-{
-    qemu_cond_destroy(&r->cond);
-    qemu_mutex_destroy(&r->lock);
-}
-
-/*
- * Theory of operation:
- *
- * In order to ensure FIFO ordering, implement a ticketlock.  Threads acquiring
- * the lock enqueue themselves by incrementing the tail index.  When the lock
- * is unlocked, the head is incremented and waiting threads are notified.
- *
- * Recursive locking does not take a ticket since the head is only incremented
- * when the outermost recursive caller unlocks.
- */
-void rfifolock_lock(RFifoLock *r)
-{
-    qemu_mutex_lock(&r->lock);
-
-    /* Take a ticket */
-    unsigned int ticket = r->tail++;
-
-    if (r->nesting > 0 && qemu_thread_is_self(&r->owner_thread)) {
-        r->tail--; /* put ticket back, we're nesting */
-    } else {
-        while (ticket != r->head) {
-            /* Invoke optional contention callback */
-            if (r->cb) {
-                r->cb(r->cb_opaque);
-            }
-            qemu_cond_wait(&r->cond, &r->lock);
-        }
-    }
-
-    qemu_thread_get_self(&r->owner_thread);
-    r->nesting++;
-    qemu_mutex_unlock(&r->lock);
-}
-
-void rfifolock_unlock(RFifoLock *r)
-{
-    qemu_mutex_lock(&r->lock);
-    assert(r->nesting > 0);
-    assert(qemu_thread_is_self(&r->owner_thread));
-    if (--r->nesting == 0) {
-        r->head++;
-        qemu_cond_broadcast(&r->cond);
-    }
-    qemu_mutex_unlock(&r->lock);
-}
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH 00/16] AioContext fine-grained locking, part 1 of 3, including bdrv_drain rewrite
  2016-02-16 17:56 [Qemu-devel] [PATCH 00/16] AioContext fine-grained locking, part 1 of 3, including bdrv_drain rewrite Paolo Bonzini
                   ` (15 preceding siblings ...)
  2016-02-16 17:56 ` [Qemu-devel] [PATCH 16/16] aio: convert from RFifoLock to QemuRecMutex Paolo Bonzini
@ 2016-03-08 17:51 ` Paolo Bonzini
  2016-03-09  8:46 ` Fam Zheng
  2016-03-16 18:18 ` Stefan Hajnoczi
  18 siblings, 0 replies; 48+ messages in thread
From: Paolo Bonzini @ 2016-03-08 17:51 UTC (permalink / raw)
  To: qemu-devel; +Cc: stefanha, qemu-block



On 16/02/2016 18:56, Paolo Bonzini wrote:
> So the fine-grained locking series has grown from 2 parts to 3...
> 
> This first part stops where we remove RFifoLock.
> 
> In the previous submission, the bulk of the series took care of
> associating an AioContext to a thread, so that aio_poll could run event
> handlers only on that thread.  This was done to fix a race in bdrv_drain.
> There were two disadvantages:
> 
> 1) reporting progress from aio_poll to the main thread was Really Hard.
> Despite being relatively sure of the correctness---also thanks to formal
> models---it's not the kind of code that I'd lightly donate to posterity.
> 
> 2) the strict dependency between bdrv_drain and a single AioContext
> would have been bad for multiqueue.
> 
> Instead, this series does the same trick (do not run dataplane event handlers
> from the main thread) but reports progress directly at the BlockDriverState
> level.
> 
> To do so, the mechanism to track in-flight requests is changed to a
> simple counter.  This is more flexible than BdrvTrackedRequest, and lets
> the block/io.c code track precisely the initiation and completion of the
> requests.  In particular, the counter remains nonzero when a bottom half
> is required to "really" complete the request.  bdrv_drain doesn't rely
> anymore on a non-blocking aio_poll to execute those bottom halves.
> 
> It is also more efficient; while BdrvTrackedRequest has to remain
> for request serialisation, with this series we can in principle make
> BdrvTrackedRequest optional and enable it only when something (automatic
> RMW or copy-on-read) requires request serialisation.
> 
> In general this flows nicely, the only snag being patch 5.  The problem
> here is that mirror is calling bdrv_drain from an AIO callback, which
> used to be okay (because bdrv_drain more or less tried to guess when
> all AIO callbacks were done) but now causes a deadlock (because bdrv_drain
> really checks for AIO callbacks to be complete).  To add to the complication,
> the bdrv_drain happens far away from the AIO callback, in the coroutine that
> the AIO callback enters.  The situation here is admittedly underdefined,
> and Stefan pointed out that the same solution is found in many other places
> in the QEMU block layer.  Therefore I think it's acceptable.  I'm pointing
> it out explicitly though, because (together with patch 8) it is an example
> of the issues caused by the new, stricter definition of bdrv_drain.
> 
> Patches 1-7 organize bdrv_drain into separate phases, with a well-defined
> order between the BDS and it children.  The phases are: 1) disable
> throttling; 2) disable io_plug; 3) call bdrv_drain callbacks; 4) wait
> for I/O to finish; 5) re-enable io_plug and throttling.  Previously,
> throttling of throttling and io_plug were handled by bdrv_flush_io_queue,
> which was repeatedly called as part of the I/O wait loop.  This part of
> the series removes bdrv_flush_io_queue.
> 
> Patch 8-11 replace aio_poll with bdrv_drain so that the work done
> so far applies to all former callers of aio_poll.
> 
> Patches 12-13 introduce the logic that lets the main thread wait remotely
> for an iothread to drain the BDS.  Patches 14-16 then achieve the purpose
> of this series, removing the long-running AioContext critical section
> from iothread_run and removing RFifoLock.
> 
> The series passes check-block.sh with a few fixes applied for pre-existing
> bugs:
> 
>     vl: fix migration from prelaunch state
>     migration: fix incorrect memory_global_dirty_log_start outside BQL
>     qed: fix bdrv_qed_drain
> 
> Paolo
> 
> Paolo Bonzini (16):
>   block: make bdrv_start_throttled_reqs return void
>   block: move restarting of throttled reqs to block/throttle-groups.c
>   block: introduce bdrv_no_throttling_begin/end
>   block: plug whole tree at once, introduce bdrv_io_unplugged_begin/end
>   mirror: use bottom half to re-enter coroutine
>   block: add BDS field to count in-flight requests
>   block: change drain to look only at one child at a time
>   blockjob: introduce .drain callback for jobs
>   block: wait for all pending I/O when doing synchronous requests
>   nfs: replace aio_poll with bdrv_drain
>   sheepdog: disable dataplane
>   aio: introduce aio_context_in_iothread
>   block: only call aio_poll from iothread
>   iothread: release AioContext around aio_poll
>   qemu-thread: introduce QemuRecMutex
>   aio: convert from RFifoLock to QemuRecMutex
> 
>  async.c                         |  28 +---
>  block.c                         |   4 +-
>  block/backup.c                  |   7 +
>  block/io.c                      | 281 +++++++++++++++++++++++++---------------
>  block/linux-aio.c               |  13 +-
>  block/mirror.c                  |  37 +++++-
>  block/nfs.c                     |  50 ++++---
>  block/qed-table.c               |  16 +--
>  block/qed.c                     |   4 +-
>  block/raw-aio.h                 |   2 +-
>  block/raw-posix.c               |  16 +--
>  block/sheepdog.c                |  19 +++
>  block/throttle-groups.c         |  19 +++
>  blockjob.c                      |  16 ++-
>  docs/multiple-iothreads.txt     |  40 +++---
>  include/block/aio.h             |  13 +-
>  include/block/block.h           |   3 +-
>  include/block/block_int.h       |  22 +++-
>  include/block/blockjob.h        |   7 +
>  include/block/throttle-groups.h |   1 +
>  include/qemu/rfifolock.h        |  54 --------
>  include/qemu/thread-posix.h     |   6 +
>  include/qemu/thread-win32.h     |  10 ++
>  include/qemu/thread.h           |   3 +
>  iothread.c                      |  20 +--
>  stubs/Makefile.objs             |   1 +
>  stubs/iothread.c                |   8 ++
>  tests/.gitignore                |   1 -
>  tests/Makefile                  |   2 -
>  tests/qemu-iotests/060          |   8 +-
>  tests/qemu-iotests/060.out      |   4 +-
>  tests/test-aio.c                |  22 ++--
>  tests/test-rfifolock.c          |  91 -------------
>  util/Makefile.objs              |   1 -
>  util/qemu-thread-posix.c        |  13 ++
>  util/qemu-thread-win32.c        |  25 ++++
>  util/rfifolock.c                |  78 -----------
>  37 files changed, 471 insertions(+), 474 deletions(-)
>  delete mode 100644 include/qemu/rfifolock.h
>  create mode 100644 stubs/iothread.c
>  delete mode 100644 tests/test-rfifolock.c
>  delete mode 100644 util/rfifolock.c
> 

Ping?  It's been almost a month and I still have one series to send
before hard freeze...

Paolo

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH 02/16] block: move restarting of throttled reqs to block/throttle-groups.c
  2016-02-16 17:56 ` [Qemu-devel] [PATCH 02/16] block: move restarting of throttled reqs to block/throttle-groups.c Paolo Bonzini
@ 2016-03-09  1:26   ` Fam Zheng
  2016-03-09  7:37     ` Paolo Bonzini
  0 siblings, 1 reply; 48+ messages in thread
From: Fam Zheng @ 2016-03-09  1:26 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: stefanha, qemu-devel, qemu-block

On Tue, 02/16 18:56, Paolo Bonzini wrote:
> We want to remove throttled_reqs from block/io.c.  This is the easy
> part---hide the handling of throttled_reqs during disable/enable of
> throttling within throttle-groups.c.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  block/io.c                      | 15 +--------------
>  block/throttle-groups.c         | 15 +++++++++++++++
>  include/block/throttle-groups.h |  1 +
>  3 files changed, 17 insertions(+), 14 deletions(-)
> 
> diff --git a/block/io.c b/block/io.c
> index e58cfe2..5ee5032 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -66,28 +66,15 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
>  void bdrv_set_io_limits(BlockDriverState *bs,
>                          ThrottleConfig *cfg)
>  {
> -    int i;
> -
>      throttle_group_config(bs, cfg);
> -
> -    for (i = 0; i < 2; i++) {
> -        qemu_co_enter_next(&bs->throttled_reqs[i]);
> -    }
>  }
>  
>  static void bdrv_start_throttled_reqs(BlockDriverState *bs)
>  {
>      bool enabled = bs->io_limits_enabled;
> -    int i;
>  
>      bs->io_limits_enabled = false;
> -
> -    for (i = 0; i < 2; i++) {
> -        while (qemu_co_enter_next(&bs->throttled_reqs[i])) {
> -            ;
> -        }
> -    }
> -
> +    throttle_group_restart_bs(bs);
>      bs->io_limits_enabled = enabled;
>  }
>  
> diff --git a/block/throttle-groups.c b/block/throttle-groups.c
> index 4920e09..eccfc0d 100644
> --- a/block/throttle-groups.c
> +++ b/block/throttle-groups.c
> @@ -313,6 +313,17 @@ void coroutine_fn throttle_group_co_io_limits_intercept(BlockDriverState *bs,
>      qemu_mutex_unlock(&tg->lock);
>  }
>  
> +void throttle_group_restart_bs(BlockDriverState *bs)
> +{
> +    int i;
> +
> +    for (i = 0; i < 2; i++) {
> +        while (qemu_co_enter_next(&bs->throttled_reqs[i])) {
> +            ;
> +        }
> +    }
> +}
> +
>  /* Update the throttle configuration for a particular group. Similar
>   * to throttle_config(), but guarantees atomicity within the
>   * throttling group.
> @@ -335,6 +346,10 @@ void throttle_group_config(BlockDriverState *bs, ThrottleConfig *cfg)
>      }
>      throttle_config(ts, tt, cfg);
>      qemu_mutex_unlock(&tg->lock);
> +
> +    aio_context_acquire(bdrv_get_aio_context(bs));
> +    throttle_group_restart_bs(bs);
> +    aio_context_release(bdrv_get_aio_context(bs));

Could you explain why does this hunk belong to this patch?

Otherwise looks good.

Fam

>  }
>  
>  /* Get the throttle configuration from a particular group. Similar to
> diff --git a/include/block/throttle-groups.h b/include/block/throttle-groups.h
> index aba28f3..395f72d 100644
> --- a/include/block/throttle-groups.h
> +++ b/include/block/throttle-groups.h
> @@ -38,6 +38,7 @@ void throttle_group_get_config(BlockDriverState *bs, ThrottleConfig *cfg);
>  
>  void throttle_group_register_bs(BlockDriverState *bs, const char *groupname);
>  void throttle_group_unregister_bs(BlockDriverState *bs);
> +void throttle_group_restart_bs(BlockDriverState *bs);
>  
>  void coroutine_fn throttle_group_co_io_limits_intercept(BlockDriverState *bs,
>                                                          unsigned int bytes,
> -- 
> 2.5.0
> 
> 
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH 03/16] block: introduce bdrv_no_throttling_begin/end
  2016-02-16 17:56 ` [Qemu-devel] [PATCH 03/16] block: introduce bdrv_no_throttling_begin/end Paolo Bonzini
@ 2016-03-09  1:45   ` Fam Zheng
  2016-03-09  7:40     ` Paolo Bonzini
  0 siblings, 1 reply; 48+ messages in thread
From: Fam Zheng @ 2016-03-09  1:45 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: stefanha, qemu-devel, qemu-block

On Tue, 02/16 18:56, Paolo Bonzini wrote:
> Extract the handling of throttling from bdrv_flush_io_queue.

Looks good overall. Have two questions below.

> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  block.c                   |  1 -
>  block/io.c                | 56 +++++++++++++++++++++++++++++------------------
>  block/throttle-groups.c   |  4 ++++
>  include/block/block_int.h |  6 ++---
>  4 files changed, 42 insertions(+), 25 deletions(-)
> 
> diff --git a/block.c b/block.c
> index efc3c43..b4f328f 100644
> --- a/block.c
> +++ b/block.c
> @@ -2314,7 +2314,6 @@ static void swap_feature_fields(BlockDriverState *bs_top,
>  
>      assert(!bs_new->throttle_state);
>      if (bs_top->throttle_state) {
> -        assert(bs_top->io_limits_enabled);
>          bdrv_io_limits_enable(bs_new, throttle_group_get_name(bs_top));
>          bdrv_io_limits_disable(bs_top);
>      }
> diff --git a/block/io.c b/block/io.c
> index 5ee5032..272caac 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -69,28 +69,43 @@ void bdrv_set_io_limits(BlockDriverState *bs,
>      throttle_group_config(bs, cfg);
>  }
>  
> -static void bdrv_start_throttled_reqs(BlockDriverState *bs)
> +static void bdrv_no_throttling_begin(BlockDriverState *bs)
>  {
> -    bool enabled = bs->io_limits_enabled;
> +    BdrvChild *child;
> +
> +    QLIST_FOREACH(child, &bs->children, next) {
> +        bdrv_no_throttling_begin(child->bs);
> +    }
>  
> -    bs->io_limits_enabled = false;
> -    throttle_group_restart_bs(bs);
> -    bs->io_limits_enabled = enabled;
> +    if (bs->io_limits_disabled++ == 0) {
> +        throttle_group_restart_bs(bs);
> +    }
> +}
> +
> +static void bdrv_no_throttling_end(BlockDriverState *bs)
> +{
> +    BdrvChild *child;
> +
> +    --bs->io_limits_disabled;
> +
> +    QLIST_FOREACH(child, &bs->children, next) {
> +        bdrv_no_throttling_end(child->bs);
> +    }
>  }
>  
>  void bdrv_io_limits_disable(BlockDriverState *bs)
>  {
> -    bs->io_limits_enabled = false;
> -    bdrv_start_throttled_reqs(bs);
> +    assert(bs->throttle_state);
> +    bdrv_no_throttling_begin(bs);
>      throttle_group_unregister_bs(bs);
> +    bdrv_no_throttling_end(bs);
>  }
>  
>  /* should be called before bdrv_set_io_limits if a limit is set */
>  void bdrv_io_limits_enable(BlockDriverState *bs, const char *group)
>  {
> -    assert(!bs->io_limits_enabled);
> +    assert(!bs->throttle_state);
>      throttle_group_register_bs(bs, group);
> -    bs->io_limits_enabled = true;
>  }
>  
>  void bdrv_io_limits_update_group(BlockDriverState *bs, const char *group)
> @@ -255,6 +270,7 @@ void bdrv_drain(BlockDriverState *bs)
>  {
>      bool busy = true;
>  
> +    bdrv_no_throttling_begin(bs);
>      bdrv_drain_recurse(bs);
>      while (busy) {
>          /* Keep iterating */
> @@ -262,6 +278,7 @@ void bdrv_drain(BlockDriverState *bs)
>           busy = bdrv_requests_pending(bs);
>           busy |= aio_poll(bdrv_get_aio_context(bs), busy);
>      }
> +    bdrv_no_throttling_end(bs);
>  }
>  
>  /*
> @@ -284,6 +301,7 @@ void bdrv_drain_all(void)
>          if (bs->job) {
>              block_job_pause(bs->job);
>          }
> +        bdrv_no_throttling_begin(bs);
>          bdrv_drain_recurse(bs);
>          aio_context_release(aio_context);
>  
> @@ -325,6 +343,7 @@ void bdrv_drain_all(void)
>          AioContext *aio_context = bdrv_get_aio_context(bs);
>  
>          aio_context_acquire(aio_context);
> +        bdrv_no_throttling_end(bs);
>          if (bs->job) {
>              block_job_resume(bs->job);
>          }
> @@ -555,11 +574,7 @@ static int bdrv_prwv_co(BlockDriverState *bs, int64_t offset,
>       * will not fire; so the I/O throttling function has to be disabled here
>       * if it has been enabled.
>       */
> -    if (bs->io_limits_enabled) {
> -        fprintf(stderr, "Disabling I/O throttling on '%s' due "
> -                        "to synchronous I/O.\n", bdrv_get_device_name(bs));
> -        bdrv_io_limits_disable(bs);
> -    }
> +    bdrv_no_throttling_begin(bs);
>  
>      if (qemu_in_coroutine()) {
>          /* Fast-path if already in coroutine context */
> @@ -573,6 +588,8 @@ static int bdrv_prwv_co(BlockDriverState *bs, int64_t offset,
>              aio_poll(aio_context, true);
>          }
>      }
> +
> +    bdrv_no_throttling_end(bs);

Does this change the behavior? There wasn't a bdrv_io_limits_enable() here, and
the throttle doesn't come back automatically. Just want to make sure it's
intended.

>      return rwco.ret;
>  }
>  
> @@ -608,13 +625,11 @@ int bdrv_read(BlockDriverState *bs, int64_t sector_num,
>  int bdrv_read_unthrottled(BlockDriverState *bs, int64_t sector_num,
>                            uint8_t *buf, int nb_sectors)
>  {
> -    bool enabled;
>      int ret;
>  
> -    enabled = bs->io_limits_enabled;
> -    bs->io_limits_enabled = false;
> +    bdrv_no_throttling_begin(bs);
>      ret = bdrv_read(bs, sector_num, buf, nb_sectors);
> -    bs->io_limits_enabled = enabled;
> +    bdrv_no_throttling_end(bs);
>      return ret;
>  }
>  
> @@ -952,7 +967,7 @@ static int coroutine_fn bdrv_co_do_preadv(BlockDriverState *bs,
>      }
>  
>      /* throttling disk I/O */
> -    if (bs->io_limits_enabled) {
> +    if (bs->throttle_state) {
>          throttle_group_co_io_limits_intercept(bs, bytes, false);
>      }
>  
> @@ -1294,7 +1309,7 @@ static int coroutine_fn bdrv_co_do_pwritev(BlockDriverState *bs,
>      }
>  
>      /* throttling disk I/O */
> -    if (bs->io_limits_enabled) {
> +    if (bs->throttle_state) {
>          throttle_group_co_io_limits_intercept(bs, bytes, true);
>      }
>  
> @@ -2749,7 +2764,6 @@ void bdrv_flush_io_queue(BlockDriverState *bs)
>      } else if (bs->file) {
>          bdrv_flush_io_queue(bs->file->bs);
>      }
> -    bdrv_start_throttled_reqs(bs);
>  }
>  
>  void bdrv_drained_begin(BlockDriverState *bs)
> diff --git a/block/throttle-groups.c b/block/throttle-groups.c
> index eccfc0d..8fe0a4f 100644
> --- a/block/throttle-groups.c
> +++ b/block/throttle-groups.c
> @@ -219,6 +219,10 @@ static bool throttle_group_schedule_timer(BlockDriverState *bs,
>      ThrottleGroup *tg = container_of(ts, ThrottleGroup, ts);
>      bool must_wait;
>  
> +    if (bs->io_limits_disabled) {
> +        return false;
> +    }
> +

Before, this function and its direct caller
throttle_group_co_io_limits_intercept are not called if !bs->io_limits_enabled,
so the accounting is not done. It's better, but just different.

Thanks,
Fam

>      /* Check if any of the timers in this group is already armed */
>      if (tg->any_timer_armed[is_write]) {
>          return true;
> diff --git a/include/block/block_int.h b/include/block/block_int.h
> index 9ef823a..ed2e034 100644
> --- a/include/block/block_int.h
> +++ b/include/block/block_int.h
> @@ -412,10 +412,10 @@ struct BlockDriverState {
>  
>      /* I/O throttling.
>       * throttle_state tells us if this BDS has I/O limits configured.
> -     * io_limits_enabled tells us if they are currently being
> -     * enforced, but it can be temporarily set to false */
> +     * io_limits_disabled tells us if they are currently being enforced */
>      CoQueue      throttled_reqs[2];
> -    bool         io_limits_enabled;
> +    unsigned int io_limits_disabled;
> +
>      /* The following fields are protected by the ThrottleGroup lock.
>       * See the ThrottleGroup documentation for details. */
>      ThrottleState *throttle_state;
> -- 
> 2.5.0
> 
> 
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH 05/16] mirror: use bottom half to re-enter coroutine
  2016-02-16 17:56 ` [Qemu-devel] [PATCH 05/16] mirror: use bottom half to re-enter coroutine Paolo Bonzini
@ 2016-03-09  3:19   ` Fam Zheng
  2016-03-09  7:41     ` Paolo Bonzini
  0 siblings, 1 reply; 48+ messages in thread
From: Fam Zheng @ 2016-03-09  3:19 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: stefanha, qemu-devel, qemu-block

On Tue, 02/16 18:56, Paolo Bonzini wrote:
> mirror is calling bdrv_drain from an AIO callback---more precisely,
> the bdrv_drain happens far away from the AIO callback, in the coroutine that
> the AIO callback enters.
> 
> This used to be okay because bdrv_drain more or less tried to guess
> when all AIO callbacks were done; however it will cause a deadlock once
> bdrv_drain really checks for AIO callbacks to be complete.  The situation
> here is admittedly underdefined, and Stefan pointed out that the same
> solution is found in many other places in the QEMU block layer, therefore
> I think this workaround is acceptable.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  block/mirror.c | 23 +++++++++++++++++++----
>  1 file changed, 19 insertions(+), 4 deletions(-)
> 
> diff --git a/block/mirror.c b/block/mirror.c
> index 2c0edfa..793c20c 100644
> --- a/block/mirror.c
> +++ b/block/mirror.c
> @@ -71,6 +71,7 @@ typedef struct MirrorOp {
>      QEMUIOVector qiov;
>      int64_t sector_num;
>      int nb_sectors;
> +    QEMUBH *co_enter_bh;
>  } MirrorOp;
>  
>  static BlockErrorAction mirror_error_action(MirrorBlockJob *s, bool read,
> @@ -86,6 +87,18 @@ static BlockErrorAction mirror_error_action(MirrorBlockJob *s, bool read,
>      }
>  }
>  
> +static void mirror_bh_cb(void *opaque)
> +{
> +    MirrorOp *op = opaque;
> +    MirrorBlockJob *s = op->s;
> +
> +    qemu_bh_delete(op->co_enter_bh);
> +    g_free(op);
> +    if (s->waiting_for_io) {
> +        qemu_coroutine_enter(s->common.co, NULL);
> +    }
> +}
> +
>  static void mirror_iteration_done(MirrorOp *op, int ret)
>  {
>      MirrorBlockJob *s = op->s;
> @@ -116,11 +129,13 @@ static void mirror_iteration_done(MirrorOp *op, int ret)
>      }
>  
>      qemu_iovec_destroy(&op->qiov);
> -    g_free(op);
>  
> -    if (s->waiting_for_io) {
> -        qemu_coroutine_enter(s->common.co, NULL);
> -    }
> +    /* The I/O operation is not finished until the callback returns.
> +     * If we call qemu_coroutine_enter here, there is the possibility
> +     * of a deadlock when the coroutine calls bdrv_drained_begin.
> +     */
> +    op->co_enter_bh = qemu_bh_new(mirror_bh_cb, op);

Shouldn't this be aio_bh_new()?

Fam

> +    qemu_bh_schedule(op->co_enter_bh);
>  }
>  
>  static void mirror_write_complete(void *opaque, int ret)
> -- 
> 2.5.0
> 
> 
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH 06/16] block: add BDS field to count in-flight requests
  2016-02-16 17:56 ` [Qemu-devel] [PATCH 06/16] block: add BDS field to count in-flight requests Paolo Bonzini
@ 2016-03-09  3:35   ` Fam Zheng
  2016-03-09  7:43     ` Paolo Bonzini
  0 siblings, 1 reply; 48+ messages in thread
From: Fam Zheng @ 2016-03-09  3:35 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: stefanha, qemu-devel, qemu-block

On Tue, 02/16 18:56, Paolo Bonzini wrote:
> Unlike tracked_requests, this field also counts throttled requests,
> and remains non-zero if an AIO operation needs a BH to be "really"
> completed.
> 
> With this change, it is no longer necessary to have a dummy
> BdrvTrackedRequest for requests that are never serialising, and
> it is no longer necessary to poll the AioContext once after
> bdrv_requests_pending(bs) returns false.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  block/io.c                | 71 +++++++++++++++++++++++++++++++----------------
>  block/mirror.c            |  2 ++
>  include/block/block_int.h |  8 ++++--
>  3 files changed, 54 insertions(+), 27 deletions(-)
> 
> diff --git a/block/io.c b/block/io.c
> index d8d50b7..a9a23a6 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -224,13 +224,7 @@ bool bdrv_requests_pending(BlockDriverState *bs)
>  {
>      BdrvChild *child;
>  
> -    if (!QLIST_EMPTY(&bs->tracked_requests)) {
> -        return true;
> -    }
> -    if (!qemu_co_queue_empty(&bs->throttled_reqs[0])) {
> -        return true;
> -    }
> -    if (!qemu_co_queue_empty(&bs->throttled_reqs[1])) {
> +    if (atomic_read(&bs->in_flight)) {
>          return true;
>      }
>  
> @@ -268,15 +262,12 @@ static void bdrv_drain_recurse(BlockDriverState *bs)
>   */
>  void bdrv_drain(BlockDriverState *bs)
>  {
> -    bool busy = true;
> -
>      bdrv_no_throttling_begin(bs);
>      bdrv_io_unplugged_begin(bs);
>      bdrv_drain_recurse(bs);
> -    while (busy) {
> +    while (bdrv_requests_pending(bs)) {
>          /* Keep iterating */
> -         busy = bdrv_requests_pending(bs);
> -         busy |= aio_poll(bdrv_get_aio_context(bs), busy);
> +         aio_poll(bdrv_get_aio_context(bs), true);
>      }
>      bdrv_io_unplugged_end(bs);
>      bdrv_no_throttling_end(bs);
> @@ -291,7 +282,7 @@ void bdrv_drain(BlockDriverState *bs)
>  void bdrv_drain_all(void)
>  {
>      /* Always run first iteration so any pending completion BHs run */
> -    bool busy = true;
> +    bool waited = true;
>      BlockDriverState *bs = NULL;
>      GSList *aio_ctxs = NULL, *ctx;
>  
> @@ -318,8 +309,8 @@ void bdrv_drain_all(void)
>       * request completion.  Therefore we must keep looping until there was no
>       * more activity rather than simply draining each device independently.
>       */
> -    while (busy) {
> -        busy = false;
> +    while (waited) {
> +        waited = false;
>  
>          for (ctx = aio_ctxs; ctx != NULL; ctx = ctx->next) {
>              AioContext *aio_context = ctx->data;
> @@ -329,12 +320,11 @@ void bdrv_drain_all(void)
>              while ((bs = bdrv_next(bs))) {
>                  if (aio_context == bdrv_get_aio_context(bs)) {
>                      if (bdrv_requests_pending(bs)) {
> -                        busy = true;
> -                        aio_poll(aio_context, busy);
> +                        aio_poll(aio_context, true);
> +                        waited = true;
>                      }
>                  }
>              }
> -            busy |= aio_poll(aio_context, false);
>              aio_context_release(aio_context);
>          }
>      }
> @@ -457,6 +447,16 @@ static bool tracked_request_overlaps(BdrvTrackedRequest *req,
>      return true;
>  }
>  
> +void bdrv_inc_in_flight(BlockDriverState *bs)
> +{
> +    atomic_inc(&bs->in_flight);
> +}
> +
> +void bdrv_dec_in_flight(BlockDriverState *bs)
> +{
> +    atomic_dec(&bs->in_flight);
> +}
> +
>  static bool coroutine_fn wait_serialising_requests(BdrvTrackedRequest *self)
>  {
>      BlockDriverState *bs = self->bs;
> @@ -963,6 +963,8 @@ static int coroutine_fn bdrv_co_do_preadv(BlockDriverState *bs,
>          return ret;
>      }
>  
> +    bdrv_inc_in_flight(bs);
> +
>      /* Don't do copy-on-read if we read data before write operation */
>      if (bs->copy_on_read && !(flags & BDRV_REQ_NO_SERIALISING)) {
>          flags |= BDRV_REQ_COPY_ON_READ;
> @@ -1003,6 +1005,7 @@ static int coroutine_fn bdrv_co_do_preadv(BlockDriverState *bs,
>                                use_local_qiov ? &local_qiov : qiov,
>                                flags);
>      tracked_request_end(&req);
> +    bdrv_dec_in_flight(bs);
>  
>      if (use_local_qiov) {
>          qemu_iovec_destroy(&local_qiov);
> @@ -1310,6 +1313,8 @@ static int coroutine_fn bdrv_co_do_pwritev(BlockDriverState *bs,
>          return ret;
>      }
>  
> +    bdrv_inc_in_flight(bs);
> +
>      /* throttling disk I/O */
>      if (bs->throttle_state) {
>          throttle_group_co_io_limits_intercept(bs, bytes, true);
> @@ -1408,6 +1413,7 @@ fail:
>      qemu_vfree(tail_buf);
>  out:
>      tracked_request_end(&req);
> +    bdrv_dec_in_flight(bs);
>      return ret;
>  }
>  
> @@ -2065,6 +2071,7 @@ static const AIOCBInfo bdrv_em_aiocb_info = {
>  static void bdrv_aio_bh_cb(void *opaque)
>  {
>      BlockAIOCBSync *acb = opaque;
> +    BlockDriverState *bs = acb->common.bs;
>  
>      if (!acb->is_write && acb->ret >= 0) {
>          qemu_iovec_from_buf(acb->qiov, 0, acb->bounce, acb->qiov->size);
> @@ -2072,6 +2079,7 @@ static void bdrv_aio_bh_cb(void *opaque)
>      qemu_vfree(acb->bounce);
>      acb->common.cb(acb->common.opaque, acb->ret);
>      qemu_bh_delete(acb->bh);
> +    bdrv_dec_in_flight(bs);
>      acb->bh = NULL;
>      qemu_aio_unref(acb);
>  }
> @@ -2087,6 +2095,7 @@ static BlockAIOCB *bdrv_aio_rw_vector(BlockDriverState *bs,
>  {
>      BlockAIOCBSync *acb;
>  
> +    bdrv_inc_in_flight(bs);
>      acb = qemu_aio_get(&bdrv_em_aiocb_info, bs, cb, opaque);
>      acb->is_write = is_write;
>      acb->qiov = qiov;
> @@ -2139,6 +2148,7 @@ static void bdrv_co_complete(BlockAIOCBCoroutine *acb)
>  {
>      if (!acb->need_bh) {
>          acb->common.cb(acb->common.opaque, acb->req.error);
> +        bdrv_dec_in_flight(acb->common.bs);
>          qemu_aio_unref(acb);
>      }
>  }
> @@ -2192,6 +2202,9 @@ static BlockAIOCB *bdrv_co_aio_rw_vector(BlockDriverState *bs,
>      Coroutine *co;
>      BlockAIOCBCoroutine *acb;
>  
> +    /* Matched by bdrv_co_complete's bdrv_dec_in_flight.  */
> +    bdrv_inc_in_flight(bs);
> +
>      acb = qemu_aio_get(&bdrv_em_co_aiocb_info, bs, cb, opaque);
>      acb->need_bh = true;
>      acb->req.error = -EINPROGRESS;
> @@ -2225,6 +2238,9 @@ BlockAIOCB *bdrv_aio_flush(BlockDriverState *bs,
>      Coroutine *co;
>      BlockAIOCBCoroutine *acb;
>  
> +    /* Matched by bdrv_co_complete's bdrv_dec_in_flight.  */
> +    bdrv_inc_in_flight(bs);
> +
>      acb = qemu_aio_get(&bdrv_em_co_aiocb_info, bs, cb, opaque);
>      acb->need_bh = true;
>      acb->req.error = -EINPROGRESS;
> @@ -2254,6 +2270,9 @@ BlockAIOCB *bdrv_aio_discard(BlockDriverState *bs,
>  
>      trace_bdrv_aio_discard(bs, sector_num, nb_sectors, opaque);
>  
> +    /* Matched by bdrv_co_complete's bdrv_dec_in_flight.  */
> +    bdrv_inc_in_flight(bs);
> +
>      acb = qemu_aio_get(&bdrv_em_co_aiocb_info, bs, cb, opaque);
>      acb->need_bh = true;
>      acb->req.error = -EINPROGRESS;
> @@ -2361,14 +2380,14 @@ static void coroutine_fn bdrv_flush_co_entry(void *opaque)
>  int coroutine_fn bdrv_co_flush(BlockDriverState *bs)
>  {
>      int ret;
> -    BdrvTrackedRequest req;
>  
>      if (!bs || !bdrv_is_inserted(bs) || bdrv_is_read_only(bs) ||
>          bdrv_is_sg(bs)) {
>          return 0;
>      }
>  
> -    tracked_request_begin(&req, bs, 0, 0, BDRV_TRACKED_FLUSH);
> +    bdrv_inc_in_flight(bs);
> +
>      /* Write back cached data to the OS even with cache=unsafe */
>      BLKDBG_EVENT(bs->file, BLKDBG_FLUSH_TO_OS);
>      if (bs->drv->bdrv_co_flush_to_os) {
> @@ -2423,7 +2442,7 @@ int coroutine_fn bdrv_co_flush(BlockDriverState *bs)
>  flush_parent:
>      ret = bs->file ? bdrv_co_flush(bs->file->bs) : 0;
>  out:
> -    tracked_request_end(&req);
> +    bdrv_dec_in_flight(bs);
>      return ret;
>  }
>  
> @@ -2491,6 +2510,7 @@ int coroutine_fn bdrv_co_discard(BlockDriverState *bs, int64_t sector_num,
>          return 0;
>      }
>  
> +    bdrv_inc_in_flight(bs);
>      tracked_request_begin(&req, bs, sector_num, nb_sectors,
>                            BDRV_TRACKED_DISCARD);
>      bdrv_set_dirty(bs, sector_num, nb_sectors);
> @@ -2543,6 +2563,7 @@ int coroutine_fn bdrv_co_discard(BlockDriverState *bs, int64_t sector_num,
>      ret = 0;
>  out:
>      tracked_request_end(&req);
> +    bdrv_dec_in_flight(bs);
>      return ret;
>  }
>  
> @@ -2588,13 +2609,12 @@ static void bdrv_ioctl_bh_cb(void *opaque)
>  static int bdrv_co_do_ioctl(BlockDriverState *bs, int req, void *buf)
>  {
>      BlockDriver *drv = bs->drv;
> -    BdrvTrackedRequest tracked_req;
>      CoroutineIOCompletion co = {
>          .coroutine = qemu_coroutine_self(),
>      };
>      BlockAIOCB *acb;
>  
> -    tracked_request_begin(&tracked_req, bs, 0, 0, BDRV_TRACKED_IOCTL);
> +    bdrv_inc_in_flight(bs);
>      if (!drv || !drv->bdrv_aio_ioctl) {
>          co.ret = -ENOTSUP;
>          goto out;
> @@ -2610,7 +2630,7 @@ static int bdrv_co_do_ioctl(BlockDriverState *bs, int req, void *buf)
>      }
>      qemu_coroutine_yield();
>  out:
> -    tracked_request_end(&tracked_req);
> +    bdrv_dec_in_flight(bs);
>      return co.ret;
>  }
>  
> @@ -2667,6 +2687,9 @@ BlockAIOCB *bdrv_aio_ioctl(BlockDriverState *bs,
>                                              bs, cb, opaque);
>      Coroutine *co;
>  
> +    /* Matched by bdrv_co_complete's bdrv_dec_in_flight.  */
> +    bdrv_inc_in_flight(bs);
> +
>      acb->need_bh = true;
>      acb->req.error = -EINPROGRESS;
>      acb->req.req = req;
> diff --git a/block/mirror.c b/block/mirror.c
> index 793c20c..3f163b8 100644
> --- a/block/mirror.c
> +++ b/block/mirror.c
> @@ -94,6 +94,7 @@ static void mirror_bh_cb(void *opaque)
>  
>      qemu_bh_delete(op->co_enter_bh);
>      g_free(op);
> +    bdrv_dec_in_flight(s->common.bs);
>      if (s->waiting_for_io) {
>          qemu_coroutine_enter(s->common.co, NULL);
>      }
> @@ -134,6 +135,7 @@ static void mirror_iteration_done(MirrorOp *op, int ret)
>       * If we call qemu_coroutine_enter here, there is the possibility
>       * of a deadlock when the coroutine calls bdrv_drained_begin.
>       */
> +    bdrv_inc_in_flight(s->common.bs);
>      op->co_enter_bh = qemu_bh_new(mirror_bh_cb, op);
>      qemu_bh_schedule(op->co_enter_bh);
>  }
> diff --git a/include/block/block_int.h b/include/block/block_int.h
> index 2838814..89c38c0 100644
> --- a/include/block/block_int.h
> +++ b/include/block/block_int.h
> @@ -63,8 +63,6 @@
>  enum BdrvTrackedRequestType {
>      BDRV_TRACKED_READ,
>      BDRV_TRACKED_WRITE,
> -    BDRV_TRACKED_FLUSH,
> -    BDRV_TRACKED_IOCTL,
>      BDRV_TRACKED_DISCARD,

Okay, so flush and ioctl are not needed, but why is discard different?

Fam

>  };
>  
> @@ -406,7 +404,8 @@ struct BlockDriverState {
>      /* Callback before write request is processed */
>      NotifierWithReturnList before_write_notifiers;
>  
> -    /* number of in-flight serialising requests */
> +    /* number of in-flight requests; overall and serialising */
> +    unsigned int in_flight;
>      unsigned int serialising_in_flight;
>  
>      /* I/O throttling.
> @@ -713,6 +712,9 @@ bool bdrv_requests_pending(BlockDriverState *bs);
>  void bdrv_clear_dirty_bitmap(BdrvDirtyBitmap *bitmap, HBitmap **out);
>  void bdrv_undo_clear_dirty_bitmap(BdrvDirtyBitmap *bitmap, HBitmap *in);
>  
> +void bdrv_inc_in_flight(BlockDriverState *bs);
> +void bdrv_dec_in_flight(BlockDriverState *bs);
> +
>  void blockdev_close_all_bdrv_states(void);
>  
>  #endif /* BLOCK_INT_H */
> -- 
> 2.5.0
> 
> 
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH 07/16] block: change drain to look only at one child at a time
  2016-02-16 17:56 ` [Qemu-devel] [PATCH 07/16] block: change drain to look only at one child at a time Paolo Bonzini
@ 2016-03-09  3:41   ` Fam Zheng
  2016-03-09  7:49     ` Paolo Bonzini
  2016-03-16 16:39   ` Stefan Hajnoczi
  1 sibling, 1 reply; 48+ messages in thread
From: Fam Zheng @ 2016-03-09  3:41 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: stefanha, qemu-devel, qemu-block

On Tue, 02/16 18:56, Paolo Bonzini wrote:
> bdrv_requests_pending is checking children to also wait until internal
> requests (such as metadata writes) have completed.  However, checking
> children is in general overkill because, apart from this special case,
> the parent's in_flight count will always be incremented by at least one
> for every request in the child.

While the patch looks good, I'm not sure I understand this: aren't children's
requests generated by the child itself, how is parent's in_flight incremented?

> 
> Since internal requests are only generated by the parent in the child,
> instead visit the tree parent first, and then wait for internal I/O in
> the children to complete.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  block/io.c | 27 +++++++++++++++++++--------
>  1 file changed, 19 insertions(+), 8 deletions(-)
> 
> diff --git a/block/io.c b/block/io.c
> index a9a23a6..e0c9215 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -249,6 +249,23 @@ static void bdrv_drain_recurse(BlockDriverState *bs)
>      }
>  }
>  
> +static bool bdrv_drain_io_recurse(BlockDriverState *bs)
> +{
> +    BdrvChild *child;
> +    bool waited = false;
> +
> +    while (atomic_read(&bs->in_flight) > 0) {
> +        aio_poll(bdrv_get_aio_context(bs), true);
> +        waited = true;
> +    }
> +
> +    QLIST_FOREACH(child, &bs->children, next) {
> +        waited |= bdrv_drain_io_recurse(child->bs);
> +    }
> +
> +    return waited;
> +}
> +
>  /*
>   * Wait for pending requests to complete on a single BlockDriverState subtree,
>   * and suspend block driver's internal I/O until next request arrives.
> @@ -265,10 +282,7 @@ void bdrv_drain(BlockDriverState *bs)
>      bdrv_no_throttling_begin(bs);
>      bdrv_io_unplugged_begin(bs);
>      bdrv_drain_recurse(bs);
> -    while (bdrv_requests_pending(bs)) {
> -        /* Keep iterating */
> -         aio_poll(bdrv_get_aio_context(bs), true);
> -    }
> +    bdrv_drain_io_recurse(bs);
>      bdrv_io_unplugged_end(bs);
>      bdrv_no_throttling_end(bs);
>  }
> @@ -319,10 +333,7 @@ void bdrv_drain_all(void)
>              aio_context_acquire(aio_context);
>              while ((bs = bdrv_next(bs))) {
>                  if (aio_context == bdrv_get_aio_context(bs)) {
> -                    if (bdrv_requests_pending(bs)) {
> -                        aio_poll(aio_context, true);
> -                        waited = true;
> -                    }
> +                    waited |= bdrv_drain_io_recurse(bs);
>                  }
>              }
>              aio_context_release(aio_context);
> -- 
> 2.5.0
> 
> 
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH 02/16] block: move restarting of throttled reqs to block/throttle-groups.c
  2016-03-09  1:26   ` Fam Zheng
@ 2016-03-09  7:37     ` Paolo Bonzini
  0 siblings, 0 replies; 48+ messages in thread
From: Paolo Bonzini @ 2016-03-09  7:37 UTC (permalink / raw)
  To: Fam Zheng; +Cc: qemu-block, qemu-devel, stefanha



On 09/03/2016 02:26, Fam Zheng wrote:
>> diff --git a/block/throttle-groups.c b/block/throttle-groups.c
>> index 4920e09..eccfc0d 100644
>> --- a/block/throttle-groups.c
>> +++ b/block/throttle-groups.c
>> @@ -313,6 +313,17 @@ void coroutine_fn throttle_group_co_io_limits_intercept(BlockDriverState *bs,
>>      qemu_mutex_unlock(&tg->lock);
>>  }
>>  
>> +void throttle_group_restart_bs(BlockDriverState *bs)
>> +{
>> +    int i;
>> +
>> +    for (i = 0; i < 2; i++) {
>> +        while (qemu_co_enter_next(&bs->throttled_reqs[i])) {
>> +            ;
>> +        }
>> +    }
>> +}
>> +
>>  /* Update the throttle configuration for a particular group. Similar
>>   * to throttle_config(), but guarantees atomicity within the
>>   * throttling group.
>> @@ -335,6 +346,10 @@ void throttle_group_config(BlockDriverState *bs, ThrottleConfig *cfg)
>>      }
>>      throttle_config(ts, tt, cfg);
>>      qemu_mutex_unlock(&tg->lock);
>> +
>> +    aio_context_acquire(bdrv_get_aio_context(bs));
>> +    throttle_group_restart_bs(bs);
>> +    aio_context_release(bdrv_get_aio_context(bs));
> 
> Could you explain why does this hunk belong to this patch?
> 
> Otherwise looks good.

Sure.  It's moved from bdrv_set_io_limits, which calls
throttle_group_config, to throttle_group_config itself:

>>  void bdrv_set_io_limits(BlockDriverState *bs,
>>                          ThrottleConfig *cfg)
>>  {
>> -    int i;
>> -
>>      throttle_group_config(bs, cfg);
>> -
>> -    for (i = 0; i < 2; i++) {
>> -        qemu_co_enter_next(&bs->throttled_reqs[i]);
>> -    }
>>  }
>>  

But in v2 I'll change it to only restart the first request so there is
no semantic change.

Paolo

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH 03/16] block: introduce bdrv_no_throttling_begin/end
  2016-03-09  1:45   ` Fam Zheng
@ 2016-03-09  7:40     ` Paolo Bonzini
  0 siblings, 0 replies; 48+ messages in thread
From: Paolo Bonzini @ 2016-03-09  7:40 UTC (permalink / raw)
  To: Fam Zheng; +Cc: qemu-block, qemu-devel, stefanha



On 09/03/2016 02:45, Fam Zheng wrote:
>> > @@ -555,11 +574,7 @@ static int bdrv_prwv_co(BlockDriverState *bs, int64_t offset,
>> >       * will not fire; so the I/O throttling function has to be disabled here
>> >       * if it has been enabled.
>> >       */
>> > -    if (bs->io_limits_enabled) {
>> > -        fprintf(stderr, "Disabling I/O throttling on '%s' due "
>> > -                        "to synchronous I/O.\n", bdrv_get_device_name(bs));
>> > -        bdrv_io_limits_disable(bs);
>> > -    }
>> > +    bdrv_no_throttling_begin(bs);
>> >  
>> >      if (qemu_in_coroutine()) {
>> >          /* Fast-path if already in coroutine context */
>> > @@ -573,6 +588,8 @@ static int bdrv_prwv_co(BlockDriverState *bs, int64_t offset,
>> >              aio_poll(aio_context, true);
>> >          }
>> >      }
>> > +
>> > +    bdrv_no_throttling_end(bs);
> 
> Does this change the behavior? There wasn't a bdrv_io_limits_enable() here, and
> the throttle doesn't come back automatically. Just want to make sure it's
> intended.

Yes, it's intended.  As long as the I/O stays synchronous, throttling is
disabled.  If it starts to be asynchronous, it will be re-enabled
automatically.

Paolo

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH 05/16] mirror: use bottom half to re-enter coroutine
  2016-03-09  3:19   ` Fam Zheng
@ 2016-03-09  7:41     ` Paolo Bonzini
  0 siblings, 0 replies; 48+ messages in thread
From: Paolo Bonzini @ 2016-03-09  7:41 UTC (permalink / raw)
  To: Fam Zheng; +Cc: qemu-block, qemu-devel, stefanha



On 09/03/2016 04:19, Fam Zheng wrote:
>> > +    /* The I/O operation is not finished until the callback returns.
>> > +     * If we call qemu_coroutine_enter here, there is the possibility
>> > +     * of a deadlock when the coroutine calls bdrv_drained_begin.
>> > +     */
>> > +    op->co_enter_bh = qemu_bh_new(mirror_bh_cb, op);
> Shouldn't this be aio_bh_new()?

Yes, of course.

Paolo

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH 06/16] block: add BDS field to count in-flight requests
  2016-03-09  3:35   ` Fam Zheng
@ 2016-03-09  7:43     ` Paolo Bonzini
  2016-03-09  8:00       ` Fam Zheng
  0 siblings, 1 reply; 48+ messages in thread
From: Paolo Bonzini @ 2016-03-09  7:43 UTC (permalink / raw)
  To: Fam Zheng; +Cc: qemu-block, qemu-devel, stefanha



On 09/03/2016 04:35, Fam Zheng wrote:
>> >  enum BdrvTrackedRequestType {
>> >      BDRV_TRACKED_READ,
>> >      BDRV_TRACKED_WRITE,
>> > -    BDRV_TRACKED_FLUSH,
>> > -    BDRV_TRACKED_IOCTL,
>> >      BDRV_TRACKED_DISCARD,
> Okay, so flush and ioctl are not needed, but why is discard different?

Discard can modify the contents of the device, so I think it's safer to
serialize it against RMW and copy-on-read operations.

Paolo

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH 07/16] block: change drain to look only at one child at a time
  2016-03-09  3:41   ` Fam Zheng
@ 2016-03-09  7:49     ` Paolo Bonzini
  0 siblings, 0 replies; 48+ messages in thread
From: Paolo Bonzini @ 2016-03-09  7:49 UTC (permalink / raw)
  To: Fam Zheng; +Cc: qemu-block, qemu-devel, stefanha



On 09/03/2016 04:41, Fam Zheng wrote:
> > bdrv_requests_pending is checking children to also wait until internal
> > requests (such as metadata writes) have completed.  However, checking
> > children is in general overkill because, apart from this special case,
> > the parent's in_flight count will always be incremented by at least one
> > for every request in the child.
> 
> While the patch looks good, I'm not sure I understand this: aren't children's
> requests generated by the child itself, how is parent's in_flight incremented?

What I mean is that there are two cases of bs generating I/O on
bs->file->bs:

1) writes caused by an operation on bs, e.g. a bdrv_aio_write to bs.  In
this case, the bdrv_aio_write increments bs's in_flight count

2) asynchronous metadata writes or flushes.  Such writes can be started
even if bs's in_flight count is zero, but not after the .bdrv_drain
callback has been invoked.  So the correct sequence is:

    - finish current I/O

    - call bdrv_drain callback

    - recurse on children

This is what is needed for QED's callback to work correctly, and I'll
implement it this way in v2.

Paolo

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH 06/16] block: add BDS field to count in-flight requests
  2016-03-09  7:43     ` Paolo Bonzini
@ 2016-03-09  8:00       ` Fam Zheng
  2016-03-09  8:22         ` Paolo Bonzini
  0 siblings, 1 reply; 48+ messages in thread
From: Fam Zheng @ 2016-03-09  8:00 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: qemu-block, qemu-devel, stefanha

On Wed, 03/09 08:43, Paolo Bonzini wrote:
> 
> 
> On 09/03/2016 04:35, Fam Zheng wrote:
> >> >  enum BdrvTrackedRequestType {
> >> >      BDRV_TRACKED_READ,
> >> >      BDRV_TRACKED_WRITE,
> >> > -    BDRV_TRACKED_FLUSH,
> >> > -    BDRV_TRACKED_IOCTL,
> >> >      BDRV_TRACKED_DISCARD,
> > Okay, so flush and ioctl are not needed, but why is discard different?
> 
> Discard can modify the contents of the device, so I think it's safer to
> serialize it against RMW and copy-on-read operations.

Okay, that makes sense, but ioctl like SG_IO can also modify content, no?

Fam

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH 09/16] block: wait for all pending I/O when doing synchronous requests
  2016-02-16 17:56 ` [Qemu-devel] [PATCH 09/16] block: wait for all pending I/O when doing synchronous requests Paolo Bonzini
@ 2016-03-09  8:13   ` Fam Zheng
  2016-03-09  8:23     ` Paolo Bonzini
  2016-03-16 18:04   ` Stefan Hajnoczi
  1 sibling, 1 reply; 48+ messages in thread
From: Fam Zheng @ 2016-03-09  8:13 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: stefanha, qemu-devel, qemu-block

On Tue, 02/16 18:56, Paolo Bonzini wrote:
> diff --git a/block/qed.c b/block/qed.c
> index ebba220..e90792f 100644
> --- a/block/qed.c
> +++ b/block/qed.c
> @@ -352,7 +352,9 @@ static void qed_start_need_check_timer(BDRVQEDState *s)
>  static void qed_cancel_need_check_timer(BDRVQEDState *s)
>  {
>      trace_qed_cancel_need_check_timer(s);
> -    timer_del(s->need_check_timer);
> +    if (s->need_check_timer) {
> +        timer_del(s->need_check_timer);
> +    }
>  }

Not clear why this change is needed in this patch, but it is obviously not
wrong.  If this is to mask a bug, it at least deserves a comment.

The other parts of the patch looks good to me.

Fam

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH 06/16] block: add BDS field to count in-flight requests
  2016-03-09  8:00       ` Fam Zheng
@ 2016-03-09  8:22         ` Paolo Bonzini
  2016-03-09  8:33           ` Fam Zheng
  0 siblings, 1 reply; 48+ messages in thread
From: Paolo Bonzini @ 2016-03-09  8:22 UTC (permalink / raw)
  To: Fam Zheng; +Cc: stefanha, qemu-devel, qemu-block



On 09/03/2016 09:00, Fam Zheng wrote:
>> > On 09/03/2016 04:35, Fam Zheng wrote:
>>>>> > >> >  enum BdrvTrackedRequestType {
>>>>> > >> >      BDRV_TRACKED_READ,
>>>>> > >> >      BDRV_TRACKED_WRITE,
>>>>> > >> > -    BDRV_TRACKED_FLUSH,
>>>>> > >> > -    BDRV_TRACKED_IOCTL,
>>>>> > >> >      BDRV_TRACKED_DISCARD,
>>> > > Okay, so flush and ioctl are not needed, but why is discard different?
>> > 
>> > Discard can modify the contents of the device, so I think it's safer to
>> > serialize it against RMW and copy-on-read operations.
> Okay, that makes sense, but ioctl like SG_IO can also modify content, no?

If you use SG_IO you shouldn't use RMW (because scsi_read_complete traps
READ CAPACITY and sets the host block size as the guest block size) or
copy-on-read (because raw has no backing file).

Besides, BDRV_TRACKED_IOCTL didn't include sector_num/nr_sectors
operations so it didn't provide serialization.

Paolo

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH 09/16] block: wait for all pending I/O when doing synchronous requests
  2016-03-09  8:13   ` Fam Zheng
@ 2016-03-09  8:23     ` Paolo Bonzini
  0 siblings, 0 replies; 48+ messages in thread
From: Paolo Bonzini @ 2016-03-09  8:23 UTC (permalink / raw)
  To: Fam Zheng; +Cc: qemu-block, qemu-devel, stefanha



On 09/03/2016 09:13, Fam Zheng wrote:
>> > @@ -352,7 +352,9 @@ static void qed_start_need_check_timer(BDRVQEDState *s)
>> >  static void qed_cancel_need_check_timer(BDRVQEDState *s)
>> >  {
>> >      trace_qed_cancel_need_check_timer(s);
>> > -    timer_del(s->need_check_timer);
>> > +    if (s->need_check_timer) {
>> > +        timer_del(s->need_check_timer);
>> > +    }
>> >  }
> Not clear why this change is needed in this patch, but it is obviously not
> wrong.  If this is to mask a bug, it at least deserves a comment.
> 
> The other parts of the patch looks good to me.

I need to check. :)

Paolo

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH 13/16] block: only call aio_poll from iothread
  2016-02-16 17:56 ` [Qemu-devel] [PATCH 13/16] block: only call aio_poll from iothread Paolo Bonzini
@ 2016-03-09  8:30   ` Fam Zheng
  2016-03-09  8:55     ` Paolo Bonzini
  2016-03-09  9:10     ` Paolo Bonzini
  0 siblings, 2 replies; 48+ messages in thread
From: Fam Zheng @ 2016-03-09  8:30 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: stefanha, qemu-devel, qemu-block

On Tue, 02/16 18:56, Paolo Bonzini wrote:
> aio_poll is not thread safe; for example bdrv_drain can hang if
> the last in-flight I/O operation is completed in the I/O thread after
> the main thread has checked bs->in_flight.
> 
> The bug remains latent as long as all of it is called within
> aio_context_acquire/aio_context_release, but this will change soon.
> 
> To fix this, if bdrv_drain is called from outside the I/O thread handle
> it internally in the BDS, without involving AioContext and aio_poll.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  block.c                   |  2 ++
>  block/io.c                | 21 ++++++++++++++++++---
>  include/block/block_int.h |  5 ++++-
>  3 files changed, 24 insertions(+), 4 deletions(-)
> 
> diff --git a/block.c b/block.c
> index fb02d7f..601a73f 100644
> --- a/block.c
> +++ b/block.c
> @@ -267,6 +267,7 @@ BlockDriverState *bdrv_new(void)
>      qemu_co_queue_init(&bs->throttled_reqs[1]);
>      bs->refcnt = 1;
>      bs->aio_context = qemu_get_aio_context();
> +    qemu_event_init(&bs->in_flight_event, true);
>  
>      QTAILQ_INSERT_TAIL(&all_bdrv_states, bs, bs_list);
>  
> @@ -2395,6 +2396,7 @@ static void bdrv_delete(BlockDriverState *bs)
>      bdrv_make_anon(bs);
>  
>      QTAILQ_REMOVE(&all_bdrv_states, bs, bs_list);
> +    qemu_event_destroy(&bs->in_flight_event);
>  
>      g_free(bs);
>  }
> diff --git a/block/io.c b/block/io.c
> index 04b52c8..ea0546f 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -251,11 +251,24 @@ static void bdrv_drain_recurse(BlockDriverState *bs)
>  
>  static bool bdrv_drain_io_recurse(BlockDriverState *bs)
>  {
> -    BdrvChild *child;
> +    AioContext *ctx = bdrv_get_aio_context(bs);
>      bool waited = false;
> +    BdrvChild *child;
>  
>      while (atomic_read(&bs->in_flight) > 0) {
> -        aio_poll(bdrv_get_aio_context(bs), true);
> +        if (aio_context_in_iothread(ctx)) {
> +            /* This case should not occur at all, except for the
> +             * main thread.
> +             */

Maybe assert ctx == qemu_get_aio_context()?

> +            aio_poll(bdrv_get_aio_context(bs), true);
> +        } else {
> +            qemu_event_reset(&bs->in_flight_event);
> +            if (atomic_read(&bs->in_flight) > 0) {
> +                aio_context_release(bdrv_get_aio_context(bs));
> +                qemu_event_wait(&bs->in_flight_event);
> +                aio_context_acquire(bdrv_get_aio_context(bs));
> +            }
> +        }
>          waited = true;
>      }
>  
> @@ -465,7 +478,9 @@ void bdrv_inc_in_flight(BlockDriverState *bs)
>  
>  void bdrv_dec_in_flight(BlockDriverState *bs)
>  {
> -    atomic_dec(&bs->in_flight);
> +    if (atomic_fetch_dec(&bs->in_flight) == 1) {
> +        qemu_event_set(&bs->in_flight_event);
> +    }
>  }
>  
>  static bool coroutine_fn wait_serialising_requests(BdrvTrackedRequest *self)
> diff --git a/include/block/block_int.h b/include/block/block_int.h
> index 89c38c0..9c96d5d 100644
> --- a/include/block/block_int.h
> +++ b/include/block/block_int.h
> @@ -404,9 +404,12 @@ struct BlockDriverState {
>      /* Callback before write request is processed */
>      NotifierWithReturnList before_write_notifiers;
>  
> -    /* number of in-flight requests; overall and serialising */
> +    /* number of in-flight requests; overall and serialising.
> +     * in_flight_event is set when in_flight becomes 0.
> +     */
>      unsigned int in_flight;
>      unsigned int serialising_in_flight;
> +    QemuEvent in_flight_event;
>  
>      /* I/O throttling.
>       * throttle_state tells us if this BDS has I/O limits configured.
> -- 
> 2.5.0
> 
> 
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH 06/16] block: add BDS field to count in-flight requests
  2016-03-09  8:22         ` Paolo Bonzini
@ 2016-03-09  8:33           ` Fam Zheng
  0 siblings, 0 replies; 48+ messages in thread
From: Fam Zheng @ 2016-03-09  8:33 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: stefanha, qemu-devel, qemu-block

On Wed, 03/09 09:22, Paolo Bonzini wrote:
> 
> 
> On 09/03/2016 09:00, Fam Zheng wrote:
> >> > On 09/03/2016 04:35, Fam Zheng wrote:
> >>>>> > >> >  enum BdrvTrackedRequestType {
> >>>>> > >> >      BDRV_TRACKED_READ,
> >>>>> > >> >      BDRV_TRACKED_WRITE,
> >>>>> > >> > -    BDRV_TRACKED_FLUSH,
> >>>>> > >> > -    BDRV_TRACKED_IOCTL,
> >>>>> > >> >      BDRV_TRACKED_DISCARD,
> >>> > > Okay, so flush and ioctl are not needed, but why is discard different?
> >> > 
> >> > Discard can modify the contents of the device, so I think it's safer to
> >> > serialize it against RMW and copy-on-read operations.
> > Okay, that makes sense, but ioctl like SG_IO can also modify content, no?
> 
> If you use SG_IO you shouldn't use RMW (because scsi_read_complete traps
> READ CAPACITY and sets the host block size as the guest block size) or
> copy-on-read (because raw has no backing file).
> 
> Besides, BDRV_TRACKED_IOCTL didn't include sector_num/nr_sectors
> operations so it didn't provide serialization.
> 

Yes, that's right. Thanks.

Fam

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH 00/16] AioContext fine-grained locking, part 1 of 3, including bdrv_drain rewrite
  2016-02-16 17:56 [Qemu-devel] [PATCH 00/16] AioContext fine-grained locking, part 1 of 3, including bdrv_drain rewrite Paolo Bonzini
                   ` (16 preceding siblings ...)
  2016-03-08 17:51 ` [Qemu-devel] [PATCH 00/16] AioContext fine-grained locking, part 1 of 3, including bdrv_drain rewrite Paolo Bonzini
@ 2016-03-09  8:46 ` Fam Zheng
  2016-03-16 18:18 ` Stefan Hajnoczi
  18 siblings, 0 replies; 48+ messages in thread
From: Fam Zheng @ 2016-03-09  8:46 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: stefanha, qemu-devel, qemu-block

On Tue, 02/16 18:56, Paolo Bonzini wrote:
> So the fine-grained locking series has grown from 2 parts to 3...
> 
> This first part stops where we remove RFifoLock.

Modulo to the minor comments I had:

Reviewed-by: Fam Zheng <famz@redhat.com>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH 13/16] block: only call aio_poll from iothread
  2016-03-09  8:30   ` Fam Zheng
@ 2016-03-09  8:55     ` Paolo Bonzini
  2016-03-09  9:10     ` Paolo Bonzini
  1 sibling, 0 replies; 48+ messages in thread
From: Paolo Bonzini @ 2016-03-09  8:55 UTC (permalink / raw)
  To: Fam Zheng; +Cc: qemu-block, qemu-devel, stefanha



On 09/03/2016 09:30, Fam Zheng wrote:
> > -        aio_poll(bdrv_get_aio_context(bs), true);
> > +        if (aio_context_in_iothread(ctx)) {
> > +            /* This case should not occur at all, except for the
> > +             * main thread.
> > +             */
> 
> Maybe assert ctx == qemu_get_aio_context()?

Sure.

Paolo

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH 13/16] block: only call aio_poll from iothread
  2016-03-09  8:30   ` Fam Zheng
  2016-03-09  8:55     ` Paolo Bonzini
@ 2016-03-09  9:10     ` Paolo Bonzini
  2016-03-09  9:27       ` Fam Zheng
  1 sibling, 1 reply; 48+ messages in thread
From: Paolo Bonzini @ 2016-03-09  9:10 UTC (permalink / raw)
  To: Fam Zheng; +Cc: qemu-block, qemu-devel, stefanha



On 09/03/2016 09:30, Fam Zheng wrote:
> > -        aio_poll(bdrv_get_aio_context(bs), true);
> > +        if (aio_context_in_iothread(ctx)) {
> > +            /* This case should not occur at all, except for the
> > +             * main thread.
> > +             */
> 
> Maybe assert ctx == qemu_get_aio_context()?

Actually it happens for block/mirror.c's bdrv_drained_begin, but it's safe.

Paolo

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH 13/16] block: only call aio_poll from iothread
  2016-03-09  9:10     ` Paolo Bonzini
@ 2016-03-09  9:27       ` Fam Zheng
  0 siblings, 0 replies; 48+ messages in thread
From: Fam Zheng @ 2016-03-09  9:27 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: qemu-block, qemu-devel, stefanha

On Wed, 03/09 10:10, Paolo Bonzini wrote:
> 
> 
> On 09/03/2016 09:30, Fam Zheng wrote:
> > > -        aio_poll(bdrv_get_aio_context(bs), true);
> > > +        if (aio_context_in_iothread(ctx)) {
> > > +            /* This case should not occur at all, except for the
> > > +             * main thread.
> > > +             */
> > 
> > Maybe assert ctx == qemu_get_aio_context()?
> 
> Actually it happens for block/mirror.c's bdrv_drained_begin, but it's safe.

Oh yes, then we cannot assert, and the comment need adjustion. Thanks for
pointing out.

Fam

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH 07/16] block: change drain to look only at one child at a time
  2016-02-16 17:56 ` [Qemu-devel] [PATCH 07/16] block: change drain to look only at one child at a time Paolo Bonzini
  2016-03-09  3:41   ` Fam Zheng
@ 2016-03-16 16:39   ` Stefan Hajnoczi
  2016-03-16 17:41     ` Paolo Bonzini
  1 sibling, 1 reply; 48+ messages in thread
From: Stefan Hajnoczi @ 2016-03-16 16:39 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Kevin Wolf, qemu-devel, qemu-block

[-- Attachment #1: Type: text/plain, Size: 3880 bytes --]

On Tue, Feb 16, 2016 at 06:56:19PM +0100, Paolo Bonzini wrote:
> bdrv_requests_pending is checking children to also wait until internal
> requests (such as metadata writes) have completed.  However, checking
> children is in general overkill because, apart from this special case,
> the parent's in_flight count will always be incremented by at least one
> for every request in the child.
> 
> Since internal requests are only generated by the parent in the child,
> instead visit the tree parent first, and then wait for internal I/O in
> the children to complete.

This assumption is true if the BDS graph is a tree.  When there a
multiple roots (i.e. it's a directed acyclic graph), especially with
roots at different levels, then I wonder if pre-order bdrv_drain
traversal leaves open the possibility that I/Os sneak into parts of the
BDS graph that we thought was already drained.

The advantage of the current approach is that it really wait until the
whole (sub)tree is idle.

Concrete example: image fleecing involves adding an ephemeral qcow2 file
which is exported via NBD.  The guest keeps writing to the disk image as
normal but the original data will be backed up to the ephemeral qcow2
file before being overwritten, so that the client sees a point-in-time
snapshot of the disk image.

The tree looks like this:

          [NBD export]
             /
            v
[guest] temporary qcow2
   \    /
    v  v
    disk

Block backend access is in square brackets.  Nodes without square
brackets are BDS nodes.

If the guest wants to drain the disk, it's possible for new I/O requests
to enter the disk BDS while we're recursing to disk's children because
the NBD export socket fd is in the same AIOContext.  The socket fd is
therefore handled during aio_poll() calls.

I'm not 100% sure that this is a problem, but I wonder if you've thought
about this?

> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  block/io.c | 27 +++++++++++++++++++--------
>  1 file changed, 19 insertions(+), 8 deletions(-)
> 
> diff --git a/block/io.c b/block/io.c
> index a9a23a6..e0c9215 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -249,6 +249,23 @@ static void bdrv_drain_recurse(BlockDriverState *bs)
>      }
>  }
>  
> +static bool bdrv_drain_io_recurse(BlockDriverState *bs)
> +{
> +    BdrvChild *child;
> +    bool waited = false;
> +
> +    while (atomic_read(&bs->in_flight) > 0) {
> +        aio_poll(bdrv_get_aio_context(bs), true);
> +        waited = true;
> +    }
> +
> +    QLIST_FOREACH(child, &bs->children, next) {
> +        waited |= bdrv_drain_io_recurse(child->bs);
> +    }
> +
> +    return waited;
> +}
> +
>  /*
>   * Wait for pending requests to complete on a single BlockDriverState subtree,
>   * and suspend block driver's internal I/O until next request arrives.
> @@ -265,10 +282,7 @@ void bdrv_drain(BlockDriverState *bs)
>      bdrv_no_throttling_begin(bs);
>      bdrv_io_unplugged_begin(bs);
>      bdrv_drain_recurse(bs);
> -    while (bdrv_requests_pending(bs)) {
> -        /* Keep iterating */
> -         aio_poll(bdrv_get_aio_context(bs), true);
> -    }
> +    bdrv_drain_io_recurse(bs);
>      bdrv_io_unplugged_end(bs);
>      bdrv_no_throttling_end(bs);
>  }
> @@ -319,10 +333,7 @@ void bdrv_drain_all(void)
>              aio_context_acquire(aio_context);
>              while ((bs = bdrv_next(bs))) {
>                  if (aio_context == bdrv_get_aio_context(bs)) {
> -                    if (bdrv_requests_pending(bs)) {
> -                        aio_poll(aio_context, true);
> -                        waited = true;
> -                    }
> +                    waited |= bdrv_drain_io_recurse(bs);
>                  }
>              }
>              aio_context_release(aio_context);
> -- 
> 2.5.0
> 
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH 07/16] block: change drain to look only at one child at a time
  2016-03-16 16:39   ` Stefan Hajnoczi
@ 2016-03-16 17:41     ` Paolo Bonzini
  2016-03-17  0:57       ` Fam Zheng
  0 siblings, 1 reply; 48+ messages in thread
From: Paolo Bonzini @ 2016-03-16 17:41 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Kevin Wolf, qemu-devel, qemu-block



On 16/03/2016 17:39, Stefan Hajnoczi wrote:
> The tree looks like this:
> 
>           [NBD export]
>              /
>             v
> [guest] temporary qcow2
>    \    /
>     v  v
>     disk
> 
> Block backend access is in square brackets.  Nodes without square
> brackets are BDS nodes.
> 
> If the guest wants to drain the disk, it's possible for new I/O requests
> to enter the disk BDS while we're recursing to disk's children because
> the NBD export socket fd is in the same AIOContext.  The socket fd is
> therefore handled during aio_poll() calls.
> 
> I'm not 100% sure that this is a problem, but I wonder if you've thought
> about this?

I hadn't, but I think this is handled by using
bdrv_drained_begin/bdrv_drained_end instead of bdrv_drain.  The NBD
export registers its callback as "external", and it is thus disabled
between bdrv_drained_begin and bdrv_drained_end.

It will indeed become more complex when BDSes won't have anymore a "home
AioContext" due to multiqueue.  I suspect that we should rethink the
strategy for enabling and disabling external callbacks.  For example we
could add callbacks to each BlockBackend that enable/disable external
callbacks, and when bdrv_drained_begin is called on a BDS, we call the
callbacks for all BlockBackends that are included in this BDS.  I'm not
sure if there's a way to go from a BDS to all the BBs above it.

Paolo

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH 08/16] blockjob: introduce .drain callback for jobs
  2016-02-16 17:56 ` [Qemu-devel] [PATCH 08/16] blockjob: introduce .drain callback for jobs Paolo Bonzini
@ 2016-03-16 17:56   ` Stefan Hajnoczi
  0 siblings, 0 replies; 48+ messages in thread
From: Stefan Hajnoczi @ 2016-03-16 17:56 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: qemu-devel, qemu-block

[-- Attachment #1: Type: text/plain, Size: 5443 bytes --]

On Tue, Feb 16, 2016 at 06:56:20PM +0100, Paolo Bonzini wrote:
> This is required to decouple block jobs from running in an
> AioContext.  With multiqueue block devices, a BlockDriverState
> does not really belong to a single AioContext.
> 
> The solution is to first wait until all I/O operations are
> complete; then loop in the main thread for the block job to
> complete entirely.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  block/backup.c           |  7 +++++++
>  block/mirror.c           | 12 ++++++++++--
>  blockjob.c               | 16 +++++++++++++---
>  include/block/blockjob.h |  7 +++++++
>  4 files changed, 37 insertions(+), 5 deletions(-)
> 
> diff --git a/block/backup.c b/block/backup.c
> index 00cafdb..2edb895 100644
> --- a/block/backup.c
> +++ b/block/backup.c
> @@ -251,6 +251,12 @@ static void backup_abort(BlockJob *job)
>      }
>  }
>  
> +static void backup_drain(BlockJob *job)
> +{
> +    BackupBlockJob *s = container_of(job, BackupBlockJob, common);
> +    bdrv_drain(s->target);
> +}
> +
>  static const BlockJobDriver backup_job_driver = {
>      .instance_size  = sizeof(BackupBlockJob),
>      .job_type       = BLOCK_JOB_TYPE_BACKUP,
> @@ -258,6 +264,7 @@ static const BlockJobDriver backup_job_driver = {
>      .iostatus_reset = backup_iostatus_reset,
>      .commit         = backup_commit,
>      .abort          = backup_abort,
> +    .drain          = backup_drain,
>  };
>  
>  static BlockErrorAction backup_error_action(BackupBlockJob *job,
> diff --git a/block/mirror.c b/block/mirror.c
> index 3f163b8..b473a1b 100644
> --- a/block/mirror.c
> +++ b/block/mirror.c
> @@ -358,7 +358,7 @@ static void mirror_free_init(MirrorBlockJob *s)
>      }
>  }
>  
> -static void mirror_drain(MirrorBlockJob *s)
> +static void mirror_wait_for_completion(MirrorBlockJob *s)
>  {
>      while (s->in_flight > 0) {
>          s->waiting_for_io = true;
> @@ -627,7 +627,7 @@ immediate_exit:
>           * the target is a copy of the source.
>           */
>          assert(ret < 0 || (!s->synced && block_job_is_cancelled(&s->common)));
> -        mirror_drain(s);
> +        mirror_wait_for_completion(s);
>      }
>  
>      assert(s->in_flight == 0);
> @@ -708,12 +708,19 @@ static void mirror_complete(BlockJob *job, Error **errp)
>      block_job_enter(&s->common);
>  }
>  
> +static void mirror_drain(BlockJob *job)
> +{
> +    MirrorBlockJob *s = container_of(job, MirrorBlockJob, common);
> +    bdrv_drain(s->target);
> +}
> +
>  static const BlockJobDriver mirror_job_driver = {
>      .instance_size = sizeof(MirrorBlockJob),
>      .job_type      = BLOCK_JOB_TYPE_MIRROR,
>      .set_speed     = mirror_set_speed,
>      .iostatus_reset= mirror_iostatus_reset,
>      .complete      = mirror_complete,
> +    .drain         = mirror_drain,
>  };
>  
>  static const BlockJobDriver commit_active_job_driver = {
> @@ -723,6 +730,7 @@ static const BlockJobDriver commit_active_job_driver = {
>      .iostatus_reset
>                     = mirror_iostatus_reset,
>      .complete      = mirror_complete,
> +    .drain         = mirror_drain,
>  };
>  
>  static void mirror_start_job(BlockDriverState *bs, BlockDriverState *target,
> diff --git a/blockjob.c b/blockjob.c
> index 9fc37ca..5bb6f9b 100644
> --- a/blockjob.c
> +++ b/blockjob.c
> @@ -289,16 +289,26 @@ static int block_job_finish_sync(BlockJob *job,
>      assert(bs->job == job);
>  
>      block_job_ref(job);
> +
> +    /* finish will call block_job_enter (see e.g. block_job_cancel,
> +     * or mirror_complete in block/mirror.c).  Barring bugs in the
> +     * job coroutine, bdrv_drain should be enough to induce progress
> +     * until the job completes or moves to the main thread.
> +    */
>      finish(job, &local_err);
>      if (local_err) {
>          error_propagate(errp, local_err);
>          block_job_unref(job);
>          return -EBUSY;
>      }
> +    while (!job->deferred_to_main_loop && !job->completed) {
> +        bdrv_drain(bs);
> +        if (job->driver->drain) {
> +            job->driver->drain(job);
> +        }
> +    }
>      while (!job->completed) {
> -        aio_poll(job->deferred_to_main_loop ? qemu_get_aio_context() :
> -                                              bdrv_get_aio_context(bs),
> -                 true);
> +        aio_poll(qemu_get_aio_context(), true);
>      }

This adds the assumption that block_job_defer_to_main_loop() is only
used to complete the block job.  If we really need to make this
assumption then we could rename it complete_in_main_loop() or similar so
that is clear.

>      ret = (job->cancelled && job->ret == 0) ? -ECANCELED : job->ret;
>      block_job_unref(job);
> diff --git a/include/block/blockjob.h b/include/block/blockjob.h
> index 8bedc49..8e564df 100644
> --- a/include/block/blockjob.h
> +++ b/include/block/blockjob.h
> @@ -70,6 +70,13 @@ typedef struct BlockJobDriver {
>       * never both.
>       */
>      void (*abort)(BlockJob *job);
> +
> +    /**
> +     * If the callback is not NULL, it will be invoked when the job has to be
> +     * synchronously cancelled or completed; it should drain BlockDriverStates
> +     * as required to ensure progress.
> +     */
> +    void (*drain)(BlockJob *job);
>  } BlockJobDriver;
>  
>  /**
> -- 
> 2.5.0
> 
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH 09/16] block: wait for all pending I/O when doing synchronous requests
  2016-02-16 17:56 ` [Qemu-devel] [PATCH 09/16] block: wait for all pending I/O when doing synchronous requests Paolo Bonzini
  2016-03-09  8:13   ` Fam Zheng
@ 2016-03-16 18:04   ` Stefan Hajnoczi
  1 sibling, 0 replies; 48+ messages in thread
From: Stefan Hajnoczi @ 2016-03-16 18:04 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: qemu-devel, qemu-block

[-- Attachment #1: Type: text/plain, Size: 594 bytes --]

On Tue, Feb 16, 2016 at 06:56:21PM +0100, Paolo Bonzini wrote:
> Synchronous I/O should in general happen either in the main thread (e.g.
> for bdrv_open and bdrv_create) or between bdrv_drained_begin and
> bdrv_drained_end.  Therefore, the simplest way to wait for it to finish
> is to wait for _all_ pending I/O to complete.

This seems okay although it opens the door to performance degradation
due to waiting for all requests instead of one specific request.  That
shouldn't be a problem since synchronous code paths are usually not
intended to have parallel I/O going on at the same time.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH 00/16] AioContext fine-grained locking, part 1 of 3, including bdrv_drain rewrite
  2016-02-16 17:56 [Qemu-devel] [PATCH 00/16] AioContext fine-grained locking, part 1 of 3, including bdrv_drain rewrite Paolo Bonzini
                   ` (17 preceding siblings ...)
  2016-03-09  8:46 ` Fam Zheng
@ 2016-03-16 18:18 ` Stefan Hajnoczi
  2016-03-16 22:29   ` Paolo Bonzini
  18 siblings, 1 reply; 48+ messages in thread
From: Stefan Hajnoczi @ 2016-03-16 18:18 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: qemu-devel, qemu-block

[-- Attachment #1: Type: text/plain, Size: 6746 bytes --]

On Tue, Feb 16, 2016 at 06:56:12PM +0100, Paolo Bonzini wrote:
> So the fine-grained locking series has grown from 2 parts to 3...
> 
> This first part stops where we remove RFifoLock.
> 
> In the previous submission, the bulk of the series took care of
> associating an AioContext to a thread, so that aio_poll could run event
> handlers only on that thread.  This was done to fix a race in bdrv_drain.
> There were two disadvantages:
> 
> 1) reporting progress from aio_poll to the main thread was Really Hard.
> Despite being relatively sure of the correctness---also thanks to formal
> models---it's not the kind of code that I'd lightly donate to posterity.
> 
> 2) the strict dependency between bdrv_drain and a single AioContext
> would have been bad for multiqueue.
> 
> Instead, this series does the same trick (do not run dataplane event handlers
> from the main thread) but reports progress directly at the BlockDriverState
> level.
> 
> To do so, the mechanism to track in-flight requests is changed to a
> simple counter.  This is more flexible than BdrvTrackedRequest, and lets
> the block/io.c code track precisely the initiation and completion of the
> requests.  In particular, the counter remains nonzero when a bottom half
> is required to "really" complete the request.  bdrv_drain doesn't rely
> anymore on a non-blocking aio_poll to execute those bottom halves.
> 
> It is also more efficient; while BdrvTrackedRequest has to remain
> for request serialisation, with this series we can in principle make
> BdrvTrackedRequest optional and enable it only when something (automatic
> RMW or copy-on-read) requires request serialisation.
> 
> In general this flows nicely, the only snag being patch 5.  The problem
> here is that mirror is calling bdrv_drain from an AIO callback, which
> used to be okay (because bdrv_drain more or less tried to guess when
> all AIO callbacks were done) but now causes a deadlock (because bdrv_drain
> really checks for AIO callbacks to be complete).  To add to the complication,
> the bdrv_drain happens far away from the AIO callback, in the coroutine that
> the AIO callback enters.  The situation here is admittedly underdefined,
> and Stefan pointed out that the same solution is found in many other places
> in the QEMU block layer.  Therefore I think it's acceptable.  I'm pointing
> it out explicitly though, because (together with patch 8) it is an example
> of the issues caused by the new, stricter definition of bdrv_drain.
> 
> Patches 1-7 organize bdrv_drain into separate phases, with a well-defined
> order between the BDS and it children.  The phases are: 1) disable
> throttling; 2) disable io_plug; 3) call bdrv_drain callbacks; 4) wait
> for I/O to finish; 5) re-enable io_plug and throttling.  Previously,
> throttling of throttling and io_plug were handled by bdrv_flush_io_queue,
> which was repeatedly called as part of the I/O wait loop.  This part of
> the series removes bdrv_flush_io_queue.
> 
> Patch 8-11 replace aio_poll with bdrv_drain so that the work done
> so far applies to all former callers of aio_poll.
> 
> Patches 12-13 introduce the logic that lets the main thread wait remotely
> for an iothread to drain the BDS.  Patches 14-16 then achieve the purpose
> of this series, removing the long-running AioContext critical section
> from iothread_run and removing RFifoLock.
> 
> The series passes check-block.sh with a few fixes applied for pre-existing
> bugs:
> 
>     vl: fix migration from prelaunch state
>     migration: fix incorrect memory_global_dirty_log_start outside BQL
>     qed: fix bdrv_qed_drain
> 
> Paolo
> 
> Paolo Bonzini (16):
>   block: make bdrv_start_throttled_reqs return void
>   block: move restarting of throttled reqs to block/throttle-groups.c
>   block: introduce bdrv_no_throttling_begin/end
>   block: plug whole tree at once, introduce bdrv_io_unplugged_begin/end
>   mirror: use bottom half to re-enter coroutine
>   block: add BDS field to count in-flight requests
>   block: change drain to look only at one child at a time
>   blockjob: introduce .drain callback for jobs
>   block: wait for all pending I/O when doing synchronous requests
>   nfs: replace aio_poll with bdrv_drain
>   sheepdog: disable dataplane
>   aio: introduce aio_context_in_iothread
>   block: only call aio_poll from iothread
>   iothread: release AioContext around aio_poll
>   qemu-thread: introduce QemuRecMutex
>   aio: convert from RFifoLock to QemuRecMutex
> 
>  async.c                         |  28 +---
>  block.c                         |   4 +-
>  block/backup.c                  |   7 +
>  block/io.c                      | 281 +++++++++++++++++++++++++---------------
>  block/linux-aio.c               |  13 +-
>  block/mirror.c                  |  37 +++++-
>  block/nfs.c                     |  50 ++++---
>  block/qed-table.c               |  16 +--
>  block/qed.c                     |   4 +-
>  block/raw-aio.h                 |   2 +-
>  block/raw-posix.c               |  16 +--
>  block/sheepdog.c                |  19 +++
>  block/throttle-groups.c         |  19 +++
>  blockjob.c                      |  16 ++-
>  docs/multiple-iothreads.txt     |  40 +++---
>  include/block/aio.h             |  13 +-
>  include/block/block.h           |   3 +-
>  include/block/block_int.h       |  22 +++-
>  include/block/blockjob.h        |   7 +
>  include/block/throttle-groups.h |   1 +
>  include/qemu/rfifolock.h        |  54 --------
>  include/qemu/thread-posix.h     |   6 +
>  include/qemu/thread-win32.h     |  10 ++
>  include/qemu/thread.h           |   3 +
>  iothread.c                      |  20 +--
>  stubs/Makefile.objs             |   1 +
>  stubs/iothread.c                |   8 ++
>  tests/.gitignore                |   1 -
>  tests/Makefile                  |   2 -
>  tests/qemu-iotests/060          |   8 +-
>  tests/qemu-iotests/060.out      |   4 +-
>  tests/test-aio.c                |  22 ++--
>  tests/test-rfifolock.c          |  91 -------------
>  util/Makefile.objs              |   1 -
>  util/qemu-thread-posix.c        |  13 ++
>  util/qemu-thread-win32.c        |  25 ++++
>  util/rfifolock.c                |  78 -----------
>  37 files changed, 471 insertions(+), 474 deletions(-)
>  delete mode 100644 include/qemu/rfifolock.h
>  create mode 100644 stubs/iothread.c
>  delete mode 100644 tests/test-rfifolock.c
>  delete mode 100644 util/rfifolock.c

Looks good overall.  I'm a little nervous about merging it for QEMU 2.6
but the block job, NBD, and data plane tests should give it a good
workout.

I have posted comments on a few patches.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH 00/16] AioContext fine-grained locking, part 1 of 3, including bdrv_drain rewrite
  2016-03-16 18:18 ` Stefan Hajnoczi
@ 2016-03-16 22:29   ` Paolo Bonzini
  2016-03-17 13:44     ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
  0 siblings, 1 reply; 48+ messages in thread
From: Paolo Bonzini @ 2016-03-16 22:29 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: qemu-devel, qemu-block



On 16/03/2016 19:18, Stefan Hajnoczi wrote:
> Looks good overall.  I'm a little nervous about merging it for QEMU 2.6
> but the block job, NBD, and data plane tests should give it a good
> workout.

Apart from QEMU nearing hard freeze, I totally understand not wanting to
commit to merging part 1 of n where n will probably be a dozen or so.
I'm open to experimenting with different models for handling long-term
contributions.

For example, each part will probably have an uncontroversial and
generally useful prefix---for example patches 1-4 in this case, or the
change to a single linux-aio context per iothread.  You could merge
those only, and for the rest, I will maintain myself a branch with R-b
from maintainers.  Master will be periodically merged into it, but not
too frequently---it could be only after each part is accepted, or when
there is some important bugfix to catch.  Once the whole multiqueue
thing gets somewhere I would send you a pull request with the entire
feature, which would consist of say 200 patches all with a Reviewed-by
already.

This is just a possibility; if you have any other idea, I'd be happy to
follow it.

Paolo

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH 07/16] block: change drain to look only at one child at a time
  2016-03-16 17:41     ` Paolo Bonzini
@ 2016-03-17  0:57       ` Fam Zheng
  0 siblings, 0 replies; 48+ messages in thread
From: Fam Zheng @ 2016-03-17  0:57 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Kevin Wolf, qemu-block, qemu-devel, Stefan Hajnoczi

On Wed, 03/16 18:41, Paolo Bonzini wrote:
> 
> 
> On 16/03/2016 17:39, Stefan Hajnoczi wrote:
> > The tree looks like this:
> > 
> >           [NBD export]
> >              /
> >             v
> > [guest] temporary qcow2
> >    \    /
> >     v  v
> >     disk
> > 
> > Block backend access is in square brackets.  Nodes without square
> > brackets are BDS nodes.
> > 
> > If the guest wants to drain the disk, it's possible for new I/O requests
> > to enter the disk BDS while we're recursing to disk's children because
> > the NBD export socket fd is in the same AIOContext.  The socket fd is
> > therefore handled during aio_poll() calls.
> > 
> > I'm not 100% sure that this is a problem, but I wonder if you've thought
> > about this?
> 
> I hadn't, but I think this is handled by using
> bdrv_drained_begin/bdrv_drained_end instead of bdrv_drain.  The NBD
> export registers its callback as "external", and it is thus disabled
> between bdrv_drained_begin and bdrv_drained_end.
> 
> It will indeed become more complex when BDSes won't have anymore a "home

It probably means BBs won't have a "home AioContext" too, in that case.

> AioContext" due to multiqueue.  I suspect that we should rethink the
> strategy for enabling and disabling external callbacks.  For example we
> could add callbacks to each BlockBackend that enable/disable external
> callbacks, and when bdrv_drained_begin is called on a BDS, we call the
> callbacks for all BlockBackends that are included in this BDS.  I'm not
> sure if there's a way to go from a BDS to all the BBs above it.

If none of the bdrv_drained_begin callers is in hot path, I think we can simply
call disable on each aio context that can send request to BB/BDS.

Fam

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH 00/16] AioContext fine-grained locking, part 1 of 3, including bdrv_drain rewrite
  2016-03-16 22:29   ` Paolo Bonzini
@ 2016-03-17 13:44     ` Stefan Hajnoczi
  2016-03-17 13:48       ` Paolo Bonzini
  0 siblings, 1 reply; 48+ messages in thread
From: Stefan Hajnoczi @ 2016-03-17 13:44 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: qemu-block, qemu-devel, Stefan Hajnoczi

[-- Attachment #1: Type: text/plain, Size: 1547 bytes --]

On Wed, Mar 16, 2016 at 11:29:02PM +0100, Paolo Bonzini wrote:
> On 16/03/2016 19:18, Stefan Hajnoczi wrote:
> > Looks good overall.  I'm a little nervous about merging it for QEMU 2.6
> > but the block job, NBD, and data plane tests should give it a good
> > workout.
> 
> Apart from QEMU nearing hard freeze, I totally understand not wanting to
> commit to merging part 1 of n where n will probably be a dozen or so.
> I'm open to experimenting with different models for handling long-term
> contributions.
> 
> For example, each part will probably have an uncontroversial and
> generally useful prefix---for example patches 1-4 in this case, or the
> change to a single linux-aio context per iothread.  You could merge
> those only, and for the rest, I will maintain myself a branch with R-b
> from maintainers.  Master will be periodically merged into it, but not
> too frequently---it could be only after each part is accepted, or when
> there is some important bugfix to catch.  Once the whole multiqueue
> thing gets somewhere I would send you a pull request with the entire
> feature, which would consist of say 200 patches all with a Reviewed-by
> already.
> 
> This is just a possibility; if you have any other idea, I'd be happy to
> follow it.

That sounds reasonable.  I guess you are sending a) infrastructure and safe
changes alongside b) longer-term work.  If you indicate which patches
are a) then that makes it easier to merge parts into qemu.git before all
the long-term work is complete.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH 00/16] AioContext fine-grained locking, part 1 of 3, including bdrv_drain rewrite
  2016-03-17 13:44     ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
@ 2016-03-17 13:48       ` Paolo Bonzini
  2016-03-18 15:49         ` Stefan Hajnoczi
  0 siblings, 1 reply; 48+ messages in thread
From: Paolo Bonzini @ 2016-03-17 13:48 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: qemu-block, qemu-devel, Stefan Hajnoczi



On 17/03/2016 14:44, Stefan Hajnoczi wrote:
> > For example, each part will probably have an uncontroversial and
> > generally useful prefix---for example patches 1-4 in this case, or the
> > change to a single linux-aio context per iothread.  You could merge
> > those only, and for the rest, I will maintain myself a branch with R-b
> > from maintainers.  Master will be periodically merged into it, but not
> > too frequently---it could be only after each part is accepted, or when
> > there is some important bugfix to catch.  Once the whole multiqueue
> > thing gets somewhere I would send you a pull request with the entire
> > feature, which would consist of say 200 patches all with a Reviewed-by
> > already.
> > 
> > This is just a possibility; if you have any other idea, I'd be happy to
> > follow it.
>
> That sounds reasonable.  I guess you are sending a) infrastructure and safe
> changes alongside b) longer-term work.  If you indicate which patches
> are a) then that makes it easier to merge parts into qemu.git before all
> the long-term work is complete.

Great, let's try it then.  For this series (well, for v2 of this series)
only patches 1-4 would be considered infrastructure.  They were sent
before soft freeze, would they be acceptable for 2.6?

In general I would send "safe" patches as [PATCH mm/nn] and everything
else as [PATCH multiqueue mm/nn] or similar, but in either case I'd be
seeking formal maintainer review as soon as I send them.

Paolo

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH 00/16] AioContext fine-grained locking, part 1 of 3, including bdrv_drain rewrite
  2016-03-17 13:48       ` Paolo Bonzini
@ 2016-03-18 15:49         ` Stefan Hajnoczi
  0 siblings, 0 replies; 48+ messages in thread
From: Stefan Hajnoczi @ 2016-03-18 15:49 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Stefan Hajnoczi, qemu-devel, qemu-block

[-- Attachment #1: Type: text/plain, Size: 1693 bytes --]

On Thu, Mar 17, 2016 at 02:48:00PM +0100, Paolo Bonzini wrote:
> 
> 
> On 17/03/2016 14:44, Stefan Hajnoczi wrote:
> > > For example, each part will probably have an uncontroversial and
> > > generally useful prefix---for example patches 1-4 in this case, or the
> > > change to a single linux-aio context per iothread.  You could merge
> > > those only, and for the rest, I will maintain myself a branch with R-b
> > > from maintainers.  Master will be periodically merged into it, but not
> > > too frequently---it could be only after each part is accepted, or when
> > > there is some important bugfix to catch.  Once the whole multiqueue
> > > thing gets somewhere I would send you a pull request with the entire
> > > feature, which would consist of say 200 patches all with a Reviewed-by
> > > already.
> > > 
> > > This is just a possibility; if you have any other idea, I'd be happy to
> > > follow it.
> >
> > That sounds reasonable.  I guess you are sending a) infrastructure and safe
> > changes alongside b) longer-term work.  If you indicate which patches
> > are a) then that makes it easier to merge parts into qemu.git before all
> > the long-term work is complete.
> 
> Great, let's try it then.  For this series (well, for v2 of this series)
> only patches 1-4 would be considered infrastructure.  They were sent
> before soft freeze, would they be acceptable for 2.6?
> 
> In general I would send "safe" patches as [PATCH mm/nn] and everything
> else as [PATCH multiqueue mm/nn] or similar, but in either case I'd be
> seeking formal maintainer review as soon as I send them.

Okay.  I'll hope over to the v2 series to take a look.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2016-03-18 15:49 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-16 17:56 [Qemu-devel] [PATCH 00/16] AioContext fine-grained locking, part 1 of 3, including bdrv_drain rewrite Paolo Bonzini
2016-02-16 17:56 ` [Qemu-devel] [PATCH 01/16] block: make bdrv_start_throttled_reqs return void Paolo Bonzini
2016-02-16 17:56 ` [Qemu-devel] [PATCH 02/16] block: move restarting of throttled reqs to block/throttle-groups.c Paolo Bonzini
2016-03-09  1:26   ` Fam Zheng
2016-03-09  7:37     ` Paolo Bonzini
2016-02-16 17:56 ` [Qemu-devel] [PATCH 03/16] block: introduce bdrv_no_throttling_begin/end Paolo Bonzini
2016-03-09  1:45   ` Fam Zheng
2016-03-09  7:40     ` Paolo Bonzini
2016-02-16 17:56 ` [Qemu-devel] [PATCH 04/16] block: plug whole tree at once, introduce bdrv_io_unplugged_begin/end Paolo Bonzini
2016-02-16 17:56 ` [Qemu-devel] [PATCH 05/16] mirror: use bottom half to re-enter coroutine Paolo Bonzini
2016-03-09  3:19   ` Fam Zheng
2016-03-09  7:41     ` Paolo Bonzini
2016-02-16 17:56 ` [Qemu-devel] [PATCH 06/16] block: add BDS field to count in-flight requests Paolo Bonzini
2016-03-09  3:35   ` Fam Zheng
2016-03-09  7:43     ` Paolo Bonzini
2016-03-09  8:00       ` Fam Zheng
2016-03-09  8:22         ` Paolo Bonzini
2016-03-09  8:33           ` Fam Zheng
2016-02-16 17:56 ` [Qemu-devel] [PATCH 07/16] block: change drain to look only at one child at a time Paolo Bonzini
2016-03-09  3:41   ` Fam Zheng
2016-03-09  7:49     ` Paolo Bonzini
2016-03-16 16:39   ` Stefan Hajnoczi
2016-03-16 17:41     ` Paolo Bonzini
2016-03-17  0:57       ` Fam Zheng
2016-02-16 17:56 ` [Qemu-devel] [PATCH 08/16] blockjob: introduce .drain callback for jobs Paolo Bonzini
2016-03-16 17:56   ` Stefan Hajnoczi
2016-02-16 17:56 ` [Qemu-devel] [PATCH 09/16] block: wait for all pending I/O when doing synchronous requests Paolo Bonzini
2016-03-09  8:13   ` Fam Zheng
2016-03-09  8:23     ` Paolo Bonzini
2016-03-16 18:04   ` Stefan Hajnoczi
2016-02-16 17:56 ` [Qemu-devel] [PATCH 10/16] nfs: replace aio_poll with bdrv_drain Paolo Bonzini
2016-02-16 17:56 ` [Qemu-devel] [PATCH 11/16] sheepdog: disable dataplane Paolo Bonzini
2016-02-16 17:56 ` [Qemu-devel] [PATCH 12/16] aio: introduce aio_context_in_iothread Paolo Bonzini
2016-02-16 17:56 ` [Qemu-devel] [PATCH 13/16] block: only call aio_poll from iothread Paolo Bonzini
2016-03-09  8:30   ` Fam Zheng
2016-03-09  8:55     ` Paolo Bonzini
2016-03-09  9:10     ` Paolo Bonzini
2016-03-09  9:27       ` Fam Zheng
2016-02-16 17:56 ` [Qemu-devel] [PATCH 14/16] iothread: release AioContext around aio_poll Paolo Bonzini
2016-02-16 17:56 ` [Qemu-devel] [PATCH 15/16] qemu-thread: introduce QemuRecMutex Paolo Bonzini
2016-02-16 17:56 ` [Qemu-devel] [PATCH 16/16] aio: convert from RFifoLock to QemuRecMutex Paolo Bonzini
2016-03-08 17:51 ` [Qemu-devel] [PATCH 00/16] AioContext fine-grained locking, part 1 of 3, including bdrv_drain rewrite Paolo Bonzini
2016-03-09  8:46 ` Fam Zheng
2016-03-16 18:18 ` Stefan Hajnoczi
2016-03-16 22:29   ` Paolo Bonzini
2016-03-17 13:44     ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
2016-03-17 13:48       ` Paolo Bonzini
2016-03-18 15:49         ` Stefan Hajnoczi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).