All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v2 0/6] block: Fixes regarding dataplane and management operations
@ 2017-04-07  6:54 Fam Zheng
  2017-04-07  6:54 ` [Qemu-devel] [PATCH v2 1/6] block: Fix unpaired aio_disable_external in external snapshot Fam Zheng
                   ` (6 more replies)
  0 siblings, 7 replies; 26+ messages in thread
From: Fam Zheng @ 2017-04-07  6:54 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, qemu-block, Ed Swierk, Fam Zheng, Kevin Wolf,
	Max Reitz, Eric Blake, Stefan Hajnoczi

v2: - Drop patch 4 in v1. A second thought made me feel neither it nor Kevin's
      suggestion to move the BH process to bdrv_drain_recurse/BDRV_POLL_WHILE
      is a complete fix. So leave it for a separate patch.
    - Add rev-by to patches 1, 3, 4.
    - Split from patch 1 in v1 and add patch 2, for the new assertions. [Kevin]
    - Rewrite patch 5. Fix block job's co when a BDS is moved to a different
      aio context. [Kevin]
    - Add patch 6.

Crashes are reported on dataplane devices when doing snapshot and commit under
guest I/O.

With this series, Ed's test case '176' now passes (reran 10+ times):

    https://github.com/skyportsystems/qemu-1/commits/eswierk-iotests-2.9

The biggest fix for this is patch 5, which fixes a race condition between main
thread and iothread.

Fam Zheng (6):
  block: Fix unpaired aio_disable_external in external snapshot
  block: Assert attached child node has right aio context
  mirror: Fix aio context of mirror_top_bs
  block: Quiesce old aio context during bdrv_set_aio_context
  coroutine: Explicitly specify AioContext when entering coroutine
  tests/block-job-txn: Don't start block job before adding to txn

 block.c                    | 30 ++++++++++++++++++++++++++----
 block/blkdebug.c           |  4 ++--
 block/blkverify.c          |  8 ++++----
 block/block-backend.c      |  4 ++--
 block/io.c                 | 14 +++++++-------
 block/mirror.c             |  3 ++-
 block/quorum.c             | 16 ++++++++--------
 block/sheepdog.c           |  4 ++--
 blockdev.c                 |  4 ++--
 blockjob.c                 |  4 ++--
 hw/9pfs/9p.c               |  4 ++--
 hw/9pfs/coth.c             |  4 ++--
 include/block/block.h      | 11 +++++++++++
 include/qemu/coroutine.h   | 11 ++++++-----
 include/qemu/main-loop.h   |  2 +-
 migration/colo.c           |  3 ++-
 migration/migration.c      |  2 +-
 nbd/server.c               |  5 +++--
 qemu-img.c                 |  4 ++--
 qemu-io-cmds.c             |  2 +-
 tests/test-blockjob-txn.c  |  6 +++++-
 tests/test-coroutine.c     | 41 +++++++++++++++++++++--------------------
 tests/test-thread-pool.c   |  2 +-
 util/async.c               |  4 ++--
 util/qemu-coroutine-io.c   |  3 ++-
 util/qemu-coroutine-lock.c |  6 ++++--
 util/qemu-coroutine.c      | 10 +++++-----
 util/trace-events          |  2 +-
 28 files changed, 129 insertions(+), 84 deletions(-)

-- 
2.9.3

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Qemu-devel] [PATCH v2 1/6] block: Fix unpaired aio_disable_external in external snapshot
  2017-04-07  6:54 [Qemu-devel] [PATCH v2 0/6] block: Fixes regarding dataplane and management operations Fam Zheng
@ 2017-04-07  6:54 ` Fam Zheng
  2017-04-07 13:23   ` Stefan Hajnoczi
  2017-04-07  6:54 ` [Qemu-devel] [PATCH v2 2/6] block: Assert attached child node has right aio context Fam Zheng
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 26+ messages in thread
From: Fam Zheng @ 2017-04-07  6:54 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, qemu-block, Ed Swierk, Fam Zheng, Kevin Wolf,
	Max Reitz, Eric Blake, Stefan Hajnoczi

bdrv_replace_child_noperm tries to hand over the quiesce_counter state
from old bs to the new one, but if they are not on the same aio context
this causes unbalance.

Fix this by setting the correct aio context before calling
bdrv_append().

Reported-by: Ed Swierk <eswierk@skyportsystems.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
---
 blockdev.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/blockdev.c b/blockdev.c
index 040c152..4927914 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -1772,6 +1772,8 @@ static void external_snapshot_prepare(BlkActionState *common,
         return;
     }
 
+    bdrv_set_aio_context(state->new_bs, state->aio_context);
+
     /* This removes our old bs and adds the new bs. This is an operation that
      * can fail, so we need to do it in .prepare; undoing it for abort is
      * always possible. */
@@ -1789,8 +1791,6 @@ static void external_snapshot_commit(BlkActionState *common)
     ExternalSnapshotState *state =
                              DO_UPCAST(ExternalSnapshotState, common, common);
 
-    bdrv_set_aio_context(state->new_bs, state->aio_context);
-
     /* We don't need (or want) to use the transactional
      * bdrv_reopen_multiple() across all the entries at once, because we
      * don't want to abort all of them if one of them fails the reopen */
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [Qemu-devel] [PATCH v2 2/6] block: Assert attached child node has right aio context
  2017-04-07  6:54 [Qemu-devel] [PATCH v2 0/6] block: Fixes regarding dataplane and management operations Fam Zheng
  2017-04-07  6:54 ` [Qemu-devel] [PATCH v2 1/6] block: Fix unpaired aio_disable_external in external snapshot Fam Zheng
@ 2017-04-07  6:54 ` Fam Zheng
  2017-04-07 13:24   ` Stefan Hajnoczi
  2017-04-07  6:54 ` [Qemu-devel] [PATCH v2 3/6] mirror: Fix aio context of mirror_top_bs Fam Zheng
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 26+ messages in thread
From: Fam Zheng @ 2017-04-07  6:54 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, qemu-block, Ed Swierk, Fam Zheng, Kevin Wolf,
	Max Reitz, Eric Blake, Stefan Hajnoczi

Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
---
 block.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/block.c b/block.c
index 927ba89..b8a3011 100644
--- a/block.c
+++ b/block.c
@@ -1752,6 +1752,9 @@ static void bdrv_replace_child_noperm(BdrvChild *child,
 {
     BlockDriverState *old_bs = child->bs;
 
+    if (old_bs && new_bs) {
+        assert(bdrv_get_aio_context(old_bs) == bdrv_get_aio_context(new_bs));
+    }
     if (old_bs) {
         if (old_bs->quiesce_counter && child->role->drained_end) {
             child->role->drained_end(child);
@@ -1852,6 +1855,7 @@ BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
     bdrv_get_cumulative_perm(parent_bs, &perm, &shared_perm);
 
     assert(parent_bs->drv);
+    assert(bdrv_get_aio_context(parent_bs) == bdrv_get_aio_context(child_bs));
     parent_bs->drv->bdrv_child_perm(parent_bs, NULL, child_role,
                                     perm, shared_perm, &perm, &shared_perm);
 
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [Qemu-devel] [PATCH v2 3/6] mirror: Fix aio context of mirror_top_bs
  2017-04-07  6:54 [Qemu-devel] [PATCH v2 0/6] block: Fixes regarding dataplane and management operations Fam Zheng
  2017-04-07  6:54 ` [Qemu-devel] [PATCH v2 1/6] block: Fix unpaired aio_disable_external in external snapshot Fam Zheng
  2017-04-07  6:54 ` [Qemu-devel] [PATCH v2 2/6] block: Assert attached child node has right aio context Fam Zheng
@ 2017-04-07  6:54 ` Fam Zheng
  2017-04-07 13:24   ` Stefan Hajnoczi
  2017-04-07  6:54 ` [Qemu-devel] [PATCH v2 4/6] block: Quiesce old aio context during bdrv_set_aio_context Fam Zheng
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 26+ messages in thread
From: Fam Zheng @ 2017-04-07  6:54 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, qemu-block, Ed Swierk, Fam Zheng, Kevin Wolf,
	Max Reitz, Eric Blake, Stefan Hajnoczi

It should be moved to the same context as source, before inserting to the
graph.

Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
---
 block/mirror.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/mirror.c b/block/mirror.c
index 9e2fecc..e904fef 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -1148,6 +1148,7 @@ static void mirror_start_job(const char *job_id, BlockDriverState *bs,
         return;
     }
     mirror_top_bs->total_sectors = bs->total_sectors;
+    bdrv_set_aio_context(mirror_top_bs, bdrv_get_aio_context(bs));
 
     /* bdrv_append takes ownership of the mirror_top_bs reference, need to keep
      * it alive until block_job_create() even if bs has no parent. */
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [Qemu-devel] [PATCH v2 4/6] block: Quiesce old aio context during bdrv_set_aio_context
  2017-04-07  6:54 [Qemu-devel] [PATCH v2 0/6] block: Fixes regarding dataplane and management operations Fam Zheng
                   ` (2 preceding siblings ...)
  2017-04-07  6:54 ` [Qemu-devel] [PATCH v2 3/6] mirror: Fix aio context of mirror_top_bs Fam Zheng
@ 2017-04-07  6:54 ` Fam Zheng
  2017-04-07 12:50   ` Stefan Hajnoczi
  2017-04-07  6:54 ` [Qemu-devel] [PATCH v2 5/6] coroutine: Explicitly specify AioContext when entering coroutine Fam Zheng
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 26+ messages in thread
From: Fam Zheng @ 2017-04-07  6:54 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, qemu-block, Ed Swierk, Fam Zheng, Kevin Wolf,
	Max Reitz, Eric Blake, Stefan Hajnoczi

The fact that the bs->aio_context is changing can confuse the dataplane
iothread, because of the now fine granularity aio context lock.
bdrv_drain should rather be a bdrv_drained_begin/end pair, but since
bs->aio_context is changing, we can just use aio_disable_external and
block_job_pause.

Reported-by: Ed Swierk <eswierk@skyportsystems.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
---
 block.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/block.c b/block.c
index b8a3011..512515a 100644
--- a/block.c
+++ b/block.c
@@ -4396,11 +4396,14 @@ void bdrv_attach_aio_context(BlockDriverState *bs,
 
 void bdrv_set_aio_context(BlockDriverState *bs, AioContext *new_context)
 {
-    AioContext *ctx;
+    AioContext *ctx = bdrv_get_aio_context(bs);
 
+    aio_disable_external(ctx);
+    if (bs->job) {
+        block_job_pause(bs->job);
+    }
     bdrv_drain(bs); /* ensure there are no in-flight requests */
 
-    ctx = bdrv_get_aio_context(bs);
     while (aio_poll(ctx, false)) {
         /* wait for all bottom halves to execute */
     }
@@ -4413,6 +4416,10 @@ void bdrv_set_aio_context(BlockDriverState *bs, AioContext *new_context)
     aio_context_acquire(new_context);
     bdrv_attach_aio_context(bs, new_context);
     aio_context_release(new_context);
+    if (bs->job) {
+        block_job_resume(bs->job);
+    }
+    aio_enable_external(ctx);
 }
 
 void bdrv_add_aio_context_notifier(BlockDriverState *bs,
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [Qemu-devel] [PATCH v2 5/6] coroutine: Explicitly specify AioContext when entering coroutine
  2017-04-07  6:54 [Qemu-devel] [PATCH v2 0/6] block: Fixes regarding dataplane and management operations Fam Zheng
                   ` (3 preceding siblings ...)
  2017-04-07  6:54 ` [Qemu-devel] [PATCH v2 4/6] block: Quiesce old aio context during bdrv_set_aio_context Fam Zheng
@ 2017-04-07  6:54 ` Fam Zheng
  2017-04-07  9:57   ` Fam Zheng
                     ` (3 more replies)
  2017-04-07  6:54 ` [Qemu-devel] [PATCH v2 6/6] tests/block-job-txn: Don't start block job before adding to txn Fam Zheng
  2017-04-07 12:45 ` [Qemu-devel] [PATCH v2 0/6] block: Fixes regarding dataplane and management operations Kevin Wolf
  6 siblings, 4 replies; 26+ messages in thread
From: Fam Zheng @ 2017-04-07  6:54 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, qemu-block, Ed Swierk, Fam Zheng, Kevin Wolf,
	Max Reitz, Eric Blake, Stefan Hajnoczi

Coroutine in block layer should always be waken up in bs->aio_context
rather than the "current" context where it is entered. They differ when
the main loop is doing QMP tasks.

Race conditions happen without this patch, because the wrong context is
acquired in co_schedule_bh_cb, while the entered coroutine works on a
different one:

  main loop                                iothread
-----------------------------------------------------------------------
  blockdev_snapshot
    aio_context_acquire(bs->ctx)
    bdrv_flush(bs)
      bdrv_co_flush(bs)
        ...
        qemu_coroutine_yield(co)
      BDRV_POLL_WHILE()
        aio_context_release(bs->ctx)
                                            aio_context_acquire(bs->ctx)
                                              ...
                                                aio_co_wake(co)
        aio_poll(qemu_aio_context)              ...
          co_schedule_bh_cb()                   ...
            qemu_coroutine_enter(co)            ...
              /* (A) bdrv_co_flush(bs)              /* (B) I/O on bs */
                      continues... */
                                            aio_context_release(bs->ctx)

Both (A) and (B) can access resources protected by bs->ctx, but (A) is
not thread-safe.

Make the block layer explicitly specify a desired context for the
entered coroutine. For the rest callers, stick to the old behavior,
qemu_get_aio_context() or qemu_get_current_aio_context().

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 block.c                    | 15 +++++++++++++--
 block/blkdebug.c           |  4 ++--
 block/blkverify.c          |  8 ++++----
 block/block-backend.c      |  4 ++--
 block/io.c                 | 14 +++++++-------
 block/mirror.c             |  2 +-
 block/quorum.c             | 16 ++++++++--------
 block/sheepdog.c           |  4 ++--
 blockjob.c                 |  4 ++--
 hw/9pfs/9p.c               |  4 ++--
 hw/9pfs/coth.c             |  4 ++--
 include/block/block.h      | 11 +++++++++++
 include/qemu/coroutine.h   | 11 ++++++-----
 include/qemu/main-loop.h   |  2 +-
 migration/colo.c           |  3 ++-
 migration/migration.c      |  2 +-
 nbd/server.c               |  5 +++--
 qemu-img.c                 |  4 ++--
 qemu-io-cmds.c             |  2 +-
 tests/test-coroutine.c     | 41 +++++++++++++++++++++--------------------
 tests/test-thread-pool.c   |  2 +-
 util/async.c               |  4 ++--
 util/qemu-coroutine-io.c   |  3 ++-
 util/qemu-coroutine-lock.c |  6 ++++--
 util/qemu-coroutine.c      | 10 +++++-----
 util/trace-events          |  2 +-
 26 files changed, 108 insertions(+), 79 deletions(-)

diff --git a/block.c b/block.c
index 512515a..6513484 100644
--- a/block.c
+++ b/block.c
@@ -357,10 +357,11 @@ int bdrv_create(BlockDriver *drv, const char* filename,
         /* Fast-path if already in coroutine context */
         bdrv_create_co_entry(&cco);
     } else {
+        AioContext *ctx = qemu_get_aio_context();
         co = qemu_coroutine_create(bdrv_create_co_entry, &cco);
-        qemu_coroutine_enter(co);
+        qemu_coroutine_enter(ctx, co);
         while (cco.ret == NOT_DONE) {
-            aio_poll(qemu_get_aio_context(), true);
+            aio_poll(ctx, true);
         }
     }
 
@@ -4324,6 +4325,16 @@ AioContext *bdrv_get_aio_context(BlockDriverState *bs)
     return bs->aio_context;
 }
 
+void bdrv_coroutine_enter(BlockDriverState *bs, Coroutine *coroutine)
+{
+    qemu_coroutine_enter(bdrv_get_aio_context(bs), coroutine);
+}
+
+void bdrv_coroutine_enter_if_inactive(BlockDriverState *bs, Coroutine *co)
+{
+    qemu_coroutine_enter_if_inactive(bdrv_get_aio_context(bs), co);
+}
+
 static void bdrv_do_remove_aio_context_notifier(BdrvAioNotifier *ban)
 {
     QLIST_REMOVE(ban, list);
diff --git a/block/blkdebug.c b/block/blkdebug.c
index 67e8024..2370f73 100644
--- a/block/blkdebug.c
+++ b/block/blkdebug.c
@@ -610,7 +610,7 @@ static int blkdebug_debug_resume(BlockDriverState *bs, const char *tag)
 
     QLIST_FOREACH_SAFE(r, &s->suspended_reqs, next, next) {
         if (!strcmp(r->tag, tag)) {
-            qemu_coroutine_enter(r->co);
+            bdrv_coroutine_enter(bs, r->co);
             return 0;
         }
     }
@@ -636,7 +636,7 @@ static int blkdebug_debug_remove_breakpoint(BlockDriverState *bs,
     }
     QLIST_FOREACH_SAFE(r, &s->suspended_reqs, next, r_next) {
         if (!strcmp(r->tag, tag)) {
-            qemu_coroutine_enter(r->co);
+            bdrv_coroutine_enter(bs, r->co);
             ret = 0;
         }
     }
diff --git a/block/blkverify.c b/block/blkverify.c
index 9a1e21c..c11a636 100644
--- a/block/blkverify.c
+++ b/block/blkverify.c
@@ -172,7 +172,7 @@ static void coroutine_fn blkverify_do_test_req(void *opaque)
     r->ret = r->request_fn(s->test_file, r->offset, r->bytes, r->qiov,
                            r->flags);
     r->done++;
-    qemu_coroutine_enter_if_inactive(r->co);
+    bdrv_coroutine_enter_if_inactive(r->bs, r->co);
 }
 
 static void coroutine_fn blkverify_do_raw_req(void *opaque)
@@ -182,7 +182,7 @@ static void coroutine_fn blkverify_do_raw_req(void *opaque)
     r->raw_ret = r->request_fn(r->bs->file, r->offset, r->bytes, r->raw_qiov,
                                r->flags);
     r->done++;
-    qemu_coroutine_enter_if_inactive(r->co);
+    bdrv_coroutine_enter_if_inactive(r->bs, r->co);
 }
 
 static int coroutine_fn
@@ -207,8 +207,8 @@ blkverify_co_prwv(BlockDriverState *bs, BlkverifyRequest *r, uint64_t offset,
     co_a = qemu_coroutine_create(blkverify_do_test_req, r);
     co_b = qemu_coroutine_create(blkverify_do_raw_req, r);
 
-    qemu_coroutine_enter(co_a);
-    qemu_coroutine_enter(co_b);
+    bdrv_coroutine_enter(bs, co_a);
+    bdrv_coroutine_enter(bs, co_b);
 
     while (r->done < 2) {
         qemu_coroutine_yield();
diff --git a/block/block-backend.c b/block/block-backend.c
index 0b63773..5eeaad1 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1007,7 +1007,7 @@ static int blk_prw(BlockBackend *blk, int64_t offset, uint8_t *buf,
         co_entry(&rwco);
     } else {
         Coroutine *co = qemu_coroutine_create(co_entry, &rwco);
-        qemu_coroutine_enter(co);
+        bdrv_coroutine_enter(blk_bs(blk), co);
         BDRV_POLL_WHILE(blk_bs(blk), rwco.ret == NOT_DONE);
     }
 
@@ -1114,7 +1114,7 @@ static BlockAIOCB *blk_aio_prwv(BlockBackend *blk, int64_t offset, int bytes,
     acb->has_returned = false;
 
     co = qemu_coroutine_create(co_entry, acb);
-    qemu_coroutine_enter(co);
+    bdrv_coroutine_enter(blk_bs(blk), co);
 
     acb->has_returned = true;
     if (acb->rwco.ret != NOT_DONE) {
diff --git a/block/io.c b/block/io.c
index 2709a70..30083a7 100644
--- a/block/io.c
+++ b/block/io.c
@@ -616,7 +616,7 @@ static int bdrv_prwv_co(BdrvChild *child, int64_t offset,
         bdrv_rw_co_entry(&rwco);
     } else {
         co = qemu_coroutine_create(bdrv_rw_co_entry, &rwco);
-        qemu_coroutine_enter(co);
+        bdrv_coroutine_enter(child->bs, co);
         BDRV_POLL_WHILE(child->bs, rwco.ret == NOT_DONE);
     }
     return rwco.ret;
@@ -1873,7 +1873,7 @@ int64_t bdrv_get_block_status_above(BlockDriverState *bs,
     } else {
         co = qemu_coroutine_create(bdrv_get_block_status_above_co_entry,
                                    &data);
-        qemu_coroutine_enter(co);
+        bdrv_coroutine_enter(bs, co);
         BDRV_POLL_WHILE(bs, !data.done);
     }
     return data.ret;
@@ -1999,7 +1999,7 @@ bdrv_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
         };
         Coroutine *co = qemu_coroutine_create(bdrv_co_rw_vmstate_entry, &data);
 
-        qemu_coroutine_enter(co);
+        bdrv_coroutine_enter(bs, co);
         while (data.ret == -EINPROGRESS) {
             aio_poll(bdrv_get_aio_context(bs), true);
         }
@@ -2216,7 +2216,7 @@ static BlockAIOCB *bdrv_co_aio_prw_vector(BdrvChild *child,
     acb->is_write = is_write;
 
     co = qemu_coroutine_create(bdrv_co_do_rw, acb);
-    qemu_coroutine_enter(co);
+    bdrv_coroutine_enter(child->bs, co);
 
     bdrv_co_maybe_schedule_bh(acb);
     return &acb->common;
@@ -2247,7 +2247,7 @@ BlockAIOCB *bdrv_aio_flush(BlockDriverState *bs,
     acb->req.error = -EINPROGRESS;
 
     co = qemu_coroutine_create(bdrv_aio_flush_co_entry, acb);
-    qemu_coroutine_enter(co);
+    bdrv_coroutine_enter(bs, co);
 
     bdrv_co_maybe_schedule_bh(acb);
     return &acb->common;
@@ -2380,7 +2380,7 @@ int bdrv_flush(BlockDriverState *bs)
         bdrv_flush_co_entry(&flush_co);
     } else {
         co = qemu_coroutine_create(bdrv_flush_co_entry, &flush_co);
-        qemu_coroutine_enter(co);
+        bdrv_coroutine_enter(bs, co);
         BDRV_POLL_WHILE(bs, flush_co.ret == NOT_DONE);
     }
 
@@ -2527,7 +2527,7 @@ int bdrv_pdiscard(BlockDriverState *bs, int64_t offset, int count)
         bdrv_pdiscard_co_entry(&rwco);
     } else {
         co = qemu_coroutine_create(bdrv_pdiscard_co_entry, &rwco);
-        qemu_coroutine_enter(co);
+        bdrv_coroutine_enter(bs, co);
         BDRV_POLL_WHILE(bs, rwco.ret == NOT_DONE);
     }
 
diff --git a/block/mirror.c b/block/mirror.c
index e904fef..a68855c 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -130,7 +130,7 @@ static void mirror_iteration_done(MirrorOp *op, int ret)
     g_free(op);
 
     if (s->waiting_for_io) {
-        qemu_coroutine_enter(s->common.co);
+        bdrv_coroutine_enter(s->source, s->common.co);
     }
 }
 
diff --git a/block/quorum.c b/block/quorum.c
index 40205fb..4d42e4a 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -270,16 +270,16 @@ static void quorum_rewrite_entry(void *opaque)
     QuorumCo *co = opaque;
     QuorumAIOCB *acb = co->acb;
     BDRVQuorumState *s = acb->bs->opaque;
+    BdrvChild *child = s->children[co->idx];
 
     /* Ignore any errors, it's just a correction attempt for already
      * corrupted data. */
-    bdrv_co_pwritev(s->children[co->idx], acb->offset, acb->bytes,
-                    acb->qiov, 0);
+    bdrv_co_pwritev(child, acb->offset, acb->bytes, acb->qiov, 0);
 
     /* Wake up the caller after the last rewrite */
     acb->rewrite_count--;
     if (!acb->rewrite_count) {
-        qemu_coroutine_enter_if_inactive(acb->co);
+        bdrv_coroutine_enter_if_inactive(child->bs, acb->co);
     }
 }
 
@@ -318,7 +318,7 @@ static bool quorum_rewrite_bad_versions(QuorumAIOCB *acb,
             };
 
             co = qemu_coroutine_create(quorum_rewrite_entry, &data);
-            qemu_coroutine_enter(co);
+            bdrv_coroutine_enter(acb->bs, co);
         }
     }
 
@@ -602,7 +602,7 @@ static void read_quorum_children_entry(void *opaque)
 
     /* Wake up the caller after the last read */
     if (acb->count == s->num_children) {
-        qemu_coroutine_enter_if_inactive(acb->co);
+        bdrv_coroutine_enter_if_inactive(sacb->bs, acb->co);
     }
 }
 
@@ -626,7 +626,7 @@ static int read_quorum_children(QuorumAIOCB *acb)
         };
 
         co = qemu_coroutine_create(read_quorum_children_entry, &data);
-        qemu_coroutine_enter(co);
+        bdrv_coroutine_enter(acb->bs, co);
     }
 
     while (acb->count < s->num_children) {
@@ -712,7 +712,7 @@ static void write_quorum_entry(void *opaque)
 
     /* Wake up the caller after the last write */
     if (acb->count == s->num_children) {
-        qemu_coroutine_enter_if_inactive(acb->co);
+        bdrv_coroutine_enter_if_inactive(sacb->bs, acb->co);
     }
 }
 
@@ -731,7 +731,7 @@ static int quorum_co_pwritev(BlockDriverState *bs, uint64_t offset,
         };
 
         co = qemu_coroutine_create(write_quorum_entry, &data);
-        qemu_coroutine_enter(co);
+        bdrv_coroutine_enter(bs, co);
     }
 
     while (acb->count < s->num_children) {
diff --git a/block/sheepdog.c b/block/sheepdog.c
index 1b71fc8..ac6d32a 100644
--- a/block/sheepdog.c
+++ b/block/sheepdog.c
@@ -736,10 +736,10 @@ static int do_req(int sockfd, BlockDriverState *bs, SheepdogReq *hdr,
     } else {
         co = qemu_coroutine_create(do_co_req, &srco);
         if (bs) {
-            qemu_coroutine_enter(co);
+            bdrv_coroutine_enter(bs, co);
             BDRV_POLL_WHILE(bs, !srco.finished);
         } else {
-            qemu_coroutine_enter(co);
+            bdrv_coroutine_enter(bs, co);
             while (!srco.finished) {
                 aio_poll(qemu_get_aio_context(), true);
             }
diff --git a/blockjob.c b/blockjob.c
index 9b619f385..6e48932 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -290,7 +290,7 @@ void block_job_start(BlockJob *job)
     job->pause_count--;
     job->busy = true;
     job->paused = false;
-    qemu_coroutine_enter(job->co);
+    bdrv_coroutine_enter(blk_bs(job->blk), job->co);
 }
 
 void block_job_ref(BlockJob *job)
@@ -532,7 +532,7 @@ void block_job_user_resume(BlockJob *job)
 void block_job_enter(BlockJob *job)
 {
     if (job->co && !job->busy) {
-        qemu_coroutine_enter(job->co);
+        bdrv_coroutine_enter(blk_bs(job->blk), job->co);
     }
 }
 
diff --git a/hw/9pfs/9p.c b/hw/9pfs/9p.c
index c80ba67..af4acb4 100644
--- a/hw/9pfs/9p.c
+++ b/hw/9pfs/9p.c
@@ -3463,7 +3463,7 @@ void pdu_submit(V9fsPDU *pdu)
         handler = v9fs_fs_ro;
     }
     co = qemu_coroutine_create(handler, pdu);
-    qemu_coroutine_enter(co);
+    qemu_coroutine_enter(qemu_get_aio_context(), co);
 }
 
 /* Returns 0 on success, 1 on failure. */
@@ -3596,7 +3596,7 @@ void v9fs_reset(V9fsState *s)
     }
 
     co = qemu_coroutine_create(virtfs_co_reset, &data);
-    qemu_coroutine_enter(co);
+    qemu_coroutine_enter(qemu_get_aio_context(), co);
 
     while (!data.done) {
         aio_poll(qemu_get_aio_context(), true);
diff --git a/hw/9pfs/coth.c b/hw/9pfs/coth.c
index 89018de..70e2667 100644
--- a/hw/9pfs/coth.c
+++ b/hw/9pfs/coth.c
@@ -22,14 +22,14 @@
 static void coroutine_enter_cb(void *opaque, int ret)
 {
     Coroutine *co = opaque;
-    qemu_coroutine_enter(co);
+    qemu_coroutine_enter(qemu_get_aio_context(), co);
 }
 
 /* Called from worker thread.  */
 static int coroutine_enter_func(void *arg)
 {
     Coroutine *co = arg;
-    qemu_coroutine_enter(co);
+    qemu_coroutine_enter(qemu_get_aio_context(), co);
     return 0;
 }
 
diff --git a/include/block/block.h b/include/block/block.h
index 5149260..00db53f 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -556,6 +556,17 @@ bool bdrv_debug_is_suspended(BlockDriverState *bs, const char *tag);
 AioContext *bdrv_get_aio_context(BlockDriverState *bs);
 
 /**
+ * Transfer control to @co in the aio context of @bs
+ */
+void bdrv_coroutine_enter(BlockDriverState *bs, Coroutine *co);
+
+/**
+ * Transfer control to @co in the aio context of @bs if it's not active (i.e.
+ * part of the call stack of the running coroutine). Otherwise, do nothing.
+ */
+void bdrv_coroutine_enter_if_inactive(BlockDriverState *bs, Coroutine *co);
+
+/**
  * bdrv_set_aio_context:
  *
  * Changes the #AioContext used for fd handlers, timers, and BHs by this
diff --git a/include/qemu/coroutine.h b/include/qemu/coroutine.h
index e60beaf..77f2de1 100644
--- a/include/qemu/coroutine.h
+++ b/include/qemu/coroutine.h
@@ -66,15 +66,16 @@ typedef void coroutine_fn CoroutineEntry(void *opaque);
 Coroutine *qemu_coroutine_create(CoroutineEntry *entry, void *opaque);
 
 /**
- * Transfer control to a coroutine
+ * Transfer control to a coroutine and associate it to @ctx
  */
-void qemu_coroutine_enter(Coroutine *coroutine);
+void qemu_coroutine_enter(AioContext *ctx, Coroutine *coroutine);
 
 /**
- * Transfer control to a coroutine if it's not active (i.e. part of the call
- * stack of the running coroutine). Otherwise, do nothing.
+ * Transfer control to a coroutine and associate it to @ctx, if it's not active
+ * (i.e. part of the call stack of the running coroutine). Otherwise, do
+ * nothing.
  */
-void qemu_coroutine_enter_if_inactive(Coroutine *co);
+void qemu_coroutine_enter_if_inactive(AioContext *ctx, Coroutine *co);
 
 /**
  * Transfer control back to a coroutine's caller
diff --git a/include/qemu/main-loop.h b/include/qemu/main-loop.h
index d7e24af..8a19e10 100644
--- a/include/qemu/main-loop.h
+++ b/include/qemu/main-loop.h
@@ -64,7 +64,7 @@ int qemu_init_main_loop(Error **errp);
  *
  *     void enter_co_bh(void *opaque) {
  *         QEMUCoroutine *co = opaque;
- *         qemu_coroutine_enter(co);
+ *         qemu_coroutine_enter(ctx, co);
  *     }
  *
  *     ...
diff --git a/migration/colo.c b/migration/colo.c
index c19eb3f..7167ce7 100644
--- a/migration/colo.c
+++ b/migration/colo.c
@@ -100,7 +100,8 @@ static void secondary_vm_do_failover(void)
     qemu_sem_post(&mis->colo_incoming_sem);
     /* For Secondary VM, jump to incoming co */
     if (mis->migration_incoming_co) {
-        qemu_coroutine_enter(mis->migration_incoming_co);
+        qemu_coroutine_enter(qemu_get_aio_context(),
+                             mis->migration_incoming_co);
     }
 }
 
diff --git a/migration/migration.c b/migration/migration.c
index 54060f7..82ed7e4 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -449,7 +449,7 @@ void migration_fd_process_incoming(QEMUFile *f)
 
     migrate_decompress_threads_create();
     qemu_file_set_blocking(f, false);
-    qemu_coroutine_enter(co);
+    qemu_coroutine_enter(qemu_get_aio_context(), co);
 }
 
 
diff --git a/nbd/server.c b/nbd/server.c
index 924a1fe..a2635d2 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -108,7 +108,7 @@ static gboolean nbd_negotiate_continue(QIOChannel *ioc,
                                        GIOCondition condition,
                                        void *opaque)
 {
-    qemu_coroutine_enter(opaque);
+    qemu_coroutine_enter(ioc->ctx, opaque);
     return TRUE;
 }
 
@@ -1418,5 +1418,6 @@ void nbd_client_new(NBDExport *exp,
 
     data->client = client;
     data->co = qemu_coroutine_create(nbd_co_client_start, data);
-    qemu_coroutine_enter(data->co);
+    qemu_coroutine_enter(exp ? exp->ctx : qemu_get_aio_context(),
+                         data->co);
 }
diff --git a/qemu-img.c b/qemu-img.c
index b220cf7..ffb09f7 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -1830,7 +1830,7 @@ static void coroutine_fn convert_co_do_copy(void *opaque)
                      * s->wait_sector_num[i] == -1 during A -> B.  Therefore
                      * B will never enter A during this time window.
                      */
-                    qemu_coroutine_enter(s->co[i]);
+                    qemu_coroutine_enter(qemu_get_aio_context(), s->co[i]);
                     break;
                 }
             }
@@ -1896,7 +1896,7 @@ static int convert_do_copy(ImgConvertState *s)
     for (i = 0; i < s->num_coroutines; i++) {
         s->co[i] = qemu_coroutine_create(convert_co_do_copy, s);
         s->wait_sector_num[i] = -1;
-        qemu_coroutine_enter(s->co[i]);
+        qemu_coroutine_enter(qemu_get_aio_context(), s->co[i]);
     }
 
     while (s->ret == -EINPROGRESS) {
diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c
index 883f53b..312fc6d 100644
--- a/qemu-io-cmds.c
+++ b/qemu-io-cmds.c
@@ -521,7 +521,7 @@ static int do_co_pwrite_zeroes(BlockBackend *blk, int64_t offset,
     }
 
     co = qemu_coroutine_create(co_pwrite_zeroes_entry, &data);
-    qemu_coroutine_enter(co);
+    bdrv_coroutine_enter(blk_bs(blk), co);
     while (!data.done) {
         aio_poll(blk_get_aio_context(blk), true);
     }
diff --git a/tests/test-coroutine.c b/tests/test-coroutine.c
index abd97c2..64f2563 100644
--- a/tests/test-coroutine.c
+++ b/tests/test-coroutine.c
@@ -14,6 +14,7 @@
 #include "qemu/osdep.h"
 #include "qemu/coroutine.h"
 #include "qemu/coroutine_int.h"
+#include "qemu/main-loop.h"
 
 /*
  * Check that qemu_in_coroutine() works
@@ -31,7 +32,7 @@ static void test_in_coroutine(void)
     g_assert(!qemu_in_coroutine());
 
     coroutine = qemu_coroutine_create(verify_in_coroutine, NULL);
-    qemu_coroutine_enter(coroutine);
+    qemu_coroutine_enter(qemu_get_aio_context(), coroutine);
 }
 
 /*
@@ -49,7 +50,7 @@ static void test_self(void)
     Coroutine *coroutine;
 
     coroutine = qemu_coroutine_create(verify_self, &coroutine);
-    qemu_coroutine_enter(coroutine);
+    qemu_coroutine_enter(qemu_get_aio_context(), coroutine);
 }
 
 /*
@@ -79,9 +80,9 @@ static void coroutine_fn verify_entered_step_1(void *opaque)
 
     coroutine = qemu_coroutine_create(verify_entered_step_2, self);
     g_assert(!qemu_coroutine_entered(coroutine));
-    qemu_coroutine_enter(coroutine);
+    qemu_coroutine_enter(qemu_get_aio_context(), coroutine);
     g_assert(!qemu_coroutine_entered(coroutine));
-    qemu_coroutine_enter(coroutine);
+    qemu_coroutine_enter(qemu_get_aio_context(), coroutine);
 }
 
 static void test_entered(void)
@@ -90,7 +91,7 @@ static void test_entered(void)
 
     coroutine = qemu_coroutine_create(verify_entered_step_1, NULL);
     g_assert(!qemu_coroutine_entered(coroutine));
-    qemu_coroutine_enter(coroutine);
+    qemu_coroutine_enter(qemu_get_aio_context(), coroutine);
 }
 
 /*
@@ -113,7 +114,7 @@ static void coroutine_fn nest(void *opaque)
         Coroutine *child;
 
         child = qemu_coroutine_create(nest, nd);
-        qemu_coroutine_enter(child);
+        qemu_coroutine_enter(qemu_get_aio_context(), child);
     }
 
     nd->n_return++;
@@ -129,7 +130,7 @@ static void test_nesting(void)
     };
 
     root = qemu_coroutine_create(nest, &nd);
-    qemu_coroutine_enter(root);
+    qemu_coroutine_enter(qemu_get_aio_context(), root);
 
     /* Must enter and return from max nesting level */
     g_assert_cmpint(nd.n_enter, ==, nd.max);
@@ -159,7 +160,7 @@ static void test_yield(void)
 
     coroutine = qemu_coroutine_create(yield_5_times, &done);
     while (!done) {
-        qemu_coroutine_enter(coroutine);
+        qemu_coroutine_enter(qemu_get_aio_context(), coroutine);
         i++;
     }
     g_assert_cmpint(i, ==, 5); /* coroutine must yield 5 times */
@@ -173,7 +174,7 @@ static void coroutine_fn c2_fn(void *opaque)
 static void coroutine_fn c1_fn(void *opaque)
 {
     Coroutine *c2 = opaque;
-    qemu_coroutine_enter(c2);
+    qemu_coroutine_enter(qemu_get_aio_context(), c2);
 }
 
 static void test_co_queue(void)
@@ -185,12 +186,12 @@ static void test_co_queue(void)
     c2 = qemu_coroutine_create(c2_fn, NULL);
     c1 = qemu_coroutine_create(c1_fn, c2);
 
-    qemu_coroutine_enter(c1);
+    qemu_coroutine_enter(qemu_get_aio_context(), c1);
 
     /* c1 shouldn't be used any more now; make sure we segfault if it is */
     tmp = *c1;
     memset(c1, 0xff, sizeof(Coroutine));
-    qemu_coroutine_enter(c2);
+    qemu_coroutine_enter(qemu_get_aio_context(), c2);
 
     /* Must restore the coroutine now to avoid corrupted pool */
     *c1 = tmp;
@@ -214,13 +215,13 @@ static void test_lifecycle(void)
 
     /* Create, enter, and return from coroutine */
     coroutine = qemu_coroutine_create(set_and_exit, &done);
-    qemu_coroutine_enter(coroutine);
+    qemu_coroutine_enter(qemu_get_aio_context(), coroutine);
     g_assert(done); /* expect done to be true (first time) */
 
     /* Repeat to check that no state affects this test */
     done = false;
     coroutine = qemu_coroutine_create(set_and_exit, &done);
-    qemu_coroutine_enter(coroutine);
+    qemu_coroutine_enter(qemu_get_aio_context(), coroutine);
     g_assert(done); /* expect done to be true (second time) */
 }
 
@@ -256,10 +257,10 @@ static void do_order_test(void)
 
     co = qemu_coroutine_create(co_order_test, NULL);
     record_push(1, 1);
-    qemu_coroutine_enter(co);
+    qemu_coroutine_enter(qemu_get_aio_context(), co);
     record_push(1, 2);
     g_assert(!qemu_in_coroutine());
-    qemu_coroutine_enter(co);
+    qemu_coroutine_enter(qemu_get_aio_context(), co);
     record_push(1, 3);
     g_assert(!qemu_in_coroutine());
 }
@@ -297,7 +298,7 @@ static void perf_lifecycle(void)
     g_test_timer_start();
     for (i = 0; i < max; i++) {
         coroutine = qemu_coroutine_create(empty_coroutine, NULL);
-        qemu_coroutine_enter(coroutine);
+        qemu_coroutine_enter(qemu_get_aio_context(), coroutine);
     }
     duration = g_test_timer_elapsed();
 
@@ -321,7 +322,7 @@ static void perf_nesting(void)
             .max      = maxnesting,
         };
         root = qemu_coroutine_create(nest, &nd);
-        qemu_coroutine_enter(root);
+        qemu_coroutine_enter(qemu_get_aio_context(), root);
     }
     duration = g_test_timer_elapsed();
 
@@ -354,7 +355,7 @@ static void perf_yield(void)
 
     g_test_timer_start();
     while (i > 0) {
-        qemu_coroutine_enter(coroutine);
+        qemu_coroutine_enter(qemu_get_aio_context(), coroutine);
     }
     duration = g_test_timer_elapsed();
 
@@ -401,8 +402,8 @@ static void perf_cost(void)
     g_test_timer_start();
     while (i++ < maxcycles) {
         co = qemu_coroutine_create(perf_cost_func, &i);
-        qemu_coroutine_enter(co);
-        qemu_coroutine_enter(co);
+        qemu_coroutine_enter(qemu_get_aio_context(), co);
+        qemu_coroutine_enter(qemu_get_aio_context(), co);
     }
     duration = g_test_timer_elapsed();
     ops = (long)(maxcycles / (duration * 1000));
diff --git a/tests/test-thread-pool.c b/tests/test-thread-pool.c
index 91b4ec5..ba4fee4 100644
--- a/tests/test-thread-pool.c
+++ b/tests/test-thread-pool.c
@@ -94,7 +94,7 @@ static void test_submit_co(void)
     WorkerTestData data;
     Coroutine *co = qemu_coroutine_create(co_test_cb, &data);
 
-    qemu_coroutine_enter(co);
+    qemu_coroutine_enter(qemu_get_aio_context(), co);
 
     /* Back here once the worker has started.  */
 
diff --git a/util/async.c b/util/async.c
index 663e297..3289a36 100644
--- a/util/async.c
+++ b/util/async.c
@@ -388,7 +388,7 @@ static void co_schedule_bh_cb(void *opaque)
         QSLIST_REMOVE_HEAD(&straight, co_scheduled_next);
         trace_aio_co_schedule_bh_cb(ctx, co);
         aio_context_acquire(ctx);
-        qemu_coroutine_enter(co);
+        qemu_coroutine_enter(ctx, co);
         aio_context_release(ctx);
     }
 }
@@ -464,7 +464,7 @@ void aio_co_wake(struct Coroutine *co)
         QSIMPLEQ_INSERT_TAIL(&self->co_queue_wakeup, co, co_queue_next);
     } else {
         aio_context_acquire(ctx);
-        qemu_coroutine_enter(co);
+        qemu_coroutine_enter(ctx, co);
         aio_context_release(ctx);
     }
 }
diff --git a/util/qemu-coroutine-io.c b/util/qemu-coroutine-io.c
index 44a8969..26818af 100644
--- a/util/qemu-coroutine-io.c
+++ b/util/qemu-coroutine-io.c
@@ -75,7 +75,8 @@ static void fd_coroutine_enter(void *opaque)
 {
     FDYieldUntilData *data = opaque;
     qemu_set_fd_handler(data->fd, NULL, NULL, NULL);
-    qemu_coroutine_enter(data->co);
+    /* XXX: do we need an explicit ctx? */
+    qemu_coroutine_enter(qemu_get_current_aio_context(), data->co);
 }
 
 void coroutine_fn yield_until_fd_readable(int fd)
diff --git a/util/qemu-coroutine-lock.c b/util/qemu-coroutine-lock.c
index 6328eed..42c18d8 100644
--- a/util/qemu-coroutine-lock.c
+++ b/util/qemu-coroutine-lock.c
@@ -81,7 +81,8 @@ void qemu_co_queue_run_restart(Coroutine *co)
     trace_qemu_co_queue_run_restart(co);
     while ((next = QSIMPLEQ_FIRST(&co->co_queue_wakeup))) {
         QSIMPLEQ_REMOVE_HEAD(&co->co_queue_wakeup, co_queue_next);
-        qemu_coroutine_enter(next);
+        assert(next->ctx);
+        qemu_coroutine_enter(next->ctx, next);
     }
 }
 
@@ -125,7 +126,8 @@ bool qemu_co_enter_next(CoQueue *queue)
     }
 
     QSIMPLEQ_REMOVE_HEAD(&queue->entries, co_queue_next);
-    qemu_coroutine_enter(next);
+    assert(next->ctx);
+    qemu_coroutine_enter(next->ctx, next);
     return true;
 }
 
diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c
index 72412e5..47329a0 100644
--- a/util/qemu-coroutine.c
+++ b/util/qemu-coroutine.c
@@ -102,12 +102,12 @@ static void coroutine_delete(Coroutine *co)
     qemu_coroutine_delete(co);
 }
 
-void qemu_coroutine_enter(Coroutine *co)
+void qemu_coroutine_enter(AioContext *ctx, Coroutine *co)
 {
     Coroutine *self = qemu_coroutine_self();
     CoroutineAction ret;
 
-    trace_qemu_coroutine_enter(self, co, co->entry_arg);
+    trace_qemu_coroutine_enter(self, ctx, co, co->entry_arg);
 
     if (co->caller) {
         fprintf(stderr, "Co-routine re-entered recursively\n");
@@ -115,7 +115,7 @@ void qemu_coroutine_enter(Coroutine *co)
     }
 
     co->caller = self;
-    co->ctx = qemu_get_current_aio_context();
+    co->ctx = ctx;
 
     /* Store co->ctx before anything that stores co.  Matches
      * barrier in aio_co_wake and qemu_co_mutex_wake.
@@ -139,10 +139,10 @@ void qemu_coroutine_enter(Coroutine *co)
     }
 }
 
-void qemu_coroutine_enter_if_inactive(Coroutine *co)
+void qemu_coroutine_enter_if_inactive(AioContext *ctx, Coroutine *co)
 {
     if (!qemu_coroutine_entered(co)) {
-        qemu_coroutine_enter(co);
+        qemu_coroutine_enter(ctx, co);
     }
 }
 
diff --git a/util/trace-events b/util/trace-events
index ac27d94..d3c39a6 100644
--- a/util/trace-events
+++ b/util/trace-events
@@ -22,7 +22,7 @@ buffer_move(const char *buf, size_t len, const char *from) "%s: %zd bytes from %
 buffer_free(const char *buf, size_t len) "%s: capacity %zd"
 
 # util/qemu-coroutine.c
-qemu_coroutine_enter(void *from, void *to, void *opaque) "from %p to %p opaque %p"
+qemu_coroutine_enter(void *ctx, void *from, void *to, void *opaque) "ctx %p from %p to %p opaque %p"
 qemu_coroutine_yield(void *from, void *to) "from %p to %p"
 qemu_coroutine_terminate(void *co) "self %p"
 
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [Qemu-devel] [PATCH v2 6/6] tests/block-job-txn: Don't start block job before adding to txn
  2017-04-07  6:54 [Qemu-devel] [PATCH v2 0/6] block: Fixes regarding dataplane and management operations Fam Zheng
                   ` (4 preceding siblings ...)
  2017-04-07  6:54 ` [Qemu-devel] [PATCH v2 5/6] coroutine: Explicitly specify AioContext when entering coroutine Fam Zheng
@ 2017-04-07  6:54 ` Fam Zheng
  2017-04-07 13:28   ` Stefan Hajnoczi
  2017-04-07 12:45 ` [Qemu-devel] [PATCH v2 0/6] block: Fixes regarding dataplane and management operations Kevin Wolf
  6 siblings, 1 reply; 26+ messages in thread
From: Fam Zheng @ 2017-04-07  6:54 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, qemu-block, Ed Swierk, Fam Zheng, Kevin Wolf,
	Max Reitz, Eric Blake, Stefan Hajnoczi

Previously, before test_block_job_start returns, the job can already
complete, as a result, the transactional state of other jobs added to
the same txn later cannot be handled correctly.

Move the block_job_start() calls to callers after
block_job_txn_add_job() calls.

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 tests/test-blockjob-txn.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/tests/test-blockjob-txn.c b/tests/test-blockjob-txn.c
index 4ccbda1..0f80194 100644
--- a/tests/test-blockjob-txn.c
+++ b/tests/test-blockjob-txn.c
@@ -110,7 +110,6 @@ static BlockJob *test_block_job_start(unsigned int iterations,
     s->result = result;
     data->job = s;
     data->result = result;
-    block_job_start(&s->common);
     return &s->common;
 }
 
@@ -123,6 +122,7 @@ static void test_single_job(int expected)
     txn = block_job_txn_new();
     job = test_block_job_start(1, true, expected, &result);
     block_job_txn_add_job(txn, job);
+    block_job_start(job);
 
     if (expected == -ECANCELED) {
         block_job_cancel(job);
@@ -164,6 +164,8 @@ static void test_pair_jobs(int expected1, int expected2)
     block_job_txn_add_job(txn, job1);
     job2 = test_block_job_start(2, true, expected2, &result2);
     block_job_txn_add_job(txn, job2);
+    block_job_start(job1);
+    block_job_start(job2);
 
     if (expected1 == -ECANCELED) {
         block_job_cancel(job1);
@@ -223,6 +225,8 @@ static void test_pair_jobs_fail_cancel_race(void)
     block_job_txn_add_job(txn, job1);
     job2 = test_block_job_start(2, false, 0, &result2);
     block_job_txn_add_job(txn, job2);
+    block_job_start(job1);
+    block_job_start(job2);
 
     block_job_cancel(job1);
 
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH v2 5/6] coroutine: Explicitly specify AioContext when entering coroutine
  2017-04-07  6:54 ` [Qemu-devel] [PATCH v2 5/6] coroutine: Explicitly specify AioContext when entering coroutine Fam Zheng
@ 2017-04-07  9:57   ` Fam Zheng
  2017-04-07 13:26   ` Stefan Hajnoczi
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 26+ messages in thread
From: Fam Zheng @ 2017-04-07  9:57 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, qemu-block, Ed Swierk, Kevin Wolf, Max Reitz,
	Eric Blake, Stefan Hajnoczi

On Fri, 04/07 14:54, Fam Zheng wrote:
> 
>   main loop                                iothread
> -----------------------------------------------------------------------
>   blockdev_snapshot
>     aio_context_acquire(bs->ctx)
>     bdrv_flush(bs)
>       bdrv_co_flush(bs)
>         ...
>         qemu_coroutine_yield(co)
>       BDRV_POLL_WHILE()
>         aio_context_release(bs->ctx)
>                                             aio_context_acquire(bs->ctx)
>                                               ...
>                                                 aio_co_wake(co)
>         aio_poll(qemu_aio_context)              ...
>           co_schedule_bh_cb()                   ...
>             qemu_coroutine_enter(co)            ...
>               /* (A) bdrv_co_flush(bs)              /* (B) I/O on bs */
>                       continues... */
>                                             aio_context_release(bs->ctx)

After talking to Kevin on IRC, this aio_context_acquire() in iothread could be
the one in vq handler:

  main loop                                iothread
-----------------------------------------------------------------------
  blockdev_snapshot
    aio_context_acquire(bs->ctx)
                                           virtio_scsi_data_plane_handle_cmd
    bdrv_drained_begin(bs->ctx)
    bdrv_flush(bs)
      bdrv_co_flush(bs)                      aio_context_acquire(bs->ctx).enter
        ...
        qemu_coroutine_yield(co)
      BDRV_POLL_WHILE()
        aio_context_release(bs->ctx)
                                             aio_context_acquire(bs->ctx).return
                                               ...
                                                 aio_co_wake(co)
        aio_poll(qemu_aio_context)               ...
          co_schedule_bh_cb()                    ...
            qemu_coroutine_enter(co)             ...

              /* (A) bdrv_co_flush(bs)           /* (B) I/O on bs */
                      continues... */
                                             aio_context_release(bs->ctx)
        aio_context_acquire(bs->ctx)

Note that in this special case, bdrv_drained_begin() doesn't do the "release,
poll, acquire" in BDRV_POLL_WHILE, because bs->in_flight == 0. This might be the
root cause of the race?  (Ed's test case showed that apart from vq handlers,
block jobs in the iothread can also trigger the same pattern of race.  For this
part we need John's patches to pause block jobs in bdrv_drained_begin.)

Fam

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH v2 0/6] block: Fixes regarding dataplane and management operations
  2017-04-07  6:54 [Qemu-devel] [PATCH v2 0/6] block: Fixes regarding dataplane and management operations Fam Zheng
                   ` (5 preceding siblings ...)
  2017-04-07  6:54 ` [Qemu-devel] [PATCH v2 6/6] tests/block-job-txn: Don't start block job before adding to txn Fam Zheng
@ 2017-04-07 12:45 ` Kevin Wolf
  6 siblings, 0 replies; 26+ messages in thread
From: Kevin Wolf @ 2017-04-07 12:45 UTC (permalink / raw)
  To: Fam Zheng
  Cc: qemu-devel, Paolo Bonzini, qemu-block, Ed Swierk, Max Reitz,
	Eric Blake, Stefan Hajnoczi

Am 07.04.2017 um 08:54 hat Fam Zheng geschrieben:
> v2: - Drop patch 4 in v1. A second thought made me feel neither it nor Kevin's
>       suggestion to move the BH process to bdrv_drain_recurse/BDRV_POLL_WHILE
>       is a complete fix. So leave it for a separate patch.
>     - Add rev-by to patches 1, 3, 4.
>     - Split from patch 1 in v1 and add patch 2, for the new assertions. [Kevin]
>     - Rewrite patch 5. Fix block job's co when a BDS is moved to a different
>       aio context. [Kevin]
>     - Add patch 6.
> 
> Crashes are reported on dataplane devices when doing snapshot and commit under
> guest I/O.
> 
> With this series, Ed's test case '176' now passes (reran 10+ times):
> 
>     https://github.com/skyportsystems/qemu-1/commits/eswierk-iotests-2.9
> 
> The biggest fix for this is patch 5, which fixes a race condition between main
> thread and iothread.

Thanks, applied patch 1-3 for now. I feel the other patches still need
closer review and potentially discussion.

Kevin

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH v2 4/6] block: Quiesce old aio context during bdrv_set_aio_context
  2017-04-07  6:54 ` [Qemu-devel] [PATCH v2 4/6] block: Quiesce old aio context during bdrv_set_aio_context Fam Zheng
@ 2017-04-07 12:50   ` Stefan Hajnoczi
  2017-04-08  3:43     ` Fam Zheng
  0 siblings, 1 reply; 26+ messages in thread
From: Stefan Hajnoczi @ 2017-04-07 12:50 UTC (permalink / raw)
  To: Fam Zheng
  Cc: qemu-devel, Paolo Bonzini, qemu-block, Ed Swierk, Kevin Wolf,
	Max Reitz, Eric Blake

[-- Attachment #1: Type: text/plain, Size: 421 bytes --]

On Fri, Apr 07, 2017 at 02:54:12PM +0800, Fam Zheng wrote:
> @@ -4413,6 +4416,10 @@ void bdrv_set_aio_context(BlockDriverState *bs, AioContext *new_context)
>      aio_context_acquire(new_context);
>      bdrv_attach_aio_context(bs, new_context);
>      aio_context_release(new_context);
> +    if (bs->job) {
> +        block_job_resume(bs->job);
> +    }

Should this be called before aio_context_release(new_context)?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH v2 1/6] block: Fix unpaired aio_disable_external in external snapshot
  2017-04-07  6:54 ` [Qemu-devel] [PATCH v2 1/6] block: Fix unpaired aio_disable_external in external snapshot Fam Zheng
@ 2017-04-07 13:23   ` Stefan Hajnoczi
  0 siblings, 0 replies; 26+ messages in thread
From: Stefan Hajnoczi @ 2017-04-07 13:23 UTC (permalink / raw)
  To: Fam Zheng
  Cc: qemu-devel, Paolo Bonzini, qemu-block, Ed Swierk, Kevin Wolf,
	Max Reitz, Eric Blake

[-- Attachment #1: Type: text/plain, Size: 608 bytes --]

On Fri, Apr 07, 2017 at 02:54:09PM +0800, Fam Zheng wrote:
> bdrv_replace_child_noperm tries to hand over the quiesce_counter state
> from old bs to the new one, but if they are not on the same aio context
> this causes unbalance.
> 
> Fix this by setting the correct aio context before calling
> bdrv_append().
> 
> Reported-by: Ed Swierk <eswierk@skyportsystems.com>
> Reviewed-by: Eric Blake <eblake@redhat.com>
> Signed-off-by: Fam Zheng <famz@redhat.com>
> ---
>  blockdev.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH v2 2/6] block: Assert attached child node has right aio context
  2017-04-07  6:54 ` [Qemu-devel] [PATCH v2 2/6] block: Assert attached child node has right aio context Fam Zheng
@ 2017-04-07 13:24   ` Stefan Hajnoczi
  0 siblings, 0 replies; 26+ messages in thread
From: Stefan Hajnoczi @ 2017-04-07 13:24 UTC (permalink / raw)
  To: Fam Zheng
  Cc: qemu-devel, Paolo Bonzini, qemu-block, Ed Swierk, Kevin Wolf,
	Max Reitz, Eric Blake

[-- Attachment #1: Type: text/plain, Size: 263 bytes --]

On Fri, Apr 07, 2017 at 02:54:10PM +0800, Fam Zheng wrote:
> Suggested-by: Kevin Wolf <kwolf@redhat.com>
> Signed-off-by: Fam Zheng <famz@redhat.com>
> ---
>  block.c | 4 ++++
>  1 file changed, 4 insertions(+)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH v2 3/6] mirror: Fix aio context of mirror_top_bs
  2017-04-07  6:54 ` [Qemu-devel] [PATCH v2 3/6] mirror: Fix aio context of mirror_top_bs Fam Zheng
@ 2017-04-07 13:24   ` Stefan Hajnoczi
  0 siblings, 0 replies; 26+ messages in thread
From: Stefan Hajnoczi @ 2017-04-07 13:24 UTC (permalink / raw)
  To: Fam Zheng
  Cc: qemu-devel, Paolo Bonzini, qemu-block, Ed Swierk, Kevin Wolf,
	Max Reitz, Eric Blake

[-- Attachment #1: Type: text/plain, Size: 411 bytes --]

On Fri, Apr 07, 2017 at 02:54:11PM +0800, Fam Zheng wrote:
> It should be moved to the same context as source, before inserting to the
> graph.
> 
> Reviewed-by: Eric Blake <eblake@redhat.com>
> Reviewed-by: Kevin Wolf <kwolf@redhat.com>
> Signed-off-by: Fam Zheng <famz@redhat.com>
> ---
>  block/mirror.c | 1 +
>  1 file changed, 1 insertion(+)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH v2 5/6] coroutine: Explicitly specify AioContext when entering coroutine
  2017-04-07  6:54 ` [Qemu-devel] [PATCH v2 5/6] coroutine: Explicitly specify AioContext when entering coroutine Fam Zheng
  2017-04-07  9:57   ` Fam Zheng
@ 2017-04-07 13:26   ` Stefan Hajnoczi
  2017-04-08  3:27     ` Fam Zheng
  2017-04-07 14:07   ` Eric Blake
  2017-04-07 15:14   ` Kevin Wolf
  3 siblings, 1 reply; 26+ messages in thread
From: Stefan Hajnoczi @ 2017-04-07 13:26 UTC (permalink / raw)
  To: Fam Zheng
  Cc: qemu-devel, Paolo Bonzini, qemu-block, Ed Swierk, Kevin Wolf,
	Max Reitz, Eric Blake

[-- Attachment #1: Type: text/plain, Size: 1987 bytes --]

On Fri, Apr 07, 2017 at 02:54:13PM +0800, Fam Zheng wrote:
> Coroutine in block layer should always be waken up in bs->aio_context
> rather than the "current" context where it is entered. They differ when
> the main loop is doing QMP tasks.
> 
> Race conditions happen without this patch, because the wrong context is
> acquired in co_schedule_bh_cb, while the entered coroutine works on a
> different one:
> 
>   main loop                                iothread
> -----------------------------------------------------------------------
>   blockdev_snapshot
>     aio_context_acquire(bs->ctx)
>     bdrv_flush(bs)
>       bdrv_co_flush(bs)
>         ...
>         qemu_coroutine_yield(co)
>       BDRV_POLL_WHILE()
>         aio_context_release(bs->ctx)
>                                             aio_context_acquire(bs->ctx)
>                                               ...
>                                                 aio_co_wake(co)
>         aio_poll(qemu_aio_context)              ...
>           co_schedule_bh_cb()                   ...
>             qemu_coroutine_enter(co)            ...
>               /* (A) bdrv_co_flush(bs)              /* (B) I/O on bs */
>                       continues... */
>                                             aio_context_release(bs->ctx)
> 
> Both (A) and (B) can access resources protected by bs->ctx, but (A) is
> not thread-safe.
> 
> Make the block layer explicitly specify a desired context for the
> entered coroutine. For the rest callers, stick to the old behavior,
> qemu_get_aio_context() or qemu_get_current_aio_context().

I wasn't expecting the patch to touch so many places.  If a coroutine
can be permanently associated with an AioContext then there's no need to
keep assigning co->ctx on each qemu_coroutine_enter().

I don't know about this one.  Paolo is the best person to review it
since he's most familiar with the recent Coroutine/AioContext changes.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH v2 6/6] tests/block-job-txn: Don't start block job before adding to txn
  2017-04-07  6:54 ` [Qemu-devel] [PATCH v2 6/6] tests/block-job-txn: Don't start block job before adding to txn Fam Zheng
@ 2017-04-07 13:28   ` Stefan Hajnoczi
  2017-04-07 18:05     ` John Snow
  0 siblings, 1 reply; 26+ messages in thread
From: Stefan Hajnoczi @ 2017-04-07 13:28 UTC (permalink / raw)
  To: Fam Zheng
  Cc: qemu-devel, Paolo Bonzini, qemu-block, Ed Swierk, Kevin Wolf,
	Max Reitz, Eric Blake, jsnow

[-- Attachment #1: Type: text/plain, Size: 596 bytes --]

On Fri, Apr 07, 2017 at 02:54:14PM +0800, Fam Zheng wrote:
> Previously, before test_block_job_start returns, the job can already
> complete, as a result, the transactional state of other jobs added to
> the same txn later cannot be handled correctly.
> 
> Move the block_job_start() calls to callers after
> block_job_txn_add_job() calls.
> 
> Signed-off-by: Fam Zheng <famz@redhat.com>
> ---
>  tests/test-blockjob-txn.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)

CCing John Snow because he looked at block jobs completing during txn
setup recently.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH v2 5/6] coroutine: Explicitly specify AioContext when entering coroutine
  2017-04-07  6:54 ` [Qemu-devel] [PATCH v2 5/6] coroutine: Explicitly specify AioContext when entering coroutine Fam Zheng
  2017-04-07  9:57   ` Fam Zheng
  2017-04-07 13:26   ` Stefan Hajnoczi
@ 2017-04-07 14:07   ` Eric Blake
  2017-04-07 15:14   ` Kevin Wolf
  3 siblings, 0 replies; 26+ messages in thread
From: Eric Blake @ 2017-04-07 14:07 UTC (permalink / raw)
  To: Fam Zheng, qemu-devel
  Cc: Paolo Bonzini, qemu-block, Ed Swierk, Kevin Wolf, Max Reitz,
	Stefan Hajnoczi

[-- Attachment #1: Type: text/plain, Size: 1924 bytes --]

On 04/07/2017 01:54 AM, Fam Zheng wrote:
> Coroutine in block layer should always be waken up in bs->aio_context

s/waken up/awakened/

> rather than the "current" context where it is entered. They differ when
> the main loop is doing QMP tasks.
> 
> Race conditions happen without this patch, because the wrong context is
> acquired in co_schedule_bh_cb, while the entered coroutine works on a
> different one:
> 
>   main loop                                iothread
> -----------------------------------------------------------------------
>   blockdev_snapshot
>     aio_context_acquire(bs->ctx)
>     bdrv_flush(bs)
>       bdrv_co_flush(bs)
>         ...
>         qemu_coroutine_yield(co)
>       BDRV_POLL_WHILE()
>         aio_context_release(bs->ctx)
>                                             aio_context_acquire(bs->ctx)
>                                               ...
>                                                 aio_co_wake(co)
>         aio_poll(qemu_aio_context)              ...
>           co_schedule_bh_cb()                   ...
>             qemu_coroutine_enter(co)            ...
>               /* (A) bdrv_co_flush(bs)              /* (B) I/O on bs */
>                       continues... */
>                                             aio_context_release(bs->ctx)
> 
> Both (A) and (B) can access resources protected by bs->ctx, but (A) is
> not thread-safe.
> 
> Make the block layer explicitly specify a desired context for the
> entered coroutine. For the rest callers, stick to the old behavior,

s/rest/remaining/

> qemu_get_aio_context() or qemu_get_current_aio_context().
> 
> Signed-off-by: Fam Zheng <famz@redhat.com>
> ---

At this point, I'm still more comfortable waiting for Paolo's review.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH v2 5/6] coroutine: Explicitly specify AioContext when entering coroutine
  2017-04-07  6:54 ` [Qemu-devel] [PATCH v2 5/6] coroutine: Explicitly specify AioContext when entering coroutine Fam Zheng
                     ` (2 preceding siblings ...)
  2017-04-07 14:07   ` Eric Blake
@ 2017-04-07 15:14   ` Kevin Wolf
  2017-04-10  0:56     ` Paolo Bonzini
  3 siblings, 1 reply; 26+ messages in thread
From: Kevin Wolf @ 2017-04-07 15:14 UTC (permalink / raw)
  To: Fam Zheng
  Cc: qemu-devel, Paolo Bonzini, qemu-block, Ed Swierk, Max Reitz,
	Eric Blake, Stefan Hajnoczi

Am 07.04.2017 um 08:54 hat Fam Zheng geschrieben:
> Coroutine in block layer should always be waken up in bs->aio_context
> rather than the "current" context where it is entered. They differ when
> the main loop is doing QMP tasks.
> 
> Race conditions happen without this patch, because the wrong context is
> acquired in co_schedule_bh_cb, while the entered coroutine works on a
> different one:
> 
>   main loop                                iothread
> -----------------------------------------------------------------------
>   blockdev_snapshot
>     aio_context_acquire(bs->ctx)
>     bdrv_flush(bs)
>       bdrv_co_flush(bs)
>         ...
>         qemu_coroutine_yield(co)
>       BDRV_POLL_WHILE()
>         aio_context_release(bs->ctx)
>                                             aio_context_acquire(bs->ctx)
>                                               ...
>                                                 aio_co_wake(co)
>         aio_poll(qemu_aio_context)              ...
>           co_schedule_bh_cb()                   ...
>             qemu_coroutine_enter(co)            ...
>               /* (A) bdrv_co_flush(bs)              /* (B) I/O on bs */
>                       continues... */
>                                             aio_context_release(bs->ctx)

As I discussed with Fam on IRC this morning, what this patch tries to
fix (acquiring the wrong context) is not the real problem, but just a
symptom.

If you look at the example above (the same is true for the other example
that Fam posted in a reply), you see that the monitor called
aio_context_acquire() specifically in order to avoid that some other
user interferes with its activities. The real bug is that the iothread
even had a chance to run. One part of the reason is that
BDRV_POLL_WHILE() drops the lock since commit c9d1a561, so just calling
aio_context_acquire() doesn't protect you any more if there is any
chance that a nested function calls BDRV_POLL_WHILE().

I haven't checked what was done when merging said commit, but I kind of
expect that we didn't carefully audit all callers of
aio_context_acquire() whether they can cope with this. Specifically,
monitor commands tend to rely on the fact that they keep the lock and
therefore nobody else can interfere. When I scrolled through blockdev.c
this morning, I saw a few suspicious ones that could be broken now.

Now, of course, some callers additionally call bdrv_drained_begin(), and
the snapshot code is one of them. This should in theory be safe, but in
practice even both aio_context_acquire() and bdrv_drained_begin()
together don't give us the exclusive access that these callers want.
This is the real bug to address.

I don't know enough about this code to say whether aio_context_acquire()
alone should give the same guarantees again (I suspect it became
impractical?) or whether we need to fix only bdrv_drained_begin() and
add it to more places. So I'll join the others in this email thread:

Paolo, do you have an opinion on this?

Kevin

> Both (A) and (B) can access resources protected by bs->ctx, but (A) is
> not thread-safe.
> 
> Make the block layer explicitly specify a desired context for the
> entered coroutine. For the rest callers, stick to the old behavior,
> qemu_get_aio_context() or qemu_get_current_aio_context().
> 
> Signed-off-by: Fam Zheng <famz@redhat.com>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH v2 6/6] tests/block-job-txn: Don't start block job before adding to txn
  2017-04-07 13:28   ` Stefan Hajnoczi
@ 2017-04-07 18:05     ` John Snow
  2017-04-08  3:39       ` Fam Zheng
  0 siblings, 1 reply; 26+ messages in thread
From: John Snow @ 2017-04-07 18:05 UTC (permalink / raw)
  To: Stefan Hajnoczi, Fam Zheng
  Cc: qemu-devel, Paolo Bonzini, qemu-block, Ed Swierk, Kevin Wolf,
	Max Reitz, Eric Blake



On 04/07/2017 09:28 AM, Stefan Hajnoczi wrote:
> On Fri, Apr 07, 2017 at 02:54:14PM +0800, Fam Zheng wrote:
>> Previously, before test_block_job_start returns, the job can already
>> complete, as a result, the transactional state of other jobs added to
>> the same txn later cannot be handled correctly.
>>
>> Move the block_job_start() calls to callers after
>> block_job_txn_add_job() calls.
>>
>> Signed-off-by: Fam Zheng <famz@redhat.com>
>> ---
>>  tests/test-blockjob-txn.c | 6 +++++-
>>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> CCing John Snow because he looked at block jobs completing during txn
> setup recently.
> 
> Stefan
> 

This matches the changes we made to qmp_transaction, but I forgot to (or
didn't take care to)  change the qtest as it didn't cause a regression
at the time.

I wonder if I should make it a runtime error to add a job to a
transaction which has already "started" to make sure that this interface
is not misused, as this test highlights that there were still some
remaining "bad" uses of the interface.

Regardless...

Thanks for the CC. ACK

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH v2 5/6] coroutine: Explicitly specify AioContext when entering coroutine
  2017-04-07 13:26   ` Stefan Hajnoczi
@ 2017-04-08  3:27     ` Fam Zheng
  0 siblings, 0 replies; 26+ messages in thread
From: Fam Zheng @ 2017-04-08  3:27 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, Paolo Bonzini, qemu-block, Ed Swierk, Kevin Wolf,
	Max Reitz, Eric Blake

On Fri, 04/07 14:26, Stefan Hajnoczi wrote:
> I wasn't expecting the patch to touch so many places.  If a coroutine
> can be permanently associated with an AioContext then there's no need to
> keep assigning co->ctx on each qemu_coroutine_enter().

Maybe, but bs->job->co->ctx changes when bdrv_set_aio_context(bs->ctx) is
called. In v1 the co->ctx is associated at qemu_coroutine_create() time but
failed to handle bdrv_set_aio_context.

Fam

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH v2 6/6] tests/block-job-txn: Don't start block job before adding to txn
  2017-04-07 18:05     ` John Snow
@ 2017-04-08  3:39       ` Fam Zheng
  0 siblings, 0 replies; 26+ messages in thread
From: Fam Zheng @ 2017-04-08  3:39 UTC (permalink / raw)
  To: John Snow
  Cc: Stefan Hajnoczi, qemu-devel, Paolo Bonzini, qemu-block,
	Ed Swierk, Kevin Wolf, Max Reitz, Eric Blake

On Fri, 04/07 14:05, John Snow wrote:
> 
> 
> On 04/07/2017 09:28 AM, Stefan Hajnoczi wrote:
> > On Fri, Apr 07, 2017 at 02:54:14PM +0800, Fam Zheng wrote:
> >> Previously, before test_block_job_start returns, the job can already
> >> complete, as a result, the transactional state of other jobs added to
> >> the same txn later cannot be handled correctly.
> >>
> >> Move the block_job_start() calls to callers after
> >> block_job_txn_add_job() calls.
> >>
> >> Signed-off-by: Fam Zheng <famz@redhat.com>
> >> ---
> >>  tests/test-blockjob-txn.c | 6 +++++-
> >>  1 file changed, 5 insertions(+), 1 deletion(-)
> > 
> > CCing John Snow because he looked at block jobs completing during txn
> > setup recently.
> > 
> > Stefan
> > 
> 
> This matches the changes we made to qmp_transaction, but I forgot to (or
> didn't take care to)  change the qtest as it didn't cause a regression
> at the time.
> 
> I wonder if I should make it a runtime error to add a job to a
> transaction which has already "started" to make sure that this interface
> is not misused, as this test highlights that there were still some
> remaining "bad" uses of the interface.
> 
> Regardless...
> 
> Thanks for the CC. ACK

So, shall we merge this for 2.9?

Fam

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH v2 4/6] block: Quiesce old aio context during bdrv_set_aio_context
  2017-04-07 12:50   ` Stefan Hajnoczi
@ 2017-04-08  3:43     ` Fam Zheng
  2017-04-10  8:06       ` Kevin Wolf
  0 siblings, 1 reply; 26+ messages in thread
From: Fam Zheng @ 2017-04-08  3:43 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, Paolo Bonzini, qemu-block, Ed Swierk, Kevin Wolf,
	Max Reitz, Eric Blake

On Fri, 04/07 13:50, Stefan Hajnoczi wrote:
> On Fri, Apr 07, 2017 at 02:54:12PM +0800, Fam Zheng wrote:
> > @@ -4413,6 +4416,10 @@ void bdrv_set_aio_context(BlockDriverState *bs, AioContext *new_context)
> >      aio_context_acquire(new_context);
> >      bdrv_attach_aio_context(bs, new_context);
> >      aio_context_release(new_context);
> > +    if (bs->job) {
> > +        block_job_resume(bs->job);
> > +    }
> 
> Should this be called before aio_context_release(new_context)?

Yes, and I'm going to replace it with bdrv_parent_drained_begin() as Kevin
suggested.

Fam

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH v2 5/6] coroutine: Explicitly specify AioContext when entering coroutine
  2017-04-07 15:14   ` Kevin Wolf
@ 2017-04-10  0:56     ` Paolo Bonzini
  2017-04-10  1:43       ` Fam Zheng
  0 siblings, 1 reply; 26+ messages in thread
From: Paolo Bonzini @ 2017-04-10  0:56 UTC (permalink / raw)
  To: Kevin Wolf, Fam Zheng; +Cc: qemu-block, qemu-devel, Max Reitz, Stefan Hajnoczi



On 07/04/2017 23:14, Kevin Wolf wrote:
> One part of the reason is that
> BDRV_POLL_WHILE() drops the lock since commit c9d1a561, so just calling
> aio_context_acquire() doesn't protect you any more if there is any
> chance that a nested function calls BDRV_POLL_WHILE().

Hi guys, sorry for the late reply.

There wasn't actually much that changed in commit c9d1a561.  Everything
that could break was also broken before.

With commit c9d1a561 the logic is

	aio_context_acquire
	prepare I/O
	aio_context_release
	poll main AioContext
				aio_context_acquire
				aio_poll secondary AioContext
				complete I/O
			    <-- bdrv_wakeup
	aio_context_acquire
				aio_context_release
	Sync I/O is complete

while before it was

	aio_context_acquire
	prepare I/O
	aio_poll secondary AioContext
	complete I/O
	Sync I/O is complete

The code that can run in "aio_poll secondary AioContext" is the same in
both cases.  The difference is that, after c9d1a561, it always runs in
one thread which eliminates the need for RFifoLock's contention
callbacks (and a bunch of bdrv_drain_all deadlocks that arose from the
combination of contention callbacks and recursive locking).

The patch that introduced the bug is the one that introduced the "home
AioContext" co->ctx for coroutines, because now you have

	aio_context_acquire
	prepare I/O
	aio_context_release
	poll main AioContext
				aio_context_acquire
				aio_poll secondary AioContext
				aio_co_wake
	complete I/O
	bdrv_wakeup
	aio_context_acquire
				aio_context_release
	Sync I/O is complete

and if "complete I/O" causes anything to happen in the iothread, bad
things happen.

I think Fam's analysis is right.  This patch will hopefully be reverted
in 2.10, but right now it's the right thing to do.  However, I don't
like very much adding an argument to qemu_coroutine_enter because:

1) coroutines can be used without AioContexts (CoMutex has a dependency
on aio_co_wake, but it is hidden behind the API and if you have a single
AioContext you could just define aio_co_wake to be qemu_coroutine_enter).

2) the value that is assigned to co->ctx is not the AioContext in which
the coroutine is running.

It would be better to split aio_co_wake so that aio_co_wake is

    /* Read coroutine before co->ctx.  Matches smp_wmb in
     * qemu_coroutine_enter.
     */
    smp_read_barrier_depends();
    ctx = atomic_read(&co->ctx);
    aio_co_enter(ctx, co);

and aio_co_enter is the rest of the logic.

Then the coroutine code always runs in the right thread, which obviously
is thread-safe.

Paolo

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH v2 5/6] coroutine: Explicitly specify AioContext when entering coroutine
  2017-04-10  0:56     ` Paolo Bonzini
@ 2017-04-10  1:43       ` Fam Zheng
  0 siblings, 0 replies; 26+ messages in thread
From: Fam Zheng @ 2017-04-10  1:43 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel, qemu-block, Max Reitz

On Mon, 04/10 08:56, Paolo Bonzini wrote:
> 
> 
> On 07/04/2017 23:14, Kevin Wolf wrote:
> > One part of the reason is that
> > BDRV_POLL_WHILE() drops the lock since commit c9d1a561, so just calling
> > aio_context_acquire() doesn't protect you any more if there is any
> > chance that a nested function calls BDRV_POLL_WHILE().
> 
> Hi guys, sorry for the late reply.
> 
> There wasn't actually much that changed in commit c9d1a561.  Everything
> that could break was also broken before.
> 
> With commit c9d1a561 the logic is
> 
> 	aio_context_acquire
> 	prepare I/O
> 	aio_context_release
> 	poll main AioContext
> 				aio_context_acquire
> 				aio_poll secondary AioContext
> 				complete I/O
> 			    <-- bdrv_wakeup
> 	aio_context_acquire
> 				aio_context_release
> 	Sync I/O is complete
> 
> while before it was
> 
> 	aio_context_acquire
> 	prepare I/O
> 	aio_poll secondary AioContext
> 	complete I/O
> 	Sync I/O is complete
> 
> The code that can run in "aio_poll secondary AioContext" is the same in
> both cases.

This is not completely true. Before this fact, a blocked aio_context_acquire()
in virtio_scsi_data_plane_handle_cmd() wouldn't be woken up during the sync I/O.
Of course the aio context pushdown (9d4566544) happened after c9d1a561, so this
is a compound consequence of the two.

Also, not polling for at least once in bdrv_drain_poll if
!bdrv_requests_pending(bs), since 997235485, made the situation a bit worse -
virtio_scsi_data_plane_handle_cmd could have returned before
bdrv_drained_begin() returns if we do the BDRV_POLL_WHILE body at least once
even if cond is false.

> The difference is that, after c9d1a561, it always runs in
> one thread which eliminates the need for RFifoLock's contention
> callbacks (and a bunch of bdrv_drain_all deadlocks that arose from the
> combination of contention callbacks and recursive locking).
> 
> The patch that introduced the bug is the one that introduced the "home
> AioContext" co->ctx for coroutines, because now you have
> 
> 	aio_context_acquire
> 	prepare I/O
> 	aio_context_release
> 	poll main AioContext
> 				aio_context_acquire
> 				aio_poll secondary AioContext
> 				aio_co_wake
> 	complete I/O
> 	bdrv_wakeup
> 	aio_context_acquire
> 				aio_context_release
> 	Sync I/O is complete
> 
> and if "complete I/O" causes anything to happen in the iothread, bad
> things happen.
> 
> I think Fam's analysis is right.  This patch will hopefully be reverted
> in 2.10, but right now it's the right thing to do.  However, I don't
> like very much adding an argument to qemu_coroutine_enter because:
> 
> 1) coroutines can be used without AioContexts (CoMutex has a dependency
> on aio_co_wake, but it is hidden behind the API and if you have a single
> AioContext you could just define aio_co_wake to be qemu_coroutine_enter).
> 
> 2) the value that is assigned to co->ctx is not the AioContext in which
> the coroutine is running.
> 
> It would be better to split aio_co_wake so that aio_co_wake is
> 
>     /* Read coroutine before co->ctx.  Matches smp_wmb in
>      * qemu_coroutine_enter.
>      */
>     smp_read_barrier_depends();
>     ctx = atomic_read(&co->ctx);
>     aio_co_enter(ctx, co);
> 
> and aio_co_enter is the rest of the logic.

Is aio_co_enter what solves 1) for you too? If so I can add it in v3.

Fam

> 
> Then the coroutine code always runs in the right thread, which obviously
> is thread-safe.
> 
> Paolo
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH v2 4/6] block: Quiesce old aio context during bdrv_set_aio_context
  2017-04-08  3:43     ` Fam Zheng
@ 2017-04-10  8:06       ` Kevin Wolf
  2017-04-10  8:45         ` Fam Zheng
  0 siblings, 1 reply; 26+ messages in thread
From: Kevin Wolf @ 2017-04-10  8:06 UTC (permalink / raw)
  To: Fam Zheng
  Cc: Stefan Hajnoczi, qemu-devel, Paolo Bonzini, qemu-block,
	Ed Swierk, Max Reitz, Eric Blake

Am 08.04.2017 um 05:43 hat Fam Zheng geschrieben:
> On Fri, 04/07 13:50, Stefan Hajnoczi wrote:
> > On Fri, Apr 07, 2017 at 02:54:12PM +0800, Fam Zheng wrote:
> > > @@ -4413,6 +4416,10 @@ void bdrv_set_aio_context(BlockDriverState *bs, AioContext *new_context)
> > >      aio_context_acquire(new_context);
> > >      bdrv_attach_aio_context(bs, new_context);
> > >      aio_context_release(new_context);
> > > +    if (bs->job) {
> > > +        block_job_resume(bs->job);
> > > +    }
> > 
> > Should this be called before aio_context_release(new_context)?
> 
> Yes, and I'm going to replace it with bdrv_parent_drained_begin() as Kevin
> suggested.

I think at the moment bdrv_parent_drained_begin() can't replace it yet,
but you need both.

Kevin

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH v2 4/6] block: Quiesce old aio context during bdrv_set_aio_context
  2017-04-10  8:06       ` Kevin Wolf
@ 2017-04-10  8:45         ` Fam Zheng
  2017-04-10  9:11           ` Kevin Wolf
  0 siblings, 1 reply; 26+ messages in thread
From: Fam Zheng @ 2017-04-10  8:45 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Stefan Hajnoczi, qemu-devel, Paolo Bonzini, qemu-block,
	Ed Swierk, Max Reitz, Eric Blake

On Mon, 04/10 10:06, Kevin Wolf wrote:
> Am 08.04.2017 um 05:43 hat Fam Zheng geschrieben:
> > On Fri, 04/07 13:50, Stefan Hajnoczi wrote:
> > > On Fri, Apr 07, 2017 at 02:54:12PM +0800, Fam Zheng wrote:
> > > > @@ -4413,6 +4416,10 @@ void bdrv_set_aio_context(BlockDriverState *bs, AioContext *new_context)
> > > >      aio_context_acquire(new_context);
> > > >      bdrv_attach_aio_context(bs, new_context);
> > > >      aio_context_release(new_context);
> > > > +    if (bs->job) {
> > > > +        block_job_resume(bs->job);
> > > > +    }
> > > 
> > > Should this be called before aio_context_release(new_context)?
> > 
> > Yes, and I'm going to replace it with bdrv_parent_drained_begin() as Kevin
> > suggested.
> 
> I think at the moment bdrv_parent_drained_begin() can't replace it yet,
> but you need both.

I think we have it already, see 600ac6a0e (blockjob: add devops to blockjob
backends):

    bdrv_parent_drained_begin
     -> blk_root_drained_begin
      -> block_job_drained_begin
       -> block_job_pause

Fam

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH v2 4/6] block: Quiesce old aio context during bdrv_set_aio_context
  2017-04-10  8:45         ` Fam Zheng
@ 2017-04-10  9:11           ` Kevin Wolf
  0 siblings, 0 replies; 26+ messages in thread
From: Kevin Wolf @ 2017-04-10  9:11 UTC (permalink / raw)
  To: Fam Zheng
  Cc: Stefan Hajnoczi, qemu-devel, Paolo Bonzini, qemu-block,
	Ed Swierk, Max Reitz, Eric Blake

Am 10.04.2017 um 10:45 hat Fam Zheng geschrieben:
> On Mon, 04/10 10:06, Kevin Wolf wrote:
> > Am 08.04.2017 um 05:43 hat Fam Zheng geschrieben:
> > > On Fri, 04/07 13:50, Stefan Hajnoczi wrote:
> > > > On Fri, Apr 07, 2017 at 02:54:12PM +0800, Fam Zheng wrote:
> > > > > @@ -4413,6 +4416,10 @@ void bdrv_set_aio_context(BlockDriverState *bs, AioContext *new_context)
> > > > >      aio_context_acquire(new_context);
> > > > >      bdrv_attach_aio_context(bs, new_context);
> > > > >      aio_context_release(new_context);
> > > > > +    if (bs->job) {
> > > > > +        block_job_resume(bs->job);
> > > > > +    }
> > > > 
> > > > Should this be called before aio_context_release(new_context)?
> > > 
> > > Yes, and I'm going to replace it with bdrv_parent_drained_begin() as Kevin
> > > suggested.
> > 
> > I think at the moment bdrv_parent_drained_begin() can't replace it yet,
> > but you need both.
> 
> I think we have it already, see 600ac6a0e (blockjob: add devops to blockjob
> backends):
> 
>     bdrv_parent_drained_begin
>      -> blk_root_drained_begin
>       -> block_job_drained_begin
>        -> block_job_pause

Ah, yes, you're right. Somehow I thought this was only for 2.10.

Kevin

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2017-04-10  9:11 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-07  6:54 [Qemu-devel] [PATCH v2 0/6] block: Fixes regarding dataplane and management operations Fam Zheng
2017-04-07  6:54 ` [Qemu-devel] [PATCH v2 1/6] block: Fix unpaired aio_disable_external in external snapshot Fam Zheng
2017-04-07 13:23   ` Stefan Hajnoczi
2017-04-07  6:54 ` [Qemu-devel] [PATCH v2 2/6] block: Assert attached child node has right aio context Fam Zheng
2017-04-07 13:24   ` Stefan Hajnoczi
2017-04-07  6:54 ` [Qemu-devel] [PATCH v2 3/6] mirror: Fix aio context of mirror_top_bs Fam Zheng
2017-04-07 13:24   ` Stefan Hajnoczi
2017-04-07  6:54 ` [Qemu-devel] [PATCH v2 4/6] block: Quiesce old aio context during bdrv_set_aio_context Fam Zheng
2017-04-07 12:50   ` Stefan Hajnoczi
2017-04-08  3:43     ` Fam Zheng
2017-04-10  8:06       ` Kevin Wolf
2017-04-10  8:45         ` Fam Zheng
2017-04-10  9:11           ` Kevin Wolf
2017-04-07  6:54 ` [Qemu-devel] [PATCH v2 5/6] coroutine: Explicitly specify AioContext when entering coroutine Fam Zheng
2017-04-07  9:57   ` Fam Zheng
2017-04-07 13:26   ` Stefan Hajnoczi
2017-04-08  3:27     ` Fam Zheng
2017-04-07 14:07   ` Eric Blake
2017-04-07 15:14   ` Kevin Wolf
2017-04-10  0:56     ` Paolo Bonzini
2017-04-10  1:43       ` Fam Zheng
2017-04-07  6:54 ` [Qemu-devel] [PATCH v2 6/6] tests/block-job-txn: Don't start block job before adding to txn Fam Zheng
2017-04-07 13:28   ` Stefan Hajnoczi
2017-04-07 18:05     ` John Snow
2017-04-08  3:39       ` Fam Zheng
2017-04-07 12:45 ` [Qemu-devel] [PATCH v2 0/6] block: Fixes regarding dataplane and management operations Kevin Wolf

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.