[Qemu-devel] [PATCH 0/4] block-backend: Queue requests while drained

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH 0/4] block-backend: Queue requests while drained
@ 2019-07-25 16:27 Kevin Wolf
  2019-07-25 16:27 ` [Qemu-devel] [PATCH 1/4] block: Remove blk_pread_unthrottled() Kevin Wolf
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Kevin Wolf @ 2019-07-25 16:27 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, vsementsov, den, qemu-devel, mreitz, dplotnikov

This series fixes the problem that devices like IDE, which submit
requests as a direct result of I/O from the CPU thread, can continue to
submit new requests even in a drained section.

In order to avoid a dependency for this series, I borrowed a patch from
Max.

Kevin Wolf (3):
  block: Remove blk_pread_unthrottled()
  mirror: Keep target drained until graph changes are done
  block-backend: Queue requests while drained

Max Reitz (1):
  block: Reduce (un)drains when replacing a child

 include/sysemu/block-backend.h | 13 +++---
 block.c                        | 49 +++++++++++++-------
 block/backup.c                 |  1 +
 block/block-backend.c          | 85 +++++++++++++++++++++++-----------
 block/commit.c                 |  2 +
 block/mirror.c                 | 20 ++++----
 blockjob.c                     |  3 ++
 hw/block/hd-geometry.c         |  7 +--
 tests/test-bdrv-drain.c        |  1 +
 9 files changed, 118 insertions(+), 63 deletions(-)

-- 
2.20.1



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Qemu-devel] [PATCH 1/4] block: Remove blk_pread_unthrottled()
  2019-07-25 16:27 [Qemu-devel] [PATCH 0/4] block-backend: Queue requests while drained Kevin Wolf
@ 2019-07-25 16:27 ` Kevin Wolf
  2019-07-26  9:18   ` Max Reitz
  2019-07-25 16:27 ` [Qemu-devel] [PATCH 2/4] block: Reduce (un)drains when replacing a child Kevin Wolf
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 14+ messages in thread
From: Kevin Wolf @ 2019-07-25 16:27 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, vsementsov, den, qemu-devel, mreitz, dplotnikov

The functionality offered by blk_pread_unthrottled() goes back to commit
498e386c584. Then, we couldn't perform I/O throttling with synchronous
requests because timers wouldn't be executed in polling loops. So the
commit automatically disabled I/O throttling as soon as a synchronous
request was issued.

However, for geometry detection during disk initialisation, we always
used (and still use) synchronous requests even if guest requests use AIO
later. Geometry detection was not wanted to disable I/O throttling, so
bdrv_pread_unthrottled() was introduced which disabled throttling only
temporarily.

All of this isn't necessary any more because we do run timers in polling
loop and even synchronous requests are now using coroutine
infrastructure internally. For this reason, commit 90c78624f already
removed the automatic disabling of I/O throttling.

It's time to get rid of the workaround for the removed code, and its
abuse of blk_root_drained_begin()/end(), as well.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 include/sysemu/block-backend.h |  2 --
 block/block-backend.c          | 16 ----------------
 hw/block/hd-geometry.c         |  7 +------
 3 files changed, 1 insertion(+), 24 deletions(-)

diff --git a/include/sysemu/block-backend.h b/include/sysemu/block-backend.h
index 733c4957eb..7320b58467 100644
--- a/include/sysemu/block-backend.h
+++ b/include/sysemu/block-backend.h
@@ -117,8 +117,6 @@ char *blk_get_attached_dev_id(BlockBackend *blk);
 BlockBackend *blk_by_dev(void *dev);
 BlockBackend *blk_by_qdev_id(const char *id, Error **errp);
 void blk_set_dev_ops(BlockBackend *blk, const BlockDevOps *ops, void *opaque);
-int blk_pread_unthrottled(BlockBackend *blk, int64_t offset, uint8_t *buf,
-                          int bytes);
 int coroutine_fn blk_co_preadv(BlockBackend *blk, int64_t offset,
                                unsigned int bytes, QEMUIOVector *qiov,
                                BdrvRequestFlags flags);
diff --git a/block/block-backend.c b/block/block-backend.c
index 0056b526b8..fdd6b01ecf 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1237,22 +1237,6 @@ static int blk_prw(BlockBackend *blk, int64_t offset, uint8_t *buf,
     return rwco.ret;
 }
 
-int blk_pread_unthrottled(BlockBackend *blk, int64_t offset, uint8_t *buf,
-                          int count)
-{
-    int ret;
-
-    ret = blk_check_byte_request(blk, offset, count);
-    if (ret < 0) {
-        return ret;
-    }
-
-    blk_root_drained_begin(blk->root);
-    ret = blk_pread(blk, offset, buf, count);
-    blk_root_drained_end(blk->root, NULL);
-    return ret;
-}
-
 int blk_pwrite_zeroes(BlockBackend *blk, int64_t offset,
                       int bytes, BdrvRequestFlags flags)
 {
diff --git a/hw/block/hd-geometry.c b/hw/block/hd-geometry.c
index 79384a2b0a..dcbccee294 100644
--- a/hw/block/hd-geometry.c
+++ b/hw/block/hd-geometry.c
@@ -63,12 +63,7 @@ static int guess_disk_lchs(BlockBackend *blk,
 
     blk_get_geometry(blk, &nb_sectors);
 
-    /**
-     * The function will be invoked during startup not only in sync I/O mode,
-     * but also in async I/O mode. So the I/O throttling function has to
-     * be disabled temporarily here, not permanently.
-     */
-    if (blk_pread_unthrottled(blk, 0, buf, BDRV_SECTOR_SIZE) < 0) {
+    if (blk_pread(blk, 0, buf, BDRV_SECTOR_SIZE) < 0) {
         return -1;
     }
     /* test msdos magic */
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [Qemu-devel] [PATCH 2/4] block: Reduce (un)drains when replacing a child
  2019-07-25 16:27 [Qemu-devel] [PATCH 0/4] block-backend: Queue requests while drained Kevin Wolf
  2019-07-25 16:27 ` [Qemu-devel] [PATCH 1/4] block: Remove blk_pread_unthrottled() Kevin Wolf
@ 2019-07-25 16:27 ` Kevin Wolf
  2019-07-25 16:27 ` [Qemu-devel] [PATCH 3/4] mirror: Keep target drained until graph changes are done Kevin Wolf
  2019-07-25 16:27 ` [Qemu-devel] [PATCH 4/4] block-backend: Queue requests while drained Kevin Wolf
  3 siblings, 0 replies; 14+ messages in thread
From: Kevin Wolf @ 2019-07-25 16:27 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, vsementsov, den, qemu-devel, mreitz, dplotnikov

From: Max Reitz <mreitz@redhat.com>

Currently, bdrv_replace_child_noperm() undrains the parent until it is
completely undrained, then re-drains it after attaching the new child
node.

This is a problem with bdrv_drop_intermediate(): We want to keep the
whole subtree drained, including parents, while the operation is
under way.  bdrv_replace_child_noperm() breaks this by allowing every
parent to become unquiesced briefly, and then redraining it.

In fact, there is no reason why the parent should become unquiesced and
be allowed to submit requests to the new child node if that new node is
supposed to be kept drained.  So if anything, we have to drain the
parent before detaching the old child node.  Conversely, we have to
undrain it only after attaching the new child node.

Thus, change the whole drain algorithm here: Calculate the number of
times we have to drain/undrain the parent before replacing the child
node then drain it (if necessary), replace the child node, and then
undrain it.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block.c | 49 +++++++++++++++++++++++++++++++++----------------
 1 file changed, 33 insertions(+), 16 deletions(-)

diff --git a/block.c b/block.c
index cbd8da5f3b..e1595bd058 100644
--- a/block.c
+++ b/block.c
@@ -2238,13 +2238,27 @@ static void bdrv_replace_child_noperm(BdrvChild *child,
                                       BlockDriverState *new_bs)
 {
     BlockDriverState *old_bs = child->bs;
-    int i;
+    int new_bs_quiesce_counter;
+    int drain_saldo;
 
     assert(!child->frozen);
 
     if (old_bs && new_bs) {
         assert(bdrv_get_aio_context(old_bs) == bdrv_get_aio_context(new_bs));
     }
+
+    new_bs_quiesce_counter = (new_bs ? new_bs->quiesce_counter : 0);
+    drain_saldo = new_bs_quiesce_counter - child->parent_quiesce_counter;
+
+    /*
+     * If the new child node is drained but the old one was not, flush
+     * all outstanding requests to the old child node.
+     */
+    while (drain_saldo > 0 && child->role->drained_begin) {
+        bdrv_parent_drained_begin_single(child, true);
+        drain_saldo--;
+    }
+
     if (old_bs) {
         /* Detach first so that the recursive drain sections coming from @child
          * are already gone and we only end the drain sections that came from
@@ -2252,28 +2266,22 @@ static void bdrv_replace_child_noperm(BdrvChild *child,
         if (child->role->detach) {
             child->role->detach(child);
         }
-        while (child->parent_quiesce_counter) {
-            bdrv_parent_drained_end_single(child);
-        }
         QLIST_REMOVE(child, next_parent);
-    } else {
-        assert(child->parent_quiesce_counter == 0);
     }
 
     child->bs = new_bs;
 
     if (new_bs) {
         QLIST_INSERT_HEAD(&new_bs->parents, child, next_parent);
-        if (new_bs->quiesce_counter) {
-            int num = new_bs->quiesce_counter;
-            if (child->role->parent_is_bds) {
-                num -= bdrv_drain_all_count;
-            }
-            assert(num >= 0);
-            for (i = 0; i < num; i++) {
-                bdrv_parent_drained_begin_single(child, true);
-            }
-        }
+
+        /*
+         * Detaching the old node may have led to the new node's
+         * quiesce_counter having been decreased.  Not a problem, we
+         * just need to recognize this here and then invoke
+         * drained_end appropriately more often.
+         */
+        assert(new_bs->quiesce_counter <= new_bs_quiesce_counter);
+        drain_saldo += new_bs->quiesce_counter - new_bs_quiesce_counter;
 
         /* Attach only after starting new drained sections, so that recursive
          * drain sections coming from @child don't get an extra .drained_begin
@@ -2282,6 +2290,15 @@ static void bdrv_replace_child_noperm(BdrvChild *child,
             child->role->attach(child);
         }
     }
+
+    /*
+     * If the old child node was drained but the new one is not, allow
+     * requests to come in only after the new node has been attached.
+     */
+    while (drain_saldo < 0 && child->role->drained_end) {
+        bdrv_parent_drained_end_single(child);
+        drain_saldo++;
+    }
 }
 
 /*
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [Qemu-devel] [PATCH 3/4] mirror: Keep target drained until graph changes are done
  2019-07-25 16:27 [Qemu-devel] [PATCH 0/4] block-backend: Queue requests while drained Kevin Wolf
  2019-07-25 16:27 ` [Qemu-devel] [PATCH 1/4] block: Remove blk_pread_unthrottled() Kevin Wolf
  2019-07-25 16:27 ` [Qemu-devel] [PATCH 2/4] block: Reduce (un)drains when replacing a child Kevin Wolf
@ 2019-07-25 16:27 ` Kevin Wolf
  2019-07-25 17:03   ` Eric Blake
  2019-07-26  9:52   ` Max Reitz
  2019-07-25 16:27 ` [Qemu-devel] [PATCH 4/4] block-backend: Queue requests while drained Kevin Wolf
  3 siblings, 2 replies; 14+ messages in thread
From: Kevin Wolf @ 2019-07-25 16:27 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, vsementsov, den, qemu-devel, mreitz, dplotnikov

Calling bdrv_drained_end() for target_bs can restarts requests too
early, so that they would execute on mirror_top_bs, which however has
already dropped all permissions.

Keep the target node drained until all graph changes have completed.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/mirror.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 8cb75fb409..7483051f8d 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -644,6 +644,11 @@ static int mirror_exit_common(Job *job)
     bdrv_ref(mirror_top_bs);
     bdrv_ref(target_bs);
 
+    /* The mirror job has no requests in flight any more, but we need to
+     * drain potential other users of the BDS before changing the graph. */
+    assert(s->in_drain);
+    bdrv_drained_begin(target_bs);
+
     /* Remove target parent that still uses BLK_PERM_WRITE/RESIZE before
      * inserting target_bs at s->to_replace, where we might not be able to get
      * these permissions.
@@ -684,12 +689,7 @@ static int mirror_exit_common(Job *job)
             bdrv_reopen_set_read_only(target_bs, ro, NULL);
         }
 
-        /* The mirror job has no requests in flight any more, but we need to
-         * drain potential other users of the BDS before changing the graph. */
-        assert(s->in_drain);
-        bdrv_drained_begin(target_bs);
         bdrv_replace_node(to_replace, target_bs, &local_err);
-        bdrv_drained_end(target_bs);
         if (local_err) {
             error_report_err(local_err);
             ret = -EPERM;
@@ -704,7 +704,6 @@ static int mirror_exit_common(Job *job)
         aio_context_release(replace_aio_context);
     }
     g_free(s->replaces);
-    bdrv_unref(target_bs);
 
     /*
      * Remove the mirror filter driver from the graph. Before this, get rid of
@@ -724,9 +723,12 @@ static int mirror_exit_common(Job *job)
     bs_opaque->job = NULL;
 
     bdrv_drained_end(src);
+    bdrv_drained_end(target_bs);
+
     s->in_drain = false;
     bdrv_unref(mirror_top_bs);
     bdrv_unref(src);
+    bdrv_unref(target_bs);
 
     return ret;
 }
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [Qemu-devel] [PATCH 4/4] block-backend: Queue requests while drained
  2019-07-25 16:27 [Qemu-devel] [PATCH 0/4] block-backend: Queue requests while drained Kevin Wolf
                   ` (2 preceding siblings ...)
  2019-07-25 16:27 ` [Qemu-devel] [PATCH 3/4] mirror: Keep target drained until graph changes are done Kevin Wolf
@ 2019-07-25 16:27 ` Kevin Wolf
  2019-07-25 17:06   ` Eric Blake
  2019-07-26 10:50   ` Max Reitz
  3 siblings, 2 replies; 14+ messages in thread
From: Kevin Wolf @ 2019-07-25 16:27 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, vsementsov, den, qemu-devel, mreitz, dplotnikov

This fixes device like IDE that can still start new requests from I/O
handlers in the CPU thread while the block backend is drained.

The basic assumption is that in a drain section, no new requests should
be allowed through a BlockBackend (blk_drained_begin/end don't exist,
we get drain sections only on the node level). However, there are two
special cases where requests should not be queued:

1. Block jobs: We already make sure that block jobs are paused in a
   drain section, so they won't start new requests. However, if the
   drain_begin is called on the job's BlockBackend first, it can happen
   that we deadlock because the job stays busy until it reaches a pause
   point - which it can't if it's requests aren't processed any more.

   The proper solution here would be to make all requests through the
   job's filter node instead of using a BlockBackend. For now, just
   disabling request queuin on the job BlockBackend is simpler.

2. In test cases where making requests through bdrv_* would be
   cumbersome because we'd need a BdrvChild. As we already got the
   functionality to disable request queuing from 1., use it in tests,
   too, for convenience.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 include/sysemu/block-backend.h | 11 +++---
 block/backup.c                 |  1 +
 block/block-backend.c          | 69 +++++++++++++++++++++++++++++-----
 block/commit.c                 |  2 +
 block/mirror.c                 |  6 ++-
 blockjob.c                     |  3 ++
 tests/test-bdrv-drain.c        |  1 +
 7 files changed, 76 insertions(+), 17 deletions(-)

diff --git a/include/sysemu/block-backend.h b/include/sysemu/block-backend.h
index 7320b58467..d453a4e9a1 100644
--- a/include/sysemu/block-backend.h
+++ b/include/sysemu/block-backend.h
@@ -104,6 +104,7 @@ void blk_get_perm(BlockBackend *blk, uint64_t *perm, uint64_t *shared_perm);
 
 void blk_set_allow_write_beyond_eof(BlockBackend *blk, bool allow);
 void blk_set_allow_aio_context_change(BlockBackend *blk, bool allow);
+void blk_set_disable_request_queuing(BlockBackend *blk, bool disable);
 void blk_iostatus_enable(BlockBackend *blk);
 bool blk_iostatus_is_enabled(const BlockBackend *blk);
 BlockDeviceIoStatus blk_iostatus(const BlockBackend *blk);
@@ -119,10 +120,10 @@ BlockBackend *blk_by_qdev_id(const char *id, Error **errp);
 void blk_set_dev_ops(BlockBackend *blk, const BlockDevOps *ops, void *opaque);
 int coroutine_fn blk_co_preadv(BlockBackend *blk, int64_t offset,
                                unsigned int bytes, QEMUIOVector *qiov,
-                               BdrvRequestFlags flags);
+                               BdrvRequestFlags flags, bool wait_while_drained);
 int coroutine_fn blk_co_pwritev(BlockBackend *blk, int64_t offset,
-                               unsigned int bytes, QEMUIOVector *qiov,
-                               BdrvRequestFlags flags);
+                                unsigned int bytes, QEMUIOVector *qiov,
+                                BdrvRequestFlags flags, bool wait_while_drained);
 
 static inline int coroutine_fn blk_co_pread(BlockBackend *blk, int64_t offset,
                                             unsigned int bytes, void *buf,
@@ -130,7 +131,7 @@ static inline int coroutine_fn blk_co_pread(BlockBackend *blk, int64_t offset,
 {
     QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, buf, bytes);
 
-    return blk_co_preadv(blk, offset, bytes, &qiov, flags);
+    return blk_co_preadv(blk, offset, bytes, &qiov, flags, true);
 }
 
 static inline int coroutine_fn blk_co_pwrite(BlockBackend *blk, int64_t offset,
@@ -139,7 +140,7 @@ static inline int coroutine_fn blk_co_pwrite(BlockBackend *blk, int64_t offset,
 {
     QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, buf, bytes);
 
-    return blk_co_pwritev(blk, offset, bytes, &qiov, flags);
+    return blk_co_pwritev(blk, offset, bytes, &qiov, flags, true);
 }
 
 int blk_pwrite_zeroes(BlockBackend *blk, int64_t offset,
diff --git a/block/backup.c b/block/backup.c
index 715e1d3be8..f66b2f4ee7 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -635,6 +635,7 @@ BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs,
     if (ret < 0) {
         goto error;
     }
+    blk_set_disable_request_queuing(job->target, true);
 
     job->on_source_error = on_source_error;
     job->on_target_error = on_target_error;
diff --git a/block/block-backend.c b/block/block-backend.c
index fdd6b01ecf..603b281743 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -79,6 +79,9 @@ struct BlockBackend {
     QLIST_HEAD(, BlockBackendAioNotifier) aio_notifiers;
 
     int quiesce_counter;
+    CoQueue queued_requests;
+    bool disable_request_queuing;
+
     VMChangeStateEntry *vmsh;
     bool force_allow_inactivate;
 
@@ -339,6 +342,7 @@ BlockBackend *blk_new(AioContext *ctx, uint64_t perm, uint64_t shared_perm)
 
     block_acct_init(&blk->stats);
 
+    qemu_co_queue_init(&blk->queued_requests);
     notifier_list_init(&blk->remove_bs_notifiers);
     notifier_list_init(&blk->insert_bs_notifiers);
     QLIST_INIT(&blk->aio_notifiers);
@@ -1096,6 +1100,11 @@ void blk_set_allow_aio_context_change(BlockBackend *blk, bool allow)
     blk->allow_aio_context_change = allow;
 }
 
+void blk_set_disable_request_queuing(BlockBackend *blk, bool disable)
+{
+    blk->disable_request_queuing = disable;
+}
+
 static int blk_check_byte_request(BlockBackend *blk, int64_t offset,
                                   size_t size)
 {
@@ -1127,13 +1136,26 @@ static int blk_check_byte_request(BlockBackend *blk, int64_t offset,
     return 0;
 }
 
+static void blk_wait_while_drained(BlockBackend *blk)
+{
+    if (blk->quiesce_counter && !blk->disable_request_queuing) {
+        qemu_co_queue_wait(&blk->queued_requests, NULL);
+    }
+}
+
 int coroutine_fn blk_co_preadv(BlockBackend *blk, int64_t offset,
                                unsigned int bytes, QEMUIOVector *qiov,
-                               BdrvRequestFlags flags)
+                               BdrvRequestFlags flags, bool wait_while_drained)
 {
     int ret;
-    BlockDriverState *bs = blk_bs(blk);
+    BlockDriverState *bs;
 
+    if (wait_while_drained) {
+        blk_wait_while_drained(blk);
+    }
+
+    /* Call blk_bs() only after waiting, the graph may have changed */
+    bs = blk_bs(blk);
     trace_blk_co_preadv(blk, bs, offset, bytes, flags);
 
     ret = blk_check_byte_request(blk, offset, bytes);
@@ -1156,11 +1178,17 @@ int coroutine_fn blk_co_preadv(BlockBackend *blk, int64_t offset,
 
 int coroutine_fn blk_co_pwritev(BlockBackend *blk, int64_t offset,
                                 unsigned int bytes, QEMUIOVector *qiov,
-                                BdrvRequestFlags flags)
+                                BdrvRequestFlags flags, bool wait_while_drained)
 {
     int ret;
-    BlockDriverState *bs = blk_bs(blk);
+    BlockDriverState *bs;
+
+    if (wait_while_drained) {
+        blk_wait_while_drained(blk);
+    }
 
+    /* Call blk_bs() only after waiting, the graph may have changed */
+    bs = blk_bs(blk);
     trace_blk_co_pwritev(blk, bs, offset, bytes, flags);
 
     ret = blk_check_byte_request(blk, offset, bytes);
@@ -1198,7 +1226,7 @@ static void blk_read_entry(void *opaque)
     QEMUIOVector *qiov = rwco->iobuf;
 
     rwco->ret = blk_co_preadv(rwco->blk, rwco->offset, qiov->size,
-                              qiov, rwco->flags);
+                              qiov, rwco->flags, true);
     aio_wait_kick();
 }
 
@@ -1208,7 +1236,7 @@ static void blk_write_entry(void *opaque)
     QEMUIOVector *qiov = rwco->iobuf;
 
     rwco->ret = blk_co_pwritev(rwco->blk, rwco->offset, qiov->size,
-                               qiov, rwco->flags);
+                               qiov, rwco->flags, true);
     aio_wait_kick();
 }
 
@@ -1349,9 +1377,15 @@ static void blk_aio_read_entry(void *opaque)
     BlkRwCo *rwco = &acb->rwco;
     QEMUIOVector *qiov = rwco->iobuf;
 
+    if (rwco->blk->quiesce_counter) {
+        blk_dec_in_flight(rwco->blk);
+        blk_wait_while_drained(rwco->blk);
+        blk_inc_in_flight(rwco->blk);
+    }
+
     assert(qiov->size == acb->bytes);
     rwco->ret = blk_co_preadv(rwco->blk, rwco->offset, acb->bytes,
-                              qiov, rwco->flags);
+                              qiov, rwco->flags, false);
     blk_aio_complete(acb);
 }
 
@@ -1361,9 +1395,15 @@ static void blk_aio_write_entry(void *opaque)
     BlkRwCo *rwco = &acb->rwco;
     QEMUIOVector *qiov = rwco->iobuf;
 
+    if (rwco->blk->quiesce_counter) {
+        blk_dec_in_flight(rwco->blk);
+        blk_wait_while_drained(rwco->blk);
+        blk_inc_in_flight(rwco->blk);
+    }
+
     assert(!qiov || qiov->size == acb->bytes);
     rwco->ret = blk_co_pwritev(rwco->blk, rwco->offset, acb->bytes,
-                               qiov, rwco->flags);
+                               qiov, rwco->flags, false);
     blk_aio_complete(acb);
 }
 
@@ -1482,6 +1522,8 @@ void blk_aio_cancel_async(BlockAIOCB *acb)
 
 int blk_co_ioctl(BlockBackend *blk, unsigned long int req, void *buf)
 {
+    blk_wait_while_drained(blk);
+
     if (!blk_is_available(blk)) {
         return -ENOMEDIUM;
     }
@@ -1522,7 +1564,11 @@ BlockAIOCB *blk_aio_ioctl(BlockBackend *blk, unsigned long int req, void *buf,
 
 int blk_co_pdiscard(BlockBackend *blk, int64_t offset, int bytes)
 {
-    int ret = blk_check_byte_request(blk, offset, bytes);
+    int ret;
+
+    blk_wait_while_drained(blk);
+
+    ret = blk_check_byte_request(blk, offset, bytes);
     if (ret < 0) {
         return ret;
     }
@@ -2004,7 +2050,7 @@ int coroutine_fn blk_co_pwrite_zeroes(BlockBackend *blk, int64_t offset,
                                       int bytes, BdrvRequestFlags flags)
 {
     return blk_co_pwritev(blk, offset, bytes, NULL,
-                          flags | BDRV_REQ_ZERO_WRITE);
+                          flags | BDRV_REQ_ZERO_WRITE, true);
 }
 
 int blk_pwrite_compressed(BlockBackend *blk, int64_t offset, const void *buf,
@@ -2232,6 +2278,9 @@ static void blk_root_drained_end(BdrvChild *child, int *drained_end_counter)
         if (blk->dev_ops && blk->dev_ops->drained_end) {
             blk->dev_ops->drained_end(blk->dev_opaque);
         }
+        while (qemu_co_enter_next(&blk->queued_requests, NULL)) {
+            /* Resume all queued requests */
+        }
     }
 }
 
diff --git a/block/commit.c b/block/commit.c
index 2c5a6d4ebc..408ae15389 100644
--- a/block/commit.c
+++ b/block/commit.c
@@ -350,6 +350,7 @@ void commit_start(const char *job_id, BlockDriverState *bs,
     if (ret < 0) {
         goto fail;
     }
+    blk_set_disable_request_queuing(s->base, true);
     s->base_bs = base;
 
     /* Required permissions are already taken with block_job_add_bdrv() */
@@ -358,6 +359,7 @@ void commit_start(const char *job_id, BlockDriverState *bs,
     if (ret < 0) {
         goto fail;
     }
+    blk_set_disable_request_queuing(s->top, true);
 
     s->backing_file_str = g_strdup(backing_file_str);
     s->on_error = on_error;
diff --git a/block/mirror.c b/block/mirror.c
index 7483051f8d..8d0a3a987d 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -231,7 +231,8 @@ static void coroutine_fn mirror_read_complete(MirrorOp *op, int ret)
         return;
     }
 
-    ret = blk_co_pwritev(s->target, op->offset, op->qiov.size, &op->qiov, 0);
+    ret = blk_co_pwritev(s->target, op->offset, op->qiov.size, &op->qiov, 0,
+                         false);
     mirror_write_complete(op, ret);
 }
 
@@ -1237,7 +1238,7 @@ do_sync_target_write(MirrorBlockJob *job, MirrorMethod method,
         switch (method) {
         case MIRROR_METHOD_COPY:
             ret = blk_co_pwritev(job->target, dirty_offset, dirty_bytes,
-                                 qiov ? &target_qiov : NULL, flags);
+                                 qiov ? &target_qiov : NULL, flags, false);
             break;
 
         case MIRROR_METHOD_ZERO:
@@ -1624,6 +1625,7 @@ static BlockJob *mirror_start_job(
         blk_set_force_allow_inactivate(s->target);
     }
     blk_set_allow_aio_context_change(s->target, true);
+    blk_set_disable_request_queuing(s->target, true);
 
     s->replaces = g_strdup(replaces);
     s->on_source_error = on_source_error;
diff --git a/blockjob.c b/blockjob.c
index 20b7f557da..73d9f1ba2b 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -445,6 +445,9 @@ void *block_job_create(const char *job_id, const BlockJobDriver *driver,
 
     bdrv_op_unblock(bs, BLOCK_OP_TYPE_DATAPLANE, job->blocker);
 
+    /* Disable request queuing in the BlockBackend to avoid deadlocks on drain:
+     * The job reports that it's busy until it reaches a pause point. */
+    blk_set_disable_request_queuing(blk, true);
     blk_set_allow_aio_context_change(blk, true);
 
     /* Only set speed when necessary to avoid NotSupported error */
diff --git a/tests/test-bdrv-drain.c b/tests/test-bdrv-drain.c
index 03fa1142a1..3fcf7c1c95 100644
--- a/tests/test-bdrv-drain.c
+++ b/tests/test-bdrv-drain.c
@@ -677,6 +677,7 @@ static void test_iothread_common(enum drain_type drain_type, int drain_thread)
                               &error_abort);
     s = bs->opaque;
     blk_insert_bs(blk, bs, &error_abort);
+    blk_set_disable_request_queuing(blk, true);
 
     blk_set_aio_context(blk, ctx_a, &error_abort);
     aio_context_acquire(ctx_a);
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH 3/4] mirror: Keep target drained until graph changes are done
  2019-07-25 16:27 ` [Qemu-devel] [PATCH 3/4] mirror: Keep target drained until graph changes are done Kevin Wolf
@ 2019-07-25 17:03   ` Eric Blake
  2019-07-26  9:52   ` Max Reitz
  1 sibling, 0 replies; 14+ messages in thread
From: Eric Blake @ 2019-07-25 17:03 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: dplotnikov, vsementsov, den, mreitz, qemu-devel


[-- Attachment #1.1: Type: text/plain, Size: 1691 bytes --]

On 7/25/19 11:27 AM, Kevin Wolf wrote:
> Calling bdrv_drained_end() for target_bs can restarts requests too

restart

> early, so that they would execute on mirror_top_bs, which however has
> already dropped all permissions.
> 
> Keep the target node drained until all graph changes have completed.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>  block/mirror.c | 14 ++++++++------
>  1 file changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/block/mirror.c b/block/mirror.c
> index 8cb75fb409..7483051f8d 100644
> --- a/block/mirror.c
> +++ b/block/mirror.c
> @@ -644,6 +644,11 @@ static int mirror_exit_common(Job *job)
>      bdrv_ref(mirror_top_bs);
>      bdrv_ref(target_bs);
>  
> +    /* The mirror job has no requests in flight any more, but we need to
> +     * drain potential other users of the BDS before changing the graph. */

Is checkpatch going to gripe about your comment style,

> +    assert(s->in_drain);
> +    bdrv_drained_begin(target_bs);
> +
>      /* Remove target parent that still uses BLK_PERM_WRITE/RESIZE before
>       * inserting target_bs at s->to_replace, where we might not be able to get
>       * these permissions.
> @@ -684,12 +689,7 @@ static int mirror_exit_common(Job *job)
>              bdrv_reopen_set_read_only(target_bs, ro, NULL);
>          }
>  
> -        /* The mirror job has no requests in flight any more, but we need to
> -         * drain potential other users of the BDS before changing the graph. */

even though it is just code motion?

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH 4/4] block-backend: Queue requests while drained
  2019-07-25 16:27 ` [Qemu-devel] [PATCH 4/4] block-backend: Queue requests while drained Kevin Wolf
@ 2019-07-25 17:06   ` Eric Blake
  2019-07-26 10:50   ` Max Reitz
  1 sibling, 0 replies; 14+ messages in thread
From: Eric Blake @ 2019-07-25 17:06 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: dplotnikov, vsementsov, den, mreitz, qemu-devel


[-- Attachment #1.1: Type: text/plain, Size: 1496 bytes --]

On 7/25/19 11:27 AM, Kevin Wolf wrote:
> This fixes device like IDE that can still start new requests from I/O
> handlers in the CPU thread while the block backend is drained.
> 
> The basic assumption is that in a drain section, no new requests should
> be allowed through a BlockBackend (blk_drained_begin/end don't exist,
> we get drain sections only on the node level). However, there are two
> special cases where requests should not be queued:
> 
> 1. Block jobs: We already make sure that block jobs are paused in a
>    drain section, so they won't start new requests. However, if the
>    drain_begin is called on the job's BlockBackend first, it can happen
>    that we deadlock because the job stays busy until it reaches a pause
>    point - which it can't if it's requests aren't processed any more.

its (remember, "it's" is only okay if "it is" works as well)

> 
>    The proper solution here would be to make all requests through the
>    job's filter node instead of using a BlockBackend. For now, just
>    disabling request queuin on the job BlockBackend is simpler.

queuing

> 
> 2. In test cases where making requests through bdrv_* would be
>    cumbersome because we'd need a BdrvChild. As we already got the
>    functionality to disable request queuing from 1., use it in tests,
>    too, for convenience.
> 
-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH 1/4] block: Remove blk_pread_unthrottled()
  2019-07-25 16:27 ` [Qemu-devel] [PATCH 1/4] block: Remove blk_pread_unthrottled() Kevin Wolf
@ 2019-07-26  9:18   ` Max Reitz
  0 siblings, 0 replies; 14+ messages in thread
From: Max Reitz @ 2019-07-26  9:18 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: dplotnikov, vsementsov, den, qemu-devel


[-- Attachment #1.1: Type: text/plain, Size: 1604 bytes --]

On 25.07.19 18:27, Kevin Wolf wrote:
> The functionality offered by blk_pread_unthrottled() goes back to commit
> 498e386c584. Then, we couldn't perform I/O throttling with synchronous
> requests because timers wouldn't be executed in polling loops. So the
> commit automatically disabled I/O throttling as soon as a synchronous
> request was issued.
> 
> However, for geometry detection during disk initialisation, we always
> used (and still use) synchronous requests even if guest requests use AIO
> later. Geometry detection was not wanted to disable I/O throttling, so
> bdrv_pread_unthrottled() was introduced which disabled throttling only
> temporarily.
> 
> All of this isn't necessary any more because we do run timers in polling
> loop and even synchronous requests are now using coroutine
> infrastructure internally. For this reason, commit 90c78624f already
> removed the automatic disabling of I/O throttling.
> 
> It's time to get rid of the workaround for the removed code, and its
> abuse of blk_root_drained_begin()/end(), as well.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>  include/sysemu/block-backend.h |  2 --
>  block/block-backend.c          | 16 ----------------
>  hw/block/hd-geometry.c         |  7 +------
>  3 files changed, 1 insertion(+), 24 deletions(-)

It took me a bit of git blaming to find out more about the history of
timer execution (and finally arrived at
https://lists.nongnu.org/archive/html/qemu-devel/2013-08/msg03060.html),
but now I’m reasonably confident.

Reviewed-by: Max Reitz <mreitz@redhat.com>


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH 3/4] mirror: Keep target drained until graph changes are done
  2019-07-25 16:27 ` [Qemu-devel] [PATCH 3/4] mirror: Keep target drained until graph changes are done Kevin Wolf
  2019-07-25 17:03   ` Eric Blake
@ 2019-07-26  9:52   ` Max Reitz
  2019-07-26 11:36     ` Kevin Wolf
  1 sibling, 1 reply; 14+ messages in thread
From: Max Reitz @ 2019-07-26  9:52 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: dplotnikov, vsementsov, den, qemu-devel


[-- Attachment #1.1: Type: text/plain, Size: 3966 bytes --]

On 25.07.19 18:27, Kevin Wolf wrote:
> Calling bdrv_drained_end() for target_bs can restarts requests too
> early, so that they would execute on mirror_top_bs, which however has
> already dropped all permissions.
> 
> Keep the target node drained until all graph changes have completed.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>  block/mirror.c | 14 ++++++++------
>  1 file changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/block/mirror.c b/block/mirror.c
> index 8cb75fb409..7483051f8d 100644
> --- a/block/mirror.c
> +++ b/block/mirror.c
> @@ -644,6 +644,11 @@ static int mirror_exit_common(Job *job)
>      bdrv_ref(mirror_top_bs);
>      bdrv_ref(target_bs);
>  
> +    /* The mirror job has no requests in flight any more, but we need to
> +     * drain potential other users of the BDS before changing the graph. */
> +    assert(s->in_drain);
> +    bdrv_drained_begin(target_bs);
> +

In contrast to what Eric said, I think it is a problem that this is just
code motion.

The comment doesn’t tell the reason why the target needs to be drained
here.  Other users of the BDS have their own BdrvChild and thus their
own permissions, their requests do not go through mirror.

So in addition to why the target needs to be drained around
bdrv_replace_node(), the comment should tell why we need to drain it
here, like the commit message does.

Now, the thing is, I don’t quite understand the connection between the
target and mirror_top_bs that the commit message wants to establish.

I see the following problem:
(1) We drain src (at the end of mirror_run()).
(2) This implicitly drains mirror_top_bs.
(3) We drain target.
(4) bdrv_replace_node() replaces src by target, thus replacing the drain
    on mirror_top_bs from src by the one from target.
(5) We undrain target, thus also undraining mirror_top_bs.
(6) After all is done, we undrain src, which has no effect on
    mirror_top_bs, because they haven’t been connected since (4).

I suppose (5) is the problem.  This patch moves it down to (6), so
mirror_top_bs is drained as long as src is drained.

(If to_replace is not src, then src will stay attached, which keeps
mirror_top_bs drained, too.)

This makes it seem to me like the actually important thing is to drain
mirror_top_bs, not target.  If so, it would seem more obvious to me to
just add a drain on mirror_top_bs than to move the existing target drain.

>      /* Remove target parent that still uses BLK_PERM_WRITE/RESIZE before
>       * inserting target_bs at s->to_replace, where we might not be able to get
>       * these permissions.
> @@ -684,12 +689,7 @@ static int mirror_exit_common(Job *job)
>              bdrv_reopen_set_read_only(target_bs, ro, NULL);
>          }
>  
> -        /* The mirror job has no requests in flight any more, but we need to
> -         * drain potential other users of the BDS before changing the graph. */
> -        assert(s->in_drain);
> -        bdrv_drained_begin(target_bs);

By the way, don’t we need to drain to_replace also?  In case it isn’t src?

Max

>          bdrv_replace_node(to_replace, target_bs, &local_err);
> -        bdrv_drained_end(target_bs);
>          if (local_err) {
>              error_report_err(local_err);
>              ret = -EPERM;
> @@ -704,7 +704,6 @@ static int mirror_exit_common(Job *job)
>          aio_context_release(replace_aio_context);
>      }
>      g_free(s->replaces);
> -    bdrv_unref(target_bs);
>  
>      /*
>       * Remove the mirror filter driver from the graph. Before this, get rid of
> @@ -724,9 +723,12 @@ static int mirror_exit_common(Job *job)
>      bs_opaque->job = NULL;
>  
>      bdrv_drained_end(src);
> +    bdrv_drained_end(target_bs);
> +
>      s->in_drain = false;
>      bdrv_unref(mirror_top_bs);
>      bdrv_unref(src);
> +    bdrv_unref(target_bs);
>  
>      return ret;
>  }
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH 4/4] block-backend: Queue requests while drained
  2019-07-25 16:27 ` [Qemu-devel] [PATCH 4/4] block-backend: Queue requests while drained Kevin Wolf
  2019-07-25 17:06   ` Eric Blake
@ 2019-07-26 10:50   ` Max Reitz
  2019-07-26 11:49     ` Kevin Wolf
  1 sibling, 1 reply; 14+ messages in thread
From: Max Reitz @ 2019-07-26 10:50 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: dplotnikov, vsementsov, den, qemu-devel


[-- Attachment #1.1: Type: text/plain, Size: 4158 bytes --]

On 25.07.19 18:27, Kevin Wolf wrote:
> This fixes device like IDE that can still start new requests from I/O

*devices

> handlers in the CPU thread while the block backend is drained.
> 
> The basic assumption is that in a drain section, no new requests should
> be allowed through a BlockBackend (blk_drained_begin/end don't exist,
> we get drain sections only on the node level). However, there are two
> special cases where requests should not be queued:
> 
> 1. Block jobs: We already make sure that block jobs are paused in a
>    drain section, so they won't start new requests. However, if the
>    drain_begin is called on the job's BlockBackend first, it can happen
>    that we deadlock because the job stays busy until it reaches a pause
>    point - which it can't if it's requests aren't processed any more.
> 
>    The proper solution here would be to make all requests through the
>    job's filter node instead of using a BlockBackend. For now, just
>    disabling request queuin on the job BlockBackend is simpler.

Yep, seems reasonable.

(We’d need a relationship that a BB is owned by some job, and then pause
the job when the BB is drained, I suppose.  But that’s exactly
accomplished by not making the job use a BB, but its BdrvChild
references instead.)

> 2. In test cases where making requests through bdrv_* would be
>    cumbersome because we'd need a BdrvChild. As we already got the
>    functionality to disable request queuing from 1., use it in tests,
>    too, for convenience.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>  include/sysemu/block-backend.h | 11 +++---
>  block/backup.c                 |  1 +
>  block/block-backend.c          | 69 +++++++++++++++++++++++++++++-----
>  block/commit.c                 |  2 +
>  block/mirror.c                 |  6 ++-
>  blockjob.c                     |  3 ++
>  tests/test-bdrv-drain.c        |  1 +
>  7 files changed, 76 insertions(+), 17 deletions(-)

[...]

> diff --git a/block/block-backend.c b/block/block-backend.c
> index fdd6b01ecf..603b281743 100644
> --- a/block/block-backend.c
> +++ b/block/block-backend.c

[...]

> @@ -1127,13 +1136,26 @@ static int blk_check_byte_request(BlockBackend *blk, int64_t offset,
>      return 0;
>  }
>  
> +static void blk_wait_while_drained(BlockBackend *blk)

+coroutine_fn?  (Maybe even blk_co_wait...)

> +{
> +    if (blk->quiesce_counter && !blk->disable_request_queuing) {
> +        qemu_co_queue_wait(&blk->queued_requests, NULL);
> +    }
> +}
> +
>  int coroutine_fn blk_co_preadv(BlockBackend *blk, int64_t offset,
>                                 unsigned int bytes, QEMUIOVector *qiov,
> -                               BdrvRequestFlags flags)
> +                               BdrvRequestFlags flags, bool wait_while_drained)

What’s the purpose of this parameter?  How would it hurt to always
wait_while_drained?

I see the following callers of blk_co_p{read,write}v() that call it with
wait_while_drained=false:

1. blk_aio_{read,write}_entry(): They wait themselves, so they don’t
   need these functions to wait.  But OTOH, because they have waited, we
   know that the BB is not quiesced here, so we won’t wait here anyway.
   (These functions should be coroutine_fn, too, by the way)

2. mirror: It disables request queuing anyway, so wait_while_drained
   doesn’t have any effect.

>  {
>      int ret;
> -    BlockDriverState *bs = blk_bs(blk);
> +    BlockDriverState *bs;
>  
> +    if (wait_while_drained) {
> +        blk_wait_while_drained(blk);
> +    }

[...]

What about blk_co_flush()?  Should that wait, too?

> @@ -2232,6 +2278,9 @@ static void blk_root_drained_end(BdrvChild *child, int *drained_end_counter)
>          if (blk->dev_ops && blk->dev_ops->drained_end) {
>              blk->dev_ops->drained_end(blk->dev_opaque);
>          }
> +        while (qemu_co_enter_next(&blk->queued_requests, NULL)) {
> +            /* Resume all queued requests */
> +        }

Wouldn’t qemu_co_queue_restart_all(&blk->queued_requests) achieve the same?

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH 3/4] mirror: Keep target drained until graph changes are done
  2019-07-26  9:52   ` Max Reitz
@ 2019-07-26 11:36     ` Kevin Wolf
  2019-07-26 12:30       ` Max Reitz
  0 siblings, 1 reply; 14+ messages in thread
From: Kevin Wolf @ 2019-07-26 11:36 UTC (permalink / raw)
  To: Max Reitz; +Cc: dplotnikov, vsementsov, den, qemu-block, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 4001 bytes --]

Am 26.07.2019 um 11:52 hat Max Reitz geschrieben:
> On 25.07.19 18:27, Kevin Wolf wrote:
> > Calling bdrv_drained_end() for target_bs can restarts requests too
> > early, so that they would execute on mirror_top_bs, which however has
> > already dropped all permissions.
> > 
> > Keep the target node drained until all graph changes have completed.
> > 
> > Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> > ---
> >  block/mirror.c | 14 ++++++++------
> >  1 file changed, 8 insertions(+), 6 deletions(-)
> > 
> > diff --git a/block/mirror.c b/block/mirror.c
> > index 8cb75fb409..7483051f8d 100644
> > --- a/block/mirror.c
> > +++ b/block/mirror.c
> > @@ -644,6 +644,11 @@ static int mirror_exit_common(Job *job)
> >      bdrv_ref(mirror_top_bs);
> >      bdrv_ref(target_bs);
> >  
> > +    /* The mirror job has no requests in flight any more, but we need to
> > +     * drain potential other users of the BDS before changing the graph. */
> > +    assert(s->in_drain);
> > +    bdrv_drained_begin(target_bs);
> > +
> 
> In contrast to what Eric said, I think it is a problem that this is just
> code motion.
> 
> The comment doesn’t tell the reason why the target needs to be drained
> here.  Other users of the BDS have their own BdrvChild and thus their
> own permissions, their requests do not go through mirror.
> 
> So in addition to why the target needs to be drained around
> bdrv_replace_node(), the comment should tell why we need to drain it
> here, like the commit message does.
> 
> Now, the thing is, I don’t quite understand the connection between the
> target and mirror_top_bs that the commit message wants to establish.
> 
> I see the following problem:
> (1) We drain src (at the end of mirror_run()).
> (2) This implicitly drains mirror_top_bs.
> (3) We drain target.
> (4) bdrv_replace_node() replaces src by target, thus replacing the drain
>     on mirror_top_bs from src by the one from target.
> (5) We undrain target, thus also undraining mirror_top_bs.

(5.5) Remove mirror_top_bs from the target chain

> (6) After all is done, we undrain src, which has no effect on
>     mirror_top_bs, because they haven’t been connected since (4).
> 
> I suppose (5) is the problem.  This patch moves it down to (6), so
> mirror_top_bs is drained as long as src is drained.

The problem is that (5) happens before (5.5), so we can start requests
on a node that we're about to remove (without draining it again before).

> (If to_replace is not src, then src will stay attached, which keeps
> mirror_top_bs drained, too.)
> 
> This makes it seem to me like the actually important thing is to drain
> mirror_top_bs, not target.  If so, it would seem more obvious to me to
> just add a drain on mirror_top_bs than to move the existing target drain.

Do you really think having a third drained section makes things easier
to understand? Draining both source and target while we're modifying the
graph seems pretty intuitive to me - which is also why I moved the
bdrv_drained_begin() to the very start instead of looking for the first
operation that actually strictly needs it.

> >      /* Remove target parent that still uses BLK_PERM_WRITE/RESIZE before
> >       * inserting target_bs at s->to_replace, where we might not be able to get
> >       * these permissions.
> > @@ -684,12 +689,7 @@ static int mirror_exit_common(Job *job)
> >              bdrv_reopen_set_read_only(target_bs, ro, NULL);
> >          }
> >  
> > -        /* The mirror job has no requests in flight any more, but we need to
> > -         * drain potential other users of the BDS before changing the graph. */
> > -        assert(s->in_drain);
> > -        bdrv_drained_begin(target_bs);
> 
> By the way, don’t we need to drain to_replace also?  In case it isn’t src?

I think to_replace is required to be in the subtree of src, no?

Though maybe it could have another parent, so you might be right.

Kevin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH 4/4] block-backend: Queue requests while drained
  2019-07-26 10:50   ` Max Reitz
@ 2019-07-26 11:49     ` Kevin Wolf
  2019-07-26 12:34       ` Max Reitz
  0 siblings, 1 reply; 14+ messages in thread
From: Kevin Wolf @ 2019-07-26 11:49 UTC (permalink / raw)
  To: Max Reitz; +Cc: dplotnikov, vsementsov, den, qemu-block, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 5304 bytes --]

Am 26.07.2019 um 12:50 hat Max Reitz geschrieben:
> On 25.07.19 18:27, Kevin Wolf wrote:
> > This fixes device like IDE that can still start new requests from I/O
> 
> *devices
> 
> > handlers in the CPU thread while the block backend is drained.
> > 
> > The basic assumption is that in a drain section, no new requests should
> > be allowed through a BlockBackend (blk_drained_begin/end don't exist,
> > we get drain sections only on the node level). However, there are two
> > special cases where requests should not be queued:
> > 
> > 1. Block jobs: We already make sure that block jobs are paused in a
> >    drain section, so they won't start new requests. However, if the
> >    drain_begin is called on the job's BlockBackend first, it can happen
> >    that we deadlock because the job stays busy until it reaches a pause
> >    point - which it can't if it's requests aren't processed any more.
> > 
> >    The proper solution here would be to make all requests through the
> >    job's filter node instead of using a BlockBackend. For now, just
> >    disabling request queuin on the job BlockBackend is simpler.
> 
> Yep, seems reasonable.
> 
> (We’d need a relationship that a BB is owned by some job, and then pause
> the job when the BB is drained, I suppose.  But that’s exactly
> accomplished by not making the job use a BB, but its BdrvChild
> references instead.)

We actually had this before commit ad90feba, when we changed it to use
the job's BdrvChild objects instead. All block jobs have both currently,
they just don't use their BdrvChild objects much.

> > 2. In test cases where making requests through bdrv_* would be
> >    cumbersome because we'd need a BdrvChild. As we already got the
> >    functionality to disable request queuing from 1., use it in tests,
> >    too, for convenience.
> > 
> > Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> > ---
> >  include/sysemu/block-backend.h | 11 +++---
> >  block/backup.c                 |  1 +
> >  block/block-backend.c          | 69 +++++++++++++++++++++++++++++-----
> >  block/commit.c                 |  2 +
> >  block/mirror.c                 |  6 ++-
> >  blockjob.c                     |  3 ++
> >  tests/test-bdrv-drain.c        |  1 +
> >  7 files changed, 76 insertions(+), 17 deletions(-)
> 
> [...]
> 
> > diff --git a/block/block-backend.c b/block/block-backend.c
> > index fdd6b01ecf..603b281743 100644
> > --- a/block/block-backend.c
> > +++ b/block/block-backend.c
> 
> [...]
> 
> > @@ -1127,13 +1136,26 @@ static int blk_check_byte_request(BlockBackend *blk, int64_t offset,
> >      return 0;
> >  }
> >  
> > +static void blk_wait_while_drained(BlockBackend *blk)
> 
> +coroutine_fn?  (Maybe even blk_co_wait...)
> 
> > +{
> > +    if (blk->quiesce_counter && !blk->disable_request_queuing) {
> > +        qemu_co_queue_wait(&blk->queued_requests, NULL);
> > +    }
> > +}
> > +
> >  int coroutine_fn blk_co_preadv(BlockBackend *blk, int64_t offset,
> >                                 unsigned int bytes, QEMUIOVector *qiov,
> > -                               BdrvRequestFlags flags)
> > +                               BdrvRequestFlags flags, bool wait_while_drained)
> 
> What’s the purpose of this parameter?  How would it hurt to always
> wait_while_drained?
> 
> I see the following callers of blk_co_p{read,write}v() that call it with
> wait_while_drained=false:
> 
> 1. blk_aio_{read,write}_entry(): They wait themselves, so they don’t
>    need these functions to wait.  But OTOH, because they have waited, we
>    know that the BB is not quiesced here, so we won’t wait here anyway.
>    (These functions should be coroutine_fn, too, by the way)

I think I was worried that the coroutine might yield between the two
places. Later I noticed that blk_wait_while_drained() must be the very
first thing anyway, so maybe it doesn't matter any more now.

If we did yield here for requests coming from blk_aio_prwv(), in_flight
would be increased and drain would deadlock.

Would you prefer if I just unconditionally wait if we're drained?

> 2. mirror: It disables request queuing anyway, so wait_while_drained
>    doesn’t have any effect.

Yes, I wasn't sure what to use there. false seemed like it would be
less likely to cause misunderstandings because it just repeats what
would happen anyway.

> >  {
> >      int ret;
> > -    BlockDriverState *bs = blk_bs(blk);
> > +    BlockDriverState *bs;
> >  
> > +    if (wait_while_drained) {
> > +        blk_wait_while_drained(blk);
> > +    }
> 
> [...]
> 
> What about blk_co_flush()?  Should that wait, too?

Hm, probably, yes.

> > @@ -2232,6 +2278,9 @@ static void blk_root_drained_end(BdrvChild *child, int *drained_end_counter)
> >          if (blk->dev_ops && blk->dev_ops->drained_end) {
> >              blk->dev_ops->drained_end(blk->dev_opaque);
> >          }
> > +        while (qemu_co_enter_next(&blk->queued_requests, NULL)) {
> > +            /* Resume all queued requests */
> > +        }
> 
> Wouldn’t qemu_co_queue_restart_all(&blk->queued_requests) achieve the same?

It would fail an assertion because we're not in coroutine context.
(Guess what my first attempt was!)

Kevin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH 3/4] mirror: Keep target drained until graph changes are done
  2019-07-26 11:36     ` Kevin Wolf
@ 2019-07-26 12:30       ` Max Reitz
  0 siblings, 0 replies; 14+ messages in thread
From: Max Reitz @ 2019-07-26 12:30 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: dplotnikov, vsementsov, den, qemu-block, qemu-devel


[-- Attachment #1.1: Type: text/plain, Size: 5280 bytes --]

On 26.07.19 13:36, Kevin Wolf wrote:
> Am 26.07.2019 um 11:52 hat Max Reitz geschrieben:
>> On 25.07.19 18:27, Kevin Wolf wrote:
>>> Calling bdrv_drained_end() for target_bs can restarts requests too
>>> early, so that they would execute on mirror_top_bs, which however has
>>> already dropped all permissions.
>>>
>>> Keep the target node drained until all graph changes have completed.
>>>
>>> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
>>> ---
>>>  block/mirror.c | 14 ++++++++------
>>>  1 file changed, 8 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/block/mirror.c b/block/mirror.c
>>> index 8cb75fb409..7483051f8d 100644
>>> --- a/block/mirror.c
>>> +++ b/block/mirror.c
>>> @@ -644,6 +644,11 @@ static int mirror_exit_common(Job *job)
>>>      bdrv_ref(mirror_top_bs);
>>>      bdrv_ref(target_bs);
>>>  
>>> +    /* The mirror job has no requests in flight any more, but we need to
>>> +     * drain potential other users of the BDS before changing the graph. */
>>> +    assert(s->in_drain);
>>> +    bdrv_drained_begin(target_bs);
>>> +
>>
>> In contrast to what Eric said, I think it is a problem that this is just
>> code motion.
>>
>> The comment doesn’t tell the reason why the target needs to be drained
>> here.  Other users of the BDS have their own BdrvChild and thus their
>> own permissions, their requests do not go through mirror.
>>
>> So in addition to why the target needs to be drained around
>> bdrv_replace_node(), the comment should tell why we need to drain it
>> here, like the commit message does.
>>
>> Now, the thing is, I don’t quite understand the connection between the
>> target and mirror_top_bs that the commit message wants to establish.
>>
>> I see the following problem:
>> (1) We drain src (at the end of mirror_run()).
>> (2) This implicitly drains mirror_top_bs.
>> (3) We drain target.
>> (4) bdrv_replace_node() replaces src by target, thus replacing the drain
>>     on mirror_top_bs from src by the one from target.
>> (5) We undrain target, thus also undraining mirror_top_bs.
> 
> (5.5) Remove mirror_top_bs from the target chain
> 
>> (6) After all is done, we undrain src, which has no effect on
>>     mirror_top_bs, because they haven’t been connected since (4).
>>
>> I suppose (5) is the problem.  This patch moves it down to (6), so
>> mirror_top_bs is drained as long as src is drained.
> 
> The problem is that (5) happens before (5.5), so we can start requests
> on a node that we're about to remove (without draining it again before).

Well, yes.  I generally put that under the idea of “We set
bs_opaque->stop, so we shouldn’t issue any further requests” (which I
find implied by “has already dropped all permissions” in your commit
message).

>> (If to_replace is not src, then src will stay attached, which keeps
>> mirror_top_bs drained, too.)
>>
>> This makes it seem to me like the actually important thing is to drain
>> mirror_top_bs, not target.  If so, it would seem more obvious to me to
>> just add a drain on mirror_top_bs than to move the existing target drain.
> 
> Do you really think having a third drained section makes things easier
> to understand?

Yes, I do.  It makes immediate sense because of the bs_opaque->stop
concept.  As you explain yourself, mirror_top_bs dropped all
permissions, it mustn’t perform any further requests.  As such, it must
be drained.

>                Draining both source and target while we're modifying the
> graph seems pretty intuitive to me - which is also why I moved the
> bdrv_drained_begin() to the very start instead of looking for the first
> operation that actually strictly needs it.

The problem for me is that we don’t actually care about whether the
target is drained or not, do we?  Anyone can access it at basically any
point[1], we don’t care.

The point is that mirror must not perform any further requests.  Thus it
should be mirror_top_bs that’s drained.

[1] Maybe not during bdrv_replace_node(), even though I don’t quite know
why.  Why do we care about other users of target accessing it while we
attach more parents to it?

>>>      /* Remove target parent that still uses BLK_PERM_WRITE/RESIZE before
>>>       * inserting target_bs at s->to_replace, where we might not be able to get
>>>       * these permissions.
>>> @@ -684,12 +689,7 @@ static int mirror_exit_common(Job *job)
>>>              bdrv_reopen_set_read_only(target_bs, ro, NULL);
>>>          }
>>>  
>>> -        /* The mirror job has no requests in flight any more, but we need to
>>> -         * drain potential other users of the BDS before changing the graph. */
>>> -        assert(s->in_drain);
>>> -        bdrv_drained_begin(target_bs);
>>
>> By the way, don’t we need to drain to_replace also?  In case it isn’t src?
> 
> I think to_replace is required to be in the subtree of src, no?
> 
> Though maybe it could have another parent, so you might be right.

That might be broken, but there could be a throttle node between src and
to_replace.  Not sure whether draining src would drain that, too.

But we don’t, actually, because bdrv_replace_node() already takes care
of keeping @from drained.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH 4/4] block-backend: Queue requests while drained
  2019-07-26 11:49     ` Kevin Wolf
@ 2019-07-26 12:34       ` Max Reitz
  0 siblings, 0 replies; 14+ messages in thread
From: Max Reitz @ 2019-07-26 12:34 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: dplotnikov, vsementsov, den, qemu-block, qemu-devel


[-- Attachment #1.1: Type: text/plain, Size: 5445 bytes --]

On 26.07.19 13:49, Kevin Wolf wrote:
> Am 26.07.2019 um 12:50 hat Max Reitz geschrieben:
>> On 25.07.19 18:27, Kevin Wolf wrote:
>>> This fixes device like IDE that can still start new requests from I/O
>>
>> *devices
>>
>>> handlers in the CPU thread while the block backend is drained.
>>>
>>> The basic assumption is that in a drain section, no new requests should
>>> be allowed through a BlockBackend (blk_drained_begin/end don't exist,
>>> we get drain sections only on the node level). However, there are two
>>> special cases where requests should not be queued:
>>>
>>> 1. Block jobs: We already make sure that block jobs are paused in a
>>>    drain section, so they won't start new requests. However, if the
>>>    drain_begin is called on the job's BlockBackend first, it can happen
>>>    that we deadlock because the job stays busy until it reaches a pause
>>>    point - which it can't if it's requests aren't processed any more.
>>>
>>>    The proper solution here would be to make all requests through the
>>>    job's filter node instead of using a BlockBackend. For now, just
>>>    disabling request queuin on the job BlockBackend is simpler.
>>
>> Yep, seems reasonable.
>>
>> (We’d need a relationship that a BB is owned by some job, and then pause
>> the job when the BB is drained, I suppose.  But that’s exactly
>> accomplished by not making the job use a BB, but its BdrvChild
>> references instead.)
> 
> We actually had this before commit ad90feba, when we changed it to use
> the job's BdrvChild objects instead. All block jobs have both currently,
> they just don't use their BdrvChild objects much.
> 
>>> 2. In test cases where making requests through bdrv_* would be
>>>    cumbersome because we'd need a BdrvChild. As we already got the
>>>    functionality to disable request queuing from 1., use it in tests,
>>>    too, for convenience.
>>>
>>> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
>>> ---
>>>  include/sysemu/block-backend.h | 11 +++---
>>>  block/backup.c                 |  1 +
>>>  block/block-backend.c          | 69 +++++++++++++++++++++++++++++-----
>>>  block/commit.c                 |  2 +
>>>  block/mirror.c                 |  6 ++-
>>>  blockjob.c                     |  3 ++
>>>  tests/test-bdrv-drain.c        |  1 +
>>>  7 files changed, 76 insertions(+), 17 deletions(-)
>>
>> [...]
>>
>>> diff --git a/block/block-backend.c b/block/block-backend.c
>>> index fdd6b01ecf..603b281743 100644
>>> --- a/block/block-backend.c
>>> +++ b/block/block-backend.c
>>
>> [...]
>>
>>> @@ -1127,13 +1136,26 @@ static int blk_check_byte_request(BlockBackend *blk, int64_t offset,
>>>      return 0;
>>>  }
>>>  
>>> +static void blk_wait_while_drained(BlockBackend *blk)
>>
>> +coroutine_fn?  (Maybe even blk_co_wait...)
>>
>>> +{
>>> +    if (blk->quiesce_counter && !blk->disable_request_queuing) {
>>> +        qemu_co_queue_wait(&blk->queued_requests, NULL);
>>> +    }
>>> +}
>>> +
>>>  int coroutine_fn blk_co_preadv(BlockBackend *blk, int64_t offset,
>>>                                 unsigned int bytes, QEMUIOVector *qiov,
>>> -                               BdrvRequestFlags flags)
>>> +                               BdrvRequestFlags flags, bool wait_while_drained)
>>
>> What’s the purpose of this parameter?  How would it hurt to always
>> wait_while_drained?
>>
>> I see the following callers of blk_co_p{read,write}v() that call it with
>> wait_while_drained=false:
>>
>> 1. blk_aio_{read,write}_entry(): They wait themselves, so they don’t
>>    need these functions to wait.  But OTOH, because they have waited, we
>>    know that the BB is not quiesced here, so we won’t wait here anyway.
>>    (These functions should be coroutine_fn, too, by the way)
> 
> I think I was worried that the coroutine might yield between the two
> places. Later I noticed that blk_wait_while_drained() must be the very
> first thing anyway, so maybe it doesn't matter any more now.
> 
> If we did yield here for requests coming from blk_aio_prwv(), in_flight
> would be increased and drain would deadlock.
> 
> Would you prefer if I just unconditionally wait if we're drained?

I think I would, yes.

>> 2. mirror: It disables request queuing anyway, so wait_while_drained
>>    doesn’t have any effect.
> 
> Yes, I wasn't sure what to use there. false seemed like it would be
> less likely to cause misunderstandings because it just repeats what
> would happen anyway.
> 
>>>  {
>>>      int ret;
>>> -    BlockDriverState *bs = blk_bs(blk);
>>> +    BlockDriverState *bs;
>>>  
>>> +    if (wait_while_drained) {
>>> +        blk_wait_while_drained(blk);
>>> +    }
>>
>> [...]
>>
>> What about blk_co_flush()?  Should that wait, too?
> 
> Hm, probably, yes.
> 
>>> @@ -2232,6 +2278,9 @@ static void blk_root_drained_end(BdrvChild *child, int *drained_end_counter)
>>>          if (blk->dev_ops && blk->dev_ops->drained_end) {
>>>              blk->dev_ops->drained_end(blk->dev_opaque);
>>>          }
>>> +        while (qemu_co_enter_next(&blk->queued_requests, NULL)) {
>>> +            /* Resume all queued requests */
>>> +        }
>>
>> Wouldn’t qemu_co_queue_restart_all(&blk->queued_requests) achieve the same?
> 
> It would fail an assertion because we're not in coroutine context.
> (Guess what my first attempt was!)

:-)

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2019-07-26 12:34 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-25 16:27 [Qemu-devel] [PATCH 0/4] block-backend: Queue requests while drained Kevin Wolf
2019-07-25 16:27 ` [Qemu-devel] [PATCH 1/4] block: Remove blk_pread_unthrottled() Kevin Wolf
2019-07-26  9:18   ` Max Reitz
2019-07-25 16:27 ` [Qemu-devel] [PATCH 2/4] block: Reduce (un)drains when replacing a child Kevin Wolf
2019-07-25 16:27 ` [Qemu-devel] [PATCH 3/4] mirror: Keep target drained until graph changes are done Kevin Wolf
2019-07-25 17:03   ` Eric Blake
2019-07-26  9:52   ` Max Reitz
2019-07-26 11:36     ` Kevin Wolf
2019-07-26 12:30       ` Max Reitz
2019-07-25 16:27 ` [Qemu-devel] [PATCH 4/4] block-backend: Queue requests while drained Kevin Wolf
2019-07-25 17:06   ` Eric Blake
2019-07-26 10:50   ` Max Reitz
2019-07-26 11:49     ` Kevin Wolf
2019-07-26 12:34       ` Max Reitz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.