All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/13] block: Simplify drain
@ 2022-11-08 12:37 Kevin Wolf
  2022-11-08 12:37 ` [PATCH 01/13] qed: Don't yield in bdrv_qed_co_drain_begin() Kevin Wolf
                   ` (14 more replies)
  0 siblings, 15 replies; 61+ messages in thread
From: Kevin Wolf @ 2022-11-08 12:37 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, eesposit, stefanha, hreitz, pbonzini, qemu-devel

I'm aware that exactly nobody has been looking forward to a series with
this title, but it has to be. The way drain works means that we need to
poll in bdrv_replace_child_noperm() and that makes things rather messy
with Emanuele's multiqueue work because you must not poll while you hold
the graph lock.

The other reason why it has to be is that drain is way too complex and
there are too many different cases. Some simplification like this will
hopefully make it considerably more maintainable. The diffstat probably
tells something, too.

There are roughly speaking three parts in this series:

1. Make BlockDriver.bdrv_drained_begin/end() non-coroutine_fn again,
   which allows us to not poll on bdrv_drained_end() any more.

2. Remove subtree drains. They are a considerable complication in the
   whole drain machinery (in particular, they require polling in the
   BdrvChildClass.attach/detach() callbacks that are called during
   bdrv_replace_child_noperm()) and none of their users actually has a
   good reason to use them.

3. Finally get rid of polling in bdrv_replace_child_noperm() by
   requiring that the child is already drained by the caller and calling
   callbacks only once and not again for every nested drain section.

If necessary, a prefix of this series can be merged that covers only the
first or the first two parts and it would still make sense.

Kevin Wolf (13):
  qed: Don't yield in bdrv_qed_co_drain_begin()
  test-bdrv-drain: Don't yield in .bdrv_co_drained_begin/end()
  block: Revert .bdrv_drained_begin/end to non-coroutine_fn
  block: Remove drained_end_counter
  block: Inline bdrv_drain_invoke()
  block: Drain invidual nodes during reopen
  block: Don't use subtree drains in bdrv_drop_intermediate()
  stream: Replace subtree drain with a single node drain
  block: Remove subtree drains
  block: Call drain callbacks only once
  block: Remove ignore_bds_parents parameter from drain functions
  block: Don't poll in bdrv_replace_child_noperm()
  block: Remove poll parameter from bdrv_parent_drained_begin_single()

 include/block/block-global-state.h |   3 +
 include/block/block-io.h           |  52 +---
 include/block/block_int-common.h   |  17 +-
 include/block/block_int-io.h       |  12 -
 block.c                            | 132 ++++++-----
 block/block-backend.c              |   4 +-
 block/io.c                         | 281 ++++------------------
 block/qed.c                        |  24 +-
 block/replication.c                |   6 -
 block/stream.c                     |  20 +-
 block/throttle.c                   |   6 +-
 blockdev.c                         |  13 -
 blockjob.c                         |   2 +-
 tests/unit/test-bdrv-drain.c       | 369 +++++++----------------------
 14 files changed, 270 insertions(+), 671 deletions(-)

-- 
2.38.1



^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH 01/13] qed: Don't yield in bdrv_qed_co_drain_begin()
  2022-11-08 12:37 [PATCH 00/13] block: Simplify drain Kevin Wolf
@ 2022-11-08 12:37 ` Kevin Wolf
  2022-11-09  9:21   ` Vladimir Sementsov-Ogievskiy
                     ` (4 more replies)
  2022-11-08 12:37 ` [PATCH 02/13] test-bdrv-drain: Don't yield in .bdrv_co_drained_begin/end() Kevin Wolf
                   ` (13 subsequent siblings)
  14 siblings, 5 replies; 61+ messages in thread
From: Kevin Wolf @ 2022-11-08 12:37 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, eesposit, stefanha, hreitz, pbonzini, qemu-devel

We want to change .bdrv_co_drained_begin() back to be a non-coroutine
callback, so in preparation, avoid yielding in its implementation.

Because we increase bs->in_flight and bdrv_drained_begin() polls, the
behaviour is unchanged.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/qed.c | 20 +++++++++++++++++---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/block/qed.c b/block/qed.c
index 2f36ad342c..013f826c44 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -282,9 +282,8 @@ static void coroutine_fn qed_unplug_allocating_write_reqs(BDRVQEDState *s)
     qemu_co_mutex_unlock(&s->table_lock);
 }
 
-static void coroutine_fn qed_need_check_timer_entry(void *opaque)
+static void coroutine_fn qed_need_check_timer(BDRVQEDState *s)
 {
-    BDRVQEDState *s = opaque;
     int ret;
 
     trace_qed_need_check_timer_cb(s);
@@ -310,9 +309,20 @@ static void coroutine_fn qed_need_check_timer_entry(void *opaque)
     (void) ret;
 }
 
+static void coroutine_fn qed_need_check_timer_entry(void *opaque)
+{
+    BDRVQEDState *s = opaque;
+
+    qed_need_check_timer(opaque);
+    bdrv_dec_in_flight(s->bs);
+}
+
 static void qed_need_check_timer_cb(void *opaque)
 {
+    BDRVQEDState *s = opaque;
     Coroutine *co = qemu_coroutine_create(qed_need_check_timer_entry, opaque);
+
+    bdrv_inc_in_flight(s->bs);
     qemu_coroutine_enter(co);
 }
 
@@ -363,8 +373,12 @@ static void coroutine_fn bdrv_qed_co_drain_begin(BlockDriverState *bs)
      * header is flushed.
      */
     if (s->need_check_timer && timer_pending(s->need_check_timer)) {
+        Coroutine *co;
+
         qed_cancel_need_check_timer(s);
-        qed_need_check_timer_entry(s);
+        co = qemu_coroutine_create(qed_need_check_timer_entry, s);
+        bdrv_inc_in_flight(bs);
+        aio_co_enter(bdrv_get_aio_context(bs), co);
     }
 }
 
-- 
2.38.1



^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 02/13] test-bdrv-drain: Don't yield in .bdrv_co_drained_begin/end()
  2022-11-08 12:37 [PATCH 00/13] block: Simplify drain Kevin Wolf
  2022-11-08 12:37 ` [PATCH 01/13] qed: Don't yield in bdrv_qed_co_drain_begin() Kevin Wolf
@ 2022-11-08 12:37 ` Kevin Wolf
  2022-11-09 10:50   ` Vladimir Sementsov-Ogievskiy
                     ` (3 more replies)
  2022-11-08 12:37 ` [PATCH 03/13] block: Revert .bdrv_drained_begin/end to non-coroutine_fn Kevin Wolf
                   ` (12 subsequent siblings)
  14 siblings, 4 replies; 61+ messages in thread
From: Kevin Wolf @ 2022-11-08 12:37 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, eesposit, stefanha, hreitz, pbonzini, qemu-devel

We want to change .bdrv_co_drained_begin/end() back to be non-coroutine
callbacks, so in preparation, avoid yielding in their implementation.

This does almost the same as the existing logic in bdrv_drain_invoke(),
by creating and entering coroutines internally. However, since the test
case is by far the heaviest user of coroutine code in drain callbacks,
it is preferable to have the complexity in the test case rather than the
drain core, which is already complicated enough without this.

The behaviour for bdrv_drain_begin() is unchanged because we increase
bs->in_flight and this is still polled. However, bdrv_drain_end()
doesn't wait for the spawned coroutine to complete any more. This is
fine, we don't rely on bdrv_drain_end() restarting all operations
immediately before the next aio_poll().

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 tests/unit/test-bdrv-drain.c | 64 ++++++++++++++++++++++++++----------
 1 file changed, 46 insertions(+), 18 deletions(-)

diff --git a/tests/unit/test-bdrv-drain.c b/tests/unit/test-bdrv-drain.c
index 09dc4a4891..24f34e24ad 100644
--- a/tests/unit/test-bdrv-drain.c
+++ b/tests/unit/test-bdrv-drain.c
@@ -38,12 +38,22 @@ typedef struct BDRVTestState {
     bool sleep_in_drain_begin;
 } BDRVTestState;
 
+static void coroutine_fn sleep_in_drain_begin(void *opaque)
+{
+    BlockDriverState *bs = opaque;
+
+    qemu_co_sleep_ns(QEMU_CLOCK_REALTIME, 100000);
+    bdrv_dec_in_flight(bs);
+}
+
 static void coroutine_fn bdrv_test_co_drain_begin(BlockDriverState *bs)
 {
     BDRVTestState *s = bs->opaque;
     s->drain_count++;
     if (s->sleep_in_drain_begin) {
-        qemu_co_sleep_ns(QEMU_CLOCK_REALTIME, 100000);
+        Coroutine *co = qemu_coroutine_create(sleep_in_drain_begin, bs);
+        bdrv_inc_in_flight(bs);
+        aio_co_enter(bdrv_get_aio_context(bs), co);
     }
 }
 
@@ -1916,6 +1926,21 @@ static int coroutine_fn bdrv_replace_test_co_preadv(BlockDriverState *bs,
     return 0;
 }
 
+static void coroutine_fn bdrv_replace_test_drain_co(void *opaque)
+{
+    BlockDriverState *bs = opaque;
+    BDRVReplaceTestState *s = bs->opaque;
+
+    /* Keep waking io_co up until it is done */
+    while (s->io_co) {
+        aio_co_wake(s->io_co);
+        s->io_co = NULL;
+        qemu_coroutine_yield();
+    }
+    s->drain_co = NULL;
+    bdrv_dec_in_flight(bs);
+}
+
 /**
  * If .drain_count is 0, wake up .io_co if there is one; and set
  * .was_drained.
@@ -1926,20 +1951,27 @@ static void coroutine_fn bdrv_replace_test_co_drain_begin(BlockDriverState *bs)
     BDRVReplaceTestState *s = bs->opaque;
 
     if (!s->drain_count) {
-        /* Keep waking io_co up until it is done */
-        s->drain_co = qemu_coroutine_self();
-        while (s->io_co) {
-            aio_co_wake(s->io_co);
-            s->io_co = NULL;
-            qemu_coroutine_yield();
-        }
-        s->drain_co = NULL;
-
+        s->drain_co = qemu_coroutine_create(bdrv_replace_test_drain_co, bs);
+        bdrv_inc_in_flight(bs);
+        aio_co_enter(bdrv_get_aio_context(bs), s->drain_co);
         s->was_drained = true;
     }
     s->drain_count++;
 }
 
+static void coroutine_fn bdrv_replace_test_read_entry(void *opaque)
+{
+    BlockDriverState *bs = opaque;
+    char data;
+    QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, &data, 1);
+    int ret;
+
+    /* Queue a read request post-drain */
+    ret = bdrv_replace_test_co_preadv(bs, 0, 1, &qiov, 0);
+    g_assert(ret >= 0);
+    bdrv_dec_in_flight(bs);
+}
+
 /**
  * Reduce .drain_count, set .was_undrained once it reaches 0.
  * If .drain_count reaches 0 and the node has a backing file, issue a
@@ -1951,17 +1983,13 @@ static void coroutine_fn bdrv_replace_test_co_drain_end(BlockDriverState *bs)
 
     g_assert(s->drain_count > 0);
     if (!--s->drain_count) {
-        int ret;
-
         s->was_undrained = true;
 
         if (bs->backing) {
-            char data;
-            QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, &data, 1);
-
-            /* Queue a read request post-drain */
-            ret = bdrv_replace_test_co_preadv(bs, 0, 1, &qiov, 0);
-            g_assert(ret >= 0);
+            Coroutine *co = qemu_coroutine_create(bdrv_replace_test_read_entry,
+                                                  bs);
+            bdrv_inc_in_flight(bs);
+            aio_co_enter(bdrv_get_aio_context(bs), co);
         }
     }
 }
-- 
2.38.1



^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 03/13] block: Revert .bdrv_drained_begin/end to non-coroutine_fn
  2022-11-08 12:37 [PATCH 00/13] block: Simplify drain Kevin Wolf
  2022-11-08 12:37 ` [PATCH 01/13] qed: Don't yield in bdrv_qed_co_drain_begin() Kevin Wolf
  2022-11-08 12:37 ` [PATCH 02/13] test-bdrv-drain: Don't yield in .bdrv_co_drained_begin/end() Kevin Wolf
@ 2022-11-08 12:37 ` Kevin Wolf
  2022-11-09 14:29   ` Vladimir Sementsov-Ogievskiy
                     ` (3 more replies)
  2022-11-08 12:37 ` [PATCH 04/13] block: Remove drained_end_counter Kevin Wolf
                   ` (11 subsequent siblings)
  14 siblings, 4 replies; 61+ messages in thread
From: Kevin Wolf @ 2022-11-08 12:37 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, eesposit, stefanha, hreitz, pbonzini, qemu-devel

Polling during bdrv_drained_end() can be problematic (and in the future,
we may get cases for bdrv_drained_begin() where polling is forbidden,
and we don't care about already in-flight requests, but just want to
prevent new requests from arriving).

The .bdrv_drained_begin/end callbacks running in a coroutine is the only
reason why we have to do this polling, so make them non-coroutine
callbacks again. None of the callers actually yield any more.

This means that bdrv_drained_end() effectively doesn't poll any more,
even if AIO_WAIT_WHILE() loops are still there (their condition is false
from the beginning). This is generally not a problem, but in
test-bdrv-drain, some additional explicit aio_poll() calls need to be
added because the test case wants to verify the final state after BHs
have executed.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 include/block/block_int-common.h | 10 ++++---
 block.c                          |  4 +--
 block/io.c                       | 49 +++++---------------------------
 block/qed.c                      |  4 +--
 block/throttle.c                 |  6 ++--
 tests/unit/test-bdrv-drain.c     | 18 ++++++------
 6 files changed, 30 insertions(+), 61 deletions(-)

diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index 5a2cc077a0..0956acbb60 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -735,17 +735,19 @@ struct BlockDriver {
     void (*bdrv_io_unplug)(BlockDriverState *bs);
 
     /**
-     * bdrv_co_drain_begin is called if implemented in the beginning of a
+     * bdrv_drain_begin is called if implemented in the beginning of a
      * drain operation to drain and stop any internal sources of requests in
      * the driver.
-     * bdrv_co_drain_end is called if implemented at the end of the drain.
+     * bdrv_drain_end is called if implemented at the end of the drain.
      *
      * They should be used by the driver to e.g. manage scheduled I/O
      * requests, or toggle an internal state. After the end of the drain new
      * requests will continue normally.
+     *
+     * Implementations of both functions must not call aio_poll().
      */
-    void coroutine_fn (*bdrv_co_drain_begin)(BlockDriverState *bs);
-    void coroutine_fn (*bdrv_co_drain_end)(BlockDriverState *bs);
+    void (*bdrv_drain_begin)(BlockDriverState *bs);
+    void (*bdrv_drain_end)(BlockDriverState *bs);
 
     bool (*bdrv_supports_persistent_dirty_bitmap)(BlockDriverState *bs);
     bool coroutine_fn (*bdrv_co_can_store_new_dirty_bitmap)(
diff --git a/block.c b/block.c
index 3bd594eb2a..fed8077993 100644
--- a/block.c
+++ b/block.c
@@ -1705,8 +1705,8 @@ static int bdrv_open_driver(BlockDriverState *bs, BlockDriver *drv,
     assert(is_power_of_2(bs->bl.request_alignment));
 
     for (i = 0; i < bs->quiesce_counter; i++) {
-        if (drv->bdrv_co_drain_begin) {
-            drv->bdrv_co_drain_begin(bs);
+        if (drv->bdrv_drain_begin) {
+            drv->bdrv_drain_begin(bs);
         }
     }
 
diff --git a/block/io.c b/block/io.c
index 34b30e304e..183b407f5b 100644
--- a/block/io.c
+++ b/block/io.c
@@ -250,55 +250,20 @@ typedef struct {
     int *drained_end_counter;
 } BdrvCoDrainData;
 
-static void coroutine_fn bdrv_drain_invoke_entry(void *opaque)
-{
-    BdrvCoDrainData *data = opaque;
-    BlockDriverState *bs = data->bs;
-
-    if (data->begin) {
-        bs->drv->bdrv_co_drain_begin(bs);
-    } else {
-        bs->drv->bdrv_co_drain_end(bs);
-    }
-
-    /* Set data->done and decrement drained_end_counter before bdrv_wakeup() */
-    qatomic_mb_set(&data->done, true);
-    if (!data->begin) {
-        qatomic_dec(data->drained_end_counter);
-    }
-    bdrv_dec_in_flight(bs);
-
-    g_free(data);
-}
-
-/* Recursively call BlockDriver.bdrv_co_drain_begin/end callbacks */
+/* Recursively call BlockDriver.bdrv_drain_begin/end callbacks */
 static void bdrv_drain_invoke(BlockDriverState *bs, bool begin,
                               int *drained_end_counter)
 {
-    BdrvCoDrainData *data;
-
-    if (!bs->drv || (begin && !bs->drv->bdrv_co_drain_begin) ||
-            (!begin && !bs->drv->bdrv_co_drain_end)) {
+    if (!bs->drv || (begin && !bs->drv->bdrv_drain_begin) ||
+            (!begin && !bs->drv->bdrv_drain_end)) {
         return;
     }
 
-    data = g_new(BdrvCoDrainData, 1);
-    *data = (BdrvCoDrainData) {
-        .bs = bs,
-        .done = false,
-        .begin = begin,
-        .drained_end_counter = drained_end_counter,
-    };
-
-    if (!begin) {
-        qatomic_inc(drained_end_counter);
+    if (begin) {
+        bs->drv->bdrv_drain_begin(bs);
+    } else {
+        bs->drv->bdrv_drain_end(bs);
     }
-
-    /* Make sure the driver callback completes during the polling phase for
-     * drain_begin. */
-    bdrv_inc_in_flight(bs);
-    data->co = qemu_coroutine_create(bdrv_drain_invoke_entry, data);
-    aio_co_schedule(bdrv_get_aio_context(bs), data->co);
 }
 
 /* Returns true if BDRV_POLL_WHILE() should go into a blocking aio_poll() */
diff --git a/block/qed.c b/block/qed.c
index 013f826c44..301ff8fd86 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -365,7 +365,7 @@ static void bdrv_qed_attach_aio_context(BlockDriverState *bs,
     }
 }
 
-static void coroutine_fn bdrv_qed_co_drain_begin(BlockDriverState *bs)
+static void bdrv_qed_co_drain_begin(BlockDriverState *bs)
 {
     BDRVQEDState *s = bs->opaque;
 
@@ -1661,7 +1661,7 @@ static BlockDriver bdrv_qed = {
     .bdrv_co_check            = bdrv_qed_co_check,
     .bdrv_detach_aio_context  = bdrv_qed_detach_aio_context,
     .bdrv_attach_aio_context  = bdrv_qed_attach_aio_context,
-    .bdrv_co_drain_begin      = bdrv_qed_co_drain_begin,
+    .bdrv_drain_begin         = bdrv_qed_co_drain_begin,
 };
 
 static void bdrv_qed_init(void)
diff --git a/block/throttle.c b/block/throttle.c
index 131eba3ab4..6e3ae1b355 100644
--- a/block/throttle.c
+++ b/block/throttle.c
@@ -214,7 +214,7 @@ static void throttle_reopen_abort(BDRVReopenState *reopen_state)
     reopen_state->opaque = NULL;
 }
 
-static void coroutine_fn throttle_co_drain_begin(BlockDriverState *bs)
+static void throttle_co_drain_begin(BlockDriverState *bs)
 {
     ThrottleGroupMember *tgm = bs->opaque;
     if (qatomic_fetch_inc(&tgm->io_limits_disabled) == 0) {
@@ -261,8 +261,8 @@ static BlockDriver bdrv_throttle = {
     .bdrv_reopen_commit                 =   throttle_reopen_commit,
     .bdrv_reopen_abort                  =   throttle_reopen_abort,
 
-    .bdrv_co_drain_begin                =   throttle_co_drain_begin,
-    .bdrv_co_drain_end                  =   throttle_co_drain_end,
+    .bdrv_drain_begin                   =   throttle_co_drain_begin,
+    .bdrv_drain_end                     =   throttle_co_drain_end,
 
     .is_filter                          =   true,
     .strong_runtime_opts                =   throttle_strong_runtime_opts,
diff --git a/tests/unit/test-bdrv-drain.c b/tests/unit/test-bdrv-drain.c
index 24f34e24ad..695519ee02 100644
--- a/tests/unit/test-bdrv-drain.c
+++ b/tests/unit/test-bdrv-drain.c
@@ -46,7 +46,7 @@ static void coroutine_fn sleep_in_drain_begin(void *opaque)
     bdrv_dec_in_flight(bs);
 }
 
-static void coroutine_fn bdrv_test_co_drain_begin(BlockDriverState *bs)
+static void bdrv_test_drain_begin(BlockDriverState *bs)
 {
     BDRVTestState *s = bs->opaque;
     s->drain_count++;
@@ -57,7 +57,7 @@ static void coroutine_fn bdrv_test_co_drain_begin(BlockDriverState *bs)
     }
 }
 
-static void coroutine_fn bdrv_test_co_drain_end(BlockDriverState *bs)
+static void bdrv_test_drain_end(BlockDriverState *bs)
 {
     BDRVTestState *s = bs->opaque;
     s->drain_count--;
@@ -111,8 +111,8 @@ static BlockDriver bdrv_test = {
     .bdrv_close             = bdrv_test_close,
     .bdrv_co_preadv         = bdrv_test_co_preadv,
 
-    .bdrv_co_drain_begin    = bdrv_test_co_drain_begin,
-    .bdrv_co_drain_end      = bdrv_test_co_drain_end,
+    .bdrv_drain_begin       = bdrv_test_drain_begin,
+    .bdrv_drain_end         = bdrv_test_drain_end,
 
     .bdrv_child_perm        = bdrv_default_perms,
 
@@ -1703,6 +1703,7 @@ static void test_blockjob_commit_by_drained_end(void)
     bdrv_drained_begin(bs_child);
     g_assert(!job_has_completed);
     bdrv_drained_end(bs_child);
+    aio_poll(qemu_get_aio_context(), false);
     g_assert(job_has_completed);
 
     bdrv_unref(bs_parents[0]);
@@ -1858,6 +1859,7 @@ static void test_drop_intermediate_poll(void)
 
     g_assert(!job_has_completed);
     ret = bdrv_drop_intermediate(chain[1], chain[0], NULL);
+    aio_poll(qemu_get_aio_context(), false);
     g_assert(ret == 0);
     g_assert(job_has_completed);
 
@@ -1946,7 +1948,7 @@ static void coroutine_fn bdrv_replace_test_drain_co(void *opaque)
  * .was_drained.
  * Increment .drain_count.
  */
-static void coroutine_fn bdrv_replace_test_co_drain_begin(BlockDriverState *bs)
+static void bdrv_replace_test_drain_begin(BlockDriverState *bs)
 {
     BDRVReplaceTestState *s = bs->opaque;
 
@@ -1977,7 +1979,7 @@ static void coroutine_fn bdrv_replace_test_read_entry(void *opaque)
  * If .drain_count reaches 0 and the node has a backing file, issue a
  * read request.
  */
-static void coroutine_fn bdrv_replace_test_co_drain_end(BlockDriverState *bs)
+static void bdrv_replace_test_drain_end(BlockDriverState *bs)
 {
     BDRVReplaceTestState *s = bs->opaque;
 
@@ -2002,8 +2004,8 @@ static BlockDriver bdrv_replace_test = {
     .bdrv_close             = bdrv_replace_test_close,
     .bdrv_co_preadv         = bdrv_replace_test_co_preadv,
 
-    .bdrv_co_drain_begin    = bdrv_replace_test_co_drain_begin,
-    .bdrv_co_drain_end      = bdrv_replace_test_co_drain_end,
+    .bdrv_drain_begin       = bdrv_replace_test_drain_begin,
+    .bdrv_drain_end         = bdrv_replace_test_drain_end,
 
     .bdrv_child_perm        = bdrv_default_perms,
 };
-- 
2.38.1



^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 04/13] block: Remove drained_end_counter
  2022-11-08 12:37 [PATCH 00/13] block: Simplify drain Kevin Wolf
                   ` (2 preceding siblings ...)
  2022-11-08 12:37 ` [PATCH 03/13] block: Revert .bdrv_drained_begin/end to non-coroutine_fn Kevin Wolf
@ 2022-11-08 12:37 ` Kevin Wolf
  2022-11-09 14:44   ` Vladimir Sementsov-Ogievskiy
                     ` (2 more replies)
  2022-11-08 12:37 ` [PATCH 05/13] block: Inline bdrv_drain_invoke() Kevin Wolf
                   ` (10 subsequent siblings)
  14 siblings, 3 replies; 61+ messages in thread
From: Kevin Wolf @ 2022-11-08 12:37 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, eesposit, stefanha, hreitz, pbonzini, qemu-devel

drained_end_counter is unused now, nobody changes its value any more. It
can be removed.

In cases where we had two almost identical functions that only differed
in whether the caller passes drained_end_counter, or whether they would
poll for a local drained_end_counter to reach 0, these become a single
function.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 include/block/block-io.h         | 15 -----
 include/block/block_int-common.h |  6 +-
 block.c                          |  5 +-
 block/block-backend.c            |  4 +-
 block/io.c                       | 97 ++++++++------------------------
 blockjob.c                       |  2 +-
 6 files changed, 30 insertions(+), 99 deletions(-)

diff --git a/include/block/block-io.h b/include/block/block-io.h
index 770ddeb7c8..97e9ae8bee 100644
--- a/include/block/block-io.h
+++ b/include/block/block-io.h
@@ -235,21 +235,6 @@ int coroutine_fn bdrv_co_copy_range(BdrvChild *src, int64_t src_offset,
                                     int64_t bytes, BdrvRequestFlags read_flags,
                                     BdrvRequestFlags write_flags);
 
-/**
- * bdrv_drained_end_no_poll:
- *
- * Same as bdrv_drained_end(), but do not poll for the subgraph to
- * actually become unquiesced.  Therefore, no graph changes will occur
- * with this function.
- *
- * *drained_end_counter is incremented for every background operation
- * that is scheduled, and will be decremented for every operation once
- * it settles.  The caller must poll until it reaches 0.  The counter
- * should be accessed using atomic operations only.
- */
-void bdrv_drained_end_no_poll(BlockDriverState *bs, int *drained_end_counter);
-
-
 /*
  * "I/O or GS" API functions. These functions can run without
  * the BQL, but only in one specific iothread/main loop.
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index 0956acbb60..6504db4fd9 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -939,15 +939,11 @@ struct BdrvChildClass {
      * These functions must not change the graph (and therefore also must not
      * call aio_poll(), which could change the graph indirectly).
      *
-     * If drained_end() schedules background operations, it must atomically
-     * increment *drained_end_counter for each such operation and atomically
-     * decrement it once the operation has settled.
-     *
      * Note that this can be nested. If drained_begin() was called twice, new
      * I/O is allowed only after drained_end() was called twice, too.
      */
     void (*drained_begin)(BdrvChild *child);
-    void (*drained_end)(BdrvChild *child, int *drained_end_counter);
+    void (*drained_end)(BdrvChild *child);
 
     /*
      * Returns whether the parent has pending requests for the child. This
diff --git a/block.c b/block.c
index fed8077993..7a24bd4c36 100644
--- a/block.c
+++ b/block.c
@@ -1227,11 +1227,10 @@ static bool bdrv_child_cb_drained_poll(BdrvChild *child)
     return bdrv_drain_poll(bs, false, NULL, false);
 }
 
-static void bdrv_child_cb_drained_end(BdrvChild *child,
-                                      int *drained_end_counter)
+static void bdrv_child_cb_drained_end(BdrvChild *child)
 {
     BlockDriverState *bs = child->opaque;
-    bdrv_drained_end_no_poll(bs, drained_end_counter);
+    bdrv_drained_end(bs);
 }
 
 static int bdrv_child_cb_inactivate(BdrvChild *child)
diff --git a/block/block-backend.c b/block/block-backend.c
index c0c7d56c8d..ecdfeb49bb 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -129,7 +129,7 @@ static void blk_root_inherit_options(BdrvChildRole role, bool parent_is_format,
 }
 static void blk_root_drained_begin(BdrvChild *child);
 static bool blk_root_drained_poll(BdrvChild *child);
-static void blk_root_drained_end(BdrvChild *child, int *drained_end_counter);
+static void blk_root_drained_end(BdrvChild *child);
 
 static void blk_root_change_media(BdrvChild *child, bool load);
 static void blk_root_resize(BdrvChild *child);
@@ -2549,7 +2549,7 @@ static bool blk_root_drained_poll(BdrvChild *child)
     return busy || !!blk->in_flight;
 }
 
-static void blk_root_drained_end(BdrvChild *child, int *drained_end_counter)
+static void blk_root_drained_end(BdrvChild *child)
 {
     BlockBackend *blk = child->opaque;
     assert(blk->quiesce_counter);
diff --git a/block/io.c b/block/io.c
index 183b407f5b..41e6121c31 100644
--- a/block/io.c
+++ b/block/io.c
@@ -58,27 +58,19 @@ static void bdrv_parent_drained_begin(BlockDriverState *bs, BdrvChild *ignore,
     }
 }
 
-static void bdrv_parent_drained_end_single_no_poll(BdrvChild *c,
-                                                   int *drained_end_counter)
+void bdrv_parent_drained_end_single(BdrvChild *c)
 {
+    IO_OR_GS_CODE();
+
     assert(c->parent_quiesce_counter > 0);
     c->parent_quiesce_counter--;
     if (c->klass->drained_end) {
-        c->klass->drained_end(c, drained_end_counter);
+        c->klass->drained_end(c);
     }
 }
 
-void bdrv_parent_drained_end_single(BdrvChild *c)
-{
-    int drained_end_counter = 0;
-    IO_OR_GS_CODE();
-    bdrv_parent_drained_end_single_no_poll(c, &drained_end_counter);
-    BDRV_POLL_WHILE(c->bs, qatomic_read(&drained_end_counter) > 0);
-}
-
 static void bdrv_parent_drained_end(BlockDriverState *bs, BdrvChild *ignore,
-                                    bool ignore_bds_parents,
-                                    int *drained_end_counter)
+                                    bool ignore_bds_parents)
 {
     BdrvChild *c;
 
@@ -86,7 +78,7 @@ static void bdrv_parent_drained_end(BlockDriverState *bs, BdrvChild *ignore,
         if (c == ignore || (ignore_bds_parents && c->klass->parent_is_bds)) {
             continue;
         }
-        bdrv_parent_drained_end_single_no_poll(c, drained_end_counter);
+        bdrv_parent_drained_end_single(c);
     }
 }
 
@@ -247,12 +239,10 @@ typedef struct {
     bool poll;
     BdrvChild *parent;
     bool ignore_bds_parents;
-    int *drained_end_counter;
 } BdrvCoDrainData;
 
 /* Recursively call BlockDriver.bdrv_drain_begin/end callbacks */
-static void bdrv_drain_invoke(BlockDriverState *bs, bool begin,
-                              int *drained_end_counter)
+static void bdrv_drain_invoke(BlockDriverState *bs, bool begin)
 {
     if (!bs->drv || (begin && !bs->drv->bdrv_drain_begin) ||
             (!begin && !bs->drv->bdrv_drain_end)) {
@@ -303,8 +293,7 @@ static void bdrv_do_drained_begin(BlockDriverState *bs, bool recursive,
                                   BdrvChild *parent, bool ignore_bds_parents,
                                   bool poll);
 static void bdrv_do_drained_end(BlockDriverState *bs, bool recursive,
-                                BdrvChild *parent, bool ignore_bds_parents,
-                                int *drained_end_counter);
+                                BdrvChild *parent, bool ignore_bds_parents);
 
 static void bdrv_co_drain_bh_cb(void *opaque)
 {
@@ -317,14 +306,12 @@ static void bdrv_co_drain_bh_cb(void *opaque)
         aio_context_acquire(ctx);
         bdrv_dec_in_flight(bs);
         if (data->begin) {
-            assert(!data->drained_end_counter);
             bdrv_do_drained_begin(bs, data->recursive, data->parent,
                                   data->ignore_bds_parents, data->poll);
         } else {
             assert(!data->poll);
             bdrv_do_drained_end(bs, data->recursive, data->parent,
-                                data->ignore_bds_parents,
-                                data->drained_end_counter);
+                                data->ignore_bds_parents);
         }
         aio_context_release(ctx);
     } else {
@@ -340,8 +327,7 @@ static void coroutine_fn bdrv_co_yield_to_drain(BlockDriverState *bs,
                                                 bool begin, bool recursive,
                                                 BdrvChild *parent,
                                                 bool ignore_bds_parents,
-                                                bool poll,
-                                                int *drained_end_counter)
+                                                bool poll)
 {
     BdrvCoDrainData data;
     Coroutine *self = qemu_coroutine_self();
@@ -361,7 +347,6 @@ static void coroutine_fn bdrv_co_yield_to_drain(BlockDriverState *bs,
         .parent = parent,
         .ignore_bds_parents = ignore_bds_parents,
         .poll = poll,
-        .drained_end_counter = drained_end_counter,
     };
 
     if (bs) {
@@ -404,7 +389,7 @@ void bdrv_do_drained_begin_quiesce(BlockDriverState *bs,
     }
 
     bdrv_parent_drained_begin(bs, parent, ignore_bds_parents);
-    bdrv_drain_invoke(bs, true, NULL);
+    bdrv_drain_invoke(bs, true);
 }
 
 static void bdrv_do_drained_begin(BlockDriverState *bs, bool recursive,
@@ -415,7 +400,7 @@ static void bdrv_do_drained_begin(BlockDriverState *bs, bool recursive,
 
     if (qemu_in_coroutine()) {
         bdrv_co_yield_to_drain(bs, true, recursive, parent, ignore_bds_parents,
-                               poll, NULL);
+                               poll);
         return;
     }
 
@@ -459,38 +444,24 @@ void bdrv_subtree_drained_begin(BlockDriverState *bs)
 
 /**
  * This function does not poll, nor must any of its recursively called
- * functions.  The *drained_end_counter pointee will be incremented
- * once for every background operation scheduled, and decremented once
- * the operation settles.  Therefore, the pointer must remain valid
- * until the pointee reaches 0.  That implies that whoever sets up the
- * pointee has to poll until it is 0.
- *
- * We use atomic operations to access *drained_end_counter, because
- * (1) when called from bdrv_set_aio_context_ignore(), the subgraph of
- *     @bs may contain nodes in different AioContexts,
- * (2) bdrv_drain_all_end() uses the same counter for all nodes,
- *     regardless of which AioContext they are in.
+ * functions.
  */
 static void bdrv_do_drained_end(BlockDriverState *bs, bool recursive,
-                                BdrvChild *parent, bool ignore_bds_parents,
-                                int *drained_end_counter)
+                                BdrvChild *parent, bool ignore_bds_parents)
 {
     BdrvChild *child;
     int old_quiesce_counter;
 
-    assert(drained_end_counter != NULL);
-
     if (qemu_in_coroutine()) {
         bdrv_co_yield_to_drain(bs, false, recursive, parent, ignore_bds_parents,
-                               false, drained_end_counter);
+                               false);
         return;
     }
     assert(bs->quiesce_counter > 0);
 
     /* Re-enable things in child-to-parent order */
-    bdrv_drain_invoke(bs, false, drained_end_counter);
-    bdrv_parent_drained_end(bs, parent, ignore_bds_parents,
-                            drained_end_counter);
+    bdrv_drain_invoke(bs, false);
+    bdrv_parent_drained_end(bs, parent, ignore_bds_parents);
 
     old_quiesce_counter = qatomic_fetch_dec(&bs->quiesce_counter);
     if (old_quiesce_counter == 1) {
@@ -501,32 +472,21 @@ static void bdrv_do_drained_end(BlockDriverState *bs, bool recursive,
         assert(!ignore_bds_parents);
         bs->recursive_quiesce_counter--;
         QLIST_FOREACH(child, &bs->children, next) {
-            bdrv_do_drained_end(child->bs, true, child, ignore_bds_parents,
-                                drained_end_counter);
+            bdrv_do_drained_end(child->bs, true, child, ignore_bds_parents);
         }
     }
 }
 
 void bdrv_drained_end(BlockDriverState *bs)
 {
-    int drained_end_counter = 0;
     IO_OR_GS_CODE();
-    bdrv_do_drained_end(bs, false, NULL, false, &drained_end_counter);
-    BDRV_POLL_WHILE(bs, qatomic_read(&drained_end_counter) > 0);
-}
-
-void bdrv_drained_end_no_poll(BlockDriverState *bs, int *drained_end_counter)
-{
-    IO_CODE();
-    bdrv_do_drained_end(bs, false, NULL, false, drained_end_counter);
+    bdrv_do_drained_end(bs, false, NULL, false);
 }
 
 void bdrv_subtree_drained_end(BlockDriverState *bs)
 {
-    int drained_end_counter = 0;
     IO_OR_GS_CODE();
-    bdrv_do_drained_end(bs, true, NULL, false, &drained_end_counter);
-    BDRV_POLL_WHILE(bs, qatomic_read(&drained_end_counter) > 0);
+    bdrv_do_drained_end(bs, true, NULL, false);
 }
 
 void bdrv_apply_subtree_drain(BdrvChild *child, BlockDriverState *new_parent)
@@ -541,16 +501,12 @@ void bdrv_apply_subtree_drain(BdrvChild *child, BlockDriverState *new_parent)
 
 void bdrv_unapply_subtree_drain(BdrvChild *child, BlockDriverState *old_parent)
 {
-    int drained_end_counter = 0;
     int i;
     IO_OR_GS_CODE();
 
     for (i = 0; i < old_parent->recursive_quiesce_counter; i++) {
-        bdrv_do_drained_end(child->bs, true, child, false,
-                            &drained_end_counter);
+        bdrv_do_drained_end(child->bs, true, child, false);
     }
-
-    BDRV_POLL_WHILE(child->bs, qatomic_read(&drained_end_counter) > 0);
 }
 
 void bdrv_drain(BlockDriverState *bs)
@@ -608,7 +564,7 @@ void bdrv_drain_all_begin(void)
     GLOBAL_STATE_CODE();
 
     if (qemu_in_coroutine()) {
-        bdrv_co_yield_to_drain(NULL, true, false, NULL, true, true, NULL);
+        bdrv_co_yield_to_drain(NULL, true, false, NULL, true, true);
         return;
     }
 
@@ -647,22 +603,19 @@ void bdrv_drain_all_begin(void)
 
 void bdrv_drain_all_end_quiesce(BlockDriverState *bs)
 {
-    int drained_end_counter = 0;
     GLOBAL_STATE_CODE();
 
     g_assert(bs->quiesce_counter > 0);
     g_assert(!bs->refcnt);
 
     while (bs->quiesce_counter) {
-        bdrv_do_drained_end(bs, false, NULL, true, &drained_end_counter);
+        bdrv_do_drained_end(bs, false, NULL, true);
     }
-    BDRV_POLL_WHILE(bs, qatomic_read(&drained_end_counter) > 0);
 }
 
 void bdrv_drain_all_end(void)
 {
     BlockDriverState *bs = NULL;
-    int drained_end_counter = 0;
     GLOBAL_STATE_CODE();
 
     /*
@@ -678,13 +631,11 @@ void bdrv_drain_all_end(void)
         AioContext *aio_context = bdrv_get_aio_context(bs);
 
         aio_context_acquire(aio_context);
-        bdrv_do_drained_end(bs, false, NULL, true, &drained_end_counter);
+        bdrv_do_drained_end(bs, false, NULL, true);
         aio_context_release(aio_context);
     }
 
     assert(qemu_get_current_aio_context() == qemu_get_aio_context());
-    AIO_WAIT_WHILE(NULL, qatomic_read(&drained_end_counter) > 0);
-
     assert(bdrv_drain_all_count > 0);
     bdrv_drain_all_count--;
 }
diff --git a/blockjob.c b/blockjob.c
index 2d86014fa5..43d0db1f94 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -120,7 +120,7 @@ static bool child_job_drained_poll(BdrvChild *c)
     }
 }
 
-static void child_job_drained_end(BdrvChild *c, int *drained_end_counter)
+static void child_job_drained_end(BdrvChild *c)
 {
     BlockJob *job = c->opaque;
     job_resume(&job->job);
-- 
2.38.1



^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 05/13] block: Inline bdrv_drain_invoke()
  2022-11-08 12:37 [PATCH 00/13] block: Simplify drain Kevin Wolf
                   ` (3 preceding siblings ...)
  2022-11-08 12:37 ` [PATCH 04/13] block: Remove drained_end_counter Kevin Wolf
@ 2022-11-08 12:37 ` Kevin Wolf
  2022-11-09 15:34   ` Vladimir Sementsov-Ogievskiy
                     ` (3 more replies)
  2022-11-08 12:37 ` [PATCH 06/13] block: Drain invidual nodes during reopen Kevin Wolf
                   ` (9 subsequent siblings)
  14 siblings, 4 replies; 61+ messages in thread
From: Kevin Wolf @ 2022-11-08 12:37 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, eesposit, stefanha, hreitz, pbonzini, qemu-devel

bdrv_drain_invoke() has now two entirely separate cases that share no
code any more and are selected depending on a bool parameter. Each case
has only one caller. Just inline the function.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/io.c | 23 ++++++-----------------
 1 file changed, 6 insertions(+), 17 deletions(-)

diff --git a/block/io.c b/block/io.c
index 41e6121c31..c520183fb7 100644
--- a/block/io.c
+++ b/block/io.c
@@ -241,21 +241,6 @@ typedef struct {
     bool ignore_bds_parents;
 } BdrvCoDrainData;
 
-/* Recursively call BlockDriver.bdrv_drain_begin/end callbacks */
-static void bdrv_drain_invoke(BlockDriverState *bs, bool begin)
-{
-    if (!bs->drv || (begin && !bs->drv->bdrv_drain_begin) ||
-            (!begin && !bs->drv->bdrv_drain_end)) {
-        return;
-    }
-
-    if (begin) {
-        bs->drv->bdrv_drain_begin(bs);
-    } else {
-        bs->drv->bdrv_drain_end(bs);
-    }
-}
-
 /* Returns true if BDRV_POLL_WHILE() should go into a blocking aio_poll() */
 bool bdrv_drain_poll(BlockDriverState *bs, bool recursive,
                      BdrvChild *ignore_parent, bool ignore_bds_parents)
@@ -389,7 +374,9 @@ void bdrv_do_drained_begin_quiesce(BlockDriverState *bs,
     }
 
     bdrv_parent_drained_begin(bs, parent, ignore_bds_parents);
-    bdrv_drain_invoke(bs, true);
+    if (bs->drv && bs->drv->bdrv_drain_begin) {
+        bs->drv->bdrv_drain_begin(bs);
+    }
 }
 
 static void bdrv_do_drained_begin(BlockDriverState *bs, bool recursive,
@@ -460,7 +447,9 @@ static void bdrv_do_drained_end(BlockDriverState *bs, bool recursive,
     assert(bs->quiesce_counter > 0);
 
     /* Re-enable things in child-to-parent order */
-    bdrv_drain_invoke(bs, false);
+    if (bs->drv && bs->drv->bdrv_drain_end) {
+        bs->drv->bdrv_drain_end(bs);
+    }
     bdrv_parent_drained_end(bs, parent, ignore_bds_parents);
 
     old_quiesce_counter = qatomic_fetch_dec(&bs->quiesce_counter);
-- 
2.38.1



^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 06/13] block: Drain invidual nodes during reopen
  2022-11-08 12:37 [PATCH 00/13] block: Simplify drain Kevin Wolf
                   ` (4 preceding siblings ...)
  2022-11-08 12:37 ` [PATCH 05/13] block: Inline bdrv_drain_invoke() Kevin Wolf
@ 2022-11-08 12:37 ` Kevin Wolf
  2022-11-09 16:00   ` Vladimir Sementsov-Ogievskiy
  2022-11-08 12:37 ` [PATCH 07/13] block: Don't use subtree drains in bdrv_drop_intermediate() Kevin Wolf
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 61+ messages in thread
From: Kevin Wolf @ 2022-11-08 12:37 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, eesposit, stefanha, hreitz, pbonzini, qemu-devel

bdrv_reopen() and friends use subtree drains as a lazy way of covering
all the nodes they touch. Turns out that this lazy way is a lot more
complicated than just draining the nodes individually, even not
accounting for the additional complexity in the drain mechanism itself.

Simplify the code by switching to draining the individual nodes that are
already managed in the BlockReopenQueue anyway.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block.c             | 11 ++++-------
 block/replication.c |  6 ------
 blockdev.c          | 13 -------------
 3 files changed, 4 insertions(+), 26 deletions(-)

diff --git a/block.c b/block.c
index 7a24bd4c36..5828b970e4 100644
--- a/block.c
+++ b/block.c
@@ -4142,7 +4142,7 @@ static bool bdrv_recurse_has_child(BlockDriverState *bs,
  * returns a pointer to bs_queue, which is either the newly allocated
  * bs_queue, or the existing bs_queue being used.
  *
- * bs must be drained between bdrv_reopen_queue() and bdrv_reopen_multiple().
+ * bs is drained here and undrained by bdrv_reopen_queue_free().
  */
 static BlockReopenQueue *bdrv_reopen_queue_child(BlockReopenQueue *bs_queue,
                                                  BlockDriverState *bs,
@@ -4162,12 +4162,10 @@ static BlockReopenQueue *bdrv_reopen_queue_child(BlockReopenQueue *bs_queue,
     int flags;
     QemuOpts *opts;
 
-    /* Make sure that the caller remembered to use a drained section. This is
-     * important to avoid graph changes between the recursive queuing here and
-     * bdrv_reopen_multiple(). */
-    assert(bs->quiesce_counter > 0);
     GLOBAL_STATE_CODE();
 
+    bdrv_drained_begin(bs);
+
     if (bs_queue == NULL) {
         bs_queue = g_new0(BlockReopenQueue, 1);
         QTAILQ_INIT(bs_queue);
@@ -4317,6 +4315,7 @@ void bdrv_reopen_queue_free(BlockReopenQueue *bs_queue)
     if (bs_queue) {
         BlockReopenQueueEntry *bs_entry, *next;
         QTAILQ_FOREACH_SAFE(bs_entry, bs_queue, entry, next) {
+            bdrv_drained_end(bs_entry->state.bs);
             qobject_unref(bs_entry->state.explicit_options);
             qobject_unref(bs_entry->state.options);
             g_free(bs_entry);
@@ -4464,7 +4463,6 @@ int bdrv_reopen(BlockDriverState *bs, QDict *opts, bool keep_old_opts,
 
     GLOBAL_STATE_CODE();
 
-    bdrv_subtree_drained_begin(bs);
     if (ctx != qemu_get_aio_context()) {
         aio_context_release(ctx);
     }
@@ -4475,7 +4473,6 @@ int bdrv_reopen(BlockDriverState *bs, QDict *opts, bool keep_old_opts,
     if (ctx != qemu_get_aio_context()) {
         aio_context_acquire(ctx);
     }
-    bdrv_subtree_drained_end(bs);
 
     return ret;
 }
diff --git a/block/replication.c b/block/replication.c
index f1eed25e43..c62f48a874 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -374,9 +374,6 @@ static void reopen_backing_file(BlockDriverState *bs, bool writable,
         s->orig_secondary_read_only = bdrv_is_read_only(secondary_disk->bs);
     }
 
-    bdrv_subtree_drained_begin(hidden_disk->bs);
-    bdrv_subtree_drained_begin(secondary_disk->bs);
-
     if (s->orig_hidden_read_only) {
         QDict *opts = qdict_new();
         qdict_put_bool(opts, BDRV_OPT_READ_ONLY, !writable);
@@ -401,9 +398,6 @@ static void reopen_backing_file(BlockDriverState *bs, bool writable,
             aio_context_acquire(ctx);
         }
     }
-
-    bdrv_subtree_drained_end(hidden_disk->bs);
-    bdrv_subtree_drained_end(secondary_disk->bs);
 }
 
 static void backup_job_cleanup(BlockDriverState *bs)
diff --git a/blockdev.c b/blockdev.c
index 3f1dec6242..8ffb3d9537 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -3547,8 +3547,6 @@ fail:
 void qmp_blockdev_reopen(BlockdevOptionsList *reopen_list, Error **errp)
 {
     BlockReopenQueue *queue = NULL;
-    GSList *drained = NULL;
-    GSList *p;
 
     /* Add each one of the BDS that we want to reopen to the queue */
     for (; reopen_list != NULL; reopen_list = reopen_list->next) {
@@ -3585,9 +3583,7 @@ void qmp_blockdev_reopen(BlockdevOptionsList *reopen_list, Error **errp)
         ctx = bdrv_get_aio_context(bs);
         aio_context_acquire(ctx);
 
-        bdrv_subtree_drained_begin(bs);
         queue = bdrv_reopen_queue(queue, bs, qdict, false);
-        drained = g_slist_prepend(drained, bs);
 
         aio_context_release(ctx);
     }
@@ -3598,15 +3594,6 @@ void qmp_blockdev_reopen(BlockdevOptionsList *reopen_list, Error **errp)
 
 fail:
     bdrv_reopen_queue_free(queue);
-    for (p = drained; p; p = p->next) {
-        BlockDriverState *bs = p->data;
-        AioContext *ctx = bdrv_get_aio_context(bs);
-
-        aio_context_acquire(ctx);
-        bdrv_subtree_drained_end(bs);
-        aio_context_release(ctx);
-    }
-    g_slist_free(drained);
 }
 
 void qmp_blockdev_del(const char *node_name, Error **errp)
-- 
2.38.1



^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 07/13] block: Don't use subtree drains in bdrv_drop_intermediate()
  2022-11-08 12:37 [PATCH 00/13] block: Simplify drain Kevin Wolf
                   ` (5 preceding siblings ...)
  2022-11-08 12:37 ` [PATCH 06/13] block: Drain invidual nodes during reopen Kevin Wolf
@ 2022-11-08 12:37 ` Kevin Wolf
  2022-11-09 16:18   ` Vladimir Sementsov-Ogievskiy
  2022-11-14 18:20   ` Hanna Reitz
  2022-11-08 12:37 ` [PATCH 08/13] stream: Replace subtree drain with a single node drain Kevin Wolf
                   ` (7 subsequent siblings)
  14 siblings, 2 replies; 61+ messages in thread
From: Kevin Wolf @ 2022-11-08 12:37 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, eesposit, stefanha, hreitz, pbonzini, qemu-devel

Instead of using a subtree drain from the top node (which also drains
child nodes of base that we're not even interested in), use a normal
drain for base, which automatically drains all of the parents, too.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block.c b/block.c
index 5828b970e4..2f6b25875f 100644
--- a/block.c
+++ b/block.c
@@ -5581,7 +5581,7 @@ int bdrv_drop_intermediate(BlockDriverState *top, BlockDriverState *base,
     GLOBAL_STATE_CODE();
 
     bdrv_ref(top);
-    bdrv_subtree_drained_begin(top);
+    bdrv_drained_begin(base);
 
     if (!top->drv || !base->drv) {
         goto exit;
@@ -5654,7 +5654,7 @@ int bdrv_drop_intermediate(BlockDriverState *top, BlockDriverState *base,
 
     ret = 0;
 exit:
-    bdrv_subtree_drained_end(top);
+    bdrv_drained_end(base);
     bdrv_unref(top);
     return ret;
 }
-- 
2.38.1



^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 08/13] stream: Replace subtree drain with a single node drain
  2022-11-08 12:37 [PATCH 00/13] block: Simplify drain Kevin Wolf
                   ` (6 preceding siblings ...)
  2022-11-08 12:37 ` [PATCH 07/13] block: Don't use subtree drains in bdrv_drop_intermediate() Kevin Wolf
@ 2022-11-08 12:37 ` Kevin Wolf
  2022-11-09 16:52   ` Vladimir Sementsov-Ogievskiy
  2022-11-14 18:21   ` Hanna Reitz
  2022-11-08 12:37 ` [PATCH 09/13] block: Remove subtree drains Kevin Wolf
                   ` (6 subsequent siblings)
  14 siblings, 2 replies; 61+ messages in thread
From: Kevin Wolf @ 2022-11-08 12:37 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, eesposit, stefanha, hreitz, pbonzini, qemu-devel

The subtree drain was introduced in commit b1e1af394d9 as a way to avoid
graph changes between finding the base node and changing the block graph
as necessary on completion of the image streaming job.

The block graph could change between these two points because
bdrv_set_backing_hd() first drains the parent node, which involved
polling and can do anything.

Subtree draining was an imperfect way to make this less likely (because
with it, fewer callbacks are called during this window). Everyone agreed
that it's not really the right solution, and it was only committed as a
stopgap solution.

This replaces the subtree drain with a solution that simply drains the
parent node before we try to find the base node, and then call a version
of bdrv_set_backing_hd() that doesn't drain, but just asserts that the
parent node is already drained.

This way, any graph changes caused by draining happen before we start
looking at the graph and things stay consistent between finding the base
node and changing the graph.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 include/block/block-global-state.h |  3 +++
 block.c                            | 17 ++++++++++++++---
 block/stream.c                     | 20 ++++++++++----------
 3 files changed, 27 insertions(+), 13 deletions(-)

diff --git a/include/block/block-global-state.h b/include/block/block-global-state.h
index bb42ed9559..7923415d4e 100644
--- a/include/block/block-global-state.h
+++ b/include/block/block-global-state.h
@@ -82,6 +82,9 @@ int bdrv_open_file_child(const char *filename,
 BlockDriverState *bdrv_open_blockdev_ref(BlockdevRef *ref, Error **errp);
 int bdrv_set_backing_hd(BlockDriverState *bs, BlockDriverState *backing_hd,
                         Error **errp);
+int bdrv_set_backing_hd_drained(BlockDriverState *bs,
+                                BlockDriverState *backing_hd,
+                                Error **errp);
 int bdrv_open_backing_file(BlockDriverState *bs, QDict *parent_options,
                            const char *bdref_key, Error **errp);
 BlockDriverState *bdrv_open(const char *filename, const char *reference,
diff --git a/block.c b/block.c
index 2f6b25875f..43b893dd6c 100644
--- a/block.c
+++ b/block.c
@@ -3395,14 +3395,15 @@ static int bdrv_set_backing_noperm(BlockDriverState *bs,
     return bdrv_set_file_or_backing_noperm(bs, backing_hd, true, tran, errp);
 }
 
-int bdrv_set_backing_hd(BlockDriverState *bs, BlockDriverState *backing_hd,
-                        Error **errp)
+int bdrv_set_backing_hd_drained(BlockDriverState *bs,
+                                BlockDriverState *backing_hd,
+                                Error **errp)
 {
     int ret;
     Transaction *tran = tran_new();
 
     GLOBAL_STATE_CODE();
-    bdrv_drained_begin(bs);
+    assert(bs->quiesce_counter > 0);
 
     ret = bdrv_set_backing_noperm(bs, backing_hd, tran, errp);
     if (ret < 0) {
@@ -3412,7 +3413,17 @@ int bdrv_set_backing_hd(BlockDriverState *bs, BlockDriverState *backing_hd,
     ret = bdrv_refresh_perms(bs, errp);
 out:
     tran_finalize(tran, ret);
+    return ret;
+}
 
+int bdrv_set_backing_hd(BlockDriverState *bs, BlockDriverState *backing_hd,
+                        Error **errp)
+{
+    int ret;
+    GLOBAL_STATE_CODE();
+
+    bdrv_drained_begin(bs);
+    ret = bdrv_set_backing_hd_drained(bs, backing_hd, errp);
     bdrv_drained_end(bs);
 
     return ret;
diff --git a/block/stream.c b/block/stream.c
index 694709bd25..81dcf5a417 100644
--- a/block/stream.c
+++ b/block/stream.c
@@ -64,13 +64,16 @@ static int stream_prepare(Job *job)
     bdrv_cor_filter_drop(s->cor_filter_bs);
     s->cor_filter_bs = NULL;
 
-    bdrv_subtree_drained_begin(s->above_base);
+    /*
+     * bdrv_set_backing_hd() requires that unfiltered_bs is drained. Drain
+     * already here and use bdrv_set_backing_hd_drained() instead because
+     * the polling during drained_begin() might change the graph, and if we do
+     * this only later, we may end up working with the wrong base node (or it
+     * might even have gone away by the time we want to use it).
+     */
+    bdrv_drained_begin(unfiltered_bs);
 
     base = bdrv_filter_or_cow_bs(s->above_base);
-    if (base) {
-        bdrv_ref(base);
-    }
-
     unfiltered_base = bdrv_skip_filters(base);
 
     if (bdrv_cow_child(unfiltered_bs)) {
@@ -82,7 +85,7 @@ static int stream_prepare(Job *job)
             }
         }
 
-        bdrv_set_backing_hd(unfiltered_bs, base, &local_err);
+        bdrv_set_backing_hd_drained(unfiltered_bs, base, &local_err);
         ret = bdrv_change_backing_file(unfiltered_bs, base_id, base_fmt, false);
         if (local_err) {
             error_report_err(local_err);
@@ -92,10 +95,7 @@ static int stream_prepare(Job *job)
     }
 
 out:
-    if (base) {
-        bdrv_unref(base);
-    }
-    bdrv_subtree_drained_end(s->above_base);
+    bdrv_drained_end(unfiltered_bs);
     return ret;
 }
 
-- 
2.38.1



^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 09/13] block: Remove subtree drains
  2022-11-08 12:37 [PATCH 00/13] block: Simplify drain Kevin Wolf
                   ` (7 preceding siblings ...)
  2022-11-08 12:37 ` [PATCH 08/13] stream: Replace subtree drain with a single node drain Kevin Wolf
@ 2022-11-08 12:37 ` Kevin Wolf
  2022-11-09 17:22   ` Vladimir Sementsov-Ogievskiy
  2022-11-14 18:22   ` Hanna Reitz
  2022-11-08 12:37 ` [PATCH 10/13] block: Call drain callbacks only once Kevin Wolf
                   ` (5 subsequent siblings)
  14 siblings, 2 replies; 61+ messages in thread
From: Kevin Wolf @ 2022-11-08 12:37 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, eesposit, stefanha, hreitz, pbonzini, qemu-devel

Subtree drains are not used any more. Remove them.

After this, BdrvChildClass.attach/detach() don't poll any more.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 include/block/block-io.h         |  18 +--
 include/block/block_int-common.h |   1 -
 include/block/block_int-io.h     |  12 --
 block.c                          |  20 +--
 block/io.c                       | 121 +++-----------
 tests/unit/test-bdrv-drain.c     | 261 ++-----------------------------
 6 files changed, 44 insertions(+), 389 deletions(-)

diff --git a/include/block/block-io.h b/include/block/block-io.h
index 97e9ae8bee..c35cb1e53f 100644
--- a/include/block/block-io.h
+++ b/include/block/block-io.h
@@ -303,8 +303,7 @@ void bdrv_parent_drained_end_single(BdrvChild *c);
 /**
  * bdrv_drain_poll:
  *
- * Poll for pending requests in @bs, its parents (except for @ignore_parent),
- * and if @recursive is true its children as well (used for subtree drain).
+ * Poll for pending requests in @bs and its parents (except for @ignore_parent).
  *
  * If @ignore_bds_parents is true, parents that are BlockDriverStates must
  * ignore the drain request because they will be drained separately (used for
@@ -312,8 +311,8 @@ void bdrv_parent_drained_end_single(BdrvChild *c);
  *
  * This is part of bdrv_drained_begin.
  */
-bool bdrv_drain_poll(BlockDriverState *bs, bool recursive,
-                     BdrvChild *ignore_parent, bool ignore_bds_parents);
+bool bdrv_drain_poll(BlockDriverState *bs, BdrvChild *ignore_parent,
+                     bool ignore_bds_parents);
 
 /**
  * bdrv_drained_begin:
@@ -334,12 +333,6 @@ void bdrv_drained_begin(BlockDriverState *bs);
 void bdrv_do_drained_begin_quiesce(BlockDriverState *bs,
                                    BdrvChild *parent, bool ignore_bds_parents);
 
-/**
- * Like bdrv_drained_begin, but recursively begins a quiesced section for
- * exclusive access to all child nodes as well.
- */
-void bdrv_subtree_drained_begin(BlockDriverState *bs);
-
 /**
  * bdrv_drained_end:
  *
@@ -353,9 +346,4 @@ void bdrv_subtree_drained_begin(BlockDriverState *bs);
  */
 void bdrv_drained_end(BlockDriverState *bs);
 
-/**
- * End a quiescent section started by bdrv_subtree_drained_begin().
- */
-void bdrv_subtree_drained_end(BlockDriverState *bs);
-
 #endif /* BLOCK_IO_H */
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index 6504db4fd9..65ee5fcbec 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -1184,7 +1184,6 @@ struct BlockDriverState {
 
     /* Accessed with atomic ops.  */
     int quiesce_counter;
-    int recursive_quiesce_counter;
 
     unsigned int write_gen;               /* Current data generation */
 
diff --git a/include/block/block_int-io.h b/include/block/block_int-io.h
index 4b0b3e17ef..8bc061ebb8 100644
--- a/include/block/block_int-io.h
+++ b/include/block/block_int-io.h
@@ -179,16 +179,4 @@ void bdrv_bsc_invalidate_range(BlockDriverState *bs,
  */
 void bdrv_bsc_fill(BlockDriverState *bs, int64_t offset, int64_t bytes);
 
-
-/*
- * "I/O or GS" API functions. These functions can run without
- * the BQL, but only in one specific iothread/main loop.
- *
- * See include/block/block-io.h for more information about
- * the "I/O or GS" API.
- */
-
-void bdrv_apply_subtree_drain(BdrvChild *child, BlockDriverState *new_parent);
-void bdrv_unapply_subtree_drain(BdrvChild *child, BlockDriverState *old_parent);
-
 #endif /* BLOCK_INT_IO_H */
diff --git a/block.c b/block.c
index 43b893dd6c..9d082631d9 100644
--- a/block.c
+++ b/block.c
@@ -1224,7 +1224,7 @@ static void bdrv_child_cb_drained_begin(BdrvChild *child)
 static bool bdrv_child_cb_drained_poll(BdrvChild *child)
 {
     BlockDriverState *bs = child->opaque;
-    return bdrv_drain_poll(bs, false, NULL, false);
+    return bdrv_drain_poll(bs, NULL, false);
 }
 
 static void bdrv_child_cb_drained_end(BdrvChild *child)
@@ -1474,8 +1474,6 @@ static void bdrv_child_cb_attach(BdrvChild *child)
         assert(!bs->file);
         bs->file = child;
     }
-
-    bdrv_apply_subtree_drain(child, bs);
 }
 
 static void bdrv_child_cb_detach(BdrvChild *child)
@@ -1486,8 +1484,6 @@ static void bdrv_child_cb_detach(BdrvChild *child)
         bdrv_backing_detach(child);
     }
 
-    bdrv_unapply_subtree_drain(child, bs);
-
     assert_bdrv_graph_writable(bs);
     QLIST_REMOVE(child, next);
     if (child == bs->backing) {
@@ -2843,9 +2839,6 @@ static void bdrv_replace_child_noperm(BdrvChild *child,
     }
 
     if (old_bs) {
-        /* Detach first so that the recursive drain sections coming from @child
-         * are already gone and we only end the drain sections that came from
-         * elsewhere. */
         if (child->klass->detach) {
             child->klass->detach(child);
         }
@@ -2860,17 +2853,14 @@ static void bdrv_replace_child_noperm(BdrvChild *child,
         QLIST_INSERT_HEAD(&new_bs->parents, child, next_parent);
 
         /*
-         * Detaching the old node may have led to the new node's
-         * quiesce_counter having been decreased.  Not a problem, we
-         * just need to recognize this here and then invoke
-         * drained_end appropriately more often.
+         * Polling in bdrv_parent_drained_begin_single() may have led to the new
+         * node's quiesce_counter having been decreased.  Not a problem, we just
+         * need to recognize this here and then invoke drained_end appropriately
+         * more often.
          */
         assert(new_bs->quiesce_counter <= new_bs_quiesce_counter);
         drain_saldo += new_bs->quiesce_counter - new_bs_quiesce_counter;
 
-        /* Attach only after starting new drained sections, so that recursive
-         * drain sections coming from @child don't get an extra .drained_begin
-         * callback. */
         if (child->klass->attach) {
             child->klass->attach(child);
         }
diff --git a/block/io.c b/block/io.c
index c520183fb7..870a25d7a5 100644
--- a/block/io.c
+++ b/block/io.c
@@ -235,17 +235,15 @@ typedef struct {
     BlockDriverState *bs;
     bool done;
     bool begin;
-    bool recursive;
     bool poll;
     BdrvChild *parent;
     bool ignore_bds_parents;
 } BdrvCoDrainData;
 
 /* Returns true if BDRV_POLL_WHILE() should go into a blocking aio_poll() */
-bool bdrv_drain_poll(BlockDriverState *bs, bool recursive,
-                     BdrvChild *ignore_parent, bool ignore_bds_parents)
+bool bdrv_drain_poll(BlockDriverState *bs, BdrvChild *ignore_parent,
+                     bool ignore_bds_parents)
 {
-    BdrvChild *child, *next;
     IO_OR_GS_CODE();
 
     if (bdrv_parent_drained_poll(bs, ignore_parent, ignore_bds_parents)) {
@@ -256,29 +254,19 @@ bool bdrv_drain_poll(BlockDriverState *bs, bool recursive,
         return true;
     }
 
-    if (recursive) {
-        assert(!ignore_bds_parents);
-        QLIST_FOREACH_SAFE(child, &bs->children, next, next) {
-            if (bdrv_drain_poll(child->bs, recursive, child, false)) {
-                return true;
-            }
-        }
-    }
-
     return false;
 }
 
-static bool bdrv_drain_poll_top_level(BlockDriverState *bs, bool recursive,
+static bool bdrv_drain_poll_top_level(BlockDriverState *bs,
                                       BdrvChild *ignore_parent)
 {
-    return bdrv_drain_poll(bs, recursive, ignore_parent, false);
+    return bdrv_drain_poll(bs, ignore_parent, false);
 }
 
-static void bdrv_do_drained_begin(BlockDriverState *bs, bool recursive,
-                                  BdrvChild *parent, bool ignore_bds_parents,
-                                  bool poll);
-static void bdrv_do_drained_end(BlockDriverState *bs, bool recursive,
-                                BdrvChild *parent, bool ignore_bds_parents);
+static void bdrv_do_drained_begin(BlockDriverState *bs, BdrvChild *parent,
+                                  bool ignore_bds_parents, bool poll);
+static void bdrv_do_drained_end(BlockDriverState *bs, BdrvChild *parent,
+                                bool ignore_bds_parents);
 
 static void bdrv_co_drain_bh_cb(void *opaque)
 {
@@ -291,12 +279,11 @@ static void bdrv_co_drain_bh_cb(void *opaque)
         aio_context_acquire(ctx);
         bdrv_dec_in_flight(bs);
         if (data->begin) {
-            bdrv_do_drained_begin(bs, data->recursive, data->parent,
-                                  data->ignore_bds_parents, data->poll);
+            bdrv_do_drained_begin(bs, data->parent, data->ignore_bds_parents,
+                                  data->poll);
         } else {
             assert(!data->poll);
-            bdrv_do_drained_end(bs, data->recursive, data->parent,
-                                data->ignore_bds_parents);
+            bdrv_do_drained_end(bs, data->parent, data->ignore_bds_parents);
         }
         aio_context_release(ctx);
     } else {
@@ -309,7 +296,7 @@ static void bdrv_co_drain_bh_cb(void *opaque)
 }
 
 static void coroutine_fn bdrv_co_yield_to_drain(BlockDriverState *bs,
-                                                bool begin, bool recursive,
+                                                bool begin,
                                                 BdrvChild *parent,
                                                 bool ignore_bds_parents,
                                                 bool poll)
@@ -328,7 +315,6 @@ static void coroutine_fn bdrv_co_yield_to_drain(BlockDriverState *bs,
         .bs = bs,
         .done = false,
         .begin = begin,
-        .recursive = recursive,
         .parent = parent,
         .ignore_bds_parents = ignore_bds_parents,
         .poll = poll,
@@ -379,29 +365,16 @@ void bdrv_do_drained_begin_quiesce(BlockDriverState *bs,
     }
 }
 
-static void bdrv_do_drained_begin(BlockDriverState *bs, bool recursive,
-                                  BdrvChild *parent, bool ignore_bds_parents,
-                                  bool poll)
+static void bdrv_do_drained_begin(BlockDriverState *bs, BdrvChild *parent,
+                                  bool ignore_bds_parents, bool poll)
 {
-    BdrvChild *child, *next;
-
     if (qemu_in_coroutine()) {
-        bdrv_co_yield_to_drain(bs, true, recursive, parent, ignore_bds_parents,
-                               poll);
+        bdrv_co_yield_to_drain(bs, true, parent, ignore_bds_parents, poll);
         return;
     }
 
     bdrv_do_drained_begin_quiesce(bs, parent, ignore_bds_parents);
 
-    if (recursive) {
-        assert(!ignore_bds_parents);
-        bs->recursive_quiesce_counter++;
-        QLIST_FOREACH_SAFE(child, &bs->children, next, next) {
-            bdrv_do_drained_begin(child->bs, true, child, ignore_bds_parents,
-                                  false);
-        }
-    }
-
     /*
      * Wait for drained requests to finish.
      *
@@ -413,35 +386,27 @@ static void bdrv_do_drained_begin(BlockDriverState *bs, bool recursive,
      */
     if (poll) {
         assert(!ignore_bds_parents);
-        BDRV_POLL_WHILE(bs, bdrv_drain_poll_top_level(bs, recursive, parent));
+        BDRV_POLL_WHILE(bs, bdrv_drain_poll_top_level(bs, parent));
     }
 }
 
 void bdrv_drained_begin(BlockDriverState *bs)
 {
     IO_OR_GS_CODE();
-    bdrv_do_drained_begin(bs, false, NULL, false, true);
-}
-
-void bdrv_subtree_drained_begin(BlockDriverState *bs)
-{
-    IO_OR_GS_CODE();
-    bdrv_do_drained_begin(bs, true, NULL, false, true);
+    bdrv_do_drained_begin(bs, NULL, false, true);
 }
 
 /**
  * This function does not poll, nor must any of its recursively called
  * functions.
  */
-static void bdrv_do_drained_end(BlockDriverState *bs, bool recursive,
-                                BdrvChild *parent, bool ignore_bds_parents)
+static void bdrv_do_drained_end(BlockDriverState *bs, BdrvChild *parent,
+                                bool ignore_bds_parents)
 {
-    BdrvChild *child;
     int old_quiesce_counter;
 
     if (qemu_in_coroutine()) {
-        bdrv_co_yield_to_drain(bs, false, recursive, parent, ignore_bds_parents,
-                               false);
+        bdrv_co_yield_to_drain(bs, false, parent, ignore_bds_parents, false);
         return;
     }
     assert(bs->quiesce_counter > 0);
@@ -456,46 +421,12 @@ static void bdrv_do_drained_end(BlockDriverState *bs, bool recursive,
     if (old_quiesce_counter == 1) {
         aio_enable_external(bdrv_get_aio_context(bs));
     }
-
-    if (recursive) {
-        assert(!ignore_bds_parents);
-        bs->recursive_quiesce_counter--;
-        QLIST_FOREACH(child, &bs->children, next) {
-            bdrv_do_drained_end(child->bs, true, child, ignore_bds_parents);
-        }
-    }
 }
 
 void bdrv_drained_end(BlockDriverState *bs)
 {
     IO_OR_GS_CODE();
-    bdrv_do_drained_end(bs, false, NULL, false);
-}
-
-void bdrv_subtree_drained_end(BlockDriverState *bs)
-{
-    IO_OR_GS_CODE();
-    bdrv_do_drained_end(bs, true, NULL, false);
-}
-
-void bdrv_apply_subtree_drain(BdrvChild *child, BlockDriverState *new_parent)
-{
-    int i;
-    IO_OR_GS_CODE();
-
-    for (i = 0; i < new_parent->recursive_quiesce_counter; i++) {
-        bdrv_do_drained_begin(child->bs, true, child, false, true);
-    }
-}
-
-void bdrv_unapply_subtree_drain(BdrvChild *child, BlockDriverState *old_parent)
-{
-    int i;
-    IO_OR_GS_CODE();
-
-    for (i = 0; i < old_parent->recursive_quiesce_counter; i++) {
-        bdrv_do_drained_end(child->bs, true, child, false);
-    }
+    bdrv_do_drained_end(bs, NULL, false);
 }
 
 void bdrv_drain(BlockDriverState *bs)
@@ -528,7 +459,7 @@ static bool bdrv_drain_all_poll(void)
     while ((bs = bdrv_next_all_states(bs))) {
         AioContext *aio_context = bdrv_get_aio_context(bs);
         aio_context_acquire(aio_context);
-        result |= bdrv_drain_poll(bs, false, NULL, true);
+        result |= bdrv_drain_poll(bs, NULL, true);
         aio_context_release(aio_context);
     }
 
@@ -553,7 +484,7 @@ void bdrv_drain_all_begin(void)
     GLOBAL_STATE_CODE();
 
     if (qemu_in_coroutine()) {
-        bdrv_co_yield_to_drain(NULL, true, false, NULL, true, true);
+        bdrv_co_yield_to_drain(NULL, true, NULL, true, true);
         return;
     }
 
@@ -578,7 +509,7 @@ void bdrv_drain_all_begin(void)
         AioContext *aio_context = bdrv_get_aio_context(bs);
 
         aio_context_acquire(aio_context);
-        bdrv_do_drained_begin(bs, false, NULL, true, false);
+        bdrv_do_drained_begin(bs, NULL, true, false);
         aio_context_release(aio_context);
     }
 
@@ -598,7 +529,7 @@ void bdrv_drain_all_end_quiesce(BlockDriverState *bs)
     g_assert(!bs->refcnt);
 
     while (bs->quiesce_counter) {
-        bdrv_do_drained_end(bs, false, NULL, true);
+        bdrv_do_drained_end(bs, NULL, true);
     }
 }
 
@@ -620,7 +551,7 @@ void bdrv_drain_all_end(void)
         AioContext *aio_context = bdrv_get_aio_context(bs);
 
         aio_context_acquire(aio_context);
-        bdrv_do_drained_end(bs, false, NULL, true);
+        bdrv_do_drained_end(bs, NULL, true);
         aio_context_release(aio_context);
     }
 
diff --git a/tests/unit/test-bdrv-drain.c b/tests/unit/test-bdrv-drain.c
index 695519ee02..dda08de8db 100644
--- a/tests/unit/test-bdrv-drain.c
+++ b/tests/unit/test-bdrv-drain.c
@@ -156,7 +156,6 @@ static void call_in_coroutine(void (*entry)(void))
 enum drain_type {
     BDRV_DRAIN_ALL,
     BDRV_DRAIN,
-    BDRV_SUBTREE_DRAIN,
     DRAIN_TYPE_MAX,
 };
 
@@ -165,7 +164,6 @@ static void do_drain_begin(enum drain_type drain_type, BlockDriverState *bs)
     switch (drain_type) {
     case BDRV_DRAIN_ALL:        bdrv_drain_all_begin(); break;
     case BDRV_DRAIN:            bdrv_drained_begin(bs); break;
-    case BDRV_SUBTREE_DRAIN:    bdrv_subtree_drained_begin(bs); break;
     default:                    g_assert_not_reached();
     }
 }
@@ -175,7 +173,6 @@ static void do_drain_end(enum drain_type drain_type, BlockDriverState *bs)
     switch (drain_type) {
     case BDRV_DRAIN_ALL:        bdrv_drain_all_end(); break;
     case BDRV_DRAIN:            bdrv_drained_end(bs); break;
-    case BDRV_SUBTREE_DRAIN:    bdrv_subtree_drained_end(bs); break;
     default:                    g_assert_not_reached();
     }
 }
@@ -271,11 +268,6 @@ static void test_drv_cb_drain(void)
     test_drv_cb_common(BDRV_DRAIN, false);
 }
 
-static void test_drv_cb_drain_subtree(void)
-{
-    test_drv_cb_common(BDRV_SUBTREE_DRAIN, true);
-}
-
 static void test_drv_cb_co_drain_all(void)
 {
     call_in_coroutine(test_drv_cb_drain_all);
@@ -286,11 +278,6 @@ static void test_drv_cb_co_drain(void)
     call_in_coroutine(test_drv_cb_drain);
 }
 
-static void test_drv_cb_co_drain_subtree(void)
-{
-    call_in_coroutine(test_drv_cb_drain_subtree);
-}
-
 static void test_quiesce_common(enum drain_type drain_type, bool recursive)
 {
     BlockBackend *blk;
@@ -332,11 +319,6 @@ static void test_quiesce_drain(void)
     test_quiesce_common(BDRV_DRAIN, false);
 }
 
-static void test_quiesce_drain_subtree(void)
-{
-    test_quiesce_common(BDRV_SUBTREE_DRAIN, true);
-}
-
 static void test_quiesce_co_drain_all(void)
 {
     call_in_coroutine(test_quiesce_drain_all);
@@ -347,11 +329,6 @@ static void test_quiesce_co_drain(void)
     call_in_coroutine(test_quiesce_drain);
 }
 
-static void test_quiesce_co_drain_subtree(void)
-{
-    call_in_coroutine(test_quiesce_drain_subtree);
-}
-
 static void test_nested(void)
 {
     BlockBackend *blk;
@@ -402,158 +379,6 @@ static void test_nested(void)
     blk_unref(blk);
 }
 
-static void test_multiparent(void)
-{
-    BlockBackend *blk_a, *blk_b;
-    BlockDriverState *bs_a, *bs_b, *backing;
-    BDRVTestState *a_s, *b_s, *backing_s;
-
-    blk_a = blk_new(qemu_get_aio_context(), BLK_PERM_ALL, BLK_PERM_ALL);
-    bs_a = bdrv_new_open_driver(&bdrv_test, "test-node-a", BDRV_O_RDWR,
-                                &error_abort);
-    a_s = bs_a->opaque;
-    blk_insert_bs(blk_a, bs_a, &error_abort);
-
-    blk_b = blk_new(qemu_get_aio_context(), BLK_PERM_ALL, BLK_PERM_ALL);
-    bs_b = bdrv_new_open_driver(&bdrv_test, "test-node-b", BDRV_O_RDWR,
-                                &error_abort);
-    b_s = bs_b->opaque;
-    blk_insert_bs(blk_b, bs_b, &error_abort);
-
-    backing = bdrv_new_open_driver(&bdrv_test, "backing", 0, &error_abort);
-    backing_s = backing->opaque;
-    bdrv_set_backing_hd(bs_a, backing, &error_abort);
-    bdrv_set_backing_hd(bs_b, backing, &error_abort);
-
-    g_assert_cmpint(bs_a->quiesce_counter, ==, 0);
-    g_assert_cmpint(bs_b->quiesce_counter, ==, 0);
-    g_assert_cmpint(backing->quiesce_counter, ==, 0);
-    g_assert_cmpint(a_s->drain_count, ==, 0);
-    g_assert_cmpint(b_s->drain_count, ==, 0);
-    g_assert_cmpint(backing_s->drain_count, ==, 0);
-
-    do_drain_begin(BDRV_SUBTREE_DRAIN, bs_a);
-
-    g_assert_cmpint(bs_a->quiesce_counter, ==, 1);
-    g_assert_cmpint(bs_b->quiesce_counter, ==, 1);
-    g_assert_cmpint(backing->quiesce_counter, ==, 1);
-    g_assert_cmpint(a_s->drain_count, ==, 1);
-    g_assert_cmpint(b_s->drain_count, ==, 1);
-    g_assert_cmpint(backing_s->drain_count, ==, 1);
-
-    do_drain_begin(BDRV_SUBTREE_DRAIN, bs_b);
-
-    g_assert_cmpint(bs_a->quiesce_counter, ==, 2);
-    g_assert_cmpint(bs_b->quiesce_counter, ==, 2);
-    g_assert_cmpint(backing->quiesce_counter, ==, 2);
-    g_assert_cmpint(a_s->drain_count, ==, 2);
-    g_assert_cmpint(b_s->drain_count, ==, 2);
-    g_assert_cmpint(backing_s->drain_count, ==, 2);
-
-    do_drain_end(BDRV_SUBTREE_DRAIN, bs_b);
-
-    g_assert_cmpint(bs_a->quiesce_counter, ==, 1);
-    g_assert_cmpint(bs_b->quiesce_counter, ==, 1);
-    g_assert_cmpint(backing->quiesce_counter, ==, 1);
-    g_assert_cmpint(a_s->drain_count, ==, 1);
-    g_assert_cmpint(b_s->drain_count, ==, 1);
-    g_assert_cmpint(backing_s->drain_count, ==, 1);
-
-    do_drain_end(BDRV_SUBTREE_DRAIN, bs_a);
-
-    g_assert_cmpint(bs_a->quiesce_counter, ==, 0);
-    g_assert_cmpint(bs_b->quiesce_counter, ==, 0);
-    g_assert_cmpint(backing->quiesce_counter, ==, 0);
-    g_assert_cmpint(a_s->drain_count, ==, 0);
-    g_assert_cmpint(b_s->drain_count, ==, 0);
-    g_assert_cmpint(backing_s->drain_count, ==, 0);
-
-    bdrv_unref(backing);
-    bdrv_unref(bs_a);
-    bdrv_unref(bs_b);
-    blk_unref(blk_a);
-    blk_unref(blk_b);
-}
-
-static void test_graph_change_drain_subtree(void)
-{
-    BlockBackend *blk_a, *blk_b;
-    BlockDriverState *bs_a, *bs_b, *backing;
-    BDRVTestState *a_s, *b_s, *backing_s;
-
-    blk_a = blk_new(qemu_get_aio_context(), BLK_PERM_ALL, BLK_PERM_ALL);
-    bs_a = bdrv_new_open_driver(&bdrv_test, "test-node-a", BDRV_O_RDWR,
-                                &error_abort);
-    a_s = bs_a->opaque;
-    blk_insert_bs(blk_a, bs_a, &error_abort);
-
-    blk_b = blk_new(qemu_get_aio_context(), BLK_PERM_ALL, BLK_PERM_ALL);
-    bs_b = bdrv_new_open_driver(&bdrv_test, "test-node-b", BDRV_O_RDWR,
-                                &error_abort);
-    b_s = bs_b->opaque;
-    blk_insert_bs(blk_b, bs_b, &error_abort);
-
-    backing = bdrv_new_open_driver(&bdrv_test, "backing", 0, &error_abort);
-    backing_s = backing->opaque;
-    bdrv_set_backing_hd(bs_a, backing, &error_abort);
-
-    g_assert_cmpint(bs_a->quiesce_counter, ==, 0);
-    g_assert_cmpint(bs_b->quiesce_counter, ==, 0);
-    g_assert_cmpint(backing->quiesce_counter, ==, 0);
-    g_assert_cmpint(a_s->drain_count, ==, 0);
-    g_assert_cmpint(b_s->drain_count, ==, 0);
-    g_assert_cmpint(backing_s->drain_count, ==, 0);
-
-    do_drain_begin(BDRV_SUBTREE_DRAIN, bs_a);
-    do_drain_begin(BDRV_SUBTREE_DRAIN, bs_a);
-    do_drain_begin(BDRV_SUBTREE_DRAIN, bs_a);
-    do_drain_begin(BDRV_SUBTREE_DRAIN, bs_b);
-    do_drain_begin(BDRV_SUBTREE_DRAIN, bs_b);
-
-    bdrv_set_backing_hd(bs_b, backing, &error_abort);
-    g_assert_cmpint(bs_a->quiesce_counter, ==, 5);
-    g_assert_cmpint(bs_b->quiesce_counter, ==, 5);
-    g_assert_cmpint(backing->quiesce_counter, ==, 5);
-    g_assert_cmpint(a_s->drain_count, ==, 5);
-    g_assert_cmpint(b_s->drain_count, ==, 5);
-    g_assert_cmpint(backing_s->drain_count, ==, 5);
-
-    bdrv_set_backing_hd(bs_b, NULL, &error_abort);
-    g_assert_cmpint(bs_a->quiesce_counter, ==, 3);
-    g_assert_cmpint(bs_b->quiesce_counter, ==, 2);
-    g_assert_cmpint(backing->quiesce_counter, ==, 3);
-    g_assert_cmpint(a_s->drain_count, ==, 3);
-    g_assert_cmpint(b_s->drain_count, ==, 2);
-    g_assert_cmpint(backing_s->drain_count, ==, 3);
-
-    bdrv_set_backing_hd(bs_b, backing, &error_abort);
-    g_assert_cmpint(bs_a->quiesce_counter, ==, 5);
-    g_assert_cmpint(bs_b->quiesce_counter, ==, 5);
-    g_assert_cmpint(backing->quiesce_counter, ==, 5);
-    g_assert_cmpint(a_s->drain_count, ==, 5);
-    g_assert_cmpint(b_s->drain_count, ==, 5);
-    g_assert_cmpint(backing_s->drain_count, ==, 5);
-
-    do_drain_end(BDRV_SUBTREE_DRAIN, bs_b);
-    do_drain_end(BDRV_SUBTREE_DRAIN, bs_b);
-    do_drain_end(BDRV_SUBTREE_DRAIN, bs_a);
-    do_drain_end(BDRV_SUBTREE_DRAIN, bs_a);
-    do_drain_end(BDRV_SUBTREE_DRAIN, bs_a);
-
-    g_assert_cmpint(bs_a->quiesce_counter, ==, 0);
-    g_assert_cmpint(bs_b->quiesce_counter, ==, 0);
-    g_assert_cmpint(backing->quiesce_counter, ==, 0);
-    g_assert_cmpint(a_s->drain_count, ==, 0);
-    g_assert_cmpint(b_s->drain_count, ==, 0);
-    g_assert_cmpint(backing_s->drain_count, ==, 0);
-
-    bdrv_unref(backing);
-    bdrv_unref(bs_a);
-    bdrv_unref(bs_b);
-    blk_unref(blk_a);
-    blk_unref(blk_b);
-}
-
 static void test_graph_change_drain_all(void)
 {
     BlockBackend *blk_a, *blk_b;
@@ -773,12 +598,6 @@ static void test_iothread_drain(void)
     test_iothread_common(BDRV_DRAIN, 1);
 }
 
-static void test_iothread_drain_subtree(void)
-{
-    test_iothread_common(BDRV_SUBTREE_DRAIN, 0);
-    test_iothread_common(BDRV_SUBTREE_DRAIN, 1);
-}
-
 
 typedef struct TestBlockJob {
     BlockJob common;
@@ -863,7 +682,6 @@ enum test_job_result {
 enum test_job_drain_node {
     TEST_JOB_DRAIN_SRC,
     TEST_JOB_DRAIN_SRC_CHILD,
-    TEST_JOB_DRAIN_SRC_PARENT,
 };
 
 static void test_blockjob_common_drain_node(enum drain_type drain_type,
@@ -901,9 +719,6 @@ static void test_blockjob_common_drain_node(enum drain_type drain_type,
     case TEST_JOB_DRAIN_SRC_CHILD:
         drain_bs = src_backing;
         break;
-    case TEST_JOB_DRAIN_SRC_PARENT:
-        drain_bs = src_overlay;
-        break;
     default:
         g_assert_not_reached();
     }
@@ -1055,10 +870,6 @@ static void test_blockjob_common(enum drain_type drain_type, bool use_iothread,
                                     TEST_JOB_DRAIN_SRC);
     test_blockjob_common_drain_node(drain_type, use_iothread, result,
                                     TEST_JOB_DRAIN_SRC_CHILD);
-    if (drain_type == BDRV_SUBTREE_DRAIN) {
-        test_blockjob_common_drain_node(drain_type, use_iothread, result,
-                                        TEST_JOB_DRAIN_SRC_PARENT);
-    }
 }
 
 static void test_blockjob_drain_all(void)
@@ -1071,11 +882,6 @@ static void test_blockjob_drain(void)
     test_blockjob_common(BDRV_DRAIN, false, TEST_JOB_SUCCESS);
 }
 
-static void test_blockjob_drain_subtree(void)
-{
-    test_blockjob_common(BDRV_SUBTREE_DRAIN, false, TEST_JOB_SUCCESS);
-}
-
 static void test_blockjob_error_drain_all(void)
 {
     test_blockjob_common(BDRV_DRAIN_ALL, false, TEST_JOB_FAIL_RUN);
@@ -1088,12 +894,6 @@ static void test_blockjob_error_drain(void)
     test_blockjob_common(BDRV_DRAIN, false, TEST_JOB_FAIL_PREPARE);
 }
 
-static void test_blockjob_error_drain_subtree(void)
-{
-    test_blockjob_common(BDRV_SUBTREE_DRAIN, false, TEST_JOB_FAIL_RUN);
-    test_blockjob_common(BDRV_SUBTREE_DRAIN, false, TEST_JOB_FAIL_PREPARE);
-}
-
 static void test_blockjob_iothread_drain_all(void)
 {
     test_blockjob_common(BDRV_DRAIN_ALL, true, TEST_JOB_SUCCESS);
@@ -1104,11 +904,6 @@ static void test_blockjob_iothread_drain(void)
     test_blockjob_common(BDRV_DRAIN, true, TEST_JOB_SUCCESS);
 }
 
-static void test_blockjob_iothread_drain_subtree(void)
-{
-    test_blockjob_common(BDRV_SUBTREE_DRAIN, true, TEST_JOB_SUCCESS);
-}
-
 static void test_blockjob_iothread_error_drain_all(void)
 {
     test_blockjob_common(BDRV_DRAIN_ALL, true, TEST_JOB_FAIL_RUN);
@@ -1121,12 +916,6 @@ static void test_blockjob_iothread_error_drain(void)
     test_blockjob_common(BDRV_DRAIN, true, TEST_JOB_FAIL_PREPARE);
 }
 
-static void test_blockjob_iothread_error_drain_subtree(void)
-{
-    test_blockjob_common(BDRV_SUBTREE_DRAIN, true, TEST_JOB_FAIL_RUN);
-    test_blockjob_common(BDRV_SUBTREE_DRAIN, true, TEST_JOB_FAIL_PREPARE);
-}
-
 
 typedef struct BDRVTestTopState {
     BdrvChild *wait_child;
@@ -1273,14 +1062,6 @@ static void do_test_delete_by_drain(bool detach_instead_of_delete,
         bdrv_drain(child_bs);
         bdrv_unref(child_bs);
         break;
-    case BDRV_SUBTREE_DRAIN:
-        /* Would have to ref/unref bs here for !detach_instead_of_delete, but
-         * then the whole test becomes pointless because the graph changes
-         * don't occur during the drain any more. */
-        assert(detach_instead_of_delete);
-        bdrv_subtree_drained_begin(bs);
-        bdrv_subtree_drained_end(bs);
-        break;
     case BDRV_DRAIN_ALL:
         bdrv_drain_all_begin();
         bdrv_drain_all_end();
@@ -1315,11 +1096,6 @@ static void test_detach_by_drain(void)
     do_test_delete_by_drain(true, BDRV_DRAIN);
 }
 
-static void test_detach_by_drain_subtree(void)
-{
-    do_test_delete_by_drain(true, BDRV_SUBTREE_DRAIN);
-}
-
 
 struct detach_by_parent_data {
     BlockDriverState *parent_b;
@@ -1452,7 +1228,10 @@ static void test_detach_indirect(bool by_parent_cb)
     g_assert(acb != NULL);
 
     /* Drain and check the expected result */
-    bdrv_subtree_drained_begin(parent_b);
+    bdrv_drained_begin(parent_b);
+    bdrv_drained_begin(a);
+    bdrv_drained_begin(b);
+    bdrv_drained_begin(c);
 
     g_assert(detach_by_parent_data.child_c != NULL);
 
@@ -1467,12 +1246,15 @@ static void test_detach_indirect(bool by_parent_cb)
     g_assert(QLIST_NEXT(child_a, next) == NULL);
 
     g_assert_cmpint(parent_a->quiesce_counter, ==, 1);
-    g_assert_cmpint(parent_b->quiesce_counter, ==, 1);
+    g_assert_cmpint(parent_b->quiesce_counter, ==, 3);
     g_assert_cmpint(a->quiesce_counter, ==, 1);
-    g_assert_cmpint(b->quiesce_counter, ==, 0);
+    g_assert_cmpint(b->quiesce_counter, ==, 1);
     g_assert_cmpint(c->quiesce_counter, ==, 1);
 
-    bdrv_subtree_drained_end(parent_b);
+    bdrv_drained_end(parent_b);
+    bdrv_drained_end(a);
+    bdrv_drained_end(b);
+    bdrv_drained_end(c);
 
     bdrv_unref(parent_b);
     blk_unref(blk);
@@ -2202,70 +1984,47 @@ int main(int argc, char **argv)
 
     g_test_add_func("/bdrv-drain/driver-cb/drain_all", test_drv_cb_drain_all);
     g_test_add_func("/bdrv-drain/driver-cb/drain", test_drv_cb_drain);
-    g_test_add_func("/bdrv-drain/driver-cb/drain_subtree",
-                    test_drv_cb_drain_subtree);
 
     g_test_add_func("/bdrv-drain/driver-cb/co/drain_all",
                     test_drv_cb_co_drain_all);
     g_test_add_func("/bdrv-drain/driver-cb/co/drain", test_drv_cb_co_drain);
-    g_test_add_func("/bdrv-drain/driver-cb/co/drain_subtree",
-                    test_drv_cb_co_drain_subtree);
-
 
     g_test_add_func("/bdrv-drain/quiesce/drain_all", test_quiesce_drain_all);
     g_test_add_func("/bdrv-drain/quiesce/drain", test_quiesce_drain);
-    g_test_add_func("/bdrv-drain/quiesce/drain_subtree",
-                    test_quiesce_drain_subtree);
 
     g_test_add_func("/bdrv-drain/quiesce/co/drain_all",
                     test_quiesce_co_drain_all);
     g_test_add_func("/bdrv-drain/quiesce/co/drain", test_quiesce_co_drain);
-    g_test_add_func("/bdrv-drain/quiesce/co/drain_subtree",
-                    test_quiesce_co_drain_subtree);
 
     g_test_add_func("/bdrv-drain/nested", test_nested);
-    g_test_add_func("/bdrv-drain/multiparent", test_multiparent);
 
-    g_test_add_func("/bdrv-drain/graph-change/drain_subtree",
-                    test_graph_change_drain_subtree);
     g_test_add_func("/bdrv-drain/graph-change/drain_all",
                     test_graph_change_drain_all);
 
     g_test_add_func("/bdrv-drain/iothread/drain_all", test_iothread_drain_all);
     g_test_add_func("/bdrv-drain/iothread/drain", test_iothread_drain);
-    g_test_add_func("/bdrv-drain/iothread/drain_subtree",
-                    test_iothread_drain_subtree);
 
     g_test_add_func("/bdrv-drain/blockjob/drain_all", test_blockjob_drain_all);
     g_test_add_func("/bdrv-drain/blockjob/drain", test_blockjob_drain);
-    g_test_add_func("/bdrv-drain/blockjob/drain_subtree",
-                    test_blockjob_drain_subtree);
 
     g_test_add_func("/bdrv-drain/blockjob/error/drain_all",
                     test_blockjob_error_drain_all);
     g_test_add_func("/bdrv-drain/blockjob/error/drain",
                     test_blockjob_error_drain);
-    g_test_add_func("/bdrv-drain/blockjob/error/drain_subtree",
-                    test_blockjob_error_drain_subtree);
 
     g_test_add_func("/bdrv-drain/blockjob/iothread/drain_all",
                     test_blockjob_iothread_drain_all);
     g_test_add_func("/bdrv-drain/blockjob/iothread/drain",
                     test_blockjob_iothread_drain);
-    g_test_add_func("/bdrv-drain/blockjob/iothread/drain_subtree",
-                    test_blockjob_iothread_drain_subtree);
 
     g_test_add_func("/bdrv-drain/blockjob/iothread/error/drain_all",
                     test_blockjob_iothread_error_drain_all);
     g_test_add_func("/bdrv-drain/blockjob/iothread/error/drain",
                     test_blockjob_iothread_error_drain);
-    g_test_add_func("/bdrv-drain/blockjob/iothread/error/drain_subtree",
-                    test_blockjob_iothread_error_drain_subtree);
 
     g_test_add_func("/bdrv-drain/deletion/drain", test_delete_by_drain);
     g_test_add_func("/bdrv-drain/detach/drain_all", test_detach_by_drain_all);
     g_test_add_func("/bdrv-drain/detach/drain", test_detach_by_drain);
-    g_test_add_func("/bdrv-drain/detach/drain_subtree", test_detach_by_drain_subtree);
     g_test_add_func("/bdrv-drain/detach/parent_cb", test_detach_by_parent_cb);
     g_test_add_func("/bdrv-drain/detach/driver_cb", test_detach_by_driver_cb);
 
-- 
2.38.1



^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 10/13] block: Call drain callbacks only once
  2022-11-08 12:37 [PATCH 00/13] block: Simplify drain Kevin Wolf
                   ` (8 preceding siblings ...)
  2022-11-08 12:37 ` [PATCH 09/13] block: Remove subtree drains Kevin Wolf
@ 2022-11-08 12:37 ` Kevin Wolf
  2022-11-09 18:05   ` Vladimir Sementsov-Ogievskiy
                     ` (2 more replies)
  2022-11-08 12:37 ` [PATCH 11/13] block: Remove ignore_bds_parents parameter from drain functions Kevin Wolf
                   ` (4 subsequent siblings)
  14 siblings, 3 replies; 61+ messages in thread
From: Kevin Wolf @ 2022-11-08 12:37 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, eesposit, stefanha, hreitz, pbonzini, qemu-devel

We only need to call both the BlockDriver's callback and the parent
callbacks when going from undrained to drained or vice versa. A second
drain section doesn't make a difference for the driver or the parent,
they weren't supposed to send new requests before and after the second
drain.

One thing that gets in the way is the 'ignore_bds_parents' parameter in
bdrv_do_drained_begin_quiesce() and bdrv_do_drained_end(): If it is true
for the first drain, bs->quiesce_counter will be non-zero, but the
parent callbacks still haven't been called, so a second drain where it
is false would still have to call them.

Instead of keeping track of this, let's just get rid of the parameter.
It was introduced in commit 6cd5c9d7b2d as an optimisation so that
during bdrv_drain_all(), we wouldn't recursively drain all parents up to
the root for each node, resulting in quadratic complexity. As it happens,
calling the callbacks only once solves the same problem, so as of this
patch, we'll still have O(n) complexity and ignore_bds_parents is not
needed any more.

This patch only ignores the 'ignore_bds_parents' parameter. It will be
removed in a separate patch.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block.c                      | 13 ++++++-------
 block/io.c                   | 24 +++++++++++++-----------
 tests/unit/test-bdrv-drain.c | 16 ++++++++++------
 3 files changed, 29 insertions(+), 24 deletions(-)

diff --git a/block.c b/block.c
index 9d082631d9..8878586f6e 100644
--- a/block.c
+++ b/block.c
@@ -2816,7 +2816,6 @@ static void bdrv_replace_child_noperm(BdrvChild *child,
 {
     BlockDriverState *old_bs = child->bs;
     int new_bs_quiesce_counter;
-    int drain_saldo;
 
     assert(!child->frozen);
     assert(old_bs != new_bs);
@@ -2827,15 +2826,13 @@ static void bdrv_replace_child_noperm(BdrvChild *child,
     }
 
     new_bs_quiesce_counter = (new_bs ? new_bs->quiesce_counter : 0);
-    drain_saldo = new_bs_quiesce_counter - child->parent_quiesce_counter;
 
     /*
      * If the new child node is drained but the old one was not, flush
      * all outstanding requests to the old child node.
      */
-    while (drain_saldo > 0 && child->klass->drained_begin) {
+    if (new_bs_quiesce_counter && !child->parent_quiesce_counter) {
         bdrv_parent_drained_begin_single(child, true);
-        drain_saldo--;
     }
 
     if (old_bs) {
@@ -2859,7 +2856,6 @@ static void bdrv_replace_child_noperm(BdrvChild *child,
          * more often.
          */
         assert(new_bs->quiesce_counter <= new_bs_quiesce_counter);
-        drain_saldo += new_bs->quiesce_counter - new_bs_quiesce_counter;
 
         if (child->klass->attach) {
             child->klass->attach(child);
@@ -2869,10 +2865,13 @@ static void bdrv_replace_child_noperm(BdrvChild *child,
     /*
      * If the old child node was drained but the new one is not, allow
      * requests to come in only after the new node has been attached.
+     *
+     * Update new_bs_quiesce_counter because bdrv_parent_drained_begin_single()
+     * polls, which could have changed the value.
      */
-    while (drain_saldo < 0 && child->klass->drained_end) {
+    new_bs_quiesce_counter = (new_bs ? new_bs->quiesce_counter : 0);
+    if (!new_bs_quiesce_counter && child->parent_quiesce_counter) {
         bdrv_parent_drained_end_single(child);
-        drain_saldo++;
     }
 }
 
diff --git a/block/io.c b/block/io.c
index 870a25d7a5..87c7a92f15 100644
--- a/block/io.c
+++ b/block/io.c
@@ -62,7 +62,7 @@ void bdrv_parent_drained_end_single(BdrvChild *c)
 {
     IO_OR_GS_CODE();
 
-    assert(c->parent_quiesce_counter > 0);
+    assert(c->parent_quiesce_counter == 1);
     c->parent_quiesce_counter--;
     if (c->klass->drained_end) {
         c->klass->drained_end(c);
@@ -109,6 +109,7 @@ static bool bdrv_parent_drained_poll(BlockDriverState *bs, BdrvChild *ignore,
 void bdrv_parent_drained_begin_single(BdrvChild *c, bool poll)
 {
     IO_OR_GS_CODE();
+    assert(c->parent_quiesce_counter == 0);
     c->parent_quiesce_counter++;
     if (c->klass->drained_begin) {
         c->klass->drained_begin(c);
@@ -352,16 +353,16 @@ void bdrv_do_drained_begin_quiesce(BlockDriverState *bs,
                                    BdrvChild *parent, bool ignore_bds_parents)
 {
     IO_OR_GS_CODE();
-    assert(!qemu_in_coroutine());
 
     /* Stop things in parent-to-child order */
     if (qatomic_fetch_inc(&bs->quiesce_counter) == 0) {
         aio_disable_external(bdrv_get_aio_context(bs));
-    }
 
-    bdrv_parent_drained_begin(bs, parent, ignore_bds_parents);
-    if (bs->drv && bs->drv->bdrv_drain_begin) {
-        bs->drv->bdrv_drain_begin(bs);
+        /* TODO Remove ignore_bds_parents, we don't consider it any more */
+        bdrv_parent_drained_begin(bs, parent, false);
+        if (bs->drv && bs->drv->bdrv_drain_begin) {
+            bs->drv->bdrv_drain_begin(bs);
+        }
     }
 }
 
@@ -412,13 +413,14 @@ static void bdrv_do_drained_end(BlockDriverState *bs, BdrvChild *parent,
     assert(bs->quiesce_counter > 0);
 
     /* Re-enable things in child-to-parent order */
-    if (bs->drv && bs->drv->bdrv_drain_end) {
-        bs->drv->bdrv_drain_end(bs);
-    }
-    bdrv_parent_drained_end(bs, parent, ignore_bds_parents);
-
     old_quiesce_counter = qatomic_fetch_dec(&bs->quiesce_counter);
     if (old_quiesce_counter == 1) {
+        if (bs->drv && bs->drv->bdrv_drain_end) {
+            bs->drv->bdrv_drain_end(bs);
+        }
+        /* TODO Remove ignore_bds_parents, we don't consider it any more */
+        bdrv_parent_drained_end(bs, parent, false);
+
         aio_enable_external(bdrv_get_aio_context(bs));
     }
 }
diff --git a/tests/unit/test-bdrv-drain.c b/tests/unit/test-bdrv-drain.c
index dda08de8db..172bc6debc 100644
--- a/tests/unit/test-bdrv-drain.c
+++ b/tests/unit/test-bdrv-drain.c
@@ -296,7 +296,11 @@ static void test_quiesce_common(enum drain_type drain_type, bool recursive)
 
     do_drain_begin(drain_type, bs);
 
-    g_assert_cmpint(bs->quiesce_counter, ==, 1);
+    if (drain_type == BDRV_DRAIN_ALL) {
+        g_assert_cmpint(bs->quiesce_counter, ==, 2);
+    } else {
+        g_assert_cmpint(bs->quiesce_counter, ==, 1);
+    }
     g_assert_cmpint(backing->quiesce_counter, ==, !!recursive);
 
     do_drain_end(drain_type, bs);
@@ -348,8 +352,8 @@ static void test_nested(void)
 
     for (outer = 0; outer < DRAIN_TYPE_MAX; outer++) {
         for (inner = 0; inner < DRAIN_TYPE_MAX; inner++) {
-            int backing_quiesce = (outer != BDRV_DRAIN) +
-                                  (inner != BDRV_DRAIN);
+            int backing_quiesce = (outer == BDRV_DRAIN_ALL) +
+                                  (inner == BDRV_DRAIN_ALL);
 
             g_assert_cmpint(bs->quiesce_counter, ==, 0);
             g_assert_cmpint(backing->quiesce_counter, ==, 0);
@@ -359,10 +363,10 @@ static void test_nested(void)
             do_drain_begin(outer, bs);
             do_drain_begin(inner, bs);
 
-            g_assert_cmpint(bs->quiesce_counter, ==, 2);
+            g_assert_cmpint(bs->quiesce_counter, ==, 2 + !!backing_quiesce);
             g_assert_cmpint(backing->quiesce_counter, ==, backing_quiesce);
-            g_assert_cmpint(s->drain_count, ==, 2);
-            g_assert_cmpint(backing_s->drain_count, ==, backing_quiesce);
+            g_assert_cmpint(s->drain_count, ==, 1);
+            g_assert_cmpint(backing_s->drain_count, ==, !!backing_quiesce);
 
             do_drain_end(inner, bs);
             do_drain_end(outer, bs);
-- 
2.38.1



^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 11/13] block: Remove ignore_bds_parents parameter from drain functions
  2022-11-08 12:37 [PATCH 00/13] block: Simplify drain Kevin Wolf
                   ` (9 preceding siblings ...)
  2022-11-08 12:37 ` [PATCH 10/13] block: Call drain callbacks only once Kevin Wolf
@ 2022-11-08 12:37 ` Kevin Wolf
  2022-11-09 18:57   ` Vladimir Sementsov-Ogievskiy
  2022-11-14 18:23   ` Hanna Reitz
  2022-11-08 12:37 ` [PATCH 12/13] block: Don't poll in bdrv_replace_child_noperm() Kevin Wolf
                   ` (3 subsequent siblings)
  14 siblings, 2 replies; 61+ messages in thread
From: Kevin Wolf @ 2022-11-08 12:37 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, eesposit, stefanha, hreitz, pbonzini, qemu-devel

ignore_bds_parents is now ignored, so we can just remove it.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 include/block/block-io.h | 10 ++----
 block.c                  |  4 +--
 block/io.c               | 78 +++++++++++++++-------------------------
 3 files changed, 32 insertions(+), 60 deletions(-)

diff --git a/include/block/block-io.h b/include/block/block-io.h
index c35cb1e53f..5b54ed4672 100644
--- a/include/block/block-io.h
+++ b/include/block/block-io.h
@@ -305,14 +305,9 @@ void bdrv_parent_drained_end_single(BdrvChild *c);
  *
  * Poll for pending requests in @bs and its parents (except for @ignore_parent).
  *
- * If @ignore_bds_parents is true, parents that are BlockDriverStates must
- * ignore the drain request because they will be drained separately (used for
- * drain_all).
- *
  * This is part of bdrv_drained_begin.
  */
-bool bdrv_drain_poll(BlockDriverState *bs, BdrvChild *ignore_parent,
-                     bool ignore_bds_parents);
+bool bdrv_drain_poll(BlockDriverState *bs, BdrvChild *ignore_parent);
 
 /**
  * bdrv_drained_begin:
@@ -330,8 +325,7 @@ void bdrv_drained_begin(BlockDriverState *bs);
  * Quiesces a BDS like bdrv_drained_begin(), but does not wait for already
  * running requests to complete.
  */
-void bdrv_do_drained_begin_quiesce(BlockDriverState *bs,
-                                   BdrvChild *parent, bool ignore_bds_parents);
+void bdrv_do_drained_begin_quiesce(BlockDriverState *bs, BdrvChild *parent);
 
 /**
  * bdrv_drained_end:
diff --git a/block.c b/block.c
index 8878586f6e..5f5f79cd16 100644
--- a/block.c
+++ b/block.c
@@ -1218,13 +1218,13 @@ static char *bdrv_child_get_parent_desc(BdrvChild *c)
 static void bdrv_child_cb_drained_begin(BdrvChild *child)
 {
     BlockDriverState *bs = child->opaque;
-    bdrv_do_drained_begin_quiesce(bs, NULL, false);
+    bdrv_do_drained_begin_quiesce(bs, NULL);
 }
 
 static bool bdrv_child_cb_drained_poll(BdrvChild *child)
 {
     BlockDriverState *bs = child->opaque;
-    return bdrv_drain_poll(bs, NULL, false);
+    return bdrv_drain_poll(bs, NULL);
 }
 
 static void bdrv_child_cb_drained_end(BdrvChild *child)
diff --git a/block/io.c b/block/io.c
index 87c7a92f15..4a83359a8f 100644
--- a/block/io.c
+++ b/block/io.c
@@ -45,13 +45,12 @@ static void bdrv_parent_cb_resize(BlockDriverState *bs);
 static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
     int64_t offset, int64_t bytes, BdrvRequestFlags flags);
 
-static void bdrv_parent_drained_begin(BlockDriverState *bs, BdrvChild *ignore,
-                                      bool ignore_bds_parents)
+static void bdrv_parent_drained_begin(BlockDriverState *bs, BdrvChild *ignore)
 {
     BdrvChild *c, *next;
 
     QLIST_FOREACH_SAFE(c, &bs->parents, next_parent, next) {
-        if (c == ignore || (ignore_bds_parents && c->klass->parent_is_bds)) {
+        if (c == ignore) {
             continue;
         }
         bdrv_parent_drained_begin_single(c, false);
@@ -69,13 +68,12 @@ void bdrv_parent_drained_end_single(BdrvChild *c)
     }
 }
 
-static void bdrv_parent_drained_end(BlockDriverState *bs, BdrvChild *ignore,
-                                    bool ignore_bds_parents)
+static void bdrv_parent_drained_end(BlockDriverState *bs, BdrvChild *ignore)
 {
     BdrvChild *c;
 
     QLIST_FOREACH(c, &bs->parents, next_parent) {
-        if (c == ignore || (ignore_bds_parents && c->klass->parent_is_bds)) {
+        if (c == ignore) {
             continue;
         }
         bdrv_parent_drained_end_single(c);
@@ -90,14 +88,13 @@ static bool bdrv_parent_drained_poll_single(BdrvChild *c)
     return false;
 }
 
-static bool bdrv_parent_drained_poll(BlockDriverState *bs, BdrvChild *ignore,
-                                     bool ignore_bds_parents)
+static bool bdrv_parent_drained_poll(BlockDriverState *bs, BdrvChild *ignore)
 {
     BdrvChild *c, *next;
     bool busy = false;
 
     QLIST_FOREACH_SAFE(c, &bs->parents, next_parent, next) {
-        if (c == ignore || (ignore_bds_parents && c->klass->parent_is_bds)) {
+        if (c == ignore) {
             continue;
         }
         busy |= bdrv_parent_drained_poll_single(c);
@@ -238,16 +235,14 @@ typedef struct {
     bool begin;
     bool poll;
     BdrvChild *parent;
-    bool ignore_bds_parents;
 } BdrvCoDrainData;
 
 /* Returns true if BDRV_POLL_WHILE() should go into a blocking aio_poll() */
-bool bdrv_drain_poll(BlockDriverState *bs, BdrvChild *ignore_parent,
-                     bool ignore_bds_parents)
+bool bdrv_drain_poll(BlockDriverState *bs, BdrvChild *ignore_parent)
 {
     IO_OR_GS_CODE();
 
-    if (bdrv_parent_drained_poll(bs, ignore_parent, ignore_bds_parents)) {
+    if (bdrv_parent_drained_poll(bs, ignore_parent)) {
         return true;
     }
 
@@ -258,16 +253,9 @@ bool bdrv_drain_poll(BlockDriverState *bs, BdrvChild *ignore_parent,
     return false;
 }
 
-static bool bdrv_drain_poll_top_level(BlockDriverState *bs,
-                                      BdrvChild *ignore_parent)
-{
-    return bdrv_drain_poll(bs, ignore_parent, false);
-}
-
 static void bdrv_do_drained_begin(BlockDriverState *bs, BdrvChild *parent,
-                                  bool ignore_bds_parents, bool poll);
-static void bdrv_do_drained_end(BlockDriverState *bs, BdrvChild *parent,
-                                bool ignore_bds_parents);
+                                  bool poll);
+static void bdrv_do_drained_end(BlockDriverState *bs, BdrvChild *parent);
 
 static void bdrv_co_drain_bh_cb(void *opaque)
 {
@@ -280,11 +268,10 @@ static void bdrv_co_drain_bh_cb(void *opaque)
         aio_context_acquire(ctx);
         bdrv_dec_in_flight(bs);
         if (data->begin) {
-            bdrv_do_drained_begin(bs, data->parent, data->ignore_bds_parents,
-                                  data->poll);
+            bdrv_do_drained_begin(bs, data->parent, data->poll);
         } else {
             assert(!data->poll);
-            bdrv_do_drained_end(bs, data->parent, data->ignore_bds_parents);
+            bdrv_do_drained_end(bs, data->parent);
         }
         aio_context_release(ctx);
     } else {
@@ -299,7 +286,6 @@ static void bdrv_co_drain_bh_cb(void *opaque)
 static void coroutine_fn bdrv_co_yield_to_drain(BlockDriverState *bs,
                                                 bool begin,
                                                 BdrvChild *parent,
-                                                bool ignore_bds_parents,
                                                 bool poll)
 {
     BdrvCoDrainData data;
@@ -317,7 +303,6 @@ static void coroutine_fn bdrv_co_yield_to_drain(BlockDriverState *bs,
         .done = false,
         .begin = begin,
         .parent = parent,
-        .ignore_bds_parents = ignore_bds_parents,
         .poll = poll,
     };
 
@@ -349,17 +334,14 @@ static void coroutine_fn bdrv_co_yield_to_drain(BlockDriverState *bs,
     }
 }
 
-void bdrv_do_drained_begin_quiesce(BlockDriverState *bs,
-                                   BdrvChild *parent, bool ignore_bds_parents)
+void bdrv_do_drained_begin_quiesce(BlockDriverState *bs, BdrvChild *parent)
 {
     IO_OR_GS_CODE();
 
     /* Stop things in parent-to-child order */
     if (qatomic_fetch_inc(&bs->quiesce_counter) == 0) {
         aio_disable_external(bdrv_get_aio_context(bs));
-
-        /* TODO Remove ignore_bds_parents, we don't consider it any more */
-        bdrv_parent_drained_begin(bs, parent, false);
+        bdrv_parent_drained_begin(bs, parent);
         if (bs->drv && bs->drv->bdrv_drain_begin) {
             bs->drv->bdrv_drain_begin(bs);
         }
@@ -367,14 +349,14 @@ void bdrv_do_drained_begin_quiesce(BlockDriverState *bs,
 }
 
 static void bdrv_do_drained_begin(BlockDriverState *bs, BdrvChild *parent,
-                                  bool ignore_bds_parents, bool poll)
+                                  bool poll)
 {
     if (qemu_in_coroutine()) {
-        bdrv_co_yield_to_drain(bs, true, parent, ignore_bds_parents, poll);
+        bdrv_co_yield_to_drain(bs, true, parent, poll);
         return;
     }
 
-    bdrv_do_drained_begin_quiesce(bs, parent, ignore_bds_parents);
+    bdrv_do_drained_begin_quiesce(bs, parent);
 
     /*
      * Wait for drained requests to finish.
@@ -386,28 +368,26 @@ static void bdrv_do_drained_begin(BlockDriverState *bs, BdrvChild *parent,
      * nodes.
      */
     if (poll) {
-        assert(!ignore_bds_parents);
-        BDRV_POLL_WHILE(bs, bdrv_drain_poll_top_level(bs, parent));
+        BDRV_POLL_WHILE(bs, bdrv_drain_poll(bs, parent));
     }
 }
 
 void bdrv_drained_begin(BlockDriverState *bs)
 {
     IO_OR_GS_CODE();
-    bdrv_do_drained_begin(bs, NULL, false, true);
+    bdrv_do_drained_begin(bs, NULL, true);
 }
 
 /**
  * This function does not poll, nor must any of its recursively called
  * functions.
  */
-static void bdrv_do_drained_end(BlockDriverState *bs, BdrvChild *parent,
-                                bool ignore_bds_parents)
+static void bdrv_do_drained_end(BlockDriverState *bs, BdrvChild *parent)
 {
     int old_quiesce_counter;
 
     if (qemu_in_coroutine()) {
-        bdrv_co_yield_to_drain(bs, false, parent, ignore_bds_parents, false);
+        bdrv_co_yield_to_drain(bs, false, parent, false);
         return;
     }
     assert(bs->quiesce_counter > 0);
@@ -418,9 +398,7 @@ static void bdrv_do_drained_end(BlockDriverState *bs, BdrvChild *parent,
         if (bs->drv && bs->drv->bdrv_drain_end) {
             bs->drv->bdrv_drain_end(bs);
         }
-        /* TODO Remove ignore_bds_parents, we don't consider it any more */
-        bdrv_parent_drained_end(bs, parent, false);
-
+        bdrv_parent_drained_end(bs, parent);
         aio_enable_external(bdrv_get_aio_context(bs));
     }
 }
@@ -428,7 +406,7 @@ static void bdrv_do_drained_end(BlockDriverState *bs, BdrvChild *parent,
 void bdrv_drained_end(BlockDriverState *bs)
 {
     IO_OR_GS_CODE();
-    bdrv_do_drained_end(bs, NULL, false);
+    bdrv_do_drained_end(bs, NULL);
 }
 
 void bdrv_drain(BlockDriverState *bs)
@@ -461,7 +439,7 @@ static bool bdrv_drain_all_poll(void)
     while ((bs = bdrv_next_all_states(bs))) {
         AioContext *aio_context = bdrv_get_aio_context(bs);
         aio_context_acquire(aio_context);
-        result |= bdrv_drain_poll(bs, NULL, true);
+        result |= bdrv_drain_poll(bs, NULL);
         aio_context_release(aio_context);
     }
 
@@ -486,7 +464,7 @@ void bdrv_drain_all_begin(void)
     GLOBAL_STATE_CODE();
 
     if (qemu_in_coroutine()) {
-        bdrv_co_yield_to_drain(NULL, true, NULL, true, true);
+        bdrv_co_yield_to_drain(NULL, true, NULL, true);
         return;
     }
 
@@ -511,7 +489,7 @@ void bdrv_drain_all_begin(void)
         AioContext *aio_context = bdrv_get_aio_context(bs);
 
         aio_context_acquire(aio_context);
-        bdrv_do_drained_begin(bs, NULL, true, false);
+        bdrv_do_drained_begin(bs, NULL, false);
         aio_context_release(aio_context);
     }
 
@@ -531,7 +509,7 @@ void bdrv_drain_all_end_quiesce(BlockDriverState *bs)
     g_assert(!bs->refcnt);
 
     while (bs->quiesce_counter) {
-        bdrv_do_drained_end(bs, NULL, true);
+        bdrv_do_drained_end(bs, NULL);
     }
 }
 
@@ -553,7 +531,7 @@ void bdrv_drain_all_end(void)
         AioContext *aio_context = bdrv_get_aio_context(bs);
 
         aio_context_acquire(aio_context);
-        bdrv_do_drained_end(bs, NULL, true);
+        bdrv_do_drained_end(bs, NULL);
         aio_context_release(aio_context);
     }
 
-- 
2.38.1



^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 12/13] block: Don't poll in bdrv_replace_child_noperm()
  2022-11-08 12:37 [PATCH 00/13] block: Simplify drain Kevin Wolf
                   ` (10 preceding siblings ...)
  2022-11-08 12:37 ` [PATCH 11/13] block: Remove ignore_bds_parents parameter from drain functions Kevin Wolf
@ 2022-11-08 12:37 ` Kevin Wolf
  2022-11-11 11:21   ` Emanuele Giuseppe Esposito
  2022-11-14 20:22   ` Hanna Reitz
  2022-11-08 12:37 ` [PATCH 13/13] block: Remove poll parameter from bdrv_parent_drained_begin_single() Kevin Wolf
                   ` (2 subsequent siblings)
  14 siblings, 2 replies; 61+ messages in thread
From: Kevin Wolf @ 2022-11-08 12:37 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, eesposit, stefanha, hreitz, pbonzini, qemu-devel

In order to make sure that bdrv_replace_child_noperm() doesn't have to
poll any more, get rid of the bdrv_parent_drained_begin_single() call.

This is possible now because we can require that the child is already
drained when the function is called (it better be, having in-flight
requests while modifying the graph isn't going to end well!) and we
don't call the parent drain callbacks more than once.

The additional drain calls needed in callers cause the test case to run
its code in the drain handler too early (bdrv_attach_child() drains
now), so modify it to only enable the code after the test setup has
completed.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 include/block/block-io.h     |  8 ++++
 block.c                      | 72 +++++++++++++++++++++++++-----------
 block/io.c                   |  2 +-
 tests/unit/test-bdrv-drain.c | 10 +++++
 4 files changed, 70 insertions(+), 22 deletions(-)

diff --git a/include/block/block-io.h b/include/block/block-io.h
index 5b54ed4672..ddce8550a9 100644
--- a/include/block/block-io.h
+++ b/include/block/block-io.h
@@ -290,6 +290,14 @@ bdrv_writev_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos);
  */
 void bdrv_parent_drained_begin_single(BdrvChild *c, bool poll);
 
+/**
+ * bdrv_parent_drained_poll_single:
+ *
+ * Returns true if there is any pending activity to cease before @c can be
+ * called quiesced, false otherwise.
+ */
+bool bdrv_parent_drained_poll_single(BdrvChild *c);
+
 /**
  * bdrv_parent_drained_end_single:
  *
diff --git a/block.c b/block.c
index 5f5f79cd16..12039e9b8a 100644
--- a/block.c
+++ b/block.c
@@ -2399,6 +2399,20 @@ static void bdrv_replace_child_abort(void *opaque)
 
     GLOBAL_STATE_CODE();
     /* old_bs reference is transparently moved from @s to @s->child */
+    if (!s->child->bs) {
+        /*
+         * The parents were undrained when removing old_bs from the child. New
+         * requests can't have been made, though, because the child was empty.
+         *
+         * TODO Make bdrv_replace_child_noperm() transactionable to avoid
+         * undraining the parent in the first place. Once this is done, having
+         * new_bs drained when calling bdrv_replace_child_tran() is not a
+         * requirement any more.
+         */
+        bdrv_parent_drained_begin_single(s->child, false);
+        assert(!bdrv_parent_drained_poll_single(s->child));
+    }
+    assert(s->child->parent_quiesce_counter);
     bdrv_replace_child_noperm(s->child, s->old_bs);
     bdrv_unref(new_bs);
 }
@@ -2414,12 +2428,20 @@ static TransactionActionDrv bdrv_replace_child_drv = {
  *
  * Note: real unref of old_bs is done only on commit.
  *
+ * Both child and new_bs (if non-NULL) must be drained. new_bs must be kept
+ * drained until the transaction is completed (this automatically implies that
+ * child remains drained, too).
+ *
  * The function doesn't update permissions, caller is responsible for this.
  */
 static void bdrv_replace_child_tran(BdrvChild *child, BlockDriverState *new_bs,
                                     Transaction *tran)
 {
     BdrvReplaceChildState *s = g_new(BdrvReplaceChildState, 1);
+
+    assert(child->parent_quiesce_counter);
+    assert(!new_bs || new_bs->quiesce_counter);
+
     *s = (BdrvReplaceChildState) {
         .child = child,
         .old_bs = child->bs,
@@ -2818,6 +2840,12 @@ static void bdrv_replace_child_noperm(BdrvChild *child,
     int new_bs_quiesce_counter;
 
     assert(!child->frozen);
+    /*
+     * When removing the child, it's the callers responsibility to make sure
+     * that no requests are in flight any more. Usually the parent is drained,
+     * but not through child->parent_quiesce_counter.
+     */
+    assert(!new_bs || child->parent_quiesce_counter);
     assert(old_bs != new_bs);
     GLOBAL_STATE_CODE();
 
@@ -2825,16 +2853,6 @@ static void bdrv_replace_child_noperm(BdrvChild *child,
         assert(bdrv_get_aio_context(old_bs) == bdrv_get_aio_context(new_bs));
     }
 
-    new_bs_quiesce_counter = (new_bs ? new_bs->quiesce_counter : 0);
-
-    /*
-     * If the new child node is drained but the old one was not, flush
-     * all outstanding requests to the old child node.
-     */
-    if (new_bs_quiesce_counter && !child->parent_quiesce_counter) {
-        bdrv_parent_drained_begin_single(child, true);
-    }
-
     if (old_bs) {
         if (child->klass->detach) {
             child->klass->detach(child);
@@ -2849,14 +2867,6 @@ static void bdrv_replace_child_noperm(BdrvChild *child,
         assert_bdrv_graph_writable(new_bs);
         QLIST_INSERT_HEAD(&new_bs->parents, child, next_parent);
 
-        /*
-         * Polling in bdrv_parent_drained_begin_single() may have led to the new
-         * node's quiesce_counter having been decreased.  Not a problem, we just
-         * need to recognize this here and then invoke drained_end appropriately
-         * more often.
-         */
-        assert(new_bs->quiesce_counter <= new_bs_quiesce_counter);
-
         if (child->klass->attach) {
             child->klass->attach(child);
         }
@@ -2865,9 +2875,6 @@ static void bdrv_replace_child_noperm(BdrvChild *child,
     /*
      * If the old child node was drained but the new one is not, allow
      * requests to come in only after the new node has been attached.
-     *
-     * Update new_bs_quiesce_counter because bdrv_parent_drained_begin_single()
-     * polls, which could have changed the value.
      */
     new_bs_quiesce_counter = (new_bs ? new_bs->quiesce_counter : 0);
     if (!new_bs_quiesce_counter && child->parent_quiesce_counter) {
@@ -3004,6 +3011,12 @@ static BdrvChild *bdrv_attach_child_common(BlockDriverState *child_bs,
     }
 
     bdrv_ref(child_bs);
+    /*
+     * Let every new BdrvChild start drained, inserting it in the graph with
+     * bdrv_replace_child_noperm() will undrain it if the child node is not
+     * drained. The child was only just created, so polling is not necessary.
+     */
+    bdrv_parent_drained_begin_single(new_child, false);
     bdrv_replace_child_noperm(new_child, child_bs);
 
     BdrvAttachChildCommonState *s = g_new(BdrvAttachChildCommonState, 1);
@@ -5053,7 +5066,10 @@ static void bdrv_remove_child(BdrvChild *child, Transaction *tran)
     }
 
     if (child->bs) {
+        BlockDriverState *bs = child->bs;
+        bdrv_drained_begin(bs);
         bdrv_replace_child_tran(child, NULL, tran);
+        bdrv_drained_end(bs);
     }
 
     tran_add(tran, &bdrv_remove_child_drv, child);
@@ -5070,6 +5086,15 @@ static void bdrv_remove_filter_or_cow_child(BlockDriverState *bs,
     bdrv_remove_child(bdrv_filter_or_cow_child(bs), tran);
 }
 
+static void undrain_on_clean_cb(void *opaque)
+{
+    bdrv_drained_end(opaque);
+}
+
+static TransactionActionDrv undrain_on_clean = {
+    .clean = undrain_on_clean_cb,
+};
+
 static int bdrv_replace_node_noperm(BlockDriverState *from,
                                     BlockDriverState *to,
                                     bool auto_skip, Transaction *tran,
@@ -5079,6 +5104,11 @@ static int bdrv_replace_node_noperm(BlockDriverState *from,
 
     GLOBAL_STATE_CODE();
 
+    bdrv_drained_begin(from);
+    bdrv_drained_begin(to);
+    tran_add(tran, &undrain_on_clean, from);
+    tran_add(tran, &undrain_on_clean, to);
+
     QLIST_FOREACH_SAFE(c, &from->parents, next_parent, next) {
         assert(c->bs == from);
         if (!should_update_child(c, to)) {
diff --git a/block/io.c b/block/io.c
index 4a83359a8f..d0f641926f 100644
--- a/block/io.c
+++ b/block/io.c
@@ -80,7 +80,7 @@ static void bdrv_parent_drained_end(BlockDriverState *bs, BdrvChild *ignore)
     }
 }
 
-static bool bdrv_parent_drained_poll_single(BdrvChild *c)
+bool bdrv_parent_drained_poll_single(BdrvChild *c)
 {
     if (c->klass->drained_poll) {
         return c->klass->drained_poll(c);
diff --git a/tests/unit/test-bdrv-drain.c b/tests/unit/test-bdrv-drain.c
index 172bc6debc..2686a8acee 100644
--- a/tests/unit/test-bdrv-drain.c
+++ b/tests/unit/test-bdrv-drain.c
@@ -1654,6 +1654,7 @@ static void test_drop_intermediate_poll(void)
 
 
 typedef struct BDRVReplaceTestState {
+    bool setup_completed;
     bool was_drained;
     bool was_undrained;
     bool has_read;
@@ -1738,6 +1739,10 @@ static void bdrv_replace_test_drain_begin(BlockDriverState *bs)
 {
     BDRVReplaceTestState *s = bs->opaque;
 
+    if (!s->setup_completed) {
+        return;
+    }
+
     if (!s->drain_count) {
         s->drain_co = qemu_coroutine_create(bdrv_replace_test_drain_co, bs);
         bdrv_inc_in_flight(bs);
@@ -1769,6 +1774,10 @@ static void bdrv_replace_test_drain_end(BlockDriverState *bs)
 {
     BDRVReplaceTestState *s = bs->opaque;
 
+    if (!s->setup_completed) {
+        return;
+    }
+
     g_assert(s->drain_count > 0);
     if (!--s->drain_count) {
         s->was_undrained = true;
@@ -1867,6 +1876,7 @@ static void do_test_replace_child_mid_drain(int old_drain_count,
     bdrv_ref(old_child_bs);
     bdrv_attach_child(parent_bs, old_child_bs, "child", &child_of_bds,
                       BDRV_CHILD_COW, &error_abort);
+    parent_s->setup_completed = true;
 
     for (i = 0; i < old_drain_count; i++) {
         bdrv_drained_begin(old_child_bs);
-- 
2.38.1



^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 13/13] block: Remove poll parameter from bdrv_parent_drained_begin_single()
  2022-11-08 12:37 [PATCH 00/13] block: Simplify drain Kevin Wolf
                   ` (11 preceding siblings ...)
  2022-11-08 12:37 ` [PATCH 12/13] block: Don't poll in bdrv_replace_child_noperm() Kevin Wolf
@ 2022-11-08 12:37 ` Kevin Wolf
  2022-11-14 20:24   ` Hanna Reitz
  2022-11-10 20:13 ` [PATCH 00/13] block: Simplify drain Stefan Hajnoczi
  2022-11-11 11:23 ` Emanuele Giuseppe Esposito
  14 siblings, 1 reply; 61+ messages in thread
From: Kevin Wolf @ 2022-11-08 12:37 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, eesposit, stefanha, hreitz, pbonzini, qemu-devel

All callers of bdrv_parent_drained_begin_single() pass poll=false now,
so we don't need the parameter any more.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 include/block/block-io.h | 5 ++---
 block.c                  | 4 ++--
 block/io.c               | 7 ++-----
 3 files changed, 6 insertions(+), 10 deletions(-)

diff --git a/include/block/block-io.h b/include/block/block-io.h
index ddce8550a9..35669f0e62 100644
--- a/include/block/block-io.h
+++ b/include/block/block-io.h
@@ -285,10 +285,9 @@ bdrv_writev_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos);
 /**
  * bdrv_parent_drained_begin_single:
  *
- * Begin a quiesced section for the parent of @c. If @poll is true, wait for
- * any pending activity to cease.
+ * Begin a quiesced section for the parent of @c.
  */
-void bdrv_parent_drained_begin_single(BdrvChild *c, bool poll);
+void bdrv_parent_drained_begin_single(BdrvChild *c);
 
 /**
  * bdrv_parent_drained_poll_single:
diff --git a/block.c b/block.c
index 12039e9b8a..c200f7afa0 100644
--- a/block.c
+++ b/block.c
@@ -2409,7 +2409,7 @@ static void bdrv_replace_child_abort(void *opaque)
          * new_bs drained when calling bdrv_replace_child_tran() is not a
          * requirement any more.
          */
-        bdrv_parent_drained_begin_single(s->child, false);
+        bdrv_parent_drained_begin_single(s->child);
         assert(!bdrv_parent_drained_poll_single(s->child));
     }
     assert(s->child->parent_quiesce_counter);
@@ -3016,7 +3016,7 @@ static BdrvChild *bdrv_attach_child_common(BlockDriverState *child_bs,
      * bdrv_replace_child_noperm() will undrain it if the child node is not
      * drained. The child was only just created, so polling is not necessary.
      */
-    bdrv_parent_drained_begin_single(new_child, false);
+    bdrv_parent_drained_begin_single(new_child);
     bdrv_replace_child_noperm(new_child, child_bs);
 
     BdrvAttachChildCommonState *s = g_new(BdrvAttachChildCommonState, 1);
diff --git a/block/io.c b/block/io.c
index d0f641926f..9bcb19e5ee 100644
--- a/block/io.c
+++ b/block/io.c
@@ -53,7 +53,7 @@ static void bdrv_parent_drained_begin(BlockDriverState *bs, BdrvChild *ignore)
         if (c == ignore) {
             continue;
         }
-        bdrv_parent_drained_begin_single(c, false);
+        bdrv_parent_drained_begin_single(c);
     }
 }
 
@@ -103,7 +103,7 @@ static bool bdrv_parent_drained_poll(BlockDriverState *bs, BdrvChild *ignore)
     return busy;
 }
 
-void bdrv_parent_drained_begin_single(BdrvChild *c, bool poll)
+void bdrv_parent_drained_begin_single(BdrvChild *c)
 {
     IO_OR_GS_CODE();
     assert(c->parent_quiesce_counter == 0);
@@ -111,9 +111,6 @@ void bdrv_parent_drained_begin_single(BdrvChild *c, bool poll)
     if (c->klass->drained_begin) {
         c->klass->drained_begin(c);
     }
-    if (poll) {
-        BDRV_POLL_WHILE(c->bs, bdrv_parent_drained_poll_single(c));
-    }
 }
 
 static void bdrv_merge_limits(BlockLimits *dst, const BlockLimits *src)
-- 
2.38.1



^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH 01/13] qed: Don't yield in bdrv_qed_co_drain_begin()
  2022-11-08 12:37 ` [PATCH 01/13] qed: Don't yield in bdrv_qed_co_drain_begin() Kevin Wolf
@ 2022-11-09  9:21   ` Vladimir Sementsov-Ogievskiy
  2022-11-09  9:27   ` Vladimir Sementsov-Ogievskiy
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 61+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2022-11-09  9:21 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: eesposit, stefanha, hreitz, pbonzini, qemu-devel

On 11/8/22 15:37, Kevin Wolf wrote:
> We want to change .bdrv_co_drained_begin() back to be a non-coroutine
> callback, so in preparation, avoid yielding in its implementation.
> 
> Because we increase bs->in_flight and bdrv_drained_begin() polls, the
> behaviour is unchanged.
> 
> Signed-off-by: Kevin Wolf<kwolf@redhat.com>

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>

-- 
Best regards,
Vladimir



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 01/13] qed: Don't yield in bdrv_qed_co_drain_begin()
  2022-11-08 12:37 ` [PATCH 01/13] qed: Don't yield in bdrv_qed_co_drain_begin() Kevin Wolf
  2022-11-09  9:21   ` Vladimir Sementsov-Ogievskiy
@ 2022-11-09  9:27   ` Vladimir Sementsov-Ogievskiy
  2022-11-09 12:22     ` Kevin Wolf
  2022-11-09 21:49   ` Stefan Hajnoczi
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 61+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2022-11-09  9:27 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: eesposit, stefanha, hreitz, pbonzini, qemu-devel

On 11/8/22 15:37, Kevin Wolf wrote:
>       int ret;
>   
>       trace_qed_need_check_timer_cb(s);
> @@ -310,9 +309,20 @@ static void coroutine_fn qed_need_check_timer_entry(void *opaque)
>       (void) ret;
>   }
>   
> +static void coroutine_fn qed_need_check_timer_entry(void *opaque)
> +{
> +    BDRVQEDState *s = opaque;
> +
> +    qed_need_check_timer(opaque);
> +    bdrv_dec_in_flight(s->bs);

hmm, one question: don't we need aio_wait_kick() call here?

> +}
> +
>   static void qed_need_che

-- 
Best regards,
Vladimir



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 02/13] test-bdrv-drain: Don't yield in .bdrv_co_drained_begin/end()
  2022-11-08 12:37 ` [PATCH 02/13] test-bdrv-drain: Don't yield in .bdrv_co_drained_begin/end() Kevin Wolf
@ 2022-11-09 10:50   ` Vladimir Sementsov-Ogievskiy
  2022-11-09 12:28     ` Kevin Wolf
  2022-11-09 13:45   ` Vladimir Sementsov-Ogievskiy
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 61+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2022-11-09 10:50 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: eesposit, stefanha, hreitz, pbonzini, qemu-devel

On 11/8/22 15:37, Kevin Wolf wrote:
> We want to change .bdrv_co_drained_begin/end() back to be non-coroutine
> callbacks, so in preparation, avoid yielding in their implementation.
> 
> This does almost the same as the existing logic in bdrv_drain_invoke(),
> by creating and entering coroutines internally. However, since the test
> case is by far the heaviest user of coroutine code in drain callbacks,
> it is preferable to have the complexity in the test case rather than the
> drain core, which is already complicated enough without this.
> 
> The behaviour for bdrv_drain_begin() is unchanged because we increase
> bs->in_flight and this is still polled. However, bdrv_drain_end()
> doesn't wait for the spawned coroutine to complete any more. This is
> fine, we don't rely on bdrv_drain_end() restarting all operations
> immediately before the next aio_poll().
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>   tests/unit/test-bdrv-drain.c | 64 ++++++++++++++++++++++++++----------
>   1 file changed, 46 insertions(+), 18 deletions(-)
> 
> diff --git a/tests/unit/test-bdrv-drain.c b/tests/unit/test-bdrv-drain.c
> index 09dc4a4891..24f34e24ad 100644
> --- a/tests/unit/test-bdrv-drain.c
> +++ b/tests/unit/test-bdrv-drain.c
> @@ -38,12 +38,22 @@ typedef struct BDRVTestState {
>       bool sleep_in_drain_begin;
>   } BDRVTestState;
>   
> +static void coroutine_fn sleep_in_drain_begin(void *opaque)
> +{
> +    BlockDriverState *bs = opaque;
> +
> +    qemu_co_sleep_ns(QEMU_CLOCK_REALTIME, 100000);
> +    bdrv_dec_in_flight(bs);
> +}
> +
>   static void coroutine_fn bdrv_test_co_drain_begin(BlockDriverState *bs)
>   {
>       BDRVTestState *s = bs->opaque;
>       s->drain_count++;
>       if (s->sleep_in_drain_begin) {
> -        qemu_co_sleep_ns(QEMU_CLOCK_REALTIME, 100000);
> +        Coroutine *co = qemu_coroutine_create(sleep_in_drain_begin, bs);
> +        bdrv_inc_in_flight(bs);
> +        aio_co_enter(bdrv_get_aio_context(bs), co);
>       }
>   }
>   
> @@ -1916,6 +1926,21 @@ static int coroutine_fn bdrv_replace_test_co_preadv(BlockDriverState *bs,
>       return 0;
>   }
>   
> +static void coroutine_fn bdrv_replace_test_drain_co(void *opaque)
> +{
> +    BlockDriverState *bs = opaque;
> +    BDRVReplaceTestState *s = bs->opaque;
> +
> +    /* Keep waking io_co up until it is done */
> +    while (s->io_co) {
> +        aio_co_wake(s->io_co);
> +        s->io_co = NULL;
> +        qemu_coroutine_yield();
> +    }
> +    s->drain_co = NULL;
> +    bdrv_dec_in_flight(bs);
> +}

Same question, don't we need aio_wait_kick() after decrement in_flight.

Also, seems we have here extra waiting level: a special coroutine that waits in a loop.

Could we just do in .drain_begin:

if (s->io_co) {
    bdrv_inc_in_flight(bs);
}

and in .co_preadv instead of waking s->drain_co simply

if (s->drain_count == 1) {
   bdrv_dec_in_flight(bs);
   aio_wait_kick();
}


or even better, do inc in_flight when io_co becomes not NULL.

> +
>   /**
>    * If .drain_count is 0, wake up .io_co if there is one; and set
>    * .was_drained.
> @@ -1926,20 +1951,27 @@ static void coroutine_fn bdrv_replace_test_co_drain_begin(BlockDriverState *bs)
>       BDRVReplaceTestState *s = bs->opaque;
>   
>       if (!s->drain_count) {
> -        /* Keep waking io_co up until it is done */
> -        s->drain_co = qemu_coroutine_self();
> -        while (s->io_co) {
> -            aio_co_wake(s->io_co);
> -            s->io_co = NULL;
> -            qemu_coroutine_yield();
> -        }
> -        s->drain_co = NULL;
> -
> +        s->drain_co = qemu_coroutine_create(bdrv_replace_test_drain_co, bs);
> +        bdrv_inc_in_flight(bs);
> +        aio_co_enter(bdrv_get_aio_context(bs), s->drain_co);
>           s->was_drained = true;
>       }
>       s->drain_count++;
>   }
>   
> +static void coroutine_fn bdrv_replace_test_read_entry(void *opaque)
> +{
> +    BlockDriverState *bs = opaque;
> +    char data;
> +    QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, &data, 1);
> +    int ret;
> +
> +    /* Queue a read request post-drain */
> +    ret = bdrv_replace_test_co_preadv(bs, 0, 1, &qiov, 0);
> +    g_assert(ret >= 0);
> +    bdrv_dec_in_flight(bs);
> +}
> +
>   /**
>    * Reduce .drain_count, set .was_undrained once it reaches 0.
>    * If .drain_count reaches 0 and the node has a backing file, issue a
> @@ -1951,17 +1983,13 @@ static void coroutine_fn bdrv_replace_test_co_drain_end(BlockDriverState *bs)
>   
>       g_assert(s->drain_count > 0);
>       if (!--s->drain_count) {
> -        int ret;
> -
>           s->was_undrained = true;
>   
>           if (bs->backing) {
> -            char data;
> -            QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, &data, 1);
> -
> -            /* Queue a read request post-drain */
> -            ret = bdrv_replace_test_co_preadv(bs, 0, 1, &qiov, 0);
> -            g_assert(ret >= 0);
> +            Coroutine *co = qemu_coroutine_create(bdrv_replace_test_read_entry,
> +                                                  bs);
> +            bdrv_inc_in_flight(bs);
> +            aio_co_enter(bdrv_get_aio_context(bs), co);
>           }
>       }
>   }

-- 
Best regards,
Vladimir



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 01/13] qed: Don't yield in bdrv_qed_co_drain_begin()
  2022-11-09  9:27   ` Vladimir Sementsov-Ogievskiy
@ 2022-11-09 12:22     ` Kevin Wolf
  0 siblings, 0 replies; 61+ messages in thread
From: Kevin Wolf @ 2022-11-09 12:22 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: qemu-block, eesposit, stefanha, hreitz, pbonzini, qemu-devel

Am 09.11.2022 um 10:27 hat Vladimir Sementsov-Ogievskiy geschrieben:
> On 11/8/22 15:37, Kevin Wolf wrote:
> >       int ret;
> >       trace_qed_need_check_timer_cb(s);
> > @@ -310,9 +309,20 @@ static void coroutine_fn qed_need_check_timer_entry(void *opaque)
> >       (void) ret;
> >   }
> > +static void coroutine_fn qed_need_check_timer_entry(void *opaque)
> > +{
> > +    BDRVQEDState *s = opaque;
> > +
> > +    qed_need_check_timer(opaque);
> > +    bdrv_dec_in_flight(s->bs);
> 
> hmm, one question: don't we need aio_wait_kick() call here?

bdrv_dec_in_flight() already calls aio_wait_kick() internally, so any
places that use it don't need a separate aio_wait_kick().

Kevin



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 02/13] test-bdrv-drain: Don't yield in .bdrv_co_drained_begin/end()
  2022-11-09 10:50   ` Vladimir Sementsov-Ogievskiy
@ 2022-11-09 12:28     ` Kevin Wolf
  0 siblings, 0 replies; 61+ messages in thread
From: Kevin Wolf @ 2022-11-09 12:28 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: qemu-block, eesposit, stefanha, hreitz, pbonzini, qemu-devel

Am 09.11.2022 um 11:50 hat Vladimir Sementsov-Ogievskiy geschrieben:
> On 11/8/22 15:37, Kevin Wolf wrote:
> > We want to change .bdrv_co_drained_begin/end() back to be non-coroutine
> > callbacks, so in preparation, avoid yielding in their implementation.
> > 
> > This does almost the same as the existing logic in bdrv_drain_invoke(),
> > by creating and entering coroutines internally. However, since the test
> > case is by far the heaviest user of coroutine code in drain callbacks,
> > it is preferable to have the complexity in the test case rather than the
> > drain core, which is already complicated enough without this.
> > 
> > The behaviour for bdrv_drain_begin() is unchanged because we increase
> > bs->in_flight and this is still polled. However, bdrv_drain_end()
> > doesn't wait for the spawned coroutine to complete any more. This is
> > fine, we don't rely on bdrv_drain_end() restarting all operations
> > immediately before the next aio_poll().
> > 
> > Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> > ---
> >   tests/unit/test-bdrv-drain.c | 64 ++++++++++++++++++++++++++----------
> >   1 file changed, 46 insertions(+), 18 deletions(-)
> > 
> > diff --git a/tests/unit/test-bdrv-drain.c b/tests/unit/test-bdrv-drain.c
> > index 09dc4a4891..24f34e24ad 100644
> > --- a/tests/unit/test-bdrv-drain.c
> > +++ b/tests/unit/test-bdrv-drain.c
> > @@ -38,12 +38,22 @@ typedef struct BDRVTestState {
> >       bool sleep_in_drain_begin;
> >   } BDRVTestState;
> > +static void coroutine_fn sleep_in_drain_begin(void *opaque)
> > +{
> > +    BlockDriverState *bs = opaque;
> > +
> > +    qemu_co_sleep_ns(QEMU_CLOCK_REALTIME, 100000);
> > +    bdrv_dec_in_flight(bs);
> > +}
> > +
> >   static void coroutine_fn bdrv_test_co_drain_begin(BlockDriverState *bs)
> >   {
> >       BDRVTestState *s = bs->opaque;
> >       s->drain_count++;
> >       if (s->sleep_in_drain_begin) {
> > -        qemu_co_sleep_ns(QEMU_CLOCK_REALTIME, 100000);
> > +        Coroutine *co = qemu_coroutine_create(sleep_in_drain_begin, bs);
> > +        bdrv_inc_in_flight(bs);
> > +        aio_co_enter(bdrv_get_aio_context(bs), co);
> >       }
> >   }
> > @@ -1916,6 +1926,21 @@ static int coroutine_fn bdrv_replace_test_co_preadv(BlockDriverState *bs,
> >       return 0;
> >   }
> > +static void coroutine_fn bdrv_replace_test_drain_co(void *opaque)
> > +{
> > +    BlockDriverState *bs = opaque;
> > +    BDRVReplaceTestState *s = bs->opaque;
> > +
> > +    /* Keep waking io_co up until it is done */
> > +    while (s->io_co) {
> > +        aio_co_wake(s->io_co);
> > +        s->io_co = NULL;
> > +        qemu_coroutine_yield();
> > +    }
> > +    s->drain_co = NULL;
> > +    bdrv_dec_in_flight(bs);
> > +}
> 
> Same question, don't we need aio_wait_kick() after decrement in_flight.
> 
> Also, seems we have here extra waiting level: a special coroutine that waits in a loop.
> 
> Could we just do in .drain_begin:
> 
> if (s->io_co) {
>    bdrv_inc_in_flight(bs);
> }
> 
> and in .co_preadv instead of waking s->drain_co simply
> 
> if (s->drain_count == 1) {
>   bdrv_dec_in_flight(bs);
>   aio_wait_kick();
> }
> 
> or even better, do inc in_flight when io_co becomes not NULL.

I just did the minimal transformation of the existing code in the test
case.

These test cases often test specific interactions between coroutines, so
I could imagine that the additional yield is not just some inefficient
code, but coroutines that yield multiple times could actually be the
scenario that is supposed to be tested.

I didn't check it for this one, but making test cases more efficient
isn't automatically a good thing if they then end up not testing certain
code paths any more. So if you intend to make a change here, it would
need a careful analysis of all test cases that use the driver.

Kevin



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 02/13] test-bdrv-drain: Don't yield in .bdrv_co_drained_begin/end()
  2022-11-08 12:37 ` [PATCH 02/13] test-bdrv-drain: Don't yield in .bdrv_co_drained_begin/end() Kevin Wolf
  2022-11-09 10:50   ` Vladimir Sementsov-Ogievskiy
@ 2022-11-09 13:45   ` Vladimir Sementsov-Ogievskiy
  2022-11-11 11:14   ` Emanuele Giuseppe Esposito
  2022-11-14 18:16   ` Hanna Reitz
  3 siblings, 0 replies; 61+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2022-11-09 13:45 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: eesposit, stefanha, hreitz, pbonzini, qemu-devel

On 11/8/22 15:37, Kevin Wolf wrote:
> We want to change .bdrv_co_drained_begin/end() back to be non-coroutine
> callbacks, so in preparation, avoid yielding in their implementation.
> 
> This does almost the same as the existing logic in bdrv_drain_invoke(),
> by creating and entering coroutines internally. However, since the test
> case is by far the heaviest user of coroutine code in drain callbacks,
> it is preferable to have the complexity in the test case rather than the
> drain core, which is already complicated enough without this.
> 
> The behaviour for bdrv_drain_begin() is unchanged because we increase
> bs->in_flight and this is still polled. However, bdrv_drain_end()
> doesn't wait for the spawned coroutine to complete any more. This is
> fine, we don't rely on bdrv_drain_end() restarting all operations
> immediately before the next aio_poll().
> 
> Signed-off-by: Kevin Wolf<kwolf@redhat.com>

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>

-- 
Best regards,
Vladimir



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 03/13] block: Revert .bdrv_drained_begin/end to non-coroutine_fn
  2022-11-08 12:37 ` [PATCH 03/13] block: Revert .bdrv_drained_begin/end to non-coroutine_fn Kevin Wolf
@ 2022-11-09 14:29   ` Vladimir Sementsov-Ogievskiy
  2022-11-09 22:13   ` Stefan Hajnoczi
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 61+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2022-11-09 14:29 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: eesposit, stefanha, hreitz, pbonzini, qemu-devel

On 11/8/22 15:37, Kevin Wolf wrote:
> Polling during bdrv_drained_end() can be problematic (and in the future,
> we may get cases for bdrv_drained_begin() where polling is forbidden,
> and we don't care about already in-flight requests, but just want to
> prevent new requests from arriving).
> 
> The .bdrv_drained_begin/end callbacks running in a coroutine is the only
> reason why we have to do this polling, so make them non-coroutine
> callbacks again. None of the callers actually yield any more.
> 
> This means that bdrv_drained_end() effectively doesn't poll any more,
> even if AIO_WAIT_WHILE() loops are still there (their condition is false
> from the beginning). This is generally not a problem, but in
> test-bdrv-drain, some additional explicit aio_poll() calls need to be
> added because the test case wants to verify the final state after BHs
> have executed.

So, drained_end_counter is always zero since this commit (and is removed in the next one).

> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>   include/block/block_int-common.h | 10 ++++---
>   block.c                          |  4 +--
>   block/io.c                       | 49 +++++---------------------------
>   block/qed.c                      |  4 +--
>   block/throttle.c                 |  6 ++--
>   tests/unit/test-bdrv-drain.c     | 18 ++++++------
>   6 files changed, 30 insertions(+), 61 deletions(-)
> 

[..]

> --- a/block/qed.c
> +++ b/block/qed.c
> @@ -365,7 +365,7 @@ static void bdrv_qed_attach_aio_context(BlockDriverState *bs,
>       }
>   }
>   
> -static void coroutine_fn bdrv_qed_co_drain_begin(BlockDriverState *bs)
> +static void bdrv_qed_co_drain_begin(BlockDriverState *bs)
>   {
>       BDRVQEDState *s = bs->opaque;
>   
> @@ -1661,7 +1661,7 @@ static BlockDriver bdrv_qed = {
>       .bdrv_co_check            = bdrv_qed_co_check,
>       .bdrv_detach_aio_context  = bdrv_qed_detach_aio_context,
>       .bdrv_attach_aio_context  = bdrv_qed_attach_aio_context,
> -    .bdrv_co_drain_begin      = bdrv_qed_co_drain_begin,
> +    .bdrv_drain_begin         = bdrv_qed_co_drain_begin,

Rename to bdrv_qed_drain_begin without _co_, as for tests ?


>   };
>   
>   static void bdrv_qed_init(void)
> diff --git a/block/throttle.c b/block/throttle.c
> index 131eba3ab4..6e3ae1b355 100644
> --- a/block/throttle.c
> +++ b/block/throttle.c
> @@ -214,7 +214,7 @@ static void throttle_reopen_abort(BDRVReopenState *reopen_state)
>       reopen_state->opaque = NULL;
>   }
>   
> -static void coroutine_fn throttle_co_drain_begin(BlockDriverState *bs)
> +static void throttle_co_drain_begin(BlockDriverState *bs)

and here.

And you didn't drop coroutine_fn for throttle_co_drain_end

with that fixed:

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>

-- 
Best regards,
Vladimir



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 04/13] block: Remove drained_end_counter
  2022-11-08 12:37 ` [PATCH 04/13] block: Remove drained_end_counter Kevin Wolf
@ 2022-11-09 14:44   ` Vladimir Sementsov-Ogievskiy
  2022-11-11 16:37     ` Kevin Wolf
  2022-11-11 11:15   ` Emanuele Giuseppe Esposito
  2022-11-14 18:19   ` Hanna Reitz
  2 siblings, 1 reply; 61+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2022-11-09 14:44 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: eesposit, stefanha, hreitz, pbonzini, qemu-devel

On 11/8/22 15:37, Kevin Wolf wrote:
> drained_end_counter is unused now, nobody changes its value any more. It
> can be removed.
> 
> In cases where we had two almost identical functions that only differed
> in whether the caller passes drained_end_counter, or whether they would
> poll for a local drained_end_counter to reach 0, these become a single
> function.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>

[..]

>   
>   /* Recursively call BlockDriver.bdrv_drain_begin/end callbacks */

Not about this patch, but what is recursive in bdrv_drain_invoke() ?

> -static void bdrv_drain_invoke(BlockDriverState *bs, bool begin,
> -                              int *drained_end_counter)
> +static void bdrv_drain_invoke(BlockDriverState *bs, bool begin)
>   {
>       if (!bs->drv || (begin && !bs->drv->bdrv_drain_begin) ||
>               (!begin && !bs->drv->bdrv_drain_end)) {

[..]

>   
>   /**
>    * This function does not poll, nor must any of its recursively called
> - * functions.  The *drained_end_counter pointee will be incremented
> - * once 

Seems that is wrong already after previous commit.. Not critical

> for every background operation scheduled, and decremented once
> - * the operation settles.  Therefore, the pointer must remain valid
> - * until the pointee reaches 0.  That implies that whoever sets up the
> - * pointee has to poll until it is 0.
> - *
> - * We use atomic operations to access *drained_end_counter, because
> - * (1) when called from bdrv_set_aio_context_ignore(), the subgraph of
> - *     @bs may contain nodes in different AioContexts,
> - * (2) bdrv_drain_all_end() uses the same counter for all nodes,
> - *     regardless of which AioContext they are in.
> + * functions.
>    */
>   static void bdrv_do_drained_end(BlockDriverState *bs, bool recursive,
> -                                BdrvChild *parent, bool ignore_bds_parents,
> -                                int *drained_end_counter)
> +                                BdrvChild *parent, bool ignore_bds_parents)
>   {
>       BdrvChild *child;
>       int old_quiesce_counter;
>   
> -    assert(drained_end_counter != NULL);
> -

[..]

-- 
Best regards,
Vladimir



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/13] block: Inline bdrv_drain_invoke()
  2022-11-08 12:37 ` [PATCH 05/13] block: Inline bdrv_drain_invoke() Kevin Wolf
@ 2022-11-09 15:34   ` Vladimir Sementsov-Ogievskiy
  2022-11-10 19:48   ` Stefan Hajnoczi
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 61+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2022-11-09 15:34 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: eesposit, stefanha, hreitz, pbonzini, qemu-devel

On 11/8/22 15:37, Kevin Wolf wrote:
> bdrv_drain_invoke() has now two entirely separate cases that share no
> code any more and are selected depending on a bool parameter. Each case
> has only one caller. Just inline the function.
> 
> Signed-off-by: Kevin Wolf<kwolf@redhat.com>

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>

-- 
Best regards,
Vladimir



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 06/13] block: Drain invidual nodes during reopen
  2022-11-08 12:37 ` [PATCH 06/13] block: Drain invidual nodes during reopen Kevin Wolf
@ 2022-11-09 16:00   ` Vladimir Sementsov-Ogievskiy
  2022-11-11 16:54     ` Kevin Wolf
  0 siblings, 1 reply; 61+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2022-11-09 16:00 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: eesposit, stefanha, hreitz, pbonzini, qemu-devel

In subject: individual

On 11/8/22 15:37, Kevin Wolf wrote:
> bdrv_reopen() and friends use subtree drains as a lazy way of covering
> all the nodes they touch. Turns out that this lazy way is a lot more
> complicated than just draining the nodes individually, even not
> accounting for the additional complexity in the drain mechanism itself.
> 
> Simplify the code by switching to draining the individual nodes that are
> already managed in the BlockReopenQueue anyway.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>   block.c             | 11 ++++-------
>   block/replication.c |  6 ------
>   blockdev.c          | 13 -------------
>   3 files changed, 4 insertions(+), 26 deletions(-)
> 

[..]

>       bdrv_reopen_queue_free(queue);
> -    for (p = drained; p; p = p->next) {
> -        BlockDriverState *bs = p->data;
> -        AioContext *ctx = bdrv_get_aio_context(bs);
> -
> -        aio_context_acquire(ctx);

In bdrv_reopen_queue_free() we don't have this acquire()/release() pair around bdrv_drained_end(). We don't need it anymore?

> -        bdrv_subtree_drained_end(bs);
> -        aio_context_release(ctx);
> -    }
> -    g_slist_free(drained);
>   }
>   
>   void qmp_blockdev_del(const char *node_name, Error **errp)

-- 
Best regards,
Vladimir



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 07/13] block: Don't use subtree drains in bdrv_drop_intermediate()
  2022-11-08 12:37 ` [PATCH 07/13] block: Don't use subtree drains in bdrv_drop_intermediate() Kevin Wolf
@ 2022-11-09 16:18   ` Vladimir Sementsov-Ogievskiy
  2022-11-14 18:20   ` Hanna Reitz
  1 sibling, 0 replies; 61+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2022-11-09 16:18 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: eesposit, stefanha, hreitz, pbonzini, qemu-devel

On 11/8/22 15:37, Kevin Wolf wrote:
> Instead of using a subtree drain from the top node (which also drains
> child nodes of base that we're not even interested in), use a normal
> drain for base, which automatically drains all of the parents, too.
> 
> Signed-off-by: Kevin Wolf<kwolf@redhat.com>

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>

-- 
Best regards,
Vladimir



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 08/13] stream: Replace subtree drain with a single node drain
  2022-11-08 12:37 ` [PATCH 08/13] stream: Replace subtree drain with a single node drain Kevin Wolf
@ 2022-11-09 16:52   ` Vladimir Sementsov-Ogievskiy
  2022-11-10 10:16     ` Kevin Wolf
  2022-11-14 18:21   ` Hanna Reitz
  1 sibling, 1 reply; 61+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2022-11-09 16:52 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: eesposit, stefanha, hreitz, pbonzini, qemu-devel

On 11/8/22 15:37, Kevin Wolf wrote:
> The subtree drain was introduced in commit b1e1af394d9 as a way to avoid
> graph changes between finding the base node and changing the block graph
> as necessary on completion of the image streaming job.
> 
> The block graph could change between these two points because
> bdrv_set_backing_hd() first drains the parent node, which involved
> polling and can do anything.
> 
> Subtree draining was an imperfect way to make this less likely (because
> with it, fewer callbacks are called during this window). Everyone agreed
> that it's not really the right solution, and it was only committed as a
> stopgap solution.
> 
> This replaces the subtree drain with a solution that simply drains the
> parent node before we try to find the base node, and then call a version
> of bdrv_set_backing_hd() that doesn't drain, but just asserts that the
> parent node is already drained.
> 
> This way, any graph changes caused by draining happen before we start
> looking at the graph and things stay consistent between finding the base
> node and changing the graph.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>

[..]

>   
>       base = bdrv_filter_or_cow_bs(s->above_base);
> -    if (base) {
> -        bdrv_ref(base);
> -    }
> -
>       unfiltered_base = bdrv_skip_filters(base);
>   
>       if (bdrv_cow_child(unfiltered_bs)) {
> @@ -82,7 +85,7 @@ static int stream_prepare(Job *job)
>               }
>           }
>   
> -        bdrv_set_backing_hd(unfiltered_bs, base, &local_err);
> +        bdrv_set_backing_hd_drained(unfiltered_bs, base, &local_err);
>           ret = bdrv_change_backing_file(unfiltered_bs, base_id, base_fmt, false);

If we have yield points / polls during bdrv_set_backing_hd_drained() and bdrv_change_backing_file(), it's still bad and another graph-modifying operation may interleave. But b1e1af394d9 reports only polling in bdrv_set_backing_hd(), so I think it's OK to not care about other cases.

>           if (local_err) {
>               error_report_err(local_err);
> @@ -92,10 +95,7 @@ static int stream_prepare(Job *job)
>       }
>   
>   out:
> -    if (base) {
> -        bdrv_unref(base);
> -    }
> -    bdrv_subtree_drained_end(s->above_base);
> +    bdrv_drained_end(unfiltered_bs);
>       return ret;
>   }
>   

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>

-- 
Best regards,
Vladimir



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 09/13] block: Remove subtree drains
  2022-11-08 12:37 ` [PATCH 09/13] block: Remove subtree drains Kevin Wolf
@ 2022-11-09 17:22   ` Vladimir Sementsov-Ogievskiy
  2022-11-14 18:22   ` Hanna Reitz
  1 sibling, 0 replies; 61+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2022-11-09 17:22 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: eesposit, stefanha, hreitz, pbonzini, qemu-devel

On 11/8/22 15:37, Kevin Wolf wrote:
> Subtree drains are not used any more. Remove them.
> 
> After this, BdrvChildClass.attach/detach() don't poll any more.
> 
> Signed-off-by: Kevin Wolf<kwolf@redhat.com>


Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>

-- 
Best regards,
Vladimir



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 10/13] block: Call drain callbacks only once
  2022-11-08 12:37 ` [PATCH 10/13] block: Call drain callbacks only once Kevin Wolf
@ 2022-11-09 18:05   ` Vladimir Sementsov-Ogievskiy
  2022-11-14 12:32     ` Kevin Wolf
  2022-11-09 18:54   ` Vladimir Sementsov-Ogievskiy
  2022-11-14 18:23   ` Hanna Reitz
  2 siblings, 1 reply; 61+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2022-11-09 18:05 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: eesposit, stefanha, hreitz, pbonzini, qemu-devel

On 11/8/22 15:37, Kevin Wolf wrote:
> We only need to call both the BlockDriver's callback and the parent
> callbacks when going from undrained to drained or vice versa. A second
> drain section doesn't make a difference for the driver or the parent,
> they weren't supposed to send new requests before and after the second
> drain.
> 
> One thing that gets in the way is the 'ignore_bds_parents' parameter in
> bdrv_do_drained_begin_quiesce() and bdrv_do_drained_end(): If it is true
> for the first drain, bs->quiesce_counter will be non-zero, but the
> parent callbacks still haven't been called, so a second drain where it
> is false would still have to call them.
> 
> Instead of keeping track of this, let's just get rid of the parameter.
> It was introduced in commit 6cd5c9d7b2d as an optimisation so that
> during bdrv_drain_all(), we wouldn't recursively drain all parents up to
> the root for each node, resulting in quadratic complexity. As it happens,
> calling the callbacks only once solves the same problem, so as of this
> patch, we'll still have O(n) complexity and ignore_bds_parents is not
> needed any more.
> 
> This patch only ignores the 'ignore_bds_parents' parameter. It will be
> removed in a separate patch.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>   block.c                      | 13 ++++++-------
>   block/io.c                   | 24 +++++++++++++-----------
>   tests/unit/test-bdrv-drain.c | 16 ++++++++++------
>   3 files changed, 29 insertions(+), 24 deletions(-)
> 
> diff --git a/block.c b/block.c
> index 9d082631d9..8878586f6e 100644
> --- a/block.c
> +++ b/block.c
> @@ -2816,7 +2816,6 @@ static void bdrv_replace_child_noperm(BdrvChild *child,
>   {
>       BlockDriverState *old_bs = child->bs;
>       int new_bs_quiesce_counter;
> -    int drain_saldo;
>   
>       assert(!child->frozen);
>       assert(old_bs != new_bs);
> @@ -2827,15 +2826,13 @@ static void bdrv_replace_child_noperm(BdrvChild *child,
>       }
>   
>       new_bs_quiesce_counter = (new_bs ? new_bs->quiesce_counter : 0);
> -    drain_saldo = new_bs_quiesce_counter - child->parent_quiesce_counter;
>   
>       /*
>        * If the new child node is drained but the old one was not, flush
>        * all outstanding requests to the old child node.
>        */
> -    while (drain_saldo > 0 && child->klass->drained_begin) {
> +    if (new_bs_quiesce_counter && !child->parent_quiesce_counter) {

Looks like checking for child->klass->drained_begin was a wrong thing even prepatch?

Also, parent_quiesce_counter actually becomes a boolean variable.. Should we stress it by new type and name?

>           bdrv_parent_drained_begin_single(child, true);
> -        drain_saldo--;
>       }
>   
>       if (old_bs) {
> @@ -2859,7 +2856,6 @@ static void bdrv_replace_child_noperm(BdrvChild *child,
>            * more often.
>            */

the comment above ^^^ should be updated, we are not going to call drained_end more than once anyway

>           assert(new_bs->quiesce_counter <= new_bs_quiesce_counter);

do we still need this assertion and the comment at all?

> -        drain_saldo += new_bs->quiesce_counter - new_bs_quiesce_counter;
>   
>           if (child->klass->attach) {
>               child->klass->attach(child);
> @@ -2869,10 +2865,13 @@ static void bdrv_replace_child_noperm(BdrvChild *child,
>       /*
>        * If the old child node was drained but the new one is not, allow
>        * requests to come in only after the new node has been attached.
> +     *
> +     * Update new_bs_quiesce_counter because bdrv_parent_drained_begin_single()
> +     * polls, which could have changed the value.
>        */
> -    while (drain_saldo < 0 && child->klass->drained_end) {
> +    new_bs_quiesce_counter = (new_bs ? new_bs->quiesce_counter : 0);
> +    if (!new_bs_quiesce_counter && child->parent_quiesce_counter) {
>           bdrv_parent_drained_end_single(child);
> -        drain_saldo++;
>       }
>   }
>   
> diff --git a/block/io.c b/block/io.c
> index 870a25d7a5..87c7a92f15 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -62,7 +62,7 @@ void bdrv_parent_drained_end_single(BdrvChild *c)
>   {
>       IO_OR_GS_CODE();
>   
> -    assert(c->parent_quiesce_counter > 0);
> +    assert(c->parent_quiesce_counter == 1);
>       c->parent_quiesce_counter--;
>       if (c->klass->drained_end) {
>           c->klass->drained_end(c);
> @@ -109,6 +109,7 @@ static bool bdrv_parent_drained_poll(BlockDriverState *bs, BdrvChild *ignore,
>   void bdrv_parent_drained_begin_single(BdrvChild *c, bool poll)
>   {
>       IO_OR_GS_CODE();
> +    assert(c->parent_quiesce_counter == 0);
>       c->parent_quiesce_counter++;
>       if (c->klass->drained_begin) {
>           c->klass->drained_begin(c);
> @@ -352,16 +353,16 @@ void bdrv_do_drained_begin_quiesce(BlockDriverState *bs,
>                                      BdrvChild *parent, bool ignore_bds_parents)
>   {
>       IO_OR_GS_CODE();
> -    assert(!qemu_in_coroutine());

why that is dropped? seems unrelated to the commit

>   
>       /* Stop things in parent-to-child order */
>       if (qatomic_fetch_inc(&bs->quiesce_counter) == 0) {
>           aio_disable_external(bdrv_get_aio_context(bs));
> -    }
>   
> -    bdrv_parent_drained_begin(bs, parent, ignore_bds_parents);
> -    if (bs->drv && bs->drv->bdrv_drain_begin) {
> -        bs->drv->bdrv_drain_begin(bs);
> +        /* TODO Remove ignore_bds_parents, we don't consider it any more */
> +        bdrv_parent_drained_begin(bs, parent, false);
> +        if (bs->drv && bs->drv->bdrv_drain_begin) {
> +            bs->drv->bdrv_drain_begin(bs);
> +        }
>       }
>   }
>   
> @@ -412,13 +413,14 @@ static void bdrv_do_drained_end(BlockDriverState *bs, BdrvChild *parent,
>       assert(bs->quiesce_counter > 0);
>   
>       /* Re-enable things in child-to-parent order */

the comment should be moved too, I think

> -    if (bs->drv && bs->drv->bdrv_drain_end) {
> -        bs->drv->bdrv_drain_end(bs);
> -    }
> -    bdrv_parent_drained_end(bs, parent, ignore_bds_parents);
> -
>       old_quiesce_counter = qatomic_fetch_dec(&bs->quiesce_counter);
>       if (old_quiesce_counter == 1) {
> +        if (bs->drv && bs->drv->bdrv_drain_end) {
> +            bs->drv->bdrv_drain_end(bs);
> +        }
> +        /* TODO Remove ignore_bds_parents, we don't consider it any more */
> +        bdrv_parent_drained_end(bs, parent, false);
> +
>           aio_enable_external(bdrv_get_aio_context(bs));
>       }
>   }
> diff --git a/tests/unit/test-bdrv-drain.c b/tests/unit/test-bdrv-drain.c
> index dda08de8db..172bc6debc 100644
> --- a/tests/unit/test-bdrv-drain.c
> +++ b/tests/unit/test-bdrv-drain.c
> @@ -296,7 +296,11 @@ static void test_quiesce_common(enum drain_type drain_type, bool recursive)
>   
>       do_drain_begin(drain_type, bs);
>   
> -    g_assert_cmpint(bs->quiesce_counter, ==, 1);
> +    if (drain_type == BDRV_DRAIN_ALL) {
> +        g_assert_cmpint(bs->quiesce_counter, ==, 2);
> +    } else {
> +        g_assert_cmpint(bs->quiesce_counter, ==, 1);
> +    }
>       g_assert_cmpint(backing->quiesce_counter, ==, !!recursive);
>   
>       do_drain_end(drain_type, bs);
> @@ -348,8 +352,8 @@ static void test_nested(void)
>   
>       for (outer = 0; outer < DRAIN_TYPE_MAX; outer++) {
>           for (inner = 0; inner < DRAIN_TYPE_MAX; inner++) {
> -            int backing_quiesce = (outer != BDRV_DRAIN) +
> -                                  (inner != BDRV_DRAIN);
> +            int backing_quiesce = (outer == BDRV_DRAIN_ALL) +
> +                                  (inner == BDRV_DRAIN_ALL);
>   
>               g_assert_cmpint(bs->quiesce_counter, ==, 0);
>               g_assert_cmpint(backing->quiesce_counter, ==, 0);
> @@ -359,10 +363,10 @@ static void test_nested(void)
>               do_drain_begin(outer, bs);
>               do_drain_begin(inner, bs);
>   
> -            g_assert_cmpint(bs->quiesce_counter, ==, 2);
> +            g_assert_cmpint(bs->quiesce_counter, ==, 2 + !!backing_quiesce);
>               g_assert_cmpint(backing->quiesce_counter, ==, backing_quiesce);
> -            g_assert_cmpint(s->drain_count, ==, 2);
> -            g_assert_cmpint(backing_s->drain_count, ==, backing_quiesce);
> +            g_assert_cmpint(s->drain_count, ==, 1);
> +            g_assert_cmpint(backing_s->drain_count, ==, !!backing_quiesce);
>   
>               do_drain_end(inner, bs);
>               do_drain_end(outer, bs);

-- 
Best regards,
Vladimir



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 10/13] block: Call drain callbacks only once
  2022-11-08 12:37 ` [PATCH 10/13] block: Call drain callbacks only once Kevin Wolf
  2022-11-09 18:05   ` Vladimir Sementsov-Ogievskiy
@ 2022-11-09 18:54   ` Vladimir Sementsov-Ogievskiy
  2022-11-14 18:23   ` Hanna Reitz
  2 siblings, 0 replies; 61+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2022-11-09 18:54 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: eesposit, stefanha, hreitz, pbonzini, qemu-devel

On 11/8/22 15:37, Kevin Wolf wrote:
> One thing that gets in the way is the 'ignore_bds_parents' parameter in
> bdrv_do_drained_begin_quiesce() and bdrv_do_drained_end(): If it is true
> for the first drain, bs->quiesce_counter will be non-zero, but the
> parent callbacks still haven't been called, so a second drain where it
> is false would still have to call them.

This paragraph breaks my brain :/

Still, I understand the new concept and believe that dropping ignore_bds_parents and just call the callbacks once (and stop the recursion when we found quiesce_counter > 0) is a good way.

-- 
Best regards,
Vladimir



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 11/13] block: Remove ignore_bds_parents parameter from drain functions
  2022-11-08 12:37 ` [PATCH 11/13] block: Remove ignore_bds_parents parameter from drain functions Kevin Wolf
@ 2022-11-09 18:57   ` Vladimir Sementsov-Ogievskiy
  2022-11-14 18:23   ` Hanna Reitz
  1 sibling, 0 replies; 61+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2022-11-09 18:57 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: eesposit, stefanha, hreitz, pbonzini, qemu-devel

On 11/8/22 15:37, Kevin Wolf wrote:
> ignore_bds_parents is now ignored, so we can just remove it.> 
> Signed-off-by: Kevin Wolf<kwolf@redhat.com>

Not obvious to me that they are ignored, some logic is still here. Maybe it's all do nothing finally.

Still I believe that we should get rid of ignore_bds_parents anyway at this point, when we've change the recursion concept in the previous commit.

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>

-- 
Best regards,
Vladimir



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 01/13] qed: Don't yield in bdrv_qed_co_drain_begin()
  2022-11-08 12:37 ` [PATCH 01/13] qed: Don't yield in bdrv_qed_co_drain_begin() Kevin Wolf
  2022-11-09  9:21   ` Vladimir Sementsov-Ogievskiy
  2022-11-09  9:27   ` Vladimir Sementsov-Ogievskiy
@ 2022-11-09 21:49   ` Stefan Hajnoczi
  2022-11-10 11:07     ` Kevin Wolf
  2022-11-11 11:14   ` Emanuele Giuseppe Esposito
  2022-11-14 18:16   ` Hanna Reitz
  4 siblings, 1 reply; 61+ messages in thread
From: Stefan Hajnoczi @ 2022-11-09 21:49 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-block, eesposit, hreitz, pbonzini, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1267 bytes --]

On Tue, Nov 08, 2022 at 01:37:26PM +0100, Kevin Wolf wrote:
> @@ -310,9 +309,20 @@ static void coroutine_fn qed_need_check_timer_entry(void *opaque)
>      (void) ret;
>  }
>  
> +static void coroutine_fn qed_need_check_timer_entry(void *opaque)
> +{
> +    BDRVQEDState *s = opaque;
> +
> +    qed_need_check_timer(opaque);
> +    bdrv_dec_in_flight(s->bs);
> +}
> +
>  static void qed_need_check_timer_cb(void *opaque)
>  {
> +    BDRVQEDState *s = opaque;
>      Coroutine *co = qemu_coroutine_create(qed_need_check_timer_entry, opaque);
> +
> +    bdrv_inc_in_flight(s->bs);
>      qemu_coroutine_enter(co);
>  }
>  
> @@ -363,8 +373,12 @@ static void coroutine_fn bdrv_qed_co_drain_begin(BlockDriverState *bs)
>       * header is flushed.
>       */
>      if (s->need_check_timer && timer_pending(s->need_check_timer)) {
> +        Coroutine *co;
> +
>          qed_cancel_need_check_timer(s);
> -        qed_need_check_timer_entry(s);
> +        co = qemu_coroutine_create(qed_need_check_timer_entry, s);
> +        bdrv_inc_in_flight(bs);

Please include comments that indicate where inc/dec are paired. This is
like pairing memory barriers where it can be very hard to know after the
code has been written (and modified).

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 03/13] block: Revert .bdrv_drained_begin/end to non-coroutine_fn
  2022-11-08 12:37 ` [PATCH 03/13] block: Revert .bdrv_drained_begin/end to non-coroutine_fn Kevin Wolf
  2022-11-09 14:29   ` Vladimir Sementsov-Ogievskiy
@ 2022-11-09 22:13   ` Stefan Hajnoczi
  2022-11-11 11:14   ` Emanuele Giuseppe Esposito
  2022-11-14 18:17   ` Hanna Reitz
  3 siblings, 0 replies; 61+ messages in thread
From: Stefan Hajnoczi @ 2022-11-09 22:13 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-block, eesposit, hreitz, pbonzini, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 10984 bytes --]

On Tue, Nov 08, 2022 at 01:37:28PM +0100, Kevin Wolf wrote:
> Polling during bdrv_drained_end() can be problematic (and in the future,
> we may get cases for bdrv_drained_begin() where polling is forbidden,
> and we don't care about already in-flight requests, but just want to
> prevent new requests from arriving).
> 
> The .bdrv_drained_begin/end callbacks running in a coroutine is the only
> reason why we have to do this polling, so make them non-coroutine
> callbacks again. None of the callers actually yield any more.
> 
> This means that bdrv_drained_end() effectively doesn't poll any more,
> even if AIO_WAIT_WHILE() loops are still there (their condition is false
> from the beginning). This is generally not a problem, but in
> test-bdrv-drain, some additional explicit aio_poll() calls need to be
> added because the test case wants to verify the final state after BHs
> have executed.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>  include/block/block_int-common.h | 10 ++++---
>  block.c                          |  4 +--
>  block/io.c                       | 49 +++++---------------------------
>  block/qed.c                      |  4 +--
>  block/throttle.c                 |  6 ++--
>  tests/unit/test-bdrv-drain.c     | 18 ++++++------
>  6 files changed, 30 insertions(+), 61 deletions(-)

Wow, surprisingly little has to change to make these non-coroutine_fn.

> 
> diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
> index 5a2cc077a0..0956acbb60 100644
> --- a/include/block/block_int-common.h
> +++ b/include/block/block_int-common.h
> @@ -735,17 +735,19 @@ struct BlockDriver {
>      void (*bdrv_io_unplug)(BlockDriverState *bs);
>  
>      /**
> -     * bdrv_co_drain_begin is called if implemented in the beginning of a
> +     * bdrv_drain_begin is called if implemented in the beginning of a
>       * drain operation to drain and stop any internal sources of requests in
>       * the driver.
> -     * bdrv_co_drain_end is called if implemented at the end of the drain.
> +     * bdrv_drain_end is called if implemented at the end of the drain.
>       *
>       * They should be used by the driver to e.g. manage scheduled I/O
>       * requests, or toggle an internal state. After the end of the drain new
>       * requests will continue normally.
> +     *
> +     * Implementations of both functions must not call aio_poll().
>       */
> -    void coroutine_fn (*bdrv_co_drain_begin)(BlockDriverState *bs);
> -    void coroutine_fn (*bdrv_co_drain_end)(BlockDriverState *bs);
> +    void (*bdrv_drain_begin)(BlockDriverState *bs);
> +    void (*bdrv_drain_end)(BlockDriverState *bs);
>  
>      bool (*bdrv_supports_persistent_dirty_bitmap)(BlockDriverState *bs);
>      bool coroutine_fn (*bdrv_co_can_store_new_dirty_bitmap)(
> diff --git a/block.c b/block.c
> index 3bd594eb2a..fed8077993 100644
> --- a/block.c
> +++ b/block.c
> @@ -1705,8 +1705,8 @@ static int bdrv_open_driver(BlockDriverState *bs, BlockDriver *drv,
>      assert(is_power_of_2(bs->bl.request_alignment));
>  
>      for (i = 0; i < bs->quiesce_counter; i++) {
> -        if (drv->bdrv_co_drain_begin) {
> -            drv->bdrv_co_drain_begin(bs);
> +        if (drv->bdrv_drain_begin) {
> +            drv->bdrv_drain_begin(bs);
>          }
>      }
>  
> diff --git a/block/io.c b/block/io.c
> index 34b30e304e..183b407f5b 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -250,55 +250,20 @@ typedef struct {
>      int *drained_end_counter;
>  } BdrvCoDrainData;
>  
> -static void coroutine_fn bdrv_drain_invoke_entry(void *opaque)
> -{
> -    BdrvCoDrainData *data = opaque;
> -    BlockDriverState *bs = data->bs;
> -
> -    if (data->begin) {
> -        bs->drv->bdrv_co_drain_begin(bs);
> -    } else {
> -        bs->drv->bdrv_co_drain_end(bs);
> -    }
> -
> -    /* Set data->done and decrement drained_end_counter before bdrv_wakeup() */
> -    qatomic_mb_set(&data->done, true);
> -    if (!data->begin) {
> -        qatomic_dec(data->drained_end_counter);
> -    }
> -    bdrv_dec_in_flight(bs);
> -
> -    g_free(data);
> -}
> -
> -/* Recursively call BlockDriver.bdrv_co_drain_begin/end callbacks */
> +/* Recursively call BlockDriver.bdrv_drain_begin/end callbacks */
>  static void bdrv_drain_invoke(BlockDriverState *bs, bool begin,
>                                int *drained_end_counter)
>  {
> -    BdrvCoDrainData *data;
> -
> -    if (!bs->drv || (begin && !bs->drv->bdrv_co_drain_begin) ||
> -            (!begin && !bs->drv->bdrv_co_drain_end)) {
> +    if (!bs->drv || (begin && !bs->drv->bdrv_drain_begin) ||
> +            (!begin && !bs->drv->bdrv_drain_end)) {
>          return;
>      }
>  
> -    data = g_new(BdrvCoDrainData, 1);
> -    *data = (BdrvCoDrainData) {
> -        .bs = bs,
> -        .done = false,
> -        .begin = begin,
> -        .drained_end_counter = drained_end_counter,
> -    };
> -
> -    if (!begin) {
> -        qatomic_inc(drained_end_counter);
> +    if (begin) {
> +        bs->drv->bdrv_drain_begin(bs);
> +    } else {
> +        bs->drv->bdrv_drain_end(bs);
>      }
> -
> -    /* Make sure the driver callback completes during the polling phase for
> -     * drain_begin. */
> -    bdrv_inc_in_flight(bs);
> -    data->co = qemu_coroutine_create(bdrv_drain_invoke_entry, data);
> -    aio_co_schedule(bdrv_get_aio_context(bs), data->co);
>  }
>  
>  /* Returns true if BDRV_POLL_WHILE() should go into a blocking aio_poll() */
> diff --git a/block/qed.c b/block/qed.c
> index 013f826c44..301ff8fd86 100644
> --- a/block/qed.c
> +++ b/block/qed.c
> @@ -365,7 +365,7 @@ static void bdrv_qed_attach_aio_context(BlockDriverState *bs,
>      }
>  }
>  
> -static void coroutine_fn bdrv_qed_co_drain_begin(BlockDriverState *bs)
> +static void bdrv_qed_co_drain_begin(BlockDriverState *bs)

This function needs to be renamed s/_co_//.

>  {
>      BDRVQEDState *s = bs->opaque;
>  
> @@ -1661,7 +1661,7 @@ static BlockDriver bdrv_qed = {
>      .bdrv_co_check            = bdrv_qed_co_check,
>      .bdrv_detach_aio_context  = bdrv_qed_detach_aio_context,
>      .bdrv_attach_aio_context  = bdrv_qed_attach_aio_context,
> -    .bdrv_co_drain_begin      = bdrv_qed_co_drain_begin,
> +    .bdrv_drain_begin         = bdrv_qed_co_drain_begin,
>  };
>  
>  static void bdrv_qed_init(void)
> diff --git a/block/throttle.c b/block/throttle.c
> index 131eba3ab4..6e3ae1b355 100644
> --- a/block/throttle.c
> +++ b/block/throttle.c
> @@ -214,7 +214,7 @@ static void throttle_reopen_abort(BDRVReopenState *reopen_state)
>      reopen_state->opaque = NULL;
>  }
>  
> -static void coroutine_fn throttle_co_drain_begin(BlockDriverState *bs)
> +static void throttle_co_drain_begin(BlockDriverState *bs)

Same here.

>  {
>      ThrottleGroupMember *tgm = bs->opaque;
>      if (qatomic_fetch_inc(&tgm->io_limits_disabled) == 0) {
> @@ -261,8 +261,8 @@ static BlockDriver bdrv_throttle = {
>      .bdrv_reopen_commit                 =   throttle_reopen_commit,
>      .bdrv_reopen_abort                  =   throttle_reopen_abort,
>  
> -    .bdrv_co_drain_begin                =   throttle_co_drain_begin,
> -    .bdrv_co_drain_end                  =   throttle_co_drain_end,

Is throttle_co_drain_end() still marked coroutine_fn? It also need to be
renamed to throttle_drain_end().

> +    .bdrv_drain_begin                   =   throttle_co_drain_begin,
> +    .bdrv_drain_end                     =   throttle_co_drain_end,
>  
>      .is_filter                          =   true,
>      .strong_runtime_opts                =   throttle_strong_runtime_opts,
> diff --git a/tests/unit/test-bdrv-drain.c b/tests/unit/test-bdrv-drain.c
> index 24f34e24ad..695519ee02 100644
> --- a/tests/unit/test-bdrv-drain.c
> +++ b/tests/unit/test-bdrv-drain.c
> @@ -46,7 +46,7 @@ static void coroutine_fn sleep_in_drain_begin(void *opaque)
>      bdrv_dec_in_flight(bs);
>  }
>  
> -static void coroutine_fn bdrv_test_co_drain_begin(BlockDriverState *bs)
> +static void bdrv_test_drain_begin(BlockDriverState *bs)
>  {
>      BDRVTestState *s = bs->opaque;
>      s->drain_count++;
> @@ -57,7 +57,7 @@ static void coroutine_fn bdrv_test_co_drain_begin(BlockDriverState *bs)
>      }
>  }
>  
> -static void coroutine_fn bdrv_test_co_drain_end(BlockDriverState *bs)
> +static void bdrv_test_drain_end(BlockDriverState *bs)
>  {
>      BDRVTestState *s = bs->opaque;
>      s->drain_count--;
> @@ -111,8 +111,8 @@ static BlockDriver bdrv_test = {
>      .bdrv_close             = bdrv_test_close,
>      .bdrv_co_preadv         = bdrv_test_co_preadv,
>  
> -    .bdrv_co_drain_begin    = bdrv_test_co_drain_begin,
> -    .bdrv_co_drain_end      = bdrv_test_co_drain_end,
> +    .bdrv_drain_begin       = bdrv_test_drain_begin,
> +    .bdrv_drain_end         = bdrv_test_drain_end,
>  
>      .bdrv_child_perm        = bdrv_default_perms,
>  
> @@ -1703,6 +1703,7 @@ static void test_blockjob_commit_by_drained_end(void)
>      bdrv_drained_begin(bs_child);
>      g_assert(!job_has_completed);
>      bdrv_drained_end(bs_child);
> +    aio_poll(qemu_get_aio_context(), false);
>      g_assert(job_has_completed);
>  
>      bdrv_unref(bs_parents[0]);
> @@ -1858,6 +1859,7 @@ static void test_drop_intermediate_poll(void)
>  
>      g_assert(!job_has_completed);
>      ret = bdrv_drop_intermediate(chain[1], chain[0], NULL);
> +    aio_poll(qemu_get_aio_context(), false);
>      g_assert(ret == 0);
>      g_assert(job_has_completed);
>  
> @@ -1946,7 +1948,7 @@ static void coroutine_fn bdrv_replace_test_drain_co(void *opaque)
>   * .was_drained.
>   * Increment .drain_count.
>   */
> -static void coroutine_fn bdrv_replace_test_co_drain_begin(BlockDriverState *bs)
> +static void bdrv_replace_test_drain_begin(BlockDriverState *bs)
>  {
>      BDRVReplaceTestState *s = bs->opaque;
>  
> @@ -1977,7 +1979,7 @@ static void coroutine_fn bdrv_replace_test_read_entry(void *opaque)
>   * If .drain_count reaches 0 and the node has a backing file, issue a
>   * read request.
>   */
> -static void coroutine_fn bdrv_replace_test_co_drain_end(BlockDriverState *bs)
> +static void bdrv_replace_test_drain_end(BlockDriverState *bs)
>  {
>      BDRVReplaceTestState *s = bs->opaque;
>  
> @@ -2002,8 +2004,8 @@ static BlockDriver bdrv_replace_test = {
>      .bdrv_close             = bdrv_replace_test_close,
>      .bdrv_co_preadv         = bdrv_replace_test_co_preadv,
>  
> -    .bdrv_co_drain_begin    = bdrv_replace_test_co_drain_begin,
> -    .bdrv_co_drain_end      = bdrv_replace_test_co_drain_end,
> +    .bdrv_drain_begin       = bdrv_replace_test_drain_begin,
> +    .bdrv_drain_end         = bdrv_replace_test_drain_end,
>  
>      .bdrv_child_perm        = bdrv_default_perms,
>  };
> -- 
> 2.38.1
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 08/13] stream: Replace subtree drain with a single node drain
  2022-11-09 16:52   ` Vladimir Sementsov-Ogievskiy
@ 2022-11-10 10:16     ` Kevin Wolf
  2022-11-10 11:25       ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 61+ messages in thread
From: Kevin Wolf @ 2022-11-10 10:16 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: qemu-block, eesposit, stefanha, hreitz, pbonzini, qemu-devel

Am 09.11.2022 um 17:52 hat Vladimir Sementsov-Ogievskiy geschrieben:
> On 11/8/22 15:37, Kevin Wolf wrote:
> > The subtree drain was introduced in commit b1e1af394d9 as a way to avoid
> > graph changes between finding the base node and changing the block graph
> > as necessary on completion of the image streaming job.
> > 
> > The block graph could change between these two points because
> > bdrv_set_backing_hd() first drains the parent node, which involved
> > polling and can do anything.
> > 
> > Subtree draining was an imperfect way to make this less likely (because
> > with it, fewer callbacks are called during this window). Everyone agreed
> > that it's not really the right solution, and it was only committed as a
> > stopgap solution.
> > 
> > This replaces the subtree drain with a solution that simply drains the
> > parent node before we try to find the base node, and then call a version
> > of bdrv_set_backing_hd() that doesn't drain, but just asserts that the
> > parent node is already drained.
> > 
> > This way, any graph changes caused by draining happen before we start
> > looking at the graph and things stay consistent between finding the base
> > node and changing the graph.
> > 
> > Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> 
> [..]
> 
> >       base = bdrv_filter_or_cow_bs(s->above_base);
> > -    if (base) {
> > -        bdrv_ref(base);
> > -    }
> > -
> >       unfiltered_base = bdrv_skip_filters(base);
> >       if (bdrv_cow_child(unfiltered_bs)) {
> > @@ -82,7 +85,7 @@ static int stream_prepare(Job *job)
> >               }
> >           }
> > -        bdrv_set_backing_hd(unfiltered_bs, base, &local_err);
> > +        bdrv_set_backing_hd_drained(unfiltered_bs, base, &local_err);
> >           ret = bdrv_change_backing_file(unfiltered_bs, base_id, base_fmt, false);
> 
> If we have yield points / polls during bdrv_set_backing_hd_drained()
> and bdrv_change_backing_file(), it's still bad and another
> graph-modifying operation may interleave. But b1e1af394d9 reports only
> polling in bdrv_set_backing_hd(), so I think it's OK to not care about
> other cases.

At this point in the series, bdrv_replace_child_noperm() can indeed
still poll. I'm not sure how bad it is, but at this point we're already
reconfiguring the graph with two specific nodes and somehow this poll
hasn't caused problems in the past. Anyway, at the end of the series,
there isn't be any polling left in bdrv_set_backing_hd_drained(), as far
as I can tell.

bdrv_change_backing_file() will certainly poll because it does I/O to
the image file. However, the change to the graph is completed at that
point, so I don't think it's a problem. Do you think it would be worth
putting a comment before bdrv_change_backing_file() that mentions that
the graph may change again from here on, but we've completed the graph
change?

> >           if (local_err) {
> >               error_report_err(local_err);
> > @@ -92,10 +95,7 @@ static int stream_prepare(Job *job)
> >       }
> >   out:
> > -    if (base) {
> > -        bdrv_unref(base);
> > -    }
> > -    bdrv_subtree_drained_end(s->above_base);
> > +    bdrv_drained_end(unfiltered_bs);
> >       return ret;
> >   }
> 
> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>

Thanks.

Kevin



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 01/13] qed: Don't yield in bdrv_qed_co_drain_begin()
  2022-11-09 21:49   ` Stefan Hajnoczi
@ 2022-11-10 11:07     ` Kevin Wolf
  0 siblings, 0 replies; 61+ messages in thread
From: Kevin Wolf @ 2022-11-10 11:07 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: qemu-block, eesposit, hreitz, pbonzini, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 2034 bytes --]

Am 09.11.2022 um 22:49 hat Stefan Hajnoczi geschrieben:
> On Tue, Nov 08, 2022 at 01:37:26PM +0100, Kevin Wolf wrote:
> > @@ -310,9 +309,20 @@ static void coroutine_fn qed_need_check_timer_entry(void *opaque)
> >      (void) ret;
> >  }
> >  
> > +static void coroutine_fn qed_need_check_timer_entry(void *opaque)
> > +{
> > +    BDRVQEDState *s = opaque;
> > +
> > +    qed_need_check_timer(opaque);
> > +    bdrv_dec_in_flight(s->bs);
> > +}
> > +
> >  static void qed_need_check_timer_cb(void *opaque)
> >  {
> > +    BDRVQEDState *s = opaque;
> >      Coroutine *co = qemu_coroutine_create(qed_need_check_timer_entry, opaque);
> > +
> > +    bdrv_inc_in_flight(s->bs);
> >      qemu_coroutine_enter(co);
> >  }
> >  
> > @@ -363,8 +373,12 @@ static void coroutine_fn bdrv_qed_co_drain_begin(BlockDriverState *bs)
> >       * header is flushed.
> >       */
> >      if (s->need_check_timer && timer_pending(s->need_check_timer)) {
> > +        Coroutine *co;
> > +
> >          qed_cancel_need_check_timer(s);
> > -        qed_need_check_timer_entry(s);
> > +        co = qemu_coroutine_create(qed_need_check_timer_entry, s);
> > +        bdrv_inc_in_flight(bs);
> 
> Please include comments that indicate where inc/dec are paired. This is
> like pairing memory barriers where it can be very hard to know after the
> code has been written (and modified).

I can do this, of course, if you like to have it in qed. However, it's
not something we're doing elsewhere.

bdrv_inc/dec_in_flight() are a lot simpler than barriers which
synchronise two completely independently running tasks. You just need to
follow the control flow from the inc() and you'll find the dec(). They
are much more similar to taking and releasing a lock than to barriers.

Callbacks always make the code a little harder to read, but personally I
think inc() before scheduling a new coroutine and then dec() at the end
of its entry function is a very obvious pattern that exists in other
places, too.

Kevin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 08/13] stream: Replace subtree drain with a single node drain
  2022-11-10 10:16     ` Kevin Wolf
@ 2022-11-10 11:25       ` Vladimir Sementsov-Ogievskiy
  2022-11-10 17:27         ` Kevin Wolf
  0 siblings, 1 reply; 61+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2022-11-10 11:25 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-block, eesposit, stefanha, hreitz, pbonzini, qemu-devel

On 11/10/22 13:16, Kevin Wolf wrote:
> Am 09.11.2022 um 17:52 hat Vladimir Sementsov-Ogievskiy geschrieben:
>> On 11/8/22 15:37, Kevin Wolf wrote:
>>> The subtree drain was introduced in commit b1e1af394d9 as a way to avoid
>>> graph changes between finding the base node and changing the block graph
>>> as necessary on completion of the image streaming job.
>>>
>>> The block graph could change between these two points because
>>> bdrv_set_backing_hd() first drains the parent node, which involved
>>> polling and can do anything.
>>>
>>> Subtree draining was an imperfect way to make this less likely (because
>>> with it, fewer callbacks are called during this window). Everyone agreed
>>> that it's not really the right solution, and it was only committed as a
>>> stopgap solution.
>>>
>>> This replaces the subtree drain with a solution that simply drains the
>>> parent node before we try to find the base node, and then call a version
>>> of bdrv_set_backing_hd() that doesn't drain, but just asserts that the
>>> parent node is already drained.
>>>
>>> This way, any graph changes caused by draining happen before we start
>>> looking at the graph and things stay consistent between finding the base
>>> node and changing the graph.
>>>
>>> Signed-off-by: Kevin Wolf<kwolf@redhat.com>
>> [..]
>>
>>>        base = bdrv_filter_or_cow_bs(s->above_base);
>>> -    if (base) {
>>> -        bdrv_ref(base);
>>> -    }
>>> -
>>>        unfiltered_base = bdrv_skip_filters(base);
>>>        if (bdrv_cow_child(unfiltered_bs)) {
>>> @@ -82,7 +85,7 @@ static int stream_prepare(Job *job)
>>>                }
>>>            }
>>> -        bdrv_set_backing_hd(unfiltered_bs, base, &local_err);
>>> +        bdrv_set_backing_hd_drained(unfiltered_bs, base, &local_err);
>>>            ret = bdrv_change_backing_file(unfiltered_bs, base_id, base_fmt, false);
>> If we have yield points / polls during bdrv_set_backing_hd_drained()
>> and bdrv_change_backing_file(), it's still bad and another
>> graph-modifying operation may interleave. But b1e1af394d9 reports only
>> polling in bdrv_set_backing_hd(), so I think it's OK to not care about
>> other cases.
> At this point in the series, bdrv_replace_child_noperm() can indeed
> still poll. I'm not sure how bad it is, but at this point we're already
> reconfiguring the graph with two specific nodes and somehow this poll
> hasn't caused problems in the past. Anyway, at the end of the series,
> there isn't be any polling left in bdrv_set_backing_hd_drained(), as far
> as I can tell.
> 
> bdrv_change_backing_file() will certainly poll because it does I/O to
> the image file. However, the change to the graph is completed at that
> point, so I don't think it's a problem. Do you think it would be worth
> putting a comment before bdrv_change_backing_file() that mentions that
> the graph may change again from here on, but we've completed the graph
> change?
> 

Comment won't hurt. I think theoretically that's possible that we

1. change the graph
2. yield in bdrv_change_backing_file
3. switch to another graph-modifying operation, change backing file and do another bdrv_change_backing_file()
4. return to bdrv_change_backing_file() of [2] and write wrong backing file to metadata

And the only solution for such things that I can imagine is a kind of global graph-modifying lock, which should be held around the whole graph modifying operation, including writing metadata. Probably, we shouldn't care until we have real bug reports of it. Actually I hope that the only user who start stream and commit jobs in parallel on same backing-chain is our iotests :)


-- 
Best regards,
Vladimir



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 08/13] stream: Replace subtree drain with a single node drain
  2022-11-10 11:25       ` Vladimir Sementsov-Ogievskiy
@ 2022-11-10 17:27         ` Kevin Wolf
  0 siblings, 0 replies; 61+ messages in thread
From: Kevin Wolf @ 2022-11-10 17:27 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: qemu-block, eesposit, stefanha, hreitz, pbonzini, qemu-devel

Am 10.11.2022 um 12:25 hat Vladimir Sementsov-Ogievskiy geschrieben:
> On 11/10/22 13:16, Kevin Wolf wrote:
> > Am 09.11.2022 um 17:52 hat Vladimir Sementsov-Ogievskiy geschrieben:
> > > On 11/8/22 15:37, Kevin Wolf wrote:
> > > > The subtree drain was introduced in commit b1e1af394d9 as a way to avoid
> > > > graph changes between finding the base node and changing the block graph
> > > > as necessary on completion of the image streaming job.
> > > > 
> > > > The block graph could change between these two points because
> > > > bdrv_set_backing_hd() first drains the parent node, which involved
> > > > polling and can do anything.
> > > > 
> > > > Subtree draining was an imperfect way to make this less likely (because
> > > > with it, fewer callbacks are called during this window). Everyone agreed
> > > > that it's not really the right solution, and it was only committed as a
> > > > stopgap solution.
> > > > 
> > > > This replaces the subtree drain with a solution that simply drains the
> > > > parent node before we try to find the base node, and then call a version
> > > > of bdrv_set_backing_hd() that doesn't drain, but just asserts that the
> > > > parent node is already drained.
> > > > 
> > > > This way, any graph changes caused by draining happen before we start
> > > > looking at the graph and things stay consistent between finding the base
> > > > node and changing the graph.
> > > > 
> > > > Signed-off-by: Kevin Wolf<kwolf@redhat.com>
> > > [..]
> > > 
> > > >        base = bdrv_filter_or_cow_bs(s->above_base);
> > > > -    if (base) {
> > > > -        bdrv_ref(base);
> > > > -    }
> > > > -
> > > >        unfiltered_base = bdrv_skip_filters(base);
> > > >        if (bdrv_cow_child(unfiltered_bs)) {
> > > > @@ -82,7 +85,7 @@ static int stream_prepare(Job *job)
> > > >                }
> > > >            }
> > > > -        bdrv_set_backing_hd(unfiltered_bs, base, &local_err);
> > > > +        bdrv_set_backing_hd_drained(unfiltered_bs, base, &local_err);
> > > >            ret = bdrv_change_backing_file(unfiltered_bs, base_id, base_fmt, false);
> > > If we have yield points / polls during bdrv_set_backing_hd_drained()
> > > and bdrv_change_backing_file(), it's still bad and another
> > > graph-modifying operation may interleave. But b1e1af394d9 reports only
> > > polling in bdrv_set_backing_hd(), so I think it's OK to not care about
> > > other cases.
> > At this point in the series, bdrv_replace_child_noperm() can indeed
> > still poll. I'm not sure how bad it is, but at this point we're already
> > reconfiguring the graph with two specific nodes and somehow this poll
> > hasn't caused problems in the past. Anyway, at the end of the series,
> > there isn't be any polling left in bdrv_set_backing_hd_drained(), as far
> > as I can tell.
> > 
> > bdrv_change_backing_file() will certainly poll because it does I/O to
> > the image file. However, the change to the graph is completed at that
> > point, so I don't think it's a problem. Do you think it would be worth
> > putting a comment before bdrv_change_backing_file() that mentions that
> > the graph may change again from here on, but we've completed the graph
> > change?
> > 
> 
> Comment won't hurt. I think theoretically that's possible that we
> 
> 1. change the graph
> 2. yield in bdrv_change_backing_file
> 3. switch to another graph-modifying operation, change backing file and do another bdrv_change_backing_file()
> 4. return to bdrv_change_backing_file() of [2] and write wrong backing file to metadata
> 
> And the only solution for such things that I can imagine is a kind of
> global graph-modifying lock, which should be held around the whole
> graph modifying operation, including writing metadata.

Actually, I don't think this is the case. The problem that you get here
is just that we haven't really defined what happens when you get two
concurrent .bdrv_change_backing_file requests. To solve this, you don't
need to lock the whole graph, you just need to order the updates at the
block driver level instead of doing them in parallel, so that we know
that the last .bdrv_change_backing_file call wins. I think taking
s->lock in qcow2 would already achieve this (but still lock more than is
strictly necessary).

> Probably, we shouldn't care until we have real bug reports of it.
> Actually I hope that the only user who start stream and commit jobs in
> parallel on same backing-chain is our iotests :)

Yes, it sounds very theoretical. :-)

Kevin



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/13] block: Inline bdrv_drain_invoke()
  2022-11-08 12:37 ` [PATCH 05/13] block: Inline bdrv_drain_invoke() Kevin Wolf
  2022-11-09 15:34   ` Vladimir Sementsov-Ogievskiy
@ 2022-11-10 19:48   ` Stefan Hajnoczi
  2022-11-11 11:15   ` Emanuele Giuseppe Esposito
  2022-11-14 18:19   ` Hanna Reitz
  3 siblings, 0 replies; 61+ messages in thread
From: Stefan Hajnoczi @ 2022-11-10 19:48 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-block, eesposit, hreitz, pbonzini, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 469 bytes --]

On Tue, Nov 08, 2022 at 01:37:30PM +0100, Kevin Wolf wrote:
> bdrv_drain_invoke() has now two entirely separate cases that share no
> code any more and are selected depending on a bool parameter. Each case
> has only one caller. Just inline the function.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>  block/io.c | 23 ++++++-----------------
>  1 file changed, 6 insertions(+), 17 deletions(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 00/13] block: Simplify drain
  2022-11-08 12:37 [PATCH 00/13] block: Simplify drain Kevin Wolf
                   ` (12 preceding siblings ...)
  2022-11-08 12:37 ` [PATCH 13/13] block: Remove poll parameter from bdrv_parent_drained_begin_single() Kevin Wolf
@ 2022-11-10 20:13 ` Stefan Hajnoczi
  2022-11-11 11:23 ` Emanuele Giuseppe Esposito
  14 siblings, 0 replies; 61+ messages in thread
From: Stefan Hajnoczi @ 2022-11-10 20:13 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-block, eesposit, hreitz, pbonzini, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 3283 bytes --]

On Tue, Nov 08, 2022 at 01:37:25PM +0100, Kevin Wolf wrote:
> I'm aware that exactly nobody has been looking forward to a series with
> this title, but it has to be. The way drain works means that we need to
> poll in bdrv_replace_child_noperm() and that makes things rather messy
> with Emanuele's multiqueue work because you must not poll while you hold
> the graph lock.
> 
> The other reason why it has to be is that drain is way too complex and
> there are too many different cases. Some simplification like this will
> hopefully make it considerably more maintainable. The diffstat probably
> tells something, too.
> 
> There are roughly speaking three parts in this series:
> 
> 1. Make BlockDriver.bdrv_drained_begin/end() non-coroutine_fn again,
>    which allows us to not poll on bdrv_drained_end() any more.
> 
> 2. Remove subtree drains. They are a considerable complication in the
>    whole drain machinery (in particular, they require polling in the
>    BdrvChildClass.attach/detach() callbacks that are called during
>    bdrv_replace_child_noperm()) and none of their users actually has a
>    good reason to use them.
> 
> 3. Finally get rid of polling in bdrv_replace_child_noperm() by
>    requiring that the child is already drained by the caller and calling
>    callbacks only once and not again for every nested drain section.
> 
> If necessary, a prefix of this series can be merged that covers only the
> first or the first two parts and it would still make sense.
> 
> Kevin Wolf (13):
>   qed: Don't yield in bdrv_qed_co_drain_begin()
>   test-bdrv-drain: Don't yield in .bdrv_co_drained_begin/end()
>   block: Revert .bdrv_drained_begin/end to non-coroutine_fn
>   block: Remove drained_end_counter
>   block: Inline bdrv_drain_invoke()
>   block: Drain invidual nodes during reopen
>   block: Don't use subtree drains in bdrv_drop_intermediate()
>   stream: Replace subtree drain with a single node drain
>   block: Remove subtree drains
>   block: Call drain callbacks only once
>   block: Remove ignore_bds_parents parameter from drain functions
>   block: Don't poll in bdrv_replace_child_noperm()
>   block: Remove poll parameter from bdrv_parent_drained_begin_single()
> 
>  include/block/block-global-state.h |   3 +
>  include/block/block-io.h           |  52 +---
>  include/block/block_int-common.h   |  17 +-
>  include/block/block_int-io.h       |  12 -
>  block.c                            | 132 ++++++-----
>  block/block-backend.c              |   4 +-
>  block/io.c                         | 281 ++++------------------
>  block/qed.c                        |  24 +-
>  block/replication.c                |   6 -
>  block/stream.c                     |  20 +-
>  block/throttle.c                   |   6 +-
>  blockdev.c                         |  13 -
>  blockjob.c                         |   2 +-
>  tests/unit/test-bdrv-drain.c       | 369 +++++++----------------------
>  14 files changed, 270 insertions(+), 671 deletions(-)

I have looked through all patches but don't understand the code well
enough to give an opinion or spot bugs. Removing subtree drains and
aio_poll() in bdrv_replace_child_noperm() are nice.

Acked-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 01/13] qed: Don't yield in bdrv_qed_co_drain_begin()
  2022-11-08 12:37 ` [PATCH 01/13] qed: Don't yield in bdrv_qed_co_drain_begin() Kevin Wolf
                     ` (2 preceding siblings ...)
  2022-11-09 21:49   ` Stefan Hajnoczi
@ 2022-11-11 11:14   ` Emanuele Giuseppe Esposito
  2022-11-14 18:16   ` Hanna Reitz
  4 siblings, 0 replies; 61+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-11-11 11:14 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: stefanha, hreitz, pbonzini, qemu-devel



Am 08/11/2022 um 13:37 schrieb Kevin Wolf:
> We want to change .bdrv_co_drained_begin() back to be a non-coroutine
> callback, so in preparation, avoid yielding in its implementation.
>
> Because we increase bs->in_flight and bdrv_drained_begin() polls, the
> behaviour is unchanged.
>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
>


Reviewed-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>




^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 02/13] test-bdrv-drain: Don't yield in .bdrv_co_drained_begin/end()
  2022-11-08 12:37 ` [PATCH 02/13] test-bdrv-drain: Don't yield in .bdrv_co_drained_begin/end() Kevin Wolf
  2022-11-09 10:50   ` Vladimir Sementsov-Ogievskiy
  2022-11-09 13:45   ` Vladimir Sementsov-Ogievskiy
@ 2022-11-11 11:14   ` Emanuele Giuseppe Esposito
  2022-11-14 18:16   ` Hanna Reitz
  3 siblings, 0 replies; 61+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-11-11 11:14 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: stefanha, hreitz, pbonzini, qemu-devel



Am 08/11/2022 um 13:37 schrieb Kevin Wolf:
> We want to change .bdrv_co_drained_begin/end() back to be non-coroutine
> callbacks, so in preparation, avoid yielding in their implementation.
>
> This does almost the same as the existing logic in bdrv_drain_invoke(),
> by creating and entering coroutines internally. However, since the test
> case is by far the heaviest user of coroutine code in drain callbacks,
> it is preferable to have the complexity in the test case rather than the
> drain core, which is already complicated enough without this.
>
> The behaviour for bdrv_drain_begin() is unchanged because we increase
> bs->in_flight and this is still polled. However, bdrv_drain_end()
> doesn't wait for the spawned coroutine to complete any more. This is
> fine, we don't rely on bdrv_drain_end() restarting all operations
> immediately before the next aio_poll().
>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
>


Reviewed-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 03/13] block: Revert .bdrv_drained_begin/end to non-coroutine_fn
  2022-11-08 12:37 ` [PATCH 03/13] block: Revert .bdrv_drained_begin/end to non-coroutine_fn Kevin Wolf
  2022-11-09 14:29   ` Vladimir Sementsov-Ogievskiy
  2022-11-09 22:13   ` Stefan Hajnoczi
@ 2022-11-11 11:14   ` Emanuele Giuseppe Esposito
  2022-11-14 18:17   ` Hanna Reitz
  3 siblings, 0 replies; 61+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-11-11 11:14 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: stefanha, hreitz, pbonzini, qemu-devel



Am 08/11/2022 um 13:37 schrieb Kevin Wolf:
> Polling during bdrv_drained_end() can be problematic (and in the future,
> we may get cases for bdrv_drained_begin() where polling is forbidden,
> and we don't care about already in-flight requests, but just want to
> prevent new requests from arriving).
>
> The .bdrv_drained_begin/end callbacks running in a coroutine is the only
> reason why we have to do this polling, so make them non-coroutine
> callbacks again. None of the callers actually yield any more.
>
> This means that bdrv_drained_end() effectively doesn't poll any more,
> even if AIO_WAIT_WHILE() loops are still there (their condition is false
> from the beginning). This is generally not a problem, but in
> test-bdrv-drain, some additional explicit aio_poll() calls need to be
> added because the test case wants to verify the final state after BHs
> have executed.
>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
>


Reviewed-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 04/13] block: Remove drained_end_counter
  2022-11-08 12:37 ` [PATCH 04/13] block: Remove drained_end_counter Kevin Wolf
  2022-11-09 14:44   ` Vladimir Sementsov-Ogievskiy
@ 2022-11-11 11:15   ` Emanuele Giuseppe Esposito
  2022-11-14 18:19   ` Hanna Reitz
  2 siblings, 0 replies; 61+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-11-11 11:15 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: stefanha, hreitz, pbonzini, qemu-devel



Am 08/11/2022 um 13:37 schrieb Kevin Wolf:
> drained_end_counter is unused now, nobody changes its value any more. It
> can be removed.
>
> In cases where we had two almost identical functions that only differed
> in whether the caller passes drained_end_counter, or whether they would
> poll for a local drained_end_counter to reach 0, these become a single
> function.
>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
>


Reviewed-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/13] block: Inline bdrv_drain_invoke()
  2022-11-08 12:37 ` [PATCH 05/13] block: Inline bdrv_drain_invoke() Kevin Wolf
  2022-11-09 15:34   ` Vladimir Sementsov-Ogievskiy
  2022-11-10 19:48   ` Stefan Hajnoczi
@ 2022-11-11 11:15   ` Emanuele Giuseppe Esposito
  2022-11-14 18:19   ` Hanna Reitz
  3 siblings, 0 replies; 61+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-11-11 11:15 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: stefanha, hreitz, pbonzini, qemu-devel



Am 08/11/2022 um 13:37 schrieb Kevin Wolf:
> bdrv_drain_invoke() has now two entirely separate cases that share no
> code any more and are selected depending on a bool parameter. Each case
> has only one caller. Just inline the function.
>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
>


Reviewed-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 12/13] block: Don't poll in bdrv_replace_child_noperm()
  2022-11-08 12:37 ` [PATCH 12/13] block: Don't poll in bdrv_replace_child_noperm() Kevin Wolf
@ 2022-11-11 11:21   ` Emanuele Giuseppe Esposito
  2022-11-14 20:22   ` Hanna Reitz
  1 sibling, 0 replies; 61+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-11-11 11:21 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: stefanha, hreitz, pbonzini, qemu-devel



Am 08/11/2022 um 13:37 schrieb Kevin Wolf:
> In order to make sure that bdrv_replace_child_noperm() doesn't have to
> poll any more, get rid of the bdrv_parent_drained_begin_single() call.
> 
> This is possible now because we can require that the child is already
> drained when the function is called (it better be, having in-flight
> requests while modifying the graph isn't going to end well!) and we
> don't call the parent drain callbacks more than once.
> 
> The additional drain calls needed in callers cause the test case to run
> its code in the drain handler too early (bdrv_attach_child() drains
> now), so modify it to only enable the code after the test setup has
> completed.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>  include/block/block-io.h     |  8 ++++
>  block.c                      | 72 +++++++++++++++++++++++++-----------
>  block/io.c                   |  2 +-
>  tests/unit/test-bdrv-drain.c | 10 +++++
>  4 files changed, 70 insertions(+), 22 deletions(-)
> 
> diff --git a/include/block/block-io.h b/include/block/block-io.h
> index 5b54ed4672..ddce8550a9 100644
> --- a/include/block/block-io.h
> +++ b/include/block/block-io.h
> @@ -290,6 +290,14 @@ bdrv_writev_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos);
>   */
>  void bdrv_parent_drained_begin_single(BdrvChild *c, bool poll);
>  
> +/**
> + * bdrv_parent_drained_poll_single:
> + *
> + * Returns true if there is any pending activity to cease before @c can be
> + * called quiesced, false otherwise.
> + */
> +bool bdrv_parent_drained_poll_single(BdrvChild *c);
> +
>  /**
>   * bdrv_parent_drained_end_single:
>   *
> diff --git a/block.c b/block.c
> index 5f5f79cd16..12039e9b8a 100644
> --- a/block.c
> +++ b/block.c
> @@ -2399,6 +2399,20 @@ static void bdrv_replace_child_abort(void *opaque)
>  
>      GLOBAL_STATE_CODE();
>      /* old_bs reference is transparently moved from @s to @s->child */
> +    if (!s->child->bs) {
> +        /*
> +         * The parents were undrained when removing old_bs from the child. New
> +         * requests can't have been made, though, because the child was empty.
> +         *
> +         * TODO Make bdrv_replace_child_noperm() transactionable to avoid
> +         * undraining the parent in the first place. Once this is done, having
> +         * new_bs drained when calling bdrv_replace_child_tran() is not a
> +         * requirement any more.
> +         */
> +        bdrv_parent_drained_begin_single(s->child, false);
> +        assert(!bdrv_parent_drained_poll_single(s->child));
> +    }
> +    assert(s->child->parent_quiesce_counter);
>      bdrv_replace_child_noperm(s->child, s->old_bs);
>      bdrv_unref(new_bs);
>  }
> @@ -2414,12 +2428,20 @@ static TransactionActionDrv bdrv_replace_child_drv = {
>   *
>   * Note: real unref of old_bs is done only on commit.
>   *
> + * Both child and new_bs (if non-NULL) must be drained. new_bs must be kept
> + * drained until the transaction is completed (this automatically implies that
> + * child remains drained, too).
> + *
>   * The function doesn't update permissions, caller is responsible for this.
>   */
>  static void bdrv_replace_child_tran(BdrvChild *child, BlockDriverState *new_bs,
>                                      Transaction *tran)
>  {
>      BdrvReplaceChildState *s = g_new(BdrvReplaceChildState, 1);
> +
> +    assert(child->parent_quiesce_counter);
> +    assert(!new_bs || new_bs->quiesce_counter);
> +
>      *s = (BdrvReplaceChildState) {
>          .child = child,
>          .old_bs = child->bs,
> @@ -2818,6 +2840,12 @@ static void bdrv_replace_child_noperm(BdrvChild *child,
>      int new_bs_quiesce_counter;
>  
>      assert(!child->frozen);
> +    /*
> +     * When removing the child, it's the callers responsibility to make sure
> +     * that no requests are in flight any more. Usually the parent is drained,
> +     * but not through child->parent_quiesce_counter.
> +     */
> +    assert(!new_bs || child->parent_quiesce_counter);
>      assert(old_bs != new_bs);
>      GLOBAL_STATE_CODE();
>  
> @@ -2825,16 +2853,6 @@ static void bdrv_replace_child_noperm(BdrvChild *child,
>          assert(bdrv_get_aio_context(old_bs) == bdrv_get_aio_context(new_bs));
>      }
>  
> -    new_bs_quiesce_counter = (new_bs ? new_bs->quiesce_counter : 0);
> -
> -    /*
> -     * If the new child node is drained but the old one was not, flush
> -     * all outstanding requests to the old child node.
> -     */
> -    if (new_bs_quiesce_counter && !child->parent_quiesce_counter) {
> -        bdrv_parent_drained_begin_single(child, true);
> -    }
> -
>      if (old_bs) {
>          if (child->klass->detach) {
>              child->klass->detach(child);
> @@ -2849,14 +2867,6 @@ static void bdrv_replace_child_noperm(BdrvChild *child,
>          assert_bdrv_graph_writable(new_bs);
>          QLIST_INSERT_HEAD(&new_bs->parents, child, next_parent);
>  
> -        /*
> -         * Polling in bdrv_parent_drained_begin_single() may have led to the new
> -         * node's quiesce_counter having been decreased.  Not a problem, we just
> -         * need to recognize this here and then invoke drained_end appropriately
> -         * more often.
> -         */
> -        assert(new_bs->quiesce_counter <= new_bs_quiesce_counter);
> -
>          if (child->klass->attach) {
>              child->klass->attach(child);
>          }
> @@ -2865,9 +2875,6 @@ static void bdrv_replace_child_noperm(BdrvChild *child,
>      /*
>       * If the old child node was drained but the new one is not, allow
>       * requests to come in only after the new node has been attached.
> -     *
> -     * Update new_bs_quiesce_counter because bdrv_parent_drained_begin_single()
> -     * polls, which could have changed the value.
>       */
>      new_bs_quiesce_counter = (new_bs ? new_bs->quiesce_counter : 0);
>      if (!new_bs_quiesce_counter && child->parent_quiesce_counter) {
> @@ -3004,6 +3011,12 @@ static BdrvChild *bdrv_attach_child_common(BlockDriverState *child_bs,
>      }
>  
>      bdrv_ref(child_bs);
> +    /*
> +     * Let every new BdrvChild start drained, inserting it in the graph with
> +     * bdrv_replace_child_noperm() will undrain it if the child node is not
> +     * drained. The child was only just created, so polling is not necessary.
> +     */

I think there's a better way to write this, I find it complicated to read.

Also I don't really understand how you cover the case where we are
replacing a child with another one (so both old and new are not-null and
not newly created), and `old` for example could (?) have a drained
counter greater than `new`.
Before we had all the draining saldo stuff, but now it's gone.

Thank you,
Emanuele

> +    bdrv_parent_drained_begin_single(new_child, false);
>      bdrv_replace_child_noperm(new_child, child_bs);
>  
>      BdrvAttachChildCommonState *s = g_new(BdrvAttachChildCommonState, 1);
> @@ -5053,7 +5066,10 @@ static void bdrv_remove_child(BdrvChild *child, Transaction *tran)
>      }
>  
>      if (child->bs) {
> +        BlockDriverState *bs = child->bs;
> +        bdrv_drained_begin(bs);
>          bdrv_replace_child_tran(child, NULL, tran);
> +        bdrv_drained_end(bs);
>      }
>  
>      tran_add(tran, &bdrv_remove_child_drv, child);
> @@ -5070,6 +5086,15 @@ static void bdrv_remove_filter_or_cow_child(BlockDriverState *bs,
>      bdrv_remove_child(bdrv_filter_or_cow_child(bs), tran);
>  }
>  
> +static void undrain_on_clean_cb(void *opaque)
> +{
> +    bdrv_drained_end(opaque);
> +}
> +
> +static TransactionActionDrv undrain_on_clean = {
> +    .clean = undrain_on_clean_cb,
> +};
> +
>  static int bdrv_replace_node_noperm(BlockDriverState *from,
>                                      BlockDriverState *to,
>                                      bool auto_skip, Transaction *tran,
> @@ -5079,6 +5104,11 @@ static int bdrv_replace_node_noperm(BlockDriverState *from,
>  
>      GLOBAL_STATE_CODE();
>  
> +    bdrv_drained_begin(from);
> +    bdrv_drained_begin(to);
> +    tran_add(tran, &undrain_on_clean, from);
> +    tran_add(tran, &undrain_on_clean, to);
> +
>      QLIST_FOREACH_SAFE(c, &from->parents, next_parent, next) {
>          assert(c->bs == from);
>          if (!should_update_child(c, to)) {
> diff --git a/block/io.c b/block/io.c
> index 4a83359a8f..d0f641926f 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -80,7 +80,7 @@ static void bdrv_parent_drained_end(BlockDriverState *bs, BdrvChild *ignore)
>      }
>  }
>  
> -static bool bdrv_parent_drained_poll_single(BdrvChild *c)
> +bool bdrv_parent_drained_poll_single(BdrvChild *c)
>  {
>      if (c->klass->drained_poll) {
>          return c->klass->drained_poll(c);
> diff --git a/tests/unit/test-bdrv-drain.c b/tests/unit/test-bdrv-drain.c
> index 172bc6debc..2686a8acee 100644
> --- a/tests/unit/test-bdrv-drain.c
> +++ b/tests/unit/test-bdrv-drain.c
> @@ -1654,6 +1654,7 @@ static void test_drop_intermediate_poll(void)
>  
>  
>  typedef struct BDRVReplaceTestState {
> +    bool setup_completed;
>      bool was_drained;
>      bool was_undrained;
>      bool has_read;
> @@ -1738,6 +1739,10 @@ static void bdrv_replace_test_drain_begin(BlockDriverState *bs)
>  {
>      BDRVReplaceTestState *s = bs->opaque;
>  
> +    if (!s->setup_completed) {
> +        return;
> +    }
> +
>      if (!s->drain_count) {
>          s->drain_co = qemu_coroutine_create(bdrv_replace_test_drain_co, bs);
>          bdrv_inc_in_flight(bs);
> @@ -1769,6 +1774,10 @@ static void bdrv_replace_test_drain_end(BlockDriverState *bs)
>  {
>      BDRVReplaceTestState *s = bs->opaque;
>  
> +    if (!s->setup_completed) {
> +        return;
> +    }
> +
>      g_assert(s->drain_count > 0);
>      if (!--s->drain_count) {
>          s->was_undrained = true;
> @@ -1867,6 +1876,7 @@ static void do_test_replace_child_mid_drain(int old_drain_count,
>      bdrv_ref(old_child_bs);
>      bdrv_attach_child(parent_bs, old_child_bs, "child", &child_of_bds,
>                        BDRV_CHILD_COW, &error_abort);
> +    parent_s->setup_completed = true;
>  
>      for (i = 0; i < old_drain_count; i++) {
>          bdrv_drained_begin(old_child_bs);
> 



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 00/13] block: Simplify drain
  2022-11-08 12:37 [PATCH 00/13] block: Simplify drain Kevin Wolf
                   ` (13 preceding siblings ...)
  2022-11-10 20:13 ` [PATCH 00/13] block: Simplify drain Stefan Hajnoczi
@ 2022-11-11 11:23 ` Emanuele Giuseppe Esposito
  14 siblings, 0 replies; 61+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-11-11 11:23 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: stefanha, hreitz, pbonzini, qemu-devel



Am 08/11/2022 um 13:37 schrieb Kevin Wolf:
> I'm aware that exactly nobody has been looking forward to a series with
> this title, but it has to be. The way drain works means that we need to
> poll in bdrv_replace_child_noperm() and that makes things rather messy
> with Emanuele's multiqueue work because you must not poll while you hold
> the graph lock.
> 
> The other reason why it has to be is that drain is way too complex and
> there are too many different cases. Some simplification like this will
> hopefully make it considerably more maintainable. The diffstat probably
> tells something, too.
> 
> There are roughly speaking three parts in this series:
> 
> 1. Make BlockDriver.bdrv_drained_begin/end() non-coroutine_fn again,
>    which allows us to not poll on bdrv_drained_end() any more.
> 
> 2. Remove subtree drains. They are a considerable complication in the
>    whole drain machinery (in particular, they require polling in the
>    BdrvChildClass.attach/detach() callbacks that are called during
>    bdrv_replace_child_noperm()) and none of their users actually has a
>    good reason to use them.
> 
> 3. Finally get rid of polling in bdrv_replace_child_noperm() by
>    requiring that the child is already drained by the caller and calling
>    callbacks only once and not again for every nested drain section.
> 
> If necessary, a prefix of this series can be merged that covers only the
> first or the first two parts and it would still make sense.

I added by Reviewed-by where I felt confortable with the code, the other
parts I am not enough confident to review them.
But yes if this works it will be very helpful for the AioContext lock
removal!

Thank you,
Emanuele

> 
> Kevin Wolf (13):
>   qed: Don't yield in bdrv_qed_co_drain_begin()
>   test-bdrv-drain: Don't yield in .bdrv_co_drained_begin/end()
>   block: Revert .bdrv_drained_begin/end to non-coroutine_fn
>   block: Remove drained_end_counter
>   block: Inline bdrv_drain_invoke()
>   block: Drain invidual nodes during reopen
>   block: Don't use subtree drains in bdrv_drop_intermediate()
>   stream: Replace subtree drain with a single node drain
>   block: Remove subtree drains
>   block: Call drain callbacks only once
>   block: Remove ignore_bds_parents parameter from drain functions
>   block: Don't poll in bdrv_replace_child_noperm()
>   block: Remove poll parameter from bdrv_parent_drained_begin_single()
> 
>  include/block/block-global-state.h |   3 +
>  include/block/block-io.h           |  52 +---
>  include/block/block_int-common.h   |  17 +-
>  include/block/block_int-io.h       |  12 -
>  block.c                            | 132 ++++++-----
>  block/block-backend.c              |   4 +-
>  block/io.c                         | 281 ++++------------------
>  block/qed.c                        |  24 +-
>  block/replication.c                |   6 -
>  block/stream.c                     |  20 +-
>  block/throttle.c                   |   6 +-
>  blockdev.c                         |  13 -
>  blockjob.c                         |   2 +-
>  tests/unit/test-bdrv-drain.c       | 369 +++++++----------------------
>  14 files changed, 270 insertions(+), 671 deletions(-)
> 



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 04/13] block: Remove drained_end_counter
  2022-11-09 14:44   ` Vladimir Sementsov-Ogievskiy
@ 2022-11-11 16:37     ` Kevin Wolf
  0 siblings, 0 replies; 61+ messages in thread
From: Kevin Wolf @ 2022-11-11 16:37 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: qemu-block, eesposit, stefanha, hreitz, pbonzini, qemu-devel

Am 09.11.2022 um 15:44 hat Vladimir Sementsov-Ogievskiy geschrieben:
> On 11/8/22 15:37, Kevin Wolf wrote:
> > drained_end_counter is unused now, nobody changes its value any more. It
> > can be removed.
> > 
> > In cases where we had two almost identical functions that only differed
> > in whether the caller passes drained_end_counter, or whether they would
> > poll for a local drained_end_counter to reach 0, these become a single
> > function.
> > 
> > Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> 
> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
> 
> [..]
> 
> >   /* Recursively call BlockDriver.bdrv_drain_begin/end callbacks */
> 
> Not about this patch, but what is recursive in bdrv_drain_invoke() ?

Nothing today, but it used to be the case. Looks like I forgot to remove
the comment in commit 7d40d9ef five years ago.

> > -static void bdrv_drain_invoke(BlockDriverState *bs, bool begin,
> > -                              int *drained_end_counter)
> > +static void bdrv_drain_invoke(BlockDriverState *bs, bool begin)
> >   {
> >       if (!bs->drv || (begin && !bs->drv->bdrv_drain_begin) ||
> >               (!begin && !bs->drv->bdrv_drain_end)) {
> 
> [..]
> 
> >   /**
> >    * This function does not poll, nor must any of its recursively called
> > - * functions.  The *drained_end_counter pointee will be incremented
> > - * once
> 
> Seems that is wrong already after previous commit.. Not critical

I think it's technically correct: We don't schedule background
operations any more, so any statement about them is true.  :-)

You're right, it could be updated in the previous commit, but maybe it's
easier to read the patches when you need to verify my claim only in this
patch. As you said, it doesn't matter much anyway, at the end of the
series it's gone.

Kevin

> > for every background operation scheduled, and decremented once
> > - * the operation settles.  Therefore, the pointer must remain valid
> > - * until the pointee reaches 0.  That implies that whoever sets up the
> > - * pointee has to poll until it is 0.
> > - *
> > - * We use atomic operations to access *drained_end_counter, because
> > - * (1) when called from bdrv_set_aio_context_ignore(), the subgraph of
> > - *     @bs may contain nodes in different AioContexts,
> > - * (2) bdrv_drain_all_end() uses the same counter for all nodes,
> > - *     regardless of which AioContext they are in.
> > + * functions.
> >    */
> >   static void bdrv_do_drained_end(BlockDriverState *bs, bool recursive,
> > -                                BdrvChild *parent, bool ignore_bds_parents,
> > -                                int *drained_end_counter)
> > +                                BdrvChild *parent, bool ignore_bds_parents)
> >   {
> >       BdrvChild *child;
> >       int old_quiesce_counter;
> > -    assert(drained_end_counter != NULL);
> > -
> 
> [..]
> 
> -- 
> Best regards,
> Vladimir
> 



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 06/13] block: Drain invidual nodes during reopen
  2022-11-09 16:00   ` Vladimir Sementsov-Ogievskiy
@ 2022-11-11 16:54     ` Kevin Wolf
  0 siblings, 0 replies; 61+ messages in thread
From: Kevin Wolf @ 2022-11-11 16:54 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: qemu-block, eesposit, stefanha, hreitz, pbonzini, qemu-devel

Am 09.11.2022 um 17:00 hat Vladimir Sementsov-Ogievskiy geschrieben:
> In subject: individual
> 
> On 11/8/22 15:37, Kevin Wolf wrote:
> > bdrv_reopen() and friends use subtree drains as a lazy way of covering
> > all the nodes they touch. Turns out that this lazy way is a lot more
> > complicated than just draining the nodes individually, even not
> > accounting for the additional complexity in the drain mechanism itself.
> > 
> > Simplify the code by switching to draining the individual nodes that are
> > already managed in the BlockReopenQueue anyway.
> > 
> > Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> > ---
> >   block.c             | 11 ++++-------
> >   block/replication.c |  6 ------
> >   blockdev.c          | 13 -------------
> >   3 files changed, 4 insertions(+), 26 deletions(-)
> > 
> 
> [..]
> 
> >       bdrv_reopen_queue_free(queue);
> > -    for (p = drained; p; p = p->next) {
> > -        BlockDriverState *bs = p->data;
> > -        AioContext *ctx = bdrv_get_aio_context(bs);
> > -
> > -        aio_context_acquire(ctx);
> 
> In bdrv_reopen_queue_free() we don't have this acquire()/release()
> pair around bdrv_drained_end(). We don't need it anymore?

Good catch, I think we do.

Reopen is a bit messy with AioContext locks. I think the rule is
supposed to be that bdrv_reopen_queue() requires that the lock for
bs->aio_context is held, and bdrv_reopen_multiple() requires that no
AioContext lock is held, right?

Because the former is not actually true: qmp_blockdev_reopen() and the
'replication' block driver do indeed take the lock, but bdrv_reopen()
drops it for both functions!

So I think we also need an additional fix for bdrv_reopen() to drop the
lock only after calling bdrv_reopen_queue(). It may not have made a
difference before, but now that we call bdrv_drained_begin() in it, it
seems important.

Kevin



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 10/13] block: Call drain callbacks only once
  2022-11-09 18:05   ` Vladimir Sementsov-Ogievskiy
@ 2022-11-14 12:32     ` Kevin Wolf
  0 siblings, 0 replies; 61+ messages in thread
From: Kevin Wolf @ 2022-11-14 12:32 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: qemu-block, eesposit, stefanha, hreitz, pbonzini, qemu-devel

Am 09.11.2022 um 19:05 hat Vladimir Sementsov-Ogievskiy geschrieben:
> On 11/8/22 15:37, Kevin Wolf wrote:
> > We only need to call both the BlockDriver's callback and the parent
> > callbacks when going from undrained to drained or vice versa. A second
> > drain section doesn't make a difference for the driver or the parent,
> > they weren't supposed to send new requests before and after the second
> > drain.
> > 
> > One thing that gets in the way is the 'ignore_bds_parents' parameter in
> > bdrv_do_drained_begin_quiesce() and bdrv_do_drained_end(): If it is true
> > for the first drain, bs->quiesce_counter will be non-zero, but the
> > parent callbacks still haven't been called, so a second drain where it
> > is false would still have to call them.
> > 
> > Instead of keeping track of this, let's just get rid of the parameter.
> > It was introduced in commit 6cd5c9d7b2d as an optimisation so that
> > during bdrv_drain_all(), we wouldn't recursively drain all parents up to
> > the root for each node, resulting in quadratic complexity. As it happens,
> > calling the callbacks only once solves the same problem, so as of this
> > patch, we'll still have O(n) complexity and ignore_bds_parents is not
> > needed any more.
> > 
> > This patch only ignores the 'ignore_bds_parents' parameter. It will be
> > removed in a separate patch.
> > 
> > Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> > ---
> >   block.c                      | 13 ++++++-------
> >   block/io.c                   | 24 +++++++++++++-----------
> >   tests/unit/test-bdrv-drain.c | 16 ++++++++++------
> >   3 files changed, 29 insertions(+), 24 deletions(-)
> > 
> > diff --git a/block.c b/block.c
> > index 9d082631d9..8878586f6e 100644
> > --- a/block.c
> > +++ b/block.c
> > @@ -2816,7 +2816,6 @@ static void bdrv_replace_child_noperm(BdrvChild *child,
> >   {
> >       BlockDriverState *old_bs = child->bs;
> >       int new_bs_quiesce_counter;
> > -    int drain_saldo;
> >       assert(!child->frozen);
> >       assert(old_bs != new_bs);
> > @@ -2827,15 +2826,13 @@ static void bdrv_replace_child_noperm(BdrvChild *child,
> >       }
> >       new_bs_quiesce_counter = (new_bs ? new_bs->quiesce_counter : 0);
> > -    drain_saldo = new_bs_quiesce_counter - child->parent_quiesce_counter;
> >       /*
> >        * If the new child node is drained but the old one was not, flush
> >        * all outstanding requests to the old child node.
> >        */
> > -    while (drain_saldo > 0 && child->klass->drained_begin) {
> > +    if (new_bs_quiesce_counter && !child->parent_quiesce_counter) {
> 
> Looks like checking for child->klass->drained_begin was a wrong thing
> even prepatch?

I'm not sure if it was strictly wrong in practice, but at least
unnecessary. It would have been wrong if a BdrvChildClass implemented
for example .drained_begin, but not .drain_end. But I think we always
implement all three of .drained_begin/poll/end or none of them.

> Also, parent_quiesce_counter actually becomes a boolean variable..
> Should we stress it by new type and name?

Ok, but I would do that in a separate patch. Maybe 'bool drains_parent'.

> >           bdrv_parent_drained_begin_single(child, true);
> > -        drain_saldo--;
> >       }
> >       if (old_bs) {
> > @@ -2859,7 +2856,6 @@ static void bdrv_replace_child_noperm(BdrvChild *child,
> >            * more often.
> >            */
> 
> the comment above ^^^ should be updated, we are not going to call
> drained_end more than once anyway
> 
> >           assert(new_bs->quiesce_counter <= new_bs_quiesce_counter);
> 
> do we still need this assertion and the comment at all?

Patch 12 removes both, but I can do it already here.

> > -        drain_saldo += new_bs->quiesce_counter - new_bs_quiesce_counter;
> >           if (child->klass->attach) {
> >               child->klass->attach(child);
> > @@ -2869,10 +2865,13 @@ static void bdrv_replace_child_noperm(BdrvChild *child,
> >       /*
> >        * If the old child node was drained but the new one is not, allow
> >        * requests to come in only after the new node has been attached.
> > +     *
> > +     * Update new_bs_quiesce_counter because bdrv_parent_drained_begin_single()
> > +     * polls, which could have changed the value.
> >        */
> > -    while (drain_saldo < 0 && child->klass->drained_end) {
> > +    new_bs_quiesce_counter = (new_bs ? new_bs->quiesce_counter : 0);
> > +    if (!new_bs_quiesce_counter && child->parent_quiesce_counter) {
> >           bdrv_parent_drained_end_single(child);
> > -        drain_saldo++;
> >       }
> >   }
> > diff --git a/block/io.c b/block/io.c
> > index 870a25d7a5..87c7a92f15 100644
> > --- a/block/io.c
> > +++ b/block/io.c
> > @@ -62,7 +62,7 @@ void bdrv_parent_drained_end_single(BdrvChild *c)
> >   {
> >       IO_OR_GS_CODE();
> > -    assert(c->parent_quiesce_counter > 0);
> > +    assert(c->parent_quiesce_counter == 1);
> >       c->parent_quiesce_counter--;
> >       if (c->klass->drained_end) {
> >           c->klass->drained_end(c);
> > @@ -109,6 +109,7 @@ static bool bdrv_parent_drained_poll(BlockDriverState *bs, BdrvChild *ignore,
> >   void bdrv_parent_drained_begin_single(BdrvChild *c, bool poll)
> >   {
> >       IO_OR_GS_CODE();
> > +    assert(c->parent_quiesce_counter == 0);
> >       c->parent_quiesce_counter++;
> >       if (c->klass->drained_begin) {
> >           c->klass->drained_begin(c);
> > @@ -352,16 +353,16 @@ void bdrv_do_drained_begin_quiesce(BlockDriverState *bs,
> >                                      BdrvChild *parent, bool ignore_bds_parents)
> >   {
> >       IO_OR_GS_CODE();
> > -    assert(!qemu_in_coroutine());
> 
> why that is dropped? seems unrelated to the commit

I'm sure I added it because I actually got an assertion failure, but I
can't reproduce it on this commit now. At the end of the series tests do
fail without this removed. I'll double check which commit is right one
to remove it.

> >       /* Stop things in parent-to-child order */
> >       if (qatomic_fetch_inc(&bs->quiesce_counter) == 0) {
> >           aio_disable_external(bdrv_get_aio_context(bs));
> > -    }
> > -    bdrv_parent_drained_begin(bs, parent, ignore_bds_parents);
> > -    if (bs->drv && bs->drv->bdrv_drain_begin) {
> > -        bs->drv->bdrv_drain_begin(bs);
> > +        /* TODO Remove ignore_bds_parents, we don't consider it any more */
> > +        bdrv_parent_drained_begin(bs, parent, false);
> > +        if (bs->drv && bs->drv->bdrv_drain_begin) {
> > +            bs->drv->bdrv_drain_begin(bs);
> > +        }
> >       }
> >   }
> > @@ -412,13 +413,14 @@ static void bdrv_do_drained_end(BlockDriverState *bs, BdrvChild *parent,
> >       assert(bs->quiesce_counter > 0);
> >       /* Re-enable things in child-to-parent order */
> 
> the comment should be moved too, I think

It is the same place as in bdrv_do_drained_begin_quiesce().

> > -    if (bs->drv && bs->drv->bdrv_drain_end) {
> > -        bs->drv->bdrv_drain_end(bs);
> > -    }
> > -    bdrv_parent_drained_end(bs, parent, ignore_bds_parents);
> > -
> >       old_quiesce_counter = qatomic_fetch_dec(&bs->quiesce_counter);
> >       if (old_quiesce_counter == 1) {
> > +        if (bs->drv && bs->drv->bdrv_drain_end) {
> > +            bs->drv->bdrv_drain_end(bs);
> > +        }
> > +        /* TODO Remove ignore_bds_parents, we don't consider it any more */
> > +        bdrv_parent_drained_end(bs, parent, false);
> > +
> >           aio_enable_external(bdrv_get_aio_context(bs));
> >       }
> >   }

Kevin



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 01/13] qed: Don't yield in bdrv_qed_co_drain_begin()
  2022-11-08 12:37 ` [PATCH 01/13] qed: Don't yield in bdrv_qed_co_drain_begin() Kevin Wolf
                     ` (3 preceding siblings ...)
  2022-11-11 11:14   ` Emanuele Giuseppe Esposito
@ 2022-11-14 18:16   ` Hanna Reitz
  4 siblings, 0 replies; 61+ messages in thread
From: Hanna Reitz @ 2022-11-14 18:16 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: eesposit, stefanha, pbonzini, qemu-devel

On 08.11.22 13:37, Kevin Wolf wrote:
> We want to change .bdrv_co_drained_begin() back to be a non-coroutine
> callback, so in preparation, avoid yielding in its implementation.
>
> Because we increase bs->in_flight and bdrv_drained_begin() polls, the
> behaviour is unchanged.
>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>   block/qed.c | 20 +++++++++++++++++---
>   1 file changed, 17 insertions(+), 3 deletions(-)

Reviewed-by: Hanna Reitz <hreitz@redhat.com>



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 02/13] test-bdrv-drain: Don't yield in .bdrv_co_drained_begin/end()
  2022-11-08 12:37 ` [PATCH 02/13] test-bdrv-drain: Don't yield in .bdrv_co_drained_begin/end() Kevin Wolf
                     ` (2 preceding siblings ...)
  2022-11-11 11:14   ` Emanuele Giuseppe Esposito
@ 2022-11-14 18:16   ` Hanna Reitz
  3 siblings, 0 replies; 61+ messages in thread
From: Hanna Reitz @ 2022-11-14 18:16 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: eesposit, stefanha, pbonzini, qemu-devel

On 08.11.22 13:37, Kevin Wolf wrote:
> We want to change .bdrv_co_drained_begin/end() back to be non-coroutine
> callbacks, so in preparation, avoid yielding in their implementation.
>
> This does almost the same as the existing logic in bdrv_drain_invoke(),
> by creating and entering coroutines internally. However, since the test
> case is by far the heaviest user of coroutine code in drain callbacks,
> it is preferable to have the complexity in the test case rather than the
> drain core, which is already complicated enough without this.
>
> The behaviour for bdrv_drain_begin() is unchanged because we increase
> bs->in_flight and this is still polled. However, bdrv_drain_end()
> doesn't wait for the spawned coroutine to complete any more. This is
> fine, we don't rely on bdrv_drain_end() restarting all operations
> immediately before the next aio_poll().
>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>   tests/unit/test-bdrv-drain.c | 64 ++++++++++++++++++++++++++----------
>   1 file changed, 46 insertions(+), 18 deletions(-)

Reviewed-by: Hanna Reitz <hreitz@redhat.com>



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 03/13] block: Revert .bdrv_drained_begin/end to non-coroutine_fn
  2022-11-08 12:37 ` [PATCH 03/13] block: Revert .bdrv_drained_begin/end to non-coroutine_fn Kevin Wolf
                     ` (2 preceding siblings ...)
  2022-11-11 11:14   ` Emanuele Giuseppe Esposito
@ 2022-11-14 18:17   ` Hanna Reitz
  3 siblings, 0 replies; 61+ messages in thread
From: Hanna Reitz @ 2022-11-14 18:17 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: eesposit, stefanha, pbonzini, qemu-devel

On 08.11.22 13:37, Kevin Wolf wrote:
> Polling during bdrv_drained_end() can be problematic (and in the future,
> we may get cases for bdrv_drained_begin() where polling is forbidden,
> and we don't care about already in-flight requests, but just want to
> prevent new requests from arriving).
>
> The .bdrv_drained_begin/end callbacks running in a coroutine is the only
> reason why we have to do this polling, so make them non-coroutine
> callbacks again. None of the callers actually yield any more.
>
> This means that bdrv_drained_end() effectively doesn't poll any more,
> even if AIO_WAIT_WHILE() loops are still there (their condition is false
> from the beginning). This is generally not a problem, but in
> test-bdrv-drain, some additional explicit aio_poll() calls need to be
> added because the test case wants to verify the final state after BHs
> have executed.
>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>   include/block/block_int-common.h | 10 ++++---
>   block.c                          |  4 +--
>   block/io.c                       | 49 +++++---------------------------
>   block/qed.c                      |  4 +--
>   block/throttle.c                 |  6 ++--
>   tests/unit/test-bdrv-drain.c     | 18 ++++++------
>   6 files changed, 30 insertions(+), 61 deletions(-)

As the others have already suggested, I’d too drop the _co_ in qed and 
throttle, and the coroutine_fn in throttle.  With that done:

Reviewed-by: Hanna Reitz <hreitz@redhat.com>



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 04/13] block: Remove drained_end_counter
  2022-11-08 12:37 ` [PATCH 04/13] block: Remove drained_end_counter Kevin Wolf
  2022-11-09 14:44   ` Vladimir Sementsov-Ogievskiy
  2022-11-11 11:15   ` Emanuele Giuseppe Esposito
@ 2022-11-14 18:19   ` Hanna Reitz
  2 siblings, 0 replies; 61+ messages in thread
From: Hanna Reitz @ 2022-11-14 18:19 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: eesposit, stefanha, pbonzini, qemu-devel

On 08.11.22 13:37, Kevin Wolf wrote:
> drained_end_counter is unused now, nobody changes its value any more. It
> can be removed.
>
> In cases where we had two almost identical functions that only differed
> in whether the caller passes drained_end_counter, or whether they would
> poll for a local drained_end_counter to reach 0, these become a single
> function.
>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>   include/block/block-io.h         | 15 -----
>   include/block/block_int-common.h |  6 +-
>   block.c                          |  5 +-
>   block/block-backend.c            |  4 +-
>   block/io.c                       | 97 ++++++++------------------------
>   blockjob.c                       |  2 +-
>   6 files changed, 30 insertions(+), 99 deletions(-)

The comments on bdrv_drained_end() and bdrv_parent_drained_end_single() 
in include/block/block-io.h still say that they poll some AioContext 
“which may result in a graph change”.  That’s no longer the case, 
though, so those paragraphs should be dropped, I think.

Apart from that, looks good.

Hanna



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/13] block: Inline bdrv_drain_invoke()
  2022-11-08 12:37 ` [PATCH 05/13] block: Inline bdrv_drain_invoke() Kevin Wolf
                     ` (2 preceding siblings ...)
  2022-11-11 11:15   ` Emanuele Giuseppe Esposito
@ 2022-11-14 18:19   ` Hanna Reitz
  3 siblings, 0 replies; 61+ messages in thread
From: Hanna Reitz @ 2022-11-14 18:19 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: eesposit, stefanha, pbonzini, qemu-devel

On 08.11.22 13:37, Kevin Wolf wrote:
> bdrv_drain_invoke() has now two entirely separate cases that share no
> code any more and are selected depending on a bool parameter. Each case
> has only one caller. Just inline the function.
>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>   block/io.c | 23 ++++++-----------------
>   1 file changed, 6 insertions(+), 17 deletions(-)

Reviewed-by: Hanna Reitz <hreitz@redhat.com>



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 07/13] block: Don't use subtree drains in bdrv_drop_intermediate()
  2022-11-08 12:37 ` [PATCH 07/13] block: Don't use subtree drains in bdrv_drop_intermediate() Kevin Wolf
  2022-11-09 16:18   ` Vladimir Sementsov-Ogievskiy
@ 2022-11-14 18:20   ` Hanna Reitz
  1 sibling, 0 replies; 61+ messages in thread
From: Hanna Reitz @ 2022-11-14 18:20 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: eesposit, stefanha, pbonzini, qemu-devel

On 08.11.22 13:37, Kevin Wolf wrote:
> Instead of using a subtree drain from the top node (which also drains
> child nodes of base that we're not even interested in), use a normal
> drain for base, which automatically drains all of the parents, too.
>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>   block.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)

Reviewed-by: Hanna Reitz <hreitz@redhat.com>



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 08/13] stream: Replace subtree drain with a single node drain
  2022-11-08 12:37 ` [PATCH 08/13] stream: Replace subtree drain with a single node drain Kevin Wolf
  2022-11-09 16:52   ` Vladimir Sementsov-Ogievskiy
@ 2022-11-14 18:21   ` Hanna Reitz
  1 sibling, 0 replies; 61+ messages in thread
From: Hanna Reitz @ 2022-11-14 18:21 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: eesposit, stefanha, pbonzini, qemu-devel

On 08.11.22 13:37, Kevin Wolf wrote:
> The subtree drain was introduced in commit b1e1af394d9 as a way to avoid
> graph changes between finding the base node and changing the block graph
> as necessary on completion of the image streaming job.
>
> The block graph could change between these two points because
> bdrv_set_backing_hd() first drains the parent node, which involved
> polling and can do anything.
>
> Subtree draining was an imperfect way to make this less likely (because
> with it, fewer callbacks are called during this window). Everyone agreed
> that it's not really the right solution, and it was only committed as a
> stopgap solution.
>
> This replaces the subtree drain with a solution that simply drains the
> parent node before we try to find the base node, and then call a version
> of bdrv_set_backing_hd() that doesn't drain, but just asserts that the
> parent node is already drained.
>
> This way, any graph changes caused by draining happen before we start
> looking at the graph and things stay consistent between finding the base
> node and changing the graph.
>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>   include/block/block-global-state.h |  3 +++
>   block.c                            | 17 ++++++++++++++---
>   block/stream.c                     | 20 ++++++++++----------
>   3 files changed, 27 insertions(+), 13 deletions(-)

Reviewed-by: Hanna Reitz <hreitz@redhat.com>



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 09/13] block: Remove subtree drains
  2022-11-08 12:37 ` [PATCH 09/13] block: Remove subtree drains Kevin Wolf
  2022-11-09 17:22   ` Vladimir Sementsov-Ogievskiy
@ 2022-11-14 18:22   ` Hanna Reitz
  1 sibling, 0 replies; 61+ messages in thread
From: Hanna Reitz @ 2022-11-14 18:22 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: eesposit, stefanha, pbonzini, qemu-devel

On 08.11.22 13:37, Kevin Wolf wrote:
> Subtree drains are not used any more. Remove them.
>
> After this, BdrvChildClass.attach/detach() don't poll any more.
>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>   include/block/block-io.h         |  18 +--
>   include/block/block_int-common.h |   1 -
>   include/block/block_int-io.h     |  12 --
>   block.c                          |  20 +--
>   block/io.c                       | 121 +++-----------
>   tests/unit/test-bdrv-drain.c     | 261 ++-----------------------------
>   6 files changed, 44 insertions(+), 389 deletions(-)

Reviewed-by: Hanna Reitz <hreitz@redhat.com>



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 10/13] block: Call drain callbacks only once
  2022-11-08 12:37 ` [PATCH 10/13] block: Call drain callbacks only once Kevin Wolf
  2022-11-09 18:05   ` Vladimir Sementsov-Ogievskiy
  2022-11-09 18:54   ` Vladimir Sementsov-Ogievskiy
@ 2022-11-14 18:23   ` Hanna Reitz
  2 siblings, 0 replies; 61+ messages in thread
From: Hanna Reitz @ 2022-11-14 18:23 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: eesposit, stefanha, pbonzini, qemu-devel

On 08.11.22 13:37, Kevin Wolf wrote:
> We only need to call both the BlockDriver's callback and the parent
> callbacks when going from undrained to drained or vice versa. A second
> drain section doesn't make a difference for the driver or the parent,
> they weren't supposed to send new requests before and after the second
> drain.
>
> One thing that gets in the way is the 'ignore_bds_parents' parameter in
> bdrv_do_drained_begin_quiesce() and bdrv_do_drained_end(): If it is true
> for the first drain, bs->quiesce_counter will be non-zero, but the
> parent callbacks still haven't been called, so a second drain where it
> is false would still have to call them.
>
> Instead of keeping track of this, let's just get rid of the parameter.
> It was introduced in commit 6cd5c9d7b2d as an optimisation so that
> during bdrv_drain_all(), we wouldn't recursively drain all parents up to
> the root for each node, resulting in quadratic complexity. As it happens,
> calling the callbacks only once solves the same problem, so as of this
> patch, we'll still have O(n) complexity and ignore_bds_parents is not
> needed any more.
>
> This patch only ignores the 'ignore_bds_parents' parameter. It will be
> removed in a separate patch.
>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>   block.c                      | 13 ++++++-------
>   block/io.c                   | 24 +++++++++++++-----------
>   tests/unit/test-bdrv-drain.c | 16 ++++++++++------
>   3 files changed, 29 insertions(+), 24 deletions(-)

I too would like parent_quiesce_counter to become `bool 
parent_quiesced`, but:

Reviewed-by: Hanna Reitz <hreitz@redhat.com>



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 11/13] block: Remove ignore_bds_parents parameter from drain functions
  2022-11-08 12:37 ` [PATCH 11/13] block: Remove ignore_bds_parents parameter from drain functions Kevin Wolf
  2022-11-09 18:57   ` Vladimir Sementsov-Ogievskiy
@ 2022-11-14 18:23   ` Hanna Reitz
  1 sibling, 0 replies; 61+ messages in thread
From: Hanna Reitz @ 2022-11-14 18:23 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: eesposit, stefanha, pbonzini, qemu-devel

On 08.11.22 13:37, Kevin Wolf wrote:
> ignore_bds_parents is now ignored, so we can just remove it.
>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>   include/block/block-io.h | 10 ++----
>   block.c                  |  4 +--
>   block/io.c               | 78 +++++++++++++++-------------------------
>   3 files changed, 32 insertions(+), 60 deletions(-)

Reviewed-by: Hanna Reitz <hreitz@redhat.com>



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 12/13] block: Don't poll in bdrv_replace_child_noperm()
  2022-11-08 12:37 ` [PATCH 12/13] block: Don't poll in bdrv_replace_child_noperm() Kevin Wolf
  2022-11-11 11:21   ` Emanuele Giuseppe Esposito
@ 2022-11-14 20:22   ` Hanna Reitz
  2022-11-17 13:27     ` Kevin Wolf
  1 sibling, 1 reply; 61+ messages in thread
From: Hanna Reitz @ 2022-11-14 20:22 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: eesposit, stefanha, pbonzini, qemu-devel

On 08.11.22 13:37, Kevin Wolf wrote:
> In order to make sure that bdrv_replace_child_noperm() doesn't have to
> poll any more, get rid of the bdrv_parent_drained_begin_single() call.
>
> This is possible now because we can require that the child is already
> drained when the function is called (it better be, having in-flight
> requests while modifying the graph isn't going to end well!) and we
> don't call the parent drain callbacks more than once.
>
> The additional drain calls needed in callers cause the test case to run
> its code in the drain handler too early (bdrv_attach_child() drains
> now), so modify it to only enable the code after the test setup has
> completed.
>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>   include/block/block-io.h     |  8 ++++
>   block.c                      | 72 +++++++++++++++++++++++++-----------
>   block/io.c                   |  2 +-
>   tests/unit/test-bdrv-drain.c | 10 +++++
>   4 files changed, 70 insertions(+), 22 deletions(-)

I find this change complicated.  I understand it’s the point of the 
series, but I find it difficult to grasp.  But I guess there can be no 
drain series without such a patch.

As usual, I was very skeptical of the code at first, and over time 
slowly realized that I’m mostly confused by the comments, and the code 
seems fine.  Ah, well.

[...]

> diff --git a/block.c b/block.c
> index 5f5f79cd16..12039e9b8a 100644
> --- a/block.c
> +++ b/block.c

[...]

> @@ -2414,12 +2428,20 @@ static TransactionActionDrv bdrv_replace_child_drv = {
>    *
>    * Note: real unref of old_bs is done only on commit.
>    *
> + * Both child and new_bs (if non-NULL) must be drained. new_bs must be kept
> + * drained until the transaction is completed (this automatically implies that
> + * child remains drained, too).

I find “child” extremely ambiguous.  The problem is that there generally 
is no way to drain a BdrvChild object, is there?  You can only drain the 
BDS in it, which then drains the parent through the BdrvChild object.  
Historically, I don’t think there was ever a place where we cared about 
the BdrvChild object between the two to be drained, was there?  I mean, 
now there apparently is, in bdrv_child_attach_common(), but that’s a 
different story.

So the problem is that “draining a BdrvChild object” generally appears 
in the context of bdrv_parent_drained_*() functions, i.e. actually 
functions draining the parent.  Which makes it a bit confusing to refer 
to a BdrvChild object just as “child”.

I know that “child” here refers to the variable (or does it not?), but 
that is why I really prefer marking variables that are just plain 
English words, e.g. as @child or `child`, so it’s clear they are a name 
and not a noun.

In any case, because the concept is generally to drain the `child->bs` 
instead of the BdrvChild object directly, I understand the comment to 
mean: “Both the old child (`child->bs`) and `new_bs` (if non-NULL) must 
be drained.  `new_bs` must be kept drained until the transaction is 
completed.  This implies that the parent too will be kept drained until 
the transaction is completed by the BdrvChild object `child`.”

Or am I misunderstanding something, and the distinction between `child` 
and `child->bs` and the parent node is important here? (Would be good to 
know. :))

> + *
>    * The function doesn't update permissions, caller is responsible for this.
>    */
>   static void bdrv_replace_child_tran(BdrvChild *child, BlockDriverState *new_bs,
>                                       Transaction *tran)
>   {
>       BdrvReplaceChildState *s = g_new(BdrvReplaceChildState, 1);
> +
> +    assert(child->parent_quiesce_counter);
> +    assert(!new_bs || new_bs->quiesce_counter);
> +
>       *s = (BdrvReplaceChildState) {
>           .child = child,
>           .old_bs = child->bs,
> @@ -2818,6 +2840,12 @@ static void bdrv_replace_child_noperm(BdrvChild *child,

This function now has its callers fulfill kind of a complicated 
contract.  I would prefer that to be written out in a doc comment, 
especially because it sounds like the assertions can’t cover everything 
(i.e. callers that remove a child are required to have stopped issuing 
requests to that child, but they are free to do that in any way they 
want, so no assertion will check for it here).

>       int new_bs_quiesce_counter;
>   
>       assert(!child->frozen);
> +    /*
> +     * When removing the child, it's the callers responsibility to make sure
> +     * that no requests are in flight any more. Usually the parent is drained,
> +     * but not through child->parent_quiesce_counter.
> +     */

When I see a comment above an assertion, I immediately assume it is 
going to describe what the assertion checks.  Unless I’m 
misunderstanding something (again?), this comment actually describes 
what the assertion *does not* check.  I find that confusing, especially 
because the comment leads with “it’s the caller’s responsibility”, which 
to me implies “and that’s why we check it here in this assertion”, 
because assertions are there to verify that contracts are met.

The assertion verifies that the parent must be drained (through @child), 
unless the child is removed, which case isn’t covered by the assertion.  
That “isn’t covered” is then described by the comment, right?

I’d prefer the comment to lead with describing what the assertion does 
check, and then transitioning to “But in case the child is removed, we 
ignore that, and just note that it’s the caller’s responsibility to...”.

Also, the comment doesn’t explicitly say why we don’t check it in the 
assertion.  It says “usually” and “child->parent_quiesce_counter”, which 
implies “can’t get any information from child->parent_quiesce_counter, 
and regardless, callers can do what they want do achieve quiescing in 
regards to this child, so there’s nothing we can check”.  It feels like 
we can just say outright that there’s an informal contract that we can’t 
formally verify here, but callers naturally still must adhere to it.  It 
would be interesting to know (and note) why that is, though, i.e. why we 
can’t have parents be drained through the BdrvChild object for the child 
that is being removed.

I understand the intention behind the assertion to be: “We require the 
parent not to have in-flight requests to the BdrvChild object 
manipulated here.  In most cases, we verify that by requiring the parent 
be drained through this BdrvChild object.  However, when a child is 
being removed, we skip formal verification, because we leave callers 
free in deciding how to ensure that no requests are in flight.  Usually, 
they will still have the parent be drained (even if not through this 
BdrvChild object), but we don’t require that.”

I may well be wrong, but then it would be good for a comment to correct 
me. :)

(Interestingly, because bdrv_replace_child_noperm() no longer polls 
itself, it can’t know for sure that `child->parent_quiesce_counter > 0` 
means that there are no requests in flight.)

> +    assert(!new_bs || child->parent_quiesce_counter);
>       assert(old_bs != new_bs);
>       GLOBAL_STATE_CODE();

[...]

> @@ -2865,9 +2875,6 @@ static void bdrv_replace_child_noperm(BdrvChild *child,
>       /*
>        * If the old child node was drained but the new one is not, allow

This now also covers the case where there was no old child node, but the 
parent was simply drained via an empty BdrvChild by the caller.

>        * requests to come in only after the new node has been attached.
> -     *
> -     * Update new_bs_quiesce_counter because bdrv_parent_drained_begin_single()
> -     * polls, which could have changed the value.
>        */
>       new_bs_quiesce_counter = (new_bs ? new_bs->quiesce_counter : 0);
>       if (!new_bs_quiesce_counter && child->parent_quiesce_counter) {
> @@ -3004,6 +3011,12 @@ static BdrvChild *bdrv_attach_child_common(BlockDriverState *child_bs,
>       }
>   
>       bdrv_ref(child_bs);
> +    /*
> +     * Let every new BdrvChild start drained, inserting it in the graph with
> +     * bdrv_replace_child_noperm() will undrain it if the child node is not
> +     * drained. The child was only just created, so polling is not necessary.

I feel like this is hiding some complexity.  Unless I missed something, 
draining a BdrvChild always meant draining the parent. But here, it 
absolutely does not mean that, and maybe that deserves a big warning sign?

Beginning a drain without poll means quiescing.  You assert that there 
can be no requests to the new child, which I agree on[1].  The 
combination of no new requests coming in, and no requests being there at 
this point is what being drained means.  So @new_child is indeed “drained”.

But the parent isn’t drained, because it isn’t polled.  There may still 
be requests in flight to its other children.  That’s really interesting, 
and I found it extremely confusing until I wrote ten paragraphs in reply 
here and scrapped most of them again.  Whenever I find this to be my 
reaction to something, I really wish for a detailed comment that 
explains the situation.

I would like the comment to:
- Expand on what “only just created” means.  As it’s written, that could 
mean relying on a race condition.  At which point would the parent be 
able to send requests?  (I assume either the .attach() in 
bdrv_replace_child_noperm(), or when this function returns, whichever 
comes first.  (The former always comes first.))
- Say in more detail that calling bdrv_parent_drained_begin_single() 
without polling will quiesce the parent, preventing new requests from 
appearing.
- Note that because there are no requests in flight, and because no new 
requests can then appear, the BdrvChild is drained.
- Note that the parent is only quiesced, not drained, and may still have 
requests in flight to other children, but naturally we don’t care about 
them.

I feel like the comment tries to hide all that complexity simply by 
avoiding the word “parent”.

[1] As far as I can piece together, no requests to the new child can 
have started yet, because this function creates the BdrvChild object, so 
before it is returned to the caller (or BdrvChildClass.attach() is 
called in bdrv_replace_child_noperm()), the block driver won’t generate 
requests to it.

Hanna

> +     */
> +    bdrv_parent_drained_begin_single(new_child, false);
>       bdrv_replace_child_noperm(new_child, child_bs);
>   
>       BdrvAttachChildCommonState *s = g_new(BdrvAttachChildCommonState, 1);



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 13/13] block: Remove poll parameter from bdrv_parent_drained_begin_single()
  2022-11-08 12:37 ` [PATCH 13/13] block: Remove poll parameter from bdrv_parent_drained_begin_single() Kevin Wolf
@ 2022-11-14 20:24   ` Hanna Reitz
  0 siblings, 0 replies; 61+ messages in thread
From: Hanna Reitz @ 2022-11-14 20:24 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: eesposit, stefanha, pbonzini, qemu-devel

On 08.11.22 13:37, Kevin Wolf wrote:
> All callers of bdrv_parent_drained_begin_single() pass poll=false now,
> so we don't need the parameter any more.
>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>   include/block/block-io.h | 5 ++---
>   block.c                  | 4 ++--
>   block/io.c               | 7 ++-----
>   3 files changed, 6 insertions(+), 10 deletions(-)

Well, “drained_begin” does not mean “drain”, so...

Reviewed-by: Hanna Reitz <hreitz@redhat.com>



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 12/13] block: Don't poll in bdrv_replace_child_noperm()
  2022-11-14 20:22   ` Hanna Reitz
@ 2022-11-17 13:27     ` Kevin Wolf
  0 siblings, 0 replies; 61+ messages in thread
From: Kevin Wolf @ 2022-11-17 13:27 UTC (permalink / raw)
  To: Hanna Reitz; +Cc: qemu-block, eesposit, stefanha, pbonzini, qemu-devel

Am 14.11.2022 um 21:22 hat Hanna Reitz geschrieben:
> On 08.11.22 13:37, Kevin Wolf wrote:
> > In order to make sure that bdrv_replace_child_noperm() doesn't have to
> > poll any more, get rid of the bdrv_parent_drained_begin_single() call.
> > 
> > This is possible now because we can require that the child is already
> > drained when the function is called (it better be, having in-flight
> > requests while modifying the graph isn't going to end well!) and we
> > don't call the parent drain callbacks more than once.
> > 
> > The additional drain calls needed in callers cause the test case to run
> > its code in the drain handler too early (bdrv_attach_child() drains
> > now), so modify it to only enable the code after the test setup has
> > completed.
> > 
> > Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> > ---
> >   include/block/block-io.h     |  8 ++++
> >   block.c                      | 72 +++++++++++++++++++++++++-----------
> >   block/io.c                   |  2 +-
> >   tests/unit/test-bdrv-drain.c | 10 +++++
> >   4 files changed, 70 insertions(+), 22 deletions(-)
> 
> I find this change complicated.  I understand it’s the point of the series,
> but I find it difficult to grasp.  But I guess there can be no drain series
> without such a patch.
> 
> As usual, I was very skeptical of the code at first, and over time slowly
> realized that I’m mostly confused by the comments, and the code seems fine. 
> Ah, well.

I spent a while thinking about how to do things differently, but I think
my conclusion is that only improving the comments is probably the best.

The real condition in bdrv_replace_child_noperm() is: If you want to
change the BdrvChild to point to a drained node with child->bs, either
child->parent_quiesce_counter must already be non-zero or you must be
able to increase it and keep the parent's quiesce_counter consistent
with that, but without polling or starting new requests.

This patch generalised the condition to non-drained child nodes as well
just because it's easier to verify when it's a condition that applies
always. It also picked the first option, child->parent_quiesce_counter
already being non-zero.

If we wanted to implement the second option, the potential problem is in
the "without starting new requests" condition, because .drained_begin
can start new requests on the child (e.g. restarting throttled requests,
so that we can actually drain the request queue).

Note that we're not even interested in draining the request queue here,
we already assume that the parent doesn't have active requests on the
child (otherwise we would always have gotten crashes). We just want to
get the counters to agree with each other.

Maybe the most reasonable approach for this would be formally requiring
that .drained_begin both in BdrvChildClass and BlockDriver not do
anything if the thing in question is already quiesced (I think this is
true in practice; for BlockDriver probably only after the earlier
patches in this series) and then assert(bdrv_parent_is_drained(child))
in bdrv_replace_child_noperm(), which would require a new BdrvChildClass
callback. Then the bdrv_parent_drained_begin_single() call could stay
around, but wouldn't have to poll any more.

Of course, all of the drains in this patch would have to stay anyway to
make sure that the parent is already drained. So I'm not sure if it's
any simpler or better in any way than requiring that the parent was
already drained through _this_ BdrvChild.

What do you think?

> [...]
> 
> > diff --git a/block.c b/block.c
> > index 5f5f79cd16..12039e9b8a 100644
> > --- a/block.c
> > +++ b/block.c
> 
> [...]
> 
> > @@ -2414,12 +2428,20 @@ static TransactionActionDrv bdrv_replace_child_drv = {
> >    *
> >    * Note: real unref of old_bs is done only on commit.
> >    *
> > + * Both child and new_bs (if non-NULL) must be drained. new_bs must be kept
> > + * drained until the transaction is completed (this automatically implies that
> > + * child remains drained, too).
> 
> I find “child” extremely ambiguous.  The problem is that there generally is
> no way to drain a BdrvChild object, is there?  You can only drain the BDS in
> it, which then drains the parent through the BdrvChild object. 
> Historically, I don’t think there was ever a place where we cared about the
> BdrvChild object between the two to be drained, was there?  I mean, now
> there apparently is, in bdrv_child_attach_common(), but that’s a different
> story.

I think we've always cared about the parent drain happening through the
BdrvChild, though, at least since your commit 804db8ea which introduced
BdrvChild.parent_quiesce_counter.

Whether or not calling the BdrvChild drained in this case is probably
more a question of terminology.

If we want to avoid calling a BdrvChild drained, I guess I could require
child->bs to be drained instead, which implies the condition we're
really interested in.

> So the problem is that “draining a BdrvChild object” generally appears in
> the context of bdrv_parent_drained_*() functions, i.e. actually functions
> draining the parent.  Which makes it a bit confusing to refer to a BdrvChild
> object just as “child”.
> 
> I know that “child” here refers to the variable (or does it not?), but that
> is why I really prefer marking variables that are just plain English words,
> e.g. as @child or `child`, so it’s clear they are a name and not a noun.

That's fair, I should add that @. (Yes, it does.)

> In any case, because the concept is generally to drain the `child->bs`
> instead of the BdrvChild object directly, I understand the comment to mean:
> “Both the old child (`child->bs`) and `new_bs` (if non-NULL) must be
> drained.  `new_bs` must be kept drained until the transaction is completed. 
> This implies that the parent too will be kept drained until the transaction
> is completed by the BdrvChild object `child`.”
> 
> Or am I misunderstanding something, and the distinction between `child` and
> `child->bs` and the parent node is important here? (Would be good to know.
> :))

I'm not sure how a transaction "is completed by the BdrvChild object"
(isn't it the caller that finalises the transaction?), but I think
otherwise that's equivalent to what I was trying to express.

Oh, is it just the word order that confused me, and you really mean
"will be kept drained by the BdrvChild object until the transation is
completed (by the caller)"?

> > + *
> >    * The function doesn't update permissions, caller is responsible for this.
> >    */
> >   static void bdrv_replace_child_tran(BdrvChild *child, BlockDriverState *new_bs,
> >                                       Transaction *tran)
> >   {
> >       BdrvReplaceChildState *s = g_new(BdrvReplaceChildState, 1);
> > +
> > +    assert(child->parent_quiesce_counter);
> > +    assert(!new_bs || new_bs->quiesce_counter);
> > +
> >       *s = (BdrvReplaceChildState) {
> >           .child = child,
> >           .old_bs = child->bs,
> > @@ -2818,6 +2840,12 @@ static void bdrv_replace_child_noperm(BdrvChild *child,
> 
> This function now has its callers fulfill kind of a complicated contract.  I
> would prefer that to be written out in a doc comment, especially because it
> sounds like the assertions can’t cover everything (i.e. callers that remove
> a child are required to have stopped issuing requests to that child, but
> they are free to do that in any way they want, so no assertion will check
> for it here).

Ok, I can add a comment.

I don't think the contract is complicated: The parent has to be drained
through this BdrvChild (because new_bs could already be drained), except
for new_bs == NULL which is obviously not attaching the child to a
drained node.

> >       int new_bs_quiesce_counter;
> >       assert(!child->frozen);
> > +    /*
> > +     * When removing the child, it's the callers responsibility to make sure
> > +     * that no requests are in flight any more. Usually the parent is drained,
> > +     * but not through child->parent_quiesce_counter.
> > +     */
> 
> When I see a comment above an assertion, I immediately assume it is going to
> describe what the assertion checks.  Unless I’m misunderstanding something
> (again?), this comment actually describes what the assertion *does not*
> check.  I find that confusing, especially because the comment leads with
> “it’s the caller’s responsibility”, which to me implies “and that’s why we
> check it here in this assertion”, because assertions are there to verify
> that contracts are met.

The comment is bad, I must have been confused myself while writing it.

The logic here isn't even about request in flights. It's true that there
must be none, but that's a separate requirement. What it is about is
maintaining consistency between child->parent_quiesce_counter and the
parent's own quiesce_counter in case of switching to a drained node,
without having to poll - which is only possible if the parent is already
drained.

That we require the parent to be drained through this specific BdrvChild
is a choice with the intention to keep things simpler, as explained
above.

> The assertion verifies that the parent must be drained (through @child),
> unless the child is removed, which case isn’t covered by the assertion. 
> That “isn’t covered” is then described by the comment, right?
> 
> I’d prefer the comment to lead with describing what the assertion does
> check, and then transitioning to “But in case the child is removed, we
> ignore that, and just note that it’s the caller’s responsibility to...”.
> 
> Also, the comment doesn’t explicitly say why we don’t check it in the
> assertion.  It says “usually” and “child->parent_quiesce_counter”, which
> implies “can’t get any information from child->parent_quiesce_counter, and
> regardless, callers can do what they want do achieve quiescing in regards to
> this child, so there’s nothing we can check”.  It feels like we can just say
> outright that there’s an informal contract that we can’t formally verify
> here, but callers naturally still must adhere to it.  It would be
> interesting to know (and note) why that is, though, i.e. why we can’t have
> parents be drained through the BdrvChild object for the child that is being
> removed.

We could require that, it would just be more complicated for the callers
that pass a constant NULL, for no real benefit.

> I understand the intention behind the assertion to be: “We require the
> parent not to have in-flight requests to the BdrvChild object manipulated
> here.  In most cases, we verify that by requiring the parent be drained
> through this BdrvChild object.  However, when a child is being removed, we
> skip formal verification, because we leave callers free in deciding how to
> ensure that no requests are in flight.  Usually, they will still have the
> parent be drained (even if not through this BdrvChild object), but we don’t
> require that.”
> 
> I may well be wrong, but then it would be good for a comment to correct me.
> :)
> 
> (Interestingly, because bdrv_replace_child_noperm() no longer polls itself,
> it can’t know for sure that `child->parent_quiesce_counter > 0` means that
> there are no requests in flight.)
> 
> > +    assert(!new_bs || child->parent_quiesce_counter);
> >       assert(old_bs != new_bs);
> >       GLOBAL_STATE_CODE();
> 
> [...]
> 
> > @@ -2865,9 +2875,6 @@ static void bdrv_replace_child_noperm(BdrvChild *child,
> >       /*
> >        * If the old child node was drained but the new one is not, allow
> 
> This now also covers the case where there was no old child node, but the
> parent was simply drained via an empty BdrvChild by the caller.

I'm not sure how to express this concisely if we want to avoid calling
the BdrvChild itself drained.

> >        * requests to come in only after the new node has been attached.
> > -     *
> > -     * Update new_bs_quiesce_counter because bdrv_parent_drained_begin_single()
> > -     * polls, which could have changed the value.
> >        */
> >       new_bs_quiesce_counter = (new_bs ? new_bs->quiesce_counter : 0);
> >       if (!new_bs_quiesce_counter && child->parent_quiesce_counter) {
> > @@ -3004,6 +3011,12 @@ static BdrvChild *bdrv_attach_child_common(BlockDriverState *child_bs,
> >       }
> >       bdrv_ref(child_bs);
> > +    /*
> > +     * Let every new BdrvChild start drained, inserting it in the graph with
> > +     * bdrv_replace_child_noperm() will undrain it if the child node is not
> > +     * drained. The child was only just created, so polling is not necessary.
> 
> I feel like this is hiding some complexity.  Unless I missed something,
> draining a BdrvChild always meant draining the parent. But here, it
> absolutely does not mean that, and maybe that deserves a big warning sign?
> 
> Beginning a drain without poll means quiescing.  You assert that there can
> be no requests to the new child, which I agree on[1].  The combination of no
> new requests coming in, and no requests being there at this point is what
> being drained means.  So @new_child is indeed “drained”.
> 
> But the parent isn’t drained, because it isn’t polled.  There may still be
> requests in flight to its other children.  That’s really interesting, and I
> found it extremely confusing until I wrote ten paragraphs in reply here and
> scrapped most of them again.  Whenever I find this to be my reaction to
> something, I really wish for a detailed comment that explains the situation.
> 
> I would like the comment to:
> - Expand on what “only just created” means.  As it’s written, that could
> mean relying on a race condition.  At which point would the parent be able
> to send requests?  (I assume either the .attach() in
> bdrv_replace_child_noperm(), or when this function returns, whichever comes
> first.  (The former always comes first.))

I don't think .attach is supposed to create requests - even less so
while the BdrvChild is drained. It may schedule a BH, but that won't be
executed until this function returns.

This is not documented explicitly, maybe we should document it.

I suppose .drained_end can create requests in general, but it wouldn't
make sense to me if it did that for a new child. It generally just
resumes operations that were stopped because of the drain, but there was
no operation on a new child yet.

> - Say in more detail that calling bdrv_parent_drained_begin_single() without
> polling will quiesce the parent, preventing new requests from appearing.
> - Note that because there are no requests in flight, and because no new
> requests can then appear, the BdrvChild is drained.
> - Note that the parent is only quiesced, not drained, and may still have
> requests in flight to other children, but naturally we don’t care about
> them.

All of this is true, but at the same time not related to what the
bdrv_parent_drained_begin_single() call is meant for - increasing the
parent's quiesce_count from the BdrvChild before calling
bdrv_replace_child_noperm(), so that we don't have to do it inside of
that function where we don't know that we don't have to poll.

That there are no requests in flight when you change child->bs is a
requirement that we already had before this patch.

If it feels better to you, we could even just poll here (and drop patch
13 because it would still be used).

The part that is important in the context of Emanuele's patches that
will follow is that we poll outside of a bdrv_graph_wrlock/wrunlock()
section. This might mean that we'd have to pull the polling further
down into the callers in the long run. Emanuele's current patches only
put the lock in bdrv_replace_child_noperm(), but generally speaking you
wouldn't want the graph to change between two related changes, so I'm
almost sure that the lock will be taken in callers in the future.

> I feel like the comment tries to hide all that complexity simply by avoiding
> the word “parent”.
> 
> [1] As far as I can piece together, no requests to the new child can have
> started yet, because this function creates the BdrvChild object, so before
> it is returned to the caller (or BdrvChildClass.attach() is called in
> bdrv_replace_child_noperm()), the block driver won’t generate requests to
> it.

Kevin



^ permalink raw reply	[flat|nested] 61+ messages in thread

end of thread, other threads:[~2022-11-17 13:27 UTC | newest]

Thread overview: 61+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-08 12:37 [PATCH 00/13] block: Simplify drain Kevin Wolf
2022-11-08 12:37 ` [PATCH 01/13] qed: Don't yield in bdrv_qed_co_drain_begin() Kevin Wolf
2022-11-09  9:21   ` Vladimir Sementsov-Ogievskiy
2022-11-09  9:27   ` Vladimir Sementsov-Ogievskiy
2022-11-09 12:22     ` Kevin Wolf
2022-11-09 21:49   ` Stefan Hajnoczi
2022-11-10 11:07     ` Kevin Wolf
2022-11-11 11:14   ` Emanuele Giuseppe Esposito
2022-11-14 18:16   ` Hanna Reitz
2022-11-08 12:37 ` [PATCH 02/13] test-bdrv-drain: Don't yield in .bdrv_co_drained_begin/end() Kevin Wolf
2022-11-09 10:50   ` Vladimir Sementsov-Ogievskiy
2022-11-09 12:28     ` Kevin Wolf
2022-11-09 13:45   ` Vladimir Sementsov-Ogievskiy
2022-11-11 11:14   ` Emanuele Giuseppe Esposito
2022-11-14 18:16   ` Hanna Reitz
2022-11-08 12:37 ` [PATCH 03/13] block: Revert .bdrv_drained_begin/end to non-coroutine_fn Kevin Wolf
2022-11-09 14:29   ` Vladimir Sementsov-Ogievskiy
2022-11-09 22:13   ` Stefan Hajnoczi
2022-11-11 11:14   ` Emanuele Giuseppe Esposito
2022-11-14 18:17   ` Hanna Reitz
2022-11-08 12:37 ` [PATCH 04/13] block: Remove drained_end_counter Kevin Wolf
2022-11-09 14:44   ` Vladimir Sementsov-Ogievskiy
2022-11-11 16:37     ` Kevin Wolf
2022-11-11 11:15   ` Emanuele Giuseppe Esposito
2022-11-14 18:19   ` Hanna Reitz
2022-11-08 12:37 ` [PATCH 05/13] block: Inline bdrv_drain_invoke() Kevin Wolf
2022-11-09 15:34   ` Vladimir Sementsov-Ogievskiy
2022-11-10 19:48   ` Stefan Hajnoczi
2022-11-11 11:15   ` Emanuele Giuseppe Esposito
2022-11-14 18:19   ` Hanna Reitz
2022-11-08 12:37 ` [PATCH 06/13] block: Drain invidual nodes during reopen Kevin Wolf
2022-11-09 16:00   ` Vladimir Sementsov-Ogievskiy
2022-11-11 16:54     ` Kevin Wolf
2022-11-08 12:37 ` [PATCH 07/13] block: Don't use subtree drains in bdrv_drop_intermediate() Kevin Wolf
2022-11-09 16:18   ` Vladimir Sementsov-Ogievskiy
2022-11-14 18:20   ` Hanna Reitz
2022-11-08 12:37 ` [PATCH 08/13] stream: Replace subtree drain with a single node drain Kevin Wolf
2022-11-09 16:52   ` Vladimir Sementsov-Ogievskiy
2022-11-10 10:16     ` Kevin Wolf
2022-11-10 11:25       ` Vladimir Sementsov-Ogievskiy
2022-11-10 17:27         ` Kevin Wolf
2022-11-14 18:21   ` Hanna Reitz
2022-11-08 12:37 ` [PATCH 09/13] block: Remove subtree drains Kevin Wolf
2022-11-09 17:22   ` Vladimir Sementsov-Ogievskiy
2022-11-14 18:22   ` Hanna Reitz
2022-11-08 12:37 ` [PATCH 10/13] block: Call drain callbacks only once Kevin Wolf
2022-11-09 18:05   ` Vladimir Sementsov-Ogievskiy
2022-11-14 12:32     ` Kevin Wolf
2022-11-09 18:54   ` Vladimir Sementsov-Ogievskiy
2022-11-14 18:23   ` Hanna Reitz
2022-11-08 12:37 ` [PATCH 11/13] block: Remove ignore_bds_parents parameter from drain functions Kevin Wolf
2022-11-09 18:57   ` Vladimir Sementsov-Ogievskiy
2022-11-14 18:23   ` Hanna Reitz
2022-11-08 12:37 ` [PATCH 12/13] block: Don't poll in bdrv_replace_child_noperm() Kevin Wolf
2022-11-11 11:21   ` Emanuele Giuseppe Esposito
2022-11-14 20:22   ` Hanna Reitz
2022-11-17 13:27     ` Kevin Wolf
2022-11-08 12:37 ` [PATCH 13/13] block: Remove poll parameter from bdrv_parent_drained_begin_single() Kevin Wolf
2022-11-14 20:24   ` Hanna Reitz
2022-11-10 20:13 ` [PATCH 00/13] block: Simplify drain Stefan Hajnoczi
2022-11-11 11:23 ` Emanuele Giuseppe Esposito

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.