qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/4] nbd/server: Quiesce coroutines on context switch
@ 2020-12-14 17:05 Sergio Lopez
  2020-12-14 17:05 ` [PATCH v2 1/4] block: Honor blk_set_aio_context() context requirements Sergio Lopez
                   ` (4 more replies)
  0 siblings, 5 replies; 25+ messages in thread
From: Sergio Lopez @ 2020-12-14 17:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Fam Zheng, Stefano Stabellini, Sergio Lopez,
	qemu-block, Paul Durrant, Michael S. Tsirkin, Max Reitz,
	Stefan Hajnoczi, Paolo Bonzini, Anthony Perard, xen-devel

This series allows the NBD server to properly switch between AIO contexts,
having quiesced recv_coroutine and send_coroutine before doing the transition.

We need this because we send back devices running in IO Thread owned contexts
to the main context when stopping the data plane, something that can happen
multiple times during the lifetime of a VM (usually during the boot sequence or
on a reboot), and we drag the NBD server of the correspoing export with it.

While there, fix also a problem caused by a cross-dependency between
closing the export's client connections and draining the block
layer. The visible effect of this problem was QEMU getting hung when
the guest request a power off while there's an active NBD client.

v2:
 - Replace "virtio-blk: Acquire context while switching them on
 dataplane start" with "block: Honor blk_set_aio_context() context
 requirements" (Kevin Wolf)
 - Add "block: Avoid processing BDS twice in
 bdrv_set_aio_context_ignore()"
 - Add "block: Close block exports in two steps"
 - Rename nbd_read_eof() to nbd_server_read_eof() (Eric Blake)
 - Fix double space and typo in comment. (Eric Blake)

Sergio Lopez (4):
  block: Honor blk_set_aio_context() context requirements
  block: Avoid processing BDS twice in bdrv_set_aio_context_ignore()
  nbd/server: Quiesce coroutines on context switch
  block: Close block exports in two steps

 block.c                         |  27 ++++++-
 block/export/export.c           |  10 +--
 blockdev-nbd.c                  |   2 +-
 hw/block/dataplane/virtio-blk.c |   4 ++
 hw/block/dataplane/xen-block.c  |   7 +-
 hw/scsi/virtio-scsi.c           |   6 +-
 include/block/export.h          |   4 +-
 nbd/server.c                    | 120 ++++++++++++++++++++++++++++----
 qemu-nbd.c                      |   2 +-
 stubs/blk-exp-close-all.c       |   2 +-
 10 files changed, 156 insertions(+), 28 deletions(-)

-- 
2.26.2




^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v2 1/4] block: Honor blk_set_aio_context() context requirements
  2020-12-14 17:05 [PATCH v2 0/4] nbd/server: Quiesce coroutines on context switch Sergio Lopez
@ 2020-12-14 17:05 ` Sergio Lopez
  2020-12-15 11:58   ` Kevin Wolf
  2020-12-14 17:05 ` [PATCH v2 2/4] block: Avoid processing BDS twice in bdrv_set_aio_context_ignore() Sergio Lopez
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 25+ messages in thread
From: Sergio Lopez @ 2020-12-14 17:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Fam Zheng, Stefano Stabellini, Sergio Lopez,
	qemu-block, Paul Durrant, Michael S. Tsirkin, Max Reitz,
	Stefan Hajnoczi, Paolo Bonzini, Anthony Perard, xen-devel

The documentation for bdrv_set_aio_context_ignore() states this:

 * The caller must own the AioContext lock for the old AioContext of bs, but it
 * must not own the AioContext lock for new_context (unless new_context is the
 * same as the current context of bs).

As blk_set_aio_context() makes use of this function, this rule also
applies to it.

Fix all occurrences where this rule wasn't honored.

Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Sergio Lopez <slp@redhat.com>
---
 hw/block/dataplane/virtio-blk.c | 4 ++++
 hw/block/dataplane/xen-block.c  | 7 ++++++-
 hw/scsi/virtio-scsi.c           | 6 ++++--
 3 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
index 37499c5564..e9050c8987 100644
--- a/hw/block/dataplane/virtio-blk.c
+++ b/hw/block/dataplane/virtio-blk.c
@@ -172,6 +172,7 @@ int virtio_blk_data_plane_start(VirtIODevice *vdev)
     VirtIOBlockDataPlane *s = vblk->dataplane;
     BusState *qbus = BUS(qdev_get_parent_bus(DEVICE(vblk)));
     VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(qbus);
+    AioContext *old_context;
     unsigned i;
     unsigned nvqs = s->conf->num_queues;
     Error *local_err = NULL;
@@ -214,7 +215,10 @@ int virtio_blk_data_plane_start(VirtIODevice *vdev)
     vblk->dataplane_started = true;
     trace_virtio_blk_data_plane_start(s);
 
+    old_context = blk_get_aio_context(s->conf->conf.blk);
+    aio_context_acquire(old_context);
     r = blk_set_aio_context(s->conf->conf.blk, s->ctx, &local_err);
+    aio_context_release(old_context);
     if (r < 0) {
         error_report_err(local_err);
         goto fail_guest_notifiers;
diff --git a/hw/block/dataplane/xen-block.c b/hw/block/dataplane/xen-block.c
index 71c337c7b7..3675f8deaf 100644
--- a/hw/block/dataplane/xen-block.c
+++ b/hw/block/dataplane/xen-block.c
@@ -725,6 +725,7 @@ void xen_block_dataplane_start(XenBlockDataPlane *dataplane,
 {
     ERRP_GUARD();
     XenDevice *xendev = dataplane->xendev;
+    AioContext *old_context;
     unsigned int ring_size;
     unsigned int i;
 
@@ -808,10 +809,14 @@ void xen_block_dataplane_start(XenBlockDataPlane *dataplane,
         goto stop;
     }
 
-    aio_context_acquire(dataplane->ctx);
+    old_context = blk_get_aio_context(dataplane->blk);
+    aio_context_acquire(old_context);
     /* If other users keep the BlockBackend in the iothread, that's ok */
     blk_set_aio_context(dataplane->blk, dataplane->ctx, NULL);
+    aio_context_release(old_context);
+
     /* Only reason for failure is a NULL channel */
+    aio_context_acquire(dataplane->ctx);
     xen_device_set_event_channel_context(xendev, dataplane->event_channel,
                                          dataplane->ctx, &error_abort);
     aio_context_release(dataplane->ctx);
diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
index 3db9a8aae9..7a347ceac5 100644
--- a/hw/scsi/virtio-scsi.c
+++ b/hw/scsi/virtio-scsi.c
@@ -821,15 +821,17 @@ static void virtio_scsi_hotplug(HotplugHandler *hotplug_dev, DeviceState *dev,
     VirtIODevice *vdev = VIRTIO_DEVICE(hotplug_dev);
     VirtIOSCSI *s = VIRTIO_SCSI(vdev);
     SCSIDevice *sd = SCSI_DEVICE(dev);
+    AioContext *old_context;
     int ret;
 
     if (s->ctx && !s->dataplane_fenced) {
         if (blk_op_is_blocked(sd->conf.blk, BLOCK_OP_TYPE_DATAPLANE, errp)) {
             return;
         }
-        virtio_scsi_acquire(s);
+        old_context = blk_get_aio_context(sd->conf.blk);
+        aio_context_acquire(old_context);
         ret = blk_set_aio_context(sd->conf.blk, s->ctx, errp);
-        virtio_scsi_release(s);
+        aio_context_release(old_context);
         if (ret < 0) {
             return;
         }
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2 2/4] block: Avoid processing BDS twice in bdrv_set_aio_context_ignore()
  2020-12-14 17:05 [PATCH v2 0/4] nbd/server: Quiesce coroutines on context switch Sergio Lopez
  2020-12-14 17:05 ` [PATCH v2 1/4] block: Honor blk_set_aio_context() context requirements Sergio Lopez
@ 2020-12-14 17:05 ` Sergio Lopez
  2020-12-15 12:12   ` Kevin Wolf
  2020-12-14 17:05 ` [PATCH v2 3/4] nbd/server: Quiesce coroutines on context switch Sergio Lopez
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 25+ messages in thread
From: Sergio Lopez @ 2020-12-14 17:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Fam Zheng, Stefano Stabellini, Sergio Lopez,
	qemu-block, Paul Durrant, Michael S. Tsirkin, Max Reitz,
	Stefan Hajnoczi, Paolo Bonzini, Anthony Perard, xen-devel

While processing the parents of a BDS, one of the parents may process
the child that's doing the tail recursion, which leads to a BDS being
processed twice. This is especially problematic for the aio_notifiers,
as they might attempt to work on both the old and the new AIO
contexts.

To avoid this, add the BDS pointer to the ignore list, and check the
child BDS pointer while iterating over the children.

Signed-off-by: Sergio Lopez <slp@redhat.com>
---
 block.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/block.c b/block.c
index f1cedac362..bc8a66ab6e 100644
--- a/block.c
+++ b/block.c
@@ -6465,12 +6465,17 @@ void bdrv_set_aio_context_ignore(BlockDriverState *bs,
     bdrv_drained_begin(bs);
 
     QLIST_FOREACH(child, &bs->children, next) {
-        if (g_slist_find(*ignore, child)) {
+        if (g_slist_find(*ignore, child) || g_slist_find(*ignore, child->bs)) {
             continue;
         }
         *ignore = g_slist_prepend(*ignore, child);
         bdrv_set_aio_context_ignore(child->bs, new_context, ignore);
     }
+    /*
+     * Add a reference to this BS to the ignore list, so its
+     * parents won't attempt to process it again.
+     */
+    *ignore = g_slist_prepend(*ignore, bs);
     QLIST_FOREACH(child, &bs->parents, next_parent) {
         if (g_slist_find(*ignore, child)) {
             continue;
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2 3/4] nbd/server: Quiesce coroutines on context switch
  2020-12-14 17:05 [PATCH v2 0/4] nbd/server: Quiesce coroutines on context switch Sergio Lopez
  2020-12-14 17:05 ` [PATCH v2 1/4] block: Honor blk_set_aio_context() context requirements Sergio Lopez
  2020-12-14 17:05 ` [PATCH v2 2/4] block: Avoid processing BDS twice in bdrv_set_aio_context_ignore() Sergio Lopez
@ 2020-12-14 17:05 ` Sergio Lopez
  2020-12-14 17:05 ` [PATCH v2 4/4] block: Close block exports in two steps Sergio Lopez
  2021-01-20 20:49 ` [PATCH v2 0/4] nbd/server: Quiesce coroutines on context switch Eric Blake
  4 siblings, 0 replies; 25+ messages in thread
From: Sergio Lopez @ 2020-12-14 17:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Fam Zheng, Stefano Stabellini, Sergio Lopez,
	qemu-block, Paul Durrant, Michael S. Tsirkin, Max Reitz,
	Stefan Hajnoczi, Paolo Bonzini, Anthony Perard, xen-devel

When switching between AIO contexts we need to me make sure that both
recv_coroutine and send_coroutine are not scheduled to run. Otherwise,
QEMU may crash while attaching the new context with an error like
this one:

aio_co_schedule: Co-routine was already scheduled in 'aio_co_schedule'

To achieve this we need a local implementation of
'qio_channel_readv_all_eof' named 'nbd_read_eof' (a trick already done
by 'nbd/client.c') that allows us to interrupt the operation and to
know when recv_coroutine is yielding.

With this in place, we delegate detaching the AIO context to the
owning context with a BH ('nbd_aio_detach_bh') scheduled using
'aio_wait_bh_oneshot'. This BH signals that we need to quiesce the
channel by setting 'client->quiescing' to 'true', and either waits for
the coroutine to finish using AIO_WAIT_WHILE or, if it's yielding in
'nbd_read_eof', actively enters the coroutine to interrupt it.

RHBZ: https://bugzilla.redhat.com/show_bug.cgi?id=1900326
Signed-off-by: Sergio Lopez <slp@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
---
 nbd/server.c | 120 +++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 106 insertions(+), 14 deletions(-)

diff --git a/nbd/server.c b/nbd/server.c
index 613ed2634a..7229f487d2 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -132,6 +132,9 @@ struct NBDClient {
     CoMutex send_lock;
     Coroutine *send_coroutine;
 
+    bool read_yielding;
+    bool quiescing;
+
     QTAILQ_ENTRY(NBDClient) next;
     int nb_requests;
     bool closing;
@@ -1352,14 +1355,60 @@ static coroutine_fn int nbd_negotiate(NBDClient *client, Error **errp)
     return 0;
 }
 
-static int nbd_receive_request(QIOChannel *ioc, NBDRequest *request,
+/* nbd_read_eof
+ * Tries to read @size bytes from @ioc. This is a local implementation of
+ * qio_channel_readv_all_eof. We have it here because we need it to be
+ * interruptible and to know when the coroutine is yielding.
+ * Returns 1 on success
+ *         0 on eof, when no data was read (errp is not set)
+ *         negative errno on failure (errp is set)
+ */
+static inline int coroutine_fn
+nbd_read_eof(NBDClient *client, void *buffer, size_t size, Error **errp)
+{
+    bool partial = false;
+
+    assert(size);
+    while (size > 0) {
+        struct iovec iov = { .iov_base = buffer, .iov_len = size };
+        ssize_t len;
+
+        len = qio_channel_readv(client->ioc, &iov, 1, errp);
+        if (len == QIO_CHANNEL_ERR_BLOCK) {
+            client->read_yielding = true;
+            qio_channel_yield(client->ioc, G_IO_IN);
+            client->read_yielding = false;
+            if (client->quiescing) {
+                return -EAGAIN;
+            }
+            continue;
+        } else if (len < 0) {
+            return -EIO;
+        } else if (len == 0) {
+            if (partial) {
+                error_setg(errp,
+                           "Unexpected end-of-file before all bytes were read");
+                return -EIO;
+            } else {
+                return 0;
+            }
+        }
+
+        partial = true;
+        size -= len;
+        buffer = (uint8_t *) buffer + len;
+    }
+    return 1;
+}
+
+static int nbd_receive_request(NBDClient *client, NBDRequest *request,
                                Error **errp)
 {
     uint8_t buf[NBD_REQUEST_SIZE];
     uint32_t magic;
     int ret;
 
-    ret = nbd_read(ioc, buf, sizeof(buf), "request", errp);
+    ret = nbd_read_eof(client, buf, sizeof(buf), errp);
     if (ret < 0) {
         return ret;
     }
@@ -1480,11 +1529,37 @@ static void blk_aio_attached(AioContext *ctx, void *opaque)
 
     QTAILQ_FOREACH(client, &exp->clients, next) {
         qio_channel_attach_aio_context(client->ioc, ctx);
+
+        assert(client->recv_coroutine == NULL);
+        assert(client->send_coroutine == NULL);
+
+        if (client->quiescing) {
+            client->quiescing = false;
+            nbd_client_receive_next_request(client);
+        }
+    }
+}
+
+static void nbd_aio_detach_bh(void *opaque)
+{
+    NBDExport *exp = opaque;
+    NBDClient *client;
+
+    QTAILQ_FOREACH(client, &exp->clients, next) {
+        qio_channel_detach_aio_context(client->ioc);
+        client->quiescing = true;
+
         if (client->recv_coroutine) {
-            aio_co_schedule(ctx, client->recv_coroutine);
+            if (client->read_yielding) {
+                qemu_aio_coroutine_enter(exp->common.ctx,
+                                         client->recv_coroutine);
+            } else {
+                AIO_WAIT_WHILE(exp->common.ctx, client->recv_coroutine != NULL);
+            }
         }
+
         if (client->send_coroutine) {
-            aio_co_schedule(ctx, client->send_coroutine);
+            AIO_WAIT_WHILE(exp->common.ctx, client->send_coroutine != NULL);
         }
     }
 }
@@ -1492,13 +1567,10 @@ static void blk_aio_attached(AioContext *ctx, void *opaque)
 static void blk_aio_detach(void *opaque)
 {
     NBDExport *exp = opaque;
-    NBDClient *client;
 
     trace_nbd_blk_aio_detach(exp->name, exp->common.ctx);
 
-    QTAILQ_FOREACH(client, &exp->clients, next) {
-        qio_channel_detach_aio_context(client->ioc);
-    }
+    aio_wait_bh_oneshot(exp->common.ctx, nbd_aio_detach_bh, exp);
 
     exp->common.ctx = NULL;
 }
@@ -2151,20 +2223,23 @@ static int nbd_co_send_bitmap(NBDClient *client, uint64_t handle,
 
 /* nbd_co_receive_request
  * Collect a client request. Return 0 if request looks valid, -EIO to drop
- * connection right away, and any other negative value to report an error to
- * the client (although the caller may still need to disconnect after reporting
- * the error).
+ * connection right away, -EAGAIN to indicate we were interrupted and the
+ * channel should be quiesced, and any other negative value to report an error
+ * to the client (although the caller may still need to disconnect after
+ * reporting the error).
  */
 static int nbd_co_receive_request(NBDRequestData *req, NBDRequest *request,
                                   Error **errp)
 {
     NBDClient *client = req->client;
     int valid_flags;
+    int ret;
 
     g_assert(qemu_in_coroutine());
     assert(client->recv_coroutine == qemu_coroutine_self());
-    if (nbd_receive_request(client->ioc, request, errp) < 0) {
-        return -EIO;
+    ret = nbd_receive_request(client, request, errp);
+    if (ret < 0) {
+        return  ret;
     }
 
     trace_nbd_co_receive_request_decode_type(request->handle, request->type,
@@ -2507,6 +2582,17 @@ static coroutine_fn void nbd_trip(void *opaque)
         return;
     }
 
+    if (client->quiescing) {
+        /*
+         * We're switching between AIO contexts. Don't attempt to receive a new
+         * request and kick the main context which may be waiting for us.
+         */
+        nbd_client_put(client);
+        client->recv_coroutine = NULL;
+        aio_wait_kick();
+        return;
+    }
+
     req = nbd_request_get(client);
     ret = nbd_co_receive_request(req, &request, &local_err);
     client->recv_coroutine = NULL;
@@ -2519,6 +2605,11 @@ static coroutine_fn void nbd_trip(void *opaque)
         goto done;
     }
 
+    if (ret == -EAGAIN) {
+        assert(client->quiescing);
+        goto done;
+    }
+
     nbd_client_receive_next_request(client);
     if (ret == -EIO) {
         goto disconnect;
@@ -2565,7 +2656,8 @@ disconnect:
 
 static void nbd_client_receive_next_request(NBDClient *client)
 {
-    if (!client->recv_coroutine && client->nb_requests < MAX_NBD_REQUESTS) {
+    if (!client->recv_coroutine && client->nb_requests < MAX_NBD_REQUESTS &&
+        !client->quiescing) {
         nbd_client_get(client);
         client->recv_coroutine = qemu_coroutine_create(nbd_trip, client);
         aio_co_schedule(client->exp->common.ctx, client->recv_coroutine);
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2 4/4] block: Close block exports in two steps
  2020-12-14 17:05 [PATCH v2 0/4] nbd/server: Quiesce coroutines on context switch Sergio Lopez
                   ` (2 preceding siblings ...)
  2020-12-14 17:05 ` [PATCH v2 3/4] nbd/server: Quiesce coroutines on context switch Sergio Lopez
@ 2020-12-14 17:05 ` Sergio Lopez
  2020-12-15 15:34   ` Kevin Wolf
  2021-01-20 20:49 ` [PATCH v2 0/4] nbd/server: Quiesce coroutines on context switch Eric Blake
  4 siblings, 1 reply; 25+ messages in thread
From: Sergio Lopez @ 2020-12-14 17:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Fam Zheng, Stefano Stabellini, Sergio Lopez,
	qemu-block, Paul Durrant, Michael S. Tsirkin, Max Reitz,
	Stefan Hajnoczi, Paolo Bonzini, Anthony Perard, xen-devel

There's a cross-dependency between closing the block exports and
draining the block layer. The latter needs that we close all export's
client connections to ensure they won't queue more requests, but the
exports may have coroutines yielding in the block layer, which implies
they can't be fully closed until we drain it.

To break this cross-dependency, this change adds a "bool wait"
argument to blk_exp_close_all() and blk_exp_close_all_type(), so
callers can decide whether they want to wait for the exports to be
fully quiesced, or just return after requesting them to shut down.

Then, in bdrv_close_all we make two calls, one without waiting to
close all client connections, and another after draining the block
layer, this time waiting for the exports to be fully quiesced.

RHBZ: https://bugzilla.redhat.com/show_bug.cgi?id=1900505
Signed-off-by: Sergio Lopez <slp@redhat.com>
---
 block.c                   | 20 +++++++++++++++++++-
 block/export/export.c     | 10 ++++++----
 blockdev-nbd.c            |  2 +-
 include/block/export.h    |  4 ++--
 qemu-nbd.c                |  2 +-
 stubs/blk-exp-close-all.c |  2 +-
 6 files changed, 30 insertions(+), 10 deletions(-)

diff --git a/block.c b/block.c
index bc8a66ab6e..41db70ac07 100644
--- a/block.c
+++ b/block.c
@@ -4472,13 +4472,31 @@ static void bdrv_close(BlockDriverState *bs)
 void bdrv_close_all(void)
 {
     assert(job_next(NULL) == NULL);
-    blk_exp_close_all();
+
+    /*
+     * There's a cross-dependency between closing the block exports and
+     * draining the block layer. The latter needs that we close all export's
+     * client connections to ensure they won't queue more requests, but the
+     * exports may have coroutines yielding in the block layer, which implies
+     * they can't be fully closed until we drain it.
+     *
+     * Make a first call to close all export's client connections, without
+     * waiting for each export to be fully quiesced.
+     */
+    blk_exp_close_all(false);
 
     /* Drop references from requests still in flight, such as canceled block
      * jobs whose AIO context has not been polled yet */
     bdrv_drain_all();
 
     blk_remove_all_bs();
+
+    /*
+     * Make a second call to shut down the exports, this time waiting for them
+     * to be fully quiesced.
+     */
+    blk_exp_close_all(true);
+
     blockdev_close_all_bdrv_states();
 
     assert(QTAILQ_EMPTY(&all_bdrv_states));
diff --git a/block/export/export.c b/block/export/export.c
index bad6f21b1c..0124ebd9f9 100644
--- a/block/export/export.c
+++ b/block/export/export.c
@@ -280,7 +280,7 @@ static bool blk_exp_has_type(BlockExportType type)
 }
 
 /* type == BLOCK_EXPORT_TYPE__MAX for all types */
-void blk_exp_close_all_type(BlockExportType type)
+void blk_exp_close_all_type(BlockExportType type, bool wait)
 {
     BlockExport *exp, *next;
 
@@ -293,12 +293,14 @@ void blk_exp_close_all_type(BlockExportType type)
         blk_exp_request_shutdown(exp);
     }
 
-    AIO_WAIT_WHILE(NULL, blk_exp_has_type(type));
+    if (wait) {
+        AIO_WAIT_WHILE(NULL, blk_exp_has_type(type));
+    }
 }
 
-void blk_exp_close_all(void)
+void blk_exp_close_all(bool wait)
 {
-    blk_exp_close_all_type(BLOCK_EXPORT_TYPE__MAX);
+    blk_exp_close_all_type(BLOCK_EXPORT_TYPE__MAX, wait);
 }
 
 void qmp_block_export_add(BlockExportOptions *export, Error **errp)
diff --git a/blockdev-nbd.c b/blockdev-nbd.c
index d8443d235b..d71d4da7c2 100644
--- a/blockdev-nbd.c
+++ b/blockdev-nbd.c
@@ -266,7 +266,7 @@ void qmp_nbd_server_stop(Error **errp)
         return;
     }
 
-    blk_exp_close_all_type(BLOCK_EXPORT_TYPE_NBD);
+    blk_exp_close_all_type(BLOCK_EXPORT_TYPE_NBD, true);
 
     nbd_server_free(nbd_server);
     nbd_server = NULL;
diff --git a/include/block/export.h b/include/block/export.h
index 7feb02e10d..71c25928ce 100644
--- a/include/block/export.h
+++ b/include/block/export.h
@@ -83,7 +83,7 @@ BlockExport *blk_exp_find(const char *id);
 void blk_exp_ref(BlockExport *exp);
 void blk_exp_unref(BlockExport *exp);
 void blk_exp_request_shutdown(BlockExport *exp);
-void blk_exp_close_all(void);
-void blk_exp_close_all_type(BlockExportType type);
+void blk_exp_close_all(bool wait);
+void blk_exp_close_all_type(BlockExportType type, bool wait);
 
 #endif
diff --git a/qemu-nbd.c b/qemu-nbd.c
index a7075c5419..928f4466f6 100644
--- a/qemu-nbd.c
+++ b/qemu-nbd.c
@@ -1122,7 +1122,7 @@ int main(int argc, char **argv)
     do {
         main_loop_wait(false);
         if (state == TERMINATE) {
-            blk_exp_close_all();
+            blk_exp_close_all(true);
             state = TERMINATED;
         }
     } while (state != TERMINATED);
diff --git a/stubs/blk-exp-close-all.c b/stubs/blk-exp-close-all.c
index 1c71316763..ecd0ce611f 100644
--- a/stubs/blk-exp-close-all.c
+++ b/stubs/blk-exp-close-all.c
@@ -2,6 +2,6 @@
 #include "block/export.h"
 
 /* Only used in programs that support block exports (libblockdev.fa) */
-void blk_exp_close_all(void)
+void blk_exp_close_all(bool wait)
 {
 }
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/4] block: Honor blk_set_aio_context() context requirements
  2020-12-14 17:05 ` [PATCH v2 1/4] block: Honor blk_set_aio_context() context requirements Sergio Lopez
@ 2020-12-15 11:58   ` Kevin Wolf
  0 siblings, 0 replies; 25+ messages in thread
From: Kevin Wolf @ 2020-12-15 11:58 UTC (permalink / raw)
  To: Sergio Lopez
  Cc: Fam Zheng, Stefano Stabellini, qemu-block, Paul Durrant,
	Michael S. Tsirkin, qemu-devel, Max Reitz, Stefan Hajnoczi,
	Paolo Bonzini, Anthony Perard, xen-devel

Am 14.12.2020 um 18:05 hat Sergio Lopez geschrieben:
> The documentation for bdrv_set_aio_context_ignore() states this:
> 
>  * The caller must own the AioContext lock for the old AioContext of bs, but it
>  * must not own the AioContext lock for new_context (unless new_context is the
>  * same as the current context of bs).
> 
> As blk_set_aio_context() makes use of this function, this rule also
> applies to it.
> 
> Fix all occurrences where this rule wasn't honored.
> 
> Suggested-by: Kevin Wolf <kwolf@redhat.com>
> Signed-off-by: Sergio Lopez <slp@redhat.com>

Reviewed-by: Kevin Wolf <kwolf@redhat.com>



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 2/4] block: Avoid processing BDS twice in bdrv_set_aio_context_ignore()
  2020-12-14 17:05 ` [PATCH v2 2/4] block: Avoid processing BDS twice in bdrv_set_aio_context_ignore() Sergio Lopez
@ 2020-12-15 12:12   ` Kevin Wolf
  2020-12-15 13:15     ` Sergio Lopez
  0 siblings, 1 reply; 25+ messages in thread
From: Kevin Wolf @ 2020-12-15 12:12 UTC (permalink / raw)
  To: Sergio Lopez
  Cc: Fam Zheng, Stefano Stabellini, qemu-block, Paul Durrant,
	Michael S. Tsirkin, qemu-devel, Max Reitz, Stefan Hajnoczi,
	Paolo Bonzini, Anthony Perard, xen-devel

Am 14.12.2020 um 18:05 hat Sergio Lopez geschrieben:
> While processing the parents of a BDS, one of the parents may process
> the child that's doing the tail recursion, which leads to a BDS being
> processed twice. This is especially problematic for the aio_notifiers,
> as they might attempt to work on both the old and the new AIO
> contexts.
> 
> To avoid this, add the BDS pointer to the ignore list, and check the
> child BDS pointer while iterating over the children.
> 
> Signed-off-by: Sergio Lopez <slp@redhat.com>

Ugh, so we get a mixed list of BdrvChild and BlockDriverState? :-/

What is the specific scenario where you saw this breaking? Did you have
multiple BdrvChild connections between two nodes so that we would go to
the parent node through one and then come back to the child node through
the other?

Maybe if what we really need to do is not processing every edge once,
but processing every node once, the list should be changed to contain
_only_ BDS objects. But then blk_do_set_aio_context() probably won't
work any more because it can't have blk->root ignored any more...

Anyway, if we end up changing what the list contains, the comment needs
an update, too. Currently it says:

 * @ignore will accumulate all visited BdrvChild object. The caller is
 * responsible for freeing the list afterwards.

Another option: Split the parents QLIST_FOREACH loop in two. First add
all parent BdrvChild objects to the ignore list, remember which of them
were newly added, and only after adding all of them call
child->klass->set_aio_ctx() for each parent that was previously not on
the ignore list. This will avoid that we come back to the same node
because all of its incoming edges are ignored now.

Kevin

>  block.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/block.c b/block.c
> index f1cedac362..bc8a66ab6e 100644
> --- a/block.c
> +++ b/block.c
> @@ -6465,12 +6465,17 @@ void bdrv_set_aio_context_ignore(BlockDriverState *bs,
>      bdrv_drained_begin(bs);
>  
>      QLIST_FOREACH(child, &bs->children, next) {
> -        if (g_slist_find(*ignore, child)) {
> +        if (g_slist_find(*ignore, child) || g_slist_find(*ignore, child->bs)) {
>              continue;
>          }
>          *ignore = g_slist_prepend(*ignore, child);
>          bdrv_set_aio_context_ignore(child->bs, new_context, ignore);
>      }
> +    /*
> +     * Add a reference to this BS to the ignore list, so its
> +     * parents won't attempt to process it again.
> +     */
> +    *ignore = g_slist_prepend(*ignore, bs);
>      QLIST_FOREACH(child, &bs->parents, next_parent) {
>          if (g_slist_find(*ignore, child)) {
>              continue;
> -- 
> 2.26.2
> 



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 2/4] block: Avoid processing BDS twice in bdrv_set_aio_context_ignore()
  2020-12-15 12:12   ` Kevin Wolf
@ 2020-12-15 13:15     ` Sergio Lopez
  2020-12-15 15:01       ` Kevin Wolf
  0 siblings, 1 reply; 25+ messages in thread
From: Sergio Lopez @ 2020-12-15 13:15 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Fam Zheng, Stefano Stabellini, qemu-block, Paul Durrant,
	Michael S. Tsirkin, qemu-devel, Max Reitz, Stefan Hajnoczi,
	Paolo Bonzini, Anthony Perard, xen-devel

[-- Attachment #1: Type: text/plain, Size: 6361 bytes --]

On Tue, Dec 15, 2020 at 01:12:33PM +0100, Kevin Wolf wrote:
> Am 14.12.2020 um 18:05 hat Sergio Lopez geschrieben:
> > While processing the parents of a BDS, one of the parents may process
> > the child that's doing the tail recursion, which leads to a BDS being
> > processed twice. This is especially problematic for the aio_notifiers,
> > as they might attempt to work on both the old and the new AIO
> > contexts.
> > 
> > To avoid this, add the BDS pointer to the ignore list, and check the
> > child BDS pointer while iterating over the children.
> > 
> > Signed-off-by: Sergio Lopez <slp@redhat.com>
> 
> Ugh, so we get a mixed list of BdrvChild and BlockDriverState? :-/

I know, it's effective but quite ugly...

> What is the specific scenario where you saw this breaking? Did you have
> multiple BdrvChild connections between two nodes so that we would go to
> the parent node through one and then come back to the child node through
> the other?

I don't think this is a corner case. If the graph is walked top->down,
there's no problem since children are added to the ignore list before
getting processed, and siblings don't process each other. But, if the
graph is walked bottom->up, a BDS will start processing its parents
without adding itself to the ignore list, so there's nothing
preventing them from processing it again.

I'm pasting here an annotated trace of bdrv_set_aio_context_ignore I
generated while triggering the issue:

<----- begin ------>
bdrv_set_aio_context_ignore: bs=0x555ee2e48030 enter
bdrv_set_aio_context_ignore: bs=0x555ee2e48030 processing children
bdrv_set_aio_context_ignore: bs=0x555ee2e5d420 enter
bdrv_set_aio_context_ignore: bs=0x555ee2e5d420 processing children
bdrv_set_aio_context_ignore: bs=0x555ee2e52060 enter
bdrv_set_aio_context_ignore: bs=0x555ee2e52060 processing children
bdrv_set_aio_context_ignore: bs=0x555ee2e52060 processing parents
bdrv_set_aio_context_ignore: bs=0x555ee2e52060 processing itself
bdrv_set_aio_context_ignore: bs=0x555ee2e5d420 processing parents

 - We enter b_s_a_c_i with BDS 2fbf660 the first time:
 
bdrv_set_aio_context_ignore: bs=0x555ee2fbf660 enter
bdrv_set_aio_context_ignore: bs=0x555ee2fbf660 processing children

 - We enter b_s_a_c_i with BDS 3bc0c00, a child of 2fbf660:
 
bdrv_set_aio_context_ignore: bs=0x555ee3bc0c00 enter
bdrv_set_aio_context_ignore: bs=0x555ee3bc0c00 processing children
bdrv_set_aio_context_ignore: bs=0x555ee3bc0c00 processing parents
bdrv_set_aio_context_ignore: bs=0x555ee3bc0c00 processing itself

 - We start processing its parents:
 
bdrv_set_aio_context_ignore: bs=0x555ee2fbf660 processing parents

 - We enter b_s_a_c_i with BDS 2e48030, a parent of 2fbf660:
 
bdrv_set_aio_context_ignore: bs=0x555ee2e48030 enter
bdrv_set_aio_context_ignore: bs=0x555ee2e48030 processing children

 - We enter b_s_a_c_i with BDS 2fbf660 again, because of parent
   2e48030 didn't found us it in the ignore list:
   
bdrv_set_aio_context_ignore: bs=0x555ee2fbf660 enter
bdrv_set_aio_context_ignore: bs=0x555ee2fbf660 processing children
bdrv_set_aio_context_ignore: bs=0x555ee2fbf660 processing parents
bdrv_set_aio_context_ignore: bs=0x555ee2fbf660 processing itself
bdrv_set_aio_context_ignore: bs=0x555ee2e48030 processing parents
bdrv_set_aio_context_ignore: bs=0x555ee2e48030 processing itself

 - BDS 2fbf660 will be processed here a second time, triggering the
   issue:
   
bdrv_set_aio_context_ignore: bs=0x555ee2fbf660 processing itself
<----- end ------>

I suspect this has been happening for a while, and has only surfaced
now due to the need to run an AIO context BH in an aio_notifier
function that the "nbd/server: Quiesce coroutines on context switch"
patch introduces. There the problem is that the first time the
aio_notifier AIO detach function is called, it works on the old
context (as it should be), and the second one works on the new context
(which is wrong).

> Maybe if what we really need to do is not processing every edge once,
> but processing every node once, the list should be changed to contain
> _only_ BDS objects. But then blk_do_set_aio_context() probably won't
> work any more because it can't have blk->root ignored any more...

I tried that in my first attempt and it broke badly. I didn't take a
deeper look at the causes.

> Anyway, if we end up changing what the list contains, the comment needs
> an update, too. Currently it says:
> 
>  * @ignore will accumulate all visited BdrvChild object. The caller is
>  * responsible for freeing the list afterwards.
> 
> Another option: Split the parents QLIST_FOREACH loop in two. First add
> all parent BdrvChild objects to the ignore list, remember which of them
> were newly added, and only after adding all of them call
> child->klass->set_aio_ctx() for each parent that was previously not on
> the ignore list. This will avoid that we come back to the same node
> because all of its incoming edges are ignored now.

I don't think this strategy will fix the issue illustrated in the
trace above, as the BdrvChild pointer of the BDS processing its
parents won't be the on ignore list by the time one of its parents
starts processing its own children.

Thanks,
Sergio.

> >  block.c | 7 ++++++-
> >  1 file changed, 6 insertions(+), 1 deletion(-)
> > 
> > diff --git a/block.c b/block.c
> > index f1cedac362..bc8a66ab6e 100644
> > --- a/block.c
> > +++ b/block.c
> > @@ -6465,12 +6465,17 @@ void bdrv_set_aio_context_ignore(BlockDriverState *bs,
> >      bdrv_drained_begin(bs);
> >  
> >      QLIST_FOREACH(child, &bs->children, next) {
> > -        if (g_slist_find(*ignore, child)) {
> > +        if (g_slist_find(*ignore, child) || g_slist_find(*ignore, child->bs)) {
> >              continue;
> >          }
> >          *ignore = g_slist_prepend(*ignore, child);
> >          bdrv_set_aio_context_ignore(child->bs, new_context, ignore);
> >      }
> > +    /*
> > +     * Add a reference to this BS to the ignore list, so its
> > +     * parents won't attempt to process it again.
> > +     */
> > +    *ignore = g_slist_prepend(*ignore, bs);
> >      QLIST_FOREACH(child, &bs->parents, next_parent) {
> >          if (g_slist_find(*ignore, child)) {
> >              continue;
> > -- 
> > 2.26.2
> > 
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 2/4] block: Avoid processing BDS twice in bdrv_set_aio_context_ignore()
  2020-12-15 13:15     ` Sergio Lopez
@ 2020-12-15 15:01       ` Kevin Wolf
  2020-12-15 17:23         ` Sergio Lopez
  0 siblings, 1 reply; 25+ messages in thread
From: Kevin Wolf @ 2020-12-15 15:01 UTC (permalink / raw)
  To: Sergio Lopez
  Cc: Fam Zheng, Stefano Stabellini, qemu-block, Paul Durrant,
	Michael S. Tsirkin, qemu-devel, Max Reitz, Stefan Hajnoczi,
	Paolo Bonzini, Anthony Perard, xen-devel

[-- Attachment #1: Type: text/plain, Size: 6707 bytes --]

Am 15.12.2020 um 14:15 hat Sergio Lopez geschrieben:
> On Tue, Dec 15, 2020 at 01:12:33PM +0100, Kevin Wolf wrote:
> > Am 14.12.2020 um 18:05 hat Sergio Lopez geschrieben:
> > > While processing the parents of a BDS, one of the parents may process
> > > the child that's doing the tail recursion, which leads to a BDS being
> > > processed twice. This is especially problematic for the aio_notifiers,
> > > as they might attempt to work on both the old and the new AIO
> > > contexts.
> > > 
> > > To avoid this, add the BDS pointer to the ignore list, and check the
> > > child BDS pointer while iterating over the children.
> > > 
> > > Signed-off-by: Sergio Lopez <slp@redhat.com>
> > 
> > Ugh, so we get a mixed list of BdrvChild and BlockDriverState? :-/
> 
> I know, it's effective but quite ugly...
> 
> > What is the specific scenario where you saw this breaking? Did you have
> > multiple BdrvChild connections between two nodes so that we would go to
> > the parent node through one and then come back to the child node through
> > the other?
> 
> I don't think this is a corner case. If the graph is walked top->down,
> there's no problem since children are added to the ignore list before
> getting processed, and siblings don't process each other. But, if the
> graph is walked bottom->up, a BDS will start processing its parents
> without adding itself to the ignore list, so there's nothing
> preventing them from processing it again.

I don't understand. child is added to ignore before calling the parent
callback on it, so how can we come back through the same BdrvChild?

    QLIST_FOREACH(child, &bs->parents, next_parent) {
        if (g_slist_find(*ignore, child)) {
            continue;
        }
        assert(child->klass->set_aio_ctx);
        *ignore = g_slist_prepend(*ignore, child);
        child->klass->set_aio_ctx(child, new_context, ignore);
    }

> I'm pasting here an annotated trace of bdrv_set_aio_context_ignore I
> generated while triggering the issue:
> 
> <----- begin ------>
> bdrv_set_aio_context_ignore: bs=0x555ee2e48030 enter
> bdrv_set_aio_context_ignore: bs=0x555ee2e48030 processing children
> bdrv_set_aio_context_ignore: bs=0x555ee2e5d420 enter
> bdrv_set_aio_context_ignore: bs=0x555ee2e5d420 processing children
> bdrv_set_aio_context_ignore: bs=0x555ee2e52060 enter
> bdrv_set_aio_context_ignore: bs=0x555ee2e52060 processing children
> bdrv_set_aio_context_ignore: bs=0x555ee2e52060 processing parents
> bdrv_set_aio_context_ignore: bs=0x555ee2e52060 processing itself
> bdrv_set_aio_context_ignore: bs=0x555ee2e5d420 processing parents
> 
>  - We enter b_s_a_c_i with BDS 2fbf660 the first time:
>  
> bdrv_set_aio_context_ignore: bs=0x555ee2fbf660 enter
> bdrv_set_aio_context_ignore: bs=0x555ee2fbf660 processing children
> 
>  - We enter b_s_a_c_i with BDS 3bc0c00, a child of 2fbf660:
>  
> bdrv_set_aio_context_ignore: bs=0x555ee3bc0c00 enter
> bdrv_set_aio_context_ignore: bs=0x555ee3bc0c00 processing children
> bdrv_set_aio_context_ignore: bs=0x555ee3bc0c00 processing parents

> 
>  - We start processing its parents:
>  
> bdrv_set_aio_context_ignore: bs=0x555ee2fbf660 processing parents
> 
>  - We enter b_s_a_c_i with BDS 2e48030, a parent of 2fbf660:
>  
> bdrv_set_aio_context_ignore: bs=0x555ee2e48030 enter
> bdrv_set_aio_context_ignore: bs=0x555ee2e48030 processing children
> 
>  - We enter b_s_a_c_i with BDS 2fbf660 again, because of parent
>    2e48030 didn't found us it in the ignore list:
>    
> bdrv_set_aio_context_ignore: bs=0x555ee2fbf660 enter
> bdrv_set_aio_context_ignore: bs=0x555ee2fbf660 processing children
> bdrv_set_aio_context_ignore: bs=0x555ee2fbf660 processing parents
> bdrv_set_aio_context_ignore: bs=0x555ee2fbf660 processing itself
> bdrv_set_aio_context_ignore: bs=0x555ee2e48030 processing parents
> bdrv_set_aio_context_ignore: bs=0x555ee2e48030 processing itself
> 
>  - BDS 2fbf660 will be processed here a second time, triggering the
>    issue:
>    
> bdrv_set_aio_context_ignore: bs=0x555ee2fbf660 processing itself
> <----- end ------>

You didn't dump the BdrvChild here. I think that would add some
information on why we re-entered 0x555ee2fbf660. Maybe you can also add
bs->drv->format_name for each node to make the scenario less abstract?

So far my reconstruction of the graph is something like this:

0x555ee2e48030 --+
   |  |          |
   |  |          +-> 0x555ee2e5d420 -> 0x555ee2e52060
   v  v          |
0x555ee2fbf660 --+
           |
           +-------> 0x555ee3bc0c00

It doesn't look quite trivial, but if 0x555ee2e48030 is the filter node
of a block job, it's not hard to imagine either.

> I suspect this has been happening for a while, and has only surfaced
> now due to the need to run an AIO context BH in an aio_notifier
> function that the "nbd/server: Quiesce coroutines on context switch"
> patch introduces. There the problem is that the first time the
> aio_notifier AIO detach function is called, it works on the old
> context (as it should be), and the second one works on the new context
> (which is wrong).
> 
> > Maybe if what we really need to do is not processing every edge once,
> > but processing every node once, the list should be changed to contain
> > _only_ BDS objects. But then blk_do_set_aio_context() probably won't
> > work any more because it can't have blk->root ignored any more...
> 
> I tried that in my first attempt and it broke badly. I didn't take a
> deeper look at the causes.
> 
> > Anyway, if we end up changing what the list contains, the comment needs
> > an update, too. Currently it says:
> > 
> >  * @ignore will accumulate all visited BdrvChild object. The caller is
> >  * responsible for freeing the list afterwards.
> > 
> > Another option: Split the parents QLIST_FOREACH loop in two. First add
> > all parent BdrvChild objects to the ignore list, remember which of them
> > were newly added, and only after adding all of them call
> > child->klass->set_aio_ctx() for each parent that was previously not on
> > the ignore list. This will avoid that we come back to the same node
> > because all of its incoming edges are ignored now.
> 
> I don't think this strategy will fix the issue illustrated in the
> trace above, as the BdrvChild pointer of the BDS processing its
> parents won't be the on ignore list by the time one of its parents
> starts processing its own children.

But why? We do append to the ignore list each time before we recurse
into a child or parent node. The only way I see is if you have two
separate BdrvChild links between the nodes.

Kevin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 4/4] block: Close block exports in two steps
  2020-12-14 17:05 ` [PATCH v2 4/4] block: Close block exports in two steps Sergio Lopez
@ 2020-12-15 15:34   ` Kevin Wolf
  2020-12-15 17:26     ` Sergio Lopez
  2020-12-21 17:07     ` Sergio Lopez
  0 siblings, 2 replies; 25+ messages in thread
From: Kevin Wolf @ 2020-12-15 15:34 UTC (permalink / raw)
  To: Sergio Lopez
  Cc: Fam Zheng, Stefano Stabellini, qemu-block, Paul Durrant,
	Michael S. Tsirkin, qemu-devel, Max Reitz, Stefan Hajnoczi,
	Paolo Bonzini, Anthony Perard, xen-devel

Am 14.12.2020 um 18:05 hat Sergio Lopez geschrieben:
> There's a cross-dependency between closing the block exports and
> draining the block layer. The latter needs that we close all export's
> client connections to ensure they won't queue more requests, but the
> exports may have coroutines yielding in the block layer, which implies
> they can't be fully closed until we drain it.

A coroutine that yielded must have some way to be reentered. So I guess
the quesiton becomes why they aren't reentered until drain. We do
process events:

    AIO_WAIT_WHILE(NULL, blk_exp_has_type(type));

So in theory, anything that would finalise the block export closing
should still execute.

What is the difference that drain makes compared to a simple
AIO_WAIT_WHILE, so that coroutine are reentered during drain, but not
during AIO_WAIT_WHILE?

This is an even more interesting question because the NBD server isn't a
block node nor a BdrvChildClass implementation, so it shouldn't even
notice a drain operation.

Kevin

> To break this cross-dependency, this change adds a "bool wait"
> argument to blk_exp_close_all() and blk_exp_close_all_type(), so
> callers can decide whether they want to wait for the exports to be
> fully quiesced, or just return after requesting them to shut down.
> 
> Then, in bdrv_close_all we make two calls, one without waiting to
> close all client connections, and another after draining the block
> layer, this time waiting for the exports to be fully quiesced.
> 
> RHBZ: https://bugzilla.redhat.com/show_bug.cgi?id=1900505
> Signed-off-by: Sergio Lopez <slp@redhat.com>
> ---
>  block.c                   | 20 +++++++++++++++++++-
>  block/export/export.c     | 10 ++++++----
>  blockdev-nbd.c            |  2 +-
>  include/block/export.h    |  4 ++--
>  qemu-nbd.c                |  2 +-
>  stubs/blk-exp-close-all.c |  2 +-
>  6 files changed, 30 insertions(+), 10 deletions(-)
> 
> diff --git a/block.c b/block.c
> index bc8a66ab6e..41db70ac07 100644
> --- a/block.c
> +++ b/block.c
> @@ -4472,13 +4472,31 @@ static void bdrv_close(BlockDriverState *bs)
>  void bdrv_close_all(void)
>  {
>      assert(job_next(NULL) == NULL);
> -    blk_exp_close_all();
> +
> +    /*
> +     * There's a cross-dependency between closing the block exports and
> +     * draining the block layer. The latter needs that we close all export's
> +     * client connections to ensure they won't queue more requests, but the
> +     * exports may have coroutines yielding in the block layer, which implies
> +     * they can't be fully closed until we drain it.
> +     *
> +     * Make a first call to close all export's client connections, without
> +     * waiting for each export to be fully quiesced.
> +     */
> +    blk_exp_close_all(false);
>  
>      /* Drop references from requests still in flight, such as canceled block
>       * jobs whose AIO context has not been polled yet */
>      bdrv_drain_all();
>  
>      blk_remove_all_bs();
> +
> +    /*
> +     * Make a second call to shut down the exports, this time waiting for them
> +     * to be fully quiesced.
> +     */
> +    blk_exp_close_all(true);
> +
>      blockdev_close_all_bdrv_states();
>  
>      assert(QTAILQ_EMPTY(&all_bdrv_states));
> diff --git a/block/export/export.c b/block/export/export.c
> index bad6f21b1c..0124ebd9f9 100644
> --- a/block/export/export.c
> +++ b/block/export/export.c
> @@ -280,7 +280,7 @@ static bool blk_exp_has_type(BlockExportType type)
>  }
>  
>  /* type == BLOCK_EXPORT_TYPE__MAX for all types */
> -void blk_exp_close_all_type(BlockExportType type)
> +void blk_exp_close_all_type(BlockExportType type, bool wait)
>  {
>      BlockExport *exp, *next;
>  
> @@ -293,12 +293,14 @@ void blk_exp_close_all_type(BlockExportType type)
>          blk_exp_request_shutdown(exp);
>      }
>  
> -    AIO_WAIT_WHILE(NULL, blk_exp_has_type(type));
> +    if (wait) {
> +        AIO_WAIT_WHILE(NULL, blk_exp_has_type(type));
> +    }
>  }
>  
> -void blk_exp_close_all(void)
> +void blk_exp_close_all(bool wait)
>  {
> -    blk_exp_close_all_type(BLOCK_EXPORT_TYPE__MAX);
> +    blk_exp_close_all_type(BLOCK_EXPORT_TYPE__MAX, wait);
>  }
>  
>  void qmp_block_export_add(BlockExportOptions *export, Error **errp)
> diff --git a/blockdev-nbd.c b/blockdev-nbd.c
> index d8443d235b..d71d4da7c2 100644
> --- a/blockdev-nbd.c
> +++ b/blockdev-nbd.c
> @@ -266,7 +266,7 @@ void qmp_nbd_server_stop(Error **errp)
>          return;
>      }
>  
> -    blk_exp_close_all_type(BLOCK_EXPORT_TYPE_NBD);
> +    blk_exp_close_all_type(BLOCK_EXPORT_TYPE_NBD, true);
>  
>      nbd_server_free(nbd_server);
>      nbd_server = NULL;
> diff --git a/include/block/export.h b/include/block/export.h
> index 7feb02e10d..71c25928ce 100644
> --- a/include/block/export.h
> +++ b/include/block/export.h
> @@ -83,7 +83,7 @@ BlockExport *blk_exp_find(const char *id);
>  void blk_exp_ref(BlockExport *exp);
>  void blk_exp_unref(BlockExport *exp);
>  void blk_exp_request_shutdown(BlockExport *exp);
> -void blk_exp_close_all(void);
> -void blk_exp_close_all_type(BlockExportType type);
> +void blk_exp_close_all(bool wait);
> +void blk_exp_close_all_type(BlockExportType type, bool wait);
>  
>  #endif
> diff --git a/qemu-nbd.c b/qemu-nbd.c
> index a7075c5419..928f4466f6 100644
> --- a/qemu-nbd.c
> +++ b/qemu-nbd.c
> @@ -1122,7 +1122,7 @@ int main(int argc, char **argv)
>      do {
>          main_loop_wait(false);
>          if (state == TERMINATE) {
> -            blk_exp_close_all();
> +            blk_exp_close_all(true);
>              state = TERMINATED;
>          }
>      } while (state != TERMINATED);
> diff --git a/stubs/blk-exp-close-all.c b/stubs/blk-exp-close-all.c
> index 1c71316763..ecd0ce611f 100644
> --- a/stubs/blk-exp-close-all.c
> +++ b/stubs/blk-exp-close-all.c
> @@ -2,6 +2,6 @@
>  #include "block/export.h"
>  
>  /* Only used in programs that support block exports (libblockdev.fa) */
> -void blk_exp_close_all(void)
> +void blk_exp_close_all(bool wait)
>  {
>  }
> -- 
> 2.26.2
> 



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 2/4] block: Avoid processing BDS twice in bdrv_set_aio_context_ignore()
  2020-12-15 15:01       ` Kevin Wolf
@ 2020-12-15 17:23         ` Sergio Lopez
  2020-12-16 12:35           ` Kevin Wolf
  0 siblings, 1 reply; 25+ messages in thread
From: Sergio Lopez @ 2020-12-15 17:23 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Fam Zheng, Stefano Stabellini, qemu-block, Paul Durrant,
	Michael S. Tsirkin, qemu-devel, Max Reitz, Stefan Hajnoczi,
	Paolo Bonzini, Anthony Perard, xen-devel

[-- Attachment #1: Type: text/plain, Size: 4883 bytes --]

On Tue, Dec 15, 2020 at 04:01:19PM +0100, Kevin Wolf wrote:
> Am 15.12.2020 um 14:15 hat Sergio Lopez geschrieben:
> > On Tue, Dec 15, 2020 at 01:12:33PM +0100, Kevin Wolf wrote:
> > > Am 14.12.2020 um 18:05 hat Sergio Lopez geschrieben:
> > > > While processing the parents of a BDS, one of the parents may process
> > > > the child that's doing the tail recursion, which leads to a BDS being
> > > > processed twice. This is especially problematic for the aio_notifiers,
> > > > as they might attempt to work on both the old and the new AIO
> > > > contexts.
> > > > 
> > > > To avoid this, add the BDS pointer to the ignore list, and check the
> > > > child BDS pointer while iterating over the children.
> > > > 
> > > > Signed-off-by: Sergio Lopez <slp@redhat.com>
> > > 
> > > Ugh, so we get a mixed list of BdrvChild and BlockDriverState? :-/
> > 
> > I know, it's effective but quite ugly...
> > 
> > > What is the specific scenario where you saw this breaking? Did you have
> > > multiple BdrvChild connections between two nodes so that we would go to
> > > the parent node through one and then come back to the child node through
> > > the other?
> > 
> > I don't think this is a corner case. If the graph is walked top->down,
> > there's no problem since children are added to the ignore list before
> > getting processed, and siblings don't process each other. But, if the
> > graph is walked bottom->up, a BDS will start processing its parents
> > without adding itself to the ignore list, so there's nothing
> > preventing them from processing it again.
> 
> I don't understand. child is added to ignore before calling the parent
> callback on it, so how can we come back through the same BdrvChild?
> 
>     QLIST_FOREACH(child, &bs->parents, next_parent) {
>         if (g_slist_find(*ignore, child)) {
>             continue;
>         }
>         assert(child->klass->set_aio_ctx);
>         *ignore = g_slist_prepend(*ignore, child);
>         child->klass->set_aio_ctx(child, new_context, ignore);
>     }

Perhaps I'm missing something, but the way I understand it, that loop
is adding the BdrvChild pointer of each of its parents, but not the
BdrvChild pointer of the BDS that was passed as an argument to
b_s_a_c_i.

> You didn't dump the BdrvChild here. I think that would add some
> information on why we re-entered 0x555ee2fbf660. Maybe you can also add
> bs->drv->format_name for each node to make the scenario less abstract?

I've generated another trace with more data:

bs=0x565505e48030 (backup-top) enter
bs=0x565505e48030 (backup-top) processing children
bs=0x565505e48030 (backup-top) calling bsaci child=0x565505e42090 (child->bs=0x565505e5d420)
bs=0x565505e5d420 (qcow2) enter
bs=0x565505e5d420 (qcow2) processing children
bs=0x565505e5d420 (qcow2) calling bsaci child=0x565505e41ea0 (child->bs=0x565505e52060)
bs=0x565505e52060 (file) enter
bs=0x565505e52060 (file) processing children
bs=0x565505e52060 (file) processing parents
bs=0x565505e52060 (file) processing itself
bs=0x565505e5d420 (qcow2) processing parents
bs=0x565505e5d420 (qcow2) calling set_aio_ctx child=0x5655066a34d0
bs=0x565505fbf660 (qcow2) enter
bs=0x565505fbf660 (qcow2) processing children
bs=0x565505fbf660 (qcow2) calling bsaci child=0x565505e41d20 (child->bs=0x565506bc0c00)
bs=0x565506bc0c00 (file) enter
bs=0x565506bc0c00 (file) processing children
bs=0x565506bc0c00 (file) processing parents
bs=0x565506bc0c00 (file) processing itself
bs=0x565505fbf660 (qcow2) processing parents
bs=0x565505fbf660 (qcow2) calling set_aio_ctx child=0x565505fc7aa0
bs=0x565505fbf660 (qcow2) calling set_aio_ctx child=0x5655068b8510
bs=0x565505e48030 (backup-top) enter
bs=0x565505e48030 (backup-top) processing children
bs=0x565505e48030 (backup-top) calling bsaci child=0x565505e3c450 (child->bs=0x565505fbf660)
bs=0x565505fbf660 (qcow2) enter
bs=0x565505fbf660 (qcow2) processing children
bs=0x565505fbf660 (qcow2) processing parents
bs=0x565505fbf660 (qcow2) processing itself
bs=0x565505e48030 (backup-top) processing parents
bs=0x565505e48030 (backup-top) calling set_aio_ctx child=0x565505e402d0
bs=0x565505e48030 (backup-top) processing itself
bs=0x565505fbf660 (qcow2) processing itself


So it seems this is happening:

backup-top (5e48030) <---------| (5)
   |    |                      |
   |    | (6) ------------> qcow2 (5fbf660)
   |                           ^    |
   |                       (3) |    | (4)
   |-> (1) qcow2 (5e5d420) -----    |-> file (6bc0c00)
   |
   |-> (2) file (5e52060)

backup-top (5e48030), the BDS that was passed as argument in the first
bdrv_set_aio_context_ignore() call, is re-entered when qcow2 (5fbf660)
is processing its parents, and the latter is also re-entered when the
first one starts processing its children again.

Sergio.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 4/4] block: Close block exports in two steps
  2020-12-15 15:34   ` Kevin Wolf
@ 2020-12-15 17:26     ` Sergio Lopez
  2020-12-21 17:07     ` Sergio Lopez
  1 sibling, 0 replies; 25+ messages in thread
From: Sergio Lopez @ 2020-12-15 17:26 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Fam Zheng, Stefano Stabellini, qemu-block, Paul Durrant,
	Michael S. Tsirkin, qemu-devel, Max Reitz, Stefan Hajnoczi,
	Paolo Bonzini, Anthony Perard, xen-devel

[-- Attachment #1: Type: text/plain, Size: 1291 bytes --]

On Tue, Dec 15, 2020 at 04:34:05PM +0100, Kevin Wolf wrote:
> Am 14.12.2020 um 18:05 hat Sergio Lopez geschrieben:
> > There's a cross-dependency between closing the block exports and
> > draining the block layer. The latter needs that we close all export's
> > client connections to ensure they won't queue more requests, but the
> > exports may have coroutines yielding in the block layer, which implies
> > they can't be fully closed until we drain it.
> 
> A coroutine that yielded must have some way to be reentered. So I guess
> the quesiton becomes why they aren't reentered until drain. We do
> process events:
> 
>     AIO_WAIT_WHILE(NULL, blk_exp_has_type(type));
> 
> So in theory, anything that would finalise the block export closing
> should still execute.
> 
> What is the difference that drain makes compared to a simple
> AIO_WAIT_WHILE, so that coroutine are reentered during drain, but not
> during AIO_WAIT_WHILE?
> 
> This is an even more interesting question because the NBD server isn't a
> block node nor a BdrvChildClass implementation, so it shouldn't even
> notice a drain operation.

I agree in that this deserves a deeper analysis. I'm going to drop
this patch from the series, and will re-analyze the issue later.

Thanks,
Sergio.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 2/4] block: Avoid processing BDS twice in bdrv_set_aio_context_ignore()
  2020-12-15 17:23         ` Sergio Lopez
@ 2020-12-16 12:35           ` Kevin Wolf
  2020-12-16 14:55             ` Sergio Lopez
  0 siblings, 1 reply; 25+ messages in thread
From: Kevin Wolf @ 2020-12-16 12:35 UTC (permalink / raw)
  To: Sergio Lopez
  Cc: Fam Zheng, Stefano Stabellini, qemu-block, Paul Durrant,
	Michael S. Tsirkin, qemu-devel, Max Reitz, Stefan Hajnoczi,
	Paolo Bonzini, Anthony Perard, xen-devel

[-- Attachment #1: Type: text/plain, Size: 6881 bytes --]

Am 15.12.2020 um 18:23 hat Sergio Lopez geschrieben:
> On Tue, Dec 15, 2020 at 04:01:19PM +0100, Kevin Wolf wrote:
> > Am 15.12.2020 um 14:15 hat Sergio Lopez geschrieben:
> > > On Tue, Dec 15, 2020 at 01:12:33PM +0100, Kevin Wolf wrote:
> > > > Am 14.12.2020 um 18:05 hat Sergio Lopez geschrieben:
> > > > > While processing the parents of a BDS, one of the parents may process
> > > > > the child that's doing the tail recursion, which leads to a BDS being
> > > > > processed twice. This is especially problematic for the aio_notifiers,
> > > > > as they might attempt to work on both the old and the new AIO
> > > > > contexts.
> > > > > 
> > > > > To avoid this, add the BDS pointer to the ignore list, and check the
> > > > > child BDS pointer while iterating over the children.
> > > > > 
> > > > > Signed-off-by: Sergio Lopez <slp@redhat.com>
> > > > 
> > > > Ugh, so we get a mixed list of BdrvChild and BlockDriverState? :-/
> > > 
> > > I know, it's effective but quite ugly...
> > > 
> > > > What is the specific scenario where you saw this breaking? Did you have
> > > > multiple BdrvChild connections between two nodes so that we would go to
> > > > the parent node through one and then come back to the child node through
> > > > the other?
> > > 
> > > I don't think this is a corner case. If the graph is walked top->down,
> > > there's no problem since children are added to the ignore list before
> > > getting processed, and siblings don't process each other. But, if the
> > > graph is walked bottom->up, a BDS will start processing its parents
> > > without adding itself to the ignore list, so there's nothing
> > > preventing them from processing it again.
> > 
> > I don't understand. child is added to ignore before calling the parent
> > callback on it, so how can we come back through the same BdrvChild?
> > 
> >     QLIST_FOREACH(child, &bs->parents, next_parent) {
> >         if (g_slist_find(*ignore, child)) {
> >             continue;
> >         }
> >         assert(child->klass->set_aio_ctx);
> >         *ignore = g_slist_prepend(*ignore, child);
> >         child->klass->set_aio_ctx(child, new_context, ignore);
> >     }
> 
> Perhaps I'm missing something, but the way I understand it, that loop
> is adding the BdrvChild pointer of each of its parents, but not the
> BdrvChild pointer of the BDS that was passed as an argument to
> b_s_a_c_i.

Generally, the caller has already done that.

In the theoretical case that it was the outermost call in the recursion
and it hasn't (I couldn't find any such case), I think we should still
call the callback for the passed BdrvChild like we currently do.

> > You didn't dump the BdrvChild here. I think that would add some
> > information on why we re-entered 0x555ee2fbf660. Maybe you can also add
> > bs->drv->format_name for each node to make the scenario less abstract?
> 
> I've generated another trace with more data:
> 
> bs=0x565505e48030 (backup-top) enter
> bs=0x565505e48030 (backup-top) processing children
> bs=0x565505e48030 (backup-top) calling bsaci child=0x565505e42090 (child->bs=0x565505e5d420)
> bs=0x565505e5d420 (qcow2) enter
> bs=0x565505e5d420 (qcow2) processing children
> bs=0x565505e5d420 (qcow2) calling bsaci child=0x565505e41ea0 (child->bs=0x565505e52060)
> bs=0x565505e52060 (file) enter
> bs=0x565505e52060 (file) processing children
> bs=0x565505e52060 (file) processing parents
> bs=0x565505e52060 (file) processing itself
> bs=0x565505e5d420 (qcow2) processing parents
> bs=0x565505e5d420 (qcow2) calling set_aio_ctx child=0x5655066a34d0
> bs=0x565505fbf660 (qcow2) enter
> bs=0x565505fbf660 (qcow2) processing children
> bs=0x565505fbf660 (qcow2) calling bsaci child=0x565505e41d20 (child->bs=0x565506bc0c00)
> bs=0x565506bc0c00 (file) enter
> bs=0x565506bc0c00 (file) processing children
> bs=0x565506bc0c00 (file) processing parents
> bs=0x565506bc0c00 (file) processing itself
> bs=0x565505fbf660 (qcow2) processing parents
> bs=0x565505fbf660 (qcow2) calling set_aio_ctx child=0x565505fc7aa0
> bs=0x565505fbf660 (qcow2) calling set_aio_ctx child=0x5655068b8510
> bs=0x565505e48030 (backup-top) enter
> bs=0x565505e48030 (backup-top) processing children
> bs=0x565505e48030 (backup-top) calling bsaci child=0x565505e3c450 (child->bs=0x565505fbf660)
> bs=0x565505fbf660 (qcow2) enter
> bs=0x565505fbf660 (qcow2) processing children
> bs=0x565505fbf660 (qcow2) processing parents
> bs=0x565505fbf660 (qcow2) processing itself
> bs=0x565505e48030 (backup-top) processing parents
> bs=0x565505e48030 (backup-top) calling set_aio_ctx child=0x565505e402d0
> bs=0x565505e48030 (backup-top) processing itself
> bs=0x565505fbf660 (qcow2) processing itself

Hm, is this complete? Is see no "processing itself" for
bs=0x565505e5d420. Or is this because it crashed before getting there?

Anyway, trying to reconstruct the block graph with BdrvChild pointers
annotated at the edges:

BlockBackend
      |
      v
  backup-top ------------------------+
      |   |                          |
      |   +-----------------------+  |
      |            0x5655068b8510 |  | 0x565505e3c450
      |                           |  |
      | 0x565505e42090            |  |
      v                           |  |
    qcow2 ---------------------+  |  |
      |                        |  |  |
      | 0x565505e52060         |  |  | ??? [1]
      |                        |  |  |  |
      v         0x5655066a34d0 |  |  |  | 0x565505fc7aa0
    file                       v  v  v  v
                             qcow2 (backing)
                                    |
                                    | 0x565505e41d20
                                    v
                                  file

[1] This seems to be a BdrvChild with a non-BDS parent. Probably a
    BdrvChild directly owned by the backup job.

> So it seems this is happening:
> 
> backup-top (5e48030) <---------| (5)
>    |    |                      |
>    |    | (6) ------------> qcow2 (5fbf660)
>    |                           ^    |
>    |                       (3) |    | (4)
>    |-> (1) qcow2 (5e5d420) -----    |-> file (6bc0c00)
>    |
>    |-> (2) file (5e52060)
> 
> backup-top (5e48030), the BDS that was passed as argument in the first
> bdrv_set_aio_context_ignore() call, is re-entered when qcow2 (5fbf660)
> is processing its parents, and the latter is also re-entered when the
> first one starts processing its children again.

Yes, but look at the BdrvChild pointers, it is through different edges
that we come back to the same node. No BdrvChild is used twice.

If backup-top had added all of its children to the ignore list before
calling into the overlay qcow2, the backing qcow2 wouldn't eventually
have called back into backup-top.

Kevin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 2/4] block: Avoid processing BDS twice in bdrv_set_aio_context_ignore()
  2020-12-16 12:35           ` Kevin Wolf
@ 2020-12-16 14:55             ` Sergio Lopez
  2020-12-16 18:31               ` Kevin Wolf
  0 siblings, 1 reply; 25+ messages in thread
From: Sergio Lopez @ 2020-12-16 14:55 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Fam Zheng, Stefano Stabellini, qemu-block, Paul Durrant,
	Michael S. Tsirkin, qemu-devel, Max Reitz, Stefan Hajnoczi,
	Paolo Bonzini, Anthony Perard, xen-devel

[-- Attachment #1: Type: text/plain, Size: 9625 bytes --]

On Wed, Dec 16, 2020 at 01:35:14PM +0100, Kevin Wolf wrote:
> Am 15.12.2020 um 18:23 hat Sergio Lopez geschrieben:
> > On Tue, Dec 15, 2020 at 04:01:19PM +0100, Kevin Wolf wrote:
> > > Am 15.12.2020 um 14:15 hat Sergio Lopez geschrieben:
> > > > On Tue, Dec 15, 2020 at 01:12:33PM +0100, Kevin Wolf wrote:
> > > > > Am 14.12.2020 um 18:05 hat Sergio Lopez geschrieben:
> > > > > > While processing the parents of a BDS, one of the parents may process
> > > > > > the child that's doing the tail recursion, which leads to a BDS being
> > > > > > processed twice. This is especially problematic for the aio_notifiers,
> > > > > > as they might attempt to work on both the old and the new AIO
> > > > > > contexts.
> > > > > > 
> > > > > > To avoid this, add the BDS pointer to the ignore list, and check the
> > > > > > child BDS pointer while iterating over the children.
> > > > > > 
> > > > > > Signed-off-by: Sergio Lopez <slp@redhat.com>
> > > > > 
> > > > > Ugh, so we get a mixed list of BdrvChild and BlockDriverState? :-/
> > > > 
> > > > I know, it's effective but quite ugly...
> > > > 
> > > > > What is the specific scenario where you saw this breaking? Did you have
> > > > > multiple BdrvChild connections between two nodes so that we would go to
> > > > > the parent node through one and then come back to the child node through
> > > > > the other?
> > > > 
> > > > I don't think this is a corner case. If the graph is walked top->down,
> > > > there's no problem since children are added to the ignore list before
> > > > getting processed, and siblings don't process each other. But, if the
> > > > graph is walked bottom->up, a BDS will start processing its parents
> > > > without adding itself to the ignore list, so there's nothing
> > > > preventing them from processing it again.
> > > 
> > > I don't understand. child is added to ignore before calling the parent
> > > callback on it, so how can we come back through the same BdrvChild?
> > > 
> > >     QLIST_FOREACH(child, &bs->parents, next_parent) {
> > >         if (g_slist_find(*ignore, child)) {
> > >             continue;
> > >         }
> > >         assert(child->klass->set_aio_ctx);
> > >         *ignore = g_slist_prepend(*ignore, child);
> > >         child->klass->set_aio_ctx(child, new_context, ignore);
> > >     }
> > 
> > Perhaps I'm missing something, but the way I understand it, that loop
> > is adding the BdrvChild pointer of each of its parents, but not the
> > BdrvChild pointer of the BDS that was passed as an argument to
> > b_s_a_c_i.
> 
> Generally, the caller has already done that.
> 
> In the theoretical case that it was the outermost call in the recursion
> and it hasn't (I couldn't find any such case), I think we should still
> call the callback for the passed BdrvChild like we currently do.
> 
> > > You didn't dump the BdrvChild here. I think that would add some
> > > information on why we re-entered 0x555ee2fbf660. Maybe you can also add
> > > bs->drv->format_name for each node to make the scenario less abstract?
> > 
> > I've generated another trace with more data:
> > 
> > bs=0x565505e48030 (backup-top) enter
> > bs=0x565505e48030 (backup-top) processing children
> > bs=0x565505e48030 (backup-top) calling bsaci child=0x565505e42090 (child->bs=0x565505e5d420)
> > bs=0x565505e5d420 (qcow2) enter
> > bs=0x565505e5d420 (qcow2) processing children
> > bs=0x565505e5d420 (qcow2) calling bsaci child=0x565505e41ea0 (child->bs=0x565505e52060)
> > bs=0x565505e52060 (file) enter
> > bs=0x565505e52060 (file) processing children
> > bs=0x565505e52060 (file) processing parents
> > bs=0x565505e52060 (file) processing itself
> > bs=0x565505e5d420 (qcow2) processing parents
> > bs=0x565505e5d420 (qcow2) calling set_aio_ctx child=0x5655066a34d0
> > bs=0x565505fbf660 (qcow2) enter
> > bs=0x565505fbf660 (qcow2) processing children
> > bs=0x565505fbf660 (qcow2) calling bsaci child=0x565505e41d20 (child->bs=0x565506bc0c00)
> > bs=0x565506bc0c00 (file) enter
> > bs=0x565506bc0c00 (file) processing children
> > bs=0x565506bc0c00 (file) processing parents
> > bs=0x565506bc0c00 (file) processing itself
> > bs=0x565505fbf660 (qcow2) processing parents
> > bs=0x565505fbf660 (qcow2) calling set_aio_ctx child=0x565505fc7aa0
> > bs=0x565505fbf660 (qcow2) calling set_aio_ctx child=0x5655068b8510
> > bs=0x565505e48030 (backup-top) enter
> > bs=0x565505e48030 (backup-top) processing children
> > bs=0x565505e48030 (backup-top) calling bsaci child=0x565505e3c450 (child->bs=0x565505fbf660)
> > bs=0x565505fbf660 (qcow2) enter
> > bs=0x565505fbf660 (qcow2) processing children
> > bs=0x565505fbf660 (qcow2) processing parents
> > bs=0x565505fbf660 (qcow2) processing itself
> > bs=0x565505e48030 (backup-top) processing parents
> > bs=0x565505e48030 (backup-top) calling set_aio_ctx child=0x565505e402d0
> > bs=0x565505e48030 (backup-top) processing itself
> > bs=0x565505fbf660 (qcow2) processing itself
> 
> Hm, is this complete? Is see no "processing itself" for
> bs=0x565505e5d420. Or is this because it crashed before getting there?

Yes, it crashes there. I forgot to mention that, sorry.

> Anyway, trying to reconstruct the block graph with BdrvChild pointers
> annotated at the edges:
> 
> BlockBackend
>       |
>       v
>   backup-top ------------------------+
>       |   |                          |
>       |   +-----------------------+  |
>       |            0x5655068b8510 |  | 0x565505e3c450
>       |                           |  |
>       | 0x565505e42090            |  |
>       v                           |  |
>     qcow2 ---------------------+  |  |
>       |                        |  |  |
>       | 0x565505e52060         |  |  | ??? [1]
>       |                        |  |  |  |
>       v         0x5655066a34d0 |  |  |  | 0x565505fc7aa0
>     file                       v  v  v  v
>                              qcow2 (backing)
>                                     |
>                                     | 0x565505e41d20
>                                     v
>                                   file
> 
> [1] This seems to be a BdrvChild with a non-BDS parent. Probably a
>     BdrvChild directly owned by the backup job.
> 
> > So it seems this is happening:
> > 
> > backup-top (5e48030) <---------| (5)
> >    |    |                      |
> >    |    | (6) ------------> qcow2 (5fbf660)
> >    |                           ^    |
> >    |                       (3) |    | (4)
> >    |-> (1) qcow2 (5e5d420) -----    |-> file (6bc0c00)
> >    |
> >    |-> (2) file (5e52060)
> > 
> > backup-top (5e48030), the BDS that was passed as argument in the first
> > bdrv_set_aio_context_ignore() call, is re-entered when qcow2 (5fbf660)
> > is processing its parents, and the latter is also re-entered when the
> > first one starts processing its children again.
> 
> Yes, but look at the BdrvChild pointers, it is through different edges
> that we come back to the same node. No BdrvChild is used twice.
> 
> If backup-top had added all of its children to the ignore list before
> calling into the overlay qcow2, the backing qcow2 wouldn't eventually
> have called back into backup-top.

I've tested a patch that first adds every child to the ignore list,
and then processes those that weren't there before, as you suggested
on a previous email. With that, the offending qcow2 is not re-entered,
so we avoid the crash, but backup-top is still entered twice:

bs=0x560db0e3b030 (backup-top) enter
bs=0x560db0e3b030 (backup-top) processing children
bs=0x560db0e3b030 (backup-top) calling bsaci child=0x560db0e2f450 (child->bs=0x560db0fb2660)
bs=0x560db0fb2660 (qcow2) enter
bs=0x560db0fb2660 (qcow2) processing children
bs=0x560db0fb2660 (qcow2) calling bsaci child=0x560db0e34d20 (child->bs=0x560db1bb3c00)
bs=0x560db1bb3c00 (file) enter
bs=0x560db1bb3c00 (file) processing children
bs=0x560db1bb3c00 (file) processing parents
bs=0x560db1bb3c00 (file) processing itself
bs=0x560db0fb2660 (qcow2) calling bsaci child=0x560db16964d0 (child->bs=0x560db0e50420)
bs=0x560db0e50420 (qcow2) enter
bs=0x560db0e50420 (qcow2) processing children
bs=0x560db0e50420 (qcow2) calling bsaci child=0x560db0e34ea0 (child->bs=0x560db0e45060)
bs=0x560db0e45060 (file) enter
bs=0x560db0e45060 (file) processing children
bs=0x560db0e45060 (file) processing parents
bs=0x560db0e45060 (file) processing itself
bs=0x560db0e50420 (qcow2) processing parents
bs=0x560db0e50420 (qcow2) processing itself
bs=0x560db0fb2660 (qcow2) processing parents
bs=0x560db0fb2660 (qcow2) calling set_aio_ctx child=0x560db1672860
bs=0x560db0fb2660 (qcow2) calling set_aio_ctx child=0x560db1b14a20
bs=0x560db0e3b030 (backup-top) enter
bs=0x560db0e3b030 (backup-top) processing children
bs=0x560db0e3b030 (backup-top) processing parents
bs=0x560db0e3b030 (backup-top) calling set_aio_ctx child=0x560db0e332d0
bs=0x560db0e3b030 (backup-top) processing itself
bs=0x560db0fb2660 (qcow2) processing itself
bs=0x560db0e3b030 (backup-top) calling bsaci child=0x560db0e35090 (child->bs=0x560db0e50420)
bs=0x560db0e50420 (qcow2) enter
bs=0x560db0e3b030 (backup-top) processing parents
bs=0x560db0e3b030 (backup-top) processing itself

I see that "blk_do_set_aio_context()" passes "blk->root" to
"bdrv_child_try_set_aio_context()" so it's already in the ignore list,
so I'm not sure what's happening here. Is backup-top is referenced
from two different BdrvChild or is "blk->root" not pointing to
backup-top's BDS?

Thanks,
Sergio.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 2/4] block: Avoid processing BDS twice in bdrv_set_aio_context_ignore()
  2020-12-16 14:55             ` Sergio Lopez
@ 2020-12-16 18:31               ` Kevin Wolf
  2020-12-17  9:37                 ` Sergio Lopez
  0 siblings, 1 reply; 25+ messages in thread
From: Kevin Wolf @ 2020-12-16 18:31 UTC (permalink / raw)
  To: Sergio Lopez
  Cc: Fam Zheng, Stefano Stabellini, qemu-block, Paul Durrant,
	Michael S. Tsirkin, qemu-devel, Max Reitz, Stefan Hajnoczi,
	Paolo Bonzini, Anthony Perard, xen-devel

[-- Attachment #1: Type: text/plain, Size: 10878 bytes --]

Am 16.12.2020 um 15:55 hat Sergio Lopez geschrieben:
> On Wed, Dec 16, 2020 at 01:35:14PM +0100, Kevin Wolf wrote:
> > Am 15.12.2020 um 18:23 hat Sergio Lopez geschrieben:
> > > On Tue, Dec 15, 2020 at 04:01:19PM +0100, Kevin Wolf wrote:
> > > > Am 15.12.2020 um 14:15 hat Sergio Lopez geschrieben:
> > > > > On Tue, Dec 15, 2020 at 01:12:33PM +0100, Kevin Wolf wrote:
> > > > > > Am 14.12.2020 um 18:05 hat Sergio Lopez geschrieben:
> > > > > > > While processing the parents of a BDS, one of the parents may process
> > > > > > > the child that's doing the tail recursion, which leads to a BDS being
> > > > > > > processed twice. This is especially problematic for the aio_notifiers,
> > > > > > > as they might attempt to work on both the old and the new AIO
> > > > > > > contexts.
> > > > > > > 
> > > > > > > To avoid this, add the BDS pointer to the ignore list, and check the
> > > > > > > child BDS pointer while iterating over the children.
> > > > > > > 
> > > > > > > Signed-off-by: Sergio Lopez <slp@redhat.com>
> > > > > > 
> > > > > > Ugh, so we get a mixed list of BdrvChild and BlockDriverState? :-/
> > > > > 
> > > > > I know, it's effective but quite ugly...
> > > > > 
> > > > > > What is the specific scenario where you saw this breaking? Did you have
> > > > > > multiple BdrvChild connections between two nodes so that we would go to
> > > > > > the parent node through one and then come back to the child node through
> > > > > > the other?
> > > > > 
> > > > > I don't think this is a corner case. If the graph is walked top->down,
> > > > > there's no problem since children are added to the ignore list before
> > > > > getting processed, and siblings don't process each other. But, if the
> > > > > graph is walked bottom->up, a BDS will start processing its parents
> > > > > without adding itself to the ignore list, so there's nothing
> > > > > preventing them from processing it again.
> > > > 
> > > > I don't understand. child is added to ignore before calling the parent
> > > > callback on it, so how can we come back through the same BdrvChild?
> > > > 
> > > >     QLIST_FOREACH(child, &bs->parents, next_parent) {
> > > >         if (g_slist_find(*ignore, child)) {
> > > >             continue;
> > > >         }
> > > >         assert(child->klass->set_aio_ctx);
> > > >         *ignore = g_slist_prepend(*ignore, child);
> > > >         child->klass->set_aio_ctx(child, new_context, ignore);
> > > >     }
> > > 
> > > Perhaps I'm missing something, but the way I understand it, that loop
> > > is adding the BdrvChild pointer of each of its parents, but not the
> > > BdrvChild pointer of the BDS that was passed as an argument to
> > > b_s_a_c_i.
> > 
> > Generally, the caller has already done that.
> > 
> > In the theoretical case that it was the outermost call in the recursion
> > and it hasn't (I couldn't find any such case), I think we should still
> > call the callback for the passed BdrvChild like we currently do.
> > 
> > > > You didn't dump the BdrvChild here. I think that would add some
> > > > information on why we re-entered 0x555ee2fbf660. Maybe you can also add
> > > > bs->drv->format_name for each node to make the scenario less abstract?
> > > 
> > > I've generated another trace with more data:
> > > 
> > > bs=0x565505e48030 (backup-top) enter
> > > bs=0x565505e48030 (backup-top) processing children
> > > bs=0x565505e48030 (backup-top) calling bsaci child=0x565505e42090 (child->bs=0x565505e5d420)
> > > bs=0x565505e5d420 (qcow2) enter
> > > bs=0x565505e5d420 (qcow2) processing children
> > > bs=0x565505e5d420 (qcow2) calling bsaci child=0x565505e41ea0 (child->bs=0x565505e52060)
> > > bs=0x565505e52060 (file) enter
> > > bs=0x565505e52060 (file) processing children
> > > bs=0x565505e52060 (file) processing parents
> > > bs=0x565505e52060 (file) processing itself
> > > bs=0x565505e5d420 (qcow2) processing parents
> > > bs=0x565505e5d420 (qcow2) calling set_aio_ctx child=0x5655066a34d0
> > > bs=0x565505fbf660 (qcow2) enter
> > > bs=0x565505fbf660 (qcow2) processing children
> > > bs=0x565505fbf660 (qcow2) calling bsaci child=0x565505e41d20 (child->bs=0x565506bc0c00)
> > > bs=0x565506bc0c00 (file) enter
> > > bs=0x565506bc0c00 (file) processing children
> > > bs=0x565506bc0c00 (file) processing parents
> > > bs=0x565506bc0c00 (file) processing itself
> > > bs=0x565505fbf660 (qcow2) processing parents
> > > bs=0x565505fbf660 (qcow2) calling set_aio_ctx child=0x565505fc7aa0
> > > bs=0x565505fbf660 (qcow2) calling set_aio_ctx child=0x5655068b8510
> > > bs=0x565505e48030 (backup-top) enter
> > > bs=0x565505e48030 (backup-top) processing children
> > > bs=0x565505e48030 (backup-top) calling bsaci child=0x565505e3c450 (child->bs=0x565505fbf660)
> > > bs=0x565505fbf660 (qcow2) enter
> > > bs=0x565505fbf660 (qcow2) processing children
> > > bs=0x565505fbf660 (qcow2) processing parents
> > > bs=0x565505fbf660 (qcow2) processing itself
> > > bs=0x565505e48030 (backup-top) processing parents
> > > bs=0x565505e48030 (backup-top) calling set_aio_ctx child=0x565505e402d0
> > > bs=0x565505e48030 (backup-top) processing itself
> > > bs=0x565505fbf660 (qcow2) processing itself
> > 
> > Hm, is this complete? Is see no "processing itself" for
> > bs=0x565505e5d420. Or is this because it crashed before getting there?
> 
> Yes, it crashes there. I forgot to mention that, sorry.
> 
> > Anyway, trying to reconstruct the block graph with BdrvChild pointers
> > annotated at the edges:
> > 
> > BlockBackend
> >       |
> >       v
> >   backup-top ------------------------+
> >       |   |                          |
> >       |   +-----------------------+  |
> >       |            0x5655068b8510 |  | 0x565505e3c450
> >       |                           |  |
> >       | 0x565505e42090            |  |
> >       v                           |  |
> >     qcow2 ---------------------+  |  |
> >       |                        |  |  |
> >       | 0x565505e52060         |  |  | ??? [1]
> >       |                        |  |  |  |
> >       v         0x5655066a34d0 |  |  |  | 0x565505fc7aa0
> >     file                       v  v  v  v
> >                              qcow2 (backing)
> >                                     |
> >                                     | 0x565505e41d20
> >                                     v
> >                                   file
> > 
> > [1] This seems to be a BdrvChild with a non-BDS parent. Probably a
> >     BdrvChild directly owned by the backup job.
> > 
> > > So it seems this is happening:
> > > 
> > > backup-top (5e48030) <---------| (5)
> > >    |    |                      |
> > >    |    | (6) ------------> qcow2 (5fbf660)
> > >    |                           ^    |
> > >    |                       (3) |    | (4)
> > >    |-> (1) qcow2 (5e5d420) -----    |-> file (6bc0c00)
> > >    |
> > >    |-> (2) file (5e52060)
> > > 
> > > backup-top (5e48030), the BDS that was passed as argument in the first
> > > bdrv_set_aio_context_ignore() call, is re-entered when qcow2 (5fbf660)
> > > is processing its parents, and the latter is also re-entered when the
> > > first one starts processing its children again.
> > 
> > Yes, but look at the BdrvChild pointers, it is through different edges
> > that we come back to the same node. No BdrvChild is used twice.
> > 
> > If backup-top had added all of its children to the ignore list before
> > calling into the overlay qcow2, the backing qcow2 wouldn't eventually
> > have called back into backup-top.
> 
> I've tested a patch that first adds every child to the ignore list,
> and then processes those that weren't there before, as you suggested
> on a previous email. With that, the offending qcow2 is not re-entered,
> so we avoid the crash, but backup-top is still entered twice:

I think we also need to every parent to the ignore list before calling
callbacks, though it doesn't look like this is the problem you're
currently seeing.

> bs=0x560db0e3b030 (backup-top) enter
> bs=0x560db0e3b030 (backup-top) processing children
> bs=0x560db0e3b030 (backup-top) calling bsaci child=0x560db0e2f450 (child->bs=0x560db0fb2660)
> bs=0x560db0fb2660 (qcow2) enter
> bs=0x560db0fb2660 (qcow2) processing children
> bs=0x560db0fb2660 (qcow2) calling bsaci child=0x560db0e34d20 (child->bs=0x560db1bb3c00)
> bs=0x560db1bb3c00 (file) enter
> bs=0x560db1bb3c00 (file) processing children
> bs=0x560db1bb3c00 (file) processing parents
> bs=0x560db1bb3c00 (file) processing itself
> bs=0x560db0fb2660 (qcow2) calling bsaci child=0x560db16964d0 (child->bs=0x560db0e50420)
> bs=0x560db0e50420 (qcow2) enter
> bs=0x560db0e50420 (qcow2) processing children
> bs=0x560db0e50420 (qcow2) calling bsaci child=0x560db0e34ea0 (child->bs=0x560db0e45060)
> bs=0x560db0e45060 (file) enter
> bs=0x560db0e45060 (file) processing children
> bs=0x560db0e45060 (file) processing parents
> bs=0x560db0e45060 (file) processing itself
> bs=0x560db0e50420 (qcow2) processing parents
> bs=0x560db0e50420 (qcow2) processing itself
> bs=0x560db0fb2660 (qcow2) processing parents
> bs=0x560db0fb2660 (qcow2) calling set_aio_ctx child=0x560db1672860
> bs=0x560db0fb2660 (qcow2) calling set_aio_ctx child=0x560db1b14a20
> bs=0x560db0e3b030 (backup-top) enter
> bs=0x560db0e3b030 (backup-top) processing children
> bs=0x560db0e3b030 (backup-top) processing parents
> bs=0x560db0e3b030 (backup-top) calling set_aio_ctx child=0x560db0e332d0
> bs=0x560db0e3b030 (backup-top) processing itself
> bs=0x560db0fb2660 (qcow2) processing itself
> bs=0x560db0e3b030 (backup-top) calling bsaci child=0x560db0e35090 (child->bs=0x560db0e50420)
> bs=0x560db0e50420 (qcow2) enter
> bs=0x560db0e3b030 (backup-top) processing parents
> bs=0x560db0e3b030 (backup-top) processing itself
> 
> I see that "blk_do_set_aio_context()" passes "blk->root" to
> "bdrv_child_try_set_aio_context()" so it's already in the ignore list,
> so I'm not sure what's happening here. Is backup-top is referenced
> from two different BdrvChild or is "blk->root" not pointing to
> backup-top's BDS?

The second time that backup-top is entered, it is not as the BDS of
blk->root, but as the parent node of the overlay qcow2. Which is
interesting, because last time it was still the backing qcow2, so the
change did have _some_ effect.

The part that I don't understand is why you still get the line with
child=0x560db1b14a20, because when you add all children to the ignore
list first, that should have been put into the ignore list as one of the
first things in the whole process (when backup-top was first entered).

Is 0x560db1b14a20 a BdrvChild that has backup-top as its opaque value,
but isn't actually present in backup-top's bs->children?

Kevin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 2/4] block: Avoid processing BDS twice in bdrv_set_aio_context_ignore()
  2020-12-16 18:31               ` Kevin Wolf
@ 2020-12-17  9:37                 ` Sergio Lopez
  2020-12-17 10:58                   ` Kevin Wolf
  0 siblings, 1 reply; 25+ messages in thread
From: Sergio Lopez @ 2020-12-17  9:37 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Fam Zheng, Stefano Stabellini, qemu-block, Paul Durrant,
	Michael S. Tsirkin, qemu-devel, Max Reitz, Stefan Hajnoczi,
	Paolo Bonzini, Anthony Perard, xen-devel

[-- Attachment #1: Type: text/plain, Size: 11982 bytes --]

On Wed, Dec 16, 2020 at 07:31:02PM +0100, Kevin Wolf wrote:
> Am 16.12.2020 um 15:55 hat Sergio Lopez geschrieben:
> > On Wed, Dec 16, 2020 at 01:35:14PM +0100, Kevin Wolf wrote:
> > > Am 15.12.2020 um 18:23 hat Sergio Lopez geschrieben:
> > > > On Tue, Dec 15, 2020 at 04:01:19PM +0100, Kevin Wolf wrote:
> > > > > Am 15.12.2020 um 14:15 hat Sergio Lopez geschrieben:
> > > > > > On Tue, Dec 15, 2020 at 01:12:33PM +0100, Kevin Wolf wrote:
> > > > > > > Am 14.12.2020 um 18:05 hat Sergio Lopez geschrieben:
> > > > > > > > While processing the parents of a BDS, one of the parents may process
> > > > > > > > the child that's doing the tail recursion, which leads to a BDS being
> > > > > > > > processed twice. This is especially problematic for the aio_notifiers,
> > > > > > > > as they might attempt to work on both the old and the new AIO
> > > > > > > > contexts.
> > > > > > > > 
> > > > > > > > To avoid this, add the BDS pointer to the ignore list, and check the
> > > > > > > > child BDS pointer while iterating over the children.
> > > > > > > > 
> > > > > > > > Signed-off-by: Sergio Lopez <slp@redhat.com>
> > > > > > > 
> > > > > > > Ugh, so we get a mixed list of BdrvChild and BlockDriverState? :-/
> > > > > > 
> > > > > > I know, it's effective but quite ugly...
> > > > > > 
> > > > > > > What is the specific scenario where you saw this breaking? Did you have
> > > > > > > multiple BdrvChild connections between two nodes so that we would go to
> > > > > > > the parent node through one and then come back to the child node through
> > > > > > > the other?
> > > > > > 
> > > > > > I don't think this is a corner case. If the graph is walked top->down,
> > > > > > there's no problem since children are added to the ignore list before
> > > > > > getting processed, and siblings don't process each other. But, if the
> > > > > > graph is walked bottom->up, a BDS will start processing its parents
> > > > > > without adding itself to the ignore list, so there's nothing
> > > > > > preventing them from processing it again.
> > > > > 
> > > > > I don't understand. child is added to ignore before calling the parent
> > > > > callback on it, so how can we come back through the same BdrvChild?
> > > > > 
> > > > >     QLIST_FOREACH(child, &bs->parents, next_parent) {
> > > > >         if (g_slist_find(*ignore, child)) {
> > > > >             continue;
> > > > >         }
> > > > >         assert(child->klass->set_aio_ctx);
> > > > >         *ignore = g_slist_prepend(*ignore, child);
> > > > >         child->klass->set_aio_ctx(child, new_context, ignore);
> > > > >     }
> > > > 
> > > > Perhaps I'm missing something, but the way I understand it, that loop
> > > > is adding the BdrvChild pointer of each of its parents, but not the
> > > > BdrvChild pointer of the BDS that was passed as an argument to
> > > > b_s_a_c_i.
> > > 
> > > Generally, the caller has already done that.
> > > 
> > > In the theoretical case that it was the outermost call in the recursion
> > > and it hasn't (I couldn't find any such case), I think we should still
> > > call the callback for the passed BdrvChild like we currently do.
> > > 
> > > > > You didn't dump the BdrvChild here. I think that would add some
> > > > > information on why we re-entered 0x555ee2fbf660. Maybe you can also add
> > > > > bs->drv->format_name for each node to make the scenario less abstract?
> > > > 
> > > > I've generated another trace with more data:
> > > > 
> > > > bs=0x565505e48030 (backup-top) enter
> > > > bs=0x565505e48030 (backup-top) processing children
> > > > bs=0x565505e48030 (backup-top) calling bsaci child=0x565505e42090 (child->bs=0x565505e5d420)
> > > > bs=0x565505e5d420 (qcow2) enter
> > > > bs=0x565505e5d420 (qcow2) processing children
> > > > bs=0x565505e5d420 (qcow2) calling bsaci child=0x565505e41ea0 (child->bs=0x565505e52060)
> > > > bs=0x565505e52060 (file) enter
> > > > bs=0x565505e52060 (file) processing children
> > > > bs=0x565505e52060 (file) processing parents
> > > > bs=0x565505e52060 (file) processing itself
> > > > bs=0x565505e5d420 (qcow2) processing parents
> > > > bs=0x565505e5d420 (qcow2) calling set_aio_ctx child=0x5655066a34d0
> > > > bs=0x565505fbf660 (qcow2) enter
> > > > bs=0x565505fbf660 (qcow2) processing children
> > > > bs=0x565505fbf660 (qcow2) calling bsaci child=0x565505e41d20 (child->bs=0x565506bc0c00)
> > > > bs=0x565506bc0c00 (file) enter
> > > > bs=0x565506bc0c00 (file) processing children
> > > > bs=0x565506bc0c00 (file) processing parents
> > > > bs=0x565506bc0c00 (file) processing itself
> > > > bs=0x565505fbf660 (qcow2) processing parents
> > > > bs=0x565505fbf660 (qcow2) calling set_aio_ctx child=0x565505fc7aa0
> > > > bs=0x565505fbf660 (qcow2) calling set_aio_ctx child=0x5655068b8510
> > > > bs=0x565505e48030 (backup-top) enter
> > > > bs=0x565505e48030 (backup-top) processing children
> > > > bs=0x565505e48030 (backup-top) calling bsaci child=0x565505e3c450 (child->bs=0x565505fbf660)
> > > > bs=0x565505fbf660 (qcow2) enter
> > > > bs=0x565505fbf660 (qcow2) processing children
> > > > bs=0x565505fbf660 (qcow2) processing parents
> > > > bs=0x565505fbf660 (qcow2) processing itself
> > > > bs=0x565505e48030 (backup-top) processing parents
> > > > bs=0x565505e48030 (backup-top) calling set_aio_ctx child=0x565505e402d0
> > > > bs=0x565505e48030 (backup-top) processing itself
> > > > bs=0x565505fbf660 (qcow2) processing itself
> > > 
> > > Hm, is this complete? Is see no "processing itself" for
> > > bs=0x565505e5d420. Or is this because it crashed before getting there?
> > 
> > Yes, it crashes there. I forgot to mention that, sorry.
> > 
> > > Anyway, trying to reconstruct the block graph with BdrvChild pointers
> > > annotated at the edges:
> > > 
> > > BlockBackend
> > >       |
> > >       v
> > >   backup-top ------------------------+
> > >       |   |                          |
> > >       |   +-----------------------+  |
> > >       |            0x5655068b8510 |  | 0x565505e3c450
> > >       |                           |  |
> > >       | 0x565505e42090            |  |
> > >       v                           |  |
> > >     qcow2 ---------------------+  |  |
> > >       |                        |  |  |
> > >       | 0x565505e52060         |  |  | ??? [1]
> > >       |                        |  |  |  |
> > >       v         0x5655066a34d0 |  |  |  | 0x565505fc7aa0
> > >     file                       v  v  v  v
> > >                              qcow2 (backing)
> > >                                     |
> > >                                     | 0x565505e41d20
> > >                                     v
> > >                                   file
> > > 
> > > [1] This seems to be a BdrvChild with a non-BDS parent. Probably a
> > >     BdrvChild directly owned by the backup job.
> > > 
> > > > So it seems this is happening:
> > > > 
> > > > backup-top (5e48030) <---------| (5)
> > > >    |    |                      |
> > > >    |    | (6) ------------> qcow2 (5fbf660)
> > > >    |                           ^    |
> > > >    |                       (3) |    | (4)
> > > >    |-> (1) qcow2 (5e5d420) -----    |-> file (6bc0c00)
> > > >    |
> > > >    |-> (2) file (5e52060)
> > > > 
> > > > backup-top (5e48030), the BDS that was passed as argument in the first
> > > > bdrv_set_aio_context_ignore() call, is re-entered when qcow2 (5fbf660)
> > > > is processing its parents, and the latter is also re-entered when the
> > > > first one starts processing its children again.
> > > 
> > > Yes, but look at the BdrvChild pointers, it is through different edges
> > > that we come back to the same node. No BdrvChild is used twice.
> > > 
> > > If backup-top had added all of its children to the ignore list before
> > > calling into the overlay qcow2, the backing qcow2 wouldn't eventually
> > > have called back into backup-top.
> > 
> > I've tested a patch that first adds every child to the ignore list,
> > and then processes those that weren't there before, as you suggested
> > on a previous email. With that, the offending qcow2 is not re-entered,
> > so we avoid the crash, but backup-top is still entered twice:
> 
> I think we also need to every parent to the ignore list before calling
> callbacks, though it doesn't look like this is the problem you're
> currently seeing.

I agree.

> > bs=0x560db0e3b030 (backup-top) enter
> > bs=0x560db0e3b030 (backup-top) processing children
> > bs=0x560db0e3b030 (backup-top) calling bsaci child=0x560db0e2f450 (child->bs=0x560db0fb2660)
> > bs=0x560db0fb2660 (qcow2) enter
> > bs=0x560db0fb2660 (qcow2) processing children
> > bs=0x560db0fb2660 (qcow2) calling bsaci child=0x560db0e34d20 (child->bs=0x560db1bb3c00)
> > bs=0x560db1bb3c00 (file) enter
> > bs=0x560db1bb3c00 (file) processing children
> > bs=0x560db1bb3c00 (file) processing parents
> > bs=0x560db1bb3c00 (file) processing itself
> > bs=0x560db0fb2660 (qcow2) calling bsaci child=0x560db16964d0 (child->bs=0x560db0e50420)
> > bs=0x560db0e50420 (qcow2) enter
> > bs=0x560db0e50420 (qcow2) processing children
> > bs=0x560db0e50420 (qcow2) calling bsaci child=0x560db0e34ea0 (child->bs=0x560db0e45060)
> > bs=0x560db0e45060 (file) enter
> > bs=0x560db0e45060 (file) processing children
> > bs=0x560db0e45060 (file) processing parents
> > bs=0x560db0e45060 (file) processing itself
> > bs=0x560db0e50420 (qcow2) processing parents
> > bs=0x560db0e50420 (qcow2) processing itself
> > bs=0x560db0fb2660 (qcow2) processing parents
> > bs=0x560db0fb2660 (qcow2) calling set_aio_ctx child=0x560db1672860
> > bs=0x560db0fb2660 (qcow2) calling set_aio_ctx child=0x560db1b14a20
> > bs=0x560db0e3b030 (backup-top) enter
> > bs=0x560db0e3b030 (backup-top) processing children
> > bs=0x560db0e3b030 (backup-top) processing parents
> > bs=0x560db0e3b030 (backup-top) calling set_aio_ctx child=0x560db0e332d0
> > bs=0x560db0e3b030 (backup-top) processing itself
> > bs=0x560db0fb2660 (qcow2) processing itself
> > bs=0x560db0e3b030 (backup-top) calling bsaci child=0x560db0e35090 (child->bs=0x560db0e50420)
> > bs=0x560db0e50420 (qcow2) enter
> > bs=0x560db0e3b030 (backup-top) processing parents
> > bs=0x560db0e3b030 (backup-top) processing itself
> > 
> > I see that "blk_do_set_aio_context()" passes "blk->root" to
> > "bdrv_child_try_set_aio_context()" so it's already in the ignore list,
> > so I'm not sure what's happening here. Is backup-top is referenced
> > from two different BdrvChild or is "blk->root" not pointing to
> > backup-top's BDS?
> 
> The second time that backup-top is entered, it is not as the BDS of
> blk->root, but as the parent node of the overlay qcow2. Which is
> interesting, because last time it was still the backing qcow2, so the
> change did have _some_ effect.
> 
> The part that I don't understand is why you still get the line with
> child=0x560db1b14a20, because when you add all children to the ignore
> list first, that should have been put into the ignore list as one of the
> first things in the whole process (when backup-top was first entered).
> 
> Is 0x560db1b14a20 a BdrvChild that has backup-top as its opaque value,
> but isn't actually present in backup-top's bs->children?

Exactly, that line corresponds to this chunk of code:

<---- begin ---->
    QLIST_FOREACH(child, &bs->parents, next_parent) {
        if (g_slist_find(*ignore, child)) {
            continue;
        }
        assert(child->klass->set_aio_ctx);
        *ignore = g_slist_prepend(*ignore, child);
        fprintf(stderr, "bs=%p (%s) calling set_aio_ctx child=%p\n", bs, bs->drv->format_name, child);
        child->klass->set_aio_ctx(child, new_context, ignore);
    }
<---- end ---->

Do you think it's safe to re-enter backup-top, or should we look for a
way to avoid this?

Thanks,
Sergio.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 2/4] block: Avoid processing BDS twice in bdrv_set_aio_context_ignore()
  2020-12-17  9:37                 ` Sergio Lopez
@ 2020-12-17 10:58                   ` Kevin Wolf
  2020-12-17 12:50                     ` Vladimir Sementsov-Ogievskiy
  2020-12-17 13:09                     ` Sergio Lopez
  0 siblings, 2 replies; 25+ messages in thread
From: Kevin Wolf @ 2020-12-17 10:58 UTC (permalink / raw)
  To: Sergio Lopez
  Cc: Fam Zheng, Stefano Stabellini, qemu-block, Paul Durrant,
	Michael S. Tsirkin, qemu-devel, Max Reitz, Stefan Hajnoczi,
	Paolo Bonzini, Anthony Perard, xen-devel

[-- Attachment #1: Type: text/plain, Size: 13070 bytes --]

Am 17.12.2020 um 10:37 hat Sergio Lopez geschrieben:
> On Wed, Dec 16, 2020 at 07:31:02PM +0100, Kevin Wolf wrote:
> > Am 16.12.2020 um 15:55 hat Sergio Lopez geschrieben:
> > > On Wed, Dec 16, 2020 at 01:35:14PM +0100, Kevin Wolf wrote:
> > > > Am 15.12.2020 um 18:23 hat Sergio Lopez geschrieben:
> > > > > On Tue, Dec 15, 2020 at 04:01:19PM +0100, Kevin Wolf wrote:
> > > > > > Am 15.12.2020 um 14:15 hat Sergio Lopez geschrieben:
> > > > > > > On Tue, Dec 15, 2020 at 01:12:33PM +0100, Kevin Wolf wrote:
> > > > > > > > Am 14.12.2020 um 18:05 hat Sergio Lopez geschrieben:
> > > > > > > > > While processing the parents of a BDS, one of the parents may process
> > > > > > > > > the child that's doing the tail recursion, which leads to a BDS being
> > > > > > > > > processed twice. This is especially problematic for the aio_notifiers,
> > > > > > > > > as they might attempt to work on both the old and the new AIO
> > > > > > > > > contexts.
> > > > > > > > > 
> > > > > > > > > To avoid this, add the BDS pointer to the ignore list, and check the
> > > > > > > > > child BDS pointer while iterating over the children.
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Sergio Lopez <slp@redhat.com>
> > > > > > > > 
> > > > > > > > Ugh, so we get a mixed list of BdrvChild and BlockDriverState? :-/
> > > > > > > 
> > > > > > > I know, it's effective but quite ugly...
> > > > > > > 
> > > > > > > > What is the specific scenario where you saw this breaking? Did you have
> > > > > > > > multiple BdrvChild connections between two nodes so that we would go to
> > > > > > > > the parent node through one and then come back to the child node through
> > > > > > > > the other?
> > > > > > > 
> > > > > > > I don't think this is a corner case. If the graph is walked top->down,
> > > > > > > there's no problem since children are added to the ignore list before
> > > > > > > getting processed, and siblings don't process each other. But, if the
> > > > > > > graph is walked bottom->up, a BDS will start processing its parents
> > > > > > > without adding itself to the ignore list, so there's nothing
> > > > > > > preventing them from processing it again.
> > > > > > 
> > > > > > I don't understand. child is added to ignore before calling the parent
> > > > > > callback on it, so how can we come back through the same BdrvChild?
> > > > > > 
> > > > > >     QLIST_FOREACH(child, &bs->parents, next_parent) {
> > > > > >         if (g_slist_find(*ignore, child)) {
> > > > > >             continue;
> > > > > >         }
> > > > > >         assert(child->klass->set_aio_ctx);
> > > > > >         *ignore = g_slist_prepend(*ignore, child);
> > > > > >         child->klass->set_aio_ctx(child, new_context, ignore);
> > > > > >     }
> > > > > 
> > > > > Perhaps I'm missing something, but the way I understand it, that loop
> > > > > is adding the BdrvChild pointer of each of its parents, but not the
> > > > > BdrvChild pointer of the BDS that was passed as an argument to
> > > > > b_s_a_c_i.
> > > > 
> > > > Generally, the caller has already done that.
> > > > 
> > > > In the theoretical case that it was the outermost call in the recursion
> > > > and it hasn't (I couldn't find any such case), I think we should still
> > > > call the callback for the passed BdrvChild like we currently do.
> > > > 
> > > > > > You didn't dump the BdrvChild here. I think that would add some
> > > > > > information on why we re-entered 0x555ee2fbf660. Maybe you can also add
> > > > > > bs->drv->format_name for each node to make the scenario less abstract?
> > > > > 
> > > > > I've generated another trace with more data:
> > > > > 
> > > > > bs=0x565505e48030 (backup-top) enter
> > > > > bs=0x565505e48030 (backup-top) processing children
> > > > > bs=0x565505e48030 (backup-top) calling bsaci child=0x565505e42090 (child->bs=0x565505e5d420)
> > > > > bs=0x565505e5d420 (qcow2) enter
> > > > > bs=0x565505e5d420 (qcow2) processing children
> > > > > bs=0x565505e5d420 (qcow2) calling bsaci child=0x565505e41ea0 (child->bs=0x565505e52060)
> > > > > bs=0x565505e52060 (file) enter
> > > > > bs=0x565505e52060 (file) processing children
> > > > > bs=0x565505e52060 (file) processing parents
> > > > > bs=0x565505e52060 (file) processing itself
> > > > > bs=0x565505e5d420 (qcow2) processing parents
> > > > > bs=0x565505e5d420 (qcow2) calling set_aio_ctx child=0x5655066a34d0
> > > > > bs=0x565505fbf660 (qcow2) enter
> > > > > bs=0x565505fbf660 (qcow2) processing children
> > > > > bs=0x565505fbf660 (qcow2) calling bsaci child=0x565505e41d20 (child->bs=0x565506bc0c00)
> > > > > bs=0x565506bc0c00 (file) enter
> > > > > bs=0x565506bc0c00 (file) processing children
> > > > > bs=0x565506bc0c00 (file) processing parents
> > > > > bs=0x565506bc0c00 (file) processing itself
> > > > > bs=0x565505fbf660 (qcow2) processing parents
> > > > > bs=0x565505fbf660 (qcow2) calling set_aio_ctx child=0x565505fc7aa0
> > > > > bs=0x565505fbf660 (qcow2) calling set_aio_ctx child=0x5655068b8510
> > > > > bs=0x565505e48030 (backup-top) enter
> > > > > bs=0x565505e48030 (backup-top) processing children
> > > > > bs=0x565505e48030 (backup-top) calling bsaci child=0x565505e3c450 (child->bs=0x565505fbf660)
> > > > > bs=0x565505fbf660 (qcow2) enter
> > > > > bs=0x565505fbf660 (qcow2) processing children
> > > > > bs=0x565505fbf660 (qcow2) processing parents
> > > > > bs=0x565505fbf660 (qcow2) processing itself
> > > > > bs=0x565505e48030 (backup-top) processing parents
> > > > > bs=0x565505e48030 (backup-top) calling set_aio_ctx child=0x565505e402d0
> > > > > bs=0x565505e48030 (backup-top) processing itself
> > > > > bs=0x565505fbf660 (qcow2) processing itself
> > > > 
> > > > Hm, is this complete? Is see no "processing itself" for
> > > > bs=0x565505e5d420. Or is this because it crashed before getting there?
> > > 
> > > Yes, it crashes there. I forgot to mention that, sorry.
> > > 
> > > > Anyway, trying to reconstruct the block graph with BdrvChild pointers
> > > > annotated at the edges:
> > > > 
> > > > BlockBackend
> > > >       |
> > > >       v
> > > >   backup-top ------------------------+
> > > >       |   |                          |
> > > >       |   +-----------------------+  |
> > > >       |            0x5655068b8510 |  | 0x565505e3c450
> > > >       |                           |  |
> > > >       | 0x565505e42090            |  |
> > > >       v                           |  |
> > > >     qcow2 ---------------------+  |  |
> > > >       |                        |  |  |
> > > >       | 0x565505e52060         |  |  | ??? [1]
> > > >       |                        |  |  |  |
> > > >       v         0x5655066a34d0 |  |  |  | 0x565505fc7aa0
> > > >     file                       v  v  v  v
> > > >                              qcow2 (backing)
> > > >                                     |
> > > >                                     | 0x565505e41d20
> > > >                                     v
> > > >                                   file
> > > > 
> > > > [1] This seems to be a BdrvChild with a non-BDS parent. Probably a
> > > >     BdrvChild directly owned by the backup job.
> > > > 
> > > > > So it seems this is happening:
> > > > > 
> > > > > backup-top (5e48030) <---------| (5)
> > > > >    |    |                      |
> > > > >    |    | (6) ------------> qcow2 (5fbf660)
> > > > >    |                           ^    |
> > > > >    |                       (3) |    | (4)
> > > > >    |-> (1) qcow2 (5e5d420) -----    |-> file (6bc0c00)
> > > > >    |
> > > > >    |-> (2) file (5e52060)
> > > > > 
> > > > > backup-top (5e48030), the BDS that was passed as argument in the first
> > > > > bdrv_set_aio_context_ignore() call, is re-entered when qcow2 (5fbf660)
> > > > > is processing its parents, and the latter is also re-entered when the
> > > > > first one starts processing its children again.
> > > > 
> > > > Yes, but look at the BdrvChild pointers, it is through different edges
> > > > that we come back to the same node. No BdrvChild is used twice.
> > > > 
> > > > If backup-top had added all of its children to the ignore list before
> > > > calling into the overlay qcow2, the backing qcow2 wouldn't eventually
> > > > have called back into backup-top.
> > > 
> > > I've tested a patch that first adds every child to the ignore list,
> > > and then processes those that weren't there before, as you suggested
> > > on a previous email. With that, the offending qcow2 is not re-entered,
> > > so we avoid the crash, but backup-top is still entered twice:
> > 
> > I think we also need to every parent to the ignore list before calling
> > callbacks, though it doesn't look like this is the problem you're
> > currently seeing.
> 
> I agree.
> 
> > > bs=0x560db0e3b030 (backup-top) enter
> > > bs=0x560db0e3b030 (backup-top) processing children
> > > bs=0x560db0e3b030 (backup-top) calling bsaci child=0x560db0e2f450 (child->bs=0x560db0fb2660)
> > > bs=0x560db0fb2660 (qcow2) enter
> > > bs=0x560db0fb2660 (qcow2) processing children
> > > bs=0x560db0fb2660 (qcow2) calling bsaci child=0x560db0e34d20 (child->bs=0x560db1bb3c00)
> > > bs=0x560db1bb3c00 (file) enter
> > > bs=0x560db1bb3c00 (file) processing children
> > > bs=0x560db1bb3c00 (file) processing parents
> > > bs=0x560db1bb3c00 (file) processing itself
> > > bs=0x560db0fb2660 (qcow2) calling bsaci child=0x560db16964d0 (child->bs=0x560db0e50420)
> > > bs=0x560db0e50420 (qcow2) enter
> > > bs=0x560db0e50420 (qcow2) processing children
> > > bs=0x560db0e50420 (qcow2) calling bsaci child=0x560db0e34ea0 (child->bs=0x560db0e45060)
> > > bs=0x560db0e45060 (file) enter
> > > bs=0x560db0e45060 (file) processing children
> > > bs=0x560db0e45060 (file) processing parents
> > > bs=0x560db0e45060 (file) processing itself
> > > bs=0x560db0e50420 (qcow2) processing parents
> > > bs=0x560db0e50420 (qcow2) processing itself
> > > bs=0x560db0fb2660 (qcow2) processing parents
> > > bs=0x560db0fb2660 (qcow2) calling set_aio_ctx child=0x560db1672860
> > > bs=0x560db0fb2660 (qcow2) calling set_aio_ctx child=0x560db1b14a20
> > > bs=0x560db0e3b030 (backup-top) enter
> > > bs=0x560db0e3b030 (backup-top) processing children
> > > bs=0x560db0e3b030 (backup-top) processing parents
> > > bs=0x560db0e3b030 (backup-top) calling set_aio_ctx child=0x560db0e332d0
> > > bs=0x560db0e3b030 (backup-top) processing itself
> > > bs=0x560db0fb2660 (qcow2) processing itself
> > > bs=0x560db0e3b030 (backup-top) calling bsaci child=0x560db0e35090 (child->bs=0x560db0e50420)
> > > bs=0x560db0e50420 (qcow2) enter
> > > bs=0x560db0e3b030 (backup-top) processing parents
> > > bs=0x560db0e3b030 (backup-top) processing itself
> > > 
> > > I see that "blk_do_set_aio_context()" passes "blk->root" to
> > > "bdrv_child_try_set_aio_context()" so it's already in the ignore list,
> > > so I'm not sure what's happening here. Is backup-top is referenced
> > > from two different BdrvChild or is "blk->root" not pointing to
> > > backup-top's BDS?
> > 
> > The second time that backup-top is entered, it is not as the BDS of
> > blk->root, but as the parent node of the overlay qcow2. Which is
> > interesting, because last time it was still the backing qcow2, so the
> > change did have _some_ effect.
> > 
> > The part that I don't understand is why you still get the line with
> > child=0x560db1b14a20, because when you add all children to the ignore
> > list first, that should have been put into the ignore list as one of the
> > first things in the whole process (when backup-top was first entered).
> > 
> > Is 0x560db1b14a20 a BdrvChild that has backup-top as its opaque value,
> > but isn't actually present in backup-top's bs->children?
> 
> Exactly, that line corresponds to this chunk of code:
> 
> <---- begin ---->
>     QLIST_FOREACH(child, &bs->parents, next_parent) {
>         if (g_slist_find(*ignore, child)) {
>             continue;
>         }
>         assert(child->klass->set_aio_ctx);
>         *ignore = g_slist_prepend(*ignore, child);
>         fprintf(stderr, "bs=%p (%s) calling set_aio_ctx child=%p\n", bs, bs->drv->format_name, child);
>         child->klass->set_aio_ctx(child, new_context, ignore);
>     }
> <---- end ---->
> 
> Do you think it's safe to re-enter backup-top, or should we look for a
> way to avoid this?

I think it should be avoided, but I don't understand why putting all
children of backup-top into the ignore list doesn't already avoid it. If
backup-top is in the parents list of qcow2, then qcow2 should be in the
children list of backup-top and therefore the BdrvChild should already
be in the ignore list.

The only way I can explain this is that backup-top and qcow2 have
different ideas about which BdrvChild objects exist that connect them.
Or that the graph changes between both places, but I don't see how that
could happen in bdrv_set_aio_context_ignore().

Kevin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 2/4] block: Avoid processing BDS twice in bdrv_set_aio_context_ignore()
  2020-12-17 10:58                   ` Kevin Wolf
@ 2020-12-17 12:50                     ` Vladimir Sementsov-Ogievskiy
  2020-12-17 13:06                       ` Kevin Wolf
  2020-12-17 13:09                     ` Sergio Lopez
  1 sibling, 1 reply; 25+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-12-17 12:50 UTC (permalink / raw)
  To: Kevin Wolf, Sergio Lopez
  Cc: Fam Zheng, Stefano Stabellini, qemu-block, Michael S. Tsirkin,
	Paul Durrant, qemu-devel, Max Reitz, Stefan Hajnoczi, xen-devel,
	Anthony Perard, Paolo Bonzini

17.12.2020 13:58, Kevin Wolf wrote:
> Am 17.12.2020 um 10:37 hat Sergio Lopez geschrieben:
>> On Wed, Dec 16, 2020 at 07:31:02PM +0100, Kevin Wolf wrote:
>>> Am 16.12.2020 um 15:55 hat Sergio Lopez geschrieben:
>>>> On Wed, Dec 16, 2020 at 01:35:14PM +0100, Kevin Wolf wrote:
>>>>> Am 15.12.2020 um 18:23 hat Sergio Lopez geschrieben:
>>>>>> On Tue, Dec 15, 2020 at 04:01:19PM +0100, Kevin Wolf wrote:
>>>>>>> Am 15.12.2020 um 14:15 hat Sergio Lopez geschrieben:
>>>>>>>> On Tue, Dec 15, 2020 at 01:12:33PM +0100, Kevin Wolf wrote:
>>>>>>>>> Am 14.12.2020 um 18:05 hat Sergio Lopez geschrieben:
>>>>>>>>>> While processing the parents of a BDS, one of the parents may process
>>>>>>>>>> the child that's doing the tail recursion, which leads to a BDS being
>>>>>>>>>> processed twice. This is especially problematic for the aio_notifiers,
>>>>>>>>>> as they might attempt to work on both the old and the new AIO
>>>>>>>>>> contexts.
>>>>>>>>>>
>>>>>>>>>> To avoid this, add the BDS pointer to the ignore list, and check the
>>>>>>>>>> child BDS pointer while iterating over the children.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Sergio Lopez <slp@redhat.com>
>>>>>>>>>
>>>>>>>>> Ugh, so we get a mixed list of BdrvChild and BlockDriverState? :-/
>>>>>>>>
>>>>>>>> I know, it's effective but quite ugly...
>>>>>>>>
>>>>>>>>> What is the specific scenario where you saw this breaking? Did you have
>>>>>>>>> multiple BdrvChild connections between two nodes so that we would go to
>>>>>>>>> the parent node through one and then come back to the child node through
>>>>>>>>> the other?
>>>>>>>>
>>>>>>>> I don't think this is a corner case. If the graph is walked top->down,
>>>>>>>> there's no problem since children are added to the ignore list before
>>>>>>>> getting processed, and siblings don't process each other. But, if the
>>>>>>>> graph is walked bottom->up, a BDS will start processing its parents
>>>>>>>> without adding itself to the ignore list, so there's nothing
>>>>>>>> preventing them from processing it again.
>>>>>>>
>>>>>>> I don't understand. child is added to ignore before calling the parent
>>>>>>> callback on it, so how can we come back through the same BdrvChild?
>>>>>>>
>>>>>>>      QLIST_FOREACH(child, &bs->parents, next_parent) {
>>>>>>>          if (g_slist_find(*ignore, child)) {
>>>>>>>              continue;
>>>>>>>          }
>>>>>>>          assert(child->klass->set_aio_ctx);
>>>>>>>          *ignore = g_slist_prepend(*ignore, child);
>>>>>>>          child->klass->set_aio_ctx(child, new_context, ignore);
>>>>>>>      }
>>>>>>
>>>>>> Perhaps I'm missing something, but the way I understand it, that loop
>>>>>> is adding the BdrvChild pointer of each of its parents, but not the
>>>>>> BdrvChild pointer of the BDS that was passed as an argument to
>>>>>> b_s_a_c_i.
>>>>>
>>>>> Generally, the caller has already done that.
>>>>>
>>>>> In the theoretical case that it was the outermost call in the recursion
>>>>> and it hasn't (I couldn't find any such case), I think we should still
>>>>> call the callback for the passed BdrvChild like we currently do.
>>>>>
>>>>>>> You didn't dump the BdrvChild here. I think that would add some
>>>>>>> information on why we re-entered 0x555ee2fbf660. Maybe you can also add
>>>>>>> bs->drv->format_name for each node to make the scenario less abstract?
>>>>>>
>>>>>> I've generated another trace with more data:
>>>>>>
>>>>>> bs=0x565505e48030 (backup-top) enter
>>>>>> bs=0x565505e48030 (backup-top) processing children
>>>>>> bs=0x565505e48030 (backup-top) calling bsaci child=0x565505e42090 (child->bs=0x565505e5d420)
>>>>>> bs=0x565505e5d420 (qcow2) enter
>>>>>> bs=0x565505e5d420 (qcow2) processing children
>>>>>> bs=0x565505e5d420 (qcow2) calling bsaci child=0x565505e41ea0 (child->bs=0x565505e52060)
>>>>>> bs=0x565505e52060 (file) enter
>>>>>> bs=0x565505e52060 (file) processing children
>>>>>> bs=0x565505e52060 (file) processing parents
>>>>>> bs=0x565505e52060 (file) processing itself
>>>>>> bs=0x565505e5d420 (qcow2) processing parents
>>>>>> bs=0x565505e5d420 (qcow2) calling set_aio_ctx child=0x5655066a34d0
>>>>>> bs=0x565505fbf660 (qcow2) enter
>>>>>> bs=0x565505fbf660 (qcow2) processing children
>>>>>> bs=0x565505fbf660 (qcow2) calling bsaci child=0x565505e41d20 (child->bs=0x565506bc0c00)
>>>>>> bs=0x565506bc0c00 (file) enter
>>>>>> bs=0x565506bc0c00 (file) processing children
>>>>>> bs=0x565506bc0c00 (file) processing parents
>>>>>> bs=0x565506bc0c00 (file) processing itself
>>>>>> bs=0x565505fbf660 (qcow2) processing parents
>>>>>> bs=0x565505fbf660 (qcow2) calling set_aio_ctx child=0x565505fc7aa0
>>>>>> bs=0x565505fbf660 (qcow2) calling set_aio_ctx child=0x5655068b8510
>>>>>> bs=0x565505e48030 (backup-top) enter
>>>>>> bs=0x565505e48030 (backup-top) processing children
>>>>>> bs=0x565505e48030 (backup-top) calling bsaci child=0x565505e3c450 (child->bs=0x565505fbf660)
>>>>>> bs=0x565505fbf660 (qcow2) enter
>>>>>> bs=0x565505fbf660 (qcow2) processing children
>>>>>> bs=0x565505fbf660 (qcow2) processing parents
>>>>>> bs=0x565505fbf660 (qcow2) processing itself
>>>>>> bs=0x565505e48030 (backup-top) processing parents
>>>>>> bs=0x565505e48030 (backup-top) calling set_aio_ctx child=0x565505e402d0
>>>>>> bs=0x565505e48030 (backup-top) processing itself
>>>>>> bs=0x565505fbf660 (qcow2) processing itself
>>>>>
>>>>> Hm, is this complete? Is see no "processing itself" for
>>>>> bs=0x565505e5d420. Or is this because it crashed before getting there?
>>>>
>>>> Yes, it crashes there. I forgot to mention that, sorry.
>>>>
>>>>> Anyway, trying to reconstruct the block graph with BdrvChild pointers
>>>>> annotated at the edges:
>>>>>
>>>>> BlockBackend
>>>>>        |
>>>>>        v
>>>>>    backup-top ------------------------+
>>>>>        |   |                          |
>>>>>        |   +-----------------------+  |
>>>>>        |            0x5655068b8510 |  | 0x565505e3c450
>>>>>        |                           |  |
>>>>>        | 0x565505e42090            |  |
>>>>>        v                           |  |
>>>>>      qcow2 ---------------------+  |  |
>>>>>        |                        |  |  |
>>>>>        | 0x565505e52060         |  |  | ??? [1]
>>>>>        |                        |  |  |  |
>>>>>        v         0x5655066a34d0 |  |  |  | 0x565505fc7aa0
>>>>>      file                       v  v  v  v
>>>>>                               qcow2 (backing)
>>>>>                                      |
>>>>>                                      | 0x565505e41d20
>>>>>                                      v
>>>>>                                    file
>>>>>
>>>>> [1] This seems to be a BdrvChild with a non-BDS parent. Probably a
>>>>>      BdrvChild directly owned by the backup job.
>>>>>
>>>>>> So it seems this is happening:
>>>>>>
>>>>>> backup-top (5e48030) <---------| (5)
>>>>>>     |    |                      |
>>>>>>     |    | (6) ------------> qcow2 (5fbf660)
>>>>>>     |                           ^    |
>>>>>>     |                       (3) |    | (4)
>>>>>>     |-> (1) qcow2 (5e5d420) -----    |-> file (6bc0c00)
>>>>>>     |
>>>>>>     |-> (2) file (5e52060)
>>>>>>
>>>>>> backup-top (5e48030), the BDS that was passed as argument in the first
>>>>>> bdrv_set_aio_context_ignore() call, is re-entered when qcow2 (5fbf660)
>>>>>> is processing its parents, and the latter is also re-entered when the
>>>>>> first one starts processing its children again.
>>>>>
>>>>> Yes, but look at the BdrvChild pointers, it is through different edges
>>>>> that we come back to the same node. No BdrvChild is used twice.
>>>>>
>>>>> If backup-top had added all of its children to the ignore list before
>>>>> calling into the overlay qcow2, the backing qcow2 wouldn't eventually
>>>>> have called back into backup-top.
>>>>
>>>> I've tested a patch that first adds every child to the ignore list,
>>>> and then processes those that weren't there before, as you suggested
>>>> on a previous email. With that, the offending qcow2 is not re-entered,
>>>> so we avoid the crash, but backup-top is still entered twice:
>>>
>>> I think we also need to every parent to the ignore list before calling
>>> callbacks, though it doesn't look like this is the problem you're
>>> currently seeing.
>>
>> I agree.
>>
>>>> bs=0x560db0e3b030 (backup-top) enter
>>>> bs=0x560db0e3b030 (backup-top) processing children
>>>> bs=0x560db0e3b030 (backup-top) calling bsaci child=0x560db0e2f450 (child->bs=0x560db0fb2660)
>>>> bs=0x560db0fb2660 (qcow2) enter
>>>> bs=0x560db0fb2660 (qcow2) processing children
>>>> bs=0x560db0fb2660 (qcow2) calling bsaci child=0x560db0e34d20 (child->bs=0x560db1bb3c00)
>>>> bs=0x560db1bb3c00 (file) enter
>>>> bs=0x560db1bb3c00 (file) processing children
>>>> bs=0x560db1bb3c00 (file) processing parents
>>>> bs=0x560db1bb3c00 (file) processing itself
>>>> bs=0x560db0fb2660 (qcow2) calling bsaci child=0x560db16964d0 (child->bs=0x560db0e50420)
>>>> bs=0x560db0e50420 (qcow2) enter
>>>> bs=0x560db0e50420 (qcow2) processing children
>>>> bs=0x560db0e50420 (qcow2) calling bsaci child=0x560db0e34ea0 (child->bs=0x560db0e45060)
>>>> bs=0x560db0e45060 (file) enter
>>>> bs=0x560db0e45060 (file) processing children
>>>> bs=0x560db0e45060 (file) processing parents
>>>> bs=0x560db0e45060 (file) processing itself
>>>> bs=0x560db0e50420 (qcow2) processing parents
>>>> bs=0x560db0e50420 (qcow2) processing itself
>>>> bs=0x560db0fb2660 (qcow2) processing parents
>>>> bs=0x560db0fb2660 (qcow2) calling set_aio_ctx child=0x560db1672860
>>>> bs=0x560db0fb2660 (qcow2) calling set_aio_ctx child=0x560db1b14a20
>>>> bs=0x560db0e3b030 (backup-top) enter
>>>> bs=0x560db0e3b030 (backup-top) processing children
>>>> bs=0x560db0e3b030 (backup-top) processing parents
>>>> bs=0x560db0e3b030 (backup-top) calling set_aio_ctx child=0x560db0e332d0
>>>> bs=0x560db0e3b030 (backup-top) processing itself
>>>> bs=0x560db0fb2660 (qcow2) processing itself
>>>> bs=0x560db0e3b030 (backup-top) calling bsaci child=0x560db0e35090 (child->bs=0x560db0e50420)
>>>> bs=0x560db0e50420 (qcow2) enter
>>>> bs=0x560db0e3b030 (backup-top) processing parents
>>>> bs=0x560db0e3b030 (backup-top) processing itself
>>>>
>>>> I see that "blk_do_set_aio_context()" passes "blk->root" to
>>>> "bdrv_child_try_set_aio_context()" so it's already in the ignore list,
>>>> so I'm not sure what's happening here. Is backup-top is referenced
>>>> from two different BdrvChild or is "blk->root" not pointing to
>>>> backup-top's BDS?
>>>
>>> The second time that backup-top is entered, it is not as the BDS of
>>> blk->root, but as the parent node of the overlay qcow2. Which is
>>> interesting, because last time it was still the backing qcow2, so the
>>> change did have _some_ effect.
>>>
>>> The part that I don't understand is why you still get the line with
>>> child=0x560db1b14a20, because when you add all children to the ignore
>>> list first, that should have been put into the ignore list as one of the
>>> first things in the whole process (when backup-top was first entered).
>>>
>>> Is 0x560db1b14a20 a BdrvChild that has backup-top as its opaque value,
>>> but isn't actually present in backup-top's bs->children?
>>
>> Exactly, that line corresponds to this chunk of code:
>>
>> <---- begin ---->
>>      QLIST_FOREACH(child, &bs->parents, next_parent) {
>>          if (g_slist_find(*ignore, child)) {
>>              continue;
>>          }
>>          assert(child->klass->set_aio_ctx);
>>          *ignore = g_slist_prepend(*ignore, child);
>>          fprintf(stderr, "bs=%p (%s) calling set_aio_ctx child=%p\n", bs, bs->drv->format_name, child);
>>          child->klass->set_aio_ctx(child, new_context, ignore);
>>      }
>> <---- end ---->
>>
>> Do you think it's safe to re-enter backup-top, or should we look for a
>> way to avoid this?
> 
> I think it should be avoided, but I don't understand why putting all
> children of backup-top into the ignore list doesn't already avoid it. If
> backup-top is in the parents list of qcow2, then qcow2 should be in the
> children list of backup-top and therefore the BdrvChild should already
> be in the ignore list.
> 
> The only way I can explain this is that backup-top and qcow2 have
> different ideas about which BdrvChild objects exist that connect them.
> Or that the graph changes between both places, but I don't see how that
> could happen in bdrv_set_aio_context_ignore().
> 

bdrv_set_aio_context_ignore() do bdrv_drained_begin().. As I reported recently, nothing prevents some job finish and do graph modification during some another drained section. It may be the case.

If backup-top involved, I can suppose that graph modification is in backup_clean, when we remove the filter.. Who is making set_aio_context in the issue? I mean, what is the backtrace of bdrv_set_aio_context_ignore()?


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 2/4] block: Avoid processing BDS twice in bdrv_set_aio_context_ignore()
  2020-12-17 12:50                     ` Vladimir Sementsov-Ogievskiy
@ 2020-12-17 13:06                       ` Kevin Wolf
  2020-12-17 13:27                         ` Sergio Lopez
  2020-12-17 14:01                         ` Vladimir Sementsov-Ogievskiy
  0 siblings, 2 replies; 25+ messages in thread
From: Kevin Wolf @ 2020-12-17 13:06 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: Fam Zheng, Stefano Stabellini, Sergio Lopez, qemu-block,
	Michael S. Tsirkin, Paul Durrant, qemu-devel, Max Reitz,
	Stefan Hajnoczi, xen-devel, Anthony Perard, Paolo Bonzini

Am 17.12.2020 um 13:50 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 17.12.2020 13:58, Kevin Wolf wrote:
> > Am 17.12.2020 um 10:37 hat Sergio Lopez geschrieben:
> > > On Wed, Dec 16, 2020 at 07:31:02PM +0100, Kevin Wolf wrote:
> > > > Am 16.12.2020 um 15:55 hat Sergio Lopez geschrieben:
> > > > > On Wed, Dec 16, 2020 at 01:35:14PM +0100, Kevin Wolf wrote:
> > > > > > Anyway, trying to reconstruct the block graph with BdrvChild pointers
> > > > > > annotated at the edges:
> > > > > > 
> > > > > > BlockBackend
> > > > > >        |
> > > > > >        v
> > > > > >    backup-top ------------------------+
> > > > > >        |   |                          |
> > > > > >        |   +-----------------------+  |
> > > > > >        |            0x5655068b8510 |  | 0x565505e3c450
> > > > > >        |                           |  |
> > > > > >        | 0x565505e42090            |  |
> > > > > >        v                           |  |
> > > > > >      qcow2 ---------------------+  |  |
> > > > > >        |                        |  |  |
> > > > > >        | 0x565505e52060         |  |  | ??? [1]
> > > > > >        |                        |  |  |  |
> > > > > >        v         0x5655066a34d0 |  |  |  | 0x565505fc7aa0
> > > > > >      file                       v  v  v  v
> > > > > >                               qcow2 (backing)
> > > > > >                                      |
> > > > > >                                      | 0x565505e41d20
> > > > > >                                      v
> > > > > >                                    file
> > > > > > 
> > > > > > [1] This seems to be a BdrvChild with a non-BDS parent. Probably a
> > > > > >      BdrvChild directly owned by the backup job.
> > > > > > 
> > > > > > > So it seems this is happening:
> > > > > > > 
> > > > > > > backup-top (5e48030) <---------| (5)
> > > > > > >     |    |                      |
> > > > > > >     |    | (6) ------------> qcow2 (5fbf660)
> > > > > > >     |                           ^    |
> > > > > > >     |                       (3) |    | (4)
> > > > > > >     |-> (1) qcow2 (5e5d420) -----    |-> file (6bc0c00)
> > > > > > >     |
> > > > > > >     |-> (2) file (5e52060)
> > > > > > > 
> > > > > > > backup-top (5e48030), the BDS that was passed as argument in the first
> > > > > > > bdrv_set_aio_context_ignore() call, is re-entered when qcow2 (5fbf660)
> > > > > > > is processing its parents, and the latter is also re-entered when the
> > > > > > > first one starts processing its children again.
> > > > > > 
> > > > > > Yes, but look at the BdrvChild pointers, it is through different edges
> > > > > > that we come back to the same node. No BdrvChild is used twice.
> > > > > > 
> > > > > > If backup-top had added all of its children to the ignore list before
> > > > > > calling into the overlay qcow2, the backing qcow2 wouldn't eventually
> > > > > > have called back into backup-top.
> > > > > 
> > > > > I've tested a patch that first adds every child to the ignore list,
> > > > > and then processes those that weren't there before, as you suggested
> > > > > on a previous email. With that, the offending qcow2 is not re-entered,
> > > > > so we avoid the crash, but backup-top is still entered twice:
> > > > 
> > > > I think we also need to every parent to the ignore list before calling
> > > > callbacks, though it doesn't look like this is the problem you're
> > > > currently seeing.
> > > 
> > > I agree.
> > > 
> > > > > bs=0x560db0e3b030 (backup-top) enter
> > > > > bs=0x560db0e3b030 (backup-top) processing children
> > > > > bs=0x560db0e3b030 (backup-top) calling bsaci child=0x560db0e2f450 (child->bs=0x560db0fb2660)
> > > > > bs=0x560db0fb2660 (qcow2) enter
> > > > > bs=0x560db0fb2660 (qcow2) processing children
> > > > > bs=0x560db0fb2660 (qcow2) calling bsaci child=0x560db0e34d20 (child->bs=0x560db1bb3c00)
> > > > > bs=0x560db1bb3c00 (file) enter
> > > > > bs=0x560db1bb3c00 (file) processing children
> > > > > bs=0x560db1bb3c00 (file) processing parents
> > > > > bs=0x560db1bb3c00 (file) processing itself
> > > > > bs=0x560db0fb2660 (qcow2) calling bsaci child=0x560db16964d0 (child->bs=0x560db0e50420)
> > > > > bs=0x560db0e50420 (qcow2) enter
> > > > > bs=0x560db0e50420 (qcow2) processing children
> > > > > bs=0x560db0e50420 (qcow2) calling bsaci child=0x560db0e34ea0 (child->bs=0x560db0e45060)
> > > > > bs=0x560db0e45060 (file) enter
> > > > > bs=0x560db0e45060 (file) processing children
> > > > > bs=0x560db0e45060 (file) processing parents
> > > > > bs=0x560db0e45060 (file) processing itself
> > > > > bs=0x560db0e50420 (qcow2) processing parents
> > > > > bs=0x560db0e50420 (qcow2) processing itself
> > > > > bs=0x560db0fb2660 (qcow2) processing parents
> > > > > bs=0x560db0fb2660 (qcow2) calling set_aio_ctx child=0x560db1672860
> > > > > bs=0x560db0fb2660 (qcow2) calling set_aio_ctx child=0x560db1b14a20
> > > > > bs=0x560db0e3b030 (backup-top) enter
> > > > > bs=0x560db0e3b030 (backup-top) processing children
> > > > > bs=0x560db0e3b030 (backup-top) processing parents
> > > > > bs=0x560db0e3b030 (backup-top) calling set_aio_ctx child=0x560db0e332d0
> > > > > bs=0x560db0e3b030 (backup-top) processing itself
> > > > > bs=0x560db0fb2660 (qcow2) processing itself
> > > > > bs=0x560db0e3b030 (backup-top) calling bsaci child=0x560db0e35090 (child->bs=0x560db0e50420)
> > > > > bs=0x560db0e50420 (qcow2) enter
> > > > > bs=0x560db0e3b030 (backup-top) processing parents
> > > > > bs=0x560db0e3b030 (backup-top) processing itself
> > > > > 
> > > > > I see that "blk_do_set_aio_context()" passes "blk->root" to
> > > > > "bdrv_child_try_set_aio_context()" so it's already in the ignore list,
> > > > > so I'm not sure what's happening here. Is backup-top is referenced
> > > > > from two different BdrvChild or is "blk->root" not pointing to
> > > > > backup-top's BDS?
> > > > 
> > > > The second time that backup-top is entered, it is not as the BDS of
> > > > blk->root, but as the parent node of the overlay qcow2. Which is
> > > > interesting, because last time it was still the backing qcow2, so the
> > > > change did have _some_ effect.
> > > > 
> > > > The part that I don't understand is why you still get the line with
> > > > child=0x560db1b14a20, because when you add all children to the ignore
> > > > list first, that should have been put into the ignore list as one of the
> > > > first things in the whole process (when backup-top was first entered).
> > > > 
> > > > Is 0x560db1b14a20 a BdrvChild that has backup-top as its opaque value,
> > > > but isn't actually present in backup-top's bs->children?
> > > 
> > > Exactly, that line corresponds to this chunk of code:
> > > 
> > > <---- begin ---->
> > >      QLIST_FOREACH(child, &bs->parents, next_parent) {
> > >          if (g_slist_find(*ignore, child)) {
> > >              continue;
> > >          }
> > >          assert(child->klass->set_aio_ctx);
> > >          *ignore = g_slist_prepend(*ignore, child);
> > >          fprintf(stderr, "bs=%p (%s) calling set_aio_ctx child=%p\n", bs, bs->drv->format_name, child);
> > >          child->klass->set_aio_ctx(child, new_context, ignore);
> > >      }
> > > <---- end ---->
> > > 
> > > Do you think it's safe to re-enter backup-top, or should we look for a
> > > way to avoid this?
> > 
> > I think it should be avoided, but I don't understand why putting all
> > children of backup-top into the ignore list doesn't already avoid it. If
> > backup-top is in the parents list of qcow2, then qcow2 should be in the
> > children list of backup-top and therefore the BdrvChild should already
> > be in the ignore list.
> > 
> > The only way I can explain this is that backup-top and qcow2 have
> > different ideas about which BdrvChild objects exist that connect them.
> > Or that the graph changes between both places, but I don't see how that
> > could happen in bdrv_set_aio_context_ignore().
> > 
> 
> bdrv_set_aio_context_ignore() do bdrv_drained_begin().. As I reported
> recently, nothing prevents some job finish and do graph modification
> during some another drained section. It may be the case.

Good point, this might be the same bug then.

If everything worked correctly, a job completion could only happen on
the outer bdrv_set_aio_context_ignore(). But after that, we are already
in a drain section, so the job should be quiesced and a second drain
shouldn't cause any additional graph changes.

I would have to go back to the other discussion, but I think it was
related to block jobs that are already in the completion process and
keep moving forward even though they are supposed to be quiesced.

If I remember correctly, actually pausing them at this point looked
difficult. Maybe what we should then do is letting .drained_poll return
true until they have actually fully completed?

Ah, but was this something that would deadlock because the job
completion callbacks use drain sections themselves?

> If backup-top involved, I can suppose that graph modification is in
> backup_clean, when we remove the filter.. Who is making
> set_aio_context in the issue? I mean, what is the backtrace of
> bdrv_set_aio_context_ignore()?

Sergio, can you provide the backtrace and also test if the theory with a
job completion in the middle of the process is what you actually hit?

Kevin



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 2/4] block: Avoid processing BDS twice in bdrv_set_aio_context_ignore()
  2020-12-17 10:58                   ` Kevin Wolf
  2020-12-17 12:50                     ` Vladimir Sementsov-Ogievskiy
@ 2020-12-17 13:09                     ` Sergio Lopez
  1 sibling, 0 replies; 25+ messages in thread
From: Sergio Lopez @ 2020-12-17 13:09 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Fam Zheng, Stefano Stabellini, qemu-block, Paul Durrant,
	Michael S. Tsirkin, qemu-devel, Max Reitz, Stefan Hajnoczi,
	Paolo Bonzini, Anthony Perard, xen-devel

[-- Attachment #1: Type: text/plain, Size: 2401 bytes --]

On Thu, Dec 17, 2020 at 11:58:30AM +0100, Kevin Wolf wrote:
> Am 17.12.2020 um 10:37 hat Sergio Lopez geschrieben:
> > Do you think it's safe to re-enter backup-top, or should we look for a
> > way to avoid this?
> 
> I think it should be avoided, but I don't understand why putting all
> children of backup-top into the ignore list doesn't already avoid it. If
> backup-top is in the parents list of qcow2, then qcow2 should be in the
> children list of backup-top and therefore the BdrvChild should already
> be in the ignore list.
> 
> The only way I can explain this is that backup-top and qcow2 have
> different ideas about which BdrvChild objects exist that connect them.
> Or that the graph changes between both places, but I don't see how that
> could happen in bdrv_set_aio_context_ignore().

I've been digging around with gdb, and found that, at that point, the
backup-top BDS is actually referenced by two different BdrvChild
objects:

(gdb) p *(BdrvChild *) 0x560c40f7e400
$84 = {bs = 0x560c40c4c030, name = 0x560c41ca4960 "root", klass = 0x560c3eae7c20 <child_root>, 
  role = 20, opaque = 0x560c41ca4610, perm = 3, shared_perm = 29, has_backup_perm = false, 
  backup_perm = 0, backup_shared_perm = 31, frozen = false, parent_quiesce_counter = 2, next = {
    le_next = 0x0, le_prev = 0x0}, next_parent = {le_next = 0x0, le_prev = 0x560c40c44338}}

(gdb) p sibling
$72 = (BdrvChild *) 0x560c40981840
(gdb) p *sibling
$73 = {bs = 0x560c40c4c030, name = 0x560c4161be20 "main node", klass = 0x560c3eae6a40 <child_job>, 
  role = 0, opaque = 0x560c4161bc00, perm = 0, shared_perm = 31, has_backup_perm = false, 
  backup_perm = 0, backup_shared_perm = 0, frozen = false, parent_quiesce_counter = 2, next = {
    le_next = 0x0, le_prev = 0x0}, next_parent = {le_next = 0x560c40c442d0, le_prev = 0x560c40c501c0}}

When the chain of calls to switch AIO contexts is started, backup-top
is the first one to be processed. blk_do_set_aio_context() instructs
bdrv_child_try_set_aio_context() to add blk->root (0x560c40f7e400) as
the first element in ignore list, but the referenced BDS is still
re-entered through the other BdrvChild (0x560c40981840) by one the
children of the latter.

I can't think of a way of preventing this other than keeping track of
BDS pointers in the ignore list too. Do you think there are any
alternatives?

Thanks,
Sergio.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 2/4] block: Avoid processing BDS twice in bdrv_set_aio_context_ignore()
  2020-12-17 13:06                       ` Kevin Wolf
@ 2020-12-17 13:27                         ` Sergio Lopez
  2020-12-17 14:01                         ` Vladimir Sementsov-Ogievskiy
  1 sibling, 0 replies; 25+ messages in thread
From: Sergio Lopez @ 2020-12-17 13:27 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Fam Zheng, Vladimir Sementsov-Ogievskiy, qemu-block,
	Michael S. Tsirkin, Paul Durrant, qemu-devel, Max Reitz,
	Stefano Stabellini, Stefan Hajnoczi, xen-devel, Anthony Perard,
	Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 13874 bytes --]

On Thu, Dec 17, 2020 at 02:06:02PM +0100, Kevin Wolf wrote:
> Am 17.12.2020 um 13:50 hat Vladimir Sementsov-Ogievskiy geschrieben:
> > 17.12.2020 13:58, Kevin Wolf wrote:
> > > Am 17.12.2020 um 10:37 hat Sergio Lopez geschrieben:
> > > > On Wed, Dec 16, 2020 at 07:31:02PM +0100, Kevin Wolf wrote:
> > > > > Am 16.12.2020 um 15:55 hat Sergio Lopez geschrieben:
> > > > > > On Wed, Dec 16, 2020 at 01:35:14PM +0100, Kevin Wolf wrote:
> > > > > > > Anyway, trying to reconstruct the block graph with BdrvChild pointers
> > > > > > > annotated at the edges:
> > > > > > > 
> > > > > > > BlockBackend
> > > > > > >        |
> > > > > > >        v
> > > > > > >    backup-top ------------------------+
> > > > > > >        |   |                          |
> > > > > > >        |   +-----------------------+  |
> > > > > > >        |            0x5655068b8510 |  | 0x565505e3c450
> > > > > > >        |                           |  |
> > > > > > >        | 0x565505e42090            |  |
> > > > > > >        v                           |  |
> > > > > > >      qcow2 ---------------------+  |  |
> > > > > > >        |                        |  |  |
> > > > > > >        | 0x565505e52060         |  |  | ??? [1]
> > > > > > >        |                        |  |  |  |
> > > > > > >        v         0x5655066a34d0 |  |  |  | 0x565505fc7aa0
> > > > > > >      file                       v  v  v  v
> > > > > > >                               qcow2 (backing)
> > > > > > >                                      |
> > > > > > >                                      | 0x565505e41d20
> > > > > > >                                      v
> > > > > > >                                    file
> > > > > > > 
> > > > > > > [1] This seems to be a BdrvChild with a non-BDS parent. Probably a
> > > > > > >      BdrvChild directly owned by the backup job.
> > > > > > > 
> > > > > > > > So it seems this is happening:
> > > > > > > > 
> > > > > > > > backup-top (5e48030) <---------| (5)
> > > > > > > >     |    |                      |
> > > > > > > >     |    | (6) ------------> qcow2 (5fbf660)
> > > > > > > >     |                           ^    |
> > > > > > > >     |                       (3) |    | (4)
> > > > > > > >     |-> (1) qcow2 (5e5d420) -----    |-> file (6bc0c00)
> > > > > > > >     |
> > > > > > > >     |-> (2) file (5e52060)
> > > > > > > > 
> > > > > > > > backup-top (5e48030), the BDS that was passed as argument in the first
> > > > > > > > bdrv_set_aio_context_ignore() call, is re-entered when qcow2 (5fbf660)
> > > > > > > > is processing its parents, and the latter is also re-entered when the
> > > > > > > > first one starts processing its children again.
> > > > > > > 
> > > > > > > Yes, but look at the BdrvChild pointers, it is through different edges
> > > > > > > that we come back to the same node. No BdrvChild is used twice.
> > > > > > > 
> > > > > > > If backup-top had added all of its children to the ignore list before
> > > > > > > calling into the overlay qcow2, the backing qcow2 wouldn't eventually
> > > > > > > have called back into backup-top.
> > > > > > 
> > > > > > I've tested a patch that first adds every child to the ignore list,
> > > > > > and then processes those that weren't there before, as you suggested
> > > > > > on a previous email. With that, the offending qcow2 is not re-entered,
> > > > > > so we avoid the crash, but backup-top is still entered twice:
> > > > > 
> > > > > I think we also need to every parent to the ignore list before calling
> > > > > callbacks, though it doesn't look like this is the problem you're
> > > > > currently seeing.
> > > > 
> > > > I agree.
> > > > 
> > > > > > bs=0x560db0e3b030 (backup-top) enter
> > > > > > bs=0x560db0e3b030 (backup-top) processing children
> > > > > > bs=0x560db0e3b030 (backup-top) calling bsaci child=0x560db0e2f450 (child->bs=0x560db0fb2660)
> > > > > > bs=0x560db0fb2660 (qcow2) enter
> > > > > > bs=0x560db0fb2660 (qcow2) processing children
> > > > > > bs=0x560db0fb2660 (qcow2) calling bsaci child=0x560db0e34d20 (child->bs=0x560db1bb3c00)
> > > > > > bs=0x560db1bb3c00 (file) enter
> > > > > > bs=0x560db1bb3c00 (file) processing children
> > > > > > bs=0x560db1bb3c00 (file) processing parents
> > > > > > bs=0x560db1bb3c00 (file) processing itself
> > > > > > bs=0x560db0fb2660 (qcow2) calling bsaci child=0x560db16964d0 (child->bs=0x560db0e50420)
> > > > > > bs=0x560db0e50420 (qcow2) enter
> > > > > > bs=0x560db0e50420 (qcow2) processing children
> > > > > > bs=0x560db0e50420 (qcow2) calling bsaci child=0x560db0e34ea0 (child->bs=0x560db0e45060)
> > > > > > bs=0x560db0e45060 (file) enter
> > > > > > bs=0x560db0e45060 (file) processing children
> > > > > > bs=0x560db0e45060 (file) processing parents
> > > > > > bs=0x560db0e45060 (file) processing itself
> > > > > > bs=0x560db0e50420 (qcow2) processing parents
> > > > > > bs=0x560db0e50420 (qcow2) processing itself
> > > > > > bs=0x560db0fb2660 (qcow2) processing parents
> > > > > > bs=0x560db0fb2660 (qcow2) calling set_aio_ctx child=0x560db1672860
> > > > > > bs=0x560db0fb2660 (qcow2) calling set_aio_ctx child=0x560db1b14a20
> > > > > > bs=0x560db0e3b030 (backup-top) enter
> > > > > > bs=0x560db0e3b030 (backup-top) processing children
> > > > > > bs=0x560db0e3b030 (backup-top) processing parents
> > > > > > bs=0x560db0e3b030 (backup-top) calling set_aio_ctx child=0x560db0e332d0
> > > > > > bs=0x560db0e3b030 (backup-top) processing itself
> > > > > > bs=0x560db0fb2660 (qcow2) processing itself
> > > > > > bs=0x560db0e3b030 (backup-top) calling bsaci child=0x560db0e35090 (child->bs=0x560db0e50420)
> > > > > > bs=0x560db0e50420 (qcow2) enter
> > > > > > bs=0x560db0e3b030 (backup-top) processing parents
> > > > > > bs=0x560db0e3b030 (backup-top) processing itself
> > > > > > 
> > > > > > I see that "blk_do_set_aio_context()" passes "blk->root" to
> > > > > > "bdrv_child_try_set_aio_context()" so it's already in the ignore list,
> > > > > > so I'm not sure what's happening here. Is backup-top is referenced
> > > > > > from two different BdrvChild or is "blk->root" not pointing to
> > > > > > backup-top's BDS?
> > > > > 
> > > > > The second time that backup-top is entered, it is not as the BDS of
> > > > > blk->root, but as the parent node of the overlay qcow2. Which is
> > > > > interesting, because last time it was still the backing qcow2, so the
> > > > > change did have _some_ effect.
> > > > > 
> > > > > The part that I don't understand is why you still get the line with
> > > > > child=0x560db1b14a20, because when you add all children to the ignore
> > > > > list first, that should have been put into the ignore list as one of the
> > > > > first things in the whole process (when backup-top was first entered).
> > > > > 
> > > > > Is 0x560db1b14a20 a BdrvChild that has backup-top as its opaque value,
> > > > > but isn't actually present in backup-top's bs->children?
> > > > 
> > > > Exactly, that line corresponds to this chunk of code:
> > > > 
> > > > <---- begin ---->
> > > >      QLIST_FOREACH(child, &bs->parents, next_parent) {
> > > >          if (g_slist_find(*ignore, child)) {
> > > >              continue;
> > > >          }
> > > >          assert(child->klass->set_aio_ctx);
> > > >          *ignore = g_slist_prepend(*ignore, child);
> > > >          fprintf(stderr, "bs=%p (%s) calling set_aio_ctx child=%p\n", bs, bs->drv->format_name, child);
> > > >          child->klass->set_aio_ctx(child, new_context, ignore);
> > > >      }
> > > > <---- end ---->
> > > > 
> > > > Do you think it's safe to re-enter backup-top, or should we look for a
> > > > way to avoid this?
> > > 
> > > I think it should be avoided, but I don't understand why putting all
> > > children of backup-top into the ignore list doesn't already avoid it. If
> > > backup-top is in the parents list of qcow2, then qcow2 should be in the
> > > children list of backup-top and therefore the BdrvChild should already
> > > be in the ignore list.
> > > 
> > > The only way I can explain this is that backup-top and qcow2 have
> > > different ideas about which BdrvChild objects exist that connect them.
> > > Or that the graph changes between both places, but I don't see how that
> > > could happen in bdrv_set_aio_context_ignore().
> > > 
> > 
> > bdrv_set_aio_context_ignore() do bdrv_drained_begin().. As I reported
> > recently, nothing prevents some job finish and do graph modification
> > during some another drained section. It may be the case.
> 
> Good point, this might be the same bug then.
> 
> If everything worked correctly, a job completion could only happen on
> the outer bdrv_set_aio_context_ignore(). But after that, we are already
> in a drain section, so the job should be quiesced and a second drain
> shouldn't cause any additional graph changes.
> 
> I would have to go back to the other discussion, but I think it was
> related to block jobs that are already in the completion process and
> keep moving forward even though they are supposed to be quiesced.
> 
> If I remember correctly, actually pausing them at this point looked
> difficult. Maybe what we should then do is letting .drained_poll return
> true until they have actually fully completed?
> 
> Ah, but was this something that would deadlock because the job
> completion callbacks use drain sections themselves?
> 
> > If backup-top involved, I can suppose that graph modification is in
> > backup_clean, when we remove the filter.. Who is making
> > set_aio_context in the issue? I mean, what is the backtrace of
> > bdrv_set_aio_context_ignore()?
> 
> Sergio, can you provide the backtrace and also test if the theory with a
> job completion in the middle of the process is what you actually hit?

No, I'm sure the job is not finishing in the middle of the
set_aio_context chain, which is started by a
virtio_blk_data_plane_[start|stop], which in turn is triggered by a
guest reboot.

This is a stack trace that reaches to the point in which backup-top is
entered a second time:

#0  0x0000560c3e173bbd in child_job_set_aio_ctx
    (c=<optimized out>, ctx=0x560c40c45630, ignore=0x7f6d4eeb6f40) at ../blockjob.c:159
#1  0x0000560c3e1aefc6 in bdrv_set_aio_context_ignore
    (bs=0x560c40dc3660, new_context=0x560c40c45630, ignore=0x7f6d4eeb6f40) at ../block.c:6509
#2  0x0000560c3e1aee8a in bdrv_set_aio_context_ignore
    (bs=bs@entry=0x560c40c4c030, new_context=new_context@entry=0x560c40c45630, ignore=ignore@entry=0x7f6d4eeb6f40) at ../block.c:6487
#3  0x0000560c3e1af503 in bdrv_child_try_set_aio_context
    (bs=bs@entry=0x560c40c4c030, ctx=ctx@entry=0x560c40c45630, ignore_child=<optimized out>, errp=errp@entry=0x7f6d4eeb6fc8) at ../block.c:6619
#4  0x0000560c3e1e561a in blk_do_set_aio_context
    (blk=0x560c41ca4610, new_context=0x560c40c45630, update_root_node=update_root_node@entry=true, errp=errp@entry=0x7f6d4eeb6fc8) at ../block/block-backend.c:2027
#5  0x0000560c3e1e740d in blk_set_aio_context
    (blk=<optimized out>, new_context=<optimized out>, errp=errp@entry=0x7f6d4eeb6fc8)
    at ../block/block-backend.c:2048
#6  0x0000560c3e10de78 in virtio_blk_data_plane_start (vdev=<optimized out>)
    at ../hw/block/dataplane/virtio-blk.c:220
#7  0x0000560c3de691d2 in virtio_bus_start_ioeventfd (bus=bus@entry=0x560c41ca1e98)
    at ../hw/virtio/virtio-bus.c:222
#8  0x0000560c3de4f907 in virtio_pci_start_ioeventfd (proxy=0x560c41c99d90)
    at ../hw/virtio/virtio-pci.c:1261
#9  0x0000560c3de4f907 in virtio_pci_common_write
    (opaque=0x560c41c99d90, addr=<optimized out>, val=<optimized out>, size=<optimized out>)
    at ../hw/virtio/virtio-pci.c:1261
#10 0x0000560c3e145d81 in memory_region_write_accessor
    (mr=0x560c41c9a770, addr=20, value=<optimized out>, size=1, shift=<optimized out>, mask=<optimized out>, attrs=...) at ../softmmu/memory.c:491
#11 0x0000560c3e1447de in access_with_adjusted_size
    (addr=addr@entry=20, value=value@entry=0x7f6d4eeb71a8, size=size@entry=1, access_size_min=<optimized out>, access_size_max=<optimized out>, access_fn=
    0x560c3e145c80 <memory_region_write_accessor>, mr=0x560c41c9a770, attrs=...)
    at ../softmmu/memory.c:552
#12 0x0000560c3e148052 in memory_region_dispatch_write
    (mr=mr@entry=0x560c41c9a770, addr=20, data=<optimized out>, op=<optimized out>, attrs=attrs@entry=...) at ../softmmu/memory.c:1501
#13 0x0000560c3e06b5b7 in flatview_write_continue
    (fv=fv@entry=0x7f6d400ed3e0, addr=addr@entry=4261429268, attrs=..., ptr=ptr@entry=0x7f6d71dad028, len=len@entry=1, addr1=<optimized out>, l=<optimized out>, mr=0x560c41c9a770)
    at /home/BZs/1900326/qemu/include/qemu/host-utils.h:164
#14 0x0000560c3e06b7d6 in flatview_write
    (fv=0x7f6d400ed3e0, addr=addr@entry=4261429268, attrs=attrs@entry=..., buf=buf@entry=0x7f6d71dad028, len=len@entry=1) at ../softmmu/physmem.c:2799
#15 0x0000560c3e06e330 in address_space_write
    (as=0x560c3ec0a920 <address_space_memory>, addr=4261429268, attrs=..., buf=buf@entry=0x7f6d71dad028, len=1) at ../softmmu/physmem.c:2891
#16 0x0000560c3e06e3ba in address_space_rw (as=<optimized out>, addr=<optimized out>, attrs=..., 
    attrs@entry=..., buf=buf@entry=0x7f6d71dad028, len=<optimized out>, is_write=<optimized out>)
    at ../softmmu/physmem.c:2901
#17 0x0000560c3e10021a in kvm_cpu_exec (cpu=cpu@entry=0x560c40d7e0d0) at ../accel/kvm/kvm-all.c:2541
#18 0x0000560c3e1445e5 in kvm_vcpu_thread_fn (arg=arg@entry=0x560c40d7e0d0) at ../accel/kvm/kvm-cpus.c:49
#19 0x0000560c3e2c798a in qemu_thread_start (args=<optimized out>) at ../util/qemu-thread-posix.c:521
#20 0x00007f6d6ba8614a in start_thread () at /lib64/libpthread.so.0
#21 0x00007f6d6b7b8763 in clone () at /lib64/libc.so.6

Thanks,
Sergio.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 2/4] block: Avoid processing BDS twice in bdrv_set_aio_context_ignore()
  2020-12-17 13:06                       ` Kevin Wolf
  2020-12-17 13:27                         ` Sergio Lopez
@ 2020-12-17 14:01                         ` Vladimir Sementsov-Ogievskiy
  1 sibling, 0 replies; 25+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-12-17 14:01 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Fam Zheng, Stefano Stabellini, Sergio Lopez, qemu-block,
	Michael S. Tsirkin, Paul Durrant, qemu-devel, Max Reitz,
	Stefan Hajnoczi, xen-devel, Anthony Perard, Paolo Bonzini

17.12.2020 16:06, Kevin Wolf wrote:
> Am 17.12.2020 um 13:50 hat Vladimir Sementsov-Ogievskiy geschrieben:
>> 17.12.2020 13:58, Kevin Wolf wrote:
>>> Am 17.12.2020 um 10:37 hat Sergio Lopez geschrieben:
>>>> On Wed, Dec 16, 2020 at 07:31:02PM +0100, Kevin Wolf wrote:
>>>>> Am 16.12.2020 um 15:55 hat Sergio Lopez geschrieben:
>>>>>> On Wed, Dec 16, 2020 at 01:35:14PM +0100, Kevin Wolf wrote:
>>>>>>> Anyway, trying to reconstruct the block graph with BdrvChild pointers
>>>>>>> annotated at the edges:
>>>>>>>
>>>>>>> BlockBackend
>>>>>>>         |
>>>>>>>         v
>>>>>>>     backup-top ------------------------+
>>>>>>>         |   |                          |
>>>>>>>         |   +-----------------------+  |
>>>>>>>         |            0x5655068b8510 |  | 0x565505e3c450
>>>>>>>         |                           |  |
>>>>>>>         | 0x565505e42090            |  |
>>>>>>>         v                           |  |
>>>>>>>       qcow2 ---------------------+  |  |
>>>>>>>         |                        |  |  |
>>>>>>>         | 0x565505e52060         |  |  | ??? [1]
>>>>>>>         |                        |  |  |  |
>>>>>>>         v         0x5655066a34d0 |  |  |  | 0x565505fc7aa0
>>>>>>>       file                       v  v  v  v
>>>>>>>                                qcow2 (backing)
>>>>>>>                                       |
>>>>>>>                                       | 0x565505e41d20
>>>>>>>                                       v
>>>>>>>                                     file
>>>>>>>
>>>>>>> [1] This seems to be a BdrvChild with a non-BDS parent. Probably a
>>>>>>>       BdrvChild directly owned by the backup job.
>>>>>>>
>>>>>>>> So it seems this is happening:
>>>>>>>>
>>>>>>>> backup-top (5e48030) <---------| (5)
>>>>>>>>      |    |                      |
>>>>>>>>      |    | (6) ------------> qcow2 (5fbf660)
>>>>>>>>      |                           ^    |
>>>>>>>>      |                       (3) |    | (4)
>>>>>>>>      |-> (1) qcow2 (5e5d420) -----    |-> file (6bc0c00)
>>>>>>>>      |
>>>>>>>>      |-> (2) file (5e52060)
>>>>>>>>
>>>>>>>> backup-top (5e48030), the BDS that was passed as argument in the first
>>>>>>>> bdrv_set_aio_context_ignore() call, is re-entered when qcow2 (5fbf660)
>>>>>>>> is processing its parents, and the latter is also re-entered when the
>>>>>>>> first one starts processing its children again.
>>>>>>>
>>>>>>> Yes, but look at the BdrvChild pointers, it is through different edges
>>>>>>> that we come back to the same node. No BdrvChild is used twice.
>>>>>>>
>>>>>>> If backup-top had added all of its children to the ignore list before
>>>>>>> calling into the overlay qcow2, the backing qcow2 wouldn't eventually
>>>>>>> have called back into backup-top.
>>>>>>
>>>>>> I've tested a patch that first adds every child to the ignore list,
>>>>>> and then processes those that weren't there before, as you suggested
>>>>>> on a previous email. With that, the offending qcow2 is not re-entered,
>>>>>> so we avoid the crash, but backup-top is still entered twice:
>>>>>
>>>>> I think we also need to every parent to the ignore list before calling
>>>>> callbacks, though it doesn't look like this is the problem you're
>>>>> currently seeing.
>>>>
>>>> I agree.
>>>>
>>>>>> bs=0x560db0e3b030 (backup-top) enter
>>>>>> bs=0x560db0e3b030 (backup-top) processing children
>>>>>> bs=0x560db0e3b030 (backup-top) calling bsaci child=0x560db0e2f450 (child->bs=0x560db0fb2660)
>>>>>> bs=0x560db0fb2660 (qcow2) enter
>>>>>> bs=0x560db0fb2660 (qcow2) processing children
>>>>>> bs=0x560db0fb2660 (qcow2) calling bsaci child=0x560db0e34d20 (child->bs=0x560db1bb3c00)
>>>>>> bs=0x560db1bb3c00 (file) enter
>>>>>> bs=0x560db1bb3c00 (file) processing children
>>>>>> bs=0x560db1bb3c00 (file) processing parents
>>>>>> bs=0x560db1bb3c00 (file) processing itself
>>>>>> bs=0x560db0fb2660 (qcow2) calling bsaci child=0x560db16964d0 (child->bs=0x560db0e50420)
>>>>>> bs=0x560db0e50420 (qcow2) enter
>>>>>> bs=0x560db0e50420 (qcow2) processing children
>>>>>> bs=0x560db0e50420 (qcow2) calling bsaci child=0x560db0e34ea0 (child->bs=0x560db0e45060)
>>>>>> bs=0x560db0e45060 (file) enter
>>>>>> bs=0x560db0e45060 (file) processing children
>>>>>> bs=0x560db0e45060 (file) processing parents
>>>>>> bs=0x560db0e45060 (file) processing itself
>>>>>> bs=0x560db0e50420 (qcow2) processing parents
>>>>>> bs=0x560db0e50420 (qcow2) processing itself
>>>>>> bs=0x560db0fb2660 (qcow2) processing parents
>>>>>> bs=0x560db0fb2660 (qcow2) calling set_aio_ctx child=0x560db1672860
>>>>>> bs=0x560db0fb2660 (qcow2) calling set_aio_ctx child=0x560db1b14a20
>>>>>> bs=0x560db0e3b030 (backup-top) enter
>>>>>> bs=0x560db0e3b030 (backup-top) processing children
>>>>>> bs=0x560db0e3b030 (backup-top) processing parents
>>>>>> bs=0x560db0e3b030 (backup-top) calling set_aio_ctx child=0x560db0e332d0
>>>>>> bs=0x560db0e3b030 (backup-top) processing itself
>>>>>> bs=0x560db0fb2660 (qcow2) processing itself
>>>>>> bs=0x560db0e3b030 (backup-top) calling bsaci child=0x560db0e35090 (child->bs=0x560db0e50420)
>>>>>> bs=0x560db0e50420 (qcow2) enter
>>>>>> bs=0x560db0e3b030 (backup-top) processing parents
>>>>>> bs=0x560db0e3b030 (backup-top) processing itself
>>>>>>
>>>>>> I see that "blk_do_set_aio_context()" passes "blk->root" to
>>>>>> "bdrv_child_try_set_aio_context()" so it's already in the ignore list,
>>>>>> so I'm not sure what's happening here. Is backup-top is referenced
>>>>>> from two different BdrvChild or is "blk->root" not pointing to
>>>>>> backup-top's BDS?
>>>>>
>>>>> The second time that backup-top is entered, it is not as the BDS of
>>>>> blk->root, but as the parent node of the overlay qcow2. Which is
>>>>> interesting, because last time it was still the backing qcow2, so the
>>>>> change did have _some_ effect.
>>>>>
>>>>> The part that I don't understand is why you still get the line with
>>>>> child=0x560db1b14a20, because when you add all children to the ignore
>>>>> list first, that should have been put into the ignore list as one of the
>>>>> first things in the whole process (when backup-top was first entered).
>>>>>
>>>>> Is 0x560db1b14a20 a BdrvChild that has backup-top as its opaque value,
>>>>> but isn't actually present in backup-top's bs->children?
>>>>
>>>> Exactly, that line corresponds to this chunk of code:
>>>>
>>>> <---- begin ---->
>>>>       QLIST_FOREACH(child, &bs->parents, next_parent) {
>>>>           if (g_slist_find(*ignore, child)) {
>>>>               continue;
>>>>           }
>>>>           assert(child->klass->set_aio_ctx);
>>>>           *ignore = g_slist_prepend(*ignore, child);
>>>>           fprintf(stderr, "bs=%p (%s) calling set_aio_ctx child=%p\n", bs, bs->drv->format_name, child);
>>>>           child->klass->set_aio_ctx(child, new_context, ignore);
>>>>       }
>>>> <---- end ---->
>>>>
>>>> Do you think it's safe to re-enter backup-top, or should we look for a
>>>> way to avoid this?
>>>
>>> I think it should be avoided, but I don't understand why putting all
>>> children of backup-top into the ignore list doesn't already avoid it. If
>>> backup-top is in the parents list of qcow2, then qcow2 should be in the
>>> children list of backup-top and therefore the BdrvChild should already
>>> be in the ignore list.
>>>
>>> The only way I can explain this is that backup-top and qcow2 have
>>> different ideas about which BdrvChild objects exist that connect them.
>>> Or that the graph changes between both places, but I don't see how that
>>> could happen in bdrv_set_aio_context_ignore().
>>>
>>
>> bdrv_set_aio_context_ignore() do bdrv_drained_begin().. As I reported
>> recently, nothing prevents some job finish and do graph modification
>> during some another drained section. It may be the case.
> 
> Good point, this might be the same bug then.
> 
> If everything worked correctly, a job completion could only happen on
> the outer bdrv_set_aio_context_ignore(). But after that, we are already
> in a drain section, so the job should be quiesced and a second drain
> shouldn't cause any additional graph changes.
> 
> I would have to go back to the other discussion, but I think it was
> related to block jobs that are already in the completion process and
> keep moving forward even though they are supposed to be quiesced.
> 
> If I remember correctly, actually pausing them at this point looked
> difficult. Maybe what we should then do is letting .drained_poll return
> true until they have actually fully completed?
> 
> Ah, but was this something that would deadlock because the job
> completion callbacks use drain sections themselves?

Hmm..  I've recently sent good example of dead-lock in email "aio-poll dead-lock"..

I don't have better idea than moving all graph modifications (together with
corresponding drained sections) into coroutines and protect by global coroutine
mutex.

> 
>> If backup-top involved, I can suppose that graph modification is in
>> backup_clean, when we remove the filter.. Who is making
>> set_aio_context in the issue? I mean, what is the backtrace of
>> bdrv_set_aio_context_ignore()?
> 
> Sergio, can you provide the backtrace and also test if the theory with a
> job completion in the middle of the process is what you actually hit?
> 
> Kevin
> 


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 4/4] block: Close block exports in two steps
  2020-12-15 15:34   ` Kevin Wolf
  2020-12-15 17:26     ` Sergio Lopez
@ 2020-12-21 17:07     ` Sergio Lopez
  1 sibling, 0 replies; 25+ messages in thread
From: Sergio Lopez @ 2020-12-21 17:07 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Fam Zheng, Stefano Stabellini, qemu-block, Paul Durrant,
	Michael S. Tsirkin, qemu-devel, Max Reitz, Stefan Hajnoczi,
	Paolo Bonzini, Anthony Perard, xen-devel

[-- Attachment #1: Type: text/plain, Size: 1800 bytes --]

On Tue, Dec 15, 2020 at 04:34:05PM +0100, Kevin Wolf wrote:
> Am 14.12.2020 um 18:05 hat Sergio Lopez geschrieben:
> > There's a cross-dependency between closing the block exports and
> > draining the block layer. The latter needs that we close all export's
> > client connections to ensure they won't queue more requests, but the
> > exports may have coroutines yielding in the block layer, which implies
> > they can't be fully closed until we drain it.
> 
> A coroutine that yielded must have some way to be reentered. So I guess
> the quesiton becomes why they aren't reentered until drain. We do
> process events:
> 
>     AIO_WAIT_WHILE(NULL, blk_exp_has_type(type));
> 
> So in theory, anything that would finalise the block export closing
> should still execute.
> 
> What is the difference that drain makes compared to a simple
> AIO_WAIT_WHILE, so that coroutine are reentered during drain, but not
> during AIO_WAIT_WHILE?
> 
> This is an even more interesting question because the NBD server isn't a
> block node nor a BdrvChildClass implementation, so it shouldn't even
> notice a drain operation.

OK, took a deeper dive into the issue. While shutting down the guest,
some co-routines from the NBD server are stuck here:

nbd_trip
 nbd_handle_request
  nbd_do_cmd_read
   nbd_co_send_sparse_read
    blk_pread
     blk_prw
      blk_read_entry
       blk_do_preadv
        blk_wait_while_drained
         qemu_co_queue_wait

This happens because bdrv_close_all() is called after
bdrv_drain_all_begin(), so all block backends are quiesced.

An alternative approach to this patch would be moving
blk_exp_close_all() to vl.c:qemu_cleanup, before
bdrv_drain_all_begin().

Do you have a preference for one of these options?

Thanks,
Sergio.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 0/4] nbd/server: Quiesce coroutines on context switch
  2020-12-14 17:05 [PATCH v2 0/4] nbd/server: Quiesce coroutines on context switch Sergio Lopez
                   ` (3 preceding siblings ...)
  2020-12-14 17:05 ` [PATCH v2 4/4] block: Close block exports in two steps Sergio Lopez
@ 2021-01-20 20:49 ` Eric Blake
  2021-01-21  5:57   ` Sergio Lopez
  4 siblings, 1 reply; 25+ messages in thread
From: Eric Blake @ 2021-01-20 20:49 UTC (permalink / raw)
  To: Sergio Lopez, qemu-devel
  Cc: Kevin Wolf, Fam Zheng, Stefano Stabellini, qemu-block,
	Paul Durrant, Michael S. Tsirkin, Max Reitz, Stefan Hajnoczi,
	Paolo Bonzini, Anthony Perard, xen-devel

On 12/14/20 11:05 AM, Sergio Lopez wrote:
> This series allows the NBD server to properly switch between AIO contexts,
> having quiesced recv_coroutine and send_coroutine before doing the transition.
> 
> We need this because we send back devices running in IO Thread owned contexts
> to the main context when stopping the data plane, something that can happen
> multiple times during the lifetime of a VM (usually during the boot sequence or
> on a reboot), and we drag the NBD server of the correspoing export with it.
> 
> While there, fix also a problem caused by a cross-dependency between
> closing the export's client connections and draining the block
> layer. The visible effect of this problem was QEMU getting hung when
> the guest request a power off while there's an active NBD client.
> 
> v2:
>  - Replace "virtio-blk: Acquire context while switching them on
>  dataplane start" with "block: Honor blk_set_aio_context() context
>  requirements" (Kevin Wolf)
>  - Add "block: Avoid processing BDS twice in
>  bdrv_set_aio_context_ignore()"
>  - Add "block: Close block exports in two steps"
>  - Rename nbd_read_eof() to nbd_server_read_eof() (Eric Blake)
>  - Fix double space and typo in comment. (Eric Blake)

ping - where do we stand on this series?  Patches 1 and 3 have positive
reviews, I'll queue them now; patches 2 and 4 had enough comments that
I'm guessing I should wait for a v3?


> 
> Sergio Lopez (4):
>   block: Honor blk_set_aio_context() context requirements
>   block: Avoid processing BDS twice in bdrv_set_aio_context_ignore()
>   nbd/server: Quiesce coroutines on context switch
>   block: Close block exports in two steps
> 
>  block.c                         |  27 ++++++-
>  block/export/export.c           |  10 +--
>  blockdev-nbd.c                  |   2 +-
>  hw/block/dataplane/virtio-blk.c |   4 ++
>  hw/block/dataplane/xen-block.c  |   7 +-
>  hw/scsi/virtio-scsi.c           |   6 +-
>  include/block/export.h          |   4 +-
>  nbd/server.c                    | 120 ++++++++++++++++++++++++++++----
>  qemu-nbd.c                      |   2 +-
>  stubs/blk-exp-close-all.c       |   2 +-
>  10 files changed, 156 insertions(+), 28 deletions(-)
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 0/4] nbd/server: Quiesce coroutines on context switch
  2021-01-20 20:49 ` [PATCH v2 0/4] nbd/server: Quiesce coroutines on context switch Eric Blake
@ 2021-01-21  5:57   ` Sergio Lopez
  0 siblings, 0 replies; 25+ messages in thread
From: Sergio Lopez @ 2021-01-21  5:57 UTC (permalink / raw)
  To: Eric Blake
  Cc: Kevin Wolf, Fam Zheng, Stefano Stabellini, qemu-block,
	Paul Durrant, Michael S. Tsirkin, qemu-devel, Max Reitz,
	Stefan Hajnoczi, Paolo Bonzini, Anthony Perard, xen-devel

[-- Attachment #1: Type: text/plain, Size: 2850 bytes --]

On Wed, Jan 20, 2021 at 02:49:14PM -0600, Eric Blake wrote:
> On 12/14/20 11:05 AM, Sergio Lopez wrote:
> > This series allows the NBD server to properly switch between AIO contexts,
> > having quiesced recv_coroutine and send_coroutine before doing the transition.
> > 
> > We need this because we send back devices running in IO Thread owned contexts
> > to the main context when stopping the data plane, something that can happen
> > multiple times during the lifetime of a VM (usually during the boot sequence or
> > on a reboot), and we drag the NBD server of the correspoing export with it.
> > 
> > While there, fix also a problem caused by a cross-dependency between
> > closing the export's client connections and draining the block
> > layer. The visible effect of this problem was QEMU getting hung when
> > the guest request a power off while there's an active NBD client.
> > 
> > v2:
> >  - Replace "virtio-blk: Acquire context while switching them on
> >  dataplane start" with "block: Honor blk_set_aio_context() context
> >  requirements" (Kevin Wolf)
> >  - Add "block: Avoid processing BDS twice in
> >  bdrv_set_aio_context_ignore()"
> >  - Add "block: Close block exports in two steps"
> >  - Rename nbd_read_eof() to nbd_server_read_eof() (Eric Blake)
> >  - Fix double space and typo in comment. (Eric Blake)
> 
> ping - where do we stand on this series?  Patches 1 and 3 have positive
> reviews, I'll queue them now; patches 2 and 4 had enough comments that
> I'm guessing I should wait for a v3?

Yes, I have almost ready a v3. Will send it out today. I think it'd be
better to pull all four patches at the same time, as "block: Honor
blk_set_aio_context() context requirements" may cause trouble without
the patch to avoid double processing in
"bdrv_set_aio_context_ignore()".

Thanks,
Sergio.
 
> > 
> > Sergio Lopez (4):
> >   block: Honor blk_set_aio_context() context requirements
> >   block: Avoid processing BDS twice in bdrv_set_aio_context_ignore()
> >   nbd/server: Quiesce coroutines on context switch
> >   block: Close block exports in two steps
> > 
> >  block.c                         |  27 ++++++-
> >  block/export/export.c           |  10 +--
> >  blockdev-nbd.c                  |   2 +-
> >  hw/block/dataplane/virtio-blk.c |   4 ++
> >  hw/block/dataplane/xen-block.c  |   7 +-
> >  hw/scsi/virtio-scsi.c           |   6 +-
> >  include/block/export.h          |   4 +-
> >  nbd/server.c                    | 120 ++++++++++++++++++++++++++++----
> >  qemu-nbd.c                      |   2 +-
> >  stubs/blk-exp-close-all.c       |   2 +-
> >  10 files changed, 156 insertions(+), 28 deletions(-)
> > 
> 
> -- 
> Eric Blake, Principal Software Engineer
> Red Hat, Inc.           +1-919-301-3226
> Virtualization:  qemu.org | libvirt.org
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2021-01-21  5:59 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-14 17:05 [PATCH v2 0/4] nbd/server: Quiesce coroutines on context switch Sergio Lopez
2020-12-14 17:05 ` [PATCH v2 1/4] block: Honor blk_set_aio_context() context requirements Sergio Lopez
2020-12-15 11:58   ` Kevin Wolf
2020-12-14 17:05 ` [PATCH v2 2/4] block: Avoid processing BDS twice in bdrv_set_aio_context_ignore() Sergio Lopez
2020-12-15 12:12   ` Kevin Wolf
2020-12-15 13:15     ` Sergio Lopez
2020-12-15 15:01       ` Kevin Wolf
2020-12-15 17:23         ` Sergio Lopez
2020-12-16 12:35           ` Kevin Wolf
2020-12-16 14:55             ` Sergio Lopez
2020-12-16 18:31               ` Kevin Wolf
2020-12-17  9:37                 ` Sergio Lopez
2020-12-17 10:58                   ` Kevin Wolf
2020-12-17 12:50                     ` Vladimir Sementsov-Ogievskiy
2020-12-17 13:06                       ` Kevin Wolf
2020-12-17 13:27                         ` Sergio Lopez
2020-12-17 14:01                         ` Vladimir Sementsov-Ogievskiy
2020-12-17 13:09                     ` Sergio Lopez
2020-12-14 17:05 ` [PATCH v2 3/4] nbd/server: Quiesce coroutines on context switch Sergio Lopez
2020-12-14 17:05 ` [PATCH v2 4/4] block: Close block exports in two steps Sergio Lopez
2020-12-15 15:34   ` Kevin Wolf
2020-12-15 17:26     ` Sergio Lopez
2020-12-21 17:07     ` Sergio Lopez
2021-01-20 20:49 ` [PATCH v2 0/4] nbd/server: Quiesce coroutines on context switch Eric Blake
2021-01-21  5:57   ` Sergio Lopez

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).