[PATCH v6 0/5] block/nbd: drop connection

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v6 0/5] block/nbd: drop connection_co
@ 2021-09-02 10:38 Vladimir Sementsov-Ogievskiy
  2021-09-02 10:38 ` [PATCH v6 1/5] block/nbd: nbd_channel_error() shutdown channel unconditionally Vladimir Sementsov-Ogievskiy
                   ` (5 more replies)
  0 siblings, 6 replies; 14+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-09-02 10:38 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, hreitz, kwolf, vsementsov, eblake, rkagan

v6:
01,02: add Eric's r-b
03: make new interface clearer
04,05: rebased on updated 03

Vladimir Sementsov-Ogievskiy (5):
  block/nbd: nbd_channel_error() shutdown channel unconditionally
  block/nbd: move nbd_recv_coroutines_wake_all() up
  block/nbd: refactor nbd_recv_coroutines_wake_all()
  block/nbd: drop connection_co
  block/nbd: check that received handle is valid

 block/nbd.c  | 412 +++++++++++++++------------------------------------
 nbd/client.c |   2 -
 2 files changed, 121 insertions(+), 293 deletions(-)

-- 
2.29.2



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v6 1/5] block/nbd: nbd_channel_error() shutdown channel unconditionally
  2021-09-02 10:38 [PATCH v6 0/5] block/nbd: drop connection_co Vladimir Sementsov-Ogievskiy
@ 2021-09-02 10:38 ` Vladimir Sementsov-Ogievskiy
  2021-09-03 16:16   ` Eric Blake
  2021-09-02 10:38 ` [PATCH v6 2/5] block/nbd: move nbd_recv_coroutines_wake_all() up Vladimir Sementsov-Ogievskiy
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 14+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-09-02 10:38 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, hreitz, kwolf, vsementsov, eblake, rkagan

Don't rely on connection being totally broken in case of -EIO. More
safe and correct just shutdown the channel anyway, as we change the
state and going to reconnect.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
---
 block/nbd.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/block/nbd.c b/block/nbd.c
index f6ff1c4fb4..d88f4b954c 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -129,15 +129,16 @@ static bool nbd_client_connected(BDRVNBDState *s)
 
 static void nbd_channel_error(BDRVNBDState *s, int ret)
 {
+    if (nbd_client_connected(s)) {
+        qio_channel_shutdown(s->ioc, QIO_CHANNEL_SHUTDOWN_BOTH, NULL);
+    }
+
     if (ret == -EIO) {
         if (nbd_client_connected(s)) {
             s->state = s->reconnect_delay ? NBD_CLIENT_CONNECTING_WAIT :
                                             NBD_CLIENT_CONNECTING_NOWAIT;
         }
     } else {
-        if (nbd_client_connected(s)) {
-            qio_channel_shutdown(s->ioc, QIO_CHANNEL_SHUTDOWN_BOTH, NULL);
-        }
         s->state = NBD_CLIENT_QUIT;
     }
 }
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v6 2/5] block/nbd: move nbd_recv_coroutines_wake_all() up
  2021-09-02 10:38 [PATCH v6 0/5] block/nbd: drop connection_co Vladimir Sementsov-Ogievskiy
  2021-09-02 10:38 ` [PATCH v6 1/5] block/nbd: nbd_channel_error() shutdown channel unconditionally Vladimir Sementsov-Ogievskiy
@ 2021-09-02 10:38 ` Vladimir Sementsov-Ogievskiy
  2021-09-02 10:38 ` [PATCH v6 3/5] block/nbd: refactor nbd_recv_coroutines_wake_all() Vladimir Sementsov-Ogievskiy
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-09-02 10:38 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, hreitz, kwolf, vsementsov, eblake, rkagan

We are going to use it in nbd_channel_error(), so move it up. Note,
that we are going also refactor and rename
nbd_recv_coroutines_wake_all() in future anyway, so keeping it where it
is and making forward declaration doesn't make real sense.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
---
 block/nbd.c | 28 ++++++++++++++--------------
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/block/nbd.c b/block/nbd.c
index d88f4b954c..32e3826ba2 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -127,6 +127,20 @@ static bool nbd_client_connected(BDRVNBDState *s)
     return qatomic_load_acquire(&s->state) == NBD_CLIENT_CONNECTED;
 }
 
+static void nbd_recv_coroutines_wake_all(BDRVNBDState *s)
+{
+    int i;
+
+    for (i = 0; i < MAX_NBD_REQUESTS; i++) {
+        NBDClientRequest *req = &s->requests[i];
+
+        if (req->coroutine && req->receiving) {
+            req->receiving = false;
+            aio_co_wake(req->coroutine);
+        }
+    }
+}
+
 static void nbd_channel_error(BDRVNBDState *s, int ret)
 {
     if (nbd_client_connected(s)) {
@@ -143,20 +157,6 @@ static void nbd_channel_error(BDRVNBDState *s, int ret)
     }
 }
 
-static void nbd_recv_coroutines_wake_all(BDRVNBDState *s)
-{
-    int i;
-
-    for (i = 0; i < MAX_NBD_REQUESTS; i++) {
-        NBDClientRequest *req = &s->requests[i];
-
-        if (req->coroutine && req->receiving) {
-            req->receiving = false;
-            aio_co_wake(req->coroutine);
-        }
-    }
-}
-
 static void reconnect_delay_timer_del(BDRVNBDState *s)
 {
     if (s->reconnect_delay_timer) {
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v6 3/5] block/nbd: refactor nbd_recv_coroutines_wake_all()
  2021-09-02 10:38 [PATCH v6 0/5] block/nbd: drop connection_co Vladimir Sementsov-Ogievskiy
  2021-09-02 10:38 ` [PATCH v6 1/5] block/nbd: nbd_channel_error() shutdown channel unconditionally Vladimir Sementsov-Ogievskiy
  2021-09-02 10:38 ` [PATCH v6 2/5] block/nbd: move nbd_recv_coroutines_wake_all() up Vladimir Sementsov-Ogievskiy
@ 2021-09-02 10:38 ` Vladimir Sementsov-Ogievskiy
  2021-09-03 17:36   ` Eric Blake
  2021-09-02 10:38 ` [PATCH v6 4/5] block/nbd: drop connection_co Vladimir Sementsov-Ogievskiy
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 14+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-09-02 10:38 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, hreitz, kwolf, vsementsov, eblake, rkagan

Split out nbd_recv_coroutine_wake_one(), as it will be used in
separate.
Rename the function and add a possibility to wake only first found
sleeping coroutine.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/nbd.c | 26 +++++++++++++++++---------
 1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/block/nbd.c b/block/nbd.c
index 32e3826ba2..52b0733684 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -127,16 +127,24 @@ static bool nbd_client_connected(BDRVNBDState *s)
     return qatomic_load_acquire(&s->state) == NBD_CLIENT_CONNECTED;
 }
 
-static void nbd_recv_coroutines_wake_all(BDRVNBDState *s)
+static bool nbd_recv_coroutine_wake_one(NBDClientRequest *req)
+{
+    if (req->receiving) {
+        req->receiving = false;
+        aio_co_wake(req->coroutine);
+        return true;
+    }
+
+    return false;
+}
+
+static void nbd_recv_coroutines_wake(BDRVNBDState *s, bool all)
 {
     int i;
 
     for (i = 0; i < MAX_NBD_REQUESTS; i++) {
-        NBDClientRequest *req = &s->requests[i];
-
-        if (req->coroutine && req->receiving) {
-            req->receiving = false;
-            aio_co_wake(req->coroutine);
+        if (nbd_recv_coroutine_wake_one(&s->requests[i]) && !all) {
+            return;
         }
     }
 }
@@ -415,7 +423,7 @@ static coroutine_fn void nbd_reconnect_attempt(BDRVNBDState *s)
 
     while (s->in_flight > 0) {
         qemu_co_mutex_unlock(&s->send_mutex);
-        nbd_recv_coroutines_wake_all(s);
+        nbd_recv_coroutines_wake(s, true);
         s->wait_in_flight = true;
         qemu_coroutine_yield();
         s->wait_in_flight = false;
@@ -558,7 +566,7 @@ static coroutine_fn void nbd_connection_entry(void *opaque)
     }
 
     qemu_co_queue_restart_all(&s->free_sema);
-    nbd_recv_coroutines_wake_all(s);
+    nbd_recv_coroutines_wake(s, true);
     bdrv_dec_in_flight(s->bs);
 
     s->connection_co = NULL;
@@ -1035,7 +1043,7 @@ static coroutine_fn int nbd_co_receive_one_chunk(
     if (s->connection_co && !s->wait_in_flight) {
         /*
          * We must check s->wait_in_flight, because we may entered by
-         * nbd_recv_coroutines_wake_all(), in this case we should not
+         * nbd_recv_coroutines_wake(), in this case we should not
          * wake connection_co here, it will woken by last request.
          */
         aio_co_wake(s->connection_co);
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v6 4/5] block/nbd: drop connection_co
  2021-09-02 10:38 [PATCH v6 0/5] block/nbd: drop connection_co Vladimir Sementsov-Ogievskiy
                   ` (2 preceding siblings ...)
  2021-09-02 10:38 ` [PATCH v6 3/5] block/nbd: refactor nbd_recv_coroutines_wake_all() Vladimir Sementsov-Ogievskiy
@ 2021-09-02 10:38 ` Vladimir Sementsov-Ogievskiy
  2021-09-24 17:44   ` Eric Blake
  2021-09-02 10:38 ` [PATCH v6 5/5] block/nbd: check that received handle is valid Vladimir Sementsov-Ogievskiy
  2021-09-22  7:48 ` [PATCH v6 0/5] block/nbd: drop connection_co Vladimir Sementsov-Ogievskiy
  5 siblings, 1 reply; 14+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-09-02 10:38 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, hreitz, kwolf, vsementsov, eblake, rkagan

OK, that's a big rewrite of the logic.

Pre-patch we have an always running coroutine - connection_co. It does
reply receiving and reconnecting. And it leads to a lot of difficult
and unobvious code around drained sections and context switch. We also
abuse bs->in_flight counter which is increased for connection_co and
temporary decreased in points where we want to allow drained section to
begin. One of these place is in another file: in nbd_read_eof() in
nbd/client.c.

We also cancel reconnect and requests waiting for reconnect on drained
begin which is not correct. And this patch fixes that.

Let's finally drop this always running coroutine and go another way:
do both reconnect and receiving in request coroutines.

The detailed list of changes below (in the sequence of diff hunks).

1. receiving coroutines are woken directly from nbd_channel_error, when
   we change s->state

2. nbd_co_establish_connection_cancel(): we don't have drain_begin now,
   and in nbd_teardown_connection() all requests should already be
   finished (and reconnect is done from request). So
   nbd_co_establish_connection_cancel() is called from
   nbd_cancel_in_flight() (to cancel the request that is doing
   nbd_co_establish_connection()) and from reconnect_delay_timer_cb()
   (previously we didn't need it, as reconnect delay only should cancel
   active requests not the reconnection itself. But now reconnection
   itself is done in the separate thread (we now call
   nbd_client_connection_enable_retry() in nbd_open()), and we need to
   cancel the requests that waits in nbd_co_establish_connection()
   now).

2. We do receive headers in request coroutine. But we also should
   dispatch replies for another pending requests. So,
   nbd_connection_entry() is turned into nbd_receive_replies(), which
   does reply dispatching until it receive another request headers, and
   returns when it receive the requested header.

3. All old staff around drained sections and context switch is dropped.
   In details:
   - we don't need to move connection_co to new aio context, as we
     don't have connection_co anymore
   - we don't have a fake "request" of connection_co (extra increasing
     in_flight), so don't care with it in drain_begin/end
   - we don't stop reconnection during drained section anymore. This
     means that drain_begin may wait for a long time (up to
     reconnect_delay). But that's an improvement and more correct
     behavior see below[*]

4. In nbd_teardown_connection() we don't have to wait for
   connection_co, as it is dropped. And cleanup for s->ioc and nbd_yank
   is moved here from removed connection_co.

5. In nbd_co_do_establish_connection() we now should handle
   NBD_CLIENT_CONNECTING_NOWAIT: if new request comes when we are in
   NBD_CLIENT_CONNECTING_NOWAIT, it still should call
   nbd_co_establish_connection() (who knows, maybe connection already
   established by thread in background). But we shouldn't wait: if
   nbd_co_establish_connection() can't return new channel immediately
   the request should fail (we are in NBD_CLIENT_CONNECTING_NOWAIT
   state).

6. nbd_reconnect_attempt() is simplified: it's now easier to wait for
   other requests in the caller, so here we just assert that fact.
   Also delay time is now initialized here: we can easily detect first
   attempt and start a timer.

7. nbd_co_reconnect_loop() is dropped, we don't need it. Reconnect
   retries are fully handle by thread (nbd/client-connection.c), delay
   timer we initialize in nbd_reconnect_attempt(), we don't have to
   bother with s->drained and friends. nbd_reconnect_attempt() now
   called from nbd_co_send_request().

8. nbd_connection_entry is dropped: reconnect is now handled by
   nbd_co_send_request(), receiving reply is now handled by
   nbd_receive_replies(): all handled from request coroutines.

9. So, welcome new nbd_receive_replies() called from request coroutine,
   that receives reply header instead of nbd_connection_entry().
   Like with sending requests, only one coroutine may receive in a
   moment. So we introduce receive_mutex, which is locked around
   nbd_receive_reply(). It also protects some related fields. Still,
   full audit of thread-safety in nbd driver is a separate task.
   New function waits for a reply with specified handle being received
   and works rather simple:

   Under mutex:
     - if current handle is 0, do receive by hand. If another handle
       received - switch to other request coroutine, release mutex and
       yield. Otherwise return success
     - if current handle == requested handle, we are done
     - otherwise, release mutex and yield

10: in nbd_co_send_request() we now do nbd_reconnect_attempt() if
    needed. Also waiting in free_sema queue we now wait for one of two
    conditions:
    - connectED, in_flight < MAX_NBD_REQUESTS (so we can start new one)
    - connectING, in_flight == 0, so we can call
      nbd_reconnect_attempt()
    And this logic is protected by s->send_mutex

    Also, on failure we don't have to care of removed s->connection_co

11. nbd_co_do_receive_one_chunk(): now instead of yield() and wait for
    s->connection_co we just call new nbd_receive_replies().

12. nbd_co_receive_one_chunk(): place where s->reply.handle becomes 0,
    which means that handling of the whole reply is finished. Here we
    need to wake one of coroutines sleeping in nbd_receive_replies().
    If now one sleeps - do nothing. That's another behavior change: we
    don't have endless recv() in the idle time. It may be considered as
    a drawback. If so, it may be fixed later.

13. nbd_reply_chunk_iter_receive(): don't care about removed
    connection_co, just ping in_flight waiters.

14. Don't create connection_co, enable retry in the connection thread
    (we don't have own reconnect loop anymore)

15. We need now nbd_co_establish_connection_cancel() call in
    nbd_cancel_in_flight(), to cancel the request that doing connection
    attempt.

[*], ok, now we don't cancel reconnect on drain begin. That's correct:
    reconnect feature leads to possibility of long-running requests (up
    to reconnect delay). Still, drain begin is not a reason to kill
    long requests. We should wait for them.

    This also means, that we can again reproduce a dead-lock, described
    in 8c517de24a8a1dcbeb54e7e12b5b0fda42a90ace.
    Why we are OK with it:
    1. Now this is not absolutely-dead dead-lock: the vm is unfrozen
       after reconnect delay. Actually 8c517de24a8a1dc fixed a bug in
       NBD logic, that was not described in 8c517de24a8a1dc and led to
       forever dead-lock. The problem was that nobody woken free_sema
       queue, but drain_begin can't finish until there is a request in
       free_sema queue. Now we have a reconnect delay timer that works
       well.
    2. It's not a problem of NBD driver, it's a problem of ide code,
       that does drain_begin under global mutex
    3. That doesn't reproduce if chose scsi instead of ide.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/nbd.c  | 374 ++++++++++++++-------------------------------------
 nbd/client.c |   2 -
 2 files changed, 99 insertions(+), 277 deletions(-)

diff --git a/block/nbd.c b/block/nbd.c
index 52b0733684..170a8c8eeb 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -57,7 +57,7 @@
 typedef struct {
     Coroutine *coroutine;
     uint64_t offset;        /* original offset of the request */
-    bool receiving;         /* waiting for connection_co? */
+    bool receiving;         /* sleeping in the yield in nbd_receive_replies */
 } NBDClientRequest;
 
 typedef enum NBDClientState {
@@ -73,14 +73,10 @@ typedef struct BDRVNBDState {
 
     CoMutex send_mutex;
     CoQueue free_sema;
-    Coroutine *connection_co;
-    Coroutine *teardown_co;
-    QemuCoSleep reconnect_sleep;
-    bool drained;
-    bool wait_drained_end;
+
+    CoMutex receive_mutex;
     int in_flight;
     NBDClientState state;
-    bool wait_in_flight;
 
     QEMUTimer *reconnect_delay_timer;
 
@@ -163,6 +159,8 @@ static void nbd_channel_error(BDRVNBDState *s, int ret)
     } else {
         s->state = NBD_CLIENT_QUIT;
     }
+
+    nbd_recv_coroutines_wake(s, true);
 }
 
 static void reconnect_delay_timer_del(BDRVNBDState *s)
@@ -179,6 +177,7 @@ static void reconnect_delay_timer_cb(void *opaque)
 
     if (qatomic_load_acquire(&s->state) == NBD_CLIENT_CONNECTING_WAIT) {
         s->state = NBD_CLIENT_CONNECTING_NOWAIT;
+        nbd_co_establish_connection_cancel(s->conn);
         while (qemu_co_enter_next(&s->free_sema, NULL)) {
             /* Resume all queued requests */
         }
@@ -201,113 +200,21 @@ static void reconnect_delay_timer_init(BDRVNBDState *s, uint64_t expire_time_ns)
     timer_mod(s->reconnect_delay_timer, expire_time_ns);
 }
 
-static void nbd_client_detach_aio_context(BlockDriverState *bs)
-{
-    BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
-
-    /* Timer is deleted in nbd_client_co_drain_begin() */
-    assert(!s->reconnect_delay_timer);
-    /*
-     * If reconnect is in progress we may have no ->ioc.  It will be
-     * re-instantiated in the proper aio context once the connection is
-     * reestablished.
-     */
-    if (s->ioc) {
-        qio_channel_detach_aio_context(QIO_CHANNEL(s->ioc));
-    }
-}
-
-static void nbd_client_attach_aio_context_bh(void *opaque)
-{
-    BlockDriverState *bs = opaque;
-    BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
-
-    if (s->connection_co) {
-        /*
-         * The node is still drained, so we know the coroutine has yielded in
-         * nbd_read_eof(), the only place where bs->in_flight can reach 0, or
-         * it is entered for the first time. Both places are safe for entering
-         * the coroutine.
-         */
-        qemu_aio_coroutine_enter(bs->aio_context, s->connection_co);
-    }
-    bdrv_dec_in_flight(bs);
-}
-
-static void nbd_client_attach_aio_context(BlockDriverState *bs,
-                                          AioContext *new_context)
-{
-    BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
-
-    /*
-     * s->connection_co is either yielded from nbd_receive_reply or from
-     * nbd_co_reconnect_loop()
-     */
-    if (nbd_client_connected(s)) {
-        qio_channel_attach_aio_context(QIO_CHANNEL(s->ioc), new_context);
-    }
-
-    bdrv_inc_in_flight(bs);
-
-    /*
-     * Need to wait here for the BH to run because the BH must run while the
-     * node is still drained.
-     */
-    aio_wait_bh_oneshot(new_context, nbd_client_attach_aio_context_bh, bs);
-}
-
-static void coroutine_fn nbd_client_co_drain_begin(BlockDriverState *bs)
-{
-    BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
-
-    s->drained = true;
-    qemu_co_sleep_wake(&s->reconnect_sleep);
-
-    nbd_co_establish_connection_cancel(s->conn);
-
-    reconnect_delay_timer_del(s);
-
-    if (qatomic_load_acquire(&s->state) == NBD_CLIENT_CONNECTING_WAIT) {
-        s->state = NBD_CLIENT_CONNECTING_NOWAIT;
-        qemu_co_queue_restart_all(&s->free_sema);
-    }
-}
-
-static void coroutine_fn nbd_client_co_drain_end(BlockDriverState *bs)
-{
-    BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
-
-    s->drained = false;
-    if (s->wait_drained_end) {
-        s->wait_drained_end = false;
-        aio_co_wake(s->connection_co);
-    }
-}
-
-
 static void nbd_teardown_connection(BlockDriverState *bs)
 {
     BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
 
+    assert(!s->in_flight);
+
     if (s->ioc) {
-        /* finish any pending coroutines */
         qio_channel_shutdown(s->ioc, QIO_CHANNEL_SHUTDOWN_BOTH, NULL);
+        yank_unregister_function(BLOCKDEV_YANK_INSTANCE(s->bs->node_name),
+                                 nbd_yank, s->bs);
+        object_unref(OBJECT(s->ioc));
+        s->ioc = NULL;
     }
 
     s->state = NBD_CLIENT_QUIT;
-    if (s->connection_co) {
-        qemu_co_sleep_wake(&s->reconnect_sleep);
-        nbd_co_establish_connection_cancel(s->conn);
-    }
-    if (qemu_in_coroutine()) {
-        s->teardown_co = qemu_coroutine_self();
-        /* connection_co resumes us when it terminates */
-        qemu_coroutine_yield();
-        s->teardown_co = NULL;
-    } else {
-        BDRV_POLL_WHILE(bs, s->connection_co);
-    }
-    assert(!s->connection_co);
 }
 
 static bool nbd_client_connecting(BDRVNBDState *s)
@@ -372,10 +279,11 @@ int coroutine_fn nbd_co_do_establish_connection(BlockDriverState *bs,
 {
     BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
     int ret;
+    bool blocking = nbd_client_connecting_wait(s);
 
     assert(!s->ioc);
 
-    s->ioc = nbd_co_establish_connection(s->conn, &s->info, true, errp);
+    s->ioc = nbd_co_establish_connection(s->conn, &s->info, blocking, errp);
     if (!s->ioc) {
         return -ECONNREFUSED;
     }
@@ -411,29 +319,22 @@ int coroutine_fn nbd_co_do_establish_connection(BlockDriverState *bs,
     return 0;
 }
 
+/* called under s->send_mutex */
 static coroutine_fn void nbd_reconnect_attempt(BDRVNBDState *s)
 {
-    if (!nbd_client_connecting(s)) {
-        return;
-    }
+    assert(nbd_client_connecting(s));
+    assert(s->in_flight == 0);
 
-    /* Wait for completion of all in-flight requests */
-
-    qemu_co_mutex_lock(&s->send_mutex);
-
-    while (s->in_flight > 0) {
-        qemu_co_mutex_unlock(&s->send_mutex);
-        nbd_recv_coroutines_wake(s, true);
-        s->wait_in_flight = true;
-        qemu_coroutine_yield();
-        s->wait_in_flight = false;
-        qemu_co_mutex_lock(&s->send_mutex);
-    }
-
-    qemu_co_mutex_unlock(&s->send_mutex);
-
-    if (!nbd_client_connecting(s)) {
-        return;
+    if (nbd_client_connecting_wait(s) && s->reconnect_delay &&
+        !s->reconnect_delay_timer)
+    {
+        /*
+         * It's first reconnect attempt after switching to
+         * NBD_CLIENT_CONNECTING_WAIT
+         */
+        reconnect_delay_timer_init(s,
+            qemu_clock_get_ns(QEMU_CLOCK_REALTIME) +
+            s->reconnect_delay * NANOSECONDS_PER_SECOND);
     }
 
     /*
@@ -453,135 +354,79 @@ static coroutine_fn void nbd_reconnect_attempt(BDRVNBDState *s)
     nbd_co_do_establish_connection(s->bs, NULL);
 }
 
-static coroutine_fn void nbd_co_reconnect_loop(BDRVNBDState *s)
+static coroutine_fn int nbd_receive_replies(BDRVNBDState *s, uint64_t handle)
 {
-    uint64_t timeout = 1 * NANOSECONDS_PER_SECOND;
-    uint64_t max_timeout = 16 * NANOSECONDS_PER_SECOND;
+    int ret;
+    uint64_t ind = HANDLE_TO_INDEX(s, handle), ind2;
+    QEMU_LOCK_GUARD(&s->receive_mutex);
 
-    if (qatomic_load_acquire(&s->state) == NBD_CLIENT_CONNECTING_WAIT) {
-        reconnect_delay_timer_init(s, qemu_clock_get_ns(QEMU_CLOCK_REALTIME) +
-                                   s->reconnect_delay * NANOSECONDS_PER_SECOND);
-    }
-
-    nbd_reconnect_attempt(s);
-
-    while (nbd_client_connecting(s)) {
-        if (s->drained) {
-            bdrv_dec_in_flight(s->bs);
-            s->wait_drained_end = true;
-            while (s->drained) {
-                /*
-                 * We may be entered once from nbd_client_attach_aio_context_bh
-                 * and then from nbd_client_co_drain_end. So here is a loop.
-                 */
-                qemu_coroutine_yield();
-            }
-            bdrv_inc_in_flight(s->bs);
-        } else {
-            qemu_co_sleep_ns_wakeable(&s->reconnect_sleep,
-                                      QEMU_CLOCK_REALTIME, timeout);
-            if (s->drained) {
-                continue;
-            }
-            if (timeout < max_timeout) {
-                timeout *= 2;
-            }
+    while (true) {
+        if (s->reply.handle == handle) {
+            /* We are done */
+            return 0;
         }
 
-        nbd_reconnect_attempt(s);
-    }
-
-    reconnect_delay_timer_del(s);
-}
+        if (!nbd_client_connected(s)) {
+            return -EIO;
+        }
 
-static coroutine_fn void nbd_connection_entry(void *opaque)
-{
-    BDRVNBDState *s = opaque;
-    uint64_t i;
-    int ret = 0;
-    Error *local_err = NULL;
+        if (s->reply.handle != 0) {
+            /*
+             * Some other request is being handling now. It should already be
+             * woken by thous who set s->reply.handle (or never wait in this
+             * yield). So, we should not wake it here.
+             */
+            ind2 = HANDLE_TO_INDEX(s, s->reply.handle);
+            assert(!s->requests[ind2].receiving);
 
-    while (qatomic_load_acquire(&s->state) != NBD_CLIENT_QUIT) {
-        /*
-         * The NBD client can only really be considered idle when it has
-         * yielded from qio_channel_readv_all_eof(), waiting for data. This is
-         * the point where the additional scheduled coroutine entry happens
-         * after nbd_client_attach_aio_context().
-         *
-         * Therefore we keep an additional in_flight reference all the time and
-         * only drop it temporarily here.
-         */
+            s->requests[ind].receiving = true;
+            qemu_co_mutex_unlock(&s->receive_mutex);
 
-        if (nbd_client_connecting(s)) {
-            nbd_co_reconnect_loop(s);
-        }
+            qemu_coroutine_yield();
+            /*
+             * We may be woken for 3 reasons:
+             * 1. From this function, executing in parallel coroutine, when our
+             *    handle received.
+             * 2. From nbd_channel_error(), when connection is lost.
+             * 3. From nbd_co_receive_one_chunk(), when previous request is
+             *    finished and s->reply.handle set to 0.
+             * Anyway, it's OK to lock the mutex and go to the next iteration.
+             */
 
-        if (!nbd_client_connected(s)) {
+            qemu_co_mutex_lock(&s->receive_mutex);
+            assert(!s->requests[ind].receiving);
             continue;
         }
 
+        /* We are under mutex and handle is 0. We have to do the dirty work. */
         assert(s->reply.handle == 0);
-        ret = nbd_receive_reply(s->bs, s->ioc, &s->reply, &local_err);
-
-        if (local_err) {
-            trace_nbd_read_reply_entry_fail(ret, error_get_pretty(local_err));
-            error_free(local_err);
-            local_err = NULL;
-        }
+        ret = nbd_receive_reply(s->bs, s->ioc, &s->reply, NULL);
         if (ret <= 0) {
-            nbd_channel_error(s, ret ? ret : -EIO);
-            continue;
+            ret = ret ? ret : -EIO;
+            nbd_channel_error(s, ret);
+            return ret;
         }
-
-        /*
-         * There's no need for a mutex on the receive side, because the
-         * handler acts as a synchronization point and ensures that only
-         * one coroutine is called until the reply finishes.
-         */
-        i = HANDLE_TO_INDEX(s, s->reply.handle);
-        if (i >= MAX_NBD_REQUESTS ||
-            !s->requests[i].coroutine ||
-            !s->requests[i].receiving ||
-            (nbd_reply_is_structured(&s->reply) && !s->info.structured_reply))
-        {
+        if (nbd_reply_is_structured(&s->reply) && !s->info.structured_reply) {
             nbd_channel_error(s, -EINVAL);
-            continue;
+            return -EINVAL;
         }
-
-        /*
-         * We're woken up again by the request itself.  Note that there
-         * is no race between yielding and reentering connection_co.  This
-         * is because:
-         *
-         * - if the request runs on the same AioContext, it is only
-         *   entered after we yield
-         *
-         * - if the request runs on a different AioContext, reentering
-         *   connection_co happens through a bottom half, which can only
-         *   run after we yield.
-         */
-        s->requests[i].receiving = false;
-        aio_co_wake(s->requests[i].coroutine);
-        qemu_coroutine_yield();
-    }
-
-    qemu_co_queue_restart_all(&s->free_sema);
-    nbd_recv_coroutines_wake(s, true);
-    bdrv_dec_in_flight(s->bs);
-
-    s->connection_co = NULL;
-    if (s->ioc) {
-        qio_channel_detach_aio_context(QIO_CHANNEL(s->ioc));
-        yank_unregister_function(BLOCKDEV_YANK_INSTANCE(s->bs->node_name),
-                                 nbd_yank, s->bs);
-        object_unref(OBJECT(s->ioc));
-        s->ioc = NULL;
-    }
-
-    if (s->teardown_co) {
-        aio_co_wake(s->teardown_co);
+        if (s->reply.handle == handle) {
+            /* We are done */
+            return 0;
+        }
+        ind2 = HANDLE_TO_INDEX(s, s->reply.handle);
+        if (ind2 >= MAX_NBD_REQUESTS || !s->requests[ind2].coroutine) {
+            /*
+             * We only check that ind2 request exists. But don't check is it now
+             * waiting for the reply header or not. We can't just check
+             * s->requests[ind2].receiving: ind2 request may wait in trying to
+             * lock receive_mutex. So that's a TODO.
+             */
+            nbd_channel_error(s, -EINVAL);
+            return -EINVAL;
+        }
+        nbd_recv_coroutine_wake_one(&s->requests[ind2]);
     }
-    aio_wait_kick();
 }
 
 static int nbd_co_send_request(BlockDriverState *bs,
@@ -592,10 +437,17 @@ static int nbd_co_send_request(BlockDriverState *bs,
     int rc, i = -1;
 
     qemu_co_mutex_lock(&s->send_mutex);
-    while (s->in_flight == MAX_NBD_REQUESTS || nbd_client_connecting_wait(s)) {
+
+    while (s->in_flight == MAX_NBD_REQUESTS ||
+           (!nbd_client_connected(s) && s->in_flight > 0))
+    {
         qemu_co_queue_wait(&s->free_sema, &s->send_mutex);
     }
 
+    if (nbd_client_connecting(s)) {
+        nbd_reconnect_attempt(s);
+    }
+
     if (!nbd_client_connected(s)) {
         rc = -EIO;
         goto err;
@@ -642,10 +494,6 @@ err:
         if (i != -1) {
             s->requests[i].coroutine = NULL;
             s->in_flight--;
-        }
-        if (s->in_flight == 0 && s->wait_in_flight) {
-            aio_co_wake(s->connection_co);
-        } else {
             qemu_co_queue_next(&s->free_sema);
         }
     }
@@ -944,10 +792,7 @@ static coroutine_fn int nbd_co_do_receive_one_chunk(
     }
     *request_ret = 0;
 
-    /* Wait until we're woken up by nbd_connection_entry.  */
-    s->requests[i].receiving = true;
-    qemu_coroutine_yield();
-    assert(!s->requests[i].receiving);
+    nbd_receive_replies(s, handle);
     if (!nbd_client_connected(s)) {
         error_setg(errp, "Connection closed");
         return -EIO;
@@ -1040,14 +885,7 @@ static coroutine_fn int nbd_co_receive_one_chunk(
     }
     s->reply.handle = 0;
 
-    if (s->connection_co && !s->wait_in_flight) {
-        /*
-         * We must check s->wait_in_flight, because we may entered by
-         * nbd_recv_coroutines_wake(), in this case we should not
-         * wake connection_co here, it will woken by last request.
-         */
-        aio_co_wake(s->connection_co);
-    }
+    nbd_recv_coroutines_wake(s, false);
 
     return ret;
 }
@@ -1158,11 +996,7 @@ break_loop:
 
     qemu_co_mutex_lock(&s->send_mutex);
     s->in_flight--;
-    if (s->in_flight == 0 && s->wait_in_flight) {
-        aio_co_wake(s->connection_co);
-    } else {
-        qemu_co_queue_next(&s->free_sema);
-    }
+    qemu_co_queue_next(&s->free_sema);
     qemu_co_mutex_unlock(&s->send_mutex);
 
     return false;
@@ -1978,6 +1812,7 @@ static int nbd_open(BlockDriverState *bs, QDict *options, int flags,
     s->bs = bs;
     qemu_co_mutex_init(&s->send_mutex);
     qemu_co_queue_init(&s->free_sema);
+    qemu_co_mutex_init(&s->receive_mutex);
 
     if (!yank_register_instance(BLOCKDEV_YANK_INSTANCE(bs->node_name), errp)) {
         return -EEXIST;
@@ -1992,14 +1827,13 @@ static int nbd_open(BlockDriverState *bs, QDict *options, int flags,
                                         s->x_dirty_bitmap, s->tlscreds);
 
     /* TODO: Configurable retry-until-timeout behaviour. */
+    s->state = NBD_CLIENT_CONNECTING_WAIT;
     ret = nbd_do_establish_connection(bs, errp);
     if (ret < 0) {
         goto fail;
     }
 
-    s->connection_co = qemu_coroutine_create(nbd_connection_entry, s);
-    bdrv_inc_in_flight(bs);
-    aio_co_schedule(bdrv_get_aio_context(bs), s->connection_co);
+    nbd_client_connection_enable_retry(s->conn);
 
     return 0;
 
@@ -2153,6 +1987,8 @@ static void nbd_cancel_in_flight(BlockDriverState *bs)
         s->state = NBD_CLIENT_CONNECTING_NOWAIT;
         qemu_co_queue_restart_all(&s->free_sema);
     }
+
+    nbd_co_establish_connection_cancel(s->conn);
 }
 
 static BlockDriver bdrv_nbd = {
@@ -2173,10 +2009,6 @@ static BlockDriver bdrv_nbd = {
     .bdrv_refresh_limits        = nbd_refresh_limits,
     .bdrv_co_truncate           = nbd_co_truncate,
     .bdrv_getlength             = nbd_getlength,
-    .bdrv_detach_aio_context    = nbd_client_detach_aio_context,
-    .bdrv_attach_aio_context    = nbd_client_attach_aio_context,
-    .bdrv_co_drain_begin        = nbd_client_co_drain_begin,
-    .bdrv_co_drain_end          = nbd_client_co_drain_end,
     .bdrv_refresh_filename      = nbd_refresh_filename,
     .bdrv_co_block_status       = nbd_client_co_block_status,
     .bdrv_dirname               = nbd_dirname,
@@ -2202,10 +2034,6 @@ static BlockDriver bdrv_nbd_tcp = {
     .bdrv_refresh_limits        = nbd_refresh_limits,
     .bdrv_co_truncate           = nbd_co_truncate,
     .bdrv_getlength             = nbd_getlength,
-    .bdrv_detach_aio_context    = nbd_client_detach_aio_context,
-    .bdrv_attach_aio_context    = nbd_client_attach_aio_context,
-    .bdrv_co_drain_begin        = nbd_client_co_drain_begin,
-    .bdrv_co_drain_end          = nbd_client_co_drain_end,
     .bdrv_refresh_filename      = nbd_refresh_filename,
     .bdrv_co_block_status       = nbd_client_co_block_status,
     .bdrv_dirname               = nbd_dirname,
@@ -2231,10 +2059,6 @@ static BlockDriver bdrv_nbd_unix = {
     .bdrv_refresh_limits        = nbd_refresh_limits,
     .bdrv_co_truncate           = nbd_co_truncate,
     .bdrv_getlength             = nbd_getlength,
-    .bdrv_detach_aio_context    = nbd_client_detach_aio_context,
-    .bdrv_attach_aio_context    = nbd_client_attach_aio_context,
-    .bdrv_co_drain_begin        = nbd_client_co_drain_begin,
-    .bdrv_co_drain_end          = nbd_client_co_drain_end,
     .bdrv_refresh_filename      = nbd_refresh_filename,
     .bdrv_co_block_status       = nbd_client_co_block_status,
     .bdrv_dirname               = nbd_dirname,
diff --git a/nbd/client.c b/nbd/client.c
index 0c2db4bcba..30d5383cb1 100644
--- a/nbd/client.c
+++ b/nbd/client.c
@@ -1434,9 +1434,7 @@ nbd_read_eof(BlockDriverState *bs, QIOChannel *ioc, void *buffer, size_t size,
 
         len = qio_channel_readv(ioc, &iov, 1, errp);
         if (len == QIO_CHANNEL_ERR_BLOCK) {
-            bdrv_dec_in_flight(bs);
             qio_channel_yield(ioc, G_IO_IN);
-            bdrv_inc_in_flight(bs);
             continue;
         } else if (len < 0) {
             return -EIO;
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v6 5/5] block/nbd: check that received handle is valid
  2021-09-02 10:38 [PATCH v6 0/5] block/nbd: drop connection_co Vladimir Sementsov-Ogievskiy
                   ` (3 preceding siblings ...)
  2021-09-02 10:38 ` [PATCH v6 4/5] block/nbd: drop connection_co Vladimir Sementsov-Ogievskiy
@ 2021-09-02 10:38 ` Vladimir Sementsov-Ogievskiy
  2021-09-03 17:54   ` Eric Blake
  2021-09-22  7:48 ` [PATCH v6 0/5] block/nbd: drop connection_co Vladimir Sementsov-Ogievskiy
  5 siblings, 1 reply; 14+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-09-02 10:38 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, hreitz, kwolf, vsementsov, eblake, rkagan

If we don't have active request, that waiting for this handle to be
received, we should report an error.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/nbd.c | 11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/block/nbd.c b/block/nbd.c
index 170a8c8eeb..306b2de9f2 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -58,6 +58,7 @@ typedef struct {
     Coroutine *coroutine;
     uint64_t offset;        /* original offset of the request */
     bool receiving;         /* sleeping in the yield in nbd_receive_replies */
+    bool reply_possible;    /* reply header not yet received */
 } NBDClientRequest;
 
 typedef enum NBDClientState {
@@ -415,16 +416,11 @@ static coroutine_fn int nbd_receive_replies(BDRVNBDState *s, uint64_t handle)
             return 0;
         }
         ind2 = HANDLE_TO_INDEX(s, s->reply.handle);
-        if (ind2 >= MAX_NBD_REQUESTS || !s->requests[ind2].coroutine) {
-            /*
-             * We only check that ind2 request exists. But don't check is it now
-             * waiting for the reply header or not. We can't just check
-             * s->requests[ind2].receiving: ind2 request may wait in trying to
-             * lock receive_mutex. So that's a TODO.
-             */
+        if (ind2 >= MAX_NBD_REQUESTS || !s->requests[ind2].reply_possible) {
             nbd_channel_error(s, -EINVAL);
             return -EINVAL;
         }
+        s->requests[ind2].reply_possible = false;
         nbd_recv_coroutine_wake_one(&s->requests[ind2]);
     }
 }
@@ -467,6 +463,7 @@ static int nbd_co_send_request(BlockDriverState *bs,
     s->requests[i].coroutine = qemu_coroutine_self();
     s->requests[i].offset = request->from;
     s->requests[i].receiving = false;
+    s->requests[i].reply_possible = true;
 
     request->handle = INDEX_TO_HANDLE(s, i);
 
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v6 1/5] block/nbd: nbd_channel_error() shutdown channel unconditionally
  2021-09-02 10:38 ` [PATCH v6 1/5] block/nbd: nbd_channel_error() shutdown channel unconditionally Vladimir Sementsov-Ogievskiy
@ 2021-09-03 16:16   ` Eric Blake
  0 siblings, 0 replies; 14+ messages in thread
From: Eric Blake @ 2021-09-03 16:16 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: kwolf, hreitz, qemu-devel, qemu-block, rkagan

On Thu, Sep 02, 2021 at 01:38:01PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> Don't rely on connection being totally broken in case of -EIO. More
> safe and correct just shutdown the channel anyway, as we change the
> state and going to reconnect.

If you don't mind me tweaking grammar, I propose:

Safer and more correct is to just shut down the channel anyway, since
we change the state and plan on reconnecting.

> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> Reviewed-by: Eric Blake <eblake@redhat.com>
> ---
>  block/nbd.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/block/nbd.c b/block/nbd.c
> index f6ff1c4fb4..d88f4b954c 100644
> --- a/block/nbd.c
> +++ b/block/nbd.c
> @@ -129,15 +129,16 @@ static bool nbd_client_connected(BDRVNBDState *s)
>  
>  static void nbd_channel_error(BDRVNBDState *s, int ret)
>  {
> +    if (nbd_client_connected(s)) {
> +        qio_channel_shutdown(s->ioc, QIO_CHANNEL_SHUTDOWN_BOTH, NULL);
> +    }
> +
>      if (ret == -EIO) {
>          if (nbd_client_connected(s)) {
>              s->state = s->reconnect_delay ? NBD_CLIENT_CONNECTING_WAIT :
>                                              NBD_CLIENT_CONNECTING_NOWAIT;
>          }
>      } else {
> -        if (nbd_client_connected(s)) {
> -            qio_channel_shutdown(s->ioc, QIO_CHANNEL_SHUTDOWN_BOTH, NULL);
> -        }
>          s->state = NBD_CLIENT_QUIT;
>      }
>  }
> -- 
> 2.29.2
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v6 3/5] block/nbd: refactor nbd_recv_coroutines_wake_all()
  2021-09-02 10:38 ` [PATCH v6 3/5] block/nbd: refactor nbd_recv_coroutines_wake_all() Vladimir Sementsov-Ogievskiy
@ 2021-09-03 17:36   ` Eric Blake
  0 siblings, 0 replies; 14+ messages in thread
From: Eric Blake @ 2021-09-03 17:36 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: kwolf, hreitz, qemu-devel, qemu-block, rkagan

On Thu, Sep 02, 2021 at 01:38:03PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> Split out nbd_recv_coroutine_wake_one(), as it will be used in
> separate.
> Rename the function and add a possibility to wake only first found
> sleeping coroutine.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>  block/nbd.c | 26 +++++++++++++++++---------
>  1 file changed, 17 insertions(+), 9 deletions(-)
> 

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v6 5/5] block/nbd: check that received handle is valid
  2021-09-02 10:38 ` [PATCH v6 5/5] block/nbd: check that received handle is valid Vladimir Sementsov-Ogievskiy
@ 2021-09-03 17:54   ` Eric Blake
  0 siblings, 0 replies; 14+ messages in thread
From: Eric Blake @ 2021-09-03 17:54 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: kwolf, hreitz, qemu-devel, qemu-block, rkagan

On Thu, Sep 02, 2021 at 01:38:05PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> If we don't have active request, that waiting for this handle to be
> received, we should report an error.

If we don't have an active request waiting for this handle to be received,

> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>  block/nbd.c | 11 ++++-------
>  1 file changed, 4 insertions(+), 7 deletions(-)

[I'm still taking my time going through 4/5, but hopefully that
doesn't impact my quicker review here]

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v6 0/5] block/nbd: drop connection_co
  2021-09-02 10:38 [PATCH v6 0/5] block/nbd: drop connection_co Vladimir Sementsov-Ogievskiy
                   ` (4 preceding siblings ...)
  2021-09-02 10:38 ` [PATCH v6 5/5] block/nbd: check that received handle is valid Vladimir Sementsov-Ogievskiy
@ 2021-09-22  7:48 ` Vladimir Sementsov-Ogievskiy
  2021-09-22 13:12   ` Eric Blake
  5 siblings, 1 reply; 14+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-09-22  7:48 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, hreitz, kwolf, eblake, rkagan

ping)

Only one patch is not reviewed. But the biggest one :(

02.09.2021 13:38, Vladimir Sementsov-Ogievskiy wrote:
> v6:
> 01,02: add Eric's r-b
> 03: make new interface clearer
> 04,05: rebased on updated 03
> 
> Vladimir Sementsov-Ogievskiy (5):
>    block/nbd: nbd_channel_error() shutdown channel unconditionally
>    block/nbd: move nbd_recv_coroutines_wake_all() up
>    block/nbd: refactor nbd_recv_coroutines_wake_all()
>    block/nbd: drop connection_co
>    block/nbd: check that received handle is valid
> 
>   block/nbd.c  | 412 +++++++++++++++------------------------------------
>   nbd/client.c |   2 -
>   2 files changed, 121 insertions(+), 293 deletions(-)
> 


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v6 0/5] block/nbd: drop connection_co
  2021-09-22  7:48 ` [PATCH v6 0/5] block/nbd: drop connection_co Vladimir Sementsov-Ogievskiy
@ 2021-09-22 13:12   ` Eric Blake
  2021-09-22 13:49     ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 14+ messages in thread
From: Eric Blake @ 2021-09-22 13:12 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: kwolf, hreitz, qemu-devel, qemu-block, rkagan

On Wed, Sep 22, 2021 at 10:48:58AM +0300, Vladimir Sementsov-Ogievskiy wrote:
> ping)
> 
> Only one patch is not reviewed. But the biggest one :(

Yeah, and it's on my list for this week.  I'm overdue for a pull
request, so I want to get this series in if at all possible, which
means I'll get on that (large) review soon.  Meanwhile, the ping is
appreciated!

> 
> 02.09.2021 13:38, Vladimir Sementsov-Ogievskiy wrote:
> > v6:
> > 01,02: add Eric's r-b
> > 03: make new interface clearer
> > 04,05: rebased on updated 03
> > 
> > Vladimir Sementsov-Ogievskiy (5):
> >    block/nbd: nbd_channel_error() shutdown channel unconditionally
> >    block/nbd: move nbd_recv_coroutines_wake_all() up
> >    block/nbd: refactor nbd_recv_coroutines_wake_all()
> >    block/nbd: drop connection_co
> >    block/nbd: check that received handle is valid
> > 
> >   block/nbd.c  | 412 +++++++++++++++------------------------------------
> >   nbd/client.c |   2 -
> >   2 files changed, 121 insertions(+), 293 deletions(-)
> > 
> 
> 
> -- 
> Best regards,
> Vladimir
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v6 0/5] block/nbd: drop connection_co
  2021-09-22 13:12   ` Eric Blake
@ 2021-09-22 13:49     ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 14+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-09-22 13:49 UTC (permalink / raw)
  To: Eric Blake; +Cc: qemu-block, qemu-devel, hreitz, kwolf, rkagan

22.09.2021 16:12, Eric Blake wrote:
> On Wed, Sep 22, 2021 at 10:48:58AM +0300, Vladimir Sementsov-Ogievskiy wrote:
>> ping)
>>
>> Only one patch is not reviewed. But the biggest one :(
> 
> Yeah, and it's on my list for this week.  I'm overdue for a pull
> request, so I want to get this series in if at all possible, which
> means I'll get on that (large) review soon.  Meanwhile, the ping is
> appreciated!
> 

Thanks!

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v6 4/5] block/nbd: drop connection_co
  2021-09-02 10:38 ` [PATCH v6 4/5] block/nbd: drop connection_co Vladimir Sementsov-Ogievskiy
@ 2021-09-24 17:44   ` Eric Blake
  2021-09-27 11:57     ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 14+ messages in thread
From: Eric Blake @ 2021-09-24 17:44 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: kwolf, hreitz, qemu-devel, qemu-block, rkagan

On Thu, Sep 02, 2021 at 01:38:04PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> OK, that's a big rewrite of the logic.

And a time-consuming review on my part!

> 
> Pre-patch we have an always running coroutine - connection_co. It does
> reply receiving and reconnecting. And it leads to a lot of difficult
> and unobvious code around drained sections and context switch. We also
> abuse bs->in_flight counter which is increased for connection_co and
> temporary decreased in points where we want to allow drained section to
> begin. One of these place is in another file: in nbd_read_eof() in
> nbd/client.c.
> 
> We also cancel reconnect and requests waiting for reconnect on drained
> begin which is not correct. And this patch fixes that.
> 
> Let's finally drop this always running coroutine and go another way:
> do both reconnect and receiving in request coroutines.
> 
> The detailed list of changes below (in the sequence of diff hunks).

Well, depending on the git order file in use ;)

> 
> 1. receiving coroutines are woken directly from nbd_channel_error, when
>    we change s->state
> 
> 2. nbd_co_establish_connection_cancel(): we don't have drain_begin now,
>    and in nbd_teardown_connection() all requests should already be
>    finished (and reconnect is done from request). So
>    nbd_co_establish_connection_cancel() is called from
>    nbd_cancel_in_flight() (to cancel the request that is doing
>    nbd_co_establish_connection()) and from reconnect_delay_timer_cb()
>    (previously we didn't need it, as reconnect delay only should cancel
>    active requests not the reconnection itself. But now reconnection

Missing )

>    itself is done in the separate thread (we now call
>    nbd_client_connection_enable_retry() in nbd_open()), and we need to
>    cancel the requests that waits in nbd_co_establish_connection()

Singular/plural disagreement. I think the intended meaning is
'requests that wait' and not 'request that waits'.

>    now).
> 
> 2. We do receive headers in request coroutine. But we also should

Second point 2; I'll call it 2A below, because it looks related to
point 8.

>    dispatch replies for another pending requests. So,

s/another/other/

>    nbd_connection_entry() is turned into nbd_receive_replies(), which
>    does reply dispatching until it receive another request headers, and

s/until/while/, s/another/other/

>    returns when it receive the requested header.

receives

> 
> 3. All old staff around drained sections and context switch is dropped.
>    In details:
>    - we don't need to move connection_co to new aio context, as we
>      don't have connection_co anymore
>    - we don't have a fake "request" of connection_co (extra increasing
>      in_flight), so don't care with it in drain_begin/end
>    - we don't stop reconnection during drained section anymore. This
>      means that drain_begin may wait for a long time (up to
>      reconnect_delay). But that's an improvement and more correct
>      behavior see below[*]
> 
> 4. In nbd_teardown_connection() we don't have to wait for
>    connection_co, as it is dropped. And cleanup for s->ioc and nbd_yank
>    is moved here from removed connection_co.
> 
> 5. In nbd_co_do_establish_connection() we now should handle
>    NBD_CLIENT_CONNECTING_NOWAIT: if new request comes when we are in
>    NBD_CLIENT_CONNECTING_NOWAIT, it still should call
>    nbd_co_establish_connection() (who knows, maybe connection already
>    established by thread in background). But we shouldn't wait: if

maybe the connection was already established by another thread in the background

>    nbd_co_establish_connection() can't return new channel immediately
>    the request should fail (we are in NBD_CLIENT_CONNECTING_NOWAIT
>    state).
> 
> 6. nbd_reconnect_attempt() is simplified: it's now easier to wait for
>    other requests in the caller, so here we just assert that fact.
>    Also delay time is now initialized here: we can easily detect first
>    attempt and start a timer.
> 
> 7. nbd_co_reconnect_loop() is dropped, we don't need it. Reconnect
>    retries are fully handle by thread (nbd/client-connection.c), delay
>    timer we initialize in nbd_reconnect_attempt(), we don't have to
>    bother with s->drained and friends. nbd_reconnect_attempt() now
>    called from nbd_co_send_request().

A lot going on there, but it's making sense so far.

> 
> 8. nbd_connection_entry is dropped: reconnect is now handled by
>    nbd_co_send_request(), receiving reply is now handled by
>    nbd_receive_replies(): all handled from request coroutines.
> 
> 9. So, welcome new nbd_receive_replies() called from request coroutine,
>    that receives reply header instead of nbd_connection_entry().
>    Like with sending requests, only one coroutine may receive in a
>    moment. So we introduce receive_mutex, which is locked around
>    nbd_receive_reply(). It also protects some related fields. Still,
>    full audit of thread-safety in nbd driver is a separate task.

It sounds like you're more worried about auditing for which locks are
held longer than necessary, rather than expecting to find cases where
we aren't thread-safe because we are lacking locks?  At any rate, as
this patch is already easier to reason about, I'm okay deferring that
audit to later patches.

>    New function waits for a reply with specified handle being received
>    and works rather simple:
> 
>    Under mutex:
>      - if current handle is 0, do receive by hand. If another handle
>        received - switch to other request coroutine, release mutex and
>        yield. Otherwise return success
>      - if current handle == requested handle, we are done
>      - otherwise, release mutex and yield
> 
> 10: in nbd_co_send_request() we now do nbd_reconnect_attempt() if
>     needed. Also waiting in free_sema queue we now wait for one of two
>     conditions:
>     - connectED, in_flight < MAX_NBD_REQUESTS (so we can start new one)
>     - connectING, in_flight == 0, so we can call
>       nbd_reconnect_attempt()
>     And this logic is protected by s->send_mutex
> 
>     Also, on failure we don't have to care of removed s->connection_co
> 
> 11. nbd_co_do_receive_one_chunk(): now instead of yield() and wait for
>     s->connection_co we just call new nbd_receive_replies().
> 
> 12. nbd_co_receive_one_chunk(): place where s->reply.handle becomes 0,
>     which means that handling of the whole reply is finished. Here we
>     need to wake one of coroutines sleeping in nbd_receive_replies().
>     If now one sleeps - do nothing. That's another behavior change: we

s/now one sleeps/none are sleeping/

>     don't have endless recv() in the idle time. It may be considered as
>     a drawback. If so, it may be fixed later.

The case where no coroutine is waiting to receive a reply should only
happen due to a malicious server (a compliant server will only send
replies matching an existing pending request).

> 
> 13. nbd_reply_chunk_iter_receive(): don't care about removed
>     connection_co, just ping in_flight waiters.
> 
> 14. Don't create connection_co, enable retry in the connection thread
>     (we don't have own reconnect loop anymore)
> 
> 15. We need now nbd_co_establish_connection_cancel() call in
>     nbd_cancel_in_flight(), to cancel the request that doing connection
>     attempt.

s/need now/now need to add a/
s/request that doing/request that is doing a/

> 
> [*], ok, now we don't cancel reconnect on drain begin. That's correct:
>     reconnect feature leads to possibility of long-running requests (up
>     to reconnect delay). Still, drain begin is not a reason to kill
>     long requests. We should wait for them.
> 
>     This also means, that we can again reproduce a dead-lock, described
>     in 8c517de24a8a1dcbeb54e7e12b5b0fda42a90ace.
>     Why we are OK with it:
>     1. Now this is not absolutely-dead dead-lock: the vm is unfrozen
>        after reconnect delay. Actually 8c517de24a8a1dc fixed a bug in
>        NBD logic, that was not described in 8c517de24a8a1dc and led to
>        forever dead-lock. The problem was that nobody woken free_sema

s/woken/woke the/

>        queue, but drain_begin can't finish until there is a request in
>        free_sema queue. Now we have a reconnect delay timer that works
>        well.
>     2. It's not a problem of NBD driver, it's a problem of ide code,
>        that does drain_begin under global mutex

I agree that this point means that it is appropriate to address that
in a separate patch series.  Hopefully someone tackles it before 6.2,
but even if not, I agree that the worst that happens is that we have a
command stalled waiting on a long timeout, rather than actual
deadlock.

>     3. That doesn't reproduce if chose scsi instead of ide.

Seems like this sentence fragment fits better as the tail of 2, rather
than as a separate bullet 3?  I'll reword as:

...it's a problem of the ide code, because it does drain_begin under
the global mutex; the problem doesn't reproduce when using scsi
instead of ide.

> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>  block/nbd.c  | 374 ++++++++++++++-------------------------------------
>  nbd/client.c |   2 -
>  2 files changed, 99 insertions(+), 277 deletions(-)
> 
> diff --git a/block/nbd.c b/block/nbd.c
> index 52b0733684..170a8c8eeb 100644
> --- a/block/nbd.c
> +++ b/block/nbd.c
> @@ -57,7 +57,7 @@
>  typedef struct {
>      Coroutine *coroutine;
>      uint64_t offset;        /* original offset of the request */
> -    bool receiving;         /* waiting for connection_co? */
> +    bool receiving;         /* sleeping in the yield in nbd_receive_replies */
>  } NBDClientRequest;
>  
>  typedef enum NBDClientState {
> @@ -73,14 +73,10 @@ typedef struct BDRVNBDState {
>  
>      CoMutex send_mutex;
>      CoQueue free_sema;
> -    Coroutine *connection_co;
> -    Coroutine *teardown_co;
> -    QemuCoSleep reconnect_sleep;
> -    bool drained;
> -    bool wait_drained_end;
> +
> +    CoMutex receive_mutex;
>      int in_flight;
>      NBDClientState state;
> -    bool wait_in_flight;

Lots of struct rework, not quite sure which point(s) it falls under.

>  
>      QEMUTimer *reconnect_delay_timer;
>  
> @@ -163,6 +159,8 @@ static void nbd_channel_error(BDRVNBDState *s, int ret)
>      } else {
>          s->state = NBD_CLIENT_QUIT;
>      }
> +
> +    nbd_recv_coroutines_wake(s, true);
>  }

Matches point 1.

>  
>  static void reconnect_delay_timer_del(BDRVNBDState *s)
> @@ -179,6 +177,7 @@ static void reconnect_delay_timer_cb(void *opaque)
>  
>      if (qatomic_load_acquire(&s->state) == NBD_CLIENT_CONNECTING_WAIT) {
>          s->state = NBD_CLIENT_CONNECTING_NOWAIT;
> +        nbd_co_establish_connection_cancel(s->conn);
>          while (qemu_co_enter_next(&s->free_sema, NULL)) {
>              /* Resume all queued requests */
>          }

Matches point 2.

> @@ -201,113 +200,21 @@ static void reconnect_delay_timer_init(BDRVNBDState *s, uint64_t expire_time_ns)
>      timer_mod(s->reconnect_delay_timer, expire_time_ns);
>  }
>  
> -static void nbd_client_detach_aio_context(BlockDriverState *bs)
> -{

> -static void nbd_client_attach_aio_context_bh(void *opaque)
> -{

> -static void nbd_client_attach_aio_context(BlockDriverState *bs,
> -                                          AioContext *new_context)

> -static void coroutine_fn nbd_client_co_drain_begin(BlockDriverState *bs)
> -{

> -static void coroutine_fn nbd_client_co_drain_end(BlockDriverState *bs)
> -{

Matches point 3.

>  static void nbd_teardown_connection(BlockDriverState *bs)
>  {
>      BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
>  
> +    assert(!s->in_flight);
> +
>      if (s->ioc) {
> -        /* finish any pending coroutines */
>          qio_channel_shutdown(s->ioc, QIO_CHANNEL_SHUTDOWN_BOTH, NULL);
> +        yank_unregister_function(BLOCKDEV_YANK_INSTANCE(s->bs->node_name),
> +                                 nbd_yank, s->bs);
> +        object_unref(OBJECT(s->ioc));
> +        s->ioc = NULL;
>      }
>  
>      s->state = NBD_CLIENT_QUIT;
> -    if (s->connection_co) {
> -        qemu_co_sleep_wake(&s->reconnect_sleep);
> -        nbd_co_establish_connection_cancel(s->conn);
> -    }
> -    if (qemu_in_coroutine()) {
> -        s->teardown_co = qemu_coroutine_self();
> -        /* connection_co resumes us when it terminates */
> -        qemu_coroutine_yield();
> -        s->teardown_co = NULL;
> -    } else {
> -        BDRV_POLL_WHILE(bs, s->connection_co);
> -    }
> -    assert(!s->connection_co);
>  }

Matches point 4.

>  
>  static bool nbd_client_connecting(BDRVNBDState *s)
> @@ -372,10 +279,11 @@ int coroutine_fn nbd_co_do_establish_connection(BlockDriverState *bs,
>  {
>      BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
>      int ret;
> +    bool blocking = nbd_client_connecting_wait(s);
>  
>      assert(!s->ioc);
>  
> -    s->ioc = nbd_co_establish_connection(s->conn, &s->info, true, errp);
> +    s->ioc = nbd_co_establish_connection(s->conn, &s->info, blocking, errp);
>      if (!s->ioc) {
>          return -ECONNREFUSED;
>      }

Point 5, although the details about NBD_CLIENT_CONNECTING_NOWAIT are
hidden beneath the helper functions used here.

> @@ -411,29 +319,22 @@ int coroutine_fn nbd_co_do_establish_connection(BlockDriverState *bs,
>      return 0;
>  }
>  
> +/* called under s->send_mutex */
>  static coroutine_fn void nbd_reconnect_attempt(BDRVNBDState *s)
>  {
> -    if (!nbd_client_connecting(s)) {
> -        return;
> -    }
> +    assert(nbd_client_connecting(s));
> +    assert(s->in_flight == 0);
>  
> -    /* Wait for completion of all in-flight requests */
> -
> -    qemu_co_mutex_lock(&s->send_mutex);
> -
> -    while (s->in_flight > 0) {
> -        qemu_co_mutex_unlock(&s->send_mutex);
> -        nbd_recv_coroutines_wake(s, true);
> -        s->wait_in_flight = true;
> -        qemu_coroutine_yield();
> -        s->wait_in_flight = false;
> -        qemu_co_mutex_lock(&s->send_mutex);
> -    }
> -
> -    qemu_co_mutex_unlock(&s->send_mutex);
> -
> -    if (!nbd_client_connecting(s)) {
> -        return;
> +    if (nbd_client_connecting_wait(s) && s->reconnect_delay &&
> +        !s->reconnect_delay_timer)
> +    {
> +        /*
> +         * It's first reconnect attempt after switching to
> +         * NBD_CLIENT_CONNECTING_WAIT
> +         */
> +        reconnect_delay_timer_init(s,
> +            qemu_clock_get_ns(QEMU_CLOCK_REALTIME) +
> +            s->reconnect_delay * NANOSECONDS_PER_SECOND);
>      }
>  
>      /*

Matches point 6.

> @@ -453,135 +354,79 @@ static coroutine_fn void nbd_reconnect_attempt(BDRVNBDState *s)
>      nbd_co_do_establish_connection(s->bs, NULL);
>  }
>  
> -static coroutine_fn void nbd_co_reconnect_loop(BDRVNBDState *s)
> +static coroutine_fn int nbd_receive_replies(BDRVNBDState *s, uint64_t handle)

The diff gets a bit harder to read at this point (deleting one
function, renaming another, but portions get interleaved in), but it's
not impossible.

>  {
> -    uint64_t timeout = 1 * NANOSECONDS_PER_SECOND;
> -    uint64_t max_timeout = 16 * NANOSECONDS_PER_SECOND;
> +    int ret;
> +    uint64_t ind = HANDLE_TO_INDEX(s, handle), ind2;
> +    QEMU_LOCK_GUARD(&s->receive_mutex);
>  
> -    if (qatomic_load_acquire(&s->state) == NBD_CLIENT_CONNECTING_WAIT) {
> -        reconnect_delay_timer_init(s, qemu_clock_get_ns(QEMU_CLOCK_REALTIME) +
> -                                   s->reconnect_delay * NANOSECONDS_PER_SECOND);
> -    }
> -
> -    nbd_reconnect_attempt(s);
> -
> -    while (nbd_client_connecting(s)) {
> -        if (s->drained) {
> -            bdrv_dec_in_flight(s->bs);
> -            s->wait_drained_end = true;
> -            while (s->drained) {
> -                /*
> -                 * We may be entered once from nbd_client_attach_aio_context_bh
> -                 * and then from nbd_client_co_drain_end. So here is a loop.
> -                 */
> -                qemu_coroutine_yield();
> -            }
> -            bdrv_inc_in_flight(s->bs);
> -        } else {
> -            qemu_co_sleep_ns_wakeable(&s->reconnect_sleep,
> -                                      QEMU_CLOCK_REALTIME, timeout);
> -            if (s->drained) {
> -                continue;
> -            }
> -            if (timeout < max_timeout) {
> -                timeout *= 2;
> -            }

The deletion of nbd_co_reconnect_loop matches point 7.

> +    while (true) {
> +        if (s->reply.handle == handle) {
> +            /* We are done */
> +            return 0;
>          }
>  
> -        nbd_reconnect_attempt(s);
> -    }
> -
> -    reconnect_delay_timer_del(s);
> -}
> +        if (!nbd_client_connected(s)) {
> +            return -EIO;
> +        }
>  
> -static coroutine_fn void nbd_connection_entry(void *opaque)

The renaming of nbd_connection_entry into nbd_receive_replies matches
points 2A and 8.

> -{
> -    BDRVNBDState *s = opaque;
> -    uint64_t i;
> -    int ret = 0;
> -    Error *local_err = NULL;
> +        if (s->reply.handle != 0) {
> +            /*
> +             * Some other request is being handling now. It should already be
> +             * woken by thous who set s->reply.handle (or never wait in this

s/thous who/whoever/

> +             * yield). So, we should not wake it here.
> +             */
> +            ind2 = HANDLE_TO_INDEX(s, s->reply.handle);
> +            assert(!s->requests[ind2].receiving);
>  
> -    while (qatomic_load_acquire(&s->state) != NBD_CLIENT_QUIT) {
> -        /*
> -         * The NBD client can only really be considered idle when it has
> -         * yielded from qio_channel_readv_all_eof(), waiting for data. This is
> -         * the point where the additional scheduled coroutine entry happens
> -         * after nbd_client_attach_aio_context().
> -         *
> -         * Therefore we keep an additional in_flight reference all the time and
> -         * only drop it temporarily here.
> -         */
> +            s->requests[ind].receiving = true;
> +            qemu_co_mutex_unlock(&s->receive_mutex);
>  
> -        if (nbd_client_connecting(s)) {
> -            nbd_co_reconnect_loop(s);
> -        }
> +            qemu_coroutine_yield();
> +            /*
> +             * We may be woken for 3 reasons:
> +             * 1. From this function, executing in parallel coroutine, when our
> +             *    handle received.

handle is received

> +             * 2. From nbd_channel_error(), when connection is lost.
> +             * 3. From nbd_co_receive_one_chunk(), when previous request is
> +             *    finished and s->reply.handle set to 0.
> +             * Anyway, it's OK to lock the mutex and go to the next iteration.
> +             */
>  
> -        if (!nbd_client_connected(s)) {
> +            qemu_co_mutex_lock(&s->receive_mutex);
> +            assert(!s->requests[ind].receiving);
>              continue;
>          }

Matches point 9.

>  
> +        /* We are under mutex and handle is 0. We have to do the dirty work. */
>          assert(s->reply.handle == 0);
> -        ret = nbd_receive_reply(s->bs, s->ioc, &s->reply, &local_err);
> -
> -        if (local_err) {
> -            trace_nbd_read_reply_entry_fail(ret, error_get_pretty(local_err));
> -            error_free(local_err);
> -            local_err = NULL;
> -        }
> +        ret = nbd_receive_reply(s->bs, s->ioc, &s->reply, NULL);
>          if (ret <= 0) {
> -            nbd_channel_error(s, ret ? ret : -EIO);
> -            continue;
> +            ret = ret ? ret : -EIO;
> +            nbd_channel_error(s, ret);
> +            return ret;
>          }
> -
> -        /*
> -         * There's no need for a mutex on the receive side, because the
> -         * handler acts as a synchronization point and ensures that only
> -         * one coroutine is called until the reply finishes.
> -         */
> -        i = HANDLE_TO_INDEX(s, s->reply.handle);
> -        if (i >= MAX_NBD_REQUESTS ||
> -            !s->requests[i].coroutine ||
> -            !s->requests[i].receiving ||
> -            (nbd_reply_is_structured(&s->reply) && !s->info.structured_reply))
> -        {
> +        if (nbd_reply_is_structured(&s->reply) && !s->info.structured_reply) {
>              nbd_channel_error(s, -EINVAL);
> -            continue;
> +            return -EINVAL;
>          }
> -
> -        /*
> -         * We're woken up again by the request itself.  Note that there
> -         * is no race between yielding and reentering connection_co.  This
> -         * is because:
> -         *
> -         * - if the request runs on the same AioContext, it is only
> -         *   entered after we yield
> -         *
> -         * - if the request runs on a different AioContext, reentering
> -         *   connection_co happens through a bottom half, which can only
> -         *   run after we yield.
> -         */
> -        s->requests[i].receiving = false;
> -        aio_co_wake(s->requests[i].coroutine);
> -        qemu_coroutine_yield();
> -    }
> -
> -    qemu_co_queue_restart_all(&s->free_sema);
> -    nbd_recv_coroutines_wake(s, true);
> -    bdrv_dec_in_flight(s->bs);
> -
> -    s->connection_co = NULL;
> -    if (s->ioc) {
> -        qio_channel_detach_aio_context(QIO_CHANNEL(s->ioc));
> -        yank_unregister_function(BLOCKDEV_YANK_INSTANCE(s->bs->node_name),
> -                                 nbd_yank, s->bs);
> -        object_unref(OBJECT(s->ioc));
> -        s->ioc = NULL;
> -    }
> -
> -    if (s->teardown_co) {
> -        aio_co_wake(s->teardown_co);
> +        if (s->reply.handle == handle) {
> +            /* We are done */
> +            return 0;
> +        }
> +        ind2 = HANDLE_TO_INDEX(s, s->reply.handle);
> +        if (ind2 >= MAX_NBD_REQUESTS || !s->requests[ind2].coroutine) {
> +            /*
> +             * We only check that ind2 request exists. But don't check is it now

s/is it/whether it is/

> +             * waiting for the reply header or not. We can't just check
> +             * s->requests[ind2].receiving: ind2 request may wait in trying to
> +             * lock receive_mutex. So that's a TODO.
> +             */
> +            nbd_channel_error(s, -EINVAL);
> +            return -EINVAL;

For the record, the kernel folks are currently debating a patch series
to fix a use-after-free bug in nbd.ko when the server sends an
unexpected reply.  I'm so glad that qemu has already been handling
unexpected replies gracefully.
https://lists.debian.org/nbd/2021/09/msg00136.html

I hope your TODO doesn't represent some sort of CVE in qemu talking to
a malicious server.

> +        }
> +        nbd_recv_coroutine_wake_one(&s->requests[ind2]);
>      }
> -    aio_wait_kick();
>  }
>  
>  static int nbd_co_send_request(BlockDriverState *bs,
> @@ -592,10 +437,17 @@ static int nbd_co_send_request(BlockDriverState *bs,
>      int rc, i = -1;
>  
>      qemu_co_mutex_lock(&s->send_mutex);
> -    while (s->in_flight == MAX_NBD_REQUESTS || nbd_client_connecting_wait(s)) {
> +
> +    while (s->in_flight == MAX_NBD_REQUESTS ||
> +           (!nbd_client_connected(s) && s->in_flight > 0))
> +    {
>          qemu_co_queue_wait(&s->free_sema, &s->send_mutex);
>      }
>  
> +    if (nbd_client_connecting(s)) {
> +        nbd_reconnect_attempt(s);
> +    }
> +
>      if (!nbd_client_connected(s)) {
>          rc = -EIO;
>          goto err;
> @@ -642,10 +494,6 @@ err:
>          if (i != -1) {
>              s->requests[i].coroutine = NULL;
>              s->in_flight--;
> -        }
> -        if (s->in_flight == 0 && s->wait_in_flight) {
> -            aio_co_wake(s->connection_co);
> -        } else {
>              qemu_co_queue_next(&s->free_sema);
>          }
>      }

Matches point 10.

> @@ -944,10 +792,7 @@ static coroutine_fn int nbd_co_do_receive_one_chunk(
>      }
>      *request_ret = 0;
>  
> -    /* Wait until we're woken up by nbd_connection_entry.  */
> -    s->requests[i].receiving = true;
> -    qemu_coroutine_yield();
> -    assert(!s->requests[i].receiving);
> +    nbd_receive_replies(s, handle);
>      if (!nbd_client_connected(s)) {
>          error_setg(errp, "Connection closed");
>          return -EIO;

Matches point 11.

> @@ -1040,14 +885,7 @@ static coroutine_fn int nbd_co_receive_one_chunk(
>      }
>      s->reply.handle = 0;
>  
> -    if (s->connection_co && !s->wait_in_flight) {
> -        /*
> -         * We must check s->wait_in_flight, because we may entered by
> -         * nbd_recv_coroutines_wake(), in this case we should not
> -         * wake connection_co here, it will woken by last request.
> -         */
> -        aio_co_wake(s->connection_co);
> -    }
> +    nbd_recv_coroutines_wake(s, false);
>  
>      return ret;
>  }

Matches point 12.

> @@ -1158,11 +996,7 @@ break_loop:
>  
>      qemu_co_mutex_lock(&s->send_mutex);
>      s->in_flight--;
> -    if (s->in_flight == 0 && s->wait_in_flight) {
> -        aio_co_wake(s->connection_co);
> -    } else {
> -        qemu_co_queue_next(&s->free_sema);
> -    }
> +    qemu_co_queue_next(&s->free_sema);
>      qemu_co_mutex_unlock(&s->send_mutex);
>  
>      return false;

Matches point 13.

> @@ -1978,6 +1812,7 @@ static int nbd_open(BlockDriverState *bs, QDict *options, int flags,
>      s->bs = bs;
>      qemu_co_mutex_init(&s->send_mutex);
>      qemu_co_queue_init(&s->free_sema);
> +    qemu_co_mutex_init(&s->receive_mutex);
>  
>      if (!yank_register_instance(BLOCKDEV_YANK_INSTANCE(bs->node_name), errp)) {
>          return -EEXIST;
> @@ -1992,14 +1827,13 @@ static int nbd_open(BlockDriverState *bs, QDict *options, int flags,
>                                          s->x_dirty_bitmap, s->tlscreds);
>  
>      /* TODO: Configurable retry-until-timeout behaviour. */
> +    s->state = NBD_CLIENT_CONNECTING_WAIT;
>      ret = nbd_do_establish_connection(bs, errp);
>      if (ret < 0) {
>          goto fail;
>      }
>  
> -    s->connection_co = qemu_coroutine_create(nbd_connection_entry, s);
> -    bdrv_inc_in_flight(bs);
> -    aio_co_schedule(bdrv_get_aio_context(bs), s->connection_co);
> +    nbd_client_connection_enable_retry(s->conn);
>  
>      return 0;
>

Matches point 14.

> @@ -2153,6 +1987,8 @@ static void nbd_cancel_in_flight(BlockDriverState *bs)
>          s->state = NBD_CLIENT_CONNECTING_NOWAIT;
>          qemu_co_queue_restart_all(&s->free_sema);
>      }
> +
> +    nbd_co_establish_connection_cancel(s->conn);
>  }
>

Matches point 15.

>  static BlockDriver bdrv_nbd = {
> @@ -2173,10 +2009,6 @@ static BlockDriver bdrv_nbd = {
>      .bdrv_refresh_limits        = nbd_refresh_limits,
>      .bdrv_co_truncate           = nbd_co_truncate,
>      .bdrv_getlength             = nbd_getlength,
> -    .bdrv_detach_aio_context    = nbd_client_detach_aio_context,
> -    .bdrv_attach_aio_context    = nbd_client_attach_aio_context,
> -    .bdrv_co_drain_begin        = nbd_client_co_drain_begin,
> -    .bdrv_co_drain_end          = nbd_client_co_drain_end,
>      .bdrv_refresh_filename      = nbd_refresh_filename,
>      .bdrv_co_block_status       = nbd_client_co_block_status,
>      .bdrv_dirname               = nbd_dirname,
> @@ -2202,10 +2034,6 @@ static BlockDriver bdrv_nbd_tcp = {
>      .bdrv_refresh_limits        = nbd_refresh_limits,
>      .bdrv_co_truncate           = nbd_co_truncate,
>      .bdrv_getlength             = nbd_getlength,
> -    .bdrv_detach_aio_context    = nbd_client_detach_aio_context,
> -    .bdrv_attach_aio_context    = nbd_client_attach_aio_context,
> -    .bdrv_co_drain_begin        = nbd_client_co_drain_begin,
> -    .bdrv_co_drain_end          = nbd_client_co_drain_end,
>      .bdrv_refresh_filename      = nbd_refresh_filename,
>      .bdrv_co_block_status       = nbd_client_co_block_status,
>      .bdrv_dirname               = nbd_dirname,
> @@ -2231,10 +2059,6 @@ static BlockDriver bdrv_nbd_unix = {
>      .bdrv_refresh_limits        = nbd_refresh_limits,
>      .bdrv_co_truncate           = nbd_co_truncate,
>      .bdrv_getlength             = nbd_getlength,
> -    .bdrv_detach_aio_context    = nbd_client_detach_aio_context,
> -    .bdrv_attach_aio_context    = nbd_client_attach_aio_context,
> -    .bdrv_co_drain_begin        = nbd_client_co_drain_begin,
> -    .bdrv_co_drain_end          = nbd_client_co_drain_end,

Matches point 3

>      .bdrv_refresh_filename      = nbd_refresh_filename,
>      .bdrv_co_block_status       = nbd_client_co_block_status,
>      .bdrv_dirname               = nbd_dirname,
> diff --git a/nbd/client.c b/nbd/client.c
> index 0c2db4bcba..30d5383cb1 100644
> --- a/nbd/client.c
> +++ b/nbd/client.c
> @@ -1434,9 +1434,7 @@ nbd_read_eof(BlockDriverState *bs, QIOChannel *ioc, void *buffer, size_t size,
>  
>          len = qio_channel_readv(ioc, &iov, 1, errp);
>          if (len == QIO_CHANNEL_ERR_BLOCK) {
> -            bdrv_dec_in_flight(bs);
>              qio_channel_yield(ioc, G_IO_IN);
> -            bdrv_inc_in_flight(bs);

Matches point 3.

>              continue;
>          } else if (len < 0) {
>              return -EIO;
> -- 
> 2.29.2
>

Verdict: I found some typos/grammar fixes, but I think I can touch
those up.  I'm now hammering on the patches in my NBD tree, and
barring any surprises, this will be in my next pull request Monday at
the latest.

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v6 4/5] block/nbd: drop connection_co
  2021-09-24 17:44   ` Eric Blake
@ 2021-09-27 11:57     ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 14+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-09-27 11:57 UTC (permalink / raw)
  To: Eric Blake; +Cc: qemu-block, qemu-devel, hreitz, kwolf, rkagan

24.09.2021 20:44, Eric Blake wrote:
> On Thu, Sep 02, 2021 at 01:38:04PM +0300, Vladimir Sementsov-Ogievskiy wrote:
>> OK, that's a big rewrite of the logic.
> 
> And a time-consuming review on my part!
> 
>>
>> Pre-patch we have an always running coroutine - connection_co. It does
>> reply receiving and reconnecting. And it leads to a lot of difficult
>> and unobvious code around drained sections and context switch. We also
>> abuse bs->in_flight counter which is increased for connection_co and
>> temporary decreased in points where we want to allow drained section to
>> begin. One of these place is in another file: in nbd_read_eof() in
>> nbd/client.c.
>>
>> We also cancel reconnect and requests waiting for reconnect on drained
>> begin which is not correct. And this patch fixes that.
>>
>> Let's finally drop this always running coroutine and go another way:
>> do both reconnect and receiving in request coroutines.
>>
>> The detailed list of changes below (in the sequence of diff hunks).
> 
> Well, depending on the git order file in use ;)
> 
>>
>> 1. receiving coroutines are woken directly from nbd_channel_error, when
>>     we change s->state
>>
>> 2. nbd_co_establish_connection_cancel(): we don't have drain_begin now,
>>     and in nbd_teardown_connection() all requests should already be
>>     finished (and reconnect is done from request). So
>>     nbd_co_establish_connection_cancel() is called from
>>     nbd_cancel_in_flight() (to cancel the request that is doing
>>     nbd_co_establish_connection()) and from reconnect_delay_timer_cb()
>>     (previously we didn't need it, as reconnect delay only should cancel
>>     active requests not the reconnection itself. But now reconnection
> 
> Missing )
> 
>>     itself is done in the separate thread (we now call
>>     nbd_client_connection_enable_retry() in nbd_open()), and we need to
>>     cancel the requests that waits in nbd_co_establish_connection()
> 
> Singular/plural disagreement. I think the intended meaning is
> 'requests that wait' and not 'request that waits'.
> 
>>     now).
>>
>> 2. We do receive headers in request coroutine. But we also should
> 
> Second point 2; I'll call it 2A below, because it looks related to
> point 8.

Ohh. OK.

> 
>>     dispatch replies for another pending requests. So,
> 
> s/another/other/
> 
>>     nbd_connection_entry() is turned into nbd_receive_replies(), which
>>     does reply dispatching until it receive another request headers, and
> 
> s/until/while/, s/another/other/
> 
>>     returns when it receive the requested header.
> 
> receives
> 
>>
>> 3. All old staff around drained sections and context switch is dropped.
>>     In details:
>>     - we don't need to move connection_co to new aio context, as we
>>       don't have connection_co anymore
>>     - we don't have a fake "request" of connection_co (extra increasing
>>       in_flight), so don't care with it in drain_begin/end
>>     - we don't stop reconnection during drained section anymore. This
>>       means that drain_begin may wait for a long time (up to
>>       reconnect_delay). But that's an improvement and more correct
>>       behavior see below[*]
>>
>> 4. In nbd_teardown_connection() we don't have to wait for
>>     connection_co, as it is dropped. And cleanup for s->ioc and nbd_yank
>>     is moved here from removed connection_co.
>>
>> 5. In nbd_co_do_establish_connection() we now should handle
>>     NBD_CLIENT_CONNECTING_NOWAIT: if new request comes when we are in
>>     NBD_CLIENT_CONNECTING_NOWAIT, it still should call
>>     nbd_co_establish_connection() (who knows, maybe connection already
>>     established by thread in background). But we shouldn't wait: if
> 
> maybe the connection was already established by another thread in the background

"another thread" sounds vague. I think better: by connection thread.

> 
>>     nbd_co_establish_connection() can't return new channel immediately
>>     the request should fail (we are in NBD_CLIENT_CONNECTING_NOWAIT
>>     state).
>>
>> 6. nbd_reconnect_attempt() is simplified: it's now easier to wait for
>>     other requests in the caller, so here we just assert that fact.
>>     Also delay time is now initialized here: we can easily detect first
>>     attempt and start a timer.
>>
>> 7. nbd_co_reconnect_loop() is dropped, we don't need it. Reconnect
>>     retries are fully handle by thread (nbd/client-connection.c), delay
>>     timer we initialize in nbd_reconnect_attempt(), we don't have to
>>     bother with s->drained and friends. nbd_reconnect_attempt() now
>>     called from nbd_co_send_request().
> 
> A lot going on there, but it's making sense so far.
> 
>>
>> 8. nbd_connection_entry is dropped: reconnect is now handled by
>>     nbd_co_send_request(), receiving reply is now handled by
>>     nbd_receive_replies(): all handled from request coroutines.
>>
>> 9. So, welcome new nbd_receive_replies() called from request coroutine,
>>     that receives reply header instead of nbd_connection_entry().
>>     Like with sending requests, only one coroutine may receive in a
>>     moment. So we introduce receive_mutex, which is locked around
>>     nbd_receive_reply(). It also protects some related fields. Still,
>>     full audit of thread-safety in nbd driver is a separate task.
> 
> It sounds like you're more worried about auditing for which locks are
> held longer than necessary, rather than expecting to find cases where
> we aren't thread-safe because we are lacking locks?  At any rate, as
> this patch is already easier to reason about, I'm okay deferring that
> audit to later patches.

I just try to not interduce new not-threadsafe things. But I'm not sure do we have a good kind of threadsafety in nbd driver code now.

> 
>>     New function waits for a reply with specified handle being received
>>     and works rather simple:
>>
>>     Under mutex:
>>       - if current handle is 0, do receive by hand. If another handle
>>         received - switch to other request coroutine, release mutex and
>>         yield. Otherwise return success
>>       - if current handle == requested handle, we are done
>>       - otherwise, release mutex and yield
>>
>> 10: in nbd_co_send_request() we now do nbd_reconnect_attempt() if
>>      needed. Also waiting in free_sema queue we now wait for one of two
>>      conditions:
>>      - connectED, in_flight < MAX_NBD_REQUESTS (so we can start new one)
>>      - connectING, in_flight == 0, so we can call
>>        nbd_reconnect_attempt()
>>      And this logic is protected by s->send_mutex
>>
>>      Also, on failure we don't have to care of removed s->connection_co
>>
>> 11. nbd_co_do_receive_one_chunk(): now instead of yield() and wait for
>>      s->connection_co we just call new nbd_receive_replies().
>>
>> 12. nbd_co_receive_one_chunk(): place where s->reply.handle becomes 0,
>>      which means that handling of the whole reply is finished. Here we
>>      need to wake one of coroutines sleeping in nbd_receive_replies().
>>      If now one sleeps - do nothing. That's another behavior change: we
> 
> s/now one sleeps/none are sleeping/
> 
>>      don't have endless recv() in the idle time. It may be considered as
>>      a drawback. If so, it may be fixed later.
> 
> The case where no coroutine is waiting to receive a reply should only
> happen due to a malicious server (a compliant server will only send
> replies matching an existing pending request).

With new code, when we don't have any in-flight requests - there are no waiting coroutines as well.
And nobody will note if connection silently get lost.

With old code, we always have recv() call called from connection_co, so we should catch disconnects (at least some kinds of disconnects or if keep-alive is set up properly) sooner and start reconnection procedure.

> 
>>
>> 13. nbd_reply_chunk_iter_receive(): don't care about removed
>>      connection_co, just ping in_flight waiters.
>>
>> 14. Don't create connection_co, enable retry in the connection thread
>>      (we don't have own reconnect loop anymore)
>>
>> 15. We need now nbd_co_establish_connection_cancel() call in
>>      nbd_cancel_in_flight(), to cancel the request that doing connection
>>      attempt.
> 
> s/need now/now need to add a/
> s/request that doing/request that is doing a/
> 
>>
>> [*], ok, now we don't cancel reconnect on drain begin. That's correct:
>>      reconnect feature leads to possibility of long-running requests (up
>>      to reconnect delay). Still, drain begin is not a reason to kill
>>      long requests. We should wait for them.
>>
>>      This also means, that we can again reproduce a dead-lock, described
>>      in 8c517de24a8a1dcbeb54e7e12b5b0fda42a90ace.
>>      Why we are OK with it:
>>      1. Now this is not absolutely-dead dead-lock: the vm is unfrozen
>>         after reconnect delay. Actually 8c517de24a8a1dc fixed a bug in
>>         NBD logic, that was not described in 8c517de24a8a1dc and led to
>>         forever dead-lock. The problem was that nobody woken free_sema
> 
> s/woken/woke the/
> 
>>         queue, but drain_begin can't finish until there is a request in
>>         free_sema queue. Now we have a reconnect delay timer that works
>>         well.
>>      2. It's not a problem of NBD driver, it's a problem of ide code,
>>         that does drain_begin under global mutex
> 
> I agree that this point means that it is appropriate to address that
> in a separate patch series.  Hopefully someone tackles it before 6.2,
> but even if not, I agree that the worst that happens is that we have a
> command stalled waiting on a long timeout, rather than actual
> deadlock.
> 
>>      3. That doesn't reproduce if chose scsi instead of ide.
> 
> Seems like this sentence fragment fits better as the tail of 2, rather
> than as a separate bullet 3?  I'll reword as:
> 
> ...it's a problem of the ide code, because it does drain_begin under
> the global mutex; the problem doesn't reproduce when using scsi
> instead of ide.

Sounds good

> 
>>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> ---

[..]

> 
> Verdict: I found some typos/grammar fixes, but I think I can touch
> those up.  I'm now hammering on the patches in my NBD tree, and
> barring any surprises, this will be in my next pull request Monday at
> the latest.
> 
> Reviewed-by: Eric Blake <eblake@redhat.com>
> 

Thanks a lot for reviewing, and time spent to understand and fix my English )

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2021-09-27 11:58 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-02 10:38 [PATCH v6 0/5] block/nbd: drop connection_co Vladimir Sementsov-Ogievskiy
2021-09-02 10:38 ` [PATCH v6 1/5] block/nbd: nbd_channel_error() shutdown channel unconditionally Vladimir Sementsov-Ogievskiy
2021-09-03 16:16   ` Eric Blake
2021-09-02 10:38 ` [PATCH v6 2/5] block/nbd: move nbd_recv_coroutines_wake_all() up Vladimir Sementsov-Ogievskiy
2021-09-02 10:38 ` [PATCH v6 3/5] block/nbd: refactor nbd_recv_coroutines_wake_all() Vladimir Sementsov-Ogievskiy
2021-09-03 17:36   ` Eric Blake
2021-09-02 10:38 ` [PATCH v6 4/5] block/nbd: drop connection_co Vladimir Sementsov-Ogievskiy
2021-09-24 17:44   ` Eric Blake
2021-09-27 11:57     ` Vladimir Sementsov-Ogievskiy
2021-09-02 10:38 ` [PATCH v6 5/5] block/nbd: check that received handle is valid Vladimir Sementsov-Ogievskiy
2021-09-03 17:54   ` Eric Blake
2021-09-22  7:48 ` [PATCH v6 0/5] block/nbd: drop connection_co Vladimir Sementsov-Ogievskiy
2021-09-22 13:12   ` Eric Blake
2021-09-22 13:49     ` Vladimir Sementsov-Ogievskiy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.