All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH for-5.1? 0/3] Fix nbd reconnect dead-locks
@ 2020-07-20  9:00 Vladimir Sementsov-Ogievskiy
  2020-07-20  9:00 ` [PATCH 1/3] block/nbd: allow drain during reconnect attempt Vladimir Sementsov-Ogievskiy
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-07-20  9:00 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, vsementsov, qemu-devel, mreitz, den

Hi all!

I've found some dead-locks, which can be easily triggered on master
branch with default nbd configuration (reconnect-delay is 0),
here are fixes.

01-02 fix real dead-locks 
03 - hm. I'm not sure that the problem is reachable on master, I've
faced it in my development branch where I move initial connect into
coroutine and introduce non-blocking connect. So consider it as an
intuitive fix. It just makes code a bit better.

Vladimir Sementsov-Ogievskiy (3):
  block/nbd: allow drain during reconnect attempt
  block/nbd: on shutdown terminate connection attempt
  block/nbd: nbd_co_reconnect_loop(): don't sleep if drained

 block/nbd.c | 36 +++++++++++++++++++++++++++---------
 1 file changed, 27 insertions(+), 9 deletions(-)

-- 
2.21.0



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/3] block/nbd: allow drain during reconnect attempt
  2020-07-20  9:00 [PATCH for-5.1? 0/3] Fix nbd reconnect dead-locks Vladimir Sementsov-Ogievskiy
@ 2020-07-20  9:00 ` Vladimir Sementsov-Ogievskiy
  2020-07-20  9:02   ` Vladimir Sementsov-Ogievskiy
                     ` (3 more replies)
  2020-07-20  9:00 ` [PATCH 2/3] block/nbd: on shutdown terminate connection attempt Vladimir Sementsov-Ogievskiy
  2020-07-20  9:00 ` [PATCH 3/3] block/nbd: nbd_co_reconnect_loop(): don't sleep if drained Vladimir Sementsov-Ogievskiy
  2 siblings, 4 replies; 13+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-07-20  9:00 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, vsementsov, qemu-devel, mreitz, den

It should be to reenter qio_channel_yield() on io/channel read/write
path, so it's safe to reduce in_flight and allow attaching new aio
context. And no problem to allow drain itself: connection attempt is
not a guest request. Moreover, if remote server is down, we can hang
in negotiation, blocking drain section and provoking a dead lock.

How to reproduce the dead lock:

1. Create nbd-fault-injector.conf with the following contents:

[inject-error "mega1"]
event=data
io=readwrite
when=before

2. In one terminal run nbd-fault-injector in a loop, like this:

n=1; while true; do
    echo $n; ((n++));
    ./nbd-fault-injector.py 127.0.0.1:10000 nbd-fault-injector.conf;
done

3. In another terminal run qemu-io in a loop, like this:

n=1; while true; do
    echo $n; ((n++));
    ./qemu-io -c 'read 0 512' nbd+tcp://127.0.0.1:10000;
done

After some time, qemu-io will hang trying to drain, for example, like
this:

 #3 aio_poll (ctx=0x55f006bdd890, blocking=true) at
    util/aio-posix.c:600
 #4 bdrv_do_drained_begin (bs=0x55f006bea710, recursive=false,
    parent=0x0, ignore_bds_parents=false, poll=true) at block/io.c:427
 #5 bdrv_drained_begin (bs=0x55f006bea710) at block/io.c:433
 #6 blk_drain (blk=0x55f006befc80) at block/block-backend.c:1710
 #7 blk_unref (blk=0x55f006befc80) at block/block-backend.c:498
 #8 bdrv_open_inherit (filename=0x7fffba1563bc
    "nbd+tcp://127.0.0.1:10000", reference=0x0, options=0x55f006be86d0,
    flags=24578, parent=0x0, child_class=0x0, child_role=0,
    errp=0x7fffba154620) at block.c:3491
 #9 bdrv_open (filename=0x7fffba1563bc "nbd+tcp://127.0.0.1:10000",
    reference=0x0, options=0x0, flags=16386, errp=0x7fffba154620) at
    block.c:3513
 #10 blk_new_open (filename=0x7fffba1563bc "nbd+tcp://127.0.0.1:10000",
    reference=0x0, options=0x0, flags=16386, errp=0x7fffba154620) at
    block/block-backend.c:421

And connection_co stack like this:

 #0 qemu_coroutine_switch (from_=0x55f006bf2650, to_=0x7fe96e07d918,
    action=COROUTINE_YIELD) at util/coroutine-ucontext.c:302
 #1 qemu_coroutine_yield () at util/qemu-coroutine.c:193
 #2 qio_channel_yield (ioc=0x55f006bb3c20, condition=G_IO_IN) at
    io/channel.c:472
 #3 qio_channel_readv_all_eof (ioc=0x55f006bb3c20, iov=0x7fe96d729bf0,
    niov=1, errp=0x7fe96d729eb0) at io/channel.c:110
 #4 qio_channel_readv_all (ioc=0x55f006bb3c20, iov=0x7fe96d729bf0,
    niov=1, errp=0x7fe96d729eb0) at io/channel.c:143
 #5 qio_channel_read_all (ioc=0x55f006bb3c20, buf=0x7fe96d729d28
    "\300.\366\004\360U", buflen=8, errp=0x7fe96d729eb0) at
    io/channel.c:247
 #6 nbd_read (ioc=0x55f006bb3c20, buffer=0x7fe96d729d28, size=8,
    desc=0x55f004f69644 "initial magic", errp=0x7fe96d729eb0) at
    /work/src/qemu/master/include/block/nbd.h:365
 #7 nbd_read64 (ioc=0x55f006bb3c20, val=0x7fe96d729d28,
    desc=0x55f004f69644 "initial magic", errp=0x7fe96d729eb0) at
    /work/src/qemu/master/include/block/nbd.h:391
 #8 nbd_start_negotiate (aio_context=0x55f006bdd890,
    ioc=0x55f006bb3c20, tlscreds=0x0, hostname=0x0,
    outioc=0x55f006bf19f8, structured_reply=true,
    zeroes=0x7fe96d729dca, errp=0x7fe96d729eb0) at nbd/client.c:904
 #9 nbd_receive_negotiate (aio_context=0x55f006bdd890,
    ioc=0x55f006bb3c20, tlscreds=0x0, hostname=0x0,
    outioc=0x55f006bf19f8, info=0x55f006bf1a00, errp=0x7fe96d729eb0) at
    nbd/client.c:1032
 #10 nbd_client_connect (bs=0x55f006bea710, errp=0x7fe96d729eb0) at
    block/nbd.c:1460
 #11 nbd_reconnect_attempt (s=0x55f006bf19f0) at block/nbd.c:287
 #12 nbd_co_reconnect_loop (s=0x55f006bf19f0) at block/nbd.c:309
 #13 nbd_connection_entry (opaque=0x55f006bf19f0) at block/nbd.c:360
 #14 coroutine_trampoline (i0=113190480, i1=22000) at
    util/coroutine-ucontext.c:173

Note, that the hang may be
triggered by another bug, so the whole case is fixed only together with
commit "block/nbd: on shutdown terminate connection attempt".

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/nbd.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/block/nbd.c b/block/nbd.c
index 65a4f56924..49254f1c3c 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -280,7 +280,18 @@ static coroutine_fn void nbd_reconnect_attempt(BDRVNBDState *s)
         s->ioc = NULL;
     }
 
+    bdrv_dec_in_flight(s->bs);
     s->connect_status = nbd_client_connect(s->bs, &local_err);
+    s->wait_drained_end = true;
+    while (s->drained) {
+        /*
+         * We may be entered once from nbd_client_attach_aio_context_bh
+         * and then from nbd_client_co_drain_end. So here is a loop.
+         */
+        qemu_coroutine_yield();
+    }
+    bdrv_inc_in_flight(s->bs);
+
     error_free(s->connect_err);
     s->connect_err = NULL;
     error_propagate(&s->connect_err, local_err);
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/3] block/nbd: on shutdown terminate connection attempt
  2020-07-20  9:00 [PATCH for-5.1? 0/3] Fix nbd reconnect dead-locks Vladimir Sementsov-Ogievskiy
  2020-07-20  9:00 ` [PATCH 1/3] block/nbd: allow drain during reconnect attempt Vladimir Sementsov-Ogievskiy
@ 2020-07-20  9:00 ` Vladimir Sementsov-Ogievskiy
  2020-07-23 18:52   ` Eric Blake
  2020-07-20  9:00 ` [PATCH 3/3] block/nbd: nbd_co_reconnect_loop(): don't sleep if drained Vladimir Sementsov-Ogievskiy
  2 siblings, 1 reply; 13+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-07-20  9:00 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, vsementsov, qemu-devel, mreitz, den

On shutdown nbd driver may be in a connecting state. We should shutdown
it as well, otherwise we may hang in
nbd_teardown_connection, waiting for conneciton_co to finish in
BDRV_POLL_WHILE(bs, s->connection_co) loop if remote server is down.

How to reproduce the dead lock:

1. Create nbd-fault-injector.conf with the following contents:

[inject-error "mega1"]
event=data
io=readwrite
when=before

2. In one terminal run nbd-fault-injector in a loop, like this:

n=1; while true; do
    echo $n; ((n++));
    ./nbd-fault-injector.py 127.0.0.1:10000 nbd-fault-injector.conf;
done

3. In another terminal run qemu-io in a loop, like this:

n=1; while true; do
    echo $n; ((n++));
    ./qemu-io -c 'read 0 512' nbd+tcp://127.0.0.1:10000;
done

After some time, qemu-io will hang. Note, that this hang may be
triggered by another bug, so the whole case is fixed only together with
commit "block/nbd: allow drain during reconnect attempt".

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/nbd.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/block/nbd.c b/block/nbd.c
index 49254f1c3c..42146a26f7 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -206,11 +206,15 @@ static void nbd_teardown_connection(BlockDriverState *bs)
 {
     BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
 
-    if (s->state == NBD_CLIENT_CONNECTED) {
+    if (s->ioc) {
         /* finish any pending coroutines */
-        assert(s->ioc);
         qio_channel_shutdown(s->ioc, QIO_CHANNEL_SHUTDOWN_BOTH, NULL);
+    } else if (s->sioc) {
+        /* abort negotiation */
+        qio_channel_shutdown(QIO_CHANNEL(s->sioc), QIO_CHANNEL_SHUTDOWN_BOTH,
+                             NULL);
     }
+
     s->state = NBD_CLIENT_QUIT;
     if (s->connection_co) {
         if (s->connection_co_sleep_ns_state) {
@@ -1452,6 +1456,8 @@ static int nbd_client_connect(BlockDriverState *bs, Error **errp)
         return -ECONNREFUSED;
     }
 
+    s->sioc = sioc;
+
     /* NBD handshake */
     trace_nbd_client_connect(s->export);
     qio_channel_set_blocking(QIO_CHANNEL(sioc), false, NULL);
@@ -1468,6 +1474,7 @@ static int nbd_client_connect(BlockDriverState *bs, Error **errp)
     g_free(s->info.name);
     if (ret < 0) {
         object_unref(OBJECT(sioc));
+        s->sioc = NULL;
         return ret;
     }
     if (s->x_dirty_bitmap && !s->info.base_allocation) {
@@ -1493,8 +1500,6 @@ static int nbd_client_connect(BlockDriverState *bs, Error **errp)
         }
     }
 
-    s->sioc = sioc;
-
     if (!s->ioc) {
         s->ioc = QIO_CHANNEL(sioc);
         object_ref(OBJECT(s->ioc));
@@ -1515,6 +1520,7 @@ static int nbd_client_connect(BlockDriverState *bs, Error **errp)
         nbd_send_request(s->ioc ?: QIO_CHANNEL(sioc), &request);
 
         object_unref(OBJECT(sioc));
+        s->sioc = NULL;
 
         return ret;
     }
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 3/3] block/nbd: nbd_co_reconnect_loop(): don't sleep if drained
  2020-07-20  9:00 [PATCH for-5.1? 0/3] Fix nbd reconnect dead-locks Vladimir Sementsov-Ogievskiy
  2020-07-20  9:00 ` [PATCH 1/3] block/nbd: allow drain during reconnect attempt Vladimir Sementsov-Ogievskiy
  2020-07-20  9:00 ` [PATCH 2/3] block/nbd: on shutdown terminate connection attempt Vladimir Sementsov-Ogievskiy
@ 2020-07-20  9:00 ` Vladimir Sementsov-Ogievskiy
  2020-07-23 18:55   ` Eric Blake
  2 siblings, 1 reply; 13+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-07-20  9:00 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, vsementsov, qemu-devel, mreitz, den

We try to go to wakeable sleep, so that, if drain begins it will break
the sleep. But what if nbd_client_co_drain_begin() already called and
s->drained is already true? We'll go to sleep, and drain will have to
wait for the whole timeout. Let's improve it.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/nbd.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/block/nbd.c b/block/nbd.c
index 42146a26f7..d9cde30457 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -327,8 +327,6 @@ static coroutine_fn void nbd_co_reconnect_loop(BDRVNBDState *s)
             qemu_co_queue_restart_all(&s->free_sema);
         }
 
-        qemu_co_sleep_ns_wakeable(QEMU_CLOCK_REALTIME, timeout,
-                                  &s->connection_co_sleep_ns_state);
         if (s->drained) {
             bdrv_dec_in_flight(s->bs);
             s->wait_drained_end = true;
@@ -340,9 +338,12 @@ static coroutine_fn void nbd_co_reconnect_loop(BDRVNBDState *s)
                 qemu_coroutine_yield();
             }
             bdrv_inc_in_flight(s->bs);
-        }
-        if (timeout < max_timeout) {
-            timeout *= 2;
+        } else {
+            qemu_co_sleep_ns_wakeable(QEMU_CLOCK_REALTIME, timeout,
+                                      &s->connection_co_sleep_ns_state);
+            if (timeout < max_timeout) {
+                timeout *= 2;
+            }
         }
 
         nbd_reconnect_attempt(s);
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] block/nbd: allow drain during reconnect attempt
  2020-07-20  9:00 ` [PATCH 1/3] block/nbd: allow drain during reconnect attempt Vladimir Sementsov-Ogievskiy
@ 2020-07-20  9:02   ` Vladimir Sementsov-Ogievskiy
  2020-07-23 18:47   ` Eric Blake
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 13+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-07-20  9:02 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, den, qemu-devel, mreitz

20.07.2020 12:00, Vladimir Sementsov-Ogievskiy wrote:
> It should be to reenter qio_channel_yield() on io/channel read/write

should be safe I mean

> path, so it's safe to reduce in_flight and allow attaching new aio
> context. And no problem to allow drain itself: connection attempt is
> not a guest request. Moreover, if remote server is down, we can hang
> in negotiation, blocking drain section and provoking a dead lock.
> 
> How to reproduce the dead lock:
> 
> 1. Create nbd-fault-injector.conf with the following contents:
> 
> [inject-error "mega1"]
> event=data
> io=readwrite
> when=before
> 
> 2. In one terminal run nbd-fault-injector in a loop, like this:
> 
> n=1; while true; do
>      echo $n; ((n++));
>      ./nbd-fault-injector.py 127.0.0.1:10000 nbd-fault-injector.conf;
> done
> 
> 3. In another terminal run qemu-io in a loop, like this:
> 
> n=1; while true; do
>      echo $n; ((n++));
>      ./qemu-io -c 'read 0 512' nbd+tcp://127.0.0.1:10000;
> done
> 
> After some time, qemu-io will hang trying to drain, for example, like
> this:
> 
>   #3 aio_poll (ctx=0x55f006bdd890, blocking=true) at
>      util/aio-posix.c:600
>   #4 bdrv_do_drained_begin (bs=0x55f006bea710, recursive=false,
>      parent=0x0, ignore_bds_parents=false, poll=true) at block/io.c:427
>   #5 bdrv_drained_begin (bs=0x55f006bea710) at block/io.c:433
>   #6 blk_drain (blk=0x55f006befc80) at block/block-backend.c:1710
>   #7 blk_unref (blk=0x55f006befc80) at block/block-backend.c:498
>   #8 bdrv_open_inherit (filename=0x7fffba1563bc
>      "nbd+tcp://127.0.0.1:10000", reference=0x0, options=0x55f006be86d0,
>      flags=24578, parent=0x0, child_class=0x0, child_role=0,
>      errp=0x7fffba154620) at block.c:3491
>   #9 bdrv_open (filename=0x7fffba1563bc "nbd+tcp://127.0.0.1:10000",
>      reference=0x0, options=0x0, flags=16386, errp=0x7fffba154620) at
>      block.c:3513
>   #10 blk_new_open (filename=0x7fffba1563bc "nbd+tcp://127.0.0.1:10000",
>      reference=0x0, options=0x0, flags=16386, errp=0x7fffba154620) at
>      block/block-backend.c:421
> 
> And connection_co stack like this:
> 
>   #0 qemu_coroutine_switch (from_=0x55f006bf2650, to_=0x7fe96e07d918,
>      action=COROUTINE_YIELD) at util/coroutine-ucontext.c:302
>   #1 qemu_coroutine_yield () at util/qemu-coroutine.c:193
>   #2 qio_channel_yield (ioc=0x55f006bb3c20, condition=G_IO_IN) at
>      io/channel.c:472
>   #3 qio_channel_readv_all_eof (ioc=0x55f006bb3c20, iov=0x7fe96d729bf0,
>      niov=1, errp=0x7fe96d729eb0) at io/channel.c:110
>   #4 qio_channel_readv_all (ioc=0x55f006bb3c20, iov=0x7fe96d729bf0,
>      niov=1, errp=0x7fe96d729eb0) at io/channel.c:143
>   #5 qio_channel_read_all (ioc=0x55f006bb3c20, buf=0x7fe96d729d28
>      "\300.\366\004\360U", buflen=8, errp=0x7fe96d729eb0) at
>      io/channel.c:247
>   #6 nbd_read (ioc=0x55f006bb3c20, buffer=0x7fe96d729d28, size=8,
>      desc=0x55f004f69644 "initial magic", errp=0x7fe96d729eb0) at
>      /work/src/qemu/master/include/block/nbd.h:365
>   #7 nbd_read64 (ioc=0x55f006bb3c20, val=0x7fe96d729d28,
>      desc=0x55f004f69644 "initial magic", errp=0x7fe96d729eb0) at
>      /work/src/qemu/master/include/block/nbd.h:391
>   #8 nbd_start_negotiate (aio_context=0x55f006bdd890,
>      ioc=0x55f006bb3c20, tlscreds=0x0, hostname=0x0,
>      outioc=0x55f006bf19f8, structured_reply=true,
>      zeroes=0x7fe96d729dca, errp=0x7fe96d729eb0) at nbd/client.c:904
>   #9 nbd_receive_negotiate (aio_context=0x55f006bdd890,
>      ioc=0x55f006bb3c20, tlscreds=0x0, hostname=0x0,
>      outioc=0x55f006bf19f8, info=0x55f006bf1a00, errp=0x7fe96d729eb0) at
>      nbd/client.c:1032
>   #10 nbd_client_connect (bs=0x55f006bea710, errp=0x7fe96d729eb0) at
>      block/nbd.c:1460
>   #11 nbd_reconnect_attempt (s=0x55f006bf19f0) at block/nbd.c:287
>   #12 nbd_co_reconnect_loop (s=0x55f006bf19f0) at block/nbd.c:309
>   #13 nbd_connection_entry (opaque=0x55f006bf19f0) at block/nbd.c:360
>   #14 coroutine_trampoline (i0=113190480, i1=22000) at
>      util/coroutine-ucontext.c:173
> 
> Note, that the hang may be
> triggered by another bug, so the whole case is fixed only together with
> commit "block/nbd: on shutdown terminate connection attempt".
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>   block/nbd.c | 11 +++++++++++
>   1 file changed, 11 insertions(+)
> 
> diff --git a/block/nbd.c b/block/nbd.c
> index 65a4f56924..49254f1c3c 100644
> --- a/block/nbd.c
> +++ b/block/nbd.c
> @@ -280,7 +280,18 @@ static coroutine_fn void nbd_reconnect_attempt(BDRVNBDState *s)
>           s->ioc = NULL;
>       }
>   
> +    bdrv_dec_in_flight(s->bs);
>       s->connect_status = nbd_client_connect(s->bs, &local_err);
> +    s->wait_drained_end = true;
> +    while (s->drained) {
> +        /*
> +         * We may be entered once from nbd_client_attach_aio_context_bh
> +         * and then from nbd_client_co_drain_end. So here is a loop.
> +         */
> +        qemu_coroutine_yield();
> +    }
> +    bdrv_inc_in_flight(s->bs);
> +
>       error_free(s->connect_err);
>       s->connect_err = NULL;
>       error_propagate(&s->connect_err, local_err);
> 


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] block/nbd: allow drain during reconnect attempt
  2020-07-20  9:00 ` [PATCH 1/3] block/nbd: allow drain during reconnect attempt Vladimir Sementsov-Ogievskiy
  2020-07-20  9:02   ` Vladimir Sementsov-Ogievskiy
@ 2020-07-23 18:47   ` Eric Blake
  2020-07-24 10:04     ` Vladimir Sementsov-Ogievskiy
  2020-07-24 11:50     ` Vladimir Sementsov-Ogievskiy
  2020-07-24  9:49   ` Vladimir Sementsov-Ogievskiy
  2020-07-24 10:21   ` Vladimir Sementsov-Ogievskiy
  3 siblings, 2 replies; 13+ messages in thread
From: Eric Blake @ 2020-07-23 18:47 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block; +Cc: kwolf, den, qemu-devel, mreitz

On 7/20/20 4:00 AM, Vladimir Sementsov-Ogievskiy wrote:
> It should be to reenter qio_channel_yield() on io/channel read/write
> path, so it's safe to reduce in_flight and allow attaching new aio
> context. And no problem to allow drain itself: connection attempt is
> not a guest request. Moreover, if remote server is down, we can hang
> in negotiation, blocking drain section and provoking a dead lock.
> 
> How to reproduce the dead lock:
> 

I tried to reproduce this; but in the several minutes it has taken me to 
write this email, it still has not hung.  Still, your stack trace is 
fairly good evidence of the problem, where adding a temporary sleep or 
running it under gdb with a breakpoint can probably make reproduction 
easier.

> 1. Create nbd-fault-injector.conf with the following contents:
> 
> [inject-error "mega1"]
> event=data
> io=readwrite
> when=before
> 
> 2. In one terminal run nbd-fault-injector in a loop, like this:
> 
> n=1; while true; do
>      echo $n; ((n++));

Bashism, but not a problem for the commit message.

>      ./nbd-fault-injector.py 127.0.0.1:10000 nbd-fault-injector.conf;
> done
> 
> 3. In another terminal run qemu-io in a loop, like this:
> 
> n=1; while true; do
>      echo $n; ((n++));
>      ./qemu-io -c 'read 0 512' nbd+tcp://127.0.0.1:10000;

I prefer the spelling nbd:// for TCP connections, but also inconsequential.

> Note, that the hang may be
> triggered by another bug, so the whole case is fixed only together with
> commit "block/nbd: on shutdown terminate connection attempt".
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>   block/nbd.c | 11 +++++++++++
>   1 file changed, 11 insertions(+)
> 
> diff --git a/block/nbd.c b/block/nbd.c
> index 65a4f56924..49254f1c3c 100644
> --- a/block/nbd.c
> +++ b/block/nbd.c
> @@ -280,7 +280,18 @@ static coroutine_fn void nbd_reconnect_attempt(BDRVNBDState *s)
>           s->ioc = NULL;
>       }
>   
> +    bdrv_dec_in_flight(s->bs);
>       s->connect_status = nbd_client_connect(s->bs, &local_err);
> +    s->wait_drained_end = true;
> +    while (s->drained) {
> +        /*
> +         * We may be entered once from nbd_client_attach_aio_context_bh
> +         * and then from nbd_client_co_drain_end. So here is a loop.
> +         */
> +        qemu_coroutine_yield();
> +    }
> +    bdrv_inc_in_flight(s->bs);
> +

This is very similar to the code in nbd_co_reconnect_loop.  Does that 
function still need to wait on drained, since it calls 
nbd_reconnect_attempt which is now doing the same loop?  But off-hand, 
I'm not seeing a problem with keeping both places.

Reviewed-by: Eric Blake <eblake@redhat.com>

As a bug fix, I'll be including this in my NBD pull request for the next 
-rc build.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/3] block/nbd: on shutdown terminate connection attempt
  2020-07-20  9:00 ` [PATCH 2/3] block/nbd: on shutdown terminate connection attempt Vladimir Sementsov-Ogievskiy
@ 2020-07-23 18:52   ` Eric Blake
  0 siblings, 0 replies; 13+ messages in thread
From: Eric Blake @ 2020-07-23 18:52 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block; +Cc: kwolf, den, qemu-devel, mreitz

On 7/20/20 4:00 AM, Vladimir Sementsov-Ogievskiy wrote:
> On shutdown nbd driver may be in a connecting state. We should shutdown
> it as well, otherwise we may hang in
> nbd_teardown_connection, waiting for conneciton_co to finish in
> BDRV_POLL_WHILE(bs, s->connection_co) loop if remote server is down.
> 
> How to reproduce the dead lock:
> 

Same reproducer as in the previous patch (again, where a temporary sleep 
or well-placed gdb breakpoint may be more reliable than running two 
process loops).


> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>   block/nbd.c | 14 ++++++++++----
>   1 file changed, 10 insertions(+), 4 deletions(-)
> 

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/3] block/nbd: nbd_co_reconnect_loop(): don't sleep if drained
  2020-07-20  9:00 ` [PATCH 3/3] block/nbd: nbd_co_reconnect_loop(): don't sleep if drained Vladimir Sementsov-Ogievskiy
@ 2020-07-23 18:55   ` Eric Blake
  2020-07-27 18:42     ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 13+ messages in thread
From: Eric Blake @ 2020-07-23 18:55 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block; +Cc: kwolf, den, qemu-devel, mreitz

On 7/20/20 4:00 AM, Vladimir Sementsov-Ogievskiy wrote:
> We try to go to wakeable sleep, so that, if drain begins it will break
> the sleep. But what if nbd_client_co_drain_begin() already called and
> s->drained is already true? We'll go to sleep, and drain will have to
> wait for the whole timeout. Let's improve it.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>   block/nbd.c | 11 ++++++-----
>   1 file changed, 6 insertions(+), 5 deletions(-)
> 

How frequently did you hit this case?  At any rate, the optimization 
looks sane, and I'm happy to include it in 5.1.

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] block/nbd: allow drain during reconnect attempt
  2020-07-20  9:00 ` [PATCH 1/3] block/nbd: allow drain during reconnect attempt Vladimir Sementsov-Ogievskiy
  2020-07-20  9:02   ` Vladimir Sementsov-Ogievskiy
  2020-07-23 18:47   ` Eric Blake
@ 2020-07-24  9:49   ` Vladimir Sementsov-Ogievskiy
  2020-07-24 10:21   ` Vladimir Sementsov-Ogievskiy
  3 siblings, 0 replies; 13+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-07-24  9:49 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, den, qemu-devel, mreitz

20.07.2020 12:00, Vladimir Sementsov-Ogievskiy wrote:
> It should be to reenter qio_channel_yield() on io/channel read/write
> path, so it's safe to reduce in_flight and allow attaching new aio
> context. And no problem to allow drain itself: connection attempt is
> not a guest request. Moreover, if remote server is down, we can hang
> in negotiation, blocking drain section and provoking a dead lock.
> 
> How to reproduce the dead lock:
> 
> 1. Create nbd-fault-injector.conf with the following contents:
> 
> [inject-error "mega1"]
> event=data
> io=readwrite
> when=before
> 
> 2. In one terminal run nbd-fault-injector in a loop, like this:
> 
> n=1; while true; do
>      echo $n; ((n++));
>      ./nbd-fault-injector.py 127.0.0.1:10000 nbd-fault-injector.conf;
> done
> 
> 3. In another terminal run qemu-io in a loop, like this:
> 
> n=1; while true; do
>      echo $n; ((n++));
>      ./qemu-io -c 'read 0 512' nbd+tcp://127.0.0.1:10000;
> done
> 
> After some time, qemu-io will hang trying to drain, for example, like
> this:
> 
>   #3 aio_poll (ctx=0x55f006bdd890, blocking=true) at
>      util/aio-posix.c:600
>   #4 bdrv_do_drained_begin (bs=0x55f006bea710, recursive=false,
>      parent=0x0, ignore_bds_parents=false, poll=true) at block/io.c:427
>   #5 bdrv_drained_begin (bs=0x55f006bea710) at block/io.c:433
>   #6 blk_drain (blk=0x55f006befc80) at block/block-backend.c:1710
>   #7 blk_unref (blk=0x55f006befc80) at block/block-backend.c:498
>   #8 bdrv_open_inherit (filename=0x7fffba1563bc
>      "nbd+tcp://127.0.0.1:10000", reference=0x0, options=0x55f006be86d0,
>      flags=24578, parent=0x0, child_class=0x0, child_role=0,
>      errp=0x7fffba154620) at block.c:3491
>   #9 bdrv_open (filename=0x7fffba1563bc "nbd+tcp://127.0.0.1:10000",
>      reference=0x0, options=0x0, flags=16386, errp=0x7fffba154620) at
>      block.c:3513
>   #10 blk_new_open (filename=0x7fffba1563bc "nbd+tcp://127.0.0.1:10000",
>      reference=0x0, options=0x0, flags=16386, errp=0x7fffba154620) at
>      block/block-backend.c:421
> 
> And connection_co stack like this:
> 
>   #0 qemu_coroutine_switch (from_=0x55f006bf2650, to_=0x7fe96e07d918,
>      action=COROUTINE_YIELD) at util/coroutine-ucontext.c:302
>   #1 qemu_coroutine_yield () at util/qemu-coroutine.c:193
>   #2 qio_channel_yield (ioc=0x55f006bb3c20, condition=G_IO_IN) at
>      io/channel.c:472
>   #3 qio_channel_readv_all_eof (ioc=0x55f006bb3c20, iov=0x7fe96d729bf0,
>      niov=1, errp=0x7fe96d729eb0) at io/channel.c:110
>   #4 qio_channel_readv_all (ioc=0x55f006bb3c20, iov=0x7fe96d729bf0,
>      niov=1, errp=0x7fe96d729eb0) at io/channel.c:143
>   #5 qio_channel_read_all (ioc=0x55f006bb3c20, buf=0x7fe96d729d28
>      "\300.\366\004\360U", buflen=8, errp=0x7fe96d729eb0) at
>      io/channel.c:247
>   #6 nbd_read (ioc=0x55f006bb3c20, buffer=0x7fe96d729d28, size=8,
>      desc=0x55f004f69644 "initial magic", errp=0x7fe96d729eb0) at
>      /work/src/qemu/master/include/block/nbd.h:365
>   #7 nbd_read64 (ioc=0x55f006bb3c20, val=0x7fe96d729d28,
>      desc=0x55f004f69644 "initial magic", errp=0x7fe96d729eb0) at
>      /work/src/qemu/master/include/block/nbd.h:391
>   #8 nbd_start_negotiate (aio_context=0x55f006bdd890,
>      ioc=0x55f006bb3c20, tlscreds=0x0, hostname=0x0,
>      outioc=0x55f006bf19f8, structured_reply=true,
>      zeroes=0x7fe96d729dca, errp=0x7fe96d729eb0) at nbd/client.c:904
>   #9 nbd_receive_negotiate (aio_context=0x55f006bdd890,
>      ioc=0x55f006bb3c20, tlscreds=0x0, hostname=0x0,
>      outioc=0x55f006bf19f8, info=0x55f006bf1a00, errp=0x7fe96d729eb0) at
>      nbd/client.c:1032
>   #10 nbd_client_connect (bs=0x55f006bea710, errp=0x7fe96d729eb0) at
>      block/nbd.c:1460
>   #11 nbd_reconnect_attempt (s=0x55f006bf19f0) at block/nbd.c:287
>   #12 nbd_co_reconnect_loop (s=0x55f006bf19f0) at block/nbd.c:309
>   #13 nbd_connection_entry (opaque=0x55f006bf19f0) at block/nbd.c:360
>   #14 coroutine_trampoline (i0=113190480, i1=22000) at
>      util/coroutine-ucontext.c:173
> 
> Note, that the hang may be
> triggered by another bug, so the whole case is fixed only together with
> commit "block/nbd: on shutdown terminate connection attempt".
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>   block/nbd.c | 11 +++++++++++
>   1 file changed, 11 insertions(+)
> 
> diff --git a/block/nbd.c b/block/nbd.c
> index 65a4f56924..49254f1c3c 100644
> --- a/block/nbd.c
> +++ b/block/nbd.c
> @@ -280,7 +280,18 @@ static coroutine_fn void nbd_reconnect_attempt(BDRVNBDState *s)
>           s->ioc = NULL;
>       }
>   
> +    bdrv_dec_in_flight(s->bs);
>       s->connect_status = nbd_client_connect(s->bs, &local_err);

Hmm. This inserts yield between setting connect_status and connect_err. Don't know,
can it be a problem, but at least looks like bad design. Better to use local ret here
and set connect_status together with connect_err after the yield.

> +    s->wait_drained_end = true;
> +    while (s->drained) {
> +        /*
> +         * We may be entered once from nbd_client_attach_aio_context_bh
> +         * and then from nbd_client_co_drain_end. So here is a loop.
> +         */
> +        qemu_coroutine_yield();
> +    }
> +    bdrv_inc_in_flight(s->bs);
> +
>       error_free(s->connect_err);
>       s->connect_err = NULL;
>       error_propagate(&s->connect_err, local_err);
> 


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] block/nbd: allow drain during reconnect attempt
  2020-07-23 18:47   ` Eric Blake
@ 2020-07-24 10:04     ` Vladimir Sementsov-Ogievskiy
  2020-07-24 11:50     ` Vladimir Sementsov-Ogievskiy
  1 sibling, 0 replies; 13+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-07-24 10:04 UTC (permalink / raw)
  To: Eric Blake, qemu-block; +Cc: kwolf, den, qemu-devel, mreitz

23.07.2020 21:47, Eric Blake wrote:
> On 7/20/20 4:00 AM, Vladimir Sementsov-Ogievskiy wrote:
>> It should be to reenter qio_channel_yield() on io/channel read/write
>> path, so it's safe to reduce in_flight and allow attaching new aio
>> context. And no problem to allow drain itself: connection attempt is
>> not a guest request. Moreover, if remote server is down, we can hang
>> in negotiation, blocking drain section and provoking a dead lock.
>>
>> How to reproduce the dead lock:
>>
> 
> I tried to reproduce this; but in the several minutes it has taken me to write this email, it still has not hung.  Still, your stack trace is fairly good evidence of the problem, where adding a temporary sleep or running it under gdb with a breakpoint can probably make reproduction easier.
> 
>> 1. Create nbd-fault-injector.conf with the following contents:
>>
>> [inject-error "mega1"]
>> event=data
>> io=readwrite
>> when=before
>>
>> 2. In one terminal run nbd-fault-injector in a loop, like this:
>>
>> n=1; while true; do
>>      echo $n; ((n++));
> 
> Bashism, but not a problem for the commit message.
> 
>>      ./nbd-fault-injector.py 127.0.0.1:10000 nbd-fault-injector.conf;
>> done
>>
>> 3. In another terminal run qemu-io in a loop, like this:
>>
>> n=1; while true; do
>>      echo $n; ((n++));
>>      ./qemu-io -c 'read 0 512' nbd+tcp://127.0.0.1:10000;
> 
> I prefer the spelling nbd:// for TCP connections, but also inconsequential.
> 
>> Note, that the hang may be
>> triggered by another bug, so the whole case is fixed only together with
>> commit "block/nbd: on shutdown terminate connection attempt".
>>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> ---
>>   block/nbd.c | 11 +++++++++++
>>   1 file changed, 11 insertions(+)
>>
>> diff --git a/block/nbd.c b/block/nbd.c
>> index 65a4f56924..49254f1c3c 100644
>> --- a/block/nbd.c
>> +++ b/block/nbd.c
>> @@ -280,7 +280,18 @@ static coroutine_fn void nbd_reconnect_attempt(BDRVNBDState *s)
>>           s->ioc = NULL;
>>       }
>> +    bdrv_dec_in_flight(s->bs);
>>       s->connect_status = nbd_client_connect(s->bs, &local_err);
>> +    s->wait_drained_end = true;
>> +    while (s->drained) {
>> +        /*
>> +         * We may be entered once from nbd_client_attach_aio_context_bh
>> +         * and then from nbd_client_co_drain_end. So here is a loop.
>> +         */
>> +        qemu_coroutine_yield();
>> +    }
>> +    bdrv_inc_in_flight(s->bs);
>> +
> 
> This is very similar to the code in nbd_co_reconnect_loop.  Does that function still need to wait on drained, since it calls nbd_reconnect_attempt which is now doing the same loop?  But off-hand, I'm not seeing a problem with keeping both places.

I want to reduce in_fligth around one operation. And I'm afraid of continuing while drained. So, here is the pattern:

  - allow drain (by decreasing in_flight)
  - do some operation, safe for drained section
  - we afraid that some further operations are unsafe for drained sections, so
    - disallow new drain (by increasing in_fligth)
    - wait for current drain to finish, if any

And, I'm not sure that nbd_read_eof is not buggy: it just do dec/inc in_flight around qio_channel_yield(), so nothing prevents us of continuing some other operations being in drained section. The code in nbd_read_eof was introduced by d3bd5b90890f6715bce..
Is it safe?

> 
> Reviewed-by: Eric Blake <eblake@redhat.com>
> 
> As a bug fix, I'll be including this in my NBD pull request for the next -rc build.
> 


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] block/nbd: allow drain during reconnect attempt
  2020-07-20  9:00 ` [PATCH 1/3] block/nbd: allow drain during reconnect attempt Vladimir Sementsov-Ogievskiy
                     ` (2 preceding siblings ...)
  2020-07-24  9:49   ` Vladimir Sementsov-Ogievskiy
@ 2020-07-24 10:21   ` Vladimir Sementsov-Ogievskiy
  3 siblings, 0 replies; 13+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-07-24 10:21 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, den, qemu-devel, mreitz

20.07.2020 12:00, Vladimir Sementsov-Ogievskiy wrote:
> It should be to reenter qio_channel_yield() on io/channel read/write
> path, so it's safe to reduce in_flight and allow attaching new aio
> context. And no problem to allow drain itself: connection attempt is
> not a guest request. Moreover, if remote server is down, we can hang
> in negotiation, blocking drain section and provoking a dead lock.
> 
> How to reproduce the dead lock:
> 
> 1. Create nbd-fault-injector.conf with the following contents:
> 
> [inject-error "mega1"]
> event=data
> io=readwrite
> when=before
> 
> 2. In one terminal run nbd-fault-injector in a loop, like this:
> 
> n=1; while true; do
>      echo $n; ((n++));
>      ./nbd-fault-injector.py 127.0.0.1:10000 nbd-fault-injector.conf;
> done
> 
> 3. In another terminal run qemu-io in a loop, like this:
> 
> n=1; while true; do
>      echo $n; ((n++));
>      ./qemu-io -c 'read 0 512' nbd+tcp://127.0.0.1:10000;
> done
> 
> After some time, qemu-io will hang trying to drain, for example, like
> this:
> 
>   #3 aio_poll (ctx=0x55f006bdd890, blocking=true) at
>      util/aio-posix.c:600
>   #4 bdrv_do_drained_begin (bs=0x55f006bea710, recursive=false,
>      parent=0x0, ignore_bds_parents=false, poll=true) at block/io.c:427
>   #5 bdrv_drained_begin (bs=0x55f006bea710) at block/io.c:433
>   #6 blk_drain (blk=0x55f006befc80) at block/block-backend.c:1710
>   #7 blk_unref (blk=0x55f006befc80) at block/block-backend.c:498
>   #8 bdrv_open_inherit (filename=0x7fffba1563bc
>      "nbd+tcp://127.0.0.1:10000", reference=0x0, options=0x55f006be86d0,
>      flags=24578, parent=0x0, child_class=0x0, child_role=0,
>      errp=0x7fffba154620) at block.c:3491
>   #9 bdrv_open (filename=0x7fffba1563bc "nbd+tcp://127.0.0.1:10000",
>      reference=0x0, options=0x0, flags=16386, errp=0x7fffba154620) at
>      block.c:3513
>   #10 blk_new_open (filename=0x7fffba1563bc "nbd+tcp://127.0.0.1:10000",
>      reference=0x0, options=0x0, flags=16386, errp=0x7fffba154620) at
>      block/block-backend.c:421
> 
> And connection_co stack like this:
> 
>   #0 qemu_coroutine_switch (from_=0x55f006bf2650, to_=0x7fe96e07d918,
>      action=COROUTINE_YIELD) at util/coroutine-ucontext.c:302
>   #1 qemu_coroutine_yield () at util/qemu-coroutine.c:193
>   #2 qio_channel_yield (ioc=0x55f006bb3c20, condition=G_IO_IN) at
>      io/channel.c:472
>   #3 qio_channel_readv_all_eof (ioc=0x55f006bb3c20, iov=0x7fe96d729bf0,
>      niov=1, errp=0x7fe96d729eb0) at io/channel.c:110
>   #4 qio_channel_readv_all (ioc=0x55f006bb3c20, iov=0x7fe96d729bf0,
>      niov=1, errp=0x7fe96d729eb0) at io/channel.c:143
>   #5 qio_channel_read_all (ioc=0x55f006bb3c20, buf=0x7fe96d729d28
>      "\300.\366\004\360U", buflen=8, errp=0x7fe96d729eb0) at
>      io/channel.c:247
>   #6 nbd_read (ioc=0x55f006bb3c20, buffer=0x7fe96d729d28, size=8,
>      desc=0x55f004f69644 "initial magic", errp=0x7fe96d729eb0) at
>      /work/src/qemu/master/include/block/nbd.h:365
>   #7 nbd_read64 (ioc=0x55f006bb3c20, val=0x7fe96d729d28,
>      desc=0x55f004f69644 "initial magic", errp=0x7fe96d729eb0) at
>      /work/src/qemu/master/include/block/nbd.h:391
>   #8 nbd_start_negotiate (aio_context=0x55f006bdd890,
>      ioc=0x55f006bb3c20, tlscreds=0x0, hostname=0x0,
>      outioc=0x55f006bf19f8, structured_reply=true,
>      zeroes=0x7fe96d729dca, errp=0x7fe96d729eb0) at nbd/client.c:904
>   #9 nbd_receive_negotiate (aio_context=0x55f006bdd890,
>      ioc=0x55f006bb3c20, tlscreds=0x0, hostname=0x0,
>      outioc=0x55f006bf19f8, info=0x55f006bf1a00, errp=0x7fe96d729eb0) at
>      nbd/client.c:1032
>   #10 nbd_client_connect (bs=0x55f006bea710, errp=0x7fe96d729eb0) at
>      block/nbd.c:1460
>   #11 nbd_reconnect_attempt (s=0x55f006bf19f0) at block/nbd.c:287
>   #12 nbd_co_reconnect_loop (s=0x55f006bf19f0) at block/nbd.c:309
>   #13 nbd_connection_entry (opaque=0x55f006bf19f0) at block/nbd.c:360
>   #14 coroutine_trampoline (i0=113190480, i1=22000) at
>      util/coroutine-ucontext.c:173
> 
> Note, that the hang may be
> triggered by another bug, so the whole case is fixed only together with
> commit "block/nbd: on shutdown terminate connection attempt".
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>   block/nbd.c | 11 +++++++++++
>   1 file changed, 11 insertions(+)
> 
> diff --git a/block/nbd.c b/block/nbd.c
> index 65a4f56924..49254f1c3c 100644
> --- a/block/nbd.c
> +++ b/block/nbd.c
> @@ -280,7 +280,18 @@ static coroutine_fn void nbd_reconnect_attempt(BDRVNBDState *s)
>           s->ioc = NULL;
>       }
>   
> +    bdrv_dec_in_flight(s->bs);
>       s->connect_status = nbd_client_connect(s->bs, &local_err);
> +    s->wait_drained_end = true;
> +    while (s->drained) {
> +        /*
> +         * We may be entered once from nbd_client_attach_aio_context_bh
> +         * and then from nbd_client_co_drain_end. So here is a loop.
> +         */
> +        qemu_coroutine_yield();
> +    }
> +    bdrv_inc_in_flight(s->bs);

My next version of non-blocking connect will need nbd_establish_connection() to be above bdrv_dec_in_flight(). So, I want to resend this anyway.

> +
>       error_free(s->connect_err);
>       s->connect_err = NULL;
>       error_propagate(&s->connect_err, local_err);
> 


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] block/nbd: allow drain during reconnect attempt
  2020-07-23 18:47   ` Eric Blake
  2020-07-24 10:04     ` Vladimir Sementsov-Ogievskiy
@ 2020-07-24 11:50     ` Vladimir Sementsov-Ogievskiy
  1 sibling, 0 replies; 13+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-07-24 11:50 UTC (permalink / raw)
  To: Eric Blake, qemu-block; +Cc: kwolf, den, qemu-devel, mreitz

23.07.2020 21:47, Eric Blake wrote:
> On 7/20/20 4:00 AM, Vladimir Sementsov-Ogievskiy wrote:
>> It should be to reenter qio_channel_yield() on io/channel read/write
>> path, so it's safe to reduce in_flight and allow attaching new aio
>> context. And no problem to allow drain itself: connection attempt is
>> not a guest request. Moreover, if remote server is down, we can hang
>> in negotiation, blocking drain section and provoking a dead lock.
>>
>> How to reproduce the dead lock:
>>
> 
> I tried to reproduce this; but in the several minutes it has taken me to write this email, it still has not hung.  Still, your stack trace is fairly good evidence of the problem, where adding a temporary sleep or running it under gdb with a breakpoint can probably make reproduction easier.

I've tried to make a reproduce, adding temporary BDRV_POLL_WHILE, but I failed.

One time, it reproduced for me after 4000 iterations, but other times a lot earlier.

It may help to start several qemu-io loop in parallel.

Also, iotest 83 for -nbd hangs sometimes for me as well.

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/3] block/nbd: nbd_co_reconnect_loop(): don't sleep if drained
  2020-07-23 18:55   ` Eric Blake
@ 2020-07-27 18:42     ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 13+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-07-27 18:42 UTC (permalink / raw)
  To: Eric Blake, qemu-block; +Cc: kwolf, den, qemu-devel, mreitz

23.07.2020 21:55, Eric Blake wrote:
> On 7/20/20 4:00 AM, Vladimir Sementsov-Ogievskiy wrote:
>> We try to go to wakeable sleep, so that, if drain begins it will break
>> the sleep. But what if nbd_client_co_drain_begin() already called and
>> s->drained is already true? We'll go to sleep, and drain will have to
>> wait for the whole timeout. Let's improve it.
>>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> ---
>>   block/nbd.c | 11 ++++++-----
>>   1 file changed, 6 insertions(+), 5 deletions(-)
>>
> 
> How frequently did you hit this case?  At any rate, the optimization looks sane, and I'm happy to include it in 5.1.
> 

I don't remember. Probably once? And, as I said in cover letter, it even was not master branch but some my experiment..

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2020-07-27 18:43 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-20  9:00 [PATCH for-5.1? 0/3] Fix nbd reconnect dead-locks Vladimir Sementsov-Ogievskiy
2020-07-20  9:00 ` [PATCH 1/3] block/nbd: allow drain during reconnect attempt Vladimir Sementsov-Ogievskiy
2020-07-20  9:02   ` Vladimir Sementsov-Ogievskiy
2020-07-23 18:47   ` Eric Blake
2020-07-24 10:04     ` Vladimir Sementsov-Ogievskiy
2020-07-24 11:50     ` Vladimir Sementsov-Ogievskiy
2020-07-24  9:49   ` Vladimir Sementsov-Ogievskiy
2020-07-24 10:21   ` Vladimir Sementsov-Ogievskiy
2020-07-20  9:00 ` [PATCH 2/3] block/nbd: on shutdown terminate connection attempt Vladimir Sementsov-Ogievskiy
2020-07-23 18:52   ` Eric Blake
2020-07-20  9:00 ` [PATCH 3/3] block/nbd: nbd_co_reconnect_loop(): don't sleep if drained Vladimir Sementsov-Ogievskiy
2020-07-23 18:55   ` Eric Blake
2020-07-27 18:42     ` Vladimir Sementsov-Ogievskiy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.