All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v2] nbd: fix the co_queue multi-adding bug
@ 2015-02-10  7:20 Bin Wu
  2015-02-10  9:06 ` Paolo Bonzini
  2015-02-12 17:03 ` Stefan Hajnoczi
  0 siblings, 2 replies; 3+ messages in thread
From: Bin Wu @ 2015-02-10  7:20 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pbonzini, famz, Bin Wu, stefanha

From: Bin Wu <wu.wubin@huawei.com>

When we tested the VM migartion between different hosts with NBD
devices, we found if we sent a cancel command after the drive_mirror
was just started, a coroutine re-enter error would occur. The stack
was as follow:

(gdb) bt
00)  0x00007fdfc744d885 in raise () from /lib64/libc.so.6
01)  0x00007fdfc744ee61 in abort () from /lib64/libc.so.6
02)  0x00007fdfca467cc5 in qemu_coroutine_enter (co=0x7fdfcaedb400, opaque=0x0)
at qemu-coroutine.c:118
03)  0x00007fdfca467f6c in qemu_co_queue_run_restart (co=0x7fdfcaedb400) at
qemu-coroutine-lock.c:59
04)  0x00007fdfca467be5 in coroutine_swap (from=0x7fdfcaf3c4e8,
to=0x7fdfcaedb400) at qemu-coroutine.c:96
05)  0x00007fdfca467cea in qemu_coroutine_enter (co=0x7fdfcaedb400, opaque=0x0)
at qemu-coroutine.c:123
06)  0x00007fdfca467f6c in qemu_co_queue_run_restart (co=0x7fdfcaedbdc0) at
qemu-coroutine-lock.c:59
07)  0x00007fdfca467be5 in coroutine_swap (from=0x7fdfcaf3c4e8,
to=0x7fdfcaedbdc0) at qemu-coroutine.c:96
08)  0x00007fdfca467cea in qemu_coroutine_enter (co=0x7fdfcaedbdc0, opaque=0x0)
at qemu-coroutine.c:123
09)  0x00007fdfca4a1fa4 in nbd_recv_coroutines_enter_all (s=0x7fdfcaef7dd0) at
block/nbd-client.c:41
10) 0x00007fdfca4a1ff9 in nbd_teardown_connection (client=0x7fdfcaef7dd0) at
block/nbd-client.c:50
11) 0x00007fdfca4a20f0 in nbd_reply_ready (opaque=0x7fdfcaef7dd0) at
block/nbd-client.c:92
12) 0x00007fdfca45ed80 in aio_dispatch (ctx=0x7fdfcae15e90) at aio-posix.c:144
13) 0x00007fdfca45ef1b in aio_poll (ctx=0x7fdfcae15e90, blocking=false) at
aio-posix.c:222
14) 0x00007fdfca448c34 in aio_ctx_dispatch (source=0x7fdfcae15e90, callback=0x0,
user_data=0x0) at async.c:212
15) 0x00007fdfc8f2f69a in g_main_context_dispatch () from
/usr/lib64/libglib-2.0.so.0
16) 0x00007fdfca45c391 in glib_pollfds_poll () at main-loop.c:190
17) 0x00007fdfca45c489 in os_host_main_loop_wait (timeout=1483677098) at
main-loop.c:235
18) 0x00007fdfca45c57b in main_loop_wait (nonblocking=0) at main-loop.c:484
19) 0x00007fdfca25f403 in main_loop () at vl.c:2249
20) 0x00007fdfca266fc2 in main (argc=42, argv=0x7ffff517d638,
envp=0x7ffff517d790) at vl.c:4814

We find the nbd_recv_coroutines_enter_all function (triggered by a cancel
command or a network connection breaking down) will enter a coroutine which
is waiting for the sending lock. If the lock is still held by another coroutine,
the entering coroutine will be added into the co_queue again. Latter, when the
lock is released, a coroutine re-enter error will occur.

This bug can be fixed simply by delaying the setting of recv_coroutine as
suggested by paolo. After applying this patch, we have tested the cancel
operation in mirror phase looply for more than 5 hous and everything is fine.
Without this patch, a coroutine re-enter error will occur in 5 minutes.

Signed-off-by: Bn Wu <wu.wubin@huawei.com>
---
v2: fix the coroutine re-enter bug in NBD code, not in coroutine infrastructure
as suggested by paolo and kevin.
---
 block/nbd-client.c | 25 +++++++++++++------------
 1 file changed, 13 insertions(+), 12 deletions(-)

diff --git a/block/nbd-client.c b/block/nbd-client.c
index 28bfb62..1b0f7c7 100644
--- a/block/nbd-client.c
+++ b/block/nbd-client.c
@@ -104,10 +104,21 @@ static int nbd_co_send_request(NbdClientSession *s,
     QEMUIOVector *qiov, int offset)
 {
     AioContext *aio_context;
-    int rc, ret;
+    int rc, ret, i;
 
     qemu_co_mutex_lock(&s->send_mutex);
+
+    for (i = 0; i < MAX_NBD_REQUESTS; i++) {
+        if (s->recv_coroutine[i] == NULL) {
+            s->recv_coroutine[i] = qemu_coroutine_self();
+            break;
+        }
+    }
+
+    assert(i < MAX_NBD_REQUESTS);
+    request->handle = INDEX_TO_HANDLE(s, i);
     s->send_coroutine = qemu_coroutine_self();
+
     aio_context = bdrv_get_aio_context(s->bs);
     aio_set_fd_handler(aio_context, s->sock,
                        nbd_reply_ready, nbd_restart_write, s);
@@ -164,8 +175,6 @@ static void nbd_co_receive_reply(NbdClientSession *s,
 static void nbd_coroutine_start(NbdClientSession *s,
    struct nbd_request *request)
 {
-    int i;
-
     /* Poor man semaphore.  The free_sema is locked when no other request
      * can be accepted, and unlocked after receiving one reply.  */
     if (s->in_flight >= MAX_NBD_REQUESTS - 1) {
@@ -174,15 +183,7 @@ static void nbd_coroutine_start(NbdClientSession *s,
     }
     s->in_flight++;
 
-    for (i = 0; i < MAX_NBD_REQUESTS; i++) {
-        if (s->recv_coroutine[i] == NULL) {
-            s->recv_coroutine[i] = qemu_coroutine_self();
-            break;
-        }
-    }
-
-    assert(i < MAX_NBD_REQUESTS);
-    request->handle = INDEX_TO_HANDLE(s, i);
+    /* s->recv_coroutine[i] is set as soon as we get the send_lock.  */
 }
 
 static void nbd_coroutine_end(NbdClientSession *s,
-- 
1.7.12.4

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [Qemu-devel] [PATCH v2] nbd: fix the co_queue multi-adding bug
  2015-02-10  7:20 [Qemu-devel] [PATCH v2] nbd: fix the co_queue multi-adding bug Bin Wu
@ 2015-02-10  9:06 ` Paolo Bonzini
  2015-02-12 17:03 ` Stefan Hajnoczi
  1 sibling, 0 replies; 3+ messages in thread
From: Paolo Bonzini @ 2015-02-10  9:06 UTC (permalink / raw)
  To: Bin Wu, qemu-devel; +Cc: kwolf, famz, stefanha



On 10/02/2015 08:20, Bin Wu wrote:
> From: Bin Wu <wu.wubin@huawei.com>
> 
> When we tested the VM migartion between different hosts with NBD
> devices, we found if we sent a cancel command after the drive_mirror
> was just started, a coroutine re-enter error would occur. The stack
> was as follow:
> 
> (gdb) bt
> 00)  0x00007fdfc744d885 in raise () from /lib64/libc.so.6
> 01)  0x00007fdfc744ee61 in abort () from /lib64/libc.so.6
> 02)  0x00007fdfca467cc5 in qemu_coroutine_enter (co=0x7fdfcaedb400, opaque=0x0)
> at qemu-coroutine.c:118
> 03)  0x00007fdfca467f6c in qemu_co_queue_run_restart (co=0x7fdfcaedb400) at
> qemu-coroutine-lock.c:59
> 04)  0x00007fdfca467be5 in coroutine_swap (from=0x7fdfcaf3c4e8,
> to=0x7fdfcaedb400) at qemu-coroutine.c:96
> 05)  0x00007fdfca467cea in qemu_coroutine_enter (co=0x7fdfcaedb400, opaque=0x0)
> at qemu-coroutine.c:123
> 06)  0x00007fdfca467f6c in qemu_co_queue_run_restart (co=0x7fdfcaedbdc0) at
> qemu-coroutine-lock.c:59
> 07)  0x00007fdfca467be5 in coroutine_swap (from=0x7fdfcaf3c4e8,
> to=0x7fdfcaedbdc0) at qemu-coroutine.c:96
> 08)  0x00007fdfca467cea in qemu_coroutine_enter (co=0x7fdfcaedbdc0, opaque=0x0)
> at qemu-coroutine.c:123
> 09)  0x00007fdfca4a1fa4 in nbd_recv_coroutines_enter_all (s=0x7fdfcaef7dd0) at
> block/nbd-client.c:41
> 10) 0x00007fdfca4a1ff9 in nbd_teardown_connection (client=0x7fdfcaef7dd0) at
> block/nbd-client.c:50
> 11) 0x00007fdfca4a20f0 in nbd_reply_ready (opaque=0x7fdfcaef7dd0) at
> block/nbd-client.c:92
> 12) 0x00007fdfca45ed80 in aio_dispatch (ctx=0x7fdfcae15e90) at aio-posix.c:144
> 13) 0x00007fdfca45ef1b in aio_poll (ctx=0x7fdfcae15e90, blocking=false) at
> aio-posix.c:222
> 14) 0x00007fdfca448c34 in aio_ctx_dispatch (source=0x7fdfcae15e90, callback=0x0,
> user_data=0x0) at async.c:212
> 15) 0x00007fdfc8f2f69a in g_main_context_dispatch () from
> /usr/lib64/libglib-2.0.so.0
> 16) 0x00007fdfca45c391 in glib_pollfds_poll () at main-loop.c:190
> 17) 0x00007fdfca45c489 in os_host_main_loop_wait (timeout=1483677098) at
> main-loop.c:235
> 18) 0x00007fdfca45c57b in main_loop_wait (nonblocking=0) at main-loop.c:484
> 19) 0x00007fdfca25f403 in main_loop () at vl.c:2249
> 20) 0x00007fdfca266fc2 in main (argc=42, argv=0x7ffff517d638,
> envp=0x7ffff517d790) at vl.c:4814
> 
> We find the nbd_recv_coroutines_enter_all function (triggered by a cancel
> command or a network connection breaking down) will enter a coroutine which
> is waiting for the sending lock. If the lock is still held by another coroutine,
> the entering coroutine will be added into the co_queue again. Latter, when the
> lock is released, a coroutine re-enter error will occur.
> 
> This bug can be fixed simply by delaying the setting of recv_coroutine as
> suggested by paolo. After applying this patch, we have tested the cancel
> operation in mirror phase looply for more than 5 hous and everything is fine.
> Without this patch, a coroutine re-enter error will occur in 5 minutes.
> 
> Signed-off-by: Bn Wu <wu.wubin@huawei.com>
> ---
> v2: fix the coroutine re-enter bug in NBD code, not in coroutine infrastructure
> as suggested by paolo and kevin.
> ---
>  block/nbd-client.c | 25 +++++++++++++------------
>  1 file changed, 13 insertions(+), 12 deletions(-)
> 
> diff --git a/block/nbd-client.c b/block/nbd-client.c
> index 28bfb62..1b0f7c7 100644
> --- a/block/nbd-client.c
> +++ b/block/nbd-client.c
> @@ -104,10 +104,21 @@ static int nbd_co_send_request(NbdClientSession *s,
>      QEMUIOVector *qiov, int offset)
>  {
>      AioContext *aio_context;
> -    int rc, ret;
> +    int rc, ret, i;
>  
>      qemu_co_mutex_lock(&s->send_mutex);
> +
> +    for (i = 0; i < MAX_NBD_REQUESTS; i++) {
> +        if (s->recv_coroutine[i] == NULL) {
> +            s->recv_coroutine[i] = qemu_coroutine_self();
> +            break;
> +        }
> +    }
> +
> +    assert(i < MAX_NBD_REQUESTS);
> +    request->handle = INDEX_TO_HANDLE(s, i);
>      s->send_coroutine = qemu_coroutine_self();
> +
>      aio_context = bdrv_get_aio_context(s->bs);
>      aio_set_fd_handler(aio_context, s->sock,
>                         nbd_reply_ready, nbd_restart_write, s);
> @@ -164,8 +175,6 @@ static void nbd_co_receive_reply(NbdClientSession *s,
>  static void nbd_coroutine_start(NbdClientSession *s,
>     struct nbd_request *request)
>  {
> -    int i;
> -
>      /* Poor man semaphore.  The free_sema is locked when no other request
>       * can be accepted, and unlocked after receiving one reply.  */
>      if (s->in_flight >= MAX_NBD_REQUESTS - 1) {
> @@ -174,15 +183,7 @@ static void nbd_coroutine_start(NbdClientSession *s,
>      }
>      s->in_flight++;
>  
> -    for (i = 0; i < MAX_NBD_REQUESTS; i++) {
> -        if (s->recv_coroutine[i] == NULL) {
> -            s->recv_coroutine[i] = qemu_coroutine_self();
> -            break;
> -        }
> -    }
> -
> -    assert(i < MAX_NBD_REQUESTS);
> -    request->handle = INDEX_TO_HANDLE(s, i);
> +    /* s->recv_coroutine[i] is set as soon as we get the send_lock.  */
>  }
>  
>  static void nbd_coroutine_end(NbdClientSession *s,
> 

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [Qemu-devel] [PATCH v2] nbd: fix the co_queue multi-adding bug
  2015-02-10  7:20 [Qemu-devel] [PATCH v2] nbd: fix the co_queue multi-adding bug Bin Wu
  2015-02-10  9:06 ` Paolo Bonzini
@ 2015-02-12 17:03 ` Stefan Hajnoczi
  1 sibling, 0 replies; 3+ messages in thread
From: Stefan Hajnoczi @ 2015-02-12 17:03 UTC (permalink / raw)
  To: Bin Wu; +Cc: kwolf, pbonzini, famz, qemu-devel, stefanha

[-- Attachment #1: Type: text/plain, Size: 3416 bytes --]

On Tue, Feb 10, 2015 at 03:20:46PM +0800, Bin Wu wrote:
> From: Bin Wu <wu.wubin@huawei.com>
> 
> When we tested the VM migartion between different hosts with NBD
> devices, we found if we sent a cancel command after the drive_mirror
> was just started, a coroutine re-enter error would occur. The stack
> was as follow:
> 
> (gdb) bt
> 00)  0x00007fdfc744d885 in raise () from /lib64/libc.so.6
> 01)  0x00007fdfc744ee61 in abort () from /lib64/libc.so.6
> 02)  0x00007fdfca467cc5 in qemu_coroutine_enter (co=0x7fdfcaedb400, opaque=0x0)
> at qemu-coroutine.c:118
> 03)  0x00007fdfca467f6c in qemu_co_queue_run_restart (co=0x7fdfcaedb400) at
> qemu-coroutine-lock.c:59
> 04)  0x00007fdfca467be5 in coroutine_swap (from=0x7fdfcaf3c4e8,
> to=0x7fdfcaedb400) at qemu-coroutine.c:96
> 05)  0x00007fdfca467cea in qemu_coroutine_enter (co=0x7fdfcaedb400, opaque=0x0)
> at qemu-coroutine.c:123
> 06)  0x00007fdfca467f6c in qemu_co_queue_run_restart (co=0x7fdfcaedbdc0) at
> qemu-coroutine-lock.c:59
> 07)  0x00007fdfca467be5 in coroutine_swap (from=0x7fdfcaf3c4e8,
> to=0x7fdfcaedbdc0) at qemu-coroutine.c:96
> 08)  0x00007fdfca467cea in qemu_coroutine_enter (co=0x7fdfcaedbdc0, opaque=0x0)
> at qemu-coroutine.c:123
> 09)  0x00007fdfca4a1fa4 in nbd_recv_coroutines_enter_all (s=0x7fdfcaef7dd0) at
> block/nbd-client.c:41
> 10) 0x00007fdfca4a1ff9 in nbd_teardown_connection (client=0x7fdfcaef7dd0) at
> block/nbd-client.c:50
> 11) 0x00007fdfca4a20f0 in nbd_reply_ready (opaque=0x7fdfcaef7dd0) at
> block/nbd-client.c:92
> 12) 0x00007fdfca45ed80 in aio_dispatch (ctx=0x7fdfcae15e90) at aio-posix.c:144
> 13) 0x00007fdfca45ef1b in aio_poll (ctx=0x7fdfcae15e90, blocking=false) at
> aio-posix.c:222
> 14) 0x00007fdfca448c34 in aio_ctx_dispatch (source=0x7fdfcae15e90, callback=0x0,
> user_data=0x0) at async.c:212
> 15) 0x00007fdfc8f2f69a in g_main_context_dispatch () from
> /usr/lib64/libglib-2.0.so.0
> 16) 0x00007fdfca45c391 in glib_pollfds_poll () at main-loop.c:190
> 17) 0x00007fdfca45c489 in os_host_main_loop_wait (timeout=1483677098) at
> main-loop.c:235
> 18) 0x00007fdfca45c57b in main_loop_wait (nonblocking=0) at main-loop.c:484
> 19) 0x00007fdfca25f403 in main_loop () at vl.c:2249
> 20) 0x00007fdfca266fc2 in main (argc=42, argv=0x7ffff517d638,
> envp=0x7ffff517d790) at vl.c:4814
> 
> We find the nbd_recv_coroutines_enter_all function (triggered by a cancel
> command or a network connection breaking down) will enter a coroutine which
> is waiting for the sending lock. If the lock is still held by another coroutine,
> the entering coroutine will be added into the co_queue again. Latter, when the
> lock is released, a coroutine re-enter error will occur.
> 
> This bug can be fixed simply by delaying the setting of recv_coroutine as
> suggested by paolo. After applying this patch, we have tested the cancel
> operation in mirror phase looply for more than 5 hous and everything is fine.
> Without this patch, a coroutine re-enter error will occur in 5 minutes.
> 
> Signed-off-by: Bn Wu <wu.wubin@huawei.com>
> ---
> v2: fix the coroutine re-enter bug in NBD code, not in coroutine infrastructure
> as suggested by paolo and kevin.
> ---
>  block/nbd-client.c | 25 +++++++++++++------------
>  1 file changed, 13 insertions(+), 12 deletions(-)

Thanks, applied to my block tree:
https://github.com/stefanha/qemu/commits/block

Stefan

[-- Attachment #2: Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2015-02-12 17:03 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-10  7:20 [Qemu-devel] [PATCH v2] nbd: fix the co_queue multi-adding bug Bin Wu
2015-02-10  9:06 ` Paolo Bonzini
2015-02-12 17:03 ` Stefan Hajnoczi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.