All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH] fix the co_queue multi-adding bug
@ 2015-02-07  9:51 w00214312
  2015-02-07  9:51 ` [Qemu-devel] [PATCH] qemu-coroutine-lock: fix " w00214312
  2015-02-09  9:23 ` [Qemu-devel] [PATCH] fix the " Paolo Bonzini
  0 siblings, 2 replies; 10+ messages in thread
From: w00214312 @ 2015-02-07  9:51 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pbonzini, famz, Bin Wu, stefanha

From: Bin Wu <wu.wubin@huawei.com>

When we test the drive_mirror between different hosts by ndb devices, 
we find that, during the cancel phase the qemu process crashes sometimes.
By checking the crash core file, we find the stack as follows, which means
a coroutine re-enter error occurs:

(gdb) bt
#0  0x00007fdfc744d885 in raise () from /lib64/libc.so.6
#1  0x00007fdfc744ee61 in abort () from /lib64/libc.so.6
#2  0x00007fdfca467cc5 in qemu_coroutine_enter (co=0x7fdfcaedb400, opaque=0x0)
at qemu-coroutine.c:118
#3  0x00007fdfca467f6c in qemu_co_queue_run_restart (co=0x7fdfcaedb400) at
qemu-coroutine-lock.c:59
#4  0x00007fdfca467be5 in coroutine_swap (from=0x7fdfcaf3c4e8,
to=0x7fdfcaedb400) at qemu-coroutine.c:96
#5  0x00007fdfca467cea in qemu_coroutine_enter (co=0x7fdfcaedb400, opaque=0x0)
at qemu-coroutine.c:123
#6  0x00007fdfca467f6c in qemu_co_queue_run_restart (co=0x7fdfcaedbdc0) at
qemu-coroutine-lock.c:59
#7  0x00007fdfca467be5 in coroutine_swap (from=0x7fdfcaf3c4e8,
to=0x7fdfcaedbdc0) at qemu-coroutine.c:96
#8  0x00007fdfca467cea in qemu_coroutine_enter (co=0x7fdfcaedbdc0, opaque=0x0)
at qemu-coroutine.c:123
#9  0x00007fdfca4a1fa4 in nbd_recv_coroutines_enter_all (s=0x7fdfcaef7dd0) at
block/nbd-client.c:41
#10 0x00007fdfca4a1ff9 in nbd_teardown_connection (client=0x7fdfcaef7dd0) at
block/nbd-client.c:50
#11 0x00007fdfca4a20f0 in nbd_reply_ready (opaque=0x7fdfcaef7dd0) at
block/nbd-client.c:92
#12 0x00007fdfca45ed80 in aio_dispatch (ctx=0x7fdfcae15e90) at aio-posix.c:144
#13 0x00007fdfca45ef1b in aio_poll (ctx=0x7fdfcae15e90, blocking=false) at
aio-posix.c:222
#14 0x00007fdfca448c34 in aio_ctx_dispatch (source=0x7fdfcae15e90, callback=0x0,
user_data=0x0) at async.c:212
#15 0x00007fdfc8f2f69a in g_main_context_dispatch () from
/usr/lib64/libglib-2.0.so.0
#16 0x00007fdfca45c391 in glib_pollfds_poll () at main-loop.c:190
#17 0x00007fdfca45c489 in os_host_main_loop_wait (timeout=1483677098) at
main-loop.c:235
#18 0x00007fdfca45c57b in main_loop_wait (nonblocking=0) at main-loop.c:484
#19 0x00007fdfca25f403 in main_loop () at vl.c:2249
#20 0x00007fdfca266fc2 in main (argc=42, argv=0x7ffff517d638,
envp=0x7ffff517d790) at vl.c:4814

We find the nbd_recv_coroutines_enter_all function will enter a coroutine which
is waiting for the sending lock. If the lock is still held by another coroutine,
the entering coroutine will be added into the co_queue again. Latter, when the
lock is released, a coroutine re-enter error will occur. 

Bin Wu (1):
  qemu-coroutine-lock: fix co_queue multi-adding bug

 include/block/coroutine_int.h | 1 +
 qemu-coroutine-lock.c         | 6 +++++-
 qemu-coroutine.c              | 1 +
 3 files changed, 7 insertions(+), 1 deletion(-)

-- 
1.7.12.4

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Qemu-devel] [PATCH] qemu-coroutine-lock: fix co_queue multi-adding bug
  2015-02-07  9:51 [Qemu-devel] [PATCH] fix the co_queue multi-adding bug w00214312
@ 2015-02-07  9:51 ` w00214312
  2015-02-09  8:12   ` Fam Zheng
  2015-02-09  9:23 ` [Qemu-devel] [PATCH] fix the " Paolo Bonzini
  1 sibling, 1 reply; 10+ messages in thread
From: w00214312 @ 2015-02-07  9:51 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pbonzini, famz, Bin Wu, stefanha

From: Bin Wu <wu.wubin@huawei.com>

When a coroutine holds a lock, other coroutines who want to get
the lock must wait on a co_queue by adding themselves to the
CoQueue. However, if a waiting coroutine is woken up with the
lock still be holding by other coroutine, this waiting coroutine
will add itself to the co_queue again. Latter, when the lock
is released, a coroutine re-enter will occur.

We need to determine whether a coroutine is alread in the co_queue
before adding it to the waiting queue.

Signed-off-by: Bin Wu <wu.wubin@huawei.com>
---
 include/block/coroutine_int.h | 1 +
 qemu-coroutine-lock.c         | 6 +++++-
 qemu-coroutine.c              | 1 +
 3 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/include/block/coroutine_int.h b/include/block/coroutine_int.h
index f133d65..c524990 100644
--- a/include/block/coroutine_int.h
+++ b/include/block/coroutine_int.h
@@ -42,6 +42,7 @@ struct Coroutine {
     /* Coroutines that should be woken up when we yield or terminate */
     QTAILQ_HEAD(, Coroutine) co_queue_wakeup;
     QTAILQ_ENTRY(Coroutine) co_queue_next;
+    bool in_co_queue;
 };
 
 Coroutine *qemu_coroutine_new(void);
diff --git a/qemu-coroutine-lock.c b/qemu-coroutine-lock.c
index e4860ae..d256f53 100644
--- a/qemu-coroutine-lock.c
+++ b/qemu-coroutine-lock.c
@@ -36,7 +36,10 @@ void qemu_co_queue_init(CoQueue *queue)
 void coroutine_fn qemu_co_queue_wait(CoQueue *queue)
 {
     Coroutine *self = qemu_coroutine_self();
-    QTAILQ_INSERT_TAIL(&queue->entries, self, co_queue_next);
+    if (!self->in_co_queue) {
+        QTAILQ_INSERT_TAIL(&queue->entries, self, co_queue_next);
+        self->in_co_queue = true;
+    }
     qemu_coroutine_yield();
     assert(qemu_in_coroutine());
 }
@@ -71,6 +74,7 @@ static bool qemu_co_queue_do_restart(CoQueue *queue, bool single)
 
     while ((next = QTAILQ_FIRST(&queue->entries)) != NULL) {
         QTAILQ_REMOVE(&queue->entries, next, co_queue_next);
+        next->in_co_queue = false;
         QTAILQ_INSERT_TAIL(&self->co_queue_wakeup, next, co_queue_next);
         trace_qemu_co_queue_next(next);
         if (single) {
diff --git a/qemu-coroutine.c b/qemu-coroutine.c
index 525247b..a103721 100644
--- a/qemu-coroutine.c
+++ b/qemu-coroutine.c
@@ -75,6 +75,7 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
     }
 
     co->entry = entry;
+    co->in_co_queue = false;
     QTAILQ_INIT(&co->co_queue_wakeup);
     return co;
 }
-- 
1.7.12.4

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [PATCH] qemu-coroutine-lock: fix co_queue multi-adding bug
  2015-02-07  9:51 ` [Qemu-devel] [PATCH] qemu-coroutine-lock: fix " w00214312
@ 2015-02-09  8:12   ` Fam Zheng
  2015-02-09  9:36     ` Bin Wu
  0 siblings, 1 reply; 10+ messages in thread
From: Fam Zheng @ 2015-02-09  8:12 UTC (permalink / raw)
  To: w00214312; +Cc: kwolf, pbonzini, qemu-devel, stefanha

On Sat, 02/07 17:51, w00214312 wrote:
> From: Bin Wu <wu.wubin@huawei.com>
> 
> When a coroutine holds a lock, other coroutines who want to get
> the lock must wait on a co_queue by adding themselves to the
> CoQueue. However, if a waiting coroutine is woken up with the
> lock still be holding by other coroutine, this waiting coroutine

Could you explain who wakes up the waiting coroutine? Maybe the bug is that
it shouldn't be awaken in the first place.

> will add itself to the co_queue again. Latter, when the lock
> is released, a coroutine re-enter will occur.
> 
> We need to determine whether a coroutine is alread in the co_queue

s/alread/already/

Fam

> before adding it to the waiting queue.
> 
> Signed-off-by: Bin Wu <wu.wubin@huawei.com>
> ---
>  include/block/coroutine_int.h | 1 +
>  qemu-coroutine-lock.c         | 6 +++++-
>  qemu-coroutine.c              | 1 +
>  3 files changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/include/block/coroutine_int.h b/include/block/coroutine_int.h
> index f133d65..c524990 100644
> --- a/include/block/coroutine_int.h
> +++ b/include/block/coroutine_int.h
> @@ -42,6 +42,7 @@ struct Coroutine {
>      /* Coroutines that should be woken up when we yield or terminate */
>      QTAILQ_HEAD(, Coroutine) co_queue_wakeup;
>      QTAILQ_ENTRY(Coroutine) co_queue_next;
> +    bool in_co_queue;
>  };
>  
>  Coroutine *qemu_coroutine_new(void);
> diff --git a/qemu-coroutine-lock.c b/qemu-coroutine-lock.c
> index e4860ae..d256f53 100644
> --- a/qemu-coroutine-lock.c
> +++ b/qemu-coroutine-lock.c
> @@ -36,7 +36,10 @@ void qemu_co_queue_init(CoQueue *queue)
>  void coroutine_fn qemu_co_queue_wait(CoQueue *queue)
>  {
>      Coroutine *self = qemu_coroutine_self();
> -    QTAILQ_INSERT_TAIL(&queue->entries, self, co_queue_next);
> +    if (!self->in_co_queue) {
> +        QTAILQ_INSERT_TAIL(&queue->entries, self, co_queue_next);
> +        self->in_co_queue = true;
> +    }
>      qemu_coroutine_yield();
>      assert(qemu_in_coroutine());
>  }
> @@ -71,6 +74,7 @@ static bool qemu_co_queue_do_restart(CoQueue *queue, bool single)
>  
>      while ((next = QTAILQ_FIRST(&queue->entries)) != NULL) {
>          QTAILQ_REMOVE(&queue->entries, next, co_queue_next);
> +        next->in_co_queue = false;
>          QTAILQ_INSERT_TAIL(&self->co_queue_wakeup, next, co_queue_next);
>          trace_qemu_co_queue_next(next);
>          if (single) {
> diff --git a/qemu-coroutine.c b/qemu-coroutine.c
> index 525247b..a103721 100644
> --- a/qemu-coroutine.c
> +++ b/qemu-coroutine.c
> @@ -75,6 +75,7 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
>      }
>  
>      co->entry = entry;
> +    co->in_co_queue = false;
>      QTAILQ_INIT(&co->co_queue_wakeup);
>      return co;
>  }
> -- 
> 1.7.12.4
> 
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [PATCH] fix the co_queue multi-adding bug
  2015-02-07  9:51 [Qemu-devel] [PATCH] fix the co_queue multi-adding bug w00214312
  2015-02-07  9:51 ` [Qemu-devel] [PATCH] qemu-coroutine-lock: fix " w00214312
@ 2015-02-09  9:23 ` Paolo Bonzini
  2015-02-09  9:47   ` Bin Wu
  2015-02-10  6:34   ` Bin Wu
  1 sibling, 2 replies; 10+ messages in thread
From: Paolo Bonzini @ 2015-02-09  9:23 UTC (permalink / raw)
  To: w00214312, qemu-devel; +Cc: kwolf, famz, stefanha



On 07/02/2015 10:51, w00214312 wrote:
> From: Bin Wu <wu.wubin@huawei.com>
> 
> When we test the drive_mirror between different hosts by ndb devices, 
> we find that, during the cancel phase the qemu process crashes sometimes.
> By checking the crash core file, we find the stack as follows, which means
> a coroutine re-enter error occurs:

This bug probably can be fixed simply by delaying the setting of
recv_coroutine.

What are the symptoms if you only apply your "qemu-coroutine-lock: fix
co_queue multi-adding bug" patch but not "qemu-coroutine: fix
qemu_co_queue_run_restart error"?

Can you try the patch below?  (Compile-tested only).

diff --git a/block/nbd-client.c b/block/nbd-client.c
index 6e1c97c..23d6a71 100644
--- a/block/nbd-client.c
+++ b/block/nbd-client.c
@@ -104,10 +104,21 @@ static int nbd_co_send_request(NbdClientSession *s,
     QEMUIOVector *qiov, int offset)
 {
     AioContext *aio_context;
-    int rc, ret;
+    int rc, ret, i;
 
     qemu_co_mutex_lock(&s->send_mutex);
+
+    for (i = 0; i < MAX_NBD_REQUESTS; i++) {
+        if (s->recv_coroutine[i] == NULL) {
+            s->recv_coroutine[i] = qemu_coroutine_self();
+            break;
+        }
+    }
+
+    assert(i < MAX_NBD_REQUESTS);
+    request->handle = INDEX_TO_HANDLE(s, i);
     s->send_coroutine = qemu_coroutine_self();
+
     aio_context = bdrv_get_aio_context(s->bs);
     aio_set_fd_handler(aio_context, s->sock,
                        nbd_reply_ready, nbd_restart_write, s);
@@ -164,8 +175,6 @@ static void nbd_co_receive_reply(NbdClientSession *s,
 static void nbd_coroutine_start(NbdClientSession *s,
    struct nbd_request *request)
 {
-    int i;
-
     /* Poor man semaphore.  The free_sema is locked when no other request
      * can be accepted, and unlocked after receiving one reply.  */
     if (s->in_flight >= MAX_NBD_REQUESTS - 1) {
@@ -174,15 +183,7 @@ static void nbd_coroutine_start(NbdClientSession *s,
     }
     s->in_flight++;
 
-    for (i = 0; i < MAX_NBD_REQUESTS; i++) {
-        if (s->recv_coroutine[i] == NULL) {
-            s->recv_coroutine[i] = qemu_coroutine_self();
-            break;
-        }
-    }
-
-    assert(i < MAX_NBD_REQUESTS);
-    request->handle = INDEX_TO_HANDLE(s, i);
+    /* s->recv_coroutine[i] is set as soon as we get the send_lock.  */
 }
 
 static void nbd_coroutine_end(NbdClientSession *s,

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [PATCH] qemu-coroutine-lock: fix co_queue multi-adding bug
  2015-02-09  8:12   ` Fam Zheng
@ 2015-02-09  9:36     ` Bin Wu
  2015-02-09  9:37       ` Paolo Bonzini
  2015-02-09 10:12       ` Kevin Wolf
  0 siblings, 2 replies; 10+ messages in thread
From: Bin Wu @ 2015-02-09  9:36 UTC (permalink / raw)
  To: Fam Zheng; +Cc: kwolf, pbonzini, qemu-devel, stefanha

On 2015/2/9 16:12, Fam Zheng wrote:
> On Sat, 02/07 17:51, w00214312 wrote:
>> From: Bin Wu <wu.wubin@huawei.com>
>>
>> When a coroutine holds a lock, other coroutines who want to get
>> the lock must wait on a co_queue by adding themselves to the
>> CoQueue. However, if a waiting coroutine is woken up with the
>> lock still be holding by other coroutine, this waiting coroutine
> 
> Could you explain who wakes up the waiting coroutine? Maybe the bug is that
> it shouldn't be awaken in the first place.
> 

During the mirror phase with nbd devices, if we send a cancel command or
physical network breaks down, the source qemu process will receive a readable
event and the main loop will invoke nbd_reply_ready to deal with it. This
function finds the connection is down and then goes into
nbd_teardown_connection. nbd_teardown_connection wakes up all working coroutines
by nbd_recv_coroutines_enter_all. These coroutines include the one which holds
the sending lock, the ones which wait for the lock, and the ones which wait for
receiving messages.

I think the purpose of nbd_recv_coroutines_enter_all is to terminate all waiting
coroutines by waking all of them up. If the coroutine waiting for the lock is
allowed for waking up, this implementation is ok. If not, we need to distinguish
the coroutines waiting for receiving messages from the ones waiting for the lock.

In my option, if the coroutines waiting for a lock is allowd for waking up, it
should be more robust :>

>> will add itself to the co_queue again. Latter, when the lock
>> is released, a coroutine re-enter will occur.
>>
>> We need to determine whether a coroutine is alread in the co_queue
> 
> s/alread/already/
> 
> Fam
> 

Thanks, my mistake.

>> before adding it to the waiting queue.
>>
>> Signed-off-by: Bin Wu <wu.wubin@huawei.com>
>> ---
>>  include/block/coroutine_int.h | 1 +
>>  qemu-coroutine-lock.c         | 6 +++++-
>>  qemu-coroutine.c              | 1 +
>>  3 files changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/block/coroutine_int.h b/include/block/coroutine_int.h
>> index f133d65..c524990 100644
>> --- a/include/block/coroutine_int.h
>> +++ b/include/block/coroutine_int.h
>> @@ -42,6 +42,7 @@ struct Coroutine {
>>      /* Coroutines that should be woken up when we yield or terminate */
>>      QTAILQ_HEAD(, Coroutine) co_queue_wakeup;
>>      QTAILQ_ENTRY(Coroutine) co_queue_next;
>> +    bool in_co_queue;
>>  };
>>  
>>  Coroutine *qemu_coroutine_new(void);
>> diff --git a/qemu-coroutine-lock.c b/qemu-coroutine-lock.c
>> index e4860ae..d256f53 100644
>> --- a/qemu-coroutine-lock.c
>> +++ b/qemu-coroutine-lock.c
>> @@ -36,7 +36,10 @@ void qemu_co_queue_init(CoQueue *queue)
>>  void coroutine_fn qemu_co_queue_wait(CoQueue *queue)
>>  {
>>      Coroutine *self = qemu_coroutine_self();
>> -    QTAILQ_INSERT_TAIL(&queue->entries, self, co_queue_next);
>> +    if (!self->in_co_queue) {
>> +        QTAILQ_INSERT_TAIL(&queue->entries, self, co_queue_next);
>> +        self->in_co_queue = true;
>> +    }
>>      qemu_coroutine_yield();
>>      assert(qemu_in_coroutine());
>>  }
>> @@ -71,6 +74,7 @@ static bool qemu_co_queue_do_restart(CoQueue *queue, bool single)
>>  
>>      while ((next = QTAILQ_FIRST(&queue->entries)) != NULL) {
>>          QTAILQ_REMOVE(&queue->entries, next, co_queue_next);
>> +        next->in_co_queue = false;
>>          QTAILQ_INSERT_TAIL(&self->co_queue_wakeup, next, co_queue_next);
>>          trace_qemu_co_queue_next(next);
>>          if (single) {
>> diff --git a/qemu-coroutine.c b/qemu-coroutine.c
>> index 525247b..a103721 100644
>> --- a/qemu-coroutine.c
>> +++ b/qemu-coroutine.c
>> @@ -75,6 +75,7 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
>>      }
>>  
>>      co->entry = entry;
>> +    co->in_co_queue = false;
>>      QTAILQ_INIT(&co->co_queue_wakeup);
>>      return co;
>>  }
>> -- 
>> 1.7.12.4
>>
>>
> 
> .
> 

-- 
Bin Wu

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [PATCH] qemu-coroutine-lock: fix co_queue multi-adding bug
  2015-02-09  9:36     ` Bin Wu
@ 2015-02-09  9:37       ` Paolo Bonzini
  2015-02-09 10:12       ` Kevin Wolf
  1 sibling, 0 replies; 10+ messages in thread
From: Paolo Bonzini @ 2015-02-09  9:37 UTC (permalink / raw)
  To: Bin Wu, Fam Zheng; +Cc: kwolf, qemu-devel, stefanha



On 09/02/2015 10:36, Bin Wu wrote:
> During the mirror phase with nbd devices, if we send a cancel command or
> physical network breaks down, the source qemu process will receive a readable
> event and the main loop will invoke nbd_reply_ready to deal with it. This
> function finds the connection is down and then goes into
> nbd_teardown_connection. nbd_teardown_connection wakes up all working coroutines
> by nbd_recv_coroutines_enter_all. These coroutines include the one which holds
> the sending lock, the ones which wait for the lock, and the ones which wait for
> receiving messages.
> 
> I think the purpose of nbd_recv_coroutines_enter_all is to terminate all waiting
> coroutines by waking all of them up. If the coroutine waiting for the lock is
> allowed for waking up, this implementation is ok. If not, we need to distinguish
> the coroutines waiting for receiving messages from the ones waiting for the lock.
> 
> In my option, if the coroutines waiting for a lock is allowd for waking up, it
> should be more robust :>

No, it's not allowed.

Paolo

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [PATCH] fix the co_queue multi-adding bug
  2015-02-09  9:23 ` [Qemu-devel] [PATCH] fix the " Paolo Bonzini
@ 2015-02-09  9:47   ` Bin Wu
  2015-02-10  6:34   ` Bin Wu
  1 sibling, 0 replies; 10+ messages in thread
From: Bin Wu @ 2015-02-09  9:47 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel; +Cc: kwolf, famz, stefanha

On 2015/2/9 17:23, Paolo Bonzini wrote:
> 
> 
> On 07/02/2015 10:51, w00214312 wrote:
>> From: Bin Wu <wu.wubin@huawei.com>
>>
>> When we test the drive_mirror between different hosts by ndb devices, 
>> we find that, during the cancel phase the qemu process crashes sometimes.
>> By checking the crash core file, we find the stack as follows, which means
>> a coroutine re-enter error occurs:
> 
> This bug probably can be fixed simply by delaying the setting of
> recv_coroutine.
> 
> What are the symptoms if you only apply your "qemu-coroutine-lock: fix
> co_queue multi-adding bug" patch but not "qemu-coroutine: fix
> qemu_co_queue_run_restart error"?
> 
> Can you try the patch below?  (Compile-tested only).
> 

yes, I think this patch can solve the problem too. I will try the patch latter.

> diff --git a/block/nbd-client.c b/block/nbd-client.c
> index 6e1c97c..23d6a71 100644
> --- a/block/nbd-client.c
> +++ b/block/nbd-client.c
> @@ -104,10 +104,21 @@ static int nbd_co_send_request(NbdClientSession *s,
>      QEMUIOVector *qiov, int offset)
>  {
>      AioContext *aio_context;
> -    int rc, ret;
> +    int rc, ret, i;
>  
>      qemu_co_mutex_lock(&s->send_mutex);
> +
> +    for (i = 0; i < MAX_NBD_REQUESTS; i++) {
> +        if (s->recv_coroutine[i] == NULL) {
> +            s->recv_coroutine[i] = qemu_coroutine_self();
> +            break;
> +        }
> +    }
> +
> +    assert(i < MAX_NBD_REQUESTS);
> +    request->handle = INDEX_TO_HANDLE(s, i);
>      s->send_coroutine = qemu_coroutine_self();
> +
>      aio_context = bdrv_get_aio_context(s->bs);
>      aio_set_fd_handler(aio_context, s->sock,
>                         nbd_reply_ready, nbd_restart_write, s);
> @@ -164,8 +175,6 @@ static void nbd_co_receive_reply(NbdClientSession *s,
>  static void nbd_coroutine_start(NbdClientSession *s,
>     struct nbd_request *request)
>  {
> -    int i;
> -
>      /* Poor man semaphore.  The free_sema is locked when no other request
>       * can be accepted, and unlocked after receiving one reply.  */
>      if (s->in_flight >= MAX_NBD_REQUESTS - 1) {
> @@ -174,15 +183,7 @@ static void nbd_coroutine_start(NbdClientSession *s,
>      }
>      s->in_flight++;
>  
> -    for (i = 0; i < MAX_NBD_REQUESTS; i++) {
> -        if (s->recv_coroutine[i] == NULL) {
> -            s->recv_coroutine[i] = qemu_coroutine_self();
> -            break;
> -        }
> -    }
> -
> -    assert(i < MAX_NBD_REQUESTS);
> -    request->handle = INDEX_TO_HANDLE(s, i);
> +    /* s->recv_coroutine[i] is set as soon as we get the send_lock.  */
>  }
>  
>  static void nbd_coroutine_end(NbdClientSession *s,
> 
> 
> 

-- 
Bin Wu

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [PATCH] qemu-coroutine-lock: fix co_queue multi-adding bug
  2015-02-09  9:36     ` Bin Wu
  2015-02-09  9:37       ` Paolo Bonzini
@ 2015-02-09 10:12       ` Kevin Wolf
  2015-02-10  1:08         ` Bin Wu
  1 sibling, 1 reply; 10+ messages in thread
From: Kevin Wolf @ 2015-02-09 10:12 UTC (permalink / raw)
  To: Bin Wu; +Cc: pbonzini, Fam Zheng, qemu-devel, stefanha

Am 09.02.2015 um 10:36 hat Bin Wu geschrieben:
> On 2015/2/9 16:12, Fam Zheng wrote:
> > On Sat, 02/07 17:51, w00214312 wrote:
> >> From: Bin Wu <wu.wubin@huawei.com>
> >>
> >> When a coroutine holds a lock, other coroutines who want to get
> >> the lock must wait on a co_queue by adding themselves to the
> >> CoQueue. However, if a waiting coroutine is woken up with the
> >> lock still be holding by other coroutine, this waiting coroutine
> > 
> > Could you explain who wakes up the waiting coroutine? Maybe the bug is that
> > it shouldn't be awaken in the first place.
> > 
> 
> During the mirror phase with nbd devices, if we send a cancel command or
> physical network breaks down, the source qemu process will receive a readable
> event and the main loop will invoke nbd_reply_ready to deal with it. This
> function finds the connection is down and then goes into
> nbd_teardown_connection. nbd_teardown_connection wakes up all working coroutines
> by nbd_recv_coroutines_enter_all. These coroutines include the one which holds
> the sending lock, the ones which wait for the lock, and the ones which wait for
> receiving messages.

This is the bug. It's not allowed to reenter a coroutine if you don't
know its state. NBD needs a fix, not the the generic coroutine
infrastructure.

If we want to change anything in the lock implementation, it should be
adding an assertion to catch such violations of the rule. (Untested, but
I think the assertion should hold true.)

Kevin

diff --git a/qemu-coroutine-lock.c b/qemu-coroutine-lock.c
index e4860ae..25fc111 100644
--- a/qemu-coroutine-lock.c
+++ b/qemu-coroutine-lock.c
@@ -123,9 +123,8 @@ void coroutine_fn qemu_co_mutex_lock(CoMutex *mutex)
 
     trace_qemu_co_mutex_lock_entry(mutex, self);
 
-    while (mutex->locked) {
-        qemu_co_queue_wait(&mutex->queue);
-    }
+    qemu_co_queue_wait(&mutex->queue);
+    assert(!mutex->locked);
 
     mutex->locked = true;

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [PATCH] qemu-coroutine-lock: fix co_queue multi-adding bug
  2015-02-09 10:12       ` Kevin Wolf
@ 2015-02-10  1:08         ` Bin Wu
  0 siblings, 0 replies; 10+ messages in thread
From: Bin Wu @ 2015-02-10  1:08 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: pbonzini, Fam Zheng, qemu-devel, stefanha

On 2015/2/9 18:12, Kevin Wolf wrote:
> Am 09.02.2015 um 10:36 hat Bin Wu geschrieben:
>> On 2015/2/9 16:12, Fam Zheng wrote:
>>> On Sat, 02/07 17:51, w00214312 wrote:
>>>> From: Bin Wu <wu.wubin@huawei.com>
>>>>
>>>> When a coroutine holds a lock, other coroutines who want to get
>>>> the lock must wait on a co_queue by adding themselves to the
>>>> CoQueue. However, if a waiting coroutine is woken up with the
>>>> lock still be holding by other coroutine, this waiting coroutine
>>>
>>> Could you explain who wakes up the waiting coroutine? Maybe the bug is that
>>> it shouldn't be awaken in the first place.
>>>
>>
>> During the mirror phase with nbd devices, if we send a cancel command or
>> physical network breaks down, the source qemu process will receive a readable
>> event and the main loop will invoke nbd_reply_ready to deal with it. This
>> function finds the connection is down and then goes into
>> nbd_teardown_connection. nbd_teardown_connection wakes up all working coroutines
>> by nbd_recv_coroutines_enter_all. These coroutines include the one which holds
>> the sending lock, the ones which wait for the lock, and the ones which wait for
>> receiving messages.
> 
> This is the bug. It's not allowed to reenter a coroutine if you don't
> know its state. NBD needs a fix, not the the generic coroutine
> infrastructure.
> 
> If we want to change anything in the lock implementation, it should be
> adding an assertion to catch such violations of the rule. (Untested, but
> I think the assertion should hold true.)
> 
> Kevin
> 
> diff --git a/qemu-coroutine-lock.c b/qemu-coroutine-lock.c
> index e4860ae..25fc111 100644
> --- a/qemu-coroutine-lock.c
> +++ b/qemu-coroutine-lock.c
> @@ -123,9 +123,8 @@ void coroutine_fn qemu_co_mutex_lock(CoMutex *mutex)
>  
>      trace_qemu_co_mutex_lock_entry(mutex, self);
>  
> -    while (mutex->locked) {
> -        qemu_co_queue_wait(&mutex->queue);
> -    }
> +    qemu_co_queue_wait(&mutex->queue);
> +    assert(!mutex->locked);
>  
>      mutex->locked = true;
> 
> .
> 

I see. paolo's patch in his first reply can fix the bug in NBD (not in
coroutine). I will complete that patch to do some tests and send another version
latter.

-- 
Bin Wu

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [PATCH] fix the co_queue multi-adding bug
  2015-02-09  9:23 ` [Qemu-devel] [PATCH] fix the " Paolo Bonzini
  2015-02-09  9:47   ` Bin Wu
@ 2015-02-10  6:34   ` Bin Wu
  1 sibling, 0 replies; 10+ messages in thread
From: Bin Wu @ 2015-02-10  6:34 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel; +Cc: kwolf, famz, stefanha

On 2015/2/9 17:23, Paolo Bonzini wrote:
> 
> 
> On 07/02/2015 10:51, w00214312 wrote:
>> From: Bin Wu <wu.wubin@huawei.com>
>>
>> When we test the drive_mirror between different hosts by ndb devices, 
>> we find that, during the cancel phase the qemu process crashes sometimes.
>> By checking the crash core file, we find the stack as follows, which means
>> a coroutine re-enter error occurs:
> 
> This bug probably can be fixed simply by delaying the setting of
> recv_coroutine.
> 
> What are the symptoms if you only apply your "qemu-coroutine-lock: fix
> co_queue multi-adding bug" patch but not "qemu-coroutine: fix
> qemu_co_queue_run_restart error"?

These two patches are used to solve two different problems:
-"qemu-coroutine-lock: fix co_queue multi-adding bug" solves the coroutine
re-enter problem which is found when we send a cancel command after the
drive_mirror is just started.
-"qemu-coroutine: fix qemu_co_queue_run_restart error" solves the segfault
problem during drive_mirror phase of two VMs which copy large files between each
other.

> 
> Can you try the patch below?  (Compile-tested only).
> 
> diff --git a/block/nbd-client.c b/block/nbd-client.c
> index 6e1c97c..23d6a71 100644
> --- a/block/nbd-client.c
> +++ b/block/nbd-client.c
> @@ -104,10 +104,21 @@ static int nbd_co_send_request(NbdClientSession *s,
>      QEMUIOVector *qiov, int offset)
>  {
>      AioContext *aio_context;
> -    int rc, ret;
> +    int rc, ret, i;
>  
>      qemu_co_mutex_lock(&s->send_mutex);
> +
> +    for (i = 0; i < MAX_NBD_REQUESTS; i++) {
> +        if (s->recv_coroutine[i] == NULL) {
> +            s->recv_coroutine[i] = qemu_coroutine_self();
> +            break;
> +        }
> +    }
> +
> +    assert(i < MAX_NBD_REQUESTS);
> +    request->handle = INDEX_TO_HANDLE(s, i);
>      s->send_coroutine = qemu_coroutine_self();
> +
>      aio_context = bdrv_get_aio_context(s->bs);
>      aio_set_fd_handler(aio_context, s->sock,
>                         nbd_reply_ready, nbd_restart_write, s);
> @@ -164,8 +175,6 @@ static void nbd_co_receive_reply(NbdClientSession *s,
>  static void nbd_coroutine_start(NbdClientSession *s,
>     struct nbd_request *request)
>  {
> -    int i;
> -
>      /* Poor man semaphore.  The free_sema is locked when no other request
>       * can be accepted, and unlocked after receiving one reply.  */
>      if (s->in_flight >= MAX_NBD_REQUESTS - 1) {
> @@ -174,15 +183,7 @@ static void nbd_coroutine_start(NbdClientSession *s,
>      }
>      s->in_flight++;
>  
> -    for (i = 0; i < MAX_NBD_REQUESTS; i++) {
> -        if (s->recv_coroutine[i] == NULL) {
> -            s->recv_coroutine[i] = qemu_coroutine_self();
> -            break;
> -        }
> -    }
> -
> -    assert(i < MAX_NBD_REQUESTS);
> -    request->handle = INDEX_TO_HANDLE(s, i);
> +    /* s->recv_coroutine[i] is set as soon as we get the send_lock.  */
>  }
>  
>  static void nbd_coroutine_end(NbdClientSession *s,
> 
> 
> 

-- 
Bin Wu

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2015-02-10  6:35 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-07  9:51 [Qemu-devel] [PATCH] fix the co_queue multi-adding bug w00214312
2015-02-07  9:51 ` [Qemu-devel] [PATCH] qemu-coroutine-lock: fix " w00214312
2015-02-09  8:12   ` Fam Zheng
2015-02-09  9:36     ` Bin Wu
2015-02-09  9:37       ` Paolo Bonzini
2015-02-09 10:12       ` Kevin Wolf
2015-02-10  1:08         ` Bin Wu
2015-02-09  9:23 ` [Qemu-devel] [PATCH] fix the " Paolo Bonzini
2015-02-09  9:47   ` Bin Wu
2015-02-10  6:34   ` Bin Wu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.