All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH 1/2] migration: update index field when delete or qsort RDMALocalBlock
@ 2018-05-06 14:54 Lidong Chen
  2018-05-06 14:54 ` [Qemu-devel] [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect Lidong Chen
  2018-05-08 17:19 ` [Qemu-devel] [PATCH 1/2] migration: update index field when delete or qsort RDMALocalBlock Dr. David Alan Gilbert
  0 siblings, 2 replies; 19+ messages in thread
From: Lidong Chen @ 2018-05-06 14:54 UTC (permalink / raw)
  To: quintela, dgilbert; +Cc: qemu-devel, galsha, aviadye, adido, Lidong Chen

rdma_delete_block function deletes RDMALocalBlock base on index field,
but not update the index field. So when next time invoke rdma_delete_block,
it will not work correctly.

If start and cancel migration repeatedly, some RDMALocalBlock not invoke
ibv_dereg_mr to decrease kernel mm_struct vmpin. When vmpin is large than
max locked memory limitation, ibv_reg_mr will failed, and migration can not
start successfully again.

Signed-off-by: Lidong Chen <lidongchen@tencent.com>
---
 migration/rdma.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/migration/rdma.c b/migration/rdma.c
index ed9cfb1..0dd4033 100644
--- a/migration/rdma.c
+++ b/migration/rdma.c
@@ -713,6 +713,9 @@ static int rdma_delete_block(RDMAContext *rdma, RDMALocalBlock *block)
             memcpy(local->block + block->index, old + (block->index + 1),
                 sizeof(RDMALocalBlock) *
                     (local->nb_blocks - (block->index + 1)));
+            for (x = block->index; x < local->nb_blocks - 1; x++) {
+                local->block[x].index--;
+            }
         }
     } else {
         assert(block == local->block);
@@ -3398,6 +3401,10 @@ static int qemu_rdma_registration_handle(QEMUFile *f, void *opaque)
             qsort(rdma->local_ram_blocks.block,
                   rdma->local_ram_blocks.nb_blocks,
                   sizeof(RDMALocalBlock), dest_ram_sort_func);
+            for (i = 0; i < local->nb_blocks; i++) {
+                local->block[i].index = i;
+            }
+
             if (rdma->pin_all) {
                 ret = qemu_rdma_reg_whole_ram_blocks(rdma);
                 if (ret) {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [Qemu-devel] [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect
  2018-05-06 14:54 [Qemu-devel] [PATCH 1/2] migration: update index field when delete or qsort RDMALocalBlock Lidong Chen
@ 2018-05-06 14:54 ` Lidong Chen
  2018-05-08 18:40   ` Dr. David Alan Gilbert
  2018-05-08 17:19 ` [Qemu-devel] [PATCH 1/2] migration: update index field when delete or qsort RDMALocalBlock Dr. David Alan Gilbert
  1 sibling, 1 reply; 19+ messages in thread
From: Lidong Chen @ 2018-05-06 14:54 UTC (permalink / raw)
  To: quintela, dgilbert; +Cc: qemu-devel, galsha, aviadye, adido, Lidong Chen

When cancel migration during RDMA precopy, the source qemu main thread hangs sometime.

The backtrace is:
    (gdb) bt
    #0  0x00007f249eabd43d in write () from /lib64/libpthread.so.0
    #1  0x00007f24a1ce98e4 in rdma_get_cm_event (channel=0x4675d10, event=0x7ffe2f643dd0) at src/cma.c:2189
    #2  0x00000000007b6166 in qemu_rdma_cleanup (rdma=0x6784000) at migration/rdma.c:2296
    #3  0x00000000007b7cae in qio_channel_rdma_close (ioc=0x3bfcc30, errp=0x0) at migration/rdma.c:2999
    #4  0x00000000008db60e in qio_channel_close (ioc=0x3bfcc30, errp=0x0) at io/channel.c:273
    #5  0x00000000007a8765 in channel_close (opaque=0x3bfcc30) at migration/qemu-file-channel.c:98
    #6  0x00000000007a71f9 in qemu_fclose (f=0x527c000) at migration/qemu-file.c:334
    #7  0x0000000000795b96 in migrate_fd_cleanup (opaque=0x3b46280) at migration/migration.c:1162
    #8  0x000000000093a71b in aio_bh_call (bh=0x3db7a20) at util/async.c:90
    #9  0x000000000093a7b2 in aio_bh_poll (ctx=0x3b121c0) at util/async.c:118
    #10 0x000000000093f2ad in aio_dispatch (ctx=0x3b121c0) at util/aio-posix.c:436
    #11 0x000000000093ab41 in aio_ctx_dispatch (source=0x3b121c0, callback=0x0, user_data=0x0)
        at util/async.c:261
    #12 0x00007f249f73c7aa in g_main_context_dispatch () from /lib64/libglib-2.0.so.0
    #13 0x000000000093dc5e in glib_pollfds_poll () at util/main-loop.c:215
    #14 0x000000000093dd4e in os_host_main_loop_wait (timeout=28000000) at util/main-loop.c:263
    #15 0x000000000093de05 in main_loop_wait (nonblocking=0) at util/main-loop.c:522
    #16 0x00000000005bc6a5 in main_loop () at vl.c:1944
    #17 0x00000000005c39b5 in main (argc=56, argv=0x7ffe2f6443f8, envp=0x3ad0030) at vl.c:4752

It does not get the RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect sometime.
I do not find out the root cause why not get RDMA_CM_EVENT_DISCONNECTED event, but
it can be reproduced if not invoke ibv_dereg_mr to release all ram blocks which fixed
in previous patch.

Anyway, it should not invoke rdma_get_cm_event in main thread, and the event channel
is also destroyed in qemu_rdma_cleanup.

Signed-off-by: Lidong Chen <lidongchen@tencent.com>
---
 migration/rdma.c       | 12 ++----------
 migration/trace-events |  1 -
 2 files changed, 2 insertions(+), 11 deletions(-)

diff --git a/migration/rdma.c b/migration/rdma.c
index 0dd4033..92e4d30 100644
--- a/migration/rdma.c
+++ b/migration/rdma.c
@@ -2275,8 +2275,7 @@ static int qemu_rdma_write(QEMUFile *f, RDMAContext *rdma,
 
 static void qemu_rdma_cleanup(RDMAContext *rdma)
 {
-    struct rdma_cm_event *cm_event;
-    int ret, idx;
+    int idx;
 
     if (rdma->cm_id && rdma->connected) {
         if ((rdma->error_state ||
@@ -2290,14 +2289,7 @@ static void qemu_rdma_cleanup(RDMAContext *rdma)
             qemu_rdma_post_send_control(rdma, NULL, &head);
         }
 
-        ret = rdma_disconnect(rdma->cm_id);
-        if (!ret) {
-            trace_qemu_rdma_cleanup_waiting_for_disconnect();
-            ret = rdma_get_cm_event(rdma->channel, &cm_event);
-            if (!ret) {
-                rdma_ack_cm_event(cm_event);
-            }
-        }
+        rdma_disconnect(rdma->cm_id);
         trace_qemu_rdma_cleanup_disconnect();
         rdma->connected = false;
     }
diff --git a/migration/trace-events b/migration/trace-events
index d6be74b..64573ff 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -125,7 +125,6 @@ qemu_rdma_accept_pin_state(bool pin) "%d"
 qemu_rdma_accept_pin_verbsc(void *verbs) "Verbs context after listen: %p"
 qemu_rdma_block_for_wrid_miss(const char *wcompstr, int wcomp, const char *gcompstr, uint64_t req) "A Wanted wrid %s (%d) but got %s (%" PRIu64 ")"
 qemu_rdma_cleanup_disconnect(void) ""
-qemu_rdma_cleanup_waiting_for_disconnect(void) ""
 qemu_rdma_close(void) ""
 qemu_rdma_connect_pin_all_requested(void) ""
 qemu_rdma_connect_pin_all_outcome(bool pin) "%d"
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [PATCH 1/2] migration: update index field when delete or qsort RDMALocalBlock
  2018-05-06 14:54 [Qemu-devel] [PATCH 1/2] migration: update index field when delete or qsort RDMALocalBlock Lidong Chen
  2018-05-06 14:54 ` [Qemu-devel] [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect Lidong Chen
@ 2018-05-08 17:19 ` Dr. David Alan Gilbert
  2018-05-09  1:39   ` 858585 jemmy
  1 sibling, 1 reply; 19+ messages in thread
From: Dr. David Alan Gilbert @ 2018-05-08 17:19 UTC (permalink / raw)
  To: Lidong Chen; +Cc: quintela, qemu-devel, galsha, aviadye, adido, Lidong Chen

* Lidong Chen (jemmy858585@gmail.com) wrote:
> rdma_delete_block function deletes RDMALocalBlock base on index field,
> but not update the index field. So when next time invoke rdma_delete_block,
> it will not work correctly.
> 
> If start and cancel migration repeatedly, some RDMALocalBlock not invoke
> ibv_dereg_mr to decrease kernel mm_struct vmpin. When vmpin is large than
> max locked memory limitation, ibv_reg_mr will failed, and migration can not
> start successfully again.
> 
> Signed-off-by: Lidong Chen <lidongchen@tencent.com>
> ---
>  migration/rdma.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/migration/rdma.c b/migration/rdma.c
> index ed9cfb1..0dd4033 100644
> --- a/migration/rdma.c
> +++ b/migration/rdma.c
> @@ -713,6 +713,9 @@ static int rdma_delete_block(RDMAContext *rdma, RDMALocalBlock *block)
>              memcpy(local->block + block->index, old + (block->index + 1),
>                  sizeof(RDMALocalBlock) *
>                      (local->nb_blocks - (block->index + 1)));
> +            for (x = block->index; x < local->nb_blocks - 1; x++) {
> +                local->block[x].index--;
> +            }

Yes; is that equivalent to   local->blocks[x].index = x;   ?

>          }
>      } else {
>          assert(block == local->block);
> @@ -3398,6 +3401,10 @@ static int qemu_rdma_registration_handle(QEMUFile *f, void *opaque)
>              qsort(rdma->local_ram_blocks.block,
>                    rdma->local_ram_blocks.nb_blocks,
>                    sizeof(RDMALocalBlock), dest_ram_sort_func);
> +            for (i = 0; i < local->nb_blocks; i++) {
> +                local->block[i].index = i;
> +            }
> +

Which is basically the way that one does it;

OK, it's a while since I looked at this but I think it fixes my 3 year
old 03fcab38617 patch, so



Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

>              if (rdma->pin_all) {
>                  ret = qemu_rdma_reg_whole_ram_blocks(rdma);
>                  if (ret) {
> -- 
> 1.8.3.1
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect
  2018-05-06 14:54 ` [Qemu-devel] [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect Lidong Chen
@ 2018-05-08 18:40   ` Dr. David Alan Gilbert
  2018-05-09  3:43     ` 858585 jemmy
  0 siblings, 1 reply; 19+ messages in thread
From: Dr. David Alan Gilbert @ 2018-05-08 18:40 UTC (permalink / raw)
  To: Lidong Chen; +Cc: quintela, qemu-devel, galsha, aviadye, adido, Lidong Chen

* Lidong Chen (jemmy858585@gmail.com) wrote:
> When cancel migration during RDMA precopy, the source qemu main thread hangs sometime.
> 
> The backtrace is:
>     (gdb) bt
>     #0  0x00007f249eabd43d in write () from /lib64/libpthread.so.0
>     #1  0x00007f24a1ce98e4 in rdma_get_cm_event (channel=0x4675d10, event=0x7ffe2f643dd0) at src/cma.c:2189
>     #2  0x00000000007b6166 in qemu_rdma_cleanup (rdma=0x6784000) at migration/rdma.c:2296
>     #3  0x00000000007b7cae in qio_channel_rdma_close (ioc=0x3bfcc30, errp=0x0) at migration/rdma.c:2999
>     #4  0x00000000008db60e in qio_channel_close (ioc=0x3bfcc30, errp=0x0) at io/channel.c:273
>     #5  0x00000000007a8765 in channel_close (opaque=0x3bfcc30) at migration/qemu-file-channel.c:98
>     #6  0x00000000007a71f9 in qemu_fclose (f=0x527c000) at migration/qemu-file.c:334
>     #7  0x0000000000795b96 in migrate_fd_cleanup (opaque=0x3b46280) at migration/migration.c:1162
>     #8  0x000000000093a71b in aio_bh_call (bh=0x3db7a20) at util/async.c:90
>     #9  0x000000000093a7b2 in aio_bh_poll (ctx=0x3b121c0) at util/async.c:118
>     #10 0x000000000093f2ad in aio_dispatch (ctx=0x3b121c0) at util/aio-posix.c:436
>     #11 0x000000000093ab41 in aio_ctx_dispatch (source=0x3b121c0, callback=0x0, user_data=0x0)
>         at util/async.c:261
>     #12 0x00007f249f73c7aa in g_main_context_dispatch () from /lib64/libglib-2.0.so.0
>     #13 0x000000000093dc5e in glib_pollfds_poll () at util/main-loop.c:215
>     #14 0x000000000093dd4e in os_host_main_loop_wait (timeout=28000000) at util/main-loop.c:263
>     #15 0x000000000093de05 in main_loop_wait (nonblocking=0) at util/main-loop.c:522
>     #16 0x00000000005bc6a5 in main_loop () at vl.c:1944
>     #17 0x00000000005c39b5 in main (argc=56, argv=0x7ffe2f6443f8, envp=0x3ad0030) at vl.c:4752
> 
> It does not get the RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect sometime.
> I do not find out the root cause why not get RDMA_CM_EVENT_DISCONNECTED event, but
> it can be reproduced if not invoke ibv_dereg_mr to release all ram blocks which fixed
> in previous patch.

Does this happen without your other changes?
Can you give me instructions to repeat it and also say which
cards you wereusing?

> Anyway, it should not invoke rdma_get_cm_event in main thread, and the event channel
> is also destroyed in qemu_rdma_cleanup.
> 
> Signed-off-by: Lidong Chen <lidongchen@tencent.com>
> ---
>  migration/rdma.c       | 12 ++----------
>  migration/trace-events |  1 -
>  2 files changed, 2 insertions(+), 11 deletions(-)
> 
> diff --git a/migration/rdma.c b/migration/rdma.c
> index 0dd4033..92e4d30 100644
> --- a/migration/rdma.c
> +++ b/migration/rdma.c
> @@ -2275,8 +2275,7 @@ static int qemu_rdma_write(QEMUFile *f, RDMAContext *rdma,
>  
>  static void qemu_rdma_cleanup(RDMAContext *rdma)
>  {
> -    struct rdma_cm_event *cm_event;
> -    int ret, idx;
> +    int idx;
>  
>      if (rdma->cm_id && rdma->connected) {
>          if ((rdma->error_state ||
> @@ -2290,14 +2289,7 @@ static void qemu_rdma_cleanup(RDMAContext *rdma)
>              qemu_rdma_post_send_control(rdma, NULL, &head);
>          }
>  
> -        ret = rdma_disconnect(rdma->cm_id);
> -        if (!ret) {
> -            trace_qemu_rdma_cleanup_waiting_for_disconnect();
> -            ret = rdma_get_cm_event(rdma->channel, &cm_event);
> -            if (!ret) {
> -                rdma_ack_cm_event(cm_event);
> -            }
> -        }
> +        rdma_disconnect(rdma->cm_id);

I'm worried whether this change could break stuff:
The docs say for rdma_disconnect that it flushes any posted work
requests to the completion queue;  so unless we wait for the event
do we know the stuff has been flushed?   In the normal non-cancel case
I'm worried that means we could lose something.
(But I don't know the rdma/infiniband specs well enough to know if it's
really a problem).

Dave

>          trace_qemu_rdma_cleanup_disconnect();
>          rdma->connected = false;
>      }
> diff --git a/migration/trace-events b/migration/trace-events
> index d6be74b..64573ff 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -125,7 +125,6 @@ qemu_rdma_accept_pin_state(bool pin) "%d"
>  qemu_rdma_accept_pin_verbsc(void *verbs) "Verbs context after listen: %p"
>  qemu_rdma_block_for_wrid_miss(const char *wcompstr, int wcomp, const char *gcompstr, uint64_t req) "A Wanted wrid %s (%d) but got %s (%" PRIu64 ")"
>  qemu_rdma_cleanup_disconnect(void) ""
> -qemu_rdma_cleanup_waiting_for_disconnect(void) ""
>  qemu_rdma_close(void) ""
>  qemu_rdma_connect_pin_all_requested(void) ""
>  qemu_rdma_connect_pin_all_outcome(bool pin) "%d"
> -- 
> 1.8.3.1
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [PATCH 1/2] migration: update index field when delete or qsort RDMALocalBlock
  2018-05-08 17:19 ` [Qemu-devel] [PATCH 1/2] migration: update index field when delete or qsort RDMALocalBlock Dr. David Alan Gilbert
@ 2018-05-09  1:39   ` 858585 jemmy
  0 siblings, 0 replies; 19+ messages in thread
From: 858585 jemmy @ 2018-05-09  1:39 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Juan Quintela, qemu-devel, Gal Shachaf, Aviad Yehezkel, adido,
	Lidong Chen

On Wed, May 9, 2018 at 1:19 AM, Dr. David Alan Gilbert
<dgilbert@redhat.com> wrote:
> * Lidong Chen (jemmy858585@gmail.com) wrote:
>> rdma_delete_block function deletes RDMALocalBlock base on index field,
>> but not update the index field. So when next time invoke rdma_delete_block,
>> it will not work correctly.
>>
>> If start and cancel migration repeatedly, some RDMALocalBlock not invoke
>> ibv_dereg_mr to decrease kernel mm_struct vmpin. When vmpin is large than
>> max locked memory limitation, ibv_reg_mr will failed, and migration can not
>> start successfully again.
>>
>> Signed-off-by: Lidong Chen <lidongchen@tencent.com>
>> ---
>>  migration/rdma.c | 7 +++++++
>>  1 file changed, 7 insertions(+)
>>
>> diff --git a/migration/rdma.c b/migration/rdma.c
>> index ed9cfb1..0dd4033 100644
>> --- a/migration/rdma.c
>> +++ b/migration/rdma.c
>> @@ -713,6 +713,9 @@ static int rdma_delete_block(RDMAContext *rdma, RDMALocalBlock *block)
>>              memcpy(local->block + block->index, old + (block->index + 1),
>>                  sizeof(RDMALocalBlock) *
>>                      (local->nb_blocks - (block->index + 1)));
>> +            for (x = block->index; x < local->nb_blocks - 1; x++) {
>> +                local->block[x].index--;
>> +            }
>
> Yes; is that equivalent to   local->blocks[x].index = x;   ?

yes, it's equivalent.

>
>>          }
>>      } else {
>>          assert(block == local->block);
>> @@ -3398,6 +3401,10 @@ static int qemu_rdma_registration_handle(QEMUFile *f, void *opaque)
>>              qsort(rdma->local_ram_blocks.block,
>>                    rdma->local_ram_blocks.nb_blocks,
>>                    sizeof(RDMALocalBlock), dest_ram_sort_func);
>> +            for (i = 0; i < local->nb_blocks; i++) {
>> +                local->block[i].index = i;
>> +            }
>> +
>
> Which is basically the way that one does it;
>
> OK, it's a while since I looked at this but I think it fixes my 3 year
> old 03fcab38617 patch, so
>
>
>
> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
>
>>              if (rdma->pin_all) {
>>                  ret = qemu_rdma_reg_whole_ram_blocks(rdma);
>>                  if (ret) {
>> --
>> 1.8.3.1
>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect
  2018-05-08 18:40   ` Dr. David Alan Gilbert
@ 2018-05-09  3:43     ` 858585 jemmy
  2018-05-11 18:03       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 19+ messages in thread
From: 858585 jemmy @ 2018-05-09  3:43 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, Aviad Yehezkel
  Cc: Juan Quintela, qemu-devel, Gal Shachaf, adido, Lidong Chen

On Wed, May 9, 2018 at 2:40 AM, Dr. David Alan Gilbert
<dgilbert@redhat.com> wrote:
> * Lidong Chen (jemmy858585@gmail.com) wrote:
>> When cancel migration during RDMA precopy, the source qemu main thread hangs sometime.
>>
>> The backtrace is:
>>     (gdb) bt
>>     #0  0x00007f249eabd43d in write () from /lib64/libpthread.so.0
>>     #1  0x00007f24a1ce98e4 in rdma_get_cm_event (channel=0x4675d10, event=0x7ffe2f643dd0) at src/cma.c:2189
>>     #2  0x00000000007b6166 in qemu_rdma_cleanup (rdma=0x6784000) at migration/rdma.c:2296
>>     #3  0x00000000007b7cae in qio_channel_rdma_close (ioc=0x3bfcc30, errp=0x0) at migration/rdma.c:2999
>>     #4  0x00000000008db60e in qio_channel_close (ioc=0x3bfcc30, errp=0x0) at io/channel.c:273
>>     #5  0x00000000007a8765 in channel_close (opaque=0x3bfcc30) at migration/qemu-file-channel.c:98
>>     #6  0x00000000007a71f9 in qemu_fclose (f=0x527c000) at migration/qemu-file.c:334
>>     #7  0x0000000000795b96 in migrate_fd_cleanup (opaque=0x3b46280) at migration/migration.c:1162
>>     #8  0x000000000093a71b in aio_bh_call (bh=0x3db7a20) at util/async.c:90
>>     #9  0x000000000093a7b2 in aio_bh_poll (ctx=0x3b121c0) at util/async.c:118
>>     #10 0x000000000093f2ad in aio_dispatch (ctx=0x3b121c0) at util/aio-posix.c:436
>>     #11 0x000000000093ab41 in aio_ctx_dispatch (source=0x3b121c0, callback=0x0, user_data=0x0)
>>         at util/async.c:261
>>     #12 0x00007f249f73c7aa in g_main_context_dispatch () from /lib64/libglib-2.0.so.0
>>     #13 0x000000000093dc5e in glib_pollfds_poll () at util/main-loop.c:215
>>     #14 0x000000000093dd4e in os_host_main_loop_wait (timeout=28000000) at util/main-loop.c:263
>>     #15 0x000000000093de05 in main_loop_wait (nonblocking=0) at util/main-loop.c:522
>>     #16 0x00000000005bc6a5 in main_loop () at vl.c:1944
>>     #17 0x00000000005c39b5 in main (argc=56, argv=0x7ffe2f6443f8, envp=0x3ad0030) at vl.c:4752
>>
>> It does not get the RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect sometime.
>> I do not find out the root cause why not get RDMA_CM_EVENT_DISCONNECTED event, but
>> it can be reproduced if not invoke ibv_dereg_mr to release all ram blocks which fixed
>> in previous patch.
>
> Does this happen without your other changes?

Yes, this issue also happen on v2.12.0. base on
commit 4743c23509a51bd4ee85cc272287a41917d1be35

> Can you give me instructions to repeat it and also say which
> cards you wereusing?

This issue can be reproduced by start and cancel migration.
less than 10 times, this issue will be reproduced.

The command line is:
virsh migrate --live --copy-storage-all  --undefinesource --persistent
--timeout 10800 \
 --verbose 83e0049e-1325-4f31-baf9-25231509ada1  \
qemu+ssh://9.16.46.142/system rdma://9.16.46.142

The net card i use is :
0000:3b:00.0 Ethernet controller: Mellanox Technologies MT27710 Family
[ConnectX-4 Lx]
0000:3b:00.1 Ethernet controller: Mellanox Technologies MT27710 Family
[ConnectX-4 Lx]

This issue is related to ibv_dereg_mr, if not invoke ibv_dereg_mr for
all ram block, this issue can be reproduced.
If we fixed the bugs and use ibv_dereg_mr to release all ram block,
this issue never happens.

And for the kernel part, there is a bug also cause not release ram
block when canceling live migration.
https://patchwork.kernel.org/patch/10385781/

>
>> Anyway, it should not invoke rdma_get_cm_event in main thread, and the event channel
>> is also destroyed in qemu_rdma_cleanup.
>>
>> Signed-off-by: Lidong Chen <lidongchen@tencent.com>
>> ---
>>  migration/rdma.c       | 12 ++----------
>>  migration/trace-events |  1 -
>>  2 files changed, 2 insertions(+), 11 deletions(-)
>>
>> diff --git a/migration/rdma.c b/migration/rdma.c
>> index 0dd4033..92e4d30 100644
>> --- a/migration/rdma.c
>> +++ b/migration/rdma.c
>> @@ -2275,8 +2275,7 @@ static int qemu_rdma_write(QEMUFile *f, RDMAContext *rdma,
>>
>>  static void qemu_rdma_cleanup(RDMAContext *rdma)
>>  {
>> -    struct rdma_cm_event *cm_event;
>> -    int ret, idx;
>> +    int idx;
>>
>>      if (rdma->cm_id && rdma->connected) {
>>          if ((rdma->error_state ||
>> @@ -2290,14 +2289,7 @@ static void qemu_rdma_cleanup(RDMAContext *rdma)
>>              qemu_rdma_post_send_control(rdma, NULL, &head);
>>          }
>>
>> -        ret = rdma_disconnect(rdma->cm_id);
>> -        if (!ret) {
>> -            trace_qemu_rdma_cleanup_waiting_for_disconnect();
>> -            ret = rdma_get_cm_event(rdma->channel, &cm_event);
>> -            if (!ret) {
>> -                rdma_ack_cm_event(cm_event);
>> -            }
>> -        }
>> +        rdma_disconnect(rdma->cm_id);
>
> I'm worried whether this change could break stuff:
> The docs say for rdma_disconnect that it flushes any posted work
> requests to the completion queue;  so unless we wait for the event
> do we know the stuff has been flushed?   In the normal non-cancel case
> I'm worried that means we could lose something.
> (But I don't know the rdma/infiniband specs well enough to know if it's
> really a problem).

In qemu_fclose function, it invoke qemu_fflush(f) before invoke f->ops->close.
so I think it's safe for normal migration case.

For the root cause why not receive RDMA_CM_EVENT_DISCONNECTED event
after rdma_disconnect,
I loop in Aviad Yehezkel<aviadye@mellanox.com>, Aviad will help us
find the root cause.

>
> Dave
>
>>          trace_qemu_rdma_cleanup_disconnect();
>>          rdma->connected = false;
>>      }
>> diff --git a/migration/trace-events b/migration/trace-events
>> index d6be74b..64573ff 100644
>> --- a/migration/trace-events
>> +++ b/migration/trace-events
>> @@ -125,7 +125,6 @@ qemu_rdma_accept_pin_state(bool pin) "%d"
>>  qemu_rdma_accept_pin_verbsc(void *verbs) "Verbs context after listen: %p"
>>  qemu_rdma_block_for_wrid_miss(const char *wcompstr, int wcomp, const char *gcompstr, uint64_t req) "A Wanted wrid %s (%d) but got %s (%" PRIu64 ")"
>>  qemu_rdma_cleanup_disconnect(void) ""
>> -qemu_rdma_cleanup_waiting_for_disconnect(void) ""
>>  qemu_rdma_close(void) ""
>>  qemu_rdma_connect_pin_all_requested(void) ""
>>  qemu_rdma_connect_pin_all_outcome(bool pin) "%d"
>> --
>> 1.8.3.1
>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect
  2018-05-09  3:43     ` 858585 jemmy
@ 2018-05-11 18:03       ` Dr. David Alan Gilbert
  2018-05-14  7:32         ` 858585 jemmy
  0 siblings, 1 reply; 19+ messages in thread
From: Dr. David Alan Gilbert @ 2018-05-11 18:03 UTC (permalink / raw)
  To: 858585 jemmy
  Cc: Aviad Yehezkel, Juan Quintela, qemu-devel, Gal Shachaf, adido,
	Lidong Chen

* 858585 jemmy (jemmy858585@gmail.com) wrote:
> On Wed, May 9, 2018 at 2:40 AM, Dr. David Alan Gilbert
> <dgilbert@redhat.com> wrote:
> > * Lidong Chen (jemmy858585@gmail.com) wrote:
> >> When cancel migration during RDMA precopy, the source qemu main thread hangs sometime.
> >>
> >> The backtrace is:
> >>     (gdb) bt
> >>     #0  0x00007f249eabd43d in write () from /lib64/libpthread.so.0
> >>     #1  0x00007f24a1ce98e4 in rdma_get_cm_event (channel=0x4675d10, event=0x7ffe2f643dd0) at src/cma.c:2189
> >>     #2  0x00000000007b6166 in qemu_rdma_cleanup (rdma=0x6784000) at migration/rdma.c:2296
> >>     #3  0x00000000007b7cae in qio_channel_rdma_close (ioc=0x3bfcc30, errp=0x0) at migration/rdma.c:2999
> >>     #4  0x00000000008db60e in qio_channel_close (ioc=0x3bfcc30, errp=0x0) at io/channel.c:273
> >>     #5  0x00000000007a8765 in channel_close (opaque=0x3bfcc30) at migration/qemu-file-channel.c:98
> >>     #6  0x00000000007a71f9 in qemu_fclose (f=0x527c000) at migration/qemu-file.c:334
> >>     #7  0x0000000000795b96 in migrate_fd_cleanup (opaque=0x3b46280) at migration/migration.c:1162
> >>     #8  0x000000000093a71b in aio_bh_call (bh=0x3db7a20) at util/async.c:90
> >>     #9  0x000000000093a7b2 in aio_bh_poll (ctx=0x3b121c0) at util/async.c:118
> >>     #10 0x000000000093f2ad in aio_dispatch (ctx=0x3b121c0) at util/aio-posix.c:436
> >>     #11 0x000000000093ab41 in aio_ctx_dispatch (source=0x3b121c0, callback=0x0, user_data=0x0)
> >>         at util/async.c:261
> >>     #12 0x00007f249f73c7aa in g_main_context_dispatch () from /lib64/libglib-2.0.so.0
> >>     #13 0x000000000093dc5e in glib_pollfds_poll () at util/main-loop.c:215
> >>     #14 0x000000000093dd4e in os_host_main_loop_wait (timeout=28000000) at util/main-loop.c:263
> >>     #15 0x000000000093de05 in main_loop_wait (nonblocking=0) at util/main-loop.c:522
> >>     #16 0x00000000005bc6a5 in main_loop () at vl.c:1944
> >>     #17 0x00000000005c39b5 in main (argc=56, argv=0x7ffe2f6443f8, envp=0x3ad0030) at vl.c:4752
> >>
> >> It does not get the RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect sometime.
> >> I do not find out the root cause why not get RDMA_CM_EVENT_DISCONNECTED event, but
> >> it can be reproduced if not invoke ibv_dereg_mr to release all ram blocks which fixed
> >> in previous patch.
> >
> > Does this happen without your other changes?
> 
> Yes, this issue also happen on v2.12.0. base on
> commit 4743c23509a51bd4ee85cc272287a41917d1be35
> 
> > Can you give me instructions to repeat it and also say which
> > cards you wereusing?
> 
> This issue can be reproduced by start and cancel migration.
> less than 10 times, this issue will be reproduced.
> 
> The command line is:
> virsh migrate --live --copy-storage-all  --undefinesource --persistent
> --timeout 10800 \
>  --verbose 83e0049e-1325-4f31-baf9-25231509ada1  \
> qemu+ssh://9.16.46.142/system rdma://9.16.46.142
> 
> The net card i use is :
> 0000:3b:00.0 Ethernet controller: Mellanox Technologies MT27710 Family
> [ConnectX-4 Lx]
> 0000:3b:00.1 Ethernet controller: Mellanox Technologies MT27710 Family
> [ConnectX-4 Lx]
> 
> This issue is related to ibv_dereg_mr, if not invoke ibv_dereg_mr for
> all ram block, this issue can be reproduced.
> If we fixed the bugs and use ibv_dereg_mr to release all ram block,
> this issue never happens.

Maybe that is the right fix; I can imagine that the RDMA code doesn't
like closing down if there are still ramblocks registered that
potentially could have incoming DMA?

> And for the kernel part, there is a bug also cause not release ram
> block when canceling live migration.
> https://patchwork.kernel.org/patch/10385781/

OK, that's a pain; which threads are doing the dereg - is some stuff
in the migration thread and some stuff in the main thread on cleanup?

Dave

> >
> >> Anyway, it should not invoke rdma_get_cm_event in main thread, and the event channel
> >> is also destroyed in qemu_rdma_cleanup.
> >>
> >> Signed-off-by: Lidong Chen <lidongchen@tencent.com>
> >> ---
> >>  migration/rdma.c       | 12 ++----------
> >>  migration/trace-events |  1 -
> >>  2 files changed, 2 insertions(+), 11 deletions(-)
> >>
> >> diff --git a/migration/rdma.c b/migration/rdma.c
> >> index 0dd4033..92e4d30 100644
> >> --- a/migration/rdma.c
> >> +++ b/migration/rdma.c
> >> @@ -2275,8 +2275,7 @@ static int qemu_rdma_write(QEMUFile *f, RDMAContext *rdma,
> >>
> >>  static void qemu_rdma_cleanup(RDMAContext *rdma)
> >>  {
> >> -    struct rdma_cm_event *cm_event;
> >> -    int ret, idx;
> >> +    int idx;
> >>
> >>      if (rdma->cm_id && rdma->connected) {
> >>          if ((rdma->error_state ||
> >> @@ -2290,14 +2289,7 @@ static void qemu_rdma_cleanup(RDMAContext *rdma)
> >>              qemu_rdma_post_send_control(rdma, NULL, &head);
> >>          }
> >>
> >> -        ret = rdma_disconnect(rdma->cm_id);
> >> -        if (!ret) {
> >> -            trace_qemu_rdma_cleanup_waiting_for_disconnect();
> >> -            ret = rdma_get_cm_event(rdma->channel, &cm_event);
> >> -            if (!ret) {
> >> -                rdma_ack_cm_event(cm_event);
> >> -            }
> >> -        }
> >> +        rdma_disconnect(rdma->cm_id);
> >
> > I'm worried whether this change could break stuff:
> > The docs say for rdma_disconnect that it flushes any posted work
> > requests to the completion queue;  so unless we wait for the event
> > do we know the stuff has been flushed?   In the normal non-cancel case
> > I'm worried that means we could lose something.
> > (But I don't know the rdma/infiniband specs well enough to know if it's
> > really a problem).
> 
> In qemu_fclose function, it invoke qemu_fflush(f) before invoke f->ops->close.
> so I think it's safe for normal migration case.
> 
> For the root cause why not receive RDMA_CM_EVENT_DISCONNECTED event
> after rdma_disconnect,
> I loop in Aviad Yehezkel<aviadye@mellanox.com>, Aviad will help us
> find the root cause.
> 
> >
> > Dave
> >
> >>          trace_qemu_rdma_cleanup_disconnect();
> >>          rdma->connected = false;
> >>      }
> >> diff --git a/migration/trace-events b/migration/trace-events
> >> index d6be74b..64573ff 100644
> >> --- a/migration/trace-events
> >> +++ b/migration/trace-events
> >> @@ -125,7 +125,6 @@ qemu_rdma_accept_pin_state(bool pin) "%d"
> >>  qemu_rdma_accept_pin_verbsc(void *verbs) "Verbs context after listen: %p"
> >>  qemu_rdma_block_for_wrid_miss(const char *wcompstr, int wcomp, const char *gcompstr, uint64_t req) "A Wanted wrid %s (%d) but got %s (%" PRIu64 ")"
> >>  qemu_rdma_cleanup_disconnect(void) ""
> >> -qemu_rdma_cleanup_waiting_for_disconnect(void) ""
> >>  qemu_rdma_close(void) ""
> >>  qemu_rdma_connect_pin_all_requested(void) ""
> >>  qemu_rdma_connect_pin_all_outcome(bool pin) "%d"
> >> --
> >> 1.8.3.1
> >>
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect
  2018-05-11 18:03       ` Dr. David Alan Gilbert
@ 2018-05-14  7:32         ` 858585 jemmy
  2018-05-14 19:27           ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 19+ messages in thread
From: 858585 jemmy @ 2018-05-14  7:32 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Aviad Yehezkel, Juan Quintela, qemu-devel, Gal Shachaf, adido,
	Lidong Chen

On Sat, May 12, 2018 at 2:03 AM, Dr. David Alan Gilbert
<dgilbert@redhat.com> wrote:
> * 858585 jemmy (jemmy858585@gmail.com) wrote:
>> On Wed, May 9, 2018 at 2:40 AM, Dr. David Alan Gilbert
>> <dgilbert@redhat.com> wrote:
>> > * Lidong Chen (jemmy858585@gmail.com) wrote:
>> >> When cancel migration during RDMA precopy, the source qemu main thread hangs sometime.
>> >>
>> >> The backtrace is:
>> >>     (gdb) bt
>> >>     #0  0x00007f249eabd43d in write () from /lib64/libpthread.so.0
>> >>     #1  0x00007f24a1ce98e4 in rdma_get_cm_event (channel=0x4675d10, event=0x7ffe2f643dd0) at src/cma.c:2189
>> >>     #2  0x00000000007b6166 in qemu_rdma_cleanup (rdma=0x6784000) at migration/rdma.c:2296
>> >>     #3  0x00000000007b7cae in qio_channel_rdma_close (ioc=0x3bfcc30, errp=0x0) at migration/rdma.c:2999
>> >>     #4  0x00000000008db60e in qio_channel_close (ioc=0x3bfcc30, errp=0x0) at io/channel.c:273
>> >>     #5  0x00000000007a8765 in channel_close (opaque=0x3bfcc30) at migration/qemu-file-channel.c:98
>> >>     #6  0x00000000007a71f9 in qemu_fclose (f=0x527c000) at migration/qemu-file.c:334
>> >>     #7  0x0000000000795b96 in migrate_fd_cleanup (opaque=0x3b46280) at migration/migration.c:1162
>> >>     #8  0x000000000093a71b in aio_bh_call (bh=0x3db7a20) at util/async.c:90
>> >>     #9  0x000000000093a7b2 in aio_bh_poll (ctx=0x3b121c0) at util/async.c:118
>> >>     #10 0x000000000093f2ad in aio_dispatch (ctx=0x3b121c0) at util/aio-posix.c:436
>> >>     #11 0x000000000093ab41 in aio_ctx_dispatch (source=0x3b121c0, callback=0x0, user_data=0x0)
>> >>         at util/async.c:261
>> >>     #12 0x00007f249f73c7aa in g_main_context_dispatch () from /lib64/libglib-2.0.so.0
>> >>     #13 0x000000000093dc5e in glib_pollfds_poll () at util/main-loop.c:215
>> >>     #14 0x000000000093dd4e in os_host_main_loop_wait (timeout=28000000) at util/main-loop.c:263
>> >>     #15 0x000000000093de05 in main_loop_wait (nonblocking=0) at util/main-loop.c:522
>> >>     #16 0x00000000005bc6a5 in main_loop () at vl.c:1944
>> >>     #17 0x00000000005c39b5 in main (argc=56, argv=0x7ffe2f6443f8, envp=0x3ad0030) at vl.c:4752
>> >>
>> >> It does not get the RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect sometime.
>> >> I do not find out the root cause why not get RDMA_CM_EVENT_DISCONNECTED event, but
>> >> it can be reproduced if not invoke ibv_dereg_mr to release all ram blocks which fixed
>> >> in previous patch.
>> >
>> > Does this happen without your other changes?
>>
>> Yes, this issue also happen on v2.12.0. base on
>> commit 4743c23509a51bd4ee85cc272287a41917d1be35
>>
>> > Can you give me instructions to repeat it and also say which
>> > cards you wereusing?
>>
>> This issue can be reproduced by start and cancel migration.
>> less than 10 times, this issue will be reproduced.
>>
>> The command line is:
>> virsh migrate --live --copy-storage-all  --undefinesource --persistent
>> --timeout 10800 \
>>  --verbose 83e0049e-1325-4f31-baf9-25231509ada1  \
>> qemu+ssh://9.16.46.142/system rdma://9.16.46.142
>>
>> The net card i use is :
>> 0000:3b:00.0 Ethernet controller: Mellanox Technologies MT27710 Family
>> [ConnectX-4 Lx]
>> 0000:3b:00.1 Ethernet controller: Mellanox Technologies MT27710 Family
>> [ConnectX-4 Lx]
>>
>> This issue is related to ibv_dereg_mr, if not invoke ibv_dereg_mr for
>> all ram block, this issue can be reproduced.
>> If we fixed the bugs and use ibv_dereg_mr to release all ram block,
>> this issue never happens.
>
> Maybe that is the right fix; I can imagine that the RDMA code doesn't
> like closing down if there are still ramblocks registered that
> potentially could have incoming DMA?
>
>> And for the kernel part, there is a bug also cause not release ram
>> block when canceling live migration.
>> https://patchwork.kernel.org/patch/10385781/
>
> OK, that's a pain; which threads are doing the dereg - is some stuff
> in the migration thread and some stuff in the main thread on cleanup?

Yes, the migration thread invokes ibv_reg_mr, and the main thread bh
will invoke ibv_dereg_mr.
and when the main thread schedule s->cleanup_bh, the migration thread
may have been exited.

And I find ibv_dereg_mr may take a long time for the big memory size
virtual machine.

The test result is:
 10GB  326ms
 20GB  699ms
 30GB  1021ms
 40GB  1387ms
 50GB  1712ms
 60GB  2034ms
 70GB  2457ms
 80GB  2807ms
 90GB  3107ms
100GB 3474ms
110GB 3735ms
120GB 4064ms
130GB 4567ms
140GB 4886ms

when canceling migration, the guest os will hang for a while.
so I think we should invoke qemu_fclose(s->to_dst_file)  in
migration_thread, and without hold io thread lock.

>
> Dave
>
>> >
>> >> Anyway, it should not invoke rdma_get_cm_event in main thread, and the event channel
>> >> is also destroyed in qemu_rdma_cleanup.
>> >>
>> >> Signed-off-by: Lidong Chen <lidongchen@tencent.com>
>> >> ---
>> >>  migration/rdma.c       | 12 ++----------
>> >>  migration/trace-events |  1 -
>> >>  2 files changed, 2 insertions(+), 11 deletions(-)
>> >>
>> >> diff --git a/migration/rdma.c b/migration/rdma.c
>> >> index 0dd4033..92e4d30 100644
>> >> --- a/migration/rdma.c
>> >> +++ b/migration/rdma.c
>> >> @@ -2275,8 +2275,7 @@ static int qemu_rdma_write(QEMUFile *f, RDMAContext *rdma,
>> >>
>> >>  static void qemu_rdma_cleanup(RDMAContext *rdma)
>> >>  {
>> >> -    struct rdma_cm_event *cm_event;
>> >> -    int ret, idx;
>> >> +    int idx;
>> >>
>> >>      if (rdma->cm_id && rdma->connected) {
>> >>          if ((rdma->error_state ||
>> >> @@ -2290,14 +2289,7 @@ static void qemu_rdma_cleanup(RDMAContext *rdma)
>> >>              qemu_rdma_post_send_control(rdma, NULL, &head);
>> >>          }
>> >>
>> >> -        ret = rdma_disconnect(rdma->cm_id);
>> >> -        if (!ret) {
>> >> -            trace_qemu_rdma_cleanup_waiting_for_disconnect();
>> >> -            ret = rdma_get_cm_event(rdma->channel, &cm_event);
>> >> -            if (!ret) {
>> >> -                rdma_ack_cm_event(cm_event);
>> >> -            }
>> >> -        }
>> >> +        rdma_disconnect(rdma->cm_id);
>> >
>> > I'm worried whether this change could break stuff:
>> > The docs say for rdma_disconnect that it flushes any posted work
>> > requests to the completion queue;  so unless we wait for the event
>> > do we know the stuff has been flushed?   In the normal non-cancel case
>> > I'm worried that means we could lose something.
>> > (But I don't know the rdma/infiniband specs well enough to know if it's
>> > really a problem).
>>
>> In qemu_fclose function, it invoke qemu_fflush(f) before invoke f->ops->close.
>> so I think it's safe for normal migration case.
>>
>> For the root cause why not receive RDMA_CM_EVENT_DISCONNECTED event
>> after rdma_disconnect,
>> I loop in Aviad Yehezkel<aviadye@mellanox.com>, Aviad will help us
>> find the root cause.
>>
>> >
>> > Dave
>> >
>> >>          trace_qemu_rdma_cleanup_disconnect();
>> >>          rdma->connected = false;
>> >>      }
>> >> diff --git a/migration/trace-events b/migration/trace-events
>> >> index d6be74b..64573ff 100644
>> >> --- a/migration/trace-events
>> >> +++ b/migration/trace-events
>> >> @@ -125,7 +125,6 @@ qemu_rdma_accept_pin_state(bool pin) "%d"
>> >>  qemu_rdma_accept_pin_verbsc(void *verbs) "Verbs context after listen: %p"
>> >>  qemu_rdma_block_for_wrid_miss(const char *wcompstr, int wcomp, const char *gcompstr, uint64_t req) "A Wanted wrid %s (%d) but got %s (%" PRIu64 ")"
>> >>  qemu_rdma_cleanup_disconnect(void) ""
>> >> -qemu_rdma_cleanup_waiting_for_disconnect(void) ""
>> >>  qemu_rdma_close(void) ""
>> >>  qemu_rdma_connect_pin_all_requested(void) ""
>> >>  qemu_rdma_connect_pin_all_outcome(bool pin) "%d"
>> >> --
>> >> 1.8.3.1
>> >>
>> > --
>> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect
  2018-05-14  7:32         ` 858585 jemmy
@ 2018-05-14 19:27           ` Dr. David Alan Gilbert
  2018-05-16  8:31             ` 858585 jemmy
  0 siblings, 1 reply; 19+ messages in thread
From: Dr. David Alan Gilbert @ 2018-05-14 19:27 UTC (permalink / raw)
  To: 858585 jemmy
  Cc: Aviad Yehezkel, Juan Quintela, qemu-devel, Gal Shachaf, adido,
	Lidong Chen

* 858585 jemmy (jemmy858585@gmail.com) wrote:
> On Sat, May 12, 2018 at 2:03 AM, Dr. David Alan Gilbert
> <dgilbert@redhat.com> wrote:
> > * 858585 jemmy (jemmy858585@gmail.com) wrote:
> >> On Wed, May 9, 2018 at 2:40 AM, Dr. David Alan Gilbert
> >> <dgilbert@redhat.com> wrote:
> >> > * Lidong Chen (jemmy858585@gmail.com) wrote:
> >> >> When cancel migration during RDMA precopy, the source qemu main thread hangs sometime.
> >> >>
> >> >> The backtrace is:
> >> >>     (gdb) bt
> >> >>     #0  0x00007f249eabd43d in write () from /lib64/libpthread.so.0
> >> >>     #1  0x00007f24a1ce98e4 in rdma_get_cm_event (channel=0x4675d10, event=0x7ffe2f643dd0) at src/cma.c:2189
> >> >>     #2  0x00000000007b6166 in qemu_rdma_cleanup (rdma=0x6784000) at migration/rdma.c:2296
> >> >>     #3  0x00000000007b7cae in qio_channel_rdma_close (ioc=0x3bfcc30, errp=0x0) at migration/rdma.c:2999
> >> >>     #4  0x00000000008db60e in qio_channel_close (ioc=0x3bfcc30, errp=0x0) at io/channel.c:273
> >> >>     #5  0x00000000007a8765 in channel_close (opaque=0x3bfcc30) at migration/qemu-file-channel.c:98
> >> >>     #6  0x00000000007a71f9 in qemu_fclose (f=0x527c000) at migration/qemu-file.c:334
> >> >>     #7  0x0000000000795b96 in migrate_fd_cleanup (opaque=0x3b46280) at migration/migration.c:1162
> >> >>     #8  0x000000000093a71b in aio_bh_call (bh=0x3db7a20) at util/async.c:90
> >> >>     #9  0x000000000093a7b2 in aio_bh_poll (ctx=0x3b121c0) at util/async.c:118
> >> >>     #10 0x000000000093f2ad in aio_dispatch (ctx=0x3b121c0) at util/aio-posix.c:436
> >> >>     #11 0x000000000093ab41 in aio_ctx_dispatch (source=0x3b121c0, callback=0x0, user_data=0x0)
> >> >>         at util/async.c:261
> >> >>     #12 0x00007f249f73c7aa in g_main_context_dispatch () from /lib64/libglib-2.0.so.0
> >> >>     #13 0x000000000093dc5e in glib_pollfds_poll () at util/main-loop.c:215
> >> >>     #14 0x000000000093dd4e in os_host_main_loop_wait (timeout=28000000) at util/main-loop.c:263
> >> >>     #15 0x000000000093de05 in main_loop_wait (nonblocking=0) at util/main-loop.c:522
> >> >>     #16 0x00000000005bc6a5 in main_loop () at vl.c:1944
> >> >>     #17 0x00000000005c39b5 in main (argc=56, argv=0x7ffe2f6443f8, envp=0x3ad0030) at vl.c:4752
> >> >>
> >> >> It does not get the RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect sometime.
> >> >> I do not find out the root cause why not get RDMA_CM_EVENT_DISCONNECTED event, but
> >> >> it can be reproduced if not invoke ibv_dereg_mr to release all ram blocks which fixed
> >> >> in previous patch.
> >> >
> >> > Does this happen without your other changes?
> >>
> >> Yes, this issue also happen on v2.12.0. base on
> >> commit 4743c23509a51bd4ee85cc272287a41917d1be35
> >>
> >> > Can you give me instructions to repeat it and also say which
> >> > cards you wereusing?
> >>
> >> This issue can be reproduced by start and cancel migration.
> >> less than 10 times, this issue will be reproduced.
> >>
> >> The command line is:
> >> virsh migrate --live --copy-storage-all  --undefinesource --persistent
> >> --timeout 10800 \
> >>  --verbose 83e0049e-1325-4f31-baf9-25231509ada1  \
> >> qemu+ssh://9.16.46.142/system rdma://9.16.46.142
> >>
> >> The net card i use is :
> >> 0000:3b:00.0 Ethernet controller: Mellanox Technologies MT27710 Family
> >> [ConnectX-4 Lx]
> >> 0000:3b:00.1 Ethernet controller: Mellanox Technologies MT27710 Family
> >> [ConnectX-4 Lx]
> >>
> >> This issue is related to ibv_dereg_mr, if not invoke ibv_dereg_mr for
> >> all ram block, this issue can be reproduced.
> >> If we fixed the bugs and use ibv_dereg_mr to release all ram block,
> >> this issue never happens.
> >
> > Maybe that is the right fix; I can imagine that the RDMA code doesn't
> > like closing down if there are still ramblocks registered that
> > potentially could have incoming DMA?
> >
> >> And for the kernel part, there is a bug also cause not release ram
> >> block when canceling live migration.
> >> https://patchwork.kernel.org/patch/10385781/
> >
> > OK, that's a pain; which threads are doing the dereg - is some stuff
> > in the migration thread and some stuff in the main thread on cleanup?
> 
> Yes, the migration thread invokes ibv_reg_mr, and the main thread bh
> will invoke ibv_dereg_mr.

OK, although my reading of that discussion is that *should* be allowed
(with your fix).

> and when the main thread schedule s->cleanup_bh, the migration thread
> may have been exited.
> 
> And I find ibv_dereg_mr may take a long time for the big memory size
> virtual machine.
> 
> The test result is:
>  10GB  326ms
>  20GB  699ms
>  30GB  1021ms
>  40GB  1387ms
>  50GB  1712ms
>  60GB  2034ms
>  70GB  2457ms
>  80GB  2807ms
>  90GB  3107ms
> 100GB 3474ms
> 110GB 3735ms
> 120GB 4064ms
> 130GB 4567ms
> 140GB 4886ms
> 
> when canceling migration, the guest os will hang for a while.
> so I think we should invoke qemu_fclose(s->to_dst_file)  in
> migration_thread, and without hold io thread lock.

Be very careful; it's tricky to get it right to handle all the failure
cases; that's why it's much easier to clean up right at the end.
It might make sense to try and do the deregister in the migration
thread but also to just make sure at the end.

I wonder why dereg_mr takes so long - I could understand if reg_mr
took a long time, but why for dereg, that sounds like the easy side.

Dave


> >
> > Dave
> >
> >> >
> >> >> Anyway, it should not invoke rdma_get_cm_event in main thread, and the event channel
> >> >> is also destroyed in qemu_rdma_cleanup.
> >> >>
> >> >> Signed-off-by: Lidong Chen <lidongchen@tencent.com>
> >> >> ---
> >> >>  migration/rdma.c       | 12 ++----------
> >> >>  migration/trace-events |  1 -
> >> >>  2 files changed, 2 insertions(+), 11 deletions(-)
> >> >>
> >> >> diff --git a/migration/rdma.c b/migration/rdma.c
> >> >> index 0dd4033..92e4d30 100644
> >> >> --- a/migration/rdma.c
> >> >> +++ b/migration/rdma.c
> >> >> @@ -2275,8 +2275,7 @@ static int qemu_rdma_write(QEMUFile *f, RDMAContext *rdma,
> >> >>
> >> >>  static void qemu_rdma_cleanup(RDMAContext *rdma)
> >> >>  {
> >> >> -    struct rdma_cm_event *cm_event;
> >> >> -    int ret, idx;
> >> >> +    int idx;
> >> >>
> >> >>      if (rdma->cm_id && rdma->connected) {
> >> >>          if ((rdma->error_state ||
> >> >> @@ -2290,14 +2289,7 @@ static void qemu_rdma_cleanup(RDMAContext *rdma)
> >> >>              qemu_rdma_post_send_control(rdma, NULL, &head);
> >> >>          }
> >> >>
> >> >> -        ret = rdma_disconnect(rdma->cm_id);
> >> >> -        if (!ret) {
> >> >> -            trace_qemu_rdma_cleanup_waiting_for_disconnect();
> >> >> -            ret = rdma_get_cm_event(rdma->channel, &cm_event);
> >> >> -            if (!ret) {
> >> >> -                rdma_ack_cm_event(cm_event);
> >> >> -            }
> >> >> -        }
> >> >> +        rdma_disconnect(rdma->cm_id);
> >> >
> >> > I'm worried whether this change could break stuff:
> >> > The docs say for rdma_disconnect that it flushes any posted work
> >> > requests to the completion queue;  so unless we wait for the event
> >> > do we know the stuff has been flushed?   In the normal non-cancel case
> >> > I'm worried that means we could lose something.
> >> > (But I don't know the rdma/infiniband specs well enough to know if it's
> >> > really a problem).
> >>
> >> In qemu_fclose function, it invoke qemu_fflush(f) before invoke f->ops->close.
> >> so I think it's safe for normal migration case.
> >>
> >> For the root cause why not receive RDMA_CM_EVENT_DISCONNECTED event
> >> after rdma_disconnect,
> >> I loop in Aviad Yehezkel<aviadye@mellanox.com>, Aviad will help us
> >> find the root cause.
> >>
> >> >
> >> > Dave
> >> >
> >> >>          trace_qemu_rdma_cleanup_disconnect();
> >> >>          rdma->connected = false;
> >> >>      }
> >> >> diff --git a/migration/trace-events b/migration/trace-events
> >> >> index d6be74b..64573ff 100644
> >> >> --- a/migration/trace-events
> >> >> +++ b/migration/trace-events
> >> >> @@ -125,7 +125,6 @@ qemu_rdma_accept_pin_state(bool pin) "%d"
> >> >>  qemu_rdma_accept_pin_verbsc(void *verbs) "Verbs context after listen: %p"
> >> >>  qemu_rdma_block_for_wrid_miss(const char *wcompstr, int wcomp, const char *gcompstr, uint64_t req) "A Wanted wrid %s (%d) but got %s (%" PRIu64 ")"
> >> >>  qemu_rdma_cleanup_disconnect(void) ""
> >> >> -qemu_rdma_cleanup_waiting_for_disconnect(void) ""
> >> >>  qemu_rdma_close(void) ""
> >> >>  qemu_rdma_connect_pin_all_requested(void) ""
> >> >>  qemu_rdma_connect_pin_all_outcome(bool pin) "%d"
> >> >> --
> >> >> 1.8.3.1
> >> >>
> >> > --
> >> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect
  2018-05-14 19:27           ` Dr. David Alan Gilbert
@ 2018-05-16  8:31             ` 858585 jemmy
  2018-05-16  9:39               ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 19+ messages in thread
From: 858585 jemmy @ 2018-05-16  8:31 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Aviad Yehezkel, Juan Quintela, qemu-devel, Gal Shachaf, adido,
	Lidong Chen

On Tue, May 15, 2018 at 3:27 AM, Dr. David Alan Gilbert
<dgilbert@redhat.com> wrote:
> * 858585 jemmy (jemmy858585@gmail.com) wrote:
>> On Sat, May 12, 2018 at 2:03 AM, Dr. David Alan Gilbert
>> <dgilbert@redhat.com> wrote:
>> > * 858585 jemmy (jemmy858585@gmail.com) wrote:
>> >> On Wed, May 9, 2018 at 2:40 AM, Dr. David Alan Gilbert
>> >> <dgilbert@redhat.com> wrote:
>> >> > * Lidong Chen (jemmy858585@gmail.com) wrote:
>> >> >> When cancel migration during RDMA precopy, the source qemu main thread hangs sometime.
>> >> >>
>> >> >> The backtrace is:
>> >> >>     (gdb) bt
>> >> >>     #0  0x00007f249eabd43d in write () from /lib64/libpthread.so.0
>> >> >>     #1  0x00007f24a1ce98e4 in rdma_get_cm_event (channel=0x4675d10, event=0x7ffe2f643dd0) at src/cma.c:2189
>> >> >>     #2  0x00000000007b6166 in qemu_rdma_cleanup (rdma=0x6784000) at migration/rdma.c:2296
>> >> >>     #3  0x00000000007b7cae in qio_channel_rdma_close (ioc=0x3bfcc30, errp=0x0) at migration/rdma.c:2999
>> >> >>     #4  0x00000000008db60e in qio_channel_close (ioc=0x3bfcc30, errp=0x0) at io/channel.c:273
>> >> >>     #5  0x00000000007a8765 in channel_close (opaque=0x3bfcc30) at migration/qemu-file-channel.c:98
>> >> >>     #6  0x00000000007a71f9 in qemu_fclose (f=0x527c000) at migration/qemu-file.c:334
>> >> >>     #7  0x0000000000795b96 in migrate_fd_cleanup (opaque=0x3b46280) at migration/migration.c:1162
>> >> >>     #8  0x000000000093a71b in aio_bh_call (bh=0x3db7a20) at util/async.c:90
>> >> >>     #9  0x000000000093a7b2 in aio_bh_poll (ctx=0x3b121c0) at util/async.c:118
>> >> >>     #10 0x000000000093f2ad in aio_dispatch (ctx=0x3b121c0) at util/aio-posix.c:436
>> >> >>     #11 0x000000000093ab41 in aio_ctx_dispatch (source=0x3b121c0, callback=0x0, user_data=0x0)
>> >> >>         at util/async.c:261
>> >> >>     #12 0x00007f249f73c7aa in g_main_context_dispatch () from /lib64/libglib-2.0.so.0
>> >> >>     #13 0x000000000093dc5e in glib_pollfds_poll () at util/main-loop.c:215
>> >> >>     #14 0x000000000093dd4e in os_host_main_loop_wait (timeout=28000000) at util/main-loop.c:263
>> >> >>     #15 0x000000000093de05 in main_loop_wait (nonblocking=0) at util/main-loop.c:522
>> >> >>     #16 0x00000000005bc6a5 in main_loop () at vl.c:1944
>> >> >>     #17 0x00000000005c39b5 in main (argc=56, argv=0x7ffe2f6443f8, envp=0x3ad0030) at vl.c:4752
>> >> >>
>> >> >> It does not get the RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect sometime.
>> >> >> I do not find out the root cause why not get RDMA_CM_EVENT_DISCONNECTED event, but
>> >> >> it can be reproduced if not invoke ibv_dereg_mr to release all ram blocks which fixed
>> >> >> in previous patch.
>> >> >
>> >> > Does this happen without your other changes?
>> >>
>> >> Yes, this issue also happen on v2.12.0. base on
>> >> commit 4743c23509a51bd4ee85cc272287a41917d1be35
>> >>
>> >> > Can you give me instructions to repeat it and also say which
>> >> > cards you wereusing?
>> >>
>> >> This issue can be reproduced by start and cancel migration.
>> >> less than 10 times, this issue will be reproduced.
>> >>
>> >> The command line is:
>> >> virsh migrate --live --copy-storage-all  --undefinesource --persistent
>> >> --timeout 10800 \
>> >>  --verbose 83e0049e-1325-4f31-baf9-25231509ada1  \
>> >> qemu+ssh://9.16.46.142/system rdma://9.16.46.142
>> >>
>> >> The net card i use is :
>> >> 0000:3b:00.0 Ethernet controller: Mellanox Technologies MT27710 Family
>> >> [ConnectX-4 Lx]
>> >> 0000:3b:00.1 Ethernet controller: Mellanox Technologies MT27710 Family
>> >> [ConnectX-4 Lx]
>> >>
>> >> This issue is related to ibv_dereg_mr, if not invoke ibv_dereg_mr for
>> >> all ram block, this issue can be reproduced.
>> >> If we fixed the bugs and use ibv_dereg_mr to release all ram block,
>> >> this issue never happens.
>> >
>> > Maybe that is the right fix; I can imagine that the RDMA code doesn't
>> > like closing down if there are still ramblocks registered that
>> > potentially could have incoming DMA?
>> >
>> >> And for the kernel part, there is a bug also cause not release ram
>> >> block when canceling live migration.
>> >> https://patchwork.kernel.org/patch/10385781/
>> >
>> > OK, that's a pain; which threads are doing the dereg - is some stuff
>> > in the migration thread and some stuff in the main thread on cleanup?
>>
>> Yes, the migration thread invokes ibv_reg_mr, and the main thread bh
>> will invoke ibv_dereg_mr.
>
> OK, although my reading of that discussion is that *should* be allowed
> (with your fix).
>
>> and when the main thread schedule s->cleanup_bh, the migration thread
>> may have been exited.
>>
>> And I find ibv_dereg_mr may take a long time for the big memory size
>> virtual machine.
>>
>> The test result is:
>>  10GB  326ms
>>  20GB  699ms
>>  30GB  1021ms
>>  40GB  1387ms
>>  50GB  1712ms
>>  60GB  2034ms
>>  70GB  2457ms
>>  80GB  2807ms
>>  90GB  3107ms
>> 100GB 3474ms
>> 110GB 3735ms
>> 120GB 4064ms
>> 130GB 4567ms
>> 140GB 4886ms
>>
>> when canceling migration, the guest os will hang for a while.
>> so I think we should invoke qemu_fclose(s->to_dst_file)  in
>> migration_thread, and without hold io thread lock.
>
> Be very careful; it's tricky to get it right to handle all the failure
> cases; that's why it's much easier to clean up right at the end.
> It might make sense to try and do the deregister in the migration
> thread but also to just make sure at the end.

Yes, I will make sure at the end of migration thread.

>
> I wonder why dereg_mr takes so long - I could understand if reg_mr
> took a long time, but why for dereg, that sounds like the easy side.

I use perf collect the information when ibv_dereg_mr is invoked.

-   9.95%  client2  [kernel.kallsyms]  [k] put_compound_page
                                                          `
   - put_compound_page
      - 98.45% put_page
           __ib_umem_release
           ib_umem_release
           dereg_mr
           mlx5_ib_dereg_mr
           ib_dereg_mr
           uverbs_free_mr
           remove_commit_idr_uobject
           _rdma_remove_commit_uobject
           rdma_remove_commit_uobject
           ib_uverbs_dereg_mr
           ib_uverbs_write
           vfs_write
           sys_write
           system_call_fastpath
           __GI___libc_write
           0
      + 1.55% __ib_umem_release
+   8.31%  client2  [kernel.kallsyms]  [k] compound_unlock_irqrestore
+   7.01%  client2  [kernel.kallsyms]  [k] page_waitqueue
+   7.00%  client2  [kernel.kallsyms]  [k] set_page_dirty
+   6.61%  client2  [kernel.kallsyms]  [k] unlock_page
+   6.33%  client2  [kernel.kallsyms]  [k] put_page_testzero
+   5.68%  client2  [kernel.kallsyms]  [k] set_page_dirty_lock
+   4.30%  client2  [kernel.kallsyms]  [k] __wake_up_bit
+   4.04%  client2  [kernel.kallsyms]  [k] free_pages_prepare
+   3.65%  client2  [kernel.kallsyms]  [k] release_pages
+   3.62%  client2  [kernel.kallsyms]  [k] arch_local_irq_save
+   3.35%  client2  [kernel.kallsyms]  [k] page_mapping
+   3.13%  client2  [kernel.kallsyms]  [k] get_pageblock_flags_group
+   3.09%  client2  [kernel.kallsyms]  [k] put_page

the reason is __ib_umem_release will loop many times for each page.

static void __ib_umem_release(struct ib_device *dev, struct ib_umem
*umem, int dirty)
{
    struct scatterlist *sg;
    struct page *page;
    int i;

    if (umem->nmap > 0)
         ib_dma_unmap_sg(dev, umem->sg_head.sgl,
                                        umem->npages,
                                        DMA_BIDIRECTIONAL);

         for_each_sg(umem->sg_head.sgl, sg, umem->npages, i) {      <<
loop a lot of times for each page.here
              page = sg_page(sg);
              if (umem->writable && dirty)
                  set_page_dirty_lock(page);
              put_page(page);
         }

         sg_free_table(&umem->sg_head);
         return;
}

>
> Dave
>
>
>> >
>> > Dave
>> >
>> >> >
>> >> >> Anyway, it should not invoke rdma_get_cm_event in main thread, and the event channel
>> >> >> is also destroyed in qemu_rdma_cleanup.
>> >> >>
>> >> >> Signed-off-by: Lidong Chen <lidongchen@tencent.com>
>> >> >> ---
>> >> >>  migration/rdma.c       | 12 ++----------
>> >> >>  migration/trace-events |  1 -
>> >> >>  2 files changed, 2 insertions(+), 11 deletions(-)
>> >> >>
>> >> >> diff --git a/migration/rdma.c b/migration/rdma.c
>> >> >> index 0dd4033..92e4d30 100644
>> >> >> --- a/migration/rdma.c
>> >> >> +++ b/migration/rdma.c
>> >> >> @@ -2275,8 +2275,7 @@ static int qemu_rdma_write(QEMUFile *f, RDMAContext *rdma,
>> >> >>
>> >> >>  static void qemu_rdma_cleanup(RDMAContext *rdma)
>> >> >>  {
>> >> >> -    struct rdma_cm_event *cm_event;
>> >> >> -    int ret, idx;
>> >> >> +    int idx;
>> >> >>
>> >> >>      if (rdma->cm_id && rdma->connected) {
>> >> >>          if ((rdma->error_state ||
>> >> >> @@ -2290,14 +2289,7 @@ static void qemu_rdma_cleanup(RDMAContext *rdma)
>> >> >>              qemu_rdma_post_send_control(rdma, NULL, &head);
>> >> >>          }
>> >> >>
>> >> >> -        ret = rdma_disconnect(rdma->cm_id);
>> >> >> -        if (!ret) {
>> >> >> -            trace_qemu_rdma_cleanup_waiting_for_disconnect();
>> >> >> -            ret = rdma_get_cm_event(rdma->channel, &cm_event);
>> >> >> -            if (!ret) {
>> >> >> -                rdma_ack_cm_event(cm_event);
>> >> >> -            }
>> >> >> -        }
>> >> >> +        rdma_disconnect(rdma->cm_id);
>> >> >
>> >> > I'm worried whether this change could break stuff:
>> >> > The docs say for rdma_disconnect that it flushes any posted work
>> >> > requests to the completion queue;  so unless we wait for the event
>> >> > do we know the stuff has been flushed?   In the normal non-cancel case
>> >> > I'm worried that means we could lose something.
>> >> > (But I don't know the rdma/infiniband specs well enough to know if it's
>> >> > really a problem).
>> >>
>> >> In qemu_fclose function, it invoke qemu_fflush(f) before invoke f->ops->close.
>> >> so I think it's safe for normal migration case.
>> >>
>> >> For the root cause why not receive RDMA_CM_EVENT_DISCONNECTED event
>> >> after rdma_disconnect,
>> >> I loop in Aviad Yehezkel<aviadye@mellanox.com>, Aviad will help us
>> >> find the root cause.
>> >>
>> >> >
>> >> > Dave
>> >> >
>> >> >>          trace_qemu_rdma_cleanup_disconnect();
>> >> >>          rdma->connected = false;
>> >> >>      }
>> >> >> diff --git a/migration/trace-events b/migration/trace-events
>> >> >> index d6be74b..64573ff 100644
>> >> >> --- a/migration/trace-events
>> >> >> +++ b/migration/trace-events
>> >> >> @@ -125,7 +125,6 @@ qemu_rdma_accept_pin_state(bool pin) "%d"
>> >> >>  qemu_rdma_accept_pin_verbsc(void *verbs) "Verbs context after listen: %p"
>> >> >>  qemu_rdma_block_for_wrid_miss(const char *wcompstr, int wcomp, const char *gcompstr, uint64_t req) "A Wanted wrid %s (%d) but got %s (%" PRIu64 ")"
>> >> >>  qemu_rdma_cleanup_disconnect(void) ""
>> >> >> -qemu_rdma_cleanup_waiting_for_disconnect(void) ""
>> >> >>  qemu_rdma_close(void) ""
>> >> >>  qemu_rdma_connect_pin_all_requested(void) ""
>> >> >>  qemu_rdma_connect_pin_all_outcome(bool pin) "%d"
>> >> >> --
>> >> >> 1.8.3.1
>> >> >>
>> >> > --
>> >> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>> > --
>> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect
  2018-05-16  8:31             ` 858585 jemmy
@ 2018-05-16  9:39               ` Dr. David Alan Gilbert
  2018-05-16  9:45                 ` 858585 jemmy
  0 siblings, 1 reply; 19+ messages in thread
From: Dr. David Alan Gilbert @ 2018-05-16  9:39 UTC (permalink / raw)
  To: 858585 jemmy
  Cc: Aviad Yehezkel, Juan Quintela, qemu-devel, Gal Shachaf, adido,
	Lidong Chen

* 858585 jemmy (jemmy858585@gmail.com) wrote:
> On Tue, May 15, 2018 at 3:27 AM, Dr. David Alan Gilbert
> <dgilbert@redhat.com> wrote:
> > * 858585 jemmy (jemmy858585@gmail.com) wrote:
> >> On Sat, May 12, 2018 at 2:03 AM, Dr. David Alan Gilbert
> >> <dgilbert@redhat.com> wrote:
> >> > * 858585 jemmy (jemmy858585@gmail.com) wrote:
> >> >> On Wed, May 9, 2018 at 2:40 AM, Dr. David Alan Gilbert
> >> >> <dgilbert@redhat.com> wrote:
> >> >> > * Lidong Chen (jemmy858585@gmail.com) wrote:
> >> >> >> When cancel migration during RDMA precopy, the source qemu main thread hangs sometime.
> >> >> >>
> >> >> >> The backtrace is:
> >> >> >>     (gdb) bt
> >> >> >>     #0  0x00007f249eabd43d in write () from /lib64/libpthread.so.0
> >> >> >>     #1  0x00007f24a1ce98e4 in rdma_get_cm_event (channel=0x4675d10, event=0x7ffe2f643dd0) at src/cma.c:2189
> >> >> >>     #2  0x00000000007b6166 in qemu_rdma_cleanup (rdma=0x6784000) at migration/rdma.c:2296
> >> >> >>     #3  0x00000000007b7cae in qio_channel_rdma_close (ioc=0x3bfcc30, errp=0x0) at migration/rdma.c:2999
> >> >> >>     #4  0x00000000008db60e in qio_channel_close (ioc=0x3bfcc30, errp=0x0) at io/channel.c:273
> >> >> >>     #5  0x00000000007a8765 in channel_close (opaque=0x3bfcc30) at migration/qemu-file-channel.c:98
> >> >> >>     #6  0x00000000007a71f9 in qemu_fclose (f=0x527c000) at migration/qemu-file.c:334
> >> >> >>     #7  0x0000000000795b96 in migrate_fd_cleanup (opaque=0x3b46280) at migration/migration.c:1162
> >> >> >>     #8  0x000000000093a71b in aio_bh_call (bh=0x3db7a20) at util/async.c:90
> >> >> >>     #9  0x000000000093a7b2 in aio_bh_poll (ctx=0x3b121c0) at util/async.c:118
> >> >> >>     #10 0x000000000093f2ad in aio_dispatch (ctx=0x3b121c0) at util/aio-posix.c:436
> >> >> >>     #11 0x000000000093ab41 in aio_ctx_dispatch (source=0x3b121c0, callback=0x0, user_data=0x0)
> >> >> >>         at util/async.c:261
> >> >> >>     #12 0x00007f249f73c7aa in g_main_context_dispatch () from /lib64/libglib-2.0.so.0
> >> >> >>     #13 0x000000000093dc5e in glib_pollfds_poll () at util/main-loop.c:215
> >> >> >>     #14 0x000000000093dd4e in os_host_main_loop_wait (timeout=28000000) at util/main-loop.c:263
> >> >> >>     #15 0x000000000093de05 in main_loop_wait (nonblocking=0) at util/main-loop.c:522
> >> >> >>     #16 0x00000000005bc6a5 in main_loop () at vl.c:1944
> >> >> >>     #17 0x00000000005c39b5 in main (argc=56, argv=0x7ffe2f6443f8, envp=0x3ad0030) at vl.c:4752
> >> >> >>
> >> >> >> It does not get the RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect sometime.
> >> >> >> I do not find out the root cause why not get RDMA_CM_EVENT_DISCONNECTED event, but
> >> >> >> it can be reproduced if not invoke ibv_dereg_mr to release all ram blocks which fixed
> >> >> >> in previous patch.
> >> >> >
> >> >> > Does this happen without your other changes?
> >> >>
> >> >> Yes, this issue also happen on v2.12.0. base on
> >> >> commit 4743c23509a51bd4ee85cc272287a41917d1be35
> >> >>
> >> >> > Can you give me instructions to repeat it and also say which
> >> >> > cards you wereusing?
> >> >>
> >> >> This issue can be reproduced by start and cancel migration.
> >> >> less than 10 times, this issue will be reproduced.
> >> >>
> >> >> The command line is:
> >> >> virsh migrate --live --copy-storage-all  --undefinesource --persistent
> >> >> --timeout 10800 \
> >> >>  --verbose 83e0049e-1325-4f31-baf9-25231509ada1  \
> >> >> qemu+ssh://9.16.46.142/system rdma://9.16.46.142
> >> >>
> >> >> The net card i use is :
> >> >> 0000:3b:00.0 Ethernet controller: Mellanox Technologies MT27710 Family
> >> >> [ConnectX-4 Lx]
> >> >> 0000:3b:00.1 Ethernet controller: Mellanox Technologies MT27710 Family
> >> >> [ConnectX-4 Lx]
> >> >>
> >> >> This issue is related to ibv_dereg_mr, if not invoke ibv_dereg_mr for
> >> >> all ram block, this issue can be reproduced.
> >> >> If we fixed the bugs and use ibv_dereg_mr to release all ram block,
> >> >> this issue never happens.
> >> >
> >> > Maybe that is the right fix; I can imagine that the RDMA code doesn't
> >> > like closing down if there are still ramblocks registered that
> >> > potentially could have incoming DMA?
> >> >
> >> >> And for the kernel part, there is a bug also cause not release ram
> >> >> block when canceling live migration.
> >> >> https://patchwork.kernel.org/patch/10385781/
> >> >
> >> > OK, that's a pain; which threads are doing the dereg - is some stuff
> >> > in the migration thread and some stuff in the main thread on cleanup?
> >>
> >> Yes, the migration thread invokes ibv_reg_mr, and the main thread bh
> >> will invoke ibv_dereg_mr.
> >
> > OK, although my reading of that discussion is that *should* be allowed
> > (with your fix).
> >
> >> and when the main thread schedule s->cleanup_bh, the migration thread
> >> may have been exited.
> >>
> >> And I find ibv_dereg_mr may take a long time for the big memory size
> >> virtual machine.
> >>
> >> The test result is:
> >>  10GB  326ms
> >>  20GB  699ms
> >>  30GB  1021ms
> >>  40GB  1387ms
> >>  50GB  1712ms
> >>  60GB  2034ms
> >>  70GB  2457ms
> >>  80GB  2807ms
> >>  90GB  3107ms
> >> 100GB 3474ms
> >> 110GB 3735ms
> >> 120GB 4064ms
> >> 130GB 4567ms
> >> 140GB 4886ms
> >>
> >> when canceling migration, the guest os will hang for a while.
> >> so I think we should invoke qemu_fclose(s->to_dst_file)  in
> >> migration_thread, and without hold io thread lock.
> >
> > Be very careful; it's tricky to get it right to handle all the failure
> > cases; that's why it's much easier to clean up right at the end.
> > It might make sense to try and do the deregister in the migration
> > thread but also to just make sure at the end.
> 
> Yes, I will make sure at the end of migration thread.
> 
> >
> > I wonder why dereg_mr takes so long - I could understand if reg_mr
> > took a long time, but why for dereg, that sounds like the easy side.
> 
> I use perf collect the information when ibv_dereg_mr is invoked.
> 
> -   9.95%  client2  [kernel.kallsyms]  [k] put_compound_page
>                                                           `
>    - put_compound_page
>       - 98.45% put_page
>            __ib_umem_release
>            ib_umem_release
>            dereg_mr
>            mlx5_ib_dereg_mr
>            ib_dereg_mr
>            uverbs_free_mr
>            remove_commit_idr_uobject
>            _rdma_remove_commit_uobject
>            rdma_remove_commit_uobject
>            ib_uverbs_dereg_mr
>            ib_uverbs_write
>            vfs_write
>            sys_write
>            system_call_fastpath
>            __GI___libc_write
>            0
>       + 1.55% __ib_umem_release
> +   8.31%  client2  [kernel.kallsyms]  [k] compound_unlock_irqrestore
> +   7.01%  client2  [kernel.kallsyms]  [k] page_waitqueue
> +   7.00%  client2  [kernel.kallsyms]  [k] set_page_dirty
> +   6.61%  client2  [kernel.kallsyms]  [k] unlock_page
> +   6.33%  client2  [kernel.kallsyms]  [k] put_page_testzero
> +   5.68%  client2  [kernel.kallsyms]  [k] set_page_dirty_lock
> +   4.30%  client2  [kernel.kallsyms]  [k] __wake_up_bit
> +   4.04%  client2  [kernel.kallsyms]  [k] free_pages_prepare
> +   3.65%  client2  [kernel.kallsyms]  [k] release_pages
> +   3.62%  client2  [kernel.kallsyms]  [k] arch_local_irq_save
> +   3.35%  client2  [kernel.kallsyms]  [k] page_mapping
> +   3.13%  client2  [kernel.kallsyms]  [k] get_pageblock_flags_group
> +   3.09%  client2  [kernel.kallsyms]  [k] put_page
> 
> the reason is __ib_umem_release will loop many times for each page.
> 
> static void __ib_umem_release(struct ib_device *dev, struct ib_umem
> *umem, int dirty)
> {
>     struct scatterlist *sg;
>     struct page *page;
>     int i;
> 
>     if (umem->nmap > 0)
>          ib_dma_unmap_sg(dev, umem->sg_head.sgl,
>                                         umem->npages,
>                                         DMA_BIDIRECTIONAL);
> 
>          for_each_sg(umem->sg_head.sgl, sg, umem->npages, i) {      <<
> loop a lot of times for each page.here

Why 'lot of times for each page'?  I don't know this code at all, but
I'd expected once per page?

With your other kernel fix, does the problem of the missing
RDMA_CM_EVENT_DISCONNECTED events go away?

Dave

>               page = sg_page(sg);
>               if (umem->writable && dirty)
>                   set_page_dirty_lock(page);
>               put_page(page);
>          }
> 
>          sg_free_table(&umem->sg_head);
>          return;
> }
> 
> >
> > Dave
> >
> >
> >> >
> >> > Dave
> >> >
> >> >> >
> >> >> >> Anyway, it should not invoke rdma_get_cm_event in main thread, and the event channel
> >> >> >> is also destroyed in qemu_rdma_cleanup.
> >> >> >>
> >> >> >> Signed-off-by: Lidong Chen <lidongchen@tencent.com>
> >> >> >> ---
> >> >> >>  migration/rdma.c       | 12 ++----------
> >> >> >>  migration/trace-events |  1 -
> >> >> >>  2 files changed, 2 insertions(+), 11 deletions(-)
> >> >> >>
> >> >> >> diff --git a/migration/rdma.c b/migration/rdma.c
> >> >> >> index 0dd4033..92e4d30 100644
> >> >> >> --- a/migration/rdma.c
> >> >> >> +++ b/migration/rdma.c
> >> >> >> @@ -2275,8 +2275,7 @@ static int qemu_rdma_write(QEMUFile *f, RDMAContext *rdma,
> >> >> >>
> >> >> >>  static void qemu_rdma_cleanup(RDMAContext *rdma)
> >> >> >>  {
> >> >> >> -    struct rdma_cm_event *cm_event;
> >> >> >> -    int ret, idx;
> >> >> >> +    int idx;
> >> >> >>
> >> >> >>      if (rdma->cm_id && rdma->connected) {
> >> >> >>          if ((rdma->error_state ||
> >> >> >> @@ -2290,14 +2289,7 @@ static void qemu_rdma_cleanup(RDMAContext *rdma)
> >> >> >>              qemu_rdma_post_send_control(rdma, NULL, &head);
> >> >> >>          }
> >> >> >>
> >> >> >> -        ret = rdma_disconnect(rdma->cm_id);
> >> >> >> -        if (!ret) {
> >> >> >> -            trace_qemu_rdma_cleanup_waiting_for_disconnect();
> >> >> >> -            ret = rdma_get_cm_event(rdma->channel, &cm_event);
> >> >> >> -            if (!ret) {
> >> >> >> -                rdma_ack_cm_event(cm_event);
> >> >> >> -            }
> >> >> >> -        }
> >> >> >> +        rdma_disconnect(rdma->cm_id);
> >> >> >
> >> >> > I'm worried whether this change could break stuff:
> >> >> > The docs say for rdma_disconnect that it flushes any posted work
> >> >> > requests to the completion queue;  so unless we wait for the event
> >> >> > do we know the stuff has been flushed?   In the normal non-cancel case
> >> >> > I'm worried that means we could lose something.
> >> >> > (But I don't know the rdma/infiniband specs well enough to know if it's
> >> >> > really a problem).
> >> >>
> >> >> In qemu_fclose function, it invoke qemu_fflush(f) before invoke f->ops->close.
> >> >> so I think it's safe for normal migration case.
> >> >>
> >> >> For the root cause why not receive RDMA_CM_EVENT_DISCONNECTED event
> >> >> after rdma_disconnect,
> >> >> I loop in Aviad Yehezkel<aviadye@mellanox.com>, Aviad will help us
> >> >> find the root cause.
> >> >>
> >> >> >
> >> >> > Dave
> >> >> >
> >> >> >>          trace_qemu_rdma_cleanup_disconnect();
> >> >> >>          rdma->connected = false;
> >> >> >>      }
> >> >> >> diff --git a/migration/trace-events b/migration/trace-events
> >> >> >> index d6be74b..64573ff 100644
> >> >> >> --- a/migration/trace-events
> >> >> >> +++ b/migration/trace-events
> >> >> >> @@ -125,7 +125,6 @@ qemu_rdma_accept_pin_state(bool pin) "%d"
> >> >> >>  qemu_rdma_accept_pin_verbsc(void *verbs) "Verbs context after listen: %p"
> >> >> >>  qemu_rdma_block_for_wrid_miss(const char *wcompstr, int wcomp, const char *gcompstr, uint64_t req) "A Wanted wrid %s (%d) but got %s (%" PRIu64 ")"
> >> >> >>  qemu_rdma_cleanup_disconnect(void) ""
> >> >> >> -qemu_rdma_cleanup_waiting_for_disconnect(void) ""
> >> >> >>  qemu_rdma_close(void) ""
> >> >> >>  qemu_rdma_connect_pin_all_requested(void) ""
> >> >> >>  qemu_rdma_connect_pin_all_outcome(bool pin) "%d"
> >> >> >> --
> >> >> >> 1.8.3.1
> >> >> >>
> >> >> > --
> >> >> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> >> > --
> >> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect
  2018-05-16  9:39               ` Dr. David Alan Gilbert
@ 2018-05-16  9:45                 ` 858585 jemmy
  2018-05-16  9:53                   ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 19+ messages in thread
From: 858585 jemmy @ 2018-05-16  9:45 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Aviad Yehezkel, Juan Quintela, qemu-devel, Gal Shachaf, adido,
	Lidong Chen

On Wed, May 16, 2018 at 5:39 PM, Dr. David Alan Gilbert
<dgilbert@redhat.com> wrote:
> * 858585 jemmy (jemmy858585@gmail.com) wrote:
>> On Tue, May 15, 2018 at 3:27 AM, Dr. David Alan Gilbert
>> <dgilbert@redhat.com> wrote:
>> > * 858585 jemmy (jemmy858585@gmail.com) wrote:
>> >> On Sat, May 12, 2018 at 2:03 AM, Dr. David Alan Gilbert
>> >> <dgilbert@redhat.com> wrote:
>> >> > * 858585 jemmy (jemmy858585@gmail.com) wrote:
>> >> >> On Wed, May 9, 2018 at 2:40 AM, Dr. David Alan Gilbert
>> >> >> <dgilbert@redhat.com> wrote:
>> >> >> > * Lidong Chen (jemmy858585@gmail.com) wrote:
>> >> >> >> When cancel migration during RDMA precopy, the source qemu main thread hangs sometime.
>> >> >> >>
>> >> >> >> The backtrace is:
>> >> >> >>     (gdb) bt
>> >> >> >>     #0  0x00007f249eabd43d in write () from /lib64/libpthread.so.0
>> >> >> >>     #1  0x00007f24a1ce98e4 in rdma_get_cm_event (channel=0x4675d10, event=0x7ffe2f643dd0) at src/cma.c:2189
>> >> >> >>     #2  0x00000000007b6166 in qemu_rdma_cleanup (rdma=0x6784000) at migration/rdma.c:2296
>> >> >> >>     #3  0x00000000007b7cae in qio_channel_rdma_close (ioc=0x3bfcc30, errp=0x0) at migration/rdma.c:2999
>> >> >> >>     #4  0x00000000008db60e in qio_channel_close (ioc=0x3bfcc30, errp=0x0) at io/channel.c:273
>> >> >> >>     #5  0x00000000007a8765 in channel_close (opaque=0x3bfcc30) at migration/qemu-file-channel.c:98
>> >> >> >>     #6  0x00000000007a71f9 in qemu_fclose (f=0x527c000) at migration/qemu-file.c:334
>> >> >> >>     #7  0x0000000000795b96 in migrate_fd_cleanup (opaque=0x3b46280) at migration/migration.c:1162
>> >> >> >>     #8  0x000000000093a71b in aio_bh_call (bh=0x3db7a20) at util/async.c:90
>> >> >> >>     #9  0x000000000093a7b2 in aio_bh_poll (ctx=0x3b121c0) at util/async.c:118
>> >> >> >>     #10 0x000000000093f2ad in aio_dispatch (ctx=0x3b121c0) at util/aio-posix.c:436
>> >> >> >>     #11 0x000000000093ab41 in aio_ctx_dispatch (source=0x3b121c0, callback=0x0, user_data=0x0)
>> >> >> >>         at util/async.c:261
>> >> >> >>     #12 0x00007f249f73c7aa in g_main_context_dispatch () from /lib64/libglib-2.0.so.0
>> >> >> >>     #13 0x000000000093dc5e in glib_pollfds_poll () at util/main-loop.c:215
>> >> >> >>     #14 0x000000000093dd4e in os_host_main_loop_wait (timeout=28000000) at util/main-loop.c:263
>> >> >> >>     #15 0x000000000093de05 in main_loop_wait (nonblocking=0) at util/main-loop.c:522
>> >> >> >>     #16 0x00000000005bc6a5 in main_loop () at vl.c:1944
>> >> >> >>     #17 0x00000000005c39b5 in main (argc=56, argv=0x7ffe2f6443f8, envp=0x3ad0030) at vl.c:4752
>> >> >> >>
>> >> >> >> It does not get the RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect sometime.
>> >> >> >> I do not find out the root cause why not get RDMA_CM_EVENT_DISCONNECTED event, but
>> >> >> >> it can be reproduced if not invoke ibv_dereg_mr to release all ram blocks which fixed
>> >> >> >> in previous patch.
>> >> >> >
>> >> >> > Does this happen without your other changes?
>> >> >>
>> >> >> Yes, this issue also happen on v2.12.0. base on
>> >> >> commit 4743c23509a51bd4ee85cc272287a41917d1be35
>> >> >>
>> >> >> > Can you give me instructions to repeat it and also say which
>> >> >> > cards you wereusing?
>> >> >>
>> >> >> This issue can be reproduced by start and cancel migration.
>> >> >> less than 10 times, this issue will be reproduced.
>> >> >>
>> >> >> The command line is:
>> >> >> virsh migrate --live --copy-storage-all  --undefinesource --persistent
>> >> >> --timeout 10800 \
>> >> >>  --verbose 83e0049e-1325-4f31-baf9-25231509ada1  \
>> >> >> qemu+ssh://9.16.46.142/system rdma://9.16.46.142
>> >> >>
>> >> >> The net card i use is :
>> >> >> 0000:3b:00.0 Ethernet controller: Mellanox Technologies MT27710 Family
>> >> >> [ConnectX-4 Lx]
>> >> >> 0000:3b:00.1 Ethernet controller: Mellanox Technologies MT27710 Family
>> >> >> [ConnectX-4 Lx]
>> >> >>
>> >> >> This issue is related to ibv_dereg_mr, if not invoke ibv_dereg_mr for
>> >> >> all ram block, this issue can be reproduced.
>> >> >> If we fixed the bugs and use ibv_dereg_mr to release all ram block,
>> >> >> this issue never happens.
>> >> >
>> >> > Maybe that is the right fix; I can imagine that the RDMA code doesn't
>> >> > like closing down if there are still ramblocks registered that
>> >> > potentially could have incoming DMA?
>> >> >
>> >> >> And for the kernel part, there is a bug also cause not release ram
>> >> >> block when canceling live migration.
>> >> >> https://patchwork.kernel.org/patch/10385781/
>> >> >
>> >> > OK, that's a pain; which threads are doing the dereg - is some stuff
>> >> > in the migration thread and some stuff in the main thread on cleanup?
>> >>
>> >> Yes, the migration thread invokes ibv_reg_mr, and the main thread bh
>> >> will invoke ibv_dereg_mr.
>> >
>> > OK, although my reading of that discussion is that *should* be allowed
>> > (with your fix).
>> >
>> >> and when the main thread schedule s->cleanup_bh, the migration thread
>> >> may have been exited.
>> >>
>> >> And I find ibv_dereg_mr may take a long time for the big memory size
>> >> virtual machine.
>> >>
>> >> The test result is:
>> >>  10GB  326ms
>> >>  20GB  699ms
>> >>  30GB  1021ms
>> >>  40GB  1387ms
>> >>  50GB  1712ms
>> >>  60GB  2034ms
>> >>  70GB  2457ms
>> >>  80GB  2807ms
>> >>  90GB  3107ms
>> >> 100GB 3474ms
>> >> 110GB 3735ms
>> >> 120GB 4064ms
>> >> 130GB 4567ms
>> >> 140GB 4886ms
>> >>
>> >> when canceling migration, the guest os will hang for a while.
>> >> so I think we should invoke qemu_fclose(s->to_dst_file)  in
>> >> migration_thread, and without hold io thread lock.
>> >
>> > Be very careful; it's tricky to get it right to handle all the failure
>> > cases; that's why it's much easier to clean up right at the end.
>> > It might make sense to try and do the deregister in the migration
>> > thread but also to just make sure at the end.
>>
>> Yes, I will make sure at the end of migration thread.
>>
>> >
>> > I wonder why dereg_mr takes so long - I could understand if reg_mr
>> > took a long time, but why for dereg, that sounds like the easy side.
>>
>> I use perf collect the information when ibv_dereg_mr is invoked.
>>
>> -   9.95%  client2  [kernel.kallsyms]  [k] put_compound_page
>>                                                           `
>>    - put_compound_page
>>       - 98.45% put_page
>>            __ib_umem_release
>>            ib_umem_release
>>            dereg_mr
>>            mlx5_ib_dereg_mr
>>            ib_dereg_mr
>>            uverbs_free_mr
>>            remove_commit_idr_uobject
>>            _rdma_remove_commit_uobject
>>            rdma_remove_commit_uobject
>>            ib_uverbs_dereg_mr
>>            ib_uverbs_write
>>            vfs_write
>>            sys_write
>>            system_call_fastpath
>>            __GI___libc_write
>>            0
>>       + 1.55% __ib_umem_release
>> +   8.31%  client2  [kernel.kallsyms]  [k] compound_unlock_irqrestore
>> +   7.01%  client2  [kernel.kallsyms]  [k] page_waitqueue
>> +   7.00%  client2  [kernel.kallsyms]  [k] set_page_dirty
>> +   6.61%  client2  [kernel.kallsyms]  [k] unlock_page
>> +   6.33%  client2  [kernel.kallsyms]  [k] put_page_testzero
>> +   5.68%  client2  [kernel.kallsyms]  [k] set_page_dirty_lock
>> +   4.30%  client2  [kernel.kallsyms]  [k] __wake_up_bit
>> +   4.04%  client2  [kernel.kallsyms]  [k] free_pages_prepare
>> +   3.65%  client2  [kernel.kallsyms]  [k] release_pages
>> +   3.62%  client2  [kernel.kallsyms]  [k] arch_local_irq_save
>> +   3.35%  client2  [kernel.kallsyms]  [k] page_mapping
>> +   3.13%  client2  [kernel.kallsyms]  [k] get_pageblock_flags_group
>> +   3.09%  client2  [kernel.kallsyms]  [k] put_page
>>
>> the reason is __ib_umem_release will loop many times for each page.
>>
>> static void __ib_umem_release(struct ib_device *dev, struct ib_umem
>> *umem, int dirty)
>> {
>>     struct scatterlist *sg;
>>     struct page *page;
>>     int i;
>>
>>     if (umem->nmap > 0)
>>          ib_dma_unmap_sg(dev, umem->sg_head.sgl,
>>                                         umem->npages,
>>                                         DMA_BIDIRECTIONAL);
>>
>>          for_each_sg(umem->sg_head.sgl, sg, umem->npages, i) {      <<
>> loop a lot of times for each page.here
>
> Why 'lot of times for each page'?  I don't know this code at all, but
> I'd expected once per page?

sorry, once per page, but a lot of page for a big size virtual machine.

>
> With your other kernel fix, does the problem of the missing
> RDMA_CM_EVENT_DISCONNECTED events go away?

Yes, after kernel and qemu fixed, this issue never happens again.
Do you think we should remove rdma_get_cm_event after rdma_disconnect?

>
> Dave
>
>>               page = sg_page(sg);
>>               if (umem->writable && dirty)
>>                   set_page_dirty_lock(page);
>>               put_page(page);
>>          }
>>
>>          sg_free_table(&umem->sg_head);
>>          return;
>> }
>>
>> >
>> > Dave
>> >
>> >
>> >> >
>> >> > Dave
>> >> >
>> >> >> >
>> >> >> >> Anyway, it should not invoke rdma_get_cm_event in main thread, and the event channel
>> >> >> >> is also destroyed in qemu_rdma_cleanup.
>> >> >> >>
>> >> >> >> Signed-off-by: Lidong Chen <lidongchen@tencent.com>
>> >> >> >> ---
>> >> >> >>  migration/rdma.c       | 12 ++----------
>> >> >> >>  migration/trace-events |  1 -
>> >> >> >>  2 files changed, 2 insertions(+), 11 deletions(-)
>> >> >> >>
>> >> >> >> diff --git a/migration/rdma.c b/migration/rdma.c
>> >> >> >> index 0dd4033..92e4d30 100644
>> >> >> >> --- a/migration/rdma.c
>> >> >> >> +++ b/migration/rdma.c
>> >> >> >> @@ -2275,8 +2275,7 @@ static int qemu_rdma_write(QEMUFile *f, RDMAContext *rdma,
>> >> >> >>
>> >> >> >>  static void qemu_rdma_cleanup(RDMAContext *rdma)
>> >> >> >>  {
>> >> >> >> -    struct rdma_cm_event *cm_event;
>> >> >> >> -    int ret, idx;
>> >> >> >> +    int idx;
>> >> >> >>
>> >> >> >>      if (rdma->cm_id && rdma->connected) {
>> >> >> >>          if ((rdma->error_state ||
>> >> >> >> @@ -2290,14 +2289,7 @@ static void qemu_rdma_cleanup(RDMAContext *rdma)
>> >> >> >>              qemu_rdma_post_send_control(rdma, NULL, &head);
>> >> >> >>          }
>> >> >> >>
>> >> >> >> -        ret = rdma_disconnect(rdma->cm_id);
>> >> >> >> -        if (!ret) {
>> >> >> >> -            trace_qemu_rdma_cleanup_waiting_for_disconnect();
>> >> >> >> -            ret = rdma_get_cm_event(rdma->channel, &cm_event);
>> >> >> >> -            if (!ret) {
>> >> >> >> -                rdma_ack_cm_event(cm_event);
>> >> >> >> -            }
>> >> >> >> -        }
>> >> >> >> +        rdma_disconnect(rdma->cm_id);
>> >> >> >
>> >> >> > I'm worried whether this change could break stuff:
>> >> >> > The docs say for rdma_disconnect that it flushes any posted work
>> >> >> > requests to the completion queue;  so unless we wait for the event
>> >> >> > do we know the stuff has been flushed?   In the normal non-cancel case
>> >> >> > I'm worried that means we could lose something.
>> >> >> > (But I don't know the rdma/infiniband specs well enough to know if it's
>> >> >> > really a problem).
>> >> >>
>> >> >> In qemu_fclose function, it invoke qemu_fflush(f) before invoke f->ops->close.
>> >> >> so I think it's safe for normal migration case.
>> >> >>
>> >> >> For the root cause why not receive RDMA_CM_EVENT_DISCONNECTED event
>> >> >> after rdma_disconnect,
>> >> >> I loop in Aviad Yehezkel<aviadye@mellanox.com>, Aviad will help us
>> >> >> find the root cause.
>> >> >>
>> >> >> >
>> >> >> > Dave
>> >> >> >
>> >> >> >>          trace_qemu_rdma_cleanup_disconnect();
>> >> >> >>          rdma->connected = false;
>> >> >> >>      }
>> >> >> >> diff --git a/migration/trace-events b/migration/trace-events
>> >> >> >> index d6be74b..64573ff 100644
>> >> >> >> --- a/migration/trace-events
>> >> >> >> +++ b/migration/trace-events
>> >> >> >> @@ -125,7 +125,6 @@ qemu_rdma_accept_pin_state(bool pin) "%d"
>> >> >> >>  qemu_rdma_accept_pin_verbsc(void *verbs) "Verbs context after listen: %p"
>> >> >> >>  qemu_rdma_block_for_wrid_miss(const char *wcompstr, int wcomp, const char *gcompstr, uint64_t req) "A Wanted wrid %s (%d) but got %s (%" PRIu64 ")"
>> >> >> >>  qemu_rdma_cleanup_disconnect(void) ""
>> >> >> >> -qemu_rdma_cleanup_waiting_for_disconnect(void) ""
>> >> >> >>  qemu_rdma_close(void) ""
>> >> >> >>  qemu_rdma_connect_pin_all_requested(void) ""
>> >> >> >>  qemu_rdma_connect_pin_all_outcome(bool pin) "%d"
>> >> >> >> --
>> >> >> >> 1.8.3.1
>> >> >> >>
>> >> >> > --
>> >> >> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>> >> > --
>> >> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>> > --
>> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect
  2018-05-16  9:45                 ` 858585 jemmy
@ 2018-05-16  9:53                   ` Dr. David Alan Gilbert
  2018-05-16 12:48                     ` 858585 jemmy
  0 siblings, 1 reply; 19+ messages in thread
From: Dr. David Alan Gilbert @ 2018-05-16  9:53 UTC (permalink / raw)
  To: 858585 jemmy
  Cc: Aviad Yehezkel, Juan Quintela, qemu-devel, Gal Shachaf, adido,
	Lidong Chen

* 858585 jemmy (jemmy858585@gmail.com) wrote:
> On Wed, May 16, 2018 at 5:39 PM, Dr. David Alan Gilbert
> <dgilbert@redhat.com> wrote:
> > * 858585 jemmy (jemmy858585@gmail.com) wrote:
> >> On Tue, May 15, 2018 at 3:27 AM, Dr. David Alan Gilbert
> >> <dgilbert@redhat.com> wrote:
> >> > * 858585 jemmy (jemmy858585@gmail.com) wrote:
> >> >> On Sat, May 12, 2018 at 2:03 AM, Dr. David Alan Gilbert
> >> >> <dgilbert@redhat.com> wrote:
> >> >> > * 858585 jemmy (jemmy858585@gmail.com) wrote:
> >> >> >> On Wed, May 9, 2018 at 2:40 AM, Dr. David Alan Gilbert
> >> >> >> <dgilbert@redhat.com> wrote:
> >> >> >> > * Lidong Chen (jemmy858585@gmail.com) wrote:
> >> >> >> >> When cancel migration during RDMA precopy, the source qemu main thread hangs sometime.
> >> >> >> >>
> >> >> >> >> The backtrace is:
> >> >> >> >>     (gdb) bt
> >> >> >> >>     #0  0x00007f249eabd43d in write () from /lib64/libpthread.so.0
> >> >> >> >>     #1  0x00007f24a1ce98e4 in rdma_get_cm_event (channel=0x4675d10, event=0x7ffe2f643dd0) at src/cma.c:2189
> >> >> >> >>     #2  0x00000000007b6166 in qemu_rdma_cleanup (rdma=0x6784000) at migration/rdma.c:2296
> >> >> >> >>     #3  0x00000000007b7cae in qio_channel_rdma_close (ioc=0x3bfcc30, errp=0x0) at migration/rdma.c:2999
> >> >> >> >>     #4  0x00000000008db60e in qio_channel_close (ioc=0x3bfcc30, errp=0x0) at io/channel.c:273
> >> >> >> >>     #5  0x00000000007a8765 in channel_close (opaque=0x3bfcc30) at migration/qemu-file-channel.c:98
> >> >> >> >>     #6  0x00000000007a71f9 in qemu_fclose (f=0x527c000) at migration/qemu-file.c:334
> >> >> >> >>     #7  0x0000000000795b96 in migrate_fd_cleanup (opaque=0x3b46280) at migration/migration.c:1162
> >> >> >> >>     #8  0x000000000093a71b in aio_bh_call (bh=0x3db7a20) at util/async.c:90
> >> >> >> >>     #9  0x000000000093a7b2 in aio_bh_poll (ctx=0x3b121c0) at util/async.c:118
> >> >> >> >>     #10 0x000000000093f2ad in aio_dispatch (ctx=0x3b121c0) at util/aio-posix.c:436
> >> >> >> >>     #11 0x000000000093ab41 in aio_ctx_dispatch (source=0x3b121c0, callback=0x0, user_data=0x0)
> >> >> >> >>         at util/async.c:261
> >> >> >> >>     #12 0x00007f249f73c7aa in g_main_context_dispatch () from /lib64/libglib-2.0.so.0
> >> >> >> >>     #13 0x000000000093dc5e in glib_pollfds_poll () at util/main-loop.c:215
> >> >> >> >>     #14 0x000000000093dd4e in os_host_main_loop_wait (timeout=28000000) at util/main-loop.c:263
> >> >> >> >>     #15 0x000000000093de05 in main_loop_wait (nonblocking=0) at util/main-loop.c:522
> >> >> >> >>     #16 0x00000000005bc6a5 in main_loop () at vl.c:1944
> >> >> >> >>     #17 0x00000000005c39b5 in main (argc=56, argv=0x7ffe2f6443f8, envp=0x3ad0030) at vl.c:4752
> >> >> >> >>
> >> >> >> >> It does not get the RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect sometime.
> >> >> >> >> I do not find out the root cause why not get RDMA_CM_EVENT_DISCONNECTED event, but
> >> >> >> >> it can be reproduced if not invoke ibv_dereg_mr to release all ram blocks which fixed
> >> >> >> >> in previous patch.
> >> >> >> >
> >> >> >> > Does this happen without your other changes?
> >> >> >>
> >> >> >> Yes, this issue also happen on v2.12.0. base on
> >> >> >> commit 4743c23509a51bd4ee85cc272287a41917d1be35
> >> >> >>
> >> >> >> > Can you give me instructions to repeat it and also say which
> >> >> >> > cards you wereusing?
> >> >> >>
> >> >> >> This issue can be reproduced by start and cancel migration.
> >> >> >> less than 10 times, this issue will be reproduced.
> >> >> >>
> >> >> >> The command line is:
> >> >> >> virsh migrate --live --copy-storage-all  --undefinesource --persistent
> >> >> >> --timeout 10800 \
> >> >> >>  --verbose 83e0049e-1325-4f31-baf9-25231509ada1  \
> >> >> >> qemu+ssh://9.16.46.142/system rdma://9.16.46.142
> >> >> >>
> >> >> >> The net card i use is :
> >> >> >> 0000:3b:00.0 Ethernet controller: Mellanox Technologies MT27710 Family
> >> >> >> [ConnectX-4 Lx]
> >> >> >> 0000:3b:00.1 Ethernet controller: Mellanox Technologies MT27710 Family
> >> >> >> [ConnectX-4 Lx]
> >> >> >>
> >> >> >> This issue is related to ibv_dereg_mr, if not invoke ibv_dereg_mr for
> >> >> >> all ram block, this issue can be reproduced.
> >> >> >> If we fixed the bugs and use ibv_dereg_mr to release all ram block,
> >> >> >> this issue never happens.
> >> >> >
> >> >> > Maybe that is the right fix; I can imagine that the RDMA code doesn't
> >> >> > like closing down if there are still ramblocks registered that
> >> >> > potentially could have incoming DMA?
> >> >> >
> >> >> >> And for the kernel part, there is a bug also cause not release ram
> >> >> >> block when canceling live migration.
> >> >> >> https://patchwork.kernel.org/patch/10385781/
> >> >> >
> >> >> > OK, that's a pain; which threads are doing the dereg - is some stuff
> >> >> > in the migration thread and some stuff in the main thread on cleanup?
> >> >>
> >> >> Yes, the migration thread invokes ibv_reg_mr, and the main thread bh
> >> >> will invoke ibv_dereg_mr.
> >> >
> >> > OK, although my reading of that discussion is that *should* be allowed
> >> > (with your fix).
> >> >
> >> >> and when the main thread schedule s->cleanup_bh, the migration thread
> >> >> may have been exited.
> >> >>
> >> >> And I find ibv_dereg_mr may take a long time for the big memory size
> >> >> virtual machine.
> >> >>
> >> >> The test result is:
> >> >>  10GB  326ms
> >> >>  20GB  699ms
> >> >>  30GB  1021ms
> >> >>  40GB  1387ms
> >> >>  50GB  1712ms
> >> >>  60GB  2034ms
> >> >>  70GB  2457ms
> >> >>  80GB  2807ms
> >> >>  90GB  3107ms
> >> >> 100GB 3474ms
> >> >> 110GB 3735ms
> >> >> 120GB 4064ms
> >> >> 130GB 4567ms
> >> >> 140GB 4886ms
> >> >>
> >> >> when canceling migration, the guest os will hang for a while.
> >> >> so I think we should invoke qemu_fclose(s->to_dst_file)  in
> >> >> migration_thread, and without hold io thread lock.
> >> >
> >> > Be very careful; it's tricky to get it right to handle all the failure
> >> > cases; that's why it's much easier to clean up right at the end.
> >> > It might make sense to try and do the deregister in the migration
> >> > thread but also to just make sure at the end.
> >>
> >> Yes, I will make sure at the end of migration thread.
> >>
> >> >
> >> > I wonder why dereg_mr takes so long - I could understand if reg_mr
> >> > took a long time, but why for dereg, that sounds like the easy side.
> >>
> >> I use perf collect the information when ibv_dereg_mr is invoked.
> >>
> >> -   9.95%  client2  [kernel.kallsyms]  [k] put_compound_page
> >>                                                           `
> >>    - put_compound_page
> >>       - 98.45% put_page
> >>            __ib_umem_release
> >>            ib_umem_release
> >>            dereg_mr
> >>            mlx5_ib_dereg_mr
> >>            ib_dereg_mr
> >>            uverbs_free_mr
> >>            remove_commit_idr_uobject
> >>            _rdma_remove_commit_uobject
> >>            rdma_remove_commit_uobject
> >>            ib_uverbs_dereg_mr
> >>            ib_uverbs_write
> >>            vfs_write
> >>            sys_write
> >>            system_call_fastpath
> >>            __GI___libc_write
> >>            0
> >>       + 1.55% __ib_umem_release
> >> +   8.31%  client2  [kernel.kallsyms]  [k] compound_unlock_irqrestore
> >> +   7.01%  client2  [kernel.kallsyms]  [k] page_waitqueue
> >> +   7.00%  client2  [kernel.kallsyms]  [k] set_page_dirty
> >> +   6.61%  client2  [kernel.kallsyms]  [k] unlock_page
> >> +   6.33%  client2  [kernel.kallsyms]  [k] put_page_testzero
> >> +   5.68%  client2  [kernel.kallsyms]  [k] set_page_dirty_lock
> >> +   4.30%  client2  [kernel.kallsyms]  [k] __wake_up_bit
> >> +   4.04%  client2  [kernel.kallsyms]  [k] free_pages_prepare
> >> +   3.65%  client2  [kernel.kallsyms]  [k] release_pages
> >> +   3.62%  client2  [kernel.kallsyms]  [k] arch_local_irq_save
> >> +   3.35%  client2  [kernel.kallsyms]  [k] page_mapping
> >> +   3.13%  client2  [kernel.kallsyms]  [k] get_pageblock_flags_group
> >> +   3.09%  client2  [kernel.kallsyms]  [k] put_page
> >>
> >> the reason is __ib_umem_release will loop many times for each page.
> >>
> >> static void __ib_umem_release(struct ib_device *dev, struct ib_umem
> >> *umem, int dirty)
> >> {
> >>     struct scatterlist *sg;
> >>     struct page *page;
> >>     int i;
> >>
> >>     if (umem->nmap > 0)
> >>          ib_dma_unmap_sg(dev, umem->sg_head.sgl,
> >>                                         umem->npages,
> >>                                         DMA_BIDIRECTIONAL);
> >>
> >>          for_each_sg(umem->sg_head.sgl, sg, umem->npages, i) {      <<
> >> loop a lot of times for each page.here
> >
> > Why 'lot of times for each page'?  I don't know this code at all, but
> > I'd expected once per page?
> 
> sorry, once per page, but a lot of page for a big size virtual machine.

Ah OK; so yes it seems best if you can find a way to do the release in
the migration thread then;  still maybe this is something some
of the kernel people could look at speeding up?

> >
> > With your other kernel fix, does the problem of the missing
> > RDMA_CM_EVENT_DISCONNECTED events go away?
> 
> Yes, after kernel and qemu fixed, this issue never happens again.

I'm confused; which qemu fix; my question was whether the kernel fix by
itself fixed the problem of the missing event.

> Do you think we should remove rdma_get_cm_event after rdma_disconnect?

I don't think so; if 'rdma_disconnect' is supposed to generate the event
I think we're supposed to wait for it to know that the disconnect is 
really complete.

Dave

> 
> >
> > Dave
> >
> >>               page = sg_page(sg);
> >>               if (umem->writable && dirty)
> >>                   set_page_dirty_lock(page);
> >>               put_page(page);
> >>          }
> >>
> >>          sg_free_table(&umem->sg_head);
> >>          return;
> >> }
> >>
> >> >
> >> > Dave
> >> >
> >> >
> >> >> >
> >> >> > Dave
> >> >> >
> >> >> >> >
> >> >> >> >> Anyway, it should not invoke rdma_get_cm_event in main thread, and the event channel
> >> >> >> >> is also destroyed in qemu_rdma_cleanup.
> >> >> >> >>
> >> >> >> >> Signed-off-by: Lidong Chen <lidongchen@tencent.com>
> >> >> >> >> ---
> >> >> >> >>  migration/rdma.c       | 12 ++----------
> >> >> >> >>  migration/trace-events |  1 -
> >> >> >> >>  2 files changed, 2 insertions(+), 11 deletions(-)
> >> >> >> >>
> >> >> >> >> diff --git a/migration/rdma.c b/migration/rdma.c
> >> >> >> >> index 0dd4033..92e4d30 100644
> >> >> >> >> --- a/migration/rdma.c
> >> >> >> >> +++ b/migration/rdma.c
> >> >> >> >> @@ -2275,8 +2275,7 @@ static int qemu_rdma_write(QEMUFile *f, RDMAContext *rdma,
> >> >> >> >>
> >> >> >> >>  static void qemu_rdma_cleanup(RDMAContext *rdma)
> >> >> >> >>  {
> >> >> >> >> -    struct rdma_cm_event *cm_event;
> >> >> >> >> -    int ret, idx;
> >> >> >> >> +    int idx;
> >> >> >> >>
> >> >> >> >>      if (rdma->cm_id && rdma->connected) {
> >> >> >> >>          if ((rdma->error_state ||
> >> >> >> >> @@ -2290,14 +2289,7 @@ static void qemu_rdma_cleanup(RDMAContext *rdma)
> >> >> >> >>              qemu_rdma_post_send_control(rdma, NULL, &head);
> >> >> >> >>          }
> >> >> >> >>
> >> >> >> >> -        ret = rdma_disconnect(rdma->cm_id);
> >> >> >> >> -        if (!ret) {
> >> >> >> >> -            trace_qemu_rdma_cleanup_waiting_for_disconnect();
> >> >> >> >> -            ret = rdma_get_cm_event(rdma->channel, &cm_event);
> >> >> >> >> -            if (!ret) {
> >> >> >> >> -                rdma_ack_cm_event(cm_event);
> >> >> >> >> -            }
> >> >> >> >> -        }
> >> >> >> >> +        rdma_disconnect(rdma->cm_id);
> >> >> >> >
> >> >> >> > I'm worried whether this change could break stuff:
> >> >> >> > The docs say for rdma_disconnect that it flushes any posted work
> >> >> >> > requests to the completion queue;  so unless we wait for the event
> >> >> >> > do we know the stuff has been flushed?   In the normal non-cancel case
> >> >> >> > I'm worried that means we could lose something.
> >> >> >> > (But I don't know the rdma/infiniband specs well enough to know if it's
> >> >> >> > really a problem).
> >> >> >>
> >> >> >> In qemu_fclose function, it invoke qemu_fflush(f) before invoke f->ops->close.
> >> >> >> so I think it's safe for normal migration case.
> >> >> >>
> >> >> >> For the root cause why not receive RDMA_CM_EVENT_DISCONNECTED event
> >> >> >> after rdma_disconnect,
> >> >> >> I loop in Aviad Yehezkel<aviadye@mellanox.com>, Aviad will help us
> >> >> >> find the root cause.
> >> >> >>
> >> >> >> >
> >> >> >> > Dave
> >> >> >> >
> >> >> >> >>          trace_qemu_rdma_cleanup_disconnect();
> >> >> >> >>          rdma->connected = false;
> >> >> >> >>      }
> >> >> >> >> diff --git a/migration/trace-events b/migration/trace-events
> >> >> >> >> index d6be74b..64573ff 100644
> >> >> >> >> --- a/migration/trace-events
> >> >> >> >> +++ b/migration/trace-events
> >> >> >> >> @@ -125,7 +125,6 @@ qemu_rdma_accept_pin_state(bool pin) "%d"
> >> >> >> >>  qemu_rdma_accept_pin_verbsc(void *verbs) "Verbs context after listen: %p"
> >> >> >> >>  qemu_rdma_block_for_wrid_miss(const char *wcompstr, int wcomp, const char *gcompstr, uint64_t req) "A Wanted wrid %s (%d) but got %s (%" PRIu64 ")"
> >> >> >> >>  qemu_rdma_cleanup_disconnect(void) ""
> >> >> >> >> -qemu_rdma_cleanup_waiting_for_disconnect(void) ""
> >> >> >> >>  qemu_rdma_close(void) ""
> >> >> >> >>  qemu_rdma_connect_pin_all_requested(void) ""
> >> >> >> >>  qemu_rdma_connect_pin_all_outcome(bool pin) "%d"
> >> >> >> >> --
> >> >> >> >> 1.8.3.1
> >> >> >> >>
> >> >> >> > --
> >> >> >> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> >> >> > --
> >> >> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> >> > --
> >> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect
  2018-05-16  9:53                   ` Dr. David Alan Gilbert
@ 2018-05-16 12:48                     ` 858585 jemmy
  2018-05-16 13:13                       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 19+ messages in thread
From: 858585 jemmy @ 2018-05-16 12:48 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Aviad Yehezkel, Juan Quintela, qemu-devel, Gal Shachaf, adido,
	Lidong Chen

On Wed, May 16, 2018 at 5:53 PM, Dr. David Alan Gilbert
<dgilbert@redhat.com> wrote:
> * 858585 jemmy (jemmy858585@gmail.com) wrote:
>> On Wed, May 16, 2018 at 5:39 PM, Dr. David Alan Gilbert
>> <dgilbert@redhat.com> wrote:
>> > * 858585 jemmy (jemmy858585@gmail.com) wrote:
>> >> On Tue, May 15, 2018 at 3:27 AM, Dr. David Alan Gilbert
>> >> <dgilbert@redhat.com> wrote:
>> >> > * 858585 jemmy (jemmy858585@gmail.com) wrote:
>> >> >> On Sat, May 12, 2018 at 2:03 AM, Dr. David Alan Gilbert
>> >> >> <dgilbert@redhat.com> wrote:
>> >> >> > * 858585 jemmy (jemmy858585@gmail.com) wrote:
>> >> >> >> On Wed, May 9, 2018 at 2:40 AM, Dr. David Alan Gilbert
>> >> >> >> <dgilbert@redhat.com> wrote:
>> >> >> >> > * Lidong Chen (jemmy858585@gmail.com) wrote:
>> >> >> >> >> When cancel migration during RDMA precopy, the source qemu main thread hangs sometime.
>> >> >> >> >>
>> >> >> >> >> The backtrace is:
>> >> >> >> >>     (gdb) bt
>> >> >> >> >>     #0  0x00007f249eabd43d in write () from /lib64/libpthread.so.0
>> >> >> >> >>     #1  0x00007f24a1ce98e4 in rdma_get_cm_event (channel=0x4675d10, event=0x7ffe2f643dd0) at src/cma.c:2189
>> >> >> >> >>     #2  0x00000000007b6166 in qemu_rdma_cleanup (rdma=0x6784000) at migration/rdma.c:2296
>> >> >> >> >>     #3  0x00000000007b7cae in qio_channel_rdma_close (ioc=0x3bfcc30, errp=0x0) at migration/rdma.c:2999
>> >> >> >> >>     #4  0x00000000008db60e in qio_channel_close (ioc=0x3bfcc30, errp=0x0) at io/channel.c:273
>> >> >> >> >>     #5  0x00000000007a8765 in channel_close (opaque=0x3bfcc30) at migration/qemu-file-channel.c:98
>> >> >> >> >>     #6  0x00000000007a71f9 in qemu_fclose (f=0x527c000) at migration/qemu-file.c:334
>> >> >> >> >>     #7  0x0000000000795b96 in migrate_fd_cleanup (opaque=0x3b46280) at migration/migration.c:1162
>> >> >> >> >>     #8  0x000000000093a71b in aio_bh_call (bh=0x3db7a20) at util/async.c:90
>> >> >> >> >>     #9  0x000000000093a7b2 in aio_bh_poll (ctx=0x3b121c0) at util/async.c:118
>> >> >> >> >>     #10 0x000000000093f2ad in aio_dispatch (ctx=0x3b121c0) at util/aio-posix.c:436
>> >> >> >> >>     #11 0x000000000093ab41 in aio_ctx_dispatch (source=0x3b121c0, callback=0x0, user_data=0x0)
>> >> >> >> >>         at util/async.c:261
>> >> >> >> >>     #12 0x00007f249f73c7aa in g_main_context_dispatch () from /lib64/libglib-2.0.so.0
>> >> >> >> >>     #13 0x000000000093dc5e in glib_pollfds_poll () at util/main-loop.c:215
>> >> >> >> >>     #14 0x000000000093dd4e in os_host_main_loop_wait (timeout=28000000) at util/main-loop.c:263
>> >> >> >> >>     #15 0x000000000093de05 in main_loop_wait (nonblocking=0) at util/main-loop.c:522
>> >> >> >> >>     #16 0x00000000005bc6a5 in main_loop () at vl.c:1944
>> >> >> >> >>     #17 0x00000000005c39b5 in main (argc=56, argv=0x7ffe2f6443f8, envp=0x3ad0030) at vl.c:4752
>> >> >> >> >>
>> >> >> >> >> It does not get the RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect sometime.
>> >> >> >> >> I do not find out the root cause why not get RDMA_CM_EVENT_DISCONNECTED event, but
>> >> >> >> >> it can be reproduced if not invoke ibv_dereg_mr to release all ram blocks which fixed
>> >> >> >> >> in previous patch.
>> >> >> >> >
>> >> >> >> > Does this happen without your other changes?
>> >> >> >>
>> >> >> >> Yes, this issue also happen on v2.12.0. base on
>> >> >> >> commit 4743c23509a51bd4ee85cc272287a41917d1be35
>> >> >> >>
>> >> >> >> > Can you give me instructions to repeat it and also say which
>> >> >> >> > cards you wereusing?
>> >> >> >>
>> >> >> >> This issue can be reproduced by start and cancel migration.
>> >> >> >> less than 10 times, this issue will be reproduced.
>> >> >> >>
>> >> >> >> The command line is:
>> >> >> >> virsh migrate --live --copy-storage-all  --undefinesource --persistent
>> >> >> >> --timeout 10800 \
>> >> >> >>  --verbose 83e0049e-1325-4f31-baf9-25231509ada1  \
>> >> >> >> qemu+ssh://9.16.46.142/system rdma://9.16.46.142
>> >> >> >>
>> >> >> >> The net card i use is :
>> >> >> >> 0000:3b:00.0 Ethernet controller: Mellanox Technologies MT27710 Family
>> >> >> >> [ConnectX-4 Lx]
>> >> >> >> 0000:3b:00.1 Ethernet controller: Mellanox Technologies MT27710 Family
>> >> >> >> [ConnectX-4 Lx]
>> >> >> >>
>> >> >> >> This issue is related to ibv_dereg_mr, if not invoke ibv_dereg_mr for
>> >> >> >> all ram block, this issue can be reproduced.
>> >> >> >> If we fixed the bugs and use ibv_dereg_mr to release all ram block,
>> >> >> >> this issue never happens.
>> >> >> >
>> >> >> > Maybe that is the right fix; I can imagine that the RDMA code doesn't
>> >> >> > like closing down if there are still ramblocks registered that
>> >> >> > potentially could have incoming DMA?
>> >> >> >
>> >> >> >> And for the kernel part, there is a bug also cause not release ram
>> >> >> >> block when canceling live migration.
>> >> >> >> https://patchwork.kernel.org/patch/10385781/
>> >> >> >
>> >> >> > OK, that's a pain; which threads are doing the dereg - is some stuff
>> >> >> > in the migration thread and some stuff in the main thread on cleanup?
>> >> >>
>> >> >> Yes, the migration thread invokes ibv_reg_mr, and the main thread bh
>> >> >> will invoke ibv_dereg_mr.
>> >> >
>> >> > OK, although my reading of that discussion is that *should* be allowed
>> >> > (with your fix).
>> >> >
>> >> >> and when the main thread schedule s->cleanup_bh, the migration thread
>> >> >> may have been exited.
>> >> >>
>> >> >> And I find ibv_dereg_mr may take a long time for the big memory size
>> >> >> virtual machine.
>> >> >>
>> >> >> The test result is:
>> >> >>  10GB  326ms
>> >> >>  20GB  699ms
>> >> >>  30GB  1021ms
>> >> >>  40GB  1387ms
>> >> >>  50GB  1712ms
>> >> >>  60GB  2034ms
>> >> >>  70GB  2457ms
>> >> >>  80GB  2807ms
>> >> >>  90GB  3107ms
>> >> >> 100GB 3474ms
>> >> >> 110GB 3735ms
>> >> >> 120GB 4064ms
>> >> >> 130GB 4567ms
>> >> >> 140GB 4886ms
>> >> >>
>> >> >> when canceling migration, the guest os will hang for a while.
>> >> >> so I think we should invoke qemu_fclose(s->to_dst_file)  in
>> >> >> migration_thread, and without hold io thread lock.
>> >> >
>> >> > Be very careful; it's tricky to get it right to handle all the failure
>> >> > cases; that's why it's much easier to clean up right at the end.
>> >> > It might make sense to try and do the deregister in the migration
>> >> > thread but also to just make sure at the end.
>> >>
>> >> Yes, I will make sure at the end of migration thread.
>> >>
>> >> >
>> >> > I wonder why dereg_mr takes so long - I could understand if reg_mr
>> >> > took a long time, but why for dereg, that sounds like the easy side.
>> >>
>> >> I use perf collect the information when ibv_dereg_mr is invoked.
>> >>
>> >> -   9.95%  client2  [kernel.kallsyms]  [k] put_compound_page
>> >>                                                           `
>> >>    - put_compound_page
>> >>       - 98.45% put_page
>> >>            __ib_umem_release
>> >>            ib_umem_release
>> >>            dereg_mr
>> >>            mlx5_ib_dereg_mr
>> >>            ib_dereg_mr
>> >>            uverbs_free_mr
>> >>            remove_commit_idr_uobject
>> >>            _rdma_remove_commit_uobject
>> >>            rdma_remove_commit_uobject
>> >>            ib_uverbs_dereg_mr
>> >>            ib_uverbs_write
>> >>            vfs_write
>> >>            sys_write
>> >>            system_call_fastpath
>> >>            __GI___libc_write
>> >>            0
>> >>       + 1.55% __ib_umem_release
>> >> +   8.31%  client2  [kernel.kallsyms]  [k] compound_unlock_irqrestore
>> >> +   7.01%  client2  [kernel.kallsyms]  [k] page_waitqueue
>> >> +   7.00%  client2  [kernel.kallsyms]  [k] set_page_dirty
>> >> +   6.61%  client2  [kernel.kallsyms]  [k] unlock_page
>> >> +   6.33%  client2  [kernel.kallsyms]  [k] put_page_testzero
>> >> +   5.68%  client2  [kernel.kallsyms]  [k] set_page_dirty_lock
>> >> +   4.30%  client2  [kernel.kallsyms]  [k] __wake_up_bit
>> >> +   4.04%  client2  [kernel.kallsyms]  [k] free_pages_prepare
>> >> +   3.65%  client2  [kernel.kallsyms]  [k] release_pages
>> >> +   3.62%  client2  [kernel.kallsyms]  [k] arch_local_irq_save
>> >> +   3.35%  client2  [kernel.kallsyms]  [k] page_mapping
>> >> +   3.13%  client2  [kernel.kallsyms]  [k] get_pageblock_flags_group
>> >> +   3.09%  client2  [kernel.kallsyms]  [k] put_page
>> >>
>> >> the reason is __ib_umem_release will loop many times for each page.
>> >>
>> >> static void __ib_umem_release(struct ib_device *dev, struct ib_umem
>> >> *umem, int dirty)
>> >> {
>> >>     struct scatterlist *sg;
>> >>     struct page *page;
>> >>     int i;
>> >>
>> >>     if (umem->nmap > 0)
>> >>          ib_dma_unmap_sg(dev, umem->sg_head.sgl,
>> >>                                         umem->npages,
>> >>                                         DMA_BIDIRECTIONAL);
>> >>
>> >>          for_each_sg(umem->sg_head.sgl, sg, umem->npages, i) {      <<
>> >> loop a lot of times for each page.here
>> >
>> > Why 'lot of times for each page'?  I don't know this code at all, but
>> > I'd expected once per page?
>>
>> sorry, once per page, but a lot of page for a big size virtual machine.
>
> Ah OK; so yes it seems best if you can find a way to do the release in
> the migration thread then;  still maybe this is something some
> of the kernel people could look at speeding up?

The kernel code seem is not complex, and I have no idea how to speed up.

>
>> >
>> > With your other kernel fix, does the problem of the missing
>> > RDMA_CM_EVENT_DISCONNECTED events go away?
>>
>> Yes, after kernel and qemu fixed, this issue never happens again.
>
> I'm confused; which qemu fix; my question was whether the kernel fix by
> itself fixed the problem of the missing event.

this qemu fix:
migration: update index field when delete or qsort RDMALocalBlock

this issue also cause by some ram block is not released. but I do not
find the root cause.

>
>> Do you think we should remove rdma_get_cm_event after rdma_disconnect?
>
> I don't think so; if 'rdma_disconnect' is supposed to generate the event
> I think we're supposed to wait for it to know that the disconnect is
> really complete.

After move qemu_fclose to migration thread, it will not block the main
thread when wait
the disconnection event.

>
> Dave
>
>>
>> >
>> > Dave
>> >
>> >>               page = sg_page(sg);
>> >>               if (umem->writable && dirty)
>> >>                   set_page_dirty_lock(page);
>> >>               put_page(page);
>> >>          }
>> >>
>> >>          sg_free_table(&umem->sg_head);
>> >>          return;
>> >> }
>> >>
>> >> >
>> >> > Dave
>> >> >
>> >> >
>> >> >> >
>> >> >> > Dave
>> >> >> >
>> >> >> >> >
>> >> >> >> >> Anyway, it should not invoke rdma_get_cm_event in main thread, and the event channel
>> >> >> >> >> is also destroyed in qemu_rdma_cleanup.
>> >> >> >> >>
>> >> >> >> >> Signed-off-by: Lidong Chen <lidongchen@tencent.com>
>> >> >> >> >> ---
>> >> >> >> >>  migration/rdma.c       | 12 ++----------
>> >> >> >> >>  migration/trace-events |  1 -
>> >> >> >> >>  2 files changed, 2 insertions(+), 11 deletions(-)
>> >> >> >> >>
>> >> >> >> >> diff --git a/migration/rdma.c b/migration/rdma.c
>> >> >> >> >> index 0dd4033..92e4d30 100644
>> >> >> >> >> --- a/migration/rdma.c
>> >> >> >> >> +++ b/migration/rdma.c
>> >> >> >> >> @@ -2275,8 +2275,7 @@ static int qemu_rdma_write(QEMUFile *f, RDMAContext *rdma,
>> >> >> >> >>
>> >> >> >> >>  static void qemu_rdma_cleanup(RDMAContext *rdma)
>> >> >> >> >>  {
>> >> >> >> >> -    struct rdma_cm_event *cm_event;
>> >> >> >> >> -    int ret, idx;
>> >> >> >> >> +    int idx;
>> >> >> >> >>
>> >> >> >> >>      if (rdma->cm_id && rdma->connected) {
>> >> >> >> >>          if ((rdma->error_state ||
>> >> >> >> >> @@ -2290,14 +2289,7 @@ static void qemu_rdma_cleanup(RDMAContext *rdma)
>> >> >> >> >>              qemu_rdma_post_send_control(rdma, NULL, &head);
>> >> >> >> >>          }
>> >> >> >> >>
>> >> >> >> >> -        ret = rdma_disconnect(rdma->cm_id);
>> >> >> >> >> -        if (!ret) {
>> >> >> >> >> -            trace_qemu_rdma_cleanup_waiting_for_disconnect();
>> >> >> >> >> -            ret = rdma_get_cm_event(rdma->channel, &cm_event);
>> >> >> >> >> -            if (!ret) {
>> >> >> >> >> -                rdma_ack_cm_event(cm_event);
>> >> >> >> >> -            }
>> >> >> >> >> -        }
>> >> >> >> >> +        rdma_disconnect(rdma->cm_id);
>> >> >> >> >
>> >> >> >> > I'm worried whether this change could break stuff:
>> >> >> >> > The docs say for rdma_disconnect that it flushes any posted work
>> >> >> >> > requests to the completion queue;  so unless we wait for the event
>> >> >> >> > do we know the stuff has been flushed?   In the normal non-cancel case
>> >> >> >> > I'm worried that means we could lose something.
>> >> >> >> > (But I don't know the rdma/infiniband specs well enough to know if it's
>> >> >> >> > really a problem).
>> >> >> >>
>> >> >> >> In qemu_fclose function, it invoke qemu_fflush(f) before invoke f->ops->close.
>> >> >> >> so I think it's safe for normal migration case.
>> >> >> >>
>> >> >> >> For the root cause why not receive RDMA_CM_EVENT_DISCONNECTED event
>> >> >> >> after rdma_disconnect,
>> >> >> >> I loop in Aviad Yehezkel<aviadye@mellanox.com>, Aviad will help us
>> >> >> >> find the root cause.
>> >> >> >>
>> >> >> >> >
>> >> >> >> > Dave
>> >> >> >> >
>> >> >> >> >>          trace_qemu_rdma_cleanup_disconnect();
>> >> >> >> >>          rdma->connected = false;
>> >> >> >> >>      }
>> >> >> >> >> diff --git a/migration/trace-events b/migration/trace-events
>> >> >> >> >> index d6be74b..64573ff 100644
>> >> >> >> >> --- a/migration/trace-events
>> >> >> >> >> +++ b/migration/trace-events
>> >> >> >> >> @@ -125,7 +125,6 @@ qemu_rdma_accept_pin_state(bool pin) "%d"
>> >> >> >> >>  qemu_rdma_accept_pin_verbsc(void *verbs) "Verbs context after listen: %p"
>> >> >> >> >>  qemu_rdma_block_for_wrid_miss(const char *wcompstr, int wcomp, const char *gcompstr, uint64_t req) "A Wanted wrid %s (%d) but got %s (%" PRIu64 ")"
>> >> >> >> >>  qemu_rdma_cleanup_disconnect(void) ""
>> >> >> >> >> -qemu_rdma_cleanup_waiting_for_disconnect(void) ""
>> >> >> >> >>  qemu_rdma_close(void) ""
>> >> >> >> >>  qemu_rdma_connect_pin_all_requested(void) ""
>> >> >> >> >>  qemu_rdma_connect_pin_all_outcome(bool pin) "%d"
>> >> >> >> >> --
>> >> >> >> >> 1.8.3.1
>> >> >> >> >>
>> >> >> >> > --
>> >> >> >> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>> >> >> > --
>> >> >> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>> >> > --
>> >> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>> > --
>> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect
  2018-05-16 12:48                     ` 858585 jemmy
@ 2018-05-16 13:13                       ` Dr. David Alan Gilbert
       [not found]                         ` <AM6PR05MB4360C1A885C2B052583F47C7C2920@AM6PR05MB4360.eurprd05.prod.outlook.com>
  2018-05-22 10:12                         ` [Qemu-devel] " 858585 jemmy
  0 siblings, 2 replies; 19+ messages in thread
From: Dr. David Alan Gilbert @ 2018-05-16 13:13 UTC (permalink / raw)
  To: 858585 jemmy
  Cc: Aviad Yehezkel, Juan Quintela, qemu-devel, Gal Shachaf, adido,
	Lidong Chen

* 858585 jemmy (jemmy858585@gmail.com) wrote:

<snip>

> >> >> > I wonder why dereg_mr takes so long - I could understand if reg_mr
> >> >> > took a long time, but why for dereg, that sounds like the easy side.
> >> >>
> >> >> I use perf collect the information when ibv_dereg_mr is invoked.
> >> >>
> >> >> -   9.95%  client2  [kernel.kallsyms]  [k] put_compound_page
> >> >>                                                           `
> >> >>    - put_compound_page
> >> >>       - 98.45% put_page
> >> >>            __ib_umem_release
> >> >>            ib_umem_release
> >> >>            dereg_mr
> >> >>            mlx5_ib_dereg_mr
> >> >>            ib_dereg_mr
> >> >>            uverbs_free_mr
> >> >>            remove_commit_idr_uobject
> >> >>            _rdma_remove_commit_uobject
> >> >>            rdma_remove_commit_uobject
> >> >>            ib_uverbs_dereg_mr
> >> >>            ib_uverbs_write
> >> >>            vfs_write
> >> >>            sys_write
> >> >>            system_call_fastpath
> >> >>            __GI___libc_write
> >> >>            0
> >> >>       + 1.55% __ib_umem_release
> >> >> +   8.31%  client2  [kernel.kallsyms]  [k] compound_unlock_irqrestore
> >> >> +   7.01%  client2  [kernel.kallsyms]  [k] page_waitqueue
> >> >> +   7.00%  client2  [kernel.kallsyms]  [k] set_page_dirty
> >> >> +   6.61%  client2  [kernel.kallsyms]  [k] unlock_page
> >> >> +   6.33%  client2  [kernel.kallsyms]  [k] put_page_testzero
> >> >> +   5.68%  client2  [kernel.kallsyms]  [k] set_page_dirty_lock
> >> >> +   4.30%  client2  [kernel.kallsyms]  [k] __wake_up_bit
> >> >> +   4.04%  client2  [kernel.kallsyms]  [k] free_pages_prepare
> >> >> +   3.65%  client2  [kernel.kallsyms]  [k] release_pages
> >> >> +   3.62%  client2  [kernel.kallsyms]  [k] arch_local_irq_save
> >> >> +   3.35%  client2  [kernel.kallsyms]  [k] page_mapping
> >> >> +   3.13%  client2  [kernel.kallsyms]  [k] get_pageblock_flags_group
> >> >> +   3.09%  client2  [kernel.kallsyms]  [k] put_page
> >> >>
> >> >> the reason is __ib_umem_release will loop many times for each page.
> >> >>
> >> >> static void __ib_umem_release(struct ib_device *dev, struct ib_umem
> >> >> *umem, int dirty)
> >> >> {
> >> >>     struct scatterlist *sg;
> >> >>     struct page *page;
> >> >>     int i;
> >> >>
> >> >>     if (umem->nmap > 0)
> >> >>          ib_dma_unmap_sg(dev, umem->sg_head.sgl,
> >> >>                                         umem->npages,
> >> >>                                         DMA_BIDIRECTIONAL);
> >> >>
> >> >>          for_each_sg(umem->sg_head.sgl, sg, umem->npages, i) {      <<
> >> >> loop a lot of times for each page.here
> >> >
> >> > Why 'lot of times for each page'?  I don't know this code at all, but
> >> > I'd expected once per page?
> >>
> >> sorry, once per page, but a lot of page for a big size virtual machine.
> >
> > Ah OK; so yes it seems best if you can find a way to do the release in
> > the migration thread then;  still maybe this is something some
> > of the kernel people could look at speeding up?
> 
> The kernel code seem is not complex, and I have no idea how to speed up.

Me neither; but I'll ask around.

> >> >
> >> > With your other kernel fix, does the problem of the missing
> >> > RDMA_CM_EVENT_DISCONNECTED events go away?
> >>
> >> Yes, after kernel and qemu fixed, this issue never happens again.
> >
> > I'm confused; which qemu fix; my question was whether the kernel fix by
> > itself fixed the problem of the missing event.
> 
> this qemu fix:
> migration: update index field when delete or qsort RDMALocalBlock

OK good; so then we shouldn't need this 2/2 patch.

> this issue also cause by some ram block is not released. but I do not
> find the root cause.

Hmm, we should try and track that down.

> >
> >> Do you think we should remove rdma_get_cm_event after rdma_disconnect?
> >
> > I don't think so; if 'rdma_disconnect' is supposed to generate the event
> > I think we're supposed to wait for it to know that the disconnect is
> > really complete.
> 
> After move qemu_fclose to migration thread, it will not block the main
> thread when wait
> the disconnection event.

I'm not sure about moving the fclose to the migration thread; it worries
me with the interaction with cancel and other failures.

Dave

> >
> > Dave
> >
> >>
> >> >
> >> > Dave
> >> >
> >> >>               page = sg_page(sg);
> >> >>               if (umem->writable && dirty)
> >> >>                   set_page_dirty_lock(page);
> >> >>               put_page(page);
> >> >>          }
> >> >>
> >> >>          sg_free_table(&umem->sg_head);
> >> >>          return;
> >> >> }
> >> >>
> >> >> >
> >> >> > Dave
> >> >> >
> >> >> >
> >> >> >> >
> >> >> >> > Dave
> >> >> >> >
> >> >> >> >> >
> >> >> >> >> >> Anyway, it should not invoke rdma_get_cm_event in main thread, and the event channel
> >> >> >> >> >> is also destroyed in qemu_rdma_cleanup.
> >> >> >> >> >>
> >> >> >> >> >> Signed-off-by: Lidong Chen <lidongchen@tencent.com>
> >> >> >> >> >> ---
> >> >> >> >> >>  migration/rdma.c       | 12 ++----------
> >> >> >> >> >>  migration/trace-events |  1 -
> >> >> >> >> >>  2 files changed, 2 insertions(+), 11 deletions(-)
> >> >> >> >> >>
> >> >> >> >> >> diff --git a/migration/rdma.c b/migration/rdma.c
> >> >> >> >> >> index 0dd4033..92e4d30 100644
> >> >> >> >> >> --- a/migration/rdma.c
> >> >> >> >> >> +++ b/migration/rdma.c
> >> >> >> >> >> @@ -2275,8 +2275,7 @@ static int qemu_rdma_write(QEMUFile *f, RDMAContext *rdma,
> >> >> >> >> >>
> >> >> >> >> >>  static void qemu_rdma_cleanup(RDMAContext *rdma)
> >> >> >> >> >>  {
> >> >> >> >> >> -    struct rdma_cm_event *cm_event;
> >> >> >> >> >> -    int ret, idx;
> >> >> >> >> >> +    int idx;
> >> >> >> >> >>
> >> >> >> >> >>      if (rdma->cm_id && rdma->connected) {
> >> >> >> >> >>          if ((rdma->error_state ||
> >> >> >> >> >> @@ -2290,14 +2289,7 @@ static void qemu_rdma_cleanup(RDMAContext *rdma)
> >> >> >> >> >>              qemu_rdma_post_send_control(rdma, NULL, &head);
> >> >> >> >> >>          }
> >> >> >> >> >>
> >> >> >> >> >> -        ret = rdma_disconnect(rdma->cm_id);
> >> >> >> >> >> -        if (!ret) {
> >> >> >> >> >> -            trace_qemu_rdma_cleanup_waiting_for_disconnect();
> >> >> >> >> >> -            ret = rdma_get_cm_event(rdma->channel, &cm_event);
> >> >> >> >> >> -            if (!ret) {
> >> >> >> >> >> -                rdma_ack_cm_event(cm_event);
> >> >> >> >> >> -            }
> >> >> >> >> >> -        }
> >> >> >> >> >> +        rdma_disconnect(rdma->cm_id);
> >> >> >> >> >
> >> >> >> >> > I'm worried whether this change could break stuff:
> >> >> >> >> > The docs say for rdma_disconnect that it flushes any posted work
> >> >> >> >> > requests to the completion queue;  so unless we wait for the event
> >> >> >> >> > do we know the stuff has been flushed?   In the normal non-cancel case
> >> >> >> >> > I'm worried that means we could lose something.
> >> >> >> >> > (But I don't know the rdma/infiniband specs well enough to know if it's
> >> >> >> >> > really a problem).
> >> >> >> >>
> >> >> >> >> In qemu_fclose function, it invoke qemu_fflush(f) before invoke f->ops->close.
> >> >> >> >> so I think it's safe for normal migration case.
> >> >> >> >>
> >> >> >> >> For the root cause why not receive RDMA_CM_EVENT_DISCONNECTED event
> >> >> >> >> after rdma_disconnect,
> >> >> >> >> I loop in Aviad Yehezkel<aviadye@mellanox.com>, Aviad will help us
> >> >> >> >> find the root cause.
> >> >> >> >>
> >> >> >> >> >
> >> >> >> >> > Dave
> >> >> >> >> >
> >> >> >> >> >>          trace_qemu_rdma_cleanup_disconnect();
> >> >> >> >> >>          rdma->connected = false;
> >> >> >> >> >>      }
> >> >> >> >> >> diff --git a/migration/trace-events b/migration/trace-events
> >> >> >> >> >> index d6be74b..64573ff 100644
> >> >> >> >> >> --- a/migration/trace-events
> >> >> >> >> >> +++ b/migration/trace-events
> >> >> >> >> >> @@ -125,7 +125,6 @@ qemu_rdma_accept_pin_state(bool pin) "%d"
> >> >> >> >> >>  qemu_rdma_accept_pin_verbsc(void *verbs) "Verbs context after listen: %p"
> >> >> >> >> >>  qemu_rdma_block_for_wrid_miss(const char *wcompstr, int wcomp, const char *gcompstr, uint64_t req) "A Wanted wrid %s (%d) but got %s (%" PRIu64 ")"
> >> >> >> >> >>  qemu_rdma_cleanup_disconnect(void) ""
> >> >> >> >> >> -qemu_rdma_cleanup_waiting_for_disconnect(void) ""
> >> >> >> >> >>  qemu_rdma_close(void) ""
> >> >> >> >> >>  qemu_rdma_connect_pin_all_requested(void) ""
> >> >> >> >> >>  qemu_rdma_connect_pin_all_outcome(bool pin) "%d"
> >> >> >> >> >> --
> >> >> >> >> >> 1.8.3.1
> >> >> >> >> >>
> >> >> >> >> > --
> >> >> >> >> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> >> >> >> > --
> >> >> >> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> >> >> > --
> >> >> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> >> > --
> >> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] FW: [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect
       [not found]                             ` <CAOGPPbenO1P9VbdpDVD5_SNJeoF4eHxkLg=rF7LM9HpbyoR0uw@mail.gmail.com>
@ 2018-05-17  7:31                               ` Aviad Yehezkel
  2018-05-17  7:41                                 ` 858585 jemmy
  0 siblings, 1 reply; 19+ messages in thread
From: Aviad Yehezkel @ 2018-05-17  7:31 UTC (permalink / raw)
  To: 858585 jemmy
  Cc: Aviad Yehezkel, Juan Quintela, qemu-devel, Gal Shachaf,
	Adi Dotan, Lidong Chen



On 5/17/2018 5:42 AM, 858585 jemmy wrote:
> On Wed, May 16, 2018 at 11:11 PM, Aviad Yehezkel
> <aviadye@dev.mellanox.co.il> wrote:
>> Hi Lidong and David,
>> Sorry for the late response, I had to ramp up on migration code and build a
>> setup on my side.
>>
>> PSB my comments for this patch below.
>> For the RDMA post-copy patches I will comment next week after testing on
>> Mellanox side too.
>>
>> Thanks!
>>
>> On 5/16/2018 5:21 PM, Aviad Yehezkel wrote:
>>>
>>> -----Original Message-----
>>> From: Dr. David Alan Gilbert [mailto:dgilbert@redhat.com]
>>> Sent: Wednesday, May 16, 2018 4:13 PM
>>> To: 858585 jemmy <jemmy858585@gmail.com>
>>> Cc: Aviad Yehezkel <aviadye@mellanox.com>; Juan Quintela
>>> <quintela@redhat.com>; qemu-devel <qemu-devel@nongnu.org>; Gal Shachaf
>>> <galsha@mellanox.com>; Adi Dotan <adido@mellanox.com>; Lidong Chen
>>> <lidongchen@tencent.com>
>>> Subject: Re: [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED
>>> event after rdma_disconnect
>>>
>>> * 858585 jemmy (jemmy858585@gmail.com) wrote:
>>>
>>> <snip>
>>>
>>>>>>>>> I wonder why dereg_mr takes so long - I could understand if
>>>>>>>>> reg_mr took a long time, but why for dereg, that sounds like the
>>>>>>>>> easy side.
>>>>>>>> I use perf collect the information when ibv_dereg_mr is invoked.
>>>>>>>>
>>>>>>>> -   9.95%  client2  [kernel.kallsyms]  [k] put_compound_page
>>>>>>>>                                                             `
>>>>>>>>      - put_compound_page
>>>>>>>>         - 98.45% put_page
>>>>>>>>              __ib_umem_release
>>>>>>>>              ib_umem_release
>>>>>>>>              dereg_mr
>>>>>>>>              mlx5_ib_dereg_mr
>>>>>>>>              ib_dereg_mr
>>>>>>>>              uverbs_free_mr
>>>>>>>>              remove_commit_idr_uobject
>>>>>>>>              _rdma_remove_commit_uobject
>>>>>>>>              rdma_remove_commit_uobject
>>>>>>>>              ib_uverbs_dereg_mr
>>>>>>>>              ib_uverbs_write
>>>>>>>>              vfs_write
>>>>>>>>              sys_write
>>>>>>>>              system_call_fastpath
>>>>>>>>              __GI___libc_write
>>>>>>>>              0
>>>>>>>>         + 1.55% __ib_umem_release
>>>>>>>> +   8.31%  client2  [kernel.kallsyms]  [k] compound_unlock_irqrestore
>>>>>>>> +   7.01%  client2  [kernel.kallsyms]  [k] page_waitqueue
>>>>>>>> +   7.00%  client2  [kernel.kallsyms]  [k] set_page_dirty
>>>>>>>> +   6.61%  client2  [kernel.kallsyms]  [k] unlock_page
>>>>>>>> +   6.33%  client2  [kernel.kallsyms]  [k] put_page_testzero
>>>>>>>> +   5.68%  client2  [kernel.kallsyms]  [k] set_page_dirty_lock
>>>>>>>> +   4.30%  client2  [kernel.kallsyms]  [k] __wake_up_bit
>>>>>>>> +   4.04%  client2  [kernel.kallsyms]  [k] free_pages_prepare
>>>>>>>> +   3.65%  client2  [kernel.kallsyms]  [k] release_pages
>>>>>>>> +   3.62%  client2  [kernel.kallsyms]  [k] arch_local_irq_save
>>>>>>>> +   3.35%  client2  [kernel.kallsyms]  [k] page_mapping
>>>>>>>> +   3.13%  client2  [kernel.kallsyms]  [k] get_pageblock_flags_group
>>>>>>>> +   3.09%  client2  [kernel.kallsyms]  [k] put_page
>>>>>>>>
>>>>>>>> the reason is __ib_umem_release will loop many times for each page.
>>>>>>>>
>>>>>>>> static void __ib_umem_release(struct ib_device *dev, struct
>>>>>>>> ib_umem *umem, int dirty) {
>>>>>>>>       struct scatterlist *sg;
>>>>>>>>       struct page *page;
>>>>>>>>       int i;
>>>>>>>>
>>>>>>>>       if (umem->nmap > 0)
>>>>>>>>            ib_dma_unmap_sg(dev, umem->sg_head.sgl,
>>>>>>>>                                           umem->npages,
>>>>>>>>                                           DMA_BIDIRECTIONAL);
>>>>>>>>
>>>>>>>>            for_each_sg(umem->sg_head.sgl, sg, umem->npages, i) {
>>>>>>>> <<
>>>>>>>> loop a lot of times for each page.here
>>>>>>> Why 'lot of times for each page'?  I don't know this code at all,
>>>>>>> but I'd expected once per page?
>>>>>> sorry, once per page, but a lot of page for a big size virtual machine.
>>>>> Ah OK; so yes it seems best if you can find a way to do the release
>>>>> in the migration thread then;  still maybe this is something some of
>>>>> the kernel people could look at speeding up?
>>>> The kernel code seem is not complex, and I have no idea how to speed up.
>>> Me neither; but I'll ask around.
>> I will ask internally and get back on this one.
>>>
>>>>>>> With your other kernel fix, does the problem of the missing
>>>>>>> RDMA_CM_EVENT_DISCONNECTED events go away?
>>>>>> Yes, after kernel and qemu fixed, this issue never happens again.
>>>>> I'm confused; which qemu fix; my question was whether the kernel fix
>>>>> by itself fixed the problem of the missing event.
>>>> this qemu fix:
>>>> migration: update index field when delete or qsort RDMALocalBlock
>>> OK good; so then we shouldn't need this 2/2 patch.
>> It is good that the qemu fix solved this issue but me and my colleagues
>> think we need 2/2 patch anyway.
>> According to IB Spec once active side send DREQ message he should wait for
>> DREP message and only once it arrived it should trigger a DISCONNECT event.
>> DREP message can be dropped due to network issues.
>> For that case the spec defines a DREP_timeout state in the CM state machine,
>> if the DREP is dropped we should get a timeout and a TIMEWAIT_EXIT event
>> will be trigger.
>> Unfortunately the current kernel CM implementation doesn't include the
>> DREP_timeout state and in above scenario we will not get DISCONNECT or
>> TIMEWAIT_EXIT events.
>> (Similar scenario exists also for passive side).
>> We think it is best to apply this patch until we will enhance the kernel
>> code.
>>
> Hi Aviad:
>      How long about the DREP_timeout state?
>      Do RDMA have some tools like tcpdump? then I can use to confirm whether
>      DREP message has received.
>
>      Thanks.
Are you running IB or RoCE?
>>>> this issue also cause by some ram block is not released. but I do not
>>>> find the root cause.
>>> Hmm, we should try and track that down.
>>>
>>>>>> Do you think we should remove rdma_get_cm_event after rdma_disconnect?
>>>>> I don't think so; if 'rdma_disconnect' is supposed to generate the
>>>>> event I think we're supposed to wait for it to know that the
>>>>> disconnect is really complete.
>>>> After move qemu_fclose to migration thread, it will not block the main
>>>> thread when wait the disconnection event.
>>> I'm not sure about moving the fclose to the migration thread; it worries
>>> me with the interaction with cancel and other failures.
>>>
>>> Dave
>>>
>>>>> Dave
>>>>>
>>>>>>> Dave
>>>>>>>
>>>>>>>>                 page = sg_page(sg);
>>>>>>>>                 if (umem->writable && dirty)
>>>>>>>>                     set_page_dirty_lock(page);
>>>>>>>>                 put_page(page);
>>>>>>>>            }
>>>>>>>>
>>>>>>>>            sg_free_table(&umem->sg_head);
>>>>>>>>            return;
>>>>>>>> }
>>>>>>>>
>>>>>>>>> Dave
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> Dave
>>>>>>>>>>>
>>>>>>>>>>>>>> Anyway, it should not invoke rdma_get_cm_event in main
>>>>>>>>>>>>>> thread, and the event channel is also destroyed in
>>>>>>>>>>>>>> qemu_rdma_cleanup.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Signed-off-by: Lidong Chen <lidongchen@tencent.com>
>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>    migration/rdma.c       | 12 ++----------
>>>>>>>>>>>>>>    migration/trace-events |  1 -
>>>>>>>>>>>>>>    2 files changed, 2 insertions(+), 11 deletions(-)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> diff --git a/migration/rdma.c b/migration/rdma.c index
>>>>>>>>>>>>>> 0dd4033..92e4d30 100644
>>>>>>>>>>>>>> --- a/migration/rdma.c
>>>>>>>>>>>>>> +++ b/migration/rdma.c
>>>>>>>>>>>>>> @@ -2275,8 +2275,7 @@ static int
>>>>>>>>>>>>>> qemu_rdma_write(QEMUFile *f, RDMAContext *rdma,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    static void qemu_rdma_cleanup(RDMAContext *rdma)  {
>>>>>>>>>>>>>> -    struct rdma_cm_event *cm_event;
>>>>>>>>>>>>>> -    int ret, idx;
>>>>>>>>>>>>>> +    int idx;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>        if (rdma->cm_id && rdma->connected) {
>>>>>>>>>>>>>>            if ((rdma->error_state ||
>>>>>>>>>>>>>> @@ -2290,14 +2289,7 @@ static void
>>>>>>>>>>>>>> qemu_rdma_cleanup(RDMAContext *rdma)
>>>>>>>>>>>>>>                qemu_rdma_post_send_control(rdma, NULL, &head);
>>>>>>>>>>>>>>            }
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -        ret = rdma_disconnect(rdma->cm_id);
>>>>>>>>>>>>>> -        if (!ret) {
>>>>>>>>>>>>>> -            trace_qemu_rdma_cleanup_waiting_for_disconnect();
>>>>>>>>>>>>>> -            ret = rdma_get_cm_event(rdma->channel, &cm_event);
>>>>>>>>>>>>>> -            if (!ret) {
>>>>>>>>>>>>>> -                rdma_ack_cm_event(cm_event);
>>>>>>>>>>>>>> -            }
>>>>>>>>>>>>>> -        }
>>>>>>>>>>>>>> +        rdma_disconnect(rdma->cm_id);
>>>>>>>>>>>>> I'm worried whether this change could break stuff:
>>>>>>>>>>>>> The docs say for rdma_disconnect that it flushes any posted work
>>>>>>>>>>>>> requests to the completion queue;  so unless we wait for the
>>>>>>>>>>>>> event
>>>>>>>>>>>>> do we know the stuff has been flushed?   In the normal
>>>>>>>>>>>>> non-cancel case
>>>>>>>>>>>>> I'm worried that means we could lose something.
>>>>>>>>>>>>> (But I don't know the rdma/infiniband specs well enough to know
>>>>>>>>>>>>> if it's
>>>>>>>>>>>>> really a problem).
>>>>>>>>>>>> In qemu_fclose function, it invoke qemu_fflush(f) before invoke
>>>>>>>>>>>> f->ops->close.
>>>>>>>>>>>> so I think it's safe for normal migration case.
>>>>>>>>>>>>
>>>>>>>>>>>> For the root cause why not receive RDMA_CM_EVENT_DISCONNECTED
>>>>>>>>>>>> event
>>>>>>>>>>>> after rdma_disconnect,
>>>>>>>>>>>> I loop in Aviad Yehezkel<aviadye@mellanox.com>, Aviad will help
>>>>>>>>>>>> us
>>>>>>>>>>>> find the root cause.
>>
>> Regarding previous discussion on that thread:
>> rdma_disconnect can be invoke even without invoking ib_dreg_mr first.
>>
>> common teardown flow:
>>
>> rdma_disconnect
>>    <-- here resources like QPs are still alive and can DMA to RAM
>> rdma_destroy_cm_id
>>    <-- All resources are destroyed by MR can be still active
>> rdma_dreg_mr
>>
>>
>>
>>>>>>>>>>>>> Dave
>>>>>>>>>>>>>
>>>>>>>>>>>>>>            trace_qemu_rdma_cleanup_disconnect();
>>>>>>>>>>>>>>            rdma->connected = false;
>>>>>>>>>>>>>>        }
>>>>>>>>>>>>>> diff --git a/migration/trace-events b/migration/trace-events
>>>>>>>>>>>>>> index d6be74b..64573ff 100644
>>>>>>>>>>>>>> --- a/migration/trace-events
>>>>>>>>>>>>>> +++ b/migration/trace-events
>>>>>>>>>>>>>> @@ -125,7 +125,6 @@ qemu_rdma_accept_pin_state(bool pin) "%d"
>>>>>>>>>>>>>>    qemu_rdma_accept_pin_verbsc(void *verbs) "Verbs context after
>>>>>>>>>>>>>> listen: %p"
>>>>>>>>>>>>>>    qemu_rdma_block_for_wrid_miss(const char *wcompstr, int
>>>>>>>>>>>>>> wcomp, const char *gcompstr, uint64_t req) "A Wanted wrid %s (%d) but got %s
>>>>>>>>>>>>>> (%" PRIu64 ")"
>>>>>>>>>>>>>>    qemu_rdma_cleanup_disconnect(void) ""
>>>>>>>>>>>>>> -qemu_rdma_cleanup_waiting_for_disconnect(void) ""
>>>>>>>>>>>>>>    qemu_rdma_close(void) ""
>>>>>>>>>>>>>>    qemu_rdma_connect_pin_all_requested(void) ""
>>>>>>>>>>>>>>    qemu_rdma_connect_pin_all_outcome(bool pin) "%d"
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> 1.8.3.1
>>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>>>>>>>>>> --
>>>>>>>>>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>>>>>>>> --
>>>>>>>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>>>>>> --
>>>>>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>>>> --
>>>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>> --
>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] FW: [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect
  2018-05-17  7:31                               ` [Qemu-devel] FW: " Aviad Yehezkel
@ 2018-05-17  7:41                                 ` 858585 jemmy
  2018-05-17  7:46                                   ` Aviad Yehezkel
  0 siblings, 1 reply; 19+ messages in thread
From: 858585 jemmy @ 2018-05-17  7:41 UTC (permalink / raw)
  To: Aviad Yehezkel
  Cc: Aviad Yehezkel, Juan Quintela, qemu-devel, Gal Shachaf,
	Adi Dotan, Lidong Chen

On Thu, May 17, 2018 at 3:31 PM, Aviad Yehezkel
<aviadye@dev.mellanox.co.il> wrote:
>
>
> On 5/17/2018 5:42 AM, 858585 jemmy wrote:
>>
>> On Wed, May 16, 2018 at 11:11 PM, Aviad Yehezkel
>> <aviadye@dev.mellanox.co.il> wrote:
>>>
>>> Hi Lidong and David,
>>> Sorry for the late response, I had to ramp up on migration code and build
>>> a
>>> setup on my side.
>>>
>>> PSB my comments for this patch below.
>>> For the RDMA post-copy patches I will comment next week after testing on
>>> Mellanox side too.
>>>
>>> Thanks!
>>>
>>> On 5/16/2018 5:21 PM, Aviad Yehezkel wrote:
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Dr. David Alan Gilbert [mailto:dgilbert@redhat.com]
>>>> Sent: Wednesday, May 16, 2018 4:13 PM
>>>> To: 858585 jemmy <jemmy858585@gmail.com>
>>>> Cc: Aviad Yehezkel <aviadye@mellanox.com>; Juan Quintela
>>>> <quintela@redhat.com>; qemu-devel <qemu-devel@nongnu.org>; Gal Shachaf
>>>> <galsha@mellanox.com>; Adi Dotan <adido@mellanox.com>; Lidong Chen
>>>> <lidongchen@tencent.com>
>>>> Subject: Re: [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED
>>>> event after rdma_disconnect
>>>>
>>>> * 858585 jemmy (jemmy858585@gmail.com) wrote:
>>>>
>>>> <snip>
>>>>
>>>>>>>>>> I wonder why dereg_mr takes so long - I could understand if
>>>>>>>>>> reg_mr took a long time, but why for dereg, that sounds like the
>>>>>>>>>> easy side.
>>>>>>>>>
>>>>>>>>> I use perf collect the information when ibv_dereg_mr is invoked.
>>>>>>>>>
>>>>>>>>> -   9.95%  client2  [kernel.kallsyms]  [k] put_compound_page
>>>>>>>>>                                                             `
>>>>>>>>>      - put_compound_page
>>>>>>>>>         - 98.45% put_page
>>>>>>>>>              __ib_umem_release
>>>>>>>>>              ib_umem_release
>>>>>>>>>              dereg_mr
>>>>>>>>>              mlx5_ib_dereg_mr
>>>>>>>>>              ib_dereg_mr
>>>>>>>>>              uverbs_free_mr
>>>>>>>>>              remove_commit_idr_uobject
>>>>>>>>>              _rdma_remove_commit_uobject
>>>>>>>>>              rdma_remove_commit_uobject
>>>>>>>>>              ib_uverbs_dereg_mr
>>>>>>>>>              ib_uverbs_write
>>>>>>>>>              vfs_write
>>>>>>>>>              sys_write
>>>>>>>>>              system_call_fastpath
>>>>>>>>>              __GI___libc_write
>>>>>>>>>              0
>>>>>>>>>         + 1.55% __ib_umem_release
>>>>>>>>> +   8.31%  client2  [kernel.kallsyms]  [k]
>>>>>>>>> compound_unlock_irqrestore
>>>>>>>>> +   7.01%  client2  [kernel.kallsyms]  [k] page_waitqueue
>>>>>>>>> +   7.00%  client2  [kernel.kallsyms]  [k] set_page_dirty
>>>>>>>>> +   6.61%  client2  [kernel.kallsyms]  [k] unlock_page
>>>>>>>>> +   6.33%  client2  [kernel.kallsyms]  [k] put_page_testzero
>>>>>>>>> +   5.68%  client2  [kernel.kallsyms]  [k] set_page_dirty_lock
>>>>>>>>> +   4.30%  client2  [kernel.kallsyms]  [k] __wake_up_bit
>>>>>>>>> +   4.04%  client2  [kernel.kallsyms]  [k] free_pages_prepare
>>>>>>>>> +   3.65%  client2  [kernel.kallsyms]  [k] release_pages
>>>>>>>>> +   3.62%  client2  [kernel.kallsyms]  [k] arch_local_irq_save
>>>>>>>>> +   3.35%  client2  [kernel.kallsyms]  [k] page_mapping
>>>>>>>>> +   3.13%  client2  [kernel.kallsyms]  [k]
>>>>>>>>> get_pageblock_flags_group
>>>>>>>>> +   3.09%  client2  [kernel.kallsyms]  [k] put_page
>>>>>>>>>
>>>>>>>>> the reason is __ib_umem_release will loop many times for each page.
>>>>>>>>>
>>>>>>>>> static void __ib_umem_release(struct ib_device *dev, struct
>>>>>>>>> ib_umem *umem, int dirty) {
>>>>>>>>>       struct scatterlist *sg;
>>>>>>>>>       struct page *page;
>>>>>>>>>       int i;
>>>>>>>>>
>>>>>>>>>       if (umem->nmap > 0)
>>>>>>>>>            ib_dma_unmap_sg(dev, umem->sg_head.sgl,
>>>>>>>>>                                           umem->npages,
>>>>>>>>>                                           DMA_BIDIRECTIONAL);
>>>>>>>>>
>>>>>>>>>            for_each_sg(umem->sg_head.sgl, sg, umem->npages, i) {
>>>>>>>>> <<
>>>>>>>>> loop a lot of times for each page.here
>>>>>>>>
>>>>>>>> Why 'lot of times for each page'?  I don't know this code at all,
>>>>>>>> but I'd expected once per page?
>>>>>>>
>>>>>>> sorry, once per page, but a lot of page for a big size virtual
>>>>>>> machine.
>>>>>>
>>>>>> Ah OK; so yes it seems best if you can find a way to do the release
>>>>>> in the migration thread then;  still maybe this is something some of
>>>>>> the kernel people could look at speeding up?
>>>>>
>>>>> The kernel code seem is not complex, and I have no idea how to speed
>>>>> up.
>>>>
>>>> Me neither; but I'll ask around.
>>>
>>> I will ask internally and get back on this one.
>>>>
>>>>
>>>>>>>> With your other kernel fix, does the problem of the missing
>>>>>>>> RDMA_CM_EVENT_DISCONNECTED events go away?
>>>>>>>
>>>>>>> Yes, after kernel and qemu fixed, this issue never happens again.
>>>>>>
>>>>>> I'm confused; which qemu fix; my question was whether the kernel fix
>>>>>> by itself fixed the problem of the missing event.
>>>>>
>>>>> this qemu fix:
>>>>> migration: update index field when delete or qsort RDMALocalBlock
>>>>
>>>> OK good; so then we shouldn't need this 2/2 patch.
>>>
>>> It is good that the qemu fix solved this issue but me and my colleagues
>>> think we need 2/2 patch anyway.
>>> According to IB Spec once active side send DREQ message he should wait
>>> for
>>> DREP message and only once it arrived it should trigger a DISCONNECT
>>> event.
>>> DREP message can be dropped due to network issues.
>>> For that case the spec defines a DREP_timeout state in the CM state
>>> machine,
>>> if the DREP is dropped we should get a timeout and a TIMEWAIT_EXIT event
>>> will be trigger.
>>> Unfortunately the current kernel CM implementation doesn't include the
>>> DREP_timeout state and in above scenario we will not get DISCONNECT or
>>> TIMEWAIT_EXIT events.
>>> (Similar scenario exists also for passive side).
>>> We think it is best to apply this patch until we will enhance the kernel
>>> code.
>>>
>> Hi Aviad:
>>      How long about the DREP_timeout state?
>>      Do RDMA have some tools like tcpdump? then I can use to confirm
>> whether
>>      DREP message has received.
>>
>>      Thanks.
>
> Are you running IB or RoCE?

RoCE v2.

>>>>>
>>>>> this issue also cause by some ram block is not released. but I do not
>>>>> find the root cause.
>>>>
>>>> Hmm, we should try and track that down.
>>>>
>>>>>>> Do you think we should remove rdma_get_cm_event after
>>>>>>> rdma_disconnect?
>>>>>>
>>>>>> I don't think so; if 'rdma_disconnect' is supposed to generate the
>>>>>> event I think we're supposed to wait for it to know that the
>>>>>> disconnect is really complete.
>>>>>
>>>>> After move qemu_fclose to migration thread, it will not block the main
>>>>> thread when wait the disconnection event.
>>>>
>>>> I'm not sure about moving the fclose to the migration thread; it worries
>>>> me with the interaction with cancel and other failures.
>>>>
>>>> Dave
>>>>
>>>>>> Dave
>>>>>>
>>>>>>>> Dave
>>>>>>>>
>>>>>>>>>                 page = sg_page(sg);
>>>>>>>>>                 if (umem->writable && dirty)
>>>>>>>>>                     set_page_dirty_lock(page);
>>>>>>>>>                 put_page(page);
>>>>>>>>>            }
>>>>>>>>>
>>>>>>>>>            sg_free_table(&umem->sg_head);
>>>>>>>>>            return;
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>>> Dave
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>> Dave
>>>>>>>>>>>>
>>>>>>>>>>>>>>> Anyway, it should not invoke rdma_get_cm_event in main
>>>>>>>>>>>>>>> thread, and the event channel is also destroyed in
>>>>>>>>>>>>>>> qemu_rdma_cleanup.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Signed-off-by: Lidong Chen <lidongchen@tencent.com>
>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>    migration/rdma.c       | 12 ++----------
>>>>>>>>>>>>>>>    migration/trace-events |  1 -
>>>>>>>>>>>>>>>    2 files changed, 2 insertions(+), 11 deletions(-)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> diff --git a/migration/rdma.c b/migration/rdma.c index
>>>>>>>>>>>>>>> 0dd4033..92e4d30 100644
>>>>>>>>>>>>>>> --- a/migration/rdma.c
>>>>>>>>>>>>>>> +++ b/migration/rdma.c
>>>>>>>>>>>>>>> @@ -2275,8 +2275,7 @@ static int
>>>>>>>>>>>>>>> qemu_rdma_write(QEMUFile *f, RDMAContext *rdma,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    static void qemu_rdma_cleanup(RDMAContext *rdma)  {
>>>>>>>>>>>>>>> -    struct rdma_cm_event *cm_event;
>>>>>>>>>>>>>>> -    int ret, idx;
>>>>>>>>>>>>>>> +    int idx;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>        if (rdma->cm_id && rdma->connected) {
>>>>>>>>>>>>>>>            if ((rdma->error_state ||
>>>>>>>>>>>>>>> @@ -2290,14 +2289,7 @@ static void
>>>>>>>>>>>>>>> qemu_rdma_cleanup(RDMAContext *rdma)
>>>>>>>>>>>>>>>                qemu_rdma_post_send_control(rdma, NULL,
>>>>>>>>>>>>>>> &head);
>>>>>>>>>>>>>>>            }
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -        ret = rdma_disconnect(rdma->cm_id);
>>>>>>>>>>>>>>> -        if (!ret) {
>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>> trace_qemu_rdma_cleanup_waiting_for_disconnect();
>>>>>>>>>>>>>>> -            ret = rdma_get_cm_event(rdma->channel,
>>>>>>>>>>>>>>> &cm_event);
>>>>>>>>>>>>>>> -            if (!ret) {
>>>>>>>>>>>>>>> -                rdma_ack_cm_event(cm_event);
>>>>>>>>>>>>>>> -            }
>>>>>>>>>>>>>>> -        }
>>>>>>>>>>>>>>> +        rdma_disconnect(rdma->cm_id);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm worried whether this change could break stuff:
>>>>>>>>>>>>>> The docs say for rdma_disconnect that it flushes any posted
>>>>>>>>>>>>>> work
>>>>>>>>>>>>>> requests to the completion queue;  so unless we wait for the
>>>>>>>>>>>>>> event
>>>>>>>>>>>>>> do we know the stuff has been flushed?   In the normal
>>>>>>>>>>>>>> non-cancel case
>>>>>>>>>>>>>> I'm worried that means we could lose something.
>>>>>>>>>>>>>> (But I don't know the rdma/infiniband specs well enough to
>>>>>>>>>>>>>> know
>>>>>>>>>>>>>> if it's
>>>>>>>>>>>>>> really a problem).
>>>>>>>>>>>>>
>>>>>>>>>>>>> In qemu_fclose function, it invoke qemu_fflush(f) before invoke
>>>>>>>>>>>>> f->ops->close.
>>>>>>>>>>>>> so I think it's safe for normal migration case.
>>>>>>>>>>>>>
>>>>>>>>>>>>> For the root cause why not receive RDMA_CM_EVENT_DISCONNECTED
>>>>>>>>>>>>> event
>>>>>>>>>>>>> after rdma_disconnect,
>>>>>>>>>>>>> I loop in Aviad Yehezkel<aviadye@mellanox.com>, Aviad will help
>>>>>>>>>>>>> us
>>>>>>>>>>>>> find the root cause.
>>>
>>>
>>> Regarding previous discussion on that thread:
>>> rdma_disconnect can be invoke even without invoking ib_dreg_mr first.
>>>
>>> common teardown flow:
>>>
>>> rdma_disconnect
>>>    <-- here resources like QPs are still alive and can DMA to RAM
>>> rdma_destroy_cm_id
>>>    <-- All resources are destroyed by MR can be still active
>>> rdma_dreg_mr
>>>
>>>
>>>
>>>>>>>>>>>>>> Dave
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>            trace_qemu_rdma_cleanup_disconnect();
>>>>>>>>>>>>>>>            rdma->connected = false;
>>>>>>>>>>>>>>>        }
>>>>>>>>>>>>>>> diff --git a/migration/trace-events b/migration/trace-events
>>>>>>>>>>>>>>> index d6be74b..64573ff 100644
>>>>>>>>>>>>>>> --- a/migration/trace-events
>>>>>>>>>>>>>>> +++ b/migration/trace-events
>>>>>>>>>>>>>>> @@ -125,7 +125,6 @@ qemu_rdma_accept_pin_state(bool pin) "%d"
>>>>>>>>>>>>>>>    qemu_rdma_accept_pin_verbsc(void *verbs) "Verbs context
>>>>>>>>>>>>>>> after
>>>>>>>>>>>>>>> listen: %p"
>>>>>>>>>>>>>>>    qemu_rdma_block_for_wrid_miss(const char *wcompstr, int
>>>>>>>>>>>>>>> wcomp, const char *gcompstr, uint64_t req) "A Wanted wrid %s
>>>>>>>>>>>>>>> (%d) but got %s
>>>>>>>>>>>>>>> (%" PRIu64 ")"
>>>>>>>>>>>>>>>    qemu_rdma_cleanup_disconnect(void) ""
>>>>>>>>>>>>>>> -qemu_rdma_cleanup_waiting_for_disconnect(void) ""
>>>>>>>>>>>>>>>    qemu_rdma_close(void) ""
>>>>>>>>>>>>>>>    qemu_rdma_connect_pin_all_requested(void) ""
>>>>>>>>>>>>>>>    qemu_rdma_connect_pin_all_outcome(bool pin) "%d"
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> 1.8.3.1
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>>>>>>>
>>>>>>>> --
>>>>>>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>>>>>
>>>>>> --
>>>>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>>>
>>>> --
>>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>>
>>>
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] FW: [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect
  2018-05-17  7:41                                 ` 858585 jemmy
@ 2018-05-17  7:46                                   ` Aviad Yehezkel
  0 siblings, 0 replies; 19+ messages in thread
From: Aviad Yehezkel @ 2018-05-17  7:46 UTC (permalink / raw)
  To: 858585 jemmy
  Cc: Aviad Yehezkel, Juan Quintela, qemu-devel, Gal Shachaf,
	Adi Dotan, Lidong Chen



On 5/17/2018 10:41 AM, 858585 jemmy wrote:
> On Thu, May 17, 2018 at 3:31 PM, Aviad Yehezkel
> <aviadye@dev.mellanox.co.il> wrote:
>>
>> On 5/17/2018 5:42 AM, 858585 jemmy wrote:
>>> On Wed, May 16, 2018 at 11:11 PM, Aviad Yehezkel
>>> <aviadye@dev.mellanox.co.il> wrote:
>>>> Hi Lidong and David,
>>>> Sorry for the late response, I had to ramp up on migration code and build
>>>> a
>>>> setup on my side.
>>>>
>>>> PSB my comments for this patch below.
>>>> For the RDMA post-copy patches I will comment next week after testing on
>>>> Mellanox side too.
>>>>
>>>> Thanks!
>>>>
>>>> On 5/16/2018 5:21 PM, Aviad Yehezkel wrote:
>>>>>
>>>>> -----Original Message-----
>>>>> From: Dr. David Alan Gilbert [mailto:dgilbert@redhat.com]
>>>>> Sent: Wednesday, May 16, 2018 4:13 PM
>>>>> To: 858585 jemmy <jemmy858585@gmail.com>
>>>>> Cc: Aviad Yehezkel <aviadye@mellanox.com>; Juan Quintela
>>>>> <quintela@redhat.com>; qemu-devel <qemu-devel@nongnu.org>; Gal Shachaf
>>>>> <galsha@mellanox.com>; Adi Dotan <adido@mellanox.com>; Lidong Chen
>>>>> <lidongchen@tencent.com>
>>>>> Subject: Re: [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED
>>>>> event after rdma_disconnect
>>>>>
>>>>> * 858585 jemmy (jemmy858585@gmail.com) wrote:
>>>>>
>>>>> <snip>
>>>>>
>>>>>>>>>>> I wonder why dereg_mr takes so long - I could understand if
>>>>>>>>>>> reg_mr took a long time, but why for dereg, that sounds like the
>>>>>>>>>>> easy side.
>>>>>>>>>> I use perf collect the information when ibv_dereg_mr is invoked.
>>>>>>>>>>
>>>>>>>>>> -   9.95%  client2  [kernel.kallsyms]  [k] put_compound_page
>>>>>>>>>>                                                              `
>>>>>>>>>>       - put_compound_page
>>>>>>>>>>          - 98.45% put_page
>>>>>>>>>>               __ib_umem_release
>>>>>>>>>>               ib_umem_release
>>>>>>>>>>               dereg_mr
>>>>>>>>>>               mlx5_ib_dereg_mr
>>>>>>>>>>               ib_dereg_mr
>>>>>>>>>>               uverbs_free_mr
>>>>>>>>>>               remove_commit_idr_uobject
>>>>>>>>>>               _rdma_remove_commit_uobject
>>>>>>>>>>               rdma_remove_commit_uobject
>>>>>>>>>>               ib_uverbs_dereg_mr
>>>>>>>>>>               ib_uverbs_write
>>>>>>>>>>               vfs_write
>>>>>>>>>>               sys_write
>>>>>>>>>>               system_call_fastpath
>>>>>>>>>>               __GI___libc_write
>>>>>>>>>>               0
>>>>>>>>>>          + 1.55% __ib_umem_release
>>>>>>>>>> +   8.31%  client2  [kernel.kallsyms]  [k]
>>>>>>>>>> compound_unlock_irqrestore
>>>>>>>>>> +   7.01%  client2  [kernel.kallsyms]  [k] page_waitqueue
>>>>>>>>>> +   7.00%  client2  [kernel.kallsyms]  [k] set_page_dirty
>>>>>>>>>> +   6.61%  client2  [kernel.kallsyms]  [k] unlock_page
>>>>>>>>>> +   6.33%  client2  [kernel.kallsyms]  [k] put_page_testzero
>>>>>>>>>> +   5.68%  client2  [kernel.kallsyms]  [k] set_page_dirty_lock
>>>>>>>>>> +   4.30%  client2  [kernel.kallsyms]  [k] __wake_up_bit
>>>>>>>>>> +   4.04%  client2  [kernel.kallsyms]  [k] free_pages_prepare
>>>>>>>>>> +   3.65%  client2  [kernel.kallsyms]  [k] release_pages
>>>>>>>>>> +   3.62%  client2  [kernel.kallsyms]  [k] arch_local_irq_save
>>>>>>>>>> +   3.35%  client2  [kernel.kallsyms]  [k] page_mapping
>>>>>>>>>> +   3.13%  client2  [kernel.kallsyms]  [k]
>>>>>>>>>> get_pageblock_flags_group
>>>>>>>>>> +   3.09%  client2  [kernel.kallsyms]  [k] put_page
>>>>>>>>>>
>>>>>>>>>> the reason is __ib_umem_release will loop many times for each page.
>>>>>>>>>>
>>>>>>>>>> static void __ib_umem_release(struct ib_device *dev, struct
>>>>>>>>>> ib_umem *umem, int dirty) {
>>>>>>>>>>        struct scatterlist *sg;
>>>>>>>>>>        struct page *page;
>>>>>>>>>>        int i;
>>>>>>>>>>
>>>>>>>>>>        if (umem->nmap > 0)
>>>>>>>>>>             ib_dma_unmap_sg(dev, umem->sg_head.sgl,
>>>>>>>>>>                                            umem->npages,
>>>>>>>>>>                                            DMA_BIDIRECTIONAL);
>>>>>>>>>>
>>>>>>>>>>             for_each_sg(umem->sg_head.sgl, sg, umem->npages, i) {
>>>>>>>>>> <<
>>>>>>>>>> loop a lot of times for each page.here
>>>>>>>>> Why 'lot of times for each page'?  I don't know this code at all,
>>>>>>>>> but I'd expected once per page?
>>>>>>>> sorry, once per page, but a lot of page for a big size virtual
>>>>>>>> machine.
>>>>>>> Ah OK; so yes it seems best if you can find a way to do the release
>>>>>>> in the migration thread then;  still maybe this is something some of
>>>>>>> the kernel people could look at speeding up?
>>>>>> The kernel code seem is not complex, and I have no idea how to speed
>>>>>> up.
>>>>> Me neither; but I'll ask around.
>>>> I will ask internally and get back on this one.
>>>>>
>>>>>>>>> With your other kernel fix, does the problem of the missing
>>>>>>>>> RDMA_CM_EVENT_DISCONNECTED events go away?
>>>>>>>> Yes, after kernel and qemu fixed, this issue never happens again.
>>>>>>> I'm confused; which qemu fix; my question was whether the kernel fix
>>>>>>> by itself fixed the problem of the missing event.
>>>>>> this qemu fix:
>>>>>> migration: update index field when delete or qsort RDMALocalBlock
>>>>> OK good; so then we shouldn't need this 2/2 patch.
>>>> It is good that the qemu fix solved this issue but me and my colleagues
>>>> think we need 2/2 patch anyway.
>>>> According to IB Spec once active side send DREQ message he should wait
>>>> for
>>>> DREP message and only once it arrived it should trigger a DISCONNECT
>>>> event.
>>>> DREP message can be dropped due to network issues.
>>>> For that case the spec defines a DREP_timeout state in the CM state
>>>> machine,
>>>> if the DREP is dropped we should get a timeout and a TIMEWAIT_EXIT event
>>>> will be trigger.
>>>> Unfortunately the current kernel CM implementation doesn't include the
>>>> DREP_timeout state and in above scenario we will not get DISCONNECT or
>>>> TIMEWAIT_EXIT events.
>>>> (Similar scenario exists also for passive side).
>>>> We think it is best to apply this patch until we will enhance the kernel
>>>> code.
>>>>
>>> Hi Aviad:
>>>       How long about the DREP_timeout state?
>>>       Do RDMA have some tools like tcpdump? then I can use to confirm
>>> whether
>>>       DREP message has received.
>>>
>>>       Thanks.
>> Are you running IB or RoCE?
> RoCE v2.
Than download the latest wireshark, it should have RDMA RoCE support.
Capture packets on the associated Ethernet interface.
Let me know if you need help.
>
>>>>>> this issue also cause by some ram block is not released. but I do not
>>>>>> find the root cause.
>>>>> Hmm, we should try and track that down.
>>>>>
>>>>>>>> Do you think we should remove rdma_get_cm_event after
>>>>>>>> rdma_disconnect?
>>>>>>> I don't think so; if 'rdma_disconnect' is supposed to generate the
>>>>>>> event I think we're supposed to wait for it to know that the
>>>>>>> disconnect is really complete.
>>>>>> After move qemu_fclose to migration thread, it will not block the main
>>>>>> thread when wait the disconnection event.
>>>>> I'm not sure about moving the fclose to the migration thread; it worries
>>>>> me with the interaction with cancel and other failures.
>>>>>
>>>>> Dave
>>>>>
>>>>>>> Dave
>>>>>>>
>>>>>>>>> Dave
>>>>>>>>>
>>>>>>>>>>                  page = sg_page(sg);
>>>>>>>>>>                  if (umem->writable && dirty)
>>>>>>>>>>                      set_page_dirty_lock(page);
>>>>>>>>>>                  put_page(page);
>>>>>>>>>>             }
>>>>>>>>>>
>>>>>>>>>>             sg_free_table(&umem->sg_head);
>>>>>>>>>>             return;
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>>> Dave
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>> Dave
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Anyway, it should not invoke rdma_get_cm_event in main
>>>>>>>>>>>>>>>> thread, and the event channel is also destroyed in
>>>>>>>>>>>>>>>> qemu_rdma_cleanup.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Signed-off-by: Lidong Chen <lidongchen@tencent.com>
>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>     migration/rdma.c       | 12 ++----------
>>>>>>>>>>>>>>>>     migration/trace-events |  1 -
>>>>>>>>>>>>>>>>     2 files changed, 2 insertions(+), 11 deletions(-)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> diff --git a/migration/rdma.c b/migration/rdma.c index
>>>>>>>>>>>>>>>> 0dd4033..92e4d30 100644
>>>>>>>>>>>>>>>> --- a/migration/rdma.c
>>>>>>>>>>>>>>>> +++ b/migration/rdma.c
>>>>>>>>>>>>>>>> @@ -2275,8 +2275,7 @@ static int
>>>>>>>>>>>>>>>> qemu_rdma_write(QEMUFile *f, RDMAContext *rdma,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>     static void qemu_rdma_cleanup(RDMAContext *rdma)  {
>>>>>>>>>>>>>>>> -    struct rdma_cm_event *cm_event;
>>>>>>>>>>>>>>>> -    int ret, idx;
>>>>>>>>>>>>>>>> +    int idx;
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>         if (rdma->cm_id && rdma->connected) {
>>>>>>>>>>>>>>>>             if ((rdma->error_state ||
>>>>>>>>>>>>>>>> @@ -2290,14 +2289,7 @@ static void
>>>>>>>>>>>>>>>> qemu_rdma_cleanup(RDMAContext *rdma)
>>>>>>>>>>>>>>>>                 qemu_rdma_post_send_control(rdma, NULL,
>>>>>>>>>>>>>>>> &head);
>>>>>>>>>>>>>>>>             }
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -        ret = rdma_disconnect(rdma->cm_id);
>>>>>>>>>>>>>>>> -        if (!ret) {
>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>> trace_qemu_rdma_cleanup_waiting_for_disconnect();
>>>>>>>>>>>>>>>> -            ret = rdma_get_cm_event(rdma->channel,
>>>>>>>>>>>>>>>> &cm_event);
>>>>>>>>>>>>>>>> -            if (!ret) {
>>>>>>>>>>>>>>>> -                rdma_ack_cm_event(cm_event);
>>>>>>>>>>>>>>>> -            }
>>>>>>>>>>>>>>>> -        }
>>>>>>>>>>>>>>>> +        rdma_disconnect(rdma->cm_id);
>>>>>>>>>>>>>>> I'm worried whether this change could break stuff:
>>>>>>>>>>>>>>> The docs say for rdma_disconnect that it flushes any posted
>>>>>>>>>>>>>>> work
>>>>>>>>>>>>>>> requests to the completion queue;  so unless we wait for the
>>>>>>>>>>>>>>> event
>>>>>>>>>>>>>>> do we know the stuff has been flushed?   In the normal
>>>>>>>>>>>>>>> non-cancel case
>>>>>>>>>>>>>>> I'm worried that means we could lose something.
>>>>>>>>>>>>>>> (But I don't know the rdma/infiniband specs well enough to
>>>>>>>>>>>>>>> know
>>>>>>>>>>>>>>> if it's
>>>>>>>>>>>>>>> really a problem).
>>>>>>>>>>>>>> In qemu_fclose function, it invoke qemu_fflush(f) before invoke
>>>>>>>>>>>>>> f->ops->close.
>>>>>>>>>>>>>> so I think it's safe for normal migration case.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> For the root cause why not receive RDMA_CM_EVENT_DISCONNECTED
>>>>>>>>>>>>>> event
>>>>>>>>>>>>>> after rdma_disconnect,
>>>>>>>>>>>>>> I loop in Aviad Yehezkel<aviadye@mellanox.com>, Aviad will help
>>>>>>>>>>>>>> us
>>>>>>>>>>>>>> find the root cause.
>>>>
>>>> Regarding previous discussion on that thread:
>>>> rdma_disconnect can be invoke even without invoking ib_dreg_mr first.
>>>>
>>>> common teardown flow:
>>>>
>>>> rdma_disconnect
>>>>     <-- here resources like QPs are still alive and can DMA to RAM
>>>> rdma_destroy_cm_id
>>>>     <-- All resources are destroyed by MR can be still active
>>>> rdma_dreg_mr
>>>>
>>>>
>>>>
>>>>>>>>>>>>>>> Dave
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>             trace_qemu_rdma_cleanup_disconnect();
>>>>>>>>>>>>>>>>             rdma->connected = false;
>>>>>>>>>>>>>>>>         }
>>>>>>>>>>>>>>>> diff --git a/migration/trace-events b/migration/trace-events
>>>>>>>>>>>>>>>> index d6be74b..64573ff 100644
>>>>>>>>>>>>>>>> --- a/migration/trace-events
>>>>>>>>>>>>>>>> +++ b/migration/trace-events
>>>>>>>>>>>>>>>> @@ -125,7 +125,6 @@ qemu_rdma_accept_pin_state(bool pin) "%d"
>>>>>>>>>>>>>>>>     qemu_rdma_accept_pin_verbsc(void *verbs) "Verbs context
>>>>>>>>>>>>>>>> after
>>>>>>>>>>>>>>>> listen: %p"
>>>>>>>>>>>>>>>>     qemu_rdma_block_for_wrid_miss(const char *wcompstr, int
>>>>>>>>>>>>>>>> wcomp, const char *gcompstr, uint64_t req) "A Wanted wrid %s
>>>>>>>>>>>>>>>> (%d) but got %s
>>>>>>>>>>>>>>>> (%" PRIu64 ")"
>>>>>>>>>>>>>>>>     qemu_rdma_cleanup_disconnect(void) ""
>>>>>>>>>>>>>>>> -qemu_rdma_cleanup_waiting_for_disconnect(void) ""
>>>>>>>>>>>>>>>>     qemu_rdma_close(void) ""
>>>>>>>>>>>>>>>>     qemu_rdma_connect_pin_all_requested(void) ""
>>>>>>>>>>>>>>>>     qemu_rdma_connect_pin_all_outcome(bool pin) "%d"
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> 1.8.3.1
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>>>>>>>>>> --
>>>>>>>>>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>>>>>>>> --
>>>>>>>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>>>>>> --
>>>>>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>>>> --
>>>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>>>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect
  2018-05-16 13:13                       ` Dr. David Alan Gilbert
       [not found]                         ` <AM6PR05MB4360C1A885C2B052583F47C7C2920@AM6PR05MB4360.eurprd05.prod.outlook.com>
@ 2018-05-22 10:12                         ` 858585 jemmy
  1 sibling, 0 replies; 19+ messages in thread
From: 858585 jemmy @ 2018-05-22 10:12 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Aviad Yehezkel, Juan Quintela, qemu-devel, Gal Shachaf,
	Adi Dotan, Lidong Chen

On Wed, May 16, 2018 at 9:13 PM, Dr. David Alan Gilbert
<dgilbert@redhat.com> wrote:
> * 858585 jemmy (jemmy858585@gmail.com) wrote:
>
> <snip>
>
>> >> >> > I wonder why dereg_mr takes so long - I could understand if reg_mr
>> >> >> > took a long time, but why for dereg, that sounds like the easy side.
>> >> >>
>> >> >> I use perf collect the information when ibv_dereg_mr is invoked.
>> >> >>
>> >> >> -   9.95%  client2  [kernel.kallsyms]  [k] put_compound_page
>> >> >>                                                           `
>> >> >>    - put_compound_page
>> >> >>       - 98.45% put_page
>> >> >>            __ib_umem_release
>> >> >>            ib_umem_release
>> >> >>            dereg_mr
>> >> >>            mlx5_ib_dereg_mr
>> >> >>            ib_dereg_mr
>> >> >>            uverbs_free_mr
>> >> >>            remove_commit_idr_uobject
>> >> >>            _rdma_remove_commit_uobject
>> >> >>            rdma_remove_commit_uobject
>> >> >>            ib_uverbs_dereg_mr
>> >> >>            ib_uverbs_write
>> >> >>            vfs_write
>> >> >>            sys_write
>> >> >>            system_call_fastpath
>> >> >>            __GI___libc_write
>> >> >>            0
>> >> >>       + 1.55% __ib_umem_release
>> >> >> +   8.31%  client2  [kernel.kallsyms]  [k] compound_unlock_irqrestore
>> >> >> +   7.01%  client2  [kernel.kallsyms]  [k] page_waitqueue
>> >> >> +   7.00%  client2  [kernel.kallsyms]  [k] set_page_dirty
>> >> >> +   6.61%  client2  [kernel.kallsyms]  [k] unlock_page
>> >> >> +   6.33%  client2  [kernel.kallsyms]  [k] put_page_testzero
>> >> >> +   5.68%  client2  [kernel.kallsyms]  [k] set_page_dirty_lock
>> >> >> +   4.30%  client2  [kernel.kallsyms]  [k] __wake_up_bit
>> >> >> +   4.04%  client2  [kernel.kallsyms]  [k] free_pages_prepare
>> >> >> +   3.65%  client2  [kernel.kallsyms]  [k] release_pages
>> >> >> +   3.62%  client2  [kernel.kallsyms]  [k] arch_local_irq_save
>> >> >> +   3.35%  client2  [kernel.kallsyms]  [k] page_mapping
>> >> >> +   3.13%  client2  [kernel.kallsyms]  [k] get_pageblock_flags_group
>> >> >> +   3.09%  client2  [kernel.kallsyms]  [k] put_page
>> >> >>
>> >> >> the reason is __ib_umem_release will loop many times for each page.
>> >> >>
>> >> >> static void __ib_umem_release(struct ib_device *dev, struct ib_umem
>> >> >> *umem, int dirty)
>> >> >> {
>> >> >>     struct scatterlist *sg;
>> >> >>     struct page *page;
>> >> >>     int i;
>> >> >>
>> >> >>     if (umem->nmap > 0)
>> >> >>          ib_dma_unmap_sg(dev, umem->sg_head.sgl,
>> >> >>                                         umem->npages,
>> >> >>                                         DMA_BIDIRECTIONAL);
>> >> >>
>> >> >>          for_each_sg(umem->sg_head.sgl, sg, umem->npages, i) {      <<
>> >> >> loop a lot of times for each page.here
>> >> >
>> >> > Why 'lot of times for each page'?  I don't know this code at all, but
>> >> > I'd expected once per page?
>> >>
>> >> sorry, once per page, but a lot of page for a big size virtual machine.
>> >
>> > Ah OK; so yes it seems best if you can find a way to do the release in
>> > the migration thread then;  still maybe this is something some
>> > of the kernel people could look at speeding up?
>>
>> The kernel code seem is not complex, and I have no idea how to speed up.
>
> Me neither; but I'll ask around.
>
>> >> >
>> >> > With your other kernel fix, does the problem of the missing
>> >> > RDMA_CM_EVENT_DISCONNECTED events go away?
>> >>
>> >> Yes, after kernel and qemu fixed, this issue never happens again.
>> >
>> > I'm confused; which qemu fix; my question was whether the kernel fix by
>> > itself fixed the problem of the missing event.
>>
>> this qemu fix:
>> migration: update index field when delete or qsort RDMALocalBlock
>
> OK good; so then we shouldn't need this 2/2 patch.
>
>> this issue also cause by some ram block is not released. but I do not
>> find the root cause.
>
> Hmm, we should try and track that down.
>
>> >
>> >> Do you think we should remove rdma_get_cm_event after rdma_disconnect?
>> >
>> > I don't think so; if 'rdma_disconnect' is supposed to generate the event
>> > I think we're supposed to wait for it to know that the disconnect is
>> > really complete.
>>
>> After move qemu_fclose to migration thread, it will not block the main
>> thread when wait
>> the disconnection event.
>
> I'm not sure about moving the fclose to the migration thread; it worries
> me with the interaction with cancel and other failures.

Bad news, I find the destination qemu also have this problem today.
when the precopy migration finished, the destination qemu main thread
process_incoming_migration_co will call qemu_fclose(f), and cause the
guest os hang for a while.

so I think it's better to solve this problem by rdma.c itself, not
change the migration.c code
when calling qemu_fclose.

we can create a separate thread in qemu_rdma_cleanup, and release the
resource in the
separate thread.

>
> Dave
>
>> >
>> > Dave
>> >
>> >>
>> >> >
>> >> > Dave
>> >> >
>> >> >>               page = sg_page(sg);
>> >> >>               if (umem->writable && dirty)
>> >> >>                   set_page_dirty_lock(page);
>> >> >>               put_page(page);
>> >> >>          }
>> >> >>
>> >> >>          sg_free_table(&umem->sg_head);
>> >> >>          return;
>> >> >> }
>> >> >>
>> >> >> >
>> >> >> > Dave
>> >> >> >
>> >> >> >
>> >> >> >> >
>> >> >> >> > Dave
>> >> >> >> >
>> >> >> >> >> >
>> >> >> >> >> >> Anyway, it should not invoke rdma_get_cm_event in main thread, and the event channel
>> >> >> >> >> >> is also destroyed in qemu_rdma_cleanup.
>> >> >> >> >> >>
>> >> >> >> >> >> Signed-off-by: Lidong Chen <lidongchen@tencent.com>
>> >> >> >> >> >> ---
>> >> >> >> >> >>  migration/rdma.c       | 12 ++----------
>> >> >> >> >> >>  migration/trace-events |  1 -
>> >> >> >> >> >>  2 files changed, 2 insertions(+), 11 deletions(-)
>> >> >> >> >> >>
>> >> >> >> >> >> diff --git a/migration/rdma.c b/migration/rdma.c
>> >> >> >> >> >> index 0dd4033..92e4d30 100644
>> >> >> >> >> >> --- a/migration/rdma.c
>> >> >> >> >> >> +++ b/migration/rdma.c
>> >> >> >> >> >> @@ -2275,8 +2275,7 @@ static int qemu_rdma_write(QEMUFile *f, RDMAContext *rdma,
>> >> >> >> >> >>
>> >> >> >> >> >>  static void qemu_rdma_cleanup(RDMAContext *rdma)
>> >> >> >> >> >>  {
>> >> >> >> >> >> -    struct rdma_cm_event *cm_event;
>> >> >> >> >> >> -    int ret, idx;
>> >> >> >> >> >> +    int idx;
>> >> >> >> >> >>
>> >> >> >> >> >>      if (rdma->cm_id && rdma->connected) {
>> >> >> >> >> >>          if ((rdma->error_state ||
>> >> >> >> >> >> @@ -2290,14 +2289,7 @@ static void qemu_rdma_cleanup(RDMAContext *rdma)
>> >> >> >> >> >>              qemu_rdma_post_send_control(rdma, NULL, &head);
>> >> >> >> >> >>          }
>> >> >> >> >> >>
>> >> >> >> >> >> -        ret = rdma_disconnect(rdma->cm_id);
>> >> >> >> >> >> -        if (!ret) {
>> >> >> >> >> >> -            trace_qemu_rdma_cleanup_waiting_for_disconnect();
>> >> >> >> >> >> -            ret = rdma_get_cm_event(rdma->channel, &cm_event);
>> >> >> >> >> >> -            if (!ret) {
>> >> >> >> >> >> -                rdma_ack_cm_event(cm_event);
>> >> >> >> >> >> -            }
>> >> >> >> >> >> -        }
>> >> >> >> >> >> +        rdma_disconnect(rdma->cm_id);
>> >> >> >> >> >
>> >> >> >> >> > I'm worried whether this change could break stuff:
>> >> >> >> >> > The docs say for rdma_disconnect that it flushes any posted work
>> >> >> >> >> > requests to the completion queue;  so unless we wait for the event
>> >> >> >> >> > do we know the stuff has been flushed?   In the normal non-cancel case
>> >> >> >> >> > I'm worried that means we could lose something.
>> >> >> >> >> > (But I don't know the rdma/infiniband specs well enough to know if it's
>> >> >> >> >> > really a problem).
>> >> >> >> >>
>> >> >> >> >> In qemu_fclose function, it invoke qemu_fflush(f) before invoke f->ops->close.
>> >> >> >> >> so I think it's safe for normal migration case.
>> >> >> >> >>
>> >> >> >> >> For the root cause why not receive RDMA_CM_EVENT_DISCONNECTED event
>> >> >> >> >> after rdma_disconnect,
>> >> >> >> >> I loop in Aviad Yehezkel<aviadye@mellanox.com>, Aviad will help us
>> >> >> >> >> find the root cause.
>> >> >> >> >>
>> >> >> >> >> >
>> >> >> >> >> > Dave
>> >> >> >> >> >
>> >> >> >> >> >>          trace_qemu_rdma_cleanup_disconnect();
>> >> >> >> >> >>          rdma->connected = false;
>> >> >> >> >> >>      }
>> >> >> >> >> >> diff --git a/migration/trace-events b/migration/trace-events
>> >> >> >> >> >> index d6be74b..64573ff 100644
>> >> >> >> >> >> --- a/migration/trace-events
>> >> >> >> >> >> +++ b/migration/trace-events
>> >> >> >> >> >> @@ -125,7 +125,6 @@ qemu_rdma_accept_pin_state(bool pin) "%d"
>> >> >> >> >> >>  qemu_rdma_accept_pin_verbsc(void *verbs) "Verbs context after listen: %p"
>> >> >> >> >> >>  qemu_rdma_block_for_wrid_miss(const char *wcompstr, int wcomp, const char *gcompstr, uint64_t req) "A Wanted wrid %s (%d) but got %s (%" PRIu64 ")"
>> >> >> >> >> >>  qemu_rdma_cleanup_disconnect(void) ""
>> >> >> >> >> >> -qemu_rdma_cleanup_waiting_for_disconnect(void) ""
>> >> >> >> >> >>  qemu_rdma_close(void) ""
>> >> >> >> >> >>  qemu_rdma_connect_pin_all_requested(void) ""
>> >> >> >> >> >>  qemu_rdma_connect_pin_all_outcome(bool pin) "%d"
>> >> >> >> >> >> --
>> >> >> >> >> >> 1.8.3.1
>> >> >> >> >> >>
>> >> >> >> >> > --
>> >> >> >> >> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>> >> >> >> > --
>> >> >> >> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>> >> >> > --
>> >> >> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>> >> > --
>> >> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>> > --
>> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2018-05-22 10:12 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-06 14:54 [Qemu-devel] [PATCH 1/2] migration: update index field when delete or qsort RDMALocalBlock Lidong Chen
2018-05-06 14:54 ` [Qemu-devel] [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect Lidong Chen
2018-05-08 18:40   ` Dr. David Alan Gilbert
2018-05-09  3:43     ` 858585 jemmy
2018-05-11 18:03       ` Dr. David Alan Gilbert
2018-05-14  7:32         ` 858585 jemmy
2018-05-14 19:27           ` Dr. David Alan Gilbert
2018-05-16  8:31             ` 858585 jemmy
2018-05-16  9:39               ` Dr. David Alan Gilbert
2018-05-16  9:45                 ` 858585 jemmy
2018-05-16  9:53                   ` Dr. David Alan Gilbert
2018-05-16 12:48                     ` 858585 jemmy
2018-05-16 13:13                       ` Dr. David Alan Gilbert
     [not found]                         ` <AM6PR05MB4360C1A885C2B052583F47C7C2920@AM6PR05MB4360.eurprd05.prod.outlook.com>
     [not found]                           ` <be4b48b2-22b3-9a69-92ed-dbc9ae59b4e6@dev.mellanox.co.il>
     [not found]                             ` <CAOGPPbenO1P9VbdpDVD5_SNJeoF4eHxkLg=rF7LM9HpbyoR0uw@mail.gmail.com>
2018-05-17  7:31                               ` [Qemu-devel] FW: " Aviad Yehezkel
2018-05-17  7:41                                 ` 858585 jemmy
2018-05-17  7:46                                   ` Aviad Yehezkel
2018-05-22 10:12                         ` [Qemu-devel] " 858585 jemmy
2018-05-08 17:19 ` [Qemu-devel] [PATCH 1/2] migration: update index field when delete or qsort RDMALocalBlock Dr. David Alan Gilbert
2018-05-09  1:39   ` 858585 jemmy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.