qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] enable fsdax rdma migration
@ 2021-07-31 14:03 Li Zhijian
  2021-07-31 14:03 ` [PATCH 1/2] migration/rdma: Try to register On-Demand Paging memory region Li Zhijian
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Li Zhijian @ 2021-07-31 14:03 UTC (permalink / raw)
  To: quintela, dgilbert, qemu-devel; +Cc: Li Zhijian

Previous qemu face 2 problems when migrating a fsdax memory backend with
RDMA protocol.
(1) ibv_reg_mr failed with Operation not supported
(2) requester(source) side could receive RNR NAK.

For the (1), we can try to register memory region with ODP feature which
has already been implemented in some modern HCA hardware/drivers.
For the (2), IB provides advise API to prefetch pages in specific memory
region. It can help driver reduce the page fault on responder(destination)
side during RDMA_WRITE.

Li Zhijian (2):
  migration/rdma: Try to register On-Demand Paging memory region
  migration/rdma: advise prefetch write for ODP region

 migration/rdma.c       | 67 ++++++++++++++++++++++++++++++++++++------
 migration/trace-events |  2 ++
 2 files changed, 60 insertions(+), 9 deletions(-)

-- 
2.31.1





^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/2] migration/rdma: Try to register On-Demand Paging memory region
  2021-07-31 14:03 [PATCH 0/2] enable fsdax rdma migration Li Zhijian
@ 2021-07-31 14:03 ` Li Zhijian
  2021-08-22  8:53   ` Marcel Apfelbaum
  2021-07-31 14:03 ` [PATCH 2/2] migration/rdma: advise prefetch write for ODP region Li Zhijian
  2021-08-16  2:10 ` [PATCH 0/2] enable fsdax rdma migration lizhijian
  2 siblings, 1 reply; 9+ messages in thread
From: Li Zhijian @ 2021-07-31 14:03 UTC (permalink / raw)
  To: quintela, dgilbert, qemu-devel; +Cc: Li Zhijian

Previously, for the fsdax mem-backend-file, it will register failed with
Operation not supported. In this case, we can try to register it with
On-Demand Paging[1] like what rpma_mr_reg() does on rpma[2].

[1]: https://community.mellanox.com/s/article/understanding-on-demand-paging--odp-x
[2]: http://pmem.io/rpma/manpages/v0.9.0/rpma_mr_reg.3
Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
---
 migration/rdma.c       | 27 ++++++++++++++++++---------
 migration/trace-events |  1 +
 2 files changed, 19 insertions(+), 9 deletions(-)

diff --git a/migration/rdma.c b/migration/rdma.c
index 5c2d113aa94..8784b5f22a6 100644
--- a/migration/rdma.c
+++ b/migration/rdma.c
@@ -1123,15 +1123,21 @@ static int qemu_rdma_reg_whole_ram_blocks(RDMAContext *rdma)
     RDMALocalBlocks *local = &rdma->local_ram_blocks;
 
     for (i = 0; i < local->nb_blocks; i++) {
+        int access = IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE;
+
+on_demand:
         local->block[i].mr =
             ibv_reg_mr(rdma->pd,
                     local->block[i].local_host_addr,
-                    local->block[i].length,
-                    IBV_ACCESS_LOCAL_WRITE |
-                    IBV_ACCESS_REMOTE_WRITE
+                    local->block[i].length, access
                     );
         if (!local->block[i].mr) {
-            perror("Failed to register local dest ram block!");
+            if (!(access & IBV_ACCESS_ON_DEMAND) && errno == ENOTSUP) {
+                access |= IBV_ACCESS_ON_DEMAND;
+                trace_qemu_rdma_register_odp_mr(local->block[i].block_name);
+                goto on_demand;
+            }
+            perror("Failed to register local dest ram block!\n");
             break;
         }
         rdma->total_registrations++;
@@ -1215,15 +1221,18 @@ static int qemu_rdma_register_and_get_keys(RDMAContext *rdma,
      */
     if (!block->pmr[chunk]) {
         uint64_t len = chunk_end - chunk_start;
+        int access = rkey ? IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE : 0;
 
         trace_qemu_rdma_register_and_get_keys(len, chunk_start);
 
-        block->pmr[chunk] = ibv_reg_mr(rdma->pd,
-                chunk_start, len,
-                (rkey ? (IBV_ACCESS_LOCAL_WRITE |
-                        IBV_ACCESS_REMOTE_WRITE) : 0));
-
+on_demand:
+        block->pmr[chunk] = ibv_reg_mr(rdma->pd, chunk_start, len, access);
         if (!block->pmr[chunk]) {
+            if (!(access & IBV_ACCESS_ON_DEMAND) && errno == ENOTSUP) {
+                access |= IBV_ACCESS_ON_DEMAND;
+                trace_qemu_rdma_register_odp_mr(block->block_name);
+                goto on_demand;
+            }
             perror("Failed to register chunk!");
             fprintf(stderr, "Chunk details: block: %d chunk index %d"
                             " start %" PRIuPTR " end %" PRIuPTR
diff --git a/migration/trace-events b/migration/trace-events
index a1c0f034ab8..5f6aa580def 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -212,6 +212,7 @@ qemu_rdma_poll_write(const char *compstr, int64_t comp, int left, uint64_t block
 qemu_rdma_poll_other(const char *compstr, int64_t comp, int left) "other completion %s (%" PRId64 ") received left %d"
 qemu_rdma_post_send_control(const char *desc) "CONTROL: sending %s.."
 qemu_rdma_register_and_get_keys(uint64_t len, void *start) "Registering %" PRIu64 " bytes @ %p"
+qemu_rdma_register_odp_mr(const char *name) "Try to register On-Demand Paging memory region: %s"
 qemu_rdma_registration_handle_compress(int64_t length, int index, int64_t offset) "Zapping zero chunk: %" PRId64 " bytes, index %d, offset %" PRId64
 qemu_rdma_registration_handle_finished(void) ""
 qemu_rdma_registration_handle_ram_blocks(void) ""
-- 
2.31.1





^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/2] migration/rdma: advise prefetch write for ODP region
  2021-07-31 14:03 [PATCH 0/2] enable fsdax rdma migration Li Zhijian
  2021-07-31 14:03 ` [PATCH 1/2] migration/rdma: Try to register On-Demand Paging memory region Li Zhijian
@ 2021-07-31 14:03 ` Li Zhijian
  2021-08-22  8:39   ` Marcel Apfelbaum
  2021-08-16  2:10 ` [PATCH 0/2] enable fsdax rdma migration lizhijian
  2 siblings, 1 reply; 9+ messages in thread
From: Li Zhijian @ 2021-07-31 14:03 UTC (permalink / raw)
  To: quintela, dgilbert, qemu-devel; +Cc: Li Zhijian

The responder mr registering with ODP will sent RNR NAK back to
the requester in the face of the page fault.
---------
ibv_poll_cq wc.status=13 RNR retry counter exceeded!
ibv_poll_cq wrid=WRITE RDMA!
---------
ibv_advise_mr(3) helps to make pages present before the actual IO is
conducted so that the responder does page fault as little as possible.

Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
---
 migration/rdma.c       | 40 ++++++++++++++++++++++++++++++++++++++++
 migration/trace-events |  1 +
 2 files changed, 41 insertions(+)

diff --git a/migration/rdma.c b/migration/rdma.c
index 8784b5f22a6..a2ad00d665f 100644
--- a/migration/rdma.c
+++ b/migration/rdma.c
@@ -1117,6 +1117,30 @@ static int qemu_rdma_alloc_qp(RDMAContext *rdma)
     return 0;
 }
 
+/*
+ * ibv_advise_mr to avoid RNR NAK error as far as possible.
+ * The responder mr registering with ODP will sent RNR NAK back to
+ * the requester in the face of the page fault.
+ */
+static void qemu_rdma_advise_prefetch_write_mr(struct ibv_pd *pd, uint64_t addr,
+                                               uint32_t len,  uint32_t lkey,
+                                               const char *name, bool wr)
+{
+    int ret;
+    int advice = wr ? IBV_ADVISE_MR_ADVICE_PREFETCH_WRITE :
+                 IBV_ADVISE_MR_ADVICE_PREFETCH;
+    struct ibv_sge sg_list = {.lkey = lkey, .addr = addr, .length = len};
+
+    ret = ibv_advise_mr(pd, advice,
+                        IB_UVERBS_ADVISE_MR_FLAG_FLUSH, &sg_list, 1);
+    /* ignore the error */
+    if (ret) {
+        trace_qemu_rdma_advise_mr(name, len, addr, strerror(errno));
+    } else {
+        trace_qemu_rdma_advise_mr(name, len, addr, "successed");
+    }
+}
+
 static int qemu_rdma_reg_whole_ram_blocks(RDMAContext *rdma)
 {
     int i;
@@ -1140,6 +1164,17 @@ on_demand:
             perror("Failed to register local dest ram block!\n");
             break;
         }
+
+        if (access & IBV_ACCESS_ON_DEMAND) {
+            qemu_rdma_advise_prefetch_write_mr(rdma->pd,
+                                               (uintptr_t)
+                                               local->block[i].local_host_addr,
+                                               local->block[i].length,
+                                               local->block[i].mr->lkey,
+                                               local->block[i].block_name,
+                                               true);
+        }
+
         rdma->total_registrations++;
     }
 
@@ -1244,6 +1279,11 @@ on_demand:
                             rdma->total_registrations);
             return -1;
         }
+        if (access & IBV_ACCESS_ON_DEMAND) {
+            qemu_rdma_advise_prefetch_write_mr(rdma->pd, (uintptr_t)chunk_start,
+                                               len, block->pmr[chunk]->lkey,
+                                               block->block_name, rkey);
+        }
         rdma->total_registrations++;
     }
 
diff --git a/migration/trace-events b/migration/trace-events
index 5f6aa580def..901c1d54c12 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -213,6 +213,7 @@ qemu_rdma_poll_other(const char *compstr, int64_t comp, int left) "other complet
 qemu_rdma_post_send_control(const char *desc) "CONTROL: sending %s.."
 qemu_rdma_register_and_get_keys(uint64_t len, void *start) "Registering %" PRIu64 " bytes @ %p"
 qemu_rdma_register_odp_mr(const char *name) "Try to register On-Demand Paging memory region: %s"
+qemu_rdma_advise_mr(const char *name, uint32_t len, uint64_t addr, const char *res) "Try to advise block %s prefetch write at %" PRIu32 "@0x%" PRIx64 ": %s"
 qemu_rdma_registration_handle_compress(int64_t length, int index, int64_t offset) "Zapping zero chunk: %" PRId64 " bytes, index %d, offset %" PRId64
 qemu_rdma_registration_handle_finished(void) ""
 qemu_rdma_registration_handle_ram_blocks(void) ""
-- 
2.31.1





^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/2] enable fsdax rdma migration
  2021-07-31 14:03 [PATCH 0/2] enable fsdax rdma migration Li Zhijian
  2021-07-31 14:03 ` [PATCH 1/2] migration/rdma: Try to register On-Demand Paging memory region Li Zhijian
  2021-07-31 14:03 ` [PATCH 2/2] migration/rdma: advise prefetch write for ODP region Li Zhijian
@ 2021-08-16  2:10 ` lizhijian
  2 siblings, 0 replies; 9+ messages in thread
From: lizhijian @ 2021-08-16  2:10 UTC (permalink / raw)
  To: lizhijian, quintela, dgilbert, qemu-devel


ping...

Hey Dave, could you help to take a look :)

Thanks
Zhijian


On 31/07/2021 22:03, Li Zhijian wrote:
> Previous qemu face 2 problems when migrating a fsdax memory backend with
> RDMA protocol.
> (1) ibv_reg_mr failed with Operation not supported
> (2) requester(source) side could receive RNR NAK.
>
> For the (1), we can try to register memory region with ODP feature which
> has already been implemented in some modern HCA hardware/drivers.
> For the (2), IB provides advise API to prefetch pages in specific memory
> region. It can help driver reduce the page fault on responder(destination)
> side during RDMA_WRITE.
>
> Li Zhijian (2):
>    migration/rdma: Try to register On-Demand Paging memory region
>    migration/rdma: advise prefetch write for ODP region
>
>   migration/rdma.c       | 67 ++++++++++++++++++++++++++++++++++++------
>   migration/trace-events |  2 ++
>   2 files changed, 60 insertions(+), 9 deletions(-)
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] migration/rdma: advise prefetch write for ODP region
  2021-07-31 14:03 ` [PATCH 2/2] migration/rdma: advise prefetch write for ODP region Li Zhijian
@ 2021-08-22  8:39   ` Marcel Apfelbaum
  2021-08-23  1:21     ` lizhijian
  0 siblings, 1 reply; 9+ messages in thread
From: Marcel Apfelbaum @ 2021-08-22  8:39 UTC (permalink / raw)
  To: Li Zhijian; +Cc: qemu devel list, Dr. David Alan Gilbert, Juan Quintela

Hi,

On Sat, Jul 31, 2021 at 5:03 PM Li Zhijian <lizhijian@cn.fujitsu.com> wrote:
>
> The responder mr registering with ODP will sent RNR NAK back to
> the requester in the face of the page fault.
> ---------
> ibv_poll_cq wc.status=13 RNR retry counter exceeded!
> ibv_poll_cq wrid=WRITE RDMA!
> ---------
> ibv_advise_mr(3) helps to make pages present before the actual IO is
> conducted so that the responder does page fault as little as possible.
>
> Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
> ---
>  migration/rdma.c       | 40 ++++++++++++++++++++++++++++++++++++++++
>  migration/trace-events |  1 +
>  2 files changed, 41 insertions(+)
>
> diff --git a/migration/rdma.c b/migration/rdma.c
> index 8784b5f22a6..a2ad00d665f 100644
> --- a/migration/rdma.c
> +++ b/migration/rdma.c
> @@ -1117,6 +1117,30 @@ static int qemu_rdma_alloc_qp(RDMAContext *rdma)
>      return 0;
>  }
>
> +/*
> + * ibv_advise_mr to avoid RNR NAK error as far as possible.
> + * The responder mr registering with ODP will sent RNR NAK back to
> + * the requester in the face of the page fault.
> + */
> +static void qemu_rdma_advise_prefetch_write_mr(struct ibv_pd *pd, uint64_t addr,
> +                                               uint32_t len,  uint32_t lkey,
> +                                               const char *name, bool wr)
> +{
> +    int ret;
> +    int advice = wr ? IBV_ADVISE_MR_ADVICE_PREFETCH_WRITE :
> +                 IBV_ADVISE_MR_ADVICE_PREFETCH;
> +    struct ibv_sge sg_list = {.lkey = lkey, .addr = addr, .length = len};
> +
> +    ret = ibv_advise_mr(pd, advice,
> +                        IB_UVERBS_ADVISE_MR_FLAG_FLUSH, &sg_list, 1);
> +    /* ignore the error */
> +    if (ret) {
> +        trace_qemu_rdma_advise_mr(name, len, addr, strerror(errno));
> +    } else {
> +        trace_qemu_rdma_advise_mr(name, len, addr, "successed");
> +    }
> +}
> +
>  static int qemu_rdma_reg_whole_ram_blocks(RDMAContext *rdma)
>  {
>      int i;
> @@ -1140,6 +1164,17 @@ on_demand:
>              perror("Failed to register local dest ram block!\n");
>              break;
>          }
> +
> +        if (access & IBV_ACCESS_ON_DEMAND) {
> +            qemu_rdma_advise_prefetch_write_mr(rdma->pd,
> +                                               (uintptr_t)
> +                                               local->block[i].local_host_addr,
> +                                               local->block[i].length,
> +                                               local->block[i].mr->lkey,
> +                                               local->block[i].block_name,
> +                                               true);
> +        }
> +
>          rdma->total_registrations++;
>      }
>
> @@ -1244,6 +1279,11 @@ on_demand:
>                              rdma->total_registrations);
>              return -1;
>          }
> +        if (access & IBV_ACCESS_ON_DEMAND) {
> +            qemu_rdma_advise_prefetch_write_mr(rdma->pd, (uintptr_t)chunk_start,
> +                                               len, block->pmr[chunk]->lkey,
> +                                               block->block_name, rkey);
> +        }
>          rdma->total_registrations++;
>      }
>
> diff --git a/migration/trace-events b/migration/trace-events
> index 5f6aa580def..901c1d54c12 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -213,6 +213,7 @@ qemu_rdma_poll_other(const char *compstr, int64_t comp, int left) "other complet
>  qemu_rdma_post_send_control(const char *desc) "CONTROL: sending %s.."
>  qemu_rdma_register_and_get_keys(uint64_t len, void *start) "Registering %" PRIu64 " bytes @ %p"
>  qemu_rdma_register_odp_mr(const char *name) "Try to register On-Demand Paging memory region: %s"
> +qemu_rdma_advise_mr(const char *name, uint32_t len, uint64_t addr, const char *res) "Try to advise block %s prefetch write at %" PRIu32 "@0x%" PRIx64 ": %s"
>  qemu_rdma_registration_handle_compress(int64_t length, int index, int64_t offset) "Zapping zero chunk: %" PRId64 " bytes, index %d, offset %" PRId64
>  qemu_rdma_registration_handle_finished(void) ""
>  qemu_rdma_registration_handle_ram_blocks(void) ""
> --
> 2.31.1
>

Following https://github.com/linux-rdma/rdma-core/blob/master/libibverbs/man/ibv_advise_mr.3.md
it looks like it is a best-effort optimization,
I don't see any down-sides to it.
However it seems like it is recommended to use
IBV_ADVISE_MR_FLAG_FLUSH in order to
increase the optimization chances.

Anyway

Reviewed-by: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>

Thanks,
Marcel


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] migration/rdma: Try to register On-Demand Paging memory region
  2021-07-31 14:03 ` [PATCH 1/2] migration/rdma: Try to register On-Demand Paging memory region Li Zhijian
@ 2021-08-22  8:53   ` Marcel Apfelbaum
  2021-08-23  1:41     ` lizhijian
  0 siblings, 1 reply; 9+ messages in thread
From: Marcel Apfelbaum @ 2021-08-22  8:53 UTC (permalink / raw)
  To: Li Zhijian; +Cc: qemu devel list, Dr. David Alan Gilbert, Juan Quintela

Hi

On Sat, Jul 31, 2021 at 5:00 PM Li Zhijian <lizhijian@cn.fujitsu.com> wrote:
>
> Previously, for the fsdax mem-backend-file, it will register failed with
> Operation not supported. In this case, we can try to register it with
> On-Demand Paging[1] like what rpma_mr_reg() does on rpma[2].
>
> [1]: https://community.mellanox.com/s/article/understanding-on-demand-paging--odp-x
> [2]: http://pmem.io/rpma/manpages/v0.9.0/rpma_mr_reg.3
> Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
> ---
>  migration/rdma.c       | 27 ++++++++++++++++++---------
>  migration/trace-events |  1 +
>  2 files changed, 19 insertions(+), 9 deletions(-)
>
> diff --git a/migration/rdma.c b/migration/rdma.c
> index 5c2d113aa94..8784b5f22a6 100644
> --- a/migration/rdma.c
> +++ b/migration/rdma.c
> @@ -1123,15 +1123,21 @@ static int qemu_rdma_reg_whole_ram_blocks(RDMAContext *rdma)
>      RDMALocalBlocks *local = &rdma->local_ram_blocks;
>
>      for (i = 0; i < local->nb_blocks; i++) {
> +        int access = IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE;
> +
> +on_demand:
>          local->block[i].mr =
>              ibv_reg_mr(rdma->pd,
>                      local->block[i].local_host_addr,
> -                    local->block[i].length,
> -                    IBV_ACCESS_LOCAL_WRITE |
> -                    IBV_ACCESS_REMOTE_WRITE
> +                    local->block[i].length, access
>                      );
>          if (!local->block[i].mr) {
> -            perror("Failed to register local dest ram block!");
> +            if (!(access & IBV_ACCESS_ON_DEMAND) && errno == ENOTSUP) {
> +                access |= IBV_ACCESS_ON_DEMAND;
> +                trace_qemu_rdma_register_odp_mr(local->block[i].block_name);
> +                goto on_demand;

Wouldn't it be better to check first if the device supports ODP ?
Something like:
    ret = ibv_exp_query_device(context, &dattr);
    if (dattr.exp_device_cap_flags & IBV_EXP_DEVICE_ODP)...

Also, I  am not (personally) too fond of the "on_demand" label usage here,
however I will let the maintainer/others decide.

Thanks,
Marcel

> +            }
> +            perror("Failed to register local dest ram block!\n");
>              break;
>          }
>          rdma->total_registrations++;
> @@ -1215,15 +1221,18 @@ static int qemu_rdma_register_and_get_keys(RDMAContext *rdma,
>       */
>      if (!block->pmr[chunk]) {
>          uint64_t len = chunk_end - chunk_start;
> +        int access = rkey ? IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE : 0;
>
>          trace_qemu_rdma_register_and_get_keys(len, chunk_start);
>
> -        block->pmr[chunk] = ibv_reg_mr(rdma->pd,
> -                chunk_start, len,
> -                (rkey ? (IBV_ACCESS_LOCAL_WRITE |
> -                        IBV_ACCESS_REMOTE_WRITE) : 0));
> -
> +on_demand:
> +        block->pmr[chunk] = ibv_reg_mr(rdma->pd, chunk_start, len, access);
>          if (!block->pmr[chunk]) {
> +            if (!(access & IBV_ACCESS_ON_DEMAND) && errno == ENOTSUP) {
> +                access |= IBV_ACCESS_ON_DEMAND;
> +                trace_qemu_rdma_register_odp_mr(block->block_name);
> +                goto on_demand;
> +            }
>              perror("Failed to register chunk!");
>              fprintf(stderr, "Chunk details: block: %d chunk index %d"
>                              " start %" PRIuPTR " end %" PRIuPTR
> diff --git a/migration/trace-events b/migration/trace-events
> index a1c0f034ab8..5f6aa580def 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -212,6 +212,7 @@ qemu_rdma_poll_write(const char *compstr, int64_t comp, int left, uint64_t block
>  qemu_rdma_poll_other(const char *compstr, int64_t comp, int left) "other completion %s (%" PRId64 ") received left %d"
>  qemu_rdma_post_send_control(const char *desc) "CONTROL: sending %s.."
>  qemu_rdma_register_and_get_keys(uint64_t len, void *start) "Registering %" PRIu64 " bytes @ %p"
> +qemu_rdma_register_odp_mr(const char *name) "Try to register On-Demand Paging memory region: %s"
>  qemu_rdma_registration_handle_compress(int64_t length, int index, int64_t offset) "Zapping zero chunk: %" PRId64 " bytes, index %d, offset %" PRId64
>  qemu_rdma_registration_handle_finished(void) ""
>  qemu_rdma_registration_handle_ram_blocks(void) ""
> --
> 2.31.1
>
>
>
>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] migration/rdma: advise prefetch write for ODP region
  2021-08-22  8:39   ` Marcel Apfelbaum
@ 2021-08-23  1:21     ` lizhijian
  0 siblings, 0 replies; 9+ messages in thread
From: lizhijian @ 2021-08-23  1:21 UTC (permalink / raw)
  To: Marcel Apfelbaum, lizhijian
  Cc: qemu devel list, Dr. David Alan Gilbert, Juan Quintela

Hi Marcel


On 22/08/2021 16:39, Marcel Apfelbaum wrote:
> Hi,
>
> On Sat, Jul 31, 2021 at 5:03 PM Li Zhijian <lizhijian@cn.fujitsu.com> wrote:
>> The responder mr registering with ODP will sent RNR NAK back to
>> the requester in the face of the page fault.
>> ---------
>> ibv_poll_cq wc.status=13 RNR retry counter exceeded!
>> ibv_poll_cq wrid=WRITE RDMA!
>> ---------
>> ibv_advise_mr(3) helps to make pages present before the actual IO is
>> conducted so that the responder does page fault as little as possible.
>>
>> Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
>> ---
>>   migration/rdma.c       | 40 ++++++++++++++++++++++++++++++++++++++++
>>   migration/trace-events |  1 +
>>   2 files changed, 41 insertions(+)
>>
>> diff --git a/migration/rdma.c b/migration/rdma.c
>> index 8784b5f22a6..a2ad00d665f 100644
>> --- a/migration/rdma.c
>> +++ b/migration/rdma.c
>> @@ -1117,6 +1117,30 @@ static int qemu_rdma_alloc_qp(RDMAContext *rdma)
>>       return 0;
>>   }
>>
>> +/*
>> + * ibv_advise_mr to avoid RNR NAK error as far as possible.
>> + * The responder mr registering with ODP will sent RNR NAK back to
>> + * the requester in the face of the page fault.
>> + */
>> +static void qemu_rdma_advise_prefetch_write_mr(struct ibv_pd *pd, uint64_t addr,
>> +                                               uint32_t len,  uint32_t lkey,
>> +                                               const char *name, bool wr)
>> +{
>> +    int ret;
>> +    int advice = wr ? IBV_ADVISE_MR_ADVICE_PREFETCH_WRITE :
>> +                 IBV_ADVISE_MR_ADVICE_PREFETCH;
>> +    struct ibv_sge sg_list = {.lkey = lkey, .addr = addr, .length = len};
>> +
>> +    ret = ibv_advise_mr(pd, advice,
>> +                        IB_UVERBS_ADVISE_MR_FLAG_FLUSH, &sg_list, 1);
>> +    /* ignore the error */
> Following https://github.com/linux-rdma/rdma-core/blob/master/libibverbs/man/ibv_advise_mr.3.md
> it looks like it is a best-effort optimization,
> I don't see any down-sides to it.
> However it seems like it is recommended to use
> IBV_ADVISE_MR_FLAG_FLUSH in order to
> increase the optimization chances.
Good catch,  i will update it soon.


Thanks

>
> Anyway
>
> Reviewed-by: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
>
> Thanks,
> Marcel
>
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] migration/rdma: Try to register On-Demand Paging memory region
  2021-08-22  8:53   ` Marcel Apfelbaum
@ 2021-08-23  1:41     ` lizhijian
  2021-08-23  7:03       ` Marcel Apfelbaum
  0 siblings, 1 reply; 9+ messages in thread
From: lizhijian @ 2021-08-23  1:41 UTC (permalink / raw)
  To: Marcel Apfelbaum, lizhijian
  Cc: qemu devel list, Dr. David Alan Gilbert, Juan Quintela



On 22/08/2021 16:53, Marcel Apfelbaum wrote:
> Hi
>
> On Sat, Jul 31, 2021 at 5:00 PM Li Zhijian <lizhijian@cn.fujitsu.com> wrote:
>> Previously, for the fsdax mem-backend-file, it will register failed with
>> Operation not supported. In this case, we can try to register it with
>> On-Demand Paging[1] like what rpma_mr_reg() does on rpma[2].
>>
>> [1]: https://community.mellanox.com/s/article/understanding-on-demand-paging--odp-x
>> [2]: http://pmem.io/rpma/manpages/v0.9.0/rpma_mr_reg.3
>> Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
>> ---
>>   migration/rdma.c       | 27 ++++++++++++++++++---------
>>   migration/trace-events |  1 +
>>   2 files changed, 19 insertions(+), 9 deletions(-)
>>
>> diff --git a/migration/rdma.c b/migration/rdma.c
>> index 5c2d113aa94..8784b5f22a6 100644
>> --- a/migration/rdma.c
>> +++ b/migration/rdma.c
>> @@ -1123,15 +1123,21 @@ static int qemu_rdma_reg_whole_ram_blocks(RDMAContext *rdma)
>>       RDMALocalBlocks *local = &rdma->local_ram_blocks;
>>
>>       for (i = 0; i < local->nb_blocks; i++) {
>> +        int access = IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE;
>> +
>> +on_demand:
>>           local->block[i].mr =
>>               ibv_reg_mr(rdma->pd,
>>                       local->block[i].local_host_addr,
>> -                    local->block[i].length,
>> -                    IBV_ACCESS_LOCAL_WRITE |
>> -                    IBV_ACCESS_REMOTE_WRITE
>> +                    local->block[i].length, access
>>                       );
>>           if (!local->block[i].mr) {
>> -            perror("Failed to register local dest ram block!");
>> +            if (!(access & IBV_ACCESS_ON_DEMAND) && errno == ENOTSUP) {
>> +                access |= IBV_ACCESS_ON_DEMAND;
>> +                trace_qemu_rdma_register_odp_mr(local->block[i].block_name);
>> +                goto on_demand;
> Wouldn't it be better to check first if the device supports ODP ?
> Something like:
>      ret = ibv_exp_query_device(context, &dattr);
>      if (dattr.exp_device_cap_flags & IBV_EXP_DEVICE_ODP)...

Good idea !



>
> Also, I  am not (personally) too fond of the "on_demand" label usage here,
> however I will let the maintainer/others decide.
Indeed, how just repeating the ibv_reg_mr() instead of a 'go to'

Thanks
Zhijian



>
> Thanks,
> Marcel
>
>> +            }
>> +            perror("Failed to register local dest ram block!\n");
>>               break;
>>           }
>>           rdma->total_registrations++;
>> @@ -1215,15 +1221,18 @@ static int qemu_rdma_register_and_get_keys(RDMAContext *rdma,
>>        */
>>       if (!block->pmr[chunk]) {
>>           uint64_t len = chunk_end - chunk_start;
>> +        int access = rkey ? IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE : 0;
>>
>>           trace_qemu_rdma_register_and_get_keys(len, chunk_start);
>>
>> -        block->pmr[chunk] = ibv_reg_mr(rdma->pd,
>> -                chunk_start, len,
>> -                (rkey ? (IBV_ACCESS_LOCAL_WRITE |
>> -                        IBV_ACCESS_REMOTE_WRITE) : 0));
>> -
>> +on_demand:
>> +        block->pmr[chunk] = ibv_reg_mr(rdma->pd, chunk_start, len, access);
>>           if (!block->pmr[chunk]) {
>> +            if (!(access & IBV_ACCESS_ON_DEMAND) && errno == ENOTSUP) {
>> +                access |= IBV_ACCESS_ON_DEMAND;
>> +                trace_qemu_rdma_register_odp_mr(block->block_name);
>> +                goto on_demand;
>> +            }
>>               perror("Failed to register chunk!");
>>               fprintf(stderr, "Chunk details: block: %d chunk index %d"
>>                               " start %" PRIuPTR " end %" PRIuPTR
>> diff --git a/migration/trace-events b/migration/trace-events
>> index a1c0f034ab8..5f6aa580def 100644
>> --- a/migration/trace-events
>> +++ b/migration/trace-events
>> @@ -212,6 +212,7 @@ qemu_rdma_poll_write(const char *compstr, int64_t comp, int left, uint64_t block
>>   qemu_rdma_poll_other(const char *compstr, int64_t comp, int left) "other completion %s (%" PRId64 ") received left %d"
>>   qemu_rdma_post_send_control(const char *desc) "CONTROL: sending %s.."
>>   qemu_rdma_register_and_get_keys(uint64_t len, void *start) "Registering %" PRIu64 " bytes @ %p"
>> +qemu_rdma_register_odp_mr(const char *name) "Try to register On-Demand Paging memory region: %s"
>>   qemu_rdma_registration_handle_compress(int64_t length, int index, int64_t offset) "Zapping zero chunk: %" PRId64 " bytes, index %d, offset %" PRId64
>>   qemu_rdma_registration_handle_finished(void) ""
>>   qemu_rdma_registration_handle_ram_blocks(void) ""
>> --
>> 2.31.1
>>
>>
>>
>>
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] migration/rdma: Try to register On-Demand Paging memory region
  2021-08-23  1:41     ` lizhijian
@ 2021-08-23  7:03       ` Marcel Apfelbaum
  0 siblings, 0 replies; 9+ messages in thread
From: Marcel Apfelbaum @ 2021-08-23  7:03 UTC (permalink / raw)
  To: lizhijian; +Cc: qemu devel list, Dr. David Alan Gilbert, Juan Quintela

Hi Zhijian,

On Mon, Aug 23, 2021 at 4:41 AM lizhijian@fujitsu.com
<lizhijian@fujitsu.com> wrote:
>
>
>
> On 22/08/2021 16:53, Marcel Apfelbaum wrote:
> > Hi
> >
> > On Sat, Jul 31, 2021 at 5:00 PM Li Zhijian <lizhijian@cn.fujitsu.com> wrote:
> >> Previously, for the fsdax mem-backend-file, it will register failed with
> >> Operation not supported. In this case, we can try to register it with
> >> On-Demand Paging[1] like what rpma_mr_reg() does on rpma[2].
> >>
> >> [1]: https://community.mellanox.com/s/article/understanding-on-demand-paging--odp-x
> >> [2]: http://pmem.io/rpma/manpages/v0.9.0/rpma_mr_reg.3
> >> Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
> >> ---
> >>   migration/rdma.c       | 27 ++++++++++++++++++---------
> >>   migration/trace-events |  1 +
> >>   2 files changed, 19 insertions(+), 9 deletions(-)
> >>
> >> diff --git a/migration/rdma.c b/migration/rdma.c
> >> index 5c2d113aa94..8784b5f22a6 100644
> >> --- a/migration/rdma.c
> >> +++ b/migration/rdma.c
> >> @@ -1123,15 +1123,21 @@ static int qemu_rdma_reg_whole_ram_blocks(RDMAContext *rdma)
> >>       RDMALocalBlocks *local = &rdma->local_ram_blocks;
> >>
> >>       for (i = 0; i < local->nb_blocks; i++) {
> >> +        int access = IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE;
> >> +
> >> +on_demand:
> >>           local->block[i].mr =
> >>               ibv_reg_mr(rdma->pd,
> >>                       local->block[i].local_host_addr,
> >> -                    local->block[i].length,
> >> -                    IBV_ACCESS_LOCAL_WRITE |
> >> -                    IBV_ACCESS_REMOTE_WRITE
> >> +                    local->block[i].length, access
> >>                       );
> >>           if (!local->block[i].mr) {
> >> -            perror("Failed to register local dest ram block!");
> >> +            if (!(access & IBV_ACCESS_ON_DEMAND) && errno == ENOTSUP) {
> >> +                access |= IBV_ACCESS_ON_DEMAND;
> >> +                trace_qemu_rdma_register_odp_mr(local->block[i].block_name);
> >> +                goto on_demand;
> > Wouldn't it be better to check first if the device supports ODP ?
> > Something like:
> >      ret = ibv_exp_query_device(context, &dattr);
> >      if (dattr.exp_device_cap_flags & IBV_EXP_DEVICE_ODP)...
>
> Good idea !
>
>
>
> >
> > Also, I  am not (personally) too fond of the "on_demand" label usage here,
> > however I will let the maintainer/others decide.
> Indeed, how just repeating the ibv_reg_mr() instead of a 'go to'
>

Any "boring"/standard approach would do.

Thanks,
Marcel

> Thanks
> Zhijian
>
>
>
> >
> > Thanks,
> > Marcel
> >
> >> +            }
> >> +            perror("Failed to register local dest ram block!\n");
> >>               break;
> >>           }
> >>           rdma->total_registrations++;
> >> @@ -1215,15 +1221,18 @@ static int qemu_rdma_register_and_get_keys(RDMAContext *rdma,
> >>        */
> >>       if (!block->pmr[chunk]) {
> >>           uint64_t len = chunk_end - chunk_start;
> >> +        int access = rkey ? IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE : 0;
> >>
> >>           trace_qemu_rdma_register_and_get_keys(len, chunk_start);
> >>
> >> -        block->pmr[chunk] = ibv_reg_mr(rdma->pd,
> >> -                chunk_start, len,
> >> -                (rkey ? (IBV_ACCESS_LOCAL_WRITE |
> >> -                        IBV_ACCESS_REMOTE_WRITE) : 0));
> >> -
> >> +on_demand:
> >> +        block->pmr[chunk] = ibv_reg_mr(rdma->pd, chunk_start, len, access);
> >>           if (!block->pmr[chunk]) {
> >> +            if (!(access & IBV_ACCESS_ON_DEMAND) && errno == ENOTSUP) {
> >> +                access |= IBV_ACCESS_ON_DEMAND;
> >> +                trace_qemu_rdma_register_odp_mr(block->block_name);
> >> +                goto on_demand;
> >> +            }
> >>               perror("Failed to register chunk!");
> >>               fprintf(stderr, "Chunk details: block: %d chunk index %d"
> >>                               " start %" PRIuPTR " end %" PRIuPTR
> >> diff --git a/migration/trace-events b/migration/trace-events
> >> index a1c0f034ab8..5f6aa580def 100644
> >> --- a/migration/trace-events
> >> +++ b/migration/trace-events
> >> @@ -212,6 +212,7 @@ qemu_rdma_poll_write(const char *compstr, int64_t comp, int left, uint64_t block
> >>   qemu_rdma_poll_other(const char *compstr, int64_t comp, int left) "other completion %s (%" PRId64 ") received left %d"
> >>   qemu_rdma_post_send_control(const char *desc) "CONTROL: sending %s.."
> >>   qemu_rdma_register_and_get_keys(uint64_t len, void *start) "Registering %" PRIu64 " bytes @ %p"
> >> +qemu_rdma_register_odp_mr(const char *name) "Try to register On-Demand Paging memory region: %s"
> >>   qemu_rdma_registration_handle_compress(int64_t length, int index, int64_t offset) "Zapping zero chunk: %" PRId64 " bytes, index %d, offset %" PRId64
> >>   qemu_rdma_registration_handle_finished(void) ""
> >>   qemu_rdma_registration_handle_ram_blocks(void) ""
> >> --
> >> 2.31.1
> >>
> >>
> >>
> >>
> >


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-08-23  7:04 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-31 14:03 [PATCH 0/2] enable fsdax rdma migration Li Zhijian
2021-07-31 14:03 ` [PATCH 1/2] migration/rdma: Try to register On-Demand Paging memory region Li Zhijian
2021-08-22  8:53   ` Marcel Apfelbaum
2021-08-23  1:41     ` lizhijian
2021-08-23  7:03       ` Marcel Apfelbaum
2021-07-31 14:03 ` [PATCH 2/2] migration/rdma: advise prefetch write for ODP region Li Zhijian
2021-08-22  8:39   ` Marcel Apfelbaum
2021-08-23  1:21     ` lizhijian
2021-08-16  2:10 ` [PATCH 0/2] enable fsdax rdma migration lizhijian

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).