qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/2] enable fsdax rdma migration
@ 2021-08-23  3:33 Li Zhijian
  2021-08-23  3:33 ` [PATCH v2 1/2] migration/rdma: Try to register On-Demand Paging memory region Li Zhijian
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Li Zhijian @ 2021-08-23  3:33 UTC (permalink / raw)
  To: quintela, dgilbert; +Cc: qemu-devel, Li Zhijian

Previous qemu are facing 2 problems when migrating a fsdax memory backend with
RDMA protocol.
(1) ibv_reg_mr failed with Operation not supported
(2) requester(source) side could receive RNR NAK.

For the (1), we can try to register memory region with ODP feature which
has already been implemented in some modern HCA hardware/drivers.
For the (2), IB provides advise API to prefetch pages in specific memory
region. It can help driver reduce the page fault on responder(destination)
side during RDMA_WRITE.

CC: marcel.apfelbaum@gmail.com

Li Zhijian (2):
  migration/rdma: Try to register On-Demand Paging memory region
  migration/rdma: advise prefetch write for ODP region

 migration/rdma.c       | 117 +++++++++++++++++++++++++++++++++--------
 migration/trace-events |   2 +
 2 files changed, 98 insertions(+), 21 deletions(-)

-- 
2.31.1





^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v2 1/2] migration/rdma: Try to register On-Demand Paging memory region
  2021-08-23  3:33 [PATCH v2 0/2] enable fsdax rdma migration Li Zhijian
@ 2021-08-23  3:33 ` Li Zhijian
  2021-08-23  8:42   ` lizhijian
  2021-08-23  3:33 ` [PATCH v2 2/2] migration/rdma: advise prefetch write for ODP region Li Zhijian
  2021-08-23  8:41 ` [PATCH v2 0/2] enable fsdax rdma migration lizhijian
  2 siblings, 1 reply; 8+ messages in thread
From: Li Zhijian @ 2021-08-23  3:33 UTC (permalink / raw)
  To: quintela, dgilbert; +Cc: qemu-devel, Li Zhijian

Previously, for the fsdax mem-backend-file, it will register failed with
Operation not supported. In this case, we can try to register it with
On-Demand Paging[1] like what rpma_mr_reg() does on rpma[2].

[1]: https://community.mellanox.com/s/article/understanding-on-demand-paging--odp-x
[2]: http://pmem.io/rpma/manpages/v0.9.0/rpma_mr_reg.3

CC: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>

---
V2: add ODP sanity check and remove goto
---
 migration/rdma.c       | 73 ++++++++++++++++++++++++++++++------------
 migration/trace-events |  1 +
 2 files changed, 54 insertions(+), 20 deletions(-)

diff --git a/migration/rdma.c b/migration/rdma.c
index 5c2d113aa94..eb80431aae2 100644
--- a/migration/rdma.c
+++ b/migration/rdma.c
@@ -1117,19 +1117,47 @@ static int qemu_rdma_alloc_qp(RDMAContext *rdma)
     return 0;
 }
 
+/* Check whether On-Demand Paging is supported by RDAM device */
+static bool rdma_support_odp(struct ibv_context *dev)
+{
+    struct ibv_device_attr_ex attr = {0};
+    int ret = ibv_query_device_ex(dev, NULL, &attr);
+    if (ret) {
+        return false;
+    }
+
+    if (attr.odp_caps.general_caps & IBV_ODP_SUPPORT) {
+        return true;
+    }
+
+    return false;
+}
+
 static int qemu_rdma_reg_whole_ram_blocks(RDMAContext *rdma)
 {
     int i;
     RDMALocalBlocks *local = &rdma->local_ram_blocks;
 
     for (i = 0; i < local->nb_blocks; i++) {
+        int access = IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE;
+
         local->block[i].mr =
             ibv_reg_mr(rdma->pd,
                     local->block[i].local_host_addr,
-                    local->block[i].length,
-                    IBV_ACCESS_LOCAL_WRITE |
-                    IBV_ACCESS_REMOTE_WRITE
+                    local->block[i].length, access
                     );
+
+        if (!local->block[i].mr &&
+            errno == ENOTSUP && rdma_support_odp(rdma->verbs)) {
+                access |= IBV_ACCESS_ON_DEMAND;
+                /* register ODP mr */
+                local->block[i].mr =
+                    ibv_reg_mr(rdma->pd,
+                               local->block[i].local_host_addr,
+                               local->block[i].length, access);
+                trace_qemu_rdma_register_odp_mr(local->block[i].block_name);
+        }
+
         if (!local->block[i].mr) {
             perror("Failed to register local dest ram block!");
             break;
@@ -1215,28 +1243,33 @@ static int qemu_rdma_register_and_get_keys(RDMAContext *rdma,
      */
     if (!block->pmr[chunk]) {
         uint64_t len = chunk_end - chunk_start;
+        int access = rkey ? IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE :
+                     0;
 
         trace_qemu_rdma_register_and_get_keys(len, chunk_start);
 
-        block->pmr[chunk] = ibv_reg_mr(rdma->pd,
-                chunk_start, len,
-                (rkey ? (IBV_ACCESS_LOCAL_WRITE |
-                        IBV_ACCESS_REMOTE_WRITE) : 0));
-
-        if (!block->pmr[chunk]) {
-            perror("Failed to register chunk!");
-            fprintf(stderr, "Chunk details: block: %d chunk index %d"
-                            " start %" PRIuPTR " end %" PRIuPTR
-                            " host %" PRIuPTR
-                            " local %" PRIuPTR " registrations: %d\n",
-                            block->index, chunk, (uintptr_t)chunk_start,
-                            (uintptr_t)chunk_end, host_addr,
-                            (uintptr_t)block->local_host_addr,
-                            rdma->total_registrations);
-            return -1;
+        block->pmr[chunk] = ibv_reg_mr(rdma->pd, chunk_start, len, access);
+        if (!block->pmr[chunk] &&
+            errno == ENOTSUP && rdma_support_odp(rdma->verbs)) {
+            access |= IBV_ACCESS_ON_DEMAND;
+            /* register ODP mr */
+            block->pmr[chunk] = ibv_reg_mr(rdma->pd, chunk_start, len, access);
+            trace_qemu_rdma_register_odp_mr(block->block_name);
         }
-        rdma->total_registrations++;
     }
+    if (!block->pmr[chunk]) {
+        perror("Failed to register chunk!");
+        fprintf(stderr, "Chunk details: block: %d chunk index %d"
+                        " start %" PRIuPTR " end %" PRIuPTR
+                        " host %" PRIuPTR
+                        " local %" PRIuPTR " registrations: %d\n",
+                        block->index, chunk, (uintptr_t)chunk_start,
+                        (uintptr_t)chunk_end, host_addr,
+                        (uintptr_t)block->local_host_addr,
+                        rdma->total_registrations);
+        return -1;
+    }
+    rdma->total_registrations++;
 
     if (lkey) {
         *lkey = block->pmr[chunk]->lkey;
diff --git a/migration/trace-events b/migration/trace-events
index a1c0f034ab8..5f6aa580def 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -212,6 +212,7 @@ qemu_rdma_poll_write(const char *compstr, int64_t comp, int left, uint64_t block
 qemu_rdma_poll_other(const char *compstr, int64_t comp, int left) "other completion %s (%" PRId64 ") received left %d"
 qemu_rdma_post_send_control(const char *desc) "CONTROL: sending %s.."
 qemu_rdma_register_and_get_keys(uint64_t len, void *start) "Registering %" PRIu64 " bytes @ %p"
+qemu_rdma_register_odp_mr(const char *name) "Try to register On-Demand Paging memory region: %s"
 qemu_rdma_registration_handle_compress(int64_t length, int index, int64_t offset) "Zapping zero chunk: %" PRId64 " bytes, index %d, offset %" PRId64
 qemu_rdma_registration_handle_finished(void) ""
 qemu_rdma_registration_handle_ram_blocks(void) ""
-- 
2.31.1





^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 2/2] migration/rdma: advise prefetch write for ODP region
  2021-08-23  3:33 [PATCH v2 0/2] enable fsdax rdma migration Li Zhijian
  2021-08-23  3:33 ` [PATCH v2 1/2] migration/rdma: Try to register On-Demand Paging memory region Li Zhijian
@ 2021-08-23  3:33 ` Li Zhijian
  2021-08-23  8:42   ` lizhijian
  2021-08-23  8:41 ` [PATCH v2 0/2] enable fsdax rdma migration lizhijian
  2 siblings, 1 reply; 8+ messages in thread
From: Li Zhijian @ 2021-08-23  3:33 UTC (permalink / raw)
  To: quintela, dgilbert; +Cc: qemu-devel, Li Zhijian

The responder mr registering with ODP will sent RNR NAK back to
the requester in the face of the page fault.
---------
ibv_poll_cq wc.status=13 RNR retry counter exceeded!
ibv_poll_cq wrid=WRITE RDMA!
---------
ibv_advise_mr(3) helps to make pages present before the actual IO is
conducted so that the responder does page fault as little as possible.

Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
Reviewed-by: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>

---
V2: use IBV_ADVISE_MR_FLAG_FLUSH instead of IB_UVERBS_ADVISE_MR_FLAG_FLUSH
    and add Reviewed-by tag. # Marcel
---
 migration/rdma.c       | 40 ++++++++++++++++++++++++++++++++++++++++
 migration/trace-events |  1 +
 2 files changed, 41 insertions(+)

diff --git a/migration/rdma.c b/migration/rdma.c
index eb80431aae2..6c2cc3f617c 100644
--- a/migration/rdma.c
+++ b/migration/rdma.c
@@ -1133,6 +1133,30 @@ static bool rdma_support_odp(struct ibv_context *dev)
     return false;
 }
 
+/*
+ * ibv_advise_mr to avoid RNR NAK error as far as possible.
+ * The responder mr registering with ODP will sent RNR NAK back to
+ * the requester in the face of the page fault.
+ */
+static void qemu_rdma_advise_prefetch_mr(struct ibv_pd *pd, uint64_t addr,
+                                         uint32_t len,  uint32_t lkey,
+                                         const char *name, bool wr)
+{
+    int ret;
+    int advice = wr ? IBV_ADVISE_MR_ADVICE_PREFETCH_WRITE :
+                 IBV_ADVISE_MR_ADVICE_PREFETCH;
+    struct ibv_sge sg_list = {.lkey = lkey, .addr = addr, .length = len};
+
+    ret = ibv_advise_mr(pd, advice,
+                        IBV_ADVISE_MR_FLAG_FLUSH, &sg_list, 1);
+    /* ignore the error */
+    if (ret) {
+        trace_qemu_rdma_advise_mr(name, len, addr, strerror(errno));
+    } else {
+        trace_qemu_rdma_advise_mr(name, len, addr, "successed");
+    }
+}
+
 static int qemu_rdma_reg_whole_ram_blocks(RDMAContext *rdma)
 {
     int i;
@@ -1156,6 +1180,15 @@ static int qemu_rdma_reg_whole_ram_blocks(RDMAContext *rdma)
                                local->block[i].local_host_addr,
                                local->block[i].length, access);
                 trace_qemu_rdma_register_odp_mr(local->block[i].block_name);
+
+                if (local->block[i].mr) {
+                    qemu_rdma_advise_prefetch_mr(rdma->pd,
+                                    (uintptr_t)local->block[i].local_host_addr,
+                                    local->block[i].length,
+                                    local->block[i].mr->lkey,
+                                    local->block[i].block_name,
+                                    true);
+                }
         }
 
         if (!local->block[i].mr) {
@@ -1255,6 +1288,13 @@ static int qemu_rdma_register_and_get_keys(RDMAContext *rdma,
             /* register ODP mr */
             block->pmr[chunk] = ibv_reg_mr(rdma->pd, chunk_start, len, access);
             trace_qemu_rdma_register_odp_mr(block->block_name);
+
+            if (block->pmr[chunk]) {
+                qemu_rdma_advise_prefetch_mr(rdma->pd, (uintptr_t)chunk_start,
+                                            len, block->pmr[chunk]->lkey,
+                                            block->block_name, rkey);
+
+            }
         }
     }
     if (!block->pmr[chunk]) {
diff --git a/migration/trace-events b/migration/trace-events
index 5f6aa580def..a8ae163707c 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -213,6 +213,7 @@ qemu_rdma_poll_other(const char *compstr, int64_t comp, int left) "other complet
 qemu_rdma_post_send_control(const char *desc) "CONTROL: sending %s.."
 qemu_rdma_register_and_get_keys(uint64_t len, void *start) "Registering %" PRIu64 " bytes @ %p"
 qemu_rdma_register_odp_mr(const char *name) "Try to register On-Demand Paging memory region: %s"
+qemu_rdma_advise_mr(const char *name, uint32_t len, uint64_t addr, const char *res) "Try to advise block %s prefetch at %" PRIu32 "@0x%" PRIx64 ": %s"
 qemu_rdma_registration_handle_compress(int64_t length, int index, int64_t offset) "Zapping zero chunk: %" PRId64 " bytes, index %d, offset %" PRId64
 qemu_rdma_registration_handle_finished(void) ""
 qemu_rdma_registration_handle_ram_blocks(void) ""
-- 
2.31.1





^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 0/2] enable fsdax rdma migration
  2021-08-23  3:33 [PATCH v2 0/2] enable fsdax rdma migration Li Zhijian
  2021-08-23  3:33 ` [PATCH v2 1/2] migration/rdma: Try to register On-Demand Paging memory region Li Zhijian
  2021-08-23  3:33 ` [PATCH v2 2/2] migration/rdma: advise prefetch write for ODP region Li Zhijian
@ 2021-08-23  8:41 ` lizhijian
  2021-08-23  8:53   ` Marcel Apfelbaum
  2 siblings, 1 reply; 8+ messages in thread
From: lizhijian @ 2021-08-23  8:41 UTC (permalink / raw)
  To: lizhijian, quintela, dgilbert, Marcel Apfelbaum; +Cc: qemu-devel

CCing  Marcel


On 23/08/2021 11:33, Li Zhijian wrote:
> Previous qemu are facing 2 problems when migrating a fsdax memory backend with
> RDMA protocol.
> (1) ibv_reg_mr failed with Operation not supported
> (2) requester(source) side could receive RNR NAK.
>
> For the (1), we can try to register memory region with ODP feature which
> has already been implemented in some modern HCA hardware/drivers.
> For the (2), IB provides advise API to prefetch pages in specific memory
> region. It can help driver reduce the page fault on responder(destination)
> side during RDMA_WRITE.
>
> CC: marcel.apfelbaum@gmail.com
>
> Li Zhijian (2):
>    migration/rdma: Try to register On-Demand Paging memory region
>    migration/rdma: advise prefetch write for ODP region
>
>   migration/rdma.c       | 117 +++++++++++++++++++++++++++++++++--------
>   migration/trace-events |   2 +
>   2 files changed, 98 insertions(+), 21 deletions(-)
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 1/2] migration/rdma: Try to register On-Demand Paging memory region
  2021-08-23  3:33 ` [PATCH v2 1/2] migration/rdma: Try to register On-Demand Paging memory region Li Zhijian
@ 2021-08-23  8:42   ` lizhijian
  2021-08-23  8:52     ` Marcel Apfelbaum
  0 siblings, 1 reply; 8+ messages in thread
From: lizhijian @ 2021-08-23  8:42 UTC (permalink / raw)
  To: lizhijian, quintela, dgilbert, Marcel Apfelbaum; +Cc: qemu-devel

CCing  Marcel


On 23/08/2021 11:33, Li Zhijian wrote:
> Previously, for the fsdax mem-backend-file, it will register failed with
> Operation not supported. In this case, we can try to register it with
> On-Demand Paging[1] like what rpma_mr_reg() does on rpma[2].
>
> [1]: https://community.mellanox.com/s/article/understanding-on-demand-paging--odp-x
> [2]: http://pmem.io/rpma/manpages/v0.9.0/rpma_mr_reg.3
>
> CC: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
> Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
>
> ---
> V2: add ODP sanity check and remove goto
> ---
>   migration/rdma.c       | 73 ++++++++++++++++++++++++++++++------------
>   migration/trace-events |  1 +
>   2 files changed, 54 insertions(+), 20 deletions(-)
>
> diff --git a/migration/rdma.c b/migration/rdma.c
> index 5c2d113aa94..eb80431aae2 100644
> --- a/migration/rdma.c
> +++ b/migration/rdma.c
> @@ -1117,19 +1117,47 @@ static int qemu_rdma_alloc_qp(RDMAContext *rdma)
>       return 0;
>   }
>   
> +/* Check whether On-Demand Paging is supported by RDAM device */
> +static bool rdma_support_odp(struct ibv_context *dev)
> +{
> +    struct ibv_device_attr_ex attr = {0};
> +    int ret = ibv_query_device_ex(dev, NULL, &attr);
> +    if (ret) {
> +        return false;
> +    }
> +
> +    if (attr.odp_caps.general_caps & IBV_ODP_SUPPORT) {
> +        return true;
> +    }
> +
> +    return false;
> +}
> +
>   static int qemu_rdma_reg_whole_ram_blocks(RDMAContext *rdma)
>   {
>       int i;
>       RDMALocalBlocks *local = &rdma->local_ram_blocks;
>   
>       for (i = 0; i < local->nb_blocks; i++) {
> +        int access = IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE;
> +
>           local->block[i].mr =
>               ibv_reg_mr(rdma->pd,
>                       local->block[i].local_host_addr,
> -                    local->block[i].length,
> -                    IBV_ACCESS_LOCAL_WRITE |
> -                    IBV_ACCESS_REMOTE_WRITE
> +                    local->block[i].length, access
>                       );
> +
> +        if (!local->block[i].mr &&
> +            errno == ENOTSUP && rdma_support_odp(rdma->verbs)) {
> +                access |= IBV_ACCESS_ON_DEMAND;
> +                /* register ODP mr */
> +                local->block[i].mr =
> +                    ibv_reg_mr(rdma->pd,
> +                               local->block[i].local_host_addr,
> +                               local->block[i].length, access);
> +                trace_qemu_rdma_register_odp_mr(local->block[i].block_name);
> +        }
> +
>           if (!local->block[i].mr) {
>               perror("Failed to register local dest ram block!");
>               break;
> @@ -1215,28 +1243,33 @@ static int qemu_rdma_register_and_get_keys(RDMAContext *rdma,
>        */
>       if (!block->pmr[chunk]) {
>           uint64_t len = chunk_end - chunk_start;
> +        int access = rkey ? IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE :
> +                     0;
>   
>           trace_qemu_rdma_register_and_get_keys(len, chunk_start);
>   
> -        block->pmr[chunk] = ibv_reg_mr(rdma->pd,
> -                chunk_start, len,
> -                (rkey ? (IBV_ACCESS_LOCAL_WRITE |
> -                        IBV_ACCESS_REMOTE_WRITE) : 0));
> -
> -        if (!block->pmr[chunk]) {
> -            perror("Failed to register chunk!");
> -            fprintf(stderr, "Chunk details: block: %d chunk index %d"
> -                            " start %" PRIuPTR " end %" PRIuPTR
> -                            " host %" PRIuPTR
> -                            " local %" PRIuPTR " registrations: %d\n",
> -                            block->index, chunk, (uintptr_t)chunk_start,
> -                            (uintptr_t)chunk_end, host_addr,
> -                            (uintptr_t)block->local_host_addr,
> -                            rdma->total_registrations);
> -            return -1;
> +        block->pmr[chunk] = ibv_reg_mr(rdma->pd, chunk_start, len, access);
> +        if (!block->pmr[chunk] &&
> +            errno == ENOTSUP && rdma_support_odp(rdma->verbs)) {
> +            access |= IBV_ACCESS_ON_DEMAND;
> +            /* register ODP mr */
> +            block->pmr[chunk] = ibv_reg_mr(rdma->pd, chunk_start, len, access);
> +            trace_qemu_rdma_register_odp_mr(block->block_name);
>           }
> -        rdma->total_registrations++;
>       }
> +    if (!block->pmr[chunk]) {
> +        perror("Failed to register chunk!");
> +        fprintf(stderr, "Chunk details: block: %d chunk index %d"
> +                        " start %" PRIuPTR " end %" PRIuPTR
> +                        " host %" PRIuPTR
> +                        " local %" PRIuPTR " registrations: %d\n",
> +                        block->index, chunk, (uintptr_t)chunk_start,
> +                        (uintptr_t)chunk_end, host_addr,
> +                        (uintptr_t)block->local_host_addr,
> +                        rdma->total_registrations);
> +        return -1;
> +    }
> +    rdma->total_registrations++;
>   
>       if (lkey) {
>           *lkey = block->pmr[chunk]->lkey;
> diff --git a/migration/trace-events b/migration/trace-events
> index a1c0f034ab8..5f6aa580def 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -212,6 +212,7 @@ qemu_rdma_poll_write(const char *compstr, int64_t comp, int left, uint64_t block
>   qemu_rdma_poll_other(const char *compstr, int64_t comp, int left) "other completion %s (%" PRId64 ") received left %d"
>   qemu_rdma_post_send_control(const char *desc) "CONTROL: sending %s.."
>   qemu_rdma_register_and_get_keys(uint64_t len, void *start) "Registering %" PRIu64 " bytes @ %p"
> +qemu_rdma_register_odp_mr(const char *name) "Try to register On-Demand Paging memory region: %s"
>   qemu_rdma_registration_handle_compress(int64_t length, int index, int64_t offset) "Zapping zero chunk: %" PRId64 " bytes, index %d, offset %" PRId64
>   qemu_rdma_registration_handle_finished(void) ""
>   qemu_rdma_registration_handle_ram_blocks(void) ""

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 2/2] migration/rdma: advise prefetch write for ODP region
  2021-08-23  3:33 ` [PATCH v2 2/2] migration/rdma: advise prefetch write for ODP region Li Zhijian
@ 2021-08-23  8:42   ` lizhijian
  0 siblings, 0 replies; 8+ messages in thread
From: lizhijian @ 2021-08-23  8:42 UTC (permalink / raw)
  To: lizhijian, quintela, dgilbert, Marcel Apfelbaum; +Cc: qemu-devel

CCing Marcel


On 23/08/2021 11:33, Li Zhijian wrote:
> The responder mr registering with ODP will sent RNR NAK back to
> the requester in the face of the page fault.
> ---------
> ibv_poll_cq wc.status=13 RNR retry counter exceeded!
> ibv_poll_cq wrid=WRITE RDMA!
> ---------
> ibv_advise_mr(3) helps to make pages present before the actual IO is
> conducted so that the responder does page fault as little as possible.
>
> Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
> Reviewed-by: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
>
> ---
> V2: use IBV_ADVISE_MR_FLAG_FLUSH instead of IB_UVERBS_ADVISE_MR_FLAG_FLUSH
>      and add Reviewed-by tag. # Marcel
> ---
>   migration/rdma.c       | 40 ++++++++++++++++++++++++++++++++++++++++
>   migration/trace-events |  1 +
>   2 files changed, 41 insertions(+)
>
> diff --git a/migration/rdma.c b/migration/rdma.c
> index eb80431aae2..6c2cc3f617c 100644
> --- a/migration/rdma.c
> +++ b/migration/rdma.c
> @@ -1133,6 +1133,30 @@ static bool rdma_support_odp(struct ibv_context *dev)
>       return false;
>   }
>   
> +/*
> + * ibv_advise_mr to avoid RNR NAK error as far as possible.
> + * The responder mr registering with ODP will sent RNR NAK back to
> + * the requester in the face of the page fault.
> + */
> +static void qemu_rdma_advise_prefetch_mr(struct ibv_pd *pd, uint64_t addr,
> +                                         uint32_t len,  uint32_t lkey,
> +                                         const char *name, bool wr)
> +{
> +    int ret;
> +    int advice = wr ? IBV_ADVISE_MR_ADVICE_PREFETCH_WRITE :
> +                 IBV_ADVISE_MR_ADVICE_PREFETCH;
> +    struct ibv_sge sg_list = {.lkey = lkey, .addr = addr, .length = len};
> +
> +    ret = ibv_advise_mr(pd, advice,
> +                        IBV_ADVISE_MR_FLAG_FLUSH, &sg_list, 1);
> +    /* ignore the error */
> +    if (ret) {
> +        trace_qemu_rdma_advise_mr(name, len, addr, strerror(errno));
> +    } else {
> +        trace_qemu_rdma_advise_mr(name, len, addr, "successed");
> +    }
> +}
> +
>   static int qemu_rdma_reg_whole_ram_blocks(RDMAContext *rdma)
>   {
>       int i;
> @@ -1156,6 +1180,15 @@ static int qemu_rdma_reg_whole_ram_blocks(RDMAContext *rdma)
>                                  local->block[i].local_host_addr,
>                                  local->block[i].length, access);
>                   trace_qemu_rdma_register_odp_mr(local->block[i].block_name);
> +
> +                if (local->block[i].mr) {
> +                    qemu_rdma_advise_prefetch_mr(rdma->pd,
> +                                    (uintptr_t)local->block[i].local_host_addr,
> +                                    local->block[i].length,
> +                                    local->block[i].mr->lkey,
> +                                    local->block[i].block_name,
> +                                    true);
> +                }
>           }
>   
>           if (!local->block[i].mr) {
> @@ -1255,6 +1288,13 @@ static int qemu_rdma_register_and_get_keys(RDMAContext *rdma,
>               /* register ODP mr */
>               block->pmr[chunk] = ibv_reg_mr(rdma->pd, chunk_start, len, access);
>               trace_qemu_rdma_register_odp_mr(block->block_name);
> +
> +            if (block->pmr[chunk]) {
> +                qemu_rdma_advise_prefetch_mr(rdma->pd, (uintptr_t)chunk_start,
> +                                            len, block->pmr[chunk]->lkey,
> +                                            block->block_name, rkey);
> +
> +            }
>           }
>       }
>       if (!block->pmr[chunk]) {
> diff --git a/migration/trace-events b/migration/trace-events
> index 5f6aa580def..a8ae163707c 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -213,6 +213,7 @@ qemu_rdma_poll_other(const char *compstr, int64_t comp, int left) "other complet
>   qemu_rdma_post_send_control(const char *desc) "CONTROL: sending %s.."
>   qemu_rdma_register_and_get_keys(uint64_t len, void *start) "Registering %" PRIu64 " bytes @ %p"
>   qemu_rdma_register_odp_mr(const char *name) "Try to register On-Demand Paging memory region: %s"
> +qemu_rdma_advise_mr(const char *name, uint32_t len, uint64_t addr, const char *res) "Try to advise block %s prefetch at %" PRIu32 "@0x%" PRIx64 ": %s"
>   qemu_rdma_registration_handle_compress(int64_t length, int index, int64_t offset) "Zapping zero chunk: %" PRId64 " bytes, index %d, offset %" PRId64
>   qemu_rdma_registration_handle_finished(void) ""
>   qemu_rdma_registration_handle_ram_blocks(void) ""

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 1/2] migration/rdma: Try to register On-Demand Paging memory region
  2021-08-23  8:42   ` lizhijian
@ 2021-08-23  8:52     ` Marcel Apfelbaum
  0 siblings, 0 replies; 8+ messages in thread
From: Marcel Apfelbaum @ 2021-08-23  8:52 UTC (permalink / raw)
  To: lizhijian; +Cc: qemu-devel, dgilbert, quintela

Hi Zhijian,

On Mon, Aug 23, 2021 at 11:42 AM lizhijian@fujitsu.com
<lizhijian@fujitsu.com> wrote:
>
> CCing  Marcel
>
>
> On 23/08/2021 11:33, Li Zhijian wrote:
> > Previously, for the fsdax mem-backend-file, it will register failed with
> > Operation not supported. In this case, we can try to register it with
> > On-Demand Paging[1] like what rpma_mr_reg() does on rpma[2].
> >
> > [1]: https://community.mellanox.com/s/article/understanding-on-demand-paging--odp-x
> > [2]: http://pmem.io/rpma/manpages/v0.9.0/rpma_mr_reg.3
> >
> > CC: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
> > Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
> >
> > ---
> > V2: add ODP sanity check and remove goto
> > ---
> >   migration/rdma.c       | 73 ++++++++++++++++++++++++++++++------------
> >   migration/trace-events |  1 +
> >   2 files changed, 54 insertions(+), 20 deletions(-)
> >
> > diff --git a/migration/rdma.c b/migration/rdma.c
> > index 5c2d113aa94..eb80431aae2 100644
> > --- a/migration/rdma.c
> > +++ b/migration/rdma.c
> > @@ -1117,19 +1117,47 @@ static int qemu_rdma_alloc_qp(RDMAContext *rdma)
> >       return 0;
> >   }
> >
> > +/* Check whether On-Demand Paging is supported by RDAM device */
> > +static bool rdma_support_odp(struct ibv_context *dev)
> > +{
> > +    struct ibv_device_attr_ex attr = {0};
> > +    int ret = ibv_query_device_ex(dev, NULL, &attr);
> > +    if (ret) {
> > +        return false;
> > +    }
> > +
> > +    if (attr.odp_caps.general_caps & IBV_ODP_SUPPORT) {
> > +        return true;
> > +    }
> > +
> > +    return false;
> > +}
> > +
> >   static int qemu_rdma_reg_whole_ram_blocks(RDMAContext *rdma)
> >   {
> >       int i;
> >       RDMALocalBlocks *local = &rdma->local_ram_blocks;
> >
> >       for (i = 0; i < local->nb_blocks; i++) {
> > +        int access = IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE;
> > +
> >           local->block[i].mr =
> >               ibv_reg_mr(rdma->pd,
> >                       local->block[i].local_host_addr,
> > -                    local->block[i].length,
> > -                    IBV_ACCESS_LOCAL_WRITE |
> > -                    IBV_ACCESS_REMOTE_WRITE
> > +                    local->block[i].length, access
> >                       );
> > +
> > +        if (!local->block[i].mr &&
> > +            errno == ENOTSUP && rdma_support_odp(rdma->verbs)) {
> > +                access |= IBV_ACCESS_ON_DEMAND;
> > +                /* register ODP mr */
> > +                local->block[i].mr =
> > +                    ibv_reg_mr(rdma->pd,
> > +                               local->block[i].local_host_addr,
> > +                               local->block[i].length, access);
> > +                trace_qemu_rdma_register_odp_mr(local->block[i].block_name);
> > +        }
> > +
> >           if (!local->block[i].mr) {
> >               perror("Failed to register local dest ram block!");
> >               break;
> > @@ -1215,28 +1243,33 @@ static int qemu_rdma_register_and_get_keys(RDMAContext *rdma,
> >        */
> >       if (!block->pmr[chunk]) {
> >           uint64_t len = chunk_end - chunk_start;
> > +        int access = rkey ? IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE :
> > +                     0;
> >
> >           trace_qemu_rdma_register_and_get_keys(len, chunk_start);
> >
> > -        block->pmr[chunk] = ibv_reg_mr(rdma->pd,
> > -                chunk_start, len,
> > -                (rkey ? (IBV_ACCESS_LOCAL_WRITE |
> > -                        IBV_ACCESS_REMOTE_WRITE) : 0));
> > -
> > -        if (!block->pmr[chunk]) {
> > -            perror("Failed to register chunk!");
> > -            fprintf(stderr, "Chunk details: block: %d chunk index %d"
> > -                            " start %" PRIuPTR " end %" PRIuPTR
> > -                            " host %" PRIuPTR
> > -                            " local %" PRIuPTR " registrations: %d\n",
> > -                            block->index, chunk, (uintptr_t)chunk_start,
> > -                            (uintptr_t)chunk_end, host_addr,
> > -                            (uintptr_t)block->local_host_addr,
> > -                            rdma->total_registrations);
> > -            return -1;
> > +        block->pmr[chunk] = ibv_reg_mr(rdma->pd, chunk_start, len, access);
> > +        if (!block->pmr[chunk] &&
> > +            errno == ENOTSUP && rdma_support_odp(rdma->verbs)) {
> > +            access |= IBV_ACCESS_ON_DEMAND;
> > +            /* register ODP mr */
> > +            block->pmr[chunk] = ibv_reg_mr(rdma->pd, chunk_start, len, access);
> > +            trace_qemu_rdma_register_odp_mr(block->block_name);
> >           }
> > -        rdma->total_registrations++;
> >       }
> > +    if (!block->pmr[chunk]) {
> > +        perror("Failed to register chunk!");
> > +        fprintf(stderr, "Chunk details: block: %d chunk index %d"
> > +                        " start %" PRIuPTR " end %" PRIuPTR
> > +                        " host %" PRIuPTR
> > +                        " local %" PRIuPTR " registrations: %d\n",
> > +                        block->index, chunk, (uintptr_t)chunk_start,
> > +                        (uintptr_t)chunk_end, host_addr,
> > +                        (uintptr_t)block->local_host_addr,
> > +                        rdma->total_registrations);
> > +        return -1;
> > +    }
> > +    rdma->total_registrations++;
> >
> >       if (lkey) {
> >           *lkey = block->pmr[chunk]->lkey;
> > diff --git a/migration/trace-events b/migration/trace-events
> > index a1c0f034ab8..5f6aa580def 100644
> > --- a/migration/trace-events
> > +++ b/migration/trace-events
> > @@ -212,6 +212,7 @@ qemu_rdma_poll_write(const char *compstr, int64_t comp, int left, uint64_t block
> >   qemu_rdma_poll_other(const char *compstr, int64_t comp, int left) "other completion %s (%" PRId64 ") received left %d"
> >   qemu_rdma_post_send_control(const char *desc) "CONTROL: sending %s.."
> >   qemu_rdma_register_and_get_keys(uint64_t len, void *start) "Registering %" PRIu64 " bytes @ %p"
> > +qemu_rdma_register_odp_mr(const char *name) "Try to register On-Demand Paging memory region: %s"
> >   qemu_rdma_registration_handle_compress(int64_t length, int index, int64_t offset) "Zapping zero chunk: %" PRId64 " bytes, index %d, offset %" PRId64
> >   qemu_rdma_registration_handle_finished(void) ""
> >   qemu_rdma_registration_handle_ram_blocks(void) ""

Reviewed-by: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>

Thanks,
Marcel


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 0/2] enable fsdax rdma migration
  2021-08-23  8:41 ` [PATCH v2 0/2] enable fsdax rdma migration lizhijian
@ 2021-08-23  8:53   ` Marcel Apfelbaum
  0 siblings, 0 replies; 8+ messages in thread
From: Marcel Apfelbaum @ 2021-08-23  8:53 UTC (permalink / raw)
  To: lizhijian; +Cc: qemu-devel, dgilbert, quintela

Hi Zhijian,

On Mon, Aug 23, 2021 at 11:41 AM lizhijian@fujitsu.com
<lizhijian@fujitsu.com> wrote:
>
> CCing  Marcel
>
>
> On 23/08/2021 11:33, Li Zhijian wrote:
> > Previous qemu are facing 2 problems when migrating a fsdax memory backend with
> > RDMA protocol.
> > (1) ibv_reg_mr failed with Operation not supported
> > (2) requester(source) side could receive RNR NAK.
> >
> > For the (1), we can try to register memory region with ODP feature which
> > has already been implemented in some modern HCA hardware/drivers.
> > For the (2), IB provides advise API to prefetch pages in specific memory
> > region. It can help driver reduce the page fault on responder(destination)
> > side during RDMA_WRITE.
> >
> > CC: marcel.apfelbaum@gmail.com
> >
> > Li Zhijian (2):
> >    migration/rdma: Try to register On-Demand Paging memory region
> >    migration/rdma: advise prefetch write for ODP region
> >
> >   migration/rdma.c       | 117 +++++++++++++++++++++++++++++++++--------
> >   migration/trace-events |   2 +
> >   2 files changed, 98 insertions(+), 21 deletions(-)
> >

Series
Reviewed-by: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>

Thanks,
Marcel


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-08-23  8:54 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-23  3:33 [PATCH v2 0/2] enable fsdax rdma migration Li Zhijian
2021-08-23  3:33 ` [PATCH v2 1/2] migration/rdma: Try to register On-Demand Paging memory region Li Zhijian
2021-08-23  8:42   ` lizhijian
2021-08-23  8:52     ` Marcel Apfelbaum
2021-08-23  3:33 ` [PATCH v2 2/2] migration/rdma: advise prefetch write for ODP region Li Zhijian
2021-08-23  8:42   ` lizhijian
2021-08-23  8:41 ` [PATCH v2 0/2] enable fsdax rdma migration lizhijian
2021-08-23  8:53   ` Marcel Apfelbaum

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).