* [PATCH 0/2] enable fsdax rdma migration @ 2021-07-31 14:03 Li Zhijian 2021-07-31 14:03 ` [PATCH 1/2] migration/rdma: Try to register On-Demand Paging memory region Li Zhijian ` (2 more replies) 0 siblings, 3 replies; 9+ messages in thread From: Li Zhijian @ 2021-07-31 14:03 UTC (permalink / raw) To: quintela, dgilbert, qemu-devel; +Cc: Li Zhijian Previous qemu face 2 problems when migrating a fsdax memory backend with RDMA protocol. (1) ibv_reg_mr failed with Operation not supported (2) requester(source) side could receive RNR NAK. For the (1), we can try to register memory region with ODP feature which has already been implemented in some modern HCA hardware/drivers. For the (2), IB provides advise API to prefetch pages in specific memory region. It can help driver reduce the page fault on responder(destination) side during RDMA_WRITE. Li Zhijian (2): migration/rdma: Try to register On-Demand Paging memory region migration/rdma: advise prefetch write for ODP region migration/rdma.c | 67 ++++++++++++++++++++++++++++++++++++------ migration/trace-events | 2 ++ 2 files changed, 60 insertions(+), 9 deletions(-) -- 2.31.1 ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 1/2] migration/rdma: Try to register On-Demand Paging memory region 2021-07-31 14:03 [PATCH 0/2] enable fsdax rdma migration Li Zhijian @ 2021-07-31 14:03 ` Li Zhijian 2021-08-22 8:53 ` Marcel Apfelbaum 2021-07-31 14:03 ` [PATCH 2/2] migration/rdma: advise prefetch write for ODP region Li Zhijian 2021-08-16 2:10 ` [PATCH 0/2] enable fsdax rdma migration lizhijian 2 siblings, 1 reply; 9+ messages in thread From: Li Zhijian @ 2021-07-31 14:03 UTC (permalink / raw) To: quintela, dgilbert, qemu-devel; +Cc: Li Zhijian Previously, for the fsdax mem-backend-file, it will register failed with Operation not supported. In this case, we can try to register it with On-Demand Paging[1] like what rpma_mr_reg() does on rpma[2]. [1]: https://community.mellanox.com/s/article/understanding-on-demand-paging--odp-x [2]: http://pmem.io/rpma/manpages/v0.9.0/rpma_mr_reg.3 Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com> --- migration/rdma.c | 27 ++++++++++++++++++--------- migration/trace-events | 1 + 2 files changed, 19 insertions(+), 9 deletions(-) diff --git a/migration/rdma.c b/migration/rdma.c index 5c2d113aa94..8784b5f22a6 100644 --- a/migration/rdma.c +++ b/migration/rdma.c @@ -1123,15 +1123,21 @@ static int qemu_rdma_reg_whole_ram_blocks(RDMAContext *rdma) RDMALocalBlocks *local = &rdma->local_ram_blocks; for (i = 0; i < local->nb_blocks; i++) { + int access = IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE; + +on_demand: local->block[i].mr = ibv_reg_mr(rdma->pd, local->block[i].local_host_addr, - local->block[i].length, - IBV_ACCESS_LOCAL_WRITE | - IBV_ACCESS_REMOTE_WRITE + local->block[i].length, access ); if (!local->block[i].mr) { - perror("Failed to register local dest ram block!"); + if (!(access & IBV_ACCESS_ON_DEMAND) && errno == ENOTSUP) { + access |= IBV_ACCESS_ON_DEMAND; + trace_qemu_rdma_register_odp_mr(local->block[i].block_name); + goto on_demand; + } + perror("Failed to register local dest ram block!\n"); break; } rdma->total_registrations++; @@ -1215,15 +1221,18 @@ static int qemu_rdma_register_and_get_keys(RDMAContext *rdma, */ if (!block->pmr[chunk]) { uint64_t len = chunk_end - chunk_start; + int access = rkey ? IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE : 0; trace_qemu_rdma_register_and_get_keys(len, chunk_start); - block->pmr[chunk] = ibv_reg_mr(rdma->pd, - chunk_start, len, - (rkey ? (IBV_ACCESS_LOCAL_WRITE | - IBV_ACCESS_REMOTE_WRITE) : 0)); - +on_demand: + block->pmr[chunk] = ibv_reg_mr(rdma->pd, chunk_start, len, access); if (!block->pmr[chunk]) { + if (!(access & IBV_ACCESS_ON_DEMAND) && errno == ENOTSUP) { + access |= IBV_ACCESS_ON_DEMAND; + trace_qemu_rdma_register_odp_mr(block->block_name); + goto on_demand; + } perror("Failed to register chunk!"); fprintf(stderr, "Chunk details: block: %d chunk index %d" " start %" PRIuPTR " end %" PRIuPTR diff --git a/migration/trace-events b/migration/trace-events index a1c0f034ab8..5f6aa580def 100644 --- a/migration/trace-events +++ b/migration/trace-events @@ -212,6 +212,7 @@ qemu_rdma_poll_write(const char *compstr, int64_t comp, int left, uint64_t block qemu_rdma_poll_other(const char *compstr, int64_t comp, int left) "other completion %s (%" PRId64 ") received left %d" qemu_rdma_post_send_control(const char *desc) "CONTROL: sending %s.." qemu_rdma_register_and_get_keys(uint64_t len, void *start) "Registering %" PRIu64 " bytes @ %p" +qemu_rdma_register_odp_mr(const char *name) "Try to register On-Demand Paging memory region: %s" qemu_rdma_registration_handle_compress(int64_t length, int index, int64_t offset) "Zapping zero chunk: %" PRId64 " bytes, index %d, offset %" PRId64 qemu_rdma_registration_handle_finished(void) "" qemu_rdma_registration_handle_ram_blocks(void) "" -- 2.31.1 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] migration/rdma: Try to register On-Demand Paging memory region 2021-07-31 14:03 ` [PATCH 1/2] migration/rdma: Try to register On-Demand Paging memory region Li Zhijian @ 2021-08-22 8:53 ` Marcel Apfelbaum 2021-08-23 1:41 ` lizhijian 0 siblings, 1 reply; 9+ messages in thread From: Marcel Apfelbaum @ 2021-08-22 8:53 UTC (permalink / raw) To: Li Zhijian; +Cc: qemu devel list, Dr. David Alan Gilbert, Juan Quintela Hi On Sat, Jul 31, 2021 at 5:00 PM Li Zhijian <lizhijian@cn.fujitsu.com> wrote: > > Previously, for the fsdax mem-backend-file, it will register failed with > Operation not supported. In this case, we can try to register it with > On-Demand Paging[1] like what rpma_mr_reg() does on rpma[2]. > > [1]: https://community.mellanox.com/s/article/understanding-on-demand-paging--odp-x > [2]: http://pmem.io/rpma/manpages/v0.9.0/rpma_mr_reg.3 > Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com> > --- > migration/rdma.c | 27 ++++++++++++++++++--------- > migration/trace-events | 1 + > 2 files changed, 19 insertions(+), 9 deletions(-) > > diff --git a/migration/rdma.c b/migration/rdma.c > index 5c2d113aa94..8784b5f22a6 100644 > --- a/migration/rdma.c > +++ b/migration/rdma.c > @@ -1123,15 +1123,21 @@ static int qemu_rdma_reg_whole_ram_blocks(RDMAContext *rdma) > RDMALocalBlocks *local = &rdma->local_ram_blocks; > > for (i = 0; i < local->nb_blocks; i++) { > + int access = IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE; > + > +on_demand: > local->block[i].mr = > ibv_reg_mr(rdma->pd, > local->block[i].local_host_addr, > - local->block[i].length, > - IBV_ACCESS_LOCAL_WRITE | > - IBV_ACCESS_REMOTE_WRITE > + local->block[i].length, access > ); > if (!local->block[i].mr) { > - perror("Failed to register local dest ram block!"); > + if (!(access & IBV_ACCESS_ON_DEMAND) && errno == ENOTSUP) { > + access |= IBV_ACCESS_ON_DEMAND; > + trace_qemu_rdma_register_odp_mr(local->block[i].block_name); > + goto on_demand; Wouldn't it be better to check first if the device supports ODP ? Something like: ret = ibv_exp_query_device(context, &dattr); if (dattr.exp_device_cap_flags & IBV_EXP_DEVICE_ODP)... Also, I am not (personally) too fond of the "on_demand" label usage here, however I will let the maintainer/others decide. Thanks, Marcel > + } > + perror("Failed to register local dest ram block!\n"); > break; > } > rdma->total_registrations++; > @@ -1215,15 +1221,18 @@ static int qemu_rdma_register_and_get_keys(RDMAContext *rdma, > */ > if (!block->pmr[chunk]) { > uint64_t len = chunk_end - chunk_start; > + int access = rkey ? IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE : 0; > > trace_qemu_rdma_register_and_get_keys(len, chunk_start); > > - block->pmr[chunk] = ibv_reg_mr(rdma->pd, > - chunk_start, len, > - (rkey ? (IBV_ACCESS_LOCAL_WRITE | > - IBV_ACCESS_REMOTE_WRITE) : 0)); > - > +on_demand: > + block->pmr[chunk] = ibv_reg_mr(rdma->pd, chunk_start, len, access); > if (!block->pmr[chunk]) { > + if (!(access & IBV_ACCESS_ON_DEMAND) && errno == ENOTSUP) { > + access |= IBV_ACCESS_ON_DEMAND; > + trace_qemu_rdma_register_odp_mr(block->block_name); > + goto on_demand; > + } > perror("Failed to register chunk!"); > fprintf(stderr, "Chunk details: block: %d chunk index %d" > " start %" PRIuPTR " end %" PRIuPTR > diff --git a/migration/trace-events b/migration/trace-events > index a1c0f034ab8..5f6aa580def 100644 > --- a/migration/trace-events > +++ b/migration/trace-events > @@ -212,6 +212,7 @@ qemu_rdma_poll_write(const char *compstr, int64_t comp, int left, uint64_t block > qemu_rdma_poll_other(const char *compstr, int64_t comp, int left) "other completion %s (%" PRId64 ") received left %d" > qemu_rdma_post_send_control(const char *desc) "CONTROL: sending %s.." > qemu_rdma_register_and_get_keys(uint64_t len, void *start) "Registering %" PRIu64 " bytes @ %p" > +qemu_rdma_register_odp_mr(const char *name) "Try to register On-Demand Paging memory region: %s" > qemu_rdma_registration_handle_compress(int64_t length, int index, int64_t offset) "Zapping zero chunk: %" PRId64 " bytes, index %d, offset %" PRId64 > qemu_rdma_registration_handle_finished(void) "" > qemu_rdma_registration_handle_ram_blocks(void) "" > -- > 2.31.1 > > > > ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] migration/rdma: Try to register On-Demand Paging memory region 2021-08-22 8:53 ` Marcel Apfelbaum @ 2021-08-23 1:41 ` lizhijian 2021-08-23 7:03 ` Marcel Apfelbaum 0 siblings, 1 reply; 9+ messages in thread From: lizhijian @ 2021-08-23 1:41 UTC (permalink / raw) To: Marcel Apfelbaum, lizhijian Cc: qemu devel list, Dr. David Alan Gilbert, Juan Quintela On 22/08/2021 16:53, Marcel Apfelbaum wrote: > Hi > > On Sat, Jul 31, 2021 at 5:00 PM Li Zhijian <lizhijian@cn.fujitsu.com> wrote: >> Previously, for the fsdax mem-backend-file, it will register failed with >> Operation not supported. In this case, we can try to register it with >> On-Demand Paging[1] like what rpma_mr_reg() does on rpma[2]. >> >> [1]: https://community.mellanox.com/s/article/understanding-on-demand-paging--odp-x >> [2]: http://pmem.io/rpma/manpages/v0.9.0/rpma_mr_reg.3 >> Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com> >> --- >> migration/rdma.c | 27 ++++++++++++++++++--------- >> migration/trace-events | 1 + >> 2 files changed, 19 insertions(+), 9 deletions(-) >> >> diff --git a/migration/rdma.c b/migration/rdma.c >> index 5c2d113aa94..8784b5f22a6 100644 >> --- a/migration/rdma.c >> +++ b/migration/rdma.c >> @@ -1123,15 +1123,21 @@ static int qemu_rdma_reg_whole_ram_blocks(RDMAContext *rdma) >> RDMALocalBlocks *local = &rdma->local_ram_blocks; >> >> for (i = 0; i < local->nb_blocks; i++) { >> + int access = IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE; >> + >> +on_demand: >> local->block[i].mr = >> ibv_reg_mr(rdma->pd, >> local->block[i].local_host_addr, >> - local->block[i].length, >> - IBV_ACCESS_LOCAL_WRITE | >> - IBV_ACCESS_REMOTE_WRITE >> + local->block[i].length, access >> ); >> if (!local->block[i].mr) { >> - perror("Failed to register local dest ram block!"); >> + if (!(access & IBV_ACCESS_ON_DEMAND) && errno == ENOTSUP) { >> + access |= IBV_ACCESS_ON_DEMAND; >> + trace_qemu_rdma_register_odp_mr(local->block[i].block_name); >> + goto on_demand; > Wouldn't it be better to check first if the device supports ODP ? > Something like: > ret = ibv_exp_query_device(context, &dattr); > if (dattr.exp_device_cap_flags & IBV_EXP_DEVICE_ODP)... Good idea ! > > Also, I am not (personally) too fond of the "on_demand" label usage here, > however I will let the maintainer/others decide. Indeed, how just repeating the ibv_reg_mr() instead of a 'go to' Thanks Zhijian > > Thanks, > Marcel > >> + } >> + perror("Failed to register local dest ram block!\n"); >> break; >> } >> rdma->total_registrations++; >> @@ -1215,15 +1221,18 @@ static int qemu_rdma_register_and_get_keys(RDMAContext *rdma, >> */ >> if (!block->pmr[chunk]) { >> uint64_t len = chunk_end - chunk_start; >> + int access = rkey ? IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE : 0; >> >> trace_qemu_rdma_register_and_get_keys(len, chunk_start); >> >> - block->pmr[chunk] = ibv_reg_mr(rdma->pd, >> - chunk_start, len, >> - (rkey ? (IBV_ACCESS_LOCAL_WRITE | >> - IBV_ACCESS_REMOTE_WRITE) : 0)); >> - >> +on_demand: >> + block->pmr[chunk] = ibv_reg_mr(rdma->pd, chunk_start, len, access); >> if (!block->pmr[chunk]) { >> + if (!(access & IBV_ACCESS_ON_DEMAND) && errno == ENOTSUP) { >> + access |= IBV_ACCESS_ON_DEMAND; >> + trace_qemu_rdma_register_odp_mr(block->block_name); >> + goto on_demand; >> + } >> perror("Failed to register chunk!"); >> fprintf(stderr, "Chunk details: block: %d chunk index %d" >> " start %" PRIuPTR " end %" PRIuPTR >> diff --git a/migration/trace-events b/migration/trace-events >> index a1c0f034ab8..5f6aa580def 100644 >> --- a/migration/trace-events >> +++ b/migration/trace-events >> @@ -212,6 +212,7 @@ qemu_rdma_poll_write(const char *compstr, int64_t comp, int left, uint64_t block >> qemu_rdma_poll_other(const char *compstr, int64_t comp, int left) "other completion %s (%" PRId64 ") received left %d" >> qemu_rdma_post_send_control(const char *desc) "CONTROL: sending %s.." >> qemu_rdma_register_and_get_keys(uint64_t len, void *start) "Registering %" PRIu64 " bytes @ %p" >> +qemu_rdma_register_odp_mr(const char *name) "Try to register On-Demand Paging memory region: %s" >> qemu_rdma_registration_handle_compress(int64_t length, int index, int64_t offset) "Zapping zero chunk: %" PRId64 " bytes, index %d, offset %" PRId64 >> qemu_rdma_registration_handle_finished(void) "" >> qemu_rdma_registration_handle_ram_blocks(void) "" >> -- >> 2.31.1 >> >> >> >> > ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] migration/rdma: Try to register On-Demand Paging memory region 2021-08-23 1:41 ` lizhijian @ 2021-08-23 7:03 ` Marcel Apfelbaum 0 siblings, 0 replies; 9+ messages in thread From: Marcel Apfelbaum @ 2021-08-23 7:03 UTC (permalink / raw) To: lizhijian; +Cc: qemu devel list, Dr. David Alan Gilbert, Juan Quintela Hi Zhijian, On Mon, Aug 23, 2021 at 4:41 AM lizhijian@fujitsu.com <lizhijian@fujitsu.com> wrote: > > > > On 22/08/2021 16:53, Marcel Apfelbaum wrote: > > Hi > > > > On Sat, Jul 31, 2021 at 5:00 PM Li Zhijian <lizhijian@cn.fujitsu.com> wrote: > >> Previously, for the fsdax mem-backend-file, it will register failed with > >> Operation not supported. In this case, we can try to register it with > >> On-Demand Paging[1] like what rpma_mr_reg() does on rpma[2]. > >> > >> [1]: https://community.mellanox.com/s/article/understanding-on-demand-paging--odp-x > >> [2]: http://pmem.io/rpma/manpages/v0.9.0/rpma_mr_reg.3 > >> Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com> > >> --- > >> migration/rdma.c | 27 ++++++++++++++++++--------- > >> migration/trace-events | 1 + > >> 2 files changed, 19 insertions(+), 9 deletions(-) > >> > >> diff --git a/migration/rdma.c b/migration/rdma.c > >> index 5c2d113aa94..8784b5f22a6 100644 > >> --- a/migration/rdma.c > >> +++ b/migration/rdma.c > >> @@ -1123,15 +1123,21 @@ static int qemu_rdma_reg_whole_ram_blocks(RDMAContext *rdma) > >> RDMALocalBlocks *local = &rdma->local_ram_blocks; > >> > >> for (i = 0; i < local->nb_blocks; i++) { > >> + int access = IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE; > >> + > >> +on_demand: > >> local->block[i].mr = > >> ibv_reg_mr(rdma->pd, > >> local->block[i].local_host_addr, > >> - local->block[i].length, > >> - IBV_ACCESS_LOCAL_WRITE | > >> - IBV_ACCESS_REMOTE_WRITE > >> + local->block[i].length, access > >> ); > >> if (!local->block[i].mr) { > >> - perror("Failed to register local dest ram block!"); > >> + if (!(access & IBV_ACCESS_ON_DEMAND) && errno == ENOTSUP) { > >> + access |= IBV_ACCESS_ON_DEMAND; > >> + trace_qemu_rdma_register_odp_mr(local->block[i].block_name); > >> + goto on_demand; > > Wouldn't it be better to check first if the device supports ODP ? > > Something like: > > ret = ibv_exp_query_device(context, &dattr); > > if (dattr.exp_device_cap_flags & IBV_EXP_DEVICE_ODP)... > > Good idea ! > > > > > > > Also, I am not (personally) too fond of the "on_demand" label usage here, > > however I will let the maintainer/others decide. > Indeed, how just repeating the ibv_reg_mr() instead of a 'go to' > Any "boring"/standard approach would do. Thanks, Marcel > Thanks > Zhijian > > > > > > > Thanks, > > Marcel > > > >> + } > >> + perror("Failed to register local dest ram block!\n"); > >> break; > >> } > >> rdma->total_registrations++; > >> @@ -1215,15 +1221,18 @@ static int qemu_rdma_register_and_get_keys(RDMAContext *rdma, > >> */ > >> if (!block->pmr[chunk]) { > >> uint64_t len = chunk_end - chunk_start; > >> + int access = rkey ? IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE : 0; > >> > >> trace_qemu_rdma_register_and_get_keys(len, chunk_start); > >> > >> - block->pmr[chunk] = ibv_reg_mr(rdma->pd, > >> - chunk_start, len, > >> - (rkey ? (IBV_ACCESS_LOCAL_WRITE | > >> - IBV_ACCESS_REMOTE_WRITE) : 0)); > >> - > >> +on_demand: > >> + block->pmr[chunk] = ibv_reg_mr(rdma->pd, chunk_start, len, access); > >> if (!block->pmr[chunk]) { > >> + if (!(access & IBV_ACCESS_ON_DEMAND) && errno == ENOTSUP) { > >> + access |= IBV_ACCESS_ON_DEMAND; > >> + trace_qemu_rdma_register_odp_mr(block->block_name); > >> + goto on_demand; > >> + } > >> perror("Failed to register chunk!"); > >> fprintf(stderr, "Chunk details: block: %d chunk index %d" > >> " start %" PRIuPTR " end %" PRIuPTR > >> diff --git a/migration/trace-events b/migration/trace-events > >> index a1c0f034ab8..5f6aa580def 100644 > >> --- a/migration/trace-events > >> +++ b/migration/trace-events > >> @@ -212,6 +212,7 @@ qemu_rdma_poll_write(const char *compstr, int64_t comp, int left, uint64_t block > >> qemu_rdma_poll_other(const char *compstr, int64_t comp, int left) "other completion %s (%" PRId64 ") received left %d" > >> qemu_rdma_post_send_control(const char *desc) "CONTROL: sending %s.." > >> qemu_rdma_register_and_get_keys(uint64_t len, void *start) "Registering %" PRIu64 " bytes @ %p" > >> +qemu_rdma_register_odp_mr(const char *name) "Try to register On-Demand Paging memory region: %s" > >> qemu_rdma_registration_handle_compress(int64_t length, int index, int64_t offset) "Zapping zero chunk: %" PRId64 " bytes, index %d, offset %" PRId64 > >> qemu_rdma_registration_handle_finished(void) "" > >> qemu_rdma_registration_handle_ram_blocks(void) "" > >> -- > >> 2.31.1 > >> > >> > >> > >> > > ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 2/2] migration/rdma: advise prefetch write for ODP region 2021-07-31 14:03 [PATCH 0/2] enable fsdax rdma migration Li Zhijian 2021-07-31 14:03 ` [PATCH 1/2] migration/rdma: Try to register On-Demand Paging memory region Li Zhijian @ 2021-07-31 14:03 ` Li Zhijian 2021-08-22 8:39 ` Marcel Apfelbaum 2021-08-16 2:10 ` [PATCH 0/2] enable fsdax rdma migration lizhijian 2 siblings, 1 reply; 9+ messages in thread From: Li Zhijian @ 2021-07-31 14:03 UTC (permalink / raw) To: quintela, dgilbert, qemu-devel; +Cc: Li Zhijian The responder mr registering with ODP will sent RNR NAK back to the requester in the face of the page fault. --------- ibv_poll_cq wc.status=13 RNR retry counter exceeded! ibv_poll_cq wrid=WRITE RDMA! --------- ibv_advise_mr(3) helps to make pages present before the actual IO is conducted so that the responder does page fault as little as possible. Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com> --- migration/rdma.c | 40 ++++++++++++++++++++++++++++++++++++++++ migration/trace-events | 1 + 2 files changed, 41 insertions(+) diff --git a/migration/rdma.c b/migration/rdma.c index 8784b5f22a6..a2ad00d665f 100644 --- a/migration/rdma.c +++ b/migration/rdma.c @@ -1117,6 +1117,30 @@ static int qemu_rdma_alloc_qp(RDMAContext *rdma) return 0; } +/* + * ibv_advise_mr to avoid RNR NAK error as far as possible. + * The responder mr registering with ODP will sent RNR NAK back to + * the requester in the face of the page fault. + */ +static void qemu_rdma_advise_prefetch_write_mr(struct ibv_pd *pd, uint64_t addr, + uint32_t len, uint32_t lkey, + const char *name, bool wr) +{ + int ret; + int advice = wr ? IBV_ADVISE_MR_ADVICE_PREFETCH_WRITE : + IBV_ADVISE_MR_ADVICE_PREFETCH; + struct ibv_sge sg_list = {.lkey = lkey, .addr = addr, .length = len}; + + ret = ibv_advise_mr(pd, advice, + IB_UVERBS_ADVISE_MR_FLAG_FLUSH, &sg_list, 1); + /* ignore the error */ + if (ret) { + trace_qemu_rdma_advise_mr(name, len, addr, strerror(errno)); + } else { + trace_qemu_rdma_advise_mr(name, len, addr, "successed"); + } +} + static int qemu_rdma_reg_whole_ram_blocks(RDMAContext *rdma) { int i; @@ -1140,6 +1164,17 @@ on_demand: perror("Failed to register local dest ram block!\n"); break; } + + if (access & IBV_ACCESS_ON_DEMAND) { + qemu_rdma_advise_prefetch_write_mr(rdma->pd, + (uintptr_t) + local->block[i].local_host_addr, + local->block[i].length, + local->block[i].mr->lkey, + local->block[i].block_name, + true); + } + rdma->total_registrations++; } @@ -1244,6 +1279,11 @@ on_demand: rdma->total_registrations); return -1; } + if (access & IBV_ACCESS_ON_DEMAND) { + qemu_rdma_advise_prefetch_write_mr(rdma->pd, (uintptr_t)chunk_start, + len, block->pmr[chunk]->lkey, + block->block_name, rkey); + } rdma->total_registrations++; } diff --git a/migration/trace-events b/migration/trace-events index 5f6aa580def..901c1d54c12 100644 --- a/migration/trace-events +++ b/migration/trace-events @@ -213,6 +213,7 @@ qemu_rdma_poll_other(const char *compstr, int64_t comp, int left) "other complet qemu_rdma_post_send_control(const char *desc) "CONTROL: sending %s.." qemu_rdma_register_and_get_keys(uint64_t len, void *start) "Registering %" PRIu64 " bytes @ %p" qemu_rdma_register_odp_mr(const char *name) "Try to register On-Demand Paging memory region: %s" +qemu_rdma_advise_mr(const char *name, uint32_t len, uint64_t addr, const char *res) "Try to advise block %s prefetch write at %" PRIu32 "@0x%" PRIx64 ": %s" qemu_rdma_registration_handle_compress(int64_t length, int index, int64_t offset) "Zapping zero chunk: %" PRId64 " bytes, index %d, offset %" PRId64 qemu_rdma_registration_handle_finished(void) "" qemu_rdma_registration_handle_ram_blocks(void) "" -- 2.31.1 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 2/2] migration/rdma: advise prefetch write for ODP region 2021-07-31 14:03 ` [PATCH 2/2] migration/rdma: advise prefetch write for ODP region Li Zhijian @ 2021-08-22 8:39 ` Marcel Apfelbaum 2021-08-23 1:21 ` lizhijian 0 siblings, 1 reply; 9+ messages in thread From: Marcel Apfelbaum @ 2021-08-22 8:39 UTC (permalink / raw) To: Li Zhijian; +Cc: qemu devel list, Dr. David Alan Gilbert, Juan Quintela Hi, On Sat, Jul 31, 2021 at 5:03 PM Li Zhijian <lizhijian@cn.fujitsu.com> wrote: > > The responder mr registering with ODP will sent RNR NAK back to > the requester in the face of the page fault. > --------- > ibv_poll_cq wc.status=13 RNR retry counter exceeded! > ibv_poll_cq wrid=WRITE RDMA! > --------- > ibv_advise_mr(3) helps to make pages present before the actual IO is > conducted so that the responder does page fault as little as possible. > > Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com> > --- > migration/rdma.c | 40 ++++++++++++++++++++++++++++++++++++++++ > migration/trace-events | 1 + > 2 files changed, 41 insertions(+) > > diff --git a/migration/rdma.c b/migration/rdma.c > index 8784b5f22a6..a2ad00d665f 100644 > --- a/migration/rdma.c > +++ b/migration/rdma.c > @@ -1117,6 +1117,30 @@ static int qemu_rdma_alloc_qp(RDMAContext *rdma) > return 0; > } > > +/* > + * ibv_advise_mr to avoid RNR NAK error as far as possible. > + * The responder mr registering with ODP will sent RNR NAK back to > + * the requester in the face of the page fault. > + */ > +static void qemu_rdma_advise_prefetch_write_mr(struct ibv_pd *pd, uint64_t addr, > + uint32_t len, uint32_t lkey, > + const char *name, bool wr) > +{ > + int ret; > + int advice = wr ? IBV_ADVISE_MR_ADVICE_PREFETCH_WRITE : > + IBV_ADVISE_MR_ADVICE_PREFETCH; > + struct ibv_sge sg_list = {.lkey = lkey, .addr = addr, .length = len}; > + > + ret = ibv_advise_mr(pd, advice, > + IB_UVERBS_ADVISE_MR_FLAG_FLUSH, &sg_list, 1); > + /* ignore the error */ > + if (ret) { > + trace_qemu_rdma_advise_mr(name, len, addr, strerror(errno)); > + } else { > + trace_qemu_rdma_advise_mr(name, len, addr, "successed"); > + } > +} > + > static int qemu_rdma_reg_whole_ram_blocks(RDMAContext *rdma) > { > int i; > @@ -1140,6 +1164,17 @@ on_demand: > perror("Failed to register local dest ram block!\n"); > break; > } > + > + if (access & IBV_ACCESS_ON_DEMAND) { > + qemu_rdma_advise_prefetch_write_mr(rdma->pd, > + (uintptr_t) > + local->block[i].local_host_addr, > + local->block[i].length, > + local->block[i].mr->lkey, > + local->block[i].block_name, > + true); > + } > + > rdma->total_registrations++; > } > > @@ -1244,6 +1279,11 @@ on_demand: > rdma->total_registrations); > return -1; > } > + if (access & IBV_ACCESS_ON_DEMAND) { > + qemu_rdma_advise_prefetch_write_mr(rdma->pd, (uintptr_t)chunk_start, > + len, block->pmr[chunk]->lkey, > + block->block_name, rkey); > + } > rdma->total_registrations++; > } > > diff --git a/migration/trace-events b/migration/trace-events > index 5f6aa580def..901c1d54c12 100644 > --- a/migration/trace-events > +++ b/migration/trace-events > @@ -213,6 +213,7 @@ qemu_rdma_poll_other(const char *compstr, int64_t comp, int left) "other complet > qemu_rdma_post_send_control(const char *desc) "CONTROL: sending %s.." > qemu_rdma_register_and_get_keys(uint64_t len, void *start) "Registering %" PRIu64 " bytes @ %p" > qemu_rdma_register_odp_mr(const char *name) "Try to register On-Demand Paging memory region: %s" > +qemu_rdma_advise_mr(const char *name, uint32_t len, uint64_t addr, const char *res) "Try to advise block %s prefetch write at %" PRIu32 "@0x%" PRIx64 ": %s" > qemu_rdma_registration_handle_compress(int64_t length, int index, int64_t offset) "Zapping zero chunk: %" PRId64 " bytes, index %d, offset %" PRId64 > qemu_rdma_registration_handle_finished(void) "" > qemu_rdma_registration_handle_ram_blocks(void) "" > -- > 2.31.1 > Following https://github.com/linux-rdma/rdma-core/blob/master/libibverbs/man/ibv_advise_mr.3.md it looks like it is a best-effort optimization, I don't see any down-sides to it. However it seems like it is recommended to use IBV_ADVISE_MR_FLAG_FLUSH in order to increase the optimization chances. Anyway Reviewed-by: Marcel Apfelbaum <marcel.apfelbaum@gmail.com> Thanks, Marcel ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 2/2] migration/rdma: advise prefetch write for ODP region 2021-08-22 8:39 ` Marcel Apfelbaum @ 2021-08-23 1:21 ` lizhijian 0 siblings, 0 replies; 9+ messages in thread From: lizhijian @ 2021-08-23 1:21 UTC (permalink / raw) To: Marcel Apfelbaum, lizhijian Cc: qemu devel list, Dr. David Alan Gilbert, Juan Quintela Hi Marcel On 22/08/2021 16:39, Marcel Apfelbaum wrote: > Hi, > > On Sat, Jul 31, 2021 at 5:03 PM Li Zhijian <lizhijian@cn.fujitsu.com> wrote: >> The responder mr registering with ODP will sent RNR NAK back to >> the requester in the face of the page fault. >> --------- >> ibv_poll_cq wc.status=13 RNR retry counter exceeded! >> ibv_poll_cq wrid=WRITE RDMA! >> --------- >> ibv_advise_mr(3) helps to make pages present before the actual IO is >> conducted so that the responder does page fault as little as possible. >> >> Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com> >> --- >> migration/rdma.c | 40 ++++++++++++++++++++++++++++++++++++++++ >> migration/trace-events | 1 + >> 2 files changed, 41 insertions(+) >> >> diff --git a/migration/rdma.c b/migration/rdma.c >> index 8784b5f22a6..a2ad00d665f 100644 >> --- a/migration/rdma.c >> +++ b/migration/rdma.c >> @@ -1117,6 +1117,30 @@ static int qemu_rdma_alloc_qp(RDMAContext *rdma) >> return 0; >> } >> >> +/* >> + * ibv_advise_mr to avoid RNR NAK error as far as possible. >> + * The responder mr registering with ODP will sent RNR NAK back to >> + * the requester in the face of the page fault. >> + */ >> +static void qemu_rdma_advise_prefetch_write_mr(struct ibv_pd *pd, uint64_t addr, >> + uint32_t len, uint32_t lkey, >> + const char *name, bool wr) >> +{ >> + int ret; >> + int advice = wr ? IBV_ADVISE_MR_ADVICE_PREFETCH_WRITE : >> + IBV_ADVISE_MR_ADVICE_PREFETCH; >> + struct ibv_sge sg_list = {.lkey = lkey, .addr = addr, .length = len}; >> + >> + ret = ibv_advise_mr(pd, advice, >> + IB_UVERBS_ADVISE_MR_FLAG_FLUSH, &sg_list, 1); >> + /* ignore the error */ > Following https://github.com/linux-rdma/rdma-core/blob/master/libibverbs/man/ibv_advise_mr.3.md > it looks like it is a best-effort optimization, > I don't see any down-sides to it. > However it seems like it is recommended to use > IBV_ADVISE_MR_FLAG_FLUSH in order to > increase the optimization chances. Good catch, i will update it soon. Thanks > > Anyway > > Reviewed-by: Marcel Apfelbaum <marcel.apfelbaum@gmail.com> > > Thanks, > Marcel > > ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 0/2] enable fsdax rdma migration 2021-07-31 14:03 [PATCH 0/2] enable fsdax rdma migration Li Zhijian 2021-07-31 14:03 ` [PATCH 1/2] migration/rdma: Try to register On-Demand Paging memory region Li Zhijian 2021-07-31 14:03 ` [PATCH 2/2] migration/rdma: advise prefetch write for ODP region Li Zhijian @ 2021-08-16 2:10 ` lizhijian 2 siblings, 0 replies; 9+ messages in thread From: lizhijian @ 2021-08-16 2:10 UTC (permalink / raw) To: lizhijian, quintela, dgilbert, qemu-devel ping... Hey Dave, could you help to take a look :) Thanks Zhijian On 31/07/2021 22:03, Li Zhijian wrote: > Previous qemu face 2 problems when migrating a fsdax memory backend with > RDMA protocol. > (1) ibv_reg_mr failed with Operation not supported > (2) requester(source) side could receive RNR NAK. > > For the (1), we can try to register memory region with ODP feature which > has already been implemented in some modern HCA hardware/drivers. > For the (2), IB provides advise API to prefetch pages in specific memory > region. It can help driver reduce the page fault on responder(destination) > side during RDMA_WRITE. > > Li Zhijian (2): > migration/rdma: Try to register On-Demand Paging memory region > migration/rdma: advise prefetch write for ODP region > > migration/rdma.c | 67 ++++++++++++++++++++++++++++++++++++------ > migration/trace-events | 2 ++ > 2 files changed, 60 insertions(+), 9 deletions(-) > ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2021-08-23 7:04 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-07-31 14:03 [PATCH 0/2] enable fsdax rdma migration Li Zhijian 2021-07-31 14:03 ` [PATCH 1/2] migration/rdma: Try to register On-Demand Paging memory region Li Zhijian 2021-08-22 8:53 ` Marcel Apfelbaum 2021-08-23 1:41 ` lizhijian 2021-08-23 7:03 ` Marcel Apfelbaum 2021-07-31 14:03 ` [PATCH 2/2] migration/rdma: advise prefetch write for ODP region Li Zhijian 2021-08-22 8:39 ` Marcel Apfelbaum 2021-08-23 1:21 ` lizhijian 2021-08-16 2:10 ` [PATCH 0/2] enable fsdax rdma migration lizhijian
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.