All of lore.kernel.org
 help / color / mirror / Atom feed
From: Li Zhijian <lizhijian@fujitsu.com>
To: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>,
	<linux-rdma@vger.kernel.org>, <leonro@nvidia.com>,
	<jgg@nvidia.com>, <zyjzyj2000@gmail.com>
Cc: <nvdimm@lists.linux.dev>, <linux-kernel@vger.kernel.org>,
	<rpearsonhpe@gmail.com>, <yangx.jy@fujitsu.com>,
	<y-goto@fujitsu.com>
Subject: Re: [RFC PATCH 0/7] RDMA/rxe: On-Demand Paging on SoftRoCE
Date: Fri, 9 Sep 2022 11:07:33 +0800	[thread overview]
Message-ID: <f4da3894-488b-fc6a-fa04-482f1354865a@fujitsu.com> (raw)
In-Reply-To: <cover.1662461897.git.matsuda-daisuke@fujitsu.com>

Daisuke

Great job.

I love this feature, before starting reviewing you patches, i tested it with QEMU(with fsdax memory-backend) migration
over RDMA where it worked for MLX5 before.

This time, with you ODP patches, it works on RXE though ibv_advise_mr may be not yet ready.


Thanks
Zhijian


On 07/09/2022 10:42, Daisuke Matsuda wrote:
> Hi everyone,
>
> This patch series implements the On-Demand Paging feature on SoftRoCE(rxe)
> driver, which has been available only in mlx5 driver[1] so far.
>
> [Overview]
> When applications register a memory region(MR), RDMA drivers normally pin
> pages in the MR so that physical addresses are never changed during RDMA
> communication. This requires the MR to fit in physical memory and
> inevitably leads to memory pressure. On the other hand, On-Demand Paging
> (ODP) allows applications to register MRs without pinning pages. They are
> paged-in when the driver requires and paged-out when the OS reclaims. As a
> result, it is possible to register a large MR that does not fit in physical
> memory without taking up so much physical memory.
>
> [Why to add this feature?]
> We, Fujitsu, have contributed to RDMA with a view to using it with
> persistent memory. Persistent memory can host a filesystem that allows
> applications to read/write files directly without involving page cache.
> This is called FS-DAX(filesystem direct access) mode. There is a problem
> that data on DAX-enabled filesystem cannot be duplicated with software RAID
> or other hardware methods. Data replication with RDMA, which features
> high-speed connections, is the best solution for the problem.
>
> However, there is a known issue that hinders using RDMA with FS-DAX. When
> RDMA operations to a file and update of the file metadata are processed
> concurrently on the same node, illegal memory accesses can be executed,
> disregarding the updated metadata. This is because RDMA operations do not
> go through page cache but access data directly. There was an effort[2] to
> solve this problem, but it was rejected in the end. Though there is no
> general solution available, it is possible to work around the problem using
> the ODP feature that has been available only in mlx5. ODP enables drivers
> to update metadata before processing RDMA operations.
>
> We have enhanced the rxe to expedite the usage of persistent memory. Our
> contribution to rxe includes RDMA Atomic write[3] and RDMA Flush[4]. With
> them being merged to rxe along with ODP, an environment will be ready for
> developers to create and test software for RDMA with FS-DAX. There is a
> library(librpma)[5] being developed for this purpose. This environment
> can be used by anybody without any special hardware but an ordinary
> computer with a normal NIC though it is inferior to hardware
> implementations in terms of performance.
>
> [Design considerations]
> ODP has been available only in mlx5, but functions and data structures
> that can be used commonly are provided in ib_uverbs(infiniband/core). The
> interface is heavily dependent on HMM infrastructure[6], and this patchset
> use them as much as possible. While mlx5 has both Explicit and Implicit ODP
> features along with prefetch feature, this patchset implements the Explicit
> ODP feature only.
>
> As an important change, it is necessary to convert triple tasklets
> (requester, responder and completer) to workqueues because they must be
> able to sleep in order to trigger page fault before accessing MRs. I did a
> test shown in the 2nd patch and found that the change makes the latency
> higher while improving the bandwidth. Though it may be possible to create a
> new independent workqueue for page fault execution, it is a not very
> sensible solution since the tasklets have to busy-wait its completion in
> that case.
>
> If responder and completer sleep, it becomes more likely that packet drop
> occurs because of overflow in receiver queue. There are multiple queues
> involved, but, as SoftRoCE uses UDP, the most important one would be the
> UDP buffers. The size can be configured in net.core.rmem_default and
> net.core.rmem_max sysconfig parameters. Users should change these values in
> case of packet drop, but page fault would be typically not so long as to
> cause the problem.
>
> [How does ODP work?]
> "struct ib_umem_odp" is used to manage pages. It is created for each
> ODP-enabled MR on its registration. This struct holds a pair of arrays
> (dma_list/pfn_list) that serve as a driver page table. DMA addresses and
> PFNs are stored in the driver page table. They are updated on page-in and
> page-out, both of which use the common interface in ib_uverbs.
>
> Page-in can occur when requester, responder or completer access an MR in
> order to process RDMA operations. If they find that the pages being
> accessed are not present on physical memory or requisite permissions are
> not set on the pages, they provoke page fault to make pages present with
> proper permissions and at the same time update the driver page table. After
> confirming the presence of the pages, they execute memory access such as
> read, write or atomic operations.
>
> Page-out is triggered by page reclaim or filesystem events (e.g. metadata
> update of a file that is being used as an MR). When creating an ODP-enabled
> MR, the driver registers an MMU notifier callback. When the kernel issues a
> page invalidation notification, the callback is provoked to unmap DMA
> addresses and update the driver page table. After that, the kernel releases
> the pages.
>
> [Supported operations]
> All operations are supported on RC connection. Atomic write[3] and Flush[4]
> operations, which are still under discussion, are also going to be
> supported after their patches are merged. On UD connection, Send, Recv,
> SRQ-Recv are supported. Because other operations are not supported on mlx5,
> I take after the decision right now.
>
> [How to test ODP?]
> There are only a few resources available for testing. pyverbs testcases in
> rdma-core and perftest[7] are recommendable ones. Note that you may have to
> build perftest from upstream since older versions do not handle ODP
> capabilities correctly.
>
> [Future work]
> My next work will be the prefetch feature. It allows applications to
> trigger page fault using ibv_advise_mr(3) to optimize performance. Some
> existing software like librpma use this feature. Additionally, I think we
> can also add the implicit ODP feature in the future.
>
> [1] [RFC 00/20] On demand paging
> https://www.spinics.net/lists/linux-rdma/msg18906.html
>
> [2] [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)
> https://lore.kernel.org/nvdimm/20190809225833.6657-1-ira.weiny@intel.com/
>
> [3] [RESEND PATCH v5 0/2] RDMA/rxe: Add RDMA Atomic Write operation
> https://www.spinics.net/lists/linux-rdma/msg111428.html
>
> [4] [PATCH v4 0/6] RDMA/rxe: Add RDMA FLUSH operation
> https://www.spinics.net/lists/kernel/msg4462045.html
>
> [5] librpma: Remote Persistent Memory Access Library
> https://github.com/pmem/rpma
>
> [6] Heterogeneous Memory Management (HMM)
> https://www.kernel.org/doc/html/latest/mm/hmm.html
>
> [7] linux-rdma/perftest: Infiniband Verbs Performance Tests
> https://github.com/linux-rdma/perftest
>
> Daisuke Matsuda (7):
>    IB/mlx5: Change ib_umem_odp_map_dma_single_page() to retain umem_mutex
>    RDMA/rxe: Convert the triple tasklets to workqueues
>    RDMA/rxe: Cleanup code for responder Atomic operations
>    RDMA/rxe: Add page invalidation support
>    RDMA/rxe: Allow registering MRs for On-Demand Paging
>    RDMA/rxe: Add support for Send/Recv/Write/Read operations with ODP
>    RDMA/rxe: Add support for the traditional Atomic operations with ODP
>
>   drivers/infiniband/core/umem_odp.c    |   6 +-
>   drivers/infiniband/hw/mlx5/odp.c      |   4 +-
>   drivers/infiniband/sw/rxe/Makefile    |   5 +-
>   drivers/infiniband/sw/rxe/rxe.c       |  18 ++
>   drivers/infiniband/sw/rxe/rxe_comp.c  |  42 +++-
>   drivers/infiniband/sw/rxe/rxe_loc.h   |  11 +-
>   drivers/infiniband/sw/rxe/rxe_mr.c    |   7 +-
>   drivers/infiniband/sw/rxe/rxe_net.c   |   4 +-
>   drivers/infiniband/sw/rxe/rxe_odp.c   | 329 ++++++++++++++++++++++++++
>   drivers/infiniband/sw/rxe/rxe_param.h |   2 +-
>   drivers/infiniband/sw/rxe/rxe_qp.c    |  68 +++---
>   drivers/infiniband/sw/rxe/rxe_recv.c  |   2 +-
>   drivers/infiniband/sw/rxe/rxe_req.c   |  14 +-
>   drivers/infiniband/sw/rxe/rxe_resp.c  | 175 +++++++-------
>   drivers/infiniband/sw/rxe/rxe_resp.h  |  44 ++++
>   drivers/infiniband/sw/rxe/rxe_task.c  | 152 ------------
>   drivers/infiniband/sw/rxe/rxe_task.h  |  69 ------
>   drivers/infiniband/sw/rxe/rxe_verbs.c |  16 +-
>   drivers/infiniband/sw/rxe/rxe_verbs.h |  10 +-
>   drivers/infiniband/sw/rxe/rxe_wq.c    | 161 +++++++++++++
>   drivers/infiniband/sw/rxe/rxe_wq.h    |  71 ++++++
>   21 files changed, 824 insertions(+), 386 deletions(-)
>   create mode 100644 drivers/infiniband/sw/rxe/rxe_odp.c
>   create mode 100644 drivers/infiniband/sw/rxe/rxe_resp.h
>   delete mode 100644 drivers/infiniband/sw/rxe/rxe_task.c
>   delete mode 100644 drivers/infiniband/sw/rxe/rxe_task.h
>   create mode 100644 drivers/infiniband/sw/rxe/rxe_wq.c
>   create mode 100644 drivers/infiniband/sw/rxe/rxe_wq.h
>


  parent reply	other threads:[~2022-09-09  3:07 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-07  2:42 [RFC PATCH 0/7] RDMA/rxe: On-Demand Paging on SoftRoCE Daisuke Matsuda
2022-09-07  2:42 ` [RFC PATCH 1/7] IB/mlx5: Change ib_umem_odp_map_dma_single_page() to retain umem_mutex Daisuke Matsuda
2022-09-07  2:43 ` [RFC PATCH 2/7] RDMA/rxe: Convert the triple tasklets to workqueues Daisuke Matsuda
2022-09-09 19:39   ` Bob Pearson
2022-09-12  8:27     ` matsuda-daisuke
2022-09-11  7:10   ` Yanjun Zhu
2022-09-11 15:08     ` Bart Van Assche
2022-09-12  7:58       ` matsuda-daisuke
2022-09-12  8:29         ` Yanjun Zhu
2022-09-12 19:52         ` Bob Pearson
2022-09-28  6:40           ` matsuda-daisuke
2022-09-12  8:25       ` Yanjun Zhu
2022-09-07  2:43 ` [RFC PATCH 3/7] RDMA/rxe: Cleanup code for responder Atomic operations Daisuke Matsuda
2022-09-07  2:43 ` [RFC PATCH 4/7] RDMA/rxe: Add page invalidation support Daisuke Matsuda
2022-09-07  2:43 ` [RFC PATCH 5/7] RDMA/rxe: Allow registering MRs for On-Demand Paging Daisuke Matsuda
2022-09-08 16:57   ` Haris Iqbal
2022-09-09  0:55     ` matsuda-daisuke
2022-09-07  2:43 ` [RFC PATCH 6/7] RDMA/rxe: Add support for Send/Recv/Write/Read operations with ODP Daisuke Matsuda
2022-09-08  8:29   ` Leon Romanovsky
2022-09-09  2:45     ` matsuda-daisuke
2022-09-07  2:43 ` [RFC PATCH 7/7] RDMA/rxe: Add support for the traditional Atomic " Daisuke Matsuda
2022-09-08  8:40 ` [RFC PATCH 0/7] RDMA/rxe: On-Demand Paging on SoftRoCE Zhu Yanjun
2022-09-08 10:25   ` matsuda-daisuke
2022-09-09  3:07 ` Li Zhijian [this message]
2022-09-12  9:21   ` matsuda-daisuke

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f4da3894-488b-fc6a-fa04-482f1354865a@fujitsu.com \
    --to=lizhijian@fujitsu.com \
    --cc=jgg@nvidia.com \
    --cc=leonro@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=matsuda-daisuke@fujitsu.com \
    --cc=nvdimm@lists.linux.dev \
    --cc=rpearsonhpe@gmail.com \
    --cc=y-goto@fujitsu.com \
    --cc=yangx.jy@fujitsu.com \
    --cc=zyjzyj2000@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.