[PATCH RFC 0/9] A rendezvous module

* [PATCH RFC 0/9] A rendezvous module
@ 2021-03-19 12:56 kaike.wan
  2021-03-19 12:56 ` [PATCH RFC 1/9] RDMA/rv: Public interferce for the RDMA Rendezvous module kaike.wan
                   ` (9 more replies)
  0 siblings, 10 replies; 52+ messages in thread
From: kaike.wan @ 2021-03-19 12:56 UTC (permalink / raw)
  To: dledford, jgg; +Cc: linux-rdma, todd.rimmer, Kaike Wan

From: Kaike Wan <kaike.wan@intel.com>

RDMA transactions on RC QPs have a high demand for memory for HPC
applications such as MPI, especially on nodes with high cpu core
count, where a process is normally dispatched to each core and an RC QP
is created for each remote process. For a 100-node fabric,
about 10 GB - 100 GB memory is required for WQEs/Buffers/QP states on
each server node. This high demand imposes a severe restriction
on the scalability of HPC fabric.

A number of solutions have been implemented over the years. UD based
solutions solve the scalability problem by requiring only one UD
QP per process that can send a message to or receive a message from any
other process. However, it does not have the performance of an
RC QP and the application has to manage the segmentation for large
messages.

SRQ reduces the memory demand by sharing a receive queue among multiple
QPs. However, it still requires an RC QP for each remote
process and each RC QP still requires a send queue. In addition, it is
an optional feature.

XRC further reduces the memory demand by allowing a single QP per
process to communicate with any process on a remote node. In this
mode, each process requires only one QP for each remote node. Again,
this is optional and not all vendors support it.

Dynamically Connected transport minimizes the memory usage, but requires
vendor specific hardware and changes in applications.
Addtionally, all of these mechanisms can induce latency jitter due to
use of more QPs, more QP state, and hence additional
PCIe transactions at scale where NIC QP/CQ caches overflow.

Based on this, we are here proposing a generic, vendor-agnostic approach
to address the RC scalalibity and potentially improve RDMA
performance for larger messages which uses RDMA Write as part of a
rendezvous protocol.

Here are the key points of the proposal:
- For each device used by a job on a given node, there are a fixed
  number of RC connections (default: 4) between this node and any
  other remote node, no matter how many processes are running on each
  node for this job. That is, for a given job and device, all the
  processes on a given node will communicate with all the processes of
  the same job on a remote node through the same fixed number of
  connections. This eliminates the increased memory demand caused by
  core count increase and reduces the overall RC connection setup
  costs.
- Memory region cache is added to reduce the MR
  registration/deregistration costs, as the same buffers tend to be used
  repeatedly by applications. The mr cache is per process and can be
  accessed only by processes in the same job.

Here is how we are proposing to implement this application enabling
rendezvous module that take advantage of the various features
of the RDMA subsystem:
- Create a rendezvous module (rv) under drivers/infiniband/ulp/rv/. This
  can be changed if a better location is recommended.
- The rv modules adds a char device file /dev/rv that user
  applications/middlewares can open to communicate with the rv module
  (PSM3 OFI provider in libfabric 1.12.0 being the primary use case).
  The file interface is crucial for associating the job/process
  with the mr, the connection endpoints, and RDMA requests.
  rv_file_open - Open a handle to the rv module for the current user
                 process.
  rv_file_close - Close the handle and clean up.
  rv_file_ioctl - The main communication methods: to attach to a device,
                  to register/deregister mr, to create connection
                  endpoint, to connect, to query, to get stats, to do
                  RDMA transactions, etc.
  rv_file_mmap - Mmap an event ring into user space for the current user
                 process.
- Basic mode of operations (PSM3 is used as an example for user
  applications):
  - A middleware (like MPI) has out-of-band communication channels
    between any two nodes, which are used to establish high performance
    communications for providers such as PSM3.
  - On each node, the PSM3 provider opens the rv module, and issues an
    attach request to bind to a specific local device and specifies
    information specific to the job. This associates the user process
    with the RDMA device and the job.
  - On each node, the PSM3 provider will establish user space low
    latency eager and control message communications, typically via user
    space UD QPs.
  - On each node, the PSM3 provider will mmap a ring buffer for events
    from the rv module.
  - On each node, the PSM3 provider creates a set of connection
    endpoints by passing the destination info to the rv module. If the
    local node is chosen as the listener for the connection (based on
    address comparison between the two nodes), it will start to listen
    for any incoming IB CM connection requests and accept them when whey
    arrive. This step associates the shared connection endpoints with
    the job and device.
  - On each node, the PSM3 provider will request connection
    establishment through the rv module. On the client node, rv sends an
    IB CM request to the remote node. On the listener node, nothing
    additional needs to be done for the connect request from the PSM3
    provider.
  - On each node, the PSM3 provider will wait until the RC connection is
    established. The PSM3 provider will query the rv module about the
    connection status periodically.
  - As large MPI messages are requested, the PSM3 provider will request
    rv to register MRs for the MPI application’s send and receive
    buffers.  The rv module makes use of its MR cache to limit when such
    requests need to interact with verbs MR calls for the NIC.
    The PSM3 provider control message mechanisms are used to exchange IO
    virtual addresses and rkeys for such buffers and MRs.
  - For RDMA transactions, the RDMA WRITE WITH IMMED request will be
    used. The immediate data will contain info about the target
    process on the remote node  and which outstanding rendezvous message
    this RDMA is for. For send completion and receive completion,
    an event will be posted into the event ring buffer and the PSM3
    provider can poll for completion.
  - As RDMA transactions complete, the PSM3 provider will indicate
    completion of the corresponding MPI send/receive to the MPI
    middleware/application. The PSM3 provider will also inform the rv
    module it may deregister the MR.  The deregistration allows rv
    to cache the MR for use in future requests by the PSM3 provider to
    register the same memory for use in another MPI message. The
    cached MR will be removed if the user buffer is freed and a MMU
    notifier event is received by the rv module.

Please comment,

Thank you,

Kaike

Kaike Wan (9):
  RDMA/rv: Public interferce for the RDMA Rendezvous module
  RDMA/rv: Add the internal header files
  RDMA/rv: Add the rv module
  RDMA/rv: Add functions for memory region cache
  RDMA/rv: Add function to register/deregister memory region
  RDMA/rv: Add connection management functions
  RDMA/rv: Add functions for RDMA transactions
  RDMA/rv: Add functions for file operations
  RDMA/rv: Integrate the file operations into the rv module

 MAINTAINERS                                |    6 +
 drivers/infiniband/Kconfig                 |    1 +
 drivers/infiniband/core/core_priv.h        |    4 +
 drivers/infiniband/core/rdma_core.c        |  237 ++
 drivers/infiniband/core/uverbs_cmd.c       |   54 +-
 drivers/infiniband/ulp/Makefile            |    1 +
 drivers/infiniband/ulp/rv/Kconfig          |   11 +
 drivers/infiniband/ulp/rv/Makefile         |    9 +
 drivers/infiniband/ulp/rv/rv.h             |  892 ++++++
 drivers/infiniband/ulp/rv/rv_conn.c        | 3037 ++++++++++++++++++++
 drivers/infiniband/ulp/rv/rv_file.c        | 1196 ++++++++
 drivers/infiniband/ulp/rv/rv_main.c        |  271 ++
 drivers/infiniband/ulp/rv/rv_mr.c          |  393 +++
 drivers/infiniband/ulp/rv/rv_mr_cache.c    |  428 +++
 drivers/infiniband/ulp/rv/rv_mr_cache.h    |  152 +
 drivers/infiniband/ulp/rv/rv_rdma.c        |  416 +++
 drivers/infiniband/ulp/rv/trace.c          |    7 +
 drivers/infiniband/ulp/rv/trace.h          |   10 +
 drivers/infiniband/ulp/rv/trace_conn.h     |  547 ++++
 drivers/infiniband/ulp/rv/trace_dev.h      |   82 +
 drivers/infiniband/ulp/rv/trace_mr.h       |  109 +
 drivers/infiniband/ulp/rv/trace_mr_cache.h |  181 ++
 drivers/infiniband/ulp/rv/trace_rdma.h     |  254 ++
 drivers/infiniband/ulp/rv/trace_user.h     |  321 +++
 include/rdma/uverbs_types.h                |   10 +
 include/uapi/rdma/rv_user_ioctls.h         |  558 ++++
 26 files changed, 9139 insertions(+), 48 deletions(-)
 create mode 100644 drivers/infiniband/ulp/rv/Kconfig
 create mode 100644 drivers/infiniband/ulp/rv/Makefile
 create mode 100644 drivers/infiniband/ulp/rv/rv.h
 create mode 100644 drivers/infiniband/ulp/rv/rv_conn.c
 create mode 100644 drivers/infiniband/ulp/rv/rv_file.c
 create mode 100644 drivers/infiniband/ulp/rv/rv_main.c
 create mode 100644 drivers/infiniband/ulp/rv/rv_mr.c
 create mode 100644 drivers/infiniband/ulp/rv/rv_mr_cache.c
 create mode 100644 drivers/infiniband/ulp/rv/rv_mr_cache.h
 create mode 100644 drivers/infiniband/ulp/rv/rv_rdma.c
 create mode 100644 drivers/infiniband/ulp/rv/trace.c
 create mode 100644 drivers/infiniband/ulp/rv/trace.h
 create mode 100644 drivers/infiniband/ulp/rv/trace_conn.h
 create mode 100644 drivers/infiniband/ulp/rv/trace_dev.h
 create mode 100644 drivers/infiniband/ulp/rv/trace_mr.h
 create mode 100644 drivers/infiniband/ulp/rv/trace_mr_cache.h
 create mode 100644 drivers/infiniband/ulp/rv/trace_rdma.h
 create mode 100644 drivers/infiniband/ulp/rv/trace_user.h
 create mode 100644 include/uapi/rdma/rv_user_ioctls.h

-- 
2.18.1

^ permalink raw reply	[flat|nested] 52+ messages in thread