linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v2 0/3] RDMA: add dma-buf support
@ 2020-06-29 17:31 Jianxin Xiong
  2020-06-29 17:31 ` [RFC PATCH v2 1/3] RDMA/umem: Support importing dma-buf as user memory region Jianxin Xiong
                   ` (4 more replies)
  0 siblings, 5 replies; 35+ messages in thread
From: Jianxin Xiong @ 2020-06-29 17:31 UTC (permalink / raw)
  To: linux-rdma
  Cc: Jianxin Xiong, Doug Ledford, Jason Gunthorpe, Sumit Semwal,
	Leon Romanovsky, Daniel Vetter

When enabled, an RDMA capable NIC can perform peer-to-peer transactions
over PCIe to access the local memory located on another device. This can
often lead to better performance than using a system memory buffer for
RDMA and copying data between the buffer and device memory.

Current kernel RDMA stack uses get_user_pages() to pin the physical
pages backing the user buffer and uses dma_map_sg_attrs() to get the
dma addresses for memory access. This usually doesn't work for peer
device memory due to the lack of associated page structures.

Several mechanisms exist today to facilitate device memory access.

ZONE_DEVICE is a new zone for device memory in the memory management
subsystem. It allows pages from device memory being described with
specialized page structures. As the result, calls like get_user_pages()
can succeed, but what can be done with these page structures may be
different from system memory. It is further specialized into multiple
memory types, such as one type for PCI p2pmem/p2pdma and one type for
HMM.

PCI p2pmem/p2pdma uses ZONE_DEVICE to represent device memory residing
in a PCI BAR and provides a set of calls to publish, discover, allocate,
and map such memory for peer-to-peer transactions. One feature of the
API is that the buffer is allocated by the side that does the DMA
transfer. This works well with the storage usage case, but is awkward
with GPU-NIC communication, where typically the buffer is allocated by
the GPU driver rather than the NIC driver.

Heterogeneous Memory Management (HMM) utilizes mmu_interval_notifier
and ZONE_DEVICE to support shared virtual address space and page
migration between system memory and device memory. HMM doesn't support
pinning device memory because pages located on device must be able to
migrate to system memory when accessed by CPU. Peer-to-peer access
is possible if the peer can handle page fault. For RDMA, that means
the NIC must support on-demand paging.

Dma-buf is a standard mechanism for sharing buffers among different
device drivers. The buffer to be shared is exported by the owning
driver and imported by the driver that wants to use it. The exporter
provides a set of ops that the importer can call to pin and map the
buffer. In addition, a file descriptor can be associated with a dma-
buf object as the handle that can be passed to user space.

This patch series adds dma-buf importer role to the RDMA driver in
attempt to support RDMA using device memory such as GPU VRAM. Dma-buf is
chosen for a few reasons: first, the API is relatively simple and allows
a lot of flexibility in implementing the buffer manipulation ops.
Second, it doesn't require page structure. Third, dma-buf is already
supported in many GPU drivers. However, we are aware that existing GPU
drivers don't allow pinning device memory via the dma-buf interface.
Pinning and mapping a dma-buf would cause the backing storage to migrate
to system RAM. This is due to the lack of knowledge about whether the
importer can perform peer-to-peer access and the lack of resource limit
control measure for GPU. For the first part, the latest dma-buf driver
has a peer-to-peer flag for the importer, but the flag is currently tied
to dynamic mapping support, which requires on-demand paging support from
the NIC to work. There are a few possible ways to address these issues,
such as decoupling peer-to-peer flag from dynamic mapping, allowing more
leeway for individual drivers to make the pinning decision and adding
GPU resource limit control via cgroup. We would like to get comments on
this patch series with the assumption that device memory pinning via
dma-buf is supported by some GPU drivers, and at the same time welcome
open discussions on how to address the aforementioned issues as well as
GPU-NIC peer-to-peer access solutions in general.

This is the second version of the patch series. Here are the changes
from the previous version:
* The Kconfig option is removed. There is no dependence issue since
dma-buf driver is always enabled.
* The declaration of new data structure and functions is reorganized to
minimize the visibility of the changes.
* The new uverbs command now goes through ioctl() instead of write().
* The rereg functionality is removed.
* Instead of adding new device method for dma-buf specific registration,
existing method is extended to accept an extra parameter. 
* The correct function is now used for address range checking. 

This series is organized as follows. The first patch adds the common
code for importing dma-buf from a file descriptor and pinning and
mapping the dma-buf pages. Patch 2 extends the reg_user_mr() method
of the ib_device structure to accept dma-buf file descriptor as an extra
parameter. Vendor drivers are updated with the change. Patch 3 adds a
new uverbs command for registering dma-buf based memory region.

Related user space RDMA library changes will be provided as a separate
patch series.

Jianxin Xiong (3):
  RDMA/umem: Support importing dma-buf as user memory region
  RDMA/core: Expand the driver method 'reg_user_mr' to support dma-buf
  RDMA/uverbs: Add uverbs command for dma-buf based MR registration

 drivers/infiniband/core/Makefile                |   2 +-
 drivers/infiniband/core/umem.c                  |   4 +
 drivers/infiniband/core/umem_dmabuf.c           | 105 ++++++++++++++++++++++
 drivers/infiniband/core/umem_dmabuf.h           |  11 +++
 drivers/infiniband/core/uverbs_cmd.c            |   2 +-
 drivers/infiniband/core/uverbs_std_types_mr.c   | 112 ++++++++++++++++++++++++
 drivers/infiniband/core/verbs.c                 |   2 +-
 drivers/infiniband/hw/bnxt_re/ib_verbs.c        |   7 +-
 drivers/infiniband/hw/bnxt_re/ib_verbs.h        |   2 +-
 drivers/infiniband/hw/cxgb4/iw_cxgb4.h          |   3 +-
 drivers/infiniband/hw/cxgb4/mem.c               |   8 +-
 drivers/infiniband/hw/efa/efa.h                 |   2 +-
 drivers/infiniband/hw/efa/efa_verbs.c           |   7 +-
 drivers/infiniband/hw/hns/hns_roce_device.h     |   2 +-
 drivers/infiniband/hw/hns/hns_roce_mr.c         |   7 +-
 drivers/infiniband/hw/i40iw/i40iw_verbs.c       |   6 ++
 drivers/infiniband/hw/mlx4/mlx4_ib.h            |   2 +-
 drivers/infiniband/hw/mlx4/mr.c                 |   7 +-
 drivers/infiniband/hw/mlx5/mlx5_ib.h            |   2 +-
 drivers/infiniband/hw/mlx5/mr.c                 |  45 +++++++++-
 drivers/infiniband/hw/mthca/mthca_provider.c    |   8 +-
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.c     |   9 +-
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.h     |   3 +-
 drivers/infiniband/hw/qedr/verbs.c              |   8 +-
 drivers/infiniband/hw/qedr/verbs.h              |   3 +-
 drivers/infiniband/hw/usnic/usnic_ib_verbs.c    |   8 +-
 drivers/infiniband/hw/usnic/usnic_ib_verbs.h    |   2 +-
 drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c    |   6 +-
 drivers/infiniband/hw/vmw_pvrdma/pvrdma_verbs.h |   2 +-
 drivers/infiniband/sw/rdmavt/mr.c               |   6 +-
 drivers/infiniband/sw/rdmavt/mr.h               |   2 +-
 drivers/infiniband/sw/rxe/rxe_verbs.c           |   6 ++
 drivers/infiniband/sw/siw/siw_verbs.c           |   8 +-
 drivers/infiniband/sw/siw/siw_verbs.h           |   3 +-
 include/rdma/ib_umem.h                          |  14 ++-
 include/rdma/ib_verbs.h                         |   4 +-
 include/uapi/rdma/ib_user_ioctl_cmds.h          |  14 +++
 37 files changed, 410 insertions(+), 34 deletions(-)
 create mode 100644 drivers/infiniband/core/umem_dmabuf.c
 create mode 100644 drivers/infiniband/core/umem_dmabuf.h

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2020-07-08 14:34 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-29 17:31 [RFC PATCH v2 0/3] RDMA: add dma-buf support Jianxin Xiong
2020-06-29 17:31 ` [RFC PATCH v2 1/3] RDMA/umem: Support importing dma-buf as user memory region Jianxin Xiong
2020-06-30 19:04   ` Xiong, Jianxin
2020-06-29 17:31 ` [RFC PATCH v2 2/3] RDMA/core: Expand the driver method 'reg_user_mr' to support dma-buf Jianxin Xiong
2020-06-30 19:04   ` Xiong, Jianxin
2020-06-29 17:31 ` [RFC PATCH v2 3/3] RDMA/uverbs: Add uverbs command for dma-buf based MR registration Jianxin Xiong
2020-06-30 19:05   ` Xiong, Jianxin
2020-06-29 18:51 ` [RFC PATCH v2 0/3] RDMA: add dma-buf support Jason Gunthorpe
2020-06-30 17:21   ` Xiong, Jianxin
2020-06-30 17:34     ` Jason Gunthorpe
2020-06-30 18:46       ` Xiong, Jianxin
2020-06-30 19:17         ` Jason Gunthorpe
2020-06-30 20:08           ` Xiong, Jianxin
2020-07-02 12:27             ` Jason Gunthorpe
2020-07-01  9:03         ` Christian König
2020-07-01 12:07           ` Daniel Vetter
2020-07-01 12:14             ` Daniel Vetter
2020-07-01 12:39           ` Jason Gunthorpe
2020-07-01 12:55             ` Christian König
2020-07-01 15:42               ` Daniel Vetter
2020-07-01 17:15                 ` Jason Gunthorpe
2020-07-02 13:10                   ` Daniel Vetter
2020-07-02 13:29                     ` Jason Gunthorpe
2020-07-02 14:50                       ` Christian König
2020-07-02 18:15                         ` Daniel Vetter
2020-07-03 12:03                           ` Jason Gunthorpe
2020-07-03 12:52                             ` Daniel Vetter
2020-07-03 13:14                               ` Jason Gunthorpe
2020-07-03 13:21                                 ` Christian König
2020-07-07 21:58                                   ` Xiong, Jianxin
2020-07-08  9:38                                     ` Christian König
2020-07-08  9:49                                       ` Daniel Vetter
2020-07-08 14:20                                         ` Christian König
2020-07-08 14:33                                           ` Alex Deucher
2020-06-30 18:56 ` Xiong, Jianxin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).