dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* RE: [RFC PATCH v2 0/3] RDMA: add dma-buf support
       [not found]     ` <20200630173435.GK25301@ziepe.ca>
@ 2020-06-30 18:46       ` Xiong, Jianxin
  2020-06-30 19:17         ` Jason Gunthorpe
  2020-07-01  9:03         ` Christian König
  0 siblings, 2 replies; 28+ messages in thread
From: Xiong, Jianxin @ 2020-06-30 18:46 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Leon Romanovsky, linux-rdma, dri-devel, Christian Koenig,
	Doug Ledford, Vetter, Daniel

> -----Original Message-----
> From: Jason Gunthorpe <jgg@ziepe.ca>
> Sent: Tuesday, June 30, 2020 10:35 AM
> To: Xiong, Jianxin <jianxin.xiong@intel.com>
> Cc: linux-rdma@vger.kernel.org; Doug Ledford <dledford@redhat.com>; Sumit Semwal <sumit.semwal@linaro.org>; Leon Romanovsky
> <leon@kernel.org>; Vetter, Daniel <daniel.vetter@intel.com>; Christian Koenig <christian.koenig@amd.com>
> Subject: Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
> 
> On Tue, Jun 30, 2020 at 05:21:33PM +0000, Xiong, Jianxin wrote:
> > > > Heterogeneous Memory Management (HMM) utilizes
> > > > mmu_interval_notifier and ZONE_DEVICE to support shared virtual
> > > > address space and page migration between system memory and device
> > > > memory. HMM doesn't support pinning device memory because pages
> > > > located on device must be able to migrate to system memory when
> > > > accessed by CPU. Peer-to-peer access is possible if the peer can
> > > > handle page fault. For RDMA, that means the NIC must support on-demand paging.
> > >
> > > peer-peer access is currently not possible with hmm_range_fault().
> >
> > Currently hmm_range_fault() always sets the cpu access flag and device
> > private pages are migrated to the system RAM in the fault handler.
> > However, it's possible to have a modified code flow to keep the device
> > private page info for use with peer to peer access.
> 
> Sort of, but only within the same device, RDMA or anything else generic can't reach inside a DEVICE_PRIVATE and extract anything useful.

But pfn is supposed to be all that is needed.

> 
> > > So.. this patch doesn't really do anything new? We could just make a MR against the DMA buf mmap and get to the same place?
> >
> > That's right, the patch alone is just half of the story. The
> > functionality depends on availability of dma-buf exporter that can pin
> > the device memory.
> 
> Well, what do you want to happen here? The RDMA parts are reasonable, but I don't want to add new functionality without a purpose - the
> other parts need to be settled out first.

At the RDMA side, we mainly want to check if the changes are acceptable. For example,
the part about adding 'fd' to the device ops and the ioctl interface. All the previous
comments are very helpful for us to refine the patch so that we can be ready when
GPU side support becomes available.

> 
> The need for the dynamic mapping support for even the current DMA Buf hacky P2P users is really too bad. Can you get any GPU driver to
> support non-dynamic mapping?

We are working on direct direction.

> 
> > > > migrate to system RAM. This is due to the lack of knowledge about
> > > > whether the importer can perform peer-to-peer access and the lack
> > > > of resource limit control measure for GPU. For the first part, the
> > > > latest dma-buf driver has a peer-to-peer flag for the importer,
> > > > but the flag is currently tied to dynamic mapping support, which
> > > > requires on-demand paging support from the NIC to work.
> > >
> > > ODP for DMA buf?
> >
> > Right.
> 
> Hum. This is not actually so hard to do. The whole dma buf proposal would make a lot more sense if the 'dma buf MR' had to be the
> dynamic kind and the driver had to provide the faulting. It would not be so hard to change mlx5 to be able to work like this, perhaps. (the
> locking might be a bit tricky though)

The main issue is that not all NICs support ODP.

> 
> > > > There are a few possible ways to address these issues, such as
> > > > decoupling peer-to-peer flag from dynamic mapping, allowing more
> > > > leeway for individual drivers to make the pinning decision and
> > > > adding GPU resource limit control via cgroup. We would like to get
> > > > comments on this patch series with the assumption that device
> > > > memory pinning via dma-buf is supported by some GPU drivers, and
> > > > at the same time welcome open discussions on how to address the
> > > > aforementioned issues as well as GPU-NIC peer-to-peer access solutions in general.
> > >
> > > These seem like DMA buf problems, not RDMA problems, why are you
> > > asking these questions with a RDMA patch set? The usual DMA buf people are not even Cc'd here.
> >
> > The intention is to have people from both RDMA and DMA buffer side to
> > comment. Sumit Semwal is the DMA buffer maintainer according to the
> > MAINTAINERS file. I agree more people could be invited to the discussion.
> > Just added Christian Koenig to the cc-list.
> 
> Would be good to have added the drm lists too

Thanks, cc'd dri-devel here, and will also do the same for the previous part of the thread.

> 
> > If the umem_description you mentioned is for information used to
> > create the umem (e.g. a structure for all the parameters), then this would work better.
> 
> It would make some more sense, and avoid all these weird EOPNOTSUPPS.

Good, thanks for the suggestion.

> 
> Jason
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [RFC PATCH v2 0/3] RDMA: add dma-buf support
       [not found] <1593451903-30959-1-git-send-email-jianxin.xiong@intel.com>
       [not found] ` <20200629185152.GD25301@ziepe.ca>
@ 2020-06-30 18:56 ` Xiong, Jianxin
       [not found] ` <1593451903-30959-2-git-send-email-jianxin.xiong@intel.com>
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 28+ messages in thread
From: Xiong, Jianxin @ 2020-06-30 18:56 UTC (permalink / raw)
  To: linux-rdma
  Cc: Leon Romanovsky, dri-devel, Jason Gunthorpe, Doug Ledford,
	Vetter, Daniel, Christian Koenig

Added to cc-list:
Christian Koenig <christian.koenig@amd.com>
dri-devel@lists.freedesktop.org

> -----Original Message-----
> From: Xiong, Jianxin <jianxin.xiong@intel.com>
> Sent: Monday, June 29, 2020 10:32 AM
> To: linux-rdma@vger.kernel.org
> Cc: Xiong, Jianxin <jianxin.xiong@intel.com>; Doug Ledford <dledford@redhat.com>; Jason Gunthorpe <jgg@ziepe.ca>; Sumit Semwal
> <sumit.semwal@linaro.org>; Leon Romanovsky <leon@kernel.org>; Vetter, Daniel <daniel.vetter@intel.com>
> Subject: [RFC PATCH v2 0/3] RDMA: add dma-buf support
> 
> When enabled, an RDMA capable NIC can perform peer-to-peer transactions
> over PCIe to access the local memory located on another device. This can
> often lead to better performance than using a system memory buffer for
> RDMA and copying data between the buffer and device memory.
> 
> Current kernel RDMA stack uses get_user_pages() to pin the physical
> pages backing the user buffer and uses dma_map_sg_attrs() to get the
> dma addresses for memory access. This usually doesn't work for peer
> device memory due to the lack of associated page structures.
> 
> Several mechanisms exist today to facilitate device memory access.
> 
> ZONE_DEVICE is a new zone for device memory in the memory management
> subsystem. It allows pages from device memory being described with
> specialized page structures. As the result, calls like get_user_pages()
> can succeed, but what can be done with these page structures may be
> different from system memory. It is further specialized into multiple
> memory types, such as one type for PCI p2pmem/p2pdma and one type for
> HMM.
> 
> PCI p2pmem/p2pdma uses ZONE_DEVICE to represent device memory residing
> in a PCI BAR and provides a set of calls to publish, discover, allocate,
> and map such memory for peer-to-peer transactions. One feature of the
> API is that the buffer is allocated by the side that does the DMA
> transfer. This works well with the storage usage case, but is awkward
> with GPU-NIC communication, where typically the buffer is allocated by
> the GPU driver rather than the NIC driver.
> 
> Heterogeneous Memory Management (HMM) utilizes mmu_interval_notifier
> and ZONE_DEVICE to support shared virtual address space and page
> migration between system memory and device memory. HMM doesn't support
> pinning device memory because pages located on device must be able to
> migrate to system memory when accessed by CPU. Peer-to-peer access
> is possible if the peer can handle page fault. For RDMA, that means
> the NIC must support on-demand paging.
> 
> Dma-buf is a standard mechanism for sharing buffers among different
> device drivers. The buffer to be shared is exported by the owning
> driver and imported by the driver that wants to use it. The exporter
> provides a set of ops that the importer can call to pin and map the
> buffer. In addition, a file descriptor can be associated with a dma-
> buf object as the handle that can be passed to user space.
> 
> This patch series adds dma-buf importer role to the RDMA driver in
> attempt to support RDMA using device memory such as GPU VRAM. Dma-buf is
> chosen for a few reasons: first, the API is relatively simple and allows
> a lot of flexibility in implementing the buffer manipulation ops.
> Second, it doesn't require page structure. Third, dma-buf is already
> supported in many GPU drivers. However, we are aware that existing GPU
> drivers don't allow pinning device memory via the dma-buf interface.
> Pinning and mapping a dma-buf would cause the backing storage to migrate
> to system RAM. This is due to the lack of knowledge about whether the
> importer can perform peer-to-peer access and the lack of resource limit
> control measure for GPU. For the first part, the latest dma-buf driver
> has a peer-to-peer flag for the importer, but the flag is currently tied
> to dynamic mapping support, which requires on-demand paging support from
> the NIC to work. There are a few possible ways to address these issues,
> such as decoupling peer-to-peer flag from dynamic mapping, allowing more
> leeway for individual drivers to make the pinning decision and adding
> GPU resource limit control via cgroup. We would like to get comments on
> this patch series with the assumption that device memory pinning via
> dma-buf is supported by some GPU drivers, and at the same time welcome
> open discussions on how to address the aforementioned issues as well as
> GPU-NIC peer-to-peer access solutions in general.
> 
> This is the second version of the patch series. Here are the changes
> from the previous version:
> * The Kconfig option is removed. There is no dependence issue since
> dma-buf driver is always enabled.
> * The declaration of new data structure and functions is reorganized to
> minimize the visibility of the changes.
> * The new uverbs command now goes through ioctl() instead of write().
> * The rereg functionality is removed.
> * Instead of adding new device method for dma-buf specific registration,
> existing method is extended to accept an extra parameter.
> * The correct function is now used for address range checking.
> 
> This series is organized as follows. The first patch adds the common
> code for importing dma-buf from a file descriptor and pinning and
> mapping the dma-buf pages. Patch 2 extends the reg_user_mr() method
> of the ib_device structure to accept dma-buf file descriptor as an extra
> parameter. Vendor drivers are updated with the change. Patch 3 adds a
> new uverbs command for registering dma-buf based memory region.
> 
> Related user space RDMA library changes will be provided as a separate
> patch series.
> 
> Jianxin Xiong (3):
>   RDMA/umem: Support importing dma-buf as user memory region
>   RDMA/core: Expand the driver method 'reg_user_mr' to support dma-buf
>   RDMA/uverbs: Add uverbs command for dma-buf based MR registration
> 
>  drivers/infiniband/core/Makefile                |   2 +-
>  drivers/infiniband/core/umem.c                  |   4 +
>  drivers/infiniband/core/umem_dmabuf.c           | 105 ++++++++++++++++++++++
>  drivers/infiniband/core/umem_dmabuf.h           |  11 +++
>  drivers/infiniband/core/uverbs_cmd.c            |   2 +-
>  drivers/infiniband/core/uverbs_std_types_mr.c   | 112 ++++++++++++++++++++++++
>  drivers/infiniband/core/verbs.c                 |   2 +-
>  drivers/infiniband/hw/bnxt_re/ib_verbs.c        |   7 +-
>  drivers/infiniband/hw/bnxt_re/ib_verbs.h        |   2 +-
>  drivers/infiniband/hw/cxgb4/iw_cxgb4.h          |   3 +-
>  drivers/infiniband/hw/cxgb4/mem.c               |   8 +-
>  drivers/infiniband/hw/efa/efa.h                 |   2 +-
>  drivers/infiniband/hw/efa/efa_verbs.c           |   7 +-
>  drivers/infiniband/hw/hns/hns_roce_device.h     |   2 +-
>  drivers/infiniband/hw/hns/hns_roce_mr.c         |   7 +-
>  drivers/infiniband/hw/i40iw/i40iw_verbs.c       |   6 ++
>  drivers/infiniband/hw/mlx4/mlx4_ib.h            |   2 +-
>  drivers/infiniband/hw/mlx4/mr.c                 |   7 +-
>  drivers/infiniband/hw/mlx5/mlx5_ib.h            |   2 +-
>  drivers/infiniband/hw/mlx5/mr.c                 |  45 +++++++++-
>  drivers/infiniband/hw/mthca/mthca_provider.c    |   8 +-
>  drivers/infiniband/hw/ocrdma/ocrdma_verbs.c     |   9 +-
>  drivers/infiniband/hw/ocrdma/ocrdma_verbs.h     |   3 +-
>  drivers/infiniband/hw/qedr/verbs.c              |   8 +-
>  drivers/infiniband/hw/qedr/verbs.h              |   3 +-
>  drivers/infiniband/hw/usnic/usnic_ib_verbs.c    |   8 +-
>  drivers/infiniband/hw/usnic/usnic_ib_verbs.h    |   2 +-
>  drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c    |   6 +-
>  drivers/infiniband/hw/vmw_pvrdma/pvrdma_verbs.h |   2 +-
>  drivers/infiniband/sw/rdmavt/mr.c               |   6 +-
>  drivers/infiniband/sw/rdmavt/mr.h               |   2 +-
>  drivers/infiniband/sw/rxe/rxe_verbs.c           |   6 ++
>  drivers/infiniband/sw/siw/siw_verbs.c           |   8 +-
>  drivers/infiniband/sw/siw/siw_verbs.h           |   3 +-
>  include/rdma/ib_umem.h                          |  14 ++-
>  include/rdma/ib_verbs.h                         |   4 +-
>  include/uapi/rdma/ib_user_ioctl_cmds.h          |  14 +++
>  37 files changed, 410 insertions(+), 34 deletions(-)
>  create mode 100644 drivers/infiniband/core/umem_dmabuf.c
>  create mode 100644 drivers/infiniband/core/umem_dmabuf.h
> 
> --
> 1.8.3.1

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [RFC PATCH v2 1/3] RDMA/umem: Support importing dma-buf as user memory region
       [not found] ` <1593451903-30959-2-git-send-email-jianxin.xiong@intel.com>
@ 2020-06-30 19:04   ` Xiong, Jianxin
  0 siblings, 0 replies; 28+ messages in thread
From: Xiong, Jianxin @ 2020-06-30 19:04 UTC (permalink / raw)
  To: linux-rdma
  Cc: Leon Romanovsky, dri-devel, Jason Gunthorpe, Doug Ledford,
	Vetter,  Daniel, Christian Koenig

Cc'd drm people.

> -----Original Message-----
> From: Xiong, Jianxin <jianxin.xiong@intel.com>
> Sent: Monday, June 29, 2020 10:32 AM
> To: linux-rdma@vger.kernel.org
> Cc: Xiong, Jianxin <jianxin.xiong@intel.com>; Doug Ledford <dledford@redhat.com>; Jason Gunthorpe <jgg@ziepe.ca>; Sumit Semwal
> <sumit.semwal@linaro.org>; Leon Romanovsky <leon@kernel.org>; Vetter, Daniel <daniel.vetter@intel.com>
> Subject: [RFC PATCH v2 1/3] RDMA/umem: Support importing dma-buf as user memory region
> 
> Dma-buf is a standard cross-driver buffer sharing mechanism that can be
> used to support peer-to-peer access from RDMA devices.
> 
> Device memory exported via dma-buf can be associated with a file
> descriptor. This is passed to the user space as a property associated
> with the buffer allocation. When the buffer is registered as a memory
> region, the file descriptor is passed to the RDMA driver along with
> other parameters.
> 
> Implement the common code for importing dma-buf object, and pinning and
> mapping dma-buf pages.
> 
> Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
> Reviewed-by: Sean Hefty <sean.hefty@intel.com>
> Acked-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
> ---
>  drivers/infiniband/core/Makefile      |   2 +-
>  drivers/infiniband/core/umem.c        |   4 ++
>  drivers/infiniband/core/umem_dmabuf.c | 105 ++++++++++++++++++++++++++++++++++
>  drivers/infiniband/core/umem_dmabuf.h |  11 ++++
>  include/rdma/ib_umem.h                |  14 ++++-
>  5 files changed, 134 insertions(+), 2 deletions(-)
>  create mode 100644 drivers/infiniband/core/umem_dmabuf.c
>  create mode 100644 drivers/infiniband/core/umem_dmabuf.h
> 
> diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
> index d1b14887..0d4efa0 100644
> --- a/drivers/infiniband/core/Makefile
> +++ b/drivers/infiniband/core/Makefile
> @@ -37,5 +37,5 @@ ib_uverbs-y :=			uverbs_main.o uverbs_cmd.o uverbs_marshall.o \
>  				uverbs_std_types_mr.o uverbs_std_types_counters.o \
>  				uverbs_uapi.o uverbs_std_types_device.o \
>  				uverbs_std_types_async_fd.o
> -ib_uverbs-$(CONFIG_INFINIBAND_USER_MEM) += umem.o
> +ib_uverbs-$(CONFIG_INFINIBAND_USER_MEM) += umem.o umem_dmabuf.o
>  ib_uverbs-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += umem_odp.o
> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
> index 82455a1..bf1a6c1 100644
> --- a/drivers/infiniband/core/umem.c
> +++ b/drivers/infiniband/core/umem.c
> @@ -2,6 +2,7 @@
>   * Copyright (c) 2005 Topspin Communications.  All rights reserved.
>   * Copyright (c) 2005 Cisco Systems.  All rights reserved.
>   * Copyright (c) 2005 Mellanox Technologies. All rights reserved.
> + * Copyright (c) 2020 Intel Corporation. All rights reserved.
>   *
>   * This software is available to you under a choice of one of two
>   * licenses.  You may choose to be licensed under the terms of the GNU
> @@ -42,6 +43,7 @@
>  #include <rdma/ib_umem_odp.h>
> 
>  #include "uverbs.h"
> +#include "umem_dmabuf.h"
> 
>  static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int dirty)
>  {
> @@ -317,6 +319,8 @@ void ib_umem_release(struct ib_umem *umem)
>  {
>  	if (!umem)
>  		return;
> +	if (umem->is_dmabuf)
> +		return ib_umem_dmabuf_release(umem);
>  	if (umem->is_odp)
>  		return ib_umem_odp_release(to_ib_umem_odp(umem));
> 
> diff --git a/drivers/infiniband/core/umem_dmabuf.c b/drivers/infiniband/core/umem_dmabuf.c
> new file mode 100644
> index 0000000..406ca64
> --- /dev/null
> +++ b/drivers/infiniband/core/umem_dmabuf.c
> @@ -0,0 +1,105 @@
> +// SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause)
> +/*
> + * Copyright (c) 2020 Intel Corporation. All rights reserved.
> + */
> +
> +#include <linux/dma-buf.h>
> +#include <linux/dma-mapping.h>
> +
> +#include "uverbs.h"
> +
> +struct ib_umem_dmabuf {
> +	struct ib_umem	umem;
> +	struct dma_buf	*dmabuf;
> +	struct dma_buf_attachment *attach;
> +	struct sg_table *sgt;
> +};
> +
> +static inline struct ib_umem_dmabuf *to_ib_umem_dmabuf(struct ib_umem *umem)
> +{
> +	return container_of(umem, struct ib_umem_dmabuf, umem);
> +}
> +
> +struct ib_umem *ib_umem_dmabuf_get(struct ib_device *device,
> +				   unsigned long addr, size_t size,
> +				   int dmabuf_fd, int access)
> +{
> +	struct ib_umem_dmabuf *umem_dmabuf;
> +	struct sg_table *sgt;
> +	enum dma_data_direction dir;
> +	long ret;
> +	unsigned long end;
> +
> +	if (check_add_overflow(addr, size, &end))
> +		return ERR_PTR(-EINVAL);
> +
> +	if (unlikely(PAGE_ALIGN(end) < PAGE_SIZE))
> +		return ERR_PTR(-EINVAL);
> +
> +	if (access & IB_ACCESS_ON_DEMAND)
> +		return ERR_PTR(-EOPNOTSUPP);
> +
> +	umem_dmabuf = kzalloc(sizeof(*umem_dmabuf), GFP_KERNEL);
> +	if (!umem_dmabuf)
> +		return ERR_PTR(-ENOMEM);
> +
> +	umem_dmabuf->umem.ibdev = device;
> +	umem_dmabuf->umem.length = size;
> +	umem_dmabuf->umem.address = addr;
> +	umem_dmabuf->umem.writable = ib_access_writable(access);
> +	umem_dmabuf->umem.is_dmabuf = 1;
> +
> +	umem_dmabuf->dmabuf = dma_buf_get(dmabuf_fd);
> +	if (IS_ERR(umem_dmabuf->dmabuf)) {
> +		ret = PTR_ERR(umem_dmabuf->dmabuf);
> +		goto out_free_umem;
> +	}
> +
> +	umem_dmabuf->attach = dma_buf_attach(umem_dmabuf->dmabuf,
> +					     device->dma_device);
> +	if (IS_ERR(umem_dmabuf->attach)) {
> +		ret = PTR_ERR(umem_dmabuf->attach);
> +		goto out_release_dmabuf;
> +	}
> +
> +	dir = umem_dmabuf->umem.writable ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;
> +	sgt = dma_buf_map_attachment(umem_dmabuf->attach, dir);
> +	if (IS_ERR(sgt)) {
> +		ret = PTR_ERR(sgt);
> +		goto out_detach_dmabuf;
> +	}
> +
> +	umem_dmabuf->sgt = sgt;
> +	umem_dmabuf->umem.sg_head = *sgt;
> +	umem_dmabuf->umem.nmap = sgt->nents;
> +	return &umem_dmabuf->umem;
> +
> +out_detach_dmabuf:
> +	dma_buf_detach(umem_dmabuf->dmabuf, umem_dmabuf->attach);
> +
> +out_release_dmabuf:
> +	dma_buf_put(umem_dmabuf->dmabuf);
> +
> +out_free_umem:
> +	kfree(umem_dmabuf);
> +	return ERR_PTR(ret);
> +}
> +EXPORT_SYMBOL(ib_umem_dmabuf_get);
> +
> +void ib_umem_dmabuf_release(struct ib_umem *umem)
> +{
> +	enum dma_data_direction dir;
> +	struct ib_umem_dmabuf *umem_dmabuf = to_ib_umem_dmabuf(umem);
> +
> +	dir = umem_dmabuf->umem.writable ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;
> +
> +	/*
> +	 * Only use the original sgt returned from dma_buf_map_attachment(),
> +	 * otherwise the scatterlist may be freed twice due to the map caching
> +	 * mechanism.
> +	 */
> +	dma_buf_unmap_attachment(umem_dmabuf->attach, umem_dmabuf->sgt, dir);
> +	dma_buf_detach(umem_dmabuf->dmabuf, umem_dmabuf->attach);
> +	dma_buf_put(umem_dmabuf->dmabuf);
> +	kfree(umem_dmabuf);
> +}
> diff --git a/drivers/infiniband/core/umem_dmabuf.h b/drivers/infiniband/core/umem_dmabuf.h
> new file mode 100644
> index 0000000..485f653
> --- /dev/null
> +++ b/drivers/infiniband/core/umem_dmabuf.h
> @@ -0,0 +1,11 @@
> +/* SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause) */
> +/*
> + * Copyright (c) 2020 Intel Corporation. All rights reserved.
> + */
> +
> +#ifndef UMEM_DMABUF_H
> +#define UMEM_DMABUF_H
> +
> +void ib_umem_dmabuf_release(struct ib_umem *umem);
> +
> +#endif /* UMEM_DMABUF_H */
> diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
> index e3518fd..78331ce 100644
> --- a/include/rdma/ib_umem.h
> +++ b/include/rdma/ib_umem.h
> @@ -1,5 +1,6 @@
>  /*
>   * Copyright (c) 2007 Cisco Systems.  All rights reserved.
> + * Copyright (c) 2020 Intel Corporation.  All rights reserved.
>   *
>   * This software is available to you under a choice of one of two
>   * licenses.  You may choose to be licensed under the terms of the GNU
> @@ -40,6 +41,7 @@
> 
>  struct ib_ucontext;
>  struct ib_umem_odp;
> +struct ib_umem_dmabuf;
> 
>  struct ib_umem {
>  	struct ib_device       *ibdev;
> @@ -48,6 +50,7 @@ struct ib_umem {
>  	unsigned long		address;
>  	u32 writable : 1;
>  	u32 is_odp : 1;
> +	u32 is_dmabuf : 1;
>  	struct work_struct	work;
>  	struct sg_table sg_head;
>  	int             nmap;
> @@ -78,6 +81,9 @@ int ib_umem_copy_from(void *dst, struct ib_umem *umem, size_t offset,
>  unsigned long ib_umem_find_best_pgsz(struct ib_umem *umem,
>  				     unsigned long pgsz_bitmap,
>  				     unsigned long virt);
> +struct ib_umem *ib_umem_dmabuf_get(struct ib_device *device,
> +				   unsigned long addr, size_t size,
> +				   int dmabuf_fd, int access);
> 
>  #else /* CONFIG_INFINIBAND_USER_MEM */
> 
> @@ -100,7 +106,13 @@ static inline int ib_umem_find_best_pgsz(struct ib_umem *umem,
>  					 unsigned long virt) {
>  	return -EINVAL;
>  }
> +static inline struct ib_umem *ib_umem_dmabuf_get(struct ib_device *device,
> +						 unsigned long addr,
> +						 size_t size, int dmabuf_fd,
> +						 int access)
> +{
> +	return ERR_PTR(-EINVAL);
> +}
> 
>  #endif /* CONFIG_INFINIBAND_USER_MEM */
> -
>  #endif /* IB_UMEM_H */
> --
> 1.8.3.1

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [RFC PATCH v2 2/3] RDMA/core: Expand the driver method 'reg_user_mr' to support dma-buf
       [not found] ` <1593451903-30959-3-git-send-email-jianxin.xiong@intel.com>
@ 2020-06-30 19:04   ` Xiong, Jianxin
  0 siblings, 0 replies; 28+ messages in thread
From: Xiong, Jianxin @ 2020-06-30 19:04 UTC (permalink / raw)
  To: linux-rdma
  Cc: Leon Romanovsky, dri-devel, Jason Gunthorpe, Doug Ledford,
	Vetter, Daniel, Christian Koenig

Cc'd drm people.

> -----Original Message-----
> From: Xiong, Jianxin <jianxin.xiong@intel.com>
> Sent: Monday, June 29, 2020 10:32 AM
> To: linux-rdma@vger.kernel.org
> Cc: Xiong, Jianxin <jianxin.xiong@intel.com>; Doug Ledford <dledford@redhat.com>; Jason Gunthorpe <jgg@ziepe.ca>; Sumit Semwal
> <sumit.semwal@linaro.org>; Leon Romanovsky <leon@kernel.org>; Vetter, Daniel <daniel.vetter@intel.com>
> Subject: [RFC PATCH v2 2/3] RDMA/core: Expand the driver method 'reg_user_mr' to support dma-buf
> 
> Add a parameter 'fd' for the file descriptor associated with the dma-buf
> object to be imported. A negative value indicates that dma-buf is not
> used.
> 
> Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
> Reviewed-by: Sean Hefty <sean.hefty@intel.com>
> Acked-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
> ---
>  drivers/infiniband/core/uverbs_cmd.c            |  2 +-
>  drivers/infiniband/core/verbs.c                 |  2 +-
>  drivers/infiniband/hw/bnxt_re/ib_verbs.c        |  7 +++-
>  drivers/infiniband/hw/bnxt_re/ib_verbs.h        |  2 +-
>  drivers/infiniband/hw/cxgb4/iw_cxgb4.h          |  3 +-
>  drivers/infiniband/hw/cxgb4/mem.c               |  8 ++++-
>  drivers/infiniband/hw/efa/efa.h                 |  2 +-
>  drivers/infiniband/hw/efa/efa_verbs.c           |  7 +++-
>  drivers/infiniband/hw/hns/hns_roce_device.h     |  2 +-
>  drivers/infiniband/hw/hns/hns_roce_mr.c         |  7 +++-
>  drivers/infiniband/hw/i40iw/i40iw_verbs.c       |  6 ++++
>  drivers/infiniband/hw/mlx4/mlx4_ib.h            |  2 +-
>  drivers/infiniband/hw/mlx4/mr.c                 |  7 +++-
>  drivers/infiniband/hw/mlx5/mlx5_ib.h            |  2 +-
>  drivers/infiniband/hw/mlx5/mr.c                 | 45 ++++++++++++++++++++++---
>  drivers/infiniband/hw/mthca/mthca_provider.c    |  8 ++++-
>  drivers/infiniband/hw/ocrdma/ocrdma_verbs.c     |  9 ++++-
>  drivers/infiniband/hw/ocrdma/ocrdma_verbs.h     |  3 +-
>  drivers/infiniband/hw/qedr/verbs.c              |  8 ++++-
>  drivers/infiniband/hw/qedr/verbs.h              |  3 +-
>  drivers/infiniband/hw/usnic/usnic_ib_verbs.c    |  8 ++++-
>  drivers/infiniband/hw/usnic/usnic_ib_verbs.h    |  2 +-
>  drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c    |  6 +++-
>  drivers/infiniband/hw/vmw_pvrdma/pvrdma_verbs.h |  2 +-
>  drivers/infiniband/sw/rdmavt/mr.c               |  6 +++-
>  drivers/infiniband/sw/rdmavt/mr.h               |  2 +-
>  drivers/infiniband/sw/rxe/rxe_verbs.c           |  6 ++++
>  drivers/infiniband/sw/siw/siw_verbs.c           |  8 ++++-
>  drivers/infiniband/sw/siw/siw_verbs.h           |  3 +-
>  include/rdma/ib_verbs.h                         |  4 +--
>  30 files changed, 150 insertions(+), 32 deletions(-)
> 
> diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
> index 060b4eb..0199da2 100644
> --- a/drivers/infiniband/core/uverbs_cmd.c
> +++ b/drivers/infiniband/core/uverbs_cmd.c
> @@ -757,7 +757,7 @@ static int ib_uverbs_reg_mr(struct uverbs_attr_bundle *attrs)
>  	}
> 
>  	mr = pd->device->ops.reg_user_mr(pd, cmd.start, cmd.length, cmd.hca_va,
> -					 cmd.access_flags,
> +					 -1, cmd.access_flags,
>  					 &attrs->driver_udata);
>  	if (IS_ERR(mr)) {
>  		ret = PTR_ERR(mr);
> diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
> index 56a7133..aa067b2 100644
> --- a/drivers/infiniband/core/verbs.c
> +++ b/drivers/infiniband/core/verbs.c
> @@ -2003,7 +2003,7 @@ struct ib_mr *ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
>  	}
> 
>  	mr = pd->device->ops.reg_user_mr(pd, start, length, virt_addr,
> -					 access_flags, NULL);
> +					 -1, access_flags, NULL);
> 
>  	if (IS_ERR(mr))
>  		return mr;
> diff --git a/drivers/infiniband/hw/bnxt_re/ib_verbs.c b/drivers/infiniband/hw/bnxt_re/ib_verbs.c
> index 95f6d49..af40457 100644
> --- a/drivers/infiniband/hw/bnxt_re/ib_verbs.c
> +++ b/drivers/infiniband/hw/bnxt_re/ib_verbs.c
> @@ -3695,7 +3695,7 @@ static int fill_umem_pbl_tbl(struct ib_umem *umem, u64 *pbl_tbl_orig,
> 
>  /* uverbs */
>  struct ib_mr *bnxt_re_reg_user_mr(struct ib_pd *ib_pd, u64 start, u64 length,
> -				  u64 virt_addr, int mr_access_flags,
> +				  u64 virt_addr, int fd, int mr_access_flags,
>  				  struct ib_udata *udata)
>  {
>  	struct bnxt_re_pd *pd = container_of(ib_pd, struct bnxt_re_pd, ib_pd);
> @@ -3728,6 +3728,11 @@ struct ib_mr *bnxt_re_reg_user_mr(struct ib_pd *ib_pd, u64 start, u64 length,
>  	/* The fixed portion of the rkey is the same as the lkey */
>  	mr->ib_mr.rkey = mr->qplib_mr.rkey;
> 
> +	if (fd >= 0) {
> +		rc = -EOPNOTSUPP;
> +		goto free_mrw;
> +	}
> +
>  	umem = ib_umem_get(&rdev->ibdev, start, length, mr_access_flags);
>  	if (IS_ERR(umem)) {
>  		ibdev_err(&rdev->ibdev, "Failed to get umem");
> diff --git a/drivers/infiniband/hw/bnxt_re/ib_verbs.h b/drivers/infiniband/hw/bnxt_re/ib_verbs.h
> index 23d972d..040e85e 100644
> --- a/drivers/infiniband/hw/bnxt_re/ib_verbs.h
> +++ b/drivers/infiniband/hw/bnxt_re/ib_verbs.h
> @@ -203,7 +203,7 @@ struct ib_mw *bnxt_re_alloc_mw(struct ib_pd *ib_pd, enum ib_mw_type type,
>  			       struct ib_udata *udata);
>  int bnxt_re_dealloc_mw(struct ib_mw *mw);
>  struct ib_mr *bnxt_re_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
> -				  u64 virt_addr, int mr_access_flags,
> +				  u64 virt_addr, int fd, int mr_access_flags,
>  				  struct ib_udata *udata);
>  int bnxt_re_alloc_ucontext(struct ib_ucontext *ctx, struct ib_udata *udata);
>  void bnxt_re_dealloc_ucontext(struct ib_ucontext *context);
> diff --git a/drivers/infiniband/hw/cxgb4/iw_cxgb4.h b/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
> index e8e11bd..e374f89 100644
> --- a/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
> +++ b/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
> @@ -988,7 +988,8 @@ int c4iw_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg, int sg_nents,
>  struct ib_mw *c4iw_alloc_mw(struct ib_pd *pd, enum ib_mw_type type,
>  			    struct ib_udata *udata);
>  struct ib_mr *c4iw_reg_user_mr(struct ib_pd *pd, u64 start,
> -					   u64 length, u64 virt, int acc,
> +					   u64 length, u64 virt,
> +					   int fd, int acc,
>  					   struct ib_udata *udata);
>  struct ib_mr *c4iw_get_dma_mr(struct ib_pd *pd, int acc);
>  int c4iw_dereg_mr(struct ib_mr *ib_mr, struct ib_udata *udata);
> diff --git a/drivers/infiniband/hw/cxgb4/mem.c b/drivers/infiniband/hw/cxgb4/mem.c
> index 962dc97..b0b912a 100644
> --- a/drivers/infiniband/hw/cxgb4/mem.c
> +++ b/drivers/infiniband/hw/cxgb4/mem.c
> @@ -506,7 +506,8 @@ struct ib_mr *c4iw_get_dma_mr(struct ib_pd *pd, int acc)
>  }
> 
>  struct ib_mr *c4iw_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
> -			       u64 virt, int acc, struct ib_udata *udata)
> +			       u64 virt, int fd, int acc,
> +			       struct ib_udata *udata)
>  {
>  	__be64 *pages;
>  	int shift, n, i;
> @@ -543,6 +544,11 @@ struct ib_mr *c4iw_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
> 
>  	mhp->rhp = rhp;
> 
> +	if (fd >= 0) {
> +		err = -EOPNOTSUPP;
> +		goto err_free_skb;
> +	}
> +
>  	mhp->umem = ib_umem_get(pd->device, start, length, acc);
>  	if (IS_ERR(mhp->umem))
>  		goto err_free_skb;
> diff --git a/drivers/infiniband/hw/efa/efa.h b/drivers/infiniband/hw/efa/efa.h
> index aa7396a..21872a3 100644
> --- a/drivers/infiniband/hw/efa/efa.h
> +++ b/drivers/infiniband/hw/efa/efa.h
> @@ -142,7 +142,7 @@ struct ib_qp *efa_create_qp(struct ib_pd *ibpd,
>  int efa_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
>  		  struct ib_udata *udata);
>  struct ib_mr *efa_reg_mr(struct ib_pd *ibpd, u64 start, u64 length,
> -			 u64 virt_addr, int access_flags,
> +			 u64 virt_addr, int fd, int access_flags,
>  			 struct ib_udata *udata);
>  int efa_dereg_mr(struct ib_mr *ibmr, struct ib_udata *udata);
>  int efa_get_port_immutable(struct ib_device *ibdev, u8 port_num,
> diff --git a/drivers/infiniband/hw/efa/efa_verbs.c b/drivers/infiniband/hw/efa/efa_verbs.c
> index bf3120f..1362694 100644
> --- a/drivers/infiniband/hw/efa/efa_verbs.c
> +++ b/drivers/infiniband/hw/efa/efa_verbs.c
> @@ -1346,7 +1346,7 @@ static int efa_create_pbl(struct efa_dev *dev,
>  }
> 
>  struct ib_mr *efa_reg_mr(struct ib_pd *ibpd, u64 start, u64 length,
> -			 u64 virt_addr, int access_flags,
> +			 u64 virt_addr, int fd, int access_flags,
>  			 struct ib_udata *udata)
>  {
>  	struct efa_dev *dev = to_edev(ibpd->device);
> @@ -1386,6 +1386,11 @@ struct ib_mr *efa_reg_mr(struct ib_pd *ibpd, u64 start, u64 length,
>  		goto err_out;
>  	}
> 
> +	if (fd >= 0) {
> +		err = -EOPNOTSUPP;
> +		goto err_free;
> +	}
> +
>  	mr->umem = ib_umem_get(ibpd->device, start, length, access_flags);
>  	if (IS_ERR(mr->umem)) {
>  		err = PTR_ERR(mr->umem);
> diff --git a/drivers/infiniband/hw/hns/hns_roce_device.h b/drivers/infiniband/hw/hns/hns_roce_device.h
> index d7dcf6e..c2b94c6 100644
> --- a/drivers/infiniband/hw/hns/hns_roce_device.h
> +++ b/drivers/infiniband/hw/hns/hns_roce_device.h
> @@ -1185,7 +1185,7 @@ int hns_roce_create_ah(struct ib_ah *ah, struct rdma_ah_attr *ah_attr,
> 
>  struct ib_mr *hns_roce_get_dma_mr(struct ib_pd *pd, int acc);
>  struct ib_mr *hns_roce_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
> -				   u64 virt_addr, int access_flags,
> +				   u64 virt_addr, int fd, int access_flags,
>  				   struct ib_udata *udata);
>  int hns_roce_rereg_user_mr(struct ib_mr *mr, int flags, u64 start, u64 length,
>  			   u64 virt_addr, int mr_access_flags, struct ib_pd *pd,
> diff --git a/drivers/infiniband/hw/hns/hns_roce_mr.c b/drivers/infiniband/hw/hns/hns_roce_mr.c
> index b9898e7..24dd79a 100644
> --- a/drivers/infiniband/hw/hns/hns_roce_mr.c
> +++ b/drivers/infiniband/hw/hns/hns_roce_mr.c
> @@ -1130,7 +1130,7 @@ static int hns_roce_ib_umem_write_mr(struct hns_roce_dev *hr_dev,
>  }
> 
>  struct ib_mr *hns_roce_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
> -				   u64 virt_addr, int access_flags,
> +				   u64 virt_addr, int fd, int access_flags,
>  				   struct ib_udata *udata)
>  {
>  	struct hns_roce_dev *hr_dev = to_hr_dev(pd->device);
> @@ -1145,6 +1145,11 @@ struct ib_mr *hns_roce_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
>  	if (!mr)
>  		return ERR_PTR(-ENOMEM);
> 
> +	if (fd >= 0) {
> +		ret = ERR_PTR(-EOPNOTSUPP);
> +		goto err_free;
> +	}
> +
>  	mr->umem = ib_umem_get(pd->device, start, length, access_flags);
>  	if (IS_ERR(mr->umem)) {
>  		ret = PTR_ERR(mr->umem);
> diff --git a/drivers/infiniband/hw/i40iw/i40iw_verbs.c b/drivers/infiniband/hw/i40iw/i40iw_verbs.c
> index fa12929..d86bf12 100644
> --- a/drivers/infiniband/hw/i40iw/i40iw_verbs.c
> +++ b/drivers/infiniband/hw/i40iw/i40iw_verbs.c
> @@ -1727,6 +1727,7 @@ static int i40iw_hwreg_mr(struct i40iw_device *iwdev,
>   * @start: virtual start address
>   * @length: length of mr
>   * @virt: virtual address
> + * @fd: file descriptor (for dma-buf)
>   * @acc: access of mr
>   * @udata: user data
>   */
> @@ -1734,6 +1735,7 @@ static struct ib_mr *i40iw_reg_user_mr(struct ib_pd *pd,
>  				       u64 start,
>  				       u64 length,
>  				       u64 virt,
> +				       int fd,
>  				       int acc,
>  				       struct ib_udata *udata)
>  {
> @@ -1764,6 +1766,10 @@ static struct ib_mr *i40iw_reg_user_mr(struct ib_pd *pd,
> 
>  	if (length > I40IW_MAX_MR_SIZE)
>  		return ERR_PTR(-EINVAL);
> +
> +	if (fd >= 0)
> +		return ERR_PTR(-EOPNOTSUPP);
> +
>  	region = ib_umem_get(pd->device, start, length, acc);
>  	if (IS_ERR(region))
>  		return (struct ib_mr *)region;
> diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h
> index d188573..d9db34b 100644
> --- a/drivers/infiniband/hw/mlx4/mlx4_ib.h
> +++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h
> @@ -732,7 +732,7 @@ int mlx4_ib_db_map_user(struct ib_udata *udata, unsigned long virt,
>  int mlx4_ib_umem_write_mtt(struct mlx4_ib_dev *dev, struct mlx4_mtt *mtt,
>  			   struct ib_umem *umem);
>  struct ib_mr *mlx4_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
> -				  u64 virt_addr, int access_flags,
> +				  u64 virt_addr, int fd, int access_flags,
>  				  struct ib_udata *udata);
>  int mlx4_ib_dereg_mr(struct ib_mr *mr, struct ib_udata *udata);
>  struct ib_mw *mlx4_ib_alloc_mw(struct ib_pd *pd, enum ib_mw_type type,
> diff --git a/drivers/infiniband/hw/mlx4/mr.c b/drivers/infiniband/hw/mlx4/mr.c
> index b0121c9..db45abb 100644
> --- a/drivers/infiniband/hw/mlx4/mr.c
> +++ b/drivers/infiniband/hw/mlx4/mr.c
> @@ -402,7 +402,7 @@ static struct ib_umem *mlx4_get_umem_mr(struct ib_device *device, u64 start,
>  }
> 
>  struct ib_mr *mlx4_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
> -				  u64 virt_addr, int access_flags,
> +				  u64 virt_addr, int fd, int access_flags,
>  				  struct ib_udata *udata)
>  {
>  	struct mlx4_ib_dev *dev = to_mdev(pd->device);
> @@ -415,6 +415,11 @@ struct ib_mr *mlx4_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
>  	if (!mr)
>  		return ERR_PTR(-ENOMEM);
> 
> +	if (fd >= 0) {
> +		err = -EOPNOTSUPP;
> +		goto err_free;
> +	}
> +
>  	mr->umem = mlx4_get_umem_mr(pd->device, start, length, access_flags);
>  	if (IS_ERR(mr->umem)) {
>  		err = PTR_ERR(mr->umem);
> diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
> index 2e42258..5df9f0e 100644
> --- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
> +++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
> @@ -1189,7 +1189,7 @@ int mlx5_ib_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
>  int mlx5_ib_resize_cq(struct ib_cq *ibcq, int entries, struct ib_udata *udata);
>  struct ib_mr *mlx5_ib_get_dma_mr(struct ib_pd *pd, int acc);
>  struct ib_mr *mlx5_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
> -				  u64 virt_addr, int access_flags,
> +				  u64 virt_addr, int fd, int access_flags,
>  				  struct ib_udata *udata);
>  int mlx5_ib_advise_mr(struct ib_pd *pd,
>  		      enum ib_uverbs_advise_mr_advice advice,
> diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
> index 6fa0a83..bd06e35 100644
> --- a/drivers/infiniband/hw/mlx5/mr.c
> +++ b/drivers/infiniband/hw/mlx5/mr.c
> @@ -797,6 +797,39 @@ static int mr_umem_get(struct mlx5_ib_dev *dev, u64 start, u64 length,
>  	return 0;
>  }
> 
> +static int mr_umem_dmabuf_get(struct mlx5_ib_dev *dev, u64 start, u64 length,
> +			      int dmabuf_fd, int access_flags,
> +			      struct ib_umem **umem, int *npages,
> +			      int *page_shift, int *ncont, int *order)
> +{
> +	struct ib_umem *u;
> +
> +	*umem = NULL;
> +
> +	u = ib_umem_dmabuf_get(&dev->ib_dev, start, length, dmabuf_fd,
> +			       access_flags);
> +	if (IS_ERR(u)) {
> +		mlx5_ib_dbg(dev, "umem get failed (%ld)\n", PTR_ERR(u));
> +		return PTR_ERR(u);
> +	}
> +
> +	mlx5_ib_cont_pages(u, start, MLX5_MKEY_PAGE_SHIFT_MASK, npages,
> +			   page_shift, ncont, order);
> +
> +	if (!*npages) {
> +		mlx5_ib_warn(dev, "avoid zero region\n");
> +		ib_umem_release(u);
> +		return -EINVAL;
> +	}
> +
> +	*umem = u;
> +
> +	mlx5_ib_dbg(dev, "npages %d, ncont %d, order %d, page_shift %d\n",
> +		    *npages, *ncont, *order, *page_shift);
> +
> +	return 0;
> +}
> +
>  static void mlx5_ib_umr_done(struct ib_cq *cq, struct ib_wc *wc)
>  {
>  	struct mlx5_ib_umr_context *context =
> @@ -1228,7 +1261,7 @@ struct ib_mr *mlx5_ib_reg_dm_mr(struct ib_pd *pd, struct ib_dm *dm,
>  }
> 
>  struct ib_mr *mlx5_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
> -				  u64 virt_addr, int access_flags,
> +				  u64 virt_addr, int fd, int access_flags,
>  				  struct ib_udata *udata)
>  {
>  	struct mlx5_ib_dev *dev = to_mdev(pd->device);
> @@ -1261,9 +1294,13 @@ struct ib_mr *mlx5_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
>  		return &mr->ibmr;
>  	}
> 
> -	err = mr_umem_get(dev, start, length, access_flags, &umem,
> -			  &npages, &page_shift, &ncont, &order);
> -
> +	if (fd < 0)
> +		err = mr_umem_get(dev, start, length, access_flags, &umem,
> +				  &npages, &page_shift, &ncont, &order);
> +	else
> +		err = mr_umem_dmabuf_get(dev, start, length, fd, access_flags,
> +					 &umem, &npages, &page_shift, &ncont,
> +					 &order);
>  	if (err < 0)
>  		return ERR_PTR(err);
> 
> diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c
> index 69a3e4f..f1f93af 100644
> --- a/drivers/infiniband/hw/mthca/mthca_provider.c
> +++ b/drivers/infiniband/hw/mthca/mthca_provider.c
> @@ -852,7 +852,8 @@ static struct ib_mr *mthca_get_dma_mr(struct ib_pd *pd, int acc)
>  }
> 
>  static struct ib_mr *mthca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
> -				       u64 virt, int acc, struct ib_udata *udata)
> +				       u64 virt, int fd, int acc,
> +				       struct ib_udata *udata)
>  {
>  	struct mthca_dev *dev = to_mdev(pd->device);
>  	struct sg_dma_page_iter sg_iter;
> @@ -880,6 +881,11 @@ static struct ib_mr *mthca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
>  	if (!mr)
>  		return ERR_PTR(-ENOMEM);
> 
> +	if (fd >= 0) {
> +		err = -EOPNOTSUPP;
> +		goto err;
> +	}
> +
>  	mr->umem = ib_umem_get(pd->device, start, length, acc);
>  	if (IS_ERR(mr->umem)) {
>  		err = PTR_ERR(mr->umem);
> diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
> index 10e3438..2c25244 100644
> --- a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
> +++ b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
> @@ -853,7 +853,8 @@ static void build_user_pbes(struct ocrdma_dev *dev, struct ocrdma_mr *mr,
>  }
> 
>  struct ib_mr *ocrdma_reg_user_mr(struct ib_pd *ibpd, u64 start, u64 len,
> -				 u64 usr_addr, int acc, struct ib_udata *udata)
> +				 u64 usr_addr, int fd, int acc,
> +				 struct ib_udata *udata)
>  {
>  	int status = -ENOMEM;
>  	struct ocrdma_dev *dev = get_ocrdma_dev(ibpd->device);
> @@ -869,6 +870,12 @@ struct ib_mr *ocrdma_reg_user_mr(struct ib_pd *ibpd, u64 start, u64 len,
>  	mr = kzalloc(sizeof(*mr), GFP_KERNEL);
>  	if (!mr)
>  		return ERR_PTR(status);
> +
> +	if (fd >= 0) {
> +		status = -EOPNOTSUPP;
> +		goto umem_err;
> +	}
> +
>  	mr->umem = ib_umem_get(ibpd->device, start, len, acc);
>  	if (IS_ERR(mr->umem)) {
>  		status = -EFAULT;
> diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.h b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.h
> index 3a50108..bc3660e 100644
> --- a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.h
> +++ b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.h
> @@ -99,7 +99,8 @@ int ocrdma_post_srq_recv(struct ib_srq *, const struct ib_recv_wr *,
>  int ocrdma_dereg_mr(struct ib_mr *ib_mr, struct ib_udata *udata);
>  struct ib_mr *ocrdma_get_dma_mr(struct ib_pd *, int acc);
>  struct ib_mr *ocrdma_reg_user_mr(struct ib_pd *, u64 start, u64 length,
> -				 u64 virt, int acc, struct ib_udata *);
> +				 u64 virt, int fd, int acc,
> +				 struct ib_udata *);
>  struct ib_mr *ocrdma_alloc_mr(struct ib_pd *pd, enum ib_mr_type mr_type,
>  			      u32 max_num_sg, struct ib_udata *udata);
>  int ocrdma_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg, int sg_nents,
> diff --git a/drivers/infiniband/hw/qedr/verbs.c b/drivers/infiniband/hw/qedr/verbs.c
> index a5bd3ad..95e865c 100644
> --- a/drivers/infiniband/hw/qedr/verbs.c
> +++ b/drivers/infiniband/hw/qedr/verbs.c
> @@ -2828,7 +2828,8 @@ static int init_mr_info(struct qedr_dev *dev, struct mr_info *info,
>  }
> 
>  struct ib_mr *qedr_reg_user_mr(struct ib_pd *ibpd, u64 start, u64 len,
> -			       u64 usr_addr, int acc, struct ib_udata *udata)
> +			       u64 usr_addr, int fd, int acc,
> +			       struct ib_udata *udata)
>  {
>  	struct qedr_dev *dev = get_qedr_dev(ibpd->device);
>  	struct qedr_mr *mr;
> @@ -2849,6 +2850,11 @@ struct ib_mr *qedr_reg_user_mr(struct ib_pd *ibpd, u64 start, u64 len,
> 
>  	mr->type = QEDR_MR_USER;
> 
> +	if (fd >= 0) {
> +		rc = -EOPNOTSUPP;
> +		goto err0;
> +	}
> +
>  	mr->umem = ib_umem_get(ibpd->device, start, len, acc);
>  	if (IS_ERR(mr->umem)) {
>  		rc = -EFAULT;
> diff --git a/drivers/infiniband/hw/qedr/verbs.h b/drivers/infiniband/hw/qedr/verbs.h
> index 1802784..78effd1 100644
> --- a/drivers/infiniband/hw/qedr/verbs.h
> +++ b/drivers/infiniband/hw/qedr/verbs.h
> @@ -78,7 +78,8 @@ int qedr_create_ah(struct ib_ah *ibah, struct rdma_ah_attr *attr, u32 flags,
>  struct ib_mr *qedr_get_dma_mr(struct ib_pd *, int acc);
> 
>  struct ib_mr *qedr_reg_user_mr(struct ib_pd *, u64 start, u64 length,
> -			       u64 virt, int acc, struct ib_udata *);
> +			       u64 virt, int fd, int acc,
> +			       struct ib_udata *);
> 
>  int qedr_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg,
>  		   int sg_nents, unsigned int *sg_offset);
> diff --git a/drivers/infiniband/hw/usnic/usnic_ib_verbs.c b/drivers/infiniband/hw/usnic/usnic_ib_verbs.c
> index 71f8233..58ed9d1 100644
> --- a/drivers/infiniband/hw/usnic/usnic_ib_verbs.c
> +++ b/drivers/infiniband/hw/usnic/usnic_ib_verbs.c
> @@ -603,7 +603,8 @@ void usnic_ib_destroy_cq(struct ib_cq *cq, struct ib_udata *udata)
>  }
> 
>  struct ib_mr *usnic_ib_reg_mr(struct ib_pd *pd, u64 start, u64 length,
> -					u64 virt_addr, int access_flags,
> +					u64 virt_addr, int fd,
> +					int access_flags,
>  					struct ib_udata *udata)
>  {
>  	struct usnic_ib_mr *mr;
> @@ -616,6 +617,11 @@ struct ib_mr *usnic_ib_reg_mr(struct ib_pd *pd, u64 start, u64 length,
>  	if (!mr)
>  		return ERR_PTR(-ENOMEM);
> 
> +	if (fd >= 0) {
> +		err = -EOPNOTSUPP;
> +		goto err_free;
> +	}
> +
>  	mr->umem = usnic_uiom_reg_get(to_upd(pd)->umem_pd, start, length,
>  					access_flags, 0);
>  	if (IS_ERR_OR_NULL(mr->umem)) {
> diff --git a/drivers/infiniband/hw/usnic/usnic_ib_verbs.h b/drivers/infiniband/hw/usnic/usnic_ib_verbs.h
> index 2aedf78..42ba8b7 100644
> --- a/drivers/infiniband/hw/usnic/usnic_ib_verbs.h
> +++ b/drivers/infiniband/hw/usnic/usnic_ib_verbs.h
> @@ -62,7 +62,7 @@ int usnic_ib_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
>  		       struct ib_udata *udata);
>  void usnic_ib_destroy_cq(struct ib_cq *cq, struct ib_udata *udata);
>  struct ib_mr *usnic_ib_reg_mr(struct ib_pd *pd, u64 start, u64 length,
> -				u64 virt_addr, int access_flags,
> +				u64 virt_addr, int fd, int access_flags,
>  				struct ib_udata *udata);
>  int usnic_ib_dereg_mr(struct ib_mr *ibmr, struct ib_udata *udata);
>  int usnic_ib_alloc_ucontext(struct ib_ucontext *uctx, struct ib_udata *udata);
> diff --git a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c
> index b039f1f..8b8b10c 100644
> --- a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c
> +++ b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c
> @@ -103,13 +103,14 @@ struct ib_mr *pvrdma_get_dma_mr(struct ib_pd *pd, int acc)
>   * @start: starting address
>   * @length: length of region
>   * @virt_addr: I/O virtual address
> + * @fd: file descriptor (for dma-buf)
>   * @access_flags: access flags for memory region
>   * @udata: user data
>   *
>   * @return: ib_mr pointer on success, otherwise returns an errno.
>   */
>  struct ib_mr *pvrdma_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
> -				 u64 virt_addr, int access_flags,
> +				 u64 virt_addr, int fd, int access_flags,
>  				 struct ib_udata *udata)
>  {
>  	struct pvrdma_dev *dev = to_vdev(pd->device);
> @@ -126,6 +127,9 @@ struct ib_mr *pvrdma_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
>  		return ERR_PTR(-EINVAL);
>  	}
> 
> +	if (fd >= 0)
> +		return ERR_PTR(-EOPNOTSUPP);
> +
>  	umem = ib_umem_get(pd->device, start, length, access_flags);
>  	if (IS_ERR(umem)) {
>  		dev_warn(&dev->pdev->dev,
> diff --git a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_verbs.h b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_verbs.h
> index e4a48f5..7bee943 100644
> --- a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_verbs.h
> +++ b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_verbs.h
> @@ -402,7 +402,7 @@ int pvrdma_modify_port(struct ib_device *ibdev, u8 port,
>  void pvrdma_dealloc_pd(struct ib_pd *ibpd, struct ib_udata *udata);
>  struct ib_mr *pvrdma_get_dma_mr(struct ib_pd *pd, int acc);
>  struct ib_mr *pvrdma_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
> -				 u64 virt_addr, int access_flags,
> +				 u64 virt_addr, int fd, int access_flags,
>  				 struct ib_udata *udata);
>  int pvrdma_dereg_mr(struct ib_mr *mr, struct ib_udata *udata);
>  struct ib_mr *pvrdma_alloc_mr(struct ib_pd *pd, enum ib_mr_type mr_type,
> diff --git a/drivers/infiniband/sw/rdmavt/mr.c b/drivers/infiniband/sw/rdmavt/mr.c
> index 72f6534..cbd54de 100644
> --- a/drivers/infiniband/sw/rdmavt/mr.c
> +++ b/drivers/infiniband/sw/rdmavt/mr.c
> @@ -372,13 +372,14 @@ struct ib_mr *rvt_get_dma_mr(struct ib_pd *pd, int acc)
>   * @pd: protection domain for this memory region
>   * @start: starting userspace address
>   * @length: length of region to register
> + * @fd: file descriptor (for dma-buf)
>   * @mr_access_flags: access flags for this memory region
>   * @udata: unused by the driver
>   *
>   * Return: the memory region on success, otherwise returns an errno.
>   */
>  struct ib_mr *rvt_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
> -			      u64 virt_addr, int mr_access_flags,
> +			      u64 virt_addr, int fd, int mr_access_flags,
>  			      struct ib_udata *udata)
>  {
>  	struct rvt_mr *mr;
> @@ -390,6 +391,9 @@ struct ib_mr *rvt_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
>  	if (length == 0)
>  		return ERR_PTR(-EINVAL);
> 
> +	if (fd >= 0)
> +		return ERR_PTR(-EOPNOTSUPP);
> +
>  	umem = ib_umem_get(pd->device, start, length, mr_access_flags);
>  	if (IS_ERR(umem))
>  		return (void *)umem;
> diff --git a/drivers/infiniband/sw/rdmavt/mr.h b/drivers/infiniband/sw/rdmavt/mr.h
> index 2c8d075..9d38d7d 100644
> --- a/drivers/infiniband/sw/rdmavt/mr.h
> +++ b/drivers/infiniband/sw/rdmavt/mr.h
> @@ -76,7 +76,7 @@ static inline struct rvt_mr *to_imr(struct ib_mr *ibmr)
>  /* Mem Regions */
>  struct ib_mr *rvt_get_dma_mr(struct ib_pd *pd, int acc);
>  struct ib_mr *rvt_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
> -			      u64 virt_addr, int mr_access_flags,
> +			      u64 virt_addr, int fd, int mr_access_flags,
>  			      struct ib_udata *udata);
>  int rvt_dereg_mr(struct ib_mr *ibmr, struct ib_udata *udata);
>  struct ib_mr *rvt_alloc_mr(struct ib_pd *pd, enum ib_mr_type mr_type,
> diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.c b/drivers/infiniband/sw/rxe/rxe_verbs.c
> index 9dd4bd7..f5f4cbe 100644
> --- a/drivers/infiniband/sw/rxe/rxe_verbs.c
> +++ b/drivers/infiniband/sw/rxe/rxe_verbs.c
> @@ -930,6 +930,7 @@ static struct ib_mr *rxe_reg_user_mr(struct ib_pd *ibpd,
>  				     u64 start,
>  				     u64 length,
>  				     u64 iova,
> +				     int fd,
>  				     int access, struct ib_udata *udata)
>  {
>  	int err;
> @@ -947,6 +948,11 @@ static struct ib_mr *rxe_reg_user_mr(struct ib_pd *ibpd,
> 
>  	rxe_add_ref(pd);
> 
> +	if (fd >= 0) {
> +		err = -EOPNOTSUPP;
> +		goto err3;
> +	}
> +
>  	err = rxe_mem_init_user(pd, start, length, iova,
>  				access, udata, mr);
>  	if (err)
> diff --git a/drivers/infiniband/sw/siw/siw_verbs.c b/drivers/infiniband/sw/siw/siw_verbs.c
> index aeb842b..e88c077 100644
> --- a/drivers/infiniband/sw/siw/siw_verbs.c
> +++ b/drivers/infiniband/sw/siw/siw_verbs.c
> @@ -1275,11 +1275,13 @@ int siw_dereg_mr(struct ib_mr *base_mr, struct ib_udata *udata)
>   * @start:	starting address of MR (virtual address)
>   * @len:	len of MR
>   * @rnic_va:	not used by siw
> + * @fd:		file descriptor (for dma-buf)
>   * @rights:	MR access rights
>   * @udata:	user buffer to communicate STag and Key.
>   */
>  struct ib_mr *siw_reg_user_mr(struct ib_pd *pd, u64 start, u64 len,
> -			      u64 rnic_va, int rights, struct ib_udata *udata)
> +			      u64 rnic_va, int fd, int rights,
> +			      struct ib_udata *udata)
>  {
>  	struct siw_mr *mr = NULL;
>  	struct siw_umem *umem = NULL;
> @@ -1315,6 +1317,10 @@ struct ib_mr *siw_reg_user_mr(struct ib_pd *pd, u64 start, u64 len,
>  			goto err_out;
>  		}
>  	}
> +	if (fd >= 0) {
> +		rv = -EOPNOTSPP;
> +		goto err_out;
> +	}
>  	umem = siw_umem_get(start, len, ib_access_writable(rights));
>  	if (IS_ERR(umem)) {
>  		rv = PTR_ERR(umem);
> diff --git a/drivers/infiniband/sw/siw/siw_verbs.h b/drivers/infiniband/sw/siw/siw_verbs.h
> index 1a73198..e08bf51 100644
> --- a/drivers/infiniband/sw/siw/siw_verbs.h
> +++ b/drivers/infiniband/sw/siw/siw_verbs.h
> @@ -67,7 +67,8 @@ int siw_post_receive(struct ib_qp *base_qp, const struct ib_recv_wr *wr,
>  int siw_poll_cq(struct ib_cq *base_cq, int num_entries, struct ib_wc *wc);
>  int siw_req_notify_cq(struct ib_cq *base_cq, enum ib_cq_notify_flags flags);
>  struct ib_mr *siw_reg_user_mr(struct ib_pd *base_pd, u64 start, u64 len,
> -			      u64 rnic_va, int rights, struct ib_udata *udata);
> +			      u64 rnic_va, int fd, int rights,
> +			      struct ib_udata *udata);
>  struct ib_mr *siw_alloc_mr(struct ib_pd *base_pd, enum ib_mr_type mr_type,
>  			   u32 max_sge, struct ib_udata *udata);
>  struct ib_mr *siw_get_dma_mr(struct ib_pd *base_pd, int rights);
> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
> index bbc5cfb..12a45a7 100644
> --- a/include/rdma/ib_verbs.h
> +++ b/include/rdma/ib_verbs.h
> @@ -1,7 +1,7 @@
>  /*
>   * Copyright (c) 2004 Mellanox Technologies Ltd.  All rights reserved.
>   * Copyright (c) 2004 Infinicon Corporation.  All rights reserved.
> - * Copyright (c) 2004 Intel Corporation.  All rights reserved.
> + * Copyright (c) 2004, 2020 Intel Corporation.  All rights reserved.
>   * Copyright (c) 2004 Topspin Corporation.  All rights reserved.
>   * Copyright (c) 2004 Voltaire Corporation.  All rights reserved.
>   * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved.
> @@ -2431,7 +2431,7 @@ struct ib_device_ops {
>  	int (*resize_cq)(struct ib_cq *cq, int cqe, struct ib_udata *udata);
>  	struct ib_mr *(*get_dma_mr)(struct ib_pd *pd, int mr_access_flags);
>  	struct ib_mr *(*reg_user_mr)(struct ib_pd *pd, u64 start, u64 length,
> -				     u64 virt_addr, int mr_access_flags,
> +				     u64 virt_addr, int fd, int mr_access_flags,
>  				     struct ib_udata *udata);
>  	int (*rereg_user_mr)(struct ib_mr *mr, int flags, u64 start, u64 length,
>  			     u64 virt_addr, int mr_access_flags,
> --
> 1.8.3.1

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [RFC PATCH v2 3/3] RDMA/uverbs: Add uverbs command for dma-buf based MR registration
       [not found] ` <1593451903-30959-4-git-send-email-jianxin.xiong@intel.com>
@ 2020-06-30 19:05   ` Xiong, Jianxin
  0 siblings, 0 replies; 28+ messages in thread
From: Xiong, Jianxin @ 2020-06-30 19:05 UTC (permalink / raw)
  To: linux-rdma
  Cc: Leon Romanovsky, dri-devel, Jason Gunthorpe, Doug Ledford,
	Vetter, Daniel, Christian Koenig

Cc'd drm people.

> -----Original Message-----
> From: Xiong, Jianxin <jianxin.xiong@intel.com>
> Sent: Monday, June 29, 2020 10:32 AM
> To: linux-rdma@vger.kernel.org
> Cc: Xiong, Jianxin <jianxin.xiong@intel.com>; Doug Ledford <dledford@redhat.com>; Jason Gunthorpe <jgg@ziepe.ca>; Sumit Semwal
> <sumit.semwal@linaro.org>; Leon Romanovsky <leon@kernel.org>; Vetter, Daniel <daniel.vetter@intel.com>
> Subject: [RFC PATCH v2 3/3] RDMA/uverbs: Add uverbs command for dma-buf based MR registration
> 
> Add uverbs command for registering user memory region associated with a dma-buf file descriptor.
> 
> Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
> Reviewed-by: Sean Hefty <sean.hefty@intel.com>
> Acked-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
> ---
>  drivers/infiniband/core/uverbs_std_types_mr.c | 112 ++++++++++++++++++++++++++
>  include/uapi/rdma/ib_user_ioctl_cmds.h        |  14 ++++
>  2 files changed, 126 insertions(+)
> 
> diff --git a/drivers/infiniband/core/uverbs_std_types_mr.c b/drivers/infiniband/core/uverbs_std_types_mr.c
> index c1286a5..2c9ff7c 100644
> --- a/drivers/infiniband/core/uverbs_std_types_mr.c
> +++ b/drivers/infiniband/core/uverbs_std_types_mr.c
> @@ -1,5 +1,6 @@
>  /*
>   * Copyright (c) 2018, Mellanox Technologies inc.  All rights reserved.
> + * Copyright (c) 2020, Intel Corporation.  All rights reserved.
>   *
>   * This software is available to you under a choice of one of two
>   * licenses.  You may choose to be licensed under the terms of the GNU @@ -154,6 +155,85 @@ static int
> UVERBS_HANDLER(UVERBS_METHOD_DM_MR_REG)(
>  	return ret;
>  }
> 
> +static int UVERBS_HANDLER(UVERBS_METHOD_REG_DMABUF_MR)(
> +	struct uverbs_attr_bundle *attrs)
> +{
> +	struct ib_uobject *uobj =
> +		uverbs_attr_get_uobject(attrs, UVERBS_ATTR_REG_DMABUF_MR_HANDLE);
> +	struct ib_pd *pd =
> +		uverbs_attr_get_obj(attrs, UVERBS_ATTR_REG_DMABUF_MR_PD_HANDLE);
> +	struct ib_device *ib_dev = pd->device;
> +
> +	u64 addr, length, hca_va;
> +	u32 fd;
> +	u32 access_flags;
> +	struct ib_mr *mr;
> +	int ret;
> +
> +	if (!ib_dev->ops.reg_user_mr)
> +		return -EOPNOTSUPP;
> +
> +	ret = uverbs_copy_from(&addr, attrs, UVERBS_ATTR_REG_DMABUF_MR_ADDR);
> +	if (ret)
> +		return ret;
> +
> +	ret = uverbs_copy_from(&length, attrs,
> +			       UVERBS_ATTR_REG_DMABUF_MR_LENGTH);
> +	if (ret)
> +		return ret;
> +
> +	ret = uverbs_copy_from(&hca_va, attrs,
> +			       UVERBS_ATTR_REG_DMABUF_MR_HCA_VA);
> +	if (ret)
> +		return ret;
> +
> +	ret = uverbs_copy_from(&fd, attrs,
> +			       UVERBS_ATTR_REG_DMABUF_MR_FD);
> +	if (ret)
> +		return ret;
> +
> +	ret = uverbs_get_flags32(&access_flags, attrs,
> +				 UVERBS_ATTR_REG_DMABUF_MR_ACCESS_FLAGS,
> +				 IB_ACCESS_SUPPORTED);
> +	if (ret)
> +		return ret;
> +
> +	ret = ib_check_mr_access(access_flags);
> +	if (ret)
> +		return ret;
> +
> +	mr = pd->device->ops.reg_user_mr(pd, addr, length, hca_va,
> +					   (int)(s32)fd, access_flags,
> +					   &attrs->driver_udata);
> +	if (IS_ERR(mr))
> +		return PTR_ERR(mr);
> +
> +	mr->device  = pd->device;
> +	mr->pd      = pd;
> +	mr->type    = IB_MR_TYPE_USER;
> +	mr->uobject = uobj;
> +	atomic_inc(&pd->usecnt);
> +
> +	uobj->object = mr;
> +
> +	ret = uverbs_copy_to(attrs, UVERBS_ATTR_REG_DMABUF_MR_RESP_LKEY,
> +			     &mr->lkey, sizeof(mr->lkey));
> +	if (ret)
> +		goto err_dereg;
> +
> +	ret = uverbs_copy_to(attrs, UVERBS_ATTR_REG_DMABUF_MR_RESP_RKEY,
> +			     &mr->rkey, sizeof(mr->rkey));
> +	if (ret)
> +		goto err_dereg;
> +
> +	return 0;
> +
> +err_dereg:
> +	ib_dereg_mr_user(mr, uverbs_get_cleared_udata(attrs));
> +
> +	return ret;
> +}
> +
>  DECLARE_UVERBS_NAMED_METHOD(
>  	UVERBS_METHOD_ADVISE_MR,
>  	UVERBS_ATTR_IDR(UVERBS_ATTR_ADVISE_MR_PD_HANDLE,
> @@ -200,6 +280,37 @@ static int UVERBS_HANDLER(UVERBS_METHOD_DM_MR_REG)(
>  			    UVERBS_ATTR_TYPE(u32),
>  			    UA_MANDATORY));
> 
> +DECLARE_UVERBS_NAMED_METHOD(
> +	UVERBS_METHOD_REG_DMABUF_MR,
> +	UVERBS_ATTR_IDR(UVERBS_ATTR_REG_DMABUF_MR_HANDLE,
> +			UVERBS_OBJECT_MR,
> +			UVERBS_ACCESS_NEW,
> +			UA_MANDATORY),
> +	UVERBS_ATTR_IDR(UVERBS_ATTR_REG_DMABUF_MR_PD_HANDLE,
> +			UVERBS_OBJECT_PD,
> +			UVERBS_ACCESS_READ,
> +			UA_MANDATORY),
> +	UVERBS_ATTR_PTR_IN(UVERBS_ATTR_REG_DMABUF_MR_ADDR,
> +			   UVERBS_ATTR_TYPE(u64),
> +			   UA_MANDATORY),
> +	UVERBS_ATTR_PTR_IN(UVERBS_ATTR_REG_DMABUF_MR_LENGTH,
> +			   UVERBS_ATTR_TYPE(u64),
> +			   UA_MANDATORY),
> +	UVERBS_ATTR_PTR_IN(UVERBS_ATTR_REG_DMABUF_MR_HCA_VA,
> +			   UVERBS_ATTR_TYPE(u64),
> +			   UA_MANDATORY),
> +	UVERBS_ATTR_PTR_IN(UVERBS_ATTR_REG_DMABUF_MR_FD,
> +			   UVERBS_ATTR_TYPE(u32),
> +			   UA_MANDATORY),
> +	UVERBS_ATTR_FLAGS_IN(UVERBS_ATTR_REG_DMABUF_MR_ACCESS_FLAGS,
> +			     enum ib_access_flags),
> +	UVERBS_ATTR_PTR_OUT(UVERBS_ATTR_REG_DMABUF_MR_RESP_LKEY,
> +			    UVERBS_ATTR_TYPE(u32),
> +			    UA_MANDATORY),
> +	UVERBS_ATTR_PTR_OUT(UVERBS_ATTR_REG_DMABUF_MR_RESP_RKEY,
> +			    UVERBS_ATTR_TYPE(u32),
> +			    UA_MANDATORY));
> +
>  DECLARE_UVERBS_NAMED_METHOD_DESTROY(
>  	UVERBS_METHOD_MR_DESTROY,
>  	UVERBS_ATTR_IDR(UVERBS_ATTR_DESTROY_MR_HANDLE,
> @@ -210,6 +321,7 @@ static int UVERBS_HANDLER(UVERBS_METHOD_DM_MR_REG)(
>  DECLARE_UVERBS_NAMED_OBJECT(
>  	UVERBS_OBJECT_MR,
>  	UVERBS_TYPE_ALLOC_IDR(uverbs_free_mr),
> +	&UVERBS_METHOD(UVERBS_METHOD_REG_DMABUF_MR),
>  	&UVERBS_METHOD(UVERBS_METHOD_DM_MR_REG),
>  	&UVERBS_METHOD(UVERBS_METHOD_MR_DESTROY),
>  	&UVERBS_METHOD(UVERBS_METHOD_ADVISE_MR));
> diff --git a/include/uapi/rdma/ib_user_ioctl_cmds.h b/include/uapi/rdma/ib_user_ioctl_cmds.h
> index d4ddbe4..31aacbf 100644
> --- a/include/uapi/rdma/ib_user_ioctl_cmds.h
> +++ b/include/uapi/rdma/ib_user_ioctl_cmds.h
> @@ -1,5 +1,6 @@
>  /*
>   * Copyright (c) 2018, Mellanox Technologies inc.  All rights reserved.
> + * Copyright (c) 2020, Intel Corporation. All rights reserved.
>   *
>   * This software is available to you under a choice of one of two
>   * licenses.  You may choose to be licensed under the terms of the GNU @@ -161,6 +162,7 @@ enum uverbs_methods_mr {
>  	UVERBS_METHOD_DM_MR_REG,
>  	UVERBS_METHOD_MR_DESTROY,
>  	UVERBS_METHOD_ADVISE_MR,
> +	UVERBS_METHOD_REG_DMABUF_MR,
>  };
> 
>  enum uverbs_attrs_mr_destroy_ids {
> @@ -174,6 +176,18 @@ enum uverbs_attrs_advise_mr_cmd_attr_ids {
>  	UVERBS_ATTR_ADVISE_MR_SGE_LIST,
>  };
> 
> +enum uverbs_attrs_reg_dmabuf_mr_cmd_attr_ids {
> +	UVERBS_ATTR_REG_DMABUF_MR_HANDLE,
> +	UVERBS_ATTR_REG_DMABUF_MR_PD_HANDLE,
> +	UVERBS_ATTR_REG_DMABUF_MR_ADDR,
> +	UVERBS_ATTR_REG_DMABUF_MR_LENGTH,
> +	UVERBS_ATTR_REG_DMABUF_MR_HCA_VA,
> +	UVERBS_ATTR_REG_DMABUF_MR_FD,
> +	UVERBS_ATTR_REG_DMABUF_MR_ACCESS_FLAGS,
> +	UVERBS_ATTR_REG_DMABUF_MR_RESP_LKEY,
> +	UVERBS_ATTR_REG_DMABUF_MR_RESP_RKEY,
> +};
> +
>  enum uverbs_attrs_create_counters_cmd_attr_ids {
>  	UVERBS_ATTR_CREATE_COUNTERS_HANDLE,
>  };
> --
> 1.8.3.1

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
  2020-06-30 18:46       ` [RFC PATCH v2 0/3] RDMA: add dma-buf support Xiong, Jianxin
@ 2020-06-30 19:17         ` Jason Gunthorpe
  2020-06-30 20:08           ` Xiong, Jianxin
  2020-07-01  9:03         ` Christian König
  1 sibling, 1 reply; 28+ messages in thread
From: Jason Gunthorpe @ 2020-06-30 19:17 UTC (permalink / raw)
  To: Xiong, Jianxin
  Cc: Leon Romanovsky, linux-rdma, dri-devel, Christian Koenig,
	Doug Ledford, Vetter, Daniel

On Tue, Jun 30, 2020 at 06:46:17PM +0000, Xiong, Jianxin wrote:
> > From: Jason Gunthorpe <jgg@ziepe.ca>
> > Sent: Tuesday, June 30, 2020 10:35 AM
> > To: Xiong, Jianxin <jianxin.xiong@intel.com>
> > Cc: linux-rdma@vger.kernel.org; Doug Ledford <dledford@redhat.com>; Sumit Semwal <sumit.semwal@linaro.org>; Leon Romanovsky
> > <leon@kernel.org>; Vetter, Daniel <daniel.vetter@intel.com>; Christian Koenig <christian.koenig@amd.com>
> > Subject: Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
> > 
> > On Tue, Jun 30, 2020 at 05:21:33PM +0000, Xiong, Jianxin wrote:
> > > > > Heterogeneous Memory Management (HMM) utilizes
> > > > > mmu_interval_notifier and ZONE_DEVICE to support shared virtual
> > > > > address space and page migration between system memory and device
> > > > > memory. HMM doesn't support pinning device memory because pages
> > > > > located on device must be able to migrate to system memory when
> > > > > accessed by CPU. Peer-to-peer access is possible if the peer can
> > > > > handle page fault. For RDMA, that means the NIC must support on-demand paging.
> > > >
> > > > peer-peer access is currently not possible with hmm_range_fault().
> > >
> > > Currently hmm_range_fault() always sets the cpu access flag and device
> > > private pages are migrated to the system RAM in the fault handler.
> > > However, it's possible to have a modified code flow to keep the device
> > > private page info for use with peer to peer access.
> > 
> > Sort of, but only within the same device, RDMA or anything else generic can't reach inside a DEVICE_PRIVATE and extract anything useful.
> 
> But pfn is supposed to be all that is needed.

Needed for what? The PFN of the DEVICE_PRIVATE pages is useless for
anything.

> > Well, what do you want to happen here? The RDMA parts are
> > reasonable, but I don't want to add new functionality without a
> > purpose - the other parts need to be settled out first.
> 
> At the RDMA side, we mainly want to check if the changes are
> acceptable. For example, the part about adding 'fd' to the device
> ops and the ioctl interface. All the previous comments are very
> helpful for us to refine the patch so that we can be ready when GPU
> side support becomes available.

Well, I'm not totally happy with the way the umem and the fd is
handled so roughly and incompletely..

> > Hum. This is not actually so hard to do. The whole dma buf
> > proposal would make a lot more sense if the 'dma buf MR' had to be
> > the dynamic kind and the driver had to provide the faulting. It
> > would not be so hard to change mlx5 to be able to work like this,
> > perhaps. (the locking might be a bit tricky though)
> 
> The main issue is that not all NICs support ODP.

Sure, but there is lots of infrastructure work here to be done on dma
buf, having a correct consumer in the form of ODP might be helpful to
advance it.

Jason
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [RFC PATCH v2 0/3] RDMA: add dma-buf support
  2020-06-30 19:17         ` Jason Gunthorpe
@ 2020-06-30 20:08           ` Xiong, Jianxin
  2020-07-02 12:27             ` Jason Gunthorpe
  0 siblings, 1 reply; 28+ messages in thread
From: Xiong, Jianxin @ 2020-06-30 20:08 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Leon Romanovsky, linux-rdma, dri-devel, Christian Koenig,
	Doug Ledford, Vetter, Daniel

> -----Original Message-----
> From: Jason Gunthorpe <jgg@ziepe.ca>
> Sent: Tuesday, June 30, 2020 12:17 PM
> To: Xiong, Jianxin <jianxin.xiong@intel.com>
> Cc: linux-rdma@vger.kernel.org; Doug Ledford <dledford@redhat.com>; Sumit Semwal <sumit.semwal@linaro.org>; Leon Romanovsky
> <leon@kernel.org>; Vetter, Daniel <daniel.vetter@intel.com>; Christian Koenig <christian.koenig@amd.com>; dri-
> devel@lists.freedesktop.org
> Subject: Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
> 
> > >
> > > On Tue, Jun 30, 2020 at 05:21:33PM +0000, Xiong, Jianxin wrote:
> > > > > > Heterogeneous Memory Management (HMM) utilizes
> > > > > > mmu_interval_notifier and ZONE_DEVICE to support shared
> > > > > > virtual address space and page migration between system memory
> > > > > > and device memory. HMM doesn't support pinning device memory
> > > > > > because pages located on device must be able to migrate to
> > > > > > system memory when accessed by CPU. Peer-to-peer access is
> > > > > > possible if the peer can handle page fault. For RDMA, that means the NIC must support on-demand paging.
> > > > >
> > > > > peer-peer access is currently not possible with hmm_range_fault().
> > > >
> > > > Currently hmm_range_fault() always sets the cpu access flag and
> > > > device private pages are migrated to the system RAM in the fault handler.
> > > > However, it's possible to have a modified code flow to keep the
> > > > device private page info for use with peer to peer access.
> > >
> > > Sort of, but only within the same device, RDMA or anything else generic can't reach inside a DEVICE_PRIVATE and extract anything
> useful.
> >
> > But pfn is supposed to be all that is needed.
> 
> Needed for what? The PFN of the DEVICE_PRIVATE pages is useless for anything.

Hmm. I thought the pfn corresponds to the address in the BAR range. I could be
wrong here. 

> 
> > > Well, what do you want to happen here? The RDMA parts are
> > > reasonable, but I don't want to add new functionality without a
> > > purpose - the other parts need to be settled out first.
> >
> > At the RDMA side, we mainly want to check if the changes are
> > acceptable. For example, the part about adding 'fd' to the device ops
> > and the ioctl interface. All the previous comments are very helpful
> > for us to refine the patch so that we can be ready when GPU side
> > support becomes available.
> 
> Well, I'm not totally happy with the way the umem and the fd is handled so roughly and incompletely..

Yes, this feedback is very helpful. Will work on improving the code.

> 
> > > Hum. This is not actually so hard to do. The whole dma buf proposal
> > > would make a lot more sense if the 'dma buf MR' had to be the
> > > dynamic kind and the driver had to provide the faulting. It would
> > > not be so hard to change mlx5 to be able to work like this, perhaps.
> > > (the locking might be a bit tricky though)
> >
> > The main issue is that not all NICs support ODP.
> 
> Sure, but there is lots of infrastructure work here to be done on dma buf, having a correct consumer in the form of ODP might be helpful to
> advance it.

Good point. Thanks.

> 
> Jason
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
  2020-06-30 18:46       ` [RFC PATCH v2 0/3] RDMA: add dma-buf support Xiong, Jianxin
  2020-06-30 19:17         ` Jason Gunthorpe
@ 2020-07-01  9:03         ` Christian König
  2020-07-01 12:07           ` Daniel Vetter
  2020-07-01 12:39           ` Jason Gunthorpe
  1 sibling, 2 replies; 28+ messages in thread
From: Christian König @ 2020-07-01  9:03 UTC (permalink / raw)
  To: Xiong, Jianxin, Jason Gunthorpe
  Cc: Leon Romanovsky, linux-rdma, dri-devel, Doug Ledford, Vetter, Daniel

Am 30.06.20 um 20:46 schrieb Xiong, Jianxin:
>> -----Original Message-----
>> From: Jason Gunthorpe <jgg@ziepe.ca>
>> Sent: Tuesday, June 30, 2020 10:35 AM
>> To: Xiong, Jianxin <jianxin.xiong@intel.com>
>> Cc: linux-rdma@vger.kernel.org; Doug Ledford <dledford@redhat.com>; Sumit Semwal <sumit.semwal@linaro.org>; Leon Romanovsky
>> <leon@kernel.org>; Vetter, Daniel <daniel.vetter@intel.com>; Christian Koenig <christian.koenig@amd.com>
>> Subject: Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
>>
>> On Tue, Jun 30, 2020 at 05:21:33PM +0000, Xiong, Jianxin wrote:
>>>>> Heterogeneous Memory Management (HMM) utilizes
>>>>> mmu_interval_notifier and ZONE_DEVICE to support shared virtual
>>>>> address space and page migration between system memory and device
>>>>> memory. HMM doesn't support pinning device memory because pages
>>>>> located on device must be able to migrate to system memory when
>>>>> accessed by CPU. Peer-to-peer access is possible if the peer can
>>>>> handle page fault. For RDMA, that means the NIC must support on-demand paging.
>>>> peer-peer access is currently not possible with hmm_range_fault().
>>> Currently hmm_range_fault() always sets the cpu access flag and device
>>> private pages are migrated to the system RAM in the fault handler.
>>> However, it's possible to have a modified code flow to keep the device
>>> private page info for use with peer to peer access.
>> Sort of, but only within the same device, RDMA or anything else generic can't reach inside a DEVICE_PRIVATE and extract anything useful.
> But pfn is supposed to be all that is needed.
>
>>>> So.. this patch doesn't really do anything new? We could just make a MR against the DMA buf mmap and get to the same place?
>>> That's right, the patch alone is just half of the story. The
>>> functionality depends on availability of dma-buf exporter that can pin
>>> the device memory.
>> Well, what do you want to happen here? The RDMA parts are reasonable, but I don't want to add new functionality without a purpose - the
>> other parts need to be settled out first.
> At the RDMA side, we mainly want to check if the changes are acceptable. For example,
> the part about adding 'fd' to the device ops and the ioctl interface. All the previous
> comments are very helpful for us to refine the patch so that we can be ready when
> GPU side support becomes available.
>
>> The need for the dynamic mapping support for even the current DMA Buf hacky P2P users is really too bad. Can you get any GPU driver to
>> support non-dynamic mapping?
> We are working on direct direction.
>
>>>>> migrate to system RAM. This is due to the lack of knowledge about
>>>>> whether the importer can perform peer-to-peer access and the lack
>>>>> of resource limit control measure for GPU. For the first part, the
>>>>> latest dma-buf driver has a peer-to-peer flag for the importer,
>>>>> but the flag is currently tied to dynamic mapping support, which
>>>>> requires on-demand paging support from the NIC to work.
>>>> ODP for DMA buf?
>>> Right.
>> Hum. This is not actually so hard to do. The whole dma buf proposal would make a lot more sense if the 'dma buf MR' had to be the
>> dynamic kind and the driver had to provide the faulting. It would not be so hard to change mlx5 to be able to work like this, perhaps. (the
>> locking might be a bit tricky though)
> The main issue is that not all NICs support ODP.

You don't need on-demand paging support from the NIC for dynamic mapping 
to work.

All you need is the ability to stop wait for ongoing accesses to end and 
make sure that new ones grab a new mapping.

Apart from that this is a rather interesting work.

Regards,
Christian.

>
>>>>> There are a few possible ways to address these issues, such as
>>>>> decoupling peer-to-peer flag from dynamic mapping, allowing more
>>>>> leeway for individual drivers to make the pinning decision and
>>>>> adding GPU resource limit control via cgroup. We would like to get
>>>>> comments on this patch series with the assumption that device
>>>>> memory pinning via dma-buf is supported by some GPU drivers, and
>>>>> at the same time welcome open discussions on how to address the
>>>>> aforementioned issues as well as GPU-NIC peer-to-peer access solutions in general.
>>>> These seem like DMA buf problems, not RDMA problems, why are you
>>>> asking these questions with a RDMA patch set? The usual DMA buf people are not even Cc'd here.
>>> The intention is to have people from both RDMA and DMA buffer side to
>>> comment. Sumit Semwal is the DMA buffer maintainer according to the
>>> MAINTAINERS file. I agree more people could be invited to the discussion.
>>> Just added Christian Koenig to the cc-list.
>> Would be good to have added the drm lists too
> Thanks, cc'd dri-devel here, and will also do the same for the previous part of the thread.
>
>>> If the umem_description you mentioned is for information used to
>>> create the umem (e.g. a structure for all the parameters), then this would work better.
>> It would make some more sense, and avoid all these weird EOPNOTSUPPS.
> Good, thanks for the suggestion.
>
>> Jason

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
  2020-07-01  9:03         ` Christian König
@ 2020-07-01 12:07           ` Daniel Vetter
  2020-07-01 12:14             ` Daniel Vetter
  2020-07-01 12:39           ` Jason Gunthorpe
  1 sibling, 1 reply; 28+ messages in thread
From: Daniel Vetter @ 2020-07-01 12:07 UTC (permalink / raw)
  To: Christian König
  Cc: Leon Romanovsky, linux-rdma, dri-devel, Jason Gunthorpe,
	Doug Ledford, Vetter, Daniel, Xiong, Jianxin

Either my mailer ate half the thread or it's still stuck somewhere, so
jumping in the middle a bit.

On Wed, Jul 01, 2020 at 11:03:06AM +0200, Christian König wrote:
> Am 30.06.20 um 20:46 schrieb Xiong, Jianxin:
> > > -----Original Message-----
> > > From: Jason Gunthorpe <jgg@ziepe.ca>
> > > Sent: Tuesday, June 30, 2020 10:35 AM
> > > To: Xiong, Jianxin <jianxin.xiong@intel.com>
> > > Cc: linux-rdma@vger.kernel.org; Doug Ledford <dledford@redhat.com>; Sumit Semwal <sumit.semwal@linaro.org>; Leon Romanovsky
> > > <leon@kernel.org>; Vetter, Daniel <daniel.vetter@intel.com>; Christian Koenig <christian.koenig@amd.com>
> > > Subject: Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
> > > 
> > > On Tue, Jun 30, 2020 at 05:21:33PM +0000, Xiong, Jianxin wrote:
> > > > > > Heterogeneous Memory Management (HMM) utilizes
> > > > > > mmu_interval_notifier and ZONE_DEVICE to support shared virtual
> > > > > > address space and page migration between system memory and device
> > > > > > memory. HMM doesn't support pinning device memory because pages
> > > > > > located on device must be able to migrate to system memory when
> > > > > > accessed by CPU. Peer-to-peer access is possible if the peer can
> > > > > > handle page fault. For RDMA, that means the NIC must support on-demand paging.
> > > > > peer-peer access is currently not possible with hmm_range_fault().
> > > > Currently hmm_range_fault() always sets the cpu access flag and device
> > > > private pages are migrated to the system RAM in the fault handler.
> > > > However, it's possible to have a modified code flow to keep the device
> > > > private page info for use with peer to peer access.
> > > Sort of, but only within the same device, RDMA or anything else generic can't reach inside a DEVICE_PRIVATE and extract anything useful.
> > But pfn is supposed to be all that is needed.
> > 
> > > > > So.. this patch doesn't really do anything new? We could just make a MR against the DMA buf mmap and get to the same place?
> > > > That's right, the patch alone is just half of the story. The
> > > > functionality depends on availability of dma-buf exporter that can pin
> > > > the device memory.
> > > Well, what do you want to happen here? The RDMA parts are reasonable, but I don't want to add new functionality without a purpose - the
> > > other parts need to be settled out first.
> > At the RDMA side, we mainly want to check if the changes are acceptable. For example,
> > the part about adding 'fd' to the device ops and the ioctl interface. All the previous
> > comments are very helpful for us to refine the patch so that we can be ready when
> > GPU side support becomes available.
> > 
> > > The need for the dynamic mapping support for even the current DMA Buf hacky P2P users is really too bad. Can you get any GPU driver to
> > > support non-dynamic mapping?
> > We are working on direct direction.
> > 
> > > > > > migrate to system RAM. This is due to the lack of knowledge about
> > > > > > whether the importer can perform peer-to-peer access and the lack
> > > > > > of resource limit control measure for GPU. For the first part, the
> > > > > > latest dma-buf driver has a peer-to-peer flag for the importer,
> > > > > > but the flag is currently tied to dynamic mapping support, which
> > > > > > requires on-demand paging support from the NIC to work.
> > > > > ODP for DMA buf?
> > > > Right.
> > > Hum. This is not actually so hard to do. The whole dma buf proposal would make a lot more sense if the 'dma buf MR' had to be the
> > > dynamic kind and the driver had to provide the faulting. It would not be so hard to change mlx5 to be able to work like this, perhaps. (the
> > > locking might be a bit tricky though)
> > The main issue is that not all NICs support ODP.
> 
> You don't need on-demand paging support from the NIC for dynamic mapping to
> work.
> 
> All you need is the ability to stop wait for ongoing accesses to end and
> make sure that new ones grab a new mapping.

So having no clue about rdma myself much, this sounds rather interesting.
Sure it would result in immediately re-acquiring the pages, but that's
also really all we need to be able to move buffers around on the gpu side.
And with dma_resv_lock there's no livelock risk if the NIC immediately
starts a kthread/work_struct which reacquires all the dma-buf and
everything else it needs. Plus also with the full ww_mutex deadlock
backoff dance there's no locking issues with having to acquire an entire
pile of dma_resv_lock, that's natively supported (gpus very much need to
be able to lock arbitrary set of buffers).

And I think if that would allow us to avoid the entire "avoid random
drivers pinning dma-buf into vram" discussions, much better and quicker to
land something like that.

I guess the big question is going to be how to fit this into rdma, since
the ww_mutex deadlock backoff dance needs to be done at a fairly high
level. For gpu drivers it's always done at the top level ioctl entry
point.

> Apart from that this is a rather interesting work.
> 
> Regards,
> Christian.
> 
> > 
> > > > > > There are a few possible ways to address these issues, such as
> > > > > > decoupling peer-to-peer flag from dynamic mapping, allowing more
> > > > > > leeway for individual drivers to make the pinning decision and
> > > > > > adding GPU resource limit control via cgroup. We would like to get
> > > > > > comments on this patch series with the assumption that device
> > > > > > memory pinning via dma-buf is supported by some GPU drivers, and
> > > > > > at the same time welcome open discussions on how to address the
> > > > > > aforementioned issues as well as GPU-NIC peer-to-peer access solutions in general.
> > > > > These seem like DMA buf problems, not RDMA problems, why are you
> > > > > asking these questions with a RDMA patch set? The usual DMA buf people are not even Cc'd here.
> > > > The intention is to have people from both RDMA and DMA buffer side to
> > > > comment. Sumit Semwal is the DMA buffer maintainer according to the
> > > > MAINTAINERS file. I agree more people could be invited to the discussion.
> > > > Just added Christian Koenig to the cc-list.


MAINTAINERS also says to cc and entire pile of mailing lists, where the
usual suspects (including Christian and me) hang around. Is that the
reason I got only like half the thread here?

For next time around, really include everyone relevant here please.
-Daniel

> > > Would be good to have added the drm lists too
> > Thanks, cc'd dri-devel here, and will also do the same for the previous part of the thread.
> > 
> > > > If the umem_description you mentioned is for information used to
> > > > create the umem (e.g. a structure for all the parameters), then this would work better.
> > > It would make some more sense, and avoid all these weird EOPNOTSUPPS.
> > Good, thanks for the suggestion.
> > 
> > > Jason
> 
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
  2020-07-01 12:07           ` Daniel Vetter
@ 2020-07-01 12:14             ` Daniel Vetter
  0 siblings, 0 replies; 28+ messages in thread
From: Daniel Vetter @ 2020-07-01 12:14 UTC (permalink / raw)
  To: Christian König
  Cc: Leon Romanovsky, linux-rdma, dri-devel, Jason Gunthorpe,
	Doug Ledford, Vetter, Daniel, Xiong, Jianxin

On Wed, Jul 01, 2020 at 02:07:44PM +0200, Daniel Vetter wrote:
> Either my mailer ate half the thread or it's still stuck somewhere, so
> jumping in the middle a bit.
> 
> On Wed, Jul 01, 2020 at 11:03:06AM +0200, Christian König wrote:
> > Am 30.06.20 um 20:46 schrieb Xiong, Jianxin:
> > > > -----Original Message-----
> > > > From: Jason Gunthorpe <jgg@ziepe.ca>
> > > > Sent: Tuesday, June 30, 2020 10:35 AM
> > > > To: Xiong, Jianxin <jianxin.xiong@intel.com>
> > > > Cc: linux-rdma@vger.kernel.org; Doug Ledford <dledford@redhat.com>; Sumit Semwal <sumit.semwal@linaro.org>; Leon Romanovsky
> > > > <leon@kernel.org>; Vetter, Daniel <daniel.vetter@intel.com>; Christian Koenig <christian.koenig@amd.com>
> > > > Subject: Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
> > > > 
> > > > On Tue, Jun 30, 2020 at 05:21:33PM +0000, Xiong, Jianxin wrote:
> > > > > > > Heterogeneous Memory Management (HMM) utilizes
> > > > > > > mmu_interval_notifier and ZONE_DEVICE to support shared virtual
> > > > > > > address space and page migration between system memory and device
> > > > > > > memory. HMM doesn't support pinning device memory because pages
> > > > > > > located on device must be able to migrate to system memory when
> > > > > > > accessed by CPU. Peer-to-peer access is possible if the peer can
> > > > > > > handle page fault. For RDMA, that means the NIC must support on-demand paging.
> > > > > > peer-peer access is currently not possible with hmm_range_fault().
> > > > > Currently hmm_range_fault() always sets the cpu access flag and device
> > > > > private pages are migrated to the system RAM in the fault handler.
> > > > > However, it's possible to have a modified code flow to keep the device
> > > > > private page info for use with peer to peer access.
> > > > Sort of, but only within the same device, RDMA or anything else generic can't reach inside a DEVICE_PRIVATE and extract anything useful.
> > > But pfn is supposed to be all that is needed.
> > > 
> > > > > > So.. this patch doesn't really do anything new? We could just make a MR against the DMA buf mmap and get to the same place?
> > > > > That's right, the patch alone is just half of the story. The
> > > > > functionality depends on availability of dma-buf exporter that can pin
> > > > > the device memory.
> > > > Well, what do you want to happen here? The RDMA parts are reasonable, but I don't want to add new functionality without a purpose - the
> > > > other parts need to be settled out first.
> > > At the RDMA side, we mainly want to check if the changes are acceptable. For example,
> > > the part about adding 'fd' to the device ops and the ioctl interface. All the previous
> > > comments are very helpful for us to refine the patch so that we can be ready when
> > > GPU side support becomes available.
> > > 
> > > > The need for the dynamic mapping support for even the current DMA Buf hacky P2P users is really too bad. Can you get any GPU driver to
> > > > support non-dynamic mapping?
> > > We are working on direct direction.
> > > 
> > > > > > > migrate to system RAM. This is due to the lack of knowledge about
> > > > > > > whether the importer can perform peer-to-peer access and the lack
> > > > > > > of resource limit control measure for GPU. For the first part, the
> > > > > > > latest dma-buf driver has a peer-to-peer flag for the importer,
> > > > > > > but the flag is currently tied to dynamic mapping support, which
> > > > > > > requires on-demand paging support from the NIC to work.
> > > > > > ODP for DMA buf?
> > > > > Right.
> > > > Hum. This is not actually so hard to do. The whole dma buf proposal would make a lot more sense if the 'dma buf MR' had to be the
> > > > dynamic kind and the driver had to provide the faulting. It would not be so hard to change mlx5 to be able to work like this, perhaps. (the
> > > > locking might be a bit tricky though)
> > > The main issue is that not all NICs support ODP.
> > 
> > You don't need on-demand paging support from the NIC for dynamic mapping to
> > work.
> > 
> > All you need is the ability to stop wait for ongoing accesses to end and
> > make sure that new ones grab a new mapping.
> 
> So having no clue about rdma myself much, this sounds rather interesting.
> Sure it would result in immediately re-acquiring the pages, but that's
> also really all we need to be able to move buffers around on the gpu side.
> And with dma_resv_lock there's no livelock risk if the NIC immediately
> starts a kthread/work_struct which reacquires all the dma-buf and
> everything else it needs. Plus also with the full ww_mutex deadlock
> backoff dance there's no locking issues with having to acquire an entire
> pile of dma_resv_lock, that's natively supported (gpus very much need to
> be able to lock arbitrary set of buffers).
> 
> And I think if that would allow us to avoid the entire "avoid random
> drivers pinning dma-buf into vram" discussions, much better and quicker to
> land something like that.
> 
> I guess the big question is going to be how to fit this into rdma, since
> the ww_mutex deadlock backoff dance needs to be done at a fairly high
> level. For gpu drivers it's always done at the top level ioctl entry
> point.

Also, just to alleviate fears: I think all that dynamic dma-buf stuff for
rdma should be doable this way _without_ having to interact with
dma_fence. Avoiding that I think is the biggest request Jason has in this
area :-)

Furthermore, it is officially ok to allocate memory while holding a
dma_resv_lock. What is not ok (and might cause issues if you somehow mix
up things in strange ways) is taking a userspace fault, because gpu
drivers must be able to take the dma_resv_lock in their fault handlers.
That might pose a problem.

Also, all these rules are now enforced by lockdep, might_fault() and
similar checks.
-Daniel

> 
> > Apart from that this is a rather interesting work.
> > 
> > Regards,
> > Christian.
> > 
> > > 
> > > > > > > There are a few possible ways to address these issues, such as
> > > > > > > decoupling peer-to-peer flag from dynamic mapping, allowing more
> > > > > > > leeway for individual drivers to make the pinning decision and
> > > > > > > adding GPU resource limit control via cgroup. We would like to get
> > > > > > > comments on this patch series with the assumption that device
> > > > > > > memory pinning via dma-buf is supported by some GPU drivers, and
> > > > > > > at the same time welcome open discussions on how to address the
> > > > > > > aforementioned issues as well as GPU-NIC peer-to-peer access solutions in general.
> > > > > > These seem like DMA buf problems, not RDMA problems, why are you
> > > > > > asking these questions with a RDMA patch set? The usual DMA buf people are not even Cc'd here.
> > > > > The intention is to have people from both RDMA and DMA buffer side to
> > > > > comment. Sumit Semwal is the DMA buffer maintainer according to the
> > > > > MAINTAINERS file. I agree more people could be invited to the discussion.
> > > > > Just added Christian Koenig to the cc-list.
> 
> 
> MAINTAINERS also says to cc and entire pile of mailing lists, where the
> usual suspects (including Christian and me) hang around. Is that the
> reason I got only like half the thread here?
> 
> For next time around, really include everyone relevant here please.
> -Daniel
> 
> > > > Would be good to have added the drm lists too
> > > Thanks, cc'd dri-devel here, and will also do the same for the previous part of the thread.
> > > 
> > > > > If the umem_description you mentioned is for information used to
> > > > > create the umem (e.g. a structure for all the parameters), then this would work better.
> > > > It would make some more sense, and avoid all these weird EOPNOTSUPPS.
> > > Good, thanks for the suggestion.
> > > 
> > > > Jason
> > 
> > _______________________________________________
> > dri-devel mailing list
> > dri-devel@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/dri-devel
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
  2020-07-01  9:03         ` Christian König
  2020-07-01 12:07           ` Daniel Vetter
@ 2020-07-01 12:39           ` Jason Gunthorpe
  2020-07-01 12:55             ` Christian König
  1 sibling, 1 reply; 28+ messages in thread
From: Jason Gunthorpe @ 2020-07-01 12:39 UTC (permalink / raw)
  To: Christian König
  Cc: Leon Romanovsky, linux-rdma, dri-devel, Doug Ledford, Vetter,
	Daniel, Xiong, Jianxin

On Wed, Jul 01, 2020 at 11:03:06AM +0200, Christian König wrote:
> Am 30.06.20 um 20:46 schrieb Xiong, Jianxin:
> > > From: Jason Gunthorpe <jgg@ziepe.ca>
> > > Sent: Tuesday, June 30, 2020 10:35 AM
> > > To: Xiong, Jianxin <jianxin.xiong@intel.com>
> > > Cc: linux-rdma@vger.kernel.org; Doug Ledford <dledford@redhat.com>; Sumit Semwal <sumit.semwal@linaro.org>; Leon Romanovsky
> > > <leon@kernel.org>; Vetter, Daniel <daniel.vetter@intel.com>; Christian Koenig <christian.koenig@amd.com>
> > > Subject: Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
> > > 
> > > On Tue, Jun 30, 2020 at 05:21:33PM +0000, Xiong, Jianxin wrote:
> > > > > > Heterogeneous Memory Management (HMM) utilizes
> > > > > > mmu_interval_notifier and ZONE_DEVICE to support shared virtual
> > > > > > address space and page migration between system memory and device
> > > > > > memory. HMM doesn't support pinning device memory because pages
> > > > > > located on device must be able to migrate to system memory when
> > > > > > accessed by CPU. Peer-to-peer access is possible if the peer can
> > > > > > handle page fault. For RDMA, that means the NIC must support on-demand paging.
> > > > > peer-peer access is currently not possible with hmm_range_fault().
> > > > Currently hmm_range_fault() always sets the cpu access flag and device
> > > > private pages are migrated to the system RAM in the fault handler.
> > > > However, it's possible to have a modified code flow to keep the device
> > > > private page info for use with peer to peer access.
> > > Sort of, but only within the same device, RDMA or anything else generic can't reach inside a DEVICE_PRIVATE and extract anything useful.
> > But pfn is supposed to be all that is needed.
> > 
> > > > > So.. this patch doesn't really do anything new? We could just make a MR against the DMA buf mmap and get to the same place?
> > > > That's right, the patch alone is just half of the story. The
> > > > functionality depends on availability of dma-buf exporter that can pin
> > > > the device memory.
> > > Well, what do you want to happen here? The RDMA parts are reasonable, but I don't want to add new functionality without a purpose - the
> > > other parts need to be settled out first.
> > At the RDMA side, we mainly want to check if the changes are acceptable. For example,
> > the part about adding 'fd' to the device ops and the ioctl interface. All the previous
> > comments are very helpful for us to refine the patch so that we can be ready when
> > GPU side support becomes available.
> > 
> > > The need for the dynamic mapping support for even the current DMA Buf hacky P2P users is really too bad. Can you get any GPU driver to
> > > support non-dynamic mapping?
> > We are working on direct direction.
> > 
> > > > > > migrate to system RAM. This is due to the lack of knowledge about
> > > > > > whether the importer can perform peer-to-peer access and the lack
> > > > > > of resource limit control measure for GPU. For the first part, the
> > > > > > latest dma-buf driver has a peer-to-peer flag for the importer,
> > > > > > but the flag is currently tied to dynamic mapping support, which
> > > > > > requires on-demand paging support from the NIC to work.
> > > > > ODP for DMA buf?
> > > > Right.
> > > Hum. This is not actually so hard to do. The whole dma buf proposal would make a lot more sense if the 'dma buf MR' had to be the
> > > dynamic kind and the driver had to provide the faulting. It would not be so hard to change mlx5 to be able to work like this, perhaps. (the
> > > locking might be a bit tricky though)
> > The main issue is that not all NICs support ODP.
> 
> You don't need on-demand paging support from the NIC for dynamic mapping to
> work.
> 
> All you need is the ability to stop wait for ongoing accesses to end and
> make sure that new ones grab a new mapping.

Swap and flush isn't a general HW ability either..

I'm unclear how this could be useful, it is guarenteed to corrupt
in-progress writes?

Did you mean pause, swap and resume? That's ODP.

Jason
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
  2020-07-01 12:39           ` Jason Gunthorpe
@ 2020-07-01 12:55             ` Christian König
  2020-07-01 15:42               ` Daniel Vetter
  0 siblings, 1 reply; 28+ messages in thread
From: Christian König @ 2020-07-01 12:55 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Leon Romanovsky, linux-rdma, dri-devel, Doug Ledford, Vetter,
	Daniel, Xiong, Jianxin

Am 01.07.20 um 14:39 schrieb Jason Gunthorpe:
> On Wed, Jul 01, 2020 at 11:03:06AM +0200, Christian König wrote:
>> Am 30.06.20 um 20:46 schrieb Xiong, Jianxin:
>>>> From: Jason Gunthorpe <jgg@ziepe.ca>
>>>> Sent: Tuesday, June 30, 2020 10:35 AM
>>>> To: Xiong, Jianxin <jianxin.xiong@intel.com>
>>>> Cc: linux-rdma@vger.kernel.org; Doug Ledford <dledford@redhat.com>; Sumit Semwal <sumit.semwal@linaro.org>; Leon Romanovsky
>>>> <leon@kernel.org>; Vetter, Daniel <daniel.vetter@intel.com>; Christian Koenig <christian.koenig@amd.com>
>>>> Subject: Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
>>>>
>>>> On Tue, Jun 30, 2020 at 05:21:33PM +0000, Xiong, Jianxin wrote:
>>>>>>> Heterogeneous Memory Management (HMM) utilizes
>>>>>>> mmu_interval_notifier and ZONE_DEVICE to support shared virtual
>>>>>>> address space and page migration between system memory and device
>>>>>>> memory. HMM doesn't support pinning device memory because pages
>>>>>>> located on device must be able to migrate to system memory when
>>>>>>> accessed by CPU. Peer-to-peer access is possible if the peer can
>>>>>>> handle page fault. For RDMA, that means the NIC must support on-demand paging.
>>>>>> peer-peer access is currently not possible with hmm_range_fault().
>>>>> Currently hmm_range_fault() always sets the cpu access flag and device
>>>>> private pages are migrated to the system RAM in the fault handler.
>>>>> However, it's possible to have a modified code flow to keep the device
>>>>> private page info for use with peer to peer access.
>>>> Sort of, but only within the same device, RDMA or anything else generic can't reach inside a DEVICE_PRIVATE and extract anything useful.
>>> But pfn is supposed to be all that is needed.
>>>
>>>>>> So.. this patch doesn't really do anything new? We could just make a MR against the DMA buf mmap and get to the same place?
>>>>> That's right, the patch alone is just half of the story. The
>>>>> functionality depends on availability of dma-buf exporter that can pin
>>>>> the device memory.
>>>> Well, what do you want to happen here? The RDMA parts are reasonable, but I don't want to add new functionality without a purpose - the
>>>> other parts need to be settled out first.
>>> At the RDMA side, we mainly want to check if the changes are acceptable. For example,
>>> the part about adding 'fd' to the device ops and the ioctl interface. All the previous
>>> comments are very helpful for us to refine the patch so that we can be ready when
>>> GPU side support becomes available.
>>>
>>>> The need for the dynamic mapping support for even the current DMA Buf hacky P2P users is really too bad. Can you get any GPU driver to
>>>> support non-dynamic mapping?
>>> We are working on direct direction.
>>>
>>>>>>> migrate to system RAM. This is due to the lack of knowledge about
>>>>>>> whether the importer can perform peer-to-peer access and the lack
>>>>>>> of resource limit control measure for GPU. For the first part, the
>>>>>>> latest dma-buf driver has a peer-to-peer flag for the importer,
>>>>>>> but the flag is currently tied to dynamic mapping support, which
>>>>>>> requires on-demand paging support from the NIC to work.
>>>>>> ODP for DMA buf?
>>>>> Right.
>>>> Hum. This is not actually so hard to do. The whole dma buf proposal would make a lot more sense if the 'dma buf MR' had to be the
>>>> dynamic kind and the driver had to provide the faulting. It would not be so hard to change mlx5 to be able to work like this, perhaps. (the
>>>> locking might be a bit tricky though)
>>> The main issue is that not all NICs support ODP.
>> You don't need on-demand paging support from the NIC for dynamic mapping to
>> work.
>>
>> All you need is the ability to stop wait for ongoing accesses to end and
>> make sure that new ones grab a new mapping.
> Swap and flush isn't a general HW ability either..
>
> I'm unclear how this could be useful, it is guarenteed to corrupt
> in-progress writes?
>
> Did you mean pause, swap and resume? That's ODP.

Yes, something like this. And good to know, never heard of ODP.


On the GPU side we can pipeline things, e.g. you can program the 
hardware that page tables are changed at a certain point in time.

So what we do is when we get a notification that a buffer will move 
around is to mark this buffer in our structures as invalid and return a 
fence to so that the exporter is able to wait for ongoing stuff to finish.

The actual move then happens only after the ongoing operations on the 
GPU are finished and on the next operation we grab the new location of 
the buffer and re-program the page tables to it.

This way all the CPU does is really just planning asynchronous page 
table changes which are executed on the GPU later on.

You can of course do it synchronized as well, but this would hurt our 
performance pretty badly.

Regards,
Christian.

>
> Jason

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
  2020-07-01 12:55             ` Christian König
@ 2020-07-01 15:42               ` Daniel Vetter
  2020-07-01 17:15                 ` Jason Gunthorpe
  0 siblings, 1 reply; 28+ messages in thread
From: Daniel Vetter @ 2020-07-01 15:42 UTC (permalink / raw)
  To: Christian König
  Cc: Leon Romanovsky, linux-rdma, dri-devel, Jason Gunthorpe,
	Doug Ledford, Vetter, Daniel, Xiong, Jianxin

On Wed, Jul 1, 2020 at 2:56 PM Christian König <christian.koenig@amd.com> wrote:
>
> Am 01.07.20 um 14:39 schrieb Jason Gunthorpe:
> > On Wed, Jul 01, 2020 at 11:03:06AM +0200, Christian König wrote:
> >> Am 30.06.20 um 20:46 schrieb Xiong, Jianxin:
> >>>> From: Jason Gunthorpe <jgg@ziepe.ca>
> >>>> Sent: Tuesday, June 30, 2020 10:35 AM
> >>>> To: Xiong, Jianxin <jianxin.xiong@intel.com>
> >>>> Cc: linux-rdma@vger.kernel.org; Doug Ledford <dledford@redhat.com>; Sumit Semwal <sumit.semwal@linaro.org>; Leon Romanovsky
> >>>> <leon@kernel.org>; Vetter, Daniel <daniel.vetter@intel.com>; Christian Koenig <christian.koenig@amd.com>
> >>>> Subject: Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
> >>>>
> >>>> On Tue, Jun 30, 2020 at 05:21:33PM +0000, Xiong, Jianxin wrote:
> >>>>>>> Heterogeneous Memory Management (HMM) utilizes
> >>>>>>> mmu_interval_notifier and ZONE_DEVICE to support shared virtual
> >>>>>>> address space and page migration between system memory and device
> >>>>>>> memory. HMM doesn't support pinning device memory because pages
> >>>>>>> located on device must be able to migrate to system memory when
> >>>>>>> accessed by CPU. Peer-to-peer access is possible if the peer can
> >>>>>>> handle page fault. For RDMA, that means the NIC must support on-demand paging.
> >>>>>> peer-peer access is currently not possible with hmm_range_fault().
> >>>>> Currently hmm_range_fault() always sets the cpu access flag and device
> >>>>> private pages are migrated to the system RAM in the fault handler.
> >>>>> However, it's possible to have a modified code flow to keep the device
> >>>>> private page info for use with peer to peer access.
> >>>> Sort of, but only within the same device, RDMA or anything else generic can't reach inside a DEVICE_PRIVATE and extract anything useful.
> >>> But pfn is supposed to be all that is needed.
> >>>
> >>>>>> So.. this patch doesn't really do anything new? We could just make a MR against the DMA buf mmap and get to the same place?
> >>>>> That's right, the patch alone is just half of the story. The
> >>>>> functionality depends on availability of dma-buf exporter that can pin
> >>>>> the device memory.
> >>>> Well, what do you want to happen here? The RDMA parts are reasonable, but I don't want to add new functionality without a purpose - the
> >>>> other parts need to be settled out first.
> >>> At the RDMA side, we mainly want to check if the changes are acceptable. For example,
> >>> the part about adding 'fd' to the device ops and the ioctl interface. All the previous
> >>> comments are very helpful for us to refine the patch so that we can be ready when
> >>> GPU side support becomes available.
> >>>
> >>>> The need for the dynamic mapping support for even the current DMA Buf hacky P2P users is really too bad. Can you get any GPU driver to
> >>>> support non-dynamic mapping?
> >>> We are working on direct direction.
> >>>
> >>>>>>> migrate to system RAM. This is due to the lack of knowledge about
> >>>>>>> whether the importer can perform peer-to-peer access and the lack
> >>>>>>> of resource limit control measure for GPU. For the first part, the
> >>>>>>> latest dma-buf driver has a peer-to-peer flag for the importer,
> >>>>>>> but the flag is currently tied to dynamic mapping support, which
> >>>>>>> requires on-demand paging support from the NIC to work.
> >>>>>> ODP for DMA buf?
> >>>>> Right.
> >>>> Hum. This is not actually so hard to do. The whole dma buf proposal would make a lot more sense if the 'dma buf MR' had to be the
> >>>> dynamic kind and the driver had to provide the faulting. It would not be so hard to change mlx5 to be able to work like this, perhaps. (the
> >>>> locking might be a bit tricky though)
> >>> The main issue is that not all NICs support ODP.
> >> You don't need on-demand paging support from the NIC for dynamic mapping to
> >> work.
> >>
> >> All you need is the ability to stop wait for ongoing accesses to end and
> >> make sure that new ones grab a new mapping.
> > Swap and flush isn't a general HW ability either..
> >
> > I'm unclear how this could be useful, it is guarenteed to corrupt
> > in-progress writes?
> >
> > Did you mean pause, swap and resume? That's ODP.
>
> Yes, something like this. And good to know, never heard of ODP.

Hm I thought ODP was full hw page faults at an individual page level,
and this stop&resume is for the entire nic. Under the hood both apply
back-pressure on the network if a transmission can't be received, but
I thought the ODP one is a lore more fine-grained. For this dma_buf
stop-the-world approach we'd also need to guarantee _all_ buffers are
present again (without hw page faults on the nic side of things),
which has some additional locking implications.

> On the GPU side we can pipeline things, e.g. you can program the
> hardware that page tables are changed at a certain point in time.
>
> So what we do is when we get a notification that a buffer will move
> around is to mark this buffer in our structures as invalid and return a
> fence to so that the exporter is able to wait for ongoing stuff to finish.
>
> The actual move then happens only after the ongoing operations on the
> GPU are finished and on the next operation we grab the new location of
> the buffer and re-program the page tables to it.
>
> This way all the CPU does is really just planning asynchronous page
> table changes which are executed on the GPU later on.
>
> You can of course do it synchronized as well, but this would hurt our
> performance pretty badly.

So since Jason really doesn't like dma_fence much I think for rdma
synchronous it is. And it shouldn't really matter, since waiting for a
small transaction to complete at rdma wire speed isn't really that
long an operation. Should be much faster than waiting for the gpu to
complete quite sizeable amounts of rendering first. Plus with real hw
page faults it's really just pte write plus tlb flush, that should
definitely be possible from the ->move_notify hook.
But even the global stop-the-world approach I expect is going to be
much faster than a preempt request on a gpu.
-Daniel

>
> Regards,
> Christian.
>
> >
> > Jason
>
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel



-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
  2020-07-01 15:42               ` Daniel Vetter
@ 2020-07-01 17:15                 ` Jason Gunthorpe
  2020-07-02 13:10                   ` Daniel Vetter
  0 siblings, 1 reply; 28+ messages in thread
From: Jason Gunthorpe @ 2020-07-01 17:15 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Leon Romanovsky, linux-rdma, dri-devel, Doug Ledford, Vetter,
	Daniel, Christian König, Xiong, Jianxin

On Wed, Jul 01, 2020 at 05:42:21PM +0200, Daniel Vetter wrote:
> > >> All you need is the ability to stop wait for ongoing accesses to end and
> > >> make sure that new ones grab a new mapping.
> > > Swap and flush isn't a general HW ability either..
> > >
> > > I'm unclear how this could be useful, it is guarenteed to corrupt
> > > in-progress writes?
> > >
> > > Did you mean pause, swap and resume? That's ODP.
> >
> > Yes, something like this. And good to know, never heard of ODP.
> 
> Hm I thought ODP was full hw page faults at an individual page
> level,

Yes

> and this stop&resume is for the entire nic. Under the hood both apply
> back-pressure on the network if a transmission can't be received,
> but

NIC's don't do stop and resume, blocking the Rx pipe is very
problematic and performance destroying.

The strategy for something like ODP is more complex, and so far no NIC
has deployed it at any granularity larger than per-page.

> So since Jason really doesn't like dma_fence much I think for rdma
> synchronous it is. And it shouldn't really matter, since waiting for a
> small transaction to complete at rdma wire speed isn't really that
> long an operation. 

Even if DMA fence were to somehow be involved, how would it look?

Jason
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
  2020-06-30 20:08           ` Xiong, Jianxin
@ 2020-07-02 12:27             ` Jason Gunthorpe
  0 siblings, 0 replies; 28+ messages in thread
From: Jason Gunthorpe @ 2020-07-02 12:27 UTC (permalink / raw)
  To: Xiong, Jianxin
  Cc: Leon Romanovsky, linux-rdma, dri-devel, Christian Koenig,
	Doug Ledford, Vetter, Daniel

On Tue, Jun 30, 2020 at 08:08:46PM +0000, Xiong, Jianxin wrote:
> > From: Jason Gunthorpe <jgg@ziepe.ca>
> > Sent: Tuesday, June 30, 2020 12:17 PM
> > To: Xiong, Jianxin <jianxin.xiong@intel.com>
> > Cc: linux-rdma@vger.kernel.org; Doug Ledford <dledford@redhat.com>; Sumit Semwal <sumit.semwal@linaro.org>; Leon Romanovsky
> > <leon@kernel.org>; Vetter, Daniel <daniel.vetter@intel.com>; Christian Koenig <christian.koenig@amd.com>; dri-
> > devel@lists.freedesktop.org
> > Subject: Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
> > 
> > > >
> > > > On Tue, Jun 30, 2020 at 05:21:33PM +0000, Xiong, Jianxin wrote:
> > > > > > > Heterogeneous Memory Management (HMM) utilizes
> > > > > > > mmu_interval_notifier and ZONE_DEVICE to support shared
> > > > > > > virtual address space and page migration between system memory
> > > > > > > and device memory. HMM doesn't support pinning device memory
> > > > > > > because pages located on device must be able to migrate to
> > > > > > > system memory when accessed by CPU. Peer-to-peer access is
> > > > > > > possible if the peer can handle page fault. For RDMA, that means the NIC must support on-demand paging.
> > > > > >
> > > > > > peer-peer access is currently not possible with hmm_range_fault().
> > > > >
> > > > > Currently hmm_range_fault() always sets the cpu access flag and
> > > > > device private pages are migrated to the system RAM in the fault handler.
> > > > > However, it's possible to have a modified code flow to keep the
> > > > > device private page info for use with peer to peer access.
> > > >
> > > > Sort of, but only within the same device, RDMA or anything else generic can't reach inside a DEVICE_PRIVATE and extract anything
> > useful.
> > >
> > > But pfn is supposed to be all that is needed.
> > 
> > Needed for what? The PFN of the DEVICE_PRIVATE pages is useless for anything.
> 
> Hmm. I thought the pfn corresponds to the address in the BAR range. I could be
> wrong here. 

No, DEVICE_PRIVATE is a dummy pfn to empty address space.

Jason
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
  2020-07-01 17:15                 ` Jason Gunthorpe
@ 2020-07-02 13:10                   ` Daniel Vetter
  2020-07-02 13:29                     ` Jason Gunthorpe
  0 siblings, 1 reply; 28+ messages in thread
From: Daniel Vetter @ 2020-07-02 13:10 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Leon Romanovsky, linux-rdma, dri-devel, Doug Ledford, Vetter,
	Daniel, Christian König, Xiong, Jianxin

On Wed, Jul 01, 2020 at 02:15:24PM -0300, Jason Gunthorpe wrote:
> On Wed, Jul 01, 2020 at 05:42:21PM +0200, Daniel Vetter wrote:
> > > >> All you need is the ability to stop wait for ongoing accesses to end and
> > > >> make sure that new ones grab a new mapping.
> > > > Swap and flush isn't a general HW ability either..
> > > >
> > > > I'm unclear how this could be useful, it is guarenteed to corrupt
> > > > in-progress writes?
> > > >
> > > > Did you mean pause, swap and resume? That's ODP.
> > >
> > > Yes, something like this. And good to know, never heard of ODP.
> > 
> > Hm I thought ODP was full hw page faults at an individual page
> > level,
> 
> Yes
> 
> > and this stop&resume is for the entire nic. Under the hood both apply
> > back-pressure on the network if a transmission can't be received,
> > but
> 
> NIC's don't do stop and resume, blocking the Rx pipe is very
> problematic and performance destroying.
> 
> The strategy for something like ODP is more complex, and so far no NIC
> has deployed it at any granularity larger than per-page.
> 
> > So since Jason really doesn't like dma_fence much I think for rdma
> > synchronous it is. And it shouldn't really matter, since waiting for a
> > small transaction to complete at rdma wire speed isn't really that
> > long an operation. 
> 
> Even if DMA fence were to somehow be involved, how would it look?

Well above you're saying it would be performance destroying, but let's
pretend that's not a problem :-) Also, I have no clue about rdma, so this
is really just the flow we have on the gpu side.

0. rdma driver maintains a list of all dma-buf that it has mapped
somewhere and is currently using for transactions

1. rdma driver gets a dma_buf->notify_move callback on one of these
buffers. To handle that it:
	1. stops hw access somehow at the rx
	2. flushes caches and whatever else is needed
	3. moves the unmapped buffer on a special list or marks it in some
	different way as unavailable
	4. launch the kthread/work_struct to fix everything back up

2. dma-buf export (gpu driver) can now issue the commands to move the
buffer around

3. rdma driver worker gets busy to restart rx:
	1. lock all dma-buf that are currently in use (dma_resv_lock).
	thanks to ww_mutex deadlock avoidance this is possible
	2. for any buffers which have been marked as unavailable in 1.3
	grab a new mapping (which might now be in system memory, or again
	peer2peer but different address)
	3. restart hw and rx
	4. unlock all dma-buf locks (dma_resv_unlock)

There is a minor problem because step 2 only queues up the entire buffer
moves behind a pile of dma_fence, and atm we haven't made it absolutely
clear who's responsible for waiting for those to complete. For gpu drivers
it's the importer since gpu drivers don't have big qualms about
dma_fences, so 3.2. would perhaps also include a dma_fence_wait to make
sure the buffer move has actually completed.

Above flow is more or less exactly what happens for gpu workloads where we
can preempt running computations. Instead of stopping rx we preempt the
compute job and remove it from the scheduler queues, and instead of
restarting rx we just put the compute job back onto the scheduler queue as
eligible for gpu time. Otherwise it's exactly the same stuff.

Of course if you only have a single compute job and too many such
interruptions, then performance is going to tank. Don't do that, instead
make sure you have enough vram or system memory or whatever :-)

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
  2020-07-02 13:10                   ` Daniel Vetter
@ 2020-07-02 13:29                     ` Jason Gunthorpe
  2020-07-02 14:50                       ` Christian König
  0 siblings, 1 reply; 28+ messages in thread
From: Jason Gunthorpe @ 2020-07-02 13:29 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Leon Romanovsky, linux-rdma, dri-devel, Doug Ledford, Vetter,
	Daniel, Christian König, Xiong, Jianxin

On Thu, Jul 02, 2020 at 03:10:00PM +0200, Daniel Vetter wrote:
> On Wed, Jul 01, 2020 at 02:15:24PM -0300, Jason Gunthorpe wrote:
> > On Wed, Jul 01, 2020 at 05:42:21PM +0200, Daniel Vetter wrote:
> > > > >> All you need is the ability to stop wait for ongoing accesses to end and
> > > > >> make sure that new ones grab a new mapping.
> > > > > Swap and flush isn't a general HW ability either..
> > > > >
> > > > > I'm unclear how this could be useful, it is guarenteed to corrupt
> > > > > in-progress writes?
> > > > >
> > > > > Did you mean pause, swap and resume? That's ODP.
> > > >
> > > > Yes, something like this. And good to know, never heard of ODP.
> > > 
> > > Hm I thought ODP was full hw page faults at an individual page
> > > level,
> > 
> > Yes
> > 
> > > and this stop&resume is for the entire nic. Under the hood both apply
> > > back-pressure on the network if a transmission can't be received,
> > > but
> > 
> > NIC's don't do stop and resume, blocking the Rx pipe is very
> > problematic and performance destroying.
> > 
> > The strategy for something like ODP is more complex, and so far no NIC
> > has deployed it at any granularity larger than per-page.
> > 
> > > So since Jason really doesn't like dma_fence much I think for rdma
> > > synchronous it is. And it shouldn't really matter, since waiting for a
> > > small transaction to complete at rdma wire speed isn't really that
> > > long an operation. 
> > 
> > Even if DMA fence were to somehow be involved, how would it look?
> 
> Well above you're saying it would be performance destroying, but let's
> pretend that's not a problem :-) Also, I have no clue about rdma, so this
> is really just the flow we have on the gpu side.

I see, no, this is not workable, the command flow in RDMA is not at
all like GPU - what you are a proposing is a global 'stop the whole
chip' Tx and Rx flows for an undetermined time. Not feasible

What we can do is use ODP techniques and pause only the MR attached to
the DMA buf with the process you outline below. This is not so hard to
implement.

> 3. rdma driver worker gets busy to restart rx:
> 	1. lock all dma-buf that are currently in use (dma_resv_lock).
> 	thanks to ww_mutex deadlock avoidance this is possible

Why all? Why not just lock the one that was invalidated to restore the
mappings? That is some artifact of the GPU approach?

And why is this done with work queues and locking instead of a
callback saying the buffer is valid again?

Jason
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
  2020-07-02 13:29                     ` Jason Gunthorpe
@ 2020-07-02 14:50                       ` Christian König
  2020-07-02 18:15                         ` Daniel Vetter
  0 siblings, 1 reply; 28+ messages in thread
From: Christian König @ 2020-07-02 14:50 UTC (permalink / raw)
  To: Jason Gunthorpe, Daniel Vetter
  Cc: Leon Romanovsky, linux-rdma, dri-devel, Doug Ledford, Vetter,
	Daniel, Xiong, Jianxin

Am 02.07.20 um 15:29 schrieb Jason Gunthorpe:
> On Thu, Jul 02, 2020 at 03:10:00PM +0200, Daniel Vetter wrote:
>> On Wed, Jul 01, 2020 at 02:15:24PM -0300, Jason Gunthorpe wrote:
>>> On Wed, Jul 01, 2020 at 05:42:21PM +0200, Daniel Vetter wrote:
>>>>>>> All you need is the ability to stop wait for ongoing accesses to end and
>>>>>>> make sure that new ones grab a new mapping.
>>>>>> Swap and flush isn't a general HW ability either..
>>>>>>
>>>>>> I'm unclear how this could be useful, it is guarenteed to corrupt
>>>>>> in-progress writes?
>>>>>>
>>>>>> Did you mean pause, swap and resume? That's ODP.
>>>>> Yes, something like this. And good to know, never heard of ODP.
>>>> Hm I thought ODP was full hw page faults at an individual page
>>>> level,
>>> Yes
>>>
>>>> and this stop&resume is for the entire nic. Under the hood both apply
>>>> back-pressure on the network if a transmission can't be received,
>>>> but
>>> NIC's don't do stop and resume, blocking the Rx pipe is very
>>> problematic and performance destroying.
>>>
>>> The strategy for something like ODP is more complex, and so far no NIC
>>> has deployed it at any granularity larger than per-page.
>>>
>>>> So since Jason really doesn't like dma_fence much I think for rdma
>>>> synchronous it is. And it shouldn't really matter, since waiting for a
>>>> small transaction to complete at rdma wire speed isn't really that
>>>> long an operation.
>>> Even if DMA fence were to somehow be involved, how would it look?
>> Well above you're saying it would be performance destroying, but let's
>> pretend that's not a problem :-) Also, I have no clue about rdma, so this
>> is really just the flow we have on the gpu side.
> I see, no, this is not workable, the command flow in RDMA is not at
> all like GPU - what you are a proposing is a global 'stop the whole
> chip' Tx and Rx flows for an undetermined time. Not feasible
>
> What we can do is use ODP techniques and pause only the MR attached to
> the DMA buf with the process you outline below. This is not so hard to
> implement.

Well it boils down to only two requirements:

1. You can stop accessing the memory or addresses exported by the DMA-buf.

2. Before the next access you need to acquire a new mapping.

How you do this is perfectly up to you. E.g. you can stop everything, 
just prevent access to this DMA-buf, or just pause the users of this 
DMA-buf....

>
>> 3. rdma driver worker gets busy to restart rx:
>> 	1. lock all dma-buf that are currently in use (dma_resv_lock).
>> 	thanks to ww_mutex deadlock avoidance this is possible
> Why all? Why not just lock the one that was invalidated to restore the
> mappings? That is some artifact of the GPU approach?

No, but you must make sure that mapping one doesn't invalidate others 
you need.

Otherwise you can end up in a nice live lock :)

> And why is this done with work queues and locking instead of a
> callback saying the buffer is valid again?

You can do this as well, but a work queue is usually easier to handle 
than a notification in an interrupt context of a foreign driver.

Regards,
Christian.

>
> Jason

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
  2020-07-02 14:50                       ` Christian König
@ 2020-07-02 18:15                         ` Daniel Vetter
  2020-07-03 12:03                           ` Jason Gunthorpe
  0 siblings, 1 reply; 28+ messages in thread
From: Daniel Vetter @ 2020-07-02 18:15 UTC (permalink / raw)
  To: Christian König
  Cc: Leon Romanovsky, linux-rdma, dri-devel, Jason Gunthorpe,
	Doug Ledford, Vetter, Daniel, Xiong, Jianxin

On Thu, Jul 02, 2020 at 04:50:32PM +0200, Christian König wrote:
> Am 02.07.20 um 15:29 schrieb Jason Gunthorpe:
> > On Thu, Jul 02, 2020 at 03:10:00PM +0200, Daniel Vetter wrote:
> > > On Wed, Jul 01, 2020 at 02:15:24PM -0300, Jason Gunthorpe wrote:
> > > > On Wed, Jul 01, 2020 at 05:42:21PM +0200, Daniel Vetter wrote:
> > > > > > > > All you need is the ability to stop wait for ongoing accesses to end and
> > > > > > > > make sure that new ones grab a new mapping.
> > > > > > > Swap and flush isn't a general HW ability either..
> > > > > > > 
> > > > > > > I'm unclear how this could be useful, it is guarenteed to corrupt
> > > > > > > in-progress writes?
> > > > > > > 
> > > > > > > Did you mean pause, swap and resume? That's ODP.
> > > > > > Yes, something like this. And good to know, never heard of ODP.
> > > > > Hm I thought ODP was full hw page faults at an individual page
> > > > > level,
> > > > Yes
> > > > 
> > > > > and this stop&resume is for the entire nic. Under the hood both apply
> > > > > back-pressure on the network if a transmission can't be received,
> > > > > but
> > > > NIC's don't do stop and resume, blocking the Rx pipe is very
> > > > problematic and performance destroying.
> > > > 
> > > > The strategy for something like ODP is more complex, and so far no NIC
> > > > has deployed it at any granularity larger than per-page.
> > > > 
> > > > > So since Jason really doesn't like dma_fence much I think for rdma
> > > > > synchronous it is. And it shouldn't really matter, since waiting for a
> > > > > small transaction to complete at rdma wire speed isn't really that
> > > > > long an operation.
> > > > Even if DMA fence were to somehow be involved, how would it look?
> > > Well above you're saying it would be performance destroying, but let's
> > > pretend that's not a problem :-) Also, I have no clue about rdma, so this
> > > is really just the flow we have on the gpu side.
> > I see, no, this is not workable, the command flow in RDMA is not at
> > all like GPU - what you are a proposing is a global 'stop the whole
> > chip' Tx and Rx flows for an undetermined time. Not feasible

Yeah, I said I have no clue about rdma :-)

> > What we can do is use ODP techniques and pause only the MR attached to
> > the DMA buf with the process you outline below. This is not so hard to
> > implement.
> 
> Well it boils down to only two requirements:
> 
> 1. You can stop accessing the memory or addresses exported by the DMA-buf.
> 
> 2. Before the next access you need to acquire a new mapping.
> 
> How you do this is perfectly up to you. E.g. you can stop everything, just
> prevent access to this DMA-buf, or just pause the users of this DMA-buf....

Yeah in a gpu we also don't stop the entire world, only the context that
needs the buffer. If there's other stuff to run, we do keep running that.
Usually the reason for a buffer move is that we do actually have other
stuff that needs to be run, and which needs more vram for itself, so we
might have to throw out a few buffers from vram that can also be placed in
system memory.

Note that a modern gpu has multiple engines and most have or are gaining
hw scheduling of some sort, so there's a pile of concurrent gpu context
execution going on at any moment. The days where we just push gpu work
into a fifo are (mostly) long gone.

> > > 3. rdma driver worker gets busy to restart rx:
> > > 	1. lock all dma-buf that are currently in use (dma_resv_lock).
> > > 	thanks to ww_mutex deadlock avoidance this is possible
> > Why all? Why not just lock the one that was invalidated to restore the
> > mappings? That is some artifact of the GPU approach?
> 
> No, but you must make sure that mapping one doesn't invalidate others you
> need.
> 
> Otherwise you can end up in a nice live lock :)

Also if you don't have pagefaults, but have to track busy memory at a
context level, you do need to grab all locks of all buffers you need, or
you'd race. There's nothing stopping a concurrent ->notify_move on some
other buffer you'll need otherwise, and if you try to be clever and roll
you're own locking, you'll anger lockdep - you're own lock will have to be
on both sides of ww_mutex or it wont work, and that deadlocks.

So ww_mutex multi-lock dance or bust.

> > And why is this done with work queues and locking instead of a
> > callback saying the buffer is valid again?
> 
> You can do this as well, but a work queue is usually easier to handle than a
> notification in an interrupt context of a foreign driver.

Yeah you can just install a dma_fence callback but
- that's hardirq context
- if you don't have per-page hw faults you need the multi-lock ww_mutex
  dance anyway to avoid races.

On top of that the exporter has no idea whether the importer still really
needs the mapping, or whether it was just kept around opportunistically.
pte setup isn't free, so by default gpus keep everything mapped. At least
for the classic gl/vk execution model, compute like rocm/amdkfd is a bit
different here.

Cheers, Daniel

> 
> Regards,
> Christian.
> 
> > 
> > Jason
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
  2020-07-02 18:15                         ` Daniel Vetter
@ 2020-07-03 12:03                           ` Jason Gunthorpe
  2020-07-03 12:52                             ` Daniel Vetter
  0 siblings, 1 reply; 28+ messages in thread
From: Jason Gunthorpe @ 2020-07-03 12:03 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Leon Romanovsky, linux-rdma, dri-devel, Doug Ledford, Vetter,
	Daniel, Christian König, Xiong, Jianxin

On Thu, Jul 02, 2020 at 08:15:40PM +0200, Daniel Vetter wrote:
> > > > 3. rdma driver worker gets busy to restart rx:
> > > > 	1. lock all dma-buf that are currently in use (dma_resv_lock).
> > > > 	thanks to ww_mutex deadlock avoidance this is possible
> > > Why all? Why not just lock the one that was invalidated to restore the
> > > mappings? That is some artifact of the GPU approach?
> > 
> > No, but you must make sure that mapping one doesn't invalidate others you
> > need.
> > 
> > Otherwise you can end up in a nice live lock :)
> 
> Also if you don't have pagefaults, but have to track busy memory at a
> context level, you do need to grab all locks of all buffers you need, or
> you'd race. There's nothing stopping a concurrent ->notify_move on some
> other buffer you'll need otherwise, and if you try to be clever and roll
> you're own locking, you'll anger lockdep - you're own lock will have to be
> on both sides of ww_mutex or it wont work, and that deadlocks.

So you are worried about atomically building some multi buffer
transaction? I don't think this applies to RDMA which isn't going to
be transcational here..

> > > And why is this done with work queues and locking instead of a
> > > callback saying the buffer is valid again?
> > 
> > You can do this as well, but a work queue is usually easier to handle than a
> > notification in an interrupt context of a foreign driver.
> 
> Yeah you can just install a dma_fence callback but
> - that's hardirq context
> - if you don't have per-page hw faults you need the multi-lock ww_mutex
>   dance anyway to avoid races.

It is still best to avoid the per-page faults and preload the new
mapping once it is ready.

Jason
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
  2020-07-03 12:03                           ` Jason Gunthorpe
@ 2020-07-03 12:52                             ` Daniel Vetter
  2020-07-03 13:14                               ` Jason Gunthorpe
  0 siblings, 1 reply; 28+ messages in thread
From: Daniel Vetter @ 2020-07-03 12:52 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Leon Romanovsky, linux-rdma, dri-devel, Doug Ledford, Vetter,
	Daniel, Christian König, Xiong, Jianxin

On Fri, Jul 3, 2020 at 2:03 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Thu, Jul 02, 2020 at 08:15:40PM +0200, Daniel Vetter wrote:
> > > > > 3. rdma driver worker gets busy to restart rx:
> > > > >         1. lock all dma-buf that are currently in use (dma_resv_lock).
> > > > >         thanks to ww_mutex deadlock avoidance this is possible
> > > > Why all? Why not just lock the one that was invalidated to restore the
> > > > mappings? That is some artifact of the GPU approach?
> > >
> > > No, but you must make sure that mapping one doesn't invalidate others you
> > > need.
> > >
> > > Otherwise you can end up in a nice live lock :)
> >
> > Also if you don't have pagefaults, but have to track busy memory at a
> > context level, you do need to grab all locks of all buffers you need, or
> > you'd race. There's nothing stopping a concurrent ->notify_move on some
> > other buffer you'll need otherwise, and if you try to be clever and roll
> > you're own locking, you'll anger lockdep - you're own lock will have to be
> > on both sides of ww_mutex or it wont work, and that deadlocks.
>
> So you are worried about atomically building some multi buffer
> transaction? I don't think this applies to RDMA which isn't going to
> be transcational here..

So maybe I'm just totally confused about the rdma model. I thought:
- you bind a pile of memory for various transactions, that might
happen whenever. Kernel driver doesn't have much if any insight into
when memory isn't needed anymore. I think in the rdma world that's
called registering memory, but not sure.

- for hw with hw faults you can pull in the memory when it's needed,
and if concurrently another cpu is taking away pages and invalidating
rdma hw mappings, then that's no problem.

So far so good, 0 need for atomic transactions anything.

But the answer I gave here is for when you don't have per-page hw
faulting on the rdma nic, but something that works at a much larger
level. For a gpu it would be a compute context, no idea what the
equivalent for rdma is. This could go up to and including the entire
nic stalling all rx with link level back pressure, but not
necessarily.

Once you go above a single page, or well, at least a single dma-buf
object, you need some way to know when you have all buffers and memory
mappings present again, because you can't restart with only partial
memory.

Now if the rdma memory programming is more like traditional
networking, where you have a single rx or tx buffer, then yeah you
don't need fancy multi-buffer locking. Like I said I have no idea how
many of the buffers you need to restart the rdma stuff for hw which
doesn't have per-page hw faulting.

> > > > And why is this done with work queues and locking instead of a
> > > > callback saying the buffer is valid again?
> > >
> > > You can do this as well, but a work queue is usually easier to handle than a
> > > notification in an interrupt context of a foreign driver.
> >
> > Yeah you can just install a dma_fence callback but
> > - that's hardirq context
> > - if you don't have per-page hw faults you need the multi-lock ww_mutex
> >   dance anyway to avoid races.
>
> It is still best to avoid the per-page faults and preload the new
> mapping once it is ready.

Sure, but I think that's entirely orthogonal.

Also even if you don't need the multi-buffer dance (either because hw
faults, or because the rdma rx/tx can be stopped at least at a
per-buffer level through some other means) then you still need the
dma_resv_lock to serialize with the exporter. So needs a sleeping
context either way. Hence some worker is needed.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
  2020-07-03 12:52                             ` Daniel Vetter
@ 2020-07-03 13:14                               ` Jason Gunthorpe
  2020-07-03 13:21                                 ` Christian König
  0 siblings, 1 reply; 28+ messages in thread
From: Jason Gunthorpe @ 2020-07-03 13:14 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Leon Romanovsky, linux-rdma, dri-devel, Doug Ledford, Vetter,
	Daniel, Christian König, Xiong, Jianxin

On Fri, Jul 03, 2020 at 02:52:03PM +0200, Daniel Vetter wrote:

> So maybe I'm just totally confused about the rdma model. I thought:
> - you bind a pile of memory for various transactions, that might
> happen whenever. Kernel driver doesn't have much if any insight into
> when memory isn't needed anymore. I think in the rdma world that's
> called registering memory, but not sure.

Sure, but once registered the memory is able to be used at any moment with
no visibilty from the kernel.

Unlike GPU the transactions that trigger memory access do not go
through the kernel - so there is no ability to interrupt a command
flow and fiddle with mappings.

Jason
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
  2020-07-03 13:14                               ` Jason Gunthorpe
@ 2020-07-03 13:21                                 ` Christian König
  2020-07-07 21:58                                   ` Xiong, Jianxin
  0 siblings, 1 reply; 28+ messages in thread
From: Christian König @ 2020-07-03 13:21 UTC (permalink / raw)
  To: Jason Gunthorpe, Daniel Vetter
  Cc: Leon Romanovsky, linux-rdma, dri-devel, Doug Ledford, Vetter,
	Daniel, Xiong, Jianxin

Am 03.07.20 um 15:14 schrieb Jason Gunthorpe:
> On Fri, Jul 03, 2020 at 02:52:03PM +0200, Daniel Vetter wrote:
>
>> So maybe I'm just totally confused about the rdma model. I thought:
>> - you bind a pile of memory for various transactions, that might
>> happen whenever. Kernel driver doesn't have much if any insight into
>> when memory isn't needed anymore. I think in the rdma world that's
>> called registering memory, but not sure.
> Sure, but once registered the memory is able to be used at any moment with
> no visibilty from the kernel.
>
> Unlike GPU the transactions that trigger memory access do not go
> through the kernel - so there is no ability to interrupt a command
> flow and fiddle with mappings.

This is the same for GPUs with user space queues as well.

But we can still say for a process if that this process is using a 
DMA-buf which is moved out and so can't run any more unless the DMA-buf 
is accessible again.

In other words you somehow need to make sure that the hardware is not 
accessing a piece of memory any more when you want to move it.

Christian.

>
> Jason

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [RFC PATCH v2 0/3] RDMA: add dma-buf support
  2020-07-03 13:21                                 ` Christian König
@ 2020-07-07 21:58                                   ` Xiong, Jianxin
  2020-07-08  9:38                                     ` Christian König
  0 siblings, 1 reply; 28+ messages in thread
From: Xiong, Jianxin @ 2020-07-07 21:58 UTC (permalink / raw)
  To: Christian König, Jason Gunthorpe, Daniel Vetter
  Cc: linux-rdma, Vetter, Daniel, Doug Ledford, dri-devel, Leon Romanovsky


> -----Original Message-----
> From: Christian König <christian.koenig@amd.com>
> Am 03.07.20 um 15:14 schrieb Jason Gunthorpe:
> > On Fri, Jul 03, 2020 at 02:52:03PM +0200, Daniel Vetter wrote:
> >
> >> So maybe I'm just totally confused about the rdma model. I thought:
> >> - you bind a pile of memory for various transactions, that might
> >> happen whenever. Kernel driver doesn't have much if any insight into
> >> when memory isn't needed anymore. I think in the rdma world that's
> >> called registering memory, but not sure.
> > Sure, but once registered the memory is able to be used at any moment
> > with no visibilty from the kernel.
> >
> > Unlike GPU the transactions that trigger memory access do not go
> > through the kernel - so there is no ability to interrupt a command
> > flow and fiddle with mappings.
> 
> This is the same for GPUs with user space queues as well.
> 
> But we can still say for a process if that this process is using a DMA-buf which is moved out and so can't run any more unless the DMA-buf is
> accessible again.
> 
> In other words you somehow need to make sure that the hardware is not accessing a piece of memory any more when you want to move it.
> 

While a process can be easily suspended, there is no way to tell the RDMA NIC not to process posted work requests that use specific memory regions (or with any other conditions). 

So far it appears to me that DMA-buf dynamic mapping for RDMA is only viable with ODP support. For NICs without ODP, a way to allow pinning the device memory is still needed.

Jianxin

> Christian.
> 
> >
> > Jason

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
  2020-07-07 21:58                                   ` Xiong, Jianxin
@ 2020-07-08  9:38                                     ` Christian König
  2020-07-08  9:49                                       ` Daniel Vetter
  0 siblings, 1 reply; 28+ messages in thread
From: Christian König @ 2020-07-08  9:38 UTC (permalink / raw)
  To: Xiong, Jianxin, Jason Gunthorpe, Daniel Vetter
  Cc: linux-rdma, Vetter, Daniel, Doug Ledford, dri-devel, Leon Romanovsky

Am 07.07.20 um 23:58 schrieb Xiong, Jianxin:
>> -----Original Message-----
>> From: Christian König <christian.koenig@amd.com>
>> Am 03.07.20 um 15:14 schrieb Jason Gunthorpe:
>>> On Fri, Jul 03, 2020 at 02:52:03PM +0200, Daniel Vetter wrote:
>>>
>>>> So maybe I'm just totally confused about the rdma model. I thought:
>>>> - you bind a pile of memory for various transactions, that might
>>>> happen whenever. Kernel driver doesn't have much if any insight into
>>>> when memory isn't needed anymore. I think in the rdma world that's
>>>> called registering memory, but not sure.
>>> Sure, but once registered the memory is able to be used at any moment
>>> with no visibilty from the kernel.
>>>
>>> Unlike GPU the transactions that trigger memory access do not go
>>> through the kernel - so there is no ability to interrupt a command
>>> flow and fiddle with mappings.
>> This is the same for GPUs with user space queues as well.
>>
>> But we can still say for a process if that this process is using a DMA-buf which is moved out and so can't run any more unless the DMA-buf is
>> accessible again.
>>
>> In other words you somehow need to make sure that the hardware is not accessing a piece of memory any more when you want to move it.
>>
> While a process can be easily suspended, there is no way to tell the RDMA NIC not to process posted work requests that use specific memory regions (or with any other conditions).
>
> So far it appears to me that DMA-buf dynamic mapping for RDMA is only viable with ODP support. For NICs without ODP, a way to allow pinning the device memory is still needed.

And that's exactly the reason why I introduced explicit pin()/unpin() 
functions into the DMA-buf API: 
https://elixir.bootlin.com/linux/latest/source/drivers/dma-buf/dma-buf.c#L811

It's just that at least our devices drivers currently prevent P2P with 
pinned DMA-buf's for two main reasons:

a) To prevent deny of service attacks because P2P BARs are a rather rare 
resource.

b) To prevent failures in configuration where P2P is not always possible 
between all devices which want to access a buffer.

Regards,
Christian.

>
> Jianxin
>
>> Christian.
>>
>>> Jason

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
  2020-07-08  9:38                                     ` Christian König
@ 2020-07-08  9:49                                       ` Daniel Vetter
  2020-07-08 14:20                                         ` Christian König
  0 siblings, 1 reply; 28+ messages in thread
From: Daniel Vetter @ 2020-07-08  9:49 UTC (permalink / raw)
  To: Christian König
  Cc: Leon Romanovsky, linux-rdma, dri-devel, Jason Gunthorpe,
	Doug Ledford, Vetter, Daniel, Xiong, Jianxin

On Wed, Jul 08, 2020 at 11:38:31AM +0200, Christian König wrote:
> Am 07.07.20 um 23:58 schrieb Xiong, Jianxin:
> > > -----Original Message-----
> > > From: Christian König <christian.koenig@amd.com>
> > > Am 03.07.20 um 15:14 schrieb Jason Gunthorpe:
> > > > On Fri, Jul 03, 2020 at 02:52:03PM +0200, Daniel Vetter wrote:
> > > > 
> > > > > So maybe I'm just totally confused about the rdma model. I thought:
> > > > > - you bind a pile of memory for various transactions, that might
> > > > > happen whenever. Kernel driver doesn't have much if any insight into
> > > > > when memory isn't needed anymore. I think in the rdma world that's
> > > > > called registering memory, but not sure.
> > > > Sure, but once registered the memory is able to be used at any moment
> > > > with no visibilty from the kernel.
> > > > 
> > > > Unlike GPU the transactions that trigger memory access do not go
> > > > through the kernel - so there is no ability to interrupt a command
> > > > flow and fiddle with mappings.
> > > This is the same for GPUs with user space queues as well.
> > > 
> > > But we can still say for a process if that this process is using a DMA-buf which is moved out and so can't run any more unless the DMA-buf is
> > > accessible again.
> > > 
> > > In other words you somehow need to make sure that the hardware is not accessing a piece of memory any more when you want to move it.
> > > 
> > While a process can be easily suspended, there is no way to tell the RDMA NIC not to process posted work requests that use specific memory regions (or with any other conditions).
> > 
> > So far it appears to me that DMA-buf dynamic mapping for RDMA is only viable with ODP support. For NICs without ODP, a way to allow pinning the device memory is still needed.
> 
> And that's exactly the reason why I introduced explicit pin()/unpin()
> functions into the DMA-buf API:
> https://elixir.bootlin.com/linux/latest/source/drivers/dma-buf/dma-buf.c#L811
> 
> It's just that at least our devices drivers currently prevent P2P with
> pinned DMA-buf's for two main reasons:
> 
> a) To prevent deny of service attacks because P2P BARs are a rather rare
> resource.
> 
> b) To prevent failures in configuration where P2P is not always possible
> between all devices which want to access a buffer.

So the above is more or less the question in the cover letter (which
didn't make it to dri-devel). Can we somehow throw that limitation out, or
is that simply not a good idea?

Simply moving buffers to system memory when they're pinned does simplify a
lot of headaches. For a specific custom built system we can avoid that
maybe, but I think upstream is kinda a different thing.

Cheers, Daniel

> Regards,
> Christian.
> 
> > 
> > Jianxin
> > 
> > > Christian.
> > > 
> > > > Jason
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
  2020-07-08  9:49                                       ` Daniel Vetter
@ 2020-07-08 14:20                                         ` Christian König
  2020-07-08 14:33                                           ` Alex Deucher
  0 siblings, 1 reply; 28+ messages in thread
From: Christian König @ 2020-07-08 14:20 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Leon Romanovsky, linux-rdma, dri-devel, Jason Gunthorpe,
	Doug Ledford, Vetter, Daniel, Xiong, Jianxin

Am 08.07.20 um 11:49 schrieb Daniel Vetter:
> On Wed, Jul 08, 2020 at 11:38:31AM +0200, Christian König wrote:
>> Am 07.07.20 um 23:58 schrieb Xiong, Jianxin:
>>>> -----Original Message-----
>>>> From: Christian König <christian.koenig@amd.com>
>>>> Am 03.07.20 um 15:14 schrieb Jason Gunthorpe:
>>>>> On Fri, Jul 03, 2020 at 02:52:03PM +0200, Daniel Vetter wrote:
>>>>>
>>>>>> So maybe I'm just totally confused about the rdma model. I thought:
>>>>>> - you bind a pile of memory for various transactions, that might
>>>>>> happen whenever. Kernel driver doesn't have much if any insight into
>>>>>> when memory isn't needed anymore. I think in the rdma world that's
>>>>>> called registering memory, but not sure.
>>>>> Sure, but once registered the memory is able to be used at any moment
>>>>> with no visibilty from the kernel.
>>>>>
>>>>> Unlike GPU the transactions that trigger memory access do not go
>>>>> through the kernel - so there is no ability to interrupt a command
>>>>> flow and fiddle with mappings.
>>>> This is the same for GPUs with user space queues as well.
>>>>
>>>> But we can still say for a process if that this process is using a DMA-buf which is moved out and so can't run any more unless the DMA-buf is
>>>> accessible again.
>>>>
>>>> In other words you somehow need to make sure that the hardware is not accessing a piece of memory any more when you want to move it.
>>>>
>>> While a process can be easily suspended, there is no way to tell the RDMA NIC not to process posted work requests that use specific memory regions (or with any other conditions).
>>>
>>> So far it appears to me that DMA-buf dynamic mapping for RDMA is only viable with ODP support. For NICs without ODP, a way to allow pinning the device memory is still needed.
>> And that's exactly the reason why I introduced explicit pin()/unpin()
>> functions into the DMA-buf API:
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Felixir.bootlin.com%2Flinux%2Flatest%2Fsource%2Fdrivers%2Fdma-buf%2Fdma-buf.c%23L811&amp;data=02%7C01%7Cchristian.koenig%40amd.com%7C6d785861acc542a2f53608d823243a7c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637297985792135311&amp;sdata=bBrkDynlACE9DAIlGntxXhE1unr%2FBxw5IRTm6AtV6WQ%3D&amp;reserved=0
>>
>> It's just that at least our devices drivers currently prevent P2P with
>> pinned DMA-buf's for two main reasons:
>>
>> a) To prevent deny of service attacks because P2P BARs are a rather rare
>> resource.
>>
>> b) To prevent failures in configuration where P2P is not always possible
>> between all devices which want to access a buffer.
> So the above is more or less the question in the cover letter (which
> didn't make it to dri-devel). Can we somehow throw that limitation out, or
> is that simply not a good idea?

At least for the AMD graphics drivers that's most certain not a good idea.

We do have an use case where buffers need to be in system memory because 
P2P doesn't work.

And by pinning them to VRAM you can create a really nice deny of service 
attack against the X system.

> Simply moving buffers to system memory when they're pinned does simplify a
> lot of headaches. For a specific custom built system we can avoid that
> maybe, but I think upstream is kinda a different thing.

Yes, agree completely on that. Customers which are willing to take the 
risk can easily do this themselves.

But that is not something we should probably do for upstream.

Regards,
Christian.

>
> Cheers, Daniel
>
>> Regards,
>> Christian.
>>
>>> Jianxin
>>>
>>>> Christian.
>>>>
>>>>> Jason

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
  2020-07-08 14:20                                         ` Christian König
@ 2020-07-08 14:33                                           ` Alex Deucher
  0 siblings, 0 replies; 28+ messages in thread
From: Alex Deucher @ 2020-07-08 14:33 UTC (permalink / raw)
  To: Christian König
  Cc: Leon Romanovsky, linux-rdma, dri-devel, Jason Gunthorpe,
	Doug Ledford, Vetter, Daniel, Xiong, Jianxin

On Wed, Jul 8, 2020 at 10:20 AM Christian König
<christian.koenig@amd.com> wrote:
>
> Am 08.07.20 um 11:49 schrieb Daniel Vetter:
> > On Wed, Jul 08, 2020 at 11:38:31AM +0200, Christian König wrote:
> >> Am 07.07.20 um 23:58 schrieb Xiong, Jianxin:
> >>>> -----Original Message-----
> >>>> From: Christian König <christian.koenig@amd.com>
> >>>> Am 03.07.20 um 15:14 schrieb Jason Gunthorpe:
> >>>>> On Fri, Jul 03, 2020 at 02:52:03PM +0200, Daniel Vetter wrote:
> >>>>>
> >>>>>> So maybe I'm just totally confused about the rdma model. I thought:
> >>>>>> - you bind a pile of memory for various transactions, that might
> >>>>>> happen whenever. Kernel driver doesn't have much if any insight into
> >>>>>> when memory isn't needed anymore. I think in the rdma world that's
> >>>>>> called registering memory, but not sure.
> >>>>> Sure, but once registered the memory is able to be used at any moment
> >>>>> with no visibilty from the kernel.
> >>>>>
> >>>>> Unlike GPU the transactions that trigger memory access do not go
> >>>>> through the kernel - so there is no ability to interrupt a command
> >>>>> flow and fiddle with mappings.
> >>>> This is the same for GPUs with user space queues as well.
> >>>>
> >>>> But we can still say for a process if that this process is using a DMA-buf which is moved out and so can't run any more unless the DMA-buf is
> >>>> accessible again.
> >>>>
> >>>> In other words you somehow need to make sure that the hardware is not accessing a piece of memory any more when you want to move it.
> >>>>
> >>> While a process can be easily suspended, there is no way to tell the RDMA NIC not to process posted work requests that use specific memory regions (or with any other conditions).
> >>>
> >>> So far it appears to me that DMA-buf dynamic mapping for RDMA is only viable with ODP support. For NICs without ODP, a way to allow pinning the device memory is still needed.
> >> And that's exactly the reason why I introduced explicit pin()/unpin()
> >> functions into the DMA-buf API:
> >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Felixir.bootlin.com%2Flinux%2Flatest%2Fsource%2Fdrivers%2Fdma-buf%2Fdma-buf.c%23L811&amp;data=02%7C01%7Cchristian.koenig%40amd.com%7C6d785861acc542a2f53608d823243a7c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637297985792135311&amp;sdata=bBrkDynlACE9DAIlGntxXhE1unr%2FBxw5IRTm6AtV6WQ%3D&amp;reserved=0
> >>
> >> It's just that at least our devices drivers currently prevent P2P with
> >> pinned DMA-buf's for two main reasons:
> >>
> >> a) To prevent deny of service attacks because P2P BARs are a rather rare
> >> resource.
> >>
> >> b) To prevent failures in configuration where P2P is not always possible
> >> between all devices which want to access a buffer.
> > So the above is more or less the question in the cover letter (which
> > didn't make it to dri-devel). Can we somehow throw that limitation out, or
> > is that simply not a good idea?
>
> At least for the AMD graphics drivers that's most certain not a good idea.
>
> We do have an use case where buffers need to be in system memory because
> P2P doesn't work.
>
> And by pinning them to VRAM you can create a really nice deny of service
> attack against the X system.
>

On the other hand, on modern platforms with large or resizable BARs,
you may end up with systems with more vram than system ram.

Alex


> > Simply moving buffers to system memory when they're pinned does simplify a
> > lot of headaches. For a specific custom built system we can avoid that
> > maybe, but I think upstream is kinda a different thing.
>
> Yes, agree completely on that. Customers which are willing to take the
> risk can easily do this themselves.
>
> But that is not something we should probably do for upstream.
>
> Regards,
> Christian.
>
> >
> > Cheers, Daniel
> >
> >> Regards,
> >> Christian.
> >>
> >>> Jianxin
> >>>
> >>>> Christian.
> >>>>
> >>>>> Jason
>
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2020-07-08 14:34 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1593451903-30959-1-git-send-email-jianxin.xiong@intel.com>
     [not found] ` <20200629185152.GD25301@ziepe.ca>
     [not found]   ` <MW3PR11MB4555A99038FA0CFC3ED80D3DE56F0@MW3PR11MB4555.namprd11.prod.outlook.com>
     [not found]     ` <20200630173435.GK25301@ziepe.ca>
2020-06-30 18:46       ` [RFC PATCH v2 0/3] RDMA: add dma-buf support Xiong, Jianxin
2020-06-30 19:17         ` Jason Gunthorpe
2020-06-30 20:08           ` Xiong, Jianxin
2020-07-02 12:27             ` Jason Gunthorpe
2020-07-01  9:03         ` Christian König
2020-07-01 12:07           ` Daniel Vetter
2020-07-01 12:14             ` Daniel Vetter
2020-07-01 12:39           ` Jason Gunthorpe
2020-07-01 12:55             ` Christian König
2020-07-01 15:42               ` Daniel Vetter
2020-07-01 17:15                 ` Jason Gunthorpe
2020-07-02 13:10                   ` Daniel Vetter
2020-07-02 13:29                     ` Jason Gunthorpe
2020-07-02 14:50                       ` Christian König
2020-07-02 18:15                         ` Daniel Vetter
2020-07-03 12:03                           ` Jason Gunthorpe
2020-07-03 12:52                             ` Daniel Vetter
2020-07-03 13:14                               ` Jason Gunthorpe
2020-07-03 13:21                                 ` Christian König
2020-07-07 21:58                                   ` Xiong, Jianxin
2020-07-08  9:38                                     ` Christian König
2020-07-08  9:49                                       ` Daniel Vetter
2020-07-08 14:20                                         ` Christian König
2020-07-08 14:33                                           ` Alex Deucher
2020-06-30 18:56 ` Xiong, Jianxin
     [not found] ` <1593451903-30959-2-git-send-email-jianxin.xiong@intel.com>
2020-06-30 19:04   ` [RFC PATCH v2 1/3] RDMA/umem: Support importing dma-buf as user memory region Xiong, Jianxin
     [not found] ` <1593451903-30959-3-git-send-email-jianxin.xiong@intel.com>
2020-06-30 19:04   ` [RFC PATCH v2 2/3] RDMA/core: Expand the driver method 'reg_user_mr' to support dma-buf Xiong, Jianxin
     [not found] ` <1593451903-30959-4-git-send-email-jianxin.xiong@intel.com>
2020-06-30 19:05   ` [RFC PATCH v2 3/3] RDMA/uverbs: Add uverbs command for dma-buf based MR registration Xiong, Jianxin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).