Linux-RDMA Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v6 0/4] RDMA: Add dma-buf support
@ 2020-10-23 16:39 Jianxin Xiong
  2020-10-23 16:39 ` [PATCH v6 1/4] RDMA/umem: Support importing dma-buf as user memory region Jianxin Xiong
                   ` (3 more replies)
  0 siblings, 4 replies; 25+ messages in thread
From: Jianxin Xiong @ 2020-10-23 16:39 UTC (permalink / raw)
  To: linux-rdma, dri-devel
  Cc: Jianxin Xiong, Doug Ledford, Jason Gunthorpe, Leon Romanovsky,
	Sumit Semwal, Christian Koenig, Daniel Vetter

This is the sixth version of the patch set. Changelog:

v6:
* Move the dma-buf invalidation callback from the core to the device
  driver
* Move mapping update from work queue to pagefault handler
* Add dma-buf based MRs to the xarray of mmkeys so that the pagefault
  handler can be reached
* Update the new driver method and uverbs command signature by changing
  the paramter 'addr' to 'offset'
* Modify the sg list returned from dma_buf_map_attachment() based on
  the parameters 'offset' and 'length'
* Don't import dma-buf if 'dma_virt_ops' is used by the dma device
* The patch that clarifies dma-buf sg lists alignment has landed at
  https://cgit.freedesktop.org/drm/drm-misc/commit/?id=ac80cd17a615
  and thus is no longer included with this set

v5: https://www.spinics.net/lists/linux-rdma/msg96786.html
* Fix a few warnings reported by kernel test robot:
    - no previous prototype for function 'ib_umem_dmabuf_release' 
    - no previous prototype for function 'ib_umem_dmabuf_map_pages'
    - comparison of distinct pointer types in 'check_add_overflow'
* Add comment for the wait between getting the dma-buf sg tagle and
  updating the NIC page table

v4: https://www.spinics.net/lists/linux-rdma/msg96767.html
* Add a new ib_device method reg_user_mr_dmabuf() instead of expanding
  the existing method reg_user_mr()
* Use a separate code flow for dma-buf instead of adding special cases
  to the ODP memory region code path
* In invalidation callback, new mapping is updated as whole using work
  queue instead of being updated in page granularity in the page fault
  handler
* Use dma_resv_get_excl() and dma_fence_wait() to ensure the content of
  the pages have been moved to the new location before the new mapping
  is programmed into the NIC
* Add code to the ODP page fault handler to check the mapping status
* The new access flag added in v3 is removed.
* The checking for on-demand paging support in the new uverbs command
  is removed because it is implied by implementing the new ib_device
  method
* Clarify that dma-buf sg lists are page aligned

v3: https://www.spinics.net/lists/linux-rdma/msg96330.html
* Use dma_buf_dynamic_attach() instead of dma_buf_attach()
* Use on-demand paging mechanism to avoid pinning the GPU memory
* Instead of adding a new parameter to the device method for memory
  registration, pass all the attributes including the file descriptor
  as a structure
* Define a new access flag for dma-buf based memory region
* Check for on-demand paging support in the new uverbs command

v2: https://www.spinics.net/lists/linux-rdma/msg93643.html
* The Kconfig option is removed. There is no dependence issue since
  dma-buf driver is always enabled.
* The declaration of new data structure and functions is reorganized to
  minimize the visibility of the changes.
* The new uverbs command now goes through ioctl() instead of write().
* The rereg functionality is removed.
* Instead of adding new device method for dma-buf specific registration,
  existing method is extended to accept an extra parameter. 
* The correct function is now used for address range checking. 

v1: https://www.spinics.net/lists/linux-rdma/msg90720.html
* The initial patch set
* Implement core functions for importing and mapping dma-buf
* Use dma-buf static attach interface
* Add two ib_device methods reg_user_mr_fd() and rereg_user_mr_fd()
* Add two uverbs commands via the write() interface
* Add Kconfig option
* Add dma-buf support to mlx5 device

When enabled, an RDMA capable NIC can perform peer-to-peer transactions
over PCIe to access the local memory located on another device. This can
often lead to better performance than using a system memory buffer for
RDMA and copying data between the buffer and device memory.

Current kernel RDMA stack uses get_user_pages() to pin the physical
pages backing the user buffer and uses dma_map_sg_attrs() to get the
dma addresses for memory access. This usually doesn't work for peer
device memory due to the lack of associated page structures.

Several mechanisms exist today to facilitate device memory access.

ZONE_DEVICE is a new zone for device memory in the memory management
subsystem. It allows pages from device memory being described with
specialized page structures, but what can be done with these page
structures may be different from system memory. ZONE_DEVICE is further
specialized into multiple memory types, such as one type for PCI
p2pmem/p2pdma and one type for HMM.

PCI p2pmem/p2pdma uses ZONE_DEVICE to represent device memory residing
in a PCI BAR and provides a set of calls to publish, discover, allocate,
and map such memory for peer-to-peer transactions. One feature of the
API is that the buffer is allocated by the side that does the DMA
transfer. This works well with the storage usage case, but is awkward
with GPU-NIC communication, where typically the buffer is allocated by
the GPU driver rather than the NIC driver.

Heterogeneous Memory Management (HMM) utilizes mmu_interval_notifier
and ZONE_DEVICE to support shared virtual address space and page
migration between system memory and device memory. HMM doesn't support
pinning device memory because pages located on device must be able to
migrate to system memory when accessed by CPU. Peer-to-peer access
is currently not supported by HMM.

Dma-buf is a standard mechanism for sharing buffers among different
device drivers. The buffer to be shared is exported by the owning
driver and imported by the driver that wants to use it. The exporter
provides a set of ops that the importer can call to pin and map the
buffer. In addition, a file descriptor can be associated with a dma-
buf object as the handle that can be passed to user space.

This patch series adds dma-buf importer role to the RDMA driver in
attempt to support RDMA using device memory such as GPU VRAM. Dma-buf is
chosen for a few reasons: first, the API is relatively simple and allows
a lot of flexibility in implementing the buffer manipulation ops.
Second, it doesn't require page structure. Third, dma-buf is already
supported in many GPU drivers. However, we are aware that existing GPU
drivers don't allow pinning device memory via the dma-buf interface.
Pinning would simply cause the backing storage to migrate to system RAM.
True peer-to-peer access is only possible using dynamic attach, which
requires on-demand paging support from the NIC to work. For this reason,
this series only works with ODP capable NICs.

This series consists of five patches. The first patch adds the common
code for importing dma-buf from a file descriptor and mapping the
dma-buf pages. Patch 2 add the new driver method reg_user_mr_dmabuf().
Patch 3 adds a new uverbs command for registering dma-buf based memory
region. Patch 4 adds dma-buf support to the mlx5 driver. Patch 5 adds
clarification to the dma-buf API documentation that dma-buf sg lists
are page aligned.

Related user space RDMA library changes will be provided as a separate
patch series.

Jianxin Xiong (4):
  RDMA/umem: Support importing dma-buf as user memory region
  RDMA/core: Add device method for registering dma-buf base memory
    region
  RDMA/uverbs: Add uverbs command for dma-buf based MR registration
  RDMA/mlx5: Support dma-buf based userspace memory region

 drivers/infiniband/core/Makefile              |   2 +-
 drivers/infiniband/core/device.c              |   1 +
 drivers/infiniband/core/umem.c                |   4 +
 drivers/infiniband/core/umem_dmabuf.c         | 197 ++++++++++++++++++++++++++
 drivers/infiniband/core/umem_dmabuf.h         |  11 ++
 drivers/infiniband/core/uverbs_std_types_mr.c | 112 +++++++++++++++
 drivers/infiniband/hw/mlx5/main.c             |   2 +
 drivers/infiniband/hw/mlx5/mlx5_ib.h          |  11 ++
 drivers/infiniband/hw/mlx5/mr.c               | 114 ++++++++++++++-
 drivers/infiniband/hw/mlx5/odp.c              |  70 ++++++++-
 include/rdma/ib_umem.h                        |  35 ++++-
 include/rdma/ib_verbs.h                       |   6 +-
 include/uapi/rdma/ib_user_ioctl_cmds.h        |  14 ++
 13 files changed, 570 insertions(+), 9 deletions(-)
 create mode 100644 drivers/infiniband/core/umem_dmabuf.c
 create mode 100644 drivers/infiniband/core/umem_dmabuf.h

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v6 1/4] RDMA/umem: Support importing dma-buf as user memory region
  2020-10-23 16:39 [PATCH v6 0/4] RDMA: Add dma-buf support Jianxin Xiong
@ 2020-10-23 16:39 ` Jianxin Xiong
  2020-10-23 16:49   ` Daniel Vetter
  2020-10-27 20:00   ` Jason Gunthorpe
  2020-10-23 16:39 ` [PATCH v6 2/4] RDMA/core: Add device method for registering dma-buf base " Jianxin Xiong
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 25+ messages in thread
From: Jianxin Xiong @ 2020-10-23 16:39 UTC (permalink / raw)
  To: linux-rdma, dri-devel
  Cc: Jianxin Xiong, Doug Ledford, Jason Gunthorpe, Leon Romanovsky,
	Sumit Semwal, Christian Koenig, Daniel Vetter

Dma-buf is a standard cross-driver buffer sharing mechanism that can be
used to support peer-to-peer access from RDMA devices.

Device memory exported via dma-buf is associated with a file descriptor.
This is passed to the user space as a property associated with the
buffer allocation. When the buffer is registered as a memory region,
the file descriptor is passed to the RDMA driver along with other
parameters.

Implement the common code for importing dma-buf object and mapping
dma-buf pages.

Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Acked-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Acked-by: Christian Koenig <christian.koenig@amd.com>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
---
 drivers/infiniband/core/Makefile      |   2 +-
 drivers/infiniband/core/umem.c        |   4 +
 drivers/infiniband/core/umem_dmabuf.c | 197 ++++++++++++++++++++++++++++++++++
 drivers/infiniband/core/umem_dmabuf.h |  11 ++
 include/rdma/ib_umem.h                |  35 +++++-
 5 files changed, 247 insertions(+), 2 deletions(-)
 create mode 100644 drivers/infiniband/core/umem_dmabuf.c
 create mode 100644 drivers/infiniband/core/umem_dmabuf.h

diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
index ccf2670..8ab4eea 100644
--- a/drivers/infiniband/core/Makefile
+++ b/drivers/infiniband/core/Makefile
@@ -40,5 +40,5 @@ ib_uverbs-y :=			uverbs_main.o uverbs_cmd.o uverbs_marshall.o \
 				uverbs_std_types_srq.o \
 				uverbs_std_types_wq.o \
 				uverbs_std_types_qp.o
-ib_uverbs-$(CONFIG_INFINIBAND_USER_MEM) += umem.o
+ib_uverbs-$(CONFIG_INFINIBAND_USER_MEM) += umem.o umem_dmabuf.o
 ib_uverbs-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += umem_odp.o
diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index e9fecbd..2c45525 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -2,6 +2,7 @@
  * Copyright (c) 2005 Topspin Communications.  All rights reserved.
  * Copyright (c) 2005 Cisco Systems.  All rights reserved.
  * Copyright (c) 2005 Mellanox Technologies. All rights reserved.
+ * Copyright (c) 2020 Intel Corporation. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -43,6 +44,7 @@
 #include <rdma/ib_umem_odp.h>
 
 #include "uverbs.h"
+#include "umem_dmabuf.h"
 
 static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int dirty)
 {
@@ -269,6 +271,8 @@ void ib_umem_release(struct ib_umem *umem)
 {
 	if (!umem)
 		return;
+	if (umem->is_dmabuf)
+		return ib_umem_dmabuf_release(to_ib_umem_dmabuf(umem));
 	if (umem->is_odp)
 		return ib_umem_odp_release(to_ib_umem_odp(umem));
 
diff --git a/drivers/infiniband/core/umem_dmabuf.c b/drivers/infiniband/core/umem_dmabuf.c
new file mode 100644
index 0000000..66b234d
--- /dev/null
+++ b/drivers/infiniband/core/umem_dmabuf.c
@@ -0,0 +1,197 @@
+// SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause)
+/*
+ * Copyright (c) 2020 Intel Corporation. All rights reserved.
+ */
+
+#include <linux/dma-buf.h>
+#include <linux/dma-resv.h>
+#include <linux/dma-mapping.h>
+
+#include "uverbs.h"
+#include "umem_dmabuf.h"
+
+/*
+ * Generate a new dma sg list from a sub range of an existing dma sg list.
+ * Both the input and output have their entries page aligned.
+ */
+static int ib_umem_dmabuf_sgt_slice(struct sg_table *sgt, u64 offset,
+				    u64 length, struct sg_table *new_sgt)
+{
+	struct scatterlist *sg, *new_sg;
+	u64 start, end, off, addr, len;
+	unsigned int new_nents;
+	int err;
+	int i;
+
+	start = ALIGN_DOWN(offset, PAGE_SIZE);
+	end = ALIGN(offset + length, PAGE_SIZE);
+
+	offset = start;
+	length = end - start;
+	new_nents = 0;
+	for_each_sgtable_dma_sg(sgt, sg, i) {
+		len = sg_dma_len(sg);
+		off = min(len, offset);
+		len -= off;
+		len = min(len, length);
+		if (len)
+			new_nents++;
+		length -= len;
+		offset -= off;
+	}
+
+	err = sg_alloc_table(new_sgt, new_nents, GFP_KERNEL);
+	if (err)
+		return err;
+
+	offset = start;
+	length = end - start;
+	new_sg = new_sgt->sgl;
+	for_each_sgtable_dma_sg(sgt, sg, i) {
+		addr = sg_dma_address(sg);
+		len = sg_dma_len(sg);
+		off = min(len, offset);
+		addr += off;
+		len -= off;
+		len = min(len, length);
+		if (len) {
+			sg_dma_address(new_sg) = addr;
+			sg_dma_len(new_sg) = len;
+			new_sg = sg_next(new_sg);
+		}
+		length -= len;
+		offset -= off;
+	}
+
+	return 0;
+}
+
+int ib_umem_dmabuf_map_pages(struct ib_umem_dmabuf *umem_dmabuf)
+{
+	struct sg_table *sgt;
+	struct dma_fence *fence;
+	int err;
+
+	dma_resv_assert_held(umem_dmabuf->attach->dmabuf->resv);
+
+	sgt = dma_buf_map_attachment(umem_dmabuf->attach,
+				     DMA_BIDIRECTIONAL);
+
+	if (IS_ERR(sgt))
+		return PTR_ERR(sgt);
+
+	err = ib_umem_dmabuf_sgt_slice(sgt, umem_dmabuf->umem.address,
+				       umem_dmabuf->umem.length,
+				       &umem_dmabuf->umem.sg_head);
+	if (err) {
+		dma_buf_unmap_attachment(umem_dmabuf->attach, sgt,
+					 DMA_BIDIRECTIONAL);
+		return err;
+	}
+
+	umem_dmabuf->umem.nmap = umem_dmabuf->umem.sg_head.nents;
+	umem_dmabuf->sgt = sgt;
+
+	/*
+	 * Although the sg list is valid now, the content of the pages
+	 * may be not up-to-date. Wait for the exporter to finish
+	 * the migration.
+	 */
+	fence = dma_resv_get_excl(umem_dmabuf->attach->dmabuf->resv);
+	if (fence)
+		dma_fence_wait(fence, false);
+
+	return 0;
+}
+EXPORT_SYMBOL(ib_umem_dmabuf_map_pages);
+
+void ib_umem_dmabuf_unmap_pages(struct ib_umem_dmabuf *umem_dmabuf)
+{
+	dma_resv_assert_held(umem_dmabuf->attach->dmabuf->resv);
+
+	if (!umem_dmabuf->sgt)
+		return;
+
+	sg_free_table(&umem_dmabuf->umem.sg_head);
+	dma_buf_unmap_attachment(umem_dmabuf->attach, umem_dmabuf->sgt,
+				 DMA_BIDIRECTIONAL);
+	umem_dmabuf->sgt = NULL;
+}
+EXPORT_SYMBOL(ib_umem_dmabuf_unmap_pages);
+
+struct ib_umem *ib_umem_dmabuf_get(struct ib_device *device,
+				   unsigned long offset, size_t size,
+				   int fd, int access,
+				   const struct dma_buf_attach_ops *ops)
+{
+	struct dma_buf *dmabuf;
+	struct ib_umem_dmabuf *umem_dmabuf;
+	struct ib_umem *umem;
+	unsigned long end;
+	long ret;
+
+	if (check_add_overflow(offset, (unsigned long)size, &end))
+		return ERR_PTR(-EINVAL);
+
+	if (unlikely(PAGE_ALIGN(end) < PAGE_SIZE))
+		return ERR_PTR(-EINVAL);
+
+	if (unlikely(!ops || !ops->move_notify))
+		return ERR_PTR(-EINVAL);
+
+#ifdef CONFIG_DMA_VIRT_OPS
+	if (device->dma_device->dma_ops == &dma_virt_ops)
+		return ERR_PTR(-EINVAL);
+#endif
+
+	umem_dmabuf = kzalloc(sizeof(*umem_dmabuf), GFP_KERNEL);
+	if (!umem_dmabuf)
+		return ERR_PTR(-ENOMEM);
+
+	umem = &umem_dmabuf->umem;
+	umem->ibdev = device;
+	umem->length = size;
+	umem->address = offset;
+	umem->iova = offset;
+	umem->writable = ib_access_writable(access);
+	umem->is_dmabuf = 1;
+
+	dmabuf = dma_buf_get(fd);
+	if (IS_ERR(dmabuf)) {
+		ret = PTR_ERR(dmabuf);
+		goto out_free_umem;
+	}
+
+	umem_dmabuf->attach = dma_buf_dynamic_attach(
+					dmabuf,
+					device->dma_device,
+					ops,
+					umem_dmabuf);
+	if (IS_ERR(umem_dmabuf->attach)) {
+		ret = PTR_ERR(umem_dmabuf->attach);
+		goto out_release_dmabuf;
+	}
+
+	return umem;
+
+out_release_dmabuf:
+	dma_buf_put(dmabuf);
+
+out_free_umem:
+	kfree(umem_dmabuf);
+	return ERR_PTR(ret);
+}
+EXPORT_SYMBOL(ib_umem_dmabuf_get);
+
+void ib_umem_dmabuf_release(struct ib_umem_dmabuf *umem_dmabuf)
+{
+	struct dma_buf *dmabuf = umem_dmabuf->attach->dmabuf;
+
+	dma_resv_lock(dmabuf->resv, NULL);
+	ib_umem_dmabuf_unmap_pages(umem_dmabuf);
+	dma_resv_unlock(dmabuf->resv);
+
+	dma_buf_detach(dmabuf, umem_dmabuf->attach);
+	dma_buf_put(dmabuf);
+	kfree(umem_dmabuf);
+}
diff --git a/drivers/infiniband/core/umem_dmabuf.h b/drivers/infiniband/core/umem_dmabuf.h
new file mode 100644
index 0000000..13acf55
--- /dev/null
+++ b/drivers/infiniband/core/umem_dmabuf.h
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause) */
+/*
+ * Copyright (c) 2020 Intel Corporation. All rights reserved.
+ */
+
+#ifndef UMEM_DMABUF_H
+#define UMEM_DMABUF_H
+
+void ib_umem_dmabuf_release(struct ib_umem_dmabuf *umem_dmabuf);
+
+#endif /* UMEM_DMABUF_H */
diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
index 7059750..73a7b19 100644
--- a/include/rdma/ib_umem.h
+++ b/include/rdma/ib_umem.h
@@ -1,6 +1,7 @@
 /* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */
 /*
  * Copyright (c) 2007 Cisco Systems.  All rights reserved.
+ * Copyright (c) 2020 Intel Corporation.  All rights reserved.
  */
 
 #ifndef IB_UMEM_H
@@ -13,6 +14,7 @@
 
 struct ib_ucontext;
 struct ib_umem_odp;
+struct dma_buf_attach_ops;
 
 struct ib_umem {
 	struct ib_device       *ibdev;
@@ -22,12 +24,25 @@ struct ib_umem {
 	unsigned long		address;
 	u32 writable : 1;
 	u32 is_odp : 1;
+	u32 is_dmabuf : 1;
 	struct work_struct	work;
 	struct sg_table sg_head;
 	int             nmap;
 	unsigned int    sg_nents;
 };
 
+struct ib_umem_dmabuf {
+	struct ib_umem umem;
+	struct dma_buf_attachment *attach;
+	struct sg_table *sgt;
+	void *device_context;
+};
+
+static inline struct ib_umem_dmabuf *to_ib_umem_dmabuf(struct ib_umem *umem)
+{
+	return container_of(umem, struct ib_umem_dmabuf, umem);
+}
+
 /* Returns the offset of the umem start relative to the first page. */
 static inline int ib_umem_offset(struct ib_umem *umem)
 {
@@ -79,6 +94,12 @@ int ib_umem_copy_from(void *dst, struct ib_umem *umem, size_t offset,
 unsigned long ib_umem_find_best_pgsz(struct ib_umem *umem,
 				     unsigned long pgsz_bitmap,
 				     unsigned long virt);
+struct ib_umem *ib_umem_dmabuf_get(struct ib_device *device,
+				   unsigned long offset, size_t size,
+				   int fd, int access,
+				   const struct dma_buf_attach_ops *ops);
+int ib_umem_dmabuf_map_pages(struct ib_umem_dmabuf *umem_dmabuf);
+void ib_umem_dmabuf_unmap_pages(struct ib_umem_dmabuf *umem_dmabuf);
 
 #else /* CONFIG_INFINIBAND_USER_MEM */
 
@@ -101,7 +122,19 @@ static inline unsigned long ib_umem_find_best_pgsz(struct ib_umem *umem,
 {
 	return 0;
 }
+static inline struct ib_umem *ib_umem_dmabuf_get(struct ib_device *device,
+						 unsigned long offset,
+						 size_t size, int fd,
+						 int access,
+						 struct dma_buf_attach_ops *ops)
+{
+	return ERR_PTR(-EINVAL);
+}
+static inline int ib_umem_dmabuf_map_pages(struct ib_umem_dmabuf *umem_dmabuf)
+{
+	return -EINVAL;
+}
+static inline void ib_umem_dmabuf_unmap_pages(struct ib_umem_dmabuf *umem_dmabuf) { }
 
 #endif /* CONFIG_INFINIBAND_USER_MEM */
-
 #endif /* IB_UMEM_H */
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v6 2/4] RDMA/core: Add device method for registering dma-buf base memory region
  2020-10-23 16:39 [PATCH v6 0/4] RDMA: Add dma-buf support Jianxin Xiong
  2020-10-23 16:39 ` [PATCH v6 1/4] RDMA/umem: Support importing dma-buf as user memory region Jianxin Xiong
@ 2020-10-23 16:39 ` Jianxin Xiong
  2020-10-23 16:40 ` [PATCH v6 3/4] RDMA/uverbs: Add uverbs command for dma-buf based MR registration Jianxin Xiong
  2020-10-23 16:40 ` [PATCH v6 4/4] RDMA/mlx5: Support dma-buf based userspace memory region Jianxin Xiong
  3 siblings, 0 replies; 25+ messages in thread
From: Jianxin Xiong @ 2020-10-23 16:39 UTC (permalink / raw)
  To: linux-rdma, dri-devel
  Cc: Jianxin Xiong, Doug Ledford, Jason Gunthorpe, Leon Romanovsky,
	Sumit Semwal, Christian Koenig, Daniel Vetter

Dma-buf based memory region requires one extra parameter and is processed
quite differently. Adding a separate method allows clean separation from
regular memory regions.

Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Acked-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Acked-by: Christian Koenig <christian.koenig@amd.com>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
---
 drivers/infiniband/core/device.c | 1 +
 include/rdma/ib_verbs.h          | 6 +++++-
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index feaec8d..d6cd0ac 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -2653,6 +2653,7 @@ void ib_set_device_ops(struct ib_device *dev, const struct ib_device_ops *ops)
 	SET_DEVICE_OP(dev_ops, read_counters);
 	SET_DEVICE_OP(dev_ops, reg_dm_mr);
 	SET_DEVICE_OP(dev_ops, reg_user_mr);
+	SET_DEVICE_OP(dev_ops, reg_user_mr_dmabuf);
 	SET_DEVICE_OP(dev_ops, req_ncomp_notif);
 	SET_DEVICE_OP(dev_ops, req_notify_cq);
 	SET_DEVICE_OP(dev_ops, rereg_user_mr);
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 9bf6c31..5f0f8be 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -2,7 +2,7 @@
 /*
  * Copyright (c) 2004 Mellanox Technologies Ltd.  All rights reserved.
  * Copyright (c) 2004 Infinicon Corporation.  All rights reserved.
- * Copyright (c) 2004 Intel Corporation.  All rights reserved.
+ * Copyright (c) 2004, 2020 Intel Corporation.  All rights reserved.
  * Copyright (c) 2004 Topspin Corporation.  All rights reserved.
  * Copyright (c) 2004 Voltaire Corporation.  All rights reserved.
  * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved.
@@ -2429,6 +2429,10 @@ struct ib_device_ops {
 	struct ib_mr *(*reg_user_mr)(struct ib_pd *pd, u64 start, u64 length,
 				     u64 virt_addr, int mr_access_flags,
 				     struct ib_udata *udata);
+	struct ib_mr *(*reg_user_mr_dmabuf)(struct ib_pd *pd, u64 offset,
+				     u64 length, u64 virt_addr, int fd,
+				     int mr_access_flags,
+				     struct ib_udata *udata);
 	int (*rereg_user_mr)(struct ib_mr *mr, int flags, u64 start, u64 length,
 			     u64 virt_addr, int mr_access_flags,
 			     struct ib_pd *pd, struct ib_udata *udata);
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v6 3/4] RDMA/uverbs: Add uverbs command for dma-buf based MR registration
  2020-10-23 16:39 [PATCH v6 0/4] RDMA: Add dma-buf support Jianxin Xiong
  2020-10-23 16:39 ` [PATCH v6 1/4] RDMA/umem: Support importing dma-buf as user memory region Jianxin Xiong
  2020-10-23 16:39 ` [PATCH v6 2/4] RDMA/core: Add device method for registering dma-buf base " Jianxin Xiong
@ 2020-10-23 16:40 ` Jianxin Xiong
  2020-10-23 16:40 ` [PATCH v6 4/4] RDMA/mlx5: Support dma-buf based userspace memory region Jianxin Xiong
  3 siblings, 0 replies; 25+ messages in thread
From: Jianxin Xiong @ 2020-10-23 16:40 UTC (permalink / raw)
  To: linux-rdma, dri-devel
  Cc: Jianxin Xiong, Doug Ledford, Jason Gunthorpe, Leon Romanovsky,
	Sumit Semwal, Christian Koenig, Daniel Vetter

Implement a new uverbs ioctl method for memory registration with file
descriptor as an extra parameter.

Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Acked-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Acked-by: Christian Koenig <christian.koenig@amd.com>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
---
 drivers/infiniband/core/uverbs_std_types_mr.c | 112 ++++++++++++++++++++++++++
 include/uapi/rdma/ib_user_ioctl_cmds.h        |  14 ++++
 2 files changed, 126 insertions(+)

diff --git a/drivers/infiniband/core/uverbs_std_types_mr.c b/drivers/infiniband/core/uverbs_std_types_mr.c
index 9b22bb5..f22e58b 100644
--- a/drivers/infiniband/core/uverbs_std_types_mr.c
+++ b/drivers/infiniband/core/uverbs_std_types_mr.c
@@ -1,5 +1,6 @@
 /*
  * Copyright (c) 2018, Mellanox Technologies inc.  All rights reserved.
+ * Copyright (c) 2020, Intel Corporation.  All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -178,6 +179,85 @@ static int UVERBS_HANDLER(UVERBS_METHOD_QUERY_MR)(
 	return IS_UVERBS_COPY_ERR(ret) ? ret : 0;
 }
 
+static int UVERBS_HANDLER(UVERBS_METHOD_REG_DMABUF_MR)(
+	struct uverbs_attr_bundle *attrs)
+{
+	struct ib_uobject *uobj =
+		uverbs_attr_get_uobject(attrs, UVERBS_ATTR_REG_DMABUF_MR_HANDLE);
+	struct ib_pd *pd =
+		uverbs_attr_get_obj(attrs, UVERBS_ATTR_REG_DMABUF_MR_PD_HANDLE);
+	struct ib_device *ib_dev = pd->device;
+
+	u64 offset, length, virt_addr;
+	u32 fd, access_flags;
+	struct ib_mr *mr;
+	int ret;
+
+	if (!ib_dev->ops.reg_user_mr_dmabuf)
+		return -EOPNOTSUPP;
+
+	ret = uverbs_copy_from(&offset, attrs,
+			       UVERBS_ATTR_REG_DMABUF_MR_OFFSET);
+	if (ret)
+		return ret;
+
+	ret = uverbs_copy_from(&length, attrs,
+			       UVERBS_ATTR_REG_DMABUF_MR_LENGTH);
+	if (ret)
+		return ret;
+
+	ret = uverbs_copy_from(&virt_addr, attrs,
+			       UVERBS_ATTR_REG_DMABUF_MR_IOVA);
+	if (ret)
+		return ret;
+
+	ret = uverbs_copy_from(&fd, attrs,
+			       UVERBS_ATTR_REG_DMABUF_MR_FD);
+	if (ret)
+		return ret;
+
+	ret = uverbs_get_flags32(&access_flags, attrs,
+				 UVERBS_ATTR_REG_DMABUF_MR_ACCESS_FLAGS,
+				 IB_ACCESS_SUPPORTED);
+	if (ret)
+		return ret;
+
+	ret = ib_check_mr_access(access_flags);
+	if (ret)
+		return ret;
+
+	mr = pd->device->ops.reg_user_mr_dmabuf(pd, offset, length, virt_addr,
+						fd, access_flags,
+						&attrs->driver_udata);
+	if (IS_ERR(mr))
+		return PTR_ERR(mr);
+
+	mr->device  = pd->device;
+	mr->pd      = pd;
+	mr->type    = IB_MR_TYPE_USER;
+	mr->uobject = uobj;
+	atomic_inc(&pd->usecnt);
+
+	uobj->object = mr;
+
+	ret = uverbs_copy_to(attrs, UVERBS_ATTR_REG_DMABUF_MR_RESP_LKEY,
+			     &mr->lkey, sizeof(mr->lkey));
+	if (ret)
+		goto err_dereg;
+
+	ret = uverbs_copy_to(attrs, UVERBS_ATTR_REG_DMABUF_MR_RESP_RKEY,
+			     &mr->rkey, sizeof(mr->rkey));
+	if (ret)
+		goto err_dereg;
+
+	return 0;
+
+err_dereg:
+	ib_dereg_mr_user(mr, uverbs_get_cleared_udata(attrs));
+
+	return ret;
+}
+
 DECLARE_UVERBS_NAMED_METHOD(
 	UVERBS_METHOD_ADVISE_MR,
 	UVERBS_ATTR_IDR(UVERBS_ATTR_ADVISE_MR_PD_HANDLE,
@@ -243,6 +323,37 @@ static int UVERBS_HANDLER(UVERBS_METHOD_QUERY_MR)(
 			    UVERBS_ATTR_TYPE(u32),
 			    UA_MANDATORY));
 
+DECLARE_UVERBS_NAMED_METHOD(
+	UVERBS_METHOD_REG_DMABUF_MR,
+	UVERBS_ATTR_IDR(UVERBS_ATTR_REG_DMABUF_MR_HANDLE,
+			UVERBS_OBJECT_MR,
+			UVERBS_ACCESS_NEW,
+			UA_MANDATORY),
+	UVERBS_ATTR_IDR(UVERBS_ATTR_REG_DMABUF_MR_PD_HANDLE,
+			UVERBS_OBJECT_PD,
+			UVERBS_ACCESS_READ,
+			UA_MANDATORY),
+	UVERBS_ATTR_PTR_IN(UVERBS_ATTR_REG_DMABUF_MR_OFFSET,
+			   UVERBS_ATTR_TYPE(u64),
+			   UA_MANDATORY),
+	UVERBS_ATTR_PTR_IN(UVERBS_ATTR_REG_DMABUF_MR_LENGTH,
+			   UVERBS_ATTR_TYPE(u64),
+			   UA_MANDATORY),
+	UVERBS_ATTR_PTR_IN(UVERBS_ATTR_REG_DMABUF_MR_IOVA,
+			   UVERBS_ATTR_TYPE(u64),
+			   UA_MANDATORY),
+	UVERBS_ATTR_PTR_IN(UVERBS_ATTR_REG_DMABUF_MR_FD,
+			   UVERBS_ATTR_TYPE(u32),
+			   UA_MANDATORY),
+	UVERBS_ATTR_FLAGS_IN(UVERBS_ATTR_REG_DMABUF_MR_ACCESS_FLAGS,
+			     enum ib_access_flags),
+	UVERBS_ATTR_PTR_OUT(UVERBS_ATTR_REG_DMABUF_MR_RESP_LKEY,
+			    UVERBS_ATTR_TYPE(u32),
+			    UA_MANDATORY),
+	UVERBS_ATTR_PTR_OUT(UVERBS_ATTR_REG_DMABUF_MR_RESP_RKEY,
+			    UVERBS_ATTR_TYPE(u32),
+			    UA_MANDATORY));
+
 DECLARE_UVERBS_NAMED_METHOD_DESTROY(
 	UVERBS_METHOD_MR_DESTROY,
 	UVERBS_ATTR_IDR(UVERBS_ATTR_DESTROY_MR_HANDLE,
@@ -253,6 +364,7 @@ static int UVERBS_HANDLER(UVERBS_METHOD_QUERY_MR)(
 DECLARE_UVERBS_NAMED_OBJECT(
 	UVERBS_OBJECT_MR,
 	UVERBS_TYPE_ALLOC_IDR(uverbs_free_mr),
+	&UVERBS_METHOD(UVERBS_METHOD_REG_DMABUF_MR),
 	&UVERBS_METHOD(UVERBS_METHOD_DM_MR_REG),
 	&UVERBS_METHOD(UVERBS_METHOD_MR_DESTROY),
 	&UVERBS_METHOD(UVERBS_METHOD_ADVISE_MR),
diff --git a/include/uapi/rdma/ib_user_ioctl_cmds.h b/include/uapi/rdma/ib_user_ioctl_cmds.h
index 7968a18..dafc7eb 100644
--- a/include/uapi/rdma/ib_user_ioctl_cmds.h
+++ b/include/uapi/rdma/ib_user_ioctl_cmds.h
@@ -1,5 +1,6 @@
 /*
  * Copyright (c) 2018, Mellanox Technologies inc.  All rights reserved.
+ * Copyright (c) 2020, Intel Corporation. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -251,6 +252,7 @@ enum uverbs_methods_mr {
 	UVERBS_METHOD_MR_DESTROY,
 	UVERBS_METHOD_ADVISE_MR,
 	UVERBS_METHOD_QUERY_MR,
+	UVERBS_METHOD_REG_DMABUF_MR,
 };
 
 enum uverbs_attrs_mr_destroy_ids {
@@ -272,6 +274,18 @@ enum uverbs_attrs_query_mr_cmd_attr_ids {
 	UVERBS_ATTR_QUERY_MR_RESP_IOVA,
 };
 
+enum uverbs_attrs_reg_dmabuf_mr_cmd_attr_ids {
+	UVERBS_ATTR_REG_DMABUF_MR_HANDLE,
+	UVERBS_ATTR_REG_DMABUF_MR_PD_HANDLE,
+	UVERBS_ATTR_REG_DMABUF_MR_OFFSET,
+	UVERBS_ATTR_REG_DMABUF_MR_LENGTH,
+	UVERBS_ATTR_REG_DMABUF_MR_IOVA,
+	UVERBS_ATTR_REG_DMABUF_MR_FD,
+	UVERBS_ATTR_REG_DMABUF_MR_ACCESS_FLAGS,
+	UVERBS_ATTR_REG_DMABUF_MR_RESP_LKEY,
+	UVERBS_ATTR_REG_DMABUF_MR_RESP_RKEY,
+};
+
 enum uverbs_attrs_create_counters_cmd_attr_ids {
 	UVERBS_ATTR_CREATE_COUNTERS_HANDLE,
 };
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v6 4/4] RDMA/mlx5: Support dma-buf based userspace memory region
  2020-10-23 16:39 [PATCH v6 0/4] RDMA: Add dma-buf support Jianxin Xiong
                   ` (2 preceding siblings ...)
  2020-10-23 16:40 ` [PATCH v6 3/4] RDMA/uverbs: Add uverbs command for dma-buf based MR registration Jianxin Xiong
@ 2020-10-23 16:40 ` Jianxin Xiong
  2020-10-27 20:08   ` Jason Gunthorpe
  3 siblings, 1 reply; 25+ messages in thread
From: Jianxin Xiong @ 2020-10-23 16:40 UTC (permalink / raw)
  To: linux-rdma, dri-devel
  Cc: Jianxin Xiong, Doug Ledford, Jason Gunthorpe, Leon Romanovsky,
	Sumit Semwal, Christian Koenig, Daniel Vetter

Implement the new driver method 'reg_user_mr_dmabuf'.  Utilize the core
functions to import dma-buf based memory region and update the mappings.

Add code to handle dma-buf related page fault.

Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Acked-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Acked-by: Christian Koenig <christian.koenig@amd.com>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
---
 drivers/infiniband/hw/mlx5/main.c    |   2 +
 drivers/infiniband/hw/mlx5/mlx5_ib.h |  11 ++++
 drivers/infiniband/hw/mlx5/mr.c      | 114 ++++++++++++++++++++++++++++++++++-
 drivers/infiniband/hw/mlx5/odp.c     |  70 +++++++++++++++++++--
 4 files changed, 191 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index 89e04ca..ec4ad2f 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
 /*
  * Copyright (c) 2013-2020, Mellanox Technologies inc. All rights reserved.
+ * Copyright (c) 2020, Intel Corporation. All rights reserved.
  */
 
 #include <linux/debugfs.h>
@@ -4060,6 +4061,7 @@ static int mlx5_ib_enable_driver(struct ib_device *dev)
 	.query_srq = mlx5_ib_query_srq,
 	.query_ucontext = mlx5_ib_query_ucontext,
 	.reg_user_mr = mlx5_ib_reg_user_mr,
+	.reg_user_mr_dmabuf = mlx5_ib_reg_user_mr_dmabuf,
 	.req_notify_cq = mlx5_ib_arm_cq,
 	.rereg_user_mr = mlx5_ib_rereg_user_mr,
 	.resize_cq = mlx5_ib_resize_cq,
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index b1f2b34..4b72ff9 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -1,6 +1,7 @@
 /* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */
 /*
  * Copyright (c) 2013-2020, Mellanox Technologies inc. All rights reserved.
+ * Copyright (c) 2020, Intel Corporation. All rights reserved.
  */
 
 #ifndef MLX5_IB_H
@@ -639,6 +640,12 @@ static inline bool is_odp_mr(struct mlx5_ib_mr *mr)
 	       mr->umem->is_odp;
 }
 
+static inline bool is_dmabuf_mr(struct mlx5_ib_mr *mr)
+{
+	return IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING) && mr->umem &&
+	       mr->umem->is_dmabuf;
+}
+
 struct mlx5_ib_mw {
 	struct ib_mw		ibmw;
 	struct mlx5_core_mkey	mmkey;
@@ -1174,6 +1181,10 @@ int mlx5_ib_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
 struct ib_mr *mlx5_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
 				  u64 virt_addr, int access_flags,
 				  struct ib_udata *udata);
+struct ib_mr *mlx5_ib_reg_user_mr_dmabuf(struct ib_pd *pd, u64 start,
+					 u64 length, u64 virt_addr,
+					 int fd, int access_flags,
+					 struct ib_udata *udata);
 int mlx5_ib_advise_mr(struct ib_pd *pd,
 		      enum ib_uverbs_advise_mr_advice advice,
 		      u32 flags,
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index b261797..3bc412b 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -1,5 +1,6 @@
 /*
  * Copyright (c) 2013-2015, Mellanox Technologies. All rights reserved.
+ * Copyright (c) 2020, Intel Corporation. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -36,6 +37,8 @@
 #include <linux/debugfs.h>
 #include <linux/export.h>
 #include <linux/delay.h>
+#include <linux/dma-buf.h>
+#include <linux/dma-resv.h>
 #include <rdma/ib_umem.h>
 #include <rdma/ib_umem_odp.h>
 #include <rdma/ib_verbs.h>
@@ -1113,6 +1116,8 @@ int mlx5_ib_update_xlt(struct mlx5_ib_mr *mr, u64 idx, int npages,
 		dma_sync_single_for_cpu(ddev, dma, size, DMA_TO_DEVICE);
 		if (mr->umem->is_odp) {
 			mlx5_odp_populate_xlt(xlt, idx, npages, mr, flags);
+		} else if (mr->umem->is_dmabuf && (flags & MLX5_IB_UPD_XLT_ZAP)) {
+			memset(xlt, 0, size);
 		} else {
 			__mlx5_ib_populate_pas(dev, mr->umem, page_shift, idx,
 					       npages, xlt,
@@ -1462,6 +1467,111 @@ struct ib_mr *mlx5_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
 	return ERR_PTR(err);
 }
 
+static void mlx5_ib_dmabuf_invalidate_cb(struct dma_buf_attachment *attach)
+{
+	struct ib_umem_dmabuf *umem_dmabuf = attach->importer_priv;
+	struct mlx5_ib_mr *mr = umem_dmabuf->device_context;
+
+	mlx5_ib_update_xlt(mr, 0, mr->npages, PAGE_SHIFT, MLX5_IB_UPD_XLT_ZAP);
+	ib_umem_dmabuf_unmap_pages(umem_dmabuf);
+}
+
+static struct dma_buf_attach_ops mlx5_ib_dmabuf_attach_ops = {
+	.allow_peer2peer = 1,
+	.move_notify = mlx5_ib_dmabuf_invalidate_cb,
+};
+
+struct ib_mr *mlx5_ib_reg_user_mr_dmabuf(struct ib_pd *pd, u64 offset,
+					 u64 length, u64 virt_addr,
+					 int fd, int access_flags,
+					 struct ib_udata *udata)
+{
+	struct mlx5_ib_dev *dev = to_mdev(pd->device);
+	struct mlx5_ib_mr *mr = NULL;
+	struct ib_umem *umem;
+	struct ib_umem_dmabuf *umem_dmabuf;
+	int npages;
+	int order;
+	int err;
+
+	if (!IS_ENABLED(CONFIG_INFINIBAND_USER_MEM))
+		return ERR_PTR(-EOPNOTSUPP);
+
+	mlx5_ib_dbg(dev,
+		    "offset 0x%llx, virt_addr 0x%llx, length 0x%llx, fd %d, access_flags 0x%x\n",
+		    offset, virt_addr, length, fd, access_flags);
+
+	if (!mlx5_ib_can_load_pas_with_umr(dev, length))
+		return ERR_PTR(-EINVAL);
+
+	umem = ib_umem_dmabuf_get(&dev->ib_dev, offset, length, fd, access_flags,
+				  &mlx5_ib_dmabuf_attach_ops);
+	if (IS_ERR(umem)) {
+		mlx5_ib_dbg(dev, "umem get failed (%ld)\n", PTR_ERR(umem));
+		return ERR_PTR(PTR_ERR(umem));
+	}
+
+	npages = ib_umem_num_pages(umem);
+	if (!npages) {
+		mlx5_ib_warn(dev, "avoid zero region\n");
+		ib_umem_release(umem);
+		return ERR_PTR(-EINVAL);
+	}
+
+	order = ilog2(roundup_pow_of_two(npages));
+
+	mlx5_ib_dbg(dev, "npages %d, ncont %d, order %d, page_shift %d\n",
+		    npages, npages, order, PAGE_SHIFT);
+
+	mr = alloc_mr_from_cache(pd, umem, virt_addr, length, npages,
+				 PAGE_SHIFT, order, access_flags);
+	if (IS_ERR(mr))
+		mr = NULL;
+
+	if (!mr) {
+		mutex_lock(&dev->slow_path_mutex);
+		mr = reg_create(NULL, pd, virt_addr, length, umem, npages,
+				PAGE_SHIFT, access_flags, false);
+		mutex_unlock(&dev->slow_path_mutex);
+	}
+
+	if (IS_ERR(mr)) {
+		err = PTR_ERR(mr);
+		goto error;
+	}
+
+	mlx5_ib_dbg(dev, "mkey 0x%x\n", mr->mmkey.key);
+
+	mr->umem = umem;
+	set_mr_fields(dev, mr, npages, length, access_flags);
+
+	umem_dmabuf = to_ib_umem_dmabuf(umem);
+	umem_dmabuf->device_context = mr;
+
+	err = mlx5_ib_update_xlt(mr, 0, mr->npages, PAGE_SHIFT,
+				 MLX5_IB_UPD_XLT_ENABLE | MLX5_IB_UPD_XLT_ZAP);
+
+	if (err) {
+		dereg_mr(dev, mr);
+		return ERR_PTR(err);
+	}
+
+	init_waitqueue_head(&mr->q_deferred_work);
+	atomic_set(&mr->num_deferred_work, 0);
+	err = xa_err(xa_store(&dev->odp_mkeys,
+			      mlx5_base_mkey(mr->mmkey.key), &mr->mmkey,
+			      GFP_KERNEL));
+	if (err) {
+		dereg_mr(dev, mr);
+		return ERR_PTR(err);
+	}
+
+	return &mr->ibmr;
+error:
+	ib_umem_release(umem);
+	return ERR_PTR(err);
+}
+
 /**
  * mlx5_mr_cache_invalidate - Fence all DMA on the MR
  * @mr: The MR to fence
@@ -1536,7 +1646,7 @@ int mlx5_ib_rereg_user_mr(struct ib_mr *ib_mr, int flags, u64 start,
 	if (!mr->umem)
 		return -EINVAL;
 
-	if (is_odp_mr(mr))
+	if (is_odp_mr(mr) || is_dmabuf_mr(mr))
 		return -EOPNOTSUPP;
 
 	if (flags & IB_MR_REREG_TRANS) {
@@ -1695,7 +1805,7 @@ static void dereg_mr(struct mlx5_ib_dev *dev, struct mlx5_ib_mr *mr)
 	struct ib_umem *umem = mr->umem;
 
 	/* Stop all DMA */
-	if (is_odp_mr(mr))
+	if (is_odp_mr(mr) || is_dmabuf_mr(mr))
 		mlx5_ib_fence_odp_mr(mr);
 	else
 		clean_mr(dev, mr);
diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
index 5c853ec..7bc863b 100644
--- a/drivers/infiniband/hw/mlx5/odp.c
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -33,6 +33,8 @@
 #include <rdma/ib_umem.h>
 #include <rdma/ib_umem_odp.h>
 #include <linux/kernel.h>
+#include <linux/dma-buf.h>
+#include <linux/dma-resv.h>
 
 #include "mlx5_ib.h"
 #include "cmd.h"
@@ -204,6 +206,16 @@ static void dma_fence_odp_mr(struct mlx5_ib_mr *mr)
 	}
 }
 
+static void dma_fence_dmabuf_mr(struct mlx5_ib_mr *mr)
+{
+	mlx5_mr_cache_invalidate(mr);
+
+	if (!mr->cache_ent) {
+		mlx5_core_destroy_mkey(mr->dev->mdev, &mr->mmkey);
+		WARN_ON(mr->descs);
+	}
+}
+
 /*
  * This must be called after the mr has been removed from implicit_children
  * and the SRCU synchronized.  NOTE: The MR does not necessarily have to be
@@ -661,6 +673,9 @@ void mlx5_ib_fence_odp_mr(struct mlx5_ib_mr *mr)
 
 	wait_event(mr->q_deferred_work, !atomic_read(&mr->num_deferred_work));
 
+	if (is_dmabuf_mr(mr))
+		return dma_fence_dmabuf_mr(mr);
+
 	dma_fence_odp_mr(mr);
 }
 
@@ -801,6 +816,52 @@ static int pagefault_implicit_mr(struct mlx5_ib_mr *imr,
  * Returns:
  *  -EFAULT: The io_virt->bcnt is not within the MR, it covers pages that are
  *           not accessible, or the MR is no longer valid.
+ *  -EAGAIN: The operation should be retried
+ *
+ *  >0: Number of pages mapped
+ */
+static int pagefault_dmabuf_mr(struct mlx5_ib_mr *mr, struct ib_umem *umem,
+			       u64 io_virt, size_t bcnt, u32 *bytes_mapped,
+			       u32 flags)
+{
+	struct ib_umem_dmabuf *umem_dmabuf = to_ib_umem_dmabuf(umem);
+	u64 user_va;
+	u64 end;
+	int npages;
+	int err;
+
+	if (unlikely(io_virt < mr->mmkey.iova))
+		return -EFAULT;
+	if (check_add_overflow(io_virt - mr->mmkey.iova,
+			       (u64)umem->address, &user_va))
+		return -EFAULT;
+
+	/* Overflow has alreddy been checked at the umem creation time */
+	end = umem->address + umem->length;
+	if (unlikely(user_va >= end || end  - user_va < bcnt))
+		return -EFAULT;
+
+	dma_resv_lock(umem_dmabuf->attach->dmabuf->resv, NULL);
+	err = ib_umem_dmabuf_map_pages(umem_dmabuf);
+	if (!err)
+		err = mlx5_ib_update_xlt(mr, 0, mr->npages, PAGE_SHIFT, 0);
+	dma_resv_unlock(umem_dmabuf->attach->dmabuf->resv);
+
+	if (err)
+		return err;
+
+	if (bytes_mapped)
+		*bytes_mapped += bcnt;
+
+	npages = (ALIGN(user_va + bcnt, PAGE_SIZE) -
+		 ALIGN_DOWN(user_va, PAGE_SIZE)) >> PAGE_SHIFT;
+	return npages;
+}
+
+/*
+ * Returns:
+ *  -EFAULT: The io_virt->bcnt is not within the MR, it covers pages that are
+ *           not accessible, or the MR is no longer valid.
  *  -EAGAIN/-ENOMEM: The operation should be retried
  *
  *  -EINVAL/others: General internal malfunction
@@ -811,6 +872,10 @@ static int pagefault_mr(struct mlx5_ib_mr *mr, u64 io_virt, size_t bcnt,
 {
 	struct ib_umem_odp *odp = to_ib_umem_odp(mr->umem);
 
+	if (is_dmabuf_mr(mr))
+		return pagefault_dmabuf_mr(mr, mr->umem, io_virt, bcnt,
+					   bytes_mapped, flags);
+
 	lockdep_assert_held(&mr->dev->odp_srcu);
 	if (unlikely(io_virt < mr->mmkey.iova))
 		return -EFAULT;
@@ -1747,7 +1812,6 @@ static void destroy_prefetch_work(struct prefetch_mr_work *work)
 {
 	struct mlx5_ib_dev *dev = to_mdev(pd->device);
 	struct mlx5_core_mkey *mmkey;
-	struct ib_umem_odp *odp;
 	struct mlx5_ib_mr *mr;
 
 	lockdep_assert_held(&dev->odp_srcu);
@@ -1761,11 +1825,9 @@ static void destroy_prefetch_work(struct prefetch_mr_work *work)
 	if (mr->ibmr.pd != pd)
 		return NULL;
 
-	odp = to_ib_umem_odp(mr->umem);
-
 	/* prefetch with write-access must be supported by the MR */
 	if (advice == IB_UVERBS_ADVISE_MR_ADVICE_PREFETCH_WRITE &&
-	    !odp->umem.writable)
+	    !mr->umem->writable)
 		return NULL;
 
 	return mr;
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v6 1/4] RDMA/umem: Support importing dma-buf as user memory region
  2020-10-23 16:39 ` [PATCH v6 1/4] RDMA/umem: Support importing dma-buf as user memory region Jianxin Xiong
@ 2020-10-23 16:49   ` Daniel Vetter
  2020-10-23 18:09     ` Xiong, Jianxin
  2020-10-23 18:20     ` Jason Gunthorpe
  2020-10-27 20:00   ` Jason Gunthorpe
  1 sibling, 2 replies; 25+ messages in thread
From: Daniel Vetter @ 2020-10-23 16:49 UTC (permalink / raw)
  To: Jianxin Xiong
  Cc: linux-rdma, dri-devel, Leon Romanovsky, Jason Gunthorpe,
	Doug Ledford, Daniel Vetter, Christian Koenig

On Fri, Oct 23, 2020 at 09:39:58AM -0700, Jianxin Xiong wrote:
> Dma-buf is a standard cross-driver buffer sharing mechanism that can be
> used to support peer-to-peer access from RDMA devices.
> 
> Device memory exported via dma-buf is associated with a file descriptor.
> This is passed to the user space as a property associated with the
> buffer allocation. When the buffer is registered as a memory region,
> the file descriptor is passed to the RDMA driver along with other
> parameters.
> 
> Implement the common code for importing dma-buf object and mapping
> dma-buf pages.
> 
> Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
> Reviewed-by: Sean Hefty <sean.hefty@intel.com>
> Acked-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
> Acked-by: Christian Koenig <christian.koenig@amd.com>
> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> ---
>  drivers/infiniband/core/Makefile      |   2 +-
>  drivers/infiniband/core/umem.c        |   4 +
>  drivers/infiniband/core/umem_dmabuf.c | 197 ++++++++++++++++++++++++++++++++++
>  drivers/infiniband/core/umem_dmabuf.h |  11 ++
>  include/rdma/ib_umem.h                |  35 +++++-
>  5 files changed, 247 insertions(+), 2 deletions(-)
>  create mode 100644 drivers/infiniband/core/umem_dmabuf.c
>  create mode 100644 drivers/infiniband/core/umem_dmabuf.h
> 
> diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
> index ccf2670..8ab4eea 100644
> --- a/drivers/infiniband/core/Makefile
> +++ b/drivers/infiniband/core/Makefile
> @@ -40,5 +40,5 @@ ib_uverbs-y :=			uverbs_main.o uverbs_cmd.o uverbs_marshall.o \
>  				uverbs_std_types_srq.o \
>  				uverbs_std_types_wq.o \
>  				uverbs_std_types_qp.o
> -ib_uverbs-$(CONFIG_INFINIBAND_USER_MEM) += umem.o
> +ib_uverbs-$(CONFIG_INFINIBAND_USER_MEM) += umem.o umem_dmabuf.o
>  ib_uverbs-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += umem_odp.o
> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
> index e9fecbd..2c45525 100644
> --- a/drivers/infiniband/core/umem.c
> +++ b/drivers/infiniband/core/umem.c
> @@ -2,6 +2,7 @@
>   * Copyright (c) 2005 Topspin Communications.  All rights reserved.
>   * Copyright (c) 2005 Cisco Systems.  All rights reserved.
>   * Copyright (c) 2005 Mellanox Technologies. All rights reserved.
> + * Copyright (c) 2020 Intel Corporation. All rights reserved.
>   *
>   * This software is available to you under a choice of one of two
>   * licenses.  You may choose to be licensed under the terms of the GNU
> @@ -43,6 +44,7 @@
>  #include <rdma/ib_umem_odp.h>
>  
>  #include "uverbs.h"
> +#include "umem_dmabuf.h"
>  
>  static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int dirty)
>  {
> @@ -269,6 +271,8 @@ void ib_umem_release(struct ib_umem *umem)
>  {
>  	if (!umem)
>  		return;
> +	if (umem->is_dmabuf)
> +		return ib_umem_dmabuf_release(to_ib_umem_dmabuf(umem));
>  	if (umem->is_odp)
>  		return ib_umem_odp_release(to_ib_umem_odp(umem));
>  
> diff --git a/drivers/infiniband/core/umem_dmabuf.c b/drivers/infiniband/core/umem_dmabuf.c
> new file mode 100644
> index 0000000..66b234d
> --- /dev/null
> +++ b/drivers/infiniband/core/umem_dmabuf.c
> @@ -0,0 +1,197 @@
> +// SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause)
> +/*
> + * Copyright (c) 2020 Intel Corporation. All rights reserved.
> + */
> +
> +#include <linux/dma-buf.h>
> +#include <linux/dma-resv.h>
> +#include <linux/dma-mapping.h>
> +
> +#include "uverbs.h"
> +#include "umem_dmabuf.h"
> +
> +/*
> + * Generate a new dma sg list from a sub range of an existing dma sg list.
> + * Both the input and output have their entries page aligned.
> + */
> +static int ib_umem_dmabuf_sgt_slice(struct sg_table *sgt, u64 offset,
> +				    u64 length, struct sg_table *new_sgt)
> +{
> +	struct scatterlist *sg, *new_sg;
> +	u64 start, end, off, addr, len;
> +	unsigned int new_nents;
> +	int err;
> +	int i;
> +
> +	start = ALIGN_DOWN(offset, PAGE_SIZE);
> +	end = ALIGN(offset + length, PAGE_SIZE);
> +
> +	offset = start;
> +	length = end - start;
> +	new_nents = 0;
> +	for_each_sgtable_dma_sg(sgt, sg, i) {
> +		len = sg_dma_len(sg);
> +		off = min(len, offset);
> +		len -= off;
> +		len = min(len, length);
> +		if (len)
> +			new_nents++;
> +		length -= len;
> +		offset -= off;
> +	}
> +
> +	err = sg_alloc_table(new_sgt, new_nents, GFP_KERNEL);
> +	if (err)
> +		return err;
> +
> +	offset = start;
> +	length = end - start;
> +	new_sg = new_sgt->sgl;
> +	for_each_sgtable_dma_sg(sgt, sg, i) {
> +		addr = sg_dma_address(sg);
> +		len = sg_dma_len(sg);
> +		off = min(len, offset);
> +		addr += off;
> +		len -= off;
> +		len = min(len, length);
> +		if (len) {
> +			sg_dma_address(new_sg) = addr;
> +			sg_dma_len(new_sg) = len;
> +			new_sg = sg_next(new_sg);
> +		}
> +		length -= len;
> +		offset -= off;
> +	}
> +
> +	return 0;
> +}
> +
> +int ib_umem_dmabuf_map_pages(struct ib_umem_dmabuf *umem_dmabuf)
> +{
> +	struct sg_table *sgt;
> +	struct dma_fence *fence;
> +	int err;
> +
> +	dma_resv_assert_held(umem_dmabuf->attach->dmabuf->resv);
> +
> +	sgt = dma_buf_map_attachment(umem_dmabuf->attach,
> +				     DMA_BIDIRECTIONAL);
> +
> +	if (IS_ERR(sgt))
> +		return PTR_ERR(sgt);
> +
> +	err = ib_umem_dmabuf_sgt_slice(sgt, umem_dmabuf->umem.address,
> +				       umem_dmabuf->umem.length,
> +				       &umem_dmabuf->umem.sg_head);
> +	if (err) {
> +		dma_buf_unmap_attachment(umem_dmabuf->attach, sgt,
> +					 DMA_BIDIRECTIONAL);
> +		return err;
> +	}
> +
> +	umem_dmabuf->umem.nmap = umem_dmabuf->umem.sg_head.nents;
> +	umem_dmabuf->sgt = sgt;
> +
> +	/*
> +	 * Although the sg list is valid now, the content of the pages
> +	 * may be not up-to-date. Wait for the exporter to finish
> +	 * the migration.
> +	 */
> +	fence = dma_resv_get_excl(umem_dmabuf->attach->dmabuf->resv);
> +	if (fence)
> +		dma_fence_wait(fence, false);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL(ib_umem_dmabuf_map_pages);
> +
> +void ib_umem_dmabuf_unmap_pages(struct ib_umem_dmabuf *umem_dmabuf)
> +{
> +	dma_resv_assert_held(umem_dmabuf->attach->dmabuf->resv);
> +
> +	if (!umem_dmabuf->sgt)
> +		return;
> +
> +	sg_free_table(&umem_dmabuf->umem.sg_head);
> +	dma_buf_unmap_attachment(umem_dmabuf->attach, umem_dmabuf->sgt,
> +				 DMA_BIDIRECTIONAL);
> +	umem_dmabuf->sgt = NULL;
> +}
> +EXPORT_SYMBOL(ib_umem_dmabuf_unmap_pages);
> +
> +struct ib_umem *ib_umem_dmabuf_get(struct ib_device *device,
> +				   unsigned long offset, size_t size,
> +				   int fd, int access,
> +				   const struct dma_buf_attach_ops *ops)
> +{
> +	struct dma_buf *dmabuf;
> +	struct ib_umem_dmabuf *umem_dmabuf;
> +	struct ib_umem *umem;
> +	unsigned long end;
> +	long ret;
> +
> +	if (check_add_overflow(offset, (unsigned long)size, &end))
> +		return ERR_PTR(-EINVAL);
> +
> +	if (unlikely(PAGE_ALIGN(end) < PAGE_SIZE))
> +		return ERR_PTR(-EINVAL);
> +
> +	if (unlikely(!ops || !ops->move_notify))
> +		return ERR_PTR(-EINVAL);
> +
> +#ifdef CONFIG_DMA_VIRT_OPS
> +	if (device->dma_device->dma_ops == &dma_virt_ops)
> +		return ERR_PTR(-EINVAL);
> +#endif

Maybe I'm confused, but should we have this check in dma_buf_attach, or at
least in dma_buf_dynamic_attach? The p2pdma functions use that too, and I
can't imagine how zerocopy should work (which is like the entire point of
dma-buf) when we have dma_virt_ops.

A similar problem exists for swiotlb bounce buffers, not sure how that's
solved.
-Daniel

> +
> +	umem_dmabuf = kzalloc(sizeof(*umem_dmabuf), GFP_KERNEL);
> +	if (!umem_dmabuf)
> +		return ERR_PTR(-ENOMEM);
> +
> +	umem = &umem_dmabuf->umem;
> +	umem->ibdev = device;
> +	umem->length = size;
> +	umem->address = offset;
> +	umem->iova = offset;
> +	umem->writable = ib_access_writable(access);
> +	umem->is_dmabuf = 1;
> +
> +	dmabuf = dma_buf_get(fd);
> +	if (IS_ERR(dmabuf)) {
> +		ret = PTR_ERR(dmabuf);
> +		goto out_free_umem;
> +	}
> +
> +	umem_dmabuf->attach = dma_buf_dynamic_attach(
> +					dmabuf,
> +					device->dma_device,
> +					ops,
> +					umem_dmabuf);
> +	if (IS_ERR(umem_dmabuf->attach)) {
> +		ret = PTR_ERR(umem_dmabuf->attach);
> +		goto out_release_dmabuf;
> +	}
> +
> +	return umem;
> +
> +out_release_dmabuf:
> +	dma_buf_put(dmabuf);
> +
> +out_free_umem:
> +	kfree(umem_dmabuf);
> +	return ERR_PTR(ret);
> +}
> +EXPORT_SYMBOL(ib_umem_dmabuf_get);
> +
> +void ib_umem_dmabuf_release(struct ib_umem_dmabuf *umem_dmabuf)
> +{
> +	struct dma_buf *dmabuf = umem_dmabuf->attach->dmabuf;
> +
> +	dma_resv_lock(dmabuf->resv, NULL);
> +	ib_umem_dmabuf_unmap_pages(umem_dmabuf);
> +	dma_resv_unlock(dmabuf->resv);
> +
> +	dma_buf_detach(dmabuf, umem_dmabuf->attach);
> +	dma_buf_put(dmabuf);
> +	kfree(umem_dmabuf);
> +}
> diff --git a/drivers/infiniband/core/umem_dmabuf.h b/drivers/infiniband/core/umem_dmabuf.h
> new file mode 100644
> index 0000000..13acf55
> --- /dev/null
> +++ b/drivers/infiniband/core/umem_dmabuf.h
> @@ -0,0 +1,11 @@
> +/* SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause) */
> +/*
> + * Copyright (c) 2020 Intel Corporation. All rights reserved.
> + */
> +
> +#ifndef UMEM_DMABUF_H
> +#define UMEM_DMABUF_H
> +
> +void ib_umem_dmabuf_release(struct ib_umem_dmabuf *umem_dmabuf);
> +
> +#endif /* UMEM_DMABUF_H */
> diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
> index 7059750..73a7b19 100644
> --- a/include/rdma/ib_umem.h
> +++ b/include/rdma/ib_umem.h
> @@ -1,6 +1,7 @@
>  /* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */
>  /*
>   * Copyright (c) 2007 Cisco Systems.  All rights reserved.
> + * Copyright (c) 2020 Intel Corporation.  All rights reserved.
>   */
>  
>  #ifndef IB_UMEM_H
> @@ -13,6 +14,7 @@
>  
>  struct ib_ucontext;
>  struct ib_umem_odp;
> +struct dma_buf_attach_ops;
>  
>  struct ib_umem {
>  	struct ib_device       *ibdev;
> @@ -22,12 +24,25 @@ struct ib_umem {
>  	unsigned long		address;
>  	u32 writable : 1;
>  	u32 is_odp : 1;
> +	u32 is_dmabuf : 1;
>  	struct work_struct	work;
>  	struct sg_table sg_head;
>  	int             nmap;
>  	unsigned int    sg_nents;
>  };
>  
> +struct ib_umem_dmabuf {
> +	struct ib_umem umem;
> +	struct dma_buf_attachment *attach;
> +	struct sg_table *sgt;
> +	void *device_context;
> +};
> +
> +static inline struct ib_umem_dmabuf *to_ib_umem_dmabuf(struct ib_umem *umem)
> +{
> +	return container_of(umem, struct ib_umem_dmabuf, umem);
> +}
> +
>  /* Returns the offset of the umem start relative to the first page. */
>  static inline int ib_umem_offset(struct ib_umem *umem)
>  {
> @@ -79,6 +94,12 @@ int ib_umem_copy_from(void *dst, struct ib_umem *umem, size_t offset,
>  unsigned long ib_umem_find_best_pgsz(struct ib_umem *umem,
>  				     unsigned long pgsz_bitmap,
>  				     unsigned long virt);
> +struct ib_umem *ib_umem_dmabuf_get(struct ib_device *device,
> +				   unsigned long offset, size_t size,
> +				   int fd, int access,
> +				   const struct dma_buf_attach_ops *ops);
> +int ib_umem_dmabuf_map_pages(struct ib_umem_dmabuf *umem_dmabuf);
> +void ib_umem_dmabuf_unmap_pages(struct ib_umem_dmabuf *umem_dmabuf);
>  
>  #else /* CONFIG_INFINIBAND_USER_MEM */
>  
> @@ -101,7 +122,19 @@ static inline unsigned long ib_umem_find_best_pgsz(struct ib_umem *umem,
>  {
>  	return 0;
>  }
> +static inline struct ib_umem *ib_umem_dmabuf_get(struct ib_device *device,
> +						 unsigned long offset,
> +						 size_t size, int fd,
> +						 int access,
> +						 struct dma_buf_attach_ops *ops)
> +{
> +	return ERR_PTR(-EINVAL);
> +}
> +static inline int ib_umem_dmabuf_map_pages(struct ib_umem_dmabuf *umem_dmabuf)
> +{
> +	return -EINVAL;
> +}
> +static inline void ib_umem_dmabuf_unmap_pages(struct ib_umem_dmabuf *umem_dmabuf) { }
>  
>  #endif /* CONFIG_INFINIBAND_USER_MEM */
> -
>  #endif /* IB_UMEM_H */
> -- 
> 1.8.3.1
> 
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH v6 1/4] RDMA/umem: Support importing dma-buf as user memory region
  2020-10-23 16:49   ` Daniel Vetter
@ 2020-10-23 18:09     ` Xiong, Jianxin
  2020-10-23 18:13       ` Daniel Vetter
  2020-10-23 18:20     ` Jason Gunthorpe
  1 sibling, 1 reply; 25+ messages in thread
From: Xiong, Jianxin @ 2020-10-23 18:09 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: linux-rdma, dri-devel, Leon Romanovsky, Jason Gunthorpe,
	Doug Ledford, Vetter, Daniel, Christian Koenig


> -----Original Message-----
> From: Daniel Vetter <daniel@ffwll.ch>
> Sent: Friday, October 23, 2020 9:49 AM
> To: Xiong, Jianxin <jianxin.xiong@intel.com>
> Cc: linux-rdma@vger.kernel.org; dri-devel@lists.freedesktop.org; Leon Romanovsky <leon@kernel.org>; Jason Gunthorpe <jgg@ziepe.ca>;
> Doug Ledford <dledford@redhat.com>; Vetter, Daniel <daniel.vetter@intel.com>; Christian Koenig <christian.koenig@amd.com>
> Subject: Re: [PATCH v6 1/4] RDMA/umem: Support importing dma-buf as user memory region
> 
> On Fri, Oct 23, 2020 at 09:39:58AM -0700, Jianxin Xiong wrote:
> > Dma-buf is a standard cross-driver buffer sharing mechanism that can
> > be used to support peer-to-peer access from RDMA devices.
> >
> > Device memory exported via dma-buf is associated with a file descriptor.
> > This is passed to the user space as a property associated with the
> > buffer allocation. When the buffer is registered as a memory region,
> > the file descriptor is passed to the RDMA driver along with other
> > parameters.
> >
> > Implement the common code for importing dma-buf object and mapping
> > dma-buf pages.
> >
> > Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
> > Reviewed-by: Sean Hefty <sean.hefty@intel.com>
> > Acked-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
> > Acked-by: Christian Koenig <christian.koenig@amd.com>
> > Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> > ---
> >  drivers/infiniband/core/Makefile      |   2 +-
> >  drivers/infiniband/core/umem.c        |   4 +
> >  drivers/infiniband/core/umem_dmabuf.c | 197
> > ++++++++++++++++++++++++++++++++++
> >  drivers/infiniband/core/umem_dmabuf.h |  11 ++
> >  include/rdma/ib_umem.h                |  35 +++++-
> >  5 files changed, 247 insertions(+), 2 deletions(-)  create mode
> > 100644 drivers/infiniband/core/umem_dmabuf.c
> >  create mode 100644 drivers/infiniband/core/umem_dmabuf.h
> >
> > diff --git a/drivers/infiniband/core/Makefile
> > b/drivers/infiniband/core/Makefile
> > index ccf2670..8ab4eea 100644
> > --- a/drivers/infiniband/core/Makefile
> > +++ b/drivers/infiniband/core/Makefile
> > @@ -40,5 +40,5 @@ ib_uverbs-y :=			uverbs_main.o uverbs_cmd.o uverbs_marshall.o \
> >  				uverbs_std_types_srq.o \
> >  				uverbs_std_types_wq.o \
> >  				uverbs_std_types_qp.o
> > -ib_uverbs-$(CONFIG_INFINIBAND_USER_MEM) += umem.o
> > +ib_uverbs-$(CONFIG_INFINIBAND_USER_MEM) += umem.o umem_dmabuf.o
> >  ib_uverbs-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += umem_odp.o diff
> > --git a/drivers/infiniband/core/umem.c
> > b/drivers/infiniband/core/umem.c index e9fecbd..2c45525 100644
> > --- a/drivers/infiniband/core/umem.c
> > +++ b/drivers/infiniband/core/umem.c
> > @@ -2,6 +2,7 @@
> >   * Copyright (c) 2005 Topspin Communications.  All rights reserved.
> >   * Copyright (c) 2005 Cisco Systems.  All rights reserved.
> >   * Copyright (c) 2005 Mellanox Technologies. All rights reserved.
> > + * Copyright (c) 2020 Intel Corporation. All rights reserved.
> >   *
> >   * This software is available to you under a choice of one of two
> >   * licenses.  You may choose to be licensed under the terms of the
> > GNU @@ -43,6 +44,7 @@  #include <rdma/ib_umem_odp.h>
> >
> >  #include "uverbs.h"
> > +#include "umem_dmabuf.h"
> >
> >  static void __ib_umem_release(struct ib_device *dev, struct ib_umem
> > *umem, int dirty)  { @@ -269,6 +271,8 @@ void ib_umem_release(struct
> > ib_umem *umem)  {
> >  	if (!umem)
> >  		return;
> > +	if (umem->is_dmabuf)
> > +		return ib_umem_dmabuf_release(to_ib_umem_dmabuf(umem));
> >  	if (umem->is_odp)
> >  		return ib_umem_odp_release(to_ib_umem_odp(umem));
> >
> > diff --git a/drivers/infiniband/core/umem_dmabuf.c
> > b/drivers/infiniband/core/umem_dmabuf.c
> > new file mode 100644
> > index 0000000..66b234d
> > --- /dev/null
> > +++ b/drivers/infiniband/core/umem_dmabuf.c
> > @@ -0,0 +1,197 @@
> > +// SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause)
> > +/*
> > + * Copyright (c) 2020 Intel Corporation. All rights reserved.
> > + */
> > +
> > +#include <linux/dma-buf.h>
> > +#include <linux/dma-resv.h>
> > +#include <linux/dma-mapping.h>
> > +
> > +#include "uverbs.h"
> > +#include "umem_dmabuf.h"
> > +
> > +/*
> > + * Generate a new dma sg list from a sub range of an existing dma sg list.
> > + * Both the input and output have their entries page aligned.
> > + */
> > +static int ib_umem_dmabuf_sgt_slice(struct sg_table *sgt, u64 offset,
> > +				    u64 length, struct sg_table *new_sgt) {
> > +	struct scatterlist *sg, *new_sg;
> > +	u64 start, end, off, addr, len;
> > +	unsigned int new_nents;
> > +	int err;
> > +	int i;
> > +
> > +	start = ALIGN_DOWN(offset, PAGE_SIZE);
> > +	end = ALIGN(offset + length, PAGE_SIZE);
> > +
> > +	offset = start;
> > +	length = end - start;
> > +	new_nents = 0;
> > +	for_each_sgtable_dma_sg(sgt, sg, i) {
> > +		len = sg_dma_len(sg);
> > +		off = min(len, offset);
> > +		len -= off;
> > +		len = min(len, length);
> > +		if (len)
> > +			new_nents++;
> > +		length -= len;
> > +		offset -= off;
> > +	}
> > +
> > +	err = sg_alloc_table(new_sgt, new_nents, GFP_KERNEL);
> > +	if (err)
> > +		return err;
> > +
> > +	offset = start;
> > +	length = end - start;
> > +	new_sg = new_sgt->sgl;
> > +	for_each_sgtable_dma_sg(sgt, sg, i) {
> > +		addr = sg_dma_address(sg);
> > +		len = sg_dma_len(sg);
> > +		off = min(len, offset);
> > +		addr += off;
> > +		len -= off;
> > +		len = min(len, length);
> > +		if (len) {
> > +			sg_dma_address(new_sg) = addr;
> > +			sg_dma_len(new_sg) = len;
> > +			new_sg = sg_next(new_sg);
> > +		}
> > +		length -= len;
> > +		offset -= off;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +int ib_umem_dmabuf_map_pages(struct ib_umem_dmabuf *umem_dmabuf) {
> > +	struct sg_table *sgt;
> > +	struct dma_fence *fence;
> > +	int err;
> > +
> > +	dma_resv_assert_held(umem_dmabuf->attach->dmabuf->resv);
> > +
> > +	sgt = dma_buf_map_attachment(umem_dmabuf->attach,
> > +				     DMA_BIDIRECTIONAL);
> > +
> > +	if (IS_ERR(sgt))
> > +		return PTR_ERR(sgt);
> > +
> > +	err = ib_umem_dmabuf_sgt_slice(sgt, umem_dmabuf->umem.address,
> > +				       umem_dmabuf->umem.length,
> > +				       &umem_dmabuf->umem.sg_head);
> > +	if (err) {
> > +		dma_buf_unmap_attachment(umem_dmabuf->attach, sgt,
> > +					 DMA_BIDIRECTIONAL);
> > +		return err;
> > +	}
> > +
> > +	umem_dmabuf->umem.nmap = umem_dmabuf->umem.sg_head.nents;
> > +	umem_dmabuf->sgt = sgt;
> > +
> > +	/*
> > +	 * Although the sg list is valid now, the content of the pages
> > +	 * may be not up-to-date. Wait for the exporter to finish
> > +	 * the migration.
> > +	 */
> > +	fence = dma_resv_get_excl(umem_dmabuf->attach->dmabuf->resv);
> > +	if (fence)
> > +		dma_fence_wait(fence, false);
> > +
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL(ib_umem_dmabuf_map_pages);
> > +
> > +void ib_umem_dmabuf_unmap_pages(struct ib_umem_dmabuf *umem_dmabuf) {
> > +	dma_resv_assert_held(umem_dmabuf->attach->dmabuf->resv);
> > +
> > +	if (!umem_dmabuf->sgt)
> > +		return;
> > +
> > +	sg_free_table(&umem_dmabuf->umem.sg_head);
> > +	dma_buf_unmap_attachment(umem_dmabuf->attach, umem_dmabuf->sgt,
> > +				 DMA_BIDIRECTIONAL);
> > +	umem_dmabuf->sgt = NULL;
> > +}
> > +EXPORT_SYMBOL(ib_umem_dmabuf_unmap_pages);
> > +
> > +struct ib_umem *ib_umem_dmabuf_get(struct ib_device *device,
> > +				   unsigned long offset, size_t size,
> > +				   int fd, int access,
> > +				   const struct dma_buf_attach_ops *ops) {
> > +	struct dma_buf *dmabuf;
> > +	struct ib_umem_dmabuf *umem_dmabuf;
> > +	struct ib_umem *umem;
> > +	unsigned long end;
> > +	long ret;
> > +
> > +	if (check_add_overflow(offset, (unsigned long)size, &end))
> > +		return ERR_PTR(-EINVAL);
> > +
> > +	if (unlikely(PAGE_ALIGN(end) < PAGE_SIZE))
> > +		return ERR_PTR(-EINVAL);
> > +
> > +	if (unlikely(!ops || !ops->move_notify))
> > +		return ERR_PTR(-EINVAL);
> > +
> > +#ifdef CONFIG_DMA_VIRT_OPS
> > +	if (device->dma_device->dma_ops == &dma_virt_ops)
> > +		return ERR_PTR(-EINVAL);
> > +#endif
> 
> Maybe I'm confused, but should we have this check in dma_buf_attach, or at least in dma_buf_dynamic_attach? The p2pdma functions use
> that too, and I can't imagine how zerocopy should work (which is like the entire point of
> dma-buf) when we have dma_virt_ops.
> 
> A similar problem exists for swiotlb bounce buffers, not sure how that's solved.
> -Daniel
> 

This is also checked by dma_buf_dynamic_attach(), not in the common code, but
in the exporter's attach() method. For example, the attach method of 'amdgpu' calls p2pdma_distance_many and would disable p2p if dma_virt_ops is detected.

Here we could instead check the peer2peer flag from the returned 'attach' structure.
 
> > +
> > +	umem_dmabuf = kzalloc(sizeof(*umem_dmabuf), GFP_KERNEL);
> > +	if (!umem_dmabuf)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	umem = &umem_dmabuf->umem;
> > +	umem->ibdev = device;
> > +	umem->length = size;
> > +	umem->address = offset;
> > +	umem->iova = offset;
> > +	umem->writable = ib_access_writable(access);
> > +	umem->is_dmabuf = 1;
> > +
> > +	dmabuf = dma_buf_get(fd);
> > +	if (IS_ERR(dmabuf)) {
> > +		ret = PTR_ERR(dmabuf);
> > +		goto out_free_umem;
> > +	}
> > +
> > +	umem_dmabuf->attach = dma_buf_dynamic_attach(
> > +					dmabuf,
> > +					device->dma_device,
> > +					ops,
> > +					umem_dmabuf);
> > +	if (IS_ERR(umem_dmabuf->attach)) {
> > +		ret = PTR_ERR(umem_dmabuf->attach);
> > +		goto out_release_dmabuf;
> > +	}
> > +
> > +	return umem;
> > +
> > +out_release_dmabuf:
> > +	dma_buf_put(dmabuf);
> > +
> > +out_free_umem:
> > +	kfree(umem_dmabuf);
> > +	return ERR_PTR(ret);
> > +}
> > +EXPORT_SYMBOL(ib_umem_dmabuf_get);
> > +
> > +void ib_umem_dmabuf_release(struct ib_umem_dmabuf *umem_dmabuf) {
> > +	struct dma_buf *dmabuf = umem_dmabuf->attach->dmabuf;
> > +
> > +	dma_resv_lock(dmabuf->resv, NULL);
> > +	ib_umem_dmabuf_unmap_pages(umem_dmabuf);
> > +	dma_resv_unlock(dmabuf->resv);
> > +
> > +	dma_buf_detach(dmabuf, umem_dmabuf->attach);
> > +	dma_buf_put(dmabuf);
> > +	kfree(umem_dmabuf);
> > +}
> > diff --git a/drivers/infiniband/core/umem_dmabuf.h
> > b/drivers/infiniband/core/umem_dmabuf.h
> > new file mode 100644
> > index 0000000..13acf55
> > --- /dev/null
> > +++ b/drivers/infiniband/core/umem_dmabuf.h
> > @@ -0,0 +1,11 @@
> > +/* SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause) */
> > +/*
> > + * Copyright (c) 2020 Intel Corporation. All rights reserved.
> > + */
> > +
> > +#ifndef UMEM_DMABUF_H
> > +#define UMEM_DMABUF_H
> > +
> > +void ib_umem_dmabuf_release(struct ib_umem_dmabuf *umem_dmabuf);
> > +
> > +#endif /* UMEM_DMABUF_H */
> > diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h index
> > 7059750..73a7b19 100644
> > --- a/include/rdma/ib_umem.h
> > +++ b/include/rdma/ib_umem.h
> > @@ -1,6 +1,7 @@
> >  /* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */
> >  /*
> >   * Copyright (c) 2007 Cisco Systems.  All rights reserved.
> > + * Copyright (c) 2020 Intel Corporation.  All rights reserved.
> >   */
> >
> >  #ifndef IB_UMEM_H
> > @@ -13,6 +14,7 @@
> >
> >  struct ib_ucontext;
> >  struct ib_umem_odp;
> > +struct dma_buf_attach_ops;
> >
> >  struct ib_umem {
> >  	struct ib_device       *ibdev;
> > @@ -22,12 +24,25 @@ struct ib_umem {
> >  	unsigned long		address;
> >  	u32 writable : 1;
> >  	u32 is_odp : 1;
> > +	u32 is_dmabuf : 1;
> >  	struct work_struct	work;
> >  	struct sg_table sg_head;
> >  	int             nmap;
> >  	unsigned int    sg_nents;
> >  };
> >
> > +struct ib_umem_dmabuf {
> > +	struct ib_umem umem;
> > +	struct dma_buf_attachment *attach;
> > +	struct sg_table *sgt;
> > +	void *device_context;
> > +};
> > +
> > +static inline struct ib_umem_dmabuf *to_ib_umem_dmabuf(struct ib_umem
> > +*umem) {
> > +	return container_of(umem, struct ib_umem_dmabuf, umem); }
> > +
> >  /* Returns the offset of the umem start relative to the first page.
> > */  static inline int ib_umem_offset(struct ib_umem *umem)  { @@ -79,6
> > +94,12 @@ int ib_umem_copy_from(void *dst, struct ib_umem *umem,
> > size_t offset,  unsigned long ib_umem_find_best_pgsz(struct ib_umem *umem,
> >  				     unsigned long pgsz_bitmap,
> >  				     unsigned long virt);
> > +struct ib_umem *ib_umem_dmabuf_get(struct ib_device *device,
> > +				   unsigned long offset, size_t size,
> > +				   int fd, int access,
> > +				   const struct dma_buf_attach_ops *ops); int
> > +ib_umem_dmabuf_map_pages(struct ib_umem_dmabuf *umem_dmabuf); void
> > +ib_umem_dmabuf_unmap_pages(struct ib_umem_dmabuf *umem_dmabuf);
> >
> >  #else /* CONFIG_INFINIBAND_USER_MEM */
> >
> > @@ -101,7 +122,19 @@ static inline unsigned long
> > ib_umem_find_best_pgsz(struct ib_umem *umem,  {
> >  	return 0;
> >  }
> > +static inline struct ib_umem *ib_umem_dmabuf_get(struct ib_device *device,
> > +						 unsigned long offset,
> > +						 size_t size, int fd,
> > +						 int access,
> > +						 struct dma_buf_attach_ops *ops) {
> > +	return ERR_PTR(-EINVAL);
> > +}
> > +static inline int ib_umem_dmabuf_map_pages(struct ib_umem_dmabuf
> > +*umem_dmabuf) {
> > +	return -EINVAL;
> > +}
> > +static inline void ib_umem_dmabuf_unmap_pages(struct ib_umem_dmabuf
> > +*umem_dmabuf) { }
> >
> >  #endif /* CONFIG_INFINIBAND_USER_MEM */
> > -
> >  #endif /* IB_UMEM_H */
> > --
> > 1.8.3.1
> >
> > _______________________________________________
> > dri-devel mailing list
> > dri-devel@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/dri-devel
> 
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v6 1/4] RDMA/umem: Support importing dma-buf as user memory region
  2020-10-23 18:09     ` Xiong, Jianxin
@ 2020-10-23 18:13       ` Daniel Vetter
  0 siblings, 0 replies; 25+ messages in thread
From: Daniel Vetter @ 2020-10-23 18:13 UTC (permalink / raw)
  To: Xiong, Jianxin
  Cc: linux-rdma, dri-devel, Leon Romanovsky, Jason Gunthorpe,
	Doug Ledford, Vetter, Daniel, Christian Koenig

On Fri, Oct 23, 2020 at 8:09 PM Xiong, Jianxin <jianxin.xiong@intel.com> wrote:
>
>
> > -----Original Message-----
> > From: Daniel Vetter <daniel@ffwll.ch>
> > Sent: Friday, October 23, 2020 9:49 AM
> > To: Xiong, Jianxin <jianxin.xiong@intel.com>
> > Cc: linux-rdma@vger.kernel.org; dri-devel@lists.freedesktop.org; Leon Romanovsky <leon@kernel.org>; Jason Gunthorpe <jgg@ziepe.ca>;
> > Doug Ledford <dledford@redhat.com>; Vetter, Daniel <daniel.vetter@intel.com>; Christian Koenig <christian.koenig@amd.com>
> > Subject: Re: [PATCH v6 1/4] RDMA/umem: Support importing dma-buf as user memory region
> >
> > On Fri, Oct 23, 2020 at 09:39:58AM -0700, Jianxin Xiong wrote:
> > > Dma-buf is a standard cross-driver buffer sharing mechanism that can
> > > be used to support peer-to-peer access from RDMA devices.
> > >
> > > Device memory exported via dma-buf is associated with a file descriptor.
> > > This is passed to the user space as a property associated with the
> > > buffer allocation. When the buffer is registered as a memory region,
> > > the file descriptor is passed to the RDMA driver along with other
> > > parameters.
> > >
> > > Implement the common code for importing dma-buf object and mapping
> > > dma-buf pages.
> > >
> > > Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
> > > Reviewed-by: Sean Hefty <sean.hefty@intel.com>
> > > Acked-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
> > > Acked-by: Christian Koenig <christian.koenig@amd.com>
> > > Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> > > ---
> > >  drivers/infiniband/core/Makefile      |   2 +-
> > >  drivers/infiniband/core/umem.c        |   4 +
> > >  drivers/infiniband/core/umem_dmabuf.c | 197
> > > ++++++++++++++++++++++++++++++++++
> > >  drivers/infiniband/core/umem_dmabuf.h |  11 ++
> > >  include/rdma/ib_umem.h                |  35 +++++-
> > >  5 files changed, 247 insertions(+), 2 deletions(-)  create mode
> > > 100644 drivers/infiniband/core/umem_dmabuf.c
> > >  create mode 100644 drivers/infiniband/core/umem_dmabuf.h
> > >
> > > diff --git a/drivers/infiniband/core/Makefile
> > > b/drivers/infiniband/core/Makefile
> > > index ccf2670..8ab4eea 100644
> > > --- a/drivers/infiniband/core/Makefile
> > > +++ b/drivers/infiniband/core/Makefile
> > > @@ -40,5 +40,5 @@ ib_uverbs-y :=                    uverbs_main.o uverbs_cmd.o uverbs_marshall.o \
> > >                             uverbs_std_types_srq.o \
> > >                             uverbs_std_types_wq.o \
> > >                             uverbs_std_types_qp.o
> > > -ib_uverbs-$(CONFIG_INFINIBAND_USER_MEM) += umem.o
> > > +ib_uverbs-$(CONFIG_INFINIBAND_USER_MEM) += umem.o umem_dmabuf.o
> > >  ib_uverbs-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += umem_odp.o diff
> > > --git a/drivers/infiniband/core/umem.c
> > > b/drivers/infiniband/core/umem.c index e9fecbd..2c45525 100644
> > > --- a/drivers/infiniband/core/umem.c
> > > +++ b/drivers/infiniband/core/umem.c
> > > @@ -2,6 +2,7 @@
> > >   * Copyright (c) 2005 Topspin Communications.  All rights reserved.
> > >   * Copyright (c) 2005 Cisco Systems.  All rights reserved.
> > >   * Copyright (c) 2005 Mellanox Technologies. All rights reserved.
> > > + * Copyright (c) 2020 Intel Corporation. All rights reserved.
> > >   *
> > >   * This software is available to you under a choice of one of two
> > >   * licenses.  You may choose to be licensed under the terms of the
> > > GNU @@ -43,6 +44,7 @@  #include <rdma/ib_umem_odp.h>
> > >
> > >  #include "uverbs.h"
> > > +#include "umem_dmabuf.h"
> > >
> > >  static void __ib_umem_release(struct ib_device *dev, struct ib_umem
> > > *umem, int dirty)  { @@ -269,6 +271,8 @@ void ib_umem_release(struct
> > > ib_umem *umem)  {
> > >     if (!umem)
> > >             return;
> > > +   if (umem->is_dmabuf)
> > > +           return ib_umem_dmabuf_release(to_ib_umem_dmabuf(umem));
> > >     if (umem->is_odp)
> > >             return ib_umem_odp_release(to_ib_umem_odp(umem));
> > >
> > > diff --git a/drivers/infiniband/core/umem_dmabuf.c
> > > b/drivers/infiniband/core/umem_dmabuf.c
> > > new file mode 100644
> > > index 0000000..66b234d
> > > --- /dev/null
> > > +++ b/drivers/infiniband/core/umem_dmabuf.c
> > > @@ -0,0 +1,197 @@
> > > +// SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause)
> > > +/*
> > > + * Copyright (c) 2020 Intel Corporation. All rights reserved.
> > > + */
> > > +
> > > +#include <linux/dma-buf.h>
> > > +#include <linux/dma-resv.h>
> > > +#include <linux/dma-mapping.h>
> > > +
> > > +#include "uverbs.h"
> > > +#include "umem_dmabuf.h"
> > > +
> > > +/*
> > > + * Generate a new dma sg list from a sub range of an existing dma sg list.
> > > + * Both the input and output have their entries page aligned.
> > > + */
> > > +static int ib_umem_dmabuf_sgt_slice(struct sg_table *sgt, u64 offset,
> > > +                               u64 length, struct sg_table *new_sgt) {
> > > +   struct scatterlist *sg, *new_sg;
> > > +   u64 start, end, off, addr, len;
> > > +   unsigned int new_nents;
> > > +   int err;
> > > +   int i;
> > > +
> > > +   start = ALIGN_DOWN(offset, PAGE_SIZE);
> > > +   end = ALIGN(offset + length, PAGE_SIZE);
> > > +
> > > +   offset = start;
> > > +   length = end - start;
> > > +   new_nents = 0;
> > > +   for_each_sgtable_dma_sg(sgt, sg, i) {
> > > +           len = sg_dma_len(sg);
> > > +           off = min(len, offset);
> > > +           len -= off;
> > > +           len = min(len, length);
> > > +           if (len)
> > > +                   new_nents++;
> > > +           length -= len;
> > > +           offset -= off;
> > > +   }
> > > +
> > > +   err = sg_alloc_table(new_sgt, new_nents, GFP_KERNEL);
> > > +   if (err)
> > > +           return err;
> > > +
> > > +   offset = start;
> > > +   length = end - start;
> > > +   new_sg = new_sgt->sgl;
> > > +   for_each_sgtable_dma_sg(sgt, sg, i) {
> > > +           addr = sg_dma_address(sg);
> > > +           len = sg_dma_len(sg);
> > > +           off = min(len, offset);
> > > +           addr += off;
> > > +           len -= off;
> > > +           len = min(len, length);
> > > +           if (len) {
> > > +                   sg_dma_address(new_sg) = addr;
> > > +                   sg_dma_len(new_sg) = len;
> > > +                   new_sg = sg_next(new_sg);
> > > +           }
> > > +           length -= len;
> > > +           offset -= off;
> > > +   }
> > > +
> > > +   return 0;
> > > +}
> > > +
> > > +int ib_umem_dmabuf_map_pages(struct ib_umem_dmabuf *umem_dmabuf) {
> > > +   struct sg_table *sgt;
> > > +   struct dma_fence *fence;
> > > +   int err;
> > > +
> > > +   dma_resv_assert_held(umem_dmabuf->attach->dmabuf->resv);
> > > +
> > > +   sgt = dma_buf_map_attachment(umem_dmabuf->attach,
> > > +                                DMA_BIDIRECTIONAL);
> > > +
> > > +   if (IS_ERR(sgt))
> > > +           return PTR_ERR(sgt);
> > > +
> > > +   err = ib_umem_dmabuf_sgt_slice(sgt, umem_dmabuf->umem.address,
> > > +                                  umem_dmabuf->umem.length,
> > > +                                  &umem_dmabuf->umem.sg_head);
> > > +   if (err) {
> > > +           dma_buf_unmap_attachment(umem_dmabuf->attach, sgt,
> > > +                                    DMA_BIDIRECTIONAL);
> > > +           return err;
> > > +   }
> > > +
> > > +   umem_dmabuf->umem.nmap = umem_dmabuf->umem.sg_head.nents;
> > > +   umem_dmabuf->sgt = sgt;
> > > +
> > > +   /*
> > > +    * Although the sg list is valid now, the content of the pages
> > > +    * may be not up-to-date. Wait for the exporter to finish
> > > +    * the migration.
> > > +    */
> > > +   fence = dma_resv_get_excl(umem_dmabuf->attach->dmabuf->resv);
> > > +   if (fence)
> > > +           dma_fence_wait(fence, false);
> > > +
> > > +   return 0;
> > > +}
> > > +EXPORT_SYMBOL(ib_umem_dmabuf_map_pages);
> > > +
> > > +void ib_umem_dmabuf_unmap_pages(struct ib_umem_dmabuf *umem_dmabuf) {
> > > +   dma_resv_assert_held(umem_dmabuf->attach->dmabuf->resv);
> > > +
> > > +   if (!umem_dmabuf->sgt)
> > > +           return;
> > > +
> > > +   sg_free_table(&umem_dmabuf->umem.sg_head);
> > > +   dma_buf_unmap_attachment(umem_dmabuf->attach, umem_dmabuf->sgt,
> > > +                            DMA_BIDIRECTIONAL);
> > > +   umem_dmabuf->sgt = NULL;
> > > +}
> > > +EXPORT_SYMBOL(ib_umem_dmabuf_unmap_pages);
> > > +
> > > +struct ib_umem *ib_umem_dmabuf_get(struct ib_device *device,
> > > +                              unsigned long offset, size_t size,
> > > +                              int fd, int access,
> > > +                              const struct dma_buf_attach_ops *ops) {
> > > +   struct dma_buf *dmabuf;
> > > +   struct ib_umem_dmabuf *umem_dmabuf;
> > > +   struct ib_umem *umem;
> > > +   unsigned long end;
> > > +   long ret;
> > > +
> > > +   if (check_add_overflow(offset, (unsigned long)size, &end))
> > > +           return ERR_PTR(-EINVAL);
> > > +
> > > +   if (unlikely(PAGE_ALIGN(end) < PAGE_SIZE))
> > > +           return ERR_PTR(-EINVAL);
> > > +
> > > +   if (unlikely(!ops || !ops->move_notify))
> > > +           return ERR_PTR(-EINVAL);
> > > +
> > > +#ifdef CONFIG_DMA_VIRT_OPS
> > > +   if (device->dma_device->dma_ops == &dma_virt_ops)
> > > +           return ERR_PTR(-EINVAL);
> > > +#endif
> >
> > Maybe I'm confused, but should we have this check in dma_buf_attach, or at least in dma_buf_dynamic_attach? The p2pdma functions use
> > that too, and I can't imagine how zerocopy should work (which is like the entire point of
> > dma-buf) when we have dma_virt_ops.
> >
> > A similar problem exists for swiotlb bounce buffers, not sure how that's solved.
> > -Daniel
> >
>
> This is also checked by dma_buf_dynamic_attach(), not in the common code, but
> in the exporter's attach() method. For example, the attach method of 'amdgpu' calls p2pdma_distance_many and would disable p2p if dma_virt_ops is detected.
>
> Here we could instead check the peer2peer flag from the returned 'attach' structure.

The thing is, if you're a virtual device, there's cpu access functions
for your in dma_buf. So this should not happen, irrespective of p2p or
not. Or I'm totally missing the point here.

And in general I'd say if importers expect certain invariants for
stuff the exporter gives them, then the dma-buf layer should enforce
that. At least with debugging enabled, like we've done for the page
alignement.
-Daniel

>
> > > +
> > > +   umem_dmabuf = kzalloc(sizeof(*umem_dmabuf), GFP_KERNEL);
> > > +   if (!umem_dmabuf)
> > > +           return ERR_PTR(-ENOMEM);
> > > +
> > > +   umem = &umem_dmabuf->umem;
> > > +   umem->ibdev = device;
> > > +   umem->length = size;
> > > +   umem->address = offset;
> > > +   umem->iova = offset;
> > > +   umem->writable = ib_access_writable(access);
> > > +   umem->is_dmabuf = 1;
> > > +
> > > +   dmabuf = dma_buf_get(fd);
> > > +   if (IS_ERR(dmabuf)) {
> > > +           ret = PTR_ERR(dmabuf);
> > > +           goto out_free_umem;
> > > +   }
> > > +
> > > +   umem_dmabuf->attach = dma_buf_dynamic_attach(
> > > +                                   dmabuf,
> > > +                                   device->dma_device,
> > > +                                   ops,
> > > +                                   umem_dmabuf);
> > > +   if (IS_ERR(umem_dmabuf->attach)) {
> > > +           ret = PTR_ERR(umem_dmabuf->attach);
> > > +           goto out_release_dmabuf;
> > > +   }
> > > +
> > > +   return umem;
> > > +
> > > +out_release_dmabuf:
> > > +   dma_buf_put(dmabuf);
> > > +
> > > +out_free_umem:
> > > +   kfree(umem_dmabuf);
> > > +   return ERR_PTR(ret);
> > > +}
> > > +EXPORT_SYMBOL(ib_umem_dmabuf_get);
> > > +
> > > +void ib_umem_dmabuf_release(struct ib_umem_dmabuf *umem_dmabuf) {
> > > +   struct dma_buf *dmabuf = umem_dmabuf->attach->dmabuf;
> > > +
> > > +   dma_resv_lock(dmabuf->resv, NULL);
> > > +   ib_umem_dmabuf_unmap_pages(umem_dmabuf);
> > > +   dma_resv_unlock(dmabuf->resv);
> > > +
> > > +   dma_buf_detach(dmabuf, umem_dmabuf->attach);
> > > +   dma_buf_put(dmabuf);
> > > +   kfree(umem_dmabuf);
> > > +}
> > > diff --git a/drivers/infiniband/core/umem_dmabuf.h
> > > b/drivers/infiniband/core/umem_dmabuf.h
> > > new file mode 100644
> > > index 0000000..13acf55
> > > --- /dev/null
> > > +++ b/drivers/infiniband/core/umem_dmabuf.h
> > > @@ -0,0 +1,11 @@
> > > +/* SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause) */
> > > +/*
> > > + * Copyright (c) 2020 Intel Corporation. All rights reserved.
> > > + */
> > > +
> > > +#ifndef UMEM_DMABUF_H
> > > +#define UMEM_DMABUF_H
> > > +
> > > +void ib_umem_dmabuf_release(struct ib_umem_dmabuf *umem_dmabuf);
> > > +
> > > +#endif /* UMEM_DMABUF_H */
> > > diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h index
> > > 7059750..73a7b19 100644
> > > --- a/include/rdma/ib_umem.h
> > > +++ b/include/rdma/ib_umem.h
> > > @@ -1,6 +1,7 @@
> > >  /* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */
> > >  /*
> > >   * Copyright (c) 2007 Cisco Systems.  All rights reserved.
> > > + * Copyright (c) 2020 Intel Corporation.  All rights reserved.
> > >   */
> > >
> > >  #ifndef IB_UMEM_H
> > > @@ -13,6 +14,7 @@
> > >
> > >  struct ib_ucontext;
> > >  struct ib_umem_odp;
> > > +struct dma_buf_attach_ops;
> > >
> > >  struct ib_umem {
> > >     struct ib_device       *ibdev;
> > > @@ -22,12 +24,25 @@ struct ib_umem {
> > >     unsigned long           address;
> > >     u32 writable : 1;
> > >     u32 is_odp : 1;
> > > +   u32 is_dmabuf : 1;
> > >     struct work_struct      work;
> > >     struct sg_table sg_head;
> > >     int             nmap;
> > >     unsigned int    sg_nents;
> > >  };
> > >
> > > +struct ib_umem_dmabuf {
> > > +   struct ib_umem umem;
> > > +   struct dma_buf_attachment *attach;
> > > +   struct sg_table *sgt;
> > > +   void *device_context;
> > > +};
> > > +
> > > +static inline struct ib_umem_dmabuf *to_ib_umem_dmabuf(struct ib_umem
> > > +*umem) {
> > > +   return container_of(umem, struct ib_umem_dmabuf, umem); }
> > > +
> > >  /* Returns the offset of the umem start relative to the first page.
> > > */  static inline int ib_umem_offset(struct ib_umem *umem)  { @@ -79,6
> > > +94,12 @@ int ib_umem_copy_from(void *dst, struct ib_umem *umem,
> > > size_t offset,  unsigned long ib_umem_find_best_pgsz(struct ib_umem *umem,
> > >                                  unsigned long pgsz_bitmap,
> > >                                  unsigned long virt);
> > > +struct ib_umem *ib_umem_dmabuf_get(struct ib_device *device,
> > > +                              unsigned long offset, size_t size,
> > > +                              int fd, int access,
> > > +                              const struct dma_buf_attach_ops *ops); int
> > > +ib_umem_dmabuf_map_pages(struct ib_umem_dmabuf *umem_dmabuf); void
> > > +ib_umem_dmabuf_unmap_pages(struct ib_umem_dmabuf *umem_dmabuf);
> > >
> > >  #else /* CONFIG_INFINIBAND_USER_MEM */
> > >
> > > @@ -101,7 +122,19 @@ static inline unsigned long
> > > ib_umem_find_best_pgsz(struct ib_umem *umem,  {
> > >     return 0;
> > >  }
> > > +static inline struct ib_umem *ib_umem_dmabuf_get(struct ib_device *device,
> > > +                                            unsigned long offset,
> > > +                                            size_t size, int fd,
> > > +                                            int access,
> > > +                                            struct dma_buf_attach_ops *ops) {
> > > +   return ERR_PTR(-EINVAL);
> > > +}
> > > +static inline int ib_umem_dmabuf_map_pages(struct ib_umem_dmabuf
> > > +*umem_dmabuf) {
> > > +   return -EINVAL;
> > > +}
> > > +static inline void ib_umem_dmabuf_unmap_pages(struct ib_umem_dmabuf
> > > +*umem_dmabuf) { }
> > >
> > >  #endif /* CONFIG_INFINIBAND_USER_MEM */
> > > -
> > >  #endif /* IB_UMEM_H */
> > > --
> > > 1.8.3.1
> > >
> > > _______________________________________________
> > > dri-devel mailing list
> > > dri-devel@lists.freedesktop.org
> > > https://lists.freedesktop.org/mailman/listinfo/dri-devel
> >
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch



-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v6 1/4] RDMA/umem: Support importing dma-buf as user memory region
  2020-10-23 16:49   ` Daniel Vetter
  2020-10-23 18:09     ` Xiong, Jianxin
@ 2020-10-23 18:20     ` Jason Gunthorpe
  2020-10-24  1:45       ` Daniel Vetter
  2020-10-24  7:48       ` Christoph Hellwig
  1 sibling, 2 replies; 25+ messages in thread
From: Jason Gunthorpe @ 2020-10-23 18:20 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Jianxin Xiong, linux-rdma, dri-devel, Leon Romanovsky,
	Doug Ledford, Daniel Vetter, Christian Koenig

On Fri, Oct 23, 2020 at 06:49:11PM +0200, Daniel Vetter wrote:
> > +struct ib_umem *ib_umem_dmabuf_get(struct ib_device *device,
> > +				   unsigned long offset, size_t size,
> > +				   int fd, int access,
> > +				   const struct dma_buf_attach_ops *ops)
> > +{
> > +	struct dma_buf *dmabuf;
> > +	struct ib_umem_dmabuf *umem_dmabuf;
> > +	struct ib_umem *umem;
> > +	unsigned long end;
> > +	long ret;
> > +
> > +	if (check_add_overflow(offset, (unsigned long)size, &end))
> > +		return ERR_PTR(-EINVAL);
> > +
> > +	if (unlikely(PAGE_ALIGN(end) < PAGE_SIZE))
> > +		return ERR_PTR(-EINVAL);
> > +
> > +	if (unlikely(!ops || !ops->move_notify))
> > +		return ERR_PTR(-EINVAL);
> > +
> > +#ifdef CONFIG_DMA_VIRT_OPS
> > +	if (device->dma_device->dma_ops == &dma_virt_ops)
> > +		return ERR_PTR(-EINVAL);
> > +#endif
> 
> Maybe I'm confused, but should we have this check in dma_buf_attach, or at
> least in dma_buf_dynamic_attach? The p2pdma functions use that too, and I
> can't imagine how zerocopy should work (which is like the entire point of
> dma-buf) when we have dma_virt_ops.

The problem is we have RDMA drivers that assume SGL's have a valid
struct page, and these hacky/wrong P2P sgls that DMABUF creates cannot
be passed into those drivers.

But maybe this is just a 'drivers are using it wrong' if they call
this function and expect struct pages..

The check in the p2p stuff was done to avoid this too, but it was on a
different flow.

Jason

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v6 1/4] RDMA/umem: Support importing dma-buf as user memory region
  2020-10-23 18:20     ` Jason Gunthorpe
@ 2020-10-24  1:45       ` Daniel Vetter
  2020-11-03 17:36         ` Xiong, Jianxin
  2020-10-24  7:48       ` Christoph Hellwig
  1 sibling, 1 reply; 25+ messages in thread
From: Daniel Vetter @ 2020-10-24  1:45 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jianxin Xiong, linux-rdma, dri-devel, Leon Romanovsky,
	Doug Ledford, Daniel Vetter, Christian Koenig

On Fri, Oct 23, 2020 at 8:20 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Fri, Oct 23, 2020 at 06:49:11PM +0200, Daniel Vetter wrote:
> > > +struct ib_umem *ib_umem_dmabuf_get(struct ib_device *device,
> > > +                              unsigned long offset, size_t size,
> > > +                              int fd, int access,
> > > +                              const struct dma_buf_attach_ops *ops)
> > > +{
> > > +   struct dma_buf *dmabuf;
> > > +   struct ib_umem_dmabuf *umem_dmabuf;
> > > +   struct ib_umem *umem;
> > > +   unsigned long end;
> > > +   long ret;
> > > +
> > > +   if (check_add_overflow(offset, (unsigned long)size, &end))
> > > +           return ERR_PTR(-EINVAL);
> > > +
> > > +   if (unlikely(PAGE_ALIGN(end) < PAGE_SIZE))
> > > +           return ERR_PTR(-EINVAL);
> > > +
> > > +   if (unlikely(!ops || !ops->move_notify))
> > > +           return ERR_PTR(-EINVAL);
> > > +
> > > +#ifdef CONFIG_DMA_VIRT_OPS
> > > +   if (device->dma_device->dma_ops == &dma_virt_ops)
> > > +           return ERR_PTR(-EINVAL);
> > > +#endif
> >
> > Maybe I'm confused, but should we have this check in dma_buf_attach, or at
> > least in dma_buf_dynamic_attach? The p2pdma functions use that too, and I
> > can't imagine how zerocopy should work (which is like the entire point of
> > dma-buf) when we have dma_virt_ops.
>
> The problem is we have RDMA drivers that assume SGL's have a valid
> struct page, and these hacky/wrong P2P sgls that DMABUF creates cannot
> be passed into those drivers.
>
> But maybe this is just a 'drivers are using it wrong' if they call
> this function and expect struct pages..
>
> The check in the p2p stuff was done to avoid this too, but it was on a
> different flow.

Yeah definitely don't call dma_buf_map_attachment and expect a page
back. In practice you'll get a page back fairly often, but I don't
think we want to bake that in, maybe we eventually get to non-hacky
dma_addr_t only sgl.

What I'm wondering is whether dma_buf_attach shouldn't reject such
devices directly, instead of each importer having to do that.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v6 1/4] RDMA/umem: Support importing dma-buf as user memory region
  2020-10-23 18:20     ` Jason Gunthorpe
  2020-10-24  1:45       ` Daniel Vetter
@ 2020-10-24  7:48       ` Christoph Hellwig
  2020-10-26 12:26         ` Jason Gunthorpe
  1 sibling, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2020-10-24  7:48 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Daniel Vetter, Jianxin Xiong, linux-rdma, dri-devel,
	Leon Romanovsky, Doug Ledford, Daniel Vetter, Christian Koenig

On Fri, Oct 23, 2020 at 03:20:05PM -0300, Jason Gunthorpe wrote:
> The problem is we have RDMA drivers that assume SGL's have a valid
> struct page, and these hacky/wrong P2P sgls that DMABUF creates cannot
> be passed into those drivers.

RDMA drivers do not assume scatterlist have a valid struct page,
scatterlists are defined to have a valid struct page.  Any scatterlist
without a struct page is completely buggy.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v6 1/4] RDMA/umem: Support importing dma-buf as user memory region
  2020-10-24  7:48       ` Christoph Hellwig
@ 2020-10-26 12:26         ` Jason Gunthorpe
  2020-10-27  8:08           ` Christoph Hellwig
  0 siblings, 1 reply; 25+ messages in thread
From: Jason Gunthorpe @ 2020-10-26 12:26 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Daniel Vetter, Jianxin Xiong, linux-rdma, dri-devel,
	Leon Romanovsky, Doug Ledford, Daniel Vetter, Christian Koenig

On Sat, Oct 24, 2020 at 08:48:07AM +0100, Christoph Hellwig wrote:
> On Fri, Oct 23, 2020 at 03:20:05PM -0300, Jason Gunthorpe wrote:
> > The problem is we have RDMA drivers that assume SGL's have a valid
> > struct page, and these hacky/wrong P2P sgls that DMABUF creates cannot
> > be passed into those drivers.
> 
> RDMA drivers do not assume scatterlist have a valid struct page,
> scatterlists are defined to have a valid struct page.  Any scatterlist
> without a struct page is completely buggy.

It is not just having the struct page, it needs to be a CPU accessible
one for memcpy/etc. They aren't correct with the
MEMORY_DEVICE_PCI_P2PDMA SGLs either.

Jason

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v6 1/4] RDMA/umem: Support importing dma-buf as user memory region
  2020-10-26 12:26         ` Jason Gunthorpe
@ 2020-10-27  8:08           ` Christoph Hellwig
  2020-10-27 17:32             ` Xiong, Jianxin
  0 siblings, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2020-10-27  8:08 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Daniel Vetter, Jianxin Xiong, linux-rdma,
	dri-devel, Leon Romanovsky, Doug Ledford, Daniel Vetter,
	Christian Koenig

On Mon, Oct 26, 2020 at 09:26:37AM -0300, Jason Gunthorpe wrote:
> On Sat, Oct 24, 2020 at 08:48:07AM +0100, Christoph Hellwig wrote:
> > On Fri, Oct 23, 2020 at 03:20:05PM -0300, Jason Gunthorpe wrote:
> > > The problem is we have RDMA drivers that assume SGL's have a valid
> > > struct page, and these hacky/wrong P2P sgls that DMABUF creates cannot
> > > be passed into those drivers.
> > 
> > RDMA drivers do not assume scatterlist have a valid struct page,
> > scatterlists are defined to have a valid struct page.  Any scatterlist
> > without a struct page is completely buggy.
> 
> It is not just having the struct page, it needs to be a CPU accessible
> one for memcpy/etc. They aren't correct with the
> MEMORY_DEVICE_PCI_P2PDMA SGLs either.

Exactly.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH v6 1/4] RDMA/umem: Support importing dma-buf as user memory region
  2020-10-27  8:08           ` Christoph Hellwig
@ 2020-10-27 17:32             ` Xiong, Jianxin
  2020-10-27 19:51               ` Jason Gunthorpe
  0 siblings, 1 reply; 25+ messages in thread
From: Xiong, Jianxin @ 2020-10-27 17:32 UTC (permalink / raw)
  To: Christoph Hellwig, Jason Gunthorpe
  Cc: Daniel Vetter, linux-rdma, dri-devel, Leon Romanovsky,
	Doug Ledford, Vetter, Daniel, Christian Koenig

> -----Original Message-----
> From: Christoph Hellwig <hch@infradead.org>
> Sent: Tuesday, October 27, 2020 1:08 AM
> To: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Christoph Hellwig <hch@infradead.org>; Daniel Vetter <daniel@ffwll.ch>; Xiong, Jianxin <jianxin.xiong@intel.com>; linux-
> rdma@vger.kernel.org; dri-devel@lists.freedesktop.org; Leon Romanovsky <leon@kernel.org>; Doug Ledford <dledford@redhat.com>;
> Vetter, Daniel <daniel.vetter@intel.com>; Christian Koenig <christian.koenig@amd.com>
> Subject: Re: [PATCH v6 1/4] RDMA/umem: Support importing dma-buf as user memory region
> 
> On Mon, Oct 26, 2020 at 09:26:37AM -0300, Jason Gunthorpe wrote:
> > On Sat, Oct 24, 2020 at 08:48:07AM +0100, Christoph Hellwig wrote:
> > > On Fri, Oct 23, 2020 at 03:20:05PM -0300, Jason Gunthorpe wrote:
> > > > The problem is we have RDMA drivers that assume SGL's have a valid
> > > > struct page, and these hacky/wrong P2P sgls that DMABUF creates
> > > > cannot be passed into those drivers.
> > >
> > > RDMA drivers do not assume scatterlist have a valid struct page,
> > > scatterlists are defined to have a valid struct page.  Any
> > > scatterlist without a struct page is completely buggy.
> >
> > It is not just having the struct page, it needs to be a CPU accessible
> > one for memcpy/etc. They aren't correct with the
> > MEMORY_DEVICE_PCI_P2PDMA SGLs either.
> 
> Exactly.

In the function ib_umem_dmabuf_sgt_slice() (part of this patch) we could generate
a dma address array instead of filling the scatterlist 'umem->sg_head'. The array
would be handled similar to 'umem_odp->dma_list'. With such change, the RDMA
drivers wouldn't see incorrectly formed scatterlist. The check for dma_virt_ops here
wouldn't be needed either.

Would such proposal address the concern here?

-Jianxin
 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v6 1/4] RDMA/umem: Support importing dma-buf as user memory region
  2020-10-27 17:32             ` Xiong, Jianxin
@ 2020-10-27 19:51               ` Jason Gunthorpe
  0 siblings, 0 replies; 25+ messages in thread
From: Jason Gunthorpe @ 2020-10-27 19:51 UTC (permalink / raw)
  To: Xiong, Jianxin
  Cc: Christoph Hellwig, Daniel Vetter, linux-rdma, dri-devel,
	Leon Romanovsky, Doug Ledford, Vetter, Daniel, Christian Koenig

On Tue, Oct 27, 2020 at 05:32:26PM +0000, Xiong, Jianxin wrote:
> > On Mon, Oct 26, 2020 at 09:26:37AM -0300, Jason Gunthorpe wrote:
> > > On Sat, Oct 24, 2020 at 08:48:07AM +0100, Christoph Hellwig wrote:
> > > > On Fri, Oct 23, 2020 at 03:20:05PM -0300, Jason Gunthorpe wrote:
> > > > > The problem is we have RDMA drivers that assume SGL's have a valid
> > > > > struct page, and these hacky/wrong P2P sgls that DMABUF creates
> > > > > cannot be passed into those drivers.
> > > >
> > > > RDMA drivers do not assume scatterlist have a valid struct page,
> > > > scatterlists are defined to have a valid struct page.  Any
> > > > scatterlist without a struct page is completely buggy.
> > >
> > > It is not just having the struct page, it needs to be a CPU accessible
> > > one for memcpy/etc. They aren't correct with the
> > > MEMORY_DEVICE_PCI_P2PDMA SGLs either.
> > 
> > Exactly.
> 
> In the function ib_umem_dmabuf_sgt_slice() (part of this patch) we could generate
> a dma address array instead of filling the scatterlist
> 'umem->sg_head'. 

I don't think we should change the format, the SGL comes out of the
dmabuf and all the umem code is able to process it like that. Adding
another datastructure just for this one case is going to be trouble.

Ultimately I'd like to see some 'dma only sgl', CH has been talking
about this for a while. When we have that settled just change
everything connected to umem

I think in the meantime the answer for this patch is drivers just
can't call these APIs and use the struct page side, just like they
can't call the DMA buf API and use the struct page side..

Jason

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v6 1/4] RDMA/umem: Support importing dma-buf as user memory region
  2020-10-23 16:39 ` [PATCH v6 1/4] RDMA/umem: Support importing dma-buf as user memory region Jianxin Xiong
  2020-10-23 16:49   ` Daniel Vetter
@ 2020-10-27 20:00   ` Jason Gunthorpe
  2020-10-27 20:11     ` Xiong, Jianxin
  1 sibling, 1 reply; 25+ messages in thread
From: Jason Gunthorpe @ 2020-10-27 20:00 UTC (permalink / raw)
  To: Jianxin Xiong
  Cc: linux-rdma, dri-devel, Doug Ledford, Leon Romanovsky,
	Sumit Semwal, Christian Koenig, Daniel Vetter

On Fri, Oct 23, 2020 at 09:39:58AM -0700, Jianxin Xiong wrote:
> +/*
> + * Generate a new dma sg list from a sub range of an existing dma sg list.
> + * Both the input and output have their entries page aligned.
> + */
> +static int ib_umem_dmabuf_sgt_slice(struct sg_table *sgt, u64 offset,
> +				    u64 length, struct sg_table *new_sgt)
> +{
> +	struct scatterlist *sg, *new_sg;
> +	u64 start, end, off, addr, len;
> +	unsigned int new_nents;
> +	int err;
> +	int i;
> +
> +	start = ALIGN_DOWN(offset, PAGE_SIZE);
> +	end = ALIGN(offset + length, PAGE_SIZE);
> +
> +	offset = start;
> +	length = end - start;
> +	new_nents = 0;
> +	for_each_sgtable_dma_sg(sgt, sg, i) {
> +		len = sg_dma_len(sg);
> +		off = min(len, offset);
> +		len -= off;
> +		len = min(len, length);
> +		if (len)
> +			new_nents++;
> +		length -= len;
> +		offset -= off;
> +	}
> +
> +	err = sg_alloc_table(new_sgt, new_nents, GFP_KERNEL);
> +	if (err)
> +		return err;

I would really rather not allocate an entirely new table just to take
a slice of an existing SGT. Ideally the expoter API from DMA buf would
prepare the SGL slice properly instead of always giving a whole
buffer.

Alternatively making some small edit to rdma_umem_for_each_dma_block()
and ib_umem_find_best_pgsz() would let it slice the SGL at runtime

You need to rebase on top of this series:

https://patchwork.kernel.org/project/linux-rdma/list/?series=370437

Which makes mlx5 use those new APIs

Jason

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v6 4/4] RDMA/mlx5: Support dma-buf based userspace memory region
  2020-10-23 16:40 ` [PATCH v6 4/4] RDMA/mlx5: Support dma-buf based userspace memory region Jianxin Xiong
@ 2020-10-27 20:08   ` Jason Gunthorpe
  2020-10-27 20:33     ` Xiong, Jianxin
  0 siblings, 1 reply; 25+ messages in thread
From: Jason Gunthorpe @ 2020-10-27 20:08 UTC (permalink / raw)
  To: Jianxin Xiong
  Cc: linux-rdma, dri-devel, Doug Ledford, Leon Romanovsky,
	Sumit Semwal, Christian Koenig, Daniel Vetter

On Fri, Oct 23, 2020 at 09:40:01AM -0700, Jianxin Xiong wrote:

> diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
> index b261797..3bc412b 100644
> +++ b/drivers/infiniband/hw/mlx5/mr.c
> @@ -1,5 +1,6 @@
>  /*
>   * Copyright (c) 2013-2015, Mellanox Technologies. All rights reserved.
> + * Copyright (c) 2020, Intel Corporation. All rights reserved.
>   *
>   * This software is available to you under a choice of one of two
>   * licenses.  You may choose to be licensed under the terms of the GNU
> @@ -36,6 +37,8 @@
>  #include <linux/debugfs.h>
>  #include <linux/export.h>
>  #include <linux/delay.h>
> +#include <linux/dma-buf.h>
> +#include <linux/dma-resv.h>
>  #include <rdma/ib_umem.h>
>  #include <rdma/ib_umem_odp.h>
>  #include <rdma/ib_verbs.h>
> @@ -1113,6 +1116,8 @@ int mlx5_ib_update_xlt(struct mlx5_ib_mr *mr, u64 idx, int npages,
>  		dma_sync_single_for_cpu(ddev, dma, size, DMA_TO_DEVICE);
>  		if (mr->umem->is_odp) {
>  			mlx5_odp_populate_xlt(xlt, idx, npages, mr, flags);
> +		} else if (mr->umem->is_dmabuf && (flags & MLX5_IB_UPD_XLT_ZAP)) {
> +			memset(xlt, 0, size);
>  		} else {
>  			__mlx5_ib_populate_pas(dev, mr->umem, page_shift, idx,
>  					       npages, xlt,
> @@ -1462,6 +1467,111 @@ struct ib_mr *mlx5_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
>  	return ERR_PTR(err);
>  }
>  
> +static void mlx5_ib_dmabuf_invalidate_cb(struct dma_buf_attachment *attach)
> +{
> +	struct ib_umem_dmabuf *umem_dmabuf = attach->importer_priv;
> +	struct mlx5_ib_mr *mr = umem_dmabuf->device_context;
> +
> +	mlx5_ib_update_xlt(mr, 0, mr->npages, PAGE_SHIFT, MLX5_IB_UPD_XLT_ZAP);
> +	ib_umem_dmabuf_unmap_pages(umem_dmabuf);
> +}
> +
> +static struct dma_buf_attach_ops mlx5_ib_dmabuf_attach_ops = {
> +	.allow_peer2peer = 1,
> +	.move_notify = mlx5_ib_dmabuf_invalidate_cb,
> +};
> +
> +struct ib_mr *mlx5_ib_reg_user_mr_dmabuf(struct ib_pd *pd, u64 offset,
> +					 u64 length, u64 virt_addr,
> +					 int fd, int access_flags,
> +					 struct ib_udata *udata)
> +{
> +	struct mlx5_ib_dev *dev = to_mdev(pd->device);
> +	struct mlx5_ib_mr *mr = NULL;
> +	struct ib_umem *umem;
> +	struct ib_umem_dmabuf *umem_dmabuf;
> +	int npages;
> +	int order;
> +	int err;
> +
> +	if (!IS_ENABLED(CONFIG_INFINIBAND_USER_MEM))
> +		return ERR_PTR(-EOPNOTSUPP);
> +
> +	mlx5_ib_dbg(dev,
> +		    "offset 0x%llx, virt_addr 0x%llx, length 0x%llx, fd %d, access_flags 0x%x\n",
> +		    offset, virt_addr, length, fd, access_flags);
> +
> +	if (!mlx5_ib_can_load_pas_with_umr(dev, length))
> +		return ERR_PTR(-EINVAL);
> +
> +	umem = ib_umem_dmabuf_get(&dev->ib_dev, offset, length, fd, access_flags,
> +				  &mlx5_ib_dmabuf_attach_ops);
> +	if (IS_ERR(umem)) {
> +		mlx5_ib_dbg(dev, "umem get failed (%ld)\n", PTR_ERR(umem));
> +		return ERR_PTR(PTR_ERR(umem));
> +	}
> +
> +	npages = ib_umem_num_pages(umem);
> +	if (!npages) {

ib_umem_get should reject invalid umems like this

> +		mlx5_ib_warn(dev, "avoid zero region\n");
> +		ib_umem_release(umem);
> +		return ERR_PTR(-EINVAL);
> +	}
> +
> +	order = ilog2(roundup_pow_of_two(npages));

Must always call ib_umem_find_best_pgsz(), specify PAGE_SIZE as the
argument for this scenario

> +	mlx5_ib_dbg(dev, "npages %d, ncont %d, order %d, page_shift %d\n",
> +		    npages, npages, order, PAGE_SHIFT);
> +
> +	mr = alloc_mr_from_cache(pd, umem, virt_addr, length, npages,
> +				 PAGE_SHIFT, order, access_flags);
> +	if (IS_ERR(mr))
> +		mr = NULL;
> +
> +	if (!mr) {
> +		mutex_lock(&dev->slow_path_mutex);
> +		mr = reg_create(NULL, pd, virt_addr, length, umem, npages,
> +				PAGE_SHIFT, access_flags, false);
> +		mutex_unlock(&dev->slow_path_mutex);
> +	}

Rebase on the mlx5 series just posted and use their version of this
code sequence, this is just not quite right


> +	err = mlx5_ib_update_xlt(mr, 0, mr->npages, PAGE_SHIFT,
> +				 MLX5_IB_UPD_XLT_ENABLE | MLX5_IB_UPD_XLT_ZAP);
> +
> +	if (err) {
> +		dereg_mr(dev, mr);
> +		return ERR_PTR(err);
> +	}

The current mlx5 code preloads the buffer with the right data, zapping
is fairly expensive, mapping and loading is the same cost

> @@ -1536,7 +1646,7 @@ int mlx5_ib_rereg_user_mr(struct ib_mr *ib_mr, int flags, u64 start,
>  	if (!mr->umem)
>  		return -EINVAL;
>  
> -	if (is_odp_mr(mr))
> +	if (is_odp_mr(mr) || is_dmabuf_mr(mr))
>  		return -EOPNOTSUPP;
>  
>  	if (flags & IB_MR_REREG_TRANS) {
> @@ -1695,7 +1805,7 @@ static void dereg_mr(struct mlx5_ib_dev *dev, struct mlx5_ib_mr *mr)
>  	struct ib_umem *umem = mr->umem;
>  
>  	/* Stop all DMA */
> -	if (is_odp_mr(mr))
> +	if (is_odp_mr(mr) || is_dmabuf_mr(mr))
>  		mlx5_ib_fence_odp_mr(mr);

Make a dma buf specific function

I have another series touching this area too, but I think 

> @@ -801,6 +816,52 @@ static int pagefault_implicit_mr(struct mlx5_ib_mr *imr,
>   * Returns:
>   *  -EFAULT: The io_virt->bcnt is not within the MR, it covers pages that are
>   *           not accessible, or the MR is no longer valid.
> + *  -EAGAIN: The operation should be retried
> + *
> + *  >0: Number of pages mapped
> + */
> +static int pagefault_dmabuf_mr(struct mlx5_ib_mr *mr, struct ib_umem *umem,
> +			       u64 io_virt, size_t bcnt, u32 *bytes_mapped,
> +			       u32 flags)
> +{
> +	struct ib_umem_dmabuf *umem_dmabuf = to_ib_umem_dmabuf(umem);
> +	u64 user_va;
> +	u64 end;
> +	int npages;
> +	int err;
> +
> +	if (unlikely(io_virt < mr->mmkey.iova))
> +		return -EFAULT;
> +	if (check_add_overflow(io_virt - mr->mmkey.iova,
> +			       (u64)umem->address, &user_va))
> +		return -EFAULT;
> +	/* Overflow has alreddy been checked at the umem creation time */
> +	end = umem->address + umem->length;
> +	if (unlikely(user_va >= end || end  - user_va < bcnt))
> +		return -EFAULT;

Why duplicate this sequence? Caller does it

> @@ -811,6 +872,10 @@ static int pagefault_mr(struct mlx5_ib_mr *mr, u64 io_virt, size_t bcnt,
>  {
>  	struct ib_umem_odp *odp = to_ib_umem_odp(mr->umem);
>  
> +	if (is_dmabuf_mr(mr))
> +		return pagefault_dmabuf_mr(mr, mr->umem, io_virt, bcnt,
> +					   bytes_mapped, flags);
> +
>  	lockdep_assert_held(&mr->dev->odp_srcu);
>  	if (unlikely(io_virt < mr->mmkey.iova))
>  		return -EFAULT;
> @@ -1747,7 +1812,6 @@ static void destroy_prefetch_work(struct prefetch_mr_work *work)
>  {
>  	struct mlx5_ib_dev *dev = to_mdev(pd->device);
>  	struct mlx5_core_mkey *mmkey;
> -	struct ib_umem_odp *odp;
>  	struct mlx5_ib_mr *mr;
>  
>  	lockdep_assert_held(&dev->odp_srcu);
> @@ -1761,11 +1825,9 @@ static void destroy_prefetch_work(struct prefetch_mr_work *work)
>  	if (mr->ibmr.pd != pd)
>  		return NULL;
>  
> -	odp = to_ib_umem_odp(mr->umem);
> -
>  	/* prefetch with write-access must be supported by the MR */
>  	if (advice == IB_UVERBS_ADVISE_MR_ADVICE_PREFETCH_WRITE &&
> -	    !odp->umem.writable)
> +	    !mr->umem->writable)

??

This does look basically right though. I think a little more polishing
and it can be merged. It does need to go after the mlx5 MR series
though..

Jason

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH v6 1/4] RDMA/umem: Support importing dma-buf as user memory region
  2020-10-27 20:00   ` Jason Gunthorpe
@ 2020-10-27 20:11     ` Xiong, Jianxin
  0 siblings, 0 replies; 25+ messages in thread
From: Xiong, Jianxin @ 2020-10-27 20:11 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-rdma, dri-devel, Doug Ledford, Leon Romanovsky,
	Sumit Semwal, Christian Koenig, Vetter, Daniel

> -----Original Message-----
> From: Jason Gunthorpe <jgg@ziepe.ca>
> Sent: Tuesday, October 27, 2020 1:00 PM
> To: Xiong, Jianxin <jianxin.xiong@intel.com>
> Cc: linux-rdma@vger.kernel.org; dri-devel@lists.freedesktop.org; Doug Ledford <dledford@redhat.com>; Leon Romanovsky
> <leon@kernel.org>; Sumit Semwal <sumit.semwal@linaro.org>; Christian Koenig <christian.koenig@amd.com>; Vetter, Daniel
> <daniel.vetter@intel.com>
> Subject: Re: [PATCH v6 1/4] RDMA/umem: Support importing dma-buf as user memory region
> 
> On Fri, Oct 23, 2020 at 09:39:58AM -0700, Jianxin Xiong wrote:
> > +/*
> > + * Generate a new dma sg list from a sub range of an existing dma sg list.
> > + * Both the input and output have their entries page aligned.
> > + */
> > +static int ib_umem_dmabuf_sgt_slice(struct sg_table *sgt, u64 offset,
> > +				    u64 length, struct sg_table *new_sgt) {
> > +	struct scatterlist *sg, *new_sg;
> > +	u64 start, end, off, addr, len;
> > +	unsigned int new_nents;
> > +	int err;
> > +	int i;
> > +
> > +	start = ALIGN_DOWN(offset, PAGE_SIZE);
> > +	end = ALIGN(offset + length, PAGE_SIZE);
> > +
> > +	offset = start;
> > +	length = end - start;
> > +	new_nents = 0;
> > +	for_each_sgtable_dma_sg(sgt, sg, i) {
> > +		len = sg_dma_len(sg);
> > +		off = min(len, offset);
> > +		len -= off;
> > +		len = min(len, length);
> > +		if (len)
> > +			new_nents++;
> > +		length -= len;
> > +		offset -= off;
> > +	}
> > +
> > +	err = sg_alloc_table(new_sgt, new_nents, GFP_KERNEL);
> > +	if (err)
> > +		return err;
> 
> I would really rather not allocate an entirely new table just to take a slice of an existing SGT. Ideally the expoter API from DMA buf would
> prepare the SGL slice properly instead of always giving a whole buffer.
> 
> Alternatively making some small edit to rdma_umem_for_each_dma_block() and ib_umem_find_best_pgsz() would let it slice the SGL at
> runtime
> 
> You need to rebase on top of this series:
> 
> https://patchwork.kernel.org/project/linux-rdma/list/?series=370437
> 
> Which makes mlx5 use those new APIs
> 
> Jason

Thanks. Will rebase and work on the runtime slicing.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH v6 4/4] RDMA/mlx5: Support dma-buf based userspace memory region
  2020-10-27 20:08   ` Jason Gunthorpe
@ 2020-10-27 20:33     ` Xiong, Jianxin
  2020-10-28 16:35       ` Jason Gunthorpe
  0 siblings, 1 reply; 25+ messages in thread
From: Xiong, Jianxin @ 2020-10-27 20:33 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-rdma, dri-devel, Doug Ledford, Leon Romanovsky,
	Sumit Semwal, Christian Koenig, Vetter, Daniel

> -----Original Message-----
> From: Jason Gunthorpe <jgg@ziepe.ca>
> Sent: Tuesday, October 27, 2020 1:08 PM
> To: Xiong, Jianxin <jianxin.xiong@intel.com>
> Cc: linux-rdma@vger.kernel.org; dri-devel@lists.freedesktop.org; Doug Ledford <dledford@redhat.com>; Leon Romanovsky
> <leon@kernel.org>; Sumit Semwal <sumit.semwal@linaro.org>; Christian Koenig <christian.koenig@amd.com>; Vetter, Daniel
> <daniel.vetter@intel.com>
> Subject: Re: [PATCH v6 4/4] RDMA/mlx5: Support dma-buf based userspace memory region
> 
> On Fri, Oct 23, 2020 at 09:40:01AM -0700, Jianxin Xiong wrote:
> 
> > diff --git a/drivers/infiniband/hw/mlx5/mr.c
> > b/drivers/infiniband/hw/mlx5/mr.c index b261797..3bc412b 100644
> > +++ b/drivers/infiniband/hw/mlx5/mr.c
> > @@ -1,5 +1,6 @@
> >  /*
> >   * Copyright (c) 2013-2015, Mellanox Technologies. All rights reserved.
> > + * Copyright (c) 2020, Intel Corporation. All rights reserved.
> >   *
> >   * This software is available to you under a choice of one of two
> >   * licenses.  You may choose to be licensed under the terms of the
> > GNU @@ -36,6 +37,8 @@  #include <linux/debugfs.h>  #include
> > <linux/export.h>  #include <linux/delay.h>
> > +#include <linux/dma-buf.h>
> > +#include <linux/dma-resv.h>
> >  #include <rdma/ib_umem.h>
> >  #include <rdma/ib_umem_odp.h>
> >  #include <rdma/ib_verbs.h>
> > @@ -1113,6 +1116,8 @@ int mlx5_ib_update_xlt(struct mlx5_ib_mr *mr, u64 idx, int npages,
> >  		dma_sync_single_for_cpu(ddev, dma, size, DMA_TO_DEVICE);
> >  		if (mr->umem->is_odp) {
> >  			mlx5_odp_populate_xlt(xlt, idx, npages, mr, flags);
> > +		} else if (mr->umem->is_dmabuf && (flags & MLX5_IB_UPD_XLT_ZAP)) {
> > +			memset(xlt, 0, size);
> >  		} else {
> >  			__mlx5_ib_populate_pas(dev, mr->umem, page_shift, idx,
> >  					       npages, xlt,
> > @@ -1462,6 +1467,111 @@ struct ib_mr *mlx5_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
> >  	return ERR_PTR(err);
> >  }
> >
> > +static void mlx5_ib_dmabuf_invalidate_cb(struct dma_buf_attachment
> > +*attach) {
> > +	struct ib_umem_dmabuf *umem_dmabuf = attach->importer_priv;
> > +	struct mlx5_ib_mr *mr = umem_dmabuf->device_context;
> > +
> > +	mlx5_ib_update_xlt(mr, 0, mr->npages, PAGE_SHIFT, MLX5_IB_UPD_XLT_ZAP);
> > +	ib_umem_dmabuf_unmap_pages(umem_dmabuf);
> > +}
> > +
> > +static struct dma_buf_attach_ops mlx5_ib_dmabuf_attach_ops = {
> > +	.allow_peer2peer = 1,
> > +	.move_notify = mlx5_ib_dmabuf_invalidate_cb, };
> > +
> > +struct ib_mr *mlx5_ib_reg_user_mr_dmabuf(struct ib_pd *pd, u64 offset,
> > +					 u64 length, u64 virt_addr,
> > +					 int fd, int access_flags,
> > +					 struct ib_udata *udata)
> > +{
> > +	struct mlx5_ib_dev *dev = to_mdev(pd->device);
> > +	struct mlx5_ib_mr *mr = NULL;
> > +	struct ib_umem *umem;
> > +	struct ib_umem_dmabuf *umem_dmabuf;
> > +	int npages;
> > +	int order;
> > +	int err;
> > +
> > +	if (!IS_ENABLED(CONFIG_INFINIBAND_USER_MEM))
> > +		return ERR_PTR(-EOPNOTSUPP);
> > +
> > +	mlx5_ib_dbg(dev,
> > +		    "offset 0x%llx, virt_addr 0x%llx, length 0x%llx, fd %d, access_flags 0x%x\n",
> > +		    offset, virt_addr, length, fd, access_flags);
> > +
> > +	if (!mlx5_ib_can_load_pas_with_umr(dev, length))
> > +		return ERR_PTR(-EINVAL);
> > +
> > +	umem = ib_umem_dmabuf_get(&dev->ib_dev, offset, length, fd, access_flags,
> > +				  &mlx5_ib_dmabuf_attach_ops);
> > +	if (IS_ERR(umem)) {
> > +		mlx5_ib_dbg(dev, "umem get failed (%ld)\n", PTR_ERR(umem));
> > +		return ERR_PTR(PTR_ERR(umem));
> > +	}
> > +
> > +	npages = ib_umem_num_pages(umem);
> > +	if (!npages) {
> 
> ib_umem_get should reject invalid umems like this

Will move the check to ib_umem_dmabuf_get().

> 
> > +		mlx5_ib_warn(dev, "avoid zero region\n");
> > +		ib_umem_release(umem);
> > +		return ERR_PTR(-EINVAL);
> > +	}
> > +
> > +	order = ilog2(roundup_pow_of_two(npages));
> 
> Must always call ib_umem_find_best_pgsz(), specify PAGE_SIZE as the argument for this scenario

Will fix.

> 
> > +	mlx5_ib_dbg(dev, "npages %d, ncont %d, order %d, page_shift %d\n",
> > +		    npages, npages, order, PAGE_SHIFT);
> > +
> > +	mr = alloc_mr_from_cache(pd, umem, virt_addr, length, npages,
> > +				 PAGE_SHIFT, order, access_flags);
> > +	if (IS_ERR(mr))
> > +		mr = NULL;
> > +
> > +	if (!mr) {
> > +		mutex_lock(&dev->slow_path_mutex);
> > +		mr = reg_create(NULL, pd, virt_addr, length, umem, npages,
> > +				PAGE_SHIFT, access_flags, false);
> > +		mutex_unlock(&dev->slow_path_mutex);
> > +	}
> 
> Rebase on the mlx5 series just posted and use their version of this code sequence, this is just not quite right

Will do.

> 
> 
> > +	err = mlx5_ib_update_xlt(mr, 0, mr->npages, PAGE_SHIFT,
> > +				 MLX5_IB_UPD_XLT_ENABLE | MLX5_IB_UPD_XLT_ZAP);
> > +
> > +	if (err) {
> > +		dereg_mr(dev, mr);
> > +		return ERR_PTR(err);
> > +	}
> 
> The current mlx5 code preloads the buffer with the right data, zapping is fairly expensive, mapping and loading is the same cost

Could probably do the same here. Will double check.

> 
> > @@ -1536,7 +1646,7 @@ int mlx5_ib_rereg_user_mr(struct ib_mr *ib_mr, int flags, u64 start,
> >  	if (!mr->umem)
> >  		return -EINVAL;
> >
> > -	if (is_odp_mr(mr))
> > +	if (is_odp_mr(mr) || is_dmabuf_mr(mr))
> >  		return -EOPNOTSUPP;
> >
> >  	if (flags & IB_MR_REREG_TRANS) {
> > @@ -1695,7 +1805,7 @@ static void dereg_mr(struct mlx5_ib_dev *dev, struct mlx5_ib_mr *mr)
> >  	struct ib_umem *umem = mr->umem;
> >
> >  	/* Stop all DMA */
> > -	if (is_odp_mr(mr))
> > +	if (is_odp_mr(mr) || is_dmabuf_mr(mr))
> >  		mlx5_ib_fence_odp_mr(mr);
> 
> Make a dma buf specific function

Ok.

> 
> I have another series touching this area too, but I think
> 
> > @@ -801,6 +816,52 @@ static int pagefault_implicit_mr(struct mlx5_ib_mr *imr,
> >   * Returns:
> >   *  -EFAULT: The io_virt->bcnt is not within the MR, it covers pages that are
> >   *           not accessible, or the MR is no longer valid.
> > + *  -EAGAIN: The operation should be retried
> > + *
> > + *  >0: Number of pages mapped
> > + */
> > +static int pagefault_dmabuf_mr(struct mlx5_ib_mr *mr, struct ib_umem *umem,
> > +			       u64 io_virt, size_t bcnt, u32 *bytes_mapped,
> > +			       u32 flags)
> > +{
> > +	struct ib_umem_dmabuf *umem_dmabuf = to_ib_umem_dmabuf(umem);
> > +	u64 user_va;
> > +	u64 end;
> > +	int npages;
> > +	int err;
> > +
> > +	if (unlikely(io_virt < mr->mmkey.iova))
> > +		return -EFAULT;
> > +	if (check_add_overflow(io_virt - mr->mmkey.iova,
> > +			       (u64)umem->address, &user_va))
> > +		return -EFAULT;
> > +	/* Overflow has alreddy been checked at the umem creation time */
> > +	end = umem->address + umem->length;
> > +	if (unlikely(user_va >= end || end  - user_va < bcnt))
> > +		return -EFAULT;
> 
> Why duplicate this sequence? Caller does it

The sequence in the caller is for umem_odp only.

> 
> > @@ -811,6 +872,10 @@ static int pagefault_mr(struct mlx5_ib_mr *mr,
> > u64 io_virt, size_t bcnt,  {
> >  	struct ib_umem_odp *odp = to_ib_umem_odp(mr->umem);
> >
> > +	if (is_dmabuf_mr(mr))
> > +		return pagefault_dmabuf_mr(mr, mr->umem, io_virt, bcnt,
> > +					   bytes_mapped, flags);
> > +
> >  	lockdep_assert_held(&mr->dev->odp_srcu);
> >  	if (unlikely(io_virt < mr->mmkey.iova))
> >  		return -EFAULT;
> > @@ -1747,7 +1812,6 @@ static void destroy_prefetch_work(struct
> > prefetch_mr_work *work)  {
> >  	struct mlx5_ib_dev *dev = to_mdev(pd->device);
> >  	struct mlx5_core_mkey *mmkey;
> > -	struct ib_umem_odp *odp;
> >  	struct mlx5_ib_mr *mr;
> >
> >  	lockdep_assert_held(&dev->odp_srcu);
> > @@ -1761,11 +1825,9 @@ static void destroy_prefetch_work(struct prefetch_mr_work *work)
> >  	if (mr->ibmr.pd != pd)
> >  		return NULL;
> >
> > -	odp = to_ib_umem_odp(mr->umem);
> > -
> >  	/* prefetch with write-access must be supported by the MR */
> >  	if (advice == IB_UVERBS_ADVISE_MR_ADVICE_PREFETCH_WRITE &&
> > -	    !odp->umem.writable)
> > +	    !mr->umem->writable)
> 
> ??
There is no need to use umem_odp here, mr->umem is the same as &odp->umem. 
This change makes the code works for both umem_odp and umem_dmabuf.

> 
> This does look basically right though. I think a little more polishing and it can be merged. It does need to go after the mlx5 MR series
> though..
> 
> Jason

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v6 4/4] RDMA/mlx5: Support dma-buf based userspace memory region
  2020-10-27 20:33     ` Xiong, Jianxin
@ 2020-10-28 16:35       ` Jason Gunthorpe
  2020-10-28 17:29         ` Xiong, Jianxin
  0 siblings, 1 reply; 25+ messages in thread
From: Jason Gunthorpe @ 2020-10-28 16:35 UTC (permalink / raw)
  To: Xiong, Jianxin
  Cc: linux-rdma, dri-devel, Doug Ledford, Leon Romanovsky,
	Sumit Semwal, Christian Koenig, Vetter, Daniel

On Tue, Oct 27, 2020 at 08:33:52PM +0000, Xiong, Jianxin wrote:
> > > @@ -801,6 +816,52 @@ static int pagefault_implicit_mr(struct mlx5_ib_mr *imr,
> > >   * Returns:
> > >   *  -EFAULT: The io_virt->bcnt is not within the MR, it covers pages that are
> > >   *           not accessible, or the MR is no longer valid.
> > > + *  -EAGAIN: The operation should be retried
> > > + *
> > > + *  >0: Number of pages mapped
> > > + */
> > > +static int pagefault_dmabuf_mr(struct mlx5_ib_mr *mr, struct ib_umem *umem,
> > > +			       u64 io_virt, size_t bcnt, u32 *bytes_mapped,
> > > +			       u32 flags)
> > > +{
> > > +	struct ib_umem_dmabuf *umem_dmabuf = to_ib_umem_dmabuf(umem);
> > > +	u64 user_va;
> > > +	u64 end;
> > > +	int npages;
> > > +	int err;
> > > +
> > > +	if (unlikely(io_virt < mr->mmkey.iova))
> > > +		return -EFAULT;
> > > +	if (check_add_overflow(io_virt - mr->mmkey.iova,
> > > +			       (u64)umem->address, &user_va))
> > > +		return -EFAULT;
> > > +	/* Overflow has alreddy been checked at the umem creation time */
> > > +	end = umem->address + umem->length;
> > > +	if (unlikely(user_va >= end || end  - user_va < bcnt))
> > > +		return -EFAULT;
> > 
> > Why duplicate this sequence? Caller does it
> 
> The sequence in the caller is for umem_odp only.

Nothing about umem_odp in this code though??

> > >  	/* prefetch with write-access must be supported by the MR */
> > >  	if (advice == IB_UVERBS_ADVISE_MR_ADVICE_PREFETCH_WRITE &&
> > > -	    !odp->umem.writable)
> > > +	    !mr->umem->writable)
> > 
> > ??

> There is no need to use umem_odp here, mr->umem is the same as &odp->umem. 
> This change makes the code works for both umem_odp and umem_dmabuf.

Ok

Can you please also think about how to test this? I very much prefer
to see new pyverbs tests for new APIs. 

Distros are running the rdma-core test suite, if you want this to work
widely we need a public test for it.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH v6 4/4] RDMA/mlx5: Support dma-buf based userspace memory region
  2020-10-28 16:35       ` Jason Gunthorpe
@ 2020-10-28 17:29         ` Xiong, Jianxin
  0 siblings, 0 replies; 25+ messages in thread
From: Xiong, Jianxin @ 2020-10-28 17:29 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-rdma, dri-devel, Doug Ledford, Leon Romanovsky,
	Sumit Semwal, Christian Koenig, Vetter, Daniel

> -----Original Message-----
> From: Jason Gunthorpe <jgg@ziepe.ca>
> Sent: Wednesday, October 28, 2020 9:36 AM
> To: Xiong, Jianxin <jianxin.xiong@intel.com>
> Cc: linux-rdma@vger.kernel.org; dri-devel@lists.freedesktop.org; Doug Ledford <dledford@redhat.com>; Leon Romanovsky
> <leon@kernel.org>; Sumit Semwal <sumit.semwal@linaro.org>; Christian Koenig <christian.koenig@amd.com>; Vetter, Daniel
> <daniel.vetter@intel.com>
> Subject: Re: [PATCH v6 4/4] RDMA/mlx5: Support dma-buf based userspace memory region
> 
> On Tue, Oct 27, 2020 at 08:33:52PM +0000, Xiong, Jianxin wrote:
> > > > @@ -801,6 +816,52 @@ static int pagefault_implicit_mr(struct mlx5_ib_mr *imr,
> > > >   * Returns:
> > > >   *  -EFAULT: The io_virt->bcnt is not within the MR, it covers pages that are
> > > >   *           not accessible, or the MR is no longer valid.
> > > > + *  -EAGAIN: The operation should be retried
> > > > + *
> > > > + *  >0: Number of pages mapped
> > > > + */
> > > > +static int pagefault_dmabuf_mr(struct mlx5_ib_mr *mr, struct ib_umem *umem,
> > > > +			       u64 io_virt, size_t bcnt, u32 *bytes_mapped,
> > > > +			       u32 flags)
> > > > +{
> > > > +	struct ib_umem_dmabuf *umem_dmabuf = to_ib_umem_dmabuf(umem);
> > > > +	u64 user_va;
> > > > +	u64 end;
> > > > +	int npages;
> > > > +	int err;
> > > > +
> > > > +	if (unlikely(io_virt < mr->mmkey.iova))
> > > > +		return -EFAULT;
> > > > +	if (check_add_overflow(io_virt - mr->mmkey.iova,
> > > > +			       (u64)umem->address, &user_va))
> > > > +		return -EFAULT;
> > > > +	/* Overflow has alreddy been checked at the umem creation time */
> > > > +	end = umem->address + umem->length;
> > > > +	if (unlikely(user_va >= end || end  - user_va < bcnt))
> > > > +		return -EFAULT;
> > >
> > > Why duplicate this sequence? Caller does it
> >
> > The sequence in the caller is for umem_odp only.
> 
> Nothing about umem_odp in this code though??

The code in the caller uses ib_umem_end(odp) instead of the 'end' here, but we
can consolidate that with some minor changes.
  
> 
> > > >  	/* prefetch with write-access must be supported by the MR */
> > > >  	if (advice == IB_UVERBS_ADVISE_MR_ADVICE_PREFETCH_WRITE &&
> > > > -	    !odp->umem.writable)
> > > > +	    !mr->umem->writable)
> > >
> > > ??
> 
> > There is no need to use umem_odp here, mr->umem is the same as &odp->umem.
> > This change makes the code works for both umem_odp and umem_dmabuf.
> 
> Ok
> 
> Can you please also think about how to test this? I very much prefer to see new pyverbs tests for new APIs.
> 
> Distros are running the rdma-core test suite, if you want this to work widely we need a public test for it.
> 

Will look into that.

> Thanks,
> Jason

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH v6 1/4] RDMA/umem: Support importing dma-buf as user memory region
  2020-10-24  1:45       ` Daniel Vetter
@ 2020-11-03 17:36         ` Xiong, Jianxin
  2020-11-03 20:43           ` Daniel Vetter
  0 siblings, 1 reply; 25+ messages in thread
From: Xiong, Jianxin @ 2020-11-03 17:36 UTC (permalink / raw)
  To: Daniel Vetter, Jason Gunthorpe
  Cc: linux-rdma, dri-devel, Leon Romanovsky, Doug Ledford, Vetter,
	Daniel, Christian Koenig


> -----Original Message-----
> From: Daniel Vetter <daniel@ffwll.ch>
> Sent: Friday, October 23, 2020 6:45 PM
> To: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Xiong, Jianxin <jianxin.xiong@intel.com>; linux-rdma <linux-rdma@vger.kernel.org>; dri-devel <dri-devel@lists.freedesktop.org>; Leon
> Romanovsky <leon@kernel.org>; Doug Ledford <dledford@redhat.com>; Vetter, Daniel <daniel.vetter@intel.com>; Christian Koenig
> <christian.koenig@amd.com>
> Subject: Re: [PATCH v6 1/4] RDMA/umem: Support importing dma-buf as user memory region
> 
> > > > +
> > > > +#ifdef CONFIG_DMA_VIRT_OPS
> > > > +   if (device->dma_device->dma_ops == &dma_virt_ops)
> > > > +           return ERR_PTR(-EINVAL); #endif
> > >
> > > Maybe I'm confused, but should we have this check in dma_buf_attach,
> > > or at least in dma_buf_dynamic_attach? The p2pdma functions use that
> > > too, and I can't imagine how zerocopy should work (which is like the
> > > entire point of
> > > dma-buf) when we have dma_virt_ops.
> >
> > The problem is we have RDMA drivers that assume SGL's have a valid
> > struct page, and these hacky/wrong P2P sgls that DMABUF creates cannot
> > be passed into those drivers.
> >
> > But maybe this is just a 'drivers are using it wrong' if they call
> > this function and expect struct pages..
> >
> > The check in the p2p stuff was done to avoid this too, but it was on a
> > different flow.
> 
> Yeah definitely don't call dma_buf_map_attachment and expect a page back. In practice you'll get a page back fairly often, but I don't think
> we want to bake that in, maybe we eventually get to non-hacky dma_addr_t only sgl.
> 
> What I'm wondering is whether dma_buf_attach shouldn't reject such devices directly, instead of each importer having to do that.

Come back here to see if consensus can be reached on who should do the check. My
thinking is that it could be over restrictive for dma_buf_attach to always reject 
dma_virt_ops. According to dma-buf documentation the back storage would be
moved to system area upon mapping unless p2p is requested and can be supported
by the exporter. The sg_list for system memory would have struct page present. 

-Jianxin
  
> -Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v6 1/4] RDMA/umem: Support importing dma-buf as user memory region
  2020-11-03 17:36         ` Xiong, Jianxin
@ 2020-11-03 20:43           ` Daniel Vetter
  2020-11-04  0:01             ` Xiong, Jianxin
  2020-11-05 15:02             ` Jason Gunthorpe
  0 siblings, 2 replies; 25+ messages in thread
From: Daniel Vetter @ 2020-11-03 20:43 UTC (permalink / raw)
  To: Xiong, Jianxin
  Cc: Jason Gunthorpe, linux-rdma, dri-devel, Leon Romanovsky,
	Doug Ledford, Vetter, Daniel, Christian Koenig

On Tue, Nov 3, 2020 at 6:36 PM Xiong, Jianxin <jianxin.xiong@intel.com> wrote:
>
>
> > -----Original Message-----
> > From: Daniel Vetter <daniel@ffwll.ch>
> > Sent: Friday, October 23, 2020 6:45 PM
> > To: Jason Gunthorpe <jgg@ziepe.ca>
> > Cc: Xiong, Jianxin <jianxin.xiong@intel.com>; linux-rdma <linux-rdma@vger.kernel.org>; dri-devel <dri-devel@lists.freedesktop.org>; Leon
> > Romanovsky <leon@kernel.org>; Doug Ledford <dledford@redhat.com>; Vetter, Daniel <daniel.vetter@intel.com>; Christian Koenig
> > <christian.koenig@amd.com>
> > Subject: Re: [PATCH v6 1/4] RDMA/umem: Support importing dma-buf as user memory region
> >
> > > > > +
> > > > > +#ifdef CONFIG_DMA_VIRT_OPS
> > > > > +   if (device->dma_device->dma_ops == &dma_virt_ops)
> > > > > +           return ERR_PTR(-EINVAL); #endif
> > > >
> > > > Maybe I'm confused, but should we have this check in dma_buf_attach,
> > > > or at least in dma_buf_dynamic_attach? The p2pdma functions use that
> > > > too, and I can't imagine how zerocopy should work (which is like the
> > > > entire point of
> > > > dma-buf) when we have dma_virt_ops.
> > >
> > > The problem is we have RDMA drivers that assume SGL's have a valid
> > > struct page, and these hacky/wrong P2P sgls that DMABUF creates cannot
> > > be passed into those drivers.
> > >
> > > But maybe this is just a 'drivers are using it wrong' if they call
> > > this function and expect struct pages..
> > >
> > > The check in the p2p stuff was done to avoid this too, but it was on a
> > > different flow.
> >
> > Yeah definitely don't call dma_buf_map_attachment and expect a page back. In practice you'll get a page back fairly often, but I don't think
> > we want to bake that in, maybe we eventually get to non-hacky dma_addr_t only sgl.
> >
> > What I'm wondering is whether dma_buf_attach shouldn't reject such devices directly, instead of each importer having to do that.
>
> Come back here to see if consensus can be reached on who should do the check. My
> thinking is that it could be over restrictive for dma_buf_attach to always reject
> dma_virt_ops. According to dma-buf documentation the back storage would be
> moved to system area upon mapping unless p2p is requested and can be supported
> by the exporter. The sg_list for system memory would have struct page present.

So I'm not clear on what this dma_virt_ops stuff is for, but if it's
an entirely virtual device with cpu access, then you shouldn't do
dma_buf_map_attachment, and then peek at the struct page in the sgl.
Instead you need to use dma_buf_vmap/vunmap and
dma_buf_begin/end_cpu_access. Otherwise the coherency managed is all
potentially busted. Also note that cpu access from the kernel to
dma-buf is a rather niche feature (only some usb device drivers use
it), so expect warts.

If this is the case, then I think dma_buf_attach should check for this
and reject such imports, since that's an importer bug.

If it's otoh something rdma specific, then I guess rdma checking for this is ok.

As a third option, if it's something about the connectivity between
the importing and exporting device, then this should be checked in the
->attach callback the exporter can provide, like the p2p check. The
idea here is that for device specific remapping units (mostly found
ond SoC, and not something like a standard iommu managed by the
dma-api), some of which can even do funny stuff like rotation of 2d
images, can be access by some, but not other. And only the exporter is
aware of these restrictions.

Now I dunno which case this one here is.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH v6 1/4] RDMA/umem: Support importing dma-buf as user memory region
  2020-11-03 20:43           ` Daniel Vetter
@ 2020-11-04  0:01             ` Xiong, Jianxin
  2020-11-05 15:02             ` Jason Gunthorpe
  1 sibling, 0 replies; 25+ messages in thread
From: Xiong, Jianxin @ 2020-11-04  0:01 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Jason Gunthorpe, linux-rdma, dri-devel, Leon Romanovsky,
	Doug Ledford, Vetter, Daniel, Christian Koenig

> -----Original Message-----
> From: Daniel Vetter <daniel@ffwll.ch>
> Sent: Tuesday, November 03, 2020 12:43 PM
> To: Xiong, Jianxin <jianxin.xiong@intel.com>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>; linux-rdma <linux-rdma@vger.kernel.org>; dri-devel <dri-devel@lists.freedesktop.org>; Leon
> Romanovsky <leon@kernel.org>; Doug Ledford <dledford@redhat.com>; Vetter, Daniel <daniel.vetter@intel.com>; Christian Koenig
> <christian.koenig@amd.com>
> Subject: Re: [PATCH v6 1/4] RDMA/umem: Support importing dma-buf as user memory region
> 
> On Tue, Nov 3, 2020 at 6:36 PM Xiong, Jianxin <jianxin.xiong@intel.com> wrote:
> >
> >
> > > -----Original Message-----
> > > From: Daniel Vetter <daniel@ffwll.ch>
> > > Sent: Friday, October 23, 2020 6:45 PM
> > > To: Jason Gunthorpe <jgg@ziepe.ca>
> > > Cc: Xiong, Jianxin <jianxin.xiong@intel.com>; linux-rdma
> > > <linux-rdma@vger.kernel.org>; dri-devel
> > > <dri-devel@lists.freedesktop.org>; Leon Romanovsky
> > > <leon@kernel.org>; Doug Ledford <dledford@redhat.com>; Vetter,
> > > Daniel <daniel.vetter@intel.com>; Christian Koenig
> > > <christian.koenig@amd.com>
> > > Subject: Re: [PATCH v6 1/4] RDMA/umem: Support importing dma-buf as
> > > user memory region
> > >
> > > > > > +
> > > > > > +#ifdef CONFIG_DMA_VIRT_OPS
> > > > > > +   if (device->dma_device->dma_ops == &dma_virt_ops)
> > > > > > +           return ERR_PTR(-EINVAL); #endif
> > > > >
> > > > > Maybe I'm confused, but should we have this check in
> > > > > dma_buf_attach, or at least in dma_buf_dynamic_attach? The
> > > > > p2pdma functions use that too, and I can't imagine how zerocopy
> > > > > should work (which is like the entire point of
> > > > > dma-buf) when we have dma_virt_ops.
> > > >
> > > > The problem is we have RDMA drivers that assume SGL's have a valid
> > > > struct page, and these hacky/wrong P2P sgls that DMABUF creates
> > > > cannot be passed into those drivers.
> > > >
> > > > But maybe this is just a 'drivers are using it wrong' if they call
> > > > this function and expect struct pages..
> > > >
> > > > The check in the p2p stuff was done to avoid this too, but it was
> > > > on a different flow.
> > >
> > > Yeah definitely don't call dma_buf_map_attachment and expect a page
> > > back. In practice you'll get a page back fairly often, but I don't think we want to bake that in, maybe we eventually get to non-hacky
> dma_addr_t only sgl.
> > >
> > > What I'm wondering is whether dma_buf_attach shouldn't reject such devices directly, instead of each importer having to do that.
> >
> > Come back here to see if consensus can be reached on who should do the
> > check. My thinking is that it could be over restrictive for
> > dma_buf_attach to always reject dma_virt_ops. According to dma-buf
> > documentation the back storage would be moved to system area upon
> > mapping unless p2p is requested and can be supported by the exporter. The sg_list for system memory would have struct page present.
> 
> So I'm not clear on what this dma_virt_ops stuff is for, but if it's an entirely virtual device with cpu access, then you shouldn't do
> dma_buf_map_attachment, and then peek at the struct page in the sgl.

This is the key, thanks for pointing that out. I was assuming the importer could
use the struct page if it exists. 

> Instead you need to use dma_buf_vmap/vunmap and dma_buf_begin/end_cpu_access. Otherwise the coherency managed is all potentially
> busted. Also note that cpu access from the kernel to dma-buf is a rather niche feature (only some usb device drivers use it), so expect warts.
> 

dma_virt_ops is a set of dma mapping operations that map page/sgl to virtual addresses
instead of dma addresses. Drivers that use dma_virt_ops would use the mapping
result for cpu access (to emulate DMA) instead of real DMA, and thus the dma mapping
returned from dma-buf is not compatible with the expectation of such drivers. If these
drivers are not allowed to peek into the struct page of the sgl, they have no way to
correctly use the sgl. In this sense I agree that drivers that use dma_virt_ops should not
call dma_buf_attach(). They should use dma_buf_vmap() et al to get cpu access. 

> If this is the case, then I think dma_buf_attach should check for this and reject such imports, since that's an importer bug.

So here we go. I will move the check to dma_buf_dynamic_attach (and dma_buf_attach
is a wrapper over that).

> 
> If it's otoh something rdma specific, then I guess rdma checking for this is ok.
> 
> As a third option, if it's something about the connectivity between the importing and exporting device, then this should be checked in the
> ->attach callback the exporter can provide, like the p2p check. The
> idea here is that for device specific remapping units (mostly found ond SoC, and not something like a standard iommu managed by the dma-
> api), some of which can even do funny stuff like rotation of 2d images, can be access by some, but not other. And only the exporter is
> aware of these restrictions.
> 
> Now I dunno which case this one here is.
> -Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v6 1/4] RDMA/umem: Support importing dma-buf as user memory region
  2020-11-03 20:43           ` Daniel Vetter
  2020-11-04  0:01             ` Xiong, Jianxin
@ 2020-11-05 15:02             ` Jason Gunthorpe
  1 sibling, 0 replies; 25+ messages in thread
From: Jason Gunthorpe @ 2020-11-05 15:02 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Xiong, Jianxin, linux-rdma, dri-devel, Leon Romanovsky,
	Doug Ledford, Vetter, Daniel, Christian Koenig

On Tue, Nov 03, 2020 at 09:43:17PM +0100, Daniel Vetter wrote:

> > > Yeah definitely don't call dma_buf_map_attachment and expect a page back. In practice you'll get a page back fairly often, but I don't think
> > > we want to bake that in, maybe we eventually get to non-hacky dma_addr_t only sgl.
> > >
> > > What I'm wondering is whether dma_buf_attach shouldn't reject such devices directly, instead of each importer having to do that.
> >
> > Come back here to see if consensus can be reached on who should do the check. My
> > thinking is that it could be over restrictive for dma_buf_attach to always reject
> > dma_virt_ops. According to dma-buf documentation the back storage would be
> > moved to system area upon mapping unless p2p is requested and can be supported
> > by the exporter. The sg_list for system memory would have struct page present.
> 
> So I'm not clear on what this dma_virt_ops stuff is for, but if it's
> an entirely virtual device with cpu access, then you shouldn't do
> dma_buf_map_attachment, and then peek at the struct page in the sgl.

Yes, so I think the answer is it is rdma device driver error to call these
new APIs and touch the struct page side of the SGL.

After Christophs series removign dma_virt_ops we could make that more
explicit, like was done for the pci p2p case.


> As a third option, if it's something about the connectivity between
> the importing and exporting device, then this should be checked in the
> ->attach callback the exporter can provide, like the p2p check. The

Drivers doing p2p are supposed to be calling the p2p distance stuff
and p2p dma map stuff which already has these checks.

Doing p2p and skipping all that in the dma buf side we already knew
was a hacky thing.

Jason

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, back to index

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-23 16:39 [PATCH v6 0/4] RDMA: Add dma-buf support Jianxin Xiong
2020-10-23 16:39 ` [PATCH v6 1/4] RDMA/umem: Support importing dma-buf as user memory region Jianxin Xiong
2020-10-23 16:49   ` Daniel Vetter
2020-10-23 18:09     ` Xiong, Jianxin
2020-10-23 18:13       ` Daniel Vetter
2020-10-23 18:20     ` Jason Gunthorpe
2020-10-24  1:45       ` Daniel Vetter
2020-11-03 17:36         ` Xiong, Jianxin
2020-11-03 20:43           ` Daniel Vetter
2020-11-04  0:01             ` Xiong, Jianxin
2020-11-05 15:02             ` Jason Gunthorpe
2020-10-24  7:48       ` Christoph Hellwig
2020-10-26 12:26         ` Jason Gunthorpe
2020-10-27  8:08           ` Christoph Hellwig
2020-10-27 17:32             ` Xiong, Jianxin
2020-10-27 19:51               ` Jason Gunthorpe
2020-10-27 20:00   ` Jason Gunthorpe
2020-10-27 20:11     ` Xiong, Jianxin
2020-10-23 16:39 ` [PATCH v6 2/4] RDMA/core: Add device method for registering dma-buf base " Jianxin Xiong
2020-10-23 16:40 ` [PATCH v6 3/4] RDMA/uverbs: Add uverbs command for dma-buf based MR registration Jianxin Xiong
2020-10-23 16:40 ` [PATCH v6 4/4] RDMA/mlx5: Support dma-buf based userspace memory region Jianxin Xiong
2020-10-27 20:08   ` Jason Gunthorpe
2020-10-27 20:33     ` Xiong, Jianxin
2020-10-28 16:35       ` Jason Gunthorpe
2020-10-28 17:29         ` Xiong, Jianxin

Linux-RDMA Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-rdma/0 linux-rdma/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-rdma linux-rdma/ https://lore.kernel.org/linux-rdma \
		linux-rdma@vger.kernel.org
	public-inbox-index linux-rdma

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-rdma


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git