linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1 0/2] vfio/pci: Allow MMIO regions to be exported through dma-buf
@ 2024-04-22  6:30 Vivek Kasireddy
  2024-04-22  6:30 ` [PATCH v1 1/2] vfio: Export vfio device get and put registration helpers Vivek Kasireddy
  2024-04-22  6:30 ` [PATCH v1 2/2] vfio/pci: Allow MMIO regions to be exported through dma-buf Vivek Kasireddy
  0 siblings, 2 replies; 10+ messages in thread
From: Vivek Kasireddy @ 2024-04-22  6:30 UTC (permalink / raw)
  To: dri-devel, kvm, linux-rdma
  Cc: Vivek Kasireddy, Jason Gunthorpe, Christoph Hellwig,
	Robin Murphy, Christian König, Daniel Vetter, Oded Gabbay,
	Gerd Hoffmann, Alex Williamson, Kevin Tian

This is an attempt to revive the patches posted by Jason Gunthorpe at:
https://patchwork.kernel.org/project/linux-media/cover/0-v2-472615b3877e+28f7-vfio_dma_buf_jgg@nvidia.com/

Here is the cover letter text from Jason's original series:
"dma-buf has become a way to safely acquire a handle to non-struct page
memory that can still have lifetime controlled by the exporter. Notably
RDMA can now import dma-buf FDs and build them into MRs which allows for
PCI P2P operations. Extend this to allow vfio-pci to export MMIO memory
from PCI device BARs.

This series supports a use case for SPDK where a NVMe device will be owned
by SPDK through VFIO but interacting with a RDMA device. The RDMA device
may directly access the NVMe CMB or directly manipulate the NVMe device's
doorbell using PCI P2P.

However, as a general mechanism, it can support many other scenarios with
VFIO. I imagine this dmabuf approach to be usable by iommufd as well for
generic and safe P2P mappings.

This series goes after the "Break up ioctl dispatch functions to one
function per ioctl" series."

In addition to the SPDK use-case mentioned above, the capability added
in this patch series can also be useful when a buffer (located in device
memory such as VRAM) needs to be shared between any two GPU devices or
instances (assuming one of them is bound to VFIO PCI) as long as they
are P2P DMA compatible.

The main difference between this series and the original one is the usage
of P2P DMA APIs to create struct pages (ZONE_DEVICE) to populate the
scatterlist instead of using DMA addresses. Other additions include a
mmap handler to provide CPU access to the dmabuf and support for
creating the dmabuf from multiple areas (or ranges).

This series is available at:
https://gitlab.freedesktop.org/Vivek/drm-tip/-/commits/vfio_dmabuf_v1

along with additional patches for Qemu and Spice here:
https://gitlab.freedesktop.org/Vivek/qemu/-/commits/vfio_dmabuf_1
https://gitlab.freedesktop.org/Vivek/spice/-/commits/encode_dmabuf_v4 

This series is tested using the following method:
- Run Qemu with the following relevant options:
  qemu-system-x86_64 -m 4096m ....
  -device vfio-pci,host=0000:03:00.0
  -device virtio-vga,max_outputs=1,blob=true,xres=1920,yres=1080
  -spice port=3001,gl=on,disable-ticketing=on,preferred-codec=gstreamer:h264
  -object memory-backend-memfd,id=mem1,size=4096M
  -machine memory-backend=mem1 ...
- Run upstream Weston with the following options in the Guest VM:
  ./weston --drm-device=card1 --additional-devices=card0

where card1 is a DG2 dGPU (assigned to vfio-pci) and card0 is
virtio-gpu.

Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Oded Gabbay <ogabbay@kernel.org>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Kevin Tian <kevin.tian@intel.com>

Vivek Kasireddy (2):
  vfio: Export vfio device get and put registration helpers
  vfio/pci: Allow MMIO regions to be exported through dma-buf

 drivers/vfio/pci/Makefile          |   1 +
 drivers/vfio/pci/dma_buf.c         | 348 +++++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci_config.c |   8 +-
 drivers/vfio/pci/vfio_pci_core.c   |  28 ++-
 drivers/vfio/pci/vfio_pci_priv.h   |  23 ++
 drivers/vfio/vfio_main.c           |   2 +
 include/linux/vfio.h               |   2 +
 include/linux/vfio_pci_core.h      |   1 +
 include/uapi/linux/vfio.h          |  25 +++
 9 files changed, 430 insertions(+), 8 deletions(-)
 create mode 100644 drivers/vfio/pci/dma_buf.c

-- 
2.43.0


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v1 1/2] vfio: Export vfio device get and put registration helpers
  2024-04-22  6:30 [PATCH v1 0/2] vfio/pci: Allow MMIO regions to be exported through dma-buf Vivek Kasireddy
@ 2024-04-22  6:30 ` Vivek Kasireddy
  2024-04-22  6:30 ` [PATCH v1 2/2] vfio/pci: Allow MMIO regions to be exported through dma-buf Vivek Kasireddy
  1 sibling, 0 replies; 10+ messages in thread
From: Vivek Kasireddy @ 2024-04-22  6:30 UTC (permalink / raw)
  To: dri-devel, kvm, linux-rdma; +Cc: Vivek Kasireddy, Jason Gunthorpe

These helpers are useful for managing additional references taken
on the device from other associated VFIO modules.

Original-patch-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
---
 drivers/vfio/vfio_main.c | 2 ++
 include/linux/vfio.h     | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index e97d796a54fb..7434461fc1b3 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -165,11 +165,13 @@ void vfio_device_put_registration(struct vfio_device *device)
 	if (refcount_dec_and_test(&device->refcount))
 		complete(&device->comp);
 }
+EXPORT_SYMBOL_GPL(vfio_device_put_registration);
 
 bool vfio_device_try_get_registration(struct vfio_device *device)
 {
 	return refcount_inc_not_zero(&device->refcount);
 }
+EXPORT_SYMBOL_GPL(vfio_device_try_get_registration);
 
 /*
  * VFIO driver API
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 8b1a29820409..a6302fca8acd 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -278,6 +278,8 @@ static inline void vfio_put_device(struct vfio_device *device)
 int vfio_register_group_dev(struct vfio_device *device);
 int vfio_register_emulated_iommu_dev(struct vfio_device *device);
 void vfio_unregister_group_dev(struct vfio_device *device);
+bool vfio_device_try_get_registration(struct vfio_device *device);
+void vfio_device_put_registration(struct vfio_device *device);
 
 int vfio_assign_device_set(struct vfio_device *device, void *set_id);
 unsigned int vfio_device_set_open_count(struct vfio_device_set *dev_set);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v1 2/2] vfio/pci: Allow MMIO regions to be exported through dma-buf
  2024-04-22  6:30 [PATCH v1 0/2] vfio/pci: Allow MMIO regions to be exported through dma-buf Vivek Kasireddy
  2024-04-22  6:30 ` [PATCH v1 1/2] vfio: Export vfio device get and put registration helpers Vivek Kasireddy
@ 2024-04-22  6:30 ` Vivek Kasireddy
  2024-04-22 14:44   ` Zhu Yanjun
  2024-04-30 22:24   ` Alex Williamson
  1 sibling, 2 replies; 10+ messages in thread
From: Vivek Kasireddy @ 2024-04-22  6:30 UTC (permalink / raw)
  To: dri-devel, kvm, linux-rdma; +Cc: Vivek Kasireddy, Jason Gunthorpe

From Jason Gunthorpe:
"dma-buf has become a way to safely acquire a handle to non-struct page
memory that can still have lifetime controlled by the exporter. Notably
RDMA can now import dma-buf FDs and build them into MRs which allows for
PCI P2P operations. Extend this to allow vfio-pci to export MMIO memory
from PCI device BARs.

The patch design loosely follows the pattern in commit
db1a8dd916aa ("habanalabs: add support for dma-buf exporter") except this
does not support pinning.

Instead, this implements what, in the past, we've called a revocable
attachment using move. In normal situations the attachment is pinned, as a
BAR does not change physical address. However when the VFIO device is
closed, or a PCI reset is issued, access to the MMIO memory is revoked.

Revoked means that move occurs, but an attempt to immediately re-map the
memory will fail. In the reset case a future move will be triggered when
MMIO access returns. As both close and reset are under userspace control
it is expected that userspace will suspend use of the dma-buf before doing
these operations, the revoke is purely for kernel self-defense against a
hostile userspace."

Following enhancements are made to the original patch:
- Use P2P DMA APIs to create pages (ZONE_DEVICE) instead of DMA addrs
- Add a mmap handler to provide CPU access to the dmabuf
- Add support for creating dmabuf from multiple areas (or ranges)

Original-patch-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
---
 drivers/vfio/pci/Makefile          |   1 +
 drivers/vfio/pci/dma_buf.c         | 348 +++++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci_config.c |   8 +-
 drivers/vfio/pci/vfio_pci_core.c   |  28 ++-
 drivers/vfio/pci/vfio_pci_priv.h   |  23 ++
 include/linux/vfio_pci_core.h      |   1 +
 include/uapi/linux/vfio.h          |  25 +++
 7 files changed, 426 insertions(+), 8 deletions(-)
 create mode 100644 drivers/vfio/pci/dma_buf.c

diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index ce7a61f1d912..b2374856ff62 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -2,6 +2,7 @@
 
 vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
 vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV_KVM) += vfio_pci_zdev.o
+vfio-pci-core-$(CONFIG_DMA_SHARED_BUFFER) += dma_buf.o
 obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
 
 vfio-pci-y := vfio_pci.o
diff --git a/drivers/vfio/pci/dma_buf.c b/drivers/vfio/pci/dma_buf.c
new file mode 100644
index 000000000000..7bf00fdee69b
--- /dev/null
+++ b/drivers/vfio/pci/dma_buf.c
@@ -0,0 +1,348 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES.
+ */
+#include <linux/dma-buf.h>
+#include <linux/pci-p2pdma.h>
+#include <linux/dma-resv.h>
+
+#include "vfio_pci_priv.h"
+
+MODULE_IMPORT_NS(DMA_BUF);
+
+struct vfio_pci_dma_buf {
+	struct dma_buf *dmabuf;
+	struct vfio_pci_core_device *vdev;
+	struct list_head dmabufs_elm;
+	struct page **pages;
+	struct sg_table *sg;
+	unsigned int nr_pages;
+	bool revoked;
+};
+
+static vm_fault_t vfio_pci_dma_buf_fault(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct vfio_pci_dma_buf *priv = vma->vm_private_data;
+	pgoff_t pgoff = vmf->pgoff;
+
+	if (pgoff >= priv->nr_pages)
+		return VM_FAULT_SIGBUS;
+
+	return vmf_insert_pfn(vma, vmf->address,
+			      page_to_pfn(priv->pages[pgoff]));
+}
+
+static const struct vm_operations_struct vfio_pci_dma_buf_vmops = {
+	.fault = vfio_pci_dma_buf_fault,
+};
+
+static int vfio_pci_dma_buf_mmap(struct dma_buf *dmabuf,
+				 struct vm_area_struct *vma)
+{
+	struct vfio_pci_dma_buf *priv = dmabuf->priv;
+
+	if ((vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) == 0)
+		return -EINVAL;
+
+	vma->vm_ops = &vfio_pci_dma_buf_vmops;
+	vma->vm_private_data = priv;
+	vm_flags_set(vma, VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP);
+	return 0;
+}
+
+static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf,
+				   struct dma_buf_attachment *attachment)
+{
+	struct vfio_pci_dma_buf *priv = dmabuf->priv;
+	int rc;
+
+	rc = pci_p2pdma_distance_many(priv->vdev->pdev, &attachment->dev, 1,
+				      true);
+	if (rc < 0)
+		attachment->peer2peer = false;
+	return 0;
+}
+
+static void vfio_pci_dma_buf_unpin(struct dma_buf_attachment *attachment)
+{
+}
+
+static int vfio_pci_dma_buf_pin(struct dma_buf_attachment *attachment)
+{
+	/*
+	 * Uses the dynamic interface but must always allow for
+	 * dma_buf_move_notify() to do revoke
+	 */
+	return -EINVAL;
+}
+
+static struct sg_table *
+vfio_pci_dma_buf_map(struct dma_buf_attachment *attachment,
+		     enum dma_data_direction dir)
+{
+	struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv;
+	struct scatterlist *sgl;
+	struct sg_table *sgt;
+	unsigned int i = 0;
+	int ret;
+
+	dma_resv_assert_held(priv->dmabuf->resv);
+
+	if (!attachment->peer2peer)
+		return ERR_PTR(-EPERM);
+
+	if (priv->revoked)
+		return ERR_PTR(-ENODEV);
+
+	sgt = kzalloc(sizeof(*sgt), GFP_KERNEL);
+	if (!sgt)
+		return ERR_PTR(-ENOMEM);
+
+	ret = sg_alloc_table(sgt, priv->nr_pages, GFP_KERNEL);
+	if (ret)
+		goto err_kfree_sgt;
+
+	for_each_sg(sgt->sgl, sgl, priv->nr_pages, i)
+		sg_set_page(sgl, priv->pages[i], PAGE_SIZE, 0);
+
+	ret = dma_map_sgtable(attachment->dev, sgt, dir, 0);
+	if (ret < 0)
+		goto err_free_sgt;
+
+	return sgt;
+
+err_free_sgt:
+	sg_free_table(sgt);
+err_kfree_sgt:
+	kfree(sgt);
+	return ERR_PTR(ret);
+}
+
+static void vfio_pci_dma_buf_unmap(struct dma_buf_attachment *attachment,
+				   struct sg_table *sgt,
+				   enum dma_data_direction dir)
+{
+	dma_unmap_sgtable(attachment->dev, sgt, dir, 0);
+	sg_free_table(sgt);
+	kfree(sgt);
+}
+
+static void release_p2p_pages(struct vfio_pci_dma_buf *priv,
+			      unsigned int nr_pages)
+{
+	while (nr_pages > 0 && priv->pages[--nr_pages])
+		pci_free_p2pmem(priv->vdev->pdev,
+				page_to_virt(priv->pages[nr_pages]),
+				PAGE_SIZE);
+}
+
+static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf)
+{
+	struct vfio_pci_dma_buf *priv = dmabuf->priv;
+
+	/*
+	 * Either this or vfio_pci_dma_buf_cleanup() will remove from the list.
+	 * The refcount prevents both.
+	 */
+	if (priv->vdev) {
+		release_p2p_pages(priv, priv->nr_pages);
+		kfree(priv->pages);
+		down_write(&priv->vdev->memory_lock);
+		list_del_init(&priv->dmabufs_elm);
+		up_write(&priv->vdev->memory_lock);
+		vfio_device_put_registration(&priv->vdev->vdev);
+	}
+	kfree(priv);
+}
+
+static const struct dma_buf_ops vfio_pci_dmabuf_ops = {
+	.attach = vfio_pci_dma_buf_attach,
+	.map_dma_buf = vfio_pci_dma_buf_map,
+	.pin = vfio_pci_dma_buf_pin,
+	.unpin = vfio_pci_dma_buf_unpin,
+	.release = vfio_pci_dma_buf_release,
+	.unmap_dma_buf = vfio_pci_dma_buf_unmap,
+	.mmap = vfio_pci_dma_buf_mmap,
+};
+
+static int create_p2p_pages(struct vfio_pci_dma_buf *priv, uint32_t nr_areas,
+			    struct vfio_region_p2p_area *p2p_areas)
+{
+	struct pci_dev *pdev = priv->vdev->pdev;
+	resource_size_t bar_size;
+	unsigned int pg = 0;
+	void *vaddr;
+	size_t size;
+	int i, ret;
+
+	for (i = 0; i < nr_areas; i++) {
+		bar_size = pci_resource_len(pdev, p2p_areas[i].region_index);
+		if (p2p_areas[i].offset > bar_size ||
+		    p2p_areas[i].offset + p2p_areas[i].length > bar_size) {
+			ret = -ERANGE;
+			goto err;
+		}
+
+		ret = pci_p2pdma_add_resource(pdev,
+					      p2p_areas[i].region_index,
+					      p2p_areas[i].length,
+					      p2p_areas[i].offset);
+		if (ret)
+			goto err;
+
+		vaddr = pci_alloc_p2pmem(pdev, p2p_areas[i].length);
+		if (!vaddr) {
+			ret = -EINVAL;
+			goto err;
+		}
+
+		for (size = 0; size < p2p_areas[i].length;) {
+			priv->pages[pg++] = virt_to_page(vaddr + size);
+			size += PAGE_SIZE;
+		}
+	}
+
+	return 0;
+
+err:
+	release_p2p_pages(priv, pg);
+	return ret;
+}
+
+int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
+				  struct vfio_device_feature_dma_buf __user *arg,
+				  size_t argsz)
+{
+	struct vfio_device_feature_dma_buf get_dma_buf;
+	struct vfio_region_p2p_area *p2p_areas;
+	DEFINE_DMA_BUF_EXPORT_INFO(exp_info);
+	struct vfio_pci_dma_buf *priv;
+	int i, ret;
+
+	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_GET,
+				 sizeof(get_dma_buf));
+	if (ret != 1)
+		return ret;
+
+	if (copy_from_user(&get_dma_buf, arg, sizeof(get_dma_buf)))
+		return -EFAULT;
+
+	p2p_areas = memdup_array_user(&arg->p2p_areas,
+				      get_dma_buf.nr_areas,
+				      sizeof(*p2p_areas));
+	if (IS_ERR(p2p_areas))
+		return PTR_ERR(p2p_areas);
+
+	priv = kzalloc(sizeof(*priv), GFP_KERNEL);
+	if (!priv)
+		return -ENOMEM;
+
+	ret = -ERANGE;
+	for (i = 0; i < get_dma_buf.nr_areas; i++) {
+		/*
+		 * For PCI the region_index is the BAR number like
+		 * everything else.
+		 */
+		if (p2p_areas[i].region_index >= VFIO_PCI_ROM_REGION_INDEX) {
+			goto err_free_priv;
+		}
+
+		if (!IS_ALIGNED(p2p_areas[i].offset, PAGE_SIZE) ||
+		    !IS_ALIGNED(p2p_areas[i].length, PAGE_SIZE))
+			goto err_free_priv;
+
+		priv->nr_pages += p2p_areas[i].length >> PAGE_SHIFT;
+	}
+
+	if (!priv->nr_pages)
+		goto err_free_priv;
+
+	priv->pages = kmalloc_array(priv->nr_pages,
+				    sizeof(*priv->pages), GFP_KERNEL);
+	if (!priv->pages) {
+		ret = -ENOMEM;
+		goto err_free_priv;
+	}
+
+	priv->vdev = vdev;
+	ret = create_p2p_pages(priv, get_dma_buf.nr_areas, p2p_areas);
+	if (ret)
+		goto err_free_priv;
+
+	exp_info.ops = &vfio_pci_dmabuf_ops;
+	exp_info.size = priv->nr_pages << PAGE_SHIFT;
+	exp_info.flags = get_dma_buf.open_flags;
+	exp_info.priv = priv;
+
+	priv->dmabuf = dma_buf_export(&exp_info);
+	if (IS_ERR(priv->dmabuf)) {
+		ret = PTR_ERR(priv->dmabuf);
+		goto err_free_pages;
+	}
+
+	/* dma_buf_put() now frees priv */
+	INIT_LIST_HEAD(&priv->dmabufs_elm);
+	down_write(&vdev->memory_lock);
+	dma_resv_lock(priv->dmabuf->resv, NULL);
+	priv->revoked = !__vfio_pci_memory_enabled(vdev);
+	vfio_device_try_get_registration(&vdev->vdev);
+	list_add_tail(&priv->dmabufs_elm, &vdev->dmabufs);
+	dma_resv_unlock(priv->dmabuf->resv);
+	up_write(&vdev->memory_lock);
+	kfree(p2p_areas);
+
+	/*
+	 * dma_buf_fd() consumes the reference, when the file closes the dmabuf
+	 * will be released.
+	 */
+	return dma_buf_fd(priv->dmabuf, get_dma_buf.open_flags);
+
+err_free_pages:
+	release_p2p_pages(priv, priv->nr_pages);
+err_free_priv:
+	kfree(p2p_areas);
+	kfree(priv->pages);
+	kfree(priv);
+	return ret;
+}
+
+void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked)
+{
+	struct vfio_pci_dma_buf *priv;
+	struct vfio_pci_dma_buf *tmp;
+
+	lockdep_assert_held_write(&vdev->memory_lock);
+
+	list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
+		if (!get_file_rcu(&priv->dmabuf->file))
+			continue;
+		if (priv->revoked != revoked) {
+			dma_resv_lock(priv->dmabuf->resv, NULL);
+			priv->revoked = revoked;
+			dma_buf_move_notify(priv->dmabuf);
+			dma_resv_unlock(priv->dmabuf->resv);
+		}
+		dma_buf_put(priv->dmabuf);
+	}
+}
+
+void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev)
+{
+	struct vfio_pci_dma_buf *priv;
+	struct vfio_pci_dma_buf *tmp;
+
+	down_write(&vdev->memory_lock);
+	list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
+		if (!get_file_rcu(&priv->dmabuf->file))
+			continue;
+		dma_resv_lock(priv->dmabuf->resv, NULL);
+		list_del_init(&priv->dmabufs_elm);
+		priv->vdev = NULL;
+		priv->revoked = true;
+		dma_buf_move_notify(priv->dmabuf);
+		dma_resv_unlock(priv->dmabuf->resv);
+		vfio_device_put_registration(&vdev->vdev);
+		dma_buf_put(priv->dmabuf);
+	}
+	up_write(&vdev->memory_lock);
+}
diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
index 97422aafaa7b..c605c5cb0078 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -585,10 +585,12 @@ static int vfio_basic_config_write(struct vfio_pci_core_device *vdev, int pos,
 		virt_mem = !!(le16_to_cpu(*virt_cmd) & PCI_COMMAND_MEMORY);
 		new_mem = !!(new_cmd & PCI_COMMAND_MEMORY);
 
-		if (!new_mem)
+		if (!new_mem) {
 			vfio_pci_zap_and_down_write_memory_lock(vdev);
-		else
+			vfio_pci_dma_buf_move(vdev, true);
+		} else {
 			down_write(&vdev->memory_lock);
+		}
 
 		/*
 		 * If the user is writing mem/io enable (new_mem/io) and we
@@ -623,6 +625,8 @@ static int vfio_basic_config_write(struct vfio_pci_core_device *vdev, int pos,
 		*virt_cmd &= cpu_to_le16(~mask);
 		*virt_cmd |= cpu_to_le16(new_cmd & mask);
 
+		if (__vfio_pci_memory_enabled(vdev))
+			vfio_pci_dma_buf_move(vdev, false);
 		up_write(&vdev->memory_lock);
 	}
 
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index d94d61b92c1a..d2ef982bdb3e 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -700,6 +700,8 @@ void vfio_pci_core_close_device(struct vfio_device *core_vdev)
 #endif
 	vfio_pci_core_disable(vdev);
 
+	vfio_pci_dma_buf_cleanup(vdev);
+
 	mutex_lock(&vdev->igate);
 	if (vdev->err_trigger) {
 		eventfd_ctx_put(vdev->err_trigger);
@@ -1246,7 +1248,10 @@ static int vfio_pci_ioctl_reset(struct vfio_pci_core_device *vdev,
 	 */
 	vfio_pci_set_power_state(vdev, PCI_D0);
 
+	vfio_pci_dma_buf_move(vdev, true);
 	ret = pci_try_reset_function(vdev->pdev);
+	if (__vfio_pci_memory_enabled(vdev))
+		vfio_pci_dma_buf_move(vdev, false);
 	up_write(&vdev->memory_lock);
 
 	return ret;
@@ -1467,11 +1472,10 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
 }
 EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl);
 
-static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags,
-				       uuid_t __user *arg, size_t argsz)
+static int vfio_pci_core_feature_token(struct vfio_pci_core_device *vdev,
+				       u32 flags, uuid_t __user *arg,
+				       size_t argsz)
 {
-	struct vfio_pci_core_device *vdev =
-		container_of(device, struct vfio_pci_core_device, vdev);
 	uuid_t uuid;
 	int ret;
 
@@ -1498,6 +1502,9 @@ static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags,
 int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
 				void __user *arg, size_t argsz)
 {
+	struct vfio_pci_core_device *vdev =
+			container_of(device, struct vfio_pci_core_device, vdev);
+
 	switch (flags & VFIO_DEVICE_FEATURE_MASK) {
 	case VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY:
 		return vfio_pci_core_pm_entry(device, flags, arg, argsz);
@@ -1507,7 +1514,9 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
 	case VFIO_DEVICE_FEATURE_LOW_POWER_EXIT:
 		return vfio_pci_core_pm_exit(device, flags, arg, argsz);
 	case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
-		return vfio_pci_core_feature_token(device, flags, arg, argsz);
+		return vfio_pci_core_feature_token(vdev, flags, arg, argsz);
+	case VFIO_DEVICE_FEATURE_DMA_BUF:
+		return vfio_pci_core_feature_dma_buf(vdev, flags, arg, argsz);
 	default:
 		return -ENOTTY;
 	}
@@ -2182,6 +2191,7 @@ int vfio_pci_core_init_dev(struct vfio_device *core_vdev)
 	mutex_init(&vdev->vma_lock);
 	INIT_LIST_HEAD(&vdev->vma_list);
 	INIT_LIST_HEAD(&vdev->sriov_pfs_item);
+	INIT_LIST_HEAD(&vdev->dmabufs);
 	init_rwsem(&vdev->memory_lock);
 	xa_init(&vdev->ctx);
 
@@ -2576,11 +2586,17 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
 	 * cause the PCI config space reset without restoring the original
 	 * state (saved locally in 'vdev->pm_save').
 	 */
-	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list)
+	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) {
+		vfio_pci_dma_buf_move(cur, true);
 		vfio_pci_set_power_state(cur, PCI_D0);
+	}
 
 	ret = pci_reset_bus(pdev);
 
+	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list)
+		if (__vfio_pci_memory_enabled(cur))
+			vfio_pci_dma_buf_move(cur, false);
+
 err_undo:
 	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) {
 		if (cur == cur_mem)
diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
index 5e4fa69aee16..09d3c300918c 100644
--- a/drivers/vfio/pci/vfio_pci_priv.h
+++ b/drivers/vfio/pci/vfio_pci_priv.h
@@ -101,4 +101,27 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
 	return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
 }
 
+#ifdef CONFIG_DMA_SHARED_BUFFER
+int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
+				  struct vfio_device_feature_dma_buf __user *arg,
+				  size_t argsz);
+void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev);
+void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked);
+#else
+static int
+vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
+			      struct vfio_device_feature_dma_buf __user *arg,
+			      size_t argsz)
+{
+	return -ENOTTY;
+}
+static inline void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev)
+{
+}
+static inline void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev,
+					 bool revoked)
+{
+}
+#endif
+
 #endif
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index a2c8b8bba711..387cce561dad 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -96,6 +96,7 @@ struct vfio_pci_core_device {
 	struct mutex		vma_lock;
 	struct list_head	vma_list;
 	struct rw_semaphore	memory_lock;
+	struct list_head	dmabufs;
 };
 
 /* Will be exported for vfio pci drivers usage */
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 2b68e6cdf190..47d230c5df25 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -1458,6 +1458,31 @@ struct vfio_device_feature_bus_master {
 };
 #define VFIO_DEVICE_FEATURE_BUS_MASTER 10
 
+/**
+ * Upon VFIO_DEVICE_FEATURE_GET create a dma_buf fd for the
+ * region selected.
+ *
+ * open_flags are the typical flags passed to open(2), eg O_RDWR, O_CLOEXEC,
+ * etc. offset/length specify a slice of the region to create the dmabuf from.
+ * If both are 0 then the whole region is used.
+ *
+ * Return: The fd number on success, -1 and errno is set on failure.
+ */
+#define VFIO_DEVICE_FEATURE_DMA_BUF 11
+
+struct vfio_region_p2p_area {
+	__u32	region_index;
+	__u32	__pad;
+	__u64	offset;
+	__u64	length;
+};
+
+struct vfio_device_feature_dma_buf {
+	__u32	open_flags;
+	__u32	nr_areas;
+	struct vfio_region_p2p_area p2p_areas[];
+};
+
 /* -------- API for Type1 VFIO IOMMU -------- */
 
 /**
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v1 2/2] vfio/pci: Allow MMIO regions to be exported through dma-buf
  2024-04-22  6:30 ` [PATCH v1 2/2] vfio/pci: Allow MMIO regions to be exported through dma-buf Vivek Kasireddy
@ 2024-04-22 14:44   ` Zhu Yanjun
  2024-04-30 22:24   ` Alex Williamson
  1 sibling, 0 replies; 10+ messages in thread
From: Zhu Yanjun @ 2024-04-22 14:44 UTC (permalink / raw)
  To: Vivek Kasireddy, dri-devel, kvm, linux-rdma; +Cc: Jason Gunthorpe

On 22.04.24 08:30, Vivek Kasireddy wrote:
>  From Jason Gunthorpe:
> "dma-buf has become a way to safely acquire a handle to non-struct page
> memory that can still have lifetime controlled by the exporter. Notably
> RDMA can now import dma-buf FDs and build them into MRs which allows for
> PCI P2P operations. Extend this to allow vfio-pci to export MMIO memory
> from PCI device BARs.
> 
> The patch design loosely follows the pattern in commit
> db1a8dd916aa ("habanalabs: add support for dma-buf exporter") except this
> does not support pinning.
> 
> Instead, this implements what, in the past, we've called a revocable
> attachment using move. In normal situations the attachment is pinned, as a
> BAR does not change physical address. However when the VFIO device is
> closed, or a PCI reset is issued, access to the MMIO memory is revoked.
> 
> Revoked means that move occurs, but an attempt to immediately re-map the
> memory will fail. In the reset case a future move will be triggered when
> MMIO access returns. As both close and reset are under userspace control
> it is expected that userspace will suspend use of the dma-buf before doing
> these operations, the revoke is purely for kernel self-defense against a
> hostile userspace."
> 
> Following enhancements are made to the original patch:
> - Use P2P DMA APIs to create pages (ZONE_DEVICE) instead of DMA addrs
> - Add a mmap handler to provide CPU access to the dmabuf
> - Add support for creating dmabuf from multiple areas (or ranges)
> 
> Original-patch-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
> ---
>   drivers/vfio/pci/Makefile          |   1 +
>   drivers/vfio/pci/dma_buf.c         | 348 +++++++++++++++++++++++++++++
>   drivers/vfio/pci/vfio_pci_config.c |   8 +-
>   drivers/vfio/pci/vfio_pci_core.c   |  28 ++-
>   drivers/vfio/pci/vfio_pci_priv.h   |  23 ++
>   include/linux/vfio_pci_core.h      |   1 +
>   include/uapi/linux/vfio.h          |  25 +++
>   7 files changed, 426 insertions(+), 8 deletions(-)
>   create mode 100644 drivers/vfio/pci/dma_buf.c
> 
> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
> index ce7a61f1d912..b2374856ff62 100644
> --- a/drivers/vfio/pci/Makefile
> +++ b/drivers/vfio/pci/Makefile
> @@ -2,6 +2,7 @@
>   
>   vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
>   vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV_KVM) += vfio_pci_zdev.o
> +vfio-pci-core-$(CONFIG_DMA_SHARED_BUFFER) += dma_buf.o
>   obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
>   
>   vfio-pci-y := vfio_pci.o
> diff --git a/drivers/vfio/pci/dma_buf.c b/drivers/vfio/pci/dma_buf.c
> new file mode 100644
> index 000000000000..7bf00fdee69b
> --- /dev/null
> +++ b/drivers/vfio/pci/dma_buf.c
> @@ -0,0 +1,348 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES.

Not sure if this Copyright (c) is 2022 or 2024.

Zhu Yanjun

> + */
> +#include <linux/dma-buf.h>
> +#include <linux/pci-p2pdma.h>
> +#include <linux/dma-resv.h>
> +
> +#include "vfio_pci_priv.h"
> +
> +MODULE_IMPORT_NS(DMA_BUF);
> +
> +struct vfio_pci_dma_buf {
> +	struct dma_buf *dmabuf;
> +	struct vfio_pci_core_device *vdev;
> +	struct list_head dmabufs_elm;
> +	struct page **pages;
> +	struct sg_table *sg;
> +	unsigned int nr_pages;
> +	bool revoked;
> +};
> +
> +static vm_fault_t vfio_pci_dma_buf_fault(struct vm_fault *vmf)
> +{
> +	struct vm_area_struct *vma = vmf->vma;
> +	struct vfio_pci_dma_buf *priv = vma->vm_private_data;
> +	pgoff_t pgoff = vmf->pgoff;
> +
> +	if (pgoff >= priv->nr_pages)
> +		return VM_FAULT_SIGBUS;
> +
> +	return vmf_insert_pfn(vma, vmf->address,
> +			      page_to_pfn(priv->pages[pgoff]));
> +}
> +
> +static const struct vm_operations_struct vfio_pci_dma_buf_vmops = {
> +	.fault = vfio_pci_dma_buf_fault,
> +};
> +
> +static int vfio_pci_dma_buf_mmap(struct dma_buf *dmabuf,
> +				 struct vm_area_struct *vma)
> +{
> +	struct vfio_pci_dma_buf *priv = dmabuf->priv;
> +
> +	if ((vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) == 0)
> +		return -EINVAL;
> +
> +	vma->vm_ops = &vfio_pci_dma_buf_vmops;
> +	vma->vm_private_data = priv;
> +	vm_flags_set(vma, VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP);
> +	return 0;
> +}
> +
> +static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf,
> +				   struct dma_buf_attachment *attachment)
> +{
> +	struct vfio_pci_dma_buf *priv = dmabuf->priv;
> +	int rc;
> +
> +	rc = pci_p2pdma_distance_many(priv->vdev->pdev, &attachment->dev, 1,
> +				      true);
> +	if (rc < 0)
> +		attachment->peer2peer = false;
> +	return 0;
> +}
> +
> +static void vfio_pci_dma_buf_unpin(struct dma_buf_attachment *attachment)
> +{
> +}
> +
> +static int vfio_pci_dma_buf_pin(struct dma_buf_attachment *attachment)
> +{
> +	/*
> +	 * Uses the dynamic interface but must always allow for
> +	 * dma_buf_move_notify() to do revoke
> +	 */
> +	return -EINVAL;
> +}
> +
> +static struct sg_table *
> +vfio_pci_dma_buf_map(struct dma_buf_attachment *attachment,
> +		     enum dma_data_direction dir)
> +{
> +	struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv;
> +	struct scatterlist *sgl;
> +	struct sg_table *sgt;
> +	unsigned int i = 0;
> +	int ret;
> +
> +	dma_resv_assert_held(priv->dmabuf->resv);
> +
> +	if (!attachment->peer2peer)
> +		return ERR_PTR(-EPERM);
> +
> +	if (priv->revoked)
> +		return ERR_PTR(-ENODEV);
> +
> +	sgt = kzalloc(sizeof(*sgt), GFP_KERNEL);
> +	if (!sgt)
> +		return ERR_PTR(-ENOMEM);
> +
> +	ret = sg_alloc_table(sgt, priv->nr_pages, GFP_KERNEL);
> +	if (ret)
> +		goto err_kfree_sgt;
> +
> +	for_each_sg(sgt->sgl, sgl, priv->nr_pages, i)
> +		sg_set_page(sgl, priv->pages[i], PAGE_SIZE, 0);
> +
> +	ret = dma_map_sgtable(attachment->dev, sgt, dir, 0);
> +	if (ret < 0)
> +		goto err_free_sgt;
> +
> +	return sgt;
> +
> +err_free_sgt:
> +	sg_free_table(sgt);
> +err_kfree_sgt:
> +	kfree(sgt);
> +	return ERR_PTR(ret);
> +}
> +
> +static void vfio_pci_dma_buf_unmap(struct dma_buf_attachment *attachment,
> +				   struct sg_table *sgt,
> +				   enum dma_data_direction dir)
> +{
> +	dma_unmap_sgtable(attachment->dev, sgt, dir, 0);
> +	sg_free_table(sgt);
> +	kfree(sgt);
> +}
> +
> +static void release_p2p_pages(struct vfio_pci_dma_buf *priv,
> +			      unsigned int nr_pages)
> +{
> +	while (nr_pages > 0 && priv->pages[--nr_pages])
> +		pci_free_p2pmem(priv->vdev->pdev,
> +				page_to_virt(priv->pages[nr_pages]),
> +				PAGE_SIZE);
> +}
> +
> +static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf)
> +{
> +	struct vfio_pci_dma_buf *priv = dmabuf->priv;
> +
> +	/*
> +	 * Either this or vfio_pci_dma_buf_cleanup() will remove from the list.
> +	 * The refcount prevents both.
> +	 */
> +	if (priv->vdev) {
> +		release_p2p_pages(priv, priv->nr_pages);
> +		kfree(priv->pages);
> +		down_write(&priv->vdev->memory_lock);
> +		list_del_init(&priv->dmabufs_elm);
> +		up_write(&priv->vdev->memory_lock);
> +		vfio_device_put_registration(&priv->vdev->vdev);
> +	}
> +	kfree(priv);
> +}
> +
> +static const struct dma_buf_ops vfio_pci_dmabuf_ops = {
> +	.attach = vfio_pci_dma_buf_attach,
> +	.map_dma_buf = vfio_pci_dma_buf_map,
> +	.pin = vfio_pci_dma_buf_pin,
> +	.unpin = vfio_pci_dma_buf_unpin,
> +	.release = vfio_pci_dma_buf_release,
> +	.unmap_dma_buf = vfio_pci_dma_buf_unmap,
> +	.mmap = vfio_pci_dma_buf_mmap,
> +};
> +
> +static int create_p2p_pages(struct vfio_pci_dma_buf *priv, uint32_t nr_areas,
> +			    struct vfio_region_p2p_area *p2p_areas)
> +{
> +	struct pci_dev *pdev = priv->vdev->pdev;
> +	resource_size_t bar_size;
> +	unsigned int pg = 0;
> +	void *vaddr;
> +	size_t size;
> +	int i, ret;
> +
> +	for (i = 0; i < nr_areas; i++) {
> +		bar_size = pci_resource_len(pdev, p2p_areas[i].region_index);
> +		if (p2p_areas[i].offset > bar_size ||
> +		    p2p_areas[i].offset + p2p_areas[i].length > bar_size) {
> +			ret = -ERANGE;
> +			goto err;
> +		}
> +
> +		ret = pci_p2pdma_add_resource(pdev,
> +					      p2p_areas[i].region_index,
> +					      p2p_areas[i].length,
> +					      p2p_areas[i].offset);
> +		if (ret)
> +			goto err;
> +
> +		vaddr = pci_alloc_p2pmem(pdev, p2p_areas[i].length);
> +		if (!vaddr) {
> +			ret = -EINVAL;
> +			goto err;
> +		}
> +
> +		for (size = 0; size < p2p_areas[i].length;) {
> +			priv->pages[pg++] = virt_to_page(vaddr + size);
> +			size += PAGE_SIZE;
> +		}
> +	}
> +
> +	return 0;
> +
> +err:
> +	release_p2p_pages(priv, pg);
> +	return ret;
> +}
> +
> +int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
> +				  struct vfio_device_feature_dma_buf __user *arg,
> +				  size_t argsz)
> +{
> +	struct vfio_device_feature_dma_buf get_dma_buf;
> +	struct vfio_region_p2p_area *p2p_areas;
> +	DEFINE_DMA_BUF_EXPORT_INFO(exp_info);
> +	struct vfio_pci_dma_buf *priv;
> +	int i, ret;
> +
> +	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_GET,
> +				 sizeof(get_dma_buf));
> +	if (ret != 1)
> +		return ret;
> +
> +	if (copy_from_user(&get_dma_buf, arg, sizeof(get_dma_buf)))
> +		return -EFAULT;
> +
> +	p2p_areas = memdup_array_user(&arg->p2p_areas,
> +				      get_dma_buf.nr_areas,
> +				      sizeof(*p2p_areas));
> +	if (IS_ERR(p2p_areas))
> +		return PTR_ERR(p2p_areas);
> +
> +	priv = kzalloc(sizeof(*priv), GFP_KERNEL);
> +	if (!priv)
> +		return -ENOMEM;
> +
> +	ret = -ERANGE;
> +	for (i = 0; i < get_dma_buf.nr_areas; i++) {
> +		/*
> +		 * For PCI the region_index is the BAR number like
> +		 * everything else.
> +		 */
> +		if (p2p_areas[i].region_index >= VFIO_PCI_ROM_REGION_INDEX) {
> +			goto err_free_priv;
> +		}
> +
> +		if (!IS_ALIGNED(p2p_areas[i].offset, PAGE_SIZE) ||
> +		    !IS_ALIGNED(p2p_areas[i].length, PAGE_SIZE))
> +			goto err_free_priv;
> +
> +		priv->nr_pages += p2p_areas[i].length >> PAGE_SHIFT;
> +	}
> +
> +	if (!priv->nr_pages)
> +		goto err_free_priv;
> +
> +	priv->pages = kmalloc_array(priv->nr_pages,
> +				    sizeof(*priv->pages), GFP_KERNEL);
> +	if (!priv->pages) {
> +		ret = -ENOMEM;
> +		goto err_free_priv;
> +	}
> +
> +	priv->vdev = vdev;
> +	ret = create_p2p_pages(priv, get_dma_buf.nr_areas, p2p_areas);
> +	if (ret)
> +		goto err_free_priv;
> +
> +	exp_info.ops = &vfio_pci_dmabuf_ops;
> +	exp_info.size = priv->nr_pages << PAGE_SHIFT;
> +	exp_info.flags = get_dma_buf.open_flags;
> +	exp_info.priv = priv;
> +
> +	priv->dmabuf = dma_buf_export(&exp_info);
> +	if (IS_ERR(priv->dmabuf)) {
> +		ret = PTR_ERR(priv->dmabuf);
> +		goto err_free_pages;
> +	}
> +
> +	/* dma_buf_put() now frees priv */
> +	INIT_LIST_HEAD(&priv->dmabufs_elm);
> +	down_write(&vdev->memory_lock);
> +	dma_resv_lock(priv->dmabuf->resv, NULL);
> +	priv->revoked = !__vfio_pci_memory_enabled(vdev);
> +	vfio_device_try_get_registration(&vdev->vdev);
> +	list_add_tail(&priv->dmabufs_elm, &vdev->dmabufs);
> +	dma_resv_unlock(priv->dmabuf->resv);
> +	up_write(&vdev->memory_lock);
> +	kfree(p2p_areas);
> +
> +	/*
> +	 * dma_buf_fd() consumes the reference, when the file closes the dmabuf
> +	 * will be released.
> +	 */
> +	return dma_buf_fd(priv->dmabuf, get_dma_buf.open_flags);
> +
> +err_free_pages:
> +	release_p2p_pages(priv, priv->nr_pages);
> +err_free_priv:
> +	kfree(p2p_areas);
> +	kfree(priv->pages);
> +	kfree(priv);
> +	return ret;
> +}
> +
> +void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked)
> +{
> +	struct vfio_pci_dma_buf *priv;
> +	struct vfio_pci_dma_buf *tmp;
> +
> +	lockdep_assert_held_write(&vdev->memory_lock);
> +
> +	list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
> +		if (!get_file_rcu(&priv->dmabuf->file))
> +			continue;
> +		if (priv->revoked != revoked) {
> +			dma_resv_lock(priv->dmabuf->resv, NULL);
> +			priv->revoked = revoked;
> +			dma_buf_move_notify(priv->dmabuf);
> +			dma_resv_unlock(priv->dmabuf->resv);
> +		}
> +		dma_buf_put(priv->dmabuf);
> +	}
> +}
> +
> +void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev)
> +{
> +	struct vfio_pci_dma_buf *priv;
> +	struct vfio_pci_dma_buf *tmp;
> +
> +	down_write(&vdev->memory_lock);
> +	list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
> +		if (!get_file_rcu(&priv->dmabuf->file))
> +			continue;
> +		dma_resv_lock(priv->dmabuf->resv, NULL);
> +		list_del_init(&priv->dmabufs_elm);
> +		priv->vdev = NULL;
> +		priv->revoked = true;
> +		dma_buf_move_notify(priv->dmabuf);
> +		dma_resv_unlock(priv->dmabuf->resv);
> +		vfio_device_put_registration(&vdev->vdev);
> +		dma_buf_put(priv->dmabuf);
> +	}
> +	up_write(&vdev->memory_lock);
> +}
> diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
> index 97422aafaa7b..c605c5cb0078 100644
> --- a/drivers/vfio/pci/vfio_pci_config.c
> +++ b/drivers/vfio/pci/vfio_pci_config.c
> @@ -585,10 +585,12 @@ static int vfio_basic_config_write(struct vfio_pci_core_device *vdev, int pos,
>   		virt_mem = !!(le16_to_cpu(*virt_cmd) & PCI_COMMAND_MEMORY);
>   		new_mem = !!(new_cmd & PCI_COMMAND_MEMORY);
>   
> -		if (!new_mem)
> +		if (!new_mem) {
>   			vfio_pci_zap_and_down_write_memory_lock(vdev);
> -		else
> +			vfio_pci_dma_buf_move(vdev, true);
> +		} else {
>   			down_write(&vdev->memory_lock);
> +		}
>   
>   		/*
>   		 * If the user is writing mem/io enable (new_mem/io) and we
> @@ -623,6 +625,8 @@ static int vfio_basic_config_write(struct vfio_pci_core_device *vdev, int pos,
>   		*virt_cmd &= cpu_to_le16(~mask);
>   		*virt_cmd |= cpu_to_le16(new_cmd & mask);
>   
> +		if (__vfio_pci_memory_enabled(vdev))
> +			vfio_pci_dma_buf_move(vdev, false);
>   		up_write(&vdev->memory_lock);
>   	}
>   
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index d94d61b92c1a..d2ef982bdb3e 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -700,6 +700,8 @@ void vfio_pci_core_close_device(struct vfio_device *core_vdev)
>   #endif
>   	vfio_pci_core_disable(vdev);
>   
> +	vfio_pci_dma_buf_cleanup(vdev);
> +
>   	mutex_lock(&vdev->igate);
>   	if (vdev->err_trigger) {
>   		eventfd_ctx_put(vdev->err_trigger);
> @@ -1246,7 +1248,10 @@ static int vfio_pci_ioctl_reset(struct vfio_pci_core_device *vdev,
>   	 */
>   	vfio_pci_set_power_state(vdev, PCI_D0);
>   
> +	vfio_pci_dma_buf_move(vdev, true);
>   	ret = pci_try_reset_function(vdev->pdev);
> +	if (__vfio_pci_memory_enabled(vdev))
> +		vfio_pci_dma_buf_move(vdev, false);
>   	up_write(&vdev->memory_lock);
>   
>   	return ret;
> @@ -1467,11 +1472,10 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
>   }
>   EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl);
>   
> -static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags,
> -				       uuid_t __user *arg, size_t argsz)
> +static int vfio_pci_core_feature_token(struct vfio_pci_core_device *vdev,
> +				       u32 flags, uuid_t __user *arg,
> +				       size_t argsz)
>   {
> -	struct vfio_pci_core_device *vdev =
> -		container_of(device, struct vfio_pci_core_device, vdev);
>   	uuid_t uuid;
>   	int ret;
>   
> @@ -1498,6 +1502,9 @@ static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags,
>   int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
>   				void __user *arg, size_t argsz)
>   {
> +	struct vfio_pci_core_device *vdev =
> +			container_of(device, struct vfio_pci_core_device, vdev);
> +
>   	switch (flags & VFIO_DEVICE_FEATURE_MASK) {
>   	case VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY:
>   		return vfio_pci_core_pm_entry(device, flags, arg, argsz);
> @@ -1507,7 +1514,9 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
>   	case VFIO_DEVICE_FEATURE_LOW_POWER_EXIT:
>   		return vfio_pci_core_pm_exit(device, flags, arg, argsz);
>   	case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
> -		return vfio_pci_core_feature_token(device, flags, arg, argsz);
> +		return vfio_pci_core_feature_token(vdev, flags, arg, argsz);
> +	case VFIO_DEVICE_FEATURE_DMA_BUF:
> +		return vfio_pci_core_feature_dma_buf(vdev, flags, arg, argsz);
>   	default:
>   		return -ENOTTY;
>   	}
> @@ -2182,6 +2191,7 @@ int vfio_pci_core_init_dev(struct vfio_device *core_vdev)
>   	mutex_init(&vdev->vma_lock);
>   	INIT_LIST_HEAD(&vdev->vma_list);
>   	INIT_LIST_HEAD(&vdev->sriov_pfs_item);
> +	INIT_LIST_HEAD(&vdev->dmabufs);
>   	init_rwsem(&vdev->memory_lock);
>   	xa_init(&vdev->ctx);
>   
> @@ -2576,11 +2586,17 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>   	 * cause the PCI config space reset without restoring the original
>   	 * state (saved locally in 'vdev->pm_save').
>   	 */
> -	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list)
> +	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) {
> +		vfio_pci_dma_buf_move(cur, true);
>   		vfio_pci_set_power_state(cur, PCI_D0);
> +	}
>   
>   	ret = pci_reset_bus(pdev);
>   
> +	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list)
> +		if (__vfio_pci_memory_enabled(cur))
> +			vfio_pci_dma_buf_move(cur, false);
> +
>   err_undo:
>   	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) {
>   		if (cur == cur_mem)
> diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
> index 5e4fa69aee16..09d3c300918c 100644
> --- a/drivers/vfio/pci/vfio_pci_priv.h
> +++ b/drivers/vfio/pci/vfio_pci_priv.h
> @@ -101,4 +101,27 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
>   	return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
>   }
>   
> +#ifdef CONFIG_DMA_SHARED_BUFFER
> +int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
> +				  struct vfio_device_feature_dma_buf __user *arg,
> +				  size_t argsz);
> +void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev);
> +void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked);
> +#else
> +static int
> +vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
> +			      struct vfio_device_feature_dma_buf __user *arg,
> +			      size_t argsz)
> +{
> +	return -ENOTTY;
> +}
> +static inline void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev)
> +{
> +}
> +static inline void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev,
> +					 bool revoked)
> +{
> +}
> +#endif
> +
>   #endif
> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
> index a2c8b8bba711..387cce561dad 100644
> --- a/include/linux/vfio_pci_core.h
> +++ b/include/linux/vfio_pci_core.h
> @@ -96,6 +96,7 @@ struct vfio_pci_core_device {
>   	struct mutex		vma_lock;
>   	struct list_head	vma_list;
>   	struct rw_semaphore	memory_lock;
> +	struct list_head	dmabufs;
>   };
>   
>   /* Will be exported for vfio pci drivers usage */
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 2b68e6cdf190..47d230c5df25 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -1458,6 +1458,31 @@ struct vfio_device_feature_bus_master {
>   };
>   #define VFIO_DEVICE_FEATURE_BUS_MASTER 10
>   
> +/**
> + * Upon VFIO_DEVICE_FEATURE_GET create a dma_buf fd for the
> + * region selected.
> + *
> + * open_flags are the typical flags passed to open(2), eg O_RDWR, O_CLOEXEC,
> + * etc. offset/length specify a slice of the region to create the dmabuf from.
> + * If both are 0 then the whole region is used.
> + *
> + * Return: The fd number on success, -1 and errno is set on failure.
> + */
> +#define VFIO_DEVICE_FEATURE_DMA_BUF 11
> +
> +struct vfio_region_p2p_area {
> +	__u32	region_index;
> +	__u32	__pad;
> +	__u64	offset;
> +	__u64	length;
> +};
> +
> +struct vfio_device_feature_dma_buf {
> +	__u32	open_flags;
> +	__u32	nr_areas;
> +	struct vfio_region_p2p_area p2p_areas[];
> +};
> +
>   /* -------- API for Type1 VFIO IOMMU -------- */
>   
>   /**


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v1 2/2] vfio/pci: Allow MMIO regions to be exported through dma-buf
  2024-04-22  6:30 ` [PATCH v1 2/2] vfio/pci: Allow MMIO regions to be exported through dma-buf Vivek Kasireddy
  2024-04-22 14:44   ` Zhu Yanjun
@ 2024-04-30 22:24   ` Alex Williamson
  2024-05-01 12:53     ` Jason Gunthorpe
  1 sibling, 1 reply; 10+ messages in thread
From: Alex Williamson @ 2024-04-30 22:24 UTC (permalink / raw)
  To: Vivek Kasireddy; +Cc: dri-devel, kvm, linux-rdma, Jason Gunthorpe

On Sun, 21 Apr 2024 23:30:33 -0700
Vivek Kasireddy <vivek.kasireddy@intel.com> wrote:

> From Jason Gunthorpe:
> "dma-buf has become a way to safely acquire a handle to non-struct page
> memory that can still have lifetime controlled by the exporter. Notably
> RDMA can now import dma-buf FDs and build them into MRs which allows for
> PCI P2P operations. Extend this to allow vfio-pci to export MMIO memory
> from PCI device BARs.
> 
> The patch design loosely follows the pattern in commit
> db1a8dd916aa ("habanalabs: add support for dma-buf exporter") except this
> does not support pinning.
> 
> Instead, this implements what, in the past, we've called a revocable
> attachment using move. In normal situations the attachment is pinned, as a
> BAR does not change physical address. However when the VFIO device is
> closed, or a PCI reset is issued, access to the MMIO memory is revoked.
> 
> Revoked means that move occurs, but an attempt to immediately re-map the
> memory will fail. In the reset case a future move will be triggered when
> MMIO access returns. As both close and reset are under userspace control
> it is expected that userspace will suspend use of the dma-buf before doing
> these operations, the revoke is purely for kernel self-defense against a
> hostile userspace."
> 
> Following enhancements are made to the original patch:
> - Use P2P DMA APIs to create pages (ZONE_DEVICE) instead of DMA addrs
> - Add a mmap handler to provide CPU access to the dmabuf
> - Add support for creating dmabuf from multiple areas (or ranges)
> 
> Original-patch-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
> ---
>  drivers/vfio/pci/Makefile          |   1 +
>  drivers/vfio/pci/dma_buf.c         | 348 +++++++++++++++++++++++++++++
>  drivers/vfio/pci/vfio_pci_config.c |   8 +-
>  drivers/vfio/pci/vfio_pci_core.c   |  28 ++-
>  drivers/vfio/pci/vfio_pci_priv.h   |  23 ++
>  include/linux/vfio_pci_core.h      |   1 +
>  include/uapi/linux/vfio.h          |  25 +++
>  7 files changed, 426 insertions(+), 8 deletions(-)
>  create mode 100644 drivers/vfio/pci/dma_buf.c
> 
> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
> index ce7a61f1d912..b2374856ff62 100644
> --- a/drivers/vfio/pci/Makefile
> +++ b/drivers/vfio/pci/Makefile
> @@ -2,6 +2,7 @@
>  
>  vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
>  vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV_KVM) += vfio_pci_zdev.o
> +vfio-pci-core-$(CONFIG_DMA_SHARED_BUFFER) += dma_buf.o
>  obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
>  
>  vfio-pci-y := vfio_pci.o
> diff --git a/drivers/vfio/pci/dma_buf.c b/drivers/vfio/pci/dma_buf.c
> new file mode 100644
> index 000000000000..7bf00fdee69b
> --- /dev/null
> +++ b/drivers/vfio/pci/dma_buf.c
> @@ -0,0 +1,348 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES.
> + */
> +#include <linux/dma-buf.h>
> +#include <linux/pci-p2pdma.h>
> +#include <linux/dma-resv.h>
> +
> +#include "vfio_pci_priv.h"
> +
> +MODULE_IMPORT_NS(DMA_BUF);
> +
> +struct vfio_pci_dma_buf {
> +	struct dma_buf *dmabuf;
> +	struct vfio_pci_core_device *vdev;
> +	struct list_head dmabufs_elm;
> +	struct page **pages;
> +	struct sg_table *sg;
> +	unsigned int nr_pages;
> +	bool revoked;
> +};
> +
> +static vm_fault_t vfio_pci_dma_buf_fault(struct vm_fault *vmf)
> +{
> +	struct vm_area_struct *vma = vmf->vma;
> +	struct vfio_pci_dma_buf *priv = vma->vm_private_data;
> +	pgoff_t pgoff = vmf->pgoff;
> +
> +	if (pgoff >= priv->nr_pages)
> +		return VM_FAULT_SIGBUS;
> +
> +	return vmf_insert_pfn(vma, vmf->address,
> +			      page_to_pfn(priv->pages[pgoff]));
> +}

How does this prevent the MMIO space from being mmap'd when disabled at
the device?  How is the mmap revoked when the MMIO becomes disabled?
Is it part of the move protocol?

> +
> +static const struct vm_operations_struct vfio_pci_dma_buf_vmops = {
> +	.fault = vfio_pci_dma_buf_fault,
> +};
> +
> +static int vfio_pci_dma_buf_mmap(struct dma_buf *dmabuf,
> +				 struct vm_area_struct *vma)
> +{
> +	struct vfio_pci_dma_buf *priv = dmabuf->priv;
> +
> +	if ((vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) == 0)
> +		return -EINVAL;
> +
> +	vma->vm_ops = &vfio_pci_dma_buf_vmops;
> +	vma->vm_private_data = priv;
> +	vm_flags_set(vma, VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP);
> +	return 0;
> +}

Does this need to include the new VM_ALLOW_ANY_UNCACHED flag?

> +
> +static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf,
> +				   struct dma_buf_attachment *attachment)
> +{
> +	struct vfio_pci_dma_buf *priv = dmabuf->priv;
> +	int rc;
> +
> +	rc = pci_p2pdma_distance_many(priv->vdev->pdev, &attachment->dev, 1,
> +				      true);
> +	if (rc < 0)
> +		attachment->peer2peer = false;
> +	return 0;
> +}
> +
> +static void vfio_pci_dma_buf_unpin(struct dma_buf_attachment *attachment)
> +{
> +}
> +
> +static int vfio_pci_dma_buf_pin(struct dma_buf_attachment *attachment)
> +{
> +	/*
> +	 * Uses the dynamic interface but must always allow for
> +	 * dma_buf_move_notify() to do revoke
> +	 */
> +	return -EINVAL;
> +}
> +
> +static struct sg_table *
> +vfio_pci_dma_buf_map(struct dma_buf_attachment *attachment,
> +		     enum dma_data_direction dir)
> +{
> +	struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv;
> +	struct scatterlist *sgl;
> +	struct sg_table *sgt;
> +	unsigned int i = 0;
> +	int ret;
> +
> +	dma_resv_assert_held(priv->dmabuf->resv);
> +
> +	if (!attachment->peer2peer)
> +		return ERR_PTR(-EPERM);
> +
> +	if (priv->revoked)
> +		return ERR_PTR(-ENODEV);
> +
> +	sgt = kzalloc(sizeof(*sgt), GFP_KERNEL);
> +	if (!sgt)
> +		return ERR_PTR(-ENOMEM);
> +
> +	ret = sg_alloc_table(sgt, priv->nr_pages, GFP_KERNEL);
> +	if (ret)
> +		goto err_kfree_sgt;
> +
> +	for_each_sg(sgt->sgl, sgl, priv->nr_pages, i)
> +		sg_set_page(sgl, priv->pages[i], PAGE_SIZE, 0);
> +
> +	ret = dma_map_sgtable(attachment->dev, sgt, dir, 0);
> +	if (ret < 0)
> +		goto err_free_sgt;
> +
> +	return sgt;
> +
> +err_free_sgt:
> +	sg_free_table(sgt);
> +err_kfree_sgt:
> +	kfree(sgt);
> +	return ERR_PTR(ret);
> +}
> +
> +static void vfio_pci_dma_buf_unmap(struct dma_buf_attachment *attachment,
> +				   struct sg_table *sgt,
> +				   enum dma_data_direction dir)
> +{
> +	dma_unmap_sgtable(attachment->dev, sgt, dir, 0);
> +	sg_free_table(sgt);
> +	kfree(sgt);
> +}
> +
> +static void release_p2p_pages(struct vfio_pci_dma_buf *priv,
> +			      unsigned int nr_pages)
> +{
> +	while (nr_pages > 0 && priv->pages[--nr_pages])
> +		pci_free_p2pmem(priv->vdev->pdev,
> +				page_to_virt(priv->pages[nr_pages]),
> +				PAGE_SIZE);
> +}
> +
> +static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf)
> +{
> +	struct vfio_pci_dma_buf *priv = dmabuf->priv;
> +
> +	/*
> +	 * Either this or vfio_pci_dma_buf_cleanup() will remove from the list.
> +	 * The refcount prevents both.
> +	 */
> +	if (priv->vdev) {
> +		release_p2p_pages(priv, priv->nr_pages);
> +		kfree(priv->pages);
> +		down_write(&priv->vdev->memory_lock);
> +		list_del_init(&priv->dmabufs_elm);
> +		up_write(&priv->vdev->memory_lock);

Why are we acquiring and releasing the memory_lock write lock
throughout when we're not modifying the device memory enable state?
Ugh, we're using it to implicitly lock dmabufs_elm/dmabufs aren't we...

> +		vfio_device_put_registration(&priv->vdev->vdev);
> +	}
> +	kfree(priv);
> +}
> +
> +static const struct dma_buf_ops vfio_pci_dmabuf_ops = {
> +	.attach = vfio_pci_dma_buf_attach,
> +	.map_dma_buf = vfio_pci_dma_buf_map,
> +	.pin = vfio_pci_dma_buf_pin,
> +	.unpin = vfio_pci_dma_buf_unpin,
> +	.release = vfio_pci_dma_buf_release,
> +	.unmap_dma_buf = vfio_pci_dma_buf_unmap,
> +	.mmap = vfio_pci_dma_buf_mmap,
> +};
> +
> +static int create_p2p_pages(struct vfio_pci_dma_buf *priv, uint32_t nr_areas,
> +			    struct vfio_region_p2p_area *p2p_areas)
> +{
> +	struct pci_dev *pdev = priv->vdev->pdev;
> +	resource_size_t bar_size;
> +	unsigned int pg = 0;
> +	void *vaddr;
> +	size_t size;
> +	int i, ret;
> +
> +	for (i = 0; i < nr_areas; i++) {
> +		bar_size = pci_resource_len(pdev, p2p_areas[i].region_index);
> +		if (p2p_areas[i].offset > bar_size ||
> +		    p2p_areas[i].offset + p2p_areas[i].length > bar_size) {
> +			ret = -ERANGE;
> +			goto err;
> +		}
> +
> +		ret = pci_p2pdma_add_resource(pdev,
> +					      p2p_areas[i].region_index,
> +					      p2p_areas[i].length,
> +					      p2p_areas[i].offset);
> +		if (ret)
> +			goto err;
> +
> +		vaddr = pci_alloc_p2pmem(pdev, p2p_areas[i].length);
> +		if (!vaddr) {
> +			ret = -EINVAL;
> +			goto err;
> +		}
> +
> +		for (size = 0; size < p2p_areas[i].length;) {
> +			priv->pages[pg++] = virt_to_page(vaddr + size);
> +			size += PAGE_SIZE;
> +		}
> +	}
> +
> +	return 0;
> +
> +err:
> +	release_p2p_pages(priv, pg);
> +	return ret;
> +}
> +
> +int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
> +				  struct vfio_device_feature_dma_buf __user *arg,
> +				  size_t argsz)
> +{
> +	struct vfio_device_feature_dma_buf get_dma_buf;
> +	struct vfio_region_p2p_area *p2p_areas;
> +	DEFINE_DMA_BUF_EXPORT_INFO(exp_info);
> +	struct vfio_pci_dma_buf *priv;
> +	int i, ret;
> +
> +	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_GET,
> +				 sizeof(get_dma_buf));
> +	if (ret != 1)
> +		return ret;
> +
> +	if (copy_from_user(&get_dma_buf, arg, sizeof(get_dma_buf)))
> +		return -EFAULT;
> +
> +	p2p_areas = memdup_array_user(&arg->p2p_areas,
> +				      get_dma_buf.nr_areas,
> +				      sizeof(*p2p_areas));
> +	if (IS_ERR(p2p_areas))
> +		return PTR_ERR(p2p_areas);
> +
> +	priv = kzalloc(sizeof(*priv), GFP_KERNEL);
> +	if (!priv)
> +		return -ENOMEM;

p2p_areas is leaked.

> +
> +	ret = -ERANGE;
> +	for (i = 0; i < get_dma_buf.nr_areas; i++) {
> +		/*
> +		 * For PCI the region_index is the BAR number like
> +		 * everything else.
> +		 */
> +		if (p2p_areas[i].region_index >= VFIO_PCI_ROM_REGION_INDEX) {
> +			goto err_free_priv;
> +		}
> +
> +		if (!IS_ALIGNED(p2p_areas[i].offset, PAGE_SIZE) ||
> +		    !IS_ALIGNED(p2p_areas[i].length, PAGE_SIZE))
> +			goto err_free_priv;
> +
> +		priv->nr_pages += p2p_areas[i].length >> PAGE_SHIFT;
> +	}
> +
> +	if (!priv->nr_pages)
> +		goto err_free_priv;
> +
> +	priv->pages = kmalloc_array(priv->nr_pages,
> +				    sizeof(*priv->pages), GFP_KERNEL);
> +	if (!priv->pages) {
> +		ret = -ENOMEM;
> +		goto err_free_priv;
> +	}
> +
> +	priv->vdev = vdev;
> +	ret = create_p2p_pages(priv, get_dma_buf.nr_areas, p2p_areas);
> +	if (ret)
> +		goto err_free_priv;
> +
> +	exp_info.ops = &vfio_pci_dmabuf_ops;
> +	exp_info.size = priv->nr_pages << PAGE_SHIFT;
> +	exp_info.flags = get_dma_buf.open_flags;

open_flags from userspace are unchecked.

> +	exp_info.priv = priv;
> +
> +	priv->dmabuf = dma_buf_export(&exp_info);
> +	if (IS_ERR(priv->dmabuf)) {
> +		ret = PTR_ERR(priv->dmabuf);
> +		goto err_free_pages;
> +	}
> +
> +	/* dma_buf_put() now frees priv */
> +	INIT_LIST_HEAD(&priv->dmabufs_elm);
> +	down_write(&vdev->memory_lock);
> +	dma_resv_lock(priv->dmabuf->resv, NULL);
> +	priv->revoked = !__vfio_pci_memory_enabled(vdev);
> +	vfio_device_try_get_registration(&vdev->vdev);

I guess we're assuming this can't fail in the ioctl path of an open
device?

> +	list_add_tail(&priv->dmabufs_elm, &vdev->dmabufs);
> +	dma_resv_unlock(priv->dmabuf->resv);

What was the purpose of locking this?

> +	up_write(&vdev->memory_lock);
> +	kfree(p2p_areas);
> +
> +	/*
> +	 * dma_buf_fd() consumes the reference, when the file closes the dmabuf
> +	 * will be released.
> +	 */
> +	return dma_buf_fd(priv->dmabuf, get_dma_buf.open_flags);
> +
> +err_free_pages:
> +	release_p2p_pages(priv, priv->nr_pages);
> +err_free_priv:
> +	kfree(p2p_areas);
> +	kfree(priv->pages);
> +	kfree(priv);
> +	return ret;
> +}
> +
> +void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked)
> +{
> +	struct vfio_pci_dma_buf *priv;
> +	struct vfio_pci_dma_buf *tmp;
> +
> +	lockdep_assert_held_write(&vdev->memory_lock);
> +
> +	list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
> +		if (!get_file_rcu(&priv->dmabuf->file))
> +			continue;

Does this indicate the file was closed?

> +		if (priv->revoked != revoked) {
> +			dma_resv_lock(priv->dmabuf->resv, NULL);
> +			priv->revoked = revoked;
> +			dma_buf_move_notify(priv->dmabuf);
> +			dma_resv_unlock(priv->dmabuf->resv);
> +		}
> +		dma_buf_put(priv->dmabuf);
> +	}
> +}
> +
> +void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev)
> +{
> +	struct vfio_pci_dma_buf *priv;
> +	struct vfio_pci_dma_buf *tmp;
> +
> +	down_write(&vdev->memory_lock);
> +	list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
> +		if (!get_file_rcu(&priv->dmabuf->file))
> +			continue;
> +		dma_resv_lock(priv->dmabuf->resv, NULL);
> +		list_del_init(&priv->dmabufs_elm);
> +		priv->vdev = NULL;
> +		priv->revoked = true;
> +		dma_buf_move_notify(priv->dmabuf);
> +		dma_resv_unlock(priv->dmabuf->resv);
> +		vfio_device_put_registration(&vdev->vdev);
> +		dma_buf_put(priv->dmabuf);
> +	}
> +	up_write(&vdev->memory_lock);
> +}
> diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
> index 97422aafaa7b..c605c5cb0078 100644
> --- a/drivers/vfio/pci/vfio_pci_config.c
> +++ b/drivers/vfio/pci/vfio_pci_config.c
> @@ -585,10 +585,12 @@ static int vfio_basic_config_write(struct vfio_pci_core_device *vdev, int pos,
>  		virt_mem = !!(le16_to_cpu(*virt_cmd) & PCI_COMMAND_MEMORY);
>  		new_mem = !!(new_cmd & PCI_COMMAND_MEMORY);
>  
> -		if (!new_mem)
> +		if (!new_mem) {
>  			vfio_pci_zap_and_down_write_memory_lock(vdev);
> -		else
> +			vfio_pci_dma_buf_move(vdev, true);
> +		} else {
>  			down_write(&vdev->memory_lock);
> +		}
>  
>  		/*
>  		 * If the user is writing mem/io enable (new_mem/io) and we
> @@ -623,6 +625,8 @@ static int vfio_basic_config_write(struct vfio_pci_core_device *vdev, int pos,
>  		*virt_cmd &= cpu_to_le16(~mask);
>  		*virt_cmd |= cpu_to_le16(new_cmd & mask);
>  
> +		if (__vfio_pci_memory_enabled(vdev))
> +			vfio_pci_dma_buf_move(vdev, false);
>  		up_write(&vdev->memory_lock);
>  	}
>  

FLR is also accessible through config space.

> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index d94d61b92c1a..d2ef982bdb3e 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -700,6 +700,8 @@ void vfio_pci_core_close_device(struct vfio_device *core_vdev)
>  #endif
>  	vfio_pci_core_disable(vdev);
>  
> +	vfio_pci_dma_buf_cleanup(vdev);
> +
>  	mutex_lock(&vdev->igate);
>  	if (vdev->err_trigger) {
>  		eventfd_ctx_put(vdev->err_trigger);
> @@ -1246,7 +1248,10 @@ static int vfio_pci_ioctl_reset(struct vfio_pci_core_device *vdev,
>  	 */
>  	vfio_pci_set_power_state(vdev, PCI_D0);
>  
> +	vfio_pci_dma_buf_move(vdev, true);
>  	ret = pci_try_reset_function(vdev->pdev);
> +	if (__vfio_pci_memory_enabled(vdev))
> +		vfio_pci_dma_buf_move(vdev, false);
>  	up_write(&vdev->memory_lock);
>  

What about runtime power management?


>  	return ret;
> @@ -1467,11 +1472,10 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
>  }
>  EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl);
>  
> -static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags,
> -				       uuid_t __user *arg, size_t argsz)
> +static int vfio_pci_core_feature_token(struct vfio_pci_core_device *vdev,
> +				       u32 flags, uuid_t __user *arg,
> +				       size_t argsz)
>  {
> -	struct vfio_pci_core_device *vdev =
> -		container_of(device, struct vfio_pci_core_device, vdev);

Why is only this existing function updated?  If the core device deref
is common then apply it to all and do so in a separate patch.  Thanks,

Alex

>  	uuid_t uuid;
>  	int ret;
>  
> @@ -1498,6 +1502,9 @@ static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags,
>  int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
>  				void __user *arg, size_t argsz)
>  {
> +	struct vfio_pci_core_device *vdev =
> +			container_of(device, struct vfio_pci_core_device, vdev);
> +
>  	switch (flags & VFIO_DEVICE_FEATURE_MASK) {
>  	case VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY:
>  		return vfio_pci_core_pm_entry(device, flags, arg, argsz);
> @@ -1507,7 +1514,9 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
>  	case VFIO_DEVICE_FEATURE_LOW_POWER_EXIT:
>  		return vfio_pci_core_pm_exit(device, flags, arg, argsz);
>  	case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
> -		return vfio_pci_core_feature_token(device, flags, arg, argsz);
> +		return vfio_pci_core_feature_token(vdev, flags, arg, argsz);
> +	case VFIO_DEVICE_FEATURE_DMA_BUF:
> +		return vfio_pci_core_feature_dma_buf(vdev, flags, arg, argsz);
>  	default:
>  		return -ENOTTY;
>  	}
> @@ -2182,6 +2191,7 @@ int vfio_pci_core_init_dev(struct vfio_device *core_vdev)
>  	mutex_init(&vdev->vma_lock);
>  	INIT_LIST_HEAD(&vdev->vma_list);
>  	INIT_LIST_HEAD(&vdev->sriov_pfs_item);
> +	INIT_LIST_HEAD(&vdev->dmabufs);
>  	init_rwsem(&vdev->memory_lock);
>  	xa_init(&vdev->ctx);
>  
> @@ -2576,11 +2586,17 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>  	 * cause the PCI config space reset without restoring the original
>  	 * state (saved locally in 'vdev->pm_save').
>  	 */
> -	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list)
> +	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) {
> +		vfio_pci_dma_buf_move(cur, true);
>  		vfio_pci_set_power_state(cur, PCI_D0);
> +	}
>  
>  	ret = pci_reset_bus(pdev);
>  
> +	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list)
> +		if (__vfio_pci_memory_enabled(cur))
> +			vfio_pci_dma_buf_move(cur, false);
> +
>  err_undo:
>  	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) {
>  		if (cur == cur_mem)
> diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
> index 5e4fa69aee16..09d3c300918c 100644
> --- a/drivers/vfio/pci/vfio_pci_priv.h
> +++ b/drivers/vfio/pci/vfio_pci_priv.h
> @@ -101,4 +101,27 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
>  	return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
>  }
>  
> +#ifdef CONFIG_DMA_SHARED_BUFFER
> +int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
> +				  struct vfio_device_feature_dma_buf __user *arg,
> +				  size_t argsz);
> +void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev);
> +void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked);
> +#else
> +static int
> +vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
> +			      struct vfio_device_feature_dma_buf __user *arg,
> +			      size_t argsz)
> +{
> +	return -ENOTTY;
> +}
> +static inline void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev)
> +{
> +}
> +static inline void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev,
> +					 bool revoked)
> +{
> +}
> +#endif
> +
>  #endif
> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
> index a2c8b8bba711..387cce561dad 100644
> --- a/include/linux/vfio_pci_core.h
> +++ b/include/linux/vfio_pci_core.h
> @@ -96,6 +96,7 @@ struct vfio_pci_core_device {
>  	struct mutex		vma_lock;
>  	struct list_head	vma_list;
>  	struct rw_semaphore	memory_lock;
> +	struct list_head	dmabufs;
>  };
>  
>  /* Will be exported for vfio pci drivers usage */
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 2b68e6cdf190..47d230c5df25 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -1458,6 +1458,31 @@ struct vfio_device_feature_bus_master {
>  };
>  #define VFIO_DEVICE_FEATURE_BUS_MASTER 10
>  
> +/**
> + * Upon VFIO_DEVICE_FEATURE_GET create a dma_buf fd for the
> + * region selected.
> + *
> + * open_flags are the typical flags passed to open(2), eg O_RDWR, O_CLOEXEC,
> + * etc. offset/length specify a slice of the region to create the dmabuf from.
> + * If both are 0 then the whole region is used.
> + *
> + * Return: The fd number on success, -1 and errno is set on failure.
> + */
> +#define VFIO_DEVICE_FEATURE_DMA_BUF 11
> +
> +struct vfio_region_p2p_area {
> +	__u32	region_index;
> +	__u32	__pad;
> +	__u64	offset;
> +	__u64	length;
> +};
> +
> +struct vfio_device_feature_dma_buf {
> +	__u32	open_flags;
> +	__u32	nr_areas;
> +	struct vfio_region_p2p_area p2p_areas[];
> +};
> +
>  /* -------- API for Type1 VFIO IOMMU -------- */
>  
>  /**


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v1 2/2] vfio/pci: Allow MMIO regions to be exported through dma-buf
  2024-04-30 22:24   ` Alex Williamson
@ 2024-05-01 12:53     ` Jason Gunthorpe
  2024-05-02  7:50       ` Kasireddy, Vivek
  0 siblings, 1 reply; 10+ messages in thread
From: Jason Gunthorpe @ 2024-05-01 12:53 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Vivek Kasireddy, dri-devel, kvm, linux-rdma

On Tue, Apr 30, 2024 at 04:24:50PM -0600, Alex Williamson wrote:
> > +static vm_fault_t vfio_pci_dma_buf_fault(struct vm_fault *vmf)
> > +{
> > +	struct vm_area_struct *vma = vmf->vma;
> > +	struct vfio_pci_dma_buf *priv = vma->vm_private_data;
> > +	pgoff_t pgoff = vmf->pgoff;
> > +
> > +	if (pgoff >= priv->nr_pages)
> > +		return VM_FAULT_SIGBUS;
> > +
> > +	return vmf_insert_pfn(vma, vmf->address,
> > +			      page_to_pfn(priv->pages[pgoff]));
> > +}
> 
> How does this prevent the MMIO space from being mmap'd when disabled at
> the device?  How is the mmap revoked when the MMIO becomes disabled?
> Is it part of the move protocol?

Yes, we should not have a mmap handler for dmabuf. vfio memory must be
mmapped in the normal way.

> > +static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf)
> > +{
> > +	struct vfio_pci_dma_buf *priv = dmabuf->priv;
> > +
> > +	/*
> > +	 * Either this or vfio_pci_dma_buf_cleanup() will remove from the list.
> > +	 * The refcount prevents both.
> > +	 */
> > +	if (priv->vdev) {
> > +		release_p2p_pages(priv, priv->nr_pages);
> > +		kfree(priv->pages);
> > +		down_write(&priv->vdev->memory_lock);
> > +		list_del_init(&priv->dmabufs_elm);
> > +		up_write(&priv->vdev->memory_lock);
> 
> Why are we acquiring and releasing the memory_lock write lock
> throughout when we're not modifying the device memory enable state?
> Ugh, we're using it to implicitly lock dmabufs_elm/dmabufs aren't we...

Not really implicitly, but yes the dmabufs list is locked by the
memory_lock.

> > +int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
> > +				  struct vfio_device_feature_dma_buf __user *arg,
> > +				  size_t argsz)
> > +{
> > +	struct vfio_device_feature_dma_buf get_dma_buf;
> > +	struct vfio_region_p2p_area *p2p_areas;
> > +	DEFINE_DMA_BUF_EXPORT_INFO(exp_info);
> > +	struct vfio_pci_dma_buf *priv;
> > +	int i, ret;
> > +
> > +	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_GET,
> > +				 sizeof(get_dma_buf));
> > +	if (ret != 1)
> > +		return ret;
> > +
> > +	if (copy_from_user(&get_dma_buf, arg, sizeof(get_dma_buf)))
> > +		return -EFAULT;
> > +
> > +	p2p_areas = memdup_array_user(&arg->p2p_areas,
> > +				      get_dma_buf.nr_areas,
> > +				      sizeof(*p2p_areas));
> > +	if (IS_ERR(p2p_areas))
> > +		return PTR_ERR(p2p_areas);
> > +
> > +	priv = kzalloc(sizeof(*priv), GFP_KERNEL);
> > +	if (!priv)
> > +		return -ENOMEM;
> 
> p2p_areas is leaked.

What is this new p2p_areas thing? It wasn't in my patch..

> > +	exp_info.ops = &vfio_pci_dmabuf_ops;
> > +	exp_info.size = priv->nr_pages << PAGE_SHIFT;
> > +	exp_info.flags = get_dma_buf.open_flags;
> 
> open_flags from userspace are unchecked.

Huh. That seems to be a dmabuf pattern. :\

> > +	exp_info.priv = priv;
> > +
> > +	priv->dmabuf = dma_buf_export(&exp_info);
> > +	if (IS_ERR(priv->dmabuf)) {
> > +		ret = PTR_ERR(priv->dmabuf);
> > +		goto err_free_pages;
> > +	}
> > +
> > +	/* dma_buf_put() now frees priv */
> > +	INIT_LIST_HEAD(&priv->dmabufs_elm);
> > +	down_write(&vdev->memory_lock);
> > +	dma_resv_lock(priv->dmabuf->resv, NULL);
> > +	priv->revoked = !__vfio_pci_memory_enabled(vdev);
> > +	vfio_device_try_get_registration(&vdev->vdev);
> 
> I guess we're assuming this can't fail in the ioctl path of an open
> device?

Seems like a bug added here.. My version had this as
vfio_device_get(). This stuff has probably changed since I wrote it.

> > +	list_add_tail(&priv->dmabufs_elm, &vdev->dmabufs);
> > +	dma_resv_unlock(priv->dmabuf->resv);
> 
> What was the purpose of locking this?

?

> > +void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked)
> > +{
> > +	struct vfio_pci_dma_buf *priv;
> > +	struct vfio_pci_dma_buf *tmp;
> > +
> > +	lockdep_assert_held_write(&vdev->memory_lock);
> > +
> > +	list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
> > +		if (!get_file_rcu(&priv->dmabuf->file))
> > +			continue;
> 
> Does this indicate the file was closed?

Yes.. The original patch was clearer, Christian asked to open
code it:

+ * Returns true if a reference was successfully obtained. The caller must
+ * interlock with the dmabuf's release function in some way, such as RCU, to
+ * ensure that this is not called on freed memory.

A description of how the locking is working should be put in a comment
above that code.

> > @@ -623,6 +625,8 @@ static int vfio_basic_config_write(struct vfio_pci_core_device *vdev, int pos,
> >  		*virt_cmd &= cpu_to_le16(~mask);
> >  		*virt_cmd |= cpu_to_le16(new_cmd & mask);
> >  
> > +		if (__vfio_pci_memory_enabled(vdev))
> > +			vfio_pci_dma_buf_move(vdev, false);
> >  		up_write(&vdev->memory_lock);
> >  	}
> 
> FLR is also accessible through config space.

That needs fixing up

> > @@ -1246,7 +1248,10 @@ static int vfio_pci_ioctl_reset(struct vfio_pci_core_device *vdev,
> >  	 */
> >  	vfio_pci_set_power_state(vdev, PCI_D0);
> >  
> > +	vfio_pci_dma_buf_move(vdev, true);
> >  	ret = pci_try_reset_function(vdev->pdev);
> > +	if (__vfio_pci_memory_enabled(vdev))
> > +		vfio_pci_dma_buf_move(vdev, false);
> >  	up_write(&vdev->memory_lock);
> >  
> 
> What about runtime power management?

Yes

Yes, I somehow thing it was added

> > -static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags,
> > -				       uuid_t __user *arg, size_t argsz)
> > +static int vfio_pci_core_feature_token(struct vfio_pci_core_device *vdev,
> > +				       u32 flags, uuid_t __user *arg,
> > +				       size_t argsz)
> >  {
> > -	struct vfio_pci_core_device *vdev =
> > -		container_of(device, struct vfio_pci_core_device, vdev);
> 
> Why is only this existing function updated?  If the core device deref
> is common then apply it to all and do so in a separate patch.  Thanks,

Hm, I think that was som rebasing issue.

> > +/**
> > + * Upon VFIO_DEVICE_FEATURE_GET create a dma_buf fd for the
> > + * region selected.
> > + *
> > + * open_flags are the typical flags passed to open(2), eg O_RDWR, O_CLOEXEC,
> > + * etc. offset/length specify a slice of the region to create the dmabuf from.
> > + * If both are 0 then the whole region is used.
> > + *
> > + * Return: The fd number on success, -1 and errno is set on failure.
> > + */
> > +#define VFIO_DEVICE_FEATURE_DMA_BUF 11
> > +
> > +struct vfio_region_p2p_area {
> > +	__u32	region_index;
> > +	__u32	__pad;
> > +	__u64	offset;
> > +	__u64	length;
> > +};
> > +
> > +struct vfio_device_feature_dma_buf {
> > +	__u32	open_flags;
> > +	__u32	nr_areas;
> > +	struct vfio_region_p2p_area p2p_areas[];
> > +};

Still have no clue what this p2p areas is. You want to create a dmabuf
out of a scatterlist? Why??

I'm also not sure of the use of the pci_p2pdma family of functions, it
is a bold step to make struct pages, that isn't going to work in quite
alot of cases. It is really hacky/wrong to immediately call
pci_alloc_p2pmem() to defeat the internal genalloc.

I'd rather we stick with the original design. Leon is working on DMA
API changes that should address half the issue.

Jason

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [PATCH v1 2/2] vfio/pci: Allow MMIO regions to be exported through dma-buf
  2024-05-01 12:53     ` Jason Gunthorpe
@ 2024-05-02  7:50       ` Kasireddy, Vivek
  2024-05-02 12:57         ` Leon Romanovsky
  2024-05-08  0:31         ` Jason Gunthorpe
  0 siblings, 2 replies; 10+ messages in thread
From: Kasireddy, Vivek @ 2024-05-02  7:50 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: dri-devel, kvm, linux-rdma, Christoph Hellwig

Hi Jason,

> 
> On Tue, Apr 30, 2024 at 04:24:50PM -0600, Alex Williamson wrote:
> > > +static vm_fault_t vfio_pci_dma_buf_fault(struct vm_fault *vmf)
> > > +{
> > > +	struct vm_area_struct *vma = vmf->vma;
> > > +	struct vfio_pci_dma_buf *priv = vma->vm_private_data;
> > > +	pgoff_t pgoff = vmf->pgoff;
> > > +
> > > +	if (pgoff >= priv->nr_pages)
> > > +		return VM_FAULT_SIGBUS;
> > > +
> > > +	return vmf_insert_pfn(vma, vmf->address,
> > > +			      page_to_pfn(priv->pages[pgoff]));
> > > +}
> >
> > How does this prevent the MMIO space from being mmap'd when disabled
> at
> > the device?  How is the mmap revoked when the MMIO becomes disabled?
> > Is it part of the move protocol?
In this case, I think the importers that mmap'd the dmabuf need to be tracked
separately and their VMA PTEs need to be zapped when MMIO access is revoked.

> 
> Yes, we should not have a mmap handler for dmabuf. vfio memory must be
> mmapped in the normal way.
Although optional, I think most dmabuf exporters (drm ones) provide a mmap
handler. Otherwise, there is no easy way to provide CPU access (backup slow path)
to the dmabuf for the importer.

> 
> > > +static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf)
> > > +{
> > > +	struct vfio_pci_dma_buf *priv = dmabuf->priv;
> > > +
> > > +	/*
> > > +	 * Either this or vfio_pci_dma_buf_cleanup() will remove from the
> list.
> > > +	 * The refcount prevents both.
> > > +	 */
> > > +	if (priv->vdev) {
> > > +		release_p2p_pages(priv, priv->nr_pages);
> > > +		kfree(priv->pages);
> > > +		down_write(&priv->vdev->memory_lock);
> > > +		list_del_init(&priv->dmabufs_elm);
> > > +		up_write(&priv->vdev->memory_lock);
> >
> > Why are we acquiring and releasing the memory_lock write lock
> > throughout when we're not modifying the device memory enable state?
> > Ugh, we're using it to implicitly lock dmabufs_elm/dmabufs aren't we...
> 
> Not really implicitly, but yes the dmabufs list is locked by the
> memory_lock.
> 
> > > +int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev,
> u32 flags,
> > > +				  struct vfio_device_feature_dma_buf __user
> *arg,
> > > +				  size_t argsz)
> > > +{
> > > +	struct vfio_device_feature_dma_buf get_dma_buf;
> > > +	struct vfio_region_p2p_area *p2p_areas;
> > > +	DEFINE_DMA_BUF_EXPORT_INFO(exp_info);
> > > +	struct vfio_pci_dma_buf *priv;
> > > +	int i, ret;
> > > +
> > > +	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_GET,
> > > +				 sizeof(get_dma_buf));
> > > +	if (ret != 1)
> > > +		return ret;
> > > +
> > > +	if (copy_from_user(&get_dma_buf, arg, sizeof(get_dma_buf)))
> > > +		return -EFAULT;
> > > +
> > > +	p2p_areas = memdup_array_user(&arg->p2p_areas,
> > > +				      get_dma_buf.nr_areas,
> > > +				      sizeof(*p2p_areas));
> > > +	if (IS_ERR(p2p_areas))
> > > +		return PTR_ERR(p2p_areas);
> > > +
> > > +	priv = kzalloc(sizeof(*priv), GFP_KERNEL);
> > > +	if (!priv)
> > > +		return -ENOMEM;
> >
> > p2p_areas is leaked.
> 
> What is this new p2p_areas thing? It wasn't in my patch..
As noted in the commit message, this is one of the things I added to
your original patch.

> 
> > > +	exp_info.ops = &vfio_pci_dmabuf_ops;
> > > +	exp_info.size = priv->nr_pages << PAGE_SHIFT;
> > > +	exp_info.flags = get_dma_buf.open_flags;
> >
> > open_flags from userspace are unchecked.
> 
> Huh. That seems to be a dmabuf pattern. :\
> 
> > > +	exp_info.priv = priv;
> > > +
> > > +	priv->dmabuf = dma_buf_export(&exp_info);
> > > +	if (IS_ERR(priv->dmabuf)) {
> > > +		ret = PTR_ERR(priv->dmabuf);
> > > +		goto err_free_pages;
> > > +	}
> > > +
> > > +	/* dma_buf_put() now frees priv */
> > > +	INIT_LIST_HEAD(&priv->dmabufs_elm);
> > > +	down_write(&vdev->memory_lock);
> > > +	dma_resv_lock(priv->dmabuf->resv, NULL);
> > > +	priv->revoked = !__vfio_pci_memory_enabled(vdev);
> > > +	vfio_device_try_get_registration(&vdev->vdev);
> >
> > I guess we're assuming this can't fail in the ioctl path of an open
> > device?
> 
> Seems like a bug added here.. My version had this as
> vfio_device_get(). This stuff has probably changed since I wrote it.
vfio_device_try_get_registration() is essentially doing the same thing as
vfio_device_get() except that we need check the return value of
vfio_device_try_get_registration() which I plan to do in v2.

> 
> > > +	list_add_tail(&priv->dmabufs_elm, &vdev->dmabufs);
> > > +	dma_resv_unlock(priv->dmabuf->resv);
> >
> > What was the purpose of locking this?
> 
> ?
> 
> > > +void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool
> revoked)
> > > +{
> > > +	struct vfio_pci_dma_buf *priv;
> > > +	struct vfio_pci_dma_buf *tmp;
> > > +
> > > +	lockdep_assert_held_write(&vdev->memory_lock);
> > > +
> > > +	list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm)
> {
> > > +		if (!get_file_rcu(&priv->dmabuf->file))
> > > +			continue;
> >
> > Does this indicate the file was closed?
> 
> Yes.. The original patch was clearer, Christian asked to open
> code it:
> 
> + * Returns true if a reference was successfully obtained. The caller must
> + * interlock with the dmabuf's release function in some way, such as RCU, to
> + * ensure that this is not called on freed memory.
> 
> A description of how the locking is working should be put in a comment
> above that code.
Sure, will add it in v2.

> 
> > > @@ -623,6 +625,8 @@ static int vfio_basic_config_write(struct
> vfio_pci_core_device *vdev, int pos,
> > >  		*virt_cmd &= cpu_to_le16(~mask);
> > >  		*virt_cmd |= cpu_to_le16(new_cmd & mask);
> > >
> > > +		if (__vfio_pci_memory_enabled(vdev))
> > > +			vfio_pci_dma_buf_move(vdev, false);
> > >  		up_write(&vdev->memory_lock);
> > >  	}
> >
> > FLR is also accessible through config space.
> 
> That needs fixing up
> 
> > > @@ -1246,7 +1248,10 @@ static int vfio_pci_ioctl_reset(struct
> vfio_pci_core_device *vdev,
> > >  	 */
> > >  	vfio_pci_set_power_state(vdev, PCI_D0);
> > >
> > > +	vfio_pci_dma_buf_move(vdev, true);
> > >  	ret = pci_try_reset_function(vdev->pdev);
> > > +	if (__vfio_pci_memory_enabled(vdev))
> > > +		vfio_pci_dma_buf_move(vdev, false);
> > >  	up_write(&vdev->memory_lock);
> > >
> >
> > What about runtime power management?
> 
> Yes
> 
> Yes, I somehow thing it was added
Ok, I'll handle runtime PM and FLR cases in v2.

> 
> > > -static int vfio_pci_core_feature_token(struct vfio_device *device, u32
> flags,
> > > -				       uuid_t __user *arg, size_t argsz)
> > > +static int vfio_pci_core_feature_token(struct vfio_pci_core_device
> *vdev,
> > > +				       u32 flags, uuid_t __user *arg,
> > > +				       size_t argsz)
> > >  {
> > > -	struct vfio_pci_core_device *vdev =
> > > -		container_of(device, struct vfio_pci_core_device, vdev);
> >
> > Why is only this existing function updated?  If the core device deref
> > is common then apply it to all and do so in a separate patch.  Thanks,
> 
> Hm, I think that was som rebasing issue.
Yeah, looks like the above change may not be needed.

> 
> > > +/**
> > > + * Upon VFIO_DEVICE_FEATURE_GET create a dma_buf fd for the
> > > + * region selected.
> > > + *
> > > + * open_flags are the typical flags passed to open(2), eg O_RDWR,
> O_CLOEXEC,
> > > + * etc. offset/length specify a slice of the region to create the dmabuf
> from.
> > > + * If both are 0 then the whole region is used.
> > > + *
> > > + * Return: The fd number on success, -1 and errno is set on failure.
> > > + */
> > > +#define VFIO_DEVICE_FEATURE_DMA_BUF 11
> > > +
> > > +struct vfio_region_p2p_area {
> > > +	__u32	region_index;
> > > +	__u32	__pad;
> > > +	__u64	offset;
> > > +	__u64	length;
> > > +};
> > > +
> > > +struct vfio_device_feature_dma_buf {
> > > +	__u32	open_flags;
> > > +	__u32	nr_areas;
> > > +	struct vfio_region_p2p_area p2p_areas[];
> > > +};
> 
> Still have no clue what this p2p areas is. You want to create a dmabuf
> out of a scatterlist? Why??
Because the data associated with a buffer that needs to be shared can
come from multiple ranges. I probably should have used the terms ranges
or slices or chunks to make it more clear instead of p2p areas.

In my use-case, GPU A (in a guest VM and bound to vfio-pci on Host) writes
to a buffer (framebuffer in device mem/VRAM in this case) that needs to be
shared with GPU B on the Host. Since the framebuffer can be at-least 8 MB
(assuming 1920x1080) or more in size, it is not reasonable to expect that it
would be allocated as one big contiguous chunk in device memory/VRAM.

> 
> I'm also not sure of the use of the pci_p2pdma family of functions, it
> is a bold step to make struct pages, that isn't going to work in quite
I guess things may have changed since the last discussion on this topic or
maybe I misunderstood but I thought Christoph's suggestion was to use
struct pages to populate the scatterlist instead of using DMA addresses
and, I figured pci_p2pdma APIs can easily help with that.

Do you see any significant drawback in using pci_p2pdma APIs? Or, could
you please explain why this approach would not work in a lot of cases?

> alot of cases. It is really hacky/wrong to immediately call
> pci_alloc_p2pmem() to defeat the internal genalloc.
In my use-case, I need to use all the pages from the pool and I don't see any
better way to do it.

> 
> I'd rather we stick with the original design. Leon is working on DMA
> API changes that should address half the issue.
Ok, I'll keep an eye out for Leon's work.

Thanks,
Vivek

> 
> Jason

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v1 2/2] vfio/pci: Allow MMIO regions to be exported through dma-buf
  2024-05-02  7:50       ` Kasireddy, Vivek
@ 2024-05-02 12:57         ` Leon Romanovsky
  2024-05-08  0:31         ` Jason Gunthorpe
  1 sibling, 0 replies; 10+ messages in thread
From: Leon Romanovsky @ 2024-05-02 12:57 UTC (permalink / raw)
  To: Kasireddy, Vivek
  Cc: Jason Gunthorpe, Alex Williamson, dri-devel, kvm, linux-rdma,
	Christoph Hellwig

On Thu, May 02, 2024 at 07:50:36AM +0000, Kasireddy, Vivek wrote:
> Hi Jason,

<...>

> > I'd rather we stick with the original design. Leon is working on DMA
> > API changes that should address half the issue.
> Ok, I'll keep an eye out for Leon's work.

The code for v1 is here:
https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git/log/?h=dma-split-v1
It is constantly rebased till we will be ready to submit it.

v0 is here:
https://lore.kernel.org/linux-rdma/cover.1709635535.git.leon@kernel.org/

Thanks

> 
> Thanks,
> Vivek
> 
> > 
> > Jason
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v1 2/2] vfio/pci: Allow MMIO regions to be exported through dma-buf
  2024-05-02  7:50       ` Kasireddy, Vivek
  2024-05-02 12:57         ` Leon Romanovsky
@ 2024-05-08  0:31         ` Jason Gunthorpe
  2024-05-08  8:23           ` Daniel Vetter
  1 sibling, 1 reply; 10+ messages in thread
From: Jason Gunthorpe @ 2024-05-08  0:31 UTC (permalink / raw)
  To: Kasireddy, Vivek
  Cc: Alex Williamson, dri-devel, kvm, linux-rdma, Christoph Hellwig

On Thu, May 02, 2024 at 07:50:36AM +0000, Kasireddy, Vivek wrote:
> Hi Jason,
> 
> > 
> > On Tue, Apr 30, 2024 at 04:24:50PM -0600, Alex Williamson wrote:
> > > > +static vm_fault_t vfio_pci_dma_buf_fault(struct vm_fault *vmf)
> > > > +{
> > > > +	struct vm_area_struct *vma = vmf->vma;
> > > > +	struct vfio_pci_dma_buf *priv = vma->vm_private_data;
> > > > +	pgoff_t pgoff = vmf->pgoff;
> > > > +
> > > > +	if (pgoff >= priv->nr_pages)
> > > > +		return VM_FAULT_SIGBUS;
> > > > +
> > > > +	return vmf_insert_pfn(vma, vmf->address,
> > > > +			      page_to_pfn(priv->pages[pgoff]));
> > > > +}
> > >
> > > How does this prevent the MMIO space from being mmap'd when disabled
> > at
> > > the device?  How is the mmap revoked when the MMIO becomes disabled?
> > > Is it part of the move protocol?
> In this case, I think the importers that mmap'd the dmabuf need to be tracked
> separately and their VMA PTEs need to be zapped when MMIO access is revoked.

Which, as we know, is quite hard.

> > Yes, we should not have a mmap handler for dmabuf. vfio memory must be
> > mmapped in the normal way.
> Although optional, I think most dmabuf exporters (drm ones) provide a mmap
> handler. Otherwise, there is no easy way to provide CPU access (backup slow path)
> to the dmabuf for the importer.

Here we should not, there is no reason since VFIO already provides a
mmap mechanism itself. Anything using this API should just call the
native VFIO function instead of trying to mmap the DMABUF. Yes, it
will be inconvient for the scatterlist case you have, but the kernel
side implementation is much easier ..


> > > > +/**
> > > > + * Upon VFIO_DEVICE_FEATURE_GET create a dma_buf fd for the
> > > > + * region selected.
> > > > + *
> > > > + * open_flags are the typical flags passed to open(2), eg O_RDWR,
> > O_CLOEXEC,
> > > > + * etc. offset/length specify a slice of the region to create the dmabuf
> > from.
> > > > + * If both are 0 then the whole region is used.
> > > > + *
> > > > + * Return: The fd number on success, -1 and errno is set on failure.
> > > > + */
> > > > +#define VFIO_DEVICE_FEATURE_DMA_BUF 11
> > > > +
> > > > +struct vfio_region_p2p_area {
> > > > +	__u32	region_index;
> > > > +	__u32	__pad;
> > > > +	__u64	offset;
> > > > +	__u64	length;
> > > > +};
> > > > +
> > > > +struct vfio_device_feature_dma_buf {
> > > > +	__u32	open_flags;
> > > > +	__u32	nr_areas;
> > > > +	struct vfio_region_p2p_area p2p_areas[];
> > > > +};
> > 
> > Still have no clue what this p2p areas is. You want to create a dmabuf
> > out of a scatterlist? Why??

> Because the data associated with a buffer that needs to be shared can
> come from multiple ranges. I probably should have used the terms ranges
> or slices or chunks to make it more clear instead of p2p areas.

Yes, please pick a better name.

> > I'm also not sure of the use of the pci_p2pdma family of functions, it
> > is a bold step to make struct pages, that isn't going to work in quite

> I guess things may have changed since the last discussion on this topic or
> maybe I misunderstood but I thought Christoph's suggestion was to use
> struct pages to populate the scatterlist instead of using DMA addresses
> and, I figured pci_p2pdma APIs can easily help with that.

It was, but it doesn't really work for VFIO. You can only create
struct pages for large MMIO spaces, and only on some architectures.

Requiring them will make VFIO unusable in alot of places. Requiring
them just for DMABUF will make that feature unusable.

Creating them on first use, and then ignoring how broken it is 
perhaps reasonable for now, but I'm unhappy to see it so poorly
usable.
> 
> > alot of cases. It is really hacky/wrong to immediately call
> > pci_alloc_p2pmem() to defeat the internal genalloc.

> In my use-case, I need to use all the pages from the pool and I don't see any
> better way to do it.

You have to fix the P2P API to work properly for this kind of use
case.

There should be no genalloc at all.

> > I'd rather we stick with the original design. Leon is working on DMA
> > API changes that should address half the issue.
>
> Ok, I'll keep an eye out for Leon's work.

It saves from messing with the P2P stuff, but you get the other issues
back.

Jason

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v1 2/2] vfio/pci: Allow MMIO regions to be exported through dma-buf
  2024-05-08  0:31         ` Jason Gunthorpe
@ 2024-05-08  8:23           ` Daniel Vetter
  0 siblings, 0 replies; 10+ messages in thread
From: Daniel Vetter @ 2024-05-08  8:23 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Kasireddy, Vivek, Alex Williamson, dri-devel, kvm, linux-rdma,
	Christoph Hellwig

On Tue, May 07, 2024 at 09:31:53PM -0300, Jason Gunthorpe wrote:
> On Thu, May 02, 2024 at 07:50:36AM +0000, Kasireddy, Vivek wrote:
> > Hi Jason,
> > 
> > > 
> > > On Tue, Apr 30, 2024 at 04:24:50PM -0600, Alex Williamson wrote:
> > > > > +static vm_fault_t vfio_pci_dma_buf_fault(struct vm_fault *vmf)
> > > > > +{
> > > > > +	struct vm_area_struct *vma = vmf->vma;
> > > > > +	struct vfio_pci_dma_buf *priv = vma->vm_private_data;
> > > > > +	pgoff_t pgoff = vmf->pgoff;
> > > > > +
> > > > > +	if (pgoff >= priv->nr_pages)
> > > > > +		return VM_FAULT_SIGBUS;
> > > > > +
> > > > > +	return vmf_insert_pfn(vma, vmf->address,
> > > > > +			      page_to_pfn(priv->pages[pgoff]));
> > > > > +}
> > > >
> > > > How does this prevent the MMIO space from being mmap'd when disabled
> > > at
> > > > the device?  How is the mmap revoked when the MMIO becomes disabled?
> > > > Is it part of the move protocol?
> > In this case, I think the importers that mmap'd the dmabuf need to be tracked
> > separately and their VMA PTEs need to be zapped when MMIO access is revoked.
> 
> Which, as we know, is quite hard.
> 
> > > Yes, we should not have a mmap handler for dmabuf. vfio memory must be
> > > mmapped in the normal way.
> > Although optional, I think most dmabuf exporters (drm ones) provide a mmap
> > handler. Otherwise, there is no easy way to provide CPU access (backup slow path)
> > to the dmabuf for the importer.
> 
> Here we should not, there is no reason since VFIO already provides a
> mmap mechanism itself. Anything using this API should just call the
> native VFIO function instead of trying to mmap the DMABUF. Yes, it
> will be inconvient for the scatterlist case you have, but the kernel
> side implementation is much easier ..

Just wanted to confirm that it's entirely legit to not implement dma-buf
mmap. Same for the in-kernel vmap functions. Especially for really funny
buffers like these it's just not a good idea, and the dma-buf interfaces
are intentionally "everything is optional".

Similarly you can (and should) reject and dma_buf_attach to devices where
p2p connectevity isn't there, or well really for any other reason that
makes stuff complicated and is out of scope for your use-case. It's better
to reject strictly and than accidentally support something really horrible
(we've been there).

The only real rule with all the interfaces is that when attach() worked,
then map must too (except when you're in OOM). Because at least for some
drivers/subsystems, that's how userspace figures out whether a buffer can
be shared.
-Sima
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2024-05-08  8:23 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-22  6:30 [PATCH v1 0/2] vfio/pci: Allow MMIO regions to be exported through dma-buf Vivek Kasireddy
2024-04-22  6:30 ` [PATCH v1 1/2] vfio: Export vfio device get and put registration helpers Vivek Kasireddy
2024-04-22  6:30 ` [PATCH v1 2/2] vfio/pci: Allow MMIO regions to be exported through dma-buf Vivek Kasireddy
2024-04-22 14:44   ` Zhu Yanjun
2024-04-30 22:24   ` Alex Williamson
2024-05-01 12:53     ` Jason Gunthorpe
2024-05-02  7:50       ` Kasireddy, Vivek
2024-05-02 12:57         ` Leon Romanovsky
2024-05-08  0:31         ` Jason Gunthorpe
2024-05-08  8:23           ` Daniel Vetter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).