All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] VDUSE: Support registering userspace memory as bounce buffer
@ 2022-06-29  8:25 Xie Yongji
  2022-06-29  8:25 ` [PATCH 1/6] vduse: Remove unnecessary spin lock protection Xie Yongji
                   ` (6 more replies)
  0 siblings, 7 replies; 24+ messages in thread
From: Xie Yongji @ 2022-06-29  8:25 UTC (permalink / raw)
  To: mst, jasowang, xiaodong.liu, maxime.coquelin, stefanha
  Cc: virtualization, linux-kernel

Hi all,

This series introduces some new ioctls: VDUSE_IOTLB_GET_INFO,
VDUSE_IOTLB_REG_UMEM and VDUSE_IOTLB_DEREG_UMEM to support
registering and de-registering userspace memory for IOTLB
as bounce buffer in virtio-vdpa case.

The VDUSE_IOTLB_GET_INFO ioctl can help user to query IOLTB
information such as bounce buffer size. Then user can use
those information on VDUSE_IOTLB_REG_UMEM and
VDUSE_IOTLB_DEREG_UMEM ioctls to register and de-register
userspace memory for IOTLB.

During registering and de-registering, the DMA data in use
would be copied from kernel bounce pages to userspace bounce
pages and back.

With this feature, some existing application such as SPDK
and DPDK can leverage the datapath of VDUSE directly and
efficiently as discussed before [1]. They can register some
preallocated hugepages to VDUSE to avoid an extra memcpy
from bounce-buffer to hugepages.

The kernel and userspace codes could be found in github:

https://github.com/bytedance/linux/tree/vduse-umem
https://github.com/bytedance/qemu/tree/vduse-umem

To test it with qemu-storage-daemon:

$ qemu-storage-daemon \
    --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server=on,wait=off \
    --monitor chardev=charmonitor \
    --blockdev driver=host_device,cache.direct=on,aio=native,filename=/dev/nullb0,node-name=disk0 \
    --export type=vduse-blk,id=vduse-test,name=vduse-test,node-name=disk0,writable=on

[1] https://lkml.org/lkml/2021/6/27/318

Please review, thanks!

Xie Yongji (6):
  vduse: Remove unnecessary spin lock protection
  vduse: Use memcpy_{to,from}_page() in do_bounce()
  vduse: Support using userspace pages as bounce buffer
  vduse: Support querying IOLTB information
  vduse: Support registering userspace memory for IOTLB
  vduse: Update api version to 1

 drivers/vdpa/vdpa_user/iova_domain.c | 134 +++++++++++++++++++---
 drivers/vdpa/vdpa_user/iova_domain.h |   9 ++
 drivers/vdpa/vdpa_user/vduse_dev.c   | 163 +++++++++++++++++++++++++++
 include/uapi/linux/vduse.h           |  53 ++++++++-
 4 files changed, 345 insertions(+), 14 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/6] vduse: Remove unnecessary spin lock protection
  2022-06-29  8:25 [PATCH 0/6] VDUSE: Support registering userspace memory as bounce buffer Xie Yongji
@ 2022-06-29  8:25 ` Xie Yongji
  2022-06-29  8:25 ` [PATCH 2/6] vduse: Use memcpy_{to,from}_page() in do_bounce() Xie Yongji
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 24+ messages in thread
From: Xie Yongji @ 2022-06-29  8:25 UTC (permalink / raw)
  To: mst, jasowang, xiaodong.liu, maxime.coquelin, stefanha
  Cc: virtualization, linux-kernel

Taking iotlb lock to access bounce page in page fault
handler is meaningless since vduse_domain_free_bounce_pages()
would only be called during file release.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 drivers/vdpa/vdpa_user/iova_domain.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c
index 6daa3978d290..bca1f0b8850c 100644
--- a/drivers/vdpa/vdpa_user/iova_domain.c
+++ b/drivers/vdpa/vdpa_user/iova_domain.c
@@ -211,17 +211,14 @@ static struct page *
 vduse_domain_get_bounce_page(struct vduse_iova_domain *domain, u64 iova)
 {
 	struct vduse_bounce_map *map;
-	struct page *page = NULL;
+	struct page *page;
 
-	spin_lock(&domain->iotlb_lock);
 	map = &domain->bounce_maps[iova >> PAGE_SHIFT];
 	if (!map->bounce_page)
-		goto out;
+		return NULL;
 
 	page = map->bounce_page;
 	get_page(page);
-out:
-	spin_unlock(&domain->iotlb_lock);
 
 	return page;
 }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/6] vduse: Use memcpy_{to,from}_page() in do_bounce()
  2022-06-29  8:25 [PATCH 0/6] VDUSE: Support registering userspace memory as bounce buffer Xie Yongji
  2022-06-29  8:25 ` [PATCH 1/6] vduse: Remove unnecessary spin lock protection Xie Yongji
@ 2022-06-29  8:25 ` Xie Yongji
  2022-06-29  8:25 ` [PATCH 3/6] vduse: Support using userspace pages as bounce buffer Xie Yongji
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 24+ messages in thread
From: Xie Yongji @ 2022-06-29  8:25 UTC (permalink / raw)
  To: mst, jasowang, xiaodong.liu, maxime.coquelin, stefanha
  Cc: virtualization, linux-kernel

kmap_atomic() is being deprecated in favor of kmap_local_page().

The use of kmap_atomic() in do_bounce() is all thread local therefore
kmap_local_page() is a sufficient replacement.

Convert to kmap_local_page() but, instead of open coding it,
use the helpers memcpy_to_page() and memcpy_from_page().

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 drivers/vdpa/vdpa_user/iova_domain.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c
index bca1f0b8850c..50d7c08d5450 100644
--- a/drivers/vdpa/vdpa_user/iova_domain.c
+++ b/drivers/vdpa/vdpa_user/iova_domain.c
@@ -138,18 +138,17 @@ static void do_bounce(phys_addr_t orig, void *addr, size_t size,
 {
 	unsigned long pfn = PFN_DOWN(orig);
 	unsigned int offset = offset_in_page(orig);
-	char *buffer;
+	struct page *page;
 	unsigned int sz = 0;
 
 	while (size) {
 		sz = min_t(size_t, PAGE_SIZE - offset, size);
 
-		buffer = kmap_atomic(pfn_to_page(pfn));
+		page = pfn_to_page(pfn);
 		if (dir == DMA_TO_DEVICE)
-			memcpy(addr, buffer + offset, sz);
+			memcpy_from_page(addr, page, offset, sz);
 		else
-			memcpy(buffer + offset, addr, sz);
-		kunmap_atomic(buffer);
+			memcpy_to_page(page, offset, addr, sz);
 
 		size -= sz;
 		pfn++;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 3/6] vduse: Support using userspace pages as bounce buffer
  2022-06-29  8:25 [PATCH 0/6] VDUSE: Support registering userspace memory as bounce buffer Xie Yongji
  2022-06-29  8:25 ` [PATCH 1/6] vduse: Remove unnecessary spin lock protection Xie Yongji
  2022-06-29  8:25 ` [PATCH 2/6] vduse: Use memcpy_{to,from}_page() in do_bounce() Xie Yongji
@ 2022-06-29  8:25 ` Xie Yongji
  2022-06-29  8:25 ` [PATCH 4/6] vduse: Support querying IOLTB information Xie Yongji
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 24+ messages in thread
From: Xie Yongji @ 2022-06-29  8:25 UTC (permalink / raw)
  To: mst, jasowang, xiaodong.liu, maxime.coquelin, stefanha
  Cc: virtualization, linux-kernel

Introduce two APIs: vduse_domain_add_user_bounce_pages()
and vduse_domain_remove_user_bounce_pages() to support
adding and removing userspace pages for bounce buffers.
During adding and removing, the DMA data would be copied
from the kernel bounce pages to the userspace bounce pages
and back.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 drivers/vdpa/vdpa_user/iova_domain.c | 128 +++++++++++++++++++++++++--
 drivers/vdpa/vdpa_user/iova_domain.h |   9 ++
 2 files changed, 129 insertions(+), 8 deletions(-)

diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c
index 50d7c08d5450..2ae29341228e 100644
--- a/drivers/vdpa/vdpa_user/iova_domain.c
+++ b/drivers/vdpa/vdpa_user/iova_domain.c
@@ -178,8 +178,9 @@ static void vduse_domain_bounce(struct vduse_iova_domain *domain,
 			    map->orig_phys == INVALID_PHYS_ADDR))
 			return;
 
-		addr = page_address(map->bounce_page) + offset;
-		do_bounce(map->orig_phys + offset, addr, sz, dir);
+		addr = kmap_local_page(map->bounce_page);
+		do_bounce(map->orig_phys + offset, addr + offset, sz, dir);
+		kunmap_local(addr);
 		size -= sz;
 		iova += sz;
 	}
@@ -210,20 +211,23 @@ static struct page *
 vduse_domain_get_bounce_page(struct vduse_iova_domain *domain, u64 iova)
 {
 	struct vduse_bounce_map *map;
-	struct page *page;
+	struct page *page = NULL;
 
+	read_lock(&domain->bounce_lock);
 	map = &domain->bounce_maps[iova >> PAGE_SHIFT];
-	if (!map->bounce_page)
-		return NULL;
+	if (domain->user_bounce_pages || !map->bounce_page)
+		goto out;
 
 	page = map->bounce_page;
 	get_page(page);
+out:
+	read_unlock(&domain->bounce_lock);
 
 	return page;
 }
 
 static void
-vduse_domain_free_bounce_pages(struct vduse_iova_domain *domain)
+vduse_domain_free_kernel_bounce_pages(struct vduse_iova_domain *domain)
 {
 	struct vduse_bounce_map *map;
 	unsigned long pfn, bounce_pfns;
@@ -243,6 +247,81 @@ vduse_domain_free_bounce_pages(struct vduse_iova_domain *domain)
 	}
 }
 
+int vduse_domain_add_user_bounce_pages(struct vduse_iova_domain *domain,
+				       struct page **pages, int count)
+{
+	struct vduse_bounce_map *map;
+	int i, ret;
+
+	/* Now we don't support partial mapping */
+	if (count != (domain->bounce_size >> PAGE_SHIFT))
+		return -EINVAL;
+
+	write_lock(&domain->bounce_lock);
+	ret = -EEXIST;
+	if (domain->user_bounce_pages)
+		goto out;
+
+	ret = -EBUSY;
+	/*
+	 * Make sure nobody maps the kernel bounce pages,
+	 * then we can free them.
+	 */
+	if (domain->mapped)
+		goto out;
+
+	for (i = 0; i < count; i++) {
+		map = &domain->bounce_maps[i];
+		if (map->bounce_page) {
+			/* Copy kernel page to user page if it's in use */
+			if (map->orig_phys != INVALID_PHYS_ADDR)
+				memcpy_to_page(pages[i], 0,
+					       page_address(map->bounce_page),
+					       PAGE_SIZE);
+			__free_page(map->bounce_page);
+		}
+		map->bounce_page = pages[i];
+		get_page(pages[i]);
+	}
+	domain->user_bounce_pages = true;
+	ret = 0;
+out:
+	write_unlock(&domain->bounce_lock);
+
+	return ret;
+}
+
+void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain)
+{
+	struct vduse_bounce_map *map;
+	unsigned long i, count;
+
+	write_lock(&domain->bounce_lock);
+	if (!domain->user_bounce_pages)
+		goto out;
+
+	count = domain->bounce_size >> PAGE_SHIFT;
+	for (i = 0; i < count; i++) {
+		struct page *page = NULL;
+
+		map = &domain->bounce_maps[i];
+		if (WARN_ON(!map->bounce_page))
+			continue;
+
+		/* Copy user page to kernel page if it's in use */
+		if (map->orig_phys != INVALID_PHYS_ADDR) {
+			page = alloc_page(GFP_ATOMIC | __GFP_NOFAIL);
+			memcpy_from_page(page_address(page),
+					 map->bounce_page, 0, PAGE_SIZE);
+		}
+		put_page(map->bounce_page);
+		map->bounce_page = page;
+	}
+	domain->user_bounce_pages = false;
+out:
+	write_unlock(&domain->bounce_lock);
+}
+
 void vduse_domain_reset_bounce_map(struct vduse_iova_domain *domain)
 {
 	if (!domain->bounce_map)
@@ -318,13 +397,18 @@ dma_addr_t vduse_domain_map_page(struct vduse_iova_domain *domain,
 	if (vduse_domain_init_bounce_map(domain))
 		goto err;
 
+	read_lock(&domain->bounce_lock);
 	if (vduse_domain_map_bounce_page(domain, (u64)iova, (u64)size, pa))
-		goto err;
+		goto err_unlock;
 
 	if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL)
 		vduse_domain_bounce(domain, iova, size, DMA_TO_DEVICE);
 
+	read_unlock(&domain->bounce_lock);
+
 	return iova;
+err_unlock:
+	read_unlock(&domain->bounce_lock);
 err:
 	vduse_domain_free_iova(iovad, iova, size);
 	return DMA_MAPPING_ERROR;
@@ -336,10 +420,12 @@ void vduse_domain_unmap_page(struct vduse_iova_domain *domain,
 {
 	struct iova_domain *iovad = &domain->stream_iovad;
 
+	read_lock(&domain->bounce_lock);
 	if (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL)
 		vduse_domain_bounce(domain, dma_addr, size, DMA_FROM_DEVICE);
 
 	vduse_domain_unmap_bounce_page(domain, (u64)dma_addr, (u64)size);
+	read_unlock(&domain->bounce_lock);
 	vduse_domain_free_iova(iovad, dma_addr, size);
 }
 
@@ -404,6 +490,24 @@ void vduse_domain_free_coherent(struct vduse_iova_domain *domain, size_t size,
 	free_pages_exact(phys_to_virt(pa), size);
 }
 
+static void vduse_domain_mmap_open(struct vm_area_struct *vma)
+{
+	struct vduse_iova_domain *domain = vma->vm_private_data;
+
+	write_lock(&domain->bounce_lock);
+	domain->mapped++;
+	write_unlock(&domain->bounce_lock);
+}
+
+static void vduse_domain_mmap_close(struct vm_area_struct *vma)
+{
+	struct vduse_iova_domain *domain = vma->vm_private_data;
+
+	write_lock(&domain->bounce_lock);
+	domain->mapped--;
+	write_unlock(&domain->bounce_lock);
+}
+
 static vm_fault_t vduse_domain_mmap_fault(struct vm_fault *vmf)
 {
 	struct vduse_iova_domain *domain = vmf->vma->vm_private_data;
@@ -427,6 +531,8 @@ static vm_fault_t vduse_domain_mmap_fault(struct vm_fault *vmf)
 }
 
 static const struct vm_operations_struct vduse_domain_mmap_ops = {
+	.open =	vduse_domain_mmap_open,
+	.close = vduse_domain_mmap_close,
 	.fault = vduse_domain_mmap_fault,
 };
 
@@ -438,6 +544,10 @@ static int vduse_domain_mmap(struct file *file, struct vm_area_struct *vma)
 	vma->vm_private_data = domain;
 	vma->vm_ops = &vduse_domain_mmap_ops;
 
+	write_lock(&domain->bounce_lock);
+	domain->mapped++;
+	write_unlock(&domain->bounce_lock);
+
 	return 0;
 }
 
@@ -447,7 +557,8 @@ static int vduse_domain_release(struct inode *inode, struct file *file)
 
 	spin_lock(&domain->iotlb_lock);
 	vduse_iotlb_del_range(domain, 0, ULLONG_MAX);
-	vduse_domain_free_bounce_pages(domain);
+	vduse_domain_remove_user_bounce_pages(domain);
+	vduse_domain_free_kernel_bounce_pages(domain);
 	spin_unlock(&domain->iotlb_lock);
 	put_iova_domain(&domain->stream_iovad);
 	put_iova_domain(&domain->consistent_iovad);
@@ -507,6 +618,7 @@ vduse_domain_create(unsigned long iova_limit, size_t bounce_size)
 		goto err_file;
 
 	domain->file = file;
+	rwlock_init(&domain->bounce_lock);
 	spin_lock_init(&domain->iotlb_lock);
 	init_iova_domain(&domain->stream_iovad,
 			PAGE_SIZE, IOVA_START_PFN);
diff --git a/drivers/vdpa/vdpa_user/iova_domain.h b/drivers/vdpa/vdpa_user/iova_domain.h
index 2722d9b8e21a..4a47615346ac 100644
--- a/drivers/vdpa/vdpa_user/iova_domain.h
+++ b/drivers/vdpa/vdpa_user/iova_domain.h
@@ -14,6 +14,7 @@
 #include <linux/iova.h>
 #include <linux/dma-mapping.h>
 #include <linux/vhost_iotlb.h>
+#include <linux/rwlock.h>
 
 #define IOVA_START_PFN 1
 
@@ -34,6 +35,9 @@ struct vduse_iova_domain {
 	struct vhost_iotlb *iotlb;
 	spinlock_t iotlb_lock;
 	struct file *file;
+	int mapped;
+	bool user_bounce_pages;
+	rwlock_t bounce_lock;
 };
 
 int vduse_domain_set_map(struct vduse_iova_domain *domain,
@@ -61,6 +65,11 @@ void vduse_domain_free_coherent(struct vduse_iova_domain *domain, size_t size,
 
 void vduse_domain_reset_bounce_map(struct vduse_iova_domain *domain);
 
+int vduse_domain_add_user_bounce_pages(struct vduse_iova_domain *domain,
+				       struct page **pages, int count);
+
+void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain);
+
 void vduse_domain_destroy(struct vduse_iova_domain *domain);
 
 struct vduse_iova_domain *vduse_domain_create(unsigned long iova_limit,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 4/6] vduse: Support querying IOLTB information
  2022-06-29  8:25 [PATCH 0/6] VDUSE: Support registering userspace memory as bounce buffer Xie Yongji
                   ` (2 preceding siblings ...)
  2022-06-29  8:25 ` [PATCH 3/6] vduse: Support using userspace pages as bounce buffer Xie Yongji
@ 2022-06-29  8:25 ` Xie Yongji
  2022-06-29  8:25 ` [PATCH 5/6] vduse: Support registering userspace memory for IOTLB Xie Yongji
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 24+ messages in thread
From: Xie Yongji @ 2022-06-29  8:25 UTC (permalink / raw)
  To: mst, jasowang, xiaodong.liu, maxime.coquelin, stefanha
  Cc: virtualization, linux-kernel

This introduces a new ioctl: VDUSE_IOTLB_GET_INFO to
support querying IOLTB information such as bounce
buffer size.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 drivers/vdpa/vdpa_user/vduse_dev.c | 13 +++++++++++++
 include/uapi/linux/vduse.h         | 17 +++++++++++++++++
 2 files changed, 30 insertions(+)

diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
index 3bc27de58f46..c47a5d9765cf 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -1089,6 +1089,19 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
 		ret = vduse_dev_queue_irq_work(dev, &dev->vqs[index].inject);
 		break;
 	}
+	case VDUSE_IOTLB_GET_INFO: {
+		struct vduse_iotlb_info iotlb;
+
+		iotlb.bounce_iova = 0;
+		iotlb.bounce_size = dev->domain->bounce_size;
+
+		ret = -EFAULT;
+		if (copy_to_user(argp, &iotlb, sizeof(iotlb)))
+			break;
+
+		ret = 0;
+		break;
+	}
 	default:
 		ret = -ENOIOCTLCMD;
 		break;
diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
index 7cfe1c1280c0..c201b7a77c2c 100644
--- a/include/uapi/linux/vduse.h
+++ b/include/uapi/linux/vduse.h
@@ -210,6 +210,23 @@ struct vduse_vq_eventfd {
  */
 #define VDUSE_VQ_INJECT_IRQ	_IOW(VDUSE_BASE, 0x17, __u32)
 
+/**
+ * struct vduse_iotlb_info - IOTLB information
+ * @bounce_iova: start IOVA of bounce buffer
+ * @bounce_size: bounce buffer size
+ * @reserved: for future use, needs to be initialized to zero
+ *
+ * Structure used by VDUSE_IOTLB_GET_INFO ioctl to get IOTLB information.
+ */
+struct vduse_iotlb_info {
+	__u64 bounce_iova;
+	__u64 bounce_size;
+	__u64 reserved[2];
+};
+
+/* Get IOTLB information, e.g. bounce buffer size */
+#define VDUSE_IOTLB_GET_INFO    _IOR(VDUSE_BASE, 0x18, struct vduse_iotlb_info)
+
 /* The control messages definition for read(2)/write(2) on /dev/vduse/$NAME */
 
 /**
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 5/6] vduse: Support registering userspace memory for IOTLB
  2022-06-29  8:25 [PATCH 0/6] VDUSE: Support registering userspace memory as bounce buffer Xie Yongji
                   ` (3 preceding siblings ...)
  2022-06-29  8:25 ` [PATCH 4/6] vduse: Support querying IOLTB information Xie Yongji
@ 2022-06-29  8:25 ` Xie Yongji
  2022-06-29  8:42     ` Michael S. Tsirkin
  2022-06-29  8:25 ` [PATCH 6/6] vduse: Update api version to 1 Xie Yongji
  2022-07-04  9:26   ` Liu Xiaodong
  6 siblings, 1 reply; 24+ messages in thread
From: Xie Yongji @ 2022-06-29  8:25 UTC (permalink / raw)
  To: mst, jasowang, xiaodong.liu, maxime.coquelin, stefanha
  Cc: virtualization, linux-kernel

Introduce two ioctls: VDUSE_IOTLB_REG_UMEM and
VDUSE_IOTLB_DEREG_UMEM to support registering
and de-registering userspace memory for IOTLB
in virtio-vdpa case.

Now it only supports registering userspace memory
for IOTLB as bounce buffer.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 drivers/vdpa/vdpa_user/vduse_dev.c | 138 +++++++++++++++++++++++++++++
 include/uapi/linux/vduse.h         |  28 ++++++
 2 files changed, 166 insertions(+)

diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
index c47a5d9765cf..7b2ea7612da9 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -21,6 +21,7 @@
 #include <linux/uio.h>
 #include <linux/vdpa.h>
 #include <linux/nospec.h>
+#include <linux/sched/mm.h>
 #include <uapi/linux/vduse.h>
 #include <uapi/linux/vdpa.h>
 #include <uapi/linux/virtio_config.h>
@@ -64,6 +65,13 @@ struct vduse_vdpa {
 	struct vduse_dev *dev;
 };
 
+struct vduse_iotlb_mem {
+	unsigned long iova;
+	unsigned long npages;
+	struct page **pages;
+	struct mm_struct *mm;
+};
+
 struct vduse_dev {
 	struct vduse_vdpa *vdev;
 	struct device *dev;
@@ -95,6 +103,8 @@ struct vduse_dev {
 	u8 status;
 	u32 vq_num;
 	u32 vq_align;
+	struct vduse_iotlb_mem *iotlb_mem;
+	struct mutex mem_lock;
 };
 
 struct vduse_dev_msg {
@@ -917,6 +927,100 @@ static int vduse_dev_queue_irq_work(struct vduse_dev *dev,
 	return ret;
 }
 
+static int vduse_dev_dereg_iotlb_mem(struct vduse_dev *dev,
+				     u64 iova, u64 size)
+{
+	int ret;
+
+	mutex_lock(&dev->mem_lock);
+	ret = -ENOENT;
+	if (!dev->iotlb_mem)
+		goto unlock;
+
+	ret = -EINVAL;
+	if (dev->iotlb_mem->iova != iova || size != dev->domain->bounce_size)
+		goto unlock;
+
+	vduse_domain_remove_user_bounce_pages(dev->domain);
+	unpin_user_pages(dev->iotlb_mem->pages, dev->iotlb_mem->npages);
+	atomic64_sub(dev->iotlb_mem->npages, &dev->iotlb_mem->mm->pinned_vm);
+	mmdrop(dev->iotlb_mem->mm);
+	vfree(dev->iotlb_mem->pages);
+	kfree(dev->iotlb_mem);
+	dev->iotlb_mem = NULL;
+	ret = 0;
+unlock:
+	mutex_unlock(&dev->mem_lock);
+	return ret;
+}
+
+static int vduse_dev_reg_iotlb_mem(struct vduse_dev *dev,
+				   u64 iova, u64 uaddr, u64 size)
+{
+	struct page **page_list = NULL;
+	struct vduse_iotlb_mem *mem = NULL;
+	long pinned = 0;
+	unsigned long npages, lock_limit;
+	int ret;
+
+	if (size != dev->domain->bounce_size ||
+	    iova != 0 || uaddr & ~PAGE_MASK)
+		return -EINVAL;
+
+	mutex_lock(&dev->mem_lock);
+	ret = -EEXIST;
+	if (dev->iotlb_mem)
+		goto unlock;
+
+	ret = -ENOMEM;
+	npages = size >> PAGE_SHIFT;
+	page_list = vmalloc(array_size(npages,
+			    sizeof(struct page *)));
+	mem = kzalloc(sizeof(*mem), GFP_KERNEL);
+	if (!page_list || !mem)
+		goto unlock;
+
+	mmap_read_lock(current->mm);
+
+	lock_limit = PFN_DOWN(rlimit(RLIMIT_MEMLOCK));
+	if (npages + atomic64_read(&current->mm->pinned_vm) > lock_limit)
+		goto out;
+
+	pinned = pin_user_pages(uaddr, npages, FOLL_LONGTERM | FOLL_WRITE,
+				page_list, NULL);
+	if (pinned != npages) {
+		ret = pinned < 0 ? pinned : -ENOMEM;
+		goto out;
+	}
+
+	ret = vduse_domain_add_user_bounce_pages(dev->domain,
+						 page_list, pinned);
+	if (ret)
+		goto out;
+
+	atomic64_add(npages, &current->mm->pinned_vm);
+
+	mem->pages = page_list;
+	mem->npages = pinned;
+	mem->iova = iova;
+	mem->mm = current->mm;
+	mmgrab(current->mm);
+
+	dev->iotlb_mem = mem;
+out:
+	if (ret && pinned > 0)
+		unpin_user_pages(page_list, pinned);
+
+	mmap_read_unlock(current->mm);
+unlock:
+	if (ret) {
+		vfree(page_list);
+		kfree(mem);
+	}
+	mutex_unlock(&dev->mem_lock);
+	return ret;
+}
+
 static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
 			    unsigned long arg)
 {
@@ -943,6 +1047,16 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
 		if (entry.start > entry.last)
 			break;
 
+		if (domain->bounce_map && dev->iotlb_mem) {
+			ret = -EEXIST;
+			if (entry.start >= 0 &&
+			    entry.last < domain->bounce_size)
+				break;
+
+			if (entry.start < domain->bounce_size)
+				entry.start = domain->bounce_size;
+		}
+
 		spin_lock(&domain->iotlb_lock);
 		map = vhost_iotlb_itree_first(domain->iotlb,
 					      entry.start, entry.last);
@@ -1102,6 +1216,28 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
 		ret = 0;
 		break;
 	}
+	case VDUSE_IOTLB_REG_UMEM: {
+		struct vduse_iotlb_umem umem;
+
+		ret = -EFAULT;
+		if (copy_from_user(&umem, argp, sizeof(umem)))
+			break;
+
+		ret = vduse_dev_reg_iotlb_mem(dev, umem.iova,
+					      umem.uaddr, umem.size);
+		break;
+	}
+	case VDUSE_IOTLB_DEREG_UMEM: {
+		struct vduse_iotlb_umem umem;
+
+		ret = -EFAULT;
+		if (copy_from_user(&umem, argp, sizeof(umem)))
+			break;
+
+		ret = vduse_dev_dereg_iotlb_mem(dev, umem.iova,
+						umem.size);
+		break;
+	}
 	default:
 		ret = -ENOIOCTLCMD;
 		break;
@@ -1114,6 +1250,7 @@ static int vduse_dev_release(struct inode *inode, struct file *file)
 {
 	struct vduse_dev *dev = file->private_data;
 
+	vduse_dev_dereg_iotlb_mem(dev, 0, dev->domain->bounce_size);
 	spin_lock(&dev->msg_lock);
 	/* Make sure the inflight messages can processed after reconncection */
 	list_splice_init(&dev->recv_list, &dev->send_list);
@@ -1176,6 +1313,7 @@ static struct vduse_dev *vduse_dev_create(void)
 		return NULL;
 
 	mutex_init(&dev->lock);
+	mutex_init(&dev->mem_lock);
 	spin_lock_init(&dev->msg_lock);
 	INIT_LIST_HEAD(&dev->send_list);
 	INIT_LIST_HEAD(&dev->recv_list);
diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
index c201b7a77c2c..1b17391e228f 100644
--- a/include/uapi/linux/vduse.h
+++ b/include/uapi/linux/vduse.h
@@ -227,6 +227,34 @@ struct vduse_iotlb_info {
 /* Get IOTLB information, e.g. bounce buffer size */
 #define VDUSE_IOTLB_GET_INFO    _IOR(VDUSE_BASE, 0x18, struct vduse_iotlb_info)
 
+/**
+ * struct vduse_iotlb_umem - userspace memory configuration
+ * @uaddr: start address of userspace memory, it must be aligned to page size
+ * @iova: IOVA of userspace memory, it must be equal to bounce iova returned
+ *        by VDUSE_IOTLB_GET_INFO now
+ * @size: size of userspace memory, it must be equal to bounce size returned
+ *        by VDUSE_IOTLB_GET_INFO now
+ * @reserved: for future use, needs to be initialized to zero
+ *
+ * Structure used by VDUSE_IOTLB_REG_UMEM and VDUSE_IOTLB_DEREG_UMEM
+ * ioctls to register/de-register userspace memory for IOTLB.
+ */
+struct vduse_iotlb_umem {
+	__u64 uaddr;
+	__u64 iova;
+	__u64 size;
+	__u64 reserved[3];
+};
+
+/*
+ * Register userspace memory for IOTLB. Now we only support registering
+ * userspace memory as bounce buffer.
+ */
+#define VDUSE_IOTLB_REG_UMEM	_IOW(VDUSE_BASE, 0x19, struct vduse_iotlb_umem)
+
+/* De-register the userspace memory. Caller should set iova and size field. */
+#define VDUSE_IOTLB_DEREG_UMEM	_IOW(VDUSE_BASE, 0x1a, struct vduse_iotlb_umem)
+
 /* The control messages definition for read(2)/write(2) on /dev/vduse/$NAME */
 
 /**
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 6/6] vduse: Update api version to 1
  2022-06-29  8:25 [PATCH 0/6] VDUSE: Support registering userspace memory as bounce buffer Xie Yongji
                   ` (4 preceding siblings ...)
  2022-06-29  8:25 ` [PATCH 5/6] vduse: Support registering userspace memory for IOTLB Xie Yongji
@ 2022-06-29  8:25 ` Xie Yongji
  2022-06-29  8:33     ` Michael S. Tsirkin
  2022-07-04  9:26   ` Liu Xiaodong
  6 siblings, 1 reply; 24+ messages in thread
From: Xie Yongji @ 2022-06-29  8:25 UTC (permalink / raw)
  To: mst, jasowang, xiaodong.liu, maxime.coquelin, stefanha
  Cc: virtualization, linux-kernel

Let's update api version to 1 since we introduced
some new ioctls to support registering userspace
memory for IOTLB.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 drivers/vdpa/vdpa_user/vduse_dev.c | 12 ++++++++++++
 include/uapi/linux/vduse.h         |  8 +++++++-
 2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
index 7b2ea7612da9..2795785ca6a2 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -1206,6 +1206,10 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
 	case VDUSE_IOTLB_GET_INFO: {
 		struct vduse_iotlb_info iotlb;
 
+		ret = -EPERM;
+		if (dev->api_version < 1)
+			break;
+
 		iotlb.bounce_iova = 0;
 		iotlb.bounce_size = dev->domain->bounce_size;
 
@@ -1219,6 +1223,10 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
 	case VDUSE_IOTLB_REG_UMEM: {
 		struct vduse_iotlb_umem umem;
 
+		ret = -EPERM;
+		if (dev->api_version < 1)
+			break;
+
 		ret = -EFAULT;
 		if (copy_from_user(&umem, argp, sizeof(umem)))
 			break;
@@ -1230,6 +1238,10 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
 	case VDUSE_IOTLB_DEREG_UMEM: {
 		struct vduse_iotlb_umem umem;
 
+		ret = -EPERM;
+		if (dev->api_version < 1)
+			break;
+
 		ret = -EFAULT;
 		if (copy_from_user(&umem, argp, sizeof(umem)))
 			break;
diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
index 1b17391e228f..902ea19cd9e0 100644
--- a/include/uapi/linux/vduse.h
+++ b/include/uapi/linux/vduse.h
@@ -8,7 +8,13 @@
 
 /* The ioctls for control device (/dev/vduse/control) */
 
-#define VDUSE_API_VERSION	0
+/*
+ * v0 -> v1:
+ *  - Introduce VDUSE_IOTLB_GET_INFO ioctl
+ *  - Introduce VDUSE_VDUSE_IOTLB_REG_UMEM ioctl
+ *  - Introduce VDUSE_IOTLB_DEREG_UMEM ioctl
+ */
+#define VDUSE_API_VERSION	1
 
 /*
  * Get the version of VDUSE API that kernel supported (VDUSE_API_VERSION).
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 6/6] vduse: Update api version to 1
  2022-06-29  8:25 ` [PATCH 6/6] vduse: Update api version to 1 Xie Yongji
@ 2022-06-29  8:33     ` Michael S. Tsirkin
  0 siblings, 0 replies; 24+ messages in thread
From: Michael S. Tsirkin @ 2022-06-29  8:33 UTC (permalink / raw)
  To: Xie Yongji
  Cc: jasowang, xiaodong.liu, maxime.coquelin, stefanha,
	virtualization, linux-kernel

On Wed, Jun 29, 2022 at 04:25:41PM +0800, Xie Yongji wrote:
> Let's update api version to 1 since we introduced
> some new ioctls to support registering userspace
> memory for IOTLB.
> 
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>


Adding new ioctls does not justify things like this.

Besides, adding UAPI then changing it is not nice
since it makes git bisect behave incorrectly.

> ---
>  drivers/vdpa/vdpa_user/vduse_dev.c | 12 ++++++++++++
>  include/uapi/linux/vduse.h         |  8 +++++++-
>  2 files changed, 19 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> index 7b2ea7612da9..2795785ca6a2 100644
> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> @@ -1206,6 +1206,10 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
>  	case VDUSE_IOTLB_GET_INFO: {
>  		struct vduse_iotlb_info iotlb;
>  
> +		ret = -EPERM;


Almost for sure a wrong error code.

> +		if (dev->api_version < 1)
> +			break;
> +
>  		iotlb.bounce_iova = 0;
>  		iotlb.bounce_size = dev->domain->bounce_size;
>  


Wait a second. so you are intentionally breaking any userspace
that called VDUSE_SET_API_VERSION with version 0?

Please don't.


> @@ -1219,6 +1223,10 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
>  	case VDUSE_IOTLB_REG_UMEM: {
>  		struct vduse_iotlb_umem umem;
>  
> +		ret = -EPERM;
> +		if (dev->api_version < 1)
> +			break;
> +
>  		ret = -EFAULT;
>  		if (copy_from_user(&umem, argp, sizeof(umem)))
>  			break;
> @@ -1230,6 +1238,10 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
>  	case VDUSE_IOTLB_DEREG_UMEM: {
>  		struct vduse_iotlb_umem umem;
>  
> +		ret = -EPERM;
> +		if (dev->api_version < 1)
> +			break;
> +
>  		ret = -EFAULT;
>  		if (copy_from_user(&umem, argp, sizeof(umem)))
>  			break;
> diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
> index 1b17391e228f..902ea19cd9e0 100644
> --- a/include/uapi/linux/vduse.h
> +++ b/include/uapi/linux/vduse.h
> @@ -8,7 +8,13 @@
>  
>  /* The ioctls for control device (/dev/vduse/control) */
>  
> -#define VDUSE_API_VERSION	0
> +/*
> + * v0 -> v1:
> + *  - Introduce VDUSE_IOTLB_GET_INFO ioctl
> + *  - Introduce VDUSE_VDUSE_IOTLB_REG_UMEM ioctl
> + *  - Introduce VDUSE_IOTLB_DEREG_UMEM ioctl
> + */
> +#define VDUSE_API_VERSION	1
>  
>  /*
>   * Get the version of VDUSE API that kernel supported (VDUSE_API_VERSION).
> -- 
> 2.20.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 6/6] vduse: Update api version to 1
@ 2022-06-29  8:33     ` Michael S. Tsirkin
  0 siblings, 0 replies; 24+ messages in thread
From: Michael S. Tsirkin @ 2022-06-29  8:33 UTC (permalink / raw)
  To: Xie Yongji
  Cc: xiaodong.liu, linux-kernel, virtualization, maxime.coquelin, stefanha

On Wed, Jun 29, 2022 at 04:25:41PM +0800, Xie Yongji wrote:
> Let's update api version to 1 since we introduced
> some new ioctls to support registering userspace
> memory for IOTLB.
> 
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>


Adding new ioctls does not justify things like this.

Besides, adding UAPI then changing it is not nice
since it makes git bisect behave incorrectly.

> ---
>  drivers/vdpa/vdpa_user/vduse_dev.c | 12 ++++++++++++
>  include/uapi/linux/vduse.h         |  8 +++++++-
>  2 files changed, 19 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> index 7b2ea7612da9..2795785ca6a2 100644
> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> @@ -1206,6 +1206,10 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
>  	case VDUSE_IOTLB_GET_INFO: {
>  		struct vduse_iotlb_info iotlb;
>  
> +		ret = -EPERM;


Almost for sure a wrong error code.

> +		if (dev->api_version < 1)
> +			break;
> +
>  		iotlb.bounce_iova = 0;
>  		iotlb.bounce_size = dev->domain->bounce_size;
>  


Wait a second. so you are intentionally breaking any userspace
that called VDUSE_SET_API_VERSION with version 0?

Please don't.


> @@ -1219,6 +1223,10 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
>  	case VDUSE_IOTLB_REG_UMEM: {
>  		struct vduse_iotlb_umem umem;
>  
> +		ret = -EPERM;
> +		if (dev->api_version < 1)
> +			break;
> +
>  		ret = -EFAULT;
>  		if (copy_from_user(&umem, argp, sizeof(umem)))
>  			break;
> @@ -1230,6 +1238,10 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
>  	case VDUSE_IOTLB_DEREG_UMEM: {
>  		struct vduse_iotlb_umem umem;
>  
> +		ret = -EPERM;
> +		if (dev->api_version < 1)
> +			break;
> +
>  		ret = -EFAULT;
>  		if (copy_from_user(&umem, argp, sizeof(umem)))
>  			break;
> diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
> index 1b17391e228f..902ea19cd9e0 100644
> --- a/include/uapi/linux/vduse.h
> +++ b/include/uapi/linux/vduse.h
> @@ -8,7 +8,13 @@
>  
>  /* The ioctls for control device (/dev/vduse/control) */
>  
> -#define VDUSE_API_VERSION	0
> +/*
> + * v0 -> v1:
> + *  - Introduce VDUSE_IOTLB_GET_INFO ioctl
> + *  - Introduce VDUSE_VDUSE_IOTLB_REG_UMEM ioctl
> + *  - Introduce VDUSE_IOTLB_DEREG_UMEM ioctl
> + */
> +#define VDUSE_API_VERSION	1
>  
>  /*
>   * Get the version of VDUSE API that kernel supported (VDUSE_API_VERSION).
> -- 
> 2.20.1

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/6] vduse: Support registering userspace memory for IOTLB
  2022-06-29  8:25 ` [PATCH 5/6] vduse: Support registering userspace memory for IOTLB Xie Yongji
@ 2022-06-29  8:42     ` Michael S. Tsirkin
  0 siblings, 0 replies; 24+ messages in thread
From: Michael S. Tsirkin @ 2022-06-29  8:42 UTC (permalink / raw)
  To: Xie Yongji
  Cc: jasowang, xiaodong.liu, maxime.coquelin, stefanha,
	virtualization, linux-kernel

On Wed, Jun 29, 2022 at 04:25:40PM +0800, Xie Yongji wrote:
> Introduce two ioctls: VDUSE_IOTLB_REG_UMEM and
> VDUSE_IOTLB_DEREG_UMEM to support registering
> and de-registering userspace memory for IOTLB
> in virtio-vdpa case.
> 
> Now it only supports registering userspace memory
> for IOTLB as bounce buffer.
> 
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> ---
>  drivers/vdpa/vdpa_user/vduse_dev.c | 138 +++++++++++++++++++++++++++++
>  include/uapi/linux/vduse.h         |  28 ++++++
>  2 files changed, 166 insertions(+)
> 
> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> index c47a5d9765cf..7b2ea7612da9 100644
> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> @@ -21,6 +21,7 @@
>  #include <linux/uio.h>
>  #include <linux/vdpa.h>
>  #include <linux/nospec.h>
> +#include <linux/sched/mm.h>
>  #include <uapi/linux/vduse.h>
>  #include <uapi/linux/vdpa.h>
>  #include <uapi/linux/virtio_config.h>
> @@ -64,6 +65,13 @@ struct vduse_vdpa {
>  	struct vduse_dev *dev;
>  };
>  
> +struct vduse_iotlb_mem {
> +	unsigned long iova;
> +	unsigned long npages;
> +	struct page **pages;
> +	struct mm_struct *mm;
> +};
> +
>  struct vduse_dev {
>  	struct vduse_vdpa *vdev;
>  	struct device *dev;
> @@ -95,6 +103,8 @@ struct vduse_dev {
>  	u8 status;
>  	u32 vq_num;
>  	u32 vq_align;
> +	struct vduse_iotlb_mem *iotlb_mem;
> +	struct mutex mem_lock;
>  };
>  
>  struct vduse_dev_msg {
> @@ -917,6 +927,100 @@ static int vduse_dev_queue_irq_work(struct vduse_dev *dev,
>  	return ret;
>  }
>  
> +static int vduse_dev_dereg_iotlb_mem(struct vduse_dev *dev,
> +				     u64 iova, u64 size)
> +{
> +	int ret;
> +
> +	mutex_lock(&dev->mem_lock);
> +	ret = -ENOENT;
> +	if (!dev->iotlb_mem)
> +		goto unlock;
> +
> +	ret = -EINVAL;
> +	if (dev->iotlb_mem->iova != iova || size != dev->domain->bounce_size)
> +		goto unlock;
> +
> +	vduse_domain_remove_user_bounce_pages(dev->domain);
> +	unpin_user_pages(dev->iotlb_mem->pages, dev->iotlb_mem->npages);

I notice you don't mark the pages dirty. This is going to be a problem.

> +	atomic64_sub(dev->iotlb_mem->npages, &dev->iotlb_mem->mm->pinned_vm);
> +	mmdrop(dev->iotlb_mem->mm);
> +	vfree(dev->iotlb_mem->pages);
> +	kfree(dev->iotlb_mem);
> +	dev->iotlb_mem = NULL;
> +	ret = 0;
> +unlock:
> +	mutex_unlock(&dev->mem_lock);
> +	return ret;
> +}
> +
> +static int vduse_dev_reg_iotlb_mem(struct vduse_dev *dev,
> +				   u64 iova, u64 uaddr, u64 size)
> +{
> +	struct page **page_list = NULL;
> +	struct vduse_iotlb_mem *mem = NULL;
> +	long pinned = 0;
> +	unsigned long npages, lock_limit;
> +	int ret;
> +
> +	if (size != dev->domain->bounce_size ||
> +	    iova != 0 || uaddr & ~PAGE_MASK)
> +		return -EINVAL;
> +
> +	mutex_lock(&dev->mem_lock);
> +	ret = -EEXIST;
> +	if (dev->iotlb_mem)
> +		goto unlock;
> +
> +	ret = -ENOMEM;
> +	npages = size >> PAGE_SHIFT;
> +	page_list = vmalloc(array_size(npages,
> +			    sizeof(struct page *)));

Is this basically trying to do a vmalloc with userspace-controlled size?
That's an easy DOS vector.

> +	mem = kzalloc(sizeof(*mem), GFP_KERNEL);
> +	if (!page_list || !mem)
> +		goto unlock;
> +
> +	mmap_read_lock(current->mm);
> +
> +	lock_limit = PFN_DOWN(rlimit(RLIMIT_MEMLOCK));
> +	if (npages + atomic64_read(&current->mm->pinned_vm) > lock_limit)
> +		goto out;
> +
> +	pinned = pin_user_pages(uaddr, npages, FOLL_LONGTERM | FOLL_WRITE,
> +				page_list, NULL);
> +	if (pinned != npages) {
> +		ret = pinned < 0 ? pinned : -ENOMEM;
> +		goto out;
> +	}


This is a popular approach but it's problematic if multiple
devices try to pin the same page.
Can this happen here?

> +
> +	ret = vduse_domain_add_user_bounce_pages(dev->domain,
> +						 page_list, pinned);
> +	if (ret)
> +		goto out;
> +
> +	atomic64_add(npages, &current->mm->pinned_vm);
> +
> +	mem->pages = page_list;
> +	mem->npages = pinned;
> +	mem->iova = iova;
> +	mem->mm = current->mm;
> +	mmgrab(current->mm);
> +
> +	dev->iotlb_mem = mem;
> +out:
> +	if (ret && pinned > 0)
> +		unpin_user_pages(page_list, pinned);
> +
> +	mmap_read_unlock(current->mm);
> +unlock:
> +	if (ret) {
> +		vfree(page_list);
> +		kfree(mem);
> +	}
> +	mutex_unlock(&dev->mem_lock);
> +	return ret;
> +}
> +
>  static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
>  			    unsigned long arg)
>  {
> @@ -943,6 +1047,16 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
>  		if (entry.start > entry.last)
>  			break;
>  
> +		if (domain->bounce_map && dev->iotlb_mem) {
> +			ret = -EEXIST;
> +			if (entry.start >= 0 &&
> +			    entry.last < domain->bounce_size)
> +				break;
> +
> +			if (entry.start < domain->bounce_size)
> +				entry.start = domain->bounce_size;
> +		}
> +
>  		spin_lock(&domain->iotlb_lock);
>  		map = vhost_iotlb_itree_first(domain->iotlb,
>  					      entry.start, entry.last);
> @@ -1102,6 +1216,28 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
>  		ret = 0;
>  		break;
>  	}
> +	case VDUSE_IOTLB_REG_UMEM: {
> +		struct vduse_iotlb_umem umem;
> +
> +		ret = -EFAULT;
> +		if (copy_from_user(&umem, argp, sizeof(umem)))
> +			break;
> +
> +		ret = vduse_dev_reg_iotlb_mem(dev, umem.iova,
> +					      umem.uaddr, umem.size);
> +		break;
> +	}
> +	case VDUSE_IOTLB_DEREG_UMEM: {
> +		struct vduse_iotlb_umem umem;
> +
> +		ret = -EFAULT;
> +		if (copy_from_user(&umem, argp, sizeof(umem)))
> +			break;
> +
> +		ret = vduse_dev_dereg_iotlb_mem(dev, umem.iova,
> +						umem.size);
> +		break;
> +	}
>  	default:
>  		ret = -ENOIOCTLCMD;
>  		break;
> @@ -1114,6 +1250,7 @@ static int vduse_dev_release(struct inode *inode, struct file *file)
>  {
>  	struct vduse_dev *dev = file->private_data;
>  
> +	vduse_dev_dereg_iotlb_mem(dev, 0, dev->domain->bounce_size);
>  	spin_lock(&dev->msg_lock);
>  	/* Make sure the inflight messages can processed after reconncection */
>  	list_splice_init(&dev->recv_list, &dev->send_list);
> @@ -1176,6 +1313,7 @@ static struct vduse_dev *vduse_dev_create(void)
>  		return NULL;
>  
>  	mutex_init(&dev->lock);
> +	mutex_init(&dev->mem_lock);
>  	spin_lock_init(&dev->msg_lock);
>  	INIT_LIST_HEAD(&dev->send_list);
>  	INIT_LIST_HEAD(&dev->recv_list);
> diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
> index c201b7a77c2c..1b17391e228f 100644
> --- a/include/uapi/linux/vduse.h
> +++ b/include/uapi/linux/vduse.h
> @@ -227,6 +227,34 @@ struct vduse_iotlb_info {
>  /* Get IOTLB information, e.g. bounce buffer size */
>  #define VDUSE_IOTLB_GET_INFO    _IOR(VDUSE_BASE, 0x18, struct vduse_iotlb_info)
>  
> +/**
> + * struct vduse_iotlb_umem - userspace memory configuration
> + * @uaddr: start address of userspace memory, it must be aligned to page size
> + * @iova: IOVA of userspace memory, it must be equal to bounce iova returned
> + *        by VDUSE_IOTLB_GET_INFO now
> + * @size: size of userspace memory, it must be equal to bounce size returned
> + *        by VDUSE_IOTLB_GET_INFO now
> + * @reserved: for future use, needs to be initialized to zero

You should check that it's 0 in that case, otherwise userspace
will conveniently forget.

> + *
> + * Structure used by VDUSE_IOTLB_REG_UMEM and VDUSE_IOTLB_DEREG_UMEM
> + * ioctls to register/de-register userspace memory for IOTLB.
> + */
> +struct vduse_iotlb_umem {
> +	__u64 uaddr;
> +	__u64 iova;
> +	__u64 size;
> +	__u64 reserved[3];
> +};
> +
> +/*
> + * Register userspace memory for IOTLB. Now we only support registering
> + * userspace memory as bounce buffer.
> + */
> +#define VDUSE_IOTLB_REG_UMEM	_IOW(VDUSE_BASE, 0x19, struct vduse_iotlb_umem)
> +
> +/* De-register the userspace memory. Caller should set iova and size field. */
> +#define VDUSE_IOTLB_DEREG_UMEM	_IOW(VDUSE_BASE, 0x1a, struct vduse_iotlb_umem)
> +
>  /* The control messages definition for read(2)/write(2) on /dev/vduse/$NAME */
>  
>  /**
> -- 
> 2.20.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/6] vduse: Support registering userspace memory for IOTLB
@ 2022-06-29  8:42     ` Michael S. Tsirkin
  0 siblings, 0 replies; 24+ messages in thread
From: Michael S. Tsirkin @ 2022-06-29  8:42 UTC (permalink / raw)
  To: Xie Yongji
  Cc: xiaodong.liu, linux-kernel, virtualization, maxime.coquelin, stefanha

On Wed, Jun 29, 2022 at 04:25:40PM +0800, Xie Yongji wrote:
> Introduce two ioctls: VDUSE_IOTLB_REG_UMEM and
> VDUSE_IOTLB_DEREG_UMEM to support registering
> and de-registering userspace memory for IOTLB
> in virtio-vdpa case.
> 
> Now it only supports registering userspace memory
> for IOTLB as bounce buffer.
> 
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> ---
>  drivers/vdpa/vdpa_user/vduse_dev.c | 138 +++++++++++++++++++++++++++++
>  include/uapi/linux/vduse.h         |  28 ++++++
>  2 files changed, 166 insertions(+)
> 
> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> index c47a5d9765cf..7b2ea7612da9 100644
> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> @@ -21,6 +21,7 @@
>  #include <linux/uio.h>
>  #include <linux/vdpa.h>
>  #include <linux/nospec.h>
> +#include <linux/sched/mm.h>
>  #include <uapi/linux/vduse.h>
>  #include <uapi/linux/vdpa.h>
>  #include <uapi/linux/virtio_config.h>
> @@ -64,6 +65,13 @@ struct vduse_vdpa {
>  	struct vduse_dev *dev;
>  };
>  
> +struct vduse_iotlb_mem {
> +	unsigned long iova;
> +	unsigned long npages;
> +	struct page **pages;
> +	struct mm_struct *mm;
> +};
> +
>  struct vduse_dev {
>  	struct vduse_vdpa *vdev;
>  	struct device *dev;
> @@ -95,6 +103,8 @@ struct vduse_dev {
>  	u8 status;
>  	u32 vq_num;
>  	u32 vq_align;
> +	struct vduse_iotlb_mem *iotlb_mem;
> +	struct mutex mem_lock;
>  };
>  
>  struct vduse_dev_msg {
> @@ -917,6 +927,100 @@ static int vduse_dev_queue_irq_work(struct vduse_dev *dev,
>  	return ret;
>  }
>  
> +static int vduse_dev_dereg_iotlb_mem(struct vduse_dev *dev,
> +				     u64 iova, u64 size)
> +{
> +	int ret;
> +
> +	mutex_lock(&dev->mem_lock);
> +	ret = -ENOENT;
> +	if (!dev->iotlb_mem)
> +		goto unlock;
> +
> +	ret = -EINVAL;
> +	if (dev->iotlb_mem->iova != iova || size != dev->domain->bounce_size)
> +		goto unlock;
> +
> +	vduse_domain_remove_user_bounce_pages(dev->domain);
> +	unpin_user_pages(dev->iotlb_mem->pages, dev->iotlb_mem->npages);

I notice you don't mark the pages dirty. This is going to be a problem.

> +	atomic64_sub(dev->iotlb_mem->npages, &dev->iotlb_mem->mm->pinned_vm);
> +	mmdrop(dev->iotlb_mem->mm);
> +	vfree(dev->iotlb_mem->pages);
> +	kfree(dev->iotlb_mem);
> +	dev->iotlb_mem = NULL;
> +	ret = 0;
> +unlock:
> +	mutex_unlock(&dev->mem_lock);
> +	return ret;
> +}
> +
> +static int vduse_dev_reg_iotlb_mem(struct vduse_dev *dev,
> +				   u64 iova, u64 uaddr, u64 size)
> +{
> +	struct page **page_list = NULL;
> +	struct vduse_iotlb_mem *mem = NULL;
> +	long pinned = 0;
> +	unsigned long npages, lock_limit;
> +	int ret;
> +
> +	if (size != dev->domain->bounce_size ||
> +	    iova != 0 || uaddr & ~PAGE_MASK)
> +		return -EINVAL;
> +
> +	mutex_lock(&dev->mem_lock);
> +	ret = -EEXIST;
> +	if (dev->iotlb_mem)
> +		goto unlock;
> +
> +	ret = -ENOMEM;
> +	npages = size >> PAGE_SHIFT;
> +	page_list = vmalloc(array_size(npages,
> +			    sizeof(struct page *)));

Is this basically trying to do a vmalloc with userspace-controlled size?
That's an easy DOS vector.

> +	mem = kzalloc(sizeof(*mem), GFP_KERNEL);
> +	if (!page_list || !mem)
> +		goto unlock;
> +
> +	mmap_read_lock(current->mm);
> +
> +	lock_limit = PFN_DOWN(rlimit(RLIMIT_MEMLOCK));
> +	if (npages + atomic64_read(&current->mm->pinned_vm) > lock_limit)
> +		goto out;
> +
> +	pinned = pin_user_pages(uaddr, npages, FOLL_LONGTERM | FOLL_WRITE,
> +				page_list, NULL);
> +	if (pinned != npages) {
> +		ret = pinned < 0 ? pinned : -ENOMEM;
> +		goto out;
> +	}


This is a popular approach but it's problematic if multiple
devices try to pin the same page.
Can this happen here?

> +
> +	ret = vduse_domain_add_user_bounce_pages(dev->domain,
> +						 page_list, pinned);
> +	if (ret)
> +		goto out;
> +
> +	atomic64_add(npages, &current->mm->pinned_vm);
> +
> +	mem->pages = page_list;
> +	mem->npages = pinned;
> +	mem->iova = iova;
> +	mem->mm = current->mm;
> +	mmgrab(current->mm);
> +
> +	dev->iotlb_mem = mem;
> +out:
> +	if (ret && pinned > 0)
> +		unpin_user_pages(page_list, pinned);
> +
> +	mmap_read_unlock(current->mm);
> +unlock:
> +	if (ret) {
> +		vfree(page_list);
> +		kfree(mem);
> +	}
> +	mutex_unlock(&dev->mem_lock);
> +	return ret;
> +}
> +
>  static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
>  			    unsigned long arg)
>  {
> @@ -943,6 +1047,16 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
>  		if (entry.start > entry.last)
>  			break;
>  
> +		if (domain->bounce_map && dev->iotlb_mem) {
> +			ret = -EEXIST;
> +			if (entry.start >= 0 &&
> +			    entry.last < domain->bounce_size)
> +				break;
> +
> +			if (entry.start < domain->bounce_size)
> +				entry.start = domain->bounce_size;
> +		}
> +
>  		spin_lock(&domain->iotlb_lock);
>  		map = vhost_iotlb_itree_first(domain->iotlb,
>  					      entry.start, entry.last);
> @@ -1102,6 +1216,28 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
>  		ret = 0;
>  		break;
>  	}
> +	case VDUSE_IOTLB_REG_UMEM: {
> +		struct vduse_iotlb_umem umem;
> +
> +		ret = -EFAULT;
> +		if (copy_from_user(&umem, argp, sizeof(umem)))
> +			break;
> +
> +		ret = vduse_dev_reg_iotlb_mem(dev, umem.iova,
> +					      umem.uaddr, umem.size);
> +		break;
> +	}
> +	case VDUSE_IOTLB_DEREG_UMEM: {
> +		struct vduse_iotlb_umem umem;
> +
> +		ret = -EFAULT;
> +		if (copy_from_user(&umem, argp, sizeof(umem)))
> +			break;
> +
> +		ret = vduse_dev_dereg_iotlb_mem(dev, umem.iova,
> +						umem.size);
> +		break;
> +	}
>  	default:
>  		ret = -ENOIOCTLCMD;
>  		break;
> @@ -1114,6 +1250,7 @@ static int vduse_dev_release(struct inode *inode, struct file *file)
>  {
>  	struct vduse_dev *dev = file->private_data;
>  
> +	vduse_dev_dereg_iotlb_mem(dev, 0, dev->domain->bounce_size);
>  	spin_lock(&dev->msg_lock);
>  	/* Make sure the inflight messages can processed after reconncection */
>  	list_splice_init(&dev->recv_list, &dev->send_list);
> @@ -1176,6 +1313,7 @@ static struct vduse_dev *vduse_dev_create(void)
>  		return NULL;
>  
>  	mutex_init(&dev->lock);
> +	mutex_init(&dev->mem_lock);
>  	spin_lock_init(&dev->msg_lock);
>  	INIT_LIST_HEAD(&dev->send_list);
>  	INIT_LIST_HEAD(&dev->recv_list);
> diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
> index c201b7a77c2c..1b17391e228f 100644
> --- a/include/uapi/linux/vduse.h
> +++ b/include/uapi/linux/vduse.h
> @@ -227,6 +227,34 @@ struct vduse_iotlb_info {
>  /* Get IOTLB information, e.g. bounce buffer size */
>  #define VDUSE_IOTLB_GET_INFO    _IOR(VDUSE_BASE, 0x18, struct vduse_iotlb_info)
>  
> +/**
> + * struct vduse_iotlb_umem - userspace memory configuration
> + * @uaddr: start address of userspace memory, it must be aligned to page size
> + * @iova: IOVA of userspace memory, it must be equal to bounce iova returned
> + *        by VDUSE_IOTLB_GET_INFO now
> + * @size: size of userspace memory, it must be equal to bounce size returned
> + *        by VDUSE_IOTLB_GET_INFO now
> + * @reserved: for future use, needs to be initialized to zero

You should check that it's 0 in that case, otherwise userspace
will conveniently forget.

> + *
> + * Structure used by VDUSE_IOTLB_REG_UMEM and VDUSE_IOTLB_DEREG_UMEM
> + * ioctls to register/de-register userspace memory for IOTLB.
> + */
> +struct vduse_iotlb_umem {
> +	__u64 uaddr;
> +	__u64 iova;
> +	__u64 size;
> +	__u64 reserved[3];
> +};
> +
> +/*
> + * Register userspace memory for IOTLB. Now we only support registering
> + * userspace memory as bounce buffer.
> + */
> +#define VDUSE_IOTLB_REG_UMEM	_IOW(VDUSE_BASE, 0x19, struct vduse_iotlb_umem)
> +
> +/* De-register the userspace memory. Caller should set iova and size field. */
> +#define VDUSE_IOTLB_DEREG_UMEM	_IOW(VDUSE_BASE, 0x1a, struct vduse_iotlb_umem)
> +
>  /* The control messages definition for read(2)/write(2) on /dev/vduse/$NAME */
>  
>  /**
> -- 
> 2.20.1

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 6/6] vduse: Update api version to 1
  2022-06-29  8:33     ` Michael S. Tsirkin
  (?)
@ 2022-06-29  9:02     ` Yongji Xie
  2022-06-29  9:22         ` Michael S. Tsirkin
  -1 siblings, 1 reply; 24+ messages in thread
From: Yongji Xie @ 2022-06-29  9:02 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, Liu Xiaodong, Maxime Coquelin, Stefan Hajnoczi,
	virtualization, linux-kernel

On Wed, Jun 29, 2022 at 4:33 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Wed, Jun 29, 2022 at 04:25:41PM +0800, Xie Yongji wrote:
> > Let's update api version to 1 since we introduced
> > some new ioctls to support registering userspace
> > memory for IOTLB.
> >
> > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
>
>
> Adding new ioctls does not justify things like this.
>

What I want to do here is make userspace know whether this feature is
supported or not in the kernel. So do you think we need to add
something like CHECK_EXTENSION ioctl here?

> Besides, adding UAPI then changing it is not nice
> since it makes git bisect behave incorrectly.
>
> > ---
> >  drivers/vdpa/vdpa_user/vduse_dev.c | 12 ++++++++++++
> >  include/uapi/linux/vduse.h         |  8 +++++++-
> >  2 files changed, 19 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> > index 7b2ea7612da9..2795785ca6a2 100644
> > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > @@ -1206,6 +1206,10 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> >       case VDUSE_IOTLB_GET_INFO: {
> >               struct vduse_iotlb_info iotlb;
> >
> > +             ret = -EPERM;
>
>
> Almost for sure a wrong error code.
>
> > +             if (dev->api_version < 1)
> > +                     break;
> > +
> >               iotlb.bounce_iova = 0;
> >               iotlb.bounce_size = dev->domain->bounce_size;
> >
>
>
> Wait a second. so you are intentionally breaking any userspace
> that called VDUSE_SET_API_VERSION with version 0?
>
> Please don't.
>

Yes, I'd like to let userspace know we don't support this feature.

Thanks.
Yongji

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 6/6] vduse: Update api version to 1
  2022-06-29  9:02     ` Yongji Xie
@ 2022-06-29  9:22         ` Michael S. Tsirkin
  0 siblings, 0 replies; 24+ messages in thread
From: Michael S. Tsirkin @ 2022-06-29  9:22 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Jason Wang, Liu Xiaodong, Maxime Coquelin, Stefan Hajnoczi,
	virtualization, linux-kernel

On Wed, Jun 29, 2022 at 05:02:40PM +0800, Yongji Xie wrote:
> On Wed, Jun 29, 2022 at 4:33 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Wed, Jun 29, 2022 at 04:25:41PM +0800, Xie Yongji wrote:
> > > Let's update api version to 1 since we introduced
> > > some new ioctls to support registering userspace
> > > memory for IOTLB.
> > >
> > > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> >
> >
> > Adding new ioctls does not justify things like this.
> >
> 
> What I want to do here is make userspace know whether this feature is
> supported or not in the kernel. So do you think we need to add
> something like CHECK_EXTENSION ioctl here?

Why bother? unsupported ioctls just return an error code.

> > Besides, adding UAPI then changing it is not nice
> > since it makes git bisect behave incorrectly.
> >
> > > ---
> > >  drivers/vdpa/vdpa_user/vduse_dev.c | 12 ++++++++++++
> > >  include/uapi/linux/vduse.h         |  8 +++++++-
> > >  2 files changed, 19 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > index 7b2ea7612da9..2795785ca6a2 100644
> > > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > @@ -1206,6 +1206,10 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > >       case VDUSE_IOTLB_GET_INFO: {
> > >               struct vduse_iotlb_info iotlb;
> > >
> > > +             ret = -EPERM;
> >
> >
> > Almost for sure a wrong error code.
> >
> > > +             if (dev->api_version < 1)
> > > +                     break;
> > > +
> > >               iotlb.bounce_iova = 0;
> > >               iotlb.bounce_size = dev->domain->bounce_size;
> > >
> >
> >
> > Wait a second. so you are intentionally breaking any userspace
> > that called VDUSE_SET_API_VERSION with version 0?
> >
> > Please don't.
> >
> 
> Yes, I'd like to let userspace know we don't support this feature.
> 
> Thanks.
> Yongji


-- 
MST


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 6/6] vduse: Update api version to 1
@ 2022-06-29  9:22         ` Michael S. Tsirkin
  0 siblings, 0 replies; 24+ messages in thread
From: Michael S. Tsirkin @ 2022-06-29  9:22 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Liu Xiaodong, linux-kernel, virtualization, Maxime Coquelin,
	Stefan Hajnoczi

On Wed, Jun 29, 2022 at 05:02:40PM +0800, Yongji Xie wrote:
> On Wed, Jun 29, 2022 at 4:33 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Wed, Jun 29, 2022 at 04:25:41PM +0800, Xie Yongji wrote:
> > > Let's update api version to 1 since we introduced
> > > some new ioctls to support registering userspace
> > > memory for IOTLB.
> > >
> > > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> >
> >
> > Adding new ioctls does not justify things like this.
> >
> 
> What I want to do here is make userspace know whether this feature is
> supported or not in the kernel. So do you think we need to add
> something like CHECK_EXTENSION ioctl here?

Why bother? unsupported ioctls just return an error code.

> > Besides, adding UAPI then changing it is not nice
> > since it makes git bisect behave incorrectly.
> >
> > > ---
> > >  drivers/vdpa/vdpa_user/vduse_dev.c | 12 ++++++++++++
> > >  include/uapi/linux/vduse.h         |  8 +++++++-
> > >  2 files changed, 19 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > index 7b2ea7612da9..2795785ca6a2 100644
> > > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > @@ -1206,6 +1206,10 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > >       case VDUSE_IOTLB_GET_INFO: {
> > >               struct vduse_iotlb_info iotlb;
> > >
> > > +             ret = -EPERM;
> >
> >
> > Almost for sure a wrong error code.
> >
> > > +             if (dev->api_version < 1)
> > > +                     break;
> > > +
> > >               iotlb.bounce_iova = 0;
> > >               iotlb.bounce_size = dev->domain->bounce_size;
> > >
> >
> >
> > Wait a second. so you are intentionally breaking any userspace
> > that called VDUSE_SET_API_VERSION with version 0?
> >
> > Please don't.
> >
> 
> Yes, I'd like to let userspace know we don't support this feature.
> 
> Thanks.
> Yongji


-- 
MST

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/6] vduse: Support registering userspace memory for IOTLB
  2022-06-29  8:42     ` Michael S. Tsirkin
  (?)
@ 2022-06-29  9:26     ` Yongji Xie
  2022-06-29  9:54         ` Michael S. Tsirkin
  -1 siblings, 1 reply; 24+ messages in thread
From: Yongji Xie @ 2022-06-29  9:26 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, Liu Xiaodong, Maxime Coquelin, Stefan Hajnoczi,
	virtualization, linux-kernel

On Wed, Jun 29, 2022 at 4:43 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Wed, Jun 29, 2022 at 04:25:40PM +0800, Xie Yongji wrote:
> > Introduce two ioctls: VDUSE_IOTLB_REG_UMEM and
> > VDUSE_IOTLB_DEREG_UMEM to support registering
> > and de-registering userspace memory for IOTLB
> > in virtio-vdpa case.
> >
> > Now it only supports registering userspace memory
> > for IOTLB as bounce buffer.
> >
> > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> > ---
> >  drivers/vdpa/vdpa_user/vduse_dev.c | 138 +++++++++++++++++++++++++++++
> >  include/uapi/linux/vduse.h         |  28 ++++++
> >  2 files changed, 166 insertions(+)
> >
> > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> > index c47a5d9765cf..7b2ea7612da9 100644
> > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > @@ -21,6 +21,7 @@
> >  #include <linux/uio.h>
> >  #include <linux/vdpa.h>
> >  #include <linux/nospec.h>
> > +#include <linux/sched/mm.h>
> >  #include <uapi/linux/vduse.h>
> >  #include <uapi/linux/vdpa.h>
> >  #include <uapi/linux/virtio_config.h>
> > @@ -64,6 +65,13 @@ struct vduse_vdpa {
> >       struct vduse_dev *dev;
> >  };
> >
> > +struct vduse_iotlb_mem {
> > +     unsigned long iova;
> > +     unsigned long npages;
> > +     struct page **pages;
> > +     struct mm_struct *mm;
> > +};
> > +
> >  struct vduse_dev {
> >       struct vduse_vdpa *vdev;
> >       struct device *dev;
> > @@ -95,6 +103,8 @@ struct vduse_dev {
> >       u8 status;
> >       u32 vq_num;
> >       u32 vq_align;
> > +     struct vduse_iotlb_mem *iotlb_mem;
> > +     struct mutex mem_lock;
> >  };
> >
> >  struct vduse_dev_msg {
> > @@ -917,6 +927,100 @@ static int vduse_dev_queue_irq_work(struct vduse_dev *dev,
> >       return ret;
> >  }
> >
> > +static int vduse_dev_dereg_iotlb_mem(struct vduse_dev *dev,
> > +                                  u64 iova, u64 size)
> > +{
> > +     int ret;
> > +
> > +     mutex_lock(&dev->mem_lock);
> > +     ret = -ENOENT;
> > +     if (!dev->iotlb_mem)
> > +             goto unlock;
> > +
> > +     ret = -EINVAL;
> > +     if (dev->iotlb_mem->iova != iova || size != dev->domain->bounce_size)
> > +             goto unlock;
> > +
> > +     vduse_domain_remove_user_bounce_pages(dev->domain);
> > +     unpin_user_pages(dev->iotlb_mem->pages, dev->iotlb_mem->npages);
>
> I notice you don't mark the pages dirty. This is going to be a problem.
>

Thanks for pointing out this, I will use unpin_user_pages_dirty_lock() instead.

> > +     atomic64_sub(dev->iotlb_mem->npages, &dev->iotlb_mem->mm->pinned_vm);
> > +     mmdrop(dev->iotlb_mem->mm);
> > +     vfree(dev->iotlb_mem->pages);
> > +     kfree(dev->iotlb_mem);
> > +     dev->iotlb_mem = NULL;
> > +     ret = 0;
> > +unlock:
> > +     mutex_unlock(&dev->mem_lock);
> > +     return ret;
> > +}
> > +
> > +static int vduse_dev_reg_iotlb_mem(struct vduse_dev *dev,
> > +                                u64 iova, u64 uaddr, u64 size)
> > +{
> > +     struct page **page_list = NULL;
> > +     struct vduse_iotlb_mem *mem = NULL;
> > +     long pinned = 0;
> > +     unsigned long npages, lock_limit;
> > +     int ret;
> > +
> > +     if (size != dev->domain->bounce_size ||
> > +         iova != 0 || uaddr & ~PAGE_MASK)
> > +             return -EINVAL;
> > +
> > +     mutex_lock(&dev->mem_lock);
> > +     ret = -EEXIST;
> > +     if (dev->iotlb_mem)
> > +             goto unlock;
> > +
> > +     ret = -ENOMEM;
> > +     npages = size >> PAGE_SHIFT;
> > +     page_list = vmalloc(array_size(npages,
> > +                         sizeof(struct page *)));
>
> Is this basically trying to do a vmalloc with userspace-controlled size?
> That's an easy DOS vector.
>

We already checked the size before. The size must equal to (64MB >>
PAGE_SHIFT) now.

> > +     mem = kzalloc(sizeof(*mem), GFP_KERNEL);
> > +     if (!page_list || !mem)
> > +             goto unlock;
> > +
> > +     mmap_read_lock(current->mm);
> > +
> > +     lock_limit = PFN_DOWN(rlimit(RLIMIT_MEMLOCK));
> > +     if (npages + atomic64_read(&current->mm->pinned_vm) > lock_limit)
> > +             goto out;
> > +
> > +     pinned = pin_user_pages(uaddr, npages, FOLL_LONGTERM | FOLL_WRITE,
> > +                             page_list, NULL);
> > +     if (pinned != npages) {
> > +             ret = pinned < 0 ? pinned : -ENOMEM;
> > +             goto out;
> > +     }
>
>
> This is a popular approach but it's problematic if multiple
> devices try to pin the same page.

Do you mean the data would be corrupted if multiple devices use the
same page as bounce buffer? This is indeed a problem.

> Can this happen here?
>

I think we can't prevent this case now. I will do it in v2.

> > +
> > +     ret = vduse_domain_add_user_bounce_pages(dev->domain,
> > +                                              page_list, pinned);
> > +     if (ret)
> > +             goto out;
> > +
> > +     atomic64_add(npages, &current->mm->pinned_vm);
> > +
> > +     mem->pages = page_list;
> > +     mem->npages = pinned;
> > +     mem->iova = iova;
> > +     mem->mm = current->mm;
> > +     mmgrab(current->mm);
> > +
> > +     dev->iotlb_mem = mem;
> > +out:
> > +     if (ret && pinned > 0)
> > +             unpin_user_pages(page_list, pinned);
> > +
> > +     mmap_read_unlock(current->mm);
> > +unlock:
> > +     if (ret) {
> > +             vfree(page_list);
> > +             kfree(mem);
> > +     }
> > +     mutex_unlock(&dev->mem_lock);
> > +     return ret;
> > +}
> > +
> >  static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> >                           unsigned long arg)
> >  {
> > @@ -943,6 +1047,16 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> >               if (entry.start > entry.last)
> >                       break;
> >
> > +             if (domain->bounce_map && dev->iotlb_mem) {
> > +                     ret = -EEXIST;
> > +                     if (entry.start >= 0 &&
> > +                         entry.last < domain->bounce_size)
> > +                             break;
> > +
> > +                     if (entry.start < domain->bounce_size)
> > +                             entry.start = domain->bounce_size;
> > +             }
> > +
> >               spin_lock(&domain->iotlb_lock);
> >               map = vhost_iotlb_itree_first(domain->iotlb,
> >                                             entry.start, entry.last);
> > @@ -1102,6 +1216,28 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> >               ret = 0;
> >               break;
> >       }
> > +     case VDUSE_IOTLB_REG_UMEM: {
> > +             struct vduse_iotlb_umem umem;
> > +
> > +             ret = -EFAULT;
> > +             if (copy_from_user(&umem, argp, sizeof(umem)))
> > +                     break;
> > +
> > +             ret = vduse_dev_reg_iotlb_mem(dev, umem.iova,
> > +                                           umem.uaddr, umem.size);
> > +             break;
> > +     }
> > +     case VDUSE_IOTLB_DEREG_UMEM: {
> > +             struct vduse_iotlb_umem umem;
> > +
> > +             ret = -EFAULT;
> > +             if (copy_from_user(&umem, argp, sizeof(umem)))
> > +                     break;
> > +
> > +             ret = vduse_dev_dereg_iotlb_mem(dev, umem.iova,
> > +                                             umem.size);
> > +             break;
> > +     }
> >       default:
> >               ret = -ENOIOCTLCMD;
> >               break;
> > @@ -1114,6 +1250,7 @@ static int vduse_dev_release(struct inode *inode, struct file *file)
> >  {
> >       struct vduse_dev *dev = file->private_data;
> >
> > +     vduse_dev_dereg_iotlb_mem(dev, 0, dev->domain->bounce_size);
> >       spin_lock(&dev->msg_lock);
> >       /* Make sure the inflight messages can processed after reconncection */
> >       list_splice_init(&dev->recv_list, &dev->send_list);
> > @@ -1176,6 +1313,7 @@ static struct vduse_dev *vduse_dev_create(void)
> >               return NULL;
> >
> >       mutex_init(&dev->lock);
> > +     mutex_init(&dev->mem_lock);
> >       spin_lock_init(&dev->msg_lock);
> >       INIT_LIST_HEAD(&dev->send_list);
> >       INIT_LIST_HEAD(&dev->recv_list);
> > diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
> > index c201b7a77c2c..1b17391e228f 100644
> > --- a/include/uapi/linux/vduse.h
> > +++ b/include/uapi/linux/vduse.h
> > @@ -227,6 +227,34 @@ struct vduse_iotlb_info {
> >  /* Get IOTLB information, e.g. bounce buffer size */
> >  #define VDUSE_IOTLB_GET_INFO    _IOR(VDUSE_BASE, 0x18, struct vduse_iotlb_info)
> >
> > +/**
> > + * struct vduse_iotlb_umem - userspace memory configuration
> > + * @uaddr: start address of userspace memory, it must be aligned to page size
> > + * @iova: IOVA of userspace memory, it must be equal to bounce iova returned
> > + *        by VDUSE_IOTLB_GET_INFO now
> > + * @size: size of userspace memory, it must be equal to bounce size returned
> > + *        by VDUSE_IOTLB_GET_INFO now
> > + * @reserved: for future use, needs to be initialized to zero
>
> You should check that it's 0 in that case, otherwise userspace
> will conveniently forget.
>

OK, I will fix it.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 6/6] vduse: Update api version to 1
  2022-06-29  9:22         ` Michael S. Tsirkin
  (?)
@ 2022-06-29  9:28         ` Yongji Xie
  -1 siblings, 0 replies; 24+ messages in thread
From: Yongji Xie @ 2022-06-29  9:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, Liu Xiaodong, Maxime Coquelin, Stefan Hajnoczi,
	virtualization, linux-kernel

On Wed, Jun 29, 2022 at 5:22 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Wed, Jun 29, 2022 at 05:02:40PM +0800, Yongji Xie wrote:
> > On Wed, Jun 29, 2022 at 4:33 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > >
> > > On Wed, Jun 29, 2022 at 04:25:41PM +0800, Xie Yongji wrote:
> > > > Let's update api version to 1 since we introduced
> > > > some new ioctls to support registering userspace
> > > > memory for IOTLB.
> > > >
> > > > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> > >
> > >
> > > Adding new ioctls does not justify things like this.
> > >
> >
> > What I want to do here is make userspace know whether this feature is
> > supported or not in the kernel. So do you think we need to add
> > something like CHECK_EXTENSION ioctl here?
>
> Why bother? unsupported ioctls just return an error code.
>

Oh, yes. I forgot the error code of the ioctls.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/6] vduse: Support registering userspace memory for IOTLB
  2022-06-29  9:26     ` Yongji Xie
@ 2022-06-29  9:54         ` Michael S. Tsirkin
  0 siblings, 0 replies; 24+ messages in thread
From: Michael S. Tsirkin @ 2022-06-29  9:54 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Jason Wang, Liu Xiaodong, Maxime Coquelin, Stefan Hajnoczi,
	virtualization, linux-kernel

On Wed, Jun 29, 2022 at 05:26:04PM +0800, Yongji Xie wrote:
> On Wed, Jun 29, 2022 at 4:43 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Wed, Jun 29, 2022 at 04:25:40PM +0800, Xie Yongji wrote:
> > > Introduce two ioctls: VDUSE_IOTLB_REG_UMEM and
> > > VDUSE_IOTLB_DEREG_UMEM to support registering
> > > and de-registering userspace memory for IOTLB
> > > in virtio-vdpa case.
> > >
> > > Now it only supports registering userspace memory
> > > for IOTLB as bounce buffer.
> > >
> > > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> > > ---
> > >  drivers/vdpa/vdpa_user/vduse_dev.c | 138 +++++++++++++++++++++++++++++
> > >  include/uapi/linux/vduse.h         |  28 ++++++
> > >  2 files changed, 166 insertions(+)
> > >
> > > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > index c47a5d9765cf..7b2ea7612da9 100644
> > > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > @@ -21,6 +21,7 @@
> > >  #include <linux/uio.h>
> > >  #include <linux/vdpa.h>
> > >  #include <linux/nospec.h>
> > > +#include <linux/sched/mm.h>
> > >  #include <uapi/linux/vduse.h>
> > >  #include <uapi/linux/vdpa.h>
> > >  #include <uapi/linux/virtio_config.h>
> > > @@ -64,6 +65,13 @@ struct vduse_vdpa {
> > >       struct vduse_dev *dev;
> > >  };
> > >
> > > +struct vduse_iotlb_mem {
> > > +     unsigned long iova;
> > > +     unsigned long npages;
> > > +     struct page **pages;
> > > +     struct mm_struct *mm;
> > > +};
> > > +
> > >  struct vduse_dev {
> > >       struct vduse_vdpa *vdev;
> > >       struct device *dev;
> > > @@ -95,6 +103,8 @@ struct vduse_dev {
> > >       u8 status;
> > >       u32 vq_num;
> > >       u32 vq_align;
> > > +     struct vduse_iotlb_mem *iotlb_mem;
> > > +     struct mutex mem_lock;
> > >  };
> > >
> > >  struct vduse_dev_msg {
> > > @@ -917,6 +927,100 @@ static int vduse_dev_queue_irq_work(struct vduse_dev *dev,
> > >       return ret;
> > >  }
> > >
> > > +static int vduse_dev_dereg_iotlb_mem(struct vduse_dev *dev,
> > > +                                  u64 iova, u64 size)
> > > +{
> > > +     int ret;
> > > +
> > > +     mutex_lock(&dev->mem_lock);
> > > +     ret = -ENOENT;
> > > +     if (!dev->iotlb_mem)
> > > +             goto unlock;
> > > +
> > > +     ret = -EINVAL;
> > > +     if (dev->iotlb_mem->iova != iova || size != dev->domain->bounce_size)
> > > +             goto unlock;
> > > +
> > > +     vduse_domain_remove_user_bounce_pages(dev->domain);
> > > +     unpin_user_pages(dev->iotlb_mem->pages, dev->iotlb_mem->npages);
> >
> > I notice you don't mark the pages dirty. This is going to be a problem.
> >
> 
> Thanks for pointing out this, I will use unpin_user_pages_dirty_lock() instead.
> 
> > > +     atomic64_sub(dev->iotlb_mem->npages, &dev->iotlb_mem->mm->pinned_vm);
> > > +     mmdrop(dev->iotlb_mem->mm);
> > > +     vfree(dev->iotlb_mem->pages);
> > > +     kfree(dev->iotlb_mem);
> > > +     dev->iotlb_mem = NULL;
> > > +     ret = 0;
> > > +unlock:
> > > +     mutex_unlock(&dev->mem_lock);
> > > +     return ret;
> > > +}
> > > +
> > > +static int vduse_dev_reg_iotlb_mem(struct vduse_dev *dev,
> > > +                                u64 iova, u64 uaddr, u64 size)
> > > +{
> > > +     struct page **page_list = NULL;
> > > +     struct vduse_iotlb_mem *mem = NULL;
> > > +     long pinned = 0;
> > > +     unsigned long npages, lock_limit;
> > > +     int ret;
> > > +
> > > +     if (size != dev->domain->bounce_size ||
> > > +         iova != 0 || uaddr & ~PAGE_MASK)
> > > +             return -EINVAL;
> > > +
> > > +     mutex_lock(&dev->mem_lock);
> > > +     ret = -EEXIST;
> > > +     if (dev->iotlb_mem)
> > > +             goto unlock;
> > > +
> > > +     ret = -ENOMEM;
> > > +     npages = size >> PAGE_SHIFT;
> > > +     page_list = vmalloc(array_size(npages,
> > > +                         sizeof(struct page *)));
> >
> > Is this basically trying to do a vmalloc with userspace-controlled size?
> > That's an easy DOS vector.
> >
> 
> We already checked the size before. The size must equal to (64MB >>
> PAGE_SHIFT) now.

That's not a small amount. Can this be accounted e.g. through cgroups at least?

> > > +     mem = kzalloc(sizeof(*mem), GFP_KERNEL);
> > > +     if (!page_list || !mem)
> > > +             goto unlock;
> > > +
> > > +     mmap_read_lock(current->mm);
> > > +
> > > +     lock_limit = PFN_DOWN(rlimit(RLIMIT_MEMLOCK));
> > > +     if (npages + atomic64_read(&current->mm->pinned_vm) > lock_limit)
> > > +             goto out;
> > > +
> > > +     pinned = pin_user_pages(uaddr, npages, FOLL_LONGTERM | FOLL_WRITE,
> > > +                             page_list, NULL);
> > > +     if (pinned != npages) {
> > > +             ret = pinned < 0 ? pinned : -ENOMEM;
> > > +             goto out;
> > > +     }
> >
> >
> > This is a popular approach but it's problematic if multiple
> > devices try to pin the same page.
> 
> Do you mean the data would be corrupted if multiple devices use the
> same page as bounce buffer? This is indeed a problem.

No i mean you decrement the lock twice. Question is can two bounce
buffers share a page?

> > Can this happen here?
> >
> 
> I think we can't prevent this case now. I will do it in v2.
> 
> > > +
> > > +     ret = vduse_domain_add_user_bounce_pages(dev->domain,
> > > +                                              page_list, pinned);
> > > +     if (ret)
> > > +             goto out;
> > > +
> > > +     atomic64_add(npages, &current->mm->pinned_vm);
> > > +
> > > +     mem->pages = page_list;
> > > +     mem->npages = pinned;
> > > +     mem->iova = iova;
> > > +     mem->mm = current->mm;
> > > +     mmgrab(current->mm);
> > > +
> > > +     dev->iotlb_mem = mem;
> > > +out:
> > > +     if (ret && pinned > 0)
> > > +             unpin_user_pages(page_list, pinned);
> > > +
> > > +     mmap_read_unlock(current->mm);
> > > +unlock:
> > > +     if (ret) {
> > > +             vfree(page_list);
> > > +             kfree(mem);
> > > +     }
> > > +     mutex_unlock(&dev->mem_lock);
> > > +     return ret;
> > > +}
> > > +
> > >  static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > >                           unsigned long arg)
> > >  {
> > > @@ -943,6 +1047,16 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > >               if (entry.start > entry.last)
> > >                       break;
> > >
> > > +             if (domain->bounce_map && dev->iotlb_mem) {
> > > +                     ret = -EEXIST;
> > > +                     if (entry.start >= 0 &&
> > > +                         entry.last < domain->bounce_size)
> > > +                             break;
> > > +
> > > +                     if (entry.start < domain->bounce_size)
> > > +                             entry.start = domain->bounce_size;
> > > +             }
> > > +
> > >               spin_lock(&domain->iotlb_lock);
> > >               map = vhost_iotlb_itree_first(domain->iotlb,
> > >                                             entry.start, entry.last);
> > > @@ -1102,6 +1216,28 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > >               ret = 0;
> > >               break;
> > >       }
> > > +     case VDUSE_IOTLB_REG_UMEM: {
> > > +             struct vduse_iotlb_umem umem;
> > > +
> > > +             ret = -EFAULT;
> > > +             if (copy_from_user(&umem, argp, sizeof(umem)))
> > > +                     break;
> > > +
> > > +             ret = vduse_dev_reg_iotlb_mem(dev, umem.iova,
> > > +                                           umem.uaddr, umem.size);
> > > +             break;
> > > +     }
> > > +     case VDUSE_IOTLB_DEREG_UMEM: {
> > > +             struct vduse_iotlb_umem umem;
> > > +
> > > +             ret = -EFAULT;
> > > +             if (copy_from_user(&umem, argp, sizeof(umem)))
> > > +                     break;
> > > +
> > > +             ret = vduse_dev_dereg_iotlb_mem(dev, umem.iova,
> > > +                                             umem.size);
> > > +             break;
> > > +     }
> > >       default:
> > >               ret = -ENOIOCTLCMD;
> > >               break;
> > > @@ -1114,6 +1250,7 @@ static int vduse_dev_release(struct inode *inode, struct file *file)
> > >  {
> > >       struct vduse_dev *dev = file->private_data;
> > >
> > > +     vduse_dev_dereg_iotlb_mem(dev, 0, dev->domain->bounce_size);
> > >       spin_lock(&dev->msg_lock);
> > >       /* Make sure the inflight messages can processed after reconncection */
> > >       list_splice_init(&dev->recv_list, &dev->send_list);
> > > @@ -1176,6 +1313,7 @@ static struct vduse_dev *vduse_dev_create(void)
> > >               return NULL;
> > >
> > >       mutex_init(&dev->lock);
> > > +     mutex_init(&dev->mem_lock);
> > >       spin_lock_init(&dev->msg_lock);
> > >       INIT_LIST_HEAD(&dev->send_list);
> > >       INIT_LIST_HEAD(&dev->recv_list);
> > > diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
> > > index c201b7a77c2c..1b17391e228f 100644
> > > --- a/include/uapi/linux/vduse.h
> > > +++ b/include/uapi/linux/vduse.h
> > > @@ -227,6 +227,34 @@ struct vduse_iotlb_info {
> > >  /* Get IOTLB information, e.g. bounce buffer size */
> > >  #define VDUSE_IOTLB_GET_INFO    _IOR(VDUSE_BASE, 0x18, struct vduse_iotlb_info)
> > >
> > > +/**
> > > + * struct vduse_iotlb_umem - userspace memory configuration
> > > + * @uaddr: start address of userspace memory, it must be aligned to page size
> > > + * @iova: IOVA of userspace memory, it must be equal to bounce iova returned
> > > + *        by VDUSE_IOTLB_GET_INFO now
> > > + * @size: size of userspace memory, it must be equal to bounce size returned
> > > + *        by VDUSE_IOTLB_GET_INFO now
> > > + * @reserved: for future use, needs to be initialized to zero
> >
> > You should check that it's 0 in that case, otherwise userspace
> > will conveniently forget.
> >
> 
> OK, I will fix it.
> 
> Thanks,
> Yongji


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/6] vduse: Support registering userspace memory for IOTLB
@ 2022-06-29  9:54         ` Michael S. Tsirkin
  0 siblings, 0 replies; 24+ messages in thread
From: Michael S. Tsirkin @ 2022-06-29  9:54 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Liu Xiaodong, linux-kernel, virtualization, Maxime Coquelin,
	Stefan Hajnoczi

On Wed, Jun 29, 2022 at 05:26:04PM +0800, Yongji Xie wrote:
> On Wed, Jun 29, 2022 at 4:43 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Wed, Jun 29, 2022 at 04:25:40PM +0800, Xie Yongji wrote:
> > > Introduce two ioctls: VDUSE_IOTLB_REG_UMEM and
> > > VDUSE_IOTLB_DEREG_UMEM to support registering
> > > and de-registering userspace memory for IOTLB
> > > in virtio-vdpa case.
> > >
> > > Now it only supports registering userspace memory
> > > for IOTLB as bounce buffer.
> > >
> > > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> > > ---
> > >  drivers/vdpa/vdpa_user/vduse_dev.c | 138 +++++++++++++++++++++++++++++
> > >  include/uapi/linux/vduse.h         |  28 ++++++
> > >  2 files changed, 166 insertions(+)
> > >
> > > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > index c47a5d9765cf..7b2ea7612da9 100644
> > > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > @@ -21,6 +21,7 @@
> > >  #include <linux/uio.h>
> > >  #include <linux/vdpa.h>
> > >  #include <linux/nospec.h>
> > > +#include <linux/sched/mm.h>
> > >  #include <uapi/linux/vduse.h>
> > >  #include <uapi/linux/vdpa.h>
> > >  #include <uapi/linux/virtio_config.h>
> > > @@ -64,6 +65,13 @@ struct vduse_vdpa {
> > >       struct vduse_dev *dev;
> > >  };
> > >
> > > +struct vduse_iotlb_mem {
> > > +     unsigned long iova;
> > > +     unsigned long npages;
> > > +     struct page **pages;
> > > +     struct mm_struct *mm;
> > > +};
> > > +
> > >  struct vduse_dev {
> > >       struct vduse_vdpa *vdev;
> > >       struct device *dev;
> > > @@ -95,6 +103,8 @@ struct vduse_dev {
> > >       u8 status;
> > >       u32 vq_num;
> > >       u32 vq_align;
> > > +     struct vduse_iotlb_mem *iotlb_mem;
> > > +     struct mutex mem_lock;
> > >  };
> > >
> > >  struct vduse_dev_msg {
> > > @@ -917,6 +927,100 @@ static int vduse_dev_queue_irq_work(struct vduse_dev *dev,
> > >       return ret;
> > >  }
> > >
> > > +static int vduse_dev_dereg_iotlb_mem(struct vduse_dev *dev,
> > > +                                  u64 iova, u64 size)
> > > +{
> > > +     int ret;
> > > +
> > > +     mutex_lock(&dev->mem_lock);
> > > +     ret = -ENOENT;
> > > +     if (!dev->iotlb_mem)
> > > +             goto unlock;
> > > +
> > > +     ret = -EINVAL;
> > > +     if (dev->iotlb_mem->iova != iova || size != dev->domain->bounce_size)
> > > +             goto unlock;
> > > +
> > > +     vduse_domain_remove_user_bounce_pages(dev->domain);
> > > +     unpin_user_pages(dev->iotlb_mem->pages, dev->iotlb_mem->npages);
> >
> > I notice you don't mark the pages dirty. This is going to be a problem.
> >
> 
> Thanks for pointing out this, I will use unpin_user_pages_dirty_lock() instead.
> 
> > > +     atomic64_sub(dev->iotlb_mem->npages, &dev->iotlb_mem->mm->pinned_vm);
> > > +     mmdrop(dev->iotlb_mem->mm);
> > > +     vfree(dev->iotlb_mem->pages);
> > > +     kfree(dev->iotlb_mem);
> > > +     dev->iotlb_mem = NULL;
> > > +     ret = 0;
> > > +unlock:
> > > +     mutex_unlock(&dev->mem_lock);
> > > +     return ret;
> > > +}
> > > +
> > > +static int vduse_dev_reg_iotlb_mem(struct vduse_dev *dev,
> > > +                                u64 iova, u64 uaddr, u64 size)
> > > +{
> > > +     struct page **page_list = NULL;
> > > +     struct vduse_iotlb_mem *mem = NULL;
> > > +     long pinned = 0;
> > > +     unsigned long npages, lock_limit;
> > > +     int ret;
> > > +
> > > +     if (size != dev->domain->bounce_size ||
> > > +         iova != 0 || uaddr & ~PAGE_MASK)
> > > +             return -EINVAL;
> > > +
> > > +     mutex_lock(&dev->mem_lock);
> > > +     ret = -EEXIST;
> > > +     if (dev->iotlb_mem)
> > > +             goto unlock;
> > > +
> > > +     ret = -ENOMEM;
> > > +     npages = size >> PAGE_SHIFT;
> > > +     page_list = vmalloc(array_size(npages,
> > > +                         sizeof(struct page *)));
> >
> > Is this basically trying to do a vmalloc with userspace-controlled size?
> > That's an easy DOS vector.
> >
> 
> We already checked the size before. The size must equal to (64MB >>
> PAGE_SHIFT) now.

That's not a small amount. Can this be accounted e.g. through cgroups at least?

> > > +     mem = kzalloc(sizeof(*mem), GFP_KERNEL);
> > > +     if (!page_list || !mem)
> > > +             goto unlock;
> > > +
> > > +     mmap_read_lock(current->mm);
> > > +
> > > +     lock_limit = PFN_DOWN(rlimit(RLIMIT_MEMLOCK));
> > > +     if (npages + atomic64_read(&current->mm->pinned_vm) > lock_limit)
> > > +             goto out;
> > > +
> > > +     pinned = pin_user_pages(uaddr, npages, FOLL_LONGTERM | FOLL_WRITE,
> > > +                             page_list, NULL);
> > > +     if (pinned != npages) {
> > > +             ret = pinned < 0 ? pinned : -ENOMEM;
> > > +             goto out;
> > > +     }
> >
> >
> > This is a popular approach but it's problematic if multiple
> > devices try to pin the same page.
> 
> Do you mean the data would be corrupted if multiple devices use the
> same page as bounce buffer? This is indeed a problem.

No i mean you decrement the lock twice. Question is can two bounce
buffers share a page?

> > Can this happen here?
> >
> 
> I think we can't prevent this case now. I will do it in v2.
> 
> > > +
> > > +     ret = vduse_domain_add_user_bounce_pages(dev->domain,
> > > +                                              page_list, pinned);
> > > +     if (ret)
> > > +             goto out;
> > > +
> > > +     atomic64_add(npages, &current->mm->pinned_vm);
> > > +
> > > +     mem->pages = page_list;
> > > +     mem->npages = pinned;
> > > +     mem->iova = iova;
> > > +     mem->mm = current->mm;
> > > +     mmgrab(current->mm);
> > > +
> > > +     dev->iotlb_mem = mem;
> > > +out:
> > > +     if (ret && pinned > 0)
> > > +             unpin_user_pages(page_list, pinned);
> > > +
> > > +     mmap_read_unlock(current->mm);
> > > +unlock:
> > > +     if (ret) {
> > > +             vfree(page_list);
> > > +             kfree(mem);
> > > +     }
> > > +     mutex_unlock(&dev->mem_lock);
> > > +     return ret;
> > > +}
> > > +
> > >  static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > >                           unsigned long arg)
> > >  {
> > > @@ -943,6 +1047,16 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > >               if (entry.start > entry.last)
> > >                       break;
> > >
> > > +             if (domain->bounce_map && dev->iotlb_mem) {
> > > +                     ret = -EEXIST;
> > > +                     if (entry.start >= 0 &&
> > > +                         entry.last < domain->bounce_size)
> > > +                             break;
> > > +
> > > +                     if (entry.start < domain->bounce_size)
> > > +                             entry.start = domain->bounce_size;
> > > +             }
> > > +
> > >               spin_lock(&domain->iotlb_lock);
> > >               map = vhost_iotlb_itree_first(domain->iotlb,
> > >                                             entry.start, entry.last);
> > > @@ -1102,6 +1216,28 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > >               ret = 0;
> > >               break;
> > >       }
> > > +     case VDUSE_IOTLB_REG_UMEM: {
> > > +             struct vduse_iotlb_umem umem;
> > > +
> > > +             ret = -EFAULT;
> > > +             if (copy_from_user(&umem, argp, sizeof(umem)))
> > > +                     break;
> > > +
> > > +             ret = vduse_dev_reg_iotlb_mem(dev, umem.iova,
> > > +                                           umem.uaddr, umem.size);
> > > +             break;
> > > +     }
> > > +     case VDUSE_IOTLB_DEREG_UMEM: {
> > > +             struct vduse_iotlb_umem umem;
> > > +
> > > +             ret = -EFAULT;
> > > +             if (copy_from_user(&umem, argp, sizeof(umem)))
> > > +                     break;
> > > +
> > > +             ret = vduse_dev_dereg_iotlb_mem(dev, umem.iova,
> > > +                                             umem.size);
> > > +             break;
> > > +     }
> > >       default:
> > >               ret = -ENOIOCTLCMD;
> > >               break;
> > > @@ -1114,6 +1250,7 @@ static int vduse_dev_release(struct inode *inode, struct file *file)
> > >  {
> > >       struct vduse_dev *dev = file->private_data;
> > >
> > > +     vduse_dev_dereg_iotlb_mem(dev, 0, dev->domain->bounce_size);
> > >       spin_lock(&dev->msg_lock);
> > >       /* Make sure the inflight messages can processed after reconncection */
> > >       list_splice_init(&dev->recv_list, &dev->send_list);
> > > @@ -1176,6 +1313,7 @@ static struct vduse_dev *vduse_dev_create(void)
> > >               return NULL;
> > >
> > >       mutex_init(&dev->lock);
> > > +     mutex_init(&dev->mem_lock);
> > >       spin_lock_init(&dev->msg_lock);
> > >       INIT_LIST_HEAD(&dev->send_list);
> > >       INIT_LIST_HEAD(&dev->recv_list);
> > > diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
> > > index c201b7a77c2c..1b17391e228f 100644
> > > --- a/include/uapi/linux/vduse.h
> > > +++ b/include/uapi/linux/vduse.h
> > > @@ -227,6 +227,34 @@ struct vduse_iotlb_info {
> > >  /* Get IOTLB information, e.g. bounce buffer size */
> > >  #define VDUSE_IOTLB_GET_INFO    _IOR(VDUSE_BASE, 0x18, struct vduse_iotlb_info)
> > >
> > > +/**
> > > + * struct vduse_iotlb_umem - userspace memory configuration
> > > + * @uaddr: start address of userspace memory, it must be aligned to page size
> > > + * @iova: IOVA of userspace memory, it must be equal to bounce iova returned
> > > + *        by VDUSE_IOTLB_GET_INFO now
> > > + * @size: size of userspace memory, it must be equal to bounce size returned
> > > + *        by VDUSE_IOTLB_GET_INFO now
> > > + * @reserved: for future use, needs to be initialized to zero
> >
> > You should check that it's 0 in that case, otherwise userspace
> > will conveniently forget.
> >
> 
> OK, I will fix it.
> 
> Thanks,
> Yongji

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/6] vduse: Support registering userspace memory for IOTLB
  2022-06-29  9:54         ` Michael S. Tsirkin
  (?)
@ 2022-06-29 10:19         ` Yongji Xie
  2022-06-29 11:28             ` Michael S. Tsirkin
  -1 siblings, 1 reply; 24+ messages in thread
From: Yongji Xie @ 2022-06-29 10:19 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, Liu Xiaodong, Maxime Coquelin, Stefan Hajnoczi,
	virtualization, linux-kernel, songmuchun

On Wed, Jun 29, 2022 at 5:54 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Wed, Jun 29, 2022 at 05:26:04PM +0800, Yongji Xie wrote:
> > On Wed, Jun 29, 2022 at 4:43 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > >
> > > On Wed, Jun 29, 2022 at 04:25:40PM +0800, Xie Yongji wrote:
> > > > Introduce two ioctls: VDUSE_IOTLB_REG_UMEM and
> > > > VDUSE_IOTLB_DEREG_UMEM to support registering
> > > > and de-registering userspace memory for IOTLB
> > > > in virtio-vdpa case.
> > > >
> > > > Now it only supports registering userspace memory
> > > > for IOTLB as bounce buffer.
> > > >
> > > > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> > > > ---
> > > >  drivers/vdpa/vdpa_user/vduse_dev.c | 138 +++++++++++++++++++++++++++++
> > > >  include/uapi/linux/vduse.h         |  28 ++++++
> > > >  2 files changed, 166 insertions(+)
> > > >
> > > > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > index c47a5d9765cf..7b2ea7612da9 100644
> > > > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > @@ -21,6 +21,7 @@
> > > >  #include <linux/uio.h>
> > > >  #include <linux/vdpa.h>
> > > >  #include <linux/nospec.h>
> > > > +#include <linux/sched/mm.h>
> > > >  #include <uapi/linux/vduse.h>
> > > >  #include <uapi/linux/vdpa.h>
> > > >  #include <uapi/linux/virtio_config.h>
> > > > @@ -64,6 +65,13 @@ struct vduse_vdpa {
> > > >       struct vduse_dev *dev;
> > > >  };
> > > >
> > > > +struct vduse_iotlb_mem {
> > > > +     unsigned long iova;
> > > > +     unsigned long npages;
> > > > +     struct page **pages;
> > > > +     struct mm_struct *mm;
> > > > +};
> > > > +
> > > >  struct vduse_dev {
> > > >       struct vduse_vdpa *vdev;
> > > >       struct device *dev;
> > > > @@ -95,6 +103,8 @@ struct vduse_dev {
> > > >       u8 status;
> > > >       u32 vq_num;
> > > >       u32 vq_align;
> > > > +     struct vduse_iotlb_mem *iotlb_mem;
> > > > +     struct mutex mem_lock;
> > > >  };
> > > >
> > > >  struct vduse_dev_msg {
> > > > @@ -917,6 +927,100 @@ static int vduse_dev_queue_irq_work(struct vduse_dev *dev,
> > > >       return ret;
> > > >  }
> > > >
> > > > +static int vduse_dev_dereg_iotlb_mem(struct vduse_dev *dev,
> > > > +                                  u64 iova, u64 size)
> > > > +{
> > > > +     int ret;
> > > > +
> > > > +     mutex_lock(&dev->mem_lock);
> > > > +     ret = -ENOENT;
> > > > +     if (!dev->iotlb_mem)
> > > > +             goto unlock;
> > > > +
> > > > +     ret = -EINVAL;
> > > > +     if (dev->iotlb_mem->iova != iova || size != dev->domain->bounce_size)
> > > > +             goto unlock;
> > > > +
> > > > +     vduse_domain_remove_user_bounce_pages(dev->domain);
> > > > +     unpin_user_pages(dev->iotlb_mem->pages, dev->iotlb_mem->npages);
> > >
> > > I notice you don't mark the pages dirty. This is going to be a problem.
> > >
> >
> > Thanks for pointing out this, I will use unpin_user_pages_dirty_lock() instead.
> >
> > > > +     atomic64_sub(dev->iotlb_mem->npages, &dev->iotlb_mem->mm->pinned_vm);
> > > > +     mmdrop(dev->iotlb_mem->mm);
> > > > +     vfree(dev->iotlb_mem->pages);
> > > > +     kfree(dev->iotlb_mem);
> > > > +     dev->iotlb_mem = NULL;
> > > > +     ret = 0;
> > > > +unlock:
> > > > +     mutex_unlock(&dev->mem_lock);
> > > > +     return ret;
> > > > +}
> > > > +
> > > > +static int vduse_dev_reg_iotlb_mem(struct vduse_dev *dev,
> > > > +                                u64 iova, u64 uaddr, u64 size)
> > > > +{
> > > > +     struct page **page_list = NULL;
> > > > +     struct vduse_iotlb_mem *mem = NULL;
> > > > +     long pinned = 0;
> > > > +     unsigned long npages, lock_limit;
> > > > +     int ret;
> > > > +
> > > > +     if (size != dev->domain->bounce_size ||
> > > > +         iova != 0 || uaddr & ~PAGE_MASK)
> > > > +             return -EINVAL;
> > > > +
> > > > +     mutex_lock(&dev->mem_lock);
> > > > +     ret = -EEXIST;
> > > > +     if (dev->iotlb_mem)
> > > > +             goto unlock;
> > > > +
> > > > +     ret = -ENOMEM;
> > > > +     npages = size >> PAGE_SHIFT;
> > > > +     page_list = vmalloc(array_size(npages,
> > > > +                         sizeof(struct page *)));
> > >
> > > Is this basically trying to do a vmalloc with userspace-controlled size?
> > > That's an easy DOS vector.
> > >
> >
> > We already checked the size before. The size must equal to (64MB >>
> > PAGE_SHIFT) now.
>
> That's not a small amount. Can this be accounted e.g. through cgroups at least?
>

Make sense, will use __vmalloc(__GFP_ACCOUNT) instead.

> > > > +     mem = kzalloc(sizeof(*mem), GFP_KERNEL);
> > > > +     if (!page_list || !mem)
> > > > +             goto unlock;
> > > > +
> > > > +     mmap_read_lock(current->mm);
> > > > +
> > > > +     lock_limit = PFN_DOWN(rlimit(RLIMIT_MEMLOCK));
> > > > +     if (npages + atomic64_read(&current->mm->pinned_vm) > lock_limit)
> > > > +             goto out;
> > > > +
> > > > +     pinned = pin_user_pages(uaddr, npages, FOLL_LONGTERM | FOLL_WRITE,
> > > > +                             page_list, NULL);
> > > > +     if (pinned != npages) {
> > > > +             ret = pinned < 0 ? pinned : -ENOMEM;
> > > > +             goto out;
> > > > +     }
> > >
> > >
> > > This is a popular approach but it's problematic if multiple
> > > devices try to pin the same page.
> >
> > Do you mean the data would be corrupted if multiple devices use the
> > same page as bounce buffer? This is indeed a problem.
>
> No i mean you decrement the lock twice. Question is can two bounce
> buffers share a page?
>

I think we can't. I will find a way to prevent it.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/6] vduse: Support registering userspace memory for IOTLB
  2022-06-29 10:19         ` Yongji Xie
@ 2022-06-29 11:28             ` Michael S. Tsirkin
  0 siblings, 0 replies; 24+ messages in thread
From: Michael S. Tsirkin @ 2022-06-29 11:28 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Jason Wang, Liu Xiaodong, Maxime Coquelin, Stefan Hajnoczi,
	virtualization, linux-kernel, songmuchun

On Wed, Jun 29, 2022 at 06:19:31PM +0800, Yongji Xie wrote:
> > No i mean you decrement the lock twice. Question is can two bounce
> > buffers share a page?
> >
> 
> I think we can't. I will find a way to prevent it.
> 
> Thanks,
> Yongji

I guess it doesn't matter much then.

-- 
MST


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/6] vduse: Support registering userspace memory for IOTLB
@ 2022-06-29 11:28             ` Michael S. Tsirkin
  0 siblings, 0 replies; 24+ messages in thread
From: Michael S. Tsirkin @ 2022-06-29 11:28 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Liu Xiaodong, linux-kernel, virtualization, Maxime Coquelin,
	Stefan Hajnoczi, songmuchun

On Wed, Jun 29, 2022 at 06:19:31PM +0800, Yongji Xie wrote:
> > No i mean you decrement the lock twice. Question is can two bounce
> > buffers share a page?
> >
> 
> I think we can't. I will find a way to prevent it.
> 
> Thanks,
> Yongji

I guess it doesn't matter much then.

-- 
MST

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/6] VDUSE: Support registering userspace memory as bounce buffer
  2022-06-29  8:25 [PATCH 0/6] VDUSE: Support registering userspace memory as bounce buffer Xie Yongji
@ 2022-07-04  9:26   ` Liu Xiaodong
  2022-06-29  8:25 ` [PATCH 2/6] vduse: Use memcpy_{to,from}_page() in do_bounce() Xie Yongji
                     ` (5 subsequent siblings)
  6 siblings, 0 replies; 24+ messages in thread
From: Liu Xiaodong @ 2022-07-04  9:26 UTC (permalink / raw)
  To: Xie Yongji
  Cc: mst, jasowang, xiaodong.liu, maxime.coquelin, stefanha,
	virtualization, linux-kernel

On Wed, Jun 29, 2022 at 04:25:35PM +0800, Xie Yongji wrote:
> Hi all,
> 
> This series introduces some new ioctls: VDUSE_IOTLB_GET_INFO,
> VDUSE_IOTLB_REG_UMEM and VDUSE_IOTLB_DEREG_UMEM to support
> registering and de-registering userspace memory for IOTLB
> as bounce buffer in virtio-vdpa case.
> 
> The VDUSE_IOTLB_GET_INFO ioctl can help user to query IOLTB
> information such as bounce buffer size. Then user can use
> those information on VDUSE_IOTLB_REG_UMEM and
> VDUSE_IOTLB_DEREG_UMEM ioctls to register and de-register
> userspace memory for IOTLB.
> 
> During registering and de-registering, the DMA data in use
> would be copied from kernel bounce pages to userspace bounce
> pages and back.
> 
> With this feature, some existing application such as SPDK
> and DPDK can leverage the datapath of VDUSE directly and
> efficiently as discussed before [1]. They can register some
> preallocated hugepages to VDUSE to avoid an extra memcpy
> from bounce-buffer to hugepages.

Hi, Yongji

Very glad to see this enhancement in VDUSE. Thank you.
It is really helpful and essential to SPDK.
With this new feature, we can get VDUSE transferred data
accessed directly by userspace physical backends, like RDMA
and PCIe devices.

In SPDK roadmap, it's one important work to export block
services to local host, especially for container scenario.
This patch could help SPDK do that with its userspace
backend stacks while keeping high efficiency and performance.
So the whole SPDK ecosystem can get benefited.

Based on this enhancement, as discussed, I drafted a VDUSE
prototype module in SPDK for initial evaluation:
[TEST]vduse: prototype for initial draft
https://review.spdk.io/gerrit/c/spdk/spdk/+/13534

Running SPDK on single CPU core, configured with 2 P3700 NVMe,
and exported block devices to local host kernel via different
protocols. The randwrite IOPS through each protocol are:
NBD 		  121K
NVMf-tcp loopback 274K
VDUSE 		  463K

SPDK with RDMA backends should have a similar ratio.
VDUSE has a great performance advantage for SPDK.
We have kept investigating on this usage for years.
Originally, some SPDK users used NBD. Then NVMf-tcp loopback
is SPDK community accommended way. In future, VDUSE could be
the preferred way.

> The kernel and userspace codes could be found in github:
> 
> https://github.com/bytedance/linux/tree/vduse-umem
> https://github.com/bytedance/qemu/tree/vduse-umem
> 
> To test it with qemu-storage-daemon:
> 
> $ qemu-storage-daemon \
>     --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server=on,wait=off \
>     --monitor chardev=charmonitor \
>     --blockdev driver=host_device,cache.direct=on,aio=native,filename=/dev/nullb0,node-name=disk0
> \
>     --export type=vduse-blk,id=vduse-test,name=vduse-test,node-name=disk0,writable=on
> 
> [1] https://lkml.org/lkml/2021/6/27/318
> 
> Please review, thanks!

Waiting for its review process.

Thanks
Xiaodong

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/6] VDUSE: Support registering userspace memory as bounce buffer
@ 2022-07-04  9:26   ` Liu Xiaodong
  0 siblings, 0 replies; 24+ messages in thread
From: Liu Xiaodong @ 2022-07-04  9:26 UTC (permalink / raw)
  To: Xie Yongji
  Cc: mst, xiaodong.liu, linux-kernel, virtualization, maxime.coquelin,
	stefanha

On Wed, Jun 29, 2022 at 04:25:35PM +0800, Xie Yongji wrote:
> Hi all,
> 
> This series introduces some new ioctls: VDUSE_IOTLB_GET_INFO,
> VDUSE_IOTLB_REG_UMEM and VDUSE_IOTLB_DEREG_UMEM to support
> registering and de-registering userspace memory for IOTLB
> as bounce buffer in virtio-vdpa case.
> 
> The VDUSE_IOTLB_GET_INFO ioctl can help user to query IOLTB
> information such as bounce buffer size. Then user can use
> those information on VDUSE_IOTLB_REG_UMEM and
> VDUSE_IOTLB_DEREG_UMEM ioctls to register and de-register
> userspace memory for IOTLB.
> 
> During registering and de-registering, the DMA data in use
> would be copied from kernel bounce pages to userspace bounce
> pages and back.
> 
> With this feature, some existing application such as SPDK
> and DPDK can leverage the datapath of VDUSE directly and
> efficiently as discussed before [1]. They can register some
> preallocated hugepages to VDUSE to avoid an extra memcpy
> from bounce-buffer to hugepages.

Hi, Yongji

Very glad to see this enhancement in VDUSE. Thank you.
It is really helpful and essential to SPDK.
With this new feature, we can get VDUSE transferred data
accessed directly by userspace physical backends, like RDMA
and PCIe devices.

In SPDK roadmap, it's one important work to export block
services to local host, especially for container scenario.
This patch could help SPDK do that with its userspace
backend stacks while keeping high efficiency and performance.
So the whole SPDK ecosystem can get benefited.

Based on this enhancement, as discussed, I drafted a VDUSE
prototype module in SPDK for initial evaluation:
[TEST]vduse: prototype for initial draft
https://review.spdk.io/gerrit/c/spdk/spdk/+/13534

Running SPDK on single CPU core, configured with 2 P3700 NVMe,
and exported block devices to local host kernel via different
protocols. The randwrite IOPS through each protocol are:
NBD 		  121K
NVMf-tcp loopback 274K
VDUSE 		  463K

SPDK with RDMA backends should have a similar ratio.
VDUSE has a great performance advantage for SPDK.
We have kept investigating on this usage for years.
Originally, some SPDK users used NBD. Then NVMf-tcp loopback
is SPDK community accommended way. In future, VDUSE could be
the preferred way.

> The kernel and userspace codes could be found in github:
> 
> https://github.com/bytedance/linux/tree/vduse-umem
> https://github.com/bytedance/qemu/tree/vduse-umem
> 
> To test it with qemu-storage-daemon:
> 
> $ qemu-storage-daemon \
>     --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server=on,wait=off \
>     --monitor chardev=charmonitor \
>     --blockdev driver=host_device,cache.direct=on,aio=native,filename=/dev/nullb0,node-name=disk0
> \
>     --export type=vduse-blk,id=vduse-test,name=vduse-test,node-name=disk0,writable=on
> 
> [1] https://lkml.org/lkml/2021/6/27/318
> 
> Please review, thanks!

Waiting for its review process.

Thanks
Xiaodong
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/6] VDUSE: Support registering userspace memory as bounce buffer
  2022-07-04  9:26   ` Liu Xiaodong
  (?)
@ 2022-07-04 10:02   ` Yongji Xie
  -1 siblings, 0 replies; 24+ messages in thread
From: Yongji Xie @ 2022-07-04 10:02 UTC (permalink / raw)
  To: Liu Xiaodong
  Cc: Michael S. Tsirkin, Jason Wang, Maxime Coquelin, Stefan Hajnoczi,
	virtualization, linux-kernel

Hi Xiaodong,

On Mon, Jul 4, 2022 at 5:27 PM Liu Xiaodong <xiaodong.liu@intel.com> wrote:
>
> On Wed, Jun 29, 2022 at 04:25:35PM +0800, Xie Yongji wrote:
> > Hi all,
> >
> > This series introduces some new ioctls: VDUSE_IOTLB_GET_INFO,
> > VDUSE_IOTLB_REG_UMEM and VDUSE_IOTLB_DEREG_UMEM to support
> > registering and de-registering userspace memory for IOTLB
> > as bounce buffer in virtio-vdpa case.
> >
> > The VDUSE_IOTLB_GET_INFO ioctl can help user to query IOLTB
> > information such as bounce buffer size. Then user can use
> > those information on VDUSE_IOTLB_REG_UMEM and
> > VDUSE_IOTLB_DEREG_UMEM ioctls to register and de-register
> > userspace memory for IOTLB.
> >
> > During registering and de-registering, the DMA data in use
> > would be copied from kernel bounce pages to userspace bounce
> > pages and back.
> >
> > With this feature, some existing application such as SPDK
> > and DPDK can leverage the datapath of VDUSE directly and
> > efficiently as discussed before [1]. They can register some
> > preallocated hugepages to VDUSE to avoid an extra memcpy
> > from bounce-buffer to hugepages.
>
> Hi, Yongji
>
> Very glad to see this enhancement in VDUSE. Thank you.
> It is really helpful and essential to SPDK.
> With this new feature, we can get VDUSE transferred data
> accessed directly by userspace physical backends, like RDMA
> and PCIe devices.
>
> In SPDK roadmap, it's one important work to export block
> services to local host, especially for container scenario.
> This patch could help SPDK do that with its userspace
> backend stacks while keeping high efficiency and performance.
> So the whole SPDK ecosystem can get benefited.
>
> Based on this enhancement, as discussed, I drafted a VDUSE
> prototype module in SPDK for initial evaluation:
> [TEST]vduse: prototype for initial draft
> https://review.spdk.io/gerrit/c/spdk/spdk/+/13534
>

Thanks for this nice work!

> Running SPDK on single CPU core, configured with 2 P3700 NVMe,
> and exported block devices to local host kernel via different
> protocols. The randwrite IOPS through each protocol are:
> NBD               121K
> NVMf-tcp loopback 274K
> VDUSE             463K
>
> SPDK with RDMA backends should have a similar ratio.
> VDUSE has a great performance advantage for SPDK.
> We have kept investigating on this usage for years.
> Originally, some SPDK users used NBD. Then NVMf-tcp loopback
> is SPDK community accommended way. In future, VDUSE could be
> the preferred way.
>

Glad to see SPDK can benefit from this feature. I will continue to
improve this feature to make it available ASAP.

Thanks,
Yongji

> > The kernel and userspace codes could be found in github:
> >
> > https://github.com/bytedance/linux/tree/vduse-umem
> > https://github.com/bytedance/qemu/tree/vduse-umem
> >
> > To test it with qemu-storage-daemon:
> >
> > $ qemu-storage-daemon \
> >     --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server=on,wait=off \
> >     --monitor chardev=charmonitor \
> >     --blockdev driver=host_device,cache.direct=on,aio=native,filename=/dev/nullb0,node-name=disk0
> > \
> >     --export type=vduse-blk,id=vduse-test,name=vduse-test,node-name=disk0,writable=on
> >
> > [1] https://lkml.org/lkml/2021/6/27/318
> >
> > Please review, thanks!
>
> Waiting for its review process.
>
> Thanks
> Xiaodong

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2022-07-04 10:03 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-29  8:25 [PATCH 0/6] VDUSE: Support registering userspace memory as bounce buffer Xie Yongji
2022-06-29  8:25 ` [PATCH 1/6] vduse: Remove unnecessary spin lock protection Xie Yongji
2022-06-29  8:25 ` [PATCH 2/6] vduse: Use memcpy_{to,from}_page() in do_bounce() Xie Yongji
2022-06-29  8:25 ` [PATCH 3/6] vduse: Support using userspace pages as bounce buffer Xie Yongji
2022-06-29  8:25 ` [PATCH 4/6] vduse: Support querying IOLTB information Xie Yongji
2022-06-29  8:25 ` [PATCH 5/6] vduse: Support registering userspace memory for IOTLB Xie Yongji
2022-06-29  8:42   ` Michael S. Tsirkin
2022-06-29  8:42     ` Michael S. Tsirkin
2022-06-29  9:26     ` Yongji Xie
2022-06-29  9:54       ` Michael S. Tsirkin
2022-06-29  9:54         ` Michael S. Tsirkin
2022-06-29 10:19         ` Yongji Xie
2022-06-29 11:28           ` Michael S. Tsirkin
2022-06-29 11:28             ` Michael S. Tsirkin
2022-06-29  8:25 ` [PATCH 6/6] vduse: Update api version to 1 Xie Yongji
2022-06-29  8:33   ` Michael S. Tsirkin
2022-06-29  8:33     ` Michael S. Tsirkin
2022-06-29  9:02     ` Yongji Xie
2022-06-29  9:22       ` Michael S. Tsirkin
2022-06-29  9:22         ` Michael S. Tsirkin
2022-06-29  9:28         ` Yongji Xie
2022-07-04  9:26 ` [PATCH 0/6] VDUSE: Support registering userspace memory as bounce buffer Liu Xiaodong
2022-07-04  9:26   ` Liu Xiaodong
2022-07-04 10:02   ` Yongji Xie

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.