linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/4] Introduce VDUSE - vDPA Device in Userspace
@ 2020-10-19 14:56 Xie Yongji
  2020-10-19 14:56 ` [RFC 1/4] mm: export zap_page_range() for driver use Xie Yongji
                   ` (5 more replies)
  0 siblings, 6 replies; 28+ messages in thread
From: Xie Yongji @ 2020-10-19 14:56 UTC (permalink / raw)
  To: mst, jasowang, akpm; +Cc: linux-mm, virtualization

This series introduces a framework, which can be used to implement
vDPA Devices in a userspace program. To implement it, the work
consist of two parts: control path emulating and data path offloading.

In the control path, the VDUSE driver will make use of message
mechnism to forward the actions (get/set features, get/st status,
get/set config space and set virtqueue states) from virtio-vdpa
driver to userspace. Userspace can use read()/write() to
receive/reply to those control messages.

In the data path, the VDUSE driver implements a MMU-based
on-chip IOMMU driver which supports both direct mapping and
indirect mapping with bounce buffer. Then userspace can access
those iova space via mmap(). Besides, eventfd mechnism is used to
trigger interrupts and forward virtqueue kicks.

The details and our user case is shown below:

------------------------     -----------------------------------------------------------
|                  APP |     |                          QEMU                           |
|       ---------      |     | --------------------    -------------------+<-->+------ |
|       |dev/vdx|      |     | | device emulation |    | virtio dataplane |    | BDS | |
------------+-----------     -----------+-----------------------+-----------------+-----
            |                           |                       |                 |
            |                           | emulating             | offloading      |
------------+---------------------------+-----------------------+-----------------+------
|    | block device |           |  vduse driver |        |  vdpa device |    | TCP/IP | |
|    -------+--------           --------+--------        +------+-------     -----+---- |
|           |                           |                |      |                 |     |
|           |                           |                |      |                 |     |
| ----------+----------       ----------+-----------     |      |                 |     |
| | virtio-blk driver |       | virtio-vdpa driver |     |      |                 |     |
| ----------+----------       ----------+-----------     |      |                 |     |
|           |                           |                |      |                 |     |
|           |                           ------------------      |                 |     |
|           -----------------------------------------------------              ---+---  |
------------------------------------------------------------------------------ | NIC |---
                                                                               ---+---
                                                                                  |
                                                                         ---------+---------
                                                                         | Remote Storages |
                                                                         -------------------

We make use of it to implement a block device connecting to
our distributed storage, which can be used in containers and
bare metal. Compared with qemu-nbd solution, this solution has
higher performance, and we can have an unified technology stack
in VM and containers for remote storages.

To test it with a host disk (e.g. /dev/sdx):

  $ qemu-storage-daemon \
      --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server,nowait \
      --monitor chardev=charmonitor \
      --blockdev driver=host_device,cache.direct=on,aio=native,filename=/dev/sdx,node-name=disk0 \
      --export vduse-blk,id=test,node-name=disk0,writable=on,vduse-id=1,num-queues=16,queue-size=128

The qemu-storage-daemon can be found at https://github.com/bytedance/qemu/tree/vduse

Future work:
  - Improve performance (e.g. zero copy implementation in datapath)
  - Config interrupt support
  - Userspace library (find a way to reuse device emulation code in qemu/rust-vmm)

Xie Yongji (4):
  mm: export zap_page_range() for driver use
  vduse: Introduce VDUSE - vDPA Device in Userspace
  vduse: grab the module's references until there is no vduse device
  vduse: Add memory shrinker to reclaim bounce pages

 drivers/vdpa/Kconfig                 |    8 +
 drivers/vdpa/Makefile                |    1 +
 drivers/vdpa/vdpa_user/Makefile      |    5 +
 drivers/vdpa/vdpa_user/eventfd.c     |  221 ++++++
 drivers/vdpa/vdpa_user/eventfd.h     |   48 ++
 drivers/vdpa/vdpa_user/iova_domain.c |  488 ++++++++++++
 drivers/vdpa/vdpa_user/iova_domain.h |  104 +++
 drivers/vdpa/vdpa_user/vduse.h       |   66 ++
 drivers/vdpa/vdpa_user/vduse_dev.c   | 1081 ++++++++++++++++++++++++++
 include/uapi/linux/vduse.h           |   85 ++
 mm/memory.c                          |    1 +
 11 files changed, 2108 insertions(+)
 create mode 100644 drivers/vdpa/vdpa_user/Makefile
 create mode 100644 drivers/vdpa/vdpa_user/eventfd.c
 create mode 100644 drivers/vdpa/vdpa_user/eventfd.h
 create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c
 create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h
 create mode 100644 drivers/vdpa/vdpa_user/vduse.h
 create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
 create mode 100644 include/uapi/linux/vduse.h

-- 
2.25.1



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC 1/4] mm: export zap_page_range() for driver use
  2020-10-19 14:56 [RFC 0/4] Introduce VDUSE - vDPA Device in Userspace Xie Yongji
@ 2020-10-19 14:56 ` Xie Yongji
  2020-10-19 15:14   ` Matthew Wilcox
  2020-10-19 14:56 ` [RFC 2/4] vduse: Introduce VDUSE - vDPA Device in Userspace Xie Yongji
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 28+ messages in thread
From: Xie Yongji @ 2020-10-19 14:56 UTC (permalink / raw)
  To: mst, jasowang, akpm; +Cc: linux-mm, virtualization

Export zap_page_range() for use in VDUSE.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 mm/memory.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/memory.c b/mm/memory.c
index c48f8df6e502..5008dbbeb72f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1531,6 +1531,7 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
 	mmu_notifier_invalidate_range_end(&range);
 	tlb_finish_mmu(&tlb, start, range.end);
 }
+EXPORT_SYMBOL(zap_page_range);
 
 /**
  * zap_page_range_single - remove user pages in a given range
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC 2/4] vduse: Introduce VDUSE - vDPA Device in Userspace
  2020-10-19 14:56 [RFC 0/4] Introduce VDUSE - vDPA Device in Userspace Xie Yongji
  2020-10-19 14:56 ` [RFC 1/4] mm: export zap_page_range() for driver use Xie Yongji
@ 2020-10-19 14:56 ` Xie Yongji
  2020-10-19 15:08   ` Michael S. Tsirkin
  2020-10-19 14:56 ` [RFC 3/4] vduse: grab the module's references until there is no vduse device Xie Yongji
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 28+ messages in thread
From: Xie Yongji @ 2020-10-19 14:56 UTC (permalink / raw)
  To: mst, jasowang, akpm; +Cc: linux-mm, virtualization

This VDUSE driver enables implementing vDPA devices in userspace.
Both control path and data path of vDPA devices will be able to
be handled in userspace.

In the control path, the VDUSE driver will make use of message
mechnism to forward the actions (get/set features, get/st status,
get/set config space and set virtqueue states) from virtio-vdpa
driver to userspace. Userspace can use read()/write() to receive/reply
to those control messages.

In the data path, the VDUSE driver implements a MMU-based
on-chip IOMMU driver which supports both direct mapping and
indirect mapping with bounce buffer. Userspace can access those
iova space via mmap(). Besides, eventfd mechnism is used to
trigger interrupts and forward virtqueue kicks.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 drivers/vdpa/Kconfig                 |    8 +
 drivers/vdpa/Makefile                |    1 +
 drivers/vdpa/vdpa_user/Makefile      |    5 +
 drivers/vdpa/vdpa_user/eventfd.c     |  221 ++++++
 drivers/vdpa/vdpa_user/eventfd.h     |   48 ++
 drivers/vdpa/vdpa_user/iova_domain.c |  413 +++++++++++
 drivers/vdpa/vdpa_user/iova_domain.h |   94 +++
 drivers/vdpa/vdpa_user/vduse.h       |   66 ++
 drivers/vdpa/vdpa_user/vduse_dev.c   | 1028 ++++++++++++++++++++++++++
 include/uapi/linux/vduse.h           |   85 +++
 10 files changed, 1969 insertions(+)
 create mode 100644 drivers/vdpa/vdpa_user/Makefile
 create mode 100644 drivers/vdpa/vdpa_user/eventfd.c
 create mode 100644 drivers/vdpa/vdpa_user/eventfd.h
 create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c
 create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h
 create mode 100644 drivers/vdpa/vdpa_user/vduse.h
 create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
 create mode 100644 include/uapi/linux/vduse.h

diff --git a/drivers/vdpa/Kconfig b/drivers/vdpa/Kconfig
index d7d32b656102..c1e51a434809 100644
--- a/drivers/vdpa/Kconfig
+++ b/drivers/vdpa/Kconfig
@@ -19,6 +19,14 @@ config VDPA_SIM
 	  to RX. This device is used for testing, prototyping and
 	  development of vDPA.
 
+config VDPA_USER
+	tristate "VDUSE (vDPA Device in Userspace) support"
+	depends on EVENTFD && MMU && HAS_DMA
+	default n
+	help
+	  With VDUSE it is possible to emulate a vDPA Device
+	  in a userspace program.
+
 config IFCVF
 	tristate "Intel IFC VF vDPA driver"
 	depends on PCI_MSI
diff --git a/drivers/vdpa/Makefile b/drivers/vdpa/Makefile
index d160e9b63a66..66e97778ad03 100644
--- a/drivers/vdpa/Makefile
+++ b/drivers/vdpa/Makefile
@@ -1,5 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0
 obj-$(CONFIG_VDPA) += vdpa.o
 obj-$(CONFIG_VDPA_SIM) += vdpa_sim/
+obj-$(CONFIG_VDPA_USER) += vdpa_user/
 obj-$(CONFIG_IFCVF)    += ifcvf/
 obj-$(CONFIG_MLX5_VDPA) += mlx5/
diff --git a/drivers/vdpa/vdpa_user/Makefile b/drivers/vdpa/vdpa_user/Makefile
new file mode 100644
index 000000000000..b7645e36992b
--- /dev/null
+++ b/drivers/vdpa/vdpa_user/Makefile
@@ -0,0 +1,5 @@
+# SPDX-License-Identifier: GPL-2.0
+
+vduse-y := vduse_dev.o iova_domain.o eventfd.o
+
+obj-$(CONFIG_VDPA_USER) += vduse.o
diff --git a/drivers/vdpa/vdpa_user/eventfd.c b/drivers/vdpa/vdpa_user/eventfd.c
new file mode 100644
index 000000000000..dbffddb08908
--- /dev/null
+++ b/drivers/vdpa/vdpa_user/eventfd.c
@@ -0,0 +1,221 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Eventfd support for VDUSE
+ *
+ * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
+ *
+ * Author: Xie Yongji <xieyongji@bytedance.com>
+ *
+ */
+
+#include <linux/eventfd.h>
+#include <linux/poll.h>
+#include <linux/wait.h>
+#include <linux/slab.h>
+#include <linux/file.h>
+#include <uapi/linux/vduse.h>
+
+#include "eventfd.h"
+
+static struct workqueue_struct *vduse_irqfd_cleanup_wq;
+
+static void vduse_virqfd_shutdown(struct work_struct *work)
+{
+	u64 cnt;
+	struct vduse_virqfd *virqfd = container_of(work,
+					struct vduse_virqfd, shutdown);
+
+	eventfd_ctx_remove_wait_queue(virqfd->ctx, &virqfd->wait, &cnt);
+	flush_work(&virqfd->inject);
+	eventfd_ctx_put(virqfd->ctx);
+	kfree(virqfd);
+}
+
+static void vduse_virqfd_inject(struct work_struct *work)
+{
+	struct vduse_virqfd *virqfd = container_of(work,
+					struct vduse_virqfd, inject);
+	struct vduse_virtqueue *vq = virqfd->vq;
+
+	spin_lock_irq(&vq->irq_lock);
+	if (vq->ready && vq->cb)
+		vq->cb(vq->private);
+	spin_unlock_irq(&vq->irq_lock);
+}
+
+static void virqfd_deactivate(struct vduse_virqfd *virqfd)
+{
+	queue_work(vduse_irqfd_cleanup_wq, &virqfd->shutdown);
+}
+
+static int vduse_virqfd_wakeup(wait_queue_entry_t *wait, unsigned int mode,
+				int sync, void *key)
+{
+	struct vduse_virqfd *virqfd = container_of(wait, struct vduse_virqfd, wait);
+	struct vduse_virtqueue *vq = virqfd->vq;
+
+	__poll_t flags = key_to_poll(key);
+
+	if (flags & EPOLLIN)
+		schedule_work(&virqfd->inject);
+
+	if (flags & EPOLLHUP) {
+		spin_lock(&vq->irq_lock);
+		if (vq->virqfd == virqfd) {
+			vq->virqfd = NULL;
+			virqfd_deactivate(virqfd);
+		}
+		spin_unlock(&vq->irq_lock);
+	}
+
+	return 0;
+}
+
+static void vduse_virqfd_ptable_queue_proc(struct file *file,
+			wait_queue_head_t *wqh, poll_table *pt)
+{
+	struct vduse_virqfd *virqfd = container_of(pt, struct vduse_virqfd, pt);
+
+	add_wait_queue(wqh, &virqfd->wait);
+}
+
+int vduse_virqfd_setup(struct vduse_dev *dev,
+			struct vduse_vq_eventfd *eventfd)
+{
+	struct vduse_virqfd *virqfd;
+	struct fd irqfd;
+	struct eventfd_ctx *ctx;
+	struct vduse_virtqueue *vq;
+	__poll_t events;
+	int ret;
+
+	if (eventfd->index >= dev->vq_num)
+		return -EINVAL;
+
+	vq = &dev->vqs[eventfd->index];
+	virqfd = kzalloc(sizeof(*virqfd), GFP_KERNEL);
+	if (!virqfd)
+		return -ENOMEM;
+
+	INIT_WORK(&virqfd->shutdown, vduse_virqfd_shutdown);
+	INIT_WORK(&virqfd->inject, vduse_virqfd_inject);
+
+	ret = -EBADF;
+	irqfd = fdget(eventfd->fd);
+	if (!irqfd.file)
+		goto err_fd;
+
+	ctx = eventfd_ctx_fileget(irqfd.file);
+	if (IS_ERR(ctx)) {
+		ret = PTR_ERR(ctx);
+		goto err_ctx;
+	}
+
+	virqfd->vq = vq;
+	virqfd->ctx = ctx;
+	spin_lock(&vq->irq_lock);
+	if (vq->virqfd)
+		virqfd_deactivate(virqfd);
+	vq->virqfd = virqfd;
+	spin_unlock(&vq->irq_lock);
+
+	init_waitqueue_func_entry(&virqfd->wait, vduse_virqfd_wakeup);
+	init_poll_funcptr(&virqfd->pt, vduse_virqfd_ptable_queue_proc);
+
+	events = vfs_poll(irqfd.file, &virqfd->pt);
+
+	/*
+	 * Check if there was an event already pending on the eventfd
+	 * before we registered and trigger it as if we didn't miss it.
+	 */
+	if (events & EPOLLIN)
+		schedule_work(&virqfd->inject);
+
+	fdput(irqfd);
+
+	return 0;
+err_ctx:
+	fdput(irqfd);
+err_fd:
+	kfree(virqfd);
+	return ret;
+}
+
+void vduse_virqfd_release(struct vduse_dev *dev)
+{
+	int i;
+
+	for (i = 0; i < dev->vq_num; i++) {
+		struct vduse_virtqueue *vq = &dev->vqs[i];
+
+		spin_lock(&vq->irq_lock);
+		if (vq->virqfd) {
+			virqfd_deactivate(vq->virqfd);
+			vq->virqfd = NULL;
+		}
+		spin_unlock(&vq->irq_lock);
+	}
+	flush_workqueue(vduse_irqfd_cleanup_wq);
+}
+
+int vduse_virqfd_init(void)
+{
+	vduse_irqfd_cleanup_wq = alloc_workqueue("vduse-irqfd-cleanup",
+						WQ_UNBOUND, 0);
+	if (!vduse_irqfd_cleanup_wq)
+		return -ENOMEM;
+
+	return 0;
+}
+
+void vduse_virqfd_exit(void)
+{
+	destroy_workqueue(vduse_irqfd_cleanup_wq);
+}
+
+void vduse_vq_kick(struct vduse_virtqueue *vq)
+{
+	spin_lock(&vq->kick_lock);
+	if (vq->ready && vq->kickfd)
+		eventfd_signal(vq->kickfd, 1);
+	spin_unlock(&vq->kick_lock);
+}
+
+int vduse_kickfd_setup(struct vduse_dev *dev,
+			struct vduse_vq_eventfd *eventfd)
+{
+	struct eventfd_ctx *ctx;
+	struct vduse_virtqueue *vq;
+
+	if (eventfd->index >= dev->vq_num)
+		return -EINVAL;
+
+	vq = &dev->vqs[eventfd->index];
+	ctx = eventfd_ctx_fdget(eventfd->fd);
+	if (IS_ERR(ctx))
+		return PTR_ERR(ctx);
+
+	spin_lock(&vq->kick_lock);
+	if (vq->kickfd)
+		eventfd_ctx_put(vq->kickfd);
+	vq->kickfd = ctx;
+	spin_unlock(&vq->kick_lock);
+
+	return 0;
+}
+
+void vduse_kickfd_release(struct vduse_dev *dev)
+{
+	int i;
+
+	for (i = 0; i < dev->vq_num; i++) {
+		struct vduse_virtqueue *vq = &dev->vqs[i];
+
+		spin_lock(&vq->kick_lock);
+		if (vq->kickfd) {
+			eventfd_ctx_put(vq->kickfd);
+			vq->kickfd = NULL;
+		}
+		spin_unlock(&vq->kick_lock);
+	}
+}
diff --git a/drivers/vdpa/vdpa_user/eventfd.h b/drivers/vdpa/vdpa_user/eventfd.h
new file mode 100644
index 000000000000..14269ff27f47
--- /dev/null
+++ b/drivers/vdpa/vdpa_user/eventfd.h
@@ -0,0 +1,48 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Eventfd support for VDUSE
+ *
+ * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
+ *
+ * Author: Xie Yongji <xieyongji@bytedance.com>
+ *
+ */
+
+#ifndef _VDUSE_EVENTFD_H
+#define _VDUSE_EVENTFD_H
+
+#include <linux/eventfd.h>
+#include <linux/poll.h>
+#include <linux/wait.h>
+#include <uapi/linux/vduse.h>
+
+#include "vduse.h"
+
+struct vduse_dev;
+
+struct vduse_virqfd {
+	struct eventfd_ctx *ctx;
+	struct vduse_virtqueue *vq;
+	struct work_struct inject;
+	struct work_struct shutdown;
+	wait_queue_entry_t wait;
+	poll_table pt;
+};
+
+int vduse_virqfd_setup(struct vduse_dev *dev,
+			struct vduse_vq_eventfd *eventfd);
+
+void vduse_virqfd_release(struct vduse_dev *dev);
+
+int vduse_virqfd_init(void);
+
+void vduse_virqfd_exit(void);
+
+void vduse_vq_kick(struct vduse_virtqueue *vq);
+
+int vduse_kickfd_setup(struct vduse_dev *dev,
+			struct vduse_vq_eventfd *eventfd);
+
+void vduse_kickfd_release(struct vduse_dev *dev);
+
+#endif /* _VDUSE_EVENTFD_H */
diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c
new file mode 100644
index 000000000000..a274f78f00d2
--- /dev/null
+++ b/drivers/vdpa/vdpa_user/iova_domain.c
@@ -0,0 +1,413 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * MMU-based IOMMU implementation
+ *
+ * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
+ *
+ * Author: Xie Yongji <xieyongji@bytedance.com>
+ *
+ */
+
+#include <linux/wait.h>
+#include <linux/slab.h>
+#include <linux/interval_tree.h>
+#include <linux/genalloc.h>
+#include <linux/dma-mapping.h>
+
+#include "iova_domain.h"
+
+#define IOVA_CHUNK_SHIFT 26
+#define IOVA_CHUNK_SIZE (_AC(1, UL) << IOVA_CHUNK_SHIFT)
+#define IOVA_CHUNK_MASK (~(IOVA_CHUNK_SIZE - 1))
+
+#define IOVA_MIN_SIZE (IOVA_CHUNK_SIZE >> 1)
+
+#define IOVA_ALLOC_ORDER 9
+#define IOVA_ALLOC_SIZE (1 << IOVA_ALLOC_ORDER)
+
+struct vduse_mmap_vma {
+	struct vm_area_struct *vma;
+	struct list_head list;
+};
+
+static inline struct page *
+vduse_domain_get_bounce_page(struct vduse_iova_domain *domain,
+				unsigned long iova)
+{
+	unsigned long index = iova >> IOVA_CHUNK_SHIFT;
+	unsigned long chunkoff = iova & ~IOVA_CHUNK_MASK;
+	unsigned long pgindex = chunkoff >> PAGE_SHIFT;
+
+	return domain->chunks[index].bounce_pages[pgindex];
+}
+
+static inline void
+vduse_domain_set_bounce_page(struct vduse_iova_domain *domain,
+				unsigned long iova, struct page *page)
+{
+	unsigned long index = iova >> IOVA_CHUNK_SHIFT;
+	unsigned long chunkoff = iova & ~IOVA_CHUNK_MASK;
+	unsigned long pgindex = chunkoff >> PAGE_SHIFT;
+
+	domain->chunks[index].bounce_pages[pgindex] = page;
+}
+
+static int
+vduse_domain_free_bounce_pages(struct vduse_iova_domain *domain,
+				unsigned long iova, size_t size)
+{
+	struct page *page;
+	size_t walk_sz = 0;
+	int frees = 0;
+
+	while (walk_sz < size) {
+		page = vduse_domain_get_bounce_page(domain, iova);
+		if (page) {
+			vduse_domain_set_bounce_page(domain, iova, NULL);
+			put_page(page);
+			frees++;
+		}
+		iova += PAGE_SIZE;
+		walk_sz += PAGE_SIZE;
+	}
+
+	return frees;
+}
+
+int vduse_domain_add_vma(struct vduse_iova_domain *domain,
+				struct vm_area_struct *vma)
+{
+	unsigned long size = vma->vm_end - vma->vm_start;
+	struct vduse_mmap_vma *mmap_vma;
+
+	if (WARN_ON(size != domain->size))
+		return -EINVAL;
+
+	mmap_vma = kmalloc(sizeof(*mmap_vma), GFP_KERNEL);
+	if (!mmap_vma)
+		return -ENOMEM;
+
+	mmap_vma->vma = vma;
+	mutex_lock(&domain->vma_lock);
+	list_add(&mmap_vma->list, &domain->vma_list);
+	mutex_unlock(&domain->vma_lock);
+
+	return 0;
+}
+
+void vduse_domain_remove_vma(struct vduse_iova_domain *domain,
+				struct vm_area_struct *vma)
+{
+	struct vduse_mmap_vma *mmap_vma;
+
+	mutex_lock(&domain->vma_lock);
+	list_for_each_entry(mmap_vma, &domain->vma_list, list) {
+		if (mmap_vma->vma == vma) {
+			list_del(&mmap_vma->list);
+			kfree(mmap_vma);
+			break;
+		}
+	}
+	mutex_unlock(&domain->vma_lock);
+}
+
+int vduse_domain_add_mapping(struct vduse_iova_domain *domain,
+				unsigned long iova, unsigned long orig,
+				size_t size, enum dma_data_direction dir)
+{
+	struct vduse_iova_map *map;
+
+	map = kzalloc(sizeof(struct vduse_iova_map), GFP_ATOMIC);
+	if (!map)
+		return -ENOMEM;
+
+	map->iova.start = iova;
+	map->iova.last = iova + size - 1;
+	map->orig = orig;
+	map->size = size;
+	map->dir = dir;
+
+	spin_lock(&domain->map_lock);
+	interval_tree_insert(&map->iova, &domain->mappings);
+	spin_unlock(&domain->map_lock);
+
+	return 0;
+}
+
+struct vduse_iova_map *
+vduse_domain_get_mapping(struct vduse_iova_domain *domain,
+			unsigned long iova, size_t size)
+{
+	struct interval_tree_node *node;
+	struct vduse_iova_map *map = NULL;
+	unsigned long last = iova + size - 1;
+
+	spin_lock(&domain->map_lock);
+	node = interval_tree_iter_first(&domain->mappings, iova, last);
+	if (node)
+		map = container_of(node, struct vduse_iova_map, iova);
+	spin_unlock(&domain->map_lock);
+
+	return map;
+}
+
+void vduse_domain_remove_mapping(struct vduse_iova_domain *domain,
+				struct vduse_iova_map *map)
+{
+	spin_lock(&domain->map_lock);
+	interval_tree_remove(&map->iova, &domain->mappings);
+	spin_unlock(&domain->map_lock);
+}
+
+void vduse_domain_unmap(struct vduse_iova_domain *domain,
+			unsigned long iova, size_t size)
+{
+	struct vduse_mmap_vma *mmap_vma;
+	unsigned long uaddr;
+
+	mutex_lock(&domain->vma_lock);
+	list_for_each_entry(mmap_vma, &domain->vma_list, list) {
+		mmap_read_lock(mmap_vma->vma->vm_mm);
+		uaddr = iova + mmap_vma->vma->vm_start;
+		zap_page_range(mmap_vma->vma, uaddr, size);
+		mmap_read_unlock(mmap_vma->vma->vm_mm);
+	}
+	mutex_unlock(&domain->vma_lock);
+}
+
+int vduse_domain_direct_map(struct vduse_iova_domain *domain,
+			struct vm_area_struct *vma, unsigned long iova)
+{
+	unsigned long uaddr = iova + vma->vm_start;
+	unsigned long start = iova & PAGE_MASK;
+	unsigned long last = start + PAGE_SIZE - 1;
+	unsigned long offset;
+	struct vduse_iova_map *map;
+	struct interval_tree_node *node;
+	struct page *page = NULL;
+
+	spin_lock(&domain->map_lock);
+	node = interval_tree_iter_first(&domain->mappings, start, last);
+	if (node) {
+		map = container_of(node, struct vduse_iova_map, iova);
+		offset = last - map->iova.start;
+		page = virt_to_page(map->orig + offset);
+	}
+	spin_unlock(&domain->map_lock);
+
+	return page ? vm_insert_page(vma, uaddr, page) : -EFAULT;
+}
+
+void vduse_domain_bounce(struct vduse_iova_domain *domain,
+			unsigned long iova, unsigned long orig,
+			size_t size, enum dma_data_direction dir)
+{
+	unsigned int offset = offset_in_page(iova);
+
+	while (size) {
+		struct page *p = vduse_domain_get_bounce_page(domain, iova);
+		size_t copy_len = min_t(size_t, PAGE_SIZE - offset, size);
+		void *addr;
+
+		if (p) {
+			addr = page_address(p) + offset;
+			if (dir == DMA_TO_DEVICE)
+				memcpy(addr, (void *)orig, copy_len);
+			else if (dir == DMA_FROM_DEVICE)
+				memcpy((void *)orig, addr, copy_len);
+		}
+		size -= copy_len;
+		orig += copy_len;
+		iova += copy_len;
+		offset = 0;
+	}
+}
+
+int vduse_domain_bounce_map(struct vduse_iova_domain *domain,
+			struct vm_area_struct *vma, unsigned long iova)
+{
+	unsigned long uaddr = iova + vma->vm_start;
+	unsigned long start = iova & PAGE_MASK;
+	unsigned long last = start + PAGE_SIZE - 1;
+	struct vduse_iova_map *map;
+	struct interval_tree_node *node;
+	struct page *page;
+
+	page = alloc_page(GFP_KERNEL);
+	if (!page)
+		return -ENOMEM;
+
+	spin_lock(&domain->map_lock);
+	if (vduse_domain_get_bounce_page(domain, iova))
+		goto unlock;
+
+	node = interval_tree_iter_first(&domain->mappings, start, last);
+	if (!node)
+		goto unlock;
+
+	while (node) {
+		unsigned int src_offset = 0, dst_offset = 0;
+		void *src, *dst;
+		size_t copy_len;
+
+		map = container_of(node, struct vduse_iova_map, iova);
+		node = interval_tree_iter_next(node, start, last);
+		if (map->dir == DMA_FROM_DEVICE)
+			continue;
+
+		if (start > map->iova.start)
+			src_offset = start - map->iova.start;
+		else
+			dst_offset = map->iova.start - start;
+		src = (void *)(map->orig + src_offset);
+		dst = page_address(page) + dst_offset;
+		copy_len = min_t(size_t, map->size - src_offset,
+				PAGE_SIZE - dst_offset);
+		memcpy(dst, src, copy_len);
+	}
+	get_page(page);
+	vduse_domain_set_bounce_page(domain, iova, page);
+unlock:
+	put_page(page);
+	page = vduse_domain_get_bounce_page(domain, iova);
+	spin_unlock(&domain->map_lock);
+
+	return page ? vm_insert_page(vma, uaddr, page) : -EFAULT;
+}
+
+bool vduse_domain_is_direct_map(struct vduse_iova_domain *domain,
+				unsigned long iova)
+{
+	unsigned long index = iova >> IOVA_CHUNK_SHIFT;
+	struct vduse_iova_chunk *chunk = &domain->chunks[index];
+
+	return atomic_read(&chunk->map_type) == TYPE_DIRECT_MAP;
+}
+
+unsigned long vduse_domain_alloc_iova(struct vduse_iova_domain *domain,
+					size_t size, enum iova_map_type type)
+{
+	struct vduse_iova_chunk *chunk;
+	unsigned long iova = 0;
+	int align = (type == TYPE_DIRECT_MAP) ? PAGE_SIZE : IOVA_ALLOC_SIZE;
+	struct genpool_data_align data = { .align = align };
+	int i;
+
+	for (i = 0; i < domain->chunk_num; i++) {
+		chunk = &domain->chunks[i];
+		if (unlikely(atomic_read(&chunk->map_type) == TYPE_NONE))
+			atomic_cmpxchg(&chunk->map_type, TYPE_NONE, type);
+
+		if (atomic_read(&chunk->map_type) != type)
+			continue;
+
+		iova = gen_pool_alloc_algo(chunk->pool, size,
+					gen_pool_first_fit_align, &data);
+		if (iova)
+			break;
+	}
+
+	return iova;
+}
+
+void vduse_domain_free_iova(struct vduse_iova_domain *domain,
+				unsigned long iova, size_t size)
+{
+	unsigned long index = iova >> IOVA_CHUNK_SHIFT;
+	struct vduse_iova_chunk *chunk = &domain->chunks[index];
+
+	gen_pool_free(chunk->pool, iova, size);
+}
+
+static void vduse_iova_chunk_cleanup(struct vduse_iova_chunk *chunk)
+{
+	vfree(chunk->bounce_pages);
+	gen_pool_destroy(chunk->pool);
+}
+
+void vduse_iova_domain_destroy(struct vduse_iova_domain *domain)
+{
+	struct vduse_iova_chunk *chunk;
+	int i;
+
+	for (i = 0; i < domain->chunk_num; i++) {
+		chunk = &domain->chunks[i];
+		vduse_domain_free_bounce_pages(domain,
+					chunk->start, IOVA_CHUNK_SIZE);
+		vduse_iova_chunk_cleanup(chunk);
+	}
+
+	mutex_destroy(&domain->vma_lock);
+	kfree(domain->chunks);
+	kfree(domain);
+}
+
+static int vduse_iova_chunk_init(struct vduse_iova_chunk *chunk,
+				unsigned long addr, size_t size)
+{
+	int ret;
+	int pages = size >> PAGE_SHIFT;
+
+	chunk->pool = gen_pool_create(IOVA_ALLOC_ORDER, -1);
+	if (!chunk->pool)
+		return -ENOMEM;
+
+	/* addr 0 is used in allocation failure case */
+	if (addr == 0)
+		addr += IOVA_ALLOC_SIZE;
+
+	ret = gen_pool_add(chunk->pool, addr, size, -1);
+	if (ret)
+		goto err;
+
+	ret = -ENOMEM;
+	chunk->bounce_pages = vzalloc(pages * sizeof(struct page *));
+	if (!chunk->bounce_pages)
+		goto err;
+
+	chunk->start = addr;
+	atomic_set(&chunk->map_type, TYPE_NONE);
+
+	return 0;
+err:
+	gen_pool_destroy(chunk->pool);
+	return ret;
+}
+
+struct vduse_iova_domain *vduse_iova_domain_create(size_t size)
+{
+	int j, i = 0;
+	struct vduse_iova_domain *domain;
+	unsigned long num = size >> IOVA_CHUNK_SHIFT;
+	unsigned long addr = 0;
+
+	if (size < IOVA_MIN_SIZE || size & ~IOVA_CHUNK_MASK)
+		return NULL;
+
+	domain = kzalloc(sizeof(*domain), GFP_KERNEL);
+	if (!domain)
+		return NULL;
+
+	domain->chunks = kcalloc(num, sizeof(struct vduse_iova_chunk), GFP_KERNEL);
+	if (!domain->chunks)
+		goto err;
+
+	for (i = 0; i < num; i++, addr += IOVA_CHUNK_SIZE)
+		if (vduse_iova_chunk_init(&domain->chunks[i], addr,
+					IOVA_CHUNK_SIZE))
+			goto err;
+
+	domain->chunk_num = num;
+	domain->size = size;
+	domain->mappings = RB_ROOT_CACHED;
+	INIT_LIST_HEAD(&domain->vma_list);
+	mutex_init(&domain->vma_lock);
+
+	return domain;
+err:
+	for (j = 0; j < i; j++)
+		vduse_iova_chunk_cleanup(&domain->chunks[j]);
+	kfree(domain);
+
+	return NULL;
+}
diff --git a/drivers/vdpa/vdpa_user/iova_domain.h b/drivers/vdpa/vdpa_user/iova_domain.h
new file mode 100644
index 000000000000..7ae60c0e50ec
--- /dev/null
+++ b/drivers/vdpa/vdpa_user/iova_domain.h
@@ -0,0 +1,94 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * MMU-based IOMMU implementation
+ *
+ * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
+ *
+ * Author: Xie Yongji <xieyongji@bytedance.com>
+ *
+ */
+
+#ifndef _VDUSE_IOVA_DOMAIN_H
+#define _VDUSE_IOVA_DOMAIN_H
+
+#include <linux/interval_tree.h>
+#include <linux/genalloc.h>
+#include <linux/dma-mapping.h>
+
+enum iova_map_type {
+	TYPE_NONE,
+	TYPE_DIRECT_MAP,
+	TYPE_BOUNCE_MAP,
+};
+
+struct vduse_iova_chunk {
+	struct gen_pool *pool;
+	struct page **bounce_pages;
+	unsigned long start;
+	atomic_t map_type;
+};
+
+struct vduse_iova_domain {
+	struct vduse_iova_chunk *chunks;
+	int chunk_num;
+	size_t size;
+	struct rb_root_cached mappings;
+	spinlock_t map_lock;
+	struct mutex vma_lock;
+	struct list_head vma_list;
+};
+
+struct vduse_iova_map {
+	struct interval_tree_node iova;
+	unsigned long orig;
+	size_t size;
+	enum dma_data_direction dir;
+};
+
+int vduse_domain_add_vma(struct vduse_iova_domain *domain,
+				struct vm_area_struct *vma);
+
+void vduse_domain_remove_vma(struct vduse_iova_domain *domain,
+				struct vm_area_struct *vma);
+
+int vduse_domain_add_mapping(struct vduse_iova_domain *domain,
+				unsigned long iova, unsigned long orig,
+				size_t size, enum dma_data_direction dir);
+
+struct vduse_iova_map *
+vduse_domain_get_mapping(struct vduse_iova_domain *domain,
+			unsigned long iova, size_t size);
+
+void vduse_domain_remove_mapping(struct vduse_iova_domain *domain,
+				struct vduse_iova_map *map);
+
+void vduse_domain_unmap(struct vduse_iova_domain *domain,
+			unsigned long iova, size_t size);
+
+int vduse_domain_direct_map(struct vduse_iova_domain *domain,
+			struct vm_area_struct *vma, unsigned long iova);
+
+void vduse_domain_bounce(struct vduse_iova_domain *domain,
+			unsigned long iova, unsigned long orig,
+			size_t size, enum dma_data_direction dir);
+
+int vduse_domain_bounce_map(struct vduse_iova_domain *domain,
+			struct vm_area_struct *vma, unsigned long iova);
+
+bool vduse_domain_is_direct_map(struct vduse_iova_domain *domain,
+				unsigned long iova);
+
+unsigned long vduse_domain_alloc_iova(struct vduse_iova_domain *domain,
+					size_t size, enum iova_map_type type);
+
+void vduse_domain_free_iova(struct vduse_iova_domain *domain,
+				unsigned long iova, size_t size);
+
+bool vduse_domain_is_direct_map(struct vduse_iova_domain *domain,
+				unsigned long iova);
+
+void vduse_iova_domain_destroy(struct vduse_iova_domain *domain);
+
+struct vduse_iova_domain *vduse_iova_domain_create(size_t size);
+
+#endif /* _VDUSE_IOVA_DOMAIN_H */
diff --git a/drivers/vdpa/vdpa_user/vduse.h b/drivers/vdpa/vdpa_user/vduse.h
new file mode 100644
index 000000000000..3bd52f668ab3
--- /dev/null
+++ b/drivers/vdpa/vdpa_user/vduse.h
@@ -0,0 +1,66 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * VDUSE: vDPA Device in Userspace
+ *
+ * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
+ *
+ * Author: Xie Yongji <xieyongji@bytedance.com>
+ *
+ */
+
+#ifndef _VDUSE_H
+#define _VDUSE_H
+
+#include <linux/eventfd.h>
+#include <linux/wait.h>
+#include <linux/vdpa.h>
+
+#include "iova_domain.h"
+#include "eventfd.h"
+
+struct vduse_virtqueue {
+	u16 index;
+	u32 num;
+	u64 desc_addr;
+	u64 device_addr;
+	u64 driver_addr;
+	bool ready;
+	spinlock_t kick_lock;
+	spinlock_t irq_lock;
+	struct eventfd_ctx *kickfd;
+	struct vduse_virqfd *virqfd;
+	void *private;
+	irqreturn_t (*cb)(void *data);
+};
+
+struct vduse_dev;
+
+struct vduse_vdpa {
+	struct vdpa_device vdpa;
+	struct vduse_dev *dev;
+	struct work_struct start_work;
+	struct work_struct stop_work;
+	bool stopping;
+};
+
+struct vduse_dev {
+	struct vduse_vdpa *vdev;
+	struct vduse_virtqueue *vqs;
+	struct vduse_iova_domain *domain;
+	struct mutex lock;
+	spinlock_t msg_lock;
+	atomic64_t msg_unique;
+	wait_queue_head_t waitq;
+	struct list_head send_list;
+	struct list_head recv_list;
+	struct list_head list;
+	bool connected;
+	u32 id;
+	u16 vq_size_max;
+	u16 vq_num;
+	u32 vq_align;
+	u32 device_id;
+	u32 vendor_id;
+};
+
+#endif /* _VDUSE_H_ */
diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
new file mode 100644
index 000000000000..6787ba66725c
--- /dev/null
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -0,0 +1,1028 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * VDUSE: vDPA Device in Userspace
+ *
+ * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
+ *
+ * Author: Xie Yongji <xieyongji@bytedance.com>
+ *
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/miscdevice.h>
+#include <linux/device.h>
+#include <linux/eventfd.h>
+#include <linux/slab.h>
+#include <linux/wait.h>
+#include <linux/dma-mapping.h>
+#include <linux/anon_inodes.h>
+#include <linux/file.h>
+#include <linux/uio.h>
+#include <linux/vdpa.h>
+#include <uapi/linux/vduse.h>
+#include <uapi/linux/virtio_config.h>
+
+#include "vduse.h"
+
+#define DRV_VERSION  "1.0"
+#define DRV_AUTHOR   "Yongji Xie <xieyongji@bytedance.com>"
+#define DRV_DESC     "vDPA Device in Userspace"
+#define DRV_LICENSE  "GPL v2"
+
+struct vduse_dev_msg {
+	struct vduse_dev_request req;
+	struct vduse_dev_response resp;
+	struct list_head list;
+	wait_queue_head_t waitq;
+	bool completed;
+	refcount_t refcnt;
+};
+
+static struct workqueue_struct *vduse_vdpa_wq;
+static DEFINE_MUTEX(vduse_lock);
+static LIST_HEAD(vduse_devs);
+
+static inline struct vduse_dev *vdpa_to_vduse(struct vdpa_device *vdpa)
+{
+	struct vduse_vdpa *vdev = container_of(vdpa, struct vduse_vdpa, vdpa);
+
+	return vdev->dev;
+}
+
+static inline struct vduse_dev *dev_to_vduse(struct device *dev)
+{
+	struct vdpa_device *vdpa = dev_to_vdpa(dev);
+
+	return vdpa_to_vduse(vdpa);
+}
+
+static struct vduse_dev_msg *vduse_dev_new_msg(struct vduse_dev *dev, int type)
+{
+	struct vduse_dev_msg *msg = kzalloc(sizeof(*msg),
+					GFP_KERNEL | __GFP_NOFAIL);
+
+	msg->req.type = type;
+	msg->req.unique = atomic64_fetch_inc(&dev->msg_unique);
+	init_waitqueue_head(&msg->waitq);
+	refcount_set(&msg->refcnt, 1);
+
+	return msg;
+}
+
+static void vduse_dev_msg_get(struct vduse_dev_msg *msg)
+{
+	refcount_inc(&msg->refcnt);
+}
+
+static void vduse_dev_msg_put(struct vduse_dev_msg *msg)
+{
+	if (refcount_dec_and_test(&msg->refcnt))
+		kfree(msg);
+}
+
+static struct vduse_dev_msg *vduse_dev_find_msg(struct vduse_dev *dev,
+						struct list_head *head,
+						uint32_t unique)
+{
+	struct vduse_dev_msg *tmp, *msg = NULL;
+
+	spin_lock(&dev->msg_lock);
+	list_for_each_entry(tmp, head, list) {
+		if (tmp->req.unique == unique) {
+			msg = tmp;
+			list_del(&tmp->list);
+			break;
+		}
+	}
+	spin_unlock(&dev->msg_lock);
+
+	return msg;
+}
+
+static struct vduse_dev_msg *vduse_dev_dequeue_msg(struct vduse_dev *dev,
+						struct list_head *head)
+{
+	struct vduse_dev_msg *msg = NULL;
+
+	spin_lock(&dev->msg_lock);
+	if (!list_empty(head)) {
+		msg = list_first_entry(head, struct vduse_dev_msg, list);
+		list_del(&msg->list);
+	}
+	spin_unlock(&dev->msg_lock);
+
+	return msg;
+}
+
+static void vduse_dev_enqueue_msg(struct vduse_dev *dev,
+			struct vduse_dev_msg *msg, struct list_head *head)
+{
+	spin_lock(&dev->msg_lock);
+	list_add_tail(&msg->list, head);
+	spin_unlock(&dev->msg_lock);
+}
+
+static int vduse_dev_msg_sync(struct vduse_dev *dev, struct vduse_dev_msg *msg)
+{
+	int ret;
+
+	vduse_dev_enqueue_msg(dev, msg, &dev->send_list);
+	wake_up(&dev->waitq);
+	wait_event(msg->waitq, msg->completed);
+	/* coupled with smp_wmb() in vduse_dev_msg_complete() */
+	smp_rmb();
+	ret = msg->resp.result;
+
+	return ret;
+}
+
+static void vduse_dev_msg_complete(struct vduse_dev_msg *msg,
+					struct vduse_dev_response *resp)
+{
+	vduse_dev_msg_get(msg);
+	memcpy(&msg->resp, resp, sizeof(*resp));
+	/* coupled with smp_rmb() in vduse_dev_msg_sync() */
+	smp_wmb();
+	msg->completed = 1;
+	wake_up(&msg->waitq);
+	vduse_dev_msg_put(msg);
+}
+
+static u64 vduse_dev_get_features(struct vduse_dev *dev)
+{
+	struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_GET_FEATURES);
+	u64 features;
+
+	vduse_dev_msg_sync(dev, msg);
+	features = msg->resp.features;
+	vduse_dev_msg_put(msg);
+
+	return features;
+}
+
+static int vduse_dev_set_features(struct vduse_dev *dev, u64 features)
+{
+	struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_FEATURES);
+	int ret;
+
+	msg->req.size = sizeof(features);
+	msg->req.features = features;
+
+	ret = vduse_dev_msg_sync(dev, msg);
+	vduse_dev_msg_put(msg);
+
+	return ret;
+}
+
+static u8 vduse_dev_get_status(struct vduse_dev *dev)
+{
+	struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_GET_STATUS);
+	u8 status;
+
+	vduse_dev_msg_sync(dev, msg);
+	status = msg->resp.status;
+	vduse_dev_msg_put(msg);
+
+	return status;
+}
+
+static void vduse_dev_set_status(struct vduse_dev *dev, u8 status)
+{
+	struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_STATUS);
+
+	msg->req.size = sizeof(status);
+	msg->req.status = status;
+
+	vduse_dev_msg_sync(dev, msg);
+	vduse_dev_msg_put(msg);
+}
+
+static void vduse_dev_get_config(struct vduse_dev *dev, unsigned int offset,
+					void *buf, unsigned int len)
+{
+	struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_GET_CONFIG);
+
+	WARN_ON(len > sizeof(msg->req.config.data));
+
+	msg->req.size = sizeof(struct vduse_dev_config_data);
+	msg->req.config.offset = offset;
+	msg->req.config.len = len;
+	vduse_dev_msg_sync(dev, msg);
+	memcpy(buf, msg->resp.config.data, len);
+	vduse_dev_msg_put(msg);
+}
+
+static void vduse_dev_set_config(struct vduse_dev *dev, unsigned int offset,
+					const void *buf, unsigned int len)
+{
+	struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_CONFIG);
+
+	WARN_ON(len > sizeof(msg->req.config.data));
+
+	msg->req.size = sizeof(struct vduse_dev_config_data);
+	msg->req.config.offset = offset;
+	msg->req.config.len = len;
+	memcpy(msg->req.config.data, buf, len);
+	vduse_dev_msg_sync(dev, msg);
+	vduse_dev_msg_put(msg);
+}
+
+static void vduse_dev_set_vq_state(struct vduse_dev *dev,
+				   struct vduse_virtqueue *vq)
+{
+	struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_VQ_STATE);
+
+	msg->req.size = sizeof(struct vduse_vq_state);
+	msg->req.vq_state.index = vq->index;
+	msg->req.vq_state.num = vq->num;
+	msg->req.vq_state.desc_addr = vq->desc_addr;
+	msg->req.vq_state.driver_addr = vq->driver_addr;
+	msg->req.vq_state.device_addr = vq->device_addr;
+	msg->req.vq_state.ready = vq->ready;
+
+	vduse_dev_msg_sync(dev, msg);
+	vduse_dev_msg_put(msg);
+}
+
+static ssize_t vduse_dev_read_iter(struct kiocb *iocb, struct iov_iter *to)
+{
+	struct file *file = iocb->ki_filp;
+	struct vduse_dev *dev = file->private_data;
+	struct vduse_dev_msg *msg;
+	int size = sizeof(struct vduse_dev_request);
+	ssize_t ret = 0;
+
+	if (iov_iter_count(to) < size)
+		return 0;
+
+	while (1) {
+		if (!dev->vdev)
+			return -EPERM;
+
+		msg = vduse_dev_dequeue_msg(dev, &dev->send_list);
+		if (msg)
+			break;
+
+		if (file->f_flags & O_NONBLOCK)
+			return -EAGAIN;
+
+		ret = wait_event_interruptible_exclusive(dev->waitq,
+			!dev->vdev || !list_empty(&dev->send_list));
+		if (ret)
+			return ret;
+	}
+	ret = copy_to_iter(&msg->req, size, to);
+	if (ret != size) {
+		vduse_dev_enqueue_msg(dev, msg, &dev->send_list);
+		return -EFAULT;
+	}
+	vduse_dev_enqueue_msg(dev, msg, &dev->recv_list);
+
+	return ret;
+}
+
+static ssize_t vduse_dev_write_iter(struct kiocb *iocb, struct iov_iter *from)
+{
+	struct file *file = iocb->ki_filp;
+	struct vduse_dev *dev = file->private_data;
+	struct vduse_dev_response resp;
+	struct vduse_dev_msg *msg;
+	size_t ret;
+
+	if (!dev->vdev)
+		return -EPERM;
+
+	ret = copy_from_iter(&resp, sizeof(resp), from);
+	if (ret != sizeof(resp))
+		return -EINVAL;
+
+	msg = vduse_dev_find_msg(dev, &dev->recv_list, resp.unique);
+	if (!msg)
+		return -EINVAL;
+
+	vduse_dev_msg_complete(msg, &resp);
+
+	return ret;
+}
+
+static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
+{
+	struct vduse_dev *dev = file->private_data;
+	__poll_t mask = 0;
+
+	poll_wait(file, &dev->waitq, wait);
+
+	if (!dev->vdev)
+		mask = EPOLLERR;
+	else if (!list_empty(&dev->send_list))
+		mask |= EPOLLIN | EPOLLRDNORM;
+
+	return mask;
+}
+
+static void vduse_dev_reset(struct vduse_dev *dev)
+{
+	int i;
+
+	for (i = 0; i < dev->vq_num; i++) {
+		struct vduse_virtqueue *vq = &dev->vqs[i];
+
+		spin_lock(&vq->irq_lock);
+		vq->ready = false;
+		vq->desc_addr = 0;
+		vq->driver_addr = 0;
+		vq->device_addr = 0;
+		vq->cb = NULL;
+		vq->private = NULL;
+		spin_unlock(&vq->irq_lock);
+	}
+}
+
+static int vduse_vdpa_set_vq_address(struct vdpa_device *vdpa, u16 idx,
+				u64 desc_area, u64 driver_area,
+				u64 device_area)
+{
+	struct vduse_dev *vduse_dev = vdpa_to_vduse(vdpa);
+	struct vduse_virtqueue *vq = &vduse_dev->vqs[idx];
+
+	vq->desc_addr = desc_area;
+	vq->driver_addr = driver_area;
+	vq->device_addr = device_area;
+
+	return 0;
+}
+
+static void vduse_vdpa_kick_vq(struct vdpa_device *vdpa, u16 idx)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vduse_virtqueue *vq = &dev->vqs[idx];
+
+	vduse_vq_kick(vq);
+}
+
+static void vduse_vdpa_set_vq_cb(struct vdpa_device *vdpa, u16 idx,
+			      struct vdpa_callback *cb)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vduse_virtqueue *vq = &dev->vqs[idx];
+
+	vq->cb = cb->callback;
+	vq->private = cb->private;
+}
+
+static void vduse_vdpa_set_vq_num(struct vdpa_device *vdpa, u16 idx, u32 num)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vduse_virtqueue *vq = &dev->vqs[idx];
+
+	vq->num = num;
+}
+
+static void vduse_vdpa_set_vq_ready(struct vdpa_device *vdpa,
+					u16 idx, bool ready)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vduse_virtqueue *vq = &dev->vqs[idx];
+
+	vq->ready = ready;
+	vduse_dev_set_vq_state(dev, vq);
+}
+
+static bool vduse_vdpa_get_vq_ready(struct vdpa_device *vdpa, u16 idx)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vduse_virtqueue *vq = &dev->vqs[idx];
+
+	return vq->ready;
+}
+
+static u32 vduse_vdpa_get_vq_align(struct vdpa_device *vdpa)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	return dev->vq_align;
+}
+
+static u64 vduse_vdpa_get_features(struct vdpa_device *vdpa)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	u64 fixed = (1ULL << VIRTIO_F_ACCESS_PLATFORM);
+
+	return (vduse_dev_get_features(dev) | fixed);
+}
+
+static int vduse_vdpa_set_features(struct vdpa_device *vdpa, u64 features)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	return vduse_dev_set_features(dev, features);
+}
+
+static void vduse_vdpa_set_config_cb(struct vdpa_device *vdpa,
+				  struct vdpa_callback *cb)
+{
+	/* We don't support config interrupt */
+}
+
+static u16 vduse_vdpa_get_vq_num_max(struct vdpa_device *vdpa)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	return dev->vq_size_max;
+}
+
+static u32 vduse_vdpa_get_device_id(struct vdpa_device *vdpa)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	return dev->device_id;
+}
+
+static u32 vduse_vdpa_get_vendor_id(struct vdpa_device *vdpa)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	return dev->vendor_id;
+}
+
+static u8 vduse_vdpa_get_status(struct vdpa_device *vdpa)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	return vduse_dev_get_status(dev);
+}
+
+static void vduse_vdpa_set_status(struct vdpa_device *vdpa, u8 status)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	if (status == 0)
+		vduse_dev_reset(dev);
+
+	vduse_dev_set_status(dev, status);
+}
+
+static void vduse_vdpa_get_config(struct vdpa_device *vdpa, unsigned int offset,
+			     void *buf, unsigned int len)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	vduse_dev_get_config(dev, offset, buf, len);
+}
+
+static void vduse_vdpa_set_config(struct vdpa_device *vdpa, unsigned int offset,
+			const void *buf, unsigned int len)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	vduse_dev_set_config(dev, offset, buf, len);
+}
+
+static void vduse_vdpa_free(struct vdpa_device *vdpa)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	vduse_kickfd_release(dev);
+	vduse_virqfd_release(dev);
+
+	WARN_ON(!list_empty(&dev->send_list));
+	WARN_ON(!list_empty(&dev->recv_list));
+	dev->vdev = NULL;
+	wake_up_all(&dev->waitq);
+}
+
+static const struct vdpa_config_ops vduse_vdpa_config_ops = {
+	.set_vq_address		= vduse_vdpa_set_vq_address,
+	.kick_vq		= vduse_vdpa_kick_vq,
+	.set_vq_cb		= vduse_vdpa_set_vq_cb,
+	.set_vq_num             = vduse_vdpa_set_vq_num,
+	.set_vq_ready		= vduse_vdpa_set_vq_ready,
+	.get_vq_ready		= vduse_vdpa_get_vq_ready,
+	.get_vq_align		= vduse_vdpa_get_vq_align,
+	.get_features		= vduse_vdpa_get_features,
+	.set_features		= vduse_vdpa_set_features,
+	.set_config_cb		= vduse_vdpa_set_config_cb,
+	.get_vq_num_max		= vduse_vdpa_get_vq_num_max,
+	.get_device_id		= vduse_vdpa_get_device_id,
+	.get_vendor_id		= vduse_vdpa_get_vendor_id,
+	.get_status		= vduse_vdpa_get_status,
+	.set_status		= vduse_vdpa_set_status,
+	.get_config		= vduse_vdpa_get_config,
+	.set_config		= vduse_vdpa_set_config,
+	.free			= vduse_vdpa_free,
+};
+
+static dma_addr_t vduse_dev_map_page(struct device *dev, struct page *page,
+					unsigned long offset, size_t size,
+					enum dma_data_direction dir,
+					unsigned long attrs)
+{
+	struct vduse_dev *vdev = dev_to_vduse(dev);
+	struct vduse_iova_domain *domain = vdev->domain;
+	unsigned long iova = vduse_domain_alloc_iova(domain, size,
+							TYPE_BOUNCE_MAP);
+	unsigned long orig = (unsigned long)page_address(page) + offset;
+
+	if (!iova)
+		return DMA_MAPPING_ERROR;
+
+	if (vduse_domain_add_mapping(domain, iova, orig, size, dir)) {
+		vduse_domain_free_iova(domain, iova, size);
+		return DMA_MAPPING_ERROR;
+	}
+
+	if (dir == DMA_TO_DEVICE)
+		vduse_domain_bounce(domain, iova, orig, size, dir);
+
+	return (dma_addr_t)iova;
+}
+
+static void vduse_dev_unmap_page(struct device *dev, dma_addr_t dma_addr,
+				size_t size, enum dma_data_direction dir,
+				unsigned long attrs)
+{
+	struct vduse_dev *vdev = dev_to_vduse(dev);
+	struct vduse_iova_domain *domain = vdev->domain;
+	unsigned long iova = (unsigned long)dma_addr;
+	struct vduse_iova_map *map = vduse_domain_get_mapping(domain,
+							iova, size);
+	if (WARN_ON(!map))
+		return;
+
+	if (dir == DMA_FROM_DEVICE)
+		vduse_domain_bounce(domain, iova, map->orig, size, dir);
+	vduse_domain_remove_mapping(domain, map);
+	vduse_domain_free_iova(domain, iova, size);
+	kfree(map);
+}
+
+static void *vduse_dev_alloc_coherent(struct device *dev, size_t size,
+					dma_addr_t *dma_addr, gfp_t flag,
+					unsigned long attrs)
+{
+	struct vduse_dev *vdev = dev_to_vduse(dev);
+	struct vduse_iova_domain *domain = vdev->domain;
+	unsigned long iova = vduse_domain_alloc_iova(domain, size,
+							TYPE_DIRECT_MAP);
+	void *orig = alloc_pages_exact(size, flag);
+
+	if (!iova || !orig)
+		goto err;
+
+	if (vduse_domain_add_mapping(domain, iova,
+				(unsigned long)orig, size, DMA_BIDIRECTIONAL))
+		goto err;
+
+	*dma_addr = (dma_addr_t)iova;
+
+	return orig;
+err:
+	*dma_addr = DMA_MAPPING_ERROR;
+	if (orig)
+		free_pages_exact(orig, size);
+	if (iova)
+		vduse_domain_free_iova(domain, iova, size);
+
+	return NULL;
+}
+
+static void vduse_dev_free_coherent(struct device *dev, size_t size,
+					void *vaddr, dma_addr_t dma_addr,
+					unsigned long attrs)
+{
+	struct vduse_dev *vdev = dev_to_vduse(dev);
+	struct vduse_iova_domain *domain = vdev->domain;
+	unsigned long iova = (unsigned long)dma_addr;
+	struct vduse_iova_map *map = vduse_domain_get_mapping(domain,
+							iova, size);
+	if (WARN_ON(!map))
+		return;
+
+	vduse_domain_remove_mapping(domain, map);
+	vduse_domain_unmap(domain, map->iova.start, PAGE_ALIGN(map->size));
+	free_pages_exact((void *)map->orig, map->size);
+	vduse_domain_free_iova(domain, map->iova.start, map->size);
+	kfree(map);
+}
+
+static const struct dma_map_ops vduse_dev_dma_ops = {
+	.map_page = vduse_dev_map_page,
+	.unmap_page = vduse_dev_unmap_page,
+	.alloc = vduse_dev_alloc_coherent,
+	.free = vduse_dev_free_coherent,
+};
+
+static void vduse_dev_mmap_open(struct vm_area_struct *vma)
+{
+	struct vduse_iova_domain *domain = vma->vm_private_data;
+
+	if (!vduse_domain_add_vma(domain, vma))
+		return;
+
+	vma->vm_private_data = NULL;
+}
+
+static void vduse_dev_mmap_close(struct vm_area_struct *vma)
+{
+	struct vduse_iova_domain *domain = vma->vm_private_data;
+
+	if (!domain)
+		return;
+
+	vduse_domain_remove_vma(domain, vma);
+}
+
+static int vduse_dev_mmap_split(struct vm_area_struct *vma, unsigned long addr)
+{
+	return -EPERM;
+}
+
+static vm_fault_t vduse_dev_mmap_fault(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct vduse_iova_domain *domain = vma->vm_private_data;
+	unsigned long iova = vmf->address - vma->vm_start;
+	int ret;
+
+	if (!domain)
+		return VM_FAULT_SIGBUS;
+
+	if (vduse_domain_is_direct_map(domain, iova))
+		ret = vduse_domain_direct_map(domain, vma, iova);
+	else
+		ret = vduse_domain_bounce_map(domain, vma, iova);
+
+	if (ret == -ENOMEM)
+		return VM_FAULT_OOM;
+
+	return (ret == 0) ? VM_FAULT_NOPAGE : VM_FAULT_SIGBUS;
+}
+
+static const struct vm_operations_struct vduse_dev_mmap_ops = {
+	.open = vduse_dev_mmap_open,
+	.close = vduse_dev_mmap_close,
+	.split = vduse_dev_mmap_split,
+	.fault = vduse_dev_mmap_fault,
+};
+
+static int vduse_dev_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct vduse_dev *dev = file->private_data;
+	struct vduse_iova_domain *domain = dev->domain;
+	unsigned long size = vma->vm_end - vma->vm_start;
+	int ret;
+
+	if (domain->size != size || vma->vm_pgoff)
+		return -EINVAL;
+
+	ret = vduse_domain_add_vma(domain, vma);
+	if (ret)
+		return ret;
+
+	vma->vm_flags |= VM_MIXEDMAP | VM_DONTCOPY |
+				VM_DONTDUMP | VM_DONTEXPAND;
+	vma->vm_private_data = domain;
+	vma->vm_ops = &vduse_dev_mmap_ops;
+
+	return 0;
+}
+
+static void vduse_vdpa_start_work(struct work_struct *work)
+{
+	struct vduse_vdpa *vdev = container_of(work,
+					struct vduse_vdpa, start_work);
+	struct vduse_dev *dev = vdev->dev;
+
+	if (vdpa_register_device(&vdev->vdpa)) {
+		dev->vdev = NULL;
+		put_device(&vdev->vdpa.dev);
+		wake_up_all(&dev->waitq);
+	}
+}
+
+static void vduse_vdpa_stop_work(struct work_struct *work)
+{
+	struct vduse_vdpa *vdev = container_of(work,
+					struct vduse_vdpa, stop_work);
+	vdpa_unregister_device(&vdev->vdpa);
+}
+
+static int vduse_dev_start(struct vduse_dev *dev)
+{
+	struct vduse_vdpa *vdev = dev->vdev;
+
+	if (vdev)
+		return -EBUSY;
+
+	vdev = vdpa_alloc_device(struct vduse_vdpa, vdpa, NULL,
+				&vduse_vdpa_config_ops, dev->vq_num);
+	if (!vdev)
+		return -ENOMEM;
+
+	dev->vdev = vdev;
+	vdev->dev = dev;
+	vdev->vdpa.dev.coherent_dma_mask = DMA_BIT_MASK(64);
+	set_dma_ops(&vdev->vdpa.dev, &vduse_dev_dma_ops);
+	vdev->vdpa.dma_dev = &vdev->vdpa.dev;
+
+	INIT_WORK(&vdev->start_work, vduse_vdpa_start_work);
+	queue_work(vduse_vdpa_wq, &vdev->start_work);
+
+	return 0;
+}
+
+static int vduse_dev_stop(struct vduse_dev *dev)
+{
+	struct vduse_vdpa *vdev = dev->vdev;
+
+	if (!vdev)
+		return -EPERM;
+
+	if (vdev->stopping)
+		return -EBUSY;
+
+	vdev->stopping = true;
+	INIT_WORK(&vdev->stop_work, vduse_vdpa_stop_work);
+	queue_work(vduse_vdpa_wq, &vdev->stop_work);
+
+	return 0;
+}
+
+static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
+			unsigned long arg)
+{
+	struct vduse_dev *dev = file->private_data;
+	void __user *argp = (void __user *)arg;
+	int ret;
+
+	mutex_lock(&dev->lock);
+	switch (cmd) {
+	case VDUSE_DEV_START:
+		ret = vduse_dev_start(dev);
+		break;
+	case VDUSE_DEV_STOP:
+		ret = vduse_dev_stop(dev);
+		break;
+	case VDUSE_VQ_SETUP_KICKFD: {
+		struct vduse_vq_eventfd eventfd;
+
+		ret = -EFAULT;
+		if (copy_from_user(&eventfd, argp, sizeof(eventfd)))
+			break;
+
+		ret = vduse_kickfd_setup(dev, &eventfd);
+		break;
+	}
+	case VDUSE_VQ_SETUP_IRQFD: {
+		struct vduse_vq_eventfd eventfd;
+
+		ret = -EFAULT;
+		if (copy_from_user(&eventfd, argp, sizeof(eventfd)))
+			break;
+
+		ret = vduse_virqfd_setup(dev, &eventfd);
+		break;
+	}
+	}
+	mutex_unlock(&dev->lock);
+
+	return ret;
+}
+
+static int vduse_dev_release(struct inode *inode, struct file *file)
+{
+	struct vduse_dev *dev = file->private_data;
+	struct vduse_dev_msg *msg;
+
+	while ((msg = vduse_dev_dequeue_msg(dev, &dev->recv_list)))
+		vduse_dev_enqueue_msg(dev, msg, &dev->send_list);
+
+	dev->connected = false;
+
+	return 0;
+}
+
+static const struct file_operations vduse_dev_fops = {
+	.owner		= THIS_MODULE,
+	.release	= vduse_dev_release,
+	.read_iter	= vduse_dev_read_iter,
+	.write_iter	= vduse_dev_write_iter,
+	.poll		= vduse_dev_poll,
+	.mmap		= vduse_dev_mmap,
+	.unlocked_ioctl	= vduse_dev_ioctl,
+	.compat_ioctl	= compat_ptr_ioctl,
+	.llseek		= noop_llseek,
+};
+
+static struct vduse_dev *vduse_dev_create(void)
+{
+	struct vduse_dev *dev = kzalloc(sizeof(*dev), GFP_KERNEL);
+
+	if (!dev)
+		return NULL;
+
+	mutex_init(&dev->lock);
+	spin_lock_init(&dev->msg_lock);
+	INIT_LIST_HEAD(&dev->send_list);
+	INIT_LIST_HEAD(&dev->recv_list);
+	atomic64_set(&dev->msg_unique, 0);
+	init_waitqueue_head(&dev->waitq);
+
+	return dev;
+}
+
+static void vduse_dev_destroy(struct vduse_dev *dev)
+{
+	mutex_destroy(&dev->lock);
+	kfree(dev);
+}
+
+static struct vduse_dev *vduse_find_dev(u32 id)
+{
+	struct vduse_dev *tmp, *dev = NULL;
+
+	list_for_each_entry(tmp, &vduse_devs, list) {
+		if (tmp->id == id) {
+			dev = tmp;
+			break;
+		}
+	}
+	return dev;
+}
+
+static int vduse_get_dev(u32 id)
+{
+	int fd;
+	char name[64];
+	struct vduse_dev *dev = vduse_find_dev(id);
+
+	if (!dev)
+		return -EINVAL;
+
+	if (dev->connected)
+		return -EBUSY;
+
+	snprintf(name, sizeof(name), "vduse-dev:%u", dev->id);
+	fd = anon_inode_getfd(name, &vduse_dev_fops, dev, O_RDWR | O_CLOEXEC);
+	if (fd < 0)
+		return fd;
+
+	dev->connected = true;
+
+	return fd;
+}
+
+static int vduse_destroy_dev(u32 id)
+{
+	struct vduse_dev *dev = vduse_find_dev(id);
+
+	if (!dev)
+		return -EINVAL;
+
+	if (dev->connected || dev->vdev)
+		return -EBUSY;
+
+	list_del(&dev->list);
+	kfree(dev->vqs);
+	vduse_iova_domain_destroy(dev->domain);
+	vduse_dev_destroy(dev);
+
+	return 0;
+}
+
+static int vduse_create_dev(struct vduse_dev_config *config)
+{
+	int i, fd;
+	struct vduse_dev *dev;
+	char name[64];
+
+	if (vduse_find_dev(config->id))
+		return -EEXIST;
+
+	dev = vduse_dev_create();
+	if (!dev)
+		return -ENOMEM;
+
+	dev->id = config->id;
+	dev->device_id = config->device_id;
+	dev->vendor_id = config->vendor_id;
+	dev->domain = vduse_iova_domain_create(config->iova_size);
+	if (!dev->domain)
+		goto err_domain;
+
+	dev->vq_align = config->vq_align;
+	dev->vq_size_max = config->vq_size_max;
+	dev->vq_num = config->vq_num;
+	dev->vqs = kcalloc(dev->vq_num, sizeof(*dev->vqs), GFP_KERNEL);
+	if (!dev->vqs)
+		goto err_vqs;
+
+	for (i = 0; i < dev->vq_num; i++) {
+		dev->vqs[i].index = i;
+		spin_lock_init(&dev->vqs[i].kick_lock);
+		spin_lock_init(&dev->vqs[i].irq_lock);
+	}
+
+	snprintf(name, sizeof(name), "vduse-dev:%u", config->id);
+	fd = anon_inode_getfd(name, &vduse_dev_fops, dev, O_RDWR | O_CLOEXEC);
+	if (fd < 0)
+		goto err_fd;
+
+	dev->connected = true;
+	list_add(&dev->list, &vduse_devs);
+
+	return fd;
+err_fd:
+	kfree(dev->vqs);
+err_vqs:
+	vduse_iova_domain_destroy(dev->domain);
+err_domain:
+	vduse_dev_destroy(dev);
+	return fd;
+}
+
+static long vduse_ioctl(struct file *file, unsigned int cmd,
+			unsigned long arg)
+{
+	int ret;
+	void __user *argp = (void __user *)arg;
+
+	mutex_lock(&vduse_lock);
+	switch (cmd) {
+	case VDUSE_CREATE_DEV: {
+		struct vduse_dev_config config;
+
+		ret = -EFAULT;
+		if (copy_from_user(&config, argp, sizeof(config)))
+			break;
+
+		ret = vduse_create_dev(&config);
+		break;
+	}
+	case VDUSE_GET_DEV:
+		ret = vduse_get_dev(arg);
+		break;
+	case VDUSE_DESTROY_DEV:
+		ret = vduse_destroy_dev(arg);
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+	mutex_unlock(&vduse_lock);
+
+	return ret;
+}
+
+static const struct file_operations vduse_fops = {
+	.owner		= THIS_MODULE,
+	.unlocked_ioctl	= vduse_ioctl,
+	.compat_ioctl	= compat_ptr_ioctl,
+	.llseek		= noop_llseek,
+};
+
+static struct miscdevice vduse_misc = {
+	.fops = &vduse_fops,
+	.minor = MISC_DYNAMIC_MINOR,
+	.name = "vduse",
+};
+
+static int vduse_init(void)
+{
+	int ret;
+
+	ret = misc_register(&vduse_misc);
+	if (ret)
+		return ret;
+
+	ret = -ENOMEM;
+	vduse_vdpa_wq = alloc_workqueue("vduse-vdpa", WQ_UNBOUND, 1);
+	if (!vduse_vdpa_wq)
+		goto err_vdpa_wq;
+
+	ret = vduse_virqfd_init();
+	if (ret)
+		goto err_irqfd;
+
+	return 0;
+err_irqfd:
+	destroy_workqueue(vduse_vdpa_wq);
+err_vdpa_wq:
+	misc_deregister(&vduse_misc);
+	return ret;
+}
+module_init(vduse_init);
+
+static void vduse_exit(void)
+{
+	misc_deregister(&vduse_misc);
+	destroy_workqueue(vduse_vdpa_wq);
+	vduse_virqfd_exit();
+}
+module_exit(vduse_exit);
+
+MODULE_VERSION(DRV_VERSION);
+MODULE_LICENSE(DRV_LICENSE);
+MODULE_AUTHOR(DRV_AUTHOR);
+MODULE_DESCRIPTION(DRV_DESC);
diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
new file mode 100644
index 000000000000..855d2116b3a6
--- /dev/null
+++ b/include/uapi/linux/vduse.h
@@ -0,0 +1,85 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _UAPI_VDUSE_H_
+#define _UAPI_VDUSE_H_
+
+#include <linux/types.h>
+
+#define VDUSE_CONFIG_DATA_LEN	8
+
+enum vduse_req_type {
+	VDUSE_SET_VQ_STATE,
+	VDUSE_SET_FEATURES,
+	VDUSE_GET_FEATURES,
+	VDUSE_SET_STATUS,
+	VDUSE_GET_STATUS,
+	VDUSE_SET_CONFIG,
+	VDUSE_GET_CONFIG,
+};
+
+struct vduse_vq_state {
+	__u32 index;
+	__u32 num;
+	__u64 desc_addr;
+	__u64 driver_addr;
+	__u64 device_addr;
+	__u8 ready;
+};
+
+struct vduse_dev_config_data {
+	__u32 offset;
+	__u32 len;
+	__u8 data[VDUSE_CONFIG_DATA_LEN];
+};
+
+struct vduse_dev_request {
+	__u32 type;
+	__u32 unique;
+	__u32 flags;
+	__u32 size;
+	union {
+		struct vduse_vq_state vq_state;
+		struct vduse_dev_config_data config;
+		__u64 features;
+		__u8 status;
+	};
+};
+
+struct vduse_dev_response {
+	__u32 unique;
+	__s32 result;
+	union {
+		struct vduse_dev_config_data config;
+		__u64 features;
+		__u8 status;
+	};
+};
+
+/* ioctl */
+
+struct vduse_dev_config {
+	__u32 id;
+	__u32 vendor_id;
+	__u32 device_id;
+	__u64 iova_size;
+	__u16 vq_num;
+	__u16 vq_size_max;
+	__u32 vq_align;
+};
+
+struct vduse_vq_eventfd {
+	__u32 index;
+	__u32 fd;
+};
+
+#define VDUSE_BASE	'V'
+
+#define VDUSE_CREATE_DEV	_IOW(VDUSE_BASE, 0x01, struct vduse_dev_config)
+#define VDUSE_GET_DEV		_IO(VDUSE_BASE, 0x02)
+#define VDUSE_DESTROY_DEV	_IO(VDUSE_BASE, 0x03)
+
+#define VDUSE_DEV_START		_IO(VDUSE_BASE, 0x04)
+#define VDUSE_DEV_STOP		_IO(VDUSE_BASE, 0x05)
+#define VDUSE_VQ_SETUP_KICKFD	_IOW(VDUSE_BASE, 0x06, struct vduse_vq_eventfd)
+#define VDUSE_VQ_SETUP_IRQFD	_IOW(VDUSE_BASE, 0x07, struct vduse_vq_eventfd)
+
+#endif /* _UAPI_VDUSE_H_ */
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC 3/4] vduse: grab the module's references until there is no vduse device
  2020-10-19 14:56 [RFC 0/4] Introduce VDUSE - vDPA Device in Userspace Xie Yongji
  2020-10-19 14:56 ` [RFC 1/4] mm: export zap_page_range() for driver use Xie Yongji
  2020-10-19 14:56 ` [RFC 2/4] vduse: Introduce VDUSE - vDPA Device in Userspace Xie Yongji
@ 2020-10-19 14:56 ` Xie Yongji
  2020-10-19 15:05   ` Michael S. Tsirkin
  2020-10-19 14:56 ` [RFC 4/4] vduse: Add memory shrinker to reclaim bounce pages Xie Yongji
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 28+ messages in thread
From: Xie Yongji @ 2020-10-19 14:56 UTC (permalink / raw)
  To: mst, jasowang, akpm; +Cc: linux-mm, virtualization

The module should not be unloaded if any vduse device exists.
So increase the module's reference count when creating vduse
device. And the reference count is kept until the device is
destroyed.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 drivers/vdpa/vdpa_user/vduse_dev.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
index 6787ba66725c..f04aa02de8c1 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -887,6 +887,7 @@ static int vduse_destroy_dev(u32 id)
 	kfree(dev->vqs);
 	vduse_iova_domain_destroy(dev->domain);
 	vduse_dev_destroy(dev);
+	module_put(THIS_MODULE);
 
 	return 0;
 }
@@ -931,6 +932,7 @@ static int vduse_create_dev(struct vduse_dev_config *config)
 
 	dev->connected = true;
 	list_add(&dev->list, &vduse_devs);
+	__module_get(THIS_MODULE);
 
 	return fd;
 err_fd:
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC 4/4] vduse: Add memory shrinker to reclaim bounce pages
  2020-10-19 14:56 [RFC 0/4] Introduce VDUSE - vDPA Device in Userspace Xie Yongji
                   ` (2 preceding siblings ...)
  2020-10-19 14:56 ` [RFC 3/4] vduse: grab the module's references until there is no vduse device Xie Yongji
@ 2020-10-19 14:56 ` Xie Yongji
  2020-10-19 17:16 ` [RFC 0/4] Introduce VDUSE - vDPA Device in Userspace Michael S. Tsirkin
  2020-10-20  3:20 ` Jason Wang
  5 siblings, 0 replies; 28+ messages in thread
From: Xie Yongji @ 2020-10-19 14:56 UTC (permalink / raw)
  To: mst, jasowang, akpm; +Cc: linux-mm, virtualization

Add a shrinker to reclaim several pages used by bounce buffer
in order to avoid memory pressures. We will do reclaiming
chunk by chunk. And only reclaim the iova chunk that no one used.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 drivers/vdpa/vdpa_user/iova_domain.c | 83 ++++++++++++++++++++++++++--
 drivers/vdpa/vdpa_user/iova_domain.h | 10 ++++
 drivers/vdpa/vdpa_user/vduse_dev.c   | 51 +++++++++++++++++
 3 files changed, 140 insertions(+), 4 deletions(-)

diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c
index a274f78f00d2..9e3d4686de4f 100644
--- a/drivers/vdpa/vdpa_user/iova_domain.c
+++ b/drivers/vdpa/vdpa_user/iova_domain.c
@@ -30,6 +30,8 @@ struct vduse_mmap_vma {
 	struct list_head list;
 };
 
+struct percpu_counter vduse_total_bounce_pages;
+
 static inline struct page *
 vduse_domain_get_bounce_page(struct vduse_iova_domain *domain,
 				unsigned long iova)
@@ -49,6 +51,13 @@ vduse_domain_set_bounce_page(struct vduse_iova_domain *domain,
 	unsigned long chunkoff = iova & ~IOVA_CHUNK_MASK;
 	unsigned long pgindex = chunkoff >> PAGE_SHIFT;
 
+	if (page) {
+		domain->chunks[index].used_bounce_pages++;
+		percpu_counter_inc(&vduse_total_bounce_pages);
+	} else {
+		domain->chunks[index].used_bounce_pages--;
+		percpu_counter_dec(&vduse_total_bounce_pages);
+	}
 	domain->chunks[index].bounce_pages[pgindex] = page;
 }
 
@@ -159,6 +168,29 @@ void vduse_domain_remove_mapping(struct vduse_iova_domain *domain,
 	spin_unlock(&domain->map_lock);
 }
 
+static bool vduse_domain_try_unmap(struct vduse_iova_domain *domain,
+				unsigned long iova, size_t size)
+{
+	struct vduse_mmap_vma *mmap_vma;
+	unsigned long uaddr;
+	bool unmap = true;
+
+	mutex_lock(&domain->vma_lock);
+	list_for_each_entry(mmap_vma, &domain->vma_list, list) {
+		if (!mmap_read_trylock(mmap_vma->vma->vm_mm)) {
+			unmap = false;
+			break;
+		}
+
+		uaddr = iova + mmap_vma->vma->vm_start;
+		zap_page_range(mmap_vma->vma, uaddr, size);
+		mmap_read_unlock(mmap_vma->vma->vm_mm);
+	}
+	mutex_unlock(&domain->vma_lock);
+
+	return unmap;
+}
+
 void vduse_domain_unmap(struct vduse_iova_domain *domain,
 			unsigned long iova, size_t size)
 {
@@ -284,6 +316,32 @@ bool vduse_domain_is_direct_map(struct vduse_iova_domain *domain,
 	return atomic_read(&chunk->map_type) == TYPE_DIRECT_MAP;
 }
 
+int vduse_domain_reclaim(struct vduse_iova_domain *domain)
+{
+	struct vduse_iova_chunk *chunk;
+	int i, freed = 0;
+
+	for (i = domain->chunk_num - 1; i >= 0; i--) {
+		chunk = &domain->chunks[i];
+		if (!chunk->used_bounce_pages)
+			continue;
+
+		if (atomic_cmpxchg(&chunk->state, 0, INT_MIN) != 0)
+			continue;
+
+		if (!vduse_domain_try_unmap(domain,
+				chunk->start, IOVA_CHUNK_SIZE)) {
+			atomic_sub(INT_MIN, &chunk->state);
+			break;
+		}
+		freed += vduse_domain_free_bounce_pages(domain,
+				chunk->start, IOVA_CHUNK_SIZE);
+		atomic_sub(INT_MIN, &chunk->state);
+	}
+
+	return freed;
+}
+
 unsigned long vduse_domain_alloc_iova(struct vduse_iova_domain *domain,
 					size_t size, enum iova_map_type type)
 {
@@ -301,10 +359,13 @@ unsigned long vduse_domain_alloc_iova(struct vduse_iova_domain *domain,
 		if (atomic_read(&chunk->map_type) != type)
 			continue;
 
-		iova = gen_pool_alloc_algo(chunk->pool, size,
+		if (atomic_fetch_inc(&chunk->state) >= 0) {
+			iova = gen_pool_alloc_algo(chunk->pool, size,
 					gen_pool_first_fit_align, &data);
-		if (iova)
-			break;
+			if (iova)
+				break;
+		}
+		atomic_dec(&chunk->state);
 	}
 
 	return iova;
@@ -317,6 +378,7 @@ void vduse_domain_free_iova(struct vduse_iova_domain *domain,
 	struct vduse_iova_chunk *chunk = &domain->chunks[index];
 
 	gen_pool_free(chunk->pool, iova, size);
+	atomic_dec(&chunk->state);
 }
 
 static void vduse_iova_chunk_cleanup(struct vduse_iova_chunk *chunk)
@@ -332,7 +394,8 @@ void vduse_iova_domain_destroy(struct vduse_iova_domain *domain)
 
 	for (i = 0; i < domain->chunk_num; i++) {
 		chunk = &domain->chunks[i];
-		vduse_domain_free_bounce_pages(domain,
+		if (chunk->used_bounce_pages)
+			vduse_domain_free_bounce_pages(domain,
 					chunk->start, IOVA_CHUNK_SIZE);
 		vduse_iova_chunk_cleanup(chunk);
 	}
@@ -365,8 +428,10 @@ static int vduse_iova_chunk_init(struct vduse_iova_chunk *chunk,
 	if (!chunk->bounce_pages)
 		goto err;
 
+	chunk->used_bounce_pages = 0;
 	chunk->start = addr;
 	atomic_set(&chunk->map_type, TYPE_NONE);
+	atomic_set(&chunk->state, 0);
 
 	return 0;
 err:
@@ -411,3 +476,13 @@ struct vduse_iova_domain *vduse_iova_domain_create(size_t size)
 
 	return NULL;
 }
+
+int vduse_domain_init(void)
+{
+	return percpu_counter_init(&vduse_total_bounce_pages, 0, GFP_KERNEL);
+}
+
+void vduse_domain_exit(void)
+{
+	percpu_counter_destroy(&vduse_total_bounce_pages);
+}
diff --git a/drivers/vdpa/vdpa_user/iova_domain.h b/drivers/vdpa/vdpa_user/iova_domain.h
index 7ae60c0e50ec..016f84d4bef2 100644
--- a/drivers/vdpa/vdpa_user/iova_domain.h
+++ b/drivers/vdpa/vdpa_user/iova_domain.h
@@ -24,8 +24,10 @@ enum iova_map_type {
 struct vduse_iova_chunk {
 	struct gen_pool *pool;
 	struct page **bounce_pages;
+	int used_bounce_pages;
 	unsigned long start;
 	atomic_t map_type;
+	atomic_t state;
 };
 
 struct vduse_iova_domain {
@@ -45,6 +47,8 @@ struct vduse_iova_map {
 	enum dma_data_direction dir;
 };
 
+extern struct percpu_counter vduse_total_bounce_pages;
+
 int vduse_domain_add_vma(struct vduse_iova_domain *domain,
 				struct vm_area_struct *vma);
 
@@ -78,6 +82,8 @@ int vduse_domain_bounce_map(struct vduse_iova_domain *domain,
 bool vduse_domain_is_direct_map(struct vduse_iova_domain *domain,
 				unsigned long iova);
 
+int vduse_domain_reclaim(struct vduse_iova_domain *domain);
+
 unsigned long vduse_domain_alloc_iova(struct vduse_iova_domain *domain,
 					size_t size, enum iova_map_type type);
 
@@ -91,4 +97,8 @@ void vduse_iova_domain_destroy(struct vduse_iova_domain *domain);
 
 struct vduse_iova_domain *vduse_iova_domain_create(size_t size);
 
+int vduse_domain_init(void);
+
+void vduse_domain_exit(void);
+
 #endif /* _VDUSE_IOVA_DOMAIN_H */
diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
index f04aa02de8c1..1163209ffff3 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -977,6 +977,43 @@ static long vduse_ioctl(struct file *file, unsigned int cmd,
 	return ret;
 }
 
+static unsigned long vduse_shrink_scan(struct shrinker *shrinker,
+					struct shrink_control *sc)
+{
+	unsigned long freed = 0;
+	struct vduse_dev *dev;
+
+	if (!mutex_trylock(&vduse_lock))
+		return SHRINK_STOP;
+
+	list_for_each_entry(dev, &vduse_devs, list) {
+		if (!dev->domain)
+			continue;
+
+		freed = vduse_domain_reclaim(dev->domain);
+		if (!freed)
+			continue;
+
+		list_move_tail(&dev->list, &vduse_devs);
+		break;
+	}
+	mutex_unlock(&vduse_lock);
+
+	return freed ? freed : SHRINK_STOP;
+}
+
+static unsigned long vduse_shrink_count(struct shrinker *shrink,
+					struct shrink_control *sc)
+{
+	return percpu_counter_read_positive(&vduse_total_bounce_pages);
+}
+
+static struct shrinker vduse_bounce_pages_shrinker = {
+	.count_objects = vduse_shrink_count,
+	.scan_objects = vduse_shrink_scan,
+	.seeks = DEFAULT_SEEKS,
+};
+
 static const struct file_operations vduse_fops = {
 	.owner		= THIS_MODULE,
 	.unlocked_ioctl	= vduse_ioctl,
@@ -1007,7 +1044,19 @@ static int vduse_init(void)
 	if (ret)
 		goto err_irqfd;
 
+	ret = vduse_domain_init();
+	if (ret)
+		goto err_domain;
+
+	ret = register_shrinker(&vduse_bounce_pages_shrinker);
+	if (ret)
+		goto err_shrinker;
+
 	return 0;
+err_shrinker:
+	vduse_domain_exit();
+err_domain:
+	vduse_virqfd_exit();
 err_irqfd:
 	destroy_workqueue(vduse_vdpa_wq);
 err_vdpa_wq:
@@ -1018,8 +1067,10 @@ module_init(vduse_init);
 
 static void vduse_exit(void)
 {
+	unregister_shrinker(&vduse_bounce_pages_shrinker);
 	misc_deregister(&vduse_misc);
 	destroy_workqueue(vduse_vdpa_wq);
+	vduse_domain_exit();
 	vduse_virqfd_exit();
 }
 module_exit(vduse_exit);
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [RFC 3/4] vduse: grab the module's references until there is no vduse device
  2020-10-19 14:56 ` [RFC 3/4] vduse: grab the module's references until there is no vduse device Xie Yongji
@ 2020-10-19 15:05   ` Michael S. Tsirkin
  2020-10-19 15:44     ` [External] " 谢永吉
  0 siblings, 1 reply; 28+ messages in thread
From: Michael S. Tsirkin @ 2020-10-19 15:05 UTC (permalink / raw)
  To: Xie Yongji; +Cc: jasowang, akpm, linux-mm, virtualization

On Mon, Oct 19, 2020 at 10:56:22PM +0800, Xie Yongji wrote:
> The module should not be unloaded if any vduse device exists.
> So increase the module's reference count when creating vduse
> device. And the reference count is kept until the device is
> destroyed.
> 
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> ---
>  drivers/vdpa/vdpa_user/vduse_dev.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> index 6787ba66725c..f04aa02de8c1 100644
> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> @@ -887,6 +887,7 @@ static int vduse_destroy_dev(u32 id)
>  	kfree(dev->vqs);
>  	vduse_iova_domain_destroy(dev->domain);
>  	vduse_dev_destroy(dev);
> +	module_put(THIS_MODULE);
>  
>  	return 0;
>  }
> @@ -931,6 +932,7 @@ static int vduse_create_dev(struct vduse_dev_config *config)
>  
>  	dev->connected = true;
>  	list_add(&dev->list, &vduse_devs);
> +	__module_get(THIS_MODULE);
>  
>  	return fd;
>  err_fd:

This kind of thing is usually an indicator of a bug. E.g.
if the refcount drops to 0 on module_put(THIS_MODULE) it
will be unloaded and the following return will not run.



> -- 
> 2.25.1



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 2/4] vduse: Introduce VDUSE - vDPA Device in Userspace
  2020-10-19 14:56 ` [RFC 2/4] vduse: Introduce VDUSE - vDPA Device in Userspace Xie Yongji
@ 2020-10-19 15:08   ` Michael S. Tsirkin
  2020-10-19 15:24     ` Randy Dunlap
  2020-10-19 15:48     ` 谢永吉
  0 siblings, 2 replies; 28+ messages in thread
From: Michael S. Tsirkin @ 2020-10-19 15:08 UTC (permalink / raw)
  To: Xie Yongji; +Cc: jasowang, akpm, linux-mm, virtualization

On Mon, Oct 19, 2020 at 10:56:21PM +0800, Xie Yongji wrote:
> diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
> new file mode 100644
> index 000000000000..855d2116b3a6
> --- /dev/null
> +++ b/include/uapi/linux/vduse.h
> @@ -0,0 +1,85 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _UAPI_VDUSE_H_
> +#define _UAPI_VDUSE_H_
> +
> +#include <linux/types.h>
> +
> +#define VDUSE_CONFIG_DATA_LEN	8
> +
> +enum vduse_req_type {
> +	VDUSE_SET_VQ_STATE,
> +	VDUSE_SET_FEATURES,
> +	VDUSE_GET_FEATURES,
> +	VDUSE_SET_STATUS,
> +	VDUSE_GET_STATUS,
> +	VDUSE_SET_CONFIG,
> +	VDUSE_GET_CONFIG,
> +};
> +
> +struct vduse_vq_state {
> +	__u32 index;
> +	__u32 num;
> +	__u64 desc_addr;
> +	__u64 driver_addr;
> +	__u64 device_addr;
> +	__u8 ready;
> +};
> +
> +struct vduse_dev_config_data {
> +	__u32 offset;
> +	__u32 len;
> +	__u8 data[VDUSE_CONFIG_DATA_LEN];
> +};
> +
> +struct vduse_dev_request {
> +	__u32 type;
> +	__u32 unique;
> +	__u32 flags;
> +	__u32 size;
> +	union {
> +		struct vduse_vq_state vq_state;
> +		struct vduse_dev_config_data config;
> +		__u64 features;
> +		__u8 status;
> +	};
> +};
> +
> +struct vduse_dev_response {
> +	__u32 unique;
> +	__s32 result;
> +	union {
> +		struct vduse_dev_config_data config;
> +		__u64 features;
> +		__u8 status;
> +	};
> +};
> +
> +/* ioctl */
> +
> +struct vduse_dev_config {
> +	__u32 id;
> +	__u32 vendor_id;
> +	__u32 device_id;
> +	__u64 iova_size;
> +	__u16 vq_num;
> +	__u16 vq_size_max;
> +	__u32 vq_align;
> +};
> +
> +struct vduse_vq_eventfd {
> +	__u32 index;
> +	__u32 fd;
> +};
> +
> +#define VDUSE_BASE	'V'
> +
> +#define VDUSE_CREATE_DEV	_IOW(VDUSE_BASE, 0x01, struct vduse_dev_config)
> +#define VDUSE_GET_DEV		_IO(VDUSE_BASE, 0x02)
> +#define VDUSE_DESTROY_DEV	_IO(VDUSE_BASE, 0x03)
> +
> +#define VDUSE_DEV_START		_IO(VDUSE_BASE, 0x04)
> +#define VDUSE_DEV_STOP		_IO(VDUSE_BASE, 0x05)
> +#define VDUSE_VQ_SETUP_KICKFD	_IOW(VDUSE_BASE, 0x06, struct vduse_vq_eventfd)
> +#define VDUSE_VQ_SETUP_IRQFD	_IOW(VDUSE_BASE, 0x07, struct vduse_vq_eventfd)
> +
> +#endif /* _UAPI_VDUSE_H_ */


Could we see some documentation about the user interface of this module please?

> -- 
> 2.25.1



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 1/4] mm: export zap_page_range() for driver use
  2020-10-19 14:56 ` [RFC 1/4] mm: export zap_page_range() for driver use Xie Yongji
@ 2020-10-19 15:14   ` Matthew Wilcox
  2020-10-19 15:36     ` [External] " 谢永吉
  0 siblings, 1 reply; 28+ messages in thread
From: Matthew Wilcox @ 2020-10-19 15:14 UTC (permalink / raw)
  To: Xie Yongji; +Cc: mst, jasowang, akpm, linux-mm, virtualization

On Mon, Oct 19, 2020 at 10:56:20PM +0800, Xie Yongji wrote:
> Export zap_page_range() for use in VDUSE.

I think you're missing a lot of MMU notifier work by calling this
directly.  It probably works in every scenario you've tested, but won't
work for others.  I see you're using VM_MIXEDMAP -- would it make sense
to use VM_PFNMAP instead and use zap_vma_ptes()?  Or would it make sense
to change zap_vma_ptes() to handle VM_MIXEDMAP as well as VM_PFNMAP?



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 2/4] vduse: Introduce VDUSE - vDPA Device in Userspace
  2020-10-19 15:08   ` Michael S. Tsirkin
@ 2020-10-19 15:24     ` Randy Dunlap
  2020-10-19 15:46       ` [External] " 谢永吉
  2020-10-19 15:48     ` 谢永吉
  1 sibling, 1 reply; 28+ messages in thread
From: Randy Dunlap @ 2020-10-19 15:24 UTC (permalink / raw)
  To: Michael S. Tsirkin, Xie Yongji; +Cc: jasowang, akpm, linux-mm, virtualization

On 10/19/20 8:08 AM, Michael S. Tsirkin wrote:
> On Mon, Oct 19, 2020 at 10:56:21PM +0800, Xie Yongji wrote:
>> diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h

>> +#define VDUSE_BASE	'V'
>> +
> 
> Could we see some documentation about the user interface of this module please?
> 

Also, the VDUSE_BASE value should be documented in
Documentation/userspace-api/ioctl/ioctl-number.rst.

thanks.
-- 
~Randy



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [External] Re: [RFC 1/4] mm: export zap_page_range() for driver use
  2020-10-19 15:14   ` Matthew Wilcox
@ 2020-10-19 15:36     ` 谢永吉
  0 siblings, 0 replies; 28+ messages in thread
From: 谢永吉 @ 2020-10-19 15:36 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Michael S. Tsirkin, jasowang, akpm, linux-mm, virtualization

[-- Attachment #1: Type: text/plain, Size: 703 bytes --]

On Mon, Oct 19, 2020 at 11:14 PM Matthew Wilcox <willy@infradead.org> wrote:

> On Mon, Oct 19, 2020 at 10:56:20PM +0800, Xie Yongji wrote:
> > Export zap_page_range() for use in VDUSE.
>
> I think you're missing a lot of MMU notifier work by calling this
> directly.  It probably works in every scenario you've tested, but won't
> work for others.  I see you're using VM_MIXEDMAP -- would it make sense
> to use VM_PFNMAP instead and use zap_vma_ptes()?


But I didn't see any difference between zap_vma_ptes() and zap_page_range()
about MMU notifier. Could you give me more details?

Thanks,
Yongji


> Or would it make sense
> to change zap_vma_ptes() to handle VM_MIXEDMAP as well as VM_PFNMAP?
>
>

[-- Attachment #2: Type: text/html, Size: 1263 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [External] Re: [RFC 3/4] vduse: grab the module's references until there is no vduse device
  2020-10-19 15:05   ` Michael S. Tsirkin
@ 2020-10-19 15:44     ` 谢永吉
  2020-10-19 15:47       ` Michael S. Tsirkin
  0 siblings, 1 reply; 28+ messages in thread
From: 谢永吉 @ 2020-10-19 15:44 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: jasowang, akpm, linux-mm, virtualization

[-- Attachment #1: Type: text/plain, Size: 1536 bytes --]

On Mon, Oct 19, 2020 at 11:05 PM Michael S. Tsirkin <mst@redhat.com> wrote:

> On Mon, Oct 19, 2020 at 10:56:22PM +0800, Xie Yongji wrote:
> > The module should not be unloaded if any vduse device exists.
> > So increase the module's reference count when creating vduse
> > device. And the reference count is kept until the device is
> > destroyed.
> >
> > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> > ---
> >  drivers/vdpa/vdpa_user/vduse_dev.c | 2 ++
> >  1 file changed, 2 insertions(+)
> >
> > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c
> b/drivers/vdpa/vdpa_user/vduse_dev.c
> > index 6787ba66725c..f04aa02de8c1 100644
> > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > @@ -887,6 +887,7 @@ static int vduse_destroy_dev(u32 id)
> >       kfree(dev->vqs);
> >       vduse_iova_domain_destroy(dev->domain);
> >       vduse_dev_destroy(dev);
> > +     module_put(THIS_MODULE);
> >
> >       return 0;
> >  }
> > @@ -931,6 +932,7 @@ static int vduse_create_dev(struct vduse_dev_config
> *config)
> >
> >       dev->connected = true;
> >       list_add(&dev->list, &vduse_devs);
> > +     __module_get(THIS_MODULE);
> >
> >       return fd;
> >  err_fd:
>
> This kind of thing is usually an indicator of a bug. E.g.
> if the refcount drops to 0 on module_put(THIS_MODULE) it
> will be unloaded and the following return will not run.
>
>
Should this happen?  The refcount should be only decreased to 0 after the
misc_device is closed?

Thanks,
Yongji

>
>

> > --
> > 2.25.1
>
>

[-- Attachment #2: Type: text/html, Size: 2520 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [External] Re: [RFC 2/4] vduse: Introduce VDUSE - vDPA Device in Userspace
  2020-10-19 15:24     ` Randy Dunlap
@ 2020-10-19 15:46       ` 谢永吉
  0 siblings, 0 replies; 28+ messages in thread
From: 谢永吉 @ 2020-10-19 15:46 UTC (permalink / raw)
  To: Randy Dunlap; +Cc: Michael S. Tsirkin, jasowang, akpm, linux-mm, virtualization

[-- Attachment #1: Type: text/plain, Size: 592 bytes --]

On Mon, Oct 19, 2020 at 11:24 PM Randy Dunlap <rdunlap@infradead.org> wrote:

> On 10/19/20 8:08 AM, Michael S. Tsirkin wrote:
> > On Mon, Oct 19, 2020 at 10:56:21PM +0800, Xie Yongji wrote:
> >> diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
>
> >> +#define VDUSE_BASE  'V'
> >> +
> >
> > Could we see some documentation about the user interface of this module
> please?
> >
>
> Also, the VDUSE_BASE value should be documented in
> Documentation/userspace-api/ioctl/ioctl-number.rst.
>
> Thanks for the reminder. Will add it.

Thanks,
Yongji

thanks.
> --
> ~Randy
>
>

[-- Attachment #2: Type: text/html, Size: 1185 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [External] Re: [RFC 3/4] vduse: grab the module's references until there is no vduse device
  2020-10-19 15:44     ` [External] " 谢永吉
@ 2020-10-19 15:47       ` Michael S. Tsirkin
  2020-10-19 15:56         ` 谢永吉
  0 siblings, 1 reply; 28+ messages in thread
From: Michael S. Tsirkin @ 2020-10-19 15:47 UTC (permalink / raw)
  To: 谢永吉; +Cc: jasowang, akpm, linux-mm, virtualization

On Mon, Oct 19, 2020 at 11:44:36PM +0800, 谢永吉 wrote:
> 
> 
> On Mon, Oct 19, 2020 at 11:05 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> 
>     On Mon, Oct 19, 2020 at 10:56:22PM +0800, Xie Yongji wrote:
>     > The module should not be unloaded if any vduse device exists.
>     > So increase the module's reference count when creating vduse
>     > device. And the reference count is kept until the device is
>     > destroyed.
>     >
>     > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
>     > ---
>     >  drivers/vdpa/vdpa_user/vduse_dev.c | 2 ++
>     >  1 file changed, 2 insertions(+)
>     >
>     > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/
>     vduse_dev.c
>     > index 6787ba66725c..f04aa02de8c1 100644
>     > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
>     > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
>     > @@ -887,6 +887,7 @@ static int vduse_destroy_dev(u32 id)
>     >       kfree(dev->vqs);
>     >       vduse_iova_domain_destroy(dev->domain);
>     >       vduse_dev_destroy(dev);
>     > +     module_put(THIS_MODULE);
>     > 
>     >       return 0;
>     >  }
>     > @@ -931,6 +932,7 @@ static int vduse_create_dev(struct vduse_dev_config
>     *config)
>     > 
>     >       dev->connected = true;
>     >       list_add(&dev->list, &vduse_devs);
>     > +     __module_get(THIS_MODULE);
>     > 
>     >       return fd;
>     >  err_fd:
> 
>     This kind of thing is usually an indicator of a bug. E.g.
>     if the refcount drops to 0 on module_put(THIS_MODULE) it
>     will be unloaded and the following return will not run.
> 
> 
> 
> Should this happen?  The refcount should be only decreased to 0 after the
> misc_device is closed?
> 
> Thanks,
> Yongji
> 

OTOH if it never drops to 0 anyway then why do you need to increase it?

> 
> 
>     > --
>     > 2.25.1
> 
> 



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [External] Re: [RFC 2/4] vduse: Introduce VDUSE - vDPA Device in Userspace
  2020-10-19 15:08   ` Michael S. Tsirkin
  2020-10-19 15:24     ` Randy Dunlap
@ 2020-10-19 15:48     ` 谢永吉
  1 sibling, 0 replies; 28+ messages in thread
From: 谢永吉 @ 2020-10-19 15:48 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: jasowang, akpm, linux-mm, virtualization

[-- Attachment #1: Type: text/plain, Size: 2777 bytes --]

On Mon, Oct 19, 2020 at 11:09 PM Michael S. Tsirkin <mst@redhat.com> wrote:

> On Mon, Oct 19, 2020 at 10:56:21PM +0800, Xie Yongji wrote:
> > diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
> > new file mode 100644
> > index 000000000000..855d2116b3a6
> > --- /dev/null
> > +++ b/include/uapi/linux/vduse.h
> > @@ -0,0 +1,85 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef _UAPI_VDUSE_H_
> > +#define _UAPI_VDUSE_H_
> > +
> > +#include <linux/types.h>
> > +
> > +#define VDUSE_CONFIG_DATA_LEN        8
> > +
> > +enum vduse_req_type {
> > +     VDUSE_SET_VQ_STATE,
> > +     VDUSE_SET_FEATURES,
> > +     VDUSE_GET_FEATURES,
> > +     VDUSE_SET_STATUS,
> > +     VDUSE_GET_STATUS,
> > +     VDUSE_SET_CONFIG,
> > +     VDUSE_GET_CONFIG,
> > +};
> > +
> > +struct vduse_vq_state {
> > +     __u32 index;
> > +     __u32 num;
> > +     __u64 desc_addr;
> > +     __u64 driver_addr;
> > +     __u64 device_addr;
> > +     __u8 ready;
> > +};
> > +
> > +struct vduse_dev_config_data {
> > +     __u32 offset;
> > +     __u32 len;
> > +     __u8 data[VDUSE_CONFIG_DATA_LEN];
> > +};
> > +
> > +struct vduse_dev_request {
> > +     __u32 type;
> > +     __u32 unique;
> > +     __u32 flags;
> > +     __u32 size;
> > +     union {
> > +             struct vduse_vq_state vq_state;
> > +             struct vduse_dev_config_data config;
> > +             __u64 features;
> > +             __u8 status;
> > +     };
> > +};
> > +
> > +struct vduse_dev_response {
> > +     __u32 unique;
> > +     __s32 result;
> > +     union {
> > +             struct vduse_dev_config_data config;
> > +             __u64 features;
> > +             __u8 status;
> > +     };
> > +};
> > +
> > +/* ioctl */
> > +
> > +struct vduse_dev_config {
> > +     __u32 id;
> > +     __u32 vendor_id;
> > +     __u32 device_id;
> > +     __u64 iova_size;
> > +     __u16 vq_num;
> > +     __u16 vq_size_max;
> > +     __u32 vq_align;
> > +};
> > +
> > +struct vduse_vq_eventfd {
> > +     __u32 index;
> > +     __u32 fd;
> > +};
> > +
> > +#define VDUSE_BASE   'V'
> > +
> > +#define VDUSE_CREATE_DEV     _IOW(VDUSE_BASE, 0x01, struct
> vduse_dev_config)
> > +#define VDUSE_GET_DEV                _IO(VDUSE_BASE, 0x02)
> > +#define VDUSE_DESTROY_DEV    _IO(VDUSE_BASE, 0x03)
> > +
> > +#define VDUSE_DEV_START              _IO(VDUSE_BASE, 0x04)
> > +#define VDUSE_DEV_STOP               _IO(VDUSE_BASE, 0x05)
> > +#define VDUSE_VQ_SETUP_KICKFD        _IOW(VDUSE_BASE, 0x06, struct
> vduse_vq_eventfd)
> > +#define VDUSE_VQ_SETUP_IRQFD _IOW(VDUSE_BASE, 0x07, struct
> vduse_vq_eventfd)
> > +
> > +#endif /* _UAPI_VDUSE_H_ */
>
>
> Could we see some documentation about the user interface of this module
> please?
>
>
Sure. Will do it!

Thanks,
Yongji


> > --
> > 2.25.1
>
>

[-- Attachment #2: Type: text/html, Size: 4058 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [External] Re: [RFC 3/4] vduse: grab the module's references until there is no vduse device
  2020-10-19 15:47       ` Michael S. Tsirkin
@ 2020-10-19 15:56         ` 谢永吉
  2020-10-19 16:41           ` Michael S. Tsirkin
  0 siblings, 1 reply; 28+ messages in thread
From: 谢永吉 @ 2020-10-19 15:56 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: jasowang, akpm, linux-mm, virtualization

[-- Attachment #1: Type: text/plain, Size: 2270 bytes --]

On Mon, Oct 19, 2020 at 11:47 PM Michael S. Tsirkin <mst@redhat.com> wrote:

> On Mon, Oct 19, 2020 at 11:44:36PM +0800, 谢永吉 wrote:
> >
> >
> > On Mon, Oct 19, 2020 at 11:05 PM Michael S. Tsirkin <mst@redhat.com>
> wrote:
> >
> >     On Mon, Oct 19, 2020 at 10:56:22PM +0800, Xie Yongji wrote:
> >     > The module should not be unloaded if any vduse device exists.
> >     > So increase the module's reference count when creating vduse
> >     > device. And the reference count is kept until the device is
> >     > destroyed.
> >     >
> >     > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> >     > ---
> >     >  drivers/vdpa/vdpa_user/vduse_dev.c | 2 ++
> >     >  1 file changed, 2 insertions(+)
> >     >
> >     > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c
> b/drivers/vdpa/vdpa_user/
> >     vduse_dev.c
> >     > index 6787ba66725c..f04aa02de8c1 100644
> >     > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> >     > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> >     > @@ -887,6 +887,7 @@ static int vduse_destroy_dev(u32 id)
> >     >       kfree(dev->vqs);
> >     >       vduse_iova_domain_destroy(dev->domain);
> >     >       vduse_dev_destroy(dev);
> >     > +     module_put(THIS_MODULE);
> >     >
> >     >       return 0;
> >     >  }
> >     > @@ -931,6 +932,7 @@ static int vduse_create_dev(struct
> vduse_dev_config
> >     *config)
> >     >
> >     >       dev->connected = true;
> >     >       list_add(&dev->list, &vduse_devs);
> >     > +     __module_get(THIS_MODULE);
> >     >
> >     >       return fd;
> >     >  err_fd:
> >
> >     This kind of thing is usually an indicator of a bug. E.g.
> >     if the refcount drops to 0 on module_put(THIS_MODULE) it
> >     will be unloaded and the following return will not run.
> >
> >
> >
> > Should this happen?  The refcount should be only decreased to 0 after the
> > misc_device is closed?
> >
> > Thanks,
> > Yongji
> >
>
> OTOH if it never drops to 0 anyway then why do you need to increase it?
>
>
To prevent unloading the module in the case that the device is created, but
no user process using it (e.g. the user process crashed).

Thanks,
Yongji

>
> >
> >     > --
> >     > 2.25.1
> >
> >
>
>

[-- Attachment #2: Type: text/html, Size: 3533 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [External] Re: [RFC 3/4] vduse: grab the module's references until there is no vduse device
  2020-10-19 15:56         ` 谢永吉
@ 2020-10-19 16:41           ` Michael S. Tsirkin
  2020-10-20  7:42             ` Yongji Xie
  0 siblings, 1 reply; 28+ messages in thread
From: Michael S. Tsirkin @ 2020-10-19 16:41 UTC (permalink / raw)
  To: 谢永吉; +Cc: jasowang, akpm, linux-mm, virtualization

On Mon, Oct 19, 2020 at 11:56:35PM +0800, 谢永吉 wrote:
> 
> 
> 
> On Mon, Oct 19, 2020 at 11:47 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> 
>     On Mon, Oct 19, 2020 at 11:44:36PM +0800, 谢永吉 wrote:
>     >
>     >
>     > On Mon, Oct 19, 2020 at 11:05 PM Michael S. Tsirkin <mst@redhat.com>
>     wrote:
>     >
>     >     On Mon, Oct 19, 2020 at 10:56:22PM +0800, Xie Yongji wrote:
>     >     > The module should not be unloaded if any vduse device exists.
>     >     > So increase the module's reference count when creating vduse
>     >     > device. And the reference count is kept until the device is
>     >     > destroyed.
>     >     >
>     >     > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
>     >     > ---
>     >     >  drivers/vdpa/vdpa_user/vduse_dev.c | 2 ++
>     >     >  1 file changed, 2 insertions(+)
>     >     >
>     >     > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/
>     vdpa_user/
>     >     vduse_dev.c
>     >     > index 6787ba66725c..f04aa02de8c1 100644
>     >     > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
>     >     > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
>     >     > @@ -887,6 +887,7 @@ static int vduse_destroy_dev(u32 id)
>     >     >       kfree(dev->vqs);
>     >     >       vduse_iova_domain_destroy(dev->domain);
>     >     >       vduse_dev_destroy(dev);
>     >     > +     module_put(THIS_MODULE);
>     >     > 
>     >     >       return 0;
>     >     >  }
>     >     > @@ -931,6 +932,7 @@ static int vduse_create_dev(struct
>     vduse_dev_config
>     >     *config)
>     >     > 
>     >     >       dev->connected = true;
>     >     >       list_add(&dev->list, &vduse_devs);
>     >     > +     __module_get(THIS_MODULE);
>     >     > 
>     >     >       return fd;
>     >     >  err_fd:
>     >
>     >     This kind of thing is usually an indicator of a bug. E.g.
>     >     if the refcount drops to 0 on module_put(THIS_MODULE) it
>     >     will be unloaded and the following return will not run.
>     >
>     >
>     >
>     > Should this happen?  The refcount should be only decreased to 0 after the
>     > misc_device is closed?
>     >
>     > Thanks,
>     > Yongji
>     >
> 
>     OTOH if it never drops to 0 anyway then why do you need to increase it?
> 
> 
> 
> To prevent unloading the module in the case that the device is created, but no
> user process using it (e.g. the user process crashed). 
> 
> Thanks,
> Yongji

Looks like it can drop to 0 if that is the case then?


> 
>     >
>     >
>     >     > --
>     >     > 2.25.1
>     >
>     >
> 
> 



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 0/4] Introduce VDUSE - vDPA Device in Userspace
  2020-10-19 14:56 [RFC 0/4] Introduce VDUSE - vDPA Device in Userspace Xie Yongji
                   ` (3 preceding siblings ...)
  2020-10-19 14:56 ` [RFC 4/4] vduse: Add memory shrinker to reclaim bounce pages Xie Yongji
@ 2020-10-19 17:16 ` Michael S. Tsirkin
  2020-10-20  2:18   ` [External] " 谢永吉
  2020-10-20  3:20 ` Jason Wang
  5 siblings, 1 reply; 28+ messages in thread
From: Michael S. Tsirkin @ 2020-10-19 17:16 UTC (permalink / raw)
  To: Xie Yongji; +Cc: jasowang, akpm, linux-mm, virtualization

On Mon, Oct 19, 2020 at 10:56:19PM +0800, Xie Yongji wrote:
> This series introduces a framework, which can be used to implement
> vDPA Devices in a userspace program. To implement it, the work
> consist of two parts: control path emulating and data path offloading.
> 
> In the control path, the VDUSE driver will make use of message
> mechnism to forward the actions (get/set features, get/st status,
> get/set config space and set virtqueue states) from virtio-vdpa
> driver to userspace. Userspace can use read()/write() to
> receive/reply to those control messages.
> 
> In the data path, the VDUSE driver implements a MMU-based
> on-chip IOMMU driver which supports both direct mapping and
> indirect mapping with bounce buffer. Then userspace can access
> those iova space via mmap(). Besides, eventfd mechnism is used to
> trigger interrupts and forward virtqueue kicks.
> 
> The details and our user case is shown below:
> 
> ------------------------     -----------------------------------------------------------
> |                  APP |     |                          QEMU                           |
> |       ---------      |     | --------------------    -------------------+<-->+------ |
> |       |dev/vdx|      |     | | device emulation |    | virtio dataplane |    | BDS | |
> ------------+-----------     -----------+-----------------------+-----------------+-----
>             |                           |                       |                 |
>             |                           | emulating             | offloading      |
> ------------+---------------------------+-----------------------+-----------------+------
> |    | block device |           |  vduse driver |        |  vdpa device |    | TCP/IP | |
> |    -------+--------           --------+--------        +------+-------     -----+---- |
> |           |                           |                |      |                 |     |
> |           |                           |                |      |                 |     |
> | ----------+----------       ----------+-----------     |      |                 |     |
> | | virtio-blk driver |       | virtio-vdpa driver |     |      |                 |     |
> | ----------+----------       ----------+-----------     |      |                 |     |
> |           |                           |                |      |                 |     |
> |           |                           ------------------      |                 |     |
> |           -----------------------------------------------------              ---+---  |
> ------------------------------------------------------------------------------ | NIC |---
>                                                                                ---+---
>                                                                                   |
>                                                                          ---------+---------
>                                                                          | Remote Storages |
>                                                                          -------------------
> We make use of it to implement a block device connecting to
> our distributed storage, which can be used in containers and
> bare metal.

What is not exactly clear is what is the APP above doing.

Taking virtio blk requests and sending them over the network
in some proprietary way?

> Compared with qemu-nbd solution, this solution has
> higher performance, and we can have an unified technology stack
> in VM and containers for remote storages.
> 
> To test it with a host disk (e.g. /dev/sdx):
> 
>   $ qemu-storage-daemon \
>       --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server,nowait \
>       --monitor chardev=charmonitor \
>       --blockdev driver=host_device,cache.direct=on,aio=native,filename=/dev/sdx,node-name=disk0 \
>       --export vduse-blk,id=test,node-name=disk0,writable=on,vduse-id=1,num-queues=16,queue-size=128
> 
> The qemu-storage-daemon can be found at https://github.com/bytedance/qemu/tree/vduse
> 
> Future work:
>   - Improve performance (e.g. zero copy implementation in datapath)
>   - Config interrupt support
>   - Userspace library (find a way to reuse device emulation code in qemu/rust-vmm)


How does this driver compare with vhost-user-blk (which doesn't need kernel support)?



> Xie Yongji (4):
>   mm: export zap_page_range() for driver use
>   vduse: Introduce VDUSE - vDPA Device in Userspace
>   vduse: grab the module's references until there is no vduse device
>   vduse: Add memory shrinker to reclaim bounce pages
> 
>  drivers/vdpa/Kconfig                 |    8 +
>  drivers/vdpa/Makefile                |    1 +
>  drivers/vdpa/vdpa_user/Makefile      |    5 +
>  drivers/vdpa/vdpa_user/eventfd.c     |  221 ++++++
>  drivers/vdpa/vdpa_user/eventfd.h     |   48 ++
>  drivers/vdpa/vdpa_user/iova_domain.c |  488 ++++++++++++
>  drivers/vdpa/vdpa_user/iova_domain.h |  104 +++
>  drivers/vdpa/vdpa_user/vduse.h       |   66 ++
>  drivers/vdpa/vdpa_user/vduse_dev.c   | 1081 ++++++++++++++++++++++++++
>  include/uapi/linux/vduse.h           |   85 ++
>  mm/memory.c                          |    1 +
>  11 files changed, 2108 insertions(+)
>  create mode 100644 drivers/vdpa/vdpa_user/Makefile
>  create mode 100644 drivers/vdpa/vdpa_user/eventfd.c
>  create mode 100644 drivers/vdpa/vdpa_user/eventfd.h
>  create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c
>  create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h
>  create mode 100644 drivers/vdpa/vdpa_user/vduse.h
>  create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
>  create mode 100644 include/uapi/linux/vduse.h
> 
> -- 
> 2.25.1



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [External] Re: [RFC 0/4] Introduce VDUSE - vDPA Device in Userspace
  2020-10-19 17:16 ` [RFC 0/4] Introduce VDUSE - vDPA Device in Userspace Michael S. Tsirkin
@ 2020-10-20  2:18   ` 谢永吉
  2020-10-20  2:20     ` Jason Wang
  0 siblings, 1 reply; 28+ messages in thread
From: 谢永吉 @ 2020-10-20  2:18 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: jasowang, akpm, linux-mm, virtualization

[-- Attachment #1: Type: text/plain, Size: 4686 bytes --]

On Tue, Oct 20, 2020 at 1:16 AM Michael S. Tsirkin <mst@redhat.com> wrote:

> On Mon, Oct 19, 2020 at 10:56:19PM +0800, Xie Yongji wrote:
> > This series introduces a framework, which can be used to implement
> > vDPA Devices in a userspace program. To implement it, the work
> > consist of two parts: control path emulating and data path offloading.
> >
> > In the control path, the VDUSE driver will make use of message
> > mechnism to forward the actions (get/set features, get/st status,
> > get/set config space and set virtqueue states) from virtio-vdpa
> > driver to userspace. Userspace can use read()/write() to
> > receive/reply to those control messages.
> >
> > In the data path, the VDUSE driver implements a MMU-based
> > on-chip IOMMU driver which supports both direct mapping and
> > indirect mapping with bounce buffer. Then userspace can access
> > those iova space via mmap(). Besides, eventfd mechnism is used to
> > trigger interrupts and forward virtqueue kicks.
> >
> > The details and our user case is shown below:
> >
> > ------------------------
>  -----------------------------------------------------------
> > |                  APP |     |                          QEMU
>                |
> > |       ---------      |     | --------------------
> -------------------+<-->+------ |
> > |       |dev/vdx|      |     | | device emulation |    | virtio
> dataplane |    | BDS | |
> > ------------+-----------
>  -----------+-----------------------+-----------------+-----
> >             |                           |                       |
>          |
> >             |                           | emulating             |
> offloading      |
> >
> ------------+---------------------------+-----------------------+-----------------+------
> > |    | block device |           |  vduse driver |        |  vdpa device
> |    | TCP/IP | |
> > |    -------+--------           --------+--------
> +------+-------     -----+---- |
> > |           |                           |                |      |
>          |     |
> > |           |                           |                |      |
>          |     |
> > | ----------+----------       ----------+-----------     |      |
>          |     |
> > | | virtio-blk driver |       | virtio-vdpa driver |     |      |
>          |     |
> > | ----------+----------       ----------+-----------     |      |
>          |     |
> > |           |                           |                |      |
>          |     |
> > |           |                           ------------------      |
>          |     |
> > |           -----------------------------------------------------
>       ---+---  |
> >
> ------------------------------------------------------------------------------
> | NIC |---
> >
>       ---+---
> >
>          |
> >
> ---------+---------
> >
> | Remote Storages |
> >
> -------------------
> > We make use of it to implement a block device connecting to
> > our distributed storage, which can be used in containers and
> > bare metal.
>
> What is not exactly clear is what is the APP above doing.
>
> Taking virtio blk requests and sending them over the network
> in some proprietary way?
>
>
No, the APP doesn't need to know details on virtio-blk. Maybe replace "APP"
with "Container" here could be more clear. Our purpose is to make virtio
device available for container and bare metal, so that we can reuse the
VM's technology stack to provide service, e.g. SPDK's remote bdev, ovs-dpdk
and so on.


> > Compared with qemu-nbd solution, this solution has
> > higher performance, and we can have an unified technology stack
> > in VM and containers for remote storages.
> >
> > To test it with a host disk (e.g. /dev/sdx):
> >
> >   $ qemu-storage-daemon \
> >       --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server,nowait \
> >       --monitor chardev=charmonitor \
> >       --blockdev
> driver=host_device,cache.direct=on,aio=native,filename=/dev/sdx,node-name=disk0
> \
> >       --export
> vduse-blk,id=test,node-name=disk0,writable=on,vduse-id=1,num-queues=16,queue-size=128
> >
> > The qemu-storage-daemon can be found at
> https://github.com/bytedance/qemu/tree/vduse
> >
> > Future work:
> >   - Improve performance (e.g. zero copy implementation in datapath)
> >   - Config interrupt support
> >   - Userspace library (find a way to reuse device emulation code in
> qemu/rust-vmm)
>
>
> How does this driver compare with vhost-user-blk (which doesn't need
> kernel support)?
>
>
We want to implement a block device rather than a virtio-blk dataplane. And
with this driver's help, the vhost-user-blk process could provide storage
service to all APPs in the host.

Thanks,
Yongji

[-- Attachment #2: Type: text/html, Size: 6808 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [External] Re: [RFC 0/4] Introduce VDUSE - vDPA Device in Userspace
  2020-10-20  2:18   ` [External] " 谢永吉
@ 2020-10-20  2:20     ` Jason Wang
  2020-10-20  2:28       ` 谢永吉
  0 siblings, 1 reply; 28+ messages in thread
From: Jason Wang @ 2020-10-20  2:20 UTC (permalink / raw)
  To: 谢永吉, Michael S. Tsirkin
  Cc: akpm, linux-mm, virtualization


On 2020/10/20 上午10:18, 谢永吉 wrote:
>
>
>
>     How does this driver compare with vhost-user-blk (which doesn't
>     need kernel support)?
>
>
> We want to implement a block device rather than a virtio-blk 
> dataplane. And with this driver's help, the vhost-user-blk process 
> could provide storage service to all APPs in the host.
>
> Thanks,
> Yongji


I guess the point is that, with the help of VDUSE, besides vhost-vDPA 
for VM, you can have a kernel virtio interface through virtio-vdpa which 
can not be done in vhost-user-blk.

Thanks



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [External] Re: [RFC 0/4] Introduce VDUSE - vDPA Device in Userspace
  2020-10-20  2:20     ` Jason Wang
@ 2020-10-20  2:28       ` 谢永吉
  0 siblings, 0 replies; 28+ messages in thread
From: 谢永吉 @ 2020-10-20  2:28 UTC (permalink / raw)
  To: Jason Wang; +Cc: Michael S. Tsirkin, akpm, linux-mm, virtualization

[-- Attachment #1: Type: text/plain, Size: 719 bytes --]

On Tue, Oct 20, 2020 at 10:21 AM Jason Wang <jasowang@redhat.com> wrote:

>
> On 2020/10/20 上午10:18, 谢永吉 wrote:
> >
> >
> >
> >     How does this driver compare with vhost-user-blk (which doesn't
> >     need kernel support)?
> >
> >
> > We want to implement a block device rather than a virtio-blk
> > dataplane. And with this driver's help, the vhost-user-blk process
> > could provide storage service to all APPs in the host.
> >
> > Thanks,
> > Yongji
>
>
> I guess the point is that, with the help of VDUSE, besides vhost-vDPA
> for VM, you can have a kernel virtio interface through virtio-vdpa which
> can not be done in vhost-user-blk.
>

Exactly correct!

 Thanks,
Yongji

[-- Attachment #2: Type: text/html, Size: 1255 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 0/4] Introduce VDUSE - vDPA Device in Userspace
  2020-10-19 14:56 [RFC 0/4] Introduce VDUSE - vDPA Device in Userspace Xie Yongji
                   ` (4 preceding siblings ...)
  2020-10-19 17:16 ` [RFC 0/4] Introduce VDUSE - vDPA Device in Userspace Michael S. Tsirkin
@ 2020-10-20  3:20 ` Jason Wang
  2020-10-20  7:39   ` [External] " Yongji Xie
  5 siblings, 1 reply; 28+ messages in thread
From: Jason Wang @ 2020-10-20  3:20 UTC (permalink / raw)
  To: Xie Yongji, mst, akpm; +Cc: linux-mm, virtualization


On 2020/10/19 下午10:56, Xie Yongji wrote:
> This series introduces a framework, which can be used to implement
> vDPA Devices in a userspace program. To implement it, the work
> consist of two parts: control path emulating and data path offloading.
>
> In the control path, the VDUSE driver will make use of message
> mechnism to forward the actions (get/set features, get/st status,
> get/set config space and set virtqueue states) from virtio-vdpa
> driver to userspace. Userspace can use read()/write() to
> receive/reply to those control messages.
>
> In the data path, the VDUSE driver implements a MMU-based
> on-chip IOMMU driver which supports both direct mapping and
> indirect mapping with bounce buffer. Then userspace can access
> those iova space via mmap(). Besides, eventfd mechnism is used to
> trigger interrupts and forward virtqueue kicks.


This is pretty interesting!

For vhost-vdpa, it should work, but for virtio-vdpa, I think we should 
carefully deal with the IOMMU/DMA ops stuffs.

I notice that neither dma_map nor set_map is implemented in 
vduse_vdpa_config_ops, this means you want to let vhost-vDPA to deal 
with IOMMU domains stuffs.  Any reason for doing that?

The reason for the questions are:

1) You've implemented a on-chip IOMMU driver but don't expose it to 
generic IOMMU layer (or generic IOMMU layer may need some extension to 
support this)
2) We will probably remove the IOMMU domain management in vhost-vDPA, 
and move it to the device(parent).

So if it's possible, please implement either set_map() or 
dma_map()/dma_unmap(), this may align with our future goal and may speed 
up the development.

Btw, it would be helpful to give even more details on how the on-chip 
IOMMU driver in implemented.


>
> The details and our user case is shown below:
>
> ------------------------     -----------------------------------------------------------
> |                  APP |     |                          QEMU                           |
> |       ---------      |     | --------------------    -------------------+<-->+------ |
> |       |dev/vdx|      |     | | device emulation |    | virtio dataplane |    | BDS | |
> ------------+-----------     -----------+-----------------------+-----------------+-----
>              |                           |                       |                 |
>              |                           | emulating             | offloading      |
> ------------+---------------------------+-----------------------+-----------------+------
> |    | block device |           |  vduse driver |        |  vdpa device |    | TCP/IP | |
> |    -------+--------           --------+--------        +------+-------     -----+---- |
> |           |                           |                |      |                 |     |
> |           |                           |                |      |                 |     |
> | ----------+----------       ----------+-----------     |      |                 |     |
> | | virtio-blk driver |       | virtio-vdpa driver |     |      |                 |     |
> | ----------+----------       ----------+-----------     |      |                 |     |
> |           |                           |                |      |                 |     |
> |           |                           ------------------      |                 |     |
> |           -----------------------------------------------------              ---+---  |
> ------------------------------------------------------------------------------ | NIC |---
>                                                                                 ---+---
>                                                                                    |
>                                                                           ---------+---------
>                                                                           | Remote Storages |
>                                                                           -------------------


The figure is not very clear to me in the following points:

1) if the device emulation and virtio dataplane is all implemented in 
QEMU, what's the point of doing this? I thought the device should be a 
remove process?
2) it would be better to draw a vDPA bus somewhere to help people to 
understand the architecture
3) for the "offloading" I guess it should be done virtio vhost-vDPA, so 
it's better to draw a vhost-vDPA block there


> We make use of it to implement a block device connecting to
> our distributed storage, which can be used in containers and
> bare metal. Compared with qemu-nbd solution, this solution has
> higher performance, and we can have an unified technology stack
> in VM and containers for remote storages.
>
> To test it with a host disk (e.g. /dev/sdx):
>
>    $ qemu-storage-daemon \
>        --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server,nowait \
>        --monitor chardev=charmonitor \
>        --blockdev driver=host_device,cache.direct=on,aio=native,filename=/dev/sdx,node-name=disk0 \
>        --export vduse-blk,id=test,node-name=disk0,writable=on,vduse-id=1,num-queues=16,queue-size=128
>
> The qemu-storage-daemon can be found at https://github.com/bytedance/qemu/tree/vduse
>
> Future work:
>    - Improve performance (e.g. zero copy implementation in datapath)
>    - Config interrupt support
>    - Userspace library (find a way to reuse device emulation code in qemu/rust-vmm)


Right, a library will be very useful.

Thanks


>
> Xie Yongji (4):
>    mm: export zap_page_range() for driver use
>    vduse: Introduce VDUSE - vDPA Device in Userspace
>    vduse: grab the module's references until there is no vduse device
>    vduse: Add memory shrinker to reclaim bounce pages
>
>   drivers/vdpa/Kconfig                 |    8 +
>   drivers/vdpa/Makefile                |    1 +
>   drivers/vdpa/vdpa_user/Makefile      |    5 +
>   drivers/vdpa/vdpa_user/eventfd.c     |  221 ++++++
>   drivers/vdpa/vdpa_user/eventfd.h     |   48 ++
>   drivers/vdpa/vdpa_user/iova_domain.c |  488 ++++++++++++
>   drivers/vdpa/vdpa_user/iova_domain.h |  104 +++
>   drivers/vdpa/vdpa_user/vduse.h       |   66 ++
>   drivers/vdpa/vdpa_user/vduse_dev.c   | 1081 ++++++++++++++++++++++++++
>   include/uapi/linux/vduse.h           |   85 ++
>   mm/memory.c                          |    1 +
>   11 files changed, 2108 insertions(+)
>   create mode 100644 drivers/vdpa/vdpa_user/Makefile
>   create mode 100644 drivers/vdpa/vdpa_user/eventfd.c
>   create mode 100644 drivers/vdpa/vdpa_user/eventfd.h
>   create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c
>   create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h
>   create mode 100644 drivers/vdpa/vdpa_user/vduse.h
>   create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
>   create mode 100644 include/uapi/linux/vduse.h
>



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [External] Re: [RFC 0/4] Introduce VDUSE - vDPA Device in Userspace
  2020-10-20  3:20 ` Jason Wang
@ 2020-10-20  7:39   ` Yongji Xie
  2020-10-20  8:01     ` Jason Wang
  0 siblings, 1 reply; 28+ messages in thread
From: Yongji Xie @ 2020-10-20  7:39 UTC (permalink / raw)
  To: Jason Wang; +Cc: Michael S. Tsirkin, akpm, linux-mm, virtualization

[-- Attachment #1: Type: text/plain, Size: 5082 bytes --]

On Tue, Oct 20, 2020 at 11:20 AM Jason Wang <jasowang@redhat.com> wrote:

>
> On 2020/10/19 下午10:56, Xie Yongji wrote:
> > This series introduces a framework, which can be used to implement
> > vDPA Devices in a userspace program. To implement it, the work
> > consist of two parts: control path emulating and data path offloading.
> >
> > In the control path, the VDUSE driver will make use of message
> > mechnism to forward the actions (get/set features, get/st status,
> > get/set config space and set virtqueue states) from virtio-vdpa
> > driver to userspace. Userspace can use read()/write() to
> > receive/reply to those control messages.
> >
> > In the data path, the VDUSE driver implements a MMU-based
> > on-chip IOMMU driver which supports both direct mapping and
> > indirect mapping with bounce buffer. Then userspace can access
> > those iova space via mmap(). Besides, eventfd mechnism is used to
> > trigger interrupts and forward virtqueue kicks.
>
>
> This is pretty interesting!
>
> For vhost-vdpa, it should work, but for virtio-vdpa, I think we should
> carefully deal with the IOMMU/DMA ops stuffs.


>
I notice that neither dma_map nor set_map is implemented in
> vduse_vdpa_config_ops, this means you want to let vhost-vDPA to deal
> with IOMMU domains stuffs.  Any reason for doing that?
>
>
Actually, this series only focus on virtio-vdpa case now. To support
vhost-vdpa,  as you said, we need to implement dma_map/dma_unmap. But there
is a limit that vm's memory can't be anonymous pages which are forbidden in
vm_insert_page(). Maybe we need to add some limits on vhost-vdpa?


> The reason for the questions are:
>
> 1) You've implemented a on-chip IOMMU driver but don't expose it to
> generic IOMMU layer (or generic IOMMU layer may need some extension to
> support this)
> 2) We will probably remove the IOMMU domain management in vhost-vDPA,
> and move it to the device(parent).
>
> So if it's possible, please implement either set_map() or
> dma_map()/dma_unmap(), this may align with our future goal and may speed
> up the development.
>
> Btw, it would be helpful to give even more details on how the on-chip
> IOMMU driver in implemented.
>

The basic idea is treating MMU (VA->PA) as IOMMU (IOVA->PA). And using
vm_insert_page()/zap_page_range() to do address mapping/unmapping. And the
address mapping will be done in page fault handler because vm_insert_page()
can't be called in atomic_context such as dma_map_ops->map_page().

>
> > The details and our user case is shown below:
> >
> > ------------------------
>  -----------------------------------------------------------
> > |                  APP |     |                          QEMU
>                |
> > |       ---------      |     | --------------------
> -------------------+<-->+------ |
> > |       |dev/vdx|      |     | | device emulation |    | virtio
> dataplane |    | BDS | |
> > ------------+-----------
>  -----------+-----------------------+-----------------+-----
> >              |                           |                       |
>            |
> >              |                           | emulating             |
> offloading      |
> >
> ------------+---------------------------+-----------------------+-----------------+------
> > |    | block device |           |  vduse driver |        |  vdpa device
> |    | TCP/IP | |
> > |    -------+--------           --------+--------
> +------+-------     -----+---- |
> > |           |                           |                |      |
>          |     |
> > |           |                           |                |      |
>          |     |
> > | ----------+----------       ----------+-----------     |      |
>          |     |
> > | | virtio-blk driver |       | virtio-vdpa driver |     |      |
>          |     |
> > | ----------+----------       ----------+-----------     |      |
>          |     |
> > |           |                           |                |      |
>          |     |
> > |           |                           ------------------      |
>          |     |
> > |           -----------------------------------------------------
>       ---+---  |
> >
> ------------------------------------------------------------------------------
> | NIC |---
> >
>        ---+---
> >
>           |
> >
>  ---------+---------
> >
>  | Remote Storages |
> >
>  -------------------
>
>
> The figure is not very clear to me in the following points:
>
> 1) if the device emulation and virtio dataplane is all implemented in
> QEMU, what's the point of doing this? I thought the device should be a
> remove process?

2) it would be better to draw a vDPA bus somewhere to help people to
> understand the architecture
> 3) for the "offloading" I guess it should be done virtio vhost-vDPA, so
> it's better to draw a vhost-vDPA block there
>
>
This figure only shows virtio-vdpa case, I will take vhost-vdpa case into
consideration in next version.

Thanks,
Yongji

[-- Attachment #2: Type: text/html, Size: 7538 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [External] Re: [RFC 3/4] vduse: grab the module's references until there is no vduse device
  2020-10-19 16:41           ` Michael S. Tsirkin
@ 2020-10-20  7:42             ` Yongji Xie
  0 siblings, 0 replies; 28+ messages in thread
From: Yongji Xie @ 2020-10-20  7:42 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Jason Wang, akpm, linux-mm, virtualization

[-- Attachment #1: Type: text/plain, Size: 3001 bytes --]

On Tue, Oct 20, 2020 at 12:41 AM Michael S. Tsirkin <mst@redhat.com> wrote:

> On Mon, Oct 19, 2020 at 11:56:35PM +0800, 谢永吉 wrote:
> >
> >
> >
> > On Mon, Oct 19, 2020 at 11:47 PM Michael S. Tsirkin <mst@redhat.com>
> wrote:
> >
> >     On Mon, Oct 19, 2020 at 11:44:36PM +0800, 谢永吉 wrote:
> >     >
> >     >
> >     > On Mon, Oct 19, 2020 at 11:05 PM Michael S. Tsirkin <
> mst@redhat.com>
> >     wrote:
> >     >
> >     >     On Mon, Oct 19, 2020 at 10:56:22PM +0800, Xie Yongji wrote:
> >     >     > The module should not be unloaded if any vduse device exists.
> >     >     > So increase the module's reference count when creating vduse
> >     >     > device. And the reference count is kept until the device is
> >     >     > destroyed.
> >     >     >
> >     >     > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> >     >     > ---
> >     >     >  drivers/vdpa/vdpa_user/vduse_dev.c | 2 ++
> >     >     >  1 file changed, 2 insertions(+)
> >     >     >
> >     >     > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c
> b/drivers/vdpa/
> >     vdpa_user/
> >     >     vduse_dev.c
> >     >     > index 6787ba66725c..f04aa02de8c1 100644
> >     >     > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> >     >     > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> >     >     > @@ -887,6 +887,7 @@ static int vduse_destroy_dev(u32 id)
> >     >     >       kfree(dev->vqs);
> >     >     >       vduse_iova_domain_destroy(dev->domain);
> >     >     >       vduse_dev_destroy(dev);
> >     >     > +     module_put(THIS_MODULE);
> >     >     >
> >     >     >       return 0;
> >     >     >  }
> >     >     > @@ -931,6 +932,7 @@ static int vduse_create_dev(struct
> >     vduse_dev_config
> >     >     *config)
> >     >     >
> >     >     >       dev->connected = true;
> >     >     >       list_add(&dev->list, &vduse_devs);
> >     >     > +     __module_get(THIS_MODULE);
> >     >     >
> >     >     >       return fd;
> >     >     >  err_fd:
> >     >
> >     >     This kind of thing is usually an indicator of a bug. E.g.
> >     >     if the refcount drops to 0 on module_put(THIS_MODULE) it
> >     >     will be unloaded and the following return will not run.
> >     >
> >     >
> >     >
> >     > Should this happen?  The refcount should be only decreased to 0
> after the
> >     > misc_device is closed?
> >     >
> >     > Thanks,
> >     > Yongji
> >     >
> >
> >     OTOH if it never drops to 0 anyway then why do you need to increase
> it?
> >
> >
> >
> > To prevent unloading the module in the case that the device is created,
> but no
> > user process using it (e.g. the user process crashed).
> >
> > Thanks,
> > Yongji
>
> Looks like it can drop to 0 if that is the case then?
>
> Could you give me an example?  In my understanding, vduse_create_dev()
should be called only after we open the chardev which will grab the
module's reference.

Thanks,
Yongji

[-- Attachment #2: Type: text/html, Size: 4451 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [External] Re: [RFC 0/4] Introduce VDUSE - vDPA Device in Userspace
  2020-10-20  7:39   ` [External] " Yongji Xie
@ 2020-10-20  8:01     ` Jason Wang
  2020-10-20  8:35       ` Yongji Xie
  0 siblings, 1 reply; 28+ messages in thread
From: Jason Wang @ 2020-10-20  8:01 UTC (permalink / raw)
  To: Yongji Xie; +Cc: Michael S. Tsirkin, akpm, linux-mm, virtualization


On 2020/10/20 下午3:39, Yongji Xie wrote:
>
>
> On Tue, Oct 20, 2020 at 11:20 AM Jason Wang <jasowang@redhat.com 
> <mailto:jasowang@redhat.com>> wrote:
>
>
>     On 2020/10/19 下午10:56, Xie Yongji wrote:
>     > This series introduces a framework, which can be used to implement
>     > vDPA Devices in a userspace program. To implement it, the work
>     > consist of two parts: control path emulating and data path
>     offloading.
>     >
>     > In the control path, the VDUSE driver will make use of message
>     > mechnism to forward the actions (get/set features, get/st status,
>     > get/set config space and set virtqueue states) from virtio-vdpa
>     > driver to userspace. Userspace can use read()/write() to
>     > receive/reply to those control messages.
>     >
>     > In the data path, the VDUSE driver implements a MMU-based
>     > on-chip IOMMU driver which supports both direct mapping and
>     > indirect mapping with bounce buffer. Then userspace can access
>     > those iova space via mmap(). Besides, eventfd mechnism is used to
>     > trigger interrupts and forward virtqueue kicks.
>
>
>     This is pretty interesting!
>
>     For vhost-vdpa, it should work, but for virtio-vdpa, I think we
>     should
>     carefully deal with the IOMMU/DMA ops stuffs.
>
>
>     I notice that neither dma_map nor set_map is implemented in
>     vduse_vdpa_config_ops, this means you want to let vhost-vDPA to deal
>     with IOMMU domains stuffs.  Any reason for doing that?
>
> Actually, this series only focus on virtio-vdpa case now. To support 
> vhost-vdpa,  as you said, we need to implement dma_map/dma_unmap. But 
> there is a limit that vm's memory can't be anonymous pages which are 
> forbidden in vm_insert_page(). Maybe we need to add some limits on 
> vhost-vdpa?


I'm not sure I get this, any reason that you want to use 
vm_insert_page() to VM's memory. Or do you mean you want to implement 
some kind of zero-copy?

I guess from the software device implemention in user space it only need 
to receive IOVA ranges and map them in its own address space.


>     The reason for the questions are:
>
>     1) You've implemented a on-chip IOMMU driver but don't expose it to
>     generic IOMMU layer (or generic IOMMU layer may need some
>     extension to
>     support this)
>     2) We will probably remove the IOMMU domain management in vhost-vDPA,
>     and move it to the device(parent).
>
>     So if it's possible, please implement either set_map() or
>     dma_map()/dma_unmap(), this may align with our future goal and may
>     speed
>     up the development.
>
>     Btw, it would be helpful to give even more details on how the on-chip
>     IOMMU driver in implemented.
>
>
> The basic idea is treating MMU (VA->PA) as IOMMU (IOVA->PA). And using 
> vm_insert_page()/zap_page_range() to do address mapping/unmapping. And 
> the address mapping will be done in page fault handler because 
> vm_insert_page() can't be called in atomic_context such 
> as dma_map_ops->map_page().


Ok, please add it in the cover letter or patch 2 in the next version.


>
>     >
>     > The details and our user case is shown below:
>     >
>     > ------------------------
>      -----------------------------------------------------------
>     > |                  APP |     | QEMU                           |
>     > |       ---------      |     | --------------------
>     -------------------+<-->+------ |
>     > |       |dev/vdx|      |     | | device emulation | | virtio
>     dataplane |    | BDS | |
>     > ------------+-----------
>      -----------+-----------------------+-----------------+-----
>     >              |                           |          |           
>          |
>     >              |                           | emulating          |
>     offloading      |
>     >
>     ------------+---------------------------+-----------------------+-----------------+------
>     > |    | block device |           |  vduse driver |   |  vdpa
>     device |    | TCP/IP | |
>     > |    -------+--------           --------+--------  
>     +------+-------     -----+---- |
>     > |           |                           |   |      |           
>          |     |
>     > |           |                           |   |      |           
>          |     |
>     > | ----------+----------       ----------+-----------  |      | 
>                    |     |
>     > | | virtio-blk driver |       | virtio-vdpa driver |  |      | 
>                    |     |
>     > | ----------+----------       ----------+-----------  |      | 
>                    |     |
>     > |           |                           |   |      |           
>          |     |
>     > |           |  ------------------      |                 |     |
>     > |  -----------------------------------------------------        
>     ---+---  |
>     >
>     ------------------------------------------------------------------------------
>     | NIC |---
>     >                          ---+---
>     >                             |
>     >                    ---------+---------
>     >                    | Remote Storages |
>     >                    -------------------
>
>
>     The figure is not very clear to me in the following points:
>
>     1) if the device emulation and virtio dataplane is all implemented in
>     QEMU, what's the point of doing this? I thought the device should
>     be a
>     remove process?
>
>     2) it would be better to draw a vDPA bus somewhere to help people to
>     understand the architecture
>     3) for the "offloading" I guess it should be done virtio
>     vhost-vDPA, so
>     it's better to draw a vhost-vDPA block there
>
>
> This figure only shows virtio-vdpa case, I will take vhost-vdpa case 
> into consideration in next version.


Please do that, otherwise this proposal is incomplete.

Thanks


>
> Thanks,
> Yongji



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [External] Re: [RFC 0/4] Introduce VDUSE - vDPA Device in Userspace
  2020-10-20  8:01     ` Jason Wang
@ 2020-10-20  8:35       ` Yongji Xie
  2020-10-20  9:12         ` Jason Wang
  0 siblings, 1 reply; 28+ messages in thread
From: Yongji Xie @ 2020-10-20  8:35 UTC (permalink / raw)
  To: Jason Wang; +Cc: Michael S. Tsirkin, akpm, linux-mm, virtualization

[-- Attachment #1: Type: text/plain, Size: 6977 bytes --]

On Tue, Oct 20, 2020 at 4:01 PM Jason Wang <jasowang@redhat.com> wrote:

>
> On 2020/10/20 下午3:39, Yongji Xie wrote:
> >
> >
> > On Tue, Oct 20, 2020 at 11:20 AM Jason Wang <jasowang@redhat.com
> > <mailto:jasowang@redhat.com>> wrote:
> >
> >
> >     On 2020/10/19 下午10:56, Xie Yongji wrote:
> >     > This series introduces a framework, which can be used to implement
> >     > vDPA Devices in a userspace program. To implement it, the work
> >     > consist of two parts: control path emulating and data path
> >     offloading.
> >     >
> >     > In the control path, the VDUSE driver will make use of message
> >     > mechnism to forward the actions (get/set features, get/st status,
> >     > get/set config space and set virtqueue states) from virtio-vdpa
> >     > driver to userspace. Userspace can use read()/write() to
> >     > receive/reply to those control messages.
> >     >
> >     > In the data path, the VDUSE driver implements a MMU-based
> >     > on-chip IOMMU driver which supports both direct mapping and
> >     > indirect mapping with bounce buffer. Then userspace can access
> >     > those iova space via mmap(). Besides, eventfd mechnism is used to
> >     > trigger interrupts and forward virtqueue kicks.
> >
> >
> >     This is pretty interesting!
> >
> >     For vhost-vdpa, it should work, but for virtio-vdpa, I think we
> >     should
> >     carefully deal with the IOMMU/DMA ops stuffs.
> >
> >
> >     I notice that neither dma_map nor set_map is implemented in
> >     vduse_vdpa_config_ops, this means you want to let vhost-vDPA to deal
> >     with IOMMU domains stuffs.  Any reason for doing that?
> >
> > Actually, this series only focus on virtio-vdpa case now. To support
> > vhost-vdpa,  as you said, we need to implement dma_map/dma_unmap. But
> > there is a limit that vm's memory can't be anonymous pages which are
> > forbidden in vm_insert_page(). Maybe we need to add some limits on
> > vhost-vdpa?
>
>
> I'm not sure I get this, any reason that you want to use
> vm_insert_page() to VM's memory. Or do you mean you want to implement
> some kind of zero-copy?


>

If my understanding is right, we will have a QEMU (VM) process and a device
emulation process in the vhost-vdpa case, right? When I/O happens, the
virtio driver in VM will put the IOVA to vring and device emulation process
will get the IOVA from vring. Then the device emulation process
will translate the IOVA to its VA to access the dma buffer which resides in
VM's memory. That means the device emulation process needs to access
VM's memory, so we should use vm_insert_page() to build the page table of
the device emulation process.

I guess from the software device implemention in user space it only need
> to receive IOVA ranges and map them in its own address space.


How to map them in its own address space if we don't use vm_insert_page()?


>

> >     The reason for the questions are:
> >
> >     1) You've implemented a on-chip IOMMU driver but don't expose it to
> >     generic IOMMU layer (or generic IOMMU layer may need some
> >     extension to
> >     support this)
> >     2) We will probably remove the IOMMU domain management in vhost-vDPA,
> >     and move it to the device(parent).
> >
> >     So if it's possible, please implement either set_map() or
> >     dma_map()/dma_unmap(), this may align with our future goal and may
> >     speed
> >     up the development.
> >
> >     Btw, it would be helpful to give even more details on how the on-chip
> >     IOMMU driver in implemented.
> >
> >
> > The basic idea is treating MMU (VA->PA) as IOMMU (IOVA->PA). And using
> > vm_insert_page()/zap_page_range() to do address mapping/unmapping. And
> > the address mapping will be done in page fault handler because
> > vm_insert_page() can't be called in atomic_context such
> > as dma_map_ops->map_page().
>
>
> Ok, please add it in the cover letter or patch 2 in the next version.
>

> >
> >     >
> >     > The details and our user case is shown below:
> >     >
> >     > ------------------------
> >      -----------------------------------------------------------
> >     > |                  APP |     | QEMU                           |
> >     > |       ---------      |     | --------------------
> >     -------------------+<-->+------ |
> >     > |       |dev/vdx|      |     | | device emulation | | virtio
> >     dataplane |    | BDS | |
> >     > ------------+-----------
> >      -----------+-----------------------+-----------------+-----
> >     >              |                           |          |
> >          |
> >     >              |                           | emulating          |
> >     offloading      |
> >     >
> >
>  ------------+---------------------------+-----------------------+-----------------+------
> >     > |    | block device |           |  vduse driver |   |  vdpa
> >     device |    | TCP/IP | |
> >     > |    -------+--------           --------+--------
> >     +------+-------     -----+---- |
> >     > |           |                           |   |      |
> >          |     |
> >     > |           |                           |   |      |
> >          |     |
> >     > | ----------+----------       ----------+-----------  |      |
> >                    |     |
> >     > | | virtio-blk driver |       | virtio-vdpa driver |  |      |
> >                    |     |
> >     > | ----------+----------       ----------+-----------  |      |
> >                    |     |
> >     > |           |                           |   |      |
> >          |     |
> >     > |           |  ------------------      |                 |     |
> >     > |  -----------------------------------------------------
> >     ---+---  |
> >     >
> >
>  ------------------------------------------------------------------------------
> >     | NIC |---
> >     >                          ---+---
> >     >                             |
> >     >                    ---------+---------
> >     >                    | Remote Storages |
> >     >                    -------------------
> >
> >
> >     The figure is not very clear to me in the following points:
> >
> >     1) if the device emulation and virtio dataplane is all implemented in
> >     QEMU, what's the point of doing this? I thought the device should
> >     be a
> >     remove process?
> >
> >     2) it would be better to draw a vDPA bus somewhere to help people to
> >     understand the architecture
> >     3) for the "offloading" I guess it should be done virtio
> >     vhost-vDPA, so
> >     it's better to draw a vhost-vDPA block there
> >
> >
> > This figure only shows virtio-vdpa case, I will take vhost-vdpa case
> > into consideration in next version.
>
>
> Please do that, otherwise this proposal is incomplete.
>
>
Sure.

Thanks,
Yongji

[-- Attachment #2: Type: text/html, Size: 9824 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [External] Re: [RFC 0/4] Introduce VDUSE - vDPA Device in Userspace
  2020-10-20  8:35       ` Yongji Xie
@ 2020-10-20  9:12         ` Jason Wang
  2020-10-23  2:55           ` Yongji Xie
  0 siblings, 1 reply; 28+ messages in thread
From: Jason Wang @ 2020-10-20  9:12 UTC (permalink / raw)
  To: Yongji Xie; +Cc: Michael S. Tsirkin, akpm, linux-mm, virtualization


On 2020/10/20 下午4:35, Yongji Xie wrote:
>
>
> On Tue, Oct 20, 2020 at 4:01 PM Jason Wang <jasowang@redhat.com 
> <mailto:jasowang@redhat.com>> wrote:
>
>
>     On 2020/10/20 下午3:39, Yongji Xie wrote:
>     >
>     >
>     > On Tue, Oct 20, 2020 at 11:20 AM Jason Wang <jasowang@redhat.com
>     <mailto:jasowang@redhat.com>
>     > <mailto:jasowang@redhat.com <mailto:jasowang@redhat.com>>> wrote:
>     >
>     >
>     >     On 2020/10/19 下午10:56, Xie Yongji wrote:
>     >     > This series introduces a framework, which can be used to
>     implement
>     >     > vDPA Devices in a userspace program. To implement it, the work
>     >     > consist of two parts: control path emulating and data path
>     >     offloading.
>     >     >
>     >     > In the control path, the VDUSE driver will make use of message
>     >     > mechnism to forward the actions (get/set features, get/st
>     status,
>     >     > get/set config space and set virtqueue states) from
>     virtio-vdpa
>     >     > driver to userspace. Userspace can use read()/write() to
>     >     > receive/reply to those control messages.
>     >     >
>     >     > In the data path, the VDUSE driver implements a MMU-based
>     >     > on-chip IOMMU driver which supports both direct mapping and
>     >     > indirect mapping with bounce buffer. Then userspace can access
>     >     > those iova space via mmap(). Besides, eventfd mechnism is
>     used to
>     >     > trigger interrupts and forward virtqueue kicks.
>     >
>     >
>     >     This is pretty interesting!
>     >
>     >     For vhost-vdpa, it should work, but for virtio-vdpa, I think we
>     >     should
>     >     carefully deal with the IOMMU/DMA ops stuffs.
>     >
>     >
>     >     I notice that neither dma_map nor set_map is implemented in
>     >     vduse_vdpa_config_ops, this means you want to let vhost-vDPA
>     to deal
>     >     with IOMMU domains stuffs.  Any reason for doing that?
>     >
>     > Actually, this series only focus on virtio-vdpa case now. To
>     support
>     > vhost-vdpa,  as you said, we need to implement
>     dma_map/dma_unmap. But
>     > there is a limit that vm's memory can't be anonymous pages which
>     are
>     > forbidden in vm_insert_page(). Maybe we need to add some limits on
>     > vhost-vdpa?
>
>
>     I'm not sure I get this, any reason that you want to use
>     vm_insert_page() to VM's memory. Or do you mean you want to implement
>     some kind of zero-copy? 
>
>
>
> If my understanding is right, we will have a QEMU (VM) process and a 
> device emulation process in the vhost-vdpa case, right? When I/O 
> happens, the virtio driver in VM will put the IOVA to vring and device 
> emulation process will get the IOVA from vring. Then the device 
> emulation process will translate the IOVA to its VA to access the dma 
> buffer which resides in VM's memory. That means the device emulation 
> process needs to access VM's memory, so we should use vm_insert_page() 
> to build the page table of the device emulation process.


Ok, I get you now. So it looks to me the that the real issue is not the 
limitation to anonymous page but see the comments above vm_insert_page():

"

  * The page has to be a nice clean _individual_ kernel allocation.
"

So I suspect that using vm_insert_page() to share pages between 
processes is legal. We need inputs from MM experts.

Thanks


>
>     I guess from the software device implemention in user space it
>     only need
>     to receive IOVA ranges and map them in its own address space.
>
>
> How to map them in its own address space if we don't use vm_insert_page()?



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [External] Re: [RFC 0/4] Introduce VDUSE - vDPA Device in Userspace
  2020-10-20  9:12         ` Jason Wang
@ 2020-10-23  2:55           ` Yongji Xie
  2020-10-23  8:44             ` Jason Wang
  0 siblings, 1 reply; 28+ messages in thread
From: Yongji Xie @ 2020-10-23  2:55 UTC (permalink / raw)
  To: Jason Wang; +Cc: Michael S. Tsirkin, akpm, linux-mm, virtualization

[-- Attachment #1: Type: text/plain, Size: 4234 bytes --]

On Tue, Oct 20, 2020 at 5:13 PM Jason Wang <jasowang@redhat.com> wrote:

>
> On 2020/10/20 下午4:35, Yongji Xie wrote:
> >
> >
> > On Tue, Oct 20, 2020 at 4:01 PM Jason Wang <jasowang@redhat.com
> > <mailto:jasowang@redhat.com>> wrote:
> >
> >
> >     On 2020/10/20 下午3:39, Yongji Xie wrote:
> >     >
> >     >
> >     > On Tue, Oct 20, 2020 at 11:20 AM Jason Wang <jasowang@redhat.com
> >     <mailto:jasowang@redhat.com>
> >     > <mailto:jasowang@redhat.com <mailto:jasowang@redhat.com>>> wrote:
> >     >
> >     >
> >     >     On 2020/10/19 下午10:56, Xie Yongji wrote:
> >     >     > This series introduces a framework, which can be used to
> >     implement
> >     >     > vDPA Devices in a userspace program. To implement it, the
> work
> >     >     > consist of two parts: control path emulating and data path
> >     >     offloading.
> >     >     >
> >     >     > In the control path, the VDUSE driver will make use of
> message
> >     >     > mechnism to forward the actions (get/set features, get/st
> >     status,
> >     >     > get/set config space and set virtqueue states) from
> >     virtio-vdpa
> >     >     > driver to userspace. Userspace can use read()/write() to
> >     >     > receive/reply to those control messages.
> >     >     >
> >     >     > In the data path, the VDUSE driver implements a MMU-based
> >     >     > on-chip IOMMU driver which supports both direct mapping and
> >     >     > indirect mapping with bounce buffer. Then userspace can
> access
> >     >     > those iova space via mmap(). Besides, eventfd mechnism is
> >     used to
> >     >     > trigger interrupts and forward virtqueue kicks.
> >     >
> >     >
> >     >     This is pretty interesting!
> >     >
> >     >     For vhost-vdpa, it should work, but for virtio-vdpa, I think we
> >     >     should
> >     >     carefully deal with the IOMMU/DMA ops stuffs.
> >     >
> >     >
> >     >     I notice that neither dma_map nor set_map is implemented in
> >     >     vduse_vdpa_config_ops, this means you want to let vhost-vDPA
> >     to deal
> >     >     with IOMMU domains stuffs.  Any reason for doing that?
> >     >
> >     > Actually, this series only focus on virtio-vdpa case now. To
> >     support
> >     > vhost-vdpa,  as you said, we need to implement
> >     dma_map/dma_unmap. But
> >     > there is a limit that vm's memory can't be anonymous pages which
> >     are
> >     > forbidden in vm_insert_page(). Maybe we need to add some limits on
> >     > vhost-vdpa?
> >
> >
> >     I'm not sure I get this, any reason that you want to use
> >     vm_insert_page() to VM's memory. Or do you mean you want to implement
> >     some kind of zero-copy?
> >
> >
> >
> > If my understanding is right, we will have a QEMU (VM) process and a
> > device emulation process in the vhost-vdpa case, right? When I/O
> > happens, the virtio driver in VM will put the IOVA to vring and device
> > emulation process will get the IOVA from vring. Then the device
> > emulation process will translate the IOVA to its VA to access the dma
> > buffer which resides in VM's memory. That means the device emulation
> > process needs to access VM's memory, so we should use vm_insert_page()
> > to build the page table of the device emulation process.
>
>
> Ok, I get you now. So it looks to me the that the real issue is not the
> limitation to anonymous page but see the comments above vm_insert_page():
>
> "
>
>   * The page has to be a nice clean _individual_ kernel allocation.
> "
>
> So I suspect that using vm_insert_page() to share pages between
> processes is legal. We need inputs from MM experts.
>
>
Yes,  vm_insert_page() can't be used in this case. So could we add the
shmfd into the vhost iotlb msg and pass it to the device emulation process
as a new iova_domain, just like vhost-user does.

Thanks,
Yongji


>

>
> >
> >     I guess from the software device implemention in user space it
> >     only need
> >     to receive IOVA ranges and map them in its own address space.
> >
> >
> > How to map them in its own address space if we don't use
> vm_insert_page()?
>
>

[-- Attachment #2: Type: text/html, Size: 6152 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [External] Re: [RFC 0/4] Introduce VDUSE - vDPA Device in Userspace
  2020-10-23  2:55           ` Yongji Xie
@ 2020-10-23  8:44             ` Jason Wang
  0 siblings, 0 replies; 28+ messages in thread
From: Jason Wang @ 2020-10-23  8:44 UTC (permalink / raw)
  To: Yongji Xie; +Cc: Michael S. Tsirkin, akpm, linux-mm, virtualization


On 2020/10/23 上午10:55, Yongji Xie wrote:
>
>
> On Tue, Oct 20, 2020 at 5:13 PM Jason Wang <jasowang@redhat.com 
> <mailto:jasowang@redhat.com>> wrote:
>
>
>     On 2020/10/20 下午4:35, Yongji Xie wrote:
>     >
>     >
>     > On Tue, Oct 20, 2020 at 4:01 PM Jason Wang <jasowang@redhat.com
>     <mailto:jasowang@redhat.com>
>     > <mailto:jasowang@redhat.com <mailto:jasowang@redhat.com>>> wrote:
>     >
>     >
>     >     On 2020/10/20 下午3:39, Yongji Xie wrote:
>     >     >
>     >     >
>     >     > On Tue, Oct 20, 2020 at 11:20 AM Jason Wang
>     <jasowang@redhat.com <mailto:jasowang@redhat.com>
>     >     <mailto:jasowang@redhat.com <mailto:jasowang@redhat.com>>
>     >     > <mailto:jasowang@redhat.com <mailto:jasowang@redhat.com>
>     <mailto:jasowang@redhat.com <mailto:jasowang@redhat.com>>>> wrote:
>     >     >
>     >     >
>     >     >     On 2020/10/19 下午10:56, Xie Yongji wrote:
>     >     >     > This series introduces a framework, which can be used to
>     >     implement
>     >     >     > vDPA Devices in a userspace program. To implement
>     it, the work
>     >     >     > consist of two parts: control path emulating and
>     data path
>     >     >     offloading.
>     >     >     >
>     >     >     > In the control path, the VDUSE driver will make use
>     of message
>     >     >     > mechnism to forward the actions (get/set features,
>     get/st
>     >     status,
>     >     >     > get/set config space and set virtqueue states) from
>     >     virtio-vdpa
>     >     >     > driver to userspace. Userspace can use read()/write() to
>     >     >     > receive/reply to those control messages.
>     >     >     >
>     >     >     > In the data path, the VDUSE driver implements a
>     MMU-based
>     >     >     > on-chip IOMMU driver which supports both direct
>     mapping and
>     >     >     > indirect mapping with bounce buffer. Then userspace
>     can access
>     >     >     > those iova space via mmap(). Besides, eventfd
>     mechnism is
>     >     used to
>     >     >     > trigger interrupts and forward virtqueue kicks.
>     >     >
>     >     >
>     >     >     This is pretty interesting!
>     >     >
>     >     >     For vhost-vdpa, it should work, but for virtio-vdpa, I
>     think we
>     >     >     should
>     >     >     carefully deal with the IOMMU/DMA ops stuffs.
>     >     >
>     >     >
>     >     >     I notice that neither dma_map nor set_map is
>     implemented in
>     >     >     vduse_vdpa_config_ops, this means you want to let
>     vhost-vDPA
>     >     to deal
>     >     >     with IOMMU domains stuffs.  Any reason for doing that?
>     >     >
>     >     > Actually, this series only focus on virtio-vdpa case now. To
>     >     support
>     >     > vhost-vdpa,  as you said, we need to implement
>     >     dma_map/dma_unmap. But
>     >     > there is a limit that vm's memory can't be anonymous pages
>     which
>     >     are
>     >     > forbidden in vm_insert_page(). Maybe we need to add some
>     limits on
>     >     > vhost-vdpa?
>     >
>     >
>     >     I'm not sure I get this, any reason that you want to use
>     >     vm_insert_page() to VM's memory. Or do you mean you want to
>     implement
>     >     some kind of zero-copy?
>     >
>     >
>     >
>     > If my understanding is right, we will have a QEMU (VM) process
>     and a
>     > device emulation process in the vhost-vdpa case, right? When I/O
>     > happens, the virtio driver in VM will put the IOVA to vring and
>     device
>     > emulation process will get the IOVA from vring. Then the device
>     > emulation process will translate the IOVA to its VA to access
>     the dma
>     > buffer which resides in VM's memory. That means the device
>     emulation
>     > process needs to access VM's memory, so we should use
>     vm_insert_page()
>     > to build the page table of the device emulation process.
>
>
>     Ok, I get you now. So it looks to me the that the real issue is
>     not the
>     limitation to anonymous page but see the comments above
>     vm_insert_page():
>
>     "
>
>       * The page has to be a nice clean _individual_ kernel allocation.
>     "
>
>     So I suspect that using vm_insert_page() to share pages between
>     processes is legal. We need inputs from MM experts.
>
>
> Yes,  vm_insert_page() can't be used in this case. So could we add the 
> shmfd into the vhost iotlb msg and pass it to the device emulation 
> process as a new iova_domain, just like vhost-user does.
>
> Thanks,
> Yongji


I think vhost-user did that via SET_MEM_TABLE which is not supported by 
vDPA. Note that the current IOTLB message will be used when vIOMMU is 
enabled.

This needs more thought. Will come back if I had any thought.

Thanks


>
>
>
>
>     >
>     >     I guess from the software device implemention in user space it
>     >     only need
>     >     to receive IOVA ranges and map them in its own address space.
>     >
>     >
>     > How to map them in its own address space if we don't use
>     vm_insert_page()?
>



^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2020-10-23  8:45 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-19 14:56 [RFC 0/4] Introduce VDUSE - vDPA Device in Userspace Xie Yongji
2020-10-19 14:56 ` [RFC 1/4] mm: export zap_page_range() for driver use Xie Yongji
2020-10-19 15:14   ` Matthew Wilcox
2020-10-19 15:36     ` [External] " 谢永吉
2020-10-19 14:56 ` [RFC 2/4] vduse: Introduce VDUSE - vDPA Device in Userspace Xie Yongji
2020-10-19 15:08   ` Michael S. Tsirkin
2020-10-19 15:24     ` Randy Dunlap
2020-10-19 15:46       ` [External] " 谢永吉
2020-10-19 15:48     ` 谢永吉
2020-10-19 14:56 ` [RFC 3/4] vduse: grab the module's references until there is no vduse device Xie Yongji
2020-10-19 15:05   ` Michael S. Tsirkin
2020-10-19 15:44     ` [External] " 谢永吉
2020-10-19 15:47       ` Michael S. Tsirkin
2020-10-19 15:56         ` 谢永吉
2020-10-19 16:41           ` Michael S. Tsirkin
2020-10-20  7:42             ` Yongji Xie
2020-10-19 14:56 ` [RFC 4/4] vduse: Add memory shrinker to reclaim bounce pages Xie Yongji
2020-10-19 17:16 ` [RFC 0/4] Introduce VDUSE - vDPA Device in Userspace Michael S. Tsirkin
2020-10-20  2:18   ` [External] " 谢永吉
2020-10-20  2:20     ` Jason Wang
2020-10-20  2:28       ` 谢永吉
2020-10-20  3:20 ` Jason Wang
2020-10-20  7:39   ` [External] " Yongji Xie
2020-10-20  8:01     ` Jason Wang
2020-10-20  8:35       ` Yongji Xie
2020-10-20  9:12         ` Jason Wang
2020-10-23  2:55           ` Yongji Xie
2020-10-23  8:44             ` Jason Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).