linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC v2 00/13] Introduce VDUSE - vDPA Device in Userspace
@ 2020-12-22 14:52 Xie Yongji
  2020-12-22 14:52 ` [RFC v2 01/13] mm: export zap_page_range() for driver use Xie Yongji
                   ` (13 more replies)
  0 siblings, 14 replies; 55+ messages in thread
From: Xie Yongji @ 2020-12-22 14:52 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, akpm, rdunlap, willy,
	viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel, linux-mm

This series introduces a framework, which can be used to implement
vDPA Devices in a userspace program. The work consist of two parts:
control path forwarding and data path offloading.

In the control path, the VDUSE driver will make use of message
mechnism to forward the config operation from vdpa bus driver
to userspace. Userspace can use read()/write() to receive/reply
those control messages.

In the data path, the core is mapping dma buffer into VDUSE
daemon's address space, which can be implemented in different ways
depending on the vdpa bus to which the vDPA device is attached.

In virtio-vdpa case, we implements a MMU-based on-chip IOMMU driver with
bounce-buffering mechanism to achieve that. And in vhost-vdpa case, the dma
buffer is reside in a userspace memory region which can be shared to the
VDUSE userspace processs via transferring the shmfd.

The details and our user case is shown below:

------------------------    -------------------------   ----------------------------------------------
|            Container |    |              QEMU(VM) |   |                               VDUSE daemon |
|       ---------      |    |  -------------------  |   | ------------------------- ---------------- |
|       |dev/vdx|      |    |  |/dev/vhost-vdpa-x|  |   | | vDPA device emulation | | block driver | |
------------+-----------     -----------+------------   -------------+----------------------+---------
            |                           |                            |                      |
            |                           |                            |                      |
------------+---------------------------+----------------------------+----------------------+---------
|    | block device |           |  vhost device |            | vduse driver |          | TCP/IP |    |
|    -------+--------           --------+--------            -------+--------          -----+----    |
|           |                           |                           |                       |        |
| ----------+----------       ----------+-----------         -------+-------                |        |
| | virtio-blk driver |       |  vhost-vdpa driver |         | vdpa device |                |        |
| ----------+----------       ----------+-----------         -------+-------                |        |
|           |      virtio bus           |                           |                       |        |
|   --------+----+-----------           |                           |                       |        |
|                |                      |                           |                       |        |
|      ----------+----------            |                           |                       |        |
|      | virtio-blk device |            |                           |                       |        |
|      ----------+----------            |                           |                       |        |
|                |                      |                           |                       |        |
|     -----------+-----------           |                           |                       |        |
|     |  virtio-vdpa driver |           |                           |                       |        |
|     -----------+-----------           |                           |                       |        |
|                |                      |                           |    vdpa bus           |        |
|     -----------+----------------------+---------------------------+------------           |        |
|                                                                                        ---+---     |
-----------------------------------------------------------------------------------------| NIC |------
                                                                                         ---+---
                                                                                            |
                                                                                   ---------+---------
                                                                                   | Remote Storages |
                                                                                   -------------------

We make use of it to implement a block device connecting to
our distributed storage, which can be used both in containers and
VMs. Thus, we can have an unified technology stack in this two cases.

To test it with null-blk:

  $ qemu-storage-daemon \
      --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server,nowait \
      --monitor chardev=charmonitor \
      --blockdev driver=host_device,cache.direct=on,aio=native,filename=/dev/nullb0,node-name=disk0 \
      --export vduse-blk,id=test,node-name=disk0,writable=on,vduse-id=1,num-queues=16,queue-size=128

The qemu-storage-daemon can be found at https://github.com/bytedance/qemu/tree/vduse

Future work:
  - Improve performance (e.g. zero copy implementation in datapath)
  - Config interrupt support
  - Userspace library (find a way to reuse device emulation code in qemu/rust-vmm)

This is now based on below series:
https://lore.kernel.org/netdev/20201112064005.349268-1-parav@nvidia.com/

V1 to V2:
- Add vhost-vdpa support
- Add some documents
- Based on the vdpa management tool
- Introduce a workqueue for irq injection
- Replace interval tree with array map to store the iova_map

Xie Yongji (13):
  mm: export zap_page_range() for driver use
  eventfd: track eventfd_signal() recursion depth separately in different cases
  eventfd: Increase the recursion depth of eventfd_signal()
  vdpa: Remove the restriction that only supports virtio-net devices
  vdpa: Pass the netlink attributes to ops.dev_add()
  vduse: Introduce VDUSE - vDPA Device in Userspace
  vduse: support get/set virtqueue state
  vdpa: Introduce process_iotlb_msg() in vdpa_config_ops
  vduse: Add support for processing vhost iotlb message
  vduse: grab the module's references until there is no vduse device
  vduse/iova_domain: Support reclaiming bounce pages
  vduse: Add memory shrinker to reclaim bounce pages
  vduse: Introduce a workqueue for irq injection

 Documentation/driver-api/vduse.rst                 |   91 ++
 Documentation/userspace-api/ioctl/ioctl-number.rst |    1 +
 drivers/vdpa/Kconfig                               |    8 +
 drivers/vdpa/Makefile                              |    1 +
 drivers/vdpa/vdpa.c                                |    2 +-
 drivers/vdpa/vdpa_sim/vdpa_sim.c                   |    3 +-
 drivers/vdpa/vdpa_user/Makefile                    |    5 +
 drivers/vdpa/vdpa_user/eventfd.c                   |  229 ++++
 drivers/vdpa/vdpa_user/eventfd.h                   |   48 +
 drivers/vdpa/vdpa_user/iova_domain.c               |  517 ++++++++
 drivers/vdpa/vdpa_user/iova_domain.h               |  103 ++
 drivers/vdpa/vdpa_user/vduse.h                     |   59 +
 drivers/vdpa/vdpa_user/vduse_dev.c                 | 1373 ++++++++++++++++++++
 drivers/vhost/vdpa.c                               |   34 +-
 fs/aio.c                                           |    3 +-
 fs/eventfd.c                                       |   20 +-
 include/linux/eventfd.h                            |    5 +-
 include/linux/vdpa.h                               |   11 +-
 include/uapi/linux/vdpa.h                          |    1 +
 include/uapi/linux/vduse.h                         |  119 ++
 mm/memory.c                                        |    1 +
 21 files changed, 2598 insertions(+), 36 deletions(-)
 create mode 100644 Documentation/driver-api/vduse.rst
 create mode 100644 drivers/vdpa/vdpa_user/Makefile
 create mode 100644 drivers/vdpa/vdpa_user/eventfd.c
 create mode 100644 drivers/vdpa/vdpa_user/eventfd.h
 create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c
 create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h
 create mode 100644 drivers/vdpa/vdpa_user/vduse.h
 create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
 create mode 100644 include/uapi/linux/vduse.h

-- 
2.11.0



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [RFC v2 01/13] mm: export zap_page_range() for driver use
  2020-12-22 14:52 [RFC v2 00/13] Introduce VDUSE - vDPA Device in Userspace Xie Yongji
@ 2020-12-22 14:52 ` Xie Yongji
  2020-12-22 15:44   ` Christoph Hellwig
  2020-12-22 14:52 ` [RFC v2 02/13] eventfd: track eventfd_signal() recursion depth separately in different cases Xie Yongji
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 55+ messages in thread
From: Xie Yongji @ 2020-12-22 14:52 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, akpm, rdunlap, willy,
	viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel, linux-mm

Export zap_page_range() for use in VDUSE.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 mm/memory.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/memory.c b/mm/memory.c
index 7d608765932b..edd2d6497bb3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1542,6 +1542,7 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
 	mmu_notifier_invalidate_range_end(&range);
 	tlb_finish_mmu(&tlb, start, range.end);
 }
+EXPORT_SYMBOL(zap_page_range);
 
 /**
  * zap_page_range_single - remove user pages in a given range
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [RFC v2 02/13] eventfd: track eventfd_signal() recursion depth separately in different cases
  2020-12-22 14:52 [RFC v2 00/13] Introduce VDUSE - vDPA Device in Userspace Xie Yongji
  2020-12-22 14:52 ` [RFC v2 01/13] mm: export zap_page_range() for driver use Xie Yongji
@ 2020-12-22 14:52 ` Xie Yongji
  2020-12-22 14:52 ` [RFC v2 03/13] eventfd: Increase the recursion depth of eventfd_signal() Xie Yongji
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 55+ messages in thread
From: Xie Yongji @ 2020-12-22 14:52 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, akpm, rdunlap, willy,
	viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel, linux-mm

Now we have a global percpu counter to limit the recursion depth
of eventfd_signal(). This can avoid deadlock or stack overflow.
But in stack overflow case, it should be OK to increase the
recursion depth if needed. So we add a percpu counter in eventfd_ctx
to limit the recursion depth for deadlock case. Then it could be
fine to increase the global percpu counter later.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 fs/aio.c                |  3 ++-
 fs/eventfd.c            | 20 +++++++++++++++++++-
 include/linux/eventfd.h |  5 +----
 3 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 1f32da13d39e..5d82903161f5 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1698,7 +1698,8 @@ static int aio_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
 		list_del(&iocb->ki_list);
 		iocb->ki_res.res = mangle_poll(mask);
 		req->done = true;
-		if (iocb->ki_eventfd && eventfd_signal_count()) {
+		if (iocb->ki_eventfd &&
+			eventfd_signal_count(iocb->ki_eventfd)) {
 			iocb = NULL;
 			INIT_WORK(&req->work, aio_poll_put_work);
 			schedule_work(&req->work);
diff --git a/fs/eventfd.c b/fs/eventfd.c
index e265b6dd4f34..2df24f9bada3 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -25,6 +25,8 @@
 #include <linux/idr.h>
 #include <linux/uio.h>
 
+#define EVENTFD_WAKE_DEPTH 0
+
 DEFINE_PER_CPU(int, eventfd_wake_count);
 
 static DEFINE_IDA(eventfd_ida);
@@ -42,9 +44,17 @@ struct eventfd_ctx {
 	 */
 	__u64 count;
 	unsigned int flags;
+	int __percpu *wake_count;
 	int id;
 };
 
+bool eventfd_signal_count(struct eventfd_ctx *ctx)
+{
+	return (this_cpu_read(*ctx->wake_count) ||
+		this_cpu_read(eventfd_wake_count) > EVENTFD_WAKE_DEPTH);
+}
+EXPORT_SYMBOL_GPL(eventfd_signal_count);
+
 /**
  * eventfd_signal - Adds @n to the eventfd counter.
  * @ctx: [in] Pointer to the eventfd context.
@@ -71,17 +81,19 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n)
 	 * it returns true, the eventfd_signal() call should be deferred to a
 	 * safe context.
 	 */
-	if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count)))
+	if (WARN_ON_ONCE(eventfd_signal_count(ctx)))
 		return 0;
 
 	spin_lock_irqsave(&ctx->wqh.lock, flags);
 	this_cpu_inc(eventfd_wake_count);
+	this_cpu_inc(*ctx->wake_count);
 	if (ULLONG_MAX - ctx->count < n)
 		n = ULLONG_MAX - ctx->count;
 	ctx->count += n;
 	if (waitqueue_active(&ctx->wqh))
 		wake_up_locked_poll(&ctx->wqh, EPOLLIN);
 	this_cpu_dec(eventfd_wake_count);
+	this_cpu_dec(*ctx->wake_count);
 	spin_unlock_irqrestore(&ctx->wqh.lock, flags);
 
 	return n;
@@ -92,6 +104,7 @@ static void eventfd_free_ctx(struct eventfd_ctx *ctx)
 {
 	if (ctx->id >= 0)
 		ida_simple_remove(&eventfd_ida, ctx->id);
+	free_percpu(ctx->wake_count);
 	kfree(ctx);
 }
 
@@ -423,6 +436,11 @@ static int do_eventfd(unsigned int count, int flags)
 
 	kref_init(&ctx->kref);
 	init_waitqueue_head(&ctx->wqh);
+	ctx->wake_count = alloc_percpu(int);
+	if (!ctx->wake_count) {
+		kfree(ctx);
+		return -ENOMEM;
+	}
 	ctx->count = count;
 	ctx->flags = flags;
 	ctx->id = ida_simple_get(&eventfd_ida, 0, 0, GFP_KERNEL);
diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
index fa0a524baed0..1a11ebbd74a9 100644
--- a/include/linux/eventfd.h
+++ b/include/linux/eventfd.h
@@ -45,10 +45,7 @@ void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 *cnt);
 
 DECLARE_PER_CPU(int, eventfd_wake_count);
 
-static inline bool eventfd_signal_count(void)
-{
-	return this_cpu_read(eventfd_wake_count);
-}
+bool eventfd_signal_count(struct eventfd_ctx *ctx);
 
 #else /* CONFIG_EVENTFD */
 
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [RFC v2 03/13] eventfd: Increase the recursion depth of eventfd_signal()
  2020-12-22 14:52 [RFC v2 00/13] Introduce VDUSE - vDPA Device in Userspace Xie Yongji
  2020-12-22 14:52 ` [RFC v2 01/13] mm: export zap_page_range() for driver use Xie Yongji
  2020-12-22 14:52 ` [RFC v2 02/13] eventfd: track eventfd_signal() recursion depth separately in different cases Xie Yongji
@ 2020-12-22 14:52 ` Xie Yongji
  2020-12-22 14:52 ` [RFC v2 04/13] vdpa: Remove the restriction that only supports virtio-net devices Xie Yongji
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 55+ messages in thread
From: Xie Yongji @ 2020-12-22 14:52 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, akpm, rdunlap, willy,
	viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel, linux-mm

Increase the recursion depth of eventfd_signal() to 1. This
will be used in VDUSE case later.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 fs/eventfd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/eventfd.c b/fs/eventfd.c
index 2df24f9bada3..478cdc175949 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -25,7 +25,7 @@
 #include <linux/idr.h>
 #include <linux/uio.h>
 
-#define EVENTFD_WAKE_DEPTH 0
+#define EVENTFD_WAKE_DEPTH 1
 
 DEFINE_PER_CPU(int, eventfd_wake_count);
 
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [RFC v2 04/13] vdpa: Remove the restriction that only supports virtio-net devices
  2020-12-22 14:52 [RFC v2 00/13] Introduce VDUSE - vDPA Device in Userspace Xie Yongji
                   ` (2 preceding siblings ...)
  2020-12-22 14:52 ` [RFC v2 03/13] eventfd: Increase the recursion depth of eventfd_signal() Xie Yongji
@ 2020-12-22 14:52 ` Xie Yongji
  2020-12-22 14:52 ` [RFC v2 05/13] vdpa: Pass the netlink attributes to ops.dev_add() Xie Yongji
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 55+ messages in thread
From: Xie Yongji @ 2020-12-22 14:52 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, akpm, rdunlap, willy,
	viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel, linux-mm

With VDUSE, we should be able to support all kinds of virtio devices.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 drivers/vhost/vdpa.c | 29 +++--------------------------
 1 file changed, 3 insertions(+), 26 deletions(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 29ed4173f04e..448be7875b6d 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -22,6 +22,7 @@
 #include <linux/nospec.h>
 #include <linux/vhost.h>
 #include <linux/virtio_net.h>
+#include <linux/virtio_blk.h>
 
 #include "vhost.h"
 
@@ -185,26 +186,6 @@ static long vhost_vdpa_set_status(struct vhost_vdpa *v, u8 __user *statusp)
 	return 0;
 }
 
-static int vhost_vdpa_config_validate(struct vhost_vdpa *v,
-				      struct vhost_vdpa_config *c)
-{
-	long size = 0;
-
-	switch (v->virtio_id) {
-	case VIRTIO_ID_NET:
-		size = sizeof(struct virtio_net_config);
-		break;
-	}
-
-	if (c->len == 0)
-		return -EINVAL;
-
-	if (c->len > size - c->off)
-		return -E2BIG;
-
-	return 0;
-}
-
 static long vhost_vdpa_get_config(struct vhost_vdpa *v,
 				  struct vhost_vdpa_config __user *c)
 {
@@ -215,7 +196,7 @@ static long vhost_vdpa_get_config(struct vhost_vdpa *v,
 
 	if (copy_from_user(&config, c, size))
 		return -EFAULT;
-	if (vhost_vdpa_config_validate(v, &config))
+	if (config.len == 0)
 		return -EINVAL;
 	buf = kvzalloc(config.len, GFP_KERNEL);
 	if (!buf)
@@ -243,7 +224,7 @@ static long vhost_vdpa_set_config(struct vhost_vdpa *v,
 
 	if (copy_from_user(&config, c, size))
 		return -EFAULT;
-	if (vhost_vdpa_config_validate(v, &config))
+	if (config.len == 0)
 		return -EINVAL;
 	buf = kvzalloc(config.len, GFP_KERNEL);
 	if (!buf)
@@ -1025,10 +1006,6 @@ static int vhost_vdpa_probe(struct vdpa_device *vdpa)
 	int minor;
 	int r;
 
-	/* Currently, we only accept the network devices. */
-	if (ops->get_device_id(vdpa) != VIRTIO_ID_NET)
-		return -ENOTSUPP;
-
 	v = kzalloc(sizeof(*v), GFP_KERNEL | __GFP_RETRY_MAYFAIL);
 	if (!v)
 		return -ENOMEM;
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [RFC v2 05/13] vdpa: Pass the netlink attributes to ops.dev_add()
  2020-12-22 14:52 [RFC v2 00/13] Introduce VDUSE - vDPA Device in Userspace Xie Yongji
                   ` (3 preceding siblings ...)
  2020-12-22 14:52 ` [RFC v2 04/13] vdpa: Remove the restriction that only supports virtio-net devices Xie Yongji
@ 2020-12-22 14:52 ` Xie Yongji
  2020-12-22 14:52 ` [RFC v2 06/13] vduse: Introduce VDUSE - vDPA Device in Userspace Xie Yongji
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 55+ messages in thread
From: Xie Yongji @ 2020-12-22 14:52 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, akpm, rdunlap, willy,
	viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel, linux-mm

Pass the netlink attributes to ops.dev_add() so that we
could get some device specific attributes when creating
a vdpa device.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 drivers/vdpa/vdpa.c              | 2 +-
 drivers/vdpa/vdpa_sim/vdpa_sim.c | 3 ++-
 include/linux/vdpa.h             | 4 +++-
 3 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/vdpa/vdpa.c b/drivers/vdpa/vdpa.c
index 32bd48baffab..f6ff81927694 100644
--- a/drivers/vdpa/vdpa.c
+++ b/drivers/vdpa/vdpa.c
@@ -440,7 +440,7 @@ static int vdpa_nl_cmd_dev_add_set_doit(struct sk_buff *skb, struct genl_info *i
 		goto err;
 	}
 
-	vdev = pdev->ops->dev_add(pdev, name, device_id);
+	vdev = pdev->ops->dev_add(pdev, name, device_id, info->attrs);
 	if (IS_ERR(vdev))
 		goto err;
 
diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
index 85776e4e6749..cfc314f5403a 100644
--- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
+++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
@@ -726,7 +726,8 @@ static const struct vdpa_config_ops vdpasim_net_batch_config_ops = {
 };
 
 static struct vdpa_device *
-vdpa_dev_add(struct vdpa_parent_dev *pdev, const char *name, u32 device_id)
+vdpa_dev_add(struct vdpa_parent_dev *pdev, const char *name,
+		u32 device_id, struct nlattr **attrs)
 {
 	struct vdpasim *simdev;
 
diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index cb5a3d847af3..656fe264234e 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -6,6 +6,7 @@
 #include <linux/device.h>
 #include <linux/interrupt.h>
 #include <linux/vhost_iotlb.h>
+#include <net/genetlink.h>
 
 /**
  * vDPA callback definition.
@@ -349,6 +350,7 @@ static inline void vdpa_get_config(struct vdpa_device *vdev, unsigned offset,
  *		@pdev: parent device to use for device addition
  *		@name: name of the new vdpa device
  *		@device_id: device id of the new vdpa device
+ *		@attrs: device specific attributes
  *		Driver need to add a new device using vdpa_register_device() after
  *		fully initializing the vdpa device. On successful addition driver
  *		must return a valid pointer of vdpa device or ERR_PTR for the error.
@@ -359,7 +361,7 @@ static inline void vdpa_get_config(struct vdpa_device *vdev, unsigned offset,
  */
 struct vdpa_dev_ops {
 	struct vdpa_device* (*dev_add)(struct vdpa_parent_dev *pdev, const char *name,
-				       u32 device_id);
+				       u32 device_id, struct nlattr **attrs);
 	void (*dev_del)(struct vdpa_parent_dev *pdev, struct vdpa_device *dev);
 };
 
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [RFC v2 06/13] vduse: Introduce VDUSE - vDPA Device in Userspace
  2020-12-22 14:52 [RFC v2 00/13] Introduce VDUSE - vDPA Device in Userspace Xie Yongji
                   ` (4 preceding siblings ...)
  2020-12-22 14:52 ` [RFC v2 05/13] vdpa: Pass the netlink attributes to ops.dev_add() Xie Yongji
@ 2020-12-22 14:52 ` Xie Yongji
  2020-12-23  8:08   ` Jason Wang
  2021-01-08 13:32   ` Bob Liu
  2020-12-22 14:52 ` [RFC v2 07/13] vduse: support get/set virtqueue state Xie Yongji
                   ` (7 subsequent siblings)
  13 siblings, 2 replies; 55+ messages in thread
From: Xie Yongji @ 2020-12-22 14:52 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, akpm, rdunlap, willy,
	viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel, linux-mm

This VDUSE driver enables implementing vDPA devices in userspace.
Both control path and data path of vDPA devices will be able to
be handled in userspace.

In the control path, the VDUSE driver will make use of message
mechnism to forward the config operation from vdpa bus driver
to userspace. Userspace can use read()/write() to receive/reply
those control messages.

In the data path, the VDUSE driver implements a MMU-based on-chip
IOMMU driver which supports mapping the kernel dma buffer to a
userspace iova region dynamically. Userspace can access those
iova region via mmap(). Besides, the eventfd mechanism is used to
trigger interrupt callbacks and receive virtqueue kicks in userspace

Now we only support virtio-vdpa bus driver with this patch applied.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 Documentation/driver-api/vduse.rst                 |   74 ++
 Documentation/userspace-api/ioctl/ioctl-number.rst |    1 +
 drivers/vdpa/Kconfig                               |    8 +
 drivers/vdpa/Makefile                              |    1 +
 drivers/vdpa/vdpa_user/Makefile                    |    5 +
 drivers/vdpa/vdpa_user/eventfd.c                   |  221 ++++
 drivers/vdpa/vdpa_user/eventfd.h                   |   48 +
 drivers/vdpa/vdpa_user/iova_domain.c               |  442 ++++++++
 drivers/vdpa/vdpa_user/iova_domain.h               |   93 ++
 drivers/vdpa/vdpa_user/vduse.h                     |   59 ++
 drivers/vdpa/vdpa_user/vduse_dev.c                 | 1121 ++++++++++++++++++++
 include/uapi/linux/vdpa.h                          |    1 +
 include/uapi/linux/vduse.h                         |   99 ++
 13 files changed, 2173 insertions(+)
 create mode 100644 Documentation/driver-api/vduse.rst
 create mode 100644 drivers/vdpa/vdpa_user/Makefile
 create mode 100644 drivers/vdpa/vdpa_user/eventfd.c
 create mode 100644 drivers/vdpa/vdpa_user/eventfd.h
 create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c
 create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h
 create mode 100644 drivers/vdpa/vdpa_user/vduse.h
 create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
 create mode 100644 include/uapi/linux/vduse.h

diff --git a/Documentation/driver-api/vduse.rst b/Documentation/driver-api/vduse.rst
new file mode 100644
index 000000000000..da9b3040f20a
--- /dev/null
+++ b/Documentation/driver-api/vduse.rst
@@ -0,0 +1,74 @@
+==================================
+VDUSE - "vDPA Device in Userspace"
+==================================
+
+vDPA (virtio data path acceleration) device is a device that uses a
+datapath which complies with the virtio specifications with vendor
+specific control path. vDPA devices can be both physically located on
+the hardware or emulated by software. VDUSE is a framework that makes it
+possible to implement software-emulated vDPA devices in userspace.
+
+How VDUSE works
+------------
+Each userspace vDPA device is created by the VDUSE_CREATE_DEV ioctl on
+the VDUSE character device (/dev/vduse). Then a file descriptor pointing
+to the new resources will be returned, which can be used to implement the
+userspace vDPA device's control path and data path.
+
+To implement control path, the read/write operations to the file descriptor
+will be used to receive/reply the control messages from/to VDUSE driver.
+Those control messages are based on the vdpa_config_ops which defines a
+unified interface to control different types of vDPA device.
+
+The following types of messages are provided by the VDUSE framework now:
+
+- VDUSE_SET_VQ_ADDR: Set the addresses of the different aspects of virtqueue.
+
+- VDUSE_SET_VQ_NUM: Set the size of virtqueue
+
+- VDUSE_SET_VQ_READY: Set ready status of virtqueue
+
+- VDUSE_GET_VQ_READY: Get ready status of virtqueue
+
+- VDUSE_SET_FEATURES: Set virtio features supported by the driver
+
+- VDUSE_GET_FEATURES: Get virtio features supported by the device
+
+- VDUSE_SET_STATUS: Set the device status
+
+- VDUSE_GET_STATUS: Get the device status
+
+- VDUSE_SET_CONFIG: Write to device specific configuration space
+
+- VDUSE_GET_CONFIG: Read from device specific configuration space
+
+Please see include/linux/vdpa.h for details.
+
+In the data path, VDUSE framework implements a MMU-based on-chip IOMMU
+driver which supports mapping the kernel dma buffer to a userspace iova
+region dynamically. The userspace iova region can be created by passing
+the userspace vDPA device fd to mmap(2).
+
+Besides, the eventfd mechanism is used to trigger interrupt callbacks and
+receive virtqueue kicks in userspace. The following ioctls on the userspace
+vDPA device fd are provided to support that:
+
+- VDUSE_VQ_SETUP_KICKFD: set the kickfd for virtqueue, this eventfd is used
+  by VDUSE driver to notify userspace to consume the vring.
+
+- VDUSE_VQ_SETUP_IRQFD: set the irqfd for virtqueue, this eventfd is used
+  by userspace to notify VDUSE driver to trigger interrupt callbacks.
+
+MMU-based IOMMU Driver
+----------------------
+The basic idea behind the IOMMU driver is treating MMU (VA->PA) as
+IOMMU (IOVA->PA). This driver will set up MMU mapping instead of IOMMU mapping
+for the DMA transfer so that the userspace process is able to use its virtual
+address to access the dma buffer in kernel.
+
+And to avoid security issue, a bounce-buffering mechanism is introduced to
+prevent userspace accessing the original buffer directly which may contain other
+kernel data. During the mapping, unmapping, the driver will copy the data from
+the original buffer to the bounce buffer and back, depending on the direction of
+the transfer. And the bounce-buffer addresses will be mapped into the user address
+space instead of the original one.
diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
index a4c75a28c839..71722e6f8f23 100644
--- a/Documentation/userspace-api/ioctl/ioctl-number.rst
+++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
@@ -300,6 +300,7 @@ Code  Seq#    Include File                                           Comments
 'z'   10-4F  drivers/s390/crypto/zcrypt_api.h                        conflict!
 '|'   00-7F  linux/media.h
 0x80  00-1F  linux/fb.h
+0x81  00-1F  linux/vduse.h
 0x89  00-06  arch/x86/include/asm/sockios.h
 0x89  0B-DF  linux/sockios.h
 0x89  E0-EF  linux/sockios.h                                         SIOCPROTOPRIVATE range
diff --git a/drivers/vdpa/Kconfig b/drivers/vdpa/Kconfig
index 4be7be39be26..211cc449cbd3 100644
--- a/drivers/vdpa/Kconfig
+++ b/drivers/vdpa/Kconfig
@@ -21,6 +21,14 @@ config VDPA_SIM
 	  to RX. This device is used for testing, prototyping and
 	  development of vDPA.
 
+config VDPA_USER
+	tristate "VDUSE (vDPA Device in Userspace) support"
+	depends on EVENTFD && MMU && HAS_DMA
+	default n
+	help
+	  With VDUSE it is possible to emulate a vDPA Device
+	  in a userspace program.
+
 config IFCVF
 	tristate "Intel IFC VF vDPA driver"
 	depends on PCI_MSI
diff --git a/drivers/vdpa/Makefile b/drivers/vdpa/Makefile
index d160e9b63a66..66e97778ad03 100644
--- a/drivers/vdpa/Makefile
+++ b/drivers/vdpa/Makefile
@@ -1,5 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0
 obj-$(CONFIG_VDPA) += vdpa.o
 obj-$(CONFIG_VDPA_SIM) += vdpa_sim/
+obj-$(CONFIG_VDPA_USER) += vdpa_user/
 obj-$(CONFIG_IFCVF)    += ifcvf/
 obj-$(CONFIG_MLX5_VDPA) += mlx5/
diff --git a/drivers/vdpa/vdpa_user/Makefile b/drivers/vdpa/vdpa_user/Makefile
new file mode 100644
index 000000000000..b7645e36992b
--- /dev/null
+++ b/drivers/vdpa/vdpa_user/Makefile
@@ -0,0 +1,5 @@
+# SPDX-License-Identifier: GPL-2.0
+
+vduse-y := vduse_dev.o iova_domain.o eventfd.o
+
+obj-$(CONFIG_VDPA_USER) += vduse.o
diff --git a/drivers/vdpa/vdpa_user/eventfd.c b/drivers/vdpa/vdpa_user/eventfd.c
new file mode 100644
index 000000000000..dbffddb08908
--- /dev/null
+++ b/drivers/vdpa/vdpa_user/eventfd.c
@@ -0,0 +1,221 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Eventfd support for VDUSE
+ *
+ * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
+ *
+ * Author: Xie Yongji <xieyongji@bytedance.com>
+ *
+ */
+
+#include <linux/eventfd.h>
+#include <linux/poll.h>
+#include <linux/wait.h>
+#include <linux/slab.h>
+#include <linux/file.h>
+#include <uapi/linux/vduse.h>
+
+#include "eventfd.h"
+
+static struct workqueue_struct *vduse_irqfd_cleanup_wq;
+
+static void vduse_virqfd_shutdown(struct work_struct *work)
+{
+	u64 cnt;
+	struct vduse_virqfd *virqfd = container_of(work,
+					struct vduse_virqfd, shutdown);
+
+	eventfd_ctx_remove_wait_queue(virqfd->ctx, &virqfd->wait, &cnt);
+	flush_work(&virqfd->inject);
+	eventfd_ctx_put(virqfd->ctx);
+	kfree(virqfd);
+}
+
+static void vduse_virqfd_inject(struct work_struct *work)
+{
+	struct vduse_virqfd *virqfd = container_of(work,
+					struct vduse_virqfd, inject);
+	struct vduse_virtqueue *vq = virqfd->vq;
+
+	spin_lock_irq(&vq->irq_lock);
+	if (vq->ready && vq->cb)
+		vq->cb(vq->private);
+	spin_unlock_irq(&vq->irq_lock);
+}
+
+static void virqfd_deactivate(struct vduse_virqfd *virqfd)
+{
+	queue_work(vduse_irqfd_cleanup_wq, &virqfd->shutdown);
+}
+
+static int vduse_virqfd_wakeup(wait_queue_entry_t *wait, unsigned int mode,
+				int sync, void *key)
+{
+	struct vduse_virqfd *virqfd = container_of(wait, struct vduse_virqfd, wait);
+	struct vduse_virtqueue *vq = virqfd->vq;
+
+	__poll_t flags = key_to_poll(key);
+
+	if (flags & EPOLLIN)
+		schedule_work(&virqfd->inject);
+
+	if (flags & EPOLLHUP) {
+		spin_lock(&vq->irq_lock);
+		if (vq->virqfd == virqfd) {
+			vq->virqfd = NULL;
+			virqfd_deactivate(virqfd);
+		}
+		spin_unlock(&vq->irq_lock);
+	}
+
+	return 0;
+}
+
+static void vduse_virqfd_ptable_queue_proc(struct file *file,
+			wait_queue_head_t *wqh, poll_table *pt)
+{
+	struct vduse_virqfd *virqfd = container_of(pt, struct vduse_virqfd, pt);
+
+	add_wait_queue(wqh, &virqfd->wait);
+}
+
+int vduse_virqfd_setup(struct vduse_dev *dev,
+			struct vduse_vq_eventfd *eventfd)
+{
+	struct vduse_virqfd *virqfd;
+	struct fd irqfd;
+	struct eventfd_ctx *ctx;
+	struct vduse_virtqueue *vq;
+	__poll_t events;
+	int ret;
+
+	if (eventfd->index >= dev->vq_num)
+		return -EINVAL;
+
+	vq = &dev->vqs[eventfd->index];
+	virqfd = kzalloc(sizeof(*virqfd), GFP_KERNEL);
+	if (!virqfd)
+		return -ENOMEM;
+
+	INIT_WORK(&virqfd->shutdown, vduse_virqfd_shutdown);
+	INIT_WORK(&virqfd->inject, vduse_virqfd_inject);
+
+	ret = -EBADF;
+	irqfd = fdget(eventfd->fd);
+	if (!irqfd.file)
+		goto err_fd;
+
+	ctx = eventfd_ctx_fileget(irqfd.file);
+	if (IS_ERR(ctx)) {
+		ret = PTR_ERR(ctx);
+		goto err_ctx;
+	}
+
+	virqfd->vq = vq;
+	virqfd->ctx = ctx;
+	spin_lock(&vq->irq_lock);
+	if (vq->virqfd)
+		virqfd_deactivate(virqfd);
+	vq->virqfd = virqfd;
+	spin_unlock(&vq->irq_lock);
+
+	init_waitqueue_func_entry(&virqfd->wait, vduse_virqfd_wakeup);
+	init_poll_funcptr(&virqfd->pt, vduse_virqfd_ptable_queue_proc);
+
+	events = vfs_poll(irqfd.file, &virqfd->pt);
+
+	/*
+	 * Check if there was an event already pending on the eventfd
+	 * before we registered and trigger it as if we didn't miss it.
+	 */
+	if (events & EPOLLIN)
+		schedule_work(&virqfd->inject);
+
+	fdput(irqfd);
+
+	return 0;
+err_ctx:
+	fdput(irqfd);
+err_fd:
+	kfree(virqfd);
+	return ret;
+}
+
+void vduse_virqfd_release(struct vduse_dev *dev)
+{
+	int i;
+
+	for (i = 0; i < dev->vq_num; i++) {
+		struct vduse_virtqueue *vq = &dev->vqs[i];
+
+		spin_lock(&vq->irq_lock);
+		if (vq->virqfd) {
+			virqfd_deactivate(vq->virqfd);
+			vq->virqfd = NULL;
+		}
+		spin_unlock(&vq->irq_lock);
+	}
+	flush_workqueue(vduse_irqfd_cleanup_wq);
+}
+
+int vduse_virqfd_init(void)
+{
+	vduse_irqfd_cleanup_wq = alloc_workqueue("vduse-irqfd-cleanup",
+						WQ_UNBOUND, 0);
+	if (!vduse_irqfd_cleanup_wq)
+		return -ENOMEM;
+
+	return 0;
+}
+
+void vduse_virqfd_exit(void)
+{
+	destroy_workqueue(vduse_irqfd_cleanup_wq);
+}
+
+void vduse_vq_kick(struct vduse_virtqueue *vq)
+{
+	spin_lock(&vq->kick_lock);
+	if (vq->ready && vq->kickfd)
+		eventfd_signal(vq->kickfd, 1);
+	spin_unlock(&vq->kick_lock);
+}
+
+int vduse_kickfd_setup(struct vduse_dev *dev,
+			struct vduse_vq_eventfd *eventfd)
+{
+	struct eventfd_ctx *ctx;
+	struct vduse_virtqueue *vq;
+
+	if (eventfd->index >= dev->vq_num)
+		return -EINVAL;
+
+	vq = &dev->vqs[eventfd->index];
+	ctx = eventfd_ctx_fdget(eventfd->fd);
+	if (IS_ERR(ctx))
+		return PTR_ERR(ctx);
+
+	spin_lock(&vq->kick_lock);
+	if (vq->kickfd)
+		eventfd_ctx_put(vq->kickfd);
+	vq->kickfd = ctx;
+	spin_unlock(&vq->kick_lock);
+
+	return 0;
+}
+
+void vduse_kickfd_release(struct vduse_dev *dev)
+{
+	int i;
+
+	for (i = 0; i < dev->vq_num; i++) {
+		struct vduse_virtqueue *vq = &dev->vqs[i];
+
+		spin_lock(&vq->kick_lock);
+		if (vq->kickfd) {
+			eventfd_ctx_put(vq->kickfd);
+			vq->kickfd = NULL;
+		}
+		spin_unlock(&vq->kick_lock);
+	}
+}
diff --git a/drivers/vdpa/vdpa_user/eventfd.h b/drivers/vdpa/vdpa_user/eventfd.h
new file mode 100644
index 000000000000..14269ff27f47
--- /dev/null
+++ b/drivers/vdpa/vdpa_user/eventfd.h
@@ -0,0 +1,48 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Eventfd support for VDUSE
+ *
+ * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
+ *
+ * Author: Xie Yongji <xieyongji@bytedance.com>
+ *
+ */
+
+#ifndef _VDUSE_EVENTFD_H
+#define _VDUSE_EVENTFD_H
+
+#include <linux/eventfd.h>
+#include <linux/poll.h>
+#include <linux/wait.h>
+#include <uapi/linux/vduse.h>
+
+#include "vduse.h"
+
+struct vduse_dev;
+
+struct vduse_virqfd {
+	struct eventfd_ctx *ctx;
+	struct vduse_virtqueue *vq;
+	struct work_struct inject;
+	struct work_struct shutdown;
+	wait_queue_entry_t wait;
+	poll_table pt;
+};
+
+int vduse_virqfd_setup(struct vduse_dev *dev,
+			struct vduse_vq_eventfd *eventfd);
+
+void vduse_virqfd_release(struct vduse_dev *dev);
+
+int vduse_virqfd_init(void);
+
+void vduse_virqfd_exit(void);
+
+void vduse_vq_kick(struct vduse_virtqueue *vq);
+
+int vduse_kickfd_setup(struct vduse_dev *dev,
+			struct vduse_vq_eventfd *eventfd);
+
+void vduse_kickfd_release(struct vduse_dev *dev);
+
+#endif /* _VDUSE_EVENTFD_H */
diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c
new file mode 100644
index 000000000000..27022157abc6
--- /dev/null
+++ b/drivers/vdpa/vdpa_user/iova_domain.c
@@ -0,0 +1,442 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * MMU-based IOMMU implementation
+ *
+ * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
+ *
+ * Author: Xie Yongji <xieyongji@bytedance.com>
+ *
+ */
+
+#include <linux/wait.h>
+#include <linux/slab.h>
+#include <linux/genalloc.h>
+#include <linux/dma-mapping.h>
+
+#include "iova_domain.h"
+
+#define IOVA_CHUNK_SHIFT 26
+#define IOVA_CHUNK_SIZE (_AC(1, UL) << IOVA_CHUNK_SHIFT)
+#define IOVA_CHUNK_MASK (~(IOVA_CHUNK_SIZE - 1))
+
+#define IOVA_MIN_SIZE (IOVA_CHUNK_SIZE << 1)
+
+#define IOVA_ALLOC_ORDER 12
+#define IOVA_ALLOC_SIZE (1 << IOVA_ALLOC_ORDER)
+
+struct vduse_mmap_vma {
+	struct vm_area_struct *vma;
+	struct list_head list;
+};
+
+static inline struct page *
+vduse_domain_get_bounce_page(struct vduse_iova_domain *domain,
+				unsigned long iova)
+{
+	unsigned long index = iova >> IOVA_CHUNK_SHIFT;
+	unsigned long chunkoff = iova & ~IOVA_CHUNK_MASK;
+	unsigned long pgindex = chunkoff >> PAGE_SHIFT;
+
+	return domain->chunks[index].bounce_pages[pgindex];
+}
+
+static inline void
+vduse_domain_set_bounce_page(struct vduse_iova_domain *domain,
+				unsigned long iova, struct page *page)
+{
+	unsigned long index = iova >> IOVA_CHUNK_SHIFT;
+	unsigned long chunkoff = iova & ~IOVA_CHUNK_MASK;
+	unsigned long pgindex = chunkoff >> PAGE_SHIFT;
+
+	domain->chunks[index].bounce_pages[pgindex] = page;
+}
+
+static inline struct vduse_iova_map *
+vduse_domain_get_iova_map(struct vduse_iova_domain *domain,
+				unsigned long iova)
+{
+	unsigned long index = iova >> IOVA_CHUNK_SHIFT;
+	unsigned long chunkoff = iova & ~IOVA_CHUNK_MASK;
+	unsigned long mapindex = chunkoff >> IOVA_ALLOC_ORDER;
+
+	return domain->chunks[index].iova_map[mapindex];
+}
+
+static inline void
+vduse_domain_set_iova_map(struct vduse_iova_domain *domain,
+			unsigned long iova, struct vduse_iova_map *map)
+{
+	unsigned long index = iova >> IOVA_CHUNK_SHIFT;
+	unsigned long chunkoff = iova & ~IOVA_CHUNK_MASK;
+	unsigned long mapindex = chunkoff >> IOVA_ALLOC_ORDER;
+
+	domain->chunks[index].iova_map[mapindex] = map;
+}
+
+static int
+vduse_domain_free_bounce_pages(struct vduse_iova_domain *domain,
+				unsigned long iova, size_t size)
+{
+	struct page *page;
+	size_t walk_sz = 0;
+	int frees = 0;
+
+	while (walk_sz < size) {
+		page = vduse_domain_get_bounce_page(domain, iova);
+		if (page) {
+			vduse_domain_set_bounce_page(domain, iova, NULL);
+			put_page(page);
+			frees++;
+		}
+		iova += PAGE_SIZE;
+		walk_sz += PAGE_SIZE;
+	}
+
+	return frees;
+}
+
+int vduse_domain_add_vma(struct vduse_iova_domain *domain,
+				struct vm_area_struct *vma)
+{
+	unsigned long size = vma->vm_end - vma->vm_start;
+	struct vduse_mmap_vma *mmap_vma;
+
+	if (WARN_ON(size != domain->size))
+		return -EINVAL;
+
+	mmap_vma = kmalloc(sizeof(*mmap_vma), GFP_KERNEL);
+	if (!mmap_vma)
+		return -ENOMEM;
+
+	mmap_vma->vma = vma;
+	mutex_lock(&domain->vma_lock);
+	list_add(&mmap_vma->list, &domain->vma_list);
+	mutex_unlock(&domain->vma_lock);
+
+	return 0;
+}
+
+void vduse_domain_remove_vma(struct vduse_iova_domain *domain,
+				struct vm_area_struct *vma)
+{
+	struct vduse_mmap_vma *mmap_vma;
+
+	mutex_lock(&domain->vma_lock);
+	list_for_each_entry(mmap_vma, &domain->vma_list, list) {
+		if (mmap_vma->vma == vma) {
+			list_del(&mmap_vma->list);
+			kfree(mmap_vma);
+			break;
+		}
+	}
+	mutex_unlock(&domain->vma_lock);
+}
+
+int vduse_domain_add_mapping(struct vduse_iova_domain *domain,
+				unsigned long iova, unsigned long orig,
+				size_t size, enum dma_data_direction dir)
+{
+	struct vduse_iova_map *map;
+	unsigned long last = iova + size;
+
+	map = kzalloc(sizeof(struct vduse_iova_map), GFP_ATOMIC);
+	if (!map)
+		return -ENOMEM;
+
+	map->iova = iova;
+	map->orig = orig;
+	map->size = size;
+	map->dir = dir;
+
+	while (iova < last) {
+		vduse_domain_set_iova_map(domain, iova, map);
+		iova += IOVA_ALLOC_SIZE;
+	}
+
+	return 0;
+}
+
+struct vduse_iova_map *
+vduse_domain_get_mapping(struct vduse_iova_domain *domain,
+			unsigned long iova)
+{
+	return vduse_domain_get_iova_map(domain, iova);
+}
+
+void vduse_domain_remove_mapping(struct vduse_iova_domain *domain,
+				struct vduse_iova_map *map)
+{
+	unsigned long iova = map->iova;
+	unsigned long last = iova + map->size;
+
+	while (iova < last) {
+		vduse_domain_set_iova_map(domain, iova, NULL);
+		iova += IOVA_ALLOC_SIZE;
+	}
+}
+
+void vduse_domain_unmap(struct vduse_iova_domain *domain,
+			unsigned long iova, size_t size)
+{
+	struct vduse_mmap_vma *mmap_vma;
+	unsigned long uaddr;
+
+	mutex_lock(&domain->vma_lock);
+	list_for_each_entry(mmap_vma, &domain->vma_list, list) {
+		mmap_read_lock(mmap_vma->vma->vm_mm);
+		uaddr = iova + mmap_vma->vma->vm_start;
+		zap_page_range(mmap_vma->vma, uaddr, size);
+		mmap_read_unlock(mmap_vma->vma->vm_mm);
+	}
+	mutex_unlock(&domain->vma_lock);
+}
+
+int vduse_domain_direct_map(struct vduse_iova_domain *domain,
+			struct vm_area_struct *vma, unsigned long iova)
+{
+	unsigned long uaddr = iova + vma->vm_start;
+	unsigned long start = iova & PAGE_MASK;
+	unsigned long last = start + PAGE_SIZE - 1;
+	unsigned long offset;
+	struct vduse_iova_map *map;
+	struct page *page = NULL;
+
+	map = vduse_domain_get_iova_map(domain, iova);
+	if (map) {
+		offset = last - map->iova;
+		page = virt_to_page(map->orig + offset);
+	}
+
+	return page ? vm_insert_page(vma, uaddr, page) : -EFAULT;
+}
+
+void vduse_domain_bounce(struct vduse_iova_domain *domain,
+			unsigned long iova, unsigned long orig,
+			size_t size, enum dma_data_direction dir)
+{
+	unsigned int offset = offset_in_page(iova);
+
+	while (size) {
+		struct page *p = vduse_domain_get_bounce_page(domain, iova);
+		size_t copy_len = min_t(size_t, PAGE_SIZE - offset, size);
+		void *addr;
+
+		if (p) {
+			addr = page_address(p) + offset;
+			if (dir == DMA_TO_DEVICE)
+				memcpy(addr, (void *)orig, copy_len);
+			else if (dir == DMA_FROM_DEVICE)
+				memcpy((void *)orig, addr, copy_len);
+		}
+		size -= copy_len;
+		orig += copy_len;
+		iova += copy_len;
+		offset = 0;
+	}
+}
+
+int vduse_domain_bounce_map(struct vduse_iova_domain *domain,
+			struct vm_area_struct *vma, unsigned long iova)
+{
+	unsigned long uaddr = iova + vma->vm_start;
+	unsigned long start = iova & PAGE_MASK;
+	unsigned long offset = 0;
+	bool found = false;
+	struct vduse_iova_map *map;
+	struct page *page;
+
+	mutex_lock(&domain->map_lock);
+
+	page = vduse_domain_get_bounce_page(domain, iova);
+	if (page)
+		goto unlock;
+
+	page = alloc_page(GFP_KERNEL);
+	if (!page)
+		goto unlock;
+
+	while (offset < PAGE_SIZE) {
+		unsigned int src_offset = 0, dst_offset = 0;
+		void *src, *dst;
+		size_t copy_len;
+
+		map = vduse_domain_get_iova_map(domain, start + offset);
+		if (!map) {
+			offset += IOVA_ALLOC_SIZE;
+			continue;
+		}
+
+		found = true;
+		offset += map->size;
+		if (map->dir == DMA_FROM_DEVICE)
+			continue;
+
+		if (start > map->iova)
+			src_offset = start - map->iova;
+		else
+			dst_offset = map->iova - start;
+
+		src = (void *)(map->orig + src_offset);
+		dst = page_address(page) + dst_offset;
+		copy_len = min_t(size_t, map->size - src_offset,
+				PAGE_SIZE - dst_offset);
+		memcpy(dst, src, copy_len);
+	}
+	if (!found) {
+		put_page(page);
+		page = NULL;
+	}
+	vduse_domain_set_bounce_page(domain, iova, page);
+unlock:
+	mutex_unlock(&domain->map_lock);
+
+	return page ? vm_insert_page(vma, uaddr, page) : -EFAULT;
+}
+
+bool vduse_domain_is_direct_map(struct vduse_iova_domain *domain,
+				unsigned long iova)
+{
+	unsigned long index = iova >> IOVA_CHUNK_SHIFT;
+	struct vduse_iova_chunk *chunk = &domain->chunks[index];
+
+	return atomic_read(&chunk->map_type) == TYPE_DIRECT_MAP;
+}
+
+unsigned long vduse_domain_alloc_iova(struct vduse_iova_domain *domain,
+					size_t size, enum iova_map_type type)
+{
+	struct vduse_iova_chunk *chunk;
+	unsigned long iova = 0;
+	int align = (type == TYPE_DIRECT_MAP) ? PAGE_SIZE : IOVA_ALLOC_SIZE;
+	struct genpool_data_align data = { .align = align };
+	int i;
+
+	for (i = 0; i < domain->chunk_num; i++) {
+		chunk = &domain->chunks[i];
+		if (unlikely(atomic_read(&chunk->map_type) == TYPE_NONE))
+			atomic_cmpxchg(&chunk->map_type, TYPE_NONE, type);
+
+		if (atomic_read(&chunk->map_type) != type)
+			continue;
+
+		iova = gen_pool_alloc_algo(chunk->pool, size,
+					gen_pool_first_fit_align, &data);
+		if (iova)
+			break;
+	}
+
+	return iova;
+}
+
+void vduse_domain_free_iova(struct vduse_iova_domain *domain,
+				unsigned long iova, size_t size)
+{
+	unsigned long index = iova >> IOVA_CHUNK_SHIFT;
+	struct vduse_iova_chunk *chunk = &domain->chunks[index];
+
+	gen_pool_free(chunk->pool, iova, size);
+}
+
+static void vduse_iova_chunk_cleanup(struct vduse_iova_chunk *chunk)
+{
+	vfree(chunk->bounce_pages);
+	vfree(chunk->iova_map);
+	gen_pool_destroy(chunk->pool);
+}
+
+void vduse_iova_domain_destroy(struct vduse_iova_domain *domain)
+{
+	struct vduse_iova_chunk *chunk;
+	int i;
+
+	for (i = 0; i < domain->chunk_num; i++) {
+		chunk = &domain->chunks[i];
+		vduse_domain_free_bounce_pages(domain,
+					chunk->start, IOVA_CHUNK_SIZE);
+		vduse_iova_chunk_cleanup(chunk);
+	}
+
+	mutex_destroy(&domain->map_lock);
+	mutex_destroy(&domain->vma_lock);
+	kfree(domain->chunks);
+	kfree(domain);
+}
+
+static int vduse_iova_chunk_init(struct vduse_iova_chunk *chunk,
+				unsigned long addr, size_t size)
+{
+	int ret;
+	int pages = size >> PAGE_SHIFT;
+
+	chunk->pool = gen_pool_create(IOVA_ALLOC_ORDER, -1);
+	if (!chunk->pool)
+		return -ENOMEM;
+
+	/* addr 0 is used in allocation failure case */
+	if (addr == 0)
+		addr += IOVA_ALLOC_SIZE;
+
+	ret = gen_pool_add(chunk->pool, addr, size, -1);
+	if (ret)
+		goto err;
+
+	ret = -ENOMEM;
+	chunk->bounce_pages = vzalloc(pages * sizeof(struct page *));
+	if (!chunk->bounce_pages)
+		goto err;
+
+	chunk->iova_map = vzalloc((size >> IOVA_ALLOC_ORDER) *
+				sizeof(struct vduse_iova_map *));
+	if (!chunk->iova_map)
+		goto err;
+
+	chunk->start = addr;
+	atomic_set(&chunk->map_type, TYPE_NONE);
+
+	return 0;
+err:
+	if (chunk->bounce_pages) {
+		vfree(chunk->bounce_pages);
+		chunk->bounce_pages = NULL;
+	}
+	gen_pool_destroy(chunk->pool);
+	return ret;
+}
+
+struct vduse_iova_domain *vduse_iova_domain_create(size_t size)
+{
+	int j, i = 0;
+	struct vduse_iova_domain *domain;
+	unsigned long num = size >> IOVA_CHUNK_SHIFT;
+	unsigned long addr = 0;
+
+	if (size < IOVA_MIN_SIZE || size & ~IOVA_CHUNK_MASK)
+		return NULL;
+
+	domain = kzalloc(sizeof(*domain), GFP_KERNEL);
+	if (!domain)
+		return NULL;
+
+	domain->chunks = kcalloc(num, sizeof(struct vduse_iova_chunk), GFP_KERNEL);
+	if (!domain->chunks)
+		goto err;
+
+	for (i = 0; i < num; i++, addr += IOVA_CHUNK_SIZE)
+		if (vduse_iova_chunk_init(&domain->chunks[i], addr,
+					IOVA_CHUNK_SIZE))
+			goto err;
+
+	domain->chunk_num = num;
+	domain->size = size;
+	INIT_LIST_HEAD(&domain->vma_list);
+	mutex_init(&domain->vma_lock);
+	mutex_init(&domain->map_lock);
+
+	return domain;
+err:
+	for (j = 0; j < i; j++)
+		vduse_iova_chunk_cleanup(&domain->chunks[j]);
+	kfree(domain);
+
+	return NULL;
+}
diff --git a/drivers/vdpa/vdpa_user/iova_domain.h b/drivers/vdpa/vdpa_user/iova_domain.h
new file mode 100644
index 000000000000..fe1816287f5f
--- /dev/null
+++ b/drivers/vdpa/vdpa_user/iova_domain.h
@@ -0,0 +1,93 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * MMU-based IOMMU implementation
+ *
+ * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
+ *
+ * Author: Xie Yongji <xieyongji@bytedance.com>
+ *
+ */
+
+#ifndef _VDUSE_IOVA_DOMAIN_H
+#define _VDUSE_IOVA_DOMAIN_H
+
+#include <linux/genalloc.h>
+#include <linux/dma-mapping.h>
+
+enum iova_map_type {
+	TYPE_NONE,
+	TYPE_DIRECT_MAP,
+	TYPE_BOUNCE_MAP,
+};
+
+struct vduse_iova_map {
+	unsigned long iova;
+	unsigned long orig;
+	size_t size;
+	enum dma_data_direction dir;
+};
+
+struct vduse_iova_chunk {
+	struct gen_pool *pool;
+	struct page **bounce_pages;
+	struct vduse_iova_map **iova_map;
+	unsigned long start;
+	atomic_t map_type;
+};
+
+struct vduse_iova_domain {
+	struct vduse_iova_chunk *chunks;
+	int chunk_num;
+	size_t size;
+	struct mutex map_lock;
+	struct mutex vma_lock;
+	struct list_head vma_list;
+};
+
+int vduse_domain_add_vma(struct vduse_iova_domain *domain,
+				struct vm_area_struct *vma);
+
+void vduse_domain_remove_vma(struct vduse_iova_domain *domain,
+				struct vm_area_struct *vma);
+
+int vduse_domain_add_mapping(struct vduse_iova_domain *domain,
+				unsigned long iova, unsigned long orig,
+				size_t size, enum dma_data_direction dir);
+
+struct vduse_iova_map *
+vduse_domain_get_mapping(struct vduse_iova_domain *domain,
+			unsigned long iova);
+
+void vduse_domain_remove_mapping(struct vduse_iova_domain *domain,
+				struct vduse_iova_map *map);
+
+void vduse_domain_unmap(struct vduse_iova_domain *domain,
+			unsigned long iova, size_t size);
+
+int vduse_domain_direct_map(struct vduse_iova_domain *domain,
+			struct vm_area_struct *vma, unsigned long iova);
+
+void vduse_domain_bounce(struct vduse_iova_domain *domain,
+			unsigned long iova, unsigned long orig,
+			size_t size, enum dma_data_direction dir);
+
+int vduse_domain_bounce_map(struct vduse_iova_domain *domain,
+			struct vm_area_struct *vma, unsigned long iova);
+
+bool vduse_domain_is_direct_map(struct vduse_iova_domain *domain,
+				unsigned long iova);
+
+unsigned long vduse_domain_alloc_iova(struct vduse_iova_domain *domain,
+					size_t size, enum iova_map_type type);
+
+void vduse_domain_free_iova(struct vduse_iova_domain *domain,
+				unsigned long iova, size_t size);
+
+bool vduse_domain_is_direct_map(struct vduse_iova_domain *domain,
+				unsigned long iova);
+
+void vduse_iova_domain_destroy(struct vduse_iova_domain *domain);
+
+struct vduse_iova_domain *vduse_iova_domain_create(size_t size);
+
+#endif /* _VDUSE_IOVA_DOMAIN_H */
diff --git a/drivers/vdpa/vdpa_user/vduse.h b/drivers/vdpa/vdpa_user/vduse.h
new file mode 100644
index 000000000000..1041ce7bddc4
--- /dev/null
+++ b/drivers/vdpa/vdpa_user/vduse.h
@@ -0,0 +1,59 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * VDUSE: vDPA Device in Userspace
+ *
+ * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
+ *
+ * Author: Xie Yongji <xieyongji@bytedance.com>
+ *
+ */
+
+#ifndef _VDUSE_H
+#define _VDUSE_H
+
+#include <linux/eventfd.h>
+#include <linux/wait.h>
+#include <linux/vdpa.h>
+
+#include "iova_domain.h"
+#include "eventfd.h"
+
+struct vduse_virtqueue {
+	u16 index;
+	bool ready;
+	spinlock_t kick_lock;
+	spinlock_t irq_lock;
+	struct eventfd_ctx *kickfd;
+	struct vduse_virqfd *virqfd;
+	void *private;
+	irqreturn_t (*cb)(void *data);
+};
+
+struct vduse_dev;
+
+struct vduse_vdpa {
+	struct vdpa_device vdpa;
+	struct vduse_dev *dev;
+};
+
+struct vduse_dev {
+	struct vduse_vdpa *vdev;
+	struct vduse_virtqueue *vqs;
+	struct vduse_iova_domain *domain;
+	struct mutex lock;
+	spinlock_t msg_lock;
+	atomic64_t msg_unique;
+	wait_queue_head_t waitq;
+	struct list_head send_list;
+	struct list_head recv_list;
+	struct list_head list;
+	refcount_t refcnt;
+	u32 id;
+	u16 vq_size_max;
+	u16 vq_num;
+	u32 vq_align;
+	u32 device_id;
+	u32 vendor_id;
+};
+
+#endif /* _VDUSE_H_ */
diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
new file mode 100644
index 000000000000..4a869b9698ef
--- /dev/null
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -0,0 +1,1121 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * VDUSE: vDPA Device in Userspace
+ *
+ * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
+ *
+ * Author: Xie Yongji <xieyongji@bytedance.com>
+ *
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/miscdevice.h>
+#include <linux/device.h>
+#include <linux/eventfd.h>
+#include <linux/slab.h>
+#include <linux/wait.h>
+#include <linux/dma-map-ops.h>
+#include <linux/anon_inodes.h>
+#include <linux/file.h>
+#include <linux/uio.h>
+#include <linux/vdpa.h>
+#include <uapi/linux/vduse.h>
+#include <uapi/linux/vdpa.h>
+#include <uapi/linux/virtio_config.h>
+#include <linux/mod_devicetable.h>
+
+#include "vduse.h"
+
+#define DRV_VERSION  "1.0"
+#define DRV_AUTHOR   "Yongji Xie <xieyongji@bytedance.com>"
+#define DRV_DESC     "vDPA Device in Userspace"
+#define DRV_LICENSE  "GPL v2"
+
+struct vduse_dev_msg {
+	struct vduse_dev_request req;
+	struct vduse_dev_response resp;
+	struct list_head list;
+	wait_queue_head_t waitq;
+	bool completed;
+	refcount_t refcnt;
+};
+
+static struct workqueue_struct *vduse_vdpa_wq;
+static DEFINE_MUTEX(vduse_lock);
+static LIST_HEAD(vduse_devs);
+
+static inline struct vduse_dev *vdpa_to_vduse(struct vdpa_device *vdpa)
+{
+	struct vduse_vdpa *vdev = container_of(vdpa, struct vduse_vdpa, vdpa);
+
+	return vdev->dev;
+}
+
+static inline struct vduse_dev *dev_to_vduse(struct device *dev)
+{
+	struct vdpa_device *vdpa = dev_to_vdpa(dev);
+
+	return vdpa_to_vduse(vdpa);
+}
+
+static struct vduse_dev_msg *vduse_dev_new_msg(struct vduse_dev *dev, int type)
+{
+	struct vduse_dev_msg *msg = kzalloc(sizeof(*msg),
+					GFP_KERNEL | __GFP_NOFAIL);
+
+	msg->req.type = type;
+	msg->req.unique = atomic64_fetch_inc(&dev->msg_unique);
+	init_waitqueue_head(&msg->waitq);
+	refcount_set(&msg->refcnt, 1);
+
+	return msg;
+}
+
+static void vduse_dev_msg_get(struct vduse_dev_msg *msg)
+{
+	refcount_inc(&msg->refcnt);
+}
+
+static void vduse_dev_msg_put(struct vduse_dev_msg *msg)
+{
+	if (refcount_dec_and_test(&msg->refcnt))
+		kfree(msg);
+}
+
+static struct vduse_dev_msg *vduse_dev_find_msg(struct vduse_dev *dev,
+						struct list_head *head,
+						uint32_t unique)
+{
+	struct vduse_dev_msg *tmp, *msg = NULL;
+
+	spin_lock(&dev->msg_lock);
+	list_for_each_entry(tmp, head, list) {
+		if (tmp->req.unique == unique) {
+			msg = tmp;
+			list_del(&tmp->list);
+			break;
+		}
+	}
+	spin_unlock(&dev->msg_lock);
+
+	return msg;
+}
+
+static struct vduse_dev_msg *vduse_dev_dequeue_msg(struct vduse_dev *dev,
+						struct list_head *head)
+{
+	struct vduse_dev_msg *msg = NULL;
+
+	spin_lock(&dev->msg_lock);
+	if (!list_empty(head)) {
+		msg = list_first_entry(head, struct vduse_dev_msg, list);
+		list_del(&msg->list);
+	}
+	spin_unlock(&dev->msg_lock);
+
+	return msg;
+}
+
+static void vduse_dev_enqueue_msg(struct vduse_dev *dev,
+			struct vduse_dev_msg *msg, struct list_head *head)
+{
+	spin_lock(&dev->msg_lock);
+	list_add_tail(&msg->list, head);
+	spin_unlock(&dev->msg_lock);
+}
+
+static int vduse_dev_msg_sync(struct vduse_dev *dev, struct vduse_dev_msg *msg)
+{
+	int ret;
+
+	vduse_dev_enqueue_msg(dev, msg, &dev->send_list);
+	wake_up(&dev->waitq);
+	wait_event(msg->waitq, msg->completed);
+	/* coupled with smp_wmb() in vduse_dev_msg_complete() */
+	smp_rmb();
+	ret = msg->resp.result;
+
+	return ret;
+}
+
+static void vduse_dev_msg_complete(struct vduse_dev_msg *msg,
+					struct vduse_dev_response *resp)
+{
+	vduse_dev_msg_get(msg);
+	memcpy(&msg->resp, resp, sizeof(*resp));
+	/* coupled with smp_rmb() in vduse_dev_msg_sync() */
+	smp_wmb();
+	msg->completed = 1;
+	wake_up(&msg->waitq);
+	vduse_dev_msg_put(msg);
+}
+
+static u64 vduse_dev_get_features(struct vduse_dev *dev)
+{
+	struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_GET_FEATURES);
+	u64 features;
+
+	vduse_dev_msg_sync(dev, msg);
+	features = msg->resp.features;
+	vduse_dev_msg_put(msg);
+
+	return features;
+}
+
+static int vduse_dev_set_features(struct vduse_dev *dev, u64 features)
+{
+	struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_FEATURES);
+	int ret;
+
+	msg->req.size = sizeof(features);
+	msg->req.features = features;
+
+	ret = vduse_dev_msg_sync(dev, msg);
+	vduse_dev_msg_put(msg);
+
+	return ret;
+}
+
+static u8 vduse_dev_get_status(struct vduse_dev *dev)
+{
+	struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_GET_STATUS);
+	u8 status;
+
+	vduse_dev_msg_sync(dev, msg);
+	status = msg->resp.status;
+	vduse_dev_msg_put(msg);
+
+	return status;
+}
+
+static void vduse_dev_set_status(struct vduse_dev *dev, u8 status)
+{
+	struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_STATUS);
+
+	msg->req.size = sizeof(status);
+	msg->req.status = status;
+
+	vduse_dev_msg_sync(dev, msg);
+	vduse_dev_msg_put(msg);
+}
+
+static void vduse_dev_get_config(struct vduse_dev *dev, unsigned int offset,
+					void *buf, unsigned int len)
+{
+	struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_GET_CONFIG);
+
+	WARN_ON(len > sizeof(msg->req.config.data));
+
+	msg->req.size = sizeof(struct vduse_dev_config_data);
+	msg->req.config.offset = offset;
+	msg->req.config.len = len;
+	vduse_dev_msg_sync(dev, msg);
+	memcpy(buf, msg->resp.config.data, len);
+	vduse_dev_msg_put(msg);
+}
+
+static void vduse_dev_set_config(struct vduse_dev *dev, unsigned int offset,
+					const void *buf, unsigned int len)
+{
+	struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_CONFIG);
+
+	WARN_ON(len > sizeof(msg->req.config.data));
+
+	msg->req.size = sizeof(struct vduse_dev_config_data);
+	msg->req.config.offset = offset;
+	msg->req.config.len = len;
+	memcpy(msg->req.config.data, buf, len);
+	vduse_dev_msg_sync(dev, msg);
+	vduse_dev_msg_put(msg);
+}
+
+static void vduse_dev_set_vq_num(struct vduse_dev *dev,
+				struct vduse_virtqueue *vq, u32 num)
+{
+	struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_VQ_NUM);
+
+	msg->req.size = sizeof(struct vduse_vq_num);
+	msg->req.vq_num.index = vq->index;
+	msg->req.vq_num.num = num;
+
+	vduse_dev_msg_sync(dev, msg);
+	vduse_dev_msg_put(msg);
+}
+
+static int vduse_dev_set_vq_addr(struct vduse_dev *dev,
+				struct vduse_virtqueue *vq, u64 desc_addr,
+				u64 driver_addr, u64 device_addr)
+{
+	struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_VQ_ADDR);
+	int ret;
+
+	msg->req.size = sizeof(struct vduse_vq_addr);
+	msg->req.vq_addr.index = vq->index;
+	msg->req.vq_addr.desc_addr = desc_addr;
+	msg->req.vq_addr.driver_addr = driver_addr;
+	msg->req.vq_addr.device_addr = device_addr;
+
+	ret = vduse_dev_msg_sync(dev, msg);
+	vduse_dev_msg_put(msg);
+
+	return ret;
+}
+
+static void vduse_dev_set_vq_ready(struct vduse_dev *dev,
+				struct vduse_virtqueue *vq, bool ready)
+{
+	struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_VQ_READY);
+
+	msg->req.size = sizeof(struct vduse_vq_ready);
+	msg->req.vq_ready.index = vq->index;
+	msg->req.vq_ready.ready = ready;
+
+	vduse_dev_msg_sync(dev, msg);
+	vduse_dev_msg_put(msg);
+}
+
+static bool vduse_dev_get_vq_ready(struct vduse_dev *dev,
+				   struct vduse_virtqueue *vq)
+{
+	struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_GET_VQ_READY);
+	bool ready;
+
+	msg->req.size = sizeof(struct vduse_vq_ready);
+	msg->req.vq_ready.index = vq->index;
+
+	vduse_dev_msg_sync(dev, msg);
+	ready = msg->resp.vq_ready.ready;
+	vduse_dev_msg_put(msg);
+
+	return ready;
+}
+
+static ssize_t vduse_dev_read_iter(struct kiocb *iocb, struct iov_iter *to)
+{
+	struct file *file = iocb->ki_filp;
+	struct vduse_dev *dev = file->private_data;
+	struct vduse_dev_msg *msg;
+	int size = sizeof(struct vduse_dev_request);
+	ssize_t ret = 0;
+
+	if (iov_iter_count(to) < size)
+		return 0;
+
+	while (1) {
+		msg = vduse_dev_dequeue_msg(dev, &dev->send_list);
+		if (msg)
+			break;
+
+		if (file->f_flags & O_NONBLOCK)
+			return -EAGAIN;
+
+		ret = wait_event_interruptible_exclusive(dev->waitq,
+					!list_empty(&dev->send_list));
+		if (ret)
+			return ret;
+	}
+	ret = copy_to_iter(&msg->req, size, to);
+	if (ret != size) {
+		vduse_dev_enqueue_msg(dev, msg, &dev->send_list);
+		return -EFAULT;
+	}
+	vduse_dev_enqueue_msg(dev, msg, &dev->recv_list);
+
+	return ret;
+}
+
+static ssize_t vduse_dev_write_iter(struct kiocb *iocb, struct iov_iter *from)
+{
+	struct file *file = iocb->ki_filp;
+	struct vduse_dev *dev = file->private_data;
+	struct vduse_dev_response resp;
+	struct vduse_dev_msg *msg;
+	size_t ret;
+
+	ret = copy_from_iter(&resp, sizeof(resp), from);
+	if (ret != sizeof(resp))
+		return -EINVAL;
+
+	msg = vduse_dev_find_msg(dev, &dev->recv_list, resp.unique);
+	if (!msg)
+		return -EINVAL;
+
+	vduse_dev_msg_complete(msg, &resp);
+
+	return ret;
+}
+
+static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
+{
+	struct vduse_dev *dev = file->private_data;
+	__poll_t mask = 0;
+
+	poll_wait(file, &dev->waitq, wait);
+
+	if (!list_empty(&dev->send_list))
+		mask |= EPOLLIN | EPOLLRDNORM;
+
+	return mask;
+}
+
+static void vduse_dev_reset(struct vduse_dev *dev)
+{
+	int i;
+
+	for (i = 0; i < dev->vq_num; i++) {
+		struct vduse_virtqueue *vq = &dev->vqs[i];
+
+		spin_lock(&vq->irq_lock);
+		vq->ready = false;
+		vq->cb = NULL;
+		vq->private = NULL;
+		spin_unlock(&vq->irq_lock);
+	}
+}
+
+static int vduse_vdpa_set_vq_address(struct vdpa_device *vdpa, u16 idx,
+				u64 desc_area, u64 driver_area,
+				u64 device_area)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vduse_virtqueue *vq = &dev->vqs[idx];
+
+	return vduse_dev_set_vq_addr(dev, vq, desc_area,
+					driver_area, device_area);
+}
+
+static void vduse_vdpa_kick_vq(struct vdpa_device *vdpa, u16 idx)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vduse_virtqueue *vq = &dev->vqs[idx];
+
+	vduse_vq_kick(vq);
+}
+
+static void vduse_vdpa_set_vq_cb(struct vdpa_device *vdpa, u16 idx,
+			      struct vdpa_callback *cb)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vduse_virtqueue *vq = &dev->vqs[idx];
+
+	vq->cb = cb->callback;
+	vq->private = cb->private;
+}
+
+static void vduse_vdpa_set_vq_num(struct vdpa_device *vdpa, u16 idx, u32 num)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vduse_virtqueue *vq = &dev->vqs[idx];
+
+	vduse_dev_set_vq_num(dev, vq, num);
+}
+
+static void vduse_vdpa_set_vq_ready(struct vdpa_device *vdpa,
+					u16 idx, bool ready)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vduse_virtqueue *vq = &dev->vqs[idx];
+
+	vduse_dev_set_vq_ready(dev, vq, ready);
+	vq->ready = ready;
+}
+
+static bool vduse_vdpa_get_vq_ready(struct vdpa_device *vdpa, u16 idx)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vduse_virtqueue *vq = &dev->vqs[idx];
+
+	vq->ready = vduse_dev_get_vq_ready(dev, vq);
+
+	return vq->ready;
+}
+
+static u32 vduse_vdpa_get_vq_align(struct vdpa_device *vdpa)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	return dev->vq_align;
+}
+
+static u64 vduse_vdpa_get_features(struct vdpa_device *vdpa)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	u64 fixed = (1ULL << VIRTIO_F_ACCESS_PLATFORM);
+
+	return (vduse_dev_get_features(dev) | fixed);
+}
+
+static int vduse_vdpa_set_features(struct vdpa_device *vdpa, u64 features)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	return vduse_dev_set_features(dev, features);
+}
+
+static void vduse_vdpa_set_config_cb(struct vdpa_device *vdpa,
+				  struct vdpa_callback *cb)
+{
+	/* We don't support config interrupt */
+}
+
+static u16 vduse_vdpa_get_vq_num_max(struct vdpa_device *vdpa)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	return dev->vq_size_max;
+}
+
+static u32 vduse_vdpa_get_device_id(struct vdpa_device *vdpa)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	return dev->device_id;
+}
+
+static u32 vduse_vdpa_get_vendor_id(struct vdpa_device *vdpa)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	return dev->vendor_id;
+}
+
+static u8 vduse_vdpa_get_status(struct vdpa_device *vdpa)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	return vduse_dev_get_status(dev);
+}
+
+static void vduse_vdpa_set_status(struct vdpa_device *vdpa, u8 status)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	if (status == 0)
+		vduse_dev_reset(dev);
+
+	vduse_dev_set_status(dev, status);
+}
+
+static void vduse_vdpa_get_config(struct vdpa_device *vdpa, unsigned int offset,
+			     void *buf, unsigned int len)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	vduse_dev_get_config(dev, offset, buf, len);
+}
+
+static void vduse_vdpa_set_config(struct vdpa_device *vdpa, unsigned int offset,
+			const void *buf, unsigned int len)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	vduse_dev_set_config(dev, offset, buf, len);
+}
+
+static void vduse_vdpa_free(struct vdpa_device *vdpa)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	vduse_kickfd_release(dev);
+	vduse_virqfd_release(dev);
+
+	WARN_ON(!list_empty(&dev->send_list));
+	WARN_ON(!list_empty(&dev->recv_list));
+	dev->vdev = NULL;
+}
+
+static const struct vdpa_config_ops vduse_vdpa_config_ops = {
+	.set_vq_address		= vduse_vdpa_set_vq_address,
+	.kick_vq		= vduse_vdpa_kick_vq,
+	.set_vq_cb		= vduse_vdpa_set_vq_cb,
+	.set_vq_num             = vduse_vdpa_set_vq_num,
+	.set_vq_ready		= vduse_vdpa_set_vq_ready,
+	.get_vq_ready		= vduse_vdpa_get_vq_ready,
+	.get_vq_align		= vduse_vdpa_get_vq_align,
+	.get_features		= vduse_vdpa_get_features,
+	.set_features		= vduse_vdpa_set_features,
+	.set_config_cb		= vduse_vdpa_set_config_cb,
+	.get_vq_num_max		= vduse_vdpa_get_vq_num_max,
+	.get_device_id		= vduse_vdpa_get_device_id,
+	.get_vendor_id		= vduse_vdpa_get_vendor_id,
+	.get_status		= vduse_vdpa_get_status,
+	.set_status		= vduse_vdpa_set_status,
+	.get_config		= vduse_vdpa_get_config,
+	.set_config		= vduse_vdpa_set_config,
+	.free			= vduse_vdpa_free,
+};
+
+static dma_addr_t vduse_dev_map_page(struct device *dev, struct page *page,
+					unsigned long offset, size_t size,
+					enum dma_data_direction dir,
+					unsigned long attrs)
+{
+	struct vduse_dev *vdev = dev_to_vduse(dev);
+	struct vduse_iova_domain *domain = vdev->domain;
+	unsigned long iova = vduse_domain_alloc_iova(domain, size,
+							TYPE_BOUNCE_MAP);
+	unsigned long orig = (unsigned long)page_address(page) + offset;
+
+	if (!iova)
+		return DMA_MAPPING_ERROR;
+
+	if (vduse_domain_add_mapping(domain, iova, orig, size, dir)) {
+		vduse_domain_free_iova(domain, iova, size);
+		return DMA_MAPPING_ERROR;
+	}
+
+	if (dir == DMA_TO_DEVICE)
+		vduse_domain_bounce(domain, iova, orig, size, dir);
+
+	return (dma_addr_t)iova;
+}
+
+static void vduse_dev_unmap_page(struct device *dev, dma_addr_t dma_addr,
+				size_t size, enum dma_data_direction dir,
+				unsigned long attrs)
+{
+	struct vduse_dev *vdev = dev_to_vduse(dev);
+	struct vduse_iova_domain *domain = vdev->domain;
+	unsigned long iova = (unsigned long)dma_addr;
+	struct vduse_iova_map *map = vduse_domain_get_mapping(domain, iova);
+
+	if (WARN_ON(!map))
+		return;
+
+	if (dir == DMA_FROM_DEVICE)
+		vduse_domain_bounce(domain, iova, map->orig, size, dir);
+	vduse_domain_remove_mapping(domain, map);
+	vduse_domain_free_iova(domain, iova, size);
+	kfree(map);
+}
+
+static void *vduse_dev_alloc_coherent(struct device *dev, size_t size,
+					dma_addr_t *dma_addr, gfp_t flag,
+					unsigned long attrs)
+{
+	struct vduse_dev *vdev = dev_to_vduse(dev);
+	struct vduse_iova_domain *domain = vdev->domain;
+	unsigned long iova = vduse_domain_alloc_iova(domain, size,
+							TYPE_DIRECT_MAP);
+	void *orig = alloc_pages_exact(size, flag);
+
+	if (!iova || !orig)
+		goto err;
+
+	if (vduse_domain_add_mapping(domain, iova,
+				(unsigned long)orig, size, DMA_BIDIRECTIONAL))
+		goto err;
+
+	*dma_addr = (dma_addr_t)iova;
+
+	return orig;
+err:
+	*dma_addr = DMA_MAPPING_ERROR;
+	if (orig)
+		free_pages_exact(orig, size);
+	if (iova)
+		vduse_domain_free_iova(domain, iova, size);
+
+	return NULL;
+}
+
+static void vduse_dev_free_coherent(struct device *dev, size_t size,
+					void *vaddr, dma_addr_t dma_addr,
+					unsigned long attrs)
+{
+	struct vduse_dev *vdev = dev_to_vduse(dev);
+	struct vduse_iova_domain *domain = vdev->domain;
+	unsigned long iova = (unsigned long)dma_addr;
+	struct vduse_iova_map *map = vduse_domain_get_mapping(domain, iova);
+
+	if (WARN_ON(!map))
+		return;
+
+	vduse_domain_remove_mapping(domain, map);
+	vduse_domain_unmap(domain, map->iova, PAGE_ALIGN(map->size));
+	free_pages_exact((void *)map->orig, map->size);
+	vduse_domain_free_iova(domain, map->iova, map->size);
+	kfree(map);
+}
+
+static const struct dma_map_ops vduse_dev_dma_ops = {
+	.map_page = vduse_dev_map_page,
+	.unmap_page = vduse_dev_unmap_page,
+	.alloc = vduse_dev_alloc_coherent,
+	.free = vduse_dev_free_coherent,
+};
+
+static void vduse_dev_mmap_open(struct vm_area_struct *vma)
+{
+	struct vduse_iova_domain *domain = vma->vm_private_data;
+
+	if (!vduse_domain_add_vma(domain, vma))
+		return;
+
+	vma->vm_private_data = NULL;
+}
+
+static void vduse_dev_mmap_close(struct vm_area_struct *vma)
+{
+	struct vduse_iova_domain *domain = vma->vm_private_data;
+
+	if (!domain)
+		return;
+
+	vduse_domain_remove_vma(domain, vma);
+}
+
+static int vduse_dev_mmap_split(struct vm_area_struct *vma, unsigned long addr)
+{
+	return -EPERM;
+}
+
+static vm_fault_t vduse_dev_mmap_fault(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct vduse_iova_domain *domain = vma->vm_private_data;
+	unsigned long iova = vmf->address - vma->vm_start;
+	int ret;
+
+	if (!domain)
+		return VM_FAULT_SIGBUS;
+
+	if (vduse_domain_is_direct_map(domain, iova))
+		ret = vduse_domain_direct_map(domain, vma, iova);
+	else
+		ret = vduse_domain_bounce_map(domain, vma, iova);
+
+	if (ret == -ENOMEM)
+		return VM_FAULT_OOM;
+	if (ret < 0 && ret != -EBUSY)
+		return VM_FAULT_SIGBUS;
+
+	return VM_FAULT_NOPAGE;
+}
+
+static const struct vm_operations_struct vduse_dev_mmap_ops = {
+	.open = vduse_dev_mmap_open,
+	.close = vduse_dev_mmap_close,
+	.may_split = vduse_dev_mmap_split,
+	.fault = vduse_dev_mmap_fault,
+};
+
+static int vduse_dev_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct vduse_dev *dev = file->private_data;
+	struct vduse_iova_domain *domain = dev->domain;
+	unsigned long size = vma->vm_end - vma->vm_start;
+	int ret;
+
+	if (domain->size != size || vma->vm_pgoff)
+		return -EINVAL;
+
+	ret = vduse_domain_add_vma(domain, vma);
+	if (ret)
+		return ret;
+
+	vma->vm_flags |= VM_MIXEDMAP | VM_DONTCOPY |
+				VM_DONTDUMP | VM_DONTEXPAND;
+	vma->vm_private_data = domain;
+	vma->vm_ops = &vduse_dev_mmap_ops;
+
+	return 0;
+}
+
+static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
+			unsigned long arg)
+{
+	struct vduse_dev *dev = file->private_data;
+	void __user *argp = (void __user *)arg;
+	int ret;
+
+	mutex_lock(&dev->lock);
+	switch (cmd) {
+	case VDUSE_VQ_SETUP_KICKFD: {
+		struct vduse_vq_eventfd eventfd;
+
+		ret = -EFAULT;
+		if (copy_from_user(&eventfd, argp, sizeof(eventfd)))
+			break;
+
+		ret = vduse_kickfd_setup(dev, &eventfd);
+		break;
+	}
+	case VDUSE_VQ_SETUP_IRQFD: {
+		struct vduse_vq_eventfd eventfd;
+
+		ret = -EFAULT;
+		if (copy_from_user(&eventfd, argp, sizeof(eventfd)))
+			break;
+
+		ret = vduse_virqfd_setup(dev, &eventfd);
+		break;
+	}
+	}
+	mutex_unlock(&dev->lock);
+
+	return ret;
+}
+
+static int vduse_dev_release(struct inode *inode, struct file *file)
+{
+	struct vduse_dev *dev = file->private_data;
+	struct vduse_dev_msg *msg;
+
+	while ((msg = vduse_dev_dequeue_msg(dev, &dev->recv_list)))
+		vduse_dev_enqueue_msg(dev, msg, &dev->send_list);
+
+	refcount_dec(&dev->refcnt);
+
+	return 0;
+}
+
+static const struct file_operations vduse_dev_fops = {
+	.owner		= THIS_MODULE,
+	.release	= vduse_dev_release,
+	.read_iter	= vduse_dev_read_iter,
+	.write_iter	= vduse_dev_write_iter,
+	.poll		= vduse_dev_poll,
+	.mmap		= vduse_dev_mmap,
+	.unlocked_ioctl	= vduse_dev_ioctl,
+	.compat_ioctl	= compat_ptr_ioctl,
+	.llseek		= noop_llseek,
+};
+
+static struct vduse_dev *vduse_dev_create(void)
+{
+	struct vduse_dev *dev = kzalloc(sizeof(*dev), GFP_KERNEL);
+
+	if (!dev)
+		return NULL;
+
+	mutex_init(&dev->lock);
+	spin_lock_init(&dev->msg_lock);
+	INIT_LIST_HEAD(&dev->send_list);
+	INIT_LIST_HEAD(&dev->recv_list);
+	atomic64_set(&dev->msg_unique, 0);
+	init_waitqueue_head(&dev->waitq);
+	refcount_set(&dev->refcnt, 1);
+
+	return dev;
+}
+
+static void vduse_dev_destroy(struct vduse_dev *dev)
+{
+	mutex_destroy(&dev->lock);
+	kfree(dev);
+}
+
+static struct vduse_dev *vduse_find_dev(u32 id)
+{
+	struct vduse_dev *tmp, *dev = NULL;
+
+	list_for_each_entry(tmp, &vduse_devs, list) {
+		if (tmp->id == id) {
+			dev = tmp;
+			break;
+		}
+	}
+	return dev;
+}
+
+static int vduse_get_dev(u32 id)
+{
+	int fd;
+	char name[64];
+	struct vduse_dev *dev = vduse_find_dev(id);
+
+	if (!dev)
+		return -EINVAL;
+
+	snprintf(name, sizeof(name), "vduse-dev:%u", dev->id);
+	fd = anon_inode_getfd(name, &vduse_dev_fops, dev, O_RDWR | O_CLOEXEC);
+	if (fd < 0)
+		return fd;
+
+	refcount_inc(&dev->refcnt);
+
+	return fd;
+}
+
+static int vduse_destroy_dev(u32 id)
+{
+	struct vduse_dev *dev = vduse_find_dev(id);
+
+	if (!dev)
+		return -EINVAL;
+
+	if (dev->vdev || refcount_read(&dev->refcnt) > 1)
+		return -EBUSY;
+
+	list_del(&dev->list);
+	kfree(dev->vqs);
+	vduse_iova_domain_destroy(dev->domain);
+	vduse_dev_destroy(dev);
+
+	return 0;
+}
+
+static int vduse_create_dev(struct vduse_dev_config *config)
+{
+	int i, fd;
+	struct vduse_dev *dev;
+	char name[64];
+
+	if (vduse_find_dev(config->id))
+		return -EEXIST;
+
+	dev = vduse_dev_create();
+	if (!dev)
+		return -ENOMEM;
+
+	dev->id = config->id;
+	dev->device_id = config->device_id;
+	dev->vendor_id = config->vendor_id;
+	dev->domain = vduse_iova_domain_create(config->iova_size);
+	if (!dev->domain)
+		goto err_domain;
+
+	dev->vq_align = config->vq_align;
+	dev->vq_size_max = config->vq_size_max;
+	dev->vq_num = config->vq_num;
+	dev->vqs = kcalloc(dev->vq_num, sizeof(*dev->vqs), GFP_KERNEL);
+	if (!dev->vqs)
+		goto err_vqs;
+
+	for (i = 0; i < dev->vq_num; i++) {
+		dev->vqs[i].index = i;
+		spin_lock_init(&dev->vqs[i].kick_lock);
+		spin_lock_init(&dev->vqs[i].irq_lock);
+	}
+
+	snprintf(name, sizeof(name), "vduse-dev:%u", config->id);
+	fd = anon_inode_getfd(name, &vduse_dev_fops, dev, O_RDWR | O_CLOEXEC);
+	if (fd < 0)
+		goto err_fd;
+
+	refcount_inc(&dev->refcnt);
+	list_add(&dev->list, &vduse_devs);
+
+	return fd;
+err_fd:
+	kfree(dev->vqs);
+err_vqs:
+	vduse_iova_domain_destroy(dev->domain);
+err_domain:
+	vduse_dev_destroy(dev);
+	return fd;
+}
+
+static long vduse_ioctl(struct file *file, unsigned int cmd,
+			unsigned long arg)
+{
+	int ret;
+	void __user *argp = (void __user *)arg;
+
+	mutex_lock(&vduse_lock);
+	switch (cmd) {
+	case VDUSE_CREATE_DEV: {
+		struct vduse_dev_config config;
+
+		ret = -EFAULT;
+		if (copy_from_user(&config, argp, sizeof(config)))
+			break;
+
+		ret = vduse_create_dev(&config);
+		break;
+	}
+	case VDUSE_GET_DEV:
+		ret = vduse_get_dev(arg);
+		break;
+	case VDUSE_DESTROY_DEV:
+		ret = vduse_destroy_dev(arg);
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+	mutex_unlock(&vduse_lock);
+
+	return ret;
+}
+
+static const struct file_operations vduse_fops = {
+	.owner		= THIS_MODULE,
+	.unlocked_ioctl	= vduse_ioctl,
+	.compat_ioctl	= compat_ptr_ioctl,
+	.llseek		= noop_llseek,
+};
+
+static struct miscdevice vduse_misc = {
+	.fops = &vduse_fops,
+	.minor = MISC_DYNAMIC_MINOR,
+	.name = "vduse",
+};
+
+static void vduse_parent_release(struct device *dev)
+{
+}
+
+static struct device vduse_parent = {
+	.init_name = "vduse",
+	.release = vduse_parent_release,
+};
+
+static struct vdpa_parent_dev parent_dev;
+
+static int vduse_dev_add_vdpa(struct vduse_dev *dev, const char *name)
+{
+	struct vduse_vdpa *vdev = dev->vdev;
+	int ret;
+
+	if (vdev)
+		return -EEXIST;
+
+	vdev = vdpa_alloc_device(struct vduse_vdpa, vdpa, NULL,
+				&vduse_vdpa_config_ops, dev->vq_num, name);
+	if (!vdev)
+		return -ENOMEM;
+
+	vdev->dev = dev;
+	vdev->vdpa.dev.dma_mask = &vdev->vdpa.dev.coherent_dma_mask;
+	ret = dma_set_mask_and_coherent(&vdev->vdpa.dev, DMA_BIT_MASK(64));
+	if (ret)
+		goto err;
+
+	set_dma_ops(&vdev->vdpa.dev, &vduse_dev_dma_ops);
+	vdev->vdpa.dma_dev = &vdev->vdpa.dev;
+	vdev->vdpa.pdev = &parent_dev;
+
+	ret = _vdpa_register_device(&vdev->vdpa);
+	if (ret)
+		goto err;
+
+	dev->vdev = vdev;
+
+	return 0;
+err:
+	put_device(&vdev->vdpa.dev);
+	return ret;
+}
+
+static struct vdpa_device *vdpa_dev_add(struct vdpa_parent_dev *pdev,
+					const char *name, u32 device_id,
+					struct nlattr **attrs)
+{
+	u32 vduse_id;
+	struct vduse_dev *dev;
+	int ret = -EINVAL;
+
+	if (!attrs[VDPA_ATTR_BACKEND_ID])
+		return ERR_PTR(-EINVAL);
+
+	mutex_lock(&vduse_lock);
+	vduse_id = nla_get_u32(attrs[VDPA_ATTR_BACKEND_ID]);
+	dev = vduse_find_dev(vduse_id);
+	if (!dev)
+		goto unlock;
+
+	if (dev->device_id != device_id)
+		goto unlock;
+
+	ret = vduse_dev_add_vdpa(dev, name);
+unlock:
+	mutex_unlock(&vduse_lock);
+	if (ret)
+		return ERR_PTR(ret);
+
+	return &dev->vdev->vdpa;
+}
+
+static void vdpa_dev_del(struct vdpa_parent_dev *pdev, struct vdpa_device *dev)
+{
+	_vdpa_unregister_device(dev);
+}
+
+static const struct vdpa_dev_ops vdpa_dev_parent_ops = {
+	.dev_add = vdpa_dev_add,
+	.dev_del = vdpa_dev_del
+};
+
+static struct virtio_device_id id_table[] = {
+	{ VIRTIO_DEV_ANY_ID, VIRTIO_DEV_ANY_ID },
+	{ 0 },
+};
+
+static struct vdpa_parent_dev parent_dev = {
+	.device = &vduse_parent,
+	.id_table = id_table,
+	.ops = &vdpa_dev_parent_ops,
+};
+
+static int vduse_parentdev_init(void)
+{
+	int ret;
+
+	ret = device_register(&vduse_parent);
+	if (ret)
+		return ret;
+
+	ret = vdpa_parentdev_register(&parent_dev);
+	if (ret)
+		goto err;
+
+	return 0;
+err:
+	device_unregister(&vduse_parent);
+	return ret;
+}
+
+static void vduse_parentdev_exit(void)
+{
+	vdpa_parentdev_unregister(&parent_dev);
+	device_unregister(&vduse_parent);
+}
+
+static int vduse_init(void)
+{
+	int ret;
+
+	ret = misc_register(&vduse_misc);
+	if (ret)
+		return ret;
+
+	ret = -ENOMEM;
+	vduse_vdpa_wq = alloc_workqueue("vduse-vdpa", WQ_UNBOUND, 1);
+	if (!vduse_vdpa_wq)
+		goto err_vdpa_wq;
+
+	ret = vduse_virqfd_init();
+	if (ret)
+		goto err_irqfd;
+
+	ret = vduse_parentdev_init();
+	if (ret)
+		goto err_parentdev;
+
+	return 0;
+err_parentdev:
+	vduse_virqfd_exit();
+err_irqfd:
+	destroy_workqueue(vduse_vdpa_wq);
+err_vdpa_wq:
+	misc_deregister(&vduse_misc);
+	return ret;
+}
+module_init(vduse_init);
+
+static void vduse_exit(void)
+{
+	misc_deregister(&vduse_misc);
+	destroy_workqueue(vduse_vdpa_wq);
+	vduse_virqfd_exit();
+	vduse_parentdev_exit();
+}
+module_exit(vduse_exit);
+
+MODULE_VERSION(DRV_VERSION);
+MODULE_LICENSE(DRV_LICENSE);
+MODULE_AUTHOR(DRV_AUTHOR);
+MODULE_DESCRIPTION(DRV_DESC);
diff --git a/include/uapi/linux/vdpa.h b/include/uapi/linux/vdpa.h
index bba8b83a94b5..a7a841e5ffc7 100644
--- a/include/uapi/linux/vdpa.h
+++ b/include/uapi/linux/vdpa.h
@@ -33,6 +33,7 @@ enum vdpa_attr {
 	VDPA_ATTR_DEV_VENDOR_ID,		/* u32 */
 	VDPA_ATTR_DEV_MAX_VQS,			/* u32 */
 	VDPA_ATTR_DEV_MAX_VQ_SIZE,		/* u16 */
+	VDPA_ATTR_BACKEND_ID,			/* u32 */
 
 	/* new attributes must be added above here */
 	VDPA_ATTR_MAX,
diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
new file mode 100644
index 000000000000..f8579abdaa3b
--- /dev/null
+++ b/include/uapi/linux/vduse.h
@@ -0,0 +1,99 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _UAPI_VDUSE_H_
+#define _UAPI_VDUSE_H_
+
+#include <linux/types.h>
+
+/* the control messages definition for read/write */
+
+#define VDUSE_CONFIG_DATA_LEN	256
+
+enum vduse_req_type {
+	VDUSE_SET_VQ_NUM,
+	VDUSE_SET_VQ_ADDR,
+	VDUSE_SET_VQ_READY,
+	VDUSE_GET_VQ_READY,
+	VDUSE_SET_FEATURES,
+	VDUSE_GET_FEATURES,
+	VDUSE_SET_STATUS,
+	VDUSE_GET_STATUS,
+	VDUSE_SET_CONFIG,
+	VDUSE_GET_CONFIG,
+};
+
+struct vduse_vq_num {
+	__u32 index;
+	__u32 num;
+};
+
+struct vduse_vq_addr {
+	__u32 index;
+	__u64 desc_addr;
+	__u64 driver_addr;
+	__u64 device_addr;
+};
+
+struct vduse_vq_ready {
+	__u32 index;
+	__u8 ready;
+};
+
+struct vduse_dev_config_data {
+	__u32 offset;
+	__u32 len;
+	__u8 data[VDUSE_CONFIG_DATA_LEN];
+};
+
+struct vduse_dev_request {
+	__u32 type; /* request type */
+	__u32 unique; /* request id */
+	__u32 flags; /* request flags */
+	__u32 size; /* the payload size */
+	union {
+		struct vduse_vq_num vq_num; /* virtqueue num */
+		struct vduse_vq_addr vq_addr; /* virtqueue address */
+		struct vduse_vq_ready vq_ready; /* virtqueue ready status */
+		struct vduse_dev_config_data config; /* virtio device config space */
+		__u64 features; /* virtio features */
+		__u8 status; /* device status */
+	};
+};
+
+struct vduse_dev_response {
+	__u32 unique; /* corresponding request id */
+	__s32 result; /* the result of request */
+	union {
+		struct vduse_vq_ready vq_ready; /* virtqueue ready status */
+		struct vduse_dev_config_data config; /* virtio device config space */
+		__u64 features; /* virtio features */
+		__u8 status; /* device status */
+	};
+};
+
+/* ioctls */
+
+struct vduse_dev_config {
+	__u32 id; /* vduse device id */
+	__u32 vendor_id; /* virtio vendor id */
+	__u32 device_id; /* virtio device id */
+	__u64 iova_size; /* iova space size, used for mmap(2) */
+	__u16 vq_num; /* the number of virtqueues */
+	__u16 vq_size_max; /* the max size of virtqueue */
+	__u32 vq_align; /* the allocation alignment of virtqueue's metadata */
+};
+
+struct vduse_vq_eventfd {
+	__u32 index; /* virtqueue index */
+	__u32 fd; /* eventfd */
+};
+
+#define VDUSE_BASE	0x81
+
+#define VDUSE_CREATE_DEV	_IOW(VDUSE_BASE, 0x01, struct vduse_dev_config)
+#define VDUSE_GET_DEV		_IO(VDUSE_BASE, 0x02)
+#define VDUSE_DESTROY_DEV	_IO(VDUSE_BASE, 0x03)
+
+#define VDUSE_VQ_SETUP_KICKFD	_IOW(VDUSE_BASE, 0x04, struct vduse_vq_eventfd)
+#define VDUSE_VQ_SETUP_IRQFD	_IOW(VDUSE_BASE, 0x05, struct vduse_vq_eventfd)
+
+#endif /* _UAPI_VDUSE_H_ */
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [RFC v2 07/13] vduse: support get/set virtqueue state
  2020-12-22 14:52 [RFC v2 00/13] Introduce VDUSE - vDPA Device in Userspace Xie Yongji
                   ` (5 preceding siblings ...)
  2020-12-22 14:52 ` [RFC v2 06/13] vduse: Introduce VDUSE - vDPA Device in Userspace Xie Yongji
@ 2020-12-22 14:52 ` Xie Yongji
  2020-12-22 14:52 ` [RFC v2 08/13] vdpa: Introduce process_iotlb_msg() in vdpa_config_ops Xie Yongji
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 55+ messages in thread
From: Xie Yongji @ 2020-12-22 14:52 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, akpm, rdunlap, willy,
	viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel, linux-mm

This patch makes vhost-vdpa bus driver can get/set virtqueue
state from userspace VDUSE process.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 Documentation/driver-api/vduse.rst |  4 +++
 drivers/vdpa/vdpa_user/vduse_dev.c | 54 ++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/vduse.h         |  9 +++++++
 3 files changed, 67 insertions(+)

diff --git a/Documentation/driver-api/vduse.rst b/Documentation/driver-api/vduse.rst
index da9b3040f20a..623f7b040ccf 100644
--- a/Documentation/driver-api/vduse.rst
+++ b/Documentation/driver-api/vduse.rst
@@ -30,6 +30,10 @@ The following types of messages are provided by the VDUSE framework now:
 
 - VDUSE_GET_VQ_READY: Get ready status of virtqueue
 
+- VDUSE_SET_VQ_STATE: Set the state (last_avail_idx) for virtqueue
+
+- VDUSE_GET_VQ_STATE: Get the state (last_avail_idx) for virtqueue
+
 - VDUSE_SET_FEATURES: Set virtio features supported by the driver
 
 - VDUSE_GET_FEATURES: Get virtio features supported by the device
diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
index 4a869b9698ef..b974333ed4e9 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -291,6 +291,40 @@ static bool vduse_dev_get_vq_ready(struct vduse_dev *dev,
 	return ready;
 }
 
+static int vduse_dev_get_vq_state(struct vduse_dev *dev,
+				struct vduse_virtqueue *vq,
+				struct vdpa_vq_state *state)
+{
+	struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_GET_VQ_STATE);
+	int ret;
+
+	msg->req.size = sizeof(struct vduse_vq_state);
+	msg->req.vq_state.index = vq->index;
+
+	ret = vduse_dev_msg_sync(dev, msg);
+	state->avail_index = msg->resp.vq_state.avail_idx;
+	vduse_dev_msg_put(msg);
+
+	return ret;
+}
+
+static int vduse_dev_set_vq_state(struct vduse_dev *dev,
+				struct vduse_virtqueue *vq,
+				const struct vdpa_vq_state *state)
+{
+	struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_VQ_STATE);
+	int ret;
+
+	msg->req.size = sizeof(struct vduse_vq_state);
+	msg->req.vq_state.index = vq->index;
+	msg->req.vq_state.avail_idx = state->avail_index;
+
+	ret = vduse_dev_msg_sync(dev, msg);
+	vduse_dev_msg_put(msg);
+
+	return ret;
+}
+
 static ssize_t vduse_dev_read_iter(struct kiocb *iocb, struct iov_iter *to)
 {
 	struct file *file = iocb->ki_filp;
@@ -431,6 +465,24 @@ static bool vduse_vdpa_get_vq_ready(struct vdpa_device *vdpa, u16 idx)
 	return vq->ready;
 }
 
+static int vduse_vdpa_set_vq_state(struct vdpa_device *vdpa, u16 idx,
+				const struct vdpa_vq_state *state)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vduse_virtqueue *vq = &dev->vqs[idx];
+
+	return vduse_dev_set_vq_state(dev, vq, state);
+}
+
+static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
+				struct vdpa_vq_state *state)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vduse_virtqueue *vq = &dev->vqs[idx];
+
+	return vduse_dev_get_vq_state(dev, vq, state);
+}
+
 static u32 vduse_vdpa_get_vq_align(struct vdpa_device *vdpa)
 {
 	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
@@ -532,6 +584,8 @@ static const struct vdpa_config_ops vduse_vdpa_config_ops = {
 	.set_vq_num             = vduse_vdpa_set_vq_num,
 	.set_vq_ready		= vduse_vdpa_set_vq_ready,
 	.get_vq_ready		= vduse_vdpa_get_vq_ready,
+	.set_vq_state		= vduse_vdpa_set_vq_state,
+	.get_vq_state		= vduse_vdpa_get_vq_state,
 	.get_vq_align		= vduse_vdpa_get_vq_align,
 	.get_features		= vduse_vdpa_get_features,
 	.set_features		= vduse_vdpa_set_features,
diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
index f8579abdaa3b..873305dfd93f 100644
--- a/include/uapi/linux/vduse.h
+++ b/include/uapi/linux/vduse.h
@@ -13,6 +13,8 @@ enum vduse_req_type {
 	VDUSE_SET_VQ_ADDR,
 	VDUSE_SET_VQ_READY,
 	VDUSE_GET_VQ_READY,
+	VDUSE_SET_VQ_STATE,
+	VDUSE_GET_VQ_STATE,
 	VDUSE_SET_FEATURES,
 	VDUSE_GET_FEATURES,
 	VDUSE_SET_STATUS,
@@ -38,6 +40,11 @@ struct vduse_vq_ready {
 	__u8 ready;
 };
 
+struct vduse_vq_state {
+	__u32 index;
+	__u16 avail_idx;
+};
+
 struct vduse_dev_config_data {
 	__u32 offset;
 	__u32 len;
@@ -53,6 +60,7 @@ struct vduse_dev_request {
 		struct vduse_vq_num vq_num; /* virtqueue num */
 		struct vduse_vq_addr vq_addr; /* virtqueue address */
 		struct vduse_vq_ready vq_ready; /* virtqueue ready status */
+		struct vduse_vq_state vq_state; /* virtqueue state */
 		struct vduse_dev_config_data config; /* virtio device config space */
 		__u64 features; /* virtio features */
 		__u8 status; /* device status */
@@ -64,6 +72,7 @@ struct vduse_dev_response {
 	__s32 result; /* the result of request */
 	union {
 		struct vduse_vq_ready vq_ready; /* virtqueue ready status */
+		struct vduse_vq_state vq_state; /* virtqueue state */
 		struct vduse_dev_config_data config; /* virtio device config space */
 		__u64 features; /* virtio features */
 		__u8 status; /* device status */
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [RFC v2 08/13] vdpa: Introduce process_iotlb_msg() in vdpa_config_ops
  2020-12-22 14:52 [RFC v2 00/13] Introduce VDUSE - vDPA Device in Userspace Xie Yongji
                   ` (6 preceding siblings ...)
  2020-12-22 14:52 ` [RFC v2 07/13] vduse: support get/set virtqueue state Xie Yongji
@ 2020-12-22 14:52 ` Xie Yongji
  2020-12-23  8:36   ` Jason Wang
  2020-12-22 14:52 ` [RFC v2 09/13] vduse: Add support for processing vhost iotlb message Xie Yongji
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 55+ messages in thread
From: Xie Yongji @ 2020-12-22 14:52 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, akpm, rdunlap, willy,
	viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel, linux-mm

This patch introduces a new method in the vdpa_config_ops to
support processing the raw vhost memory mapping message in the
vDPA device driver.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 drivers/vhost/vdpa.c | 5 ++++-
 include/linux/vdpa.h | 7 +++++++
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 448be7875b6d..ccbb391e38be 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -728,6 +728,9 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev,
 	if (r)
 		return r;
 
+	if (ops->process_iotlb_msg)
+		return ops->process_iotlb_msg(vdpa, msg);
+
 	switch (msg->type) {
 	case VHOST_IOTLB_UPDATE:
 		r = vhost_vdpa_process_iotlb_update(v, msg);
@@ -770,7 +773,7 @@ static int vhost_vdpa_alloc_domain(struct vhost_vdpa *v)
 	int ret;
 
 	/* Device want to do DMA by itself */
-	if (ops->set_map || ops->dma_map)
+	if (ops->set_map || ops->dma_map || ops->process_iotlb_msg)
 		return 0;
 
 	bus = dma_dev->bus;
diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index 656fe264234e..7bccedf22f4b 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -5,6 +5,7 @@
 #include <linux/kernel.h>
 #include <linux/device.h>
 #include <linux/interrupt.h>
+#include <linux/vhost_types.h>
 #include <linux/vhost_iotlb.h>
 #include <net/genetlink.h>
 
@@ -172,6 +173,10 @@ struct vdpa_iova_range {
  *				@vdev: vdpa device
  *				Returns the iova range supported by
  *				the device.
+ * @process_iotlb_msg:		Process vhost memory mapping message (optional)
+ *				Only used for VDUSE device now
+ *				@vdev: vdpa device
+ *				@msg: vhost memory mapping message
  * @set_map:			Set device memory mapping (optional)
  *				Needed for device that using device
  *				specific DMA translation (on-chip IOMMU)
@@ -240,6 +245,8 @@ struct vdpa_config_ops {
 	struct vdpa_iova_range (*get_iova_range)(struct vdpa_device *vdev);
 
 	/* DMA ops */
+	int (*process_iotlb_msg)(struct vdpa_device *vdev,
+				 struct vhost_iotlb_msg *msg);
 	int (*set_map)(struct vdpa_device *vdev, struct vhost_iotlb *iotlb);
 	int (*dma_map)(struct vdpa_device *vdev, u64 iova, u64 size,
 		       u64 pa, u32 perm);
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [RFC v2 09/13] vduse: Add support for processing vhost iotlb message
  2020-12-22 14:52 [RFC v2 00/13] Introduce VDUSE - vDPA Device in Userspace Xie Yongji
                   ` (7 preceding siblings ...)
  2020-12-22 14:52 ` [RFC v2 08/13] vdpa: Introduce process_iotlb_msg() in vdpa_config_ops Xie Yongji
@ 2020-12-22 14:52 ` Xie Yongji
  2020-12-23  9:05   ` Jason Wang
  2020-12-22 14:52 ` [RFC v2 10/13] vduse: grab the module's references until there is no vduse device Xie Yongji
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 55+ messages in thread
From: Xie Yongji @ 2020-12-22 14:52 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, akpm, rdunlap, willy,
	viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel, linux-mm

To support vhost-vdpa bus driver, we need a way to share the
vhost-vdpa backend process's memory with the userspace VDUSE process.

This patch tries to make use of the vhost iotlb message to achieve
that. We will get the shm file from the iotlb message and pass it
to the userspace VDUSE process.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 Documentation/driver-api/vduse.rst |  15 +++-
 drivers/vdpa/vdpa_user/vduse_dev.c | 147 ++++++++++++++++++++++++++++++++++++-
 include/uapi/linux/vduse.h         |  11 +++
 3 files changed, 171 insertions(+), 2 deletions(-)

diff --git a/Documentation/driver-api/vduse.rst b/Documentation/driver-api/vduse.rst
index 623f7b040ccf..48e4b1ba353f 100644
--- a/Documentation/driver-api/vduse.rst
+++ b/Documentation/driver-api/vduse.rst
@@ -46,13 +46,26 @@ The following types of messages are provided by the VDUSE framework now:
 
 - VDUSE_GET_CONFIG: Read from device specific configuration space
 
+- VDUSE_UPDATE_IOTLB: Update the memory mapping in device IOTLB
+
+- VDUSE_INVALIDATE_IOTLB: Invalidate the memory mapping in device IOTLB
+
 Please see include/linux/vdpa.h for details.
 
-In the data path, VDUSE framework implements a MMU-based on-chip IOMMU
+The data path of userspace vDPA device is implemented in different ways
+depending on the vdpa bus to which it is attached.
+
+In virtio-vdpa case, VDUSE framework implements a MMU-based on-chip IOMMU
 driver which supports mapping the kernel dma buffer to a userspace iova
 region dynamically. The userspace iova region can be created by passing
 the userspace vDPA device fd to mmap(2).
 
+In vhost-vdpa case, the dma buffer is reside in a userspace memory region
+which will be shared to the VDUSE userspace processs via the file
+descriptor in VDUSE_UPDATE_IOTLB message. And the corresponding address
+mapping (IOVA of dma buffer <-> VA of the memory region) is also included
+in this message.
+
 Besides, the eventfd mechanism is used to trigger interrupt callbacks and
 receive virtqueue kicks in userspace. The following ioctls on the userspace
 vDPA device fd are provided to support that:
diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
index b974333ed4e9..d24aaacb6008 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -34,6 +34,7 @@
 
 struct vduse_dev_msg {
 	struct vduse_dev_request req;
+	struct file *iotlb_file;
 	struct vduse_dev_response resp;
 	struct list_head list;
 	wait_queue_head_t waitq;
@@ -325,12 +326,80 @@ static int vduse_dev_set_vq_state(struct vduse_dev *dev,
 	return ret;
 }
 
+static int vduse_dev_update_iotlb(struct vduse_dev *dev, struct file *file,
+				u64 offset, u64 iova, u64 size, u8 perm)
+{
+	struct vduse_dev_msg *msg;
+	int ret;
+
+	if (!size)
+		return -EINVAL;
+
+	msg = vduse_dev_new_msg(dev, VDUSE_UPDATE_IOTLB);
+	msg->req.size = sizeof(struct vduse_iotlb);
+	msg->req.iotlb.offset = offset;
+	msg->req.iotlb.iova = iova;
+	msg->req.iotlb.size = size;
+	msg->req.iotlb.perm = perm;
+	msg->req.iotlb.fd = -1;
+	msg->iotlb_file = get_file(file);
+
+	ret = vduse_dev_msg_sync(dev, msg);
+	vduse_dev_msg_put(msg);
+	fput(file);
+
+	return ret;
+}
+
+static int vduse_dev_invalidate_iotlb(struct vduse_dev *dev,
+					u64 iova, u64 size)
+{
+	struct vduse_dev_msg *msg;
+	int ret;
+
+	if (!size)
+		return -EINVAL;
+
+	msg = vduse_dev_new_msg(dev, VDUSE_INVALIDATE_IOTLB);
+	msg->req.size = sizeof(struct vduse_iotlb);
+	msg->req.iotlb.iova = iova;
+	msg->req.iotlb.size = size;
+
+	ret = vduse_dev_msg_sync(dev, msg);
+	vduse_dev_msg_put(msg);
+
+	return ret;
+}
+
+static unsigned int perm_to_file_flags(u8 perm)
+{
+	unsigned int flags = 0;
+
+	switch (perm) {
+	case VHOST_ACCESS_WO:
+		flags |= O_WRONLY;
+		break;
+	case VHOST_ACCESS_RO:
+		flags |= O_RDONLY;
+		break;
+	case VHOST_ACCESS_RW:
+		flags |= O_RDWR;
+		break;
+	default:
+		WARN(1, "invalidate vhost IOTLB permission\n");
+		break;
+	}
+
+	return flags;
+}
+
 static ssize_t vduse_dev_read_iter(struct kiocb *iocb, struct iov_iter *to)
 {
 	struct file *file = iocb->ki_filp;
 	struct vduse_dev *dev = file->private_data;
 	struct vduse_dev_msg *msg;
-	int size = sizeof(struct vduse_dev_request);
+	unsigned int flags;
+	int fd, size = sizeof(struct vduse_dev_request);
 	ssize_t ret = 0;
 
 	if (iov_iter_count(to) < size)
@@ -349,6 +418,18 @@ static ssize_t vduse_dev_read_iter(struct kiocb *iocb, struct iov_iter *to)
 		if (ret)
 			return ret;
 	}
+
+	if (msg->req.type == VDUSE_UPDATE_IOTLB && msg->req.iotlb.fd == -1) {
+		flags = perm_to_file_flags(msg->req.iotlb.perm);
+		fd = get_unused_fd_flags(flags);
+		if (fd < 0) {
+			vduse_dev_enqueue_msg(dev, msg, &dev->send_list);
+			return fd;
+		}
+		fd_install(fd, get_file(msg->iotlb_file));
+		msg->req.iotlb.fd = fd;
+	}
+
 	ret = copy_to_iter(&msg->req, size, to);
 	if (ret != size) {
 		vduse_dev_enqueue_msg(dev, msg, &dev->send_list);
@@ -565,6 +646,69 @@ static void vduse_vdpa_set_config(struct vdpa_device *vdpa, unsigned int offset,
 	vduse_dev_set_config(dev, offset, buf, len);
 }
 
+static void vduse_vdpa_invalidate_iotlb(struct vduse_dev *dev,
+					struct vhost_iotlb_msg *msg)
+{
+	vduse_dev_invalidate_iotlb(dev, msg->iova, msg->size);
+}
+
+static int vduse_vdpa_update_iotlb(struct vduse_dev *dev,
+					struct vhost_iotlb_msg *msg)
+{
+	u64 uaddr = msg->uaddr;
+	u64 iova = msg->iova;
+	u64 size = msg->size;
+	u64 offset;
+	struct vm_area_struct *vma;
+	int ret;
+
+	while (uaddr < msg->uaddr + msg->size) {
+		vma = find_vma(current->mm, uaddr);
+		ret = -EINVAL;
+		if (!vma)
+			goto err;
+
+		size = min(msg->size, vma->vm_end - uaddr);
+		offset = (vma->vm_pgoff << PAGE_SHIFT) + uaddr - vma->vm_start;
+		if (vma->vm_file && (vma->vm_flags & VM_SHARED)) {
+			ret = vduse_dev_update_iotlb(dev, vma->vm_file, offset,
+							iova, size, msg->perm);
+			if (ret)
+				goto err;
+		}
+		iova += size;
+		uaddr += size;
+	}
+	return 0;
+err:
+	vduse_dev_invalidate_iotlb(dev, msg->iova, iova - msg->iova);
+	return ret;
+}
+
+static int vduse_vdpa_process_iotlb_msg(struct vdpa_device *vdpa,
+					struct vhost_iotlb_msg *msg)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	int ret = 0;
+
+	switch (msg->type) {
+	case VHOST_IOTLB_UPDATE:
+		ret = vduse_vdpa_update_iotlb(dev, msg);
+		break;
+	case VHOST_IOTLB_INVALIDATE:
+		vduse_vdpa_invalidate_iotlb(dev, msg);
+		break;
+	case VHOST_IOTLB_BATCH_BEGIN:
+	case VHOST_IOTLB_BATCH_END:
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+	return ret;
+}
+
 static void vduse_vdpa_free(struct vdpa_device *vdpa)
 {
 	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
@@ -597,6 +741,7 @@ static const struct vdpa_config_ops vduse_vdpa_config_ops = {
 	.set_status		= vduse_vdpa_set_status,
 	.get_config		= vduse_vdpa_get_config,
 	.set_config		= vduse_vdpa_set_config,
+	.process_iotlb_msg	= vduse_vdpa_process_iotlb_msg,
 	.free			= vduse_vdpa_free,
 };
 
diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
index 873305dfd93f..c5080851f140 100644
--- a/include/uapi/linux/vduse.h
+++ b/include/uapi/linux/vduse.h
@@ -21,6 +21,8 @@ enum vduse_req_type {
 	VDUSE_GET_STATUS,
 	VDUSE_SET_CONFIG,
 	VDUSE_GET_CONFIG,
+	VDUSE_UPDATE_IOTLB,
+	VDUSE_INVALIDATE_IOTLB,
 };
 
 struct vduse_vq_num {
@@ -51,6 +53,14 @@ struct vduse_dev_config_data {
 	__u8 data[VDUSE_CONFIG_DATA_LEN];
 };
 
+struct vduse_iotlb {
+	__u32 fd;
+	__u64 offset;
+	__u64 iova;
+	__u64 size;
+	__u8 perm;
+};
+
 struct vduse_dev_request {
 	__u32 type; /* request type */
 	__u32 unique; /* request id */
@@ -62,6 +72,7 @@ struct vduse_dev_request {
 		struct vduse_vq_ready vq_ready; /* virtqueue ready status */
 		struct vduse_vq_state vq_state; /* virtqueue state */
 		struct vduse_dev_config_data config; /* virtio device config space */
+		struct vduse_iotlb iotlb; /* iotlb message */
 		__u64 features; /* virtio features */
 		__u8 status; /* device status */
 	};
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [RFC v2 10/13] vduse: grab the module's references until there is no vduse device
  2020-12-22 14:52 [RFC v2 00/13] Introduce VDUSE - vDPA Device in Userspace Xie Yongji
                   ` (8 preceding siblings ...)
  2020-12-22 14:52 ` [RFC v2 09/13] vduse: Add support for processing vhost iotlb message Xie Yongji
@ 2020-12-22 14:52 ` Xie Yongji
  2020-12-22 14:52 ` [RFC v2 11/13] vduse/iova_domain: Support reclaiming bounce pages Xie Yongji
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 55+ messages in thread
From: Xie Yongji @ 2020-12-22 14:52 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, akpm, rdunlap, willy,
	viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel, linux-mm

The module should not be unloaded if any vduse device exists.
So increase the module's reference count when creating vduse
device. And the reference count is kept until the device is
destroyed.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 drivers/vdpa/vdpa_user/vduse_dev.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
index d24aaacb6008..c29b24a7e7e9 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -1052,6 +1052,7 @@ static int vduse_destroy_dev(u32 id)
 	kfree(dev->vqs);
 	vduse_iova_domain_destroy(dev->domain);
 	vduse_dev_destroy(dev);
+	module_put(THIS_MODULE);
 
 	return 0;
 }
@@ -1096,6 +1097,7 @@ static int vduse_create_dev(struct vduse_dev_config *config)
 
 	refcount_inc(&dev->refcnt);
 	list_add(&dev->list, &vduse_devs);
+	__module_get(THIS_MODULE);
 
 	return fd;
 err_fd:
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [RFC v2 11/13] vduse/iova_domain: Support reclaiming bounce pages
  2020-12-22 14:52 [RFC v2 00/13] Introduce VDUSE - vDPA Device in Userspace Xie Yongji
                   ` (9 preceding siblings ...)
  2020-12-22 14:52 ` [RFC v2 10/13] vduse: grab the module's references until there is no vduse device Xie Yongji
@ 2020-12-22 14:52 ` Xie Yongji
  2020-12-22 14:52 ` [RFC v2 12/13] vduse: Add memory shrinker to reclaim " Xie Yongji
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 55+ messages in thread
From: Xie Yongji @ 2020-12-22 14:52 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, akpm, rdunlap, willy,
	viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel, linux-mm

Introduce vduse_domain_reclaim() to support reclaiming bounce page
when necessary. We will do reclaiming chunk by chunk. And only
reclaim the iova chunk that no one used.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 drivers/vdpa/vdpa_user/iova_domain.c | 83 ++++++++++++++++++++++++++++++++++--
 drivers/vdpa/vdpa_user/iova_domain.h | 10 +++++
 2 files changed, 89 insertions(+), 4 deletions(-)

diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c
index 27022157abc6..c438cc85d33d 100644
--- a/drivers/vdpa/vdpa_user/iova_domain.c
+++ b/drivers/vdpa/vdpa_user/iova_domain.c
@@ -29,6 +29,8 @@ struct vduse_mmap_vma {
 	struct list_head list;
 };
 
+struct percpu_counter vduse_total_bounce_pages;
+
 static inline struct page *
 vduse_domain_get_bounce_page(struct vduse_iova_domain *domain,
 				unsigned long iova)
@@ -48,6 +50,13 @@ vduse_domain_set_bounce_page(struct vduse_iova_domain *domain,
 	unsigned long chunkoff = iova & ~IOVA_CHUNK_MASK;
 	unsigned long pgindex = chunkoff >> PAGE_SHIFT;
 
+	if (page) {
+		domain->chunks[index].used_bounce_pages++;
+		percpu_counter_inc(&vduse_total_bounce_pages);
+	} else {
+		domain->chunks[index].used_bounce_pages--;
+		percpu_counter_dec(&vduse_total_bounce_pages);
+	}
 	domain->chunks[index].bounce_pages[pgindex] = page;
 }
 
@@ -175,6 +184,29 @@ void vduse_domain_remove_mapping(struct vduse_iova_domain *domain,
 	}
 }
 
+static bool vduse_domain_try_unmap(struct vduse_iova_domain *domain,
+				unsigned long iova, size_t size)
+{
+	struct vduse_mmap_vma *mmap_vma;
+	unsigned long uaddr;
+	bool unmap = true;
+
+	mutex_lock(&domain->vma_lock);
+	list_for_each_entry(mmap_vma, &domain->vma_list, list) {
+		if (!mmap_read_trylock(mmap_vma->vma->vm_mm)) {
+			unmap = false;
+			break;
+		}
+
+		uaddr = iova + mmap_vma->vma->vm_start;
+		zap_page_range(mmap_vma->vma, uaddr, size);
+		mmap_read_unlock(mmap_vma->vma->vm_mm);
+	}
+	mutex_unlock(&domain->vma_lock);
+
+	return unmap;
+}
+
 void vduse_domain_unmap(struct vduse_iova_domain *domain,
 			unsigned long iova, size_t size)
 {
@@ -302,6 +334,32 @@ bool vduse_domain_is_direct_map(struct vduse_iova_domain *domain,
 	return atomic_read(&chunk->map_type) == TYPE_DIRECT_MAP;
 }
 
+int vduse_domain_reclaim(struct vduse_iova_domain *domain)
+{
+	struct vduse_iova_chunk *chunk;
+	int i, freed = 0;
+
+	for (i = domain->chunk_num - 1; i >= 0; i--) {
+		chunk = &domain->chunks[i];
+		if (!chunk->used_bounce_pages)
+			continue;
+
+		if (atomic_cmpxchg(&chunk->state, 0, INT_MIN) != 0)
+			continue;
+
+		if (!vduse_domain_try_unmap(domain,
+				chunk->start, IOVA_CHUNK_SIZE)) {
+			atomic_sub(INT_MIN, &chunk->state);
+			break;
+		}
+		freed += vduse_domain_free_bounce_pages(domain,
+				chunk->start, IOVA_CHUNK_SIZE);
+		atomic_sub(INT_MIN, &chunk->state);
+	}
+
+	return freed;
+}
+
 unsigned long vduse_domain_alloc_iova(struct vduse_iova_domain *domain,
 					size_t size, enum iova_map_type type)
 {
@@ -319,10 +377,13 @@ unsigned long vduse_domain_alloc_iova(struct vduse_iova_domain *domain,
 		if (atomic_read(&chunk->map_type) != type)
 			continue;
 
-		iova = gen_pool_alloc_algo(chunk->pool, size,
+		if (atomic_fetch_inc(&chunk->state) >= 0) {
+			iova = gen_pool_alloc_algo(chunk->pool, size,
 					gen_pool_first_fit_align, &data);
-		if (iova)
-			break;
+			if (iova)
+				break;
+		}
+		atomic_dec(&chunk->state);
 	}
 
 	return iova;
@@ -335,6 +396,7 @@ void vduse_domain_free_iova(struct vduse_iova_domain *domain,
 	struct vduse_iova_chunk *chunk = &domain->chunks[index];
 
 	gen_pool_free(chunk->pool, iova, size);
+	atomic_dec(&chunk->state);
 }
 
 static void vduse_iova_chunk_cleanup(struct vduse_iova_chunk *chunk)
@@ -351,7 +413,8 @@ void vduse_iova_domain_destroy(struct vduse_iova_domain *domain)
 
 	for (i = 0; i < domain->chunk_num; i++) {
 		chunk = &domain->chunks[i];
-		vduse_domain_free_bounce_pages(domain,
+		if (chunk->used_bounce_pages)
+			vduse_domain_free_bounce_pages(domain,
 					chunk->start, IOVA_CHUNK_SIZE);
 		vduse_iova_chunk_cleanup(chunk);
 	}
@@ -390,8 +453,10 @@ static int vduse_iova_chunk_init(struct vduse_iova_chunk *chunk,
 	if (!chunk->iova_map)
 		goto err;
 
+	chunk->used_bounce_pages = 0;
 	chunk->start = addr;
 	atomic_set(&chunk->map_type, TYPE_NONE);
+	atomic_set(&chunk->state, 0);
 
 	return 0;
 err:
@@ -440,3 +505,13 @@ struct vduse_iova_domain *vduse_iova_domain_create(size_t size)
 
 	return NULL;
 }
+
+int vduse_domain_init(void)
+{
+	return percpu_counter_init(&vduse_total_bounce_pages, 0, GFP_KERNEL);
+}
+
+void vduse_domain_exit(void)
+{
+	percpu_counter_destroy(&vduse_total_bounce_pages);
+}
diff --git a/drivers/vdpa/vdpa_user/iova_domain.h b/drivers/vdpa/vdpa_user/iova_domain.h
index fe1816287f5f..6815b00629d2 100644
--- a/drivers/vdpa/vdpa_user/iova_domain.h
+++ b/drivers/vdpa/vdpa_user/iova_domain.h
@@ -31,8 +31,10 @@ struct vduse_iova_chunk {
 	struct gen_pool *pool;
 	struct page **bounce_pages;
 	struct vduse_iova_map **iova_map;
+	int used_bounce_pages;
 	unsigned long start;
 	atomic_t map_type;
+	atomic_t state;
 };
 
 struct vduse_iova_domain {
@@ -44,6 +46,8 @@ struct vduse_iova_domain {
 	struct list_head vma_list;
 };
 
+extern struct percpu_counter vduse_total_bounce_pages;
+
 int vduse_domain_add_vma(struct vduse_iova_domain *domain,
 				struct vm_area_struct *vma);
 
@@ -77,6 +81,8 @@ int vduse_domain_bounce_map(struct vduse_iova_domain *domain,
 bool vduse_domain_is_direct_map(struct vduse_iova_domain *domain,
 				unsigned long iova);
 
+int vduse_domain_reclaim(struct vduse_iova_domain *domain);
+
 unsigned long vduse_domain_alloc_iova(struct vduse_iova_domain *domain,
 					size_t size, enum iova_map_type type);
 
@@ -90,4 +96,8 @@ void vduse_iova_domain_destroy(struct vduse_iova_domain *domain);
 
 struct vduse_iova_domain *vduse_iova_domain_create(size_t size);
 
+int vduse_domain_init(void);
+
+void vduse_domain_exit(void);
+
 #endif /* _VDUSE_IOVA_DOMAIN_H */
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [RFC v2 12/13] vduse: Add memory shrinker to reclaim bounce pages
  2020-12-22 14:52 [RFC v2 00/13] Introduce VDUSE - vDPA Device in Userspace Xie Yongji
                   ` (10 preceding siblings ...)
  2020-12-22 14:52 ` [RFC v2 11/13] vduse/iova_domain: Support reclaiming bounce pages Xie Yongji
@ 2020-12-22 14:52 ` Xie Yongji
  2020-12-22 14:52 ` [RFC v2 13/13] vduse: Introduce a workqueue for irq injection Xie Yongji
  2020-12-23  6:38 ` [RFC v2 00/13] Introduce VDUSE - vDPA Device in Userspace Jason Wang
  13 siblings, 0 replies; 55+ messages in thread
From: Xie Yongji @ 2020-12-22 14:52 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, akpm, rdunlap, willy,
	viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel, linux-mm

Add a shrinker to reclaim several pages used by bounce buffer
in order to avoid memory pressures.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 drivers/vdpa/vdpa_user/vduse_dev.c | 51 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 51 insertions(+)

diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
index c29b24a7e7e9..1bc2e627c476 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -1142,6 +1142,43 @@ static long vduse_ioctl(struct file *file, unsigned int cmd,
 	return ret;
 }
 
+static unsigned long vduse_shrink_scan(struct shrinker *shrinker,
+					struct shrink_control *sc)
+{
+	unsigned long freed = 0;
+	struct vduse_dev *dev;
+
+	if (!mutex_trylock(&vduse_lock))
+		return SHRINK_STOP;
+
+	list_for_each_entry(dev, &vduse_devs, list) {
+		if (!dev->domain)
+			continue;
+
+		freed = vduse_domain_reclaim(dev->domain);
+		if (!freed)
+			continue;
+
+		list_move_tail(&dev->list, &vduse_devs);
+		break;
+	}
+	mutex_unlock(&vduse_lock);
+
+	return freed ? freed : SHRINK_STOP;
+}
+
+static unsigned long vduse_shrink_count(struct shrinker *shrink,
+					struct shrink_control *sc)
+{
+	return percpu_counter_read_positive(&vduse_total_bounce_pages);
+}
+
+static struct shrinker vduse_bounce_pages_shrinker = {
+	.count_objects = vduse_shrink_count,
+	.scan_objects = vduse_shrink_scan,
+	.seeks = DEFAULT_SEEKS,
+};
+
 static const struct file_operations vduse_fops = {
 	.owner		= THIS_MODULE,
 	.unlocked_ioctl	= vduse_ioctl,
@@ -1292,12 +1329,24 @@ static int vduse_init(void)
 	if (ret)
 		goto err_irqfd;
 
+	ret = vduse_domain_init();
+	if (ret)
+		goto err_domain;
+
+	ret = register_shrinker(&vduse_bounce_pages_shrinker);
+	if (ret)
+		goto err_shrinker;
+
 	ret = vduse_parentdev_init();
 	if (ret)
 		goto err_parentdev;
 
 	return 0;
 err_parentdev:
+	unregister_shrinker(&vduse_bounce_pages_shrinker);
+err_shrinker:
+	vduse_domain_exit();
+err_domain:
 	vduse_virqfd_exit();
 err_irqfd:
 	destroy_workqueue(vduse_vdpa_wq);
@@ -1309,8 +1358,10 @@ module_init(vduse_init);
 
 static void vduse_exit(void)
 {
+	unregister_shrinker(&vduse_bounce_pages_shrinker);
 	misc_deregister(&vduse_misc);
 	destroy_workqueue(vduse_vdpa_wq);
+	vduse_domain_exit();
 	vduse_virqfd_exit();
 	vduse_parentdev_exit();
 }
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [RFC v2 13/13] vduse: Introduce a workqueue for irq injection
  2020-12-22 14:52 [RFC v2 00/13] Introduce VDUSE - vDPA Device in Userspace Xie Yongji
                   ` (11 preceding siblings ...)
  2020-12-22 14:52 ` [RFC v2 12/13] vduse: Add memory shrinker to reclaim " Xie Yongji
@ 2020-12-22 14:52 ` Xie Yongji
  2020-12-23  6:38 ` [RFC v2 00/13] Introduce VDUSE - vDPA Device in Userspace Jason Wang
  13 siblings, 0 replies; 55+ messages in thread
From: Xie Yongji @ 2020-12-22 14:52 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, akpm, rdunlap, willy,
	viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel, linux-mm

This patch introduces a dedicated workqueue for irq injection
so that we are able to do some performance tuning for it.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 drivers/vdpa/vdpa_user/eventfd.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/vdpa/vdpa_user/eventfd.c b/drivers/vdpa/vdpa_user/eventfd.c
index dbffddb08908..caf7d8d68ac0 100644
--- a/drivers/vdpa/vdpa_user/eventfd.c
+++ b/drivers/vdpa/vdpa_user/eventfd.c
@@ -18,6 +18,7 @@
 #include "eventfd.h"
 
 static struct workqueue_struct *vduse_irqfd_cleanup_wq;
+static struct workqueue_struct *vduse_irq_wq;
 
 static void vduse_virqfd_shutdown(struct work_struct *work)
 {
@@ -57,7 +58,7 @@ static int vduse_virqfd_wakeup(wait_queue_entry_t *wait, unsigned int mode,
 	__poll_t flags = key_to_poll(key);
 
 	if (flags & EPOLLIN)
-		schedule_work(&virqfd->inject);
+		queue_work(vduse_irq_wq, &virqfd->inject);
 
 	if (flags & EPOLLHUP) {
 		spin_lock(&vq->irq_lock);
@@ -165,11 +166,18 @@ int vduse_virqfd_init(void)
 	if (!vduse_irqfd_cleanup_wq)
 		return -ENOMEM;
 
+	vduse_irq_wq = alloc_workqueue("vduse-irq", WQ_SYSFS | WQ_UNBOUND, 0);
+	if (!vduse_irq_wq) {
+		destroy_workqueue(vduse_irqfd_cleanup_wq);
+		return -ENOMEM;
+	}
+
 	return 0;
 }
 
 void vduse_virqfd_exit(void)
 {
+	destroy_workqueue(vduse_irq_wq);
 	destroy_workqueue(vduse_irqfd_cleanup_wq);
 }
 
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [RFC v2 01/13] mm: export zap_page_range() for driver use
  2020-12-22 14:52 ` [RFC v2 01/13] mm: export zap_page_range() for driver use Xie Yongji
@ 2020-12-22 15:44   ` Christoph Hellwig
  0 siblings, 0 replies; 55+ messages in thread
From: Christoph Hellwig @ 2020-12-22 15:44 UTC (permalink / raw)
  To: Xie Yongji
  Cc: mst, jasowang, stefanha, sgarzare, parav, akpm, rdunlap, willy,
	viro, axboe, bcrl, corbet, virtualization, netdev, kvm,
	linux-aio, linux-fsdevel, linux-mm

On Tue, Dec 22, 2020 at 10:52:09PM +0800, Xie Yongji wrote:
> Export zap_page_range() for use in VDUSE.

Err, no.  This has absolutely no business being used by drivers.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 00/13] Introduce VDUSE - vDPA Device in Userspace
  2020-12-22 14:52 [RFC v2 00/13] Introduce VDUSE - vDPA Device in Userspace Xie Yongji
                   ` (12 preceding siblings ...)
  2020-12-22 14:52 ` [RFC v2 13/13] vduse: Introduce a workqueue for irq injection Xie Yongji
@ 2020-12-23  6:38 ` Jason Wang
  2020-12-23  8:14   ` Jason Wang
  2020-12-23 10:59   ` Yongji Xie
  13 siblings, 2 replies; 55+ messages in thread
From: Jason Wang @ 2020-12-23  6:38 UTC (permalink / raw)
  To: Xie Yongji, mst, stefanha, sgarzare, parav, akpm, rdunlap, willy,
	viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel, linux-mm


On 2020/12/22 下午10:52, Xie Yongji wrote:
> This series introduces a framework, which can be used to implement
> vDPA Devices in a userspace program. The work consist of two parts:
> control path forwarding and data path offloading.
>
> In the control path, the VDUSE driver will make use of message
> mechnism to forward the config operation from vdpa bus driver
> to userspace. Userspace can use read()/write() to receive/reply
> those control messages.
>
> In the data path, the core is mapping dma buffer into VDUSE
> daemon's address space, which can be implemented in different ways
> depending on the vdpa bus to which the vDPA device is attached.
>
> In virtio-vdpa case, we implements a MMU-based on-chip IOMMU driver with
> bounce-buffering mechanism to achieve that.


Rethink about the bounce buffer stuffs. I wonder instead of using kernel 
pages with mmap(), how about just use userspace pages like what vhost did?

It means we need a worker to do bouncing but we don't need to care about 
annoying stuffs like page reclaiming?


> And in vhost-vdpa case, the dma
> buffer is reside in a userspace memory region which can be shared to the
> VDUSE userspace processs via transferring the shmfd.
>
> The details and our user case is shown below:
>
> ------------------------    -------------------------   ----------------------------------------------
> |            Container |    |              QEMU(VM) |   |                               VDUSE daemon |
> |       ---------      |    |  -------------------  |   | ------------------------- ---------------- |
> |       |dev/vdx|      |    |  |/dev/vhost-vdpa-x|  |   | | vDPA device emulation | | block driver | |
> ------------+-----------     -----------+------------   -------------+----------------------+---------
>              |                           |                            |                      |
>              |                           |                            |                      |
> ------------+---------------------------+----------------------------+----------------------+---------
> |    | block device |           |  vhost device |            | vduse driver |          | TCP/IP |    |
> |    -------+--------           --------+--------            -------+--------          -----+----    |
> |           |                           |                           |                       |        |
> | ----------+----------       ----------+-----------         -------+-------                |        |
> | | virtio-blk driver |       |  vhost-vdpa driver |         | vdpa device |                |        |
> | ----------+----------       ----------+-----------         -------+-------                |        |
> |           |      virtio bus           |                           |                       |        |
> |   --------+----+-----------           |                           |                       |        |
> |                |                      |                           |                       |        |
> |      ----------+----------            |                           |                       |        |
> |      | virtio-blk device |            |                           |                       |        |
> |      ----------+----------            |                           |                       |        |
> |                |                      |                           |                       |        |
> |     -----------+-----------           |                           |                       |        |
> |     |  virtio-vdpa driver |           |                           |                       |        |
> |     -----------+-----------           |                           |                       |        |
> |                |                      |                           |    vdpa bus           |        |
> |     -----------+----------------------+---------------------------+------------           |        |
> |                                                                                        ---+---     |
> -----------------------------------------------------------------------------------------| NIC |------
>                                                                                           ---+---
>                                                                                              |
>                                                                                     ---------+---------
>                                                                                     | Remote Storages |
>                                                                                     -------------------
>
> We make use of it to implement a block device connecting to
> our distributed storage, which can be used both in containers and
> VMs. Thus, we can have an unified technology stack in this two cases.
>
> To test it with null-blk:
>
>    $ qemu-storage-daemon \
>        --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server,nowait \
>        --monitor chardev=charmonitor \
>        --blockdev driver=host_device,cache.direct=on,aio=native,filename=/dev/nullb0,node-name=disk0 \
>        --export vduse-blk,id=test,node-name=disk0,writable=on,vduse-id=1,num-queues=16,queue-size=128
>
> The qemu-storage-daemon can be found at https://github.com/bytedance/qemu/tree/vduse
>
> Future work:
>    - Improve performance (e.g. zero copy implementation in datapath)
>    - Config interrupt support
>    - Userspace library (find a way to reuse device emulation code in qemu/rust-vmm)
>
> This is now based on below series:
> https://lore.kernel.org/netdev/20201112064005.349268-1-parav@nvidia.com/
>
> V1 to V2:
> - Add vhost-vdpa support


I may miss something but I don't see any code to support that. E.g 
neither set_map nor dma_map/unmap is implemented in the config ops.

Thanks


> - Add some documents
> - Based on the vdpa management tool
> - Introduce a workqueue for irq injection
> - Replace interval tree with array map to store the iova_map
>
> Xie Yongji (13):
>    mm: export zap_page_range() for driver use
>    eventfd: track eventfd_signal() recursion depth separately in different cases
>    eventfd: Increase the recursion depth of eventfd_signal()
>    vdpa: Remove the restriction that only supports virtio-net devices
>    vdpa: Pass the netlink attributes to ops.dev_add()
>    vduse: Introduce VDUSE - vDPA Device in Userspace
>    vduse: support get/set virtqueue state
>    vdpa: Introduce process_iotlb_msg() in vdpa_config_ops
>    vduse: Add support for processing vhost iotlb message
>    vduse: grab the module's references until there is no vduse device
>    vduse/iova_domain: Support reclaiming bounce pages
>    vduse: Add memory shrinker to reclaim bounce pages
>    vduse: Introduce a workqueue for irq injection
>
>   Documentation/driver-api/vduse.rst                 |   91 ++
>   Documentation/userspace-api/ioctl/ioctl-number.rst |    1 +
>   drivers/vdpa/Kconfig                               |    8 +
>   drivers/vdpa/Makefile                              |    1 +
>   drivers/vdpa/vdpa.c                                |    2 +-
>   drivers/vdpa/vdpa_sim/vdpa_sim.c                   |    3 +-
>   drivers/vdpa/vdpa_user/Makefile                    |    5 +
>   drivers/vdpa/vdpa_user/eventfd.c                   |  229 ++++
>   drivers/vdpa/vdpa_user/eventfd.h                   |   48 +
>   drivers/vdpa/vdpa_user/iova_domain.c               |  517 ++++++++
>   drivers/vdpa/vdpa_user/iova_domain.h               |  103 ++
>   drivers/vdpa/vdpa_user/vduse.h                     |   59 +
>   drivers/vdpa/vdpa_user/vduse_dev.c                 | 1373 ++++++++++++++++++++
>   drivers/vhost/vdpa.c                               |   34 +-
>   fs/aio.c                                           |    3 +-
>   fs/eventfd.c                                       |   20 +-
>   include/linux/eventfd.h                            |    5 +-
>   include/linux/vdpa.h                               |   11 +-
>   include/uapi/linux/vdpa.h                          |    1 +
>   include/uapi/linux/vduse.h                         |  119 ++
>   mm/memory.c                                        |    1 +
>   21 files changed, 2598 insertions(+), 36 deletions(-)
>   create mode 100644 Documentation/driver-api/vduse.rst
>   create mode 100644 drivers/vdpa/vdpa_user/Makefile
>   create mode 100644 drivers/vdpa/vdpa_user/eventfd.c
>   create mode 100644 drivers/vdpa/vdpa_user/eventfd.h
>   create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c
>   create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h
>   create mode 100644 drivers/vdpa/vdpa_user/vduse.h
>   create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
>   create mode 100644 include/uapi/linux/vduse.h
>



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 06/13] vduse: Introduce VDUSE - vDPA Device in Userspace
  2020-12-22 14:52 ` [RFC v2 06/13] vduse: Introduce VDUSE - vDPA Device in Userspace Xie Yongji
@ 2020-12-23  8:08   ` Jason Wang
  2020-12-23 14:17     ` Yongji Xie
  2021-01-08 13:32   ` Bob Liu
  1 sibling, 1 reply; 55+ messages in thread
From: Jason Wang @ 2020-12-23  8:08 UTC (permalink / raw)
  To: Xie Yongji, mst, stefanha, sgarzare, parav, akpm, rdunlap, willy,
	viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel, linux-mm


On 2020/12/22 下午10:52, Xie Yongji wrote:
> This VDUSE driver enables implementing vDPA devices in userspace.
> Both control path and data path of vDPA devices will be able to
> be handled in userspace.
>
> In the control path, the VDUSE driver will make use of message
> mechnism to forward the config operation from vdpa bus driver
> to userspace. Userspace can use read()/write() to receive/reply
> those control messages.
>
> In the data path, the VDUSE driver implements a MMU-based on-chip
> IOMMU driver which supports mapping the kernel dma buffer to a
> userspace iova region dynamically. Userspace can access those
> iova region via mmap(). Besides, the eventfd mechanism is used to
> trigger interrupt callbacks and receive virtqueue kicks in userspace
>
> Now we only support virtio-vdpa bus driver with this patch applied.
>
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> ---
>   Documentation/driver-api/vduse.rst                 |   74 ++
>   Documentation/userspace-api/ioctl/ioctl-number.rst |    1 +
>   drivers/vdpa/Kconfig                               |    8 +
>   drivers/vdpa/Makefile                              |    1 +
>   drivers/vdpa/vdpa_user/Makefile                    |    5 +
>   drivers/vdpa/vdpa_user/eventfd.c                   |  221 ++++
>   drivers/vdpa/vdpa_user/eventfd.h                   |   48 +
>   drivers/vdpa/vdpa_user/iova_domain.c               |  442 ++++++++
>   drivers/vdpa/vdpa_user/iova_domain.h               |   93 ++
>   drivers/vdpa/vdpa_user/vduse.h                     |   59 ++
>   drivers/vdpa/vdpa_user/vduse_dev.c                 | 1121 ++++++++++++++++++++
>   include/uapi/linux/vdpa.h                          |    1 +
>   include/uapi/linux/vduse.h                         |   99 ++
>   13 files changed, 2173 insertions(+)
>   create mode 100644 Documentation/driver-api/vduse.rst
>   create mode 100644 drivers/vdpa/vdpa_user/Makefile
>   create mode 100644 drivers/vdpa/vdpa_user/eventfd.c
>   create mode 100644 drivers/vdpa/vdpa_user/eventfd.h
>   create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c
>   create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h
>   create mode 100644 drivers/vdpa/vdpa_user/vduse.h
>   create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
>   create mode 100644 include/uapi/linux/vduse.h
>
> diff --git a/Documentation/driver-api/vduse.rst b/Documentation/driver-api/vduse.rst
> new file mode 100644
> index 000000000000..da9b3040f20a
> --- /dev/null
> +++ b/Documentation/driver-api/vduse.rst
> @@ -0,0 +1,74 @@
> +==================================
> +VDUSE - "vDPA Device in Userspace"
> +==================================
> +
> +vDPA (virtio data path acceleration) device is a device that uses a
> +datapath which complies with the virtio specifications with vendor
> +specific control path. vDPA devices can be both physically located on
> +the hardware or emulated by software. VDUSE is a framework that makes it
> +possible to implement software-emulated vDPA devices in userspace.
> +
> +How VDUSE works
> +------------
> +Each userspace vDPA device is created by the VDUSE_CREATE_DEV ioctl on
> +the VDUSE character device (/dev/vduse). Then a file descriptor pointing
> +to the new resources will be returned, which can be used to implement the
> +userspace vDPA device's control path and data path.
> +
> +To implement control path, the read/write operations to the file descriptor
> +will be used to receive/reply the control messages from/to VDUSE driver.
> +Those control messages are based on the vdpa_config_ops which defines a
> +unified interface to control different types of vDPA device.
> +
> +The following types of messages are provided by the VDUSE framework now:
> +
> +- VDUSE_SET_VQ_ADDR: Set the addresses of the different aspects of virtqueue.
> +
> +- VDUSE_SET_VQ_NUM: Set the size of virtqueue
> +
> +- VDUSE_SET_VQ_READY: Set ready status of virtqueue
> +
> +- VDUSE_GET_VQ_READY: Get ready status of virtqueue
> +
> +- VDUSE_SET_FEATURES: Set virtio features supported by the driver
> +
> +- VDUSE_GET_FEATURES: Get virtio features supported by the device
> +
> +- VDUSE_SET_STATUS: Set the device status
> +
> +- VDUSE_GET_STATUS: Get the device status
> +
> +- VDUSE_SET_CONFIG: Write to device specific configuration space
> +
> +- VDUSE_GET_CONFIG: Read from device specific configuration space
> +
> +Please see include/linux/vdpa.h for details.
> +
> +In the data path, VDUSE framework implements a MMU-based on-chip IOMMU
> +driver which supports mapping the kernel dma buffer to a userspace iova
> +region dynamically. The userspace iova region can be created by passing
> +the userspace vDPA device fd to mmap(2).
> +
> +Besides, the eventfd mechanism is used to trigger interrupt callbacks and
> +receive virtqueue kicks in userspace. The following ioctls on the userspace
> +vDPA device fd are provided to support that:
> +
> +- VDUSE_VQ_SETUP_KICKFD: set the kickfd for virtqueue, this eventfd is used
> +  by VDUSE driver to notify userspace to consume the vring.
> +
> +- VDUSE_VQ_SETUP_IRQFD: set the irqfd for virtqueue, this eventfd is used
> +  by userspace to notify VDUSE driver to trigger interrupt callbacks.
> +
> +MMU-based IOMMU Driver
> +----------------------
> +The basic idea behind the IOMMU driver is treating MMU (VA->PA) as
> +IOMMU (IOVA->PA). This driver will set up MMU mapping instead of IOMMU mapping
> +for the DMA transfer so that the userspace process is able to use its virtual
> +address to access the dma buffer in kernel.
> +
> +And to avoid security issue, a bounce-buffering mechanism is introduced to
> +prevent userspace accessing the original buffer directly which may contain other
> +kernel data. During the mapping, unmapping, the driver will copy the data from
> +the original buffer to the bounce buffer and back, depending on the direction of
> +the transfer. And the bounce-buffer addresses will be mapped into the user address
> +space instead of the original one.
> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
> index a4c75a28c839..71722e6f8f23 100644
> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> @@ -300,6 +300,7 @@ Code  Seq#    Include File                                           Comments
>   'z'   10-4F  drivers/s390/crypto/zcrypt_api.h                        conflict!
>   '|'   00-7F  linux/media.h
>   0x80  00-1F  linux/fb.h
> +0x81  00-1F  linux/vduse.h
>   0x89  00-06  arch/x86/include/asm/sockios.h
>   0x89  0B-DF  linux/sockios.h
>   0x89  E0-EF  linux/sockios.h                                         SIOCPROTOPRIVATE range
> diff --git a/drivers/vdpa/Kconfig b/drivers/vdpa/Kconfig
> index 4be7be39be26..211cc449cbd3 100644
> --- a/drivers/vdpa/Kconfig
> +++ b/drivers/vdpa/Kconfig
> @@ -21,6 +21,14 @@ config VDPA_SIM
>   	  to RX. This device is used for testing, prototyping and
>   	  development of vDPA.
>   
> +config VDPA_USER
> +	tristate "VDUSE (vDPA Device in Userspace) support"
> +	depends on EVENTFD && MMU && HAS_DMA
> +	default n


The "default n" is not necessary.


> +	help
> +	  With VDUSE it is possible to emulate a vDPA Device
> +	  in a userspace program.
> +
>   config IFCVF
>   	tristate "Intel IFC VF vDPA driver"
>   	depends on PCI_MSI
> diff --git a/drivers/vdpa/Makefile b/drivers/vdpa/Makefile
> index d160e9b63a66..66e97778ad03 100644
> --- a/drivers/vdpa/Makefile
> +++ b/drivers/vdpa/Makefile
> @@ -1,5 +1,6 @@
>   # SPDX-License-Identifier: GPL-2.0
>   obj-$(CONFIG_VDPA) += vdpa.o
>   obj-$(CONFIG_VDPA_SIM) += vdpa_sim/
> +obj-$(CONFIG_VDPA_USER) += vdpa_user/
>   obj-$(CONFIG_IFCVF)    += ifcvf/
>   obj-$(CONFIG_MLX5_VDPA) += mlx5/
> diff --git a/drivers/vdpa/vdpa_user/Makefile b/drivers/vdpa/vdpa_user/Makefile
> new file mode 100644
> index 000000000000..b7645e36992b
> --- /dev/null
> +++ b/drivers/vdpa/vdpa_user/Makefile
> @@ -0,0 +1,5 @@
> +# SPDX-License-Identifier: GPL-2.0
> +
> +vduse-y := vduse_dev.o iova_domain.o eventfd.o


Do we really need eventfd.o here consider we've selected it.


> +
> +obj-$(CONFIG_VDPA_USER) += vduse.o
> diff --git a/drivers/vdpa/vdpa_user/eventfd.c b/drivers/vdpa/vdpa_user/eventfd.c
> new file mode 100644
> index 000000000000..dbffddb08908
> --- /dev/null
> +++ b/drivers/vdpa/vdpa_user/eventfd.c
> @@ -0,0 +1,221 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Eventfd support for VDUSE
> + *
> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
> + *
> + * Author: Xie Yongji <xieyongji@bytedance.com>
> + *
> + */
> +
> +#include <linux/eventfd.h>
> +#include <linux/poll.h>
> +#include <linux/wait.h>
> +#include <linux/slab.h>
> +#include <linux/file.h>
> +#include <uapi/linux/vduse.h>
> +
> +#include "eventfd.h"
> +
> +static struct workqueue_struct *vduse_irqfd_cleanup_wq;
> +
> +static void vduse_virqfd_shutdown(struct work_struct *work)
> +{
> +	u64 cnt;
> +	struct vduse_virqfd *virqfd = container_of(work,
> +					struct vduse_virqfd, shutdown);
> +
> +	eventfd_ctx_remove_wait_queue(virqfd->ctx, &virqfd->wait, &cnt);
> +	flush_work(&virqfd->inject);
> +	eventfd_ctx_put(virqfd->ctx);
> +	kfree(virqfd);
> +}
> +
> +static void vduse_virqfd_inject(struct work_struct *work)
> +{
> +	struct vduse_virqfd *virqfd = container_of(work,
> +					struct vduse_virqfd, inject);
> +	struct vduse_virtqueue *vq = virqfd->vq;
> +
> +	spin_lock_irq(&vq->irq_lock);
> +	if (vq->ready && vq->cb)
> +		vq->cb(vq->private);
> +	spin_unlock_irq(&vq->irq_lock);
> +}
> +
> +static void virqfd_deactivate(struct vduse_virqfd *virqfd)
> +{
> +	queue_work(vduse_irqfd_cleanup_wq, &virqfd->shutdown);
> +}
> +
> +static int vduse_virqfd_wakeup(wait_queue_entry_t *wait, unsigned int mode,
> +				int sync, void *key)
> +{
> +	struct vduse_virqfd *virqfd = container_of(wait, struct vduse_virqfd, wait);
> +	struct vduse_virtqueue *vq = virqfd->vq;
> +
> +	__poll_t flags = key_to_poll(key);
> +
> +	if (flags & EPOLLIN)
> +		schedule_work(&virqfd->inject);
> +
> +	if (flags & EPOLLHUP) {
> +		spin_lock(&vq->irq_lock);
> +		if (vq->virqfd == virqfd) {
> +			vq->virqfd = NULL;
> +			virqfd_deactivate(virqfd);
> +		}
> +		spin_unlock(&vq->irq_lock);
> +	}
> +
> +	return 0;
> +}
> +
> +static void vduse_virqfd_ptable_queue_proc(struct file *file,
> +			wait_queue_head_t *wqh, poll_table *pt)
> +{
> +	struct vduse_virqfd *virqfd = container_of(pt, struct vduse_virqfd, pt);
> +
> +	add_wait_queue(wqh, &virqfd->wait);
> +}
> +
> +int vduse_virqfd_setup(struct vduse_dev *dev,
> +			struct vduse_vq_eventfd *eventfd)
> +{
> +	struct vduse_virqfd *virqfd;
> +	struct fd irqfd;
> +	struct eventfd_ctx *ctx;
> +	struct vduse_virtqueue *vq;
> +	__poll_t events;
> +	int ret;
> +
> +	if (eventfd->index >= dev->vq_num)
> +		return -EINVAL;
> +
> +	vq = &dev->vqs[eventfd->index];
> +	virqfd = kzalloc(sizeof(*virqfd), GFP_KERNEL);
> +	if (!virqfd)
> +		return -ENOMEM;
> +
> +	INIT_WORK(&virqfd->shutdown, vduse_virqfd_shutdown);
> +	INIT_WORK(&virqfd->inject, vduse_virqfd_inject);


Any reason that a workqueue is must here?


> +
> +	ret = -EBADF;
> +	irqfd = fdget(eventfd->fd);
> +	if (!irqfd.file)
> +		goto err_fd;
> +
> +	ctx = eventfd_ctx_fileget(irqfd.file);
> +	if (IS_ERR(ctx)) {
> +		ret = PTR_ERR(ctx);
> +		goto err_ctx;
> +	}
> +
> +	virqfd->vq = vq;
> +	virqfd->ctx = ctx;
> +	spin_lock(&vq->irq_lock);
> +	if (vq->virqfd)
> +		virqfd_deactivate(virqfd);
> +	vq->virqfd = virqfd;
> +	spin_unlock(&vq->irq_lock);
> +
> +	init_waitqueue_func_entry(&virqfd->wait, vduse_virqfd_wakeup);
> +	init_poll_funcptr(&virqfd->pt, vduse_virqfd_ptable_queue_proc);
> +
> +	events = vfs_poll(irqfd.file, &virqfd->pt);
> +
> +	/*
> +	 * Check if there was an event already pending on the eventfd
> +	 * before we registered and trigger it as if we didn't miss it.
> +	 */
> +	if (events & EPOLLIN)
> +		schedule_work(&virqfd->inject);
> +
> +	fdput(irqfd);
> +
> +	return 0;
> +err_ctx:
> +	fdput(irqfd);
> +err_fd:
> +	kfree(virqfd);
> +	return ret;
> +}
> +
> +void vduse_virqfd_release(struct vduse_dev *dev)
> +{
> +	int i;
> +
> +	for (i = 0; i < dev->vq_num; i++) {
> +		struct vduse_virtqueue *vq = &dev->vqs[i];
> +
> +		spin_lock(&vq->irq_lock);
> +		if (vq->virqfd) {
> +			virqfd_deactivate(vq->virqfd);
> +			vq->virqfd = NULL;
> +		}
> +		spin_unlock(&vq->irq_lock);
> +	}
> +	flush_workqueue(vduse_irqfd_cleanup_wq);
> +}
> +
> +int vduse_virqfd_init(void)
> +{
> +	vduse_irqfd_cleanup_wq = alloc_workqueue("vduse-irqfd-cleanup",
> +						WQ_UNBOUND, 0);
> +	if (!vduse_irqfd_cleanup_wq)
> +		return -ENOMEM;
> +
> +	return 0;
> +}
> +
> +void vduse_virqfd_exit(void)
> +{
> +	destroy_workqueue(vduse_irqfd_cleanup_wq);
> +}
> +
> +void vduse_vq_kick(struct vduse_virtqueue *vq)
> +{
> +	spin_lock(&vq->kick_lock);
> +	if (vq->ready && vq->kickfd)
> +		eventfd_signal(vq->kickfd, 1);
> +	spin_unlock(&vq->kick_lock);
> +}
> +
> +int vduse_kickfd_setup(struct vduse_dev *dev,
> +			struct vduse_vq_eventfd *eventfd)
> +{
> +	struct eventfd_ctx *ctx;
> +	struct vduse_virtqueue *vq;
> +
> +	if (eventfd->index >= dev->vq_num)
> +		return -EINVAL;
> +
> +	vq = &dev->vqs[eventfd->index];
> +	ctx = eventfd_ctx_fdget(eventfd->fd);
> +	if (IS_ERR(ctx))
> +		return PTR_ERR(ctx);
> +
> +	spin_lock(&vq->kick_lock);
> +	if (vq->kickfd)
> +		eventfd_ctx_put(vq->kickfd);
> +	vq->kickfd = ctx;
> +	spin_unlock(&vq->kick_lock);
> +
> +	return 0;
> +}
> +
> +void vduse_kickfd_release(struct vduse_dev *dev)
> +{
> +	int i;
> +
> +	for (i = 0; i < dev->vq_num; i++) {
> +		struct vduse_virtqueue *vq = &dev->vqs[i];
> +
> +		spin_lock(&vq->kick_lock);
> +		if (vq->kickfd) {
> +			eventfd_ctx_put(vq->kickfd);
> +			vq->kickfd = NULL;
> +		}
> +		spin_unlock(&vq->kick_lock);
> +	}
> +}
> diff --git a/drivers/vdpa/vdpa_user/eventfd.h b/drivers/vdpa/vdpa_user/eventfd.h
> new file mode 100644
> index 000000000000..14269ff27f47
> --- /dev/null
> +++ b/drivers/vdpa/vdpa_user/eventfd.h
> @@ -0,0 +1,48 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Eventfd support for VDUSE
> + *
> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
> + *
> + * Author: Xie Yongji <xieyongji@bytedance.com>
> + *
> + */
> +
> +#ifndef _VDUSE_EVENTFD_H
> +#define _VDUSE_EVENTFD_H
> +
> +#include <linux/eventfd.h>
> +#include <linux/poll.h>
> +#include <linux/wait.h>
> +#include <uapi/linux/vduse.h>
> +
> +#include "vduse.h"
> +
> +struct vduse_dev;
> +
> +struct vduse_virqfd {
> +	struct eventfd_ctx *ctx;
> +	struct vduse_virtqueue *vq;
> +	struct work_struct inject;
> +	struct work_struct shutdown;
> +	wait_queue_entry_t wait;
> +	poll_table pt;
> +};
> +
> +int vduse_virqfd_setup(struct vduse_dev *dev,
> +			struct vduse_vq_eventfd *eventfd);
> +
> +void vduse_virqfd_release(struct vduse_dev *dev);
> +
> +int vduse_virqfd_init(void);
> +
> +void vduse_virqfd_exit(void);
> +
> +void vduse_vq_kick(struct vduse_virtqueue *vq);
> +
> +int vduse_kickfd_setup(struct vduse_dev *dev,
> +			struct vduse_vq_eventfd *eventfd);
> +
> +void vduse_kickfd_release(struct vduse_dev *dev);
> +
> +#endif /* _VDUSE_EVENTFD_H */
> diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c
> new file mode 100644
> index 000000000000..27022157abc6
> --- /dev/null
> +++ b/drivers/vdpa/vdpa_user/iova_domain.c
> @@ -0,0 +1,442 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * MMU-based IOMMU implementation
> + *
> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
> + *
> + * Author: Xie Yongji <xieyongji@bytedance.com>
> + *
> + */
> +
> +#include <linux/wait.h>
> +#include <linux/slab.h>
> +#include <linux/genalloc.h>
> +#include <linux/dma-mapping.h>
> +
> +#include "iova_domain.h"
> +
> +#define IOVA_CHUNK_SHIFT 26
> +#define IOVA_CHUNK_SIZE (_AC(1, UL) << IOVA_CHUNK_SHIFT)
> +#define IOVA_CHUNK_MASK (~(IOVA_CHUNK_SIZE - 1))
> +
> +#define IOVA_MIN_SIZE (IOVA_CHUNK_SIZE << 1)
> +
> +#define IOVA_ALLOC_ORDER 12
> +#define IOVA_ALLOC_SIZE (1 << IOVA_ALLOC_ORDER)
> +
> +struct vduse_mmap_vma {
> +	struct vm_area_struct *vma;
> +	struct list_head list;
> +};
> +
> +static inline struct page *
> +vduse_domain_get_bounce_page(struct vduse_iova_domain *domain,
> +				unsigned long iova)
> +{
> +	unsigned long index = iova >> IOVA_CHUNK_SHIFT;
> +	unsigned long chunkoff = iova & ~IOVA_CHUNK_MASK;
> +	unsigned long pgindex = chunkoff >> PAGE_SHIFT;
> +
> +	return domain->chunks[index].bounce_pages[pgindex];
> +}
> +
> +static inline void
> +vduse_domain_set_bounce_page(struct vduse_iova_domain *domain,
> +				unsigned long iova, struct page *page)
> +{
> +	unsigned long index = iova >> IOVA_CHUNK_SHIFT;
> +	unsigned long chunkoff = iova & ~IOVA_CHUNK_MASK;
> +	unsigned long pgindex = chunkoff >> PAGE_SHIFT;
> +
> +	domain->chunks[index].bounce_pages[pgindex] = page;
> +}
> +
> +static inline struct vduse_iova_map *
> +vduse_domain_get_iova_map(struct vduse_iova_domain *domain,
> +				unsigned long iova)
> +{
> +	unsigned long index = iova >> IOVA_CHUNK_SHIFT;
> +	unsigned long chunkoff = iova & ~IOVA_CHUNK_MASK;
> +	unsigned long mapindex = chunkoff >> IOVA_ALLOC_ORDER;
> +
> +	return domain->chunks[index].iova_map[mapindex];
> +}
> +
> +static inline void
> +vduse_domain_set_iova_map(struct vduse_iova_domain *domain,
> +			unsigned long iova, struct vduse_iova_map *map)
> +{
> +	unsigned long index = iova >> IOVA_CHUNK_SHIFT;
> +	unsigned long chunkoff = iova & ~IOVA_CHUNK_MASK;
> +	unsigned long mapindex = chunkoff >> IOVA_ALLOC_ORDER;
> +
> +	domain->chunks[index].iova_map[mapindex] = map;
> +}
> +
> +static int
> +vduse_domain_free_bounce_pages(struct vduse_iova_domain *domain,
> +				unsigned long iova, size_t size)
> +{
> +	struct page *page;
> +	size_t walk_sz = 0;
> +	int frees = 0;
> +
> +	while (walk_sz < size) {
> +		page = vduse_domain_get_bounce_page(domain, iova);
> +		if (page) {
> +			vduse_domain_set_bounce_page(domain, iova, NULL);
> +			put_page(page);
> +			frees++;
> +		}
> +		iova += PAGE_SIZE;
> +		walk_sz += PAGE_SIZE;
> +	}
> +
> +	return frees;
> +}
> +
> +int vduse_domain_add_vma(struct vduse_iova_domain *domain,
> +				struct vm_area_struct *vma)
> +{
> +	unsigned long size = vma->vm_end - vma->vm_start;
> +	struct vduse_mmap_vma *mmap_vma;
> +
> +	if (WARN_ON(size != domain->size))
> +		return -EINVAL;
> +
> +	mmap_vma = kmalloc(sizeof(*mmap_vma), GFP_KERNEL);
> +	if (!mmap_vma)
> +		return -ENOMEM;
> +
> +	mmap_vma->vma = vma;
> +	mutex_lock(&domain->vma_lock);
> +	list_add(&mmap_vma->list, &domain->vma_list);
> +	mutex_unlock(&domain->vma_lock);
> +
> +	return 0;
> +}
> +
> +void vduse_domain_remove_vma(struct vduse_iova_domain *domain,
> +				struct vm_area_struct *vma)
> +{
> +	struct vduse_mmap_vma *mmap_vma;
> +
> +	mutex_lock(&domain->vma_lock);
> +	list_for_each_entry(mmap_vma, &domain->vma_list, list) {
> +		if (mmap_vma->vma == vma) {
> +			list_del(&mmap_vma->list);
> +			kfree(mmap_vma);
> +			break;
> +		}
> +	}
> +	mutex_unlock(&domain->vma_lock);
> +}
> +
> +int vduse_domain_add_mapping(struct vduse_iova_domain *domain,
> +				unsigned long iova, unsigned long orig,
> +				size_t size, enum dma_data_direction dir)
> +{
> +	struct vduse_iova_map *map;
> +	unsigned long last = iova + size;
> +
> +	map = kzalloc(sizeof(struct vduse_iova_map), GFP_ATOMIC);
> +	if (!map)
> +		return -ENOMEM;
> +
> +	map->iova = iova;
> +	map->orig = orig;
> +	map->size = size;
> +	map->dir = dir;
> +
> +	while (iova < last) {
> +		vduse_domain_set_iova_map(domain, iova, map);
> +		iova += IOVA_ALLOC_SIZE;
> +	}
> +
> +	return 0;
> +}
> +
> +struct vduse_iova_map *
> +vduse_domain_get_mapping(struct vduse_iova_domain *domain,
> +			unsigned long iova)
> +{
> +	return vduse_domain_get_iova_map(domain, iova);
> +}
> +
> +void vduse_domain_remove_mapping(struct vduse_iova_domain *domain,
> +				struct vduse_iova_map *map)
> +{
> +	unsigned long iova = map->iova;
> +	unsigned long last = iova + map->size;
> +
> +	while (iova < last) {
> +		vduse_domain_set_iova_map(domain, iova, NULL);
> +		iova += IOVA_ALLOC_SIZE;
> +	}
> +}
> +
> +void vduse_domain_unmap(struct vduse_iova_domain *domain,
> +			unsigned long iova, size_t size)
> +{
> +	struct vduse_mmap_vma *mmap_vma;
> +	unsigned long uaddr;
> +
> +	mutex_lock(&domain->vma_lock);
> +	list_for_each_entry(mmap_vma, &domain->vma_list, list) {
> +		mmap_read_lock(mmap_vma->vma->vm_mm);
> +		uaddr = iova + mmap_vma->vma->vm_start;
> +		zap_page_range(mmap_vma->vma, uaddr, size);
> +		mmap_read_unlock(mmap_vma->vma->vm_mm);
> +	}
> +	mutex_unlock(&domain->vma_lock);
> +}
> +
> +int vduse_domain_direct_map(struct vduse_iova_domain *domain,
> +			struct vm_area_struct *vma, unsigned long iova)
> +{
> +	unsigned long uaddr = iova + vma->vm_start;
> +	unsigned long start = iova & PAGE_MASK;
> +	unsigned long last = start + PAGE_SIZE - 1;
> +	unsigned long offset;
> +	struct vduse_iova_map *map;
> +	struct page *page = NULL;
> +
> +	map = vduse_domain_get_iova_map(domain, iova);
> +	if (map) {
> +		offset = last - map->iova;
> +		page = virt_to_page(map->orig + offset);
> +	}
> +
> +	return page ? vm_insert_page(vma, uaddr, page) : -EFAULT;
> +}


So as we discussed before, we need to find way to make vhost work. And 
it's better to make vhost transparent to VDUSE. One idea is to implement 
shadow virtqueue here, that is, instead of trying to insert the pages to 
VDUSE userspace, we use the shadow virtqueue to relay the descriptors to 
userspace. With this, we don't need stuffs like shmfd etc.


> +
> +void vduse_domain_bounce(struct vduse_iova_domain *domain,
> +			unsigned long iova, unsigned long orig,
> +			size_t size, enum dma_data_direction dir)
> +{
> +	unsigned int offset = offset_in_page(iova);
> +
> +	while (size) {
> +		struct page *p = vduse_domain_get_bounce_page(domain, iova);
> +		size_t copy_len = min_t(size_t, PAGE_SIZE - offset, size);
> +		void *addr;
> +
> +		if (p) {
> +			addr = page_address(p) + offset;
> +			if (dir == DMA_TO_DEVICE)
> +				memcpy(addr, (void *)orig, copy_len);
> +			else if (dir == DMA_FROM_DEVICE)
> +				memcpy((void *)orig, addr, copy_len);
> +		}


I think I miss something, for DMA_FROM_DEVICE, if p doesn't exist how is 
it expected to work? Or do we need to warn here in this case?


> +		size -= copy_len;
> +		orig += copy_len;
> +		iova += copy_len;
> +		offset = 0;
> +	}
> +}
> +
> +int vduse_domain_bounce_map(struct vduse_iova_domain *domain,
> +			struct vm_area_struct *vma, unsigned long iova)
> +{
> +	unsigned long uaddr = iova + vma->vm_start;
> +	unsigned long start = iova & PAGE_MASK;
> +	unsigned long offset = 0;
> +	bool found = false;
> +	struct vduse_iova_map *map;
> +	struct page *page;
> +
> +	mutex_lock(&domain->map_lock);
> +
> +	page = vduse_domain_get_bounce_page(domain, iova);
> +	if (page)
> +		goto unlock;
> +
> +	page = alloc_page(GFP_KERNEL);
> +	if (!page)
> +		goto unlock;
> +
> +	while (offset < PAGE_SIZE) {
> +		unsigned int src_offset = 0, dst_offset = 0;
> +		void *src, *dst;
> +		size_t copy_len;
> +
> +		map = vduse_domain_get_iova_map(domain, start + offset);
> +		if (!map) {
> +			offset += IOVA_ALLOC_SIZE;
> +			continue;
> +		}
> +
> +		found = true;
> +		offset += map->size;
> +		if (map->dir == DMA_FROM_DEVICE)
> +			continue;
> +
> +		if (start > map->iova)
> +			src_offset = start - map->iova;
> +		else
> +			dst_offset = map->iova - start;
> +
> +		src = (void *)(map->orig + src_offset);
> +		dst = page_address(page) + dst_offset;
> +		copy_len = min_t(size_t, map->size - src_offset,
> +				PAGE_SIZE - dst_offset);
> +		memcpy(dst, src, copy_len);
> +	}
> +	if (!found) {
> +		put_page(page);
> +		page = NULL;
> +	}
> +	vduse_domain_set_bounce_page(domain, iova, page);
> +unlock:
> +	mutex_unlock(&domain->map_lock);
> +
> +	return page ? vm_insert_page(vma, uaddr, page) : -EFAULT;
> +}
> +
> +bool vduse_domain_is_direct_map(struct vduse_iova_domain *domain,
> +				unsigned long iova)
> +{
> +	unsigned long index = iova >> IOVA_CHUNK_SHIFT;
> +	struct vduse_iova_chunk *chunk = &domain->chunks[index];
> +
> +	return atomic_read(&chunk->map_type) == TYPE_DIRECT_MAP;
> +}
> +
> +unsigned long vduse_domain_alloc_iova(struct vduse_iova_domain *domain,
> +					size_t size, enum iova_map_type type)
> +{
> +	struct vduse_iova_chunk *chunk;
> +	unsigned long iova = 0;
> +	int align = (type == TYPE_DIRECT_MAP) ? PAGE_SIZE : IOVA_ALLOC_SIZE;
> +	struct genpool_data_align data = { .align = align };
> +	int i;
> +
> +	for (i = 0; i < domain->chunk_num; i++) {
> +		chunk = &domain->chunks[i];
> +		if (unlikely(atomic_read(&chunk->map_type) == TYPE_NONE))
> +			atomic_cmpxchg(&chunk->map_type, TYPE_NONE, type);
> +
> +		if (atomic_read(&chunk->map_type) != type)
> +			continue;
> +
> +		iova = gen_pool_alloc_algo(chunk->pool, size,
> +					gen_pool_first_fit_align, &data);
> +		if (iova)
> +			break;
> +	}
> +
> +	return iova;


I wonder why not just reuse the iova domain implements in 
driver/iommu/iova.c


> +}
> +
> +void vduse_domain_free_iova(struct vduse_iova_domain *domain,
> +				unsigned long iova, size_t size)
> +{
> +	unsigned long index = iova >> IOVA_CHUNK_SHIFT;
> +	struct vduse_iova_chunk *chunk = &domain->chunks[index];
> +
> +	gen_pool_free(chunk->pool, iova, size);
> +}
> +
> +static void vduse_iova_chunk_cleanup(struct vduse_iova_chunk *chunk)
> +{
> +	vfree(chunk->bounce_pages);
> +	vfree(chunk->iova_map);
> +	gen_pool_destroy(chunk->pool);
> +}
> +
> +void vduse_iova_domain_destroy(struct vduse_iova_domain *domain)
> +{
> +	struct vduse_iova_chunk *chunk;
> +	int i;
> +
> +	for (i = 0; i < domain->chunk_num; i++) {
> +		chunk = &domain->chunks[i];
> +		vduse_domain_free_bounce_pages(domain,
> +					chunk->start, IOVA_CHUNK_SIZE);
> +		vduse_iova_chunk_cleanup(chunk);
> +	}
> +
> +	mutex_destroy(&domain->map_lock);
> +	mutex_destroy(&domain->vma_lock);
> +	kfree(domain->chunks);
> +	kfree(domain);
> +}
> +
> +static int vduse_iova_chunk_init(struct vduse_iova_chunk *chunk,
> +				unsigned long addr, size_t size)
> +{
> +	int ret;
> +	int pages = size >> PAGE_SHIFT;
> +
> +	chunk->pool = gen_pool_create(IOVA_ALLOC_ORDER, -1);
> +	if (!chunk->pool)
> +		return -ENOMEM;
> +
> +	/* addr 0 is used in allocation failure case */
> +	if (addr == 0)
> +		addr += IOVA_ALLOC_SIZE;
> +
> +	ret = gen_pool_add(chunk->pool, addr, size, -1);
> +	if (ret)
> +		goto err;
> +
> +	ret = -ENOMEM;
> +	chunk->bounce_pages = vzalloc(pages * sizeof(struct page *));
> +	if (!chunk->bounce_pages)
> +		goto err;
> +
> +	chunk->iova_map = vzalloc((size >> IOVA_ALLOC_ORDER) *
> +				sizeof(struct vduse_iova_map *));
> +	if (!chunk->iova_map)
> +		goto err;
> +
> +	chunk->start = addr;
> +	atomic_set(&chunk->map_type, TYPE_NONE);
> +
> +	return 0;
> +err:
> +	if (chunk->bounce_pages) {
> +		vfree(chunk->bounce_pages);
> +		chunk->bounce_pages = NULL;
> +	}
> +	gen_pool_destroy(chunk->pool);
> +	return ret;
> +}
> +
> +struct vduse_iova_domain *vduse_iova_domain_create(size_t size)
> +{
> +	int j, i = 0;
> +	struct vduse_iova_domain *domain;
> +	unsigned long num = size >> IOVA_CHUNK_SHIFT;
> +	unsigned long addr = 0;
> +
> +	if (size < IOVA_MIN_SIZE || size & ~IOVA_CHUNK_MASK)
> +		return NULL;
> +
> +	domain = kzalloc(sizeof(*domain), GFP_KERNEL);
> +	if (!domain)
> +		return NULL;
> +
> +	domain->chunks = kcalloc(num, sizeof(struct vduse_iova_chunk), GFP_KERNEL);
> +	if (!domain->chunks)
> +		goto err;
> +
> +	for (i = 0; i < num; i++, addr += IOVA_CHUNK_SIZE)
> +		if (vduse_iova_chunk_init(&domain->chunks[i], addr,
> +					IOVA_CHUNK_SIZE))
> +			goto err;
> +
> +	domain->chunk_num = num;
> +	domain->size = size;
> +	INIT_LIST_HEAD(&domain->vma_list);
> +	mutex_init(&domain->vma_lock);
> +	mutex_init(&domain->map_lock);
> +
> +	return domain;
> +err:
> +	for (j = 0; j < i; j++)
> +		vduse_iova_chunk_cleanup(&domain->chunks[j]);
> +	kfree(domain);
> +
> +	return NULL;
> +}
> diff --git a/drivers/vdpa/vdpa_user/iova_domain.h b/drivers/vdpa/vdpa_user/iova_domain.h
> new file mode 100644
> index 000000000000..fe1816287f5f
> --- /dev/null
> +++ b/drivers/vdpa/vdpa_user/iova_domain.h
> @@ -0,0 +1,93 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * MMU-based IOMMU implementation
> + *
> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
> + *
> + * Author: Xie Yongji <xieyongji@bytedance.com>
> + *
> + */
> +
> +#ifndef _VDUSE_IOVA_DOMAIN_H
> +#define _VDUSE_IOVA_DOMAIN_H
> +
> +#include <linux/genalloc.h>
> +#include <linux/dma-mapping.h>
> +
> +enum iova_map_type {
> +	TYPE_NONE,
> +	TYPE_DIRECT_MAP,
> +	TYPE_BOUNCE_MAP,
> +};
> +
> +struct vduse_iova_map {
> +	unsigned long iova;
> +	unsigned long orig;
> +	size_t size;
> +	enum dma_data_direction dir;
> +};
> +
> +struct vduse_iova_chunk {
> +	struct gen_pool *pool;
> +	struct page **bounce_pages;
> +	struct vduse_iova_map **iova_map;
> +	unsigned long start;
> +	atomic_t map_type;
> +};
> +
> +struct vduse_iova_domain {
> +	struct vduse_iova_chunk *chunks;
> +	int chunk_num;
> +	size_t size;
> +	struct mutex map_lock;
> +	struct mutex vma_lock;
> +	struct list_head vma_list;
> +};


It's better to explain why you need to organize the bounce buffer with 
chunks by adding some comments above or in the commit log. Is this 
because you want to have O(1) for finding the page for a specific IOVA?


> +
> +int vduse_domain_add_vma(struct vduse_iova_domain *domain,
> +				struct vm_area_struct *vma);
> +
> +void vduse_domain_remove_vma(struct vduse_iova_domain *domain,
> +				struct vm_area_struct *vma);
> +
> +int vduse_domain_add_mapping(struct vduse_iova_domain *domain,
> +				unsigned long iova, unsigned long orig,
> +				size_t size, enum dma_data_direction dir);
> +
> +struct vduse_iova_map *
> +vduse_domain_get_mapping(struct vduse_iova_domain *domain,
> +			unsigned long iova);
> +
> +void vduse_domain_remove_mapping(struct vduse_iova_domain *domain,
> +				struct vduse_iova_map *map);
> +
> +void vduse_domain_unmap(struct vduse_iova_domain *domain,
> +			unsigned long iova, size_t size);
> +
> +int vduse_domain_direct_map(struct vduse_iova_domain *domain,
> +			struct vm_area_struct *vma, unsigned long iova);
> +
> +void vduse_domain_bounce(struct vduse_iova_domain *domain,
> +			unsigned long iova, unsigned long orig,
> +			size_t size, enum dma_data_direction dir);
> +
> +int vduse_domain_bounce_map(struct vduse_iova_domain *domain,
> +			struct vm_area_struct *vma, unsigned long iova);
> +
> +bool vduse_domain_is_direct_map(struct vduse_iova_domain *domain,
> +				unsigned long iova);
> +
> +unsigned long vduse_domain_alloc_iova(struct vduse_iova_domain *domain,
> +					size_t size, enum iova_map_type type);
> +
> +void vduse_domain_free_iova(struct vduse_iova_domain *domain,
> +				unsigned long iova, size_t size);
> +
> +bool vduse_domain_is_direct_map(struct vduse_iova_domain *domain,
> +				unsigned long iova);
> +
> +void vduse_iova_domain_destroy(struct vduse_iova_domain *domain);
> +
> +struct vduse_iova_domain *vduse_iova_domain_create(size_t size);
> +
> +#endif /* _VDUSE_IOVA_DOMAIN_H */
> diff --git a/drivers/vdpa/vdpa_user/vduse.h b/drivers/vdpa/vdpa_user/vduse.h
> new file mode 100644
> index 000000000000..1041ce7bddc4
> --- /dev/null
> +++ b/drivers/vdpa/vdpa_user/vduse.h
> @@ -0,0 +1,59 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * VDUSE: vDPA Device in Userspace
> + *
> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
> + *
> + * Author: Xie Yongji <xieyongji@bytedance.com>
> + *
> + */
> +
> +#ifndef _VDUSE_H
> +#define _VDUSE_H
> +
> +#include <linux/eventfd.h>
> +#include <linux/wait.h>
> +#include <linux/vdpa.h>
> +
> +#include "iova_domain.h"
> +#include "eventfd.h"
> +
> +struct vduse_virtqueue {
> +	u16 index;
> +	bool ready;
> +	spinlock_t kick_lock;
> +	spinlock_t irq_lock;
> +	struct eventfd_ctx *kickfd;
> +	struct vduse_virqfd *virqfd;
> +	void *private;
> +	irqreturn_t (*cb)(void *data);
> +};
> +
> +struct vduse_dev;
> +
> +struct vduse_vdpa {
> +	struct vdpa_device vdpa;
> +	struct vduse_dev *dev;
> +};
> +
> +struct vduse_dev {
> +	struct vduse_vdpa *vdev;
> +	struct vduse_virtqueue *vqs;
> +	struct vduse_iova_domain *domain;
> +	struct mutex lock;
> +	spinlock_t msg_lock;
> +	atomic64_t msg_unique;
> +	wait_queue_head_t waitq;
> +	struct list_head send_list;
> +	struct list_head recv_list;
> +	struct list_head list;
> +	refcount_t refcnt;
> +	u32 id;
> +	u16 vq_size_max;
> +	u16 vq_num;
> +	u32 vq_align;
> +	u32 device_id;
> +	u32 vendor_id;
> +};
> +
> +#endif /* _VDUSE_H_ */
> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> new file mode 100644
> index 000000000000..4a869b9698ef
> --- /dev/null
> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> @@ -0,0 +1,1121 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * VDUSE: vDPA Device in Userspace
> + *
> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
> + *
> + * Author: Xie Yongji <xieyongji@bytedance.com>
> + *
> + */
> +
> +#include <linux/init.h>
> +#include <linux/module.h>
> +#include <linux/miscdevice.h>
> +#include <linux/device.h>
> +#include <linux/eventfd.h>
> +#include <linux/slab.h>
> +#include <linux/wait.h>
> +#include <linux/dma-map-ops.h>
> +#include <linux/anon_inodes.h>
> +#include <linux/file.h>
> +#include <linux/uio.h>
> +#include <linux/vdpa.h>
> +#include <uapi/linux/vduse.h>
> +#include <uapi/linux/vdpa.h>
> +#include <uapi/linux/virtio_config.h>
> +#include <linux/mod_devicetable.h>
> +
> +#include "vduse.h"
> +
> +#define DRV_VERSION  "1.0"
> +#define DRV_AUTHOR   "Yongji Xie <xieyongji@bytedance.com>"
> +#define DRV_DESC     "vDPA Device in Userspace"
> +#define DRV_LICENSE  "GPL v2"
> +
> +struct vduse_dev_msg {
> +	struct vduse_dev_request req;
> +	struct vduse_dev_response resp;
> +	struct list_head list;
> +	wait_queue_head_t waitq;
> +	bool completed;
> +	refcount_t refcnt;
> +};
> +
> +static struct workqueue_struct *vduse_vdpa_wq;
> +static DEFINE_MUTEX(vduse_lock);
> +static LIST_HEAD(vduse_devs);
> +
> +static inline struct vduse_dev *vdpa_to_vduse(struct vdpa_device *vdpa)
> +{
> +	struct vduse_vdpa *vdev = container_of(vdpa, struct vduse_vdpa, vdpa);
> +
> +	return vdev->dev;
> +}
> +
> +static inline struct vduse_dev *dev_to_vduse(struct device *dev)
> +{
> +	struct vdpa_device *vdpa = dev_to_vdpa(dev);
> +
> +	return vdpa_to_vduse(vdpa);
> +}
> +
> +static struct vduse_dev_msg *vduse_dev_new_msg(struct vduse_dev *dev, int type)
> +{
> +	struct vduse_dev_msg *msg = kzalloc(sizeof(*msg),
> +					GFP_KERNEL | __GFP_NOFAIL);
> +
> +	msg->req.type = type;
> +	msg->req.unique = atomic64_fetch_inc(&dev->msg_unique);
> +	init_waitqueue_head(&msg->waitq);
> +	refcount_set(&msg->refcnt, 1);
> +
> +	return msg;
> +}
> +
> +static void vduse_dev_msg_get(struct vduse_dev_msg *msg)
> +{
> +	refcount_inc(&msg->refcnt);
> +}
> +
> +static void vduse_dev_msg_put(struct vduse_dev_msg *msg)
> +{
> +	if (refcount_dec_and_test(&msg->refcnt))
> +		kfree(msg);
> +}
> +
> +static struct vduse_dev_msg *vduse_dev_find_msg(struct vduse_dev *dev,
> +						struct list_head *head,
> +						uint32_t unique)
> +{
> +	struct vduse_dev_msg *tmp, *msg = NULL;
> +
> +	spin_lock(&dev->msg_lock);
> +	list_for_each_entry(tmp, head, list) {
> +		if (tmp->req.unique == unique) {
> +			msg = tmp;
> +			list_del(&tmp->list);
> +			break;
> +		}
> +	}
> +	spin_unlock(&dev->msg_lock);
> +
> +	return msg;
> +}
> +
> +static struct vduse_dev_msg *vduse_dev_dequeue_msg(struct vduse_dev *dev,
> +						struct list_head *head)
> +{
> +	struct vduse_dev_msg *msg = NULL;
> +
> +	spin_lock(&dev->msg_lock);
> +	if (!list_empty(head)) {
> +		msg = list_first_entry(head, struct vduse_dev_msg, list);
> +		list_del(&msg->list);
> +	}
> +	spin_unlock(&dev->msg_lock);
> +
> +	return msg;
> +}
> +
> +static void vduse_dev_enqueue_msg(struct vduse_dev *dev,
> +			struct vduse_dev_msg *msg, struct list_head *head)
> +{
> +	spin_lock(&dev->msg_lock);
> +	list_add_tail(&msg->list, head);
> +	spin_unlock(&dev->msg_lock);
> +}
> +
> +static int vduse_dev_msg_sync(struct vduse_dev *dev, struct vduse_dev_msg *msg)
> +{
> +	int ret;
> +
> +	vduse_dev_enqueue_msg(dev, msg, &dev->send_list);
> +	wake_up(&dev->waitq);
> +	wait_event(msg->waitq, msg->completed);
> +	/* coupled with smp_wmb() in vduse_dev_msg_complete() */
> +	smp_rmb();
> +	ret = msg->resp.result;
> +
> +	return ret;
> +}
> +
> +static void vduse_dev_msg_complete(struct vduse_dev_msg *msg,
> +					struct vduse_dev_response *resp)
> +{
> +	vduse_dev_msg_get(msg);
> +	memcpy(&msg->resp, resp, sizeof(*resp));
> +	/* coupled with smp_rmb() in vduse_dev_msg_sync() */
> +	smp_wmb();
> +	msg->completed = 1;
> +	wake_up(&msg->waitq);
> +	vduse_dev_msg_put(msg);
> +}
> +
> +static u64 vduse_dev_get_features(struct vduse_dev *dev)
> +{
> +	struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_GET_FEATURES);
> +	u64 features;
> +
> +	vduse_dev_msg_sync(dev, msg);
> +	features = msg->resp.features;
> +	vduse_dev_msg_put(msg);
> +
> +	return features;
> +}
> +
> +static int vduse_dev_set_features(struct vduse_dev *dev, u64 features)
> +{
> +	struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_FEATURES);
> +	int ret;
> +
> +	msg->req.size = sizeof(features);
> +	msg->req.features = features;
> +
> +	ret = vduse_dev_msg_sync(dev, msg);
> +	vduse_dev_msg_put(msg);
> +
> +	return ret;
> +}
> +
> +static u8 vduse_dev_get_status(struct vduse_dev *dev)
> +{
> +	struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_GET_STATUS);
> +	u8 status;
> +
> +	vduse_dev_msg_sync(dev, msg);
> +	status = msg->resp.status;
> +	vduse_dev_msg_put(msg);
> +
> +	return status;
> +}
> +
> +static void vduse_dev_set_status(struct vduse_dev *dev, u8 status)
> +{
> +	struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_STATUS);
> +
> +	msg->req.size = sizeof(status);
> +	msg->req.status = status;
> +
> +	vduse_dev_msg_sync(dev, msg);
> +	vduse_dev_msg_put(msg);
> +}
> +
> +static void vduse_dev_get_config(struct vduse_dev *dev, unsigned int offset,
> +					void *buf, unsigned int len)
> +{
> +	struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_GET_CONFIG);
> +
> +	WARN_ON(len > sizeof(msg->req.config.data));
> +
> +	msg->req.size = sizeof(struct vduse_dev_config_data);
> +	msg->req.config.offset = offset;
> +	msg->req.config.len = len;
> +	vduse_dev_msg_sync(dev, msg);
> +	memcpy(buf, msg->resp.config.data, len);
> +	vduse_dev_msg_put(msg);
> +}
> +
> +static void vduse_dev_set_config(struct vduse_dev *dev, unsigned int offset,
> +					const void *buf, unsigned int len)
> +{
> +	struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_CONFIG);
> +
> +	WARN_ON(len > sizeof(msg->req.config.data));
> +
> +	msg->req.size = sizeof(struct vduse_dev_config_data);
> +	msg->req.config.offset = offset;
> +	msg->req.config.len = len;
> +	memcpy(msg->req.config.data, buf, len);
> +	vduse_dev_msg_sync(dev, msg);
> +	vduse_dev_msg_put(msg);
> +}
> +
> +static void vduse_dev_set_vq_num(struct vduse_dev *dev,
> +				struct vduse_virtqueue *vq, u32 num)
> +{
> +	struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_VQ_NUM);
> +
> +	msg->req.size = sizeof(struct vduse_vq_num);
> +	msg->req.vq_num.index = vq->index;
> +	msg->req.vq_num.num = num;
> +
> +	vduse_dev_msg_sync(dev, msg);
> +	vduse_dev_msg_put(msg);
> +}
> +
> +static int vduse_dev_set_vq_addr(struct vduse_dev *dev,
> +				struct vduse_virtqueue *vq, u64 desc_addr,
> +				u64 driver_addr, u64 device_addr)
> +{
> +	struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_VQ_ADDR);
> +	int ret;
> +
> +	msg->req.size = sizeof(struct vduse_vq_addr);
> +	msg->req.vq_addr.index = vq->index;
> +	msg->req.vq_addr.desc_addr = desc_addr;
> +	msg->req.vq_addr.driver_addr = driver_addr;
> +	msg->req.vq_addr.device_addr = device_addr;
> +
> +	ret = vduse_dev_msg_sync(dev, msg);
> +	vduse_dev_msg_put(msg);
> +
> +	return ret;
> +}
> +
> +static void vduse_dev_set_vq_ready(struct vduse_dev *dev,
> +				struct vduse_virtqueue *vq, bool ready)
> +{
> +	struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_VQ_READY);
> +
> +	msg->req.size = sizeof(struct vduse_vq_ready);
> +	msg->req.vq_ready.index = vq->index;
> +	msg->req.vq_ready.ready = ready;
> +
> +	vduse_dev_msg_sync(dev, msg);
> +	vduse_dev_msg_put(msg);
> +}
> +
> +static bool vduse_dev_get_vq_ready(struct vduse_dev *dev,
> +				   struct vduse_virtqueue *vq)
> +{
> +	struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_GET_VQ_READY);
> +	bool ready;
> +
> +	msg->req.size = sizeof(struct vduse_vq_ready);
> +	msg->req.vq_ready.index = vq->index;
> +
> +	vduse_dev_msg_sync(dev, msg);
> +	ready = msg->resp.vq_ready.ready;
> +	vduse_dev_msg_put(msg);
> +
> +	return ready;
> +}
> +
> +static ssize_t vduse_dev_read_iter(struct kiocb *iocb, struct iov_iter *to)
> +{
> +	struct file *file = iocb->ki_filp;
> +	struct vduse_dev *dev = file->private_data;
> +	struct vduse_dev_msg *msg;
> +	int size = sizeof(struct vduse_dev_request);
> +	ssize_t ret = 0;
> +
> +	if (iov_iter_count(to) < size)
> +		return 0;
> +
> +	while (1) {
> +		msg = vduse_dev_dequeue_msg(dev, &dev->send_list);
> +		if (msg)
> +			break;
> +
> +		if (file->f_flags & O_NONBLOCK)
> +			return -EAGAIN;
> +
> +		ret = wait_event_interruptible_exclusive(dev->waitq,
> +					!list_empty(&dev->send_list));
> +		if (ret)
> +			return ret;
> +	}
> +	ret = copy_to_iter(&msg->req, size, to);
> +	if (ret != size) {
> +		vduse_dev_enqueue_msg(dev, msg, &dev->send_list);
> +		return -EFAULT;
> +	}
> +	vduse_dev_enqueue_msg(dev, msg, &dev->recv_list);
> +
> +	return ret;
> +}
> +
> +static ssize_t vduse_dev_write_iter(struct kiocb *iocb, struct iov_iter *from)
> +{
> +	struct file *file = iocb->ki_filp;
> +	struct vduse_dev *dev = file->private_data;
> +	struct vduse_dev_response resp;
> +	struct vduse_dev_msg *msg;
> +	size_t ret;
> +
> +	ret = copy_from_iter(&resp, sizeof(resp), from);
> +	if (ret != sizeof(resp))
> +		return -EINVAL;
> +
> +	msg = vduse_dev_find_msg(dev, &dev->recv_list, resp.unique);
> +	if (!msg)
> +		return -EINVAL;
> +
> +	vduse_dev_msg_complete(msg, &resp);
> +
> +	return ret;
> +}
> +
> +static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
> +{
> +	struct vduse_dev *dev = file->private_data;
> +	__poll_t mask = 0;
> +
> +	poll_wait(file, &dev->waitq, wait);
> +
> +	if (!list_empty(&dev->send_list))
> +		mask |= EPOLLIN | EPOLLRDNORM;
> +
> +	return mask;
> +}
> +
> +static void vduse_dev_reset(struct vduse_dev *dev)
> +{
> +	int i;
> +
> +	for (i = 0; i < dev->vq_num; i++) {
> +		struct vduse_virtqueue *vq = &dev->vqs[i];
> +
> +		spin_lock(&vq->irq_lock);
> +		vq->ready = false;
> +		vq->cb = NULL;
> +		vq->private = NULL;
> +		spin_unlock(&vq->irq_lock);
> +	}
> +}
> +
> +static int vduse_vdpa_set_vq_address(struct vdpa_device *vdpa, u16 idx,
> +				u64 desc_area, u64 driver_area,
> +				u64 device_area)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	return vduse_dev_set_vq_addr(dev, vq, desc_area,
> +					driver_area, device_area);
> +}
> +
> +static void vduse_vdpa_kick_vq(struct vdpa_device *vdpa, u16 idx)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	vduse_vq_kick(vq);
> +}
> +
> +static void vduse_vdpa_set_vq_cb(struct vdpa_device *vdpa, u16 idx,
> +			      struct vdpa_callback *cb)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	vq->cb = cb->callback;
> +	vq->private = cb->private;
> +}
> +
> +static void vduse_vdpa_set_vq_num(struct vdpa_device *vdpa, u16 idx, u32 num)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	vduse_dev_set_vq_num(dev, vq, num);
> +}
> +
> +static void vduse_vdpa_set_vq_ready(struct vdpa_device *vdpa,
> +					u16 idx, bool ready)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	vduse_dev_set_vq_ready(dev, vq, ready);
> +	vq->ready = ready;
> +}
> +
> +static bool vduse_vdpa_get_vq_ready(struct vdpa_device *vdpa, u16 idx)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	vq->ready = vduse_dev_get_vq_ready(dev, vq);
> +
> +	return vq->ready;
> +}
> +
> +static u32 vduse_vdpa_get_vq_align(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return dev->vq_align;
> +}
> +
> +static u64 vduse_vdpa_get_features(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	u64 fixed = (1ULL << VIRTIO_F_ACCESS_PLATFORM);
> +
> +	return (vduse_dev_get_features(dev) | fixed);
> +}
> +
> +static int vduse_vdpa_set_features(struct vdpa_device *vdpa, u64 features)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return vduse_dev_set_features(dev, features);
> +}
> +
> +static void vduse_vdpa_set_config_cb(struct vdpa_device *vdpa,
> +				  struct vdpa_callback *cb)
> +{
> +	/* We don't support config interrupt */
> +}
> +
> +static u16 vduse_vdpa_get_vq_num_max(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return dev->vq_size_max;
> +}
> +
> +static u32 vduse_vdpa_get_device_id(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return dev->device_id;
> +}
> +
> +static u32 vduse_vdpa_get_vendor_id(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return dev->vendor_id;
> +}
> +
> +static u8 vduse_vdpa_get_status(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return vduse_dev_get_status(dev);
> +}
> +
> +static void vduse_vdpa_set_status(struct vdpa_device *vdpa, u8 status)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	if (status == 0)
> +		vduse_dev_reset(dev);
> +
> +	vduse_dev_set_status(dev, status);
> +}
> +
> +static void vduse_vdpa_get_config(struct vdpa_device *vdpa, unsigned int offset,
> +			     void *buf, unsigned int len)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	vduse_dev_get_config(dev, offset, buf, len);
> +}
> +
> +static void vduse_vdpa_set_config(struct vdpa_device *vdpa, unsigned int offset,
> +			const void *buf, unsigned int len)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	vduse_dev_set_config(dev, offset, buf, len);
> +}
> +
> +static void vduse_vdpa_free(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	vduse_kickfd_release(dev);
> +	vduse_virqfd_release(dev);
> +
> +	WARN_ON(!list_empty(&dev->send_list));
> +	WARN_ON(!list_empty(&dev->recv_list));
> +	dev->vdev = NULL;
> +}
> +
> +static const struct vdpa_config_ops vduse_vdpa_config_ops = {
> +	.set_vq_address		= vduse_vdpa_set_vq_address,
> +	.kick_vq		= vduse_vdpa_kick_vq,
> +	.set_vq_cb		= vduse_vdpa_set_vq_cb,
> +	.set_vq_num             = vduse_vdpa_set_vq_num,
> +	.set_vq_ready		= vduse_vdpa_set_vq_ready,
> +	.get_vq_ready		= vduse_vdpa_get_vq_ready,
> +	.get_vq_align		= vduse_vdpa_get_vq_align,
> +	.get_features		= vduse_vdpa_get_features,
> +	.set_features		= vduse_vdpa_set_features,
> +	.set_config_cb		= vduse_vdpa_set_config_cb,
> +	.get_vq_num_max		= vduse_vdpa_get_vq_num_max,
> +	.get_device_id		= vduse_vdpa_get_device_id,
> +	.get_vendor_id		= vduse_vdpa_get_vendor_id,
> +	.get_status		= vduse_vdpa_get_status,
> +	.set_status		= vduse_vdpa_set_status,
> +	.get_config		= vduse_vdpa_get_config,
> +	.set_config		= vduse_vdpa_set_config,
> +	.free			= vduse_vdpa_free,
> +};
> +
> +static dma_addr_t vduse_dev_map_page(struct device *dev, struct page *page,
> +					unsigned long offset, size_t size,
> +					enum dma_data_direction dir,
> +					unsigned long attrs)
> +{
> +	struct vduse_dev *vdev = dev_to_vduse(dev);
> +	struct vduse_iova_domain *domain = vdev->domain;
> +	unsigned long iova = vduse_domain_alloc_iova(domain, size,
> +							TYPE_BOUNCE_MAP);
> +	unsigned long orig = (unsigned long)page_address(page) + offset;
> +
> +	if (!iova)
> +		return DMA_MAPPING_ERROR;
> +
> +	if (vduse_domain_add_mapping(domain, iova, orig, size, dir)) {
> +		vduse_domain_free_iova(domain, iova, size);
> +		return DMA_MAPPING_ERROR;
> +	}
> +
> +	if (dir == DMA_TO_DEVICE)


How about bidirectional mapping?


> +		vduse_domain_bounce(domain, iova, orig, size, dir);
> +
> +	return (dma_addr_t)iova;
> +}
> +
> +static void vduse_dev_unmap_page(struct device *dev, dma_addr_t dma_addr,
> +				size_t size, enum dma_data_direction dir,
> +				unsigned long attrs)
> +{
> +	struct vduse_dev *vdev = dev_to_vduse(dev);
> +	struct vduse_iova_domain *domain = vdev->domain;
> +	unsigned long iova = (unsigned long)dma_addr;
> +	struct vduse_iova_map *map = vduse_domain_get_mapping(domain, iova);
> +
> +	if (WARN_ON(!map))
> +		return;
> +
> +	if (dir == DMA_FROM_DEVICE)
> +		vduse_domain_bounce(domain, iova, map->orig, size, dir);
> +	vduse_domain_remove_mapping(domain, map);
> +	vduse_domain_free_iova(domain, iova, size);
> +	kfree(map);
> +}
> +
> +static void *vduse_dev_alloc_coherent(struct device *dev, size_t size,
> +					dma_addr_t *dma_addr, gfp_t flag,
> +					unsigned long attrs)
> +{
> +	struct vduse_dev *vdev = dev_to_vduse(dev);
> +	struct vduse_iova_domain *domain = vdev->domain;
> +	unsigned long iova = vduse_domain_alloc_iova(domain, size,
> +							TYPE_DIRECT_MAP);
> +	void *orig = alloc_pages_exact(size, flag);
> +
> +	if (!iova || !orig)
> +		goto err;
> +
> +	if (vduse_domain_add_mapping(domain, iova,
> +				(unsigned long)orig, size, DMA_BIDIRECTIONAL))
> +		goto err;
> +
> +	*dma_addr = (dma_addr_t)iova;
> +
> +	return orig;
> +err:
> +	*dma_addr = DMA_MAPPING_ERROR;
> +	if (orig)
> +		free_pages_exact(orig, size);
> +	if (iova)
> +		vduse_domain_free_iova(domain, iova, size);
> +
> +	return NULL;
> +}
> +
> +static void vduse_dev_free_coherent(struct device *dev, size_t size,
> +					void *vaddr, dma_addr_t dma_addr,
> +					unsigned long attrs)
> +{
> +	struct vduse_dev *vdev = dev_to_vduse(dev);
> +	struct vduse_iova_domain *domain = vdev->domain;
> +	unsigned long iova = (unsigned long)dma_addr;
> +	struct vduse_iova_map *map = vduse_domain_get_mapping(domain, iova);
> +
> +	if (WARN_ON(!map))
> +		return;
> +
> +	vduse_domain_remove_mapping(domain, map);
> +	vduse_domain_unmap(domain, map->iova, PAGE_ALIGN(map->size));
> +	free_pages_exact((void *)map->orig, map->size);
> +	vduse_domain_free_iova(domain, map->iova, map->size);
> +	kfree(map);
> +}
> +
> +static const struct dma_map_ops vduse_dev_dma_ops = {
> +	.map_page = vduse_dev_map_page,
> +	.unmap_page = vduse_dev_unmap_page,
> +	.alloc = vduse_dev_alloc_coherent,
> +	.free = vduse_dev_free_coherent,
> +};
> +
> +static void vduse_dev_mmap_open(struct vm_area_struct *vma)
> +{
> +	struct vduse_iova_domain *domain = vma->vm_private_data;
> +
> +	if (!vduse_domain_add_vma(domain, vma))
> +		return;
> +
> +	vma->vm_private_data = NULL;
> +}
> +
> +static void vduse_dev_mmap_close(struct vm_area_struct *vma)
> +{
> +	struct vduse_iova_domain *domain = vma->vm_private_data;
> +
> +	if (!domain)
> +		return;
> +
> +	vduse_domain_remove_vma(domain, vma);
> +}
> +
> +static int vduse_dev_mmap_split(struct vm_area_struct *vma, unsigned long addr)
> +{
> +	return -EPERM;
> +}
> +
> +static vm_fault_t vduse_dev_mmap_fault(struct vm_fault *vmf)
> +{
> +	struct vm_area_struct *vma = vmf->vma;
> +	struct vduse_iova_domain *domain = vma->vm_private_data;
> +	unsigned long iova = vmf->address - vma->vm_start;
> +	int ret;
> +
> +	if (!domain)
> +		return VM_FAULT_SIGBUS;
> +
> +	if (vduse_domain_is_direct_map(domain, iova))
> +		ret = vduse_domain_direct_map(domain, vma, iova);
> +	else
> +		ret = vduse_domain_bounce_map(domain, vma, iova);
> +
> +	if (ret == -ENOMEM)
> +		return VM_FAULT_OOM;
> +	if (ret < 0 && ret != -EBUSY)
> +		return VM_FAULT_SIGBUS;
> +
> +	return VM_FAULT_NOPAGE;
> +}
> +
> +static const struct vm_operations_struct vduse_dev_mmap_ops = {
> +	.open = vduse_dev_mmap_open,
> +	.close = vduse_dev_mmap_close,
> +	.may_split = vduse_dev_mmap_split,
> +	.fault = vduse_dev_mmap_fault,
> +};
> +
> +static int vduse_dev_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> +	struct vduse_dev *dev = file->private_data;
> +	struct vduse_iova_domain *domain = dev->domain;
> +	unsigned long size = vma->vm_end - vma->vm_start;
> +	int ret;
> +
> +	if (domain->size != size || vma->vm_pgoff)
> +		return -EINVAL;
> +
> +	ret = vduse_domain_add_vma(domain, vma);
> +	if (ret)
> +		return ret;
> +
> +	vma->vm_flags |= VM_MIXEDMAP | VM_DONTCOPY |
> +				VM_DONTDUMP | VM_DONTEXPAND;
> +	vma->vm_private_data = domain;
> +	vma->vm_ops = &vduse_dev_mmap_ops;
> +
> +	return 0;
> +}
> +
> +static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> +			unsigned long arg)
> +{
> +	struct vduse_dev *dev = file->private_data;
> +	void __user *argp = (void __user *)arg;
> +	int ret;
> +
> +	mutex_lock(&dev->lock);
> +	switch (cmd) {
> +	case VDUSE_VQ_SETUP_KICKFD: {
> +		struct vduse_vq_eventfd eventfd;
> +
> +		ret = -EFAULT;
> +		if (copy_from_user(&eventfd, argp, sizeof(eventfd)))
> +			break;
> +
> +		ret = vduse_kickfd_setup(dev, &eventfd);
> +		break;
> +	}
> +	case VDUSE_VQ_SETUP_IRQFD: {
> +		struct vduse_vq_eventfd eventfd;
> +
> +		ret = -EFAULT;
> +		if (copy_from_user(&eventfd, argp, sizeof(eventfd)))
> +			break;
> +
> +		ret = vduse_virqfd_setup(dev, &eventfd);
> +		break;
> +	}
> +	}
> +	mutex_unlock(&dev->lock);
> +
> +	return ret;
> +}
> +
> +static int vduse_dev_release(struct inode *inode, struct file *file)
> +{
> +	struct vduse_dev *dev = file->private_data;
> +	struct vduse_dev_msg *msg;
> +
> +	while ((msg = vduse_dev_dequeue_msg(dev, &dev->recv_list)))
> +		vduse_dev_enqueue_msg(dev, msg, &dev->send_list);
> +
> +	refcount_dec(&dev->refcnt);
> +
> +	return 0;
> +}
> +
> +static const struct file_operations vduse_dev_fops = {
> +	.owner		= THIS_MODULE,
> +	.release	= vduse_dev_release,
> +	.read_iter	= vduse_dev_read_iter,
> +	.write_iter	= vduse_dev_write_iter,
> +	.poll		= vduse_dev_poll,
> +	.mmap		= vduse_dev_mmap,
> +	.unlocked_ioctl	= vduse_dev_ioctl,
> +	.compat_ioctl	= compat_ptr_ioctl,
> +	.llseek		= noop_llseek,
> +};
> +
> +static struct vduse_dev *vduse_dev_create(void)
> +{
> +	struct vduse_dev *dev = kzalloc(sizeof(*dev), GFP_KERNEL);
> +
> +	if (!dev)
> +		return NULL;
> +
> +	mutex_init(&dev->lock);
> +	spin_lock_init(&dev->msg_lock);
> +	INIT_LIST_HEAD(&dev->send_list);
> +	INIT_LIST_HEAD(&dev->recv_list);
> +	atomic64_set(&dev->msg_unique, 0);
> +	init_waitqueue_head(&dev->waitq);
> +	refcount_set(&dev->refcnt, 1);
> +
> +	return dev;
> +}
> +
> +static void vduse_dev_destroy(struct vduse_dev *dev)
> +{
> +	mutex_destroy(&dev->lock);
> +	kfree(dev);
> +}
> +
> +static struct vduse_dev *vduse_find_dev(u32 id)
> +{
> +	struct vduse_dev *tmp, *dev = NULL;
> +
> +	list_for_each_entry(tmp, &vduse_devs, list) {
> +		if (tmp->id == id) {
> +			dev = tmp;
> +			break;
> +		}
> +	}
> +	return dev;
> +}
> +
> +static int vduse_get_dev(u32 id)
> +{
> +	int fd;
> +	char name[64];
> +	struct vduse_dev *dev = vduse_find_dev(id);
> +
> +	if (!dev)
> +		return -EINVAL;
> +
> +	snprintf(name, sizeof(name), "vduse-dev:%u", dev->id);
> +	fd = anon_inode_getfd(name, &vduse_dev_fops, dev, O_RDWR | O_CLOEXEC);
> +	if (fd < 0)
> +		return fd;
> +
> +	refcount_inc(&dev->refcnt);
> +
> +	return fd;
> +}
> +
> +static int vduse_destroy_dev(u32 id)
> +{
> +	struct vduse_dev *dev = vduse_find_dev(id);
> +
> +	if (!dev)
> +		return -EINVAL;
> +
> +	if (dev->vdev || refcount_read(&dev->refcnt) > 1)
> +		return -EBUSY;
> +
> +	list_del(&dev->list);
> +	kfree(dev->vqs);
> +	vduse_iova_domain_destroy(dev->domain);
> +	vduse_dev_destroy(dev);
> +
> +	return 0;
> +}
> +
> +static int vduse_create_dev(struct vduse_dev_config *config)
> +{
> +	int i, fd;
> +	struct vduse_dev *dev;
> +	char name[64];
> +
> +	if (vduse_find_dev(config->id))
> +		return -EEXIST;
> +
> +	dev = vduse_dev_create();
> +	if (!dev)
> +		return -ENOMEM;
> +
> +	dev->id = config->id;
> +	dev->device_id = config->device_id;
> +	dev->vendor_id = config->vendor_id;
> +	dev->domain = vduse_iova_domain_create(config->iova_size);
> +	if (!dev->domain)
> +		goto err_domain;
> +
> +	dev->vq_align = config->vq_align;
> +	dev->vq_size_max = config->vq_size_max;
> +	dev->vq_num = config->vq_num;
> +	dev->vqs = kcalloc(dev->vq_num, sizeof(*dev->vqs), GFP_KERNEL);
> +	if (!dev->vqs)
> +		goto err_vqs;
> +
> +	for (i = 0; i < dev->vq_num; i++) {
> +		dev->vqs[i].index = i;
> +		spin_lock_init(&dev->vqs[i].kick_lock);
> +		spin_lock_init(&dev->vqs[i].irq_lock);
> +	}
> +
> +	snprintf(name, sizeof(name), "vduse-dev:%u", config->id);
> +	fd = anon_inode_getfd(name, &vduse_dev_fops, dev, O_RDWR | O_CLOEXEC);
> +	if (fd < 0)
> +		goto err_fd;
> +
> +	refcount_inc(&dev->refcnt);
> +	list_add(&dev->list, &vduse_devs);
> +
> +	return fd;
> +err_fd:
> +	kfree(dev->vqs);
> +err_vqs:
> +	vduse_iova_domain_destroy(dev->domain);
> +err_domain:
> +	vduse_dev_destroy(dev);
> +	return fd;
> +}
> +
> +static long vduse_ioctl(struct file *file, unsigned int cmd,
> +			unsigned long arg)
> +{
> +	int ret;
> +	void __user *argp = (void __user *)arg;
> +
> +	mutex_lock(&vduse_lock);
> +	switch (cmd) {
> +	case VDUSE_CREATE_DEV: {
> +		struct vduse_dev_config config;
> +
> +		ret = -EFAULT;
> +		if (copy_from_user(&config, argp, sizeof(config)))
> +			break;
> +
> +		ret = vduse_create_dev(&config);
> +		break;
> +	}
> +	case VDUSE_GET_DEV:
> +		ret = vduse_get_dev(arg);
> +		break;


What's the use case of VDUSE_GET_DEV? (Need to document this)


> +	case VDUSE_DESTROY_DEV:
> +		ret = vduse_destroy_dev(arg);
> +		break;
> +	default:
> +		ret = -EINVAL;
> +		break;
> +	}
> +	mutex_unlock(&vduse_lock);
> +
> +	return ret;
> +}
> +
> +static const struct file_operations vduse_fops = {
> +	.owner		= THIS_MODULE,
> +	.unlocked_ioctl	= vduse_ioctl,
> +	.compat_ioctl	= compat_ptr_ioctl,
> +	.llseek		= noop_llseek,
> +};
> +
> +static struct miscdevice vduse_misc = {
> +	.fops = &vduse_fops,
> +	.minor = MISC_DYNAMIC_MINOR,
> +	.name = "vduse",
> +};
> +
> +static void vduse_parent_release(struct device *dev)
> +{
> +}
> +
> +static struct device vduse_parent = {
> +	.init_name = "vduse",
> +	.release = vduse_parent_release,
> +};
> +
> +static struct vdpa_parent_dev parent_dev;
> +
> +static int vduse_dev_add_vdpa(struct vduse_dev *dev, const char *name)
> +{
> +	struct vduse_vdpa *vdev = dev->vdev;
> +	int ret;
> +
> +	if (vdev)
> +		return -EEXIST;
> +
> +	vdev = vdpa_alloc_device(struct vduse_vdpa, vdpa, NULL,
> +				&vduse_vdpa_config_ops, dev->vq_num, name);
> +	if (!vdev)
> +		return -ENOMEM;
> +
> +	vdev->dev = dev;
> +	vdev->vdpa.dev.dma_mask = &vdev->vdpa.dev.coherent_dma_mask;
> +	ret = dma_set_mask_and_coherent(&vdev->vdpa.dev, DMA_BIT_MASK(64));
> +	if (ret)
> +		goto err;
> +
> +	set_dma_ops(&vdev->vdpa.dev, &vduse_dev_dma_ops);
> +	vdev->vdpa.dma_dev = &vdev->vdpa.dev;
> +	vdev->vdpa.pdev = &parent_dev;
> +
> +	ret = _vdpa_register_device(&vdev->vdpa);
> +	if (ret)
> +		goto err;
> +
> +	dev->vdev = vdev;
> +
> +	return 0;
> +err:
> +	put_device(&vdev->vdpa.dev);
> +	return ret;
> +}
> +
> +static struct vdpa_device *vdpa_dev_add(struct vdpa_parent_dev *pdev,
> +					const char *name, u32 device_id,
> +					struct nlattr **attrs)
> +{
> +	u32 vduse_id;
> +	struct vduse_dev *dev;
> +	int ret = -EINVAL;
> +
> +	if (!attrs[VDPA_ATTR_BACKEND_ID])
> +		return ERR_PTR(-EINVAL);
> +
> +	mutex_lock(&vduse_lock);
> +	vduse_id = nla_get_u32(attrs[VDPA_ATTR_BACKEND_ID]);
> +	dev = vduse_find_dev(vduse_id);
> +	if (!dev)
> +		goto unlock;
> +
> +	if (dev->device_id != device_id)
> +		goto unlock;
> +
> +	ret = vduse_dev_add_vdpa(dev, name);
> +unlock:
> +	mutex_unlock(&vduse_lock);
> +	if (ret)
> +		return ERR_PTR(ret);
> +
> +	return &dev->vdev->vdpa;
> +}
> +
> +static void vdpa_dev_del(struct vdpa_parent_dev *pdev, struct vdpa_device *dev)
> +{
> +	_vdpa_unregister_device(dev);
> +}
> +
> +static const struct vdpa_dev_ops vdpa_dev_parent_ops = {
> +	.dev_add = vdpa_dev_add,
> +	.dev_del = vdpa_dev_del
> +};
> +
> +static struct virtio_device_id id_table[] = {
> +	{ VIRTIO_DEV_ANY_ID, VIRTIO_DEV_ANY_ID },
> +	{ 0 },
> +};
> +
> +static struct vdpa_parent_dev parent_dev = {
> +	.device = &vduse_parent,
> +	.id_table = id_table,
> +	.ops = &vdpa_dev_parent_ops,
> +};
> +
> +static int vduse_parentdev_init(void)
> +{
> +	int ret;
> +
> +	ret = device_register(&vduse_parent);
> +	if (ret)
> +		return ret;
> +
> +	ret = vdpa_parentdev_register(&parent_dev);
> +	if (ret)
> +		goto err;
> +
> +	return 0;
> +err:
> +	device_unregister(&vduse_parent);
> +	return ret;
> +}
> +
> +static void vduse_parentdev_exit(void)
> +{
> +	vdpa_parentdev_unregister(&parent_dev);
> +	device_unregister(&vduse_parent);
> +}
> +
> +static int vduse_init(void)
> +{
> +	int ret;
> +
> +	ret = misc_register(&vduse_misc);
> +	if (ret)
> +		return ret;
> +
> +	ret = -ENOMEM;
> +	vduse_vdpa_wq = alloc_workqueue("vduse-vdpa", WQ_UNBOUND, 1);
> +	if (!vduse_vdpa_wq)
> +		goto err_vdpa_wq;
> +
> +	ret = vduse_virqfd_init();
> +	if (ret)
> +		goto err_irqfd;
> +
> +	ret = vduse_parentdev_init();
> +	if (ret)
> +		goto err_parentdev;
> +
> +	return 0;
> +err_parentdev:
> +	vduse_virqfd_exit();
> +err_irqfd:
> +	destroy_workqueue(vduse_vdpa_wq);
> +err_vdpa_wq:
> +	misc_deregister(&vduse_misc);
> +	return ret;
> +}
> +module_init(vduse_init);
> +
> +static void vduse_exit(void)
> +{
> +	misc_deregister(&vduse_misc);
> +	destroy_workqueue(vduse_vdpa_wq);
> +	vduse_virqfd_exit();
> +	vduse_parentdev_exit();
> +}
> +module_exit(vduse_exit);
> +
> +MODULE_VERSION(DRV_VERSION);
> +MODULE_LICENSE(DRV_LICENSE);
> +MODULE_AUTHOR(DRV_AUTHOR);
> +MODULE_DESCRIPTION(DRV_DESC);
> diff --git a/include/uapi/linux/vdpa.h b/include/uapi/linux/vdpa.h
> index bba8b83a94b5..a7a841e5ffc7 100644
> --- a/include/uapi/linux/vdpa.h
> +++ b/include/uapi/linux/vdpa.h
> @@ -33,6 +33,7 @@ enum vdpa_attr {
>   	VDPA_ATTR_DEV_VENDOR_ID,		/* u32 */
>   	VDPA_ATTR_DEV_MAX_VQS,			/* u32 */
>   	VDPA_ATTR_DEV_MAX_VQ_SIZE,		/* u16 */
> +	VDPA_ATTR_BACKEND_ID,			/* u32 */
>   
>   	/* new attributes must be added above here */
>   	VDPA_ATTR_MAX,
> diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
> new file mode 100644
> index 000000000000..f8579abdaa3b
> --- /dev/null
> +++ b/include/uapi/linux/vduse.h
> @@ -0,0 +1,99 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _UAPI_VDUSE_H_
> +#define _UAPI_VDUSE_H_
> +
> +#include <linux/types.h>
> +
> +/* the control messages definition for read/write */
> +
> +#define VDUSE_CONFIG_DATA_LEN	256
> +
> +enum vduse_req_type {
> +	VDUSE_SET_VQ_NUM,
> +	VDUSE_SET_VQ_ADDR,
> +	VDUSE_SET_VQ_READY,
> +	VDUSE_GET_VQ_READY,
> +	VDUSE_SET_FEATURES,
> +	VDUSE_GET_FEATURES,
> +	VDUSE_SET_STATUS,
> +	VDUSE_GET_STATUS,
> +	VDUSE_SET_CONFIG,
> +	VDUSE_GET_CONFIG,
> +};
> +
> +struct vduse_vq_num {
> +	__u32 index;
> +	__u32 num;
> +};
> +
> +struct vduse_vq_addr {
> +	__u32 index;
> +	__u64 desc_addr;
> +	__u64 driver_addr;
> +	__u64 device_addr;
> +};
> +
> +struct vduse_vq_ready {
> +	__u32 index;
> +	__u8 ready;
> +};
> +
> +struct vduse_dev_config_data {
> +	__u32 offset;
> +	__u32 len;
> +	__u8 data[VDUSE_CONFIG_DATA_LEN];
> +};
> +
> +struct vduse_dev_request {
> +	__u32 type; /* request type */
> +	__u32 unique; /* request id */
> +	__u32 flags; /* request flags */
> +	__u32 size; /* the payload size */
> +	union {
> +		struct vduse_vq_num vq_num; /* virtqueue num */
> +		struct vduse_vq_addr vq_addr; /* virtqueue address */
> +		struct vduse_vq_ready vq_ready; /* virtqueue ready status */
> +		struct vduse_dev_config_data config; /* virtio device config space */
> +		__u64 features; /* virtio features */
> +		__u8 status; /* device status */
> +	};
> +};
> +
> +struct vduse_dev_response {
> +	__u32 unique; /* corresponding request id */
> +	__s32 result; /* the result of request */
> +	union {
> +		struct vduse_vq_ready vq_ready; /* virtqueue ready status */
> +		struct vduse_dev_config_data config; /* virtio device config space */
> +		__u64 features; /* virtio features */
> +		__u8 status; /* device status */
> +	};
> +};
> +
> +/* ioctls */
> +
> +struct vduse_dev_config {
> +	__u32 id; /* vduse device id */
> +	__u32 vendor_id; /* virtio vendor id */
> +	__u32 device_id; /* virtio device id */
> +	__u64 iova_size; /* iova space size, used for mmap(2) */
> +	__u16 vq_num; /* the number of virtqueues */
> +	__u16 vq_size_max; /* the max size of virtqueue */
> +	__u32 vq_align; /* the allocation alignment of virtqueue's metadata */
> +};
> +
> +struct vduse_vq_eventfd {
> +	__u32 index; /* virtqueue index */
> +	__u32 fd; /* eventfd */
> +};
> +
> +#define VDUSE_BASE	0x81
> +
> +#define VDUSE_CREATE_DEV	_IOW(VDUSE_BASE, 0x01, struct vduse_dev_config)
> +#define VDUSE_GET_DEV		_IO(VDUSE_BASE, 0x02)
> +#define VDUSE_DESTROY_DEV	_IO(VDUSE_BASE, 0x03)
> +
> +#define VDUSE_VQ_SETUP_KICKFD	_IOW(VDUSE_BASE, 0x04, struct vduse_vq_eventfd)
> +#define VDUSE_VQ_SETUP_IRQFD	_IOW(VDUSE_BASE, 0x05, struct vduse_vq_eventfd)
> +
> +#endif /* _UAPI_VDUSE_H_ */



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 00/13] Introduce VDUSE - vDPA Device in Userspace
  2020-12-23  6:38 ` [RFC v2 00/13] Introduce VDUSE - vDPA Device in Userspace Jason Wang
@ 2020-12-23  8:14   ` Jason Wang
  2020-12-23 10:59   ` Yongji Xie
  1 sibling, 0 replies; 55+ messages in thread
From: Jason Wang @ 2020-12-23  8:14 UTC (permalink / raw)
  To: Xie Yongji, mst, stefanha, sgarzare, parav, akpm, rdunlap, willy,
	viro, axboe, bcrl, corbet
  Cc: linux-aio, kvm, netdev, virtualization, linux-mm, linux-fsdevel


On 2020/12/23 下午2:38, Jason Wang wrote:
>>
>> V1 to V2:
>> - Add vhost-vdpa support
>
>
> I may miss something but I don't see any code to support that. E.g 
> neither set_map nor dma_map/unmap is implemented in the config ops.
>
> Thanks 


Speak too fast :(

I saw a new config ops was introduced.

Let me dive into that.

Thanks



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 08/13] vdpa: Introduce process_iotlb_msg() in vdpa_config_ops
  2020-12-22 14:52 ` [RFC v2 08/13] vdpa: Introduce process_iotlb_msg() in vdpa_config_ops Xie Yongji
@ 2020-12-23  8:36   ` Jason Wang
  2020-12-23 11:06     ` Yongji Xie
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Wang @ 2020-12-23  8:36 UTC (permalink / raw)
  To: Xie Yongji, mst, stefanha, sgarzare, parav, akpm, rdunlap, willy,
	viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel, linux-mm


On 2020/12/22 下午10:52, Xie Yongji wrote:
> This patch introduces a new method in the vdpa_config_ops to
> support processing the raw vhost memory mapping message in the
> vDPA device driver.
>
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> ---
>   drivers/vhost/vdpa.c | 5 ++++-
>   include/linux/vdpa.h | 7 +++++++
>   2 files changed, 11 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index 448be7875b6d..ccbb391e38be 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -728,6 +728,9 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev,
>   	if (r)
>   		return r;
>   
> +	if (ops->process_iotlb_msg)
> +		return ops->process_iotlb_msg(vdpa, msg);
> +
>   	switch (msg->type) {
>   	case VHOST_IOTLB_UPDATE:
>   		r = vhost_vdpa_process_iotlb_update(v, msg);
> @@ -770,7 +773,7 @@ static int vhost_vdpa_alloc_domain(struct vhost_vdpa *v)
>   	int ret;
>   
>   	/* Device want to do DMA by itself */
> -	if (ops->set_map || ops->dma_map)
> +	if (ops->set_map || ops->dma_map || ops->process_iotlb_msg)
>   		return 0;
>   
>   	bus = dma_dev->bus;
> diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
> index 656fe264234e..7bccedf22f4b 100644
> --- a/include/linux/vdpa.h
> +++ b/include/linux/vdpa.h
> @@ -5,6 +5,7 @@
>   #include <linux/kernel.h>
>   #include <linux/device.h>
>   #include <linux/interrupt.h>
> +#include <linux/vhost_types.h>
>   #include <linux/vhost_iotlb.h>
>   #include <net/genetlink.h>
>   
> @@ -172,6 +173,10 @@ struct vdpa_iova_range {
>    *				@vdev: vdpa device
>    *				Returns the iova range supported by
>    *				the device.
> + * @process_iotlb_msg:		Process vhost memory mapping message (optional)
> + *				Only used for VDUSE device now
> + *				@vdev: vdpa device
> + *				@msg: vhost memory mapping message
>    * @set_map:			Set device memory mapping (optional)
>    *				Needed for device that using device
>    *				specific DMA translation (on-chip IOMMU)
> @@ -240,6 +245,8 @@ struct vdpa_config_ops {
>   	struct vdpa_iova_range (*get_iova_range)(struct vdpa_device *vdev);
>   
>   	/* DMA ops */
> +	int (*process_iotlb_msg)(struct vdpa_device *vdev,
> +				 struct vhost_iotlb_msg *msg);
>   	int (*set_map)(struct vdpa_device *vdev, struct vhost_iotlb *iotlb);
>   	int (*dma_map)(struct vdpa_device *vdev, u64 iova, u64 size,
>   		       u64 pa, u32 perm);


Is there any reason that it can't be done via dma_map/dma_unmap or set_map?

Thanks




^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 09/13] vduse: Add support for processing vhost iotlb message
  2020-12-22 14:52 ` [RFC v2 09/13] vduse: Add support for processing vhost iotlb message Xie Yongji
@ 2020-12-23  9:05   ` Jason Wang
  2020-12-23 12:14     ` [External] " Yongji Xie
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Wang @ 2020-12-23  9:05 UTC (permalink / raw)
  To: Xie Yongji, mst, stefanha, sgarzare, parav, akpm, rdunlap, willy,
	viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel, linux-mm


On 2020/12/22 下午10:52, Xie Yongji wrote:
> To support vhost-vdpa bus driver, we need a way to share the
> vhost-vdpa backend process's memory with the userspace VDUSE process.
>
> This patch tries to make use of the vhost iotlb message to achieve
> that. We will get the shm file from the iotlb message and pass it
> to the userspace VDUSE process.
>
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> ---
>   Documentation/driver-api/vduse.rst |  15 +++-
>   drivers/vdpa/vdpa_user/vduse_dev.c | 147 ++++++++++++++++++++++++++++++++++++-
>   include/uapi/linux/vduse.h         |  11 +++
>   3 files changed, 171 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/driver-api/vduse.rst b/Documentation/driver-api/vduse.rst
> index 623f7b040ccf..48e4b1ba353f 100644
> --- a/Documentation/driver-api/vduse.rst
> +++ b/Documentation/driver-api/vduse.rst
> @@ -46,13 +46,26 @@ The following types of messages are provided by the VDUSE framework now:
>   
>   - VDUSE_GET_CONFIG: Read from device specific configuration space
>   
> +- VDUSE_UPDATE_IOTLB: Update the memory mapping in device IOTLB
> +
> +- VDUSE_INVALIDATE_IOTLB: Invalidate the memory mapping in device IOTLB
> +
>   Please see include/linux/vdpa.h for details.
>   
> -In the data path, VDUSE framework implements a MMU-based on-chip IOMMU
> +The data path of userspace vDPA device is implemented in different ways
> +depending on the vdpa bus to which it is attached.
> +
> +In virtio-vdpa case, VDUSE framework implements a MMU-based on-chip IOMMU
>   driver which supports mapping the kernel dma buffer to a userspace iova
>   region dynamically. The userspace iova region can be created by passing
>   the userspace vDPA device fd to mmap(2).
>   
> +In vhost-vdpa case, the dma buffer is reside in a userspace memory region
> +which will be shared to the VDUSE userspace processs via the file
> +descriptor in VDUSE_UPDATE_IOTLB message. And the corresponding address
> +mapping (IOVA of dma buffer <-> VA of the memory region) is also included
> +in this message.
> +
>   Besides, the eventfd mechanism is used to trigger interrupt callbacks and
>   receive virtqueue kicks in userspace. The following ioctls on the userspace
>   vDPA device fd are provided to support that:
> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> index b974333ed4e9..d24aaacb6008 100644
> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> @@ -34,6 +34,7 @@
>   
>   struct vduse_dev_msg {
>   	struct vduse_dev_request req;
> +	struct file *iotlb_file;
>   	struct vduse_dev_response resp;
>   	struct list_head list;
>   	wait_queue_head_t waitq;
> @@ -325,12 +326,80 @@ static int vduse_dev_set_vq_state(struct vduse_dev *dev,
>   	return ret;
>   }
>   
> +static int vduse_dev_update_iotlb(struct vduse_dev *dev, struct file *file,
> +				u64 offset, u64 iova, u64 size, u8 perm)
> +{
> +	struct vduse_dev_msg *msg;
> +	int ret;
> +
> +	if (!size)
> +		return -EINVAL;
> +
> +	msg = vduse_dev_new_msg(dev, VDUSE_UPDATE_IOTLB);
> +	msg->req.size = sizeof(struct vduse_iotlb);
> +	msg->req.iotlb.offset = offset;
> +	msg->req.iotlb.iova = iova;
> +	msg->req.iotlb.size = size;
> +	msg->req.iotlb.perm = perm;
> +	msg->req.iotlb.fd = -1;
> +	msg->iotlb_file = get_file(file);
> +
> +	ret = vduse_dev_msg_sync(dev, msg);


My feeling is that we should provide consistent API for the userspace 
device to use.

E.g we'd better carry the IOTLB message for both virtio/vhost drivers.

It looks to me for virtio drivers we can still use UPDAT_IOTLB message 
by using VDUSE file as msg->iotlb_file here.


> +	vduse_dev_msg_put(msg);
> +	fput(file);
> +
> +	return ret;
> +}
> +
> +static int vduse_dev_invalidate_iotlb(struct vduse_dev *dev,
> +					u64 iova, u64 size)
> +{
> +	struct vduse_dev_msg *msg;
> +	int ret;
> +
> +	if (!size)
> +		return -EINVAL;
> +
> +	msg = vduse_dev_new_msg(dev, VDUSE_INVALIDATE_IOTLB);
> +	msg->req.size = sizeof(struct vduse_iotlb);
> +	msg->req.iotlb.iova = iova;
> +	msg->req.iotlb.size = size;
> +
> +	ret = vduse_dev_msg_sync(dev, msg);
> +	vduse_dev_msg_put(msg);
> +
> +	return ret;
> +}
> +
> +static unsigned int perm_to_file_flags(u8 perm)
> +{
> +	unsigned int flags = 0;
> +
> +	switch (perm) {
> +	case VHOST_ACCESS_WO:
> +		flags |= O_WRONLY;
> +		break;
> +	case VHOST_ACCESS_RO:
> +		flags |= O_RDONLY;
> +		break;
> +	case VHOST_ACCESS_RW:
> +		flags |= O_RDWR;
> +		break;
> +	default:
> +		WARN(1, "invalidate vhost IOTLB permission\n");
> +		break;
> +	}
> +
> +	return flags;
> +}
> +
>   static ssize_t vduse_dev_read_iter(struct kiocb *iocb, struct iov_iter *to)
>   {
>   	struct file *file = iocb->ki_filp;
>   	struct vduse_dev *dev = file->private_data;
>   	struct vduse_dev_msg *msg;
> -	int size = sizeof(struct vduse_dev_request);
> +	unsigned int flags;
> +	int fd, size = sizeof(struct vduse_dev_request);
>   	ssize_t ret = 0;
>   
>   	if (iov_iter_count(to) < size)
> @@ -349,6 +418,18 @@ static ssize_t vduse_dev_read_iter(struct kiocb *iocb, struct iov_iter *to)
>   		if (ret)
>   			return ret;
>   	}
> +
> +	if (msg->req.type == VDUSE_UPDATE_IOTLB && msg->req.iotlb.fd == -1) {
> +		flags = perm_to_file_flags(msg->req.iotlb.perm);
> +		fd = get_unused_fd_flags(flags);
> +		if (fd < 0) {
> +			vduse_dev_enqueue_msg(dev, msg, &dev->send_list);
> +			return fd;
> +		}
> +		fd_install(fd, get_file(msg->iotlb_file));
> +		msg->req.iotlb.fd = fd;
> +	}
> +
>   	ret = copy_to_iter(&msg->req, size, to);
>   	if (ret != size) {
>   		vduse_dev_enqueue_msg(dev, msg, &dev->send_list);
> @@ -565,6 +646,69 @@ static void vduse_vdpa_set_config(struct vdpa_device *vdpa, unsigned int offset,
>   	vduse_dev_set_config(dev, offset, buf, len);
>   }
>   
> +static void vduse_vdpa_invalidate_iotlb(struct vduse_dev *dev,
> +					struct vhost_iotlb_msg *msg)
> +{
> +	vduse_dev_invalidate_iotlb(dev, msg->iova, msg->size);
> +}
> +
> +static int vduse_vdpa_update_iotlb(struct vduse_dev *dev,
> +					struct vhost_iotlb_msg *msg)
> +{
> +	u64 uaddr = msg->uaddr;
> +	u64 iova = msg->iova;
> +	u64 size = msg->size;
> +	u64 offset;
> +	struct vm_area_struct *vma;
> +	int ret;
> +
> +	while (uaddr < msg->uaddr + msg->size) {
> +		vma = find_vma(current->mm, uaddr);
> +		ret = -EINVAL;
> +		if (!vma)
> +			goto err;
> +
> +		size = min(msg->size, vma->vm_end - uaddr);
> +		offset = (vma->vm_pgoff << PAGE_SHIFT) + uaddr - vma->vm_start;
> +		if (vma->vm_file && (vma->vm_flags & VM_SHARED)) {
> +			ret = vduse_dev_update_iotlb(dev, vma->vm_file, offset,
> +							iova, size, msg->perm);
> +			if (ret)
> +				goto err;


My understanding is that vma is something that should not be known by a 
device. So I suggest to move the above processing to vhost-vdpa.c.

Thanks


> +		}
> +		iova += size;
> +		uaddr += size;
> +	}
> +	return 0;
> +err:
> +	vduse_dev_invalidate_iotlb(dev, msg->iova, iova - msg->iova);
> +	return ret;
> +}
> +
> +static int vduse_vdpa_process_iotlb_msg(struct vdpa_device *vdpa,
> +					struct vhost_iotlb_msg *msg)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	int ret = 0;
> +
> +	switch (msg->type) {
> +	case VHOST_IOTLB_UPDATE:
> +		ret = vduse_vdpa_update_iotlb(dev, msg);
> +		break;
> +	case VHOST_IOTLB_INVALIDATE:
> +		vduse_vdpa_invalidate_iotlb(dev, msg);
> +		break;
> +	case VHOST_IOTLB_BATCH_BEGIN:
> +	case VHOST_IOTLB_BATCH_END:
> +		break;
> +	default:
> +		ret = -EINVAL;
> +		break;
> +	}
> +
> +	return ret;
> +}
> +
>   static void vduse_vdpa_free(struct vdpa_device *vdpa)
>   {
>   	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> @@ -597,6 +741,7 @@ static const struct vdpa_config_ops vduse_vdpa_config_ops = {
>   	.set_status		= vduse_vdpa_set_status,
>   	.get_config		= vduse_vdpa_get_config,
>   	.set_config		= vduse_vdpa_set_config,
> +	.process_iotlb_msg	= vduse_vdpa_process_iotlb_msg,
>   	.free			= vduse_vdpa_free,
>   };
>   
> diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
> index 873305dfd93f..c5080851f140 100644
> --- a/include/uapi/linux/vduse.h
> +++ b/include/uapi/linux/vduse.h
> @@ -21,6 +21,8 @@ enum vduse_req_type {
>   	VDUSE_GET_STATUS,
>   	VDUSE_SET_CONFIG,
>   	VDUSE_GET_CONFIG,
> +	VDUSE_UPDATE_IOTLB,
> +	VDUSE_INVALIDATE_IOTLB,
>   };
>   
>   struct vduse_vq_num {
> @@ -51,6 +53,14 @@ struct vduse_dev_config_data {
>   	__u8 data[VDUSE_CONFIG_DATA_LEN];
>   };
>   
> +struct vduse_iotlb {
> +	__u32 fd;
> +	__u64 offset;
> +	__u64 iova;
> +	__u64 size;
> +	__u8 perm;
> +};
> +
>   struct vduse_dev_request {
>   	__u32 type; /* request type */
>   	__u32 unique; /* request id */
> @@ -62,6 +72,7 @@ struct vduse_dev_request {
>   		struct vduse_vq_ready vq_ready; /* virtqueue ready status */
>   		struct vduse_vq_state vq_state; /* virtqueue state */
>   		struct vduse_dev_config_data config; /* virtio device config space */
> +		struct vduse_iotlb iotlb; /* iotlb message */
>   		__u64 features; /* virtio features */
>   		__u8 status; /* device status */
>   	};



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 00/13] Introduce VDUSE - vDPA Device in Userspace
  2020-12-23  6:38 ` [RFC v2 00/13] Introduce VDUSE - vDPA Device in Userspace Jason Wang
  2020-12-23  8:14   ` Jason Wang
@ 2020-12-23 10:59   ` Yongji Xie
  2020-12-24  2:24     ` Jason Wang
  1 sibling, 1 reply; 55+ messages in thread
From: Yongji Xie @ 2020-12-23 10:59 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, sgarzare, Parav Pandit,
	akpm, Randy Dunlap, Matthew Wilcox, viro, axboe, bcrl, corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel, linux-mm

On Wed, Dec 23, 2020 at 2:38 PM Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2020/12/22 下午10:52, Xie Yongji wrote:
> > This series introduces a framework, which can be used to implement
> > vDPA Devices in a userspace program. The work consist of two parts:
> > control path forwarding and data path offloading.
> >
> > In the control path, the VDUSE driver will make use of message
> > mechnism to forward the config operation from vdpa bus driver
> > to userspace. Userspace can use read()/write() to receive/reply
> > those control messages.
> >
> > In the data path, the core is mapping dma buffer into VDUSE
> > daemon's address space, which can be implemented in different ways
> > depending on the vdpa bus to which the vDPA device is attached.
> >
> > In virtio-vdpa case, we implements a MMU-based on-chip IOMMU driver with
> > bounce-buffering mechanism to achieve that.
>
>
> Rethink about the bounce buffer stuffs. I wonder instead of using kernel
> pages with mmap(), how about just use userspace pages like what vhost did?
>
> It means we need a worker to do bouncing but we don't need to care about
> annoying stuffs like page reclaiming?
>

Now the I/O bouncing is done in the streaming DMA mapping routines
which can be called from interrupt context. If we put this into a
kworker, that means we need to synchronize with a kworker in an
interrupt context. I think it can't work.

Thanks,
Yongji


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 08/13] vdpa: Introduce process_iotlb_msg() in vdpa_config_ops
  2020-12-23  8:36   ` Jason Wang
@ 2020-12-23 11:06     ` Yongji Xie
  2020-12-24  2:36       ` Jason Wang
  0 siblings, 1 reply; 55+ messages in thread
From: Yongji Xie @ 2020-12-23 11:06 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, sgarzare, Parav Pandit,
	akpm, Randy Dunlap, Matthew Wilcox, viro, axboe, bcrl, corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel, linux-mm

On Wed, Dec 23, 2020 at 4:37 PM Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2020/12/22 下午10:52, Xie Yongji wrote:
> > This patch introduces a new method in the vdpa_config_ops to
> > support processing the raw vhost memory mapping message in the
> > vDPA device driver.
> >
> > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> > ---
> >   drivers/vhost/vdpa.c | 5 ++++-
> >   include/linux/vdpa.h | 7 +++++++
> >   2 files changed, 11 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> > index 448be7875b6d..ccbb391e38be 100644
> > --- a/drivers/vhost/vdpa.c
> > +++ b/drivers/vhost/vdpa.c
> > @@ -728,6 +728,9 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev,
> >       if (r)
> >               return r;
> >
> > +     if (ops->process_iotlb_msg)
> > +             return ops->process_iotlb_msg(vdpa, msg);
> > +
> >       switch (msg->type) {
> >       case VHOST_IOTLB_UPDATE:
> >               r = vhost_vdpa_process_iotlb_update(v, msg);
> > @@ -770,7 +773,7 @@ static int vhost_vdpa_alloc_domain(struct vhost_vdpa *v)
> >       int ret;
> >
> >       /* Device want to do DMA by itself */
> > -     if (ops->set_map || ops->dma_map)
> > +     if (ops->set_map || ops->dma_map || ops->process_iotlb_msg)
> >               return 0;
> >
> >       bus = dma_dev->bus;
> > diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
> > index 656fe264234e..7bccedf22f4b 100644
> > --- a/include/linux/vdpa.h
> > +++ b/include/linux/vdpa.h
> > @@ -5,6 +5,7 @@
> >   #include <linux/kernel.h>
> >   #include <linux/device.h>
> >   #include <linux/interrupt.h>
> > +#include <linux/vhost_types.h>
> >   #include <linux/vhost_iotlb.h>
> >   #include <net/genetlink.h>
> >
> > @@ -172,6 +173,10 @@ struct vdpa_iova_range {
> >    *                          @vdev: vdpa device
> >    *                          Returns the iova range supported by
> >    *                          the device.
> > + * @process_iotlb_msg:               Process vhost memory mapping message (optional)
> > + *                           Only used for VDUSE device now
> > + *                           @vdev: vdpa device
> > + *                           @msg: vhost memory mapping message
> >    * @set_map:                        Set device memory mapping (optional)
> >    *                          Needed for device that using device
> >    *                          specific DMA translation (on-chip IOMMU)
> > @@ -240,6 +245,8 @@ struct vdpa_config_ops {
> >       struct vdpa_iova_range (*get_iova_range)(struct vdpa_device *vdev);
> >
> >       /* DMA ops */
> > +     int (*process_iotlb_msg)(struct vdpa_device *vdev,
> > +                              struct vhost_iotlb_msg *msg);
> >       int (*set_map)(struct vdpa_device *vdev, struct vhost_iotlb *iotlb);
> >       int (*dma_map)(struct vdpa_device *vdev, u64 iova, u64 size,
> >                      u64 pa, u32 perm);
>
>
> Is there any reason that it can't be done via dma_map/dma_unmap or set_map?
>

To get the shmfd, we need the vma rather than physical address. And
it's not necessary to pin the user pages in VDUSE case.

Thanks,
Yongji


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [External] Re: [RFC v2 09/13] vduse: Add support for processing vhost iotlb message
  2020-12-23  9:05   ` Jason Wang
@ 2020-12-23 12:14     ` Yongji Xie
  2020-12-24  2:41       ` Jason Wang
  0 siblings, 1 reply; 55+ messages in thread
From: Yongji Xie @ 2020-12-23 12:14 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, sgarzare, Parav Pandit,
	akpm, Randy Dunlap, Matthew Wilcox, viro, axboe, bcrl, corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel, linux-mm

On Wed, Dec 23, 2020 at 5:05 PM Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2020/12/22 下午10:52, Xie Yongji wrote:
> > To support vhost-vdpa bus driver, we need a way to share the
> > vhost-vdpa backend process's memory with the userspace VDUSE process.
> >
> > This patch tries to make use of the vhost iotlb message to achieve
> > that. We will get the shm file from the iotlb message and pass it
> > to the userspace VDUSE process.
> >
> > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> > ---
> >   Documentation/driver-api/vduse.rst |  15 +++-
> >   drivers/vdpa/vdpa_user/vduse_dev.c | 147 ++++++++++++++++++++++++++++++++++++-
> >   include/uapi/linux/vduse.h         |  11 +++
> >   3 files changed, 171 insertions(+), 2 deletions(-)
> >
> > diff --git a/Documentation/driver-api/vduse.rst b/Documentation/driver-api/vduse.rst
> > index 623f7b040ccf..48e4b1ba353f 100644
> > --- a/Documentation/driver-api/vduse.rst
> > +++ b/Documentation/driver-api/vduse.rst
> > @@ -46,13 +46,26 @@ The following types of messages are provided by the VDUSE framework now:
> >
> >   - VDUSE_GET_CONFIG: Read from device specific configuration space
> >
> > +- VDUSE_UPDATE_IOTLB: Update the memory mapping in device IOTLB
> > +
> > +- VDUSE_INVALIDATE_IOTLB: Invalidate the memory mapping in device IOTLB
> > +
> >   Please see include/linux/vdpa.h for details.
> >
> > -In the data path, VDUSE framework implements a MMU-based on-chip IOMMU
> > +The data path of userspace vDPA device is implemented in different ways
> > +depending on the vdpa bus to which it is attached.
> > +
> > +In virtio-vdpa case, VDUSE framework implements a MMU-based on-chip IOMMU
> >   driver which supports mapping the kernel dma buffer to a userspace iova
> >   region dynamically. The userspace iova region can be created by passing
> >   the userspace vDPA device fd to mmap(2).
> >
> > +In vhost-vdpa case, the dma buffer is reside in a userspace memory region
> > +which will be shared to the VDUSE userspace processs via the file
> > +descriptor in VDUSE_UPDATE_IOTLB message. And the corresponding address
> > +mapping (IOVA of dma buffer <-> VA of the memory region) is also included
> > +in this message.
> > +
> >   Besides, the eventfd mechanism is used to trigger interrupt callbacks and
> >   receive virtqueue kicks in userspace. The following ioctls on the userspace
> >   vDPA device fd are provided to support that:
> > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> > index b974333ed4e9..d24aaacb6008 100644
> > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > @@ -34,6 +34,7 @@
> >
> >   struct vduse_dev_msg {
> >       struct vduse_dev_request req;
> > +     struct file *iotlb_file;
> >       struct vduse_dev_response resp;
> >       struct list_head list;
> >       wait_queue_head_t waitq;
> > @@ -325,12 +326,80 @@ static int vduse_dev_set_vq_state(struct vduse_dev *dev,
> >       return ret;
> >   }
> >
> > +static int vduse_dev_update_iotlb(struct vduse_dev *dev, struct file *file,
> > +                             u64 offset, u64 iova, u64 size, u8 perm)
> > +{
> > +     struct vduse_dev_msg *msg;
> > +     int ret;
> > +
> > +     if (!size)
> > +             return -EINVAL;
> > +
> > +     msg = vduse_dev_new_msg(dev, VDUSE_UPDATE_IOTLB);
> > +     msg->req.size = sizeof(struct vduse_iotlb);
> > +     msg->req.iotlb.offset = offset;
> > +     msg->req.iotlb.iova = iova;
> > +     msg->req.iotlb.size = size;
> > +     msg->req.iotlb.perm = perm;
> > +     msg->req.iotlb.fd = -1;
> > +     msg->iotlb_file = get_file(file);
> > +
> > +     ret = vduse_dev_msg_sync(dev, msg);
>
>
> My feeling is that we should provide consistent API for the userspace
> device to use.
>
> E.g we'd better carry the IOTLB message for both virtio/vhost drivers.
>
> It looks to me for virtio drivers we can still use UPDAT_IOTLB message
> by using VDUSE file as msg->iotlb_file here.
>

It's OK for me. One problem is when to transfer the UPDATE_IOTLB
message in virtio cases.

>
> > +     vduse_dev_msg_put(msg);
> > +     fput(file);
> > +
> > +     return ret;
> > +}
> > +
> > +static int vduse_dev_invalidate_iotlb(struct vduse_dev *dev,
> > +                                     u64 iova, u64 size)
> > +{
> > +     struct vduse_dev_msg *msg;
> > +     int ret;
> > +
> > +     if (!size)
> > +             return -EINVAL;
> > +
> > +     msg = vduse_dev_new_msg(dev, VDUSE_INVALIDATE_IOTLB);
> > +     msg->req.size = sizeof(struct vduse_iotlb);
> > +     msg->req.iotlb.iova = iova;
> > +     msg->req.iotlb.size = size;
> > +
> > +     ret = vduse_dev_msg_sync(dev, msg);
> > +     vduse_dev_msg_put(msg);
> > +
> > +     return ret;
> > +}
> > +
> > +static unsigned int perm_to_file_flags(u8 perm)
> > +{
> > +     unsigned int flags = 0;
> > +
> > +     switch (perm) {
> > +     case VHOST_ACCESS_WO:
> > +             flags |= O_WRONLY;
> > +             break;
> > +     case VHOST_ACCESS_RO:
> > +             flags |= O_RDONLY;
> > +             break;
> > +     case VHOST_ACCESS_RW:
> > +             flags |= O_RDWR;
> > +             break;
> > +     default:
> > +             WARN(1, "invalidate vhost IOTLB permission\n");
> > +             break;
> > +     }
> > +
> > +     return flags;
> > +}
> > +
> >   static ssize_t vduse_dev_read_iter(struct kiocb *iocb, struct iov_iter *to)
> >   {
> >       struct file *file = iocb->ki_filp;
> >       struct vduse_dev *dev = file->private_data;
> >       struct vduse_dev_msg *msg;
> > -     int size = sizeof(struct vduse_dev_request);
> > +     unsigned int flags;
> > +     int fd, size = sizeof(struct vduse_dev_request);
> >       ssize_t ret = 0;
> >
> >       if (iov_iter_count(to) < size)
> > @@ -349,6 +418,18 @@ static ssize_t vduse_dev_read_iter(struct kiocb *iocb, struct iov_iter *to)
> >               if (ret)
> >                       return ret;
> >       }
> > +
> > +     if (msg->req.type == VDUSE_UPDATE_IOTLB && msg->req.iotlb.fd == -1) {
> > +             flags = perm_to_file_flags(msg->req.iotlb.perm);
> > +             fd = get_unused_fd_flags(flags);
> > +             if (fd < 0) {
> > +                     vduse_dev_enqueue_msg(dev, msg, &dev->send_list);
> > +                     return fd;
> > +             }
> > +             fd_install(fd, get_file(msg->iotlb_file));
> > +             msg->req.iotlb.fd = fd;
> > +     }
> > +
> >       ret = copy_to_iter(&msg->req, size, to);
> >       if (ret != size) {
> >               vduse_dev_enqueue_msg(dev, msg, &dev->send_list);
> > @@ -565,6 +646,69 @@ static void vduse_vdpa_set_config(struct vdpa_device *vdpa, unsigned int offset,
> >       vduse_dev_set_config(dev, offset, buf, len);
> >   }
> >
> > +static void vduse_vdpa_invalidate_iotlb(struct vduse_dev *dev,
> > +                                     struct vhost_iotlb_msg *msg)
> > +{
> > +     vduse_dev_invalidate_iotlb(dev, msg->iova, msg->size);
> > +}
> > +
> > +static int vduse_vdpa_update_iotlb(struct vduse_dev *dev,
> > +                                     struct vhost_iotlb_msg *msg)
> > +{
> > +     u64 uaddr = msg->uaddr;
> > +     u64 iova = msg->iova;
> > +     u64 size = msg->size;
> > +     u64 offset;
> > +     struct vm_area_struct *vma;
> > +     int ret;
> > +
> > +     while (uaddr < msg->uaddr + msg->size) {
> > +             vma = find_vma(current->mm, uaddr);
> > +             ret = -EINVAL;
> > +             if (!vma)
> > +                     goto err;
> > +
> > +             size = min(msg->size, vma->vm_end - uaddr);
> > +             offset = (vma->vm_pgoff << PAGE_SHIFT) + uaddr - vma->vm_start;
> > +             if (vma->vm_file && (vma->vm_flags & VM_SHARED)) {
> > +                     ret = vduse_dev_update_iotlb(dev, vma->vm_file, offset,
> > +                                                     iova, size, msg->perm);
> > +                     if (ret)
> > +                             goto err;
>
>
> My understanding is that vma is something that should not be known by a
> device. So I suggest to move the above processing to vhost-vdpa.c.
>

Will do it.

Thanks,
Yongji


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 06/13] vduse: Introduce VDUSE - vDPA Device in Userspace
  2020-12-23  8:08   ` Jason Wang
@ 2020-12-23 14:17     ` Yongji Xie
  2020-12-24  3:01       ` Jason Wang
  0 siblings, 1 reply; 55+ messages in thread
From: Yongji Xie @ 2020-12-23 14:17 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, sgarzare, Parav Pandit,
	akpm, Randy Dunlap, Matthew Wilcox, viro, axboe, bcrl, corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel, linux-mm

On Wed, Dec 23, 2020 at 4:08 PM Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2020/12/22 下午10:52, Xie Yongji wrote:
> > This VDUSE driver enables implementing vDPA devices in userspace.
> > Both control path and data path of vDPA devices will be able to
> > be handled in userspace.
> >
> > In the control path, the VDUSE driver will make use of message
> > mechnism to forward the config operation from vdpa bus driver
> > to userspace. Userspace can use read()/write() to receive/reply
> > those control messages.
> >
> > In the data path, the VDUSE driver implements a MMU-based on-chip
> > IOMMU driver which supports mapping the kernel dma buffer to a
> > userspace iova region dynamically. Userspace can access those
> > iova region via mmap(). Besides, the eventfd mechanism is used to
> > trigger interrupt callbacks and receive virtqueue kicks in userspace
> >
> > Now we only support virtio-vdpa bus driver with this patch applied.
> >
> > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> > ---
> >   Documentation/driver-api/vduse.rst                 |   74 ++
> >   Documentation/userspace-api/ioctl/ioctl-number.rst |    1 +
> >   drivers/vdpa/Kconfig                               |    8 +
> >   drivers/vdpa/Makefile                              |    1 +
> >   drivers/vdpa/vdpa_user/Makefile                    |    5 +
> >   drivers/vdpa/vdpa_user/eventfd.c                   |  221 ++++
> >   drivers/vdpa/vdpa_user/eventfd.h                   |   48 +
> >   drivers/vdpa/vdpa_user/iova_domain.c               |  442 ++++++++
> >   drivers/vdpa/vdpa_user/iova_domain.h               |   93 ++
> >   drivers/vdpa/vdpa_user/vduse.h                     |   59 ++
> >   drivers/vdpa/vdpa_user/vduse_dev.c                 | 1121 ++++++++++++++++++++
> >   include/uapi/linux/vdpa.h                          |    1 +
> >   include/uapi/linux/vduse.h                         |   99 ++
> >   13 files changed, 2173 insertions(+)
> >   create mode 100644 Documentation/driver-api/vduse.rst
> >   create mode 100644 drivers/vdpa/vdpa_user/Makefile
> >   create mode 100644 drivers/vdpa/vdpa_user/eventfd.c
> >   create mode 100644 drivers/vdpa/vdpa_user/eventfd.h
> >   create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c
> >   create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h
> >   create mode 100644 drivers/vdpa/vdpa_user/vduse.h
> >   create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
> >   create mode 100644 include/uapi/linux/vduse.h
> >
> > diff --git a/Documentation/driver-api/vduse.rst b/Documentation/driver-api/vduse.rst
> > new file mode 100644
> > index 000000000000..da9b3040f20a
> > --- /dev/null
> > +++ b/Documentation/driver-api/vduse.rst
> > @@ -0,0 +1,74 @@
> > +==================================
> > +VDUSE - "vDPA Device in Userspace"
> > +==================================
> > +
> > +vDPA (virtio data path acceleration) device is a device that uses a
> > +datapath which complies with the virtio specifications with vendor
> > +specific control path. vDPA devices can be both physically located on
> > +the hardware or emulated by software. VDUSE is a framework that makes it
> > +possible to implement software-emulated vDPA devices in userspace.
> > +
> > +How VDUSE works
> > +------------
> > +Each userspace vDPA device is created by the VDUSE_CREATE_DEV ioctl on
> > +the VDUSE character device (/dev/vduse). Then a file descriptor pointing
> > +to the new resources will be returned, which can be used to implement the
> > +userspace vDPA device's control path and data path.
> > +
> > +To implement control path, the read/write operations to the file descriptor
> > +will be used to receive/reply the control messages from/to VDUSE driver.
> > +Those control messages are based on the vdpa_config_ops which defines a
> > +unified interface to control different types of vDPA device.
> > +
> > +The following types of messages are provided by the VDUSE framework now:
> > +
> > +- VDUSE_SET_VQ_ADDR: Set the addresses of the different aspects of virtqueue.
> > +
> > +- VDUSE_SET_VQ_NUM: Set the size of virtqueue
> > +
> > +- VDUSE_SET_VQ_READY: Set ready status of virtqueue
> > +
> > +- VDUSE_GET_VQ_READY: Get ready status of virtqueue
> > +
> > +- VDUSE_SET_FEATURES: Set virtio features supported by the driver
> > +
> > +- VDUSE_GET_FEATURES: Get virtio features supported by the device
> > +
> > +- VDUSE_SET_STATUS: Set the device status
> > +
> > +- VDUSE_GET_STATUS: Get the device status
> > +
> > +- VDUSE_SET_CONFIG: Write to device specific configuration space
> > +
> > +- VDUSE_GET_CONFIG: Read from device specific configuration space
> > +
> > +Please see include/linux/vdpa.h for details.
> > +
> > +In the data path, VDUSE framework implements a MMU-based on-chip IOMMU
> > +driver which supports mapping the kernel dma buffer to a userspace iova
> > +region dynamically. The userspace iova region can be created by passing
> > +the userspace vDPA device fd to mmap(2).
> > +
> > +Besides, the eventfd mechanism is used to trigger interrupt callbacks and
> > +receive virtqueue kicks in userspace. The following ioctls on the userspace
> > +vDPA device fd are provided to support that:
> > +
> > +- VDUSE_VQ_SETUP_KICKFD: set the kickfd for virtqueue, this eventfd is used
> > +  by VDUSE driver to notify userspace to consume the vring.
> > +
> > +- VDUSE_VQ_SETUP_IRQFD: set the irqfd for virtqueue, this eventfd is used
> > +  by userspace to notify VDUSE driver to trigger interrupt callbacks.
> > +
> > +MMU-based IOMMU Driver
> > +----------------------
> > +The basic idea behind the IOMMU driver is treating MMU (VA->PA) as
> > +IOMMU (IOVA->PA). This driver will set up MMU mapping instead of IOMMU mapping
> > +for the DMA transfer so that the userspace process is able to use its virtual
> > +address to access the dma buffer in kernel.
> > +
> > +And to avoid security issue, a bounce-buffering mechanism is introduced to
> > +prevent userspace accessing the original buffer directly which may contain other
> > +kernel data. During the mapping, unmapping, the driver will copy the data from
> > +the original buffer to the bounce buffer and back, depending on the direction of
> > +the transfer. And the bounce-buffer addresses will be mapped into the user address
> > +space instead of the original one.
> > diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
> > index a4c75a28c839..71722e6f8f23 100644
> > --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> > +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> > @@ -300,6 +300,7 @@ Code  Seq#    Include File                                           Comments
> >   'z'   10-4F  drivers/s390/crypto/zcrypt_api.h                        conflict!
> >   '|'   00-7F  linux/media.h
> >   0x80  00-1F  linux/fb.h
> > +0x81  00-1F  linux/vduse.h
> >   0x89  00-06  arch/x86/include/asm/sockios.h
> >   0x89  0B-DF  linux/sockios.h
> >   0x89  E0-EF  linux/sockios.h                                         SIOCPROTOPRIVATE range
> > diff --git a/drivers/vdpa/Kconfig b/drivers/vdpa/Kconfig
> > index 4be7be39be26..211cc449cbd3 100644
> > --- a/drivers/vdpa/Kconfig
> > +++ b/drivers/vdpa/Kconfig
> > @@ -21,6 +21,14 @@ config VDPA_SIM
> >         to RX. This device is used for testing, prototyping and
> >         development of vDPA.
> >
> > +config VDPA_USER
> > +     tristate "VDUSE (vDPA Device in Userspace) support"
> > +     depends on EVENTFD && MMU && HAS_DMA
> > +     default n
>
>
> The "default n" is not necessary.
>

OK.
>
> > +     help
> > +       With VDUSE it is possible to emulate a vDPA Device
> > +       in a userspace program.
> > +
> >   config IFCVF
> >       tristate "Intel IFC VF vDPA driver"
> >       depends on PCI_MSI
> > diff --git a/drivers/vdpa/Makefile b/drivers/vdpa/Makefile
> > index d160e9b63a66..66e97778ad03 100644
> > --- a/drivers/vdpa/Makefile
> > +++ b/drivers/vdpa/Makefile
> > @@ -1,5 +1,6 @@
> >   # SPDX-License-Identifier: GPL-2.0
> >   obj-$(CONFIG_VDPA) += vdpa.o
> >   obj-$(CONFIG_VDPA_SIM) += vdpa_sim/
> > +obj-$(CONFIG_VDPA_USER) += vdpa_user/
> >   obj-$(CONFIG_IFCVF)    += ifcvf/
> >   obj-$(CONFIG_MLX5_VDPA) += mlx5/
> > diff --git a/drivers/vdpa/vdpa_user/Makefile b/drivers/vdpa/vdpa_user/Makefile
> > new file mode 100644
> > index 000000000000..b7645e36992b
> > --- /dev/null
> > +++ b/drivers/vdpa/vdpa_user/Makefile
> > @@ -0,0 +1,5 @@
> > +# SPDX-License-Identifier: GPL-2.0
> > +
> > +vduse-y := vduse_dev.o iova_domain.o eventfd.o
>
>
> Do we really need eventfd.o here consider we've selected it.
>

Do you mean the file "drivers/vdpa/vdpa_user/eventfd.c"?

>
> > +
> > +obj-$(CONFIG_VDPA_USER) += vduse.o
> > diff --git a/drivers/vdpa/vdpa_user/eventfd.c b/drivers/vdpa/vdpa_user/eventfd.c
> > new file mode 100644
> > index 000000000000..dbffddb08908
> > --- /dev/null
> > +++ b/drivers/vdpa/vdpa_user/eventfd.c
> > @@ -0,0 +1,221 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * Eventfd support for VDUSE
> > + *
> > + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
> > + *
> > + * Author: Xie Yongji <xieyongji@bytedance.com>
> > + *
> > + */
> > +
> > +#include <linux/eventfd.h>
> > +#include <linux/poll.h>
> > +#include <linux/wait.h>
> > +#include <linux/slab.h>
> > +#include <linux/file.h>
> > +#include <uapi/linux/vduse.h>
> > +
> > +#include "eventfd.h"
> > +
> > +static struct workqueue_struct *vduse_irqfd_cleanup_wq;
> > +
> > +static void vduse_virqfd_shutdown(struct work_struct *work)
> > +{
> > +     u64 cnt;
> > +     struct vduse_virqfd *virqfd = container_of(work,
> > +                                     struct vduse_virqfd, shutdown);
> > +
> > +     eventfd_ctx_remove_wait_queue(virqfd->ctx, &virqfd->wait, &cnt);
> > +     flush_work(&virqfd->inject);
> > +     eventfd_ctx_put(virqfd->ctx);
> > +     kfree(virqfd);
> > +}
> > +
> > +static void vduse_virqfd_inject(struct work_struct *work)
> > +{
> > +     struct vduse_virqfd *virqfd = container_of(work,
> > +                                     struct vduse_virqfd, inject);
> > +     struct vduse_virtqueue *vq = virqfd->vq;
> > +
> > +     spin_lock_irq(&vq->irq_lock);
> > +     if (vq->ready && vq->cb)
> > +             vq->cb(vq->private);
> > +     spin_unlock_irq(&vq->irq_lock);
> > +}
> > +
> > +static void virqfd_deactivate(struct vduse_virqfd *virqfd)
> > +{
> > +     queue_work(vduse_irqfd_cleanup_wq, &virqfd->shutdown);
> > +}
> > +
> > +static int vduse_virqfd_wakeup(wait_queue_entry_t *wait, unsigned int mode,
> > +                             int sync, void *key)
> > +{
> > +     struct vduse_virqfd *virqfd = container_of(wait, struct vduse_virqfd, wait);
> > +     struct vduse_virtqueue *vq = virqfd->vq;
> > +
> > +     __poll_t flags = key_to_poll(key);
> > +
> > +     if (flags & EPOLLIN)
> > +             schedule_work(&virqfd->inject);
> > +
> > +     if (flags & EPOLLHUP) {
> > +             spin_lock(&vq->irq_lock);
> > +             if (vq->virqfd == virqfd) {
> > +                     vq->virqfd = NULL;
> > +                     virqfd_deactivate(virqfd);
> > +             }
> > +             spin_unlock(&vq->irq_lock);
> > +     }
> > +
> > +     return 0;
> > +}
> > +
> > +static void vduse_virqfd_ptable_queue_proc(struct file *file,
> > +                     wait_queue_head_t *wqh, poll_table *pt)
> > +{
> > +     struct vduse_virqfd *virqfd = container_of(pt, struct vduse_virqfd, pt);
> > +
> > +     add_wait_queue(wqh, &virqfd->wait);
> > +}
> > +
> > +int vduse_virqfd_setup(struct vduse_dev *dev,
> > +                     struct vduse_vq_eventfd *eventfd)
> > +{
> > +     struct vduse_virqfd *virqfd;
> > +     struct fd irqfd;
> > +     struct eventfd_ctx *ctx;
> > +     struct vduse_virtqueue *vq;
> > +     __poll_t events;
> > +     int ret;
> > +
> > +     if (eventfd->index >= dev->vq_num)
> > +             return -EINVAL;
> > +
> > +     vq = &dev->vqs[eventfd->index];
> > +     virqfd = kzalloc(sizeof(*virqfd), GFP_KERNEL);
> > +     if (!virqfd)
> > +             return -ENOMEM;
> > +
> > +     INIT_WORK(&virqfd->shutdown, vduse_virqfd_shutdown);
> > +     INIT_WORK(&virqfd->inject, vduse_virqfd_inject);
>
>
> Any reason that a workqueue is must here?
>

Mainly for performance considerations. Make sure the push() and pop()
for used vring can be asynchronous.

> > +
> > +     ret = -EBADF;
> > +     irqfd = fdget(eventfd->fd);
> > +     if (!irqfd.file)
> > +             goto err_fd;
> > +
> > +     ctx = eventfd_ctx_fileget(irqfd.file);
> > +     if (IS_ERR(ctx)) {
> > +             ret = PTR_ERR(ctx);
> > +             goto err_ctx;
> > +     }
> > +
> > +     virqfd->vq = vq;
> > +     virqfd->ctx = ctx;
> > +     spin_lock(&vq->irq_lock);
> > +     if (vq->virqfd)
> > +             virqfd_deactivate(virqfd);
> > +     vq->virqfd = virqfd;
> > +     spin_unlock(&vq->irq_lock);
> > +
> > +     init_waitqueue_func_entry(&virqfd->wait, vduse_virqfd_wakeup);
> > +     init_poll_funcptr(&virqfd->pt, vduse_virqfd_ptable_queue_proc);
> > +
> > +     events = vfs_poll(irqfd.file, &virqfd->pt);
> > +
> > +     /*
> > +      * Check if there was an event already pending on the eventfd
> > +      * before we registered and trigger it as if we didn't miss it.
> > +      */
> > +     if (events & EPOLLIN)
> > +             schedule_work(&virqfd->inject);
> > +
> > +     fdput(irqfd);
> > +
> > +     return 0;
> > +err_ctx:
> > +     fdput(irqfd);
> > +err_fd:
> > +     kfree(virqfd);
> > +     return ret;
> > +}
> > +
> > +void vduse_virqfd_release(struct vduse_dev *dev)
> > +{
> > +     int i;
> > +
> > +     for (i = 0; i < dev->vq_num; i++) {
> > +             struct vduse_virtqueue *vq = &dev->vqs[i];
> > +
> > +             spin_lock(&vq->irq_lock);
> > +             if (vq->virqfd) {
> > +                     virqfd_deactivate(vq->virqfd);
> > +                     vq->virqfd = NULL;
> > +             }
> > +             spin_unlock(&vq->irq_lock);
> > +     }
> > +     flush_workqueue(vduse_irqfd_cleanup_wq);
> > +}
> > +
> > +int vduse_virqfd_init(void)
> > +{
> > +     vduse_irqfd_cleanup_wq = alloc_workqueue("vduse-irqfd-cleanup",
> > +                                             WQ_UNBOUND, 0);
> > +     if (!vduse_irqfd_cleanup_wq)
> > +             return -ENOMEM;
> > +
> > +     return 0;
> > +}
> > +
> > +void vduse_virqfd_exit(void)
> > +{
> > +     destroy_workqueue(vduse_irqfd_cleanup_wq);
> > +}
> > +
> > +void vduse_vq_kick(struct vduse_virtqueue *vq)
> > +{
> > +     spin_lock(&vq->kick_lock);
> > +     if (vq->ready && vq->kickfd)
> > +             eventfd_signal(vq->kickfd, 1);
> > +     spin_unlock(&vq->kick_lock);
> > +}
> > +
> > +int vduse_kickfd_setup(struct vduse_dev *dev,
> > +                     struct vduse_vq_eventfd *eventfd)
> > +{
> > +     struct eventfd_ctx *ctx;
> > +     struct vduse_virtqueue *vq;
> > +
> > +     if (eventfd->index >= dev->vq_num)
> > +             return -EINVAL;
> > +
> > +     vq = &dev->vqs[eventfd->index];
> > +     ctx = eventfd_ctx_fdget(eventfd->fd);
> > +     if (IS_ERR(ctx))
> > +             return PTR_ERR(ctx);
> > +
> > +     spin_lock(&vq->kick_lock);
> > +     if (vq->kickfd)
> > +             eventfd_ctx_put(vq->kickfd);
> > +     vq->kickfd = ctx;
> > +     spin_unlock(&vq->kick_lock);
> > +
> > +     return 0;
> > +}
> > +
> > +void vduse_kickfd_release(struct vduse_dev *dev)
> > +{
> > +     int i;
> > +
> > +     for (i = 0; i < dev->vq_num; i++) {
> > +             struct vduse_virtqueue *vq = &dev->vqs[i];
> > +
> > +             spin_lock(&vq->kick_lock);
> > +             if (vq->kickfd) {
> > +                     eventfd_ctx_put(vq->kickfd);
> > +                     vq->kickfd = NULL;
> > +             }
> > +             spin_unlock(&vq->kick_lock);
> > +     }
> > +}
> > diff --git a/drivers/vdpa/vdpa_user/eventfd.h b/drivers/vdpa/vdpa_user/eventfd.h
> > new file mode 100644
> > index 000000000000..14269ff27f47
> > --- /dev/null
> > +++ b/drivers/vdpa/vdpa_user/eventfd.h
> > @@ -0,0 +1,48 @@
> > +/* SPDX-License-Identifier: GPL-2.0-only */
> > +/*
> > + * Eventfd support for VDUSE
> > + *
> > + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
> > + *
> > + * Author: Xie Yongji <xieyongji@bytedance.com>
> > + *
> > + */
> > +
> > +#ifndef _VDUSE_EVENTFD_H
> > +#define _VDUSE_EVENTFD_H
> > +
> > +#include <linux/eventfd.h>
> > +#include <linux/poll.h>
> > +#include <linux/wait.h>
> > +#include <uapi/linux/vduse.h>
> > +
> > +#include "vduse.h"
> > +
> > +struct vduse_dev;
> > +
> > +struct vduse_virqfd {
> > +     struct eventfd_ctx *ctx;
> > +     struct vduse_virtqueue *vq;
> > +     struct work_struct inject;
> > +     struct work_struct shutdown;
> > +     wait_queue_entry_t wait;
> > +     poll_table pt;
> > +};
> > +
> > +int vduse_virqfd_setup(struct vduse_dev *dev,
> > +                     struct vduse_vq_eventfd *eventfd);
> > +
> > +void vduse_virqfd_release(struct vduse_dev *dev);
> > +
> > +int vduse_virqfd_init(void);
> > +
> > +void vduse_virqfd_exit(void);
> > +
> > +void vduse_vq_kick(struct vduse_virtqueue *vq);
> > +
> > +int vduse_kickfd_setup(struct vduse_dev *dev,
> > +                     struct vduse_vq_eventfd *eventfd);
> > +
> > +void vduse_kickfd_release(struct vduse_dev *dev);
> > +
> > +#endif /* _VDUSE_EVENTFD_H */
> > diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c
> > new file mode 100644
> > index 000000000000..27022157abc6
> > --- /dev/null
> > +++ b/drivers/vdpa/vdpa_user/iova_domain.c
> > @@ -0,0 +1,442 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * MMU-based IOMMU implementation
> > + *
> > + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
> > + *
> > + * Author: Xie Yongji <xieyongji@bytedance.com>
> > + *
> > + */
> > +
> > +#include <linux/wait.h>
> > +#include <linux/slab.h>
> > +#include <linux/genalloc.h>
> > +#include <linux/dma-mapping.h>
> > +
> > +#include "iova_domain.h"
> > +
> > +#define IOVA_CHUNK_SHIFT 26
> > +#define IOVA_CHUNK_SIZE (_AC(1, UL) << IOVA_CHUNK_SHIFT)
> > +#define IOVA_CHUNK_MASK (~(IOVA_CHUNK_SIZE - 1))
> > +
> > +#define IOVA_MIN_SIZE (IOVA_CHUNK_SIZE << 1)
> > +
> > +#define IOVA_ALLOC_ORDER 12
> > +#define IOVA_ALLOC_SIZE (1 << IOVA_ALLOC_ORDER)
> > +
> > +struct vduse_mmap_vma {
> > +     struct vm_area_struct *vma;
> > +     struct list_head list;
> > +};
> > +
> > +static inline struct page *
> > +vduse_domain_get_bounce_page(struct vduse_iova_domain *domain,
> > +                             unsigned long iova)
> > +{
> > +     unsigned long index = iova >> IOVA_CHUNK_SHIFT;
> > +     unsigned long chunkoff = iova & ~IOVA_CHUNK_MASK;
> > +     unsigned long pgindex = chunkoff >> PAGE_SHIFT;
> > +
> > +     return domain->chunks[index].bounce_pages[pgindex];
> > +}
> > +
> > +static inline void
> > +vduse_domain_set_bounce_page(struct vduse_iova_domain *domain,
> > +                             unsigned long iova, struct page *page)
> > +{
> > +     unsigned long index = iova >> IOVA_CHUNK_SHIFT;
> > +     unsigned long chunkoff = iova & ~IOVA_CHUNK_MASK;
> > +     unsigned long pgindex = chunkoff >> PAGE_SHIFT;
> > +
> > +     domain->chunks[index].bounce_pages[pgindex] = page;
> > +}
> > +
> > +static inline struct vduse_iova_map *
> > +vduse_domain_get_iova_map(struct vduse_iova_domain *domain,
> > +                             unsigned long iova)
> > +{
> > +     unsigned long index = iova >> IOVA_CHUNK_SHIFT;
> > +     unsigned long chunkoff = iova & ~IOVA_CHUNK_MASK;
> > +     unsigned long mapindex = chunkoff >> IOVA_ALLOC_ORDER;
> > +
> > +     return domain->chunks[index].iova_map[mapindex];
> > +}
> > +
> > +static inline void
> > +vduse_domain_set_iova_map(struct vduse_iova_domain *domain,
> > +                     unsigned long iova, struct vduse_iova_map *map)
> > +{
> > +     unsigned long index = iova >> IOVA_CHUNK_SHIFT;
> > +     unsigned long chunkoff = iova & ~IOVA_CHUNK_MASK;
> > +     unsigned long mapindex = chunkoff >> IOVA_ALLOC_ORDER;
> > +
> > +     domain->chunks[index].iova_map[mapindex] = map;
> > +}
> > +
> > +static int
> > +vduse_domain_free_bounce_pages(struct vduse_iova_domain *domain,
> > +                             unsigned long iova, size_t size)
> > +{
> > +     struct page *page;
> > +     size_t walk_sz = 0;
> > +     int frees = 0;
> > +
> > +     while (walk_sz < size) {
> > +             page = vduse_domain_get_bounce_page(domain, iova);
> > +             if (page) {
> > +                     vduse_domain_set_bounce_page(domain, iova, NULL);
> > +                     put_page(page);
> > +                     frees++;
> > +             }
> > +             iova += PAGE_SIZE;
> > +             walk_sz += PAGE_SIZE;
> > +     }
> > +
> > +     return frees;
> > +}
> > +
> > +int vduse_domain_add_vma(struct vduse_iova_domain *domain,
> > +                             struct vm_area_struct *vma)
> > +{
> > +     unsigned long size = vma->vm_end - vma->vm_start;
> > +     struct vduse_mmap_vma *mmap_vma;
> > +
> > +     if (WARN_ON(size != domain->size))
> > +             return -EINVAL;
> > +
> > +     mmap_vma = kmalloc(sizeof(*mmap_vma), GFP_KERNEL);
> > +     if (!mmap_vma)
> > +             return -ENOMEM;
> > +
> > +     mmap_vma->vma = vma;
> > +     mutex_lock(&domain->vma_lock);
> > +     list_add(&mmap_vma->list, &domain->vma_list);
> > +     mutex_unlock(&domain->vma_lock);
> > +
> > +     return 0;
> > +}
> > +
> > +void vduse_domain_remove_vma(struct vduse_iova_domain *domain,
> > +                             struct vm_area_struct *vma)
> > +{
> > +     struct vduse_mmap_vma *mmap_vma;
> > +
> > +     mutex_lock(&domain->vma_lock);
> > +     list_for_each_entry(mmap_vma, &domain->vma_list, list) {
> > +             if (mmap_vma->vma == vma) {
> > +                     list_del(&mmap_vma->list);
> > +                     kfree(mmap_vma);
> > +                     break;
> > +             }
> > +     }
> > +     mutex_unlock(&domain->vma_lock);
> > +}
> > +
> > +int vduse_domain_add_mapping(struct vduse_iova_domain *domain,
> > +                             unsigned long iova, unsigned long orig,
> > +                             size_t size, enum dma_data_direction dir)
> > +{
> > +     struct vduse_iova_map *map;
> > +     unsigned long last = iova + size;
> > +
> > +     map = kzalloc(sizeof(struct vduse_iova_map), GFP_ATOMIC);
> > +     if (!map)
> > +             return -ENOMEM;
> > +
> > +     map->iova = iova;
> > +     map->orig = orig;
> > +     map->size = size;
> > +     map->dir = dir;
> > +
> > +     while (iova < last) {
> > +             vduse_domain_set_iova_map(domain, iova, map);
> > +             iova += IOVA_ALLOC_SIZE;
> > +     }
> > +
> > +     return 0;
> > +}
> > +
> > +struct vduse_iova_map *
> > +vduse_domain_get_mapping(struct vduse_iova_domain *domain,
> > +                     unsigned long iova)
> > +{
> > +     return vduse_domain_get_iova_map(domain, iova);
> > +}
> > +
> > +void vduse_domain_remove_mapping(struct vduse_iova_domain *domain,
> > +                             struct vduse_iova_map *map)
> > +{
> > +     unsigned long iova = map->iova;
> > +     unsigned long last = iova + map->size;
> > +
> > +     while (iova < last) {
> > +             vduse_domain_set_iova_map(domain, iova, NULL);
> > +             iova += IOVA_ALLOC_SIZE;
> > +     }
> > +}
> > +
> > +void vduse_domain_unmap(struct vduse_iova_domain *domain,
> > +                     unsigned long iova, size_t size)
> > +{
> > +     struct vduse_mmap_vma *mmap_vma;
> > +     unsigned long uaddr;
> > +
> > +     mutex_lock(&domain->vma_lock);
> > +     list_for_each_entry(mmap_vma, &domain->vma_list, list) {
> > +             mmap_read_lock(mmap_vma->vma->vm_mm);
> > +             uaddr = iova + mmap_vma->vma->vm_start;
> > +             zap_page_range(mmap_vma->vma, uaddr, size);
> > +             mmap_read_unlock(mmap_vma->vma->vm_mm);
> > +     }
> > +     mutex_unlock(&domain->vma_lock);
> > +}
> > +
> > +int vduse_domain_direct_map(struct vduse_iova_domain *domain,
> > +                     struct vm_area_struct *vma, unsigned long iova)
> > +{
> > +     unsigned long uaddr = iova + vma->vm_start;
> > +     unsigned long start = iova & PAGE_MASK;
> > +     unsigned long last = start + PAGE_SIZE - 1;
> > +     unsigned long offset;
> > +     struct vduse_iova_map *map;
> > +     struct page *page = NULL;
> > +
> > +     map = vduse_domain_get_iova_map(domain, iova);
> > +     if (map) {
> > +             offset = last - map->iova;
> > +             page = virt_to_page(map->orig + offset);
> > +     }
> > +
> > +     return page ? vm_insert_page(vma, uaddr, page) : -EFAULT;
> > +}
>
>
> So as we discussed before, we need to find way to make vhost work. And
> it's better to make vhost transparent to VDUSE. One idea is to implement
> shadow virtqueue here, that is, instead of trying to insert the pages to
> VDUSE userspace, we use the shadow virtqueue to relay the descriptors to
> userspace. With this, we don't need stuffs like shmfd etc.
>

Good idea! The disadvantage is performance will go down (one more
thread switch overhead and vhost-liked kworker will become bottleneck
without multi-thread support). I think I can try this in v3. And the
MMU-based IOMMU implementation can be a future optimization in the
virtio-vdpa case. What's your opinion?

>
> > +
> > +void vduse_domain_bounce(struct vduse_iova_domain *domain,
> > +                     unsigned long iova, unsigned long orig,
> > +                     size_t size, enum dma_data_direction dir)
> > +{
> > +     unsigned int offset = offset_in_page(iova);
> > +
> > +     while (size) {
> > +             struct page *p = vduse_domain_get_bounce_page(domain, iova);
> > +             size_t copy_len = min_t(size_t, PAGE_SIZE - offset, size);
> > +             void *addr;
> > +
> > +             if (p) {
> > +                     addr = page_address(p) + offset;
> > +                     if (dir == DMA_TO_DEVICE)
> > +                             memcpy(addr, (void *)orig, copy_len);
> > +                     else if (dir == DMA_FROM_DEVICE)
> > +                             memcpy((void *)orig, addr, copy_len);
> > +             }
>
>
> I think I miss something, for DMA_FROM_DEVICE, if p doesn't exist how is
> it expected to work? Or do we need to warn here in this case?
>

Yes, I think we need a WARN_ON here.


>
> > +             size -= copy_len;
> > +             orig += copy_len;
> > +             iova += copy_len;
> > +             offset = 0;
> > +     }
> > +}
> > +
> > +int vduse_domain_bounce_map(struct vduse_iova_domain *domain,
> > +                     struct vm_area_struct *vma, unsigned long iova)
> > +{
> > +     unsigned long uaddr = iova + vma->vm_start;
> > +     unsigned long start = iova & PAGE_MASK;
> > +     unsigned long offset = 0;
> > +     bool found = false;
> > +     struct vduse_iova_map *map;
> > +     struct page *page;
> > +
> > +     mutex_lock(&domain->map_lock);
> > +
> > +     page = vduse_domain_get_bounce_page(domain, iova);
> > +     if (page)
> > +             goto unlock;
> > +
> > +     page = alloc_page(GFP_KERNEL);
> > +     if (!page)
> > +             goto unlock;
> > +
> > +     while (offset < PAGE_SIZE) {
> > +             unsigned int src_offset = 0, dst_offset = 0;
> > +             void *src, *dst;
> > +             size_t copy_len;
> > +
> > +             map = vduse_domain_get_iova_map(domain, start + offset);
> > +             if (!map) {
> > +                     offset += IOVA_ALLOC_SIZE;
> > +                     continue;
> > +             }
> > +
> > +             found = true;
> > +             offset += map->size;
> > +             if (map->dir == DMA_FROM_DEVICE)
> > +                     continue;
> > +
> > +             if (start > map->iova)
> > +                     src_offset = start - map->iova;
> > +             else
> > +                     dst_offset = map->iova - start;
> > +
> > +             src = (void *)(map->orig + src_offset);
> > +             dst = page_address(page) + dst_offset;
> > +             copy_len = min_t(size_t, map->size - src_offset,
> > +                             PAGE_SIZE - dst_offset);
> > +             memcpy(dst, src, copy_len);
> > +     }
> > +     if (!found) {
> > +             put_page(page);
> > +             page = NULL;
> > +     }
> > +     vduse_domain_set_bounce_page(domain, iova, page);
> > +unlock:
> > +     mutex_unlock(&domain->map_lock);
> > +
> > +     return page ? vm_insert_page(vma, uaddr, page) : -EFAULT;
> > +}
> > +
> > +bool vduse_domain_is_direct_map(struct vduse_iova_domain *domain,
> > +                             unsigned long iova)
> > +{
> > +     unsigned long index = iova >> IOVA_CHUNK_SHIFT;
> > +     struct vduse_iova_chunk *chunk = &domain->chunks[index];
> > +
> > +     return atomic_read(&chunk->map_type) == TYPE_DIRECT_MAP;
> > +}
> > +
> > +unsigned long vduse_domain_alloc_iova(struct vduse_iova_domain *domain,
> > +                                     size_t size, enum iova_map_type type)
> > +{
> > +     struct vduse_iova_chunk *chunk;
> > +     unsigned long iova = 0;
> > +     int align = (type == TYPE_DIRECT_MAP) ? PAGE_SIZE : IOVA_ALLOC_SIZE;
> > +     struct genpool_data_align data = { .align = align };
> > +     int i;
> > +
> > +     for (i = 0; i < domain->chunk_num; i++) {
> > +             chunk = &domain->chunks[i];
> > +             if (unlikely(atomic_read(&chunk->map_type) == TYPE_NONE))
> > +                     atomic_cmpxchg(&chunk->map_type, TYPE_NONE, type);
> > +
> > +             if (atomic_read(&chunk->map_type) != type)
> > +                     continue;
> > +
> > +             iova = gen_pool_alloc_algo(chunk->pool, size,
> > +                                     gen_pool_first_fit_align, &data);
> > +             if (iova)
> > +                     break;
> > +     }
> > +
> > +     return iova;
>
>
> I wonder why not just reuse the iova domain implements in
> driver/iommu/iova.c
>

The iova domain in driver/iommu/iova.c is only an iova allocator which
is implemented by the genpool memory allocator in our case. The other
part in our iova domain is chunk management and iova_map management.
We need different chunks to distinguish different dma mapping types:
consistent mapping or streaming mapping. We can only use
bouncing-mechanism in the streaming mapping case.

>
> > +}
> > +
> > +void vduse_domain_free_iova(struct vduse_iova_domain *domain,
> > +                             unsigned long iova, size_t size)
> > +{
> > +     unsigned long index = iova >> IOVA_CHUNK_SHIFT;
> > +     struct vduse_iova_chunk *chunk = &domain->chunks[index];
> > +
> > +     gen_pool_free(chunk->pool, iova, size);
> > +}
> > +
> > +static void vduse_iova_chunk_cleanup(struct vduse_iova_chunk *chunk)
> > +{
> > +     vfree(chunk->bounce_pages);
> > +     vfree(chunk->iova_map);
> > +     gen_pool_destroy(chunk->pool);
> > +}
> > +
> > +void vduse_iova_domain_destroy(struct vduse_iova_domain *domain)
> > +{
> > +     struct vduse_iova_chunk *chunk;
> > +     int i;
> > +
> > +     for (i = 0; i < domain->chunk_num; i++) {
> > +             chunk = &domain->chunks[i];
> > +             vduse_domain_free_bounce_pages(domain,
> > +                                     chunk->start, IOVA_CHUNK_SIZE);
> > +             vduse_iova_chunk_cleanup(chunk);
> > +     }
> > +
> > +     mutex_destroy(&domain->map_lock);
> > +     mutex_destroy(&domain->vma_lock);
> > +     kfree(domain->chunks);
> > +     kfree(domain);
> > +}
> > +
> > +static int vduse_iova_chunk_init(struct vduse_iova_chunk *chunk,
> > +                             unsigned long addr, size_t size)
> > +{
> > +     int ret;
> > +     int pages = size >> PAGE_SHIFT;
> > +
> > +     chunk->pool = gen_pool_create(IOVA_ALLOC_ORDER, -1);
> > +     if (!chunk->pool)
> > +             return -ENOMEM;
> > +
> > +     /* addr 0 is used in allocation failure case */
> > +     if (addr == 0)
> > +             addr += IOVA_ALLOC_SIZE;
> > +
> > +     ret = gen_pool_add(chunk->pool, addr, size, -1);
> > +     if (ret)
> > +             goto err;
> > +
> > +     ret = -ENOMEM;
> > +     chunk->bounce_pages = vzalloc(pages * sizeof(struct page *));
> > +     if (!chunk->bounce_pages)
> > +             goto err;
> > +
> > +     chunk->iova_map = vzalloc((size >> IOVA_ALLOC_ORDER) *
> > +                             sizeof(struct vduse_iova_map *));
> > +     if (!chunk->iova_map)
> > +             goto err;
> > +
> > +     chunk->start = addr;
> > +     atomic_set(&chunk->map_type, TYPE_NONE);
> > +
> > +     return 0;
> > +err:
> > +     if (chunk->bounce_pages) {
> > +             vfree(chunk->bounce_pages);
> > +             chunk->bounce_pages = NULL;
> > +     }
> > +     gen_pool_destroy(chunk->pool);
> > +     return ret;
> > +}
> > +
> > +struct vduse_iova_domain *vduse_iova_domain_create(size_t size)
> > +{
> > +     int j, i = 0;
> > +     struct vduse_iova_domain *domain;
> > +     unsigned long num = size >> IOVA_CHUNK_SHIFT;
> > +     unsigned long addr = 0;
> > +
> > +     if (size < IOVA_MIN_SIZE || size & ~IOVA_CHUNK_MASK)
> > +             return NULL;
> > +
> > +     domain = kzalloc(sizeof(*domain), GFP_KERNEL);
> > +     if (!domain)
> > +             return NULL;
> > +
> > +     domain->chunks = kcalloc(num, sizeof(struct vduse_iova_chunk), GFP_KERNEL);
> > +     if (!domain->chunks)
> > +             goto err;
> > +
> > +     for (i = 0; i < num; i++, addr += IOVA_CHUNK_SIZE)
> > +             if (vduse_iova_chunk_init(&domain->chunks[i], addr,
> > +                                     IOVA_CHUNK_SIZE))
> > +                     goto err;
> > +
> > +     domain->chunk_num = num;
> > +     domain->size = size;
> > +     INIT_LIST_HEAD(&domain->vma_list);
> > +     mutex_init(&domain->vma_lock);
> > +     mutex_init(&domain->map_lock);
> > +
> > +     return domain;
> > +err:
> > +     for (j = 0; j < i; j++)
> > +             vduse_iova_chunk_cleanup(&domain->chunks[j]);
> > +     kfree(domain);
> > +
> > +     return NULL;
> > +}
> > diff --git a/drivers/vdpa/vdpa_user/iova_domain.h b/drivers/vdpa/vdpa_user/iova_domain.h
> > new file mode 100644
> > index 000000000000..fe1816287f5f
> > --- /dev/null
> > +++ b/drivers/vdpa/vdpa_user/iova_domain.h
> > @@ -0,0 +1,93 @@
> > +/* SPDX-License-Identifier: GPL-2.0-only */
> > +/*
> > + * MMU-based IOMMU implementation
> > + *
> > + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
> > + *
> > + * Author: Xie Yongji <xieyongji@bytedance.com>
> > + *
> > + */
> > +
> > +#ifndef _VDUSE_IOVA_DOMAIN_H
> > +#define _VDUSE_IOVA_DOMAIN_H
> > +
> > +#include <linux/genalloc.h>
> > +#include <linux/dma-mapping.h>
> > +
> > +enum iova_map_type {
> > +     TYPE_NONE,
> > +     TYPE_DIRECT_MAP,
> > +     TYPE_BOUNCE_MAP,
> > +};
> > +
> > +struct vduse_iova_map {
> > +     unsigned long iova;
> > +     unsigned long orig;
> > +     size_t size;
> > +     enum dma_data_direction dir;
> > +};
> > +
> > +struct vduse_iova_chunk {
> > +     struct gen_pool *pool;
> > +     struct page **bounce_pages;
> > +     struct vduse_iova_map **iova_map;
> > +     unsigned long start;
> > +     atomic_t map_type;
> > +};
> > +
> > +struct vduse_iova_domain {
> > +     struct vduse_iova_chunk *chunks;
> > +     int chunk_num;
> > +     size_t size;
> > +     struct mutex map_lock;
> > +     struct mutex vma_lock;
> > +     struct list_head vma_list;
> > +};
>
>
> It's better to explain why you need to organize the bounce buffer with
> chunks by adding some comments above or in the commit log. Is this
> because you want to have O(1) for finding the page for a specific IOVA?
>

It is used to distinguish different dma mapping type as above said.


>
> > +
> > +int vduse_domain_add_vma(struct vduse_iova_domain *domain,
> > +                             struct vm_area_struct *vma);
> > +
> > +void vduse_domain_remove_vma(struct vduse_iova_domain *domain,
> > +                             struct vm_area_struct *vma);
> > +
> > +int vduse_domain_add_mapping(struct vduse_iova_domain *domain,
> > +                             unsigned long iova, unsigned long orig,
> > +                             size_t size, enum dma_data_direction dir);
> > +
> > +struct vduse_iova_map *
> > +vduse_domain_get_mapping(struct vduse_iova_domain *domain,
> > +                     unsigned long iova);
> > +
> > +void vduse_domain_remove_mapping(struct vduse_iova_domain *domain,
> > +                             struct vduse_iova_map *map);
> > +
> > +void vduse_domain_unmap(struct vduse_iova_domain *domain,
> > +                     unsigned long iova, size_t size);
> > +
> > +int vduse_domain_direct_map(struct vduse_iova_domain *domain,
> > +                     struct vm_area_struct *vma, unsigned long iova);
> > +
> > +void vduse_domain_bounce(struct vduse_iova_domain *domain,
> > +                     unsigned long iova, unsigned long orig,
> > +                     size_t size, enum dma_data_direction dir);
> > +
> > +int vduse_domain_bounce_map(struct vduse_iova_domain *domain,
> > +                     struct vm_area_struct *vma, unsigned long iova);
> > +
> > +bool vduse_domain_is_direct_map(struct vduse_iova_domain *domain,
> > +                             unsigned long iova);
> > +
> > +unsigned long vduse_domain_alloc_iova(struct vduse_iova_domain *domain,
> > +                                     size_t size, enum iova_map_type type);
> > +
> > +void vduse_domain_free_iova(struct vduse_iova_domain *domain,
> > +                             unsigned long iova, size_t size);
> > +
> > +bool vduse_domain_is_direct_map(struct vduse_iova_domain *domain,
> > +                             unsigned long iova);
> > +
> > +void vduse_iova_domain_destroy(struct vduse_iova_domain *domain);
> > +
> > +struct vduse_iova_domain *vduse_iova_domain_create(size_t size);
> > +
> > +#endif /* _VDUSE_IOVA_DOMAIN_H */
> > diff --git a/drivers/vdpa/vdpa_user/vduse.h b/drivers/vdpa/vdpa_user/vduse.h
> > new file mode 100644
> > index 000000000000..1041ce7bddc4
> > --- /dev/null
> > +++ b/drivers/vdpa/vdpa_user/vduse.h
> > @@ -0,0 +1,59 @@
> > +/* SPDX-License-Identifier: GPL-2.0-only */
> > +/*
> > + * VDUSE: vDPA Device in Userspace
> > + *
> > + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
> > + *
> > + * Author: Xie Yongji <xieyongji@bytedance.com>
> > + *
> > + */
> > +
> > +#ifndef _VDUSE_H
> > +#define _VDUSE_H
> > +
> > +#include <linux/eventfd.h>
> > +#include <linux/wait.h>
> > +#include <linux/vdpa.h>
> > +
> > +#include "iova_domain.h"
> > +#include "eventfd.h"
> > +
> > +struct vduse_virtqueue {
> > +     u16 index;
> > +     bool ready;
> > +     spinlock_t kick_lock;
> > +     spinlock_t irq_lock;
> > +     struct eventfd_ctx *kickfd;
> > +     struct vduse_virqfd *virqfd;
> > +     void *private;
> > +     irqreturn_t (*cb)(void *data);
> > +};
> > +
> > +struct vduse_dev;
> > +
> > +struct vduse_vdpa {
> > +     struct vdpa_device vdpa;
> > +     struct vduse_dev *dev;
> > +};
> > +
> > +struct vduse_dev {
> > +     struct vduse_vdpa *vdev;
> > +     struct vduse_virtqueue *vqs;
> > +     struct vduse_iova_domain *domain;
> > +     struct mutex lock;
> > +     spinlock_t msg_lock;
> > +     atomic64_t msg_unique;
> > +     wait_queue_head_t waitq;
> > +     struct list_head send_list;
> > +     struct list_head recv_list;
> > +     struct list_head list;
> > +     refcount_t refcnt;
> > +     u32 id;
> > +     u16 vq_size_max;
> > +     u16 vq_num;
> > +     u32 vq_align;
> > +     u32 device_id;
> > +     u32 vendor_id;
> > +};
> > +
> > +#endif /* _VDUSE_H_ */
> > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> > new file mode 100644
> > index 000000000000..4a869b9698ef
> > --- /dev/null
> > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > @@ -0,0 +1,1121 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * VDUSE: vDPA Device in Userspace
> > + *
> > + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
> > + *
> > + * Author: Xie Yongji <xieyongji@bytedance.com>
> > + *
> > + */
> > +
> > +#include <linux/init.h>
> > +#include <linux/module.h>
> > +#include <linux/miscdevice.h>
> > +#include <linux/device.h>
> > +#include <linux/eventfd.h>
> > +#include <linux/slab.h>
> > +#include <linux/wait.h>
> > +#include <linux/dma-map-ops.h>
> > +#include <linux/anon_inodes.h>
> > +#include <linux/file.h>
> > +#include <linux/uio.h>
> > +#include <linux/vdpa.h>
> > +#include <uapi/linux/vduse.h>
> > +#include <uapi/linux/vdpa.h>
> > +#include <uapi/linux/virtio_config.h>
> > +#include <linux/mod_devicetable.h>
> > +
> > +#include "vduse.h"
> > +
> > +#define DRV_VERSION  "1.0"
> > +#define DRV_AUTHOR   "Yongji Xie <xieyongji@bytedance.com>"
> > +#define DRV_DESC     "vDPA Device in Userspace"
> > +#define DRV_LICENSE  "GPL v2"
> > +
> > +struct vduse_dev_msg {
> > +     struct vduse_dev_request req;
> > +     struct vduse_dev_response resp;
> > +     struct list_head list;
> > +     wait_queue_head_t waitq;
> > +     bool completed;
> > +     refcount_t refcnt;
> > +};
> > +
> > +static struct workqueue_struct *vduse_vdpa_wq;
> > +static DEFINE_MUTEX(vduse_lock);
> > +static LIST_HEAD(vduse_devs);
> > +
> > +static inline struct vduse_dev *vdpa_to_vduse(struct vdpa_device *vdpa)
> > +{
> > +     struct vduse_vdpa *vdev = container_of(vdpa, struct vduse_vdpa, vdpa);
> > +
> > +     return vdev->dev;
> > +}
> > +
> > +static inline struct vduse_dev *dev_to_vduse(struct device *dev)
> > +{
> > +     struct vdpa_device *vdpa = dev_to_vdpa(dev);
> > +
> > +     return vdpa_to_vduse(vdpa);
> > +}
> > +
> > +static struct vduse_dev_msg *vduse_dev_new_msg(struct vduse_dev *dev, int type)
> > +{
> > +     struct vduse_dev_msg *msg = kzalloc(sizeof(*msg),
> > +                                     GFP_KERNEL | __GFP_NOFAIL);
> > +
> > +     msg->req.type = type;
> > +     msg->req.unique = atomic64_fetch_inc(&dev->msg_unique);
> > +     init_waitqueue_head(&msg->waitq);
> > +     refcount_set(&msg->refcnt, 1);
> > +
> > +     return msg;
> > +}
> > +
> > +static void vduse_dev_msg_get(struct vduse_dev_msg *msg)
> > +{
> > +     refcount_inc(&msg->refcnt);
> > +}
> > +
> > +static void vduse_dev_msg_put(struct vduse_dev_msg *msg)
> > +{
> > +     if (refcount_dec_and_test(&msg->refcnt))
> > +             kfree(msg);
> > +}
> > +
> > +static struct vduse_dev_msg *vduse_dev_find_msg(struct vduse_dev *dev,
> > +                                             struct list_head *head,
> > +                                             uint32_t unique)
> > +{
> > +     struct vduse_dev_msg *tmp, *msg = NULL;
> > +
> > +     spin_lock(&dev->msg_lock);
> > +     list_for_each_entry(tmp, head, list) {
> > +             if (tmp->req.unique == unique) {
> > +                     msg = tmp;
> > +                     list_del(&tmp->list);
> > +                     break;
> > +             }
> > +     }
> > +     spin_unlock(&dev->msg_lock);
> > +
> > +     return msg;
> > +}
> > +
> > +static struct vduse_dev_msg *vduse_dev_dequeue_msg(struct vduse_dev *dev,
> > +                                             struct list_head *head)
> > +{
> > +     struct vduse_dev_msg *msg = NULL;
> > +
> > +     spin_lock(&dev->msg_lock);
> > +     if (!list_empty(head)) {
> > +             msg = list_first_entry(head, struct vduse_dev_msg, list);
> > +             list_del(&msg->list);
> > +     }
> > +     spin_unlock(&dev->msg_lock);
> > +
> > +     return msg;
> > +}
> > +
> > +static void vduse_dev_enqueue_msg(struct vduse_dev *dev,
> > +                     struct vduse_dev_msg *msg, struct list_head *head)
> > +{
> > +     spin_lock(&dev->msg_lock);
> > +     list_add_tail(&msg->list, head);
> > +     spin_unlock(&dev->msg_lock);
> > +}
> > +
> > +static int vduse_dev_msg_sync(struct vduse_dev *dev, struct vduse_dev_msg *msg)
> > +{
> > +     int ret;
> > +
> > +     vduse_dev_enqueue_msg(dev, msg, &dev->send_list);
> > +     wake_up(&dev->waitq);
> > +     wait_event(msg->waitq, msg->completed);
> > +     /* coupled with smp_wmb() in vduse_dev_msg_complete() */
> > +     smp_rmb();
> > +     ret = msg->resp.result;
> > +
> > +     return ret;
> > +}
> > +
> > +static void vduse_dev_msg_complete(struct vduse_dev_msg *msg,
> > +                                     struct vduse_dev_response *resp)
> > +{
> > +     vduse_dev_msg_get(msg);
> > +     memcpy(&msg->resp, resp, sizeof(*resp));
> > +     /* coupled with smp_rmb() in vduse_dev_msg_sync() */
> > +     smp_wmb();
> > +     msg->completed = 1;
> > +     wake_up(&msg->waitq);
> > +     vduse_dev_msg_put(msg);
> > +}
> > +
> > +static u64 vduse_dev_get_features(struct vduse_dev *dev)
> > +{
> > +     struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_GET_FEATURES);
> > +     u64 features;
> > +
> > +     vduse_dev_msg_sync(dev, msg);
> > +     features = msg->resp.features;
> > +     vduse_dev_msg_put(msg);
> > +
> > +     return features;
> > +}
> > +
> > +static int vduse_dev_set_features(struct vduse_dev *dev, u64 features)
> > +{
> > +     struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_FEATURES);
> > +     int ret;
> > +
> > +     msg->req.size = sizeof(features);
> > +     msg->req.features = features;
> > +
> > +     ret = vduse_dev_msg_sync(dev, msg);
> > +     vduse_dev_msg_put(msg);
> > +
> > +     return ret;
> > +}
> > +
> > +static u8 vduse_dev_get_status(struct vduse_dev *dev)
> > +{
> > +     struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_GET_STATUS);
> > +     u8 status;
> > +
> > +     vduse_dev_msg_sync(dev, msg);
> > +     status = msg->resp.status;
> > +     vduse_dev_msg_put(msg);
> > +
> > +     return status;
> > +}
> > +
> > +static void vduse_dev_set_status(struct vduse_dev *dev, u8 status)
> > +{
> > +     struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_STATUS);
> > +
> > +     msg->req.size = sizeof(status);
> > +     msg->req.status = status;
> > +
> > +     vduse_dev_msg_sync(dev, msg);
> > +     vduse_dev_msg_put(msg);
> > +}
> > +
> > +static void vduse_dev_get_config(struct vduse_dev *dev, unsigned int offset,
> > +                                     void *buf, unsigned int len)
> > +{
> > +     struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_GET_CONFIG);
> > +
> > +     WARN_ON(len > sizeof(msg->req.config.data));
> > +
> > +     msg->req.size = sizeof(struct vduse_dev_config_data);
> > +     msg->req.config.offset = offset;
> > +     msg->req.config.len = len;
> > +     vduse_dev_msg_sync(dev, msg);
> > +     memcpy(buf, msg->resp.config.data, len);
> > +     vduse_dev_msg_put(msg);
> > +}
> > +
> > +static void vduse_dev_set_config(struct vduse_dev *dev, unsigned int offset,
> > +                                     const void *buf, unsigned int len)
> > +{
> > +     struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_CONFIG);
> > +
> > +     WARN_ON(len > sizeof(msg->req.config.data));
> > +
> > +     msg->req.size = sizeof(struct vduse_dev_config_data);
> > +     msg->req.config.offset = offset;
> > +     msg->req.config.len = len;
> > +     memcpy(msg->req.config.data, buf, len);
> > +     vduse_dev_msg_sync(dev, msg);
> > +     vduse_dev_msg_put(msg);
> > +}
> > +
> > +static void vduse_dev_set_vq_num(struct vduse_dev *dev,
> > +                             struct vduse_virtqueue *vq, u32 num)
> > +{
> > +     struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_VQ_NUM);
> > +
> > +     msg->req.size = sizeof(struct vduse_vq_num);
> > +     msg->req.vq_num.index = vq->index;
> > +     msg->req.vq_num.num = num;
> > +
> > +     vduse_dev_msg_sync(dev, msg);
> > +     vduse_dev_msg_put(msg);
> > +}
> > +
> > +static int vduse_dev_set_vq_addr(struct vduse_dev *dev,
> > +                             struct vduse_virtqueue *vq, u64 desc_addr,
> > +                             u64 driver_addr, u64 device_addr)
> > +{
> > +     struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_VQ_ADDR);
> > +     int ret;
> > +
> > +     msg->req.size = sizeof(struct vduse_vq_addr);
> > +     msg->req.vq_addr.index = vq->index;
> > +     msg->req.vq_addr.desc_addr = desc_addr;
> > +     msg->req.vq_addr.driver_addr = driver_addr;
> > +     msg->req.vq_addr.device_addr = device_addr;
> > +
> > +     ret = vduse_dev_msg_sync(dev, msg);
> > +     vduse_dev_msg_put(msg);
> > +
> > +     return ret;
> > +}
> > +
> > +static void vduse_dev_set_vq_ready(struct vduse_dev *dev,
> > +                             struct vduse_virtqueue *vq, bool ready)
> > +{
> > +     struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_VQ_READY);
> > +
> > +     msg->req.size = sizeof(struct vduse_vq_ready);
> > +     msg->req.vq_ready.index = vq->index;
> > +     msg->req.vq_ready.ready = ready;
> > +
> > +     vduse_dev_msg_sync(dev, msg);
> > +     vduse_dev_msg_put(msg);
> > +}
> > +
> > +static bool vduse_dev_get_vq_ready(struct vduse_dev *dev,
> > +                                struct vduse_virtqueue *vq)
> > +{
> > +     struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_GET_VQ_READY);
> > +     bool ready;
> > +
> > +     msg->req.size = sizeof(struct vduse_vq_ready);
> > +     msg->req.vq_ready.index = vq->index;
> > +
> > +     vduse_dev_msg_sync(dev, msg);
> > +     ready = msg->resp.vq_ready.ready;
> > +     vduse_dev_msg_put(msg);
> > +
> > +     return ready;
> > +}
> > +
> > +static ssize_t vduse_dev_read_iter(struct kiocb *iocb, struct iov_iter *to)
> > +{
> > +     struct file *file = iocb->ki_filp;
> > +     struct vduse_dev *dev = file->private_data;
> > +     struct vduse_dev_msg *msg;
> > +     int size = sizeof(struct vduse_dev_request);
> > +     ssize_t ret = 0;
> > +
> > +     if (iov_iter_count(to) < size)
> > +             return 0;
> > +
> > +     while (1) {
> > +             msg = vduse_dev_dequeue_msg(dev, &dev->send_list);
> > +             if (msg)
> > +                     break;
> > +
> > +             if (file->f_flags & O_NONBLOCK)
> > +                     return -EAGAIN;
> > +
> > +             ret = wait_event_interruptible_exclusive(dev->waitq,
> > +                                     !list_empty(&dev->send_list));
> > +             if (ret)
> > +                     return ret;
> > +     }
> > +     ret = copy_to_iter(&msg->req, size, to);
> > +     if (ret != size) {
> > +             vduse_dev_enqueue_msg(dev, msg, &dev->send_list);
> > +             return -EFAULT;
> > +     }
> > +     vduse_dev_enqueue_msg(dev, msg, &dev->recv_list);
> > +
> > +     return ret;
> > +}
> > +
> > +static ssize_t vduse_dev_write_iter(struct kiocb *iocb, struct iov_iter *from)
> > +{
> > +     struct file *file = iocb->ki_filp;
> > +     struct vduse_dev *dev = file->private_data;
> > +     struct vduse_dev_response resp;
> > +     struct vduse_dev_msg *msg;
> > +     size_t ret;
> > +
> > +     ret = copy_from_iter(&resp, sizeof(resp), from);
> > +     if (ret != sizeof(resp))
> > +             return -EINVAL;
> > +
> > +     msg = vduse_dev_find_msg(dev, &dev->recv_list, resp.unique);
> > +     if (!msg)
> > +             return -EINVAL;
> > +
> > +     vduse_dev_msg_complete(msg, &resp);
> > +
> > +     return ret;
> > +}
> > +
> > +static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
> > +{
> > +     struct vduse_dev *dev = file->private_data;
> > +     __poll_t mask = 0;
> > +
> > +     poll_wait(file, &dev->waitq, wait);
> > +
> > +     if (!list_empty(&dev->send_list))
> > +             mask |= EPOLLIN | EPOLLRDNORM;
> > +
> > +     return mask;
> > +}
> > +
> > +static void vduse_dev_reset(struct vduse_dev *dev)
> > +{
> > +     int i;
> > +
> > +     for (i = 0; i < dev->vq_num; i++) {
> > +             struct vduse_virtqueue *vq = &dev->vqs[i];
> > +
> > +             spin_lock(&vq->irq_lock);
> > +             vq->ready = false;
> > +             vq->cb = NULL;
> > +             vq->private = NULL;
> > +             spin_unlock(&vq->irq_lock);
> > +     }
> > +}
> > +
> > +static int vduse_vdpa_set_vq_address(struct vdpa_device *vdpa, u16 idx,
> > +                             u64 desc_area, u64 driver_area,
> > +                             u64 device_area)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> > +
> > +     return vduse_dev_set_vq_addr(dev, vq, desc_area,
> > +                                     driver_area, device_area);
> > +}
> > +
> > +static void vduse_vdpa_kick_vq(struct vdpa_device *vdpa, u16 idx)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> > +
> > +     vduse_vq_kick(vq);
> > +}
> > +
> > +static void vduse_vdpa_set_vq_cb(struct vdpa_device *vdpa, u16 idx,
> > +                           struct vdpa_callback *cb)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> > +
> > +     vq->cb = cb->callback;
> > +     vq->private = cb->private;
> > +}
> > +
> > +static void vduse_vdpa_set_vq_num(struct vdpa_device *vdpa, u16 idx, u32 num)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> > +
> > +     vduse_dev_set_vq_num(dev, vq, num);
> > +}
> > +
> > +static void vduse_vdpa_set_vq_ready(struct vdpa_device *vdpa,
> > +                                     u16 idx, bool ready)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> > +
> > +     vduse_dev_set_vq_ready(dev, vq, ready);
> > +     vq->ready = ready;
> > +}
> > +
> > +static bool vduse_vdpa_get_vq_ready(struct vdpa_device *vdpa, u16 idx)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> > +
> > +     vq->ready = vduse_dev_get_vq_ready(dev, vq);
> > +
> > +     return vq->ready;
> > +}
> > +
> > +static u32 vduse_vdpa_get_vq_align(struct vdpa_device *vdpa)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     return dev->vq_align;
> > +}
> > +
> > +static u64 vduse_vdpa_get_features(struct vdpa_device *vdpa)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +     u64 fixed = (1ULL << VIRTIO_F_ACCESS_PLATFORM);
> > +
> > +     return (vduse_dev_get_features(dev) | fixed);
> > +}
> > +
> > +static int vduse_vdpa_set_features(struct vdpa_device *vdpa, u64 features)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     return vduse_dev_set_features(dev, features);
> > +}
> > +
> > +static void vduse_vdpa_set_config_cb(struct vdpa_device *vdpa,
> > +                               struct vdpa_callback *cb)
> > +{
> > +     /* We don't support config interrupt */
> > +}
> > +
> > +static u16 vduse_vdpa_get_vq_num_max(struct vdpa_device *vdpa)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     return dev->vq_size_max;
> > +}
> > +
> > +static u32 vduse_vdpa_get_device_id(struct vdpa_device *vdpa)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     return dev->device_id;
> > +}
> > +
> > +static u32 vduse_vdpa_get_vendor_id(struct vdpa_device *vdpa)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     return dev->vendor_id;
> > +}
> > +
> > +static u8 vduse_vdpa_get_status(struct vdpa_device *vdpa)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     return vduse_dev_get_status(dev);
> > +}
> > +
> > +static void vduse_vdpa_set_status(struct vdpa_device *vdpa, u8 status)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     if (status == 0)
> > +             vduse_dev_reset(dev);
> > +
> > +     vduse_dev_set_status(dev, status);
> > +}
> > +
> > +static void vduse_vdpa_get_config(struct vdpa_device *vdpa, unsigned int offset,
> > +                          void *buf, unsigned int len)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     vduse_dev_get_config(dev, offset, buf, len);
> > +}
> > +
> > +static void vduse_vdpa_set_config(struct vdpa_device *vdpa, unsigned int offset,
> > +                     const void *buf, unsigned int len)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     vduse_dev_set_config(dev, offset, buf, len);
> > +}
> > +
> > +static void vduse_vdpa_free(struct vdpa_device *vdpa)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     vduse_kickfd_release(dev);
> > +     vduse_virqfd_release(dev);
> > +
> > +     WARN_ON(!list_empty(&dev->send_list));
> > +     WARN_ON(!list_empty(&dev->recv_list));
> > +     dev->vdev = NULL;
> > +}
> > +
> > +static const struct vdpa_config_ops vduse_vdpa_config_ops = {
> > +     .set_vq_address         = vduse_vdpa_set_vq_address,
> > +     .kick_vq                = vduse_vdpa_kick_vq,
> > +     .set_vq_cb              = vduse_vdpa_set_vq_cb,
> > +     .set_vq_num             = vduse_vdpa_set_vq_num,
> > +     .set_vq_ready           = vduse_vdpa_set_vq_ready,
> > +     .get_vq_ready           = vduse_vdpa_get_vq_ready,
> > +     .get_vq_align           = vduse_vdpa_get_vq_align,
> > +     .get_features           = vduse_vdpa_get_features,
> > +     .set_features           = vduse_vdpa_set_features,
> > +     .set_config_cb          = vduse_vdpa_set_config_cb,
> > +     .get_vq_num_max         = vduse_vdpa_get_vq_num_max,
> > +     .get_device_id          = vduse_vdpa_get_device_id,
> > +     .get_vendor_id          = vduse_vdpa_get_vendor_id,
> > +     .get_status             = vduse_vdpa_get_status,
> > +     .set_status             = vduse_vdpa_set_status,
> > +     .get_config             = vduse_vdpa_get_config,
> > +     .set_config             = vduse_vdpa_set_config,
> > +     .free                   = vduse_vdpa_free,
> > +};
> > +
> > +static dma_addr_t vduse_dev_map_page(struct device *dev, struct page *page,
> > +                                     unsigned long offset, size_t size,
> > +                                     enum dma_data_direction dir,
> > +                                     unsigned long attrs)
> > +{
> > +     struct vduse_dev *vdev = dev_to_vduse(dev);
> > +     struct vduse_iova_domain *domain = vdev->domain;
> > +     unsigned long iova = vduse_domain_alloc_iova(domain, size,
> > +                                                     TYPE_BOUNCE_MAP);
> > +     unsigned long orig = (unsigned long)page_address(page) + offset;
> > +
> > +     if (!iova)
> > +             return DMA_MAPPING_ERROR;
> > +
> > +     if (vduse_domain_add_mapping(domain, iova, orig, size, dir)) {
> > +             vduse_domain_free_iova(domain, iova, size);
> > +             return DMA_MAPPING_ERROR;
> > +     }
> > +
> > +     if (dir == DMA_TO_DEVICE)
>
>
> How about bidirectional mapping?
>

Will fix it.

>
> > +             vduse_domain_bounce(domain, iova, orig, size, dir);
> > +
> > +     return (dma_addr_t)iova;
> > +}
> > +
> > +static void vduse_dev_unmap_page(struct device *dev, dma_addr_t dma_addr,
> > +                             size_t size, enum dma_data_direction dir,
> > +                             unsigned long attrs)
> > +{
> > +     struct vduse_dev *vdev = dev_to_vduse(dev);
> > +     struct vduse_iova_domain *domain = vdev->domain;
> > +     unsigned long iova = (unsigned long)dma_addr;
> > +     struct vduse_iova_map *map = vduse_domain_get_mapping(domain, iova);
> > +
> > +     if (WARN_ON(!map))
> > +             return;
> > +
> > +     if (dir == DMA_FROM_DEVICE)
> > +             vduse_domain_bounce(domain, iova, map->orig, size, dir);
> > +     vduse_domain_remove_mapping(domain, map);
> > +     vduse_domain_free_iova(domain, iova, size);
> > +     kfree(map);
> > +}
> > +
> > +static void *vduse_dev_alloc_coherent(struct device *dev, size_t size,
> > +                                     dma_addr_t *dma_addr, gfp_t flag,
> > +                                     unsigned long attrs)
> > +{
> > +     struct vduse_dev *vdev = dev_to_vduse(dev);
> > +     struct vduse_iova_domain *domain = vdev->domain;
> > +     unsigned long iova = vduse_domain_alloc_iova(domain, size,
> > +                                                     TYPE_DIRECT_MAP);
> > +     void *orig = alloc_pages_exact(size, flag);
> > +
> > +     if (!iova || !orig)
> > +             goto err;
> > +
> > +     if (vduse_domain_add_mapping(domain, iova,
> > +                             (unsigned long)orig, size, DMA_BIDIRECTIONAL))
> > +             goto err;
> > +
> > +     *dma_addr = (dma_addr_t)iova;
> > +
> > +     return orig;
> > +err:
> > +     *dma_addr = DMA_MAPPING_ERROR;
> > +     if (orig)
> > +             free_pages_exact(orig, size);
> > +     if (iova)
> > +             vduse_domain_free_iova(domain, iova, size);
> > +
> > +     return NULL;
> > +}
> > +
> > +static void vduse_dev_free_coherent(struct device *dev, size_t size,
> > +                                     void *vaddr, dma_addr_t dma_addr,
> > +                                     unsigned long attrs)
> > +{
> > +     struct vduse_dev *vdev = dev_to_vduse(dev);
> > +     struct vduse_iova_domain *domain = vdev->domain;
> > +     unsigned long iova = (unsigned long)dma_addr;
> > +     struct vduse_iova_map *map = vduse_domain_get_mapping(domain, iova);
> > +
> > +     if (WARN_ON(!map))
> > +             return;
> > +
> > +     vduse_domain_remove_mapping(domain, map);
> > +     vduse_domain_unmap(domain, map->iova, PAGE_ALIGN(map->size));
> > +     free_pages_exact((void *)map->orig, map->size);
> > +     vduse_domain_free_iova(domain, map->iova, map->size);
> > +     kfree(map);
> > +}
> > +
> > +static const struct dma_map_ops vduse_dev_dma_ops = {
> > +     .map_page = vduse_dev_map_page,
> > +     .unmap_page = vduse_dev_unmap_page,
> > +     .alloc = vduse_dev_alloc_coherent,
> > +     .free = vduse_dev_free_coherent,
> > +};
> > +
> > +static void vduse_dev_mmap_open(struct vm_area_struct *vma)
> > +{
> > +     struct vduse_iova_domain *domain = vma->vm_private_data;
> > +
> > +     if (!vduse_domain_add_vma(domain, vma))
> > +             return;
> > +
> > +     vma->vm_private_data = NULL;
> > +}
> > +
> > +static void vduse_dev_mmap_close(struct vm_area_struct *vma)
> > +{
> > +     struct vduse_iova_domain *domain = vma->vm_private_data;
> > +
> > +     if (!domain)
> > +             return;
> > +
> > +     vduse_domain_remove_vma(domain, vma);
> > +}
> > +
> > +static int vduse_dev_mmap_split(struct vm_area_struct *vma, unsigned long addr)
> > +{
> > +     return -EPERM;
> > +}
> > +
> > +static vm_fault_t vduse_dev_mmap_fault(struct vm_fault *vmf)
> > +{
> > +     struct vm_area_struct *vma = vmf->vma;
> > +     struct vduse_iova_domain *domain = vma->vm_private_data;
> > +     unsigned long iova = vmf->address - vma->vm_start;
> > +     int ret;
> > +
> > +     if (!domain)
> > +             return VM_FAULT_SIGBUS;
> > +
> > +     if (vduse_domain_is_direct_map(domain, iova))
> > +             ret = vduse_domain_direct_map(domain, vma, iova);
> > +     else
> > +             ret = vduse_domain_bounce_map(domain, vma, iova);
> > +
> > +     if (ret == -ENOMEM)
> > +             return VM_FAULT_OOM;
> > +     if (ret < 0 && ret != -EBUSY)
> > +             return VM_FAULT_SIGBUS;
> > +
> > +     return VM_FAULT_NOPAGE;
> > +}
> > +
> > +static const struct vm_operations_struct vduse_dev_mmap_ops = {
> > +     .open = vduse_dev_mmap_open,
> > +     .close = vduse_dev_mmap_close,
> > +     .may_split = vduse_dev_mmap_split,
> > +     .fault = vduse_dev_mmap_fault,
> > +};
> > +
> > +static int vduse_dev_mmap(struct file *file, struct vm_area_struct *vma)
> > +{
> > +     struct vduse_dev *dev = file->private_data;
> > +     struct vduse_iova_domain *domain = dev->domain;
> > +     unsigned long size = vma->vm_end - vma->vm_start;
> > +     int ret;
> > +
> > +     if (domain->size != size || vma->vm_pgoff)
> > +             return -EINVAL;
> > +
> > +     ret = vduse_domain_add_vma(domain, vma);
> > +     if (ret)
> > +             return ret;
> > +
> > +     vma->vm_flags |= VM_MIXEDMAP | VM_DONTCOPY |
> > +                             VM_DONTDUMP | VM_DONTEXPAND;
> > +     vma->vm_private_data = domain;
> > +     vma->vm_ops = &vduse_dev_mmap_ops;
> > +
> > +     return 0;
> > +}
> > +
> > +static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > +                     unsigned long arg)
> > +{
> > +     struct vduse_dev *dev = file->private_data;
> > +     void __user *argp = (void __user *)arg;
> > +     int ret;
> > +
> > +     mutex_lock(&dev->lock);
> > +     switch (cmd) {
> > +     case VDUSE_VQ_SETUP_KICKFD: {
> > +             struct vduse_vq_eventfd eventfd;
> > +
> > +             ret = -EFAULT;
> > +             if (copy_from_user(&eventfd, argp, sizeof(eventfd)))
> > +                     break;
> > +
> > +             ret = vduse_kickfd_setup(dev, &eventfd);
> > +             break;
> > +     }
> > +     case VDUSE_VQ_SETUP_IRQFD: {
> > +             struct vduse_vq_eventfd eventfd;
> > +
> > +             ret = -EFAULT;
> > +             if (copy_from_user(&eventfd, argp, sizeof(eventfd)))
> > +                     break;
> > +
> > +             ret = vduse_virqfd_setup(dev, &eventfd);
> > +             break;
> > +     }
> > +     }
> > +     mutex_unlock(&dev->lock);
> > +
> > +     return ret;
> > +}
> > +
> > +static int vduse_dev_release(struct inode *inode, struct file *file)
> > +{
> > +     struct vduse_dev *dev = file->private_data;
> > +     struct vduse_dev_msg *msg;
> > +
> > +     while ((msg = vduse_dev_dequeue_msg(dev, &dev->recv_list)))
> > +             vduse_dev_enqueue_msg(dev, msg, &dev->send_list);
> > +
> > +     refcount_dec(&dev->refcnt);
> > +
> > +     return 0;
> > +}
> > +
> > +static const struct file_operations vduse_dev_fops = {
> > +     .owner          = THIS_MODULE,
> > +     .release        = vduse_dev_release,
> > +     .read_iter      = vduse_dev_read_iter,
> > +     .write_iter     = vduse_dev_write_iter,
> > +     .poll           = vduse_dev_poll,
> > +     .mmap           = vduse_dev_mmap,
> > +     .unlocked_ioctl = vduse_dev_ioctl,
> > +     .compat_ioctl   = compat_ptr_ioctl,
> > +     .llseek         = noop_llseek,
> > +};
> > +
> > +static struct vduse_dev *vduse_dev_create(void)
> > +{
> > +     struct vduse_dev *dev = kzalloc(sizeof(*dev), GFP_KERNEL);
> > +
> > +     if (!dev)
> > +             return NULL;
> > +
> > +     mutex_init(&dev->lock);
> > +     spin_lock_init(&dev->msg_lock);
> > +     INIT_LIST_HEAD(&dev->send_list);
> > +     INIT_LIST_HEAD(&dev->recv_list);
> > +     atomic64_set(&dev->msg_unique, 0);
> > +     init_waitqueue_head(&dev->waitq);
> > +     refcount_set(&dev->refcnt, 1);
> > +
> > +     return dev;
> > +}
> > +
> > +static void vduse_dev_destroy(struct vduse_dev *dev)
> > +{
> > +     mutex_destroy(&dev->lock);
> > +     kfree(dev);
> > +}
> > +
> > +static struct vduse_dev *vduse_find_dev(u32 id)
> > +{
> > +     struct vduse_dev *tmp, *dev = NULL;
> > +
> > +     list_for_each_entry(tmp, &vduse_devs, list) {
> > +             if (tmp->id == id) {
> > +                     dev = tmp;
> > +                     break;
> > +             }
> > +     }
> > +     return dev;
> > +}
> > +
> > +static int vduse_get_dev(u32 id)
> > +{
> > +     int fd;
> > +     char name[64];
> > +     struct vduse_dev *dev = vduse_find_dev(id);
> > +
> > +     if (!dev)
> > +             return -EINVAL;
> > +
> > +     snprintf(name, sizeof(name), "vduse-dev:%u", dev->id);
> > +     fd = anon_inode_getfd(name, &vduse_dev_fops, dev, O_RDWR | O_CLOEXEC);
> > +     if (fd < 0)
> > +             return fd;
> > +
> > +     refcount_inc(&dev->refcnt);
> > +
> > +     return fd;
> > +}
> > +
> > +static int vduse_destroy_dev(u32 id)
> > +{
> > +     struct vduse_dev *dev = vduse_find_dev(id);
> > +
> > +     if (!dev)
> > +             return -EINVAL;
> > +
> > +     if (dev->vdev || refcount_read(&dev->refcnt) > 1)
> > +             return -EBUSY;
> > +
> > +     list_del(&dev->list);
> > +     kfree(dev->vqs);
> > +     vduse_iova_domain_destroy(dev->domain);
> > +     vduse_dev_destroy(dev);
> > +
> > +     return 0;
> > +}
> > +
> > +static int vduse_create_dev(struct vduse_dev_config *config)
> > +{
> > +     int i, fd;
> > +     struct vduse_dev *dev;
> > +     char name[64];
> > +
> > +     if (vduse_find_dev(config->id))
> > +             return -EEXIST;
> > +
> > +     dev = vduse_dev_create();
> > +     if (!dev)
> > +             return -ENOMEM;
> > +
> > +     dev->id = config->id;
> > +     dev->device_id = config->device_id;
> > +     dev->vendor_id = config->vendor_id;
> > +     dev->domain = vduse_iova_domain_create(config->iova_size);
> > +     if (!dev->domain)
> > +             goto err_domain;
> > +
> > +     dev->vq_align = config->vq_align;
> > +     dev->vq_size_max = config->vq_size_max;
> > +     dev->vq_num = config->vq_num;
> > +     dev->vqs = kcalloc(dev->vq_num, sizeof(*dev->vqs), GFP_KERNEL);
> > +     if (!dev->vqs)
> > +             goto err_vqs;
> > +
> > +     for (i = 0; i < dev->vq_num; i++) {
> > +             dev->vqs[i].index = i;
> > +             spin_lock_init(&dev->vqs[i].kick_lock);
> > +             spin_lock_init(&dev->vqs[i].irq_lock);
> > +     }
> > +
> > +     snprintf(name, sizeof(name), "vduse-dev:%u", config->id);
> > +     fd = anon_inode_getfd(name, &vduse_dev_fops, dev, O_RDWR | O_CLOEXEC);
> > +     if (fd < 0)
> > +             goto err_fd;
> > +
> > +     refcount_inc(&dev->refcnt);
> > +     list_add(&dev->list, &vduse_devs);
> > +
> > +     return fd;
> > +err_fd:
> > +     kfree(dev->vqs);
> > +err_vqs:
> > +     vduse_iova_domain_destroy(dev->domain);
> > +err_domain:
> > +     vduse_dev_destroy(dev);
> > +     return fd;
> > +}
> > +
> > +static long vduse_ioctl(struct file *file, unsigned int cmd,
> > +                     unsigned long arg)
> > +{
> > +     int ret;
> > +     void __user *argp = (void __user *)arg;
> > +
> > +     mutex_lock(&vduse_lock);
> > +     switch (cmd) {
> > +     case VDUSE_CREATE_DEV: {
> > +             struct vduse_dev_config config;
> > +
> > +             ret = -EFAULT;
> > +             if (copy_from_user(&config, argp, sizeof(config)))
> > +                     break;
> > +
> > +             ret = vduse_create_dev(&config);
> > +             break;
> > +     }
> > +     case VDUSE_GET_DEV:
> > +             ret = vduse_get_dev(arg);
> > +             break;
>
>
> What's the use case of VDUSE_GET_DEV? (Need to document this)
>

It is used to get the device fd after VDUSE daemon reboot/crash, I
will split this into another patch and document it.

Thanks,
Yongji


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 00/13] Introduce VDUSE - vDPA Device in Userspace
  2020-12-23 10:59   ` Yongji Xie
@ 2020-12-24  2:24     ` Jason Wang
  0 siblings, 0 replies; 55+ messages in thread
From: Jason Wang @ 2020-12-24  2:24 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, sgarzare, Parav Pandit,
	akpm, Randy Dunlap, Matthew Wilcox, viro, axboe, bcrl, corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel, linux-mm


On 2020/12/23 下午6:59, Yongji Xie wrote:
> On Wed, Dec 23, 2020 at 2:38 PM Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2020/12/22 下午10:52, Xie Yongji wrote:
>>> This series introduces a framework, which can be used to implement
>>> vDPA Devices in a userspace program. The work consist of two parts:
>>> control path forwarding and data path offloading.
>>>
>>> In the control path, the VDUSE driver will make use of message
>>> mechnism to forward the config operation from vdpa bus driver
>>> to userspace. Userspace can use read()/write() to receive/reply
>>> those control messages.
>>>
>>> In the data path, the core is mapping dma buffer into VDUSE
>>> daemon's address space, which can be implemented in different ways
>>> depending on the vdpa bus to which the vDPA device is attached.
>>>
>>> In virtio-vdpa case, we implements a MMU-based on-chip IOMMU driver with
>>> bounce-buffering mechanism to achieve that.
>>
>> Rethink about the bounce buffer stuffs. I wonder instead of using kernel
>> pages with mmap(), how about just use userspace pages like what vhost did?
>>
>> It means we need a worker to do bouncing but we don't need to care about
>> annoying stuffs like page reclaiming?
>>
> Now the I/O bouncing is done in the streaming DMA mapping routines
> which can be called from interrupt context. If we put this into a
> kworker, that means we need to synchronize with a kworker in an
> interrupt context. I think it can't work.


We just need to make sure the buffer is ready before the user is trying 
to access them.

But I admit it would be tricky (require shadow virtqueue etc) which is 
probably not a good idea.

Thanks


>
> Thanks,
> Yongji
>



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 08/13] vdpa: Introduce process_iotlb_msg() in vdpa_config_ops
  2020-12-23 11:06     ` Yongji Xie
@ 2020-12-24  2:36       ` Jason Wang
  2020-12-24  7:24         ` Yongji Xie
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Wang @ 2020-12-24  2:36 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, sgarzare, Parav Pandit,
	akpm, Randy Dunlap, Matthew Wilcox, viro, axboe, bcrl, corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel, linux-mm


On 2020/12/23 下午7:06, Yongji Xie wrote:
> On Wed, Dec 23, 2020 at 4:37 PM Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2020/12/22 下午10:52, Xie Yongji wrote:
>>> This patch introduces a new method in the vdpa_config_ops to
>>> support processing the raw vhost memory mapping message in the
>>> vDPA device driver.
>>>
>>> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
>>> ---
>>>    drivers/vhost/vdpa.c | 5 ++++-
>>>    include/linux/vdpa.h | 7 +++++++
>>>    2 files changed, 11 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
>>> index 448be7875b6d..ccbb391e38be 100644
>>> --- a/drivers/vhost/vdpa.c
>>> +++ b/drivers/vhost/vdpa.c
>>> @@ -728,6 +728,9 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev,
>>>        if (r)
>>>                return r;
>>>
>>> +     if (ops->process_iotlb_msg)
>>> +             return ops->process_iotlb_msg(vdpa, msg);
>>> +
>>>        switch (msg->type) {
>>>        case VHOST_IOTLB_UPDATE:
>>>                r = vhost_vdpa_process_iotlb_update(v, msg);
>>> @@ -770,7 +773,7 @@ static int vhost_vdpa_alloc_domain(struct vhost_vdpa *v)
>>>        int ret;
>>>
>>>        /* Device want to do DMA by itself */
>>> -     if (ops->set_map || ops->dma_map)
>>> +     if (ops->set_map || ops->dma_map || ops->process_iotlb_msg)
>>>                return 0;
>>>
>>>        bus = dma_dev->bus;
>>> diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
>>> index 656fe264234e..7bccedf22f4b 100644
>>> --- a/include/linux/vdpa.h
>>> +++ b/include/linux/vdpa.h
>>> @@ -5,6 +5,7 @@
>>>    #include <linux/kernel.h>
>>>    #include <linux/device.h>
>>>    #include <linux/interrupt.h>
>>> +#include <linux/vhost_types.h>
>>>    #include <linux/vhost_iotlb.h>
>>>    #include <net/genetlink.h>
>>>
>>> @@ -172,6 +173,10 @@ struct vdpa_iova_range {
>>>     *                          @vdev: vdpa device
>>>     *                          Returns the iova range supported by
>>>     *                          the device.
>>> + * @process_iotlb_msg:               Process vhost memory mapping message (optional)
>>> + *                           Only used for VDUSE device now
>>> + *                           @vdev: vdpa device
>>> + *                           @msg: vhost memory mapping message
>>>     * @set_map:                        Set device memory mapping (optional)
>>>     *                          Needed for device that using device
>>>     *                          specific DMA translation (on-chip IOMMU)
>>> @@ -240,6 +245,8 @@ struct vdpa_config_ops {
>>>        struct vdpa_iova_range (*get_iova_range)(struct vdpa_device *vdev);
>>>
>>>        /* DMA ops */
>>> +     int (*process_iotlb_msg)(struct vdpa_device *vdev,
>>> +                              struct vhost_iotlb_msg *msg);
>>>        int (*set_map)(struct vdpa_device *vdev, struct vhost_iotlb *iotlb);
>>>        int (*dma_map)(struct vdpa_device *vdev, u64 iova, u64 size,
>>>                       u64 pa, u32 perm);
>>
>> Is there any reason that it can't be done via dma_map/dma_unmap or set_map?
>>
> To get the shmfd, we need the vma rather than physical address. And
> it's not necessary to pin the user pages in VDUSE case.


Right, actually, vhost-vDPA is planning to support shared virtual 
address space.

So let's try to reuse the existing config ops. How about just introduce 
an attribute to vdpa device that tells the bus tells the bus it can do 
shared virtual memory. Then when the device is probed by vhost-vDPA, use 
pages won't be pinned and we will do VA->VA mapping as IOVA->PA mapping 
in the vhost IOTLB and the config ops. vhost IOTLB needs to be extended 
to accept opaque pointer to store the file. And the file was pass via 
the config ops as well.

Thanks



>
> Thanks,
> Yongji
>



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [External] Re: [RFC v2 09/13] vduse: Add support for processing vhost iotlb message
  2020-12-23 12:14     ` [External] " Yongji Xie
@ 2020-12-24  2:41       ` Jason Wang
  2020-12-24  7:37         ` Yongji Xie
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Wang @ 2020-12-24  2:41 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, sgarzare, Parav Pandit,
	akpm, Randy Dunlap, Matthew Wilcox, viro, axboe, bcrl, corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel, linux-mm


On 2020/12/23 下午8:14, Yongji Xie wrote:
> On Wed, Dec 23, 2020 at 5:05 PM Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2020/12/22 下午10:52, Xie Yongji wrote:
>>> To support vhost-vdpa bus driver, we need a way to share the
>>> vhost-vdpa backend process's memory with the userspace VDUSE process.
>>>
>>> This patch tries to make use of the vhost iotlb message to achieve
>>> that. We will get the shm file from the iotlb message and pass it
>>> to the userspace VDUSE process.
>>>
>>> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
>>> ---
>>>    Documentation/driver-api/vduse.rst |  15 +++-
>>>    drivers/vdpa/vdpa_user/vduse_dev.c | 147 ++++++++++++++++++++++++++++++++++++-
>>>    include/uapi/linux/vduse.h         |  11 +++
>>>    3 files changed, 171 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/Documentation/driver-api/vduse.rst b/Documentation/driver-api/vduse.rst
>>> index 623f7b040ccf..48e4b1ba353f 100644
>>> --- a/Documentation/driver-api/vduse.rst
>>> +++ b/Documentation/driver-api/vduse.rst
>>> @@ -46,13 +46,26 @@ The following types of messages are provided by the VDUSE framework now:
>>>
>>>    - VDUSE_GET_CONFIG: Read from device specific configuration space
>>>
>>> +- VDUSE_UPDATE_IOTLB: Update the memory mapping in device IOTLB
>>> +
>>> +- VDUSE_INVALIDATE_IOTLB: Invalidate the memory mapping in device IOTLB
>>> +
>>>    Please see include/linux/vdpa.h for details.
>>>
>>> -In the data path, VDUSE framework implements a MMU-based on-chip IOMMU
>>> +The data path of userspace vDPA device is implemented in different ways
>>> +depending on the vdpa bus to which it is attached.
>>> +
>>> +In virtio-vdpa case, VDUSE framework implements a MMU-based on-chip IOMMU
>>>    driver which supports mapping the kernel dma buffer to a userspace iova
>>>    region dynamically. The userspace iova region can be created by passing
>>>    the userspace vDPA device fd to mmap(2).
>>>
>>> +In vhost-vdpa case, the dma buffer is reside in a userspace memory region
>>> +which will be shared to the VDUSE userspace processs via the file
>>> +descriptor in VDUSE_UPDATE_IOTLB message. And the corresponding address
>>> +mapping (IOVA of dma buffer <-> VA of the memory region) is also included
>>> +in this message.
>>> +
>>>    Besides, the eventfd mechanism is used to trigger interrupt callbacks and
>>>    receive virtqueue kicks in userspace. The following ioctls on the userspace
>>>    vDPA device fd are provided to support that:
>>> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
>>> index b974333ed4e9..d24aaacb6008 100644
>>> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
>>> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
>>> @@ -34,6 +34,7 @@
>>>
>>>    struct vduse_dev_msg {
>>>        struct vduse_dev_request req;
>>> +     struct file *iotlb_file;
>>>        struct vduse_dev_response resp;
>>>        struct list_head list;
>>>        wait_queue_head_t waitq;
>>> @@ -325,12 +326,80 @@ static int vduse_dev_set_vq_state(struct vduse_dev *dev,
>>>        return ret;
>>>    }
>>>
>>> +static int vduse_dev_update_iotlb(struct vduse_dev *dev, struct file *file,
>>> +                             u64 offset, u64 iova, u64 size, u8 perm)
>>> +{
>>> +     struct vduse_dev_msg *msg;
>>> +     int ret;
>>> +
>>> +     if (!size)
>>> +             return -EINVAL;
>>> +
>>> +     msg = vduse_dev_new_msg(dev, VDUSE_UPDATE_IOTLB);
>>> +     msg->req.size = sizeof(struct vduse_iotlb);
>>> +     msg->req.iotlb.offset = offset;
>>> +     msg->req.iotlb.iova = iova;
>>> +     msg->req.iotlb.size = size;
>>> +     msg->req.iotlb.perm = perm;
>>> +     msg->req.iotlb.fd = -1;
>>> +     msg->iotlb_file = get_file(file);
>>> +
>>> +     ret = vduse_dev_msg_sync(dev, msg);
>>
>> My feeling is that we should provide consistent API for the userspace
>> device to use.
>>
>> E.g we'd better carry the IOTLB message for both virtio/vhost drivers.
>>
>> It looks to me for virtio drivers we can still use UPDAT_IOTLB message
>> by using VDUSE file as msg->iotlb_file here.
>>
> It's OK for me. One problem is when to transfer the UPDATE_IOTLB
> message in virtio cases.


Instead of generating IOTLB messages for userspace.

How about record the mappings (which is a common case for device have 
on-chip IOMMU e.g mlx5e and vdpa simlator), then we can introduce ioctl 
for userspace to query?

Thanks


>
>>> +     vduse_dev_msg_put(msg);
>>> +     fput(file);
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static int vduse_dev_invalidate_iotlb(struct vduse_dev *dev,
>>> +                                     u64 iova, u64 size)
>>> +{
>>> +     struct vduse_dev_msg *msg;
>>> +     int ret;
>>> +
>>> +     if (!size)
>>> +             return -EINVAL;
>>> +
>>> +     msg = vduse_dev_new_msg(dev, VDUSE_INVALIDATE_IOTLB);
>>> +     msg->req.size = sizeof(struct vduse_iotlb);
>>> +     msg->req.iotlb.iova = iova;
>>> +     msg->req.iotlb.size = size;
>>> +
>>> +     ret = vduse_dev_msg_sync(dev, msg);
>>> +     vduse_dev_msg_put(msg);
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static unsigned int perm_to_file_flags(u8 perm)
>>> +{
>>> +     unsigned int flags = 0;
>>> +
>>> +     switch (perm) {
>>> +     case VHOST_ACCESS_WO:
>>> +             flags |= O_WRONLY;
>>> +             break;
>>> +     case VHOST_ACCESS_RO:
>>> +             flags |= O_RDONLY;
>>> +             break;
>>> +     case VHOST_ACCESS_RW:
>>> +             flags |= O_RDWR;
>>> +             break;
>>> +     default:
>>> +             WARN(1, "invalidate vhost IOTLB permission\n");
>>> +             break;
>>> +     }
>>> +
>>> +     return flags;
>>> +}
>>> +
>>>    static ssize_t vduse_dev_read_iter(struct kiocb *iocb, struct iov_iter *to)
>>>    {
>>>        struct file *file = iocb->ki_filp;
>>>        struct vduse_dev *dev = file->private_data;
>>>        struct vduse_dev_msg *msg;
>>> -     int size = sizeof(struct vduse_dev_request);
>>> +     unsigned int flags;
>>> +     int fd, size = sizeof(struct vduse_dev_request);
>>>        ssize_t ret = 0;
>>>
>>>        if (iov_iter_count(to) < size)
>>> @@ -349,6 +418,18 @@ static ssize_t vduse_dev_read_iter(struct kiocb *iocb, struct iov_iter *to)
>>>                if (ret)
>>>                        return ret;
>>>        }
>>> +
>>> +     if (msg->req.type == VDUSE_UPDATE_IOTLB && msg->req.iotlb.fd == -1) {
>>> +             flags = perm_to_file_flags(msg->req.iotlb.perm);
>>> +             fd = get_unused_fd_flags(flags);
>>> +             if (fd < 0) {
>>> +                     vduse_dev_enqueue_msg(dev, msg, &dev->send_list);
>>> +                     return fd;
>>> +             }
>>> +             fd_install(fd, get_file(msg->iotlb_file));
>>> +             msg->req.iotlb.fd = fd;
>>> +     }
>>> +
>>>        ret = copy_to_iter(&msg->req, size, to);
>>>        if (ret != size) {
>>>                vduse_dev_enqueue_msg(dev, msg, &dev->send_list);
>>> @@ -565,6 +646,69 @@ static void vduse_vdpa_set_config(struct vdpa_device *vdpa, unsigned int offset,
>>>        vduse_dev_set_config(dev, offset, buf, len);
>>>    }
>>>
>>> +static void vduse_vdpa_invalidate_iotlb(struct vduse_dev *dev,
>>> +                                     struct vhost_iotlb_msg *msg)
>>> +{
>>> +     vduse_dev_invalidate_iotlb(dev, msg->iova, msg->size);
>>> +}
>>> +
>>> +static int vduse_vdpa_update_iotlb(struct vduse_dev *dev,
>>> +                                     struct vhost_iotlb_msg *msg)
>>> +{
>>> +     u64 uaddr = msg->uaddr;
>>> +     u64 iova = msg->iova;
>>> +     u64 size = msg->size;
>>> +     u64 offset;
>>> +     struct vm_area_struct *vma;
>>> +     int ret;
>>> +
>>> +     while (uaddr < msg->uaddr + msg->size) {
>>> +             vma = find_vma(current->mm, uaddr);
>>> +             ret = -EINVAL;
>>> +             if (!vma)
>>> +                     goto err;
>>> +
>>> +             size = min(msg->size, vma->vm_end - uaddr);
>>> +             offset = (vma->vm_pgoff << PAGE_SHIFT) + uaddr - vma->vm_start;
>>> +             if (vma->vm_file && (vma->vm_flags & VM_SHARED)) {
>>> +                     ret = vduse_dev_update_iotlb(dev, vma->vm_file, offset,
>>> +                                                     iova, size, msg->perm);
>>> +                     if (ret)
>>> +                             goto err;
>>
>> My understanding is that vma is something that should not be known by a
>> device. So I suggest to move the above processing to vhost-vdpa.c.
>>
> Will do it.
>
> Thanks,
> Yongji
>



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 06/13] vduse: Introduce VDUSE - vDPA Device in Userspace
  2020-12-23 14:17     ` Yongji Xie
@ 2020-12-24  3:01       ` Jason Wang
  2020-12-24  8:34         ` Yongji Xie
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Wang @ 2020-12-24  3:01 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, sgarzare, Parav Pandit,
	akpm, Randy Dunlap, Matthew Wilcox, viro, axboe, bcrl, corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel, linux-mm


On 2020/12/23 下午10:17, Yongji Xie wrote:
> On Wed, Dec 23, 2020 at 4:08 PM Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2020/12/22 下午10:52, Xie Yongji wrote:
>>> This VDUSE driver enables implementing vDPA devices in userspace.
>>> Both control path and data path of vDPA devices will be able to
>>> be handled in userspace.
>>>
>>> In the control path, the VDUSE driver will make use of message
>>> mechnism to forward the config operation from vdpa bus driver
>>> to userspace. Userspace can use read()/write() to receive/reply
>>> those control messages.
>>>
>>> In the data path, the VDUSE driver implements a MMU-based on-chip
>>> IOMMU driver which supports mapping the kernel dma buffer to a
>>> userspace iova region dynamically. Userspace can access those
>>> iova region via mmap(). Besides, the eventfd mechanism is used to
>>> trigger interrupt callbacks and receive virtqueue kicks in userspace
>>>
>>> Now we only support virtio-vdpa bus driver with this patch applied.
>>>
>>> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
>>> ---
>>>    Documentation/driver-api/vduse.rst                 |   74 ++
>>>    Documentation/userspace-api/ioctl/ioctl-number.rst |    1 +
>>>    drivers/vdpa/Kconfig                               |    8 +
>>>    drivers/vdpa/Makefile                              |    1 +
>>>    drivers/vdpa/vdpa_user/Makefile                    |    5 +
>>>    drivers/vdpa/vdpa_user/eventfd.c                   |  221 ++++
>>>    drivers/vdpa/vdpa_user/eventfd.h                   |   48 +
>>>    drivers/vdpa/vdpa_user/iova_domain.c               |  442 ++++++++
>>>    drivers/vdpa/vdpa_user/iova_domain.h               |   93 ++
>>>    drivers/vdpa/vdpa_user/vduse.h                     |   59 ++
>>>    drivers/vdpa/vdpa_user/vduse_dev.c                 | 1121 ++++++++++++++++++++
>>>    include/uapi/linux/vdpa.h                          |    1 +
>>>    include/uapi/linux/vduse.h                         |   99 ++
>>>    13 files changed, 2173 insertions(+)
>>>    create mode 100644 Documentation/driver-api/vduse.rst
>>>    create mode 100644 drivers/vdpa/vdpa_user/Makefile
>>>    create mode 100644 drivers/vdpa/vdpa_user/eventfd.c
>>>    create mode 100644 drivers/vdpa/vdpa_user/eventfd.h
>>>    create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c
>>>    create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h
>>>    create mode 100644 drivers/vdpa/vdpa_user/vduse.h
>>>    create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
>>>    create mode 100644 include/uapi/linux/vduse.h
>>>
>>> diff --git a/Documentation/driver-api/vduse.rst b/Documentation/driver-api/vduse.rst
>>> new file mode 100644
>>> index 000000000000..da9b3040f20a
>>> --- /dev/null
>>> +++ b/Documentation/driver-api/vduse.rst
>>> @@ -0,0 +1,74 @@
>>> +==================================
>>> +VDUSE - "vDPA Device in Userspace"
>>> +==================================
>>> +
>>> +vDPA (virtio data path acceleration) device is a device that uses a
>>> +datapath which complies with the virtio specifications with vendor
>>> +specific control path. vDPA devices can be both physically located on
>>> +the hardware or emulated by software. VDUSE is a framework that makes it
>>> +possible to implement software-emulated vDPA devices in userspace.
>>> +
>>> +How VDUSE works
>>> +------------
>>> +Each userspace vDPA device is created by the VDUSE_CREATE_DEV ioctl on
>>> +the VDUSE character device (/dev/vduse). Then a file descriptor pointing
>>> +to the new resources will be returned, which can be used to implement the
>>> +userspace vDPA device's control path and data path.
>>> +
>>> +To implement control path, the read/write operations to the file descriptor
>>> +will be used to receive/reply the control messages from/to VDUSE driver.
>>> +Those control messages are based on the vdpa_config_ops which defines a
>>> +unified interface to control different types of vDPA device.
>>> +
>>> +The following types of messages are provided by the VDUSE framework now:
>>> +
>>> +- VDUSE_SET_VQ_ADDR: Set the addresses of the different aspects of virtqueue.
>>> +
>>> +- VDUSE_SET_VQ_NUM: Set the size of virtqueue
>>> +
>>> +- VDUSE_SET_VQ_READY: Set ready status of virtqueue
>>> +
>>> +- VDUSE_GET_VQ_READY: Get ready status of virtqueue
>>> +
>>> +- VDUSE_SET_FEATURES: Set virtio features supported by the driver
>>> +
>>> +- VDUSE_GET_FEATURES: Get virtio features supported by the device
>>> +
>>> +- VDUSE_SET_STATUS: Set the device status
>>> +
>>> +- VDUSE_GET_STATUS: Get the device status
>>> +
>>> +- VDUSE_SET_CONFIG: Write to device specific configuration space
>>> +
>>> +- VDUSE_GET_CONFIG: Read from device specific configuration space
>>> +
>>> +Please see include/linux/vdpa.h for details.
>>> +
>>> +In the data path, VDUSE framework implements a MMU-based on-chip IOMMU
>>> +driver which supports mapping the kernel dma buffer to a userspace iova
>>> +region dynamically. The userspace iova region can be created by passing
>>> +the userspace vDPA device fd to mmap(2).
>>> +
>>> +Besides, the eventfd mechanism is used to trigger interrupt callbacks and
>>> +receive virtqueue kicks in userspace. The following ioctls on the userspace
>>> +vDPA device fd are provided to support that:
>>> +
>>> +- VDUSE_VQ_SETUP_KICKFD: set the kickfd for virtqueue, this eventfd is used
>>> +  by VDUSE driver to notify userspace to consume the vring.
>>> +
>>> +- VDUSE_VQ_SETUP_IRQFD: set the irqfd for virtqueue, this eventfd is used
>>> +  by userspace to notify VDUSE driver to trigger interrupt callbacks.
>>> +
>>> +MMU-based IOMMU Driver
>>> +----------------------
>>> +The basic idea behind the IOMMU driver is treating MMU (VA->PA) as
>>> +IOMMU (IOVA->PA). This driver will set up MMU mapping instead of IOMMU mapping
>>> +for the DMA transfer so that the userspace process is able to use its virtual
>>> +address to access the dma buffer in kernel.
>>> +
>>> +And to avoid security issue, a bounce-buffering mechanism is introduced to
>>> +prevent userspace accessing the original buffer directly which may contain other
>>> +kernel data. During the mapping, unmapping, the driver will copy the data from
>>> +the original buffer to the bounce buffer and back, depending on the direction of
>>> +the transfer. And the bounce-buffer addresses will be mapped into the user address
>>> +space instead of the original one.
>>> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
>>> index a4c75a28c839..71722e6f8f23 100644
>>> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
>>> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
>>> @@ -300,6 +300,7 @@ Code  Seq#    Include File                                           Comments
>>>    'z'   10-4F  drivers/s390/crypto/zcrypt_api.h                        conflict!
>>>    '|'   00-7F  linux/media.h
>>>    0x80  00-1F  linux/fb.h
>>> +0x81  00-1F  linux/vduse.h
>>>    0x89  00-06  arch/x86/include/asm/sockios.h
>>>    0x89  0B-DF  linux/sockios.h
>>>    0x89  E0-EF  linux/sockios.h                                         SIOCPROTOPRIVATE range
>>> diff --git a/drivers/vdpa/Kconfig b/drivers/vdpa/Kconfig
>>> index 4be7be39be26..211cc449cbd3 100644
>>> --- a/drivers/vdpa/Kconfig
>>> +++ b/drivers/vdpa/Kconfig
>>> @@ -21,6 +21,14 @@ config VDPA_SIM
>>>          to RX. This device is used for testing, prototyping and
>>>          development of vDPA.
>>>
>>> +config VDPA_USER
>>> +     tristate "VDUSE (vDPA Device in Userspace) support"
>>> +     depends on EVENTFD && MMU && HAS_DMA
>>> +     default n
>>
>> The "default n" is not necessary.
>>
> OK.
>>> +     help
>>> +       With VDUSE it is possible to emulate a vDPA Device
>>> +       in a userspace program.
>>> +
>>>    config IFCVF
>>>        tristate "Intel IFC VF vDPA driver"
>>>        depends on PCI_MSI
>>> diff --git a/drivers/vdpa/Makefile b/drivers/vdpa/Makefile
>>> index d160e9b63a66..66e97778ad03 100644
>>> --- a/drivers/vdpa/Makefile
>>> +++ b/drivers/vdpa/Makefile
>>> @@ -1,5 +1,6 @@
>>>    # SPDX-License-Identifier: GPL-2.0
>>>    obj-$(CONFIG_VDPA) += vdpa.o
>>>    obj-$(CONFIG_VDPA_SIM) += vdpa_sim/
>>> +obj-$(CONFIG_VDPA_USER) += vdpa_user/
>>>    obj-$(CONFIG_IFCVF)    += ifcvf/
>>>    obj-$(CONFIG_MLX5_VDPA) += mlx5/
>>> diff --git a/drivers/vdpa/vdpa_user/Makefile b/drivers/vdpa/vdpa_user/Makefile
>>> new file mode 100644
>>> index 000000000000..b7645e36992b
>>> --- /dev/null
>>> +++ b/drivers/vdpa/vdpa_user/Makefile
>>> @@ -0,0 +1,5 @@
>>> +# SPDX-License-Identifier: GPL-2.0
>>> +
>>> +vduse-y := vduse_dev.o iova_domain.o eventfd.o
>>
>> Do we really need eventfd.o here consider we've selected it.
>>
> Do you mean the file "drivers/vdpa/vdpa_user/eventfd.c"?


My bad, I confuse this with the common eventfd. So the code is fine here.


>
>>> +
>>> +obj-$(CONFIG_VDPA_USER) += vduse.o
>>> diff --git a/drivers/vdpa/vdpa_user/eventfd.c b/drivers/vdpa/vdpa_user/eventfd.c
>>> new file mode 100644
>>> index 000000000000..dbffddb08908
>>> --- /dev/null
>>> +++ b/drivers/vdpa/vdpa_user/eventfd.c
>>> @@ -0,0 +1,221 @@
>>> +// SPDX-License-Identifier: GPL-2.0-only
>>> +/*
>>> + * Eventfd support for VDUSE
>>> + *
>>> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
>>> + *
>>> + * Author: Xie Yongji <xieyongji@bytedance.com>
>>> + *
>>> + */
>>> +
>>> +#include <linux/eventfd.h>
>>> +#include <linux/poll.h>
>>> +#include <linux/wait.h>
>>> +#include <linux/slab.h>
>>> +#include <linux/file.h>
>>> +#include <uapi/linux/vduse.h>
>>> +
>>> +#include "eventfd.h"
>>> +
>>> +static struct workqueue_struct *vduse_irqfd_cleanup_wq;
>>> +
>>> +static void vduse_virqfd_shutdown(struct work_struct *work)
>>> +{
>>> +     u64 cnt;
>>> +     struct vduse_virqfd *virqfd = container_of(work,
>>> +                                     struct vduse_virqfd, shutdown);
>>> +
>>> +     eventfd_ctx_remove_wait_queue(virqfd->ctx, &virqfd->wait, &cnt);
>>> +     flush_work(&virqfd->inject);
>>> +     eventfd_ctx_put(virqfd->ctx);
>>> +     kfree(virqfd);
>>> +}
>>> +
>>> +static void vduse_virqfd_inject(struct work_struct *work)
>>> +{
>>> +     struct vduse_virqfd *virqfd = container_of(work,
>>> +                                     struct vduse_virqfd, inject);
>>> +     struct vduse_virtqueue *vq = virqfd->vq;
>>> +
>>> +     spin_lock_irq(&vq->irq_lock);
>>> +     if (vq->ready && vq->cb)
>>> +             vq->cb(vq->private);
>>> +     spin_unlock_irq(&vq->irq_lock);
>>> +}
>>> +
>>> +static void virqfd_deactivate(struct vduse_virqfd *virqfd)
>>> +{
>>> +     queue_work(vduse_irqfd_cleanup_wq, &virqfd->shutdown);
>>> +}
>>> +
>>> +static int vduse_virqfd_wakeup(wait_queue_entry_t *wait, unsigned int mode,
>>> +                             int sync, void *key)
>>> +{
>>> +     struct vduse_virqfd *virqfd = container_of(wait, struct vduse_virqfd, wait);
>>> +     struct vduse_virtqueue *vq = virqfd->vq;
>>> +
>>> +     __poll_t flags = key_to_poll(key);
>>> +
>>> +     if (flags & EPOLLIN)
>>> +             schedule_work(&virqfd->inject);
>>> +
>>> +     if (flags & EPOLLHUP) {
>>> +             spin_lock(&vq->irq_lock);
>>> +             if (vq->virqfd == virqfd) {
>>> +                     vq->virqfd = NULL;
>>> +                     virqfd_deactivate(virqfd);
>>> +             }
>>> +             spin_unlock(&vq->irq_lock);
>>> +     }
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static void vduse_virqfd_ptable_queue_proc(struct file *file,
>>> +                     wait_queue_head_t *wqh, poll_table *pt)
>>> +{
>>> +     struct vduse_virqfd *virqfd = container_of(pt, struct vduse_virqfd, pt);
>>> +
>>> +     add_wait_queue(wqh, &virqfd->wait);
>>> +}
>>> +
>>> +int vduse_virqfd_setup(struct vduse_dev *dev,
>>> +                     struct vduse_vq_eventfd *eventfd)
>>> +{
>>> +     struct vduse_virqfd *virqfd;
>>> +     struct fd irqfd;
>>> +     struct eventfd_ctx *ctx;
>>> +     struct vduse_virtqueue *vq;
>>> +     __poll_t events;
>>> +     int ret;
>>> +
>>> +     if (eventfd->index >= dev->vq_num)
>>> +             return -EINVAL;
>>> +
>>> +     vq = &dev->vqs[eventfd->index];
>>> +     virqfd = kzalloc(sizeof(*virqfd), GFP_KERNEL);
>>> +     if (!virqfd)
>>> +             return -ENOMEM;
>>> +
>>> +     INIT_WORK(&virqfd->shutdown, vduse_virqfd_shutdown);
>>> +     INIT_WORK(&virqfd->inject, vduse_virqfd_inject);
>>
>> Any reason that a workqueue is must here?
>>
> Mainly for performance considerations. Make sure the push() and pop()
> for used vring can be asynchronous.


I see.


>
>>> +
>>> +     ret = -EBADF;
>>> +     irqfd = fdget(eventfd->fd);
>>> +     if (!irqfd.file)
>>> +             goto err_fd;
>>> +
>>> +     ctx = eventfd_ctx_fileget(irqfd.file);
>>> +     if (IS_ERR(ctx)) {
>>> +             ret = PTR_ERR(ctx);
>>> +             goto err_ctx;
>>> +     }
>>> +
>>> +     virqfd->vq = vq;
>>> +     virqfd->ctx = ctx;
>>> +     spin_lock(&vq->irq_lock);
>>> +     if (vq->virqfd)
>>> +             virqfd_deactivate(virqfd);
>>> +     vq->virqfd = virqfd;
>>> +     spin_unlock(&vq->irq_lock);
>>> +
>>> +     init_waitqueue_func_entry(&virqfd->wait, vduse_virqfd_wakeup);
>>> +     init_poll_funcptr(&virqfd->pt, vduse_virqfd_ptable_queue_proc);
>>> +
>>> +     events = vfs_poll(irqfd.file, &virqfd->pt);
>>> +
>>> +     /*
>>> +      * Check if there was an event already pending on the eventfd
>>> +      * before we registered and trigger it as if we didn't miss it.
>>> +      */
>>> +     if (events & EPOLLIN)
>>> +             schedule_work(&virqfd->inject);
>>> +
>>> +     fdput(irqfd);
>>> +
>>> +     return 0;
>>> +err_ctx:
>>> +     fdput(irqfd);
>>> +err_fd:
>>> +     kfree(virqfd);
>>> +     return ret;
>>> +}
>>> +
>>> +void vduse_virqfd_release(struct vduse_dev *dev)
>>> +{
>>> +     int i;
>>> +
>>> +     for (i = 0; i < dev->vq_num; i++) {
>>> +             struct vduse_virtqueue *vq = &dev->vqs[i];
>>> +
>>> +             spin_lock(&vq->irq_lock);
>>> +             if (vq->virqfd) {
>>> +                     virqfd_deactivate(vq->virqfd);
>>> +                     vq->virqfd = NULL;
>>> +             }
>>> +             spin_unlock(&vq->irq_lock);
>>> +     }
>>> +     flush_workqueue(vduse_irqfd_cleanup_wq);
>>> +}
>>> +
>>> +int vduse_virqfd_init(void)
>>> +{
>>> +     vduse_irqfd_cleanup_wq = alloc_workqueue("vduse-irqfd-cleanup",
>>> +                                             WQ_UNBOUND, 0);
>>> +     if (!vduse_irqfd_cleanup_wq)
>>> +             return -ENOMEM;
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +void vduse_virqfd_exit(void)
>>> +{
>>> +     destroy_workqueue(vduse_irqfd_cleanup_wq);
>>> +}
>>> +
>>> +void vduse_vq_kick(struct vduse_virtqueue *vq)
>>> +{
>>> +     spin_lock(&vq->kick_lock);
>>> +     if (vq->ready && vq->kickfd)
>>> +             eventfd_signal(vq->kickfd, 1);
>>> +     spin_unlock(&vq->kick_lock);
>>> +}
>>> +
>>> +int vduse_kickfd_setup(struct vduse_dev *dev,
>>> +                     struct vduse_vq_eventfd *eventfd)
>>> +{
>>> +     struct eventfd_ctx *ctx;
>>> +     struct vduse_virtqueue *vq;
>>> +
>>> +     if (eventfd->index >= dev->vq_num)
>>> +             return -EINVAL;
>>> +
>>> +     vq = &dev->vqs[eventfd->index];
>>> +     ctx = eventfd_ctx_fdget(eventfd->fd);
>>> +     if (IS_ERR(ctx))
>>> +             return PTR_ERR(ctx);
>>> +
>>> +     spin_lock(&vq->kick_lock);
>>> +     if (vq->kickfd)
>>> +             eventfd_ctx_put(vq->kickfd);
>>> +     vq->kickfd = ctx;
>>> +     spin_unlock(&vq->kick_lock);
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +void vduse_kickfd_release(struct vduse_dev *dev)
>>> +{
>>> +     int i;
>>> +
>>> +     for (i = 0; i < dev->vq_num; i++) {
>>> +             struct vduse_virtqueue *vq = &dev->vqs[i];
>>> +
>>> +             spin_lock(&vq->kick_lock);
>>> +             if (vq->kickfd) {
>>> +                     eventfd_ctx_put(vq->kickfd);
>>> +                     vq->kickfd = NULL;
>>> +             }
>>> +             spin_unlock(&vq->kick_lock);
>>> +     }
>>> +}
>>> diff --git a/drivers/vdpa/vdpa_user/eventfd.h b/drivers/vdpa/vdpa_user/eventfd.h
>>> new file mode 100644
>>> index 000000000000..14269ff27f47
>>> --- /dev/null
>>> +++ b/drivers/vdpa/vdpa_user/eventfd.h
>>> @@ -0,0 +1,48 @@
>>> +/* SPDX-License-Identifier: GPL-2.0-only */
>>> +/*
>>> + * Eventfd support for VDUSE
>>> + *
>>> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
>>> + *
>>> + * Author: Xie Yongji <xieyongji@bytedance.com>
>>> + *
>>> + */
>>> +
>>> +#ifndef _VDUSE_EVENTFD_H
>>> +#define _VDUSE_EVENTFD_H
>>> +
>>> +#include <linux/eventfd.h>
>>> +#include <linux/poll.h>
>>> +#include <linux/wait.h>
>>> +#include <uapi/linux/vduse.h>
>>> +
>>> +#include "vduse.h"
>>> +
>>> +struct vduse_dev;
>>> +
>>> +struct vduse_virqfd {
>>> +     struct eventfd_ctx *ctx;
>>> +     struct vduse_virtqueue *vq;
>>> +     struct work_struct inject;
>>> +     struct work_struct shutdown;
>>> +     wait_queue_entry_t wait;
>>> +     poll_table pt;
>>> +};
>>> +
>>> +int vduse_virqfd_setup(struct vduse_dev *dev,
>>> +                     struct vduse_vq_eventfd *eventfd);
>>> +
>>> +void vduse_virqfd_release(struct vduse_dev *dev);
>>> +
>>> +int vduse_virqfd_init(void);
>>> +
>>> +void vduse_virqfd_exit(void);
>>> +
>>> +void vduse_vq_kick(struct vduse_virtqueue *vq);
>>> +
>>> +int vduse_kickfd_setup(struct vduse_dev *dev,
>>> +                     struct vduse_vq_eventfd *eventfd);
>>> +
>>> +void vduse_kickfd_release(struct vduse_dev *dev);
>>> +
>>> +#endif /* _VDUSE_EVENTFD_H */
>>> diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c
>>> new file mode 100644
>>> index 000000000000..27022157abc6
>>> --- /dev/null
>>> +++ b/drivers/vdpa/vdpa_user/iova_domain.c
>>> @@ -0,0 +1,442 @@
>>> +// SPDX-License-Identifier: GPL-2.0-only
>>> +/*
>>> + * MMU-based IOMMU implementation
>>> + *
>>> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
>>> + *
>>> + * Author: Xie Yongji <xieyongji@bytedance.com>
>>> + *
>>> + */
>>> +
>>> +#include <linux/wait.h>
>>> +#include <linux/slab.h>
>>> +#include <linux/genalloc.h>
>>> +#include <linux/dma-mapping.h>
>>> +
>>> +#include "iova_domain.h"
>>> +
>>> +#define IOVA_CHUNK_SHIFT 26
>>> +#define IOVA_CHUNK_SIZE (_AC(1, UL) << IOVA_CHUNK_SHIFT)
>>> +#define IOVA_CHUNK_MASK (~(IOVA_CHUNK_SIZE - 1))
>>> +
>>> +#define IOVA_MIN_SIZE (IOVA_CHUNK_SIZE << 1)
>>> +
>>> +#define IOVA_ALLOC_ORDER 12
>>> +#define IOVA_ALLOC_SIZE (1 << IOVA_ALLOC_ORDER)
>>> +
>>> +struct vduse_mmap_vma {
>>> +     struct vm_area_struct *vma;
>>> +     struct list_head list;
>>> +};
>>> +
>>> +static inline struct page *
>>> +vduse_domain_get_bounce_page(struct vduse_iova_domain *domain,
>>> +                             unsigned long iova)
>>> +{
>>> +     unsigned long index = iova >> IOVA_CHUNK_SHIFT;
>>> +     unsigned long chunkoff = iova & ~IOVA_CHUNK_MASK;
>>> +     unsigned long pgindex = chunkoff >> PAGE_SHIFT;
>>> +
>>> +     return domain->chunks[index].bounce_pages[pgindex];
>>> +}
>>> +
>>> +static inline void
>>> +vduse_domain_set_bounce_page(struct vduse_iova_domain *domain,
>>> +                             unsigned long iova, struct page *page)
>>> +{
>>> +     unsigned long index = iova >> IOVA_CHUNK_SHIFT;
>>> +     unsigned long chunkoff = iova & ~IOVA_CHUNK_MASK;
>>> +     unsigned long pgindex = chunkoff >> PAGE_SHIFT;
>>> +
>>> +     domain->chunks[index].bounce_pages[pgindex] = page;
>>> +}
>>> +
>>> +static inline struct vduse_iova_map *
>>> +vduse_domain_get_iova_map(struct vduse_iova_domain *domain,
>>> +                             unsigned long iova)
>>> +{
>>> +     unsigned long index = iova >> IOVA_CHUNK_SHIFT;
>>> +     unsigned long chunkoff = iova & ~IOVA_CHUNK_MASK;
>>> +     unsigned long mapindex = chunkoff >> IOVA_ALLOC_ORDER;
>>> +
>>> +     return domain->chunks[index].iova_map[mapindex];
>>> +}
>>> +
>>> +static inline void
>>> +vduse_domain_set_iova_map(struct vduse_iova_domain *domain,
>>> +                     unsigned long iova, struct vduse_iova_map *map)
>>> +{
>>> +     unsigned long index = iova >> IOVA_CHUNK_SHIFT;
>>> +     unsigned long chunkoff = iova & ~IOVA_CHUNK_MASK;
>>> +     unsigned long mapindex = chunkoff >> IOVA_ALLOC_ORDER;
>>> +
>>> +     domain->chunks[index].iova_map[mapindex] = map;
>>> +}
>>> +
>>> +static int
>>> +vduse_domain_free_bounce_pages(struct vduse_iova_domain *domain,
>>> +                             unsigned long iova, size_t size)
>>> +{
>>> +     struct page *page;
>>> +     size_t walk_sz = 0;
>>> +     int frees = 0;
>>> +
>>> +     while (walk_sz < size) {
>>> +             page = vduse_domain_get_bounce_page(domain, iova);
>>> +             if (page) {
>>> +                     vduse_domain_set_bounce_page(domain, iova, NULL);
>>> +                     put_page(page);
>>> +                     frees++;
>>> +             }
>>> +             iova += PAGE_SIZE;
>>> +             walk_sz += PAGE_SIZE;
>>> +     }
>>> +
>>> +     return frees;
>>> +}
>>> +
>>> +int vduse_domain_add_vma(struct vduse_iova_domain *domain,
>>> +                             struct vm_area_struct *vma)
>>> +{
>>> +     unsigned long size = vma->vm_end - vma->vm_start;
>>> +     struct vduse_mmap_vma *mmap_vma;
>>> +
>>> +     if (WARN_ON(size != domain->size))
>>> +             return -EINVAL;
>>> +
>>> +     mmap_vma = kmalloc(sizeof(*mmap_vma), GFP_KERNEL);
>>> +     if (!mmap_vma)
>>> +             return -ENOMEM;
>>> +
>>> +     mmap_vma->vma = vma;
>>> +     mutex_lock(&domain->vma_lock);
>>> +     list_add(&mmap_vma->list, &domain->vma_list);
>>> +     mutex_unlock(&domain->vma_lock);
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +void vduse_domain_remove_vma(struct vduse_iova_domain *domain,
>>> +                             struct vm_area_struct *vma)
>>> +{
>>> +     struct vduse_mmap_vma *mmap_vma;
>>> +
>>> +     mutex_lock(&domain->vma_lock);
>>> +     list_for_each_entry(mmap_vma, &domain->vma_list, list) {
>>> +             if (mmap_vma->vma == vma) {
>>> +                     list_del(&mmap_vma->list);
>>> +                     kfree(mmap_vma);
>>> +                     break;
>>> +             }
>>> +     }
>>> +     mutex_unlock(&domain->vma_lock);
>>> +}
>>> +
>>> +int vduse_domain_add_mapping(struct vduse_iova_domain *domain,
>>> +                             unsigned long iova, unsigned long orig,
>>> +                             size_t size, enum dma_data_direction dir)
>>> +{
>>> +     struct vduse_iova_map *map;
>>> +     unsigned long last = iova + size;
>>> +
>>> +     map = kzalloc(sizeof(struct vduse_iova_map), GFP_ATOMIC);
>>> +     if (!map)
>>> +             return -ENOMEM;
>>> +
>>> +     map->iova = iova;
>>> +     map->orig = orig;
>>> +     map->size = size;
>>> +     map->dir = dir;
>>> +
>>> +     while (iova < last) {
>>> +             vduse_domain_set_iova_map(domain, iova, map);
>>> +             iova += IOVA_ALLOC_SIZE;
>>> +     }
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +struct vduse_iova_map *
>>> +vduse_domain_get_mapping(struct vduse_iova_domain *domain,
>>> +                     unsigned long iova)
>>> +{
>>> +     return vduse_domain_get_iova_map(domain, iova);
>>> +}
>>> +
>>> +void vduse_domain_remove_mapping(struct vduse_iova_domain *domain,
>>> +                             struct vduse_iova_map *map)
>>> +{
>>> +     unsigned long iova = map->iova;
>>> +     unsigned long last = iova + map->size;
>>> +
>>> +     while (iova < last) {
>>> +             vduse_domain_set_iova_map(domain, iova, NULL);
>>> +             iova += IOVA_ALLOC_SIZE;
>>> +     }
>>> +}
>>> +
>>> +void vduse_domain_unmap(struct vduse_iova_domain *domain,
>>> +                     unsigned long iova, size_t size)
>>> +{
>>> +     struct vduse_mmap_vma *mmap_vma;
>>> +     unsigned long uaddr;
>>> +
>>> +     mutex_lock(&domain->vma_lock);
>>> +     list_for_each_entry(mmap_vma, &domain->vma_list, list) {
>>> +             mmap_read_lock(mmap_vma->vma->vm_mm);
>>> +             uaddr = iova + mmap_vma->vma->vm_start;
>>> +             zap_page_range(mmap_vma->vma, uaddr, size);
>>> +             mmap_read_unlock(mmap_vma->vma->vm_mm);
>>> +     }
>>> +     mutex_unlock(&domain->vma_lock);
>>> +}
>>> +
>>> +int vduse_domain_direct_map(struct vduse_iova_domain *domain,
>>> +                     struct vm_area_struct *vma, unsigned long iova)
>>> +{
>>> +     unsigned long uaddr = iova + vma->vm_start;
>>> +     unsigned long start = iova & PAGE_MASK;
>>> +     unsigned long last = start + PAGE_SIZE - 1;
>>> +     unsigned long offset;
>>> +     struct vduse_iova_map *map;
>>> +     struct page *page = NULL;
>>> +
>>> +     map = vduse_domain_get_iova_map(domain, iova);
>>> +     if (map) {
>>> +             offset = last - map->iova;
>>> +             page = virt_to_page(map->orig + offset);
>>> +     }
>>> +
>>> +     return page ? vm_insert_page(vma, uaddr, page) : -EFAULT;
>>> +}
>>
>> So as we discussed before, we need to find way to make vhost work. And
>> it's better to make vhost transparent to VDUSE. One idea is to implement
>> shadow virtqueue here, that is, instead of trying to insert the pages to
>> VDUSE userspace, we use the shadow virtqueue to relay the descriptors to
>> userspace. With this, we don't need stuffs like shmfd etc.
>>
> Good idea! The disadvantage is performance will go down (one more
> thread switch overhead and vhost-liked kworker will become bottleneck
> without multi-thread support).


Yes, the disadvantage is the performance. But it should be simpler (I 
guess) and we know it can succeed.



> I think I can try this in v3. And the
> MMU-based IOMMU implementation can be a future optimization in the
> virtio-vdpa case. What's your opinion?


Maybe I was wrong, but I think we can try as what has been proposed here 
first and use shadow virtqueue as backup plan if we fail.


>
>>> +
>>> +void vduse_domain_bounce(struct vduse_iova_domain *domain,
>>> +                     unsigned long iova, unsigned long orig,
>>> +                     size_t size, enum dma_data_direction dir)
>>> +{
>>> +     unsigned int offset = offset_in_page(iova);
>>> +
>>> +     while (size) {
>>> +             struct page *p = vduse_domain_get_bounce_page(domain, iova);
>>> +             size_t copy_len = min_t(size_t, PAGE_SIZE - offset, size);
>>> +             void *addr;
>>> +
>>> +             if (p) {
>>> +                     addr = page_address(p) + offset;
>>> +                     if (dir == DMA_TO_DEVICE)
>>> +                             memcpy(addr, (void *)orig, copy_len);
>>> +                     else if (dir == DMA_FROM_DEVICE)
>>> +                             memcpy((void *)orig, addr, copy_len);
>>> +             }
>>
>> I think I miss something, for DMA_FROM_DEVICE, if p doesn't exist how is
>> it expected to work? Or do we need to warn here in this case?
>>
> Yes, I think we need a WARN_ON here.


Ok.


>
>
>>> +             size -= copy_len;
>>> +             orig += copy_len;
>>> +             iova += copy_len;
>>> +             offset = 0;
>>> +     }
>>> +}
>>> +
>>> +int vduse_domain_bounce_map(struct vduse_iova_domain *domain,
>>> +                     struct vm_area_struct *vma, unsigned long iova)
>>> +{
>>> +     unsigned long uaddr = iova + vma->vm_start;
>>> +     unsigned long start = iova & PAGE_MASK;
>>> +     unsigned long offset = 0;
>>> +     bool found = false;
>>> +     struct vduse_iova_map *map;
>>> +     struct page *page;
>>> +
>>> +     mutex_lock(&domain->map_lock);
>>> +
>>> +     page = vduse_domain_get_bounce_page(domain, iova);
>>> +     if (page)
>>> +             goto unlock;
>>> +
>>> +     page = alloc_page(GFP_KERNEL);
>>> +     if (!page)
>>> +             goto unlock;
>>> +
>>> +     while (offset < PAGE_SIZE) {
>>> +             unsigned int src_offset = 0, dst_offset = 0;
>>> +             void *src, *dst;
>>> +             size_t copy_len;
>>> +
>>> +             map = vduse_domain_get_iova_map(domain, start + offset);
>>> +             if (!map) {
>>> +                     offset += IOVA_ALLOC_SIZE;
>>> +                     continue;
>>> +             }
>>> +
>>> +             found = true;
>>> +             offset += map->size;
>>> +             if (map->dir == DMA_FROM_DEVICE)
>>> +                     continue;
>>> +
>>> +             if (start > map->iova)
>>> +                     src_offset = start - map->iova;
>>> +             else
>>> +                     dst_offset = map->iova - start;
>>> +
>>> +             src = (void *)(map->orig + src_offset);
>>> +             dst = page_address(page) + dst_offset;
>>> +             copy_len = min_t(size_t, map->size - src_offset,
>>> +                             PAGE_SIZE - dst_offset);
>>> +             memcpy(dst, src, copy_len);
>>> +     }
>>> +     if (!found) {
>>> +             put_page(page);
>>> +             page = NULL;
>>> +     }
>>> +     vduse_domain_set_bounce_page(domain, iova, page);
>>> +unlock:
>>> +     mutex_unlock(&domain->map_lock);
>>> +
>>> +     return page ? vm_insert_page(vma, uaddr, page) : -EFAULT;
>>> +}
>>> +
>>> +bool vduse_domain_is_direct_map(struct vduse_iova_domain *domain,
>>> +                             unsigned long iova)
>>> +{
>>> +     unsigned long index = iova >> IOVA_CHUNK_SHIFT;
>>> +     struct vduse_iova_chunk *chunk = &domain->chunks[index];
>>> +
>>> +     return atomic_read(&chunk->map_type) == TYPE_DIRECT_MAP;
>>> +}
>>> +
>>> +unsigned long vduse_domain_alloc_iova(struct vduse_iova_domain *domain,
>>> +                                     size_t size, enum iova_map_type type)
>>> +{
>>> +     struct vduse_iova_chunk *chunk;
>>> +     unsigned long iova = 0;
>>> +     int align = (type == TYPE_DIRECT_MAP) ? PAGE_SIZE : IOVA_ALLOC_SIZE;
>>> +     struct genpool_data_align data = { .align = align };
>>> +     int i;
>>> +
>>> +     for (i = 0; i < domain->chunk_num; i++) {
>>> +             chunk = &domain->chunks[i];
>>> +             if (unlikely(atomic_read(&chunk->map_type) == TYPE_NONE))
>>> +                     atomic_cmpxchg(&chunk->map_type, TYPE_NONE, type);
>>> +
>>> +             if (atomic_read(&chunk->map_type) != type)
>>> +                     continue;
>>> +
>>> +             iova = gen_pool_alloc_algo(chunk->pool, size,
>>> +                                     gen_pool_first_fit_align, &data);
>>> +             if (iova)
>>> +                     break;
>>> +     }
>>> +
>>> +     return iova;
>>
>> I wonder why not just reuse the iova domain implements in
>> driver/iommu/iova.c
>>
> The iova domain in driver/iommu/iova.c is only an iova allocator which
> is implemented by the genpool memory allocator in our case. The other
> part in our iova domain is chunk management and iova_map management.
> We need different chunks to distinguish different dma mapping types:
> consistent mapping or streaming mapping. We can only use
> bouncing-mechanism in the streaming mapping case.


To differ dma mappings, you can use two iova domains with different 
ranges. It looks simpler than the gen_pool. (AFAIK most IOMMU driver is 
using iova domain).

Thanks



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Re: [RFC v2 08/13] vdpa: Introduce process_iotlb_msg() in vdpa_config_ops
  2020-12-24  2:36       ` Jason Wang
@ 2020-12-24  7:24         ` Yongji Xie
  0 siblings, 0 replies; 55+ messages in thread
From: Yongji Xie @ 2020-12-24  7:24 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, sgarzare, Parav Pandit,
	akpm, Randy Dunlap, Matthew Wilcox, viro, axboe, bcrl, corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel, linux-mm

On Thu, Dec 24, 2020 at 10:37 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2020/12/23 下午7:06, Yongji Xie wrote:
> > On Wed, Dec 23, 2020 at 4:37 PM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> On 2020/12/22 下午10:52, Xie Yongji wrote:
> >>> This patch introduces a new method in the vdpa_config_ops to
> >>> support processing the raw vhost memory mapping message in the
> >>> vDPA device driver.
> >>>
> >>> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> >>> ---
> >>>    drivers/vhost/vdpa.c | 5 ++++-
> >>>    include/linux/vdpa.h | 7 +++++++
> >>>    2 files changed, 11 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> >>> index 448be7875b6d..ccbb391e38be 100644
> >>> --- a/drivers/vhost/vdpa.c
> >>> +++ b/drivers/vhost/vdpa.c
> >>> @@ -728,6 +728,9 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev,
> >>>        if (r)
> >>>                return r;
> >>>
> >>> +     if (ops->process_iotlb_msg)
> >>> +             return ops->process_iotlb_msg(vdpa, msg);
> >>> +
> >>>        switch (msg->type) {
> >>>        case VHOST_IOTLB_UPDATE:
> >>>                r = vhost_vdpa_process_iotlb_update(v, msg);
> >>> @@ -770,7 +773,7 @@ static int vhost_vdpa_alloc_domain(struct vhost_vdpa *v)
> >>>        int ret;
> >>>
> >>>        /* Device want to do DMA by itself */
> >>> -     if (ops->set_map || ops->dma_map)
> >>> +     if (ops->set_map || ops->dma_map || ops->process_iotlb_msg)
> >>>                return 0;
> >>>
> >>>        bus = dma_dev->bus;
> >>> diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
> >>> index 656fe264234e..7bccedf22f4b 100644
> >>> --- a/include/linux/vdpa.h
> >>> +++ b/include/linux/vdpa.h
> >>> @@ -5,6 +5,7 @@
> >>>    #include <linux/kernel.h>
> >>>    #include <linux/device.h>
> >>>    #include <linux/interrupt.h>
> >>> +#include <linux/vhost_types.h>
> >>>    #include <linux/vhost_iotlb.h>
> >>>    #include <net/genetlink.h>
> >>>
> >>> @@ -172,6 +173,10 @@ struct vdpa_iova_range {
> >>>     *                          @vdev: vdpa device
> >>>     *                          Returns the iova range supported by
> >>>     *                          the device.
> >>> + * @process_iotlb_msg:               Process vhost memory mapping message (optional)
> >>> + *                           Only used for VDUSE device now
> >>> + *                           @vdev: vdpa device
> >>> + *                           @msg: vhost memory mapping message
> >>>     * @set_map:                        Set device memory mapping (optional)
> >>>     *                          Needed for device that using device
> >>>     *                          specific DMA translation (on-chip IOMMU)
> >>> @@ -240,6 +245,8 @@ struct vdpa_config_ops {
> >>>        struct vdpa_iova_range (*get_iova_range)(struct vdpa_device *vdev);
> >>>
> >>>        /* DMA ops */
> >>> +     int (*process_iotlb_msg)(struct vdpa_device *vdev,
> >>> +                              struct vhost_iotlb_msg *msg);
> >>>        int (*set_map)(struct vdpa_device *vdev, struct vhost_iotlb *iotlb);
> >>>        int (*dma_map)(struct vdpa_device *vdev, u64 iova, u64 size,
> >>>                       u64 pa, u32 perm);
> >>
> >> Is there any reason that it can't be done via dma_map/dma_unmap or set_map?
> >>
> > To get the shmfd, we need the vma rather than physical address. And
> > it's not necessary to pin the user pages in VDUSE case.
>
>
> Right, actually, vhost-vDPA is planning to support shared virtual
> address space.
>
> So let's try to reuse the existing config ops. How about just introduce
> an attribute to vdpa device that tells the bus tells the bus it can do
> shared virtual memory. Then when the device is probed by vhost-vDPA, use
> pages won't be pinned and we will do VA->VA mapping as IOVA->PA mapping
> in the vhost IOTLB and the config ops. vhost IOTLB needs to be extended
> to accept opaque pointer to store the file. And the file was pass via
> the config ops as well.
>

OK, I see. Will try it in v3.

Thanks,
Yongji


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Re: [RFC v2 09/13] vduse: Add support for processing vhost iotlb message
  2020-12-24  2:41       ` Jason Wang
@ 2020-12-24  7:37         ` Yongji Xie
  2020-12-25  2:37           ` Yongji Xie
  2020-12-25  6:57           ` Jason Wang
  0 siblings, 2 replies; 55+ messages in thread
From: Yongji Xie @ 2020-12-24  7:37 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, sgarzare, Parav Pandit,
	akpm, Randy Dunlap, Matthew Wilcox, viro, axboe, bcrl, corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel, linux-mm

On Thu, Dec 24, 2020 at 10:41 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2020/12/23 下午8:14, Yongji Xie wrote:
> > On Wed, Dec 23, 2020 at 5:05 PM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> On 2020/12/22 下午10:52, Xie Yongji wrote:
> >>> To support vhost-vdpa bus driver, we need a way to share the
> >>> vhost-vdpa backend process's memory with the userspace VDUSE process.
> >>>
> >>> This patch tries to make use of the vhost iotlb message to achieve
> >>> that. We will get the shm file from the iotlb message and pass it
> >>> to the userspace VDUSE process.
> >>>
> >>> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> >>> ---
> >>>    Documentation/driver-api/vduse.rst |  15 +++-
> >>>    drivers/vdpa/vdpa_user/vduse_dev.c | 147 ++++++++++++++++++++++++++++++++++++-
> >>>    include/uapi/linux/vduse.h         |  11 +++
> >>>    3 files changed, 171 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/Documentation/driver-api/vduse.rst b/Documentation/driver-api/vduse.rst
> >>> index 623f7b040ccf..48e4b1ba353f 100644
> >>> --- a/Documentation/driver-api/vduse.rst
> >>> +++ b/Documentation/driver-api/vduse.rst
> >>> @@ -46,13 +46,26 @@ The following types of messages are provided by the VDUSE framework now:
> >>>
> >>>    - VDUSE_GET_CONFIG: Read from device specific configuration space
> >>>
> >>> +- VDUSE_UPDATE_IOTLB: Update the memory mapping in device IOTLB
> >>> +
> >>> +- VDUSE_INVALIDATE_IOTLB: Invalidate the memory mapping in device IOTLB
> >>> +
> >>>    Please see include/linux/vdpa.h for details.
> >>>
> >>> -In the data path, VDUSE framework implements a MMU-based on-chip IOMMU
> >>> +The data path of userspace vDPA device is implemented in different ways
> >>> +depending on the vdpa bus to which it is attached.
> >>> +
> >>> +In virtio-vdpa case, VDUSE framework implements a MMU-based on-chip IOMMU
> >>>    driver which supports mapping the kernel dma buffer to a userspace iova
> >>>    region dynamically. The userspace iova region can be created by passing
> >>>    the userspace vDPA device fd to mmap(2).
> >>>
> >>> +In vhost-vdpa case, the dma buffer is reside in a userspace memory region
> >>> +which will be shared to the VDUSE userspace processs via the file
> >>> +descriptor in VDUSE_UPDATE_IOTLB message. And the corresponding address
> >>> +mapping (IOVA of dma buffer <-> VA of the memory region) is also included
> >>> +in this message.
> >>> +
> >>>    Besides, the eventfd mechanism is used to trigger interrupt callbacks and
> >>>    receive virtqueue kicks in userspace. The following ioctls on the userspace
> >>>    vDPA device fd are provided to support that:
> >>> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> >>> index b974333ed4e9..d24aaacb6008 100644
> >>> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> >>> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> >>> @@ -34,6 +34,7 @@
> >>>
> >>>    struct vduse_dev_msg {
> >>>        struct vduse_dev_request req;
> >>> +     struct file *iotlb_file;
> >>>        struct vduse_dev_response resp;
> >>>        struct list_head list;
> >>>        wait_queue_head_t waitq;
> >>> @@ -325,12 +326,80 @@ static int vduse_dev_set_vq_state(struct vduse_dev *dev,
> >>>        return ret;
> >>>    }
> >>>
> >>> +static int vduse_dev_update_iotlb(struct vduse_dev *dev, struct file *file,
> >>> +                             u64 offset, u64 iova, u64 size, u8 perm)
> >>> +{
> >>> +     struct vduse_dev_msg *msg;
> >>> +     int ret;
> >>> +
> >>> +     if (!size)
> >>> +             return -EINVAL;
> >>> +
> >>> +     msg = vduse_dev_new_msg(dev, VDUSE_UPDATE_IOTLB);
> >>> +     msg->req.size = sizeof(struct vduse_iotlb);
> >>> +     msg->req.iotlb.offset = offset;
> >>> +     msg->req.iotlb.iova = iova;
> >>> +     msg->req.iotlb.size = size;
> >>> +     msg->req.iotlb.perm = perm;
> >>> +     msg->req.iotlb.fd = -1;
> >>> +     msg->iotlb_file = get_file(file);
> >>> +
> >>> +     ret = vduse_dev_msg_sync(dev, msg);
> >>
> >> My feeling is that we should provide consistent API for the userspace
> >> device to use.
> >>
> >> E.g we'd better carry the IOTLB message for both virtio/vhost drivers.
> >>
> >> It looks to me for virtio drivers we can still use UPDAT_IOTLB message
> >> by using VDUSE file as msg->iotlb_file here.
> >>
> > It's OK for me. One problem is when to transfer the UPDATE_IOTLB
> > message in virtio cases.
>
>
> Instead of generating IOTLB messages for userspace.
>
> How about record the mappings (which is a common case for device have
> on-chip IOMMU e.g mlx5e and vdpa simlator), then we can introduce ioctl
> for userspace to query?
>

If so, the IOTLB UPDATE is actually triggered by ioctl, but
IOTLB_INVALIDATE is triggered by the message. Is it a little odd? Or
how about trigger it when userspace call mmap() on the device fd?

Thanks,
Yongji


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Re: [RFC v2 06/13] vduse: Introduce VDUSE - vDPA Device in Userspace
  2020-12-24  3:01       ` Jason Wang
@ 2020-12-24  8:34         ` Yongji Xie
  2020-12-25  6:59           ` Jason Wang
  0 siblings, 1 reply; 55+ messages in thread
From: Yongji Xie @ 2020-12-24  8:34 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, sgarzare, Parav Pandit,
	akpm, Randy Dunlap, Matthew Wilcox, viro, axboe, bcrl, corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel, linux-mm

On Thu, Dec 24, 2020 at 11:01 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2020/12/23 下午10:17, Yongji Xie wrote:
> > On Wed, Dec 23, 2020 at 4:08 PM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> On 2020/12/22 下午10:52, Xie Yongji wrote:
> >>> This VDUSE driver enables implementing vDPA devices in userspace.
> >>> Both control path and data path of vDPA devices will be able to
> >>> be handled in userspace.
> >>>
> >>> In the control path, the VDUSE driver will make use of message
> >>> mechnism to forward the config operation from vdpa bus driver
> >>> to userspace. Userspace can use read()/write() to receive/reply
> >>> those control messages.
> >>>
> >>> In the data path, the VDUSE driver implements a MMU-based on-chip
> >>> IOMMU driver which supports mapping the kernel dma buffer to a
> >>> userspace iova region dynamically. Userspace can access those
> >>> iova region via mmap(). Besides, the eventfd mechanism is used to
> >>> trigger interrupt callbacks and receive virtqueue kicks in userspace
> >>>
> >>> Now we only support virtio-vdpa bus driver with this patch applied.
> >>>
> >>> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> >>> ---
> >>>    Documentation/driver-api/vduse.rst                 |   74 ++
> >>>    Documentation/userspace-api/ioctl/ioctl-number.rst |    1 +
> >>>    drivers/vdpa/Kconfig                               |    8 +
> >>>    drivers/vdpa/Makefile                              |    1 +
> >>>    drivers/vdpa/vdpa_user/Makefile                    |    5 +
> >>>    drivers/vdpa/vdpa_user/eventfd.c                   |  221 ++++
> >>>    drivers/vdpa/vdpa_user/eventfd.h                   |   48 +
> >>>    drivers/vdpa/vdpa_user/iova_domain.c               |  442 ++++++++
> >>>    drivers/vdpa/vdpa_user/iova_domain.h               |   93 ++
> >>>    drivers/vdpa/vdpa_user/vduse.h                     |   59 ++
> >>>    drivers/vdpa/vdpa_user/vduse_dev.c                 | 1121 ++++++++++++++++++++
> >>>    include/uapi/linux/vdpa.h                          |    1 +
> >>>    include/uapi/linux/vduse.h                         |   99 ++
> >>>    13 files changed, 2173 insertions(+)
> >>>    create mode 100644 Documentation/driver-api/vduse.rst
> >>>    create mode 100644 drivers/vdpa/vdpa_user/Makefile
> >>>    create mode 100644 drivers/vdpa/vdpa_user/eventfd.c
> >>>    create mode 100644 drivers/vdpa/vdpa_user/eventfd.h
> >>>    create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c
> >>>    create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h
> >>>    create mode 100644 drivers/vdpa/vdpa_user/vduse.h
> >>>    create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
> >>>    create mode 100644 include/uapi/linux/vduse.h
> >>>
> >>> diff --git a/Documentation/driver-api/vduse.rst b/Documentation/driver-api/vduse.rst
> >>> new file mode 100644
> >>> index 000000000000..da9b3040f20a
> >>> --- /dev/null
> >>> +++ b/Documentation/driver-api/vduse.rst
> >>> @@ -0,0 +1,74 @@
> >>> +==================================
> >>> +VDUSE - "vDPA Device in Userspace"
> >>> +==================================
> >>> +
> >>> +vDPA (virtio data path acceleration) device is a device that uses a
> >>> +datapath which complies with the virtio specifications with vendor
> >>> +specific control path. vDPA devices can be both physically located on
> >>> +the hardware or emulated by software. VDUSE is a framework that makes it
> >>> +possible to implement software-emulated vDPA devices in userspace.
> >>> +
> >>> +How VDUSE works
> >>> +------------
> >>> +Each userspace vDPA device is created by the VDUSE_CREATE_DEV ioctl on
> >>> +the VDUSE character device (/dev/vduse). Then a file descriptor pointing
> >>> +to the new resources will be returned, which can be used to implement the
> >>> +userspace vDPA device's control path and data path.
> >>> +
> >>> +To implement control path, the read/write operations to the file descriptor
> >>> +will be used to receive/reply the control messages from/to VDUSE driver.
> >>> +Those control messages are based on the vdpa_config_ops which defines a
> >>> +unified interface to control different types of vDPA device.
> >>> +
> >>> +The following types of messages are provided by the VDUSE framework now:
> >>> +
> >>> +- VDUSE_SET_VQ_ADDR: Set the addresses of the different aspects of virtqueue.
> >>> +
> >>> +- VDUSE_SET_VQ_NUM: Set the size of virtqueue
> >>> +
> >>> +- VDUSE_SET_VQ_READY: Set ready status of virtqueue
> >>> +
> >>> +- VDUSE_GET_VQ_READY: Get ready status of virtqueue
> >>> +
> >>> +- VDUSE_SET_FEATURES: Set virtio features supported by the driver
> >>> +
> >>> +- VDUSE_GET_FEATURES: Get virtio features supported by the device
> >>> +
> >>> +- VDUSE_SET_STATUS: Set the device status
> >>> +
> >>> +- VDUSE_GET_STATUS: Get the device status
> >>> +
> >>> +- VDUSE_SET_CONFIG: Write to device specific configuration space
> >>> +
> >>> +- VDUSE_GET_CONFIG: Read from device specific configuration space
> >>> +
> >>> +Please see include/linux/vdpa.h for details.
> >>> +
> >>> +In the data path, VDUSE framework implements a MMU-based on-chip IOMMU
> >>> +driver which supports mapping the kernel dma buffer to a userspace iova
> >>> +region dynamically. The userspace iova region can be created by passing
> >>> +the userspace vDPA device fd to mmap(2).
> >>> +
> >>> +Besides, the eventfd mechanism is used to trigger interrupt callbacks and
> >>> +receive virtqueue kicks in userspace. The following ioctls on the userspace
> >>> +vDPA device fd are provided to support that:
> >>> +
> >>> +- VDUSE_VQ_SETUP_KICKFD: set the kickfd for virtqueue, this eventfd is used
> >>> +  by VDUSE driver to notify userspace to consume the vring.
> >>> +
> >>> +- VDUSE_VQ_SETUP_IRQFD: set the irqfd for virtqueue, this eventfd is used
> >>> +  by userspace to notify VDUSE driver to trigger interrupt callbacks.
> >>> +
> >>> +MMU-based IOMMU Driver
> >>> +----------------------
> >>> +The basic idea behind the IOMMU driver is treating MMU (VA->PA) as
> >>> +IOMMU (IOVA->PA). This driver will set up MMU mapping instead of IOMMU mapping
> >>> +for the DMA transfer so that the userspace process is able to use its virtual
> >>> +address to access the dma buffer in kernel.
> >>> +
> >>> +And to avoid security issue, a bounce-buffering mechanism is introduced to
> >>> +prevent userspace accessing the original buffer directly which may contain other
> >>> +kernel data. During the mapping, unmapping, the driver will copy the data from
> >>> +the original buffer to the bounce buffer and back, depending on the direction of
> >>> +the transfer. And the bounce-buffer addresses will be mapped into the user address
> >>> +space instead of the original one.
> >>> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
> >>> index a4c75a28c839..71722e6f8f23 100644
> >>> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> >>> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> >>> @@ -300,6 +300,7 @@ Code  Seq#    Include File                                           Comments
> >>>    'z'   10-4F  drivers/s390/crypto/zcrypt_api.h                        conflict!
> >>>    '|'   00-7F  linux/media.h
> >>>    0x80  00-1F  linux/fb.h
> >>> +0x81  00-1F  linux/vduse.h
> >>>    0x89  00-06  arch/x86/include/asm/sockios.h
> >>>    0x89  0B-DF  linux/sockios.h
> >>>    0x89  E0-EF  linux/sockios.h                                         SIOCPROTOPRIVATE range
> >>> diff --git a/drivers/vdpa/Kconfig b/drivers/vdpa/Kconfig
> >>> index 4be7be39be26..211cc449cbd3 100644
> >>> --- a/drivers/vdpa/Kconfig
> >>> +++ b/drivers/vdpa/Kconfig
> >>> @@ -21,6 +21,14 @@ config VDPA_SIM
> >>>          to RX. This device is used for testing, prototyping and
> >>>          development of vDPA.
> >>>
> >>> +config VDPA_USER
> >>> +     tristate "VDUSE (vDPA Device in Userspace) support"
> >>> +     depends on EVENTFD && MMU && HAS_DMA
> >>> +     default n
> >>
> >> The "default n" is not necessary.
> >>
> > OK.
> >>> +     help
> >>> +       With VDUSE it is possible to emulate a vDPA Device
> >>> +       in a userspace program.
> >>> +
> >>>    config IFCVF
> >>>        tristate "Intel IFC VF vDPA driver"
> >>>        depends on PCI_MSI
> >>> diff --git a/drivers/vdpa/Makefile b/drivers/vdpa/Makefile
> >>> index d160e9b63a66..66e97778ad03 100644
> >>> --- a/drivers/vdpa/Makefile
> >>> +++ b/drivers/vdpa/Makefile
> >>> @@ -1,5 +1,6 @@
> >>>    # SPDX-License-Identifier: GPL-2.0
> >>>    obj-$(CONFIG_VDPA) += vdpa.o
> >>>    obj-$(CONFIG_VDPA_SIM) += vdpa_sim/
> >>> +obj-$(CONFIG_VDPA_USER) += vdpa_user/
> >>>    obj-$(CONFIG_IFCVF)    += ifcvf/
> >>>    obj-$(CONFIG_MLX5_VDPA) += mlx5/
> >>> diff --git a/drivers/vdpa/vdpa_user/Makefile b/drivers/vdpa/vdpa_user/Makefile
> >>> new file mode 100644
> >>> index 000000000000..b7645e36992b
> >>> --- /dev/null
> >>> +++ b/drivers/vdpa/vdpa_user/Makefile
> >>> @@ -0,0 +1,5 @@
> >>> +# SPDX-License-Identifier: GPL-2.0
> >>> +
> >>> +vduse-y := vduse_dev.o iova_domain.o eventfd.o
> >>
> >> Do we really need eventfd.o here consider we've selected it.
> >>
> > Do you mean the file "drivers/vdpa/vdpa_user/eventfd.c"?
>
>
> My bad, I confuse this with the common eventfd. So the code is fine here.
>
>
> >
> >>> +
> >>> +obj-$(CONFIG_VDPA_USER) += vduse.o
> >>> diff --git a/drivers/vdpa/vdpa_user/eventfd.c b/drivers/vdpa/vdpa_user/eventfd.c
> >>> new file mode 100644
> >>> index 000000000000..dbffddb08908
> >>> --- /dev/null
> >>> +++ b/drivers/vdpa/vdpa_user/eventfd.c
> >>> @@ -0,0 +1,221 @@
> >>> +// SPDX-License-Identifier: GPL-2.0-only
> >>> +/*
> >>> + * Eventfd support for VDUSE
> >>> + *
> >>> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
> >>> + *
> >>> + * Author: Xie Yongji <xieyongji@bytedance.com>
> >>> + *
> >>> + */
> >>> +
> >>> +#include <linux/eventfd.h>
> >>> +#include <linux/poll.h>
> >>> +#include <linux/wait.h>
> >>> +#include <linux/slab.h>
> >>> +#include <linux/file.h>
> >>> +#include <uapi/linux/vduse.h>
> >>> +
> >>> +#include "eventfd.h"
> >>> +
> >>> +static struct workqueue_struct *vduse_irqfd_cleanup_wq;
> >>> +
> >>> +static void vduse_virqfd_shutdown(struct work_struct *work)
> >>> +{
> >>> +     u64 cnt;
> >>> +     struct vduse_virqfd *virqfd = container_of(work,
> >>> +                                     struct vduse_virqfd, shutdown);
> >>> +
> >>> +     eventfd_ctx_remove_wait_queue(virqfd->ctx, &virqfd->wait, &cnt);
> >>> +     flush_work(&virqfd->inject);
> >>> +     eventfd_ctx_put(virqfd->ctx);
> >>> +     kfree(virqfd);
> >>> +}
> >>> +
> >>> +static void vduse_virqfd_inject(struct work_struct *work)
> >>> +{
> >>> +     struct vduse_virqfd *virqfd = container_of(work,
> >>> +                                     struct vduse_virqfd, inject);
> >>> +     struct vduse_virtqueue *vq = virqfd->vq;
> >>> +
> >>> +     spin_lock_irq(&vq->irq_lock);
> >>> +     if (vq->ready && vq->cb)
> >>> +             vq->cb(vq->private);
> >>> +     spin_unlock_irq(&vq->irq_lock);
> >>> +}
> >>> +
> >>> +static void virqfd_deactivate(struct vduse_virqfd *virqfd)
> >>> +{
> >>> +     queue_work(vduse_irqfd_cleanup_wq, &virqfd->shutdown);
> >>> +}
> >>> +
> >>> +static int vduse_virqfd_wakeup(wait_queue_entry_t *wait, unsigned int mode,
> >>> +                             int sync, void *key)
> >>> +{
> >>> +     struct vduse_virqfd *virqfd = container_of(wait, struct vduse_virqfd, wait);
> >>> +     struct vduse_virtqueue *vq = virqfd->vq;
> >>> +
> >>> +     __poll_t flags = key_to_poll(key);
> >>> +
> >>> +     if (flags & EPOLLIN)
> >>> +             schedule_work(&virqfd->inject);
> >>> +
> >>> +     if (flags & EPOLLHUP) {
> >>> +             spin_lock(&vq->irq_lock);
> >>> +             if (vq->virqfd == virqfd) {
> >>> +                     vq->virqfd = NULL;
> >>> +                     virqfd_deactivate(virqfd);
> >>> +             }
> >>> +             spin_unlock(&vq->irq_lock);
> >>> +     }
> >>> +
> >>> +     return 0;
> >>> +}
> >>> +
> >>> +static void vduse_virqfd_ptable_queue_proc(struct file *file,
> >>> +                     wait_queue_head_t *wqh, poll_table *pt)
> >>> +{
> >>> +     struct vduse_virqfd *virqfd = container_of(pt, struct vduse_virqfd, pt);
> >>> +
> >>> +     add_wait_queue(wqh, &virqfd->wait);
> >>> +}
> >>> +
> >>> +int vduse_virqfd_setup(struct vduse_dev *dev,
> >>> +                     struct vduse_vq_eventfd *eventfd)
> >>> +{
> >>> +     struct vduse_virqfd *virqfd;
> >>> +     struct fd irqfd;
> >>> +     struct eventfd_ctx *ctx;
> >>> +     struct vduse_virtqueue *vq;
> >>> +     __poll_t events;
> >>> +     int ret;
> >>> +
> >>> +     if (eventfd->index >= dev->vq_num)
> >>> +             return -EINVAL;
> >>> +
> >>> +     vq = &dev->vqs[eventfd->index];
> >>> +     virqfd = kzalloc(sizeof(*virqfd), GFP_KERNEL);
> >>> +     if (!virqfd)
> >>> +             return -ENOMEM;
> >>> +
> >>> +     INIT_WORK(&virqfd->shutdown, vduse_virqfd_shutdown);
> >>> +     INIT_WORK(&virqfd->inject, vduse_virqfd_inject);
> >>
> >> Any reason that a workqueue is must here?
> >>
> > Mainly for performance considerations. Make sure the push() and pop()
> > for used vring can be asynchronous.
>
>
> I see.
>
>
> >
> >>> +
> >>> +     ret = -EBADF;
> >>> +     irqfd = fdget(eventfd->fd);
> >>> +     if (!irqfd.file)
> >>> +             goto err_fd;
> >>> +
> >>> +     ctx = eventfd_ctx_fileget(irqfd.file);
> >>> +     if (IS_ERR(ctx)) {
> >>> +             ret = PTR_ERR(ctx);
> >>> +             goto err_ctx;
> >>> +     }
> >>> +
> >>> +     virqfd->vq = vq;
> >>> +     virqfd->ctx = ctx;
> >>> +     spin_lock(&vq->irq_lock);
> >>> +     if (vq->virqfd)
> >>> +             virqfd_deactivate(virqfd);
> >>> +     vq->virqfd = virqfd;
> >>> +     spin_unlock(&vq->irq_lock);
> >>> +
> >>> +     init_waitqueue_func_entry(&virqfd->wait, vduse_virqfd_wakeup);
> >>> +     init_poll_funcptr(&virqfd->pt, vduse_virqfd_ptable_queue_proc);
> >>> +
> >>> +     events = vfs_poll(irqfd.file, &virqfd->pt);
> >>> +
> >>> +     /*
> >>> +      * Check if there was an event already pending on the eventfd
> >>> +      * before we registered and trigger it as if we didn't miss it.
> >>> +      */
> >>> +     if (events & EPOLLIN)
> >>> +             schedule_work(&virqfd->inject);
> >>> +
> >>> +     fdput(irqfd);
> >>> +
> >>> +     return 0;
> >>> +err_ctx:
> >>> +     fdput(irqfd);
> >>> +err_fd:
> >>> +     kfree(virqfd);
> >>> +     return ret;
> >>> +}
> >>> +
> >>> +void vduse_virqfd_release(struct vduse_dev *dev)
> >>> +{
> >>> +     int i;
> >>> +
> >>> +     for (i = 0; i < dev->vq_num; i++) {
> >>> +             struct vduse_virtqueue *vq = &dev->vqs[i];
> >>> +
> >>> +             spin_lock(&vq->irq_lock);
> >>> +             if (vq->virqfd) {
> >>> +                     virqfd_deactivate(vq->virqfd);
> >>> +                     vq->virqfd = NULL;
> >>> +             }
> >>> +             spin_unlock(&vq->irq_lock);
> >>> +     }
> >>> +     flush_workqueue(vduse_irqfd_cleanup_wq);
> >>> +}
> >>> +
> >>> +int vduse_virqfd_init(void)
> >>> +{
> >>> +     vduse_irqfd_cleanup_wq = alloc_workqueue("vduse-irqfd-cleanup",
> >>> +                                             WQ_UNBOUND, 0);
> >>> +     if (!vduse_irqfd_cleanup_wq)
> >>> +             return -ENOMEM;
> >>> +
> >>> +     return 0;
> >>> +}
> >>> +
> >>> +void vduse_virqfd_exit(void)
> >>> +{
> >>> +     destroy_workqueue(vduse_irqfd_cleanup_wq);
> >>> +}
> >>> +
> >>> +void vduse_vq_kick(struct vduse_virtqueue *vq)
> >>> +{
> >>> +     spin_lock(&vq->kick_lock);
> >>> +     if (vq->ready && vq->kickfd)
> >>> +             eventfd_signal(vq->kickfd, 1);
> >>> +     spin_unlock(&vq->kick_lock);
> >>> +}
> >>> +
> >>> +int vduse_kickfd_setup(struct vduse_dev *dev,
> >>> +                     struct vduse_vq_eventfd *eventfd)
> >>> +{
> >>> +     struct eventfd_ctx *ctx;
> >>> +     struct vduse_virtqueue *vq;
> >>> +
> >>> +     if (eventfd->index >= dev->vq_num)
> >>> +             return -EINVAL;
> >>> +
> >>> +     vq = &dev->vqs[eventfd->index];
> >>> +     ctx = eventfd_ctx_fdget(eventfd->fd);
> >>> +     if (IS_ERR(ctx))
> >>> +             return PTR_ERR(ctx);
> >>> +
> >>> +     spin_lock(&vq->kick_lock);
> >>> +     if (vq->kickfd)
> >>> +             eventfd_ctx_put(vq->kickfd);
> >>> +     vq->kickfd = ctx;
> >>> +     spin_unlock(&vq->kick_lock);
> >>> +
> >>> +     return 0;
> >>> +}
> >>> +
> >>> +void vduse_kickfd_release(struct vduse_dev *dev)
> >>> +{
> >>> +     int i;
> >>> +
> >>> +     for (i = 0; i < dev->vq_num; i++) {
> >>> +             struct vduse_virtqueue *vq = &dev->vqs[i];
> >>> +
> >>> +             spin_lock(&vq->kick_lock);
> >>> +             if (vq->kickfd) {
> >>> +                     eventfd_ctx_put(vq->kickfd);
> >>> +                     vq->kickfd = NULL;
> >>> +             }
> >>> +             spin_unlock(&vq->kick_lock);
> >>> +     }
> >>> +}
> >>> diff --git a/drivers/vdpa/vdpa_user/eventfd.h b/drivers/vdpa/vdpa_user/eventfd.h
> >>> new file mode 100644
> >>> index 000000000000..14269ff27f47
> >>> --- /dev/null
> >>> +++ b/drivers/vdpa/vdpa_user/eventfd.h
> >>> @@ -0,0 +1,48 @@
> >>> +/* SPDX-License-Identifier: GPL-2.0-only */
> >>> +/*
> >>> + * Eventfd support for VDUSE
> >>> + *
> >>> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
> >>> + *
> >>> + * Author: Xie Yongji <xieyongji@bytedance.com>
> >>> + *
> >>> + */
> >>> +
> >>> +#ifndef _VDUSE_EVENTFD_H
> >>> +#define _VDUSE_EVENTFD_H
> >>> +
> >>> +#include <linux/eventfd.h>
> >>> +#include <linux/poll.h>
> >>> +#include <linux/wait.h>
> >>> +#include <uapi/linux/vduse.h>
> >>> +
> >>> +#include "vduse.h"
> >>> +
> >>> +struct vduse_dev;
> >>> +
> >>> +struct vduse_virqfd {
> >>> +     struct eventfd_ctx *ctx;
> >>> +     struct vduse_virtqueue *vq;
> >>> +     struct work_struct inject;
> >>> +     struct work_struct shutdown;
> >>> +     wait_queue_entry_t wait;
> >>> +     poll_table pt;
> >>> +};
> >>> +
> >>> +int vduse_virqfd_setup(struct vduse_dev *dev,
> >>> +                     struct vduse_vq_eventfd *eventfd);
> >>> +
> >>> +void vduse_virqfd_release(struct vduse_dev *dev);
> >>> +
> >>> +int vduse_virqfd_init(void);
> >>> +
> >>> +void vduse_virqfd_exit(void);
> >>> +
> >>> +void vduse_vq_kick(struct vduse_virtqueue *vq);
> >>> +
> >>> +int vduse_kickfd_setup(struct vduse_dev *dev,
> >>> +                     struct vduse_vq_eventfd *eventfd);
> >>> +
> >>> +void vduse_kickfd_release(struct vduse_dev *dev);
> >>> +
> >>> +#endif /* _VDUSE_EVENTFD_H */
> >>> diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c
> >>> new file mode 100644
> >>> index 000000000000..27022157abc6
> >>> --- /dev/null
> >>> +++ b/drivers/vdpa/vdpa_user/iova_domain.c
> >>> @@ -0,0 +1,442 @@
> >>> +// SPDX-License-Identifier: GPL-2.0-only
> >>> +/*
> >>> + * MMU-based IOMMU implementation
> >>> + *
> >>> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
> >>> + *
> >>> + * Author: Xie Yongji <xieyongji@bytedance.com>
> >>> + *
> >>> + */
> >>> +
> >>> +#include <linux/wait.h>
> >>> +#include <linux/slab.h>
> >>> +#include <linux/genalloc.h>
> >>> +#include <linux/dma-mapping.h>
> >>> +
> >>> +#include "iova_domain.h"
> >>> +
> >>> +#define IOVA_CHUNK_SHIFT 26
> >>> +#define IOVA_CHUNK_SIZE (_AC(1, UL) << IOVA_CHUNK_SHIFT)
> >>> +#define IOVA_CHUNK_MASK (~(IOVA_CHUNK_SIZE - 1))
> >>> +
> >>> +#define IOVA_MIN_SIZE (IOVA_CHUNK_SIZE << 1)
> >>> +
> >>> +#define IOVA_ALLOC_ORDER 12
> >>> +#define IOVA_ALLOC_SIZE (1 << IOVA_ALLOC_ORDER)
> >>> +
> >>> +struct vduse_mmap_vma {
> >>> +     struct vm_area_struct *vma;
> >>> +     struct list_head list;
> >>> +};
> >>> +
> >>> +static inline struct page *
> >>> +vduse_domain_get_bounce_page(struct vduse_iova_domain *domain,
> >>> +                             unsigned long iova)
> >>> +{
> >>> +     unsigned long index = iova >> IOVA_CHUNK_SHIFT;
> >>> +     unsigned long chunkoff = iova & ~IOVA_CHUNK_MASK;
> >>> +     unsigned long pgindex = chunkoff >> PAGE_SHIFT;
> >>> +
> >>> +     return domain->chunks[index].bounce_pages[pgindex];
> >>> +}
> >>> +
> >>> +static inline void
> >>> +vduse_domain_set_bounce_page(struct vduse_iova_domain *domain,
> >>> +                             unsigned long iova, struct page *page)
> >>> +{
> >>> +     unsigned long index = iova >> IOVA_CHUNK_SHIFT;
> >>> +     unsigned long chunkoff = iova & ~IOVA_CHUNK_MASK;
> >>> +     unsigned long pgindex = chunkoff >> PAGE_SHIFT;
> >>> +
> >>> +     domain->chunks[index].bounce_pages[pgindex] = page;
> >>> +}
> >>> +
> >>> +static inline struct vduse_iova_map *
> >>> +vduse_domain_get_iova_map(struct vduse_iova_domain *domain,
> >>> +                             unsigned long iova)
> >>> +{
> >>> +     unsigned long index = iova >> IOVA_CHUNK_SHIFT;
> >>> +     unsigned long chunkoff = iova & ~IOVA_CHUNK_MASK;
> >>> +     unsigned long mapindex = chunkoff >> IOVA_ALLOC_ORDER;
> >>> +
> >>> +     return domain->chunks[index].iova_map[mapindex];
> >>> +}
> >>> +
> >>> +static inline void
> >>> +vduse_domain_set_iova_map(struct vduse_iova_domain *domain,
> >>> +                     unsigned long iova, struct vduse_iova_map *map)
> >>> +{
> >>> +     unsigned long index = iova >> IOVA_CHUNK_SHIFT;
> >>> +     unsigned long chunkoff = iova & ~IOVA_CHUNK_MASK;
> >>> +     unsigned long mapindex = chunkoff >> IOVA_ALLOC_ORDER;
> >>> +
> >>> +     domain->chunks[index].iova_map[mapindex] = map;
> >>> +}
> >>> +
> >>> +static int
> >>> +vduse_domain_free_bounce_pages(struct vduse_iova_domain *domain,
> >>> +                             unsigned long iova, size_t size)
> >>> +{
> >>> +     struct page *page;
> >>> +     size_t walk_sz = 0;
> >>> +     int frees = 0;
> >>> +
> >>> +     while (walk_sz < size) {
> >>> +             page = vduse_domain_get_bounce_page(domain, iova);
> >>> +             if (page) {
> >>> +                     vduse_domain_set_bounce_page(domain, iova, NULL);
> >>> +                     put_page(page);
> >>> +                     frees++;
> >>> +             }
> >>> +             iova += PAGE_SIZE;
> >>> +             walk_sz += PAGE_SIZE;
> >>> +     }
> >>> +
> >>> +     return frees;
> >>> +}
> >>> +
> >>> +int vduse_domain_add_vma(struct vduse_iova_domain *domain,
> >>> +                             struct vm_area_struct *vma)
> >>> +{
> >>> +     unsigned long size = vma->vm_end - vma->vm_start;
> >>> +     struct vduse_mmap_vma *mmap_vma;
> >>> +
> >>> +     if (WARN_ON(size != domain->size))
> >>> +             return -EINVAL;
> >>> +
> >>> +     mmap_vma = kmalloc(sizeof(*mmap_vma), GFP_KERNEL);
> >>> +     if (!mmap_vma)
> >>> +             return -ENOMEM;
> >>> +
> >>> +     mmap_vma->vma = vma;
> >>> +     mutex_lock(&domain->vma_lock);
> >>> +     list_add(&mmap_vma->list, &domain->vma_list);
> >>> +     mutex_unlock(&domain->vma_lock);
> >>> +
> >>> +     return 0;
> >>> +}
> >>> +
> >>> +void vduse_domain_remove_vma(struct vduse_iova_domain *domain,
> >>> +                             struct vm_area_struct *vma)
> >>> +{
> >>> +     struct vduse_mmap_vma *mmap_vma;
> >>> +
> >>> +     mutex_lock(&domain->vma_lock);
> >>> +     list_for_each_entry(mmap_vma, &domain->vma_list, list) {
> >>> +             if (mmap_vma->vma == vma) {
> >>> +                     list_del(&mmap_vma->list);
> >>> +                     kfree(mmap_vma);
> >>> +                     break;
> >>> +             }
> >>> +     }
> >>> +     mutex_unlock(&domain->vma_lock);
> >>> +}
> >>> +
> >>> +int vduse_domain_add_mapping(struct vduse_iova_domain *domain,
> >>> +                             unsigned long iova, unsigned long orig,
> >>> +                             size_t size, enum dma_data_direction dir)
> >>> +{
> >>> +     struct vduse_iova_map *map;
> >>> +     unsigned long last = iova + size;
> >>> +
> >>> +     map = kzalloc(sizeof(struct vduse_iova_map), GFP_ATOMIC);
> >>> +     if (!map)
> >>> +             return -ENOMEM;
> >>> +
> >>> +     map->iova = iova;
> >>> +     map->orig = orig;
> >>> +     map->size = size;
> >>> +     map->dir = dir;
> >>> +
> >>> +     while (iova < last) {
> >>> +             vduse_domain_set_iova_map(domain, iova, map);
> >>> +             iova += IOVA_ALLOC_SIZE;
> >>> +     }
> >>> +
> >>> +     return 0;
> >>> +}
> >>> +
> >>> +struct vduse_iova_map *
> >>> +vduse_domain_get_mapping(struct vduse_iova_domain *domain,
> >>> +                     unsigned long iova)
> >>> +{
> >>> +     return vduse_domain_get_iova_map(domain, iova);
> >>> +}
> >>> +
> >>> +void vduse_domain_remove_mapping(struct vduse_iova_domain *domain,
> >>> +                             struct vduse_iova_map *map)
> >>> +{
> >>> +     unsigned long iova = map->iova;
> >>> +     unsigned long last = iova + map->size;
> >>> +
> >>> +     while (iova < last) {
> >>> +             vduse_domain_set_iova_map(domain, iova, NULL);
> >>> +             iova += IOVA_ALLOC_SIZE;
> >>> +     }
> >>> +}
> >>> +
> >>> +void vduse_domain_unmap(struct vduse_iova_domain *domain,
> >>> +                     unsigned long iova, size_t size)
> >>> +{
> >>> +     struct vduse_mmap_vma *mmap_vma;
> >>> +     unsigned long uaddr;
> >>> +
> >>> +     mutex_lock(&domain->vma_lock);
> >>> +     list_for_each_entry(mmap_vma, &domain->vma_list, list) {
> >>> +             mmap_read_lock(mmap_vma->vma->vm_mm);
> >>> +             uaddr = iova + mmap_vma->vma->vm_start;
> >>> +             zap_page_range(mmap_vma->vma, uaddr, size);
> >>> +             mmap_read_unlock(mmap_vma->vma->vm_mm);
> >>> +     }
> >>> +     mutex_unlock(&domain->vma_lock);
> >>> +}
> >>> +
> >>> +int vduse_domain_direct_map(struct vduse_iova_domain *domain,
> >>> +                     struct vm_area_struct *vma, unsigned long iova)
> >>> +{
> >>> +     unsigned long uaddr = iova + vma->vm_start;
> >>> +     unsigned long start = iova & PAGE_MASK;
> >>> +     unsigned long last = start + PAGE_SIZE - 1;
> >>> +     unsigned long offset;
> >>> +     struct vduse_iova_map *map;
> >>> +     struct page *page = NULL;
> >>> +
> >>> +     map = vduse_domain_get_iova_map(domain, iova);
> >>> +     if (map) {
> >>> +             offset = last - map->iova;
> >>> +             page = virt_to_page(map->orig + offset);
> >>> +     }
> >>> +
> >>> +     return page ? vm_insert_page(vma, uaddr, page) : -EFAULT;
> >>> +}
> >>
> >> So as we discussed before, we need to find way to make vhost work. And
> >> it's better to make vhost transparent to VDUSE. One idea is to implement
> >> shadow virtqueue here, that is, instead of trying to insert the pages to
> >> VDUSE userspace, we use the shadow virtqueue to relay the descriptors to
> >> userspace. With this, we don't need stuffs like shmfd etc.
> >>
> > Good idea! The disadvantage is performance will go down (one more
> > thread switch overhead and vhost-liked kworker will become bottleneck
> > without multi-thread support).
>
>
> Yes, the disadvantage is the performance. But it should be simpler (I
> guess) and we know it can succeed.
>

Yes, another advantage is that we can support the VM using anonymous memory.

>
>
> > I think I can try this in v3. And the
> > MMU-based IOMMU implementation can be a future optimization in the
> > virtio-vdpa case. What's your opinion?
>
>
> Maybe I was wrong, but I think we can try as what has been proposed here
> first and use shadow virtqueue as backup plan if we fail.
>

OK, I will continue to work on this proposal.

>
> >
> >>> +
> >>> +void vduse_domain_bounce(struct vduse_iova_domain *domain,
> >>> +                     unsigned long iova, unsigned long orig,
> >>> +                     size_t size, enum dma_data_direction dir)
> >>> +{
> >>> +     unsigned int offset = offset_in_page(iova);
> >>> +
> >>> +     while (size) {
> >>> +             struct page *p = vduse_domain_get_bounce_page(domain, iova);
> >>> +             size_t copy_len = min_t(size_t, PAGE_SIZE - offset, size);
> >>> +             void *addr;
> >>> +
> >>> +             if (p) {
> >>> +                     addr = page_address(p) + offset;
> >>> +                     if (dir == DMA_TO_DEVICE)
> >>> +                             memcpy(addr, (void *)orig, copy_len);
> >>> +                     else if (dir == DMA_FROM_DEVICE)
> >>> +                             memcpy((void *)orig, addr, copy_len);
> >>> +             }
> >>
> >> I think I miss something, for DMA_FROM_DEVICE, if p doesn't exist how is
> >> it expected to work? Or do we need to warn here in this case?
> >>
> > Yes, I think we need a WARN_ON here.
>
>
> Ok.
>
>
> >
> >
> >>> +             size -= copy_len;
> >>> +             orig += copy_len;
> >>> +             iova += copy_len;
> >>> +             offset = 0;
> >>> +     }
> >>> +}
> >>> +
> >>> +int vduse_domain_bounce_map(struct vduse_iova_domain *domain,
> >>> +                     struct vm_area_struct *vma, unsigned long iova)
> >>> +{
> >>> +     unsigned long uaddr = iova + vma->vm_start;
> >>> +     unsigned long start = iova & PAGE_MASK;
> >>> +     unsigned long offset = 0;
> >>> +     bool found = false;
> >>> +     struct vduse_iova_map *map;
> >>> +     struct page *page;
> >>> +
> >>> +     mutex_lock(&domain->map_lock);
> >>> +
> >>> +     page = vduse_domain_get_bounce_page(domain, iova);
> >>> +     if (page)
> >>> +             goto unlock;
> >>> +
> >>> +     page = alloc_page(GFP_KERNEL);
> >>> +     if (!page)
> >>> +             goto unlock;
> >>> +
> >>> +     while (offset < PAGE_SIZE) {
> >>> +             unsigned int src_offset = 0, dst_offset = 0;
> >>> +             void *src, *dst;
> >>> +             size_t copy_len;
> >>> +
> >>> +             map = vduse_domain_get_iova_map(domain, start + offset);
> >>> +             if (!map) {
> >>> +                     offset += IOVA_ALLOC_SIZE;
> >>> +                     continue;
> >>> +             }
> >>> +
> >>> +             found = true;
> >>> +             offset += map->size;
> >>> +             if (map->dir == DMA_FROM_DEVICE)
> >>> +                     continue;
> >>> +
> >>> +             if (start > map->iova)
> >>> +                     src_offset = start - map->iova;
> >>> +             else
> >>> +                     dst_offset = map->iova - start;
> >>> +
> >>> +             src = (void *)(map->orig + src_offset);
> >>> +             dst = page_address(page) + dst_offset;
> >>> +             copy_len = min_t(size_t, map->size - src_offset,
> >>> +                             PAGE_SIZE - dst_offset);
> >>> +             memcpy(dst, src, copy_len);
> >>> +     }
> >>> +     if (!found) {
> >>> +             put_page(page);
> >>> +             page = NULL;
> >>> +     }
> >>> +     vduse_domain_set_bounce_page(domain, iova, page);
> >>> +unlock:
> >>> +     mutex_unlock(&domain->map_lock);
> >>> +
> >>> +     return page ? vm_insert_page(vma, uaddr, page) : -EFAULT;
> >>> +}
> >>> +
> >>> +bool vduse_domain_is_direct_map(struct vduse_iova_domain *domain,
> >>> +                             unsigned long iova)
> >>> +{
> >>> +     unsigned long index = iova >> IOVA_CHUNK_SHIFT;
> >>> +     struct vduse_iova_chunk *chunk = &domain->chunks[index];
> >>> +
> >>> +     return atomic_read(&chunk->map_type) == TYPE_DIRECT_MAP;
> >>> +}
> >>> +
> >>> +unsigned long vduse_domain_alloc_iova(struct vduse_iova_domain *domain,
> >>> +                                     size_t size, enum iova_map_type type)
> >>> +{
> >>> +     struct vduse_iova_chunk *chunk;
> >>> +     unsigned long iova = 0;
> >>> +     int align = (type == TYPE_DIRECT_MAP) ? PAGE_SIZE : IOVA_ALLOC_SIZE;
> >>> +     struct genpool_data_align data = { .align = align };
> >>> +     int i;
> >>> +
> >>> +     for (i = 0; i < domain->chunk_num; i++) {
> >>> +             chunk = &domain->chunks[i];
> >>> +             if (unlikely(atomic_read(&chunk->map_type) == TYPE_NONE))
> >>> +                     atomic_cmpxchg(&chunk->map_type, TYPE_NONE, type);
> >>> +
> >>> +             if (atomic_read(&chunk->map_type) != type)
> >>> +                     continue;
> >>> +
> >>> +             iova = gen_pool_alloc_algo(chunk->pool, size,
> >>> +                                     gen_pool_first_fit_align, &data);
> >>> +             if (iova)
> >>> +                     break;
> >>> +     }
> >>> +
> >>> +     return iova;
> >>
> >> I wonder why not just reuse the iova domain implements in
> >> driver/iommu/iova.c
> >>
> > The iova domain in driver/iommu/iova.c is only an iova allocator which
> > is implemented by the genpool memory allocator in our case. The other
> > part in our iova domain is chunk management and iova_map management.
> > We need different chunks to distinguish different dma mapping types:
> > consistent mapping or streaming mapping. We can only use
> > bouncing-mechanism in the streaming mapping case.
>
>
> To differ dma mappings, you can use two iova domains with different
> ranges. It looks simpler than the gen_pool. (AFAIK most IOMMU driver is
> using iova domain).
>

OK, I see. Will do it in v3.

Thanks,
Yongji


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Re: [RFC v2 09/13] vduse: Add support for processing vhost iotlb message
  2020-12-24  7:37         ` Yongji Xie
@ 2020-12-25  2:37           ` Yongji Xie
  2020-12-25  7:02             ` Jason Wang
  2020-12-25  6:57           ` Jason Wang
  1 sibling, 1 reply; 55+ messages in thread
From: Yongji Xie @ 2020-12-25  2:37 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, sgarzare, Parav Pandit,
	akpm, Randy Dunlap, Matthew Wilcox, viro, axboe, bcrl, corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel, linux-mm

On Thu, Dec 24, 2020 at 3:37 PM Yongji Xie <xieyongji@bytedance.com> wrote:
>
> On Thu, Dec 24, 2020 at 10:41 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> >
> > On 2020/12/23 下午8:14, Yongji Xie wrote:
> > > On Wed, Dec 23, 2020 at 5:05 PM Jason Wang <jasowang@redhat.com> wrote:
> > >>
> > >> On 2020/12/22 下午10:52, Xie Yongji wrote:
> > >>> To support vhost-vdpa bus driver, we need a way to share the
> > >>> vhost-vdpa backend process's memory with the userspace VDUSE process.
> > >>>
> > >>> This patch tries to make use of the vhost iotlb message to achieve
> > >>> that. We will get the shm file from the iotlb message and pass it
> > >>> to the userspace VDUSE process.
> > >>>
> > >>> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> > >>> ---
> > >>>    Documentation/driver-api/vduse.rst |  15 +++-
> > >>>    drivers/vdpa/vdpa_user/vduse_dev.c | 147 ++++++++++++++++++++++++++++++++++++-
> > >>>    include/uapi/linux/vduse.h         |  11 +++
> > >>>    3 files changed, 171 insertions(+), 2 deletions(-)
> > >>>
> > >>> diff --git a/Documentation/driver-api/vduse.rst b/Documentation/driver-api/vduse.rst
> > >>> index 623f7b040ccf..48e4b1ba353f 100644
> > >>> --- a/Documentation/driver-api/vduse.rst
> > >>> +++ b/Documentation/driver-api/vduse.rst
> > >>> @@ -46,13 +46,26 @@ The following types of messages are provided by the VDUSE framework now:
> > >>>
> > >>>    - VDUSE_GET_CONFIG: Read from device specific configuration space
> > >>>
> > >>> +- VDUSE_UPDATE_IOTLB: Update the memory mapping in device IOTLB
> > >>> +
> > >>> +- VDUSE_INVALIDATE_IOTLB: Invalidate the memory mapping in device IOTLB
> > >>> +
> > >>>    Please see include/linux/vdpa.h for details.
> > >>>
> > >>> -In the data path, VDUSE framework implements a MMU-based on-chip IOMMU
> > >>> +The data path of userspace vDPA device is implemented in different ways
> > >>> +depending on the vdpa bus to which it is attached.
> > >>> +
> > >>> +In virtio-vdpa case, VDUSE framework implements a MMU-based on-chip IOMMU
> > >>>    driver which supports mapping the kernel dma buffer to a userspace iova
> > >>>    region dynamically. The userspace iova region can be created by passing
> > >>>    the userspace vDPA device fd to mmap(2).
> > >>>
> > >>> +In vhost-vdpa case, the dma buffer is reside in a userspace memory region
> > >>> +which will be shared to the VDUSE userspace processs via the file
> > >>> +descriptor in VDUSE_UPDATE_IOTLB message. And the corresponding address
> > >>> +mapping (IOVA of dma buffer <-> VA of the memory region) is also included
> > >>> +in this message.
> > >>> +
> > >>>    Besides, the eventfd mechanism is used to trigger interrupt callbacks and
> > >>>    receive virtqueue kicks in userspace. The following ioctls on the userspace
> > >>>    vDPA device fd are provided to support that:
> > >>> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> > >>> index b974333ed4e9..d24aaacb6008 100644
> > >>> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > >>> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > >>> @@ -34,6 +34,7 @@
> > >>>
> > >>>    struct vduse_dev_msg {
> > >>>        struct vduse_dev_request req;
> > >>> +     struct file *iotlb_file;
> > >>>        struct vduse_dev_response resp;
> > >>>        struct list_head list;
> > >>>        wait_queue_head_t waitq;
> > >>> @@ -325,12 +326,80 @@ static int vduse_dev_set_vq_state(struct vduse_dev *dev,
> > >>>        return ret;
> > >>>    }
> > >>>
> > >>> +static int vduse_dev_update_iotlb(struct vduse_dev *dev, struct file *file,
> > >>> +                             u64 offset, u64 iova, u64 size, u8 perm)
> > >>> +{
> > >>> +     struct vduse_dev_msg *msg;
> > >>> +     int ret;
> > >>> +
> > >>> +     if (!size)
> > >>> +             return -EINVAL;
> > >>> +
> > >>> +     msg = vduse_dev_new_msg(dev, VDUSE_UPDATE_IOTLB);
> > >>> +     msg->req.size = sizeof(struct vduse_iotlb);
> > >>> +     msg->req.iotlb.offset = offset;
> > >>> +     msg->req.iotlb.iova = iova;
> > >>> +     msg->req.iotlb.size = size;
> > >>> +     msg->req.iotlb.perm = perm;
> > >>> +     msg->req.iotlb.fd = -1;
> > >>> +     msg->iotlb_file = get_file(file);
> > >>> +
> > >>> +     ret = vduse_dev_msg_sync(dev, msg);
> > >>
> > >> My feeling is that we should provide consistent API for the userspace
> > >> device to use.
> > >>
> > >> E.g we'd better carry the IOTLB message for both virtio/vhost drivers.
> > >>
> > >> It looks to me for virtio drivers we can still use UPDAT_IOTLB message
> > >> by using VDUSE file as msg->iotlb_file here.
> > >>
> > > It's OK for me. One problem is when to transfer the UPDATE_IOTLB
> > > message in virtio cases.
> >
> >
> > Instead of generating IOTLB messages for userspace.
> >
> > How about record the mappings (which is a common case for device have
> > on-chip IOMMU e.g mlx5e and vdpa simlator), then we can introduce ioctl
> > for userspace to query?
> >
>
> If so, the IOTLB UPDATE is actually triggered by ioctl, but
> IOTLB_INVALIDATE is triggered by the message. Is it a little odd? Or
> how about trigger it when userspace call mmap() on the device fd?
>

Oh sorry, looks like mmap() needs to be called in IOTLB UPDATE message
handler. Is it possible for the vdpa device to know which vdpa bus it
is attached to? So that we can generate this message during probing.
Otherwise, we don't know whether the iova domain of MMU-based IOMMU is
needed.

Thanks,
Yongji


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 09/13] vduse: Add support for processing vhost iotlb message
  2020-12-24  7:37         ` Yongji Xie
  2020-12-25  2:37           ` Yongji Xie
@ 2020-12-25  6:57           ` Jason Wang
  2020-12-25 10:31             ` Yongji Xie
  1 sibling, 1 reply; 55+ messages in thread
From: Jason Wang @ 2020-12-25  6:57 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, sgarzare, Parav Pandit,
	akpm, Randy Dunlap, Matthew Wilcox, viro, axboe, bcrl, corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel, linux-mm


On 2020/12/24 下午3:37, Yongji Xie wrote:
> On Thu, Dec 24, 2020 at 10:41 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2020/12/23 下午8:14, Yongji Xie wrote:
>>> On Wed, Dec 23, 2020 at 5:05 PM Jason Wang <jasowang@redhat.com> wrote:
>>>> On 2020/12/22 下午10:52, Xie Yongji wrote:
>>>>> To support vhost-vdpa bus driver, we need a way to share the
>>>>> vhost-vdpa backend process's memory with the userspace VDUSE process.
>>>>>
>>>>> This patch tries to make use of the vhost iotlb message to achieve
>>>>> that. We will get the shm file from the iotlb message and pass it
>>>>> to the userspace VDUSE process.
>>>>>
>>>>> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
>>>>> ---
>>>>>     Documentation/driver-api/vduse.rst |  15 +++-
>>>>>     drivers/vdpa/vdpa_user/vduse_dev.c | 147 ++++++++++++++++++++++++++++++++++++-
>>>>>     include/uapi/linux/vduse.h         |  11 +++
>>>>>     3 files changed, 171 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/Documentation/driver-api/vduse.rst b/Documentation/driver-api/vduse.rst
>>>>> index 623f7b040ccf..48e4b1ba353f 100644
>>>>> --- a/Documentation/driver-api/vduse.rst
>>>>> +++ b/Documentation/driver-api/vduse.rst
>>>>> @@ -46,13 +46,26 @@ The following types of messages are provided by the VDUSE framework now:
>>>>>
>>>>>     - VDUSE_GET_CONFIG: Read from device specific configuration space
>>>>>
>>>>> +- VDUSE_UPDATE_IOTLB: Update the memory mapping in device IOTLB
>>>>> +
>>>>> +- VDUSE_INVALIDATE_IOTLB: Invalidate the memory mapping in device IOTLB
>>>>> +
>>>>>     Please see include/linux/vdpa.h for details.
>>>>>
>>>>> -In the data path, VDUSE framework implements a MMU-based on-chip IOMMU
>>>>> +The data path of userspace vDPA device is implemented in different ways
>>>>> +depending on the vdpa bus to which it is attached.
>>>>> +
>>>>> +In virtio-vdpa case, VDUSE framework implements a MMU-based on-chip IOMMU
>>>>>     driver which supports mapping the kernel dma buffer to a userspace iova
>>>>>     region dynamically. The userspace iova region can be created by passing
>>>>>     the userspace vDPA device fd to mmap(2).
>>>>>
>>>>> +In vhost-vdpa case, the dma buffer is reside in a userspace memory region
>>>>> +which will be shared to the VDUSE userspace processs via the file
>>>>> +descriptor in VDUSE_UPDATE_IOTLB message. And the corresponding address
>>>>> +mapping (IOVA of dma buffer <-> VA of the memory region) is also included
>>>>> +in this message.
>>>>> +
>>>>>     Besides, the eventfd mechanism is used to trigger interrupt callbacks and
>>>>>     receive virtqueue kicks in userspace. The following ioctls on the userspace
>>>>>     vDPA device fd are provided to support that:
>>>>> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
>>>>> index b974333ed4e9..d24aaacb6008 100644
>>>>> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
>>>>> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
>>>>> @@ -34,6 +34,7 @@
>>>>>
>>>>>     struct vduse_dev_msg {
>>>>>         struct vduse_dev_request req;
>>>>> +     struct file *iotlb_file;
>>>>>         struct vduse_dev_response resp;
>>>>>         struct list_head list;
>>>>>         wait_queue_head_t waitq;
>>>>> @@ -325,12 +326,80 @@ static int vduse_dev_set_vq_state(struct vduse_dev *dev,
>>>>>         return ret;
>>>>>     }
>>>>>
>>>>> +static int vduse_dev_update_iotlb(struct vduse_dev *dev, struct file *file,
>>>>> +                             u64 offset, u64 iova, u64 size, u8 perm)
>>>>> +{
>>>>> +     struct vduse_dev_msg *msg;
>>>>> +     int ret;
>>>>> +
>>>>> +     if (!size)
>>>>> +             return -EINVAL;
>>>>> +
>>>>> +     msg = vduse_dev_new_msg(dev, VDUSE_UPDATE_IOTLB);
>>>>> +     msg->req.size = sizeof(struct vduse_iotlb);
>>>>> +     msg->req.iotlb.offset = offset;
>>>>> +     msg->req.iotlb.iova = iova;
>>>>> +     msg->req.iotlb.size = size;
>>>>> +     msg->req.iotlb.perm = perm;
>>>>> +     msg->req.iotlb.fd = -1;
>>>>> +     msg->iotlb_file = get_file(file);
>>>>> +
>>>>> +     ret = vduse_dev_msg_sync(dev, msg);
>>>> My feeling is that we should provide consistent API for the userspace
>>>> device to use.
>>>>
>>>> E.g we'd better carry the IOTLB message for both virtio/vhost drivers.
>>>>
>>>> It looks to me for virtio drivers we can still use UPDAT_IOTLB message
>>>> by using VDUSE file as msg->iotlb_file here.
>>>>
>>> It's OK for me. One problem is when to transfer the UPDATE_IOTLB
>>> message in virtio cases.
>>
>> Instead of generating IOTLB messages for userspace.
>>
>> How about record the mappings (which is a common case for device have
>> on-chip IOMMU e.g mlx5e and vdpa simlator), then we can introduce ioctl
>> for userspace to query?
>>
> If so, the IOTLB UPDATE is actually triggered by ioctl, but
> IOTLB_INVALIDATE is triggered by the message. Is it a little odd?


Good point.

Some questions here:

1) Userspace think the IOTLB was flushed after IOTLB_INVALIDATE syscall 
is returned. But this patch doesn't have this guarantee. Can this lead 
some issues?
2) I think after VDUSE userspace receives IOTLB_INVALIDATE, it needs to 
issue a munmap(). What if it doesn't do that?


>   Or
> how about trigger it when userspace call mmap() on the device fd?


One possible issue is that the IOTLB_UPDATE/INVALIDATE might come after 
mmap():

1) When vIOMMU is enabled
2) Guest memory topology has been changed (memory hotplug).

Thanks


>
> Thanks,
> Yongji
>



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 06/13] vduse: Introduce VDUSE - vDPA Device in Userspace
  2020-12-24  8:34         ` Yongji Xie
@ 2020-12-25  6:59           ` Jason Wang
  0 siblings, 0 replies; 55+ messages in thread
From: Jason Wang @ 2020-12-25  6:59 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, sgarzare, Parav Pandit,
	akpm, Randy Dunlap, Matthew Wilcox, viro, axboe, bcrl, corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel, linux-mm


On 2020/12/24 下午4:34, Yongji Xie wrote:
>> Yes, the disadvantage is the performance. But it should be simpler (I
>> guess) and we know it can succeed.
>>
> Yes, another advantage is that we can support the VM using anonymous memory.


Exactly.


>
>>> I think I can try this in v3. And the
>>> MMU-based IOMMU implementation can be a future optimization in the
>>> virtio-vdpa case. What's your opinion?
>> Maybe I was wrong, but I think we can try as what has been proposed here
>> first and use shadow virtqueue as backup plan if we fail.
>>
> OK, I will continue to work on this proposal.


Thanks

>



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 09/13] vduse: Add support for processing vhost iotlb message
  2020-12-25  2:37           ` Yongji Xie
@ 2020-12-25  7:02             ` Jason Wang
  2020-12-25 11:36               ` Yongji Xie
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Wang @ 2020-12-25  7:02 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, sgarzare, Parav Pandit,
	akpm, Randy Dunlap, Matthew Wilcox, viro, axboe, bcrl, corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel, linux-mm


On 2020/12/25 上午10:37, Yongji Xie wrote:
> On Thu, Dec 24, 2020 at 3:37 PM Yongji Xie <xieyongji@bytedance.com> wrote:
>> On Thu, Dec 24, 2020 at 10:41 AM Jason Wang <jasowang@redhat.com> wrote:
>>>
>>> On 2020/12/23 下午8:14, Yongji Xie wrote:
>>>> On Wed, Dec 23, 2020 at 5:05 PM Jason Wang <jasowang@redhat.com> wrote:
>>>>> On 2020/12/22 下午10:52, Xie Yongji wrote:
>>>>>> To support vhost-vdpa bus driver, we need a way to share the
>>>>>> vhost-vdpa backend process's memory with the userspace VDUSE process.
>>>>>>
>>>>>> This patch tries to make use of the vhost iotlb message to achieve
>>>>>> that. We will get the shm file from the iotlb message and pass it
>>>>>> to the userspace VDUSE process.
>>>>>>
>>>>>> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
>>>>>> ---
>>>>>>     Documentation/driver-api/vduse.rst |  15 +++-
>>>>>>     drivers/vdpa/vdpa_user/vduse_dev.c | 147 ++++++++++++++++++++++++++++++++++++-
>>>>>>     include/uapi/linux/vduse.h         |  11 +++
>>>>>>     3 files changed, 171 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/Documentation/driver-api/vduse.rst b/Documentation/driver-api/vduse.rst
>>>>>> index 623f7b040ccf..48e4b1ba353f 100644
>>>>>> --- a/Documentation/driver-api/vduse.rst
>>>>>> +++ b/Documentation/driver-api/vduse.rst
>>>>>> @@ -46,13 +46,26 @@ The following types of messages are provided by the VDUSE framework now:
>>>>>>
>>>>>>     - VDUSE_GET_CONFIG: Read from device specific configuration space
>>>>>>
>>>>>> +- VDUSE_UPDATE_IOTLB: Update the memory mapping in device IOTLB
>>>>>> +
>>>>>> +- VDUSE_INVALIDATE_IOTLB: Invalidate the memory mapping in device IOTLB
>>>>>> +
>>>>>>     Please see include/linux/vdpa.h for details.
>>>>>>
>>>>>> -In the data path, VDUSE framework implements a MMU-based on-chip IOMMU
>>>>>> +The data path of userspace vDPA device is implemented in different ways
>>>>>> +depending on the vdpa bus to which it is attached.
>>>>>> +
>>>>>> +In virtio-vdpa case, VDUSE framework implements a MMU-based on-chip IOMMU
>>>>>>     driver which supports mapping the kernel dma buffer to a userspace iova
>>>>>>     region dynamically. The userspace iova region can be created by passing
>>>>>>     the userspace vDPA device fd to mmap(2).
>>>>>>
>>>>>> +In vhost-vdpa case, the dma buffer is reside in a userspace memory region
>>>>>> +which will be shared to the VDUSE userspace processs via the file
>>>>>> +descriptor in VDUSE_UPDATE_IOTLB message. And the corresponding address
>>>>>> +mapping (IOVA of dma buffer <-> VA of the memory region) is also included
>>>>>> +in this message.
>>>>>> +
>>>>>>     Besides, the eventfd mechanism is used to trigger interrupt callbacks and
>>>>>>     receive virtqueue kicks in userspace. The following ioctls on the userspace
>>>>>>     vDPA device fd are provided to support that:
>>>>>> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
>>>>>> index b974333ed4e9..d24aaacb6008 100644
>>>>>> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
>>>>>> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
>>>>>> @@ -34,6 +34,7 @@
>>>>>>
>>>>>>     struct vduse_dev_msg {
>>>>>>         struct vduse_dev_request req;
>>>>>> +     struct file *iotlb_file;
>>>>>>         struct vduse_dev_response resp;
>>>>>>         struct list_head list;
>>>>>>         wait_queue_head_t waitq;
>>>>>> @@ -325,12 +326,80 @@ static int vduse_dev_set_vq_state(struct vduse_dev *dev,
>>>>>>         return ret;
>>>>>>     }
>>>>>>
>>>>>> +static int vduse_dev_update_iotlb(struct vduse_dev *dev, struct file *file,
>>>>>> +                             u64 offset, u64 iova, u64 size, u8 perm)
>>>>>> +{
>>>>>> +     struct vduse_dev_msg *msg;
>>>>>> +     int ret;
>>>>>> +
>>>>>> +     if (!size)
>>>>>> +             return -EINVAL;
>>>>>> +
>>>>>> +     msg = vduse_dev_new_msg(dev, VDUSE_UPDATE_IOTLB);
>>>>>> +     msg->req.size = sizeof(struct vduse_iotlb);
>>>>>> +     msg->req.iotlb.offset = offset;
>>>>>> +     msg->req.iotlb.iova = iova;
>>>>>> +     msg->req.iotlb.size = size;
>>>>>> +     msg->req.iotlb.perm = perm;
>>>>>> +     msg->req.iotlb.fd = -1;
>>>>>> +     msg->iotlb_file = get_file(file);
>>>>>> +
>>>>>> +     ret = vduse_dev_msg_sync(dev, msg);
>>>>> My feeling is that we should provide consistent API for the userspace
>>>>> device to use.
>>>>>
>>>>> E.g we'd better carry the IOTLB message for both virtio/vhost drivers.
>>>>>
>>>>> It looks to me for virtio drivers we can still use UPDAT_IOTLB message
>>>>> by using VDUSE file as msg->iotlb_file here.
>>>>>
>>>> It's OK for me. One problem is when to transfer the UPDATE_IOTLB
>>>> message in virtio cases.
>>>
>>> Instead of generating IOTLB messages for userspace.
>>>
>>> How about record the mappings (which is a common case for device have
>>> on-chip IOMMU e.g mlx5e and vdpa simlator), then we can introduce ioctl
>>> for userspace to query?
>>>
>> If so, the IOTLB UPDATE is actually triggered by ioctl, but
>> IOTLB_INVALIDATE is triggered by the message. Is it a little odd? Or
>> how about trigger it when userspace call mmap() on the device fd?
>>
> Oh sorry, looks like mmap() needs to be called in IOTLB UPDATE message
> handler. Is it possible for the vdpa device to know which vdpa bus it
> is attached to?


We'd better not. It's kind of layer violation.

Thanks


>   So that we can generate this message during probing.
> Otherwise, we don't know whether the iova domain of MMU-based IOMMU is
> needed.
>
> Thanks,
> Yongji
>



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Re: [RFC v2 09/13] vduse: Add support for processing vhost iotlb message
  2020-12-25  6:57           ` Jason Wang
@ 2020-12-25 10:31             ` Yongji Xie
  2020-12-28  7:43               ` Jason Wang
  0 siblings, 1 reply; 55+ messages in thread
From: Yongji Xie @ 2020-12-25 10:31 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, sgarzare, Parav Pandit,
	akpm, Randy Dunlap, Matthew Wilcox, viro, axboe, bcrl, corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel, linux-mm

On Fri, Dec 25, 2020 at 2:58 PM Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2020/12/24 下午3:37, Yongji Xie wrote:
> > On Thu, Dec 24, 2020 at 10:41 AM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> On 2020/12/23 下午8:14, Yongji Xie wrote:
> >>> On Wed, Dec 23, 2020 at 5:05 PM Jason Wang <jasowang@redhat.com> wrote:
> >>>> On 2020/12/22 下午10:52, Xie Yongji wrote:
> >>>>> To support vhost-vdpa bus driver, we need a way to share the
> >>>>> vhost-vdpa backend process's memory with the userspace VDUSE process.
> >>>>>
> >>>>> This patch tries to make use of the vhost iotlb message to achieve
> >>>>> that. We will get the shm file from the iotlb message and pass it
> >>>>> to the userspace VDUSE process.
> >>>>>
> >>>>> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> >>>>> ---
> >>>>>     Documentation/driver-api/vduse.rst |  15 +++-
> >>>>>     drivers/vdpa/vdpa_user/vduse_dev.c | 147 ++++++++++++++++++++++++++++++++++++-
> >>>>>     include/uapi/linux/vduse.h         |  11 +++
> >>>>>     3 files changed, 171 insertions(+), 2 deletions(-)
> >>>>>
> >>>>> diff --git a/Documentation/driver-api/vduse.rst b/Documentation/driver-api/vduse.rst
> >>>>> index 623f7b040ccf..48e4b1ba353f 100644
> >>>>> --- a/Documentation/driver-api/vduse.rst
> >>>>> +++ b/Documentation/driver-api/vduse.rst
> >>>>> @@ -46,13 +46,26 @@ The following types of messages are provided by the VDUSE framework now:
> >>>>>
> >>>>>     - VDUSE_GET_CONFIG: Read from device specific configuration space
> >>>>>
> >>>>> +- VDUSE_UPDATE_IOTLB: Update the memory mapping in device IOTLB
> >>>>> +
> >>>>> +- VDUSE_INVALIDATE_IOTLB: Invalidate the memory mapping in device IOTLB
> >>>>> +
> >>>>>     Please see include/linux/vdpa.h for details.
> >>>>>
> >>>>> -In the data path, VDUSE framework implements a MMU-based on-chip IOMMU
> >>>>> +The data path of userspace vDPA device is implemented in different ways
> >>>>> +depending on the vdpa bus to which it is attached.
> >>>>> +
> >>>>> +In virtio-vdpa case, VDUSE framework implements a MMU-based on-chip IOMMU
> >>>>>     driver which supports mapping the kernel dma buffer to a userspace iova
> >>>>>     region dynamically. The userspace iova region can be created by passing
> >>>>>     the userspace vDPA device fd to mmap(2).
> >>>>>
> >>>>> +In vhost-vdpa case, the dma buffer is reside in a userspace memory region
> >>>>> +which will be shared to the VDUSE userspace processs via the file
> >>>>> +descriptor in VDUSE_UPDATE_IOTLB message. And the corresponding address
> >>>>> +mapping (IOVA of dma buffer <-> VA of the memory region) is also included
> >>>>> +in this message.
> >>>>> +
> >>>>>     Besides, the eventfd mechanism is used to trigger interrupt callbacks and
> >>>>>     receive virtqueue kicks in userspace. The following ioctls on the userspace
> >>>>>     vDPA device fd are provided to support that:
> >>>>> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> >>>>> index b974333ed4e9..d24aaacb6008 100644
> >>>>> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> >>>>> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> >>>>> @@ -34,6 +34,7 @@
> >>>>>
> >>>>>     struct vduse_dev_msg {
> >>>>>         struct vduse_dev_request req;
> >>>>> +     struct file *iotlb_file;
> >>>>>         struct vduse_dev_response resp;
> >>>>>         struct list_head list;
> >>>>>         wait_queue_head_t waitq;
> >>>>> @@ -325,12 +326,80 @@ static int vduse_dev_set_vq_state(struct vduse_dev *dev,
> >>>>>         return ret;
> >>>>>     }
> >>>>>
> >>>>> +static int vduse_dev_update_iotlb(struct vduse_dev *dev, struct file *file,
> >>>>> +                             u64 offset, u64 iova, u64 size, u8 perm)
> >>>>> +{
> >>>>> +     struct vduse_dev_msg *msg;
> >>>>> +     int ret;
> >>>>> +
> >>>>> +     if (!size)
> >>>>> +             return -EINVAL;
> >>>>> +
> >>>>> +     msg = vduse_dev_new_msg(dev, VDUSE_UPDATE_IOTLB);
> >>>>> +     msg->req.size = sizeof(struct vduse_iotlb);
> >>>>> +     msg->req.iotlb.offset = offset;
> >>>>> +     msg->req.iotlb.iova = iova;
> >>>>> +     msg->req.iotlb.size = size;
> >>>>> +     msg->req.iotlb.perm = perm;
> >>>>> +     msg->req.iotlb.fd = -1;
> >>>>> +     msg->iotlb_file = get_file(file);
> >>>>> +
> >>>>> +     ret = vduse_dev_msg_sync(dev, msg);
> >>>> My feeling is that we should provide consistent API for the userspace
> >>>> device to use.
> >>>>
> >>>> E.g we'd better carry the IOTLB message for both virtio/vhost drivers.
> >>>>
> >>>> It looks to me for virtio drivers we can still use UPDAT_IOTLB message
> >>>> by using VDUSE file as msg->iotlb_file here.
> >>>>
> >>> It's OK for me. One problem is when to transfer the UPDATE_IOTLB
> >>> message in virtio cases.
> >>
> >> Instead of generating IOTLB messages for userspace.
> >>
> >> How about record the mappings (which is a common case for device have
> >> on-chip IOMMU e.g mlx5e and vdpa simlator), then we can introduce ioctl
> >> for userspace to query?
> >>
> > If so, the IOTLB UPDATE is actually triggered by ioctl, but
> > IOTLB_INVALIDATE is triggered by the message. Is it a little odd?
>
>
> Good point.
>
> Some questions here:
>
> 1) Userspace think the IOTLB was flushed after IOTLB_INVALIDATE syscall
> is returned. But this patch doesn't have this guarantee. Can this lead
> some issues?

I'm not sure. But should it be guaranteed in VDUSE userspace? Just
like what vhost-user backend process does.

> 2) I think after VDUSE userspace receives IOTLB_INVALIDATE, it needs to
> issue a munmap(). What if it doesn't do that?
>

Yes, the munmap() is needed. If it doesn't do that, VDUSE userspace
could still access the corresponding guest memory region.

Thanks,
Yongji


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Re: [RFC v2 09/13] vduse: Add support for processing vhost iotlb message
  2020-12-25  7:02             ` Jason Wang
@ 2020-12-25 11:36               ` Yongji Xie
  0 siblings, 0 replies; 55+ messages in thread
From: Yongji Xie @ 2020-12-25 11:36 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, sgarzare, Parav Pandit,
	akpm, Randy Dunlap, Matthew Wilcox, viro, axboe, bcrl, corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel, linux-mm

On Fri, Dec 25, 2020 at 3:02 PM Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2020/12/25 上午10:37, Yongji Xie wrote:
> > On Thu, Dec 24, 2020 at 3:37 PM Yongji Xie <xieyongji@bytedance.com> wrote:
> >> On Thu, Dec 24, 2020 at 10:41 AM Jason Wang <jasowang@redhat.com> wrote:
> >>>
> >>> On 2020/12/23 下午8:14, Yongji Xie wrote:
> >>>> On Wed, Dec 23, 2020 at 5:05 PM Jason Wang <jasowang@redhat.com> wrote:
> >>>>> On 2020/12/22 下午10:52, Xie Yongji wrote:
> >>>>>> To support vhost-vdpa bus driver, we need a way to share the
> >>>>>> vhost-vdpa backend process's memory with the userspace VDUSE process.
> >>>>>>
> >>>>>> This patch tries to make use of the vhost iotlb message to achieve
> >>>>>> that. We will get the shm file from the iotlb message and pass it
> >>>>>> to the userspace VDUSE process.
> >>>>>>
> >>>>>> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> >>>>>> ---
> >>>>>>     Documentation/driver-api/vduse.rst |  15 +++-
> >>>>>>     drivers/vdpa/vdpa_user/vduse_dev.c | 147 ++++++++++++++++++++++++++++++++++++-
> >>>>>>     include/uapi/linux/vduse.h         |  11 +++
> >>>>>>     3 files changed, 171 insertions(+), 2 deletions(-)
> >>>>>>
> >>>>>> diff --git a/Documentation/driver-api/vduse.rst b/Documentation/driver-api/vduse.rst
> >>>>>> index 623f7b040ccf..48e4b1ba353f 100644
> >>>>>> --- a/Documentation/driver-api/vduse.rst
> >>>>>> +++ b/Documentation/driver-api/vduse.rst
> >>>>>> @@ -46,13 +46,26 @@ The following types of messages are provided by the VDUSE framework now:
> >>>>>>
> >>>>>>     - VDUSE_GET_CONFIG: Read from device specific configuration space
> >>>>>>
> >>>>>> +- VDUSE_UPDATE_IOTLB: Update the memory mapping in device IOTLB
> >>>>>> +
> >>>>>> +- VDUSE_INVALIDATE_IOTLB: Invalidate the memory mapping in device IOTLB
> >>>>>> +
> >>>>>>     Please see include/linux/vdpa.h for details.
> >>>>>>
> >>>>>> -In the data path, VDUSE framework implements a MMU-based on-chip IOMMU
> >>>>>> +The data path of userspace vDPA device is implemented in different ways
> >>>>>> +depending on the vdpa bus to which it is attached.
> >>>>>> +
> >>>>>> +In virtio-vdpa case, VDUSE framework implements a MMU-based on-chip IOMMU
> >>>>>>     driver which supports mapping the kernel dma buffer to a userspace iova
> >>>>>>     region dynamically. The userspace iova region can be created by passing
> >>>>>>     the userspace vDPA device fd to mmap(2).
> >>>>>>
> >>>>>> +In vhost-vdpa case, the dma buffer is reside in a userspace memory region
> >>>>>> +which will be shared to the VDUSE userspace processs via the file
> >>>>>> +descriptor in VDUSE_UPDATE_IOTLB message. And the corresponding address
> >>>>>> +mapping (IOVA of dma buffer <-> VA of the memory region) is also included
> >>>>>> +in this message.
> >>>>>> +
> >>>>>>     Besides, the eventfd mechanism is used to trigger interrupt callbacks and
> >>>>>>     receive virtqueue kicks in userspace. The following ioctls on the userspace
> >>>>>>     vDPA device fd are provided to support that:
> >>>>>> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> >>>>>> index b974333ed4e9..d24aaacb6008 100644
> >>>>>> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> >>>>>> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> >>>>>> @@ -34,6 +34,7 @@
> >>>>>>
> >>>>>>     struct vduse_dev_msg {
> >>>>>>         struct vduse_dev_request req;
> >>>>>> +     struct file *iotlb_file;
> >>>>>>         struct vduse_dev_response resp;
> >>>>>>         struct list_head list;
> >>>>>>         wait_queue_head_t waitq;
> >>>>>> @@ -325,12 +326,80 @@ static int vduse_dev_set_vq_state(struct vduse_dev *dev,
> >>>>>>         return ret;
> >>>>>>     }
> >>>>>>
> >>>>>> +static int vduse_dev_update_iotlb(struct vduse_dev *dev, struct file *file,
> >>>>>> +                             u64 offset, u64 iova, u64 size, u8 perm)
> >>>>>> +{
> >>>>>> +     struct vduse_dev_msg *msg;
> >>>>>> +     int ret;
> >>>>>> +
> >>>>>> +     if (!size)
> >>>>>> +             return -EINVAL;
> >>>>>> +
> >>>>>> +     msg = vduse_dev_new_msg(dev, VDUSE_UPDATE_IOTLB);
> >>>>>> +     msg->req.size = sizeof(struct vduse_iotlb);
> >>>>>> +     msg->req.iotlb.offset = offset;
> >>>>>> +     msg->req.iotlb.iova = iova;
> >>>>>> +     msg->req.iotlb.size = size;
> >>>>>> +     msg->req.iotlb.perm = perm;
> >>>>>> +     msg->req.iotlb.fd = -1;
> >>>>>> +     msg->iotlb_file = get_file(file);
> >>>>>> +
> >>>>>> +     ret = vduse_dev_msg_sync(dev, msg);
> >>>>> My feeling is that we should provide consistent API for the userspace
> >>>>> device to use.
> >>>>>
> >>>>> E.g we'd better carry the IOTLB message for both virtio/vhost drivers.
> >>>>>
> >>>>> It looks to me for virtio drivers we can still use UPDAT_IOTLB message
> >>>>> by using VDUSE file as msg->iotlb_file here.
> >>>>>
> >>>> It's OK for me. One problem is when to transfer the UPDATE_IOTLB
> >>>> message in virtio cases.
> >>>
> >>> Instead of generating IOTLB messages for userspace.
> >>>
> >>> How about record the mappings (which is a common case for device have
> >>> on-chip IOMMU e.g mlx5e and vdpa simlator), then we can introduce ioctl
> >>> for userspace to query?
> >>>
> >> If so, the IOTLB UPDATE is actually triggered by ioctl, but
> >> IOTLB_INVALIDATE is triggered by the message. Is it a little odd? Or
> >> how about trigger it when userspace call mmap() on the device fd?
> >>
> > Oh sorry, looks like mmap() needs to be called in IOTLB UPDATE message
> > handler. Is it possible for the vdpa device to know which vdpa bus it
> > is attached to?
>
>
> We'd better not. It's kind of layer violation.
>

OK. Now I think both ioctl and message are needed. The ioctl is useful
when VDUSE userspace daemon reboot. And the IOTLB_UPDATE message could
be generated during the first DMA mapping in the virtio-vdpa case.

Thanks,
Yongji


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 09/13] vduse: Add support for processing vhost iotlb message
  2020-12-25 10:31             ` Yongji Xie
@ 2020-12-28  7:43               ` Jason Wang
  2020-12-28  8:14                 ` Yongji Xie
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Wang @ 2020-12-28  7:43 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, sgarzare, Parav Pandit,
	akpm, Randy Dunlap, Matthew Wilcox, viro, axboe, bcrl, corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel, linux-mm


On 2020/12/25 下午6:31, Yongji Xie wrote:
> On Fri, Dec 25, 2020 at 2:58 PM Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2020/12/24 下午3:37, Yongji Xie wrote:
>>> On Thu, Dec 24, 2020 at 10:41 AM Jason Wang <jasowang@redhat.com> wrote:
>>>> On 2020/12/23 下午8:14, Yongji Xie wrote:
>>>>> On Wed, Dec 23, 2020 at 5:05 PM Jason Wang <jasowang@redhat.com> wrote:
>>>>>> On 2020/12/22 下午10:52, Xie Yongji wrote:
>>>>>>> To support vhost-vdpa bus driver, we need a way to share the
>>>>>>> vhost-vdpa backend process's memory with the userspace VDUSE process.
>>>>>>>
>>>>>>> This patch tries to make use of the vhost iotlb message to achieve
>>>>>>> that. We will get the shm file from the iotlb message and pass it
>>>>>>> to the userspace VDUSE process.
>>>>>>>
>>>>>>> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
>>>>>>> ---
>>>>>>>      Documentation/driver-api/vduse.rst |  15 +++-
>>>>>>>      drivers/vdpa/vdpa_user/vduse_dev.c | 147 ++++++++++++++++++++++++++++++++++++-
>>>>>>>      include/uapi/linux/vduse.h         |  11 +++
>>>>>>>      3 files changed, 171 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/Documentation/driver-api/vduse.rst b/Documentation/driver-api/vduse.rst
>>>>>>> index 623f7b040ccf..48e4b1ba353f 100644
>>>>>>> --- a/Documentation/driver-api/vduse.rst
>>>>>>> +++ b/Documentation/driver-api/vduse.rst
>>>>>>> @@ -46,13 +46,26 @@ The following types of messages are provided by the VDUSE framework now:
>>>>>>>
>>>>>>>      - VDUSE_GET_CONFIG: Read from device specific configuration space
>>>>>>>
>>>>>>> +- VDUSE_UPDATE_IOTLB: Update the memory mapping in device IOTLB
>>>>>>> +
>>>>>>> +- VDUSE_INVALIDATE_IOTLB: Invalidate the memory mapping in device IOTLB
>>>>>>> +
>>>>>>>      Please see include/linux/vdpa.h for details.
>>>>>>>
>>>>>>> -In the data path, VDUSE framework implements a MMU-based on-chip IOMMU
>>>>>>> +The data path of userspace vDPA device is implemented in different ways
>>>>>>> +depending on the vdpa bus to which it is attached.
>>>>>>> +
>>>>>>> +In virtio-vdpa case, VDUSE framework implements a MMU-based on-chip IOMMU
>>>>>>>      driver which supports mapping the kernel dma buffer to a userspace iova
>>>>>>>      region dynamically. The userspace iova region can be created by passing
>>>>>>>      the userspace vDPA device fd to mmap(2).
>>>>>>>
>>>>>>> +In vhost-vdpa case, the dma buffer is reside in a userspace memory region
>>>>>>> +which will be shared to the VDUSE userspace processs via the file
>>>>>>> +descriptor in VDUSE_UPDATE_IOTLB message. And the corresponding address
>>>>>>> +mapping (IOVA of dma buffer <-> VA of the memory region) is also included
>>>>>>> +in this message.
>>>>>>> +
>>>>>>>      Besides, the eventfd mechanism is used to trigger interrupt callbacks and
>>>>>>>      receive virtqueue kicks in userspace. The following ioctls on the userspace
>>>>>>>      vDPA device fd are provided to support that:
>>>>>>> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
>>>>>>> index b974333ed4e9..d24aaacb6008 100644
>>>>>>> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
>>>>>>> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
>>>>>>> @@ -34,6 +34,7 @@
>>>>>>>
>>>>>>>      struct vduse_dev_msg {
>>>>>>>          struct vduse_dev_request req;
>>>>>>> +     struct file *iotlb_file;
>>>>>>>          struct vduse_dev_response resp;
>>>>>>>          struct list_head list;
>>>>>>>          wait_queue_head_t waitq;
>>>>>>> @@ -325,12 +326,80 @@ static int vduse_dev_set_vq_state(struct vduse_dev *dev,
>>>>>>>          return ret;
>>>>>>>      }
>>>>>>>
>>>>>>> +static int vduse_dev_update_iotlb(struct vduse_dev *dev, struct file *file,
>>>>>>> +                             u64 offset, u64 iova, u64 size, u8 perm)
>>>>>>> +{
>>>>>>> +     struct vduse_dev_msg *msg;
>>>>>>> +     int ret;
>>>>>>> +
>>>>>>> +     if (!size)
>>>>>>> +             return -EINVAL;
>>>>>>> +
>>>>>>> +     msg = vduse_dev_new_msg(dev, VDUSE_UPDATE_IOTLB);
>>>>>>> +     msg->req.size = sizeof(struct vduse_iotlb);
>>>>>>> +     msg->req.iotlb.offset = offset;
>>>>>>> +     msg->req.iotlb.iova = iova;
>>>>>>> +     msg->req.iotlb.size = size;
>>>>>>> +     msg->req.iotlb.perm = perm;
>>>>>>> +     msg->req.iotlb.fd = -1;
>>>>>>> +     msg->iotlb_file = get_file(file);
>>>>>>> +
>>>>>>> +     ret = vduse_dev_msg_sync(dev, msg);
>>>>>> My feeling is that we should provide consistent API for the userspace
>>>>>> device to use.
>>>>>>
>>>>>> E.g we'd better carry the IOTLB message for both virtio/vhost drivers.
>>>>>>
>>>>>> It looks to me for virtio drivers we can still use UPDAT_IOTLB message
>>>>>> by using VDUSE file as msg->iotlb_file here.
>>>>>>
>>>>> It's OK for me. One problem is when to transfer the UPDATE_IOTLB
>>>>> message in virtio cases.
>>>> Instead of generating IOTLB messages for userspace.
>>>>
>>>> How about record the mappings (which is a common case for device have
>>>> on-chip IOMMU e.g mlx5e and vdpa simlator), then we can introduce ioctl
>>>> for userspace to query?
>>>>
>>> If so, the IOTLB UPDATE is actually triggered by ioctl, but
>>> IOTLB_INVALIDATE is triggered by the message. Is it a little odd?
>>
>> Good point.
>>
>> Some questions here:
>>
>> 1) Userspace think the IOTLB was flushed after IOTLB_INVALIDATE syscall
>> is returned. But this patch doesn't have this guarantee. Can this lead
>> some issues?
> I'm not sure. But should it be guaranteed in VDUSE userspace? Just
> like what vhost-user backend process does.


I think so. This is because the userspace device needs a way to 
synchronize invalidation with its datapath. Otherwise, guest may thing 
the buffer is freed to be used by other parts but the it actually can be 
used by the VDUSE device. The may cause security issues.


>
>> 2) I think after VDUSE userspace receives IOTLB_INVALIDATE, it needs to
>> issue a munmap(). What if it doesn't do that?
>>
> Yes, the munmap() is needed. If it doesn't do that, VDUSE userspace
> could still access the corresponding guest memory region.


I see. So all the above two questions are because VHOST_IOTLB_INVALIDATE 
is expected to be synchronous. This need to be solved by tweaking the 
current VDUSE API or we can re-visit to go with descriptors relaying first.

Thanks


>
> Thanks,
> Yongji
>



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Re: [RFC v2 09/13] vduse: Add support for processing vhost iotlb message
  2020-12-28  7:43               ` Jason Wang
@ 2020-12-28  8:14                 ` Yongji Xie
  2020-12-28  8:43                   ` Jason Wang
  0 siblings, 1 reply; 55+ messages in thread
From: Yongji Xie @ 2020-12-28  8:14 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, sgarzare, Parav Pandit,
	akpm, Randy Dunlap, Matthew Wilcox, viro, axboe, bcrl, corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel, linux-mm

On Mon, Dec 28, 2020 at 3:43 PM Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2020/12/25 下午6:31, Yongji Xie wrote:
> > On Fri, Dec 25, 2020 at 2:58 PM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> On 2020/12/24 下午3:37, Yongji Xie wrote:
> >>> On Thu, Dec 24, 2020 at 10:41 AM Jason Wang <jasowang@redhat.com> wrote:
> >>>> On 2020/12/23 下午8:14, Yongji Xie wrote:
> >>>>> On Wed, Dec 23, 2020 at 5:05 PM Jason Wang <jasowang@redhat.com> wrote:
> >>>>>> On 2020/12/22 下午10:52, Xie Yongji wrote:
> >>>>>>> To support vhost-vdpa bus driver, we need a way to share the
> >>>>>>> vhost-vdpa backend process's memory with the userspace VDUSE process.
> >>>>>>>
> >>>>>>> This patch tries to make use of the vhost iotlb message to achieve
> >>>>>>> that. We will get the shm file from the iotlb message and pass it
> >>>>>>> to the userspace VDUSE process.
> >>>>>>>
> >>>>>>> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> >>>>>>> ---
> >>>>>>>      Documentation/driver-api/vduse.rst |  15 +++-
> >>>>>>>      drivers/vdpa/vdpa_user/vduse_dev.c | 147 ++++++++++++++++++++++++++++++++++++-
> >>>>>>>      include/uapi/linux/vduse.h         |  11 +++
> >>>>>>>      3 files changed, 171 insertions(+), 2 deletions(-)
> >>>>>>>
> >>>>>>> diff --git a/Documentation/driver-api/vduse.rst b/Documentation/driver-api/vduse.rst
> >>>>>>> index 623f7b040ccf..48e4b1ba353f 100644
> >>>>>>> --- a/Documentation/driver-api/vduse.rst
> >>>>>>> +++ b/Documentation/driver-api/vduse.rst
> >>>>>>> @@ -46,13 +46,26 @@ The following types of messages are provided by the VDUSE framework now:
> >>>>>>>
> >>>>>>>      - VDUSE_GET_CONFIG: Read from device specific configuration space
> >>>>>>>
> >>>>>>> +- VDUSE_UPDATE_IOTLB: Update the memory mapping in device IOTLB
> >>>>>>> +
> >>>>>>> +- VDUSE_INVALIDATE_IOTLB: Invalidate the memory mapping in device IOTLB
> >>>>>>> +
> >>>>>>>      Please see include/linux/vdpa.h for details.
> >>>>>>>
> >>>>>>> -In the data path, VDUSE framework implements a MMU-based on-chip IOMMU
> >>>>>>> +The data path of userspace vDPA device is implemented in different ways
> >>>>>>> +depending on the vdpa bus to which it is attached.
> >>>>>>> +
> >>>>>>> +In virtio-vdpa case, VDUSE framework implements a MMU-based on-chip IOMMU
> >>>>>>>      driver which supports mapping the kernel dma buffer to a userspace iova
> >>>>>>>      region dynamically. The userspace iova region can be created by passing
> >>>>>>>      the userspace vDPA device fd to mmap(2).
> >>>>>>>
> >>>>>>> +In vhost-vdpa case, the dma buffer is reside in a userspace memory region
> >>>>>>> +which will be shared to the VDUSE userspace processs via the file
> >>>>>>> +descriptor in VDUSE_UPDATE_IOTLB message. And the corresponding address
> >>>>>>> +mapping (IOVA of dma buffer <-> VA of the memory region) is also included
> >>>>>>> +in this message.
> >>>>>>> +
> >>>>>>>      Besides, the eventfd mechanism is used to trigger interrupt callbacks and
> >>>>>>>      receive virtqueue kicks in userspace. The following ioctls on the userspace
> >>>>>>>      vDPA device fd are provided to support that:
> >>>>>>> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> >>>>>>> index b974333ed4e9..d24aaacb6008 100644
> >>>>>>> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> >>>>>>> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> >>>>>>> @@ -34,6 +34,7 @@
> >>>>>>>
> >>>>>>>      struct vduse_dev_msg {
> >>>>>>>          struct vduse_dev_request req;
> >>>>>>> +     struct file *iotlb_file;
> >>>>>>>          struct vduse_dev_response resp;
> >>>>>>>          struct list_head list;
> >>>>>>>          wait_queue_head_t waitq;
> >>>>>>> @@ -325,12 +326,80 @@ static int vduse_dev_set_vq_state(struct vduse_dev *dev,
> >>>>>>>          return ret;
> >>>>>>>      }
> >>>>>>>
> >>>>>>> +static int vduse_dev_update_iotlb(struct vduse_dev *dev, struct file *file,
> >>>>>>> +                             u64 offset, u64 iova, u64 size, u8 perm)
> >>>>>>> +{
> >>>>>>> +     struct vduse_dev_msg *msg;
> >>>>>>> +     int ret;
> >>>>>>> +
> >>>>>>> +     if (!size)
> >>>>>>> +             return -EINVAL;
> >>>>>>> +
> >>>>>>> +     msg = vduse_dev_new_msg(dev, VDUSE_UPDATE_IOTLB);
> >>>>>>> +     msg->req.size = sizeof(struct vduse_iotlb);
> >>>>>>> +     msg->req.iotlb.offset = offset;
> >>>>>>> +     msg->req.iotlb.iova = iova;
> >>>>>>> +     msg->req.iotlb.size = size;
> >>>>>>> +     msg->req.iotlb.perm = perm;
> >>>>>>> +     msg->req.iotlb.fd = -1;
> >>>>>>> +     msg->iotlb_file = get_file(file);
> >>>>>>> +
> >>>>>>> +     ret = vduse_dev_msg_sync(dev, msg);
> >>>>>> My feeling is that we should provide consistent API for the userspace
> >>>>>> device to use.
> >>>>>>
> >>>>>> E.g we'd better carry the IOTLB message for both virtio/vhost drivers.
> >>>>>>
> >>>>>> It looks to me for virtio drivers we can still use UPDAT_IOTLB message
> >>>>>> by using VDUSE file as msg->iotlb_file here.
> >>>>>>
> >>>>> It's OK for me. One problem is when to transfer the UPDATE_IOTLB
> >>>>> message in virtio cases.
> >>>> Instead of generating IOTLB messages for userspace.
> >>>>
> >>>> How about record the mappings (which is a common case for device have
> >>>> on-chip IOMMU e.g mlx5e and vdpa simlator), then we can introduce ioctl
> >>>> for userspace to query?
> >>>>
> >>> If so, the IOTLB UPDATE is actually triggered by ioctl, but
> >>> IOTLB_INVALIDATE is triggered by the message. Is it a little odd?
> >>
> >> Good point.
> >>
> >> Some questions here:
> >>
> >> 1) Userspace think the IOTLB was flushed after IOTLB_INVALIDATE syscall
> >> is returned. But this patch doesn't have this guarantee. Can this lead
> >> some issues?
> > I'm not sure. But should it be guaranteed in VDUSE userspace? Just
> > like what vhost-user backend process does.
>
>
> I think so. This is because the userspace device needs a way to
> synchronize invalidation with its datapath. Otherwise, guest may thing
> the buffer is freed to be used by other parts but the it actually can be
> used by the VDUSE device. The may cause security issues.
>
>
> >
> >> 2) I think after VDUSE userspace receives IOTLB_INVALIDATE, it needs to
> >> issue a munmap(). What if it doesn't do that?
> >>
> > Yes, the munmap() is needed. If it doesn't do that, VDUSE userspace
> > could still access the corresponding guest memory region.
>
>
> I see. So all the above two questions are because VHOST_IOTLB_INVALIDATE
> is expected to be synchronous. This need to be solved by tweaking the
> current VDUSE API or we can re-visit to go with descriptors relaying first.
>

Actually all vdpa related operations are synchronous in current
implementation. The ops.set_map/dma_map/dma_unmap should not return
until the VDUSE_UPDATE_IOTLB/VDUSE_INVALIDATE_IOTLB message is replied
by userspace. Could it solve this problem?

Thanks,
Yongji


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 09/13] vduse: Add support for processing vhost iotlb message
  2020-12-28  8:14                 ` Yongji Xie
@ 2020-12-28  8:43                   ` Jason Wang
  2020-12-28  9:12                     ` Yongji Xie
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Wang @ 2020-12-28  8:43 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, sgarzare, Parav Pandit,
	akpm, Randy Dunlap, Matthew Wilcox, viro, axboe, bcrl, corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel, linux-mm


On 2020/12/28 下午4:14, Yongji Xie wrote:
>> I see. So all the above two questions are because VHOST_IOTLB_INVALIDATE
>> is expected to be synchronous. This need to be solved by tweaking the
>> current VDUSE API or we can re-visit to go with descriptors relaying first.
>>
> Actually all vdpa related operations are synchronous in current
> implementation. The ops.set_map/dma_map/dma_unmap should not return
> until the VDUSE_UPDATE_IOTLB/VDUSE_INVALIDATE_IOTLB message is replied
> by userspace. Could it solve this problem?


  I was thinking whether or not we need to generate IOTLB_INVALIDATE 
message to VDUSE during dma_unmap (vduse_dev_unmap_page).

If we don't, we're probably fine.

Thanks


>
> Thanks,
> Yongji
>



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Re: [RFC v2 09/13] vduse: Add support for processing vhost iotlb message
  2020-12-28  8:43                   ` Jason Wang
@ 2020-12-28  9:12                     ` Yongji Xie
  2020-12-29  9:11                       ` Jason Wang
  0 siblings, 1 reply; 55+ messages in thread
From: Yongji Xie @ 2020-12-28  9:12 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, sgarzare, Parav Pandit,
	akpm, Randy Dunlap, Matthew Wilcox, viro, axboe, bcrl, corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel, linux-mm

On Mon, Dec 28, 2020 at 4:43 PM Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2020/12/28 下午4:14, Yongji Xie wrote:
> >> I see. So all the above two questions are because VHOST_IOTLB_INVALIDATE
> >> is expected to be synchronous. This need to be solved by tweaking the
> >> current VDUSE API or we can re-visit to go with descriptors relaying first.
> >>
> > Actually all vdpa related operations are synchronous in current
> > implementation. The ops.set_map/dma_map/dma_unmap should not return
> > until the VDUSE_UPDATE_IOTLB/VDUSE_INVALIDATE_IOTLB message is replied
> > by userspace. Could it solve this problem?
>
>
>   I was thinking whether or not we need to generate IOTLB_INVALIDATE
> message to VDUSE during dma_unmap (vduse_dev_unmap_page).
>
> If we don't, we're probably fine.
>

It seems not feasible. This message will be also used in the
virtio-vdpa case to notify userspace to unmap some pages during
consistent dma unmapping. Maybe we can document it to make sure the
users can handle the message correctly.

Thanks,
Yongji


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 09/13] vduse: Add support for processing vhost iotlb message
  2020-12-28  9:12                     ` Yongji Xie
@ 2020-12-29  9:11                       ` Jason Wang
  2020-12-29 10:26                         ` Yongji Xie
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Wang @ 2020-12-29  9:11 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, sgarzare, Parav Pandit,
	akpm, Randy Dunlap, Matthew Wilcox, viro, axboe, bcrl, corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel, linux-mm



----- Original Message -----
> On Mon, Dec 28, 2020 at 4:43 PM Jason Wang <jasowang@redhat.com> wrote:
> >
> >
> > On 2020/12/28 下午4:14, Yongji Xie wrote:
> > >> I see. So all the above two questions are because VHOST_IOTLB_INVALIDATE
> > >> is expected to be synchronous. This need to be solved by tweaking the
> > >> current VDUSE API or we can re-visit to go with descriptors relaying
> > >> first.
> > >>
> > > Actually all vdpa related operations are synchronous in current
> > > implementation. The ops.set_map/dma_map/dma_unmap should not return
> > > until the VDUSE_UPDATE_IOTLB/VDUSE_INVALIDATE_IOTLB message is replied
> > > by userspace. Could it solve this problem?
> >
> >
> >   I was thinking whether or not we need to generate IOTLB_INVALIDATE
> > message to VDUSE during dma_unmap (vduse_dev_unmap_page).
> >
> > If we don't, we're probably fine.
> >
> 
> It seems not feasible. This message will be also used in the
> virtio-vdpa case to notify userspace to unmap some pages during
> consistent dma unmapping. Maybe we can document it to make sure the
> users can handle the message correctly.

Just to make sure I understand your point.

Do you mean you plan to notify the unmap of 1) streaming DMA or 2)
coherent DMA?

For 1) you probably need a workqueue to do that since dma unmap can
be done in irq or bh context. And if usrspace does't do the unmap, it
can still access the bounce buffer (if you don't zap pte)?

Thanks


> 
> Thanks,
> Yongji
> 
> 



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Re: [RFC v2 09/13] vduse: Add support for processing vhost iotlb message
  2020-12-29  9:11                       ` Jason Wang
@ 2020-12-29 10:26                         ` Yongji Xie
  2020-12-30  6:10                           ` Jason Wang
  0 siblings, 1 reply; 55+ messages in thread
From: Yongji Xie @ 2020-12-29 10:26 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, sgarzare, Parav Pandit,
	akpm, Randy Dunlap, Matthew Wilcox, viro, axboe, bcrl, corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel, linux-mm

On Tue, Dec 29, 2020 at 5:11 PM Jason Wang <jasowang@redhat.com> wrote:
>
>
>
> ----- Original Message -----
> > On Mon, Dec 28, 2020 at 4:43 PM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > >
> > > On 2020/12/28 下午4:14, Yongji Xie wrote:
> > > >> I see. So all the above two questions are because VHOST_IOTLB_INVALIDATE
> > > >> is expected to be synchronous. This need to be solved by tweaking the
> > > >> current VDUSE API or we can re-visit to go with descriptors relaying
> > > >> first.
> > > >>
> > > > Actually all vdpa related operations are synchronous in current
> > > > implementation. The ops.set_map/dma_map/dma_unmap should not return
> > > > until the VDUSE_UPDATE_IOTLB/VDUSE_INVALIDATE_IOTLB message is replied
> > > > by userspace. Could it solve this problem?
> > >
> > >
> > >   I was thinking whether or not we need to generate IOTLB_INVALIDATE
> > > message to VDUSE during dma_unmap (vduse_dev_unmap_page).
> > >
> > > If we don't, we're probably fine.
> > >
> >
> > It seems not feasible. This message will be also used in the
> > virtio-vdpa case to notify userspace to unmap some pages during
> > consistent dma unmapping. Maybe we can document it to make sure the
> > users can handle the message correctly.
>
> Just to make sure I understand your point.
>
> Do you mean you plan to notify the unmap of 1) streaming DMA or 2)
> coherent DMA?
>
> For 1) you probably need a workqueue to do that since dma unmap can
> be done in irq or bh context. And if usrspace does't do the unmap, it
> can still access the bounce buffer (if you don't zap pte)?
>

I plan to do it in the coherent DMA case. It's true that userspace can
access the dma buffer if userspace doesn't do the unmap. But the dma
pages would not be freed and reused unless user space called munmap()
for them.

Thanks,
Yongji


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 09/13] vduse: Add support for processing vhost iotlb message
  2020-12-29 10:26                         ` Yongji Xie
@ 2020-12-30  6:10                           ` Jason Wang
  2020-12-30  7:09                             ` Yongji Xie
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Wang @ 2020-12-30  6:10 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, sgarzare, Parav Pandit,
	akpm, Randy Dunlap, Matthew Wilcox, viro, axboe, bcrl, corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel, linux-mm


On 2020/12/29 下午6:26, Yongji Xie wrote:
> On Tue, Dec 29, 2020 at 5:11 PM Jason Wang <jasowang@redhat.com> wrote:
>>
>>
>> ----- Original Message -----
>>> On Mon, Dec 28, 2020 at 4:43 PM Jason Wang <jasowang@redhat.com> wrote:
>>>>
>>>> On 2020/12/28 下午4:14, Yongji Xie wrote:
>>>>>> I see. So all the above two questions are because VHOST_IOTLB_INVALIDATE
>>>>>> is expected to be synchronous. This need to be solved by tweaking the
>>>>>> current VDUSE API or we can re-visit to go with descriptors relaying
>>>>>> first.
>>>>>>
>>>>> Actually all vdpa related operations are synchronous in current
>>>>> implementation. The ops.set_map/dma_map/dma_unmap should not return
>>>>> until the VDUSE_UPDATE_IOTLB/VDUSE_INVALIDATE_IOTLB message is replied
>>>>> by userspace. Could it solve this problem?
>>>>
>>>>    I was thinking whether or not we need to generate IOTLB_INVALIDATE
>>>> message to VDUSE during dma_unmap (vduse_dev_unmap_page).
>>>>
>>>> If we don't, we're probably fine.
>>>>
>>> It seems not feasible. This message will be also used in the
>>> virtio-vdpa case to notify userspace to unmap some pages during
>>> consistent dma unmapping. Maybe we can document it to make sure the
>>> users can handle the message correctly.
>> Just to make sure I understand your point.
>>
>> Do you mean you plan to notify the unmap of 1) streaming DMA or 2)
>> coherent DMA?
>>
>> For 1) you probably need a workqueue to do that since dma unmap can
>> be done in irq or bh context. And if usrspace does't do the unmap, it
>> can still access the bounce buffer (if you don't zap pte)?
>>
> I plan to do it in the coherent DMA case.


Any reason for treating coherent DMA differently?


> It's true that userspace can
> access the dma buffer if userspace doesn't do the unmap. But the dma
> pages would not be freed and reused unless user space called munmap()
> for them.


I wonder whether or not we could recycle IOVA in this case to avoid the 
IOTLB_UMAP message.

Thanks


>
> Thanks,
> Yongji
>



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Re: [RFC v2 09/13] vduse: Add support for processing vhost iotlb message
  2020-12-30  6:10                           ` Jason Wang
@ 2020-12-30  7:09                             ` Yongji Xie
  2020-12-30  8:41                               ` Jason Wang
  0 siblings, 1 reply; 55+ messages in thread
From: Yongji Xie @ 2020-12-30  7:09 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, sgarzare, Parav Pandit,
	akpm, Randy Dunlap, Matthew Wilcox, viro, axboe, bcrl, corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel, linux-mm

On Wed, Dec 30, 2020 at 2:11 PM Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2020/12/29 下午6:26, Yongji Xie wrote:
> > On Tue, Dec 29, 2020 at 5:11 PM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >>
> >> ----- Original Message -----
> >>> On Mon, Dec 28, 2020 at 4:43 PM Jason Wang <jasowang@redhat.com> wrote:
> >>>>
> >>>> On 2020/12/28 下午4:14, Yongji Xie wrote:
> >>>>>> I see. So all the above two questions are because VHOST_IOTLB_INVALIDATE
> >>>>>> is expected to be synchronous. This need to be solved by tweaking the
> >>>>>> current VDUSE API or we can re-visit to go with descriptors relaying
> >>>>>> first.
> >>>>>>
> >>>>> Actually all vdpa related operations are synchronous in current
> >>>>> implementation. The ops.set_map/dma_map/dma_unmap should not return
> >>>>> until the VDUSE_UPDATE_IOTLB/VDUSE_INVALIDATE_IOTLB message is replied
> >>>>> by userspace. Could it solve this problem?
> >>>>
> >>>>    I was thinking whether or not we need to generate IOTLB_INVALIDATE
> >>>> message to VDUSE during dma_unmap (vduse_dev_unmap_page).
> >>>>
> >>>> If we don't, we're probably fine.
> >>>>
> >>> It seems not feasible. This message will be also used in the
> >>> virtio-vdpa case to notify userspace to unmap some pages during
> >>> consistent dma unmapping. Maybe we can document it to make sure the
> >>> users can handle the message correctly.
> >> Just to make sure I understand your point.
> >>
> >> Do you mean you plan to notify the unmap of 1) streaming DMA or 2)
> >> coherent DMA?
> >>
> >> For 1) you probably need a workqueue to do that since dma unmap can
> >> be done in irq or bh context. And if usrspace does't do the unmap, it
> >> can still access the bounce buffer (if you don't zap pte)?
> >>
> > I plan to do it in the coherent DMA case.
>
>
> Any reason for treating coherent DMA differently?
>

Now the memory of the bounce buffer is allocated page by page in the
page fault handler. So it can't be used in coherent DMA mapping case
which needs some memory with contiguous virtual addresses. I can use
vmalloc() to do allocation for the bounce buffer instead. But it might
cause some memory waste. Any suggestion?

>
> > It's true that userspace can
> > access the dma buffer if userspace doesn't do the unmap. But the dma
> > pages would not be freed and reused unless user space called munmap()
> > for them.
>
>
> I wonder whether or not we could recycle IOVA in this case to avoid the
> IOTLB_UMAP message.
>

We can achieve that if we use vmalloc() to do allocation for the
bounce buffer which can be used in coherent DMA mapping case. But
looks like we still have no way to avoid the IOTLB_UMAP message in
vhost-vdpa case.

Thanks,
Yongji


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 09/13] vduse: Add support for processing vhost iotlb message
  2020-12-30  7:09                             ` Yongji Xie
@ 2020-12-30  8:41                               ` Jason Wang
  2020-12-30 10:12                                 ` Yongji Xie
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Wang @ 2020-12-30  8:41 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, sgarzare, Parav Pandit,
	akpm, Randy Dunlap, Matthew Wilcox, viro, axboe, bcrl, corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel, linux-mm


On 2020/12/30 下午3:09, Yongji Xie wrote:
> On Wed, Dec 30, 2020 at 2:11 PM Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2020/12/29 下午6:26, Yongji Xie wrote:
>>> On Tue, Dec 29, 2020 at 5:11 PM Jason Wang <jasowang@redhat.com> wrote:
>>>>
>>>> ----- Original Message -----
>>>>> On Mon, Dec 28, 2020 at 4:43 PM Jason Wang <jasowang@redhat.com> wrote:
>>>>>> On 2020/12/28 下午4:14, Yongji Xie wrote:
>>>>>>>> I see. So all the above two questions are because VHOST_IOTLB_INVALIDATE
>>>>>>>> is expected to be synchronous. This need to be solved by tweaking the
>>>>>>>> current VDUSE API or we can re-visit to go with descriptors relaying
>>>>>>>> first.
>>>>>>>>
>>>>>>> Actually all vdpa related operations are synchronous in current
>>>>>>> implementation. The ops.set_map/dma_map/dma_unmap should not return
>>>>>>> until the VDUSE_UPDATE_IOTLB/VDUSE_INVALIDATE_IOTLB message is replied
>>>>>>> by userspace. Could it solve this problem?
>>>>>>     I was thinking whether or not we need to generate IOTLB_INVALIDATE
>>>>>> message to VDUSE during dma_unmap (vduse_dev_unmap_page).
>>>>>>
>>>>>> If we don't, we're probably fine.
>>>>>>
>>>>> It seems not feasible. This message will be also used in the
>>>>> virtio-vdpa case to notify userspace to unmap some pages during
>>>>> consistent dma unmapping. Maybe we can document it to make sure the
>>>>> users can handle the message correctly.
>>>> Just to make sure I understand your point.
>>>>
>>>> Do you mean you plan to notify the unmap of 1) streaming DMA or 2)
>>>> coherent DMA?
>>>>
>>>> For 1) you probably need a workqueue to do that since dma unmap can
>>>> be done in irq or bh context. And if usrspace does't do the unmap, it
>>>> can still access the bounce buffer (if you don't zap pte)?
>>>>
>>> I plan to do it in the coherent DMA case.
>>
>> Any reason for treating coherent DMA differently?
>>
> Now the memory of the bounce buffer is allocated page by page in the
> page fault handler. So it can't be used in coherent DMA mapping case
> which needs some memory with contiguous virtual addresses. I can use
> vmalloc() to do allocation for the bounce buffer instead. But it might
> cause some memory waste. Any suggestion?


I may miss something. But I don't see a relationship between the 
IOTLB_UNMAP and vmalloc().


>
>>> It's true that userspace can
>>> access the dma buffer if userspace doesn't do the unmap. But the dma
>>> pages would not be freed and reused unless user space called munmap()
>>> for them.
>>
>> I wonder whether or not we could recycle IOVA in this case to avoid the
>> IOTLB_UMAP message.
>>
> We can achieve that if we use vmalloc() to do allocation for the
> bounce buffer which can be used in coherent DMA mapping case. But
> looks like we still have no way to avoid the IOTLB_UMAP message in
> vhost-vdpa case.


I think that's fine. For virtio-vdpa, from VDUSE userspace perspective, 
it works like a driver that is using SWIOTLB in this case.

Thanks


>
> Thanks,
> Yongji
>



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Re: [RFC v2 09/13] vduse: Add support for processing vhost iotlb message
  2020-12-30  8:41                               ` Jason Wang
@ 2020-12-30 10:12                                 ` Yongji Xie
  2020-12-31  2:49                                   ` Jason Wang
  0 siblings, 1 reply; 55+ messages in thread
From: Yongji Xie @ 2020-12-30 10:12 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, sgarzare, Parav Pandit,
	akpm, Randy Dunlap, Matthew Wilcox, viro, axboe, bcrl, corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel, linux-mm

On Wed, Dec 30, 2020 at 4:41 PM Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2020/12/30 下午3:09, Yongji Xie wrote:
> > On Wed, Dec 30, 2020 at 2:11 PM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> On 2020/12/29 下午6:26, Yongji Xie wrote:
> >>> On Tue, Dec 29, 2020 at 5:11 PM Jason Wang <jasowang@redhat.com> wrote:
> >>>>
> >>>> ----- Original Message -----
> >>>>> On Mon, Dec 28, 2020 at 4:43 PM Jason Wang <jasowang@redhat.com> wrote:
> >>>>>> On 2020/12/28 下午4:14, Yongji Xie wrote:
> >>>>>>>> I see. So all the above two questions are because VHOST_IOTLB_INVALIDATE
> >>>>>>>> is expected to be synchronous. This need to be solved by tweaking the
> >>>>>>>> current VDUSE API or we can re-visit to go with descriptors relaying
> >>>>>>>> first.
> >>>>>>>>
> >>>>>>> Actually all vdpa related operations are synchronous in current
> >>>>>>> implementation. The ops.set_map/dma_map/dma_unmap should not return
> >>>>>>> until the VDUSE_UPDATE_IOTLB/VDUSE_INVALIDATE_IOTLB message is replied
> >>>>>>> by userspace. Could it solve this problem?
> >>>>>>     I was thinking whether or not we need to generate IOTLB_INVALIDATE
> >>>>>> message to VDUSE during dma_unmap (vduse_dev_unmap_page).
> >>>>>>
> >>>>>> If we don't, we're probably fine.
> >>>>>>
> >>>>> It seems not feasible. This message will be also used in the
> >>>>> virtio-vdpa case to notify userspace to unmap some pages during
> >>>>> consistent dma unmapping. Maybe we can document it to make sure the
> >>>>> users can handle the message correctly.
> >>>> Just to make sure I understand your point.
> >>>>
> >>>> Do you mean you plan to notify the unmap of 1) streaming DMA or 2)
> >>>> coherent DMA?
> >>>>
> >>>> For 1) you probably need a workqueue to do that since dma unmap can
> >>>> be done in irq or bh context. And if usrspace does't do the unmap, it
> >>>> can still access the bounce buffer (if you don't zap pte)?
> >>>>
> >>> I plan to do it in the coherent DMA case.
> >>
> >> Any reason for treating coherent DMA differently?
> >>
> > Now the memory of the bounce buffer is allocated page by page in the
> > page fault handler. So it can't be used in coherent DMA mapping case
> > which needs some memory with contiguous virtual addresses. I can use
> > vmalloc() to do allocation for the bounce buffer instead. But it might
> > cause some memory waste. Any suggestion?
>
>
> I may miss something. But I don't see a relationship between the
> IOTLB_UNMAP and vmalloc().
>

In the vmalloc() case, the coherent DMA page will be taken from the
memory allocated by vmalloc(). So IOTLB_UNMAP is not needed anymore
during coherent DMA unmapping because those vmalloc'ed memory which
has been mapped into userspace address space during initialization can
be reused. And userspace should not unmap the region until we destroy
the device.

>
> >
> >>> It's true that userspace can
> >>> access the dma buffer if userspace doesn't do the unmap. But the dma
> >>> pages would not be freed and reused unless user space called munmap()
> >>> for them.
> >>
> >> I wonder whether or not we could recycle IOVA in this case to avoid the
> >> IOTLB_UMAP message.
> >>
> > We can achieve that if we use vmalloc() to do allocation for the
> > bounce buffer which can be used in coherent DMA mapping case. But
> > looks like we still have no way to avoid the IOTLB_UMAP message in
> > vhost-vdpa case.
>
>
> I think that's fine. For virtio-vdpa, from VDUSE userspace perspective,
> it works like a driver that is using SWIOTLB in this case.
>

OK, will do it in v3.

Thanks,
Yongji


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 09/13] vduse: Add support for processing vhost iotlb message
  2020-12-30 10:12                                 ` Yongji Xie
@ 2020-12-31  2:49                                   ` Jason Wang
  2020-12-31  5:15                                     ` Yongji Xie
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Wang @ 2020-12-31  2:49 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, sgarzare, Parav Pandit,
	akpm, Randy Dunlap, Matthew Wilcox, viro, axboe, bcrl, corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel, linux-mm


On 2020/12/30 下午6:12, Yongji Xie wrote:
> On Wed, Dec 30, 2020 at 4:41 PM Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2020/12/30 下午3:09, Yongji Xie wrote:
>>> On Wed, Dec 30, 2020 at 2:11 PM Jason Wang <jasowang@redhat.com> wrote:
>>>> On 2020/12/29 下午6:26, Yongji Xie wrote:
>>>>> On Tue, Dec 29, 2020 at 5:11 PM Jason Wang <jasowang@redhat.com> wrote:
>>>>>> ----- Original Message -----
>>>>>>> On Mon, Dec 28, 2020 at 4:43 PM Jason Wang <jasowang@redhat.com> wrote:
>>>>>>>> On 2020/12/28 下午4:14, Yongji Xie wrote:
>>>>>>>>>> I see. So all the above two questions are because VHOST_IOTLB_INVALIDATE
>>>>>>>>>> is expected to be synchronous. This need to be solved by tweaking the
>>>>>>>>>> current VDUSE API or we can re-visit to go with descriptors relaying
>>>>>>>>>> first.
>>>>>>>>>>
>>>>>>>>> Actually all vdpa related operations are synchronous in current
>>>>>>>>> implementation. The ops.set_map/dma_map/dma_unmap should not return
>>>>>>>>> until the VDUSE_UPDATE_IOTLB/VDUSE_INVALIDATE_IOTLB message is replied
>>>>>>>>> by userspace. Could it solve this problem?
>>>>>>>>      I was thinking whether or not we need to generate IOTLB_INVALIDATE
>>>>>>>> message to VDUSE during dma_unmap (vduse_dev_unmap_page).
>>>>>>>>
>>>>>>>> If we don't, we're probably fine.
>>>>>>>>
>>>>>>> It seems not feasible. This message will be also used in the
>>>>>>> virtio-vdpa case to notify userspace to unmap some pages during
>>>>>>> consistent dma unmapping. Maybe we can document it to make sure the
>>>>>>> users can handle the message correctly.
>>>>>> Just to make sure I understand your point.
>>>>>>
>>>>>> Do you mean you plan to notify the unmap of 1) streaming DMA or 2)
>>>>>> coherent DMA?
>>>>>>
>>>>>> For 1) you probably need a workqueue to do that since dma unmap can
>>>>>> be done in irq or bh context. And if usrspace does't do the unmap, it
>>>>>> can still access the bounce buffer (if you don't zap pte)?
>>>>>>
>>>>> I plan to do it in the coherent DMA case.
>>>> Any reason for treating coherent DMA differently?
>>>>
>>> Now the memory of the bounce buffer is allocated page by page in the
>>> page fault handler. So it can't be used in coherent DMA mapping case
>>> which needs some memory with contiguous virtual addresses. I can use
>>> vmalloc() to do allocation for the bounce buffer instead. But it might
>>> cause some memory waste. Any suggestion?
>>
>> I may miss something. But I don't see a relationship between the
>> IOTLB_UNMAP and vmalloc().
>>
> In the vmalloc() case, the coherent DMA page will be taken from the
> memory allocated by vmalloc(). So IOTLB_UNMAP is not needed anymore
> during coherent DMA unmapping because those vmalloc'ed memory which
> has been mapped into userspace address space during initialization can
> be reused. And userspace should not unmap the region until we destroy
> the device.


Just to make sure I understand. My understanding is that IOTLB_UNMAP is 
only needed when there's a change the mapping from IOVA to page.

So if we stick to the mapping, e.g during dma_unmap, we just put IOVA to 
free list to be used by the next IOVA allocating. IOTLB_UNMAP could be 
avoided.

So we are not limited by how the pages are actually allocated?

Thanks


>
>>>>> It's true that userspace can
>>>>> access the dma buffer if userspace doesn't do the unmap. But the dma
>>>>> pages would not be freed and reused unless user space called munmap()
>>>>> for them.
>>>> I wonder whether or not we could recycle IOVA in this case to avoid the
>>>> IOTLB_UMAP message.
>>>>
>>> We can achieve that if we use vmalloc() to do allocation for the
>>> bounce buffer which can be used in coherent DMA mapping case. But
>>> looks like we still have no way to avoid the IOTLB_UMAP message in
>>> vhost-vdpa case.
>>
>> I think that's fine. For virtio-vdpa, from VDUSE userspace perspective,
>> it works like a driver that is using SWIOTLB in this case.
>>
> OK, will do it in v3.
>
> Thanks,
> Yongji
>



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Re: [RFC v2 09/13] vduse: Add support for processing vhost iotlb message
  2020-12-31  2:49                                   ` Jason Wang
@ 2020-12-31  5:15                                     ` Yongji Xie
  2020-12-31  5:49                                       ` Jason Wang
  0 siblings, 1 reply; 55+ messages in thread
From: Yongji Xie @ 2020-12-31  5:15 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, sgarzare, Parav Pandit,
	akpm, Randy Dunlap, Matthew Wilcox, viro, axboe, bcrl, corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel, linux-mm

On Thu, Dec 31, 2020 at 10:49 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2020/12/30 下午6:12, Yongji Xie wrote:
> > On Wed, Dec 30, 2020 at 4:41 PM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> On 2020/12/30 下午3:09, Yongji Xie wrote:
> >>> On Wed, Dec 30, 2020 at 2:11 PM Jason Wang <jasowang@redhat.com> wrote:
> >>>> On 2020/12/29 下午6:26, Yongji Xie wrote:
> >>>>> On Tue, Dec 29, 2020 at 5:11 PM Jason Wang <jasowang@redhat.com> wrote:
> >>>>>> ----- Original Message -----
> >>>>>>> On Mon, Dec 28, 2020 at 4:43 PM Jason Wang <jasowang@redhat.com> wrote:
> >>>>>>>> On 2020/12/28 下午4:14, Yongji Xie wrote:
> >>>>>>>>>> I see. So all the above two questions are because VHOST_IOTLB_INVALIDATE
> >>>>>>>>>> is expected to be synchronous. This need to be solved by tweaking the
> >>>>>>>>>> current VDUSE API or we can re-visit to go with descriptors relaying
> >>>>>>>>>> first.
> >>>>>>>>>>
> >>>>>>>>> Actually all vdpa related operations are synchronous in current
> >>>>>>>>> implementation. The ops.set_map/dma_map/dma_unmap should not return
> >>>>>>>>> until the VDUSE_UPDATE_IOTLB/VDUSE_INVALIDATE_IOTLB message is replied
> >>>>>>>>> by userspace. Could it solve this problem?
> >>>>>>>>      I was thinking whether or not we need to generate IOTLB_INVALIDATE
> >>>>>>>> message to VDUSE during dma_unmap (vduse_dev_unmap_page).
> >>>>>>>>
> >>>>>>>> If we don't, we're probably fine.
> >>>>>>>>
> >>>>>>> It seems not feasible. This message will be also used in the
> >>>>>>> virtio-vdpa case to notify userspace to unmap some pages during
> >>>>>>> consistent dma unmapping. Maybe we can document it to make sure the
> >>>>>>> users can handle the message correctly.
> >>>>>> Just to make sure I understand your point.
> >>>>>>
> >>>>>> Do you mean you plan to notify the unmap of 1) streaming DMA or 2)
> >>>>>> coherent DMA?
> >>>>>>
> >>>>>> For 1) you probably need a workqueue to do that since dma unmap can
> >>>>>> be done in irq or bh context. And if usrspace does't do the unmap, it
> >>>>>> can still access the bounce buffer (if you don't zap pte)?
> >>>>>>
> >>>>> I plan to do it in the coherent DMA case.
> >>>> Any reason for treating coherent DMA differently?
> >>>>
> >>> Now the memory of the bounce buffer is allocated page by page in the
> >>> page fault handler. So it can't be used in coherent DMA mapping case
> >>> which needs some memory with contiguous virtual addresses. I can use
> >>> vmalloc() to do allocation for the bounce buffer instead. But it might
> >>> cause some memory waste. Any suggestion?
> >>
> >> I may miss something. But I don't see a relationship between the
> >> IOTLB_UNMAP and vmalloc().
> >>
> > In the vmalloc() case, the coherent DMA page will be taken from the
> > memory allocated by vmalloc(). So IOTLB_UNMAP is not needed anymore
> > during coherent DMA unmapping because those vmalloc'ed memory which
> > has been mapped into userspace address space during initialization can
> > be reused. And userspace should not unmap the region until we destroy
> > the device.
>
>
> Just to make sure I understand. My understanding is that IOTLB_UNMAP is
> only needed when there's a change the mapping from IOVA to page.
>

Yes, that's true.

> So if we stick to the mapping, e.g during dma_unmap, we just put IOVA to
> free list to be used by the next IOVA allocating. IOTLB_UNMAP could be
> avoided.
>
> So we are not limited by how the pages are actually allocated?
>

In coherent DMA cases, we need to return some memory with contiguous
kernel virtual addresses. That is the reason why we need vmalloc()
here. If we allocate the memory page by page, the corresponding kernel
virtual addresses in a contiguous IOVA range might not be contiguous.
And in streaming DMA cases, there is no limit. So another choice is
using vmalloc'ed memory only for coherent DMA cases.

Not sure if this is clear for you.

Thanks,
Yongji


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 09/13] vduse: Add support for processing vhost iotlb message
  2020-12-31  5:15                                     ` Yongji Xie
@ 2020-12-31  5:49                                       ` Jason Wang
  2020-12-31  6:52                                         ` Yongji Xie
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Wang @ 2020-12-31  5:49 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, sgarzare, Parav Pandit,
	akpm, Randy Dunlap, Matthew Wilcox, viro, axboe, bcrl, corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel, linux-mm


On 2020/12/31 下午1:15, Yongji Xie wrote:
> On Thu, Dec 31, 2020 at 10:49 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2020/12/30 下午6:12, Yongji Xie wrote:
>>> On Wed, Dec 30, 2020 at 4:41 PM Jason Wang <jasowang@redhat.com> wrote:
>>>> On 2020/12/30 下午3:09, Yongji Xie wrote:
>>>>> On Wed, Dec 30, 2020 at 2:11 PM Jason Wang <jasowang@redhat.com> wrote:
>>>>>> On 2020/12/29 下午6:26, Yongji Xie wrote:
>>>>>>> On Tue, Dec 29, 2020 at 5:11 PM Jason Wang <jasowang@redhat.com> wrote:
>>>>>>>> ----- Original Message -----
>>>>>>>>> On Mon, Dec 28, 2020 at 4:43 PM Jason Wang <jasowang@redhat.com> wrote:
>>>>>>>>>> On 2020/12/28 下午4:14, Yongji Xie wrote:
>>>>>>>>>>>> I see. So all the above two questions are because VHOST_IOTLB_INVALIDATE
>>>>>>>>>>>> is expected to be synchronous. This need to be solved by tweaking the
>>>>>>>>>>>> current VDUSE API or we can re-visit to go with descriptors relaying
>>>>>>>>>>>> first.
>>>>>>>>>>>>
>>>>>>>>>>> Actually all vdpa related operations are synchronous in current
>>>>>>>>>>> implementation. The ops.set_map/dma_map/dma_unmap should not return
>>>>>>>>>>> until the VDUSE_UPDATE_IOTLB/VDUSE_INVALIDATE_IOTLB message is replied
>>>>>>>>>>> by userspace. Could it solve this problem?
>>>>>>>>>>       I was thinking whether or not we need to generate IOTLB_INVALIDATE
>>>>>>>>>> message to VDUSE during dma_unmap (vduse_dev_unmap_page).
>>>>>>>>>>
>>>>>>>>>> If we don't, we're probably fine.
>>>>>>>>>>
>>>>>>>>> It seems not feasible. This message will be also used in the
>>>>>>>>> virtio-vdpa case to notify userspace to unmap some pages during
>>>>>>>>> consistent dma unmapping. Maybe we can document it to make sure the
>>>>>>>>> users can handle the message correctly.
>>>>>>>> Just to make sure I understand your point.
>>>>>>>>
>>>>>>>> Do you mean you plan to notify the unmap of 1) streaming DMA or 2)
>>>>>>>> coherent DMA?
>>>>>>>>
>>>>>>>> For 1) you probably need a workqueue to do that since dma unmap can
>>>>>>>> be done in irq or bh context. And if usrspace does't do the unmap, it
>>>>>>>> can still access the bounce buffer (if you don't zap pte)?
>>>>>>>>
>>>>>>> I plan to do it in the coherent DMA case.
>>>>>> Any reason for treating coherent DMA differently?
>>>>>>
>>>>> Now the memory of the bounce buffer is allocated page by page in the
>>>>> page fault handler. So it can't be used in coherent DMA mapping case
>>>>> which needs some memory with contiguous virtual addresses. I can use
>>>>> vmalloc() to do allocation for the bounce buffer instead. But it might
>>>>> cause some memory waste. Any suggestion?
>>>> I may miss something. But I don't see a relationship between the
>>>> IOTLB_UNMAP and vmalloc().
>>>>
>>> In the vmalloc() case, the coherent DMA page will be taken from the
>>> memory allocated by vmalloc(). So IOTLB_UNMAP is not needed anymore
>>> during coherent DMA unmapping because those vmalloc'ed memory which
>>> has been mapped into userspace address space during initialization can
>>> be reused. And userspace should not unmap the region until we destroy
>>> the device.
>>
>> Just to make sure I understand. My understanding is that IOTLB_UNMAP is
>> only needed when there's a change the mapping from IOVA to page.
>>
> Yes, that's true.
>
>> So if we stick to the mapping, e.g during dma_unmap, we just put IOVA to
>> free list to be used by the next IOVA allocating. IOTLB_UNMAP could be
>> avoided.
>>
>> So we are not limited by how the pages are actually allocated?
>>
> In coherent DMA cases, we need to return some memory with contiguous
> kernel virtual addresses. That is the reason why we need vmalloc()
> here. If we allocate the memory page by page, the corresponding kernel
> virtual addresses in a contiguous IOVA range might not be contiguous.


Yes, but we can do that as what has been done in the series 
(alloc_pages_exact()). Or do you mean it would be a little bit hard to 
recycle IOVA/pages here?

Thanks


> And in streaming DMA cases, there is no limit. So another choice is
> using vmalloc'ed memory only for coherent DMA cases.
>
> Not sure if this is clear for you.
>
> Thanks,
> Yongji
>



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Re: [RFC v2 09/13] vduse: Add support for processing vhost iotlb message
  2020-12-31  5:49                                       ` Jason Wang
@ 2020-12-31  6:52                                         ` Yongji Xie
  2020-12-31  7:11                                           ` Jason Wang
  0 siblings, 1 reply; 55+ messages in thread
From: Yongji Xie @ 2020-12-31  6:52 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, sgarzare, Parav Pandit,
	akpm, Randy Dunlap, Matthew Wilcox, viro, axboe, bcrl, corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel, linux-mm

On Thu, Dec 31, 2020 at 1:50 PM Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2020/12/31 下午1:15, Yongji Xie wrote:
> > On Thu, Dec 31, 2020 at 10:49 AM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> On 2020/12/30 下午6:12, Yongji Xie wrote:
> >>> On Wed, Dec 30, 2020 at 4:41 PM Jason Wang <jasowang@redhat.com> wrote:
> >>>> On 2020/12/30 下午3:09, Yongji Xie wrote:
> >>>>> On Wed, Dec 30, 2020 at 2:11 PM Jason Wang <jasowang@redhat.com> wrote:
> >>>>>> On 2020/12/29 下午6:26, Yongji Xie wrote:
> >>>>>>> On Tue, Dec 29, 2020 at 5:11 PM Jason Wang <jasowang@redhat.com> wrote:
> >>>>>>>> ----- Original Message -----
> >>>>>>>>> On Mon, Dec 28, 2020 at 4:43 PM Jason Wang <jasowang@redhat.com> wrote:
> >>>>>>>>>> On 2020/12/28 下午4:14, Yongji Xie wrote:
> >>>>>>>>>>>> I see. So all the above two questions are because VHOST_IOTLB_INVALIDATE
> >>>>>>>>>>>> is expected to be synchronous. This need to be solved by tweaking the
> >>>>>>>>>>>> current VDUSE API or we can re-visit to go with descriptors relaying
> >>>>>>>>>>>> first.
> >>>>>>>>>>>>
> >>>>>>>>>>> Actually all vdpa related operations are synchronous in current
> >>>>>>>>>>> implementation. The ops.set_map/dma_map/dma_unmap should not return
> >>>>>>>>>>> until the VDUSE_UPDATE_IOTLB/VDUSE_INVALIDATE_IOTLB message is replied
> >>>>>>>>>>> by userspace. Could it solve this problem?
> >>>>>>>>>>       I was thinking whether or not we need to generate IOTLB_INVALIDATE
> >>>>>>>>>> message to VDUSE during dma_unmap (vduse_dev_unmap_page).
> >>>>>>>>>>
> >>>>>>>>>> If we don't, we're probably fine.
> >>>>>>>>>>
> >>>>>>>>> It seems not feasible. This message will be also used in the
> >>>>>>>>> virtio-vdpa case to notify userspace to unmap some pages during
> >>>>>>>>> consistent dma unmapping. Maybe we can document it to make sure the
> >>>>>>>>> users can handle the message correctly.
> >>>>>>>> Just to make sure I understand your point.
> >>>>>>>>
> >>>>>>>> Do you mean you plan to notify the unmap of 1) streaming DMA or 2)
> >>>>>>>> coherent DMA?
> >>>>>>>>
> >>>>>>>> For 1) you probably need a workqueue to do that since dma unmap can
> >>>>>>>> be done in irq or bh context. And if usrspace does't do the unmap, it
> >>>>>>>> can still access the bounce buffer (if you don't zap pte)?
> >>>>>>>>
> >>>>>>> I plan to do it in the coherent DMA case.
> >>>>>> Any reason for treating coherent DMA differently?
> >>>>>>
> >>>>> Now the memory of the bounce buffer is allocated page by page in the
> >>>>> page fault handler. So it can't be used in coherent DMA mapping case
> >>>>> which needs some memory with contiguous virtual addresses. I can use
> >>>>> vmalloc() to do allocation for the bounce buffer instead. But it might
> >>>>> cause some memory waste. Any suggestion?
> >>>> I may miss something. But I don't see a relationship between the
> >>>> IOTLB_UNMAP and vmalloc().
> >>>>
> >>> In the vmalloc() case, the coherent DMA page will be taken from the
> >>> memory allocated by vmalloc(). So IOTLB_UNMAP is not needed anymore
> >>> during coherent DMA unmapping because those vmalloc'ed memory which
> >>> has been mapped into userspace address space during initialization can
> >>> be reused. And userspace should not unmap the region until we destroy
> >>> the device.
> >>
> >> Just to make sure I understand. My understanding is that IOTLB_UNMAP is
> >> only needed when there's a change the mapping from IOVA to page.
> >>
> > Yes, that's true.
> >
> >> So if we stick to the mapping, e.g during dma_unmap, we just put IOVA to
> >> free list to be used by the next IOVA allocating. IOTLB_UNMAP could be
> >> avoided.
> >>
> >> So we are not limited by how the pages are actually allocated?
> >>
> > In coherent DMA cases, we need to return some memory with contiguous
> > kernel virtual addresses. That is the reason why we need vmalloc()
> > here. If we allocate the memory page by page, the corresponding kernel
> > virtual addresses in a contiguous IOVA range might not be contiguous.
>
>
> Yes, but we can do that as what has been done in the series
> (alloc_pages_exact()). Or do you mean it would be a little bit hard to
> recycle IOVA/pages here?
>

Yes, it might be hard to reuse the memory. For example, we firstly
allocate 1 IOVA/page during dma_map, then the IOVA is freed during
dma_unmap. Actually we can't reuse this single page if we need a
two-pages area in the next IOVA allocating. So the best way is using
IOTLB_UNMAP to free this single page during dma_unmap too.

Thanks,
Yongji


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 09/13] vduse: Add support for processing vhost iotlb message
  2020-12-31  6:52                                         ` Yongji Xie
@ 2020-12-31  7:11                                           ` Jason Wang
  2020-12-31  8:00                                             ` Yongji Xie
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Wang @ 2020-12-31  7:11 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, sgarzare, Parav Pandit,
	akpm, Randy Dunlap, Matthew Wilcox, viro, axboe, bcrl, corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel, linux-mm


On 2020/12/31 下午2:52, Yongji Xie wrote:
> On Thu, Dec 31, 2020 at 1:50 PM Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2020/12/31 下午1:15, Yongji Xie wrote:
>>> On Thu, Dec 31, 2020 at 10:49 AM Jason Wang <jasowang@redhat.com> wrote:
>>>> On 2020/12/30 下午6:12, Yongji Xie wrote:
>>>>> On Wed, Dec 30, 2020 at 4:41 PM Jason Wang <jasowang@redhat.com> wrote:
>>>>>> On 2020/12/30 下午3:09, Yongji Xie wrote:
>>>>>>> On Wed, Dec 30, 2020 at 2:11 PM Jason Wang <jasowang@redhat.com> wrote:
>>>>>>>> On 2020/12/29 下午6:26, Yongji Xie wrote:
>>>>>>>>> On Tue, Dec 29, 2020 at 5:11 PM Jason Wang <jasowang@redhat.com> wrote:
>>>>>>>>>> ----- Original Message -----
>>>>>>>>>>> On Mon, Dec 28, 2020 at 4:43 PM Jason Wang <jasowang@redhat.com> wrote:
>>>>>>>>>>>> On 2020/12/28 下午4:14, Yongji Xie wrote:
>>>>>>>>>>>>>> I see. So all the above two questions are because VHOST_IOTLB_INVALIDATE
>>>>>>>>>>>>>> is expected to be synchronous. This need to be solved by tweaking the
>>>>>>>>>>>>>> current VDUSE API or we can re-visit to go with descriptors relaying
>>>>>>>>>>>>>> first.
>>>>>>>>>>>>>>
>>>>>>>>>>>>> Actually all vdpa related operations are synchronous in current
>>>>>>>>>>>>> implementation. The ops.set_map/dma_map/dma_unmap should not return
>>>>>>>>>>>>> until the VDUSE_UPDATE_IOTLB/VDUSE_INVALIDATE_IOTLB message is replied
>>>>>>>>>>>>> by userspace. Could it solve this problem?
>>>>>>>>>>>>        I was thinking whether or not we need to generate IOTLB_INVALIDATE
>>>>>>>>>>>> message to VDUSE during dma_unmap (vduse_dev_unmap_page).
>>>>>>>>>>>>
>>>>>>>>>>>> If we don't, we're probably fine.
>>>>>>>>>>>>
>>>>>>>>>>> It seems not feasible. This message will be also used in the
>>>>>>>>>>> virtio-vdpa case to notify userspace to unmap some pages during
>>>>>>>>>>> consistent dma unmapping. Maybe we can document it to make sure the
>>>>>>>>>>> users can handle the message correctly.
>>>>>>>>>> Just to make sure I understand your point.
>>>>>>>>>>
>>>>>>>>>> Do you mean you plan to notify the unmap of 1) streaming DMA or 2)
>>>>>>>>>> coherent DMA?
>>>>>>>>>>
>>>>>>>>>> For 1) you probably need a workqueue to do that since dma unmap can
>>>>>>>>>> be done in irq or bh context. And if usrspace does't do the unmap, it
>>>>>>>>>> can still access the bounce buffer (if you don't zap pte)?
>>>>>>>>>>
>>>>>>>>> I plan to do it in the coherent DMA case.
>>>>>>>> Any reason for treating coherent DMA differently?
>>>>>>>>
>>>>>>> Now the memory of the bounce buffer is allocated page by page in the
>>>>>>> page fault handler. So it can't be used in coherent DMA mapping case
>>>>>>> which needs some memory with contiguous virtual addresses. I can use
>>>>>>> vmalloc() to do allocation for the bounce buffer instead. But it might
>>>>>>> cause some memory waste. Any suggestion?
>>>>>> I may miss something. But I don't see a relationship between the
>>>>>> IOTLB_UNMAP and vmalloc().
>>>>>>
>>>>> In the vmalloc() case, the coherent DMA page will be taken from the
>>>>> memory allocated by vmalloc(). So IOTLB_UNMAP is not needed anymore
>>>>> during coherent DMA unmapping because those vmalloc'ed memory which
>>>>> has been mapped into userspace address space during initialization can
>>>>> be reused. And userspace should not unmap the region until we destroy
>>>>> the device.
>>>> Just to make sure I understand. My understanding is that IOTLB_UNMAP is
>>>> only needed when there's a change the mapping from IOVA to page.
>>>>
>>> Yes, that's true.
>>>
>>>> So if we stick to the mapping, e.g during dma_unmap, we just put IOVA to
>>>> free list to be used by the next IOVA allocating. IOTLB_UNMAP could be
>>>> avoided.
>>>>
>>>> So we are not limited by how the pages are actually allocated?
>>>>
>>> In coherent DMA cases, we need to return some memory with contiguous
>>> kernel virtual addresses. That is the reason why we need vmalloc()
>>> here. If we allocate the memory page by page, the corresponding kernel
>>> virtual addresses in a contiguous IOVA range might not be contiguous.
>>
>> Yes, but we can do that as what has been done in the series
>> (alloc_pages_exact()). Or do you mean it would be a little bit hard to
>> recycle IOVA/pages here?
>>
> Yes, it might be hard to reuse the memory. For example, we firstly
> allocate 1 IOVA/page during dma_map, then the IOVA is freed during
> dma_unmap. Actually we can't reuse this single page if we need a
> two-pages area in the next IOVA allocating. So the best way is using
> IOTLB_UNMAP to free this single page during dma_unmap too.
>
> Thanks,
> Yongji


I get you now. Then I agree that let's go with IOTLB_UNMAP.

Thanks


>



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Re: [RFC v2 09/13] vduse: Add support for processing vhost iotlb message
  2020-12-31  7:11                                           ` Jason Wang
@ 2020-12-31  8:00                                             ` Yongji Xie
  0 siblings, 0 replies; 55+ messages in thread
From: Yongji Xie @ 2020-12-31  8:00 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, sgarzare, Parav Pandit,
	akpm, Randy Dunlap, Matthew Wilcox, viro, axboe, bcrl, corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel, linux-mm

On Thu, Dec 31, 2020 at 3:12 PM Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2020/12/31 下午2:52, Yongji Xie wrote:
> > On Thu, Dec 31, 2020 at 1:50 PM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> On 2020/12/31 下午1:15, Yongji Xie wrote:
> >>> On Thu, Dec 31, 2020 at 10:49 AM Jason Wang <jasowang@redhat.com> wrote:
> >>>> On 2020/12/30 下午6:12, Yongji Xie wrote:
> >>>>> On Wed, Dec 30, 2020 at 4:41 PM Jason Wang <jasowang@redhat.com> wrote:
> >>>>>> On 2020/12/30 下午3:09, Yongji Xie wrote:
> >>>>>>> On Wed, Dec 30, 2020 at 2:11 PM Jason Wang <jasowang@redhat.com> wrote:
> >>>>>>>> On 2020/12/29 下午6:26, Yongji Xie wrote:
> >>>>>>>>> On Tue, Dec 29, 2020 at 5:11 PM Jason Wang <jasowang@redhat.com> wrote:
> >>>>>>>>>> ----- Original Message -----
> >>>>>>>>>>> On Mon, Dec 28, 2020 at 4:43 PM Jason Wang <jasowang@redhat.com> wrote:
> >>>>>>>>>>>> On 2020/12/28 下午4:14, Yongji Xie wrote:
> >>>>>>>>>>>>>> I see. So all the above two questions are because VHOST_IOTLB_INVALIDATE
> >>>>>>>>>>>>>> is expected to be synchronous. This need to be solved by tweaking the
> >>>>>>>>>>>>>> current VDUSE API or we can re-visit to go with descriptors relaying
> >>>>>>>>>>>>>> first.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> Actually all vdpa related operations are synchronous in current
> >>>>>>>>>>>>> implementation. The ops.set_map/dma_map/dma_unmap should not return
> >>>>>>>>>>>>> until the VDUSE_UPDATE_IOTLB/VDUSE_INVALIDATE_IOTLB message is replied
> >>>>>>>>>>>>> by userspace. Could it solve this problem?
> >>>>>>>>>>>>        I was thinking whether or not we need to generate IOTLB_INVALIDATE
> >>>>>>>>>>>> message to VDUSE during dma_unmap (vduse_dev_unmap_page).
> >>>>>>>>>>>>
> >>>>>>>>>>>> If we don't, we're probably fine.
> >>>>>>>>>>>>
> >>>>>>>>>>> It seems not feasible. This message will be also used in the
> >>>>>>>>>>> virtio-vdpa case to notify userspace to unmap some pages during
> >>>>>>>>>>> consistent dma unmapping. Maybe we can document it to make sure the
> >>>>>>>>>>> users can handle the message correctly.
> >>>>>>>>>> Just to make sure I understand your point.
> >>>>>>>>>>
> >>>>>>>>>> Do you mean you plan to notify the unmap of 1) streaming DMA or 2)
> >>>>>>>>>> coherent DMA?
> >>>>>>>>>>
> >>>>>>>>>> For 1) you probably need a workqueue to do that since dma unmap can
> >>>>>>>>>> be done in irq or bh context. And if usrspace does't do the unmap, it
> >>>>>>>>>> can still access the bounce buffer (if you don't zap pte)?
> >>>>>>>>>>
> >>>>>>>>> I plan to do it in the coherent DMA case.
> >>>>>>>> Any reason for treating coherent DMA differently?
> >>>>>>>>
> >>>>>>> Now the memory of the bounce buffer is allocated page by page in the
> >>>>>>> page fault handler. So it can't be used in coherent DMA mapping case
> >>>>>>> which needs some memory with contiguous virtual addresses. I can use
> >>>>>>> vmalloc() to do allocation for the bounce buffer instead. But it might
> >>>>>>> cause some memory waste. Any suggestion?
> >>>>>> I may miss something. But I don't see a relationship between the
> >>>>>> IOTLB_UNMAP and vmalloc().
> >>>>>>
> >>>>> In the vmalloc() case, the coherent DMA page will be taken from the
> >>>>> memory allocated by vmalloc(). So IOTLB_UNMAP is not needed anymore
> >>>>> during coherent DMA unmapping because those vmalloc'ed memory which
> >>>>> has been mapped into userspace address space during initialization can
> >>>>> be reused. And userspace should not unmap the region until we destroy
> >>>>> the device.
> >>>> Just to make sure I understand. My understanding is that IOTLB_UNMAP is
> >>>> only needed when there's a change the mapping from IOVA to page.
> >>>>
> >>> Yes, that's true.
> >>>
> >>>> So if we stick to the mapping, e.g during dma_unmap, we just put IOVA to
> >>>> free list to be used by the next IOVA allocating. IOTLB_UNMAP could be
> >>>> avoided.
> >>>>
> >>>> So we are not limited by how the pages are actually allocated?
> >>>>
> >>> In coherent DMA cases, we need to return some memory with contiguous
> >>> kernel virtual addresses. That is the reason why we need vmalloc()
> >>> here. If we allocate the memory page by page, the corresponding kernel
> >>> virtual addresses in a contiguous IOVA range might not be contiguous.
> >>
> >> Yes, but we can do that as what has been done in the series
> >> (alloc_pages_exact()). Or do you mean it would be a little bit hard to
> >> recycle IOVA/pages here?
> >>
> > Yes, it might be hard to reuse the memory. For example, we firstly
> > allocate 1 IOVA/page during dma_map, then the IOVA is freed during
> > dma_unmap. Actually we can't reuse this single page if we need a
> > two-pages area in the next IOVA allocating. So the best way is using
> > IOTLB_UNMAP to free this single page during dma_unmap too.
> >
> > Thanks,
> > Yongji
>
>
> I get you now. Then I agree that let's go with IOTLB_UNMAP.
>

Fine, will do it.

Thanks,
Yongji


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 06/13] vduse: Introduce VDUSE - vDPA Device in Userspace
  2020-12-22 14:52 ` [RFC v2 06/13] vduse: Introduce VDUSE - vDPA Device in Userspace Xie Yongji
  2020-12-23  8:08   ` Jason Wang
@ 2021-01-08 13:32   ` Bob Liu
  2021-01-10 10:03     ` Yongji Xie
  1 sibling, 1 reply; 55+ messages in thread
From: Bob Liu @ 2021-01-08 13:32 UTC (permalink / raw)
  To: Xie Yongji, mst, jasowang, stefanha, sgarzare, parav, akpm,
	rdunlap, willy, viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel, linux-mm

On 12/22/20 10:52 PM, Xie Yongji wrote:
> This VDUSE driver enables implementing vDPA devices in userspace.
> Both control path and data path of vDPA devices will be able to
> be handled in userspace.
> 
> In the control path, the VDUSE driver will make use of message
> mechnism to forward the config operation from vdpa bus driver
> to userspace. Userspace can use read()/write() to receive/reply
> those control messages.
> 
> In the data path, the VDUSE driver implements a MMU-based on-chip
> IOMMU driver which supports mapping the kernel dma buffer to a
> userspace iova region dynamically. Userspace can access those
> iova region via mmap(). Besides, the eventfd mechanism is used to
> trigger interrupt callbacks and receive virtqueue kicks in userspace
> 
> Now we only support virtio-vdpa bus driver with this patch applied.
> 
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> ---
>  Documentation/driver-api/vduse.rst                 |   74 ++
>  Documentation/userspace-api/ioctl/ioctl-number.rst |    1 +
>  drivers/vdpa/Kconfig                               |    8 +
>  drivers/vdpa/Makefile                              |    1 +
>  drivers/vdpa/vdpa_user/Makefile                    |    5 +
>  drivers/vdpa/vdpa_user/eventfd.c                   |  221 ++++
>  drivers/vdpa/vdpa_user/eventfd.h                   |   48 +
>  drivers/vdpa/vdpa_user/iova_domain.c               |  442 ++++++++
>  drivers/vdpa/vdpa_user/iova_domain.h               |   93 ++
>  drivers/vdpa/vdpa_user/vduse.h                     |   59 ++
>  drivers/vdpa/vdpa_user/vduse_dev.c                 | 1121 ++++++++++++++++++++
>  include/uapi/linux/vdpa.h                          |    1 +
>  include/uapi/linux/vduse.h                         |   99 ++
>  13 files changed, 2173 insertions(+)
>  create mode 100644 Documentation/driver-api/vduse.rst
>  create mode 100644 drivers/vdpa/vdpa_user/Makefile
>  create mode 100644 drivers/vdpa/vdpa_user/eventfd.c
>  create mode 100644 drivers/vdpa/vdpa_user/eventfd.h
>  create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c
>  create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h
>  create mode 100644 drivers/vdpa/vdpa_user/vduse.h
>  create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
>  create mode 100644 include/uapi/linux/vduse.h
> 
> diff --git a/Documentation/driver-api/vduse.rst b/Documentation/driver-api/vduse.rst
> new file mode 100644
> index 000000000000..da9b3040f20a
> --- /dev/null
> +++ b/Documentation/driver-api/vduse.rst
> @@ -0,0 +1,74 @@
> +==================================
> +VDUSE - "vDPA Device in Userspace"
> +==================================
> +
> +vDPA (virtio data path acceleration) device is a device that uses a
> +datapath which complies with the virtio specifications with vendor
> +specific control path. vDPA devices can be both physically located on
> +the hardware or emulated by software. VDUSE is a framework that makes it
> +possible to implement software-emulated vDPA devices in userspace.
> +

Could you explain a bit more why need a VDUSE framework?
Software emulated vDPA devices is more likely used by debugging only when
don't have real hardware.
Do you think do the emulation in kernel space is not enough?

Thanks,
Bob

> +How VDUSE works
> +------------
> +Each userspace vDPA device is created by the VDUSE_CREATE_DEV ioctl on
> +the VDUSE character device (/dev/vduse). Then a file descriptor pointing
> +to the new resources will be returned, which can be used to implement the
> +userspace vDPA device's control path and data path.
> +
> +To implement control path, the read/write operations to the file descriptor
> +will be used to receive/reply the control messages from/to VDUSE driver.
> +Those control messages are based on the vdpa_config_ops which defines a
> +unified interface to control different types of vDPA device.
> +




^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Re: [RFC v2 06/13] vduse: Introduce VDUSE - vDPA Device in Userspace
  2021-01-08 13:32   ` Bob Liu
@ 2021-01-10 10:03     ` Yongji Xie
  0 siblings, 0 replies; 55+ messages in thread
From: Yongji Xie @ 2021-01-10 10:03 UTC (permalink / raw)
  To: Bob Liu
  Cc: Michael S. Tsirkin, Jason Wang, Stefan Hajnoczi, sgarzare,
	Parav Pandit, akpm, Randy Dunlap, Matthew Wilcox, viro, axboe,
	bcrl, corbet, virtualization, netdev, kvm, linux-aio,
	linux-fsdevel, linux-mm

On Fri, Jan 8, 2021 at 9:32 PM Bob Liu <bob.liu@oracle.com> wrote:
>
> On 12/22/20 10:52 PM, Xie Yongji wrote:
> > This VDUSE driver enables implementing vDPA devices in userspace.
> > Both control path and data path of vDPA devices will be able to
> > be handled in userspace.
> >
> > In the control path, the VDUSE driver will make use of message
> > mechnism to forward the config operation from vdpa bus driver
> > to userspace. Userspace can use read()/write() to receive/reply
> > those control messages.
> >
> > In the data path, the VDUSE driver implements a MMU-based on-chip
> > IOMMU driver which supports mapping the kernel dma buffer to a
> > userspace iova region dynamically. Userspace can access those
> > iova region via mmap(). Besides, the eventfd mechanism is used to
> > trigger interrupt callbacks and receive virtqueue kicks in userspace
> >
> > Now we only support virtio-vdpa bus driver with this patch applied.
> >
> > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> > ---
> >  Documentation/driver-api/vduse.rst                 |   74 ++
> >  Documentation/userspace-api/ioctl/ioctl-number.rst |    1 +
> >  drivers/vdpa/Kconfig                               |    8 +
> >  drivers/vdpa/Makefile                              |    1 +
> >  drivers/vdpa/vdpa_user/Makefile                    |    5 +
> >  drivers/vdpa/vdpa_user/eventfd.c                   |  221 ++++
> >  drivers/vdpa/vdpa_user/eventfd.h                   |   48 +
> >  drivers/vdpa/vdpa_user/iova_domain.c               |  442 ++++++++
> >  drivers/vdpa/vdpa_user/iova_domain.h               |   93 ++
> >  drivers/vdpa/vdpa_user/vduse.h                     |   59 ++
> >  drivers/vdpa/vdpa_user/vduse_dev.c                 | 1121 ++++++++++++++++++++
> >  include/uapi/linux/vdpa.h                          |    1 +
> >  include/uapi/linux/vduse.h                         |   99 ++
> >  13 files changed, 2173 insertions(+)
> >  create mode 100644 Documentation/driver-api/vduse.rst
> >  create mode 100644 drivers/vdpa/vdpa_user/Makefile
> >  create mode 100644 drivers/vdpa/vdpa_user/eventfd.c
> >  create mode 100644 drivers/vdpa/vdpa_user/eventfd.h
> >  create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c
> >  create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h
> >  create mode 100644 drivers/vdpa/vdpa_user/vduse.h
> >  create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
> >  create mode 100644 include/uapi/linux/vduse.h
> >
> > diff --git a/Documentation/driver-api/vduse.rst b/Documentation/driver-api/vduse.rst
> > new file mode 100644
> > index 000000000000..da9b3040f20a
> > --- /dev/null
> > +++ b/Documentation/driver-api/vduse.rst
> > @@ -0,0 +1,74 @@
> > +==================================
> > +VDUSE - "vDPA Device in Userspace"
> > +==================================
> > +
> > +vDPA (virtio data path acceleration) device is a device that uses a
> > +datapath which complies with the virtio specifications with vendor
> > +specific control path. vDPA devices can be both physically located on
> > +the hardware or emulated by software. VDUSE is a framework that makes it
> > +possible to implement software-emulated vDPA devices in userspace.
> > +
>
> Could you explain a bit more why need a VDUSE framework?

This can be used to implement a userspace I/O (such as storage,
network and so on) solution (virtio-based) for both container and VM.

> Software emulated vDPA devices is more likely used by debugging only when
> don't have real hardware.

I think software emulated vDPA devices should be also useful in other
cases, just like FUSE does.

> Do you think do the emulation in kernel space is not enough?
>

Doing the emulation in userspace should be more flexible.

Thanks,
Yongji


^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2021-01-10 10:04 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-22 14:52 [RFC v2 00/13] Introduce VDUSE - vDPA Device in Userspace Xie Yongji
2020-12-22 14:52 ` [RFC v2 01/13] mm: export zap_page_range() for driver use Xie Yongji
2020-12-22 15:44   ` Christoph Hellwig
2020-12-22 14:52 ` [RFC v2 02/13] eventfd: track eventfd_signal() recursion depth separately in different cases Xie Yongji
2020-12-22 14:52 ` [RFC v2 03/13] eventfd: Increase the recursion depth of eventfd_signal() Xie Yongji
2020-12-22 14:52 ` [RFC v2 04/13] vdpa: Remove the restriction that only supports virtio-net devices Xie Yongji
2020-12-22 14:52 ` [RFC v2 05/13] vdpa: Pass the netlink attributes to ops.dev_add() Xie Yongji
2020-12-22 14:52 ` [RFC v2 06/13] vduse: Introduce VDUSE - vDPA Device in Userspace Xie Yongji
2020-12-23  8:08   ` Jason Wang
2020-12-23 14:17     ` Yongji Xie
2020-12-24  3:01       ` Jason Wang
2020-12-24  8:34         ` Yongji Xie
2020-12-25  6:59           ` Jason Wang
2021-01-08 13:32   ` Bob Liu
2021-01-10 10:03     ` Yongji Xie
2020-12-22 14:52 ` [RFC v2 07/13] vduse: support get/set virtqueue state Xie Yongji
2020-12-22 14:52 ` [RFC v2 08/13] vdpa: Introduce process_iotlb_msg() in vdpa_config_ops Xie Yongji
2020-12-23  8:36   ` Jason Wang
2020-12-23 11:06     ` Yongji Xie
2020-12-24  2:36       ` Jason Wang
2020-12-24  7:24         ` Yongji Xie
2020-12-22 14:52 ` [RFC v2 09/13] vduse: Add support for processing vhost iotlb message Xie Yongji
2020-12-23  9:05   ` Jason Wang
2020-12-23 12:14     ` [External] " Yongji Xie
2020-12-24  2:41       ` Jason Wang
2020-12-24  7:37         ` Yongji Xie
2020-12-25  2:37           ` Yongji Xie
2020-12-25  7:02             ` Jason Wang
2020-12-25 11:36               ` Yongji Xie
2020-12-25  6:57           ` Jason Wang
2020-12-25 10:31             ` Yongji Xie
2020-12-28  7:43               ` Jason Wang
2020-12-28  8:14                 ` Yongji Xie
2020-12-28  8:43                   ` Jason Wang
2020-12-28  9:12                     ` Yongji Xie
2020-12-29  9:11                       ` Jason Wang
2020-12-29 10:26                         ` Yongji Xie
2020-12-30  6:10                           ` Jason Wang
2020-12-30  7:09                             ` Yongji Xie
2020-12-30  8:41                               ` Jason Wang
2020-12-30 10:12                                 ` Yongji Xie
2020-12-31  2:49                                   ` Jason Wang
2020-12-31  5:15                                     ` Yongji Xie
2020-12-31  5:49                                       ` Jason Wang
2020-12-31  6:52                                         ` Yongji Xie
2020-12-31  7:11                                           ` Jason Wang
2020-12-31  8:00                                             ` Yongji Xie
2020-12-22 14:52 ` [RFC v2 10/13] vduse: grab the module's references until there is no vduse device Xie Yongji
2020-12-22 14:52 ` [RFC v2 11/13] vduse/iova_domain: Support reclaiming bounce pages Xie Yongji
2020-12-22 14:52 ` [RFC v2 12/13] vduse: Add memory shrinker to reclaim " Xie Yongji
2020-12-22 14:52 ` [RFC v2 13/13] vduse: Introduce a workqueue for irq injection Xie Yongji
2020-12-23  6:38 ` [RFC v2 00/13] Introduce VDUSE - vDPA Device in Userspace Jason Wang
2020-12-23  8:14   ` Jason Wang
2020-12-23 10:59   ` Yongji Xie
2020-12-24  2:24     ` Jason Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).