kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC v3 00/11] Introduce VDUSE - vDPA Device in Userspace
@ 2021-01-19  4:59 Xie Yongji
  2021-01-19  4:59 ` [RFC v3 01/11] eventfd: track eventfd_signal() recursion depth separately in different cases Xie Yongji
                   ` (6 more replies)
  0 siblings, 7 replies; 57+ messages in thread
From: Xie Yongji @ 2021-01-19  4:59 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, bob.liu, hch, rdunlap,
	willy, viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel

This series introduces a framework, which can be used to implement
vDPA Devices in a userspace program. The work consist of two parts:
control path forwarding and data path offloading.

In the control path, the VDUSE driver will make use of message
mechnism to forward the config operation from vdpa bus driver
to userspace. Userspace can use read()/write() to receive/reply
those control messages.

In the data path, the core is mapping dma buffer into VDUSE
daemon's address space, which can be implemented in different ways
depending on the vdpa bus to which the vDPA device is attached.

In virtio-vdpa case, we implements a MMU-based on-chip IOMMU driver with
bounce-buffering mechanism to achieve that. And in vhost-vdpa case, the dma
buffer is reside in a userspace memory region which can be shared to the
VDUSE userspace processs via transferring the shmfd.

The details and our user case is shown below:

------------------------    -------------------------   ----------------------------------------------
|            Container |    |              QEMU(VM) |   |                               VDUSE daemon |
|       ---------      |    |  -------------------  |   | ------------------------- ---------------- |
|       |dev/vdx|      |    |  |/dev/vhost-vdpa-x|  |   | | vDPA device emulation | | block driver | |
------------+-----------     -----------+------------   -------------+----------------------+---------
            |                           |                            |                      |
            |                           |                            |                      |
------------+---------------------------+----------------------------+----------------------+---------
|    | block device |           |  vhost device |            | vduse driver |          | TCP/IP |    |
|    -------+--------           --------+--------            -------+--------          -----+----    |
|           |                           |                           |                       |        |
| ----------+----------       ----------+-----------         -------+-------                |        |
| | virtio-blk driver |       |  vhost-vdpa driver |         | vdpa device |                |        |
| ----------+----------       ----------+-----------         -------+-------                |        |
|           |      virtio bus           |                           |                       |        |
|   --------+----+-----------           |                           |                       |        |
|                |                      |                           |                       |        |
|      ----------+----------            |                           |                       |        |
|      | virtio-blk device |            |                           |                       |        |
|      ----------+----------            |                           |                       |        |
|                |                      |                           |                       |        |
|     -----------+-----------           |                           |                       |        |
|     |  virtio-vdpa driver |           |                           |                       |        |
|     -----------+-----------           |                           |                       |        |
|                |                      |                           |    vdpa bus           |        |
|     -----------+----------------------+---------------------------+------------           |        |
|                                                                                        ---+---     |
-----------------------------------------------------------------------------------------| NIC |------
                                                                                         ---+---
                                                                                            |
                                                                                   ---------+---------
                                                                                   | Remote Storages |
                                                                                   -------------------

We make use of it to implement a block device connecting to
our distributed storage, which can be used both in containers and
VMs. Thus, we can have an unified technology stack in this two cases.

To test it with null-blk:

  $ qemu-storage-daemon \
      --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server,nowait \
      --monitor chardev=charmonitor \
      --blockdev driver=host_device,cache.direct=on,aio=native,filename=/dev/nullb0,node-name=disk0 \
      --export vduse-blk,id=test,node-name=disk0,writable=on,vduse-id=1,num-queues=16,queue-size=128

The qemu-storage-daemon can be found at https://github.com/bytedance/qemu/tree/vduse

Future work:
  - Improve performance (e.g. zero copy implementation in datapath)
  - Config interrupt support
  - Userspace library (find a way to reuse device emulation code in qemu/rust-vmm)

This is now based on below series:
https://lore.kernel.org/netdev/20201112064005.349268-1-parav@nvidia.com/

V2 to V3:
- Rework the MMU-based IOMMU driver
- Use the iova domain as iova allocator instead of genpool
- Support transferring vma->vm_file in vhost-vdpa
- Add SVA support in vhost-vdpa
- Remove the patches on bounce pages reclaim

V1 to V2:
- Add vhost-vdpa support
- Add some documents
- Based on the vdpa management tool
- Introduce a workqueue for irq injection
- Replace interval tree with array map to store the iova_map

Xie Yongji (11):
  eventfd: track eventfd_signal() recursion depth separately in different cases
  eventfd: Increase the recursion depth of eventfd_signal()
  vdpa: Remove the restriction that only supports virtio-net devices
  vhost-vdpa: protect concurrent access to vhost device iotlb
  vdpa: shared virtual addressing support
  vhost-vdpa: Add an opaque pointer for vhost IOTLB
  vdpa: Pass the netlink attributes to ops.dev_add()
  vduse: Introduce VDUSE - vDPA Device in Userspace
  vduse: Add VDUSE_GET_DEV ioctl
  vduse: grab the module's references until there is no vduse device
  vduse: Introduce a workqueue for irq injection

 Documentation/driver-api/vduse.rst                 |   85 ++
 Documentation/userspace-api/ioctl/ioctl-number.rst |    1 +
 drivers/vdpa/Kconfig                               |    7 +
 drivers/vdpa/Makefile                              |    1 +
 drivers/vdpa/ifcvf/ifcvf_main.c                    |    2 +-
 drivers/vdpa/mlx5/net/mlx5_vnet.c                  |    2 +-
 drivers/vdpa/vdpa.c                                |    7 +-
 drivers/vdpa/vdpa_sim/vdpa_sim.c                   |   17 +-
 drivers/vdpa/vdpa_user/Makefile                    |    5 +
 drivers/vdpa/vdpa_user/eventfd.c                   |  229 ++++
 drivers/vdpa/vdpa_user/eventfd.h                   |   48 +
 drivers/vdpa/vdpa_user/iova_domain.c               |  426 +++++++
 drivers/vdpa/vdpa_user/iova_domain.h               |   68 ++
 drivers/vdpa/vdpa_user/vduse.h                     |   62 +
 drivers/vdpa/vdpa_user/vduse_dev.c                 | 1249 ++++++++++++++++++++
 drivers/vhost/iotlb.c                              |    5 +-
 drivers/vhost/vdpa.c                               |  130 +-
 drivers/vhost/vhost.c                              |    4 +-
 fs/aio.c                                           |    3 +-
 fs/eventfd.c                                       |   20 +-
 include/linux/eventfd.h                            |    5 +-
 include/linux/vdpa.h                               |   17 +-
 include/linux/vhost_iotlb.h                        |    8 +-
 include/uapi/linux/vdpa.h                          |    1 +
 include/uapi/linux/vduse.h                         |  126 ++
 25 files changed, 2458 insertions(+), 70 deletions(-)
 create mode 100644 Documentation/driver-api/vduse.rst
 create mode 100644 drivers/vdpa/vdpa_user/Makefile
 create mode 100644 drivers/vdpa/vdpa_user/eventfd.c
 create mode 100644 drivers/vdpa/vdpa_user/eventfd.h
 create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c
 create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h
 create mode 100644 drivers/vdpa/vdpa_user/vduse.h
 create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
 create mode 100644 include/uapi/linux/vduse.h

-- 
2.11.0


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [RFC v3 01/11] eventfd: track eventfd_signal() recursion depth separately in different cases
  2021-01-19  4:59 [RFC v3 00/11] Introduce VDUSE - vDPA Device in Userspace Xie Yongji
@ 2021-01-19  4:59 ` Xie Yongji
  2021-01-20  4:24   ` Jason Wang
  2021-01-19  4:59 ` [RFC v3 02/11] eventfd: Increase the recursion depth of eventfd_signal() Xie Yongji
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 57+ messages in thread
From: Xie Yongji @ 2021-01-19  4:59 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, bob.liu, hch, rdunlap,
	willy, viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel

Now we have a global percpu counter to limit the recursion depth
of eventfd_signal(). This can avoid deadlock or stack overflow.
But in stack overflow case, it should be OK to increase the
recursion depth if needed. So we add a percpu counter in eventfd_ctx
to limit the recursion depth for deadlock case. Then it could be
fine to increase the global percpu counter later.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 fs/aio.c                |  3 ++-
 fs/eventfd.c            | 20 +++++++++++++++++++-
 include/linux/eventfd.h |  5 +----
 3 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 1f32da13d39e..5d82903161f5 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1698,7 +1698,8 @@ static int aio_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
 		list_del(&iocb->ki_list);
 		iocb->ki_res.res = mangle_poll(mask);
 		req->done = true;
-		if (iocb->ki_eventfd && eventfd_signal_count()) {
+		if (iocb->ki_eventfd &&
+			eventfd_signal_count(iocb->ki_eventfd)) {
 			iocb = NULL;
 			INIT_WORK(&req->work, aio_poll_put_work);
 			schedule_work(&req->work);
diff --git a/fs/eventfd.c b/fs/eventfd.c
index e265b6dd4f34..2df24f9bada3 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -25,6 +25,8 @@
 #include <linux/idr.h>
 #include <linux/uio.h>
 
+#define EVENTFD_WAKE_DEPTH 0
+
 DEFINE_PER_CPU(int, eventfd_wake_count);
 
 static DEFINE_IDA(eventfd_ida);
@@ -42,9 +44,17 @@ struct eventfd_ctx {
 	 */
 	__u64 count;
 	unsigned int flags;
+	int __percpu *wake_count;
 	int id;
 };
 
+bool eventfd_signal_count(struct eventfd_ctx *ctx)
+{
+	return (this_cpu_read(*ctx->wake_count) ||
+		this_cpu_read(eventfd_wake_count) > EVENTFD_WAKE_DEPTH);
+}
+EXPORT_SYMBOL_GPL(eventfd_signal_count);
+
 /**
  * eventfd_signal - Adds @n to the eventfd counter.
  * @ctx: [in] Pointer to the eventfd context.
@@ -71,17 +81,19 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n)
 	 * it returns true, the eventfd_signal() call should be deferred to a
 	 * safe context.
 	 */
-	if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count)))
+	if (WARN_ON_ONCE(eventfd_signal_count(ctx)))
 		return 0;
 
 	spin_lock_irqsave(&ctx->wqh.lock, flags);
 	this_cpu_inc(eventfd_wake_count);
+	this_cpu_inc(*ctx->wake_count);
 	if (ULLONG_MAX - ctx->count < n)
 		n = ULLONG_MAX - ctx->count;
 	ctx->count += n;
 	if (waitqueue_active(&ctx->wqh))
 		wake_up_locked_poll(&ctx->wqh, EPOLLIN);
 	this_cpu_dec(eventfd_wake_count);
+	this_cpu_dec(*ctx->wake_count);
 	spin_unlock_irqrestore(&ctx->wqh.lock, flags);
 
 	return n;
@@ -92,6 +104,7 @@ static void eventfd_free_ctx(struct eventfd_ctx *ctx)
 {
 	if (ctx->id >= 0)
 		ida_simple_remove(&eventfd_ida, ctx->id);
+	free_percpu(ctx->wake_count);
 	kfree(ctx);
 }
 
@@ -423,6 +436,11 @@ static int do_eventfd(unsigned int count, int flags)
 
 	kref_init(&ctx->kref);
 	init_waitqueue_head(&ctx->wqh);
+	ctx->wake_count = alloc_percpu(int);
+	if (!ctx->wake_count) {
+		kfree(ctx);
+		return -ENOMEM;
+	}
 	ctx->count = count;
 	ctx->flags = flags;
 	ctx->id = ida_simple_get(&eventfd_ida, 0, 0, GFP_KERNEL);
diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
index fa0a524baed0..1a11ebbd74a9 100644
--- a/include/linux/eventfd.h
+++ b/include/linux/eventfd.h
@@ -45,10 +45,7 @@ void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 *cnt);
 
 DECLARE_PER_CPU(int, eventfd_wake_count);
 
-static inline bool eventfd_signal_count(void)
-{
-	return this_cpu_read(eventfd_wake_count);
-}
+bool eventfd_signal_count(struct eventfd_ctx *ctx);
 
 #else /* CONFIG_EVENTFD */
 
-- 
2.11.0


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [RFC v3 02/11] eventfd: Increase the recursion depth of eventfd_signal()
  2021-01-19  4:59 [RFC v3 00/11] Introduce VDUSE - vDPA Device in Userspace Xie Yongji
  2021-01-19  4:59 ` [RFC v3 01/11] eventfd: track eventfd_signal() recursion depth separately in different cases Xie Yongji
@ 2021-01-19  4:59 ` Xie Yongji
  2021-01-19  4:59 ` [RFC v3 03/11] vdpa: Remove the restriction that only supports virtio-net devices Xie Yongji
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 57+ messages in thread
From: Xie Yongji @ 2021-01-19  4:59 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, bob.liu, hch, rdunlap,
	willy, viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel

Increase the recursion depth of eventfd_signal() to 1. This
will be used in VDUSE case later.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 fs/eventfd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/eventfd.c b/fs/eventfd.c
index 2df24f9bada3..478cdc175949 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -25,7 +25,7 @@
 #include <linux/idr.h>
 #include <linux/uio.h>
 
-#define EVENTFD_WAKE_DEPTH 0
+#define EVENTFD_WAKE_DEPTH 1
 
 DEFINE_PER_CPU(int, eventfd_wake_count);
 
-- 
2.11.0


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [RFC v3 03/11] vdpa: Remove the restriction that only supports virtio-net devices
  2021-01-19  4:59 [RFC v3 00/11] Introduce VDUSE - vDPA Device in Userspace Xie Yongji
  2021-01-19  4:59 ` [RFC v3 01/11] eventfd: track eventfd_signal() recursion depth separately in different cases Xie Yongji
  2021-01-19  4:59 ` [RFC v3 02/11] eventfd: Increase the recursion depth of eventfd_signal() Xie Yongji
@ 2021-01-19  4:59 ` Xie Yongji
  2021-01-20  3:46   ` Jason Wang
  2021-01-27  8:59   ` Stefano Garzarella
  2021-01-19  4:59 ` [RFC v3 04/11] vhost-vdpa: protect concurrent access to vhost device iotlb Xie Yongji
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 57+ messages in thread
From: Xie Yongji @ 2021-01-19  4:59 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, bob.liu, hch, rdunlap,
	willy, viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel

With VDUSE, we should be able to support all kinds of virtio devices.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 drivers/vhost/vdpa.c | 29 +++--------------------------
 1 file changed, 3 insertions(+), 26 deletions(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 29ed4173f04e..448be7875b6d 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -22,6 +22,7 @@
 #include <linux/nospec.h>
 #include <linux/vhost.h>
 #include <linux/virtio_net.h>
+#include <linux/virtio_blk.h>
 
 #include "vhost.h"
 
@@ -185,26 +186,6 @@ static long vhost_vdpa_set_status(struct vhost_vdpa *v, u8 __user *statusp)
 	return 0;
 }
 
-static int vhost_vdpa_config_validate(struct vhost_vdpa *v,
-				      struct vhost_vdpa_config *c)
-{
-	long size = 0;
-
-	switch (v->virtio_id) {
-	case VIRTIO_ID_NET:
-		size = sizeof(struct virtio_net_config);
-		break;
-	}
-
-	if (c->len == 0)
-		return -EINVAL;
-
-	if (c->len > size - c->off)
-		return -E2BIG;
-
-	return 0;
-}
-
 static long vhost_vdpa_get_config(struct vhost_vdpa *v,
 				  struct vhost_vdpa_config __user *c)
 {
@@ -215,7 +196,7 @@ static long vhost_vdpa_get_config(struct vhost_vdpa *v,
 
 	if (copy_from_user(&config, c, size))
 		return -EFAULT;
-	if (vhost_vdpa_config_validate(v, &config))
+	if (config.len == 0)
 		return -EINVAL;
 	buf = kvzalloc(config.len, GFP_KERNEL);
 	if (!buf)
@@ -243,7 +224,7 @@ static long vhost_vdpa_set_config(struct vhost_vdpa *v,
 
 	if (copy_from_user(&config, c, size))
 		return -EFAULT;
-	if (vhost_vdpa_config_validate(v, &config))
+	if (config.len == 0)
 		return -EINVAL;
 	buf = kvzalloc(config.len, GFP_KERNEL);
 	if (!buf)
@@ -1025,10 +1006,6 @@ static int vhost_vdpa_probe(struct vdpa_device *vdpa)
 	int minor;
 	int r;
 
-	/* Currently, we only accept the network devices. */
-	if (ops->get_device_id(vdpa) != VIRTIO_ID_NET)
-		return -ENOTSUPP;
-
 	v = kzalloc(sizeof(*v), GFP_KERNEL | __GFP_RETRY_MAYFAIL);
 	if (!v)
 		return -ENOMEM;
-- 
2.11.0


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [RFC v3 04/11] vhost-vdpa: protect concurrent access to vhost device iotlb
  2021-01-19  4:59 [RFC v3 00/11] Introduce VDUSE - vDPA Device in Userspace Xie Yongji
                   ` (2 preceding siblings ...)
  2021-01-19  4:59 ` [RFC v3 03/11] vdpa: Remove the restriction that only supports virtio-net devices Xie Yongji
@ 2021-01-19  4:59 ` Xie Yongji
  2021-01-20  3:44   ` Jason Wang
  2021-01-19  4:59 ` [RFC v3 05/11] vdpa: shared virtual addressing support Xie Yongji
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 57+ messages in thread
From: Xie Yongji @ 2021-01-19  4:59 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, bob.liu, hch, rdunlap,
	willy, viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel

Introduce a mutex to protect vhost device iotlb from
concurrent access.

Fixes: 4c8cf318("vhost: introduce vDPA-based backend")
Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 drivers/vhost/vdpa.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 448be7875b6d..4a241d380c40 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -49,6 +49,7 @@ struct vhost_vdpa {
 	struct eventfd_ctx *config_ctx;
 	int in_batch;
 	struct vdpa_iova_range range;
+	struct mutex mutex;
 };
 
 static DEFINE_IDA(vhost_vdpa_ida);
@@ -728,6 +729,7 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev,
 	if (r)
 		return r;
 
+	mutex_lock(&v->mutex);
 	switch (msg->type) {
 	case VHOST_IOTLB_UPDATE:
 		r = vhost_vdpa_process_iotlb_update(v, msg);
@@ -747,6 +749,7 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev,
 		r = -EINVAL;
 		break;
 	}
+	mutex_unlock(&v->mutex);
 
 	return r;
 }
@@ -1017,6 +1020,7 @@ static int vhost_vdpa_probe(struct vdpa_device *vdpa)
 		return minor;
 	}
 
+	mutex_init(&v->mutex);
 	atomic_set(&v->opened, 0);
 	v->minor = minor;
 	v->vdpa = vdpa;
-- 
2.11.0


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [RFC v3 05/11] vdpa: shared virtual addressing support
  2021-01-19  4:59 [RFC v3 00/11] Introduce VDUSE - vDPA Device in Userspace Xie Yongji
                   ` (3 preceding siblings ...)
  2021-01-19  4:59 ` [RFC v3 04/11] vhost-vdpa: protect concurrent access to vhost device iotlb Xie Yongji
@ 2021-01-19  4:59 ` Xie Yongji
  2021-01-20  5:55   ` Jason Wang
  2021-01-19  4:59 ` [RFC v3 06/11] vhost-vdpa: Add an opaque pointer for vhost IOTLB Xie Yongji
  2021-01-19  5:07 ` [RFC v3 07/11] vdpa: Pass the netlink attributes to ops.dev_add() Xie Yongji
  6 siblings, 1 reply; 57+ messages in thread
From: Xie Yongji @ 2021-01-19  4:59 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, bob.liu, hch, rdunlap,
	willy, viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel

This patches introduces SVA (Shared Virtual Addressing)
support for vDPA device. During vDPA device allocation,
vDPA device driver needs to indicate whether SVA is
supported by the device. Then vhost-vdpa bus driver
will not pin user page and transfer userspace virtual
address instead of physical address during DMA mapping.

Suggested-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 drivers/vdpa/ifcvf/ifcvf_main.c   |  2 +-
 drivers/vdpa/mlx5/net/mlx5_vnet.c |  2 +-
 drivers/vdpa/vdpa.c               |  5 ++++-
 drivers/vdpa/vdpa_sim/vdpa_sim.c  |  3 ++-
 drivers/vhost/vdpa.c              | 35 +++++++++++++++++++++++------------
 include/linux/vdpa.h              | 10 +++++++---
 6 files changed, 38 insertions(+), 19 deletions(-)

diff --git a/drivers/vdpa/ifcvf/ifcvf_main.c b/drivers/vdpa/ifcvf/ifcvf_main.c
index 23474af7da40..95c4601f82f5 100644
--- a/drivers/vdpa/ifcvf/ifcvf_main.c
+++ b/drivers/vdpa/ifcvf/ifcvf_main.c
@@ -439,7 +439,7 @@ static int ifcvf_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 
 	adapter = vdpa_alloc_device(struct ifcvf_adapter, vdpa,
 				    dev, &ifc_vdpa_ops,
-				    IFCVF_MAX_QUEUE_PAIRS * 2, NULL);
+				    IFCVF_MAX_QUEUE_PAIRS * 2, NULL, false);
 	if (adapter == NULL) {
 		IFCVF_ERR(pdev, "Failed to allocate vDPA structure");
 		return -ENOMEM;
diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c
index 77595c81488d..05988d6907f2 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -1959,7 +1959,7 @@ static int mlx5v_probe(struct auxiliary_device *adev,
 	max_vqs = min_t(u32, max_vqs, MLX5_MAX_SUPPORTED_VQS);
 
 	ndev = vdpa_alloc_device(struct mlx5_vdpa_net, mvdev.vdev, mdev->device, &mlx5_vdpa_ops,
-				 2 * mlx5_vdpa_max_qps(max_vqs), NULL);
+				 2 * mlx5_vdpa_max_qps(max_vqs), NULL, false);
 	if (IS_ERR(ndev))
 		return PTR_ERR(ndev);
 
diff --git a/drivers/vdpa/vdpa.c b/drivers/vdpa/vdpa.c
index 32bd48baffab..50cab930b2e5 100644
--- a/drivers/vdpa/vdpa.c
+++ b/drivers/vdpa/vdpa.c
@@ -72,6 +72,7 @@ static void vdpa_release_dev(struct device *d)
  * @nvqs: number of virtqueues supported by this device
  * @size: size of the parent structure that contains private data
  * @name: name of the vdpa device; optional.
+ * @sva: indicate whether SVA (Shared Virtual Addressing) is supported
  *
  * Driver should use vdpa_alloc_device() wrapper macro instead of
  * using this directly.
@@ -81,7 +82,8 @@ static void vdpa_release_dev(struct device *d)
  */
 struct vdpa_device *__vdpa_alloc_device(struct device *parent,
 					const struct vdpa_config_ops *config,
-					int nvqs, size_t size, const char *name)
+					int nvqs, size_t size, const char *name,
+					bool sva)
 {
 	struct vdpa_device *vdev;
 	int err = -EINVAL;
@@ -108,6 +110,7 @@ struct vdpa_device *__vdpa_alloc_device(struct device *parent,
 	vdev->config = config;
 	vdev->features_valid = false;
 	vdev->nvqs = nvqs;
+	vdev->sva = sva;
 
 	if (name)
 		err = dev_set_name(&vdev->dev, "%s", name);
diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
index 85776e4e6749..03c796873a6b 100644
--- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
+++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
@@ -367,7 +367,8 @@ static struct vdpasim *vdpasim_create(const char *name)
 	else
 		ops = &vdpasim_net_config_ops;
 
-	vdpasim = vdpa_alloc_device(struct vdpasim, vdpa, NULL, ops, VDPASIM_VQ_NUM, name);
+	vdpasim = vdpa_alloc_device(struct vdpasim, vdpa, NULL, ops,
+				VDPASIM_VQ_NUM, name, false);
 	if (!vdpasim)
 		goto err_alloc;
 
diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 4a241d380c40..36b6950ba37f 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -486,21 +486,25 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
 static void vhost_vdpa_iotlb_unmap(struct vhost_vdpa *v, u64 start, u64 last)
 {
 	struct vhost_dev *dev = &v->vdev;
+	struct vdpa_device *vdpa = v->vdpa;
 	struct vhost_iotlb *iotlb = dev->iotlb;
 	struct vhost_iotlb_map *map;
 	struct page *page;
 	unsigned long pfn, pinned;
 
 	while ((map = vhost_iotlb_itree_first(iotlb, start, last)) != NULL) {
-		pinned = map->size >> PAGE_SHIFT;
-		for (pfn = map->addr >> PAGE_SHIFT;
-		     pinned > 0; pfn++, pinned--) {
-			page = pfn_to_page(pfn);
-			if (map->perm & VHOST_ACCESS_WO)
-				set_page_dirty_lock(page);
-			unpin_user_page(page);
+		if (!vdpa->sva) {
+			pinned = map->size >> PAGE_SHIFT;
+			for (pfn = map->addr >> PAGE_SHIFT;
+			     pinned > 0; pfn++, pinned--) {
+				page = pfn_to_page(pfn);
+				if (map->perm & VHOST_ACCESS_WO)
+					set_page_dirty_lock(page);
+				unpin_user_page(page);
+			}
+			atomic64_sub(map->size >> PAGE_SHIFT,
+					&dev->mm->pinned_vm);
 		}
-		atomic64_sub(map->size >> PAGE_SHIFT, &dev->mm->pinned_vm);
 		vhost_iotlb_map_free(iotlb, map);
 	}
 }
@@ -558,13 +562,15 @@ static int vhost_vdpa_map(struct vhost_vdpa *v,
 		r = iommu_map(v->domain, iova, pa, size,
 			      perm_to_iommu_flags(perm));
 	}
-
-	if (r)
+	if (r) {
 		vhost_iotlb_del_range(dev->iotlb, iova, iova + size - 1);
-	else
+		return r;
+	}
+
+	if (!vdpa->sva)
 		atomic64_add(size >> PAGE_SHIFT, &dev->mm->pinned_vm);
 
-	return r;
+	return 0;
 }
 
 static void vhost_vdpa_unmap(struct vhost_vdpa *v, u64 iova, u64 size)
@@ -589,6 +595,7 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
 					   struct vhost_iotlb_msg *msg)
 {
 	struct vhost_dev *dev = &v->vdev;
+	struct vdpa_device *vdpa = v->vdpa;
 	struct vhost_iotlb *iotlb = dev->iotlb;
 	struct page **page_list;
 	unsigned long list_size = PAGE_SIZE / sizeof(struct page *);
@@ -607,6 +614,10 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
 				    msg->iova + msg->size - 1))
 		return -EEXIST;
 
+	if (vdpa->sva)
+		return vhost_vdpa_map(v, msg->iova, msg->size,
+				      msg->uaddr, msg->perm);
+
 	/* Limit the use of memory for bookkeeping */
 	page_list = (struct page **) __get_free_page(GFP_KERNEL);
 	if (!page_list)
diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index cb5a3d847af3..f86869651614 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -44,6 +44,7 @@ struct vdpa_parent_dev;
  * @config: the configuration ops for this device.
  * @index: device index
  * @features_valid: were features initialized? for legacy guests
+ * @sva: indicate whether SVA (Shared Virtual Addressing) is supported
  * @nvqs: maximum number of supported virtqueues
  * @pdev: parent device pointer; caller must setup when registering device as part
  *	  of dev_add() parentdev ops callback before invoking _vdpa_register_device().
@@ -54,6 +55,7 @@ struct vdpa_device {
 	const struct vdpa_config_ops *config;
 	unsigned int index;
 	bool features_valid;
+	bool sva;
 	int nvqs;
 	struct vdpa_parent_dev *pdev;
 };
@@ -250,14 +252,16 @@ struct vdpa_config_ops {
 
 struct vdpa_device *__vdpa_alloc_device(struct device *parent,
 					const struct vdpa_config_ops *config,
-					int nvqs, size_t size, const char *name);
+					int nvqs, size_t size,
+					const char *name, bool sva);
 
-#define vdpa_alloc_device(dev_struct, member, parent, config, nvqs, name)   \
+#define vdpa_alloc_device(dev_struct, member, parent, config, \
+			  nvqs, name, sva) \
 			  container_of(__vdpa_alloc_device( \
 				       parent, config, nvqs, \
 				       sizeof(dev_struct) + \
 				       BUILD_BUG_ON_ZERO(offsetof( \
-				       dev_struct, member)), name), \
+				       dev_struct, member)), name, sva), \
 				       dev_struct, member)
 
 int vdpa_register_device(struct vdpa_device *vdev);
-- 
2.11.0


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [RFC v3 06/11] vhost-vdpa: Add an opaque pointer for vhost IOTLB
  2021-01-19  4:59 [RFC v3 00/11] Introduce VDUSE - vDPA Device in Userspace Xie Yongji
                   ` (4 preceding siblings ...)
  2021-01-19  4:59 ` [RFC v3 05/11] vdpa: shared virtual addressing support Xie Yongji
@ 2021-01-19  4:59 ` Xie Yongji
  2021-01-20  6:24   ` Jason Wang
  2021-01-19  5:07 ` [RFC v3 07/11] vdpa: Pass the netlink attributes to ops.dev_add() Xie Yongji
  6 siblings, 1 reply; 57+ messages in thread
From: Xie Yongji @ 2021-01-19  4:59 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, bob.liu, hch, rdunlap,
	willy, viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel

Add an opaque pointer for vhost IOTLB to store the
corresponding vma->vm_file and offset on the DMA mapping.

It will be used in VDUSE case later.

Suggested-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 drivers/vdpa/vdpa_sim/vdpa_sim.c | 11 ++++---
 drivers/vhost/iotlb.c            |  5 ++-
 drivers/vhost/vdpa.c             | 66 +++++++++++++++++++++++++++++++++++-----
 drivers/vhost/vhost.c            |  4 +--
 include/linux/vdpa.h             |  3 +-
 include/linux/vhost_iotlb.h      |  8 ++++-
 6 files changed, 79 insertions(+), 18 deletions(-)

diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
index 03c796873a6b..1ffcef67954f 100644
--- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
+++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
@@ -279,7 +279,7 @@ static dma_addr_t vdpasim_map_page(struct device *dev, struct page *page,
 	 */
 	spin_lock(&vdpasim->iommu_lock);
 	ret = vhost_iotlb_add_range(iommu, pa, pa + size - 1,
-				    pa, dir_to_perm(dir));
+				    pa, dir_to_perm(dir), NULL);
 	spin_unlock(&vdpasim->iommu_lock);
 	if (ret)
 		return DMA_MAPPING_ERROR;
@@ -317,7 +317,7 @@ static void *vdpasim_alloc_coherent(struct device *dev, size_t size,
 
 		ret = vhost_iotlb_add_range(iommu, (u64)pa,
 					    (u64)pa + size - 1,
-					    pa, VHOST_MAP_RW);
+					    pa, VHOST_MAP_RW, NULL);
 		if (ret) {
 			*dma_addr = DMA_MAPPING_ERROR;
 			kfree(addr);
@@ -625,7 +625,8 @@ static int vdpasim_set_map(struct vdpa_device *vdpa,
 	for (map = vhost_iotlb_itree_first(iotlb, start, last); map;
 	     map = vhost_iotlb_itree_next(map, start, last)) {
 		ret = vhost_iotlb_add_range(vdpasim->iommu, map->start,
-					    map->last, map->addr, map->perm);
+					    map->last, map->addr,
+					    map->perm, NULL);
 		if (ret)
 			goto err;
 	}
@@ -639,14 +640,14 @@ static int vdpasim_set_map(struct vdpa_device *vdpa,
 }
 
 static int vdpasim_dma_map(struct vdpa_device *vdpa, u64 iova, u64 size,
-			   u64 pa, u32 perm)
+			   u64 pa, u32 perm, void *opaque)
 {
 	struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
 	int ret;
 
 	spin_lock(&vdpasim->iommu_lock);
 	ret = vhost_iotlb_add_range(vdpasim->iommu, iova, iova + size - 1, pa,
-				    perm);
+				    perm, NULL);
 	spin_unlock(&vdpasim->iommu_lock);
 
 	return ret;
diff --git a/drivers/vhost/iotlb.c b/drivers/vhost/iotlb.c
index 0fd3f87e913c..3bd5bd06cdbc 100644
--- a/drivers/vhost/iotlb.c
+++ b/drivers/vhost/iotlb.c
@@ -42,13 +42,15 @@ EXPORT_SYMBOL_GPL(vhost_iotlb_map_free);
  * @last: last of IOVA range
  * @addr: the address that is mapped to @start
  * @perm: access permission of this range
+ * @opaque: the opaque pointer for the IOTLB mapping
  *
  * Returns an error last is smaller than start or memory allocation
  * fails
  */
 int vhost_iotlb_add_range(struct vhost_iotlb *iotlb,
 			  u64 start, u64 last,
-			  u64 addr, unsigned int perm)
+			  u64 addr, unsigned int perm,
+			  void *opaque)
 {
 	struct vhost_iotlb_map *map;
 
@@ -71,6 +73,7 @@ int vhost_iotlb_add_range(struct vhost_iotlb *iotlb,
 	map->last = last;
 	map->addr = addr;
 	map->perm = perm;
+	map->opaque = opaque;
 
 	iotlb->nmaps++;
 	vhost_iotlb_itree_insert(map, &iotlb->root);
diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 36b6950ba37f..e83e5be7cec8 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -488,6 +488,7 @@ static void vhost_vdpa_iotlb_unmap(struct vhost_vdpa *v, u64 start, u64 last)
 	struct vhost_dev *dev = &v->vdev;
 	struct vdpa_device *vdpa = v->vdpa;
 	struct vhost_iotlb *iotlb = dev->iotlb;
+	struct vhost_iotlb_file *iotlb_file;
 	struct vhost_iotlb_map *map;
 	struct page *page;
 	unsigned long pfn, pinned;
@@ -504,6 +505,10 @@ static void vhost_vdpa_iotlb_unmap(struct vhost_vdpa *v, u64 start, u64 last)
 			}
 			atomic64_sub(map->size >> PAGE_SHIFT,
 					&dev->mm->pinned_vm);
+		} else if (map->opaque) {
+			iotlb_file = (struct vhost_iotlb_file *)map->opaque;
+			fput(iotlb_file->file);
+			kfree(iotlb_file);
 		}
 		vhost_iotlb_map_free(iotlb, map);
 	}
@@ -540,8 +545,8 @@ static int perm_to_iommu_flags(u32 perm)
 	return flags | IOMMU_CACHE;
 }
 
-static int vhost_vdpa_map(struct vhost_vdpa *v,
-			  u64 iova, u64 size, u64 pa, u32 perm)
+static int vhost_vdpa_map(struct vhost_vdpa *v, u64 iova,
+			  u64 size, u64 pa, u32 perm, void *opaque)
 {
 	struct vhost_dev *dev = &v->vdev;
 	struct vdpa_device *vdpa = v->vdpa;
@@ -549,12 +554,12 @@ static int vhost_vdpa_map(struct vhost_vdpa *v,
 	int r = 0;
 
 	r = vhost_iotlb_add_range(dev->iotlb, iova, iova + size - 1,
-				  pa, perm);
+				  pa, perm, opaque);
 	if (r)
 		return r;
 
 	if (ops->dma_map) {
-		r = ops->dma_map(vdpa, iova, size, pa, perm);
+		r = ops->dma_map(vdpa, iova, size, pa, perm, opaque);
 	} else if (ops->set_map) {
 		if (!v->in_batch)
 			r = ops->set_map(vdpa, dev->iotlb);
@@ -591,6 +596,51 @@ static void vhost_vdpa_unmap(struct vhost_vdpa *v, u64 iova, u64 size)
 	}
 }
 
+static int vhost_vdpa_sva_map(struct vhost_vdpa *v,
+			      u64 iova, u64 size, u64 uaddr, u32 perm)
+{
+	u64 offset, map_size, map_iova = iova;
+	struct vhost_iotlb_file *iotlb_file;
+	struct vm_area_struct *vma;
+	int ret;
+
+	while (size) {
+		vma = find_vma(current->mm, uaddr);
+		if (!vma) {
+			ret = -EINVAL;
+			goto err;
+		}
+		map_size = min(size, vma->vm_end - uaddr);
+		offset = (vma->vm_pgoff << PAGE_SHIFT) + uaddr - vma->vm_start;
+		iotlb_file = NULL;
+		if (vma->vm_file && (vma->vm_flags & VM_SHARED)) {
+			iotlb_file = kmalloc(sizeof(*iotlb_file), GFP_KERNEL);
+			if (!iotlb_file) {
+				ret = -ENOMEM;
+				goto err;
+			}
+			iotlb_file->file = get_file(vma->vm_file);
+			iotlb_file->offset = offset;
+		}
+		ret = vhost_vdpa_map(v, map_iova, map_size, uaddr,
+					perm, iotlb_file);
+		if (ret) {
+			if (iotlb_file) {
+				fput(iotlb_file->file);
+				kfree(iotlb_file);
+			}
+			goto err;
+		}
+		size -= map_size;
+		uaddr += map_size;
+		map_iova += map_size;
+	}
+	return 0;
+err:
+	vhost_vdpa_unmap(v, iova, map_iova - iova);
+	return ret;
+}
+
 static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
 					   struct vhost_iotlb_msg *msg)
 {
@@ -615,8 +665,8 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
 		return -EEXIST;
 
 	if (vdpa->sva)
-		return vhost_vdpa_map(v, msg->iova, msg->size,
-				      msg->uaddr, msg->perm);
+		return vhost_vdpa_sva_map(v, msg->iova, msg->size,
+					  msg->uaddr, msg->perm);
 
 	/* Limit the use of memory for bookkeeping */
 	page_list = (struct page **) __get_free_page(GFP_KERNEL);
@@ -671,7 +721,7 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
 				csize = (last_pfn - map_pfn + 1) << PAGE_SHIFT;
 				ret = vhost_vdpa_map(v, iova, csize,
 						     map_pfn << PAGE_SHIFT,
-						     msg->perm);
+						     msg->perm, NULL);
 				if (ret) {
 					/*
 					 * Unpin the pages that are left unmapped
@@ -700,7 +750,7 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
 
 	/* Pin the rest chunk */
 	ret = vhost_vdpa_map(v, iova, (last_pfn - map_pfn + 1) << PAGE_SHIFT,
-			     map_pfn << PAGE_SHIFT, msg->perm);
+			     map_pfn << PAGE_SHIFT, msg->perm, NULL);
 out:
 	if (ret) {
 		if (nchunks) {
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index a262e12c6dc2..120dd5b3c119 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1104,7 +1104,7 @@ static int vhost_process_iotlb_msg(struct vhost_dev *dev,
 		vhost_vq_meta_reset(dev);
 		if (vhost_iotlb_add_range(dev->iotlb, msg->iova,
 					  msg->iova + msg->size - 1,
-					  msg->uaddr, msg->perm)) {
+					  msg->uaddr, msg->perm, NULL)) {
 			ret = -ENOMEM;
 			break;
 		}
@@ -1450,7 +1450,7 @@ static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
 					  region->guest_phys_addr +
 					  region->memory_size - 1,
 					  region->userspace_addr,
-					  VHOST_MAP_RW))
+					  VHOST_MAP_RW, NULL))
 			goto err;
 	}
 
diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index f86869651614..b264c627e94b 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -189,6 +189,7 @@ struct vdpa_iova_range {
  *				@size: size of the area
  *				@pa: physical address for the map
  *				@perm: device access permission (VHOST_MAP_XX)
+ *				@opaque: the opaque pointer for the mapping
  *				Returns integer: success (0) or error (< 0)
  * @dma_unmap:			Unmap an area of IOVA (optional but
  *				must be implemented with dma_map)
@@ -243,7 +244,7 @@ struct vdpa_config_ops {
 	/* DMA ops */
 	int (*set_map)(struct vdpa_device *vdev, struct vhost_iotlb *iotlb);
 	int (*dma_map)(struct vdpa_device *vdev, u64 iova, u64 size,
-		       u64 pa, u32 perm);
+		       u64 pa, u32 perm, void *opaque);
 	int (*dma_unmap)(struct vdpa_device *vdev, u64 iova, u64 size);
 
 	/* Free device resources */
diff --git a/include/linux/vhost_iotlb.h b/include/linux/vhost_iotlb.h
index 6b09b786a762..66a50c11c8ca 100644
--- a/include/linux/vhost_iotlb.h
+++ b/include/linux/vhost_iotlb.h
@@ -4,6 +4,11 @@
 
 #include <linux/interval_tree_generic.h>
 
+struct vhost_iotlb_file {
+	struct file *file;
+	u64 offset;
+};
+
 struct vhost_iotlb_map {
 	struct rb_node rb;
 	struct list_head link;
@@ -17,6 +22,7 @@ struct vhost_iotlb_map {
 	u32 perm;
 	u32 flags_padding;
 	u64 __subtree_last;
+	void *opaque;
 };
 
 #define VHOST_IOTLB_FLAG_RETIRE 0x1
@@ -30,7 +36,7 @@ struct vhost_iotlb {
 };
 
 int vhost_iotlb_add_range(struct vhost_iotlb *iotlb, u64 start, u64 last,
-			  u64 addr, unsigned int perm);
+			  u64 addr, unsigned int perm, void *opaque);
 void vhost_iotlb_del_range(struct vhost_iotlb *iotlb, u64 start, u64 last);
 
 struct vhost_iotlb *vhost_iotlb_alloc(unsigned int limit, unsigned int flags);
-- 
2.11.0


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [RFC v3 07/11] vdpa: Pass the netlink attributes to ops.dev_add()
  2021-01-19  4:59 [RFC v3 00/11] Introduce VDUSE - vDPA Device in Userspace Xie Yongji
                   ` (5 preceding siblings ...)
  2021-01-19  4:59 ` [RFC v3 06/11] vhost-vdpa: Add an opaque pointer for vhost IOTLB Xie Yongji
@ 2021-01-19  5:07 ` Xie Yongji
  2021-01-19  5:07   ` [RFC v3 08/11] vduse: Introduce VDUSE - vDPA Device in Userspace Xie Yongji
                     ` (3 more replies)
  6 siblings, 4 replies; 57+ messages in thread
From: Xie Yongji @ 2021-01-19  5:07 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, bob.liu, hch, rdunlap,
	willy, viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel

Pass the netlink attributes to ops.dev_add() so that we
could get some device specific attributes when creating
a vdpa device.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 drivers/vdpa/vdpa.c              | 2 +-
 drivers/vdpa/vdpa_sim/vdpa_sim.c | 3 ++-
 include/linux/vdpa.h             | 4 +++-
 3 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/vdpa/vdpa.c b/drivers/vdpa/vdpa.c
index 50cab930b2e5..81a099ec390e 100644
--- a/drivers/vdpa/vdpa.c
+++ b/drivers/vdpa/vdpa.c
@@ -443,7 +443,7 @@ static int vdpa_nl_cmd_dev_add_set_doit(struct sk_buff *skb, struct genl_info *i
 		goto err;
 	}
 
-	vdev = pdev->ops->dev_add(pdev, name, device_id);
+	vdev = pdev->ops->dev_add(pdev, name, device_id, info->attrs);
 	if (IS_ERR(vdev))
 		goto err;
 
diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
index 1ffcef67954f..ce24a40f5b00 100644
--- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
+++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
@@ -728,7 +728,8 @@ static const struct vdpa_config_ops vdpasim_net_batch_config_ops = {
 };
 
 static struct vdpa_device *
-vdpa_dev_add(struct vdpa_parent_dev *pdev, const char *name, u32 device_id)
+vdpa_dev_add(struct vdpa_parent_dev *pdev, const char *name,
+		u32 device_id, struct nlattr **attrs)
 {
 	struct vdpasim *simdev;
 
diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index b264c627e94b..7b84badc6741 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -6,6 +6,7 @@
 #include <linux/device.h>
 #include <linux/interrupt.h>
 #include <linux/vhost_iotlb.h>
+#include <net/genetlink.h>
 
 /**
  * vDPA callback definition.
@@ -354,6 +355,7 @@ static inline void vdpa_get_config(struct vdpa_device *vdev, unsigned offset,
  *		@pdev: parent device to use for device addition
  *		@name: name of the new vdpa device
  *		@device_id: device id of the new vdpa device
+ *		@attrs: device specific attributes
  *		Driver need to add a new device using vdpa_register_device() after
  *		fully initializing the vdpa device. On successful addition driver
  *		must return a valid pointer of vdpa device or ERR_PTR for the error.
@@ -364,7 +366,7 @@ static inline void vdpa_get_config(struct vdpa_device *vdev, unsigned offset,
  */
 struct vdpa_dev_ops {
 	struct vdpa_device* (*dev_add)(struct vdpa_parent_dev *pdev, const char *name,
-				       u32 device_id);
+				       u32 device_id, struct nlattr **attrs);
 	void (*dev_del)(struct vdpa_parent_dev *pdev, struct vdpa_device *dev);
 };
 
-- 
2.11.0


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [RFC v3 08/11] vduse: Introduce VDUSE - vDPA Device in Userspace
  2021-01-19  5:07 ` [RFC v3 07/11] vdpa: Pass the netlink attributes to ops.dev_add() Xie Yongji
@ 2021-01-19  5:07   ` Xie Yongji
  2021-01-19 14:53     ` Jonathan Corbet
                       ` (3 more replies)
  2021-01-19  5:07   ` [RFC v3 09/11] vduse: Add VDUSE_GET_DEV ioctl Xie Yongji
                     ` (2 subsequent siblings)
  3 siblings, 4 replies; 57+ messages in thread
From: Xie Yongji @ 2021-01-19  5:07 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, bob.liu, hch, rdunlap,
	willy, viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel

This VDUSE driver enables implementing vDPA devices in userspace.
Both control path and data path of vDPA devices will be able to
be handled in userspace.

In the control path, the VDUSE driver will make use of message
mechnism to forward the config operation from vdpa bus driver
to userspace. Userspace can use read()/write() to receive/reply
those control messages.

In the data path, VDUSE_IOTLB_GET_FD ioctl will be used to get
the file descriptors referring to vDPA device's iova regions. Then
userspace can use mmap() to access those iova regions. Besides,
the eventfd mechanism is used to trigger interrupt callbacks and
receive virtqueue kicks in userspace.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 Documentation/driver-api/vduse.rst                 |   85 ++
 Documentation/userspace-api/ioctl/ioctl-number.rst |    1 +
 drivers/vdpa/Kconfig                               |    7 +
 drivers/vdpa/Makefile                              |    1 +
 drivers/vdpa/vdpa_user/Makefile                    |    5 +
 drivers/vdpa/vdpa_user/eventfd.c                   |  221 ++++
 drivers/vdpa/vdpa_user/eventfd.h                   |   48 +
 drivers/vdpa/vdpa_user/iova_domain.c               |  426 +++++++
 drivers/vdpa/vdpa_user/iova_domain.h               |   68 ++
 drivers/vdpa/vdpa_user/vduse.h                     |   62 +
 drivers/vdpa/vdpa_user/vduse_dev.c                 | 1217 ++++++++++++++++++++
 include/uapi/linux/vdpa.h                          |    1 +
 include/uapi/linux/vduse.h                         |  125 ++
 13 files changed, 2267 insertions(+)
 create mode 100644 Documentation/driver-api/vduse.rst
 create mode 100644 drivers/vdpa/vdpa_user/Makefile
 create mode 100644 drivers/vdpa/vdpa_user/eventfd.c
 create mode 100644 drivers/vdpa/vdpa_user/eventfd.h
 create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c
 create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h
 create mode 100644 drivers/vdpa/vdpa_user/vduse.h
 create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
 create mode 100644 include/uapi/linux/vduse.h

diff --git a/Documentation/driver-api/vduse.rst b/Documentation/driver-api/vduse.rst
new file mode 100644
index 000000000000..9418a7f6646b
--- /dev/null
+++ b/Documentation/driver-api/vduse.rst
@@ -0,0 +1,85 @@
+==================================
+VDUSE - "vDPA Device in Userspace"
+==================================
+
+vDPA (virtio data path acceleration) device is a device that uses a
+datapath which complies with the virtio specifications with vendor
+specific control path. vDPA devices can be both physically located on
+the hardware or emulated by software. VDUSE is a framework that makes it
+possible to implement software-emulated vDPA devices in userspace.
+
+How VDUSE works
+------------
+Each userspace vDPA device is created by the VDUSE_CREATE_DEV ioctl on
+the VDUSE character device (/dev/vduse). Then a file descriptor pointing
+to the new resources will be returned, which can be used to implement the
+userspace vDPA device's control path and data path.
+
+To implement control path, the read/write operations to the file descriptor
+will be used to receive/reply the control messages from/to VDUSE driver.
+Those control messages are mostly based on the vdpa_config_ops which defines
+a unified interface to control different types of vDPA device.
+
+The following types of messages are provided by the VDUSE framework now:
+
+- VDUSE_SET_VQ_ADDR: Set the addresses of the different aspects of virtqueue.
+
+- VDUSE_SET_VQ_NUM: Set the size of virtqueue
+
+- VDUSE_SET_VQ_READY: Set ready status of virtqueue
+
+- VDUSE_GET_VQ_READY: Get ready status of virtqueue
+
+- VDUSE_SET_VQ_STATE: Set the state (last_avail_idx) for virtqueue
+
+- VDUSE_GET_VQ_STATE: Get the state (last_avail_idx) for virtqueue
+
+- VDUSE_SET_FEATURES: Set virtio features supported by the driver
+
+- VDUSE_GET_FEATURES: Get virtio features supported by the device
+
+- VDUSE_SET_STATUS: Set the device status
+
+- VDUSE_GET_STATUS: Get the device status
+
+- VDUSE_SET_CONFIG: Write to device specific configuration space
+
+- VDUSE_GET_CONFIG: Read from device specific configuration space
+
+- VDUSE_UPDATE_IOTLB: Notify userspace to update the memory mapping in device IOTLB
+
+Please see include/linux/vdpa.h for details.
+
+In the data path, vDPA device's iova regions will be mapped into userspace with
+the help of VDUSE_IOTLB_GET_FD ioctl on the userspace vDPA device fd:
+
+- VDUSE_IOTLB_GET_FD: get the file descriptor to iova region. Userspace can
+  access this iova region by passing the fd to mmap(2).
+
+Besides, the eventfd mechanism is used to trigger interrupt callbacks and
+receive virtqueue kicks in userspace. The following ioctls on the userspace
+vDPA device fd are provided to support that:
+
+- VDUSE_VQ_SETUP_KICKFD: set the kickfd for virtqueue, this eventfd is used
+  by VDUSE driver to notify userspace to consume the vring.
+
+- VDUSE_VQ_SETUP_IRQFD: set the irqfd for virtqueue, this eventfd is used
+  by userspace to notify VDUSE driver to trigger interrupt callbacks.
+
+MMU-based IOMMU Driver
+----------------------
+In virtio-vdpa case, VDUSE framework implements a MMU-based on-chip IOMMU
+driver to support mapping the kernel dma buffer into the userspace iova
+region dynamically.
+
+The basic idea behind this driver is treating MMU (VA->PA) as IOMMU (IOVA->PA).
+The driver will set up MMU mapping instead of IOMMU mapping for the DMA transfer
+so that the userspace process is able to use its virtual address to access
+the dma buffer in kernel.
+
+And to avoid security issue, a bounce-buffering mechanism is introduced to
+prevent userspace accessing the original buffer directly which may contain other
+kernel data. During the mapping, unmapping, the driver will copy the data from
+the original buffer to the bounce buffer and back, depending on the direction of
+the transfer. And the bounce-buffer addresses will be mapped into the user address
+space instead of the original one.
diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
index a4c75a28c839..71722e6f8f23 100644
--- a/Documentation/userspace-api/ioctl/ioctl-number.rst
+++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
@@ -300,6 +300,7 @@ Code  Seq#    Include File                                           Comments
 'z'   10-4F  drivers/s390/crypto/zcrypt_api.h                        conflict!
 '|'   00-7F  linux/media.h
 0x80  00-1F  linux/fb.h
+0x81  00-1F  linux/vduse.h
 0x89  00-06  arch/x86/include/asm/sockios.h
 0x89  0B-DF  linux/sockios.h
 0x89  E0-EF  linux/sockios.h                                         SIOCPROTOPRIVATE range
diff --git a/drivers/vdpa/Kconfig b/drivers/vdpa/Kconfig
index 4be7be39be26..667354309bf4 100644
--- a/drivers/vdpa/Kconfig
+++ b/drivers/vdpa/Kconfig
@@ -21,6 +21,13 @@ config VDPA_SIM
 	  to RX. This device is used for testing, prototyping and
 	  development of vDPA.
 
+config VDPA_USER
+	tristate "VDUSE (vDPA Device in Userspace) support"
+	depends on EVENTFD && MMU && HAS_DMA
+	help
+	  With VDUSE it is possible to emulate a vDPA Device
+	  in a userspace program.
+
 config IFCVF
 	tristate "Intel IFC VF vDPA driver"
 	depends on PCI_MSI
diff --git a/drivers/vdpa/Makefile b/drivers/vdpa/Makefile
index d160e9b63a66..66e97778ad03 100644
--- a/drivers/vdpa/Makefile
+++ b/drivers/vdpa/Makefile
@@ -1,5 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0
 obj-$(CONFIG_VDPA) += vdpa.o
 obj-$(CONFIG_VDPA_SIM) += vdpa_sim/
+obj-$(CONFIG_VDPA_USER) += vdpa_user/
 obj-$(CONFIG_IFCVF)    += ifcvf/
 obj-$(CONFIG_MLX5_VDPA) += mlx5/
diff --git a/drivers/vdpa/vdpa_user/Makefile b/drivers/vdpa/vdpa_user/Makefile
new file mode 100644
index 000000000000..b7645e36992b
--- /dev/null
+++ b/drivers/vdpa/vdpa_user/Makefile
@@ -0,0 +1,5 @@
+# SPDX-License-Identifier: GPL-2.0
+
+vduse-y := vduse_dev.o iova_domain.o eventfd.o
+
+obj-$(CONFIG_VDPA_USER) += vduse.o
diff --git a/drivers/vdpa/vdpa_user/eventfd.c b/drivers/vdpa/vdpa_user/eventfd.c
new file mode 100644
index 000000000000..dbffddb08908
--- /dev/null
+++ b/drivers/vdpa/vdpa_user/eventfd.c
@@ -0,0 +1,221 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Eventfd support for VDUSE
+ *
+ * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
+ *
+ * Author: Xie Yongji <xieyongji@bytedance.com>
+ *
+ */
+
+#include <linux/eventfd.h>
+#include <linux/poll.h>
+#include <linux/wait.h>
+#include <linux/slab.h>
+#include <linux/file.h>
+#include <uapi/linux/vduse.h>
+
+#include "eventfd.h"
+
+static struct workqueue_struct *vduse_irqfd_cleanup_wq;
+
+static void vduse_virqfd_shutdown(struct work_struct *work)
+{
+	u64 cnt;
+	struct vduse_virqfd *virqfd = container_of(work,
+					struct vduse_virqfd, shutdown);
+
+	eventfd_ctx_remove_wait_queue(virqfd->ctx, &virqfd->wait, &cnt);
+	flush_work(&virqfd->inject);
+	eventfd_ctx_put(virqfd->ctx);
+	kfree(virqfd);
+}
+
+static void vduse_virqfd_inject(struct work_struct *work)
+{
+	struct vduse_virqfd *virqfd = container_of(work,
+					struct vduse_virqfd, inject);
+	struct vduse_virtqueue *vq = virqfd->vq;
+
+	spin_lock_irq(&vq->irq_lock);
+	if (vq->ready && vq->cb)
+		vq->cb(vq->private);
+	spin_unlock_irq(&vq->irq_lock);
+}
+
+static void virqfd_deactivate(struct vduse_virqfd *virqfd)
+{
+	queue_work(vduse_irqfd_cleanup_wq, &virqfd->shutdown);
+}
+
+static int vduse_virqfd_wakeup(wait_queue_entry_t *wait, unsigned int mode,
+				int sync, void *key)
+{
+	struct vduse_virqfd *virqfd = container_of(wait, struct vduse_virqfd, wait);
+	struct vduse_virtqueue *vq = virqfd->vq;
+
+	__poll_t flags = key_to_poll(key);
+
+	if (flags & EPOLLIN)
+		schedule_work(&virqfd->inject);
+
+	if (flags & EPOLLHUP) {
+		spin_lock(&vq->irq_lock);
+		if (vq->virqfd == virqfd) {
+			vq->virqfd = NULL;
+			virqfd_deactivate(virqfd);
+		}
+		spin_unlock(&vq->irq_lock);
+	}
+
+	return 0;
+}
+
+static void vduse_virqfd_ptable_queue_proc(struct file *file,
+			wait_queue_head_t *wqh, poll_table *pt)
+{
+	struct vduse_virqfd *virqfd = container_of(pt, struct vduse_virqfd, pt);
+
+	add_wait_queue(wqh, &virqfd->wait);
+}
+
+int vduse_virqfd_setup(struct vduse_dev *dev,
+			struct vduse_vq_eventfd *eventfd)
+{
+	struct vduse_virqfd *virqfd;
+	struct fd irqfd;
+	struct eventfd_ctx *ctx;
+	struct vduse_virtqueue *vq;
+	__poll_t events;
+	int ret;
+
+	if (eventfd->index >= dev->vq_num)
+		return -EINVAL;
+
+	vq = &dev->vqs[eventfd->index];
+	virqfd = kzalloc(sizeof(*virqfd), GFP_KERNEL);
+	if (!virqfd)
+		return -ENOMEM;
+
+	INIT_WORK(&virqfd->shutdown, vduse_virqfd_shutdown);
+	INIT_WORK(&virqfd->inject, vduse_virqfd_inject);
+
+	ret = -EBADF;
+	irqfd = fdget(eventfd->fd);
+	if (!irqfd.file)
+		goto err_fd;
+
+	ctx = eventfd_ctx_fileget(irqfd.file);
+	if (IS_ERR(ctx)) {
+		ret = PTR_ERR(ctx);
+		goto err_ctx;
+	}
+
+	virqfd->vq = vq;
+	virqfd->ctx = ctx;
+	spin_lock(&vq->irq_lock);
+	if (vq->virqfd)
+		virqfd_deactivate(virqfd);
+	vq->virqfd = virqfd;
+	spin_unlock(&vq->irq_lock);
+
+	init_waitqueue_func_entry(&virqfd->wait, vduse_virqfd_wakeup);
+	init_poll_funcptr(&virqfd->pt, vduse_virqfd_ptable_queue_proc);
+
+	events = vfs_poll(irqfd.file, &virqfd->pt);
+
+	/*
+	 * Check if there was an event already pending on the eventfd
+	 * before we registered and trigger it as if we didn't miss it.
+	 */
+	if (events & EPOLLIN)
+		schedule_work(&virqfd->inject);
+
+	fdput(irqfd);
+
+	return 0;
+err_ctx:
+	fdput(irqfd);
+err_fd:
+	kfree(virqfd);
+	return ret;
+}
+
+void vduse_virqfd_release(struct vduse_dev *dev)
+{
+	int i;
+
+	for (i = 0; i < dev->vq_num; i++) {
+		struct vduse_virtqueue *vq = &dev->vqs[i];
+
+		spin_lock(&vq->irq_lock);
+		if (vq->virqfd) {
+			virqfd_deactivate(vq->virqfd);
+			vq->virqfd = NULL;
+		}
+		spin_unlock(&vq->irq_lock);
+	}
+	flush_workqueue(vduse_irqfd_cleanup_wq);
+}
+
+int vduse_virqfd_init(void)
+{
+	vduse_irqfd_cleanup_wq = alloc_workqueue("vduse-irqfd-cleanup",
+						WQ_UNBOUND, 0);
+	if (!vduse_irqfd_cleanup_wq)
+		return -ENOMEM;
+
+	return 0;
+}
+
+void vduse_virqfd_exit(void)
+{
+	destroy_workqueue(vduse_irqfd_cleanup_wq);
+}
+
+void vduse_vq_kick(struct vduse_virtqueue *vq)
+{
+	spin_lock(&vq->kick_lock);
+	if (vq->ready && vq->kickfd)
+		eventfd_signal(vq->kickfd, 1);
+	spin_unlock(&vq->kick_lock);
+}
+
+int vduse_kickfd_setup(struct vduse_dev *dev,
+			struct vduse_vq_eventfd *eventfd)
+{
+	struct eventfd_ctx *ctx;
+	struct vduse_virtqueue *vq;
+
+	if (eventfd->index >= dev->vq_num)
+		return -EINVAL;
+
+	vq = &dev->vqs[eventfd->index];
+	ctx = eventfd_ctx_fdget(eventfd->fd);
+	if (IS_ERR(ctx))
+		return PTR_ERR(ctx);
+
+	spin_lock(&vq->kick_lock);
+	if (vq->kickfd)
+		eventfd_ctx_put(vq->kickfd);
+	vq->kickfd = ctx;
+	spin_unlock(&vq->kick_lock);
+
+	return 0;
+}
+
+void vduse_kickfd_release(struct vduse_dev *dev)
+{
+	int i;
+
+	for (i = 0; i < dev->vq_num; i++) {
+		struct vduse_virtqueue *vq = &dev->vqs[i];
+
+		spin_lock(&vq->kick_lock);
+		if (vq->kickfd) {
+			eventfd_ctx_put(vq->kickfd);
+			vq->kickfd = NULL;
+		}
+		spin_unlock(&vq->kick_lock);
+	}
+}
diff --git a/drivers/vdpa/vdpa_user/eventfd.h b/drivers/vdpa/vdpa_user/eventfd.h
new file mode 100644
index 000000000000..14269ff27f47
--- /dev/null
+++ b/drivers/vdpa/vdpa_user/eventfd.h
@@ -0,0 +1,48 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Eventfd support for VDUSE
+ *
+ * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
+ *
+ * Author: Xie Yongji <xieyongji@bytedance.com>
+ *
+ */
+
+#ifndef _VDUSE_EVENTFD_H
+#define _VDUSE_EVENTFD_H
+
+#include <linux/eventfd.h>
+#include <linux/poll.h>
+#include <linux/wait.h>
+#include <uapi/linux/vduse.h>
+
+#include "vduse.h"
+
+struct vduse_dev;
+
+struct vduse_virqfd {
+	struct eventfd_ctx *ctx;
+	struct vduse_virtqueue *vq;
+	struct work_struct inject;
+	struct work_struct shutdown;
+	wait_queue_entry_t wait;
+	poll_table pt;
+};
+
+int vduse_virqfd_setup(struct vduse_dev *dev,
+			struct vduse_vq_eventfd *eventfd);
+
+void vduse_virqfd_release(struct vduse_dev *dev);
+
+int vduse_virqfd_init(void);
+
+void vduse_virqfd_exit(void);
+
+void vduse_vq_kick(struct vduse_virtqueue *vq);
+
+int vduse_kickfd_setup(struct vduse_dev *dev,
+			struct vduse_vq_eventfd *eventfd);
+
+void vduse_kickfd_release(struct vduse_dev *dev);
+
+#endif /* _VDUSE_EVENTFD_H */
diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c
new file mode 100644
index 000000000000..cdfef8e9f9d6
--- /dev/null
+++ b/drivers/vdpa/vdpa_user/iova_domain.c
@@ -0,0 +1,426 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * MMU-based IOMMU implementation
+ *
+ * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
+ *
+ * Author: Xie Yongji <xieyongji@bytedance.com>
+ *
+ */
+
+#include <linux/slab.h>
+#include <linux/file.h>
+#include <linux/anon_inodes.h>
+
+#include "iova_domain.h"
+
+#define IOVA_START_PFN 1
+#define IOVA_ALLOC_ORDER 12
+#define IOVA_ALLOC_SIZE (1 << IOVA_ALLOC_ORDER)
+
+#define CONSISTENT_DMA_SIZE (1024 * 1024 * 1024)
+
+static inline struct page *
+vduse_domain_get_bounce_page(struct vduse_iova_domain *domain,
+				unsigned long iova)
+{
+	unsigned long index = iova >> PAGE_SHIFT;
+
+	return domain->bounce_pages[index];
+}
+
+static inline void
+vduse_domain_set_bounce_page(struct vduse_iova_domain *domain,
+				unsigned long iova, struct page *page)
+{
+	unsigned long index = iova >> PAGE_SHIFT;
+
+	domain->bounce_pages[index] = page;
+}
+
+static struct vduse_iova_map *
+vduse_domain_alloc_iova_map(struct vduse_iova_domain *domain,
+			unsigned long iova, unsigned long orig,
+			size_t size, enum dma_data_direction dir)
+{
+	struct vduse_iova_map *map;
+
+	map = kzalloc(sizeof(struct vduse_iova_map), GFP_ATOMIC);
+	if (!map)
+		return NULL;
+
+	map->iova.start = iova;
+	map->iova.last = iova + size - 1;
+	map->orig = orig;
+	map->size = size;
+	map->dir = dir;
+
+	return map;
+}
+
+static struct page *
+vduse_domain_get_mapping_page(struct vduse_iova_domain *domain,
+				unsigned long iova)
+{
+	unsigned long start = iova & PAGE_MASK;
+	unsigned long last = start + PAGE_SIZE - 1;
+	struct vduse_iova_map *map;
+	struct interval_tree_node *node;
+	struct page *page = NULL;
+
+	spin_lock(&domain->map_lock);
+	node = interval_tree_iter_first(&domain->mappings, start, last);
+	if (!node)
+		goto out;
+
+	map = container_of(node, struct vduse_iova_map, iova);
+	page = virt_to_page(map->orig + iova - map->iova.start);
+	get_page(page);
+out:
+	spin_unlock(&domain->map_lock);
+
+	return page;
+}
+
+static struct page *
+vduse_domain_alloc_bounce_page(struct vduse_iova_domain *domain,
+				unsigned long iova)
+{
+	unsigned long start = iova & PAGE_MASK;
+	unsigned long last = start + PAGE_SIZE - 1;
+	struct vduse_iova_map *map;
+	struct interval_tree_node *node;
+	struct page *page = NULL, *new_page = alloc_page(GFP_KERNEL);
+
+	if (!new_page)
+		return NULL;
+
+	spin_lock(&domain->map_lock);
+	node = interval_tree_iter_first(&domain->mappings, start, last);
+	if (!node) {
+		__free_page(new_page);
+		goto out;
+	}
+	page = vduse_domain_get_bounce_page(domain, iova);
+	if (page) {
+		get_page(page);
+		__free_page(new_page);
+		goto out;
+	}
+	vduse_domain_set_bounce_page(domain, iova, new_page);
+	get_page(new_page);
+	page = new_page;
+
+	while (node) {
+		unsigned int src_offset = 0, dst_offset = 0;
+		void *src, *dst;
+		size_t copy_len;
+
+		map = container_of(node, struct vduse_iova_map, iova);
+		node = interval_tree_iter_next(node, start, last);
+		if (map->dir == DMA_FROM_DEVICE)
+			continue;
+
+		if (start > map->iova.start)
+			src_offset = start - map->iova.start;
+		else
+			dst_offset = map->iova.start - start;
+
+		src = (void *)(map->orig + src_offset);
+		dst = page_address(page) + dst_offset;
+		copy_len = min_t(size_t, map->size - src_offset,
+				PAGE_SIZE - dst_offset);
+		memcpy(dst, src, copy_len);
+	}
+out:
+	spin_unlock(&domain->map_lock);
+
+	return page;
+}
+
+static void
+vduse_domain_free_bounce_pages(struct vduse_iova_domain *domain,
+				unsigned long iova, size_t size)
+{
+	struct page *page;
+	struct interval_tree_node *node;
+	unsigned long last = iova + size - 1;
+
+	spin_lock(&domain->map_lock);
+	node = interval_tree_iter_first(&domain->mappings, iova, last);
+	if (WARN_ON(node))
+		goto out;
+
+	while (size > 0) {
+		page = vduse_domain_get_bounce_page(domain, iova);
+		if (page) {
+			vduse_domain_set_bounce_page(domain, iova, NULL);
+			__free_page(page);
+		}
+		size -= PAGE_SIZE;
+		iova += PAGE_SIZE;
+	}
+out:
+	spin_unlock(&domain->map_lock);
+}
+
+static void vduse_domain_bounce(struct vduse_iova_domain *domain,
+				unsigned long iova, unsigned long orig,
+				size_t size, enum dma_data_direction dir)
+{
+	unsigned int offset = offset_in_page(iova);
+
+	while (size) {
+		struct page *p = vduse_domain_get_bounce_page(domain, iova);
+		size_t copy_len = min_t(size_t, PAGE_SIZE - offset, size);
+		void *addr;
+
+		WARN_ON(!p && dir == DMA_FROM_DEVICE);
+
+		if (p) {
+			addr = page_address(p) + offset;
+			if (dir == DMA_TO_DEVICE)
+				memcpy(addr, (void *)orig, copy_len);
+			else if (dir == DMA_FROM_DEVICE)
+				memcpy((void *)orig, addr, copy_len);
+		}
+
+		size -= copy_len;
+		orig += copy_len;
+		iova += copy_len;
+		offset = 0;
+	}
+}
+
+static unsigned long vduse_domain_alloc_iova(struct iova_domain *iovad,
+				unsigned long size, unsigned long limit)
+{
+	unsigned long shift = iova_shift(iovad);
+	unsigned long iova_len = iova_align(iovad, size) >> shift;
+	unsigned long iova_pfn;
+
+	if (iova_len < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1)))
+		iova_len = roundup_pow_of_two(iova_len);
+	iova_pfn = alloc_iova_fast(iovad, iova_len, limit >> shift, true);
+
+	return iova_pfn << shift;
+}
+
+static void vduse_domain_free_iova(struct iova_domain *iovad,
+				unsigned long iova, size_t size)
+{
+	unsigned long shift = iova_shift(iovad);
+	unsigned long iova_len = iova_align(iovad, size) >> shift;
+
+	free_iova_fast(iovad, iova >> shift, iova_len);
+}
+
+dma_addr_t vduse_domain_map_page(struct vduse_iova_domain *domain,
+				struct page *page, unsigned long offset,
+				size_t size, enum dma_data_direction dir,
+				unsigned long attrs)
+{
+	struct iova_domain *iovad = &domain->stream_iovad;
+	unsigned long limit = domain->bounce_size - 1;
+	unsigned long iova = vduse_domain_alloc_iova(iovad, size, limit);
+	unsigned long orig = (unsigned long)page_address(page) + offset;
+	struct vduse_iova_map *map;
+
+	if (!iova)
+		return DMA_MAPPING_ERROR;
+
+	map = vduse_domain_alloc_iova_map(domain, iova, orig, size, dir);
+	if (!map) {
+		vduse_domain_free_iova(iovad, iova, size);
+		return DMA_MAPPING_ERROR;
+	}
+
+	spin_lock(&domain->map_lock);
+	interval_tree_insert(&map->iova, &domain->mappings);
+	spin_unlock(&domain->map_lock);
+
+	if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL)
+		vduse_domain_bounce(domain, iova, orig, size, DMA_TO_DEVICE);
+
+	return (dma_addr_t)iova;
+}
+
+void vduse_domain_unmap_page(struct vduse_iova_domain *domain,
+			dma_addr_t dma_addr, size_t size,
+			enum dma_data_direction dir, unsigned long attrs)
+{
+	struct iova_domain *iovad = &domain->stream_iovad;
+	unsigned long iova = (unsigned long)dma_addr;
+	struct interval_tree_node *node;
+	struct vduse_iova_map *map;
+
+	spin_lock(&domain->map_lock);
+	node = interval_tree_iter_first(&domain->mappings, iova, iova + 1);
+	if (WARN_ON(!node)) {
+		spin_unlock(&domain->map_lock);
+		return;
+	}
+	interval_tree_remove(node, &domain->mappings);
+	spin_unlock(&domain->map_lock);
+
+	map = container_of(node, struct vduse_iova_map, iova);
+	if (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL)
+		vduse_domain_bounce(domain, iova, map->orig,
+					size, DMA_FROM_DEVICE);
+	vduse_domain_free_iova(iovad, iova, size);
+	kfree(map);
+}
+
+void *vduse_domain_alloc_coherent(struct vduse_iova_domain *domain,
+				size_t size, dma_addr_t *dma_addr,
+				gfp_t flag, unsigned long attrs)
+{
+	struct iova_domain *iovad = &domain->consistent_iovad;
+	unsigned long limit = domain->bounce_size + CONSISTENT_DMA_SIZE - 1;
+	unsigned long iova = vduse_domain_alloc_iova(iovad, size, limit);
+	void *orig = alloc_pages_exact(size, flag);
+	struct vduse_iova_map *map;
+
+	if (!iova || !orig)
+		goto err;
+
+	map = vduse_domain_alloc_iova_map(domain, iova, (unsigned long)orig,
+					size, DMA_BIDIRECTIONAL);
+	if (!map)
+		goto err;
+
+	spin_lock(&domain->map_lock);
+	interval_tree_insert(&map->iova, &domain->mappings);
+	spin_unlock(&domain->map_lock);
+	*dma_addr = (dma_addr_t)iova;
+
+	return orig;
+err:
+	*dma_addr = DMA_MAPPING_ERROR;
+	if (orig)
+		free_pages_exact(orig, size);
+	if (iova)
+		vduse_domain_free_iova(iovad, iova, size);
+
+	return NULL;
+}
+
+void vduse_domain_free_coherent(struct vduse_iova_domain *domain, size_t size,
+				void *vaddr, dma_addr_t dma_addr,
+				unsigned long attrs)
+{
+	struct iova_domain *iovad = &domain->consistent_iovad;
+	unsigned long iova = (unsigned long)dma_addr;
+	struct interval_tree_node *node;
+	struct vduse_iova_map *map;
+
+	spin_lock(&domain->map_lock);
+	node = interval_tree_iter_first(&domain->mappings, iova, iova + 1);
+	if (WARN_ON(!node)) {
+		spin_unlock(&domain->map_lock);
+		return;
+	}
+	interval_tree_remove(node, &domain->mappings);
+	spin_unlock(&domain->map_lock);
+
+	map = container_of(node, struct vduse_iova_map, iova);
+	vduse_domain_free_iova(iovad, iova, size);
+	free_pages_exact(vaddr, size);
+	kfree(map);
+}
+
+static vm_fault_t vduse_domain_mmap_fault(struct vm_fault *vmf)
+{
+	struct vduse_iova_domain *domain = vmf->vma->vm_private_data;
+	unsigned long iova = vmf->pgoff << PAGE_SHIFT;
+	struct page *page;
+
+	if (!domain)
+		return VM_FAULT_SIGBUS;
+
+	if (iova < domain->bounce_size)
+		page = vduse_domain_alloc_bounce_page(domain, iova);
+	else
+		page = vduse_domain_get_mapping_page(domain, iova);
+
+	if (!page)
+		return VM_FAULT_SIGBUS;
+
+	vmf->page = page;
+
+	return 0;
+}
+
+static const struct vm_operations_struct vduse_domain_mmap_ops = {
+	.fault = vduse_domain_mmap_fault,
+};
+
+static int vduse_domain_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct vduse_iova_domain *domain = file->private_data;
+
+	vma->vm_flags |= VM_DONTCOPY | VM_DONTDUMP | VM_DONTEXPAND;
+	vma->vm_private_data = domain;
+	vma->vm_ops = &vduse_domain_mmap_ops;
+
+	return 0;
+}
+
+static int vduse_domain_release(struct inode *inode, struct file *file)
+{
+	struct vduse_iova_domain *domain = file->private_data;
+
+	vduse_domain_free_bounce_pages(domain, 0, domain->bounce_size);
+	put_iova_domain(&domain->stream_iovad);
+	put_iova_domain(&domain->consistent_iovad);
+	vfree(domain->bounce_pages);
+	kfree(domain);
+
+	return 0;
+}
+
+static const struct file_operations vduse_domain_fops = {
+	.mmap = vduse_domain_mmap,
+	.release = vduse_domain_release,
+};
+
+void vduse_domain_destroy(struct vduse_iova_domain *domain)
+{
+	fput(domain->file);
+}
+
+struct vduse_iova_domain *vduse_domain_create(size_t bounce_size)
+{
+	struct vduse_iova_domain *domain;
+	struct file *file;
+	unsigned long bounce_pfns = PAGE_ALIGN(bounce_size) >> PAGE_SHIFT;
+
+	domain = kzalloc(sizeof(*domain), GFP_KERNEL);
+	if (!domain)
+		return NULL;
+
+	domain->bounce_size = PAGE_ALIGN(bounce_size);
+	domain->bounce_pages = vzalloc(bounce_pfns * sizeof(struct page *));
+	if (!domain->bounce_pages)
+		goto err_page;
+
+	file = anon_inode_getfile("[vduse-domain]", &vduse_domain_fops,
+				domain, O_RDWR);
+	if (IS_ERR(file))
+		goto err_file;
+
+	domain->file = file;
+	spin_lock_init(&domain->map_lock);
+	domain->mappings = RB_ROOT_CACHED;
+	init_iova_domain(&domain->stream_iovad,
+			IOVA_ALLOC_SIZE, IOVA_START_PFN);
+	init_iova_domain(&domain->consistent_iovad,
+			PAGE_SIZE, bounce_pfns);
+
+	return domain;
+err_file:
+	vfree(domain->bounce_pages);
+err_page:
+	kfree(domain);
+	return NULL;
+}
diff --git a/drivers/vdpa/vdpa_user/iova_domain.h b/drivers/vdpa/vdpa_user/iova_domain.h
new file mode 100644
index 000000000000..cc61866acb56
--- /dev/null
+++ b/drivers/vdpa/vdpa_user/iova_domain.h
@@ -0,0 +1,68 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * MMU-based IOMMU implementation
+ *
+ * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
+ *
+ * Author: Xie Yongji <xieyongji@bytedance.com>
+ *
+ */
+
+#ifndef _VDUSE_IOVA_DOMAIN_H
+#define _VDUSE_IOVA_DOMAIN_H
+
+#include <linux/iova.h>
+#include <linux/interval_tree.h>
+#include <linux/dma-mapping.h>
+
+struct vduse_iova_map {
+	struct interval_tree_node iova;
+	unsigned long orig;
+	size_t size;
+	enum dma_data_direction dir;
+};
+
+struct vduse_iova_domain {
+	struct iova_domain stream_iovad;
+	struct iova_domain consistent_iovad;
+	struct page **bounce_pages;
+	size_t bounce_size;
+	struct rb_root_cached mappings;
+	spinlock_t map_lock;
+	struct file *file;
+};
+
+static inline struct file *
+vduse_domain_file(struct vduse_iova_domain *domain)
+{
+	return domain->file;
+}
+
+static inline unsigned long
+vduse_domain_get_offset(struct vduse_iova_domain *domain, unsigned long iova)
+{
+	return iova;
+}
+
+dma_addr_t vduse_domain_map_page(struct vduse_iova_domain *domain,
+				struct page *page, unsigned long offset,
+				size_t size, enum dma_data_direction dir,
+				unsigned long attrs);
+
+void vduse_domain_unmap_page(struct vduse_iova_domain *domain,
+			dma_addr_t dma_addr, size_t size,
+			enum dma_data_direction dir, unsigned long attrs);
+
+void *vduse_domain_alloc_coherent(struct vduse_iova_domain *domain,
+				size_t size, dma_addr_t *dma_addr,
+				gfp_t flag, unsigned long attrs);
+
+void vduse_domain_free_coherent(struct vduse_iova_domain *domain, size_t size,
+				void *vaddr, dma_addr_t dma_addr,
+				unsigned long attrs);
+
+void vduse_domain_destroy(struct vduse_iova_domain *domain);
+
+struct vduse_iova_domain *vduse_domain_create(size_t bounce_size);
+
+#endif /* _VDUSE_IOVA_DOMAIN_H */
diff --git a/drivers/vdpa/vdpa_user/vduse.h b/drivers/vdpa/vdpa_user/vduse.h
new file mode 100644
index 000000000000..3566d229382e
--- /dev/null
+++ b/drivers/vdpa/vdpa_user/vduse.h
@@ -0,0 +1,62 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * VDUSE: vDPA Device in Userspace
+ *
+ * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
+ *
+ * Author: Xie Yongji <xieyongji@bytedance.com>
+ *
+ */
+
+#ifndef _VDUSE_H
+#define _VDUSE_H
+
+#include <linux/eventfd.h>
+#include <linux/wait.h>
+#include <linux/vdpa.h>
+
+#include "iova_domain.h"
+#include "eventfd.h"
+
+struct vduse_virtqueue {
+	u16 index;
+	bool ready;
+	spinlock_t kick_lock;
+	spinlock_t irq_lock;
+	struct eventfd_ctx *kickfd;
+	struct vduse_virqfd *virqfd;
+	void *private;
+	irqreturn_t (*cb)(void *data);
+};
+
+struct vduse_dev;
+
+struct vduse_vdpa {
+	struct vdpa_device vdpa;
+	struct vduse_dev *dev;
+};
+
+struct vduse_dev {
+	struct vduse_vdpa *vdev;
+	struct mutex lock;
+	struct vduse_virtqueue *vqs;
+	struct vduse_iova_domain *domain;
+	struct vhost_iotlb *iommu;
+	spinlock_t iommu_lock;
+	atomic_t bounce_map;
+	spinlock_t msg_lock;
+	atomic64_t msg_unique;
+	wait_queue_head_t waitq;
+	struct list_head send_list;
+	struct list_head recv_list;
+	struct list_head list;
+	bool connected;
+	u32 id;
+	u16 vq_size_max;
+	u16 vq_num;
+	u32 vq_align;
+	u32 device_id;
+	u32 vendor_id;
+};
+
+#endif /* _VDUSE_H_ */
diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
new file mode 100644
index 000000000000..1cf759bc5914
--- /dev/null
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -0,0 +1,1217 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * VDUSE: vDPA Device in Userspace
+ *
+ * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
+ *
+ * Author: Xie Yongji <xieyongji@bytedance.com>
+ *
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/miscdevice.h>
+#include <linux/device.h>
+#include <linux/eventfd.h>
+#include <linux/slab.h>
+#include <linux/wait.h>
+#include <linux/dma-map-ops.h>
+#include <linux/anon_inodes.h>
+#include <linux/file.h>
+#include <linux/uio.h>
+#include <linux/vdpa.h>
+#include <uapi/linux/vduse.h>
+#include <uapi/linux/vdpa.h>
+#include <uapi/linux/virtio_config.h>
+#include <linux/mod_devicetable.h>
+
+#include "vduse.h"
+
+#define DRV_VERSION  "1.0"
+#define DRV_AUTHOR   "Yongji Xie <xieyongji@bytedance.com>"
+#define DRV_DESC     "vDPA Device in Userspace"
+#define DRV_LICENSE  "GPL v2"
+
+struct vduse_dev_msg {
+	struct vduse_dev_request req;
+	struct vduse_dev_response resp;
+	struct list_head list;
+	wait_queue_head_t waitq;
+	bool completed;
+	refcount_t refcnt;
+};
+
+static struct workqueue_struct *vduse_vdpa_wq;
+static DEFINE_MUTEX(vduse_lock);
+static LIST_HEAD(vduse_devs);
+
+static inline struct vduse_dev *vdpa_to_vduse(struct vdpa_device *vdpa)
+{
+	struct vduse_vdpa *vdev = container_of(vdpa, struct vduse_vdpa, vdpa);
+
+	return vdev->dev;
+}
+
+static inline struct vduse_dev *dev_to_vduse(struct device *dev)
+{
+	struct vdpa_device *vdpa = dev_to_vdpa(dev);
+
+	return vdpa_to_vduse(vdpa);
+}
+
+static struct vduse_dev_msg *vduse_dev_new_msg(struct vduse_dev *dev, int type)
+{
+	struct vduse_dev_msg *msg = kzalloc(sizeof(*msg),
+					GFP_KERNEL | __GFP_NOFAIL);
+
+	msg->req.type = type;
+	msg->req.unique = atomic64_fetch_inc(&dev->msg_unique);
+	init_waitqueue_head(&msg->waitq);
+	refcount_set(&msg->refcnt, 1);
+
+	return msg;
+}
+
+static void vduse_dev_msg_get(struct vduse_dev_msg *msg)
+{
+	refcount_inc(&msg->refcnt);
+}
+
+static void vduse_dev_msg_put(struct vduse_dev_msg *msg)
+{
+	if (refcount_dec_and_test(&msg->refcnt))
+		kfree(msg);
+}
+
+static struct vduse_dev_msg *vduse_dev_find_msg(struct vduse_dev *dev,
+						struct list_head *head,
+						uint32_t unique)
+{
+	struct vduse_dev_msg *tmp, *msg = NULL;
+
+	spin_lock(&dev->msg_lock);
+	list_for_each_entry(tmp, head, list) {
+		if (tmp->req.unique == unique) {
+			msg = tmp;
+			list_del(&tmp->list);
+			break;
+		}
+	}
+	spin_unlock(&dev->msg_lock);
+
+	return msg;
+}
+
+static struct vduse_dev_msg *vduse_dev_dequeue_msg(struct vduse_dev *dev,
+						struct list_head *head)
+{
+	struct vduse_dev_msg *msg = NULL;
+
+	spin_lock(&dev->msg_lock);
+	if (!list_empty(head)) {
+		msg = list_first_entry(head, struct vduse_dev_msg, list);
+		list_del(&msg->list);
+	}
+	spin_unlock(&dev->msg_lock);
+
+	return msg;
+}
+
+static void vduse_dev_enqueue_msg(struct vduse_dev *dev,
+			struct vduse_dev_msg *msg, struct list_head *head)
+{
+	spin_lock(&dev->msg_lock);
+	list_add_tail(&msg->list, head);
+	spin_unlock(&dev->msg_lock);
+}
+
+static int vduse_dev_msg_sync(struct vduse_dev *dev, struct vduse_dev_msg *msg)
+{
+	int ret;
+
+	vduse_dev_enqueue_msg(dev, msg, &dev->send_list);
+	wake_up(&dev->waitq);
+	wait_event(msg->waitq, msg->completed);
+	/* coupled with smp_wmb() in vduse_dev_msg_complete() */
+	smp_rmb();
+	ret = msg->resp.result;
+
+	return ret;
+}
+
+static void vduse_dev_msg_complete(struct vduse_dev_msg *msg,
+					struct vduse_dev_response *resp)
+{
+	vduse_dev_msg_get(msg);
+	memcpy(&msg->resp, resp, sizeof(*resp));
+	/* coupled with smp_rmb() in vduse_dev_msg_sync() */
+	smp_wmb();
+	msg->completed = 1;
+	wake_up(&msg->waitq);
+	vduse_dev_msg_put(msg);
+}
+
+static u64 vduse_dev_get_features(struct vduse_dev *dev)
+{
+	struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_GET_FEATURES);
+	u64 features;
+
+	vduse_dev_msg_sync(dev, msg);
+	features = msg->resp.features;
+	vduse_dev_msg_put(msg);
+
+	return features;
+}
+
+static int vduse_dev_set_features(struct vduse_dev *dev, u64 features)
+{
+	struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_FEATURES);
+	int ret;
+
+	msg->req.size = sizeof(features);
+	msg->req.features = features;
+
+	ret = vduse_dev_msg_sync(dev, msg);
+	vduse_dev_msg_put(msg);
+
+	return ret;
+}
+
+static u8 vduse_dev_get_status(struct vduse_dev *dev)
+{
+	struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_GET_STATUS);
+	u8 status;
+
+	vduse_dev_msg_sync(dev, msg);
+	status = msg->resp.status;
+	vduse_dev_msg_put(msg);
+
+	return status;
+}
+
+static void vduse_dev_set_status(struct vduse_dev *dev, u8 status)
+{
+	struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_STATUS);
+
+	msg->req.size = sizeof(status);
+	msg->req.status = status;
+
+	vduse_dev_msg_sync(dev, msg);
+	vduse_dev_msg_put(msg);
+}
+
+static void vduse_dev_get_config(struct vduse_dev *dev, unsigned int offset,
+					void *buf, unsigned int len)
+{
+	struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_GET_CONFIG);
+
+	WARN_ON(len > sizeof(msg->req.config.data));
+
+	msg->req.size = sizeof(struct vduse_dev_config_data);
+	msg->req.config.offset = offset;
+	msg->req.config.len = len;
+	vduse_dev_msg_sync(dev, msg);
+	memcpy(buf, msg->resp.config.data, len);
+	vduse_dev_msg_put(msg);
+}
+
+static void vduse_dev_set_config(struct vduse_dev *dev, unsigned int offset,
+					const void *buf, unsigned int len)
+{
+	struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_CONFIG);
+
+	WARN_ON(len > sizeof(msg->req.config.data));
+
+	msg->req.size = sizeof(struct vduse_dev_config_data);
+	msg->req.config.offset = offset;
+	msg->req.config.len = len;
+	memcpy(msg->req.config.data, buf, len);
+	vduse_dev_msg_sync(dev, msg);
+	vduse_dev_msg_put(msg);
+}
+
+static void vduse_dev_set_vq_num(struct vduse_dev *dev,
+				struct vduse_virtqueue *vq, u32 num)
+{
+	struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_VQ_NUM);
+
+	msg->req.size = sizeof(struct vduse_vq_num);
+	msg->req.vq_num.index = vq->index;
+	msg->req.vq_num.num = num;
+
+	vduse_dev_msg_sync(dev, msg);
+	vduse_dev_msg_put(msg);
+}
+
+static int vduse_dev_set_vq_addr(struct vduse_dev *dev,
+				struct vduse_virtqueue *vq, u64 desc_addr,
+				u64 driver_addr, u64 device_addr)
+{
+	struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_VQ_ADDR);
+	int ret;
+
+	msg->req.size = sizeof(struct vduse_vq_addr);
+	msg->req.vq_addr.index = vq->index;
+	msg->req.vq_addr.desc_addr = desc_addr;
+	msg->req.vq_addr.driver_addr = driver_addr;
+	msg->req.vq_addr.device_addr = device_addr;
+
+	ret = vduse_dev_msg_sync(dev, msg);
+	vduse_dev_msg_put(msg);
+
+	return ret;
+}
+
+static void vduse_dev_set_vq_ready(struct vduse_dev *dev,
+				struct vduse_virtqueue *vq, bool ready)
+{
+	struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_VQ_READY);
+
+	msg->req.size = sizeof(struct vduse_vq_ready);
+	msg->req.vq_ready.index = vq->index;
+	msg->req.vq_ready.ready = ready;
+
+	vduse_dev_msg_sync(dev, msg);
+	vduse_dev_msg_put(msg);
+}
+
+static bool vduse_dev_get_vq_ready(struct vduse_dev *dev,
+				   struct vduse_virtqueue *vq)
+{
+	struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_GET_VQ_READY);
+	bool ready;
+
+	msg->req.size = sizeof(struct vduse_vq_ready);
+	msg->req.vq_ready.index = vq->index;
+
+	vduse_dev_msg_sync(dev, msg);
+	ready = msg->resp.vq_ready.ready;
+	vduse_dev_msg_put(msg);
+
+	return ready;
+}
+
+static int vduse_dev_get_vq_state(struct vduse_dev *dev,
+				struct vduse_virtqueue *vq,
+				struct vdpa_vq_state *state)
+{
+	struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_GET_VQ_STATE);
+	int ret;
+
+	msg->req.size = sizeof(struct vduse_vq_state);
+	msg->req.vq_state.index = vq->index;
+
+	ret = vduse_dev_msg_sync(dev, msg);
+	state->avail_index = msg->resp.vq_state.avail_idx;
+	vduse_dev_msg_put(msg);
+
+	return ret;
+}
+
+static int vduse_dev_set_vq_state(struct vduse_dev *dev,
+				struct vduse_virtqueue *vq,
+				const struct vdpa_vq_state *state)
+{
+	struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_VQ_STATE);
+	int ret;
+
+	msg->req.size = sizeof(struct vduse_vq_state);
+	msg->req.vq_state.index = vq->index;
+	msg->req.vq_state.avail_idx = state->avail_index;
+
+	ret = vduse_dev_msg_sync(dev, msg);
+	vduse_dev_msg_put(msg);
+
+	return ret;
+}
+
+static int vduse_dev_update_iotlb(struct vduse_dev *dev,
+					u64 start, u64 last)
+{
+	struct vduse_dev_msg *msg;
+	int ret;
+
+	if (last < start)
+		return -EINVAL;
+
+	msg = vduse_dev_new_msg(dev, VDUSE_UPDATE_IOTLB);
+	msg->req.size = sizeof(struct vduse_iova_range);
+	msg->req.iova.start = start;
+	msg->req.iova.last = last;
+
+	ret = vduse_dev_msg_sync(dev, msg);
+	vduse_dev_msg_put(msg);
+
+	return ret;
+}
+
+static ssize_t vduse_dev_read_iter(struct kiocb *iocb, struct iov_iter *to)
+{
+	struct file *file = iocb->ki_filp;
+	struct vduse_dev *dev = file->private_data;
+	struct vduse_dev_msg *msg;
+	int size = sizeof(struct vduse_dev_request);
+	ssize_t ret = 0;
+
+	if (iov_iter_count(to) < size)
+		return 0;
+
+	while (1) {
+		msg = vduse_dev_dequeue_msg(dev, &dev->send_list);
+		if (msg)
+			break;
+
+		if (file->f_flags & O_NONBLOCK)
+			return -EAGAIN;
+
+		ret = wait_event_interruptible_exclusive(dev->waitq,
+					!list_empty(&dev->send_list));
+		if (ret)
+			return ret;
+	}
+	ret = copy_to_iter(&msg->req, size, to);
+	if (ret != size) {
+		vduse_dev_enqueue_msg(dev, msg, &dev->send_list);
+		return -EFAULT;
+	}
+	vduse_dev_enqueue_msg(dev, msg, &dev->recv_list);
+
+	return ret;
+}
+
+static ssize_t vduse_dev_write_iter(struct kiocb *iocb, struct iov_iter *from)
+{
+	struct file *file = iocb->ki_filp;
+	struct vduse_dev *dev = file->private_data;
+	struct vduse_dev_response resp;
+	struct vduse_dev_msg *msg;
+	size_t ret;
+
+	ret = copy_from_iter(&resp, sizeof(resp), from);
+	if (ret != sizeof(resp))
+		return -EINVAL;
+
+	msg = vduse_dev_find_msg(dev, &dev->recv_list, resp.unique);
+	if (!msg)
+		return -EINVAL;
+
+	vduse_dev_msg_complete(msg, &resp);
+
+	return ret;
+}
+
+static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
+{
+	struct vduse_dev *dev = file->private_data;
+	__poll_t mask = 0;
+
+	poll_wait(file, &dev->waitq, wait);
+
+	if (!list_empty(&dev->send_list))
+		mask |= EPOLLIN | EPOLLRDNORM;
+
+	return mask;
+}
+
+static int vduse_iotlb_add_range(struct vduse_dev *dev,
+				 u64 start, u64 last,
+				 u64 addr, unsigned int perm,
+				 struct file *file, u64 offset)
+{
+	struct vhost_iotlb_file *iotlb_file;
+	int ret;
+
+	iotlb_file = kmalloc(sizeof(*iotlb_file), GFP_ATOMIC);
+	if (!iotlb_file)
+		return -ENOMEM;
+
+	iotlb_file->file = get_file(file);
+	iotlb_file->offset = offset;
+
+	spin_lock(&dev->iommu_lock);
+	ret = vhost_iotlb_add_range(dev->iommu, start, last,
+					addr, perm, iotlb_file);
+	spin_unlock(&dev->iommu_lock);
+	if (ret) {
+		fput(iotlb_file->file);
+		kfree(iotlb_file);
+		return ret;
+	}
+	return 0;
+}
+
+static void vduse_iotlb_del_range(struct vduse_dev *dev, u64 start, u64 last)
+{
+	struct vhost_iotlb_file *iotlb_file;
+	struct vhost_iotlb_map *map;
+
+	spin_lock(&dev->iommu_lock);
+	while ((map = vhost_iotlb_itree_first(dev->iommu, start, last))) {
+		iotlb_file = (struct vhost_iotlb_file *)map->opaque;
+		fput(iotlb_file->file);
+		kfree(iotlb_file);
+		vhost_iotlb_map_free(dev->iommu, map);
+	}
+	spin_unlock(&dev->iommu_lock);
+}
+
+static void vduse_dev_reset(struct vduse_dev *dev)
+{
+	int i;
+
+	atomic_set(&dev->bounce_map, 0);
+	vduse_iotlb_del_range(dev, 0ULL, 0ULL - 1);
+	vduse_dev_update_iotlb(dev, 0ULL, 0ULL - 1);
+
+	for (i = 0; i < dev->vq_num; i++) {
+		struct vduse_virtqueue *vq = &dev->vqs[i];
+
+		spin_lock(&vq->irq_lock);
+		vq->ready = false;
+		vq->cb = NULL;
+		vq->private = NULL;
+		spin_unlock(&vq->irq_lock);
+	}
+}
+
+static int vduse_vdpa_set_vq_address(struct vdpa_device *vdpa, u16 idx,
+				u64 desc_area, u64 driver_area,
+				u64 device_area)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vduse_virtqueue *vq = &dev->vqs[idx];
+
+	return vduse_dev_set_vq_addr(dev, vq, desc_area,
+					driver_area, device_area);
+}
+
+static void vduse_vdpa_kick_vq(struct vdpa_device *vdpa, u16 idx)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vduse_virtqueue *vq = &dev->vqs[idx];
+
+	vduse_vq_kick(vq);
+}
+
+static void vduse_vdpa_set_vq_cb(struct vdpa_device *vdpa, u16 idx,
+			      struct vdpa_callback *cb)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vduse_virtqueue *vq = &dev->vqs[idx];
+
+	vq->cb = cb->callback;
+	vq->private = cb->private;
+}
+
+static void vduse_vdpa_set_vq_num(struct vdpa_device *vdpa, u16 idx, u32 num)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vduse_virtqueue *vq = &dev->vqs[idx];
+
+	vduse_dev_set_vq_num(dev, vq, num);
+}
+
+static void vduse_vdpa_set_vq_ready(struct vdpa_device *vdpa,
+					u16 idx, bool ready)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vduse_virtqueue *vq = &dev->vqs[idx];
+
+	vduse_dev_set_vq_ready(dev, vq, ready);
+	vq->ready = ready;
+}
+
+static bool vduse_vdpa_get_vq_ready(struct vdpa_device *vdpa, u16 idx)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vduse_virtqueue *vq = &dev->vqs[idx];
+
+	vq->ready = vduse_dev_get_vq_ready(dev, vq);
+
+	return vq->ready;
+}
+
+static int vduse_vdpa_set_vq_state(struct vdpa_device *vdpa, u16 idx,
+				const struct vdpa_vq_state *state)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vduse_virtqueue *vq = &dev->vqs[idx];
+
+	return vduse_dev_set_vq_state(dev, vq, state);
+}
+
+static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
+				struct vdpa_vq_state *state)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vduse_virtqueue *vq = &dev->vqs[idx];
+
+	return vduse_dev_get_vq_state(dev, vq, state);
+}
+
+static u32 vduse_vdpa_get_vq_align(struct vdpa_device *vdpa)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	return dev->vq_align;
+}
+
+static u64 vduse_vdpa_get_features(struct vdpa_device *vdpa)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	u64 fixed = (1ULL << VIRTIO_F_ACCESS_PLATFORM);
+
+	return (vduse_dev_get_features(dev) | fixed);
+}
+
+static int vduse_vdpa_set_features(struct vdpa_device *vdpa, u64 features)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	return vduse_dev_set_features(dev, features);
+}
+
+static void vduse_vdpa_set_config_cb(struct vdpa_device *vdpa,
+				  struct vdpa_callback *cb)
+{
+	/* We don't support config interrupt */
+}
+
+static u16 vduse_vdpa_get_vq_num_max(struct vdpa_device *vdpa)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	return dev->vq_size_max;
+}
+
+static u32 vduse_vdpa_get_device_id(struct vdpa_device *vdpa)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	return dev->device_id;
+}
+
+static u32 vduse_vdpa_get_vendor_id(struct vdpa_device *vdpa)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	return dev->vendor_id;
+}
+
+static u8 vduse_vdpa_get_status(struct vdpa_device *vdpa)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	return vduse_dev_get_status(dev);
+}
+
+static void vduse_vdpa_set_status(struct vdpa_device *vdpa, u8 status)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	if (status == 0)
+		vduse_dev_reset(dev);
+	else
+		vduse_dev_update_iotlb(dev, 0ULL, 0ULL - 1);
+
+	vduse_dev_set_status(dev, status);
+}
+
+static void vduse_vdpa_get_config(struct vdpa_device *vdpa, unsigned int offset,
+			     void *buf, unsigned int len)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	vduse_dev_get_config(dev, offset, buf, len);
+}
+
+static void vduse_vdpa_set_config(struct vdpa_device *vdpa, unsigned int offset,
+			const void *buf, unsigned int len)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	vduse_dev_set_config(dev, offset, buf, len);
+}
+
+static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
+				struct vhost_iotlb *iotlb)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vhost_iotlb_map *map;
+	struct vhost_iotlb_file *iotlb_file;
+	u64 start = 0ULL, last = 0ULL - 1;
+	int ret = 0;
+
+	vduse_iotlb_del_range(dev, start, last);
+
+	for (map = vhost_iotlb_itree_first(iotlb, start, last); map;
+		map = vhost_iotlb_itree_next(map, start, last)) {
+		if (!map->opaque)
+			continue;
+
+		iotlb_file = (struct vhost_iotlb_file *)map->opaque;
+		ret = vduse_iotlb_add_range(dev, map->start, map->last,
+					    map->addr, map->perm,
+					    iotlb_file->file,
+					    iotlb_file->offset);
+		if (ret)
+			break;
+	}
+	vduse_dev_update_iotlb(dev, start, last);
+
+	return ret;
+}
+
+static void vduse_vdpa_free(struct vdpa_device *vdpa)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	WARN_ON(!list_empty(&dev->send_list));
+	WARN_ON(!list_empty(&dev->recv_list));
+	dev->vdev = NULL;
+}
+
+static const struct vdpa_config_ops vduse_vdpa_config_ops = {
+	.set_vq_address		= vduse_vdpa_set_vq_address,
+	.kick_vq		= vduse_vdpa_kick_vq,
+	.set_vq_cb		= vduse_vdpa_set_vq_cb,
+	.set_vq_num             = vduse_vdpa_set_vq_num,
+	.set_vq_ready		= vduse_vdpa_set_vq_ready,
+	.get_vq_ready		= vduse_vdpa_get_vq_ready,
+	.set_vq_state		= vduse_vdpa_set_vq_state,
+	.get_vq_state		= vduse_vdpa_get_vq_state,
+	.get_vq_align		= vduse_vdpa_get_vq_align,
+	.get_features		= vduse_vdpa_get_features,
+	.set_features		= vduse_vdpa_set_features,
+	.set_config_cb		= vduse_vdpa_set_config_cb,
+	.get_vq_num_max		= vduse_vdpa_get_vq_num_max,
+	.get_device_id		= vduse_vdpa_get_device_id,
+	.get_vendor_id		= vduse_vdpa_get_vendor_id,
+	.get_status		= vduse_vdpa_get_status,
+	.set_status		= vduse_vdpa_set_status,
+	.get_config		= vduse_vdpa_get_config,
+	.set_config		= vduse_vdpa_set_config,
+	.set_map		= vduse_vdpa_set_map,
+	.free			= vduse_vdpa_free,
+};
+
+static dma_addr_t vduse_dev_map_page(struct device *dev, struct page *page,
+					unsigned long offset, size_t size,
+					enum dma_data_direction dir,
+					unsigned long attrs)
+{
+	struct vduse_dev *vdev = dev_to_vduse(dev);
+	struct vduse_iova_domain *domain = vdev->domain;
+
+	if (atomic_xchg(&vdev->bounce_map, 1) == 0 &&
+		vduse_iotlb_add_range(vdev, 0, domain->bounce_size - 1,
+				      0, VDUSE_ACCESS_RW,
+				      vduse_domain_file(domain),
+				      vduse_domain_get_offset(domain, 0))) {
+		atomic_set(&vdev->bounce_map, 0);
+		return DMA_MAPPING_ERROR;
+	}
+
+	return vduse_domain_map_page(domain, page, offset, size, dir, attrs);
+}
+
+static void vduse_dev_unmap_page(struct device *dev, dma_addr_t dma_addr,
+				size_t size, enum dma_data_direction dir,
+				unsigned long attrs)
+{
+	struct vduse_dev *vdev = dev_to_vduse(dev);
+	struct vduse_iova_domain *domain = vdev->domain;
+
+	return vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
+}
+
+static void *vduse_dev_alloc_coherent(struct device *dev, size_t size,
+					dma_addr_t *dma_addr, gfp_t flag,
+					unsigned long attrs)
+{
+	struct vduse_dev *vdev = dev_to_vduse(dev);
+	struct vduse_iova_domain *domain = vdev->domain;
+	unsigned long iova;
+	void *addr;
+
+	*dma_addr = DMA_MAPPING_ERROR;
+	addr = vduse_domain_alloc_coherent(domain, size,
+				(dma_addr_t *)&iova, flag, attrs);
+	if (!addr)
+		return NULL;
+
+	if (vduse_iotlb_add_range(vdev, iova, iova + size - 1,
+				  iova, VDUSE_ACCESS_RW,
+				  vduse_domain_file(domain),
+				  vduse_domain_get_offset(domain, iova))) {
+		vduse_domain_free_coherent(domain, size, addr, iova, attrs);
+		return NULL;
+	}
+	*dma_addr = (dma_addr_t)iova;
+
+	return addr;
+}
+
+static void vduse_dev_free_coherent(struct device *dev, size_t size,
+					void *vaddr, dma_addr_t dma_addr,
+					unsigned long attrs)
+{
+	struct vduse_dev *vdev = dev_to_vduse(dev);
+	struct vduse_iova_domain *domain = vdev->domain;
+	unsigned long start = (unsigned long)dma_addr;
+	unsigned long last = start + size - 1;
+
+	vduse_iotlb_del_range(vdev, start, last);
+	vduse_dev_update_iotlb(vdev, start, last);
+	vduse_domain_free_coherent(domain, size, vaddr, dma_addr, attrs);
+}
+
+static const struct dma_map_ops vduse_dev_dma_ops = {
+	.map_page = vduse_dev_map_page,
+	.unmap_page = vduse_dev_unmap_page,
+	.alloc = vduse_dev_alloc_coherent,
+	.free = vduse_dev_free_coherent,
+};
+
+static unsigned int perm_to_file_flags(u8 perm)
+{
+	unsigned int flags = 0;
+
+	switch (perm) {
+	case VDUSE_ACCESS_WO:
+		flags |= O_WRONLY;
+		break;
+	case VDUSE_ACCESS_RO:
+		flags |= O_RDONLY;
+		break;
+	case VDUSE_ACCESS_RW:
+		flags |= O_RDWR;
+		break;
+	default:
+		WARN(1, "invalidate vhost IOTLB permission\n");
+		break;
+	}
+
+	return flags;
+}
+
+static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
+			unsigned long arg)
+{
+	struct vduse_dev *dev = file->private_data;
+	void __user *argp = (void __user *)arg;
+	int ret;
+
+	mutex_lock(&dev->lock);
+	switch (cmd) {
+	case VDUSE_IOTLB_GET_FD: {
+		struct vduse_iotlb_entry entry;
+		struct vhost_iotlb_map *map;
+		struct vhost_iotlb_file *iotlb_file;
+		struct file *f = NULL;
+
+		ret = -EFAULT;
+		if (copy_from_user(&entry, argp, sizeof(entry)))
+			break;
+
+		spin_lock(&dev->iommu_lock);
+		map = vhost_iotlb_itree_first(dev->iommu, entry.start,
+					      entry.last);
+		if (map) {
+			iotlb_file = (struct vhost_iotlb_file *)map->opaque;
+			f = get_file(iotlb_file->file);
+			entry.offset = iotlb_file->offset;
+			entry.start = map->start;
+			entry.last = map->last;
+			entry.perm = map->perm;
+		}
+		spin_unlock(&dev->iommu_lock);
+		if (!f) {
+			ret = -EINVAL;
+			break;
+		}
+		if (copy_to_user(argp, &entry, sizeof(entry))) {
+			fput(f);
+			ret = -EFAULT;
+			break;
+		}
+		ret = get_unused_fd_flags(perm_to_file_flags(entry.perm));
+		if (ret < 0) {
+			fput(f);
+			break;
+		}
+		fd_install(ret, f);
+		break;
+	}
+	case VDUSE_VQ_SETUP_KICKFD: {
+		struct vduse_vq_eventfd eventfd;
+
+		ret = -EFAULT;
+		if (copy_from_user(&eventfd, argp, sizeof(eventfd)))
+			break;
+
+		ret = vduse_kickfd_setup(dev, &eventfd);
+		break;
+	}
+	case VDUSE_VQ_SETUP_IRQFD: {
+		struct vduse_vq_eventfd eventfd;
+
+		ret = -EFAULT;
+		if (copy_from_user(&eventfd, argp, sizeof(eventfd)))
+			break;
+
+		ret = vduse_virqfd_setup(dev, &eventfd);
+		break;
+	}
+	}
+	mutex_unlock(&dev->lock);
+
+	return ret;
+}
+
+static int vduse_dev_release(struct inode *inode, struct file *file)
+{
+	struct vduse_dev *dev = file->private_data;
+
+	vduse_kickfd_release(dev);
+	vduse_virqfd_release(dev);
+	dev->connected = false;
+
+	return 0;
+}
+
+static const struct file_operations vduse_dev_fops = {
+	.owner		= THIS_MODULE,
+	.release	= vduse_dev_release,
+	.read_iter	= vduse_dev_read_iter,
+	.write_iter	= vduse_dev_write_iter,
+	.poll		= vduse_dev_poll,
+	.unlocked_ioctl	= vduse_dev_ioctl,
+	.compat_ioctl	= compat_ptr_ioctl,
+	.llseek		= noop_llseek,
+};
+
+static struct vduse_dev *vduse_dev_create(void)
+{
+	struct vduse_dev *dev = kzalloc(sizeof(*dev), GFP_KERNEL);
+
+	if (!dev)
+		return NULL;
+
+	dev->iommu = vhost_iotlb_alloc(2048, 0);
+	if (!dev->iommu) {
+		kfree(dev);
+		return NULL;
+	}
+
+	mutex_init(&dev->lock);
+	spin_lock_init(&dev->msg_lock);
+	INIT_LIST_HEAD(&dev->send_list);
+	INIT_LIST_HEAD(&dev->recv_list);
+	atomic64_set(&dev->msg_unique, 0);
+	spin_lock_init(&dev->iommu_lock);
+	atomic_set(&dev->bounce_map, 0);
+
+	init_waitqueue_head(&dev->waitq);
+
+	return dev;
+}
+
+static void vduse_dev_destroy(struct vduse_dev *dev)
+{
+	vhost_iotlb_free(dev->iommu);
+	mutex_destroy(&dev->lock);
+	kfree(dev);
+}
+
+static struct vduse_dev *vduse_find_dev(u32 id)
+{
+	struct vduse_dev *tmp, *dev = NULL;
+
+	list_for_each_entry(tmp, &vduse_devs, list) {
+		if (tmp->id == id) {
+			dev = tmp;
+			break;
+		}
+	}
+	return dev;
+}
+
+static int vduse_destroy_dev(u32 id)
+{
+	struct vduse_dev *dev = vduse_find_dev(id);
+
+	if (!dev)
+		return -EINVAL;
+
+	if (dev->vdev || dev->connected)
+		return -EBUSY;
+
+	list_del(&dev->list);
+	kfree(dev->vqs);
+	vduse_domain_destroy(dev->domain);
+	vduse_dev_destroy(dev);
+
+	return 0;
+}
+
+static int vduse_create_dev(struct vduse_dev_config *config)
+{
+	int i, fd;
+	struct vduse_dev *dev;
+	char name[64];
+
+	if (vduse_find_dev(config->id))
+		return -EEXIST;
+
+	dev = vduse_dev_create();
+	if (!dev)
+		return -ENOMEM;
+
+	dev->id = config->id;
+	dev->device_id = config->device_id;
+	dev->vendor_id = config->vendor_id;
+	dev->domain = vduse_domain_create(config->bounce_size);
+	if (!dev->domain)
+		goto err_domain;
+
+	dev->vq_align = config->vq_align;
+	dev->vq_size_max = config->vq_size_max;
+	dev->vq_num = config->vq_num;
+	dev->vqs = kcalloc(dev->vq_num, sizeof(*dev->vqs), GFP_KERNEL);
+	if (!dev->vqs)
+		goto err_vqs;
+
+	for (i = 0; i < dev->vq_num; i++) {
+		dev->vqs[i].index = i;
+		spin_lock_init(&dev->vqs[i].kick_lock);
+		spin_lock_init(&dev->vqs[i].irq_lock);
+	}
+
+	snprintf(name, sizeof(name), "[vduse-dev:%u]", config->id);
+	fd = anon_inode_getfd(name, &vduse_dev_fops, dev, O_RDWR | O_CLOEXEC);
+	if (fd < 0)
+		goto err_fd;
+
+	dev->connected = true;
+	list_add(&dev->list, &vduse_devs);
+
+	return fd;
+err_fd:
+	kfree(dev->vqs);
+err_vqs:
+	vduse_domain_destroy(dev->domain);
+err_domain:
+	vduse_dev_destroy(dev);
+	return fd;
+}
+
+static long vduse_ioctl(struct file *file, unsigned int cmd,
+			unsigned long arg)
+{
+	int ret;
+	void __user *argp = (void __user *)arg;
+
+	mutex_lock(&vduse_lock);
+	switch (cmd) {
+	case VDUSE_CREATE_DEV: {
+		struct vduse_dev_config config;
+
+		ret = -EFAULT;
+		if (copy_from_user(&config, argp, sizeof(config)))
+			break;
+
+		ret = vduse_create_dev(&config);
+		break;
+	}
+	case VDUSE_DESTROY_DEV:
+		ret = vduse_destroy_dev(arg);
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+	mutex_unlock(&vduse_lock);
+
+	return ret;
+}
+
+static const struct file_operations vduse_fops = {
+	.owner		= THIS_MODULE,
+	.unlocked_ioctl	= vduse_ioctl,
+	.compat_ioctl	= compat_ptr_ioctl,
+	.llseek		= noop_llseek,
+};
+
+static struct miscdevice vduse_misc = {
+	.fops = &vduse_fops,
+	.minor = MISC_DYNAMIC_MINOR,
+	.name = "vduse",
+};
+
+static void vduse_parent_release(struct device *dev)
+{
+}
+
+static struct device vduse_parent = {
+	.init_name = "vduse",
+	.release = vduse_parent_release,
+};
+
+static struct vdpa_parent_dev parent_dev;
+
+static int vduse_dev_add_vdpa(struct vduse_dev *dev, const char *name)
+{
+	struct vduse_vdpa *vdev = dev->vdev;
+	int ret;
+
+	if (vdev)
+		return -EEXIST;
+
+	vdev = vdpa_alloc_device(struct vduse_vdpa, vdpa, NULL,
+				 &vduse_vdpa_config_ops,
+				 dev->vq_num, name, true);
+	if (!vdev)
+		return -ENOMEM;
+
+	vdev->dev = dev;
+	vdev->vdpa.dev.dma_mask = &vdev->vdpa.dev.coherent_dma_mask;
+	ret = dma_set_mask_and_coherent(&vdev->vdpa.dev, DMA_BIT_MASK(64));
+	if (ret)
+		goto err;
+
+	set_dma_ops(&vdev->vdpa.dev, &vduse_dev_dma_ops);
+	vdev->vdpa.dma_dev = &vdev->vdpa.dev;
+	vdev->vdpa.pdev = &parent_dev;
+
+	ret = _vdpa_register_device(&vdev->vdpa);
+	if (ret)
+		goto err;
+
+	dev->vdev = vdev;
+
+	return 0;
+err:
+	put_device(&vdev->vdpa.dev);
+	return ret;
+}
+
+static struct vdpa_device *vdpa_dev_add(struct vdpa_parent_dev *pdev,
+					const char *name, u32 device_id,
+					struct nlattr **attrs)
+{
+	u32 vduse_id;
+	struct vduse_dev *dev;
+	int ret = -EINVAL;
+
+	if (!attrs[VDPA_ATTR_BACKEND_ID])
+		return ERR_PTR(-EINVAL);
+
+	mutex_lock(&vduse_lock);
+	vduse_id = nla_get_u32(attrs[VDPA_ATTR_BACKEND_ID]);
+	dev = vduse_find_dev(vduse_id);
+	if (!dev)
+		goto unlock;
+
+	if (dev->device_id != device_id)
+		goto unlock;
+
+	ret = vduse_dev_add_vdpa(dev, name);
+unlock:
+	mutex_unlock(&vduse_lock);
+	if (ret)
+		return ERR_PTR(ret);
+
+	return &dev->vdev->vdpa;
+}
+
+static void vdpa_dev_del(struct vdpa_parent_dev *pdev, struct vdpa_device *dev)
+{
+	_vdpa_unregister_device(dev);
+}
+
+static const struct vdpa_dev_ops vdpa_dev_parent_ops = {
+	.dev_add = vdpa_dev_add,
+	.dev_del = vdpa_dev_del
+};
+
+static struct virtio_device_id id_table[] = {
+	{ VIRTIO_DEV_ANY_ID, VIRTIO_DEV_ANY_ID },
+	{ 0 },
+};
+
+static struct vdpa_parent_dev parent_dev = {
+	.device = &vduse_parent,
+	.id_table = id_table,
+	.ops = &vdpa_dev_parent_ops,
+};
+
+static int vduse_parentdev_init(void)
+{
+	int ret;
+
+	ret = device_register(&vduse_parent);
+	if (ret)
+		return ret;
+
+	ret = vdpa_parentdev_register(&parent_dev);
+	if (ret)
+		goto err;
+
+	return 0;
+err:
+	device_unregister(&vduse_parent);
+	return ret;
+}
+
+static void vduse_parentdev_exit(void)
+{
+	vdpa_parentdev_unregister(&parent_dev);
+	device_unregister(&vduse_parent);
+}
+
+static int vduse_init(void)
+{
+	int ret;
+
+	ret = misc_register(&vduse_misc);
+	if (ret)
+		return ret;
+
+	ret = -ENOMEM;
+	vduse_vdpa_wq = alloc_workqueue("vduse-vdpa", WQ_UNBOUND, 1);
+	if (!vduse_vdpa_wq)
+		goto err_vdpa_wq;
+
+	ret = vduse_virqfd_init();
+	if (ret)
+		goto err_irqfd;
+
+	ret = vduse_parentdev_init();
+	if (ret)
+		goto err_parentdev;
+
+	return 0;
+err_parentdev:
+	vduse_virqfd_exit();
+err_irqfd:
+	destroy_workqueue(vduse_vdpa_wq);
+err_vdpa_wq:
+	misc_deregister(&vduse_misc);
+	return ret;
+}
+module_init(vduse_init);
+
+static void vduse_exit(void)
+{
+	misc_deregister(&vduse_misc);
+	destroy_workqueue(vduse_vdpa_wq);
+	vduse_virqfd_exit();
+	vduse_parentdev_exit();
+}
+module_exit(vduse_exit);
+
+MODULE_VERSION(DRV_VERSION);
+MODULE_LICENSE(DRV_LICENSE);
+MODULE_AUTHOR(DRV_AUTHOR);
+MODULE_DESCRIPTION(DRV_DESC);
diff --git a/include/uapi/linux/vdpa.h b/include/uapi/linux/vdpa.h
index bba8b83a94b5..a7a841e5ffc7 100644
--- a/include/uapi/linux/vdpa.h
+++ b/include/uapi/linux/vdpa.h
@@ -33,6 +33,7 @@ enum vdpa_attr {
 	VDPA_ATTR_DEV_VENDOR_ID,		/* u32 */
 	VDPA_ATTR_DEV_MAX_VQS,			/* u32 */
 	VDPA_ATTR_DEV_MAX_VQ_SIZE,		/* u16 */
+	VDPA_ATTR_BACKEND_ID,			/* u32 */
 
 	/* new attributes must be added above here */
 	VDPA_ATTR_MAX,
diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
new file mode 100644
index 000000000000..9fb555ddcfbd
--- /dev/null
+++ b/include/uapi/linux/vduse.h
@@ -0,0 +1,125 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _UAPI_VDUSE_H_
+#define _UAPI_VDUSE_H_
+
+#include <linux/types.h>
+
+/* the control messages definition for read/write */
+
+#define VDUSE_CONFIG_DATA_LEN	256
+
+enum vduse_req_type {
+	VDUSE_SET_VQ_NUM,
+	VDUSE_SET_VQ_ADDR,
+	VDUSE_SET_VQ_READY,
+	VDUSE_GET_VQ_READY,
+	VDUSE_SET_VQ_STATE,
+	VDUSE_GET_VQ_STATE,
+	VDUSE_SET_FEATURES,
+	VDUSE_GET_FEATURES,
+	VDUSE_SET_STATUS,
+	VDUSE_GET_STATUS,
+	VDUSE_SET_CONFIG,
+	VDUSE_GET_CONFIG,
+	VDUSE_UPDATE_IOTLB,
+};
+
+struct vduse_vq_num {
+	__u32 index;
+	__u32 num;
+};
+
+struct vduse_vq_addr {
+	__u32 index;
+	__u64 desc_addr;
+	__u64 driver_addr;
+	__u64 device_addr;
+};
+
+struct vduse_vq_ready {
+	__u32 index;
+	__u8 ready;
+};
+
+struct vduse_vq_state {
+	__u32 index;
+	__u16 avail_idx;
+};
+
+struct vduse_dev_config_data {
+	__u32 offset;
+	__u32 len;
+	__u8 data[VDUSE_CONFIG_DATA_LEN];
+};
+
+struct vduse_iova_range {
+	__u64 start;
+	__u64 last;
+};
+
+struct vduse_dev_request {
+	__u32 type; /* request type */
+	__u32 unique; /* request id */
+	__u32 flags; /* request flags */
+	__u32 size; /* the payload size */
+	union {
+		struct vduse_vq_num vq_num; /* virtqueue num */
+		struct vduse_vq_addr vq_addr; /* virtqueue address */
+		struct vduse_vq_ready vq_ready; /* virtqueue ready status */
+		struct vduse_vq_state vq_state; /* virtqueue state */
+		struct vduse_dev_config_data config; /* virtio device config space */
+		struct vduse_iova_range iova; /* iova range for updating */
+		__u64 features; /* virtio features */
+		__u8 status; /* device status */
+	};
+};
+
+struct vduse_dev_response {
+	__u32 unique; /* corresponding request id */
+	__s32 result; /* the result of request */
+	union {
+		struct vduse_vq_ready vq_ready; /* virtqueue ready status */
+		struct vduse_vq_state vq_state; /* virtqueue state */
+		struct vduse_dev_config_data config; /* virtio device config space */
+		__u64 features; /* virtio features */
+		__u8 status; /* device status */
+	};
+};
+
+/* ioctls */
+
+struct vduse_dev_config {
+	__u32 id; /* vduse device id */
+	__u32 vendor_id; /* virtio vendor id */
+	__u32 device_id; /* virtio device id */
+	__u64 bounce_size; /* bounce buffer size for iommu */
+	__u16 vq_num; /* the number of virtqueues */
+	__u16 vq_size_max; /* the max size of virtqueue */
+	__u32 vq_align; /* the allocation alignment of virtqueue's metadata */
+};
+
+struct vduse_iotlb_entry {
+	__u64 offset; /* the mmap offset on fd */
+	__u64 start; /* start of the IOVA range */
+	__u64 last; /* last of the IOVA range */
+#define VDUSE_ACCESS_RO 0x1
+#define VDUSE_ACCESS_WO 0x2
+#define VDUSE_ACCESS_RW 0x3
+	__u8 perm; /* access permission of this range */
+};
+
+struct vduse_vq_eventfd {
+	__u32 index; /* virtqueue index */
+	__u32 fd; /* eventfd */
+};
+
+#define VDUSE_BASE	0x81
+
+#define VDUSE_CREATE_DEV	_IOW(VDUSE_BASE, 0x01, struct vduse_dev_config)
+#define VDUSE_DESTROY_DEV	_IO(VDUSE_BASE, 0x02)
+
+#define VDUSE_IOTLB_GET_FD	_IOWR(VDUSE_BASE, 0x04, struct vduse_iotlb_entry)
+#define VDUSE_VQ_SETUP_KICKFD	_IOW(VDUSE_BASE, 0x05, struct vduse_vq_eventfd)
+#define VDUSE_VQ_SETUP_IRQFD	_IOW(VDUSE_BASE, 0x06, struct vduse_vq_eventfd)
+
+#endif /* _UAPI_VDUSE_H_ */
-- 
2.11.0


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [RFC v3 09/11] vduse: Add VDUSE_GET_DEV ioctl
  2021-01-19  5:07 ` [RFC v3 07/11] vdpa: Pass the netlink attributes to ops.dev_add() Xie Yongji
  2021-01-19  5:07   ` [RFC v3 08/11] vduse: Introduce VDUSE - vDPA Device in Userspace Xie Yongji
@ 2021-01-19  5:07   ` Xie Yongji
  2021-01-19  5:07   ` [RFC v3 10/11] vduse: grab the module's references until there is no vduse device Xie Yongji
  2021-01-19  5:07   ` [RFC v3 11/11] vduse: Introduce a workqueue for irq injection Xie Yongji
  3 siblings, 0 replies; 57+ messages in thread
From: Xie Yongji @ 2021-01-19  5:07 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, bob.liu, hch, rdunlap,
	willy, viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel

This new ioctl will be used to retrieve the file descriptor
referring to userspace vDPA device to support reconnecting.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 drivers/vdpa/vdpa_user/vduse_dev.c | 30 ++++++++++++++++++++++++++++++
 include/uapi/linux/vduse.h         |  1 +
 2 files changed, 31 insertions(+)

diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
index 1cf759bc5914..4d21203da5b6 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -872,9 +872,14 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
 static int vduse_dev_release(struct inode *inode, struct file *file)
 {
 	struct vduse_dev *dev = file->private_data;
+	struct vduse_dev_msg *msg;
 
 	vduse_kickfd_release(dev);
 	vduse_virqfd_release(dev);
+
+	while ((msg = vduse_dev_dequeue_msg(dev, &dev->recv_list)))
+		vduse_dev_enqueue_msg(dev, msg, &dev->send_list);
+
 	dev->connected = false;
 
 	return 0;
@@ -937,6 +942,28 @@ static struct vduse_dev *vduse_find_dev(u32 id)
 	return dev;
 }
 
+static int vduse_get_dev(u32 id)
+{
+	int fd;
+	char name[64];
+	struct vduse_dev *dev = vduse_find_dev(id);
+
+	if (!dev)
+		return -EINVAL;
+
+	if (dev->connected)
+		return -EBUSY;
+
+	snprintf(name, sizeof(name), "[vduse-dev:%u]", dev->id);
+	fd = anon_inode_getfd(name, &vduse_dev_fops, dev, O_RDWR | O_CLOEXEC);
+	if (fd < 0)
+		return fd;
+
+	dev->connected = true;
+
+	return fd;
+}
+
 static int vduse_destroy_dev(u32 id)
 {
 	struct vduse_dev *dev = vduse_find_dev(id);
@@ -1024,6 +1051,9 @@ static long vduse_ioctl(struct file *file, unsigned int cmd,
 		ret = vduse_create_dev(&config);
 		break;
 	}
+	case VDUSE_GET_DEV:
+		ret = vduse_get_dev(arg);
+		break;
 	case VDUSE_DESTROY_DEV:
 		ret = vduse_destroy_dev(arg);
 		break;
diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
index 9fb555ddcfbd..b3694f0d3b77 100644
--- a/include/uapi/linux/vduse.h
+++ b/include/uapi/linux/vduse.h
@@ -117,6 +117,7 @@ struct vduse_vq_eventfd {
 
 #define VDUSE_CREATE_DEV	_IOW(VDUSE_BASE, 0x01, struct vduse_dev_config)
 #define VDUSE_DESTROY_DEV	_IO(VDUSE_BASE, 0x02)
+#define VDUSE_GET_DEV		_IO(VDUSE_BASE, 0x03)
 
 #define VDUSE_IOTLB_GET_FD	_IOWR(VDUSE_BASE, 0x04, struct vduse_iotlb_entry)
 #define VDUSE_VQ_SETUP_KICKFD	_IOW(VDUSE_BASE, 0x05, struct vduse_vq_eventfd)
-- 
2.11.0


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [RFC v3 10/11] vduse: grab the module's references until there is no vduse device
  2021-01-19  5:07 ` [RFC v3 07/11] vdpa: Pass the netlink attributes to ops.dev_add() Xie Yongji
  2021-01-19  5:07   ` [RFC v3 08/11] vduse: Introduce VDUSE - vDPA Device in Userspace Xie Yongji
  2021-01-19  5:07   ` [RFC v3 09/11] vduse: Add VDUSE_GET_DEV ioctl Xie Yongji
@ 2021-01-19  5:07   ` Xie Yongji
  2021-01-26  8:09     ` Jason Wang
  2021-01-19  5:07   ` [RFC v3 11/11] vduse: Introduce a workqueue for irq injection Xie Yongji
  3 siblings, 1 reply; 57+ messages in thread
From: Xie Yongji @ 2021-01-19  5:07 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, bob.liu, hch, rdunlap,
	willy, viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel

The module should not be unloaded if any vduse device exists.
So increase the module's reference count when creating vduse
device. And the reference count is kept until the device is
destroyed.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 drivers/vdpa/vdpa_user/vduse_dev.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
index 4d21203da5b6..003aeb281bce 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -978,6 +978,7 @@ static int vduse_destroy_dev(u32 id)
 	kfree(dev->vqs);
 	vduse_domain_destroy(dev->domain);
 	vduse_dev_destroy(dev);
+	module_put(THIS_MODULE);
 
 	return 0;
 }
@@ -1022,6 +1023,7 @@ static int vduse_create_dev(struct vduse_dev_config *config)
 
 	dev->connected = true;
 	list_add(&dev->list, &vduse_devs);
+	__module_get(THIS_MODULE);
 
 	return fd;
 err_fd:
-- 
2.11.0


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [RFC v3 11/11] vduse: Introduce a workqueue for irq injection
  2021-01-19  5:07 ` [RFC v3 07/11] vdpa: Pass the netlink attributes to ops.dev_add() Xie Yongji
                     ` (2 preceding siblings ...)
  2021-01-19  5:07   ` [RFC v3 10/11] vduse: grab the module's references until there is no vduse device Xie Yongji
@ 2021-01-19  5:07   ` Xie Yongji
  2021-01-26  8:17     ` Jason Wang
  3 siblings, 1 reply; 57+ messages in thread
From: Xie Yongji @ 2021-01-19  5:07 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, bob.liu, hch, rdunlap,
	willy, viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel

This patch introduces a dedicated workqueue for irq injection
so that we are able to do some performance tuning for it.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 drivers/vdpa/vdpa_user/eventfd.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/vdpa/vdpa_user/eventfd.c b/drivers/vdpa/vdpa_user/eventfd.c
index dbffddb08908..caf7d8d68ac0 100644
--- a/drivers/vdpa/vdpa_user/eventfd.c
+++ b/drivers/vdpa/vdpa_user/eventfd.c
@@ -18,6 +18,7 @@
 #include "eventfd.h"
 
 static struct workqueue_struct *vduse_irqfd_cleanup_wq;
+static struct workqueue_struct *vduse_irq_wq;
 
 static void vduse_virqfd_shutdown(struct work_struct *work)
 {
@@ -57,7 +58,7 @@ static int vduse_virqfd_wakeup(wait_queue_entry_t *wait, unsigned int mode,
 	__poll_t flags = key_to_poll(key);
 
 	if (flags & EPOLLIN)
-		schedule_work(&virqfd->inject);
+		queue_work(vduse_irq_wq, &virqfd->inject);
 
 	if (flags & EPOLLHUP) {
 		spin_lock(&vq->irq_lock);
@@ -165,11 +166,18 @@ int vduse_virqfd_init(void)
 	if (!vduse_irqfd_cleanup_wq)
 		return -ENOMEM;
 
+	vduse_irq_wq = alloc_workqueue("vduse-irq", WQ_SYSFS | WQ_UNBOUND, 0);
+	if (!vduse_irq_wq) {
+		destroy_workqueue(vduse_irqfd_cleanup_wq);
+		return -ENOMEM;
+	}
+
 	return 0;
 }
 
 void vduse_virqfd_exit(void)
 {
+	destroy_workqueue(vduse_irq_wq);
 	destroy_workqueue(vduse_irqfd_cleanup_wq);
 }
 
-- 
2.11.0


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC v3 08/11] vduse: Introduce VDUSE - vDPA Device in Userspace
  2021-01-19  5:07   ` [RFC v3 08/11] vduse: Introduce VDUSE - vDPA Device in Userspace Xie Yongji
@ 2021-01-19 14:53     ` Jonathan Corbet
  2021-01-20  2:25       ` Yongji Xie
  2021-01-19 17:53     ` Randy Dunlap
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 57+ messages in thread
From: Jonathan Corbet @ 2021-01-19 14:53 UTC (permalink / raw)
  To: Xie Yongji
  Cc: mst, jasowang, stefanha, sgarzare, parav, bob.liu, hch, rdunlap,
	willy, viro, axboe, bcrl, virtualization, netdev, kvm, linux-aio,
	linux-fsdevel

On Tue, 19 Jan 2021 13:07:53 +0800
Xie Yongji <xieyongji@bytedance.com> wrote:

> diff --git a/Documentation/driver-api/vduse.rst b/Documentation/driver-api/vduse.rst
> new file mode 100644
> index 000000000000..9418a7f6646b
> --- /dev/null
> +++ b/Documentation/driver-api/vduse.rst
> @@ -0,0 +1,85 @@
> +==================================
> +VDUSE - "vDPA Device in Userspace"
> +==================================

Thanks for documenting this feature!  You will, though, need to add this
new document to Documentation/driver-api/index.rst for it to be included
in the docs build.

That said, this would appear to be documentation for user space, right?
So the userspace-api manual is probably a more appropriate place for it.

Thanks,

jon

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC v3 08/11] vduse: Introduce VDUSE - vDPA Device in Userspace
  2021-01-19  5:07   ` [RFC v3 08/11] vduse: Introduce VDUSE - vDPA Device in Userspace Xie Yongji
  2021-01-19 14:53     ` Jonathan Corbet
@ 2021-01-19 17:53     ` Randy Dunlap
  2021-01-20  2:42       ` Yongji Xie
  2021-01-26  8:08     ` Jason Wang
  2021-01-26  8:19     ` Jason Wang
  3 siblings, 1 reply; 57+ messages in thread
From: Randy Dunlap @ 2021-01-19 17:53 UTC (permalink / raw)
  To: Xie Yongji, mst, jasowang, stefanha, sgarzare, parav, bob.liu,
	hch, willy, viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel

Hi,

Documentation comments only:

On 1/18/21 9:07 PM, Xie Yongji wrote:
> 
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> ---
>  Documentation/driver-api/vduse.rst                 |   85 ++
> 
> diff --git a/Documentation/driver-api/vduse.rst b/Documentation/driver-api/vduse.rst
> new file mode 100644
> index 000000000000..9418a7f6646b
> --- /dev/null
> +++ b/Documentation/driver-api/vduse.rst
> @@ -0,0 +1,85 @@
> +==================================
> +VDUSE - "vDPA Device in Userspace"
> +==================================
> +
> +vDPA (virtio data path acceleration) device is a device that uses a
> +datapath which complies with the virtio specifications with vendor
> +specific control path. vDPA devices can be both physically located on
> +the hardware or emulated by software. VDUSE is a framework that makes it
> +possible to implement software-emulated vDPA devices in userspace.
> +
> +How VDUSE works
> +------------
> +Each userspace vDPA device is created by the VDUSE_CREATE_DEV ioctl on
> +the VDUSE character device (/dev/vduse). Then a file descriptor pointing
> +to the new resources will be returned, which can be used to implement the
> +userspace vDPA device's control path and data path.
> +
> +To implement control path, the read/write operations to the file descriptor
> +will be used to receive/reply the control messages from/to VDUSE driver.
> +Those control messages are mostly based on the vdpa_config_ops which defines
> +a unified interface to control different types of vDPA device.
> +
> +The following types of messages are provided by the VDUSE framework now:
> +
> +- VDUSE_SET_VQ_ADDR: Set the addresses of the different aspects of virtqueue.
> +
> +- VDUSE_SET_VQ_NUM: Set the size of virtqueue
> +
> +- VDUSE_SET_VQ_READY: Set ready status of virtqueue
> +
> +- VDUSE_GET_VQ_READY: Get ready status of virtqueue
> +
> +- VDUSE_SET_VQ_STATE: Set the state (last_avail_idx) for virtqueue
> +
> +- VDUSE_GET_VQ_STATE: Get the state (last_avail_idx) for virtqueue
> +
> +- VDUSE_SET_FEATURES: Set virtio features supported by the driver
> +
> +- VDUSE_GET_FEATURES: Get virtio features supported by the device
> +
> +- VDUSE_SET_STATUS: Set the device status
> +
> +- VDUSE_GET_STATUS: Get the device status
> +
> +- VDUSE_SET_CONFIG: Write to device specific configuration space
> +
> +- VDUSE_GET_CONFIG: Read from device specific configuration space
> +
> +- VDUSE_UPDATE_IOTLB: Notify userspace to update the memory mapping in device IOTLB
> +
> +Please see include/linux/vdpa.h for details.
> +
> +In the data path, vDPA device's iova regions will be mapped into userspace with
> +the help of VDUSE_IOTLB_GET_FD ioctl on the userspace vDPA device fd:
> +
> +- VDUSE_IOTLB_GET_FD: get the file descriptor to iova region. Userspace can
> +  access this iova region by passing the fd to mmap(2).
> +
> +Besides, the eventfd mechanism is used to trigger interrupt callbacks and
> +receive virtqueue kicks in userspace. The following ioctls on the userspace
> +vDPA device fd are provided to support that:
> +
> +- VDUSE_VQ_SETUP_KICKFD: set the kickfd for virtqueue, this eventfd is used
> +  by VDUSE driver to notify userspace to consume the vring.
> +
> +- VDUSE_VQ_SETUP_IRQFD: set the irqfd for virtqueue, this eventfd is used
> +  by userspace to notify VDUSE driver to trigger interrupt callbacks.
> +
> +MMU-based IOMMU Driver
> +----------------------
> +In virtio-vdpa case, VDUSE framework implements a MMU-based on-chip IOMMU

                                                   an MMU-based

> +driver to support mapping the kernel dma buffer into the userspace iova

                                        DMA

> +region dynamically.
> +
> +The basic idea behind this driver is treating MMU (VA->PA) as IOMMU (IOVA->PA).
> +The driver will set up MMU mapping instead of IOMMU mapping for the DMA transfer
> +so that the userspace process is able to use its virtual address to access
> +the dma buffer in kernel.

       DMA

> +
> +And to avoid security issue, a bounce-buffering mechanism is introduced to
> +prevent userspace accessing the original buffer directly which may contain other
> +kernel data. During the mapping, unmapping, the driver will copy the data from
> +the original buffer to the bounce buffer and back, depending on the direction of
> +the transfer. And the bounce-buffer addresses will be mapped into the user address
> +space instead of the original one.


thanks.
-- 
~Randy


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: [RFC v3 08/11] vduse: Introduce VDUSE - vDPA Device in Userspace
  2021-01-19 14:53     ` Jonathan Corbet
@ 2021-01-20  2:25       ` Yongji Xie
  0 siblings, 0 replies; 57+ messages in thread
From: Yongji Xie @ 2021-01-20  2:25 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Michael S. Tsirkin, Jason Wang, Stefan Hajnoczi, sgarzare,
	Parav Pandit, Bob Liu, Christoph Hellwig, Randy Dunlap,
	Matthew Wilcox, viro, axboe, bcrl, virtualization, netdev, kvm,
	linux-aio, linux-fsdevel

On Tue, Jan 19, 2021 at 10:54 PM Jonathan Corbet <corbet@lwn.net> wrote:
>
> X-Gm-Spam: 0
> X-Gm-Phishy: 0
>
> On Tue, 19 Jan 2021 13:07:53 +0800
> Xie Yongji <xieyongji@bytedance.com> wrote:
>
> > diff --git a/Documentation/driver-api/vduse.rst b/Documentation/driver-api/vduse.rst
> > new file mode 100644
> > index 000000000000..9418a7f6646b
> > --- /dev/null
> > +++ b/Documentation/driver-api/vduse.rst
> > @@ -0,0 +1,85 @@
> > +==================================
> > +VDUSE - "vDPA Device in Userspace"
> > +==================================
>
> Thanks for documenting this feature!  You will, though, need to add this
> new document to Documentation/driver-api/index.rst for it to be included
> in the docs build.
>
> That said, this would appear to be documentation for user space, right?
> So the userspace-api manual is probably a more appropriate place for it.
>

Will do it. Thanks for the reminder!

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: [RFC v3 08/11] vduse: Introduce VDUSE - vDPA Device in Userspace
  2021-01-19 17:53     ` Randy Dunlap
@ 2021-01-20  2:42       ` Yongji Xie
  0 siblings, 0 replies; 57+ messages in thread
From: Yongji Xie @ 2021-01-20  2:42 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Michael S. Tsirkin, Jason Wang, Stefan Hajnoczi, sgarzare,
	Parav Pandit, Bob Liu, Christoph Hellwig, Matthew Wilcox, viro,
	axboe, bcrl, Jonathan Corbet, virtualization, netdev, kvm,
	linux-aio, linux-fsdevel

On Wed, Jan 20, 2021 at 1:54 AM Randy Dunlap <rdunlap@infradead.org> wrote:
>
> Hi,
>
> Documentation comments only:
>

Will fix it.

Thanks,
Yongji


> On 1/18/21 9:07 PM, Xie Yongji wrote:
> >
> > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> > ---
> >  Documentation/driver-api/vduse.rst                 |   85 ++
> >
> > diff --git a/Documentation/driver-api/vduse.rst b/Documentation/driver-api/vduse.rst
> > new file mode 100644
> > index 000000000000..9418a7f6646b
> > --- /dev/null
> > +++ b/Documentation/driver-api/vduse.rst
> > @@ -0,0 +1,85 @@
> > +==================================
> > +VDUSE - "vDPA Device in Userspace"
> > +==================================
> > +
> > +vDPA (virtio data path acceleration) device is a device that uses a
> > +datapath which complies with the virtio specifications with vendor
> > +specific control path. vDPA devices can be both physically located on
> > +the hardware or emulated by software. VDUSE is a framework that makes it
> > +possible to implement software-emulated vDPA devices in userspace.
> > +
> > +How VDUSE works
> > +------------
> > +Each userspace vDPA device is created by the VDUSE_CREATE_DEV ioctl on
> > +the VDUSE character device (/dev/vduse). Then a file descriptor pointing
> > +to the new resources will be returned, which can be used to implement the
> > +userspace vDPA device's control path and data path.
> > +
> > +To implement control path, the read/write operations to the file descriptor
> > +will be used to receive/reply the control messages from/to VDUSE driver.
> > +Those control messages are mostly based on the vdpa_config_ops which defines
> > +a unified interface to control different types of vDPA device.
> > +
> > +The following types of messages are provided by the VDUSE framework now:
> > +
> > +- VDUSE_SET_VQ_ADDR: Set the addresses of the different aspects of virtqueue.
> > +
> > +- VDUSE_SET_VQ_NUM: Set the size of virtqueue
> > +
> > +- VDUSE_SET_VQ_READY: Set ready status of virtqueue
> > +
> > +- VDUSE_GET_VQ_READY: Get ready status of virtqueue
> > +
> > +- VDUSE_SET_VQ_STATE: Set the state (last_avail_idx) for virtqueue
> > +
> > +- VDUSE_GET_VQ_STATE: Get the state (last_avail_idx) for virtqueue
> > +
> > +- VDUSE_SET_FEATURES: Set virtio features supported by the driver
> > +
> > +- VDUSE_GET_FEATURES: Get virtio features supported by the device
> > +
> > +- VDUSE_SET_STATUS: Set the device status
> > +
> > +- VDUSE_GET_STATUS: Get the device status
> > +
> > +- VDUSE_SET_CONFIG: Write to device specific configuration space
> > +
> > +- VDUSE_GET_CONFIG: Read from device specific configuration space
> > +
> > +- VDUSE_UPDATE_IOTLB: Notify userspace to update the memory mapping in device IOTLB
> > +
> > +Please see include/linux/vdpa.h for details.
> > +
> > +In the data path, vDPA device's iova regions will be mapped into userspace with
> > +the help of VDUSE_IOTLB_GET_FD ioctl on the userspace vDPA device fd:
> > +
> > +- VDUSE_IOTLB_GET_FD: get the file descriptor to iova region. Userspace can
> > +  access this iova region by passing the fd to mmap(2).
> > +
> > +Besides, the eventfd mechanism is used to trigger interrupt callbacks and
> > +receive virtqueue kicks in userspace. The following ioctls on the userspace
> > +vDPA device fd are provided to support that:
> > +
> > +- VDUSE_VQ_SETUP_KICKFD: set the kickfd for virtqueue, this eventfd is used
> > +  by VDUSE driver to notify userspace to consume the vring.
> > +
> > +- VDUSE_VQ_SETUP_IRQFD: set the irqfd for virtqueue, this eventfd is used
> > +  by userspace to notify VDUSE driver to trigger interrupt callbacks.
> > +
> > +MMU-based IOMMU Driver
> > +----------------------
> > +In virtio-vdpa case, VDUSE framework implements a MMU-based on-chip IOMMU
>
>                                                    an MMU-based
>
> > +driver to support mapping the kernel dma buffer into the userspace iova
>
>                                         DMA
>
> > +region dynamically.
> > +
> > +The basic idea behind this driver is treating MMU (VA->PA) as IOMMU (IOVA->PA).
> > +The driver will set up MMU mapping instead of IOMMU mapping for the DMA transfer
> > +so that the userspace process is able to use its virtual address to access
> > +the dma buffer in kernel.
>
>        DMA
>
> > +
> > +And to avoid security issue, a bounce-buffering mechanism is introduced to
> > +prevent userspace accessing the original buffer directly which may contain other
> > +kernel data. During the mapping, unmapping, the driver will copy the data from
> > +the original buffer to the bounce buffer and back, depending on the direction of
> > +the transfer. And the bounce-buffer addresses will be mapped into the user address
> > +space instead of the original one.
>
>
> thanks.
> --
> ~Randy
>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC v3 04/11] vhost-vdpa: protect concurrent access to vhost device iotlb
  2021-01-19  4:59 ` [RFC v3 04/11] vhost-vdpa: protect concurrent access to vhost device iotlb Xie Yongji
@ 2021-01-20  3:44   ` Jason Wang
  2021-01-20  6:44     ` Yongji Xie
  0 siblings, 1 reply; 57+ messages in thread
From: Jason Wang @ 2021-01-20  3:44 UTC (permalink / raw)
  To: Xie Yongji, mst, stefanha, sgarzare, parav, bob.liu, hch,
	rdunlap, willy, viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel


On 2021/1/19 下午12:59, Xie Yongji wrote:
> Introduce a mutex to protect vhost device iotlb from
> concurrent access.
>
> Fixes: 4c8cf318("vhost: introduce vDPA-based backend")
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> ---
>   drivers/vhost/vdpa.c | 4 ++++
>   1 file changed, 4 insertions(+)
>
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index 448be7875b6d..4a241d380c40 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -49,6 +49,7 @@ struct vhost_vdpa {
>   	struct eventfd_ctx *config_ctx;
>   	int in_batch;
>   	struct vdpa_iova_range range;
> +	struct mutex mutex;


Let's use the device mutex like what vhost_process_iotlb_msg() did.

Thanks


>   };
>   
>   static DEFINE_IDA(vhost_vdpa_ida);
> @@ -728,6 +729,7 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev,
>   	if (r)
>   		return r;
>   
> +	mutex_lock(&v->mutex);
>   	switch (msg->type) {
>   	case VHOST_IOTLB_UPDATE:
>   		r = vhost_vdpa_process_iotlb_update(v, msg);
> @@ -747,6 +749,7 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev,
>   		r = -EINVAL;
>   		break;
>   	}
> +	mutex_unlock(&v->mutex);
>   
>   	return r;
>   }
> @@ -1017,6 +1020,7 @@ static int vhost_vdpa_probe(struct vdpa_device *vdpa)
>   		return minor;
>   	}
>   
> +	mutex_init(&v->mutex);
>   	atomic_set(&v->opened, 0);
>   	v->minor = minor;
>   	v->vdpa = vdpa;


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC v3 03/11] vdpa: Remove the restriction that only supports virtio-net devices
  2021-01-19  4:59 ` [RFC v3 03/11] vdpa: Remove the restriction that only supports virtio-net devices Xie Yongji
@ 2021-01-20  3:46   ` Jason Wang
  2021-01-20  6:46     ` Yongji Xie
  2021-01-20 11:08     ` Stefano Garzarella
  2021-01-27  8:59   ` Stefano Garzarella
  1 sibling, 2 replies; 57+ messages in thread
From: Jason Wang @ 2021-01-20  3:46 UTC (permalink / raw)
  To: Xie Yongji, mst, stefanha, sgarzare, parav, bob.liu, hch,
	rdunlap, willy, viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel


On 2021/1/19 下午12:59, Xie Yongji wrote:
> With VDUSE, we should be able to support all kinds of virtio devices.
>
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> ---
>   drivers/vhost/vdpa.c | 29 +++--------------------------
>   1 file changed, 3 insertions(+), 26 deletions(-)
>
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index 29ed4173f04e..448be7875b6d 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -22,6 +22,7 @@
>   #include <linux/nospec.h>
>   #include <linux/vhost.h>
>   #include <linux/virtio_net.h>
> +#include <linux/virtio_blk.h>
>   
>   #include "vhost.h"
>   
> @@ -185,26 +186,6 @@ static long vhost_vdpa_set_status(struct vhost_vdpa *v, u8 __user *statusp)
>   	return 0;
>   }
>   
> -static int vhost_vdpa_config_validate(struct vhost_vdpa *v,
> -				      struct vhost_vdpa_config *c)
> -{
> -	long size = 0;
> -
> -	switch (v->virtio_id) {
> -	case VIRTIO_ID_NET:
> -		size = sizeof(struct virtio_net_config);
> -		break;
> -	}
> -
> -	if (c->len == 0)
> -		return -EINVAL;
> -
> -	if (c->len > size - c->off)
> -		return -E2BIG;
> -
> -	return 0;
> -}


I think we should use a separate patch for this.

Thanks


> -
>   static long vhost_vdpa_get_config(struct vhost_vdpa *v,
>   				  struct vhost_vdpa_config __user *c)
>   {
> @@ -215,7 +196,7 @@ static long vhost_vdpa_get_config(struct vhost_vdpa *v,
>   
>   	if (copy_from_user(&config, c, size))
>   		return -EFAULT;
> -	if (vhost_vdpa_config_validate(v, &config))
> +	if (config.len == 0)
>   		return -EINVAL;
>   	buf = kvzalloc(config.len, GFP_KERNEL);
>   	if (!buf)
> @@ -243,7 +224,7 @@ static long vhost_vdpa_set_config(struct vhost_vdpa *v,
>   
>   	if (copy_from_user(&config, c, size))
>   		return -EFAULT;
> -	if (vhost_vdpa_config_validate(v, &config))
> +	if (config.len == 0)
>   		return -EINVAL;
>   	buf = kvzalloc(config.len, GFP_KERNEL);
>   	if (!buf)
> @@ -1025,10 +1006,6 @@ static int vhost_vdpa_probe(struct vdpa_device *vdpa)
>   	int minor;
>   	int r;
>   
> -	/* Currently, we only accept the network devices. */
> -	if (ops->get_device_id(vdpa) != VIRTIO_ID_NET)
> -		return -ENOTSUPP;
> -
>   	v = kzalloc(sizeof(*v), GFP_KERNEL | __GFP_RETRY_MAYFAIL);
>   	if (!v)
>   		return -ENOMEM;


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC v3 01/11] eventfd: track eventfd_signal() recursion depth separately in different cases
  2021-01-19  4:59 ` [RFC v3 01/11] eventfd: track eventfd_signal() recursion depth separately in different cases Xie Yongji
@ 2021-01-20  4:24   ` Jason Wang
  2021-01-20  6:52     ` Yongji Xie
  0 siblings, 1 reply; 57+ messages in thread
From: Jason Wang @ 2021-01-20  4:24 UTC (permalink / raw)
  To: Xie Yongji, mst, stefanha, sgarzare, parav, bob.liu, hch,
	rdunlap, willy, viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel


On 2021/1/19 下午12:59, Xie Yongji wrote:
> Now we have a global percpu counter to limit the recursion depth
> of eventfd_signal(). This can avoid deadlock or stack overflow.
> But in stack overflow case, it should be OK to increase the
> recursion depth if needed. So we add a percpu counter in eventfd_ctx
> to limit the recursion depth for deadlock case. Then it could be
> fine to increase the global percpu counter later.


I wonder whether or not it's worth to introduce percpu for each eventfd.

How about simply check if eventfd_signal_count() is greater than 2?

Thanks


>
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> ---
>   fs/aio.c                |  3 ++-
>   fs/eventfd.c            | 20 +++++++++++++++++++-
>   include/linux/eventfd.h |  5 +----
>   3 files changed, 22 insertions(+), 6 deletions(-)
>
> diff --git a/fs/aio.c b/fs/aio.c
> index 1f32da13d39e..5d82903161f5 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -1698,7 +1698,8 @@ static int aio_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
>   		list_del(&iocb->ki_list);
>   		iocb->ki_res.res = mangle_poll(mask);
>   		req->done = true;
> -		if (iocb->ki_eventfd && eventfd_signal_count()) {
> +		if (iocb->ki_eventfd &&
> +			eventfd_signal_count(iocb->ki_eventfd)) {
>   			iocb = NULL;
>   			INIT_WORK(&req->work, aio_poll_put_work);
>   			schedule_work(&req->work);
> diff --git a/fs/eventfd.c b/fs/eventfd.c
> index e265b6dd4f34..2df24f9bada3 100644
> --- a/fs/eventfd.c
> +++ b/fs/eventfd.c
> @@ -25,6 +25,8 @@
>   #include <linux/idr.h>
>   #include <linux/uio.h>
>   
> +#define EVENTFD_WAKE_DEPTH 0
> +
>   DEFINE_PER_CPU(int, eventfd_wake_count);
>   
>   static DEFINE_IDA(eventfd_ida);
> @@ -42,9 +44,17 @@ struct eventfd_ctx {
>   	 */
>   	__u64 count;
>   	unsigned int flags;
> +	int __percpu *wake_count;
>   	int id;
>   };
>   
> +bool eventfd_signal_count(struct eventfd_ctx *ctx)
> +{
> +	return (this_cpu_read(*ctx->wake_count) ||
> +		this_cpu_read(eventfd_wake_count) > EVENTFD_WAKE_DEPTH);
> +}
> +EXPORT_SYMBOL_GPL(eventfd_signal_count);
> +
>   /**
>    * eventfd_signal - Adds @n to the eventfd counter.
>    * @ctx: [in] Pointer to the eventfd context.
> @@ -71,17 +81,19 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n)
>   	 * it returns true, the eventfd_signal() call should be deferred to a
>   	 * safe context.
>   	 */
> -	if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count)))
> +	if (WARN_ON_ONCE(eventfd_signal_count(ctx)))
>   		return 0;
>   
>   	spin_lock_irqsave(&ctx->wqh.lock, flags);
>   	this_cpu_inc(eventfd_wake_count);
> +	this_cpu_inc(*ctx->wake_count);
>   	if (ULLONG_MAX - ctx->count < n)
>   		n = ULLONG_MAX - ctx->count;
>   	ctx->count += n;
>   	if (waitqueue_active(&ctx->wqh))
>   		wake_up_locked_poll(&ctx->wqh, EPOLLIN);
>   	this_cpu_dec(eventfd_wake_count);
> +	this_cpu_dec(*ctx->wake_count);
>   	spin_unlock_irqrestore(&ctx->wqh.lock, flags);
>   
>   	return n;
> @@ -92,6 +104,7 @@ static void eventfd_free_ctx(struct eventfd_ctx *ctx)
>   {
>   	if (ctx->id >= 0)
>   		ida_simple_remove(&eventfd_ida, ctx->id);
> +	free_percpu(ctx->wake_count);
>   	kfree(ctx);
>   }
>   
> @@ -423,6 +436,11 @@ static int do_eventfd(unsigned int count, int flags)
>   
>   	kref_init(&ctx->kref);
>   	init_waitqueue_head(&ctx->wqh);
> +	ctx->wake_count = alloc_percpu(int);
> +	if (!ctx->wake_count) {
> +		kfree(ctx);
> +		return -ENOMEM;
> +	}
>   	ctx->count = count;
>   	ctx->flags = flags;
>   	ctx->id = ida_simple_get(&eventfd_ida, 0, 0, GFP_KERNEL);
> diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
> index fa0a524baed0..1a11ebbd74a9 100644
> --- a/include/linux/eventfd.h
> +++ b/include/linux/eventfd.h
> @@ -45,10 +45,7 @@ void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 *cnt);
>   
>   DECLARE_PER_CPU(int, eventfd_wake_count);
>   
> -static inline bool eventfd_signal_count(void)
> -{
> -	return this_cpu_read(eventfd_wake_count);
> -}
> +bool eventfd_signal_count(struct eventfd_ctx *ctx);
>   
>   #else /* CONFIG_EVENTFD */
>   


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC v3 05/11] vdpa: shared virtual addressing support
  2021-01-19  4:59 ` [RFC v3 05/11] vdpa: shared virtual addressing support Xie Yongji
@ 2021-01-20  5:55   ` Jason Wang
  2021-01-20  7:10     ` Yongji Xie
  0 siblings, 1 reply; 57+ messages in thread
From: Jason Wang @ 2021-01-20  5:55 UTC (permalink / raw)
  To: Xie Yongji, mst, stefanha, sgarzare, parav, bob.liu, hch,
	rdunlap, willy, viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel


On 2021/1/19 下午12:59, Xie Yongji wrote:
> This patches introduces SVA (Shared Virtual Addressing)
> support for vDPA device. During vDPA device allocation,
> vDPA device driver needs to indicate whether SVA is
> supported by the device. Then vhost-vdpa bus driver
> will not pin user page and transfer userspace virtual
> address instead of physical address during DMA mapping.
>
> Suggested-by: Jason Wang <jasowang@redhat.com>
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> ---
>   drivers/vdpa/ifcvf/ifcvf_main.c   |  2 +-
>   drivers/vdpa/mlx5/net/mlx5_vnet.c |  2 +-
>   drivers/vdpa/vdpa.c               |  5 ++++-
>   drivers/vdpa/vdpa_sim/vdpa_sim.c  |  3 ++-
>   drivers/vhost/vdpa.c              | 35 +++++++++++++++++++++++------------
>   include/linux/vdpa.h              | 10 +++++++---
>   6 files changed, 38 insertions(+), 19 deletions(-)
>
> diff --git a/drivers/vdpa/ifcvf/ifcvf_main.c b/drivers/vdpa/ifcvf/ifcvf_main.c
> index 23474af7da40..95c4601f82f5 100644
> --- a/drivers/vdpa/ifcvf/ifcvf_main.c
> +++ b/drivers/vdpa/ifcvf/ifcvf_main.c
> @@ -439,7 +439,7 @@ static int ifcvf_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>   
>   	adapter = vdpa_alloc_device(struct ifcvf_adapter, vdpa,
>   				    dev, &ifc_vdpa_ops,
> -				    IFCVF_MAX_QUEUE_PAIRS * 2, NULL);
> +				    IFCVF_MAX_QUEUE_PAIRS * 2, NULL, false);
>   	if (adapter == NULL) {
>   		IFCVF_ERR(pdev, "Failed to allocate vDPA structure");
>   		return -ENOMEM;
> diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> index 77595c81488d..05988d6907f2 100644
> --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
> +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> @@ -1959,7 +1959,7 @@ static int mlx5v_probe(struct auxiliary_device *adev,
>   	max_vqs = min_t(u32, max_vqs, MLX5_MAX_SUPPORTED_VQS);
>   
>   	ndev = vdpa_alloc_device(struct mlx5_vdpa_net, mvdev.vdev, mdev->device, &mlx5_vdpa_ops,
> -				 2 * mlx5_vdpa_max_qps(max_vqs), NULL);
> +				 2 * mlx5_vdpa_max_qps(max_vqs), NULL, false);
>   	if (IS_ERR(ndev))
>   		return PTR_ERR(ndev);
>   
> diff --git a/drivers/vdpa/vdpa.c b/drivers/vdpa/vdpa.c
> index 32bd48baffab..50cab930b2e5 100644
> --- a/drivers/vdpa/vdpa.c
> +++ b/drivers/vdpa/vdpa.c
> @@ -72,6 +72,7 @@ static void vdpa_release_dev(struct device *d)
>    * @nvqs: number of virtqueues supported by this device
>    * @size: size of the parent structure that contains private data
>    * @name: name of the vdpa device; optional.
> + * @sva: indicate whether SVA (Shared Virtual Addressing) is supported
>    *
>    * Driver should use vdpa_alloc_device() wrapper macro instead of
>    * using this directly.
> @@ -81,7 +82,8 @@ static void vdpa_release_dev(struct device *d)
>    */
>   struct vdpa_device *__vdpa_alloc_device(struct device *parent,
>   					const struct vdpa_config_ops *config,
> -					int nvqs, size_t size, const char *name)
> +					int nvqs, size_t size, const char *name,
> +					bool sva)
>   {
>   	struct vdpa_device *vdev;
>   	int err = -EINVAL;
> @@ -108,6 +110,7 @@ struct vdpa_device *__vdpa_alloc_device(struct device *parent,
>   	vdev->config = config;
>   	vdev->features_valid = false;
>   	vdev->nvqs = nvqs;
> +	vdev->sva = sva;
>   
>   	if (name)
>   		err = dev_set_name(&vdev->dev, "%s", name);
> diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
> index 85776e4e6749..03c796873a6b 100644
> --- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
> +++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
> @@ -367,7 +367,8 @@ static struct vdpasim *vdpasim_create(const char *name)
>   	else
>   		ops = &vdpasim_net_config_ops;
>   
> -	vdpasim = vdpa_alloc_device(struct vdpasim, vdpa, NULL, ops, VDPASIM_VQ_NUM, name);
> +	vdpasim = vdpa_alloc_device(struct vdpasim, vdpa, NULL, ops,
> +				VDPASIM_VQ_NUM, name, false);
>   	if (!vdpasim)
>   		goto err_alloc;
>   
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index 4a241d380c40..36b6950ba37f 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -486,21 +486,25 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
>   static void vhost_vdpa_iotlb_unmap(struct vhost_vdpa *v, u64 start, u64 last)
>   {
>   	struct vhost_dev *dev = &v->vdev;
> +	struct vdpa_device *vdpa = v->vdpa;
>   	struct vhost_iotlb *iotlb = dev->iotlb;
>   	struct vhost_iotlb_map *map;
>   	struct page *page;
>   	unsigned long pfn, pinned;
>   
>   	while ((map = vhost_iotlb_itree_first(iotlb, start, last)) != NULL) {
> -		pinned = map->size >> PAGE_SHIFT;
> -		for (pfn = map->addr >> PAGE_SHIFT;
> -		     pinned > 0; pfn++, pinned--) {
> -			page = pfn_to_page(pfn);
> -			if (map->perm & VHOST_ACCESS_WO)
> -				set_page_dirty_lock(page);
> -			unpin_user_page(page);
> +		if (!vdpa->sva) {
> +			pinned = map->size >> PAGE_SHIFT;
> +			for (pfn = map->addr >> PAGE_SHIFT;
> +			     pinned > 0; pfn++, pinned--) {
> +				page = pfn_to_page(pfn);
> +				if (map->perm & VHOST_ACCESS_WO)
> +					set_page_dirty_lock(page);
> +				unpin_user_page(page);
> +			}
> +			atomic64_sub(map->size >> PAGE_SHIFT,
> +					&dev->mm->pinned_vm);
>   		}
> -		atomic64_sub(map->size >> PAGE_SHIFT, &dev->mm->pinned_vm);
>   		vhost_iotlb_map_free(iotlb, map);
>   	}
>   }
> @@ -558,13 +562,15 @@ static int vhost_vdpa_map(struct vhost_vdpa *v,
>   		r = iommu_map(v->domain, iova, pa, size,
>   			      perm_to_iommu_flags(perm));
>   	}
> -
> -	if (r)
> +	if (r) {
>   		vhost_iotlb_del_range(dev->iotlb, iova, iova + size - 1);
> -	else
> +		return r;
> +	}
> +
> +	if (!vdpa->sva)
>   		atomic64_add(size >> PAGE_SHIFT, &dev->mm->pinned_vm);
>   
> -	return r;
> +	return 0;
>   }
>   
>   static void vhost_vdpa_unmap(struct vhost_vdpa *v, u64 iova, u64 size)
> @@ -589,6 +595,7 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
>   					   struct vhost_iotlb_msg *msg)
>   {
>   	struct vhost_dev *dev = &v->vdev;
> +	struct vdpa_device *vdpa = v->vdpa;
>   	struct vhost_iotlb *iotlb = dev->iotlb;
>   	struct page **page_list;
>   	unsigned long list_size = PAGE_SIZE / sizeof(struct page *);
> @@ -607,6 +614,10 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
>   				    msg->iova + msg->size - 1))
>   		return -EEXIST;
>   
> +	if (vdpa->sva)
> +		return vhost_vdpa_map(v, msg->iova, msg->size,
> +				      msg->uaddr, msg->perm);
> +
>   	/* Limit the use of memory for bookkeeping */
>   	page_list = (struct page **) __get_free_page(GFP_KERNEL);
>   	if (!page_list)
> diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
> index cb5a3d847af3..f86869651614 100644
> --- a/include/linux/vdpa.h
> +++ b/include/linux/vdpa.h
> @@ -44,6 +44,7 @@ struct vdpa_parent_dev;
>    * @config: the configuration ops for this device.
>    * @index: device index
>    * @features_valid: were features initialized? for legacy guests
> + * @sva: indicate whether SVA (Shared Virtual Addressing) is supported


Rethink about this. I think we probably need a better name other than 
"sva" since kernel already use that for shared virtual address space. 
But actually we don't the whole virtual address space.

And I guess this can not work for the device that use platform IOMMU, so 
we should check and fail if sva && !(dma_map || set_map).

Thanks


>    * @nvqs: maximum number of supported virtqueues
>    * @pdev: parent device pointer; caller must setup when registering device as part
>    *	  of dev_add() parentdev ops callback before invoking _vdpa_register_device().
> @@ -54,6 +55,7 @@ struct vdpa_device {
>   	const struct vdpa_config_ops *config;
>   	unsigned int index;
>   	bool features_valid;
> +	bool sva;
>   	int nvqs;
>   	struct vdpa_parent_dev *pdev;
>   };
> @@ -250,14 +252,16 @@ struct vdpa_config_ops {
>   
>   struct vdpa_device *__vdpa_alloc_device(struct device *parent,
>   					const struct vdpa_config_ops *config,
> -					int nvqs, size_t size, const char *name);
> +					int nvqs, size_t size,
> +					const char *name, bool sva);
>   
> -#define vdpa_alloc_device(dev_struct, member, parent, config, nvqs, name)   \
> +#define vdpa_alloc_device(dev_struct, member, parent, config, \
> +			  nvqs, name, sva) \
>   			  container_of(__vdpa_alloc_device( \
>   				       parent, config, nvqs, \
>   				       sizeof(dev_struct) + \
>   				       BUILD_BUG_ON_ZERO(offsetof( \
> -				       dev_struct, member)), name), \
> +				       dev_struct, member)), name, sva), \
>   				       dev_struct, member)
>   
>   int vdpa_register_device(struct vdpa_device *vdev);


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC v3 06/11] vhost-vdpa: Add an opaque pointer for vhost IOTLB
  2021-01-19  4:59 ` [RFC v3 06/11] vhost-vdpa: Add an opaque pointer for vhost IOTLB Xie Yongji
@ 2021-01-20  6:24   ` Jason Wang
  2021-01-20  7:52     ` Yongji Xie
  0 siblings, 1 reply; 57+ messages in thread
From: Jason Wang @ 2021-01-20  6:24 UTC (permalink / raw)
  To: Xie Yongji, mst, stefanha, sgarzare, parav, bob.liu, hch,
	rdunlap, willy, viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel


On 2021/1/19 下午12:59, Xie Yongji wrote:
> Add an opaque pointer for vhost IOTLB to store the
> corresponding vma->vm_file and offset on the DMA mapping.


Let's split the patch into two.

1) opaque pointer
2) vma stuffs


>
> It will be used in VDUSE case later.
>
> Suggested-by: Jason Wang <jasowang@redhat.com>
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> ---
>   drivers/vdpa/vdpa_sim/vdpa_sim.c | 11 ++++---
>   drivers/vhost/iotlb.c            |  5 ++-
>   drivers/vhost/vdpa.c             | 66 +++++++++++++++++++++++++++++++++++-----
>   drivers/vhost/vhost.c            |  4 +--
>   include/linux/vdpa.h             |  3 +-
>   include/linux/vhost_iotlb.h      |  8 ++++-
>   6 files changed, 79 insertions(+), 18 deletions(-)
>
> diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
> index 03c796873a6b..1ffcef67954f 100644
> --- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
> +++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
> @@ -279,7 +279,7 @@ static dma_addr_t vdpasim_map_page(struct device *dev, struct page *page,
>   	 */
>   	spin_lock(&vdpasim->iommu_lock);
>   	ret = vhost_iotlb_add_range(iommu, pa, pa + size - 1,
> -				    pa, dir_to_perm(dir));
> +				    pa, dir_to_perm(dir), NULL);


Maybe its better to introduce

vhost_iotlb_add_range_ctx() which can accepts the opaque (context). And 
let vhost_iotlb_add_range() just call that.


>   	spin_unlock(&vdpasim->iommu_lock);
>   	if (ret)
>   		return DMA_MAPPING_ERROR;
> @@ -317,7 +317,7 @@ static void *vdpasim_alloc_coherent(struct device *dev, size_t size,
>   
>   		ret = vhost_iotlb_add_range(iommu, (u64)pa,
>   					    (u64)pa + size - 1,
> -					    pa, VHOST_MAP_RW);
> +					    pa, VHOST_MAP_RW, NULL);
>   		if (ret) {
>   			*dma_addr = DMA_MAPPING_ERROR;
>   			kfree(addr);
> @@ -625,7 +625,8 @@ static int vdpasim_set_map(struct vdpa_device *vdpa,
>   	for (map = vhost_iotlb_itree_first(iotlb, start, last); map;
>   	     map = vhost_iotlb_itree_next(map, start, last)) {
>   		ret = vhost_iotlb_add_range(vdpasim->iommu, map->start,
> -					    map->last, map->addr, map->perm);
> +					    map->last, map->addr,
> +					    map->perm, NULL);
>   		if (ret)
>   			goto err;
>   	}
> @@ -639,14 +640,14 @@ static int vdpasim_set_map(struct vdpa_device *vdpa,
>   }
>   
>   static int vdpasim_dma_map(struct vdpa_device *vdpa, u64 iova, u64 size,
> -			   u64 pa, u32 perm)
> +			   u64 pa, u32 perm, void *opaque)
>   {
>   	struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
>   	int ret;
>   
>   	spin_lock(&vdpasim->iommu_lock);
>   	ret = vhost_iotlb_add_range(vdpasim->iommu, iova, iova + size - 1, pa,
> -				    perm);
> +				    perm, NULL);
>   	spin_unlock(&vdpasim->iommu_lock);
>   
>   	return ret;
> diff --git a/drivers/vhost/iotlb.c b/drivers/vhost/iotlb.c
> index 0fd3f87e913c..3bd5bd06cdbc 100644
> --- a/drivers/vhost/iotlb.c
> +++ b/drivers/vhost/iotlb.c
> @@ -42,13 +42,15 @@ EXPORT_SYMBOL_GPL(vhost_iotlb_map_free);
>    * @last: last of IOVA range
>    * @addr: the address that is mapped to @start
>    * @perm: access permission of this range
> + * @opaque: the opaque pointer for the IOTLB mapping
>    *
>    * Returns an error last is smaller than start or memory allocation
>    * fails
>    */
>   int vhost_iotlb_add_range(struct vhost_iotlb *iotlb,
>   			  u64 start, u64 last,
> -			  u64 addr, unsigned int perm)
> +			  u64 addr, unsigned int perm,
> +			  void *opaque)
>   {
>   	struct vhost_iotlb_map *map;
>   
> @@ -71,6 +73,7 @@ int vhost_iotlb_add_range(struct vhost_iotlb *iotlb,
>   	map->last = last;
>   	map->addr = addr;
>   	map->perm = perm;
> +	map->opaque = opaque;
>   
>   	iotlb->nmaps++;
>   	vhost_iotlb_itree_insert(map, &iotlb->root);
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index 36b6950ba37f..e83e5be7cec8 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -488,6 +488,7 @@ static void vhost_vdpa_iotlb_unmap(struct vhost_vdpa *v, u64 start, u64 last)
>   	struct vhost_dev *dev = &v->vdev;
>   	struct vdpa_device *vdpa = v->vdpa;
>   	struct vhost_iotlb *iotlb = dev->iotlb;
> +	struct vhost_iotlb_file *iotlb_file;
>   	struct vhost_iotlb_map *map;
>   	struct page *page;
>   	unsigned long pfn, pinned;
> @@ -504,6 +505,10 @@ static void vhost_vdpa_iotlb_unmap(struct vhost_vdpa *v, u64 start, u64 last)
>   			}
>   			atomic64_sub(map->size >> PAGE_SHIFT,
>   					&dev->mm->pinned_vm);
> +		} else if (map->opaque) {
> +			iotlb_file = (struct vhost_iotlb_file *)map->opaque;
> +			fput(iotlb_file->file);
> +			kfree(iotlb_file);
>   		}
>   		vhost_iotlb_map_free(iotlb, map);
>   	}
> @@ -540,8 +545,8 @@ static int perm_to_iommu_flags(u32 perm)
>   	return flags | IOMMU_CACHE;
>   }
>   
> -static int vhost_vdpa_map(struct vhost_vdpa *v,
> -			  u64 iova, u64 size, u64 pa, u32 perm)
> +static int vhost_vdpa_map(struct vhost_vdpa *v, u64 iova,
> +			  u64 size, u64 pa, u32 perm, void *opaque)
>   {
>   	struct vhost_dev *dev = &v->vdev;
>   	struct vdpa_device *vdpa = v->vdpa;
> @@ -549,12 +554,12 @@ static int vhost_vdpa_map(struct vhost_vdpa *v,
>   	int r = 0;
>   
>   	r = vhost_iotlb_add_range(dev->iotlb, iova, iova + size - 1,
> -				  pa, perm);
> +				  pa, perm, opaque);
>   	if (r)
>   		return r;
>   
>   	if (ops->dma_map) {
> -		r = ops->dma_map(vdpa, iova, size, pa, perm);
> +		r = ops->dma_map(vdpa, iova, size, pa, perm, opaque);
>   	} else if (ops->set_map) {
>   		if (!v->in_batch)
>   			r = ops->set_map(vdpa, dev->iotlb);
> @@ -591,6 +596,51 @@ static void vhost_vdpa_unmap(struct vhost_vdpa *v, u64 iova, u64 size)
>   	}
>   }
>   
> +static int vhost_vdpa_sva_map(struct vhost_vdpa *v,
> +			      u64 iova, u64 size, u64 uaddr, u32 perm)
> +{
> +	u64 offset, map_size, map_iova = iova;
> +	struct vhost_iotlb_file *iotlb_file;
> +	struct vm_area_struct *vma;
> +	int ret;


Lacking mmap_read_lock().


> +
> +	while (size) {
> +		vma = find_vma(current->mm, uaddr);
> +		if (!vma) {
> +			ret = -EINVAL;
> +			goto err;
> +		}
> +		map_size = min(size, vma->vm_end - uaddr);
> +		offset = (vma->vm_pgoff << PAGE_SHIFT) + uaddr - vma->vm_start;
> +		iotlb_file = NULL;
> +		if (vma->vm_file && (vma->vm_flags & VM_SHARED)) {


I wonder if we need more strict check here. When developing vhost-vdpa, 
I try hard to make sure the map can only work for user pages.

So the question is: do we need to exclude MMIO area or only allow shmem 
to work here?



> +			iotlb_file = kmalloc(sizeof(*iotlb_file), GFP_KERNEL);
> +			if (!iotlb_file) {
> +				ret = -ENOMEM;
> +				goto err;
> +			}
> +			iotlb_file->file = get_file(vma->vm_file);
> +			iotlb_file->offset = offset;
> +		}


I wonder if it's better to allocate iotlb_file and make iotlb_file->file 
= NULL && iotlb_file->offset = 0. This can force a consistent code for 
the vDPA parents.

Or we can simply fail the map without a file as backend.


> +		ret = vhost_vdpa_map(v, map_iova, map_size, uaddr,
> +					perm, iotlb_file);
> +		if (ret) {
> +			if (iotlb_file) {
> +				fput(iotlb_file->file);
> +				kfree(iotlb_file);
> +			}
> +			goto err;
> +		}
> +		size -= map_size;
> +		uaddr += map_size;
> +		map_iova += map_size;
> +	}
> +	return 0;
> +err:
> +	vhost_vdpa_unmap(v, iova, map_iova - iova);
> +	return ret;
> +}
> +
>   static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
>   					   struct vhost_iotlb_msg *msg)
>   {
> @@ -615,8 +665,8 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
>   		return -EEXIST;
>   
>   	if (vdpa->sva)
> -		return vhost_vdpa_map(v, msg->iova, msg->size,
> -				      msg->uaddr, msg->perm);
> +		return vhost_vdpa_sva_map(v, msg->iova, msg->size,
> +					  msg->uaddr, msg->perm);


So I think it's better squash vhost_vdpa_sva_map() and related changes 
into previous patch.


>   
>   	/* Limit the use of memory for bookkeeping */
>   	page_list = (struct page **) __get_free_page(GFP_KERNEL);
> @@ -671,7 +721,7 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
>   				csize = (last_pfn - map_pfn + 1) << PAGE_SHIFT;
>   				ret = vhost_vdpa_map(v, iova, csize,
>   						     map_pfn << PAGE_SHIFT,
> -						     msg->perm);
> +						     msg->perm, NULL);
>   				if (ret) {
>   					/*
>   					 * Unpin the pages that are left unmapped
> @@ -700,7 +750,7 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
>   
>   	/* Pin the rest chunk */
>   	ret = vhost_vdpa_map(v, iova, (last_pfn - map_pfn + 1) << PAGE_SHIFT,
> -			     map_pfn << PAGE_SHIFT, msg->perm);
> +			     map_pfn << PAGE_SHIFT, msg->perm, NULL);
>   out:
>   	if (ret) {
>   		if (nchunks) {
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index a262e12c6dc2..120dd5b3c119 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -1104,7 +1104,7 @@ static int vhost_process_iotlb_msg(struct vhost_dev *dev,
>   		vhost_vq_meta_reset(dev);
>   		if (vhost_iotlb_add_range(dev->iotlb, msg->iova,
>   					  msg->iova + msg->size - 1,
> -					  msg->uaddr, msg->perm)) {
> +					  msg->uaddr, msg->perm, NULL)) {
>   			ret = -ENOMEM;
>   			break;
>   		}
> @@ -1450,7 +1450,7 @@ static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
>   					  region->guest_phys_addr +
>   					  region->memory_size - 1,
>   					  region->userspace_addr,
> -					  VHOST_MAP_RW))
> +					  VHOST_MAP_RW, NULL))
>   			goto err;
>   	}
>   
> diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
> index f86869651614..b264c627e94b 100644
> --- a/include/linux/vdpa.h
> +++ b/include/linux/vdpa.h
> @@ -189,6 +189,7 @@ struct vdpa_iova_range {
>    *				@size: size of the area
>    *				@pa: physical address for the map
>    *				@perm: device access permission (VHOST_MAP_XX)
> + *				@opaque: the opaque pointer for the mapping
>    *				Returns integer: success (0) or error (< 0)
>    * @dma_unmap:			Unmap an area of IOVA (optional but
>    *				must be implemented with dma_map)
> @@ -243,7 +244,7 @@ struct vdpa_config_ops {
>   	/* DMA ops */
>   	int (*set_map)(struct vdpa_device *vdev, struct vhost_iotlb *iotlb);
>   	int (*dma_map)(struct vdpa_device *vdev, u64 iova, u64 size,
> -		       u64 pa, u32 perm);
> +		       u64 pa, u32 perm, void *opaque);
>   	int (*dma_unmap)(struct vdpa_device *vdev, u64 iova, u64 size);
>   
>   	/* Free device resources */
> diff --git a/include/linux/vhost_iotlb.h b/include/linux/vhost_iotlb.h
> index 6b09b786a762..66a50c11c8ca 100644
> --- a/include/linux/vhost_iotlb.h
> +++ b/include/linux/vhost_iotlb.h
> @@ -4,6 +4,11 @@
>   
>   #include <linux/interval_tree_generic.h>
>   
> +struct vhost_iotlb_file {
> +	struct file *file;
> +	u64 offset;
> +};


I think we'd better either:

1) simply use struct vhost_iotlb_file * instead of void *opaque for 
vhost_iotlb_map

or

2)rename and move the vhost_iotlb_file to vdpa

2) looks better since we want to let vhost iotlb to carry any type of 
context (opaque pointer)

And if we do this, the modification of vdpa_config_ops deserves a 
separate patch.

Thanks


> +
>   struct vhost_iotlb_map {
>   	struct rb_node rb;
>   	struct list_head link;
> @@ -17,6 +22,7 @@ struct vhost_iotlb_map {
>   	u32 perm;
>   	u32 flags_padding;
>   	u64 __subtree_last;
> +	void *opaque;
>   };
>   
>   #define VHOST_IOTLB_FLAG_RETIRE 0x1
> @@ -30,7 +36,7 @@ struct vhost_iotlb {
>   };
>   
>   int vhost_iotlb_add_range(struct vhost_iotlb *iotlb, u64 start, u64 last,
> -			  u64 addr, unsigned int perm);
> +			  u64 addr, unsigned int perm, void *opaque);
>   void vhost_iotlb_del_range(struct vhost_iotlb *iotlb, u64 start, u64 last);
>   
>   struct vhost_iotlb *vhost_iotlb_alloc(unsigned int limit, unsigned int flags);


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: [RFC v3 04/11] vhost-vdpa: protect concurrent access to vhost device iotlb
  2021-01-20  3:44   ` Jason Wang
@ 2021-01-20  6:44     ` Yongji Xie
  0 siblings, 0 replies; 57+ messages in thread
From: Yongji Xie @ 2021-01-20  6:44 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, sgarzare, Parav Pandit,
	Bob Liu, Christoph Hellwig, Randy Dunlap, Matthew Wilcox, viro,
	axboe, bcrl, Jonathan Corbet, virtualization, netdev, kvm,
	linux-aio, linux-fsdevel

On Wed, Jan 20, 2021 at 11:44 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2021/1/19 下午12:59, Xie Yongji wrote:
> > Introduce a mutex to protect vhost device iotlb from
> > concurrent access.
> >
> > Fixes: 4c8cf318("vhost: introduce vDPA-based backend")
> > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> > ---
> >   drivers/vhost/vdpa.c | 4 ++++
> >   1 file changed, 4 insertions(+)
> >
> > diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> > index 448be7875b6d..4a241d380c40 100644
> > --- a/drivers/vhost/vdpa.c
> > +++ b/drivers/vhost/vdpa.c
> > @@ -49,6 +49,7 @@ struct vhost_vdpa {
> >       struct eventfd_ctx *config_ctx;
> >       int in_batch;
> >       struct vdpa_iova_range range;
> > +     struct mutex mutex;
>
>
> Let's use the device mutex like what vhost_process_iotlb_msg() did.
>

Looks fine.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: [RFC v3 03/11] vdpa: Remove the restriction that only supports virtio-net devices
  2021-01-20  3:46   ` Jason Wang
@ 2021-01-20  6:46     ` Yongji Xie
  2021-01-20 11:08     ` Stefano Garzarella
  1 sibling, 0 replies; 57+ messages in thread
From: Yongji Xie @ 2021-01-20  6:46 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, sgarzare, Parav Pandit,
	Bob Liu, Christoph Hellwig, Randy Dunlap, Matthew Wilcox, viro,
	axboe, bcrl, Jonathan Corbet, virtualization, netdev, kvm,
	linux-aio, linux-fsdevel

On Wed, Jan 20, 2021 at 11:47 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2021/1/19 下午12:59, Xie Yongji wrote:
> > With VDUSE, we should be able to support all kinds of virtio devices.
> >
> > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> > ---
> >   drivers/vhost/vdpa.c | 29 +++--------------------------
> >   1 file changed, 3 insertions(+), 26 deletions(-)
> >
> > diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> > index 29ed4173f04e..448be7875b6d 100644
> > --- a/drivers/vhost/vdpa.c
> > +++ b/drivers/vhost/vdpa.c
> > @@ -22,6 +22,7 @@
> >   #include <linux/nospec.h>
> >   #include <linux/vhost.h>
> >   #include <linux/virtio_net.h>
> > +#include <linux/virtio_blk.h>
> >
> >   #include "vhost.h"
> >
> > @@ -185,26 +186,6 @@ static long vhost_vdpa_set_status(struct vhost_vdpa *v, u8 __user *statusp)
> >       return 0;
> >   }
> >
> > -static int vhost_vdpa_config_validate(struct vhost_vdpa *v,
> > -                                   struct vhost_vdpa_config *c)
> > -{
> > -     long size = 0;
> > -
> > -     switch (v->virtio_id) {
> > -     case VIRTIO_ID_NET:
> > -             size = sizeof(struct virtio_net_config);
> > -             break;
> > -     }
> > -
> > -     if (c->len == 0)
> > -             return -EINVAL;
> > -
> > -     if (c->len > size - c->off)
> > -             return -E2BIG;
> > -
> > -     return 0;
> > -}
>
>
> I think we should use a separate patch for this.
>

Will do it.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: [RFC v3 01/11] eventfd: track eventfd_signal() recursion depth separately in different cases
  2021-01-20  4:24   ` Jason Wang
@ 2021-01-20  6:52     ` Yongji Xie
  2021-01-27  3:37       ` Jason Wang
  0 siblings, 1 reply; 57+ messages in thread
From: Yongji Xie @ 2021-01-20  6:52 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, sgarzare, Parav Pandit,
	Bob Liu, Christoph Hellwig, Randy Dunlap, Matthew Wilcox, viro,
	axboe, bcrl, Jonathan Corbet, virtualization, netdev, kvm,
	linux-aio, linux-fsdevel

On Wed, Jan 20, 2021 at 12:24 PM Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2021/1/19 下午12:59, Xie Yongji wrote:
> > Now we have a global percpu counter to limit the recursion depth
> > of eventfd_signal(). This can avoid deadlock or stack overflow.
> > But in stack overflow case, it should be OK to increase the
> > recursion depth if needed. So we add a percpu counter in eventfd_ctx
> > to limit the recursion depth for deadlock case. Then it could be
> > fine to increase the global percpu counter later.
>
>
> I wonder whether or not it's worth to introduce percpu for each eventfd.
>
> How about simply check if eventfd_signal_count() is greater than 2?
>

It can't avoid deadlock in this way. So we need a percpu counter for
each eventfd to limit the recursion depth for deadlock cases. And
using a global percpu counter to avoid stack overflow.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: [RFC v3 05/11] vdpa: shared virtual addressing support
  2021-01-20  5:55   ` Jason Wang
@ 2021-01-20  7:10     ` Yongji Xie
  2021-01-27  3:43       ` Jason Wang
  0 siblings, 1 reply; 57+ messages in thread
From: Yongji Xie @ 2021-01-20  7:10 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, sgarzare, Parav Pandit,
	Bob Liu, Christoph Hellwig, Randy Dunlap, Matthew Wilcox, viro,
	axboe, bcrl, Jonathan Corbet, virtualization, netdev, kvm,
	linux-aio, linux-fsdevel

On Wed, Jan 20, 2021 at 1:55 PM Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2021/1/19 下午12:59, Xie Yongji wrote:
> > This patches introduces SVA (Shared Virtual Addressing)
> > support for vDPA device. During vDPA device allocation,
> > vDPA device driver needs to indicate whether SVA is
> > supported by the device. Then vhost-vdpa bus driver
> > will not pin user page and transfer userspace virtual
> > address instead of physical address during DMA mapping.
> >
> > Suggested-by: Jason Wang <jasowang@redhat.com>
> > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> > ---
> >   drivers/vdpa/ifcvf/ifcvf_main.c   |  2 +-
> >   drivers/vdpa/mlx5/net/mlx5_vnet.c |  2 +-
> >   drivers/vdpa/vdpa.c               |  5 ++++-
> >   drivers/vdpa/vdpa_sim/vdpa_sim.c  |  3 ++-
> >   drivers/vhost/vdpa.c              | 35 +++++++++++++++++++++++------------
> >   include/linux/vdpa.h              | 10 +++++++---
> >   6 files changed, 38 insertions(+), 19 deletions(-)
> >
> > diff --git a/drivers/vdpa/ifcvf/ifcvf_main.c b/drivers/vdpa/ifcvf/ifcvf_main.c
> > index 23474af7da40..95c4601f82f5 100644
> > --- a/drivers/vdpa/ifcvf/ifcvf_main.c
> > +++ b/drivers/vdpa/ifcvf/ifcvf_main.c
> > @@ -439,7 +439,7 @@ static int ifcvf_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> >
> >       adapter = vdpa_alloc_device(struct ifcvf_adapter, vdpa,
> >                                   dev, &ifc_vdpa_ops,
> > -                                 IFCVF_MAX_QUEUE_PAIRS * 2, NULL);
> > +                                 IFCVF_MAX_QUEUE_PAIRS * 2, NULL, false);
> >       if (adapter == NULL) {
> >               IFCVF_ERR(pdev, "Failed to allocate vDPA structure");
> >               return -ENOMEM;
> > diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> > index 77595c81488d..05988d6907f2 100644
> > --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
> > +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> > @@ -1959,7 +1959,7 @@ static int mlx5v_probe(struct auxiliary_device *adev,
> >       max_vqs = min_t(u32, max_vqs, MLX5_MAX_SUPPORTED_VQS);
> >
> >       ndev = vdpa_alloc_device(struct mlx5_vdpa_net, mvdev.vdev, mdev->device, &mlx5_vdpa_ops,
> > -                              2 * mlx5_vdpa_max_qps(max_vqs), NULL);
> > +                              2 * mlx5_vdpa_max_qps(max_vqs), NULL, false);
> >       if (IS_ERR(ndev))
> >               return PTR_ERR(ndev);
> >
> > diff --git a/drivers/vdpa/vdpa.c b/drivers/vdpa/vdpa.c
> > index 32bd48baffab..50cab930b2e5 100644
> > --- a/drivers/vdpa/vdpa.c
> > +++ b/drivers/vdpa/vdpa.c
> > @@ -72,6 +72,7 @@ static void vdpa_release_dev(struct device *d)
> >    * @nvqs: number of virtqueues supported by this device
> >    * @size: size of the parent structure that contains private data
> >    * @name: name of the vdpa device; optional.
> > + * @sva: indicate whether SVA (Shared Virtual Addressing) is supported
> >    *
> >    * Driver should use vdpa_alloc_device() wrapper macro instead of
> >    * using this directly.
> > @@ -81,7 +82,8 @@ static void vdpa_release_dev(struct device *d)
> >    */
> >   struct vdpa_device *__vdpa_alloc_device(struct device *parent,
> >                                       const struct vdpa_config_ops *config,
> > -                                     int nvqs, size_t size, const char *name)
> > +                                     int nvqs, size_t size, const char *name,
> > +                                     bool sva)
> >   {
> >       struct vdpa_device *vdev;
> >       int err = -EINVAL;
> > @@ -108,6 +110,7 @@ struct vdpa_device *__vdpa_alloc_device(struct device *parent,
> >       vdev->config = config;
> >       vdev->features_valid = false;
> >       vdev->nvqs = nvqs;
> > +     vdev->sva = sva;
> >
> >       if (name)
> >               err = dev_set_name(&vdev->dev, "%s", name);
> > diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
> > index 85776e4e6749..03c796873a6b 100644
> > --- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
> > +++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
> > @@ -367,7 +367,8 @@ static struct vdpasim *vdpasim_create(const char *name)
> >       else
> >               ops = &vdpasim_net_config_ops;
> >
> > -     vdpasim = vdpa_alloc_device(struct vdpasim, vdpa, NULL, ops, VDPASIM_VQ_NUM, name);
> > +     vdpasim = vdpa_alloc_device(struct vdpasim, vdpa, NULL, ops,
> > +                             VDPASIM_VQ_NUM, name, false);
> >       if (!vdpasim)
> >               goto err_alloc;
> >
> > diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> > index 4a241d380c40..36b6950ba37f 100644
> > --- a/drivers/vhost/vdpa.c
> > +++ b/drivers/vhost/vdpa.c
> > @@ -486,21 +486,25 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
> >   static void vhost_vdpa_iotlb_unmap(struct vhost_vdpa *v, u64 start, u64 last)
> >   {
> >       struct vhost_dev *dev = &v->vdev;
> > +     struct vdpa_device *vdpa = v->vdpa;
> >       struct vhost_iotlb *iotlb = dev->iotlb;
> >       struct vhost_iotlb_map *map;
> >       struct page *page;
> >       unsigned long pfn, pinned;
> >
> >       while ((map = vhost_iotlb_itree_first(iotlb, start, last)) != NULL) {
> > -             pinned = map->size >> PAGE_SHIFT;
> > -             for (pfn = map->addr >> PAGE_SHIFT;
> > -                  pinned > 0; pfn++, pinned--) {
> > -                     page = pfn_to_page(pfn);
> > -                     if (map->perm & VHOST_ACCESS_WO)
> > -                             set_page_dirty_lock(page);
> > -                     unpin_user_page(page);
> > +             if (!vdpa->sva) {
> > +                     pinned = map->size >> PAGE_SHIFT;
> > +                     for (pfn = map->addr >> PAGE_SHIFT;
> > +                          pinned > 0; pfn++, pinned--) {
> > +                             page = pfn_to_page(pfn);
> > +                             if (map->perm & VHOST_ACCESS_WO)
> > +                                     set_page_dirty_lock(page);
> > +                             unpin_user_page(page);
> > +                     }
> > +                     atomic64_sub(map->size >> PAGE_SHIFT,
> > +                                     &dev->mm->pinned_vm);
> >               }
> > -             atomic64_sub(map->size >> PAGE_SHIFT, &dev->mm->pinned_vm);
> >               vhost_iotlb_map_free(iotlb, map);
> >       }
> >   }
> > @@ -558,13 +562,15 @@ static int vhost_vdpa_map(struct vhost_vdpa *v,
> >               r = iommu_map(v->domain, iova, pa, size,
> >                             perm_to_iommu_flags(perm));
> >       }
> > -
> > -     if (r)
> > +     if (r) {
> >               vhost_iotlb_del_range(dev->iotlb, iova, iova + size - 1);
> > -     else
> > +             return r;
> > +     }
> > +
> > +     if (!vdpa->sva)
> >               atomic64_add(size >> PAGE_SHIFT, &dev->mm->pinned_vm);
> >
> > -     return r;
> > +     return 0;
> >   }
> >
> >   static void vhost_vdpa_unmap(struct vhost_vdpa *v, u64 iova, u64 size)
> > @@ -589,6 +595,7 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
> >                                          struct vhost_iotlb_msg *msg)
> >   {
> >       struct vhost_dev *dev = &v->vdev;
> > +     struct vdpa_device *vdpa = v->vdpa;
> >       struct vhost_iotlb *iotlb = dev->iotlb;
> >       struct page **page_list;
> >       unsigned long list_size = PAGE_SIZE / sizeof(struct page *);
> > @@ -607,6 +614,10 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
> >                                   msg->iova + msg->size - 1))
> >               return -EEXIST;
> >
> > +     if (vdpa->sva)
> > +             return vhost_vdpa_map(v, msg->iova, msg->size,
> > +                                   msg->uaddr, msg->perm);
> > +
> >       /* Limit the use of memory for bookkeeping */
> >       page_list = (struct page **) __get_free_page(GFP_KERNEL);
> >       if (!page_list)
> > diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
> > index cb5a3d847af3..f86869651614 100644
> > --- a/include/linux/vdpa.h
> > +++ b/include/linux/vdpa.h
> > @@ -44,6 +44,7 @@ struct vdpa_parent_dev;
> >    * @config: the configuration ops for this device.
> >    * @index: device index
> >    * @features_valid: were features initialized? for legacy guests
> > + * @sva: indicate whether SVA (Shared Virtual Addressing) is supported
>
>
> Rethink about this. I think we probably need a better name other than
> "sva" since kernel already use that for shared virtual address space.
> But actually we don't the whole virtual address space.
>

This flag is used to tell vhost-vdpa bus driver to transfer virtual
addresses instead of physical addresses. So how about "use_va“,
”need_va" or "va“?

> And I guess this can not work for the device that use platform IOMMU, so
> we should check and fail if sva && !(dma_map || set_map).
>

Agree.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: [RFC v3 06/11] vhost-vdpa: Add an opaque pointer for vhost IOTLB
  2021-01-20  6:24   ` Jason Wang
@ 2021-01-20  7:52     ` Yongji Xie
  2021-01-27  3:51       ` Jason Wang
  0 siblings, 1 reply; 57+ messages in thread
From: Yongji Xie @ 2021-01-20  7:52 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, sgarzare, Parav Pandit,
	Bob Liu, Christoph Hellwig, Randy Dunlap, Matthew Wilcox, viro,
	axboe, bcrl, Jonathan Corbet, virtualization, netdev, kvm,
	linux-aio, linux-fsdevel

On Wed, Jan 20, 2021 at 2:24 PM Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2021/1/19 下午12:59, Xie Yongji wrote:
> > Add an opaque pointer for vhost IOTLB to store the
> > corresponding vma->vm_file and offset on the DMA mapping.
>
>
> Let's split the patch into two.
>
> 1) opaque pointer
> 2) vma stuffs
>

OK.

>
> >
> > It will be used in VDUSE case later.
> >
> > Suggested-by: Jason Wang <jasowang@redhat.com>
> > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> > ---
> >   drivers/vdpa/vdpa_sim/vdpa_sim.c | 11 ++++---
> >   drivers/vhost/iotlb.c            |  5 ++-
> >   drivers/vhost/vdpa.c             | 66 +++++++++++++++++++++++++++++++++++-----
> >   drivers/vhost/vhost.c            |  4 +--
> >   include/linux/vdpa.h             |  3 +-
> >   include/linux/vhost_iotlb.h      |  8 ++++-
> >   6 files changed, 79 insertions(+), 18 deletions(-)
> >
> > diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
> > index 03c796873a6b..1ffcef67954f 100644
> > --- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
> > +++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
> > @@ -279,7 +279,7 @@ static dma_addr_t vdpasim_map_page(struct device *dev, struct page *page,
> >        */
> >       spin_lock(&vdpasim->iommu_lock);
> >       ret = vhost_iotlb_add_range(iommu, pa, pa + size - 1,
> > -                                 pa, dir_to_perm(dir));
> > +                                 pa, dir_to_perm(dir), NULL);
>
>
> Maybe its better to introduce
>
> vhost_iotlb_add_range_ctx() which can accepts the opaque (context). And
> let vhost_iotlb_add_range() just call that.
>

If so, we need export both vhost_iotlb_add_range() and
vhost_iotlb_add_range_ctx() which will be used in VDUSE driver. Is it
a bit redundant?

>
> >       spin_unlock(&vdpasim->iommu_lock);
> >       if (ret)
> >               return DMA_MAPPING_ERROR;
> > @@ -317,7 +317,7 @@ static void *vdpasim_alloc_coherent(struct device *dev, size_t size,
> >
> >               ret = vhost_iotlb_add_range(iommu, (u64)pa,
> >                                           (u64)pa + size - 1,
> > -                                         pa, VHOST_MAP_RW);
> > +                                         pa, VHOST_MAP_RW, NULL);
> >               if (ret) {
> >                       *dma_addr = DMA_MAPPING_ERROR;
> >                       kfree(addr);
> > @@ -625,7 +625,8 @@ static int vdpasim_set_map(struct vdpa_device *vdpa,
> >       for (map = vhost_iotlb_itree_first(iotlb, start, last); map;
> >            map = vhost_iotlb_itree_next(map, start, last)) {
> >               ret = vhost_iotlb_add_range(vdpasim->iommu, map->start,
> > -                                         map->last, map->addr, map->perm);
> > +                                         map->last, map->addr,
> > +                                         map->perm, NULL);
> >               if (ret)
> >                       goto err;
> >       }
> > @@ -639,14 +640,14 @@ static int vdpasim_set_map(struct vdpa_device *vdpa,
> >   }
> >
> >   static int vdpasim_dma_map(struct vdpa_device *vdpa, u64 iova, u64 size,
> > -                        u64 pa, u32 perm)
> > +                        u64 pa, u32 perm, void *opaque)
> >   {
> >       struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
> >       int ret;
> >
> >       spin_lock(&vdpasim->iommu_lock);
> >       ret = vhost_iotlb_add_range(vdpasim->iommu, iova, iova + size - 1, pa,
> > -                                 perm);
> > +                                 perm, NULL);
> >       spin_unlock(&vdpasim->iommu_lock);
> >
> >       return ret;
> > diff --git a/drivers/vhost/iotlb.c b/drivers/vhost/iotlb.c
> > index 0fd3f87e913c..3bd5bd06cdbc 100644
> > --- a/drivers/vhost/iotlb.c
> > +++ b/drivers/vhost/iotlb.c
> > @@ -42,13 +42,15 @@ EXPORT_SYMBOL_GPL(vhost_iotlb_map_free);
> >    * @last: last of IOVA range
> >    * @addr: the address that is mapped to @start
> >    * @perm: access permission of this range
> > + * @opaque: the opaque pointer for the IOTLB mapping
> >    *
> >    * Returns an error last is smaller than start or memory allocation
> >    * fails
> >    */
> >   int vhost_iotlb_add_range(struct vhost_iotlb *iotlb,
> >                         u64 start, u64 last,
> > -                       u64 addr, unsigned int perm)
> > +                       u64 addr, unsigned int perm,
> > +                       void *opaque)
> >   {
> >       struct vhost_iotlb_map *map;
> >
> > @@ -71,6 +73,7 @@ int vhost_iotlb_add_range(struct vhost_iotlb *iotlb,
> >       map->last = last;
> >       map->addr = addr;
> >       map->perm = perm;
> > +     map->opaque = opaque;
> >
> >       iotlb->nmaps++;
> >       vhost_iotlb_itree_insert(map, &iotlb->root);
> > diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> > index 36b6950ba37f..e83e5be7cec8 100644
> > --- a/drivers/vhost/vdpa.c
> > +++ b/drivers/vhost/vdpa.c
> > @@ -488,6 +488,7 @@ static void vhost_vdpa_iotlb_unmap(struct vhost_vdpa *v, u64 start, u64 last)
> >       struct vhost_dev *dev = &v->vdev;
> >       struct vdpa_device *vdpa = v->vdpa;
> >       struct vhost_iotlb *iotlb = dev->iotlb;
> > +     struct vhost_iotlb_file *iotlb_file;
> >       struct vhost_iotlb_map *map;
> >       struct page *page;
> >       unsigned long pfn, pinned;
> > @@ -504,6 +505,10 @@ static void vhost_vdpa_iotlb_unmap(struct vhost_vdpa *v, u64 start, u64 last)
> >                       }
> >                       atomic64_sub(map->size >> PAGE_SHIFT,
> >                                       &dev->mm->pinned_vm);
> > +             } else if (map->opaque) {
> > +                     iotlb_file = (struct vhost_iotlb_file *)map->opaque;
> > +                     fput(iotlb_file->file);
> > +                     kfree(iotlb_file);
> >               }
> >               vhost_iotlb_map_free(iotlb, map);
> >       }
> > @@ -540,8 +545,8 @@ static int perm_to_iommu_flags(u32 perm)
> >       return flags | IOMMU_CACHE;
> >   }
> >
> > -static int vhost_vdpa_map(struct vhost_vdpa *v,
> > -                       u64 iova, u64 size, u64 pa, u32 perm)
> > +static int vhost_vdpa_map(struct vhost_vdpa *v, u64 iova,
> > +                       u64 size, u64 pa, u32 perm, void *opaque)
> >   {
> >       struct vhost_dev *dev = &v->vdev;
> >       struct vdpa_device *vdpa = v->vdpa;
> > @@ -549,12 +554,12 @@ static int vhost_vdpa_map(struct vhost_vdpa *v,
> >       int r = 0;
> >
> >       r = vhost_iotlb_add_range(dev->iotlb, iova, iova + size - 1,
> > -                               pa, perm);
> > +                               pa, perm, opaque);
> >       if (r)
> >               return r;
> >
> >       if (ops->dma_map) {
> > -             r = ops->dma_map(vdpa, iova, size, pa, perm);
> > +             r = ops->dma_map(vdpa, iova, size, pa, perm, opaque);
> >       } else if (ops->set_map) {
> >               if (!v->in_batch)
> >                       r = ops->set_map(vdpa, dev->iotlb);
> > @@ -591,6 +596,51 @@ static void vhost_vdpa_unmap(struct vhost_vdpa *v, u64 iova, u64 size)
> >       }
> >   }
> >
> > +static int vhost_vdpa_sva_map(struct vhost_vdpa *v,
> > +                           u64 iova, u64 size, u64 uaddr, u32 perm)
> > +{
> > +     u64 offset, map_size, map_iova = iova;
> > +     struct vhost_iotlb_file *iotlb_file;
> > +     struct vm_area_struct *vma;
> > +     int ret;
>
>
> Lacking mmap_read_lock().
>

Good catch! Will fix it.

>
> > +
> > +     while (size) {
> > +             vma = find_vma(current->mm, uaddr);
> > +             if (!vma) {
> > +                     ret = -EINVAL;
> > +                     goto err;
> > +             }
> > +             map_size = min(size, vma->vm_end - uaddr);
> > +             offset = (vma->vm_pgoff << PAGE_SHIFT) + uaddr - vma->vm_start;
> > +             iotlb_file = NULL;
> > +             if (vma->vm_file && (vma->vm_flags & VM_SHARED)) {
>
>
> I wonder if we need more strict check here. When developing vhost-vdpa,
> I try hard to make sure the map can only work for user pages.
>
> So the question is: do we need to exclude MMIO area or only allow shmem
> to work here?
>

Do you mean we need to check VM_MIXEDMAP | VM_PFNMAP here?

It makes sense to me.

>
>
> > +                     iotlb_file = kmalloc(sizeof(*iotlb_file), GFP_KERNEL);
> > +                     if (!iotlb_file) {
> > +                             ret = -ENOMEM;
> > +                             goto err;
> > +                     }
> > +                     iotlb_file->file = get_file(vma->vm_file);
> > +                     iotlb_file->offset = offset;
> > +             }
>
>
> I wonder if it's better to allocate iotlb_file and make iotlb_file->file
> = NULL && iotlb_file->offset = 0. This can force a consistent code for
> the vDPA parents.
>

Looks fine to me.

> Or we can simply fail the map without a file as backend.
>

Actually there will be some vma without vm_file during vm booting.

>
> > +             ret = vhost_vdpa_map(v, map_iova, map_size, uaddr,
> > +                                     perm, iotlb_file);
> > +             if (ret) {
> > +                     if (iotlb_file) {
> > +                             fput(iotlb_file->file);
> > +                             kfree(iotlb_file);
> > +                     }
> > +                     goto err;
> > +             }
> > +             size -= map_size;
> > +             uaddr += map_size;
> > +             map_iova += map_size;
> > +     }
> > +     return 0;
> > +err:
> > +     vhost_vdpa_unmap(v, iova, map_iova - iova);
> > +     return ret;
> > +}
> > +
> >   static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
> >                                          struct vhost_iotlb_msg *msg)
> >   {
> > @@ -615,8 +665,8 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
> >               return -EEXIST;
> >
> >       if (vdpa->sva)
> > -             return vhost_vdpa_map(v, msg->iova, msg->size,
> > -                                   msg->uaddr, msg->perm);
> > +             return vhost_vdpa_sva_map(v, msg->iova, msg->size,
> > +                                       msg->uaddr, msg->perm);
>
>
> So I think it's better squash vhost_vdpa_sva_map() and related changes
> into previous patch.
>

OK, so the order of the patches is:
1) opaque pointer
2) va support + vma stuffs?

Is it OK?

>
> >
> >       /* Limit the use of memory for bookkeeping */
> >       page_list = (struct page **) __get_free_page(GFP_KERNEL);
> > @@ -671,7 +721,7 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
> >                               csize = (last_pfn - map_pfn + 1) << PAGE_SHIFT;
> >                               ret = vhost_vdpa_map(v, iova, csize,
> >                                                    map_pfn << PAGE_SHIFT,
> > -                                                  msg->perm);
> > +                                                  msg->perm, NULL);
> >                               if (ret) {
> >                                       /*
> >                                        * Unpin the pages that are left unmapped
> > @@ -700,7 +750,7 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
> >
> >       /* Pin the rest chunk */
> >       ret = vhost_vdpa_map(v, iova, (last_pfn - map_pfn + 1) << PAGE_SHIFT,
> > -                          map_pfn << PAGE_SHIFT, msg->perm);
> > +                          map_pfn << PAGE_SHIFT, msg->perm, NULL);
> >   out:
> >       if (ret) {
> >               if (nchunks) {
> > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > index a262e12c6dc2..120dd5b3c119 100644
> > --- a/drivers/vhost/vhost.c
> > +++ b/drivers/vhost/vhost.c
> > @@ -1104,7 +1104,7 @@ static int vhost_process_iotlb_msg(struct vhost_dev *dev,
> >               vhost_vq_meta_reset(dev);
> >               if (vhost_iotlb_add_range(dev->iotlb, msg->iova,
> >                                         msg->iova + msg->size - 1,
> > -                                       msg->uaddr, msg->perm)) {
> > +                                       msg->uaddr, msg->perm, NULL)) {
> >                       ret = -ENOMEM;
> >                       break;
> >               }
> > @@ -1450,7 +1450,7 @@ static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
> >                                         region->guest_phys_addr +
> >                                         region->memory_size - 1,
> >                                         region->userspace_addr,
> > -                                       VHOST_MAP_RW))
> > +                                       VHOST_MAP_RW, NULL))
> >                       goto err;
> >       }
> >
> > diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
> > index f86869651614..b264c627e94b 100644
> > --- a/include/linux/vdpa.h
> > +++ b/include/linux/vdpa.h
> > @@ -189,6 +189,7 @@ struct vdpa_iova_range {
> >    *                          @size: size of the area
> >    *                          @pa: physical address for the map
> >    *                          @perm: device access permission (VHOST_MAP_XX)
> > + *                           @opaque: the opaque pointer for the mapping
> >    *                          Returns integer: success (0) or error (< 0)
> >    * @dma_unmap:                      Unmap an area of IOVA (optional but
> >    *                          must be implemented with dma_map)
> > @@ -243,7 +244,7 @@ struct vdpa_config_ops {
> >       /* DMA ops */
> >       int (*set_map)(struct vdpa_device *vdev, struct vhost_iotlb *iotlb);
> >       int (*dma_map)(struct vdpa_device *vdev, u64 iova, u64 size,
> > -                    u64 pa, u32 perm);
> > +                    u64 pa, u32 perm, void *opaque);
> >       int (*dma_unmap)(struct vdpa_device *vdev, u64 iova, u64 size);
> >
> >       /* Free device resources */
> > diff --git a/include/linux/vhost_iotlb.h b/include/linux/vhost_iotlb.h
> > index 6b09b786a762..66a50c11c8ca 100644
> > --- a/include/linux/vhost_iotlb.h
> > +++ b/include/linux/vhost_iotlb.h
> > @@ -4,6 +4,11 @@
> >
> >   #include <linux/interval_tree_generic.h>
> >
> > +struct vhost_iotlb_file {
> > +     struct file *file;
> > +     u64 offset;
> > +};
>
>
> I think we'd better either:
>
> 1) simply use struct vhost_iotlb_file * instead of void *opaque for
> vhost_iotlb_map
>
> or
>
> 2)rename and move the vhost_iotlb_file to vdpa
>
> 2) looks better since we want to let vhost iotlb to carry any type of
> context (opaque pointer)
>

I agree. So we need to introduce struct vdpa_iotlb_file in
include/linux/vdpa.h, right?

> And if we do this, the modification of vdpa_config_ops deserves a
> separate patch.
>

Sorry, I didn't get you here. What do you mean by the modification of
vdpa_config_ops? Do you mean adding an opaque pointer to ops.dma_map?

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC v3 03/11] vdpa: Remove the restriction that only supports virtio-net devices
  2021-01-20  3:46   ` Jason Wang
  2021-01-20  6:46     ` Yongji Xie
@ 2021-01-20 11:08     ` Stefano Garzarella
  2021-01-27  3:33       ` Jason Wang
  1 sibling, 1 reply; 57+ messages in thread
From: Stefano Garzarella @ 2021-01-20 11:08 UTC (permalink / raw)
  To: Jason Wang, Xie Yongji
  Cc: mst, stefanha, parav, bob.liu, hch, rdunlap, willy, viro, axboe,
	bcrl, corbet, virtualization, netdev, kvm, linux-aio,
	linux-fsdevel

On Wed, Jan 20, 2021 at 11:46:38AM +0800, Jason Wang wrote:
>
>On 2021/1/19 下午12:59, Xie Yongji wrote:
>>With VDUSE, we should be able to support all kinds of virtio devices.
>>
>>Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
>>---
>>  drivers/vhost/vdpa.c | 29 +++--------------------------
>>  1 file changed, 3 insertions(+), 26 deletions(-)
>>
>>diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
>>index 29ed4173f04e..448be7875b6d 100644
>>--- a/drivers/vhost/vdpa.c
>>+++ b/drivers/vhost/vdpa.c
>>@@ -22,6 +22,7 @@
>>  #include <linux/nospec.h>
>>  #include <linux/vhost.h>
>>  #include <linux/virtio_net.h>
>>+#include <linux/virtio_blk.h>
>>  #include "vhost.h"
>>@@ -185,26 +186,6 @@ static long vhost_vdpa_set_status(struct vhost_vdpa *v, u8 __user *statusp)
>>  	return 0;
>>  }
>>-static int vhost_vdpa_config_validate(struct vhost_vdpa *v,
>>-				      struct vhost_vdpa_config *c)
>>-{
>>-	long size = 0;
>>-
>>-	switch (v->virtio_id) {
>>-	case VIRTIO_ID_NET:
>>-		size = sizeof(struct virtio_net_config);
>>-		break;
>>-	}
>>-
>>-	if (c->len == 0)
>>-		return -EINVAL;
>>-
>>-	if (c->len > size - c->off)
>>-		return -E2BIG;
>>-
>>-	return 0;
>>-}
>
>
>I think we should use a separate patch for this.

For the vdpa-blk simulator I had the same issues and I'm adding a 
.get_config_size() callback to vdpa devices.

Do you think make sense or is better to remove this check in vhost/vdpa, 
delegating the boundaries checks to get_config/set_config callbacks.

Thanks,
Stefano


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC v3 08/11] vduse: Introduce VDUSE - vDPA Device in Userspace
  2021-01-19  5:07   ` [RFC v3 08/11] vduse: Introduce VDUSE - vDPA Device in Userspace Xie Yongji
  2021-01-19 14:53     ` Jonathan Corbet
  2021-01-19 17:53     ` Randy Dunlap
@ 2021-01-26  8:08     ` Jason Wang
  2021-01-27  8:50       ` Yongji Xie
  2021-01-26  8:19     ` Jason Wang
  3 siblings, 1 reply; 57+ messages in thread
From: Jason Wang @ 2021-01-26  8:08 UTC (permalink / raw)
  To: Xie Yongji, mst, stefanha, sgarzare, parav, bob.liu, hch,
	rdunlap, willy, viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel


On 2021/1/19 下午1:07, Xie Yongji wrote:
> This VDUSE driver enables implementing vDPA devices in userspace.
> Both control path and data path of vDPA devices will be able to
> be handled in userspace.
>
> In the control path, the VDUSE driver will make use of message
> mechnism to forward the config operation from vdpa bus driver
> to userspace. Userspace can use read()/write() to receive/reply
> those control messages.
>
> In the data path, VDUSE_IOTLB_GET_FD ioctl will be used to get
> the file descriptors referring to vDPA device's iova regions. Then
> userspace can use mmap() to access those iova regions. Besides,
> the eventfd mechanism is used to trigger interrupt callbacks and
> receive virtqueue kicks in userspace.
>
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> ---
>   Documentation/driver-api/vduse.rst                 |   85 ++
>   Documentation/userspace-api/ioctl/ioctl-number.rst |    1 +
>   drivers/vdpa/Kconfig                               |    7 +
>   drivers/vdpa/Makefile                              |    1 +
>   drivers/vdpa/vdpa_user/Makefile                    |    5 +
>   drivers/vdpa/vdpa_user/eventfd.c                   |  221 ++++
>   drivers/vdpa/vdpa_user/eventfd.h                   |   48 +
>   drivers/vdpa/vdpa_user/iova_domain.c               |  426 +++++++
>   drivers/vdpa/vdpa_user/iova_domain.h               |   68 ++
>   drivers/vdpa/vdpa_user/vduse.h                     |   62 +
>   drivers/vdpa/vdpa_user/vduse_dev.c                 | 1217 ++++++++++++++++++++
>   include/uapi/linux/vdpa.h                          |    1 +
>   include/uapi/linux/vduse.h                         |  125 ++
>   13 files changed, 2267 insertions(+)
>   create mode 100644 Documentation/driver-api/vduse.rst
>   create mode 100644 drivers/vdpa/vdpa_user/Makefile
>   create mode 100644 drivers/vdpa/vdpa_user/eventfd.c
>   create mode 100644 drivers/vdpa/vdpa_user/eventfd.h
>   create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c
>   create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h
>   create mode 100644 drivers/vdpa/vdpa_user/vduse.h
>   create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
>   create mode 100644 include/uapi/linux/vduse.h
>
> diff --git a/Documentation/driver-api/vduse.rst b/Documentation/driver-api/vduse.rst
> new file mode 100644
> index 000000000000..9418a7f6646b
> --- /dev/null
> +++ b/Documentation/driver-api/vduse.rst
> @@ -0,0 +1,85 @@
> +==================================
> +VDUSE - "vDPA Device in Userspace"
> +==================================
> +
> +vDPA (virtio data path acceleration) device is a device that uses a
> +datapath which complies with the virtio specifications with vendor
> +specific control path. vDPA devices can be both physically located on
> +the hardware or emulated by software. VDUSE is a framework that makes it
> +possible to implement software-emulated vDPA devices in userspace.
> +
> +How VDUSE works
> +------------
> +Each userspace vDPA device is created by the VDUSE_CREATE_DEV ioctl on
> +the VDUSE character device (/dev/vduse). Then a file descriptor pointing
> +to the new resources will be returned, which can be used to implement the
> +userspace vDPA device's control path and data path.
> +
> +To implement control path, the read/write operations to the file descriptor
> +will be used to receive/reply the control messages from/to VDUSE driver.


It's better to document the protocol here. E.g the identifier stuffs.


> +Those control messages are mostly based on the vdpa_config_ops which defines
> +a unified interface to control different types of vDPA device.
> +
> +The following types of messages are provided by the VDUSE framework now:
> +
> +- VDUSE_SET_VQ_ADDR: Set the addresses of the different aspects of virtqueue.


"Set the vring address of a virtqueue" might be better here.


> +
> +- VDUSE_SET_VQ_NUM: Set the size of virtqueue
> +
> +- VDUSE_SET_VQ_READY: Set ready status of virtqueue
> +
> +- VDUSE_GET_VQ_READY: Get ready status of virtqueue
> +
> +- VDUSE_SET_VQ_STATE: Set the state (last_avail_idx) for virtqueue
> +
> +- VDUSE_GET_VQ_STATE: Get the state (last_avail_idx) for virtqueue


It's better not to mention layout specific stuffs here (last_avail_idx). 
Consider we should support packed virtqueue in the future.


> +
> +- VDUSE_SET_FEATURES: Set virtio features supported by the driver
> +
> +- VDUSE_GET_FEATURES: Get virtio features supported by the device
> +
> +- VDUSE_SET_STATUS: Set the device status
> +
> +- VDUSE_GET_STATUS: Get the device status
> +
> +- VDUSE_SET_CONFIG: Write to device specific configuration space
> +
> +- VDUSE_GET_CONFIG: Read from device specific configuration space
> +
> +- VDUSE_UPDATE_IOTLB: Notify userspace to update the memory mapping in device IOTLB
> +
> +Please see include/linux/vdpa.h for details.
> +
> +In the data path, vDPA device's iova regions will be mapped into userspace with
> +the help of VDUSE_IOTLB_GET_FD ioctl on the userspace vDPA device fd:
> +
> +- VDUSE_IOTLB_GET_FD: get the file descriptor to iova region. Userspace can
> +  access this iova region by passing the fd to mmap(2).
> +
> +Besides, the eventfd mechanism is used to trigger interrupt callbacks and
> +receive virtqueue kicks in userspace. The following ioctls on the userspace
> +vDPA device fd are provided to support that:
> +
> +- VDUSE_VQ_SETUP_KICKFD: set the kickfd for virtqueue, this eventfd is used
> +  by VDUSE driver to notify userspace to consume the vring.
> +
> +- VDUSE_VQ_SETUP_IRQFD: set the irqfd for virtqueue, this eventfd is used
> +  by userspace to notify VDUSE driver to trigger interrupt callbacks.
> +
> +MMU-based IOMMU Driver
> +----------------------
> +In virtio-vdpa case, VDUSE framework implements a MMU-based on-chip IOMMU
> +driver to support mapping the kernel dma buffer into the userspace iova
> +region dynamically.
> +
> +The basic idea behind this driver is treating MMU (VA->PA) as IOMMU (IOVA->PA).
> +The driver will set up MMU mapping instead of IOMMU mapping for the DMA transfer
> +so that the userspace process is able to use its virtual address to access
> +the dma buffer in kernel.
> +
> +And to avoid security issue, a bounce-buffering mechanism is introduced to
> +prevent userspace accessing the original buffer directly which may contain other
> +kernel data. During the mapping, unmapping, the driver will copy the data from
> +the original buffer to the bounce buffer and back, depending on the direction of
> +the transfer. And the bounce-buffer addresses will be mapped into the user address
> +space instead of the original one.
> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
> index a4c75a28c839..71722e6f8f23 100644
> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> @@ -300,6 +300,7 @@ Code  Seq#    Include File                                           Comments
>   'z'   10-4F  drivers/s390/crypto/zcrypt_api.h                        conflict!
>   '|'   00-7F  linux/media.h
>   0x80  00-1F  linux/fb.h
> +0x81  00-1F  linux/vduse.h
>   0x89  00-06  arch/x86/include/asm/sockios.h
>   0x89  0B-DF  linux/sockios.h
>   0x89  E0-EF  linux/sockios.h                                         SIOCPROTOPRIVATE range
> diff --git a/drivers/vdpa/Kconfig b/drivers/vdpa/Kconfig
> index 4be7be39be26..667354309bf4 100644
> --- a/drivers/vdpa/Kconfig
> +++ b/drivers/vdpa/Kconfig
> @@ -21,6 +21,13 @@ config VDPA_SIM
>   	  to RX. This device is used for testing, prototyping and
>   	  development of vDPA.
>   
> +config VDPA_USER
> +	tristate "VDUSE (vDPA Device in Userspace) support"
> +	depends on EVENTFD && MMU && HAS_DMA


Need select VHOST_IOTLB.


> +	help
> +	  With VDUSE it is possible to emulate a vDPA Device
> +	  in a userspace program.
> +
>   config IFCVF
>   	tristate "Intel IFC VF vDPA driver"
>   	depends on PCI_MSI
> diff --git a/drivers/vdpa/Makefile b/drivers/vdpa/Makefile
> index d160e9b63a66..66e97778ad03 100644
> --- a/drivers/vdpa/Makefile
> +++ b/drivers/vdpa/Makefile
> @@ -1,5 +1,6 @@
>   # SPDX-License-Identifier: GPL-2.0
>   obj-$(CONFIG_VDPA) += vdpa.o
>   obj-$(CONFIG_VDPA_SIM) += vdpa_sim/
> +obj-$(CONFIG_VDPA_USER) += vdpa_user/
>   obj-$(CONFIG_IFCVF)    += ifcvf/
>   obj-$(CONFIG_MLX5_VDPA) += mlx5/
> diff --git a/drivers/vdpa/vdpa_user/Makefile b/drivers/vdpa/vdpa_user/Makefile
> new file mode 100644
> index 000000000000..b7645e36992b
> --- /dev/null
> +++ b/drivers/vdpa/vdpa_user/Makefile
> @@ -0,0 +1,5 @@
> +# SPDX-License-Identifier: GPL-2.0
> +
> +vduse-y := vduse_dev.o iova_domain.o eventfd.o
> +
> +obj-$(CONFIG_VDPA_USER) += vduse.o
> diff --git a/drivers/vdpa/vdpa_user/eventfd.c b/drivers/vdpa/vdpa_user/eventfd.c
> new file mode 100644
> index 000000000000..dbffddb08908
> --- /dev/null
> +++ b/drivers/vdpa/vdpa_user/eventfd.c
> @@ -0,0 +1,221 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Eventfd support for VDUSE
> + *
> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
> + *
> + * Author: Xie Yongji <xieyongji@bytedance.com>
> + *
> + */
> +
> +#include <linux/eventfd.h>
> +#include <linux/poll.h>
> +#include <linux/wait.h>
> +#include <linux/slab.h>
> +#include <linux/file.h>
> +#include <uapi/linux/vduse.h>
> +
> +#include "eventfd.h"
> +
> +static struct workqueue_struct *vduse_irqfd_cleanup_wq;
> +
> +static void vduse_virqfd_shutdown(struct work_struct *work)
> +{
> +	u64 cnt;
> +	struct vduse_virqfd *virqfd = container_of(work,
> +					struct vduse_virqfd, shutdown);
> +
> +	eventfd_ctx_remove_wait_queue(virqfd->ctx, &virqfd->wait, &cnt);
> +	flush_work(&virqfd->inject);
> +	eventfd_ctx_put(virqfd->ctx);
> +	kfree(virqfd);
> +}
> +
> +static void vduse_virqfd_inject(struct work_struct *work)
> +{
> +	struct vduse_virqfd *virqfd = container_of(work,
> +					struct vduse_virqfd, inject);
> +	struct vduse_virtqueue *vq = virqfd->vq;
> +
> +	spin_lock_irq(&vq->irq_lock);
> +	if (vq->ready && vq->cb)
> +		vq->cb(vq->private);
> +	spin_unlock_irq(&vq->irq_lock);
> +}
> +
> +static void virqfd_deactivate(struct vduse_virqfd *virqfd)
> +{
> +	queue_work(vduse_irqfd_cleanup_wq, &virqfd->shutdown);
> +}
> +
> +static int vduse_virqfd_wakeup(wait_queue_entry_t *wait, unsigned int mode,
> +				int sync, void *key)
> +{
> +	struct vduse_virqfd *virqfd = container_of(wait, struct vduse_virqfd, wait);
> +	struct vduse_virtqueue *vq = virqfd->vq;
> +
> +	__poll_t flags = key_to_poll(key);
> +
> +	if (flags & EPOLLIN)
> +		schedule_work(&virqfd->inject);
> +
> +	if (flags & EPOLLHUP) {
> +		spin_lock(&vq->irq_lock);
> +		if (vq->virqfd == virqfd) {
> +			vq->virqfd = NULL;
> +			virqfd_deactivate(virqfd);
> +		}
> +		spin_unlock(&vq->irq_lock);
> +	}
> +
> +	return 0;
> +}
> +
> +static void vduse_virqfd_ptable_queue_proc(struct file *file,
> +			wait_queue_head_t *wqh, poll_table *pt)
> +{
> +	struct vduse_virqfd *virqfd = container_of(pt, struct vduse_virqfd, pt);
> +
> +	add_wait_queue(wqh, &virqfd->wait);
> +}
> +
> +int vduse_virqfd_setup(struct vduse_dev *dev,
> +			struct vduse_vq_eventfd *eventfd)
> +{
> +	struct vduse_virqfd *virqfd;
> +	struct fd irqfd;
> +	struct eventfd_ctx *ctx;
> +	struct vduse_virtqueue *vq;
> +	__poll_t events;
> +	int ret;
> +
> +	if (eventfd->index >= dev->vq_num)
> +		return -EINVAL;
> +
> +	vq = &dev->vqs[eventfd->index];
> +	virqfd = kzalloc(sizeof(*virqfd), GFP_KERNEL);
> +	if (!virqfd)
> +		return -ENOMEM;
> +
> +	INIT_WORK(&virqfd->shutdown, vduse_virqfd_shutdown);
> +	INIT_WORK(&virqfd->inject, vduse_virqfd_inject);
> +
> +	ret = -EBADF;
> +	irqfd = fdget(eventfd->fd);
> +	if (!irqfd.file)
> +		goto err_fd;
> +
> +	ctx = eventfd_ctx_fileget(irqfd.file);
> +	if (IS_ERR(ctx)) {
> +		ret = PTR_ERR(ctx);
> +		goto err_ctx;
> +	}
> +
> +	virqfd->vq = vq;
> +	virqfd->ctx = ctx;
> +	spin_lock(&vq->irq_lock);
> +	if (vq->virqfd)
> +		virqfd_deactivate(virqfd);
> +	vq->virqfd = virqfd;
> +	spin_unlock(&vq->irq_lock);
> +
> +	init_waitqueue_func_entry(&virqfd->wait, vduse_virqfd_wakeup);
> +	init_poll_funcptr(&virqfd->pt, vduse_virqfd_ptable_queue_proc);
> +
> +	events = vfs_poll(irqfd.file, &virqfd->pt);
> +
> +	/*
> +	 * Check if there was an event already pending on the eventfd
> +	 * before we registered and trigger it as if we didn't miss it.
> +	 */
> +	if (events & EPOLLIN)
> +		schedule_work(&virqfd->inject);
> +
> +	fdput(irqfd);
> +
> +	return 0;
> +err_ctx:
> +	fdput(irqfd);
> +err_fd:
> +	kfree(virqfd);
> +	return ret;
> +}
> +
> +void vduse_virqfd_release(struct vduse_dev *dev)
> +{
> +	int i;
> +
> +	for (i = 0; i < dev->vq_num; i++) {
> +		struct vduse_virtqueue *vq = &dev->vqs[i];
> +
> +		spin_lock(&vq->irq_lock);
> +		if (vq->virqfd) {
> +			virqfd_deactivate(vq->virqfd);
> +			vq->virqfd = NULL;
> +		}
> +		spin_unlock(&vq->irq_lock);
> +	}
> +	flush_workqueue(vduse_irqfd_cleanup_wq);
> +}
> +
> +int vduse_virqfd_init(void)
> +{
> +	vduse_irqfd_cleanup_wq = alloc_workqueue("vduse-irqfd-cleanup",
> +						WQ_UNBOUND, 0);
> +	if (!vduse_irqfd_cleanup_wq)
> +		return -ENOMEM;
> +
> +	return 0;
> +}
> +
> +void vduse_virqfd_exit(void)
> +{
> +	destroy_workqueue(vduse_irqfd_cleanup_wq);
> +}
> +
> +void vduse_vq_kick(struct vduse_virtqueue *vq)
> +{
> +	spin_lock(&vq->kick_lock);
> +	if (vq->ready && vq->kickfd)
> +		eventfd_signal(vq->kickfd, 1);
> +	spin_unlock(&vq->kick_lock);
> +}
> +
> +int vduse_kickfd_setup(struct vduse_dev *dev,
> +			struct vduse_vq_eventfd *eventfd)
> +{
> +	struct eventfd_ctx *ctx;
> +	struct vduse_virtqueue *vq;
> +
> +	if (eventfd->index >= dev->vq_num)
> +		return -EINVAL;
> +
> +	vq = &dev->vqs[eventfd->index];
> +	ctx = eventfd_ctx_fdget(eventfd->fd);
> +	if (IS_ERR(ctx))
> +		return PTR_ERR(ctx);
> +
> +	spin_lock(&vq->kick_lock);
> +	if (vq->kickfd)
> +		eventfd_ctx_put(vq->kickfd);
> +	vq->kickfd = ctx;
> +	spin_unlock(&vq->kick_lock);
> +
> +	return 0;
> +}
> +
> +void vduse_kickfd_release(struct vduse_dev *dev)
> +{
> +	int i;
> +
> +	for (i = 0; i < dev->vq_num; i++) {
> +		struct vduse_virtqueue *vq = &dev->vqs[i];
> +
> +		spin_lock(&vq->kick_lock);
> +		if (vq->kickfd) {
> +			eventfd_ctx_put(vq->kickfd);
> +			vq->kickfd = NULL;
> +		}
> +		spin_unlock(&vq->kick_lock);
> +	}
> +}
> diff --git a/drivers/vdpa/vdpa_user/eventfd.h b/drivers/vdpa/vdpa_user/eventfd.h
> new file mode 100644
> index 000000000000..14269ff27f47
> --- /dev/null
> +++ b/drivers/vdpa/vdpa_user/eventfd.h
> @@ -0,0 +1,48 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Eventfd support for VDUSE
> + *
> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
> + *
> + * Author: Xie Yongji <xieyongji@bytedance.com>
> + *
> + */
> +
> +#ifndef _VDUSE_EVENTFD_H
> +#define _VDUSE_EVENTFD_H
> +
> +#include <linux/eventfd.h>
> +#include <linux/poll.h>
> +#include <linux/wait.h>
> +#include <uapi/linux/vduse.h>
> +
> +#include "vduse.h"
> +
> +struct vduse_dev;
> +
> +struct vduse_virqfd {
> +	struct eventfd_ctx *ctx;
> +	struct vduse_virtqueue *vq;
> +	struct work_struct inject;
> +	struct work_struct shutdown;
> +	wait_queue_entry_t wait;
> +	poll_table pt;
> +};
> +
> +int vduse_virqfd_setup(struct vduse_dev *dev,
> +			struct vduse_vq_eventfd *eventfd);
> +
> +void vduse_virqfd_release(struct vduse_dev *dev);
> +
> +int vduse_virqfd_init(void);
> +
> +void vduse_virqfd_exit(void);
> +
> +void vduse_vq_kick(struct vduse_virtqueue *vq);
> +
> +int vduse_kickfd_setup(struct vduse_dev *dev,
> +			struct vduse_vq_eventfd *eventfd);
> +
> +void vduse_kickfd_release(struct vduse_dev *dev);
> +
> +#endif /* _VDUSE_EVENTFD_H */
> diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c
> new file mode 100644
> index 000000000000..cdfef8e9f9d6
> --- /dev/null
> +++ b/drivers/vdpa/vdpa_user/iova_domain.c
> @@ -0,0 +1,426 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * MMU-based IOMMU implementation
> + *
> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
> + *
> + * Author: Xie Yongji <xieyongji@bytedance.com>
> + *
> + */
> +
> +#include <linux/slab.h>
> +#include <linux/file.h>
> +#include <linux/anon_inodes.h>
> +
> +#include "iova_domain.h"
> +
> +#define IOVA_START_PFN 1
> +#define IOVA_ALLOC_ORDER 12
> +#define IOVA_ALLOC_SIZE (1 << IOVA_ALLOC_ORDER)


Can this work for all archs (e.g why not use PAGE_SIZE)?


> +
> +#define CONSISTENT_DMA_SIZE (1024 * 1024 * 1024)
> +
> +static inline struct page *
> +vduse_domain_get_bounce_page(struct vduse_iova_domain *domain,
> +				unsigned long iova)
> +{
> +	unsigned long index = iova >> PAGE_SHIFT;
> +
> +	return domain->bounce_pages[index];
> +}
> +
> +static inline void
> +vduse_domain_set_bounce_page(struct vduse_iova_domain *domain,
> +				unsigned long iova, struct page *page)
> +{
> +	unsigned long index = iova >> PAGE_SHIFT;
> +
> +	domain->bounce_pages[index] = page;
> +}
> +
> +static struct vduse_iova_map *
> +vduse_domain_alloc_iova_map(struct vduse_iova_domain *domain,
> +			unsigned long iova, unsigned long orig,
> +			size_t size, enum dma_data_direction dir)
> +{
> +	struct vduse_iova_map *map;
> +
> +	map = kzalloc(sizeof(struct vduse_iova_map), GFP_ATOMIC);
> +	if (!map)
> +		return NULL;
> +
> +	map->iova.start = iova;
> +	map->iova.last = iova + size - 1;
> +	map->orig = orig;
> +	map->size = size;
> +	map->dir = dir;
> +
> +	return map;
> +}
> +
> +static struct page *
> +vduse_domain_get_mapping_page(struct vduse_iova_domain *domain,
> +				unsigned long iova)
> +{
> +	unsigned long start = iova & PAGE_MASK;
> +	unsigned long last = start + PAGE_SIZE - 1;
> +	struct vduse_iova_map *map;
> +	struct interval_tree_node *node;
> +	struct page *page = NULL;
> +
> +	spin_lock(&domain->map_lock);
> +	node = interval_tree_iter_first(&domain->mappings, start, last);
> +	if (!node)
> +		goto out;
> +
> +	map = container_of(node, struct vduse_iova_map, iova);
> +	page = virt_to_page(map->orig + iova - map->iova.start);
> +	get_page(page);
> +out:
> +	spin_unlock(&domain->map_lock);
> +
> +	return page;
> +}
> +
> +static struct page *
> +vduse_domain_alloc_bounce_page(struct vduse_iova_domain *domain,
> +				unsigned long iova)
> +{
> +	unsigned long start = iova & PAGE_MASK;
> +	unsigned long last = start + PAGE_SIZE - 1;
> +	struct vduse_iova_map *map;
> +	struct interval_tree_node *node;
> +	struct page *page = NULL, *new_page = alloc_page(GFP_KERNEL);
> +
> +	if (!new_page)
> +		return NULL;
> +
> +	spin_lock(&domain->map_lock);
> +	node = interval_tree_iter_first(&domain->mappings, start, last);
> +	if (!node) {
> +		__free_page(new_page);
> +		goto out;
> +	}
> +	page = vduse_domain_get_bounce_page(domain, iova);
> +	if (page) {
> +		get_page(page);
> +		__free_page(new_page);


Let's delay the allocation of new_page until it is really required.


> +		goto out;
> +	}
> +	vduse_domain_set_bounce_page(domain, iova, new_page);
> +	get_page(new_page);
> +	page = new_page;
> +
> +	while (node) {


I may miss something but which case should we do the loop here?


> +		unsigned int src_offset = 0, dst_offset = 0;
> +		void *src, *dst;
> +		size_t copy_len;
> +
> +		map = container_of(node, struct vduse_iova_map, iova);
> +		node = interval_tree_iter_next(node, start, last);
> +		if (map->dir == DMA_FROM_DEVICE)
> +			continue;
> +
> +		if (start > map->iova.start)
> +			src_offset = start - map->iova.start;
> +		else
> +			dst_offset = map->iova.start - start;
> +
> +		src = (void *)(map->orig + src_offset);
> +		dst = page_address(page) + dst_offset;
> +		copy_len = min_t(size_t, map->size - src_offset,
> +				PAGE_SIZE - dst_offset);
> +		memcpy(dst, src, copy_len);
> +	}
> +out:
> +	spin_unlock(&domain->map_lock);
> +
> +	return page;
> +}
> +
> +static void
> +vduse_domain_free_bounce_pages(struct vduse_iova_domain *domain,
> +				unsigned long iova, size_t size)
> +{
> +	struct page *page;
> +	struct interval_tree_node *node;
> +	unsigned long last = iova + size - 1;
> +
> +	spin_lock(&domain->map_lock);
> +	node = interval_tree_iter_first(&domain->mappings, iova, last);
> +	if (WARN_ON(node))
> +		goto out;
> +
> +	while (size > 0) {
> +		page = vduse_domain_get_bounce_page(domain, iova);
> +		if (page) {
> +			vduse_domain_set_bounce_page(domain, iova, NULL);
> +			__free_page(page);
> +		}
> +		size -= PAGE_SIZE;
> +		iova += PAGE_SIZE;
> +	}
> +out:
> +	spin_unlock(&domain->map_lock);
> +}
> +
> +static void vduse_domain_bounce(struct vduse_iova_domain *domain,
> +				unsigned long iova, unsigned long orig,
> +				size_t size, enum dma_data_direction dir)
> +{
> +	unsigned int offset = offset_in_page(iova);
> +
> +	while (size) {
> +		struct page *p = vduse_domain_get_bounce_page(domain, iova);
> +		size_t copy_len = min_t(size_t, PAGE_SIZE - offset, size);
> +		void *addr;
> +
> +		WARN_ON(!p && dir == DMA_FROM_DEVICE);
> +
> +		if (p) {
> +			addr = page_address(p) + offset;
> +			if (dir == DMA_TO_DEVICE)
> +				memcpy(addr, (void *)orig, copy_len);
> +			else if (dir == DMA_FROM_DEVICE)
> +				memcpy((void *)orig, addr, copy_len);
> +		}
> +
> +		size -= copy_len;
> +		orig += copy_len;
> +		iova += copy_len;
> +		offset = 0;
> +	}
> +}
> +
> +static unsigned long vduse_domain_alloc_iova(struct iova_domain *iovad,
> +				unsigned long size, unsigned long limit)
> +{
> +	unsigned long shift = iova_shift(iovad);
> +	unsigned long iova_len = iova_align(iovad, size) >> shift;
> +	unsigned long iova_pfn;
> +
> +	if (iova_len < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1)))
> +		iova_len = roundup_pow_of_two(iova_len);
> +	iova_pfn = alloc_iova_fast(iovad, iova_len, limit >> shift, true);
> +
> +	return iova_pfn << shift;
> +}
> +
> +static void vduse_domain_free_iova(struct iova_domain *iovad,
> +				unsigned long iova, size_t size)
> +{
> +	unsigned long shift = iova_shift(iovad);
> +	unsigned long iova_len = iova_align(iovad, size) >> shift;
> +
> +	free_iova_fast(iovad, iova >> shift, iova_len);
> +}
> +
> +dma_addr_t vduse_domain_map_page(struct vduse_iova_domain *domain,
> +				struct page *page, unsigned long offset,
> +				size_t size, enum dma_data_direction dir,
> +				unsigned long attrs)
> +{
> +	struct iova_domain *iovad = &domain->stream_iovad;
> +	unsigned long limit = domain->bounce_size - 1;
> +	unsigned long iova = vduse_domain_alloc_iova(iovad, size, limit);
> +	unsigned long orig = (unsigned long)page_address(page) + offset;
> +	struct vduse_iova_map *map;
> +
> +	if (!iova)
> +		return DMA_MAPPING_ERROR;
> +
> +	map = vduse_domain_alloc_iova_map(domain, iova, orig, size, dir);
> +	if (!map) {
> +		vduse_domain_free_iova(iovad, iova, size);
> +		return DMA_MAPPING_ERROR;
> +	}
> +
> +	spin_lock(&domain->map_lock);
> +	interval_tree_insert(&map->iova, &domain->mappings);
> +	spin_unlock(&domain->map_lock);
> +
> +	if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL)
> +		vduse_domain_bounce(domain, iova, orig, size, DMA_TO_DEVICE);
> +
> +	return (dma_addr_t)iova;
> +}
> +
> +void vduse_domain_unmap_page(struct vduse_iova_domain *domain,
> +			dma_addr_t dma_addr, size_t size,
> +			enum dma_data_direction dir, unsigned long attrs)
> +{
> +	struct iova_domain *iovad = &domain->stream_iovad;
> +	unsigned long iova = (unsigned long)dma_addr;
> +	struct interval_tree_node *node;
> +	struct vduse_iova_map *map;
> +
> +	spin_lock(&domain->map_lock);
> +	node = interval_tree_iter_first(&domain->mappings, iova, iova + 1);
> +	if (WARN_ON(!node)) {
> +		spin_unlock(&domain->map_lock);
> +		return;
> +	}
> +	interval_tree_remove(node, &domain->mappings);
> +	spin_unlock(&domain->map_lock);
> +
> +	map = container_of(node, struct vduse_iova_map, iova);
> +	if (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL)
> +		vduse_domain_bounce(domain, iova, map->orig,
> +					size, DMA_FROM_DEVICE);
> +	vduse_domain_free_iova(iovad, iova, size);
> +	kfree(map);
> +}
> +
> +void *vduse_domain_alloc_coherent(struct vduse_iova_domain *domain,
> +				size_t size, dma_addr_t *dma_addr,
> +				gfp_t flag, unsigned long attrs)
> +{
> +	struct iova_domain *iovad = &domain->consistent_iovad;
> +	unsigned long limit = domain->bounce_size + CONSISTENT_DMA_SIZE - 1;
> +	unsigned long iova = vduse_domain_alloc_iova(iovad, size, limit);
> +	void *orig = alloc_pages_exact(size, flag);
> +	struct vduse_iova_map *map;
> +
> +	if (!iova || !orig)
> +		goto err;
> +
> +	map = vduse_domain_alloc_iova_map(domain, iova, (unsigned long)orig,
> +					size, DMA_BIDIRECTIONAL);
> +	if (!map)
> +		goto err;
> +
> +	spin_lock(&domain->map_lock);
> +	interval_tree_insert(&map->iova, &domain->mappings);
> +	spin_unlock(&domain->map_lock);
> +	*dma_addr = (dma_addr_t)iova;
> +
> +	return orig;
> +err:
> +	*dma_addr = DMA_MAPPING_ERROR;
> +	if (orig)
> +		free_pages_exact(orig, size);
> +	if (iova)
> +		vduse_domain_free_iova(iovad, iova, size);
> +
> +	return NULL;
> +}
> +
> +void vduse_domain_free_coherent(struct vduse_iova_domain *domain, size_t size,
> +				void *vaddr, dma_addr_t dma_addr,
> +				unsigned long attrs)
> +{
> +	struct iova_domain *iovad = &domain->consistent_iovad;
> +	unsigned long iova = (unsigned long)dma_addr;
> +	struct interval_tree_node *node;
> +	struct vduse_iova_map *map;
> +
> +	spin_lock(&domain->map_lock);
> +	node = interval_tree_iter_first(&domain->mappings, iova, iova + 1);
> +	if (WARN_ON(!node)) {
> +		spin_unlock(&domain->map_lock);
> +		return;
> +	}
> +	interval_tree_remove(node, &domain->mappings);
> +	spin_unlock(&domain->map_lock);
> +
> +	map = container_of(node, struct vduse_iova_map, iova);
> +	vduse_domain_free_iova(iovad, iova, size);
> +	free_pages_exact(vaddr, size);
> +	kfree(map);
> +}
> +
> +static vm_fault_t vduse_domain_mmap_fault(struct vm_fault *vmf)
> +{
> +	struct vduse_iova_domain *domain = vmf->vma->vm_private_data;
> +	unsigned long iova = vmf->pgoff << PAGE_SHIFT;
> +	struct page *page;
> +
> +	if (!domain)
> +		return VM_FAULT_SIGBUS;
> +
> +	if (iova < domain->bounce_size)
> +		page = vduse_domain_alloc_bounce_page(domain, iova);
> +	else
> +		page = vduse_domain_get_mapping_page(domain, iova);
> +
> +	if (!page)
> +		return VM_FAULT_SIGBUS;
> +
> +	vmf->page = page;
> +
> +	return 0;
> +}
> +
> +static const struct vm_operations_struct vduse_domain_mmap_ops = {
> +	.fault = vduse_domain_mmap_fault,
> +};
> +
> +static int vduse_domain_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> +	struct vduse_iova_domain *domain = file->private_data;
> +
> +	vma->vm_flags |= VM_DONTCOPY | VM_DONTDUMP | VM_DONTEXPAND;
> +	vma->vm_private_data = domain;
> +	vma->vm_ops = &vduse_domain_mmap_ops;
> +
> +	return 0;
> +}
> +
> +static int vduse_domain_release(struct inode *inode, struct file *file)
> +{
> +	struct vduse_iova_domain *domain = file->private_data;
> +
> +	vduse_domain_free_bounce_pages(domain, 0, domain->bounce_size);
> +	put_iova_domain(&domain->stream_iovad);
> +	put_iova_domain(&domain->consistent_iovad);
> +	vfree(domain->bounce_pages);
> +	kfree(domain);
> +
> +	return 0;
> +}
> +
> +static const struct file_operations vduse_domain_fops = {
> +	.mmap = vduse_domain_mmap,
> +	.release = vduse_domain_release,
> +};


It's better to explain the reason for introducing a dedicated file for 
mmap() here.


> +
> +void vduse_domain_destroy(struct vduse_iova_domain *domain)
> +{
> +	fput(domain->file);
> +}
> +
> +struct vduse_iova_domain *vduse_domain_create(size_t bounce_size)
> +{
> +	struct vduse_iova_domain *domain;
> +	struct file *file;
> +	unsigned long bounce_pfns = PAGE_ALIGN(bounce_size) >> PAGE_SHIFT;
> +
> +	domain = kzalloc(sizeof(*domain), GFP_KERNEL);
> +	if (!domain)
> +		return NULL;
> +
> +	domain->bounce_size = PAGE_ALIGN(bounce_size);
> +	domain->bounce_pages = vzalloc(bounce_pfns * sizeof(struct page *));
> +	if (!domain->bounce_pages)
> +		goto err_page;
> +
> +	file = anon_inode_getfile("[vduse-domain]", &vduse_domain_fops,
> +				domain, O_RDWR);
> +	if (IS_ERR(file))
> +		goto err_file;
> +
> +	domain->file = file;
> +	spin_lock_init(&domain->map_lock);
> +	domain->mappings = RB_ROOT_CACHED;
> +	init_iova_domain(&domain->stream_iovad,
> +			IOVA_ALLOC_SIZE, IOVA_START_PFN);
> +	init_iova_domain(&domain->consistent_iovad,
> +			PAGE_SIZE, bounce_pfns);
> +
> +	return domain;
> +err_file:
> +	vfree(domain->bounce_pages);
> +err_page:
> +	kfree(domain);
> +	return NULL;
> +}
> diff --git a/drivers/vdpa/vdpa_user/iova_domain.h b/drivers/vdpa/vdpa_user/iova_domain.h
> new file mode 100644
> index 000000000000..cc61866acb56
> --- /dev/null
> +++ b/drivers/vdpa/vdpa_user/iova_domain.h
> @@ -0,0 +1,68 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * MMU-based IOMMU implementation
> + *
> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
> + *
> + * Author: Xie Yongji <xieyongji@bytedance.com>
> + *
> + */
> +
> +#ifndef _VDUSE_IOVA_DOMAIN_H
> +#define _VDUSE_IOVA_DOMAIN_H
> +
> +#include <linux/iova.h>
> +#include <linux/interval_tree.h>
> +#include <linux/dma-mapping.h>
> +
> +struct vduse_iova_map {
> +	struct interval_tree_node iova;
> +	unsigned long orig;


Need a better name, probably "va"?


> +	size_t size;
> +	enum dma_data_direction dir;
> +};
> +
> +struct vduse_iova_domain {
> +	struct iova_domain stream_iovad;
> +	struct iova_domain consistent_iovad;
> +	struct page **bounce_pages;
> +	size_t bounce_size;
> +	struct rb_root_cached mappings;


We had IOTLB, any reason for this extra mappings here?


> +	spinlock_t map_lock;
> +	struct file *file;
> +};
> +
> +static inline struct file *
> +vduse_domain_file(struct vduse_iova_domain *domain)
> +{
> +	return domain->file;
> +}
> +
> +static inline unsigned long
> +vduse_domain_get_offset(struct vduse_iova_domain *domain, unsigned long iova)
> +{
> +	return iova;
> +}
> +
> +dma_addr_t vduse_domain_map_page(struct vduse_iova_domain *domain,
> +				struct page *page, unsigned long offset,
> +				size_t size, enum dma_data_direction dir,
> +				unsigned long attrs);
> +
> +void vduse_domain_unmap_page(struct vduse_iova_domain *domain,
> +			dma_addr_t dma_addr, size_t size,
> +			enum dma_data_direction dir, unsigned long attrs);
> +
> +void *vduse_domain_alloc_coherent(struct vduse_iova_domain *domain,
> +				size_t size, dma_addr_t *dma_addr,
> +				gfp_t flag, unsigned long attrs);
> +
> +void vduse_domain_free_coherent(struct vduse_iova_domain *domain, size_t size,
> +				void *vaddr, dma_addr_t dma_addr,
> +				unsigned long attrs);
> +
> +void vduse_domain_destroy(struct vduse_iova_domain *domain);
> +
> +struct vduse_iova_domain *vduse_domain_create(size_t bounce_size);
> +
> +#endif /* _VDUSE_IOVA_DOMAIN_H */
> diff --git a/drivers/vdpa/vdpa_user/vduse.h b/drivers/vdpa/vdpa_user/vduse.h
> new file mode 100644
> index 000000000000..3566d229382e
> --- /dev/null
> +++ b/drivers/vdpa/vdpa_user/vduse.h
> @@ -0,0 +1,62 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * VDUSE: vDPA Device in Userspace
> + *
> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
> + *
> + * Author: Xie Yongji <xieyongji@bytedance.com>
> + *
> + */
> +
> +#ifndef _VDUSE_H
> +#define _VDUSE_H
> +
> +#include <linux/eventfd.h>
> +#include <linux/wait.h>
> +#include <linux/vdpa.h>
> +
> +#include "iova_domain.h"
> +#include "eventfd.h"
> +
> +struct vduse_virtqueue {
> +	u16 index;
> +	bool ready;
> +	spinlock_t kick_lock;
> +	spinlock_t irq_lock;
> +	struct eventfd_ctx *kickfd;
> +	struct vduse_virqfd *virqfd;
> +	void *private;
> +	irqreturn_t (*cb)(void *data);
> +};
> +
> +struct vduse_dev;
> +
> +struct vduse_vdpa {
> +	struct vdpa_device vdpa;
> +	struct vduse_dev *dev;
> +};
> +
> +struct vduse_dev {
> +	struct vduse_vdpa *vdev;
> +	struct mutex lock;
> +	struct vduse_virtqueue *vqs;
> +	struct vduse_iova_domain *domain;
> +	struct vhost_iotlb *iommu;
> +	spinlock_t iommu_lock;
> +	atomic_t bounce_map;
> +	spinlock_t msg_lock;
> +	atomic64_t msg_unique;
> +	wait_queue_head_t waitq;
> +	struct list_head send_list;
> +	struct list_head recv_list;
> +	struct list_head list;
> +	bool connected;
> +	u32 id;
> +	u16 vq_size_max;
> +	u16 vq_num;
> +	u32 vq_align;
> +	u32 device_id;
> +	u32 vendor_id;
> +};
> +
> +#endif /* _VDUSE_H_ */
> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> new file mode 100644
> index 000000000000..1cf759bc5914
> --- /dev/null
> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> @@ -0,0 +1,1217 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * VDUSE: vDPA Device in Userspace
> + *
> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
> + *
> + * Author: Xie Yongji <xieyongji@bytedance.com>
> + *
> + */
> +
> +#include <linux/init.h>
> +#include <linux/module.h>
> +#include <linux/miscdevice.h>
> +#include <linux/device.h>
> +#include <linux/eventfd.h>
> +#include <linux/slab.h>
> +#include <linux/wait.h>
> +#include <linux/dma-map-ops.h>
> +#include <linux/anon_inodes.h>
> +#include <linux/file.h>
> +#include <linux/uio.h>
> +#include <linux/vdpa.h>
> +#include <uapi/linux/vduse.h>
> +#include <uapi/linux/vdpa.h>
> +#include <uapi/linux/virtio_config.h>
> +#include <linux/mod_devicetable.h>
> +
> +#include "vduse.h"
> +
> +#define DRV_VERSION  "1.0"
> +#define DRV_AUTHOR   "Yongji Xie <xieyongji@bytedance.com>"
> +#define DRV_DESC     "vDPA Device in Userspace"
> +#define DRV_LICENSE  "GPL v2"
> +
> +struct vduse_dev_msg {
> +	struct vduse_dev_request req;
> +	struct vduse_dev_response resp;
> +	struct list_head list;
> +	wait_queue_head_t waitq;
> +	bool completed;
> +	refcount_t refcnt;


The reference count here will bring extra complexity. I think we can 
sync through msg_lock.



> +};
> +
> +static struct workqueue_struct *vduse_vdpa_wq;
> +static DEFINE_MUTEX(vduse_lock);
> +static LIST_HEAD(vduse_devs);
> +
> +static inline struct vduse_dev *vdpa_to_vduse(struct vdpa_device *vdpa)
> +{
> +	struct vduse_vdpa *vdev = container_of(vdpa, struct vduse_vdpa, vdpa);
> +
> +	return vdev->dev;
> +}
> +
> +static inline struct vduse_dev *dev_to_vduse(struct device *dev)
> +{
> +	struct vdpa_device *vdpa = dev_to_vdpa(dev);
> +
> +	return vdpa_to_vduse(vdpa);
> +}
> +
> +static struct vduse_dev_msg *vduse_dev_new_msg(struct vduse_dev *dev, int type)
> +{
> +	struct vduse_dev_msg *msg = kzalloc(sizeof(*msg),
> +					GFP_KERNEL | __GFP_NOFAIL);
> +
> +	msg->req.type = type;
> +	msg->req.unique = atomic64_fetch_inc(&dev->msg_unique);


This looks not safe, let's use idr here.


> +	init_waitqueue_head(&msg->waitq);
> +	refcount_set(&msg->refcnt, 1);
> +
> +	return msg;
> +}
> +
> +static void vduse_dev_msg_get(struct vduse_dev_msg *msg)
> +{
> +	refcount_inc(&msg->refcnt);
> +}
> +
> +static void vduse_dev_msg_put(struct vduse_dev_msg *msg)
> +{
> +	if (refcount_dec_and_test(&msg->refcnt))
> +		kfree(msg);
> +}
> +
> +static struct vduse_dev_msg *vduse_dev_find_msg(struct vduse_dev *dev,
> +						struct list_head *head,
> +						uint32_t unique)
> +{
> +	struct vduse_dev_msg *tmp, *msg = NULL;
> +
> +	spin_lock(&dev->msg_lock);
> +	list_for_each_entry(tmp, head, list) {
> +		if (tmp->req.unique == unique) {
> +			msg = tmp;
> +			list_del(&tmp->list);
> +			break;
> +		}
> +	}
> +	spin_unlock(&dev->msg_lock);
> +
> +	return msg;
> +}
> +
> +static struct vduse_dev_msg *vduse_dev_dequeue_msg(struct vduse_dev *dev,
> +						struct list_head *head)
> +{
> +	struct vduse_dev_msg *msg = NULL;
> +
> +	spin_lock(&dev->msg_lock);
> +	if (!list_empty(head)) {
> +		msg = list_first_entry(head, struct vduse_dev_msg, list);
> +		list_del(&msg->list);
> +	}
> +	spin_unlock(&dev->msg_lock);
> +
> +	return msg;
> +}
> +
> +static void vduse_dev_enqueue_msg(struct vduse_dev *dev,
> +			struct vduse_dev_msg *msg, struct list_head *head)
> +{
> +	spin_lock(&dev->msg_lock);
> +	list_add_tail(&msg->list, head);
> +	spin_unlock(&dev->msg_lock);
> +}
> +
> +static int vduse_dev_msg_sync(struct vduse_dev *dev, struct vduse_dev_msg *msg)
> +{
> +	int ret;
> +
> +	vduse_dev_enqueue_msg(dev, msg, &dev->send_list);
> +	wake_up(&dev->waitq);
> +	wait_event(msg->waitq, msg->completed);


This is uninterruptible wait, it means if the userspace forget to 
process the command, we will stuck here forever.


> +	/* coupled with smp_wmb() in vduse_dev_msg_complete() */
> +	smp_rmb();


Instead of using barriers, I wonder why not simply use msg lock here?


> +	ret = msg->resp.result;
> +
> +	return ret;
> +}
> +
> +static void vduse_dev_msg_complete(struct vduse_dev_msg *msg,
> +					struct vduse_dev_response *resp)
> +{
> +	vduse_dev_msg_get(msg);
> +	memcpy(&msg->resp, resp, sizeof(*resp));
> +	/* coupled with smp_rmb() in vduse_dev_msg_sync() */
> +	smp_wmb();
> +	msg->completed = 1;
> +	wake_up(&msg->waitq);
> +	vduse_dev_msg_put(msg);
> +}
> +
> +static u64 vduse_dev_get_features(struct vduse_dev *dev)
> +{
> +	struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_GET_FEATURES);
> +	u64 features;
> +
> +	vduse_dev_msg_sync(dev, msg);
> +	features = msg->resp.features;
> +	vduse_dev_msg_put(msg);
> +
> +	return features;
> +}
> +
> +static int vduse_dev_set_features(struct vduse_dev *dev, u64 features)
> +{
> +	struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_FEATURES);
> +	int ret;
> +
> +	msg->req.size = sizeof(features);
> +	msg->req.features = features;
> +
> +	ret = vduse_dev_msg_sync(dev, msg);
> +	vduse_dev_msg_put(msg);
> +
> +	return ret;
> +}
> +
> +static u8 vduse_dev_get_status(struct vduse_dev *dev)
> +{
> +	struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_GET_STATUS);
> +	u8 status;
> +
> +	vduse_dev_msg_sync(dev, msg);
> +	status = msg->resp.status;
> +	vduse_dev_msg_put(msg);
> +
> +	return status;
> +}
> +
> +static void vduse_dev_set_status(struct vduse_dev *dev, u8 status)
> +{
> +	struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_STATUS);
> +
> +	msg->req.size = sizeof(status);
> +	msg->req.status = status;
> +
> +	vduse_dev_msg_sync(dev, msg);
> +	vduse_dev_msg_put(msg);
> +}
> +
> +static void vduse_dev_get_config(struct vduse_dev *dev, unsigned int offset,
> +					void *buf, unsigned int len)
> +{
> +	struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_GET_CONFIG);
> +
> +	WARN_ON(len > sizeof(msg->req.config.data));
> +
> +	msg->req.size = sizeof(struct vduse_dev_config_data);
> +	msg->req.config.offset = offset;
> +	msg->req.config.len = len;
> +	vduse_dev_msg_sync(dev, msg);
> +	memcpy(buf, msg->resp.config.data, len);
> +	vduse_dev_msg_put(msg);
> +}
> +
> +static void vduse_dev_set_config(struct vduse_dev *dev, unsigned int offset,
> +					const void *buf, unsigned int len)
> +{
> +	struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_CONFIG);
> +
> +	WARN_ON(len > sizeof(msg->req.config.data));
> +
> +	msg->req.size = sizeof(struct vduse_dev_config_data);
> +	msg->req.config.offset = offset;
> +	msg->req.config.len = len;
> +	memcpy(msg->req.config.data, buf, len);
> +	vduse_dev_msg_sync(dev, msg);
> +	vduse_dev_msg_put(msg);
> +}
> +
> +static void vduse_dev_set_vq_num(struct vduse_dev *dev,
> +				struct vduse_virtqueue *vq, u32 num)
> +{
> +	struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_VQ_NUM);
> +
> +	msg->req.size = sizeof(struct vduse_vq_num);
> +	msg->req.vq_num.index = vq->index;
> +	msg->req.vq_num.num = num;
> +
> +	vduse_dev_msg_sync(dev, msg);
> +	vduse_dev_msg_put(msg);
> +}
> +
> +static int vduse_dev_set_vq_addr(struct vduse_dev *dev,
> +				struct vduse_virtqueue *vq, u64 desc_addr,
> +				u64 driver_addr, u64 device_addr)
> +{
> +	struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_VQ_ADDR);
> +	int ret;
> +
> +	msg->req.size = sizeof(struct vduse_vq_addr);
> +	msg->req.vq_addr.index = vq->index;
> +	msg->req.vq_addr.desc_addr = desc_addr;
> +	msg->req.vq_addr.driver_addr = driver_addr;
> +	msg->req.vq_addr.device_addr = device_addr;
> +
> +	ret = vduse_dev_msg_sync(dev, msg);
> +	vduse_dev_msg_put(msg);
> +
> +	return ret;
> +}
> +
> +static void vduse_dev_set_vq_ready(struct vduse_dev *dev,
> +				struct vduse_virtqueue *vq, bool ready)
> +{
> +	struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_VQ_READY);
> +
> +	msg->req.size = sizeof(struct vduse_vq_ready);
> +	msg->req.vq_ready.index = vq->index;
> +	msg->req.vq_ready.ready = ready;
> +
> +	vduse_dev_msg_sync(dev, msg);
> +	vduse_dev_msg_put(msg);
> +}
> +
> +static bool vduse_dev_get_vq_ready(struct vduse_dev *dev,
> +				   struct vduse_virtqueue *vq)
> +{
> +	struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_GET_VQ_READY);
> +	bool ready;
> +
> +	msg->req.size = sizeof(struct vduse_vq_ready);
> +	msg->req.vq_ready.index = vq->index;
> +
> +	vduse_dev_msg_sync(dev, msg);
> +	ready = msg->resp.vq_ready.ready;
> +	vduse_dev_msg_put(msg);
> +
> +	return ready;
> +}
> +
> +static int vduse_dev_get_vq_state(struct vduse_dev *dev,
> +				struct vduse_virtqueue *vq,
> +				struct vdpa_vq_state *state)
> +{
> +	struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_GET_VQ_STATE);
> +	int ret;
> +
> +	msg->req.size = sizeof(struct vduse_vq_state);
> +	msg->req.vq_state.index = vq->index;
> +
> +	ret = vduse_dev_msg_sync(dev, msg);
> +	state->avail_index = msg->resp.vq_state.avail_idx;
> +	vduse_dev_msg_put(msg);
> +
> +	return ret;
> +}
> +
> +static int vduse_dev_set_vq_state(struct vduse_dev *dev,
> +				struct vduse_virtqueue *vq,
> +				const struct vdpa_vq_state *state)
> +{
> +	struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_VQ_STATE);
> +	int ret;
> +
> +	msg->req.size = sizeof(struct vduse_vq_state);
> +	msg->req.vq_state.index = vq->index;
> +	msg->req.vq_state.avail_idx = state->avail_index;
> +
> +	ret = vduse_dev_msg_sync(dev, msg);
> +	vduse_dev_msg_put(msg);
> +
> +	return ret;
> +}
> +
> +static int vduse_dev_update_iotlb(struct vduse_dev *dev,
> +					u64 start, u64 last)
> +{
> +	struct vduse_dev_msg *msg;
> +	int ret;
> +
> +	if (last < start)
> +		return -EINVAL;
> +
> +	msg = vduse_dev_new_msg(dev, VDUSE_UPDATE_IOTLB);


This is actually a IOTLB invalidation. So let's rename the function and 
message type.


> +	msg->req.size = sizeof(struct vduse_iova_range);
> +	msg->req.iova.start = start;
> +	msg->req.iova.last = last;
> +
> +	ret = vduse_dev_msg_sync(dev, msg);
> +	vduse_dev_msg_put(msg);
> +
> +	return ret;
> +}
> +
> +static ssize_t vduse_dev_read_iter(struct kiocb *iocb, struct iov_iter *to)
> +{
> +	struct file *file = iocb->ki_filp;
> +	struct vduse_dev *dev = file->private_data;
> +	struct vduse_dev_msg *msg;
> +	int size = sizeof(struct vduse_dev_request);
> +	ssize_t ret = 0;
> +
> +	if (iov_iter_count(to) < size)
> +		return 0;
> +
> +	while (1) {
> +		msg = vduse_dev_dequeue_msg(dev, &dev->send_list);
> +		if (msg)
> +			break;
> +
> +		if (file->f_flags & O_NONBLOCK)
> +			return -EAGAIN;
> +
> +		ret = wait_event_interruptible_exclusive(dev->waitq,
> +					!list_empty(&dev->send_list));
> +		if (ret)
> +			return ret;
> +	}
> +	ret = copy_to_iter(&msg->req, size, to);
> +	if (ret != size) {
> +		vduse_dev_enqueue_msg(dev, msg, &dev->send_list);
> +		return -EFAULT;
> +	}
> +	vduse_dev_enqueue_msg(dev, msg, &dev->recv_list);
> +
> +	return ret;
> +}
> +
> +static ssize_t vduse_dev_write_iter(struct kiocb *iocb, struct iov_iter *from)
> +{
> +	struct file *file = iocb->ki_filp;
> +	struct vduse_dev *dev = file->private_data;
> +	struct vduse_dev_response resp;
> +	struct vduse_dev_msg *msg;
> +	size_t ret;
> +
> +	ret = copy_from_iter(&resp, sizeof(resp), from);
> +	if (ret != sizeof(resp))
> +		return -EINVAL;
> +
> +	msg = vduse_dev_find_msg(dev, &dev->recv_list, resp.unique);
> +	if (!msg)
> +		return -EINVAL;
> +
> +	vduse_dev_msg_complete(msg, &resp);


So we had multiple types of requests/responses, is this better to 
introduce a queue based admin interface other than ioctl?


> +
> +	return ret;
> +}
> +
> +static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
> +{
> +	struct vduse_dev *dev = file->private_data;
> +	__poll_t mask = 0;
> +
> +	poll_wait(file, &dev->waitq, wait);
> +
> +	if (!list_empty(&dev->send_list))
> +		mask |= EPOLLIN | EPOLLRDNORM;
> +
> +	return mask;
> +}
> +
> +static int vduse_iotlb_add_range(struct vduse_dev *dev,
> +				 u64 start, u64 last,
> +				 u64 addr, unsigned int perm,
> +				 struct file *file, u64 offset)
> +{
> +	struct vhost_iotlb_file *iotlb_file;
> +	int ret;
> +
> +	iotlb_file = kmalloc(sizeof(*iotlb_file), GFP_ATOMIC);
> +	if (!iotlb_file)
> +		return -ENOMEM;
> +
> +	iotlb_file->file = get_file(file);
> +	iotlb_file->offset = offset;
> +
> +	spin_lock(&dev->iommu_lock);
> +	ret = vhost_iotlb_add_range(dev->iommu, start, last,
> +					addr, perm, iotlb_file);
> +	spin_unlock(&dev->iommu_lock);
> +	if (ret) {
> +		fput(iotlb_file->file);
> +		kfree(iotlb_file);
> +		return ret;
> +	}
> +	return 0;
> +}
> +
> +static void vduse_iotlb_del_range(struct vduse_dev *dev, u64 start, u64 last)
> +{
> +	struct vhost_iotlb_file *iotlb_file;
> +	struct vhost_iotlb_map *map;
> +
> +	spin_lock(&dev->iommu_lock);
> +	while ((map = vhost_iotlb_itree_first(dev->iommu, start, last))) {
> +		iotlb_file = (struct vhost_iotlb_file *)map->opaque;
> +		fput(iotlb_file->file);
> +		kfree(iotlb_file);
> +		vhost_iotlb_map_free(dev->iommu, map);
> +	}
> +	spin_unlock(&dev->iommu_lock);
> +}
> +
> +static void vduse_dev_reset(struct vduse_dev *dev)
> +{
> +	int i;
> +
> +	atomic_set(&dev->bounce_map, 0);
> +	vduse_iotlb_del_range(dev, 0ULL, 0ULL - 1);
> +	vduse_dev_update_iotlb(dev, 0ULL, 0ULL - 1);


ULLONG_MAX please.


> +
> +	for (i = 0; i < dev->vq_num; i++) {
> +		struct vduse_virtqueue *vq = &dev->vqs[i];
> +
> +		spin_lock(&vq->irq_lock);
> +		vq->ready = false;
> +		vq->cb = NULL;
> +		vq->private = NULL;
> +		spin_unlock(&vq->irq_lock);
> +	}
> +}
> +
> +static int vduse_vdpa_set_vq_address(struct vdpa_device *vdpa, u16 idx,
> +				u64 desc_area, u64 driver_area,
> +				u64 device_area)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	return vduse_dev_set_vq_addr(dev, vq, desc_area,
> +					driver_area, device_area);
> +}
> +
> +static void vduse_vdpa_kick_vq(struct vdpa_device *vdpa, u16 idx)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	vduse_vq_kick(vq);
> +}
> +
> +static void vduse_vdpa_set_vq_cb(struct vdpa_device *vdpa, u16 idx,
> +			      struct vdpa_callback *cb)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	vq->cb = cb->callback;
> +	vq->private = cb->private;
> +}
> +
> +static void vduse_vdpa_set_vq_num(struct vdpa_device *vdpa, u16 idx, u32 num)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	vduse_dev_set_vq_num(dev, vq, num);
> +}
> +
> +static void vduse_vdpa_set_vq_ready(struct vdpa_device *vdpa,
> +					u16 idx, bool ready)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	vduse_dev_set_vq_ready(dev, vq, ready);
> +	vq->ready = ready;
> +}
> +
> +static bool vduse_vdpa_get_vq_ready(struct vdpa_device *vdpa, u16 idx)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	vq->ready = vduse_dev_get_vq_ready(dev, vq);
> +
> +	return vq->ready;
> +}
> +
> +static int vduse_vdpa_set_vq_state(struct vdpa_device *vdpa, u16 idx,
> +				const struct vdpa_vq_state *state)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	return vduse_dev_set_vq_state(dev, vq, state);
> +}
> +
> +static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
> +				struct vdpa_vq_state *state)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	return vduse_dev_get_vq_state(dev, vq, state);
> +}
> +
> +static u32 vduse_vdpa_get_vq_align(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return dev->vq_align;
> +}
> +
> +static u64 vduse_vdpa_get_features(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	u64 fixed = (1ULL << VIRTIO_F_ACCESS_PLATFORM);
> +
> +	return (vduse_dev_get_features(dev) | fixed);
> +}
> +
> +static int vduse_vdpa_set_features(struct vdpa_device *vdpa, u64 features)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return vduse_dev_set_features(dev, features);
> +}
> +
> +static void vduse_vdpa_set_config_cb(struct vdpa_device *vdpa,
> +				  struct vdpa_callback *cb)
> +{
> +	/* We don't support config interrupt */


If it's not hard, let's add this. Otherwise we need a per device feature 
blacklist to filter out all features that depends on config interrupt.


> +}
> +
> +static u16 vduse_vdpa_get_vq_num_max(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return dev->vq_size_max;
> +}
> +
> +static u32 vduse_vdpa_get_device_id(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return dev->device_id;
> +}
> +
> +static u32 vduse_vdpa_get_vendor_id(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return dev->vendor_id;
> +}
> +
> +static u8 vduse_vdpa_get_status(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return vduse_dev_get_status(dev);
> +}
> +
> +static void vduse_vdpa_set_status(struct vdpa_device *vdpa, u8 status)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	if (status == 0)
> +		vduse_dev_reset(dev);
> +	else
> +		vduse_dev_update_iotlb(dev, 0ULL, 0ULL - 1);


Any reason for such IOTLB invalidation here?


> +
> +	vduse_dev_set_status(dev, status);
> +}
> +
> +static void vduse_vdpa_get_config(struct vdpa_device *vdpa, unsigned int offset,
> +			     void *buf, unsigned int len)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	vduse_dev_get_config(dev, offset, buf, len);
> +}
> +
> +static void vduse_vdpa_set_config(struct vdpa_device *vdpa, unsigned int offset,
> +			const void *buf, unsigned int len)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	vduse_dev_set_config(dev, offset, buf, len);
> +}
> +
> +static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
> +				struct vhost_iotlb *iotlb)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vhost_iotlb_map *map;
> +	struct vhost_iotlb_file *iotlb_file;
> +	u64 start = 0ULL, last = 0ULL - 1;
> +	int ret = 0;
> +
> +	vduse_iotlb_del_range(dev, start, last);
> +
> +	for (map = vhost_iotlb_itree_first(iotlb, start, last); map;
> +		map = vhost_iotlb_itree_next(map, start, last)) {
> +		if (!map->opaque)
> +			continue;


What will happen if we simply accept NULL opaque here?


> +
> +		iotlb_file = (struct vhost_iotlb_file *)map->opaque;
> +		ret = vduse_iotlb_add_range(dev, map->start, map->last,
> +					    map->addr, map->perm,
> +					    iotlb_file->file,
> +					    iotlb_file->offset);
> +		if (ret)
> +			break;
> +	}
> +	vduse_dev_update_iotlb(dev, start, last);
> +
> +	return ret;
> +}
> +
> +static void vduse_vdpa_free(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	WARN_ON(!list_empty(&dev->send_list));
> +	WARN_ON(!list_empty(&dev->recv_list));
> +	dev->vdev = NULL;
> +}
> +
> +static const struct vdpa_config_ops vduse_vdpa_config_ops = {
> +	.set_vq_address		= vduse_vdpa_set_vq_address,
> +	.kick_vq		= vduse_vdpa_kick_vq,
> +	.set_vq_cb		= vduse_vdpa_set_vq_cb,
> +	.set_vq_num             = vduse_vdpa_set_vq_num,
> +	.set_vq_ready		= vduse_vdpa_set_vq_ready,
> +	.get_vq_ready		= vduse_vdpa_get_vq_ready,
> +	.set_vq_state		= vduse_vdpa_set_vq_state,
> +	.get_vq_state		= vduse_vdpa_get_vq_state,
> +	.get_vq_align		= vduse_vdpa_get_vq_align,
> +	.get_features		= vduse_vdpa_get_features,
> +	.set_features		= vduse_vdpa_set_features,
> +	.set_config_cb		= vduse_vdpa_set_config_cb,
> +	.get_vq_num_max		= vduse_vdpa_get_vq_num_max,
> +	.get_device_id		= vduse_vdpa_get_device_id,
> +	.get_vendor_id		= vduse_vdpa_get_vendor_id,
> +	.get_status		= vduse_vdpa_get_status,
> +	.set_status		= vduse_vdpa_set_status,
> +	.get_config		= vduse_vdpa_get_config,
> +	.set_config		= vduse_vdpa_set_config,
> +	.set_map		= vduse_vdpa_set_map,
> +	.free			= vduse_vdpa_free,
> +};
> +
> +static dma_addr_t vduse_dev_map_page(struct device *dev, struct page *page,
> +					unsigned long offset, size_t size,
> +					enum dma_data_direction dir,
> +					unsigned long attrs)
> +{
> +	struct vduse_dev *vdev = dev_to_vduse(dev);
> +	struct vduse_iova_domain *domain = vdev->domain;
> +
> +	if (atomic_xchg(&vdev->bounce_map, 1) == 0 &&
> +		vduse_iotlb_add_range(vdev, 0, domain->bounce_size - 1,
> +				      0, VDUSE_ACCESS_RW,


Is this safe to use VDUSE_ACCESS_RW here, consider we might have device 
readonly mappings.


> +				      vduse_domain_file(domain),
> +				      vduse_domain_get_offset(domain, 0))) {
> +		atomic_set(&vdev->bounce_map, 0);
> +		return DMA_MAPPING_ERROR;
> +	}
> +
> +	return vduse_domain_map_page(domain, page, offset, size, dir, attrs);
> +}
> +
> +static void vduse_dev_unmap_page(struct device *dev, dma_addr_t dma_addr,
> +				size_t size, enum dma_data_direction dir,
> +				unsigned long attrs)
> +{
> +	struct vduse_dev *vdev = dev_to_vduse(dev);
> +	struct vduse_iova_domain *domain = vdev->domain;
> +
> +	return vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
> +}
> +
> +static void *vduse_dev_alloc_coherent(struct device *dev, size_t size,
> +					dma_addr_t *dma_addr, gfp_t flag,
> +					unsigned long attrs)
> +{
> +	struct vduse_dev *vdev = dev_to_vduse(dev);
> +	struct vduse_iova_domain *domain = vdev->domain;
> +	unsigned long iova;
> +	void *addr;
> +
> +	*dma_addr = DMA_MAPPING_ERROR;
> +	addr = vduse_domain_alloc_coherent(domain, size,
> +				(dma_addr_t *)&iova, flag, attrs);
> +	if (!addr)
> +		return NULL;
> +
> +	if (vduse_iotlb_add_range(vdev, iova, iova + size - 1,
> +				  iova, VDUSE_ACCESS_RW,
> +				  vduse_domain_file(domain),
> +				  vduse_domain_get_offset(domain, iova))) {
> +		vduse_domain_free_coherent(domain, size, addr, iova, attrs);
> +		return NULL;
> +	}
> +	*dma_addr = (dma_addr_t)iova;
> +
> +	return addr;
> +}
> +
> +static void vduse_dev_free_coherent(struct device *dev, size_t size,
> +					void *vaddr, dma_addr_t dma_addr,
> +					unsigned long attrs)
> +{
> +	struct vduse_dev *vdev = dev_to_vduse(dev);
> +	struct vduse_iova_domain *domain = vdev->domain;
> +	unsigned long start = (unsigned long)dma_addr;
> +	unsigned long last = start + size - 1;
> +
> +	vduse_iotlb_del_range(vdev, start, last);
> +	vduse_dev_update_iotlb(vdev, start, last);
> +	vduse_domain_free_coherent(domain, size, vaddr, dma_addr, attrs);
> +}
> +
> +static const struct dma_map_ops vduse_dev_dma_ops = {
> +	.map_page = vduse_dev_map_page,
> +	.unmap_page = vduse_dev_unmap_page,
> +	.alloc = vduse_dev_alloc_coherent,
> +	.free = vduse_dev_free_coherent,
> +};
> +
> +static unsigned int perm_to_file_flags(u8 perm)
> +{
> +	unsigned int flags = 0;
> +
> +	switch (perm) {
> +	case VDUSE_ACCESS_WO:
> +		flags |= O_WRONLY;
> +		break;
> +	case VDUSE_ACCESS_RO:
> +		flags |= O_RDONLY;
> +		break;
> +	case VDUSE_ACCESS_RW:
> +		flags |= O_RDWR;
> +		break;
> +	default:
> +		WARN(1, "invalidate vhost IOTLB permission\n");
> +		break;
> +	}
> +
> +	return flags;
> +}
> +
> +static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> +			unsigned long arg)
> +{
> +	struct vduse_dev *dev = file->private_data;
> +	void __user *argp = (void __user *)arg;
> +	int ret;
> +
> +	mutex_lock(&dev->lock);
> +	switch (cmd) {
> +	case VDUSE_IOTLB_GET_FD: {
> +		struct vduse_iotlb_entry entry;
> +		struct vhost_iotlb_map *map;
> +		struct vhost_iotlb_file *iotlb_file;
> +		struct file *f = NULL;
> +
> +		ret = -EFAULT;
> +		if (copy_from_user(&entry, argp, sizeof(entry)))
> +			break;
> +
> +		spin_lock(&dev->iommu_lock);
> +		map = vhost_iotlb_itree_first(dev->iommu, entry.start,
> +					      entry.last);
> +		if (map) {
> +			iotlb_file = (struct vhost_iotlb_file *)map->opaque;
> +			f = get_file(iotlb_file->file);
> +			entry.offset = iotlb_file->offset;
> +			entry.start = map->start;
> +			entry.last = map->last;
> +			entry.perm = map->perm;
> +		}
> +		spin_unlock(&dev->iommu_lock);
> +		if (!f) {
> +			ret = -EINVAL;
> +			break;
> +		}
> +		if (copy_to_user(argp, &entry, sizeof(entry))) {
> +			fput(f);
> +			ret = -EFAULT;
> +			break;
> +		}
> +		ret = get_unused_fd_flags(perm_to_file_flags(entry.perm));
> +		if (ret < 0) {
> +			fput(f);
> +			break;
> +		}
> +		fd_install(ret, f);
> +		break;
> +	}
> +	case VDUSE_VQ_SETUP_KICKFD: {
> +		struct vduse_vq_eventfd eventfd;
> +
> +		ret = -EFAULT;
> +		if (copy_from_user(&eventfd, argp, sizeof(eventfd)))
> +			break;
> +
> +		ret = vduse_kickfd_setup(dev, &eventfd);
> +		break;
> +	}
> +	case VDUSE_VQ_SETUP_IRQFD: {
> +		struct vduse_vq_eventfd eventfd;
> +
> +		ret = -EFAULT;
> +		if (copy_from_user(&eventfd, argp, sizeof(eventfd)))
> +			break;
> +
> +		ret = vduse_virqfd_setup(dev, &eventfd);
> +		break;
> +	}
> +	}
> +	mutex_unlock(&dev->lock);
> +
> +	return ret;
> +}
> +
> +static int vduse_dev_release(struct inode *inode, struct file *file)
> +{
> +	struct vduse_dev *dev = file->private_data;
> +
> +	vduse_kickfd_release(dev);
> +	vduse_virqfd_release(dev);
> +	dev->connected = false;
> +
> +	return 0;
> +}
> +
> +static const struct file_operations vduse_dev_fops = {
> +	.owner		= THIS_MODULE,
> +	.release	= vduse_dev_release,
> +	.read_iter	= vduse_dev_read_iter,
> +	.write_iter	= vduse_dev_write_iter,
> +	.poll		= vduse_dev_poll,
> +	.unlocked_ioctl	= vduse_dev_ioctl,
> +	.compat_ioctl	= compat_ptr_ioctl,
> +	.llseek		= noop_llseek,
> +};
> +
> +static struct vduse_dev *vduse_dev_create(void)
> +{
> +	struct vduse_dev *dev = kzalloc(sizeof(*dev), GFP_KERNEL);
> +
> +	if (!dev)
> +		return NULL;
> +
> +	dev->iommu = vhost_iotlb_alloc(2048, 0);


Is 2048 sufficient here?


> +	if (!dev->iommu) {
> +		kfree(dev);
> +		return NULL;
> +	}
> +
> +	mutex_init(&dev->lock);
> +	spin_lock_init(&dev->msg_lock);
> +	INIT_LIST_HEAD(&dev->send_list);
> +	INIT_LIST_HEAD(&dev->recv_list);
> +	atomic64_set(&dev->msg_unique, 0);
> +	spin_lock_init(&dev->iommu_lock);
> +	atomic_set(&dev->bounce_map, 0);
> +
> +	init_waitqueue_head(&dev->waitq);
> +
> +	return dev;
> +}
> +
> +static void vduse_dev_destroy(struct vduse_dev *dev)
> +{
> +	vhost_iotlb_free(dev->iommu);
> +	mutex_destroy(&dev->lock);
> +	kfree(dev);
> +}
> +
> +static struct vduse_dev *vduse_find_dev(u32 id)
> +{
> +	struct vduse_dev *tmp, *dev = NULL;
> +
> +	list_for_each_entry(tmp, &vduse_devs, list) {
> +		if (tmp->id == id) {
> +			dev = tmp;
> +			break;
> +		}
> +	}
> +	return dev;
> +}
> +
> +static int vduse_destroy_dev(u32 id)
> +{
> +	struct vduse_dev *dev = vduse_find_dev(id);
> +
> +	if (!dev)
> +		return -EINVAL;
> +
> +	if (dev->vdev || dev->connected)
> +		return -EBUSY;
> +
> +	list_del(&dev->list);
> +	kfree(dev->vqs);
> +	vduse_domain_destroy(dev->domain);
> +	vduse_dev_destroy(dev);
> +
> +	return 0;
> +}
> +
> +static int vduse_create_dev(struct vduse_dev_config *config)
> +{
> +	int i, fd;
> +	struct vduse_dev *dev;
> +	char name[64];
> +
> +	if (vduse_find_dev(config->id))
> +		return -EEXIST;
> +
> +	dev = vduse_dev_create();
> +	if (!dev)
> +		return -ENOMEM;
> +
> +	dev->id = config->id;
> +	dev->device_id = config->device_id;
> +	dev->vendor_id = config->vendor_id;
> +	dev->domain = vduse_domain_create(config->bounce_size);


Do we need a upper limit of bounce_size?


> +	if (!dev->domain)
> +		goto err_domain;
> +
> +	dev->vq_align = config->vq_align;
> +	dev->vq_size_max = config->vq_size_max;
> +	dev->vq_num = config->vq_num;
> +	dev->vqs = kcalloc(dev->vq_num, sizeof(*dev->vqs), GFP_KERNEL);
> +	if (!dev->vqs)
> +		goto err_vqs;
> +
> +	for (i = 0; i < dev->vq_num; i++) {
> +		dev->vqs[i].index = i;
> +		spin_lock_init(&dev->vqs[i].kick_lock);
> +		spin_lock_init(&dev->vqs[i].irq_lock);
> +	}
> +
> +	snprintf(name, sizeof(name), "[vduse-dev:%u]", config->id);
> +	fd = anon_inode_getfd(name, &vduse_dev_fops, dev, O_RDWR | O_CLOEXEC);


Any reason for closing on exec here?


> +	if (fd < 0)
> +		goto err_fd;
> +
> +	dev->connected = true;
> +	list_add(&dev->list, &vduse_devs);
> +
> +	return fd;
> +err_fd:
> +	kfree(dev->vqs);
> +err_vqs:
> +	vduse_domain_destroy(dev->domain);
> +err_domain:
> +	vduse_dev_destroy(dev);
> +	return fd;
> +}
> +
> +static long vduse_ioctl(struct file *file, unsigned int cmd,
> +			unsigned long arg)
> +{
> +	int ret;
> +	void __user *argp = (void __user *)arg;
> +
> +	mutex_lock(&vduse_lock);
> +	switch (cmd) {
> +	case VDUSE_CREATE_DEV: {
> +		struct vduse_dev_config config;
> +
> +		ret = -EFAULT;
> +		if (copy_from_user(&config, argp, sizeof(config)))
> +			break;
> +
> +		ret = vduse_create_dev(&config);
> +		break;
> +	}
> +	case VDUSE_DESTROY_DEV:
> +		ret = vduse_destroy_dev(arg);
> +		break;
> +	default:
> +		ret = -EINVAL;
> +		break;
> +	}
> +	mutex_unlock(&vduse_lock);
> +
> +	return ret;
> +}
> +
> +static const struct file_operations vduse_fops = {
> +	.owner		= THIS_MODULE,
> +	.unlocked_ioctl	= vduse_ioctl,
> +	.compat_ioctl	= compat_ptr_ioctl,
> +	.llseek		= noop_llseek,
> +};
> +
> +static struct miscdevice vduse_misc = {
> +	.fops = &vduse_fops,
> +	.minor = MISC_DYNAMIC_MINOR,
> +	.name = "vduse",
> +};
> +
> +static void vduse_parent_release(struct device *dev)
> +{
> +}
> +
> +static struct device vduse_parent = {
> +	.init_name = "vduse",
> +	.release = vduse_parent_release,
> +};
> +
> +static struct vdpa_parent_dev parent_dev;
> +
> +static int vduse_dev_add_vdpa(struct vduse_dev *dev, const char *name)
> +{
> +	struct vduse_vdpa *vdev = dev->vdev;
> +	int ret;
> +
> +	if (vdev)
> +		return -EEXIST;
> +
> +	vdev = vdpa_alloc_device(struct vduse_vdpa, vdpa, NULL,
> +				 &vduse_vdpa_config_ops,
> +				 dev->vq_num, name, true);
> +	if (!vdev)
> +		return -ENOMEM;
> +
> +	vdev->dev = dev;
> +	vdev->vdpa.dev.dma_mask = &vdev->vdpa.dev.coherent_dma_mask;
> +	ret = dma_set_mask_and_coherent(&vdev->vdpa.dev, DMA_BIT_MASK(64));
> +	if (ret)
> +		goto err;
> +
> +	set_dma_ops(&vdev->vdpa.dev, &vduse_dev_dma_ops);
> +	vdev->vdpa.dma_dev = &vdev->vdpa.dev;
> +	vdev->vdpa.pdev = &parent_dev;
> +
> +	ret = _vdpa_register_device(&vdev->vdpa);
> +	if (ret)
> +		goto err;
> +
> +	dev->vdev = vdev;
> +
> +	return 0;
> +err:
> +	put_device(&vdev->vdpa.dev);
> +	return ret;
> +}
> +
> +static struct vdpa_device *vdpa_dev_add(struct vdpa_parent_dev *pdev,
> +					const char *name, u32 device_id,
> +					struct nlattr **attrs)
> +{
> +	u32 vduse_id;
> +	struct vduse_dev *dev;
> +	int ret = -EINVAL;
> +
> +	if (!attrs[VDPA_ATTR_BACKEND_ID])
> +		return ERR_PTR(-EINVAL);
> +
> +	mutex_lock(&vduse_lock);
> +	vduse_id = nla_get_u32(attrs[VDPA_ATTR_BACKEND_ID]);


I wonder why not using name here?

And it looks to me it would be easier if we create a char device per 
vduse. This makes the device addressing more robust than passing id 
silently among processes.


> +	dev = vduse_find_dev(vduse_id);
> +	if (!dev)
> +		goto unlock;
> +
> +	if (dev->device_id != device_id)
> +		goto unlock;
> +
> +	ret = vduse_dev_add_vdpa(dev, name);
> +unlock:
> +	mutex_unlock(&vduse_lock);
> +	if (ret)
> +		return ERR_PTR(ret);
> +
> +	return &dev->vdev->vdpa;
> +}
> +
> +static void vdpa_dev_del(struct vdpa_parent_dev *pdev, struct vdpa_device *dev)
> +{
> +	_vdpa_unregister_device(dev);
> +}
> +
> +static const struct vdpa_dev_ops vdpa_dev_parent_ops = {
> +	.dev_add = vdpa_dev_add,
> +	.dev_del = vdpa_dev_del
> +};
> +
> +static struct virtio_device_id id_table[] = {
> +	{ VIRTIO_DEV_ANY_ID, VIRTIO_DEV_ANY_ID },
> +	{ 0 },
> +};
> +
> +static struct vdpa_parent_dev parent_dev = {
> +	.device = &vduse_parent,
> +	.id_table = id_table,
> +	.ops = &vdpa_dev_parent_ops,
> +};
> +
> +static int vduse_parentdev_init(void)
> +{
> +	int ret;
> +
> +	ret = device_register(&vduse_parent);
> +	if (ret)
> +		return ret;
> +
> +	ret = vdpa_parentdev_register(&parent_dev);
> +	if (ret)
> +		goto err;
> +
> +	return 0;
> +err:
> +	device_unregister(&vduse_parent);
> +	return ret;
> +}
> +
> +static void vduse_parentdev_exit(void)
> +{
> +	vdpa_parentdev_unregister(&parent_dev);
> +	device_unregister(&vduse_parent);
> +}
> +
> +static int vduse_init(void)
> +{
> +	int ret;
> +
> +	ret = misc_register(&vduse_misc);
> +	if (ret)
> +		return ret;
> +
> +	ret = -ENOMEM;
> +	vduse_vdpa_wq = alloc_workqueue("vduse-vdpa", WQ_UNBOUND, 1);
> +	if (!vduse_vdpa_wq)
> +		goto err_vdpa_wq;
> +
> +	ret = vduse_virqfd_init();
> +	if (ret)
> +		goto err_irqfd;
> +
> +	ret = vduse_parentdev_init();
> +	if (ret)
> +		goto err_parentdev;
> +
> +	return 0;
> +err_parentdev:
> +	vduse_virqfd_exit();
> +err_irqfd:
> +	destroy_workqueue(vduse_vdpa_wq);
> +err_vdpa_wq:
> +	misc_deregister(&vduse_misc);
> +	return ret;
> +}
> +module_init(vduse_init);
> +
> +static void vduse_exit(void)
> +{
> +	misc_deregister(&vduse_misc);
> +	destroy_workqueue(vduse_vdpa_wq);
> +	vduse_virqfd_exit();
> +	vduse_parentdev_exit();
> +}
> +module_exit(vduse_exit);
> +
> +MODULE_VERSION(DRV_VERSION);
> +MODULE_LICENSE(DRV_LICENSE);
> +MODULE_AUTHOR(DRV_AUTHOR);
> +MODULE_DESCRIPTION(DRV_DESC);
> diff --git a/include/uapi/linux/vdpa.h b/include/uapi/linux/vdpa.h
> index bba8b83a94b5..a7a841e5ffc7 100644
> --- a/include/uapi/linux/vdpa.h
> +++ b/include/uapi/linux/vdpa.h
> @@ -33,6 +33,7 @@ enum vdpa_attr {
>   	VDPA_ATTR_DEV_VENDOR_ID,		/* u32 */
>   	VDPA_ATTR_DEV_MAX_VQS,			/* u32 */
>   	VDPA_ATTR_DEV_MAX_VQ_SIZE,		/* u16 */
> +	VDPA_ATTR_BACKEND_ID,			/* u32 */


As discussed, this needs more thought. But if necessary, we need a 
separate patch for this.


>   
>   	/* new attributes must be added above here */
>   	VDPA_ATTR_MAX,
> diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
> new file mode 100644
> index 000000000000..9fb555ddcfbd
> --- /dev/null
> +++ b/include/uapi/linux/vduse.h
> @@ -0,0 +1,125 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _UAPI_VDUSE_H_
> +#define _UAPI_VDUSE_H_
> +
> +#include <linux/types.h>
> +
> +/* the control messages definition for read/write */
> +
> +#define VDUSE_CONFIG_DATA_LEN	256
> +
> +enum vduse_req_type {
> +	VDUSE_SET_VQ_NUM,
> +	VDUSE_SET_VQ_ADDR,
> +	VDUSE_SET_VQ_READY,
> +	VDUSE_GET_VQ_READY,
> +	VDUSE_SET_VQ_STATE,
> +	VDUSE_GET_VQ_STATE,
> +	VDUSE_SET_FEATURES,
> +	VDUSE_GET_FEATURES,
> +	VDUSE_SET_STATUS,
> +	VDUSE_GET_STATUS,
> +	VDUSE_SET_CONFIG,
> +	VDUSE_GET_CONFIG,
> +	VDUSE_UPDATE_IOTLB,
> +};
> +
> +struct vduse_vq_num {
> +	__u32 index;
> +	__u32 num;
> +};
> +
> +struct vduse_vq_addr {
> +	__u32 index;
> +	__u64 desc_addr;
> +	__u64 driver_addr;
> +	__u64 device_addr;
> +};
> +
> +struct vduse_vq_ready {
> +	__u32 index;
> +	__u8 ready;
> +};
> +
> +struct vduse_vq_state {
> +	__u32 index;
> +	__u16 avail_idx;
> +};
> +
> +struct vduse_dev_config_data {
> +	__u32 offset;
> +	__u32 len;
> +	__u8 data[VDUSE_CONFIG_DATA_LEN];


This no guarantee that 256 is sufficient here.


> +};
> +
> +struct vduse_iova_range {
> +	__u64 start;
> +	__u64 last;
> +};
> +
> +struct vduse_dev_request {
> +	__u32 type; /* request type */
> +	__u32 unique; /* request id */
> +	__u32 flags; /* request flags */


Seems unused in this series.


> +	__u32 size; /* the payload size */


Unused.


> +	union {
> +		struct vduse_vq_num vq_num; /* virtqueue num */
> +		struct vduse_vq_addr vq_addr; /* virtqueue address */
> +		struct vduse_vq_ready vq_ready; /* virtqueue ready status */
> +		struct vduse_vq_state vq_state; /* virtqueue state */
> +		struct vduse_dev_config_data config; /* virtio device config space */
> +		struct vduse_iova_range iova; /* iova range for updating */
> +		__u64 features; /* virtio features */
> +		__u8 status; /* device status */


Let's add some padding for future extensions.


> +	};
> +};
> +
> +struct vduse_dev_response {
> +	__u32 unique; /* corresponding request id */


Let's use request id.


> +	__s32 result; /* the result of request */


Let's use macro or enum to define the success and failure value.


> +	union {
> +		struct vduse_vq_ready vq_ready; /* virtqueue ready status */
> +		struct vduse_vq_state vq_state; /* virtqueue state */
> +		struct vduse_dev_config_data config; /* virtio device config space */
> +		__u64 features; /* virtio features */
> +		__u8 status; /* device status */
> +	};
> +};
> +
> +/* ioctls */
> +
> +struct vduse_dev_config {
> +	__u32 id; /* vduse device id */
> +	__u32 vendor_id; /* virtio vendor id */
> +	__u32 device_id; /* virtio device id */
> +	__u64 bounce_size; /* bounce buffer size for iommu */
> +	__u16 vq_num; /* the number of virtqueues */
> +	__u16 vq_size_max; /* the max size of virtqueue */
> +	__u32 vq_align; /* the allocation alignment of virtqueue's metadata */
> +};
> +
> +struct vduse_iotlb_entry {
> +	__u64 offset; /* the mmap offset on fd */
> +	__u64 start; /* start of the IOVA range */
> +	__u64 last; /* last of the IOVA range */
> +#define VDUSE_ACCESS_RO 0x1
> +#define VDUSE_ACCESS_WO 0x2
> +#define VDUSE_ACCESS_RW 0x3
> +	__u8 perm; /* access permission of this range */
> +};
> +
> +struct vduse_vq_eventfd {
> +	__u32 index; /* virtqueue index */
> +	__u32 fd; /* eventfd */


Any reason for not using int here?


> +};
> +
> +#define VDUSE_BASE	0x81
> +
> +#define VDUSE_CREATE_DEV	_IOW(VDUSE_BASE, 0x01, struct vduse_dev_config)
> +#define VDUSE_DESTROY_DEV	_IO(VDUSE_BASE, 0x02)
> +
> +#define VDUSE_IOTLB_GET_FD	_IOWR(VDUSE_BASE, 0x04, struct vduse_iotlb_entry)
> +#define VDUSE_VQ_SETUP_KICKFD	_IOW(VDUSE_BASE, 0x05, struct vduse_vq_eventfd)
> +#define VDUSE_VQ_SETUP_IRQFD	_IOW(VDUSE_BASE, 0x06, struct vduse_vq_eventfd)


Better with documentation to explain those ioctls.

Thanks


> +
> +#endif /* _UAPI_VDUSE_H_ */


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC v3 10/11] vduse: grab the module's references until there is no vduse device
  2021-01-19  5:07   ` [RFC v3 10/11] vduse: grab the module's references until there is no vduse device Xie Yongji
@ 2021-01-26  8:09     ` Jason Wang
  2021-01-27  8:51       ` Yongji Xie
  0 siblings, 1 reply; 57+ messages in thread
From: Jason Wang @ 2021-01-26  8:09 UTC (permalink / raw)
  To: Xie Yongji, mst, stefanha, sgarzare, parav, bob.liu, hch,
	rdunlap, willy, viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel


On 2021/1/19 下午1:07, Xie Yongji wrote:
> The module should not be unloaded if any vduse device exists.
> So increase the module's reference count when creating vduse
> device. And the reference count is kept until the device is
> destroyed.
>
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>


Looks like a bug fix. If yes, let's squash this into patch 8.

Thanks


> ---
>   drivers/vdpa/vdpa_user/vduse_dev.c | 2 ++
>   1 file changed, 2 insertions(+)
>
> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> index 4d21203da5b6..003aeb281bce 100644
> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> @@ -978,6 +978,7 @@ static int vduse_destroy_dev(u32 id)
>   	kfree(dev->vqs);
>   	vduse_domain_destroy(dev->domain);
>   	vduse_dev_destroy(dev);
> +	module_put(THIS_MODULE);
>   
>   	return 0;
>   }
> @@ -1022,6 +1023,7 @@ static int vduse_create_dev(struct vduse_dev_config *config)
>   
>   	dev->connected = true;
>   	list_add(&dev->list, &vduse_devs);
> +	__module_get(THIS_MODULE);
>   
>   	return fd;
>   err_fd:


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC v3 11/11] vduse: Introduce a workqueue for irq injection
  2021-01-19  5:07   ` [RFC v3 11/11] vduse: Introduce a workqueue for irq injection Xie Yongji
@ 2021-01-26  8:17     ` Jason Wang
  2021-01-27  9:00       ` Yongji Xie
  0 siblings, 1 reply; 57+ messages in thread
From: Jason Wang @ 2021-01-26  8:17 UTC (permalink / raw)
  To: Xie Yongji, mst, stefanha, sgarzare, parav, bob.liu, hch,
	rdunlap, willy, viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel


On 2021/1/19 下午1:07, Xie Yongji wrote:
> This patch introduces a dedicated workqueue for irq injection
> so that we are able to do some performance tuning for it.
>
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>


If we want the split like this.

It might be better to:

1) implement a simple irq injection on the ioctl context in patch 8
2) add the dedicated workqueue injection in this patch

Since my understanding is that

1) the function looks more isolated for readers
2) the difference between sysctl vs workqueue should be more obvious 
than system wq vs dedicated wq
3) a chance to describe why workqueue is needed in the commit log in 
this patch

Thanks


> ---
>   drivers/vdpa/vdpa_user/eventfd.c | 10 +++++++++-
>   1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/vdpa/vdpa_user/eventfd.c b/drivers/vdpa/vdpa_user/eventfd.c
> index dbffddb08908..caf7d8d68ac0 100644
> --- a/drivers/vdpa/vdpa_user/eventfd.c
> +++ b/drivers/vdpa/vdpa_user/eventfd.c
> @@ -18,6 +18,7 @@
>   #include "eventfd.h"
>   
>   static struct workqueue_struct *vduse_irqfd_cleanup_wq;
> +static struct workqueue_struct *vduse_irq_wq;
>   
>   static void vduse_virqfd_shutdown(struct work_struct *work)
>   {
> @@ -57,7 +58,7 @@ static int vduse_virqfd_wakeup(wait_queue_entry_t *wait, unsigned int mode,
>   	__poll_t flags = key_to_poll(key);
>   
>   	if (flags & EPOLLIN)
> -		schedule_work(&virqfd->inject);
> +		queue_work(vduse_irq_wq, &virqfd->inject);
>   
>   	if (flags & EPOLLHUP) {
>   		spin_lock(&vq->irq_lock);
> @@ -165,11 +166,18 @@ int vduse_virqfd_init(void)
>   	if (!vduse_irqfd_cleanup_wq)
>   		return -ENOMEM;
>   
> +	vduse_irq_wq = alloc_workqueue("vduse-irq", WQ_SYSFS | WQ_UNBOUND, 0);
> +	if (!vduse_irq_wq) {
> +		destroy_workqueue(vduse_irqfd_cleanup_wq);
> +		return -ENOMEM;
> +	}
> +
>   	return 0;
>   }
>   
>   void vduse_virqfd_exit(void)
>   {
> +	destroy_workqueue(vduse_irq_wq);
>   	destroy_workqueue(vduse_irqfd_cleanup_wq);
>   }
>   


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC v3 08/11] vduse: Introduce VDUSE - vDPA Device in Userspace
  2021-01-19  5:07   ` [RFC v3 08/11] vduse: Introduce VDUSE - vDPA Device in Userspace Xie Yongji
                       ` (2 preceding siblings ...)
  2021-01-26  8:08     ` Jason Wang
@ 2021-01-26  8:19     ` Jason Wang
  2021-01-27  8:59       ` Yongji Xie
  3 siblings, 1 reply; 57+ messages in thread
From: Jason Wang @ 2021-01-26  8:19 UTC (permalink / raw)
  To: Xie Yongji, mst, stefanha, sgarzare, parav, bob.liu, hch,
	rdunlap, willy, viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel


On 2021/1/19 下午1:07, Xie Yongji wrote:
> This VDUSE driver enables implementing vDPA devices in userspace.
> Both control path and data path of vDPA devices will be able to
> be handled in userspace.
>
> In the control path, the VDUSE driver will make use of message
> mechnism to forward the config operation from vdpa bus driver
> to userspace. Userspace can use read()/write() to receive/reply
> those control messages.
>
> In the data path, VDUSE_IOTLB_GET_FD ioctl will be used to get
> the file descriptors referring to vDPA device's iova regions. Then
> userspace can use mmap() to access those iova regions. Besides,
> the eventfd mechanism is used to trigger interrupt callbacks and
> receive virtqueue kicks in userspace.
>
> Signed-off-by: Xie Yongji<xieyongji@bytedance.com>
> ---
>   Documentation/driver-api/vduse.rst                 |   85 ++
>   Documentation/userspace-api/ioctl/ioctl-number.rst |    1 +
>   drivers/vdpa/Kconfig                               |    7 +
>   drivers/vdpa/Makefile                              |    1 +
>   drivers/vdpa/vdpa_user/Makefile                    |    5 +
>   drivers/vdpa/vdpa_user/eventfd.c                   |  221 ++++
>   drivers/vdpa/vdpa_user/eventfd.h                   |   48 +
>   drivers/vdpa/vdpa_user/iova_domain.c               |  426 +++++++
>   drivers/vdpa/vdpa_user/iova_domain.h               |   68 ++
>   drivers/vdpa/vdpa_user/vduse.h                     |   62 +
>   drivers/vdpa/vdpa_user/vduse_dev.c                 | 1217 ++++++++++++++++++++
>   include/uapi/linux/vdpa.h                          |    1 +
>   include/uapi/linux/vduse.h                         |  125 ++
>   13 files changed, 2267 insertions(+)
>   create mode 100644 Documentation/driver-api/vduse.rst
>   create mode 100644 drivers/vdpa/vdpa_user/Makefile
>   create mode 100644 drivers/vdpa/vdpa_user/eventfd.c
>   create mode 100644 drivers/vdpa/vdpa_user/eventfd.h
>   create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c
>   create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h
>   create mode 100644 drivers/vdpa/vdpa_user/vduse.h
>   create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
>   create mode 100644 include/uapi/linux/vduse.h


Btw, if you could split this into three parts:

1) iova domain
2) vduse device
3) doc

It would be more easier for the reviewers.

Thanks


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC v3 03/11] vdpa: Remove the restriction that only supports virtio-net devices
  2021-01-20 11:08     ` Stefano Garzarella
@ 2021-01-27  3:33       ` Jason Wang
  2021-01-27  8:57         ` Stefano Garzarella
  0 siblings, 1 reply; 57+ messages in thread
From: Jason Wang @ 2021-01-27  3:33 UTC (permalink / raw)
  To: Stefano Garzarella, Xie Yongji
  Cc: mst, stefanha, parav, bob.liu, hch, rdunlap, willy, viro, axboe,
	bcrl, corbet, virtualization, netdev, kvm, linux-aio,
	linux-fsdevel


On 2021/1/20 下午7:08, Stefano Garzarella wrote:
> On Wed, Jan 20, 2021 at 11:46:38AM +0800, Jason Wang wrote:
>>
>> On 2021/1/19 下午12:59, Xie Yongji wrote:
>>> With VDUSE, we should be able to support all kinds of virtio devices.
>>>
>>> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
>>> ---
>>>  drivers/vhost/vdpa.c | 29 +++--------------------------
>>>  1 file changed, 3 insertions(+), 26 deletions(-)
>>>
>>> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
>>> index 29ed4173f04e..448be7875b6d 100644
>>> --- a/drivers/vhost/vdpa.c
>>> +++ b/drivers/vhost/vdpa.c
>>> @@ -22,6 +22,7 @@
>>>  #include <linux/nospec.h>
>>>  #include <linux/vhost.h>
>>>  #include <linux/virtio_net.h>
>>> +#include <linux/virtio_blk.h>
>>>  #include "vhost.h"
>>> @@ -185,26 +186,6 @@ static long vhost_vdpa_set_status(struct 
>>> vhost_vdpa *v, u8 __user *statusp)
>>>      return 0;
>>>  }
>>> -static int vhost_vdpa_config_validate(struct vhost_vdpa *v,
>>> -                      struct vhost_vdpa_config *c)
>>> -{
>>> -    long size = 0;
>>> -
>>> -    switch (v->virtio_id) {
>>> -    case VIRTIO_ID_NET:
>>> -        size = sizeof(struct virtio_net_config);
>>> -        break;
>>> -    }
>>> -
>>> -    if (c->len == 0)
>>> -        return -EINVAL;
>>> -
>>> -    if (c->len > size - c->off)
>>> -        return -E2BIG;
>>> -
>>> -    return 0;
>>> -}
>>
>>
>> I think we should use a separate patch for this.
>
> For the vdpa-blk simulator I had the same issues and I'm adding a 
> .get_config_size() callback to vdpa devices.
>
> Do you think make sense or is better to remove this check in 
> vhost/vdpa, delegating the boundaries checks to get_config/set_config 
> callbacks.


A question here. How much value could we gain from get_config_size() 
consider we can let vDPA parent to validate the length in its get_config().

Thanks


>
> Thanks,
> Stefano
>


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC v3 01/11] eventfd: track eventfd_signal() recursion depth separately in different cases
  2021-01-20  6:52     ` Yongji Xie
@ 2021-01-27  3:37       ` Jason Wang
  2021-01-27  9:11         ` Yongji Xie
  0 siblings, 1 reply; 57+ messages in thread
From: Jason Wang @ 2021-01-27  3:37 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, sgarzare, Parav Pandit,
	Bob Liu, Christoph Hellwig, Randy Dunlap, Matthew Wilcox, viro,
	axboe, bcrl, Jonathan Corbet, virtualization, netdev, kvm,
	linux-aio, linux-fsdevel


On 2021/1/20 下午2:52, Yongji Xie wrote:
> On Wed, Jan 20, 2021 at 12:24 PM Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2021/1/19 下午12:59, Xie Yongji wrote:
>>> Now we have a global percpu counter to limit the recursion depth
>>> of eventfd_signal(). This can avoid deadlock or stack overflow.
>>> But in stack overflow case, it should be OK to increase the
>>> recursion depth if needed. So we add a percpu counter in eventfd_ctx
>>> to limit the recursion depth for deadlock case. Then it could be
>>> fine to increase the global percpu counter later.
>>
>> I wonder whether or not it's worth to introduce percpu for each eventfd.
>>
>> How about simply check if eventfd_signal_count() is greater than 2?
>>
> It can't avoid deadlock in this way.


I may miss something but the count is to avoid recursive eventfd call. 
So for VDUSE what we suffers is e.g the interrupt injection path:

userspace write IRQFD -> vq->cb() -> another IRQFD.

It looks like increasing EVENTFD_WAKEUP_DEPTH should be sufficient?

Thanks


> So we need a percpu counter for
> each eventfd to limit the recursion depth for deadlock cases. And
> using a global percpu counter to avoid stack overflow.
>
> Thanks,
> Yongji
>


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC v3 05/11] vdpa: shared virtual addressing support
  2021-01-20  7:10     ` Yongji Xie
@ 2021-01-27  3:43       ` Jason Wang
  0 siblings, 0 replies; 57+ messages in thread
From: Jason Wang @ 2021-01-27  3:43 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, sgarzare, Parav Pandit,
	Bob Liu, Christoph Hellwig, Randy Dunlap, Matthew Wilcox, viro,
	axboe, bcrl, Jonathan Corbet, virtualization, netdev, kvm,
	linux-aio, linux-fsdevel


On 2021/1/20 下午3:10, Yongji Xie wrote:
> On Wed, Jan 20, 2021 at 1:55 PM Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2021/1/19 下午12:59, Xie Yongji wrote:
>>> This patches introduces SVA (Shared Virtual Addressing)
>>> support for vDPA device. During vDPA device allocation,
>>> vDPA device driver needs to indicate whether SVA is
>>> supported by the device. Then vhost-vdpa bus driver
>>> will not pin user page and transfer userspace virtual
>>> address instead of physical address during DMA mapping.
>>>
>>> Suggested-by: Jason Wang <jasowang@redhat.com>
>>> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
>>> ---
>>>    drivers/vdpa/ifcvf/ifcvf_main.c   |  2 +-
>>>    drivers/vdpa/mlx5/net/mlx5_vnet.c |  2 +-
>>>    drivers/vdpa/vdpa.c               |  5 ++++-
>>>    drivers/vdpa/vdpa_sim/vdpa_sim.c  |  3 ++-
>>>    drivers/vhost/vdpa.c              | 35 +++++++++++++++++++++++------------
>>>    include/linux/vdpa.h              | 10 +++++++---
>>>    6 files changed, 38 insertions(+), 19 deletions(-)
>>>
>>> diff --git a/drivers/vdpa/ifcvf/ifcvf_main.c b/drivers/vdpa/ifcvf/ifcvf_main.c
>>> index 23474af7da40..95c4601f82f5 100644
>>> --- a/drivers/vdpa/ifcvf/ifcvf_main.c
>>> +++ b/drivers/vdpa/ifcvf/ifcvf_main.c
>>> @@ -439,7 +439,7 @@ static int ifcvf_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>>>
>>>        adapter = vdpa_alloc_device(struct ifcvf_adapter, vdpa,
>>>                                    dev, &ifc_vdpa_ops,
>>> -                                 IFCVF_MAX_QUEUE_PAIRS * 2, NULL);
>>> +                                 IFCVF_MAX_QUEUE_PAIRS * 2, NULL, false);
>>>        if (adapter == NULL) {
>>>                IFCVF_ERR(pdev, "Failed to allocate vDPA structure");
>>>                return -ENOMEM;
>>> diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c
>>> index 77595c81488d..05988d6907f2 100644
>>> --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
>>> +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
>>> @@ -1959,7 +1959,7 @@ static int mlx5v_probe(struct auxiliary_device *adev,
>>>        max_vqs = min_t(u32, max_vqs, MLX5_MAX_SUPPORTED_VQS);
>>>
>>>        ndev = vdpa_alloc_device(struct mlx5_vdpa_net, mvdev.vdev, mdev->device, &mlx5_vdpa_ops,
>>> -                              2 * mlx5_vdpa_max_qps(max_vqs), NULL);
>>> +                              2 * mlx5_vdpa_max_qps(max_vqs), NULL, false);
>>>        if (IS_ERR(ndev))
>>>                return PTR_ERR(ndev);
>>>
>>> diff --git a/drivers/vdpa/vdpa.c b/drivers/vdpa/vdpa.c
>>> index 32bd48baffab..50cab930b2e5 100644
>>> --- a/drivers/vdpa/vdpa.c
>>> +++ b/drivers/vdpa/vdpa.c
>>> @@ -72,6 +72,7 @@ static void vdpa_release_dev(struct device *d)
>>>     * @nvqs: number of virtqueues supported by this device
>>>     * @size: size of the parent structure that contains private data
>>>     * @name: name of the vdpa device; optional.
>>> + * @sva: indicate whether SVA (Shared Virtual Addressing) is supported
>>>     *
>>>     * Driver should use vdpa_alloc_device() wrapper macro instead of
>>>     * using this directly.
>>> @@ -81,7 +82,8 @@ static void vdpa_release_dev(struct device *d)
>>>     */
>>>    struct vdpa_device *__vdpa_alloc_device(struct device *parent,
>>>                                        const struct vdpa_config_ops *config,
>>> -                                     int nvqs, size_t size, const char *name)
>>> +                                     int nvqs, size_t size, const char *name,
>>> +                                     bool sva)
>>>    {
>>>        struct vdpa_device *vdev;
>>>        int err = -EINVAL;
>>> @@ -108,6 +110,7 @@ struct vdpa_device *__vdpa_alloc_device(struct device *parent,
>>>        vdev->config = config;
>>>        vdev->features_valid = false;
>>>        vdev->nvqs = nvqs;
>>> +     vdev->sva = sva;
>>>
>>>        if (name)
>>>                err = dev_set_name(&vdev->dev, "%s", name);
>>> diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
>>> index 85776e4e6749..03c796873a6b 100644
>>> --- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
>>> +++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
>>> @@ -367,7 +367,8 @@ static struct vdpasim *vdpasim_create(const char *name)
>>>        else
>>>                ops = &vdpasim_net_config_ops;
>>>
>>> -     vdpasim = vdpa_alloc_device(struct vdpasim, vdpa, NULL, ops, VDPASIM_VQ_NUM, name);
>>> +     vdpasim = vdpa_alloc_device(struct vdpasim, vdpa, NULL, ops,
>>> +                             VDPASIM_VQ_NUM, name, false);
>>>        if (!vdpasim)
>>>                goto err_alloc;
>>>
>>> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
>>> index 4a241d380c40..36b6950ba37f 100644
>>> --- a/drivers/vhost/vdpa.c
>>> +++ b/drivers/vhost/vdpa.c
>>> @@ -486,21 +486,25 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
>>>    static void vhost_vdpa_iotlb_unmap(struct vhost_vdpa *v, u64 start, u64 last)
>>>    {
>>>        struct vhost_dev *dev = &v->vdev;
>>> +     struct vdpa_device *vdpa = v->vdpa;
>>>        struct vhost_iotlb *iotlb = dev->iotlb;
>>>        struct vhost_iotlb_map *map;
>>>        struct page *page;
>>>        unsigned long pfn, pinned;
>>>
>>>        while ((map = vhost_iotlb_itree_first(iotlb, start, last)) != NULL) {
>>> -             pinned = map->size >> PAGE_SHIFT;
>>> -             for (pfn = map->addr >> PAGE_SHIFT;
>>> -                  pinned > 0; pfn++, pinned--) {
>>> -                     page = pfn_to_page(pfn);
>>> -                     if (map->perm & VHOST_ACCESS_WO)
>>> -                             set_page_dirty_lock(page);
>>> -                     unpin_user_page(page);
>>> +             if (!vdpa->sva) {
>>> +                     pinned = map->size >> PAGE_SHIFT;
>>> +                     for (pfn = map->addr >> PAGE_SHIFT;
>>> +                          pinned > 0; pfn++, pinned--) {
>>> +                             page = pfn_to_page(pfn);
>>> +                             if (map->perm & VHOST_ACCESS_WO)
>>> +                                     set_page_dirty_lock(page);
>>> +                             unpin_user_page(page);
>>> +                     }
>>> +                     atomic64_sub(map->size >> PAGE_SHIFT,
>>> +                                     &dev->mm->pinned_vm);
>>>                }
>>> -             atomic64_sub(map->size >> PAGE_SHIFT, &dev->mm->pinned_vm);
>>>                vhost_iotlb_map_free(iotlb, map);
>>>        }
>>>    }
>>> @@ -558,13 +562,15 @@ static int vhost_vdpa_map(struct vhost_vdpa *v,
>>>                r = iommu_map(v->domain, iova, pa, size,
>>>                              perm_to_iommu_flags(perm));
>>>        }
>>> -
>>> -     if (r)
>>> +     if (r) {
>>>                vhost_iotlb_del_range(dev->iotlb, iova, iova + size - 1);
>>> -     else
>>> +             return r;
>>> +     }
>>> +
>>> +     if (!vdpa->sva)
>>>                atomic64_add(size >> PAGE_SHIFT, &dev->mm->pinned_vm);
>>>
>>> -     return r;
>>> +     return 0;
>>>    }
>>>
>>>    static void vhost_vdpa_unmap(struct vhost_vdpa *v, u64 iova, u64 size)
>>> @@ -589,6 +595,7 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
>>>                                           struct vhost_iotlb_msg *msg)
>>>    {
>>>        struct vhost_dev *dev = &v->vdev;
>>> +     struct vdpa_device *vdpa = v->vdpa;
>>>        struct vhost_iotlb *iotlb = dev->iotlb;
>>>        struct page **page_list;
>>>        unsigned long list_size = PAGE_SIZE / sizeof(struct page *);
>>> @@ -607,6 +614,10 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
>>>                                    msg->iova + msg->size - 1))
>>>                return -EEXIST;
>>>
>>> +     if (vdpa->sva)
>>> +             return vhost_vdpa_map(v, msg->iova, msg->size,
>>> +                                   msg->uaddr, msg->perm);
>>> +
>>>        /* Limit the use of memory for bookkeeping */
>>>        page_list = (struct page **) __get_free_page(GFP_KERNEL);
>>>        if (!page_list)
>>> diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
>>> index cb5a3d847af3..f86869651614 100644
>>> --- a/include/linux/vdpa.h
>>> +++ b/include/linux/vdpa.h
>>> @@ -44,6 +44,7 @@ struct vdpa_parent_dev;
>>>     * @config: the configuration ops for this device.
>>>     * @index: device index
>>>     * @features_valid: were features initialized? for legacy guests
>>> + * @sva: indicate whether SVA (Shared Virtual Addressing) is supported
>>
>> Rethink about this. I think we probably need a better name other than
>> "sva" since kernel already use that for shared virtual address space.
>> But actually we don't the whole virtual address space.
>>
> This flag is used to tell vhost-vdpa bus driver to transfer virtual
> addresses instead of physical addresses. So how about "use_va“,
> ”need_va" or "va“?


I think "use_va" or "need_va" should be fine.

Thanks


>
>> And I guess this can not work for the device that use platform IOMMU, so
>> we should check and fail if sva && !(dma_map || set_map).
>>
> Agree.
>
> Thanks,
> Yongji
>


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC v3 06/11] vhost-vdpa: Add an opaque pointer for vhost IOTLB
  2021-01-20  7:52     ` Yongji Xie
@ 2021-01-27  3:51       ` Jason Wang
  2021-01-27  9:27         ` Yongji Xie
  0 siblings, 1 reply; 57+ messages in thread
From: Jason Wang @ 2021-01-27  3:51 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, sgarzare, Parav Pandit,
	Bob Liu, Christoph Hellwig, Randy Dunlap, Matthew Wilcox, viro,
	axboe, bcrl, Jonathan Corbet, virtualization, netdev, kvm,
	linux-aio, linux-fsdevel


On 2021/1/20 下午3:52, Yongji Xie wrote:
> On Wed, Jan 20, 2021 at 2:24 PM Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2021/1/19 下午12:59, Xie Yongji wrote:
>>> Add an opaque pointer for vhost IOTLB to store the
>>> corresponding vma->vm_file and offset on the DMA mapping.
>>
>> Let's split the patch into two.
>>
>> 1) opaque pointer
>> 2) vma stuffs
>>
> OK.
>
>>> It will be used in VDUSE case later.
>>>
>>> Suggested-by: Jason Wang <jasowang@redhat.com>
>>> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
>>> ---
>>>    drivers/vdpa/vdpa_sim/vdpa_sim.c | 11 ++++---
>>>    drivers/vhost/iotlb.c            |  5 ++-
>>>    drivers/vhost/vdpa.c             | 66 +++++++++++++++++++++++++++++++++++-----
>>>    drivers/vhost/vhost.c            |  4 +--
>>>    include/linux/vdpa.h             |  3 +-
>>>    include/linux/vhost_iotlb.h      |  8 ++++-
>>>    6 files changed, 79 insertions(+), 18 deletions(-)
>>>
>>> diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
>>> index 03c796873a6b..1ffcef67954f 100644
>>> --- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
>>> +++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
>>> @@ -279,7 +279,7 @@ static dma_addr_t vdpasim_map_page(struct device *dev, struct page *page,
>>>         */
>>>        spin_lock(&vdpasim->iommu_lock);
>>>        ret = vhost_iotlb_add_range(iommu, pa, pa + size - 1,
>>> -                                 pa, dir_to_perm(dir));
>>> +                                 pa, dir_to_perm(dir), NULL);
>>
>> Maybe its better to introduce
>>
>> vhost_iotlb_add_range_ctx() which can accepts the opaque (context). And
>> let vhost_iotlb_add_range() just call that.
>>
> If so, we need export both vhost_iotlb_add_range() and
> vhost_iotlb_add_range_ctx() which will be used in VDUSE driver. Is it
> a bit redundant?


Probably not, we do something similar in virtio core:

void *virtqueue_get_buf_ctx(struct virtqueue *_vq, unsigned int *len,
                 void **ctx)
{
     struct vring_virtqueue *vq = to_vvq(_vq);

     return vq->packed_ring ? virtqueue_get_buf_ctx_packed(_vq, len, ctx) :
                  virtqueue_get_buf_ctx_split(_vq, len, ctx);
}
EXPORT_SYMBOL_GPL(virtqueue_get_buf_ctx);

void *virtqueue_get_buf(struct virtqueue *_vq, unsigned int *len)
{
     return virtqueue_get_buf_ctx(_vq, len, NULL);
}
EXPORT_SYMBOL_GPL(virtqueue_get_buf);


>
>>>        spin_unlock(&vdpasim->iommu_lock);
>>>        if (ret)
>>>                return DMA_MAPPING_ERROR;
>>> @@ -317,7 +317,7 @@ static void *vdpasim_alloc_coherent(struct device *dev, size_t size,
>>>
>>>                ret = vhost_iotlb_add_range(iommu, (u64)pa,
>>>                                            (u64)pa + size - 1,
>>> -                                         pa, VHOST_MAP_RW);
>>> +                                         pa, VHOST_MAP_RW, NULL);
>>>                if (ret) {
>>>                        *dma_addr = DMA_MAPPING_ERROR;
>>>                        kfree(addr);
>>> @@ -625,7 +625,8 @@ static int vdpasim_set_map(struct vdpa_device *vdpa,
>>>        for (map = vhost_iotlb_itree_first(iotlb, start, last); map;
>>>             map = vhost_iotlb_itree_next(map, start, last)) {
>>>                ret = vhost_iotlb_add_range(vdpasim->iommu, map->start,
>>> -                                         map->last, map->addr, map->perm);
>>> +                                         map->last, map->addr,
>>> +                                         map->perm, NULL);
>>>                if (ret)
>>>                        goto err;
>>>        }
>>> @@ -639,14 +640,14 @@ static int vdpasim_set_map(struct vdpa_device *vdpa,
>>>    }
>>>
>>>    static int vdpasim_dma_map(struct vdpa_device *vdpa, u64 iova, u64 size,
>>> -                        u64 pa, u32 perm)
>>> +                        u64 pa, u32 perm, void *opaque)
>>>    {
>>>        struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
>>>        int ret;
>>>
>>>        spin_lock(&vdpasim->iommu_lock);
>>>        ret = vhost_iotlb_add_range(vdpasim->iommu, iova, iova + size - 1, pa,
>>> -                                 perm);
>>> +                                 perm, NULL);
>>>        spin_unlock(&vdpasim->iommu_lock);
>>>
>>>        return ret;
>>> diff --git a/drivers/vhost/iotlb.c b/drivers/vhost/iotlb.c
>>> index 0fd3f87e913c..3bd5bd06cdbc 100644
>>> --- a/drivers/vhost/iotlb.c
>>> +++ b/drivers/vhost/iotlb.c
>>> @@ -42,13 +42,15 @@ EXPORT_SYMBOL_GPL(vhost_iotlb_map_free);
>>>     * @last: last of IOVA range
>>>     * @addr: the address that is mapped to @start
>>>     * @perm: access permission of this range
>>> + * @opaque: the opaque pointer for the IOTLB mapping
>>>     *
>>>     * Returns an error last is smaller than start or memory allocation
>>>     * fails
>>>     */
>>>    int vhost_iotlb_add_range(struct vhost_iotlb *iotlb,
>>>                          u64 start, u64 last,
>>> -                       u64 addr, unsigned int perm)
>>> +                       u64 addr, unsigned int perm,
>>> +                       void *opaque)
>>>    {
>>>        struct vhost_iotlb_map *map;
>>>
>>> @@ -71,6 +73,7 @@ int vhost_iotlb_add_range(struct vhost_iotlb *iotlb,
>>>        map->last = last;
>>>        map->addr = addr;
>>>        map->perm = perm;
>>> +     map->opaque = opaque;
>>>
>>>        iotlb->nmaps++;
>>>        vhost_iotlb_itree_insert(map, &iotlb->root);
>>> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
>>> index 36b6950ba37f..e83e5be7cec8 100644
>>> --- a/drivers/vhost/vdpa.c
>>> +++ b/drivers/vhost/vdpa.c
>>> @@ -488,6 +488,7 @@ static void vhost_vdpa_iotlb_unmap(struct vhost_vdpa *v, u64 start, u64 last)
>>>        struct vhost_dev *dev = &v->vdev;
>>>        struct vdpa_device *vdpa = v->vdpa;
>>>        struct vhost_iotlb *iotlb = dev->iotlb;
>>> +     struct vhost_iotlb_file *iotlb_file;
>>>        struct vhost_iotlb_map *map;
>>>        struct page *page;
>>>        unsigned long pfn, pinned;
>>> @@ -504,6 +505,10 @@ static void vhost_vdpa_iotlb_unmap(struct vhost_vdpa *v, u64 start, u64 last)
>>>                        }
>>>                        atomic64_sub(map->size >> PAGE_SHIFT,
>>>                                        &dev->mm->pinned_vm);
>>> +             } else if (map->opaque) {
>>> +                     iotlb_file = (struct vhost_iotlb_file *)map->opaque;
>>> +                     fput(iotlb_file->file);
>>> +                     kfree(iotlb_file);
>>>                }
>>>                vhost_iotlb_map_free(iotlb, map);
>>>        }
>>> @@ -540,8 +545,8 @@ static int perm_to_iommu_flags(u32 perm)
>>>        return flags | IOMMU_CACHE;
>>>    }
>>>
>>> -static int vhost_vdpa_map(struct vhost_vdpa *v,
>>> -                       u64 iova, u64 size, u64 pa, u32 perm)
>>> +static int vhost_vdpa_map(struct vhost_vdpa *v, u64 iova,
>>> +                       u64 size, u64 pa, u32 perm, void *opaque)
>>>    {
>>>        struct vhost_dev *dev = &v->vdev;
>>>        struct vdpa_device *vdpa = v->vdpa;
>>> @@ -549,12 +554,12 @@ static int vhost_vdpa_map(struct vhost_vdpa *v,
>>>        int r = 0;
>>>
>>>        r = vhost_iotlb_add_range(dev->iotlb, iova, iova + size - 1,
>>> -                               pa, perm);
>>> +                               pa, perm, opaque);
>>>        if (r)
>>>                return r;
>>>
>>>        if (ops->dma_map) {
>>> -             r = ops->dma_map(vdpa, iova, size, pa, perm);
>>> +             r = ops->dma_map(vdpa, iova, size, pa, perm, opaque);
>>>        } else if (ops->set_map) {
>>>                if (!v->in_batch)
>>>                        r = ops->set_map(vdpa, dev->iotlb);
>>> @@ -591,6 +596,51 @@ static void vhost_vdpa_unmap(struct vhost_vdpa *v, u64 iova, u64 size)
>>>        }
>>>    }
>>>
>>> +static int vhost_vdpa_sva_map(struct vhost_vdpa *v,
>>> +                           u64 iova, u64 size, u64 uaddr, u32 perm)
>>> +{
>>> +     u64 offset, map_size, map_iova = iova;
>>> +     struct vhost_iotlb_file *iotlb_file;
>>> +     struct vm_area_struct *vma;
>>> +     int ret;
>>
>> Lacking mmap_read_lock().
>>
> Good catch! Will fix it.
>
>>> +
>>> +     while (size) {
>>> +             vma = find_vma(current->mm, uaddr);
>>> +             if (!vma) {
>>> +                     ret = -EINVAL;
>>> +                     goto err;
>>> +             }
>>> +             map_size = min(size, vma->vm_end - uaddr);
>>> +             offset = (vma->vm_pgoff << PAGE_SHIFT) + uaddr - vma->vm_start;
>>> +             iotlb_file = NULL;
>>> +             if (vma->vm_file && (vma->vm_flags & VM_SHARED)) {
>>
>> I wonder if we need more strict check here. When developing vhost-vdpa,
>> I try hard to make sure the map can only work for user pages.
>>
>> So the question is: do we need to exclude MMIO area or only allow shmem
>> to work here?
>>
> Do you mean we need to check VM_MIXEDMAP | VM_PFNMAP here?


I meant do we need to allow VM_IO here? (We don't allow such case in 
vhost-vdpa now).


>
> It makes sense to me.
>
>>
>>> +                     iotlb_file = kmalloc(sizeof(*iotlb_file), GFP_KERNEL);
>>> +                     if (!iotlb_file) {
>>> +                             ret = -ENOMEM;
>>> +                             goto err;
>>> +                     }
>>> +                     iotlb_file->file = get_file(vma->vm_file);
>>> +                     iotlb_file->offset = offset;
>>> +             }
>>
>> I wonder if it's better to allocate iotlb_file and make iotlb_file->file
>> = NULL && iotlb_file->offset = 0. This can force a consistent code for
>> the vDPA parents.
>>
> Looks fine to me.
>
>> Or we can simply fail the map without a file as backend.
>>
> Actually there will be some vma without vm_file during vm booting.


Yes, e.g bios or other rom. Vhost-user has the similar issue and they 
filter the out them in qemu.

For vhost-vDPA, consider it can supports various difference backends, we 
can't do that.


>
>>> +             ret = vhost_vdpa_map(v, map_iova, map_size, uaddr,
>>> +                                     perm, iotlb_file);
>>> +             if (ret) {
>>> +                     if (iotlb_file) {
>>> +                             fput(iotlb_file->file);
>>> +                             kfree(iotlb_file);
>>> +                     }
>>> +                     goto err;
>>> +             }
>>> +             size -= map_size;
>>> +             uaddr += map_size;
>>> +             map_iova += map_size;
>>> +     }
>>> +     return 0;
>>> +err:
>>> +     vhost_vdpa_unmap(v, iova, map_iova - iova);
>>> +     return ret;
>>> +}
>>> +
>>>    static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
>>>                                           struct vhost_iotlb_msg *msg)
>>>    {
>>> @@ -615,8 +665,8 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
>>>                return -EEXIST;
>>>
>>>        if (vdpa->sva)
>>> -             return vhost_vdpa_map(v, msg->iova, msg->size,
>>> -                                   msg->uaddr, msg->perm);
>>> +             return vhost_vdpa_sva_map(v, msg->iova, msg->size,
>>> +                                       msg->uaddr, msg->perm);
>>
>> So I think it's better squash vhost_vdpa_sva_map() and related changes
>> into previous patch.
>>
> OK, so the order of the patches is:
> 1) opaque pointer
> 2) va support + vma stuffs?
>
> Is it OK?


Fine with me.


>
>>>        /* Limit the use of memory for bookkeeping */
>>>        page_list = (struct page **) __get_free_page(GFP_KERNEL);
>>> @@ -671,7 +721,7 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
>>>                                csize = (last_pfn - map_pfn + 1) << PAGE_SHIFT;
>>>                                ret = vhost_vdpa_map(v, iova, csize,
>>>                                                     map_pfn << PAGE_SHIFT,
>>> -                                                  msg->perm);
>>> +                                                  msg->perm, NULL);
>>>                                if (ret) {
>>>                                        /*
>>>                                         * Unpin the pages that are left unmapped
>>> @@ -700,7 +750,7 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
>>>
>>>        /* Pin the rest chunk */
>>>        ret = vhost_vdpa_map(v, iova, (last_pfn - map_pfn + 1) << PAGE_SHIFT,
>>> -                          map_pfn << PAGE_SHIFT, msg->perm);
>>> +                          map_pfn << PAGE_SHIFT, msg->perm, NULL);
>>>    out:
>>>        if (ret) {
>>>                if (nchunks) {
>>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>>> index a262e12c6dc2..120dd5b3c119 100644
>>> --- a/drivers/vhost/vhost.c
>>> +++ b/drivers/vhost/vhost.c
>>> @@ -1104,7 +1104,7 @@ static int vhost_process_iotlb_msg(struct vhost_dev *dev,
>>>                vhost_vq_meta_reset(dev);
>>>                if (vhost_iotlb_add_range(dev->iotlb, msg->iova,
>>>                                          msg->iova + msg->size - 1,
>>> -                                       msg->uaddr, msg->perm)) {
>>> +                                       msg->uaddr, msg->perm, NULL)) {
>>>                        ret = -ENOMEM;
>>>                        break;
>>>                }
>>> @@ -1450,7 +1450,7 @@ static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
>>>                                          region->guest_phys_addr +
>>>                                          region->memory_size - 1,
>>>                                          region->userspace_addr,
>>> -                                       VHOST_MAP_RW))
>>> +                                       VHOST_MAP_RW, NULL))
>>>                        goto err;
>>>        }
>>>
>>> diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
>>> index f86869651614..b264c627e94b 100644
>>> --- a/include/linux/vdpa.h
>>> +++ b/include/linux/vdpa.h
>>> @@ -189,6 +189,7 @@ struct vdpa_iova_range {
>>>     *                          @size: size of the area
>>>     *                          @pa: physical address for the map
>>>     *                          @perm: device access permission (VHOST_MAP_XX)
>>> + *                           @opaque: the opaque pointer for the mapping
>>>     *                          Returns integer: success (0) or error (< 0)
>>>     * @dma_unmap:                      Unmap an area of IOVA (optional but
>>>     *                          must be implemented with dma_map)
>>> @@ -243,7 +244,7 @@ struct vdpa_config_ops {
>>>        /* DMA ops */
>>>        int (*set_map)(struct vdpa_device *vdev, struct vhost_iotlb *iotlb);
>>>        int (*dma_map)(struct vdpa_device *vdev, u64 iova, u64 size,
>>> -                    u64 pa, u32 perm);
>>> +                    u64 pa, u32 perm, void *opaque);
>>>        int (*dma_unmap)(struct vdpa_device *vdev, u64 iova, u64 size);
>>>
>>>        /* Free device resources */
>>> diff --git a/include/linux/vhost_iotlb.h b/include/linux/vhost_iotlb.h
>>> index 6b09b786a762..66a50c11c8ca 100644
>>> --- a/include/linux/vhost_iotlb.h
>>> +++ b/include/linux/vhost_iotlb.h
>>> @@ -4,6 +4,11 @@
>>>
>>>    #include <linux/interval_tree_generic.h>
>>>
>>> +struct vhost_iotlb_file {
>>> +     struct file *file;
>>> +     u64 offset;
>>> +};
>>
>> I think we'd better either:
>>
>> 1) simply use struct vhost_iotlb_file * instead of void *opaque for
>> vhost_iotlb_map
>>
>> or
>>
>> 2)rename and move the vhost_iotlb_file to vdpa
>>
>> 2) looks better since we want to let vhost iotlb to carry any type of
>> context (opaque pointer)
>>
> I agree. So we need to introduce struct vdpa_iotlb_file in
> include/linux/vdpa.h, right?


Yes.


>
>> And if we do this, the modification of vdpa_config_ops deserves a
>> separate patch.
>>
> Sorry, I didn't get you here. What do you mean by the modification of
> vdpa_config_ops? Do you mean adding an opaque pointer to ops.dma_map?


Yes.

Thanks


>
> Thanks,
> Yongji
>


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: [RFC v3 08/11] vduse: Introduce VDUSE - vDPA Device in Userspace
  2021-01-26  8:08     ` Jason Wang
@ 2021-01-27  8:50       ` Yongji Xie
  2021-01-28  4:27         ` Jason Wang
  0 siblings, 1 reply; 57+ messages in thread
From: Yongji Xie @ 2021-01-27  8:50 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, sgarzare, Parav Pandit,
	Bob Liu, Christoph Hellwig, Randy Dunlap, Matthew Wilcox, viro,
	axboe, bcrl, Jonathan Corbet, virtualization, netdev, kvm,
	linux-aio, linux-fsdevel

 On Tue, Jan 26, 2021 at 4:09 PM Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2021/1/19 下午1:07, Xie Yongji wrote:
> > This VDUSE driver enables implementing vDPA devices in userspace.
> > Both control path and data path of vDPA devices will be able to
> > be handled in userspace.
> >
> > In the control path, the VDUSE driver will make use of message
> > mechnism to forward the config operation from vdpa bus driver
> > to userspace. Userspace can use read()/write() to receive/reply
> > those control messages.
> >
> > In the data path, VDUSE_IOTLB_GET_FD ioctl will be used to get
> > the file descriptors referring to vDPA device's iova regions. Then
> > userspace can use mmap() to access those iova regions. Besides,
> > the eventfd mechanism is used to trigger interrupt callbacks and
> > receive virtqueue kicks in userspace.
> >
> > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> > ---
> >   Documentation/driver-api/vduse.rst                 |   85 ++
> >   Documentation/userspace-api/ioctl/ioctl-number.rst |    1 +
> >   drivers/vdpa/Kconfig                               |    7 +
> >   drivers/vdpa/Makefile                              |    1 +
> >   drivers/vdpa/vdpa_user/Makefile                    |    5 +
> >   drivers/vdpa/vdpa_user/eventfd.c                   |  221 ++++
> >   drivers/vdpa/vdpa_user/eventfd.h                   |   48 +
> >   drivers/vdpa/vdpa_user/iova_domain.c               |  426 +++++++
> >   drivers/vdpa/vdpa_user/iova_domain.h               |   68 ++
> >   drivers/vdpa/vdpa_user/vduse.h                     |   62 +
> >   drivers/vdpa/vdpa_user/vduse_dev.c                 | 1217 ++++++++++++++++++++
> >   include/uapi/linux/vdpa.h                          |    1 +
> >   include/uapi/linux/vduse.h                         |  125 ++
> >   13 files changed, 2267 insertions(+)
> >   create mode 100644 Documentation/driver-api/vduse.rst
> >   create mode 100644 drivers/vdpa/vdpa_user/Makefile
> >   create mode 100644 drivers/vdpa/vdpa_user/eventfd.c
> >   create mode 100644 drivers/vdpa/vdpa_user/eventfd.h
> >   create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c
> >   create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h
> >   create mode 100644 drivers/vdpa/vdpa_user/vduse.h
> >   create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
> >   create mode 100644 include/uapi/linux/vduse.h
> >
> > diff --git a/Documentation/driver-api/vduse.rst b/Documentation/driver-api/vduse.rst
> > new file mode 100644
> > index 000000000000..9418a7f6646b
> > --- /dev/null
> > +++ b/Documentation/driver-api/vduse.rst
> > @@ -0,0 +1,85 @@
> > +==================================
> > +VDUSE - "vDPA Device in Userspace"
> > +==================================
> > +
> > +vDPA (virtio data path acceleration) device is a device that uses a
> > +datapath which complies with the virtio specifications with vendor
> > +specific control path. vDPA devices can be both physically located on
> > +the hardware or emulated by software. VDUSE is a framework that makes it
> > +possible to implement software-emulated vDPA devices in userspace.
> > +
> > +How VDUSE works
> > +------------
> > +Each userspace vDPA device is created by the VDUSE_CREATE_DEV ioctl on
> > +the VDUSE character device (/dev/vduse). Then a file descriptor pointing
> > +to the new resources will be returned, which can be used to implement the
> > +userspace vDPA device's control path and data path.
> > +
> > +To implement control path, the read/write operations to the file descriptor
> > +will be used to receive/reply the control messages from/to VDUSE driver.
>
>
> It's better to document the protocol here. E.g the identifier stuffs.
>

I have documented those stuffs in include/uapi/linux/vduse.h, is it
OK? Or add something like "Please see include/uapi/linux/vduse.h for
details."

>
> > +Those control messages are mostly based on the vdpa_config_ops which defines
> > +a unified interface to control different types of vDPA device.
> > +
> > +The following types of messages are provided by the VDUSE framework now:
> > +
> > +- VDUSE_SET_VQ_ADDR: Set the addresses of the different aspects of virtqueue.
>
>
> "Set the vring address of a virtqueue" might be better here.
>

OK.

>
> > +
> > +- VDUSE_SET_VQ_NUM: Set the size of virtqueue
> > +
> > +- VDUSE_SET_VQ_READY: Set ready status of virtqueue
> > +
> > +- VDUSE_GET_VQ_READY: Get ready status of virtqueue
> > +
> > +- VDUSE_SET_VQ_STATE: Set the state (last_avail_idx) for virtqueue
> > +
> > +- VDUSE_GET_VQ_STATE: Get the state (last_avail_idx) for virtqueue
>
>
> It's better not to mention layout specific stuffs here (last_avail_idx).
> Consider we should support packed virtqueue in the future.
>

I see.

>
> > +
> > +- VDUSE_SET_FEATURES: Set virtio features supported by the driver
> > +
> > +- VDUSE_GET_FEATURES: Get virtio features supported by the device
> > +
> > +- VDUSE_SET_STATUS: Set the device status
> > +
> > +- VDUSE_GET_STATUS: Get the device status
> > +
> > +- VDUSE_SET_CONFIG: Write to device specific configuration space
> > +
> > +- VDUSE_GET_CONFIG: Read from device specific configuration space
> > +
> > +- VDUSE_UPDATE_IOTLB: Notify userspace to update the memory mapping in device IOTLB
> > +
> > +Please see include/linux/vdpa.h for details.
> > +
> > +In the data path, vDPA device's iova regions will be mapped into userspace with
> > +the help of VDUSE_IOTLB_GET_FD ioctl on the userspace vDPA device fd:
> > +
> > +- VDUSE_IOTLB_GET_FD: get the file descriptor to iova region. Userspace can
> > +  access this iova region by passing the fd to mmap(2).
> > +
> > +Besides, the eventfd mechanism is used to trigger interrupt callbacks and
> > +receive virtqueue kicks in userspace. The following ioctls on the userspace
> > +vDPA device fd are provided to support that:
> > +
> > +- VDUSE_VQ_SETUP_KICKFD: set the kickfd for virtqueue, this eventfd is used
> > +  by VDUSE driver to notify userspace to consume the vring.
> > +
> > +- VDUSE_VQ_SETUP_IRQFD: set the irqfd for virtqueue, this eventfd is used
> > +  by userspace to notify VDUSE driver to trigger interrupt callbacks.
> > +
> > +MMU-based IOMMU Driver
> > +----------------------
> > +In virtio-vdpa case, VDUSE framework implements a MMU-based on-chip IOMMU
> > +driver to support mapping the kernel dma buffer into the userspace iova
> > +region dynamically.
> > +
> > +The basic idea behind this driver is treating MMU (VA->PA) as IOMMU (IOVA->PA).
> > +The driver will set up MMU mapping instead of IOMMU mapping for the DMA transfer
> > +so that the userspace process is able to use its virtual address to access
> > +the dma buffer in kernel.
> > +
> > +And to avoid security issue, a bounce-buffering mechanism is introduced to
> > +prevent userspace accessing the original buffer directly which may contain other
> > +kernel data. During the mapping, unmapping, the driver will copy the data from
> > +the original buffer to the bounce buffer and back, depending on the direction of
> > +the transfer. And the bounce-buffer addresses will be mapped into the user address
> > +space instead of the original one.
> > diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
> > index a4c75a28c839..71722e6f8f23 100644
> > --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> > +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> > @@ -300,6 +300,7 @@ Code  Seq#    Include File                                           Comments
> >   'z'   10-4F  drivers/s390/crypto/zcrypt_api.h                        conflict!
> >   '|'   00-7F  linux/media.h
> >   0x80  00-1F  linux/fb.h
> > +0x81  00-1F  linux/vduse.h
> >   0x89  00-06  arch/x86/include/asm/sockios.h
> >   0x89  0B-DF  linux/sockios.h
> >   0x89  E0-EF  linux/sockios.h                                         SIOCPROTOPRIVATE range
> > diff --git a/drivers/vdpa/Kconfig b/drivers/vdpa/Kconfig
> > index 4be7be39be26..667354309bf4 100644
> > --- a/drivers/vdpa/Kconfig
> > +++ b/drivers/vdpa/Kconfig
> > @@ -21,6 +21,13 @@ config VDPA_SIM
> >         to RX. This device is used for testing, prototyping and
> >         development of vDPA.
> >
> > +config VDPA_USER
> > +     tristate "VDUSE (vDPA Device in Userspace) support"
> > +     depends on EVENTFD && MMU && HAS_DMA
>
>
> Need select VHOST_IOTLB.
>

OK.

>
> > +     help
> > +       With VDUSE it is possible to emulate a vDPA Device
> > +       in a userspace program.
> > +
> >   config IFCVF
> >       tristate "Intel IFC VF vDPA driver"
> >       depends on PCI_MSI
> > diff --git a/drivers/vdpa/Makefile b/drivers/vdpa/Makefile
> > index d160e9b63a66..66e97778ad03 100644
> > --- a/drivers/vdpa/Makefile
> > +++ b/drivers/vdpa/Makefile
> > @@ -1,5 +1,6 @@
> >   # SPDX-License-Identifier: GPL-2.0
> >   obj-$(CONFIG_VDPA) += vdpa.o
> >   obj-$(CONFIG_VDPA_SIM) += vdpa_sim/
> > +obj-$(CONFIG_VDPA_USER) += vdpa_user/
> >   obj-$(CONFIG_IFCVF)    += ifcvf/
> >   obj-$(CONFIG_MLX5_VDPA) += mlx5/
> > diff --git a/drivers/vdpa/vdpa_user/Makefile b/drivers/vdpa/vdpa_user/Makefile
> > new file mode 100644
> > index 000000000000..b7645e36992b
> > --- /dev/null
> > +++ b/drivers/vdpa/vdpa_user/Makefile
> > @@ -0,0 +1,5 @@
> > +# SPDX-License-Identifier: GPL-2.0
> > +
> > +vduse-y := vduse_dev.o iova_domain.o eventfd.o
> > +
> > +obj-$(CONFIG_VDPA_USER) += vduse.o
> > diff --git a/drivers/vdpa/vdpa_user/eventfd.c b/drivers/vdpa/vdpa_user/eventfd.c
> > new file mode 100644
> > index 000000000000..dbffddb08908
> > --- /dev/null
> > +++ b/drivers/vdpa/vdpa_user/eventfd.c
> > @@ -0,0 +1,221 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * Eventfd support for VDUSE
> > + *
> > + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
> > + *
> > + * Author: Xie Yongji <xieyongji@bytedance.com>
> > + *
> > + */
> > +
> > +#include <linux/eventfd.h>
> > +#include <linux/poll.h>
> > +#include <linux/wait.h>
> > +#include <linux/slab.h>
> > +#include <linux/file.h>
> > +#include <uapi/linux/vduse.h>
> > +
> > +#include "eventfd.h"
> > +
> > +static struct workqueue_struct *vduse_irqfd_cleanup_wq;
> > +
> > +static void vduse_virqfd_shutdown(struct work_struct *work)
> > +{
> > +     u64 cnt;
> > +     struct vduse_virqfd *virqfd = container_of(work,
> > +                                     struct vduse_virqfd, shutdown);
> > +
> > +     eventfd_ctx_remove_wait_queue(virqfd->ctx, &virqfd->wait, &cnt);
> > +     flush_work(&virqfd->inject);
> > +     eventfd_ctx_put(virqfd->ctx);
> > +     kfree(virqfd);
> > +}
> > +
> > +static void vduse_virqfd_inject(struct work_struct *work)
> > +{
> > +     struct vduse_virqfd *virqfd = container_of(work,
> > +                                     struct vduse_virqfd, inject);
> > +     struct vduse_virtqueue *vq = virqfd->vq;
> > +
> > +     spin_lock_irq(&vq->irq_lock);
> > +     if (vq->ready && vq->cb)
> > +             vq->cb(vq->private);
> > +     spin_unlock_irq(&vq->irq_lock);
> > +}
> > +
> > +static void virqfd_deactivate(struct vduse_virqfd *virqfd)
> > +{
> > +     queue_work(vduse_irqfd_cleanup_wq, &virqfd->shutdown);
> > +}
> > +
> > +static int vduse_virqfd_wakeup(wait_queue_entry_t *wait, unsigned int mode,
> > +                             int sync, void *key)
> > +{
> > +     struct vduse_virqfd *virqfd = container_of(wait, struct vduse_virqfd, wait);
> > +     struct vduse_virtqueue *vq = virqfd->vq;
> > +
> > +     __poll_t flags = key_to_poll(key);
> > +
> > +     if (flags & EPOLLIN)
> > +             schedule_work(&virqfd->inject);
> > +
> > +     if (flags & EPOLLHUP) {
> > +             spin_lock(&vq->irq_lock);
> > +             if (vq->virqfd == virqfd) {
> > +                     vq->virqfd = NULL;
> > +                     virqfd_deactivate(virqfd);
> > +             }
> > +             spin_unlock(&vq->irq_lock);
> > +     }
> > +
> > +     return 0;
> > +}
> > +
> > +static void vduse_virqfd_ptable_queue_proc(struct file *file,
> > +                     wait_queue_head_t *wqh, poll_table *pt)
> > +{
> > +     struct vduse_virqfd *virqfd = container_of(pt, struct vduse_virqfd, pt);
> > +
> > +     add_wait_queue(wqh, &virqfd->wait);
> > +}
> > +
> > +int vduse_virqfd_setup(struct vduse_dev *dev,
> > +                     struct vduse_vq_eventfd *eventfd)
> > +{
> > +     struct vduse_virqfd *virqfd;
> > +     struct fd irqfd;
> > +     struct eventfd_ctx *ctx;
> > +     struct vduse_virtqueue *vq;
> > +     __poll_t events;
> > +     int ret;
> > +
> > +     if (eventfd->index >= dev->vq_num)
> > +             return -EINVAL;
> > +
> > +     vq = &dev->vqs[eventfd->index];
> > +     virqfd = kzalloc(sizeof(*virqfd), GFP_KERNEL);
> > +     if (!virqfd)
> > +             return -ENOMEM;
> > +
> > +     INIT_WORK(&virqfd->shutdown, vduse_virqfd_shutdown);
> > +     INIT_WORK(&virqfd->inject, vduse_virqfd_inject);
> > +
> > +     ret = -EBADF;
> > +     irqfd = fdget(eventfd->fd);
> > +     if (!irqfd.file)
> > +             goto err_fd;
> > +
> > +     ctx = eventfd_ctx_fileget(irqfd.file);
> > +     if (IS_ERR(ctx)) {
> > +             ret = PTR_ERR(ctx);
> > +             goto err_ctx;
> > +     }
> > +
> > +     virqfd->vq = vq;
> > +     virqfd->ctx = ctx;
> > +     spin_lock(&vq->irq_lock);
> > +     if (vq->virqfd)
> > +             virqfd_deactivate(virqfd);
> > +     vq->virqfd = virqfd;
> > +     spin_unlock(&vq->irq_lock);
> > +
> > +     init_waitqueue_func_entry(&virqfd->wait, vduse_virqfd_wakeup);
> > +     init_poll_funcptr(&virqfd->pt, vduse_virqfd_ptable_queue_proc);
> > +
> > +     events = vfs_poll(irqfd.file, &virqfd->pt);
> > +
> > +     /*
> > +      * Check if there was an event already pending on the eventfd
> > +      * before we registered and trigger it as if we didn't miss it.
> > +      */
> > +     if (events & EPOLLIN)
> > +             schedule_work(&virqfd->inject);
> > +
> > +     fdput(irqfd);
> > +
> > +     return 0;
> > +err_ctx:
> > +     fdput(irqfd);
> > +err_fd:
> > +     kfree(virqfd);
> > +     return ret;
> > +}
> > +
> > +void vduse_virqfd_release(struct vduse_dev *dev)
> > +{
> > +     int i;
> > +
> > +     for (i = 0; i < dev->vq_num; i++) {
> > +             struct vduse_virtqueue *vq = &dev->vqs[i];
> > +
> > +             spin_lock(&vq->irq_lock);
> > +             if (vq->virqfd) {
> > +                     virqfd_deactivate(vq->virqfd);
> > +                     vq->virqfd = NULL;
> > +             }
> > +             spin_unlock(&vq->irq_lock);
> > +     }
> > +     flush_workqueue(vduse_irqfd_cleanup_wq);
> > +}
> > +
> > +int vduse_virqfd_init(void)
> > +{
> > +     vduse_irqfd_cleanup_wq = alloc_workqueue("vduse-irqfd-cleanup",
> > +                                             WQ_UNBOUND, 0);
> > +     if (!vduse_irqfd_cleanup_wq)
> > +             return -ENOMEM;
> > +
> > +     return 0;
> > +}
> > +
> > +void vduse_virqfd_exit(void)
> > +{
> > +     destroy_workqueue(vduse_irqfd_cleanup_wq);
> > +}
> > +
> > +void vduse_vq_kick(struct vduse_virtqueue *vq)
> > +{
> > +     spin_lock(&vq->kick_lock);
> > +     if (vq->ready && vq->kickfd)
> > +             eventfd_signal(vq->kickfd, 1);
> > +     spin_unlock(&vq->kick_lock);
> > +}
> > +
> > +int vduse_kickfd_setup(struct vduse_dev *dev,
> > +                     struct vduse_vq_eventfd *eventfd)
> > +{
> > +     struct eventfd_ctx *ctx;
> > +     struct vduse_virtqueue *vq;
> > +
> > +     if (eventfd->index >= dev->vq_num)
> > +             return -EINVAL;
> > +
> > +     vq = &dev->vqs[eventfd->index];
> > +     ctx = eventfd_ctx_fdget(eventfd->fd);
> > +     if (IS_ERR(ctx))
> > +             return PTR_ERR(ctx);
> > +
> > +     spin_lock(&vq->kick_lock);
> > +     if (vq->kickfd)
> > +             eventfd_ctx_put(vq->kickfd);
> > +     vq->kickfd = ctx;
> > +     spin_unlock(&vq->kick_lock);
> > +
> > +     return 0;
> > +}
> > +
> > +void vduse_kickfd_release(struct vduse_dev *dev)
> > +{
> > +     int i;
> > +
> > +     for (i = 0; i < dev->vq_num; i++) {
> > +             struct vduse_virtqueue *vq = &dev->vqs[i];
> > +
> > +             spin_lock(&vq->kick_lock);
> > +             if (vq->kickfd) {
> > +                     eventfd_ctx_put(vq->kickfd);
> > +                     vq->kickfd = NULL;
> > +             }
> > +             spin_unlock(&vq->kick_lock);
> > +     }
> > +}
> > diff --git a/drivers/vdpa/vdpa_user/eventfd.h b/drivers/vdpa/vdpa_user/eventfd.h
> > new file mode 100644
> > index 000000000000..14269ff27f47
> > --- /dev/null
> > +++ b/drivers/vdpa/vdpa_user/eventfd.h
> > @@ -0,0 +1,48 @@
> > +/* SPDX-License-Identifier: GPL-2.0-only */
> > +/*
> > + * Eventfd support for VDUSE
> > + *
> > + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
> > + *
> > + * Author: Xie Yongji <xieyongji@bytedance.com>
> > + *
> > + */
> > +
> > +#ifndef _VDUSE_EVENTFD_H
> > +#define _VDUSE_EVENTFD_H
> > +
> > +#include <linux/eventfd.h>
> > +#include <linux/poll.h>
> > +#include <linux/wait.h>
> > +#include <uapi/linux/vduse.h>
> > +
> > +#include "vduse.h"
> > +
> > +struct vduse_dev;
> > +
> > +struct vduse_virqfd {
> > +     struct eventfd_ctx *ctx;
> > +     struct vduse_virtqueue *vq;
> > +     struct work_struct inject;
> > +     struct work_struct shutdown;
> > +     wait_queue_entry_t wait;
> > +     poll_table pt;
> > +};
> > +
> > +int vduse_virqfd_setup(struct vduse_dev *dev,
> > +                     struct vduse_vq_eventfd *eventfd);
> > +
> > +void vduse_virqfd_release(struct vduse_dev *dev);
> > +
> > +int vduse_virqfd_init(void);
> > +
> > +void vduse_virqfd_exit(void);
> > +
> > +void vduse_vq_kick(struct vduse_virtqueue *vq);
> > +
> > +int vduse_kickfd_setup(struct vduse_dev *dev,
> > +                     struct vduse_vq_eventfd *eventfd);
> > +
> > +void vduse_kickfd_release(struct vduse_dev *dev);
> > +
> > +#endif /* _VDUSE_EVENTFD_H */
> > diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c
> > new file mode 100644
> > index 000000000000..cdfef8e9f9d6
> > --- /dev/null
> > +++ b/drivers/vdpa/vdpa_user/iova_domain.c
> > @@ -0,0 +1,426 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * MMU-based IOMMU implementation
> > + *
> > + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
> > + *
> > + * Author: Xie Yongji <xieyongji@bytedance.com>
> > + *
> > + */
> > +
> > +#include <linux/slab.h>
> > +#include <linux/file.h>
> > +#include <linux/anon_inodes.h>
> > +
> > +#include "iova_domain.h"
> > +
> > +#define IOVA_START_PFN 1
> > +#define IOVA_ALLOC_ORDER 12
> > +#define IOVA_ALLOC_SIZE (1 << IOVA_ALLOC_ORDER)
>
>
> Can this work for all archs (e.g why not use PAGE_SIZE)?
>

It can work for all archs. Use IOVA_ALLOC_SIZE might save some space
in some cases/archs (e.g. PAGE_SIZE = 64K) when we have lots of
small-size I/Os.

>
> > +
> > +#define CONSISTENT_DMA_SIZE (1024 * 1024 * 1024)
> > +
> > +static inline struct page *
> > +vduse_domain_get_bounce_page(struct vduse_iova_domain *domain,
> > +                             unsigned long iova)
> > +{
> > +     unsigned long index = iova >> PAGE_SHIFT;
> > +
> > +     return domain->bounce_pages[index];
> > +}
> > +
> > +static inline void
> > +vduse_domain_set_bounce_page(struct vduse_iova_domain *domain,
> > +                             unsigned long iova, struct page *page)
> > +{
> > +     unsigned long index = iova >> PAGE_SHIFT;
> > +
> > +     domain->bounce_pages[index] = page;
> > +}
> > +
> > +static struct vduse_iova_map *
> > +vduse_domain_alloc_iova_map(struct vduse_iova_domain *domain,
> > +                     unsigned long iova, unsigned long orig,
> > +                     size_t size, enum dma_data_direction dir)
> > +{
> > +     struct vduse_iova_map *map;
> > +
> > +     map = kzalloc(sizeof(struct vduse_iova_map), GFP_ATOMIC);
> > +     if (!map)
> > +             return NULL;
> > +
> > +     map->iova.start = iova;
> > +     map->iova.last = iova + size - 1;
> > +     map->orig = orig;
> > +     map->size = size;
> > +     map->dir = dir;
> > +
> > +     return map;
> > +}
> > +
> > +static struct page *
> > +vduse_domain_get_mapping_page(struct vduse_iova_domain *domain,
> > +                             unsigned long iova)
> > +{
> > +     unsigned long start = iova & PAGE_MASK;
> > +     unsigned long last = start + PAGE_SIZE - 1;
> > +     struct vduse_iova_map *map;
> > +     struct interval_tree_node *node;
> > +     struct page *page = NULL;
> > +
> > +     spin_lock(&domain->map_lock);
> > +     node = interval_tree_iter_first(&domain->mappings, start, last);
> > +     if (!node)
> > +             goto out;
> > +
> > +     map = container_of(node, struct vduse_iova_map, iova);
> > +     page = virt_to_page(map->orig + iova - map->iova.start);
> > +     get_page(page);
> > +out:
> > +     spin_unlock(&domain->map_lock);
> > +
> > +     return page;
> > +}
> > +
> > +static struct page *
> > +vduse_domain_alloc_bounce_page(struct vduse_iova_domain *domain,
> > +                             unsigned long iova)
> > +{
> > +     unsigned long start = iova & PAGE_MASK;
> > +     unsigned long last = start + PAGE_SIZE - 1;
> > +     struct vduse_iova_map *map;
> > +     struct interval_tree_node *node;
> > +     struct page *page = NULL, *new_page = alloc_page(GFP_KERNEL);
> > +
> > +     if (!new_page)
> > +             return NULL;
> > +
> > +     spin_lock(&domain->map_lock);
> > +     node = interval_tree_iter_first(&domain->mappings, start, last);
> > +     if (!node) {
> > +             __free_page(new_page);
> > +             goto out;
> > +     }
> > +     page = vduse_domain_get_bounce_page(domain, iova);
> > +     if (page) {
> > +             get_page(page);
> > +             __free_page(new_page);
>
>
> Let's delay the allocation of new_page until it is really required.

If so, we need to allocate the page in atomic context.

>
> > +             goto out;
> > +     }
> > +     vduse_domain_set_bounce_page(domain, iova, new_page);
> > +     get_page(new_page);
> > +     page = new_page;
> > +
> > +     while (node) {
>
>
> I may miss something but which case should we do the loop here?
>

When IOVA_ALLOC_SIZE != PAGE_SIZE

>
> > +             unsigned int src_offset = 0, dst_offset = 0;
> > +             void *src, *dst;
> > +             size_t copy_len;
> > +
> > +             map = container_of(node, struct vduse_iova_map, iova);
> > +             node = interval_tree_iter_next(node, start, last);
> > +             if (map->dir == DMA_FROM_DEVICE)
> > +                     continue;
> > +
> > +             if (start > map->iova.start)
> > +                     src_offset = start - map->iova.start;
> > +             else
> > +                     dst_offset = map->iova.start - start;
> > +
> > +             src = (void *)(map->orig + src_offset);
> > +             dst = page_address(page) + dst_offset;
> > +             copy_len = min_t(size_t, map->size - src_offset,
> > +                             PAGE_SIZE - dst_offset);
> > +             memcpy(dst, src, copy_len);
> > +     }
> > +out:
> > +     spin_unlock(&domain->map_lock);
> > +
> > +     return page;
> > +}
> > +
> > +static void
> > +vduse_domain_free_bounce_pages(struct vduse_iova_domain *domain,
> > +                             unsigned long iova, size_t size)
> > +{
> > +     struct page *page;
> > +     struct interval_tree_node *node;
> > +     unsigned long last = iova + size - 1;
> > +
> > +     spin_lock(&domain->map_lock);
> > +     node = interval_tree_iter_first(&domain->mappings, iova, last);
> > +     if (WARN_ON(node))
> > +             goto out;
> > +
> > +     while (size > 0) {
> > +             page = vduse_domain_get_bounce_page(domain, iova);
> > +             if (page) {
> > +                     vduse_domain_set_bounce_page(domain, iova, NULL);
> > +                     __free_page(page);
> > +             }
> > +             size -= PAGE_SIZE;
> > +             iova += PAGE_SIZE;
> > +     }
> > +out:
> > +     spin_unlock(&domain->map_lock);
> > +}
> > +
> > +static void vduse_domain_bounce(struct vduse_iova_domain *domain,
> > +                             unsigned long iova, unsigned long orig,
> > +                             size_t size, enum dma_data_direction dir)
> > +{
> > +     unsigned int offset = offset_in_page(iova);
> > +
> > +     while (size) {
> > +             struct page *p = vduse_domain_get_bounce_page(domain, iova);
> > +             size_t copy_len = min_t(size_t, PAGE_SIZE - offset, size);
> > +             void *addr;
> > +
> > +             WARN_ON(!p && dir == DMA_FROM_DEVICE);
> > +
> > +             if (p) {
> > +                     addr = page_address(p) + offset;
> > +                     if (dir == DMA_TO_DEVICE)
> > +                             memcpy(addr, (void *)orig, copy_len);
> > +                     else if (dir == DMA_FROM_DEVICE)
> > +                             memcpy((void *)orig, addr, copy_len);
> > +             }
> > +
> > +             size -= copy_len;
> > +             orig += copy_len;
> > +             iova += copy_len;
> > +             offset = 0;
> > +     }
> > +}
> > +
> > +static unsigned long vduse_domain_alloc_iova(struct iova_domain *iovad,
> > +                             unsigned long size, unsigned long limit)
> > +{
> > +     unsigned long shift = iova_shift(iovad);
> > +     unsigned long iova_len = iova_align(iovad, size) >> shift;
> > +     unsigned long iova_pfn;
> > +
> > +     if (iova_len < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1)))
> > +             iova_len = roundup_pow_of_two(iova_len);
> > +     iova_pfn = alloc_iova_fast(iovad, iova_len, limit >> shift, true);
> > +
> > +     return iova_pfn << shift;
> > +}
> > +
> > +static void vduse_domain_free_iova(struct iova_domain *iovad,
> > +                             unsigned long iova, size_t size)
> > +{
> > +     unsigned long shift = iova_shift(iovad);
> > +     unsigned long iova_len = iova_align(iovad, size) >> shift;
> > +
> > +     free_iova_fast(iovad, iova >> shift, iova_len);
> > +}
> > +
> > +dma_addr_t vduse_domain_map_page(struct vduse_iova_domain *domain,
> > +                             struct page *page, unsigned long offset,
> > +                             size_t size, enum dma_data_direction dir,
> > +                             unsigned long attrs)
> > +{
> > +     struct iova_domain *iovad = &domain->stream_iovad;
> > +     unsigned long limit = domain->bounce_size - 1;
> > +     unsigned long iova = vduse_domain_alloc_iova(iovad, size, limit);
> > +     unsigned long orig = (unsigned long)page_address(page) + offset;
> > +     struct vduse_iova_map *map;
> > +
> > +     if (!iova)
> > +             return DMA_MAPPING_ERROR;
> > +
> > +     map = vduse_domain_alloc_iova_map(domain, iova, orig, size, dir);
> > +     if (!map) {
> > +             vduse_domain_free_iova(iovad, iova, size);
> > +             return DMA_MAPPING_ERROR;
> > +     }
> > +
> > +     spin_lock(&domain->map_lock);
> > +     interval_tree_insert(&map->iova, &domain->mappings);
> > +     spin_unlock(&domain->map_lock);
> > +
> > +     if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL)
> > +             vduse_domain_bounce(domain, iova, orig, size, DMA_TO_DEVICE);
> > +
> > +     return (dma_addr_t)iova;
> > +}
> > +
> > +void vduse_domain_unmap_page(struct vduse_iova_domain *domain,
> > +                     dma_addr_t dma_addr, size_t size,
> > +                     enum dma_data_direction dir, unsigned long attrs)
> > +{
> > +     struct iova_domain *iovad = &domain->stream_iovad;
> > +     unsigned long iova = (unsigned long)dma_addr;
> > +     struct interval_tree_node *node;
> > +     struct vduse_iova_map *map;
> > +
> > +     spin_lock(&domain->map_lock);
> > +     node = interval_tree_iter_first(&domain->mappings, iova, iova + 1);
> > +     if (WARN_ON(!node)) {
> > +             spin_unlock(&domain->map_lock);
> > +             return;
> > +     }
> > +     interval_tree_remove(node, &domain->mappings);
> > +     spin_unlock(&domain->map_lock);
> > +
> > +     map = container_of(node, struct vduse_iova_map, iova);
> > +     if (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL)
> > +             vduse_domain_bounce(domain, iova, map->orig,
> > +                                     size, DMA_FROM_DEVICE);
> > +     vduse_domain_free_iova(iovad, iova, size);
> > +     kfree(map);
> > +}
> > +
> > +void *vduse_domain_alloc_coherent(struct vduse_iova_domain *domain,
> > +                             size_t size, dma_addr_t *dma_addr,
> > +                             gfp_t flag, unsigned long attrs)
> > +{
> > +     struct iova_domain *iovad = &domain->consistent_iovad;
> > +     unsigned long limit = domain->bounce_size + CONSISTENT_DMA_SIZE - 1;
> > +     unsigned long iova = vduse_domain_alloc_iova(iovad, size, limit);
> > +     void *orig = alloc_pages_exact(size, flag);
> > +     struct vduse_iova_map *map;
> > +
> > +     if (!iova || !orig)
> > +             goto err;
> > +
> > +     map = vduse_domain_alloc_iova_map(domain, iova, (unsigned long)orig,
> > +                                     size, DMA_BIDIRECTIONAL);
> > +     if (!map)
> > +             goto err;
> > +
> > +     spin_lock(&domain->map_lock);
> > +     interval_tree_insert(&map->iova, &domain->mappings);
> > +     spin_unlock(&domain->map_lock);
> > +     *dma_addr = (dma_addr_t)iova;
> > +
> > +     return orig;
> > +err:
> > +     *dma_addr = DMA_MAPPING_ERROR;
> > +     if (orig)
> > +             free_pages_exact(orig, size);
> > +     if (iova)
> > +             vduse_domain_free_iova(iovad, iova, size);
> > +
> > +     return NULL;
> > +}
> > +
> > +void vduse_domain_free_coherent(struct vduse_iova_domain *domain, size_t size,
> > +                             void *vaddr, dma_addr_t dma_addr,
> > +                             unsigned long attrs)
> > +{
> > +     struct iova_domain *iovad = &domain->consistent_iovad;
> > +     unsigned long iova = (unsigned long)dma_addr;
> > +     struct interval_tree_node *node;
> > +     struct vduse_iova_map *map;
> > +
> > +     spin_lock(&domain->map_lock);
> > +     node = interval_tree_iter_first(&domain->mappings, iova, iova + 1);
> > +     if (WARN_ON(!node)) {
> > +             spin_unlock(&domain->map_lock);
> > +             return;
> > +     }
> > +     interval_tree_remove(node, &domain->mappings);
> > +     spin_unlock(&domain->map_lock);
> > +
> > +     map = container_of(node, struct vduse_iova_map, iova);
> > +     vduse_domain_free_iova(iovad, iova, size);
> > +     free_pages_exact(vaddr, size);
> > +     kfree(map);
> > +}
> > +
> > +static vm_fault_t vduse_domain_mmap_fault(struct vm_fault *vmf)
> > +{
> > +     struct vduse_iova_domain *domain = vmf->vma->vm_private_data;
> > +     unsigned long iova = vmf->pgoff << PAGE_SHIFT;
> > +     struct page *page;
> > +
> > +     if (!domain)
> > +             return VM_FAULT_SIGBUS;
> > +
> > +     if (iova < domain->bounce_size)
> > +             page = vduse_domain_alloc_bounce_page(domain, iova);
> > +     else
> > +             page = vduse_domain_get_mapping_page(domain, iova);
> > +
> > +     if (!page)
> > +             return VM_FAULT_SIGBUS;
> > +
> > +     vmf->page = page;
> > +
> > +     return 0;
> > +}
> > +
> > +static const struct vm_operations_struct vduse_domain_mmap_ops = {
> > +     .fault = vduse_domain_mmap_fault,
> > +};
> > +
> > +static int vduse_domain_mmap(struct file *file, struct vm_area_struct *vma)
> > +{
> > +     struct vduse_iova_domain *domain = file->private_data;
> > +
> > +     vma->vm_flags |= VM_DONTCOPY | VM_DONTDUMP | VM_DONTEXPAND;
> > +     vma->vm_private_data = domain;
> > +     vma->vm_ops = &vduse_domain_mmap_ops;
> > +
> > +     return 0;
> > +}
> > +
> > +static int vduse_domain_release(struct inode *inode, struct file *file)
> > +{
> > +     struct vduse_iova_domain *domain = file->private_data;
> > +
> > +     vduse_domain_free_bounce_pages(domain, 0, domain->bounce_size);
> > +     put_iova_domain(&domain->stream_iovad);
> > +     put_iova_domain(&domain->consistent_iovad);
> > +     vfree(domain->bounce_pages);
> > +     kfree(domain);
> > +
> > +     return 0;
> > +}
> > +
> > +static const struct file_operations vduse_domain_fops = {
> > +     .mmap = vduse_domain_mmap,
> > +     .release = vduse_domain_release,
> > +};
>
>
> It's better to explain the reason for introducing a dedicated file for
> mmap() here.
>

To make the implementation of iova_domain independent with vduse_dev.

>
> > +
> > +void vduse_domain_destroy(struct vduse_iova_domain *domain)
> > +{
> > +     fput(domain->file);
> > +}
> > +
> > +struct vduse_iova_domain *vduse_domain_create(size_t bounce_size)
> > +{
> > +     struct vduse_iova_domain *domain;
> > +     struct file *file;
> > +     unsigned long bounce_pfns = PAGE_ALIGN(bounce_size) >> PAGE_SHIFT;
> > +
> > +     domain = kzalloc(sizeof(*domain), GFP_KERNEL);
> > +     if (!domain)
> > +             return NULL;
> > +
> > +     domain->bounce_size = PAGE_ALIGN(bounce_size);
> > +     domain->bounce_pages = vzalloc(bounce_pfns * sizeof(struct page *));
> > +     if (!domain->bounce_pages)
> > +             goto err_page;
> > +
> > +     file = anon_inode_getfile("[vduse-domain]", &vduse_domain_fops,
> > +                             domain, O_RDWR);
> > +     if (IS_ERR(file))
> > +             goto err_file;
> > +
> > +     domain->file = file;
> > +     spin_lock_init(&domain->map_lock);
> > +     domain->mappings = RB_ROOT_CACHED;
> > +     init_iova_domain(&domain->stream_iovad,
> > +                     IOVA_ALLOC_SIZE, IOVA_START_PFN);
> > +     init_iova_domain(&domain->consistent_iovad,
> > +                     PAGE_SIZE, bounce_pfns);
> > +
> > +     return domain;
> > +err_file:
> > +     vfree(domain->bounce_pages);
> > +err_page:
> > +     kfree(domain);
> > +     return NULL;
> > +}
> > diff --git a/drivers/vdpa/vdpa_user/iova_domain.h b/drivers/vdpa/vdpa_user/iova_domain.h
> > new file mode 100644
> > index 000000000000..cc61866acb56
> > --- /dev/null
> > +++ b/drivers/vdpa/vdpa_user/iova_domain.h
> > @@ -0,0 +1,68 @@
> > +/* SPDX-License-Identifier: GPL-2.0-only */
> > +/*
> > + * MMU-based IOMMU implementation
> > + *
> > + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
> > + *
> > + * Author: Xie Yongji <xieyongji@bytedance.com>
> > + *
> > + */
> > +
> > +#ifndef _VDUSE_IOVA_DOMAIN_H
> > +#define _VDUSE_IOVA_DOMAIN_H
> > +
> > +#include <linux/iova.h>
> > +#include <linux/interval_tree.h>
> > +#include <linux/dma-mapping.h>
> > +
> > +struct vduse_iova_map {
> > +     struct interval_tree_node iova;
> > +     unsigned long orig;
>
>
> Need a better name, probably "va"?
>

Fine.

>
> > +     size_t size;
> > +     enum dma_data_direction dir;
> > +};
> > +
> > +struct vduse_iova_domain {
> > +     struct iova_domain stream_iovad;
> > +     struct iova_domain consistent_iovad;
> > +     struct page **bounce_pages;
> > +     size_t bounce_size;
> > +     struct rb_root_cached mappings;
>
>
> We had IOTLB, any reason for this extra mappings here?
>

It is used to store iova <-> vduse_iova_map (vaddr, size, dir)
mapping. We must use it to know how to do DMA bouncing during dma
unmapping.

>
> > +     spinlock_t map_lock;
> > +     struct file *file;
> > +};
> > +
> > +static inline struct file *
> > +vduse_domain_file(struct vduse_iova_domain *domain)
> > +{
> > +     return domain->file;
> > +}
> > +
> > +static inline unsigned long
> > +vduse_domain_get_offset(struct vduse_iova_domain *domain, unsigned long iova)
> > +{
> > +     return iova;
> > +}
> > +
> > +dma_addr_t vduse_domain_map_page(struct vduse_iova_domain *domain,
> > +                             struct page *page, unsigned long offset,
> > +                             size_t size, enum dma_data_direction dir,
> > +                             unsigned long attrs);
> > +
> > +void vduse_domain_unmap_page(struct vduse_iova_domain *domain,
> > +                     dma_addr_t dma_addr, size_t size,
> > +                     enum dma_data_direction dir, unsigned long attrs);
> > +
> > +void *vduse_domain_alloc_coherent(struct vduse_iova_domain *domain,
> > +                             size_t size, dma_addr_t *dma_addr,
> > +                             gfp_t flag, unsigned long attrs);
> > +
> > +void vduse_domain_free_coherent(struct vduse_iova_domain *domain, size_t size,
> > +                             void *vaddr, dma_addr_t dma_addr,
> > +                             unsigned long attrs);
> > +
> > +void vduse_domain_destroy(struct vduse_iova_domain *domain);
> > +
> > +struct vduse_iova_domain *vduse_domain_create(size_t bounce_size);
> > +
> > +#endif /* _VDUSE_IOVA_DOMAIN_H */
> > diff --git a/drivers/vdpa/vdpa_user/vduse.h b/drivers/vdpa/vdpa_user/vduse.h
> > new file mode 100644
> > index 000000000000..3566d229382e
> > --- /dev/null
> > +++ b/drivers/vdpa/vdpa_user/vduse.h
> > @@ -0,0 +1,62 @@
> > +/* SPDX-License-Identifier: GPL-2.0-only */
> > +/*
> > + * VDUSE: vDPA Device in Userspace
> > + *
> > + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
> > + *
> > + * Author: Xie Yongji <xieyongji@bytedance.com>
> > + *
> > + */
> > +
> > +#ifndef _VDUSE_H
> > +#define _VDUSE_H
> > +
> > +#include <linux/eventfd.h>
> > +#include <linux/wait.h>
> > +#include <linux/vdpa.h>
> > +
> > +#include "iova_domain.h"
> > +#include "eventfd.h"
> > +
> > +struct vduse_virtqueue {
> > +     u16 index;
> > +     bool ready;
> > +     spinlock_t kick_lock;
> > +     spinlock_t irq_lock;
> > +     struct eventfd_ctx *kickfd;
> > +     struct vduse_virqfd *virqfd;
> > +     void *private;
> > +     irqreturn_t (*cb)(void *data);
> > +};
> > +
> > +struct vduse_dev;
> > +
> > +struct vduse_vdpa {
> > +     struct vdpa_device vdpa;
> > +     struct vduse_dev *dev;
> > +};
> > +
> > +struct vduse_dev {
> > +     struct vduse_vdpa *vdev;
> > +     struct mutex lock;
> > +     struct vduse_virtqueue *vqs;
> > +     struct vduse_iova_domain *domain;
> > +     struct vhost_iotlb *iommu;
> > +     spinlock_t iommu_lock;
> > +     atomic_t bounce_map;
> > +     spinlock_t msg_lock;
> > +     atomic64_t msg_unique;
> > +     wait_queue_head_t waitq;
> > +     struct list_head send_list;
> > +     struct list_head recv_list;
> > +     struct list_head list;
> > +     bool connected;
> > +     u32 id;
> > +     u16 vq_size_max;
> > +     u16 vq_num;
> > +     u32 vq_align;
> > +     u32 device_id;
> > +     u32 vendor_id;
> > +};
> > +
> > +#endif /* _VDUSE_H_ */
> > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> > new file mode 100644
> > index 000000000000..1cf759bc5914
> > --- /dev/null
> > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > @@ -0,0 +1,1217 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * VDUSE: vDPA Device in Userspace
> > + *
> > + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
> > + *
> > + * Author: Xie Yongji <xieyongji@bytedance.com>
> > + *
> > + */
> > +
> > +#include <linux/init.h>
> > +#include <linux/module.h>
> > +#include <linux/miscdevice.h>
> > +#include <linux/device.h>
> > +#include <linux/eventfd.h>
> > +#include <linux/slab.h>
> > +#include <linux/wait.h>
> > +#include <linux/dma-map-ops.h>
> > +#include <linux/anon_inodes.h>
> > +#include <linux/file.h>
> > +#include <linux/uio.h>
> > +#include <linux/vdpa.h>
> > +#include <uapi/linux/vduse.h>
> > +#include <uapi/linux/vdpa.h>
> > +#include <uapi/linux/virtio_config.h>
> > +#include <linux/mod_devicetable.h>
> > +
> > +#include "vduse.h"
> > +
> > +#define DRV_VERSION  "1.0"
> > +#define DRV_AUTHOR   "Yongji Xie <xieyongji@bytedance.com>"
> > +#define DRV_DESC     "vDPA Device in Userspace"
> > +#define DRV_LICENSE  "GPL v2"
> > +
> > +struct vduse_dev_msg {
> > +     struct vduse_dev_request req;
> > +     struct vduse_dev_response resp;
> > +     struct list_head list;
> > +     wait_queue_head_t waitq;
> > +     bool completed;
> > +     refcount_t refcnt;
>
>
> The reference count here will bring extra complexity. I think we can
> sync through msg_lock.
>

Do you mean using wait_event_interruptible_locked() and
wake_up_locked()? I think it works.

>
>
> > +};
> > +
> > +static struct workqueue_struct *vduse_vdpa_wq;
> > +static DEFINE_MUTEX(vduse_lock);
> > +static LIST_HEAD(vduse_devs);
> > +
> > +static inline struct vduse_dev *vdpa_to_vduse(struct vdpa_device *vdpa)
> > +{
> > +     struct vduse_vdpa *vdev = container_of(vdpa, struct vduse_vdpa, vdpa);
> > +
> > +     return vdev->dev;
> > +}
> > +
> > +static inline struct vduse_dev *dev_to_vduse(struct device *dev)
> > +{
> > +     struct vdpa_device *vdpa = dev_to_vdpa(dev);
> > +
> > +     return vdpa_to_vduse(vdpa);
> > +}
> > +
> > +static struct vduse_dev_msg *vduse_dev_new_msg(struct vduse_dev *dev, int type)
> > +{
> > +     struct vduse_dev_msg *msg = kzalloc(sizeof(*msg),
> > +                                     GFP_KERNEL | __GFP_NOFAIL);
> > +
> > +     msg->req.type = type;
> > +     msg->req.unique = atomic64_fetch_inc(&dev->msg_unique);
>
>
> This looks not safe, let's use idr here.
>

Could you give more details? Looks like idr should not used in this
case which can not tolerate failure. And using a list to store the msg
is better than using idr when the msg needs to be re-inserted in some
cases.

>
> > +     init_waitqueue_head(&msg->waitq);
> > +     refcount_set(&msg->refcnt, 1);
> > +
> > +     return msg;
> > +}
> > +
> > +static void vduse_dev_msg_get(struct vduse_dev_msg *msg)
> > +{
> > +     refcount_inc(&msg->refcnt);
> > +}
> > +
> > +static void vduse_dev_msg_put(struct vduse_dev_msg *msg)
> > +{
> > +     if (refcount_dec_and_test(&msg->refcnt))
> > +             kfree(msg);
> > +}
> > +
> > +static struct vduse_dev_msg *vduse_dev_find_msg(struct vduse_dev *dev,
> > +                                             struct list_head *head,
> > +                                             uint32_t unique)
> > +{
> > +     struct vduse_dev_msg *tmp, *msg = NULL;
> > +
> > +     spin_lock(&dev->msg_lock);
> > +     list_for_each_entry(tmp, head, list) {
> > +             if (tmp->req.unique == unique) {
> > +                     msg = tmp;
> > +                     list_del(&tmp->list);
> > +                     break;
> > +             }
> > +     }
> > +     spin_unlock(&dev->msg_lock);
> > +
> > +     return msg;
> > +}
> > +
> > +static struct vduse_dev_msg *vduse_dev_dequeue_msg(struct vduse_dev *dev,
> > +                                             struct list_head *head)
> > +{
> > +     struct vduse_dev_msg *msg = NULL;
> > +
> > +     spin_lock(&dev->msg_lock);
> > +     if (!list_empty(head)) {
> > +             msg = list_first_entry(head, struct vduse_dev_msg, list);
> > +             list_del(&msg->list);
> > +     }
> > +     spin_unlock(&dev->msg_lock);
> > +
> > +     return msg;
> > +}
> > +
> > +static void vduse_dev_enqueue_msg(struct vduse_dev *dev,
> > +                     struct vduse_dev_msg *msg, struct list_head *head)
> > +{
> > +     spin_lock(&dev->msg_lock);
> > +     list_add_tail(&msg->list, head);
> > +     spin_unlock(&dev->msg_lock);
> > +}
> > +
> > +static int vduse_dev_msg_sync(struct vduse_dev *dev, struct vduse_dev_msg *msg)
> > +{
> > +     int ret;
> > +
> > +     vduse_dev_enqueue_msg(dev, msg, &dev->send_list);
> > +     wake_up(&dev->waitq);
> > +     wait_event(msg->waitq, msg->completed);
>
>
> This is uninterruptible wait, it means if the userspace forget to
> process the command, we will stuck here forever.
>

Yes, wait_event_interruptible() should be better here.

>
> > +     /* coupled with smp_wmb() in vduse_dev_msg_complete() */
> > +     smp_rmb();
>
>
> Instead of using barriers, I wonder why not simply use msg lock here?
>

As mentioned above, using
wait_event_interruptible_locked()/wake_up_locked() is OK to me.

>
> > +     ret = msg->resp.result;
> > +
> > +     return ret;
> > +}
> > +
> > +static void vduse_dev_msg_complete(struct vduse_dev_msg *msg,
> > +                                     struct vduse_dev_response *resp)
> > +{
> > +     vduse_dev_msg_get(msg);
> > +     memcpy(&msg->resp, resp, sizeof(*resp));
> > +     /* coupled with smp_rmb() in vduse_dev_msg_sync() */
> > +     smp_wmb();
> > +     msg->completed = 1;
> > +     wake_up(&msg->waitq);
> > +     vduse_dev_msg_put(msg);
> > +}
> > +
> > +static u64 vduse_dev_get_features(struct vduse_dev *dev)
> > +{
> > +     struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_GET_FEATURES);
> > +     u64 features;
> > +
> > +     vduse_dev_msg_sync(dev, msg);
> > +     features = msg->resp.features;
> > +     vduse_dev_msg_put(msg);
> > +
> > +     return features;
> > +}
> > +
> > +static int vduse_dev_set_features(struct vduse_dev *dev, u64 features)
> > +{
> > +     struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_FEATURES);
> > +     int ret;
> > +
> > +     msg->req.size = sizeof(features);
> > +     msg->req.features = features;
> > +
> > +     ret = vduse_dev_msg_sync(dev, msg);
> > +     vduse_dev_msg_put(msg);
> > +
> > +     return ret;
> > +}
> > +
> > +static u8 vduse_dev_get_status(struct vduse_dev *dev)
> > +{
> > +     struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_GET_STATUS);
> > +     u8 status;
> > +
> > +     vduse_dev_msg_sync(dev, msg);
> > +     status = msg->resp.status;
> > +     vduse_dev_msg_put(msg);
> > +
> > +     return status;
> > +}
> > +
> > +static void vduse_dev_set_status(struct vduse_dev *dev, u8 status)
> > +{
> > +     struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_STATUS);
> > +
> > +     msg->req.size = sizeof(status);
> > +     msg->req.status = status;
> > +
> > +     vduse_dev_msg_sync(dev, msg);
> > +     vduse_dev_msg_put(msg);
> > +}
> > +
> > +static void vduse_dev_get_config(struct vduse_dev *dev, unsigned int offset,
> > +                                     void *buf, unsigned int len)
> > +{
> > +     struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_GET_CONFIG);
> > +
> > +     WARN_ON(len > sizeof(msg->req.config.data));
> > +
> > +     msg->req.size = sizeof(struct vduse_dev_config_data);
> > +     msg->req.config.offset = offset;
> > +     msg->req.config.len = len;
> > +     vduse_dev_msg_sync(dev, msg);
> > +     memcpy(buf, msg->resp.config.data, len);
> > +     vduse_dev_msg_put(msg);
> > +}
> > +
> > +static void vduse_dev_set_config(struct vduse_dev *dev, unsigned int offset,
> > +                                     const void *buf, unsigned int len)
> > +{
> > +     struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_CONFIG);
> > +
> > +     WARN_ON(len > sizeof(msg->req.config.data));
> > +
> > +     msg->req.size = sizeof(struct vduse_dev_config_data);
> > +     msg->req.config.offset = offset;
> > +     msg->req.config.len = len;
> > +     memcpy(msg->req.config.data, buf, len);
> > +     vduse_dev_msg_sync(dev, msg);
> > +     vduse_dev_msg_put(msg);
> > +}
> > +
> > +static void vduse_dev_set_vq_num(struct vduse_dev *dev,
> > +                             struct vduse_virtqueue *vq, u32 num)
> > +{
> > +     struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_VQ_NUM);
> > +
> > +     msg->req.size = sizeof(struct vduse_vq_num);
> > +     msg->req.vq_num.index = vq->index;
> > +     msg->req.vq_num.num = num;
> > +
> > +     vduse_dev_msg_sync(dev, msg);
> > +     vduse_dev_msg_put(msg);
> > +}
> > +
> > +static int vduse_dev_set_vq_addr(struct vduse_dev *dev,
> > +                             struct vduse_virtqueue *vq, u64 desc_addr,
> > +                             u64 driver_addr, u64 device_addr)
> > +{
> > +     struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_VQ_ADDR);
> > +     int ret;
> > +
> > +     msg->req.size = sizeof(struct vduse_vq_addr);
> > +     msg->req.vq_addr.index = vq->index;
> > +     msg->req.vq_addr.desc_addr = desc_addr;
> > +     msg->req.vq_addr.driver_addr = driver_addr;
> > +     msg->req.vq_addr.device_addr = device_addr;
> > +
> > +     ret = vduse_dev_msg_sync(dev, msg);
> > +     vduse_dev_msg_put(msg);
> > +
> > +     return ret;
> > +}
> > +
> > +static void vduse_dev_set_vq_ready(struct vduse_dev *dev,
> > +                             struct vduse_virtqueue *vq, bool ready)
> > +{
> > +     struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_VQ_READY);
> > +
> > +     msg->req.size = sizeof(struct vduse_vq_ready);
> > +     msg->req.vq_ready.index = vq->index;
> > +     msg->req.vq_ready.ready = ready;
> > +
> > +     vduse_dev_msg_sync(dev, msg);
> > +     vduse_dev_msg_put(msg);
> > +}
> > +
> > +static bool vduse_dev_get_vq_ready(struct vduse_dev *dev,
> > +                                struct vduse_virtqueue *vq)
> > +{
> > +     struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_GET_VQ_READY);
> > +     bool ready;
> > +
> > +     msg->req.size = sizeof(struct vduse_vq_ready);
> > +     msg->req.vq_ready.index = vq->index;
> > +
> > +     vduse_dev_msg_sync(dev, msg);
> > +     ready = msg->resp.vq_ready.ready;
> > +     vduse_dev_msg_put(msg);
> > +
> > +     return ready;
> > +}
> > +
> > +static int vduse_dev_get_vq_state(struct vduse_dev *dev,
> > +                             struct vduse_virtqueue *vq,
> > +                             struct vdpa_vq_state *state)
> > +{
> > +     struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_GET_VQ_STATE);
> > +     int ret;
> > +
> > +     msg->req.size = sizeof(struct vduse_vq_state);
> > +     msg->req.vq_state.index = vq->index;
> > +
> > +     ret = vduse_dev_msg_sync(dev, msg);
> > +     state->avail_index = msg->resp.vq_state.avail_idx;
> > +     vduse_dev_msg_put(msg);
> > +
> > +     return ret;
> > +}
> > +
> > +static int vduse_dev_set_vq_state(struct vduse_dev *dev,
> > +                             struct vduse_virtqueue *vq,
> > +                             const struct vdpa_vq_state *state)
> > +{
> > +     struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_VQ_STATE);
> > +     int ret;
> > +
> > +     msg->req.size = sizeof(struct vduse_vq_state);
> > +     msg->req.vq_state.index = vq->index;
> > +     msg->req.vq_state.avail_idx = state->avail_index;
> > +
> > +     ret = vduse_dev_msg_sync(dev, msg);
> > +     vduse_dev_msg_put(msg);
> > +
> > +     return ret;
> > +}
> > +
> > +static int vduse_dev_update_iotlb(struct vduse_dev *dev,
> > +                                     u64 start, u64 last)
> > +{
> > +     struct vduse_dev_msg *msg;
> > +     int ret;
> > +
> > +     if (last < start)
> > +             return -EINVAL;
> > +
> > +     msg = vduse_dev_new_msg(dev, VDUSE_UPDATE_IOTLB);
>
>
> This is actually a IOTLB invalidation. So let's rename the function and
> message type.
>

Actually VDUSE_UPDATE_IOTLB now is used to notify userspace that IOTLB
is changed rather than IOTLB needs to be invalidated. Then userspace
can use GET_IOTLB ioctl to get the change. It seems to be more
friendly to userspace.

>
> > +     msg->req.size = sizeof(struct vduse_iova_range);
> > +     msg->req.iova.start = start;
> > +     msg->req.iova.last = last;
> > +
> > +     ret = vduse_dev_msg_sync(dev, msg);
> > +     vduse_dev_msg_put(msg);
> > +
> > +     return ret;
> > +}
> > +
> > +static ssize_t vduse_dev_read_iter(struct kiocb *iocb, struct iov_iter *to)
> > +{
> > +     struct file *file = iocb->ki_filp;
> > +     struct vduse_dev *dev = file->private_data;
> > +     struct vduse_dev_msg *msg;
> > +     int size = sizeof(struct vduse_dev_request);
> > +     ssize_t ret = 0;
> > +
> > +     if (iov_iter_count(to) < size)
> > +             return 0;
> > +
> > +     while (1) {
> > +             msg = vduse_dev_dequeue_msg(dev, &dev->send_list);
> > +             if (msg)
> > +                     break;
> > +
> > +             if (file->f_flags & O_NONBLOCK)
> > +                     return -EAGAIN;
> > +
> > +             ret = wait_event_interruptible_exclusive(dev->waitq,
> > +                                     !list_empty(&dev->send_list));
> > +             if (ret)
> > +                     return ret;
> > +     }
> > +     ret = copy_to_iter(&msg->req, size, to);
> > +     if (ret != size) {
> > +             vduse_dev_enqueue_msg(dev, msg, &dev->send_list);
> > +             return -EFAULT;
> > +     }
> > +     vduse_dev_enqueue_msg(dev, msg, &dev->recv_list);
> > +
> > +     return ret;
> > +}
> > +
> > +static ssize_t vduse_dev_write_iter(struct kiocb *iocb, struct iov_iter *from)
> > +{
> > +     struct file *file = iocb->ki_filp;
> > +     struct vduse_dev *dev = file->private_data;
> > +     struct vduse_dev_response resp;
> > +     struct vduse_dev_msg *msg;
> > +     size_t ret;
> > +
> > +     ret = copy_from_iter(&resp, sizeof(resp), from);
> > +     if (ret != sizeof(resp))
> > +             return -EINVAL;
> > +
> > +     msg = vduse_dev_find_msg(dev, &dev->recv_list, resp.unique);
> > +     if (!msg)
> > +             return -EINVAL;
> > +
> > +     vduse_dev_msg_complete(msg, &resp);
>
>
> So we had multiple types of requests/responses, is this better to
> introduce a queue based admin interface other than ioctl?
>

Sorry, I didn't get your point. What do you mean by queue-based admin
interface? Virtqueue-based?

>
> > +
> > +     return ret;
> > +}
> > +
> > +static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
> > +{
> > +     struct vduse_dev *dev = file->private_data;
> > +     __poll_t mask = 0;
> > +
> > +     poll_wait(file, &dev->waitq, wait);
> > +
> > +     if (!list_empty(&dev->send_list))
> > +             mask |= EPOLLIN | EPOLLRDNORM;
> > +
> > +     return mask;
> > +}
> > +
> > +static int vduse_iotlb_add_range(struct vduse_dev *dev,
> > +                              u64 start, u64 last,
> > +                              u64 addr, unsigned int perm,
> > +                              struct file *file, u64 offset)
> > +{
> > +     struct vhost_iotlb_file *iotlb_file;
> > +     int ret;
> > +
> > +     iotlb_file = kmalloc(sizeof(*iotlb_file), GFP_ATOMIC);
> > +     if (!iotlb_file)
> > +             return -ENOMEM;
> > +
> > +     iotlb_file->file = get_file(file);
> > +     iotlb_file->offset = offset;
> > +
> > +     spin_lock(&dev->iommu_lock);
> > +     ret = vhost_iotlb_add_range(dev->iommu, start, last,
> > +                                     addr, perm, iotlb_file);
> > +     spin_unlock(&dev->iommu_lock);
> > +     if (ret) {
> > +             fput(iotlb_file->file);
> > +             kfree(iotlb_file);
> > +             return ret;
> > +     }
> > +     return 0;
> > +}
> > +
> > +static void vduse_iotlb_del_range(struct vduse_dev *dev, u64 start, u64 last)
> > +{
> > +     struct vhost_iotlb_file *iotlb_file;
> > +     struct vhost_iotlb_map *map;
> > +
> > +     spin_lock(&dev->iommu_lock);
> > +     while ((map = vhost_iotlb_itree_first(dev->iommu, start, last))) {
> > +             iotlb_file = (struct vhost_iotlb_file *)map->opaque;
> > +             fput(iotlb_file->file);
> > +             kfree(iotlb_file);
> > +             vhost_iotlb_map_free(dev->iommu, map);
> > +     }
> > +     spin_unlock(&dev->iommu_lock);
> > +}
> > +
> > +static void vduse_dev_reset(struct vduse_dev *dev)
> > +{
> > +     int i;
> > +
> > +     atomic_set(&dev->bounce_map, 0);
> > +     vduse_iotlb_del_range(dev, 0ULL, 0ULL - 1);
> > +     vduse_dev_update_iotlb(dev, 0ULL, 0ULL - 1);
>
>
> ULLONG_MAX please.
>

OK.

>
> > +
> > +     for (i = 0; i < dev->vq_num; i++) {
> > +             struct vduse_virtqueue *vq = &dev->vqs[i];
> > +
> > +             spin_lock(&vq->irq_lock);
> > +             vq->ready = false;
> > +             vq->cb = NULL;
> > +             vq->private = NULL;
> > +             spin_unlock(&vq->irq_lock);
> > +     }
> > +}
> > +
> > +static int vduse_vdpa_set_vq_address(struct vdpa_device *vdpa, u16 idx,
> > +                             u64 desc_area, u64 driver_area,
> > +                             u64 device_area)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> > +
> > +     return vduse_dev_set_vq_addr(dev, vq, desc_area,
> > +                                     driver_area, device_area);
> > +}
> > +
> > +static void vduse_vdpa_kick_vq(struct vdpa_device *vdpa, u16 idx)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> > +
> > +     vduse_vq_kick(vq);
> > +}
> > +
> > +static void vduse_vdpa_set_vq_cb(struct vdpa_device *vdpa, u16 idx,
> > +                           struct vdpa_callback *cb)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> > +
> > +     vq->cb = cb->callback;
> > +     vq->private = cb->private;
> > +}
> > +
> > +static void vduse_vdpa_set_vq_num(struct vdpa_device *vdpa, u16 idx, u32 num)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> > +
> > +     vduse_dev_set_vq_num(dev, vq, num);
> > +}
> > +
> > +static void vduse_vdpa_set_vq_ready(struct vdpa_device *vdpa,
> > +                                     u16 idx, bool ready)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> > +
> > +     vduse_dev_set_vq_ready(dev, vq, ready);
> > +     vq->ready = ready;
> > +}
> > +
> > +static bool vduse_vdpa_get_vq_ready(struct vdpa_device *vdpa, u16 idx)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> > +
> > +     vq->ready = vduse_dev_get_vq_ready(dev, vq);
> > +
> > +     return vq->ready;
> > +}
> > +
> > +static int vduse_vdpa_set_vq_state(struct vdpa_device *vdpa, u16 idx,
> > +                             const struct vdpa_vq_state *state)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> > +
> > +     return vduse_dev_set_vq_state(dev, vq, state);
> > +}
> > +
> > +static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
> > +                             struct vdpa_vq_state *state)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> > +
> > +     return vduse_dev_get_vq_state(dev, vq, state);
> > +}
> > +
> > +static u32 vduse_vdpa_get_vq_align(struct vdpa_device *vdpa)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     return dev->vq_align;
> > +}
> > +
> > +static u64 vduse_vdpa_get_features(struct vdpa_device *vdpa)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +     u64 fixed = (1ULL << VIRTIO_F_ACCESS_PLATFORM);
> > +
> > +     return (vduse_dev_get_features(dev) | fixed);
> > +}
> > +
> > +static int vduse_vdpa_set_features(struct vdpa_device *vdpa, u64 features)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     return vduse_dev_set_features(dev, features);
> > +}
> > +
> > +static void vduse_vdpa_set_config_cb(struct vdpa_device *vdpa,
> > +                               struct vdpa_callback *cb)
> > +{
> > +     /* We don't support config interrupt */
>
>
> If it's not hard, let's add this. Otherwise we need a per device feature
> blacklist to filter out all features that depends on config interrupt.
>

Will do it.

>
> > +}
> > +
> > +static u16 vduse_vdpa_get_vq_num_max(struct vdpa_device *vdpa)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     return dev->vq_size_max;
> > +}
> > +
> > +static u32 vduse_vdpa_get_device_id(struct vdpa_device *vdpa)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     return dev->device_id;
> > +}
> > +
> > +static u32 vduse_vdpa_get_vendor_id(struct vdpa_device *vdpa)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     return dev->vendor_id;
> > +}
> > +
> > +static u8 vduse_vdpa_get_status(struct vdpa_device *vdpa)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     return vduse_dev_get_status(dev);
> > +}
> > +
> > +static void vduse_vdpa_set_status(struct vdpa_device *vdpa, u8 status)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     if (status == 0)
> > +             vduse_dev_reset(dev);
> > +     else
> > +             vduse_dev_update_iotlb(dev, 0ULL, 0ULL - 1);
>
>
> Any reason for such IOTLB invalidation here?
>

As I mentioned before, this is used to notify userspace to update the
IOTLB. Mainly for virtio-vdpa case.

>
> > +
> > +     vduse_dev_set_status(dev, status);
> > +}
> > +
> > +static void vduse_vdpa_get_config(struct vdpa_device *vdpa, unsigned int offset,
> > +                          void *buf, unsigned int len)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     vduse_dev_get_config(dev, offset, buf, len);
> > +}
> > +
> > +static void vduse_vdpa_set_config(struct vdpa_device *vdpa, unsigned int offset,
> > +                     const void *buf, unsigned int len)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     vduse_dev_set_config(dev, offset, buf, len);
> > +}
> > +
> > +static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
> > +                             struct vhost_iotlb *iotlb)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +     struct vhost_iotlb_map *map;
> > +     struct vhost_iotlb_file *iotlb_file;
> > +     u64 start = 0ULL, last = 0ULL - 1;
> > +     int ret = 0;
> > +
> > +     vduse_iotlb_del_range(dev, start, last);
> > +
> > +     for (map = vhost_iotlb_itree_first(iotlb, start, last); map;
> > +             map = vhost_iotlb_itree_next(map, start, last)) {
> > +             if (!map->opaque)
> > +                     continue;
>
>
> What will happen if we simply accept NULL opaque here?
>

No file to mmap in userspace. So it's useless.

>
> > +
> > +             iotlb_file = (struct vhost_iotlb_file *)map->opaque;
> > +             ret = vduse_iotlb_add_range(dev, map->start, map->last,
> > +                                         map->addr, map->perm,
> > +                                         iotlb_file->file,
> > +                                         iotlb_file->offset);
> > +             if (ret)
> > +                     break;
> > +     }
> > +     vduse_dev_update_iotlb(dev, start, last);
> > +
> > +     return ret;
> > +}
> > +
> > +static void vduse_vdpa_free(struct vdpa_device *vdpa)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     WARN_ON(!list_empty(&dev->send_list));
> > +     WARN_ON(!list_empty(&dev->recv_list));
> > +     dev->vdev = NULL;
> > +}
> > +
> > +static const struct vdpa_config_ops vduse_vdpa_config_ops = {
> > +     .set_vq_address         = vduse_vdpa_set_vq_address,
> > +     .kick_vq                = vduse_vdpa_kick_vq,
> > +     .set_vq_cb              = vduse_vdpa_set_vq_cb,
> > +     .set_vq_num             = vduse_vdpa_set_vq_num,
> > +     .set_vq_ready           = vduse_vdpa_set_vq_ready,
> > +     .get_vq_ready           = vduse_vdpa_get_vq_ready,
> > +     .set_vq_state           = vduse_vdpa_set_vq_state,
> > +     .get_vq_state           = vduse_vdpa_get_vq_state,
> > +     .get_vq_align           = vduse_vdpa_get_vq_align,
> > +     .get_features           = vduse_vdpa_get_features,
> > +     .set_features           = vduse_vdpa_set_features,
> > +     .set_config_cb          = vduse_vdpa_set_config_cb,
> > +     .get_vq_num_max         = vduse_vdpa_get_vq_num_max,
> > +     .get_device_id          = vduse_vdpa_get_device_id,
> > +     .get_vendor_id          = vduse_vdpa_get_vendor_id,
> > +     .get_status             = vduse_vdpa_get_status,
> > +     .set_status             = vduse_vdpa_set_status,
> > +     .get_config             = vduse_vdpa_get_config,
> > +     .set_config             = vduse_vdpa_set_config,
> > +     .set_map                = vduse_vdpa_set_map,
> > +     .free                   = vduse_vdpa_free,
> > +};
> > +
> > +static dma_addr_t vduse_dev_map_page(struct device *dev, struct page *page,
> > +                                     unsigned long offset, size_t size,
> > +                                     enum dma_data_direction dir,
> > +                                     unsigned long attrs)
> > +{
> > +     struct vduse_dev *vdev = dev_to_vduse(dev);
> > +     struct vduse_iova_domain *domain = vdev->domain;
> > +
> > +     if (atomic_xchg(&vdev->bounce_map, 1) == 0 &&
> > +             vduse_iotlb_add_range(vdev, 0, domain->bounce_size - 1,
> > +                                   0, VDUSE_ACCESS_RW,
>
>
> Is this safe to use VDUSE_ACCESS_RW here, consider we might have device
> readonly mappings.
>

This mapping is for the whole bounce buffer. Maybe userspace needs to
tell us if it only support readonly mappings.

>
> > +                                   vduse_domain_file(domain),
> > +                                   vduse_domain_get_offset(domain, 0))) {
> > +             atomic_set(&vdev->bounce_map, 0);
> > +             return DMA_MAPPING_ERROR;
> > +     }
> > +
> > +     return vduse_domain_map_page(domain, page, offset, size, dir, attrs);
> > +}
> > +
> > +static void vduse_dev_unmap_page(struct device *dev, dma_addr_t dma_addr,
> > +                             size_t size, enum dma_data_direction dir,
> > +                             unsigned long attrs)
> > +{
> > +     struct vduse_dev *vdev = dev_to_vduse(dev);
> > +     struct vduse_iova_domain *domain = vdev->domain;
> > +
> > +     return vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
> > +}
> > +
> > +static void *vduse_dev_alloc_coherent(struct device *dev, size_t size,
> > +                                     dma_addr_t *dma_addr, gfp_t flag,
> > +                                     unsigned long attrs)
> > +{
> > +     struct vduse_dev *vdev = dev_to_vduse(dev);
> > +     struct vduse_iova_domain *domain = vdev->domain;
> > +     unsigned long iova;
> > +     void *addr;
> > +
> > +     *dma_addr = DMA_MAPPING_ERROR;
> > +     addr = vduse_domain_alloc_coherent(domain, size,
> > +                             (dma_addr_t *)&iova, flag, attrs);
> > +     if (!addr)
> > +             return NULL;
> > +
> > +     if (vduse_iotlb_add_range(vdev, iova, iova + size - 1,
> > +                               iova, VDUSE_ACCESS_RW,
> > +                               vduse_domain_file(domain),
> > +                               vduse_domain_get_offset(domain, iova))) {
> > +             vduse_domain_free_coherent(domain, size, addr, iova, attrs);
> > +             return NULL;
> > +     }
> > +     *dma_addr = (dma_addr_t)iova;
> > +
> > +     return addr;
> > +}
> > +
> > +static void vduse_dev_free_coherent(struct device *dev, size_t size,
> > +                                     void *vaddr, dma_addr_t dma_addr,
> > +                                     unsigned long attrs)
> > +{
> > +     struct vduse_dev *vdev = dev_to_vduse(dev);
> > +     struct vduse_iova_domain *domain = vdev->domain;
> > +     unsigned long start = (unsigned long)dma_addr;
> > +     unsigned long last = start + size - 1;
> > +
> > +     vduse_iotlb_del_range(vdev, start, last);
> > +     vduse_dev_update_iotlb(vdev, start, last);
> > +     vduse_domain_free_coherent(domain, size, vaddr, dma_addr, attrs);
> > +}
> > +
> > +static const struct dma_map_ops vduse_dev_dma_ops = {
> > +     .map_page = vduse_dev_map_page,
> > +     .unmap_page = vduse_dev_unmap_page,
> > +     .alloc = vduse_dev_alloc_coherent,
> > +     .free = vduse_dev_free_coherent,
> > +};
> > +
> > +static unsigned int perm_to_file_flags(u8 perm)
> > +{
> > +     unsigned int flags = 0;
> > +
> > +     switch (perm) {
> > +     case VDUSE_ACCESS_WO:
> > +             flags |= O_WRONLY;
> > +             break;
> > +     case VDUSE_ACCESS_RO:
> > +             flags |= O_RDONLY;
> > +             break;
> > +     case VDUSE_ACCESS_RW:
> > +             flags |= O_RDWR;
> > +             break;
> > +     default:
> > +             WARN(1, "invalidate vhost IOTLB permission\n");
> > +             break;
> > +     }
> > +
> > +     return flags;
> > +}
> > +
> > +static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > +                     unsigned long arg)
> > +{
> > +     struct vduse_dev *dev = file->private_data;
> > +     void __user *argp = (void __user *)arg;
> > +     int ret;
> > +
> > +     mutex_lock(&dev->lock);
> > +     switch (cmd) {
> > +     case VDUSE_IOTLB_GET_FD: {
> > +             struct vduse_iotlb_entry entry;
> > +             struct vhost_iotlb_map *map;
> > +             struct vhost_iotlb_file *iotlb_file;
> > +             struct file *f = NULL;
> > +
> > +             ret = -EFAULT;
> > +             if (copy_from_user(&entry, argp, sizeof(entry)))
> > +                     break;
> > +
> > +             spin_lock(&dev->iommu_lock);
> > +             map = vhost_iotlb_itree_first(dev->iommu, entry.start,
> > +                                           entry.last);
> > +             if (map) {
> > +                     iotlb_file = (struct vhost_iotlb_file *)map->opaque;
> > +                     f = get_file(iotlb_file->file);
> > +                     entry.offset = iotlb_file->offset;
> > +                     entry.start = map->start;
> > +                     entry.last = map->last;
> > +                     entry.perm = map->perm;
> > +             }
> > +             spin_unlock(&dev->iommu_lock);
> > +             if (!f) {
> > +                     ret = -EINVAL;
> > +                     break;
> > +             }
> > +             if (copy_to_user(argp, &entry, sizeof(entry))) {
> > +                     fput(f);
> > +                     ret = -EFAULT;
> > +                     break;
> > +             }
> > +             ret = get_unused_fd_flags(perm_to_file_flags(entry.perm));
> > +             if (ret < 0) {
> > +                     fput(f);
> > +                     break;
> > +             }
> > +             fd_install(ret, f);
> > +             break;
> > +     }
> > +     case VDUSE_VQ_SETUP_KICKFD: {
> > +             struct vduse_vq_eventfd eventfd;
> > +
> > +             ret = -EFAULT;
> > +             if (copy_from_user(&eventfd, argp, sizeof(eventfd)))
> > +                     break;
> > +
> > +             ret = vduse_kickfd_setup(dev, &eventfd);
> > +             break;
> > +     }
> > +     case VDUSE_VQ_SETUP_IRQFD: {
> > +             struct vduse_vq_eventfd eventfd;
> > +
> > +             ret = -EFAULT;
> > +             if (copy_from_user(&eventfd, argp, sizeof(eventfd)))
> > +                     break;
> > +
> > +             ret = vduse_virqfd_setup(dev, &eventfd);
> > +             break;
> > +     }
> > +     }
> > +     mutex_unlock(&dev->lock);
> > +
> > +     return ret;
> > +}
> > +
> > +static int vduse_dev_release(struct inode *inode, struct file *file)
> > +{
> > +     struct vduse_dev *dev = file->private_data;
> > +
> > +     vduse_kickfd_release(dev);
> > +     vduse_virqfd_release(dev);
> > +     dev->connected = false;
> > +
> > +     return 0;
> > +}
> > +
> > +static const struct file_operations vduse_dev_fops = {
> > +     .owner          = THIS_MODULE,
> > +     .release        = vduse_dev_release,
> > +     .read_iter      = vduse_dev_read_iter,
> > +     .write_iter     = vduse_dev_write_iter,
> > +     .poll           = vduse_dev_poll,
> > +     .unlocked_ioctl = vduse_dev_ioctl,
> > +     .compat_ioctl   = compat_ptr_ioctl,
> > +     .llseek         = noop_llseek,
> > +};
> > +
> > +static struct vduse_dev *vduse_dev_create(void)
> > +{
> > +     struct vduse_dev *dev = kzalloc(sizeof(*dev), GFP_KERNEL);
> > +
> > +     if (!dev)
> > +             return NULL;
> > +
> > +     dev->iommu = vhost_iotlb_alloc(2048, 0);
>
>
> Is 2048 sufficient here?
>

How about letting userspace to define it?


>
> > +     if (!dev->iommu) {
> > +             kfree(dev);
> > +             return NULL;
> > +     }
> > +
> > +     mutex_init(&dev->lock);
> > +     spin_lock_init(&dev->msg_lock);
> > +     INIT_LIST_HEAD(&dev->send_list);
> > +     INIT_LIST_HEAD(&dev->recv_list);
> > +     atomic64_set(&dev->msg_unique, 0);
> > +     spin_lock_init(&dev->iommu_lock);
> > +     atomic_set(&dev->bounce_map, 0);
> > +
> > +     init_waitqueue_head(&dev->waitq);
> > +
> > +     return dev;
> > +}
> > +
> > +static void vduse_dev_destroy(struct vduse_dev *dev)
> > +{
> > +     vhost_iotlb_free(dev->iommu);
> > +     mutex_destroy(&dev->lock);
> > +     kfree(dev);
> > +}
> > +
> > +static struct vduse_dev *vduse_find_dev(u32 id)
> > +{
> > +     struct vduse_dev *tmp, *dev = NULL;
> > +
> > +     list_for_each_entry(tmp, &vduse_devs, list) {
> > +             if (tmp->id == id) {
> > +                     dev = tmp;
> > +                     break;
> > +             }
> > +     }
> > +     return dev;
> > +}
> > +
> > +static int vduse_destroy_dev(u32 id)
> > +{
> > +     struct vduse_dev *dev = vduse_find_dev(id);
> > +
> > +     if (!dev)
> > +             return -EINVAL;
> > +
> > +     if (dev->vdev || dev->connected)
> > +             return -EBUSY;
> > +
> > +     list_del(&dev->list);
> > +     kfree(dev->vqs);
> > +     vduse_domain_destroy(dev->domain);
> > +     vduse_dev_destroy(dev);
> > +
> > +     return 0;
> > +}
> > +
> > +static int vduse_create_dev(struct vduse_dev_config *config)
> > +{
> > +     int i, fd;
> > +     struct vduse_dev *dev;
> > +     char name[64];
> > +
> > +     if (vduse_find_dev(config->id))
> > +             return -EEXIST;
> > +
> > +     dev = vduse_dev_create();
> > +     if (!dev)
> > +             return -ENOMEM;
> > +
> > +     dev->id = config->id;
> > +     dev->device_id = config->device_id;
> > +     dev->vendor_id = config->vendor_id;
> > +     dev->domain = vduse_domain_create(config->bounce_size);
>
>
> Do we need a upper limit of bounce_size?
>

I agree. Any comment for the value?

>
> > +     if (!dev->domain)
> > +             goto err_domain;
> > +
> > +     dev->vq_align = config->vq_align;
> > +     dev->vq_size_max = config->vq_size_max;
> > +     dev->vq_num = config->vq_num;
> > +     dev->vqs = kcalloc(dev->vq_num, sizeof(*dev->vqs), GFP_KERNEL);
> > +     if (!dev->vqs)
> > +             goto err_vqs;
> > +
> > +     for (i = 0; i < dev->vq_num; i++) {
> > +             dev->vqs[i].index = i;
> > +             spin_lock_init(&dev->vqs[i].kick_lock);
> > +             spin_lock_init(&dev->vqs[i].irq_lock);
> > +     }
> > +
> > +     snprintf(name, sizeof(name), "[vduse-dev:%u]", config->id);
> > +     fd = anon_inode_getfd(name, &vduse_dev_fops, dev, O_RDWR | O_CLOEXEC);
>
>
> Any reason for closing on exec here?
>

Looks like we can remove this flag.

>
> > +     if (fd < 0)
> > +             goto err_fd;
> > +
> > +     dev->connected = true;
> > +     list_add(&dev->list, &vduse_devs);
> > +
> > +     return fd;
> > +err_fd:
> > +     kfree(dev->vqs);
> > +err_vqs:
> > +     vduse_domain_destroy(dev->domain);
> > +err_domain:
> > +     vduse_dev_destroy(dev);
> > +     return fd;
> > +}
> > +
> > +static long vduse_ioctl(struct file *file, unsigned int cmd,
> > +                     unsigned long arg)
> > +{
> > +     int ret;
> > +     void __user *argp = (void __user *)arg;
> > +
> > +     mutex_lock(&vduse_lock);
> > +     switch (cmd) {
> > +     case VDUSE_CREATE_DEV: {
> > +             struct vduse_dev_config config;
> > +
> > +             ret = -EFAULT;
> > +             if (copy_from_user(&config, argp, sizeof(config)))
> > +                     break;
> > +
> > +             ret = vduse_create_dev(&config);
> > +             break;
> > +     }
> > +     case VDUSE_DESTROY_DEV:
> > +             ret = vduse_destroy_dev(arg);
> > +             break;
> > +     default:
> > +             ret = -EINVAL;
> > +             break;
> > +     }
> > +     mutex_unlock(&vduse_lock);
> > +
> > +     return ret;
> > +}
> > +
> > +static const struct file_operations vduse_fops = {
> > +     .owner          = THIS_MODULE,
> > +     .unlocked_ioctl = vduse_ioctl,
> > +     .compat_ioctl   = compat_ptr_ioctl,
> > +     .llseek         = noop_llseek,
> > +};
> > +
> > +static struct miscdevice vduse_misc = {
> > +     .fops = &vduse_fops,
> > +     .minor = MISC_DYNAMIC_MINOR,
> > +     .name = "vduse",
> > +};
> > +
> > +static void vduse_parent_release(struct device *dev)
> > +{
> > +}
> > +
> > +static struct device vduse_parent = {
> > +     .init_name = "vduse",
> > +     .release = vduse_parent_release,
> > +};
> > +
> > +static struct vdpa_parent_dev parent_dev;
> > +
> > +static int vduse_dev_add_vdpa(struct vduse_dev *dev, const char *name)
> > +{
> > +     struct vduse_vdpa *vdev = dev->vdev;
> > +     int ret;
> > +
> > +     if (vdev)
> > +             return -EEXIST;
> > +
> > +     vdev = vdpa_alloc_device(struct vduse_vdpa, vdpa, NULL,
> > +                              &vduse_vdpa_config_ops,
> > +                              dev->vq_num, name, true);
> > +     if (!vdev)
> > +             return -ENOMEM;
> > +
> > +     vdev->dev = dev;
> > +     vdev->vdpa.dev.dma_mask = &vdev->vdpa.dev.coherent_dma_mask;
> > +     ret = dma_set_mask_and_coherent(&vdev->vdpa.dev, DMA_BIT_MASK(64));
> > +     if (ret)
> > +             goto err;
> > +
> > +     set_dma_ops(&vdev->vdpa.dev, &vduse_dev_dma_ops);
> > +     vdev->vdpa.dma_dev = &vdev->vdpa.dev;
> > +     vdev->vdpa.pdev = &parent_dev;
> > +
> > +     ret = _vdpa_register_device(&vdev->vdpa);
> > +     if (ret)
> > +             goto err;
> > +
> > +     dev->vdev = vdev;
> > +
> > +     return 0;
> > +err:
> > +     put_device(&vdev->vdpa.dev);
> > +     return ret;
> > +}
> > +
> > +static struct vdpa_device *vdpa_dev_add(struct vdpa_parent_dev *pdev,
> > +                                     const char *name, u32 device_id,
> > +                                     struct nlattr **attrs)
> > +{
> > +     u32 vduse_id;
> > +     struct vduse_dev *dev;
> > +     int ret = -EINVAL;
> > +
> > +     if (!attrs[VDPA_ATTR_BACKEND_ID])
> > +             return ERR_PTR(-EINVAL);
> > +
> > +     mutex_lock(&vduse_lock);
> > +     vduse_id = nla_get_u32(attrs[VDPA_ATTR_BACKEND_ID]);
>
>
> I wonder why not using name here?
>

Do you mean use the same name for both backend and frontend? If so, we
need to add a name for vduse device or replace id with name to
identify a vduse device. Which way do you prefer?

> And it looks to me it would be easier if we create a char device per
> vduse. This makes the device addressing more robust than passing id
> silently among processes.
>

It's OK to me.

>
> > +     dev = vduse_find_dev(vduse_id);
> > +     if (!dev)
> > +             goto unlock;
> > +
> > +     if (dev->device_id != device_id)
> > +             goto unlock;
> > +
> > +     ret = vduse_dev_add_vdpa(dev, name);
> > +unlock:
> > +     mutex_unlock(&vduse_lock);
> > +     if (ret)
> > +             return ERR_PTR(ret);
> > +
> > +     return &dev->vdev->vdpa;
> > +}
> > +
> > +static void vdpa_dev_del(struct vdpa_parent_dev *pdev, struct vdpa_device *dev)
> > +{
> > +     _vdpa_unregister_device(dev);
> > +}
> > +
> > +static const struct vdpa_dev_ops vdpa_dev_parent_ops = {
> > +     .dev_add = vdpa_dev_add,
> > +     .dev_del = vdpa_dev_del
> > +};
> > +
> > +static struct virtio_device_id id_table[] = {
> > +     { VIRTIO_DEV_ANY_ID, VIRTIO_DEV_ANY_ID },
> > +     { 0 },
> > +};
> > +
> > +static struct vdpa_parent_dev parent_dev = {
> > +     .device = &vduse_parent,
> > +     .id_table = id_table,
> > +     .ops = &vdpa_dev_parent_ops,
> > +};
> > +
> > +static int vduse_parentdev_init(void)
> > +{
> > +     int ret;
> > +
> > +     ret = device_register(&vduse_parent);
> > +     if (ret)
> > +             return ret;
> > +
> > +     ret = vdpa_parentdev_register(&parent_dev);
> > +     if (ret)
> > +             goto err;
> > +
> > +     return 0;
> > +err:
> > +     device_unregister(&vduse_parent);
> > +     return ret;
> > +}
> > +
> > +static void vduse_parentdev_exit(void)
> > +{
> > +     vdpa_parentdev_unregister(&parent_dev);
> > +     device_unregister(&vduse_parent);
> > +}
> > +
> > +static int vduse_init(void)
> > +{
> > +     int ret;
> > +
> > +     ret = misc_register(&vduse_misc);
> > +     if (ret)
> > +             return ret;
> > +
> > +     ret = -ENOMEM;
> > +     vduse_vdpa_wq = alloc_workqueue("vduse-vdpa", WQ_UNBOUND, 1);
> > +     if (!vduse_vdpa_wq)
> > +             goto err_vdpa_wq;
> > +
> > +     ret = vduse_virqfd_init();
> > +     if (ret)
> > +             goto err_irqfd;
> > +
> > +     ret = vduse_parentdev_init();
> > +     if (ret)
> > +             goto err_parentdev;
> > +
> > +     return 0;
> > +err_parentdev:
> > +     vduse_virqfd_exit();
> > +err_irqfd:
> > +     destroy_workqueue(vduse_vdpa_wq);
> > +err_vdpa_wq:
> > +     misc_deregister(&vduse_misc);
> > +     return ret;
> > +}
> > +module_init(vduse_init);
> > +
> > +static void vduse_exit(void)
> > +{
> > +     misc_deregister(&vduse_misc);
> > +     destroy_workqueue(vduse_vdpa_wq);
> > +     vduse_virqfd_exit();
> > +     vduse_parentdev_exit();
> > +}
> > +module_exit(vduse_exit);
> > +
> > +MODULE_VERSION(DRV_VERSION);
> > +MODULE_LICENSE(DRV_LICENSE);
> > +MODULE_AUTHOR(DRV_AUTHOR);
> > +MODULE_DESCRIPTION(DRV_DESC);
> > diff --git a/include/uapi/linux/vdpa.h b/include/uapi/linux/vdpa.h
> > index bba8b83a94b5..a7a841e5ffc7 100644
> > --- a/include/uapi/linux/vdpa.h
> > +++ b/include/uapi/linux/vdpa.h
> > @@ -33,6 +33,7 @@ enum vdpa_attr {
> >       VDPA_ATTR_DEV_VENDOR_ID,                /* u32 */
> >       VDPA_ATTR_DEV_MAX_VQS,                  /* u32 */
> >       VDPA_ATTR_DEV_MAX_VQ_SIZE,              /* u16 */
> > +     VDPA_ATTR_BACKEND_ID,                   /* u32 */
>
>
> As discussed, this needs more thought. But if necessary, we need a
> separate patch for this.
>

OK.

>
> >
> >       /* new attributes must be added above here */
> >       VDPA_ATTR_MAX,
> > diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
> > new file mode 100644
> > index 000000000000..9fb555ddcfbd
> > --- /dev/null
> > +++ b/include/uapi/linux/vduse.h
> > @@ -0,0 +1,125 @@
> > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> > +#ifndef _UAPI_VDUSE_H_
> > +#define _UAPI_VDUSE_H_
> > +
> > +#include <linux/types.h>
> > +
> > +/* the control messages definition for read/write */
> > +
> > +#define VDUSE_CONFIG_DATA_LEN        256
> > +
> > +enum vduse_req_type {
> > +     VDUSE_SET_VQ_NUM,
> > +     VDUSE_SET_VQ_ADDR,
> > +     VDUSE_SET_VQ_READY,
> > +     VDUSE_GET_VQ_READY,
> > +     VDUSE_SET_VQ_STATE,
> > +     VDUSE_GET_VQ_STATE,
> > +     VDUSE_SET_FEATURES,
> > +     VDUSE_GET_FEATURES,
> > +     VDUSE_SET_STATUS,
> > +     VDUSE_GET_STATUS,
> > +     VDUSE_SET_CONFIG,
> > +     VDUSE_GET_CONFIG,
> > +     VDUSE_UPDATE_IOTLB,
> > +};
> > +
> > +struct vduse_vq_num {
> > +     __u32 index;
> > +     __u32 num;
> > +};
> > +
> > +struct vduse_vq_addr {
> > +     __u32 index;
> > +     __u64 desc_addr;
> > +     __u64 driver_addr;
> > +     __u64 device_addr;
> > +};
> > +
> > +struct vduse_vq_ready {
> > +     __u32 index;
> > +     __u8 ready;
> > +};
> > +
> > +struct vduse_vq_state {
> > +     __u32 index;
> > +     __u16 avail_idx;
> > +};
> > +
> > +struct vduse_dev_config_data {
> > +     __u32 offset;
> > +     __u32 len;
> > +     __u8 data[VDUSE_CONFIG_DATA_LEN];
>
>
> This no guarantee that 256 is sufficient here.
>

If the size is larger than 256, we can try to split the original request.

>
> > +};
> > +
> > +struct vduse_iova_range {
> > +     __u64 start;
> > +     __u64 last;
> > +};
> > +
> > +struct vduse_dev_request {
> > +     __u32 type; /* request type */
> > +     __u32 unique; /* request id */
> > +     __u32 flags; /* request flags */
>
>
> Seems unused in this series.
>

This is for future use.

>
> > +     __u32 size; /* the payload size */
>
>
> Unused.
>

Will remove it.

>
> > +     union {
> > +             struct vduse_vq_num vq_num; /* virtqueue num */
> > +             struct vduse_vq_addr vq_addr; /* virtqueue address */
> > +             struct vduse_vq_ready vq_ready; /* virtqueue ready status */
> > +             struct vduse_vq_state vq_state; /* virtqueue state */
> > +             struct vduse_dev_config_data config; /* virtio device config space */
> > +             struct vduse_iova_range iova; /* iova range for updating */
> > +             __u64 features; /* virtio features */
> > +             __u8 status; /* device status */
>
>
> Let's add some padding for future extensions.
>

Is sizeof(vduse_dev_config_data) ok? Or char[1024]?

>
> > +     };
> > +};
> > +
> > +struct vduse_dev_response {
> > +     __u32 unique; /* corresponding request id */
>
>
> Let's use request id.
>

Fine.

>
> > +     __s32 result; /* the result of request */
>
>
> Let's use macro or enum to define the success and failure value.
>

Will do it.

>
> > +     union {
> > +             struct vduse_vq_ready vq_ready; /* virtqueue ready status */
> > +             struct vduse_vq_state vq_state; /* virtqueue state */
> > +             struct vduse_dev_config_data config; /* virtio device config space */
> > +             __u64 features; /* virtio features */
> > +             __u8 status; /* device status */
> > +     };
> > +};
> > +
> > +/* ioctls */
> > +
> > +struct vduse_dev_config {
> > +     __u32 id; /* vduse device id */
> > +     __u32 vendor_id; /* virtio vendor id */
> > +     __u32 device_id; /* virtio device id */
> > +     __u64 bounce_size; /* bounce buffer size for iommu */
> > +     __u16 vq_num; /* the number of virtqueues */
> > +     __u16 vq_size_max; /* the max size of virtqueue */
> > +     __u32 vq_align; /* the allocation alignment of virtqueue's metadata */
> > +};
> > +
> > +struct vduse_iotlb_entry {
> > +     __u64 offset; /* the mmap offset on fd */
> > +     __u64 start; /* start of the IOVA range */
> > +     __u64 last; /* last of the IOVA range */
> > +#define VDUSE_ACCESS_RO 0x1
> > +#define VDUSE_ACCESS_WO 0x2
> > +#define VDUSE_ACCESS_RW 0x3
> > +     __u8 perm; /* access permission of this range */
> > +};
> > +
> > +struct vduse_vq_eventfd {
> > +     __u32 index; /* virtqueue index */
> > +     __u32 fd; /* eventfd */
>
>
> Any reason for not using int here?
>

Will use __s32 instead.

>
> > +};
> > +
> > +#define VDUSE_BASE   0x81
> > +
> > +#define VDUSE_CREATE_DEV     _IOW(VDUSE_BASE, 0x01, struct vduse_dev_config)
> > +#define VDUSE_DESTROY_DEV    _IO(VDUSE_BASE, 0x02)
> > +
> > +#define VDUSE_IOTLB_GET_FD   _IOWR(VDUSE_BASE, 0x04, struct vduse_iotlb_entry)
> > +#define VDUSE_VQ_SETUP_KICKFD        _IOW(VDUSE_BASE, 0x05, struct vduse_vq_eventfd)
> > +#define VDUSE_VQ_SETUP_IRQFD _IOW(VDUSE_BASE, 0x06, struct vduse_vq_eventfd)
>
>
> Better with documentation to explain those ioctls.
>

Will do it.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: [RFC v3 10/11] vduse: grab the module's references until there is no vduse device
  2021-01-26  8:09     ` Jason Wang
@ 2021-01-27  8:51       ` Yongji Xie
  0 siblings, 0 replies; 57+ messages in thread
From: Yongji Xie @ 2021-01-27  8:51 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, sgarzare, Parav Pandit,
	Bob Liu, Christoph Hellwig, Randy Dunlap, Matthew Wilcox, viro,
	axboe, bcrl, Jonathan Corbet, virtualization, netdev, kvm,
	linux-aio, linux-fsdevel

On Tue, Jan 26, 2021 at 4:10 PM Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2021/1/19 下午1:07, Xie Yongji wrote:
> > The module should not be unloaded if any vduse device exists.
> > So increase the module's reference count when creating vduse
> > device. And the reference count is kept until the device is
> > destroyed.
> >
> > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
>
>
> Looks like a bug fix. If yes, let's squash this into patch 8.
>

Will do it.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC v3 03/11] vdpa: Remove the restriction that only supports virtio-net devices
  2021-01-27  3:33       ` Jason Wang
@ 2021-01-27  8:57         ` Stefano Garzarella
  2021-01-28  3:11           ` Jason Wang
  0 siblings, 1 reply; 57+ messages in thread
From: Stefano Garzarella @ 2021-01-27  8:57 UTC (permalink / raw)
  To: Jason Wang
  Cc: Xie Yongji, mst, stefanha, parav, bob.liu, hch, rdunlap, willy,
	viro, axboe, bcrl, corbet, virtualization, netdev, kvm,
	linux-aio, linux-fsdevel

On Wed, Jan 27, 2021 at 11:33:03AM +0800, Jason Wang wrote:
>
>On 2021/1/20 下午7:08, Stefano Garzarella wrote:
>>On Wed, Jan 20, 2021 at 11:46:38AM +0800, Jason Wang wrote:
>>>
>>>On 2021/1/19 下午12:59, Xie Yongji wrote:
>>>>With VDUSE, we should be able to support all kinds of virtio devices.
>>>>
>>>>Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
>>>>---
>>>> drivers/vhost/vdpa.c | 29 +++--------------------------
>>>> 1 file changed, 3 insertions(+), 26 deletions(-)
>>>>
>>>>diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
>>>>index 29ed4173f04e..448be7875b6d 100644
>>>>--- a/drivers/vhost/vdpa.c
>>>>+++ b/drivers/vhost/vdpa.c
>>>>@@ -22,6 +22,7 @@
>>>> #include <linux/nospec.h>
>>>> #include <linux/vhost.h>
>>>> #include <linux/virtio_net.h>
>>>>+#include <linux/virtio_blk.h>
>>>> #include "vhost.h"
>>>>@@ -185,26 +186,6 @@ static long vhost_vdpa_set_status(struct 
>>>>vhost_vdpa *v, u8 __user *statusp)
>>>>     return 0;
>>>> }
>>>>-static int vhost_vdpa_config_validate(struct vhost_vdpa *v,
>>>>-                      struct vhost_vdpa_config *c)
>>>>-{
>>>>-    long size = 0;
>>>>-
>>>>-    switch (v->virtio_id) {
>>>>-    case VIRTIO_ID_NET:
>>>>-        size = sizeof(struct virtio_net_config);
>>>>-        break;
>>>>-    }
>>>>-
>>>>-    if (c->len == 0)
>>>>-        return -EINVAL;
>>>>-
>>>>-    if (c->len > size - c->off)
>>>>-        return -E2BIG;
>>>>-
>>>>-    return 0;
>>>>-}
>>>
>>>
>>>I think we should use a separate patch for this.
>>
>>For the vdpa-blk simulator I had the same issues and I'm adding a 
>>.get_config_size() callback to vdpa devices.
>>
>>Do you think make sense or is better to remove this check in 
>>vhost/vdpa, delegating the boundaries checks to 
>>get_config/set_config callbacks.
>
>
>A question here. How much value could we gain from get_config_size() 
>consider we can let vDPA parent to validate the length in its 
>get_config().
>

I agree, most of the implementations already validate the length, the 
only gain is an error returned since get_config() is void, but 
eventually we can add a return value to it.

Thanks,
Stefano


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC v3 03/11] vdpa: Remove the restriction that only supports virtio-net devices
  2021-01-19  4:59 ` [RFC v3 03/11] vdpa: Remove the restriction that only supports virtio-net devices Xie Yongji
  2021-01-20  3:46   ` Jason Wang
@ 2021-01-27  8:59   ` Stefano Garzarella
  2021-01-27  9:05     ` Yongji Xie
  1 sibling, 1 reply; 57+ messages in thread
From: Stefano Garzarella @ 2021-01-27  8:59 UTC (permalink / raw)
  To: Xie Yongji
  Cc: mst, jasowang, stefanha, parav, bob.liu, hch, rdunlap, willy,
	viro, axboe, bcrl, corbet, virtualization, netdev, kvm,
	linux-aio, linux-fsdevel

On Tue, Jan 19, 2021 at 12:59:12PM +0800, Xie Yongji wrote:
>With VDUSE, we should be able to support all kinds of virtio devices.
>
>Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
>---
> drivers/vhost/vdpa.c | 29 +++--------------------------
> 1 file changed, 3 insertions(+), 26 deletions(-)
>
>diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
>index 29ed4173f04e..448be7875b6d 100644
>--- a/drivers/vhost/vdpa.c
>+++ b/drivers/vhost/vdpa.c
>@@ -22,6 +22,7 @@
> #include <linux/nospec.h>
> #include <linux/vhost.h>
> #include <linux/virtio_net.h>
>+#include <linux/virtio_blk.h>

Is this inclusion necessary?

Maybe we can remove virtio_net.h as well.

Thanks,
Stefano

>
> #include "vhost.h"
>
>@@ -185,26 +186,6 @@ static long vhost_vdpa_set_status(struct vhost_vdpa *v, u8 __user *statusp)
> 	return 0;
> }
>
>-static int vhost_vdpa_config_validate(struct vhost_vdpa *v,
>-				      struct vhost_vdpa_config *c)
>-{
>-	long size = 0;
>-
>-	switch (v->virtio_id) {
>-	case VIRTIO_ID_NET:
>-		size = sizeof(struct virtio_net_config);
>-		break;
>-	}
>-
>-	if (c->len == 0)
>-		return -EINVAL;
>-
>-	if (c->len > size - c->off)
>-		return -E2BIG;
>-
>-	return 0;
>-}
>-
> static long vhost_vdpa_get_config(struct vhost_vdpa *v,
> 				  struct vhost_vdpa_config __user *c)
> {
>@@ -215,7 +196,7 @@ static long vhost_vdpa_get_config(struct vhost_vdpa *v,
>
> 	if (copy_from_user(&config, c, size))
> 		return -EFAULT;
>-	if (vhost_vdpa_config_validate(v, &config))
>+	if (config.len == 0)
> 		return -EINVAL;
> 	buf = kvzalloc(config.len, GFP_KERNEL);
> 	if (!buf)
>@@ -243,7 +224,7 @@ static long vhost_vdpa_set_config(struct vhost_vdpa *v,
>
> 	if (copy_from_user(&config, c, size))
> 		return -EFAULT;
>-	if (vhost_vdpa_config_validate(v, &config))
>+	if (config.len == 0)
> 		return -EINVAL;
> 	buf = kvzalloc(config.len, GFP_KERNEL);
> 	if (!buf)
>@@ -1025,10 +1006,6 @@ static int vhost_vdpa_probe(struct vdpa_device *vdpa)
> 	int minor;
> 	int r;
>
>-	/* Currently, we only accept the network devices. */
>-	if (ops->get_device_id(vdpa) != VIRTIO_ID_NET)
>-		return -ENOTSUPP;
>-
> 	v = kzalloc(sizeof(*v), GFP_KERNEL | __GFP_RETRY_MAYFAIL);
> 	if (!v)
> 		return -ENOMEM;
>-- 
>2.11.0
>


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: [RFC v3 08/11] vduse: Introduce VDUSE - vDPA Device in Userspace
  2021-01-26  8:19     ` Jason Wang
@ 2021-01-27  8:59       ` Yongji Xie
  0 siblings, 0 replies; 57+ messages in thread
From: Yongji Xie @ 2021-01-27  8:59 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, sgarzare, Parav Pandit,
	Bob Liu, Christoph Hellwig, Randy Dunlap, Matthew Wilcox, viro,
	axboe, bcrl, Jonathan Corbet, virtualization, netdev, kvm,
	linux-aio, linux-fsdevel

On Tue, Jan 26, 2021 at 4:19 PM Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2021/1/19 下午1:07, Xie Yongji wrote:
> > This VDUSE driver enables implementing vDPA devices in userspace.
> > Both control path and data path of vDPA devices will be able to
> > be handled in userspace.
> >
> > In the control path, the VDUSE driver will make use of message
> > mechnism to forward the config operation from vdpa bus driver
> > to userspace. Userspace can use read()/write() to receive/reply
> > those control messages.
> >
> > In the data path, VDUSE_IOTLB_GET_FD ioctl will be used to get
> > the file descriptors referring to vDPA device's iova regions. Then
> > userspace can use mmap() to access those iova regions. Besides,
> > the eventfd mechanism is used to trigger interrupt callbacks and
> > receive virtqueue kicks in userspace.
> >
> > Signed-off-by: Xie Yongji<xieyongji@bytedance.com>
> > ---
> >   Documentation/driver-api/vduse.rst                 |   85 ++
> >   Documentation/userspace-api/ioctl/ioctl-number.rst |    1 +
> >   drivers/vdpa/Kconfig                               |    7 +
> >   drivers/vdpa/Makefile                              |    1 +
> >   drivers/vdpa/vdpa_user/Makefile                    |    5 +
> >   drivers/vdpa/vdpa_user/eventfd.c                   |  221 ++++
> >   drivers/vdpa/vdpa_user/eventfd.h                   |   48 +
> >   drivers/vdpa/vdpa_user/iova_domain.c               |  426 +++++++
> >   drivers/vdpa/vdpa_user/iova_domain.h               |   68 ++
> >   drivers/vdpa/vdpa_user/vduse.h                     |   62 +
> >   drivers/vdpa/vdpa_user/vduse_dev.c                 | 1217 ++++++++++++++++++++
> >   include/uapi/linux/vdpa.h                          |    1 +
> >   include/uapi/linux/vduse.h                         |  125 ++
> >   13 files changed, 2267 insertions(+)
> >   create mode 100644 Documentation/driver-api/vduse.rst
> >   create mode 100644 drivers/vdpa/vdpa_user/Makefile
> >   create mode 100644 drivers/vdpa/vdpa_user/eventfd.c
> >   create mode 100644 drivers/vdpa/vdpa_user/eventfd.h
> >   create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c
> >   create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h
> >   create mode 100644 drivers/vdpa/vdpa_user/vduse.h
> >   create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
> >   create mode 100644 include/uapi/linux/vduse.h
>
>
> Btw, if you could split this into three parts:
>
> 1) iova domain
> 2) vduse device
> 3) doc
>
> It would be more easier for the reviewers.
>

Make sense to me. Will do it in v4.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: [RFC v3 11/11] vduse: Introduce a workqueue for irq injection
  2021-01-26  8:17     ` Jason Wang
@ 2021-01-27  9:00       ` Yongji Xie
  0 siblings, 0 replies; 57+ messages in thread
From: Yongji Xie @ 2021-01-27  9:00 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, sgarzare, Parav Pandit,
	Bob Liu, Christoph Hellwig, Randy Dunlap, Matthew Wilcox, viro,
	axboe, bcrl, Jonathan Corbet, virtualization, netdev, kvm,
	linux-aio, linux-fsdevel

On Tue, Jan 26, 2021 at 4:17 PM Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2021/1/19 下午1:07, Xie Yongji wrote:
> > This patch introduces a dedicated workqueue for irq injection
> > so that we are able to do some performance tuning for it.
> >
> > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
>
>
> If we want the split like this.
>
> It might be better to:
>
> 1) implement a simple irq injection on the ioctl context in patch 8
> 2) add the dedicated workqueue injection in this patch
>
> Since my understanding is that
>
> 1) the function looks more isolated for readers
> 2) the difference between sysctl vs workqueue should be more obvious
> than system wq vs dedicated wq
> 3) a chance to describe why workqueue is needed in the commit log in
> this patch
>

OK, I will try to do it in v4.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: [RFC v3 03/11] vdpa: Remove the restriction that only supports virtio-net devices
  2021-01-27  8:59   ` Stefano Garzarella
@ 2021-01-27  9:05     ` Yongji Xie
  0 siblings, 0 replies; 57+ messages in thread
From: Yongji Xie @ 2021-01-27  9:05 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: Michael S. Tsirkin, Jason Wang, Stefan Hajnoczi, Parav Pandit,
	Bob Liu, Christoph Hellwig, Randy Dunlap, Matthew Wilcox, viro,
	axboe, bcrl, Jonathan Corbet, virtualization, netdev, kvm,
	linux-aio, linux-fsdevel

On Wed, Jan 27, 2021 at 4:59 PM Stefano Garzarella <sgarzare@redhat.com> wrote:
>
> On Tue, Jan 19, 2021 at 12:59:12PM +0800, Xie Yongji wrote:
> >With VDUSE, we should be able to support all kinds of virtio devices.
> >
> >Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> >---
> > drivers/vhost/vdpa.c | 29 +++--------------------------
> > 1 file changed, 3 insertions(+), 26 deletions(-)
> >
> >diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> >index 29ed4173f04e..448be7875b6d 100644
> >--- a/drivers/vhost/vdpa.c
> >+++ b/drivers/vhost/vdpa.c
> >@@ -22,6 +22,7 @@
> > #include <linux/nospec.h>
> > #include <linux/vhost.h>
> > #include <linux/virtio_net.h>
> >+#include <linux/virtio_blk.h>
>
> Is this inclusion necessary?
>

My mistake...

> Maybe we can remove virtio_net.h as well.
>

Agree.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: [RFC v3 01/11] eventfd: track eventfd_signal() recursion depth separately in different cases
  2021-01-27  3:37       ` Jason Wang
@ 2021-01-27  9:11         ` Yongji Xie
  2021-01-28  3:04           ` Jason Wang
  0 siblings, 1 reply; 57+ messages in thread
From: Yongji Xie @ 2021-01-27  9:11 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Bob Liu, Christoph Hellwig, Randy Dunlap,
	Matthew Wilcox, viro, axboe, bcrl, Jonathan Corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel

On Wed, Jan 27, 2021 at 11:38 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2021/1/20 下午2:52, Yongji Xie wrote:
> > On Wed, Jan 20, 2021 at 12:24 PM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> On 2021/1/19 下午12:59, Xie Yongji wrote:
> >>> Now we have a global percpu counter to limit the recursion depth
> >>> of eventfd_signal(). This can avoid deadlock or stack overflow.
> >>> But in stack overflow case, it should be OK to increase the
> >>> recursion depth if needed. So we add a percpu counter in eventfd_ctx
> >>> to limit the recursion depth for deadlock case. Then it could be
> >>> fine to increase the global percpu counter later.
> >>
> >> I wonder whether or not it's worth to introduce percpu for each eventfd.
> >>
> >> How about simply check if eventfd_signal_count() is greater than 2?
> >>
> > It can't avoid deadlock in this way.
>
>
> I may miss something but the count is to avoid recursive eventfd call.
> So for VDUSE what we suffers is e.g the interrupt injection path:
>
> userspace write IRQFD -> vq->cb() -> another IRQFD.
>
> It looks like increasing EVENTFD_WAKEUP_DEPTH should be sufficient?
>

Actually I mean the deadlock described in commit f0b493e ("io_uring:
prevent potential eventfd recursion on poll"). It can break this bug
fix if we just increase EVENTFD_WAKEUP_DEPTH.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: [RFC v3 06/11] vhost-vdpa: Add an opaque pointer for vhost IOTLB
  2021-01-27  3:51       ` Jason Wang
@ 2021-01-27  9:27         ` Yongji Xie
  0 siblings, 0 replies; 57+ messages in thread
From: Yongji Xie @ 2021-01-27  9:27 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Bob Liu, Christoph Hellwig, Randy Dunlap,
	Matthew Wilcox, viro, axboe, bcrl, Jonathan Corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel

On Wed, Jan 27, 2021 at 11:51 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2021/1/20 下午3:52, Yongji Xie wrote:
> > On Wed, Jan 20, 2021 at 2:24 PM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> On 2021/1/19 下午12:59, Xie Yongji wrote:
> >>> Add an opaque pointer for vhost IOTLB to store the
> >>> corresponding vma->vm_file and offset on the DMA mapping.
> >>
> >> Let's split the patch into two.
> >>
> >> 1) opaque pointer
> >> 2) vma stuffs
> >>
> > OK.
> >
> >>> It will be used in VDUSE case later.
> >>>
> >>> Suggested-by: Jason Wang <jasowang@redhat.com>
> >>> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> >>> ---
> >>>    drivers/vdpa/vdpa_sim/vdpa_sim.c | 11 ++++---
> >>>    drivers/vhost/iotlb.c            |  5 ++-
> >>>    drivers/vhost/vdpa.c             | 66 +++++++++++++++++++++++++++++++++++-----
> >>>    drivers/vhost/vhost.c            |  4 +--
> >>>    include/linux/vdpa.h             |  3 +-
> >>>    include/linux/vhost_iotlb.h      |  8 ++++-
> >>>    6 files changed, 79 insertions(+), 18 deletions(-)
> >>>
> >>> diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
> >>> index 03c796873a6b..1ffcef67954f 100644
> >>> --- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
> >>> +++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
> >>> @@ -279,7 +279,7 @@ static dma_addr_t vdpasim_map_page(struct device *dev, struct page *page,
> >>>         */
> >>>        spin_lock(&vdpasim->iommu_lock);
> >>>        ret = vhost_iotlb_add_range(iommu, pa, pa + size - 1,
> >>> -                                 pa, dir_to_perm(dir));
> >>> +                                 pa, dir_to_perm(dir), NULL);
> >>
> >> Maybe its better to introduce
> >>
> >> vhost_iotlb_add_range_ctx() which can accepts the opaque (context). And
> >> let vhost_iotlb_add_range() just call that.
> >>
> > If so, we need export both vhost_iotlb_add_range() and
> > vhost_iotlb_add_range_ctx() which will be used in VDUSE driver. Is it
> > a bit redundant?
>
>
> Probably not, we do something similar in virtio core:
>
> void *virtqueue_get_buf_ctx(struct virtqueue *_vq, unsigned int *len,
>                  void **ctx)
> {
>      struct vring_virtqueue *vq = to_vvq(_vq);
>
>      return vq->packed_ring ? virtqueue_get_buf_ctx_packed(_vq, len, ctx) :
>                   virtqueue_get_buf_ctx_split(_vq, len, ctx);
> }
> EXPORT_SYMBOL_GPL(virtqueue_get_buf_ctx);
>
> void *virtqueue_get_buf(struct virtqueue *_vq, unsigned int *len)
> {
>      return virtqueue_get_buf_ctx(_vq, len, NULL);
> }
> EXPORT_SYMBOL_GPL(virtqueue_get_buf);
>

I see. Will do it in the next version.

>
> >
> >>>        spin_unlock(&vdpasim->iommu_lock);
> >>>        if (ret)
> >>>                return DMA_MAPPING_ERROR;
> >>> @@ -317,7 +317,7 @@ static void *vdpasim_alloc_coherent(struct device *dev, size_t size,
> >>>
> >>>                ret = vhost_iotlb_add_range(iommu, (u64)pa,
> >>>                                            (u64)pa + size - 1,
> >>> -                                         pa, VHOST_MAP_RW);
> >>> +                                         pa, VHOST_MAP_RW, NULL);
> >>>                if (ret) {
> >>>                        *dma_addr = DMA_MAPPING_ERROR;
> >>>                        kfree(addr);
> >>> @@ -625,7 +625,8 @@ static int vdpasim_set_map(struct vdpa_device *vdpa,
> >>>        for (map = vhost_iotlb_itree_first(iotlb, start, last); map;
> >>>             map = vhost_iotlb_itree_next(map, start, last)) {
> >>>                ret = vhost_iotlb_add_range(vdpasim->iommu, map->start,
> >>> -                                         map->last, map->addr, map->perm);
> >>> +                                         map->last, map->addr,
> >>> +                                         map->perm, NULL);
> >>>                if (ret)
> >>>                        goto err;
> >>>        }
> >>> @@ -639,14 +640,14 @@ static int vdpasim_set_map(struct vdpa_device *vdpa,
> >>>    }
> >>>
> >>>    static int vdpasim_dma_map(struct vdpa_device *vdpa, u64 iova, u64 size,
> >>> -                        u64 pa, u32 perm)
> >>> +                        u64 pa, u32 perm, void *opaque)
> >>>    {
> >>>        struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
> >>>        int ret;
> >>>
> >>>        spin_lock(&vdpasim->iommu_lock);
> >>>        ret = vhost_iotlb_add_range(vdpasim->iommu, iova, iova + size - 1, pa,
> >>> -                                 perm);
> >>> +                                 perm, NULL);
> >>>        spin_unlock(&vdpasim->iommu_lock);
> >>>
> >>>        return ret;
> >>> diff --git a/drivers/vhost/iotlb.c b/drivers/vhost/iotlb.c
> >>> index 0fd3f87e913c..3bd5bd06cdbc 100644
> >>> --- a/drivers/vhost/iotlb.c
> >>> +++ b/drivers/vhost/iotlb.c
> >>> @@ -42,13 +42,15 @@ EXPORT_SYMBOL_GPL(vhost_iotlb_map_free);
> >>>     * @last: last of IOVA range
> >>>     * @addr: the address that is mapped to @start
> >>>     * @perm: access permission of this range
> >>> + * @opaque: the opaque pointer for the IOTLB mapping
> >>>     *
> >>>     * Returns an error last is smaller than start or memory allocation
> >>>     * fails
> >>>     */
> >>>    int vhost_iotlb_add_range(struct vhost_iotlb *iotlb,
> >>>                          u64 start, u64 last,
> >>> -                       u64 addr, unsigned int perm)
> >>> +                       u64 addr, unsigned int perm,
> >>> +                       void *opaque)
> >>>    {
> >>>        struct vhost_iotlb_map *map;
> >>>
> >>> @@ -71,6 +73,7 @@ int vhost_iotlb_add_range(struct vhost_iotlb *iotlb,
> >>>        map->last = last;
> >>>        map->addr = addr;
> >>>        map->perm = perm;
> >>> +     map->opaque = opaque;
> >>>
> >>>        iotlb->nmaps++;
> >>>        vhost_iotlb_itree_insert(map, &iotlb->root);
> >>> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> >>> index 36b6950ba37f..e83e5be7cec8 100644
> >>> --- a/drivers/vhost/vdpa.c
> >>> +++ b/drivers/vhost/vdpa.c
> >>> @@ -488,6 +488,7 @@ static void vhost_vdpa_iotlb_unmap(struct vhost_vdpa *v, u64 start, u64 last)
> >>>        struct vhost_dev *dev = &v->vdev;
> >>>        struct vdpa_device *vdpa = v->vdpa;
> >>>        struct vhost_iotlb *iotlb = dev->iotlb;
> >>> +     struct vhost_iotlb_file *iotlb_file;
> >>>        struct vhost_iotlb_map *map;
> >>>        struct page *page;
> >>>        unsigned long pfn, pinned;
> >>> @@ -504,6 +505,10 @@ static void vhost_vdpa_iotlb_unmap(struct vhost_vdpa *v, u64 start, u64 last)
> >>>                        }
> >>>                        atomic64_sub(map->size >> PAGE_SHIFT,
> >>>                                        &dev->mm->pinned_vm);
> >>> +             } else if (map->opaque) {
> >>> +                     iotlb_file = (struct vhost_iotlb_file *)map->opaque;
> >>> +                     fput(iotlb_file->file);
> >>> +                     kfree(iotlb_file);
> >>>                }
> >>>                vhost_iotlb_map_free(iotlb, map);
> >>>        }
> >>> @@ -540,8 +545,8 @@ static int perm_to_iommu_flags(u32 perm)
> >>>        return flags | IOMMU_CACHE;
> >>>    }
> >>>
> >>> -static int vhost_vdpa_map(struct vhost_vdpa *v,
> >>> -                       u64 iova, u64 size, u64 pa, u32 perm)
> >>> +static int vhost_vdpa_map(struct vhost_vdpa *v, u64 iova,
> >>> +                       u64 size, u64 pa, u32 perm, void *opaque)
> >>>    {
> >>>        struct vhost_dev *dev = &v->vdev;
> >>>        struct vdpa_device *vdpa = v->vdpa;
> >>> @@ -549,12 +554,12 @@ static int vhost_vdpa_map(struct vhost_vdpa *v,
> >>>        int r = 0;
> >>>
> >>>        r = vhost_iotlb_add_range(dev->iotlb, iova, iova + size - 1,
> >>> -                               pa, perm);
> >>> +                               pa, perm, opaque);
> >>>        if (r)
> >>>                return r;
> >>>
> >>>        if (ops->dma_map) {
> >>> -             r = ops->dma_map(vdpa, iova, size, pa, perm);
> >>> +             r = ops->dma_map(vdpa, iova, size, pa, perm, opaque);
> >>>        } else if (ops->set_map) {
> >>>                if (!v->in_batch)
> >>>                        r = ops->set_map(vdpa, dev->iotlb);
> >>> @@ -591,6 +596,51 @@ static void vhost_vdpa_unmap(struct vhost_vdpa *v, u64 iova, u64 size)
> >>>        }
> >>>    }
> >>>
> >>> +static int vhost_vdpa_sva_map(struct vhost_vdpa *v,
> >>> +                           u64 iova, u64 size, u64 uaddr, u32 perm)
> >>> +{
> >>> +     u64 offset, map_size, map_iova = iova;
> >>> +     struct vhost_iotlb_file *iotlb_file;
> >>> +     struct vm_area_struct *vma;
> >>> +     int ret;
> >>
> >> Lacking mmap_read_lock().
> >>
> > Good catch! Will fix it.
> >
> >>> +
> >>> +     while (size) {
> >>> +             vma = find_vma(current->mm, uaddr);
> >>> +             if (!vma) {
> >>> +                     ret = -EINVAL;
> >>> +                     goto err;
> >>> +             }
> >>> +             map_size = min(size, vma->vm_end - uaddr);
> >>> +             offset = (vma->vm_pgoff << PAGE_SHIFT) + uaddr - vma->vm_start;
> >>> +             iotlb_file = NULL;
> >>> +             if (vma->vm_file && (vma->vm_flags & VM_SHARED)) {
> >>
> >> I wonder if we need more strict check here. When developing vhost-vdpa,
> >> I try hard to make sure the map can only work for user pages.
> >>
> >> So the question is: do we need to exclude MMIO area or only allow shmem
> >> to work here?
> >>
> > Do you mean we need to check VM_MIXEDMAP | VM_PFNMAP here?
>
>
> I meant do we need to allow VM_IO here? (We don't allow such case in
> vhost-vdpa now).
>

OK, let's exclude the vma with VM_IO | VM_PFNMAP.

>
> >
> > It makes sense to me.
> >
> >>
> >>> +                     iotlb_file = kmalloc(sizeof(*iotlb_file), GFP_KERNEL);
> >>> +                     if (!iotlb_file) {
> >>> +                             ret = -ENOMEM;
> >>> +                             goto err;
> >>> +                     }
> >>> +                     iotlb_file->file = get_file(vma->vm_file);
> >>> +                     iotlb_file->offset = offset;
> >>> +             }
> >>
> >> I wonder if it's better to allocate iotlb_file and make iotlb_file->file
> >> = NULL && iotlb_file->offset = 0. This can force a consistent code for
> >> the vDPA parents.
> >>
> > Looks fine to me.
> >
> >> Or we can simply fail the map without a file as backend.
> >>
> > Actually there will be some vma without vm_file during vm booting.
>
>
> Yes, e.g bios or other rom. Vhost-user has the similar issue and they
> filter the out them in qemu.
>
> For vhost-vDPA, consider it can supports various difference backends, we
> can't do that.
>

OK, I will transfer iotlb_file with NULL file and let the backend do
the filtering.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC v3 01/11] eventfd: track eventfd_signal() recursion depth separately in different cases
  2021-01-27  9:11         ` Yongji Xie
@ 2021-01-28  3:04           ` Jason Wang
  2021-01-28  3:08             ` Jens Axboe
  2021-01-28  3:52             ` Yongji Xie
  0 siblings, 2 replies; 57+ messages in thread
From: Jason Wang @ 2021-01-28  3:04 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Bob Liu, Christoph Hellwig, Randy Dunlap,
	Matthew Wilcox, viro, axboe, bcrl, Jonathan Corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel


On 2021/1/27 下午5:11, Yongji Xie wrote:
> On Wed, Jan 27, 2021 at 11:38 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2021/1/20 下午2:52, Yongji Xie wrote:
>>> On Wed, Jan 20, 2021 at 12:24 PM Jason Wang <jasowang@redhat.com> wrote:
>>>> On 2021/1/19 下午12:59, Xie Yongji wrote:
>>>>> Now we have a global percpu counter to limit the recursion depth
>>>>> of eventfd_signal(). This can avoid deadlock or stack overflow.
>>>>> But in stack overflow case, it should be OK to increase the
>>>>> recursion depth if needed. So we add a percpu counter in eventfd_ctx
>>>>> to limit the recursion depth for deadlock case. Then it could be
>>>>> fine to increase the global percpu counter later.
>>>> I wonder whether or not it's worth to introduce percpu for each eventfd.
>>>>
>>>> How about simply check if eventfd_signal_count() is greater than 2?
>>>>
>>> It can't avoid deadlock in this way.
>>
>> I may miss something but the count is to avoid recursive eventfd call.
>> So for VDUSE what we suffers is e.g the interrupt injection path:
>>
>> userspace write IRQFD -> vq->cb() -> another IRQFD.
>>
>> It looks like increasing EVENTFD_WAKEUP_DEPTH should be sufficient?
>>
> Actually I mean the deadlock described in commit f0b493e ("io_uring:
> prevent potential eventfd recursion on poll"). It can break this bug
> fix if we just increase EVENTFD_WAKEUP_DEPTH.


Ok, so can wait do something similar in that commit? (using async stuffs 
like wq).

Thanks


>
> Thanks,
> Yongji
>


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC v3 01/11] eventfd: track eventfd_signal() recursion depth separately in different cases
  2021-01-28  3:04           ` Jason Wang
@ 2021-01-28  3:08             ` Jens Axboe
  2021-01-28  5:12               ` Yongji Xie
  2021-01-28  3:52             ` Yongji Xie
  1 sibling, 1 reply; 57+ messages in thread
From: Jens Axboe @ 2021-01-28  3:08 UTC (permalink / raw)
  To: Jason Wang, Yongji Xie
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Bob Liu, Christoph Hellwig, Randy Dunlap,
	Matthew Wilcox, viro, bcrl, Jonathan Corbet, virtualization,
	netdev, kvm, linux-aio, linux-fsdevel

On 1/27/21 8:04 PM, Jason Wang wrote:
> 
> On 2021/1/27 下午5:11, Yongji Xie wrote:
>> On Wed, Jan 27, 2021 at 11:38 AM Jason Wang <jasowang@redhat.com> wrote:
>>>
>>> On 2021/1/20 下午2:52, Yongji Xie wrote:
>>>> On Wed, Jan 20, 2021 at 12:24 PM Jason Wang <jasowang@redhat.com> wrote:
>>>>> On 2021/1/19 下午12:59, Xie Yongji wrote:
>>>>>> Now we have a global percpu counter to limit the recursion depth
>>>>>> of eventfd_signal(). This can avoid deadlock or stack overflow.
>>>>>> But in stack overflow case, it should be OK to increase the
>>>>>> recursion depth if needed. So we add a percpu counter in eventfd_ctx
>>>>>> to limit the recursion depth for deadlock case. Then it could be
>>>>>> fine to increase the global percpu counter later.
>>>>> I wonder whether or not it's worth to introduce percpu for each eventfd.
>>>>>
>>>>> How about simply check if eventfd_signal_count() is greater than 2?
>>>>>
>>>> It can't avoid deadlock in this way.
>>>
>>> I may miss something but the count is to avoid recursive eventfd call.
>>> So for VDUSE what we suffers is e.g the interrupt injection path:
>>>
>>> userspace write IRQFD -> vq->cb() -> another IRQFD.
>>>
>>> It looks like increasing EVENTFD_WAKEUP_DEPTH should be sufficient?
>>>
>> Actually I mean the deadlock described in commit f0b493e ("io_uring:
>> prevent potential eventfd recursion on poll"). It can break this bug
>> fix if we just increase EVENTFD_WAKEUP_DEPTH.
> 
> 
> Ok, so can wait do something similar in that commit? (using async stuffs 
> like wq).

io_uring should be fine in current kernels, but aio would still be
affected by this. But just in terms of recursion, bumping it one more
should probably still be fine.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC v3 03/11] vdpa: Remove the restriction that only supports virtio-net devices
  2021-01-27  8:57         ` Stefano Garzarella
@ 2021-01-28  3:11           ` Jason Wang
       [not found]             ` <20210129150359.caitcskrfhqed73z@steredhat>
  0 siblings, 1 reply; 57+ messages in thread
From: Jason Wang @ 2021-01-28  3:11 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: Xie Yongji, mst, stefanha, parav, bob.liu, hch, rdunlap, willy,
	viro, axboe, bcrl, corbet, virtualization, netdev, kvm,
	linux-aio, linux-fsdevel


On 2021/1/27 下午4:57, Stefano Garzarella wrote:
> On Wed, Jan 27, 2021 at 11:33:03AM +0800, Jason Wang wrote:
>>
>> On 2021/1/20 下午7:08, Stefano Garzarella wrote:
>>> On Wed, Jan 20, 2021 at 11:46:38AM +0800, Jason Wang wrote:
>>>>
>>>> On 2021/1/19 下午12:59, Xie Yongji wrote:
>>>>> With VDUSE, we should be able to support all kinds of virtio devices.
>>>>>
>>>>> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
>>>>> ---
>>>>>  drivers/vhost/vdpa.c | 29 +++--------------------------
>>>>>  1 file changed, 3 insertions(+), 26 deletions(-)
>>>>>
>>>>> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
>>>>> index 29ed4173f04e..448be7875b6d 100644
>>>>> --- a/drivers/vhost/vdpa.c
>>>>> +++ b/drivers/vhost/vdpa.c
>>>>> @@ -22,6 +22,7 @@
>>>>>  #include <linux/nospec.h>
>>>>>  #include <linux/vhost.h>
>>>>>  #include <linux/virtio_net.h>
>>>>> +#include <linux/virtio_blk.h>
>>>>>  #include "vhost.h"
>>>>> @@ -185,26 +186,6 @@ static long vhost_vdpa_set_status(struct 
>>>>> vhost_vdpa *v, u8 __user *statusp)
>>>>>      return 0;
>>>>>  }
>>>>> -static int vhost_vdpa_config_validate(struct vhost_vdpa *v,
>>>>> -                      struct vhost_vdpa_config *c)
>>>>> -{
>>>>> -    long size = 0;
>>>>> -
>>>>> -    switch (v->virtio_id) {
>>>>> -    case VIRTIO_ID_NET:
>>>>> -        size = sizeof(struct virtio_net_config);
>>>>> -        break;
>>>>> -    }
>>>>> -
>>>>> -    if (c->len == 0)
>>>>> -        return -EINVAL;
>>>>> -
>>>>> -    if (c->len > size - c->off)
>>>>> -        return -E2BIG;
>>>>> -
>>>>> -    return 0;
>>>>> -}
>>>>
>>>>
>>>> I think we should use a separate patch for this.
>>>
>>> For the vdpa-blk simulator I had the same issues and I'm adding a 
>>> .get_config_size() callback to vdpa devices.
>>>
>>> Do you think make sense or is better to remove this check in 
>>> vhost/vdpa, delegating the boundaries checks to 
>>> get_config/set_config callbacks.
>>
>>
>> A question here. How much value could we gain from get_config_size() 
>> consider we can let vDPA parent to validate the length in its 
>> get_config().
>>
>
> I agree, most of the implementations already validate the length, the 
> only gain is an error returned since get_config() is void, but 
> eventually we can add a return value to it.


Right, one problem here is that. For the virito path, its get_config() 
returns void. So we can not propagate error to virtio drivers. But it 
might not be a big issue since we trust kernel virtio driver.

So I think it makes sense to change the return value in the vdpa config ops.

Thanks


>
> Thanks,
> Stefano
>


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: [RFC v3 01/11] eventfd: track eventfd_signal() recursion depth separately in different cases
  2021-01-28  3:04           ` Jason Wang
  2021-01-28  3:08             ` Jens Axboe
@ 2021-01-28  3:52             ` Yongji Xie
  2021-01-28  4:31               ` Jason Wang
  1 sibling, 1 reply; 57+ messages in thread
From: Yongji Xie @ 2021-01-28  3:52 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Bob Liu, Christoph Hellwig, Randy Dunlap,
	Matthew Wilcox, viro, axboe, bcrl, Jonathan Corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel

On Thu, Jan 28, 2021 at 11:05 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2021/1/27 下午5:11, Yongji Xie wrote:
> > On Wed, Jan 27, 2021 at 11:38 AM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> On 2021/1/20 下午2:52, Yongji Xie wrote:
> >>> On Wed, Jan 20, 2021 at 12:24 PM Jason Wang <jasowang@redhat.com> wrote:
> >>>> On 2021/1/19 下午12:59, Xie Yongji wrote:
> >>>>> Now we have a global percpu counter to limit the recursion depth
> >>>>> of eventfd_signal(). This can avoid deadlock or stack overflow.
> >>>>> But in stack overflow case, it should be OK to increase the
> >>>>> recursion depth if needed. So we add a percpu counter in eventfd_ctx
> >>>>> to limit the recursion depth for deadlock case. Then it could be
> >>>>> fine to increase the global percpu counter later.
> >>>> I wonder whether or not it's worth to introduce percpu for each eventfd.
> >>>>
> >>>> How about simply check if eventfd_signal_count() is greater than 2?
> >>>>
> >>> It can't avoid deadlock in this way.
> >>
> >> I may miss something but the count is to avoid recursive eventfd call.
> >> So for VDUSE what we suffers is e.g the interrupt injection path:
> >>
> >> userspace write IRQFD -> vq->cb() -> another IRQFD.
> >>
> >> It looks like increasing EVENTFD_WAKEUP_DEPTH should be sufficient?
> >>
> > Actually I mean the deadlock described in commit f0b493e ("io_uring:
> > prevent potential eventfd recursion on poll"). It can break this bug
> > fix if we just increase EVENTFD_WAKEUP_DEPTH.
>
>
> Ok, so can wait do something similar in that commit? (using async stuffs
> like wq).
>

We can do that. But it will reduce the performance. Because the
eventfd recursion will be triggered every time kvm kick eventfd in
vhost-vdpa cases:

KVM write KICKFD -> ops->kick_vq -> VDUSE write KICKFD

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC v3 08/11] vduse: Introduce VDUSE - vDPA Device in Userspace
  2021-01-27  8:50       ` Yongji Xie
@ 2021-01-28  4:27         ` Jason Wang
  2021-01-28  6:03           ` Yongji Xie
  0 siblings, 1 reply; 57+ messages in thread
From: Jason Wang @ 2021-01-28  4:27 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, sgarzare, Parav Pandit,
	Bob Liu, Christoph Hellwig, Randy Dunlap, Matthew Wilcox, viro,
	axboe, bcrl, Jonathan Corbet, virtualization, netdev, kvm,
	linux-aio, linux-fsdevel


On 2021/1/27 下午4:50, Yongji Xie wrote:
>   On Tue, Jan 26, 2021 at 4:09 PM Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2021/1/19 下午1:07, Xie Yongji wrote:
>>> This VDUSE driver enables implementing vDPA devices in userspace.
>>> Both control path and data path of vDPA devices will be able to
>>> be handled in userspace.
>>>
>>> In the control path, the VDUSE driver will make use of message
>>> mechnism to forward the config operation from vdpa bus driver
>>> to userspace. Userspace can use read()/write() to receive/reply
>>> those control messages.
>>>
>>> In the data path, VDUSE_IOTLB_GET_FD ioctl will be used to get
>>> the file descriptors referring to vDPA device's iova regions. Then
>>> userspace can use mmap() to access those iova regions. Besides,
>>> the eventfd mechanism is used to trigger interrupt callbacks and
>>> receive virtqueue kicks in userspace.
>>>
>>> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
>>> ---
>>>    Documentation/driver-api/vduse.rst                 |   85 ++
>>>    Documentation/userspace-api/ioctl/ioctl-number.rst |    1 +
>>>    drivers/vdpa/Kconfig                               |    7 +
>>>    drivers/vdpa/Makefile                              |    1 +
>>>    drivers/vdpa/vdpa_user/Makefile                    |    5 +
>>>    drivers/vdpa/vdpa_user/eventfd.c                   |  221 ++++
>>>    drivers/vdpa/vdpa_user/eventfd.h                   |   48 +
>>>    drivers/vdpa/vdpa_user/iova_domain.c               |  426 +++++++
>>>    drivers/vdpa/vdpa_user/iova_domain.h               |   68 ++
>>>    drivers/vdpa/vdpa_user/vduse.h                     |   62 +
>>>    drivers/vdpa/vdpa_user/vduse_dev.c                 | 1217 ++++++++++++++++++++
>>>    include/uapi/linux/vdpa.h                          |    1 +
>>>    include/uapi/linux/vduse.h                         |  125 ++
>>>    13 files changed, 2267 insertions(+)
>>>    create mode 100644 Documentation/driver-api/vduse.rst
>>>    create mode 100644 drivers/vdpa/vdpa_user/Makefile
>>>    create mode 100644 drivers/vdpa/vdpa_user/eventfd.c
>>>    create mode 100644 drivers/vdpa/vdpa_user/eventfd.h
>>>    create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c
>>>    create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h
>>>    create mode 100644 drivers/vdpa/vdpa_user/vduse.h
>>>    create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
>>>    create mode 100644 include/uapi/linux/vduse.h
>>>
>>> diff --git a/Documentation/driver-api/vduse.rst b/Documentation/driver-api/vduse.rst
>>> new file mode 100644
>>> index 000000000000..9418a7f6646b
>>> --- /dev/null
>>> +++ b/Documentation/driver-api/vduse.rst
>>> @@ -0,0 +1,85 @@
>>> +==================================
>>> +VDUSE - "vDPA Device in Userspace"
>>> +==================================
>>> +
>>> +vDPA (virtio data path acceleration) device is a device that uses a
>>> +datapath which complies with the virtio specifications with vendor
>>> +specific control path. vDPA devices can be both physically located on
>>> +the hardware or emulated by software. VDUSE is a framework that makes it
>>> +possible to implement software-emulated vDPA devices in userspace.
>>> +
>>> +How VDUSE works
>>> +------------
>>> +Each userspace vDPA device is created by the VDUSE_CREATE_DEV ioctl on
>>> +the VDUSE character device (/dev/vduse). Then a file descriptor pointing
>>> +to the new resources will be returned, which can be used to implement the
>>> +userspace vDPA device's control path and data path.
>>> +
>>> +To implement control path, the read/write operations to the file descriptor
>>> +will be used to receive/reply the control messages from/to VDUSE driver.
>>
>> It's better to document the protocol here. E.g the identifier stuffs.
>>
> I have documented those stuffs in include/uapi/linux/vduse.h, is it
> OK? Or add something like "Please see include/uapi/linux/vduse.h for
> details."


It might be better if we add some userspace sample code to demonstrate 
how the protocol work.


>
>>> +Those control messages are mostly based on the vdpa_config_ops which defines
>>> +a unified interface to control different types of vDPA device.
>>> +
>>> +The following types of messages are provided by the VDUSE framework now:
>>> +
>>> +- VDUSE_SET_VQ_ADDR: Set the addresses of the different aspects of virtqueue.
>>
>> "Set the vring address of a virtqueue" might be better here.
>>
> OK.
>
>>> +
>>> +- VDUSE_SET_VQ_NUM: Set the size of virtqueue
>>> +
>>> +- VDUSE_SET_VQ_READY: Set ready status of virtqueue
>>> +
>>> +- VDUSE_GET_VQ_READY: Get ready status of virtqueue
>>> +
>>> +- VDUSE_SET_VQ_STATE: Set the state (last_avail_idx) for virtqueue
>>> +
>>> +- VDUSE_GET_VQ_STATE: Get the state (last_avail_idx) for virtqueue
>>
>> It's better not to mention layout specific stuffs here (last_avail_idx).
>> Consider we should support packed virtqueue in the future.
>>
> I see.
>
>>> +
>>> +- VDUSE_SET_FEATURES: Set virtio features supported by the driver
>>> +
>>> +- VDUSE_GET_FEATURES: Get virtio features supported by the device
>>> +
>>> +- VDUSE_SET_STATUS: Set the device status
>>> +
>>> +- VDUSE_GET_STATUS: Get the device status
>>> +
>>> +- VDUSE_SET_CONFIG: Write to device specific configuration space
>>> +
>>> +- VDUSE_GET_CONFIG: Read from device specific configuration space
>>> +
>>> +- VDUSE_UPDATE_IOTLB: Notify userspace to update the memory mapping in device IOTLB
>>> +
>>> +Please see include/linux/vdpa.h for details.
>>> +
>>> +In the data path, vDPA device's iova regions will be mapped into userspace with
>>> +the help of VDUSE_IOTLB_GET_FD ioctl on the userspace vDPA device fd:
>>> +
>>> +- VDUSE_IOTLB_GET_FD: get the file descriptor to iova region. Userspace can
>>> +  access this iova region by passing the fd to mmap(2).
>>> +
>>> +Besides, the eventfd mechanism is used to trigger interrupt callbacks and
>>> +receive virtqueue kicks in userspace. The following ioctls on the userspace
>>> +vDPA device fd are provided to support that:
>>> +
>>> +- VDUSE_VQ_SETUP_KICKFD: set the kickfd for virtqueue, this eventfd is used
>>> +  by VDUSE driver to notify userspace to consume the vring.
>>> +
>>> +- VDUSE_VQ_SETUP_IRQFD: set the irqfd for virtqueue, this eventfd is used
>>> +  by userspace to notify VDUSE driver to trigger interrupt callbacks.
>>> +
>>> +MMU-based IOMMU Driver
>>> +----------------------
>>> +In virtio-vdpa case, VDUSE framework implements a MMU-based on-chip IOMMU
>>> +driver to support mapping the kernel dma buffer into the userspace iova
>>> +region dynamically.
>>> +
>>> +The basic idea behind this driver is treating MMU (VA->PA) as IOMMU (IOVA->PA).
>>> +The driver will set up MMU mapping instead of IOMMU mapping for the DMA transfer
>>> +so that the userspace process is able to use its virtual address to access
>>> +the dma buffer in kernel.
>>> +
>>> +And to avoid security issue, a bounce-buffering mechanism is introduced to
>>> +prevent userspace accessing the original buffer directly which may contain other
>>> +kernel data. During the mapping, unmapping, the driver will copy the data from
>>> +the original buffer to the bounce buffer and back, depending on the direction of
>>> +the transfer. And the bounce-buffer addresses will be mapped into the user address
>>> +space instead of the original one.
>>> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
>>> index a4c75a28c839..71722e6f8f23 100644
>>> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
>>> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
>>> @@ -300,6 +300,7 @@ Code  Seq#    Include File                                           Comments
>>>    'z'   10-4F  drivers/s390/crypto/zcrypt_api.h                        conflict!
>>>    '|'   00-7F  linux/media.h
>>>    0x80  00-1F  linux/fb.h
>>> +0x81  00-1F  linux/vduse.h
>>>    0x89  00-06  arch/x86/include/asm/sockios.h
>>>    0x89  0B-DF  linux/sockios.h
>>>    0x89  E0-EF  linux/sockios.h                                         SIOCPROTOPRIVATE range
>>> diff --git a/drivers/vdpa/Kconfig b/drivers/vdpa/Kconfig
>>> index 4be7be39be26..667354309bf4 100644
>>> --- a/drivers/vdpa/Kconfig
>>> +++ b/drivers/vdpa/Kconfig
>>> @@ -21,6 +21,13 @@ config VDPA_SIM
>>>          to RX. This device is used for testing, prototyping and
>>>          development of vDPA.
>>>
>>> +config VDPA_USER
>>> +     tristate "VDUSE (vDPA Device in Userspace) support"
>>> +     depends on EVENTFD && MMU && HAS_DMA
>>
>> Need select VHOST_IOTLB.
>>
> OK.
>
>>> +     help
>>> +       With VDUSE it is possible to emulate a vDPA Device
>>> +       in a userspace program.
>>> +
>>>    config IFCVF
>>>        tristate "Intel IFC VF vDPA driver"
>>>        depends on PCI_MSI
>>> diff --git a/drivers/vdpa/Makefile b/drivers/vdpa/Makefile
>>> index d160e9b63a66..66e97778ad03 100644
>>> --- a/drivers/vdpa/Makefile
>>> +++ b/drivers/vdpa/Makefile
>>> @@ -1,5 +1,6 @@
>>>    # SPDX-License-Identifier: GPL-2.0
>>>    obj-$(CONFIG_VDPA) += vdpa.o
>>>    obj-$(CONFIG_VDPA_SIM) += vdpa_sim/
>>> +obj-$(CONFIG_VDPA_USER) += vdpa_user/
>>>    obj-$(CONFIG_IFCVF)    += ifcvf/
>>>    obj-$(CONFIG_MLX5_VDPA) += mlx5/
>>> diff --git a/drivers/vdpa/vdpa_user/Makefile b/drivers/vdpa/vdpa_user/Makefile
>>> new file mode 100644
>>> index 000000000000..b7645e36992b
>>> --- /dev/null
>>> +++ b/drivers/vdpa/vdpa_user/Makefile
>>> @@ -0,0 +1,5 @@
>>> +# SPDX-License-Identifier: GPL-2.0
>>> +
>>> +vduse-y := vduse_dev.o iova_domain.o eventfd.o
>>> +
>>> +obj-$(CONFIG_VDPA_USER) += vduse.o
>>> diff --git a/drivers/vdpa/vdpa_user/eventfd.c b/drivers/vdpa/vdpa_user/eventfd.c
>>> new file mode 100644
>>> index 000000000000..dbffddb08908
>>> --- /dev/null
>>> +++ b/drivers/vdpa/vdpa_user/eventfd.c
>>> @@ -0,0 +1,221 @@
>>> +// SPDX-License-Identifier: GPL-2.0-only
>>> +/*
>>> + * Eventfd support for VDUSE
>>> + *
>>> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
>>> + *
>>> + * Author: Xie Yongji <xieyongji@bytedance.com>
>>> + *
>>> + */
>>> +
>>> +#include <linux/eventfd.h>
>>> +#include <linux/poll.h>
>>> +#include <linux/wait.h>
>>> +#include <linux/slab.h>
>>> +#include <linux/file.h>
>>> +#include <uapi/linux/vduse.h>
>>> +
>>> +#include "eventfd.h"
>>> +
>>> +static struct workqueue_struct *vduse_irqfd_cleanup_wq;
>>> +
>>> +static void vduse_virqfd_shutdown(struct work_struct *work)
>>> +{
>>> +     u64 cnt;
>>> +     struct vduse_virqfd *virqfd = container_of(work,
>>> +                                     struct vduse_virqfd, shutdown);
>>> +
>>> +     eventfd_ctx_remove_wait_queue(virqfd->ctx, &virqfd->wait, &cnt);
>>> +     flush_work(&virqfd->inject);
>>> +     eventfd_ctx_put(virqfd->ctx);
>>> +     kfree(virqfd);
>>> +}
>>> +
>>> +static void vduse_virqfd_inject(struct work_struct *work)
>>> +{
>>> +     struct vduse_virqfd *virqfd = container_of(work,
>>> +                                     struct vduse_virqfd, inject);
>>> +     struct vduse_virtqueue *vq = virqfd->vq;
>>> +
>>> +     spin_lock_irq(&vq->irq_lock);
>>> +     if (vq->ready && vq->cb)
>>> +             vq->cb(vq->private);
>>> +     spin_unlock_irq(&vq->irq_lock);
>>> +}
>>> +
>>> +static void virqfd_deactivate(struct vduse_virqfd *virqfd)
>>> +{
>>> +     queue_work(vduse_irqfd_cleanup_wq, &virqfd->shutdown);
>>> +}
>>> +
>>> +static int vduse_virqfd_wakeup(wait_queue_entry_t *wait, unsigned int mode,
>>> +                             int sync, void *key)
>>> +{
>>> +     struct vduse_virqfd *virqfd = container_of(wait, struct vduse_virqfd, wait);
>>> +     struct vduse_virtqueue *vq = virqfd->vq;
>>> +
>>> +     __poll_t flags = key_to_poll(key);
>>> +
>>> +     if (flags & EPOLLIN)
>>> +             schedule_work(&virqfd->inject);
>>> +
>>> +     if (flags & EPOLLHUP) {
>>> +             spin_lock(&vq->irq_lock);
>>> +             if (vq->virqfd == virqfd) {
>>> +                     vq->virqfd = NULL;
>>> +                     virqfd_deactivate(virqfd);
>>> +             }
>>> +             spin_unlock(&vq->irq_lock);
>>> +     }
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static void vduse_virqfd_ptable_queue_proc(struct file *file,
>>> +                     wait_queue_head_t *wqh, poll_table *pt)
>>> +{
>>> +     struct vduse_virqfd *virqfd = container_of(pt, struct vduse_virqfd, pt);
>>> +
>>> +     add_wait_queue(wqh, &virqfd->wait);
>>> +}
>>> +
>>> +int vduse_virqfd_setup(struct vduse_dev *dev,
>>> +                     struct vduse_vq_eventfd *eventfd)
>>> +{
>>> +     struct vduse_virqfd *virqfd;
>>> +     struct fd irqfd;
>>> +     struct eventfd_ctx *ctx;
>>> +     struct vduse_virtqueue *vq;
>>> +     __poll_t events;
>>> +     int ret;
>>> +
>>> +     if (eventfd->index >= dev->vq_num)
>>> +             return -EINVAL;
>>> +
>>> +     vq = &dev->vqs[eventfd->index];
>>> +     virqfd = kzalloc(sizeof(*virqfd), GFP_KERNEL);
>>> +     if (!virqfd)
>>> +             return -ENOMEM;
>>> +
>>> +     INIT_WORK(&virqfd->shutdown, vduse_virqfd_shutdown);
>>> +     INIT_WORK(&virqfd->inject, vduse_virqfd_inject);
>>> +
>>> +     ret = -EBADF;
>>> +     irqfd = fdget(eventfd->fd);
>>> +     if (!irqfd.file)
>>> +             goto err_fd;
>>> +
>>> +     ctx = eventfd_ctx_fileget(irqfd.file);
>>> +     if (IS_ERR(ctx)) {
>>> +             ret = PTR_ERR(ctx);
>>> +             goto err_ctx;
>>> +     }
>>> +
>>> +     virqfd->vq = vq;
>>> +     virqfd->ctx = ctx;
>>> +     spin_lock(&vq->irq_lock);
>>> +     if (vq->virqfd)
>>> +             virqfd_deactivate(virqfd);
>>> +     vq->virqfd = virqfd;
>>> +     spin_unlock(&vq->irq_lock);
>>> +
>>> +     init_waitqueue_func_entry(&virqfd->wait, vduse_virqfd_wakeup);
>>> +     init_poll_funcptr(&virqfd->pt, vduse_virqfd_ptable_queue_proc);
>>> +
>>> +     events = vfs_poll(irqfd.file, &virqfd->pt);
>>> +
>>> +     /*
>>> +      * Check if there was an event already pending on the eventfd
>>> +      * before we registered and trigger it as if we didn't miss it.
>>> +      */
>>> +     if (events & EPOLLIN)
>>> +             schedule_work(&virqfd->inject);
>>> +
>>> +     fdput(irqfd);
>>> +
>>> +     return 0;
>>> +err_ctx:
>>> +     fdput(irqfd);
>>> +err_fd:
>>> +     kfree(virqfd);
>>> +     return ret;
>>> +}
>>> +
>>> +void vduse_virqfd_release(struct vduse_dev *dev)
>>> +{
>>> +     int i;
>>> +
>>> +     for (i = 0; i < dev->vq_num; i++) {
>>> +             struct vduse_virtqueue *vq = &dev->vqs[i];
>>> +
>>> +             spin_lock(&vq->irq_lock);
>>> +             if (vq->virqfd) {
>>> +                     virqfd_deactivate(vq->virqfd);
>>> +                     vq->virqfd = NULL;
>>> +             }
>>> +             spin_unlock(&vq->irq_lock);
>>> +     }
>>> +     flush_workqueue(vduse_irqfd_cleanup_wq);
>>> +}
>>> +
>>> +int vduse_virqfd_init(void)
>>> +{
>>> +     vduse_irqfd_cleanup_wq = alloc_workqueue("vduse-irqfd-cleanup",
>>> +                                             WQ_UNBOUND, 0);
>>> +     if (!vduse_irqfd_cleanup_wq)
>>> +             return -ENOMEM;
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +void vduse_virqfd_exit(void)
>>> +{
>>> +     destroy_workqueue(vduse_irqfd_cleanup_wq);
>>> +}
>>> +
>>> +void vduse_vq_kick(struct vduse_virtqueue *vq)
>>> +{
>>> +     spin_lock(&vq->kick_lock);
>>> +     if (vq->ready && vq->kickfd)
>>> +             eventfd_signal(vq->kickfd, 1);
>>> +     spin_unlock(&vq->kick_lock);
>>> +}
>>> +
>>> +int vduse_kickfd_setup(struct vduse_dev *dev,
>>> +                     struct vduse_vq_eventfd *eventfd)
>>> +{
>>> +     struct eventfd_ctx *ctx;
>>> +     struct vduse_virtqueue *vq;
>>> +
>>> +     if (eventfd->index >= dev->vq_num)
>>> +             return -EINVAL;
>>> +
>>> +     vq = &dev->vqs[eventfd->index];
>>> +     ctx = eventfd_ctx_fdget(eventfd->fd);
>>> +     if (IS_ERR(ctx))
>>> +             return PTR_ERR(ctx);
>>> +
>>> +     spin_lock(&vq->kick_lock);
>>> +     if (vq->kickfd)
>>> +             eventfd_ctx_put(vq->kickfd);
>>> +     vq->kickfd = ctx;
>>> +     spin_unlock(&vq->kick_lock);
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +void vduse_kickfd_release(struct vduse_dev *dev)
>>> +{
>>> +     int i;
>>> +
>>> +     for (i = 0; i < dev->vq_num; i++) {
>>> +             struct vduse_virtqueue *vq = &dev->vqs[i];
>>> +
>>> +             spin_lock(&vq->kick_lock);
>>> +             if (vq->kickfd) {
>>> +                     eventfd_ctx_put(vq->kickfd);
>>> +                     vq->kickfd = NULL;
>>> +             }
>>> +             spin_unlock(&vq->kick_lock);
>>> +     }
>>> +}
>>> diff --git a/drivers/vdpa/vdpa_user/eventfd.h b/drivers/vdpa/vdpa_user/eventfd.h
>>> new file mode 100644
>>> index 000000000000..14269ff27f47
>>> --- /dev/null
>>> +++ b/drivers/vdpa/vdpa_user/eventfd.h
>>> @@ -0,0 +1,48 @@
>>> +/* SPDX-License-Identifier: GPL-2.0-only */
>>> +/*
>>> + * Eventfd support for VDUSE
>>> + *
>>> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
>>> + *
>>> + * Author: Xie Yongji <xieyongji@bytedance.com>
>>> + *
>>> + */
>>> +
>>> +#ifndef _VDUSE_EVENTFD_H
>>> +#define _VDUSE_EVENTFD_H
>>> +
>>> +#include <linux/eventfd.h>
>>> +#include <linux/poll.h>
>>> +#include <linux/wait.h>
>>> +#include <uapi/linux/vduse.h>
>>> +
>>> +#include "vduse.h"
>>> +
>>> +struct vduse_dev;
>>> +
>>> +struct vduse_virqfd {
>>> +     struct eventfd_ctx *ctx;
>>> +     struct vduse_virtqueue *vq;
>>> +     struct work_struct inject;
>>> +     struct work_struct shutdown;
>>> +     wait_queue_entry_t wait;
>>> +     poll_table pt;
>>> +};
>>> +
>>> +int vduse_virqfd_setup(struct vduse_dev *dev,
>>> +                     struct vduse_vq_eventfd *eventfd);
>>> +
>>> +void vduse_virqfd_release(struct vduse_dev *dev);
>>> +
>>> +int vduse_virqfd_init(void);
>>> +
>>> +void vduse_virqfd_exit(void);
>>> +
>>> +void vduse_vq_kick(struct vduse_virtqueue *vq);
>>> +
>>> +int vduse_kickfd_setup(struct vduse_dev *dev,
>>> +                     struct vduse_vq_eventfd *eventfd);
>>> +
>>> +void vduse_kickfd_release(struct vduse_dev *dev);
>>> +
>>> +#endif /* _VDUSE_EVENTFD_H */
>>> diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c
>>> new file mode 100644
>>> index 000000000000..cdfef8e9f9d6
>>> --- /dev/null
>>> +++ b/drivers/vdpa/vdpa_user/iova_domain.c
>>> @@ -0,0 +1,426 @@
>>> +// SPDX-License-Identifier: GPL-2.0-only
>>> +/*
>>> + * MMU-based IOMMU implementation
>>> + *
>>> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
>>> + *
>>> + * Author: Xie Yongji <xieyongji@bytedance.com>
>>> + *
>>> + */
>>> +
>>> +#include <linux/slab.h>
>>> +#include <linux/file.h>
>>> +#include <linux/anon_inodes.h>
>>> +
>>> +#include "iova_domain.h"
>>> +
>>> +#define IOVA_START_PFN 1
>>> +#define IOVA_ALLOC_ORDER 12
>>> +#define IOVA_ALLOC_SIZE (1 << IOVA_ALLOC_ORDER)
>>
>> Can this work for all archs (e.g why not use PAGE_SIZE)?
>>
> It can work for all archs. Use IOVA_ALLOC_SIZE might save some space
> in some cases/archs (e.g. PAGE_SIZE = 64K) when we have lots of
> small-size I/Os.


OK, if I understand correctly, so you want to share pages among those 
small I/Os.


>
>>> +
>>> +#define CONSISTENT_DMA_SIZE (1024 * 1024 * 1024)
>>> +
>>> +static inline struct page *
>>> +vduse_domain_get_bounce_page(struct vduse_iova_domain *domain,
>>> +                             unsigned long iova)
>>> +{
>>> +     unsigned long index = iova >> PAGE_SHIFT;
>>> +
>>> +     return domain->bounce_pages[index];
>>> +}
>>> +
>>> +static inline void
>>> +vduse_domain_set_bounce_page(struct vduse_iova_domain *domain,
>>> +                             unsigned long iova, struct page *page)
>>> +{
>>> +     unsigned long index = iova >> PAGE_SHIFT;
>>> +
>>> +     domain->bounce_pages[index] = page;
>>> +}
>>> +
>>> +static struct vduse_iova_map *
>>> +vduse_domain_alloc_iova_map(struct vduse_iova_domain *domain,
>>> +                     unsigned long iova, unsigned long orig,
>>> +                     size_t size, enum dma_data_direction dir)
>>> +{
>>> +     struct vduse_iova_map *map;
>>> +
>>> +     map = kzalloc(sizeof(struct vduse_iova_map), GFP_ATOMIC);
>>> +     if (!map)
>>> +             return NULL;
>>> +
>>> +     map->iova.start = iova;
>>> +     map->iova.last = iova + size - 1;
>>> +     map->orig = orig;
>>> +     map->size = size;
>>> +     map->dir = dir;
>>> +
>>> +     return map;
>>> +}
>>> +
>>> +static struct page *
>>> +vduse_domain_get_mapping_page(struct vduse_iova_domain *domain,
>>> +                             unsigned long iova)
>>> +{
>>> +     unsigned long start = iova & PAGE_MASK;
>>> +     unsigned long last = start + PAGE_SIZE - 1;
>>> +     struct vduse_iova_map *map;
>>> +     struct interval_tree_node *node;
>>> +     struct page *page = NULL;
>>> +
>>> +     spin_lock(&domain->map_lock);
>>> +     node = interval_tree_iter_first(&domain->mappings, start, last);
>>> +     if (!node)
>>> +             goto out;
>>> +
>>> +     map = container_of(node, struct vduse_iova_map, iova);
>>> +     page = virt_to_page(map->orig + iova - map->iova.start);
>>> +     get_page(page);
>>> +out:
>>> +     spin_unlock(&domain->map_lock);
>>> +
>>> +     return page;
>>> +}
>>> +
>>> +static struct page *
>>> +vduse_domain_alloc_bounce_page(struct vduse_iova_domain *domain,
>>> +                             unsigned long iova)
>>> +{
>>> +     unsigned long start = iova & PAGE_MASK;
>>> +     unsigned long last = start + PAGE_SIZE - 1;
>>> +     struct vduse_iova_map *map;
>>> +     struct interval_tree_node *node;
>>> +     struct page *page = NULL, *new_page = alloc_page(GFP_KERNEL);
>>> +
>>> +     if (!new_page)
>>> +             return NULL;
>>> +
>>> +     spin_lock(&domain->map_lock);
>>> +     node = interval_tree_iter_first(&domain->mappings, start, last);
>>> +     if (!node) {
>>> +             __free_page(new_page);
>>> +             goto out;
>>> +     }
>>> +     page = vduse_domain_get_bounce_page(domain, iova);
>>> +     if (page) {
>>> +             get_page(page);
>>> +             __free_page(new_page);
>>
>> Let's delay the allocation of new_page until it is really required.
> If so, we need to allocate the page in atomic context.


Right, I see.


>>> +             goto out;
>>> +     }
>>> +     vduse_domain_set_bounce_page(domain, iova, new_page);
>>> +     get_page(new_page);
>>> +     page = new_page;
>>> +
>>> +     while (node) {
>>
>> I may miss something but which case should we do the loop here?
>>
> When IOVA_ALLOC_SIZE != PAGE_SIZE
>
>>> +             unsigned int src_offset = 0, dst_offset = 0;
>>> +             void *src, *dst;
>>> +             size_t copy_len;
>>> +
>>> +             map = container_of(node, struct vduse_iova_map, iova);
>>> +             node = interval_tree_iter_next(node, start, last);
>>> +             if (map->dir == DMA_FROM_DEVICE)
>>> +                     continue;
>>> +
>>> +             if (start > map->iova.start)
>>> +                     src_offset = start - map->iova.start;
>>> +             else
>>> +                     dst_offset = map->iova.start - start;
>>> +
>>> +             src = (void *)(map->orig + src_offset);
>>> +             dst = page_address(page) + dst_offset;
>>> +             copy_len = min_t(size_t, map->size - src_offset,
>>> +                             PAGE_SIZE - dst_offset);
>>> +             memcpy(dst, src, copy_len);
>>> +     }
>>> +out:
>>> +     spin_unlock(&domain->map_lock);
>>> +
>>> +     return page;
>>> +}
>>> +
>>> +static void
>>> +vduse_domain_free_bounce_pages(struct vduse_iova_domain *domain,
>>> +                             unsigned long iova, size_t size)
>>> +{
>>> +     struct page *page;
>>> +     struct interval_tree_node *node;
>>> +     unsigned long last = iova + size - 1;
>>> +
>>> +     spin_lock(&domain->map_lock);
>>> +     node = interval_tree_iter_first(&domain->mappings, iova, last);
>>> +     if (WARN_ON(node))
>>> +             goto out;
>>> +
>>> +     while (size > 0) {
>>> +             page = vduse_domain_get_bounce_page(domain, iova);
>>> +             if (page) {
>>> +                     vduse_domain_set_bounce_page(domain, iova, NULL);
>>> +                     __free_page(page);
>>> +             }
>>> +             size -= PAGE_SIZE;
>>> +             iova += PAGE_SIZE;
>>> +     }
>>> +out:
>>> +     spin_unlock(&domain->map_lock);
>>> +}
>>> +
>>> +static void vduse_domain_bounce(struct vduse_iova_domain *domain,
>>> +                             unsigned long iova, unsigned long orig,
>>> +                             size_t size, enum dma_data_direction dir)
>>> +{
>>> +     unsigned int offset = offset_in_page(iova);
>>> +
>>> +     while (size) {
>>> +             struct page *p = vduse_domain_get_bounce_page(domain, iova);
>>> +             size_t copy_len = min_t(size_t, PAGE_SIZE - offset, size);
>>> +             void *addr;
>>> +
>>> +             WARN_ON(!p && dir == DMA_FROM_DEVICE);
>>> +
>>> +             if (p) {
>>> +                     addr = page_address(p) + offset;
>>> +                     if (dir == DMA_TO_DEVICE)
>>> +                             memcpy(addr, (void *)orig, copy_len);
>>> +                     else if (dir == DMA_FROM_DEVICE)
>>> +                             memcpy((void *)orig, addr, copy_len);
>>> +             }
>>> +
>>> +             size -= copy_len;
>>> +             orig += copy_len;
>>> +             iova += copy_len;
>>> +             offset = 0;
>>> +     }
>>> +}
>>> +
>>> +static unsigned long vduse_domain_alloc_iova(struct iova_domain *iovad,
>>> +                             unsigned long size, unsigned long limit)
>>> +{
>>> +     unsigned long shift = iova_shift(iovad);
>>> +     unsigned long iova_len = iova_align(iovad, size) >> shift;
>>> +     unsigned long iova_pfn;
>>> +
>>> +     if (iova_len < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1)))
>>> +             iova_len = roundup_pow_of_two(iova_len);
>>> +     iova_pfn = alloc_iova_fast(iovad, iova_len, limit >> shift, true);
>>> +
>>> +     return iova_pfn << shift;
>>> +}
>>> +
>>> +static void vduse_domain_free_iova(struct iova_domain *iovad,
>>> +                             unsigned long iova, size_t size)
>>> +{
>>> +     unsigned long shift = iova_shift(iovad);
>>> +     unsigned long iova_len = iova_align(iovad, size) >> shift;
>>> +
>>> +     free_iova_fast(iovad, iova >> shift, iova_len);
>>> +}
>>> +
>>> +dma_addr_t vduse_domain_map_page(struct vduse_iova_domain *domain,
>>> +                             struct page *page, unsigned long offset,
>>> +                             size_t size, enum dma_data_direction dir,
>>> +                             unsigned long attrs)
>>> +{
>>> +     struct iova_domain *iovad = &domain->stream_iovad;
>>> +     unsigned long limit = domain->bounce_size - 1;
>>> +     unsigned long iova = vduse_domain_alloc_iova(iovad, size, limit);
>>> +     unsigned long orig = (unsigned long)page_address(page) + offset;
>>> +     struct vduse_iova_map *map;
>>> +
>>> +     if (!iova)
>>> +             return DMA_MAPPING_ERROR;
>>> +
>>> +     map = vduse_domain_alloc_iova_map(domain, iova, orig, size, dir);
>>> +     if (!map) {
>>> +             vduse_domain_free_iova(iovad, iova, size);
>>> +             return DMA_MAPPING_ERROR;
>>> +     }
>>> +
>>> +     spin_lock(&domain->map_lock);
>>> +     interval_tree_insert(&map->iova, &domain->mappings);
>>> +     spin_unlock(&domain->map_lock);
>>> +
>>> +     if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL)
>>> +             vduse_domain_bounce(domain, iova, orig, size, DMA_TO_DEVICE);
>>> +
>>> +     return (dma_addr_t)iova;
>>> +}
>>> +
>>> +void vduse_domain_unmap_page(struct vduse_iova_domain *domain,
>>> +                     dma_addr_t dma_addr, size_t size,
>>> +                     enum dma_data_direction dir, unsigned long attrs)
>>> +{
>>> +     struct iova_domain *iovad = &domain->stream_iovad;
>>> +     unsigned long iova = (unsigned long)dma_addr;
>>> +     struct interval_tree_node *node;
>>> +     struct vduse_iova_map *map;
>>> +
>>> +     spin_lock(&domain->map_lock);
>>> +     node = interval_tree_iter_first(&domain->mappings, iova, iova + 1);
>>> +     if (WARN_ON(!node)) {
>>> +             spin_unlock(&domain->map_lock);
>>> +             return;
>>> +     }
>>> +     interval_tree_remove(node, &domain->mappings);
>>> +     spin_unlock(&domain->map_lock);
>>> +
>>> +     map = container_of(node, struct vduse_iova_map, iova);
>>> +     if (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL)
>>> +             vduse_domain_bounce(domain, iova, map->orig,
>>> +                                     size, DMA_FROM_DEVICE);
>>> +     vduse_domain_free_iova(iovad, iova, size);
>>> +     kfree(map);
>>> +}
>>> +
>>> +void *vduse_domain_alloc_coherent(struct vduse_iova_domain *domain,
>>> +                             size_t size, dma_addr_t *dma_addr,
>>> +                             gfp_t flag, unsigned long attrs)
>>> +{
>>> +     struct iova_domain *iovad = &domain->consistent_iovad;
>>> +     unsigned long limit = domain->bounce_size + CONSISTENT_DMA_SIZE - 1;
>>> +     unsigned long iova = vduse_domain_alloc_iova(iovad, size, limit);
>>> +     void *orig = alloc_pages_exact(size, flag);
>>> +     struct vduse_iova_map *map;
>>> +
>>> +     if (!iova || !orig)
>>> +             goto err;
>>> +
>>> +     map = vduse_domain_alloc_iova_map(domain, iova, (unsigned long)orig,
>>> +                                     size, DMA_BIDIRECTIONAL);
>>> +     if (!map)
>>> +             goto err;
>>> +
>>> +     spin_lock(&domain->map_lock);
>>> +     interval_tree_insert(&map->iova, &domain->mappings);
>>> +     spin_unlock(&domain->map_lock);
>>> +     *dma_addr = (dma_addr_t)iova;
>>> +
>>> +     return orig;
>>> +err:
>>> +     *dma_addr = DMA_MAPPING_ERROR;
>>> +     if (orig)
>>> +             free_pages_exact(orig, size);
>>> +     if (iova)
>>> +             vduse_domain_free_iova(iovad, iova, size);
>>> +
>>> +     return NULL;
>>> +}
>>> +
>>> +void vduse_domain_free_coherent(struct vduse_iova_domain *domain, size_t size,
>>> +                             void *vaddr, dma_addr_t dma_addr,
>>> +                             unsigned long attrs)
>>> +{
>>> +     struct iova_domain *iovad = &domain->consistent_iovad;
>>> +     unsigned long iova = (unsigned long)dma_addr;
>>> +     struct interval_tree_node *node;
>>> +     struct vduse_iova_map *map;
>>> +
>>> +     spin_lock(&domain->map_lock);
>>> +     node = interval_tree_iter_first(&domain->mappings, iova, iova + 1);
>>> +     if (WARN_ON(!node)) {
>>> +             spin_unlock(&domain->map_lock);
>>> +             return;
>>> +     }
>>> +     interval_tree_remove(node, &domain->mappings);
>>> +     spin_unlock(&domain->map_lock);
>>> +
>>> +     map = container_of(node, struct vduse_iova_map, iova);
>>> +     vduse_domain_free_iova(iovad, iova, size);
>>> +     free_pages_exact(vaddr, size);
>>> +     kfree(map);
>>> +}
>>> +
>>> +static vm_fault_t vduse_domain_mmap_fault(struct vm_fault *vmf)
>>> +{
>>> +     struct vduse_iova_domain *domain = vmf->vma->vm_private_data;
>>> +     unsigned long iova = vmf->pgoff << PAGE_SHIFT;
>>> +     struct page *page;
>>> +
>>> +     if (!domain)
>>> +             return VM_FAULT_SIGBUS;
>>> +
>>> +     if (iova < domain->bounce_size)
>>> +             page = vduse_domain_alloc_bounce_page(domain, iova);
>>> +     else
>>> +             page = vduse_domain_get_mapping_page(domain, iova);
>>> +
>>> +     if (!page)
>>> +             return VM_FAULT_SIGBUS;
>>> +
>>> +     vmf->page = page;
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static const struct vm_operations_struct vduse_domain_mmap_ops = {
>>> +     .fault = vduse_domain_mmap_fault,
>>> +};
>>> +
>>> +static int vduse_domain_mmap(struct file *file, struct vm_area_struct *vma)
>>> +{
>>> +     struct vduse_iova_domain *domain = file->private_data;
>>> +
>>> +     vma->vm_flags |= VM_DONTCOPY | VM_DONTDUMP | VM_DONTEXPAND;
>>> +     vma->vm_private_data = domain;
>>> +     vma->vm_ops = &vduse_domain_mmap_ops;
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static int vduse_domain_release(struct inode *inode, struct file *file)
>>> +{
>>> +     struct vduse_iova_domain *domain = file->private_data;
>>> +
>>> +     vduse_domain_free_bounce_pages(domain, 0, domain->bounce_size);
>>> +     put_iova_domain(&domain->stream_iovad);
>>> +     put_iova_domain(&domain->consistent_iovad);
>>> +     vfree(domain->bounce_pages);
>>> +     kfree(domain);
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static const struct file_operations vduse_domain_fops = {
>>> +     .mmap = vduse_domain_mmap,
>>> +     .release = vduse_domain_release,
>>> +};
>>
>> It's better to explain the reason for introducing a dedicated file for
>> mmap() here.
>>
> To make the implementation of iova_domain independent with vduse_dev.


My understanding is that, the only usage for this is to:

1) support different type of iova mappings
2) or switch between iova domain mappings

But I can't think of a need for this.


>
>>> +
>>> +void vduse_domain_destroy(struct vduse_iova_domain *domain)
>>> +{
>>> +     fput(domain->file);
>>> +}
>>> +
>>> +struct vduse_iova_domain *vduse_domain_create(size_t bounce_size)
>>> +{
>>> +     struct vduse_iova_domain *domain;
>>> +     struct file *file;
>>> +     unsigned long bounce_pfns = PAGE_ALIGN(bounce_size) >> PAGE_SHIFT;
>>> +
>>> +     domain = kzalloc(sizeof(*domain), GFP_KERNEL);
>>> +     if (!domain)
>>> +             return NULL;
>>> +
>>> +     domain->bounce_size = PAGE_ALIGN(bounce_size);
>>> +     domain->bounce_pages = vzalloc(bounce_pfns * sizeof(struct page *));
>>> +     if (!domain->bounce_pages)
>>> +             goto err_page;
>>> +
>>> +     file = anon_inode_getfile("[vduse-domain]", &vduse_domain_fops,
>>> +                             domain, O_RDWR);
>>> +     if (IS_ERR(file))
>>> +             goto err_file;
>>> +
>>> +     domain->file = file;
>>> +     spin_lock_init(&domain->map_lock);
>>> +     domain->mappings = RB_ROOT_CACHED;
>>> +     init_iova_domain(&domain->stream_iovad,
>>> +                     IOVA_ALLOC_SIZE, IOVA_START_PFN);
>>> +     init_iova_domain(&domain->consistent_iovad,
>>> +                     PAGE_SIZE, bounce_pfns);
>>> +
>>> +     return domain;
>>> +err_file:
>>> +     vfree(domain->bounce_pages);
>>> +err_page:
>>> +     kfree(domain);
>>> +     return NULL;
>>> +}
>>> diff --git a/drivers/vdpa/vdpa_user/iova_domain.h b/drivers/vdpa/vdpa_user/iova_domain.h
>>> new file mode 100644
>>> index 000000000000..cc61866acb56
>>> --- /dev/null
>>> +++ b/drivers/vdpa/vdpa_user/iova_domain.h
>>> @@ -0,0 +1,68 @@
>>> +/* SPDX-License-Identifier: GPL-2.0-only */
>>> +/*
>>> + * MMU-based IOMMU implementation
>>> + *
>>> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
>>> + *
>>> + * Author: Xie Yongji <xieyongji@bytedance.com>
>>> + *
>>> + */
>>> +
>>> +#ifndef _VDUSE_IOVA_DOMAIN_H
>>> +#define _VDUSE_IOVA_DOMAIN_H
>>> +
>>> +#include <linux/iova.h>
>>> +#include <linux/interval_tree.h>
>>> +#include <linux/dma-mapping.h>
>>> +
>>> +struct vduse_iova_map {
>>> +     struct interval_tree_node iova;
>>> +     unsigned long orig;
>>
>> Need a better name, probably "va"?
>>
> Fine.
>
>>> +     size_t size;
>>> +     enum dma_data_direction dir;
>>> +};
>>> +
>>> +struct vduse_iova_domain {
>>> +     struct iova_domain stream_iovad;
>>> +     struct iova_domain consistent_iovad;
>>> +     struct page **bounce_pages;
>>> +     size_t bounce_size;
>>> +     struct rb_root_cached mappings;
>>
>> We had IOTLB, any reason for this extra mappings here?
>>
> It is used to store iova <-> vduse_iova_map (vaddr, size, dir)
> mapping. We must use it to know how to do DMA bouncing during dma
> unmapping.


Right, so I meant consider we support opaque data in the vhost IOTLB. It 
looks to me we can piggyback the "orig" in the opaque. Then there's no 
need for an extra interval tree?


>
>>> +     spinlock_t map_lock;
>>> +     struct file *file;
>>> +};
>>> +
>>> +static inline struct file *
>>> +vduse_domain_file(struct vduse_iova_domain *domain)
>>> +{
>>> +     return domain->file;
>>> +}
>>> +
>>> +static inline unsigned long
>>> +vduse_domain_get_offset(struct vduse_iova_domain *domain, unsigned long iova)
>>> +{
>>> +     return iova;
>>> +}
>>> +
>>> +dma_addr_t vduse_domain_map_page(struct vduse_iova_domain *domain,
>>> +                             struct page *page, unsigned long offset,
>>> +                             size_t size, enum dma_data_direction dir,
>>> +                             unsigned long attrs);
>>> +
>>> +void vduse_domain_unmap_page(struct vduse_iova_domain *domain,
>>> +                     dma_addr_t dma_addr, size_t size,
>>> +                     enum dma_data_direction dir, unsigned long attrs);
>>> +
>>> +void *vduse_domain_alloc_coherent(struct vduse_iova_domain *domain,
>>> +                             size_t size, dma_addr_t *dma_addr,
>>> +                             gfp_t flag, unsigned long attrs);
>>> +
>>> +void vduse_domain_free_coherent(struct vduse_iova_domain *domain, size_t size,
>>> +                             void *vaddr, dma_addr_t dma_addr,
>>> +                             unsigned long attrs);
>>> +
>>> +void vduse_domain_destroy(struct vduse_iova_domain *domain);
>>> +
>>> +struct vduse_iova_domain *vduse_domain_create(size_t bounce_size);
>>> +
>>> +#endif /* _VDUSE_IOVA_DOMAIN_H */
>>> diff --git a/drivers/vdpa/vdpa_user/vduse.h b/drivers/vdpa/vdpa_user/vduse.h
>>> new file mode 100644
>>> index 000000000000..3566d229382e
>>> --- /dev/null
>>> +++ b/drivers/vdpa/vdpa_user/vduse.h
>>> @@ -0,0 +1,62 @@
>>> +/* SPDX-License-Identifier: GPL-2.0-only */
>>> +/*
>>> + * VDUSE: vDPA Device in Userspace
>>> + *
>>> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
>>> + *
>>> + * Author: Xie Yongji <xieyongji@bytedance.com>
>>> + *
>>> + */
>>> +
>>> +#ifndef _VDUSE_H
>>> +#define _VDUSE_H
>>> +
>>> +#include <linux/eventfd.h>
>>> +#include <linux/wait.h>
>>> +#include <linux/vdpa.h>
>>> +
>>> +#include "iova_domain.h"
>>> +#include "eventfd.h"
>>> +
>>> +struct vduse_virtqueue {
>>> +     u16 index;
>>> +     bool ready;
>>> +     spinlock_t kick_lock;
>>> +     spinlock_t irq_lock;
>>> +     struct eventfd_ctx *kickfd;
>>> +     struct vduse_virqfd *virqfd;
>>> +     void *private;
>>> +     irqreturn_t (*cb)(void *data);
>>> +};
>>> +
>>> +struct vduse_dev;
>>> +
>>> +struct vduse_vdpa {
>>> +     struct vdpa_device vdpa;
>>> +     struct vduse_dev *dev;
>>> +};
>>> +
>>> +struct vduse_dev {
>>> +     struct vduse_vdpa *vdev;
>>> +     struct mutex lock;
>>> +     struct vduse_virtqueue *vqs;
>>> +     struct vduse_iova_domain *domain;
>>> +     struct vhost_iotlb *iommu;
>>> +     spinlock_t iommu_lock;
>>> +     atomic_t bounce_map;
>>> +     spinlock_t msg_lock;
>>> +     atomic64_t msg_unique;
>>> +     wait_queue_head_t waitq;
>>> +     struct list_head send_list;
>>> +     struct list_head recv_list;
>>> +     struct list_head list;
>>> +     bool connected;
>>> +     u32 id;
>>> +     u16 vq_size_max;
>>> +     u16 vq_num;
>>> +     u32 vq_align;
>>> +     u32 device_id;
>>> +     u32 vendor_id;
>>> +};
>>> +
>>> +#endif /* _VDUSE_H_ */
>>> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
>>> new file mode 100644
>>> index 000000000000..1cf759bc5914
>>> --- /dev/null
>>> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
>>> @@ -0,0 +1,1217 @@
>>> +// SPDX-License-Identifier: GPL-2.0-only
>>> +/*
>>> + * VDUSE: vDPA Device in Userspace
>>> + *
>>> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
>>> + *
>>> + * Author: Xie Yongji <xieyongji@bytedance.com>
>>> + *
>>> + */
>>> +
>>> +#include <linux/init.h>
>>> +#include <linux/module.h>
>>> +#include <linux/miscdevice.h>
>>> +#include <linux/device.h>
>>> +#include <linux/eventfd.h>
>>> +#include <linux/slab.h>
>>> +#include <linux/wait.h>
>>> +#include <linux/dma-map-ops.h>
>>> +#include <linux/anon_inodes.h>
>>> +#include <linux/file.h>
>>> +#include <linux/uio.h>
>>> +#include <linux/vdpa.h>
>>> +#include <uapi/linux/vduse.h>
>>> +#include <uapi/linux/vdpa.h>
>>> +#include <uapi/linux/virtio_config.h>
>>> +#include <linux/mod_devicetable.h>
>>> +
>>> +#include "vduse.h"
>>> +
>>> +#define DRV_VERSION  "1.0"
>>> +#define DRV_AUTHOR   "Yongji Xie <xieyongji@bytedance.com>"
>>> +#define DRV_DESC     "vDPA Device in Userspace"
>>> +#define DRV_LICENSE  "GPL v2"
>>> +
>>> +struct vduse_dev_msg {
>>> +     struct vduse_dev_request req;
>>> +     struct vduse_dev_response resp;
>>> +     struct list_head list;
>>> +     wait_queue_head_t waitq;
>>> +     bool completed;
>>> +     refcount_t refcnt;
>>
>> The reference count here will bring extra complexity. I think we can
>> sync through msg_lock.
>>
> Do you mean using wait_event_interruptible_locked() and
> wake_up_locked()? I think it works.


Right.


>
>>
>>> +};
>>> +
>>> +static struct workqueue_struct *vduse_vdpa_wq;
>>> +static DEFINE_MUTEX(vduse_lock);
>>> +static LIST_HEAD(vduse_devs);
>>> +
>>> +static inline struct vduse_dev *vdpa_to_vduse(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_vdpa *vdev = container_of(vdpa, struct vduse_vdpa, vdpa);
>>> +
>>> +     return vdev->dev;
>>> +}
>>> +
>>> +static inline struct vduse_dev *dev_to_vduse(struct device *dev)
>>> +{
>>> +     struct vdpa_device *vdpa = dev_to_vdpa(dev);
>>> +
>>> +     return vdpa_to_vduse(vdpa);
>>> +}
>>> +
>>> +static struct vduse_dev_msg *vduse_dev_new_msg(struct vduse_dev *dev, int type)
>>> +{
>>> +     struct vduse_dev_msg *msg = kzalloc(sizeof(*msg),
>>> +                                     GFP_KERNEL | __GFP_NOFAIL);
>>> +
>>> +     msg->req.type = type;
>>> +     msg->req.unique = atomic64_fetch_inc(&dev->msg_unique);
>>
>> This looks not safe, let's use idr here.
>>
> Could you give more details? Looks like idr should not used in this
> case which can not tolerate failure. And using a list to store the msg
> is better than using idr when the msg needs to be re-inserted in some
> cases.


My understanding is the "unique" (probably need a better name) is a 
token that is used to uniquely identify a message. The reply from 
userspace is required to write with exact the same token(unique). IDR 
seems better but consider we can hardly hit 64bit overflow, atomic might 
be OK as well.

Btw, under what case do we need to do "re-inserted"?


>
>>> +     init_waitqueue_head(&msg->waitq);
>>> +     refcount_set(&msg->refcnt, 1);
>>> +
>>> +     return msg;
>>> +}
>>> +
>>> +static void vduse_dev_msg_get(struct vduse_dev_msg *msg)
>>> +{
>>> +     refcount_inc(&msg->refcnt);
>>> +}
>>> +
>>> +static void vduse_dev_msg_put(struct vduse_dev_msg *msg)
>>> +{
>>> +     if (refcount_dec_and_test(&msg->refcnt))
>>> +             kfree(msg);
>>> +}
>>> +
>>> +static struct vduse_dev_msg *vduse_dev_find_msg(struct vduse_dev *dev,
>>> +                                             struct list_head *head,
>>> +                                             uint32_t unique)
>>> +{
>>> +     struct vduse_dev_msg *tmp, *msg = NULL;
>>> +
>>> +     spin_lock(&dev->msg_lock);
>>> +     list_for_each_entry(tmp, head, list) {
>>> +             if (tmp->req.unique == unique) {
>>> +                     msg = tmp;
>>> +                     list_del(&tmp->list);
>>> +                     break;
>>> +             }
>>> +     }
>>> +     spin_unlock(&dev->msg_lock);
>>> +
>>> +     return msg;
>>> +}
>>> +
>>> +static struct vduse_dev_msg *vduse_dev_dequeue_msg(struct vduse_dev *dev,
>>> +                                             struct list_head *head)
>>> +{
>>> +     struct vduse_dev_msg *msg = NULL;
>>> +
>>> +     spin_lock(&dev->msg_lock);
>>> +     if (!list_empty(head)) {
>>> +             msg = list_first_entry(head, struct vduse_dev_msg, list);
>>> +             list_del(&msg->list);
>>> +     }
>>> +     spin_unlock(&dev->msg_lock);
>>> +
>>> +     return msg;
>>> +}
>>> +
>>> +static void vduse_dev_enqueue_msg(struct vduse_dev *dev,
>>> +                     struct vduse_dev_msg *msg, struct list_head *head)
>>> +{
>>> +     spin_lock(&dev->msg_lock);
>>> +     list_add_tail(&msg->list, head);
>>> +     spin_unlock(&dev->msg_lock);
>>> +}
>>> +
>>> +static int vduse_dev_msg_sync(struct vduse_dev *dev, struct vduse_dev_msg *msg)
>>> +{
>>> +     int ret;
>>> +
>>> +     vduse_dev_enqueue_msg(dev, msg, &dev->send_list);
>>> +     wake_up(&dev->waitq);
>>> +     wait_event(msg->waitq, msg->completed);
>>
>> This is uninterruptible wait, it means if the userspace forget to
>> process the command, we will stuck here forever.
>>
> Yes, wait_event_interruptible() should be better here.
>
>>> +     /* coupled with smp_wmb() in vduse_dev_msg_complete() */
>>> +     smp_rmb();
>>
>> Instead of using barriers, I wonder why not simply use msg lock here?
>>
> As mentioned above, using
> wait_event_interruptible_locked()/wake_up_locked() is OK to me.
>
>>> +     ret = msg->resp.result;
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static void vduse_dev_msg_complete(struct vduse_dev_msg *msg,
>>> +                                     struct vduse_dev_response *resp)
>>> +{
>>> +     vduse_dev_msg_get(msg);
>>> +     memcpy(&msg->resp, resp, sizeof(*resp));
>>> +     /* coupled with smp_rmb() in vduse_dev_msg_sync() */
>>> +     smp_wmb();
>>> +     msg->completed = 1;
>>> +     wake_up(&msg->waitq);
>>> +     vduse_dev_msg_put(msg);
>>> +}
>>> +
>>> +static u64 vduse_dev_get_features(struct vduse_dev *dev)
>>> +{
>>> +     struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_GET_FEATURES);
>>> +     u64 features;
>>> +
>>> +     vduse_dev_msg_sync(dev, msg);
>>> +     features = msg->resp.features;
>>> +     vduse_dev_msg_put(msg);
>>> +
>>> +     return features;
>>> +}
>>> +
>>> +static int vduse_dev_set_features(struct vduse_dev *dev, u64 features)
>>> +{
>>> +     struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_FEATURES);
>>> +     int ret;
>>> +
>>> +     msg->req.size = sizeof(features);
>>> +     msg->req.features = features;
>>> +
>>> +     ret = vduse_dev_msg_sync(dev, msg);
>>> +     vduse_dev_msg_put(msg);
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static u8 vduse_dev_get_status(struct vduse_dev *dev)
>>> +{
>>> +     struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_GET_STATUS);
>>> +     u8 status;
>>> +
>>> +     vduse_dev_msg_sync(dev, msg);
>>> +     status = msg->resp.status;
>>> +     vduse_dev_msg_put(msg);
>>> +
>>> +     return status;
>>> +}
>>> +
>>> +static void vduse_dev_set_status(struct vduse_dev *dev, u8 status)
>>> +{
>>> +     struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_STATUS);
>>> +
>>> +     msg->req.size = sizeof(status);
>>> +     msg->req.status = status;
>>> +
>>> +     vduse_dev_msg_sync(dev, msg);
>>> +     vduse_dev_msg_put(msg);
>>> +}
>>> +
>>> +static void vduse_dev_get_config(struct vduse_dev *dev, unsigned int offset,
>>> +                                     void *buf, unsigned int len)
>>> +{
>>> +     struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_GET_CONFIG);
>>> +
>>> +     WARN_ON(len > sizeof(msg->req.config.data));
>>> +
>>> +     msg->req.size = sizeof(struct vduse_dev_config_data);
>>> +     msg->req.config.offset = offset;
>>> +     msg->req.config.len = len;
>>> +     vduse_dev_msg_sync(dev, msg);
>>> +     memcpy(buf, msg->resp.config.data, len);
>>> +     vduse_dev_msg_put(msg);
>>> +}
>>> +
>>> +static void vduse_dev_set_config(struct vduse_dev *dev, unsigned int offset,
>>> +                                     const void *buf, unsigned int len)
>>> +{
>>> +     struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_CONFIG);
>>> +
>>> +     WARN_ON(len > sizeof(msg->req.config.data));
>>> +
>>> +     msg->req.size = sizeof(struct vduse_dev_config_data);
>>> +     msg->req.config.offset = offset;
>>> +     msg->req.config.len = len;
>>> +     memcpy(msg->req.config.data, buf, len);
>>> +     vduse_dev_msg_sync(dev, msg);
>>> +     vduse_dev_msg_put(msg);
>>> +}
>>> +
>>> +static void vduse_dev_set_vq_num(struct vduse_dev *dev,
>>> +                             struct vduse_virtqueue *vq, u32 num)
>>> +{
>>> +     struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_VQ_NUM);
>>> +
>>> +     msg->req.size = sizeof(struct vduse_vq_num);
>>> +     msg->req.vq_num.index = vq->index;
>>> +     msg->req.vq_num.num = num;
>>> +
>>> +     vduse_dev_msg_sync(dev, msg);
>>> +     vduse_dev_msg_put(msg);
>>> +}
>>> +
>>> +static int vduse_dev_set_vq_addr(struct vduse_dev *dev,
>>> +                             struct vduse_virtqueue *vq, u64 desc_addr,
>>> +                             u64 driver_addr, u64 device_addr)
>>> +{
>>> +     struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_VQ_ADDR);
>>> +     int ret;
>>> +
>>> +     msg->req.size = sizeof(struct vduse_vq_addr);
>>> +     msg->req.vq_addr.index = vq->index;
>>> +     msg->req.vq_addr.desc_addr = desc_addr;
>>> +     msg->req.vq_addr.driver_addr = driver_addr;
>>> +     msg->req.vq_addr.device_addr = device_addr;
>>> +
>>> +     ret = vduse_dev_msg_sync(dev, msg);
>>> +     vduse_dev_msg_put(msg);
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static void vduse_dev_set_vq_ready(struct vduse_dev *dev,
>>> +                             struct vduse_virtqueue *vq, bool ready)
>>> +{
>>> +     struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_VQ_READY);
>>> +
>>> +     msg->req.size = sizeof(struct vduse_vq_ready);
>>> +     msg->req.vq_ready.index = vq->index;
>>> +     msg->req.vq_ready.ready = ready;
>>> +
>>> +     vduse_dev_msg_sync(dev, msg);
>>> +     vduse_dev_msg_put(msg);
>>> +}
>>> +
>>> +static bool vduse_dev_get_vq_ready(struct vduse_dev *dev,
>>> +                                struct vduse_virtqueue *vq)
>>> +{
>>> +     struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_GET_VQ_READY);
>>> +     bool ready;
>>> +
>>> +     msg->req.size = sizeof(struct vduse_vq_ready);
>>> +     msg->req.vq_ready.index = vq->index;
>>> +
>>> +     vduse_dev_msg_sync(dev, msg);
>>> +     ready = msg->resp.vq_ready.ready;
>>> +     vduse_dev_msg_put(msg);
>>> +
>>> +     return ready;
>>> +}
>>> +
>>> +static int vduse_dev_get_vq_state(struct vduse_dev *dev,
>>> +                             struct vduse_virtqueue *vq,
>>> +                             struct vdpa_vq_state *state)
>>> +{
>>> +     struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_GET_VQ_STATE);
>>> +     int ret;
>>> +
>>> +     msg->req.size = sizeof(struct vduse_vq_state);
>>> +     msg->req.vq_state.index = vq->index;
>>> +
>>> +     ret = vduse_dev_msg_sync(dev, msg);
>>> +     state->avail_index = msg->resp.vq_state.avail_idx;
>>> +     vduse_dev_msg_put(msg);
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static int vduse_dev_set_vq_state(struct vduse_dev *dev,
>>> +                             struct vduse_virtqueue *vq,
>>> +                             const struct vdpa_vq_state *state)
>>> +{
>>> +     struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_VQ_STATE);
>>> +     int ret;
>>> +
>>> +     msg->req.size = sizeof(struct vduse_vq_state);
>>> +     msg->req.vq_state.index = vq->index;
>>> +     msg->req.vq_state.avail_idx = state->avail_index;
>>> +
>>> +     ret = vduse_dev_msg_sync(dev, msg);
>>> +     vduse_dev_msg_put(msg);
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static int vduse_dev_update_iotlb(struct vduse_dev *dev,
>>> +                                     u64 start, u64 last)
>>> +{
>>> +     struct vduse_dev_msg *msg;
>>> +     int ret;
>>> +
>>> +     if (last < start)
>>> +             return -EINVAL;
>>> +
>>> +     msg = vduse_dev_new_msg(dev, VDUSE_UPDATE_IOTLB);
>>
>> This is actually a IOTLB invalidation. So let's rename the function and
>> message type.
>>
> Actually VDUSE_UPDATE_IOTLB now is used to notify userspace that IOTLB
> is changed rather than IOTLB needs to be invalidated. Then userspace
> can use GET_IOTLB ioctl to get the change. It seems to be more
> friendly to userspace.


Ok.


>
>>> +     msg->req.size = sizeof(struct vduse_iova_range);
>>> +     msg->req.iova.start = start;
>>> +     msg->req.iova.last = last;
>>> +
>>> +     ret = vduse_dev_msg_sync(dev, msg);
>>> +     vduse_dev_msg_put(msg);
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static ssize_t vduse_dev_read_iter(struct kiocb *iocb, struct iov_iter *to)
>>> +{
>>> +     struct file *file = iocb->ki_filp;
>>> +     struct vduse_dev *dev = file->private_data;
>>> +     struct vduse_dev_msg *msg;
>>> +     int size = sizeof(struct vduse_dev_request);
>>> +     ssize_t ret = 0;
>>> +
>>> +     if (iov_iter_count(to) < size)
>>> +             return 0;
>>> +
>>> +     while (1) {
>>> +             msg = vduse_dev_dequeue_msg(dev, &dev->send_list);
>>> +             if (msg)
>>> +                     break;
>>> +
>>> +             if (file->f_flags & O_NONBLOCK)
>>> +                     return -EAGAIN;
>>> +
>>> +             ret = wait_event_interruptible_exclusive(dev->waitq,
>>> +                                     !list_empty(&dev->send_list));
>>> +             if (ret)
>>> +                     return ret;
>>> +     }
>>> +     ret = copy_to_iter(&msg->req, size, to);
>>> +     if (ret != size) {
>>> +             vduse_dev_enqueue_msg(dev, msg, &dev->send_list);
>>> +             return -EFAULT;
>>> +     }
>>> +     vduse_dev_enqueue_msg(dev, msg, &dev->recv_list);
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static ssize_t vduse_dev_write_iter(struct kiocb *iocb, struct iov_iter *from)
>>> +{
>>> +     struct file *file = iocb->ki_filp;
>>> +     struct vduse_dev *dev = file->private_data;
>>> +     struct vduse_dev_response resp;
>>> +     struct vduse_dev_msg *msg;
>>> +     size_t ret;
>>> +
>>> +     ret = copy_from_iter(&resp, sizeof(resp), from);
>>> +     if (ret != sizeof(resp))
>>> +             return -EINVAL;
>>> +
>>> +     msg = vduse_dev_find_msg(dev, &dev->recv_list, resp.unique);
>>> +     if (!msg)
>>> +             return -EINVAL;
>>> +
>>> +     vduse_dev_msg_complete(msg, &resp);
>>
>> So we had multiple types of requests/responses, is this better to
>> introduce a queue based admin interface other than ioctl?
>>
> Sorry, I didn't get your point. What do you mean by queue-based admin
> interface? Virtqueue-based?


Yes, a queue(virtqueue). The commands could be passed through the queue. 
(Just an idea, not sure it's worth)


>
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
>>> +{
>>> +     struct vduse_dev *dev = file->private_data;
>>> +     __poll_t mask = 0;
>>> +
>>> +     poll_wait(file, &dev->waitq, wait);
>>> +
>>> +     if (!list_empty(&dev->send_list))
>>> +             mask |= EPOLLIN | EPOLLRDNORM;
>>> +
>>> +     return mask;
>>> +}
>>> +
>>> +static int vduse_iotlb_add_range(struct vduse_dev *dev,
>>> +                              u64 start, u64 last,
>>> +                              u64 addr, unsigned int perm,
>>> +                              struct file *file, u64 offset)
>>> +{
>>> +     struct vhost_iotlb_file *iotlb_file;
>>> +     int ret;
>>> +
>>> +     iotlb_file = kmalloc(sizeof(*iotlb_file), GFP_ATOMIC);
>>> +     if (!iotlb_file)
>>> +             return -ENOMEM;
>>> +
>>> +     iotlb_file->file = get_file(file);
>>> +     iotlb_file->offset = offset;
>>> +
>>> +     spin_lock(&dev->iommu_lock);
>>> +     ret = vhost_iotlb_add_range(dev->iommu, start, last,
>>> +                                     addr, perm, iotlb_file);
>>> +     spin_unlock(&dev->iommu_lock);
>>> +     if (ret) {
>>> +             fput(iotlb_file->file);
>>> +             kfree(iotlb_file);
>>> +             return ret;
>>> +     }
>>> +     return 0;
>>> +}
>>> +
>>> +static void vduse_iotlb_del_range(struct vduse_dev *dev, u64 start, u64 last)
>>> +{
>>> +     struct vhost_iotlb_file *iotlb_file;
>>> +     struct vhost_iotlb_map *map;
>>> +
>>> +     spin_lock(&dev->iommu_lock);
>>> +     while ((map = vhost_iotlb_itree_first(dev->iommu, start, last))) {
>>> +             iotlb_file = (struct vhost_iotlb_file *)map->opaque;
>>> +             fput(iotlb_file->file);
>>> +             kfree(iotlb_file);
>>> +             vhost_iotlb_map_free(dev->iommu, map);
>>> +     }
>>> +     spin_unlock(&dev->iommu_lock);
>>> +}
>>> +
>>> +static void vduse_dev_reset(struct vduse_dev *dev)
>>> +{
>>> +     int i;
>>> +
>>> +     atomic_set(&dev->bounce_map, 0);
>>> +     vduse_iotlb_del_range(dev, 0ULL, 0ULL - 1);
>>> +     vduse_dev_update_iotlb(dev, 0ULL, 0ULL - 1);
>>
>> ULLONG_MAX please.
>>
> OK.
>
>>> +
>>> +     for (i = 0; i < dev->vq_num; i++) {
>>> +             struct vduse_virtqueue *vq = &dev->vqs[i];
>>> +
>>> +             spin_lock(&vq->irq_lock);
>>> +             vq->ready = false;
>>> +             vq->cb = NULL;
>>> +             vq->private = NULL;
>>> +             spin_unlock(&vq->irq_lock);
>>> +     }
>>> +}
>>> +
>>> +static int vduse_vdpa_set_vq_address(struct vdpa_device *vdpa, u16 idx,
>>> +                             u64 desc_area, u64 driver_area,
>>> +                             u64 device_area)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     return vduse_dev_set_vq_addr(dev, vq, desc_area,
>>> +                                     driver_area, device_area);
>>> +}
>>> +
>>> +static void vduse_vdpa_kick_vq(struct vdpa_device *vdpa, u16 idx)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     vduse_vq_kick(vq);
>>> +}
>>> +
>>> +static void vduse_vdpa_set_vq_cb(struct vdpa_device *vdpa, u16 idx,
>>> +                           struct vdpa_callback *cb)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     vq->cb = cb->callback;
>>> +     vq->private = cb->private;
>>> +}
>>> +
>>> +static void vduse_vdpa_set_vq_num(struct vdpa_device *vdpa, u16 idx, u32 num)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     vduse_dev_set_vq_num(dev, vq, num);
>>> +}
>>> +
>>> +static void vduse_vdpa_set_vq_ready(struct vdpa_device *vdpa,
>>> +                                     u16 idx, bool ready)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     vduse_dev_set_vq_ready(dev, vq, ready);
>>> +     vq->ready = ready;
>>> +}
>>> +
>>> +static bool vduse_vdpa_get_vq_ready(struct vdpa_device *vdpa, u16 idx)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     vq->ready = vduse_dev_get_vq_ready(dev, vq);
>>> +
>>> +     return vq->ready;
>>> +}
>>> +
>>> +static int vduse_vdpa_set_vq_state(struct vdpa_device *vdpa, u16 idx,
>>> +                             const struct vdpa_vq_state *state)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     return vduse_dev_set_vq_state(dev, vq, state);
>>> +}
>>> +
>>> +static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
>>> +                             struct vdpa_vq_state *state)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     return vduse_dev_get_vq_state(dev, vq, state);
>>> +}
>>> +
>>> +static u32 vduse_vdpa_get_vq_align(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return dev->vq_align;
>>> +}
>>> +
>>> +static u64 vduse_vdpa_get_features(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     u64 fixed = (1ULL << VIRTIO_F_ACCESS_PLATFORM);
>>> +
>>> +     return (vduse_dev_get_features(dev) | fixed);
>>> +}
>>> +
>>> +static int vduse_vdpa_set_features(struct vdpa_device *vdpa, u64 features)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return vduse_dev_set_features(dev, features);
>>> +}
>>> +
>>> +static void vduse_vdpa_set_config_cb(struct vdpa_device *vdpa,
>>> +                               struct vdpa_callback *cb)
>>> +{
>>> +     /* We don't support config interrupt */
>>
>> If it's not hard, let's add this. Otherwise we need a per device feature
>> blacklist to filter out all features that depends on config interrupt.
>>
> Will do it.
>
>>> +}
>>> +
>>> +static u16 vduse_vdpa_get_vq_num_max(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return dev->vq_size_max;
>>> +}
>>> +
>>> +static u32 vduse_vdpa_get_device_id(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return dev->device_id;
>>> +}
>>> +
>>> +static u32 vduse_vdpa_get_vendor_id(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return dev->vendor_id;
>>> +}
>>> +
>>> +static u8 vduse_vdpa_get_status(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return vduse_dev_get_status(dev);
>>> +}
>>> +
>>> +static void vduse_vdpa_set_status(struct vdpa_device *vdpa, u8 status)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     if (status == 0)
>>> +             vduse_dev_reset(dev);
>>> +     else
>>> +             vduse_dev_update_iotlb(dev, 0ULL, 0ULL - 1);
>>
>> Any reason for such IOTLB invalidation here?
>>
> As I mentioned before, this is used to notify userspace to update the
> IOTLB. Mainly for virtio-vdpa case.


So the question is, usually, there could be several times of status 
setting during driver initialization. Do we really need to update IOTLB 
every time?


>
>>> +
>>> +     vduse_dev_set_status(dev, status);
>>> +}
>>> +
>>> +static void vduse_vdpa_get_config(struct vdpa_device *vdpa, unsigned int offset,
>>> +                          void *buf, unsigned int len)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     vduse_dev_get_config(dev, offset, buf, len);
>>> +}
>>> +
>>> +static void vduse_vdpa_set_config(struct vdpa_device *vdpa, unsigned int offset,
>>> +                     const void *buf, unsigned int len)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     vduse_dev_set_config(dev, offset, buf, len);
>>> +}
>>> +
>>> +static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
>>> +                             struct vhost_iotlb *iotlb)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vhost_iotlb_map *map;
>>> +     struct vhost_iotlb_file *iotlb_file;
>>> +     u64 start = 0ULL, last = 0ULL - 1;
>>> +     int ret = 0;
>>> +
>>> +     vduse_iotlb_del_range(dev, start, last);
>>> +
>>> +     for (map = vhost_iotlb_itree_first(iotlb, start, last); map;
>>> +             map = vhost_iotlb_itree_next(map, start, last)) {
>>> +             if (!map->opaque)
>>> +                     continue;
>>
>> What will happen if we simply accept NULL opaque here?
>>
> No file to mmap in userspace. So it's useless.
>
>>> +
>>> +             iotlb_file = (struct vhost_iotlb_file *)map->opaque;
>>> +             ret = vduse_iotlb_add_range(dev, map->start, map->last,
>>> +                                         map->addr, map->perm,
>>> +                                         iotlb_file->file,
>>> +                                         iotlb_file->offset);
>>> +             if (ret)
>>> +                     break;
>>> +     }
>>> +     vduse_dev_update_iotlb(dev, start, last);
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static void vduse_vdpa_free(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     WARN_ON(!list_empty(&dev->send_list));
>>> +     WARN_ON(!list_empty(&dev->recv_list));
>>> +     dev->vdev = NULL;
>>> +}
>>> +
>>> +static const struct vdpa_config_ops vduse_vdpa_config_ops = {
>>> +     .set_vq_address         = vduse_vdpa_set_vq_address,
>>> +     .kick_vq                = vduse_vdpa_kick_vq,
>>> +     .set_vq_cb              = vduse_vdpa_set_vq_cb,
>>> +     .set_vq_num             = vduse_vdpa_set_vq_num,
>>> +     .set_vq_ready           = vduse_vdpa_set_vq_ready,
>>> +     .get_vq_ready           = vduse_vdpa_get_vq_ready,
>>> +     .set_vq_state           = vduse_vdpa_set_vq_state,
>>> +     .get_vq_state           = vduse_vdpa_get_vq_state,
>>> +     .get_vq_align           = vduse_vdpa_get_vq_align,
>>> +     .get_features           = vduse_vdpa_get_features,
>>> +     .set_features           = vduse_vdpa_set_features,
>>> +     .set_config_cb          = vduse_vdpa_set_config_cb,
>>> +     .get_vq_num_max         = vduse_vdpa_get_vq_num_max,
>>> +     .get_device_id          = vduse_vdpa_get_device_id,
>>> +     .get_vendor_id          = vduse_vdpa_get_vendor_id,
>>> +     .get_status             = vduse_vdpa_get_status,
>>> +     .set_status             = vduse_vdpa_set_status,
>>> +     .get_config             = vduse_vdpa_get_config,
>>> +     .set_config             = vduse_vdpa_set_config,
>>> +     .set_map                = vduse_vdpa_set_map,
>>> +     .free                   = vduse_vdpa_free,
>>> +};
>>> +
>>> +static dma_addr_t vduse_dev_map_page(struct device *dev, struct page *page,
>>> +                                     unsigned long offset, size_t size,
>>> +                                     enum dma_data_direction dir,
>>> +                                     unsigned long attrs)
>>> +{
>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>> +
>>> +     if (atomic_xchg(&vdev->bounce_map, 1) == 0 &&
>>> +             vduse_iotlb_add_range(vdev, 0, domain->bounce_size - 1,
>>> +                                   0, VDUSE_ACCESS_RW,
>>
>> Is this safe to use VDUSE_ACCESS_RW here, consider we might have device
>> readonly mappings.
>>
> This mapping is for the whole bounce buffer. Maybe userspace needs to
> tell us if it only support readonly mappings.


Right, so I think we don't need to care about this.


>
>>> +                                   vduse_domain_file(domain),
>>> +                                   vduse_domain_get_offset(domain, 0))) {
>>> +             atomic_set(&vdev->bounce_map, 0);
>>> +             return DMA_MAPPING_ERROR;
>>> +     }
>>> +
>>> +     return vduse_domain_map_page(domain, page, offset, size, dir, attrs);
>>> +}
>>> +
>>> +static void vduse_dev_unmap_page(struct device *dev, dma_addr_t dma_addr,
>>> +                             size_t size, enum dma_data_direction dir,
>>> +                             unsigned long attrs)
>>> +{
>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>> +
>>> +     return vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
>>> +}
>>> +
>>> +static void *vduse_dev_alloc_coherent(struct device *dev, size_t size,
>>> +                                     dma_addr_t *dma_addr, gfp_t flag,
>>> +                                     unsigned long attrs)
>>> +{
>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>> +     unsigned long iova;
>>> +     void *addr;
>>> +
>>> +     *dma_addr = DMA_MAPPING_ERROR;
>>> +     addr = vduse_domain_alloc_coherent(domain, size,
>>> +                             (dma_addr_t *)&iova, flag, attrs);
>>> +     if (!addr)
>>> +             return NULL;
>>> +
>>> +     if (vduse_iotlb_add_range(vdev, iova, iova + size - 1,
>>> +                               iova, VDUSE_ACCESS_RW,
>>> +                               vduse_domain_file(domain),
>>> +                               vduse_domain_get_offset(domain, iova))) {
>>> +             vduse_domain_free_coherent(domain, size, addr, iova, attrs);
>>> +             return NULL;
>>> +     }
>>> +     *dma_addr = (dma_addr_t)iova;
>>> +
>>> +     return addr;
>>> +}
>>> +
>>> +static void vduse_dev_free_coherent(struct device *dev, size_t size,
>>> +                                     void *vaddr, dma_addr_t dma_addr,
>>> +                                     unsigned long attrs)
>>> +{
>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>> +     unsigned long start = (unsigned long)dma_addr;
>>> +     unsigned long last = start + size - 1;
>>> +
>>> +     vduse_iotlb_del_range(vdev, start, last);
>>> +     vduse_dev_update_iotlb(vdev, start, last);
>>> +     vduse_domain_free_coherent(domain, size, vaddr, dma_addr, attrs);
>>> +}
>>> +
>>> +static const struct dma_map_ops vduse_dev_dma_ops = {
>>> +     .map_page = vduse_dev_map_page,
>>> +     .unmap_page = vduse_dev_unmap_page,
>>> +     .alloc = vduse_dev_alloc_coherent,
>>> +     .free = vduse_dev_free_coherent,
>>> +};
>>> +
>>> +static unsigned int perm_to_file_flags(u8 perm)
>>> +{
>>> +     unsigned int flags = 0;
>>> +
>>> +     switch (perm) {
>>> +     case VDUSE_ACCESS_WO:
>>> +             flags |= O_WRONLY;
>>> +             break;
>>> +     case VDUSE_ACCESS_RO:
>>> +             flags |= O_RDONLY;
>>> +             break;
>>> +     case VDUSE_ACCESS_RW:
>>> +             flags |= O_RDWR;
>>> +             break;
>>> +     default:
>>> +             WARN(1, "invalidate vhost IOTLB permission\n");
>>> +             break;
>>> +     }
>>> +
>>> +     return flags;
>>> +}
>>> +
>>> +static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
>>> +                     unsigned long arg)
>>> +{
>>> +     struct vduse_dev *dev = file->private_data;
>>> +     void __user *argp = (void __user *)arg;
>>> +     int ret;
>>> +
>>> +     mutex_lock(&dev->lock);
>>> +     switch (cmd) {
>>> +     case VDUSE_IOTLB_GET_FD: {
>>> +             struct vduse_iotlb_entry entry;
>>> +             struct vhost_iotlb_map *map;
>>> +             struct vhost_iotlb_file *iotlb_file;
>>> +             struct file *f = NULL;
>>> +
>>> +             ret = -EFAULT;
>>> +             if (copy_from_user(&entry, argp, sizeof(entry)))
>>> +                     break;
>>> +
>>> +             spin_lock(&dev->iommu_lock);
>>> +             map = vhost_iotlb_itree_first(dev->iommu, entry.start,
>>> +                                           entry.last);
>>> +             if (map) {
>>> +                     iotlb_file = (struct vhost_iotlb_file *)map->opaque;
>>> +                     f = get_file(iotlb_file->file);
>>> +                     entry.offset = iotlb_file->offset;
>>> +                     entry.start = map->start;
>>> +                     entry.last = map->last;
>>> +                     entry.perm = map->perm;
>>> +             }
>>> +             spin_unlock(&dev->iommu_lock);
>>> +             if (!f) {
>>> +                     ret = -EINVAL;
>>> +                     break;
>>> +             }
>>> +             if (copy_to_user(argp, &entry, sizeof(entry))) {
>>> +                     fput(f);
>>> +                     ret = -EFAULT;
>>> +                     break;
>>> +             }
>>> +             ret = get_unused_fd_flags(perm_to_file_flags(entry.perm));
>>> +             if (ret < 0) {
>>> +                     fput(f);
>>> +                     break;
>>> +             }
>>> +             fd_install(ret, f);
>>> +             break;
>>> +     }
>>> +     case VDUSE_VQ_SETUP_KICKFD: {
>>> +             struct vduse_vq_eventfd eventfd;
>>> +
>>> +             ret = -EFAULT;
>>> +             if (copy_from_user(&eventfd, argp, sizeof(eventfd)))
>>> +                     break;
>>> +
>>> +             ret = vduse_kickfd_setup(dev, &eventfd);
>>> +             break;
>>> +     }
>>> +     case VDUSE_VQ_SETUP_IRQFD: {
>>> +             struct vduse_vq_eventfd eventfd;
>>> +
>>> +             ret = -EFAULT;
>>> +             if (copy_from_user(&eventfd, argp, sizeof(eventfd)))
>>> +                     break;
>>> +
>>> +             ret = vduse_virqfd_setup(dev, &eventfd);
>>> +             break;
>>> +     }
>>> +     }
>>> +     mutex_unlock(&dev->lock);
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static int vduse_dev_release(struct inode *inode, struct file *file)
>>> +{
>>> +     struct vduse_dev *dev = file->private_data;
>>> +
>>> +     vduse_kickfd_release(dev);
>>> +     vduse_virqfd_release(dev);
>>> +     dev->connected = false;
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static const struct file_operations vduse_dev_fops = {
>>> +     .owner          = THIS_MODULE,
>>> +     .release        = vduse_dev_release,
>>> +     .read_iter      = vduse_dev_read_iter,
>>> +     .write_iter     = vduse_dev_write_iter,
>>> +     .poll           = vduse_dev_poll,
>>> +     .unlocked_ioctl = vduse_dev_ioctl,
>>> +     .compat_ioctl   = compat_ptr_ioctl,
>>> +     .llseek         = noop_llseek,
>>> +};
>>> +
>>> +static struct vduse_dev *vduse_dev_create(void)
>>> +{
>>> +     struct vduse_dev *dev = kzalloc(sizeof(*dev), GFP_KERNEL);
>>> +
>>> +     if (!dev)
>>> +             return NULL;
>>> +
>>> +     dev->iommu = vhost_iotlb_alloc(2048, 0);
>>
>> Is 2048 sufficient here?
>>
> How about letting userspace to define it?


Fine with me.


>
>
>>> +     if (!dev->iommu) {
>>> +             kfree(dev);
>>> +             return NULL;
>>> +     }
>>> +
>>> +     mutex_init(&dev->lock);
>>> +     spin_lock_init(&dev->msg_lock);
>>> +     INIT_LIST_HEAD(&dev->send_list);
>>> +     INIT_LIST_HEAD(&dev->recv_list);
>>> +     atomic64_set(&dev->msg_unique, 0);
>>> +     spin_lock_init(&dev->iommu_lock);
>>> +     atomic_set(&dev->bounce_map, 0);
>>> +
>>> +     init_waitqueue_head(&dev->waitq);
>>> +
>>> +     return dev;
>>> +}
>>> +
>>> +static void vduse_dev_destroy(struct vduse_dev *dev)
>>> +{
>>> +     vhost_iotlb_free(dev->iommu);
>>> +     mutex_destroy(&dev->lock);
>>> +     kfree(dev);
>>> +}
>>> +
>>> +static struct vduse_dev *vduse_find_dev(u32 id)
>>> +{
>>> +     struct vduse_dev *tmp, *dev = NULL;
>>> +
>>> +     list_for_each_entry(tmp, &vduse_devs, list) {
>>> +             if (tmp->id == id) {
>>> +                     dev = tmp;
>>> +                     break;
>>> +             }
>>> +     }
>>> +     return dev;
>>> +}
>>> +
>>> +static int vduse_destroy_dev(u32 id)
>>> +{
>>> +     struct vduse_dev *dev = vduse_find_dev(id);
>>> +
>>> +     if (!dev)
>>> +             return -EINVAL;
>>> +
>>> +     if (dev->vdev || dev->connected)
>>> +             return -EBUSY;
>>> +
>>> +     list_del(&dev->list);
>>> +     kfree(dev->vqs);
>>> +     vduse_domain_destroy(dev->domain);
>>> +     vduse_dev_destroy(dev);
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static int vduse_create_dev(struct vduse_dev_config *config)
>>> +{
>>> +     int i, fd;
>>> +     struct vduse_dev *dev;
>>> +     char name[64];
>>> +
>>> +     if (vduse_find_dev(config->id))
>>> +             return -EEXIST;
>>> +
>>> +     dev = vduse_dev_create();
>>> +     if (!dev)
>>> +             return -ENOMEM;
>>> +
>>> +     dev->id = config->id;
>>> +     dev->device_id = config->device_id;
>>> +     dev->vendor_id = config->vendor_id;
>>> +     dev->domain = vduse_domain_create(config->bounce_size);
>>
>> Do we need a upper limit of bounce_size?
>>
> I agree. Any comment for the value?


Something like swiotlb default value (64M)?


>
>>> +     if (!dev->domain)
>>> +             goto err_domain;
>>> +
>>> +     dev->vq_align = config->vq_align;
>>> +     dev->vq_size_max = config->vq_size_max;
>>> +     dev->vq_num = config->vq_num;
>>> +     dev->vqs = kcalloc(dev->vq_num, sizeof(*dev->vqs), GFP_KERNEL);
>>> +     if (!dev->vqs)
>>> +             goto err_vqs;
>>> +
>>> +     for (i = 0; i < dev->vq_num; i++) {
>>> +             dev->vqs[i].index = i;
>>> +             spin_lock_init(&dev->vqs[i].kick_lock);
>>> +             spin_lock_init(&dev->vqs[i].irq_lock);
>>> +     }
>>> +
>>> +     snprintf(name, sizeof(name), "[vduse-dev:%u]", config->id);
>>> +     fd = anon_inode_getfd(name, &vduse_dev_fops, dev, O_RDWR | O_CLOEXEC);
>>
>> Any reason for closing on exec here?
>>
> Looks like we can remove this flag.
>
>>> +     if (fd < 0)
>>> +             goto err_fd;
>>> +
>>> +     dev->connected = true;
>>> +     list_add(&dev->list, &vduse_devs);
>>> +
>>> +     return fd;
>>> +err_fd:
>>> +     kfree(dev->vqs);
>>> +err_vqs:
>>> +     vduse_domain_destroy(dev->domain);
>>> +err_domain:
>>> +     vduse_dev_destroy(dev);
>>> +     return fd;
>>> +}
>>> +
>>> +static long vduse_ioctl(struct file *file, unsigned int cmd,
>>> +                     unsigned long arg)
>>> +{
>>> +     int ret;
>>> +     void __user *argp = (void __user *)arg;
>>> +
>>> +     mutex_lock(&vduse_lock);
>>> +     switch (cmd) {
>>> +     case VDUSE_CREATE_DEV: {
>>> +             struct vduse_dev_config config;
>>> +
>>> +             ret = -EFAULT;
>>> +             if (copy_from_user(&config, argp, sizeof(config)))
>>> +                     break;
>>> +
>>> +             ret = vduse_create_dev(&config);
>>> +             break;
>>> +     }
>>> +     case VDUSE_DESTROY_DEV:
>>> +             ret = vduse_destroy_dev(arg);
>>> +             break;
>>> +     default:
>>> +             ret = -EINVAL;
>>> +             break;
>>> +     }
>>> +     mutex_unlock(&vduse_lock);
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static const struct file_operations vduse_fops = {
>>> +     .owner          = THIS_MODULE,
>>> +     .unlocked_ioctl = vduse_ioctl,
>>> +     .compat_ioctl   = compat_ptr_ioctl,
>>> +     .llseek         = noop_llseek,
>>> +};
>>> +
>>> +static struct miscdevice vduse_misc = {
>>> +     .fops = &vduse_fops,
>>> +     .minor = MISC_DYNAMIC_MINOR,
>>> +     .name = "vduse",
>>> +};
>>> +
>>> +static void vduse_parent_release(struct device *dev)
>>> +{
>>> +}
>>> +
>>> +static struct device vduse_parent = {
>>> +     .init_name = "vduse",
>>> +     .release = vduse_parent_release,
>>> +};
>>> +
>>> +static struct vdpa_parent_dev parent_dev;
>>> +
>>> +static int vduse_dev_add_vdpa(struct vduse_dev *dev, const char *name)
>>> +{
>>> +     struct vduse_vdpa *vdev = dev->vdev;
>>> +     int ret;
>>> +
>>> +     if (vdev)
>>> +             return -EEXIST;
>>> +
>>> +     vdev = vdpa_alloc_device(struct vduse_vdpa, vdpa, NULL,
>>> +                              &vduse_vdpa_config_ops,
>>> +                              dev->vq_num, name, true);
>>> +     if (!vdev)
>>> +             return -ENOMEM;
>>> +
>>> +     vdev->dev = dev;
>>> +     vdev->vdpa.dev.dma_mask = &vdev->vdpa.dev.coherent_dma_mask;
>>> +     ret = dma_set_mask_and_coherent(&vdev->vdpa.dev, DMA_BIT_MASK(64));
>>> +     if (ret)
>>> +             goto err;
>>> +
>>> +     set_dma_ops(&vdev->vdpa.dev, &vduse_dev_dma_ops);
>>> +     vdev->vdpa.dma_dev = &vdev->vdpa.dev;
>>> +     vdev->vdpa.pdev = &parent_dev;
>>> +
>>> +     ret = _vdpa_register_device(&vdev->vdpa);
>>> +     if (ret)
>>> +             goto err;
>>> +
>>> +     dev->vdev = vdev;
>>> +
>>> +     return 0;
>>> +err:
>>> +     put_device(&vdev->vdpa.dev);
>>> +     return ret;
>>> +}
>>> +
>>> +static struct vdpa_device *vdpa_dev_add(struct vdpa_parent_dev *pdev,
>>> +                                     const char *name, u32 device_id,
>>> +                                     struct nlattr **attrs)
>>> +{
>>> +     u32 vduse_id;
>>> +     struct vduse_dev *dev;
>>> +     int ret = -EINVAL;
>>> +
>>> +     if (!attrs[VDPA_ATTR_BACKEND_ID])
>>> +             return ERR_PTR(-EINVAL);
>>> +
>>> +     mutex_lock(&vduse_lock);
>>> +     vduse_id = nla_get_u32(attrs[VDPA_ATTR_BACKEND_ID]);
>>
>> I wonder why not using name here?
>>
> Do you mean use the same name for both backend and frontend? If so, we
> need to add a name for vduse device or replace id with name to
> identify a vduse device. Which way do you prefer?


I think if we can do it in name, it's better not introduce any other 
thing like "id". It will complicate the management software.


>
>> And it looks to me it would be easier if we create a char device per
>> vduse. This makes the device addressing more robust than passing id
>> silently among processes.
>>
> It's OK to me.
>
>>> +     dev = vduse_find_dev(vduse_id);
>>> +     if (!dev)
>>> +             goto unlock;
>>> +
>>> +     if (dev->device_id != device_id)
>>> +             goto unlock;
>>> +
>>> +     ret = vduse_dev_add_vdpa(dev, name);
>>> +unlock:
>>> +     mutex_unlock(&vduse_lock);
>>> +     if (ret)
>>> +             return ERR_PTR(ret);
>>> +
>>> +     return &dev->vdev->vdpa;
>>> +}
>>> +
>>> +static void vdpa_dev_del(struct vdpa_parent_dev *pdev, struct vdpa_device *dev)
>>> +{
>>> +     _vdpa_unregister_device(dev);
>>> +}
>>> +
>>> +static const struct vdpa_dev_ops vdpa_dev_parent_ops = {
>>> +     .dev_add = vdpa_dev_add,
>>> +     .dev_del = vdpa_dev_del
>>> +};
>>> +
>>> +static struct virtio_device_id id_table[] = {
>>> +     { VIRTIO_DEV_ANY_ID, VIRTIO_DEV_ANY_ID },
>>> +     { 0 },
>>> +};
>>> +
>>> +static struct vdpa_parent_dev parent_dev = {
>>> +     .device = &vduse_parent,
>>> +     .id_table = id_table,
>>> +     .ops = &vdpa_dev_parent_ops,
>>> +};
>>> +
>>> +static int vduse_parentdev_init(void)
>>> +{
>>> +     int ret;
>>> +
>>> +     ret = device_register(&vduse_parent);
>>> +     if (ret)
>>> +             return ret;
>>> +
>>> +     ret = vdpa_parentdev_register(&parent_dev);
>>> +     if (ret)
>>> +             goto err;
>>> +
>>> +     return 0;
>>> +err:
>>> +     device_unregister(&vduse_parent);
>>> +     return ret;
>>> +}
>>> +
>>> +static void vduse_parentdev_exit(void)
>>> +{
>>> +     vdpa_parentdev_unregister(&parent_dev);
>>> +     device_unregister(&vduse_parent);
>>> +}
>>> +
>>> +static int vduse_init(void)
>>> +{
>>> +     int ret;
>>> +
>>> +     ret = misc_register(&vduse_misc);
>>> +     if (ret)
>>> +             return ret;
>>> +
>>> +     ret = -ENOMEM;
>>> +     vduse_vdpa_wq = alloc_workqueue("vduse-vdpa", WQ_UNBOUND, 1);
>>> +     if (!vduse_vdpa_wq)
>>> +             goto err_vdpa_wq;
>>> +
>>> +     ret = vduse_virqfd_init();
>>> +     if (ret)
>>> +             goto err_irqfd;
>>> +
>>> +     ret = vduse_parentdev_init();
>>> +     if (ret)
>>> +             goto err_parentdev;
>>> +
>>> +     return 0;
>>> +err_parentdev:
>>> +     vduse_virqfd_exit();
>>> +err_irqfd:
>>> +     destroy_workqueue(vduse_vdpa_wq);
>>> +err_vdpa_wq:
>>> +     misc_deregister(&vduse_misc);
>>> +     return ret;
>>> +}
>>> +module_init(vduse_init);
>>> +
>>> +static void vduse_exit(void)
>>> +{
>>> +     misc_deregister(&vduse_misc);
>>> +     destroy_workqueue(vduse_vdpa_wq);
>>> +     vduse_virqfd_exit();
>>> +     vduse_parentdev_exit();
>>> +}
>>> +module_exit(vduse_exit);
>>> +
>>> +MODULE_VERSION(DRV_VERSION);
>>> +MODULE_LICENSE(DRV_LICENSE);
>>> +MODULE_AUTHOR(DRV_AUTHOR);
>>> +MODULE_DESCRIPTION(DRV_DESC);
>>> diff --git a/include/uapi/linux/vdpa.h b/include/uapi/linux/vdpa.h
>>> index bba8b83a94b5..a7a841e5ffc7 100644
>>> --- a/include/uapi/linux/vdpa.h
>>> +++ b/include/uapi/linux/vdpa.h
>>> @@ -33,6 +33,7 @@ enum vdpa_attr {
>>>        VDPA_ATTR_DEV_VENDOR_ID,                /* u32 */
>>>        VDPA_ATTR_DEV_MAX_VQS,                  /* u32 */
>>>        VDPA_ATTR_DEV_MAX_VQ_SIZE,              /* u16 */
>>> +     VDPA_ATTR_BACKEND_ID,                   /* u32 */
>>
>> As discussed, this needs more thought. But if necessary, we need a
>> separate patch for this.
>>
> OK.
>
>>>        /* new attributes must be added above here */
>>>        VDPA_ATTR_MAX,
>>> diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
>>> new file mode 100644
>>> index 000000000000..9fb555ddcfbd
>>> --- /dev/null
>>> +++ b/include/uapi/linux/vduse.h
>>> @@ -0,0 +1,125 @@
>>> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
>>> +#ifndef _UAPI_VDUSE_H_
>>> +#define _UAPI_VDUSE_H_
>>> +
>>> +#include <linux/types.h>
>>> +
>>> +/* the control messages definition for read/write */
>>> +
>>> +#define VDUSE_CONFIG_DATA_LEN        256
>>> +
>>> +enum vduse_req_type {
>>> +     VDUSE_SET_VQ_NUM,
>>> +     VDUSE_SET_VQ_ADDR,
>>> +     VDUSE_SET_VQ_READY,
>>> +     VDUSE_GET_VQ_READY,
>>> +     VDUSE_SET_VQ_STATE,
>>> +     VDUSE_GET_VQ_STATE,
>>> +     VDUSE_SET_FEATURES,
>>> +     VDUSE_GET_FEATURES,
>>> +     VDUSE_SET_STATUS,
>>> +     VDUSE_GET_STATUS,
>>> +     VDUSE_SET_CONFIG,
>>> +     VDUSE_GET_CONFIG,
>>> +     VDUSE_UPDATE_IOTLB,
>>> +};
>>> +
>>> +struct vduse_vq_num {
>>> +     __u32 index;
>>> +     __u32 num;
>>> +};
>>> +
>>> +struct vduse_vq_addr {
>>> +     __u32 index;
>>> +     __u64 desc_addr;
>>> +     __u64 driver_addr;
>>> +     __u64 device_addr;
>>> +};
>>> +
>>> +struct vduse_vq_ready {
>>> +     __u32 index;
>>> +     __u8 ready;
>>> +};
>>> +
>>> +struct vduse_vq_state {
>>> +     __u32 index;
>>> +     __u16 avail_idx;
>>> +};
>>> +
>>> +struct vduse_dev_config_data {
>>> +     __u32 offset;
>>> +     __u32 len;
>>> +     __u8 data[VDUSE_CONFIG_DATA_LEN];
>>
>> This no guarantee that 256 is sufficient here.
>>
> If the size is larger than 256, we can try to split the original request.


Fine, then we need document here or in the doc.


>
>>> +};
>>> +
>>> +struct vduse_iova_range {
>>> +     __u64 start;
>>> +     __u64 last;
>>> +};
>>> +
>>> +struct vduse_dev_request {
>>> +     __u32 type; /* request type */
>>> +     __u32 unique; /* request id */
>>> +     __u32 flags; /* request flags */
>>
>> Seems unused in this series.
>>
> This is for future use.


So let's use pad or other name.


>
>>> +     __u32 size; /* the payload size */
>>
>> Unused.
>>
> Will remove it.
>
>>> +     union {
>>> +             struct vduse_vq_num vq_num; /* virtqueue num */
>>> +             struct vduse_vq_addr vq_addr; /* virtqueue address */
>>> +             struct vduse_vq_ready vq_ready; /* virtqueue ready status */
>>> +             struct vduse_vq_state vq_state; /* virtqueue state */
>>> +             struct vduse_dev_config_data config; /* virtio device config space */
>>> +             struct vduse_iova_range iova; /* iova range for updating */
>>> +             __u64 features; /* virtio features */
>>> +             __u8 status; /* device status */
>>
>> Let's add some padding for future extensions.
>>
> Is sizeof(vduse_dev_config_data) ok? Or char[1024]?


1024 seems too large, 128 or 256 looks better.


>
>>> +     };
>>> +};
>>> +
>>> +struct vduse_dev_response {
>>> +     __u32 unique; /* corresponding request id */
>>
>> Let's use request id.
>>
> Fine.
>
>>> +     __s32 result; /* the result of request */
>>
>> Let's use macro or enum to define the success and failure value.
>>
> Will do it.
>
>>> +     union {
>>> +             struct vduse_vq_ready vq_ready; /* virtqueue ready status */
>>> +             struct vduse_vq_state vq_state; /* virtqueue state */
>>> +             struct vduse_dev_config_data config; /* virtio device config space */
>>> +             __u64 features; /* virtio features */
>>> +             __u8 status; /* device status */
>>> +     };
>>> +};
>>> +
>>> +/* ioctls */
>>> +
>>> +struct vduse_dev_config {
>>> +     __u32 id; /* vduse device id */
>>> +     __u32 vendor_id; /* virtio vendor id */
>>> +     __u32 device_id; /* virtio device id */
>>> +     __u64 bounce_size; /* bounce buffer size for iommu */
>>> +     __u16 vq_num; /* the number of virtqueues */
>>> +     __u16 vq_size_max; /* the max size of virtqueue */
>>> +     __u32 vq_align; /* the allocation alignment of virtqueue's metadata */
>>> +};
>>> +
>>> +struct vduse_iotlb_entry {
>>> +     __u64 offset; /* the mmap offset on fd */
>>> +     __u64 start; /* start of the IOVA range */
>>> +     __u64 last; /* last of the IOVA range */
>>> +#define VDUSE_ACCESS_RO 0x1
>>> +#define VDUSE_ACCESS_WO 0x2
>>> +#define VDUSE_ACCESS_RW 0x3
>>> +     __u8 perm; /* access permission of this range */
>>> +};
>>> +
>>> +struct vduse_vq_eventfd {
>>> +     __u32 index; /* virtqueue index */
>>> +     __u32 fd; /* eventfd */
>>
>> Any reason for not using int here?
>>
> Will use __s32 instead.


Let's use "int" here, so -1 can be used for de-assigning the eventfd.


>
>>> +};
>>> +
>>> +#define VDUSE_BASE   0x81
>>> +
>>> +#define VDUSE_CREATE_DEV     _IOW(VDUSE_BASE, 0x01, struct vduse_dev_config)
>>> +#define VDUSE_DESTROY_DEV    _IO(VDUSE_BASE, 0x02)
>>> +
>>> +#define VDUSE_IOTLB_GET_FD   _IOWR(VDUSE_BASE, 0x04, struct vduse_iotlb_entry)
>>> +#define VDUSE_VQ_SETUP_KICKFD        _IOW(VDUSE_BASE, 0x05, struct vduse_vq_eventfd)
>>> +#define VDUSE_VQ_SETUP_IRQFD _IOW(VDUSE_BASE, 0x06, struct vduse_vq_eventfd)
>>
>> Better with documentation to explain those ioctls.
>>
> Will do it.
>
> Thanks,
> Yongji


Thanks

>


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC v3 01/11] eventfd: track eventfd_signal() recursion depth separately in different cases
  2021-01-28  3:52             ` Yongji Xie
@ 2021-01-28  4:31               ` Jason Wang
  2021-01-28  6:08                 ` Yongji Xie
  0 siblings, 1 reply; 57+ messages in thread
From: Jason Wang @ 2021-01-28  4:31 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Bob Liu, Christoph Hellwig, Randy Dunlap,
	Matthew Wilcox, viro, axboe, bcrl, Jonathan Corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel


On 2021/1/28 上午11:52, Yongji Xie wrote:
> On Thu, Jan 28, 2021 at 11:05 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2021/1/27 下午5:11, Yongji Xie wrote:
>>> On Wed, Jan 27, 2021 at 11:38 AM Jason Wang <jasowang@redhat.com> wrote:
>>>> On 2021/1/20 下午2:52, Yongji Xie wrote:
>>>>> On Wed, Jan 20, 2021 at 12:24 PM Jason Wang <jasowang@redhat.com> wrote:
>>>>>> On 2021/1/19 下午12:59, Xie Yongji wrote:
>>>>>>> Now we have a global percpu counter to limit the recursion depth
>>>>>>> of eventfd_signal(). This can avoid deadlock or stack overflow.
>>>>>>> But in stack overflow case, it should be OK to increase the
>>>>>>> recursion depth if needed. So we add a percpu counter in eventfd_ctx
>>>>>>> to limit the recursion depth for deadlock case. Then it could be
>>>>>>> fine to increase the global percpu counter later.
>>>>>> I wonder whether or not it's worth to introduce percpu for each eventfd.
>>>>>>
>>>>>> How about simply check if eventfd_signal_count() is greater than 2?
>>>>>>
>>>>> It can't avoid deadlock in this way.
>>>> I may miss something but the count is to avoid recursive eventfd call.
>>>> So for VDUSE what we suffers is e.g the interrupt injection path:
>>>>
>>>> userspace write IRQFD -> vq->cb() -> another IRQFD.
>>>>
>>>> It looks like increasing EVENTFD_WAKEUP_DEPTH should be sufficient?
>>>>
>>> Actually I mean the deadlock described in commit f0b493e ("io_uring:
>>> prevent potential eventfd recursion on poll"). It can break this bug
>>> fix if we just increase EVENTFD_WAKEUP_DEPTH.
>>
>> Ok, so can wait do something similar in that commit? (using async stuffs
>> like wq).
>>
> We can do that. But it will reduce the performance. Because the
> eventfd recursion will be triggered every time kvm kick eventfd in
> vhost-vdpa cases:
>
> KVM write KICKFD -> ops->kick_vq -> VDUSE write KICKFD
>
> Thanks,
> Yongji


Right, I think in the future we need to find a way to let KVM to wakeup 
VDUSE directly.

Havn't had a deep thought but it might work like irq bypass manager.

Thanks



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: [RFC v3 01/11] eventfd: track eventfd_signal() recursion depth separately in different cases
  2021-01-28  3:08             ` Jens Axboe
@ 2021-01-28  5:12               ` Yongji Xie
  0 siblings, 0 replies; 57+ messages in thread
From: Yongji Xie @ 2021-01-28  5:12 UTC (permalink / raw)
  To: Jens Axboe, Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Bob Liu, Christoph Hellwig, Randy Dunlap,
	Matthew Wilcox, viro, bcrl, Jonathan Corbet, virtualization,
	netdev, kvm, linux-aio, linux-fsdevel

On Thu, Jan 28, 2021 at 11:08 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 1/27/21 8:04 PM, Jason Wang wrote:
> >
> > On 2021/1/27 下午5:11, Yongji Xie wrote:
> >> On Wed, Jan 27, 2021 at 11:38 AM Jason Wang <jasowang@redhat.com> wrote:
> >>>
> >>> On 2021/1/20 下午2:52, Yongji Xie wrote:
> >>>> On Wed, Jan 20, 2021 at 12:24 PM Jason Wang <jasowang@redhat.com> wrote:
> >>>>> On 2021/1/19 下午12:59, Xie Yongji wrote:
> >>>>>> Now we have a global percpu counter to limit the recursion depth
> >>>>>> of eventfd_signal(). This can avoid deadlock or stack overflow.
> >>>>>> But in stack overflow case, it should be OK to increase the
> >>>>>> recursion depth if needed. So we add a percpu counter in eventfd_ctx
> >>>>>> to limit the recursion depth for deadlock case. Then it could be
> >>>>>> fine to increase the global percpu counter later.
> >>>>> I wonder whether or not it's worth to introduce percpu for each eventfd.
> >>>>>
> >>>>> How about simply check if eventfd_signal_count() is greater than 2?
> >>>>>
> >>>> It can't avoid deadlock in this way.
> >>>
> >>> I may miss something but the count is to avoid recursive eventfd call.
> >>> So for VDUSE what we suffers is e.g the interrupt injection path:
> >>>
> >>> userspace write IRQFD -> vq->cb() -> another IRQFD.
> >>>
> >>> It looks like increasing EVENTFD_WAKEUP_DEPTH should be sufficient?
> >>>
> >> Actually I mean the deadlock described in commit f0b493e ("io_uring:
> >> prevent potential eventfd recursion on poll"). It can break this bug
> >> fix if we just increase EVENTFD_WAKEUP_DEPTH.
> >
> >
> > Ok, so can wait do something similar in that commit? (using async stuffs
> > like wq).
>
> io_uring should be fine in current kernels, but aio would still be
> affected by this. But just in terms of recursion, bumping it one more
> should probably still be fine.
>

OK, I see. It should be easy to avoid the A-A deadlock during coding.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: [RFC v3 08/11] vduse: Introduce VDUSE - vDPA Device in Userspace
  2021-01-28  4:27         ` Jason Wang
@ 2021-01-28  6:03           ` Yongji Xie
  2021-01-28  6:14             ` Jason Wang
  0 siblings, 1 reply; 57+ messages in thread
From: Yongji Xie @ 2021-01-28  6:03 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Bob Liu, Christoph Hellwig, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel

On Thu, Jan 28, 2021 at 12:27 PM Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2021/1/27 下午4:50, Yongji Xie wrote:
> >   On Tue, Jan 26, 2021 at 4:09 PM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> On 2021/1/19 下午1:07, Xie Yongji wrote:
> >>> This VDUSE driver enables implementing vDPA devices in userspace.
> >>> Both control path and data path of vDPA devices will be able to
> >>> be handled in userspace.
> >>>
> >>> In the control path, the VDUSE driver will make use of message
> >>> mechnism to forward the config operation from vdpa bus driver
> >>> to userspace. Userspace can use read()/write() to receive/reply
> >>> those control messages.
> >>>
> >>> In the data path, VDUSE_IOTLB_GET_FD ioctl will be used to get
> >>> the file descriptors referring to vDPA device's iova regions. Then
> >>> userspace can use mmap() to access those iova regions. Besides,
> >>> the eventfd mechanism is used to trigger interrupt callbacks and
> >>> receive virtqueue kicks in userspace.
> >>>
> >>> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> >>> ---
> >>>    Documentation/driver-api/vduse.rst                 |   85 ++
> >>>    Documentation/userspace-api/ioctl/ioctl-number.rst |    1 +
> >>>    drivers/vdpa/Kconfig                               |    7 +
> >>>    drivers/vdpa/Makefile                              |    1 +
> >>>    drivers/vdpa/vdpa_user/Makefile                    |    5 +
> >>>    drivers/vdpa/vdpa_user/eventfd.c                   |  221 ++++
> >>>    drivers/vdpa/vdpa_user/eventfd.h                   |   48 +
> >>>    drivers/vdpa/vdpa_user/iova_domain.c               |  426 +++++++
> >>>    drivers/vdpa/vdpa_user/iova_domain.h               |   68 ++
> >>>    drivers/vdpa/vdpa_user/vduse.h                     |   62 +
> >>>    drivers/vdpa/vdpa_user/vduse_dev.c                 | 1217 ++++++++++++++++++++
> >>>    include/uapi/linux/vdpa.h                          |    1 +
> >>>    include/uapi/linux/vduse.h                         |  125 ++
> >>>    13 files changed, 2267 insertions(+)
> >>>    create mode 100644 Documentation/driver-api/vduse.rst
> >>>    create mode 100644 drivers/vdpa/vdpa_user/Makefile
> >>>    create mode 100644 drivers/vdpa/vdpa_user/eventfd.c
> >>>    create mode 100644 drivers/vdpa/vdpa_user/eventfd.h
> >>>    create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c
> >>>    create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h
> >>>    create mode 100644 drivers/vdpa/vdpa_user/vduse.h
> >>>    create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
> >>>    create mode 100644 include/uapi/linux/vduse.h
> >>>
> >>> diff --git a/Documentation/driver-api/vduse.rst b/Documentation/driver-api/vduse.rst
> >>> new file mode 100644
> >>> index 000000000000..9418a7f6646b
> >>> --- /dev/null
> >>> +++ b/Documentation/driver-api/vduse.rst
> >>> @@ -0,0 +1,85 @@
> >>> +==================================
> >>> +VDUSE - "vDPA Device in Userspace"
> >>> +==================================
> >>> +
> >>> +vDPA (virtio data path acceleration) device is a device that uses a
> >>> +datapath which complies with the virtio specifications with vendor
> >>> +specific control path. vDPA devices can be both physically located on
> >>> +the hardware or emulated by software. VDUSE is a framework that makes it
> >>> +possible to implement software-emulated vDPA devices in userspace.
> >>> +
> >>> +How VDUSE works
> >>> +------------
> >>> +Each userspace vDPA device is created by the VDUSE_CREATE_DEV ioctl on
> >>> +the VDUSE character device (/dev/vduse). Then a file descriptor pointing
> >>> +to the new resources will be returned, which can be used to implement the
> >>> +userspace vDPA device's control path and data path.
> >>> +
> >>> +To implement control path, the read/write operations to the file descriptor
> >>> +will be used to receive/reply the control messages from/to VDUSE driver.
> >>
> >> It's better to document the protocol here. E.g the identifier stuffs.
> >>
> > I have documented those stuffs in include/uapi/linux/vduse.h, is it
> > OK? Or add something like "Please see include/uapi/linux/vduse.h for
> > details."
>
>
> It might be better if we add some userspace sample code to demonstrate
> how the protocol work.
>

Make sense to me.

>
> >
> >>> +Those control messages are mostly based on the vdpa_config_ops which defines
> >>> +a unified interface to control different types of vDPA device.
> >>> +
> >>> +The following types of messages are provided by the VDUSE framework now:
> >>> +
> >>> +- VDUSE_SET_VQ_ADDR: Set the addresses of the different aspects of virtqueue.
> >>
> >> "Set the vring address of a virtqueue" might be better here.
> >>
> > OK.
> >
> >>> +
> >>> +- VDUSE_SET_VQ_NUM: Set the size of virtqueue
> >>> +
> >>> +- VDUSE_SET_VQ_READY: Set ready status of virtqueue
> >>> +
> >>> +- VDUSE_GET_VQ_READY: Get ready status of virtqueue
> >>> +
> >>> +- VDUSE_SET_VQ_STATE: Set the state (last_avail_idx) for virtqueue
> >>> +
> >>> +- VDUSE_GET_VQ_STATE: Get the state (last_avail_idx) for virtqueue
> >>
> >> It's better not to mention layout specific stuffs here (last_avail_idx).
> >> Consider we should support packed virtqueue in the future.
> >>
> > I see.
> >
> >>> +
> >>> +- VDUSE_SET_FEATURES: Set virtio features supported by the driver
> >>> +
> >>> +- VDUSE_GET_FEATURES: Get virtio features supported by the device
> >>> +
> >>> +- VDUSE_SET_STATUS: Set the device status
> >>> +
> >>> +- VDUSE_GET_STATUS: Get the device status
> >>> +
> >>> +- VDUSE_SET_CONFIG: Write to device specific configuration space
> >>> +
> >>> +- VDUSE_GET_CONFIG: Read from device specific configuration space
> >>> +
> >>> +- VDUSE_UPDATE_IOTLB: Notify userspace to update the memory mapping in device IOTLB
> >>> +
> >>> +Please see include/linux/vdpa.h for details.
> >>> +
> >>> +In the data path, vDPA device's iova regions will be mapped into userspace with
> >>> +the help of VDUSE_IOTLB_GET_FD ioctl on the userspace vDPA device fd:
> >>> +
> >>> +- VDUSE_IOTLB_GET_FD: get the file descriptor to iova region. Userspace can
> >>> +  access this iova region by passing the fd to mmap(2).
> >>> +
> >>> +Besides, the eventfd mechanism is used to trigger interrupt callbacks and
> >>> +receive virtqueue kicks in userspace. The following ioctls on the userspace
> >>> +vDPA device fd are provided to support that:
> >>> +
> >>> +- VDUSE_VQ_SETUP_KICKFD: set the kickfd for virtqueue, this eventfd is used
> >>> +  by VDUSE driver to notify userspace to consume the vring.
> >>> +
> >>> +- VDUSE_VQ_SETUP_IRQFD: set the irqfd for virtqueue, this eventfd is used
> >>> +  by userspace to notify VDUSE driver to trigger interrupt callbacks.
> >>> +
> >>> +MMU-based IOMMU Driver
> >>> +----------------------
> >>> +In virtio-vdpa case, VDUSE framework implements a MMU-based on-chip IOMMU
> >>> +driver to support mapping the kernel dma buffer into the userspace iova
> >>> +region dynamically.
> >>> +
> >>> +The basic idea behind this driver is treating MMU (VA->PA) as IOMMU (IOVA->PA).
> >>> +The driver will set up MMU mapping instead of IOMMU mapping for the DMA transfer
> >>> +so that the userspace process is able to use its virtual address to access
> >>> +the dma buffer in kernel.
> >>> +
> >>> +And to avoid security issue, a bounce-buffering mechanism is introduced to
> >>> +prevent userspace accessing the original buffer directly which may contain other
> >>> +kernel data. During the mapping, unmapping, the driver will copy the data from
> >>> +the original buffer to the bounce buffer and back, depending on the direction of
> >>> +the transfer. And the bounce-buffer addresses will be mapped into the user address
> >>> +space instead of the original one.
> >>> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
> >>> index a4c75a28c839..71722e6f8f23 100644
> >>> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> >>> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> >>> @@ -300,6 +300,7 @@ Code  Seq#    Include File                                           Comments
> >>>    'z'   10-4F  drivers/s390/crypto/zcrypt_api.h                        conflict!
> >>>    '|'   00-7F  linux/media.h
> >>>    0x80  00-1F  linux/fb.h
> >>> +0x81  00-1F  linux/vduse.h
> >>>    0x89  00-06  arch/x86/include/asm/sockios.h
> >>>    0x89  0B-DF  linux/sockios.h
> >>>    0x89  E0-EF  linux/sockios.h                                         SIOCPROTOPRIVATE range
> >>> diff --git a/drivers/vdpa/Kconfig b/drivers/vdpa/Kconfig
> >>> index 4be7be39be26..667354309bf4 100644
> >>> --- a/drivers/vdpa/Kconfig
> >>> +++ b/drivers/vdpa/Kconfig
> >>> @@ -21,6 +21,13 @@ config VDPA_SIM
> >>>          to RX. This device is used for testing, prototyping and
> >>>          development of vDPA.
> >>>
> >>> +config VDPA_USER
> >>> +     tristate "VDUSE (vDPA Device in Userspace) support"
> >>> +     depends on EVENTFD && MMU && HAS_DMA
> >>
> >> Need select VHOST_IOTLB.
> >>
> > OK.
> >
> >>> +     help
> >>> +       With VDUSE it is possible to emulate a vDPA Device
> >>> +       in a userspace program.
> >>> +
> >>>    config IFCVF
> >>>        tristate "Intel IFC VF vDPA driver"
> >>>        depends on PCI_MSI
> >>> diff --git a/drivers/vdpa/Makefile b/drivers/vdpa/Makefile
> >>> index d160e9b63a66..66e97778ad03 100644
> >>> --- a/drivers/vdpa/Makefile
> >>> +++ b/drivers/vdpa/Makefile
> >>> @@ -1,5 +1,6 @@
> >>>    # SPDX-License-Identifier: GPL-2.0
> >>>    obj-$(CONFIG_VDPA) += vdpa.o
> >>>    obj-$(CONFIG_VDPA_SIM) += vdpa_sim/
> >>> +obj-$(CONFIG_VDPA_USER) += vdpa_user/
> >>>    obj-$(CONFIG_IFCVF)    += ifcvf/
> >>>    obj-$(CONFIG_MLX5_VDPA) += mlx5/
> >>> diff --git a/drivers/vdpa/vdpa_user/Makefile b/drivers/vdpa/vdpa_user/Makefile
> >>> new file mode 100644
> >>> index 000000000000..b7645e36992b
> >>> --- /dev/null
> >>> +++ b/drivers/vdpa/vdpa_user/Makefile
> >>> @@ -0,0 +1,5 @@
> >>> +# SPDX-License-Identifier: GPL-2.0
> >>> +
> >>> +vduse-y := vduse_dev.o iova_domain.o eventfd.o
> >>> +
> >>> +obj-$(CONFIG_VDPA_USER) += vduse.o
> >>> diff --git a/drivers/vdpa/vdpa_user/eventfd.c b/drivers/vdpa/vdpa_user/eventfd.c
> >>> new file mode 100644
> >>> index 000000000000..dbffddb08908
> >>> --- /dev/null
> >>> +++ b/drivers/vdpa/vdpa_user/eventfd.c
> >>> @@ -0,0 +1,221 @@
> >>> +// SPDX-License-Identifier: GPL-2.0-only
> >>> +/*
> >>> + * Eventfd support for VDUSE
> >>> + *
> >>> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
> >>> + *
> >>> + * Author: Xie Yongji <xieyongji@bytedance.com>
> >>> + *
> >>> + */
> >>> +
> >>> +#include <linux/eventfd.h>
> >>> +#include <linux/poll.h>
> >>> +#include <linux/wait.h>
> >>> +#include <linux/slab.h>
> >>> +#include <linux/file.h>
> >>> +#include <uapi/linux/vduse.h>
> >>> +
> >>> +#include "eventfd.h"
> >>> +
> >>> +static struct workqueue_struct *vduse_irqfd_cleanup_wq;
> >>> +
> >>> +static void vduse_virqfd_shutdown(struct work_struct *work)
> >>> +{
> >>> +     u64 cnt;
> >>> +     struct vduse_virqfd *virqfd = container_of(work,
> >>> +                                     struct vduse_virqfd, shutdown);
> >>> +
> >>> +     eventfd_ctx_remove_wait_queue(virqfd->ctx, &virqfd->wait, &cnt);
> >>> +     flush_work(&virqfd->inject);
> >>> +     eventfd_ctx_put(virqfd->ctx);
> >>> +     kfree(virqfd);
> >>> +}
> >>> +
> >>> +static void vduse_virqfd_inject(struct work_struct *work)
> >>> +{
> >>> +     struct vduse_virqfd *virqfd = container_of(work,
> >>> +                                     struct vduse_virqfd, inject);
> >>> +     struct vduse_virtqueue *vq = virqfd->vq;
> >>> +
> >>> +     spin_lock_irq(&vq->irq_lock);
> >>> +     if (vq->ready && vq->cb)
> >>> +             vq->cb(vq->private);
> >>> +     spin_unlock_irq(&vq->irq_lock);
> >>> +}
> >>> +
> >>> +static void virqfd_deactivate(struct vduse_virqfd *virqfd)
> >>> +{
> >>> +     queue_work(vduse_irqfd_cleanup_wq, &virqfd->shutdown);
> >>> +}
> >>> +
> >>> +static int vduse_virqfd_wakeup(wait_queue_entry_t *wait, unsigned int mode,
> >>> +                             int sync, void *key)
> >>> +{
> >>> +     struct vduse_virqfd *virqfd = container_of(wait, struct vduse_virqfd, wait);
> >>> +     struct vduse_virtqueue *vq = virqfd->vq;
> >>> +
> >>> +     __poll_t flags = key_to_poll(key);
> >>> +
> >>> +     if (flags & EPOLLIN)
> >>> +             schedule_work(&virqfd->inject);
> >>> +
> >>> +     if (flags & EPOLLHUP) {
> >>> +             spin_lock(&vq->irq_lock);
> >>> +             if (vq->virqfd == virqfd) {
> >>> +                     vq->virqfd = NULL;
> >>> +                     virqfd_deactivate(virqfd);
> >>> +             }
> >>> +             spin_unlock(&vq->irq_lock);
> >>> +     }
> >>> +
> >>> +     return 0;
> >>> +}
> >>> +
> >>> +static void vduse_virqfd_ptable_queue_proc(struct file *file,
> >>> +                     wait_queue_head_t *wqh, poll_table *pt)
> >>> +{
> >>> +     struct vduse_virqfd *virqfd = container_of(pt, struct vduse_virqfd, pt);
> >>> +
> >>> +     add_wait_queue(wqh, &virqfd->wait);
> >>> +}
> >>> +
> >>> +int vduse_virqfd_setup(struct vduse_dev *dev,
> >>> +                     struct vduse_vq_eventfd *eventfd)
> >>> +{
> >>> +     struct vduse_virqfd *virqfd;
> >>> +     struct fd irqfd;
> >>> +     struct eventfd_ctx *ctx;
> >>> +     struct vduse_virtqueue *vq;
> >>> +     __poll_t events;
> >>> +     int ret;
> >>> +
> >>> +     if (eventfd->index >= dev->vq_num)
> >>> +             return -EINVAL;
> >>> +
> >>> +     vq = &dev->vqs[eventfd->index];
> >>> +     virqfd = kzalloc(sizeof(*virqfd), GFP_KERNEL);
> >>> +     if (!virqfd)
> >>> +             return -ENOMEM;
> >>> +
> >>> +     INIT_WORK(&virqfd->shutdown, vduse_virqfd_shutdown);
> >>> +     INIT_WORK(&virqfd->inject, vduse_virqfd_inject);
> >>> +
> >>> +     ret = -EBADF;
> >>> +     irqfd = fdget(eventfd->fd);
> >>> +     if (!irqfd.file)
> >>> +             goto err_fd;
> >>> +
> >>> +     ctx = eventfd_ctx_fileget(irqfd.file);
> >>> +     if (IS_ERR(ctx)) {
> >>> +             ret = PTR_ERR(ctx);
> >>> +             goto err_ctx;
> >>> +     }
> >>> +
> >>> +     virqfd->vq = vq;
> >>> +     virqfd->ctx = ctx;
> >>> +     spin_lock(&vq->irq_lock);
> >>> +     if (vq->virqfd)
> >>> +             virqfd_deactivate(virqfd);
> >>> +     vq->virqfd = virqfd;
> >>> +     spin_unlock(&vq->irq_lock);
> >>> +
> >>> +     init_waitqueue_func_entry(&virqfd->wait, vduse_virqfd_wakeup);
> >>> +     init_poll_funcptr(&virqfd->pt, vduse_virqfd_ptable_queue_proc);
> >>> +
> >>> +     events = vfs_poll(irqfd.file, &virqfd->pt);
> >>> +
> >>> +     /*
> >>> +      * Check if there was an event already pending on the eventfd
> >>> +      * before we registered and trigger it as if we didn't miss it.
> >>> +      */
> >>> +     if (events & EPOLLIN)
> >>> +             schedule_work(&virqfd->inject);
> >>> +
> >>> +     fdput(irqfd);
> >>> +
> >>> +     return 0;
> >>> +err_ctx:
> >>> +     fdput(irqfd);
> >>> +err_fd:
> >>> +     kfree(virqfd);
> >>> +     return ret;
> >>> +}
> >>> +
> >>> +void vduse_virqfd_release(struct vduse_dev *dev)
> >>> +{
> >>> +     int i;
> >>> +
> >>> +     for (i = 0; i < dev->vq_num; i++) {
> >>> +             struct vduse_virtqueue *vq = &dev->vqs[i];
> >>> +
> >>> +             spin_lock(&vq->irq_lock);
> >>> +             if (vq->virqfd) {
> >>> +                     virqfd_deactivate(vq->virqfd);
> >>> +                     vq->virqfd = NULL;
> >>> +             }
> >>> +             spin_unlock(&vq->irq_lock);
> >>> +     }
> >>> +     flush_workqueue(vduse_irqfd_cleanup_wq);
> >>> +}
> >>> +
> >>> +int vduse_virqfd_init(void)
> >>> +{
> >>> +     vduse_irqfd_cleanup_wq = alloc_workqueue("vduse-irqfd-cleanup",
> >>> +                                             WQ_UNBOUND, 0);
> >>> +     if (!vduse_irqfd_cleanup_wq)
> >>> +             return -ENOMEM;
> >>> +
> >>> +     return 0;
> >>> +}
> >>> +
> >>> +void vduse_virqfd_exit(void)
> >>> +{
> >>> +     destroy_workqueue(vduse_irqfd_cleanup_wq);
> >>> +}
> >>> +
> >>> +void vduse_vq_kick(struct vduse_virtqueue *vq)
> >>> +{
> >>> +     spin_lock(&vq->kick_lock);
> >>> +     if (vq->ready && vq->kickfd)
> >>> +             eventfd_signal(vq->kickfd, 1);
> >>> +     spin_unlock(&vq->kick_lock);
> >>> +}
> >>> +
> >>> +int vduse_kickfd_setup(struct vduse_dev *dev,
> >>> +                     struct vduse_vq_eventfd *eventfd)
> >>> +{
> >>> +     struct eventfd_ctx *ctx;
> >>> +     struct vduse_virtqueue *vq;
> >>> +
> >>> +     if (eventfd->index >= dev->vq_num)
> >>> +             return -EINVAL;
> >>> +
> >>> +     vq = &dev->vqs[eventfd->index];
> >>> +     ctx = eventfd_ctx_fdget(eventfd->fd);
> >>> +     if (IS_ERR(ctx))
> >>> +             return PTR_ERR(ctx);
> >>> +
> >>> +     spin_lock(&vq->kick_lock);
> >>> +     if (vq->kickfd)
> >>> +             eventfd_ctx_put(vq->kickfd);
> >>> +     vq->kickfd = ctx;
> >>> +     spin_unlock(&vq->kick_lock);
> >>> +
> >>> +     return 0;
> >>> +}
> >>> +
> >>> +void vduse_kickfd_release(struct vduse_dev *dev)
> >>> +{
> >>> +     int i;
> >>> +
> >>> +     for (i = 0; i < dev->vq_num; i++) {
> >>> +             struct vduse_virtqueue *vq = &dev->vqs[i];
> >>> +
> >>> +             spin_lock(&vq->kick_lock);
> >>> +             if (vq->kickfd) {
> >>> +                     eventfd_ctx_put(vq->kickfd);
> >>> +                     vq->kickfd = NULL;
> >>> +             }
> >>> +             spin_unlock(&vq->kick_lock);
> >>> +     }
> >>> +}
> >>> diff --git a/drivers/vdpa/vdpa_user/eventfd.h b/drivers/vdpa/vdpa_user/eventfd.h
> >>> new file mode 100644
> >>> index 000000000000..14269ff27f47
> >>> --- /dev/null
> >>> +++ b/drivers/vdpa/vdpa_user/eventfd.h
> >>> @@ -0,0 +1,48 @@
> >>> +/* SPDX-License-Identifier: GPL-2.0-only */
> >>> +/*
> >>> + * Eventfd support for VDUSE
> >>> + *
> >>> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
> >>> + *
> >>> + * Author: Xie Yongji <xieyongji@bytedance.com>
> >>> + *
> >>> + */
> >>> +
> >>> +#ifndef _VDUSE_EVENTFD_H
> >>> +#define _VDUSE_EVENTFD_H
> >>> +
> >>> +#include <linux/eventfd.h>
> >>> +#include <linux/poll.h>
> >>> +#include <linux/wait.h>
> >>> +#include <uapi/linux/vduse.h>
> >>> +
> >>> +#include "vduse.h"
> >>> +
> >>> +struct vduse_dev;
> >>> +
> >>> +struct vduse_virqfd {
> >>> +     struct eventfd_ctx *ctx;
> >>> +     struct vduse_virtqueue *vq;
> >>> +     struct work_struct inject;
> >>> +     struct work_struct shutdown;
> >>> +     wait_queue_entry_t wait;
> >>> +     poll_table pt;
> >>> +};
> >>> +
> >>> +int vduse_virqfd_setup(struct vduse_dev *dev,
> >>> +                     struct vduse_vq_eventfd *eventfd);
> >>> +
> >>> +void vduse_virqfd_release(struct vduse_dev *dev);
> >>> +
> >>> +int vduse_virqfd_init(void);
> >>> +
> >>> +void vduse_virqfd_exit(void);
> >>> +
> >>> +void vduse_vq_kick(struct vduse_virtqueue *vq);
> >>> +
> >>> +int vduse_kickfd_setup(struct vduse_dev *dev,
> >>> +                     struct vduse_vq_eventfd *eventfd);
> >>> +
> >>> +void vduse_kickfd_release(struct vduse_dev *dev);
> >>> +
> >>> +#endif /* _VDUSE_EVENTFD_H */
> >>> diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c
> >>> new file mode 100644
> >>> index 000000000000..cdfef8e9f9d6
> >>> --- /dev/null
> >>> +++ b/drivers/vdpa/vdpa_user/iova_domain.c
> >>> @@ -0,0 +1,426 @@
> >>> +// SPDX-License-Identifier: GPL-2.0-only
> >>> +/*
> >>> + * MMU-based IOMMU implementation
> >>> + *
> >>> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
> >>> + *
> >>> + * Author: Xie Yongji <xieyongji@bytedance.com>
> >>> + *
> >>> + */
> >>> +
> >>> +#include <linux/slab.h>
> >>> +#include <linux/file.h>
> >>> +#include <linux/anon_inodes.h>
> >>> +
> >>> +#include "iova_domain.h"
> >>> +
> >>> +#define IOVA_START_PFN 1
> >>> +#define IOVA_ALLOC_ORDER 12
> >>> +#define IOVA_ALLOC_SIZE (1 << IOVA_ALLOC_ORDER)
> >>
> >> Can this work for all archs (e.g why not use PAGE_SIZE)?
> >>
> > It can work for all archs. Use IOVA_ALLOC_SIZE might save some space
> > in some cases/archs (e.g. PAGE_SIZE = 64K) when we have lots of
> > small-size I/Os.
>
>
> OK, if I understand correctly, so you want to share pages among those
> small I/Os.
>

Correct.

>
> >
> >>> +
> >>> +#define CONSISTENT_DMA_SIZE (1024 * 1024 * 1024)
> >>> +
> >>> +static inline struct page *
> >>> +vduse_domain_get_bounce_page(struct vduse_iova_domain *domain,
> >>> +                             unsigned long iova)
> >>> +{
> >>> +     unsigned long index = iova >> PAGE_SHIFT;
> >>> +
> >>> +     return domain->bounce_pages[index];
> >>> +}
> >>> +
> >>> +static inline void
> >>> +vduse_domain_set_bounce_page(struct vduse_iova_domain *domain,
> >>> +                             unsigned long iova, struct page *page)
> >>> +{
> >>> +     unsigned long index = iova >> PAGE_SHIFT;
> >>> +
> >>> +     domain->bounce_pages[index] = page;
> >>> +}
> >>> +
> >>> +static struct vduse_iova_map *
> >>> +vduse_domain_alloc_iova_map(struct vduse_iova_domain *domain,
> >>> +                     unsigned long iova, unsigned long orig,
> >>> +                     size_t size, enum dma_data_direction dir)
> >>> +{
> >>> +     struct vduse_iova_map *map;
> >>> +
> >>> +     map = kzalloc(sizeof(struct vduse_iova_map), GFP_ATOMIC);
> >>> +     if (!map)
> >>> +             return NULL;
> >>> +
> >>> +     map->iova.start = iova;
> >>> +     map->iova.last = iova + size - 1;
> >>> +     map->orig = orig;
> >>> +     map->size = size;
> >>> +     map->dir = dir;
> >>> +
> >>> +     return map;
> >>> +}
> >>> +
> >>> +static struct page *
> >>> +vduse_domain_get_mapping_page(struct vduse_iova_domain *domain,
> >>> +                             unsigned long iova)
> >>> +{
> >>> +     unsigned long start = iova & PAGE_MASK;
> >>> +     unsigned long last = start + PAGE_SIZE - 1;
> >>> +     struct vduse_iova_map *map;
> >>> +     struct interval_tree_node *node;
> >>> +     struct page *page = NULL;
> >>> +
> >>> +     spin_lock(&domain->map_lock);
> >>> +     node = interval_tree_iter_first(&domain->mappings, start, last);
> >>> +     if (!node)
> >>> +             goto out;
> >>> +
> >>> +     map = container_of(node, struct vduse_iova_map, iova);
> >>> +     page = virt_to_page(map->orig + iova - map->iova.start);
> >>> +     get_page(page);
> >>> +out:
> >>> +     spin_unlock(&domain->map_lock);
> >>> +
> >>> +     return page;
> >>> +}
> >>> +
> >>> +static struct page *
> >>> +vduse_domain_alloc_bounce_page(struct vduse_iova_domain *domain,
> >>> +                             unsigned long iova)
> >>> +{
> >>> +     unsigned long start = iova & PAGE_MASK;
> >>> +     unsigned long last = start + PAGE_SIZE - 1;
> >>> +     struct vduse_iova_map *map;
> >>> +     struct interval_tree_node *node;
> >>> +     struct page *page = NULL, *new_page = alloc_page(GFP_KERNEL);
> >>> +
> >>> +     if (!new_page)
> >>> +             return NULL;
> >>> +
> >>> +     spin_lock(&domain->map_lock);
> >>> +     node = interval_tree_iter_first(&domain->mappings, start, last);
> >>> +     if (!node) {
> >>> +             __free_page(new_page);
> >>> +             goto out;
> >>> +     }
> >>> +     page = vduse_domain_get_bounce_page(domain, iova);
> >>> +     if (page) {
> >>> +             get_page(page);
> >>> +             __free_page(new_page);
> >>
> >> Let's delay the allocation of new_page until it is really required.
> > If so, we need to allocate the page in atomic context.
>
>
> Right, I see.
>
>
> >>> +             goto out;
> >>> +     }
> >>> +     vduse_domain_set_bounce_page(domain, iova, new_page);
> >>> +     get_page(new_page);
> >>> +     page = new_page;
> >>> +
> >>> +     while (node) {
> >>
> >> I may miss something but which case should we do the loop here?
> >>
> > When IOVA_ALLOC_SIZE != PAGE_SIZE
> >
> >>> +             unsigned int src_offset = 0, dst_offset = 0;
> >>> +             void *src, *dst;
> >>> +             size_t copy_len;
> >>> +
> >>> +             map = container_of(node, struct vduse_iova_map, iova);
> >>> +             node = interval_tree_iter_next(node, start, last);
> >>> +             if (map->dir == DMA_FROM_DEVICE)
> >>> +                     continue;
> >>> +
> >>> +             if (start > map->iova.start)
> >>> +                     src_offset = start - map->iova.start;
> >>> +             else
> >>> +                     dst_offset = map->iova.start - start;
> >>> +
> >>> +             src = (void *)(map->orig + src_offset);
> >>> +             dst = page_address(page) + dst_offset;
> >>> +             copy_len = min_t(size_t, map->size - src_offset,
> >>> +                             PAGE_SIZE - dst_offset);
> >>> +             memcpy(dst, src, copy_len);
> >>> +     }
> >>> +out:
> >>> +     spin_unlock(&domain->map_lock);
> >>> +
> >>> +     return page;
> >>> +}
> >>> +
> >>> +static void
> >>> +vduse_domain_free_bounce_pages(struct vduse_iova_domain *domain,
> >>> +                             unsigned long iova, size_t size)
> >>> +{
> >>> +     struct page *page;
> >>> +     struct interval_tree_node *node;
> >>> +     unsigned long last = iova + size - 1;
> >>> +
> >>> +     spin_lock(&domain->map_lock);
> >>> +     node = interval_tree_iter_first(&domain->mappings, iova, last);
> >>> +     if (WARN_ON(node))
> >>> +             goto out;
> >>> +
> >>> +     while (size > 0) {
> >>> +             page = vduse_domain_get_bounce_page(domain, iova);
> >>> +             if (page) {
> >>> +                     vduse_domain_set_bounce_page(domain, iova, NULL);
> >>> +                     __free_page(page);
> >>> +             }
> >>> +             size -= PAGE_SIZE;
> >>> +             iova += PAGE_SIZE;
> >>> +     }
> >>> +out:
> >>> +     spin_unlock(&domain->map_lock);
> >>> +}
> >>> +
> >>> +static void vduse_domain_bounce(struct vduse_iova_domain *domain,
> >>> +                             unsigned long iova, unsigned long orig,
> >>> +                             size_t size, enum dma_data_direction dir)
> >>> +{
> >>> +     unsigned int offset = offset_in_page(iova);
> >>> +
> >>> +     while (size) {
> >>> +             struct page *p = vduse_domain_get_bounce_page(domain, iova);
> >>> +             size_t copy_len = min_t(size_t, PAGE_SIZE - offset, size);
> >>> +             void *addr;
> >>> +
> >>> +             WARN_ON(!p && dir == DMA_FROM_DEVICE);
> >>> +
> >>> +             if (p) {
> >>> +                     addr = page_address(p) + offset;
> >>> +                     if (dir == DMA_TO_DEVICE)
> >>> +                             memcpy(addr, (void *)orig, copy_len);
> >>> +                     else if (dir == DMA_FROM_DEVICE)
> >>> +                             memcpy((void *)orig, addr, copy_len);
> >>> +             }
> >>> +
> >>> +             size -= copy_len;
> >>> +             orig += copy_len;
> >>> +             iova += copy_len;
> >>> +             offset = 0;
> >>> +     }
> >>> +}
> >>> +
> >>> +static unsigned long vduse_domain_alloc_iova(struct iova_domain *iovad,
> >>> +                             unsigned long size, unsigned long limit)
> >>> +{
> >>> +     unsigned long shift = iova_shift(iovad);
> >>> +     unsigned long iova_len = iova_align(iovad, size) >> shift;
> >>> +     unsigned long iova_pfn;
> >>> +
> >>> +     if (iova_len < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1)))
> >>> +             iova_len = roundup_pow_of_two(iova_len);
> >>> +     iova_pfn = alloc_iova_fast(iovad, iova_len, limit >> shift, true);
> >>> +
> >>> +     return iova_pfn << shift;
> >>> +}
> >>> +
> >>> +static void vduse_domain_free_iova(struct iova_domain *iovad,
> >>> +                             unsigned long iova, size_t size)
> >>> +{
> >>> +     unsigned long shift = iova_shift(iovad);
> >>> +     unsigned long iova_len = iova_align(iovad, size) >> shift;
> >>> +
> >>> +     free_iova_fast(iovad, iova >> shift, iova_len);
> >>> +}
> >>> +
> >>> +dma_addr_t vduse_domain_map_page(struct vduse_iova_domain *domain,
> >>> +                             struct page *page, unsigned long offset,
> >>> +                             size_t size, enum dma_data_direction dir,
> >>> +                             unsigned long attrs)
> >>> +{
> >>> +     struct iova_domain *iovad = &domain->stream_iovad;
> >>> +     unsigned long limit = domain->bounce_size - 1;
> >>> +     unsigned long iova = vduse_domain_alloc_iova(iovad, size, limit);
> >>> +     unsigned long orig = (unsigned long)page_address(page) + offset;
> >>> +     struct vduse_iova_map *map;
> >>> +
> >>> +     if (!iova)
> >>> +             return DMA_MAPPING_ERROR;
> >>> +
> >>> +     map = vduse_domain_alloc_iova_map(domain, iova, orig, size, dir);
> >>> +     if (!map) {
> >>> +             vduse_domain_free_iova(iovad, iova, size);
> >>> +             return DMA_MAPPING_ERROR;
> >>> +     }
> >>> +
> >>> +     spin_lock(&domain->map_lock);
> >>> +     interval_tree_insert(&map->iova, &domain->mappings);
> >>> +     spin_unlock(&domain->map_lock);
> >>> +
> >>> +     if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL)
> >>> +             vduse_domain_bounce(domain, iova, orig, size, DMA_TO_DEVICE);
> >>> +
> >>> +     return (dma_addr_t)iova;
> >>> +}
> >>> +
> >>> +void vduse_domain_unmap_page(struct vduse_iova_domain *domain,
> >>> +                     dma_addr_t dma_addr, size_t size,
> >>> +                     enum dma_data_direction dir, unsigned long attrs)
> >>> +{
> >>> +     struct iova_domain *iovad = &domain->stream_iovad;
> >>> +     unsigned long iova = (unsigned long)dma_addr;
> >>> +     struct interval_tree_node *node;
> >>> +     struct vduse_iova_map *map;
> >>> +
> >>> +     spin_lock(&domain->map_lock);
> >>> +     node = interval_tree_iter_first(&domain->mappings, iova, iova + 1);
> >>> +     if (WARN_ON(!node)) {
> >>> +             spin_unlock(&domain->map_lock);
> >>> +             return;
> >>> +     }
> >>> +     interval_tree_remove(node, &domain->mappings);
> >>> +     spin_unlock(&domain->map_lock);
> >>> +
> >>> +     map = container_of(node, struct vduse_iova_map, iova);
> >>> +     if (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL)
> >>> +             vduse_domain_bounce(domain, iova, map->orig,
> >>> +                                     size, DMA_FROM_DEVICE);
> >>> +     vduse_domain_free_iova(iovad, iova, size);
> >>> +     kfree(map);
> >>> +}
> >>> +
> >>> +void *vduse_domain_alloc_coherent(struct vduse_iova_domain *domain,
> >>> +                             size_t size, dma_addr_t *dma_addr,
> >>> +                             gfp_t flag, unsigned long attrs)
> >>> +{
> >>> +     struct iova_domain *iovad = &domain->consistent_iovad;
> >>> +     unsigned long limit = domain->bounce_size + CONSISTENT_DMA_SIZE - 1;
> >>> +     unsigned long iova = vduse_domain_alloc_iova(iovad, size, limit);
> >>> +     void *orig = alloc_pages_exact(size, flag);
> >>> +     struct vduse_iova_map *map;
> >>> +
> >>> +     if (!iova || !orig)
> >>> +             goto err;
> >>> +
> >>> +     map = vduse_domain_alloc_iova_map(domain, iova, (unsigned long)orig,
> >>> +                                     size, DMA_BIDIRECTIONAL);
> >>> +     if (!map)
> >>> +             goto err;
> >>> +
> >>> +     spin_lock(&domain->map_lock);
> >>> +     interval_tree_insert(&map->iova, &domain->mappings);
> >>> +     spin_unlock(&domain->map_lock);
> >>> +     *dma_addr = (dma_addr_t)iova;
> >>> +
> >>> +     return orig;
> >>> +err:
> >>> +     *dma_addr = DMA_MAPPING_ERROR;
> >>> +     if (orig)
> >>> +             free_pages_exact(orig, size);
> >>> +     if (iova)
> >>> +             vduse_domain_free_iova(iovad, iova, size);
> >>> +
> >>> +     return NULL;
> >>> +}
> >>> +
> >>> +void vduse_domain_free_coherent(struct vduse_iova_domain *domain, size_t size,
> >>> +                             void *vaddr, dma_addr_t dma_addr,
> >>> +                             unsigned long attrs)
> >>> +{
> >>> +     struct iova_domain *iovad = &domain->consistent_iovad;
> >>> +     unsigned long iova = (unsigned long)dma_addr;
> >>> +     struct interval_tree_node *node;
> >>> +     struct vduse_iova_map *map;
> >>> +
> >>> +     spin_lock(&domain->map_lock);
> >>> +     node = interval_tree_iter_first(&domain->mappings, iova, iova + 1);
> >>> +     if (WARN_ON(!node)) {
> >>> +             spin_unlock(&domain->map_lock);
> >>> +             return;
> >>> +     }
> >>> +     interval_tree_remove(node, &domain->mappings);
> >>> +     spin_unlock(&domain->map_lock);
> >>> +
> >>> +     map = container_of(node, struct vduse_iova_map, iova);
> >>> +     vduse_domain_free_iova(iovad, iova, size);
> >>> +     free_pages_exact(vaddr, size);
> >>> +     kfree(map);
> >>> +}
> >>> +
> >>> +static vm_fault_t vduse_domain_mmap_fault(struct vm_fault *vmf)
> >>> +{
> >>> +     struct vduse_iova_domain *domain = vmf->vma->vm_private_data;
> >>> +     unsigned long iova = vmf->pgoff << PAGE_SHIFT;
> >>> +     struct page *page;
> >>> +
> >>> +     if (!domain)
> >>> +             return VM_FAULT_SIGBUS;
> >>> +
> >>> +     if (iova < domain->bounce_size)
> >>> +             page = vduse_domain_alloc_bounce_page(domain, iova);
> >>> +     else
> >>> +             page = vduse_domain_get_mapping_page(domain, iova);
> >>> +
> >>> +     if (!page)
> >>> +             return VM_FAULT_SIGBUS;
> >>> +
> >>> +     vmf->page = page;
> >>> +
> >>> +     return 0;
> >>> +}
> >>> +
> >>> +static const struct vm_operations_struct vduse_domain_mmap_ops = {
> >>> +     .fault = vduse_domain_mmap_fault,
> >>> +};
> >>> +
> >>> +static int vduse_domain_mmap(struct file *file, struct vm_area_struct *vma)
> >>> +{
> >>> +     struct vduse_iova_domain *domain = file->private_data;
> >>> +
> >>> +     vma->vm_flags |= VM_DONTCOPY | VM_DONTDUMP | VM_DONTEXPAND;
> >>> +     vma->vm_private_data = domain;
> >>> +     vma->vm_ops = &vduse_domain_mmap_ops;
> >>> +
> >>> +     return 0;
> >>> +}
> >>> +
> >>> +static int vduse_domain_release(struct inode *inode, struct file *file)
> >>> +{
> >>> +     struct vduse_iova_domain *domain = file->private_data;
> >>> +
> >>> +     vduse_domain_free_bounce_pages(domain, 0, domain->bounce_size);
> >>> +     put_iova_domain(&domain->stream_iovad);
> >>> +     put_iova_domain(&domain->consistent_iovad);
> >>> +     vfree(domain->bounce_pages);
> >>> +     kfree(domain);
> >>> +
> >>> +     return 0;
> >>> +}
> >>> +
> >>> +static const struct file_operations vduse_domain_fops = {
> >>> +     .mmap = vduse_domain_mmap,
> >>> +     .release = vduse_domain_release,
> >>> +};
> >>
> >> It's better to explain the reason for introducing a dedicated file for
> >> mmap() here.
> >>
> > To make the implementation of iova_domain independent with vduse_dev.
>
>
> My understanding is that, the only usage for this is to:
>
> 1) support different type of iova mappings
> 2) or switch between iova domain mappings
>
> But I can't think of a need for this.
>

For example, share one iova_domain between several vduse devices.

And it will be helpful if we want to split this patch into iova domain
part and vduse device part. Because the page fault handler should be
paired with dma_map/dma_unmap.

>
> >
> >>> +
> >>> +void vduse_domain_destroy(struct vduse_iova_domain *domain)
> >>> +{
> >>> +     fput(domain->file);
> >>> +}
> >>> +
> >>> +struct vduse_iova_domain *vduse_domain_create(size_t bounce_size)
> >>> +{
> >>> +     struct vduse_iova_domain *domain;
> >>> +     struct file *file;
> >>> +     unsigned long bounce_pfns = PAGE_ALIGN(bounce_size) >> PAGE_SHIFT;
> >>> +
> >>> +     domain = kzalloc(sizeof(*domain), GFP_KERNEL);
> >>> +     if (!domain)
> >>> +             return NULL;
> >>> +
> >>> +     domain->bounce_size = PAGE_ALIGN(bounce_size);
> >>> +     domain->bounce_pages = vzalloc(bounce_pfns * sizeof(struct page *));
> >>> +     if (!domain->bounce_pages)
> >>> +             goto err_page;
> >>> +
> >>> +     file = anon_inode_getfile("[vduse-domain]", &vduse_domain_fops,
> >>> +                             domain, O_RDWR);
> >>> +     if (IS_ERR(file))
> >>> +             goto err_file;
> >>> +
> >>> +     domain->file = file;
> >>> +     spin_lock_init(&domain->map_lock);
> >>> +     domain->mappings = RB_ROOT_CACHED;
> >>> +     init_iova_domain(&domain->stream_iovad,
> >>> +                     IOVA_ALLOC_SIZE, IOVA_START_PFN);
> >>> +     init_iova_domain(&domain->consistent_iovad,
> >>> +                     PAGE_SIZE, bounce_pfns);
> >>> +
> >>> +     return domain;
> >>> +err_file:
> >>> +     vfree(domain->bounce_pages);
> >>> +err_page:
> >>> +     kfree(domain);
> >>> +     return NULL;
> >>> +}
> >>> diff --git a/drivers/vdpa/vdpa_user/iova_domain.h b/drivers/vdpa/vdpa_user/iova_domain.h
> >>> new file mode 100644
> >>> index 000000000000..cc61866acb56
> >>> --- /dev/null
> >>> +++ b/drivers/vdpa/vdpa_user/iova_domain.h
> >>> @@ -0,0 +1,68 @@
> >>> +/* SPDX-License-Identifier: GPL-2.0-only */
> >>> +/*
> >>> + * MMU-based IOMMU implementation
> >>> + *
> >>> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
> >>> + *
> >>> + * Author: Xie Yongji <xieyongji@bytedance.com>
> >>> + *
> >>> + */
> >>> +
> >>> +#ifndef _VDUSE_IOVA_DOMAIN_H
> >>> +#define _VDUSE_IOVA_DOMAIN_H
> >>> +
> >>> +#include <linux/iova.h>
> >>> +#include <linux/interval_tree.h>
> >>> +#include <linux/dma-mapping.h>
> >>> +
> >>> +struct vduse_iova_map {
> >>> +     struct interval_tree_node iova;
> >>> +     unsigned long orig;
> >>
> >> Need a better name, probably "va"?
> >>
> > Fine.
> >
> >>> +     size_t size;
> >>> +     enum dma_data_direction dir;
> >>> +};
> >>> +
> >>> +struct vduse_iova_domain {
> >>> +     struct iova_domain stream_iovad;
> >>> +     struct iova_domain consistent_iovad;
> >>> +     struct page **bounce_pages;
> >>> +     size_t bounce_size;
> >>> +     struct rb_root_cached mappings;
> >>
> >> We had IOTLB, any reason for this extra mappings here?
> >>
> > It is used to store iova <-> vduse_iova_map (vaddr, size, dir)
> > mapping. We must use it to know how to do DMA bouncing during dma
> > unmapping.
>
>
> Right, so I meant consider we support opaque data in the vhost IOTLB. It
> looks to me we can piggyback the "orig" in the opaque. Then there's no
> need for an extra interval tree?
>

OK, I see. Will do it in v4.

>
> >
> >>> +     spinlock_t map_lock;
> >>> +     struct file *file;
> >>> +};
> >>> +
> >>> +static inline struct file *
> >>> +vduse_domain_file(struct vduse_iova_domain *domain)
> >>> +{
> >>> +     return domain->file;
> >>> +}
> >>> +
> >>> +static inline unsigned long
> >>> +vduse_domain_get_offset(struct vduse_iova_domain *domain, unsigned long iova)
> >>> +{
> >>> +     return iova;
> >>> +}
> >>> +
> >>> +dma_addr_t vduse_domain_map_page(struct vduse_iova_domain *domain,
> >>> +                             struct page *page, unsigned long offset,
> >>> +                             size_t size, enum dma_data_direction dir,
> >>> +                             unsigned long attrs);
> >>> +
> >>> +void vduse_domain_unmap_page(struct vduse_iova_domain *domain,
> >>> +                     dma_addr_t dma_addr, size_t size,
> >>> +                     enum dma_data_direction dir, unsigned long attrs);
> >>> +
> >>> +void *vduse_domain_alloc_coherent(struct vduse_iova_domain *domain,
> >>> +                             size_t size, dma_addr_t *dma_addr,
> >>> +                             gfp_t flag, unsigned long attrs);
> >>> +
> >>> +void vduse_domain_free_coherent(struct vduse_iova_domain *domain, size_t size,
> >>> +                             void *vaddr, dma_addr_t dma_addr,
> >>> +                             unsigned long attrs);
> >>> +
> >>> +void vduse_domain_destroy(struct vduse_iova_domain *domain);
> >>> +
> >>> +struct vduse_iova_domain *vduse_domain_create(size_t bounce_size);
> >>> +
> >>> +#endif /* _VDUSE_IOVA_DOMAIN_H */
> >>> diff --git a/drivers/vdpa/vdpa_user/vduse.h b/drivers/vdpa/vdpa_user/vduse.h
> >>> new file mode 100644
> >>> index 000000000000..3566d229382e
> >>> --- /dev/null
> >>> +++ b/drivers/vdpa/vdpa_user/vduse.h
> >>> @@ -0,0 +1,62 @@
> >>> +/* SPDX-License-Identifier: GPL-2.0-only */
> >>> +/*
> >>> + * VDUSE: vDPA Device in Userspace
> >>> + *
> >>> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
> >>> + *
> >>> + * Author: Xie Yongji <xieyongji@bytedance.com>
> >>> + *
> >>> + */
> >>> +
> >>> +#ifndef _VDUSE_H
> >>> +#define _VDUSE_H
> >>> +
> >>> +#include <linux/eventfd.h>
> >>> +#include <linux/wait.h>
> >>> +#include <linux/vdpa.h>
> >>> +
> >>> +#include "iova_domain.h"
> >>> +#include "eventfd.h"
> >>> +
> >>> +struct vduse_virtqueue {
> >>> +     u16 index;
> >>> +     bool ready;
> >>> +     spinlock_t kick_lock;
> >>> +     spinlock_t irq_lock;
> >>> +     struct eventfd_ctx *kickfd;
> >>> +     struct vduse_virqfd *virqfd;
> >>> +     void *private;
> >>> +     irqreturn_t (*cb)(void *data);
> >>> +};
> >>> +
> >>> +struct vduse_dev;
> >>> +
> >>> +struct vduse_vdpa {
> >>> +     struct vdpa_device vdpa;
> >>> +     struct vduse_dev *dev;
> >>> +};
> >>> +
> >>> +struct vduse_dev {
> >>> +     struct vduse_vdpa *vdev;
> >>> +     struct mutex lock;
> >>> +     struct vduse_virtqueue *vqs;
> >>> +     struct vduse_iova_domain *domain;
> >>> +     struct vhost_iotlb *iommu;
> >>> +     spinlock_t iommu_lock;
> >>> +     atomic_t bounce_map;
> >>> +     spinlock_t msg_lock;
> >>> +     atomic64_t msg_unique;
> >>> +     wait_queue_head_t waitq;
> >>> +     struct list_head send_list;
> >>> +     struct list_head recv_list;
> >>> +     struct list_head list;
> >>> +     bool connected;
> >>> +     u32 id;
> >>> +     u16 vq_size_max;
> >>> +     u16 vq_num;
> >>> +     u32 vq_align;
> >>> +     u32 device_id;
> >>> +     u32 vendor_id;
> >>> +};
> >>> +
> >>> +#endif /* _VDUSE_H_ */
> >>> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> >>> new file mode 100644
> >>> index 000000000000..1cf759bc5914
> >>> --- /dev/null
> >>> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> >>> @@ -0,0 +1,1217 @@
> >>> +// SPDX-License-Identifier: GPL-2.0-only
> >>> +/*
> >>> + * VDUSE: vDPA Device in Userspace
> >>> + *
> >>> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
> >>> + *
> >>> + * Author: Xie Yongji <xieyongji@bytedance.com>
> >>> + *
> >>> + */
> >>> +
> >>> +#include <linux/init.h>
> >>> +#include <linux/module.h>
> >>> +#include <linux/miscdevice.h>
> >>> +#include <linux/device.h>
> >>> +#include <linux/eventfd.h>
> >>> +#include <linux/slab.h>
> >>> +#include <linux/wait.h>
> >>> +#include <linux/dma-map-ops.h>
> >>> +#include <linux/anon_inodes.h>
> >>> +#include <linux/file.h>
> >>> +#include <linux/uio.h>
> >>> +#include <linux/vdpa.h>
> >>> +#include <uapi/linux/vduse.h>
> >>> +#include <uapi/linux/vdpa.h>
> >>> +#include <uapi/linux/virtio_config.h>
> >>> +#include <linux/mod_devicetable.h>
> >>> +
> >>> +#include "vduse.h"
> >>> +
> >>> +#define DRV_VERSION  "1.0"
> >>> +#define DRV_AUTHOR   "Yongji Xie <xieyongji@bytedance.com>"
> >>> +#define DRV_DESC     "vDPA Device in Userspace"
> >>> +#define DRV_LICENSE  "GPL v2"
> >>> +
> >>> +struct vduse_dev_msg {
> >>> +     struct vduse_dev_request req;
> >>> +     struct vduse_dev_response resp;
> >>> +     struct list_head list;
> >>> +     wait_queue_head_t waitq;
> >>> +     bool completed;
> >>> +     refcount_t refcnt;
> >>
> >> The reference count here will bring extra complexity. I think we can
> >> sync through msg_lock.
> >>
> > Do you mean using wait_event_interruptible_locked() and
> > wake_up_locked()? I think it works.
>
>
> Right.
>

Looks like msg_lock is also OK. I will try to use it firstly.

>
> >
> >>
> >>> +};
> >>> +
> >>> +static struct workqueue_struct *vduse_vdpa_wq;
> >>> +static DEFINE_MUTEX(vduse_lock);
> >>> +static LIST_HEAD(vduse_devs);
> >>> +
> >>> +static inline struct vduse_dev *vdpa_to_vduse(struct vdpa_device *vdpa)
> >>> +{
> >>> +     struct vduse_vdpa *vdev = container_of(vdpa, struct vduse_vdpa, vdpa);
> >>> +
> >>> +     return vdev->dev;
> >>> +}
> >>> +
> >>> +static inline struct vduse_dev *dev_to_vduse(struct device *dev)
> >>> +{
> >>> +     struct vdpa_device *vdpa = dev_to_vdpa(dev);
> >>> +
> >>> +     return vdpa_to_vduse(vdpa);
> >>> +}
> >>> +
> >>> +static struct vduse_dev_msg *vduse_dev_new_msg(struct vduse_dev *dev, int type)
> >>> +{
> >>> +     struct vduse_dev_msg *msg = kzalloc(sizeof(*msg),
> >>> +                                     GFP_KERNEL | __GFP_NOFAIL);
> >>> +
> >>> +     msg->req.type = type;
> >>> +     msg->req.unique = atomic64_fetch_inc(&dev->msg_unique);
> >>
> >> This looks not safe, let's use idr here.
> >>
> > Could you give more details? Looks like idr should not used in this
> > case which can not tolerate failure. And using a list to store the msg
> > is better than using idr when the msg needs to be re-inserted in some
> > cases.
>
>
> My understanding is the "unique" (probably need a better name) is a
> token that is used to uniquely identify a message. The reply from
> userspace is required to write with exact the same token(unique). IDR
> seems better but consider we can hardly hit 64bit overflow, atomic might
> be OK as well.
>
> Btw, under what case do we need to do "re-inserted"?
>

When userspace daemon receive the message but doesn't reply it before crash.

>
> >
> >>> +     init_waitqueue_head(&msg->waitq);
> >>> +     refcount_set(&msg->refcnt, 1);
> >>> +
> >>> +     return msg;
> >>> +}
> >>> +
> >>> +static void vduse_dev_msg_get(struct vduse_dev_msg *msg)
> >>> +{
> >>> +     refcount_inc(&msg->refcnt);
> >>> +}
> >>> +
> >>> +static void vduse_dev_msg_put(struct vduse_dev_msg *msg)
> >>> +{
> >>> +     if (refcount_dec_and_test(&msg->refcnt))
> >>> +             kfree(msg);
> >>> +}
> >>> +
> >>> +static struct vduse_dev_msg *vduse_dev_find_msg(struct vduse_dev *dev,
> >>> +                                             struct list_head *head,
> >>> +                                             uint32_t unique)
> >>> +{
> >>> +     struct vduse_dev_msg *tmp, *msg = NULL;
> >>> +
> >>> +     spin_lock(&dev->msg_lock);
> >>> +     list_for_each_entry(tmp, head, list) {
> >>> +             if (tmp->req.unique == unique) {
> >>> +                     msg = tmp;
> >>> +                     list_del(&tmp->list);
> >>> +                     break;
> >>> +             }
> >>> +     }
> >>> +     spin_unlock(&dev->msg_lock);
> >>> +
> >>> +     return msg;
> >>> +}
> >>> +
> >>> +static struct vduse_dev_msg *vduse_dev_dequeue_msg(struct vduse_dev *dev,
> >>> +                                             struct list_head *head)
> >>> +{
> >>> +     struct vduse_dev_msg *msg = NULL;
> >>> +
> >>> +     spin_lock(&dev->msg_lock);
> >>> +     if (!list_empty(head)) {
> >>> +             msg = list_first_entry(head, struct vduse_dev_msg, list);
> >>> +             list_del(&msg->list);
> >>> +     }
> >>> +     spin_unlock(&dev->msg_lock);
> >>> +
> >>> +     return msg;
> >>> +}
> >>> +
> >>> +static void vduse_dev_enqueue_msg(struct vduse_dev *dev,
> >>> +                     struct vduse_dev_msg *msg, struct list_head *head)
> >>> +{
> >>> +     spin_lock(&dev->msg_lock);
> >>> +     list_add_tail(&msg->list, head);
> >>> +     spin_unlock(&dev->msg_lock);
> >>> +}
> >>> +
> >>> +static int vduse_dev_msg_sync(struct vduse_dev *dev, struct vduse_dev_msg *msg)
> >>> +{
> >>> +     int ret;
> >>> +
> >>> +     vduse_dev_enqueue_msg(dev, msg, &dev->send_list);
> >>> +     wake_up(&dev->waitq);
> >>> +     wait_event(msg->waitq, msg->completed);
> >>
> >> This is uninterruptible wait, it means if the userspace forget to
> >> process the command, we will stuck here forever.
> >>
> > Yes, wait_event_interruptible() should be better here.
> >
> >>> +     /* coupled with smp_wmb() in vduse_dev_msg_complete() */
> >>> +     smp_rmb();
> >>
> >> Instead of using barriers, I wonder why not simply use msg lock here?
> >>
> > As mentioned above, using
> > wait_event_interruptible_locked()/wake_up_locked() is OK to me.
> >
> >>> +     ret = msg->resp.result;
> >>> +
> >>> +     return ret;
> >>> +}
> >>> +
> >>> +static void vduse_dev_msg_complete(struct vduse_dev_msg *msg,
> >>> +                                     struct vduse_dev_response *resp)
> >>> +{
> >>> +     vduse_dev_msg_get(msg);
> >>> +     memcpy(&msg->resp, resp, sizeof(*resp));
> >>> +     /* coupled with smp_rmb() in vduse_dev_msg_sync() */
> >>> +     smp_wmb();
> >>> +     msg->completed = 1;
> >>> +     wake_up(&msg->waitq);
> >>> +     vduse_dev_msg_put(msg);
> >>> +}
> >>> +
> >>> +static u64 vduse_dev_get_features(struct vduse_dev *dev)
> >>> +{
> >>> +     struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_GET_FEATURES);
> >>> +     u64 features;
> >>> +
> >>> +     vduse_dev_msg_sync(dev, msg);
> >>> +     features = msg->resp.features;
> >>> +     vduse_dev_msg_put(msg);
> >>> +
> >>> +     return features;
> >>> +}
> >>> +
> >>> +static int vduse_dev_set_features(struct vduse_dev *dev, u64 features)
> >>> +{
> >>> +     struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_FEATURES);
> >>> +     int ret;
> >>> +
> >>> +     msg->req.size = sizeof(features);
> >>> +     msg->req.features = features;
> >>> +
> >>> +     ret = vduse_dev_msg_sync(dev, msg);
> >>> +     vduse_dev_msg_put(msg);
> >>> +
> >>> +     return ret;
> >>> +}
> >>> +
> >>> +static u8 vduse_dev_get_status(struct vduse_dev *dev)
> >>> +{
> >>> +     struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_GET_STATUS);
> >>> +     u8 status;
> >>> +
> >>> +     vduse_dev_msg_sync(dev, msg);
> >>> +     status = msg->resp.status;
> >>> +     vduse_dev_msg_put(msg);
> >>> +
> >>> +     return status;
> >>> +}
> >>> +
> >>> +static void vduse_dev_set_status(struct vduse_dev *dev, u8 status)
> >>> +{
> >>> +     struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_STATUS);
> >>> +
> >>> +     msg->req.size = sizeof(status);
> >>> +     msg->req.status = status;
> >>> +
> >>> +     vduse_dev_msg_sync(dev, msg);
> >>> +     vduse_dev_msg_put(msg);
> >>> +}
> >>> +
> >>> +static void vduse_dev_get_config(struct vduse_dev *dev, unsigned int offset,
> >>> +                                     void *buf, unsigned int len)
> >>> +{
> >>> +     struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_GET_CONFIG);
> >>> +
> >>> +     WARN_ON(len > sizeof(msg->req.config.data));
> >>> +
> >>> +     msg->req.size = sizeof(struct vduse_dev_config_data);
> >>> +     msg->req.config.offset = offset;
> >>> +     msg->req.config.len = len;
> >>> +     vduse_dev_msg_sync(dev, msg);
> >>> +     memcpy(buf, msg->resp.config.data, len);
> >>> +     vduse_dev_msg_put(msg);
> >>> +}
> >>> +
> >>> +static void vduse_dev_set_config(struct vduse_dev *dev, unsigned int offset,
> >>> +                                     const void *buf, unsigned int len)
> >>> +{
> >>> +     struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_CONFIG);
> >>> +
> >>> +     WARN_ON(len > sizeof(msg->req.config.data));
> >>> +
> >>> +     msg->req.size = sizeof(struct vduse_dev_config_data);
> >>> +     msg->req.config.offset = offset;
> >>> +     msg->req.config.len = len;
> >>> +     memcpy(msg->req.config.data, buf, len);
> >>> +     vduse_dev_msg_sync(dev, msg);
> >>> +     vduse_dev_msg_put(msg);
> >>> +}
> >>> +
> >>> +static void vduse_dev_set_vq_num(struct vduse_dev *dev,
> >>> +                             struct vduse_virtqueue *vq, u32 num)
> >>> +{
> >>> +     struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_VQ_NUM);
> >>> +
> >>> +     msg->req.size = sizeof(struct vduse_vq_num);
> >>> +     msg->req.vq_num.index = vq->index;
> >>> +     msg->req.vq_num.num = num;
> >>> +
> >>> +     vduse_dev_msg_sync(dev, msg);
> >>> +     vduse_dev_msg_put(msg);
> >>> +}
> >>> +
> >>> +static int vduse_dev_set_vq_addr(struct vduse_dev *dev,
> >>> +                             struct vduse_virtqueue *vq, u64 desc_addr,
> >>> +                             u64 driver_addr, u64 device_addr)
> >>> +{
> >>> +     struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_VQ_ADDR);
> >>> +     int ret;
> >>> +
> >>> +     msg->req.size = sizeof(struct vduse_vq_addr);
> >>> +     msg->req.vq_addr.index = vq->index;
> >>> +     msg->req.vq_addr.desc_addr = desc_addr;
> >>> +     msg->req.vq_addr.driver_addr = driver_addr;
> >>> +     msg->req.vq_addr.device_addr = device_addr;
> >>> +
> >>> +     ret = vduse_dev_msg_sync(dev, msg);
> >>> +     vduse_dev_msg_put(msg);
> >>> +
> >>> +     return ret;
> >>> +}
> >>> +
> >>> +static void vduse_dev_set_vq_ready(struct vduse_dev *dev,
> >>> +                             struct vduse_virtqueue *vq, bool ready)
> >>> +{
> >>> +     struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_VQ_READY);
> >>> +
> >>> +     msg->req.size = sizeof(struct vduse_vq_ready);
> >>> +     msg->req.vq_ready.index = vq->index;
> >>> +     msg->req.vq_ready.ready = ready;
> >>> +
> >>> +     vduse_dev_msg_sync(dev, msg);
> >>> +     vduse_dev_msg_put(msg);
> >>> +}
> >>> +
> >>> +static bool vduse_dev_get_vq_ready(struct vduse_dev *dev,
> >>> +                                struct vduse_virtqueue *vq)
> >>> +{
> >>> +     struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_GET_VQ_READY);
> >>> +     bool ready;
> >>> +
> >>> +     msg->req.size = sizeof(struct vduse_vq_ready);
> >>> +     msg->req.vq_ready.index = vq->index;
> >>> +
> >>> +     vduse_dev_msg_sync(dev, msg);
> >>> +     ready = msg->resp.vq_ready.ready;
> >>> +     vduse_dev_msg_put(msg);
> >>> +
> >>> +     return ready;
> >>> +}
> >>> +
> >>> +static int vduse_dev_get_vq_state(struct vduse_dev *dev,
> >>> +                             struct vduse_virtqueue *vq,
> >>> +                             struct vdpa_vq_state *state)
> >>> +{
> >>> +     struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_GET_VQ_STATE);
> >>> +     int ret;
> >>> +
> >>> +     msg->req.size = sizeof(struct vduse_vq_state);
> >>> +     msg->req.vq_state.index = vq->index;
> >>> +
> >>> +     ret = vduse_dev_msg_sync(dev, msg);
> >>> +     state->avail_index = msg->resp.vq_state.avail_idx;
> >>> +     vduse_dev_msg_put(msg);
> >>> +
> >>> +     return ret;
> >>> +}
> >>> +
> >>> +static int vduse_dev_set_vq_state(struct vduse_dev *dev,
> >>> +                             struct vduse_virtqueue *vq,
> >>> +                             const struct vdpa_vq_state *state)
> >>> +{
> >>> +     struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_VQ_STATE);
> >>> +     int ret;
> >>> +
> >>> +     msg->req.size = sizeof(struct vduse_vq_state);
> >>> +     msg->req.vq_state.index = vq->index;
> >>> +     msg->req.vq_state.avail_idx = state->avail_index;
> >>> +
> >>> +     ret = vduse_dev_msg_sync(dev, msg);
> >>> +     vduse_dev_msg_put(msg);
> >>> +
> >>> +     return ret;
> >>> +}
> >>> +
> >>> +static int vduse_dev_update_iotlb(struct vduse_dev *dev,
> >>> +                                     u64 start, u64 last)
> >>> +{
> >>> +     struct vduse_dev_msg *msg;
> >>> +     int ret;
> >>> +
> >>> +     if (last < start)
> >>> +             return -EINVAL;
> >>> +
> >>> +     msg = vduse_dev_new_msg(dev, VDUSE_UPDATE_IOTLB);
> >>
> >> This is actually a IOTLB invalidation. So let's rename the function and
> >> message type.
> >>
> > Actually VDUSE_UPDATE_IOTLB now is used to notify userspace that IOTLB
> > is changed rather than IOTLB needs to be invalidated. Then userspace
> > can use GET_IOTLB ioctl to get the change. It seems to be more
> > friendly to userspace.
>
>
> Ok.
>
>
> >
> >>> +     msg->req.size = sizeof(struct vduse_iova_range);
> >>> +     msg->req.iova.start = start;
> >>> +     msg->req.iova.last = last;
> >>> +
> >>> +     ret = vduse_dev_msg_sync(dev, msg);
> >>> +     vduse_dev_msg_put(msg);
> >>> +
> >>> +     return ret;
> >>> +}
> >>> +
> >>> +static ssize_t vduse_dev_read_iter(struct kiocb *iocb, struct iov_iter *to)
> >>> +{
> >>> +     struct file *file = iocb->ki_filp;
> >>> +     struct vduse_dev *dev = file->private_data;
> >>> +     struct vduse_dev_msg *msg;
> >>> +     int size = sizeof(struct vduse_dev_request);
> >>> +     ssize_t ret = 0;
> >>> +
> >>> +     if (iov_iter_count(to) < size)
> >>> +             return 0;
> >>> +
> >>> +     while (1) {
> >>> +             msg = vduse_dev_dequeue_msg(dev, &dev->send_list);
> >>> +             if (msg)
> >>> +                     break;
> >>> +
> >>> +             if (file->f_flags & O_NONBLOCK)
> >>> +                     return -EAGAIN;
> >>> +
> >>> +             ret = wait_event_interruptible_exclusive(dev->waitq,
> >>> +                                     !list_empty(&dev->send_list));
> >>> +             if (ret)
> >>> +                     return ret;
> >>> +     }
> >>> +     ret = copy_to_iter(&msg->req, size, to);
> >>> +     if (ret != size) {
> >>> +             vduse_dev_enqueue_msg(dev, msg, &dev->send_list);
> >>> +             return -EFAULT;
> >>> +     }
> >>> +     vduse_dev_enqueue_msg(dev, msg, &dev->recv_list);
> >>> +
> >>> +     return ret;
> >>> +}
> >>> +
> >>> +static ssize_t vduse_dev_write_iter(struct kiocb *iocb, struct iov_iter *from)
> >>> +{
> >>> +     struct file *file = iocb->ki_filp;
> >>> +     struct vduse_dev *dev = file->private_data;
> >>> +     struct vduse_dev_response resp;
> >>> +     struct vduse_dev_msg *msg;
> >>> +     size_t ret;
> >>> +
> >>> +     ret = copy_from_iter(&resp, sizeof(resp), from);
> >>> +     if (ret != sizeof(resp))
> >>> +             return -EINVAL;
> >>> +
> >>> +     msg = vduse_dev_find_msg(dev, &dev->recv_list, resp.unique);
> >>> +     if (!msg)
> >>> +             return -EINVAL;
> >>> +
> >>> +     vduse_dev_msg_complete(msg, &resp);
> >>
> >> So we had multiple types of requests/responses, is this better to
> >> introduce a queue based admin interface other than ioctl?
> >>
> > Sorry, I didn't get your point. What do you mean by queue-based admin
> > interface? Virtqueue-based?
>
>
> Yes, a queue(virtqueue). The commands could be passed through the queue.
> (Just an idea, not sure it's worth)
>

I considered it before. But I found it still needs some extra works
(setup eventfd, set vring base and so on) to setup the admin virtqueue
before using it for communication. So I turn to use this simple way.

>
> >
> >>> +
> >>> +     return ret;
> >>> +}
> >>> +
> >>> +static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
> >>> +{
> >>> +     struct vduse_dev *dev = file->private_data;
> >>> +     __poll_t mask = 0;
> >>> +
> >>> +     poll_wait(file, &dev->waitq, wait);
> >>> +
> >>> +     if (!list_empty(&dev->send_list))
> >>> +             mask |= EPOLLIN | EPOLLRDNORM;
> >>> +
> >>> +     return mask;
> >>> +}
> >>> +
> >>> +static int vduse_iotlb_add_range(struct vduse_dev *dev,
> >>> +                              u64 start, u64 last,
> >>> +                              u64 addr, unsigned int perm,
> >>> +                              struct file *file, u64 offset)
> >>> +{
> >>> +     struct vhost_iotlb_file *iotlb_file;
> >>> +     int ret;
> >>> +
> >>> +     iotlb_file = kmalloc(sizeof(*iotlb_file), GFP_ATOMIC);
> >>> +     if (!iotlb_file)
> >>> +             return -ENOMEM;
> >>> +
> >>> +     iotlb_file->file = get_file(file);
> >>> +     iotlb_file->offset = offset;
> >>> +
> >>> +     spin_lock(&dev->iommu_lock);
> >>> +     ret = vhost_iotlb_add_range(dev->iommu, start, last,
> >>> +                                     addr, perm, iotlb_file);
> >>> +     spin_unlock(&dev->iommu_lock);
> >>> +     if (ret) {
> >>> +             fput(iotlb_file->file);
> >>> +             kfree(iotlb_file);
> >>> +             return ret;
> >>> +     }
> >>> +     return 0;
> >>> +}
> >>> +
> >>> +static void vduse_iotlb_del_range(struct vduse_dev *dev, u64 start, u64 last)
> >>> +{
> >>> +     struct vhost_iotlb_file *iotlb_file;
> >>> +     struct vhost_iotlb_map *map;
> >>> +
> >>> +     spin_lock(&dev->iommu_lock);
> >>> +     while ((map = vhost_iotlb_itree_first(dev->iommu, start, last))) {
> >>> +             iotlb_file = (struct vhost_iotlb_file *)map->opaque;
> >>> +             fput(iotlb_file->file);
> >>> +             kfree(iotlb_file);
> >>> +             vhost_iotlb_map_free(dev->iommu, map);
> >>> +     }
> >>> +     spin_unlock(&dev->iommu_lock);
> >>> +}
> >>> +
> >>> +static void vduse_dev_reset(struct vduse_dev *dev)
> >>> +{
> >>> +     int i;
> >>> +
> >>> +     atomic_set(&dev->bounce_map, 0);
> >>> +     vduse_iotlb_del_range(dev, 0ULL, 0ULL - 1);
> >>> +     vduse_dev_update_iotlb(dev, 0ULL, 0ULL - 1);
> >>
> >> ULLONG_MAX please.
> >>
> > OK.
> >
> >>> +
> >>> +     for (i = 0; i < dev->vq_num; i++) {
> >>> +             struct vduse_virtqueue *vq = &dev->vqs[i];
> >>> +
> >>> +             spin_lock(&vq->irq_lock);
> >>> +             vq->ready = false;
> >>> +             vq->cb = NULL;
> >>> +             vq->private = NULL;
> >>> +             spin_unlock(&vq->irq_lock);
> >>> +     }
> >>> +}
> >>> +
> >>> +static int vduse_vdpa_set_vq_address(struct vdpa_device *vdpa, u16 idx,
> >>> +                             u64 desc_area, u64 driver_area,
> >>> +                             u64 device_area)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> >>> +
> >>> +     return vduse_dev_set_vq_addr(dev, vq, desc_area,
> >>> +                                     driver_area, device_area);
> >>> +}
> >>> +
> >>> +static void vduse_vdpa_kick_vq(struct vdpa_device *vdpa, u16 idx)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> >>> +
> >>> +     vduse_vq_kick(vq);
> >>> +}
> >>> +
> >>> +static void vduse_vdpa_set_vq_cb(struct vdpa_device *vdpa, u16 idx,
> >>> +                           struct vdpa_callback *cb)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> >>> +
> >>> +     vq->cb = cb->callback;
> >>> +     vq->private = cb->private;
> >>> +}
> >>> +
> >>> +static void vduse_vdpa_set_vq_num(struct vdpa_device *vdpa, u16 idx, u32 num)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> >>> +
> >>> +     vduse_dev_set_vq_num(dev, vq, num);
> >>> +}
> >>> +
> >>> +static void vduse_vdpa_set_vq_ready(struct vdpa_device *vdpa,
> >>> +                                     u16 idx, bool ready)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> >>> +
> >>> +     vduse_dev_set_vq_ready(dev, vq, ready);
> >>> +     vq->ready = ready;
> >>> +}
> >>> +
> >>> +static bool vduse_vdpa_get_vq_ready(struct vdpa_device *vdpa, u16 idx)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> >>> +
> >>> +     vq->ready = vduse_dev_get_vq_ready(dev, vq);
> >>> +
> >>> +     return vq->ready;
> >>> +}
> >>> +
> >>> +static int vduse_vdpa_set_vq_state(struct vdpa_device *vdpa, u16 idx,
> >>> +                             const struct vdpa_vq_state *state)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> >>> +
> >>> +     return vduse_dev_set_vq_state(dev, vq, state);
> >>> +}
> >>> +
> >>> +static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
> >>> +                             struct vdpa_vq_state *state)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> >>> +
> >>> +     return vduse_dev_get_vq_state(dev, vq, state);
> >>> +}
> >>> +
> >>> +static u32 vduse_vdpa_get_vq_align(struct vdpa_device *vdpa)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +
> >>> +     return dev->vq_align;
> >>> +}
> >>> +
> >>> +static u64 vduse_vdpa_get_features(struct vdpa_device *vdpa)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +     u64 fixed = (1ULL << VIRTIO_F_ACCESS_PLATFORM);
> >>> +
> >>> +     return (vduse_dev_get_features(dev) | fixed);
> >>> +}
> >>> +
> >>> +static int vduse_vdpa_set_features(struct vdpa_device *vdpa, u64 features)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +
> >>> +     return vduse_dev_set_features(dev, features);
> >>> +}
> >>> +
> >>> +static void vduse_vdpa_set_config_cb(struct vdpa_device *vdpa,
> >>> +                               struct vdpa_callback *cb)
> >>> +{
> >>> +     /* We don't support config interrupt */
> >>
> >> If it's not hard, let's add this. Otherwise we need a per device feature
> >> blacklist to filter out all features that depends on config interrupt.
> >>
> > Will do it.
> >
> >>> +}
> >>> +
> >>> +static u16 vduse_vdpa_get_vq_num_max(struct vdpa_device *vdpa)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +
> >>> +     return dev->vq_size_max;
> >>> +}
> >>> +
> >>> +static u32 vduse_vdpa_get_device_id(struct vdpa_device *vdpa)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +
> >>> +     return dev->device_id;
> >>> +}
> >>> +
> >>> +static u32 vduse_vdpa_get_vendor_id(struct vdpa_device *vdpa)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +
> >>> +     return dev->vendor_id;
> >>> +}
> >>> +
> >>> +static u8 vduse_vdpa_get_status(struct vdpa_device *vdpa)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +
> >>> +     return vduse_dev_get_status(dev);
> >>> +}
> >>> +
> >>> +static void vduse_vdpa_set_status(struct vdpa_device *vdpa, u8 status)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +
> >>> +     if (status == 0)
> >>> +             vduse_dev_reset(dev);
> >>> +     else
> >>> +             vduse_dev_update_iotlb(dev, 0ULL, 0ULL - 1);
> >>
> >> Any reason for such IOTLB invalidation here?
> >>
> > As I mentioned before, this is used to notify userspace to update the
> > IOTLB. Mainly for virtio-vdpa case.
>
>
> So the question is, usually, there could be several times of status
> setting during driver initialization. Do we really need to update IOTLB
> every time?
>

I think we can check whether there are some changes after the last
IOTLB updating here.

>
> >
> >>> +
> >>> +     vduse_dev_set_status(dev, status);
> >>> +}
> >>> +
> >>> +static void vduse_vdpa_get_config(struct vdpa_device *vdpa, unsigned int offset,
> >>> +                          void *buf, unsigned int len)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +
> >>> +     vduse_dev_get_config(dev, offset, buf, len);
> >>> +}
> >>> +
> >>> +static void vduse_vdpa_set_config(struct vdpa_device *vdpa, unsigned int offset,
> >>> +                     const void *buf, unsigned int len)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +
> >>> +     vduse_dev_set_config(dev, offset, buf, len);
> >>> +}
> >>> +
> >>> +static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
> >>> +                             struct vhost_iotlb *iotlb)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +     struct vhost_iotlb_map *map;
> >>> +     struct vhost_iotlb_file *iotlb_file;
> >>> +     u64 start = 0ULL, last = 0ULL - 1;
> >>> +     int ret = 0;
> >>> +
> >>> +     vduse_iotlb_del_range(dev, start, last);
> >>> +
> >>> +     for (map = vhost_iotlb_itree_first(iotlb, start, last); map;
> >>> +             map = vhost_iotlb_itree_next(map, start, last)) {
> >>> +             if (!map->opaque)
> >>> +                     continue;
> >>
> >> What will happen if we simply accept NULL opaque here?
> >>
> > No file to mmap in userspace. So it's useless.
> >
> >>> +
> >>> +             iotlb_file = (struct vhost_iotlb_file *)map->opaque;
> >>> +             ret = vduse_iotlb_add_range(dev, map->start, map->last,
> >>> +                                         map->addr, map->perm,
> >>> +                                         iotlb_file->file,
> >>> +                                         iotlb_file->offset);
> >>> +             if (ret)
> >>> +                     break;
> >>> +     }
> >>> +     vduse_dev_update_iotlb(dev, start, last);
> >>> +
> >>> +     return ret;
> >>> +}
> >>> +
> >>> +static void vduse_vdpa_free(struct vdpa_device *vdpa)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +
> >>> +     WARN_ON(!list_empty(&dev->send_list));
> >>> +     WARN_ON(!list_empty(&dev->recv_list));
> >>> +     dev->vdev = NULL;
> >>> +}
> >>> +
> >>> +static const struct vdpa_config_ops vduse_vdpa_config_ops = {
> >>> +     .set_vq_address         = vduse_vdpa_set_vq_address,
> >>> +     .kick_vq                = vduse_vdpa_kick_vq,
> >>> +     .set_vq_cb              = vduse_vdpa_set_vq_cb,
> >>> +     .set_vq_num             = vduse_vdpa_set_vq_num,
> >>> +     .set_vq_ready           = vduse_vdpa_set_vq_ready,
> >>> +     .get_vq_ready           = vduse_vdpa_get_vq_ready,
> >>> +     .set_vq_state           = vduse_vdpa_set_vq_state,
> >>> +     .get_vq_state           = vduse_vdpa_get_vq_state,
> >>> +     .get_vq_align           = vduse_vdpa_get_vq_align,
> >>> +     .get_features           = vduse_vdpa_get_features,
> >>> +     .set_features           = vduse_vdpa_set_features,
> >>> +     .set_config_cb          = vduse_vdpa_set_config_cb,
> >>> +     .get_vq_num_max         = vduse_vdpa_get_vq_num_max,
> >>> +     .get_device_id          = vduse_vdpa_get_device_id,
> >>> +     .get_vendor_id          = vduse_vdpa_get_vendor_id,
> >>> +     .get_status             = vduse_vdpa_get_status,
> >>> +     .set_status             = vduse_vdpa_set_status,
> >>> +     .get_config             = vduse_vdpa_get_config,
> >>> +     .set_config             = vduse_vdpa_set_config,
> >>> +     .set_map                = vduse_vdpa_set_map,
> >>> +     .free                   = vduse_vdpa_free,
> >>> +};
> >>> +
> >>> +static dma_addr_t vduse_dev_map_page(struct device *dev, struct page *page,
> >>> +                                     unsigned long offset, size_t size,
> >>> +                                     enum dma_data_direction dir,
> >>> +                                     unsigned long attrs)
> >>> +{
> >>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
> >>> +     struct vduse_iova_domain *domain = vdev->domain;
> >>> +
> >>> +     if (atomic_xchg(&vdev->bounce_map, 1) == 0 &&
> >>> +             vduse_iotlb_add_range(vdev, 0, domain->bounce_size - 1,
> >>> +                                   0, VDUSE_ACCESS_RW,
> >>
> >> Is this safe to use VDUSE_ACCESS_RW here, consider we might have device
> >> readonly mappings.
> >>
> > This mapping is for the whole bounce buffer. Maybe userspace needs to
> > tell us if it only support readonly mappings.
>
>
> Right, so I think we don't need to care about this.
>
>
> >
> >>> +                                   vduse_domain_file(domain),
> >>> +                                   vduse_domain_get_offset(domain, 0))) {
> >>> +             atomic_set(&vdev->bounce_map, 0);
> >>> +             return DMA_MAPPING_ERROR;
> >>> +     }
> >>> +
> >>> +     return vduse_domain_map_page(domain, page, offset, size, dir, attrs);
> >>> +}
> >>> +
> >>> +static void vduse_dev_unmap_page(struct device *dev, dma_addr_t dma_addr,
> >>> +                             size_t size, enum dma_data_direction dir,
> >>> +                             unsigned long attrs)
> >>> +{
> >>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
> >>> +     struct vduse_iova_domain *domain = vdev->domain;
> >>> +
> >>> +     return vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
> >>> +}
> >>> +
> >>> +static void *vduse_dev_alloc_coherent(struct device *dev, size_t size,
> >>> +                                     dma_addr_t *dma_addr, gfp_t flag,
> >>> +                                     unsigned long attrs)
> >>> +{
> >>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
> >>> +     struct vduse_iova_domain *domain = vdev->domain;
> >>> +     unsigned long iova;
> >>> +     void *addr;
> >>> +
> >>> +     *dma_addr = DMA_MAPPING_ERROR;
> >>> +     addr = vduse_domain_alloc_coherent(domain, size,
> >>> +                             (dma_addr_t *)&iova, flag, attrs);
> >>> +     if (!addr)
> >>> +             return NULL;
> >>> +
> >>> +     if (vduse_iotlb_add_range(vdev, iova, iova + size - 1,
> >>> +                               iova, VDUSE_ACCESS_RW,
> >>> +                               vduse_domain_file(domain),
> >>> +                               vduse_domain_get_offset(domain, iova))) {
> >>> +             vduse_domain_free_coherent(domain, size, addr, iova, attrs);
> >>> +             return NULL;
> >>> +     }
> >>> +     *dma_addr = (dma_addr_t)iova;
> >>> +
> >>> +     return addr;
> >>> +}
> >>> +
> >>> +static void vduse_dev_free_coherent(struct device *dev, size_t size,
> >>> +                                     void *vaddr, dma_addr_t dma_addr,
> >>> +                                     unsigned long attrs)
> >>> +{
> >>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
> >>> +     struct vduse_iova_domain *domain = vdev->domain;
> >>> +     unsigned long start = (unsigned long)dma_addr;
> >>> +     unsigned long last = start + size - 1;
> >>> +
> >>> +     vduse_iotlb_del_range(vdev, start, last);
> >>> +     vduse_dev_update_iotlb(vdev, start, last);
> >>> +     vduse_domain_free_coherent(domain, size, vaddr, dma_addr, attrs);
> >>> +}
> >>> +
> >>> +static const struct dma_map_ops vduse_dev_dma_ops = {
> >>> +     .map_page = vduse_dev_map_page,
> >>> +     .unmap_page = vduse_dev_unmap_page,
> >>> +     .alloc = vduse_dev_alloc_coherent,
> >>> +     .free = vduse_dev_free_coherent,
> >>> +};
> >>> +
> >>> +static unsigned int perm_to_file_flags(u8 perm)
> >>> +{
> >>> +     unsigned int flags = 0;
> >>> +
> >>> +     switch (perm) {
> >>> +     case VDUSE_ACCESS_WO:
> >>> +             flags |= O_WRONLY;
> >>> +             break;
> >>> +     case VDUSE_ACCESS_RO:
> >>> +             flags |= O_RDONLY;
> >>> +             break;
> >>> +     case VDUSE_ACCESS_RW:
> >>> +             flags |= O_RDWR;
> >>> +             break;
> >>> +     default:
> >>> +             WARN(1, "invalidate vhost IOTLB permission\n");
> >>> +             break;
> >>> +     }
> >>> +
> >>> +     return flags;
> >>> +}
> >>> +
> >>> +static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> >>> +                     unsigned long arg)
> >>> +{
> >>> +     struct vduse_dev *dev = file->private_data;
> >>> +     void __user *argp = (void __user *)arg;
> >>> +     int ret;
> >>> +
> >>> +     mutex_lock(&dev->lock);
> >>> +     switch (cmd) {
> >>> +     case VDUSE_IOTLB_GET_FD: {
> >>> +             struct vduse_iotlb_entry entry;
> >>> +             struct vhost_iotlb_map *map;
> >>> +             struct vhost_iotlb_file *iotlb_file;
> >>> +             struct file *f = NULL;
> >>> +
> >>> +             ret = -EFAULT;
> >>> +             if (copy_from_user(&entry, argp, sizeof(entry)))
> >>> +                     break;
> >>> +
> >>> +             spin_lock(&dev->iommu_lock);
> >>> +             map = vhost_iotlb_itree_first(dev->iommu, entry.start,
> >>> +                                           entry.last);
> >>> +             if (map) {
> >>> +                     iotlb_file = (struct vhost_iotlb_file *)map->opaque;
> >>> +                     f = get_file(iotlb_file->file);
> >>> +                     entry.offset = iotlb_file->offset;
> >>> +                     entry.start = map->start;
> >>> +                     entry.last = map->last;
> >>> +                     entry.perm = map->perm;
> >>> +             }
> >>> +             spin_unlock(&dev->iommu_lock);
> >>> +             if (!f) {
> >>> +                     ret = -EINVAL;
> >>> +                     break;
> >>> +             }
> >>> +             if (copy_to_user(argp, &entry, sizeof(entry))) {
> >>> +                     fput(f);
> >>> +                     ret = -EFAULT;
> >>> +                     break;
> >>> +             }
> >>> +             ret = get_unused_fd_flags(perm_to_file_flags(entry.perm));
> >>> +             if (ret < 0) {
> >>> +                     fput(f);
> >>> +                     break;
> >>> +             }
> >>> +             fd_install(ret, f);
> >>> +             break;
> >>> +     }
> >>> +     case VDUSE_VQ_SETUP_KICKFD: {
> >>> +             struct vduse_vq_eventfd eventfd;
> >>> +
> >>> +             ret = -EFAULT;
> >>> +             if (copy_from_user(&eventfd, argp, sizeof(eventfd)))
> >>> +                     break;
> >>> +
> >>> +             ret = vduse_kickfd_setup(dev, &eventfd);
> >>> +             break;
> >>> +     }
> >>> +     case VDUSE_VQ_SETUP_IRQFD: {
> >>> +             struct vduse_vq_eventfd eventfd;
> >>> +
> >>> +             ret = -EFAULT;
> >>> +             if (copy_from_user(&eventfd, argp, sizeof(eventfd)))
> >>> +                     break;
> >>> +
> >>> +             ret = vduse_virqfd_setup(dev, &eventfd);
> >>> +             break;
> >>> +     }
> >>> +     }
> >>> +     mutex_unlock(&dev->lock);
> >>> +
> >>> +     return ret;
> >>> +}
> >>> +
> >>> +static int vduse_dev_release(struct inode *inode, struct file *file)
> >>> +{
> >>> +     struct vduse_dev *dev = file->private_data;
> >>> +
> >>> +     vduse_kickfd_release(dev);
> >>> +     vduse_virqfd_release(dev);
> >>> +     dev->connected = false;
> >>> +
> >>> +     return 0;
> >>> +}
> >>> +
> >>> +static const struct file_operations vduse_dev_fops = {
> >>> +     .owner          = THIS_MODULE,
> >>> +     .release        = vduse_dev_release,
> >>> +     .read_iter      = vduse_dev_read_iter,
> >>> +     .write_iter     = vduse_dev_write_iter,
> >>> +     .poll           = vduse_dev_poll,
> >>> +     .unlocked_ioctl = vduse_dev_ioctl,
> >>> +     .compat_ioctl   = compat_ptr_ioctl,
> >>> +     .llseek         = noop_llseek,
> >>> +};
> >>> +
> >>> +static struct vduse_dev *vduse_dev_create(void)
> >>> +{
> >>> +     struct vduse_dev *dev = kzalloc(sizeof(*dev), GFP_KERNEL);
> >>> +
> >>> +     if (!dev)
> >>> +             return NULL;
> >>> +
> >>> +     dev->iommu = vhost_iotlb_alloc(2048, 0);
> >>
> >> Is 2048 sufficient here?
> >>
> > How about letting userspace to define it?
>
>
> Fine with me.
>
>
> >
> >
> >>> +     if (!dev->iommu) {
> >>> +             kfree(dev);
> >>> +             return NULL;
> >>> +     }
> >>> +
> >>> +     mutex_init(&dev->lock);
> >>> +     spin_lock_init(&dev->msg_lock);
> >>> +     INIT_LIST_HEAD(&dev->send_list);
> >>> +     INIT_LIST_HEAD(&dev->recv_list);
> >>> +     atomic64_set(&dev->msg_unique, 0);
> >>> +     spin_lock_init(&dev->iommu_lock);
> >>> +     atomic_set(&dev->bounce_map, 0);
> >>> +
> >>> +     init_waitqueue_head(&dev->waitq);
> >>> +
> >>> +     return dev;
> >>> +}
> >>> +
> >>> +static void vduse_dev_destroy(struct vduse_dev *dev)
> >>> +{
> >>> +     vhost_iotlb_free(dev->iommu);
> >>> +     mutex_destroy(&dev->lock);
> >>> +     kfree(dev);
> >>> +}
> >>> +
> >>> +static struct vduse_dev *vduse_find_dev(u32 id)
> >>> +{
> >>> +     struct vduse_dev *tmp, *dev = NULL;
> >>> +
> >>> +     list_for_each_entry(tmp, &vduse_devs, list) {
> >>> +             if (tmp->id == id) {
> >>> +                     dev = tmp;
> >>> +                     break;
> >>> +             }
> >>> +     }
> >>> +     return dev;
> >>> +}
> >>> +
> >>> +static int vduse_destroy_dev(u32 id)
> >>> +{
> >>> +     struct vduse_dev *dev = vduse_find_dev(id);
> >>> +
> >>> +     if (!dev)
> >>> +             return -EINVAL;
> >>> +
> >>> +     if (dev->vdev || dev->connected)
> >>> +             return -EBUSY;
> >>> +
> >>> +     list_del(&dev->list);
> >>> +     kfree(dev->vqs);
> >>> +     vduse_domain_destroy(dev->domain);
> >>> +     vduse_dev_destroy(dev);
> >>> +
> >>> +     return 0;
> >>> +}
> >>> +
> >>> +static int vduse_create_dev(struct vduse_dev_config *config)
> >>> +{
> >>> +     int i, fd;
> >>> +     struct vduse_dev *dev;
> >>> +     char name[64];
> >>> +
> >>> +     if (vduse_find_dev(config->id))
> >>> +             return -EEXIST;
> >>> +
> >>> +     dev = vduse_dev_create();
> >>> +     if (!dev)
> >>> +             return -ENOMEM;
> >>> +
> >>> +     dev->id = config->id;
> >>> +     dev->device_id = config->device_id;
> >>> +     dev->vendor_id = config->vendor_id;
> >>> +     dev->domain = vduse_domain_create(config->bounce_size);
> >>
> >> Do we need a upper limit of bounce_size?
> >>
> > I agree. Any comment for the value?
>
>
> Something like swiotlb default value (64M)?
>

Do we need a module parameter to change it?

>
> >
> >>> +     if (!dev->domain)
> >>> +             goto err_domain;
> >>> +
> >>> +     dev->vq_align = config->vq_align;
> >>> +     dev->vq_size_max = config->vq_size_max;
> >>> +     dev->vq_num = config->vq_num;
> >>> +     dev->vqs = kcalloc(dev->vq_num, sizeof(*dev->vqs), GFP_KERNEL);
> >>> +     if (!dev->vqs)
> >>> +             goto err_vqs;
> >>> +
> >>> +     for (i = 0; i < dev->vq_num; i++) {
> >>> +             dev->vqs[i].index = i;
> >>> +             spin_lock_init(&dev->vqs[i].kick_lock);
> >>> +             spin_lock_init(&dev->vqs[i].irq_lock);
> >>> +     }
> >>> +
> >>> +     snprintf(name, sizeof(name), "[vduse-dev:%u]", config->id);
> >>> +     fd = anon_inode_getfd(name, &vduse_dev_fops, dev, O_RDWR | O_CLOEXEC);
> >>
> >> Any reason for closing on exec here?
> >>
> > Looks like we can remove this flag.
> >
> >>> +     if (fd < 0)
> >>> +             goto err_fd;
> >>> +
> >>> +     dev->connected = true;
> >>> +     list_add(&dev->list, &vduse_devs);
> >>> +
> >>> +     return fd;
> >>> +err_fd:
> >>> +     kfree(dev->vqs);
> >>> +err_vqs:
> >>> +     vduse_domain_destroy(dev->domain);
> >>> +err_domain:
> >>> +     vduse_dev_destroy(dev);
> >>> +     return fd;
> >>> +}
> >>> +
> >>> +static long vduse_ioctl(struct file *file, unsigned int cmd,
> >>> +                     unsigned long arg)
> >>> +{
> >>> +     int ret;
> >>> +     void __user *argp = (void __user *)arg;
> >>> +
> >>> +     mutex_lock(&vduse_lock);
> >>> +     switch (cmd) {
> >>> +     case VDUSE_CREATE_DEV: {
> >>> +             struct vduse_dev_config config;
> >>> +
> >>> +             ret = -EFAULT;
> >>> +             if (copy_from_user(&config, argp, sizeof(config)))
> >>> +                     break;
> >>> +
> >>> +             ret = vduse_create_dev(&config);
> >>> +             break;
> >>> +     }
> >>> +     case VDUSE_DESTROY_DEV:
> >>> +             ret = vduse_destroy_dev(arg);
> >>> +             break;
> >>> +     default:
> >>> +             ret = -EINVAL;
> >>> +             break;
> >>> +     }
> >>> +     mutex_unlock(&vduse_lock);
> >>> +
> >>> +     return ret;
> >>> +}
> >>> +
> >>> +static const struct file_operations vduse_fops = {
> >>> +     .owner          = THIS_MODULE,
> >>> +     .unlocked_ioctl = vduse_ioctl,
> >>> +     .compat_ioctl   = compat_ptr_ioctl,
> >>> +     .llseek         = noop_llseek,
> >>> +};
> >>> +
> >>> +static struct miscdevice vduse_misc = {
> >>> +     .fops = &vduse_fops,
> >>> +     .minor = MISC_DYNAMIC_MINOR,
> >>> +     .name = "vduse",
> >>> +};
> >>> +
> >>> +static void vduse_parent_release(struct device *dev)
> >>> +{
> >>> +}
> >>> +
> >>> +static struct device vduse_parent = {
> >>> +     .init_name = "vduse",
> >>> +     .release = vduse_parent_release,
> >>> +};
> >>> +
> >>> +static struct vdpa_parent_dev parent_dev;
> >>> +
> >>> +static int vduse_dev_add_vdpa(struct vduse_dev *dev, const char *name)
> >>> +{
> >>> +     struct vduse_vdpa *vdev = dev->vdev;
> >>> +     int ret;
> >>> +
> >>> +     if (vdev)
> >>> +             return -EEXIST;
> >>> +
> >>> +     vdev = vdpa_alloc_device(struct vduse_vdpa, vdpa, NULL,
> >>> +                              &vduse_vdpa_config_ops,
> >>> +                              dev->vq_num, name, true);
> >>> +     if (!vdev)
> >>> +             return -ENOMEM;
> >>> +
> >>> +     vdev->dev = dev;
> >>> +     vdev->vdpa.dev.dma_mask = &vdev->vdpa.dev.coherent_dma_mask;
> >>> +     ret = dma_set_mask_and_coherent(&vdev->vdpa.dev, DMA_BIT_MASK(64));
> >>> +     if (ret)
> >>> +             goto err;
> >>> +
> >>> +     set_dma_ops(&vdev->vdpa.dev, &vduse_dev_dma_ops);
> >>> +     vdev->vdpa.dma_dev = &vdev->vdpa.dev;
> >>> +     vdev->vdpa.pdev = &parent_dev;
> >>> +
> >>> +     ret = _vdpa_register_device(&vdev->vdpa);
> >>> +     if (ret)
> >>> +             goto err;
> >>> +
> >>> +     dev->vdev = vdev;
> >>> +
> >>> +     return 0;
> >>> +err:
> >>> +     put_device(&vdev->vdpa.dev);
> >>> +     return ret;
> >>> +}
> >>> +
> >>> +static struct vdpa_device *vdpa_dev_add(struct vdpa_parent_dev *pdev,
> >>> +                                     const char *name, u32 device_id,
> >>> +                                     struct nlattr **attrs)
> >>> +{
> >>> +     u32 vduse_id;
> >>> +     struct vduse_dev *dev;
> >>> +     int ret = -EINVAL;
> >>> +
> >>> +     if (!attrs[VDPA_ATTR_BACKEND_ID])
> >>> +             return ERR_PTR(-EINVAL);
> >>> +
> >>> +     mutex_lock(&vduse_lock);
> >>> +     vduse_id = nla_get_u32(attrs[VDPA_ATTR_BACKEND_ID]);
> >>
> >> I wonder why not using name here?
> >>
> > Do you mean use the same name for both backend and frontend? If so, we
> > need to add a name for vduse device or replace id with name to
> > identify a vduse device. Which way do you prefer?
>
>
> I think if we can do it in name, it's better not introduce any other
> thing like "id". It will complicate the management software.
>

Fine with me.

>
> >
> >> And it looks to me it would be easier if we create a char device per
> >> vduse. This makes the device addressing more robust than passing id
> >> silently among processes.
> >>
> > It's OK to me.
> >
> >>> +     dev = vduse_find_dev(vduse_id);
> >>> +     if (!dev)
> >>> +             goto unlock;
> >>> +
> >>> +     if (dev->device_id != device_id)
> >>> +             goto unlock;
> >>> +
> >>> +     ret = vduse_dev_add_vdpa(dev, name);
> >>> +unlock:
> >>> +     mutex_unlock(&vduse_lock);
> >>> +     if (ret)
> >>> +             return ERR_PTR(ret);
> >>> +
> >>> +     return &dev->vdev->vdpa;
> >>> +}
> >>> +
> >>> +static void vdpa_dev_del(struct vdpa_parent_dev *pdev, struct vdpa_device *dev)
> >>> +{
> >>> +     _vdpa_unregister_device(dev);
> >>> +}
> >>> +
> >>> +static const struct vdpa_dev_ops vdpa_dev_parent_ops = {
> >>> +     .dev_add = vdpa_dev_add,
> >>> +     .dev_del = vdpa_dev_del
> >>> +};
> >>> +
> >>> +static struct virtio_device_id id_table[] = {
> >>> +     { VIRTIO_DEV_ANY_ID, VIRTIO_DEV_ANY_ID },
> >>> +     { 0 },
> >>> +};
> >>> +
> >>> +static struct vdpa_parent_dev parent_dev = {
> >>> +     .device = &vduse_parent,
> >>> +     .id_table = id_table,
> >>> +     .ops = &vdpa_dev_parent_ops,
> >>> +};
> >>> +
> >>> +static int vduse_parentdev_init(void)
> >>> +{
> >>> +     int ret;
> >>> +
> >>> +     ret = device_register(&vduse_parent);
> >>> +     if (ret)
> >>> +             return ret;
> >>> +
> >>> +     ret = vdpa_parentdev_register(&parent_dev);
> >>> +     if (ret)
> >>> +             goto err;
> >>> +
> >>> +     return 0;
> >>> +err:
> >>> +     device_unregister(&vduse_parent);
> >>> +     return ret;
> >>> +}
> >>> +
> >>> +static void vduse_parentdev_exit(void)
> >>> +{
> >>> +     vdpa_parentdev_unregister(&parent_dev);
> >>> +     device_unregister(&vduse_parent);
> >>> +}
> >>> +
> >>> +static int vduse_init(void)
> >>> +{
> >>> +     int ret;
> >>> +
> >>> +     ret = misc_register(&vduse_misc);
> >>> +     if (ret)
> >>> +             return ret;
> >>> +
> >>> +     ret = -ENOMEM;
> >>> +     vduse_vdpa_wq = alloc_workqueue("vduse-vdpa", WQ_UNBOUND, 1);
> >>> +     if (!vduse_vdpa_wq)
> >>> +             goto err_vdpa_wq;
> >>> +
> >>> +     ret = vduse_virqfd_init();
> >>> +     if (ret)
> >>> +             goto err_irqfd;
> >>> +
> >>> +     ret = vduse_parentdev_init();
> >>> +     if (ret)
> >>> +             goto err_parentdev;
> >>> +
> >>> +     return 0;
> >>> +err_parentdev:
> >>> +     vduse_virqfd_exit();
> >>> +err_irqfd:
> >>> +     destroy_workqueue(vduse_vdpa_wq);
> >>> +err_vdpa_wq:
> >>> +     misc_deregister(&vduse_misc);
> >>> +     return ret;
> >>> +}
> >>> +module_init(vduse_init);
> >>> +
> >>> +static void vduse_exit(void)
> >>> +{
> >>> +     misc_deregister(&vduse_misc);
> >>> +     destroy_workqueue(vduse_vdpa_wq);
> >>> +     vduse_virqfd_exit();
> >>> +     vduse_parentdev_exit();
> >>> +}
> >>> +module_exit(vduse_exit);
> >>> +
> >>> +MODULE_VERSION(DRV_VERSION);
> >>> +MODULE_LICENSE(DRV_LICENSE);
> >>> +MODULE_AUTHOR(DRV_AUTHOR);
> >>> +MODULE_DESCRIPTION(DRV_DESC);
> >>> diff --git a/include/uapi/linux/vdpa.h b/include/uapi/linux/vdpa.h
> >>> index bba8b83a94b5..a7a841e5ffc7 100644
> >>> --- a/include/uapi/linux/vdpa.h
> >>> +++ b/include/uapi/linux/vdpa.h
> >>> @@ -33,6 +33,7 @@ enum vdpa_attr {
> >>>        VDPA_ATTR_DEV_VENDOR_ID,                /* u32 */
> >>>        VDPA_ATTR_DEV_MAX_VQS,                  /* u32 */
> >>>        VDPA_ATTR_DEV_MAX_VQ_SIZE,              /* u16 */
> >>> +     VDPA_ATTR_BACKEND_ID,                   /* u32 */
> >>
> >> As discussed, this needs more thought. But if necessary, we need a
> >> separate patch for this.
> >>
> > OK.
> >
> >>>        /* new attributes must be added above here */
> >>>        VDPA_ATTR_MAX,
> >>> diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
> >>> new file mode 100644
> >>> index 000000000000..9fb555ddcfbd
> >>> --- /dev/null
> >>> +++ b/include/uapi/linux/vduse.h
> >>> @@ -0,0 +1,125 @@
> >>> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> >>> +#ifndef _UAPI_VDUSE_H_
> >>> +#define _UAPI_VDUSE_H_
> >>> +
> >>> +#include <linux/types.h>
> >>> +
> >>> +/* the control messages definition for read/write */
> >>> +
> >>> +#define VDUSE_CONFIG_DATA_LEN        256
> >>> +
> >>> +enum vduse_req_type {
> >>> +     VDUSE_SET_VQ_NUM,
> >>> +     VDUSE_SET_VQ_ADDR,
> >>> +     VDUSE_SET_VQ_READY,
> >>> +     VDUSE_GET_VQ_READY,
> >>> +     VDUSE_SET_VQ_STATE,
> >>> +     VDUSE_GET_VQ_STATE,
> >>> +     VDUSE_SET_FEATURES,
> >>> +     VDUSE_GET_FEATURES,
> >>> +     VDUSE_SET_STATUS,
> >>> +     VDUSE_GET_STATUS,
> >>> +     VDUSE_SET_CONFIG,
> >>> +     VDUSE_GET_CONFIG,
> >>> +     VDUSE_UPDATE_IOTLB,
> >>> +};
> >>> +
> >>> +struct vduse_vq_num {
> >>> +     __u32 index;
> >>> +     __u32 num;
> >>> +};
> >>> +
> >>> +struct vduse_vq_addr {
> >>> +     __u32 index;
> >>> +     __u64 desc_addr;
> >>> +     __u64 driver_addr;
> >>> +     __u64 device_addr;
> >>> +};
> >>> +
> >>> +struct vduse_vq_ready {
> >>> +     __u32 index;
> >>> +     __u8 ready;
> >>> +};
> >>> +
> >>> +struct vduse_vq_state {
> >>> +     __u32 index;
> >>> +     __u16 avail_idx;
> >>> +};
> >>> +
> >>> +struct vduse_dev_config_data {
> >>> +     __u32 offset;
> >>> +     __u32 len;
> >>> +     __u8 data[VDUSE_CONFIG_DATA_LEN];
> >>
> >> This no guarantee that 256 is sufficient here.
> >>
> > If the size is larger than 256, we can try to split the original request.
>
>
> Fine, then we need document here or in the doc.
>

Will do it.

>
> >
> >>> +};
> >>> +
> >>> +struct vduse_iova_range {
> >>> +     __u64 start;
> >>> +     __u64 last;
> >>> +};
> >>> +
> >>> +struct vduse_dev_request {
> >>> +     __u32 type; /* request type */
> >>> +     __u32 unique; /* request id */
> >>> +     __u32 flags; /* request flags */
> >>
> >> Seems unused in this series.
> >>
> > This is for future use.
>
>
> So let's use pad or other name.
>

Fine.

>
> >
> >>> +     __u32 size; /* the payload size */
> >>
> >> Unused.
> >>
> > Will remove it.
> >
> >>> +     union {
> >>> +             struct vduse_vq_num vq_num; /* virtqueue num */
> >>> +             struct vduse_vq_addr vq_addr; /* virtqueue address */
> >>> +             struct vduse_vq_ready vq_ready; /* virtqueue ready status */
> >>> +             struct vduse_vq_state vq_state; /* virtqueue state */
> >>> +             struct vduse_dev_config_data config; /* virtio device config space */
> >>> +             struct vduse_iova_range iova; /* iova range for updating */
> >>> +             __u64 features; /* virtio features */
> >>> +             __u8 status; /* device status */
> >>
> >> Let's add some padding for future extensions.
> >>
> > Is sizeof(vduse_dev_config_data) ok? Or char[1024]?
>
>
> 1024 seems too large, 128 or 256 looks better.
>

If so, sizeof(vduse_dev_config_data) is enough.

>
> >
> >>> +     };
> >>> +};
> >>> +
> >>> +struct vduse_dev_response {
> >>> +     __u32 unique; /* corresponding request id */
> >>
> >> Let's use request id.
> >>
> > Fine.
> >
> >>> +     __s32 result; /* the result of request */
> >>
> >> Let's use macro or enum to define the success and failure value.
> >>
> > Will do it.
> >
> >>> +     union {
> >>> +             struct vduse_vq_ready vq_ready; /* virtqueue ready status */
> >>> +             struct vduse_vq_state vq_state; /* virtqueue state */
> >>> +             struct vduse_dev_config_data config; /* virtio device config space */
> >>> +             __u64 features; /* virtio features */
> >>> +             __u8 status; /* device status */
> >>> +     };
> >>> +};
> >>> +
> >>> +/* ioctls */
> >>> +
> >>> +struct vduse_dev_config {
> >>> +     __u32 id; /* vduse device id */
> >>> +     __u32 vendor_id; /* virtio vendor id */
> >>> +     __u32 device_id; /* virtio device id */
> >>> +     __u64 bounce_size; /* bounce buffer size for iommu */
> >>> +     __u16 vq_num; /* the number of virtqueues */
> >>> +     __u16 vq_size_max; /* the max size of virtqueue */
> >>> +     __u32 vq_align; /* the allocation alignment of virtqueue's metadata */
> >>> +};
> >>> +
> >>> +struct vduse_iotlb_entry {
> >>> +     __u64 offset; /* the mmap offset on fd */
> >>> +     __u64 start; /* start of the IOVA range */
> >>> +     __u64 last; /* last of the IOVA range */
> >>> +#define VDUSE_ACCESS_RO 0x1
> >>> +#define VDUSE_ACCESS_WO 0x2
> >>> +#define VDUSE_ACCESS_RW 0x3
> >>> +     __u8 perm; /* access permission of this range */
> >>> +};
> >>> +
> >>> +struct vduse_vq_eventfd {
> >>> +     __u32 index; /* virtqueue index */
> >>> +     __u32 fd; /* eventfd */
> >>
> >> Any reason for not using int here?
> >>
> > Will use __s32 instead.
>
>
> Let's use "int" here, so -1 can be used for de-assigning the eventfd.
>

OK.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: [RFC v3 01/11] eventfd: track eventfd_signal() recursion depth separately in different cases
  2021-01-28  4:31               ` Jason Wang
@ 2021-01-28  6:08                 ` Yongji Xie
  0 siblings, 0 replies; 57+ messages in thread
From: Yongji Xie @ 2021-01-28  6:08 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Bob Liu, Christoph Hellwig, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel

On Thu, Jan 28, 2021 at 12:31 PM Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2021/1/28 上午11:52, Yongji Xie wrote:
> > On Thu, Jan 28, 2021 at 11:05 AM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> On 2021/1/27 下午5:11, Yongji Xie wrote:
> >>> On Wed, Jan 27, 2021 at 11:38 AM Jason Wang <jasowang@redhat.com> wrote:
> >>>> On 2021/1/20 下午2:52, Yongji Xie wrote:
> >>>>> On Wed, Jan 20, 2021 at 12:24 PM Jason Wang <jasowang@redhat.com> wrote:
> >>>>>> On 2021/1/19 下午12:59, Xie Yongji wrote:
> >>>>>>> Now we have a global percpu counter to limit the recursion depth
> >>>>>>> of eventfd_signal(). This can avoid deadlock or stack overflow.
> >>>>>>> But in stack overflow case, it should be OK to increase the
> >>>>>>> recursion depth if needed. So we add a percpu counter in eventfd_ctx
> >>>>>>> to limit the recursion depth for deadlock case. Then it could be
> >>>>>>> fine to increase the global percpu counter later.
> >>>>>> I wonder whether or not it's worth to introduce percpu for each eventfd.
> >>>>>>
> >>>>>> How about simply check if eventfd_signal_count() is greater than 2?
> >>>>>>
> >>>>> It can't avoid deadlock in this way.
> >>>> I may miss something but the count is to avoid recursive eventfd call.
> >>>> So for VDUSE what we suffers is e.g the interrupt injection path:
> >>>>
> >>>> userspace write IRQFD -> vq->cb() -> another IRQFD.
> >>>>
> >>>> It looks like increasing EVENTFD_WAKEUP_DEPTH should be sufficient?
> >>>>
> >>> Actually I mean the deadlock described in commit f0b493e ("io_uring:
> >>> prevent potential eventfd recursion on poll"). It can break this bug
> >>> fix if we just increase EVENTFD_WAKEUP_DEPTH.
> >>
> >> Ok, so can wait do something similar in that commit? (using async stuffs
> >> like wq).
> >>
> > We can do that. But it will reduce the performance. Because the
> > eventfd recursion will be triggered every time kvm kick eventfd in
> > vhost-vdpa cases:
> >
> > KVM write KICKFD -> ops->kick_vq -> VDUSE write KICKFD
> >
> > Thanks,
> > Yongji
>
>
> Right, I think in the future we need to find a way to let KVM to wakeup
> VDUSE directly.
>

Yes, this would be better.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC v3 08/11] vduse: Introduce VDUSE - vDPA Device in Userspace
  2021-01-28  6:03           ` Yongji Xie
@ 2021-01-28  6:14             ` Jason Wang
  2021-01-28  6:43               ` Yongji Xie
  0 siblings, 1 reply; 57+ messages in thread
From: Jason Wang @ 2021-01-28  6:14 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Bob Liu, Christoph Hellwig, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel


On 2021/1/28 下午2:03, Yongji Xie wrote:
>>>>> +
>>>>> +static const struct file_operations vduse_domain_fops = {
>>>>> +     .mmap = vduse_domain_mmap,
>>>>> +     .release = vduse_domain_release,
>>>>> +};
>>>> It's better to explain the reason for introducing a dedicated file for
>>>> mmap() here.
>>>>
>>> To make the implementation of iova_domain independent with vduse_dev.
>> My understanding is that, the only usage for this is to:
>>
>> 1) support different type of iova mappings
>> 2) or switch between iova domain mappings
>>
>> But I can't think of a need for this.
>>
> For example, share one iova_domain between several vduse devices.


Interesting.


>
> And it will be helpful if we want to split this patch into iova domain
> part and vduse device part. Because the page fault handler should be
> paired with dma_map/dma_unmap.


Ok.

[...]


>
>>>> This looks not safe, let's use idr here.
>>>>
>>> Could you give more details? Looks like idr should not used in this
>>> case which can not tolerate failure. And using a list to store the msg
>>> is better than using idr when the msg needs to be re-inserted in some
>>> cases.
>> My understanding is the "unique" (probably need a better name) is a
>> token that is used to uniquely identify a message. The reply from
>> userspace is required to write with exact the same token(unique). IDR
>> seems better but consider we can hardly hit 64bit overflow, atomic might
>> be OK as well.
>>
>> Btw, under what case do we need to do "re-inserted"?
>>
> When userspace daemon receive the message but doesn't reply it before crash.


Do we have code to do this?

[...]


>
>>>> So we had multiple types of requests/responses, is this better to
>>>> introduce a queue based admin interface other than ioctl?
>>>>
>>> Sorry, I didn't get your point. What do you mean by queue-based admin
>>> interface? Virtqueue-based?
>> Yes, a queue(virtqueue). The commands could be passed through the queue.
>> (Just an idea, not sure it's worth)
>>
> I considered it before. But I found it still needs some extra works
> (setup eventfd, set vring base and so on) to setup the admin virtqueue
> before using it for communication. So I turn to use this simple way.


Yes. We might consider it in the future.

[...]


>
>>>> Any reason for such IOTLB invalidation here?
>>>>
>>> As I mentioned before, this is used to notify userspace to update the
>>> IOTLB. Mainly for virtio-vdpa case.
>> So the question is, usually, there could be several times of status
>> setting during driver initialization. Do we really need to update IOTLB
>> every time?
>>
> I think we can check whether there are some changes after the last
> IOTLB updating here.


So the question still, except reset (write 0), any other status that can 
affect IOTLB?

[...]

>
>> Something like swiotlb default value (64M)?
>>
> Do we need a module parameter to change it?


We can.

[...]

>
>>>>> +     union {
>>>>> +             struct vduse_vq_num vq_num; /* virtqueue num */
>>>>> +             struct vduse_vq_addr vq_addr; /* virtqueue address */
>>>>> +             struct vduse_vq_ready vq_ready; /* virtqueue ready status */
>>>>> +             struct vduse_vq_state vq_state; /* virtqueue state */
>>>>> +             struct vduse_dev_config_data config; /* virtio device config space */
>>>>> +             struct vduse_iova_range iova; /* iova range for updating */
>>>>> +             __u64 features; /* virtio features */
>>>>> +             __u8 status; /* device status */
>>>> Let's add some padding for future extensions.
>>>>
>>> Is sizeof(vduse_dev_config_data) ok? Or char[1024]?
>> 1024 seems too large, 128 or 256 looks better.
>>
> If so, sizeof(vduse_dev_config_data) is enough.


Ok if we don't need a message more than that in the future.

Thanks


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: [RFC v3 08/11] vduse: Introduce VDUSE - vDPA Device in Userspace
  2021-01-28  6:14             ` Jason Wang
@ 2021-01-28  6:43               ` Yongji Xie
  0 siblings, 0 replies; 57+ messages in thread
From: Yongji Xie @ 2021-01-28  6:43 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Bob Liu, Christoph Hellwig, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel

On Thu, Jan 28, 2021 at 2:14 PM Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2021/1/28 下午2:03, Yongji Xie wrote:
> >>>>> +
> >>>>> +static const struct file_operations vduse_domain_fops = {
> >>>>> +     .mmap = vduse_domain_mmap,
> >>>>> +     .release = vduse_domain_release,
> >>>>> +};
> >>>> It's better to explain the reason for introducing a dedicated file for
> >>>> mmap() here.
> >>>>
> >>> To make the implementation of iova_domain independent with vduse_dev.
> >> My understanding is that, the only usage for this is to:
> >>
> >> 1) support different type of iova mappings
> >> 2) or switch between iova domain mappings
> >>
> >> But I can't think of a need for this.
> >>
> > For example, share one iova_domain between several vduse devices.
>
>
> Interesting.
>
>
> >
> > And it will be helpful if we want to split this patch into iova domain
> > part and vduse device part. Because the page fault handler should be
> > paired with dma_map/dma_unmap.
>
>
> Ok.
>
> [...]
>
>
> >
> >>>> This looks not safe, let's use idr here.
> >>>>
> >>> Could you give more details? Looks like idr should not used in this
> >>> case which can not tolerate failure. And using a list to store the msg
> >>> is better than using idr when the msg needs to be re-inserted in some
> >>> cases.
> >> My understanding is the "unique" (probably need a better name) is a
> >> token that is used to uniquely identify a message. The reply from
> >> userspace is required to write with exact the same token(unique). IDR
> >> seems better but consider we can hardly hit 64bit overflow, atomic might
> >> be OK as well.
> >>
> >> Btw, under what case do we need to do "re-inserted"?
> >>
> > When userspace daemon receive the message but doesn't reply it before crash.
>
>
> Do we have code to do this?
>

Yes, in patch 9.

>
> >
> >>>> So we had multiple types of requests/responses, is this better to
> >>>> introduce a queue based admin interface other than ioctl?
> >>>>
> >>> Sorry, I didn't get your point. What do you mean by queue-based admin
> >>> interface? Virtqueue-based?
> >> Yes, a queue(virtqueue). The commands could be passed through the queue.
> >> (Just an idea, not sure it's worth)
> >>
> > I considered it before. But I found it still needs some extra works
> > (setup eventfd, set vring base and so on) to setup the admin virtqueue
> > before using it for communication. So I turn to use this simple way.
>
>
> Yes. We might consider it in the future.
>

Agree.

>
>
> >
> >>>> Any reason for such IOTLB invalidation here?
> >>>>
> >>> As I mentioned before, this is used to notify userspace to update the
> >>> IOTLB. Mainly for virtio-vdpa case.
> >> So the question is, usually, there could be several times of status
> >> setting during driver initialization. Do we really need to update IOTLB
> >> every time?
> >>
> > I think we can check whether there are some changes after the last
> > IOTLB updating here.
>
>
> So the question still, except reset (write 0), any other status that can
> affect IOTLB?
>

OK, I get your point. The status would not affect IOTLB. The reason
why we do IOTLB updating here is we can't do it in dma_map_ops which
might work in an atomic context. So I want to notify userspace to
update IOTLB before I/O is processed. Of course, it's not a must
because userspace can manually query it.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Re: [RFC v3 03/11] vdpa: Remove the restriction that only supports virtio-net devices
       [not found]             ` <20210129150359.caitcskrfhqed73z@steredhat>
@ 2021-01-30 11:33               ` Yongji Xie
  2021-02-01 11:05                 ` Stefano Garzarella
  0 siblings, 1 reply; 57+ messages in thread
From: Yongji Xie @ 2021-01-30 11:33 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: Jason Wang, Michael S. Tsirkin, Stefan Hajnoczi, Parav Pandit,
	Bob Liu, Christoph Hellwig, Randy Dunlap, Matthew Wilcox, viro,
	Jens Axboe, bcrl, Jonathan Corbet, virtualization, netdev, kvm,
	linux-aio, linux-fsdevel

On Fri, Jan 29, 2021 at 11:04 PM Stefano Garzarella <sgarzare@redhat.com> wrote:
>
> On Thu, Jan 28, 2021 at 11:11:49AM +0800, Jason Wang wrote:
> >
> >On 2021/1/27 下午4:57, Stefano Garzarella wrote:
> >>On Wed, Jan 27, 2021 at 11:33:03AM +0800, Jason Wang wrote:
> >>>
> >>>On 2021/1/20 下午7:08, Stefano Garzarella wrote:
> >>>>On Wed, Jan 20, 2021 at 11:46:38AM +0800, Jason Wang wrote:
> >>>>>
> >>>>>On 2021/1/19 下午12:59, Xie Yongji wrote:
> >>>>>>With VDUSE, we should be able to support all kinds of virtio devices.
> >>>>>>
> >>>>>>Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> >>>>>>---
> >>>>>> drivers/vhost/vdpa.c | 29 +++--------------------------
> >>>>>> 1 file changed, 3 insertions(+), 26 deletions(-)
> >>>>>>
> >>>>>>diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> >>>>>>index 29ed4173f04e..448be7875b6d 100644
> >>>>>>--- a/drivers/vhost/vdpa.c
> >>>>>>+++ b/drivers/vhost/vdpa.c
> >>>>>>@@ -22,6 +22,7 @@
> >>>>>> #include <linux/nospec.h>
> >>>>>> #include <linux/vhost.h>
> >>>>>> #include <linux/virtio_net.h>
> >>>>>>+#include <linux/virtio_blk.h>
> >>>>>> #include "vhost.h"
> >>>>>>@@ -185,26 +186,6 @@ static long
> >>>>>>vhost_vdpa_set_status(struct vhost_vdpa *v, u8 __user
> >>>>>>*statusp)
> >>>>>>     return 0;
> >>>>>> }
> >>>>>>-static int vhost_vdpa_config_validate(struct vhost_vdpa *v,
> >>>>>>-                      struct vhost_vdpa_config *c)
> >>>>>>-{
> >>>>>>-    long size = 0;
> >>>>>>-
> >>>>>>-    switch (v->virtio_id) {
> >>>>>>-    case VIRTIO_ID_NET:
> >>>>>>-        size = sizeof(struct virtio_net_config);
> >>>>>>-        break;
> >>>>>>-    }
> >>>>>>-
> >>>>>>-    if (c->len == 0)
> >>>>>>-        return -EINVAL;
> >>>>>>-
> >>>>>>-    if (c->len > size - c->off)
> >>>>>>-        return -E2BIG;
> >>>>>>-
> >>>>>>-    return 0;
> >>>>>>-}
> >>>>>
> >>>>>
> >>>>>I think we should use a separate patch for this.
> >>>>
> >>>>For the vdpa-blk simulator I had the same issues and I'm adding
> >>>>a .get_config_size() callback to vdpa devices.
> >>>>
> >>>>Do you think make sense or is better to remove this check in
> >>>>vhost/vdpa, delegating the boundaries checks to
> >>>>get_config/set_config callbacks.
> >>>
> >>>
> >>>A question here. How much value could we gain from
> >>>get_config_size() consider we can let vDPA parent to validate the
> >>>length in its get_config().
> >>>
> >>
> >>I agree, most of the implementations already validate the length,
> >>the only gain is an error returned since get_config() is void, but
> >>eventually we can add a return value to it.
> >
> >
> >Right, one problem here is that. For the virito path, its get_config()
> >returns void. So we can not propagate error to virtio drivers. But it
> >might not be a big issue since we trust kernel virtio driver.
> >
> >So I think it makes sense to change the return value in the vdpa config ops.
>
> Thanks for your feedback!
>
> @Xie do you plan to do this?
>
> Otherwise I can do in my vdpa-blk-sim series, where for now I used this
> patch as is.
>

Hi Stefano, please do in your vdpa-blk-sim series. I have no plan for it now.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC v3 03/11] vdpa: Remove the restriction that only supports virtio-net devices
  2021-01-30 11:33               ` Yongji Xie
@ 2021-02-01 11:05                 ` Stefano Garzarella
  0 siblings, 0 replies; 57+ messages in thread
From: Stefano Garzarella @ 2021-02-01 11:05 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Jason Wang, Michael S. Tsirkin, Stefan Hajnoczi, Parav Pandit,
	Bob Liu, Christoph Hellwig, Randy Dunlap, Matthew Wilcox, viro,
	Jens Axboe, bcrl, Jonathan Corbet, virtualization, netdev, kvm,
	linux-aio, linux-fsdevel

On Sat, Jan 30, 2021 at 07:33:08PM +0800, Yongji Xie wrote:
>On Fri, Jan 29, 2021 at 11:04 PM Stefano Garzarella <sgarzare@redhat.com> wrote:
>>
>> On Thu, Jan 28, 2021 at 11:11:49AM +0800, Jason Wang wrote:
>> >
>> >On 2021/1/27 下午4:57, Stefano Garzarella wrote:
>> >>On Wed, Jan 27, 2021 at 11:33:03AM +0800, Jason Wang wrote:
>> >>>
>> >>>On 2021/1/20 下午7:08, Stefano Garzarella wrote:
>> >>>>On Wed, Jan 20, 2021 at 11:46:38AM +0800, Jason Wang wrote:
>> >>>>>
>> >>>>>On 2021/1/19 下午12:59, Xie Yongji wrote:
>> >>>>>>With VDUSE, we should be able to support all kinds of virtio devices.
>> >>>>>>
>> >>>>>>Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
>> >>>>>>---
>> >>>>>> drivers/vhost/vdpa.c | 29 +++--------------------------
>> >>>>>> 1 file changed, 3 insertions(+), 26 deletions(-)
>> >>>>>>
>> >>>>>>diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
>> >>>>>>index 29ed4173f04e..448be7875b6d 100644
>> >>>>>>--- a/drivers/vhost/vdpa.c
>> >>>>>>+++ b/drivers/vhost/vdpa.c
>> >>>>>>@@ -22,6 +22,7 @@
>> >>>>>> #include <linux/nospec.h>
>> >>>>>> #include <linux/vhost.h>
>> >>>>>> #include <linux/virtio_net.h>
>> >>>>>>+#include <linux/virtio_blk.h>
>> >>>>>> #include "vhost.h"
>> >>>>>>@@ -185,26 +186,6 @@ static long
>> >>>>>>vhost_vdpa_set_status(struct vhost_vdpa *v, u8 __user
>> >>>>>>*statusp)
>> >>>>>>     return 0;
>> >>>>>> }
>> >>>>>>-static int vhost_vdpa_config_validate(struct vhost_vdpa *v,
>> >>>>>>-                      struct vhost_vdpa_config *c)
>> >>>>>>-{
>> >>>>>>-    long size = 0;
>> >>>>>>-
>> >>>>>>-    switch (v->virtio_id) {
>> >>>>>>-    case VIRTIO_ID_NET:
>> >>>>>>-        size = sizeof(struct virtio_net_config);
>> >>>>>>-        break;
>> >>>>>>-    }
>> >>>>>>-
>> >>>>>>-    if (c->len == 0)
>> >>>>>>-        return -EINVAL;
>> >>>>>>-
>> >>>>>>-    if (c->len > size - c->off)
>> >>>>>>-        return -E2BIG;
>> >>>>>>-
>> >>>>>>-    return 0;
>> >>>>>>-}
>> >>>>>
>> >>>>>
>> >>>>>I think we should use a separate patch for this.
>> >>>>
>> >>>>For the vdpa-blk simulator I had the same issues and I'm adding
>> >>>>a .get_config_size() callback to vdpa devices.
>> >>>>
>> >>>>Do you think make sense or is better to remove this check in
>> >>>>vhost/vdpa, delegating the boundaries checks to
>> >>>>get_config/set_config callbacks.
>> >>>
>> >>>
>> >>>A question here. How much value could we gain from
>> >>>get_config_size() consider we can let vDPA parent to validate the
>> >>>length in its get_config().
>> >>>
>> >>
>> >>I agree, most of the implementations already validate the length,
>> >>the only gain is an error returned since get_config() is void, but
>> >>eventually we can add a return value to it.
>> >
>> >
>> >Right, one problem here is that. For the virito path, its get_config()
>> >returns void. So we can not propagate error to virtio drivers. But it
>> >might not be a big issue since we trust kernel virtio driver.
>> >
>> >So I think it makes sense to change the return value in the vdpa config ops.
>>
>> Thanks for your feedback!
>>
>> @Xie do you plan to do this?
>>
>> Otherwise I can do in my vdpa-blk-sim series, where for now I used this
>> patch as is.
>>
>
>Hi Stefano, please do in your vdpa-blk-sim series. I have no plan for 
>it now.

Okay, I'll do it.

Thanks,
Stefano


^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2021-02-01 11:07 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-19  4:59 [RFC v3 00/11] Introduce VDUSE - vDPA Device in Userspace Xie Yongji
2021-01-19  4:59 ` [RFC v3 01/11] eventfd: track eventfd_signal() recursion depth separately in different cases Xie Yongji
2021-01-20  4:24   ` Jason Wang
2021-01-20  6:52     ` Yongji Xie
2021-01-27  3:37       ` Jason Wang
2021-01-27  9:11         ` Yongji Xie
2021-01-28  3:04           ` Jason Wang
2021-01-28  3:08             ` Jens Axboe
2021-01-28  5:12               ` Yongji Xie
2021-01-28  3:52             ` Yongji Xie
2021-01-28  4:31               ` Jason Wang
2021-01-28  6:08                 ` Yongji Xie
2021-01-19  4:59 ` [RFC v3 02/11] eventfd: Increase the recursion depth of eventfd_signal() Xie Yongji
2021-01-19  4:59 ` [RFC v3 03/11] vdpa: Remove the restriction that only supports virtio-net devices Xie Yongji
2021-01-20  3:46   ` Jason Wang
2021-01-20  6:46     ` Yongji Xie
2021-01-20 11:08     ` Stefano Garzarella
2021-01-27  3:33       ` Jason Wang
2021-01-27  8:57         ` Stefano Garzarella
2021-01-28  3:11           ` Jason Wang
     [not found]             ` <20210129150359.caitcskrfhqed73z@steredhat>
2021-01-30 11:33               ` Yongji Xie
2021-02-01 11:05                 ` Stefano Garzarella
2021-01-27  8:59   ` Stefano Garzarella
2021-01-27  9:05     ` Yongji Xie
2021-01-19  4:59 ` [RFC v3 04/11] vhost-vdpa: protect concurrent access to vhost device iotlb Xie Yongji
2021-01-20  3:44   ` Jason Wang
2021-01-20  6:44     ` Yongji Xie
2021-01-19  4:59 ` [RFC v3 05/11] vdpa: shared virtual addressing support Xie Yongji
2021-01-20  5:55   ` Jason Wang
2021-01-20  7:10     ` Yongji Xie
2021-01-27  3:43       ` Jason Wang
2021-01-19  4:59 ` [RFC v3 06/11] vhost-vdpa: Add an opaque pointer for vhost IOTLB Xie Yongji
2021-01-20  6:24   ` Jason Wang
2021-01-20  7:52     ` Yongji Xie
2021-01-27  3:51       ` Jason Wang
2021-01-27  9:27         ` Yongji Xie
2021-01-19  5:07 ` [RFC v3 07/11] vdpa: Pass the netlink attributes to ops.dev_add() Xie Yongji
2021-01-19  5:07   ` [RFC v3 08/11] vduse: Introduce VDUSE - vDPA Device in Userspace Xie Yongji
2021-01-19 14:53     ` Jonathan Corbet
2021-01-20  2:25       ` Yongji Xie
2021-01-19 17:53     ` Randy Dunlap
2021-01-20  2:42       ` Yongji Xie
2021-01-26  8:08     ` Jason Wang
2021-01-27  8:50       ` Yongji Xie
2021-01-28  4:27         ` Jason Wang
2021-01-28  6:03           ` Yongji Xie
2021-01-28  6:14             ` Jason Wang
2021-01-28  6:43               ` Yongji Xie
2021-01-26  8:19     ` Jason Wang
2021-01-27  8:59       ` Yongji Xie
2021-01-19  5:07   ` [RFC v3 09/11] vduse: Add VDUSE_GET_DEV ioctl Xie Yongji
2021-01-19  5:07   ` [RFC v3 10/11] vduse: grab the module's references until there is no vduse device Xie Yongji
2021-01-26  8:09     ` Jason Wang
2021-01-27  8:51       ` Yongji Xie
2021-01-19  5:07   ` [RFC v3 11/11] vduse: Introduce a workqueue for irq injection Xie Yongji
2021-01-26  8:17     ` Jason Wang
2021-01-27  9:00       ` Yongji Xie

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).