All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC v4 00/11] Introduce VDUSE - vDPA Device in Userspace
@ 2021-02-23 11:50 Xie Yongji
  2021-02-23 11:50 ` [RFC v4 01/11] eventfd: Increase the recursion depth of eventfd_signal() Xie Yongji
                   ` (10 more replies)
  0 siblings, 11 replies; 92+ messages in thread
From: Xie Yongji @ 2021-02-23 11:50 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, bob.liu, hch, rdunlap,
	willy, viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel

This series introduces a framework, which can be used to implement
vDPA Devices in a userspace program. The work consist of two parts:
control path forwarding and data path offloading.

In the control path, the VDUSE driver will make use of message
mechnism to forward the config operation from vdpa bus driver
to userspace. Userspace can use read()/write() to receive/reply
those control messages.

In the data path, the core is mapping dma buffer into VDUSE
daemon's address space, which can be implemented in different ways
depending on the vdpa bus to which the vDPA device is attached.

In virtio-vdpa case, we implements a MMU-based on-chip IOMMU driver with
bounce-buffering mechanism to achieve that. And in vhost-vdpa case, the dma
buffer is reside in a userspace memory region which can be shared to the
VDUSE userspace processs via transferring the shmfd.

The details and our user case is shown below:

------------------------    -------------------------   ----------------------------------------------
|            Container |    |              QEMU(VM) |   |                               VDUSE daemon |
|       ---------      |    |  -------------------  |   | ------------------------- ---------------- |
|       |dev/vdx|      |    |  |/dev/vhost-vdpa-x|  |   | | vDPA device emulation | | block driver | |
------------+-----------     -----------+------------   -------------+----------------------+---------
            |                           |                            |                      |
            |                           |                            |                      |
------------+---------------------------+----------------------------+----------------------+---------
|    | block device |           |  vhost device |            | vduse driver |          | TCP/IP |    |
|    -------+--------           --------+--------            -------+--------          -----+----    |
|           |                           |                           |                       |        |
| ----------+----------       ----------+-----------         -------+-------                |        |
| | virtio-blk driver |       |  vhost-vdpa driver |         | vdpa device |                |        |
| ----------+----------       ----------+-----------         -------+-------                |        |
|           |      virtio bus           |                           |                       |        |
|   --------+----+-----------           |                           |                       |        |
|                |                      |                           |                       |        |
|      ----------+----------            |                           |                       |        |
|      | virtio-blk device |            |                           |                       |        |
|      ----------+----------            |                           |                       |        |
|                |                      |                           |                       |        |
|     -----------+-----------           |                           |                       |        |
|     |  virtio-vdpa driver |           |                           |                       |        |
|     -----------+-----------           |                           |                       |        |
|                |                      |                           |    vdpa bus           |        |
|     -----------+----------------------+---------------------------+------------           |        |
|                                                                                        ---+---     |
-----------------------------------------------------------------------------------------| NIC |------
                                                                                         ---+---
                                                                                            |
                                                                                   ---------+---------
                                                                                   | Remote Storages |
                                                                                   -------------------

We make use of it to implement a block device connecting to
our distributed storage, which can be used both in containers and
VMs. Thus, we can have an unified technology stack in this two cases.

To test it with null-blk:

  $ qemu-storage-daemon \
      --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server,nowait \
      --monitor chardev=charmonitor \
      --blockdev driver=host_device,cache.direct=on,aio=native,filename=/dev/nullb0,node-name=disk0 \
      --export type=vduse-blk,id=test,node-name=disk0,writable=on,name=vduse-null,num-queues=16,queue-size=128

The qemu-storage-daemon can be found at https://github.com/bytedance/qemu/tree/vduse

Future work:
  - Improve performance
  - Userspace library (find a way to reuse device emulation code in qemu/rust-vmm)

V3 to V4:
- Rebase to vhost.git
- Split some patches
- Add some documents
- Use ioctl to inject interrupt rather than eventfd
- Enable config interrupt support
- Support binding irq to the specified cpu
- Add two module parameter to limit bounce/iova size
- Create char device rather than anon inode per vduse
- Reuse vhost IOTLB for iova domain
- Rework the message mechnism in control path

V2 to V3:
- Rework the MMU-based IOMMU driver
- Use the iova domain as iova allocator instead of genpool
- Support transferring vma->vm_file in vhost-vdpa
- Add SVA support in vhost-vdpa
- Remove the patches on bounce pages reclaim

V1 to V2:
- Add vhost-vdpa support
- Add some documents
- Based on the vdpa management tool
- Introduce a workqueue for irq injection
- Replace interval tree with array map to store the iova_map

Xie Yongji (11):
  eventfd: Increase the recursion depth of eventfd_signal()
  vhost-vdpa: protect concurrent access to vhost device iotlb
  vhost-iotlb: Add an opaque pointer for vhost IOTLB
  vdpa: Add an opaque pointer for vdpa_config_ops.dma_map()
  vdpa: Support transferring virtual addressing during DMA mapping
  vduse: Implement an MMU-based IOMMU driver
  vduse: Introduce VDUSE - vDPA Device in Userspace
  vduse: Add config interrupt support
  Documentation: Add documentation for VDUSE
  vduse: Introduce a workqueue for irq injection
  vduse: Support binding irq to the specified cpu

 Documentation/userspace-api/index.rst              |    1 +
 Documentation/userspace-api/ioctl/ioctl-number.rst |    1 +
 Documentation/userspace-api/vduse.rst              |  112 ++
 drivers/vdpa/Kconfig                               |   10 +
 drivers/vdpa/Makefile                              |    1 +
 drivers/vdpa/ifcvf/ifcvf_main.c                    |    2 +-
 drivers/vdpa/mlx5/net/mlx5_vnet.c                  |    2 +-
 drivers/vdpa/vdpa.c                                |    9 +-
 drivers/vdpa/vdpa_sim/vdpa_sim.c                   |    8 +-
 drivers/vdpa/vdpa_user/Makefile                    |    5 +
 drivers/vdpa/vdpa_user/iova_domain.c               |  486 +++++++
 drivers/vdpa/vdpa_user/iova_domain.h               |   61 +
 drivers/vdpa/vdpa_user/vduse_dev.c                 | 1399 ++++++++++++++++++++
 drivers/vhost/iotlb.c                              |   20 +-
 drivers/vhost/vdpa.c                               |  106 +-
 fs/eventfd.c                                       |    2 +-
 include/linux/eventfd.h                            |    5 +-
 include/linux/vdpa.h                               |   22 +-
 include/linux/vhost_iotlb.h                        |    3 +
 include/uapi/linux/vduse.h                         |  144 ++
 20 files changed, 2363 insertions(+), 36 deletions(-)
 create mode 100644 Documentation/userspace-api/vduse.rst
 create mode 100644 drivers/vdpa/vdpa_user/Makefile
 create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c
 create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h
 create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
 create mode 100644 include/uapi/linux/vduse.h

-- 
2.11.0


^ permalink raw reply	[flat|nested] 92+ messages in thread

* [RFC v4 01/11] eventfd: Increase the recursion depth of eventfd_signal()
  2021-02-23 11:50 [RFC v4 00/11] Introduce VDUSE - vDPA Device in Userspace Xie Yongji
@ 2021-02-23 11:50 ` Xie Yongji
  2021-03-02  6:44     ` Jason Wang
  2021-02-23 11:50 ` [RFC v4 02/11] vhost-vdpa: protect concurrent access to vhost device iotlb Xie Yongji
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 92+ messages in thread
From: Xie Yongji @ 2021-02-23 11:50 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, bob.liu, hch, rdunlap,
	willy, viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel

Increase the recursion depth of eventfd_signal() to 1. This
is the maximum recursion depth we have found so far.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 fs/eventfd.c            | 2 +-
 include/linux/eventfd.h | 5 ++++-
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/eventfd.c b/fs/eventfd.c
index e265b6dd4f34..cc7cd1dbedd3 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -71,7 +71,7 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n)
 	 * it returns true, the eventfd_signal() call should be deferred to a
 	 * safe context.
 	 */
-	if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count)))
+	if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count) > EFD_WAKE_DEPTH))
 		return 0;
 
 	spin_lock_irqsave(&ctx->wqh.lock, flags);
diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
index fa0a524baed0..886d99cd38ef 100644
--- a/include/linux/eventfd.h
+++ b/include/linux/eventfd.h
@@ -29,6 +29,9 @@
 #define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
 #define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE)
 
+/* Maximum recursion depth */
+#define EFD_WAKE_DEPTH 1
+
 struct eventfd_ctx;
 struct file;
 
@@ -47,7 +50,7 @@ DECLARE_PER_CPU(int, eventfd_wake_count);
 
 static inline bool eventfd_signal_count(void)
 {
-	return this_cpu_read(eventfd_wake_count);
+	return this_cpu_read(eventfd_wake_count) > EFD_WAKE_DEPTH;
 }
 
 #else /* CONFIG_EVENTFD */
-- 
2.11.0


^ permalink raw reply	[flat|nested] 92+ messages in thread

* [RFC v4 02/11] vhost-vdpa: protect concurrent access to vhost device iotlb
  2021-02-23 11:50 [RFC v4 00/11] Introduce VDUSE - vDPA Device in Userspace Xie Yongji
  2021-02-23 11:50 ` [RFC v4 01/11] eventfd: Increase the recursion depth of eventfd_signal() Xie Yongji
@ 2021-02-23 11:50 ` Xie Yongji
  2021-03-02  6:47     ` Jason Wang
  2021-02-23 11:50 ` [RFC v4 03/11] vhost-iotlb: Add an opaque pointer for vhost IOTLB Xie Yongji
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 92+ messages in thread
From: Xie Yongji @ 2021-02-23 11:50 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, bob.liu, hch, rdunlap,
	willy, viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel

Use vhost_dev->mutex to protect vhost device iotlb from
concurrent access.

Fixes: 4c8cf318("vhost: introduce vDPA-based backend")
Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 drivers/vhost/vdpa.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index c50079dfb281..5500e3bf05c1 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -723,6 +723,7 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev,
 	if (r)
 		return r;
 
+	mutex_lock(&dev->mutex);
 	switch (msg->type) {
 	case VHOST_IOTLB_UPDATE:
 		r = vhost_vdpa_process_iotlb_update(v, msg);
@@ -742,6 +743,7 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev,
 		r = -EINVAL;
 		break;
 	}
+	mutex_unlock(&dev->mutex);
 
 	return r;
 }
-- 
2.11.0


^ permalink raw reply	[flat|nested] 92+ messages in thread

* [RFC v4 03/11] vhost-iotlb: Add an opaque pointer for vhost IOTLB
  2021-02-23 11:50 [RFC v4 00/11] Introduce VDUSE - vDPA Device in Userspace Xie Yongji
  2021-02-23 11:50 ` [RFC v4 01/11] eventfd: Increase the recursion depth of eventfd_signal() Xie Yongji
  2021-02-23 11:50 ` [RFC v4 02/11] vhost-vdpa: protect concurrent access to vhost device iotlb Xie Yongji
@ 2021-02-23 11:50 ` Xie Yongji
  2021-03-02  6:49     ` Jason Wang
  2021-02-23 11:50 ` [RFC v4 04/11] vdpa: Add an opaque pointer for vdpa_config_ops.dma_map() Xie Yongji
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 92+ messages in thread
From: Xie Yongji @ 2021-02-23 11:50 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, bob.liu, hch, rdunlap,
	willy, viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel

Add an opaque pointer for vhost IOTLB. And introduce
vhost_iotlb_add_range_ctx() to accept it.

Suggested-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 drivers/vhost/iotlb.c       | 20 ++++++++++++++++----
 include/linux/vhost_iotlb.h |  3 +++
 2 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/drivers/vhost/iotlb.c b/drivers/vhost/iotlb.c
index 0fd3f87e913c..5c99e1112cbb 100644
--- a/drivers/vhost/iotlb.c
+++ b/drivers/vhost/iotlb.c
@@ -36,19 +36,21 @@ void vhost_iotlb_map_free(struct vhost_iotlb *iotlb,
 EXPORT_SYMBOL_GPL(vhost_iotlb_map_free);
 
 /**
- * vhost_iotlb_add_range - add a new range to vhost IOTLB
+ * vhost_iotlb_add_range_ctx - add a new range to vhost IOTLB
  * @iotlb: the IOTLB
  * @start: start of the IOVA range
  * @last: last of IOVA range
  * @addr: the address that is mapped to @start
  * @perm: access permission of this range
+ * @opaque: the opaque pointer for the new mapping
  *
  * Returns an error last is smaller than start or memory allocation
  * fails
  */
-int vhost_iotlb_add_range(struct vhost_iotlb *iotlb,
-			  u64 start, u64 last,
-			  u64 addr, unsigned int perm)
+int vhost_iotlb_add_range_ctx(struct vhost_iotlb *iotlb,
+			      u64 start, u64 last,
+			      u64 addr, unsigned int perm,
+			      void *opaque)
 {
 	struct vhost_iotlb_map *map;
 
@@ -71,6 +73,7 @@ int vhost_iotlb_add_range(struct vhost_iotlb *iotlb,
 	map->last = last;
 	map->addr = addr;
 	map->perm = perm;
+	map->opaque = opaque;
 
 	iotlb->nmaps++;
 	vhost_iotlb_itree_insert(map, &iotlb->root);
@@ -80,6 +83,15 @@ int vhost_iotlb_add_range(struct vhost_iotlb *iotlb,
 
 	return 0;
 }
+EXPORT_SYMBOL_GPL(vhost_iotlb_add_range_ctx);
+
+int vhost_iotlb_add_range(struct vhost_iotlb *iotlb,
+			  u64 start, u64 last,
+			  u64 addr, unsigned int perm)
+{
+	return vhost_iotlb_add_range_ctx(iotlb, start, last,
+					 addr, perm, NULL);
+}
 EXPORT_SYMBOL_GPL(vhost_iotlb_add_range);
 
 /**
diff --git a/include/linux/vhost_iotlb.h b/include/linux/vhost_iotlb.h
index 6b09b786a762..2d0e2f52f938 100644
--- a/include/linux/vhost_iotlb.h
+++ b/include/linux/vhost_iotlb.h
@@ -17,6 +17,7 @@ struct vhost_iotlb_map {
 	u32 perm;
 	u32 flags_padding;
 	u64 __subtree_last;
+	void *opaque;
 };
 
 #define VHOST_IOTLB_FLAG_RETIRE 0x1
@@ -29,6 +30,8 @@ struct vhost_iotlb {
 	unsigned int flags;
 };
 
+int vhost_iotlb_add_range_ctx(struct vhost_iotlb *iotlb, u64 start, u64 last,
+			      u64 addr, unsigned int perm, void *opaque);
 int vhost_iotlb_add_range(struct vhost_iotlb *iotlb, u64 start, u64 last,
 			  u64 addr, unsigned int perm);
 void vhost_iotlb_del_range(struct vhost_iotlb *iotlb, u64 start, u64 last);
-- 
2.11.0


^ permalink raw reply	[flat|nested] 92+ messages in thread

* [RFC v4 04/11] vdpa: Add an opaque pointer for vdpa_config_ops.dma_map()
  2021-02-23 11:50 [RFC v4 00/11] Introduce VDUSE - vDPA Device in Userspace Xie Yongji
                   ` (2 preceding siblings ...)
  2021-02-23 11:50 ` [RFC v4 03/11] vhost-iotlb: Add an opaque pointer for vhost IOTLB Xie Yongji
@ 2021-02-23 11:50 ` Xie Yongji
  2021-03-02  6:50     ` Jason Wang
  2021-02-23 11:50 ` [RFC v4 05/11] vdpa: Support transferring virtual addressing during DMA mapping Xie Yongji
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 92+ messages in thread
From: Xie Yongji @ 2021-02-23 11:50 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, bob.liu, hch, rdunlap,
	willy, viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel

Add an opaque pointer for DMA mapping.

Suggested-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 drivers/vdpa/vdpa_sim/vdpa_sim.c | 6 +++---
 drivers/vhost/vdpa.c             | 2 +-
 include/linux/vdpa.h             | 2 +-
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
index d5942842432d..5cfc262ce055 100644
--- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
+++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
@@ -512,14 +512,14 @@ static int vdpasim_set_map(struct vdpa_device *vdpa,
 }
 
 static int vdpasim_dma_map(struct vdpa_device *vdpa, u64 iova, u64 size,
-			   u64 pa, u32 perm)
+			   u64 pa, u32 perm, void *opaque)
 {
 	struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
 	int ret;
 
 	spin_lock(&vdpasim->iommu_lock);
-	ret = vhost_iotlb_add_range(vdpasim->iommu, iova, iova + size - 1, pa,
-				    perm);
+	ret = vhost_iotlb_add_range_ctx(vdpasim->iommu, iova, iova + size - 1,
+					pa, perm, opaque);
 	spin_unlock(&vdpasim->iommu_lock);
 
 	return ret;
diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 5500e3bf05c1..70857fe3263c 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -544,7 +544,7 @@ static int vhost_vdpa_map(struct vhost_vdpa *v,
 		return r;
 
 	if (ops->dma_map) {
-		r = ops->dma_map(vdpa, iova, size, pa, perm);
+		r = ops->dma_map(vdpa, iova, size, pa, perm, NULL);
 	} else if (ops->set_map) {
 		if (!v->in_batch)
 			r = ops->set_map(vdpa, dev->iotlb);
diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index 4ab5494503a8..93dca2c328ae 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -241,7 +241,7 @@ struct vdpa_config_ops {
 	/* DMA ops */
 	int (*set_map)(struct vdpa_device *vdev, struct vhost_iotlb *iotlb);
 	int (*dma_map)(struct vdpa_device *vdev, u64 iova, u64 size,
-		       u64 pa, u32 perm);
+		       u64 pa, u32 perm, void *opaque);
 	int (*dma_unmap)(struct vdpa_device *vdev, u64 iova, u64 size);
 
 	/* Free device resources */
-- 
2.11.0


^ permalink raw reply	[flat|nested] 92+ messages in thread

* [RFC v4 05/11] vdpa: Support transferring virtual addressing during DMA mapping
  2021-02-23 11:50 [RFC v4 00/11] Introduce VDUSE - vDPA Device in Userspace Xie Yongji
                   ` (3 preceding siblings ...)
  2021-02-23 11:50 ` [RFC v4 04/11] vdpa: Add an opaque pointer for vdpa_config_ops.dma_map() Xie Yongji
@ 2021-02-23 11:50 ` Xie Yongji
  2021-03-03 10:52   ` Mika Penttilä
  2021-03-04  3:07     ` Jason Wang
  2021-02-23 11:50 ` [RFC v4 06/11] vduse: Implement an MMU-based IOMMU driver Xie Yongji
                   ` (5 subsequent siblings)
  10 siblings, 2 replies; 92+ messages in thread
From: Xie Yongji @ 2021-02-23 11:50 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, bob.liu, hch, rdunlap,
	willy, viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel

This patch introduces an attribute for vDPA device to indicate
whether virtual address can be used. If vDPA device driver set
it, vhost-vdpa bus driver will not pin user page and transfer
userspace virtual address instead of physical address during
DMA mapping. And corresponding vma->vm_file and offset will be
also passed as an opaque pointer.

Suggested-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 drivers/vdpa/ifcvf/ifcvf_main.c   |   2 +-
 drivers/vdpa/mlx5/net/mlx5_vnet.c |   2 +-
 drivers/vdpa/vdpa.c               |   9 +++-
 drivers/vdpa/vdpa_sim/vdpa_sim.c  |   2 +-
 drivers/vhost/vdpa.c              | 104 +++++++++++++++++++++++++++++++-------
 include/linux/vdpa.h              |  20 ++++++--
 6 files changed, 113 insertions(+), 26 deletions(-)

diff --git a/drivers/vdpa/ifcvf/ifcvf_main.c b/drivers/vdpa/ifcvf/ifcvf_main.c
index 7c8bbfcf6c3e..228b9f920fea 100644
--- a/drivers/vdpa/ifcvf/ifcvf_main.c
+++ b/drivers/vdpa/ifcvf/ifcvf_main.c
@@ -432,7 +432,7 @@ static int ifcvf_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 
 	adapter = vdpa_alloc_device(struct ifcvf_adapter, vdpa,
 				    dev, &ifc_vdpa_ops,
-				    IFCVF_MAX_QUEUE_PAIRS * 2, NULL);
+				    IFCVF_MAX_QUEUE_PAIRS * 2, NULL, false);
 	if (adapter == NULL) {
 		IFCVF_ERR(pdev, "Failed to allocate vDPA structure");
 		return -ENOMEM;
diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c
index 029822060017..54290438da28 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -1964,7 +1964,7 @@ static int mlx5v_probe(struct auxiliary_device *adev,
 	max_vqs = min_t(u32, max_vqs, MLX5_MAX_SUPPORTED_VQS);
 
 	ndev = vdpa_alloc_device(struct mlx5_vdpa_net, mvdev.vdev, mdev->device, &mlx5_vdpa_ops,
-				 2 * mlx5_vdpa_max_qps(max_vqs), NULL);
+				 2 * mlx5_vdpa_max_qps(max_vqs), NULL, false);
 	if (IS_ERR(ndev))
 		return PTR_ERR(ndev);
 
diff --git a/drivers/vdpa/vdpa.c b/drivers/vdpa/vdpa.c
index 9700a0adcca0..fafc0ee5eb05 100644
--- a/drivers/vdpa/vdpa.c
+++ b/drivers/vdpa/vdpa.c
@@ -72,6 +72,7 @@ static void vdpa_release_dev(struct device *d)
  * @nvqs: number of virtqueues supported by this device
  * @size: size of the parent structure that contains private data
  * @name: name of the vdpa device; optional.
+ * @use_va: indicate whether virtual address can be used by this device
  *
  * Driver should use vdpa_alloc_device() wrapper macro instead of
  * using this directly.
@@ -81,7 +82,8 @@ static void vdpa_release_dev(struct device *d)
  */
 struct vdpa_device *__vdpa_alloc_device(struct device *parent,
 					const struct vdpa_config_ops *config,
-					int nvqs, size_t size, const char *name)
+					int nvqs, size_t size, const char *name,
+					bool use_va)
 {
 	struct vdpa_device *vdev;
 	int err = -EINVAL;
@@ -92,6 +94,10 @@ struct vdpa_device *__vdpa_alloc_device(struct device *parent,
 	if (!!config->dma_map != !!config->dma_unmap)
 		goto err;
 
+	/* It should only work for the device that use on-chip IOMMU */
+	if (use_va && !(config->dma_map || config->set_map))
+		goto err;
+
 	err = -ENOMEM;
 	vdev = kzalloc(size, GFP_KERNEL);
 	if (!vdev)
@@ -108,6 +114,7 @@ struct vdpa_device *__vdpa_alloc_device(struct device *parent,
 	vdev->config = config;
 	vdev->features_valid = false;
 	vdev->nvqs = nvqs;
+	vdev->use_va = use_va;
 
 	if (name)
 		err = dev_set_name(&vdev->dev, "%s", name);
diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
index 5cfc262ce055..3a9a2dd4e987 100644
--- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
+++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
@@ -235,7 +235,7 @@ struct vdpasim *vdpasim_create(struct vdpasim_dev_attr *dev_attr)
 		ops = &vdpasim_config_ops;
 
 	vdpasim = vdpa_alloc_device(struct vdpasim, vdpa, NULL, ops,
-				    dev_attr->nvqs, dev_attr->name);
+				    dev_attr->nvqs, dev_attr->name, false);
 	if (!vdpasim)
 		goto err_alloc;
 
diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 70857fe3263c..93769ace34df 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -480,21 +480,31 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
 static void vhost_vdpa_iotlb_unmap(struct vhost_vdpa *v, u64 start, u64 last)
 {
 	struct vhost_dev *dev = &v->vdev;
+	struct vdpa_device *vdpa = v->vdpa;
 	struct vhost_iotlb *iotlb = dev->iotlb;
 	struct vhost_iotlb_map *map;
+	struct vdpa_map_file *map_file;
 	struct page *page;
 	unsigned long pfn, pinned;
 
 	while ((map = vhost_iotlb_itree_first(iotlb, start, last)) != NULL) {
-		pinned = map->size >> PAGE_SHIFT;
-		for (pfn = map->addr >> PAGE_SHIFT;
-		     pinned > 0; pfn++, pinned--) {
-			page = pfn_to_page(pfn);
-			if (map->perm & VHOST_ACCESS_WO)
-				set_page_dirty_lock(page);
-			unpin_user_page(page);
+		if (!vdpa->use_va) {
+			pinned = map->size >> PAGE_SHIFT;
+			for (pfn = map->addr >> PAGE_SHIFT;
+			     pinned > 0; pfn++, pinned--) {
+				page = pfn_to_page(pfn);
+				if (map->perm & VHOST_ACCESS_WO)
+					set_page_dirty_lock(page);
+				unpin_user_page(page);
+			}
+			atomic64_sub(map->size >> PAGE_SHIFT,
+					&dev->mm->pinned_vm);
+		} else {
+			map_file = (struct vdpa_map_file *)map->opaque;
+			if (map_file->file)
+				fput(map_file->file);
+			kfree(map_file);
 		}
-		atomic64_sub(map->size >> PAGE_SHIFT, &dev->mm->pinned_vm);
 		vhost_iotlb_map_free(iotlb, map);
 	}
 }
@@ -530,21 +540,21 @@ static int perm_to_iommu_flags(u32 perm)
 	return flags | IOMMU_CACHE;
 }
 
-static int vhost_vdpa_map(struct vhost_vdpa *v,
-			  u64 iova, u64 size, u64 pa, u32 perm)
+static int vhost_vdpa_map(struct vhost_vdpa *v, u64 iova,
+			  u64 size, u64 pa, u32 perm, void *opaque)
 {
 	struct vhost_dev *dev = &v->vdev;
 	struct vdpa_device *vdpa = v->vdpa;
 	const struct vdpa_config_ops *ops = vdpa->config;
 	int r = 0;
 
-	r = vhost_iotlb_add_range(dev->iotlb, iova, iova + size - 1,
-				  pa, perm);
+	r = vhost_iotlb_add_range_ctx(dev->iotlb, iova, iova + size - 1,
+				      pa, perm, opaque);
 	if (r)
 		return r;
 
 	if (ops->dma_map) {
-		r = ops->dma_map(vdpa, iova, size, pa, perm, NULL);
+		r = ops->dma_map(vdpa, iova, size, pa, perm, opaque);
 	} else if (ops->set_map) {
 		if (!v->in_batch)
 			r = ops->set_map(vdpa, dev->iotlb);
@@ -552,13 +562,15 @@ static int vhost_vdpa_map(struct vhost_vdpa *v,
 		r = iommu_map(v->domain, iova, pa, size,
 			      perm_to_iommu_flags(perm));
 	}
-
-	if (r)
+	if (r) {
 		vhost_iotlb_del_range(dev->iotlb, iova, iova + size - 1);
-	else
+		return r;
+	}
+
+	if (!vdpa->use_va)
 		atomic64_add(size >> PAGE_SHIFT, &dev->mm->pinned_vm);
 
-	return r;
+	return 0;
 }
 
 static void vhost_vdpa_unmap(struct vhost_vdpa *v, u64 iova, u64 size)
@@ -579,10 +591,60 @@ static void vhost_vdpa_unmap(struct vhost_vdpa *v, u64 iova, u64 size)
 	}
 }
 
+static int vhost_vdpa_va_map(struct vhost_vdpa *v,
+			     u64 iova, u64 size, u64 uaddr, u32 perm)
+{
+	struct vhost_dev *dev = &v->vdev;
+	u64 offset, map_size, map_iova = iova;
+	struct vdpa_map_file *map_file;
+	struct vm_area_struct *vma;
+	int ret;
+
+	mmap_read_lock(dev->mm);
+
+	while (size) {
+		vma = find_vma(dev->mm, uaddr);
+		if (!vma) {
+			ret = -EINVAL;
+			goto err;
+		}
+		map_size = min(size, vma->vm_end - uaddr);
+		offset = (vma->vm_pgoff << PAGE_SHIFT) + uaddr - vma->vm_start;
+		map_file = kzalloc(sizeof(*map_file), GFP_KERNEL);
+		if (!map_file) {
+			ret = -ENOMEM;
+			goto err;
+		}
+		if (vma->vm_file && (vma->vm_flags & VM_SHARED) &&
+			!(vma->vm_flags & (VM_IO | VM_PFNMAP))) {
+			map_file->file = get_file(vma->vm_file);
+			map_file->offset = offset;
+		}
+		ret = vhost_vdpa_map(v, map_iova, map_size, uaddr,
+				     perm, map_file);
+		if (ret) {
+			if (map_file->file)
+				fput(map_file->file);
+			kfree(map_file);
+			goto err;
+		}
+		size -= map_size;
+		uaddr += map_size;
+		map_iova += map_size;
+	}
+	mmap_read_unlock(dev->mm);
+
+	return 0;
+err:
+	vhost_vdpa_unmap(v, iova, map_iova - iova);
+	return ret;
+}
+
 static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
 					   struct vhost_iotlb_msg *msg)
 {
 	struct vhost_dev *dev = &v->vdev;
+	struct vdpa_device *vdpa = v->vdpa;
 	struct vhost_iotlb *iotlb = dev->iotlb;
 	struct page **page_list;
 	unsigned long list_size = PAGE_SIZE / sizeof(struct page *);
@@ -601,6 +663,10 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
 				    msg->iova + msg->size - 1))
 		return -EEXIST;
 
+	if (vdpa->use_va)
+		return vhost_vdpa_va_map(v, msg->iova, msg->size,
+					 msg->uaddr, msg->perm);
+
 	/* Limit the use of memory for bookkeeping */
 	page_list = (struct page **) __get_free_page(GFP_KERNEL);
 	if (!page_list)
@@ -654,7 +720,7 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
 				csize = (last_pfn - map_pfn + 1) << PAGE_SHIFT;
 				ret = vhost_vdpa_map(v, iova, csize,
 						     map_pfn << PAGE_SHIFT,
-						     msg->perm);
+						     msg->perm, NULL);
 				if (ret) {
 					/*
 					 * Unpin the pages that are left unmapped
@@ -683,7 +749,7 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
 
 	/* Pin the rest chunk */
 	ret = vhost_vdpa_map(v, iova, (last_pfn - map_pfn + 1) << PAGE_SHIFT,
-			     map_pfn << PAGE_SHIFT, msg->perm);
+			     map_pfn << PAGE_SHIFT, msg->perm, NULL);
 out:
 	if (ret) {
 		if (nchunks) {
diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index 93dca2c328ae..bfae6d780c38 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -44,6 +44,7 @@ struct vdpa_mgmt_dev;
  * @config: the configuration ops for this device.
  * @index: device index
  * @features_valid: were features initialized? for legacy guests
+ * @use_va: indicate whether virtual address can be used by this device
  * @nvqs: maximum number of supported virtqueues
  * @mdev: management device pointer; caller must setup when registering device as part
  *	  of dev_add() mgmtdev ops callback before invoking _vdpa_register_device().
@@ -54,6 +55,7 @@ struct vdpa_device {
 	const struct vdpa_config_ops *config;
 	unsigned int index;
 	bool features_valid;
+	bool use_va;
 	int nvqs;
 	struct vdpa_mgmt_dev *mdev;
 };
@@ -69,6 +71,16 @@ struct vdpa_iova_range {
 };
 
 /**
+ * Corresponding file area for device memory mapping
+ * @file: vma->vm_file for the mapping
+ * @offset: mapping offset in the vm_file
+ */
+struct vdpa_map_file {
+	struct file *file;
+	u64 offset;
+};
+
+/**
  * vDPA_config_ops - operations for configuring a vDPA device.
  * Note: vDPA device drivers are required to implement all of the
  * operations unless it is mentioned to be optional in the following
@@ -250,14 +262,16 @@ struct vdpa_config_ops {
 
 struct vdpa_device *__vdpa_alloc_device(struct device *parent,
 					const struct vdpa_config_ops *config,
-					int nvqs, size_t size, const char *name);
+					int nvqs, size_t size,
+					const char *name, bool use_va);
 
-#define vdpa_alloc_device(dev_struct, member, parent, config, nvqs, name)   \
+#define vdpa_alloc_device(dev_struct, member, parent, config, \
+			  nvqs, name, use_va) \
 			  container_of(__vdpa_alloc_device( \
 				       parent, config, nvqs, \
 				       sizeof(dev_struct) + \
 				       BUILD_BUG_ON_ZERO(offsetof( \
-				       dev_struct, member)), name), \
+				       dev_struct, member)), name, use_va), \
 				       dev_struct, member)
 
 int vdpa_register_device(struct vdpa_device *vdev);
-- 
2.11.0


^ permalink raw reply	[flat|nested] 92+ messages in thread

* [RFC v4 06/11] vduse: Implement an MMU-based IOMMU driver
  2021-02-23 11:50 [RFC v4 00/11] Introduce VDUSE - vDPA Device in Userspace Xie Yongji
                   ` (4 preceding siblings ...)
  2021-02-23 11:50 ` [RFC v4 05/11] vdpa: Support transferring virtual addressing during DMA mapping Xie Yongji
@ 2021-02-23 11:50 ` Xie Yongji
  2021-03-04  4:20     ` Jason Wang
  2021-02-23 11:50 ` [RFC v4 07/11] vduse: Introduce VDUSE - vDPA Device in Userspace Xie Yongji
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 92+ messages in thread
From: Xie Yongji @ 2021-02-23 11:50 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, bob.liu, hch, rdunlap,
	willy, viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel

This implements a MMU-based IOMMU driver to support mapping
kernel dma buffer into userspace. The basic idea behind it is
treating MMU (VA->PA) as IOMMU (IOVA->PA). The driver will set
up MMU mapping instead of IOMMU mapping for the DMA transfer so
that the userspace process is able to use its virtual address to
access the dma buffer in kernel.

And to avoid security issue, a bounce-buffering mechanism is
introduced to prevent userspace accessing the original buffer
directly.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 drivers/vdpa/vdpa_user/iova_domain.c | 486 +++++++++++++++++++++++++++++++++++
 drivers/vdpa/vdpa_user/iova_domain.h |  61 +++++
 2 files changed, 547 insertions(+)
 create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c
 create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h

diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c
new file mode 100644
index 000000000000..9285d430d486
--- /dev/null
+++ b/drivers/vdpa/vdpa_user/iova_domain.c
@@ -0,0 +1,486 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * MMU-based IOMMU implementation
+ *
+ * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
+ *
+ * Author: Xie Yongji <xieyongji@bytedance.com>
+ *
+ */
+
+#include <linux/slab.h>
+#include <linux/file.h>
+#include <linux/anon_inodes.h>
+#include <linux/highmem.h>
+
+#include "iova_domain.h"
+
+#define IOVA_START_PFN 1
+#define IOVA_ALLOC_ORDER 12
+#define IOVA_ALLOC_SIZE (1 << IOVA_ALLOC_ORDER)
+
+static inline struct page *
+vduse_domain_get_bounce_page(struct vduse_iova_domain *domain, u64 iova)
+{
+	u64 index = iova >> PAGE_SHIFT;
+
+	return domain->bounce_pages[index];
+}
+
+static inline void
+vduse_domain_set_bounce_page(struct vduse_iova_domain *domain,
+				u64 iova, struct page *page)
+{
+	u64 index = iova >> PAGE_SHIFT;
+
+	domain->bounce_pages[index] = page;
+}
+
+static enum dma_data_direction perm_to_dir(int perm)
+{
+	enum dma_data_direction dir;
+
+	switch (perm) {
+	case VHOST_MAP_WO:
+		dir = DMA_FROM_DEVICE;
+		break;
+	case VHOST_MAP_RO:
+		dir = DMA_TO_DEVICE;
+		break;
+	case VHOST_MAP_RW:
+		dir = DMA_BIDIRECTIONAL;
+		break;
+	default:
+		break;
+	}
+
+	return dir;
+}
+
+static int dir_to_perm(enum dma_data_direction dir)
+{
+	int perm = -EFAULT;
+
+	switch (dir) {
+	case DMA_FROM_DEVICE:
+		perm = VHOST_MAP_WO;
+		break;
+	case DMA_TO_DEVICE:
+		perm = VHOST_MAP_RO;
+		break;
+	case DMA_BIDIRECTIONAL:
+		perm = VHOST_MAP_RW;
+		break;
+	default:
+		break;
+	}
+
+	return perm;
+}
+
+static void do_bounce(phys_addr_t orig, void *addr, size_t size,
+			enum dma_data_direction dir)
+{
+	unsigned long pfn = PFN_DOWN(orig);
+
+	if (PageHighMem(pfn_to_page(pfn))) {
+		unsigned int offset = offset_in_page(orig);
+		char *buffer;
+		unsigned int sz = 0;
+		unsigned long flags;
+
+		while (size) {
+			sz = min_t(size_t, PAGE_SIZE - offset, size);
+
+			local_irq_save(flags);
+			buffer = kmap_atomic(pfn_to_page(pfn));
+			if (dir == DMA_TO_DEVICE)
+				memcpy(addr, buffer + offset, sz);
+			else
+				memcpy(buffer + offset, addr, sz);
+			kunmap_atomic(buffer);
+			local_irq_restore(flags);
+
+			size -= sz;
+			pfn++;
+			addr += sz;
+			offset = 0;
+		}
+	} else if (dir == DMA_TO_DEVICE) {
+		memcpy(addr, phys_to_virt(orig), size);
+	} else {
+		memcpy(phys_to_virt(orig), addr, size);
+	}
+}
+
+static struct page *
+vduse_domain_get_mapping_page(struct vduse_iova_domain *domain, u64 iova)
+{
+	u64 start = iova & PAGE_MASK;
+	u64 last = start + PAGE_SIZE - 1;
+	struct vhost_iotlb_map *map;
+	struct page *page = NULL;
+
+	spin_lock(&domain->iotlb_lock);
+	map = vhost_iotlb_itree_first(domain->iotlb, start, last);
+	if (!map)
+		goto out;
+
+	page = pfn_to_page((map->addr + iova - map->start) >> PAGE_SHIFT);
+	get_page(page);
+out:
+	spin_unlock(&domain->iotlb_lock);
+
+	return page;
+}
+
+static struct page *
+vduse_domain_alloc_bounce_page(struct vduse_iova_domain *domain, u64 iova)
+{
+	u64 start = iova & PAGE_MASK;
+	u64 last = start + PAGE_SIZE - 1;
+	struct vhost_iotlb_map *map;
+	struct page *page = NULL, *new_page = alloc_page(GFP_KERNEL);
+
+	if (!new_page)
+		return NULL;
+
+	spin_lock(&domain->iotlb_lock);
+	if (!vhost_iotlb_itree_first(domain->iotlb, start, last)) {
+		__free_page(new_page);
+		goto out;
+	}
+	page = vduse_domain_get_bounce_page(domain, iova);
+	if (page) {
+		get_page(page);
+		__free_page(new_page);
+		goto out;
+	}
+	vduse_domain_set_bounce_page(domain, iova, new_page);
+	get_page(new_page);
+	page = new_page;
+
+	for (map = vhost_iotlb_itree_first(domain->iotlb, start, last); map;
+	     map = vhost_iotlb_itree_next(map, start, last)) {
+		unsigned int src_offset = 0, dst_offset = 0;
+		phys_addr_t src;
+		void *dst;
+		size_t sz;
+
+		if (perm_to_dir(map->perm) == DMA_FROM_DEVICE)
+			continue;
+
+		if (start > map->start)
+			src_offset = start - map->start;
+		else
+			dst_offset = map->start - start;
+
+		src = map->addr + src_offset;
+		dst = page_address(page) + dst_offset;
+		sz = min_t(size_t, map->size - src_offset,
+				PAGE_SIZE - dst_offset);
+		do_bounce(src, dst, sz, DMA_TO_DEVICE);
+	}
+out:
+	spin_unlock(&domain->iotlb_lock);
+
+	return page;
+}
+
+static void
+vduse_domain_free_bounce_pages(struct vduse_iova_domain *domain,
+				u64 iova, size_t size)
+{
+	struct page *page;
+
+	spin_lock(&domain->iotlb_lock);
+	if (WARN_ON(vhost_iotlb_itree_first(domain->iotlb, iova,
+						iova + size - 1)))
+		goto out;
+
+	while (size > 0) {
+		page = vduse_domain_get_bounce_page(domain, iova);
+		if (page) {
+			vduse_domain_set_bounce_page(domain, iova, NULL);
+			__free_page(page);
+		}
+		size -= PAGE_SIZE;
+		iova += PAGE_SIZE;
+	}
+out:
+	spin_unlock(&domain->iotlb_lock);
+}
+
+static void vduse_domain_bounce(struct vduse_iova_domain *domain,
+				dma_addr_t iova, phys_addr_t orig,
+				size_t size, enum dma_data_direction dir)
+{
+	unsigned int offset = offset_in_page(iova);
+
+	while (size) {
+		struct page *p = vduse_domain_get_bounce_page(domain, iova);
+		size_t sz = min_t(size_t, PAGE_SIZE - offset, size);
+
+		WARN_ON(!p && dir == DMA_FROM_DEVICE);
+
+		if (p)
+			do_bounce(orig, page_address(p) + offset, sz, dir);
+
+		size -= sz;
+		orig += sz;
+		iova += sz;
+		offset = 0;
+	}
+}
+
+static dma_addr_t vduse_domain_alloc_iova(struct iova_domain *iovad,
+				unsigned long size, unsigned long limit)
+{
+	unsigned long shift = iova_shift(iovad);
+	unsigned long iova_len = iova_align(iovad, size) >> shift;
+	unsigned long iova_pfn;
+
+	if (iova_len < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1)))
+		iova_len = roundup_pow_of_two(iova_len);
+	iova_pfn = alloc_iova_fast(iovad, iova_len, limit >> shift, true);
+
+	return iova_pfn << shift;
+}
+
+static void vduse_domain_free_iova(struct iova_domain *iovad,
+				dma_addr_t iova, size_t size)
+{
+	unsigned long shift = iova_shift(iovad);
+	unsigned long iova_len = iova_align(iovad, size) >> shift;
+
+	free_iova_fast(iovad, iova >> shift, iova_len);
+}
+
+dma_addr_t vduse_domain_map_page(struct vduse_iova_domain *domain,
+				struct page *page, unsigned long offset,
+				size_t size, enum dma_data_direction dir,
+				unsigned long attrs)
+{
+	struct iova_domain *iovad = &domain->stream_iovad;
+	unsigned long limit = domain->bounce_size - 1;
+	phys_addr_t pa = page_to_phys(page) + offset;
+	dma_addr_t iova = vduse_domain_alloc_iova(iovad, size, limit);
+	int ret;
+
+	if (!iova)
+		return DMA_MAPPING_ERROR;
+
+	spin_lock(&domain->iotlb_lock);
+	ret = vhost_iotlb_add_range(domain->iotlb, (u64)iova,
+				    (u64)iova + size - 1,
+				    pa, dir_to_perm(dir));
+	spin_unlock(&domain->iotlb_lock);
+	if (ret) {
+		vduse_domain_free_iova(iovad, iova, size);
+		return DMA_MAPPING_ERROR;
+	}
+	if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL)
+		vduse_domain_bounce(domain, iova, pa, size, DMA_TO_DEVICE);
+
+	return iova;
+}
+
+void vduse_domain_unmap_page(struct vduse_iova_domain *domain,
+			dma_addr_t dma_addr, size_t size,
+			enum dma_data_direction dir, unsigned long attrs)
+{
+	struct iova_domain *iovad = &domain->stream_iovad;
+	struct vhost_iotlb_map *map;
+	phys_addr_t pa;
+
+	spin_lock(&domain->iotlb_lock);
+	map = vhost_iotlb_itree_first(domain->iotlb, (u64)dma_addr,
+				      (u64)dma_addr + size - 1);
+	if (WARN_ON(!map)) {
+		spin_unlock(&domain->iotlb_lock);
+		return;
+	}
+	pa = map->addr;
+	vhost_iotlb_map_free(domain->iotlb, map);
+	spin_unlock(&domain->iotlb_lock);
+
+	if (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL)
+		vduse_domain_bounce(domain, dma_addr, pa,
+					size, DMA_FROM_DEVICE);
+
+	vduse_domain_free_iova(iovad, dma_addr, size);
+}
+
+void *vduse_domain_alloc_coherent(struct vduse_iova_domain *domain,
+				size_t size, dma_addr_t *dma_addr,
+				gfp_t flag, unsigned long attrs)
+{
+	struct iova_domain *iovad = &domain->consistent_iovad;
+	unsigned long limit = domain->iova_limit;
+	dma_addr_t iova = vduse_domain_alloc_iova(iovad, size, limit);
+	void *orig = alloc_pages_exact(size, flag);
+	int ret;
+
+	if (!iova || !orig)
+		goto err;
+
+	spin_lock(&domain->iotlb_lock);
+	ret = vhost_iotlb_add_range(domain->iotlb, (u64)iova,
+				    (u64)iova + size - 1,
+				    virt_to_phys(orig), VHOST_MAP_RW);
+	spin_unlock(&domain->iotlb_lock);
+	if (ret)
+		goto err;
+
+	*dma_addr = iova;
+
+	return orig;
+err:
+	*dma_addr = DMA_MAPPING_ERROR;
+	if (orig)
+		free_pages_exact(orig, size);
+	if (iova)
+		vduse_domain_free_iova(iovad, iova, size);
+
+	return NULL;
+}
+
+void vduse_domain_free_coherent(struct vduse_iova_domain *domain, size_t size,
+				void *vaddr, dma_addr_t dma_addr,
+				unsigned long attrs)
+{
+	struct iova_domain *iovad = &domain->consistent_iovad;
+	struct vhost_iotlb_map *map;
+	phys_addr_t pa;
+
+	spin_lock(&domain->iotlb_lock);
+	map = vhost_iotlb_itree_first(domain->iotlb, (u64)dma_addr,
+				      (u64)dma_addr + size - 1);
+	if (WARN_ON(!map)) {
+		spin_unlock(&domain->iotlb_lock);
+		return;
+	}
+	pa = map->addr;
+	vhost_iotlb_map_free(domain->iotlb, map);
+	spin_unlock(&domain->iotlb_lock);
+
+	vduse_domain_free_iova(iovad, dma_addr, size);
+	free_pages_exact(phys_to_virt(pa), size);
+}
+
+static vm_fault_t vduse_domain_mmap_fault(struct vm_fault *vmf)
+{
+	struct vduse_iova_domain *domain = vmf->vma->vm_private_data;
+	unsigned long iova = vmf->pgoff << PAGE_SHIFT;
+	struct page *page;
+
+	if (!domain)
+		return VM_FAULT_SIGBUS;
+
+	if (iova < domain->bounce_size)
+		page = vduse_domain_alloc_bounce_page(domain, iova);
+	else
+		page = vduse_domain_get_mapping_page(domain, iova);
+
+	if (!page)
+		return VM_FAULT_SIGBUS;
+
+	vmf->page = page;
+
+	return 0;
+}
+
+static const struct vm_operations_struct vduse_domain_mmap_ops = {
+	.fault = vduse_domain_mmap_fault,
+};
+
+static int vduse_domain_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct vduse_iova_domain *domain = file->private_data;
+
+	vma->vm_flags |= VM_DONTDUMP | VM_DONTEXPAND;
+	vma->vm_private_data = domain;
+	vma->vm_ops = &vduse_domain_mmap_ops;
+
+	return 0;
+}
+
+static int vduse_domain_release(struct inode *inode, struct file *file)
+{
+	struct vduse_iova_domain *domain = file->private_data;
+
+	vduse_domain_free_bounce_pages(domain, 0, domain->bounce_size);
+	put_iova_domain(&domain->stream_iovad);
+	put_iova_domain(&domain->consistent_iovad);
+	vhost_iotlb_free(domain->iotlb);
+	vfree(domain->bounce_pages);
+	kfree(domain);
+
+	return 0;
+}
+
+static const struct file_operations vduse_domain_fops = {
+	.mmap = vduse_domain_mmap,
+	.release = vduse_domain_release,
+};
+
+void vduse_domain_destroy(struct vduse_iova_domain *domain)
+{
+	fput(domain->file);
+}
+
+struct vduse_iova_domain *
+vduse_domain_create(unsigned long iova_limit, size_t bounce_size)
+{
+	struct vduse_iova_domain *domain;
+	struct file *file;
+	unsigned long bounce_pfns = PAGE_ALIGN(bounce_size) >> PAGE_SHIFT;
+
+	if (iova_limit <= bounce_size)
+		return NULL;
+
+	domain = kzalloc(sizeof(*domain), GFP_KERNEL);
+	if (!domain)
+		return NULL;
+
+	domain->iotlb = vhost_iotlb_alloc(0, 0);
+	if (!domain->iotlb)
+		goto err_iotlb;
+
+	domain->iova_limit = iova_limit;
+	domain->bounce_size = PAGE_ALIGN(bounce_size);
+	domain->bounce_pages = vzalloc(bounce_pfns * sizeof(struct page *));
+	if (!domain->bounce_pages)
+		goto err_page;
+
+	file = anon_inode_getfile("[vduse-domain]", &vduse_domain_fops,
+				domain, O_RDWR);
+	if (IS_ERR(file))
+		goto err_file;
+
+	domain->file = file;
+	spin_lock_init(&domain->iotlb_lock);
+	init_iova_domain(&domain->stream_iovad,
+			IOVA_ALLOC_SIZE, IOVA_START_PFN);
+	init_iova_domain(&domain->consistent_iovad,
+			PAGE_SIZE, bounce_pfns);
+
+	return domain;
+err_file:
+	vfree(domain->bounce_pages);
+err_page:
+	vhost_iotlb_free(domain->iotlb);
+err_iotlb:
+	kfree(domain);
+	return NULL;
+}
+
+int vduse_domain_init(void)
+{
+	return iova_cache_get();
+}
+
+void vduse_domain_exit(void)
+{
+	iova_cache_put();
+}
diff --git a/drivers/vdpa/vdpa_user/iova_domain.h b/drivers/vdpa/vdpa_user/iova_domain.h
new file mode 100644
index 000000000000..9c85d8346626
--- /dev/null
+++ b/drivers/vdpa/vdpa_user/iova_domain.h
@@ -0,0 +1,61 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * MMU-based IOMMU implementation
+ *
+ * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
+ *
+ * Author: Xie Yongji <xieyongji@bytedance.com>
+ *
+ */
+
+#ifndef _VDUSE_IOVA_DOMAIN_H
+#define _VDUSE_IOVA_DOMAIN_H
+
+#include <linux/iova.h>
+#include <linux/dma-mapping.h>
+#include <linux/vhost_iotlb.h>
+
+struct vduse_iova_domain {
+	struct iova_domain stream_iovad;
+	struct iova_domain consistent_iovad;
+	struct page **bounce_pages;
+	size_t bounce_size;
+	unsigned long iova_limit;
+	struct vhost_iotlb *iotlb;
+	spinlock_t iotlb_lock;
+	struct file *file;
+};
+
+static inline struct file *
+vduse_domain_file(struct vduse_iova_domain *domain)
+{
+	return domain->file;
+}
+
+dma_addr_t vduse_domain_map_page(struct vduse_iova_domain *domain,
+				struct page *page, unsigned long offset,
+				size_t size, enum dma_data_direction dir,
+				unsigned long attrs);
+
+void vduse_domain_unmap_page(struct vduse_iova_domain *domain,
+			dma_addr_t dma_addr, size_t size,
+			enum dma_data_direction dir, unsigned long attrs);
+
+void *vduse_domain_alloc_coherent(struct vduse_iova_domain *domain,
+				size_t size, dma_addr_t *dma_addr,
+				gfp_t flag, unsigned long attrs);
+
+void vduse_domain_free_coherent(struct vduse_iova_domain *domain, size_t size,
+				void *vaddr, dma_addr_t dma_addr,
+				unsigned long attrs);
+
+void vduse_domain_destroy(struct vduse_iova_domain *domain);
+
+struct vduse_iova_domain *vduse_domain_create(unsigned long iova_limit,
+						size_t bounce_size);
+
+int vduse_domain_init(void);
+
+void vduse_domain_exit(void);
+
+#endif /* _VDUSE_IOVA_DOMAIN_H */
-- 
2.11.0


^ permalink raw reply	[flat|nested] 92+ messages in thread

* [RFC v4 07/11] vduse: Introduce VDUSE - vDPA Device in Userspace
  2021-02-23 11:50 [RFC v4 00/11] Introduce VDUSE - vDPA Device in Userspace Xie Yongji
                   ` (5 preceding siblings ...)
  2021-02-23 11:50 ` [RFC v4 06/11] vduse: Implement an MMU-based IOMMU driver Xie Yongji
@ 2021-02-23 11:50 ` Xie Yongji
  2021-03-04  6:27     ` Jason Wang
  2021-03-10 12:58     ` Jason Wang
  2021-02-23 11:50 ` [RFC v4 08/11] vduse: Add config interrupt support Xie Yongji
                   ` (3 subsequent siblings)
  10 siblings, 2 replies; 92+ messages in thread
From: Xie Yongji @ 2021-02-23 11:50 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, bob.liu, hch, rdunlap,
	willy, viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel

This VDUSE driver enables implementing vDPA devices in userspace.
Both control path and data path of vDPA devices will be able to
be handled in userspace.

In the control path, the VDUSE driver will make use of message
mechnism to forward the config operation from vdpa bus driver
to userspace. Userspace can use read()/write() to receive/reply
those control messages.

In the data path, VDUSE_IOTLB_GET_FD ioctl will be used to get
the file descriptors referring to vDPA device's iova regions. Then
userspace can use mmap() to access those iova regions. Besides,
userspace can use ioctl() to inject interrupt and use the eventfd
mechanism to receive virtqueue kicks.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 Documentation/userspace-api/ioctl/ioctl-number.rst |    1 +
 drivers/vdpa/Kconfig                               |   10 +
 drivers/vdpa/Makefile                              |    1 +
 drivers/vdpa/vdpa_user/Makefile                    |    5 +
 drivers/vdpa/vdpa_user/vduse_dev.c                 | 1348 ++++++++++++++++++++
 include/uapi/linux/vduse.h                         |  136 ++
 6 files changed, 1501 insertions(+)
 create mode 100644 drivers/vdpa/vdpa_user/Makefile
 create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
 create mode 100644 include/uapi/linux/vduse.h

diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
index a4c75a28c839..71722e6f8f23 100644
--- a/Documentation/userspace-api/ioctl/ioctl-number.rst
+++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
@@ -300,6 +300,7 @@ Code  Seq#    Include File                                           Comments
 'z'   10-4F  drivers/s390/crypto/zcrypt_api.h                        conflict!
 '|'   00-7F  linux/media.h
 0x80  00-1F  linux/fb.h
+0x81  00-1F  linux/vduse.h
 0x89  00-06  arch/x86/include/asm/sockios.h
 0x89  0B-DF  linux/sockios.h
 0x89  E0-EF  linux/sockios.h                                         SIOCPROTOPRIVATE range
diff --git a/drivers/vdpa/Kconfig b/drivers/vdpa/Kconfig
index ffd1e098bfd2..92f07715e3b6 100644
--- a/drivers/vdpa/Kconfig
+++ b/drivers/vdpa/Kconfig
@@ -25,6 +25,16 @@ config VDPA_SIM_NET
 	help
 	  vDPA networking device simulator which loops TX traffic back to RX.
 
+config VDPA_USER
+	tristate "VDUSE (vDPA Device in Userspace) support"
+	depends on EVENTFD && MMU && HAS_DMA
+	select DMA_OPS
+	select VHOST_IOTLB
+	select IOMMU_IOVA
+	help
+	  With VDUSE it is possible to emulate a vDPA Device
+	  in a userspace program.
+
 config IFCVF
 	tristate "Intel IFC VF vDPA driver"
 	depends on PCI_MSI
diff --git a/drivers/vdpa/Makefile b/drivers/vdpa/Makefile
index d160e9b63a66..66e97778ad03 100644
--- a/drivers/vdpa/Makefile
+++ b/drivers/vdpa/Makefile
@@ -1,5 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0
 obj-$(CONFIG_VDPA) += vdpa.o
 obj-$(CONFIG_VDPA_SIM) += vdpa_sim/
+obj-$(CONFIG_VDPA_USER) += vdpa_user/
 obj-$(CONFIG_IFCVF)    += ifcvf/
 obj-$(CONFIG_MLX5_VDPA) += mlx5/
diff --git a/drivers/vdpa/vdpa_user/Makefile b/drivers/vdpa/vdpa_user/Makefile
new file mode 100644
index 000000000000..260e0b26af99
--- /dev/null
+++ b/drivers/vdpa/vdpa_user/Makefile
@@ -0,0 +1,5 @@
+# SPDX-License-Identifier: GPL-2.0
+
+vduse-y := vduse_dev.o iova_domain.o
+
+obj-$(CONFIG_VDPA_USER) += vduse.o
diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
new file mode 100644
index 000000000000..393bf99c48be
--- /dev/null
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -0,0 +1,1348 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * VDUSE: vDPA Device in Userspace
+ *
+ * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
+ *
+ * Author: Xie Yongji <xieyongji@bytedance.com>
+ *
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/miscdevice.h>
+#include <linux/cdev.h>
+#include <linux/device.h>
+#include <linux/eventfd.h>
+#include <linux/slab.h>
+#include <linux/wait.h>
+#include <linux/dma-map-ops.h>
+#include <linux/poll.h>
+#include <linux/file.h>
+#include <linux/uio.h>
+#include <linux/vdpa.h>
+#include <uapi/linux/vduse.h>
+#include <uapi/linux/vdpa.h>
+#include <uapi/linux/virtio_config.h>
+#include <linux/mod_devicetable.h>
+
+#include "iova_domain.h"
+
+#define DRV_VERSION  "1.0"
+#define DRV_AUTHOR   "Yongji Xie <xieyongji@bytedance.com>"
+#define DRV_DESC     "vDPA Device in Userspace"
+#define DRV_LICENSE  "GPL v2"
+
+#define VDUSE_DEV_MAX (1U << MINORBITS)
+
+struct vduse_virtqueue {
+	u16 index;
+	bool ready;
+	spinlock_t kick_lock;
+	spinlock_t irq_lock;
+	struct eventfd_ctx *kickfd;
+	struct vdpa_callback cb;
+};
+
+struct vduse_dev;
+
+struct vduse_vdpa {
+	struct vdpa_device vdpa;
+	struct vduse_dev *dev;
+};
+
+struct vduse_dev {
+	struct vduse_vdpa *vdev;
+	struct device dev;
+	struct cdev cdev;
+	struct vduse_virtqueue *vqs;
+	struct vduse_iova_domain *domain;
+	struct vhost_iotlb *iommu;
+	spinlock_t iommu_lock;
+	atomic_t bounce_map;
+	struct mutex msg_lock;
+	atomic64_t msg_unique;
+	wait_queue_head_t waitq;
+	struct list_head send_list;
+	struct list_head recv_list;
+	struct list_head list;
+	bool connected;
+	int minor;
+	u16 vq_size_max;
+	u16 vq_num;
+	u32 vq_align;
+	u32 device_id;
+	u32 vendor_id;
+};
+
+struct vduse_dev_msg {
+	struct vduse_dev_request req;
+	struct vduse_dev_response resp;
+	struct list_head list;
+	wait_queue_head_t waitq;
+	bool completed;
+};
+
+static unsigned long max_bounce_size = (64 * 1024 * 1024);
+module_param(max_bounce_size, ulong, 0444);
+MODULE_PARM_DESC(max_bounce_size, "Maximum bounce buffer size. (default: 64M)");
+
+static unsigned long max_iova_size = (128 * 1024 * 1024);
+module_param(max_iova_size, ulong, 0444);
+MODULE_PARM_DESC(max_iova_size, "Maximum iova space size (default: 128M)");
+
+static DEFINE_MUTEX(vduse_lock);
+static LIST_HEAD(vduse_devs);
+static DEFINE_IDA(vduse_ida);
+
+static dev_t vduse_major;
+static struct class *vduse_class;
+
+static inline struct vduse_dev *vdpa_to_vduse(struct vdpa_device *vdpa)
+{
+	struct vduse_vdpa *vdev = container_of(vdpa, struct vduse_vdpa, vdpa);
+
+	return vdev->dev;
+}
+
+static inline struct vduse_dev *dev_to_vduse(struct device *dev)
+{
+	struct vdpa_device *vdpa = dev_to_vdpa(dev);
+
+	return vdpa_to_vduse(vdpa);
+}
+
+static struct vduse_dev_msg *vduse_find_msg(struct list_head *head,
+					    uint32_t unique)
+{
+	struct vduse_dev_msg *tmp, *msg = NULL;
+
+	list_for_each_entry(tmp, head, list) {
+		if (tmp->req.unique == unique) {
+			msg = tmp;
+			list_del(&tmp->list);
+			break;
+		}
+	}
+
+	return msg;
+}
+
+static struct vduse_dev_msg *vduse_dequeue_msg(struct list_head *head)
+{
+	struct vduse_dev_msg *msg = NULL;
+
+	if (!list_empty(head)) {
+		msg = list_first_entry(head, struct vduse_dev_msg, list);
+		list_del(&msg->list);
+	}
+
+	return msg;
+}
+
+static void vduse_enqueue_msg(struct list_head *head,
+			      struct vduse_dev_msg *msg)
+{
+	list_add_tail(&msg->list, head);
+}
+
+static int vduse_dev_msg_sync(struct vduse_dev *dev, struct vduse_dev_msg *msg)
+{
+	int ret;
+
+	init_waitqueue_head(&msg->waitq);
+	mutex_lock(&dev->msg_lock);
+	vduse_enqueue_msg(&dev->send_list, msg);
+	wake_up(&dev->waitq);
+	mutex_unlock(&dev->msg_lock);
+	ret = wait_event_interruptible(msg->waitq, msg->completed);
+	mutex_lock(&dev->msg_lock);
+	if (!msg->completed)
+		list_del(&msg->list);
+	else
+		ret = msg->resp.result;
+	mutex_unlock(&dev->msg_lock);
+
+	return ret;
+}
+
+static u64 vduse_dev_get_features(struct vduse_dev *dev)
+{
+	struct vduse_dev_msg msg = { 0 };
+
+	msg.req.type = VDUSE_GET_FEATURES;
+	msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
+
+	return vduse_dev_msg_sync(dev, &msg) ? 0 : msg.resp.features;
+}
+
+static int vduse_dev_set_features(struct vduse_dev *dev, u64 features)
+{
+	struct vduse_dev_msg msg = { 0 };
+
+	msg.req.type = VDUSE_SET_FEATURES;
+	msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
+	msg.req.features = features;
+
+	return vduse_dev_msg_sync(dev, &msg);
+}
+
+static u8 vduse_dev_get_status(struct vduse_dev *dev)
+{
+	struct vduse_dev_msg msg = { 0 };
+
+	msg.req.type = VDUSE_GET_STATUS;
+	msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
+
+	return vduse_dev_msg_sync(dev, &msg) ? 0 : msg.resp.status;
+}
+
+static void vduse_dev_set_status(struct vduse_dev *dev, u8 status)
+{
+	struct vduse_dev_msg msg = { 0 };
+
+	msg.req.type = VDUSE_SET_STATUS;
+	msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
+	msg.req.status = status;
+
+	vduse_dev_msg_sync(dev, &msg);
+}
+
+static void vduse_dev_get_config(struct vduse_dev *dev, unsigned int offset,
+					void *buf, unsigned int len)
+{
+	struct vduse_dev_msg msg = { 0 };
+	unsigned int sz;
+
+	while (len) {
+		sz = min_t(unsigned int, len, sizeof(msg.req.config.data));
+		msg.req.type = VDUSE_GET_CONFIG;
+		msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
+		msg.req.config.offset = offset;
+		msg.req.config.len = sz;
+		vduse_dev_msg_sync(dev, &msg);
+		memcpy(buf, msg.resp.config.data, sz);
+		buf += sz;
+		offset += sz;
+		len -= sz;
+	}
+}
+
+static void vduse_dev_set_config(struct vduse_dev *dev, unsigned int offset,
+					const void *buf, unsigned int len)
+{
+	struct vduse_dev_msg msg = { 0 };
+	unsigned int sz;
+
+	while (len) {
+		sz = min_t(unsigned int, len, sizeof(msg.req.config.data));
+		msg.req.type = VDUSE_SET_CONFIG;
+		msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
+		msg.req.config.offset = offset;
+		msg.req.config.len = sz;
+		memcpy(msg.req.config.data, buf, sz);
+		vduse_dev_msg_sync(dev, &msg);
+		buf += sz;
+		offset += sz;
+		len -= sz;
+	}
+}
+
+static void vduse_dev_set_vq_num(struct vduse_dev *dev,
+				struct vduse_virtqueue *vq, u32 num)
+{
+	struct vduse_dev_msg msg = { 0 };
+
+	msg.req.type = VDUSE_SET_VQ_NUM;
+	msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
+	msg.req.vq_num.index = vq->index;
+	msg.req.vq_num.num = num;
+
+	vduse_dev_msg_sync(dev, &msg);
+}
+
+static int vduse_dev_set_vq_addr(struct vduse_dev *dev,
+				struct vduse_virtqueue *vq, u64 desc_addr,
+				u64 driver_addr, u64 device_addr)
+{
+	struct vduse_dev_msg msg = { 0 };
+
+	msg.req.type = VDUSE_SET_VQ_ADDR;
+	msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
+	msg.req.vq_addr.index = vq->index;
+	msg.req.vq_addr.desc_addr = desc_addr;
+	msg.req.vq_addr.driver_addr = driver_addr;
+	msg.req.vq_addr.device_addr = device_addr;
+
+	return vduse_dev_msg_sync(dev, &msg);
+}
+
+static void vduse_dev_set_vq_ready(struct vduse_dev *dev,
+				struct vduse_virtqueue *vq, bool ready)
+{
+	struct vduse_dev_msg msg = { 0 };
+
+	msg.req.type = VDUSE_SET_VQ_READY;
+	msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
+	msg.req.vq_ready.index = vq->index;
+	msg.req.vq_ready.ready = ready;
+
+	vduse_dev_msg_sync(dev, &msg);
+}
+
+static bool vduse_dev_get_vq_ready(struct vduse_dev *dev,
+				   struct vduse_virtqueue *vq)
+{
+	struct vduse_dev_msg msg = { 0 };
+
+	msg.req.type = VDUSE_GET_VQ_READY;
+	msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
+	msg.req.vq_ready.index = vq->index;
+
+	return vduse_dev_msg_sync(dev, &msg) ? false : msg.resp.vq_ready.ready;
+}
+
+static int vduse_dev_get_vq_state(struct vduse_dev *dev,
+				struct vduse_virtqueue *vq,
+				struct vdpa_vq_state *state)
+{
+	struct vduse_dev_msg msg = { 0 };
+	int ret;
+
+	msg.req.type = VDUSE_GET_VQ_STATE;
+	msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
+	msg.req.vq_state.index = vq->index;
+
+	ret = vduse_dev_msg_sync(dev, &msg);
+	if (!ret)
+		state->avail_index = msg.resp.vq_state.avail_idx;
+
+	return ret;
+}
+
+static int vduse_dev_set_vq_state(struct vduse_dev *dev,
+				struct vduse_virtqueue *vq,
+				const struct vdpa_vq_state *state)
+{
+	struct vduse_dev_msg msg = { 0 };
+
+	msg.req.type = VDUSE_SET_VQ_STATE;
+	msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
+	msg.req.vq_state.index = vq->index;
+	msg.req.vq_state.avail_idx = state->avail_index;
+
+	return vduse_dev_msg_sync(dev, &msg);
+}
+
+static int vduse_dev_update_iotlb(struct vduse_dev *dev,
+				u64 start, u64 last)
+{
+	struct vduse_dev_msg msg = { 0 };
+
+	if (last < start)
+		return -EINVAL;
+
+	msg.req.type = VDUSE_UPDATE_IOTLB;
+	msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
+	msg.req.iova.start = start;
+	msg.req.iova.last = last;
+
+	return vduse_dev_msg_sync(dev, &msg);
+}
+
+static ssize_t vduse_dev_read_iter(struct kiocb *iocb, struct iov_iter *to)
+{
+	struct file *file = iocb->ki_filp;
+	struct vduse_dev *dev = file->private_data;
+	struct vduse_dev_msg *msg;
+	int size = sizeof(struct vduse_dev_request);
+	ssize_t ret = 0;
+
+	if (iov_iter_count(to) < size)
+		return 0;
+
+	mutex_lock(&dev->msg_lock);
+	while (1) {
+		msg = vduse_dequeue_msg(&dev->send_list);
+		if (msg)
+			break;
+
+		ret = -EAGAIN;
+		if (file->f_flags & O_NONBLOCK)
+			goto unlock;
+
+		mutex_unlock(&dev->msg_lock);
+		ret = wait_event_interruptible_exclusive(dev->waitq,
+					!list_empty(&dev->send_list));
+		if (ret)
+			return ret;
+
+		mutex_lock(&dev->msg_lock);
+	}
+	ret = copy_to_iter(&msg->req, size, to);
+	if (ret != size) {
+		ret = -EFAULT;
+		vduse_enqueue_msg(&dev->send_list, msg);
+		goto unlock;
+	}
+	vduse_enqueue_msg(&dev->recv_list, msg);
+unlock:
+	mutex_unlock(&dev->msg_lock);
+
+	return ret;
+}
+
+static ssize_t vduse_dev_write_iter(struct kiocb *iocb, struct iov_iter *from)
+{
+	struct file *file = iocb->ki_filp;
+	struct vduse_dev *dev = file->private_data;
+	struct vduse_dev_response resp;
+	struct vduse_dev_msg *msg;
+	size_t ret;
+
+	ret = copy_from_iter(&resp, sizeof(resp), from);
+	if (ret != sizeof(resp))
+		return -EINVAL;
+
+	mutex_lock(&dev->msg_lock);
+	msg = vduse_find_msg(&dev->recv_list, resp.request_id);
+	if (!msg) {
+		ret = -EINVAL;
+		goto unlock;
+	}
+
+	memcpy(&msg->resp, &resp, sizeof(resp));
+	msg->completed = 1;
+	wake_up(&msg->waitq);
+unlock:
+	mutex_unlock(&dev->msg_lock);
+
+	return ret;
+}
+
+static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
+{
+	struct vduse_dev *dev = file->private_data;
+	__poll_t mask = 0;
+
+	poll_wait(file, &dev->waitq, wait);
+
+	if (!list_empty(&dev->send_list))
+		mask |= EPOLLIN | EPOLLRDNORM;
+
+	return mask;
+}
+
+static int vduse_iotlb_add_range(struct vduse_dev *dev,
+				 u64 start, u64 last,
+				 u64 addr, unsigned int perm,
+				 struct file *file, u64 offset)
+{
+	struct vdpa_map_file *map_file;
+	int ret;
+
+	map_file = kmalloc(sizeof(*map_file), GFP_ATOMIC);
+	if (!map_file)
+		return -ENOMEM;
+
+	map_file->file = get_file(file);
+	map_file->offset = offset;
+
+	spin_lock(&dev->iommu_lock);
+	ret = vhost_iotlb_add_range_ctx(dev->iommu, start, last,
+					addr, perm, map_file);
+	spin_unlock(&dev->iommu_lock);
+	if (ret) {
+		fput(map_file->file);
+		kfree(map_file);
+		return ret;
+	}
+	return 0;
+}
+
+static void vduse_iotlb_del_range(struct vduse_dev *dev, u64 start, u64 last)
+{
+	struct vdpa_map_file *map_file;
+	struct vhost_iotlb_map *map;
+
+	spin_lock(&dev->iommu_lock);
+	while ((map = vhost_iotlb_itree_first(dev->iommu, start, last))) {
+		map_file = (struct vdpa_map_file *)map->opaque;
+		fput(map_file->file);
+		kfree(map_file);
+		vhost_iotlb_map_free(dev->iommu, map);
+	}
+	spin_unlock(&dev->iommu_lock);
+}
+
+static void vduse_dev_reset(struct vduse_dev *dev)
+{
+	int i;
+
+	atomic_set(&dev->bounce_map, 0);
+	vduse_iotlb_del_range(dev, 0ULL, ULLONG_MAX);
+	vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
+
+	for (i = 0; i < dev->vq_num; i++) {
+		struct vduse_virtqueue *vq = &dev->vqs[i];
+
+		spin_lock(&vq->irq_lock);
+		vq->ready = false;
+		vq->cb.callback = NULL;
+		vq->cb.private = NULL;
+		spin_unlock(&vq->irq_lock);
+	}
+}
+
+static int vduse_vdpa_set_vq_address(struct vdpa_device *vdpa, u16 idx,
+				u64 desc_area, u64 driver_area,
+				u64 device_area)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vduse_virtqueue *vq = &dev->vqs[idx];
+
+	return vduse_dev_set_vq_addr(dev, vq, desc_area,
+					driver_area, device_area);
+}
+
+static void vduse_vdpa_kick_vq(struct vdpa_device *vdpa, u16 idx)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vduse_virtqueue *vq = &dev->vqs[idx];
+
+	spin_lock(&vq->kick_lock);
+	if (vq->ready && vq->kickfd)
+		eventfd_signal(vq->kickfd, 1);
+	spin_unlock(&vq->kick_lock);
+}
+
+static void vduse_vdpa_set_vq_cb(struct vdpa_device *vdpa, u16 idx,
+			      struct vdpa_callback *cb)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vduse_virtqueue *vq = &dev->vqs[idx];
+
+	spin_lock(&vq->irq_lock);
+	vq->cb.callback = cb->callback;
+	vq->cb.private = cb->private;
+	spin_unlock(&vq->irq_lock);
+}
+
+static void vduse_vdpa_set_vq_num(struct vdpa_device *vdpa, u16 idx, u32 num)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vduse_virtqueue *vq = &dev->vqs[idx];
+
+	vduse_dev_set_vq_num(dev, vq, num);
+}
+
+static void vduse_vdpa_set_vq_ready(struct vdpa_device *vdpa,
+					u16 idx, bool ready)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vduse_virtqueue *vq = &dev->vqs[idx];
+
+	vduse_dev_set_vq_ready(dev, vq, ready);
+	vq->ready = ready;
+}
+
+static bool vduse_vdpa_get_vq_ready(struct vdpa_device *vdpa, u16 idx)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vduse_virtqueue *vq = &dev->vqs[idx];
+
+	vq->ready = vduse_dev_get_vq_ready(dev, vq);
+
+	return vq->ready;
+}
+
+static int vduse_vdpa_set_vq_state(struct vdpa_device *vdpa, u16 idx,
+				const struct vdpa_vq_state *state)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vduse_virtqueue *vq = &dev->vqs[idx];
+
+	return vduse_dev_set_vq_state(dev, vq, state);
+}
+
+static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
+				struct vdpa_vq_state *state)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vduse_virtqueue *vq = &dev->vqs[idx];
+
+	return vduse_dev_get_vq_state(dev, vq, state);
+}
+
+static u32 vduse_vdpa_get_vq_align(struct vdpa_device *vdpa)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	return dev->vq_align;
+}
+
+static u64 vduse_vdpa_get_features(struct vdpa_device *vdpa)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	u64 fixed = (1ULL << VIRTIO_F_ACCESS_PLATFORM);
+
+	return (vduse_dev_get_features(dev) | fixed);
+}
+
+static int vduse_vdpa_set_features(struct vdpa_device *vdpa, u64 features)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	return vduse_dev_set_features(dev, features);
+}
+
+static void vduse_vdpa_set_config_cb(struct vdpa_device *vdpa,
+				  struct vdpa_callback *cb)
+{
+	/* We don't support config interrupt */
+}
+
+static u16 vduse_vdpa_get_vq_num_max(struct vdpa_device *vdpa)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	return dev->vq_size_max;
+}
+
+static u32 vduse_vdpa_get_device_id(struct vdpa_device *vdpa)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	return dev->device_id;
+}
+
+static u32 vduse_vdpa_get_vendor_id(struct vdpa_device *vdpa)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	return dev->vendor_id;
+}
+
+static u8 vduse_vdpa_get_status(struct vdpa_device *vdpa)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	return vduse_dev_get_status(dev);
+}
+
+static void vduse_vdpa_set_status(struct vdpa_device *vdpa, u8 status)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	if (status == 0)
+		vduse_dev_reset(dev);
+
+	vduse_dev_set_status(dev, status);
+}
+
+static void vduse_vdpa_get_config(struct vdpa_device *vdpa, unsigned int offset,
+			     void *buf, unsigned int len)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	vduse_dev_get_config(dev, offset, buf, len);
+}
+
+static void vduse_vdpa_set_config(struct vdpa_device *vdpa, unsigned int offset,
+			const void *buf, unsigned int len)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	vduse_dev_set_config(dev, offset, buf, len);
+}
+
+static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
+				struct vhost_iotlb *iotlb)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vhost_iotlb_map *map;
+	struct vdpa_map_file *map_file;
+	u64 start = 0ULL, last = ULLONG_MAX;
+	int ret = 0;
+
+	vduse_iotlb_del_range(dev, start, last);
+
+	for (map = vhost_iotlb_itree_first(iotlb, start, last); map;
+		map = vhost_iotlb_itree_next(map, start, last)) {
+		map_file = (struct vdpa_map_file *)map->opaque;
+		if (!map_file->file)
+			continue;
+
+		ret = vduse_iotlb_add_range(dev, map->start, map->last,
+					    map->addr, map->perm,
+					    map_file->file,
+					    map_file->offset);
+		if (ret)
+			break;
+	}
+	vduse_dev_update_iotlb(dev, start, last);
+
+	return ret;
+}
+
+static void vduse_vdpa_free(struct vdpa_device *vdpa)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	WARN_ON(!list_empty(&dev->send_list));
+	WARN_ON(!list_empty(&dev->recv_list));
+	dev->vdev = NULL;
+}
+
+static const struct vdpa_config_ops vduse_vdpa_config_ops = {
+	.set_vq_address		= vduse_vdpa_set_vq_address,
+	.kick_vq		= vduse_vdpa_kick_vq,
+	.set_vq_cb		= vduse_vdpa_set_vq_cb,
+	.set_vq_num             = vduse_vdpa_set_vq_num,
+	.set_vq_ready		= vduse_vdpa_set_vq_ready,
+	.get_vq_ready		= vduse_vdpa_get_vq_ready,
+	.set_vq_state		= vduse_vdpa_set_vq_state,
+	.get_vq_state		= vduse_vdpa_get_vq_state,
+	.get_vq_align		= vduse_vdpa_get_vq_align,
+	.get_features		= vduse_vdpa_get_features,
+	.set_features		= vduse_vdpa_set_features,
+	.set_config_cb		= vduse_vdpa_set_config_cb,
+	.get_vq_num_max		= vduse_vdpa_get_vq_num_max,
+	.get_device_id		= vduse_vdpa_get_device_id,
+	.get_vendor_id		= vduse_vdpa_get_vendor_id,
+	.get_status		= vduse_vdpa_get_status,
+	.set_status		= vduse_vdpa_set_status,
+	.get_config		= vduse_vdpa_get_config,
+	.set_config		= vduse_vdpa_set_config,
+	.set_map		= vduse_vdpa_set_map,
+	.free			= vduse_vdpa_free,
+};
+
+static dma_addr_t vduse_dev_map_page(struct device *dev, struct page *page,
+					unsigned long offset, size_t size,
+					enum dma_data_direction dir,
+					unsigned long attrs)
+{
+	struct vduse_dev *vdev = dev_to_vduse(dev);
+	struct vduse_iova_domain *domain = vdev->domain;
+
+	if (atomic_xchg(&vdev->bounce_map, 1) == 0 &&
+		vduse_iotlb_add_range(vdev, 0, domain->bounce_size - 1,
+				      0, VDUSE_ACCESS_RW,
+				      vduse_domain_file(domain), 0)) {
+		atomic_set(&vdev->bounce_map, 0);
+		return DMA_MAPPING_ERROR;
+	}
+
+	return vduse_domain_map_page(domain, page, offset, size, dir, attrs);
+}
+
+static void vduse_dev_unmap_page(struct device *dev, dma_addr_t dma_addr,
+				size_t size, enum dma_data_direction dir,
+				unsigned long attrs)
+{
+	struct vduse_dev *vdev = dev_to_vduse(dev);
+	struct vduse_iova_domain *domain = vdev->domain;
+
+	return vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
+}
+
+static void *vduse_dev_alloc_coherent(struct device *dev, size_t size,
+					dma_addr_t *dma_addr, gfp_t flag,
+					unsigned long attrs)
+{
+	struct vduse_dev *vdev = dev_to_vduse(dev);
+	struct vduse_iova_domain *domain = vdev->domain;
+	unsigned long iova;
+	void *addr;
+
+	*dma_addr = DMA_MAPPING_ERROR;
+	addr = vduse_domain_alloc_coherent(domain, size,
+				(dma_addr_t *)&iova, flag, attrs);
+	if (!addr)
+		return NULL;
+
+	if (vduse_iotlb_add_range(vdev, iova, iova + size - 1,
+				  iova, VDUSE_ACCESS_RW,
+				  vduse_domain_file(domain), iova)) {
+		vduse_domain_free_coherent(domain, size, addr, iova, attrs);
+		return NULL;
+	}
+	*dma_addr = (dma_addr_t)iova;
+
+	return addr;
+}
+
+static void vduse_dev_free_coherent(struct device *dev, size_t size,
+					void *vaddr, dma_addr_t dma_addr,
+					unsigned long attrs)
+{
+	struct vduse_dev *vdev = dev_to_vduse(dev);
+	struct vduse_iova_domain *domain = vdev->domain;
+	unsigned long start = (unsigned long)dma_addr;
+	unsigned long last = start + size - 1;
+
+	vduse_iotlb_del_range(vdev, start, last);
+	vduse_dev_update_iotlb(vdev, start, last);
+	vduse_domain_free_coherent(domain, size, vaddr, dma_addr, attrs);
+}
+
+static const struct dma_map_ops vduse_dev_dma_ops = {
+	.map_page = vduse_dev_map_page,
+	.unmap_page = vduse_dev_unmap_page,
+	.alloc = vduse_dev_alloc_coherent,
+	.free = vduse_dev_free_coherent,
+};
+
+static unsigned int perm_to_file_flags(u8 perm)
+{
+	unsigned int flags = 0;
+
+	switch (perm) {
+	case VDUSE_ACCESS_WO:
+		flags |= O_WRONLY;
+		break;
+	case VDUSE_ACCESS_RO:
+		flags |= O_RDONLY;
+		break;
+	case VDUSE_ACCESS_RW:
+		flags |= O_RDWR;
+		break;
+	default:
+		WARN(1, "invalidate vhost IOTLB permission\n");
+		break;
+	}
+
+	return flags;
+}
+
+static int vduse_kickfd_setup(struct vduse_dev *dev,
+			struct vduse_vq_eventfd *eventfd)
+{
+	struct eventfd_ctx *ctx = NULL;
+	struct vduse_virtqueue *vq;
+
+	if (eventfd->index >= dev->vq_num)
+		return -EINVAL;
+
+	vq = &dev->vqs[eventfd->index];
+	if (eventfd->fd > 0) {
+		ctx = eventfd_ctx_fdget(eventfd->fd);
+		if (IS_ERR(ctx))
+			return PTR_ERR(ctx);
+	}
+	spin_lock(&vq->kick_lock);
+	if (vq->kickfd)
+		eventfd_ctx_put(vq->kickfd);
+	vq->kickfd = ctx;
+	spin_unlock(&vq->kick_lock);
+
+	return 0;
+}
+
+static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
+			unsigned long arg)
+{
+	struct vduse_dev *dev = file->private_data;
+	void __user *argp = (void __user *)arg;
+	int ret;
+
+	switch (cmd) {
+	case VDUSE_IOTLB_GET_FD: {
+		struct vduse_iotlb_entry entry;
+		struct vhost_iotlb_map *map;
+		struct vdpa_map_file *map_file;
+		struct file *f = NULL;
+
+		ret = -EFAULT;
+		if (copy_from_user(&entry, argp, sizeof(entry)))
+			break;
+
+		spin_lock(&dev->iommu_lock);
+		map = vhost_iotlb_itree_first(dev->iommu, entry.start,
+					      entry.last);
+		if (map) {
+			map_file = (struct vdpa_map_file *)map->opaque;
+			f = get_file(map_file->file);
+			entry.offset = map_file->offset;
+			entry.start = map->start;
+			entry.last = map->last;
+			entry.perm = map->perm;
+		}
+		spin_unlock(&dev->iommu_lock);
+		if (!f) {
+			ret = -EINVAL;
+			break;
+		}
+		if (copy_to_user(argp, &entry, sizeof(entry))) {
+			fput(f);
+			ret = -EFAULT;
+			break;
+		}
+		ret = get_unused_fd_flags(perm_to_file_flags(entry.perm));
+		if (ret < 0) {
+			fput(f);
+			break;
+		}
+		fd_install(ret, f);
+		break;
+	}
+	case VDUSE_VQ_SETUP_KICKFD: {
+		struct vduse_vq_eventfd eventfd;
+
+		ret = -EFAULT;
+		if (copy_from_user(&eventfd, argp, sizeof(eventfd)))
+			break;
+
+		ret = vduse_kickfd_setup(dev, &eventfd);
+		break;
+	}
+	case VDUSE_INJECT_VQ_IRQ: {
+		struct vduse_virtqueue *vq;
+
+		ret = -EINVAL;
+		if (arg >= dev->vq_num)
+			break;
+
+		vq = &dev->vqs[arg];
+		spin_lock_irq(&vq->irq_lock);
+		if (vq->ready && vq->cb.callback) {
+			vq->cb.callback(vq->cb.private);
+			ret = 0;
+		}
+		spin_unlock_irq(&vq->irq_lock);
+		break;
+	}
+	default:
+		ret = -ENOIOCTLCMD;
+		break;
+	}
+
+	return ret;
+}
+
+static int vduse_dev_release(struct inode *inode, struct file *file)
+{
+	struct vduse_dev *dev = file->private_data;
+	struct vduse_dev_msg *msg;
+	int i;
+
+	for (i = 0; i < dev->vq_num; i++) {
+		struct vduse_virtqueue *vq = &dev->vqs[i];
+
+		spin_lock(&vq->kick_lock);
+		if (vq->kickfd)
+			eventfd_ctx_put(vq->kickfd);
+		vq->kickfd = NULL;
+		spin_unlock(&vq->kick_lock);
+	}
+
+	mutex_lock(&dev->msg_lock);
+	while ((msg = vduse_dequeue_msg(&dev->recv_list)))
+		vduse_enqueue_msg(&dev->send_list, msg);
+	mutex_unlock(&dev->msg_lock);
+
+	dev->connected = false;
+
+	return 0;
+}
+
+static int vduse_dev_open(struct inode *inode, struct file *file)
+{
+	struct vduse_dev *dev = container_of(inode->i_cdev,
+					struct vduse_dev, cdev);
+	int ret = -EBUSY;
+
+	mutex_lock(&vduse_lock);
+	if (dev->connected)
+		goto unlock;
+
+	ret = 0;
+	dev->connected = true;
+	file->private_data = dev;
+unlock:
+	mutex_unlock(&vduse_lock);
+
+	return ret;
+}
+
+static const struct file_operations vduse_dev_fops = {
+	.owner		= THIS_MODULE,
+	.open		= vduse_dev_open,
+	.release	= vduse_dev_release,
+	.read_iter	= vduse_dev_read_iter,
+	.write_iter	= vduse_dev_write_iter,
+	.poll		= vduse_dev_poll,
+	.unlocked_ioctl	= vduse_dev_ioctl,
+	.compat_ioctl	= compat_ptr_ioctl,
+	.llseek		= noop_llseek,
+};
+
+static struct vduse_dev *vduse_dev_create(void)
+{
+	struct vduse_dev *dev = kzalloc(sizeof(*dev), GFP_KERNEL);
+
+	if (!dev)
+		return NULL;
+
+	dev->iommu = vhost_iotlb_alloc(0, 0);
+	if (!dev->iommu) {
+		kfree(dev);
+		return NULL;
+	}
+
+	mutex_init(&dev->msg_lock);
+	INIT_LIST_HEAD(&dev->send_list);
+	INIT_LIST_HEAD(&dev->recv_list);
+	atomic64_set(&dev->msg_unique, 0);
+	spin_lock_init(&dev->iommu_lock);
+	atomic_set(&dev->bounce_map, 0);
+
+	init_waitqueue_head(&dev->waitq);
+
+	return dev;
+}
+
+static void vduse_dev_destroy(struct vduse_dev *dev)
+{
+	vhost_iotlb_free(dev->iommu);
+	mutex_destroy(&dev->msg_lock);
+	kfree(dev);
+}
+
+static struct vduse_dev *vduse_find_dev(const char *name)
+{
+	struct vduse_dev *tmp, *dev = NULL;
+
+	list_for_each_entry(tmp, &vduse_devs, list) {
+		if (!strcmp(dev_name(&tmp->dev), name)) {
+			dev = tmp;
+			break;
+		}
+	}
+	return dev;
+}
+
+static int vduse_destroy_dev(char *name)
+{
+	struct vduse_dev *dev = vduse_find_dev(name);
+
+	if (!dev)
+		return -EINVAL;
+
+	if (dev->vdev || dev->connected)
+		return -EBUSY;
+
+	dev->connected = true;
+	list_del(&dev->list);
+	cdev_device_del(&dev->cdev, &dev->dev);
+	put_device(&dev->dev);
+
+	return 0;
+}
+
+static void vduse_release_dev(struct device *device)
+{
+	struct vduse_dev *dev =
+		container_of(device, struct vduse_dev, dev);
+
+	ida_simple_remove(&vduse_ida, dev->minor);
+	kfree(dev->vqs);
+	vduse_domain_destroy(dev->domain);
+	vduse_dev_destroy(dev);
+	module_put(THIS_MODULE);
+}
+
+static int vduse_create_dev(struct vduse_dev_config *config)
+{
+	int i, ret = -ENOMEM;
+	struct vduse_dev *dev;
+
+	if (config->bounce_size > max_bounce_size)
+		return -EINVAL;
+
+	if (config->bounce_size > max_iova_size)
+		return -EINVAL;
+
+	if (vduse_find_dev(config->name))
+		return -EEXIST;
+
+	dev = vduse_dev_create();
+	if (!dev)
+		return -ENOMEM;
+
+	dev->device_id = config->device_id;
+	dev->vendor_id = config->vendor_id;
+	dev->domain = vduse_domain_create(max_iova_size - 1,
+					config->bounce_size);
+	if (!dev->domain)
+		goto err_domain;
+
+	dev->vq_align = config->vq_align;
+	dev->vq_size_max = config->vq_size_max;
+	dev->vq_num = config->vq_num;
+	dev->vqs = kcalloc(dev->vq_num, sizeof(*dev->vqs), GFP_KERNEL);
+	if (!dev->vqs)
+		goto err_vqs;
+
+	for (i = 0; i < dev->vq_num; i++) {
+		dev->vqs[i].index = i;
+		spin_lock_init(&dev->vqs[i].kick_lock);
+		spin_lock_init(&dev->vqs[i].irq_lock);
+	}
+
+	ret = ida_simple_get(&vduse_ida, 0, VDUSE_DEV_MAX, GFP_KERNEL);
+	if (ret < 0)
+		goto err_ida;
+
+	dev->minor = ret;
+	device_initialize(&dev->dev);
+	dev->dev.release = vduse_release_dev;
+	dev->dev.class = vduse_class;
+	dev->dev.devt = MKDEV(MAJOR(vduse_major), dev->minor);
+	ret = dev_set_name(&dev->dev, "%s", config->name);
+	if (ret)
+		goto err_name;
+
+	cdev_init(&dev->cdev, &vduse_dev_fops);
+	dev->cdev.owner = THIS_MODULE;
+
+	ret = cdev_device_add(&dev->cdev, &dev->dev);
+	if (ret) {
+		put_device(&dev->dev);
+		return ret;
+	}
+	list_add(&dev->list, &vduse_devs);
+	__module_get(THIS_MODULE);
+
+	return 0;
+err_name:
+	ida_simple_remove(&vduse_ida, dev->minor);
+err_ida:
+	kfree(dev->vqs);
+err_vqs:
+	vduse_domain_destroy(dev->domain);
+err_domain:
+	vduse_dev_destroy(dev);
+	return ret;
+}
+
+static long vduse_ioctl(struct file *file, unsigned int cmd,
+			unsigned long arg)
+{
+	int ret;
+	void __user *argp = (void __user *)arg;
+
+	mutex_lock(&vduse_lock);
+	switch (cmd) {
+	case VDUSE_CREATE_DEV: {
+		struct vduse_dev_config config;
+
+		ret = -EFAULT;
+		if (copy_from_user(&config, argp, sizeof(config)))
+			break;
+
+		ret = vduse_create_dev(&config);
+		break;
+	}
+	case VDUSE_DESTROY_DEV: {
+		char name[VDUSE_NAME_MAX];
+
+		ret = -EFAULT;
+		if (copy_from_user(name, argp, VDUSE_NAME_MAX))
+			break;
+
+		ret = vduse_destroy_dev(name);
+		break;
+	}
+	default:
+		ret = -EINVAL;
+		break;
+	}
+	mutex_unlock(&vduse_lock);
+
+	return ret;
+}
+
+static const struct file_operations vduse_fops = {
+	.owner		= THIS_MODULE,
+	.unlocked_ioctl	= vduse_ioctl,
+	.compat_ioctl	= compat_ptr_ioctl,
+	.llseek		= noop_llseek,
+};
+
+static char *vduse_devnode(struct device *dev, umode_t *mode)
+{
+	return kasprintf(GFP_KERNEL, "vduse/%s", dev_name(dev));
+}
+
+static struct miscdevice vduse_misc = {
+	.fops = &vduse_fops,
+	.minor = MISC_DYNAMIC_MINOR,
+	.name = "vduse",
+	.nodename = "vduse/control",
+};
+
+static void vduse_mgmtdev_release(struct device *dev)
+{
+}
+
+static struct device vduse_mgmtdev = {
+	.init_name = "vduse",
+	.release = vduse_mgmtdev_release,
+};
+
+static struct vdpa_mgmt_dev mgmt_dev;
+
+static int vduse_dev_add_vdpa(struct vduse_dev *dev, const char *name)
+{
+	struct vduse_vdpa *vdev = dev->vdev;
+	int ret;
+
+	if (vdev)
+		return -EEXIST;
+
+	vdev = vdpa_alloc_device(struct vduse_vdpa, vdpa, NULL,
+				 &vduse_vdpa_config_ops,
+				 dev->vq_num, name, true);
+	if (!vdev)
+		return -ENOMEM;
+
+	vdev->dev = dev;
+	vdev->vdpa.dev.dma_mask = &vdev->vdpa.dev.coherent_dma_mask;
+	ret = dma_set_mask_and_coherent(&vdev->vdpa.dev, DMA_BIT_MASK(64));
+	if (ret)
+		goto err;
+
+	set_dma_ops(&vdev->vdpa.dev, &vduse_dev_dma_ops);
+	vdev->vdpa.dma_dev = &vdev->vdpa.dev;
+	vdev->vdpa.mdev = &mgmt_dev;
+
+	ret = _vdpa_register_device(&vdev->vdpa);
+	if (ret)
+		goto err;
+
+	dev->vdev = vdev;
+
+	return 0;
+err:
+	put_device(&vdev->vdpa.dev);
+	return ret;
+}
+
+static int vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name)
+{
+	struct vduse_dev *dev;
+	int ret = -EINVAL;
+
+	mutex_lock(&vduse_lock);
+	dev = vduse_find_dev(name);
+	if (!dev)
+		goto unlock;
+
+	ret = vduse_dev_add_vdpa(dev, name);
+unlock:
+	mutex_unlock(&vduse_lock);
+
+	return ret;
+}
+
+static void vdpa_dev_del(struct vdpa_mgmt_dev *mdev, struct vdpa_device *dev)
+{
+	_vdpa_unregister_device(dev);
+}
+
+static const struct vdpa_mgmtdev_ops vdpa_dev_mgmtdev_ops = {
+	.dev_add = vdpa_dev_add,
+	.dev_del = vdpa_dev_del,
+};
+
+static struct virtio_device_id id_table[] = {
+	{ VIRTIO_DEV_ANY_ID, VIRTIO_DEV_ANY_ID },
+	{ 0 },
+};
+
+static struct vdpa_mgmt_dev mgmt_dev = {
+	.device = &vduse_mgmtdev,
+	.id_table = id_table,
+	.ops = &vdpa_dev_mgmtdev_ops,
+};
+
+static int vduse_mgmtdev_init(void)
+{
+	int ret;
+
+	ret = device_register(&vduse_mgmtdev);
+	if (ret)
+		return ret;
+
+	ret = vdpa_mgmtdev_register(&mgmt_dev);
+	if (ret)
+		goto err;
+
+	return 0;
+err:
+	device_unregister(&vduse_mgmtdev);
+	return ret;
+}
+
+static void vduse_mgmtdev_exit(void)
+{
+	vdpa_mgmtdev_unregister(&mgmt_dev);
+	device_unregister(&vduse_mgmtdev);
+}
+
+static int vduse_init(void)
+{
+	int ret;
+
+	ret = misc_register(&vduse_misc);
+	if (ret)
+		return ret;
+
+	vduse_class = class_create(THIS_MODULE, "vduse");
+	if (IS_ERR(vduse_class)) {
+		ret = PTR_ERR(vduse_class);
+		goto err_class;
+	}
+	vduse_class->devnode = vduse_devnode;
+
+	ret = alloc_chrdev_region(&vduse_major, 0, VDUSE_DEV_MAX, "vduse");
+	if (ret)
+		goto err_chardev;
+
+	ret = vduse_domain_init();
+	if (ret)
+		goto err_domain;
+
+	ret = vduse_mgmtdev_init();
+	if (ret)
+		goto err_mgmtdev;
+
+	return 0;
+err_mgmtdev:
+	vduse_domain_exit();
+err_domain:
+	unregister_chrdev_region(vduse_major, VDUSE_DEV_MAX);
+err_chardev:
+	class_destroy(vduse_class);
+err_class:
+	misc_deregister(&vduse_misc);
+	return ret;
+}
+module_init(vduse_init);
+
+static void vduse_exit(void)
+{
+	misc_deregister(&vduse_misc);
+	class_destroy(vduse_class);
+	unregister_chrdev_region(vduse_major, VDUSE_DEV_MAX);
+	vduse_domain_exit();
+	vduse_mgmtdev_exit();
+}
+module_exit(vduse_exit);
+
+MODULE_VERSION(DRV_VERSION);
+MODULE_LICENSE(DRV_LICENSE);
+MODULE_AUTHOR(DRV_AUTHOR);
+MODULE_DESCRIPTION(DRV_DESC);
diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
new file mode 100644
index 000000000000..9391c4acfa53
--- /dev/null
+++ b/include/uapi/linux/vduse.h
@@ -0,0 +1,136 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _UAPI_VDUSE_H_
+#define _UAPI_VDUSE_H_
+
+#include <linux/types.h>
+
+#define VDUSE_CONFIG_DATA_LEN	256
+#define VDUSE_NAME_MAX	256
+
+/* the control messages definition for read/write */
+
+enum vduse_req_type {
+	VDUSE_SET_VQ_NUM,
+	VDUSE_SET_VQ_ADDR,
+	VDUSE_SET_VQ_READY,
+	VDUSE_GET_VQ_READY,
+	VDUSE_SET_VQ_STATE,
+	VDUSE_GET_VQ_STATE,
+	VDUSE_SET_FEATURES,
+	VDUSE_GET_FEATURES,
+	VDUSE_SET_STATUS,
+	VDUSE_GET_STATUS,
+	VDUSE_SET_CONFIG,
+	VDUSE_GET_CONFIG,
+	VDUSE_UPDATE_IOTLB,
+};
+
+struct vduse_vq_num {
+	__u32 index;
+	__u32 num;
+};
+
+struct vduse_vq_addr {
+	__u32 index;
+	__u64 desc_addr;
+	__u64 driver_addr;
+	__u64 device_addr;
+};
+
+struct vduse_vq_ready {
+	__u32 index;
+	__u8 ready;
+};
+
+struct vduse_vq_state {
+	__u32 index;
+	__u16 avail_idx;
+};
+
+struct vduse_dev_config_data {
+	__u32 offset;
+	__u32 len;
+	__u8 data[VDUSE_CONFIG_DATA_LEN];
+};
+
+struct vduse_iova_range {
+	__u64 start;
+	__u64 last;
+};
+
+struct vduse_dev_request {
+	__u32 type; /* request type */
+	__u32 unique; /* request id */
+	__u32 reserved[2]; /* for feature use */
+	union {
+		struct vduse_vq_num vq_num; /* virtqueue num */
+		struct vduse_vq_addr vq_addr; /* virtqueue address */
+		struct vduse_vq_ready vq_ready; /* virtqueue ready status */
+		struct vduse_vq_state vq_state; /* virtqueue state */
+		struct vduse_dev_config_data config; /* virtio device config space */
+		struct vduse_iova_range iova; /* iova range for updating */
+		__u64 features; /* virtio features */
+		__u8 status; /* device status */
+	};
+};
+
+struct vduse_dev_response {
+	__u32 request_id; /* corresponding request id */
+#define VDUSE_REQUEST_OK	0x00
+#define VDUSE_REQUEST_FAILED	0x01
+	__u8 result; /* the result of request */
+	__u8 reserved[11]; /* for feature use */
+	union {
+		struct vduse_vq_ready vq_ready; /* virtqueue ready status */
+		struct vduse_vq_state vq_state; /* virtqueue state */
+		struct vduse_dev_config_data config; /* virtio device config space */
+		__u64 features; /* virtio features */
+		__u8 status; /* device status */
+	};
+};
+
+/* ioctls */
+
+struct vduse_dev_config {
+	char name[VDUSE_NAME_MAX]; /* vduse device name */
+	__u32 vendor_id; /* virtio vendor id */
+	__u32 device_id; /* virtio device id */
+	__u64 bounce_size; /* bounce buffer size for iommu */
+	__u16 vq_num; /* the number of virtqueues */
+	__u16 vq_size_max; /* the max size of virtqueue */
+	__u32 vq_align; /* the allocation alignment of virtqueue's metadata */
+};
+
+struct vduse_iotlb_entry {
+	__u64 offset; /* the mmap offset on fd */
+	__u64 start; /* start of the IOVA range */
+	__u64 last; /* last of the IOVA range */
+#define VDUSE_ACCESS_RO 0x1
+#define VDUSE_ACCESS_WO 0x2
+#define VDUSE_ACCESS_RW 0x3
+	__u8 perm; /* access permission of this range */
+};
+
+struct vduse_vq_eventfd {
+	__u32 index; /* virtqueue index */
+	int fd; /* eventfd, -1 means de-assigning the eventfd */
+};
+
+#define VDUSE_BASE	0x81
+
+/* Create a vduse device which is represented by a char device (/dev/vduse/<name>) */
+#define VDUSE_CREATE_DEV	_IOW(VDUSE_BASE, 0x01, struct vduse_dev_config)
+
+/* Destroy a vduse device. Make sure there are no references to the char device */
+#define VDUSE_DESTROY_DEV	_IOW(VDUSE_BASE, 0x02, char[VDUSE_NAME_MAX])
+
+/* Get a file descriptor for the mmap'able iova region */
+#define VDUSE_IOTLB_GET_FD	_IOWR(VDUSE_BASE, 0x03, struct vduse_iotlb_entry)
+
+/* Setup an eventfd to receive kick for virtqueue */
+#define VDUSE_VQ_SETUP_KICKFD	_IOW(VDUSE_BASE, 0x04, struct vduse_vq_eventfd)
+
+/* Inject an interrupt for specific virtqueue */
+#define VDUSE_INJECT_VQ_IRQ	_IO(VDUSE_BASE, 0x05)
+
+#endif /* _UAPI_VDUSE_H_ */
-- 
2.11.0


^ permalink raw reply	[flat|nested] 92+ messages in thread

* [RFC v4 08/11] vduse: Add config interrupt support
  2021-02-23 11:50 [RFC v4 00/11] Introduce VDUSE - vDPA Device in Userspace Xie Yongji
                   ` (6 preceding siblings ...)
  2021-02-23 11:50 ` [RFC v4 07/11] vduse: Introduce VDUSE - vDPA Device in Userspace Xie Yongji
@ 2021-02-23 11:50 ` Xie Yongji
  2021-02-23 11:50 ` [RFC v4 09/11] Documentation: Add documentation for VDUSE Xie Yongji
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 92+ messages in thread
From: Xie Yongji @ 2021-02-23 11:50 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, bob.liu, hch, rdunlap,
	willy, viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel

This patch introduces a new ioctl VDUSE_INJECT_CONFIG_IRQ
to support injecting config interrupt.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 drivers/vdpa/vdpa_user/vduse_dev.c | 24 +++++++++++++++++++++++-
 include/uapi/linux/vduse.h         |  3 +++
 2 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
index 393bf99c48be..8042d3fa57f1 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -66,6 +66,8 @@ struct vduse_dev {
 	struct list_head send_list;
 	struct list_head recv_list;
 	struct list_head list;
+	struct vdpa_callback config_cb;
+	spinlock_t irq_lock;
 	bool connected;
 	int minor;
 	u16 vq_size_max;
@@ -483,6 +485,11 @@ static void vduse_dev_reset(struct vduse_dev *dev)
 	vduse_iotlb_del_range(dev, 0ULL, ULLONG_MAX);
 	vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
 
+	spin_lock(&dev->irq_lock);
+	dev->config_cb.callback = NULL;
+	dev->config_cb.private = NULL;
+	spin_unlock(&dev->irq_lock);
+
 	for (i = 0; i < dev->vq_num; i++) {
 		struct vduse_virtqueue *vq = &dev->vqs[i];
 
@@ -599,7 +606,12 @@ static int vduse_vdpa_set_features(struct vdpa_device *vdpa, u64 features)
 static void vduse_vdpa_set_config_cb(struct vdpa_device *vdpa,
 				  struct vdpa_callback *cb)
 {
-	/* We don't support config interrupt */
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	spin_lock(&dev->irq_lock);
+	dev->config_cb.callback = cb->callback;
+	dev->config_cb.private = cb->private;
+	spin_unlock(&dev->irq_lock);
 }
 
 static u16 vduse_vdpa_get_vq_num_max(struct vdpa_device *vdpa)
@@ -913,6 +925,15 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
 		spin_unlock_irq(&vq->irq_lock);
 		break;
 	}
+	case VDUSE_INJECT_CONFIG_IRQ:
+		ret = -EINVAL;
+		spin_lock_irq(&dev->irq_lock);
+		if (dev->config_cb.callback) {
+			dev->config_cb.callback(dev->config_cb.private);
+			ret = 0;
+		}
+		spin_unlock_irq(&dev->irq_lock);
+		break;
 	default:
 		ret = -ENOIOCTLCMD;
 		break;
@@ -996,6 +1017,7 @@ static struct vduse_dev *vduse_dev_create(void)
 	INIT_LIST_HEAD(&dev->recv_list);
 	atomic64_set(&dev->msg_unique, 0);
 	spin_lock_init(&dev->iommu_lock);
+	spin_lock_init(&dev->irq_lock);
 	atomic_set(&dev->bounce_map, 0);
 
 	init_waitqueue_head(&dev->waitq);
diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
index 9391c4acfa53..9070cd512cb4 100644
--- a/include/uapi/linux/vduse.h
+++ b/include/uapi/linux/vduse.h
@@ -133,4 +133,7 @@ struct vduse_vq_eventfd {
 /* Inject an interrupt for specific virtqueue */
 #define VDUSE_INJECT_VQ_IRQ	_IO(VDUSE_BASE, 0x05)
 
+/* Inject a config interrupt */
+#define VDUSE_INJECT_CONFIG_IRQ	_IO(VDUSE_BASE, 0x06)
+
 #endif /* _UAPI_VDUSE_H_ */
-- 
2.11.0


^ permalink raw reply	[flat|nested] 92+ messages in thread

* [RFC v4 09/11] Documentation: Add documentation for VDUSE
  2021-02-23 11:50 [RFC v4 00/11] Introduce VDUSE - vDPA Device in Userspace Xie Yongji
                   ` (7 preceding siblings ...)
  2021-02-23 11:50 ` [RFC v4 08/11] vduse: Add config interrupt support Xie Yongji
@ 2021-02-23 11:50 ` Xie Yongji
  2021-03-04  6:39     ` Jason Wang
  2021-02-23 11:50 ` [RFC v4 10/11] vduse: Introduce a workqueue for irq injection Xie Yongji
  2021-02-23 11:50 ` [RFC v4 11/11] vduse: Support binding irq to the specified cpu Xie Yongji
  10 siblings, 1 reply; 92+ messages in thread
From: Xie Yongji @ 2021-02-23 11:50 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, bob.liu, hch, rdunlap,
	willy, viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel

VDUSE (vDPA Device in Userspace) is a framework to support
implementing software-emulated vDPA devices in userspace. This
document is intended to clarify the VDUSE design and usage.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 Documentation/userspace-api/index.rst |   1 +
 Documentation/userspace-api/vduse.rst | 112 ++++++++++++++++++++++++++++++++++
 2 files changed, 113 insertions(+)
 create mode 100644 Documentation/userspace-api/vduse.rst

diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst
index acd2cc2a538d..f63119130898 100644
--- a/Documentation/userspace-api/index.rst
+++ b/Documentation/userspace-api/index.rst
@@ -24,6 +24,7 @@ place where this information is gathered.
    ioctl/index
    iommu
    media/index
+   vduse
 
 .. only::  subproject and html
 
diff --git a/Documentation/userspace-api/vduse.rst b/Documentation/userspace-api/vduse.rst
new file mode 100644
index 000000000000..2a20e686bb59
--- /dev/null
+++ b/Documentation/userspace-api/vduse.rst
@@ -0,0 +1,112 @@
+==================================
+VDUSE - "vDPA Device in Userspace"
+==================================
+
+vDPA (virtio data path acceleration) device is a device that uses a
+datapath which complies with the virtio specifications with vendor
+specific control path. vDPA devices can be both physically located on
+the hardware or emulated by software. VDUSE is a framework that makes it
+possible to implement software-emulated vDPA devices in userspace.
+
+How VDUSE works
+------------
+Each userspace vDPA device is created by the VDUSE_CREATE_DEV ioctl on
+the character device (/dev/vduse/control). Then a device file with the
+specified name (/dev/vduse/$NAME) will appear, which can be used to
+implement the userspace vDPA device's control path and data path.
+
+To implement control path, a message-based communication protocol and some
+types of control messages are introduced in the VDUSE framework:
+
+- VDUSE_SET_VQ_ADDR: Set the vring address of virtqueue.
+
+- VDUSE_SET_VQ_NUM: Set the size of virtqueue
+
+- VDUSE_SET_VQ_READY: Set ready status of virtqueue
+
+- VDUSE_GET_VQ_READY: Get ready status of virtqueue
+
+- VDUSE_SET_VQ_STATE: Set the state for virtqueue
+
+- VDUSE_GET_VQ_STATE: Get the state for virtqueue
+
+- VDUSE_SET_FEATURES: Set virtio features supported by the driver
+
+- VDUSE_GET_FEATURES: Get virtio features supported by the device
+
+- VDUSE_SET_STATUS: Set the device status
+
+- VDUSE_GET_STATUS: Get the device status
+
+- VDUSE_SET_CONFIG: Write to device specific configuration space
+
+- VDUSE_GET_CONFIG: Read from device specific configuration space
+
+- VDUSE_UPDATE_IOTLB: Notify userspace to update the memory mapping in device IOTLB
+
+Those control messages are mostly based on the vdpa_config_ops in
+include/linux/vdpa.h which defines a unified interface to control
+different types of vdpa device. Userspace needs to read()/write()
+on the VDUSE device file to receive/reply those control messages
+from/to VDUSE kernel module as follows:
+
+.. code-block:: c
+
+	static int vduse_message_handler(int dev_fd)
+	{
+		int len;
+		struct vduse_dev_request req;
+		struct vduse_dev_response resp;
+
+		len = read(dev_fd, &req, sizeof(req));
+		if (len != sizeof(req))
+			return -1;
+
+		resp.request_id = req.unique;
+
+		switch (req.type) {
+
+		/* handle different types of message */
+
+		}
+
+		len = write(dev_fd, &resp, sizeof(resp));
+		if (len != sizeof(resp))
+			return -1;
+
+		return 0;
+	}
+
+In the deta path, vDPA device's iova regions will be mapped into userspace
+with the help of VDUSE_IOTLB_GET_FD ioctl on the VDUSE device file:
+
+- VDUSE_IOTLB_GET_FD: get the file descriptor to iova region. Userspace can
+  access this iova region by passing the fd to mmap().
+
+Besides, the following ioctls on the VDUSE device file are provided to support
+interrupt injection and setting up eventfd for virtqueue kicks:
+
+- VDUSE_VQ_SETUP_KICKFD: set the kickfd for virtqueue, this eventfd is used
+  by VDUSE kernel module to notify userspace to consume the vring.
+
+- VDUSE_INJECT_VQ_IRQ: inject an interrupt for specific virtqueue
+
+- VDUSE_INJECT_CONFIG_IRQ: inject a config interrupt
+
+MMU-based IOMMU Driver
+----------------------
+In virtio-vdpa case, VDUSE framework implements an MMU-based on-chip IOMMU
+driver to support mapping the kernel DMA buffer into the userspace iova
+region dynamically.
+
+The basic idea behind this driver is treating MMU (VA->PA) as IOMMU (IOVA->PA).
+The driver will set up MMU mapping instead of IOMMU mapping for the DMA transfer
+so that the userspace process is able to use its virtual address to access
+the DMA buffer in kernel.
+
+And to avoid security issue, a bounce-buffering mechanism is introduced to
+prevent userspace accessing the original buffer directly which may contain other
+kernel data. During the mapping, unmapping, the driver will copy the data from
+the original buffer to the bounce buffer and back, depending on the direction of
+the transfer. And the bounce-buffer addresses will be mapped into the user address
+space instead of the original one.
-- 
2.11.0


^ permalink raw reply	[flat|nested] 92+ messages in thread

* [RFC v4 10/11] vduse: Introduce a workqueue for irq injection
  2021-02-23 11:50 [RFC v4 00/11] Introduce VDUSE - vDPA Device in Userspace Xie Yongji
                   ` (8 preceding siblings ...)
  2021-02-23 11:50 ` [RFC v4 09/11] Documentation: Add documentation for VDUSE Xie Yongji
@ 2021-02-23 11:50 ` Xie Yongji
  2021-03-04  6:59     ` Jason Wang
  2021-02-23 11:50 ` [RFC v4 11/11] vduse: Support binding irq to the specified cpu Xie Yongji
  10 siblings, 1 reply; 92+ messages in thread
From: Xie Yongji @ 2021-02-23 11:50 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, bob.liu, hch, rdunlap,
	willy, viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel

This patch introduces a workqueue to support injecting
virtqueue's interrupt asynchronously. This is mainly
for performance considerations which makes sure the push()
and pop() for used vring can be asynchronous.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 drivers/vdpa/vdpa_user/vduse_dev.c | 29 +++++++++++++++++++++++------
 1 file changed, 23 insertions(+), 6 deletions(-)

diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
index 8042d3fa57f1..f5adeb9ee027 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -42,6 +42,7 @@ struct vduse_virtqueue {
 	spinlock_t irq_lock;
 	struct eventfd_ctx *kickfd;
 	struct vdpa_callback cb;
+	struct work_struct inject;
 };
 
 struct vduse_dev;
@@ -99,6 +100,7 @@ static DEFINE_IDA(vduse_ida);
 
 static dev_t vduse_major;
 static struct class *vduse_class;
+static struct workqueue_struct *vduse_irq_wq;
 
 static inline struct vduse_dev *vdpa_to_vduse(struct vdpa_device *vdpa)
 {
@@ -852,6 +854,17 @@ static int vduse_kickfd_setup(struct vduse_dev *dev,
 	return 0;
 }
 
+static void vduse_vq_irq_inject(struct work_struct *work)
+{
+	struct vduse_virtqueue *vq = container_of(work,
+					struct vduse_virtqueue, inject);
+
+	spin_lock_irq(&vq->irq_lock);
+	if (vq->ready && vq->cb.callback)
+		vq->cb.callback(vq->cb.private);
+	spin_unlock_irq(&vq->irq_lock);
+}
+
 static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
 			unsigned long arg)
 {
@@ -917,12 +930,7 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
 			break;
 
 		vq = &dev->vqs[arg];
-		spin_lock_irq(&vq->irq_lock);
-		if (vq->ready && vq->cb.callback) {
-			vq->cb.callback(vq->cb.private);
-			ret = 0;
-		}
-		spin_unlock_irq(&vq->irq_lock);
+		queue_work(vduse_irq_wq, &vq->inject);
 		break;
 	}
 	case VDUSE_INJECT_CONFIG_IRQ:
@@ -1109,6 +1117,7 @@ static int vduse_create_dev(struct vduse_dev_config *config)
 
 	for (i = 0; i < dev->vq_num; i++) {
 		dev->vqs[i].index = i;
+		INIT_WORK(&dev->vqs[i].inject, vduse_vq_irq_inject);
 		spin_lock_init(&dev->vqs[i].kick_lock);
 		spin_lock_init(&dev->vqs[i].irq_lock);
 	}
@@ -1333,6 +1342,11 @@ static int vduse_init(void)
 	if (ret)
 		goto err_chardev;
 
+	vduse_irq_wq = alloc_workqueue("vduse-irq",
+				WQ_HIGHPRI | WQ_SYSFS | WQ_UNBOUND, 0);
+	if (!vduse_irq_wq)
+		goto err_wq;
+
 	ret = vduse_domain_init();
 	if (ret)
 		goto err_domain;
@@ -1344,6 +1358,8 @@ static int vduse_init(void)
 	return 0;
 err_mgmtdev:
 	vduse_domain_exit();
+err_wq:
+	destroy_workqueue(vduse_irq_wq);
 err_domain:
 	unregister_chrdev_region(vduse_major, VDUSE_DEV_MAX);
 err_chardev:
@@ -1359,6 +1375,7 @@ static void vduse_exit(void)
 	misc_deregister(&vduse_misc);
 	class_destroy(vduse_class);
 	unregister_chrdev_region(vduse_major, VDUSE_DEV_MAX);
+	destroy_workqueue(vduse_irq_wq);
 	vduse_domain_exit();
 	vduse_mgmtdev_exit();
 }
-- 
2.11.0


^ permalink raw reply	[flat|nested] 92+ messages in thread

* [RFC v4 11/11] vduse: Support binding irq to the specified cpu
  2021-02-23 11:50 [RFC v4 00/11] Introduce VDUSE - vDPA Device in Userspace Xie Yongji
                   ` (9 preceding siblings ...)
  2021-02-23 11:50 ` [RFC v4 10/11] vduse: Introduce a workqueue for irq injection Xie Yongji
@ 2021-02-23 11:50 ` Xie Yongji
  2021-03-04  7:30     ` Jason Wang
  10 siblings, 1 reply; 92+ messages in thread
From: Xie Yongji @ 2021-02-23 11:50 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, bob.liu, hch, rdunlap,
	willy, viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel

Add a parameter for the ioctl VDUSE_INJECT_VQ_IRQ to support
injecting virtqueue's interrupt to the specified cpu.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 drivers/vdpa/vdpa_user/vduse_dev.c | 22 +++++++++++++++++-----
 include/uapi/linux/vduse.h         |  7 ++++++-
 2 files changed, 23 insertions(+), 6 deletions(-)

diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
index f5adeb9ee027..df3d467fff40 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -923,14 +923,27 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
 		break;
 	}
 	case VDUSE_INJECT_VQ_IRQ: {
+		struct vduse_vq_irq irq;
 		struct vduse_virtqueue *vq;
 
+		ret = -EFAULT;
+		if (copy_from_user(&irq, argp, sizeof(irq)))
+			break;
+
 		ret = -EINVAL;
-		if (arg >= dev->vq_num)
+		if (irq.index >= dev->vq_num)
+			break;
+
+		if (irq.cpu != -1 && (irq.cpu >= nr_cpu_ids ||
+		    !cpu_online(irq.cpu)))
 			break;
 
-		vq = &dev->vqs[arg];
-		queue_work(vduse_irq_wq, &vq->inject);
+		ret = 0;
+		vq = &dev->vqs[irq.index];
+		if (irq.cpu == -1)
+			queue_work(vduse_irq_wq, &vq->inject);
+		else
+			queue_work_on(irq.cpu, vduse_irq_wq, &vq->inject);
 		break;
 	}
 	case VDUSE_INJECT_CONFIG_IRQ:
@@ -1342,8 +1355,7 @@ static int vduse_init(void)
 	if (ret)
 		goto err_chardev;
 
-	vduse_irq_wq = alloc_workqueue("vduse-irq",
-				WQ_HIGHPRI | WQ_SYSFS | WQ_UNBOUND, 0);
+	vduse_irq_wq = alloc_workqueue("vduse-irq", WQ_HIGHPRI, 0);
 	if (!vduse_irq_wq)
 		goto err_wq;
 
diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
index 9070cd512cb4..9c70fd842ce5 100644
--- a/include/uapi/linux/vduse.h
+++ b/include/uapi/linux/vduse.h
@@ -116,6 +116,11 @@ struct vduse_vq_eventfd {
 	int fd; /* eventfd, -1 means de-assigning the eventfd */
 };
 
+struct vduse_vq_irq {
+	__u32 index; /* virtqueue index */
+	int cpu; /* bind irq to the specified cpu, -1 means running on the current cpu */
+};
+
 #define VDUSE_BASE	0x81
 
 /* Create a vduse device which is represented by a char device (/dev/vduse/<name>) */
@@ -131,7 +136,7 @@ struct vduse_vq_eventfd {
 #define VDUSE_VQ_SETUP_KICKFD	_IOW(VDUSE_BASE, 0x04, struct vduse_vq_eventfd)
 
 /* Inject an interrupt for specific virtqueue */
-#define VDUSE_INJECT_VQ_IRQ	_IO(VDUSE_BASE, 0x05)
+#define VDUSE_INJECT_VQ_IRQ	_IOW(VDUSE_BASE, 0x05, struct vduse_vq_irq)
 
 /* Inject a config interrupt */
 #define VDUSE_INJECT_CONFIG_IRQ	_IO(VDUSE_BASE, 0x06)
-- 
2.11.0


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC v4 01/11] eventfd: Increase the recursion depth of eventfd_signal()
  2021-02-23 11:50 ` [RFC v4 01/11] eventfd: Increase the recursion depth of eventfd_signal() Xie Yongji
@ 2021-03-02  6:44     ` Jason Wang
  0 siblings, 0 replies; 92+ messages in thread
From: Jason Wang @ 2021-03-02  6:44 UTC (permalink / raw)
  To: Xie Yongji, mst, stefanha, sgarzare, parav, bob.liu, hch,
	rdunlap, willy, viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel


On 2021/2/23 7:50 下午, Xie Yongji wrote:
> Increase the recursion depth of eventfd_signal() to 1. This
> is the maximum recursion depth we have found so far.
>
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>


Acked-by: Jason Wang <jasowang@redhat.com>

It might be useful to explain how/when we can reach for this condition.

Thanks


> ---
>   fs/eventfd.c            | 2 +-
>   include/linux/eventfd.h | 5 ++++-
>   2 files changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/fs/eventfd.c b/fs/eventfd.c
> index e265b6dd4f34..cc7cd1dbedd3 100644
> --- a/fs/eventfd.c
> +++ b/fs/eventfd.c
> @@ -71,7 +71,7 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n)
>   	 * it returns true, the eventfd_signal() call should be deferred to a
>   	 * safe context.
>   	 */
> -	if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count)))
> +	if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count) > EFD_WAKE_DEPTH))
>   		return 0;
>   
>   	spin_lock_irqsave(&ctx->wqh.lock, flags);
> diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
> index fa0a524baed0..886d99cd38ef 100644
> --- a/include/linux/eventfd.h
> +++ b/include/linux/eventfd.h
> @@ -29,6 +29,9 @@
>   #define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
>   #define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE)
>   
> +/* Maximum recursion depth */
> +#define EFD_WAKE_DEPTH 1
> +
>   struct eventfd_ctx;
>   struct file;
>   
> @@ -47,7 +50,7 @@ DECLARE_PER_CPU(int, eventfd_wake_count);
>   
>   static inline bool eventfd_signal_count(void)
>   {
> -	return this_cpu_read(eventfd_wake_count);
> +	return this_cpu_read(eventfd_wake_count) > EFD_WAKE_DEPTH;
>   }
>   
>   #else /* CONFIG_EVENTFD */


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC v4 01/11] eventfd: Increase the recursion depth of eventfd_signal()
@ 2021-03-02  6:44     ` Jason Wang
  0 siblings, 0 replies; 92+ messages in thread
From: Jason Wang @ 2021-03-02  6:44 UTC (permalink / raw)
  To: Xie Yongji, mst, stefanha, sgarzare, parav, bob.liu, hch,
	rdunlap, willy, viro, axboe, bcrl, corbet
  Cc: linux-aio, netdev, linux-fsdevel, kvm, virtualization


On 2021/2/23 7:50 下午, Xie Yongji wrote:
> Increase the recursion depth of eventfd_signal() to 1. This
> is the maximum recursion depth we have found so far.
>
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>


Acked-by: Jason Wang <jasowang@redhat.com>

It might be useful to explain how/when we can reach for this condition.

Thanks


> ---
>   fs/eventfd.c            | 2 +-
>   include/linux/eventfd.h | 5 ++++-
>   2 files changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/fs/eventfd.c b/fs/eventfd.c
> index e265b6dd4f34..cc7cd1dbedd3 100644
> --- a/fs/eventfd.c
> +++ b/fs/eventfd.c
> @@ -71,7 +71,7 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n)
>   	 * it returns true, the eventfd_signal() call should be deferred to a
>   	 * safe context.
>   	 */
> -	if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count)))
> +	if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count) > EFD_WAKE_DEPTH))
>   		return 0;
>   
>   	spin_lock_irqsave(&ctx->wqh.lock, flags);
> diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
> index fa0a524baed0..886d99cd38ef 100644
> --- a/include/linux/eventfd.h
> +++ b/include/linux/eventfd.h
> @@ -29,6 +29,9 @@
>   #define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
>   #define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE)
>   
> +/* Maximum recursion depth */
> +#define EFD_WAKE_DEPTH 1
> +
>   struct eventfd_ctx;
>   struct file;
>   
> @@ -47,7 +50,7 @@ DECLARE_PER_CPU(int, eventfd_wake_count);
>   
>   static inline bool eventfd_signal_count(void)
>   {
> -	return this_cpu_read(eventfd_wake_count);
> +	return this_cpu_read(eventfd_wake_count) > EFD_WAKE_DEPTH;
>   }
>   
>   #else /* CONFIG_EVENTFD */

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC v4 02/11] vhost-vdpa: protect concurrent access to vhost device iotlb
  2021-02-23 11:50 ` [RFC v4 02/11] vhost-vdpa: protect concurrent access to vhost device iotlb Xie Yongji
@ 2021-03-02  6:47     ` Jason Wang
  0 siblings, 0 replies; 92+ messages in thread
From: Jason Wang @ 2021-03-02  6:47 UTC (permalink / raw)
  To: Xie Yongji, mst, stefanha, sgarzare, parav, bob.liu, hch,
	rdunlap, willy, viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel


On 2021/2/23 7:50 下午, Xie Yongji wrote:
> Use vhost_dev->mutex to protect vhost device iotlb from
> concurrent access.
>
> Fixes: 4c8cf318("vhost: introduce vDPA-based backend")
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> ---
>   drivers/vhost/vdpa.c | 2 ++
>   1 file changed, 2 insertions(+)
>
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index c50079dfb281..5500e3bf05c1 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -723,6 +723,7 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev,
>   	if (r)
>   		return r;
>   
> +	mutex_lock(&dev->mutex);


I think this should be done before the vhost_dev_check_owner() above.

Thanks


>   	switch (msg->type) {
>   	case VHOST_IOTLB_UPDATE:
>   		r = vhost_vdpa_process_iotlb_update(v, msg);
> @@ -742,6 +743,7 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev,
>   		r = -EINVAL;
>   		break;
>   	}
> +	mutex_unlock(&dev->mutex);
>   
>   	return r;
>   }


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC v4 02/11] vhost-vdpa: protect concurrent access to vhost device iotlb
@ 2021-03-02  6:47     ` Jason Wang
  0 siblings, 0 replies; 92+ messages in thread
From: Jason Wang @ 2021-03-02  6:47 UTC (permalink / raw)
  To: Xie Yongji, mst, stefanha, sgarzare, parav, bob.liu, hch,
	rdunlap, willy, viro, axboe, bcrl, corbet
  Cc: linux-aio, netdev, linux-fsdevel, kvm, virtualization


On 2021/2/23 7:50 下午, Xie Yongji wrote:
> Use vhost_dev->mutex to protect vhost device iotlb from
> concurrent access.
>
> Fixes: 4c8cf318("vhost: introduce vDPA-based backend")
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> ---
>   drivers/vhost/vdpa.c | 2 ++
>   1 file changed, 2 insertions(+)
>
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index c50079dfb281..5500e3bf05c1 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -723,6 +723,7 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev,
>   	if (r)
>   		return r;
>   
> +	mutex_lock(&dev->mutex);


I think this should be done before the vhost_dev_check_owner() above.

Thanks


>   	switch (msg->type) {
>   	case VHOST_IOTLB_UPDATE:
>   		r = vhost_vdpa_process_iotlb_update(v, msg);
> @@ -742,6 +743,7 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev,
>   		r = -EINVAL;
>   		break;
>   	}
> +	mutex_unlock(&dev->mutex);
>   
>   	return r;
>   }

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC v4 03/11] vhost-iotlb: Add an opaque pointer for vhost IOTLB
  2021-02-23 11:50 ` [RFC v4 03/11] vhost-iotlb: Add an opaque pointer for vhost IOTLB Xie Yongji
@ 2021-03-02  6:49     ` Jason Wang
  0 siblings, 0 replies; 92+ messages in thread
From: Jason Wang @ 2021-03-02  6:49 UTC (permalink / raw)
  To: Xie Yongji, mst, stefanha, sgarzare, parav, bob.liu, hch,
	rdunlap, willy, viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel


On 2021/2/23 7:50 下午, Xie Yongji wrote:
> Add an opaque pointer for vhost IOTLB. And introduce
> vhost_iotlb_add_range_ctx() to accept it.
>
> Suggested-by: Jason Wang <jasowang@redhat.com>
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>


Acked-by: Jason Wang <jasowang@redhat.com>


> ---
>   drivers/vhost/iotlb.c       | 20 ++++++++++++++++----
>   include/linux/vhost_iotlb.h |  3 +++
>   2 files changed, 19 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/vhost/iotlb.c b/drivers/vhost/iotlb.c
> index 0fd3f87e913c..5c99e1112cbb 100644
> --- a/drivers/vhost/iotlb.c
> +++ b/drivers/vhost/iotlb.c
> @@ -36,19 +36,21 @@ void vhost_iotlb_map_free(struct vhost_iotlb *iotlb,
>   EXPORT_SYMBOL_GPL(vhost_iotlb_map_free);
>   
>   /**
> - * vhost_iotlb_add_range - add a new range to vhost IOTLB
> + * vhost_iotlb_add_range_ctx - add a new range to vhost IOTLB
>    * @iotlb: the IOTLB
>    * @start: start of the IOVA range
>    * @last: last of IOVA range
>    * @addr: the address that is mapped to @start
>    * @perm: access permission of this range
> + * @opaque: the opaque pointer for the new mapping
>    *
>    * Returns an error last is smaller than start or memory allocation
>    * fails
>    */
> -int vhost_iotlb_add_range(struct vhost_iotlb *iotlb,
> -			  u64 start, u64 last,
> -			  u64 addr, unsigned int perm)
> +int vhost_iotlb_add_range_ctx(struct vhost_iotlb *iotlb,
> +			      u64 start, u64 last,
> +			      u64 addr, unsigned int perm,
> +			      void *opaque)
>   {
>   	struct vhost_iotlb_map *map;
>   
> @@ -71,6 +73,7 @@ int vhost_iotlb_add_range(struct vhost_iotlb *iotlb,
>   	map->last = last;
>   	map->addr = addr;
>   	map->perm = perm;
> +	map->opaque = opaque;
>   
>   	iotlb->nmaps++;
>   	vhost_iotlb_itree_insert(map, &iotlb->root);
> @@ -80,6 +83,15 @@ int vhost_iotlb_add_range(struct vhost_iotlb *iotlb,
>   
>   	return 0;
>   }
> +EXPORT_SYMBOL_GPL(vhost_iotlb_add_range_ctx);
> +
> +int vhost_iotlb_add_range(struct vhost_iotlb *iotlb,
> +			  u64 start, u64 last,
> +			  u64 addr, unsigned int perm)
> +{
> +	return vhost_iotlb_add_range_ctx(iotlb, start, last,
> +					 addr, perm, NULL);
> +}
>   EXPORT_SYMBOL_GPL(vhost_iotlb_add_range);
>   
>   /**
> diff --git a/include/linux/vhost_iotlb.h b/include/linux/vhost_iotlb.h
> index 6b09b786a762..2d0e2f52f938 100644
> --- a/include/linux/vhost_iotlb.h
> +++ b/include/linux/vhost_iotlb.h
> @@ -17,6 +17,7 @@ struct vhost_iotlb_map {
>   	u32 perm;
>   	u32 flags_padding;
>   	u64 __subtree_last;
> +	void *opaque;
>   };
>   
>   #define VHOST_IOTLB_FLAG_RETIRE 0x1
> @@ -29,6 +30,8 @@ struct vhost_iotlb {
>   	unsigned int flags;
>   };
>   
> +int vhost_iotlb_add_range_ctx(struct vhost_iotlb *iotlb, u64 start, u64 last,
> +			      u64 addr, unsigned int perm, void *opaque);
>   int vhost_iotlb_add_range(struct vhost_iotlb *iotlb, u64 start, u64 last,
>   			  u64 addr, unsigned int perm);
>   void vhost_iotlb_del_range(struct vhost_iotlb *iotlb, u64 start, u64 last);


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC v4 03/11] vhost-iotlb: Add an opaque pointer for vhost IOTLB
@ 2021-03-02  6:49     ` Jason Wang
  0 siblings, 0 replies; 92+ messages in thread
From: Jason Wang @ 2021-03-02  6:49 UTC (permalink / raw)
  To: Xie Yongji, mst, stefanha, sgarzare, parav, bob.liu, hch,
	rdunlap, willy, viro, axboe, bcrl, corbet
  Cc: linux-aio, netdev, linux-fsdevel, kvm, virtualization


On 2021/2/23 7:50 下午, Xie Yongji wrote:
> Add an opaque pointer for vhost IOTLB. And introduce
> vhost_iotlb_add_range_ctx() to accept it.
>
> Suggested-by: Jason Wang <jasowang@redhat.com>
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>


Acked-by: Jason Wang <jasowang@redhat.com>


> ---
>   drivers/vhost/iotlb.c       | 20 ++++++++++++++++----
>   include/linux/vhost_iotlb.h |  3 +++
>   2 files changed, 19 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/vhost/iotlb.c b/drivers/vhost/iotlb.c
> index 0fd3f87e913c..5c99e1112cbb 100644
> --- a/drivers/vhost/iotlb.c
> +++ b/drivers/vhost/iotlb.c
> @@ -36,19 +36,21 @@ void vhost_iotlb_map_free(struct vhost_iotlb *iotlb,
>   EXPORT_SYMBOL_GPL(vhost_iotlb_map_free);
>   
>   /**
> - * vhost_iotlb_add_range - add a new range to vhost IOTLB
> + * vhost_iotlb_add_range_ctx - add a new range to vhost IOTLB
>    * @iotlb: the IOTLB
>    * @start: start of the IOVA range
>    * @last: last of IOVA range
>    * @addr: the address that is mapped to @start
>    * @perm: access permission of this range
> + * @opaque: the opaque pointer for the new mapping
>    *
>    * Returns an error last is smaller than start or memory allocation
>    * fails
>    */
> -int vhost_iotlb_add_range(struct vhost_iotlb *iotlb,
> -			  u64 start, u64 last,
> -			  u64 addr, unsigned int perm)
> +int vhost_iotlb_add_range_ctx(struct vhost_iotlb *iotlb,
> +			      u64 start, u64 last,
> +			      u64 addr, unsigned int perm,
> +			      void *opaque)
>   {
>   	struct vhost_iotlb_map *map;
>   
> @@ -71,6 +73,7 @@ int vhost_iotlb_add_range(struct vhost_iotlb *iotlb,
>   	map->last = last;
>   	map->addr = addr;
>   	map->perm = perm;
> +	map->opaque = opaque;
>   
>   	iotlb->nmaps++;
>   	vhost_iotlb_itree_insert(map, &iotlb->root);
> @@ -80,6 +83,15 @@ int vhost_iotlb_add_range(struct vhost_iotlb *iotlb,
>   
>   	return 0;
>   }
> +EXPORT_SYMBOL_GPL(vhost_iotlb_add_range_ctx);
> +
> +int vhost_iotlb_add_range(struct vhost_iotlb *iotlb,
> +			  u64 start, u64 last,
> +			  u64 addr, unsigned int perm)
> +{
> +	return vhost_iotlb_add_range_ctx(iotlb, start, last,
> +					 addr, perm, NULL);
> +}
>   EXPORT_SYMBOL_GPL(vhost_iotlb_add_range);
>   
>   /**
> diff --git a/include/linux/vhost_iotlb.h b/include/linux/vhost_iotlb.h
> index 6b09b786a762..2d0e2f52f938 100644
> --- a/include/linux/vhost_iotlb.h
> +++ b/include/linux/vhost_iotlb.h
> @@ -17,6 +17,7 @@ struct vhost_iotlb_map {
>   	u32 perm;
>   	u32 flags_padding;
>   	u64 __subtree_last;
> +	void *opaque;
>   };
>   
>   #define VHOST_IOTLB_FLAG_RETIRE 0x1
> @@ -29,6 +30,8 @@ struct vhost_iotlb {
>   	unsigned int flags;
>   };
>   
> +int vhost_iotlb_add_range_ctx(struct vhost_iotlb *iotlb, u64 start, u64 last,
> +			      u64 addr, unsigned int perm, void *opaque);
>   int vhost_iotlb_add_range(struct vhost_iotlb *iotlb, u64 start, u64 last,
>   			  u64 addr, unsigned int perm);
>   void vhost_iotlb_del_range(struct vhost_iotlb *iotlb, u64 start, u64 last);

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC v4 04/11] vdpa: Add an opaque pointer for vdpa_config_ops.dma_map()
  2021-02-23 11:50 ` [RFC v4 04/11] vdpa: Add an opaque pointer for vdpa_config_ops.dma_map() Xie Yongji
@ 2021-03-02  6:50     ` Jason Wang
  0 siblings, 0 replies; 92+ messages in thread
From: Jason Wang @ 2021-03-02  6:50 UTC (permalink / raw)
  To: Xie Yongji, mst, stefanha, sgarzare, parav, bob.liu, hch,
	rdunlap, willy, viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel


On 2021/2/23 7:50 下午, Xie Yongji wrote:
> Add an opaque pointer for DMA mapping.
>
> Suggested-by: Jason Wang <jasowang@redhat.com>
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>


Acked-by: Jason Wang <jasowang@redhat.com>


> ---
>   drivers/vdpa/vdpa_sim/vdpa_sim.c | 6 +++---
>   drivers/vhost/vdpa.c             | 2 +-
>   include/linux/vdpa.h             | 2 +-
>   3 files changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
> index d5942842432d..5cfc262ce055 100644
> --- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
> +++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
> @@ -512,14 +512,14 @@ static int vdpasim_set_map(struct vdpa_device *vdpa,
>   }
>   
>   static int vdpasim_dma_map(struct vdpa_device *vdpa, u64 iova, u64 size,
> -			   u64 pa, u32 perm)
> +			   u64 pa, u32 perm, void *opaque)
>   {
>   	struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
>   	int ret;
>   
>   	spin_lock(&vdpasim->iommu_lock);
> -	ret = vhost_iotlb_add_range(vdpasim->iommu, iova, iova + size - 1, pa,
> -				    perm);
> +	ret = vhost_iotlb_add_range_ctx(vdpasim->iommu, iova, iova + size - 1,
> +					pa, perm, opaque);
>   	spin_unlock(&vdpasim->iommu_lock);
>   
>   	return ret;
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index 5500e3bf05c1..70857fe3263c 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -544,7 +544,7 @@ static int vhost_vdpa_map(struct vhost_vdpa *v,
>   		return r;
>   
>   	if (ops->dma_map) {
> -		r = ops->dma_map(vdpa, iova, size, pa, perm);
> +		r = ops->dma_map(vdpa, iova, size, pa, perm, NULL);
>   	} else if (ops->set_map) {
>   		if (!v->in_batch)
>   			r = ops->set_map(vdpa, dev->iotlb);
> diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
> index 4ab5494503a8..93dca2c328ae 100644
> --- a/include/linux/vdpa.h
> +++ b/include/linux/vdpa.h
> @@ -241,7 +241,7 @@ struct vdpa_config_ops {
>   	/* DMA ops */
>   	int (*set_map)(struct vdpa_device *vdev, struct vhost_iotlb *iotlb);
>   	int (*dma_map)(struct vdpa_device *vdev, u64 iova, u64 size,
> -		       u64 pa, u32 perm);
> +		       u64 pa, u32 perm, void *opaque);
>   	int (*dma_unmap)(struct vdpa_device *vdev, u64 iova, u64 size);
>   
>   	/* Free device resources */


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC v4 04/11] vdpa: Add an opaque pointer for vdpa_config_ops.dma_map()
@ 2021-03-02  6:50     ` Jason Wang
  0 siblings, 0 replies; 92+ messages in thread
From: Jason Wang @ 2021-03-02  6:50 UTC (permalink / raw)
  To: Xie Yongji, mst, stefanha, sgarzare, parav, bob.liu, hch,
	rdunlap, willy, viro, axboe, bcrl, corbet
  Cc: linux-aio, netdev, linux-fsdevel, kvm, virtualization


On 2021/2/23 7:50 下午, Xie Yongji wrote:
> Add an opaque pointer for DMA mapping.
>
> Suggested-by: Jason Wang <jasowang@redhat.com>
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>


Acked-by: Jason Wang <jasowang@redhat.com>


> ---
>   drivers/vdpa/vdpa_sim/vdpa_sim.c | 6 +++---
>   drivers/vhost/vdpa.c             | 2 +-
>   include/linux/vdpa.h             | 2 +-
>   3 files changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
> index d5942842432d..5cfc262ce055 100644
> --- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
> +++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
> @@ -512,14 +512,14 @@ static int vdpasim_set_map(struct vdpa_device *vdpa,
>   }
>   
>   static int vdpasim_dma_map(struct vdpa_device *vdpa, u64 iova, u64 size,
> -			   u64 pa, u32 perm)
> +			   u64 pa, u32 perm, void *opaque)
>   {
>   	struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
>   	int ret;
>   
>   	spin_lock(&vdpasim->iommu_lock);
> -	ret = vhost_iotlb_add_range(vdpasim->iommu, iova, iova + size - 1, pa,
> -				    perm);
> +	ret = vhost_iotlb_add_range_ctx(vdpasim->iommu, iova, iova + size - 1,
> +					pa, perm, opaque);
>   	spin_unlock(&vdpasim->iommu_lock);
>   
>   	return ret;
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index 5500e3bf05c1..70857fe3263c 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -544,7 +544,7 @@ static int vhost_vdpa_map(struct vhost_vdpa *v,
>   		return r;
>   
>   	if (ops->dma_map) {
> -		r = ops->dma_map(vdpa, iova, size, pa, perm);
> +		r = ops->dma_map(vdpa, iova, size, pa, perm, NULL);
>   	} else if (ops->set_map) {
>   		if (!v->in_batch)
>   			r = ops->set_map(vdpa, dev->iotlb);
> diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
> index 4ab5494503a8..93dca2c328ae 100644
> --- a/include/linux/vdpa.h
> +++ b/include/linux/vdpa.h
> @@ -241,7 +241,7 @@ struct vdpa_config_ops {
>   	/* DMA ops */
>   	int (*set_map)(struct vdpa_device *vdev, struct vhost_iotlb *iotlb);
>   	int (*dma_map)(struct vdpa_device *vdev, u64 iova, u64 size,
> -		       u64 pa, u32 perm);
> +		       u64 pa, u32 perm, void *opaque);
>   	int (*dma_unmap)(struct vdpa_device *vdev, u64 iova, u64 size);
>   
>   	/* Free device resources */

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: Re: [RFC v4 02/11] vhost-vdpa: protect concurrent access to vhost device iotlb
  2021-03-02  6:47     ` Jason Wang
  (?)
@ 2021-03-02 10:20     ` Yongji Xie
  -1 siblings, 0 replies; 92+ messages in thread
From: Yongji Xie @ 2021-03-02 10:20 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Bob Liu, Christoph Hellwig, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel

On Tue, Mar 2, 2021 at 2:47 PM Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2021/2/23 7:50 下午, Xie Yongji wrote:
> > Use vhost_dev->mutex to protect vhost device iotlb from
> > concurrent access.
> >
> > Fixes: 4c8cf318("vhost: introduce vDPA-based backend")
> > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> > ---
> >   drivers/vhost/vdpa.c | 2 ++
> >   1 file changed, 2 insertions(+)
> >
> > diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> > index c50079dfb281..5500e3bf05c1 100644
> > --- a/drivers/vhost/vdpa.c
> > +++ b/drivers/vhost/vdpa.c
> > @@ -723,6 +723,7 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev,
> >       if (r)
> >               return r;
> >
> > +     mutex_lock(&dev->mutex);
>
>
> I think this should be done before the vhost_dev_check_owner() above.
>

Agree. Will do it in v5.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: Re: [RFC v4 01/11] eventfd: Increase the recursion depth of eventfd_signal()
  2021-03-02  6:44     ` Jason Wang
  (?)
@ 2021-03-02 10:32     ` Yongji Xie
  -1 siblings, 0 replies; 92+ messages in thread
From: Yongji Xie @ 2021-03-02 10:32 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Bob Liu, Christoph Hellwig, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel

On Tue, Mar 2, 2021 at 2:44 PM Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2021/2/23 7:50 下午, Xie Yongji wrote:
> > Increase the recursion depth of eventfd_signal() to 1. This
> > is the maximum recursion depth we have found so far.
> >
> > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
>
>
> Acked-by: Jason Wang <jasowang@redhat.com>
>
> It might be useful to explain how/when we can reach for this condition.
>

Fine.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC v4 05/11] vdpa: Support transferring virtual addressing during DMA mapping
  2021-02-23 11:50 ` [RFC v4 05/11] vdpa: Support transferring virtual addressing during DMA mapping Xie Yongji
@ 2021-03-03 10:52   ` Mika Penttilä
  2021-03-03 12:45     ` Yongji Xie
  2021-03-04  3:07     ` Jason Wang
  1 sibling, 1 reply; 92+ messages in thread
From: Mika Penttilä @ 2021-03-03 10:52 UTC (permalink / raw)
  To: Xie Yongji, mst, jasowang, stefanha, sgarzare, parav, bob.liu,
	hch, rdunlap, willy, viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel



On 23.2.2021 13.50, Xie Yongji wrote:
> This patch introduces an attribute for vDPA device to indicate
> whether virtual address can be used. If vDPA device driver set
> it, vhost-vdpa bus driver will not pin user page and transfer
> userspace virtual address instead of physical address during
> DMA mapping. And corresponding vma->vm_file and offset will be
> also passed as an opaque pointer.

In the virtual addressing case, who is then responsible for the pinning 
or even mapping physical pages to the vaddr?


> Suggested-by: Jason Wang <jasowang@redhat.com>
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> ---
>   drivers/vdpa/ifcvf/ifcvf_main.c   |   2 +-
>   drivers/vdpa/mlx5/net/mlx5_vnet.c |   2 +-
>   drivers/vdpa/vdpa.c               |   9 +++-
>   drivers/vdpa/vdpa_sim/vdpa_sim.c  |   2 +-
>   drivers/vhost/vdpa.c              | 104 +++++++++++++++++++++++++++++++-------
>   include/linux/vdpa.h              |  20 ++++++--
>   6 files changed, 113 insertions(+), 26 deletions(-)
>
> diff --git a/drivers/vdpa/ifcvf/ifcvf_main.c b/drivers/vdpa/ifcvf/ifcvf_main.c
> index 7c8bbfcf6c3e..228b9f920fea 100644
> --- a/drivers/vdpa/ifcvf/ifcvf_main.c
> +++ b/drivers/vdpa/ifcvf/ifcvf_main.c
> @@ -432,7 +432,7 @@ static int ifcvf_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>   
>   	adapter = vdpa_alloc_device(struct ifcvf_adapter, vdpa,
>   				    dev, &ifc_vdpa_ops,
> -				    IFCVF_MAX_QUEUE_PAIRS * 2, NULL);
> +				    IFCVF_MAX_QUEUE_PAIRS * 2, NULL, false);
>   	if (adapter == NULL) {
>   		IFCVF_ERR(pdev, "Failed to allocate vDPA structure");
>   		return -ENOMEM;
> diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> index 029822060017..54290438da28 100644
> --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
> +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> @@ -1964,7 +1964,7 @@ static int mlx5v_probe(struct auxiliary_device *adev,
>   	max_vqs = min_t(u32, max_vqs, MLX5_MAX_SUPPORTED_VQS);
>   
>   	ndev = vdpa_alloc_device(struct mlx5_vdpa_net, mvdev.vdev, mdev->device, &mlx5_vdpa_ops,
> -				 2 * mlx5_vdpa_max_qps(max_vqs), NULL);
> +				 2 * mlx5_vdpa_max_qps(max_vqs), NULL, false);
>   	if (IS_ERR(ndev))
>   		return PTR_ERR(ndev);
>   
> diff --git a/drivers/vdpa/vdpa.c b/drivers/vdpa/vdpa.c
> index 9700a0adcca0..fafc0ee5eb05 100644
> --- a/drivers/vdpa/vdpa.c
> +++ b/drivers/vdpa/vdpa.c
> @@ -72,6 +72,7 @@ static void vdpa_release_dev(struct device *d)
>    * @nvqs: number of virtqueues supported by this device
>    * @size: size of the parent structure that contains private data
>    * @name: name of the vdpa device; optional.
> + * @use_va: indicate whether virtual address can be used by this device
>    *
>    * Driver should use vdpa_alloc_device() wrapper macro instead of
>    * using this directly.
> @@ -81,7 +82,8 @@ static void vdpa_release_dev(struct device *d)
>    */
>   struct vdpa_device *__vdpa_alloc_device(struct device *parent,
>   					const struct vdpa_config_ops *config,
> -					int nvqs, size_t size, const char *name)
> +					int nvqs, size_t size, const char *name,
> +					bool use_va)
>   {
>   	struct vdpa_device *vdev;
>   	int err = -EINVAL;
> @@ -92,6 +94,10 @@ struct vdpa_device *__vdpa_alloc_device(struct device *parent,
>   	if (!!config->dma_map != !!config->dma_unmap)
>   		goto err;
>   
> +	/* It should only work for the device that use on-chip IOMMU */
> +	if (use_va && !(config->dma_map || config->set_map))
> +		goto err;
> +
>   	err = -ENOMEM;
>   	vdev = kzalloc(size, GFP_KERNEL);
>   	if (!vdev)
> @@ -108,6 +114,7 @@ struct vdpa_device *__vdpa_alloc_device(struct device *parent,
>   	vdev->config = config;
>   	vdev->features_valid = false;
>   	vdev->nvqs = nvqs;
> +	vdev->use_va = use_va;
>   
>   	if (name)
>   		err = dev_set_name(&vdev->dev, "%s", name);
> diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
> index 5cfc262ce055..3a9a2dd4e987 100644
> --- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
> +++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
> @@ -235,7 +235,7 @@ struct vdpasim *vdpasim_create(struct vdpasim_dev_attr *dev_attr)
>   		ops = &vdpasim_config_ops;
>   
>   	vdpasim = vdpa_alloc_device(struct vdpasim, vdpa, NULL, ops,
> -				    dev_attr->nvqs, dev_attr->name);
> +				    dev_attr->nvqs, dev_attr->name, false);
>   	if (!vdpasim)
>   		goto err_alloc;
>   
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index 70857fe3263c..93769ace34df 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -480,21 +480,31 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
>   static void vhost_vdpa_iotlb_unmap(struct vhost_vdpa *v, u64 start, u64 last)
>   {
>   	struct vhost_dev *dev = &v->vdev;
> +	struct vdpa_device *vdpa = v->vdpa;
>   	struct vhost_iotlb *iotlb = dev->iotlb;
>   	struct vhost_iotlb_map *map;
> +	struct vdpa_map_file *map_file;
>   	struct page *page;
>   	unsigned long pfn, pinned;
>   
>   	while ((map = vhost_iotlb_itree_first(iotlb, start, last)) != NULL) {
> -		pinned = map->size >> PAGE_SHIFT;
> -		for (pfn = map->addr >> PAGE_SHIFT;
> -		     pinned > 0; pfn++, pinned--) {
> -			page = pfn_to_page(pfn);
> -			if (map->perm & VHOST_ACCESS_WO)
> -				set_page_dirty_lock(page);
> -			unpin_user_page(page);
> +		if (!vdpa->use_va) {
> +			pinned = map->size >> PAGE_SHIFT;
> +			for (pfn = map->addr >> PAGE_SHIFT;
> +			     pinned > 0; pfn++, pinned--) {
> +				page = pfn_to_page(pfn);
> +				if (map->perm & VHOST_ACCESS_WO)
> +					set_page_dirty_lock(page);
> +				unpin_user_page(page);
> +			}
> +			atomic64_sub(map->size >> PAGE_SHIFT,
> +					&dev->mm->pinned_vm);
> +		} else {
> +			map_file = (struct vdpa_map_file *)map->opaque;
> +			if (map_file->file)
> +				fput(map_file->file);
> +			kfree(map_file);
>   		}
> -		atomic64_sub(map->size >> PAGE_SHIFT, &dev->mm->pinned_vm);
>   		vhost_iotlb_map_free(iotlb, map);
>   	}
>   }
> @@ -530,21 +540,21 @@ static int perm_to_iommu_flags(u32 perm)
>   	return flags | IOMMU_CACHE;
>   }
>   
> -static int vhost_vdpa_map(struct vhost_vdpa *v,
> -			  u64 iova, u64 size, u64 pa, u32 perm)
> +static int vhost_vdpa_map(struct vhost_vdpa *v, u64 iova,
> +			  u64 size, u64 pa, u32 perm, void *opaque)
>   {
>   	struct vhost_dev *dev = &v->vdev;
>   	struct vdpa_device *vdpa = v->vdpa;
>   	const struct vdpa_config_ops *ops = vdpa->config;
>   	int r = 0;
>   
> -	r = vhost_iotlb_add_range(dev->iotlb, iova, iova + size - 1,
> -				  pa, perm);
> +	r = vhost_iotlb_add_range_ctx(dev->iotlb, iova, iova + size - 1,
> +				      pa, perm, opaque);
>   	if (r)
>   		return r;
>   
>   	if (ops->dma_map) {
> -		r = ops->dma_map(vdpa, iova, size, pa, perm, NULL);
> +		r = ops->dma_map(vdpa, iova, size, pa, perm, opaque);
>   	} else if (ops->set_map) {
>   		if (!v->in_batch)
>   			r = ops->set_map(vdpa, dev->iotlb);
> @@ -552,13 +562,15 @@ static int vhost_vdpa_map(struct vhost_vdpa *v,
>   		r = iommu_map(v->domain, iova, pa, size,
>   			      perm_to_iommu_flags(perm));
>   	}
> -
> -	if (r)
> +	if (r) {
>   		vhost_iotlb_del_range(dev->iotlb, iova, iova + size - 1);
> -	else
> +		return r;
> +	}
> +
> +	if (!vdpa->use_va)
>   		atomic64_add(size >> PAGE_SHIFT, &dev->mm->pinned_vm);
>   
> -	return r;
> +	return 0;
>   }
>   
>   static void vhost_vdpa_unmap(struct vhost_vdpa *v, u64 iova, u64 size)
> @@ -579,10 +591,60 @@ static void vhost_vdpa_unmap(struct vhost_vdpa *v, u64 iova, u64 size)
>   	}
>   }
>   
> +static int vhost_vdpa_va_map(struct vhost_vdpa *v,
> +			     u64 iova, u64 size, u64 uaddr, u32 perm)
> +{
> +	struct vhost_dev *dev = &v->vdev;
> +	u64 offset, map_size, map_iova = iova;
> +	struct vdpa_map_file *map_file;
> +	struct vm_area_struct *vma;
> +	int ret;
> +
> +	mmap_read_lock(dev->mm);
> +
> +	while (size) {
> +		vma = find_vma(dev->mm, uaddr);
> +		if (!vma) {
> +			ret = -EINVAL;
> +			goto err;
> +		}
> +		map_size = min(size, vma->vm_end - uaddr);
> +		offset = (vma->vm_pgoff << PAGE_SHIFT) + uaddr - vma->vm_start;
> +		map_file = kzalloc(sizeof(*map_file), GFP_KERNEL);
> +		if (!map_file) {
> +			ret = -ENOMEM;
> +			goto err;
> +		}
> +		if (vma->vm_file && (vma->vm_flags & VM_SHARED) &&
> +			!(vma->vm_flags & (VM_IO | VM_PFNMAP))) {
> +			map_file->file = get_file(vma->vm_file);
> +			map_file->offset = offset;
> +		}
> +		ret = vhost_vdpa_map(v, map_iova, map_size, uaddr,
> +				     perm, map_file);
> +		if (ret) {
> +			if (map_file->file)
> +				fput(map_file->file);
> +			kfree(map_file);
> +			goto err;
> +		}
> +		size -= map_size;
> +		uaddr += map_size;
> +		map_iova += map_size;
> +	}
> +	mmap_read_unlock(dev->mm);
> +
> +	return 0;
> +err:
> +	vhost_vdpa_unmap(v, iova, map_iova - iova);
> +	return ret;
> +}
> +
>   static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
>   					   struct vhost_iotlb_msg *msg)
>   {
>   	struct vhost_dev *dev = &v->vdev;
> +	struct vdpa_device *vdpa = v->vdpa;
>   	struct vhost_iotlb *iotlb = dev->iotlb;
>   	struct page **page_list;
>   	unsigned long list_size = PAGE_SIZE / sizeof(struct page *);
> @@ -601,6 +663,10 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
>   				    msg->iova + msg->size - 1))
>   		return -EEXIST;
>   
> +	if (vdpa->use_va)
> +		return vhost_vdpa_va_map(v, msg->iova, msg->size,
> +					 msg->uaddr, msg->perm);
> +
>   	/* Limit the use of memory for bookkeeping */
>   	page_list = (struct page **) __get_free_page(GFP_KERNEL);
>   	if (!page_list)
> @@ -654,7 +720,7 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
>   				csize = (last_pfn - map_pfn + 1) << PAGE_SHIFT;
>   				ret = vhost_vdpa_map(v, iova, csize,
>   						     map_pfn << PAGE_SHIFT,
> -						     msg->perm);
> +						     msg->perm, NULL);
>   				if (ret) {
>   					/*
>   					 * Unpin the pages that are left unmapped
> @@ -683,7 +749,7 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
>   
>   	/* Pin the rest chunk */
>   	ret = vhost_vdpa_map(v, iova, (last_pfn - map_pfn + 1) << PAGE_SHIFT,
> -			     map_pfn << PAGE_SHIFT, msg->perm);
> +			     map_pfn << PAGE_SHIFT, msg->perm, NULL);
>   out:
>   	if (ret) {
>   		if (nchunks) {
> diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
> index 93dca2c328ae..bfae6d780c38 100644
> --- a/include/linux/vdpa.h
> +++ b/include/linux/vdpa.h
> @@ -44,6 +44,7 @@ struct vdpa_mgmt_dev;
>    * @config: the configuration ops for this device.
>    * @index: device index
>    * @features_valid: were features initialized? for legacy guests
> + * @use_va: indicate whether virtual address can be used by this device
>    * @nvqs: maximum number of supported virtqueues
>    * @mdev: management device pointer; caller must setup when registering device as part
>    *	  of dev_add() mgmtdev ops callback before invoking _vdpa_register_device().
> @@ -54,6 +55,7 @@ struct vdpa_device {
>   	const struct vdpa_config_ops *config;
>   	unsigned int index;
>   	bool features_valid;
> +	bool use_va;
>   	int nvqs;
>   	struct vdpa_mgmt_dev *mdev;
>   };
> @@ -69,6 +71,16 @@ struct vdpa_iova_range {
>   };
>   
>   /**
> + * Corresponding file area for device memory mapping
> + * @file: vma->vm_file for the mapping
> + * @offset: mapping offset in the vm_file
> + */
> +struct vdpa_map_file {
> +	struct file *file;
> +	u64 offset;
> +};
> +
> +/**
>    * vDPA_config_ops - operations for configuring a vDPA device.
>    * Note: vDPA device drivers are required to implement all of the
>    * operations unless it is mentioned to be optional in the following
> @@ -250,14 +262,16 @@ struct vdpa_config_ops {
>   
>   struct vdpa_device *__vdpa_alloc_device(struct device *parent,
>   					const struct vdpa_config_ops *config,
> -					int nvqs, size_t size, const char *name);
> +					int nvqs, size_t size,
> +					const char *name, bool use_va);
>   
> -#define vdpa_alloc_device(dev_struct, member, parent, config, nvqs, name)   \
> +#define vdpa_alloc_device(dev_struct, member, parent, config, \
> +			  nvqs, name, use_va) \
>   			  container_of(__vdpa_alloc_device( \
>   				       parent, config, nvqs, \
>   				       sizeof(dev_struct) + \
>   				       BUILD_BUG_ON_ZERO(offsetof( \
> -				       dev_struct, member)), name), \
> +				       dev_struct, member)), name, use_va), \
>   				       dev_struct, member)
>   
>   int vdpa_register_device(struct vdpa_device *vdev);


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: Re: [RFC v4 05/11] vdpa: Support transferring virtual addressing during DMA mapping
  2021-03-03 10:52   ` Mika Penttilä
@ 2021-03-03 12:45     ` Yongji Xie
  2021-03-03 13:38       ` Mika Penttilä
  0 siblings, 1 reply; 92+ messages in thread
From: Yongji Xie @ 2021-03-03 12:45 UTC (permalink / raw)
  To: Mika Penttilä
  Cc: Michael S. Tsirkin, Jason Wang, Stefan Hajnoczi,
	Stefano Garzarella, Parav Pandit, Bob Liu, Christoph Hellwig,
	Randy Dunlap, Matthew Wilcox, viro, Jens Axboe, bcrl,
	Jonathan Corbet, virtualization, netdev, kvm, linux-aio,
	linux-fsdevel

On Wed, Mar 3, 2021 at 6:52 PM Mika Penttilä <mika.penttila@nextfour.com> wrote:
>
>
>
> On 23.2.2021 13.50, Xie Yongji wrote:
> > This patch introduces an attribute for vDPA device to indicate
> > whether virtual address can be used. If vDPA device driver set
> > it, vhost-vdpa bus driver will not pin user page and transfer
> > userspace virtual address instead of physical address during
> > DMA mapping. And corresponding vma->vm_file and offset will be
> > also passed as an opaque pointer.
>
> In the virtual addressing case, who is then responsible for the pinning
> or even mapping physical pages to the vaddr?
>

Actually there's no need to pin the physical pages any more in this
case. The vDPA device should be able to access the user space memory
with virtual address if this attribute is set to true. For example, in
VDUSE case, we will make use of the vma->vm_file to support sharing
the memory between VM and VDUSE daemon.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC v4 05/11] vdpa: Support transferring virtual addressing during DMA mapping
  2021-03-03 12:45     ` Yongji Xie
@ 2021-03-03 13:38       ` Mika Penttilä
  0 siblings, 0 replies; 92+ messages in thread
From: Mika Penttilä @ 2021-03-03 13:38 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Jason Wang, Stefan Hajnoczi,
	Stefano Garzarella, Parav Pandit, Bob Liu, Christoph Hellwig,
	Randy Dunlap, Matthew Wilcox, viro, Jens Axboe, bcrl,
	Jonathan Corbet, virtualization, netdev, kvm, linux-aio,
	linux-fsdevel



On 3.3.2021 14.45, Yongji Xie wrote:
> On Wed, Mar 3, 2021 at 6:52 PM Mika Penttilä <mika.penttila@nextfour.com> wrote:
>>
>>
>> On 23.2.2021 13.50, Xie Yongji wrote:
>>> This patch introduces an attribute for vDPA device to indicate
>>> whether virtual address can be used. If vDPA device driver set
>>> it, vhost-vdpa bus driver will not pin user page and transfer
>>> userspace virtual address instead of physical address during
>>> DMA mapping. And corresponding vma->vm_file and offset will be
>>> also passed as an opaque pointer.
>> In the virtual addressing case, who is then responsible for the pinning
>> or even mapping physical pages to the vaddr?
>>
> Actually there's no need to pin the physical pages any more in this
> case. The vDPA device should be able to access the user space memory
> with virtual address if this attribute is set to true. For example, in
> VDUSE case, we will make use of the vma->vm_file to support sharing
> the memory between VM and VDUSE daemon.
>
> Thanks,
> Yongji
Ah ok you have a shared file pointer + offset.

thanks,
Mika


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC v4 05/11] vdpa: Support transferring virtual addressing during DMA mapping
  2021-02-23 11:50 ` [RFC v4 05/11] vdpa: Support transferring virtual addressing during DMA mapping Xie Yongji
@ 2021-03-04  3:07     ` Jason Wang
  2021-03-04  3:07     ` Jason Wang
  1 sibling, 0 replies; 92+ messages in thread
From: Jason Wang @ 2021-03-04  3:07 UTC (permalink / raw)
  To: Xie Yongji, mst, stefanha, sgarzare, parav, bob.liu, hch,
	rdunlap, willy, viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel


On 2021/2/23 7:50 下午, Xie Yongji wrote:
> This patch introduces an attribute for vDPA device to indicate
> whether virtual address can be used. If vDPA device driver set
> it, vhost-vdpa bus driver will not pin user page and transfer
> userspace virtual address instead of physical address during
> DMA mapping. And corresponding vma->vm_file and offset will be
> also passed as an opaque pointer.
>
> Suggested-by: Jason Wang <jasowang@redhat.com>
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> ---
>   drivers/vdpa/ifcvf/ifcvf_main.c   |   2 +-
>   drivers/vdpa/mlx5/net/mlx5_vnet.c |   2 +-
>   drivers/vdpa/vdpa.c               |   9 +++-
>   drivers/vdpa/vdpa_sim/vdpa_sim.c  |   2 +-
>   drivers/vhost/vdpa.c              | 104 +++++++++++++++++++++++++++++++-------
>   include/linux/vdpa.h              |  20 ++++++--
>   6 files changed, 113 insertions(+), 26 deletions(-)
>
> diff --git a/drivers/vdpa/ifcvf/ifcvf_main.c b/drivers/vdpa/ifcvf/ifcvf_main.c
> index 7c8bbfcf6c3e..228b9f920fea 100644
> --- a/drivers/vdpa/ifcvf/ifcvf_main.c
> +++ b/drivers/vdpa/ifcvf/ifcvf_main.c
> @@ -432,7 +432,7 @@ static int ifcvf_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>   
>   	adapter = vdpa_alloc_device(struct ifcvf_adapter, vdpa,
>   				    dev, &ifc_vdpa_ops,
> -				    IFCVF_MAX_QUEUE_PAIRS * 2, NULL);
> +				    IFCVF_MAX_QUEUE_PAIRS * 2, NULL, false);
>   	if (adapter == NULL) {
>   		IFCVF_ERR(pdev, "Failed to allocate vDPA structure");
>   		return -ENOMEM;
> diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> index 029822060017..54290438da28 100644
> --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
> +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> @@ -1964,7 +1964,7 @@ static int mlx5v_probe(struct auxiliary_device *adev,
>   	max_vqs = min_t(u32, max_vqs, MLX5_MAX_SUPPORTED_VQS);
>   
>   	ndev = vdpa_alloc_device(struct mlx5_vdpa_net, mvdev.vdev, mdev->device, &mlx5_vdpa_ops,
> -				 2 * mlx5_vdpa_max_qps(max_vqs), NULL);
> +				 2 * mlx5_vdpa_max_qps(max_vqs), NULL, false);
>   	if (IS_ERR(ndev))
>   		return PTR_ERR(ndev);
>   
> diff --git a/drivers/vdpa/vdpa.c b/drivers/vdpa/vdpa.c
> index 9700a0adcca0..fafc0ee5eb05 100644
> --- a/drivers/vdpa/vdpa.c
> +++ b/drivers/vdpa/vdpa.c
> @@ -72,6 +72,7 @@ static void vdpa_release_dev(struct device *d)
>    * @nvqs: number of virtqueues supported by this device
>    * @size: size of the parent structure that contains private data
>    * @name: name of the vdpa device; optional.
> + * @use_va: indicate whether virtual address can be used by this device


I think "use_va" means va must be used instead of "can be" here.


>    *
>    * Driver should use vdpa_alloc_device() wrapper macro instead of
>    * using this directly.
> @@ -81,7 +82,8 @@ static void vdpa_release_dev(struct device *d)
>    */
>   struct vdpa_device *__vdpa_alloc_device(struct device *parent,
>   					const struct vdpa_config_ops *config,
> -					int nvqs, size_t size, const char *name)
> +					int nvqs, size_t size, const char *name,
> +					bool use_va)
>   {
>   	struct vdpa_device *vdev;
>   	int err = -EINVAL;
> @@ -92,6 +94,10 @@ struct vdpa_device *__vdpa_alloc_device(struct device *parent,
>   	if (!!config->dma_map != !!config->dma_unmap)
>   		goto err;
>   
> +	/* It should only work for the device that use on-chip IOMMU */
> +	if (use_va && !(config->dma_map || config->set_map))
> +		goto err;
> +
>   	err = -ENOMEM;
>   	vdev = kzalloc(size, GFP_KERNEL);
>   	if (!vdev)
> @@ -108,6 +114,7 @@ struct vdpa_device *__vdpa_alloc_device(struct device *parent,
>   	vdev->config = config;
>   	vdev->features_valid = false;
>   	vdev->nvqs = nvqs;
> +	vdev->use_va = use_va;
>   
>   	if (name)
>   		err = dev_set_name(&vdev->dev, "%s", name);
> diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
> index 5cfc262ce055..3a9a2dd4e987 100644
> --- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
> +++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
> @@ -235,7 +235,7 @@ struct vdpasim *vdpasim_create(struct vdpasim_dev_attr *dev_attr)
>   		ops = &vdpasim_config_ops;
>   
>   	vdpasim = vdpa_alloc_device(struct vdpasim, vdpa, NULL, ops,
> -				    dev_attr->nvqs, dev_attr->name);
> +				    dev_attr->nvqs, dev_attr->name, false);
>   	if (!vdpasim)
>   		goto err_alloc;
>   
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index 70857fe3263c..93769ace34df 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -480,21 +480,31 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
>   static void vhost_vdpa_iotlb_unmap(struct vhost_vdpa *v, u64 start, u64 last)
>   {
>   	struct vhost_dev *dev = &v->vdev;
> +	struct vdpa_device *vdpa = v->vdpa;
>   	struct vhost_iotlb *iotlb = dev->iotlb;
>   	struct vhost_iotlb_map *map;
> +	struct vdpa_map_file *map_file;
>   	struct page *page;
>   	unsigned long pfn, pinned;
>   
>   	while ((map = vhost_iotlb_itree_first(iotlb, start, last)) != NULL) {
> -		pinned = map->size >> PAGE_SHIFT;
> -		for (pfn = map->addr >> PAGE_SHIFT;
> -		     pinned > 0; pfn++, pinned--) {
> -			page = pfn_to_page(pfn);
> -			if (map->perm & VHOST_ACCESS_WO)
> -				set_page_dirty_lock(page);
> -			unpin_user_page(page);
> +		if (!vdpa->use_va) {
> +			pinned = map->size >> PAGE_SHIFT;
> +			for (pfn = map->addr >> PAGE_SHIFT;
> +			     pinned > 0; pfn++, pinned--) {
> +				page = pfn_to_page(pfn);
> +				if (map->perm & VHOST_ACCESS_WO)
> +					set_page_dirty_lock(page);
> +				unpin_user_page(page);
> +			}
> +			atomic64_sub(map->size >> PAGE_SHIFT,
> +					&dev->mm->pinned_vm);
> +		} else {
> +			map_file = (struct vdpa_map_file *)map->opaque;
> +			if (map_file->file)
> +				fput(map_file->file);
> +			kfree(map_file);
>   		}
> -		atomic64_sub(map->size >> PAGE_SHIFT, &dev->mm->pinned_vm);
>   		vhost_iotlb_map_free(iotlb, map);
>   	}
>   }
> @@ -530,21 +540,21 @@ static int perm_to_iommu_flags(u32 perm)
>   	return flags | IOMMU_CACHE;
>   }
>   
> -static int vhost_vdpa_map(struct vhost_vdpa *v,
> -			  u64 iova, u64 size, u64 pa, u32 perm)
> +static int vhost_vdpa_map(struct vhost_vdpa *v, u64 iova,
> +			  u64 size, u64 pa, u32 perm, void *opaque)
>   {
>   	struct vhost_dev *dev = &v->vdev;
>   	struct vdpa_device *vdpa = v->vdpa;
>   	const struct vdpa_config_ops *ops = vdpa->config;
>   	int r = 0;
>   
> -	r = vhost_iotlb_add_range(dev->iotlb, iova, iova + size - 1,
> -				  pa, perm);
> +	r = vhost_iotlb_add_range_ctx(dev->iotlb, iova, iova + size - 1,
> +				      pa, perm, opaque);
>   	if (r)
>   		return r;
>   
>   	if (ops->dma_map) {
> -		r = ops->dma_map(vdpa, iova, size, pa, perm, NULL);
> +		r = ops->dma_map(vdpa, iova, size, pa, perm, opaque);
>   	} else if (ops->set_map) {
>   		if (!v->in_batch)
>   			r = ops->set_map(vdpa, dev->iotlb);
> @@ -552,13 +562,15 @@ static int vhost_vdpa_map(struct vhost_vdpa *v,
>   		r = iommu_map(v->domain, iova, pa, size,
>   			      perm_to_iommu_flags(perm));
>   	}
> -
> -	if (r)
> +	if (r) {
>   		vhost_iotlb_del_range(dev->iotlb, iova, iova + size - 1);
> -	else
> +		return r;
> +	}
> +
> +	if (!vdpa->use_va)
>   		atomic64_add(size >> PAGE_SHIFT, &dev->mm->pinned_vm);
>   
> -	return r;
> +	return 0;
>   }
>   
>   static void vhost_vdpa_unmap(struct vhost_vdpa *v, u64 iova, u64 size)
> @@ -579,10 +591,60 @@ static void vhost_vdpa_unmap(struct vhost_vdpa *v, u64 iova, u64 size)
>   	}
>   }
>   
> +static int vhost_vdpa_va_map(struct vhost_vdpa *v,
> +			     u64 iova, u64 size, u64 uaddr, u32 perm)
> +{
> +	struct vhost_dev *dev = &v->vdev;
> +	u64 offset, map_size, map_iova = iova;
> +	struct vdpa_map_file *map_file;
> +	struct vm_area_struct *vma;
> +	int ret;
> +
> +	mmap_read_lock(dev->mm);
> +
> +	while (size) {
> +		vma = find_vma(dev->mm, uaddr);
> +		if (!vma) {
> +			ret = -EINVAL;
> +			goto err;
> +		}
> +		map_size = min(size, vma->vm_end - uaddr);
> +		offset = (vma->vm_pgoff << PAGE_SHIFT) + uaddr - vma->vm_start;
> +		map_file = kzalloc(sizeof(*map_file), GFP_KERNEL);
> +		if (!map_file) {
> +			ret = -ENOMEM;
> +			goto err;
> +		}
> +		if (vma->vm_file && (vma->vm_flags & VM_SHARED) &&
> +			!(vma->vm_flags & (VM_IO | VM_PFNMAP))) {
> +			map_file->file = get_file(vma->vm_file);
> +			map_file->offset = offset;
> +		}


I think it's better to do the flag check right after find_vma(), this 
can avoid things like kfree etc (e.g the code will still call 
vhost_vdpa_map() even if the flag is not expected now).


> +		ret = vhost_vdpa_map(v, map_iova, map_size, uaddr,
> +				     perm, map_file);
> +		if (ret) {
> +			if (map_file->file)
> +				fput(map_file->file);
> +			kfree(map_file);
> +			goto err;
> +		}
> +		size -= map_size;
> +		uaddr += map_size;
> +		map_iova += map_size;
> +	}
> +	mmap_read_unlock(dev->mm);
> +
> +	return 0;
> +err:
> +	vhost_vdpa_unmap(v, iova, map_iova - iova);
> +	return ret;
> +}
> +
>   static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
>   					   struct vhost_iotlb_msg *msg)
>   {
>   	struct vhost_dev *dev = &v->vdev;
> +	struct vdpa_device *vdpa = v->vdpa;
>   	struct vhost_iotlb *iotlb = dev->iotlb;
>   	struct page **page_list;
>   	unsigned long list_size = PAGE_SIZE / sizeof(struct page *);
> @@ -601,6 +663,10 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
>   				    msg->iova + msg->size - 1))
>   		return -EEXIST;
>   
> +	if (vdpa->use_va)
> +		return vhost_vdpa_va_map(v, msg->iova, msg->size,
> +					 msg->uaddr, msg->perm);


If possible, I would like to factor out the pa map below into a 
something like vhost_vdpa_pa_map() first with a separated patch. Then 
introduce vhost_vdpa_va_map().

Thanks


> +
>   	/* Limit the use of memory for bookkeeping */
>   	page_list = (struct page **) __get_free_page(GFP_KERNEL);
>   	if (!page_list)
> @@ -654,7 +720,7 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
>   				csize = (last_pfn - map_pfn + 1) << PAGE_SHIFT;
>   				ret = vhost_vdpa_map(v, iova, csize,
>   						     map_pfn << PAGE_SHIFT,
> -						     msg->perm);
> +						     msg->perm, NULL);
>   				if (ret) {
>   					/*
>   					 * Unpin the pages that are left unmapped
> @@ -683,7 +749,7 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
>   
>   	/* Pin the rest chunk */
>   	ret = vhost_vdpa_map(v, iova, (last_pfn - map_pfn + 1) << PAGE_SHIFT,
> -			     map_pfn << PAGE_SHIFT, msg->perm);
> +			     map_pfn << PAGE_SHIFT, msg->perm, NULL);
>   out:
>   	if (ret) {
>   		if (nchunks) {
> diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
> index 93dca2c328ae..bfae6d780c38 100644
> --- a/include/linux/vdpa.h
> +++ b/include/linux/vdpa.h
> @@ -44,6 +44,7 @@ struct vdpa_mgmt_dev;
>    * @config: the configuration ops for this device.
>    * @index: device index
>    * @features_valid: were features initialized? for legacy guests
> + * @use_va: indicate whether virtual address can be used by this device
>    * @nvqs: maximum number of supported virtqueues
>    * @mdev: management device pointer; caller must setup when registering device as part
>    *	  of dev_add() mgmtdev ops callback before invoking _vdpa_register_device().
> @@ -54,6 +55,7 @@ struct vdpa_device {
>   	const struct vdpa_config_ops *config;
>   	unsigned int index;
>   	bool features_valid;
> +	bool use_va;
>   	int nvqs;
>   	struct vdpa_mgmt_dev *mdev;
>   };
> @@ -69,6 +71,16 @@ struct vdpa_iova_range {
>   };
>   
>   /**
> + * Corresponding file area for device memory mapping
> + * @file: vma->vm_file for the mapping
> + * @offset: mapping offset in the vm_file
> + */
> +struct vdpa_map_file {
> +	struct file *file;
> +	u64 offset;
> +};
> +
> +/**
>    * vDPA_config_ops - operations for configuring a vDPA device.
>    * Note: vDPA device drivers are required to implement all of the
>    * operations unless it is mentioned to be optional in the following
> @@ -250,14 +262,16 @@ struct vdpa_config_ops {
>   
>   struct vdpa_device *__vdpa_alloc_device(struct device *parent,
>   					const struct vdpa_config_ops *config,
> -					int nvqs, size_t size, const char *name);
> +					int nvqs, size_t size,
> +					const char *name, bool use_va);
>   
> -#define vdpa_alloc_device(dev_struct, member, parent, config, nvqs, name)   \
> +#define vdpa_alloc_device(dev_struct, member, parent, config, \
> +			  nvqs, name, use_va) \
>   			  container_of(__vdpa_alloc_device( \
>   				       parent, config, nvqs, \
>   				       sizeof(dev_struct) + \
>   				       BUILD_BUG_ON_ZERO(offsetof( \
> -				       dev_struct, member)), name), \
> +				       dev_struct, member)), name, use_va), \
>   				       dev_struct, member)
>   
>   int vdpa_register_device(struct vdpa_device *vdev);


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC v4 05/11] vdpa: Support transferring virtual addressing during DMA mapping
@ 2021-03-04  3:07     ` Jason Wang
  0 siblings, 0 replies; 92+ messages in thread
From: Jason Wang @ 2021-03-04  3:07 UTC (permalink / raw)
  To: Xie Yongji, mst, stefanha, sgarzare, parav, bob.liu, hch,
	rdunlap, willy, viro, axboe, bcrl, corbet
  Cc: linux-aio, netdev, linux-fsdevel, kvm, virtualization


On 2021/2/23 7:50 下午, Xie Yongji wrote:
> This patch introduces an attribute for vDPA device to indicate
> whether virtual address can be used. If vDPA device driver set
> it, vhost-vdpa bus driver will not pin user page and transfer
> userspace virtual address instead of physical address during
> DMA mapping. And corresponding vma->vm_file and offset will be
> also passed as an opaque pointer.
>
> Suggested-by: Jason Wang <jasowang@redhat.com>
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> ---
>   drivers/vdpa/ifcvf/ifcvf_main.c   |   2 +-
>   drivers/vdpa/mlx5/net/mlx5_vnet.c |   2 +-
>   drivers/vdpa/vdpa.c               |   9 +++-
>   drivers/vdpa/vdpa_sim/vdpa_sim.c  |   2 +-
>   drivers/vhost/vdpa.c              | 104 +++++++++++++++++++++++++++++++-------
>   include/linux/vdpa.h              |  20 ++++++--
>   6 files changed, 113 insertions(+), 26 deletions(-)
>
> diff --git a/drivers/vdpa/ifcvf/ifcvf_main.c b/drivers/vdpa/ifcvf/ifcvf_main.c
> index 7c8bbfcf6c3e..228b9f920fea 100644
> --- a/drivers/vdpa/ifcvf/ifcvf_main.c
> +++ b/drivers/vdpa/ifcvf/ifcvf_main.c
> @@ -432,7 +432,7 @@ static int ifcvf_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>   
>   	adapter = vdpa_alloc_device(struct ifcvf_adapter, vdpa,
>   				    dev, &ifc_vdpa_ops,
> -				    IFCVF_MAX_QUEUE_PAIRS * 2, NULL);
> +				    IFCVF_MAX_QUEUE_PAIRS * 2, NULL, false);
>   	if (adapter == NULL) {
>   		IFCVF_ERR(pdev, "Failed to allocate vDPA structure");
>   		return -ENOMEM;
> diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> index 029822060017..54290438da28 100644
> --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
> +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> @@ -1964,7 +1964,7 @@ static int mlx5v_probe(struct auxiliary_device *adev,
>   	max_vqs = min_t(u32, max_vqs, MLX5_MAX_SUPPORTED_VQS);
>   
>   	ndev = vdpa_alloc_device(struct mlx5_vdpa_net, mvdev.vdev, mdev->device, &mlx5_vdpa_ops,
> -				 2 * mlx5_vdpa_max_qps(max_vqs), NULL);
> +				 2 * mlx5_vdpa_max_qps(max_vqs), NULL, false);
>   	if (IS_ERR(ndev))
>   		return PTR_ERR(ndev);
>   
> diff --git a/drivers/vdpa/vdpa.c b/drivers/vdpa/vdpa.c
> index 9700a0adcca0..fafc0ee5eb05 100644
> --- a/drivers/vdpa/vdpa.c
> +++ b/drivers/vdpa/vdpa.c
> @@ -72,6 +72,7 @@ static void vdpa_release_dev(struct device *d)
>    * @nvqs: number of virtqueues supported by this device
>    * @size: size of the parent structure that contains private data
>    * @name: name of the vdpa device; optional.
> + * @use_va: indicate whether virtual address can be used by this device


I think "use_va" means va must be used instead of "can be" here.


>    *
>    * Driver should use vdpa_alloc_device() wrapper macro instead of
>    * using this directly.
> @@ -81,7 +82,8 @@ static void vdpa_release_dev(struct device *d)
>    */
>   struct vdpa_device *__vdpa_alloc_device(struct device *parent,
>   					const struct vdpa_config_ops *config,
> -					int nvqs, size_t size, const char *name)
> +					int nvqs, size_t size, const char *name,
> +					bool use_va)
>   {
>   	struct vdpa_device *vdev;
>   	int err = -EINVAL;
> @@ -92,6 +94,10 @@ struct vdpa_device *__vdpa_alloc_device(struct device *parent,
>   	if (!!config->dma_map != !!config->dma_unmap)
>   		goto err;
>   
> +	/* It should only work for the device that use on-chip IOMMU */
> +	if (use_va && !(config->dma_map || config->set_map))
> +		goto err;
> +
>   	err = -ENOMEM;
>   	vdev = kzalloc(size, GFP_KERNEL);
>   	if (!vdev)
> @@ -108,6 +114,7 @@ struct vdpa_device *__vdpa_alloc_device(struct device *parent,
>   	vdev->config = config;
>   	vdev->features_valid = false;
>   	vdev->nvqs = nvqs;
> +	vdev->use_va = use_va;
>   
>   	if (name)
>   		err = dev_set_name(&vdev->dev, "%s", name);
> diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
> index 5cfc262ce055..3a9a2dd4e987 100644
> --- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
> +++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
> @@ -235,7 +235,7 @@ struct vdpasim *vdpasim_create(struct vdpasim_dev_attr *dev_attr)
>   		ops = &vdpasim_config_ops;
>   
>   	vdpasim = vdpa_alloc_device(struct vdpasim, vdpa, NULL, ops,
> -				    dev_attr->nvqs, dev_attr->name);
> +				    dev_attr->nvqs, dev_attr->name, false);
>   	if (!vdpasim)
>   		goto err_alloc;
>   
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index 70857fe3263c..93769ace34df 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -480,21 +480,31 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
>   static void vhost_vdpa_iotlb_unmap(struct vhost_vdpa *v, u64 start, u64 last)
>   {
>   	struct vhost_dev *dev = &v->vdev;
> +	struct vdpa_device *vdpa = v->vdpa;
>   	struct vhost_iotlb *iotlb = dev->iotlb;
>   	struct vhost_iotlb_map *map;
> +	struct vdpa_map_file *map_file;
>   	struct page *page;
>   	unsigned long pfn, pinned;
>   
>   	while ((map = vhost_iotlb_itree_first(iotlb, start, last)) != NULL) {
> -		pinned = map->size >> PAGE_SHIFT;
> -		for (pfn = map->addr >> PAGE_SHIFT;
> -		     pinned > 0; pfn++, pinned--) {
> -			page = pfn_to_page(pfn);
> -			if (map->perm & VHOST_ACCESS_WO)
> -				set_page_dirty_lock(page);
> -			unpin_user_page(page);
> +		if (!vdpa->use_va) {
> +			pinned = map->size >> PAGE_SHIFT;
> +			for (pfn = map->addr >> PAGE_SHIFT;
> +			     pinned > 0; pfn++, pinned--) {
> +				page = pfn_to_page(pfn);
> +				if (map->perm & VHOST_ACCESS_WO)
> +					set_page_dirty_lock(page);
> +				unpin_user_page(page);
> +			}
> +			atomic64_sub(map->size >> PAGE_SHIFT,
> +					&dev->mm->pinned_vm);
> +		} else {
> +			map_file = (struct vdpa_map_file *)map->opaque;
> +			if (map_file->file)
> +				fput(map_file->file);
> +			kfree(map_file);
>   		}
> -		atomic64_sub(map->size >> PAGE_SHIFT, &dev->mm->pinned_vm);
>   		vhost_iotlb_map_free(iotlb, map);
>   	}
>   }
> @@ -530,21 +540,21 @@ static int perm_to_iommu_flags(u32 perm)
>   	return flags | IOMMU_CACHE;
>   }
>   
> -static int vhost_vdpa_map(struct vhost_vdpa *v,
> -			  u64 iova, u64 size, u64 pa, u32 perm)
> +static int vhost_vdpa_map(struct vhost_vdpa *v, u64 iova,
> +			  u64 size, u64 pa, u32 perm, void *opaque)
>   {
>   	struct vhost_dev *dev = &v->vdev;
>   	struct vdpa_device *vdpa = v->vdpa;
>   	const struct vdpa_config_ops *ops = vdpa->config;
>   	int r = 0;
>   
> -	r = vhost_iotlb_add_range(dev->iotlb, iova, iova + size - 1,
> -				  pa, perm);
> +	r = vhost_iotlb_add_range_ctx(dev->iotlb, iova, iova + size - 1,
> +				      pa, perm, opaque);
>   	if (r)
>   		return r;
>   
>   	if (ops->dma_map) {
> -		r = ops->dma_map(vdpa, iova, size, pa, perm, NULL);
> +		r = ops->dma_map(vdpa, iova, size, pa, perm, opaque);
>   	} else if (ops->set_map) {
>   		if (!v->in_batch)
>   			r = ops->set_map(vdpa, dev->iotlb);
> @@ -552,13 +562,15 @@ static int vhost_vdpa_map(struct vhost_vdpa *v,
>   		r = iommu_map(v->domain, iova, pa, size,
>   			      perm_to_iommu_flags(perm));
>   	}
> -
> -	if (r)
> +	if (r) {
>   		vhost_iotlb_del_range(dev->iotlb, iova, iova + size - 1);
> -	else
> +		return r;
> +	}
> +
> +	if (!vdpa->use_va)
>   		atomic64_add(size >> PAGE_SHIFT, &dev->mm->pinned_vm);
>   
> -	return r;
> +	return 0;
>   }
>   
>   static void vhost_vdpa_unmap(struct vhost_vdpa *v, u64 iova, u64 size)
> @@ -579,10 +591,60 @@ static void vhost_vdpa_unmap(struct vhost_vdpa *v, u64 iova, u64 size)
>   	}
>   }
>   
> +static int vhost_vdpa_va_map(struct vhost_vdpa *v,
> +			     u64 iova, u64 size, u64 uaddr, u32 perm)
> +{
> +	struct vhost_dev *dev = &v->vdev;
> +	u64 offset, map_size, map_iova = iova;
> +	struct vdpa_map_file *map_file;
> +	struct vm_area_struct *vma;
> +	int ret;
> +
> +	mmap_read_lock(dev->mm);
> +
> +	while (size) {
> +		vma = find_vma(dev->mm, uaddr);
> +		if (!vma) {
> +			ret = -EINVAL;
> +			goto err;
> +		}
> +		map_size = min(size, vma->vm_end - uaddr);
> +		offset = (vma->vm_pgoff << PAGE_SHIFT) + uaddr - vma->vm_start;
> +		map_file = kzalloc(sizeof(*map_file), GFP_KERNEL);
> +		if (!map_file) {
> +			ret = -ENOMEM;
> +			goto err;
> +		}
> +		if (vma->vm_file && (vma->vm_flags & VM_SHARED) &&
> +			!(vma->vm_flags & (VM_IO | VM_PFNMAP))) {
> +			map_file->file = get_file(vma->vm_file);
> +			map_file->offset = offset;
> +		}


I think it's better to do the flag check right after find_vma(), this 
can avoid things like kfree etc (e.g the code will still call 
vhost_vdpa_map() even if the flag is not expected now).


> +		ret = vhost_vdpa_map(v, map_iova, map_size, uaddr,
> +				     perm, map_file);
> +		if (ret) {
> +			if (map_file->file)
> +				fput(map_file->file);
> +			kfree(map_file);
> +			goto err;
> +		}
> +		size -= map_size;
> +		uaddr += map_size;
> +		map_iova += map_size;
> +	}
> +	mmap_read_unlock(dev->mm);
> +
> +	return 0;
> +err:
> +	vhost_vdpa_unmap(v, iova, map_iova - iova);
> +	return ret;
> +}
> +
>   static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
>   					   struct vhost_iotlb_msg *msg)
>   {
>   	struct vhost_dev *dev = &v->vdev;
> +	struct vdpa_device *vdpa = v->vdpa;
>   	struct vhost_iotlb *iotlb = dev->iotlb;
>   	struct page **page_list;
>   	unsigned long list_size = PAGE_SIZE / sizeof(struct page *);
> @@ -601,6 +663,10 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
>   				    msg->iova + msg->size - 1))
>   		return -EEXIST;
>   
> +	if (vdpa->use_va)
> +		return vhost_vdpa_va_map(v, msg->iova, msg->size,
> +					 msg->uaddr, msg->perm);


If possible, I would like to factor out the pa map below into a 
something like vhost_vdpa_pa_map() first with a separated patch. Then 
introduce vhost_vdpa_va_map().

Thanks


> +
>   	/* Limit the use of memory for bookkeeping */
>   	page_list = (struct page **) __get_free_page(GFP_KERNEL);
>   	if (!page_list)
> @@ -654,7 +720,7 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
>   				csize = (last_pfn - map_pfn + 1) << PAGE_SHIFT;
>   				ret = vhost_vdpa_map(v, iova, csize,
>   						     map_pfn << PAGE_SHIFT,
> -						     msg->perm);
> +						     msg->perm, NULL);
>   				if (ret) {
>   					/*
>   					 * Unpin the pages that are left unmapped
> @@ -683,7 +749,7 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
>   
>   	/* Pin the rest chunk */
>   	ret = vhost_vdpa_map(v, iova, (last_pfn - map_pfn + 1) << PAGE_SHIFT,
> -			     map_pfn << PAGE_SHIFT, msg->perm);
> +			     map_pfn << PAGE_SHIFT, msg->perm, NULL);
>   out:
>   	if (ret) {
>   		if (nchunks) {
> diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
> index 93dca2c328ae..bfae6d780c38 100644
> --- a/include/linux/vdpa.h
> +++ b/include/linux/vdpa.h
> @@ -44,6 +44,7 @@ struct vdpa_mgmt_dev;
>    * @config: the configuration ops for this device.
>    * @index: device index
>    * @features_valid: were features initialized? for legacy guests
> + * @use_va: indicate whether virtual address can be used by this device
>    * @nvqs: maximum number of supported virtqueues
>    * @mdev: management device pointer; caller must setup when registering device as part
>    *	  of dev_add() mgmtdev ops callback before invoking _vdpa_register_device().
> @@ -54,6 +55,7 @@ struct vdpa_device {
>   	const struct vdpa_config_ops *config;
>   	unsigned int index;
>   	bool features_valid;
> +	bool use_va;
>   	int nvqs;
>   	struct vdpa_mgmt_dev *mdev;
>   };
> @@ -69,6 +71,16 @@ struct vdpa_iova_range {
>   };
>   
>   /**
> + * Corresponding file area for device memory mapping
> + * @file: vma->vm_file for the mapping
> + * @offset: mapping offset in the vm_file
> + */
> +struct vdpa_map_file {
> +	struct file *file;
> +	u64 offset;
> +};
> +
> +/**
>    * vDPA_config_ops - operations for configuring a vDPA device.
>    * Note: vDPA device drivers are required to implement all of the
>    * operations unless it is mentioned to be optional in the following
> @@ -250,14 +262,16 @@ struct vdpa_config_ops {
>   
>   struct vdpa_device *__vdpa_alloc_device(struct device *parent,
>   					const struct vdpa_config_ops *config,
> -					int nvqs, size_t size, const char *name);
> +					int nvqs, size_t size,
> +					const char *name, bool use_va);
>   
> -#define vdpa_alloc_device(dev_struct, member, parent, config, nvqs, name)   \
> +#define vdpa_alloc_device(dev_struct, member, parent, config, \
> +			  nvqs, name, use_va) \
>   			  container_of(__vdpa_alloc_device( \
>   				       parent, config, nvqs, \
>   				       sizeof(dev_struct) + \
>   				       BUILD_BUG_ON_ZERO(offsetof( \
> -				       dev_struct, member)), name), \
> +				       dev_struct, member)), name, use_va), \
>   				       dev_struct, member)
>   
>   int vdpa_register_device(struct vdpa_device *vdev);

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC v4 06/11] vduse: Implement an MMU-based IOMMU driver
  2021-02-23 11:50 ` [RFC v4 06/11] vduse: Implement an MMU-based IOMMU driver Xie Yongji
@ 2021-03-04  4:20     ` Jason Wang
  0 siblings, 0 replies; 92+ messages in thread
From: Jason Wang @ 2021-03-04  4:20 UTC (permalink / raw)
  To: Xie Yongji, mst, stefanha, sgarzare, parav, bob.liu, hch,
	rdunlap, willy, viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel


On 2021/2/23 7:50 下午, Xie Yongji wrote:
> This implements a MMU-based IOMMU driver to support mapping
> kernel dma buffer into userspace. The basic idea behind it is
> treating MMU (VA->PA) as IOMMU (IOVA->PA). The driver will set
> up MMU mapping instead of IOMMU mapping for the DMA transfer so
> that the userspace process is able to use its virtual address to
> access the dma buffer in kernel.
>
> And to avoid security issue, a bounce-buffering mechanism is
> introduced to prevent userspace accessing the original buffer
> directly.
>
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> ---
>   drivers/vdpa/vdpa_user/iova_domain.c | 486 +++++++++++++++++++++++++++++++++++
>   drivers/vdpa/vdpa_user/iova_domain.h |  61 +++++
>   2 files changed, 547 insertions(+)
>   create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c
>   create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h
>
> diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c
> new file mode 100644
> index 000000000000..9285d430d486
> --- /dev/null
> +++ b/drivers/vdpa/vdpa_user/iova_domain.c
> @@ -0,0 +1,486 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * MMU-based IOMMU implementation
> + *
> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
> + *
> + * Author: Xie Yongji <xieyongji@bytedance.com>
> + *
> + */
> +
> +#include <linux/slab.h>
> +#include <linux/file.h>
> +#include <linux/anon_inodes.h>
> +#include <linux/highmem.h>
> +
> +#include "iova_domain.h"
> +
> +#define IOVA_START_PFN 1
> +#define IOVA_ALLOC_ORDER 12
> +#define IOVA_ALLOC_SIZE (1 << IOVA_ALLOC_ORDER)
> +
> +static inline struct page *
> +vduse_domain_get_bounce_page(struct vduse_iova_domain *domain, u64 iova)
> +{
> +	u64 index = iova >> PAGE_SHIFT;
> +
> +	return domain->bounce_pages[index];
> +}
> +
> +static inline void
> +vduse_domain_set_bounce_page(struct vduse_iova_domain *domain,
> +				u64 iova, struct page *page)
> +{
> +	u64 index = iova >> PAGE_SHIFT;
> +
> +	domain->bounce_pages[index] = page;
> +}
> +
> +static enum dma_data_direction perm_to_dir(int perm)
> +{
> +	enum dma_data_direction dir;
> +
> +	switch (perm) {
> +	case VHOST_MAP_WO:
> +		dir = DMA_FROM_DEVICE;
> +		break;
> +	case VHOST_MAP_RO:
> +		dir = DMA_TO_DEVICE;
> +		break;
> +	case VHOST_MAP_RW:
> +		dir = DMA_BIDIRECTIONAL;
> +		break;
> +	default:
> +		break;
> +	}
> +
> +	return dir;
> +}
> +
> +static int dir_to_perm(enum dma_data_direction dir)
> +{
> +	int perm = -EFAULT;
> +
> +	switch (dir) {
> +	case DMA_FROM_DEVICE:
> +		perm = VHOST_MAP_WO;
> +		break;
> +	case DMA_TO_DEVICE:
> +		perm = VHOST_MAP_RO;
> +		break;
> +	case DMA_BIDIRECTIONAL:
> +		perm = VHOST_MAP_RW;
> +		break;
> +	default:
> +		break;
> +	}
> +
> +	return perm;
> +}


Let's move the above two helpers to vhost_iotlb.h so they could be used 
by other driver e.g (vpda_sim)


> +
> +static void do_bounce(phys_addr_t orig, void *addr, size_t size,
> +			enum dma_data_direction dir)
> +{
> +	unsigned long pfn = PFN_DOWN(orig);
> +
> +	if (PageHighMem(pfn_to_page(pfn))) {
> +		unsigned int offset = offset_in_page(orig);
> +		char *buffer;
> +		unsigned int sz = 0;
> +		unsigned long flags;
> +
> +		while (size) {
> +			sz = min_t(size_t, PAGE_SIZE - offset, size);
> +
> +			local_irq_save(flags);
> +			buffer = kmap_atomic(pfn_to_page(pfn));
> +			if (dir == DMA_TO_DEVICE)
> +				memcpy(addr, buffer + offset, sz);
> +			else
> +				memcpy(buffer + offset, addr, sz);
> +			kunmap_atomic(buffer);
> +			local_irq_restore(flags);


I wonder why we need to deal with highmem and irq flags explicitly like 
this. Doesn't kmap_atomic() will take care all of those?


> +
> +			size -= sz;
> +			pfn++;
> +			addr += sz;
> +			offset = 0;
> +		}
> +	} else if (dir == DMA_TO_DEVICE) {
> +		memcpy(addr, phys_to_virt(orig), size);
> +	} else {
> +		memcpy(phys_to_virt(orig), addr, size);
> +	}
> +}
> +
> +static struct page *
> +vduse_domain_get_mapping_page(struct vduse_iova_domain *domain, u64 iova)
> +{
> +	u64 start = iova & PAGE_MASK;
> +	u64 last = start + PAGE_SIZE - 1;
> +	struct vhost_iotlb_map *map;
> +	struct page *page = NULL;
> +
> +	spin_lock(&domain->iotlb_lock);
> +	map = vhost_iotlb_itree_first(domain->iotlb, start, last);
> +	if (!map)
> +		goto out;
> +
> +	page = pfn_to_page((map->addr + iova - map->start) >> PAGE_SHIFT);
> +	get_page(page);
> +out:
> +	spin_unlock(&domain->iotlb_lock);
> +
> +	return page;
> +}
> +
> +static struct page *
> +vduse_domain_alloc_bounce_page(struct vduse_iova_domain *domain, u64 iova)
> +{
> +	u64 start = iova & PAGE_MASK;
> +	u64 last = start + PAGE_SIZE - 1;
> +	struct vhost_iotlb_map *map;
> +	struct page *page = NULL, *new_page = alloc_page(GFP_KERNEL);
> +
> +	if (!new_page)
> +		return NULL;
> +
> +	spin_lock(&domain->iotlb_lock);
> +	if (!vhost_iotlb_itree_first(domain->iotlb, start, last)) {
> +		__free_page(new_page);
> +		goto out;
> +	}
> +	page = vduse_domain_get_bounce_page(domain, iova);
> +	if (page) {
> +		get_page(page);
> +		__free_page(new_page);
> +		goto out;
> +	}
> +	vduse_domain_set_bounce_page(domain, iova, new_page);
> +	get_page(new_page);
> +	page = new_page;
> +
> +	for (map = vhost_iotlb_itree_first(domain->iotlb, start, last); map;
> +	     map = vhost_iotlb_itree_next(map, start, last)) {
> +		unsigned int src_offset = 0, dst_offset = 0;
> +		phys_addr_t src;
> +		void *dst;
> +		size_t sz;
> +
> +		if (perm_to_dir(map->perm) == DMA_FROM_DEVICE)
> +			continue;
> +
> +		if (start > map->start)
> +			src_offset = start - map->start;
> +		else
> +			dst_offset = map->start - start;
> +
> +		src = map->addr + src_offset;
> +		dst = page_address(page) + dst_offset;
> +		sz = min_t(size_t, map->size - src_offset,
> +				PAGE_SIZE - dst_offset);
> +		do_bounce(src, dst, sz, DMA_TO_DEVICE);
> +	}
> +out:
> +	spin_unlock(&domain->iotlb_lock);
> +
> +	return page;
> +}
> +
> +static void
> +vduse_domain_free_bounce_pages(struct vduse_iova_domain *domain,
> +				u64 iova, size_t size)
> +{
> +	struct page *page;
> +
> +	spin_lock(&domain->iotlb_lock);
> +	if (WARN_ON(vhost_iotlb_itree_first(domain->iotlb, iova,
> +						iova + size - 1)))
> +		goto out;
> +
> +	while (size > 0) {
> +		page = vduse_domain_get_bounce_page(domain, iova);
> +		if (page) {
> +			vduse_domain_set_bounce_page(domain, iova, NULL);
> +			__free_page(page);
> +		}
> +		size -= PAGE_SIZE;
> +		iova += PAGE_SIZE;
> +	}
> +out:
> +	spin_unlock(&domain->iotlb_lock);
> +}
> +
> +static void vduse_domain_bounce(struct vduse_iova_domain *domain,
> +				dma_addr_t iova, phys_addr_t orig,
> +				size_t size, enum dma_data_direction dir)
> +{
> +	unsigned int offset = offset_in_page(iova);
> +
> +	while (size) {
> +		struct page *p = vduse_domain_get_bounce_page(domain, iova);
> +		size_t sz = min_t(size_t, PAGE_SIZE - offset, size);
> +
> +		WARN_ON(!p && dir == DMA_FROM_DEVICE);
> +
> +		if (p)
> +			do_bounce(orig, page_address(p) + offset, sz, dir);
> +
> +		size -= sz;
> +		orig += sz;
> +		iova += sz;
> +		offset = 0;
> +	}
> +}
> +
> +static dma_addr_t vduse_domain_alloc_iova(struct iova_domain *iovad,
> +				unsigned long size, unsigned long limit)
> +{
> +	unsigned long shift = iova_shift(iovad);
> +	unsigned long iova_len = iova_align(iovad, size) >> shift;
> +	unsigned long iova_pfn;
> +
> +	if (iova_len < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1)))
> +		iova_len = roundup_pow_of_two(iova_len);
> +	iova_pfn = alloc_iova_fast(iovad, iova_len, limit >> shift, true);
> +
> +	return iova_pfn << shift;
> +}
> +
> +static void vduse_domain_free_iova(struct iova_domain *iovad,
> +				dma_addr_t iova, size_t size)
> +{
> +	unsigned long shift = iova_shift(iovad);
> +	unsigned long iova_len = iova_align(iovad, size) >> shift;
> +
> +	free_iova_fast(iovad, iova >> shift, iova_len);
> +}
> +
> +dma_addr_t vduse_domain_map_page(struct vduse_iova_domain *domain,
> +				struct page *page, unsigned long offset,
> +				size_t size, enum dma_data_direction dir,
> +				unsigned long attrs)
> +{
> +	struct iova_domain *iovad = &domain->stream_iovad;
> +	unsigned long limit = domain->bounce_size - 1;
> +	phys_addr_t pa = page_to_phys(page) + offset;
> +	dma_addr_t iova = vduse_domain_alloc_iova(iovad, size, limit);
> +	int ret;
> +
> +	if (!iova)
> +		return DMA_MAPPING_ERROR;
> +
> +	spin_lock(&domain->iotlb_lock);
> +	ret = vhost_iotlb_add_range(domain->iotlb, (u64)iova,
> +				    (u64)iova + size - 1,
> +				    pa, dir_to_perm(dir));
> +	spin_unlock(&domain->iotlb_lock);
> +	if (ret) {
> +		vduse_domain_free_iova(iovad, iova, size);
> +		return DMA_MAPPING_ERROR;
> +	}
> +	if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL)
> +		vduse_domain_bounce(domain, iova, pa, size, DMA_TO_DEVICE);
> +
> +	return iova;
> +}
> +
> +void vduse_domain_unmap_page(struct vduse_iova_domain *domain,
> +			dma_addr_t dma_addr, size_t size,
> +			enum dma_data_direction dir, unsigned long attrs)
> +{
> +	struct iova_domain *iovad = &domain->stream_iovad;
> +	struct vhost_iotlb_map *map;
> +	phys_addr_t pa;
> +
> +	spin_lock(&domain->iotlb_lock);
> +	map = vhost_iotlb_itree_first(domain->iotlb, (u64)dma_addr,
> +				      (u64)dma_addr + size - 1);
> +	if (WARN_ON(!map)) {
> +		spin_unlock(&domain->iotlb_lock);
> +		return;
> +	}
> +	pa = map->addr;
> +	vhost_iotlb_map_free(domain->iotlb, map);
> +	spin_unlock(&domain->iotlb_lock);
> +
> +	if (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL)
> +		vduse_domain_bounce(domain, dma_addr, pa,
> +					size, DMA_FROM_DEVICE);
> +
> +	vduse_domain_free_iova(iovad, dma_addr, size);
> +}
> +
> +void *vduse_domain_alloc_coherent(struct vduse_iova_domain *domain,
> +				size_t size, dma_addr_t *dma_addr,
> +				gfp_t flag, unsigned long attrs)
> +{
> +	struct iova_domain *iovad = &domain->consistent_iovad;
> +	unsigned long limit = domain->iova_limit;
> +	dma_addr_t iova = vduse_domain_alloc_iova(iovad, size, limit);
> +	void *orig = alloc_pages_exact(size, flag);
> +	int ret;
> +
> +	if (!iova || !orig)
> +		goto err;
> +
> +	spin_lock(&domain->iotlb_lock);
> +	ret = vhost_iotlb_add_range(domain->iotlb, (u64)iova,
> +				    (u64)iova + size - 1,
> +				    virt_to_phys(orig), VHOST_MAP_RW);
> +	spin_unlock(&domain->iotlb_lock);
> +	if (ret)
> +		goto err;
> +
> +	*dma_addr = iova;
> +
> +	return orig;
> +err:
> +	*dma_addr = DMA_MAPPING_ERROR;
> +	if (orig)
> +		free_pages_exact(orig, size);
> +	if (iova)
> +		vduse_domain_free_iova(iovad, iova, size);
> +
> +	return NULL;
> +}
> +
> +void vduse_domain_free_coherent(struct vduse_iova_domain *domain, size_t size,
> +				void *vaddr, dma_addr_t dma_addr,
> +				unsigned long attrs)
> +{
> +	struct iova_domain *iovad = &domain->consistent_iovad;
> +	struct vhost_iotlb_map *map;
> +	phys_addr_t pa;
> +
> +	spin_lock(&domain->iotlb_lock);
> +	map = vhost_iotlb_itree_first(domain->iotlb, (u64)dma_addr,
> +				      (u64)dma_addr + size - 1);
> +	if (WARN_ON(!map)) {
> +		spin_unlock(&domain->iotlb_lock);
> +		return;
> +	}
> +	pa = map->addr;
> +	vhost_iotlb_map_free(domain->iotlb, map);
> +	spin_unlock(&domain->iotlb_lock);
> +
> +	vduse_domain_free_iova(iovad, dma_addr, size);
> +	free_pages_exact(phys_to_virt(pa), size);
> +}
> +
> +static vm_fault_t vduse_domain_mmap_fault(struct vm_fault *vmf)
> +{
> +	struct vduse_iova_domain *domain = vmf->vma->vm_private_data;
> +	unsigned long iova = vmf->pgoff << PAGE_SHIFT;
> +	struct page *page;
> +
> +	if (!domain)
> +		return VM_FAULT_SIGBUS;
> +
> +	if (iova < domain->bounce_size)
> +		page = vduse_domain_alloc_bounce_page(domain, iova);
> +	else
> +		page = vduse_domain_get_mapping_page(domain, iova);
> +
> +	if (!page)
> +		return VM_FAULT_SIGBUS;
> +
> +	vmf->page = page;
> +
> +	return 0;
> +}
> +
> +static const struct vm_operations_struct vduse_domain_mmap_ops = {
> +	.fault = vduse_domain_mmap_fault,
> +};
> +
> +static int vduse_domain_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> +	struct vduse_iova_domain *domain = file->private_data;
> +
> +	vma->vm_flags |= VM_DONTDUMP | VM_DONTEXPAND;
> +	vma->vm_private_data = domain;
> +	vma->vm_ops = &vduse_domain_mmap_ops;
> +
> +	return 0;
> +}
> +
> +static int vduse_domain_release(struct inode *inode, struct file *file)
> +{
> +	struct vduse_iova_domain *domain = file->private_data;
> +
> +	vduse_domain_free_bounce_pages(domain, 0, domain->bounce_size);
> +	put_iova_domain(&domain->stream_iovad);
> +	put_iova_domain(&domain->consistent_iovad);
> +	vhost_iotlb_free(domain->iotlb);
> +	vfree(domain->bounce_pages);
> +	kfree(domain);
> +
> +	return 0;
> +}
> +
> +static const struct file_operations vduse_domain_fops = {
> +	.mmap = vduse_domain_mmap,
> +	.release = vduse_domain_release,
> +};
> +
> +void vduse_domain_destroy(struct vduse_iova_domain *domain)
> +{
> +	fput(domain->file);
> +}
> +
> +struct vduse_iova_domain *
> +vduse_domain_create(unsigned long iova_limit, size_t bounce_size)
> +{
> +	struct vduse_iova_domain *domain;
> +	struct file *file;
> +	unsigned long bounce_pfns = PAGE_ALIGN(bounce_size) >> PAGE_SHIFT;
> +
> +	if (iova_limit <= bounce_size)
> +		return NULL;
> +
> +	domain = kzalloc(sizeof(*domain), GFP_KERNEL);
> +	if (!domain)
> +		return NULL;
> +
> +	domain->iotlb = vhost_iotlb_alloc(0, 0);
> +	if (!domain->iotlb)
> +		goto err_iotlb;
> +
> +	domain->iova_limit = iova_limit;
> +	domain->bounce_size = PAGE_ALIGN(bounce_size);
> +	domain->bounce_pages = vzalloc(bounce_pfns * sizeof(struct page *));
> +	if (!domain->bounce_pages)
> +		goto err_page;
> +
> +	file = anon_inode_getfile("[vduse-domain]", &vduse_domain_fops,
> +				domain, O_RDWR);
> +	if (IS_ERR(file))
> +		goto err_file;
> +
> +	domain->file = file;
> +	spin_lock_init(&domain->iotlb_lock);
> +	init_iova_domain(&domain->stream_iovad,
> +			IOVA_ALLOC_SIZE, IOVA_START_PFN);
> +	init_iova_domain(&domain->consistent_iovad,
> +			PAGE_SIZE, bounce_pfns);
> +
> +	return domain;
> +err_file:
> +	vfree(domain->bounce_pages);
> +err_page:
> +	vhost_iotlb_free(domain->iotlb);
> +err_iotlb:
> +	kfree(domain);
> +	return NULL;
> +}
> +
> +int vduse_domain_init(void)
> +{
> +	return iova_cache_get();
> +}
> +
> +void vduse_domain_exit(void)
> +{
> +	iova_cache_put();
> +}
> diff --git a/drivers/vdpa/vdpa_user/iova_domain.h b/drivers/vdpa/vdpa_user/iova_domain.h
> new file mode 100644
> index 000000000000..9c85d8346626
> --- /dev/null
> +++ b/drivers/vdpa/vdpa_user/iova_domain.h
> @@ -0,0 +1,61 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * MMU-based IOMMU implementation
> + *
> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
> + *
> + * Author: Xie Yongji <xieyongji@bytedance.com>
> + *
> + */
> +
> +#ifndef _VDUSE_IOVA_DOMAIN_H
> +#define _VDUSE_IOVA_DOMAIN_H
> +
> +#include <linux/iova.h>
> +#include <linux/dma-mapping.h>
> +#include <linux/vhost_iotlb.h>
> +
> +struct vduse_iova_domain {
> +	struct iova_domain stream_iovad;
> +	struct iova_domain consistent_iovad;
> +	struct page **bounce_pages;
> +	size_t bounce_size;
> +	unsigned long iova_limit;
> +	struct vhost_iotlb *iotlb;


Sorry if I've asked this before.

But what's the reason for maintaing a dedicated IOTLB here? I think we 
could reuse vduse_dev->iommu since the device can not be used by both 
virtio and vhost in the same time or use vduse_iova_domain->iotlb for 
set_map().

Also, since vhost IOTLB support per mapping token (opauqe), can we use 
that instead of the bounce_pages *?

Thanks


> +	spinlock_t iotlb_lock;
> +	struct file *file;
> +};
> +
> +static inline struct file *
> +vduse_domain_file(struct vduse_iova_domain *domain)
> +{
> +	return domain->file;
> +}
> +
> +dma_addr_t vduse_domain_map_page(struct vduse_iova_domain *domain,
> +				struct page *page, unsigned long offset,
> +				size_t size, enum dma_data_direction dir,
> +				unsigned long attrs);
> +
> +void vduse_domain_unmap_page(struct vduse_iova_domain *domain,
> +			dma_addr_t dma_addr, size_t size,
> +			enum dma_data_direction dir, unsigned long attrs);
> +
> +void *vduse_domain_alloc_coherent(struct vduse_iova_domain *domain,
> +				size_t size, dma_addr_t *dma_addr,
> +				gfp_t flag, unsigned long attrs);
> +
> +void vduse_domain_free_coherent(struct vduse_iova_domain *domain, size_t size,
> +				void *vaddr, dma_addr_t dma_addr,
> +				unsigned long attrs);
> +
> +void vduse_domain_destroy(struct vduse_iova_domain *domain);
> +
> +struct vduse_iova_domain *vduse_domain_create(unsigned long iova_limit,
> +						size_t bounce_size);
> +
> +int vduse_domain_init(void);
> +
> +void vduse_domain_exit(void);
> +
> +#endif /* _VDUSE_IOVA_DOMAIN_H */


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC v4 06/11] vduse: Implement an MMU-based IOMMU driver
@ 2021-03-04  4:20     ` Jason Wang
  0 siblings, 0 replies; 92+ messages in thread
From: Jason Wang @ 2021-03-04  4:20 UTC (permalink / raw)
  To: Xie Yongji, mst, stefanha, sgarzare, parav, bob.liu, hch,
	rdunlap, willy, viro, axboe, bcrl, corbet
  Cc: linux-aio, netdev, linux-fsdevel, kvm, virtualization


On 2021/2/23 7:50 下午, Xie Yongji wrote:
> This implements a MMU-based IOMMU driver to support mapping
> kernel dma buffer into userspace. The basic idea behind it is
> treating MMU (VA->PA) as IOMMU (IOVA->PA). The driver will set
> up MMU mapping instead of IOMMU mapping for the DMA transfer so
> that the userspace process is able to use its virtual address to
> access the dma buffer in kernel.
>
> And to avoid security issue, a bounce-buffering mechanism is
> introduced to prevent userspace accessing the original buffer
> directly.
>
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> ---
>   drivers/vdpa/vdpa_user/iova_domain.c | 486 +++++++++++++++++++++++++++++++++++
>   drivers/vdpa/vdpa_user/iova_domain.h |  61 +++++
>   2 files changed, 547 insertions(+)
>   create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c
>   create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h
>
> diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c
> new file mode 100644
> index 000000000000..9285d430d486
> --- /dev/null
> +++ b/drivers/vdpa/vdpa_user/iova_domain.c
> @@ -0,0 +1,486 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * MMU-based IOMMU implementation
> + *
> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
> + *
> + * Author: Xie Yongji <xieyongji@bytedance.com>
> + *
> + */
> +
> +#include <linux/slab.h>
> +#include <linux/file.h>
> +#include <linux/anon_inodes.h>
> +#include <linux/highmem.h>
> +
> +#include "iova_domain.h"
> +
> +#define IOVA_START_PFN 1
> +#define IOVA_ALLOC_ORDER 12
> +#define IOVA_ALLOC_SIZE (1 << IOVA_ALLOC_ORDER)
> +
> +static inline struct page *
> +vduse_domain_get_bounce_page(struct vduse_iova_domain *domain, u64 iova)
> +{
> +	u64 index = iova >> PAGE_SHIFT;
> +
> +	return domain->bounce_pages[index];
> +}
> +
> +static inline void
> +vduse_domain_set_bounce_page(struct vduse_iova_domain *domain,
> +				u64 iova, struct page *page)
> +{
> +	u64 index = iova >> PAGE_SHIFT;
> +
> +	domain->bounce_pages[index] = page;
> +}
> +
> +static enum dma_data_direction perm_to_dir(int perm)
> +{
> +	enum dma_data_direction dir;
> +
> +	switch (perm) {
> +	case VHOST_MAP_WO:
> +		dir = DMA_FROM_DEVICE;
> +		break;
> +	case VHOST_MAP_RO:
> +		dir = DMA_TO_DEVICE;
> +		break;
> +	case VHOST_MAP_RW:
> +		dir = DMA_BIDIRECTIONAL;
> +		break;
> +	default:
> +		break;
> +	}
> +
> +	return dir;
> +}
> +
> +static int dir_to_perm(enum dma_data_direction dir)
> +{
> +	int perm = -EFAULT;
> +
> +	switch (dir) {
> +	case DMA_FROM_DEVICE:
> +		perm = VHOST_MAP_WO;
> +		break;
> +	case DMA_TO_DEVICE:
> +		perm = VHOST_MAP_RO;
> +		break;
> +	case DMA_BIDIRECTIONAL:
> +		perm = VHOST_MAP_RW;
> +		break;
> +	default:
> +		break;
> +	}
> +
> +	return perm;
> +}


Let's move the above two helpers to vhost_iotlb.h so they could be used 
by other driver e.g (vpda_sim)


> +
> +static void do_bounce(phys_addr_t orig, void *addr, size_t size,
> +			enum dma_data_direction dir)
> +{
> +	unsigned long pfn = PFN_DOWN(orig);
> +
> +	if (PageHighMem(pfn_to_page(pfn))) {
> +		unsigned int offset = offset_in_page(orig);
> +		char *buffer;
> +		unsigned int sz = 0;
> +		unsigned long flags;
> +
> +		while (size) {
> +			sz = min_t(size_t, PAGE_SIZE - offset, size);
> +
> +			local_irq_save(flags);
> +			buffer = kmap_atomic(pfn_to_page(pfn));
> +			if (dir == DMA_TO_DEVICE)
> +				memcpy(addr, buffer + offset, sz);
> +			else
> +				memcpy(buffer + offset, addr, sz);
> +			kunmap_atomic(buffer);
> +			local_irq_restore(flags);


I wonder why we need to deal with highmem and irq flags explicitly like 
this. Doesn't kmap_atomic() will take care all of those?


> +
> +			size -= sz;
> +			pfn++;
> +			addr += sz;
> +			offset = 0;
> +		}
> +	} else if (dir == DMA_TO_DEVICE) {
> +		memcpy(addr, phys_to_virt(orig), size);
> +	} else {
> +		memcpy(phys_to_virt(orig), addr, size);
> +	}
> +}
> +
> +static struct page *
> +vduse_domain_get_mapping_page(struct vduse_iova_domain *domain, u64 iova)
> +{
> +	u64 start = iova & PAGE_MASK;
> +	u64 last = start + PAGE_SIZE - 1;
> +	struct vhost_iotlb_map *map;
> +	struct page *page = NULL;
> +
> +	spin_lock(&domain->iotlb_lock);
> +	map = vhost_iotlb_itree_first(domain->iotlb, start, last);
> +	if (!map)
> +		goto out;
> +
> +	page = pfn_to_page((map->addr + iova - map->start) >> PAGE_SHIFT);
> +	get_page(page);
> +out:
> +	spin_unlock(&domain->iotlb_lock);
> +
> +	return page;
> +}
> +
> +static struct page *
> +vduse_domain_alloc_bounce_page(struct vduse_iova_domain *domain, u64 iova)
> +{
> +	u64 start = iova & PAGE_MASK;
> +	u64 last = start + PAGE_SIZE - 1;
> +	struct vhost_iotlb_map *map;
> +	struct page *page = NULL, *new_page = alloc_page(GFP_KERNEL);
> +
> +	if (!new_page)
> +		return NULL;
> +
> +	spin_lock(&domain->iotlb_lock);
> +	if (!vhost_iotlb_itree_first(domain->iotlb, start, last)) {
> +		__free_page(new_page);
> +		goto out;
> +	}
> +	page = vduse_domain_get_bounce_page(domain, iova);
> +	if (page) {
> +		get_page(page);
> +		__free_page(new_page);
> +		goto out;
> +	}
> +	vduse_domain_set_bounce_page(domain, iova, new_page);
> +	get_page(new_page);
> +	page = new_page;
> +
> +	for (map = vhost_iotlb_itree_first(domain->iotlb, start, last); map;
> +	     map = vhost_iotlb_itree_next(map, start, last)) {
> +		unsigned int src_offset = 0, dst_offset = 0;
> +		phys_addr_t src;
> +		void *dst;
> +		size_t sz;
> +
> +		if (perm_to_dir(map->perm) == DMA_FROM_DEVICE)
> +			continue;
> +
> +		if (start > map->start)
> +			src_offset = start - map->start;
> +		else
> +			dst_offset = map->start - start;
> +
> +		src = map->addr + src_offset;
> +		dst = page_address(page) + dst_offset;
> +		sz = min_t(size_t, map->size - src_offset,
> +				PAGE_SIZE - dst_offset);
> +		do_bounce(src, dst, sz, DMA_TO_DEVICE);
> +	}
> +out:
> +	spin_unlock(&domain->iotlb_lock);
> +
> +	return page;
> +}
> +
> +static void
> +vduse_domain_free_bounce_pages(struct vduse_iova_domain *domain,
> +				u64 iova, size_t size)
> +{
> +	struct page *page;
> +
> +	spin_lock(&domain->iotlb_lock);
> +	if (WARN_ON(vhost_iotlb_itree_first(domain->iotlb, iova,
> +						iova + size - 1)))
> +		goto out;
> +
> +	while (size > 0) {
> +		page = vduse_domain_get_bounce_page(domain, iova);
> +		if (page) {
> +			vduse_domain_set_bounce_page(domain, iova, NULL);
> +			__free_page(page);
> +		}
> +		size -= PAGE_SIZE;
> +		iova += PAGE_SIZE;
> +	}
> +out:
> +	spin_unlock(&domain->iotlb_lock);
> +}
> +
> +static void vduse_domain_bounce(struct vduse_iova_domain *domain,
> +				dma_addr_t iova, phys_addr_t orig,
> +				size_t size, enum dma_data_direction dir)
> +{
> +	unsigned int offset = offset_in_page(iova);
> +
> +	while (size) {
> +		struct page *p = vduse_domain_get_bounce_page(domain, iova);
> +		size_t sz = min_t(size_t, PAGE_SIZE - offset, size);
> +
> +		WARN_ON(!p && dir == DMA_FROM_DEVICE);
> +
> +		if (p)
> +			do_bounce(orig, page_address(p) + offset, sz, dir);
> +
> +		size -= sz;
> +		orig += sz;
> +		iova += sz;
> +		offset = 0;
> +	}
> +}
> +
> +static dma_addr_t vduse_domain_alloc_iova(struct iova_domain *iovad,
> +				unsigned long size, unsigned long limit)
> +{
> +	unsigned long shift = iova_shift(iovad);
> +	unsigned long iova_len = iova_align(iovad, size) >> shift;
> +	unsigned long iova_pfn;
> +
> +	if (iova_len < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1)))
> +		iova_len = roundup_pow_of_two(iova_len);
> +	iova_pfn = alloc_iova_fast(iovad, iova_len, limit >> shift, true);
> +
> +	return iova_pfn << shift;
> +}
> +
> +static void vduse_domain_free_iova(struct iova_domain *iovad,
> +				dma_addr_t iova, size_t size)
> +{
> +	unsigned long shift = iova_shift(iovad);
> +	unsigned long iova_len = iova_align(iovad, size) >> shift;
> +
> +	free_iova_fast(iovad, iova >> shift, iova_len);
> +}
> +
> +dma_addr_t vduse_domain_map_page(struct vduse_iova_domain *domain,
> +				struct page *page, unsigned long offset,
> +				size_t size, enum dma_data_direction dir,
> +				unsigned long attrs)
> +{
> +	struct iova_domain *iovad = &domain->stream_iovad;
> +	unsigned long limit = domain->bounce_size - 1;
> +	phys_addr_t pa = page_to_phys(page) + offset;
> +	dma_addr_t iova = vduse_domain_alloc_iova(iovad, size, limit);
> +	int ret;
> +
> +	if (!iova)
> +		return DMA_MAPPING_ERROR;
> +
> +	spin_lock(&domain->iotlb_lock);
> +	ret = vhost_iotlb_add_range(domain->iotlb, (u64)iova,
> +				    (u64)iova + size - 1,
> +				    pa, dir_to_perm(dir));
> +	spin_unlock(&domain->iotlb_lock);
> +	if (ret) {
> +		vduse_domain_free_iova(iovad, iova, size);
> +		return DMA_MAPPING_ERROR;
> +	}
> +	if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL)
> +		vduse_domain_bounce(domain, iova, pa, size, DMA_TO_DEVICE);
> +
> +	return iova;
> +}
> +
> +void vduse_domain_unmap_page(struct vduse_iova_domain *domain,
> +			dma_addr_t dma_addr, size_t size,
> +			enum dma_data_direction dir, unsigned long attrs)
> +{
> +	struct iova_domain *iovad = &domain->stream_iovad;
> +	struct vhost_iotlb_map *map;
> +	phys_addr_t pa;
> +
> +	spin_lock(&domain->iotlb_lock);
> +	map = vhost_iotlb_itree_first(domain->iotlb, (u64)dma_addr,
> +				      (u64)dma_addr + size - 1);
> +	if (WARN_ON(!map)) {
> +		spin_unlock(&domain->iotlb_lock);
> +		return;
> +	}
> +	pa = map->addr;
> +	vhost_iotlb_map_free(domain->iotlb, map);
> +	spin_unlock(&domain->iotlb_lock);
> +
> +	if (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL)
> +		vduse_domain_bounce(domain, dma_addr, pa,
> +					size, DMA_FROM_DEVICE);
> +
> +	vduse_domain_free_iova(iovad, dma_addr, size);
> +}
> +
> +void *vduse_domain_alloc_coherent(struct vduse_iova_domain *domain,
> +				size_t size, dma_addr_t *dma_addr,
> +				gfp_t flag, unsigned long attrs)
> +{
> +	struct iova_domain *iovad = &domain->consistent_iovad;
> +	unsigned long limit = domain->iova_limit;
> +	dma_addr_t iova = vduse_domain_alloc_iova(iovad, size, limit);
> +	void *orig = alloc_pages_exact(size, flag);
> +	int ret;
> +
> +	if (!iova || !orig)
> +		goto err;
> +
> +	spin_lock(&domain->iotlb_lock);
> +	ret = vhost_iotlb_add_range(domain->iotlb, (u64)iova,
> +				    (u64)iova + size - 1,
> +				    virt_to_phys(orig), VHOST_MAP_RW);
> +	spin_unlock(&domain->iotlb_lock);
> +	if (ret)
> +		goto err;
> +
> +	*dma_addr = iova;
> +
> +	return orig;
> +err:
> +	*dma_addr = DMA_MAPPING_ERROR;
> +	if (orig)
> +		free_pages_exact(orig, size);
> +	if (iova)
> +		vduse_domain_free_iova(iovad, iova, size);
> +
> +	return NULL;
> +}
> +
> +void vduse_domain_free_coherent(struct vduse_iova_domain *domain, size_t size,
> +				void *vaddr, dma_addr_t dma_addr,
> +				unsigned long attrs)
> +{
> +	struct iova_domain *iovad = &domain->consistent_iovad;
> +	struct vhost_iotlb_map *map;
> +	phys_addr_t pa;
> +
> +	spin_lock(&domain->iotlb_lock);
> +	map = vhost_iotlb_itree_first(domain->iotlb, (u64)dma_addr,
> +				      (u64)dma_addr + size - 1);
> +	if (WARN_ON(!map)) {
> +		spin_unlock(&domain->iotlb_lock);
> +		return;
> +	}
> +	pa = map->addr;
> +	vhost_iotlb_map_free(domain->iotlb, map);
> +	spin_unlock(&domain->iotlb_lock);
> +
> +	vduse_domain_free_iova(iovad, dma_addr, size);
> +	free_pages_exact(phys_to_virt(pa), size);
> +}
> +
> +static vm_fault_t vduse_domain_mmap_fault(struct vm_fault *vmf)
> +{
> +	struct vduse_iova_domain *domain = vmf->vma->vm_private_data;
> +	unsigned long iova = vmf->pgoff << PAGE_SHIFT;
> +	struct page *page;
> +
> +	if (!domain)
> +		return VM_FAULT_SIGBUS;
> +
> +	if (iova < domain->bounce_size)
> +		page = vduse_domain_alloc_bounce_page(domain, iova);
> +	else
> +		page = vduse_domain_get_mapping_page(domain, iova);
> +
> +	if (!page)
> +		return VM_FAULT_SIGBUS;
> +
> +	vmf->page = page;
> +
> +	return 0;
> +}
> +
> +static const struct vm_operations_struct vduse_domain_mmap_ops = {
> +	.fault = vduse_domain_mmap_fault,
> +};
> +
> +static int vduse_domain_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> +	struct vduse_iova_domain *domain = file->private_data;
> +
> +	vma->vm_flags |= VM_DONTDUMP | VM_DONTEXPAND;
> +	vma->vm_private_data = domain;
> +	vma->vm_ops = &vduse_domain_mmap_ops;
> +
> +	return 0;
> +}
> +
> +static int vduse_domain_release(struct inode *inode, struct file *file)
> +{
> +	struct vduse_iova_domain *domain = file->private_data;
> +
> +	vduse_domain_free_bounce_pages(domain, 0, domain->bounce_size);
> +	put_iova_domain(&domain->stream_iovad);
> +	put_iova_domain(&domain->consistent_iovad);
> +	vhost_iotlb_free(domain->iotlb);
> +	vfree(domain->bounce_pages);
> +	kfree(domain);
> +
> +	return 0;
> +}
> +
> +static const struct file_operations vduse_domain_fops = {
> +	.mmap = vduse_domain_mmap,
> +	.release = vduse_domain_release,
> +};
> +
> +void vduse_domain_destroy(struct vduse_iova_domain *domain)
> +{
> +	fput(domain->file);
> +}
> +
> +struct vduse_iova_domain *
> +vduse_domain_create(unsigned long iova_limit, size_t bounce_size)
> +{
> +	struct vduse_iova_domain *domain;
> +	struct file *file;
> +	unsigned long bounce_pfns = PAGE_ALIGN(bounce_size) >> PAGE_SHIFT;
> +
> +	if (iova_limit <= bounce_size)
> +		return NULL;
> +
> +	domain = kzalloc(sizeof(*domain), GFP_KERNEL);
> +	if (!domain)
> +		return NULL;
> +
> +	domain->iotlb = vhost_iotlb_alloc(0, 0);
> +	if (!domain->iotlb)
> +		goto err_iotlb;
> +
> +	domain->iova_limit = iova_limit;
> +	domain->bounce_size = PAGE_ALIGN(bounce_size);
> +	domain->bounce_pages = vzalloc(bounce_pfns * sizeof(struct page *));
> +	if (!domain->bounce_pages)
> +		goto err_page;
> +
> +	file = anon_inode_getfile("[vduse-domain]", &vduse_domain_fops,
> +				domain, O_RDWR);
> +	if (IS_ERR(file))
> +		goto err_file;
> +
> +	domain->file = file;
> +	spin_lock_init(&domain->iotlb_lock);
> +	init_iova_domain(&domain->stream_iovad,
> +			IOVA_ALLOC_SIZE, IOVA_START_PFN);
> +	init_iova_domain(&domain->consistent_iovad,
> +			PAGE_SIZE, bounce_pfns);
> +
> +	return domain;
> +err_file:
> +	vfree(domain->bounce_pages);
> +err_page:
> +	vhost_iotlb_free(domain->iotlb);
> +err_iotlb:
> +	kfree(domain);
> +	return NULL;
> +}
> +
> +int vduse_domain_init(void)
> +{
> +	return iova_cache_get();
> +}
> +
> +void vduse_domain_exit(void)
> +{
> +	iova_cache_put();
> +}
> diff --git a/drivers/vdpa/vdpa_user/iova_domain.h b/drivers/vdpa/vdpa_user/iova_domain.h
> new file mode 100644
> index 000000000000..9c85d8346626
> --- /dev/null
> +++ b/drivers/vdpa/vdpa_user/iova_domain.h
> @@ -0,0 +1,61 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * MMU-based IOMMU implementation
> + *
> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
> + *
> + * Author: Xie Yongji <xieyongji@bytedance.com>
> + *
> + */
> +
> +#ifndef _VDUSE_IOVA_DOMAIN_H
> +#define _VDUSE_IOVA_DOMAIN_H
> +
> +#include <linux/iova.h>
> +#include <linux/dma-mapping.h>
> +#include <linux/vhost_iotlb.h>
> +
> +struct vduse_iova_domain {
> +	struct iova_domain stream_iovad;
> +	struct iova_domain consistent_iovad;
> +	struct page **bounce_pages;
> +	size_t bounce_size;
> +	unsigned long iova_limit;
> +	struct vhost_iotlb *iotlb;


Sorry if I've asked this before.

But what's the reason for maintaing a dedicated IOTLB here? I think we 
could reuse vduse_dev->iommu since the device can not be used by both 
virtio and vhost in the same time or use vduse_iova_domain->iotlb for 
set_map().

Also, since vhost IOTLB support per mapping token (opauqe), can we use 
that instead of the bounce_pages *?

Thanks


> +	spinlock_t iotlb_lock;
> +	struct file *file;
> +};
> +
> +static inline struct file *
> +vduse_domain_file(struct vduse_iova_domain *domain)
> +{
> +	return domain->file;
> +}
> +
> +dma_addr_t vduse_domain_map_page(struct vduse_iova_domain *domain,
> +				struct page *page, unsigned long offset,
> +				size_t size, enum dma_data_direction dir,
> +				unsigned long attrs);
> +
> +void vduse_domain_unmap_page(struct vduse_iova_domain *domain,
> +			dma_addr_t dma_addr, size_t size,
> +			enum dma_data_direction dir, unsigned long attrs);
> +
> +void *vduse_domain_alloc_coherent(struct vduse_iova_domain *domain,
> +				size_t size, dma_addr_t *dma_addr,
> +				gfp_t flag, unsigned long attrs);
> +
> +void vduse_domain_free_coherent(struct vduse_iova_domain *domain, size_t size,
> +				void *vaddr, dma_addr_t dma_addr,
> +				unsigned long attrs);
> +
> +void vduse_domain_destroy(struct vduse_iova_domain *domain);
> +
> +struct vduse_iova_domain *vduse_domain_create(unsigned long iova_limit,
> +						size_t bounce_size);
> +
> +int vduse_domain_init(void);
> +
> +void vduse_domain_exit(void);
> +
> +#endif /* _VDUSE_IOVA_DOMAIN_H */

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: Re: [RFC v4 06/11] vduse: Implement an MMU-based IOMMU driver
  2021-03-04  4:20     ` Jason Wang
  (?)
@ 2021-03-04  5:12     ` Yongji Xie
  2021-03-05  3:35         ` Jason Wang
  -1 siblings, 1 reply; 92+ messages in thread
From: Yongji Xie @ 2021-03-04  5:12 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Bob Liu, Christoph Hellwig, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel

On Thu, Mar 4, 2021 at 12:21 PM Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2021/2/23 7:50 下午, Xie Yongji wrote:
> > This implements a MMU-based IOMMU driver to support mapping
> > kernel dma buffer into userspace. The basic idea behind it is
> > treating MMU (VA->PA) as IOMMU (IOVA->PA). The driver will set
> > up MMU mapping instead of IOMMU mapping for the DMA transfer so
> > that the userspace process is able to use its virtual address to
> > access the dma buffer in kernel.
> >
> > And to avoid security issue, a bounce-buffering mechanism is
> > introduced to prevent userspace accessing the original buffer
> > directly.
> >
> > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> > ---
> >   drivers/vdpa/vdpa_user/iova_domain.c | 486 +++++++++++++++++++++++++++++++++++
> >   drivers/vdpa/vdpa_user/iova_domain.h |  61 +++++
> >   2 files changed, 547 insertions(+)
> >   create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c
> >   create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h
> >
> > diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c
> > new file mode 100644
> > index 000000000000..9285d430d486
> > --- /dev/null
> > +++ b/drivers/vdpa/vdpa_user/iova_domain.c
> > @@ -0,0 +1,486 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * MMU-based IOMMU implementation
> > + *
> > + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
> > + *
> > + * Author: Xie Yongji <xieyongji@bytedance.com>
> > + *
> > + */
> > +
> > +#include <linux/slab.h>
> > +#include <linux/file.h>
> > +#include <linux/anon_inodes.h>
> > +#include <linux/highmem.h>
> > +
> > +#include "iova_domain.h"
> > +
> > +#define IOVA_START_PFN 1
> > +#define IOVA_ALLOC_ORDER 12
> > +#define IOVA_ALLOC_SIZE (1 << IOVA_ALLOC_ORDER)
> > +
> > +static inline struct page *
> > +vduse_domain_get_bounce_page(struct vduse_iova_domain *domain, u64 iova)
> > +{
> > +     u64 index = iova >> PAGE_SHIFT;
> > +
> > +     return domain->bounce_pages[index];
> > +}
> > +
> > +static inline void
> > +vduse_domain_set_bounce_page(struct vduse_iova_domain *domain,
> > +                             u64 iova, struct page *page)
> > +{
> > +     u64 index = iova >> PAGE_SHIFT;
> > +
> > +     domain->bounce_pages[index] = page;
> > +}
> > +
> > +static enum dma_data_direction perm_to_dir(int perm)
> > +{
> > +     enum dma_data_direction dir;
> > +
> > +     switch (perm) {
> > +     case VHOST_MAP_WO:
> > +             dir = DMA_FROM_DEVICE;
> > +             break;
> > +     case VHOST_MAP_RO:
> > +             dir = DMA_TO_DEVICE;
> > +             break;
> > +     case VHOST_MAP_RW:
> > +             dir = DMA_BIDIRECTIONAL;
> > +             break;
> > +     default:
> > +             break;
> > +     }
> > +
> > +     return dir;
> > +}
> > +
> > +static int dir_to_perm(enum dma_data_direction dir)
> > +{
> > +     int perm = -EFAULT;
> > +
> > +     switch (dir) {
> > +     case DMA_FROM_DEVICE:
> > +             perm = VHOST_MAP_WO;
> > +             break;
> > +     case DMA_TO_DEVICE:
> > +             perm = VHOST_MAP_RO;
> > +             break;
> > +     case DMA_BIDIRECTIONAL:
> > +             perm = VHOST_MAP_RW;
> > +             break;
> > +     default:
> > +             break;
> > +     }
> > +
> > +     return perm;
> > +}
>
>
> Let's move the above two helpers to vhost_iotlb.h so they could be used
> by other driver e.g (vpda_sim)
>

Sure.

>
> > +
> > +static void do_bounce(phys_addr_t orig, void *addr, size_t size,
> > +                     enum dma_data_direction dir)
> > +{
> > +     unsigned long pfn = PFN_DOWN(orig);
> > +
> > +     if (PageHighMem(pfn_to_page(pfn))) {
> > +             unsigned int offset = offset_in_page(orig);
> > +             char *buffer;
> > +             unsigned int sz = 0;
> > +             unsigned long flags;
> > +
> > +             while (size) {
> > +                     sz = min_t(size_t, PAGE_SIZE - offset, size);
> > +
> > +                     local_irq_save(flags);
> > +                     buffer = kmap_atomic(pfn_to_page(pfn));
> > +                     if (dir == DMA_TO_DEVICE)
> > +                             memcpy(addr, buffer + offset, sz);
> > +                     else
> > +                             memcpy(buffer + offset, addr, sz);
> > +                     kunmap_atomic(buffer);
> > +                     local_irq_restore(flags);
>
>
> I wonder why we need to deal with highmem and irq flags explicitly like
> this. Doesn't kmap_atomic() will take care all of those?
>

Yes, irq flags is useless here. Will remove it.

>
> > +
> > +                     size -= sz;
> > +                     pfn++;
> > +                     addr += sz;
> > +                     offset = 0;
> > +             }
> > +     } else if (dir == DMA_TO_DEVICE) {
> > +             memcpy(addr, phys_to_virt(orig), size);
> > +     } else {
> > +             memcpy(phys_to_virt(orig), addr, size);
> > +     }
> > +}
> > +
> > +static struct page *
> > +vduse_domain_get_mapping_page(struct vduse_iova_domain *domain, u64 iova)
> > +{
> > +     u64 start = iova & PAGE_MASK;
> > +     u64 last = start + PAGE_SIZE - 1;
> > +     struct vhost_iotlb_map *map;
> > +     struct page *page = NULL;
> > +
> > +     spin_lock(&domain->iotlb_lock);
> > +     map = vhost_iotlb_itree_first(domain->iotlb, start, last);
> > +     if (!map)
> > +             goto out;
> > +
> > +     page = pfn_to_page((map->addr + iova - map->start) >> PAGE_SHIFT);
> > +     get_page(page);
> > +out:
> > +     spin_unlock(&domain->iotlb_lock);
> > +
> > +     return page;
> > +}
> > +
> > +static struct page *
> > +vduse_domain_alloc_bounce_page(struct vduse_iova_domain *domain, u64 iova)
> > +{
> > +     u64 start = iova & PAGE_MASK;
> > +     u64 last = start + PAGE_SIZE - 1;
> > +     struct vhost_iotlb_map *map;
> > +     struct page *page = NULL, *new_page = alloc_page(GFP_KERNEL);
> > +
> > +     if (!new_page)
> > +             return NULL;
> > +
> > +     spin_lock(&domain->iotlb_lock);
> > +     if (!vhost_iotlb_itree_first(domain->iotlb, start, last)) {
> > +             __free_page(new_page);
> > +             goto out;
> > +     }
> > +     page = vduse_domain_get_bounce_page(domain, iova);
> > +     if (page) {
> > +             get_page(page);
> > +             __free_page(new_page);
> > +             goto out;
> > +     }
> > +     vduse_domain_set_bounce_page(domain, iova, new_page);
> > +     get_page(new_page);
> > +     page = new_page;
> > +
> > +     for (map = vhost_iotlb_itree_first(domain->iotlb, start, last); map;
> > +          map = vhost_iotlb_itree_next(map, start, last)) {
> > +             unsigned int src_offset = 0, dst_offset = 0;
> > +             phys_addr_t src;
> > +             void *dst;
> > +             size_t sz;
> > +
> > +             if (perm_to_dir(map->perm) == DMA_FROM_DEVICE)
> > +                     continue;
> > +
> > +             if (start > map->start)
> > +                     src_offset = start - map->start;
> > +             else
> > +                     dst_offset = map->start - start;
> > +
> > +             src = map->addr + src_offset;
> > +             dst = page_address(page) + dst_offset;
> > +             sz = min_t(size_t, map->size - src_offset,
> > +                             PAGE_SIZE - dst_offset);
> > +             do_bounce(src, dst, sz, DMA_TO_DEVICE);
> > +     }
> > +out:
> > +     spin_unlock(&domain->iotlb_lock);
> > +
> > +     return page;
> > +}
> > +
> > +static void
> > +vduse_domain_free_bounce_pages(struct vduse_iova_domain *domain,
> > +                             u64 iova, size_t size)
> > +{
> > +     struct page *page;
> > +
> > +     spin_lock(&domain->iotlb_lock);
> > +     if (WARN_ON(vhost_iotlb_itree_first(domain->iotlb, iova,
> > +                                             iova + size - 1)))
> > +             goto out;
> > +
> > +     while (size > 0) {
> > +             page = vduse_domain_get_bounce_page(domain, iova);
> > +             if (page) {
> > +                     vduse_domain_set_bounce_page(domain, iova, NULL);
> > +                     __free_page(page);
> > +             }
> > +             size -= PAGE_SIZE;
> > +             iova += PAGE_SIZE;
> > +     }
> > +out:
> > +     spin_unlock(&domain->iotlb_lock);
> > +}
> > +
> > +static void vduse_domain_bounce(struct vduse_iova_domain *domain,
> > +                             dma_addr_t iova, phys_addr_t orig,
> > +                             size_t size, enum dma_data_direction dir)
> > +{
> > +     unsigned int offset = offset_in_page(iova);
> > +
> > +     while (size) {
> > +             struct page *p = vduse_domain_get_bounce_page(domain, iova);
> > +             size_t sz = min_t(size_t, PAGE_SIZE - offset, size);
> > +
> > +             WARN_ON(!p && dir == DMA_FROM_DEVICE);
> > +
> > +             if (p)
> > +                     do_bounce(orig, page_address(p) + offset, sz, dir);
> > +
> > +             size -= sz;
> > +             orig += sz;
> > +             iova += sz;
> > +             offset = 0;
> > +     }
> > +}
> > +
> > +static dma_addr_t vduse_domain_alloc_iova(struct iova_domain *iovad,
> > +                             unsigned long size, unsigned long limit)
> > +{
> > +     unsigned long shift = iova_shift(iovad);
> > +     unsigned long iova_len = iova_align(iovad, size) >> shift;
> > +     unsigned long iova_pfn;
> > +
> > +     if (iova_len < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1)))
> > +             iova_len = roundup_pow_of_two(iova_len);
> > +     iova_pfn = alloc_iova_fast(iovad, iova_len, limit >> shift, true);
> > +
> > +     return iova_pfn << shift;
> > +}
> > +
> > +static void vduse_domain_free_iova(struct iova_domain *iovad,
> > +                             dma_addr_t iova, size_t size)
> > +{
> > +     unsigned long shift = iova_shift(iovad);
> > +     unsigned long iova_len = iova_align(iovad, size) >> shift;
> > +
> > +     free_iova_fast(iovad, iova >> shift, iova_len);
> > +}
> > +
> > +dma_addr_t vduse_domain_map_page(struct vduse_iova_domain *domain,
> > +                             struct page *page, unsigned long offset,
> > +                             size_t size, enum dma_data_direction dir,
> > +                             unsigned long attrs)
> > +{
> > +     struct iova_domain *iovad = &domain->stream_iovad;
> > +     unsigned long limit = domain->bounce_size - 1;
> > +     phys_addr_t pa = page_to_phys(page) + offset;
> > +     dma_addr_t iova = vduse_domain_alloc_iova(iovad, size, limit);
> > +     int ret;
> > +
> > +     if (!iova)
> > +             return DMA_MAPPING_ERROR;
> > +
> > +     spin_lock(&domain->iotlb_lock);
> > +     ret = vhost_iotlb_add_range(domain->iotlb, (u64)iova,
> > +                                 (u64)iova + size - 1,
> > +                                 pa, dir_to_perm(dir));
> > +     spin_unlock(&domain->iotlb_lock);
> > +     if (ret) {
> > +             vduse_domain_free_iova(iovad, iova, size);
> > +             return DMA_MAPPING_ERROR;
> > +     }
> > +     if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL)
> > +             vduse_domain_bounce(domain, iova, pa, size, DMA_TO_DEVICE);
> > +
> > +     return iova;
> > +}
> > +
> > +void vduse_domain_unmap_page(struct vduse_iova_domain *domain,
> > +                     dma_addr_t dma_addr, size_t size,
> > +                     enum dma_data_direction dir, unsigned long attrs)
> > +{
> > +     struct iova_domain *iovad = &domain->stream_iovad;
> > +     struct vhost_iotlb_map *map;
> > +     phys_addr_t pa;
> > +
> > +     spin_lock(&domain->iotlb_lock);
> > +     map = vhost_iotlb_itree_first(domain->iotlb, (u64)dma_addr,
> > +                                   (u64)dma_addr + size - 1);
> > +     if (WARN_ON(!map)) {
> > +             spin_unlock(&domain->iotlb_lock);
> > +             return;
> > +     }
> > +     pa = map->addr;
> > +     vhost_iotlb_map_free(domain->iotlb, map);
> > +     spin_unlock(&domain->iotlb_lock);
> > +
> > +     if (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL)
> > +             vduse_domain_bounce(domain, dma_addr, pa,
> > +                                     size, DMA_FROM_DEVICE);
> > +
> > +     vduse_domain_free_iova(iovad, dma_addr, size);
> > +}
> > +
> > +void *vduse_domain_alloc_coherent(struct vduse_iova_domain *domain,
> > +                             size_t size, dma_addr_t *dma_addr,
> > +                             gfp_t flag, unsigned long attrs)
> > +{
> > +     struct iova_domain *iovad = &domain->consistent_iovad;
> > +     unsigned long limit = domain->iova_limit;
> > +     dma_addr_t iova = vduse_domain_alloc_iova(iovad, size, limit);
> > +     void *orig = alloc_pages_exact(size, flag);
> > +     int ret;
> > +
> > +     if (!iova || !orig)
> > +             goto err;
> > +
> > +     spin_lock(&domain->iotlb_lock);
> > +     ret = vhost_iotlb_add_range(domain->iotlb, (u64)iova,
> > +                                 (u64)iova + size - 1,
> > +                                 virt_to_phys(orig), VHOST_MAP_RW);
> > +     spin_unlock(&domain->iotlb_lock);
> > +     if (ret)
> > +             goto err;
> > +
> > +     *dma_addr = iova;
> > +
> > +     return orig;
> > +err:
> > +     *dma_addr = DMA_MAPPING_ERROR;
> > +     if (orig)
> > +             free_pages_exact(orig, size);
> > +     if (iova)
> > +             vduse_domain_free_iova(iovad, iova, size);
> > +
> > +     return NULL;
> > +}
> > +
> > +void vduse_domain_free_coherent(struct vduse_iova_domain *domain, size_t size,
> > +                             void *vaddr, dma_addr_t dma_addr,
> > +                             unsigned long attrs)
> > +{
> > +     struct iova_domain *iovad = &domain->consistent_iovad;
> > +     struct vhost_iotlb_map *map;
> > +     phys_addr_t pa;
> > +
> > +     spin_lock(&domain->iotlb_lock);
> > +     map = vhost_iotlb_itree_first(domain->iotlb, (u64)dma_addr,
> > +                                   (u64)dma_addr + size - 1);
> > +     if (WARN_ON(!map)) {
> > +             spin_unlock(&domain->iotlb_lock);
> > +             return;
> > +     }
> > +     pa = map->addr;
> > +     vhost_iotlb_map_free(domain->iotlb, map);
> > +     spin_unlock(&domain->iotlb_lock);
> > +
> > +     vduse_domain_free_iova(iovad, dma_addr, size);
> > +     free_pages_exact(phys_to_virt(pa), size);
> > +}
> > +
> > +static vm_fault_t vduse_domain_mmap_fault(struct vm_fault *vmf)
> > +{
> > +     struct vduse_iova_domain *domain = vmf->vma->vm_private_data;
> > +     unsigned long iova = vmf->pgoff << PAGE_SHIFT;
> > +     struct page *page;
> > +
> > +     if (!domain)
> > +             return VM_FAULT_SIGBUS;
> > +
> > +     if (iova < domain->bounce_size)
> > +             page = vduse_domain_alloc_bounce_page(domain, iova);
> > +     else
> > +             page = vduse_domain_get_mapping_page(domain, iova);
> > +
> > +     if (!page)
> > +             return VM_FAULT_SIGBUS;
> > +
> > +     vmf->page = page;
> > +
> > +     return 0;
> > +}
> > +
> > +static const struct vm_operations_struct vduse_domain_mmap_ops = {
> > +     .fault = vduse_domain_mmap_fault,
> > +};
> > +
> > +static int vduse_domain_mmap(struct file *file, struct vm_area_struct *vma)
> > +{
> > +     struct vduse_iova_domain *domain = file->private_data;
> > +
> > +     vma->vm_flags |= VM_DONTDUMP | VM_DONTEXPAND;
> > +     vma->vm_private_data = domain;
> > +     vma->vm_ops = &vduse_domain_mmap_ops;
> > +
> > +     return 0;
> > +}
> > +
> > +static int vduse_domain_release(struct inode *inode, struct file *file)
> > +{
> > +     struct vduse_iova_domain *domain = file->private_data;
> > +
> > +     vduse_domain_free_bounce_pages(domain, 0, domain->bounce_size);
> > +     put_iova_domain(&domain->stream_iovad);
> > +     put_iova_domain(&domain->consistent_iovad);
> > +     vhost_iotlb_free(domain->iotlb);
> > +     vfree(domain->bounce_pages);
> > +     kfree(domain);
> > +
> > +     return 0;
> > +}
> > +
> > +static const struct file_operations vduse_domain_fops = {
> > +     .mmap = vduse_domain_mmap,
> > +     .release = vduse_domain_release,
> > +};
> > +
> > +void vduse_domain_destroy(struct vduse_iova_domain *domain)
> > +{
> > +     fput(domain->file);
> > +}
> > +
> > +struct vduse_iova_domain *
> > +vduse_domain_create(unsigned long iova_limit, size_t bounce_size)
> > +{
> > +     struct vduse_iova_domain *domain;
> > +     struct file *file;
> > +     unsigned long bounce_pfns = PAGE_ALIGN(bounce_size) >> PAGE_SHIFT;
> > +
> > +     if (iova_limit <= bounce_size)
> > +             return NULL;
> > +
> > +     domain = kzalloc(sizeof(*domain), GFP_KERNEL);
> > +     if (!domain)
> > +             return NULL;
> > +
> > +     domain->iotlb = vhost_iotlb_alloc(0, 0);
> > +     if (!domain->iotlb)
> > +             goto err_iotlb;
> > +
> > +     domain->iova_limit = iova_limit;
> > +     domain->bounce_size = PAGE_ALIGN(bounce_size);
> > +     domain->bounce_pages = vzalloc(bounce_pfns * sizeof(struct page *));
> > +     if (!domain->bounce_pages)
> > +             goto err_page;
> > +
> > +     file = anon_inode_getfile("[vduse-domain]", &vduse_domain_fops,
> > +                             domain, O_RDWR);
> > +     if (IS_ERR(file))
> > +             goto err_file;
> > +
> > +     domain->file = file;
> > +     spin_lock_init(&domain->iotlb_lock);
> > +     init_iova_domain(&domain->stream_iovad,
> > +                     IOVA_ALLOC_SIZE, IOVA_START_PFN);
> > +     init_iova_domain(&domain->consistent_iovad,
> > +                     PAGE_SIZE, bounce_pfns);
> > +
> > +     return domain;
> > +err_file:
> > +     vfree(domain->bounce_pages);
> > +err_page:
> > +     vhost_iotlb_free(domain->iotlb);
> > +err_iotlb:
> > +     kfree(domain);
> > +     return NULL;
> > +}
> > +
> > +int vduse_domain_init(void)
> > +{
> > +     return iova_cache_get();
> > +}
> > +
> > +void vduse_domain_exit(void)
> > +{
> > +     iova_cache_put();
> > +}
> > diff --git a/drivers/vdpa/vdpa_user/iova_domain.h b/drivers/vdpa/vdpa_user/iova_domain.h
> > new file mode 100644
> > index 000000000000..9c85d8346626
> > --- /dev/null
> > +++ b/drivers/vdpa/vdpa_user/iova_domain.h
> > @@ -0,0 +1,61 @@
> > +/* SPDX-License-Identifier: GPL-2.0-only */
> > +/*
> > + * MMU-based IOMMU implementation
> > + *
> > + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
> > + *
> > + * Author: Xie Yongji <xieyongji@bytedance.com>
> > + *
> > + */
> > +
> > +#ifndef _VDUSE_IOVA_DOMAIN_H
> > +#define _VDUSE_IOVA_DOMAIN_H
> > +
> > +#include <linux/iova.h>
> > +#include <linux/dma-mapping.h>
> > +#include <linux/vhost_iotlb.h>
> > +
> > +struct vduse_iova_domain {
> > +     struct iova_domain stream_iovad;
> > +     struct iova_domain consistent_iovad;
> > +     struct page **bounce_pages;
> > +     size_t bounce_size;
> > +     unsigned long iova_limit;
> > +     struct vhost_iotlb *iotlb;
>
>
> Sorry if I've asked this before.
>
> But what's the reason for maintaing a dedicated IOTLB here? I think we
> could reuse vduse_dev->iommu since the device can not be used by both
> virtio and vhost in the same time or use vduse_iova_domain->iotlb for
> set_map().
>

The main difference between domain->iotlb and dev->iotlb is the way to
deal with bounce buffer. In the domain->iotlb case, bounce buffer
needs to be mapped each DMA transfer because we need to get the bounce
pages by an IOVA during DMA unmapping. In the dev->iotlb case, bounce
buffer only needs to be mapped once during initialization, which will
be used to tell userspace how to do mmap().

> Also, since vhost IOTLB support per mapping token (opauqe), can we use
> that instead of the bounce_pages *?
>

Sorry, I didn't get you here. Which value do you mean to store in the
opaque pointer?

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: Re: [RFC v4 05/11] vdpa: Support transferring virtual addressing during DMA mapping
  2021-03-04  3:07     ` Jason Wang
  (?)
@ 2021-03-04  5:40     ` Yongji Xie
  -1 siblings, 0 replies; 92+ messages in thread
From: Yongji Xie @ 2021-03-04  5:40 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Bob Liu, Christoph Hellwig, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel

On Thu, Mar 4, 2021 at 11:07 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2021/2/23 7:50 下午, Xie Yongji wrote:
> > This patch introduces an attribute for vDPA device to indicate
> > whether virtual address can be used. If vDPA device driver set
> > it, vhost-vdpa bus driver will not pin user page and transfer
> > userspace virtual address instead of physical address during
> > DMA mapping. And corresponding vma->vm_file and offset will be
> > also passed as an opaque pointer.
> >
> > Suggested-by: Jason Wang <jasowang@redhat.com>
> > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> > ---
> >   drivers/vdpa/ifcvf/ifcvf_main.c   |   2 +-
> >   drivers/vdpa/mlx5/net/mlx5_vnet.c |   2 +-
> >   drivers/vdpa/vdpa.c               |   9 +++-
> >   drivers/vdpa/vdpa_sim/vdpa_sim.c  |   2 +-
> >   drivers/vhost/vdpa.c              | 104 +++++++++++++++++++++++++++++++-------
> >   include/linux/vdpa.h              |  20 ++++++--
> >   6 files changed, 113 insertions(+), 26 deletions(-)
> >
> > diff --git a/drivers/vdpa/ifcvf/ifcvf_main.c b/drivers/vdpa/ifcvf/ifcvf_main.c
> > index 7c8bbfcf6c3e..228b9f920fea 100644
> > --- a/drivers/vdpa/ifcvf/ifcvf_main.c
> > +++ b/drivers/vdpa/ifcvf/ifcvf_main.c
> > @@ -432,7 +432,7 @@ static int ifcvf_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> >
> >       adapter = vdpa_alloc_device(struct ifcvf_adapter, vdpa,
> >                                   dev, &ifc_vdpa_ops,
> > -                                 IFCVF_MAX_QUEUE_PAIRS * 2, NULL);
> > +                                 IFCVF_MAX_QUEUE_PAIRS * 2, NULL, false);
> >       if (adapter == NULL) {
> >               IFCVF_ERR(pdev, "Failed to allocate vDPA structure");
> >               return -ENOMEM;
> > diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> > index 029822060017..54290438da28 100644
> > --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
> > +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> > @@ -1964,7 +1964,7 @@ static int mlx5v_probe(struct auxiliary_device *adev,
> >       max_vqs = min_t(u32, max_vqs, MLX5_MAX_SUPPORTED_VQS);
> >
> >       ndev = vdpa_alloc_device(struct mlx5_vdpa_net, mvdev.vdev, mdev->device, &mlx5_vdpa_ops,
> > -                              2 * mlx5_vdpa_max_qps(max_vqs), NULL);
> > +                              2 * mlx5_vdpa_max_qps(max_vqs), NULL, false);
> >       if (IS_ERR(ndev))
> >               return PTR_ERR(ndev);
> >
> > diff --git a/drivers/vdpa/vdpa.c b/drivers/vdpa/vdpa.c
> > index 9700a0adcca0..fafc0ee5eb05 100644
> > --- a/drivers/vdpa/vdpa.c
> > +++ b/drivers/vdpa/vdpa.c
> > @@ -72,6 +72,7 @@ static void vdpa_release_dev(struct device *d)
> >    * @nvqs: number of virtqueues supported by this device
> >    * @size: size of the parent structure that contains private data
> >    * @name: name of the vdpa device; optional.
> > + * @use_va: indicate whether virtual address can be used by this device
>
>
> I think "use_va" means va must be used instead of "can be" here.
>

Right.

>
> >    *
> >    * Driver should use vdpa_alloc_device() wrapper macro instead of
> >    * using this directly.
> > @@ -81,7 +82,8 @@ static void vdpa_release_dev(struct device *d)
> >    */
> >   struct vdpa_device *__vdpa_alloc_device(struct device *parent,
> >                                       const struct vdpa_config_ops *config,
> > -                                     int nvqs, size_t size, const char *name)
> > +                                     int nvqs, size_t size, const char *name,
> > +                                     bool use_va)
> >   {
> >       struct vdpa_device *vdev;
> >       int err = -EINVAL;
> > @@ -92,6 +94,10 @@ struct vdpa_device *__vdpa_alloc_device(struct device *parent,
> >       if (!!config->dma_map != !!config->dma_unmap)
> >               goto err;
> >
> > +     /* It should only work for the device that use on-chip IOMMU */
> > +     if (use_va && !(config->dma_map || config->set_map))
> > +             goto err;
> > +
> >       err = -ENOMEM;
> >       vdev = kzalloc(size, GFP_KERNEL);
> >       if (!vdev)
> > @@ -108,6 +114,7 @@ struct vdpa_device *__vdpa_alloc_device(struct device *parent,
> >       vdev->config = config;
> >       vdev->features_valid = false;
> >       vdev->nvqs = nvqs;
> > +     vdev->use_va = use_va;
> >
> >       if (name)
> >               err = dev_set_name(&vdev->dev, "%s", name);
> > diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
> > index 5cfc262ce055..3a9a2dd4e987 100644
> > --- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
> > +++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
> > @@ -235,7 +235,7 @@ struct vdpasim *vdpasim_create(struct vdpasim_dev_attr *dev_attr)
> >               ops = &vdpasim_config_ops;
> >
> >       vdpasim = vdpa_alloc_device(struct vdpasim, vdpa, NULL, ops,
> > -                                 dev_attr->nvqs, dev_attr->name);
> > +                                 dev_attr->nvqs, dev_attr->name, false);
> >       if (!vdpasim)
> >               goto err_alloc;
> >
> > diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> > index 70857fe3263c..93769ace34df 100644
> > --- a/drivers/vhost/vdpa.c
> > +++ b/drivers/vhost/vdpa.c
> > @@ -480,21 +480,31 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
> >   static void vhost_vdpa_iotlb_unmap(struct vhost_vdpa *v, u64 start, u64 last)
> >   {
> >       struct vhost_dev *dev = &v->vdev;
> > +     struct vdpa_device *vdpa = v->vdpa;
> >       struct vhost_iotlb *iotlb = dev->iotlb;
> >       struct vhost_iotlb_map *map;
> > +     struct vdpa_map_file *map_file;
> >       struct page *page;
> >       unsigned long pfn, pinned;
> >
> >       while ((map = vhost_iotlb_itree_first(iotlb, start, last)) != NULL) {
> > -             pinned = map->size >> PAGE_SHIFT;
> > -             for (pfn = map->addr >> PAGE_SHIFT;
> > -                  pinned > 0; pfn++, pinned--) {
> > -                     page = pfn_to_page(pfn);
> > -                     if (map->perm & VHOST_ACCESS_WO)
> > -                             set_page_dirty_lock(page);
> > -                     unpin_user_page(page);
> > +             if (!vdpa->use_va) {
> > +                     pinned = map->size >> PAGE_SHIFT;
> > +                     for (pfn = map->addr >> PAGE_SHIFT;
> > +                          pinned > 0; pfn++, pinned--) {
> > +                             page = pfn_to_page(pfn);
> > +                             if (map->perm & VHOST_ACCESS_WO)
> > +                                     set_page_dirty_lock(page);
> > +                             unpin_user_page(page);
> > +                     }
> > +                     atomic64_sub(map->size >> PAGE_SHIFT,
> > +                                     &dev->mm->pinned_vm);
> > +             } else {
> > +                     map_file = (struct vdpa_map_file *)map->opaque;
> > +                     if (map_file->file)
> > +                             fput(map_file->file);
> > +                     kfree(map_file);
> >               }
> > -             atomic64_sub(map->size >> PAGE_SHIFT, &dev->mm->pinned_vm);
> >               vhost_iotlb_map_free(iotlb, map);
> >       }
> >   }
> > @@ -530,21 +540,21 @@ static int perm_to_iommu_flags(u32 perm)
> >       return flags | IOMMU_CACHE;
> >   }
> >
> > -static int vhost_vdpa_map(struct vhost_vdpa *v,
> > -                       u64 iova, u64 size, u64 pa, u32 perm)
> > +static int vhost_vdpa_map(struct vhost_vdpa *v, u64 iova,
> > +                       u64 size, u64 pa, u32 perm, void *opaque)
> >   {
> >       struct vhost_dev *dev = &v->vdev;
> >       struct vdpa_device *vdpa = v->vdpa;
> >       const struct vdpa_config_ops *ops = vdpa->config;
> >       int r = 0;
> >
> > -     r = vhost_iotlb_add_range(dev->iotlb, iova, iova + size - 1,
> > -                               pa, perm);
> > +     r = vhost_iotlb_add_range_ctx(dev->iotlb, iova, iova + size - 1,
> > +                                   pa, perm, opaque);
> >       if (r)
> >               return r;
> >
> >       if (ops->dma_map) {
> > -             r = ops->dma_map(vdpa, iova, size, pa, perm, NULL);
> > +             r = ops->dma_map(vdpa, iova, size, pa, perm, opaque);
> >       } else if (ops->set_map) {
> >               if (!v->in_batch)
> >                       r = ops->set_map(vdpa, dev->iotlb);
> > @@ -552,13 +562,15 @@ static int vhost_vdpa_map(struct vhost_vdpa *v,
> >               r = iommu_map(v->domain, iova, pa, size,
> >                             perm_to_iommu_flags(perm));
> >       }
> > -
> > -     if (r)
> > +     if (r) {
> >               vhost_iotlb_del_range(dev->iotlb, iova, iova + size - 1);
> > -     else
> > +             return r;
> > +     }
> > +
> > +     if (!vdpa->use_va)
> >               atomic64_add(size >> PAGE_SHIFT, &dev->mm->pinned_vm);
> >
> > -     return r;
> > +     return 0;
> >   }
> >
> >   static void vhost_vdpa_unmap(struct vhost_vdpa *v, u64 iova, u64 size)
> > @@ -579,10 +591,60 @@ static void vhost_vdpa_unmap(struct vhost_vdpa *v, u64 iova, u64 size)
> >       }
> >   }
> >
> > +static int vhost_vdpa_va_map(struct vhost_vdpa *v,
> > +                          u64 iova, u64 size, u64 uaddr, u32 perm)
> > +{
> > +     struct vhost_dev *dev = &v->vdev;
> > +     u64 offset, map_size, map_iova = iova;
> > +     struct vdpa_map_file *map_file;
> > +     struct vm_area_struct *vma;
> > +     int ret;
> > +
> > +     mmap_read_lock(dev->mm);
> > +
> > +     while (size) {
> > +             vma = find_vma(dev->mm, uaddr);
> > +             if (!vma) {
> > +                     ret = -EINVAL;
> > +                     goto err;
> > +             }
> > +             map_size = min(size, vma->vm_end - uaddr);
> > +             offset = (vma->vm_pgoff << PAGE_SHIFT) + uaddr - vma->vm_start;
> > +             map_file = kzalloc(sizeof(*map_file), GFP_KERNEL);
> > +             if (!map_file) {
> > +                     ret = -ENOMEM;
> > +                     goto err;
> > +             }
> > +             if (vma->vm_file && (vma->vm_flags & VM_SHARED) &&
> > +                     !(vma->vm_flags & (VM_IO | VM_PFNMAP))) {
> > +                     map_file->file = get_file(vma->vm_file);
> > +                     map_file->offset = offset;
> > +             }
>
>
> I think it's better to do the flag check right after find_vma(), this
> can avoid things like kfree etc (e.g the code will still call
> vhost_vdpa_map() even if the flag is not expected now).
>

Make sense to me.

>
> > +             ret = vhost_vdpa_map(v, map_iova, map_size, uaddr,
> > +                                  perm, map_file);
> > +             if (ret) {
> > +                     if (map_file->file)
> > +                             fput(map_file->file);
> > +                     kfree(map_file);
> > +                     goto err;
> > +             }
> > +             size -= map_size;
> > +             uaddr += map_size;
> > +             map_iova += map_size;
> > +     }
> > +     mmap_read_unlock(dev->mm);
> > +
> > +     return 0;
> > +err:
> > +     vhost_vdpa_unmap(v, iova, map_iova - iova);
> > +     return ret;
> > +}
> > +
> >   static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
> >                                          struct vhost_iotlb_msg *msg)
> >   {
> >       struct vhost_dev *dev = &v->vdev;
> > +     struct vdpa_device *vdpa = v->vdpa;
> >       struct vhost_iotlb *iotlb = dev->iotlb;
> >       struct page **page_list;
> >       unsigned long list_size = PAGE_SIZE / sizeof(struct page *);
> > @@ -601,6 +663,10 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
> >                                   msg->iova + msg->size - 1))
> >               return -EEXIST;
> >
> > +     if (vdpa->use_va)
> > +             return vhost_vdpa_va_map(v, msg->iova, msg->size,
> > +                                      msg->uaddr, msg->perm);
>
>
> If possible, I would like to factor out the pa map below into a
> something like vhost_vdpa_pa_map() first with a separated patch. Then
> introduce vhost_vdpa_va_map().
>

Fine, will do it.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC v4 07/11] vduse: Introduce VDUSE - vDPA Device in Userspace
  2021-02-23 11:50 ` [RFC v4 07/11] vduse: Introduce VDUSE - vDPA Device in Userspace Xie Yongji
@ 2021-03-04  6:27     ` Jason Wang
  2021-03-10 12:58     ` Jason Wang
  1 sibling, 0 replies; 92+ messages in thread
From: Jason Wang @ 2021-03-04  6:27 UTC (permalink / raw)
  To: Xie Yongji, mst, stefanha, sgarzare, parav, bob.liu, hch,
	rdunlap, willy, viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel


On 2021/2/23 7:50 下午, Xie Yongji wrote:
> This VDUSE driver enables implementing vDPA devices in userspace.
> Both control path and data path of vDPA devices will be able to
> be handled in userspace.
>
> In the control path, the VDUSE driver will make use of message
> mechnism to forward the config operation from vdpa bus driver
> to userspace. Userspace can use read()/write() to receive/reply
> those control messages.
>
> In the data path, VDUSE_IOTLB_GET_FD ioctl will be used to get
> the file descriptors referring to vDPA device's iova regions. Then
> userspace can use mmap() to access those iova regions. Besides,
> userspace can use ioctl() to inject interrupt and use the eventfd
> mechanism to receive virtqueue kicks.
>
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> ---
>   Documentation/userspace-api/ioctl/ioctl-number.rst |    1 +
>   drivers/vdpa/Kconfig                               |   10 +
>   drivers/vdpa/Makefile                              |    1 +
>   drivers/vdpa/vdpa_user/Makefile                    |    5 +
>   drivers/vdpa/vdpa_user/vduse_dev.c                 | 1348 ++++++++++++++++++++
>   include/uapi/linux/vduse.h                         |  136 ++
>   6 files changed, 1501 insertions(+)
>   create mode 100644 drivers/vdpa/vdpa_user/Makefile
>   create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
>   create mode 100644 include/uapi/linux/vduse.h
>
> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
> index a4c75a28c839..71722e6f8f23 100644
> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> @@ -300,6 +300,7 @@ Code  Seq#    Include File                                           Comments
>   'z'   10-4F  drivers/s390/crypto/zcrypt_api.h                        conflict!
>   '|'   00-7F  linux/media.h
>   0x80  00-1F  linux/fb.h
> +0x81  00-1F  linux/vduse.h
>   0x89  00-06  arch/x86/include/asm/sockios.h
>   0x89  0B-DF  linux/sockios.h
>   0x89  E0-EF  linux/sockios.h                                         SIOCPROTOPRIVATE range
> diff --git a/drivers/vdpa/Kconfig b/drivers/vdpa/Kconfig
> index ffd1e098bfd2..92f07715e3b6 100644
> --- a/drivers/vdpa/Kconfig
> +++ b/drivers/vdpa/Kconfig
> @@ -25,6 +25,16 @@ config VDPA_SIM_NET
>   	help
>   	  vDPA networking device simulator which loops TX traffic back to RX.
>   
> +config VDPA_USER
> +	tristate "VDUSE (vDPA Device in Userspace) support"
> +	depends on EVENTFD && MMU && HAS_DMA
> +	select DMA_OPS
> +	select VHOST_IOTLB
> +	select IOMMU_IOVA
> +	help
> +	  With VDUSE it is possible to emulate a vDPA Device
> +	  in a userspace program.
> +
>   config IFCVF
>   	tristate "Intel IFC VF vDPA driver"
>   	depends on PCI_MSI
> diff --git a/drivers/vdpa/Makefile b/drivers/vdpa/Makefile
> index d160e9b63a66..66e97778ad03 100644
> --- a/drivers/vdpa/Makefile
> +++ b/drivers/vdpa/Makefile
> @@ -1,5 +1,6 @@
>   # SPDX-License-Identifier: GPL-2.0
>   obj-$(CONFIG_VDPA) += vdpa.o
>   obj-$(CONFIG_VDPA_SIM) += vdpa_sim/
> +obj-$(CONFIG_VDPA_USER) += vdpa_user/
>   obj-$(CONFIG_IFCVF)    += ifcvf/
>   obj-$(CONFIG_MLX5_VDPA) += mlx5/
> diff --git a/drivers/vdpa/vdpa_user/Makefile b/drivers/vdpa/vdpa_user/Makefile
> new file mode 100644
> index 000000000000..260e0b26af99
> --- /dev/null
> +++ b/drivers/vdpa/vdpa_user/Makefile
> @@ -0,0 +1,5 @@
> +# SPDX-License-Identifier: GPL-2.0
> +
> +vduse-y := vduse_dev.o iova_domain.o
> +
> +obj-$(CONFIG_VDPA_USER) += vduse.o
> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> new file mode 100644
> index 000000000000..393bf99c48be
> --- /dev/null
> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> @@ -0,0 +1,1348 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * VDUSE: vDPA Device in Userspace
> + *
> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
> + *
> + * Author: Xie Yongji <xieyongji@bytedance.com>
> + *
> + */
> +
> +#include <linux/init.h>
> +#include <linux/module.h>
> +#include <linux/miscdevice.h>
> +#include <linux/cdev.h>
> +#include <linux/device.h>
> +#include <linux/eventfd.h>
> +#include <linux/slab.h>
> +#include <linux/wait.h>
> +#include <linux/dma-map-ops.h>
> +#include <linux/poll.h>
> +#include <linux/file.h>
> +#include <linux/uio.h>
> +#include <linux/vdpa.h>
> +#include <uapi/linux/vduse.h>
> +#include <uapi/linux/vdpa.h>
> +#include <uapi/linux/virtio_config.h>
> +#include <linux/mod_devicetable.h>
> +
> +#include "iova_domain.h"
> +
> +#define DRV_VERSION  "1.0"
> +#define DRV_AUTHOR   "Yongji Xie <xieyongji@bytedance.com>"
> +#define DRV_DESC     "vDPA Device in Userspace"
> +#define DRV_LICENSE  "GPL v2"
> +
> +#define VDUSE_DEV_MAX (1U << MINORBITS)
> +
> +struct vduse_virtqueue {
> +	u16 index;
> +	bool ready;
> +	spinlock_t kick_lock;
> +	spinlock_t irq_lock;
> +	struct eventfd_ctx *kickfd;
> +	struct vdpa_callback cb;
> +};
> +
> +struct vduse_dev;
> +
> +struct vduse_vdpa {
> +	struct vdpa_device vdpa;
> +	struct vduse_dev *dev;
> +};
> +
> +struct vduse_dev {
> +	struct vduse_vdpa *vdev;
> +	struct device dev;
> +	struct cdev cdev;
> +	struct vduse_virtqueue *vqs;
> +	struct vduse_iova_domain *domain;
> +	struct vhost_iotlb *iommu;
> +	spinlock_t iommu_lock;
> +	atomic_t bounce_map;
> +	struct mutex msg_lock;
> +	atomic64_t msg_unique;


"next_request_id" should be better.


> +	wait_queue_head_t waitq;
> +	struct list_head send_list;
> +	struct list_head recv_list;
> +	struct list_head list;
> +	bool connected;
> +	int minor;
> +	u16 vq_size_max;
> +	u16 vq_num;
> +	u32 vq_align;
> +	u32 device_id;
> +	u32 vendor_id;
> +};
> +
> +struct vduse_dev_msg {
> +	struct vduse_dev_request req;
> +	struct vduse_dev_response resp;
> +	struct list_head list;
> +	wait_queue_head_t waitq;
> +	bool completed;
> +};
> +
> +static unsigned long max_bounce_size = (64 * 1024 * 1024);
> +module_param(max_bounce_size, ulong, 0444);
> +MODULE_PARM_DESC(max_bounce_size, "Maximum bounce buffer size. (default: 64M)");
> +
> +static unsigned long max_iova_size = (128 * 1024 * 1024);
> +module_param(max_iova_size, ulong, 0444);
> +MODULE_PARM_DESC(max_iova_size, "Maximum iova space size (default: 128M)");
> +
> +static DEFINE_MUTEX(vduse_lock);
> +static LIST_HEAD(vduse_devs);
> +static DEFINE_IDA(vduse_ida);
> +
> +static dev_t vduse_major;
> +static struct class *vduse_class;
> +
> +static inline struct vduse_dev *vdpa_to_vduse(struct vdpa_device *vdpa)
> +{
> +	struct vduse_vdpa *vdev = container_of(vdpa, struct vduse_vdpa, vdpa);
> +
> +	return vdev->dev;
> +}
> +
> +static inline struct vduse_dev *dev_to_vduse(struct device *dev)
> +{
> +	struct vdpa_device *vdpa = dev_to_vdpa(dev);
> +
> +	return vdpa_to_vduse(vdpa);
> +}
> +
> +static struct vduse_dev_msg *vduse_find_msg(struct list_head *head,
> +					    uint32_t unique)
> +{
> +	struct vduse_dev_msg *tmp, *msg = NULL;
> +
> +	list_for_each_entry(tmp, head, list) {


Shoudl we use list_for_each_entry_safe()?


> +		if (tmp->req.unique == unique) {
> +			msg = tmp;
> +			list_del(&tmp->list);
> +			break;
> +		}
> +	}
> +
> +	return msg;
> +}
> +
> +static struct vduse_dev_msg *vduse_dequeue_msg(struct list_head *head)
> +{
> +	struct vduse_dev_msg *msg = NULL;
> +
> +	if (!list_empty(head)) {
> +		msg = list_first_entry(head, struct vduse_dev_msg, list);
> +		list_del(&msg->list);
> +	}
> +
> +	return msg;
> +}
> +
> +static void vduse_enqueue_msg(struct list_head *head,
> +			      struct vduse_dev_msg *msg)
> +{
> +	list_add_tail(&msg->list, head);
> +}
> +
> +static int vduse_dev_msg_sync(struct vduse_dev *dev, struct vduse_dev_msg *msg)
> +{
> +	int ret;
> +
> +	init_waitqueue_head(&msg->waitq);
> +	mutex_lock(&dev->msg_lock);
> +	vduse_enqueue_msg(&dev->send_list, msg);
> +	wake_up(&dev->waitq);
> +	mutex_unlock(&dev->msg_lock);
> +	ret = wait_event_interruptible(msg->waitq, msg->completed);
> +	mutex_lock(&dev->msg_lock);
> +	if (!msg->completed)
> +		list_del(&msg->list);
> +	else
> +		ret = msg->resp.result;
> +	mutex_unlock(&dev->msg_lock);
> +
> +	return ret;
> +}
> +
> +static u64 vduse_dev_get_features(struct vduse_dev *dev)
> +{
> +	struct vduse_dev_msg msg = { 0 };
> +
> +	msg.req.type = VDUSE_GET_FEATURES;
> +	msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
> +
> +	return vduse_dev_msg_sync(dev, &msg) ? 0 : msg.resp.features;
> +}
> +
> +static int vduse_dev_set_features(struct vduse_dev *dev, u64 features)
> +{
> +	struct vduse_dev_msg msg = { 0 };
> +
> +	msg.req.type = VDUSE_SET_FEATURES;
> +	msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
> +	msg.req.features = features;
> +
> +	return vduse_dev_msg_sync(dev, &msg);
> +}
> +
> +static u8 vduse_dev_get_status(struct vduse_dev *dev)
> +{
> +	struct vduse_dev_msg msg = { 0 };
> +
> +	msg.req.type = VDUSE_GET_STATUS;
> +	msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
> +
> +	return vduse_dev_msg_sync(dev, &msg) ? 0 : msg.resp.status;
> +}
> +
> +static void vduse_dev_set_status(struct vduse_dev *dev, u8 status)
> +{
> +	struct vduse_dev_msg msg = { 0 };
> +
> +	msg.req.type = VDUSE_SET_STATUS;
> +	msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
> +	msg.req.status = status;
> +
> +	vduse_dev_msg_sync(dev, &msg);
> +}
> +
> +static void vduse_dev_get_config(struct vduse_dev *dev, unsigned int offset,
> +					void *buf, unsigned int len)


Btw, the ident looks odd here and other may places wherhe functions has 
more than one line of arguments.


> +{
> +	struct vduse_dev_msg msg = { 0 };
> +	unsigned int sz;
> +
> +	while (len) {
> +		sz = min_t(unsigned int, len, sizeof(msg.req.config.data));
> +		msg.req.type = VDUSE_GET_CONFIG;
> +		msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
> +		msg.req.config.offset = offset;
> +		msg.req.config.len = sz;
> +		vduse_dev_msg_sync(dev, &msg);
> +		memcpy(buf, msg.resp.config.data, sz);
> +		buf += sz;
> +		offset += sz;
> +		len -= sz;
> +	}
> +}
> +
> +static void vduse_dev_set_config(struct vduse_dev *dev, unsigned int offset,
> +					const void *buf, unsigned int len)
> +{
> +	struct vduse_dev_msg msg = { 0 };
> +	unsigned int sz;
> +
> +	while (len) {
> +		sz = min_t(unsigned int, len, sizeof(msg.req.config.data));
> +		msg.req.type = VDUSE_SET_CONFIG;
> +		msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
> +		msg.req.config.offset = offset;
> +		msg.req.config.len = sz;
> +		memcpy(msg.req.config.data, buf, sz);
> +		vduse_dev_msg_sync(dev, &msg);
> +		buf += sz;
> +		offset += sz;
> +		len -= sz;
> +	}
> +}
> +
> +static void vduse_dev_set_vq_num(struct vduse_dev *dev,
> +				struct vduse_virtqueue *vq, u32 num)
> +{
> +	struct vduse_dev_msg msg = { 0 };
> +
> +	msg.req.type = VDUSE_SET_VQ_NUM;
> +	msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
> +	msg.req.vq_num.index = vq->index;
> +	msg.req.vq_num.num = num;
> +
> +	vduse_dev_msg_sync(dev, &msg);
> +}
> +
> +static int vduse_dev_set_vq_addr(struct vduse_dev *dev,
> +				struct vduse_virtqueue *vq, u64 desc_addr,
> +				u64 driver_addr, u64 device_addr)
> +{
> +	struct vduse_dev_msg msg = { 0 };
> +
> +	msg.req.type = VDUSE_SET_VQ_ADDR;
> +	msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
> +	msg.req.vq_addr.index = vq->index;
> +	msg.req.vq_addr.desc_addr = desc_addr;
> +	msg.req.vq_addr.driver_addr = driver_addr;
> +	msg.req.vq_addr.device_addr = device_addr;
> +
> +	return vduse_dev_msg_sync(dev, &msg);
> +}
> +
> +static void vduse_dev_set_vq_ready(struct vduse_dev *dev,
> +				struct vduse_virtqueue *vq, bool ready)
> +{
> +	struct vduse_dev_msg msg = { 0 };
> +
> +	msg.req.type = VDUSE_SET_VQ_READY;
> +	msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
> +	msg.req.vq_ready.index = vq->index;
> +	msg.req.vq_ready.ready = ready;
> +
> +	vduse_dev_msg_sync(dev, &msg);
> +}
> +
> +static bool vduse_dev_get_vq_ready(struct vduse_dev *dev,
> +				   struct vduse_virtqueue *vq)
> +{
> +	struct vduse_dev_msg msg = { 0 };
> +
> +	msg.req.type = VDUSE_GET_VQ_READY;
> +	msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
> +	msg.req.vq_ready.index = vq->index;
> +
> +	return vduse_dev_msg_sync(dev, &msg) ? false : msg.resp.vq_ready.ready;
> +}
> +
> +static int vduse_dev_get_vq_state(struct vduse_dev *dev,
> +				struct vduse_virtqueue *vq,
> +				struct vdpa_vq_state *state)
> +{
> +	struct vduse_dev_msg msg = { 0 };
> +	int ret;
> +
> +	msg.req.type = VDUSE_GET_VQ_STATE;
> +	msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
> +	msg.req.vq_state.index = vq->index;
> +
> +	ret = vduse_dev_msg_sync(dev, &msg);
> +	if (!ret)
> +		state->avail_index = msg.resp.vq_state.avail_idx;
> +
> +	return ret;
> +}
> +
> +static int vduse_dev_set_vq_state(struct vduse_dev *dev,
> +				struct vduse_virtqueue *vq,
> +				const struct vdpa_vq_state *state)
> +{
> +	struct vduse_dev_msg msg = { 0 };
> +
> +	msg.req.type = VDUSE_SET_VQ_STATE;
> +	msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
> +	msg.req.vq_state.index = vq->index;
> +	msg.req.vq_state.avail_idx = state->avail_index;
> +
> +	return vduse_dev_msg_sync(dev, &msg);
> +}
> +
> +static int vduse_dev_update_iotlb(struct vduse_dev *dev,
> +				u64 start, u64 last)
> +{
> +	struct vduse_dev_msg msg = { 0 };
> +
> +	if (last < start)
> +		return -EINVAL;
> +
> +	msg.req.type = VDUSE_UPDATE_IOTLB;
> +	msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
> +	msg.req.iova.start = start;
> +	msg.req.iova.last = last;
> +
> +	return vduse_dev_msg_sync(dev, &msg);
> +}
> +
> +static ssize_t vduse_dev_read_iter(struct kiocb *iocb, struct iov_iter *to)
> +{
> +	struct file *file = iocb->ki_filp;
> +	struct vduse_dev *dev = file->private_data;
> +	struct vduse_dev_msg *msg;
> +	int size = sizeof(struct vduse_dev_request);
> +	ssize_t ret = 0;
> +
> +	if (iov_iter_count(to) < size)
> +		return 0;
> +
> +	mutex_lock(&dev->msg_lock);
> +	while (1) {
> +		msg = vduse_dequeue_msg(&dev->send_list);
> +		if (msg)
> +			break;
> +
> +		ret = -EAGAIN;
> +		if (file->f_flags & O_NONBLOCK)
> +			goto unlock;
> +
> +		mutex_unlock(&dev->msg_lock);
> +		ret = wait_event_interruptible_exclusive(dev->waitq,
> +					!list_empty(&dev->send_list));
> +		if (ret)
> +			return ret;
> +
> +		mutex_lock(&dev->msg_lock);
> +	}
> +	ret = copy_to_iter(&msg->req, size, to);
> +	if (ret != size) {
> +		ret = -EFAULT;
> +		vduse_enqueue_msg(&dev->send_list, msg);
> +		goto unlock;
> +	}
> +	vduse_enqueue_msg(&dev->recv_list, msg);
> +unlock:
> +	mutex_unlock(&dev->msg_lock);
> +
> +	return ret;
> +}
> +
> +static ssize_t vduse_dev_write_iter(struct kiocb *iocb, struct iov_iter *from)
> +{
> +	struct file *file = iocb->ki_filp;
> +	struct vduse_dev *dev = file->private_data;
> +	struct vduse_dev_response resp;
> +	struct vduse_dev_msg *msg;
> +	size_t ret;
> +
> +	ret = copy_from_iter(&resp, sizeof(resp), from);
> +	if (ret != sizeof(resp))
> +		return -EINVAL;
> +
> +	mutex_lock(&dev->msg_lock);
> +	msg = vduse_find_msg(&dev->recv_list, resp.request_id);
> +	if (!msg) {
> +		ret = -EINVAL;
> +		goto unlock;
> +	}
> +
> +	memcpy(&msg->resp, &resp, sizeof(resp));
> +	msg->completed = 1;
> +	wake_up(&msg->waitq);
> +unlock:
> +	mutex_unlock(&dev->msg_lock);
> +
> +	return ret;
> +}
> +
> +static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
> +{
> +	struct vduse_dev *dev = file->private_data;
> +	__poll_t mask = 0;
> +
> +	poll_wait(file, &dev->waitq, wait);
> +
> +	if (!list_empty(&dev->send_list))
> +		mask |= EPOLLIN | EPOLLRDNORM;
> +
> +	return mask;
> +}
> +
> +static int vduse_iotlb_add_range(struct vduse_dev *dev,
> +				 u64 start, u64 last,
> +				 u64 addr, unsigned int perm,
> +				 struct file *file, u64 offset)
> +{
> +	struct vdpa_map_file *map_file;
> +	int ret;
> +
> +	map_file = kmalloc(sizeof(*map_file), GFP_ATOMIC);
> +	if (!map_file)
> +		return -ENOMEM;
> +
> +	map_file->file = get_file(file);
> +	map_file->offset = offset;
> +
> +	spin_lock(&dev->iommu_lock);
> +	ret = vhost_iotlb_add_range_ctx(dev->iommu, start, last,
> +					addr, perm, map_file);
> +	spin_unlock(&dev->iommu_lock);
> +	if (ret) {
> +		fput(map_file->file);
> +		kfree(map_file);
> +		return ret;
> +	}
> +	return 0;
> +}
> +
> +static void vduse_iotlb_del_range(struct vduse_dev *dev, u64 start, u64 last)
> +{
> +	struct vdpa_map_file *map_file;
> +	struct vhost_iotlb_map *map;
> +
> +	spin_lock(&dev->iommu_lock);
> +	while ((map = vhost_iotlb_itree_first(dev->iommu, start, last))) {
> +		map_file = (struct vdpa_map_file *)map->opaque;
> +		fput(map_file->file);
> +		kfree(map_file);
> +		vhost_iotlb_map_free(dev->iommu, map);
> +	}
> +	spin_unlock(&dev->iommu_lock);
> +}
> +
> +static void vduse_dev_reset(struct vduse_dev *dev)
> +{
> +	int i;
> +
> +	atomic_set(&dev->bounce_map, 0);
> +	vduse_iotlb_del_range(dev, 0ULL, ULLONG_MAX);
> +	vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
> +
> +	for (i = 0; i < dev->vq_num; i++) {
> +		struct vduse_virtqueue *vq = &dev->vqs[i];
> +
> +		spin_lock(&vq->irq_lock);
> +		vq->ready = false;
> +		vq->cb.callback = NULL;
> +		vq->cb.private = NULL;
> +		spin_unlock(&vq->irq_lock);
> +	}
> +}
> +
> +static int vduse_vdpa_set_vq_address(struct vdpa_device *vdpa, u16 idx,
> +				u64 desc_area, u64 driver_area,
> +				u64 device_area)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	return vduse_dev_set_vq_addr(dev, vq, desc_area,
> +					driver_area, device_area);
> +}
> +
> +static void vduse_vdpa_kick_vq(struct vdpa_device *vdpa, u16 idx)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	spin_lock(&vq->kick_lock);
> +	if (vq->ready && vq->kickfd)
> +		eventfd_signal(vq->kickfd, 1);
> +	spin_unlock(&vq->kick_lock);
> +}
> +
> +static void vduse_vdpa_set_vq_cb(struct vdpa_device *vdpa, u16 idx,
> +			      struct vdpa_callback *cb)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	spin_lock(&vq->irq_lock);
> +	vq->cb.callback = cb->callback;
> +	vq->cb.private = cb->private;
> +	spin_unlock(&vq->irq_lock);
> +}
> +
> +static void vduse_vdpa_set_vq_num(struct vdpa_device *vdpa, u16 idx, u32 num)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	vduse_dev_set_vq_num(dev, vq, num);
> +}
> +
> +static void vduse_vdpa_set_vq_ready(struct vdpa_device *vdpa,
> +					u16 idx, bool ready)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	vduse_dev_set_vq_ready(dev, vq, ready);
> +	vq->ready = ready;
> +}
> +
> +static bool vduse_vdpa_get_vq_ready(struct vdpa_device *vdpa, u16 idx)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	vq->ready = vduse_dev_get_vq_ready(dev, vq);
> +
> +	return vq->ready;
> +}
> +
> +static int vduse_vdpa_set_vq_state(struct vdpa_device *vdpa, u16 idx,
> +				const struct vdpa_vq_state *state)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	return vduse_dev_set_vq_state(dev, vq, state);
> +}
> +
> +static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
> +				struct vdpa_vq_state *state)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	return vduse_dev_get_vq_state(dev, vq, state);
> +}
> +
> +static u32 vduse_vdpa_get_vq_align(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return dev->vq_align;
> +}
> +
> +static u64 vduse_vdpa_get_features(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	u64 fixed = (1ULL << VIRTIO_F_ACCESS_PLATFORM);
> +
> +	return (vduse_dev_get_features(dev) | fixed);


What happens if we don't do such fixup. I think we should fail if 
usersapce doesnt offer ACCESS_PLATFORM instead.


> +}
> +
> +static int vduse_vdpa_set_features(struct vdpa_device *vdpa, u64 features)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return vduse_dev_set_features(dev, features);
> +}
> +
> +static void vduse_vdpa_set_config_cb(struct vdpa_device *vdpa,
> +				  struct vdpa_callback *cb)
> +{
> +	/* We don't support config interrupt */
> +}
> +
> +static u16 vduse_vdpa_get_vq_num_max(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return dev->vq_size_max;
> +}
> +
> +static u32 vduse_vdpa_get_device_id(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return dev->device_id;
> +}
> +
> +static u32 vduse_vdpa_get_vendor_id(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return dev->vendor_id;
> +}
> +
> +static u8 vduse_vdpa_get_status(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return vduse_dev_get_status(dev);
> +}
> +
> +static void vduse_vdpa_set_status(struct vdpa_device *vdpa, u8 status)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	if (status == 0)
> +		vduse_dev_reset(dev);
> +
> +	vduse_dev_set_status(dev, status);
> +}
> +
> +static void vduse_vdpa_get_config(struct vdpa_device *vdpa, unsigned int offset,
> +			     void *buf, unsigned int len)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	vduse_dev_get_config(dev, offset, buf, len);
> +}
> +
> +static void vduse_vdpa_set_config(struct vdpa_device *vdpa, unsigned int offset,
> +			const void *buf, unsigned int len)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	vduse_dev_set_config(dev, offset, buf, len);
> +}
> +
> +static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
> +				struct vhost_iotlb *iotlb)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vhost_iotlb_map *map;
> +	struct vdpa_map_file *map_file;
> +	u64 start = 0ULL, last = ULLONG_MAX;
> +	int ret = 0;
> +
> +	vduse_iotlb_del_range(dev, start, last);
> +
> +	for (map = vhost_iotlb_itree_first(iotlb, start, last); map;
> +		map = vhost_iotlb_itree_next(map, start, last)) {
> +		map_file = (struct vdpa_map_file *)map->opaque;
> +		if (!map_file->file)
> +			continue;
> +
> +		ret = vduse_iotlb_add_range(dev, map->start, map->last,
> +					    map->addr, map->perm,
> +					    map_file->file,
> +					    map_file->offset);
> +		if (ret)
> +			break;
> +	}
> +	vduse_dev_update_iotlb(dev, start, last);
> +
> +	return ret;
> +}
> +
> +static void vduse_vdpa_free(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	WARN_ON(!list_empty(&dev->send_list));
> +	WARN_ON(!list_empty(&dev->recv_list));
> +	dev->vdev = NULL;
> +}
> +
> +static const struct vdpa_config_ops vduse_vdpa_config_ops = {
> +	.set_vq_address		= vduse_vdpa_set_vq_address,
> +	.kick_vq		= vduse_vdpa_kick_vq,
> +	.set_vq_cb		= vduse_vdpa_set_vq_cb,
> +	.set_vq_num             = vduse_vdpa_set_vq_num,
> +	.set_vq_ready		= vduse_vdpa_set_vq_ready,
> +	.get_vq_ready		= vduse_vdpa_get_vq_ready,
> +	.set_vq_state		= vduse_vdpa_set_vq_state,
> +	.get_vq_state		= vduse_vdpa_get_vq_state,
> +	.get_vq_align		= vduse_vdpa_get_vq_align,
> +	.get_features		= vduse_vdpa_get_features,
> +	.set_features		= vduse_vdpa_set_features,
> +	.set_config_cb		= vduse_vdpa_set_config_cb,
> +	.get_vq_num_max		= vduse_vdpa_get_vq_num_max,
> +	.get_device_id		= vduse_vdpa_get_device_id,
> +	.get_vendor_id		= vduse_vdpa_get_vendor_id,
> +	.get_status		= vduse_vdpa_get_status,
> +	.set_status		= vduse_vdpa_set_status,
> +	.get_config		= vduse_vdpa_get_config,
> +	.set_config		= vduse_vdpa_set_config,
> +	.set_map		= vduse_vdpa_set_map,
> +	.free			= vduse_vdpa_free,
> +};
> +
> +static dma_addr_t vduse_dev_map_page(struct device *dev, struct page *page,
> +					unsigned long offset, size_t size,
> +					enum dma_data_direction dir,
> +					unsigned long attrs)
> +{
> +	struct vduse_dev *vdev = dev_to_vduse(dev);
> +	struct vduse_iova_domain *domain = vdev->domain;
> +
> +	if (atomic_xchg(&vdev->bounce_map, 1) == 0 &&
> +		vduse_iotlb_add_range(vdev, 0, domain->bounce_size - 1,
> +				      0, VDUSE_ACCESS_RW,
> +				      vduse_domain_file(domain), 0)) {
> +		atomic_set(&vdev->bounce_map, 0);
> +		return DMA_MAPPING_ERROR;


Can we add the bounce mapping page by page here?


> +	}
> +
> +	return vduse_domain_map_page(domain, page, offset, size, dir, attrs);
> +}
> +
> +static void vduse_dev_unmap_page(struct device *dev, dma_addr_t dma_addr,
> +				size_t size, enum dma_data_direction dir,
> +				unsigned long attrs)
> +{
> +	struct vduse_dev *vdev = dev_to_vduse(dev);
> +	struct vduse_iova_domain *domain = vdev->domain;
> +
> +	return vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
> +}
> +
> +static void *vduse_dev_alloc_coherent(struct device *dev, size_t size,
> +					dma_addr_t *dma_addr, gfp_t flag,
> +					unsigned long attrs)
> +{
> +	struct vduse_dev *vdev = dev_to_vduse(dev);
> +	struct vduse_iova_domain *domain = vdev->domain;
> +	unsigned long iova;
> +	void *addr;
> +
> +	*dma_addr = DMA_MAPPING_ERROR;
> +	addr = vduse_domain_alloc_coherent(domain, size,
> +				(dma_addr_t *)&iova, flag, attrs);
> +	if (!addr)
> +		return NULL;
> +
> +	if (vduse_iotlb_add_range(vdev, iova, iova + size - 1,
> +				  iova, VDUSE_ACCESS_RW,
> +				  vduse_domain_file(domain), iova)) {
> +		vduse_domain_free_coherent(domain, size, addr, iova, attrs);
> +		return NULL;
> +	}
> +	*dma_addr = (dma_addr_t)iova;
> +
> +	return addr;
> +}
> +
> +static void vduse_dev_free_coherent(struct device *dev, size_t size,
> +					void *vaddr, dma_addr_t dma_addr,
> +					unsigned long attrs)
> +{
> +	struct vduse_dev *vdev = dev_to_vduse(dev);
> +	struct vduse_iova_domain *domain = vdev->domain;
> +	unsigned long start = (unsigned long)dma_addr;
> +	unsigned long last = start + size - 1;
> +
> +	vduse_iotlb_del_range(vdev, start, last);
> +	vduse_dev_update_iotlb(vdev, start, last);
> +	vduse_domain_free_coherent(domain, size, vaddr, dma_addr, attrs);
> +}
> +
> +static const struct dma_map_ops vduse_dev_dma_ops = {
> +	.map_page = vduse_dev_map_page,
> +	.unmap_page = vduse_dev_unmap_page,
> +	.alloc = vduse_dev_alloc_coherent,
> +	.free = vduse_dev_free_coherent,
> +};
> +
> +static unsigned int perm_to_file_flags(u8 perm)
> +{
> +	unsigned int flags = 0;
> +
> +	switch (perm) {
> +	case VDUSE_ACCESS_WO:
> +		flags |= O_WRONLY;
> +		break;
> +	case VDUSE_ACCESS_RO:
> +		flags |= O_RDONLY;
> +		break;
> +	case VDUSE_ACCESS_RW:
> +		flags |= O_RDWR;
> +		break;
> +	default:
> +		WARN(1, "invalidate vhost IOTLB permission\n");
> +		break;
> +	}
> +
> +	return flags;
> +}
> +
> +static int vduse_kickfd_setup(struct vduse_dev *dev,
> +			struct vduse_vq_eventfd *eventfd)
> +{
> +	struct eventfd_ctx *ctx = NULL;
> +	struct vduse_virtqueue *vq;
> +
> +	if (eventfd->index >= dev->vq_num)
> +		return -EINVAL;
> +
> +	vq = &dev->vqs[eventfd->index];
> +	if (eventfd->fd > 0) {
> +		ctx = eventfd_ctx_fdget(eventfd->fd);
> +		if (IS_ERR(ctx))
> +			return PTR_ERR(ctx);
> +	}
> +	spin_lock(&vq->kick_lock);
> +	if (vq->kickfd)
> +		eventfd_ctx_put(vq->kickfd);
> +	vq->kickfd = ctx;
> +	spin_unlock(&vq->kick_lock);
> +
> +	return 0;
> +}
> +
> +static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> +			unsigned long arg)
> +{
> +	struct vduse_dev *dev = file->private_data;
> +	void __user *argp = (void __user *)arg;
> +	int ret;
> +
> +	switch (cmd) {
> +	case VDUSE_IOTLB_GET_FD: {
> +		struct vduse_iotlb_entry entry;
> +		struct vhost_iotlb_map *map;
> +		struct vdpa_map_file *map_file;
> +		struct file *f = NULL;
> +
> +		ret = -EFAULT;
> +		if (copy_from_user(&entry, argp, sizeof(entry)))
> +			break;
> +
> +		spin_lock(&dev->iommu_lock);
> +		map = vhost_iotlb_itree_first(dev->iommu, entry.start,
> +					      entry.last);
> +		if (map) {
> +			map_file = (struct vdpa_map_file *)map->opaque;
> +			f = get_file(map_file->file);
> +			entry.offset = map_file->offset;
> +			entry.start = map->start;
> +			entry.last = map->last;
> +			entry.perm = map->perm;
> +		}
> +		spin_unlock(&dev->iommu_lock);
> +		if (!f) {
> +			ret = -EINVAL;
> +			break;
> +		}
> +		if (copy_to_user(argp, &entry, sizeof(entry))) {
> +			fput(f);
> +			ret = -EFAULT;
> +			break;
> +		}
> +		ret = get_unused_fd_flags(perm_to_file_flags(entry.perm));
> +		if (ret < 0) {
> +			fput(f);
> +			break;
> +		}
> +		fd_install(ret, f);
> +		break;
> +	}
> +	case VDUSE_VQ_SETUP_KICKFD: {
> +		struct vduse_vq_eventfd eventfd;
> +
> +		ret = -EFAULT;
> +		if (copy_from_user(&eventfd, argp, sizeof(eventfd)))
> +			break;
> +
> +		ret = vduse_kickfd_setup(dev, &eventfd);
> +		break;
> +	}
> +	case VDUSE_INJECT_VQ_IRQ: {
> +		struct vduse_virtqueue *vq;
> +
> +		ret = -EINVAL;
> +		if (arg >= dev->vq_num)
> +			break;
> +
> +		vq = &dev->vqs[arg];
> +		spin_lock_irq(&vq->irq_lock);
> +		if (vq->ready && vq->cb.callback) {
> +			vq->cb.callback(vq->cb.private);
> +			ret = 0;
> +		}
> +		spin_unlock_irq(&vq->irq_lock);
> +		break;
> +	}
> +	default:
> +		ret = -ENOIOCTLCMD;
> +		break;
> +	}
> +
> +	return ret;
> +}
> +
> +static int vduse_dev_release(struct inode *inode, struct file *file)
> +{
> +	struct vduse_dev *dev = file->private_data;
> +	struct vduse_dev_msg *msg;
> +	int i;
> +
> +	for (i = 0; i < dev->vq_num; i++) {
> +		struct vduse_virtqueue *vq = &dev->vqs[i];
> +
> +		spin_lock(&vq->kick_lock);
> +		if (vq->kickfd)
> +			eventfd_ctx_put(vq->kickfd);
> +		vq->kickfd = NULL;
> +		spin_unlock(&vq->kick_lock);
> +	}
> +
> +	mutex_lock(&dev->msg_lock);
> +	while ((msg = vduse_dequeue_msg(&dev->recv_list)))
> +		vduse_enqueue_msg(&dev->send_list, msg);
> +	mutex_unlock(&dev->msg_lock);
> +
> +	dev->connected = false;
> +
> +	return 0;
> +}
> +
> +static int vduse_dev_open(struct inode *inode, struct file *file)
> +{
> +	struct vduse_dev *dev = container_of(inode->i_cdev,
> +					struct vduse_dev, cdev);
> +	int ret = -EBUSY;
> +
> +	mutex_lock(&vduse_lock);
> +	if (dev->connected)
> +		goto unlock;
> +
> +	ret = 0;
> +	dev->connected = true;
> +	file->private_data = dev;
> +unlock:
> +	mutex_unlock(&vduse_lock);
> +
> +	return ret;
> +}
> +
> +static const struct file_operations vduse_dev_fops = {
> +	.owner		= THIS_MODULE,
> +	.open		= vduse_dev_open,
> +	.release	= vduse_dev_release,
> +	.read_iter	= vduse_dev_read_iter,
> +	.write_iter	= vduse_dev_write_iter,
> +	.poll		= vduse_dev_poll,
> +	.unlocked_ioctl	= vduse_dev_ioctl,
> +	.compat_ioctl	= compat_ptr_ioctl,
> +	.llseek		= noop_llseek,
> +};
> +
> +static struct vduse_dev *vduse_dev_create(void)
> +{
> +	struct vduse_dev *dev = kzalloc(sizeof(*dev), GFP_KERNEL);
> +
> +	if (!dev)
> +		return NULL;
> +
> +	dev->iommu = vhost_iotlb_alloc(0, 0);
> +	if (!dev->iommu) {
> +		kfree(dev);
> +		return NULL;
> +	}
> +
> +	mutex_init(&dev->msg_lock);
> +	INIT_LIST_HEAD(&dev->send_list);
> +	INIT_LIST_HEAD(&dev->recv_list);
> +	atomic64_set(&dev->msg_unique, 0);
> +	spin_lock_init(&dev->iommu_lock);
> +	atomic_set(&dev->bounce_map, 0);
> +
> +	init_waitqueue_head(&dev->waitq);
> +
> +	return dev;
> +}
> +
> +static void vduse_dev_destroy(struct vduse_dev *dev)
> +{
> +	vhost_iotlb_free(dev->iommu);
> +	mutex_destroy(&dev->msg_lock);
> +	kfree(dev);
> +}
> +
> +static struct vduse_dev *vduse_find_dev(const char *name)
> +{
> +	struct vduse_dev *tmp, *dev = NULL;
> +
> +	list_for_each_entry(tmp, &vduse_devs, list) {
> +		if (!strcmp(dev_name(&tmp->dev), name)) {
> +			dev = tmp;
> +			break;
> +		}
> +	}
> +	return dev;
> +}
> +
> +static int vduse_destroy_dev(char *name)
> +{
> +	struct vduse_dev *dev = vduse_find_dev(name);
> +
> +	if (!dev)
> +		return -EINVAL;
> +
> +	if (dev->vdev || dev->connected)
> +		return -EBUSY;
> +
> +	dev->connected = true;
> +	list_del(&dev->list);
> +	cdev_device_del(&dev->cdev, &dev->dev);
> +	put_device(&dev->dev);
> +
> +	return 0;
> +}
> +
> +static void vduse_release_dev(struct device *device)
> +{
> +	struct vduse_dev *dev =
> +		container_of(device, struct vduse_dev, dev);
> +
> +	ida_simple_remove(&vduse_ida, dev->minor);
> +	kfree(dev->vqs);
> +	vduse_domain_destroy(dev->domain);
> +	vduse_dev_destroy(dev);
> +	module_put(THIS_MODULE);
> +}
> +
> +static int vduse_create_dev(struct vduse_dev_config *config)
> +{
> +	int i, ret = -ENOMEM;
> +	struct vduse_dev *dev;
> +
> +	if (config->bounce_size > max_bounce_size)
> +		return -EINVAL;
> +
> +	if (config->bounce_size > max_iova_size)
> +		return -EINVAL;
> +
> +	if (vduse_find_dev(config->name))
> +		return -EEXIST;
> +
> +	dev = vduse_dev_create();
> +	if (!dev)
> +		return -ENOMEM;
> +
> +	dev->device_id = config->device_id;
> +	dev->vendor_id = config->vendor_id;
> +	dev->domain = vduse_domain_create(max_iova_size - 1,
> +					config->bounce_size);
> +	if (!dev->domain)
> +		goto err_domain;
> +
> +	dev->vq_align = config->vq_align;
> +	dev->vq_size_max = config->vq_size_max;
> +	dev->vq_num = config->vq_num;
> +	dev->vqs = kcalloc(dev->vq_num, sizeof(*dev->vqs), GFP_KERNEL);
> +	if (!dev->vqs)
> +		goto err_vqs;
> +
> +	for (i = 0; i < dev->vq_num; i++) {
> +		dev->vqs[i].index = i;
> +		spin_lock_init(&dev->vqs[i].kick_lock);
> +		spin_lock_init(&dev->vqs[i].irq_lock);
> +	}
> +
> +	ret = ida_simple_get(&vduse_ida, 0, VDUSE_DEV_MAX, GFP_KERNEL);
> +	if (ret < 0)
> +		goto err_ida;
> +
> +	dev->minor = ret;
> +	device_initialize(&dev->dev);
> +	dev->dev.release = vduse_release_dev;
> +	dev->dev.class = vduse_class;
> +	dev->dev.devt = MKDEV(MAJOR(vduse_major), dev->minor);
> +	ret = dev_set_name(&dev->dev, "%s", config->name);


Do we need to add a namespce here? E.g "vduse-%s", config->name.


> +	if (ret)
> +		goto err_name;
> +
> +	cdev_init(&dev->cdev, &vduse_dev_fops);
> +	dev->cdev.owner = THIS_MODULE;
> +
> +	ret = cdev_device_add(&dev->cdev, &dev->dev);
> +	if (ret) {
> +		put_device(&dev->dev);
> +		return ret;
> +	}
> +	list_add(&dev->list, &vduse_devs);
> +	__module_get(THIS_MODULE);
> +
> +	return 0;
> +err_name:
> +	ida_simple_remove(&vduse_ida, dev->minor);
> +err_ida:
> +	kfree(dev->vqs);
> +err_vqs:
> +	vduse_domain_destroy(dev->domain);
> +err_domain:
> +	vduse_dev_destroy(dev);
> +	return ret;
> +}
> +
> +static long vduse_ioctl(struct file *file, unsigned int cmd,
> +			unsigned long arg)
> +{
> +	int ret;
> +	void __user *argp = (void __user *)arg;
> +
> +	mutex_lock(&vduse_lock);
> +	switch (cmd) {
> +	case VDUSE_CREATE_DEV: {
> +		struct vduse_dev_config config;
> +
> +		ret = -EFAULT;
> +		if (copy_from_user(&config, argp, sizeof(config)))
> +			break;
> +
> +		ret = vduse_create_dev(&config);
> +		break;
> +	}
> +	case VDUSE_DESTROY_DEV: {
> +		char name[VDUSE_NAME_MAX];
> +
> +		ret = -EFAULT;
> +		if (copy_from_user(name, argp, VDUSE_NAME_MAX))
> +			break;
> +
> +		ret = vduse_destroy_dev(name);
> +		break;
> +	}
> +	default:
> +		ret = -EINVAL;
> +		break;
> +	}
> +	mutex_unlock(&vduse_lock);
> +
> +	return ret;
> +}
> +
> +static const struct file_operations vduse_fops = {
> +	.owner		= THIS_MODULE,
> +	.unlocked_ioctl	= vduse_ioctl,
> +	.compat_ioctl	= compat_ptr_ioctl,
> +	.llseek		= noop_llseek,
> +};
> +
> +static char *vduse_devnode(struct device *dev, umode_t *mode)
> +{
> +	return kasprintf(GFP_KERNEL, "vduse/%s", dev_name(dev));
> +}
> +
> +static struct miscdevice vduse_misc = {
> +	.fops = &vduse_fops,
> +	.minor = MISC_DYNAMIC_MINOR,
> +	.name = "vduse",
> +	.nodename = "vduse/control",
> +};
> +
> +static void vduse_mgmtdev_release(struct device *dev)
> +{
> +}
> +
> +static struct device vduse_mgmtdev = {
> +	.init_name = "vduse",
> +	.release = vduse_mgmtdev_release,
> +};
> +
> +static struct vdpa_mgmt_dev mgmt_dev;
> +
> +static int vduse_dev_add_vdpa(struct vduse_dev *dev, const char *name)
> +{
> +	struct vduse_vdpa *vdev = dev->vdev;
> +	int ret;
> +
> +	if (vdev)
> +		return -EEXIST;
> +
> +	vdev = vdpa_alloc_device(struct vduse_vdpa, vdpa, NULL,


I think the char dev should be used as the parent here.


> +				 &vduse_vdpa_config_ops,
> +				 dev->vq_num, name, true);
> +	if (!vdev)
> +		return -ENOMEM;
> +
> +	vdev->dev = dev;
> +	vdev->vdpa.dev.dma_mask = &vdev->vdpa.dev.coherent_dma_mask;
> +	ret = dma_set_mask_and_coherent(&vdev->vdpa.dev, DMA_BIT_MASK(64));
> +	if (ret)
> +		goto err;
> +
> +	set_dma_ops(&vdev->vdpa.dev, &vduse_dev_dma_ops);
> +	vdev->vdpa.dma_dev = &vdev->vdpa.dev;
> +	vdev->vdpa.mdev = &mgmt_dev;
> +
> +	ret = _vdpa_register_device(&vdev->vdpa);
> +	if (ret)
> +		goto err;
> +
> +	dev->vdev = vdev;
> +
> +	return 0;
> +err:
> +	put_device(&vdev->vdpa.dev);
> +	return ret;
> +}
> +
> +static int vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name)
> +{
> +	struct vduse_dev *dev;
> +	int ret = -EINVAL;
> +
> +	mutex_lock(&vduse_lock);
> +	dev = vduse_find_dev(name);
> +	if (!dev)
> +		goto unlock;


Any reason for this check? I think vdpa core layer has already did for 
the name check for us.


> +
> +	ret = vduse_dev_add_vdpa(dev, name);
> +unlock:
> +	mutex_unlock(&vduse_lock);
> +
> +	return ret;
> +}
> +
> +static void vdpa_dev_del(struct vdpa_mgmt_dev *mdev, struct vdpa_device *dev)
> +{
> +	_vdpa_unregister_device(dev);
> +}
> +
> +static const struct vdpa_mgmtdev_ops vdpa_dev_mgmtdev_ops = {
> +	.dev_add = vdpa_dev_add,
> +	.dev_del = vdpa_dev_del,
> +};
> +
> +static struct virtio_device_id id_table[] = {
> +	{ VIRTIO_DEV_ANY_ID, VIRTIO_DEV_ANY_ID },
> +	{ 0 },
> +};
> +
> +static struct vdpa_mgmt_dev mgmt_dev = {
> +	.device = &vduse_mgmtdev,
> +	.id_table = id_table,
> +	.ops = &vdpa_dev_mgmtdev_ops,
> +};
> +
> +static int vduse_mgmtdev_init(void)
> +{
> +	int ret;
> +
> +	ret = device_register(&vduse_mgmtdev);
> +	if (ret)
> +		return ret;
> +
> +	ret = vdpa_mgmtdev_register(&mgmt_dev);
> +	if (ret)
> +		goto err;
> +
> +	return 0;
> +err:
> +	device_unregister(&vduse_mgmtdev);
> +	return ret;
> +}
> +
> +static void vduse_mgmtdev_exit(void)
> +{
> +	vdpa_mgmtdev_unregister(&mgmt_dev);
> +	device_unregister(&vduse_mgmtdev);
> +}
> +
> +static int vduse_init(void)
> +{
> +	int ret;
> +
> +	ret = misc_register(&vduse_misc);
> +	if (ret)
> +		return ret;
> +
> +	vduse_class = class_create(THIS_MODULE, "vduse");
> +	if (IS_ERR(vduse_class)) {
> +		ret = PTR_ERR(vduse_class);
> +		goto err_class;
> +	}
> +	vduse_class->devnode = vduse_devnode;
> +
> +	ret = alloc_chrdev_region(&vduse_major, 0, VDUSE_DEV_MAX, "vduse");
> +	if (ret)
> +		goto err_chardev;
> +
> +	ret = vduse_domain_init();
> +	if (ret)
> +		goto err_domain;
> +
> +	ret = vduse_mgmtdev_init();
> +	if (ret)
> +		goto err_mgmtdev;


Should we validate max_bounce_size < max_iova_size here?

Thanks


> +
> +	return 0;
> +err_mgmtdev:
> +	vduse_domain_exit();
> +err_domain:
> +	unregister_chrdev_region(vduse_major, VDUSE_DEV_MAX);
> +err_chardev:
> +	class_destroy(vduse_class);
> +err_class:
> +	misc_deregister(&vduse_misc);
> +	return ret;
> +}
> +module_init(vduse_init);
> +
> +static void vduse_exit(void)
> +{
> +	misc_deregister(&vduse_misc);
> +	class_destroy(vduse_class);
> +	unregister_chrdev_region(vduse_major, VDUSE_DEV_MAX);
> +	vduse_domain_exit();
> +	vduse_mgmtdev_exit();
> +}
> +module_exit(vduse_exit);
> +
> +MODULE_VERSION(DRV_VERSION);
> +MODULE_LICENSE(DRV_LICENSE);
> +MODULE_AUTHOR(DRV_AUTHOR);
> +MODULE_DESCRIPTION(DRV_DESC);
> diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
> new file mode 100644
> index 000000000000..9391c4acfa53
> --- /dev/null
> +++ b/include/uapi/linux/vduse.h
> @@ -0,0 +1,136 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _UAPI_VDUSE_H_
> +#define _UAPI_VDUSE_H_
> +
> +#include <linux/types.h>
> +
> +#define VDUSE_CONFIG_DATA_LEN	256
> +#define VDUSE_NAME_MAX	256
> +
> +/* the control messages definition for read/write */
> +
> +enum vduse_req_type {
> +	VDUSE_SET_VQ_NUM,
> +	VDUSE_SET_VQ_ADDR,
> +	VDUSE_SET_VQ_READY,
> +	VDUSE_GET_VQ_READY,
> +	VDUSE_SET_VQ_STATE,
> +	VDUSE_GET_VQ_STATE,
> +	VDUSE_SET_FEATURES,
> +	VDUSE_GET_FEATURES,
> +	VDUSE_SET_STATUS,
> +	VDUSE_GET_STATUS,
> +	VDUSE_SET_CONFIG,
> +	VDUSE_GET_CONFIG,
> +	VDUSE_UPDATE_IOTLB,
> +};
> +
> +struct vduse_vq_num {
> +	__u32 index;
> +	__u32 num;
> +};
> +
> +struct vduse_vq_addr {
> +	__u32 index;
> +	__u64 desc_addr;
> +	__u64 driver_addr;
> +	__u64 device_addr;
> +};
> +
> +struct vduse_vq_ready {
> +	__u32 index;
> +	__u8 ready;
> +};
> +
> +struct vduse_vq_state {
> +	__u32 index;
> +	__u16 avail_idx;
> +};
> +
> +struct vduse_dev_config_data {
> +	__u32 offset;
> +	__u32 len;
> +	__u8 data[VDUSE_CONFIG_DATA_LEN];
> +};
> +
> +struct vduse_iova_range {
> +	__u64 start;
> +	__u64 last;
> +};
> +
> +struct vduse_dev_request {
> +	__u32 type; /* request type */
> +	__u32 unique; /* request id */


Let's simply use "request_id" here.


> +	__u32 reserved[2]; /* for feature use */
> +	union {
> +		struct vduse_vq_num vq_num; /* virtqueue num */
> +		struct vduse_vq_addr vq_addr; /* virtqueue address */
> +		struct vduse_vq_ready vq_ready; /* virtqueue ready status */
> +		struct vduse_vq_state vq_state; /* virtqueue state */
> +		struct vduse_dev_config_data config; /* virtio device config space */
> +		struct vduse_iova_range iova; /* iova range for updating */
> +		__u64 features; /* virtio features */
> +		__u8 status; /* device status */


It might be better to use struct for feaures and status as well for 
consistency.

And to be safe, let's add explicity padding here.


> +	};
> +};
> +
> +struct vduse_dev_response {
> +	__u32 request_id; /* corresponding request id */
> +#define VDUSE_REQUEST_OK	0x00
> +#define VDUSE_REQUEST_FAILED	0x01
> +	__u8 result; /* the result of request */
> +	__u8 reserved[11]; /* for feature use */


Looks like this will be a hole which is similar to 
429711aec282c4b5fe5bbd7b2f0bbbff4110ffb2. Need to make sure the reserved 
end at 8 byte boundary.


> +	union {
> +		struct vduse_vq_ready vq_ready; /* virtqueue ready status */
> +		struct vduse_vq_state vq_state; /* virtqueue state */
> +		struct vduse_dev_config_data config; /* virtio device config space */
> +		__u64 features; /* virtio features */
> +		__u8 status; /* device status */
> +	};
> +};
> +
> +/* ioctls */
> +
> +struct vduse_dev_config {
> +	char name[VDUSE_NAME_MAX]; /* vduse device name */
> +	__u32 vendor_id; /* virtio vendor id */
> +	__u32 device_id; /* virtio device id */
> +	__u64 bounce_size; /* bounce buffer size for iommu */
> +	__u16 vq_num; /* the number of virtqueues */
> +	__u16 vq_size_max; /* the max size of virtqueue */
> +	__u32 vq_align; /* the allocation alignment of virtqueue's metadata */
> +};
> +
> +struct vduse_iotlb_entry {
> +	__u64 offset; /* the mmap offset on fd */
> +	__u64 start; /* start of the IOVA range */
> +	__u64 last; /* last of the IOVA range */
> +#define VDUSE_ACCESS_RO 0x1
> +#define VDUSE_ACCESS_WO 0x2
> +#define VDUSE_ACCESS_RW 0x3
> +	__u8 perm; /* access permission of this range */
> +};
> +
> +struct vduse_vq_eventfd {
> +	__u32 index; /* virtqueue index */
> +	int fd; /* eventfd, -1 means de-assigning the eventfd */


Let's define a macro for this.


> +};
> +
> +#define VDUSE_BASE	0x81
> +
> +/* Create a vduse device which is represented by a char device (/dev/vduse/<name>) */
> +#define VDUSE_CREATE_DEV	_IOW(VDUSE_BASE, 0x01, struct vduse_dev_config)
> +
> +/* Destroy a vduse device. Make sure there are no references to the char device */
> +#define VDUSE_DESTROY_DEV	_IOW(VDUSE_BASE, 0x02, char[VDUSE_NAME_MAX])
> +
> +/* Get a file descriptor for the mmap'able iova region */
> +#define VDUSE_IOTLB_GET_FD	_IOWR(VDUSE_BASE, 0x03, struct vduse_iotlb_entry)
> +
> +/* Setup an eventfd to receive kick for virtqueue */
> +#define VDUSE_VQ_SETUP_KICKFD	_IOW(VDUSE_BASE, 0x04, struct vduse_vq_eventfd)
> +
> +/* Inject an interrupt for specific virtqueue */
> +#define VDUSE_INJECT_VQ_IRQ	_IO(VDUSE_BASE, 0x05)


I wonder do we need a version/feature handshake that is for future 
extension instead of just reserve bits in uABI? E.g VDUSE_GET_VERSION ...

Thanks


> +
> +#endif /* _UAPI_VDUSE_H_ */


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC v4 07/11] vduse: Introduce VDUSE - vDPA Device in Userspace
@ 2021-03-04  6:27     ` Jason Wang
  0 siblings, 0 replies; 92+ messages in thread
From: Jason Wang @ 2021-03-04  6:27 UTC (permalink / raw)
  To: Xie Yongji, mst, stefanha, sgarzare, parav, bob.liu, hch,
	rdunlap, willy, viro, axboe, bcrl, corbet
  Cc: linux-aio, netdev, linux-fsdevel, kvm, virtualization


On 2021/2/23 7:50 下午, Xie Yongji wrote:
> This VDUSE driver enables implementing vDPA devices in userspace.
> Both control path and data path of vDPA devices will be able to
> be handled in userspace.
>
> In the control path, the VDUSE driver will make use of message
> mechnism to forward the config operation from vdpa bus driver
> to userspace. Userspace can use read()/write() to receive/reply
> those control messages.
>
> In the data path, VDUSE_IOTLB_GET_FD ioctl will be used to get
> the file descriptors referring to vDPA device's iova regions. Then
> userspace can use mmap() to access those iova regions. Besides,
> userspace can use ioctl() to inject interrupt and use the eventfd
> mechanism to receive virtqueue kicks.
>
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> ---
>   Documentation/userspace-api/ioctl/ioctl-number.rst |    1 +
>   drivers/vdpa/Kconfig                               |   10 +
>   drivers/vdpa/Makefile                              |    1 +
>   drivers/vdpa/vdpa_user/Makefile                    |    5 +
>   drivers/vdpa/vdpa_user/vduse_dev.c                 | 1348 ++++++++++++++++++++
>   include/uapi/linux/vduse.h                         |  136 ++
>   6 files changed, 1501 insertions(+)
>   create mode 100644 drivers/vdpa/vdpa_user/Makefile
>   create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
>   create mode 100644 include/uapi/linux/vduse.h
>
> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
> index a4c75a28c839..71722e6f8f23 100644
> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> @@ -300,6 +300,7 @@ Code  Seq#    Include File                                           Comments
>   'z'   10-4F  drivers/s390/crypto/zcrypt_api.h                        conflict!
>   '|'   00-7F  linux/media.h
>   0x80  00-1F  linux/fb.h
> +0x81  00-1F  linux/vduse.h
>   0x89  00-06  arch/x86/include/asm/sockios.h
>   0x89  0B-DF  linux/sockios.h
>   0x89  E0-EF  linux/sockios.h                                         SIOCPROTOPRIVATE range
> diff --git a/drivers/vdpa/Kconfig b/drivers/vdpa/Kconfig
> index ffd1e098bfd2..92f07715e3b6 100644
> --- a/drivers/vdpa/Kconfig
> +++ b/drivers/vdpa/Kconfig
> @@ -25,6 +25,16 @@ config VDPA_SIM_NET
>   	help
>   	  vDPA networking device simulator which loops TX traffic back to RX.
>   
> +config VDPA_USER
> +	tristate "VDUSE (vDPA Device in Userspace) support"
> +	depends on EVENTFD && MMU && HAS_DMA
> +	select DMA_OPS
> +	select VHOST_IOTLB
> +	select IOMMU_IOVA
> +	help
> +	  With VDUSE it is possible to emulate a vDPA Device
> +	  in a userspace program.
> +
>   config IFCVF
>   	tristate "Intel IFC VF vDPA driver"
>   	depends on PCI_MSI
> diff --git a/drivers/vdpa/Makefile b/drivers/vdpa/Makefile
> index d160e9b63a66..66e97778ad03 100644
> --- a/drivers/vdpa/Makefile
> +++ b/drivers/vdpa/Makefile
> @@ -1,5 +1,6 @@
>   # SPDX-License-Identifier: GPL-2.0
>   obj-$(CONFIG_VDPA) += vdpa.o
>   obj-$(CONFIG_VDPA_SIM) += vdpa_sim/
> +obj-$(CONFIG_VDPA_USER) += vdpa_user/
>   obj-$(CONFIG_IFCVF)    += ifcvf/
>   obj-$(CONFIG_MLX5_VDPA) += mlx5/
> diff --git a/drivers/vdpa/vdpa_user/Makefile b/drivers/vdpa/vdpa_user/Makefile
> new file mode 100644
> index 000000000000..260e0b26af99
> --- /dev/null
> +++ b/drivers/vdpa/vdpa_user/Makefile
> @@ -0,0 +1,5 @@
> +# SPDX-License-Identifier: GPL-2.0
> +
> +vduse-y := vduse_dev.o iova_domain.o
> +
> +obj-$(CONFIG_VDPA_USER) += vduse.o
> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> new file mode 100644
> index 000000000000..393bf99c48be
> --- /dev/null
> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> @@ -0,0 +1,1348 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * VDUSE: vDPA Device in Userspace
> + *
> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
> + *
> + * Author: Xie Yongji <xieyongji@bytedance.com>
> + *
> + */
> +
> +#include <linux/init.h>
> +#include <linux/module.h>
> +#include <linux/miscdevice.h>
> +#include <linux/cdev.h>
> +#include <linux/device.h>
> +#include <linux/eventfd.h>
> +#include <linux/slab.h>
> +#include <linux/wait.h>
> +#include <linux/dma-map-ops.h>
> +#include <linux/poll.h>
> +#include <linux/file.h>
> +#include <linux/uio.h>
> +#include <linux/vdpa.h>
> +#include <uapi/linux/vduse.h>
> +#include <uapi/linux/vdpa.h>
> +#include <uapi/linux/virtio_config.h>
> +#include <linux/mod_devicetable.h>
> +
> +#include "iova_domain.h"
> +
> +#define DRV_VERSION  "1.0"
> +#define DRV_AUTHOR   "Yongji Xie <xieyongji@bytedance.com>"
> +#define DRV_DESC     "vDPA Device in Userspace"
> +#define DRV_LICENSE  "GPL v2"
> +
> +#define VDUSE_DEV_MAX (1U << MINORBITS)
> +
> +struct vduse_virtqueue {
> +	u16 index;
> +	bool ready;
> +	spinlock_t kick_lock;
> +	spinlock_t irq_lock;
> +	struct eventfd_ctx *kickfd;
> +	struct vdpa_callback cb;
> +};
> +
> +struct vduse_dev;
> +
> +struct vduse_vdpa {
> +	struct vdpa_device vdpa;
> +	struct vduse_dev *dev;
> +};
> +
> +struct vduse_dev {
> +	struct vduse_vdpa *vdev;
> +	struct device dev;
> +	struct cdev cdev;
> +	struct vduse_virtqueue *vqs;
> +	struct vduse_iova_domain *domain;
> +	struct vhost_iotlb *iommu;
> +	spinlock_t iommu_lock;
> +	atomic_t bounce_map;
> +	struct mutex msg_lock;
> +	atomic64_t msg_unique;


"next_request_id" should be better.


> +	wait_queue_head_t waitq;
> +	struct list_head send_list;
> +	struct list_head recv_list;
> +	struct list_head list;
> +	bool connected;
> +	int minor;
> +	u16 vq_size_max;
> +	u16 vq_num;
> +	u32 vq_align;
> +	u32 device_id;
> +	u32 vendor_id;
> +};
> +
> +struct vduse_dev_msg {
> +	struct vduse_dev_request req;
> +	struct vduse_dev_response resp;
> +	struct list_head list;
> +	wait_queue_head_t waitq;
> +	bool completed;
> +};
> +
> +static unsigned long max_bounce_size = (64 * 1024 * 1024);
> +module_param(max_bounce_size, ulong, 0444);
> +MODULE_PARM_DESC(max_bounce_size, "Maximum bounce buffer size. (default: 64M)");
> +
> +static unsigned long max_iova_size = (128 * 1024 * 1024);
> +module_param(max_iova_size, ulong, 0444);
> +MODULE_PARM_DESC(max_iova_size, "Maximum iova space size (default: 128M)");
> +
> +static DEFINE_MUTEX(vduse_lock);
> +static LIST_HEAD(vduse_devs);
> +static DEFINE_IDA(vduse_ida);
> +
> +static dev_t vduse_major;
> +static struct class *vduse_class;
> +
> +static inline struct vduse_dev *vdpa_to_vduse(struct vdpa_device *vdpa)
> +{
> +	struct vduse_vdpa *vdev = container_of(vdpa, struct vduse_vdpa, vdpa);
> +
> +	return vdev->dev;
> +}
> +
> +static inline struct vduse_dev *dev_to_vduse(struct device *dev)
> +{
> +	struct vdpa_device *vdpa = dev_to_vdpa(dev);
> +
> +	return vdpa_to_vduse(vdpa);
> +}
> +
> +static struct vduse_dev_msg *vduse_find_msg(struct list_head *head,
> +					    uint32_t unique)
> +{
> +	struct vduse_dev_msg *tmp, *msg = NULL;
> +
> +	list_for_each_entry(tmp, head, list) {


Shoudl we use list_for_each_entry_safe()?


> +		if (tmp->req.unique == unique) {
> +			msg = tmp;
> +			list_del(&tmp->list);
> +			break;
> +		}
> +	}
> +
> +	return msg;
> +}
> +
> +static struct vduse_dev_msg *vduse_dequeue_msg(struct list_head *head)
> +{
> +	struct vduse_dev_msg *msg = NULL;
> +
> +	if (!list_empty(head)) {
> +		msg = list_first_entry(head, struct vduse_dev_msg, list);
> +		list_del(&msg->list);
> +	}
> +
> +	return msg;
> +}
> +
> +static void vduse_enqueue_msg(struct list_head *head,
> +			      struct vduse_dev_msg *msg)
> +{
> +	list_add_tail(&msg->list, head);
> +}
> +
> +static int vduse_dev_msg_sync(struct vduse_dev *dev, struct vduse_dev_msg *msg)
> +{
> +	int ret;
> +
> +	init_waitqueue_head(&msg->waitq);
> +	mutex_lock(&dev->msg_lock);
> +	vduse_enqueue_msg(&dev->send_list, msg);
> +	wake_up(&dev->waitq);
> +	mutex_unlock(&dev->msg_lock);
> +	ret = wait_event_interruptible(msg->waitq, msg->completed);
> +	mutex_lock(&dev->msg_lock);
> +	if (!msg->completed)
> +		list_del(&msg->list);
> +	else
> +		ret = msg->resp.result;
> +	mutex_unlock(&dev->msg_lock);
> +
> +	return ret;
> +}
> +
> +static u64 vduse_dev_get_features(struct vduse_dev *dev)
> +{
> +	struct vduse_dev_msg msg = { 0 };
> +
> +	msg.req.type = VDUSE_GET_FEATURES;
> +	msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
> +
> +	return vduse_dev_msg_sync(dev, &msg) ? 0 : msg.resp.features;
> +}
> +
> +static int vduse_dev_set_features(struct vduse_dev *dev, u64 features)
> +{
> +	struct vduse_dev_msg msg = { 0 };
> +
> +	msg.req.type = VDUSE_SET_FEATURES;
> +	msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
> +	msg.req.features = features;
> +
> +	return vduse_dev_msg_sync(dev, &msg);
> +}
> +
> +static u8 vduse_dev_get_status(struct vduse_dev *dev)
> +{
> +	struct vduse_dev_msg msg = { 0 };
> +
> +	msg.req.type = VDUSE_GET_STATUS;
> +	msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
> +
> +	return vduse_dev_msg_sync(dev, &msg) ? 0 : msg.resp.status;
> +}
> +
> +static void vduse_dev_set_status(struct vduse_dev *dev, u8 status)
> +{
> +	struct vduse_dev_msg msg = { 0 };
> +
> +	msg.req.type = VDUSE_SET_STATUS;
> +	msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
> +	msg.req.status = status;
> +
> +	vduse_dev_msg_sync(dev, &msg);
> +}
> +
> +static void vduse_dev_get_config(struct vduse_dev *dev, unsigned int offset,
> +					void *buf, unsigned int len)


Btw, the ident looks odd here and other may places wherhe functions has 
more than one line of arguments.


> +{
> +	struct vduse_dev_msg msg = { 0 };
> +	unsigned int sz;
> +
> +	while (len) {
> +		sz = min_t(unsigned int, len, sizeof(msg.req.config.data));
> +		msg.req.type = VDUSE_GET_CONFIG;
> +		msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
> +		msg.req.config.offset = offset;
> +		msg.req.config.len = sz;
> +		vduse_dev_msg_sync(dev, &msg);
> +		memcpy(buf, msg.resp.config.data, sz);
> +		buf += sz;
> +		offset += sz;
> +		len -= sz;
> +	}
> +}
> +
> +static void vduse_dev_set_config(struct vduse_dev *dev, unsigned int offset,
> +					const void *buf, unsigned int len)
> +{
> +	struct vduse_dev_msg msg = { 0 };
> +	unsigned int sz;
> +
> +	while (len) {
> +		sz = min_t(unsigned int, len, sizeof(msg.req.config.data));
> +		msg.req.type = VDUSE_SET_CONFIG;
> +		msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
> +		msg.req.config.offset = offset;
> +		msg.req.config.len = sz;
> +		memcpy(msg.req.config.data, buf, sz);
> +		vduse_dev_msg_sync(dev, &msg);
> +		buf += sz;
> +		offset += sz;
> +		len -= sz;
> +	}
> +}
> +
> +static void vduse_dev_set_vq_num(struct vduse_dev *dev,
> +				struct vduse_virtqueue *vq, u32 num)
> +{
> +	struct vduse_dev_msg msg = { 0 };
> +
> +	msg.req.type = VDUSE_SET_VQ_NUM;
> +	msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
> +	msg.req.vq_num.index = vq->index;
> +	msg.req.vq_num.num = num;
> +
> +	vduse_dev_msg_sync(dev, &msg);
> +}
> +
> +static int vduse_dev_set_vq_addr(struct vduse_dev *dev,
> +				struct vduse_virtqueue *vq, u64 desc_addr,
> +				u64 driver_addr, u64 device_addr)
> +{
> +	struct vduse_dev_msg msg = { 0 };
> +
> +	msg.req.type = VDUSE_SET_VQ_ADDR;
> +	msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
> +	msg.req.vq_addr.index = vq->index;
> +	msg.req.vq_addr.desc_addr = desc_addr;
> +	msg.req.vq_addr.driver_addr = driver_addr;
> +	msg.req.vq_addr.device_addr = device_addr;
> +
> +	return vduse_dev_msg_sync(dev, &msg);
> +}
> +
> +static void vduse_dev_set_vq_ready(struct vduse_dev *dev,
> +				struct vduse_virtqueue *vq, bool ready)
> +{
> +	struct vduse_dev_msg msg = { 0 };
> +
> +	msg.req.type = VDUSE_SET_VQ_READY;
> +	msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
> +	msg.req.vq_ready.index = vq->index;
> +	msg.req.vq_ready.ready = ready;
> +
> +	vduse_dev_msg_sync(dev, &msg);
> +}
> +
> +static bool vduse_dev_get_vq_ready(struct vduse_dev *dev,
> +				   struct vduse_virtqueue *vq)
> +{
> +	struct vduse_dev_msg msg = { 0 };
> +
> +	msg.req.type = VDUSE_GET_VQ_READY;
> +	msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
> +	msg.req.vq_ready.index = vq->index;
> +
> +	return vduse_dev_msg_sync(dev, &msg) ? false : msg.resp.vq_ready.ready;
> +}
> +
> +static int vduse_dev_get_vq_state(struct vduse_dev *dev,
> +				struct vduse_virtqueue *vq,
> +				struct vdpa_vq_state *state)
> +{
> +	struct vduse_dev_msg msg = { 0 };
> +	int ret;
> +
> +	msg.req.type = VDUSE_GET_VQ_STATE;
> +	msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
> +	msg.req.vq_state.index = vq->index;
> +
> +	ret = vduse_dev_msg_sync(dev, &msg);
> +	if (!ret)
> +		state->avail_index = msg.resp.vq_state.avail_idx;
> +
> +	return ret;
> +}
> +
> +static int vduse_dev_set_vq_state(struct vduse_dev *dev,
> +				struct vduse_virtqueue *vq,
> +				const struct vdpa_vq_state *state)
> +{
> +	struct vduse_dev_msg msg = { 0 };
> +
> +	msg.req.type = VDUSE_SET_VQ_STATE;
> +	msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
> +	msg.req.vq_state.index = vq->index;
> +	msg.req.vq_state.avail_idx = state->avail_index;
> +
> +	return vduse_dev_msg_sync(dev, &msg);
> +}
> +
> +static int vduse_dev_update_iotlb(struct vduse_dev *dev,
> +				u64 start, u64 last)
> +{
> +	struct vduse_dev_msg msg = { 0 };
> +
> +	if (last < start)
> +		return -EINVAL;
> +
> +	msg.req.type = VDUSE_UPDATE_IOTLB;
> +	msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
> +	msg.req.iova.start = start;
> +	msg.req.iova.last = last;
> +
> +	return vduse_dev_msg_sync(dev, &msg);
> +}
> +
> +static ssize_t vduse_dev_read_iter(struct kiocb *iocb, struct iov_iter *to)
> +{
> +	struct file *file = iocb->ki_filp;
> +	struct vduse_dev *dev = file->private_data;
> +	struct vduse_dev_msg *msg;
> +	int size = sizeof(struct vduse_dev_request);
> +	ssize_t ret = 0;
> +
> +	if (iov_iter_count(to) < size)
> +		return 0;
> +
> +	mutex_lock(&dev->msg_lock);
> +	while (1) {
> +		msg = vduse_dequeue_msg(&dev->send_list);
> +		if (msg)
> +			break;
> +
> +		ret = -EAGAIN;
> +		if (file->f_flags & O_NONBLOCK)
> +			goto unlock;
> +
> +		mutex_unlock(&dev->msg_lock);
> +		ret = wait_event_interruptible_exclusive(dev->waitq,
> +					!list_empty(&dev->send_list));
> +		if (ret)
> +			return ret;
> +
> +		mutex_lock(&dev->msg_lock);
> +	}
> +	ret = copy_to_iter(&msg->req, size, to);
> +	if (ret != size) {
> +		ret = -EFAULT;
> +		vduse_enqueue_msg(&dev->send_list, msg);
> +		goto unlock;
> +	}
> +	vduse_enqueue_msg(&dev->recv_list, msg);
> +unlock:
> +	mutex_unlock(&dev->msg_lock);
> +
> +	return ret;
> +}
> +
> +static ssize_t vduse_dev_write_iter(struct kiocb *iocb, struct iov_iter *from)
> +{
> +	struct file *file = iocb->ki_filp;
> +	struct vduse_dev *dev = file->private_data;
> +	struct vduse_dev_response resp;
> +	struct vduse_dev_msg *msg;
> +	size_t ret;
> +
> +	ret = copy_from_iter(&resp, sizeof(resp), from);
> +	if (ret != sizeof(resp))
> +		return -EINVAL;
> +
> +	mutex_lock(&dev->msg_lock);
> +	msg = vduse_find_msg(&dev->recv_list, resp.request_id);
> +	if (!msg) {
> +		ret = -EINVAL;
> +		goto unlock;
> +	}
> +
> +	memcpy(&msg->resp, &resp, sizeof(resp));
> +	msg->completed = 1;
> +	wake_up(&msg->waitq);
> +unlock:
> +	mutex_unlock(&dev->msg_lock);
> +
> +	return ret;
> +}
> +
> +static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
> +{
> +	struct vduse_dev *dev = file->private_data;
> +	__poll_t mask = 0;
> +
> +	poll_wait(file, &dev->waitq, wait);
> +
> +	if (!list_empty(&dev->send_list))
> +		mask |= EPOLLIN | EPOLLRDNORM;
> +
> +	return mask;
> +}
> +
> +static int vduse_iotlb_add_range(struct vduse_dev *dev,
> +				 u64 start, u64 last,
> +				 u64 addr, unsigned int perm,
> +				 struct file *file, u64 offset)
> +{
> +	struct vdpa_map_file *map_file;
> +	int ret;
> +
> +	map_file = kmalloc(sizeof(*map_file), GFP_ATOMIC);
> +	if (!map_file)
> +		return -ENOMEM;
> +
> +	map_file->file = get_file(file);
> +	map_file->offset = offset;
> +
> +	spin_lock(&dev->iommu_lock);
> +	ret = vhost_iotlb_add_range_ctx(dev->iommu, start, last,
> +					addr, perm, map_file);
> +	spin_unlock(&dev->iommu_lock);
> +	if (ret) {
> +		fput(map_file->file);
> +		kfree(map_file);
> +		return ret;
> +	}
> +	return 0;
> +}
> +
> +static void vduse_iotlb_del_range(struct vduse_dev *dev, u64 start, u64 last)
> +{
> +	struct vdpa_map_file *map_file;
> +	struct vhost_iotlb_map *map;
> +
> +	spin_lock(&dev->iommu_lock);
> +	while ((map = vhost_iotlb_itree_first(dev->iommu, start, last))) {
> +		map_file = (struct vdpa_map_file *)map->opaque;
> +		fput(map_file->file);
> +		kfree(map_file);
> +		vhost_iotlb_map_free(dev->iommu, map);
> +	}
> +	spin_unlock(&dev->iommu_lock);
> +}
> +
> +static void vduse_dev_reset(struct vduse_dev *dev)
> +{
> +	int i;
> +
> +	atomic_set(&dev->bounce_map, 0);
> +	vduse_iotlb_del_range(dev, 0ULL, ULLONG_MAX);
> +	vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
> +
> +	for (i = 0; i < dev->vq_num; i++) {
> +		struct vduse_virtqueue *vq = &dev->vqs[i];
> +
> +		spin_lock(&vq->irq_lock);
> +		vq->ready = false;
> +		vq->cb.callback = NULL;
> +		vq->cb.private = NULL;
> +		spin_unlock(&vq->irq_lock);
> +	}
> +}
> +
> +static int vduse_vdpa_set_vq_address(struct vdpa_device *vdpa, u16 idx,
> +				u64 desc_area, u64 driver_area,
> +				u64 device_area)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	return vduse_dev_set_vq_addr(dev, vq, desc_area,
> +					driver_area, device_area);
> +}
> +
> +static void vduse_vdpa_kick_vq(struct vdpa_device *vdpa, u16 idx)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	spin_lock(&vq->kick_lock);
> +	if (vq->ready && vq->kickfd)
> +		eventfd_signal(vq->kickfd, 1);
> +	spin_unlock(&vq->kick_lock);
> +}
> +
> +static void vduse_vdpa_set_vq_cb(struct vdpa_device *vdpa, u16 idx,
> +			      struct vdpa_callback *cb)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	spin_lock(&vq->irq_lock);
> +	vq->cb.callback = cb->callback;
> +	vq->cb.private = cb->private;
> +	spin_unlock(&vq->irq_lock);
> +}
> +
> +static void vduse_vdpa_set_vq_num(struct vdpa_device *vdpa, u16 idx, u32 num)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	vduse_dev_set_vq_num(dev, vq, num);
> +}
> +
> +static void vduse_vdpa_set_vq_ready(struct vdpa_device *vdpa,
> +					u16 idx, bool ready)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	vduse_dev_set_vq_ready(dev, vq, ready);
> +	vq->ready = ready;
> +}
> +
> +static bool vduse_vdpa_get_vq_ready(struct vdpa_device *vdpa, u16 idx)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	vq->ready = vduse_dev_get_vq_ready(dev, vq);
> +
> +	return vq->ready;
> +}
> +
> +static int vduse_vdpa_set_vq_state(struct vdpa_device *vdpa, u16 idx,
> +				const struct vdpa_vq_state *state)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	return vduse_dev_set_vq_state(dev, vq, state);
> +}
> +
> +static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
> +				struct vdpa_vq_state *state)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	return vduse_dev_get_vq_state(dev, vq, state);
> +}
> +
> +static u32 vduse_vdpa_get_vq_align(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return dev->vq_align;
> +}
> +
> +static u64 vduse_vdpa_get_features(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	u64 fixed = (1ULL << VIRTIO_F_ACCESS_PLATFORM);
> +
> +	return (vduse_dev_get_features(dev) | fixed);


What happens if we don't do such fixup. I think we should fail if 
usersapce doesnt offer ACCESS_PLATFORM instead.


> +}
> +
> +static int vduse_vdpa_set_features(struct vdpa_device *vdpa, u64 features)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return vduse_dev_set_features(dev, features);
> +}
> +
> +static void vduse_vdpa_set_config_cb(struct vdpa_device *vdpa,
> +				  struct vdpa_callback *cb)
> +{
> +	/* We don't support config interrupt */
> +}
> +
> +static u16 vduse_vdpa_get_vq_num_max(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return dev->vq_size_max;
> +}
> +
> +static u32 vduse_vdpa_get_device_id(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return dev->device_id;
> +}
> +
> +static u32 vduse_vdpa_get_vendor_id(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return dev->vendor_id;
> +}
> +
> +static u8 vduse_vdpa_get_status(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return vduse_dev_get_status(dev);
> +}
> +
> +static void vduse_vdpa_set_status(struct vdpa_device *vdpa, u8 status)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	if (status == 0)
> +		vduse_dev_reset(dev);
> +
> +	vduse_dev_set_status(dev, status);
> +}
> +
> +static void vduse_vdpa_get_config(struct vdpa_device *vdpa, unsigned int offset,
> +			     void *buf, unsigned int len)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	vduse_dev_get_config(dev, offset, buf, len);
> +}
> +
> +static void vduse_vdpa_set_config(struct vdpa_device *vdpa, unsigned int offset,
> +			const void *buf, unsigned int len)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	vduse_dev_set_config(dev, offset, buf, len);
> +}
> +
> +static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
> +				struct vhost_iotlb *iotlb)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vhost_iotlb_map *map;
> +	struct vdpa_map_file *map_file;
> +	u64 start = 0ULL, last = ULLONG_MAX;
> +	int ret = 0;
> +
> +	vduse_iotlb_del_range(dev, start, last);
> +
> +	for (map = vhost_iotlb_itree_first(iotlb, start, last); map;
> +		map = vhost_iotlb_itree_next(map, start, last)) {
> +		map_file = (struct vdpa_map_file *)map->opaque;
> +		if (!map_file->file)
> +			continue;
> +
> +		ret = vduse_iotlb_add_range(dev, map->start, map->last,
> +					    map->addr, map->perm,
> +					    map_file->file,
> +					    map_file->offset);
> +		if (ret)
> +			break;
> +	}
> +	vduse_dev_update_iotlb(dev, start, last);
> +
> +	return ret;
> +}
> +
> +static void vduse_vdpa_free(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	WARN_ON(!list_empty(&dev->send_list));
> +	WARN_ON(!list_empty(&dev->recv_list));
> +	dev->vdev = NULL;
> +}
> +
> +static const struct vdpa_config_ops vduse_vdpa_config_ops = {
> +	.set_vq_address		= vduse_vdpa_set_vq_address,
> +	.kick_vq		= vduse_vdpa_kick_vq,
> +	.set_vq_cb		= vduse_vdpa_set_vq_cb,
> +	.set_vq_num             = vduse_vdpa_set_vq_num,
> +	.set_vq_ready		= vduse_vdpa_set_vq_ready,
> +	.get_vq_ready		= vduse_vdpa_get_vq_ready,
> +	.set_vq_state		= vduse_vdpa_set_vq_state,
> +	.get_vq_state		= vduse_vdpa_get_vq_state,
> +	.get_vq_align		= vduse_vdpa_get_vq_align,
> +	.get_features		= vduse_vdpa_get_features,
> +	.set_features		= vduse_vdpa_set_features,
> +	.set_config_cb		= vduse_vdpa_set_config_cb,
> +	.get_vq_num_max		= vduse_vdpa_get_vq_num_max,
> +	.get_device_id		= vduse_vdpa_get_device_id,
> +	.get_vendor_id		= vduse_vdpa_get_vendor_id,
> +	.get_status		= vduse_vdpa_get_status,
> +	.set_status		= vduse_vdpa_set_status,
> +	.get_config		= vduse_vdpa_get_config,
> +	.set_config		= vduse_vdpa_set_config,
> +	.set_map		= vduse_vdpa_set_map,
> +	.free			= vduse_vdpa_free,
> +};
> +
> +static dma_addr_t vduse_dev_map_page(struct device *dev, struct page *page,
> +					unsigned long offset, size_t size,
> +					enum dma_data_direction dir,
> +					unsigned long attrs)
> +{
> +	struct vduse_dev *vdev = dev_to_vduse(dev);
> +	struct vduse_iova_domain *domain = vdev->domain;
> +
> +	if (atomic_xchg(&vdev->bounce_map, 1) == 0 &&
> +		vduse_iotlb_add_range(vdev, 0, domain->bounce_size - 1,
> +				      0, VDUSE_ACCESS_RW,
> +				      vduse_domain_file(domain), 0)) {
> +		atomic_set(&vdev->bounce_map, 0);
> +		return DMA_MAPPING_ERROR;


Can we add the bounce mapping page by page here?


> +	}
> +
> +	return vduse_domain_map_page(domain, page, offset, size, dir, attrs);
> +}
> +
> +static void vduse_dev_unmap_page(struct device *dev, dma_addr_t dma_addr,
> +				size_t size, enum dma_data_direction dir,
> +				unsigned long attrs)
> +{
> +	struct vduse_dev *vdev = dev_to_vduse(dev);
> +	struct vduse_iova_domain *domain = vdev->domain;
> +
> +	return vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
> +}
> +
> +static void *vduse_dev_alloc_coherent(struct device *dev, size_t size,
> +					dma_addr_t *dma_addr, gfp_t flag,
> +					unsigned long attrs)
> +{
> +	struct vduse_dev *vdev = dev_to_vduse(dev);
> +	struct vduse_iova_domain *domain = vdev->domain;
> +	unsigned long iova;
> +	void *addr;
> +
> +	*dma_addr = DMA_MAPPING_ERROR;
> +	addr = vduse_domain_alloc_coherent(domain, size,
> +				(dma_addr_t *)&iova, flag, attrs);
> +	if (!addr)
> +		return NULL;
> +
> +	if (vduse_iotlb_add_range(vdev, iova, iova + size - 1,
> +				  iova, VDUSE_ACCESS_RW,
> +				  vduse_domain_file(domain), iova)) {
> +		vduse_domain_free_coherent(domain, size, addr, iova, attrs);
> +		return NULL;
> +	}
> +	*dma_addr = (dma_addr_t)iova;
> +
> +	return addr;
> +}
> +
> +static void vduse_dev_free_coherent(struct device *dev, size_t size,
> +					void *vaddr, dma_addr_t dma_addr,
> +					unsigned long attrs)
> +{
> +	struct vduse_dev *vdev = dev_to_vduse(dev);
> +	struct vduse_iova_domain *domain = vdev->domain;
> +	unsigned long start = (unsigned long)dma_addr;
> +	unsigned long last = start + size - 1;
> +
> +	vduse_iotlb_del_range(vdev, start, last);
> +	vduse_dev_update_iotlb(vdev, start, last);
> +	vduse_domain_free_coherent(domain, size, vaddr, dma_addr, attrs);
> +}
> +
> +static const struct dma_map_ops vduse_dev_dma_ops = {
> +	.map_page = vduse_dev_map_page,
> +	.unmap_page = vduse_dev_unmap_page,
> +	.alloc = vduse_dev_alloc_coherent,
> +	.free = vduse_dev_free_coherent,
> +};
> +
> +static unsigned int perm_to_file_flags(u8 perm)
> +{
> +	unsigned int flags = 0;
> +
> +	switch (perm) {
> +	case VDUSE_ACCESS_WO:
> +		flags |= O_WRONLY;
> +		break;
> +	case VDUSE_ACCESS_RO:
> +		flags |= O_RDONLY;
> +		break;
> +	case VDUSE_ACCESS_RW:
> +		flags |= O_RDWR;
> +		break;
> +	default:
> +		WARN(1, "invalidate vhost IOTLB permission\n");
> +		break;
> +	}
> +
> +	return flags;
> +}
> +
> +static int vduse_kickfd_setup(struct vduse_dev *dev,
> +			struct vduse_vq_eventfd *eventfd)
> +{
> +	struct eventfd_ctx *ctx = NULL;
> +	struct vduse_virtqueue *vq;
> +
> +	if (eventfd->index >= dev->vq_num)
> +		return -EINVAL;
> +
> +	vq = &dev->vqs[eventfd->index];
> +	if (eventfd->fd > 0) {
> +		ctx = eventfd_ctx_fdget(eventfd->fd);
> +		if (IS_ERR(ctx))
> +			return PTR_ERR(ctx);
> +	}
> +	spin_lock(&vq->kick_lock);
> +	if (vq->kickfd)
> +		eventfd_ctx_put(vq->kickfd);
> +	vq->kickfd = ctx;
> +	spin_unlock(&vq->kick_lock);
> +
> +	return 0;
> +}
> +
> +static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> +			unsigned long arg)
> +{
> +	struct vduse_dev *dev = file->private_data;
> +	void __user *argp = (void __user *)arg;
> +	int ret;
> +
> +	switch (cmd) {
> +	case VDUSE_IOTLB_GET_FD: {
> +		struct vduse_iotlb_entry entry;
> +		struct vhost_iotlb_map *map;
> +		struct vdpa_map_file *map_file;
> +		struct file *f = NULL;
> +
> +		ret = -EFAULT;
> +		if (copy_from_user(&entry, argp, sizeof(entry)))
> +			break;
> +
> +		spin_lock(&dev->iommu_lock);
> +		map = vhost_iotlb_itree_first(dev->iommu, entry.start,
> +					      entry.last);
> +		if (map) {
> +			map_file = (struct vdpa_map_file *)map->opaque;
> +			f = get_file(map_file->file);
> +			entry.offset = map_file->offset;
> +			entry.start = map->start;
> +			entry.last = map->last;
> +			entry.perm = map->perm;
> +		}
> +		spin_unlock(&dev->iommu_lock);
> +		if (!f) {
> +			ret = -EINVAL;
> +			break;
> +		}
> +		if (copy_to_user(argp, &entry, sizeof(entry))) {
> +			fput(f);
> +			ret = -EFAULT;
> +			break;
> +		}
> +		ret = get_unused_fd_flags(perm_to_file_flags(entry.perm));
> +		if (ret < 0) {
> +			fput(f);
> +			break;
> +		}
> +		fd_install(ret, f);
> +		break;
> +	}
> +	case VDUSE_VQ_SETUP_KICKFD: {
> +		struct vduse_vq_eventfd eventfd;
> +
> +		ret = -EFAULT;
> +		if (copy_from_user(&eventfd, argp, sizeof(eventfd)))
> +			break;
> +
> +		ret = vduse_kickfd_setup(dev, &eventfd);
> +		break;
> +	}
> +	case VDUSE_INJECT_VQ_IRQ: {
> +		struct vduse_virtqueue *vq;
> +
> +		ret = -EINVAL;
> +		if (arg >= dev->vq_num)
> +			break;
> +
> +		vq = &dev->vqs[arg];
> +		spin_lock_irq(&vq->irq_lock);
> +		if (vq->ready && vq->cb.callback) {
> +			vq->cb.callback(vq->cb.private);
> +			ret = 0;
> +		}
> +		spin_unlock_irq(&vq->irq_lock);
> +		break;
> +	}
> +	default:
> +		ret = -ENOIOCTLCMD;
> +		break;
> +	}
> +
> +	return ret;
> +}
> +
> +static int vduse_dev_release(struct inode *inode, struct file *file)
> +{
> +	struct vduse_dev *dev = file->private_data;
> +	struct vduse_dev_msg *msg;
> +	int i;
> +
> +	for (i = 0; i < dev->vq_num; i++) {
> +		struct vduse_virtqueue *vq = &dev->vqs[i];
> +
> +		spin_lock(&vq->kick_lock);
> +		if (vq->kickfd)
> +			eventfd_ctx_put(vq->kickfd);
> +		vq->kickfd = NULL;
> +		spin_unlock(&vq->kick_lock);
> +	}
> +
> +	mutex_lock(&dev->msg_lock);
> +	while ((msg = vduse_dequeue_msg(&dev->recv_list)))
> +		vduse_enqueue_msg(&dev->send_list, msg);
> +	mutex_unlock(&dev->msg_lock);
> +
> +	dev->connected = false;
> +
> +	return 0;
> +}
> +
> +static int vduse_dev_open(struct inode *inode, struct file *file)
> +{
> +	struct vduse_dev *dev = container_of(inode->i_cdev,
> +					struct vduse_dev, cdev);
> +	int ret = -EBUSY;
> +
> +	mutex_lock(&vduse_lock);
> +	if (dev->connected)
> +		goto unlock;
> +
> +	ret = 0;
> +	dev->connected = true;
> +	file->private_data = dev;
> +unlock:
> +	mutex_unlock(&vduse_lock);
> +
> +	return ret;
> +}
> +
> +static const struct file_operations vduse_dev_fops = {
> +	.owner		= THIS_MODULE,
> +	.open		= vduse_dev_open,
> +	.release	= vduse_dev_release,
> +	.read_iter	= vduse_dev_read_iter,
> +	.write_iter	= vduse_dev_write_iter,
> +	.poll		= vduse_dev_poll,
> +	.unlocked_ioctl	= vduse_dev_ioctl,
> +	.compat_ioctl	= compat_ptr_ioctl,
> +	.llseek		= noop_llseek,
> +};
> +
> +static struct vduse_dev *vduse_dev_create(void)
> +{
> +	struct vduse_dev *dev = kzalloc(sizeof(*dev), GFP_KERNEL);
> +
> +	if (!dev)
> +		return NULL;
> +
> +	dev->iommu = vhost_iotlb_alloc(0, 0);
> +	if (!dev->iommu) {
> +		kfree(dev);
> +		return NULL;
> +	}
> +
> +	mutex_init(&dev->msg_lock);
> +	INIT_LIST_HEAD(&dev->send_list);
> +	INIT_LIST_HEAD(&dev->recv_list);
> +	atomic64_set(&dev->msg_unique, 0);
> +	spin_lock_init(&dev->iommu_lock);
> +	atomic_set(&dev->bounce_map, 0);
> +
> +	init_waitqueue_head(&dev->waitq);
> +
> +	return dev;
> +}
> +
> +static void vduse_dev_destroy(struct vduse_dev *dev)
> +{
> +	vhost_iotlb_free(dev->iommu);
> +	mutex_destroy(&dev->msg_lock);
> +	kfree(dev);
> +}
> +
> +static struct vduse_dev *vduse_find_dev(const char *name)
> +{
> +	struct vduse_dev *tmp, *dev = NULL;
> +
> +	list_for_each_entry(tmp, &vduse_devs, list) {
> +		if (!strcmp(dev_name(&tmp->dev), name)) {
> +			dev = tmp;
> +			break;
> +		}
> +	}
> +	return dev;
> +}
> +
> +static int vduse_destroy_dev(char *name)
> +{
> +	struct vduse_dev *dev = vduse_find_dev(name);
> +
> +	if (!dev)
> +		return -EINVAL;
> +
> +	if (dev->vdev || dev->connected)
> +		return -EBUSY;
> +
> +	dev->connected = true;
> +	list_del(&dev->list);
> +	cdev_device_del(&dev->cdev, &dev->dev);
> +	put_device(&dev->dev);
> +
> +	return 0;
> +}
> +
> +static void vduse_release_dev(struct device *device)
> +{
> +	struct vduse_dev *dev =
> +		container_of(device, struct vduse_dev, dev);
> +
> +	ida_simple_remove(&vduse_ida, dev->minor);
> +	kfree(dev->vqs);
> +	vduse_domain_destroy(dev->domain);
> +	vduse_dev_destroy(dev);
> +	module_put(THIS_MODULE);
> +}
> +
> +static int vduse_create_dev(struct vduse_dev_config *config)
> +{
> +	int i, ret = -ENOMEM;
> +	struct vduse_dev *dev;
> +
> +	if (config->bounce_size > max_bounce_size)
> +		return -EINVAL;
> +
> +	if (config->bounce_size > max_iova_size)
> +		return -EINVAL;
> +
> +	if (vduse_find_dev(config->name))
> +		return -EEXIST;
> +
> +	dev = vduse_dev_create();
> +	if (!dev)
> +		return -ENOMEM;
> +
> +	dev->device_id = config->device_id;
> +	dev->vendor_id = config->vendor_id;
> +	dev->domain = vduse_domain_create(max_iova_size - 1,
> +					config->bounce_size);
> +	if (!dev->domain)
> +		goto err_domain;
> +
> +	dev->vq_align = config->vq_align;
> +	dev->vq_size_max = config->vq_size_max;
> +	dev->vq_num = config->vq_num;
> +	dev->vqs = kcalloc(dev->vq_num, sizeof(*dev->vqs), GFP_KERNEL);
> +	if (!dev->vqs)
> +		goto err_vqs;
> +
> +	for (i = 0; i < dev->vq_num; i++) {
> +		dev->vqs[i].index = i;
> +		spin_lock_init(&dev->vqs[i].kick_lock);
> +		spin_lock_init(&dev->vqs[i].irq_lock);
> +	}
> +
> +	ret = ida_simple_get(&vduse_ida, 0, VDUSE_DEV_MAX, GFP_KERNEL);
> +	if (ret < 0)
> +		goto err_ida;
> +
> +	dev->minor = ret;
> +	device_initialize(&dev->dev);
> +	dev->dev.release = vduse_release_dev;
> +	dev->dev.class = vduse_class;
> +	dev->dev.devt = MKDEV(MAJOR(vduse_major), dev->minor);
> +	ret = dev_set_name(&dev->dev, "%s", config->name);


Do we need to add a namespce here? E.g "vduse-%s", config->name.


> +	if (ret)
> +		goto err_name;
> +
> +	cdev_init(&dev->cdev, &vduse_dev_fops);
> +	dev->cdev.owner = THIS_MODULE;
> +
> +	ret = cdev_device_add(&dev->cdev, &dev->dev);
> +	if (ret) {
> +		put_device(&dev->dev);
> +		return ret;
> +	}
> +	list_add(&dev->list, &vduse_devs);
> +	__module_get(THIS_MODULE);
> +
> +	return 0;
> +err_name:
> +	ida_simple_remove(&vduse_ida, dev->minor);
> +err_ida:
> +	kfree(dev->vqs);
> +err_vqs:
> +	vduse_domain_destroy(dev->domain);
> +err_domain:
> +	vduse_dev_destroy(dev);
> +	return ret;
> +}
> +
> +static long vduse_ioctl(struct file *file, unsigned int cmd,
> +			unsigned long arg)
> +{
> +	int ret;
> +	void __user *argp = (void __user *)arg;
> +
> +	mutex_lock(&vduse_lock);
> +	switch (cmd) {
> +	case VDUSE_CREATE_DEV: {
> +		struct vduse_dev_config config;
> +
> +		ret = -EFAULT;
> +		if (copy_from_user(&config, argp, sizeof(config)))
> +			break;
> +
> +		ret = vduse_create_dev(&config);
> +		break;
> +	}
> +	case VDUSE_DESTROY_DEV: {
> +		char name[VDUSE_NAME_MAX];
> +
> +		ret = -EFAULT;
> +		if (copy_from_user(name, argp, VDUSE_NAME_MAX))
> +			break;
> +
> +		ret = vduse_destroy_dev(name);
> +		break;
> +	}
> +	default:
> +		ret = -EINVAL;
> +		break;
> +	}
> +	mutex_unlock(&vduse_lock);
> +
> +	return ret;
> +}
> +
> +static const struct file_operations vduse_fops = {
> +	.owner		= THIS_MODULE,
> +	.unlocked_ioctl	= vduse_ioctl,
> +	.compat_ioctl	= compat_ptr_ioctl,
> +	.llseek		= noop_llseek,
> +};
> +
> +static char *vduse_devnode(struct device *dev, umode_t *mode)
> +{
> +	return kasprintf(GFP_KERNEL, "vduse/%s", dev_name(dev));
> +}
> +
> +static struct miscdevice vduse_misc = {
> +	.fops = &vduse_fops,
> +	.minor = MISC_DYNAMIC_MINOR,
> +	.name = "vduse",
> +	.nodename = "vduse/control",
> +};
> +
> +static void vduse_mgmtdev_release(struct device *dev)
> +{
> +}
> +
> +static struct device vduse_mgmtdev = {
> +	.init_name = "vduse",
> +	.release = vduse_mgmtdev_release,
> +};
> +
> +static struct vdpa_mgmt_dev mgmt_dev;
> +
> +static int vduse_dev_add_vdpa(struct vduse_dev *dev, const char *name)
> +{
> +	struct vduse_vdpa *vdev = dev->vdev;
> +	int ret;
> +
> +	if (vdev)
> +		return -EEXIST;
> +
> +	vdev = vdpa_alloc_device(struct vduse_vdpa, vdpa, NULL,


I think the char dev should be used as the parent here.


> +				 &vduse_vdpa_config_ops,
> +				 dev->vq_num, name, true);
> +	if (!vdev)
> +		return -ENOMEM;
> +
> +	vdev->dev = dev;
> +	vdev->vdpa.dev.dma_mask = &vdev->vdpa.dev.coherent_dma_mask;
> +	ret = dma_set_mask_and_coherent(&vdev->vdpa.dev, DMA_BIT_MASK(64));
> +	if (ret)
> +		goto err;
> +
> +	set_dma_ops(&vdev->vdpa.dev, &vduse_dev_dma_ops);
> +	vdev->vdpa.dma_dev = &vdev->vdpa.dev;
> +	vdev->vdpa.mdev = &mgmt_dev;
> +
> +	ret = _vdpa_register_device(&vdev->vdpa);
> +	if (ret)
> +		goto err;
> +
> +	dev->vdev = vdev;
> +
> +	return 0;
> +err:
> +	put_device(&vdev->vdpa.dev);
> +	return ret;
> +}
> +
> +static int vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name)
> +{
> +	struct vduse_dev *dev;
> +	int ret = -EINVAL;
> +
> +	mutex_lock(&vduse_lock);
> +	dev = vduse_find_dev(name);
> +	if (!dev)
> +		goto unlock;


Any reason for this check? I think vdpa core layer has already did for 
the name check for us.


> +
> +	ret = vduse_dev_add_vdpa(dev, name);
> +unlock:
> +	mutex_unlock(&vduse_lock);
> +
> +	return ret;
> +}
> +
> +static void vdpa_dev_del(struct vdpa_mgmt_dev *mdev, struct vdpa_device *dev)
> +{
> +	_vdpa_unregister_device(dev);
> +}
> +
> +static const struct vdpa_mgmtdev_ops vdpa_dev_mgmtdev_ops = {
> +	.dev_add = vdpa_dev_add,
> +	.dev_del = vdpa_dev_del,
> +};
> +
> +static struct virtio_device_id id_table[] = {
> +	{ VIRTIO_DEV_ANY_ID, VIRTIO_DEV_ANY_ID },
> +	{ 0 },
> +};
> +
> +static struct vdpa_mgmt_dev mgmt_dev = {
> +	.device = &vduse_mgmtdev,
> +	.id_table = id_table,
> +	.ops = &vdpa_dev_mgmtdev_ops,
> +};
> +
> +static int vduse_mgmtdev_init(void)
> +{
> +	int ret;
> +
> +	ret = device_register(&vduse_mgmtdev);
> +	if (ret)
> +		return ret;
> +
> +	ret = vdpa_mgmtdev_register(&mgmt_dev);
> +	if (ret)
> +		goto err;
> +
> +	return 0;
> +err:
> +	device_unregister(&vduse_mgmtdev);
> +	return ret;
> +}
> +
> +static void vduse_mgmtdev_exit(void)
> +{
> +	vdpa_mgmtdev_unregister(&mgmt_dev);
> +	device_unregister(&vduse_mgmtdev);
> +}
> +
> +static int vduse_init(void)
> +{
> +	int ret;
> +
> +	ret = misc_register(&vduse_misc);
> +	if (ret)
> +		return ret;
> +
> +	vduse_class = class_create(THIS_MODULE, "vduse");
> +	if (IS_ERR(vduse_class)) {
> +		ret = PTR_ERR(vduse_class);
> +		goto err_class;
> +	}
> +	vduse_class->devnode = vduse_devnode;
> +
> +	ret = alloc_chrdev_region(&vduse_major, 0, VDUSE_DEV_MAX, "vduse");
> +	if (ret)
> +		goto err_chardev;
> +
> +	ret = vduse_domain_init();
> +	if (ret)
> +		goto err_domain;
> +
> +	ret = vduse_mgmtdev_init();
> +	if (ret)
> +		goto err_mgmtdev;


Should we validate max_bounce_size < max_iova_size here?

Thanks


> +
> +	return 0;
> +err_mgmtdev:
> +	vduse_domain_exit();
> +err_domain:
> +	unregister_chrdev_region(vduse_major, VDUSE_DEV_MAX);
> +err_chardev:
> +	class_destroy(vduse_class);
> +err_class:
> +	misc_deregister(&vduse_misc);
> +	return ret;
> +}
> +module_init(vduse_init);
> +
> +static void vduse_exit(void)
> +{
> +	misc_deregister(&vduse_misc);
> +	class_destroy(vduse_class);
> +	unregister_chrdev_region(vduse_major, VDUSE_DEV_MAX);
> +	vduse_domain_exit();
> +	vduse_mgmtdev_exit();
> +}
> +module_exit(vduse_exit);
> +
> +MODULE_VERSION(DRV_VERSION);
> +MODULE_LICENSE(DRV_LICENSE);
> +MODULE_AUTHOR(DRV_AUTHOR);
> +MODULE_DESCRIPTION(DRV_DESC);
> diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
> new file mode 100644
> index 000000000000..9391c4acfa53
> --- /dev/null
> +++ b/include/uapi/linux/vduse.h
> @@ -0,0 +1,136 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _UAPI_VDUSE_H_
> +#define _UAPI_VDUSE_H_
> +
> +#include <linux/types.h>
> +
> +#define VDUSE_CONFIG_DATA_LEN	256
> +#define VDUSE_NAME_MAX	256
> +
> +/* the control messages definition for read/write */
> +
> +enum vduse_req_type {
> +	VDUSE_SET_VQ_NUM,
> +	VDUSE_SET_VQ_ADDR,
> +	VDUSE_SET_VQ_READY,
> +	VDUSE_GET_VQ_READY,
> +	VDUSE_SET_VQ_STATE,
> +	VDUSE_GET_VQ_STATE,
> +	VDUSE_SET_FEATURES,
> +	VDUSE_GET_FEATURES,
> +	VDUSE_SET_STATUS,
> +	VDUSE_GET_STATUS,
> +	VDUSE_SET_CONFIG,
> +	VDUSE_GET_CONFIG,
> +	VDUSE_UPDATE_IOTLB,
> +};
> +
> +struct vduse_vq_num {
> +	__u32 index;
> +	__u32 num;
> +};
> +
> +struct vduse_vq_addr {
> +	__u32 index;
> +	__u64 desc_addr;
> +	__u64 driver_addr;
> +	__u64 device_addr;
> +};
> +
> +struct vduse_vq_ready {
> +	__u32 index;
> +	__u8 ready;
> +};
> +
> +struct vduse_vq_state {
> +	__u32 index;
> +	__u16 avail_idx;
> +};
> +
> +struct vduse_dev_config_data {
> +	__u32 offset;
> +	__u32 len;
> +	__u8 data[VDUSE_CONFIG_DATA_LEN];
> +};
> +
> +struct vduse_iova_range {
> +	__u64 start;
> +	__u64 last;
> +};
> +
> +struct vduse_dev_request {
> +	__u32 type; /* request type */
> +	__u32 unique; /* request id */


Let's simply use "request_id" here.


> +	__u32 reserved[2]; /* for feature use */
> +	union {
> +		struct vduse_vq_num vq_num; /* virtqueue num */
> +		struct vduse_vq_addr vq_addr; /* virtqueue address */
> +		struct vduse_vq_ready vq_ready; /* virtqueue ready status */
> +		struct vduse_vq_state vq_state; /* virtqueue state */
> +		struct vduse_dev_config_data config; /* virtio device config space */
> +		struct vduse_iova_range iova; /* iova range for updating */
> +		__u64 features; /* virtio features */
> +		__u8 status; /* device status */


It might be better to use struct for feaures and status as well for 
consistency.

And to be safe, let's add explicity padding here.


> +	};
> +};
> +
> +struct vduse_dev_response {
> +	__u32 request_id; /* corresponding request id */
> +#define VDUSE_REQUEST_OK	0x00
> +#define VDUSE_REQUEST_FAILED	0x01
> +	__u8 result; /* the result of request */
> +	__u8 reserved[11]; /* for feature use */


Looks like this will be a hole which is similar to 
429711aec282c4b5fe5bbd7b2f0bbbff4110ffb2. Need to make sure the reserved 
end at 8 byte boundary.


> +	union {
> +		struct vduse_vq_ready vq_ready; /* virtqueue ready status */
> +		struct vduse_vq_state vq_state; /* virtqueue state */
> +		struct vduse_dev_config_data config; /* virtio device config space */
> +		__u64 features; /* virtio features */
> +		__u8 status; /* device status */
> +	};
> +};
> +
> +/* ioctls */
> +
> +struct vduse_dev_config {
> +	char name[VDUSE_NAME_MAX]; /* vduse device name */
> +	__u32 vendor_id; /* virtio vendor id */
> +	__u32 device_id; /* virtio device id */
> +	__u64 bounce_size; /* bounce buffer size for iommu */
> +	__u16 vq_num; /* the number of virtqueues */
> +	__u16 vq_size_max; /* the max size of virtqueue */
> +	__u32 vq_align; /* the allocation alignment of virtqueue's metadata */
> +};
> +
> +struct vduse_iotlb_entry {
> +	__u64 offset; /* the mmap offset on fd */
> +	__u64 start; /* start of the IOVA range */
> +	__u64 last; /* last of the IOVA range */
> +#define VDUSE_ACCESS_RO 0x1
> +#define VDUSE_ACCESS_WO 0x2
> +#define VDUSE_ACCESS_RW 0x3
> +	__u8 perm; /* access permission of this range */
> +};
> +
> +struct vduse_vq_eventfd {
> +	__u32 index; /* virtqueue index */
> +	int fd; /* eventfd, -1 means de-assigning the eventfd */


Let's define a macro for this.


> +};
> +
> +#define VDUSE_BASE	0x81
> +
> +/* Create a vduse device which is represented by a char device (/dev/vduse/<name>) */
> +#define VDUSE_CREATE_DEV	_IOW(VDUSE_BASE, 0x01, struct vduse_dev_config)
> +
> +/* Destroy a vduse device. Make sure there are no references to the char device */
> +#define VDUSE_DESTROY_DEV	_IOW(VDUSE_BASE, 0x02, char[VDUSE_NAME_MAX])
> +
> +/* Get a file descriptor for the mmap'able iova region */
> +#define VDUSE_IOTLB_GET_FD	_IOWR(VDUSE_BASE, 0x03, struct vduse_iotlb_entry)
> +
> +/* Setup an eventfd to receive kick for virtqueue */
> +#define VDUSE_VQ_SETUP_KICKFD	_IOW(VDUSE_BASE, 0x04, struct vduse_vq_eventfd)
> +
> +/* Inject an interrupt for specific virtqueue */
> +#define VDUSE_INJECT_VQ_IRQ	_IO(VDUSE_BASE, 0x05)


I wonder do we need a version/feature handshake that is for future 
extension instead of just reserve bits in uABI? E.g VDUSE_GET_VERSION ...

Thanks


> +
> +#endif /* _UAPI_VDUSE_H_ */


_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC v4 09/11] Documentation: Add documentation for VDUSE
  2021-02-23 11:50 ` [RFC v4 09/11] Documentation: Add documentation for VDUSE Xie Yongji
@ 2021-03-04  6:39     ` Jason Wang
  0 siblings, 0 replies; 92+ messages in thread
From: Jason Wang @ 2021-03-04  6:39 UTC (permalink / raw)
  To: Xie Yongji, mst, stefanha, sgarzare, parav, bob.liu, hch,
	rdunlap, willy, viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel


On 2021/2/23 7:50 下午, Xie Yongji wrote:
> VDUSE (vDPA Device in Userspace) is a framework to support
> implementing software-emulated vDPA devices in userspace. This
> document is intended to clarify the VDUSE design and usage.
>
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> ---
>   Documentation/userspace-api/index.rst |   1 +
>   Documentation/userspace-api/vduse.rst | 112 ++++++++++++++++++++++++++++++++++
>   2 files changed, 113 insertions(+)
>   create mode 100644 Documentation/userspace-api/vduse.rst
>
> diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst
> index acd2cc2a538d..f63119130898 100644
> --- a/Documentation/userspace-api/index.rst
> +++ b/Documentation/userspace-api/index.rst
> @@ -24,6 +24,7 @@ place where this information is gathered.
>      ioctl/index
>      iommu
>      media/index
> +   vduse
>   
>   .. only::  subproject and html
>   
> diff --git a/Documentation/userspace-api/vduse.rst b/Documentation/userspace-api/vduse.rst
> new file mode 100644
> index 000000000000..2a20e686bb59
> --- /dev/null
> +++ b/Documentation/userspace-api/vduse.rst
> @@ -0,0 +1,112 @@
> +==================================
> +VDUSE - "vDPA Device in Userspace"
> +==================================
> +
> +vDPA (virtio data path acceleration) device is a device that uses a
> +datapath which complies with the virtio specifications with vendor
> +specific control path. vDPA devices can be both physically located on
> +the hardware or emulated by software. VDUSE is a framework that makes it
> +possible to implement software-emulated vDPA devices in userspace.
> +
> +How VDUSE works
> +------------
> +Each userspace vDPA device is created by the VDUSE_CREATE_DEV ioctl on
> +the character device (/dev/vduse/control). Then a device file with the
> +specified name (/dev/vduse/$NAME) will appear, which can be used to
> +implement the userspace vDPA device's control path and data path.


It's better to mention that in order to le thte device to be registered 
on the bus, admin need to use the management API(netlink) to create the 
vDPA device.

Some codes to demnonstrate how to create the device will be better.


> +
> +To implement control path, a message-based communication protocol and some
> +types of control messages are introduced in the VDUSE framework:
> +
> +- VDUSE_SET_VQ_ADDR: Set the vring address of virtqueue.
> +
> +- VDUSE_SET_VQ_NUM: Set the size of virtqueue
> +
> +- VDUSE_SET_VQ_READY: Set ready status of virtqueue
> +
> +- VDUSE_GET_VQ_READY: Get ready status of virtqueue
> +
> +- VDUSE_SET_VQ_STATE: Set the state for virtqueue
> +
> +- VDUSE_GET_VQ_STATE: Get the state for virtqueue
> +
> +- VDUSE_SET_FEATURES: Set virtio features supported by the driver
> +
> +- VDUSE_GET_FEATURES: Get virtio features supported by the device
> +
> +- VDUSE_SET_STATUS: Set the device status
> +
> +- VDUSE_GET_STATUS: Get the device status
> +
> +- VDUSE_SET_CONFIG: Write to device specific configuration space
> +
> +- VDUSE_GET_CONFIG: Read from device specific configuration space
> +
> +- VDUSE_UPDATE_IOTLB: Notify userspace to update the memory mapping in device IOTLB
> +
> +Those control messages are mostly based on the vdpa_config_ops in
> +include/linux/vdpa.h which defines a unified interface to control
> +different types of vdpa device. Userspace needs to read()/write()
> +on the VDUSE device file to receive/reply those control messages
> +from/to VDUSE kernel module as follows:
> +
> +.. code-block:: c
> +
> +	static int vduse_message_handler(int dev_fd)
> +	{
> +		int len;
> +		struct vduse_dev_request req;
> +		struct vduse_dev_response resp;
> +
> +		len = read(dev_fd, &req, sizeof(req));
> +		if (len != sizeof(req))
> +			return -1;
> +
> +		resp.request_id = req.unique;
> +
> +		switch (req.type) {
> +
> +		/* handle different types of message */
> +
> +		}
> +
> +		len = write(dev_fd, &resp, sizeof(resp));
> +		if (len != sizeof(resp))
> +			return -1;
> +
> +		return 0;
> +	}
> +
> +In the deta path, vDPA device's iova regions will be mapped into userspace
> +with the help of VDUSE_IOTLB_GET_FD ioctl on the VDUSE device file:
> +
> +- VDUSE_IOTLB_GET_FD: get the file descriptor to iova region. Userspace can
> +  access this iova region by passing the fd to mmap().


It would be better to have codes to explain how it is expected to work here.


> +
> +Besides, the following ioctls on the VDUSE device file are provided to support
> +interrupt injection and setting up eventfd for virtqueue kicks:
> +
> +- VDUSE_VQ_SETUP_KICKFD: set the kickfd for virtqueue, this eventfd is used
> +  by VDUSE kernel module to notify userspace to consume the vring.
> +
> +- VDUSE_INJECT_VQ_IRQ: inject an interrupt for specific virtqueue
> +
> +- VDUSE_INJECT_CONFIG_IRQ: inject a config interrupt
> +
> +MMU-based IOMMU Driver
> +----------------------
> +In virtio-vdpa case, VDUSE framework implements an MMU-based on-chip IOMMU
> +driver to support mapping the kernel DMA buffer into the userspace iova
> +region dynamically.
> +
> +The basic idea behind this driver is treating MMU (VA->PA) as IOMMU (IOVA->PA).
> +The driver will set up MMU mapping instead of IOMMU mapping for the DMA transfer
> +so that the userspace process is able to use its virtual address to access
> +the DMA buffer in kernel.
> +
> +And to avoid security issue, a bounce-buffering mechanism is introduced to
> +prevent userspace accessing the original buffer directly which may contain other
> +kernel data.


It's worth to mention this is designed for virtio-vdpa (kernel virtio 
drivers).

Thanks


>   During the mapping, unmapping, the driver will copy the data from
> +the original buffer to the bounce buffer and back, depending on the direction of
> +the transfer. And the bounce-buffer addresses will be mapped into the user address
> +space instead of the original one.


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC v4 09/11] Documentation: Add documentation for VDUSE
@ 2021-03-04  6:39     ` Jason Wang
  0 siblings, 0 replies; 92+ messages in thread
From: Jason Wang @ 2021-03-04  6:39 UTC (permalink / raw)
  To: Xie Yongji, mst, stefanha, sgarzare, parav, bob.liu, hch,
	rdunlap, willy, viro, axboe, bcrl, corbet
  Cc: linux-aio, netdev, linux-fsdevel, kvm, virtualization


On 2021/2/23 7:50 下午, Xie Yongji wrote:
> VDUSE (vDPA Device in Userspace) is a framework to support
> implementing software-emulated vDPA devices in userspace. This
> document is intended to clarify the VDUSE design and usage.
>
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> ---
>   Documentation/userspace-api/index.rst |   1 +
>   Documentation/userspace-api/vduse.rst | 112 ++++++++++++++++++++++++++++++++++
>   2 files changed, 113 insertions(+)
>   create mode 100644 Documentation/userspace-api/vduse.rst
>
> diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst
> index acd2cc2a538d..f63119130898 100644
> --- a/Documentation/userspace-api/index.rst
> +++ b/Documentation/userspace-api/index.rst
> @@ -24,6 +24,7 @@ place where this information is gathered.
>      ioctl/index
>      iommu
>      media/index
> +   vduse
>   
>   .. only::  subproject and html
>   
> diff --git a/Documentation/userspace-api/vduse.rst b/Documentation/userspace-api/vduse.rst
> new file mode 100644
> index 000000000000..2a20e686bb59
> --- /dev/null
> +++ b/Documentation/userspace-api/vduse.rst
> @@ -0,0 +1,112 @@
> +==================================
> +VDUSE - "vDPA Device in Userspace"
> +==================================
> +
> +vDPA (virtio data path acceleration) device is a device that uses a
> +datapath which complies with the virtio specifications with vendor
> +specific control path. vDPA devices can be both physically located on
> +the hardware or emulated by software. VDUSE is a framework that makes it
> +possible to implement software-emulated vDPA devices in userspace.
> +
> +How VDUSE works
> +------------
> +Each userspace vDPA device is created by the VDUSE_CREATE_DEV ioctl on
> +the character device (/dev/vduse/control). Then a device file with the
> +specified name (/dev/vduse/$NAME) will appear, which can be used to
> +implement the userspace vDPA device's control path and data path.


It's better to mention that in order to le thte device to be registered 
on the bus, admin need to use the management API(netlink) to create the 
vDPA device.

Some codes to demnonstrate how to create the device will be better.


> +
> +To implement control path, a message-based communication protocol and some
> +types of control messages are introduced in the VDUSE framework:
> +
> +- VDUSE_SET_VQ_ADDR: Set the vring address of virtqueue.
> +
> +- VDUSE_SET_VQ_NUM: Set the size of virtqueue
> +
> +- VDUSE_SET_VQ_READY: Set ready status of virtqueue
> +
> +- VDUSE_GET_VQ_READY: Get ready status of virtqueue
> +
> +- VDUSE_SET_VQ_STATE: Set the state for virtqueue
> +
> +- VDUSE_GET_VQ_STATE: Get the state for virtqueue
> +
> +- VDUSE_SET_FEATURES: Set virtio features supported by the driver
> +
> +- VDUSE_GET_FEATURES: Get virtio features supported by the device
> +
> +- VDUSE_SET_STATUS: Set the device status
> +
> +- VDUSE_GET_STATUS: Get the device status
> +
> +- VDUSE_SET_CONFIG: Write to device specific configuration space
> +
> +- VDUSE_GET_CONFIG: Read from device specific configuration space
> +
> +- VDUSE_UPDATE_IOTLB: Notify userspace to update the memory mapping in device IOTLB
> +
> +Those control messages are mostly based on the vdpa_config_ops in
> +include/linux/vdpa.h which defines a unified interface to control
> +different types of vdpa device. Userspace needs to read()/write()
> +on the VDUSE device file to receive/reply those control messages
> +from/to VDUSE kernel module as follows:
> +
> +.. code-block:: c
> +
> +	static int vduse_message_handler(int dev_fd)
> +	{
> +		int len;
> +		struct vduse_dev_request req;
> +		struct vduse_dev_response resp;
> +
> +		len = read(dev_fd, &req, sizeof(req));
> +		if (len != sizeof(req))
> +			return -1;
> +
> +		resp.request_id = req.unique;
> +
> +		switch (req.type) {
> +
> +		/* handle different types of message */
> +
> +		}
> +
> +		len = write(dev_fd, &resp, sizeof(resp));
> +		if (len != sizeof(resp))
> +			return -1;
> +
> +		return 0;
> +	}
> +
> +In the deta path, vDPA device's iova regions will be mapped into userspace
> +with the help of VDUSE_IOTLB_GET_FD ioctl on the VDUSE device file:
> +
> +- VDUSE_IOTLB_GET_FD: get the file descriptor to iova region. Userspace can
> +  access this iova region by passing the fd to mmap().


It would be better to have codes to explain how it is expected to work here.


> +
> +Besides, the following ioctls on the VDUSE device file are provided to support
> +interrupt injection and setting up eventfd for virtqueue kicks:
> +
> +- VDUSE_VQ_SETUP_KICKFD: set the kickfd for virtqueue, this eventfd is used
> +  by VDUSE kernel module to notify userspace to consume the vring.
> +
> +- VDUSE_INJECT_VQ_IRQ: inject an interrupt for specific virtqueue
> +
> +- VDUSE_INJECT_CONFIG_IRQ: inject a config interrupt
> +
> +MMU-based IOMMU Driver
> +----------------------
> +In virtio-vdpa case, VDUSE framework implements an MMU-based on-chip IOMMU
> +driver to support mapping the kernel DMA buffer into the userspace iova
> +region dynamically.
> +
> +The basic idea behind this driver is treating MMU (VA->PA) as IOMMU (IOVA->PA).
> +The driver will set up MMU mapping instead of IOMMU mapping for the DMA transfer
> +so that the userspace process is able to use its virtual address to access
> +the DMA buffer in kernel.
> +
> +And to avoid security issue, a bounce-buffering mechanism is introduced to
> +prevent userspace accessing the original buffer directly which may contain other
> +kernel data.


It's worth to mention this is designed for virtio-vdpa (kernel virtio 
drivers).

Thanks


>   During the mapping, unmapping, the driver will copy the data from
> +the original buffer to the bounce buffer and back, depending on the direction of
> +the transfer. And the bounce-buffer addresses will be mapped into the user address
> +space instead of the original one.

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC v4 10/11] vduse: Introduce a workqueue for irq injection
  2021-02-23 11:50 ` [RFC v4 10/11] vduse: Introduce a workqueue for irq injection Xie Yongji
@ 2021-03-04  6:59     ` Jason Wang
  0 siblings, 0 replies; 92+ messages in thread
From: Jason Wang @ 2021-03-04  6:59 UTC (permalink / raw)
  To: Xie Yongji, mst, stefanha, sgarzare, parav, bob.liu, hch,
	rdunlap, willy, viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel


On 2021/2/23 7:50 下午, Xie Yongji wrote:
> This patch introduces a workqueue to support injecting
> virtqueue's interrupt asynchronously. This is mainly
> for performance considerations which makes sure the push()
> and pop() for used vring can be asynchronous.


Do you have pref numbers for this patch?

Thanks


>
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> ---
>   drivers/vdpa/vdpa_user/vduse_dev.c | 29 +++++++++++++++++++++++------
>   1 file changed, 23 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> index 8042d3fa57f1..f5adeb9ee027 100644
> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> @@ -42,6 +42,7 @@ struct vduse_virtqueue {
>   	spinlock_t irq_lock;
>   	struct eventfd_ctx *kickfd;
>   	struct vdpa_callback cb;
> +	struct work_struct inject;
>   };
>   
>   struct vduse_dev;
> @@ -99,6 +100,7 @@ static DEFINE_IDA(vduse_ida);
>   
>   static dev_t vduse_major;
>   static struct class *vduse_class;
> +static struct workqueue_struct *vduse_irq_wq;
>   
>   static inline struct vduse_dev *vdpa_to_vduse(struct vdpa_device *vdpa)
>   {
> @@ -852,6 +854,17 @@ static int vduse_kickfd_setup(struct vduse_dev *dev,
>   	return 0;
>   }
>   
> +static void vduse_vq_irq_inject(struct work_struct *work)
> +{
> +	struct vduse_virtqueue *vq = container_of(work,
> +					struct vduse_virtqueue, inject);
> +
> +	spin_lock_irq(&vq->irq_lock);
> +	if (vq->ready && vq->cb.callback)
> +		vq->cb.callback(vq->cb.private);
> +	spin_unlock_irq(&vq->irq_lock);
> +}
> +
>   static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
>   			unsigned long arg)
>   {
> @@ -917,12 +930,7 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
>   			break;
>   
>   		vq = &dev->vqs[arg];
> -		spin_lock_irq(&vq->irq_lock);
> -		if (vq->ready && vq->cb.callback) {
> -			vq->cb.callback(vq->cb.private);
> -			ret = 0;
> -		}
> -		spin_unlock_irq(&vq->irq_lock);
> +		queue_work(vduse_irq_wq, &vq->inject);
>   		break;
>   	}
>   	case VDUSE_INJECT_CONFIG_IRQ:
> @@ -1109,6 +1117,7 @@ static int vduse_create_dev(struct vduse_dev_config *config)
>   
>   	for (i = 0; i < dev->vq_num; i++) {
>   		dev->vqs[i].index = i;
> +		INIT_WORK(&dev->vqs[i].inject, vduse_vq_irq_inject);
>   		spin_lock_init(&dev->vqs[i].kick_lock);
>   		spin_lock_init(&dev->vqs[i].irq_lock);
>   	}
> @@ -1333,6 +1342,11 @@ static int vduse_init(void)
>   	if (ret)
>   		goto err_chardev;
>   
> +	vduse_irq_wq = alloc_workqueue("vduse-irq",
> +				WQ_HIGHPRI | WQ_SYSFS | WQ_UNBOUND, 0);
> +	if (!vduse_irq_wq)
> +		goto err_wq;
> +
>   	ret = vduse_domain_init();
>   	if (ret)
>   		goto err_domain;
> @@ -1344,6 +1358,8 @@ static int vduse_init(void)
>   	return 0;
>   err_mgmtdev:
>   	vduse_domain_exit();
> +err_wq:
> +	destroy_workqueue(vduse_irq_wq);
>   err_domain:
>   	unregister_chrdev_region(vduse_major, VDUSE_DEV_MAX);
>   err_chardev:
> @@ -1359,6 +1375,7 @@ static void vduse_exit(void)
>   	misc_deregister(&vduse_misc);
>   	class_destroy(vduse_class);
>   	unregister_chrdev_region(vduse_major, VDUSE_DEV_MAX);
> +	destroy_workqueue(vduse_irq_wq);
>   	vduse_domain_exit();
>   	vduse_mgmtdev_exit();
>   }


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC v4 10/11] vduse: Introduce a workqueue for irq injection
@ 2021-03-04  6:59     ` Jason Wang
  0 siblings, 0 replies; 92+ messages in thread
From: Jason Wang @ 2021-03-04  6:59 UTC (permalink / raw)
  To: Xie Yongji, mst, stefanha, sgarzare, parav, bob.liu, hch,
	rdunlap, willy, viro, axboe, bcrl, corbet
  Cc: linux-aio, netdev, linux-fsdevel, kvm, virtualization


On 2021/2/23 7:50 下午, Xie Yongji wrote:
> This patch introduces a workqueue to support injecting
> virtqueue's interrupt asynchronously. This is mainly
> for performance considerations which makes sure the push()
> and pop() for used vring can be asynchronous.


Do you have pref numbers for this patch?

Thanks


>
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> ---
>   drivers/vdpa/vdpa_user/vduse_dev.c | 29 +++++++++++++++++++++++------
>   1 file changed, 23 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> index 8042d3fa57f1..f5adeb9ee027 100644
> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> @@ -42,6 +42,7 @@ struct vduse_virtqueue {
>   	spinlock_t irq_lock;
>   	struct eventfd_ctx *kickfd;
>   	struct vdpa_callback cb;
> +	struct work_struct inject;
>   };
>   
>   struct vduse_dev;
> @@ -99,6 +100,7 @@ static DEFINE_IDA(vduse_ida);
>   
>   static dev_t vduse_major;
>   static struct class *vduse_class;
> +static struct workqueue_struct *vduse_irq_wq;
>   
>   static inline struct vduse_dev *vdpa_to_vduse(struct vdpa_device *vdpa)
>   {
> @@ -852,6 +854,17 @@ static int vduse_kickfd_setup(struct vduse_dev *dev,
>   	return 0;
>   }
>   
> +static void vduse_vq_irq_inject(struct work_struct *work)
> +{
> +	struct vduse_virtqueue *vq = container_of(work,
> +					struct vduse_virtqueue, inject);
> +
> +	spin_lock_irq(&vq->irq_lock);
> +	if (vq->ready && vq->cb.callback)
> +		vq->cb.callback(vq->cb.private);
> +	spin_unlock_irq(&vq->irq_lock);
> +}
> +
>   static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
>   			unsigned long arg)
>   {
> @@ -917,12 +930,7 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
>   			break;
>   
>   		vq = &dev->vqs[arg];
> -		spin_lock_irq(&vq->irq_lock);
> -		if (vq->ready && vq->cb.callback) {
> -			vq->cb.callback(vq->cb.private);
> -			ret = 0;
> -		}
> -		spin_unlock_irq(&vq->irq_lock);
> +		queue_work(vduse_irq_wq, &vq->inject);
>   		break;
>   	}
>   	case VDUSE_INJECT_CONFIG_IRQ:
> @@ -1109,6 +1117,7 @@ static int vduse_create_dev(struct vduse_dev_config *config)
>   
>   	for (i = 0; i < dev->vq_num; i++) {
>   		dev->vqs[i].index = i;
> +		INIT_WORK(&dev->vqs[i].inject, vduse_vq_irq_inject);
>   		spin_lock_init(&dev->vqs[i].kick_lock);
>   		spin_lock_init(&dev->vqs[i].irq_lock);
>   	}
> @@ -1333,6 +1342,11 @@ static int vduse_init(void)
>   	if (ret)
>   		goto err_chardev;
>   
> +	vduse_irq_wq = alloc_workqueue("vduse-irq",
> +				WQ_HIGHPRI | WQ_SYSFS | WQ_UNBOUND, 0);
> +	if (!vduse_irq_wq)
> +		goto err_wq;
> +
>   	ret = vduse_domain_init();
>   	if (ret)
>   		goto err_domain;
> @@ -1344,6 +1358,8 @@ static int vduse_init(void)
>   	return 0;
>   err_mgmtdev:
>   	vduse_domain_exit();
> +err_wq:
> +	destroy_workqueue(vduse_irq_wq);
>   err_domain:
>   	unregister_chrdev_region(vduse_major, VDUSE_DEV_MAX);
>   err_chardev:
> @@ -1359,6 +1375,7 @@ static void vduse_exit(void)
>   	misc_deregister(&vduse_misc);
>   	class_destroy(vduse_class);
>   	unregister_chrdev_region(vduse_major, VDUSE_DEV_MAX);
> +	destroy_workqueue(vduse_irq_wq);
>   	vduse_domain_exit();
>   	vduse_mgmtdev_exit();
>   }

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC v4 11/11] vduse: Support binding irq to the specified cpu
  2021-02-23 11:50 ` [RFC v4 11/11] vduse: Support binding irq to the specified cpu Xie Yongji
@ 2021-03-04  7:30     ` Jason Wang
  0 siblings, 0 replies; 92+ messages in thread
From: Jason Wang @ 2021-03-04  7:30 UTC (permalink / raw)
  To: Xie Yongji, mst, stefanha, sgarzare, parav, bob.liu, hch,
	rdunlap, willy, viro, axboe, bcrl, corbet
  Cc: virtualization, netdev, kvm, linux-aio, linux-fsdevel


On 2021/2/23 7:50 下午, Xie Yongji wrote:
> Add a parameter for the ioctl VDUSE_INJECT_VQ_IRQ to support
> injecting virtqueue's interrupt to the specified cpu.


How userspace know which CPU is this irq for? It looks to me we need to 
do it at different level.

E.g introduce some API in sys to allow admin to tune for that.

But I think we can do that in antoher patch on top of this series.

Thanks


>
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> ---
>   drivers/vdpa/vdpa_user/vduse_dev.c | 22 +++++++++++++++++-----
>   include/uapi/linux/vduse.h         |  7 ++++++-
>   2 files changed, 23 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> index f5adeb9ee027..df3d467fff40 100644
> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> @@ -923,14 +923,27 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
>   		break;
>   	}
>   	case VDUSE_INJECT_VQ_IRQ: {
> +		struct vduse_vq_irq irq;
>   		struct vduse_virtqueue *vq;
>   
> +		ret = -EFAULT;
> +		if (copy_from_user(&irq, argp, sizeof(irq)))
> +			break;
> +
>   		ret = -EINVAL;
> -		if (arg >= dev->vq_num)
> +		if (irq.index >= dev->vq_num)
> +			break;
> +
> +		if (irq.cpu != -1 && (irq.cpu >= nr_cpu_ids ||
> +		    !cpu_online(irq.cpu)))
>   			break;
>   
> -		vq = &dev->vqs[arg];
> -		queue_work(vduse_irq_wq, &vq->inject);
> +		ret = 0;
> +		vq = &dev->vqs[irq.index];
> +		if (irq.cpu == -1)
> +			queue_work(vduse_irq_wq, &vq->inject);
> +		else
> +			queue_work_on(irq.cpu, vduse_irq_wq, &vq->inject);
>   		break;
>   	}
>   	case VDUSE_INJECT_CONFIG_IRQ:
> @@ -1342,8 +1355,7 @@ static int vduse_init(void)
>   	if (ret)
>   		goto err_chardev;
>   
> -	vduse_irq_wq = alloc_workqueue("vduse-irq",
> -				WQ_HIGHPRI | WQ_SYSFS | WQ_UNBOUND, 0);
> +	vduse_irq_wq = alloc_workqueue("vduse-irq", WQ_HIGHPRI, 0);
>   	if (!vduse_irq_wq)
>   		goto err_wq;
>   
> diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
> index 9070cd512cb4..9c70fd842ce5 100644
> --- a/include/uapi/linux/vduse.h
> +++ b/include/uapi/linux/vduse.h
> @@ -116,6 +116,11 @@ struct vduse_vq_eventfd {
>   	int fd; /* eventfd, -1 means de-assigning the eventfd */
>   };
>   
> +struct vduse_vq_irq {
> +	__u32 index; /* virtqueue index */
> +	int cpu; /* bind irq to the specified cpu, -1 means running on the current cpu */
> +};
> +
>   #define VDUSE_BASE	0x81
>   
>   /* Create a vduse device which is represented by a char device (/dev/vduse/<name>) */
> @@ -131,7 +136,7 @@ struct vduse_vq_eventfd {
>   #define VDUSE_VQ_SETUP_KICKFD	_IOW(VDUSE_BASE, 0x04, struct vduse_vq_eventfd)
>   
>   /* Inject an interrupt for specific virtqueue */
> -#define VDUSE_INJECT_VQ_IRQ	_IO(VDUSE_BASE, 0x05)
> +#define VDUSE_INJECT_VQ_IRQ	_IOW(VDUSE_BASE, 0x05, struct vduse_vq_irq)
>   
>   /* Inject a config interrupt */
>   #define VDUSE_INJECT_CONFIG_IRQ	_IO(VDUSE_BASE, 0x06)


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC v4 11/11] vduse: Support binding irq to the specified cpu
@ 2021-03-04  7:30     ` Jason Wang
  0 siblings, 0 replies; 92+ messages in thread
From: Jason Wang @ 2021-03-04  7:30 UTC (permalink / raw)
  To: Xie Yongji, mst, stefanha, sgarzare, parav, bob.liu, hch,
	rdunlap, willy, viro, axboe, bcrl, corbet
  Cc: linux-aio, netdev, linux-fsdevel, kvm, virtualization


On 2021/2/23 7:50 下午, Xie Yongji wrote:
> Add a parameter for the ioctl VDUSE_INJECT_VQ_IRQ to support
> injecting virtqueue's interrupt to the specified cpu.


How userspace know which CPU is this irq for? It looks to me we need to 
do it at different level.

E.g introduce some API in sys to allow admin to tune for that.

But I think we can do that in antoher patch on top of this series.

Thanks


>
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> ---
>   drivers/vdpa/vdpa_user/vduse_dev.c | 22 +++++++++++++++++-----
>   include/uapi/linux/vduse.h         |  7 ++++++-
>   2 files changed, 23 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> index f5adeb9ee027..df3d467fff40 100644
> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> @@ -923,14 +923,27 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
>   		break;
>   	}
>   	case VDUSE_INJECT_VQ_IRQ: {
> +		struct vduse_vq_irq irq;
>   		struct vduse_virtqueue *vq;
>   
> +		ret = -EFAULT;
> +		if (copy_from_user(&irq, argp, sizeof(irq)))
> +			break;
> +
>   		ret = -EINVAL;
> -		if (arg >= dev->vq_num)
> +		if (irq.index >= dev->vq_num)
> +			break;
> +
> +		if (irq.cpu != -1 && (irq.cpu >= nr_cpu_ids ||
> +		    !cpu_online(irq.cpu)))
>   			break;
>   
> -		vq = &dev->vqs[arg];
> -		queue_work(vduse_irq_wq, &vq->inject);
> +		ret = 0;
> +		vq = &dev->vqs[irq.index];
> +		if (irq.cpu == -1)
> +			queue_work(vduse_irq_wq, &vq->inject);
> +		else
> +			queue_work_on(irq.cpu, vduse_irq_wq, &vq->inject);
>   		break;
>   	}
>   	case VDUSE_INJECT_CONFIG_IRQ:
> @@ -1342,8 +1355,7 @@ static int vduse_init(void)
>   	if (ret)
>   		goto err_chardev;
>   
> -	vduse_irq_wq = alloc_workqueue("vduse-irq",
> -				WQ_HIGHPRI | WQ_SYSFS | WQ_UNBOUND, 0);
> +	vduse_irq_wq = alloc_workqueue("vduse-irq", WQ_HIGHPRI, 0);
>   	if (!vduse_irq_wq)
>   		goto err_wq;
>   
> diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
> index 9070cd512cb4..9c70fd842ce5 100644
> --- a/include/uapi/linux/vduse.h
> +++ b/include/uapi/linux/vduse.h
> @@ -116,6 +116,11 @@ struct vduse_vq_eventfd {
>   	int fd; /* eventfd, -1 means de-assigning the eventfd */
>   };
>   
> +struct vduse_vq_irq {
> +	__u32 index; /* virtqueue index */
> +	int cpu; /* bind irq to the specified cpu, -1 means running on the current cpu */
> +};
> +
>   #define VDUSE_BASE	0x81
>   
>   /* Create a vduse device which is represented by a char device (/dev/vduse/<name>) */
> @@ -131,7 +136,7 @@ struct vduse_vq_eventfd {
>   #define VDUSE_VQ_SETUP_KICKFD	_IOW(VDUSE_BASE, 0x04, struct vduse_vq_eventfd)
>   
>   /* Inject an interrupt for specific virtqueue */
> -#define VDUSE_INJECT_VQ_IRQ	_IO(VDUSE_BASE, 0x05)
> +#define VDUSE_INJECT_VQ_IRQ	_IOW(VDUSE_BASE, 0x05, struct vduse_vq_irq)
>   
>   /* Inject a config interrupt */
>   #define VDUSE_INJECT_CONFIG_IRQ	_IO(VDUSE_BASE, 0x06)

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: Re: [RFC v4 07/11] vduse: Introduce VDUSE - vDPA Device in Userspace
  2021-03-04  6:27     ` Jason Wang
  (?)
@ 2021-03-04  8:05     ` Yongji Xie
  2021-03-05  3:20         ` Jason Wang
  -1 siblings, 1 reply; 92+ messages in thread
From: Yongji Xie @ 2021-03-04  8:05 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Bob Liu, Christoph Hellwig, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel

On Thu, Mar 4, 2021 at 2:27 PM Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2021/2/23 7:50 下午, Xie Yongji wrote:
> > This VDUSE driver enables implementing vDPA devices in userspace.
> > Both control path and data path of vDPA devices will be able to
> > be handled in userspace.
> >
> > In the control path, the VDUSE driver will make use of message
> > mechnism to forward the config operation from vdpa bus driver
> > to userspace. Userspace can use read()/write() to receive/reply
> > those control messages.
> >
> > In the data path, VDUSE_IOTLB_GET_FD ioctl will be used to get
> > the file descriptors referring to vDPA device's iova regions. Then
> > userspace can use mmap() to access those iova regions. Besides,
> > userspace can use ioctl() to inject interrupt and use the eventfd
> > mechanism to receive virtqueue kicks.
> >
> > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> > ---
> >   Documentation/userspace-api/ioctl/ioctl-number.rst |    1 +
> >   drivers/vdpa/Kconfig                               |   10 +
> >   drivers/vdpa/Makefile                              |    1 +
> >   drivers/vdpa/vdpa_user/Makefile                    |    5 +
> >   drivers/vdpa/vdpa_user/vduse_dev.c                 | 1348 ++++++++++++++++++++
> >   include/uapi/linux/vduse.h                         |  136 ++
> >   6 files changed, 1501 insertions(+)
> >   create mode 100644 drivers/vdpa/vdpa_user/Makefile
> >   create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
> >   create mode 100644 include/uapi/linux/vduse.h
> >
> > diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
> > index a4c75a28c839..71722e6f8f23 100644
> > --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> > +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> > @@ -300,6 +300,7 @@ Code  Seq#    Include File                                           Comments
> >   'z'   10-4F  drivers/s390/crypto/zcrypt_api.h                        conflict!
> >   '|'   00-7F  linux/media.h
> >   0x80  00-1F  linux/fb.h
> > +0x81  00-1F  linux/vduse.h
> >   0x89  00-06  arch/x86/include/asm/sockios.h
> >   0x89  0B-DF  linux/sockios.h
> >   0x89  E0-EF  linux/sockios.h                                         SIOCPROTOPRIVATE range
> > diff --git a/drivers/vdpa/Kconfig b/drivers/vdpa/Kconfig
> > index ffd1e098bfd2..92f07715e3b6 100644
> > --- a/drivers/vdpa/Kconfig
> > +++ b/drivers/vdpa/Kconfig
> > @@ -25,6 +25,16 @@ config VDPA_SIM_NET
> >       help
> >         vDPA networking device simulator which loops TX traffic back to RX.
> >
> > +config VDPA_USER
> > +     tristate "VDUSE (vDPA Device in Userspace) support"
> > +     depends on EVENTFD && MMU && HAS_DMA
> > +     select DMA_OPS
> > +     select VHOST_IOTLB
> > +     select IOMMU_IOVA
> > +     help
> > +       With VDUSE it is possible to emulate a vDPA Device
> > +       in a userspace program.
> > +
> >   config IFCVF
> >       tristate "Intel IFC VF vDPA driver"
> >       depends on PCI_MSI
> > diff --git a/drivers/vdpa/Makefile b/drivers/vdpa/Makefile
> > index d160e9b63a66..66e97778ad03 100644
> > --- a/drivers/vdpa/Makefile
> > +++ b/drivers/vdpa/Makefile
> > @@ -1,5 +1,6 @@
> >   # SPDX-License-Identifier: GPL-2.0
> >   obj-$(CONFIG_VDPA) += vdpa.o
> >   obj-$(CONFIG_VDPA_SIM) += vdpa_sim/
> > +obj-$(CONFIG_VDPA_USER) += vdpa_user/
> >   obj-$(CONFIG_IFCVF)    += ifcvf/
> >   obj-$(CONFIG_MLX5_VDPA) += mlx5/
> > diff --git a/drivers/vdpa/vdpa_user/Makefile b/drivers/vdpa/vdpa_user/Makefile
> > new file mode 100644
> > index 000000000000..260e0b26af99
> > --- /dev/null
> > +++ b/drivers/vdpa/vdpa_user/Makefile
> > @@ -0,0 +1,5 @@
> > +# SPDX-License-Identifier: GPL-2.0
> > +
> > +vduse-y := vduse_dev.o iova_domain.o
> > +
> > +obj-$(CONFIG_VDPA_USER) += vduse.o
> > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> > new file mode 100644
> > index 000000000000..393bf99c48be
> > --- /dev/null
> > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > @@ -0,0 +1,1348 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * VDUSE: vDPA Device in Userspace
> > + *
> > + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
> > + *
> > + * Author: Xie Yongji <xieyongji@bytedance.com>
> > + *
> > + */
> > +
> > +#include <linux/init.h>
> > +#include <linux/module.h>
> > +#include <linux/miscdevice.h>
> > +#include <linux/cdev.h>
> > +#include <linux/device.h>
> > +#include <linux/eventfd.h>
> > +#include <linux/slab.h>
> > +#include <linux/wait.h>
> > +#include <linux/dma-map-ops.h>
> > +#include <linux/poll.h>
> > +#include <linux/file.h>
> > +#include <linux/uio.h>
> > +#include <linux/vdpa.h>
> > +#include <uapi/linux/vduse.h>
> > +#include <uapi/linux/vdpa.h>
> > +#include <uapi/linux/virtio_config.h>
> > +#include <linux/mod_devicetable.h>
> > +
> > +#include "iova_domain.h"
> > +
> > +#define DRV_VERSION  "1.0"
> > +#define DRV_AUTHOR   "Yongji Xie <xieyongji@bytedance.com>"
> > +#define DRV_DESC     "vDPA Device in Userspace"
> > +#define DRV_LICENSE  "GPL v2"
> > +
> > +#define VDUSE_DEV_MAX (1U << MINORBITS)
> > +
> > +struct vduse_virtqueue {
> > +     u16 index;
> > +     bool ready;
> > +     spinlock_t kick_lock;
> > +     spinlock_t irq_lock;
> > +     struct eventfd_ctx *kickfd;
> > +     struct vdpa_callback cb;
> > +};
> > +
> > +struct vduse_dev;
> > +
> > +struct vduse_vdpa {
> > +     struct vdpa_device vdpa;
> > +     struct vduse_dev *dev;
> > +};
> > +
> > +struct vduse_dev {
> > +     struct vduse_vdpa *vdev;
> > +     struct device dev;
> > +     struct cdev cdev;
> > +     struct vduse_virtqueue *vqs;
> > +     struct vduse_iova_domain *domain;
> > +     struct vhost_iotlb *iommu;
> > +     spinlock_t iommu_lock;
> > +     atomic_t bounce_map;
> > +     struct mutex msg_lock;
> > +     atomic64_t msg_unique;
>
>
> "next_request_id" should be better.
>

OK.

>
> > +     wait_queue_head_t waitq;
> > +     struct list_head send_list;
> > +     struct list_head recv_list;
> > +     struct list_head list;
> > +     bool connected;
> > +     int minor;
> > +     u16 vq_size_max;
> > +     u16 vq_num;
> > +     u32 vq_align;
> > +     u32 device_id;
> > +     u32 vendor_id;
> > +};
> > +
> > +struct vduse_dev_msg {
> > +     struct vduse_dev_request req;
> > +     struct vduse_dev_response resp;
> > +     struct list_head list;
> > +     wait_queue_head_t waitq;
> > +     bool completed;
> > +};
> > +
> > +static unsigned long max_bounce_size = (64 * 1024 * 1024);
> > +module_param(max_bounce_size, ulong, 0444);
> > +MODULE_PARM_DESC(max_bounce_size, "Maximum bounce buffer size. (default: 64M)");
> > +
> > +static unsigned long max_iova_size = (128 * 1024 * 1024);
> > +module_param(max_iova_size, ulong, 0444);
> > +MODULE_PARM_DESC(max_iova_size, "Maximum iova space size (default: 128M)");
> > +
> > +static DEFINE_MUTEX(vduse_lock);
> > +static LIST_HEAD(vduse_devs);
> > +static DEFINE_IDA(vduse_ida);
> > +
> > +static dev_t vduse_major;
> > +static struct class *vduse_class;
> > +
> > +static inline struct vduse_dev *vdpa_to_vduse(struct vdpa_device *vdpa)
> > +{
> > +     struct vduse_vdpa *vdev = container_of(vdpa, struct vduse_vdpa, vdpa);
> > +
> > +     return vdev->dev;
> > +}
> > +
> > +static inline struct vduse_dev *dev_to_vduse(struct device *dev)
> > +{
> > +     struct vdpa_device *vdpa = dev_to_vdpa(dev);
> > +
> > +     return vdpa_to_vduse(vdpa);
> > +}
> > +
> > +static struct vduse_dev_msg *vduse_find_msg(struct list_head *head,
> > +                                         uint32_t unique)
> > +{
> > +     struct vduse_dev_msg *tmp, *msg = NULL;
> > +
> > +     list_for_each_entry(tmp, head, list) {
>
>
> Shoudl we use list_for_each_entry_safe()?
>

Looks like list_for_each_entry() is ok here. We will break the loop
after deleting one node.

>
> > +             if (tmp->req.unique == unique) {
> > +                     msg = tmp;
> > +                     list_del(&tmp->list);
> > +                     break;
> > +             }
> > +     }
> > +
> > +     return msg;
> > +}
> > +
> > +static struct vduse_dev_msg *vduse_dequeue_msg(struct list_head *head)
> > +{
> > +     struct vduse_dev_msg *msg = NULL;
> > +
> > +     if (!list_empty(head)) {
> > +             msg = list_first_entry(head, struct vduse_dev_msg, list);
> > +             list_del(&msg->list);
> > +     }
> > +
> > +     return msg;
> > +}
> > +
> > +static void vduse_enqueue_msg(struct list_head *head,
> > +                           struct vduse_dev_msg *msg)
> > +{
> > +     list_add_tail(&msg->list, head);
> > +}
> > +
> > +static int vduse_dev_msg_sync(struct vduse_dev *dev, struct vduse_dev_msg *msg)
> > +{
> > +     int ret;
> > +
> > +     init_waitqueue_head(&msg->waitq);
> > +     mutex_lock(&dev->msg_lock);
> > +     vduse_enqueue_msg(&dev->send_list, msg);
> > +     wake_up(&dev->waitq);
> > +     mutex_unlock(&dev->msg_lock);
> > +     ret = wait_event_interruptible(msg->waitq, msg->completed);
> > +     mutex_lock(&dev->msg_lock);
> > +     if (!msg->completed)
> > +             list_del(&msg->list);
> > +     else
> > +             ret = msg->resp.result;
> > +     mutex_unlock(&dev->msg_lock);
> > +
> > +     return ret;
> > +}
> > +
> > +static u64 vduse_dev_get_features(struct vduse_dev *dev)
> > +{
> > +     struct vduse_dev_msg msg = { 0 };
> > +
> > +     msg.req.type = VDUSE_GET_FEATURES;
> > +     msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
> > +
> > +     return vduse_dev_msg_sync(dev, &msg) ? 0 : msg.resp.features;
> > +}
> > +
> > +static int vduse_dev_set_features(struct vduse_dev *dev, u64 features)
> > +{
> > +     struct vduse_dev_msg msg = { 0 };
> > +
> > +     msg.req.type = VDUSE_SET_FEATURES;
> > +     msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
> > +     msg.req.features = features;
> > +
> > +     return vduse_dev_msg_sync(dev, &msg);
> > +}
> > +
> > +static u8 vduse_dev_get_status(struct vduse_dev *dev)
> > +{
> > +     struct vduse_dev_msg msg = { 0 };
> > +
> > +     msg.req.type = VDUSE_GET_STATUS;
> > +     msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
> > +
> > +     return vduse_dev_msg_sync(dev, &msg) ? 0 : msg.resp.status;
> > +}
> > +
> > +static void vduse_dev_set_status(struct vduse_dev *dev, u8 status)
> > +{
> > +     struct vduse_dev_msg msg = { 0 };
> > +
> > +     msg.req.type = VDUSE_SET_STATUS;
> > +     msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
> > +     msg.req.status = status;
> > +
> > +     vduse_dev_msg_sync(dev, &msg);
> > +}
> > +
> > +static void vduse_dev_get_config(struct vduse_dev *dev, unsigned int offset,
> > +                                     void *buf, unsigned int len)
>
>
> Btw, the ident looks odd here and other may places wherhe functions has
> more than one line of arguments.
>

OK, will fix it.

>
> > +{
> > +     struct vduse_dev_msg msg = { 0 };
> > +     unsigned int sz;
> > +
> > +     while (len) {
> > +             sz = min_t(unsigned int, len, sizeof(msg.req.config.data));
> > +             msg.req.type = VDUSE_GET_CONFIG;
> > +             msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
> > +             msg.req.config.offset = offset;
> > +             msg.req.config.len = sz;
> > +             vduse_dev_msg_sync(dev, &msg);
> > +             memcpy(buf, msg.resp.config.data, sz);
> > +             buf += sz;
> > +             offset += sz;
> > +             len -= sz;
> > +     }
> > +}
> > +
> > +static void vduse_dev_set_config(struct vduse_dev *dev, unsigned int offset,
> > +                                     const void *buf, unsigned int len)
> > +{
> > +     struct vduse_dev_msg msg = { 0 };
> > +     unsigned int sz;
> > +
> > +     while (len) {
> > +             sz = min_t(unsigned int, len, sizeof(msg.req.config.data));
> > +             msg.req.type = VDUSE_SET_CONFIG;
> > +             msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
> > +             msg.req.config.offset = offset;
> > +             msg.req.config.len = sz;
> > +             memcpy(msg.req.config.data, buf, sz);
> > +             vduse_dev_msg_sync(dev, &msg);
> > +             buf += sz;
> > +             offset += sz;
> > +             len -= sz;
> > +     }
> > +}
> > +
> > +static void vduse_dev_set_vq_num(struct vduse_dev *dev,
> > +                             struct vduse_virtqueue *vq, u32 num)
> > +{
> > +     struct vduse_dev_msg msg = { 0 };
> > +
> > +     msg.req.type = VDUSE_SET_VQ_NUM;
> > +     msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
> > +     msg.req.vq_num.index = vq->index;
> > +     msg.req.vq_num.num = num;
> > +
> > +     vduse_dev_msg_sync(dev, &msg);
> > +}
> > +
> > +static int vduse_dev_set_vq_addr(struct vduse_dev *dev,
> > +                             struct vduse_virtqueue *vq, u64 desc_addr,
> > +                             u64 driver_addr, u64 device_addr)
> > +{
> > +     struct vduse_dev_msg msg = { 0 };
> > +
> > +     msg.req.type = VDUSE_SET_VQ_ADDR;
> > +     msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
> > +     msg.req.vq_addr.index = vq->index;
> > +     msg.req.vq_addr.desc_addr = desc_addr;
> > +     msg.req.vq_addr.driver_addr = driver_addr;
> > +     msg.req.vq_addr.device_addr = device_addr;
> > +
> > +     return vduse_dev_msg_sync(dev, &msg);
> > +}
> > +
> > +static void vduse_dev_set_vq_ready(struct vduse_dev *dev,
> > +                             struct vduse_virtqueue *vq, bool ready)
> > +{
> > +     struct vduse_dev_msg msg = { 0 };
> > +
> > +     msg.req.type = VDUSE_SET_VQ_READY;
> > +     msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
> > +     msg.req.vq_ready.index = vq->index;
> > +     msg.req.vq_ready.ready = ready;
> > +
> > +     vduse_dev_msg_sync(dev, &msg);
> > +}
> > +
> > +static bool vduse_dev_get_vq_ready(struct vduse_dev *dev,
> > +                                struct vduse_virtqueue *vq)
> > +{
> > +     struct vduse_dev_msg msg = { 0 };
> > +
> > +     msg.req.type = VDUSE_GET_VQ_READY;
> > +     msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
> > +     msg.req.vq_ready.index = vq->index;
> > +
> > +     return vduse_dev_msg_sync(dev, &msg) ? false : msg.resp.vq_ready.ready;
> > +}
> > +
> > +static int vduse_dev_get_vq_state(struct vduse_dev *dev,
> > +                             struct vduse_virtqueue *vq,
> > +                             struct vdpa_vq_state *state)
> > +{
> > +     struct vduse_dev_msg msg = { 0 };
> > +     int ret;
> > +
> > +     msg.req.type = VDUSE_GET_VQ_STATE;
> > +     msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
> > +     msg.req.vq_state.index = vq->index;
> > +
> > +     ret = vduse_dev_msg_sync(dev, &msg);
> > +     if (!ret)
> > +             state->avail_index = msg.resp.vq_state.avail_idx;
> > +
> > +     return ret;
> > +}
> > +
> > +static int vduse_dev_set_vq_state(struct vduse_dev *dev,
> > +                             struct vduse_virtqueue *vq,
> > +                             const struct vdpa_vq_state *state)
> > +{
> > +     struct vduse_dev_msg msg = { 0 };
> > +
> > +     msg.req.type = VDUSE_SET_VQ_STATE;
> > +     msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
> > +     msg.req.vq_state.index = vq->index;
> > +     msg.req.vq_state.avail_idx = state->avail_index;
> > +
> > +     return vduse_dev_msg_sync(dev, &msg);
> > +}
> > +
> > +static int vduse_dev_update_iotlb(struct vduse_dev *dev,
> > +                             u64 start, u64 last)
> > +{
> > +     struct vduse_dev_msg msg = { 0 };
> > +
> > +     if (last < start)
> > +             return -EINVAL;
> > +
> > +     msg.req.type = VDUSE_UPDATE_IOTLB;
> > +     msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
> > +     msg.req.iova.start = start;
> > +     msg.req.iova.last = last;
> > +
> > +     return vduse_dev_msg_sync(dev, &msg);
> > +}
> > +
> > +static ssize_t vduse_dev_read_iter(struct kiocb *iocb, struct iov_iter *to)
> > +{
> > +     struct file *file = iocb->ki_filp;
> > +     struct vduse_dev *dev = file->private_data;
> > +     struct vduse_dev_msg *msg;
> > +     int size = sizeof(struct vduse_dev_request);
> > +     ssize_t ret = 0;
> > +
> > +     if (iov_iter_count(to) < size)
> > +             return 0;
> > +
> > +     mutex_lock(&dev->msg_lock);
> > +     while (1) {
> > +             msg = vduse_dequeue_msg(&dev->send_list);
> > +             if (msg)
> > +                     break;
> > +
> > +             ret = -EAGAIN;
> > +             if (file->f_flags & O_NONBLOCK)
> > +                     goto unlock;
> > +
> > +             mutex_unlock(&dev->msg_lock);
> > +             ret = wait_event_interruptible_exclusive(dev->waitq,
> > +                                     !list_empty(&dev->send_list));
> > +             if (ret)
> > +                     return ret;
> > +
> > +             mutex_lock(&dev->msg_lock);
> > +     }
> > +     ret = copy_to_iter(&msg->req, size, to);
> > +     if (ret != size) {
> > +             ret = -EFAULT;
> > +             vduse_enqueue_msg(&dev->send_list, msg);
> > +             goto unlock;
> > +     }
> > +     vduse_enqueue_msg(&dev->recv_list, msg);
> > +unlock:
> > +     mutex_unlock(&dev->msg_lock);
> > +
> > +     return ret;
> > +}
> > +
> > +static ssize_t vduse_dev_write_iter(struct kiocb *iocb, struct iov_iter *from)
> > +{
> > +     struct file *file = iocb->ki_filp;
> > +     struct vduse_dev *dev = file->private_data;
> > +     struct vduse_dev_response resp;
> > +     struct vduse_dev_msg *msg;
> > +     size_t ret;
> > +
> > +     ret = copy_from_iter(&resp, sizeof(resp), from);
> > +     if (ret != sizeof(resp))
> > +             return -EINVAL;
> > +
> > +     mutex_lock(&dev->msg_lock);
> > +     msg = vduse_find_msg(&dev->recv_list, resp.request_id);
> > +     if (!msg) {
> > +             ret = -EINVAL;
> > +             goto unlock;
> > +     }
> > +
> > +     memcpy(&msg->resp, &resp, sizeof(resp));
> > +     msg->completed = 1;
> > +     wake_up(&msg->waitq);
> > +unlock:
> > +     mutex_unlock(&dev->msg_lock);
> > +
> > +     return ret;
> > +}
> > +
> > +static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
> > +{
> > +     struct vduse_dev *dev = file->private_data;
> > +     __poll_t mask = 0;
> > +
> > +     poll_wait(file, &dev->waitq, wait);
> > +
> > +     if (!list_empty(&dev->send_list))
> > +             mask |= EPOLLIN | EPOLLRDNORM;
> > +
> > +     return mask;
> > +}
> > +
> > +static int vduse_iotlb_add_range(struct vduse_dev *dev,
> > +                              u64 start, u64 last,
> > +                              u64 addr, unsigned int perm,
> > +                              struct file *file, u64 offset)
> > +{
> > +     struct vdpa_map_file *map_file;
> > +     int ret;
> > +
> > +     map_file = kmalloc(sizeof(*map_file), GFP_ATOMIC);
> > +     if (!map_file)
> > +             return -ENOMEM;
> > +
> > +     map_file->file = get_file(file);
> > +     map_file->offset = offset;
> > +
> > +     spin_lock(&dev->iommu_lock);
> > +     ret = vhost_iotlb_add_range_ctx(dev->iommu, start, last,
> > +                                     addr, perm, map_file);
> > +     spin_unlock(&dev->iommu_lock);
> > +     if (ret) {
> > +             fput(map_file->file);
> > +             kfree(map_file);
> > +             return ret;
> > +     }
> > +     return 0;
> > +}
> > +
> > +static void vduse_iotlb_del_range(struct vduse_dev *dev, u64 start, u64 last)
> > +{
> > +     struct vdpa_map_file *map_file;
> > +     struct vhost_iotlb_map *map;
> > +
> > +     spin_lock(&dev->iommu_lock);
> > +     while ((map = vhost_iotlb_itree_first(dev->iommu, start, last))) {
> > +             map_file = (struct vdpa_map_file *)map->opaque;
> > +             fput(map_file->file);
> > +             kfree(map_file);
> > +             vhost_iotlb_map_free(dev->iommu, map);
> > +     }
> > +     spin_unlock(&dev->iommu_lock);
> > +}
> > +
> > +static void vduse_dev_reset(struct vduse_dev *dev)
> > +{
> > +     int i;
> > +
> > +     atomic_set(&dev->bounce_map, 0);
> > +     vduse_iotlb_del_range(dev, 0ULL, ULLONG_MAX);
> > +     vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
> > +
> > +     for (i = 0; i < dev->vq_num; i++) {
> > +             struct vduse_virtqueue *vq = &dev->vqs[i];
> > +
> > +             spin_lock(&vq->irq_lock);
> > +             vq->ready = false;
> > +             vq->cb.callback = NULL;
> > +             vq->cb.private = NULL;
> > +             spin_unlock(&vq->irq_lock);
> > +     }
> > +}
> > +
> > +static int vduse_vdpa_set_vq_address(struct vdpa_device *vdpa, u16 idx,
> > +                             u64 desc_area, u64 driver_area,
> > +                             u64 device_area)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> > +
> > +     return vduse_dev_set_vq_addr(dev, vq, desc_area,
> > +                                     driver_area, device_area);
> > +}
> > +
> > +static void vduse_vdpa_kick_vq(struct vdpa_device *vdpa, u16 idx)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> > +
> > +     spin_lock(&vq->kick_lock);
> > +     if (vq->ready && vq->kickfd)
> > +             eventfd_signal(vq->kickfd, 1);
> > +     spin_unlock(&vq->kick_lock);
> > +}
> > +
> > +static void vduse_vdpa_set_vq_cb(struct vdpa_device *vdpa, u16 idx,
> > +                           struct vdpa_callback *cb)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> > +
> > +     spin_lock(&vq->irq_lock);
> > +     vq->cb.callback = cb->callback;
> > +     vq->cb.private = cb->private;
> > +     spin_unlock(&vq->irq_lock);
> > +}
> > +
> > +static void vduse_vdpa_set_vq_num(struct vdpa_device *vdpa, u16 idx, u32 num)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> > +
> > +     vduse_dev_set_vq_num(dev, vq, num);
> > +}
> > +
> > +static void vduse_vdpa_set_vq_ready(struct vdpa_device *vdpa,
> > +                                     u16 idx, bool ready)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> > +
> > +     vduse_dev_set_vq_ready(dev, vq, ready);
> > +     vq->ready = ready;
> > +}
> > +
> > +static bool vduse_vdpa_get_vq_ready(struct vdpa_device *vdpa, u16 idx)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> > +
> > +     vq->ready = vduse_dev_get_vq_ready(dev, vq);
> > +
> > +     return vq->ready;
> > +}
> > +
> > +static int vduse_vdpa_set_vq_state(struct vdpa_device *vdpa, u16 idx,
> > +                             const struct vdpa_vq_state *state)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> > +
> > +     return vduse_dev_set_vq_state(dev, vq, state);
> > +}
> > +
> > +static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
> > +                             struct vdpa_vq_state *state)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> > +
> > +     return vduse_dev_get_vq_state(dev, vq, state);
> > +}
> > +
> > +static u32 vduse_vdpa_get_vq_align(struct vdpa_device *vdpa)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     return dev->vq_align;
> > +}
> > +
> > +static u64 vduse_vdpa_get_features(struct vdpa_device *vdpa)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +     u64 fixed = (1ULL << VIRTIO_F_ACCESS_PLATFORM);
> > +
> > +     return (vduse_dev_get_features(dev) | fixed);
>
>
> What happens if we don't do such fixup. I think we should fail if
> usersapce doesnt offer ACCESS_PLATFORM instead.
>

Make sense.

>
> > +}
> > +
> > +static int vduse_vdpa_set_features(struct vdpa_device *vdpa, u64 features)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     return vduse_dev_set_features(dev, features);
> > +}
> > +
> > +static void vduse_vdpa_set_config_cb(struct vdpa_device *vdpa,
> > +                               struct vdpa_callback *cb)
> > +{
> > +     /* We don't support config interrupt */
> > +}
> > +
> > +static u16 vduse_vdpa_get_vq_num_max(struct vdpa_device *vdpa)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     return dev->vq_size_max;
> > +}
> > +
> > +static u32 vduse_vdpa_get_device_id(struct vdpa_device *vdpa)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     return dev->device_id;
> > +}
> > +
> > +static u32 vduse_vdpa_get_vendor_id(struct vdpa_device *vdpa)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     return dev->vendor_id;
> > +}
> > +
> > +static u8 vduse_vdpa_get_status(struct vdpa_device *vdpa)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     return vduse_dev_get_status(dev);
> > +}
> > +
> > +static void vduse_vdpa_set_status(struct vdpa_device *vdpa, u8 status)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     if (status == 0)
> > +             vduse_dev_reset(dev);
> > +
> > +     vduse_dev_set_status(dev, status);
> > +}
> > +
> > +static void vduse_vdpa_get_config(struct vdpa_device *vdpa, unsigned int offset,
> > +                          void *buf, unsigned int len)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     vduse_dev_get_config(dev, offset, buf, len);
> > +}
> > +
> > +static void vduse_vdpa_set_config(struct vdpa_device *vdpa, unsigned int offset,
> > +                     const void *buf, unsigned int len)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     vduse_dev_set_config(dev, offset, buf, len);
> > +}
> > +
> > +static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
> > +                             struct vhost_iotlb *iotlb)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +     struct vhost_iotlb_map *map;
> > +     struct vdpa_map_file *map_file;
> > +     u64 start = 0ULL, last = ULLONG_MAX;
> > +     int ret = 0;
> > +
> > +     vduse_iotlb_del_range(dev, start, last);
> > +
> > +     for (map = vhost_iotlb_itree_first(iotlb, start, last); map;
> > +             map = vhost_iotlb_itree_next(map, start, last)) {
> > +             map_file = (struct vdpa_map_file *)map->opaque;
> > +             if (!map_file->file)
> > +                     continue;
> > +
> > +             ret = vduse_iotlb_add_range(dev, map->start, map->last,
> > +                                         map->addr, map->perm,
> > +                                         map_file->file,
> > +                                         map_file->offset);
> > +             if (ret)
> > +                     break;
> > +     }
> > +     vduse_dev_update_iotlb(dev, start, last);
> > +
> > +     return ret;
> > +}
> > +
> > +static void vduse_vdpa_free(struct vdpa_device *vdpa)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     WARN_ON(!list_empty(&dev->send_list));
> > +     WARN_ON(!list_empty(&dev->recv_list));
> > +     dev->vdev = NULL;
> > +}
> > +
> > +static const struct vdpa_config_ops vduse_vdpa_config_ops = {
> > +     .set_vq_address         = vduse_vdpa_set_vq_address,
> > +     .kick_vq                = vduse_vdpa_kick_vq,
> > +     .set_vq_cb              = vduse_vdpa_set_vq_cb,
> > +     .set_vq_num             = vduse_vdpa_set_vq_num,
> > +     .set_vq_ready           = vduse_vdpa_set_vq_ready,
> > +     .get_vq_ready           = vduse_vdpa_get_vq_ready,
> > +     .set_vq_state           = vduse_vdpa_set_vq_state,
> > +     .get_vq_state           = vduse_vdpa_get_vq_state,
> > +     .get_vq_align           = vduse_vdpa_get_vq_align,
> > +     .get_features           = vduse_vdpa_get_features,
> > +     .set_features           = vduse_vdpa_set_features,
> > +     .set_config_cb          = vduse_vdpa_set_config_cb,
> > +     .get_vq_num_max         = vduse_vdpa_get_vq_num_max,
> > +     .get_device_id          = vduse_vdpa_get_device_id,
> > +     .get_vendor_id          = vduse_vdpa_get_vendor_id,
> > +     .get_status             = vduse_vdpa_get_status,
> > +     .set_status             = vduse_vdpa_set_status,
> > +     .get_config             = vduse_vdpa_get_config,
> > +     .set_config             = vduse_vdpa_set_config,
> > +     .set_map                = vduse_vdpa_set_map,
> > +     .free                   = vduse_vdpa_free,
> > +};
> > +
> > +static dma_addr_t vduse_dev_map_page(struct device *dev, struct page *page,
> > +                                     unsigned long offset, size_t size,
> > +                                     enum dma_data_direction dir,
> > +                                     unsigned long attrs)
> > +{
> > +     struct vduse_dev *vdev = dev_to_vduse(dev);
> > +     struct vduse_iova_domain *domain = vdev->domain;
> > +
> > +     if (atomic_xchg(&vdev->bounce_map, 1) == 0 &&
> > +             vduse_iotlb_add_range(vdev, 0, domain->bounce_size - 1,
> > +                                   0, VDUSE_ACCESS_RW,
> > +                                   vduse_domain_file(domain), 0)) {
> > +             atomic_set(&vdev->bounce_map, 0);
> > +             return DMA_MAPPING_ERROR;
>
>
> Can we add the bounce mapping page by page here?
>

Do you mean mapping the bounce buffer to user space page by page? If
so, userspace needs to call lots of mmap() for that.

>
> > +     }
> > +
> > +     return vduse_domain_map_page(domain, page, offset, size, dir, attrs);
> > +}
> > +
> > +static void vduse_dev_unmap_page(struct device *dev, dma_addr_t dma_addr,
> > +                             size_t size, enum dma_data_direction dir,
> > +                             unsigned long attrs)
> > +{
> > +     struct vduse_dev *vdev = dev_to_vduse(dev);
> > +     struct vduse_iova_domain *domain = vdev->domain;
> > +
> > +     return vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
> > +}
> > +
> > +static void *vduse_dev_alloc_coherent(struct device *dev, size_t size,
> > +                                     dma_addr_t *dma_addr, gfp_t flag,
> > +                                     unsigned long attrs)
> > +{
> > +     struct vduse_dev *vdev = dev_to_vduse(dev);
> > +     struct vduse_iova_domain *domain = vdev->domain;
> > +     unsigned long iova;
> > +     void *addr;
> > +
> > +     *dma_addr = DMA_MAPPING_ERROR;
> > +     addr = vduse_domain_alloc_coherent(domain, size,
> > +                             (dma_addr_t *)&iova, flag, attrs);
> > +     if (!addr)
> > +             return NULL;
> > +
> > +     if (vduse_iotlb_add_range(vdev, iova, iova + size - 1,
> > +                               iova, VDUSE_ACCESS_RW,
> > +                               vduse_domain_file(domain), iova)) {
> > +             vduse_domain_free_coherent(domain, size, addr, iova, attrs);
> > +             return NULL;
> > +     }
> > +     *dma_addr = (dma_addr_t)iova;
> > +
> > +     return addr;
> > +}
> > +
> > +static void vduse_dev_free_coherent(struct device *dev, size_t size,
> > +                                     void *vaddr, dma_addr_t dma_addr,
> > +                                     unsigned long attrs)
> > +{
> > +     struct vduse_dev *vdev = dev_to_vduse(dev);
> > +     struct vduse_iova_domain *domain = vdev->domain;
> > +     unsigned long start = (unsigned long)dma_addr;
> > +     unsigned long last = start + size - 1;
> > +
> > +     vduse_iotlb_del_range(vdev, start, last);
> > +     vduse_dev_update_iotlb(vdev, start, last);
> > +     vduse_domain_free_coherent(domain, size, vaddr, dma_addr, attrs);
> > +}
> > +
> > +static const struct dma_map_ops vduse_dev_dma_ops = {
> > +     .map_page = vduse_dev_map_page,
> > +     .unmap_page = vduse_dev_unmap_page,
> > +     .alloc = vduse_dev_alloc_coherent,
> > +     .free = vduse_dev_free_coherent,
> > +};
> > +
> > +static unsigned int perm_to_file_flags(u8 perm)
> > +{
> > +     unsigned int flags = 0;
> > +
> > +     switch (perm) {
> > +     case VDUSE_ACCESS_WO:
> > +             flags |= O_WRONLY;
> > +             break;
> > +     case VDUSE_ACCESS_RO:
> > +             flags |= O_RDONLY;
> > +             break;
> > +     case VDUSE_ACCESS_RW:
> > +             flags |= O_RDWR;
> > +             break;
> > +     default:
> > +             WARN(1, "invalidate vhost IOTLB permission\n");
> > +             break;
> > +     }
> > +
> > +     return flags;
> > +}
> > +
> > +static int vduse_kickfd_setup(struct vduse_dev *dev,
> > +                     struct vduse_vq_eventfd *eventfd)
> > +{
> > +     struct eventfd_ctx *ctx = NULL;
> > +     struct vduse_virtqueue *vq;
> > +
> > +     if (eventfd->index >= dev->vq_num)
> > +             return -EINVAL;
> > +
> > +     vq = &dev->vqs[eventfd->index];
> > +     if (eventfd->fd > 0) {
> > +             ctx = eventfd_ctx_fdget(eventfd->fd);
> > +             if (IS_ERR(ctx))
> > +                     return PTR_ERR(ctx);
> > +     }
> > +     spin_lock(&vq->kick_lock);
> > +     if (vq->kickfd)
> > +             eventfd_ctx_put(vq->kickfd);
> > +     vq->kickfd = ctx;
> > +     spin_unlock(&vq->kick_lock);
> > +
> > +     return 0;
> > +}
> > +
> > +static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > +                     unsigned long arg)
> > +{
> > +     struct vduse_dev *dev = file->private_data;
> > +     void __user *argp = (void __user *)arg;
> > +     int ret;
> > +
> > +     switch (cmd) {
> > +     case VDUSE_IOTLB_GET_FD: {
> > +             struct vduse_iotlb_entry entry;
> > +             struct vhost_iotlb_map *map;
> > +             struct vdpa_map_file *map_file;
> > +             struct file *f = NULL;
> > +
> > +             ret = -EFAULT;
> > +             if (copy_from_user(&entry, argp, sizeof(entry)))
> > +                     break;
> > +
> > +             spin_lock(&dev->iommu_lock);
> > +             map = vhost_iotlb_itree_first(dev->iommu, entry.start,
> > +                                           entry.last);
> > +             if (map) {
> > +                     map_file = (struct vdpa_map_file *)map->opaque;
> > +                     f = get_file(map_file->file);
> > +                     entry.offset = map_file->offset;
> > +                     entry.start = map->start;
> > +                     entry.last = map->last;
> > +                     entry.perm = map->perm;
> > +             }
> > +             spin_unlock(&dev->iommu_lock);
> > +             if (!f) {
> > +                     ret = -EINVAL;
> > +                     break;
> > +             }
> > +             if (copy_to_user(argp, &entry, sizeof(entry))) {
> > +                     fput(f);
> > +                     ret = -EFAULT;
> > +                     break;
> > +             }
> > +             ret = get_unused_fd_flags(perm_to_file_flags(entry.perm));
> > +             if (ret < 0) {
> > +                     fput(f);
> > +                     break;
> > +             }
> > +             fd_install(ret, f);
> > +             break;
> > +     }
> > +     case VDUSE_VQ_SETUP_KICKFD: {
> > +             struct vduse_vq_eventfd eventfd;
> > +
> > +             ret = -EFAULT;
> > +             if (copy_from_user(&eventfd, argp, sizeof(eventfd)))
> > +                     break;
> > +
> > +             ret = vduse_kickfd_setup(dev, &eventfd);
> > +             break;
> > +     }
> > +     case VDUSE_INJECT_VQ_IRQ: {
> > +             struct vduse_virtqueue *vq;
> > +
> > +             ret = -EINVAL;
> > +             if (arg >= dev->vq_num)
> > +                     break;
> > +
> > +             vq = &dev->vqs[arg];
> > +             spin_lock_irq(&vq->irq_lock);
> > +             if (vq->ready && vq->cb.callback) {
> > +                     vq->cb.callback(vq->cb.private);
> > +                     ret = 0;
> > +             }
> > +             spin_unlock_irq(&vq->irq_lock);
> > +             break;
> > +     }
> > +     default:
> > +             ret = -ENOIOCTLCMD;
> > +             break;
> > +     }
> > +
> > +     return ret;
> > +}
> > +
> > +static int vduse_dev_release(struct inode *inode, struct file *file)
> > +{
> > +     struct vduse_dev *dev = file->private_data;
> > +     struct vduse_dev_msg *msg;
> > +     int i;
> > +
> > +     for (i = 0; i < dev->vq_num; i++) {
> > +             struct vduse_virtqueue *vq = &dev->vqs[i];
> > +
> > +             spin_lock(&vq->kick_lock);
> > +             if (vq->kickfd)
> > +                     eventfd_ctx_put(vq->kickfd);
> > +             vq->kickfd = NULL;
> > +             spin_unlock(&vq->kick_lock);
> > +     }
> > +
> > +     mutex_lock(&dev->msg_lock);
> > +     while ((msg = vduse_dequeue_msg(&dev->recv_list)))
> > +             vduse_enqueue_msg(&dev->send_list, msg);
> > +     mutex_unlock(&dev->msg_lock);
> > +
> > +     dev->connected = false;
> > +
> > +     return 0;
> > +}
> > +
> > +static int vduse_dev_open(struct inode *inode, struct file *file)
> > +{
> > +     struct vduse_dev *dev = container_of(inode->i_cdev,
> > +                                     struct vduse_dev, cdev);
> > +     int ret = -EBUSY;
> > +
> > +     mutex_lock(&vduse_lock);
> > +     if (dev->connected)
> > +             goto unlock;
> > +
> > +     ret = 0;
> > +     dev->connected = true;
> > +     file->private_data = dev;
> > +unlock:
> > +     mutex_unlock(&vduse_lock);
> > +
> > +     return ret;
> > +}
> > +
> > +static const struct file_operations vduse_dev_fops = {
> > +     .owner          = THIS_MODULE,
> > +     .open           = vduse_dev_open,
> > +     .release        = vduse_dev_release,
> > +     .read_iter      = vduse_dev_read_iter,
> > +     .write_iter     = vduse_dev_write_iter,
> > +     .poll           = vduse_dev_poll,
> > +     .unlocked_ioctl = vduse_dev_ioctl,
> > +     .compat_ioctl   = compat_ptr_ioctl,
> > +     .llseek         = noop_llseek,
> > +};
> > +
> > +static struct vduse_dev *vduse_dev_create(void)
> > +{
> > +     struct vduse_dev *dev = kzalloc(sizeof(*dev), GFP_KERNEL);
> > +
> > +     if (!dev)
> > +             return NULL;
> > +
> > +     dev->iommu = vhost_iotlb_alloc(0, 0);
> > +     if (!dev->iommu) {
> > +             kfree(dev);
> > +             return NULL;
> > +     }
> > +
> > +     mutex_init(&dev->msg_lock);
> > +     INIT_LIST_HEAD(&dev->send_list);
> > +     INIT_LIST_HEAD(&dev->recv_list);
> > +     atomic64_set(&dev->msg_unique, 0);
> > +     spin_lock_init(&dev->iommu_lock);
> > +     atomic_set(&dev->bounce_map, 0);
> > +
> > +     init_waitqueue_head(&dev->waitq);
> > +
> > +     return dev;
> > +}
> > +
> > +static void vduse_dev_destroy(struct vduse_dev *dev)
> > +{
> > +     vhost_iotlb_free(dev->iommu);
> > +     mutex_destroy(&dev->msg_lock);
> > +     kfree(dev);
> > +}
> > +
> > +static struct vduse_dev *vduse_find_dev(const char *name)
> > +{
> > +     struct vduse_dev *tmp, *dev = NULL;
> > +
> > +     list_for_each_entry(tmp, &vduse_devs, list) {
> > +             if (!strcmp(dev_name(&tmp->dev), name)) {
> > +                     dev = tmp;
> > +                     break;
> > +             }
> > +     }
> > +     return dev;
> > +}
> > +
> > +static int vduse_destroy_dev(char *name)
> > +{
> > +     struct vduse_dev *dev = vduse_find_dev(name);
> > +
> > +     if (!dev)
> > +             return -EINVAL;
> > +
> > +     if (dev->vdev || dev->connected)
> > +             return -EBUSY;
> > +
> > +     dev->connected = true;
> > +     list_del(&dev->list);
> > +     cdev_device_del(&dev->cdev, &dev->dev);
> > +     put_device(&dev->dev);
> > +
> > +     return 0;
> > +}
> > +
> > +static void vduse_release_dev(struct device *device)
> > +{
> > +     struct vduse_dev *dev =
> > +             container_of(device, struct vduse_dev, dev);
> > +
> > +     ida_simple_remove(&vduse_ida, dev->minor);
> > +     kfree(dev->vqs);
> > +     vduse_domain_destroy(dev->domain);
> > +     vduse_dev_destroy(dev);
> > +     module_put(THIS_MODULE);
> > +}
> > +
> > +static int vduse_create_dev(struct vduse_dev_config *config)
> > +{
> > +     int i, ret = -ENOMEM;
> > +     struct vduse_dev *dev;
> > +
> > +     if (config->bounce_size > max_bounce_size)
> > +             return -EINVAL;
> > +
> > +     if (config->bounce_size > max_iova_size)
> > +             return -EINVAL;
> > +
> > +     if (vduse_find_dev(config->name))
> > +             return -EEXIST;
> > +
> > +     dev = vduse_dev_create();
> > +     if (!dev)
> > +             return -ENOMEM;
> > +
> > +     dev->device_id = config->device_id;
> > +     dev->vendor_id = config->vendor_id;
> > +     dev->domain = vduse_domain_create(max_iova_size - 1,
> > +                                     config->bounce_size);
> > +     if (!dev->domain)
> > +             goto err_domain;
> > +
> > +     dev->vq_align = config->vq_align;
> > +     dev->vq_size_max = config->vq_size_max;
> > +     dev->vq_num = config->vq_num;
> > +     dev->vqs = kcalloc(dev->vq_num, sizeof(*dev->vqs), GFP_KERNEL);
> > +     if (!dev->vqs)
> > +             goto err_vqs;
> > +
> > +     for (i = 0; i < dev->vq_num; i++) {
> > +             dev->vqs[i].index = i;
> > +             spin_lock_init(&dev->vqs[i].kick_lock);
> > +             spin_lock_init(&dev->vqs[i].irq_lock);
> > +     }
> > +
> > +     ret = ida_simple_get(&vduse_ida, 0, VDUSE_DEV_MAX, GFP_KERNEL);
> > +     if (ret < 0)
> > +             goto err_ida;
> > +
> > +     dev->minor = ret;
> > +     device_initialize(&dev->dev);
> > +     dev->dev.release = vduse_release_dev;
> > +     dev->dev.class = vduse_class;
> > +     dev->dev.devt = MKDEV(MAJOR(vduse_major), dev->minor);
> > +     ret = dev_set_name(&dev->dev, "%s", config->name);
>
>
> Do we need to add a namespce here? E.g "vduse-%s", config->name.
>

Actually we already have a parent dir "/dev/vduse/" for it.

>
> > +     if (ret)
> > +             goto err_name;
> > +
> > +     cdev_init(&dev->cdev, &vduse_dev_fops);
> > +     dev->cdev.owner = THIS_MODULE;
> > +
> > +     ret = cdev_device_add(&dev->cdev, &dev->dev);
> > +     if (ret) {
> > +             put_device(&dev->dev);
> > +             return ret;
> > +     }
> > +     list_add(&dev->list, &vduse_devs);
> > +     __module_get(THIS_MODULE);
> > +
> > +     return 0;
> > +err_name:
> > +     ida_simple_remove(&vduse_ida, dev->minor);
> > +err_ida:
> > +     kfree(dev->vqs);
> > +err_vqs:
> > +     vduse_domain_destroy(dev->domain);
> > +err_domain:
> > +     vduse_dev_destroy(dev);
> > +     return ret;
> > +}
> > +
> > +static long vduse_ioctl(struct file *file, unsigned int cmd,
> > +                     unsigned long arg)
> > +{
> > +     int ret;
> > +     void __user *argp = (void __user *)arg;
> > +
> > +     mutex_lock(&vduse_lock);
> > +     switch (cmd) {
> > +     case VDUSE_CREATE_DEV: {
> > +             struct vduse_dev_config config;
> > +
> > +             ret = -EFAULT;
> > +             if (copy_from_user(&config, argp, sizeof(config)))
> > +                     break;
> > +
> > +             ret = vduse_create_dev(&config);
> > +             break;
> > +     }
> > +     case VDUSE_DESTROY_DEV: {
> > +             char name[VDUSE_NAME_MAX];
> > +
> > +             ret = -EFAULT;
> > +             if (copy_from_user(name, argp, VDUSE_NAME_MAX))
> > +                     break;
> > +
> > +             ret = vduse_destroy_dev(name);
> > +             break;
> > +     }
> > +     default:
> > +             ret = -EINVAL;
> > +             break;
> > +     }
> > +     mutex_unlock(&vduse_lock);
> > +
> > +     return ret;
> > +}
> > +
> > +static const struct file_operations vduse_fops = {
> > +     .owner          = THIS_MODULE,
> > +     .unlocked_ioctl = vduse_ioctl,
> > +     .compat_ioctl   = compat_ptr_ioctl,
> > +     .llseek         = noop_llseek,
> > +};
> > +
> > +static char *vduse_devnode(struct device *dev, umode_t *mode)
> > +{
> > +     return kasprintf(GFP_KERNEL, "vduse/%s", dev_name(dev));
> > +}
> > +
> > +static struct miscdevice vduse_misc = {
> > +     .fops = &vduse_fops,
> > +     .minor = MISC_DYNAMIC_MINOR,
> > +     .name = "vduse",
> > +     .nodename = "vduse/control",
> > +};
> > +
> > +static void vduse_mgmtdev_release(struct device *dev)
> > +{
> > +}
> > +
> > +static struct device vduse_mgmtdev = {
> > +     .init_name = "vduse",
> > +     .release = vduse_mgmtdev_release,
> > +};
> > +
> > +static struct vdpa_mgmt_dev mgmt_dev;
> > +
> > +static int vduse_dev_add_vdpa(struct vduse_dev *dev, const char *name)
> > +{
> > +     struct vduse_vdpa *vdev = dev->vdev;
> > +     int ret;
> > +
> > +     if (vdev)
> > +             return -EEXIST;
> > +
> > +     vdev = vdpa_alloc_device(struct vduse_vdpa, vdpa, NULL,
>
>
> I think the char dev should be used as the parent here.
>

Agree.

>
> > +                              &vduse_vdpa_config_ops,
> > +                              dev->vq_num, name, true);
> > +     if (!vdev)
> > +             return -ENOMEM;
> > +
> > +     vdev->dev = dev;
> > +     vdev->vdpa.dev.dma_mask = &vdev->vdpa.dev.coherent_dma_mask;
> > +     ret = dma_set_mask_and_coherent(&vdev->vdpa.dev, DMA_BIT_MASK(64));
> > +     if (ret)
> > +             goto err;
> > +
> > +     set_dma_ops(&vdev->vdpa.dev, &vduse_dev_dma_ops);
> > +     vdev->vdpa.dma_dev = &vdev->vdpa.dev;
> > +     vdev->vdpa.mdev = &mgmt_dev;
> > +
> > +     ret = _vdpa_register_device(&vdev->vdpa);
> > +     if (ret)
> > +             goto err;
> > +
> > +     dev->vdev = vdev;
> > +
> > +     return 0;
> > +err:
> > +     put_device(&vdev->vdpa.dev);
> > +     return ret;
> > +}
> > +
> > +static int vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name)
> > +{
> > +     struct vduse_dev *dev;
> > +     int ret = -EINVAL;
> > +
> > +     mutex_lock(&vduse_lock);
> > +     dev = vduse_find_dev(name);
> > +     if (!dev)
> > +             goto unlock;
>
>
> Any reason for this check? I think vdpa core layer has already did for
> the name check for us.
>

We need to check whether the vduse device with the name is created.

>
> > +
> > +     ret = vduse_dev_add_vdpa(dev, name);
> > +unlock:
> > +     mutex_unlock(&vduse_lock);
> > +
> > +     return ret;
> > +}
> > +
> > +static void vdpa_dev_del(struct vdpa_mgmt_dev *mdev, struct vdpa_device *dev)
> > +{
> > +     _vdpa_unregister_device(dev);
> > +}
> > +
> > +static const struct vdpa_mgmtdev_ops vdpa_dev_mgmtdev_ops = {
> > +     .dev_add = vdpa_dev_add,
> > +     .dev_del = vdpa_dev_del,
> > +};
> > +
> > +static struct virtio_device_id id_table[] = {
> > +     { VIRTIO_DEV_ANY_ID, VIRTIO_DEV_ANY_ID },
> > +     { 0 },
> > +};
> > +
> > +static struct vdpa_mgmt_dev mgmt_dev = {
> > +     .device = &vduse_mgmtdev,
> > +     .id_table = id_table,
> > +     .ops = &vdpa_dev_mgmtdev_ops,
> > +};
> > +
> > +static int vduse_mgmtdev_init(void)
> > +{
> > +     int ret;
> > +
> > +     ret = device_register(&vduse_mgmtdev);
> > +     if (ret)
> > +             return ret;
> > +
> > +     ret = vdpa_mgmtdev_register(&mgmt_dev);
> > +     if (ret)
> > +             goto err;
> > +
> > +     return 0;
> > +err:
> > +     device_unregister(&vduse_mgmtdev);
> > +     return ret;
> > +}
> > +
> > +static void vduse_mgmtdev_exit(void)
> > +{
> > +     vdpa_mgmtdev_unregister(&mgmt_dev);
> > +     device_unregister(&vduse_mgmtdev);
> > +}
> > +
> > +static int vduse_init(void)
> > +{
> > +     int ret;
> > +
> > +     ret = misc_register(&vduse_misc);
> > +     if (ret)
> > +             return ret;
> > +
> > +     vduse_class = class_create(THIS_MODULE, "vduse");
> > +     if (IS_ERR(vduse_class)) {
> > +             ret = PTR_ERR(vduse_class);
> > +             goto err_class;
> > +     }
> > +     vduse_class->devnode = vduse_devnode;
> > +
> > +     ret = alloc_chrdev_region(&vduse_major, 0, VDUSE_DEV_MAX, "vduse");
> > +     if (ret)
> > +             goto err_chardev;
> > +
> > +     ret = vduse_domain_init();
> > +     if (ret)
> > +             goto err_domain;
> > +
> > +     ret = vduse_mgmtdev_init();
> > +     if (ret)
> > +             goto err_mgmtdev;
>
>
> Should we validate max_bounce_size < max_iova_size here?
>

Sure.

>
>
> > +
> > +     return 0;
> > +err_mgmtdev:
> > +     vduse_domain_exit();
> > +err_domain:
> > +     unregister_chrdev_region(vduse_major, VDUSE_DEV_MAX);
> > +err_chardev:
> > +     class_destroy(vduse_class);
> > +err_class:
> > +     misc_deregister(&vduse_misc);
> > +     return ret;
> > +}
> > +module_init(vduse_init);
> > +
> > +static void vduse_exit(void)
> > +{
> > +     misc_deregister(&vduse_misc);
> > +     class_destroy(vduse_class);
> > +     unregister_chrdev_region(vduse_major, VDUSE_DEV_MAX);
> > +     vduse_domain_exit();
> > +     vduse_mgmtdev_exit();
> > +}
> > +module_exit(vduse_exit);
> > +
> > +MODULE_VERSION(DRV_VERSION);
> > +MODULE_LICENSE(DRV_LICENSE);
> > +MODULE_AUTHOR(DRV_AUTHOR);
> > +MODULE_DESCRIPTION(DRV_DESC);
> > diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
> > new file mode 100644
> > index 000000000000..9391c4acfa53
> > --- /dev/null
> > +++ b/include/uapi/linux/vduse.h
> > @@ -0,0 +1,136 @@
> > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> > +#ifndef _UAPI_VDUSE_H_
> > +#define _UAPI_VDUSE_H_
> > +
> > +#include <linux/types.h>
> > +
> > +#define VDUSE_CONFIG_DATA_LEN        256
> > +#define VDUSE_NAME_MAX       256
> > +
> > +/* the control messages definition for read/write */
> > +
> > +enum vduse_req_type {
> > +     VDUSE_SET_VQ_NUM,
> > +     VDUSE_SET_VQ_ADDR,
> > +     VDUSE_SET_VQ_READY,
> > +     VDUSE_GET_VQ_READY,
> > +     VDUSE_SET_VQ_STATE,
> > +     VDUSE_GET_VQ_STATE,
> > +     VDUSE_SET_FEATURES,
> > +     VDUSE_GET_FEATURES,
> > +     VDUSE_SET_STATUS,
> > +     VDUSE_GET_STATUS,
> > +     VDUSE_SET_CONFIG,
> > +     VDUSE_GET_CONFIG,
> > +     VDUSE_UPDATE_IOTLB,
> > +};
> > +
> > +struct vduse_vq_num {
> > +     __u32 index;
> > +     __u32 num;
> > +};
> > +
> > +struct vduse_vq_addr {
> > +     __u32 index;
> > +     __u64 desc_addr;
> > +     __u64 driver_addr;
> > +     __u64 device_addr;
> > +};
> > +
> > +struct vduse_vq_ready {
> > +     __u32 index;
> > +     __u8 ready;
> > +};
> > +
> > +struct vduse_vq_state {
> > +     __u32 index;
> > +     __u16 avail_idx;
> > +};
> > +
> > +struct vduse_dev_config_data {
> > +     __u32 offset;
> > +     __u32 len;
> > +     __u8 data[VDUSE_CONFIG_DATA_LEN];
> > +};
> > +
> > +struct vduse_iova_range {
> > +     __u64 start;
> > +     __u64 last;
> > +};
> > +
> > +struct vduse_dev_request {
> > +     __u32 type; /* request type */
> > +     __u32 unique; /* request id */
>
>
> Let's simply use "request_id" here.
>

OK.

>
> > +     __u32 reserved[2]; /* for feature use */
> > +     union {
> > +             struct vduse_vq_num vq_num; /* virtqueue num */
> > +             struct vduse_vq_addr vq_addr; /* virtqueue address */
> > +             struct vduse_vq_ready vq_ready; /* virtqueue ready status */
> > +             struct vduse_vq_state vq_state; /* virtqueue state */
> > +             struct vduse_dev_config_data config; /* virtio device config space */
> > +             struct vduse_iova_range iova; /* iova range for updating */
> > +             __u64 features; /* virtio features */
> > +             __u8 status; /* device status */
>
>
> It might be better to use struct for feaures and status as well for
> consistency.
>

OK.

> And to be safe, let's add explicity padding here.
>

Do you mean add padding for the union?

>
> > +     };
> > +};
> > +
> > +struct vduse_dev_response {
> > +     __u32 request_id; /* corresponding request id */
> > +#define VDUSE_REQUEST_OK     0x00
> > +#define VDUSE_REQUEST_FAILED 0x01
> > +     __u8 result; /* the result of request */
> > +     __u8 reserved[11]; /* for feature use */
>
>
> Looks like this will be a hole which is similar to
> 429711aec282c4b5fe5bbd7b2f0bbbff4110ffb2. Need to make sure the reserved
> end at 8 byte boundary.
>

Will fix it.

>
> > +     union {
> > +             struct vduse_vq_ready vq_ready; /* virtqueue ready status */
> > +             struct vduse_vq_state vq_state; /* virtqueue state */
> > +             struct vduse_dev_config_data config; /* virtio device config space */
> > +             __u64 features; /* virtio features */
> > +             __u8 status; /* device status */
> > +     };
> > +};
> > +
> > +/* ioctls */
> > +
> > +struct vduse_dev_config {
> > +     char name[VDUSE_NAME_MAX]; /* vduse device name */
> > +     __u32 vendor_id; /* virtio vendor id */
> > +     __u32 device_id; /* virtio device id */
> > +     __u64 bounce_size; /* bounce buffer size for iommu */
> > +     __u16 vq_num; /* the number of virtqueues */
> > +     __u16 vq_size_max; /* the max size of virtqueue */
> > +     __u32 vq_align; /* the allocation alignment of virtqueue's metadata */
> > +};
> > +
> > +struct vduse_iotlb_entry {
> > +     __u64 offset; /* the mmap offset on fd */
> > +     __u64 start; /* start of the IOVA range */
> > +     __u64 last; /* last of the IOVA range */
> > +#define VDUSE_ACCESS_RO 0x1
> > +#define VDUSE_ACCESS_WO 0x2
> > +#define VDUSE_ACCESS_RW 0x3
> > +     __u8 perm; /* access permission of this range */
> > +};
> > +
> > +struct vduse_vq_eventfd {
> > +     __u32 index; /* virtqueue index */
> > +     int fd; /* eventfd, -1 means de-assigning the eventfd */
>
>
> Let's define a macro for this.
>

OK.

>
> > +};
> > +
> > +#define VDUSE_BASE   0x81
> > +
> > +/* Create a vduse device which is represented by a char device (/dev/vduse/<name>) */
> > +#define VDUSE_CREATE_DEV     _IOW(VDUSE_BASE, 0x01, struct vduse_dev_config)
> > +
> > +/* Destroy a vduse device. Make sure there are no references to the char device */
> > +#define VDUSE_DESTROY_DEV    _IOW(VDUSE_BASE, 0x02, char[VDUSE_NAME_MAX])
> > +
> > +/* Get a file descriptor for the mmap'able iova region */
> > +#define VDUSE_IOTLB_GET_FD   _IOWR(VDUSE_BASE, 0x03, struct vduse_iotlb_entry)
> > +
> > +/* Setup an eventfd to receive kick for virtqueue */
> > +#define VDUSE_VQ_SETUP_KICKFD        _IOW(VDUSE_BASE, 0x04, struct vduse_vq_eventfd)
> > +
> > +/* Inject an interrupt for specific virtqueue */
> > +#define VDUSE_INJECT_VQ_IRQ  _IO(VDUSE_BASE, 0x05)
>
>
> I wonder do we need a version/feature handshake that is for future
> extension instead of just reserve bits in uABI? E.g VDUSE_GET_VERSION ...
>

Agree. Will do it in v5.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: Re: [RFC v4 11/11] vduse: Support binding irq to the specified cpu
  2021-03-04  7:30     ` Jason Wang
  (?)
@ 2021-03-04  8:19     ` Yongji Xie
  2021-03-05  3:11         ` Jason Wang
  -1 siblings, 1 reply; 92+ messages in thread
From: Yongji Xie @ 2021-03-04  8:19 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Bob Liu, Christoph Hellwig, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel

On Thu, Mar 4, 2021 at 3:30 PM Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2021/2/23 7:50 下午, Xie Yongji wrote:
> > Add a parameter for the ioctl VDUSE_INJECT_VQ_IRQ to support
> > injecting virtqueue's interrupt to the specified cpu.
>
>
> How userspace know which CPU is this irq for? It looks to me we need to
> do it at different level.
>
> E.g introduce some API in sys to allow admin to tune for that.
>
> But I think we can do that in antoher patch on top of this series.
>

OK. I will think more about it.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: Re: [RFC v4 10/11] vduse: Introduce a workqueue for irq injection
  2021-03-04  6:59     ` Jason Wang
  (?)
@ 2021-03-04  8:58     ` Yongji Xie
  2021-03-05  3:04         ` Jason Wang
  -1 siblings, 1 reply; 92+ messages in thread
From: Yongji Xie @ 2021-03-04  8:58 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Bob Liu, Christoph Hellwig, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel

On Thu, Mar 4, 2021 at 2:59 PM Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2021/2/23 7:50 下午, Xie Yongji wrote:
> > This patch introduces a workqueue to support injecting
> > virtqueue's interrupt asynchronously. This is mainly
> > for performance considerations which makes sure the push()
> > and pop() for used vring can be asynchronous.
>
>
> Do you have pref numbers for this patch?
>

No, I can do some tests for it if needed.

Another problem is the VIRTIO_RING_F_EVENT_IDX feature will be useless
if we call irq callback in ioctl context. Something like:

virtqueue_push();
virtio_notify();
    ioctl()
-------------------------------------------------
        irq_cb()
            virtqueue_get_buf()

The used vring is always empty each time we call virtqueue_push() in
userspace. Not sure if it is what we expected.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: Re: [RFC v4 09/11] Documentation: Add documentation for VDUSE
  2021-03-04  6:39     ` Jason Wang
  (?)
@ 2021-03-04 10:35     ` Yongji Xie
  -1 siblings, 0 replies; 92+ messages in thread
From: Yongji Xie @ 2021-03-04 10:35 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Bob Liu, Christoph Hellwig, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel

On Thu, Mar 4, 2021 at 2:40 PM Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2021/2/23 7:50 下午, Xie Yongji wrote:
> > VDUSE (vDPA Device in Userspace) is a framework to support
> > implementing software-emulated vDPA devices in userspace. This
> > document is intended to clarify the VDUSE design and usage.
> >
> > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> > ---
> >   Documentation/userspace-api/index.rst |   1 +
> >   Documentation/userspace-api/vduse.rst | 112 ++++++++++++++++++++++++++++++++++
> >   2 files changed, 113 insertions(+)
> >   create mode 100644 Documentation/userspace-api/vduse.rst
> >
> > diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst
> > index acd2cc2a538d..f63119130898 100644
> > --- a/Documentation/userspace-api/index.rst
> > +++ b/Documentation/userspace-api/index.rst
> > @@ -24,6 +24,7 @@ place where this information is gathered.
> >      ioctl/index
> >      iommu
> >      media/index
> > +   vduse
> >
> >   .. only::  subproject and html
> >
> > diff --git a/Documentation/userspace-api/vduse.rst b/Documentation/userspace-api/vduse.rst
> > new file mode 100644
> > index 000000000000..2a20e686bb59
> > --- /dev/null
> > +++ b/Documentation/userspace-api/vduse.rst
> > @@ -0,0 +1,112 @@
> > +==================================
> > +VDUSE - "vDPA Device in Userspace"
> > +==================================
> > +
> > +vDPA (virtio data path acceleration) device is a device that uses a
> > +datapath which complies with the virtio specifications with vendor
> > +specific control path. vDPA devices can be both physically located on
> > +the hardware or emulated by software. VDUSE is a framework that makes it
> > +possible to implement software-emulated vDPA devices in userspace.
> > +
> > +How VDUSE works
> > +------------
> > +Each userspace vDPA device is created by the VDUSE_CREATE_DEV ioctl on
> > +the character device (/dev/vduse/control). Then a device file with the
> > +specified name (/dev/vduse/$NAME) will appear, which can be used to
> > +implement the userspace vDPA device's control path and data path.
>
>
> It's better to mention that in order to le thte device to be registered
> on the bus, admin need to use the management API(netlink) to create the
> vDPA device.
>
> Some codes to demnonstrate how to create the device will be better.
>

OK.

>
> > +
> > +To implement control path, a message-based communication protocol and some
> > +types of control messages are introduced in the VDUSE framework:
> > +
> > +- VDUSE_SET_VQ_ADDR: Set the vring address of virtqueue.
> > +
> > +- VDUSE_SET_VQ_NUM: Set the size of virtqueue
> > +
> > +- VDUSE_SET_VQ_READY: Set ready status of virtqueue
> > +
> > +- VDUSE_GET_VQ_READY: Get ready status of virtqueue
> > +
> > +- VDUSE_SET_VQ_STATE: Set the state for virtqueue
> > +
> > +- VDUSE_GET_VQ_STATE: Get the state for virtqueue
> > +
> > +- VDUSE_SET_FEATURES: Set virtio features supported by the driver
> > +
> > +- VDUSE_GET_FEATURES: Get virtio features supported by the device
> > +
> > +- VDUSE_SET_STATUS: Set the device status
> > +
> > +- VDUSE_GET_STATUS: Get the device status
> > +
> > +- VDUSE_SET_CONFIG: Write to device specific configuration space
> > +
> > +- VDUSE_GET_CONFIG: Read from device specific configuration space
> > +
> > +- VDUSE_UPDATE_IOTLB: Notify userspace to update the memory mapping in device IOTLB
> > +
> > +Those control messages are mostly based on the vdpa_config_ops in
> > +include/linux/vdpa.h which defines a unified interface to control
> > +different types of vdpa device. Userspace needs to read()/write()
> > +on the VDUSE device file to receive/reply those control messages
> > +from/to VDUSE kernel module as follows:
> > +
> > +.. code-block:: c
> > +
> > +     static int vduse_message_handler(int dev_fd)
> > +     {
> > +             int len;
> > +             struct vduse_dev_request req;
> > +             struct vduse_dev_response resp;
> > +
> > +             len = read(dev_fd, &req, sizeof(req));
> > +             if (len != sizeof(req))
> > +                     return -1;
> > +
> > +             resp.request_id = req.unique;
> > +
> > +             switch (req.type) {
> > +
> > +             /* handle different types of message */
> > +
> > +             }
> > +
> > +             len = write(dev_fd, &resp, sizeof(resp));
> > +             if (len != sizeof(resp))
> > +                     return -1;
> > +
> > +             return 0;
> > +     }
> > +
> > +In the deta path, vDPA device's iova regions will be mapped into userspace
> > +with the help of VDUSE_IOTLB_GET_FD ioctl on the VDUSE device file:
> > +
> > +- VDUSE_IOTLB_GET_FD: get the file descriptor to iova region. Userspace can
> > +  access this iova region by passing the fd to mmap().
>
>
> It would be better to have codes to explain how it is expected to work here.
>

OK.

>
> > +
> > +Besides, the following ioctls on the VDUSE device file are provided to support
> > +interrupt injection and setting up eventfd for virtqueue kicks:
> > +
> > +- VDUSE_VQ_SETUP_KICKFD: set the kickfd for virtqueue, this eventfd is used
> > +  by VDUSE kernel module to notify userspace to consume the vring.
> > +
> > +- VDUSE_INJECT_VQ_IRQ: inject an interrupt for specific virtqueue
> > +
> > +- VDUSE_INJECT_CONFIG_IRQ: inject a config interrupt
> > +
> > +MMU-based IOMMU Driver
> > +----------------------
> > +In virtio-vdpa case, VDUSE framework implements an MMU-based on-chip IOMMU
> > +driver to support mapping the kernel DMA buffer into the userspace iova
> > +region dynamically.
> > +
> > +The basic idea behind this driver is treating MMU (VA->PA) as IOMMU (IOVA->PA).
> > +The driver will set up MMU mapping instead of IOMMU mapping for the DMA transfer
> > +so that the userspace process is able to use its virtual address to access
> > +the DMA buffer in kernel.
> > +
> > +And to avoid security issue, a bounce-buffering mechanism is introduced to
> > +prevent userspace accessing the original buffer directly which may contain other
> > +kernel data.
>
>
> It's worth to mention this is designed for virtio-vdpa (kernel virtio
> drivers).
>

Will do it.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC v4 10/11] vduse: Introduce a workqueue for irq injection
  2021-03-04  8:58     ` Yongji Xie
@ 2021-03-05  3:04         ` Jason Wang
  0 siblings, 0 replies; 92+ messages in thread
From: Jason Wang @ 2021-03-05  3:04 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Bob Liu, Christoph Hellwig, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel


On 2021/3/4 4:58 下午, Yongji Xie wrote:
> On Thu, Mar 4, 2021 at 2:59 PM Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2021/2/23 7:50 下午, Xie Yongji wrote:
>>> This patch introduces a workqueue to support injecting
>>> virtqueue's interrupt asynchronously. This is mainly
>>> for performance considerations which makes sure the push()
>>> and pop() for used vring can be asynchronous.
>>
>> Do you have pref numbers for this patch?
>>
> No, I can do some tests for it if needed.
>
> Another problem is the VIRTIO_RING_F_EVENT_IDX feature will be useless
> if we call irq callback in ioctl context. Something like:
>
> virtqueue_push();
> virtio_notify();
>      ioctl()
> -------------------------------------------------
>          irq_cb()
>              virtqueue_get_buf()
>
> The used vring is always empty each time we call virtqueue_push() in
> userspace. Not sure if it is what we expected.


I'm not sure I get the issue.

THe used ring should be filled by virtqueue_push() which is done by 
userspace before?

Thanks


>
> Thanks,
> Yongji
>


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC v4 10/11] vduse: Introduce a workqueue for irq injection
@ 2021-03-05  3:04         ` Jason Wang
  0 siblings, 0 replies; 92+ messages in thread
From: Jason Wang @ 2021-03-05  3:04 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Jens Axboe, Jonathan Corbet, kvm, Michael S. Tsirkin, linux-aio,
	netdev, Randy Dunlap, Matthew Wilcox, virtualization,
	Christoph Hellwig, Bob Liu, bcrl, viro, Stefan Hajnoczi,
	linux-fsdevel


On 2021/3/4 4:58 下午, Yongji Xie wrote:
> On Thu, Mar 4, 2021 at 2:59 PM Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2021/2/23 7:50 下午, Xie Yongji wrote:
>>> This patch introduces a workqueue to support injecting
>>> virtqueue's interrupt asynchronously. This is mainly
>>> for performance considerations which makes sure the push()
>>> and pop() for used vring can be asynchronous.
>>
>> Do you have pref numbers for this patch?
>>
> No, I can do some tests for it if needed.
>
> Another problem is the VIRTIO_RING_F_EVENT_IDX feature will be useless
> if we call irq callback in ioctl context. Something like:
>
> virtqueue_push();
> virtio_notify();
>      ioctl()
> -------------------------------------------------
>          irq_cb()
>              virtqueue_get_buf()
>
> The used vring is always empty each time we call virtqueue_push() in
> userspace. Not sure if it is what we expected.


I'm not sure I get the issue.

THe used ring should be filled by virtqueue_push() which is done by 
userspace before?

Thanks


>
> Thanks,
> Yongji
>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC v4 11/11] vduse: Support binding irq to the specified cpu
  2021-03-04  8:19     ` Yongji Xie
@ 2021-03-05  3:11         ` Jason Wang
  0 siblings, 0 replies; 92+ messages in thread
From: Jason Wang @ 2021-03-05  3:11 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Bob Liu, Christoph Hellwig, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel


On 2021/3/4 4:19 下午, Yongji Xie wrote:
> On Thu, Mar 4, 2021 at 3:30 PM Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2021/2/23 7:50 下午, Xie Yongji wrote:
>>> Add a parameter for the ioctl VDUSE_INJECT_VQ_IRQ to support
>>> injecting virtqueue's interrupt to the specified cpu.
>>
>> How userspace know which CPU is this irq for? It looks to me we need to
>> do it at different level.
>>
>> E.g introduce some API in sys to allow admin to tune for that.
>>
>> But I think we can do that in antoher patch on top of this series.
>>
> OK. I will think more about it.


It should be soemthing like 
/sys/class/vduse/$dev_name/vq/0/irq_affinity. Also need to make sure 
eventfd could not be reused.

Thanks


>
> Thanks,
> Yongji
>


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC v4 11/11] vduse: Support binding irq to the specified cpu
@ 2021-03-05  3:11         ` Jason Wang
  0 siblings, 0 replies; 92+ messages in thread
From: Jason Wang @ 2021-03-05  3:11 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Jens Axboe, Jonathan Corbet, kvm, Michael S. Tsirkin, linux-aio,
	netdev, Randy Dunlap, Matthew Wilcox, virtualization,
	Christoph Hellwig, Bob Liu, bcrl, viro, Stefan Hajnoczi,
	linux-fsdevel


On 2021/3/4 4:19 下午, Yongji Xie wrote:
> On Thu, Mar 4, 2021 at 3:30 PM Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2021/2/23 7:50 下午, Xie Yongji wrote:
>>> Add a parameter for the ioctl VDUSE_INJECT_VQ_IRQ to support
>>> injecting virtqueue's interrupt to the specified cpu.
>>
>> How userspace know which CPU is this irq for? It looks to me we need to
>> do it at different level.
>>
>> E.g introduce some API in sys to allow admin to tune for that.
>>
>> But I think we can do that in antoher patch on top of this series.
>>
> OK. I will think more about it.


It should be soemthing like 
/sys/class/vduse/$dev_name/vq/0/irq_affinity. Also need to make sure 
eventfd could not be reused.

Thanks


>
> Thanks,
> Yongji
>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC v4 07/11] vduse: Introduce VDUSE - vDPA Device in Userspace
  2021-03-04  8:05     ` Yongji Xie
@ 2021-03-05  3:20         ` Jason Wang
  0 siblings, 0 replies; 92+ messages in thread
From: Jason Wang @ 2021-03-05  3:20 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Bob Liu, Christoph Hellwig, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel


On 2021/3/4 4:05 下午, Yongji Xie wrote:
> On Thu, Mar 4, 2021 at 2:27 PM Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2021/2/23 7:50 下午, Xie Yongji wrote:
>>> This VDUSE driver enables implementing vDPA devices in userspace.
>>> Both control path and data path of vDPA devices will be able to
>>> be handled in userspace.
>>>
>>> In the control path, the VDUSE driver will make use of message
>>> mechnism to forward the config operation from vdpa bus driver
>>> to userspace. Userspace can use read()/write() to receive/reply
>>> those control messages.
>>>
>>> In the data path, VDUSE_IOTLB_GET_FD ioctl will be used to get
>>> the file descriptors referring to vDPA device's iova regions. Then
>>> userspace can use mmap() to access those iova regions. Besides,
>>> userspace can use ioctl() to inject interrupt and use the eventfd
>>> mechanism to receive virtqueue kicks.
>>>
>>> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
>>> ---
>>>    Documentation/userspace-api/ioctl/ioctl-number.rst |    1 +
>>>    drivers/vdpa/Kconfig                               |   10 +
>>>    drivers/vdpa/Makefile                              |    1 +
>>>    drivers/vdpa/vdpa_user/Makefile                    |    5 +
>>>    drivers/vdpa/vdpa_user/vduse_dev.c                 | 1348 ++++++++++++++++++++
>>>    include/uapi/linux/vduse.h                         |  136 ++
>>>    6 files changed, 1501 insertions(+)
>>>    create mode 100644 drivers/vdpa/vdpa_user/Makefile
>>>    create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
>>>    create mode 100644 include/uapi/linux/vduse.h
>>>
>>> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
>>> index a4c75a28c839..71722e6f8f23 100644
>>> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
>>> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
>>> @@ -300,6 +300,7 @@ Code  Seq#    Include File                                           Comments
>>>    'z'   10-4F  drivers/s390/crypto/zcrypt_api.h                        conflict!
>>>    '|'   00-7F  linux/media.h
>>>    0x80  00-1F  linux/fb.h
>>> +0x81  00-1F  linux/vduse.h
>>>    0x89  00-06  arch/x86/include/asm/sockios.h
>>>    0x89  0B-DF  linux/sockios.h
>>>    0x89  E0-EF  linux/sockios.h                                         SIOCPROTOPRIVATE range
>>> diff --git a/drivers/vdpa/Kconfig b/drivers/vdpa/Kconfig
>>> index ffd1e098bfd2..92f07715e3b6 100644
>>> --- a/drivers/vdpa/Kconfig
>>> +++ b/drivers/vdpa/Kconfig
>>> @@ -25,6 +25,16 @@ config VDPA_SIM_NET
>>>        help
>>>          vDPA networking device simulator which loops TX traffic back to RX.
>>>
>>> +config VDPA_USER
>>> +     tristate "VDUSE (vDPA Device in Userspace) support"
>>> +     depends on EVENTFD && MMU && HAS_DMA
>>> +     select DMA_OPS
>>> +     select VHOST_IOTLB
>>> +     select IOMMU_IOVA
>>> +     help
>>> +       With VDUSE it is possible to emulate a vDPA Device
>>> +       in a userspace program.
>>> +
>>>    config IFCVF
>>>        tristate "Intel IFC VF vDPA driver"
>>>        depends on PCI_MSI
>>> diff --git a/drivers/vdpa/Makefile b/drivers/vdpa/Makefile
>>> index d160e9b63a66..66e97778ad03 100644
>>> --- a/drivers/vdpa/Makefile
>>> +++ b/drivers/vdpa/Makefile
>>> @@ -1,5 +1,6 @@
>>>    # SPDX-License-Identifier: GPL-2.0
>>>    obj-$(CONFIG_VDPA) += vdpa.o
>>>    obj-$(CONFIG_VDPA_SIM) += vdpa_sim/
>>> +obj-$(CONFIG_VDPA_USER) += vdpa_user/
>>>    obj-$(CONFIG_IFCVF)    += ifcvf/
>>>    obj-$(CONFIG_MLX5_VDPA) += mlx5/
>>> diff --git a/drivers/vdpa/vdpa_user/Makefile b/drivers/vdpa/vdpa_user/Makefile
>>> new file mode 100644
>>> index 000000000000..260e0b26af99
>>> --- /dev/null
>>> +++ b/drivers/vdpa/vdpa_user/Makefile
>>> @@ -0,0 +1,5 @@
>>> +# SPDX-License-Identifier: GPL-2.0
>>> +
>>> +vduse-y := vduse_dev.o iova_domain.o
>>> +
>>> +obj-$(CONFIG_VDPA_USER) += vduse.o
>>> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
>>> new file mode 100644
>>> index 000000000000..393bf99c48be
>>> --- /dev/null
>>> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
>>> @@ -0,0 +1,1348 @@
>>> +// SPDX-License-Identifier: GPL-2.0-only
>>> +/*
>>> + * VDUSE: vDPA Device in Userspace
>>> + *
>>> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
>>> + *
>>> + * Author: Xie Yongji <xieyongji@bytedance.com>
>>> + *
>>> + */
>>> +
>>> +#include <linux/init.h>
>>> +#include <linux/module.h>
>>> +#include <linux/miscdevice.h>
>>> +#include <linux/cdev.h>
>>> +#include <linux/device.h>
>>> +#include <linux/eventfd.h>
>>> +#include <linux/slab.h>
>>> +#include <linux/wait.h>
>>> +#include <linux/dma-map-ops.h>
>>> +#include <linux/poll.h>
>>> +#include <linux/file.h>
>>> +#include <linux/uio.h>
>>> +#include <linux/vdpa.h>
>>> +#include <uapi/linux/vduse.h>
>>> +#include <uapi/linux/vdpa.h>
>>> +#include <uapi/linux/virtio_config.h>
>>> +#include <linux/mod_devicetable.h>
>>> +
>>> +#include "iova_domain.h"
>>> +
>>> +#define DRV_VERSION  "1.0"
>>> +#define DRV_AUTHOR   "Yongji Xie <xieyongji@bytedance.com>"
>>> +#define DRV_DESC     "vDPA Device in Userspace"
>>> +#define DRV_LICENSE  "GPL v2"
>>> +
>>> +#define VDUSE_DEV_MAX (1U << MINORBITS)
>>> +
>>> +struct vduse_virtqueue {
>>> +     u16 index;
>>> +     bool ready;
>>> +     spinlock_t kick_lock;
>>> +     spinlock_t irq_lock;
>>> +     struct eventfd_ctx *kickfd;
>>> +     struct vdpa_callback cb;
>>> +};
>>> +
>>> +struct vduse_dev;
>>> +
>>> +struct vduse_vdpa {
>>> +     struct vdpa_device vdpa;
>>> +     struct vduse_dev *dev;
>>> +};
>>> +
>>> +struct vduse_dev {
>>> +     struct vduse_vdpa *vdev;
>>> +     struct device dev;
>>> +     struct cdev cdev;
>>> +     struct vduse_virtqueue *vqs;
>>> +     struct vduse_iova_domain *domain;
>>> +     struct vhost_iotlb *iommu;
>>> +     spinlock_t iommu_lock;
>>> +     atomic_t bounce_map;
>>> +     struct mutex msg_lock;
>>> +     atomic64_t msg_unique;
>>
>> "next_request_id" should be better.
>>
> OK.
>
>>> +     wait_queue_head_t waitq;
>>> +     struct list_head send_list;
>>> +     struct list_head recv_list;
>>> +     struct list_head list;
>>> +     bool connected;
>>> +     int minor;
>>> +     u16 vq_size_max;
>>> +     u16 vq_num;
>>> +     u32 vq_align;
>>> +     u32 device_id;
>>> +     u32 vendor_id;
>>> +};
>>> +
>>> +struct vduse_dev_msg {
>>> +     struct vduse_dev_request req;
>>> +     struct vduse_dev_response resp;
>>> +     struct list_head list;
>>> +     wait_queue_head_t waitq;
>>> +     bool completed;
>>> +};
>>> +
>>> +static unsigned long max_bounce_size = (64 * 1024 * 1024);
>>> +module_param(max_bounce_size, ulong, 0444);
>>> +MODULE_PARM_DESC(max_bounce_size, "Maximum bounce buffer size. (default: 64M)");
>>> +
>>> +static unsigned long max_iova_size = (128 * 1024 * 1024);
>>> +module_param(max_iova_size, ulong, 0444);
>>> +MODULE_PARM_DESC(max_iova_size, "Maximum iova space size (default: 128M)");
>>> +
>>> +static DEFINE_MUTEX(vduse_lock);
>>> +static LIST_HEAD(vduse_devs);
>>> +static DEFINE_IDA(vduse_ida);
>>> +
>>> +static dev_t vduse_major;
>>> +static struct class *vduse_class;
>>> +
>>> +static inline struct vduse_dev *vdpa_to_vduse(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_vdpa *vdev = container_of(vdpa, struct vduse_vdpa, vdpa);
>>> +
>>> +     return vdev->dev;
>>> +}
>>> +
>>> +static inline struct vduse_dev *dev_to_vduse(struct device *dev)
>>> +{
>>> +     struct vdpa_device *vdpa = dev_to_vdpa(dev);
>>> +
>>> +     return vdpa_to_vduse(vdpa);
>>> +}
>>> +
>>> +static struct vduse_dev_msg *vduse_find_msg(struct list_head *head,
>>> +                                         uint32_t unique)
>>> +{
>>> +     struct vduse_dev_msg *tmp, *msg = NULL;
>>> +
>>> +     list_for_each_entry(tmp, head, list) {
>>
>> Shoudl we use list_for_each_entry_safe()?
>>
> Looks like list_for_each_entry() is ok here. We will break the loop
> after deleting one node.


Right.


>
>>> +             if (tmp->req.unique == unique) {
>>> +                     msg = tmp;
>>> +                     list_del(&tmp->list);
>>> +                     break;
>>> +             }
>>> +     }
>>> +
>>> +     return msg;
>>> +}
>>> +
>>> +static struct vduse_dev_msg *vduse_dequeue_msg(struct list_head *head)
>>> +{
>>> +     struct vduse_dev_msg *msg = NULL;
>>> +
>>> +     if (!list_empty(head)) {
>>> +             msg = list_first_entry(head, struct vduse_dev_msg, list);
>>> +             list_del(&msg->list);
>>> +     }
>>> +
>>> +     return msg;
>>> +}
>>> +
>>> +static void vduse_enqueue_msg(struct list_head *head,
>>> +                           struct vduse_dev_msg *msg)
>>> +{
>>> +     list_add_tail(&msg->list, head);
>>> +}
>>> +
>>> +static int vduse_dev_msg_sync(struct vduse_dev *dev, struct vduse_dev_msg *msg)
>>> +{
>>> +     int ret;
>>> +
>>> +     init_waitqueue_head(&msg->waitq);
>>> +     mutex_lock(&dev->msg_lock);
>>> +     vduse_enqueue_msg(&dev->send_list, msg);
>>> +     wake_up(&dev->waitq);
>>> +     mutex_unlock(&dev->msg_lock);
>>> +     ret = wait_event_interruptible(msg->waitq, msg->completed);
>>> +     mutex_lock(&dev->msg_lock);
>>> +     if (!msg->completed)
>>> +             list_del(&msg->list);
>>> +     else
>>> +             ret = msg->resp.result;
>>> +     mutex_unlock(&dev->msg_lock);
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static u64 vduse_dev_get_features(struct vduse_dev *dev)
>>> +{
>>> +     struct vduse_dev_msg msg = { 0 };
>>> +
>>> +     msg.req.type = VDUSE_GET_FEATURES;
>>> +     msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
>>> +
>>> +     return vduse_dev_msg_sync(dev, &msg) ? 0 : msg.resp.features;
>>> +}
>>> +
>>> +static int vduse_dev_set_features(struct vduse_dev *dev, u64 features)
>>> +{
>>> +     struct vduse_dev_msg msg = { 0 };
>>> +
>>> +     msg.req.type = VDUSE_SET_FEATURES;
>>> +     msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
>>> +     msg.req.features = features;
>>> +
>>> +     return vduse_dev_msg_sync(dev, &msg);
>>> +}
>>> +
>>> +static u8 vduse_dev_get_status(struct vduse_dev *dev)
>>> +{
>>> +     struct vduse_dev_msg msg = { 0 };
>>> +
>>> +     msg.req.type = VDUSE_GET_STATUS;
>>> +     msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
>>> +
>>> +     return vduse_dev_msg_sync(dev, &msg) ? 0 : msg.resp.status;
>>> +}
>>> +
>>> +static void vduse_dev_set_status(struct vduse_dev *dev, u8 status)
>>> +{
>>> +     struct vduse_dev_msg msg = { 0 };
>>> +
>>> +     msg.req.type = VDUSE_SET_STATUS;
>>> +     msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
>>> +     msg.req.status = status;
>>> +
>>> +     vduse_dev_msg_sync(dev, &msg);
>>> +}
>>> +
>>> +static void vduse_dev_get_config(struct vduse_dev *dev, unsigned int offset,
>>> +                                     void *buf, unsigned int len)
>>
>> Btw, the ident looks odd here and other may places wherhe functions has
>> more than one line of arguments.
>>
> OK, will fix it.
>
>>> +{
>>> +     struct vduse_dev_msg msg = { 0 };
>>> +     unsigned int sz;
>>> +
>>> +     while (len) {
>>> +             sz = min_t(unsigned int, len, sizeof(msg.req.config.data));
>>> +             msg.req.type = VDUSE_GET_CONFIG;
>>> +             msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
>>> +             msg.req.config.offset = offset;
>>> +             msg.req.config.len = sz;
>>> +             vduse_dev_msg_sync(dev, &msg);
>>> +             memcpy(buf, msg.resp.config.data, sz);
>>> +             buf += sz;
>>> +             offset += sz;
>>> +             len -= sz;
>>> +     }
>>> +}
>>> +
>>> +static void vduse_dev_set_config(struct vduse_dev *dev, unsigned int offset,
>>> +                                     const void *buf, unsigned int len)
>>> +{
>>> +     struct vduse_dev_msg msg = { 0 };
>>> +     unsigned int sz;
>>> +
>>> +     while (len) {
>>> +             sz = min_t(unsigned int, len, sizeof(msg.req.config.data));
>>> +             msg.req.type = VDUSE_SET_CONFIG;
>>> +             msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
>>> +             msg.req.config.offset = offset;
>>> +             msg.req.config.len = sz;
>>> +             memcpy(msg.req.config.data, buf, sz);
>>> +             vduse_dev_msg_sync(dev, &msg);
>>> +             buf += sz;
>>> +             offset += sz;
>>> +             len -= sz;
>>> +     }
>>> +}
>>> +
>>> +static void vduse_dev_set_vq_num(struct vduse_dev *dev,
>>> +                             struct vduse_virtqueue *vq, u32 num)
>>> +{
>>> +     struct vduse_dev_msg msg = { 0 };
>>> +
>>> +     msg.req.type = VDUSE_SET_VQ_NUM;
>>> +     msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
>>> +     msg.req.vq_num.index = vq->index;
>>> +     msg.req.vq_num.num = num;
>>> +
>>> +     vduse_dev_msg_sync(dev, &msg);
>>> +}
>>> +
>>> +static int vduse_dev_set_vq_addr(struct vduse_dev *dev,
>>> +                             struct vduse_virtqueue *vq, u64 desc_addr,
>>> +                             u64 driver_addr, u64 device_addr)
>>> +{
>>> +     struct vduse_dev_msg msg = { 0 };
>>> +
>>> +     msg.req.type = VDUSE_SET_VQ_ADDR;
>>> +     msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
>>> +     msg.req.vq_addr.index = vq->index;
>>> +     msg.req.vq_addr.desc_addr = desc_addr;
>>> +     msg.req.vq_addr.driver_addr = driver_addr;
>>> +     msg.req.vq_addr.device_addr = device_addr;
>>> +
>>> +     return vduse_dev_msg_sync(dev, &msg);
>>> +}
>>> +
>>> +static void vduse_dev_set_vq_ready(struct vduse_dev *dev,
>>> +                             struct vduse_virtqueue *vq, bool ready)
>>> +{
>>> +     struct vduse_dev_msg msg = { 0 };
>>> +
>>> +     msg.req.type = VDUSE_SET_VQ_READY;
>>> +     msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
>>> +     msg.req.vq_ready.index = vq->index;
>>> +     msg.req.vq_ready.ready = ready;
>>> +
>>> +     vduse_dev_msg_sync(dev, &msg);
>>> +}
>>> +
>>> +static bool vduse_dev_get_vq_ready(struct vduse_dev *dev,
>>> +                                struct vduse_virtqueue *vq)
>>> +{
>>> +     struct vduse_dev_msg msg = { 0 };
>>> +
>>> +     msg.req.type = VDUSE_GET_VQ_READY;
>>> +     msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
>>> +     msg.req.vq_ready.index = vq->index;
>>> +
>>> +     return vduse_dev_msg_sync(dev, &msg) ? false : msg.resp.vq_ready.ready;
>>> +}
>>> +
>>> +static int vduse_dev_get_vq_state(struct vduse_dev *dev,
>>> +                             struct vduse_virtqueue *vq,
>>> +                             struct vdpa_vq_state *state)
>>> +{
>>> +     struct vduse_dev_msg msg = { 0 };
>>> +     int ret;
>>> +
>>> +     msg.req.type = VDUSE_GET_VQ_STATE;
>>> +     msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
>>> +     msg.req.vq_state.index = vq->index;
>>> +
>>> +     ret = vduse_dev_msg_sync(dev, &msg);
>>> +     if (!ret)
>>> +             state->avail_index = msg.resp.vq_state.avail_idx;
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static int vduse_dev_set_vq_state(struct vduse_dev *dev,
>>> +                             struct vduse_virtqueue *vq,
>>> +                             const struct vdpa_vq_state *state)
>>> +{
>>> +     struct vduse_dev_msg msg = { 0 };
>>> +
>>> +     msg.req.type = VDUSE_SET_VQ_STATE;
>>> +     msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
>>> +     msg.req.vq_state.index = vq->index;
>>> +     msg.req.vq_state.avail_idx = state->avail_index;
>>> +
>>> +     return vduse_dev_msg_sync(dev, &msg);
>>> +}
>>> +
>>> +static int vduse_dev_update_iotlb(struct vduse_dev *dev,
>>> +                             u64 start, u64 last)
>>> +{
>>> +     struct vduse_dev_msg msg = { 0 };
>>> +
>>> +     if (last < start)
>>> +             return -EINVAL;
>>> +
>>> +     msg.req.type = VDUSE_UPDATE_IOTLB;
>>> +     msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
>>> +     msg.req.iova.start = start;
>>> +     msg.req.iova.last = last;
>>> +
>>> +     return vduse_dev_msg_sync(dev, &msg);
>>> +}
>>> +
>>> +static ssize_t vduse_dev_read_iter(struct kiocb *iocb, struct iov_iter *to)
>>> +{
>>> +     struct file *file = iocb->ki_filp;
>>> +     struct vduse_dev *dev = file->private_data;
>>> +     struct vduse_dev_msg *msg;
>>> +     int size = sizeof(struct vduse_dev_request);
>>> +     ssize_t ret = 0;
>>> +
>>> +     if (iov_iter_count(to) < size)
>>> +             return 0;
>>> +
>>> +     mutex_lock(&dev->msg_lock);
>>> +     while (1) {
>>> +             msg = vduse_dequeue_msg(&dev->send_list);
>>> +             if (msg)
>>> +                     break;
>>> +
>>> +             ret = -EAGAIN;
>>> +             if (file->f_flags & O_NONBLOCK)
>>> +                     goto unlock;
>>> +
>>> +             mutex_unlock(&dev->msg_lock);
>>> +             ret = wait_event_interruptible_exclusive(dev->waitq,
>>> +                                     !list_empty(&dev->send_list));
>>> +             if (ret)
>>> +                     return ret;
>>> +
>>> +             mutex_lock(&dev->msg_lock);
>>> +     }
>>> +     ret = copy_to_iter(&msg->req, size, to);
>>> +     if (ret != size) {
>>> +             ret = -EFAULT;
>>> +             vduse_enqueue_msg(&dev->send_list, msg);
>>> +             goto unlock;
>>> +     }
>>> +     vduse_enqueue_msg(&dev->recv_list, msg);
>>> +unlock:
>>> +     mutex_unlock(&dev->msg_lock);
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static ssize_t vduse_dev_write_iter(struct kiocb *iocb, struct iov_iter *from)
>>> +{
>>> +     struct file *file = iocb->ki_filp;
>>> +     struct vduse_dev *dev = file->private_data;
>>> +     struct vduse_dev_response resp;
>>> +     struct vduse_dev_msg *msg;
>>> +     size_t ret;
>>> +
>>> +     ret = copy_from_iter(&resp, sizeof(resp), from);
>>> +     if (ret != sizeof(resp))
>>> +             return -EINVAL;
>>> +
>>> +     mutex_lock(&dev->msg_lock);
>>> +     msg = vduse_find_msg(&dev->recv_list, resp.request_id);
>>> +     if (!msg) {
>>> +             ret = -EINVAL;
>>> +             goto unlock;
>>> +     }
>>> +
>>> +     memcpy(&msg->resp, &resp, sizeof(resp));
>>> +     msg->completed = 1;
>>> +     wake_up(&msg->waitq);
>>> +unlock:
>>> +     mutex_unlock(&dev->msg_lock);
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
>>> +{
>>> +     struct vduse_dev *dev = file->private_data;
>>> +     __poll_t mask = 0;
>>> +
>>> +     poll_wait(file, &dev->waitq, wait);
>>> +
>>> +     if (!list_empty(&dev->send_list))
>>> +             mask |= EPOLLIN | EPOLLRDNORM;
>>> +
>>> +     return mask;
>>> +}
>>> +
>>> +static int vduse_iotlb_add_range(struct vduse_dev *dev,
>>> +                              u64 start, u64 last,
>>> +                              u64 addr, unsigned int perm,
>>> +                              struct file *file, u64 offset)
>>> +{
>>> +     struct vdpa_map_file *map_file;
>>> +     int ret;
>>> +
>>> +     map_file = kmalloc(sizeof(*map_file), GFP_ATOMIC);
>>> +     if (!map_file)
>>> +             return -ENOMEM;
>>> +
>>> +     map_file->file = get_file(file);
>>> +     map_file->offset = offset;
>>> +
>>> +     spin_lock(&dev->iommu_lock);
>>> +     ret = vhost_iotlb_add_range_ctx(dev->iommu, start, last,
>>> +                                     addr, perm, map_file);
>>> +     spin_unlock(&dev->iommu_lock);
>>> +     if (ret) {
>>> +             fput(map_file->file);
>>> +             kfree(map_file);
>>> +             return ret;
>>> +     }
>>> +     return 0;
>>> +}
>>> +
>>> +static void vduse_iotlb_del_range(struct vduse_dev *dev, u64 start, u64 last)
>>> +{
>>> +     struct vdpa_map_file *map_file;
>>> +     struct vhost_iotlb_map *map;
>>> +
>>> +     spin_lock(&dev->iommu_lock);
>>> +     while ((map = vhost_iotlb_itree_first(dev->iommu, start, last))) {
>>> +             map_file = (struct vdpa_map_file *)map->opaque;
>>> +             fput(map_file->file);
>>> +             kfree(map_file);
>>> +             vhost_iotlb_map_free(dev->iommu, map);
>>> +     }
>>> +     spin_unlock(&dev->iommu_lock);
>>> +}
>>> +
>>> +static void vduse_dev_reset(struct vduse_dev *dev)
>>> +{
>>> +     int i;
>>> +
>>> +     atomic_set(&dev->bounce_map, 0);
>>> +     vduse_iotlb_del_range(dev, 0ULL, ULLONG_MAX);
>>> +     vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
>>> +
>>> +     for (i = 0; i < dev->vq_num; i++) {
>>> +             struct vduse_virtqueue *vq = &dev->vqs[i];
>>> +
>>> +             spin_lock(&vq->irq_lock);
>>> +             vq->ready = false;
>>> +             vq->cb.callback = NULL;
>>> +             vq->cb.private = NULL;
>>> +             spin_unlock(&vq->irq_lock);
>>> +     }
>>> +}
>>> +
>>> +static int vduse_vdpa_set_vq_address(struct vdpa_device *vdpa, u16 idx,
>>> +                             u64 desc_area, u64 driver_area,
>>> +                             u64 device_area)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     return vduse_dev_set_vq_addr(dev, vq, desc_area,
>>> +                                     driver_area, device_area);
>>> +}
>>> +
>>> +static void vduse_vdpa_kick_vq(struct vdpa_device *vdpa, u16 idx)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     spin_lock(&vq->kick_lock);
>>> +     if (vq->ready && vq->kickfd)
>>> +             eventfd_signal(vq->kickfd, 1);
>>> +     spin_unlock(&vq->kick_lock);
>>> +}
>>> +
>>> +static void vduse_vdpa_set_vq_cb(struct vdpa_device *vdpa, u16 idx,
>>> +                           struct vdpa_callback *cb)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     spin_lock(&vq->irq_lock);
>>> +     vq->cb.callback = cb->callback;
>>> +     vq->cb.private = cb->private;
>>> +     spin_unlock(&vq->irq_lock);
>>> +}
>>> +
>>> +static void vduse_vdpa_set_vq_num(struct vdpa_device *vdpa, u16 idx, u32 num)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     vduse_dev_set_vq_num(dev, vq, num);
>>> +}
>>> +
>>> +static void vduse_vdpa_set_vq_ready(struct vdpa_device *vdpa,
>>> +                                     u16 idx, bool ready)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     vduse_dev_set_vq_ready(dev, vq, ready);
>>> +     vq->ready = ready;
>>> +}
>>> +
>>> +static bool vduse_vdpa_get_vq_ready(struct vdpa_device *vdpa, u16 idx)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     vq->ready = vduse_dev_get_vq_ready(dev, vq);
>>> +
>>> +     return vq->ready;
>>> +}
>>> +
>>> +static int vduse_vdpa_set_vq_state(struct vdpa_device *vdpa, u16 idx,
>>> +                             const struct vdpa_vq_state *state)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     return vduse_dev_set_vq_state(dev, vq, state);
>>> +}
>>> +
>>> +static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
>>> +                             struct vdpa_vq_state *state)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     return vduse_dev_get_vq_state(dev, vq, state);
>>> +}
>>> +
>>> +static u32 vduse_vdpa_get_vq_align(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return dev->vq_align;
>>> +}
>>> +
>>> +static u64 vduse_vdpa_get_features(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     u64 fixed = (1ULL << VIRTIO_F_ACCESS_PLATFORM);
>>> +
>>> +     return (vduse_dev_get_features(dev) | fixed);
>>
>> What happens if we don't do such fixup. I think we should fail if
>> usersapce doesnt offer ACCESS_PLATFORM instead.
>>
> Make sense.
>
>>> +}
>>> +
>>> +static int vduse_vdpa_set_features(struct vdpa_device *vdpa, u64 features)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return vduse_dev_set_features(dev, features);
>>> +}
>>> +
>>> +static void vduse_vdpa_set_config_cb(struct vdpa_device *vdpa,
>>> +                               struct vdpa_callback *cb)
>>> +{
>>> +     /* We don't support config interrupt */
>>> +}
>>> +
>>> +static u16 vduse_vdpa_get_vq_num_max(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return dev->vq_size_max;
>>> +}
>>> +
>>> +static u32 vduse_vdpa_get_device_id(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return dev->device_id;
>>> +}
>>> +
>>> +static u32 vduse_vdpa_get_vendor_id(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return dev->vendor_id;
>>> +}
>>> +
>>> +static u8 vduse_vdpa_get_status(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return vduse_dev_get_status(dev);
>>> +}
>>> +
>>> +static void vduse_vdpa_set_status(struct vdpa_device *vdpa, u8 status)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     if (status == 0)
>>> +             vduse_dev_reset(dev);
>>> +
>>> +     vduse_dev_set_status(dev, status);
>>> +}
>>> +
>>> +static void vduse_vdpa_get_config(struct vdpa_device *vdpa, unsigned int offset,
>>> +                          void *buf, unsigned int len)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     vduse_dev_get_config(dev, offset, buf, len);
>>> +}
>>> +
>>> +static void vduse_vdpa_set_config(struct vdpa_device *vdpa, unsigned int offset,
>>> +                     const void *buf, unsigned int len)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     vduse_dev_set_config(dev, offset, buf, len);
>>> +}
>>> +
>>> +static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
>>> +                             struct vhost_iotlb *iotlb)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vhost_iotlb_map *map;
>>> +     struct vdpa_map_file *map_file;
>>> +     u64 start = 0ULL, last = ULLONG_MAX;
>>> +     int ret = 0;
>>> +
>>> +     vduse_iotlb_del_range(dev, start, last);
>>> +
>>> +     for (map = vhost_iotlb_itree_first(iotlb, start, last); map;
>>> +             map = vhost_iotlb_itree_next(map, start, last)) {
>>> +             map_file = (struct vdpa_map_file *)map->opaque;
>>> +             if (!map_file->file)
>>> +                     continue;
>>> +
>>> +             ret = vduse_iotlb_add_range(dev, map->start, map->last,
>>> +                                         map->addr, map->perm,
>>> +                                         map_file->file,
>>> +                                         map_file->offset);
>>> +             if (ret)
>>> +                     break;
>>> +     }
>>> +     vduse_dev_update_iotlb(dev, start, last);
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static void vduse_vdpa_free(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     WARN_ON(!list_empty(&dev->send_list));
>>> +     WARN_ON(!list_empty(&dev->recv_list));
>>> +     dev->vdev = NULL;
>>> +}
>>> +
>>> +static const struct vdpa_config_ops vduse_vdpa_config_ops = {
>>> +     .set_vq_address         = vduse_vdpa_set_vq_address,
>>> +     .kick_vq                = vduse_vdpa_kick_vq,
>>> +     .set_vq_cb              = vduse_vdpa_set_vq_cb,
>>> +     .set_vq_num             = vduse_vdpa_set_vq_num,
>>> +     .set_vq_ready           = vduse_vdpa_set_vq_ready,
>>> +     .get_vq_ready           = vduse_vdpa_get_vq_ready,
>>> +     .set_vq_state           = vduse_vdpa_set_vq_state,
>>> +     .get_vq_state           = vduse_vdpa_get_vq_state,
>>> +     .get_vq_align           = vduse_vdpa_get_vq_align,
>>> +     .get_features           = vduse_vdpa_get_features,
>>> +     .set_features           = vduse_vdpa_set_features,
>>> +     .set_config_cb          = vduse_vdpa_set_config_cb,
>>> +     .get_vq_num_max         = vduse_vdpa_get_vq_num_max,
>>> +     .get_device_id          = vduse_vdpa_get_device_id,
>>> +     .get_vendor_id          = vduse_vdpa_get_vendor_id,
>>> +     .get_status             = vduse_vdpa_get_status,
>>> +     .set_status             = vduse_vdpa_set_status,
>>> +     .get_config             = vduse_vdpa_get_config,
>>> +     .set_config             = vduse_vdpa_set_config,
>>> +     .set_map                = vduse_vdpa_set_map,
>>> +     .free                   = vduse_vdpa_free,
>>> +};
>>> +
>>> +static dma_addr_t vduse_dev_map_page(struct device *dev, struct page *page,
>>> +                                     unsigned long offset, size_t size,
>>> +                                     enum dma_data_direction dir,
>>> +                                     unsigned long attrs)
>>> +{
>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>> +
>>> +     if (atomic_xchg(&vdev->bounce_map, 1) == 0 &&
>>> +             vduse_iotlb_add_range(vdev, 0, domain->bounce_size - 1,
>>> +                                   0, VDUSE_ACCESS_RW,
>>> +                                   vduse_domain_file(domain), 0)) {
>>> +             atomic_set(&vdev->bounce_map, 0);
>>> +             return DMA_MAPPING_ERROR;
>>
>> Can we add the bounce mapping page by page here?
>>
> Do you mean mapping the bounce buffer to user space page by page? If
> so, userspace needs to call lots of mmap() for that.


I get this.


>
>>> +     }
>>> +
>>> +     return vduse_domain_map_page(domain, page, offset, size, dir, attrs);
>>> +}
>>> +
>>> +static void vduse_dev_unmap_page(struct device *dev, dma_addr_t dma_addr,
>>> +                             size_t size, enum dma_data_direction dir,
>>> +                             unsigned long attrs)
>>> +{
>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>> +
>>> +     return vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
>>> +}
>>> +
>>> +static void *vduse_dev_alloc_coherent(struct device *dev, size_t size,
>>> +                                     dma_addr_t *dma_addr, gfp_t flag,
>>> +                                     unsigned long attrs)
>>> +{
>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>> +     unsigned long iova;
>>> +     void *addr;
>>> +
>>> +     *dma_addr = DMA_MAPPING_ERROR;
>>> +     addr = vduse_domain_alloc_coherent(domain, size,
>>> +                             (dma_addr_t *)&iova, flag, attrs);
>>> +     if (!addr)
>>> +             return NULL;
>>> +
>>> +     if (vduse_iotlb_add_range(vdev, iova, iova + size - 1,
>>> +                               iova, VDUSE_ACCESS_RW,
>>> +                               vduse_domain_file(domain), iova)) {
>>> +             vduse_domain_free_coherent(domain, size, addr, iova, attrs);
>>> +             return NULL;
>>> +     }
>>> +     *dma_addr = (dma_addr_t)iova;
>>> +
>>> +     return addr;
>>> +}
>>> +
>>> +static void vduse_dev_free_coherent(struct device *dev, size_t size,
>>> +                                     void *vaddr, dma_addr_t dma_addr,
>>> +                                     unsigned long attrs)
>>> +{
>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>> +     unsigned long start = (unsigned long)dma_addr;
>>> +     unsigned long last = start + size - 1;
>>> +
>>> +     vduse_iotlb_del_range(vdev, start, last);
>>> +     vduse_dev_update_iotlb(vdev, start, last);
>>> +     vduse_domain_free_coherent(domain, size, vaddr, dma_addr, attrs);
>>> +}
>>> +
>>> +static const struct dma_map_ops vduse_dev_dma_ops = {
>>> +     .map_page = vduse_dev_map_page,
>>> +     .unmap_page = vduse_dev_unmap_page,
>>> +     .alloc = vduse_dev_alloc_coherent,
>>> +     .free = vduse_dev_free_coherent,
>>> +};
>>> +
>>> +static unsigned int perm_to_file_flags(u8 perm)
>>> +{
>>> +     unsigned int flags = 0;
>>> +
>>> +     switch (perm) {
>>> +     case VDUSE_ACCESS_WO:
>>> +             flags |= O_WRONLY;
>>> +             break;
>>> +     case VDUSE_ACCESS_RO:
>>> +             flags |= O_RDONLY;
>>> +             break;
>>> +     case VDUSE_ACCESS_RW:
>>> +             flags |= O_RDWR;
>>> +             break;
>>> +     default:
>>> +             WARN(1, "invalidate vhost IOTLB permission\n");
>>> +             break;
>>> +     }
>>> +
>>> +     return flags;
>>> +}
>>> +
>>> +static int vduse_kickfd_setup(struct vduse_dev *dev,
>>> +                     struct vduse_vq_eventfd *eventfd)
>>> +{
>>> +     struct eventfd_ctx *ctx = NULL;
>>> +     struct vduse_virtqueue *vq;
>>> +
>>> +     if (eventfd->index >= dev->vq_num)
>>> +             return -EINVAL;
>>> +
>>> +     vq = &dev->vqs[eventfd->index];
>>> +     if (eventfd->fd > 0) {
>>> +             ctx = eventfd_ctx_fdget(eventfd->fd);
>>> +             if (IS_ERR(ctx))
>>> +                     return PTR_ERR(ctx);
>>> +     }
>>> +     spin_lock(&vq->kick_lock);
>>> +     if (vq->kickfd)
>>> +             eventfd_ctx_put(vq->kickfd);
>>> +     vq->kickfd = ctx;
>>> +     spin_unlock(&vq->kick_lock);
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
>>> +                     unsigned long arg)
>>> +{
>>> +     struct vduse_dev *dev = file->private_data;
>>> +     void __user *argp = (void __user *)arg;
>>> +     int ret;
>>> +
>>> +     switch (cmd) {
>>> +     case VDUSE_IOTLB_GET_FD: {
>>> +             struct vduse_iotlb_entry entry;
>>> +             struct vhost_iotlb_map *map;
>>> +             struct vdpa_map_file *map_file;
>>> +             struct file *f = NULL;
>>> +
>>> +             ret = -EFAULT;
>>> +             if (copy_from_user(&entry, argp, sizeof(entry)))
>>> +                     break;
>>> +
>>> +             spin_lock(&dev->iommu_lock);
>>> +             map = vhost_iotlb_itree_first(dev->iommu, entry.start,
>>> +                                           entry.last);
>>> +             if (map) {
>>> +                     map_file = (struct vdpa_map_file *)map->opaque;
>>> +                     f = get_file(map_file->file);
>>> +                     entry.offset = map_file->offset;
>>> +                     entry.start = map->start;
>>> +                     entry.last = map->last;
>>> +                     entry.perm = map->perm;
>>> +             }
>>> +             spin_unlock(&dev->iommu_lock);
>>> +             if (!f) {
>>> +                     ret = -EINVAL;
>>> +                     break;
>>> +             }
>>> +             if (copy_to_user(argp, &entry, sizeof(entry))) {
>>> +                     fput(f);
>>> +                     ret = -EFAULT;
>>> +                     break;
>>> +             }
>>> +             ret = get_unused_fd_flags(perm_to_file_flags(entry.perm));
>>> +             if (ret < 0) {
>>> +                     fput(f);
>>> +                     break;
>>> +             }
>>> +             fd_install(ret, f);
>>> +             break;
>>> +     }
>>> +     case VDUSE_VQ_SETUP_KICKFD: {
>>> +             struct vduse_vq_eventfd eventfd;
>>> +
>>> +             ret = -EFAULT;
>>> +             if (copy_from_user(&eventfd, argp, sizeof(eventfd)))
>>> +                     break;
>>> +
>>> +             ret = vduse_kickfd_setup(dev, &eventfd);
>>> +             break;
>>> +     }
>>> +     case VDUSE_INJECT_VQ_IRQ: {
>>> +             struct vduse_virtqueue *vq;
>>> +
>>> +             ret = -EINVAL;
>>> +             if (arg >= dev->vq_num)
>>> +                     break;
>>> +
>>> +             vq = &dev->vqs[arg];
>>> +             spin_lock_irq(&vq->irq_lock);
>>> +             if (vq->ready && vq->cb.callback) {
>>> +                     vq->cb.callback(vq->cb.private);
>>> +                     ret = 0;
>>> +             }
>>> +             spin_unlock_irq(&vq->irq_lock);
>>> +             break;
>>> +     }
>>> +     default:
>>> +             ret = -ENOIOCTLCMD;
>>> +             break;
>>> +     }
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static int vduse_dev_release(struct inode *inode, struct file *file)
>>> +{
>>> +     struct vduse_dev *dev = file->private_data;
>>> +     struct vduse_dev_msg *msg;
>>> +     int i;
>>> +
>>> +     for (i = 0; i < dev->vq_num; i++) {
>>> +             struct vduse_virtqueue *vq = &dev->vqs[i];
>>> +
>>> +             spin_lock(&vq->kick_lock);
>>> +             if (vq->kickfd)
>>> +                     eventfd_ctx_put(vq->kickfd);
>>> +             vq->kickfd = NULL;
>>> +             spin_unlock(&vq->kick_lock);
>>> +     }
>>> +
>>> +     mutex_lock(&dev->msg_lock);
>>> +     while ((msg = vduse_dequeue_msg(&dev->recv_list)))
>>> +             vduse_enqueue_msg(&dev->send_list, msg);
>>> +     mutex_unlock(&dev->msg_lock);
>>> +
>>> +     dev->connected = false;
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static int vduse_dev_open(struct inode *inode, struct file *file)
>>> +{
>>> +     struct vduse_dev *dev = container_of(inode->i_cdev,
>>> +                                     struct vduse_dev, cdev);
>>> +     int ret = -EBUSY;
>>> +
>>> +     mutex_lock(&vduse_lock);
>>> +     if (dev->connected)
>>> +             goto unlock;
>>> +
>>> +     ret = 0;
>>> +     dev->connected = true;
>>> +     file->private_data = dev;
>>> +unlock:
>>> +     mutex_unlock(&vduse_lock);
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static const struct file_operations vduse_dev_fops = {
>>> +     .owner          = THIS_MODULE,
>>> +     .open           = vduse_dev_open,
>>> +     .release        = vduse_dev_release,
>>> +     .read_iter      = vduse_dev_read_iter,
>>> +     .write_iter     = vduse_dev_write_iter,
>>> +     .poll           = vduse_dev_poll,
>>> +     .unlocked_ioctl = vduse_dev_ioctl,
>>> +     .compat_ioctl   = compat_ptr_ioctl,
>>> +     .llseek         = noop_llseek,
>>> +};
>>> +
>>> +static struct vduse_dev *vduse_dev_create(void)
>>> +{
>>> +     struct vduse_dev *dev = kzalloc(sizeof(*dev), GFP_KERNEL);
>>> +
>>> +     if (!dev)
>>> +             return NULL;
>>> +
>>> +     dev->iommu = vhost_iotlb_alloc(0, 0);
>>> +     if (!dev->iommu) {
>>> +             kfree(dev);
>>> +             return NULL;
>>> +     }
>>> +
>>> +     mutex_init(&dev->msg_lock);
>>> +     INIT_LIST_HEAD(&dev->send_list);
>>> +     INIT_LIST_HEAD(&dev->recv_list);
>>> +     atomic64_set(&dev->msg_unique, 0);
>>> +     spin_lock_init(&dev->iommu_lock);
>>> +     atomic_set(&dev->bounce_map, 0);
>>> +
>>> +     init_waitqueue_head(&dev->waitq);
>>> +
>>> +     return dev;
>>> +}
>>> +
>>> +static void vduse_dev_destroy(struct vduse_dev *dev)
>>> +{
>>> +     vhost_iotlb_free(dev->iommu);
>>> +     mutex_destroy(&dev->msg_lock);
>>> +     kfree(dev);
>>> +}
>>> +
>>> +static struct vduse_dev *vduse_find_dev(const char *name)
>>> +{
>>> +     struct vduse_dev *tmp, *dev = NULL;
>>> +
>>> +     list_for_each_entry(tmp, &vduse_devs, list) {
>>> +             if (!strcmp(dev_name(&tmp->dev), name)) {
>>> +                     dev = tmp;
>>> +                     break;
>>> +             }
>>> +     }
>>> +     return dev;
>>> +}
>>> +
>>> +static int vduse_destroy_dev(char *name)
>>> +{
>>> +     struct vduse_dev *dev = vduse_find_dev(name);
>>> +
>>> +     if (!dev)
>>> +             return -EINVAL;
>>> +
>>> +     if (dev->vdev || dev->connected)
>>> +             return -EBUSY;
>>> +
>>> +     dev->connected = true;
>>> +     list_del(&dev->list);
>>> +     cdev_device_del(&dev->cdev, &dev->dev);
>>> +     put_device(&dev->dev);
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static void vduse_release_dev(struct device *device)
>>> +{
>>> +     struct vduse_dev *dev =
>>> +             container_of(device, struct vduse_dev, dev);
>>> +
>>> +     ida_simple_remove(&vduse_ida, dev->minor);
>>> +     kfree(dev->vqs);
>>> +     vduse_domain_destroy(dev->domain);
>>> +     vduse_dev_destroy(dev);
>>> +     module_put(THIS_MODULE);
>>> +}
>>> +
>>> +static int vduse_create_dev(struct vduse_dev_config *config)
>>> +{
>>> +     int i, ret = -ENOMEM;
>>> +     struct vduse_dev *dev;
>>> +
>>> +     if (config->bounce_size > max_bounce_size)
>>> +             return -EINVAL;
>>> +
>>> +     if (config->bounce_size > max_iova_size)
>>> +             return -EINVAL;
>>> +
>>> +     if (vduse_find_dev(config->name))
>>> +             return -EEXIST;
>>> +
>>> +     dev = vduse_dev_create();
>>> +     if (!dev)
>>> +             return -ENOMEM;
>>> +
>>> +     dev->device_id = config->device_id;
>>> +     dev->vendor_id = config->vendor_id;
>>> +     dev->domain = vduse_domain_create(max_iova_size - 1,
>>> +                                     config->bounce_size);
>>> +     if (!dev->domain)
>>> +             goto err_domain;
>>> +
>>> +     dev->vq_align = config->vq_align;
>>> +     dev->vq_size_max = config->vq_size_max;
>>> +     dev->vq_num = config->vq_num;
>>> +     dev->vqs = kcalloc(dev->vq_num, sizeof(*dev->vqs), GFP_KERNEL);
>>> +     if (!dev->vqs)
>>> +             goto err_vqs;
>>> +
>>> +     for (i = 0; i < dev->vq_num; i++) {
>>> +             dev->vqs[i].index = i;
>>> +             spin_lock_init(&dev->vqs[i].kick_lock);
>>> +             spin_lock_init(&dev->vqs[i].irq_lock);
>>> +     }
>>> +
>>> +     ret = ida_simple_get(&vduse_ida, 0, VDUSE_DEV_MAX, GFP_KERNEL);
>>> +     if (ret < 0)
>>> +             goto err_ida;
>>> +
>>> +     dev->minor = ret;
>>> +     device_initialize(&dev->dev);
>>> +     dev->dev.release = vduse_release_dev;
>>> +     dev->dev.class = vduse_class;
>>> +     dev->dev.devt = MKDEV(MAJOR(vduse_major), dev->minor);
>>> +     ret = dev_set_name(&dev->dev, "%s", config->name);
>>
>> Do we need to add a namespce here? E.g "vduse-%s", config->name.
>>
> Actually we already have a parent dir "/dev/vduse/" for it.


Oh, right. Then it should be fine.


>
>>> +     if (ret)
>>> +             goto err_name;
>>> +
>>> +     cdev_init(&dev->cdev, &vduse_dev_fops);
>>> +     dev->cdev.owner = THIS_MODULE;
>>> +
>>> +     ret = cdev_device_add(&dev->cdev, &dev->dev);
>>> +     if (ret) {
>>> +             put_device(&dev->dev);
>>> +             return ret;
>>> +     }
>>> +     list_add(&dev->list, &vduse_devs);
>>> +     __module_get(THIS_MODULE);
>>> +
>>> +     return 0;
>>> +err_name:
>>> +     ida_simple_remove(&vduse_ida, dev->minor);
>>> +err_ida:
>>> +     kfree(dev->vqs);
>>> +err_vqs:
>>> +     vduse_domain_destroy(dev->domain);
>>> +err_domain:
>>> +     vduse_dev_destroy(dev);
>>> +     return ret;
>>> +}
>>> +
>>> +static long vduse_ioctl(struct file *file, unsigned int cmd,
>>> +                     unsigned long arg)
>>> +{
>>> +     int ret;
>>> +     void __user *argp = (void __user *)arg;
>>> +
>>> +     mutex_lock(&vduse_lock);
>>> +     switch (cmd) {
>>> +     case VDUSE_CREATE_DEV: {
>>> +             struct vduse_dev_config config;
>>> +
>>> +             ret = -EFAULT;
>>> +             if (copy_from_user(&config, argp, sizeof(config)))
>>> +                     break;
>>> +
>>> +             ret = vduse_create_dev(&config);
>>> +             break;
>>> +     }
>>> +     case VDUSE_DESTROY_DEV: {
>>> +             char name[VDUSE_NAME_MAX];
>>> +
>>> +             ret = -EFAULT;
>>> +             if (copy_from_user(name, argp, VDUSE_NAME_MAX))
>>> +                     break;
>>> +
>>> +             ret = vduse_destroy_dev(name);
>>> +             break;
>>> +     }
>>> +     default:
>>> +             ret = -EINVAL;
>>> +             break;
>>> +     }
>>> +     mutex_unlock(&vduse_lock);
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static const struct file_operations vduse_fops = {
>>> +     .owner          = THIS_MODULE,
>>> +     .unlocked_ioctl = vduse_ioctl,
>>> +     .compat_ioctl   = compat_ptr_ioctl,
>>> +     .llseek         = noop_llseek,
>>> +};
>>> +
>>> +static char *vduse_devnode(struct device *dev, umode_t *mode)
>>> +{
>>> +     return kasprintf(GFP_KERNEL, "vduse/%s", dev_name(dev));
>>> +}
>>> +
>>> +static struct miscdevice vduse_misc = {
>>> +     .fops = &vduse_fops,
>>> +     .minor = MISC_DYNAMIC_MINOR,
>>> +     .name = "vduse",
>>> +     .nodename = "vduse/control",
>>> +};
>>> +
>>> +static void vduse_mgmtdev_release(struct device *dev)
>>> +{
>>> +}
>>> +
>>> +static struct device vduse_mgmtdev = {
>>> +     .init_name = "vduse",
>>> +     .release = vduse_mgmtdev_release,
>>> +};
>>> +
>>> +static struct vdpa_mgmt_dev mgmt_dev;
>>> +
>>> +static int vduse_dev_add_vdpa(struct vduse_dev *dev, const char *name)
>>> +{
>>> +     struct vduse_vdpa *vdev = dev->vdev;
>>> +     int ret;
>>> +
>>> +     if (vdev)
>>> +             return -EEXIST;
>>> +
>>> +     vdev = vdpa_alloc_device(struct vduse_vdpa, vdpa, NULL,
>>
>> I think the char dev should be used as the parent here.
>>
> Agree.
>
>>> +                              &vduse_vdpa_config_ops,
>>> +                              dev->vq_num, name, true);
>>> +     if (!vdev)
>>> +             return -ENOMEM;
>>> +
>>> +     vdev->dev = dev;
>>> +     vdev->vdpa.dev.dma_mask = &vdev->vdpa.dev.coherent_dma_mask;
>>> +     ret = dma_set_mask_and_coherent(&vdev->vdpa.dev, DMA_BIT_MASK(64));
>>> +     if (ret)
>>> +             goto err;
>>> +
>>> +     set_dma_ops(&vdev->vdpa.dev, &vduse_dev_dma_ops);
>>> +     vdev->vdpa.dma_dev = &vdev->vdpa.dev;
>>> +     vdev->vdpa.mdev = &mgmt_dev;
>>> +
>>> +     ret = _vdpa_register_device(&vdev->vdpa);
>>> +     if (ret)
>>> +             goto err;
>>> +
>>> +     dev->vdev = vdev;
>>> +
>>> +     return 0;
>>> +err:
>>> +     put_device(&vdev->vdpa.dev);
>>> +     return ret;
>>> +}
>>> +
>>> +static int vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name)
>>> +{
>>> +     struct vduse_dev *dev;
>>> +     int ret = -EINVAL;
>>> +
>>> +     mutex_lock(&vduse_lock);
>>> +     dev = vduse_find_dev(name);
>>> +     if (!dev)
>>> +             goto unlock;
>>
>> Any reason for this check? I think vdpa core layer has already did for
>> the name check for us.
>>
> We need to check whether the vduse device with the name is created.


Yes.


>
>>> +
>>> +     ret = vduse_dev_add_vdpa(dev, name);
>>> +unlock:
>>> +     mutex_unlock(&vduse_lock);
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static void vdpa_dev_del(struct vdpa_mgmt_dev *mdev, struct vdpa_device *dev)
>>> +{
>>> +     _vdpa_unregister_device(dev);
>>> +}
>>> +
>>> +static const struct vdpa_mgmtdev_ops vdpa_dev_mgmtdev_ops = {
>>> +     .dev_add = vdpa_dev_add,
>>> +     .dev_del = vdpa_dev_del,
>>> +};
>>> +
>>> +static struct virtio_device_id id_table[] = {
>>> +     { VIRTIO_DEV_ANY_ID, VIRTIO_DEV_ANY_ID },
>>> +     { 0 },
>>> +};
>>> +
>>> +static struct vdpa_mgmt_dev mgmt_dev = {
>>> +     .device = &vduse_mgmtdev,
>>> +     .id_table = id_table,
>>> +     .ops = &vdpa_dev_mgmtdev_ops,
>>> +};
>>> +
>>> +static int vduse_mgmtdev_init(void)
>>> +{
>>> +     int ret;
>>> +
>>> +     ret = device_register(&vduse_mgmtdev);
>>> +     if (ret)
>>> +             return ret;
>>> +
>>> +     ret = vdpa_mgmtdev_register(&mgmt_dev);
>>> +     if (ret)
>>> +             goto err;
>>> +
>>> +     return 0;
>>> +err:
>>> +     device_unregister(&vduse_mgmtdev);
>>> +     return ret;
>>> +}
>>> +
>>> +static void vduse_mgmtdev_exit(void)
>>> +{
>>> +     vdpa_mgmtdev_unregister(&mgmt_dev);
>>> +     device_unregister(&vduse_mgmtdev);
>>> +}
>>> +
>>> +static int vduse_init(void)
>>> +{
>>> +     int ret;
>>> +
>>> +     ret = misc_register(&vduse_misc);
>>> +     if (ret)
>>> +             return ret;
>>> +
>>> +     vduse_class = class_create(THIS_MODULE, "vduse");
>>> +     if (IS_ERR(vduse_class)) {
>>> +             ret = PTR_ERR(vduse_class);
>>> +             goto err_class;
>>> +     }
>>> +     vduse_class->devnode = vduse_devnode;
>>> +
>>> +     ret = alloc_chrdev_region(&vduse_major, 0, VDUSE_DEV_MAX, "vduse");
>>> +     if (ret)
>>> +             goto err_chardev;
>>> +
>>> +     ret = vduse_domain_init();
>>> +     if (ret)
>>> +             goto err_domain;
>>> +
>>> +     ret = vduse_mgmtdev_init();
>>> +     if (ret)
>>> +             goto err_mgmtdev;
>>
>> Should we validate max_bounce_size < max_iova_size here?
>>
> Sure.
>
>>
>>> +
>>> +     return 0;
>>> +err_mgmtdev:
>>> +     vduse_domain_exit();
>>> +err_domain:
>>> +     unregister_chrdev_region(vduse_major, VDUSE_DEV_MAX);
>>> +err_chardev:
>>> +     class_destroy(vduse_class);
>>> +err_class:
>>> +     misc_deregister(&vduse_misc);
>>> +     return ret;
>>> +}
>>> +module_init(vduse_init);
>>> +
>>> +static void vduse_exit(void)
>>> +{
>>> +     misc_deregister(&vduse_misc);
>>> +     class_destroy(vduse_class);
>>> +     unregister_chrdev_region(vduse_major, VDUSE_DEV_MAX);
>>> +     vduse_domain_exit();
>>> +     vduse_mgmtdev_exit();
>>> +}
>>> +module_exit(vduse_exit);
>>> +
>>> +MODULE_VERSION(DRV_VERSION);
>>> +MODULE_LICENSE(DRV_LICENSE);
>>> +MODULE_AUTHOR(DRV_AUTHOR);
>>> +MODULE_DESCRIPTION(DRV_DESC);
>>> diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
>>> new file mode 100644
>>> index 000000000000..9391c4acfa53
>>> --- /dev/null
>>> +++ b/include/uapi/linux/vduse.h
>>> @@ -0,0 +1,136 @@
>>> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
>>> +#ifndef _UAPI_VDUSE_H_
>>> +#define _UAPI_VDUSE_H_
>>> +
>>> +#include <linux/types.h>
>>> +
>>> +#define VDUSE_CONFIG_DATA_LEN        256
>>> +#define VDUSE_NAME_MAX       256
>>> +
>>> +/* the control messages definition for read/write */
>>> +
>>> +enum vduse_req_type {
>>> +     VDUSE_SET_VQ_NUM,
>>> +     VDUSE_SET_VQ_ADDR,
>>> +     VDUSE_SET_VQ_READY,
>>> +     VDUSE_GET_VQ_READY,
>>> +     VDUSE_SET_VQ_STATE,
>>> +     VDUSE_GET_VQ_STATE,
>>> +     VDUSE_SET_FEATURES,
>>> +     VDUSE_GET_FEATURES,
>>> +     VDUSE_SET_STATUS,
>>> +     VDUSE_GET_STATUS,
>>> +     VDUSE_SET_CONFIG,
>>> +     VDUSE_GET_CONFIG,
>>> +     VDUSE_UPDATE_IOTLB,
>>> +};
>>> +
>>> +struct vduse_vq_num {
>>> +     __u32 index;
>>> +     __u32 num;
>>> +};
>>> +
>>> +struct vduse_vq_addr {
>>> +     __u32 index;
>>> +     __u64 desc_addr;
>>> +     __u64 driver_addr;
>>> +     __u64 device_addr;
>>> +};
>>> +
>>> +struct vduse_vq_ready {
>>> +     __u32 index;
>>> +     __u8 ready;
>>> +};
>>> +
>>> +struct vduse_vq_state {
>>> +     __u32 index;
>>> +     __u16 avail_idx;
>>> +};
>>> +
>>> +struct vduse_dev_config_data {
>>> +     __u32 offset;
>>> +     __u32 len;
>>> +     __u8 data[VDUSE_CONFIG_DATA_LEN];
>>> +};
>>> +
>>> +struct vduse_iova_range {
>>> +     __u64 start;
>>> +     __u64 last;
>>> +};
>>> +
>>> +struct vduse_dev_request {
>>> +     __u32 type; /* request type */
>>> +     __u32 unique; /* request id */
>>
>> Let's simply use "request_id" here.
>>
> OK.
>
>>> +     __u32 reserved[2]; /* for feature use */
>>> +     union {
>>> +             struct vduse_vq_num vq_num; /* virtqueue num */
>>> +             struct vduse_vq_addr vq_addr; /* virtqueue address */
>>> +             struct vduse_vq_ready vq_ready; /* virtqueue ready status */
>>> +             struct vduse_vq_state vq_state; /* virtqueue state */
>>> +             struct vduse_dev_config_data config; /* virtio device config space */
>>> +             struct vduse_iova_range iova; /* iova range for updating */
>>> +             __u64 features; /* virtio features */
>>> +             __u8 status; /* device status */
>>
>> It might be better to use struct for feaures and status as well for
>> consistency.
>>
> OK.
>
>> And to be safe, let's add explicity padding here.
>>
> Do you mean add padding for the union?


I think so.


>
>>> +     };
>>> +};
>>> +
>>> +struct vduse_dev_response {
>>> +     __u32 request_id; /* corresponding request id */
>>> +#define VDUSE_REQUEST_OK     0x00
>>> +#define VDUSE_REQUEST_FAILED 0x01
>>> +     __u8 result; /* the result of request */
>>> +     __u8 reserved[11]; /* for feature use */
>>
>> Looks like this will be a hole which is similar to
>> 429711aec282c4b5fe5bbd7b2f0bbbff4110ffb2. Need to make sure the reserved
>> end at 8 byte boundary.
>>
> Will fix it.
>
>>> +     union {
>>> +             struct vduse_vq_ready vq_ready; /* virtqueue ready status */
>>> +             struct vduse_vq_state vq_state; /* virtqueue state */
>>> +             struct vduse_dev_config_data config; /* virtio device config space */
>>> +             __u64 features; /* virtio features */
>>> +             __u8 status; /* device status */
>>> +     };
>>> +};
>>> +
>>> +/* ioctls */
>>> +
>>> +struct vduse_dev_config {
>>> +     char name[VDUSE_NAME_MAX]; /* vduse device name */
>>> +     __u32 vendor_id; /* virtio vendor id */
>>> +     __u32 device_id; /* virtio device id */
>>> +     __u64 bounce_size; /* bounce buffer size for iommu */
>>> +     __u16 vq_num; /* the number of virtqueues */
>>> +     __u16 vq_size_max; /* the max size of virtqueue */
>>> +     __u32 vq_align; /* the allocation alignment of virtqueue's metadata */
>>> +};
>>> +
>>> +struct vduse_iotlb_entry {
>>> +     __u64 offset; /* the mmap offset on fd */
>>> +     __u64 start; /* start of the IOVA range */
>>> +     __u64 last; /* last of the IOVA range */
>>> +#define VDUSE_ACCESS_RO 0x1
>>> +#define VDUSE_ACCESS_WO 0x2
>>> +#define VDUSE_ACCESS_RW 0x3
>>> +     __u8 perm; /* access permission of this range */
>>> +};
>>> +
>>> +struct vduse_vq_eventfd {
>>> +     __u32 index; /* virtqueue index */
>>> +     int fd; /* eventfd, -1 means de-assigning the eventfd */
>>
>> Let's define a macro for this.
>>
> OK.
>
>>> +};
>>> +
>>> +#define VDUSE_BASE   0x81
>>> +
>>> +/* Create a vduse device which is represented by a char device (/dev/vduse/<name>) */
>>> +#define VDUSE_CREATE_DEV     _IOW(VDUSE_BASE, 0x01, struct vduse_dev_config)
>>> +
>>> +/* Destroy a vduse device. Make sure there are no references to the char device */
>>> +#define VDUSE_DESTROY_DEV    _IOW(VDUSE_BASE, 0x02, char[VDUSE_NAME_MAX])
>>> +
>>> +/* Get a file descriptor for the mmap'able iova region */
>>> +#define VDUSE_IOTLB_GET_FD   _IOWR(VDUSE_BASE, 0x03, struct vduse_iotlb_entry)
>>> +
>>> +/* Setup an eventfd to receive kick for virtqueue */
>>> +#define VDUSE_VQ_SETUP_KICKFD        _IOW(VDUSE_BASE, 0x04, struct vduse_vq_eventfd)
>>> +
>>> +/* Inject an interrupt for specific virtqueue */
>>> +#define VDUSE_INJECT_VQ_IRQ  _IO(VDUSE_BASE, 0x05)
>>
>> I wonder do we need a version/feature handshake that is for future
>> extension instead of just reserve bits in uABI? E.g VDUSE_GET_VERSION ...
>>
> Agree. Will do it in v5.


Btw, I think you can remove "RFC" then Michael can consider to merge the 
series.

Thanks


>
> Thanks,
> Yongji
>


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC v4 07/11] vduse: Introduce VDUSE - vDPA Device in Userspace
@ 2021-03-05  3:20         ` Jason Wang
  0 siblings, 0 replies; 92+ messages in thread
From: Jason Wang @ 2021-03-05  3:20 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Jens Axboe, Jonathan Corbet, kvm, Michael S. Tsirkin, linux-aio,
	netdev, Randy Dunlap, Matthew Wilcox, virtualization,
	Christoph Hellwig, Bob Liu, bcrl, viro, Stefan Hajnoczi,
	linux-fsdevel


On 2021/3/4 4:05 下午, Yongji Xie wrote:
> On Thu, Mar 4, 2021 at 2:27 PM Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2021/2/23 7:50 下午, Xie Yongji wrote:
>>> This VDUSE driver enables implementing vDPA devices in userspace.
>>> Both control path and data path of vDPA devices will be able to
>>> be handled in userspace.
>>>
>>> In the control path, the VDUSE driver will make use of message
>>> mechnism to forward the config operation from vdpa bus driver
>>> to userspace. Userspace can use read()/write() to receive/reply
>>> those control messages.
>>>
>>> In the data path, VDUSE_IOTLB_GET_FD ioctl will be used to get
>>> the file descriptors referring to vDPA device's iova regions. Then
>>> userspace can use mmap() to access those iova regions. Besides,
>>> userspace can use ioctl() to inject interrupt and use the eventfd
>>> mechanism to receive virtqueue kicks.
>>>
>>> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
>>> ---
>>>    Documentation/userspace-api/ioctl/ioctl-number.rst |    1 +
>>>    drivers/vdpa/Kconfig                               |   10 +
>>>    drivers/vdpa/Makefile                              |    1 +
>>>    drivers/vdpa/vdpa_user/Makefile                    |    5 +
>>>    drivers/vdpa/vdpa_user/vduse_dev.c                 | 1348 ++++++++++++++++++++
>>>    include/uapi/linux/vduse.h                         |  136 ++
>>>    6 files changed, 1501 insertions(+)
>>>    create mode 100644 drivers/vdpa/vdpa_user/Makefile
>>>    create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
>>>    create mode 100644 include/uapi/linux/vduse.h
>>>
>>> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
>>> index a4c75a28c839..71722e6f8f23 100644
>>> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
>>> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
>>> @@ -300,6 +300,7 @@ Code  Seq#    Include File                                           Comments
>>>    'z'   10-4F  drivers/s390/crypto/zcrypt_api.h                        conflict!
>>>    '|'   00-7F  linux/media.h
>>>    0x80  00-1F  linux/fb.h
>>> +0x81  00-1F  linux/vduse.h
>>>    0x89  00-06  arch/x86/include/asm/sockios.h
>>>    0x89  0B-DF  linux/sockios.h
>>>    0x89  E0-EF  linux/sockios.h                                         SIOCPROTOPRIVATE range
>>> diff --git a/drivers/vdpa/Kconfig b/drivers/vdpa/Kconfig
>>> index ffd1e098bfd2..92f07715e3b6 100644
>>> --- a/drivers/vdpa/Kconfig
>>> +++ b/drivers/vdpa/Kconfig
>>> @@ -25,6 +25,16 @@ config VDPA_SIM_NET
>>>        help
>>>          vDPA networking device simulator which loops TX traffic back to RX.
>>>
>>> +config VDPA_USER
>>> +     tristate "VDUSE (vDPA Device in Userspace) support"
>>> +     depends on EVENTFD && MMU && HAS_DMA
>>> +     select DMA_OPS
>>> +     select VHOST_IOTLB
>>> +     select IOMMU_IOVA
>>> +     help
>>> +       With VDUSE it is possible to emulate a vDPA Device
>>> +       in a userspace program.
>>> +
>>>    config IFCVF
>>>        tristate "Intel IFC VF vDPA driver"
>>>        depends on PCI_MSI
>>> diff --git a/drivers/vdpa/Makefile b/drivers/vdpa/Makefile
>>> index d160e9b63a66..66e97778ad03 100644
>>> --- a/drivers/vdpa/Makefile
>>> +++ b/drivers/vdpa/Makefile
>>> @@ -1,5 +1,6 @@
>>>    # SPDX-License-Identifier: GPL-2.0
>>>    obj-$(CONFIG_VDPA) += vdpa.o
>>>    obj-$(CONFIG_VDPA_SIM) += vdpa_sim/
>>> +obj-$(CONFIG_VDPA_USER) += vdpa_user/
>>>    obj-$(CONFIG_IFCVF)    += ifcvf/
>>>    obj-$(CONFIG_MLX5_VDPA) += mlx5/
>>> diff --git a/drivers/vdpa/vdpa_user/Makefile b/drivers/vdpa/vdpa_user/Makefile
>>> new file mode 100644
>>> index 000000000000..260e0b26af99
>>> --- /dev/null
>>> +++ b/drivers/vdpa/vdpa_user/Makefile
>>> @@ -0,0 +1,5 @@
>>> +# SPDX-License-Identifier: GPL-2.0
>>> +
>>> +vduse-y := vduse_dev.o iova_domain.o
>>> +
>>> +obj-$(CONFIG_VDPA_USER) += vduse.o
>>> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
>>> new file mode 100644
>>> index 000000000000..393bf99c48be
>>> --- /dev/null
>>> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
>>> @@ -0,0 +1,1348 @@
>>> +// SPDX-License-Identifier: GPL-2.0-only
>>> +/*
>>> + * VDUSE: vDPA Device in Userspace
>>> + *
>>> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
>>> + *
>>> + * Author: Xie Yongji <xieyongji@bytedance.com>
>>> + *
>>> + */
>>> +
>>> +#include <linux/init.h>
>>> +#include <linux/module.h>
>>> +#include <linux/miscdevice.h>
>>> +#include <linux/cdev.h>
>>> +#include <linux/device.h>
>>> +#include <linux/eventfd.h>
>>> +#include <linux/slab.h>
>>> +#include <linux/wait.h>
>>> +#include <linux/dma-map-ops.h>
>>> +#include <linux/poll.h>
>>> +#include <linux/file.h>
>>> +#include <linux/uio.h>
>>> +#include <linux/vdpa.h>
>>> +#include <uapi/linux/vduse.h>
>>> +#include <uapi/linux/vdpa.h>
>>> +#include <uapi/linux/virtio_config.h>
>>> +#include <linux/mod_devicetable.h>
>>> +
>>> +#include "iova_domain.h"
>>> +
>>> +#define DRV_VERSION  "1.0"
>>> +#define DRV_AUTHOR   "Yongji Xie <xieyongji@bytedance.com>"
>>> +#define DRV_DESC     "vDPA Device in Userspace"
>>> +#define DRV_LICENSE  "GPL v2"
>>> +
>>> +#define VDUSE_DEV_MAX (1U << MINORBITS)
>>> +
>>> +struct vduse_virtqueue {
>>> +     u16 index;
>>> +     bool ready;
>>> +     spinlock_t kick_lock;
>>> +     spinlock_t irq_lock;
>>> +     struct eventfd_ctx *kickfd;
>>> +     struct vdpa_callback cb;
>>> +};
>>> +
>>> +struct vduse_dev;
>>> +
>>> +struct vduse_vdpa {
>>> +     struct vdpa_device vdpa;
>>> +     struct vduse_dev *dev;
>>> +};
>>> +
>>> +struct vduse_dev {
>>> +     struct vduse_vdpa *vdev;
>>> +     struct device dev;
>>> +     struct cdev cdev;
>>> +     struct vduse_virtqueue *vqs;
>>> +     struct vduse_iova_domain *domain;
>>> +     struct vhost_iotlb *iommu;
>>> +     spinlock_t iommu_lock;
>>> +     atomic_t bounce_map;
>>> +     struct mutex msg_lock;
>>> +     atomic64_t msg_unique;
>>
>> "next_request_id" should be better.
>>
> OK.
>
>>> +     wait_queue_head_t waitq;
>>> +     struct list_head send_list;
>>> +     struct list_head recv_list;
>>> +     struct list_head list;
>>> +     bool connected;
>>> +     int minor;
>>> +     u16 vq_size_max;
>>> +     u16 vq_num;
>>> +     u32 vq_align;
>>> +     u32 device_id;
>>> +     u32 vendor_id;
>>> +};
>>> +
>>> +struct vduse_dev_msg {
>>> +     struct vduse_dev_request req;
>>> +     struct vduse_dev_response resp;
>>> +     struct list_head list;
>>> +     wait_queue_head_t waitq;
>>> +     bool completed;
>>> +};
>>> +
>>> +static unsigned long max_bounce_size = (64 * 1024 * 1024);
>>> +module_param(max_bounce_size, ulong, 0444);
>>> +MODULE_PARM_DESC(max_bounce_size, "Maximum bounce buffer size. (default: 64M)");
>>> +
>>> +static unsigned long max_iova_size = (128 * 1024 * 1024);
>>> +module_param(max_iova_size, ulong, 0444);
>>> +MODULE_PARM_DESC(max_iova_size, "Maximum iova space size (default: 128M)");
>>> +
>>> +static DEFINE_MUTEX(vduse_lock);
>>> +static LIST_HEAD(vduse_devs);
>>> +static DEFINE_IDA(vduse_ida);
>>> +
>>> +static dev_t vduse_major;
>>> +static struct class *vduse_class;
>>> +
>>> +static inline struct vduse_dev *vdpa_to_vduse(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_vdpa *vdev = container_of(vdpa, struct vduse_vdpa, vdpa);
>>> +
>>> +     return vdev->dev;
>>> +}
>>> +
>>> +static inline struct vduse_dev *dev_to_vduse(struct device *dev)
>>> +{
>>> +     struct vdpa_device *vdpa = dev_to_vdpa(dev);
>>> +
>>> +     return vdpa_to_vduse(vdpa);
>>> +}
>>> +
>>> +static struct vduse_dev_msg *vduse_find_msg(struct list_head *head,
>>> +                                         uint32_t unique)
>>> +{
>>> +     struct vduse_dev_msg *tmp, *msg = NULL;
>>> +
>>> +     list_for_each_entry(tmp, head, list) {
>>
>> Shoudl we use list_for_each_entry_safe()?
>>
> Looks like list_for_each_entry() is ok here. We will break the loop
> after deleting one node.


Right.


>
>>> +             if (tmp->req.unique == unique) {
>>> +                     msg = tmp;
>>> +                     list_del(&tmp->list);
>>> +                     break;
>>> +             }
>>> +     }
>>> +
>>> +     return msg;
>>> +}
>>> +
>>> +static struct vduse_dev_msg *vduse_dequeue_msg(struct list_head *head)
>>> +{
>>> +     struct vduse_dev_msg *msg = NULL;
>>> +
>>> +     if (!list_empty(head)) {
>>> +             msg = list_first_entry(head, struct vduse_dev_msg, list);
>>> +             list_del(&msg->list);
>>> +     }
>>> +
>>> +     return msg;
>>> +}
>>> +
>>> +static void vduse_enqueue_msg(struct list_head *head,
>>> +                           struct vduse_dev_msg *msg)
>>> +{
>>> +     list_add_tail(&msg->list, head);
>>> +}
>>> +
>>> +static int vduse_dev_msg_sync(struct vduse_dev *dev, struct vduse_dev_msg *msg)
>>> +{
>>> +     int ret;
>>> +
>>> +     init_waitqueue_head(&msg->waitq);
>>> +     mutex_lock(&dev->msg_lock);
>>> +     vduse_enqueue_msg(&dev->send_list, msg);
>>> +     wake_up(&dev->waitq);
>>> +     mutex_unlock(&dev->msg_lock);
>>> +     ret = wait_event_interruptible(msg->waitq, msg->completed);
>>> +     mutex_lock(&dev->msg_lock);
>>> +     if (!msg->completed)
>>> +             list_del(&msg->list);
>>> +     else
>>> +             ret = msg->resp.result;
>>> +     mutex_unlock(&dev->msg_lock);
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static u64 vduse_dev_get_features(struct vduse_dev *dev)
>>> +{
>>> +     struct vduse_dev_msg msg = { 0 };
>>> +
>>> +     msg.req.type = VDUSE_GET_FEATURES;
>>> +     msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
>>> +
>>> +     return vduse_dev_msg_sync(dev, &msg) ? 0 : msg.resp.features;
>>> +}
>>> +
>>> +static int vduse_dev_set_features(struct vduse_dev *dev, u64 features)
>>> +{
>>> +     struct vduse_dev_msg msg = { 0 };
>>> +
>>> +     msg.req.type = VDUSE_SET_FEATURES;
>>> +     msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
>>> +     msg.req.features = features;
>>> +
>>> +     return vduse_dev_msg_sync(dev, &msg);
>>> +}
>>> +
>>> +static u8 vduse_dev_get_status(struct vduse_dev *dev)
>>> +{
>>> +     struct vduse_dev_msg msg = { 0 };
>>> +
>>> +     msg.req.type = VDUSE_GET_STATUS;
>>> +     msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
>>> +
>>> +     return vduse_dev_msg_sync(dev, &msg) ? 0 : msg.resp.status;
>>> +}
>>> +
>>> +static void vduse_dev_set_status(struct vduse_dev *dev, u8 status)
>>> +{
>>> +     struct vduse_dev_msg msg = { 0 };
>>> +
>>> +     msg.req.type = VDUSE_SET_STATUS;
>>> +     msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
>>> +     msg.req.status = status;
>>> +
>>> +     vduse_dev_msg_sync(dev, &msg);
>>> +}
>>> +
>>> +static void vduse_dev_get_config(struct vduse_dev *dev, unsigned int offset,
>>> +                                     void *buf, unsigned int len)
>>
>> Btw, the ident looks odd here and other may places wherhe functions has
>> more than one line of arguments.
>>
> OK, will fix it.
>
>>> +{
>>> +     struct vduse_dev_msg msg = { 0 };
>>> +     unsigned int sz;
>>> +
>>> +     while (len) {
>>> +             sz = min_t(unsigned int, len, sizeof(msg.req.config.data));
>>> +             msg.req.type = VDUSE_GET_CONFIG;
>>> +             msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
>>> +             msg.req.config.offset = offset;
>>> +             msg.req.config.len = sz;
>>> +             vduse_dev_msg_sync(dev, &msg);
>>> +             memcpy(buf, msg.resp.config.data, sz);
>>> +             buf += sz;
>>> +             offset += sz;
>>> +             len -= sz;
>>> +     }
>>> +}
>>> +
>>> +static void vduse_dev_set_config(struct vduse_dev *dev, unsigned int offset,
>>> +                                     const void *buf, unsigned int len)
>>> +{
>>> +     struct vduse_dev_msg msg = { 0 };
>>> +     unsigned int sz;
>>> +
>>> +     while (len) {
>>> +             sz = min_t(unsigned int, len, sizeof(msg.req.config.data));
>>> +             msg.req.type = VDUSE_SET_CONFIG;
>>> +             msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
>>> +             msg.req.config.offset = offset;
>>> +             msg.req.config.len = sz;
>>> +             memcpy(msg.req.config.data, buf, sz);
>>> +             vduse_dev_msg_sync(dev, &msg);
>>> +             buf += sz;
>>> +             offset += sz;
>>> +             len -= sz;
>>> +     }
>>> +}
>>> +
>>> +static void vduse_dev_set_vq_num(struct vduse_dev *dev,
>>> +                             struct vduse_virtqueue *vq, u32 num)
>>> +{
>>> +     struct vduse_dev_msg msg = { 0 };
>>> +
>>> +     msg.req.type = VDUSE_SET_VQ_NUM;
>>> +     msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
>>> +     msg.req.vq_num.index = vq->index;
>>> +     msg.req.vq_num.num = num;
>>> +
>>> +     vduse_dev_msg_sync(dev, &msg);
>>> +}
>>> +
>>> +static int vduse_dev_set_vq_addr(struct vduse_dev *dev,
>>> +                             struct vduse_virtqueue *vq, u64 desc_addr,
>>> +                             u64 driver_addr, u64 device_addr)
>>> +{
>>> +     struct vduse_dev_msg msg = { 0 };
>>> +
>>> +     msg.req.type = VDUSE_SET_VQ_ADDR;
>>> +     msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
>>> +     msg.req.vq_addr.index = vq->index;
>>> +     msg.req.vq_addr.desc_addr = desc_addr;
>>> +     msg.req.vq_addr.driver_addr = driver_addr;
>>> +     msg.req.vq_addr.device_addr = device_addr;
>>> +
>>> +     return vduse_dev_msg_sync(dev, &msg);
>>> +}
>>> +
>>> +static void vduse_dev_set_vq_ready(struct vduse_dev *dev,
>>> +                             struct vduse_virtqueue *vq, bool ready)
>>> +{
>>> +     struct vduse_dev_msg msg = { 0 };
>>> +
>>> +     msg.req.type = VDUSE_SET_VQ_READY;
>>> +     msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
>>> +     msg.req.vq_ready.index = vq->index;
>>> +     msg.req.vq_ready.ready = ready;
>>> +
>>> +     vduse_dev_msg_sync(dev, &msg);
>>> +}
>>> +
>>> +static bool vduse_dev_get_vq_ready(struct vduse_dev *dev,
>>> +                                struct vduse_virtqueue *vq)
>>> +{
>>> +     struct vduse_dev_msg msg = { 0 };
>>> +
>>> +     msg.req.type = VDUSE_GET_VQ_READY;
>>> +     msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
>>> +     msg.req.vq_ready.index = vq->index;
>>> +
>>> +     return vduse_dev_msg_sync(dev, &msg) ? false : msg.resp.vq_ready.ready;
>>> +}
>>> +
>>> +static int vduse_dev_get_vq_state(struct vduse_dev *dev,
>>> +                             struct vduse_virtqueue *vq,
>>> +                             struct vdpa_vq_state *state)
>>> +{
>>> +     struct vduse_dev_msg msg = { 0 };
>>> +     int ret;
>>> +
>>> +     msg.req.type = VDUSE_GET_VQ_STATE;
>>> +     msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
>>> +     msg.req.vq_state.index = vq->index;
>>> +
>>> +     ret = vduse_dev_msg_sync(dev, &msg);
>>> +     if (!ret)
>>> +             state->avail_index = msg.resp.vq_state.avail_idx;
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static int vduse_dev_set_vq_state(struct vduse_dev *dev,
>>> +                             struct vduse_virtqueue *vq,
>>> +                             const struct vdpa_vq_state *state)
>>> +{
>>> +     struct vduse_dev_msg msg = { 0 };
>>> +
>>> +     msg.req.type = VDUSE_SET_VQ_STATE;
>>> +     msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
>>> +     msg.req.vq_state.index = vq->index;
>>> +     msg.req.vq_state.avail_idx = state->avail_index;
>>> +
>>> +     return vduse_dev_msg_sync(dev, &msg);
>>> +}
>>> +
>>> +static int vduse_dev_update_iotlb(struct vduse_dev *dev,
>>> +                             u64 start, u64 last)
>>> +{
>>> +     struct vduse_dev_msg msg = { 0 };
>>> +
>>> +     if (last < start)
>>> +             return -EINVAL;
>>> +
>>> +     msg.req.type = VDUSE_UPDATE_IOTLB;
>>> +     msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
>>> +     msg.req.iova.start = start;
>>> +     msg.req.iova.last = last;
>>> +
>>> +     return vduse_dev_msg_sync(dev, &msg);
>>> +}
>>> +
>>> +static ssize_t vduse_dev_read_iter(struct kiocb *iocb, struct iov_iter *to)
>>> +{
>>> +     struct file *file = iocb->ki_filp;
>>> +     struct vduse_dev *dev = file->private_data;
>>> +     struct vduse_dev_msg *msg;
>>> +     int size = sizeof(struct vduse_dev_request);
>>> +     ssize_t ret = 0;
>>> +
>>> +     if (iov_iter_count(to) < size)
>>> +             return 0;
>>> +
>>> +     mutex_lock(&dev->msg_lock);
>>> +     while (1) {
>>> +             msg = vduse_dequeue_msg(&dev->send_list);
>>> +             if (msg)
>>> +                     break;
>>> +
>>> +             ret = -EAGAIN;
>>> +             if (file->f_flags & O_NONBLOCK)
>>> +                     goto unlock;
>>> +
>>> +             mutex_unlock(&dev->msg_lock);
>>> +             ret = wait_event_interruptible_exclusive(dev->waitq,
>>> +                                     !list_empty(&dev->send_list));
>>> +             if (ret)
>>> +                     return ret;
>>> +
>>> +             mutex_lock(&dev->msg_lock);
>>> +     }
>>> +     ret = copy_to_iter(&msg->req, size, to);
>>> +     if (ret != size) {
>>> +             ret = -EFAULT;
>>> +             vduse_enqueue_msg(&dev->send_list, msg);
>>> +             goto unlock;
>>> +     }
>>> +     vduse_enqueue_msg(&dev->recv_list, msg);
>>> +unlock:
>>> +     mutex_unlock(&dev->msg_lock);
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static ssize_t vduse_dev_write_iter(struct kiocb *iocb, struct iov_iter *from)
>>> +{
>>> +     struct file *file = iocb->ki_filp;
>>> +     struct vduse_dev *dev = file->private_data;
>>> +     struct vduse_dev_response resp;
>>> +     struct vduse_dev_msg *msg;
>>> +     size_t ret;
>>> +
>>> +     ret = copy_from_iter(&resp, sizeof(resp), from);
>>> +     if (ret != sizeof(resp))
>>> +             return -EINVAL;
>>> +
>>> +     mutex_lock(&dev->msg_lock);
>>> +     msg = vduse_find_msg(&dev->recv_list, resp.request_id);
>>> +     if (!msg) {
>>> +             ret = -EINVAL;
>>> +             goto unlock;
>>> +     }
>>> +
>>> +     memcpy(&msg->resp, &resp, sizeof(resp));
>>> +     msg->completed = 1;
>>> +     wake_up(&msg->waitq);
>>> +unlock:
>>> +     mutex_unlock(&dev->msg_lock);
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
>>> +{
>>> +     struct vduse_dev *dev = file->private_data;
>>> +     __poll_t mask = 0;
>>> +
>>> +     poll_wait(file, &dev->waitq, wait);
>>> +
>>> +     if (!list_empty(&dev->send_list))
>>> +             mask |= EPOLLIN | EPOLLRDNORM;
>>> +
>>> +     return mask;
>>> +}
>>> +
>>> +static int vduse_iotlb_add_range(struct vduse_dev *dev,
>>> +                              u64 start, u64 last,
>>> +                              u64 addr, unsigned int perm,
>>> +                              struct file *file, u64 offset)
>>> +{
>>> +     struct vdpa_map_file *map_file;
>>> +     int ret;
>>> +
>>> +     map_file = kmalloc(sizeof(*map_file), GFP_ATOMIC);
>>> +     if (!map_file)
>>> +             return -ENOMEM;
>>> +
>>> +     map_file->file = get_file(file);
>>> +     map_file->offset = offset;
>>> +
>>> +     spin_lock(&dev->iommu_lock);
>>> +     ret = vhost_iotlb_add_range_ctx(dev->iommu, start, last,
>>> +                                     addr, perm, map_file);
>>> +     spin_unlock(&dev->iommu_lock);
>>> +     if (ret) {
>>> +             fput(map_file->file);
>>> +             kfree(map_file);
>>> +             return ret;
>>> +     }
>>> +     return 0;
>>> +}
>>> +
>>> +static void vduse_iotlb_del_range(struct vduse_dev *dev, u64 start, u64 last)
>>> +{
>>> +     struct vdpa_map_file *map_file;
>>> +     struct vhost_iotlb_map *map;
>>> +
>>> +     spin_lock(&dev->iommu_lock);
>>> +     while ((map = vhost_iotlb_itree_first(dev->iommu, start, last))) {
>>> +             map_file = (struct vdpa_map_file *)map->opaque;
>>> +             fput(map_file->file);
>>> +             kfree(map_file);
>>> +             vhost_iotlb_map_free(dev->iommu, map);
>>> +     }
>>> +     spin_unlock(&dev->iommu_lock);
>>> +}
>>> +
>>> +static void vduse_dev_reset(struct vduse_dev *dev)
>>> +{
>>> +     int i;
>>> +
>>> +     atomic_set(&dev->bounce_map, 0);
>>> +     vduse_iotlb_del_range(dev, 0ULL, ULLONG_MAX);
>>> +     vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
>>> +
>>> +     for (i = 0; i < dev->vq_num; i++) {
>>> +             struct vduse_virtqueue *vq = &dev->vqs[i];
>>> +
>>> +             spin_lock(&vq->irq_lock);
>>> +             vq->ready = false;
>>> +             vq->cb.callback = NULL;
>>> +             vq->cb.private = NULL;
>>> +             spin_unlock(&vq->irq_lock);
>>> +     }
>>> +}
>>> +
>>> +static int vduse_vdpa_set_vq_address(struct vdpa_device *vdpa, u16 idx,
>>> +                             u64 desc_area, u64 driver_area,
>>> +                             u64 device_area)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     return vduse_dev_set_vq_addr(dev, vq, desc_area,
>>> +                                     driver_area, device_area);
>>> +}
>>> +
>>> +static void vduse_vdpa_kick_vq(struct vdpa_device *vdpa, u16 idx)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     spin_lock(&vq->kick_lock);
>>> +     if (vq->ready && vq->kickfd)
>>> +             eventfd_signal(vq->kickfd, 1);
>>> +     spin_unlock(&vq->kick_lock);
>>> +}
>>> +
>>> +static void vduse_vdpa_set_vq_cb(struct vdpa_device *vdpa, u16 idx,
>>> +                           struct vdpa_callback *cb)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     spin_lock(&vq->irq_lock);
>>> +     vq->cb.callback = cb->callback;
>>> +     vq->cb.private = cb->private;
>>> +     spin_unlock(&vq->irq_lock);
>>> +}
>>> +
>>> +static void vduse_vdpa_set_vq_num(struct vdpa_device *vdpa, u16 idx, u32 num)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     vduse_dev_set_vq_num(dev, vq, num);
>>> +}
>>> +
>>> +static void vduse_vdpa_set_vq_ready(struct vdpa_device *vdpa,
>>> +                                     u16 idx, bool ready)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     vduse_dev_set_vq_ready(dev, vq, ready);
>>> +     vq->ready = ready;
>>> +}
>>> +
>>> +static bool vduse_vdpa_get_vq_ready(struct vdpa_device *vdpa, u16 idx)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     vq->ready = vduse_dev_get_vq_ready(dev, vq);
>>> +
>>> +     return vq->ready;
>>> +}
>>> +
>>> +static int vduse_vdpa_set_vq_state(struct vdpa_device *vdpa, u16 idx,
>>> +                             const struct vdpa_vq_state *state)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     return vduse_dev_set_vq_state(dev, vq, state);
>>> +}
>>> +
>>> +static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
>>> +                             struct vdpa_vq_state *state)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     return vduse_dev_get_vq_state(dev, vq, state);
>>> +}
>>> +
>>> +static u32 vduse_vdpa_get_vq_align(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return dev->vq_align;
>>> +}
>>> +
>>> +static u64 vduse_vdpa_get_features(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     u64 fixed = (1ULL << VIRTIO_F_ACCESS_PLATFORM);
>>> +
>>> +     return (vduse_dev_get_features(dev) | fixed);
>>
>> What happens if we don't do such fixup. I think we should fail if
>> usersapce doesnt offer ACCESS_PLATFORM instead.
>>
> Make sense.
>
>>> +}
>>> +
>>> +static int vduse_vdpa_set_features(struct vdpa_device *vdpa, u64 features)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return vduse_dev_set_features(dev, features);
>>> +}
>>> +
>>> +static void vduse_vdpa_set_config_cb(struct vdpa_device *vdpa,
>>> +                               struct vdpa_callback *cb)
>>> +{
>>> +     /* We don't support config interrupt */
>>> +}
>>> +
>>> +static u16 vduse_vdpa_get_vq_num_max(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return dev->vq_size_max;
>>> +}
>>> +
>>> +static u32 vduse_vdpa_get_device_id(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return dev->device_id;
>>> +}
>>> +
>>> +static u32 vduse_vdpa_get_vendor_id(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return dev->vendor_id;
>>> +}
>>> +
>>> +static u8 vduse_vdpa_get_status(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return vduse_dev_get_status(dev);
>>> +}
>>> +
>>> +static void vduse_vdpa_set_status(struct vdpa_device *vdpa, u8 status)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     if (status == 0)
>>> +             vduse_dev_reset(dev);
>>> +
>>> +     vduse_dev_set_status(dev, status);
>>> +}
>>> +
>>> +static void vduse_vdpa_get_config(struct vdpa_device *vdpa, unsigned int offset,
>>> +                          void *buf, unsigned int len)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     vduse_dev_get_config(dev, offset, buf, len);
>>> +}
>>> +
>>> +static void vduse_vdpa_set_config(struct vdpa_device *vdpa, unsigned int offset,
>>> +                     const void *buf, unsigned int len)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     vduse_dev_set_config(dev, offset, buf, len);
>>> +}
>>> +
>>> +static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
>>> +                             struct vhost_iotlb *iotlb)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vhost_iotlb_map *map;
>>> +     struct vdpa_map_file *map_file;
>>> +     u64 start = 0ULL, last = ULLONG_MAX;
>>> +     int ret = 0;
>>> +
>>> +     vduse_iotlb_del_range(dev, start, last);
>>> +
>>> +     for (map = vhost_iotlb_itree_first(iotlb, start, last); map;
>>> +             map = vhost_iotlb_itree_next(map, start, last)) {
>>> +             map_file = (struct vdpa_map_file *)map->opaque;
>>> +             if (!map_file->file)
>>> +                     continue;
>>> +
>>> +             ret = vduse_iotlb_add_range(dev, map->start, map->last,
>>> +                                         map->addr, map->perm,
>>> +                                         map_file->file,
>>> +                                         map_file->offset);
>>> +             if (ret)
>>> +                     break;
>>> +     }
>>> +     vduse_dev_update_iotlb(dev, start, last);
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static void vduse_vdpa_free(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     WARN_ON(!list_empty(&dev->send_list));
>>> +     WARN_ON(!list_empty(&dev->recv_list));
>>> +     dev->vdev = NULL;
>>> +}
>>> +
>>> +static const struct vdpa_config_ops vduse_vdpa_config_ops = {
>>> +     .set_vq_address         = vduse_vdpa_set_vq_address,
>>> +     .kick_vq                = vduse_vdpa_kick_vq,
>>> +     .set_vq_cb              = vduse_vdpa_set_vq_cb,
>>> +     .set_vq_num             = vduse_vdpa_set_vq_num,
>>> +     .set_vq_ready           = vduse_vdpa_set_vq_ready,
>>> +     .get_vq_ready           = vduse_vdpa_get_vq_ready,
>>> +     .set_vq_state           = vduse_vdpa_set_vq_state,
>>> +     .get_vq_state           = vduse_vdpa_get_vq_state,
>>> +     .get_vq_align           = vduse_vdpa_get_vq_align,
>>> +     .get_features           = vduse_vdpa_get_features,
>>> +     .set_features           = vduse_vdpa_set_features,
>>> +     .set_config_cb          = vduse_vdpa_set_config_cb,
>>> +     .get_vq_num_max         = vduse_vdpa_get_vq_num_max,
>>> +     .get_device_id          = vduse_vdpa_get_device_id,
>>> +     .get_vendor_id          = vduse_vdpa_get_vendor_id,
>>> +     .get_status             = vduse_vdpa_get_status,
>>> +     .set_status             = vduse_vdpa_set_status,
>>> +     .get_config             = vduse_vdpa_get_config,
>>> +     .set_config             = vduse_vdpa_set_config,
>>> +     .set_map                = vduse_vdpa_set_map,
>>> +     .free                   = vduse_vdpa_free,
>>> +};
>>> +
>>> +static dma_addr_t vduse_dev_map_page(struct device *dev, struct page *page,
>>> +                                     unsigned long offset, size_t size,
>>> +                                     enum dma_data_direction dir,
>>> +                                     unsigned long attrs)
>>> +{
>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>> +
>>> +     if (atomic_xchg(&vdev->bounce_map, 1) == 0 &&
>>> +             vduse_iotlb_add_range(vdev, 0, domain->bounce_size - 1,
>>> +                                   0, VDUSE_ACCESS_RW,
>>> +                                   vduse_domain_file(domain), 0)) {
>>> +             atomic_set(&vdev->bounce_map, 0);
>>> +             return DMA_MAPPING_ERROR;
>>
>> Can we add the bounce mapping page by page here?
>>
> Do you mean mapping the bounce buffer to user space page by page? If
> so, userspace needs to call lots of mmap() for that.


I get this.


>
>>> +     }
>>> +
>>> +     return vduse_domain_map_page(domain, page, offset, size, dir, attrs);
>>> +}
>>> +
>>> +static void vduse_dev_unmap_page(struct device *dev, dma_addr_t dma_addr,
>>> +                             size_t size, enum dma_data_direction dir,
>>> +                             unsigned long attrs)
>>> +{
>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>> +
>>> +     return vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
>>> +}
>>> +
>>> +static void *vduse_dev_alloc_coherent(struct device *dev, size_t size,
>>> +                                     dma_addr_t *dma_addr, gfp_t flag,
>>> +                                     unsigned long attrs)
>>> +{
>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>> +     unsigned long iova;
>>> +     void *addr;
>>> +
>>> +     *dma_addr = DMA_MAPPING_ERROR;
>>> +     addr = vduse_domain_alloc_coherent(domain, size,
>>> +                             (dma_addr_t *)&iova, flag, attrs);
>>> +     if (!addr)
>>> +             return NULL;
>>> +
>>> +     if (vduse_iotlb_add_range(vdev, iova, iova + size - 1,
>>> +                               iova, VDUSE_ACCESS_RW,
>>> +                               vduse_domain_file(domain), iova)) {
>>> +             vduse_domain_free_coherent(domain, size, addr, iova, attrs);
>>> +             return NULL;
>>> +     }
>>> +     *dma_addr = (dma_addr_t)iova;
>>> +
>>> +     return addr;
>>> +}
>>> +
>>> +static void vduse_dev_free_coherent(struct device *dev, size_t size,
>>> +                                     void *vaddr, dma_addr_t dma_addr,
>>> +                                     unsigned long attrs)
>>> +{
>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>> +     unsigned long start = (unsigned long)dma_addr;
>>> +     unsigned long last = start + size - 1;
>>> +
>>> +     vduse_iotlb_del_range(vdev, start, last);
>>> +     vduse_dev_update_iotlb(vdev, start, last);
>>> +     vduse_domain_free_coherent(domain, size, vaddr, dma_addr, attrs);
>>> +}
>>> +
>>> +static const struct dma_map_ops vduse_dev_dma_ops = {
>>> +     .map_page = vduse_dev_map_page,
>>> +     .unmap_page = vduse_dev_unmap_page,
>>> +     .alloc = vduse_dev_alloc_coherent,
>>> +     .free = vduse_dev_free_coherent,
>>> +};
>>> +
>>> +static unsigned int perm_to_file_flags(u8 perm)
>>> +{
>>> +     unsigned int flags = 0;
>>> +
>>> +     switch (perm) {
>>> +     case VDUSE_ACCESS_WO:
>>> +             flags |= O_WRONLY;
>>> +             break;
>>> +     case VDUSE_ACCESS_RO:
>>> +             flags |= O_RDONLY;
>>> +             break;
>>> +     case VDUSE_ACCESS_RW:
>>> +             flags |= O_RDWR;
>>> +             break;
>>> +     default:
>>> +             WARN(1, "invalidate vhost IOTLB permission\n");
>>> +             break;
>>> +     }
>>> +
>>> +     return flags;
>>> +}
>>> +
>>> +static int vduse_kickfd_setup(struct vduse_dev *dev,
>>> +                     struct vduse_vq_eventfd *eventfd)
>>> +{
>>> +     struct eventfd_ctx *ctx = NULL;
>>> +     struct vduse_virtqueue *vq;
>>> +
>>> +     if (eventfd->index >= dev->vq_num)
>>> +             return -EINVAL;
>>> +
>>> +     vq = &dev->vqs[eventfd->index];
>>> +     if (eventfd->fd > 0) {
>>> +             ctx = eventfd_ctx_fdget(eventfd->fd);
>>> +             if (IS_ERR(ctx))
>>> +                     return PTR_ERR(ctx);
>>> +     }
>>> +     spin_lock(&vq->kick_lock);
>>> +     if (vq->kickfd)
>>> +             eventfd_ctx_put(vq->kickfd);
>>> +     vq->kickfd = ctx;
>>> +     spin_unlock(&vq->kick_lock);
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
>>> +                     unsigned long arg)
>>> +{
>>> +     struct vduse_dev *dev = file->private_data;
>>> +     void __user *argp = (void __user *)arg;
>>> +     int ret;
>>> +
>>> +     switch (cmd) {
>>> +     case VDUSE_IOTLB_GET_FD: {
>>> +             struct vduse_iotlb_entry entry;
>>> +             struct vhost_iotlb_map *map;
>>> +             struct vdpa_map_file *map_file;
>>> +             struct file *f = NULL;
>>> +
>>> +             ret = -EFAULT;
>>> +             if (copy_from_user(&entry, argp, sizeof(entry)))
>>> +                     break;
>>> +
>>> +             spin_lock(&dev->iommu_lock);
>>> +             map = vhost_iotlb_itree_first(dev->iommu, entry.start,
>>> +                                           entry.last);
>>> +             if (map) {
>>> +                     map_file = (struct vdpa_map_file *)map->opaque;
>>> +                     f = get_file(map_file->file);
>>> +                     entry.offset = map_file->offset;
>>> +                     entry.start = map->start;
>>> +                     entry.last = map->last;
>>> +                     entry.perm = map->perm;
>>> +             }
>>> +             spin_unlock(&dev->iommu_lock);
>>> +             if (!f) {
>>> +                     ret = -EINVAL;
>>> +                     break;
>>> +             }
>>> +             if (copy_to_user(argp, &entry, sizeof(entry))) {
>>> +                     fput(f);
>>> +                     ret = -EFAULT;
>>> +                     break;
>>> +             }
>>> +             ret = get_unused_fd_flags(perm_to_file_flags(entry.perm));
>>> +             if (ret < 0) {
>>> +                     fput(f);
>>> +                     break;
>>> +             }
>>> +             fd_install(ret, f);
>>> +             break;
>>> +     }
>>> +     case VDUSE_VQ_SETUP_KICKFD: {
>>> +             struct vduse_vq_eventfd eventfd;
>>> +
>>> +             ret = -EFAULT;
>>> +             if (copy_from_user(&eventfd, argp, sizeof(eventfd)))
>>> +                     break;
>>> +
>>> +             ret = vduse_kickfd_setup(dev, &eventfd);
>>> +             break;
>>> +     }
>>> +     case VDUSE_INJECT_VQ_IRQ: {
>>> +             struct vduse_virtqueue *vq;
>>> +
>>> +             ret = -EINVAL;
>>> +             if (arg >= dev->vq_num)
>>> +                     break;
>>> +
>>> +             vq = &dev->vqs[arg];
>>> +             spin_lock_irq(&vq->irq_lock);
>>> +             if (vq->ready && vq->cb.callback) {
>>> +                     vq->cb.callback(vq->cb.private);
>>> +                     ret = 0;
>>> +             }
>>> +             spin_unlock_irq(&vq->irq_lock);
>>> +             break;
>>> +     }
>>> +     default:
>>> +             ret = -ENOIOCTLCMD;
>>> +             break;
>>> +     }
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static int vduse_dev_release(struct inode *inode, struct file *file)
>>> +{
>>> +     struct vduse_dev *dev = file->private_data;
>>> +     struct vduse_dev_msg *msg;
>>> +     int i;
>>> +
>>> +     for (i = 0; i < dev->vq_num; i++) {
>>> +             struct vduse_virtqueue *vq = &dev->vqs[i];
>>> +
>>> +             spin_lock(&vq->kick_lock);
>>> +             if (vq->kickfd)
>>> +                     eventfd_ctx_put(vq->kickfd);
>>> +             vq->kickfd = NULL;
>>> +             spin_unlock(&vq->kick_lock);
>>> +     }
>>> +
>>> +     mutex_lock(&dev->msg_lock);
>>> +     while ((msg = vduse_dequeue_msg(&dev->recv_list)))
>>> +             vduse_enqueue_msg(&dev->send_list, msg);
>>> +     mutex_unlock(&dev->msg_lock);
>>> +
>>> +     dev->connected = false;
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static int vduse_dev_open(struct inode *inode, struct file *file)
>>> +{
>>> +     struct vduse_dev *dev = container_of(inode->i_cdev,
>>> +                                     struct vduse_dev, cdev);
>>> +     int ret = -EBUSY;
>>> +
>>> +     mutex_lock(&vduse_lock);
>>> +     if (dev->connected)
>>> +             goto unlock;
>>> +
>>> +     ret = 0;
>>> +     dev->connected = true;
>>> +     file->private_data = dev;
>>> +unlock:
>>> +     mutex_unlock(&vduse_lock);
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static const struct file_operations vduse_dev_fops = {
>>> +     .owner          = THIS_MODULE,
>>> +     .open           = vduse_dev_open,
>>> +     .release        = vduse_dev_release,
>>> +     .read_iter      = vduse_dev_read_iter,
>>> +     .write_iter     = vduse_dev_write_iter,
>>> +     .poll           = vduse_dev_poll,
>>> +     .unlocked_ioctl = vduse_dev_ioctl,
>>> +     .compat_ioctl   = compat_ptr_ioctl,
>>> +     .llseek         = noop_llseek,
>>> +};
>>> +
>>> +static struct vduse_dev *vduse_dev_create(void)
>>> +{
>>> +     struct vduse_dev *dev = kzalloc(sizeof(*dev), GFP_KERNEL);
>>> +
>>> +     if (!dev)
>>> +             return NULL;
>>> +
>>> +     dev->iommu = vhost_iotlb_alloc(0, 0);
>>> +     if (!dev->iommu) {
>>> +             kfree(dev);
>>> +             return NULL;
>>> +     }
>>> +
>>> +     mutex_init(&dev->msg_lock);
>>> +     INIT_LIST_HEAD(&dev->send_list);
>>> +     INIT_LIST_HEAD(&dev->recv_list);
>>> +     atomic64_set(&dev->msg_unique, 0);
>>> +     spin_lock_init(&dev->iommu_lock);
>>> +     atomic_set(&dev->bounce_map, 0);
>>> +
>>> +     init_waitqueue_head(&dev->waitq);
>>> +
>>> +     return dev;
>>> +}
>>> +
>>> +static void vduse_dev_destroy(struct vduse_dev *dev)
>>> +{
>>> +     vhost_iotlb_free(dev->iommu);
>>> +     mutex_destroy(&dev->msg_lock);
>>> +     kfree(dev);
>>> +}
>>> +
>>> +static struct vduse_dev *vduse_find_dev(const char *name)
>>> +{
>>> +     struct vduse_dev *tmp, *dev = NULL;
>>> +
>>> +     list_for_each_entry(tmp, &vduse_devs, list) {
>>> +             if (!strcmp(dev_name(&tmp->dev), name)) {
>>> +                     dev = tmp;
>>> +                     break;
>>> +             }
>>> +     }
>>> +     return dev;
>>> +}
>>> +
>>> +static int vduse_destroy_dev(char *name)
>>> +{
>>> +     struct vduse_dev *dev = vduse_find_dev(name);
>>> +
>>> +     if (!dev)
>>> +             return -EINVAL;
>>> +
>>> +     if (dev->vdev || dev->connected)
>>> +             return -EBUSY;
>>> +
>>> +     dev->connected = true;
>>> +     list_del(&dev->list);
>>> +     cdev_device_del(&dev->cdev, &dev->dev);
>>> +     put_device(&dev->dev);
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static void vduse_release_dev(struct device *device)
>>> +{
>>> +     struct vduse_dev *dev =
>>> +             container_of(device, struct vduse_dev, dev);
>>> +
>>> +     ida_simple_remove(&vduse_ida, dev->minor);
>>> +     kfree(dev->vqs);
>>> +     vduse_domain_destroy(dev->domain);
>>> +     vduse_dev_destroy(dev);
>>> +     module_put(THIS_MODULE);
>>> +}
>>> +
>>> +static int vduse_create_dev(struct vduse_dev_config *config)
>>> +{
>>> +     int i, ret = -ENOMEM;
>>> +     struct vduse_dev *dev;
>>> +
>>> +     if (config->bounce_size > max_bounce_size)
>>> +             return -EINVAL;
>>> +
>>> +     if (config->bounce_size > max_iova_size)
>>> +             return -EINVAL;
>>> +
>>> +     if (vduse_find_dev(config->name))
>>> +             return -EEXIST;
>>> +
>>> +     dev = vduse_dev_create();
>>> +     if (!dev)
>>> +             return -ENOMEM;
>>> +
>>> +     dev->device_id = config->device_id;
>>> +     dev->vendor_id = config->vendor_id;
>>> +     dev->domain = vduse_domain_create(max_iova_size - 1,
>>> +                                     config->bounce_size);
>>> +     if (!dev->domain)
>>> +             goto err_domain;
>>> +
>>> +     dev->vq_align = config->vq_align;
>>> +     dev->vq_size_max = config->vq_size_max;
>>> +     dev->vq_num = config->vq_num;
>>> +     dev->vqs = kcalloc(dev->vq_num, sizeof(*dev->vqs), GFP_KERNEL);
>>> +     if (!dev->vqs)
>>> +             goto err_vqs;
>>> +
>>> +     for (i = 0; i < dev->vq_num; i++) {
>>> +             dev->vqs[i].index = i;
>>> +             spin_lock_init(&dev->vqs[i].kick_lock);
>>> +             spin_lock_init(&dev->vqs[i].irq_lock);
>>> +     }
>>> +
>>> +     ret = ida_simple_get(&vduse_ida, 0, VDUSE_DEV_MAX, GFP_KERNEL);
>>> +     if (ret < 0)
>>> +             goto err_ida;
>>> +
>>> +     dev->minor = ret;
>>> +     device_initialize(&dev->dev);
>>> +     dev->dev.release = vduse_release_dev;
>>> +     dev->dev.class = vduse_class;
>>> +     dev->dev.devt = MKDEV(MAJOR(vduse_major), dev->minor);
>>> +     ret = dev_set_name(&dev->dev, "%s", config->name);
>>
>> Do we need to add a namespce here? E.g "vduse-%s", config->name.
>>
> Actually we already have a parent dir "/dev/vduse/" for it.


Oh, right. Then it should be fine.


>
>>> +     if (ret)
>>> +             goto err_name;
>>> +
>>> +     cdev_init(&dev->cdev, &vduse_dev_fops);
>>> +     dev->cdev.owner = THIS_MODULE;
>>> +
>>> +     ret = cdev_device_add(&dev->cdev, &dev->dev);
>>> +     if (ret) {
>>> +             put_device(&dev->dev);
>>> +             return ret;
>>> +     }
>>> +     list_add(&dev->list, &vduse_devs);
>>> +     __module_get(THIS_MODULE);
>>> +
>>> +     return 0;
>>> +err_name:
>>> +     ida_simple_remove(&vduse_ida, dev->minor);
>>> +err_ida:
>>> +     kfree(dev->vqs);
>>> +err_vqs:
>>> +     vduse_domain_destroy(dev->domain);
>>> +err_domain:
>>> +     vduse_dev_destroy(dev);
>>> +     return ret;
>>> +}
>>> +
>>> +static long vduse_ioctl(struct file *file, unsigned int cmd,
>>> +                     unsigned long arg)
>>> +{
>>> +     int ret;
>>> +     void __user *argp = (void __user *)arg;
>>> +
>>> +     mutex_lock(&vduse_lock);
>>> +     switch (cmd) {
>>> +     case VDUSE_CREATE_DEV: {
>>> +             struct vduse_dev_config config;
>>> +
>>> +             ret = -EFAULT;
>>> +             if (copy_from_user(&config, argp, sizeof(config)))
>>> +                     break;
>>> +
>>> +             ret = vduse_create_dev(&config);
>>> +             break;
>>> +     }
>>> +     case VDUSE_DESTROY_DEV: {
>>> +             char name[VDUSE_NAME_MAX];
>>> +
>>> +             ret = -EFAULT;
>>> +             if (copy_from_user(name, argp, VDUSE_NAME_MAX))
>>> +                     break;
>>> +
>>> +             ret = vduse_destroy_dev(name);
>>> +             break;
>>> +     }
>>> +     default:
>>> +             ret = -EINVAL;
>>> +             break;
>>> +     }
>>> +     mutex_unlock(&vduse_lock);
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static const struct file_operations vduse_fops = {
>>> +     .owner          = THIS_MODULE,
>>> +     .unlocked_ioctl = vduse_ioctl,
>>> +     .compat_ioctl   = compat_ptr_ioctl,
>>> +     .llseek         = noop_llseek,
>>> +};
>>> +
>>> +static char *vduse_devnode(struct device *dev, umode_t *mode)
>>> +{
>>> +     return kasprintf(GFP_KERNEL, "vduse/%s", dev_name(dev));
>>> +}
>>> +
>>> +static struct miscdevice vduse_misc = {
>>> +     .fops = &vduse_fops,
>>> +     .minor = MISC_DYNAMIC_MINOR,
>>> +     .name = "vduse",
>>> +     .nodename = "vduse/control",
>>> +};
>>> +
>>> +static void vduse_mgmtdev_release(struct device *dev)
>>> +{
>>> +}
>>> +
>>> +static struct device vduse_mgmtdev = {
>>> +     .init_name = "vduse",
>>> +     .release = vduse_mgmtdev_release,
>>> +};
>>> +
>>> +static struct vdpa_mgmt_dev mgmt_dev;
>>> +
>>> +static int vduse_dev_add_vdpa(struct vduse_dev *dev, const char *name)
>>> +{
>>> +     struct vduse_vdpa *vdev = dev->vdev;
>>> +     int ret;
>>> +
>>> +     if (vdev)
>>> +             return -EEXIST;
>>> +
>>> +     vdev = vdpa_alloc_device(struct vduse_vdpa, vdpa, NULL,
>>
>> I think the char dev should be used as the parent here.
>>
> Agree.
>
>>> +                              &vduse_vdpa_config_ops,
>>> +                              dev->vq_num, name, true);
>>> +     if (!vdev)
>>> +             return -ENOMEM;
>>> +
>>> +     vdev->dev = dev;
>>> +     vdev->vdpa.dev.dma_mask = &vdev->vdpa.dev.coherent_dma_mask;
>>> +     ret = dma_set_mask_and_coherent(&vdev->vdpa.dev, DMA_BIT_MASK(64));
>>> +     if (ret)
>>> +             goto err;
>>> +
>>> +     set_dma_ops(&vdev->vdpa.dev, &vduse_dev_dma_ops);
>>> +     vdev->vdpa.dma_dev = &vdev->vdpa.dev;
>>> +     vdev->vdpa.mdev = &mgmt_dev;
>>> +
>>> +     ret = _vdpa_register_device(&vdev->vdpa);
>>> +     if (ret)
>>> +             goto err;
>>> +
>>> +     dev->vdev = vdev;
>>> +
>>> +     return 0;
>>> +err:
>>> +     put_device(&vdev->vdpa.dev);
>>> +     return ret;
>>> +}
>>> +
>>> +static int vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name)
>>> +{
>>> +     struct vduse_dev *dev;
>>> +     int ret = -EINVAL;
>>> +
>>> +     mutex_lock(&vduse_lock);
>>> +     dev = vduse_find_dev(name);
>>> +     if (!dev)
>>> +             goto unlock;
>>
>> Any reason for this check? I think vdpa core layer has already did for
>> the name check for us.
>>
> We need to check whether the vduse device with the name is created.


Yes.


>
>>> +
>>> +     ret = vduse_dev_add_vdpa(dev, name);
>>> +unlock:
>>> +     mutex_unlock(&vduse_lock);
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static void vdpa_dev_del(struct vdpa_mgmt_dev *mdev, struct vdpa_device *dev)
>>> +{
>>> +     _vdpa_unregister_device(dev);
>>> +}
>>> +
>>> +static const struct vdpa_mgmtdev_ops vdpa_dev_mgmtdev_ops = {
>>> +     .dev_add = vdpa_dev_add,
>>> +     .dev_del = vdpa_dev_del,
>>> +};
>>> +
>>> +static struct virtio_device_id id_table[] = {
>>> +     { VIRTIO_DEV_ANY_ID, VIRTIO_DEV_ANY_ID },
>>> +     { 0 },
>>> +};
>>> +
>>> +static struct vdpa_mgmt_dev mgmt_dev = {
>>> +     .device = &vduse_mgmtdev,
>>> +     .id_table = id_table,
>>> +     .ops = &vdpa_dev_mgmtdev_ops,
>>> +};
>>> +
>>> +static int vduse_mgmtdev_init(void)
>>> +{
>>> +     int ret;
>>> +
>>> +     ret = device_register(&vduse_mgmtdev);
>>> +     if (ret)
>>> +             return ret;
>>> +
>>> +     ret = vdpa_mgmtdev_register(&mgmt_dev);
>>> +     if (ret)
>>> +             goto err;
>>> +
>>> +     return 0;
>>> +err:
>>> +     device_unregister(&vduse_mgmtdev);
>>> +     return ret;
>>> +}
>>> +
>>> +static void vduse_mgmtdev_exit(void)
>>> +{
>>> +     vdpa_mgmtdev_unregister(&mgmt_dev);
>>> +     device_unregister(&vduse_mgmtdev);
>>> +}
>>> +
>>> +static int vduse_init(void)
>>> +{
>>> +     int ret;
>>> +
>>> +     ret = misc_register(&vduse_misc);
>>> +     if (ret)
>>> +             return ret;
>>> +
>>> +     vduse_class = class_create(THIS_MODULE, "vduse");
>>> +     if (IS_ERR(vduse_class)) {
>>> +             ret = PTR_ERR(vduse_class);
>>> +             goto err_class;
>>> +     }
>>> +     vduse_class->devnode = vduse_devnode;
>>> +
>>> +     ret = alloc_chrdev_region(&vduse_major, 0, VDUSE_DEV_MAX, "vduse");
>>> +     if (ret)
>>> +             goto err_chardev;
>>> +
>>> +     ret = vduse_domain_init();
>>> +     if (ret)
>>> +             goto err_domain;
>>> +
>>> +     ret = vduse_mgmtdev_init();
>>> +     if (ret)
>>> +             goto err_mgmtdev;
>>
>> Should we validate max_bounce_size < max_iova_size here?
>>
> Sure.
>
>>
>>> +
>>> +     return 0;
>>> +err_mgmtdev:
>>> +     vduse_domain_exit();
>>> +err_domain:
>>> +     unregister_chrdev_region(vduse_major, VDUSE_DEV_MAX);
>>> +err_chardev:
>>> +     class_destroy(vduse_class);
>>> +err_class:
>>> +     misc_deregister(&vduse_misc);
>>> +     return ret;
>>> +}
>>> +module_init(vduse_init);
>>> +
>>> +static void vduse_exit(void)
>>> +{
>>> +     misc_deregister(&vduse_misc);
>>> +     class_destroy(vduse_class);
>>> +     unregister_chrdev_region(vduse_major, VDUSE_DEV_MAX);
>>> +     vduse_domain_exit();
>>> +     vduse_mgmtdev_exit();
>>> +}
>>> +module_exit(vduse_exit);
>>> +
>>> +MODULE_VERSION(DRV_VERSION);
>>> +MODULE_LICENSE(DRV_LICENSE);
>>> +MODULE_AUTHOR(DRV_AUTHOR);
>>> +MODULE_DESCRIPTION(DRV_DESC);
>>> diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
>>> new file mode 100644
>>> index 000000000000..9391c4acfa53
>>> --- /dev/null
>>> +++ b/include/uapi/linux/vduse.h
>>> @@ -0,0 +1,136 @@
>>> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
>>> +#ifndef _UAPI_VDUSE_H_
>>> +#define _UAPI_VDUSE_H_
>>> +
>>> +#include <linux/types.h>
>>> +
>>> +#define VDUSE_CONFIG_DATA_LEN        256
>>> +#define VDUSE_NAME_MAX       256
>>> +
>>> +/* the control messages definition for read/write */
>>> +
>>> +enum vduse_req_type {
>>> +     VDUSE_SET_VQ_NUM,
>>> +     VDUSE_SET_VQ_ADDR,
>>> +     VDUSE_SET_VQ_READY,
>>> +     VDUSE_GET_VQ_READY,
>>> +     VDUSE_SET_VQ_STATE,
>>> +     VDUSE_GET_VQ_STATE,
>>> +     VDUSE_SET_FEATURES,
>>> +     VDUSE_GET_FEATURES,
>>> +     VDUSE_SET_STATUS,
>>> +     VDUSE_GET_STATUS,
>>> +     VDUSE_SET_CONFIG,
>>> +     VDUSE_GET_CONFIG,
>>> +     VDUSE_UPDATE_IOTLB,
>>> +};
>>> +
>>> +struct vduse_vq_num {
>>> +     __u32 index;
>>> +     __u32 num;
>>> +};
>>> +
>>> +struct vduse_vq_addr {
>>> +     __u32 index;
>>> +     __u64 desc_addr;
>>> +     __u64 driver_addr;
>>> +     __u64 device_addr;
>>> +};
>>> +
>>> +struct vduse_vq_ready {
>>> +     __u32 index;
>>> +     __u8 ready;
>>> +};
>>> +
>>> +struct vduse_vq_state {
>>> +     __u32 index;
>>> +     __u16 avail_idx;
>>> +};
>>> +
>>> +struct vduse_dev_config_data {
>>> +     __u32 offset;
>>> +     __u32 len;
>>> +     __u8 data[VDUSE_CONFIG_DATA_LEN];
>>> +};
>>> +
>>> +struct vduse_iova_range {
>>> +     __u64 start;
>>> +     __u64 last;
>>> +};
>>> +
>>> +struct vduse_dev_request {
>>> +     __u32 type; /* request type */
>>> +     __u32 unique; /* request id */
>>
>> Let's simply use "request_id" here.
>>
> OK.
>
>>> +     __u32 reserved[2]; /* for feature use */
>>> +     union {
>>> +             struct vduse_vq_num vq_num; /* virtqueue num */
>>> +             struct vduse_vq_addr vq_addr; /* virtqueue address */
>>> +             struct vduse_vq_ready vq_ready; /* virtqueue ready status */
>>> +             struct vduse_vq_state vq_state; /* virtqueue state */
>>> +             struct vduse_dev_config_data config; /* virtio device config space */
>>> +             struct vduse_iova_range iova; /* iova range for updating */
>>> +             __u64 features; /* virtio features */
>>> +             __u8 status; /* device status */
>>
>> It might be better to use struct for feaures and status as well for
>> consistency.
>>
> OK.
>
>> And to be safe, let's add explicity padding here.
>>
> Do you mean add padding for the union?


I think so.


>
>>> +     };
>>> +};
>>> +
>>> +struct vduse_dev_response {
>>> +     __u32 request_id; /* corresponding request id */
>>> +#define VDUSE_REQUEST_OK     0x00
>>> +#define VDUSE_REQUEST_FAILED 0x01
>>> +     __u8 result; /* the result of request */
>>> +     __u8 reserved[11]; /* for feature use */
>>
>> Looks like this will be a hole which is similar to
>> 429711aec282c4b5fe5bbd7b2f0bbbff4110ffb2. Need to make sure the reserved
>> end at 8 byte boundary.
>>
> Will fix it.
>
>>> +     union {
>>> +             struct vduse_vq_ready vq_ready; /* virtqueue ready status */
>>> +             struct vduse_vq_state vq_state; /* virtqueue state */
>>> +             struct vduse_dev_config_data config; /* virtio device config space */
>>> +             __u64 features; /* virtio features */
>>> +             __u8 status; /* device status */
>>> +     };
>>> +};
>>> +
>>> +/* ioctls */
>>> +
>>> +struct vduse_dev_config {
>>> +     char name[VDUSE_NAME_MAX]; /* vduse device name */
>>> +     __u32 vendor_id; /* virtio vendor id */
>>> +     __u32 device_id; /* virtio device id */
>>> +     __u64 bounce_size; /* bounce buffer size for iommu */
>>> +     __u16 vq_num; /* the number of virtqueues */
>>> +     __u16 vq_size_max; /* the max size of virtqueue */
>>> +     __u32 vq_align; /* the allocation alignment of virtqueue's metadata */
>>> +};
>>> +
>>> +struct vduse_iotlb_entry {
>>> +     __u64 offset; /* the mmap offset on fd */
>>> +     __u64 start; /* start of the IOVA range */
>>> +     __u64 last; /* last of the IOVA range */
>>> +#define VDUSE_ACCESS_RO 0x1
>>> +#define VDUSE_ACCESS_WO 0x2
>>> +#define VDUSE_ACCESS_RW 0x3
>>> +     __u8 perm; /* access permission of this range */
>>> +};
>>> +
>>> +struct vduse_vq_eventfd {
>>> +     __u32 index; /* virtqueue index */
>>> +     int fd; /* eventfd, -1 means de-assigning the eventfd */
>>
>> Let's define a macro for this.
>>
> OK.
>
>>> +};
>>> +
>>> +#define VDUSE_BASE   0x81
>>> +
>>> +/* Create a vduse device which is represented by a char device (/dev/vduse/<name>) */
>>> +#define VDUSE_CREATE_DEV     _IOW(VDUSE_BASE, 0x01, struct vduse_dev_config)
>>> +
>>> +/* Destroy a vduse device. Make sure there are no references to the char device */
>>> +#define VDUSE_DESTROY_DEV    _IOW(VDUSE_BASE, 0x02, char[VDUSE_NAME_MAX])
>>> +
>>> +/* Get a file descriptor for the mmap'able iova region */
>>> +#define VDUSE_IOTLB_GET_FD   _IOWR(VDUSE_BASE, 0x03, struct vduse_iotlb_entry)
>>> +
>>> +/* Setup an eventfd to receive kick for virtqueue */
>>> +#define VDUSE_VQ_SETUP_KICKFD        _IOW(VDUSE_BASE, 0x04, struct vduse_vq_eventfd)
>>> +
>>> +/* Inject an interrupt for specific virtqueue */
>>> +#define VDUSE_INJECT_VQ_IRQ  _IO(VDUSE_BASE, 0x05)
>>
>> I wonder do we need a version/feature handshake that is for future
>> extension instead of just reserve bits in uABI? E.g VDUSE_GET_VERSION ...
>>
> Agree. Will do it in v5.


Btw, I think you can remove "RFC" then Michael can consider to merge the 
series.

Thanks


>
> Thanks,
> Yongji
>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: Re: [RFC v4 10/11] vduse: Introduce a workqueue for irq injection
  2021-03-05  3:04         ` Jason Wang
  (?)
@ 2021-03-05  3:30         ` Yongji Xie
  2021-03-05  3:42             ` Jason Wang
  -1 siblings, 1 reply; 92+ messages in thread
From: Yongji Xie @ 2021-03-05  3:30 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Bob Liu, Christoph Hellwig, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel

On Fri, Mar 5, 2021 at 11:05 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2021/3/4 4:58 下午, Yongji Xie wrote:
> > On Thu, Mar 4, 2021 at 2:59 PM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> On 2021/2/23 7:50 下午, Xie Yongji wrote:
> >>> This patch introduces a workqueue to support injecting
> >>> virtqueue's interrupt asynchronously. This is mainly
> >>> for performance considerations which makes sure the push()
> >>> and pop() for used vring can be asynchronous.
> >>
> >> Do you have pref numbers for this patch?
> >>
> > No, I can do some tests for it if needed.
> >
> > Another problem is the VIRTIO_RING_F_EVENT_IDX feature will be useless
> > if we call irq callback in ioctl context. Something like:
> >
> > virtqueue_push();
> > virtio_notify();
> >      ioctl()
> > -------------------------------------------------
> >          irq_cb()
> >              virtqueue_get_buf()
> >
> > The used vring is always empty each time we call virtqueue_push() in
> > userspace. Not sure if it is what we expected.
>
>
> I'm not sure I get the issue.
>
> THe used ring should be filled by virtqueue_push() which is done by
> userspace before?
>

After userspace call virtqueue_push(), it always call virtio_notify()
immediately. In traditional VM (vhost-vdpa) cases, virtio_notify()
will inject an irq to VM and return, then vcpu thread will call
interrupt handler. But in container (virtio-vdpa) cases,
virtio_notify() will call interrupt handler directly. So it looks like
we have to optimize the virtio-vdpa cases. But one problem is we don't
know whether we are in the VM user case or container user case.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC v4 06/11] vduse: Implement an MMU-based IOMMU driver
  2021-03-04  5:12     ` Yongji Xie
@ 2021-03-05  3:35         ` Jason Wang
  0 siblings, 0 replies; 92+ messages in thread
From: Jason Wang @ 2021-03-05  3:35 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Bob Liu, Christoph Hellwig, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel


On 2021/3/4 1:12 下午, Yongji Xie wrote:
> On Thu, Mar 4, 2021 at 12:21 PM Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2021/2/23 7:50 下午, Xie Yongji wrote:
>>> This implements a MMU-based IOMMU driver to support mapping
>>> kernel dma buffer into userspace. The basic idea behind it is
>>> treating MMU (VA->PA) as IOMMU (IOVA->PA). The driver will set
>>> up MMU mapping instead of IOMMU mapping for the DMA transfer so
>>> that the userspace process is able to use its virtual address to
>>> access the dma buffer in kernel.
>>>
>>> And to avoid security issue, a bounce-buffering mechanism is
>>> introduced to prevent userspace accessing the original buffer
>>> directly.
>>>
>>> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
>>> ---
>>>    drivers/vdpa/vdpa_user/iova_domain.c | 486 +++++++++++++++++++++++++++++++++++
>>>    drivers/vdpa/vdpa_user/iova_domain.h |  61 +++++
>>>    2 files changed, 547 insertions(+)
>>>    create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c
>>>    create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h
>>>
>>> diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c
>>> new file mode 100644
>>> index 000000000000..9285d430d486
>>> --- /dev/null
>>> +++ b/drivers/vdpa/vdpa_user/iova_domain.c
>>> @@ -0,0 +1,486 @@
>>> +// SPDX-License-Identifier: GPL-2.0-only
>>> +/*
>>> + * MMU-based IOMMU implementation
>>> + *
>>> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
>>> + *
>>> + * Author: Xie Yongji <xieyongji@bytedance.com>
>>> + *
>>> + */
>>> +
>>> +#include <linux/slab.h>
>>> +#include <linux/file.h>
>>> +#include <linux/anon_inodes.h>
>>> +#include <linux/highmem.h>
>>> +
>>> +#include "iova_domain.h"
>>> +
>>> +#define IOVA_START_PFN 1
>>> +#define IOVA_ALLOC_ORDER 12
>>> +#define IOVA_ALLOC_SIZE (1 << IOVA_ALLOC_ORDER)
>>> +
>>> +static inline struct page *
>>> +vduse_domain_get_bounce_page(struct vduse_iova_domain *domain, u64 iova)
>>> +{
>>> +     u64 index = iova >> PAGE_SHIFT;
>>> +
>>> +     return domain->bounce_pages[index];
>>> +}
>>> +
>>> +static inline void
>>> +vduse_domain_set_bounce_page(struct vduse_iova_domain *domain,
>>> +                             u64 iova, struct page *page)
>>> +{
>>> +     u64 index = iova >> PAGE_SHIFT;
>>> +
>>> +     domain->bounce_pages[index] = page;
>>> +}
>>> +
>>> +static enum dma_data_direction perm_to_dir(int perm)
>>> +{
>>> +     enum dma_data_direction dir;
>>> +
>>> +     switch (perm) {
>>> +     case VHOST_MAP_WO:
>>> +             dir = DMA_FROM_DEVICE;
>>> +             break;
>>> +     case VHOST_MAP_RO:
>>> +             dir = DMA_TO_DEVICE;
>>> +             break;
>>> +     case VHOST_MAP_RW:
>>> +             dir = DMA_BIDIRECTIONAL;
>>> +             break;
>>> +     default:
>>> +             break;
>>> +     }
>>> +
>>> +     return dir;
>>> +}
>>> +
>>> +static int dir_to_perm(enum dma_data_direction dir)
>>> +{
>>> +     int perm = -EFAULT;
>>> +
>>> +     switch (dir) {
>>> +     case DMA_FROM_DEVICE:
>>> +             perm = VHOST_MAP_WO;
>>> +             break;
>>> +     case DMA_TO_DEVICE:
>>> +             perm = VHOST_MAP_RO;
>>> +             break;
>>> +     case DMA_BIDIRECTIONAL:
>>> +             perm = VHOST_MAP_RW;
>>> +             break;
>>> +     default:
>>> +             break;
>>> +     }
>>> +
>>> +     return perm;
>>> +}
>>
>> Let's move the above two helpers to vhost_iotlb.h so they could be used
>> by other driver e.g (vpda_sim)
>>
> Sure.
>
>>> +
>>> +static void do_bounce(phys_addr_t orig, void *addr, size_t size,
>>> +                     enum dma_data_direction dir)
>>> +{
>>> +     unsigned long pfn = PFN_DOWN(orig);
>>> +
>>> +     if (PageHighMem(pfn_to_page(pfn))) {
>>> +             unsigned int offset = offset_in_page(orig);
>>> +             char *buffer;
>>> +             unsigned int sz = 0;
>>> +             unsigned long flags;
>>> +
>>> +             while (size) {
>>> +                     sz = min_t(size_t, PAGE_SIZE - offset, size);
>>> +
>>> +                     local_irq_save(flags);
>>> +                     buffer = kmap_atomic(pfn_to_page(pfn));
>>> +                     if (dir == DMA_TO_DEVICE)
>>> +                             memcpy(addr, buffer + offset, sz);
>>> +                     else
>>> +                             memcpy(buffer + offset, addr, sz);
>>> +                     kunmap_atomic(buffer);
>>> +                     local_irq_restore(flags);
>>
>> I wonder why we need to deal with highmem and irq flags explicitly like
>> this. Doesn't kmap_atomic() will take care all of those?
>>
> Yes, irq flags is useless here. Will remove it.
>
>>> +
>>> +                     size -= sz;
>>> +                     pfn++;
>>> +                     addr += sz;
>>> +                     offset = 0;
>>> +             }
>>> +     } else if (dir == DMA_TO_DEVICE) {
>>> +             memcpy(addr, phys_to_virt(orig), size);
>>> +     } else {
>>> +             memcpy(phys_to_virt(orig), addr, size);
>>> +     }
>>> +}
>>> +
>>> +static struct page *
>>> +vduse_domain_get_mapping_page(struct vduse_iova_domain *domain, u64 iova)
>>> +{
>>> +     u64 start = iova & PAGE_MASK;
>>> +     u64 last = start + PAGE_SIZE - 1;
>>> +     struct vhost_iotlb_map *map;
>>> +     struct page *page = NULL;
>>> +
>>> +     spin_lock(&domain->iotlb_lock);
>>> +     map = vhost_iotlb_itree_first(domain->iotlb, start, last);
>>> +     if (!map)
>>> +             goto out;
>>> +
>>> +     page = pfn_to_page((map->addr + iova - map->start) >> PAGE_SHIFT);
>>> +     get_page(page);
>>> +out:
>>> +     spin_unlock(&domain->iotlb_lock);
>>> +
>>> +     return page;
>>> +}
>>> +
>>> +static struct page *
>>> +vduse_domain_alloc_bounce_page(struct vduse_iova_domain *domain, u64 iova)
>>> +{
>>> +     u64 start = iova & PAGE_MASK;
>>> +     u64 last = start + PAGE_SIZE - 1;
>>> +     struct vhost_iotlb_map *map;
>>> +     struct page *page = NULL, *new_page = alloc_page(GFP_KERNEL);
>>> +
>>> +     if (!new_page)
>>> +             return NULL;
>>> +
>>> +     spin_lock(&domain->iotlb_lock);
>>> +     if (!vhost_iotlb_itree_first(domain->iotlb, start, last)) {
>>> +             __free_page(new_page);
>>> +             goto out;
>>> +     }
>>> +     page = vduse_domain_get_bounce_page(domain, iova);
>>> +     if (page) {
>>> +             get_page(page);
>>> +             __free_page(new_page);
>>> +             goto out;
>>> +     }
>>> +     vduse_domain_set_bounce_page(domain, iova, new_page);
>>> +     get_page(new_page);
>>> +     page = new_page;
>>> +
>>> +     for (map = vhost_iotlb_itree_first(domain->iotlb, start, last); map;
>>> +          map = vhost_iotlb_itree_next(map, start, last)) {
>>> +             unsigned int src_offset = 0, dst_offset = 0;
>>> +             phys_addr_t src;
>>> +             void *dst;
>>> +             size_t sz;
>>> +
>>> +             if (perm_to_dir(map->perm) == DMA_FROM_DEVICE)
>>> +                     continue;
>>> +
>>> +             if (start > map->start)
>>> +                     src_offset = start - map->start;
>>> +             else
>>> +                     dst_offset = map->start - start;
>>> +
>>> +             src = map->addr + src_offset;
>>> +             dst = page_address(page) + dst_offset;
>>> +             sz = min_t(size_t, map->size - src_offset,
>>> +                             PAGE_SIZE - dst_offset);
>>> +             do_bounce(src, dst, sz, DMA_TO_DEVICE);
>>> +     }
>>> +out:
>>> +     spin_unlock(&domain->iotlb_lock);
>>> +
>>> +     return page;
>>> +}
>>> +
>>> +static void
>>> +vduse_domain_free_bounce_pages(struct vduse_iova_domain *domain,
>>> +                             u64 iova, size_t size)
>>> +{
>>> +     struct page *page;
>>> +
>>> +     spin_lock(&domain->iotlb_lock);
>>> +     if (WARN_ON(vhost_iotlb_itree_first(domain->iotlb, iova,
>>> +                                             iova + size - 1)))
>>> +             goto out;
>>> +
>>> +     while (size > 0) {
>>> +             page = vduse_domain_get_bounce_page(domain, iova);
>>> +             if (page) {
>>> +                     vduse_domain_set_bounce_page(domain, iova, NULL);
>>> +                     __free_page(page);
>>> +             }
>>> +             size -= PAGE_SIZE;
>>> +             iova += PAGE_SIZE;
>>> +     }
>>> +out:
>>> +     spin_unlock(&domain->iotlb_lock);
>>> +}
>>> +
>>> +static void vduse_domain_bounce(struct vduse_iova_domain *domain,
>>> +                             dma_addr_t iova, phys_addr_t orig,
>>> +                             size_t size, enum dma_data_direction dir)
>>> +{
>>> +     unsigned int offset = offset_in_page(iova);
>>> +
>>> +     while (size) {
>>> +             struct page *p = vduse_domain_get_bounce_page(domain, iova);
>>> +             size_t sz = min_t(size_t, PAGE_SIZE - offset, size);
>>> +
>>> +             WARN_ON(!p && dir == DMA_FROM_DEVICE);
>>> +
>>> +             if (p)
>>> +                     do_bounce(orig, page_address(p) + offset, sz, dir);
>>> +
>>> +             size -= sz;
>>> +             orig += sz;
>>> +             iova += sz;
>>> +             offset = 0;
>>> +     }
>>> +}
>>> +
>>> +static dma_addr_t vduse_domain_alloc_iova(struct iova_domain *iovad,
>>> +                             unsigned long size, unsigned long limit)
>>> +{
>>> +     unsigned long shift = iova_shift(iovad);
>>> +     unsigned long iova_len = iova_align(iovad, size) >> shift;
>>> +     unsigned long iova_pfn;
>>> +
>>> +     if (iova_len < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1)))
>>> +             iova_len = roundup_pow_of_two(iova_len);
>>> +     iova_pfn = alloc_iova_fast(iovad, iova_len, limit >> shift, true);
>>> +
>>> +     return iova_pfn << shift;
>>> +}
>>> +
>>> +static void vduse_domain_free_iova(struct iova_domain *iovad,
>>> +                             dma_addr_t iova, size_t size)
>>> +{
>>> +     unsigned long shift = iova_shift(iovad);
>>> +     unsigned long iova_len = iova_align(iovad, size) >> shift;
>>> +
>>> +     free_iova_fast(iovad, iova >> shift, iova_len);
>>> +}
>>> +
>>> +dma_addr_t vduse_domain_map_page(struct vduse_iova_domain *domain,
>>> +                             struct page *page, unsigned long offset,
>>> +                             size_t size, enum dma_data_direction dir,
>>> +                             unsigned long attrs)
>>> +{
>>> +     struct iova_domain *iovad = &domain->stream_iovad;
>>> +     unsigned long limit = domain->bounce_size - 1;
>>> +     phys_addr_t pa = page_to_phys(page) + offset;
>>> +     dma_addr_t iova = vduse_domain_alloc_iova(iovad, size, limit);
>>> +     int ret;
>>> +
>>> +     if (!iova)
>>> +             return DMA_MAPPING_ERROR;
>>> +
>>> +     spin_lock(&domain->iotlb_lock);
>>> +     ret = vhost_iotlb_add_range(domain->iotlb, (u64)iova,
>>> +                                 (u64)iova + size - 1,
>>> +                                 pa, dir_to_perm(dir));
>>> +     spin_unlock(&domain->iotlb_lock);
>>> +     if (ret) {
>>> +             vduse_domain_free_iova(iovad, iova, size);
>>> +             return DMA_MAPPING_ERROR;
>>> +     }
>>> +     if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL)
>>> +             vduse_domain_bounce(domain, iova, pa, size, DMA_TO_DEVICE);
>>> +
>>> +     return iova;
>>> +}
>>> +
>>> +void vduse_domain_unmap_page(struct vduse_iova_domain *domain,
>>> +                     dma_addr_t dma_addr, size_t size,
>>> +                     enum dma_data_direction dir, unsigned long attrs)
>>> +{
>>> +     struct iova_domain *iovad = &domain->stream_iovad;
>>> +     struct vhost_iotlb_map *map;
>>> +     phys_addr_t pa;
>>> +
>>> +     spin_lock(&domain->iotlb_lock);
>>> +     map = vhost_iotlb_itree_first(domain->iotlb, (u64)dma_addr,
>>> +                                   (u64)dma_addr + size - 1);
>>> +     if (WARN_ON(!map)) {
>>> +             spin_unlock(&domain->iotlb_lock);
>>> +             return;
>>> +     }
>>> +     pa = map->addr;
>>> +     vhost_iotlb_map_free(domain->iotlb, map);
>>> +     spin_unlock(&domain->iotlb_lock);
>>> +
>>> +     if (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL)
>>> +             vduse_domain_bounce(domain, dma_addr, pa,
>>> +                                     size, DMA_FROM_DEVICE);
>>> +
>>> +     vduse_domain_free_iova(iovad, dma_addr, size);
>>> +}
>>> +
>>> +void *vduse_domain_alloc_coherent(struct vduse_iova_domain *domain,
>>> +                             size_t size, dma_addr_t *dma_addr,
>>> +                             gfp_t flag, unsigned long attrs)
>>> +{
>>> +     struct iova_domain *iovad = &domain->consistent_iovad;
>>> +     unsigned long limit = domain->iova_limit;
>>> +     dma_addr_t iova = vduse_domain_alloc_iova(iovad, size, limit);
>>> +     void *orig = alloc_pages_exact(size, flag);
>>> +     int ret;
>>> +
>>> +     if (!iova || !orig)
>>> +             goto err;
>>> +
>>> +     spin_lock(&domain->iotlb_lock);
>>> +     ret = vhost_iotlb_add_range(domain->iotlb, (u64)iova,
>>> +                                 (u64)iova + size - 1,
>>> +                                 virt_to_phys(orig), VHOST_MAP_RW);
>>> +     spin_unlock(&domain->iotlb_lock);
>>> +     if (ret)
>>> +             goto err;
>>> +
>>> +     *dma_addr = iova;
>>> +
>>> +     return orig;
>>> +err:
>>> +     *dma_addr = DMA_MAPPING_ERROR;
>>> +     if (orig)
>>> +             free_pages_exact(orig, size);
>>> +     if (iova)
>>> +             vduse_domain_free_iova(iovad, iova, size);
>>> +
>>> +     return NULL;
>>> +}
>>> +
>>> +void vduse_domain_free_coherent(struct vduse_iova_domain *domain, size_t size,
>>> +                             void *vaddr, dma_addr_t dma_addr,
>>> +                             unsigned long attrs)
>>> +{
>>> +     struct iova_domain *iovad = &domain->consistent_iovad;
>>> +     struct vhost_iotlb_map *map;
>>> +     phys_addr_t pa;
>>> +
>>> +     spin_lock(&domain->iotlb_lock);
>>> +     map = vhost_iotlb_itree_first(domain->iotlb, (u64)dma_addr,
>>> +                                   (u64)dma_addr + size - 1);
>>> +     if (WARN_ON(!map)) {
>>> +             spin_unlock(&domain->iotlb_lock);
>>> +             return;
>>> +     }
>>> +     pa = map->addr;
>>> +     vhost_iotlb_map_free(domain->iotlb, map);
>>> +     spin_unlock(&domain->iotlb_lock);
>>> +
>>> +     vduse_domain_free_iova(iovad, dma_addr, size);
>>> +     free_pages_exact(phys_to_virt(pa), size);
>>> +}
>>> +
>>> +static vm_fault_t vduse_domain_mmap_fault(struct vm_fault *vmf)
>>> +{
>>> +     struct vduse_iova_domain *domain = vmf->vma->vm_private_data;
>>> +     unsigned long iova = vmf->pgoff << PAGE_SHIFT;
>>> +     struct page *page;
>>> +
>>> +     if (!domain)
>>> +             return VM_FAULT_SIGBUS;
>>> +
>>> +     if (iova < domain->bounce_size)
>>> +             page = vduse_domain_alloc_bounce_page(domain, iova);
>>> +     else
>>> +             page = vduse_domain_get_mapping_page(domain, iova);
>>> +
>>> +     if (!page)
>>> +             return VM_FAULT_SIGBUS;
>>> +
>>> +     vmf->page = page;
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static const struct vm_operations_struct vduse_domain_mmap_ops = {
>>> +     .fault = vduse_domain_mmap_fault,
>>> +};
>>> +
>>> +static int vduse_domain_mmap(struct file *file, struct vm_area_struct *vma)
>>> +{
>>> +     struct vduse_iova_domain *domain = file->private_data;
>>> +
>>> +     vma->vm_flags |= VM_DONTDUMP | VM_DONTEXPAND;
>>> +     vma->vm_private_data = domain;
>>> +     vma->vm_ops = &vduse_domain_mmap_ops;
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static int vduse_domain_release(struct inode *inode, struct file *file)
>>> +{
>>> +     struct vduse_iova_domain *domain = file->private_data;
>>> +
>>> +     vduse_domain_free_bounce_pages(domain, 0, domain->bounce_size);
>>> +     put_iova_domain(&domain->stream_iovad);
>>> +     put_iova_domain(&domain->consistent_iovad);
>>> +     vhost_iotlb_free(domain->iotlb);
>>> +     vfree(domain->bounce_pages);
>>> +     kfree(domain);
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static const struct file_operations vduse_domain_fops = {
>>> +     .mmap = vduse_domain_mmap,
>>> +     .release = vduse_domain_release,
>>> +};
>>> +
>>> +void vduse_domain_destroy(struct vduse_iova_domain *domain)
>>> +{
>>> +     fput(domain->file);
>>> +}
>>> +
>>> +struct vduse_iova_domain *
>>> +vduse_domain_create(unsigned long iova_limit, size_t bounce_size)
>>> +{
>>> +     struct vduse_iova_domain *domain;
>>> +     struct file *file;
>>> +     unsigned long bounce_pfns = PAGE_ALIGN(bounce_size) >> PAGE_SHIFT;
>>> +
>>> +     if (iova_limit <= bounce_size)
>>> +             return NULL;
>>> +
>>> +     domain = kzalloc(sizeof(*domain), GFP_KERNEL);
>>> +     if (!domain)
>>> +             return NULL;
>>> +
>>> +     domain->iotlb = vhost_iotlb_alloc(0, 0);
>>> +     if (!domain->iotlb)
>>> +             goto err_iotlb;
>>> +
>>> +     domain->iova_limit = iova_limit;
>>> +     domain->bounce_size = PAGE_ALIGN(bounce_size);
>>> +     domain->bounce_pages = vzalloc(bounce_pfns * sizeof(struct page *));
>>> +     if (!domain->bounce_pages)
>>> +             goto err_page;
>>> +
>>> +     file = anon_inode_getfile("[vduse-domain]", &vduse_domain_fops,
>>> +                             domain, O_RDWR);
>>> +     if (IS_ERR(file))
>>> +             goto err_file;
>>> +
>>> +     domain->file = file;
>>> +     spin_lock_init(&domain->iotlb_lock);
>>> +     init_iova_domain(&domain->stream_iovad,
>>> +                     IOVA_ALLOC_SIZE, IOVA_START_PFN);
>>> +     init_iova_domain(&domain->consistent_iovad,
>>> +                     PAGE_SIZE, bounce_pfns);
>>> +
>>> +     return domain;
>>> +err_file:
>>> +     vfree(domain->bounce_pages);
>>> +err_page:
>>> +     vhost_iotlb_free(domain->iotlb);
>>> +err_iotlb:
>>> +     kfree(domain);
>>> +     return NULL;
>>> +}
>>> +
>>> +int vduse_domain_init(void)
>>> +{
>>> +     return iova_cache_get();
>>> +}
>>> +
>>> +void vduse_domain_exit(void)
>>> +{
>>> +     iova_cache_put();
>>> +}
>>> diff --git a/drivers/vdpa/vdpa_user/iova_domain.h b/drivers/vdpa/vdpa_user/iova_domain.h
>>> new file mode 100644
>>> index 000000000000..9c85d8346626
>>> --- /dev/null
>>> +++ b/drivers/vdpa/vdpa_user/iova_domain.h
>>> @@ -0,0 +1,61 @@
>>> +/* SPDX-License-Identifier: GPL-2.0-only */
>>> +/*
>>> + * MMU-based IOMMU implementation
>>> + *
>>> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
>>> + *
>>> + * Author: Xie Yongji <xieyongji@bytedance.com>
>>> + *
>>> + */
>>> +
>>> +#ifndef _VDUSE_IOVA_DOMAIN_H
>>> +#define _VDUSE_IOVA_DOMAIN_H
>>> +
>>> +#include <linux/iova.h>
>>> +#include <linux/dma-mapping.h>
>>> +#include <linux/vhost_iotlb.h>
>>> +
>>> +struct vduse_iova_domain {
>>> +     struct iova_domain stream_iovad;
>>> +     struct iova_domain consistent_iovad;
>>> +     struct page **bounce_pages;
>>> +     size_t bounce_size;
>>> +     unsigned long iova_limit;
>>> +     struct vhost_iotlb *iotlb;
>>
>> Sorry if I've asked this before.
>>
>> But what's the reason for maintaing a dedicated IOTLB here? I think we
>> could reuse vduse_dev->iommu since the device can not be used by both
>> virtio and vhost in the same time or use vduse_iova_domain->iotlb for
>> set_map().
>>
> The main difference between domain->iotlb and dev->iotlb is the way to
> deal with bounce buffer. In the domain->iotlb case, bounce buffer
> needs to be mapped each DMA transfer because we need to get the bounce
> pages by an IOVA during DMA unmapping. In the dev->iotlb case, bounce
> buffer only needs to be mapped once during initialization, which will
> be used to tell userspace how to do mmap().
>
>> Also, since vhost IOTLB support per mapping token (opauqe), can we use
>> that instead of the bounce_pages *?
>>
> Sorry, I didn't get you here. Which value do you mean to store in the
> opaque pointer?


So I would like to have a way to use a single IOTLB for manage all kinds 
of mappings. Two possible ideas:

1) map bounce page one by one in vduse_dev_map_page(), in 
VDUSE_IOTLB_GET_FD, try to merge the result if we had the same fd. Then 
for bounce pages, userspace still only need to map it once and we can 
maintain the actual mapping by storing the page or pa in the opaque 
field of IOTLB entry.
2) map bounce page once in vduse_dev_map_page() and store struct page 
**bounce_pages in the opaque field of this single IOTLB entry.

Does this work?

Thanks


>
> Thanks,
> Yongji
>


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC v4 06/11] vduse: Implement an MMU-based IOMMU driver
@ 2021-03-05  3:35         ` Jason Wang
  0 siblings, 0 replies; 92+ messages in thread
From: Jason Wang @ 2021-03-05  3:35 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Jens Axboe, Jonathan Corbet, kvm, Michael S. Tsirkin, linux-aio,
	netdev, Randy Dunlap, Matthew Wilcox, virtualization,
	Christoph Hellwig, Bob Liu, bcrl, viro, Stefan Hajnoczi,
	linux-fsdevel


On 2021/3/4 1:12 下午, Yongji Xie wrote:
> On Thu, Mar 4, 2021 at 12:21 PM Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2021/2/23 7:50 下午, Xie Yongji wrote:
>>> This implements a MMU-based IOMMU driver to support mapping
>>> kernel dma buffer into userspace. The basic idea behind it is
>>> treating MMU (VA->PA) as IOMMU (IOVA->PA). The driver will set
>>> up MMU mapping instead of IOMMU mapping for the DMA transfer so
>>> that the userspace process is able to use its virtual address to
>>> access the dma buffer in kernel.
>>>
>>> And to avoid security issue, a bounce-buffering mechanism is
>>> introduced to prevent userspace accessing the original buffer
>>> directly.
>>>
>>> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
>>> ---
>>>    drivers/vdpa/vdpa_user/iova_domain.c | 486 +++++++++++++++++++++++++++++++++++
>>>    drivers/vdpa/vdpa_user/iova_domain.h |  61 +++++
>>>    2 files changed, 547 insertions(+)
>>>    create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c
>>>    create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h
>>>
>>> diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c
>>> new file mode 100644
>>> index 000000000000..9285d430d486
>>> --- /dev/null
>>> +++ b/drivers/vdpa/vdpa_user/iova_domain.c
>>> @@ -0,0 +1,486 @@
>>> +// SPDX-License-Identifier: GPL-2.0-only
>>> +/*
>>> + * MMU-based IOMMU implementation
>>> + *
>>> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
>>> + *
>>> + * Author: Xie Yongji <xieyongji@bytedance.com>
>>> + *
>>> + */
>>> +
>>> +#include <linux/slab.h>
>>> +#include <linux/file.h>
>>> +#include <linux/anon_inodes.h>
>>> +#include <linux/highmem.h>
>>> +
>>> +#include "iova_domain.h"
>>> +
>>> +#define IOVA_START_PFN 1
>>> +#define IOVA_ALLOC_ORDER 12
>>> +#define IOVA_ALLOC_SIZE (1 << IOVA_ALLOC_ORDER)
>>> +
>>> +static inline struct page *
>>> +vduse_domain_get_bounce_page(struct vduse_iova_domain *domain, u64 iova)
>>> +{
>>> +     u64 index = iova >> PAGE_SHIFT;
>>> +
>>> +     return domain->bounce_pages[index];
>>> +}
>>> +
>>> +static inline void
>>> +vduse_domain_set_bounce_page(struct vduse_iova_domain *domain,
>>> +                             u64 iova, struct page *page)
>>> +{
>>> +     u64 index = iova >> PAGE_SHIFT;
>>> +
>>> +     domain->bounce_pages[index] = page;
>>> +}
>>> +
>>> +static enum dma_data_direction perm_to_dir(int perm)
>>> +{
>>> +     enum dma_data_direction dir;
>>> +
>>> +     switch (perm) {
>>> +     case VHOST_MAP_WO:
>>> +             dir = DMA_FROM_DEVICE;
>>> +             break;
>>> +     case VHOST_MAP_RO:
>>> +             dir = DMA_TO_DEVICE;
>>> +             break;
>>> +     case VHOST_MAP_RW:
>>> +             dir = DMA_BIDIRECTIONAL;
>>> +             break;
>>> +     default:
>>> +             break;
>>> +     }
>>> +
>>> +     return dir;
>>> +}
>>> +
>>> +static int dir_to_perm(enum dma_data_direction dir)
>>> +{
>>> +     int perm = -EFAULT;
>>> +
>>> +     switch (dir) {
>>> +     case DMA_FROM_DEVICE:
>>> +             perm = VHOST_MAP_WO;
>>> +             break;
>>> +     case DMA_TO_DEVICE:
>>> +             perm = VHOST_MAP_RO;
>>> +             break;
>>> +     case DMA_BIDIRECTIONAL:
>>> +             perm = VHOST_MAP_RW;
>>> +             break;
>>> +     default:
>>> +             break;
>>> +     }
>>> +
>>> +     return perm;
>>> +}
>>
>> Let's move the above two helpers to vhost_iotlb.h so they could be used
>> by other driver e.g (vpda_sim)
>>
> Sure.
>
>>> +
>>> +static void do_bounce(phys_addr_t orig, void *addr, size_t size,
>>> +                     enum dma_data_direction dir)
>>> +{
>>> +     unsigned long pfn = PFN_DOWN(orig);
>>> +
>>> +     if (PageHighMem(pfn_to_page(pfn))) {
>>> +             unsigned int offset = offset_in_page(orig);
>>> +             char *buffer;
>>> +             unsigned int sz = 0;
>>> +             unsigned long flags;
>>> +
>>> +             while (size) {
>>> +                     sz = min_t(size_t, PAGE_SIZE - offset, size);
>>> +
>>> +                     local_irq_save(flags);
>>> +                     buffer = kmap_atomic(pfn_to_page(pfn));
>>> +                     if (dir == DMA_TO_DEVICE)
>>> +                             memcpy(addr, buffer + offset, sz);
>>> +                     else
>>> +                             memcpy(buffer + offset, addr, sz);
>>> +                     kunmap_atomic(buffer);
>>> +                     local_irq_restore(flags);
>>
>> I wonder why we need to deal with highmem and irq flags explicitly like
>> this. Doesn't kmap_atomic() will take care all of those?
>>
> Yes, irq flags is useless here. Will remove it.
>
>>> +
>>> +                     size -= sz;
>>> +                     pfn++;
>>> +                     addr += sz;
>>> +                     offset = 0;
>>> +             }
>>> +     } else if (dir == DMA_TO_DEVICE) {
>>> +             memcpy(addr, phys_to_virt(orig), size);
>>> +     } else {
>>> +             memcpy(phys_to_virt(orig), addr, size);
>>> +     }
>>> +}
>>> +
>>> +static struct page *
>>> +vduse_domain_get_mapping_page(struct vduse_iova_domain *domain, u64 iova)
>>> +{
>>> +     u64 start = iova & PAGE_MASK;
>>> +     u64 last = start + PAGE_SIZE - 1;
>>> +     struct vhost_iotlb_map *map;
>>> +     struct page *page = NULL;
>>> +
>>> +     spin_lock(&domain->iotlb_lock);
>>> +     map = vhost_iotlb_itree_first(domain->iotlb, start, last);
>>> +     if (!map)
>>> +             goto out;
>>> +
>>> +     page = pfn_to_page((map->addr + iova - map->start) >> PAGE_SHIFT);
>>> +     get_page(page);
>>> +out:
>>> +     spin_unlock(&domain->iotlb_lock);
>>> +
>>> +     return page;
>>> +}
>>> +
>>> +static struct page *
>>> +vduse_domain_alloc_bounce_page(struct vduse_iova_domain *domain, u64 iova)
>>> +{
>>> +     u64 start = iova & PAGE_MASK;
>>> +     u64 last = start + PAGE_SIZE - 1;
>>> +     struct vhost_iotlb_map *map;
>>> +     struct page *page = NULL, *new_page = alloc_page(GFP_KERNEL);
>>> +
>>> +     if (!new_page)
>>> +             return NULL;
>>> +
>>> +     spin_lock(&domain->iotlb_lock);
>>> +     if (!vhost_iotlb_itree_first(domain->iotlb, start, last)) {
>>> +             __free_page(new_page);
>>> +             goto out;
>>> +     }
>>> +     page = vduse_domain_get_bounce_page(domain, iova);
>>> +     if (page) {
>>> +             get_page(page);
>>> +             __free_page(new_page);
>>> +             goto out;
>>> +     }
>>> +     vduse_domain_set_bounce_page(domain, iova, new_page);
>>> +     get_page(new_page);
>>> +     page = new_page;
>>> +
>>> +     for (map = vhost_iotlb_itree_first(domain->iotlb, start, last); map;
>>> +          map = vhost_iotlb_itree_next(map, start, last)) {
>>> +             unsigned int src_offset = 0, dst_offset = 0;
>>> +             phys_addr_t src;
>>> +             void *dst;
>>> +             size_t sz;
>>> +
>>> +             if (perm_to_dir(map->perm) == DMA_FROM_DEVICE)
>>> +                     continue;
>>> +
>>> +             if (start > map->start)
>>> +                     src_offset = start - map->start;
>>> +             else
>>> +                     dst_offset = map->start - start;
>>> +
>>> +             src = map->addr + src_offset;
>>> +             dst = page_address(page) + dst_offset;
>>> +             sz = min_t(size_t, map->size - src_offset,
>>> +                             PAGE_SIZE - dst_offset);
>>> +             do_bounce(src, dst, sz, DMA_TO_DEVICE);
>>> +     }
>>> +out:
>>> +     spin_unlock(&domain->iotlb_lock);
>>> +
>>> +     return page;
>>> +}
>>> +
>>> +static void
>>> +vduse_domain_free_bounce_pages(struct vduse_iova_domain *domain,
>>> +                             u64 iova, size_t size)
>>> +{
>>> +     struct page *page;
>>> +
>>> +     spin_lock(&domain->iotlb_lock);
>>> +     if (WARN_ON(vhost_iotlb_itree_first(domain->iotlb, iova,
>>> +                                             iova + size - 1)))
>>> +             goto out;
>>> +
>>> +     while (size > 0) {
>>> +             page = vduse_domain_get_bounce_page(domain, iova);
>>> +             if (page) {
>>> +                     vduse_domain_set_bounce_page(domain, iova, NULL);
>>> +                     __free_page(page);
>>> +             }
>>> +             size -= PAGE_SIZE;
>>> +             iova += PAGE_SIZE;
>>> +     }
>>> +out:
>>> +     spin_unlock(&domain->iotlb_lock);
>>> +}
>>> +
>>> +static void vduse_domain_bounce(struct vduse_iova_domain *domain,
>>> +                             dma_addr_t iova, phys_addr_t orig,
>>> +                             size_t size, enum dma_data_direction dir)
>>> +{
>>> +     unsigned int offset = offset_in_page(iova);
>>> +
>>> +     while (size) {
>>> +             struct page *p = vduse_domain_get_bounce_page(domain, iova);
>>> +             size_t sz = min_t(size_t, PAGE_SIZE - offset, size);
>>> +
>>> +             WARN_ON(!p && dir == DMA_FROM_DEVICE);
>>> +
>>> +             if (p)
>>> +                     do_bounce(orig, page_address(p) + offset, sz, dir);
>>> +
>>> +             size -= sz;
>>> +             orig += sz;
>>> +             iova += sz;
>>> +             offset = 0;
>>> +     }
>>> +}
>>> +
>>> +static dma_addr_t vduse_domain_alloc_iova(struct iova_domain *iovad,
>>> +                             unsigned long size, unsigned long limit)
>>> +{
>>> +     unsigned long shift = iova_shift(iovad);
>>> +     unsigned long iova_len = iova_align(iovad, size) >> shift;
>>> +     unsigned long iova_pfn;
>>> +
>>> +     if (iova_len < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1)))
>>> +             iova_len = roundup_pow_of_two(iova_len);
>>> +     iova_pfn = alloc_iova_fast(iovad, iova_len, limit >> shift, true);
>>> +
>>> +     return iova_pfn << shift;
>>> +}
>>> +
>>> +static void vduse_domain_free_iova(struct iova_domain *iovad,
>>> +                             dma_addr_t iova, size_t size)
>>> +{
>>> +     unsigned long shift = iova_shift(iovad);
>>> +     unsigned long iova_len = iova_align(iovad, size) >> shift;
>>> +
>>> +     free_iova_fast(iovad, iova >> shift, iova_len);
>>> +}
>>> +
>>> +dma_addr_t vduse_domain_map_page(struct vduse_iova_domain *domain,
>>> +                             struct page *page, unsigned long offset,
>>> +                             size_t size, enum dma_data_direction dir,
>>> +                             unsigned long attrs)
>>> +{
>>> +     struct iova_domain *iovad = &domain->stream_iovad;
>>> +     unsigned long limit = domain->bounce_size - 1;
>>> +     phys_addr_t pa = page_to_phys(page) + offset;
>>> +     dma_addr_t iova = vduse_domain_alloc_iova(iovad, size, limit);
>>> +     int ret;
>>> +
>>> +     if (!iova)
>>> +             return DMA_MAPPING_ERROR;
>>> +
>>> +     spin_lock(&domain->iotlb_lock);
>>> +     ret = vhost_iotlb_add_range(domain->iotlb, (u64)iova,
>>> +                                 (u64)iova + size - 1,
>>> +                                 pa, dir_to_perm(dir));
>>> +     spin_unlock(&domain->iotlb_lock);
>>> +     if (ret) {
>>> +             vduse_domain_free_iova(iovad, iova, size);
>>> +             return DMA_MAPPING_ERROR;
>>> +     }
>>> +     if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL)
>>> +             vduse_domain_bounce(domain, iova, pa, size, DMA_TO_DEVICE);
>>> +
>>> +     return iova;
>>> +}
>>> +
>>> +void vduse_domain_unmap_page(struct vduse_iova_domain *domain,
>>> +                     dma_addr_t dma_addr, size_t size,
>>> +                     enum dma_data_direction dir, unsigned long attrs)
>>> +{
>>> +     struct iova_domain *iovad = &domain->stream_iovad;
>>> +     struct vhost_iotlb_map *map;
>>> +     phys_addr_t pa;
>>> +
>>> +     spin_lock(&domain->iotlb_lock);
>>> +     map = vhost_iotlb_itree_first(domain->iotlb, (u64)dma_addr,
>>> +                                   (u64)dma_addr + size - 1);
>>> +     if (WARN_ON(!map)) {
>>> +             spin_unlock(&domain->iotlb_lock);
>>> +             return;
>>> +     }
>>> +     pa = map->addr;
>>> +     vhost_iotlb_map_free(domain->iotlb, map);
>>> +     spin_unlock(&domain->iotlb_lock);
>>> +
>>> +     if (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL)
>>> +             vduse_domain_bounce(domain, dma_addr, pa,
>>> +                                     size, DMA_FROM_DEVICE);
>>> +
>>> +     vduse_domain_free_iova(iovad, dma_addr, size);
>>> +}
>>> +
>>> +void *vduse_domain_alloc_coherent(struct vduse_iova_domain *domain,
>>> +                             size_t size, dma_addr_t *dma_addr,
>>> +                             gfp_t flag, unsigned long attrs)
>>> +{
>>> +     struct iova_domain *iovad = &domain->consistent_iovad;
>>> +     unsigned long limit = domain->iova_limit;
>>> +     dma_addr_t iova = vduse_domain_alloc_iova(iovad, size, limit);
>>> +     void *orig = alloc_pages_exact(size, flag);
>>> +     int ret;
>>> +
>>> +     if (!iova || !orig)
>>> +             goto err;
>>> +
>>> +     spin_lock(&domain->iotlb_lock);
>>> +     ret = vhost_iotlb_add_range(domain->iotlb, (u64)iova,
>>> +                                 (u64)iova + size - 1,
>>> +                                 virt_to_phys(orig), VHOST_MAP_RW);
>>> +     spin_unlock(&domain->iotlb_lock);
>>> +     if (ret)
>>> +             goto err;
>>> +
>>> +     *dma_addr = iova;
>>> +
>>> +     return orig;
>>> +err:
>>> +     *dma_addr = DMA_MAPPING_ERROR;
>>> +     if (orig)
>>> +             free_pages_exact(orig, size);
>>> +     if (iova)
>>> +             vduse_domain_free_iova(iovad, iova, size);
>>> +
>>> +     return NULL;
>>> +}
>>> +
>>> +void vduse_domain_free_coherent(struct vduse_iova_domain *domain, size_t size,
>>> +                             void *vaddr, dma_addr_t dma_addr,
>>> +                             unsigned long attrs)
>>> +{
>>> +     struct iova_domain *iovad = &domain->consistent_iovad;
>>> +     struct vhost_iotlb_map *map;
>>> +     phys_addr_t pa;
>>> +
>>> +     spin_lock(&domain->iotlb_lock);
>>> +     map = vhost_iotlb_itree_first(domain->iotlb, (u64)dma_addr,
>>> +                                   (u64)dma_addr + size - 1);
>>> +     if (WARN_ON(!map)) {
>>> +             spin_unlock(&domain->iotlb_lock);
>>> +             return;
>>> +     }
>>> +     pa = map->addr;
>>> +     vhost_iotlb_map_free(domain->iotlb, map);
>>> +     spin_unlock(&domain->iotlb_lock);
>>> +
>>> +     vduse_domain_free_iova(iovad, dma_addr, size);
>>> +     free_pages_exact(phys_to_virt(pa), size);
>>> +}
>>> +
>>> +static vm_fault_t vduse_domain_mmap_fault(struct vm_fault *vmf)
>>> +{
>>> +     struct vduse_iova_domain *domain = vmf->vma->vm_private_data;
>>> +     unsigned long iova = vmf->pgoff << PAGE_SHIFT;
>>> +     struct page *page;
>>> +
>>> +     if (!domain)
>>> +             return VM_FAULT_SIGBUS;
>>> +
>>> +     if (iova < domain->bounce_size)
>>> +             page = vduse_domain_alloc_bounce_page(domain, iova);
>>> +     else
>>> +             page = vduse_domain_get_mapping_page(domain, iova);
>>> +
>>> +     if (!page)
>>> +             return VM_FAULT_SIGBUS;
>>> +
>>> +     vmf->page = page;
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static const struct vm_operations_struct vduse_domain_mmap_ops = {
>>> +     .fault = vduse_domain_mmap_fault,
>>> +};
>>> +
>>> +static int vduse_domain_mmap(struct file *file, struct vm_area_struct *vma)
>>> +{
>>> +     struct vduse_iova_domain *domain = file->private_data;
>>> +
>>> +     vma->vm_flags |= VM_DONTDUMP | VM_DONTEXPAND;
>>> +     vma->vm_private_data = domain;
>>> +     vma->vm_ops = &vduse_domain_mmap_ops;
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static int vduse_domain_release(struct inode *inode, struct file *file)
>>> +{
>>> +     struct vduse_iova_domain *domain = file->private_data;
>>> +
>>> +     vduse_domain_free_bounce_pages(domain, 0, domain->bounce_size);
>>> +     put_iova_domain(&domain->stream_iovad);
>>> +     put_iova_domain(&domain->consistent_iovad);
>>> +     vhost_iotlb_free(domain->iotlb);
>>> +     vfree(domain->bounce_pages);
>>> +     kfree(domain);
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static const struct file_operations vduse_domain_fops = {
>>> +     .mmap = vduse_domain_mmap,
>>> +     .release = vduse_domain_release,
>>> +};
>>> +
>>> +void vduse_domain_destroy(struct vduse_iova_domain *domain)
>>> +{
>>> +     fput(domain->file);
>>> +}
>>> +
>>> +struct vduse_iova_domain *
>>> +vduse_domain_create(unsigned long iova_limit, size_t bounce_size)
>>> +{
>>> +     struct vduse_iova_domain *domain;
>>> +     struct file *file;
>>> +     unsigned long bounce_pfns = PAGE_ALIGN(bounce_size) >> PAGE_SHIFT;
>>> +
>>> +     if (iova_limit <= bounce_size)
>>> +             return NULL;
>>> +
>>> +     domain = kzalloc(sizeof(*domain), GFP_KERNEL);
>>> +     if (!domain)
>>> +             return NULL;
>>> +
>>> +     domain->iotlb = vhost_iotlb_alloc(0, 0);
>>> +     if (!domain->iotlb)
>>> +             goto err_iotlb;
>>> +
>>> +     domain->iova_limit = iova_limit;
>>> +     domain->bounce_size = PAGE_ALIGN(bounce_size);
>>> +     domain->bounce_pages = vzalloc(bounce_pfns * sizeof(struct page *));
>>> +     if (!domain->bounce_pages)
>>> +             goto err_page;
>>> +
>>> +     file = anon_inode_getfile("[vduse-domain]", &vduse_domain_fops,
>>> +                             domain, O_RDWR);
>>> +     if (IS_ERR(file))
>>> +             goto err_file;
>>> +
>>> +     domain->file = file;
>>> +     spin_lock_init(&domain->iotlb_lock);
>>> +     init_iova_domain(&domain->stream_iovad,
>>> +                     IOVA_ALLOC_SIZE, IOVA_START_PFN);
>>> +     init_iova_domain(&domain->consistent_iovad,
>>> +                     PAGE_SIZE, bounce_pfns);
>>> +
>>> +     return domain;
>>> +err_file:
>>> +     vfree(domain->bounce_pages);
>>> +err_page:
>>> +     vhost_iotlb_free(domain->iotlb);
>>> +err_iotlb:
>>> +     kfree(domain);
>>> +     return NULL;
>>> +}
>>> +
>>> +int vduse_domain_init(void)
>>> +{
>>> +     return iova_cache_get();
>>> +}
>>> +
>>> +void vduse_domain_exit(void)
>>> +{
>>> +     iova_cache_put();
>>> +}
>>> diff --git a/drivers/vdpa/vdpa_user/iova_domain.h b/drivers/vdpa/vdpa_user/iova_domain.h
>>> new file mode 100644
>>> index 000000000000..9c85d8346626
>>> --- /dev/null
>>> +++ b/drivers/vdpa/vdpa_user/iova_domain.h
>>> @@ -0,0 +1,61 @@
>>> +/* SPDX-License-Identifier: GPL-2.0-only */
>>> +/*
>>> + * MMU-based IOMMU implementation
>>> + *
>>> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
>>> + *
>>> + * Author: Xie Yongji <xieyongji@bytedance.com>
>>> + *
>>> + */
>>> +
>>> +#ifndef _VDUSE_IOVA_DOMAIN_H
>>> +#define _VDUSE_IOVA_DOMAIN_H
>>> +
>>> +#include <linux/iova.h>
>>> +#include <linux/dma-mapping.h>
>>> +#include <linux/vhost_iotlb.h>
>>> +
>>> +struct vduse_iova_domain {
>>> +     struct iova_domain stream_iovad;
>>> +     struct iova_domain consistent_iovad;
>>> +     struct page **bounce_pages;
>>> +     size_t bounce_size;
>>> +     unsigned long iova_limit;
>>> +     struct vhost_iotlb *iotlb;
>>
>> Sorry if I've asked this before.
>>
>> But what's the reason for maintaing a dedicated IOTLB here? I think we
>> could reuse vduse_dev->iommu since the device can not be used by both
>> virtio and vhost in the same time or use vduse_iova_domain->iotlb for
>> set_map().
>>
> The main difference between domain->iotlb and dev->iotlb is the way to
> deal with bounce buffer. In the domain->iotlb case, bounce buffer
> needs to be mapped each DMA transfer because we need to get the bounce
> pages by an IOVA during DMA unmapping. In the dev->iotlb case, bounce
> buffer only needs to be mapped once during initialization, which will
> be used to tell userspace how to do mmap().
>
>> Also, since vhost IOTLB support per mapping token (opauqe), can we use
>> that instead of the bounce_pages *?
>>
> Sorry, I didn't get you here. Which value do you mean to store in the
> opaque pointer?


So I would like to have a way to use a single IOTLB for manage all kinds 
of mappings. Two possible ideas:

1) map bounce page one by one in vduse_dev_map_page(), in 
VDUSE_IOTLB_GET_FD, try to merge the result if we had the same fd. Then 
for bounce pages, userspace still only need to map it once and we can 
maintain the actual mapping by storing the page or pa in the opaque 
field of IOTLB entry.
2) map bounce page once in vduse_dev_map_page() and store struct page 
**bounce_pages in the opaque field of this single IOTLB entry.

Does this work?

Thanks


>
> Thanks,
> Yongji
>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: Re: [RFC v4 11/11] vduse: Support binding irq to the specified cpu
  2021-03-05  3:11         ` Jason Wang
  (?)
@ 2021-03-05  3:37         ` Yongji Xie
  2021-03-05  3:44             ` Jason Wang
  -1 siblings, 1 reply; 92+ messages in thread
From: Yongji Xie @ 2021-03-05  3:37 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Bob Liu, Christoph Hellwig, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel

On Fri, Mar 5, 2021 at 11:11 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2021/3/4 4:19 下午, Yongji Xie wrote:
> > On Thu, Mar 4, 2021 at 3:30 PM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> On 2021/2/23 7:50 下午, Xie Yongji wrote:
> >>> Add a parameter for the ioctl VDUSE_INJECT_VQ_IRQ to support
> >>> injecting virtqueue's interrupt to the specified cpu.
> >>
> >> How userspace know which CPU is this irq for? It looks to me we need to
> >> do it at different level.
> >>
> >> E.g introduce some API in sys to allow admin to tune for that.
> >>
> >> But I think we can do that in antoher patch on top of this series.
> >>
> > OK. I will think more about it.
>
>
> It should be soemthing like
> /sys/class/vduse/$dev_name/vq/0/irq_affinity. Also need to make sure
> eventfd could not be reused.
>

Looks like we doesn't use eventfd now. Do you mean we need to use
eventfd in this case?

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC v4 10/11] vduse: Introduce a workqueue for irq injection
  2021-03-05  3:30         ` Yongji Xie
@ 2021-03-05  3:42             ` Jason Wang
  0 siblings, 0 replies; 92+ messages in thread
From: Jason Wang @ 2021-03-05  3:42 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Bob Liu, Christoph Hellwig, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel


On 2021/3/5 11:30 上午, Yongji Xie wrote:
> On Fri, Mar 5, 2021 at 11:05 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2021/3/4 4:58 下午, Yongji Xie wrote:
>>> On Thu, Mar 4, 2021 at 2:59 PM Jason Wang <jasowang@redhat.com> wrote:
>>>> On 2021/2/23 7:50 下午, Xie Yongji wrote:
>>>>> This patch introduces a workqueue to support injecting
>>>>> virtqueue's interrupt asynchronously. This is mainly
>>>>> for performance considerations which makes sure the push()
>>>>> and pop() for used vring can be asynchronous.
>>>> Do you have pref numbers for this patch?
>>>>
>>> No, I can do some tests for it if needed.
>>>
>>> Another problem is the VIRTIO_RING_F_EVENT_IDX feature will be useless
>>> if we call irq callback in ioctl context. Something like:
>>>
>>> virtqueue_push();
>>> virtio_notify();
>>>       ioctl()
>>> -------------------------------------------------
>>>           irq_cb()
>>>               virtqueue_get_buf()
>>>
>>> The used vring is always empty each time we call virtqueue_push() in
>>> userspace. Not sure if it is what we expected.
>>
>> I'm not sure I get the issue.
>>
>> THe used ring should be filled by virtqueue_push() which is done by
>> userspace before?
>>
> After userspace call virtqueue_push(), it always call virtio_notify()
> immediately. In traditional VM (vhost-vdpa) cases, virtio_notify()
> will inject an irq to VM and return, then vcpu thread will call
> interrupt handler. But in container (virtio-vdpa) cases,
> virtio_notify() will call interrupt handler directly. So it looks like
> we have to optimize the virtio-vdpa cases. But one problem is we don't
> know whether we are in the VM user case or container user case.


Yes, but I still don't get why used ring is empty after the ioctl()? 
Used ring does not use bounce page so it should be visible to the kernel 
driver. What did I miss :) ?

Thanks



>
> Thanks,
> Yongji
>


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC v4 10/11] vduse: Introduce a workqueue for irq injection
@ 2021-03-05  3:42             ` Jason Wang
  0 siblings, 0 replies; 92+ messages in thread
From: Jason Wang @ 2021-03-05  3:42 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Jens Axboe, Jonathan Corbet, kvm, Michael S. Tsirkin, linux-aio,
	netdev, Randy Dunlap, Matthew Wilcox, virtualization,
	Christoph Hellwig, Bob Liu, bcrl, viro, Stefan Hajnoczi,
	linux-fsdevel


On 2021/3/5 11:30 上午, Yongji Xie wrote:
> On Fri, Mar 5, 2021 at 11:05 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2021/3/4 4:58 下午, Yongji Xie wrote:
>>> On Thu, Mar 4, 2021 at 2:59 PM Jason Wang <jasowang@redhat.com> wrote:
>>>> On 2021/2/23 7:50 下午, Xie Yongji wrote:
>>>>> This patch introduces a workqueue to support injecting
>>>>> virtqueue's interrupt asynchronously. This is mainly
>>>>> for performance considerations which makes sure the push()
>>>>> and pop() for used vring can be asynchronous.
>>>> Do you have pref numbers for this patch?
>>>>
>>> No, I can do some tests for it if needed.
>>>
>>> Another problem is the VIRTIO_RING_F_EVENT_IDX feature will be useless
>>> if we call irq callback in ioctl context. Something like:
>>>
>>> virtqueue_push();
>>> virtio_notify();
>>>       ioctl()
>>> -------------------------------------------------
>>>           irq_cb()
>>>               virtqueue_get_buf()
>>>
>>> The used vring is always empty each time we call virtqueue_push() in
>>> userspace. Not sure if it is what we expected.
>>
>> I'm not sure I get the issue.
>>
>> THe used ring should be filled by virtqueue_push() which is done by
>> userspace before?
>>
> After userspace call virtqueue_push(), it always call virtio_notify()
> immediately. In traditional VM (vhost-vdpa) cases, virtio_notify()
> will inject an irq to VM and return, then vcpu thread will call
> interrupt handler. But in container (virtio-vdpa) cases,
> virtio_notify() will call interrupt handler directly. So it looks like
> we have to optimize the virtio-vdpa cases. But one problem is we don't
> know whether we are in the VM user case or container user case.


Yes, but I still don't get why used ring is empty after the ioctl()? 
Used ring does not use bounce page so it should be visible to the kernel 
driver. What did I miss :) ?

Thanks



>
> Thanks,
> Yongji
>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC v4 11/11] vduse: Support binding irq to the specified cpu
  2021-03-05  3:37         ` Yongji Xie
@ 2021-03-05  3:44             ` Jason Wang
  0 siblings, 0 replies; 92+ messages in thread
From: Jason Wang @ 2021-03-05  3:44 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Bob Liu, Christoph Hellwig, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel


On 2021/3/5 11:37 上午, Yongji Xie wrote:
> On Fri, Mar 5, 2021 at 11:11 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2021/3/4 4:19 下午, Yongji Xie wrote:
>>> On Thu, Mar 4, 2021 at 3:30 PM Jason Wang <jasowang@redhat.com> wrote:
>>>> On 2021/2/23 7:50 下午, Xie Yongji wrote:
>>>>> Add a parameter for the ioctl VDUSE_INJECT_VQ_IRQ to support
>>>>> injecting virtqueue's interrupt to the specified cpu.
>>>> How userspace know which CPU is this irq for? It looks to me we need to
>>>> do it at different level.
>>>>
>>>> E.g introduce some API in sys to allow admin to tune for that.
>>>>
>>>> But I think we can do that in antoher patch on top of this series.
>>>>
>>> OK. I will think more about it.
>>
>> It should be soemthing like
>> /sys/class/vduse/$dev_name/vq/0/irq_affinity. Also need to make sure
>> eventfd could not be reused.
>>
> Looks like we doesn't use eventfd now. Do you mean we need to use
> eventfd in this case?


No, I meant if we're using eventfd, do we allow a single eventfd to be 
used for injecting irq for more than one virtqueue? (If not, I guess it 
should be ok).

Thanks


>
> Thanks,
> Yongji
>


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC v4 11/11] vduse: Support binding irq to the specified cpu
@ 2021-03-05  3:44             ` Jason Wang
  0 siblings, 0 replies; 92+ messages in thread
From: Jason Wang @ 2021-03-05  3:44 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Jens Axboe, Jonathan Corbet, kvm, Michael S. Tsirkin, linux-aio,
	netdev, Randy Dunlap, Matthew Wilcox, virtualization,
	Christoph Hellwig, Bob Liu, bcrl, viro, Stefan Hajnoczi,
	linux-fsdevel


On 2021/3/5 11:37 上午, Yongji Xie wrote:
> On Fri, Mar 5, 2021 at 11:11 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2021/3/4 4:19 下午, Yongji Xie wrote:
>>> On Thu, Mar 4, 2021 at 3:30 PM Jason Wang <jasowang@redhat.com> wrote:
>>>> On 2021/2/23 7:50 下午, Xie Yongji wrote:
>>>>> Add a parameter for the ioctl VDUSE_INJECT_VQ_IRQ to support
>>>>> injecting virtqueue's interrupt to the specified cpu.
>>>> How userspace know which CPU is this irq for? It looks to me we need to
>>>> do it at different level.
>>>>
>>>> E.g introduce some API in sys to allow admin to tune for that.
>>>>
>>>> But I think we can do that in antoher patch on top of this series.
>>>>
>>> OK. I will think more about it.
>>
>> It should be soemthing like
>> /sys/class/vduse/$dev_name/vq/0/irq_affinity. Also need to make sure
>> eventfd could not be reused.
>>
> Looks like we doesn't use eventfd now. Do you mean we need to use
> eventfd in this case?


No, I meant if we're using eventfd, do we allow a single eventfd to be 
used for injecting irq for more than one virtqueue? (If not, I guess it 
should be ok).

Thanks


>
> Thanks,
> Yongji
>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: Re: [RFC v4 07/11] vduse: Introduce VDUSE - vDPA Device in Userspace
  2021-03-05  3:20         ` Jason Wang
  (?)
@ 2021-03-05  3:49         ` Yongji Xie
  -1 siblings, 0 replies; 92+ messages in thread
From: Yongji Xie @ 2021-03-05  3:49 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Bob Liu, Christoph Hellwig, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel

On Fri, Mar 5, 2021 at 11:20 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2021/3/4 4:05 下午, Yongji Xie wrote:
> > On Thu, Mar 4, 2021 at 2:27 PM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> On 2021/2/23 7:50 下午, Xie Yongji wrote:
> >>> This VDUSE driver enables implementing vDPA devices in userspace.
> >>> Both control path and data path of vDPA devices will be able to
> >>> be handled in userspace.
> >>>
> >>> In the control path, the VDUSE driver will make use of message
> >>> mechnism to forward the config operation from vdpa bus driver
> >>> to userspace. Userspace can use read()/write() to receive/reply
> >>> those control messages.
> >>>
> >>> In the data path, VDUSE_IOTLB_GET_FD ioctl will be used to get
> >>> the file descriptors referring to vDPA device's iova regions. Then
> >>> userspace can use mmap() to access those iova regions. Besides,
> >>> userspace can use ioctl() to inject interrupt and use the eventfd
> >>> mechanism to receive virtqueue kicks.
> >>>
> >>> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> >>> ---
> >>>    Documentation/userspace-api/ioctl/ioctl-number.rst |    1 +
> >>>    drivers/vdpa/Kconfig                               |   10 +
> >>>    drivers/vdpa/Makefile                              |    1 +
> >>>    drivers/vdpa/vdpa_user/Makefile                    |    5 +
> >>>    drivers/vdpa/vdpa_user/vduse_dev.c                 | 1348 ++++++++++++++++++++
> >>>    include/uapi/linux/vduse.h                         |  136 ++
> >>>    6 files changed, 1501 insertions(+)
> >>>    create mode 100644 drivers/vdpa/vdpa_user/Makefile
> >>>    create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
> >>>    create mode 100644 include/uapi/linux/vduse.h
> >>>
> >>> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
> >>> index a4c75a28c839..71722e6f8f23 100644
> >>> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> >>> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> >>> @@ -300,6 +300,7 @@ Code  Seq#    Include File                                           Comments
> >>>    'z'   10-4F  drivers/s390/crypto/zcrypt_api.h                        conflict!
> >>>    '|'   00-7F  linux/media.h
> >>>    0x80  00-1F  linux/fb.h
> >>> +0x81  00-1F  linux/vduse.h
> >>>    0x89  00-06  arch/x86/include/asm/sockios.h
> >>>    0x89  0B-DF  linux/sockios.h
> >>>    0x89  E0-EF  linux/sockios.h                                         SIOCPROTOPRIVATE range
> >>> diff --git a/drivers/vdpa/Kconfig b/drivers/vdpa/Kconfig
> >>> index ffd1e098bfd2..92f07715e3b6 100644
> >>> --- a/drivers/vdpa/Kconfig
> >>> +++ b/drivers/vdpa/Kconfig
> >>> @@ -25,6 +25,16 @@ config VDPA_SIM_NET
> >>>        help
> >>>          vDPA networking device simulator which loops TX traffic back to RX.
> >>>
> >>> +config VDPA_USER
> >>> +     tristate "VDUSE (vDPA Device in Userspace) support"
> >>> +     depends on EVENTFD && MMU && HAS_DMA
> >>> +     select DMA_OPS
> >>> +     select VHOST_IOTLB
> >>> +     select IOMMU_IOVA
> >>> +     help
> >>> +       With VDUSE it is possible to emulate a vDPA Device
> >>> +       in a userspace program.
> >>> +
> >>>    config IFCVF
> >>>        tristate "Intel IFC VF vDPA driver"
> >>>        depends on PCI_MSI
> >>> diff --git a/drivers/vdpa/Makefile b/drivers/vdpa/Makefile
> >>> index d160e9b63a66..66e97778ad03 100644
> >>> --- a/drivers/vdpa/Makefile
> >>> +++ b/drivers/vdpa/Makefile
> >>> @@ -1,5 +1,6 @@
> >>>    # SPDX-License-Identifier: GPL-2.0
> >>>    obj-$(CONFIG_VDPA) += vdpa.o
> >>>    obj-$(CONFIG_VDPA_SIM) += vdpa_sim/
> >>> +obj-$(CONFIG_VDPA_USER) += vdpa_user/
> >>>    obj-$(CONFIG_IFCVF)    += ifcvf/
> >>>    obj-$(CONFIG_MLX5_VDPA) += mlx5/
> >>> diff --git a/drivers/vdpa/vdpa_user/Makefile b/drivers/vdpa/vdpa_user/Makefile
> >>> new file mode 100644
> >>> index 000000000000..260e0b26af99
> >>> --- /dev/null
> >>> +++ b/drivers/vdpa/vdpa_user/Makefile
> >>> @@ -0,0 +1,5 @@
> >>> +# SPDX-License-Identifier: GPL-2.0
> >>> +
> >>> +vduse-y := vduse_dev.o iova_domain.o
> >>> +
> >>> +obj-$(CONFIG_VDPA_USER) += vduse.o
> >>> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> >>> new file mode 100644
> >>> index 000000000000..393bf99c48be
> >>> --- /dev/null
> >>> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> >>> @@ -0,0 +1,1348 @@
> >>> +// SPDX-License-Identifier: GPL-2.0-only
> >>> +/*
> >>> + * VDUSE: vDPA Device in Userspace
> >>> + *
> >>> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
> >>> + *
> >>> + * Author: Xie Yongji <xieyongji@bytedance.com>
> >>> + *
> >>> + */
> >>> +
> >>> +#include <linux/init.h>
> >>> +#include <linux/module.h>
> >>> +#include <linux/miscdevice.h>
> >>> +#include <linux/cdev.h>
> >>> +#include <linux/device.h>
> >>> +#include <linux/eventfd.h>
> >>> +#include <linux/slab.h>
> >>> +#include <linux/wait.h>
> >>> +#include <linux/dma-map-ops.h>
> >>> +#include <linux/poll.h>
> >>> +#include <linux/file.h>
> >>> +#include <linux/uio.h>
> >>> +#include <linux/vdpa.h>
> >>> +#include <uapi/linux/vduse.h>
> >>> +#include <uapi/linux/vdpa.h>
> >>> +#include <uapi/linux/virtio_config.h>
> >>> +#include <linux/mod_devicetable.h>
> >>> +
> >>> +#include "iova_domain.h"
> >>> +
> >>> +#define DRV_VERSION  "1.0"
> >>> +#define DRV_AUTHOR   "Yongji Xie <xieyongji@bytedance.com>"
> >>> +#define DRV_DESC     "vDPA Device in Userspace"
> >>> +#define DRV_LICENSE  "GPL v2"
> >>> +
> >>> +#define VDUSE_DEV_MAX (1U << MINORBITS)
> >>> +
> >>> +struct vduse_virtqueue {
> >>> +     u16 index;
> >>> +     bool ready;
> >>> +     spinlock_t kick_lock;
> >>> +     spinlock_t irq_lock;
> >>> +     struct eventfd_ctx *kickfd;
> >>> +     struct vdpa_callback cb;
> >>> +};
> >>> +
> >>> +struct vduse_dev;
> >>> +
> >>> +struct vduse_vdpa {
> >>> +     struct vdpa_device vdpa;
> >>> +     struct vduse_dev *dev;
> >>> +};
> >>> +
> >>> +struct vduse_dev {
> >>> +     struct vduse_vdpa *vdev;
> >>> +     struct device dev;
> >>> +     struct cdev cdev;
> >>> +     struct vduse_virtqueue *vqs;
> >>> +     struct vduse_iova_domain *domain;
> >>> +     struct vhost_iotlb *iommu;
> >>> +     spinlock_t iommu_lock;
> >>> +     atomic_t bounce_map;
> >>> +     struct mutex msg_lock;
> >>> +     atomic64_t msg_unique;
> >>
> >> "next_request_id" should be better.
> >>
> > OK.
> >
> >>> +     wait_queue_head_t waitq;
> >>> +     struct list_head send_list;
> >>> +     struct list_head recv_list;
> >>> +     struct list_head list;
> >>> +     bool connected;
> >>> +     int minor;
> >>> +     u16 vq_size_max;
> >>> +     u16 vq_num;
> >>> +     u32 vq_align;
> >>> +     u32 device_id;
> >>> +     u32 vendor_id;
> >>> +};
> >>> +
> >>> +struct vduse_dev_msg {
> >>> +     struct vduse_dev_request req;
> >>> +     struct vduse_dev_response resp;
> >>> +     struct list_head list;
> >>> +     wait_queue_head_t waitq;
> >>> +     bool completed;
> >>> +};
> >>> +
> >>> +static unsigned long max_bounce_size = (64 * 1024 * 1024);
> >>> +module_param(max_bounce_size, ulong, 0444);
> >>> +MODULE_PARM_DESC(max_bounce_size, "Maximum bounce buffer size. (default: 64M)");
> >>> +
> >>> +static unsigned long max_iova_size = (128 * 1024 * 1024);
> >>> +module_param(max_iova_size, ulong, 0444);
> >>> +MODULE_PARM_DESC(max_iova_size, "Maximum iova space size (default: 128M)");
> >>> +
> >>> +static DEFINE_MUTEX(vduse_lock);
> >>> +static LIST_HEAD(vduse_devs);
> >>> +static DEFINE_IDA(vduse_ida);
> >>> +
> >>> +static dev_t vduse_major;
> >>> +static struct class *vduse_class;
> >>> +
> >>> +static inline struct vduse_dev *vdpa_to_vduse(struct vdpa_device *vdpa)
> >>> +{
> >>> +     struct vduse_vdpa *vdev = container_of(vdpa, struct vduse_vdpa, vdpa);
> >>> +
> >>> +     return vdev->dev;
> >>> +}
> >>> +
> >>> +static inline struct vduse_dev *dev_to_vduse(struct device *dev)
> >>> +{
> >>> +     struct vdpa_device *vdpa = dev_to_vdpa(dev);
> >>> +
> >>> +     return vdpa_to_vduse(vdpa);
> >>> +}
> >>> +
> >>> +static struct vduse_dev_msg *vduse_find_msg(struct list_head *head,
> >>> +                                         uint32_t unique)
> >>> +{
> >>> +     struct vduse_dev_msg *tmp, *msg = NULL;
> >>> +
> >>> +     list_for_each_entry(tmp, head, list) {
> >>
> >> Shoudl we use list_for_each_entry_safe()?
> >>
> > Looks like list_for_each_entry() is ok here. We will break the loop
> > after deleting one node.
>
>
> Right.
>
>
> >
> >>> +             if (tmp->req.unique == unique) {
> >>> +                     msg = tmp;
> >>> +                     list_del(&tmp->list);
> >>> +                     break;
> >>> +             }
> >>> +     }
> >>> +
> >>> +     return msg;
> >>> +}
> >>> +
> >>> +static struct vduse_dev_msg *vduse_dequeue_msg(struct list_head *head)
> >>> +{
> >>> +     struct vduse_dev_msg *msg = NULL;
> >>> +
> >>> +     if (!list_empty(head)) {
> >>> +             msg = list_first_entry(head, struct vduse_dev_msg, list);
> >>> +             list_del(&msg->list);
> >>> +     }
> >>> +
> >>> +     return msg;
> >>> +}
> >>> +
> >>> +static void vduse_enqueue_msg(struct list_head *head,
> >>> +                           struct vduse_dev_msg *msg)
> >>> +{
> >>> +     list_add_tail(&msg->list, head);
> >>> +}
> >>> +
> >>> +static int vduse_dev_msg_sync(struct vduse_dev *dev, struct vduse_dev_msg *msg)
> >>> +{
> >>> +     int ret;
> >>> +
> >>> +     init_waitqueue_head(&msg->waitq);
> >>> +     mutex_lock(&dev->msg_lock);
> >>> +     vduse_enqueue_msg(&dev->send_list, msg);
> >>> +     wake_up(&dev->waitq);
> >>> +     mutex_unlock(&dev->msg_lock);
> >>> +     ret = wait_event_interruptible(msg->waitq, msg->completed);
> >>> +     mutex_lock(&dev->msg_lock);
> >>> +     if (!msg->completed)
> >>> +             list_del(&msg->list);
> >>> +     else
> >>> +             ret = msg->resp.result;
> >>> +     mutex_unlock(&dev->msg_lock);
> >>> +
> >>> +     return ret;
> >>> +}
> >>> +
> >>> +static u64 vduse_dev_get_features(struct vduse_dev *dev)
> >>> +{
> >>> +     struct vduse_dev_msg msg = { 0 };
> >>> +
> >>> +     msg.req.type = VDUSE_GET_FEATURES;
> >>> +     msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
> >>> +
> >>> +     return vduse_dev_msg_sync(dev, &msg) ? 0 : msg.resp.features;
> >>> +}
> >>> +
> >>> +static int vduse_dev_set_features(struct vduse_dev *dev, u64 features)
> >>> +{
> >>> +     struct vduse_dev_msg msg = { 0 };
> >>> +
> >>> +     msg.req.type = VDUSE_SET_FEATURES;
> >>> +     msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
> >>> +     msg.req.features = features;
> >>> +
> >>> +     return vduse_dev_msg_sync(dev, &msg);
> >>> +}
> >>> +
> >>> +static u8 vduse_dev_get_status(struct vduse_dev *dev)
> >>> +{
> >>> +     struct vduse_dev_msg msg = { 0 };
> >>> +
> >>> +     msg.req.type = VDUSE_GET_STATUS;
> >>> +     msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
> >>> +
> >>> +     return vduse_dev_msg_sync(dev, &msg) ? 0 : msg.resp.status;
> >>> +}
> >>> +
> >>> +static void vduse_dev_set_status(struct vduse_dev *dev, u8 status)
> >>> +{
> >>> +     struct vduse_dev_msg msg = { 0 };
> >>> +
> >>> +     msg.req.type = VDUSE_SET_STATUS;
> >>> +     msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
> >>> +     msg.req.status = status;
> >>> +
> >>> +     vduse_dev_msg_sync(dev, &msg);
> >>> +}
> >>> +
> >>> +static void vduse_dev_get_config(struct vduse_dev *dev, unsigned int offset,
> >>> +                                     void *buf, unsigned int len)
> >>
> >> Btw, the ident looks odd here and other may places wherhe functions has
> >> more than one line of arguments.
> >>
> > OK, will fix it.
> >
> >>> +{
> >>> +     struct vduse_dev_msg msg = { 0 };
> >>> +     unsigned int sz;
> >>> +
> >>> +     while (len) {
> >>> +             sz = min_t(unsigned int, len, sizeof(msg.req.config.data));
> >>> +             msg.req.type = VDUSE_GET_CONFIG;
> >>> +             msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
> >>> +             msg.req.config.offset = offset;
> >>> +             msg.req.config.len = sz;
> >>> +             vduse_dev_msg_sync(dev, &msg);
> >>> +             memcpy(buf, msg.resp.config.data, sz);
> >>> +             buf += sz;
> >>> +             offset += sz;
> >>> +             len -= sz;
> >>> +     }
> >>> +}
> >>> +
> >>> +static void vduse_dev_set_config(struct vduse_dev *dev, unsigned int offset,
> >>> +                                     const void *buf, unsigned int len)
> >>> +{
> >>> +     struct vduse_dev_msg msg = { 0 };
> >>> +     unsigned int sz;
> >>> +
> >>> +     while (len) {
> >>> +             sz = min_t(unsigned int, len, sizeof(msg.req.config.data));
> >>> +             msg.req.type = VDUSE_SET_CONFIG;
> >>> +             msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
> >>> +             msg.req.config.offset = offset;
> >>> +             msg.req.config.len = sz;
> >>> +             memcpy(msg.req.config.data, buf, sz);
> >>> +             vduse_dev_msg_sync(dev, &msg);
> >>> +             buf += sz;
> >>> +             offset += sz;
> >>> +             len -= sz;
> >>> +     }
> >>> +}
> >>> +
> >>> +static void vduse_dev_set_vq_num(struct vduse_dev *dev,
> >>> +                             struct vduse_virtqueue *vq, u32 num)
> >>> +{
> >>> +     struct vduse_dev_msg msg = { 0 };
> >>> +
> >>> +     msg.req.type = VDUSE_SET_VQ_NUM;
> >>> +     msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
> >>> +     msg.req.vq_num.index = vq->index;
> >>> +     msg.req.vq_num.num = num;
> >>> +
> >>> +     vduse_dev_msg_sync(dev, &msg);
> >>> +}
> >>> +
> >>> +static int vduse_dev_set_vq_addr(struct vduse_dev *dev,
> >>> +                             struct vduse_virtqueue *vq, u64 desc_addr,
> >>> +                             u64 driver_addr, u64 device_addr)
> >>> +{
> >>> +     struct vduse_dev_msg msg = { 0 };
> >>> +
> >>> +     msg.req.type = VDUSE_SET_VQ_ADDR;
> >>> +     msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
> >>> +     msg.req.vq_addr.index = vq->index;
> >>> +     msg.req.vq_addr.desc_addr = desc_addr;
> >>> +     msg.req.vq_addr.driver_addr = driver_addr;
> >>> +     msg.req.vq_addr.device_addr = device_addr;
> >>> +
> >>> +     return vduse_dev_msg_sync(dev, &msg);
> >>> +}
> >>> +
> >>> +static void vduse_dev_set_vq_ready(struct vduse_dev *dev,
> >>> +                             struct vduse_virtqueue *vq, bool ready)
> >>> +{
> >>> +     struct vduse_dev_msg msg = { 0 };
> >>> +
> >>> +     msg.req.type = VDUSE_SET_VQ_READY;
> >>> +     msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
> >>> +     msg.req.vq_ready.index = vq->index;
> >>> +     msg.req.vq_ready.ready = ready;
> >>> +
> >>> +     vduse_dev_msg_sync(dev, &msg);
> >>> +}
> >>> +
> >>> +static bool vduse_dev_get_vq_ready(struct vduse_dev *dev,
> >>> +                                struct vduse_virtqueue *vq)
> >>> +{
> >>> +     struct vduse_dev_msg msg = { 0 };
> >>> +
> >>> +     msg.req.type = VDUSE_GET_VQ_READY;
> >>> +     msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
> >>> +     msg.req.vq_ready.index = vq->index;
> >>> +
> >>> +     return vduse_dev_msg_sync(dev, &msg) ? false : msg.resp.vq_ready.ready;
> >>> +}
> >>> +
> >>> +static int vduse_dev_get_vq_state(struct vduse_dev *dev,
> >>> +                             struct vduse_virtqueue *vq,
> >>> +                             struct vdpa_vq_state *state)
> >>> +{
> >>> +     struct vduse_dev_msg msg = { 0 };
> >>> +     int ret;
> >>> +
> >>> +     msg.req.type = VDUSE_GET_VQ_STATE;
> >>> +     msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
> >>> +     msg.req.vq_state.index = vq->index;
> >>> +
> >>> +     ret = vduse_dev_msg_sync(dev, &msg);
> >>> +     if (!ret)
> >>> +             state->avail_index = msg.resp.vq_state.avail_idx;
> >>> +
> >>> +     return ret;
> >>> +}
> >>> +
> >>> +static int vduse_dev_set_vq_state(struct vduse_dev *dev,
> >>> +                             struct vduse_virtqueue *vq,
> >>> +                             const struct vdpa_vq_state *state)
> >>> +{
> >>> +     struct vduse_dev_msg msg = { 0 };
> >>> +
> >>> +     msg.req.type = VDUSE_SET_VQ_STATE;
> >>> +     msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
> >>> +     msg.req.vq_state.index = vq->index;
> >>> +     msg.req.vq_state.avail_idx = state->avail_index;
> >>> +
> >>> +     return vduse_dev_msg_sync(dev, &msg);
> >>> +}
> >>> +
> >>> +static int vduse_dev_update_iotlb(struct vduse_dev *dev,
> >>> +                             u64 start, u64 last)
> >>> +{
> >>> +     struct vduse_dev_msg msg = { 0 };
> >>> +
> >>> +     if (last < start)
> >>> +             return -EINVAL;
> >>> +
> >>> +     msg.req.type = VDUSE_UPDATE_IOTLB;
> >>> +     msg.req.unique = atomic64_fetch_inc(&dev->msg_unique);
> >>> +     msg.req.iova.start = start;
> >>> +     msg.req.iova.last = last;
> >>> +
> >>> +     return vduse_dev_msg_sync(dev, &msg);
> >>> +}
> >>> +
> >>> +static ssize_t vduse_dev_read_iter(struct kiocb *iocb, struct iov_iter *to)
> >>> +{
> >>> +     struct file *file = iocb->ki_filp;
> >>> +     struct vduse_dev *dev = file->private_data;
> >>> +     struct vduse_dev_msg *msg;
> >>> +     int size = sizeof(struct vduse_dev_request);
> >>> +     ssize_t ret = 0;
> >>> +
> >>> +     if (iov_iter_count(to) < size)
> >>> +             return 0;
> >>> +
> >>> +     mutex_lock(&dev->msg_lock);
> >>> +     while (1) {
> >>> +             msg = vduse_dequeue_msg(&dev->send_list);
> >>> +             if (msg)
> >>> +                     break;
> >>> +
> >>> +             ret = -EAGAIN;
> >>> +             if (file->f_flags & O_NONBLOCK)
> >>> +                     goto unlock;
> >>> +
> >>> +             mutex_unlock(&dev->msg_lock);
> >>> +             ret = wait_event_interruptible_exclusive(dev->waitq,
> >>> +                                     !list_empty(&dev->send_list));
> >>> +             if (ret)
> >>> +                     return ret;
> >>> +
> >>> +             mutex_lock(&dev->msg_lock);
> >>> +     }
> >>> +     ret = copy_to_iter(&msg->req, size, to);
> >>> +     if (ret != size) {
> >>> +             ret = -EFAULT;
> >>> +             vduse_enqueue_msg(&dev->send_list, msg);
> >>> +             goto unlock;
> >>> +     }
> >>> +     vduse_enqueue_msg(&dev->recv_list, msg);
> >>> +unlock:
> >>> +     mutex_unlock(&dev->msg_lock);
> >>> +
> >>> +     return ret;
> >>> +}
> >>> +
> >>> +static ssize_t vduse_dev_write_iter(struct kiocb *iocb, struct iov_iter *from)
> >>> +{
> >>> +     struct file *file = iocb->ki_filp;
> >>> +     struct vduse_dev *dev = file->private_data;
> >>> +     struct vduse_dev_response resp;
> >>> +     struct vduse_dev_msg *msg;
> >>> +     size_t ret;
> >>> +
> >>> +     ret = copy_from_iter(&resp, sizeof(resp), from);
> >>> +     if (ret != sizeof(resp))
> >>> +             return -EINVAL;
> >>> +
> >>> +     mutex_lock(&dev->msg_lock);
> >>> +     msg = vduse_find_msg(&dev->recv_list, resp.request_id);
> >>> +     if (!msg) {
> >>> +             ret = -EINVAL;
> >>> +             goto unlock;
> >>> +     }
> >>> +
> >>> +     memcpy(&msg->resp, &resp, sizeof(resp));
> >>> +     msg->completed = 1;
> >>> +     wake_up(&msg->waitq);
> >>> +unlock:
> >>> +     mutex_unlock(&dev->msg_lock);
> >>> +
> >>> +     return ret;
> >>> +}
> >>> +
> >>> +static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
> >>> +{
> >>> +     struct vduse_dev *dev = file->private_data;
> >>> +     __poll_t mask = 0;
> >>> +
> >>> +     poll_wait(file, &dev->waitq, wait);
> >>> +
> >>> +     if (!list_empty(&dev->send_list))
> >>> +             mask |= EPOLLIN | EPOLLRDNORM;
> >>> +
> >>> +     return mask;
> >>> +}
> >>> +
> >>> +static int vduse_iotlb_add_range(struct vduse_dev *dev,
> >>> +                              u64 start, u64 last,
> >>> +                              u64 addr, unsigned int perm,
> >>> +                              struct file *file, u64 offset)
> >>> +{
> >>> +     struct vdpa_map_file *map_file;
> >>> +     int ret;
> >>> +
> >>> +     map_file = kmalloc(sizeof(*map_file), GFP_ATOMIC);
> >>> +     if (!map_file)
> >>> +             return -ENOMEM;
> >>> +
> >>> +     map_file->file = get_file(file);
> >>> +     map_file->offset = offset;
> >>> +
> >>> +     spin_lock(&dev->iommu_lock);
> >>> +     ret = vhost_iotlb_add_range_ctx(dev->iommu, start, last,
> >>> +                                     addr, perm, map_file);
> >>> +     spin_unlock(&dev->iommu_lock);
> >>> +     if (ret) {
> >>> +             fput(map_file->file);
> >>> +             kfree(map_file);
> >>> +             return ret;
> >>> +     }
> >>> +     return 0;
> >>> +}
> >>> +
> >>> +static void vduse_iotlb_del_range(struct vduse_dev *dev, u64 start, u64 last)
> >>> +{
> >>> +     struct vdpa_map_file *map_file;
> >>> +     struct vhost_iotlb_map *map;
> >>> +
> >>> +     spin_lock(&dev->iommu_lock);
> >>> +     while ((map = vhost_iotlb_itree_first(dev->iommu, start, last))) {
> >>> +             map_file = (struct vdpa_map_file *)map->opaque;
> >>> +             fput(map_file->file);
> >>> +             kfree(map_file);
> >>> +             vhost_iotlb_map_free(dev->iommu, map);
> >>> +     }
> >>> +     spin_unlock(&dev->iommu_lock);
> >>> +}
> >>> +
> >>> +static void vduse_dev_reset(struct vduse_dev *dev)
> >>> +{
> >>> +     int i;
> >>> +
> >>> +     atomic_set(&dev->bounce_map, 0);
> >>> +     vduse_iotlb_del_range(dev, 0ULL, ULLONG_MAX);
> >>> +     vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
> >>> +
> >>> +     for (i = 0; i < dev->vq_num; i++) {
> >>> +             struct vduse_virtqueue *vq = &dev->vqs[i];
> >>> +
> >>> +             spin_lock(&vq->irq_lock);
> >>> +             vq->ready = false;
> >>> +             vq->cb.callback = NULL;
> >>> +             vq->cb.private = NULL;
> >>> +             spin_unlock(&vq->irq_lock);
> >>> +     }
> >>> +}
> >>> +
> >>> +static int vduse_vdpa_set_vq_address(struct vdpa_device *vdpa, u16 idx,
> >>> +                             u64 desc_area, u64 driver_area,
> >>> +                             u64 device_area)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> >>> +
> >>> +     return vduse_dev_set_vq_addr(dev, vq, desc_area,
> >>> +                                     driver_area, device_area);
> >>> +}
> >>> +
> >>> +static void vduse_vdpa_kick_vq(struct vdpa_device *vdpa, u16 idx)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> >>> +
> >>> +     spin_lock(&vq->kick_lock);
> >>> +     if (vq->ready && vq->kickfd)
> >>> +             eventfd_signal(vq->kickfd, 1);
> >>> +     spin_unlock(&vq->kick_lock);
> >>> +}
> >>> +
> >>> +static void vduse_vdpa_set_vq_cb(struct vdpa_device *vdpa, u16 idx,
> >>> +                           struct vdpa_callback *cb)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> >>> +
> >>> +     spin_lock(&vq->irq_lock);
> >>> +     vq->cb.callback = cb->callback;
> >>> +     vq->cb.private = cb->private;
> >>> +     spin_unlock(&vq->irq_lock);
> >>> +}
> >>> +
> >>> +static void vduse_vdpa_set_vq_num(struct vdpa_device *vdpa, u16 idx, u32 num)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> >>> +
> >>> +     vduse_dev_set_vq_num(dev, vq, num);
> >>> +}
> >>> +
> >>> +static void vduse_vdpa_set_vq_ready(struct vdpa_device *vdpa,
> >>> +                                     u16 idx, bool ready)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> >>> +
> >>> +     vduse_dev_set_vq_ready(dev, vq, ready);
> >>> +     vq->ready = ready;
> >>> +}
> >>> +
> >>> +static bool vduse_vdpa_get_vq_ready(struct vdpa_device *vdpa, u16 idx)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> >>> +
> >>> +     vq->ready = vduse_dev_get_vq_ready(dev, vq);
> >>> +
> >>> +     return vq->ready;
> >>> +}
> >>> +
> >>> +static int vduse_vdpa_set_vq_state(struct vdpa_device *vdpa, u16 idx,
> >>> +                             const struct vdpa_vq_state *state)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> >>> +
> >>> +     return vduse_dev_set_vq_state(dev, vq, state);
> >>> +}
> >>> +
> >>> +static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
> >>> +                             struct vdpa_vq_state *state)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> >>> +
> >>> +     return vduse_dev_get_vq_state(dev, vq, state);
> >>> +}
> >>> +
> >>> +static u32 vduse_vdpa_get_vq_align(struct vdpa_device *vdpa)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +
> >>> +     return dev->vq_align;
> >>> +}
> >>> +
> >>> +static u64 vduse_vdpa_get_features(struct vdpa_device *vdpa)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +     u64 fixed = (1ULL << VIRTIO_F_ACCESS_PLATFORM);
> >>> +
> >>> +     return (vduse_dev_get_features(dev) | fixed);
> >>
> >> What happens if we don't do such fixup. I think we should fail if
> >> usersapce doesnt offer ACCESS_PLATFORM instead.
> >>
> > Make sense.
> >
> >>> +}
> >>> +
> >>> +static int vduse_vdpa_set_features(struct vdpa_device *vdpa, u64 features)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +
> >>> +     return vduse_dev_set_features(dev, features);
> >>> +}
> >>> +
> >>> +static void vduse_vdpa_set_config_cb(struct vdpa_device *vdpa,
> >>> +                               struct vdpa_callback *cb)
> >>> +{
> >>> +     /* We don't support config interrupt */
> >>> +}
> >>> +
> >>> +static u16 vduse_vdpa_get_vq_num_max(struct vdpa_device *vdpa)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +
> >>> +     return dev->vq_size_max;
> >>> +}
> >>> +
> >>> +static u32 vduse_vdpa_get_device_id(struct vdpa_device *vdpa)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +
> >>> +     return dev->device_id;
> >>> +}
> >>> +
> >>> +static u32 vduse_vdpa_get_vendor_id(struct vdpa_device *vdpa)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +
> >>> +     return dev->vendor_id;
> >>> +}
> >>> +
> >>> +static u8 vduse_vdpa_get_status(struct vdpa_device *vdpa)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +
> >>> +     return vduse_dev_get_status(dev);
> >>> +}
> >>> +
> >>> +static void vduse_vdpa_set_status(struct vdpa_device *vdpa, u8 status)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +
> >>> +     if (status == 0)
> >>> +             vduse_dev_reset(dev);
> >>> +
> >>> +     vduse_dev_set_status(dev, status);
> >>> +}
> >>> +
> >>> +static void vduse_vdpa_get_config(struct vdpa_device *vdpa, unsigned int offset,
> >>> +                          void *buf, unsigned int len)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +
> >>> +     vduse_dev_get_config(dev, offset, buf, len);
> >>> +}
> >>> +
> >>> +static void vduse_vdpa_set_config(struct vdpa_device *vdpa, unsigned int offset,
> >>> +                     const void *buf, unsigned int len)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +
> >>> +     vduse_dev_set_config(dev, offset, buf, len);
> >>> +}
> >>> +
> >>> +static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
> >>> +                             struct vhost_iotlb *iotlb)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +     struct vhost_iotlb_map *map;
> >>> +     struct vdpa_map_file *map_file;
> >>> +     u64 start = 0ULL, last = ULLONG_MAX;
> >>> +     int ret = 0;
> >>> +
> >>> +     vduse_iotlb_del_range(dev, start, last);
> >>> +
> >>> +     for (map = vhost_iotlb_itree_first(iotlb, start, last); map;
> >>> +             map = vhost_iotlb_itree_next(map, start, last)) {
> >>> +             map_file = (struct vdpa_map_file *)map->opaque;
> >>> +             if (!map_file->file)
> >>> +                     continue;
> >>> +
> >>> +             ret = vduse_iotlb_add_range(dev, map->start, map->last,
> >>> +                                         map->addr, map->perm,
> >>> +                                         map_file->file,
> >>> +                                         map_file->offset);
> >>> +             if (ret)
> >>> +                     break;
> >>> +     }
> >>> +     vduse_dev_update_iotlb(dev, start, last);
> >>> +
> >>> +     return ret;
> >>> +}
> >>> +
> >>> +static void vduse_vdpa_free(struct vdpa_device *vdpa)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +
> >>> +     WARN_ON(!list_empty(&dev->send_list));
> >>> +     WARN_ON(!list_empty(&dev->recv_list));
> >>> +     dev->vdev = NULL;
> >>> +}
> >>> +
> >>> +static const struct vdpa_config_ops vduse_vdpa_config_ops = {
> >>> +     .set_vq_address         = vduse_vdpa_set_vq_address,
> >>> +     .kick_vq                = vduse_vdpa_kick_vq,
> >>> +     .set_vq_cb              = vduse_vdpa_set_vq_cb,
> >>> +     .set_vq_num             = vduse_vdpa_set_vq_num,
> >>> +     .set_vq_ready           = vduse_vdpa_set_vq_ready,
> >>> +     .get_vq_ready           = vduse_vdpa_get_vq_ready,
> >>> +     .set_vq_state           = vduse_vdpa_set_vq_state,
> >>> +     .get_vq_state           = vduse_vdpa_get_vq_state,
> >>> +     .get_vq_align           = vduse_vdpa_get_vq_align,
> >>> +     .get_features           = vduse_vdpa_get_features,
> >>> +     .set_features           = vduse_vdpa_set_features,
> >>> +     .set_config_cb          = vduse_vdpa_set_config_cb,
> >>> +     .get_vq_num_max         = vduse_vdpa_get_vq_num_max,
> >>> +     .get_device_id          = vduse_vdpa_get_device_id,
> >>> +     .get_vendor_id          = vduse_vdpa_get_vendor_id,
> >>> +     .get_status             = vduse_vdpa_get_status,
> >>> +     .set_status             = vduse_vdpa_set_status,
> >>> +     .get_config             = vduse_vdpa_get_config,
> >>> +     .set_config             = vduse_vdpa_set_config,
> >>> +     .set_map                = vduse_vdpa_set_map,
> >>> +     .free                   = vduse_vdpa_free,
> >>> +};
> >>> +
> >>> +static dma_addr_t vduse_dev_map_page(struct device *dev, struct page *page,
> >>> +                                     unsigned long offset, size_t size,
> >>> +                                     enum dma_data_direction dir,
> >>> +                                     unsigned long attrs)
> >>> +{
> >>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
> >>> +     struct vduse_iova_domain *domain = vdev->domain;
> >>> +
> >>> +     if (atomic_xchg(&vdev->bounce_map, 1) == 0 &&
> >>> +             vduse_iotlb_add_range(vdev, 0, domain->bounce_size - 1,
> >>> +                                   0, VDUSE_ACCESS_RW,
> >>> +                                   vduse_domain_file(domain), 0)) {
> >>> +             atomic_set(&vdev->bounce_map, 0);
> >>> +             return DMA_MAPPING_ERROR;
> >>
> >> Can we add the bounce mapping page by page here?
> >>
> > Do you mean mapping the bounce buffer to user space page by page? If
> > so, userspace needs to call lots of mmap() for that.
>
>
> I get this.
>
>
> >
> >>> +     }
> >>> +
> >>> +     return vduse_domain_map_page(domain, page, offset, size, dir, attrs);
> >>> +}
> >>> +
> >>> +static void vduse_dev_unmap_page(struct device *dev, dma_addr_t dma_addr,
> >>> +                             size_t size, enum dma_data_direction dir,
> >>> +                             unsigned long attrs)
> >>> +{
> >>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
> >>> +     struct vduse_iova_domain *domain = vdev->domain;
> >>> +
> >>> +     return vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
> >>> +}
> >>> +
> >>> +static void *vduse_dev_alloc_coherent(struct device *dev, size_t size,
> >>> +                                     dma_addr_t *dma_addr, gfp_t flag,
> >>> +                                     unsigned long attrs)
> >>> +{
> >>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
> >>> +     struct vduse_iova_domain *domain = vdev->domain;
> >>> +     unsigned long iova;
> >>> +     void *addr;
> >>> +
> >>> +     *dma_addr = DMA_MAPPING_ERROR;
> >>> +     addr = vduse_domain_alloc_coherent(domain, size,
> >>> +                             (dma_addr_t *)&iova, flag, attrs);
> >>> +     if (!addr)
> >>> +             return NULL;
> >>> +
> >>> +     if (vduse_iotlb_add_range(vdev, iova, iova + size - 1,
> >>> +                               iova, VDUSE_ACCESS_RW,
> >>> +                               vduse_domain_file(domain), iova)) {
> >>> +             vduse_domain_free_coherent(domain, size, addr, iova, attrs);
> >>> +             return NULL;
> >>> +     }
> >>> +     *dma_addr = (dma_addr_t)iova;
> >>> +
> >>> +     return addr;
> >>> +}
> >>> +
> >>> +static void vduse_dev_free_coherent(struct device *dev, size_t size,
> >>> +                                     void *vaddr, dma_addr_t dma_addr,
> >>> +                                     unsigned long attrs)
> >>> +{
> >>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
> >>> +     struct vduse_iova_domain *domain = vdev->domain;
> >>> +     unsigned long start = (unsigned long)dma_addr;
> >>> +     unsigned long last = start + size - 1;
> >>> +
> >>> +     vduse_iotlb_del_range(vdev, start, last);
> >>> +     vduse_dev_update_iotlb(vdev, start, last);
> >>> +     vduse_domain_free_coherent(domain, size, vaddr, dma_addr, attrs);
> >>> +}
> >>> +
> >>> +static const struct dma_map_ops vduse_dev_dma_ops = {
> >>> +     .map_page = vduse_dev_map_page,
> >>> +     .unmap_page = vduse_dev_unmap_page,
> >>> +     .alloc = vduse_dev_alloc_coherent,
> >>> +     .free = vduse_dev_free_coherent,
> >>> +};
> >>> +
> >>> +static unsigned int perm_to_file_flags(u8 perm)
> >>> +{
> >>> +     unsigned int flags = 0;
> >>> +
> >>> +     switch (perm) {
> >>> +     case VDUSE_ACCESS_WO:
> >>> +             flags |= O_WRONLY;
> >>> +             break;
> >>> +     case VDUSE_ACCESS_RO:
> >>> +             flags |= O_RDONLY;
> >>> +             break;
> >>> +     case VDUSE_ACCESS_RW:
> >>> +             flags |= O_RDWR;
> >>> +             break;
> >>> +     default:
> >>> +             WARN(1, "invalidate vhost IOTLB permission\n");
> >>> +             break;
> >>> +     }
> >>> +
> >>> +     return flags;
> >>> +}
> >>> +
> >>> +static int vduse_kickfd_setup(struct vduse_dev *dev,
> >>> +                     struct vduse_vq_eventfd *eventfd)
> >>> +{
> >>> +     struct eventfd_ctx *ctx = NULL;
> >>> +     struct vduse_virtqueue *vq;
> >>> +
> >>> +     if (eventfd->index >= dev->vq_num)
> >>> +             return -EINVAL;
> >>> +
> >>> +     vq = &dev->vqs[eventfd->index];
> >>> +     if (eventfd->fd > 0) {
> >>> +             ctx = eventfd_ctx_fdget(eventfd->fd);
> >>> +             if (IS_ERR(ctx))
> >>> +                     return PTR_ERR(ctx);
> >>> +     }
> >>> +     spin_lock(&vq->kick_lock);
> >>> +     if (vq->kickfd)
> >>> +             eventfd_ctx_put(vq->kickfd);
> >>> +     vq->kickfd = ctx;
> >>> +     spin_unlock(&vq->kick_lock);
> >>> +
> >>> +     return 0;
> >>> +}
> >>> +
> >>> +static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> >>> +                     unsigned long arg)
> >>> +{
> >>> +     struct vduse_dev *dev = file->private_data;
> >>> +     void __user *argp = (void __user *)arg;
> >>> +     int ret;
> >>> +
> >>> +     switch (cmd) {
> >>> +     case VDUSE_IOTLB_GET_FD: {
> >>> +             struct vduse_iotlb_entry entry;
> >>> +             struct vhost_iotlb_map *map;
> >>> +             struct vdpa_map_file *map_file;
> >>> +             struct file *f = NULL;
> >>> +
> >>> +             ret = -EFAULT;
> >>> +             if (copy_from_user(&entry, argp, sizeof(entry)))
> >>> +                     break;
> >>> +
> >>> +             spin_lock(&dev->iommu_lock);
> >>> +             map = vhost_iotlb_itree_first(dev->iommu, entry.start,
> >>> +                                           entry.last);
> >>> +             if (map) {
> >>> +                     map_file = (struct vdpa_map_file *)map->opaque;
> >>> +                     f = get_file(map_file->file);
> >>> +                     entry.offset = map_file->offset;
> >>> +                     entry.start = map->start;
> >>> +                     entry.last = map->last;
> >>> +                     entry.perm = map->perm;
> >>> +             }
> >>> +             spin_unlock(&dev->iommu_lock);
> >>> +             if (!f) {
> >>> +                     ret = -EINVAL;
> >>> +                     break;
> >>> +             }
> >>> +             if (copy_to_user(argp, &entry, sizeof(entry))) {
> >>> +                     fput(f);
> >>> +                     ret = -EFAULT;
> >>> +                     break;
> >>> +             }
> >>> +             ret = get_unused_fd_flags(perm_to_file_flags(entry.perm));
> >>> +             if (ret < 0) {
> >>> +                     fput(f);
> >>> +                     break;
> >>> +             }
> >>> +             fd_install(ret, f);
> >>> +             break;
> >>> +     }
> >>> +     case VDUSE_VQ_SETUP_KICKFD: {
> >>> +             struct vduse_vq_eventfd eventfd;
> >>> +
> >>> +             ret = -EFAULT;
> >>> +             if (copy_from_user(&eventfd, argp, sizeof(eventfd)))
> >>> +                     break;
> >>> +
> >>> +             ret = vduse_kickfd_setup(dev, &eventfd);
> >>> +             break;
> >>> +     }
> >>> +     case VDUSE_INJECT_VQ_IRQ: {
> >>> +             struct vduse_virtqueue *vq;
> >>> +
> >>> +             ret = -EINVAL;
> >>> +             if (arg >= dev->vq_num)
> >>> +                     break;
> >>> +
> >>> +             vq = &dev->vqs[arg];
> >>> +             spin_lock_irq(&vq->irq_lock);
> >>> +             if (vq->ready && vq->cb.callback) {
> >>> +                     vq->cb.callback(vq->cb.private);
> >>> +                     ret = 0;
> >>> +             }
> >>> +             spin_unlock_irq(&vq->irq_lock);
> >>> +             break;
> >>> +     }
> >>> +     default:
> >>> +             ret = -ENOIOCTLCMD;
> >>> +             break;
> >>> +     }
> >>> +
> >>> +     return ret;
> >>> +}
> >>> +
> >>> +static int vduse_dev_release(struct inode *inode, struct file *file)
> >>> +{
> >>> +     struct vduse_dev *dev = file->private_data;
> >>> +     struct vduse_dev_msg *msg;
> >>> +     int i;
> >>> +
> >>> +     for (i = 0; i < dev->vq_num; i++) {
> >>> +             struct vduse_virtqueue *vq = &dev->vqs[i];
> >>> +
> >>> +             spin_lock(&vq->kick_lock);
> >>> +             if (vq->kickfd)
> >>> +                     eventfd_ctx_put(vq->kickfd);
> >>> +             vq->kickfd = NULL;
> >>> +             spin_unlock(&vq->kick_lock);
> >>> +     }
> >>> +
> >>> +     mutex_lock(&dev->msg_lock);
> >>> +     while ((msg = vduse_dequeue_msg(&dev->recv_list)))
> >>> +             vduse_enqueue_msg(&dev->send_list, msg);
> >>> +     mutex_unlock(&dev->msg_lock);
> >>> +
> >>> +     dev->connected = false;
> >>> +
> >>> +     return 0;
> >>> +}
> >>> +
> >>> +static int vduse_dev_open(struct inode *inode, struct file *file)
> >>> +{
> >>> +     struct vduse_dev *dev = container_of(inode->i_cdev,
> >>> +                                     struct vduse_dev, cdev);
> >>> +     int ret = -EBUSY;
> >>> +
> >>> +     mutex_lock(&vduse_lock);
> >>> +     if (dev->connected)
> >>> +             goto unlock;
> >>> +
> >>> +     ret = 0;
> >>> +     dev->connected = true;
> >>> +     file->private_data = dev;
> >>> +unlock:
> >>> +     mutex_unlock(&vduse_lock);
> >>> +
> >>> +     return ret;
> >>> +}
> >>> +
> >>> +static const struct file_operations vduse_dev_fops = {
> >>> +     .owner          = THIS_MODULE,
> >>> +     .open           = vduse_dev_open,
> >>> +     .release        = vduse_dev_release,
> >>> +     .read_iter      = vduse_dev_read_iter,
> >>> +     .write_iter     = vduse_dev_write_iter,
> >>> +     .poll           = vduse_dev_poll,
> >>> +     .unlocked_ioctl = vduse_dev_ioctl,
> >>> +     .compat_ioctl   = compat_ptr_ioctl,
> >>> +     .llseek         = noop_llseek,
> >>> +};
> >>> +
> >>> +static struct vduse_dev *vduse_dev_create(void)
> >>> +{
> >>> +     struct vduse_dev *dev = kzalloc(sizeof(*dev), GFP_KERNEL);
> >>> +
> >>> +     if (!dev)
> >>> +             return NULL;
> >>> +
> >>> +     dev->iommu = vhost_iotlb_alloc(0, 0);
> >>> +     if (!dev->iommu) {
> >>> +             kfree(dev);
> >>> +             return NULL;
> >>> +     }
> >>> +
> >>> +     mutex_init(&dev->msg_lock);
> >>> +     INIT_LIST_HEAD(&dev->send_list);
> >>> +     INIT_LIST_HEAD(&dev->recv_list);
> >>> +     atomic64_set(&dev->msg_unique, 0);
> >>> +     spin_lock_init(&dev->iommu_lock);
> >>> +     atomic_set(&dev->bounce_map, 0);
> >>> +
> >>> +     init_waitqueue_head(&dev->waitq);
> >>> +
> >>> +     return dev;
> >>> +}
> >>> +
> >>> +static void vduse_dev_destroy(struct vduse_dev *dev)
> >>> +{
> >>> +     vhost_iotlb_free(dev->iommu);
> >>> +     mutex_destroy(&dev->msg_lock);
> >>> +     kfree(dev);
> >>> +}
> >>> +
> >>> +static struct vduse_dev *vduse_find_dev(const char *name)
> >>> +{
> >>> +     struct vduse_dev *tmp, *dev = NULL;
> >>> +
> >>> +     list_for_each_entry(tmp, &vduse_devs, list) {
> >>> +             if (!strcmp(dev_name(&tmp->dev), name)) {
> >>> +                     dev = tmp;
> >>> +                     break;
> >>> +             }
> >>> +     }
> >>> +     return dev;
> >>> +}
> >>> +
> >>> +static int vduse_destroy_dev(char *name)
> >>> +{
> >>> +     struct vduse_dev *dev = vduse_find_dev(name);
> >>> +
> >>> +     if (!dev)
> >>> +             return -EINVAL;
> >>> +
> >>> +     if (dev->vdev || dev->connected)
> >>> +             return -EBUSY;
> >>> +
> >>> +     dev->connected = true;
> >>> +     list_del(&dev->list);
> >>> +     cdev_device_del(&dev->cdev, &dev->dev);
> >>> +     put_device(&dev->dev);
> >>> +
> >>> +     return 0;
> >>> +}
> >>> +
> >>> +static void vduse_release_dev(struct device *device)
> >>> +{
> >>> +     struct vduse_dev *dev =
> >>> +             container_of(device, struct vduse_dev, dev);
> >>> +
> >>> +     ida_simple_remove(&vduse_ida, dev->minor);
> >>> +     kfree(dev->vqs);
> >>> +     vduse_domain_destroy(dev->domain);
> >>> +     vduse_dev_destroy(dev);
> >>> +     module_put(THIS_MODULE);
> >>> +}
> >>> +
> >>> +static int vduse_create_dev(struct vduse_dev_config *config)
> >>> +{
> >>> +     int i, ret = -ENOMEM;
> >>> +     struct vduse_dev *dev;
> >>> +
> >>> +     if (config->bounce_size > max_bounce_size)
> >>> +             return -EINVAL;
> >>> +
> >>> +     if (config->bounce_size > max_iova_size)
> >>> +             return -EINVAL;
> >>> +
> >>> +     if (vduse_find_dev(config->name))
> >>> +             return -EEXIST;
> >>> +
> >>> +     dev = vduse_dev_create();
> >>> +     if (!dev)
> >>> +             return -ENOMEM;
> >>> +
> >>> +     dev->device_id = config->device_id;
> >>> +     dev->vendor_id = config->vendor_id;
> >>> +     dev->domain = vduse_domain_create(max_iova_size - 1,
> >>> +                                     config->bounce_size);
> >>> +     if (!dev->domain)
> >>> +             goto err_domain;
> >>> +
> >>> +     dev->vq_align = config->vq_align;
> >>> +     dev->vq_size_max = config->vq_size_max;
> >>> +     dev->vq_num = config->vq_num;
> >>> +     dev->vqs = kcalloc(dev->vq_num, sizeof(*dev->vqs), GFP_KERNEL);
> >>> +     if (!dev->vqs)
> >>> +             goto err_vqs;
> >>> +
> >>> +     for (i = 0; i < dev->vq_num; i++) {
> >>> +             dev->vqs[i].index = i;
> >>> +             spin_lock_init(&dev->vqs[i].kick_lock);
> >>> +             spin_lock_init(&dev->vqs[i].irq_lock);
> >>> +     }
> >>> +
> >>> +     ret = ida_simple_get(&vduse_ida, 0, VDUSE_DEV_MAX, GFP_KERNEL);
> >>> +     if (ret < 0)
> >>> +             goto err_ida;
> >>> +
> >>> +     dev->minor = ret;
> >>> +     device_initialize(&dev->dev);
> >>> +     dev->dev.release = vduse_release_dev;
> >>> +     dev->dev.class = vduse_class;
> >>> +     dev->dev.devt = MKDEV(MAJOR(vduse_major), dev->minor);
> >>> +     ret = dev_set_name(&dev->dev, "%s", config->name);
> >>
> >> Do we need to add a namespce here? E.g "vduse-%s", config->name.
> >>
> > Actually we already have a parent dir "/dev/vduse/" for it.
>
>
> Oh, right. Then it should be fine.
>
>
> >
> >>> +     if (ret)
> >>> +             goto err_name;
> >>> +
> >>> +     cdev_init(&dev->cdev, &vduse_dev_fops);
> >>> +     dev->cdev.owner = THIS_MODULE;
> >>> +
> >>> +     ret = cdev_device_add(&dev->cdev, &dev->dev);
> >>> +     if (ret) {
> >>> +             put_device(&dev->dev);
> >>> +             return ret;
> >>> +     }
> >>> +     list_add(&dev->list, &vduse_devs);
> >>> +     __module_get(THIS_MODULE);
> >>> +
> >>> +     return 0;
> >>> +err_name:
> >>> +     ida_simple_remove(&vduse_ida, dev->minor);
> >>> +err_ida:
> >>> +     kfree(dev->vqs);
> >>> +err_vqs:
> >>> +     vduse_domain_destroy(dev->domain);
> >>> +err_domain:
> >>> +     vduse_dev_destroy(dev);
> >>> +     return ret;
> >>> +}
> >>> +
> >>> +static long vduse_ioctl(struct file *file, unsigned int cmd,
> >>> +                     unsigned long arg)
> >>> +{
> >>> +     int ret;
> >>> +     void __user *argp = (void __user *)arg;
> >>> +
> >>> +     mutex_lock(&vduse_lock);
> >>> +     switch (cmd) {
> >>> +     case VDUSE_CREATE_DEV: {
> >>> +             struct vduse_dev_config config;
> >>> +
> >>> +             ret = -EFAULT;
> >>> +             if (copy_from_user(&config, argp, sizeof(config)))
> >>> +                     break;
> >>> +
> >>> +             ret = vduse_create_dev(&config);
> >>> +             break;
> >>> +     }
> >>> +     case VDUSE_DESTROY_DEV: {
> >>> +             char name[VDUSE_NAME_MAX];
> >>> +
> >>> +             ret = -EFAULT;
> >>> +             if (copy_from_user(name, argp, VDUSE_NAME_MAX))
> >>> +                     break;
> >>> +
> >>> +             ret = vduse_destroy_dev(name);
> >>> +             break;
> >>> +     }
> >>> +     default:
> >>> +             ret = -EINVAL;
> >>> +             break;
> >>> +     }
> >>> +     mutex_unlock(&vduse_lock);
> >>> +
> >>> +     return ret;
> >>> +}
> >>> +
> >>> +static const struct file_operations vduse_fops = {
> >>> +     .owner          = THIS_MODULE,
> >>> +     .unlocked_ioctl = vduse_ioctl,
> >>> +     .compat_ioctl   = compat_ptr_ioctl,
> >>> +     .llseek         = noop_llseek,
> >>> +};
> >>> +
> >>> +static char *vduse_devnode(struct device *dev, umode_t *mode)
> >>> +{
> >>> +     return kasprintf(GFP_KERNEL, "vduse/%s", dev_name(dev));
> >>> +}
> >>> +
> >>> +static struct miscdevice vduse_misc = {
> >>> +     .fops = &vduse_fops,
> >>> +     .minor = MISC_DYNAMIC_MINOR,
> >>> +     .name = "vduse",
> >>> +     .nodename = "vduse/control",
> >>> +};
> >>> +
> >>> +static void vduse_mgmtdev_release(struct device *dev)
> >>> +{
> >>> +}
> >>> +
> >>> +static struct device vduse_mgmtdev = {
> >>> +     .init_name = "vduse",
> >>> +     .release = vduse_mgmtdev_release,
> >>> +};
> >>> +
> >>> +static struct vdpa_mgmt_dev mgmt_dev;
> >>> +
> >>> +static int vduse_dev_add_vdpa(struct vduse_dev *dev, const char *name)
> >>> +{
> >>> +     struct vduse_vdpa *vdev = dev->vdev;
> >>> +     int ret;
> >>> +
> >>> +     if (vdev)
> >>> +             return -EEXIST;
> >>> +
> >>> +     vdev = vdpa_alloc_device(struct vduse_vdpa, vdpa, NULL,
> >>
> >> I think the char dev should be used as the parent here.
> >>
> > Agree.
> >
> >>> +                              &vduse_vdpa_config_ops,
> >>> +                              dev->vq_num, name, true);
> >>> +     if (!vdev)
> >>> +             return -ENOMEM;
> >>> +
> >>> +     vdev->dev = dev;
> >>> +     vdev->vdpa.dev.dma_mask = &vdev->vdpa.dev.coherent_dma_mask;
> >>> +     ret = dma_set_mask_and_coherent(&vdev->vdpa.dev, DMA_BIT_MASK(64));
> >>> +     if (ret)
> >>> +             goto err;
> >>> +
> >>> +     set_dma_ops(&vdev->vdpa.dev, &vduse_dev_dma_ops);
> >>> +     vdev->vdpa.dma_dev = &vdev->vdpa.dev;
> >>> +     vdev->vdpa.mdev = &mgmt_dev;
> >>> +
> >>> +     ret = _vdpa_register_device(&vdev->vdpa);
> >>> +     if (ret)
> >>> +             goto err;
> >>> +
> >>> +     dev->vdev = vdev;
> >>> +
> >>> +     return 0;
> >>> +err:
> >>> +     put_device(&vdev->vdpa.dev);
> >>> +     return ret;
> >>> +}
> >>> +
> >>> +static int vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name)
> >>> +{
> >>> +     struct vduse_dev *dev;
> >>> +     int ret = -EINVAL;
> >>> +
> >>> +     mutex_lock(&vduse_lock);
> >>> +     dev = vduse_find_dev(name);
> >>> +     if (!dev)
> >>> +             goto unlock;
> >>
> >> Any reason for this check? I think vdpa core layer has already did for
> >> the name check for us.
> >>
> > We need to check whether the vduse device with the name is created.
>
>
> Yes.
>
>
> >
> >>> +
> >>> +     ret = vduse_dev_add_vdpa(dev, name);
> >>> +unlock:
> >>> +     mutex_unlock(&vduse_lock);
> >>> +
> >>> +     return ret;
> >>> +}
> >>> +
> >>> +static void vdpa_dev_del(struct vdpa_mgmt_dev *mdev, struct vdpa_device *dev)
> >>> +{
> >>> +     _vdpa_unregister_device(dev);
> >>> +}
> >>> +
> >>> +static const struct vdpa_mgmtdev_ops vdpa_dev_mgmtdev_ops = {
> >>> +     .dev_add = vdpa_dev_add,
> >>> +     .dev_del = vdpa_dev_del,
> >>> +};
> >>> +
> >>> +static struct virtio_device_id id_table[] = {
> >>> +     { VIRTIO_DEV_ANY_ID, VIRTIO_DEV_ANY_ID },
> >>> +     { 0 },
> >>> +};
> >>> +
> >>> +static struct vdpa_mgmt_dev mgmt_dev = {
> >>> +     .device = &vduse_mgmtdev,
> >>> +     .id_table = id_table,
> >>> +     .ops = &vdpa_dev_mgmtdev_ops,
> >>> +};
> >>> +
> >>> +static int vduse_mgmtdev_init(void)
> >>> +{
> >>> +     int ret;
> >>> +
> >>> +     ret = device_register(&vduse_mgmtdev);
> >>> +     if (ret)
> >>> +             return ret;
> >>> +
> >>> +     ret = vdpa_mgmtdev_register(&mgmt_dev);
> >>> +     if (ret)
> >>> +             goto err;
> >>> +
> >>> +     return 0;
> >>> +err:
> >>> +     device_unregister(&vduse_mgmtdev);
> >>> +     return ret;
> >>> +}
> >>> +
> >>> +static void vduse_mgmtdev_exit(void)
> >>> +{
> >>> +     vdpa_mgmtdev_unregister(&mgmt_dev);
> >>> +     device_unregister(&vduse_mgmtdev);
> >>> +}
> >>> +
> >>> +static int vduse_init(void)
> >>> +{
> >>> +     int ret;
> >>> +
> >>> +     ret = misc_register(&vduse_misc);
> >>> +     if (ret)
> >>> +             return ret;
> >>> +
> >>> +     vduse_class = class_create(THIS_MODULE, "vduse");
> >>> +     if (IS_ERR(vduse_class)) {
> >>> +             ret = PTR_ERR(vduse_class);
> >>> +             goto err_class;
> >>> +     }
> >>> +     vduse_class->devnode = vduse_devnode;
> >>> +
> >>> +     ret = alloc_chrdev_region(&vduse_major, 0, VDUSE_DEV_MAX, "vduse");
> >>> +     if (ret)
> >>> +             goto err_chardev;
> >>> +
> >>> +     ret = vduse_domain_init();
> >>> +     if (ret)
> >>> +             goto err_domain;
> >>> +
> >>> +     ret = vduse_mgmtdev_init();
> >>> +     if (ret)
> >>> +             goto err_mgmtdev;
> >>
> >> Should we validate max_bounce_size < max_iova_size here?
> >>
> > Sure.
> >
> >>
> >>> +
> >>> +     return 0;
> >>> +err_mgmtdev:
> >>> +     vduse_domain_exit();
> >>> +err_domain:
> >>> +     unregister_chrdev_region(vduse_major, VDUSE_DEV_MAX);
> >>> +err_chardev:
> >>> +     class_destroy(vduse_class);
> >>> +err_class:
> >>> +     misc_deregister(&vduse_misc);
> >>> +     return ret;
> >>> +}
> >>> +module_init(vduse_init);
> >>> +
> >>> +static void vduse_exit(void)
> >>> +{
> >>> +     misc_deregister(&vduse_misc);
> >>> +     class_destroy(vduse_class);
> >>> +     unregister_chrdev_region(vduse_major, VDUSE_DEV_MAX);
> >>> +     vduse_domain_exit();
> >>> +     vduse_mgmtdev_exit();
> >>> +}
> >>> +module_exit(vduse_exit);
> >>> +
> >>> +MODULE_VERSION(DRV_VERSION);
> >>> +MODULE_LICENSE(DRV_LICENSE);
> >>> +MODULE_AUTHOR(DRV_AUTHOR);
> >>> +MODULE_DESCRIPTION(DRV_DESC);
> >>> diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
> >>> new file mode 100644
> >>> index 000000000000..9391c4acfa53
> >>> --- /dev/null
> >>> +++ b/include/uapi/linux/vduse.h
> >>> @@ -0,0 +1,136 @@
> >>> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> >>> +#ifndef _UAPI_VDUSE_H_
> >>> +#define _UAPI_VDUSE_H_
> >>> +
> >>> +#include <linux/types.h>
> >>> +
> >>> +#define VDUSE_CONFIG_DATA_LEN        256
> >>> +#define VDUSE_NAME_MAX       256
> >>> +
> >>> +/* the control messages definition for read/write */
> >>> +
> >>> +enum vduse_req_type {
> >>> +     VDUSE_SET_VQ_NUM,
> >>> +     VDUSE_SET_VQ_ADDR,
> >>> +     VDUSE_SET_VQ_READY,
> >>> +     VDUSE_GET_VQ_READY,
> >>> +     VDUSE_SET_VQ_STATE,
> >>> +     VDUSE_GET_VQ_STATE,
> >>> +     VDUSE_SET_FEATURES,
> >>> +     VDUSE_GET_FEATURES,
> >>> +     VDUSE_SET_STATUS,
> >>> +     VDUSE_GET_STATUS,
> >>> +     VDUSE_SET_CONFIG,
> >>> +     VDUSE_GET_CONFIG,
> >>> +     VDUSE_UPDATE_IOTLB,
> >>> +};
> >>> +
> >>> +struct vduse_vq_num {
> >>> +     __u32 index;
> >>> +     __u32 num;
> >>> +};
> >>> +
> >>> +struct vduse_vq_addr {
> >>> +     __u32 index;
> >>> +     __u64 desc_addr;
> >>> +     __u64 driver_addr;
> >>> +     __u64 device_addr;
> >>> +};
> >>> +
> >>> +struct vduse_vq_ready {
> >>> +     __u32 index;
> >>> +     __u8 ready;
> >>> +};
> >>> +
> >>> +struct vduse_vq_state {
> >>> +     __u32 index;
> >>> +     __u16 avail_idx;
> >>> +};
> >>> +
> >>> +struct vduse_dev_config_data {
> >>> +     __u32 offset;
> >>> +     __u32 len;
> >>> +     __u8 data[VDUSE_CONFIG_DATA_LEN];
> >>> +};
> >>> +
> >>> +struct vduse_iova_range {
> >>> +     __u64 start;
> >>> +     __u64 last;
> >>> +};
> >>> +
> >>> +struct vduse_dev_request {
> >>> +     __u32 type; /* request type */
> >>> +     __u32 unique; /* request id */
> >>
> >> Let's simply use "request_id" here.
> >>
> > OK.
> >
> >>> +     __u32 reserved[2]; /* for feature use */
> >>> +     union {
> >>> +             struct vduse_vq_num vq_num; /* virtqueue num */
> >>> +             struct vduse_vq_addr vq_addr; /* virtqueue address */
> >>> +             struct vduse_vq_ready vq_ready; /* virtqueue ready status */
> >>> +             struct vduse_vq_state vq_state; /* virtqueue state */
> >>> +             struct vduse_dev_config_data config; /* virtio device config space */
> >>> +             struct vduse_iova_range iova; /* iova range for updating */
> >>> +             __u64 features; /* virtio features */
> >>> +             __u8 status; /* device status */
> >>
> >> It might be better to use struct for feaures and status as well for
> >> consistency.
> >>
> > OK.
> >
> >> And to be safe, let's add explicity padding here.
> >>
> > Do you mean add padding for the union?
>
>
> I think so.
>
>
> >
> >>> +     };
> >>> +};
> >>> +
> >>> +struct vduse_dev_response {
> >>> +     __u32 request_id; /* corresponding request id */
> >>> +#define VDUSE_REQUEST_OK     0x00
> >>> +#define VDUSE_REQUEST_FAILED 0x01
> >>> +     __u8 result; /* the result of request */
> >>> +     __u8 reserved[11]; /* for feature use */
> >>
> >> Looks like this will be a hole which is similar to
> >> 429711aec282c4b5fe5bbd7b2f0bbbff4110ffb2. Need to make sure the reserved
> >> end at 8 byte boundary.
> >>
> > Will fix it.
> >
> >>> +     union {
> >>> +             struct vduse_vq_ready vq_ready; /* virtqueue ready status */
> >>> +             struct vduse_vq_state vq_state; /* virtqueue state */
> >>> +             struct vduse_dev_config_data config; /* virtio device config space */
> >>> +             __u64 features; /* virtio features */
> >>> +             __u8 status; /* device status */
> >>> +     };
> >>> +};
> >>> +
> >>> +/* ioctls */
> >>> +
> >>> +struct vduse_dev_config {
> >>> +     char name[VDUSE_NAME_MAX]; /* vduse device name */
> >>> +     __u32 vendor_id; /* virtio vendor id */
> >>> +     __u32 device_id; /* virtio device id */
> >>> +     __u64 bounce_size; /* bounce buffer size for iommu */
> >>> +     __u16 vq_num; /* the number of virtqueues */
> >>> +     __u16 vq_size_max; /* the max size of virtqueue */
> >>> +     __u32 vq_align; /* the allocation alignment of virtqueue's metadata */
> >>> +};
> >>> +
> >>> +struct vduse_iotlb_entry {
> >>> +     __u64 offset; /* the mmap offset on fd */
> >>> +     __u64 start; /* start of the IOVA range */
> >>> +     __u64 last; /* last of the IOVA range */
> >>> +#define VDUSE_ACCESS_RO 0x1
> >>> +#define VDUSE_ACCESS_WO 0x2
> >>> +#define VDUSE_ACCESS_RW 0x3
> >>> +     __u8 perm; /* access permission of this range */
> >>> +};
> >>> +
> >>> +struct vduse_vq_eventfd {
> >>> +     __u32 index; /* virtqueue index */
> >>> +     int fd; /* eventfd, -1 means de-assigning the eventfd */
> >>
> >> Let's define a macro for this.
> >>
> > OK.
> >
> >>> +};
> >>> +
> >>> +#define VDUSE_BASE   0x81
> >>> +
> >>> +/* Create a vduse device which is represented by a char device (/dev/vduse/<name>) */
> >>> +#define VDUSE_CREATE_DEV     _IOW(VDUSE_BASE, 0x01, struct vduse_dev_config)
> >>> +
> >>> +/* Destroy a vduse device. Make sure there are no references to the char device */
> >>> +#define VDUSE_DESTROY_DEV    _IOW(VDUSE_BASE, 0x02, char[VDUSE_NAME_MAX])
> >>> +
> >>> +/* Get a file descriptor for the mmap'able iova region */
> >>> +#define VDUSE_IOTLB_GET_FD   _IOWR(VDUSE_BASE, 0x03, struct vduse_iotlb_entry)
> >>> +
> >>> +/* Setup an eventfd to receive kick for virtqueue */
> >>> +#define VDUSE_VQ_SETUP_KICKFD        _IOW(VDUSE_BASE, 0x04, struct vduse_vq_eventfd)
> >>> +
> >>> +/* Inject an interrupt for specific virtqueue */
> >>> +#define VDUSE_INJECT_VQ_IRQ  _IO(VDUSE_BASE, 0x05)
> >>
> >> I wonder do we need a version/feature handshake that is for future
> >> extension instead of just reserve bits in uABI? E.g VDUSE_GET_VERSION ...
> >>
> > Agree. Will do it in v5.
>
>
> Btw, I think you can remove "RFC" then Michael can consider to merge the
> series.
>

Fine.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: Re: [RFC v4 06/11] vduse: Implement an MMU-based IOMMU driver
  2021-03-05  3:35         ` Jason Wang
  (?)
@ 2021-03-05  6:15         ` Yongji Xie
  2021-03-05  6:51           ` Jason Wang
  -1 siblings, 1 reply; 92+ messages in thread
From: Yongji Xie @ 2021-03-05  6:15 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Bob Liu, Christoph Hellwig, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel

On Fri, Mar 5, 2021 at 11:36 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2021/3/4 1:12 下午, Yongji Xie wrote:
> > On Thu, Mar 4, 2021 at 12:21 PM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> On 2021/2/23 7:50 下午, Xie Yongji wrote:
> >>> This implements a MMU-based IOMMU driver to support mapping
> >>> kernel dma buffer into userspace. The basic idea behind it is
> >>> treating MMU (VA->PA) as IOMMU (IOVA->PA). The driver will set
> >>> up MMU mapping instead of IOMMU mapping for the DMA transfer so
> >>> that the userspace process is able to use its virtual address to
> >>> access the dma buffer in kernel.
> >>>
> >>> And to avoid security issue, a bounce-buffering mechanism is
> >>> introduced to prevent userspace accessing the original buffer
> >>> directly.
> >>>
> >>> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> >>> ---
> >>>    drivers/vdpa/vdpa_user/iova_domain.c | 486 +++++++++++++++++++++++++++++++++++
> >>>    drivers/vdpa/vdpa_user/iova_domain.h |  61 +++++
> >>>    2 files changed, 547 insertions(+)
> >>>    create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c
> >>>    create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h
> >>>
> >>> diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c
> >>> new file mode 100644
> >>> index 000000000000..9285d430d486
> >>> --- /dev/null
> >>> +++ b/drivers/vdpa/vdpa_user/iova_domain.c
> >>> @@ -0,0 +1,486 @@
> >>> +// SPDX-License-Identifier: GPL-2.0-only
> >>> +/*
> >>> + * MMU-based IOMMU implementation
> >>> + *
> >>> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
> >>> + *
> >>> + * Author: Xie Yongji <xieyongji@bytedance.com>
> >>> + *
> >>> + */
> >>> +
> >>> +#include <linux/slab.h>
> >>> +#include <linux/file.h>
> >>> +#include <linux/anon_inodes.h>
> >>> +#include <linux/highmem.h>
> >>> +
> >>> +#include "iova_domain.h"
> >>> +
> >>> +#define IOVA_START_PFN 1
> >>> +#define IOVA_ALLOC_ORDER 12
> >>> +#define IOVA_ALLOC_SIZE (1 << IOVA_ALLOC_ORDER)
> >>> +
> >>> +static inline struct page *
> >>> +vduse_domain_get_bounce_page(struct vduse_iova_domain *domain, u64 iova)
> >>> +{
> >>> +     u64 index = iova >> PAGE_SHIFT;
> >>> +
> >>> +     return domain->bounce_pages[index];
> >>> +}
> >>> +
> >>> +static inline void
> >>> +vduse_domain_set_bounce_page(struct vduse_iova_domain *domain,
> >>> +                             u64 iova, struct page *page)
> >>> +{
> >>> +     u64 index = iova >> PAGE_SHIFT;
> >>> +
> >>> +     domain->bounce_pages[index] = page;
> >>> +}
> >>> +
> >>> +static enum dma_data_direction perm_to_dir(int perm)
> >>> +{
> >>> +     enum dma_data_direction dir;
> >>> +
> >>> +     switch (perm) {
> >>> +     case VHOST_MAP_WO:
> >>> +             dir = DMA_FROM_DEVICE;
> >>> +             break;
> >>> +     case VHOST_MAP_RO:
> >>> +             dir = DMA_TO_DEVICE;
> >>> +             break;
> >>> +     case VHOST_MAP_RW:
> >>> +             dir = DMA_BIDIRECTIONAL;
> >>> +             break;
> >>> +     default:
> >>> +             break;
> >>> +     }
> >>> +
> >>> +     return dir;
> >>> +}
> >>> +
> >>> +static int dir_to_perm(enum dma_data_direction dir)
> >>> +{
> >>> +     int perm = -EFAULT;
> >>> +
> >>> +     switch (dir) {
> >>> +     case DMA_FROM_DEVICE:
> >>> +             perm = VHOST_MAP_WO;
> >>> +             break;
> >>> +     case DMA_TO_DEVICE:
> >>> +             perm = VHOST_MAP_RO;
> >>> +             break;
> >>> +     case DMA_BIDIRECTIONAL:
> >>> +             perm = VHOST_MAP_RW;
> >>> +             break;
> >>> +     default:
> >>> +             break;
> >>> +     }
> >>> +
> >>> +     return perm;
> >>> +}
> >>
> >> Let's move the above two helpers to vhost_iotlb.h so they could be used
> >> by other driver e.g (vpda_sim)
> >>
> > Sure.
> >
> >>> +
> >>> +static void do_bounce(phys_addr_t orig, void *addr, size_t size,
> >>> +                     enum dma_data_direction dir)
> >>> +{
> >>> +     unsigned long pfn = PFN_DOWN(orig);
> >>> +
> >>> +     if (PageHighMem(pfn_to_page(pfn))) {
> >>> +             unsigned int offset = offset_in_page(orig);
> >>> +             char *buffer;
> >>> +             unsigned int sz = 0;
> >>> +             unsigned long flags;
> >>> +
> >>> +             while (size) {
> >>> +                     sz = min_t(size_t, PAGE_SIZE - offset, size);
> >>> +
> >>> +                     local_irq_save(flags);
> >>> +                     buffer = kmap_atomic(pfn_to_page(pfn));
> >>> +                     if (dir == DMA_TO_DEVICE)
> >>> +                             memcpy(addr, buffer + offset, sz);
> >>> +                     else
> >>> +                             memcpy(buffer + offset, addr, sz);
> >>> +                     kunmap_atomic(buffer);
> >>> +                     local_irq_restore(flags);
> >>
> >> I wonder why we need to deal with highmem and irq flags explicitly like
> >> this. Doesn't kmap_atomic() will take care all of those?
> >>
> > Yes, irq flags is useless here. Will remove it.
> >
> >>> +
> >>> +                     size -= sz;
> >>> +                     pfn++;
> >>> +                     addr += sz;
> >>> +                     offset = 0;
> >>> +             }
> >>> +     } else if (dir == DMA_TO_DEVICE) {
> >>> +             memcpy(addr, phys_to_virt(orig), size);
> >>> +     } else {
> >>> +             memcpy(phys_to_virt(orig), addr, size);
> >>> +     }
> >>> +}
> >>> +
> >>> +static struct page *
> >>> +vduse_domain_get_mapping_page(struct vduse_iova_domain *domain, u64 iova)
> >>> +{
> >>> +     u64 start = iova & PAGE_MASK;
> >>> +     u64 last = start + PAGE_SIZE - 1;
> >>> +     struct vhost_iotlb_map *map;
> >>> +     struct page *page = NULL;
> >>> +
> >>> +     spin_lock(&domain->iotlb_lock);
> >>> +     map = vhost_iotlb_itree_first(domain->iotlb, start, last);
> >>> +     if (!map)
> >>> +             goto out;
> >>> +
> >>> +     page = pfn_to_page((map->addr + iova - map->start) >> PAGE_SHIFT);
> >>> +     get_page(page);
> >>> +out:
> >>> +     spin_unlock(&domain->iotlb_lock);
> >>> +
> >>> +     return page;
> >>> +}
> >>> +
> >>> +static struct page *
> >>> +vduse_domain_alloc_bounce_page(struct vduse_iova_domain *domain, u64 iova)
> >>> +{
> >>> +     u64 start = iova & PAGE_MASK;
> >>> +     u64 last = start + PAGE_SIZE - 1;
> >>> +     struct vhost_iotlb_map *map;
> >>> +     struct page *page = NULL, *new_page = alloc_page(GFP_KERNEL);
> >>> +
> >>> +     if (!new_page)
> >>> +             return NULL;
> >>> +
> >>> +     spin_lock(&domain->iotlb_lock);
> >>> +     if (!vhost_iotlb_itree_first(domain->iotlb, start, last)) {
> >>> +             __free_page(new_page);
> >>> +             goto out;
> >>> +     }
> >>> +     page = vduse_domain_get_bounce_page(domain, iova);
> >>> +     if (page) {
> >>> +             get_page(page);
> >>> +             __free_page(new_page);
> >>> +             goto out;
> >>> +     }
> >>> +     vduse_domain_set_bounce_page(domain, iova, new_page);
> >>> +     get_page(new_page);
> >>> +     page = new_page;
> >>> +
> >>> +     for (map = vhost_iotlb_itree_first(domain->iotlb, start, last); map;
> >>> +          map = vhost_iotlb_itree_next(map, start, last)) {
> >>> +             unsigned int src_offset = 0, dst_offset = 0;
> >>> +             phys_addr_t src;
> >>> +             void *dst;
> >>> +             size_t sz;
> >>> +
> >>> +             if (perm_to_dir(map->perm) == DMA_FROM_DEVICE)
> >>> +                     continue;
> >>> +
> >>> +             if (start > map->start)
> >>> +                     src_offset = start - map->start;
> >>> +             else
> >>> +                     dst_offset = map->start - start;
> >>> +
> >>> +             src = map->addr + src_offset;
> >>> +             dst = page_address(page) + dst_offset;
> >>> +             sz = min_t(size_t, map->size - src_offset,
> >>> +                             PAGE_SIZE - dst_offset);
> >>> +             do_bounce(src, dst, sz, DMA_TO_DEVICE);
> >>> +     }
> >>> +out:
> >>> +     spin_unlock(&domain->iotlb_lock);
> >>> +
> >>> +     return page;
> >>> +}
> >>> +
> >>> +static void
> >>> +vduse_domain_free_bounce_pages(struct vduse_iova_domain *domain,
> >>> +                             u64 iova, size_t size)
> >>> +{
> >>> +     struct page *page;
> >>> +
> >>> +     spin_lock(&domain->iotlb_lock);
> >>> +     if (WARN_ON(vhost_iotlb_itree_first(domain->iotlb, iova,
> >>> +                                             iova + size - 1)))
> >>> +             goto out;
> >>> +
> >>> +     while (size > 0) {
> >>> +             page = vduse_domain_get_bounce_page(domain, iova);
> >>> +             if (page) {
> >>> +                     vduse_domain_set_bounce_page(domain, iova, NULL);
> >>> +                     __free_page(page);
> >>> +             }
> >>> +             size -= PAGE_SIZE;
> >>> +             iova += PAGE_SIZE;
> >>> +     }
> >>> +out:
> >>> +     spin_unlock(&domain->iotlb_lock);
> >>> +}
> >>> +
> >>> +static void vduse_domain_bounce(struct vduse_iova_domain *domain,
> >>> +                             dma_addr_t iova, phys_addr_t orig,
> >>> +                             size_t size, enum dma_data_direction dir)
> >>> +{
> >>> +     unsigned int offset = offset_in_page(iova);
> >>> +
> >>> +     while (size) {
> >>> +             struct page *p = vduse_domain_get_bounce_page(domain, iova);
> >>> +             size_t sz = min_t(size_t, PAGE_SIZE - offset, size);
> >>> +
> >>> +             WARN_ON(!p && dir == DMA_FROM_DEVICE);
> >>> +
> >>> +             if (p)
> >>> +                     do_bounce(orig, page_address(p) + offset, sz, dir);
> >>> +
> >>> +             size -= sz;
> >>> +             orig += sz;
> >>> +             iova += sz;
> >>> +             offset = 0;
> >>> +     }
> >>> +}
> >>> +
> >>> +static dma_addr_t vduse_domain_alloc_iova(struct iova_domain *iovad,
> >>> +                             unsigned long size, unsigned long limit)
> >>> +{
> >>> +     unsigned long shift = iova_shift(iovad);
> >>> +     unsigned long iova_len = iova_align(iovad, size) >> shift;
> >>> +     unsigned long iova_pfn;
> >>> +
> >>> +     if (iova_len < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1)))
> >>> +             iova_len = roundup_pow_of_two(iova_len);
> >>> +     iova_pfn = alloc_iova_fast(iovad, iova_len, limit >> shift, true);
> >>> +
> >>> +     return iova_pfn << shift;
> >>> +}
> >>> +
> >>> +static void vduse_domain_free_iova(struct iova_domain *iovad,
> >>> +                             dma_addr_t iova, size_t size)
> >>> +{
> >>> +     unsigned long shift = iova_shift(iovad);
> >>> +     unsigned long iova_len = iova_align(iovad, size) >> shift;
> >>> +
> >>> +     free_iova_fast(iovad, iova >> shift, iova_len);
> >>> +}
> >>> +
> >>> +dma_addr_t vduse_domain_map_page(struct vduse_iova_domain *domain,
> >>> +                             struct page *page, unsigned long offset,
> >>> +                             size_t size, enum dma_data_direction dir,
> >>> +                             unsigned long attrs)
> >>> +{
> >>> +     struct iova_domain *iovad = &domain->stream_iovad;
> >>> +     unsigned long limit = domain->bounce_size - 1;
> >>> +     phys_addr_t pa = page_to_phys(page) + offset;
> >>> +     dma_addr_t iova = vduse_domain_alloc_iova(iovad, size, limit);
> >>> +     int ret;
> >>> +
> >>> +     if (!iova)
> >>> +             return DMA_MAPPING_ERROR;
> >>> +
> >>> +     spin_lock(&domain->iotlb_lock);
> >>> +     ret = vhost_iotlb_add_range(domain->iotlb, (u64)iova,
> >>> +                                 (u64)iova + size - 1,
> >>> +                                 pa, dir_to_perm(dir));
> >>> +     spin_unlock(&domain->iotlb_lock);
> >>> +     if (ret) {
> >>> +             vduse_domain_free_iova(iovad, iova, size);
> >>> +             return DMA_MAPPING_ERROR;
> >>> +     }
> >>> +     if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL)
> >>> +             vduse_domain_bounce(domain, iova, pa, size, DMA_TO_DEVICE);
> >>> +
> >>> +     return iova;
> >>> +}
> >>> +
> >>> +void vduse_domain_unmap_page(struct vduse_iova_domain *domain,
> >>> +                     dma_addr_t dma_addr, size_t size,
> >>> +                     enum dma_data_direction dir, unsigned long attrs)
> >>> +{
> >>> +     struct iova_domain *iovad = &domain->stream_iovad;
> >>> +     struct vhost_iotlb_map *map;
> >>> +     phys_addr_t pa;
> >>> +
> >>> +     spin_lock(&domain->iotlb_lock);
> >>> +     map = vhost_iotlb_itree_first(domain->iotlb, (u64)dma_addr,
> >>> +                                   (u64)dma_addr + size - 1);
> >>> +     if (WARN_ON(!map)) {
> >>> +             spin_unlock(&domain->iotlb_lock);
> >>> +             return;
> >>> +     }
> >>> +     pa = map->addr;
> >>> +     vhost_iotlb_map_free(domain->iotlb, map);
> >>> +     spin_unlock(&domain->iotlb_lock);
> >>> +
> >>> +     if (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL)
> >>> +             vduse_domain_bounce(domain, dma_addr, pa,
> >>> +                                     size, DMA_FROM_DEVICE);
> >>> +
> >>> +     vduse_domain_free_iova(iovad, dma_addr, size);
> >>> +}
> >>> +
> >>> +void *vduse_domain_alloc_coherent(struct vduse_iova_domain *domain,
> >>> +                             size_t size, dma_addr_t *dma_addr,
> >>> +                             gfp_t flag, unsigned long attrs)
> >>> +{
> >>> +     struct iova_domain *iovad = &domain->consistent_iovad;
> >>> +     unsigned long limit = domain->iova_limit;
> >>> +     dma_addr_t iova = vduse_domain_alloc_iova(iovad, size, limit);
> >>> +     void *orig = alloc_pages_exact(size, flag);
> >>> +     int ret;
> >>> +
> >>> +     if (!iova || !orig)
> >>> +             goto err;
> >>> +
> >>> +     spin_lock(&domain->iotlb_lock);
> >>> +     ret = vhost_iotlb_add_range(domain->iotlb, (u64)iova,
> >>> +                                 (u64)iova + size - 1,
> >>> +                                 virt_to_phys(orig), VHOST_MAP_RW);
> >>> +     spin_unlock(&domain->iotlb_lock);
> >>> +     if (ret)
> >>> +             goto err;
> >>> +
> >>> +     *dma_addr = iova;
> >>> +
> >>> +     return orig;
> >>> +err:
> >>> +     *dma_addr = DMA_MAPPING_ERROR;
> >>> +     if (orig)
> >>> +             free_pages_exact(orig, size);
> >>> +     if (iova)
> >>> +             vduse_domain_free_iova(iovad, iova, size);
> >>> +
> >>> +     return NULL;
> >>> +}
> >>> +
> >>> +void vduse_domain_free_coherent(struct vduse_iova_domain *domain, size_t size,
> >>> +                             void *vaddr, dma_addr_t dma_addr,
> >>> +                             unsigned long attrs)
> >>> +{
> >>> +     struct iova_domain *iovad = &domain->consistent_iovad;
> >>> +     struct vhost_iotlb_map *map;
> >>> +     phys_addr_t pa;
> >>> +
> >>> +     spin_lock(&domain->iotlb_lock);
> >>> +     map = vhost_iotlb_itree_first(domain->iotlb, (u64)dma_addr,
> >>> +                                   (u64)dma_addr + size - 1);
> >>> +     if (WARN_ON(!map)) {
> >>> +             spin_unlock(&domain->iotlb_lock);
> >>> +             return;
> >>> +     }
> >>> +     pa = map->addr;
> >>> +     vhost_iotlb_map_free(domain->iotlb, map);
> >>> +     spin_unlock(&domain->iotlb_lock);
> >>> +
> >>> +     vduse_domain_free_iova(iovad, dma_addr, size);
> >>> +     free_pages_exact(phys_to_virt(pa), size);
> >>> +}
> >>> +
> >>> +static vm_fault_t vduse_domain_mmap_fault(struct vm_fault *vmf)
> >>> +{
> >>> +     struct vduse_iova_domain *domain = vmf->vma->vm_private_data;
> >>> +     unsigned long iova = vmf->pgoff << PAGE_SHIFT;
> >>> +     struct page *page;
> >>> +
> >>> +     if (!domain)
> >>> +             return VM_FAULT_SIGBUS;
> >>> +
> >>> +     if (iova < domain->bounce_size)
> >>> +             page = vduse_domain_alloc_bounce_page(domain, iova);
> >>> +     else
> >>> +             page = vduse_domain_get_mapping_page(domain, iova);
> >>> +
> >>> +     if (!page)
> >>> +             return VM_FAULT_SIGBUS;
> >>> +
> >>> +     vmf->page = page;
> >>> +
> >>> +     return 0;
> >>> +}
> >>> +
> >>> +static const struct vm_operations_struct vduse_domain_mmap_ops = {
> >>> +     .fault = vduse_domain_mmap_fault,
> >>> +};
> >>> +
> >>> +static int vduse_domain_mmap(struct file *file, struct vm_area_struct *vma)
> >>> +{
> >>> +     struct vduse_iova_domain *domain = file->private_data;
> >>> +
> >>> +     vma->vm_flags |= VM_DONTDUMP | VM_DONTEXPAND;
> >>> +     vma->vm_private_data = domain;
> >>> +     vma->vm_ops = &vduse_domain_mmap_ops;
> >>> +
> >>> +     return 0;
> >>> +}
> >>> +
> >>> +static int vduse_domain_release(struct inode *inode, struct file *file)
> >>> +{
> >>> +     struct vduse_iova_domain *domain = file->private_data;
> >>> +
> >>> +     vduse_domain_free_bounce_pages(domain, 0, domain->bounce_size);
> >>> +     put_iova_domain(&domain->stream_iovad);
> >>> +     put_iova_domain(&domain->consistent_iovad);
> >>> +     vhost_iotlb_free(domain->iotlb);
> >>> +     vfree(domain->bounce_pages);
> >>> +     kfree(domain);
> >>> +
> >>> +     return 0;
> >>> +}
> >>> +
> >>> +static const struct file_operations vduse_domain_fops = {
> >>> +     .mmap = vduse_domain_mmap,
> >>> +     .release = vduse_domain_release,
> >>> +};
> >>> +
> >>> +void vduse_domain_destroy(struct vduse_iova_domain *domain)
> >>> +{
> >>> +     fput(domain->file);
> >>> +}
> >>> +
> >>> +struct vduse_iova_domain *
> >>> +vduse_domain_create(unsigned long iova_limit, size_t bounce_size)
> >>> +{
> >>> +     struct vduse_iova_domain *domain;
> >>> +     struct file *file;
> >>> +     unsigned long bounce_pfns = PAGE_ALIGN(bounce_size) >> PAGE_SHIFT;
> >>> +
> >>> +     if (iova_limit <= bounce_size)
> >>> +             return NULL;
> >>> +
> >>> +     domain = kzalloc(sizeof(*domain), GFP_KERNEL);
> >>> +     if (!domain)
> >>> +             return NULL;
> >>> +
> >>> +     domain->iotlb = vhost_iotlb_alloc(0, 0);
> >>> +     if (!domain->iotlb)
> >>> +             goto err_iotlb;
> >>> +
> >>> +     domain->iova_limit = iova_limit;
> >>> +     domain->bounce_size = PAGE_ALIGN(bounce_size);
> >>> +     domain->bounce_pages = vzalloc(bounce_pfns * sizeof(struct page *));
> >>> +     if (!domain->bounce_pages)
> >>> +             goto err_page;
> >>> +
> >>> +     file = anon_inode_getfile("[vduse-domain]", &vduse_domain_fops,
> >>> +                             domain, O_RDWR);
> >>> +     if (IS_ERR(file))
> >>> +             goto err_file;
> >>> +
> >>> +     domain->file = file;
> >>> +     spin_lock_init(&domain->iotlb_lock);
> >>> +     init_iova_domain(&domain->stream_iovad,
> >>> +                     IOVA_ALLOC_SIZE, IOVA_START_PFN);
> >>> +     init_iova_domain(&domain->consistent_iovad,
> >>> +                     PAGE_SIZE, bounce_pfns);
> >>> +
> >>> +     return domain;
> >>> +err_file:
> >>> +     vfree(domain->bounce_pages);
> >>> +err_page:
> >>> +     vhost_iotlb_free(domain->iotlb);
> >>> +err_iotlb:
> >>> +     kfree(domain);
> >>> +     return NULL;
> >>> +}
> >>> +
> >>> +int vduse_domain_init(void)
> >>> +{
> >>> +     return iova_cache_get();
> >>> +}
> >>> +
> >>> +void vduse_domain_exit(void)
> >>> +{
> >>> +     iova_cache_put();
> >>> +}
> >>> diff --git a/drivers/vdpa/vdpa_user/iova_domain.h b/drivers/vdpa/vdpa_user/iova_domain.h
> >>> new file mode 100644
> >>> index 000000000000..9c85d8346626
> >>> --- /dev/null
> >>> +++ b/drivers/vdpa/vdpa_user/iova_domain.h
> >>> @@ -0,0 +1,61 @@
> >>> +/* SPDX-License-Identifier: GPL-2.0-only */
> >>> +/*
> >>> + * MMU-based IOMMU implementation
> >>> + *
> >>> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
> >>> + *
> >>> + * Author: Xie Yongji <xieyongji@bytedance.com>
> >>> + *
> >>> + */
> >>> +
> >>> +#ifndef _VDUSE_IOVA_DOMAIN_H
> >>> +#define _VDUSE_IOVA_DOMAIN_H
> >>> +
> >>> +#include <linux/iova.h>
> >>> +#include <linux/dma-mapping.h>
> >>> +#include <linux/vhost_iotlb.h>
> >>> +
> >>> +struct vduse_iova_domain {
> >>> +     struct iova_domain stream_iovad;
> >>> +     struct iova_domain consistent_iovad;
> >>> +     struct page **bounce_pages;
> >>> +     size_t bounce_size;
> >>> +     unsigned long iova_limit;
> >>> +     struct vhost_iotlb *iotlb;
> >>
> >> Sorry if I've asked this before.
> >>
> >> But what's the reason for maintaing a dedicated IOTLB here? I think we
> >> could reuse vduse_dev->iommu since the device can not be used by both
> >> virtio and vhost in the same time or use vduse_iova_domain->iotlb for
> >> set_map().
> >>
> > The main difference between domain->iotlb and dev->iotlb is the way to
> > deal with bounce buffer. In the domain->iotlb case, bounce buffer
> > needs to be mapped each DMA transfer because we need to get the bounce
> > pages by an IOVA during DMA unmapping. In the dev->iotlb case, bounce
> > buffer only needs to be mapped once during initialization, which will
> > be used to tell userspace how to do mmap().
> >
> >> Also, since vhost IOTLB support per mapping token (opauqe), can we use
> >> that instead of the bounce_pages *?
> >>
> > Sorry, I didn't get you here. Which value do you mean to store in the
> > opaque pointer?
>
>
> So I would like to have a way to use a single IOTLB for manage all kinds
> of mappings. Two possible ideas:
>
> 1) map bounce page one by one in vduse_dev_map_page(), in
> VDUSE_IOTLB_GET_FD, try to merge the result if we had the same fd. Then
> for bounce pages, userspace still only need to map it once and we can
> maintain the actual mapping by storing the page or pa in the opaque
> field of IOTLB entry.

Looks like userspace still needs to unmap the old region and map a new
region (size is changed) with the fd in each VDUSE_IOTLB_GET_FD ioctl.

> 2) map bounce page once in vduse_dev_map_page() and store struct page
> **bounce_pages in the opaque field of this single IOTLB entry.
>

We can get the struct page **bounce_pages from vduse_iova_domain. Why
do we need to store it in the opaque field?  Should the opaque field
be used to store vdpa_map_file?

And I think it works. One problem is we need to find a place to store
the original DMA buffer's address and size. I think we can modify the
array of bounce_pages for this purpose.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: Re: [RFC v4 10/11] vduse: Introduce a workqueue for irq injection
  2021-03-05  3:42             ` Jason Wang
  (?)
@ 2021-03-05  6:36             ` Yongji Xie
  2021-03-05  7:01                 ` Jason Wang
  -1 siblings, 1 reply; 92+ messages in thread
From: Yongji Xie @ 2021-03-05  6:36 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Bob Liu, Christoph Hellwig, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel

On Fri, Mar 5, 2021 at 11:42 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2021/3/5 11:30 上午, Yongji Xie wrote:
> > On Fri, Mar 5, 2021 at 11:05 AM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> On 2021/3/4 4:58 下午, Yongji Xie wrote:
> >>> On Thu, Mar 4, 2021 at 2:59 PM Jason Wang <jasowang@redhat.com> wrote:
> >>>> On 2021/2/23 7:50 下午, Xie Yongji wrote:
> >>>>> This patch introduces a workqueue to support injecting
> >>>>> virtqueue's interrupt asynchronously. This is mainly
> >>>>> for performance considerations which makes sure the push()
> >>>>> and pop() for used vring can be asynchronous.
> >>>> Do you have pref numbers for this patch?
> >>>>
> >>> No, I can do some tests for it if needed.
> >>>
> >>> Another problem is the VIRTIO_RING_F_EVENT_IDX feature will be useless
> >>> if we call irq callback in ioctl context. Something like:
> >>>
> >>> virtqueue_push();
> >>> virtio_notify();
> >>>       ioctl()
> >>> -------------------------------------------------
> >>>           irq_cb()
> >>>               virtqueue_get_buf()
> >>>
> >>> The used vring is always empty each time we call virtqueue_push() in
> >>> userspace. Not sure if it is what we expected.
> >>
> >> I'm not sure I get the issue.
> >>
> >> THe used ring should be filled by virtqueue_push() which is done by
> >> userspace before?
> >>
> > After userspace call virtqueue_push(), it always call virtio_notify()
> > immediately. In traditional VM (vhost-vdpa) cases, virtio_notify()
> > will inject an irq to VM and return, then vcpu thread will call
> > interrupt handler. But in container (virtio-vdpa) cases,
> > virtio_notify() will call interrupt handler directly. So it looks like
> > we have to optimize the virtio-vdpa cases. But one problem is we don't
> > know whether we are in the VM user case or container user case.
>
>
> Yes, but I still don't get why used ring is empty after the ioctl()?
> Used ring does not use bounce page so it should be visible to the kernel
> driver. What did I miss :) ?
>

Sorry, I'm not saying the kernel can't see the correct used vring. I
mean the kernel will consume the used vring in the ioctl context
directly in the virtio-vdpa case. In userspace's view, that means
virtqueue_push() is used vring's producer and virtio_notify() is used
vring's consumer. They will be called one by one in one thread rather
than different threads, which looks odd and has a bad effect on
performance.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: Re: [RFC v4 11/11] vduse: Support binding irq to the specified cpu
  2021-03-05  3:44             ` Jason Wang
  (?)
@ 2021-03-05  6:40             ` Yongji Xie
  -1 siblings, 0 replies; 92+ messages in thread
From: Yongji Xie @ 2021-03-05  6:40 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Bob Liu, Christoph Hellwig, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel

On Fri, Mar 5, 2021 at 11:44 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2021/3/5 11:37 上午, Yongji Xie wrote:
> > On Fri, Mar 5, 2021 at 11:11 AM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> On 2021/3/4 4:19 下午, Yongji Xie wrote:
> >>> On Thu, Mar 4, 2021 at 3:30 PM Jason Wang <jasowang@redhat.com> wrote:
> >>>> On 2021/2/23 7:50 下午, Xie Yongji wrote:
> >>>>> Add a parameter for the ioctl VDUSE_INJECT_VQ_IRQ to support
> >>>>> injecting virtqueue's interrupt to the specified cpu.
> >>>> How userspace know which CPU is this irq for? It looks to me we need to
> >>>> do it at different level.
> >>>>
> >>>> E.g introduce some API in sys to allow admin to tune for that.
> >>>>
> >>>> But I think we can do that in antoher patch on top of this series.
> >>>>
> >>> OK. I will think more about it.
> >>
> >> It should be soemthing like
> >> /sys/class/vduse/$dev_name/vq/0/irq_affinity. Also need to make sure
> >> eventfd could not be reused.
> >>
> > Looks like we doesn't use eventfd now. Do you mean we need to use
> > eventfd in this case?
>
>
> No, I meant if we're using eventfd, do we allow a single eventfd to be
> used for injecting irq for more than one virtqueue? (If not, I guess it
> should be ok).
>

OK, I see. I think we don't allow that now.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC v4 06/11] vduse: Implement an MMU-based IOMMU driver
  2021-03-05  6:15         ` Yongji Xie
@ 2021-03-05  6:51           ` Jason Wang
  2021-03-05  7:13             ` Yongji Xie
  0 siblings, 1 reply; 92+ messages in thread
From: Jason Wang @ 2021-03-05  6:51 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Jens Axboe, Jonathan Corbet, kvm, Michael S. Tsirkin, linux-aio,
	netdev, Randy Dunlap, Matthew Wilcox, virtualization,
	Christoph Hellwig, Bob Liu, bcrl, viro, Stefan Hajnoczi,
	linux-fsdevel


[-- Attachment #1.1: Type: text/plain, Size: 2160 bytes --]


On 2021/3/5 2:15 下午, Yongji Xie wrote:
>>>> Sorry if I've asked this before.
>>>>
>>>> But what's the reason for maintaing a dedicated IOTLB here? I think we
>>>> could reuse vduse_dev->iommu since the device can not be used by both
>>>> virtio and vhost in the same time or use vduse_iova_domain->iotlb for
>>>> set_map().
>>>>
>>> The main difference between domain->iotlb and dev->iotlb is the way to
>>> deal with bounce buffer. In the domain->iotlb case, bounce buffer
>>> needs to be mapped each DMA transfer because we need to get the bounce
>>> pages by an IOVA during DMA unmapping. In the dev->iotlb case, bounce
>>> buffer only needs to be mapped once during initialization, which will
>>> be used to tell userspace how to do mmap().
>>>
>>>> Also, since vhost IOTLB support per mapping token (opauqe), can we use
>>>> that instead of the bounce_pages *?
>>>>
>>> Sorry, I didn't get you here. Which value do you mean to store in the
>>> opaque pointer?
>> So I would like to have a way to use a single IOTLB for manage all kinds
>> of mappings. Two possible ideas:
>>
>> 1) map bounce page one by one in vduse_dev_map_page(), in
>> VDUSE_IOTLB_GET_FD, try to merge the result if we had the same fd. Then
>> for bounce pages, userspace still only need to map it once and we can
>> maintain the actual mapping by storing the page or pa in the opaque
>> field of IOTLB entry.
> Looks like userspace still needs to unmap the old region and map a new
> region (size is changed) with the fd in each VDUSE_IOTLB_GET_FD ioctl.


I don't get here. Can you give an example?


>
>> 2) map bounce page once in vduse_dev_map_page() and store struct page
>> **bounce_pages in the opaque field of this single IOTLB entry.
>>
> We can get the struct page **bounce_pages from vduse_iova_domain. Why
> do we need to store it in the opaque field?  Should the opaque field
> be used to store vdpa_map_file?


Oh yes, you're right.


>
> And I think it works. One problem is we need to find a place to store
> the original DMA buffer's address and size. I think we can modify the
> array of bounce_pages for this purpose.
>
> Thanks,
> Yongji


Yes.

Thanks


>

[-- Attachment #1.2: Type: text/html, Size: 3866 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC v4 10/11] vduse: Introduce a workqueue for irq injection
  2021-03-05  6:36             ` Yongji Xie
@ 2021-03-05  7:01                 ` Jason Wang
  0 siblings, 0 replies; 92+ messages in thread
From: Jason Wang @ 2021-03-05  7:01 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Bob Liu, Christoph Hellwig, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel


On 2021/3/5 2:36 下午, Yongji Xie wrote:
> On Fri, Mar 5, 2021 at 11:42 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2021/3/5 11:30 上午, Yongji Xie wrote:
>>> On Fri, Mar 5, 2021 at 11:05 AM Jason Wang <jasowang@redhat.com> wrote:
>>>> On 2021/3/4 4:58 下午, Yongji Xie wrote:
>>>>> On Thu, Mar 4, 2021 at 2:59 PM Jason Wang <jasowang@redhat.com> wrote:
>>>>>> On 2021/2/23 7:50 下午, Xie Yongji wrote:
>>>>>>> This patch introduces a workqueue to support injecting
>>>>>>> virtqueue's interrupt asynchronously. This is mainly
>>>>>>> for performance considerations which makes sure the push()
>>>>>>> and pop() for used vring can be asynchronous.
>>>>>> Do you have pref numbers for this patch?
>>>>>>
>>>>> No, I can do some tests for it if needed.
>>>>>
>>>>> Another problem is the VIRTIO_RING_F_EVENT_IDX feature will be useless
>>>>> if we call irq callback in ioctl context. Something like:
>>>>>
>>>>> virtqueue_push();
>>>>> virtio_notify();
>>>>>        ioctl()
>>>>> -------------------------------------------------
>>>>>            irq_cb()
>>>>>                virtqueue_get_buf()
>>>>>
>>>>> The used vring is always empty each time we call virtqueue_push() in
>>>>> userspace. Not sure if it is what we expected.
>>>> I'm not sure I get the issue.
>>>>
>>>> THe used ring should be filled by virtqueue_push() which is done by
>>>> userspace before?
>>>>
>>> After userspace call virtqueue_push(), it always call virtio_notify()
>>> immediately. In traditional VM (vhost-vdpa) cases, virtio_notify()
>>> will inject an irq to VM and return, then vcpu thread will call
>>> interrupt handler. But in container (virtio-vdpa) cases,
>>> virtio_notify() will call interrupt handler directly. So it looks like
>>> we have to optimize the virtio-vdpa cases. But one problem is we don't
>>> know whether we are in the VM user case or container user case.
>>
>> Yes, but I still don't get why used ring is empty after the ioctl()?
>> Used ring does not use bounce page so it should be visible to the kernel
>> driver. What did I miss :) ?
>>
> Sorry, I'm not saying the kernel can't see the correct used vring. I
> mean the kernel will consume the used vring in the ioctl context
> directly in the virtio-vdpa case. In userspace's view, that means
> virtqueue_push() is used vring's producer and virtio_notify() is used
> vring's consumer. They will be called one by one in one thread rather
> than different threads, which looks odd and has a bad effect on
> performance.


Yes, that's why we need a workqueue (WQ_UNBOUND you used). Or do you 
want to squash this patch into patch 8?

So I think we can see obvious difference when virtio-vdpa is used.

Thanks


>
> Thanks,
> Yongji
>


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC v4 10/11] vduse: Introduce a workqueue for irq injection
@ 2021-03-05  7:01                 ` Jason Wang
  0 siblings, 0 replies; 92+ messages in thread
From: Jason Wang @ 2021-03-05  7:01 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Jens Axboe, Jonathan Corbet, kvm, Michael S. Tsirkin, linux-aio,
	netdev, Randy Dunlap, Matthew Wilcox, virtualization,
	Christoph Hellwig, Bob Liu, bcrl, viro, Stefan Hajnoczi,
	linux-fsdevel


On 2021/3/5 2:36 下午, Yongji Xie wrote:
> On Fri, Mar 5, 2021 at 11:42 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2021/3/5 11:30 上午, Yongji Xie wrote:
>>> On Fri, Mar 5, 2021 at 11:05 AM Jason Wang <jasowang@redhat.com> wrote:
>>>> On 2021/3/4 4:58 下午, Yongji Xie wrote:
>>>>> On Thu, Mar 4, 2021 at 2:59 PM Jason Wang <jasowang@redhat.com> wrote:
>>>>>> On 2021/2/23 7:50 下午, Xie Yongji wrote:
>>>>>>> This patch introduces a workqueue to support injecting
>>>>>>> virtqueue's interrupt asynchronously. This is mainly
>>>>>>> for performance considerations which makes sure the push()
>>>>>>> and pop() for used vring can be asynchronous.
>>>>>> Do you have pref numbers for this patch?
>>>>>>
>>>>> No, I can do some tests for it if needed.
>>>>>
>>>>> Another problem is the VIRTIO_RING_F_EVENT_IDX feature will be useless
>>>>> if we call irq callback in ioctl context. Something like:
>>>>>
>>>>> virtqueue_push();
>>>>> virtio_notify();
>>>>>        ioctl()
>>>>> -------------------------------------------------
>>>>>            irq_cb()
>>>>>                virtqueue_get_buf()
>>>>>
>>>>> The used vring is always empty each time we call virtqueue_push() in
>>>>> userspace. Not sure if it is what we expected.
>>>> I'm not sure I get the issue.
>>>>
>>>> THe used ring should be filled by virtqueue_push() which is done by
>>>> userspace before?
>>>>
>>> After userspace call virtqueue_push(), it always call virtio_notify()
>>> immediately. In traditional VM (vhost-vdpa) cases, virtio_notify()
>>> will inject an irq to VM and return, then vcpu thread will call
>>> interrupt handler. But in container (virtio-vdpa) cases,
>>> virtio_notify() will call interrupt handler directly. So it looks like
>>> we have to optimize the virtio-vdpa cases. But one problem is we don't
>>> know whether we are in the VM user case or container user case.
>>
>> Yes, but I still don't get why used ring is empty after the ioctl()?
>> Used ring does not use bounce page so it should be visible to the kernel
>> driver. What did I miss :) ?
>>
> Sorry, I'm not saying the kernel can't see the correct used vring. I
> mean the kernel will consume the used vring in the ioctl context
> directly in the virtio-vdpa case. In userspace's view, that means
> virtqueue_push() is used vring's producer and virtio_notify() is used
> vring's consumer. They will be called one by one in one thread rather
> than different threads, which looks odd and has a bad effect on
> performance.


Yes, that's why we need a workqueue (WQ_UNBOUND you used). Or do you 
want to squash this patch into patch 8?

So I think we can see obvious difference when virtio-vdpa is used.

Thanks


>
> Thanks,
> Yongji
>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: Re: [RFC v4 06/11] vduse: Implement an MMU-based IOMMU driver
  2021-03-05  6:51           ` Jason Wang
@ 2021-03-05  7:13             ` Yongji Xie
  2021-03-05  7:27                 ` Jason Wang
  0 siblings, 1 reply; 92+ messages in thread
From: Yongji Xie @ 2021-03-05  7:13 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Bob Liu, Christoph Hellwig, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel

On Fri, Mar 5, 2021 at 2:52 PM Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2021/3/5 2:15 下午, Yongji Xie wrote:
>
> Sorry if I've asked this before.
>
> But what's the reason for maintaing a dedicated IOTLB here? I think we
> could reuse vduse_dev->iommu since the device can not be used by both
> virtio and vhost in the same time or use vduse_iova_domain->iotlb for
> set_map().
>
> The main difference between domain->iotlb and dev->iotlb is the way to
> deal with bounce buffer. In the domain->iotlb case, bounce buffer
> needs to be mapped each DMA transfer because we need to get the bounce
> pages by an IOVA during DMA unmapping. In the dev->iotlb case, bounce
> buffer only needs to be mapped once during initialization, which will
> be used to tell userspace how to do mmap().
>
> Also, since vhost IOTLB support per mapping token (opauqe), can we use
> that instead of the bounce_pages *?
>
> Sorry, I didn't get you here. Which value do you mean to store in the
> opaque pointer?
>
> So I would like to have a way to use a single IOTLB for manage all kinds
> of mappings. Two possible ideas:
>
> 1) map bounce page one by one in vduse_dev_map_page(), in
> VDUSE_IOTLB_GET_FD, try to merge the result if we had the same fd. Then
> for bounce pages, userspace still only need to map it once and we can
> maintain the actual mapping by storing the page or pa in the opaque
> field of IOTLB entry.
>
> Looks like userspace still needs to unmap the old region and map a new
> region (size is changed) with the fd in each VDUSE_IOTLB_GET_FD ioctl.
>
>
> I don't get here. Can you give an example?
>

For example, userspace needs to process two I/O requests (one page per
request). To process the first request, userspace uses
VDUSE_IOTLB_GET_FD ioctl to query the iova region (0 ~ 4096) and mmap
it. To process the second request, userspace uses VDUSE_IOTLB_GET_FD
ioctl to query the new iova region and map a new region (0 ~ 8192).
Then userspace needs to traverse the list of iova regions and unmap
the old region (0 ~ 4096). Looks like this is a little complex.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC v4 06/11] vduse: Implement an MMU-based IOMMU driver
  2021-03-05  7:13             ` Yongji Xie
@ 2021-03-05  7:27                 ` Jason Wang
  0 siblings, 0 replies; 92+ messages in thread
From: Jason Wang @ 2021-03-05  7:27 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Bob Liu, Christoph Hellwig, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel


On 2021/3/5 3:13 下午, Yongji Xie wrote:
> On Fri, Mar 5, 2021 at 2:52 PM Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2021/3/5 2:15 下午, Yongji Xie wrote:
>>
>> Sorry if I've asked this before.
>>
>> But what's the reason for maintaing a dedicated IOTLB here? I think we
>> could reuse vduse_dev->iommu since the device can not be used by both
>> virtio and vhost in the same time or use vduse_iova_domain->iotlb for
>> set_map().
>>
>> The main difference between domain->iotlb and dev->iotlb is the way to
>> deal with bounce buffer. In the domain->iotlb case, bounce buffer
>> needs to be mapped each DMA transfer because we need to get the bounce
>> pages by an IOVA during DMA unmapping. In the dev->iotlb case, bounce
>> buffer only needs to be mapped once during initialization, which will
>> be used to tell userspace how to do mmap().
>>
>> Also, since vhost IOTLB support per mapping token (opauqe), can we use
>> that instead of the bounce_pages *?
>>
>> Sorry, I didn't get you here. Which value do you mean to store in the
>> opaque pointer?
>>
>> So I would like to have a way to use a single IOTLB for manage all kinds
>> of mappings. Two possible ideas:
>>
>> 1) map bounce page one by one in vduse_dev_map_page(), in
>> VDUSE_IOTLB_GET_FD, try to merge the result if we had the same fd. Then
>> for bounce pages, userspace still only need to map it once and we can
>> maintain the actual mapping by storing the page or pa in the opaque
>> field of IOTLB entry.
>>
>> Looks like userspace still needs to unmap the old region and map a new
>> region (size is changed) with the fd in each VDUSE_IOTLB_GET_FD ioctl.
>>
>>
>> I don't get here. Can you give an example?
>>
> For example, userspace needs to process two I/O requests (one page per
> request). To process the first request, userspace uses
> VDUSE_IOTLB_GET_FD ioctl to query the iova region (0 ~ 4096) and mmap
> it.


I think in this case we should let VDUSE_IOTLB_GET_FD return the maximum 
range as far as they are backed by the same fd.

In the case of bounce page, it's bascially the whole range of bounce buffer?

Thanks


> To process the second request, userspace uses VDUSE_IOTLB_GET_FD
> ioctl to query the new iova region and map a new region (0 ~ 8192).
> Then userspace needs to traverse the list of iova regions and unmap
> the old region (0 ~ 4096). Looks like this is a little complex.
>
> Thanks,
> Yongji
>


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC v4 06/11] vduse: Implement an MMU-based IOMMU driver
@ 2021-03-05  7:27                 ` Jason Wang
  0 siblings, 0 replies; 92+ messages in thread
From: Jason Wang @ 2021-03-05  7:27 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Jens Axboe, Jonathan Corbet, kvm, Michael S. Tsirkin, linux-aio,
	netdev, Randy Dunlap, Matthew Wilcox, virtualization,
	Christoph Hellwig, Bob Liu, bcrl, viro, Stefan Hajnoczi,
	linux-fsdevel


On 2021/3/5 3:13 下午, Yongji Xie wrote:
> On Fri, Mar 5, 2021 at 2:52 PM Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2021/3/5 2:15 下午, Yongji Xie wrote:
>>
>> Sorry if I've asked this before.
>>
>> But what's the reason for maintaing a dedicated IOTLB here? I think we
>> could reuse vduse_dev->iommu since the device can not be used by both
>> virtio and vhost in the same time or use vduse_iova_domain->iotlb for
>> set_map().
>>
>> The main difference between domain->iotlb and dev->iotlb is the way to
>> deal with bounce buffer. In the domain->iotlb case, bounce buffer
>> needs to be mapped each DMA transfer because we need to get the bounce
>> pages by an IOVA during DMA unmapping. In the dev->iotlb case, bounce
>> buffer only needs to be mapped once during initialization, which will
>> be used to tell userspace how to do mmap().
>>
>> Also, since vhost IOTLB support per mapping token (opauqe), can we use
>> that instead of the bounce_pages *?
>>
>> Sorry, I didn't get you here. Which value do you mean to store in the
>> opaque pointer?
>>
>> So I would like to have a way to use a single IOTLB for manage all kinds
>> of mappings. Two possible ideas:
>>
>> 1) map bounce page one by one in vduse_dev_map_page(), in
>> VDUSE_IOTLB_GET_FD, try to merge the result if we had the same fd. Then
>> for bounce pages, userspace still only need to map it once and we can
>> maintain the actual mapping by storing the page or pa in the opaque
>> field of IOTLB entry.
>>
>> Looks like userspace still needs to unmap the old region and map a new
>> region (size is changed) with the fd in each VDUSE_IOTLB_GET_FD ioctl.
>>
>>
>> I don't get here. Can you give an example?
>>
> For example, userspace needs to process two I/O requests (one page per
> request). To process the first request, userspace uses
> VDUSE_IOTLB_GET_FD ioctl to query the iova region (0 ~ 4096) and mmap
> it.


I think in this case we should let VDUSE_IOTLB_GET_FD return the maximum 
range as far as they are backed by the same fd.

In the case of bounce page, it's bascially the whole range of bounce buffer?

Thanks


> To process the second request, userspace uses VDUSE_IOTLB_GET_FD
> ioctl to query the new iova region and map a new region (0 ~ 8192).
> Then userspace needs to traverse the list of iova regions and unmap
> the old region (0 ~ 4096). Looks like this is a little complex.
>
> Thanks,
> Yongji
>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: Re: [RFC v4 10/11] vduse: Introduce a workqueue for irq injection
  2021-03-05  7:01                 ` Jason Wang
  (?)
@ 2021-03-05  7:27                 ` Yongji Xie
  2021-03-05  7:36                     ` Jason Wang
  -1 siblings, 1 reply; 92+ messages in thread
From: Yongji Xie @ 2021-03-05  7:27 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Bob Liu, Christoph Hellwig, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel

On Fri, Mar 5, 2021 at 3:01 PM Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2021/3/5 2:36 下午, Yongji Xie wrote:
> > On Fri, Mar 5, 2021 at 11:42 AM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> On 2021/3/5 11:30 上午, Yongji Xie wrote:
> >>> On Fri, Mar 5, 2021 at 11:05 AM Jason Wang <jasowang@redhat.com> wrote:
> >>>> On 2021/3/4 4:58 下午, Yongji Xie wrote:
> >>>>> On Thu, Mar 4, 2021 at 2:59 PM Jason Wang <jasowang@redhat.com> wrote:
> >>>>>> On 2021/2/23 7:50 下午, Xie Yongji wrote:
> >>>>>>> This patch introduces a workqueue to support injecting
> >>>>>>> virtqueue's interrupt asynchronously. This is mainly
> >>>>>>> for performance considerations which makes sure the push()
> >>>>>>> and pop() for used vring can be asynchronous.
> >>>>>> Do you have pref numbers for this patch?
> >>>>>>
> >>>>> No, I can do some tests for it if needed.
> >>>>>
> >>>>> Another problem is the VIRTIO_RING_F_EVENT_IDX feature will be useless
> >>>>> if we call irq callback in ioctl context. Something like:
> >>>>>
> >>>>> virtqueue_push();
> >>>>> virtio_notify();
> >>>>>        ioctl()
> >>>>> -------------------------------------------------
> >>>>>            irq_cb()
> >>>>>                virtqueue_get_buf()
> >>>>>
> >>>>> The used vring is always empty each time we call virtqueue_push() in
> >>>>> userspace. Not sure if it is what we expected.
> >>>> I'm not sure I get the issue.
> >>>>
> >>>> THe used ring should be filled by virtqueue_push() which is done by
> >>>> userspace before?
> >>>>
> >>> After userspace call virtqueue_push(), it always call virtio_notify()
> >>> immediately. In traditional VM (vhost-vdpa) cases, virtio_notify()
> >>> will inject an irq to VM and return, then vcpu thread will call
> >>> interrupt handler. But in container (virtio-vdpa) cases,
> >>> virtio_notify() will call interrupt handler directly. So it looks like
> >>> we have to optimize the virtio-vdpa cases. But one problem is we don't
> >>> know whether we are in the VM user case or container user case.
> >>
> >> Yes, but I still don't get why used ring is empty after the ioctl()?
> >> Used ring does not use bounce page so it should be visible to the kernel
> >> driver. What did I miss :) ?
> >>
> > Sorry, I'm not saying the kernel can't see the correct used vring. I
> > mean the kernel will consume the used vring in the ioctl context
> > directly in the virtio-vdpa case. In userspace's view, that means
> > virtqueue_push() is used vring's producer and virtio_notify() is used
> > vring's consumer. They will be called one by one in one thread rather
> > than different threads, which looks odd and has a bad effect on
> > performance.
>
>
> Yes, that's why we need a workqueue (WQ_UNBOUND you used). Or do you
> want to squash this patch into patch 8?
>
> So I think we can see obvious difference when virtio-vdpa is used.
>

But it looks like we don't need this workqueue in vhost-vdpa cases.
Any suggestions?

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC v4 10/11] vduse: Introduce a workqueue for irq injection
  2021-03-05  7:27                 ` Yongji Xie
@ 2021-03-05  7:36                     ` Jason Wang
  0 siblings, 0 replies; 92+ messages in thread
From: Jason Wang @ 2021-03-05  7:36 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Bob Liu, Christoph Hellwig, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel


On 2021/3/5 3:27 下午, Yongji Xie wrote:
> On Fri, Mar 5, 2021 at 3:01 PM Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2021/3/5 2:36 下午, Yongji Xie wrote:
>>> On Fri, Mar 5, 2021 at 11:42 AM Jason Wang <jasowang@redhat.com> wrote:
>>>> On 2021/3/5 11:30 上午, Yongji Xie wrote:
>>>>> On Fri, Mar 5, 2021 at 11:05 AM Jason Wang <jasowang@redhat.com> wrote:
>>>>>> On 2021/3/4 4:58 下午, Yongji Xie wrote:
>>>>>>> On Thu, Mar 4, 2021 at 2:59 PM Jason Wang <jasowang@redhat.com> wrote:
>>>>>>>> On 2021/2/23 7:50 下午, Xie Yongji wrote:
>>>>>>>>> This patch introduces a workqueue to support injecting
>>>>>>>>> virtqueue's interrupt asynchronously. This is mainly
>>>>>>>>> for performance considerations which makes sure the push()
>>>>>>>>> and pop() for used vring can be asynchronous.
>>>>>>>> Do you have pref numbers for this patch?
>>>>>>>>
>>>>>>> No, I can do some tests for it if needed.
>>>>>>>
>>>>>>> Another problem is the VIRTIO_RING_F_EVENT_IDX feature will be useless
>>>>>>> if we call irq callback in ioctl context. Something like:
>>>>>>>
>>>>>>> virtqueue_push();
>>>>>>> virtio_notify();
>>>>>>>         ioctl()
>>>>>>> -------------------------------------------------
>>>>>>>             irq_cb()
>>>>>>>                 virtqueue_get_buf()
>>>>>>>
>>>>>>> The used vring is always empty each time we call virtqueue_push() in
>>>>>>> userspace. Not sure if it is what we expected.
>>>>>> I'm not sure I get the issue.
>>>>>>
>>>>>> THe used ring should be filled by virtqueue_push() which is done by
>>>>>> userspace before?
>>>>>>
>>>>> After userspace call virtqueue_push(), it always call virtio_notify()
>>>>> immediately. In traditional VM (vhost-vdpa) cases, virtio_notify()
>>>>> will inject an irq to VM and return, then vcpu thread will call
>>>>> interrupt handler. But in container (virtio-vdpa) cases,
>>>>> virtio_notify() will call interrupt handler directly. So it looks like
>>>>> we have to optimize the virtio-vdpa cases. But one problem is we don't
>>>>> know whether we are in the VM user case or container user case.
>>>> Yes, but I still don't get why used ring is empty after the ioctl()?
>>>> Used ring does not use bounce page so it should be visible to the kernel
>>>> driver. What did I miss :) ?
>>>>
>>> Sorry, I'm not saying the kernel can't see the correct used vring. I
>>> mean the kernel will consume the used vring in the ioctl context
>>> directly in the virtio-vdpa case. In userspace's view, that means
>>> virtqueue_push() is used vring's producer and virtio_notify() is used
>>> vring's consumer. They will be called one by one in one thread rather
>>> than different threads, which looks odd and has a bad effect on
>>> performance.
>>
>> Yes, that's why we need a workqueue (WQ_UNBOUND you used). Or do you
>> want to squash this patch into patch 8?
>>
>> So I think we can see obvious difference when virtio-vdpa is used.
>>
> But it looks like we don't need this workqueue in vhost-vdpa cases.
> Any suggestions?


I haven't had a deep thought. But I feel we can solve this by using the 
irq bypass manager (or something similar). Then we don't need it to be 
relayed via workqueue and vdpa. But I'm not sure how hard it will be.

Do you see any obvious performance regression by using the workqueue? Or 
we can optimize it in the future.

Thanks


>
> Thanks,
> Yongji
>


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC v4 10/11] vduse: Introduce a workqueue for irq injection
@ 2021-03-05  7:36                     ` Jason Wang
  0 siblings, 0 replies; 92+ messages in thread
From: Jason Wang @ 2021-03-05  7:36 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Jens Axboe, Jonathan Corbet, kvm, Michael S. Tsirkin, linux-aio,
	netdev, Randy Dunlap, Matthew Wilcox, virtualization,
	Christoph Hellwig, Bob Liu, bcrl, viro, Stefan Hajnoczi,
	linux-fsdevel


On 2021/3/5 3:27 下午, Yongji Xie wrote:
> On Fri, Mar 5, 2021 at 3:01 PM Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2021/3/5 2:36 下午, Yongji Xie wrote:
>>> On Fri, Mar 5, 2021 at 11:42 AM Jason Wang <jasowang@redhat.com> wrote:
>>>> On 2021/3/5 11:30 上午, Yongji Xie wrote:
>>>>> On Fri, Mar 5, 2021 at 11:05 AM Jason Wang <jasowang@redhat.com> wrote:
>>>>>> On 2021/3/4 4:58 下午, Yongji Xie wrote:
>>>>>>> On Thu, Mar 4, 2021 at 2:59 PM Jason Wang <jasowang@redhat.com> wrote:
>>>>>>>> On 2021/2/23 7:50 下午, Xie Yongji wrote:
>>>>>>>>> This patch introduces a workqueue to support injecting
>>>>>>>>> virtqueue's interrupt asynchronously. This is mainly
>>>>>>>>> for performance considerations which makes sure the push()
>>>>>>>>> and pop() for used vring can be asynchronous.
>>>>>>>> Do you have pref numbers for this patch?
>>>>>>>>
>>>>>>> No, I can do some tests for it if needed.
>>>>>>>
>>>>>>> Another problem is the VIRTIO_RING_F_EVENT_IDX feature will be useless
>>>>>>> if we call irq callback in ioctl context. Something like:
>>>>>>>
>>>>>>> virtqueue_push();
>>>>>>> virtio_notify();
>>>>>>>         ioctl()
>>>>>>> -------------------------------------------------
>>>>>>>             irq_cb()
>>>>>>>                 virtqueue_get_buf()
>>>>>>>
>>>>>>> The used vring is always empty each time we call virtqueue_push() in
>>>>>>> userspace. Not sure if it is what we expected.
>>>>>> I'm not sure I get the issue.
>>>>>>
>>>>>> THe used ring should be filled by virtqueue_push() which is done by
>>>>>> userspace before?
>>>>>>
>>>>> After userspace call virtqueue_push(), it always call virtio_notify()
>>>>> immediately. In traditional VM (vhost-vdpa) cases, virtio_notify()
>>>>> will inject an irq to VM and return, then vcpu thread will call
>>>>> interrupt handler. But in container (virtio-vdpa) cases,
>>>>> virtio_notify() will call interrupt handler directly. So it looks like
>>>>> we have to optimize the virtio-vdpa cases. But one problem is we don't
>>>>> know whether we are in the VM user case or container user case.
>>>> Yes, but I still don't get why used ring is empty after the ioctl()?
>>>> Used ring does not use bounce page so it should be visible to the kernel
>>>> driver. What did I miss :) ?
>>>>
>>> Sorry, I'm not saying the kernel can't see the correct used vring. I
>>> mean the kernel will consume the used vring in the ioctl context
>>> directly in the virtio-vdpa case. In userspace's view, that means
>>> virtqueue_push() is used vring's producer and virtio_notify() is used
>>> vring's consumer. They will be called one by one in one thread rather
>>> than different threads, which looks odd and has a bad effect on
>>> performance.
>>
>> Yes, that's why we need a workqueue (WQ_UNBOUND you used). Or do you
>> want to squash this patch into patch 8?
>>
>> So I think we can see obvious difference when virtio-vdpa is used.
>>
> But it looks like we don't need this workqueue in vhost-vdpa cases.
> Any suggestions?


I haven't had a deep thought. But I feel we can solve this by using the 
irq bypass manager (or something similar). Then we don't need it to be 
relayed via workqueue and vdpa. But I'm not sure how hard it will be.

Do you see any obvious performance regression by using the workqueue? Or 
we can optimize it in the future.

Thanks


>
> Thanks,
> Yongji
>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: Re: [RFC v4 06/11] vduse: Implement an MMU-based IOMMU driver
  2021-03-05  7:27                 ` Jason Wang
  (?)
@ 2021-03-05  7:59                 ` Yongji Xie
  2021-03-08  3:17                     ` Jason Wang
  -1 siblings, 1 reply; 92+ messages in thread
From: Yongji Xie @ 2021-03-05  7:59 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Bob Liu, Christoph Hellwig, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel

On Fri, Mar 5, 2021 at 3:27 PM Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2021/3/5 3:13 下午, Yongji Xie wrote:
> > On Fri, Mar 5, 2021 at 2:52 PM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> On 2021/3/5 2:15 下午, Yongji Xie wrote:
> >>
> >> Sorry if I've asked this before.
> >>
> >> But what's the reason for maintaing a dedicated IOTLB here? I think we
> >> could reuse vduse_dev->iommu since the device can not be used by both
> >> virtio and vhost in the same time or use vduse_iova_domain->iotlb for
> >> set_map().
> >>
> >> The main difference between domain->iotlb and dev->iotlb is the way to
> >> deal with bounce buffer. In the domain->iotlb case, bounce buffer
> >> needs to be mapped each DMA transfer because we need to get the bounce
> >> pages by an IOVA during DMA unmapping. In the dev->iotlb case, bounce
> >> buffer only needs to be mapped once during initialization, which will
> >> be used to tell userspace how to do mmap().
> >>
> >> Also, since vhost IOTLB support per mapping token (opauqe), can we use
> >> that instead of the bounce_pages *?
> >>
> >> Sorry, I didn't get you here. Which value do you mean to store in the
> >> opaque pointer?
> >>
> >> So I would like to have a way to use a single IOTLB for manage all kinds
> >> of mappings. Two possible ideas:
> >>
> >> 1) map bounce page one by one in vduse_dev_map_page(), in
> >> VDUSE_IOTLB_GET_FD, try to merge the result if we had the same fd. Then
> >> for bounce pages, userspace still only need to map it once and we can
> >> maintain the actual mapping by storing the page or pa in the opaque
> >> field of IOTLB entry.
> >>
> >> Looks like userspace still needs to unmap the old region and map a new
> >> region (size is changed) with the fd in each VDUSE_IOTLB_GET_FD ioctl.
> >>
> >>
> >> I don't get here. Can you give an example?
> >>
> > For example, userspace needs to process two I/O requests (one page per
> > request). To process the first request, userspace uses
> > VDUSE_IOTLB_GET_FD ioctl to query the iova region (0 ~ 4096) and mmap
> > it.
>
>
> I think in this case we should let VDUSE_IOTLB_GET_FD return the maximum
> range as far as they are backed by the same fd.
>

But now the bounce page is mapped one by one. The second page (4096 ~
8192) might not be mapped when userspace is processing the first
request. So the maximum range is 0 ~ 4096 at that time.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: Re: [RFC v4 10/11] vduse: Introduce a workqueue for irq injection
  2021-03-05  7:36                     ` Jason Wang
  (?)
@ 2021-03-05  8:12                     ` Yongji Xie
  2021-03-08  3:04                         ` Jason Wang
  -1 siblings, 1 reply; 92+ messages in thread
From: Yongji Xie @ 2021-03-05  8:12 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Bob Liu, Christoph Hellwig, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel

On Fri, Mar 5, 2021 at 3:37 PM Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2021/3/5 3:27 下午, Yongji Xie wrote:
> > On Fri, Mar 5, 2021 at 3:01 PM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> On 2021/3/5 2:36 下午, Yongji Xie wrote:
> >>> On Fri, Mar 5, 2021 at 11:42 AM Jason Wang <jasowang@redhat.com> wrote:
> >>>> On 2021/3/5 11:30 上午, Yongji Xie wrote:
> >>>>> On Fri, Mar 5, 2021 at 11:05 AM Jason Wang <jasowang@redhat.com> wrote:
> >>>>>> On 2021/3/4 4:58 下午, Yongji Xie wrote:
> >>>>>>> On Thu, Mar 4, 2021 at 2:59 PM Jason Wang <jasowang@redhat.com> wrote:
> >>>>>>>> On 2021/2/23 7:50 下午, Xie Yongji wrote:
> >>>>>>>>> This patch introduces a workqueue to support injecting
> >>>>>>>>> virtqueue's interrupt asynchronously. This is mainly
> >>>>>>>>> for performance considerations which makes sure the push()
> >>>>>>>>> and pop() for used vring can be asynchronous.
> >>>>>>>> Do you have pref numbers for this patch?
> >>>>>>>>
> >>>>>>> No, I can do some tests for it if needed.
> >>>>>>>
> >>>>>>> Another problem is the VIRTIO_RING_F_EVENT_IDX feature will be useless
> >>>>>>> if we call irq callback in ioctl context. Something like:
> >>>>>>>
> >>>>>>> virtqueue_push();
> >>>>>>> virtio_notify();
> >>>>>>>         ioctl()
> >>>>>>> -------------------------------------------------
> >>>>>>>             irq_cb()
> >>>>>>>                 virtqueue_get_buf()
> >>>>>>>
> >>>>>>> The used vring is always empty each time we call virtqueue_push() in
> >>>>>>> userspace. Not sure if it is what we expected.
> >>>>>> I'm not sure I get the issue.
> >>>>>>
> >>>>>> THe used ring should be filled by virtqueue_push() which is done by
> >>>>>> userspace before?
> >>>>>>
> >>>>> After userspace call virtqueue_push(), it always call virtio_notify()
> >>>>> immediately. In traditional VM (vhost-vdpa) cases, virtio_notify()
> >>>>> will inject an irq to VM and return, then vcpu thread will call
> >>>>> interrupt handler. But in container (virtio-vdpa) cases,
> >>>>> virtio_notify() will call interrupt handler directly. So it looks like
> >>>>> we have to optimize the virtio-vdpa cases. But one problem is we don't
> >>>>> know whether we are in the VM user case or container user case.
> >>>> Yes, but I still don't get why used ring is empty after the ioctl()?
> >>>> Used ring does not use bounce page so it should be visible to the kernel
> >>>> driver. What did I miss :) ?
> >>>>
> >>> Sorry, I'm not saying the kernel can't see the correct used vring. I
> >>> mean the kernel will consume the used vring in the ioctl context
> >>> directly in the virtio-vdpa case. In userspace's view, that means
> >>> virtqueue_push() is used vring's producer and virtio_notify() is used
> >>> vring's consumer. They will be called one by one in one thread rather
> >>> than different threads, which looks odd and has a bad effect on
> >>> performance.
> >>
> >> Yes, that's why we need a workqueue (WQ_UNBOUND you used). Or do you
> >> want to squash this patch into patch 8?
> >>
> >> So I think we can see obvious difference when virtio-vdpa is used.
> >>
> > But it looks like we don't need this workqueue in vhost-vdpa cases.
> > Any suggestions?
>
>
> I haven't had a deep thought. But I feel we can solve this by using the
> irq bypass manager (or something similar). Then we don't need it to be
> relayed via workqueue and vdpa. But I'm not sure how hard it will be.
>

 Or let vdpa bus drivers give us some information?

> Do you see any obvious performance regression by using the workqueue? Or
> we can optimize it in the future.
>

Agree.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC v4 10/11] vduse: Introduce a workqueue for irq injection
  2021-03-05  8:12                     ` Yongji Xie
@ 2021-03-08  3:04                         ` Jason Wang
  0 siblings, 0 replies; 92+ messages in thread
From: Jason Wang @ 2021-03-08  3:04 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Bob Liu, Christoph Hellwig, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel


On 2021/3/5 4:12 下午, Yongji Xie wrote:
> On Fri, Mar 5, 2021 at 3:37 PM Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2021/3/5 3:27 下午, Yongji Xie wrote:
>>> On Fri, Mar 5, 2021 at 3:01 PM Jason Wang <jasowang@redhat.com> wrote:
>>>> On 2021/3/5 2:36 下午, Yongji Xie wrote:
>>>>> On Fri, Mar 5, 2021 at 11:42 AM Jason Wang <jasowang@redhat.com> wrote:
>>>>>> On 2021/3/5 11:30 上午, Yongji Xie wrote:
>>>>>>> On Fri, Mar 5, 2021 at 11:05 AM Jason Wang <jasowang@redhat.com> wrote:
>>>>>>>> On 2021/3/4 4:58 下午, Yongji Xie wrote:
>>>>>>>>> On Thu, Mar 4, 2021 at 2:59 PM Jason Wang <jasowang@redhat.com> wrote:
>>>>>>>>>> On 2021/2/23 7:50 下午, Xie Yongji wrote:
>>>>>>>>>>> This patch introduces a workqueue to support injecting
>>>>>>>>>>> virtqueue's interrupt asynchronously. This is mainly
>>>>>>>>>>> for performance considerations which makes sure the push()
>>>>>>>>>>> and pop() for used vring can be asynchronous.
>>>>>>>>>> Do you have pref numbers for this patch?
>>>>>>>>>>
>>>>>>>>> No, I can do some tests for it if needed.
>>>>>>>>>
>>>>>>>>> Another problem is the VIRTIO_RING_F_EVENT_IDX feature will be useless
>>>>>>>>> if we call irq callback in ioctl context. Something like:
>>>>>>>>>
>>>>>>>>> virtqueue_push();
>>>>>>>>> virtio_notify();
>>>>>>>>>          ioctl()
>>>>>>>>> -------------------------------------------------
>>>>>>>>>              irq_cb()
>>>>>>>>>                  virtqueue_get_buf()
>>>>>>>>>
>>>>>>>>> The used vring is always empty each time we call virtqueue_push() in
>>>>>>>>> userspace. Not sure if it is what we expected.
>>>>>>>> I'm not sure I get the issue.
>>>>>>>>
>>>>>>>> THe used ring should be filled by virtqueue_push() which is done by
>>>>>>>> userspace before?
>>>>>>>>
>>>>>>> After userspace call virtqueue_push(), it always call virtio_notify()
>>>>>>> immediately. In traditional VM (vhost-vdpa) cases, virtio_notify()
>>>>>>> will inject an irq to VM and return, then vcpu thread will call
>>>>>>> interrupt handler. But in container (virtio-vdpa) cases,
>>>>>>> virtio_notify() will call interrupt handler directly. So it looks like
>>>>>>> we have to optimize the virtio-vdpa cases. But one problem is we don't
>>>>>>> know whether we are in the VM user case or container user case.
>>>>>> Yes, but I still don't get why used ring is empty after the ioctl()?
>>>>>> Used ring does not use bounce page so it should be visible to the kernel
>>>>>> driver. What did I miss :) ?
>>>>>>
>>>>> Sorry, I'm not saying the kernel can't see the correct used vring. I
>>>>> mean the kernel will consume the used vring in the ioctl context
>>>>> directly in the virtio-vdpa case. In userspace's view, that means
>>>>> virtqueue_push() is used vring's producer and virtio_notify() is used
>>>>> vring's consumer. They will be called one by one in one thread rather
>>>>> than different threads, which looks odd and has a bad effect on
>>>>> performance.
>>>> Yes, that's why we need a workqueue (WQ_UNBOUND you used). Or do you
>>>> want to squash this patch into patch 8?
>>>>
>>>> So I think we can see obvious difference when virtio-vdpa is used.
>>>>
>>> But it looks like we don't need this workqueue in vhost-vdpa cases.
>>> Any suggestions?
>>
>> I haven't had a deep thought. But I feel we can solve this by using the
>> irq bypass manager (or something similar). Then we don't need it to be
>> relayed via workqueue and vdpa. But I'm not sure how hard it will be.
>>
>   Or let vdpa bus drivers give us some information?


This kind of 'type' is proposed in the early RFC of vDPA series. One 
issue is that at device level, we should not differ virtio from vhost, 
so if we introduce that, it might encourge people to design a device 
that is dedicated to vhost or virtio which might not be good.

But we can re-visit this when necessary.

Thanks


>
>> Do you see any obvious performance regression by using the workqueue? Or
>> we can optimize it in the future.
>>
> Agree.
>
> Thanks,
> Yongji
>


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC v4 10/11] vduse: Introduce a workqueue for irq injection
@ 2021-03-08  3:04                         ` Jason Wang
  0 siblings, 0 replies; 92+ messages in thread
From: Jason Wang @ 2021-03-08  3:04 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Jens Axboe, Jonathan Corbet, kvm, Michael S. Tsirkin, linux-aio,
	netdev, Randy Dunlap, Matthew Wilcox, virtualization,
	Christoph Hellwig, Bob Liu, bcrl, viro, Stefan Hajnoczi,
	linux-fsdevel


On 2021/3/5 4:12 下午, Yongji Xie wrote:
> On Fri, Mar 5, 2021 at 3:37 PM Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2021/3/5 3:27 下午, Yongji Xie wrote:
>>> On Fri, Mar 5, 2021 at 3:01 PM Jason Wang <jasowang@redhat.com> wrote:
>>>> On 2021/3/5 2:36 下午, Yongji Xie wrote:
>>>>> On Fri, Mar 5, 2021 at 11:42 AM Jason Wang <jasowang@redhat.com> wrote:
>>>>>> On 2021/3/5 11:30 上午, Yongji Xie wrote:
>>>>>>> On Fri, Mar 5, 2021 at 11:05 AM Jason Wang <jasowang@redhat.com> wrote:
>>>>>>>> On 2021/3/4 4:58 下午, Yongji Xie wrote:
>>>>>>>>> On Thu, Mar 4, 2021 at 2:59 PM Jason Wang <jasowang@redhat.com> wrote:
>>>>>>>>>> On 2021/2/23 7:50 下午, Xie Yongji wrote:
>>>>>>>>>>> This patch introduces a workqueue to support injecting
>>>>>>>>>>> virtqueue's interrupt asynchronously. This is mainly
>>>>>>>>>>> for performance considerations which makes sure the push()
>>>>>>>>>>> and pop() for used vring can be asynchronous.
>>>>>>>>>> Do you have pref numbers for this patch?
>>>>>>>>>>
>>>>>>>>> No, I can do some tests for it if needed.
>>>>>>>>>
>>>>>>>>> Another problem is the VIRTIO_RING_F_EVENT_IDX feature will be useless
>>>>>>>>> if we call irq callback in ioctl context. Something like:
>>>>>>>>>
>>>>>>>>> virtqueue_push();
>>>>>>>>> virtio_notify();
>>>>>>>>>          ioctl()
>>>>>>>>> -------------------------------------------------
>>>>>>>>>              irq_cb()
>>>>>>>>>                  virtqueue_get_buf()
>>>>>>>>>
>>>>>>>>> The used vring is always empty each time we call virtqueue_push() in
>>>>>>>>> userspace. Not sure if it is what we expected.
>>>>>>>> I'm not sure I get the issue.
>>>>>>>>
>>>>>>>> THe used ring should be filled by virtqueue_push() which is done by
>>>>>>>> userspace before?
>>>>>>>>
>>>>>>> After userspace call virtqueue_push(), it always call virtio_notify()
>>>>>>> immediately. In traditional VM (vhost-vdpa) cases, virtio_notify()
>>>>>>> will inject an irq to VM and return, then vcpu thread will call
>>>>>>> interrupt handler. But in container (virtio-vdpa) cases,
>>>>>>> virtio_notify() will call interrupt handler directly. So it looks like
>>>>>>> we have to optimize the virtio-vdpa cases. But one problem is we don't
>>>>>>> know whether we are in the VM user case or container user case.
>>>>>> Yes, but I still don't get why used ring is empty after the ioctl()?
>>>>>> Used ring does not use bounce page so it should be visible to the kernel
>>>>>> driver. What did I miss :) ?
>>>>>>
>>>>> Sorry, I'm not saying the kernel can't see the correct used vring. I
>>>>> mean the kernel will consume the used vring in the ioctl context
>>>>> directly in the virtio-vdpa case. In userspace's view, that means
>>>>> virtqueue_push() is used vring's producer and virtio_notify() is used
>>>>> vring's consumer. They will be called one by one in one thread rather
>>>>> than different threads, which looks odd and has a bad effect on
>>>>> performance.
>>>> Yes, that's why we need a workqueue (WQ_UNBOUND you used). Or do you
>>>> want to squash this patch into patch 8?
>>>>
>>>> So I think we can see obvious difference when virtio-vdpa is used.
>>>>
>>> But it looks like we don't need this workqueue in vhost-vdpa cases.
>>> Any suggestions?
>>
>> I haven't had a deep thought. But I feel we can solve this by using the
>> irq bypass manager (or something similar). Then we don't need it to be
>> relayed via workqueue and vdpa. But I'm not sure how hard it will be.
>>
>   Or let vdpa bus drivers give us some information?


This kind of 'type' is proposed in the early RFC of vDPA series. One 
issue is that at device level, we should not differ virtio from vhost, 
so if we introduce that, it might encourge people to design a device 
that is dedicated to vhost or virtio which might not be good.

But we can re-visit this when necessary.

Thanks


>
>> Do you see any obvious performance regression by using the workqueue? Or
>> we can optimize it in the future.
>>
> Agree.
>
> Thanks,
> Yongji
>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC v4 06/11] vduse: Implement an MMU-based IOMMU driver
  2021-03-05  7:59                 ` Yongji Xie
@ 2021-03-08  3:17                     ` Jason Wang
  0 siblings, 0 replies; 92+ messages in thread
From: Jason Wang @ 2021-03-08  3:17 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Bob Liu, Christoph Hellwig, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel


On 2021/3/5 3:59 下午, Yongji Xie wrote:
> On Fri, Mar 5, 2021 at 3:27 PM Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2021/3/5 3:13 下午, Yongji Xie wrote:
>>> On Fri, Mar 5, 2021 at 2:52 PM Jason Wang <jasowang@redhat.com> wrote:
>>>> On 2021/3/5 2:15 下午, Yongji Xie wrote:
>>>>
>>>> Sorry if I've asked this before.
>>>>
>>>> But what's the reason for maintaing a dedicated IOTLB here? I think we
>>>> could reuse vduse_dev->iommu since the device can not be used by both
>>>> virtio and vhost in the same time or use vduse_iova_domain->iotlb for
>>>> set_map().
>>>>
>>>> The main difference between domain->iotlb and dev->iotlb is the way to
>>>> deal with bounce buffer. In the domain->iotlb case, bounce buffer
>>>> needs to be mapped each DMA transfer because we need to get the bounce
>>>> pages by an IOVA during DMA unmapping. In the dev->iotlb case, bounce
>>>> buffer only needs to be mapped once during initialization, which will
>>>> be used to tell userspace how to do mmap().
>>>>
>>>> Also, since vhost IOTLB support per mapping token (opauqe), can we use
>>>> that instead of the bounce_pages *?
>>>>
>>>> Sorry, I didn't get you here. Which value do you mean to store in the
>>>> opaque pointer?
>>>>
>>>> So I would like to have a way to use a single IOTLB for manage all kinds
>>>> of mappings. Two possible ideas:
>>>>
>>>> 1) map bounce page one by one in vduse_dev_map_page(), in
>>>> VDUSE_IOTLB_GET_FD, try to merge the result if we had the same fd. Then
>>>> for bounce pages, userspace still only need to map it once and we can
>>>> maintain the actual mapping by storing the page or pa in the opaque
>>>> field of IOTLB entry.
>>>>
>>>> Looks like userspace still needs to unmap the old region and map a new
>>>> region (size is changed) with the fd in each VDUSE_IOTLB_GET_FD ioctl.
>>>>
>>>>
>>>> I don't get here. Can you give an example?
>>>>
>>> For example, userspace needs to process two I/O requests (one page per
>>> request). To process the first request, userspace uses
>>> VDUSE_IOTLB_GET_FD ioctl to query the iova region (0 ~ 4096) and mmap
>>> it.
>>
>> I think in this case we should let VDUSE_IOTLB_GET_FD return the maximum
>> range as far as they are backed by the same fd.
>>
> But now the bounce page is mapped one by one. The second page (4096 ~
> 8192) might not be mapped when userspace is processing the first
> request. So the maximum range is 0 ~ 4096 at that time.
>
> Thanks,
> Yongji


A question, if I read the code correctly, VDUSE_IOTLB_GET_FD will return 
the whole bounce map range which is setup in vduse_dev_map_page()? So my 
understanding is that usersapce may choose to map all its range via mmap().

So if we 'map' bounce page one by one in vduse_dev_map_page(). (Here 
'map' means using multiple itree entries instead of a single one). Then 
in the VDUSE_IOTLB_GET_FD we can keep traversing itree (dev->iommu) 
until the range is backed by a different file.

With this, there's no userspace visible changes and there's no need for 
the domain->iotlb?

Thanks


>


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC v4 06/11] vduse: Implement an MMU-based IOMMU driver
@ 2021-03-08  3:17                     ` Jason Wang
  0 siblings, 0 replies; 92+ messages in thread
From: Jason Wang @ 2021-03-08  3:17 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Jens Axboe, Jonathan Corbet, kvm, Michael S. Tsirkin, linux-aio,
	netdev, Randy Dunlap, Matthew Wilcox, virtualization,
	Christoph Hellwig, Bob Liu, bcrl, viro, Stefan Hajnoczi,
	linux-fsdevel


On 2021/3/5 3:59 下午, Yongji Xie wrote:
> On Fri, Mar 5, 2021 at 3:27 PM Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2021/3/5 3:13 下午, Yongji Xie wrote:
>>> On Fri, Mar 5, 2021 at 2:52 PM Jason Wang <jasowang@redhat.com> wrote:
>>>> On 2021/3/5 2:15 下午, Yongji Xie wrote:
>>>>
>>>> Sorry if I've asked this before.
>>>>
>>>> But what's the reason for maintaing a dedicated IOTLB here? I think we
>>>> could reuse vduse_dev->iommu since the device can not be used by both
>>>> virtio and vhost in the same time or use vduse_iova_domain->iotlb for
>>>> set_map().
>>>>
>>>> The main difference between domain->iotlb and dev->iotlb is the way to
>>>> deal with bounce buffer. In the domain->iotlb case, bounce buffer
>>>> needs to be mapped each DMA transfer because we need to get the bounce
>>>> pages by an IOVA during DMA unmapping. In the dev->iotlb case, bounce
>>>> buffer only needs to be mapped once during initialization, which will
>>>> be used to tell userspace how to do mmap().
>>>>
>>>> Also, since vhost IOTLB support per mapping token (opauqe), can we use
>>>> that instead of the bounce_pages *?
>>>>
>>>> Sorry, I didn't get you here. Which value do you mean to store in the
>>>> opaque pointer?
>>>>
>>>> So I would like to have a way to use a single IOTLB for manage all kinds
>>>> of mappings. Two possible ideas:
>>>>
>>>> 1) map bounce page one by one in vduse_dev_map_page(), in
>>>> VDUSE_IOTLB_GET_FD, try to merge the result if we had the same fd. Then
>>>> for bounce pages, userspace still only need to map it once and we can
>>>> maintain the actual mapping by storing the page or pa in the opaque
>>>> field of IOTLB entry.
>>>>
>>>> Looks like userspace still needs to unmap the old region and map a new
>>>> region (size is changed) with the fd in each VDUSE_IOTLB_GET_FD ioctl.
>>>>
>>>>
>>>> I don't get here. Can you give an example?
>>>>
>>> For example, userspace needs to process two I/O requests (one page per
>>> request). To process the first request, userspace uses
>>> VDUSE_IOTLB_GET_FD ioctl to query the iova region (0 ~ 4096) and mmap
>>> it.
>>
>> I think in this case we should let VDUSE_IOTLB_GET_FD return the maximum
>> range as far as they are backed by the same fd.
>>
> But now the bounce page is mapped one by one. The second page (4096 ~
> 8192) might not be mapped when userspace is processing the first
> request. So the maximum range is 0 ~ 4096 at that time.
>
> Thanks,
> Yongji


A question, if I read the code correctly, VDUSE_IOTLB_GET_FD will return 
the whole bounce map range which is setup in vduse_dev_map_page()? So my 
understanding is that usersapce may choose to map all its range via mmap().

So if we 'map' bounce page one by one in vduse_dev_map_page(). (Here 
'map' means using multiple itree entries instead of a single one). Then 
in the VDUSE_IOTLB_GET_FD we can keep traversing itree (dev->iommu) 
until the range is backed by a different file.

With this, there's no userspace visible changes and there's no need for 
the domain->iotlb?

Thanks


>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: Re: [RFC v4 06/11] vduse: Implement an MMU-based IOMMU driver
  2021-03-08  3:17                     ` Jason Wang
  (?)
@ 2021-03-08  3:45                     ` Yongji Xie
  2021-03-08  3:52                         ` Jason Wang
  -1 siblings, 1 reply; 92+ messages in thread
From: Yongji Xie @ 2021-03-08  3:45 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Bob Liu, Christoph Hellwig, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel

On Mon, Mar 8, 2021 at 11:17 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2021/3/5 3:59 下午, Yongji Xie wrote:
> > On Fri, Mar 5, 2021 at 3:27 PM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> On 2021/3/5 3:13 下午, Yongji Xie wrote:
> >>> On Fri, Mar 5, 2021 at 2:52 PM Jason Wang <jasowang@redhat.com> wrote:
> >>>> On 2021/3/5 2:15 下午, Yongji Xie wrote:
> >>>>
> >>>> Sorry if I've asked this before.
> >>>>
> >>>> But what's the reason for maintaing a dedicated IOTLB here? I think we
> >>>> could reuse vduse_dev->iommu since the device can not be used by both
> >>>> virtio and vhost in the same time or use vduse_iova_domain->iotlb for
> >>>> set_map().
> >>>>
> >>>> The main difference between domain->iotlb and dev->iotlb is the way to
> >>>> deal with bounce buffer. In the domain->iotlb case, bounce buffer
> >>>> needs to be mapped each DMA transfer because we need to get the bounce
> >>>> pages by an IOVA during DMA unmapping. In the dev->iotlb case, bounce
> >>>> buffer only needs to be mapped once during initialization, which will
> >>>> be used to tell userspace how to do mmap().
> >>>>
> >>>> Also, since vhost IOTLB support per mapping token (opauqe), can we use
> >>>> that instead of the bounce_pages *?
> >>>>
> >>>> Sorry, I didn't get you here. Which value do you mean to store in the
> >>>> opaque pointer?
> >>>>
> >>>> So I would like to have a way to use a single IOTLB for manage all kinds
> >>>> of mappings. Two possible ideas:
> >>>>
> >>>> 1) map bounce page one by one in vduse_dev_map_page(), in
> >>>> VDUSE_IOTLB_GET_FD, try to merge the result if we had the same fd. Then
> >>>> for bounce pages, userspace still only need to map it once and we can
> >>>> maintain the actual mapping by storing the page or pa in the opaque
> >>>> field of IOTLB entry.
> >>>>
> >>>> Looks like userspace still needs to unmap the old region and map a new
> >>>> region (size is changed) with the fd in each VDUSE_IOTLB_GET_FD ioctl.
> >>>>
> >>>>
> >>>> I don't get here. Can you give an example?
> >>>>
> >>> For example, userspace needs to process two I/O requests (one page per
> >>> request). To process the first request, userspace uses
> >>> VDUSE_IOTLB_GET_FD ioctl to query the iova region (0 ~ 4096) and mmap
> >>> it.
> >>
> >> I think in this case we should let VDUSE_IOTLB_GET_FD return the maximum
> >> range as far as they are backed by the same fd.
> >>
> > But now the bounce page is mapped one by one. The second page (4096 ~
> > 8192) might not be mapped when userspace is processing the first
> > request. So the maximum range is 0 ~ 4096 at that time.
> >
> > Thanks,
> > Yongji
>
>
> A question, if I read the code correctly, VDUSE_IOTLB_GET_FD will return
> the whole bounce map range which is setup in vduse_dev_map_page()? So my
> understanding is that usersapce may choose to map all its range via mmap().
>

Yes.

> So if we 'map' bounce page one by one in vduse_dev_map_page(). (Here
> 'map' means using multiple itree entries instead of a single one). Then
> in the VDUSE_IOTLB_GET_FD we can keep traversing itree (dev->iommu)
> until the range is backed by a different file.
>
> With this, there's no userspace visible changes and there's no need for
> the domain->iotlb?
>

In this case, I wonder what range can be obtained if userspace calls
VDUSE_IOTLB_GET_FD when the first I/O (e.g. 4K) occurs. [0, 4K] or [0,
64M]? In current implementation, userspace will map [0, 64M].

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC v4 06/11] vduse: Implement an MMU-based IOMMU driver
  2021-03-08  3:45                     ` Yongji Xie
@ 2021-03-08  3:52                         ` Jason Wang
  0 siblings, 0 replies; 92+ messages in thread
From: Jason Wang @ 2021-03-08  3:52 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Bob Liu, Christoph Hellwig, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel


On 2021/3/8 11:45 上午, Yongji Xie wrote:
> On Mon, Mar 8, 2021 at 11:17 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2021/3/5 3:59 下午, Yongji Xie wrote:
>>> On Fri, Mar 5, 2021 at 3:27 PM Jason Wang <jasowang@redhat.com> wrote:
>>>> On 2021/3/5 3:13 下午, Yongji Xie wrote:
>>>>> On Fri, Mar 5, 2021 at 2:52 PM Jason Wang <jasowang@redhat.com> wrote:
>>>>>> On 2021/3/5 2:15 下午, Yongji Xie wrote:
>>>>>>
>>>>>> Sorry if I've asked this before.
>>>>>>
>>>>>> But what's the reason for maintaing a dedicated IOTLB here? I think we
>>>>>> could reuse vduse_dev->iommu since the device can not be used by both
>>>>>> virtio and vhost in the same time or use vduse_iova_domain->iotlb for
>>>>>> set_map().
>>>>>>
>>>>>> The main difference between domain->iotlb and dev->iotlb is the way to
>>>>>> deal with bounce buffer. In the domain->iotlb case, bounce buffer
>>>>>> needs to be mapped each DMA transfer because we need to get the bounce
>>>>>> pages by an IOVA during DMA unmapping. In the dev->iotlb case, bounce
>>>>>> buffer only needs to be mapped once during initialization, which will
>>>>>> be used to tell userspace how to do mmap().
>>>>>>
>>>>>> Also, since vhost IOTLB support per mapping token (opauqe), can we use
>>>>>> that instead of the bounce_pages *?
>>>>>>
>>>>>> Sorry, I didn't get you here. Which value do you mean to store in the
>>>>>> opaque pointer?
>>>>>>
>>>>>> So I would like to have a way to use a single IOTLB for manage all kinds
>>>>>> of mappings. Two possible ideas:
>>>>>>
>>>>>> 1) map bounce page one by one in vduse_dev_map_page(), in
>>>>>> VDUSE_IOTLB_GET_FD, try to merge the result if we had the same fd. Then
>>>>>> for bounce pages, userspace still only need to map it once and we can
>>>>>> maintain the actual mapping by storing the page or pa in the opaque
>>>>>> field of IOTLB entry.
>>>>>>
>>>>>> Looks like userspace still needs to unmap the old region and map a new
>>>>>> region (size is changed) with the fd in each VDUSE_IOTLB_GET_FD ioctl.
>>>>>>
>>>>>>
>>>>>> I don't get here. Can you give an example?
>>>>>>
>>>>> For example, userspace needs to process two I/O requests (one page per
>>>>> request). To process the first request, userspace uses
>>>>> VDUSE_IOTLB_GET_FD ioctl to query the iova region (0 ~ 4096) and mmap
>>>>> it.
>>>> I think in this case we should let VDUSE_IOTLB_GET_FD return the maximum
>>>> range as far as they are backed by the same fd.
>>>>
>>> But now the bounce page is mapped one by one. The second page (4096 ~
>>> 8192) might not be mapped when userspace is processing the first
>>> request. So the maximum range is 0 ~ 4096 at that time.
>>>
>>> Thanks,
>>> Yongji
>>
>> A question, if I read the code correctly, VDUSE_IOTLB_GET_FD will return
>> the whole bounce map range which is setup in vduse_dev_map_page()? So my
>> understanding is that usersapce may choose to map all its range via mmap().
>>
> Yes.
>
>> So if we 'map' bounce page one by one in vduse_dev_map_page(). (Here
>> 'map' means using multiple itree entries instead of a single one). Then
>> in the VDUSE_IOTLB_GET_FD we can keep traversing itree (dev->iommu)
>> until the range is backed by a different file.
>>
>> With this, there's no userspace visible changes and there's no need for
>> the domain->iotlb?
>>
> In this case, I wonder what range can be obtained if userspace calls
> VDUSE_IOTLB_GET_FD when the first I/O (e.g. 4K) occurs. [0, 4K] or [0,
> 64M]? In current implementation, userspace will map [0, 64M].


It should still be [0, 64M). Do you see any issue?

Thanks


>
> Thanks,
> Yongji
>


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC v4 06/11] vduse: Implement an MMU-based IOMMU driver
@ 2021-03-08  3:52                         ` Jason Wang
  0 siblings, 0 replies; 92+ messages in thread
From: Jason Wang @ 2021-03-08  3:52 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Jens Axboe, Jonathan Corbet, kvm, Michael S. Tsirkin, linux-aio,
	netdev, Randy Dunlap, Matthew Wilcox, virtualization,
	Christoph Hellwig, Bob Liu, bcrl, viro, Stefan Hajnoczi,
	linux-fsdevel


On 2021/3/8 11:45 上午, Yongji Xie wrote:
> On Mon, Mar 8, 2021 at 11:17 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2021/3/5 3:59 下午, Yongji Xie wrote:
>>> On Fri, Mar 5, 2021 at 3:27 PM Jason Wang <jasowang@redhat.com> wrote:
>>>> On 2021/3/5 3:13 下午, Yongji Xie wrote:
>>>>> On Fri, Mar 5, 2021 at 2:52 PM Jason Wang <jasowang@redhat.com> wrote:
>>>>>> On 2021/3/5 2:15 下午, Yongji Xie wrote:
>>>>>>
>>>>>> Sorry if I've asked this before.
>>>>>>
>>>>>> But what's the reason for maintaing a dedicated IOTLB here? I think we
>>>>>> could reuse vduse_dev->iommu since the device can not be used by both
>>>>>> virtio and vhost in the same time or use vduse_iova_domain->iotlb for
>>>>>> set_map().
>>>>>>
>>>>>> The main difference between domain->iotlb and dev->iotlb is the way to
>>>>>> deal with bounce buffer. In the domain->iotlb case, bounce buffer
>>>>>> needs to be mapped each DMA transfer because we need to get the bounce
>>>>>> pages by an IOVA during DMA unmapping. In the dev->iotlb case, bounce
>>>>>> buffer only needs to be mapped once during initialization, which will
>>>>>> be used to tell userspace how to do mmap().
>>>>>>
>>>>>> Also, since vhost IOTLB support per mapping token (opauqe), can we use
>>>>>> that instead of the bounce_pages *?
>>>>>>
>>>>>> Sorry, I didn't get you here. Which value do you mean to store in the
>>>>>> opaque pointer?
>>>>>>
>>>>>> So I would like to have a way to use a single IOTLB for manage all kinds
>>>>>> of mappings. Two possible ideas:
>>>>>>
>>>>>> 1) map bounce page one by one in vduse_dev_map_page(), in
>>>>>> VDUSE_IOTLB_GET_FD, try to merge the result if we had the same fd. Then
>>>>>> for bounce pages, userspace still only need to map it once and we can
>>>>>> maintain the actual mapping by storing the page or pa in the opaque
>>>>>> field of IOTLB entry.
>>>>>>
>>>>>> Looks like userspace still needs to unmap the old region and map a new
>>>>>> region (size is changed) with the fd in each VDUSE_IOTLB_GET_FD ioctl.
>>>>>>
>>>>>>
>>>>>> I don't get here. Can you give an example?
>>>>>>
>>>>> For example, userspace needs to process two I/O requests (one page per
>>>>> request). To process the first request, userspace uses
>>>>> VDUSE_IOTLB_GET_FD ioctl to query the iova region (0 ~ 4096) and mmap
>>>>> it.
>>>> I think in this case we should let VDUSE_IOTLB_GET_FD return the maximum
>>>> range as far as they are backed by the same fd.
>>>>
>>> But now the bounce page is mapped one by one. The second page (4096 ~
>>> 8192) might not be mapped when userspace is processing the first
>>> request. So the maximum range is 0 ~ 4096 at that time.
>>>
>>> Thanks,
>>> Yongji
>>
>> A question, if I read the code correctly, VDUSE_IOTLB_GET_FD will return
>> the whole bounce map range which is setup in vduse_dev_map_page()? So my
>> understanding is that usersapce may choose to map all its range via mmap().
>>
> Yes.
>
>> So if we 'map' bounce page one by one in vduse_dev_map_page(). (Here
>> 'map' means using multiple itree entries instead of a single one). Then
>> in the VDUSE_IOTLB_GET_FD we can keep traversing itree (dev->iommu)
>> until the range is backed by a different file.
>>
>> With this, there's no userspace visible changes and there's no need for
>> the domain->iotlb?
>>
> In this case, I wonder what range can be obtained if userspace calls
> VDUSE_IOTLB_GET_FD when the first I/O (e.g. 4K) occurs. [0, 4K] or [0,
> 64M]? In current implementation, userspace will map [0, 64M].


It should still be [0, 64M). Do you see any issue?

Thanks


>
> Thanks,
> Yongji
>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: Re: [RFC v4 10/11] vduse: Introduce a workqueue for irq injection
  2021-03-08  3:04                         ` Jason Wang
  (?)
@ 2021-03-08  4:50                         ` Yongji Xie
  2021-03-08  7:01                             ` Jason Wang
  -1 siblings, 1 reply; 92+ messages in thread
From: Yongji Xie @ 2021-03-08  4:50 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Bob Liu, Christoph Hellwig, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel

On Mon, Mar 8, 2021 at 11:04 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2021/3/5 4:12 下午, Yongji Xie wrote:
> > On Fri, Mar 5, 2021 at 3:37 PM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> On 2021/3/5 3:27 下午, Yongji Xie wrote:
> >>> On Fri, Mar 5, 2021 at 3:01 PM Jason Wang <jasowang@redhat.com> wrote:
> >>>> On 2021/3/5 2:36 下午, Yongji Xie wrote:
> >>>>> On Fri, Mar 5, 2021 at 11:42 AM Jason Wang <jasowang@redhat.com> wrote:
> >>>>>> On 2021/3/5 11:30 上午, Yongji Xie wrote:
> >>>>>>> On Fri, Mar 5, 2021 at 11:05 AM Jason Wang <jasowang@redhat.com> wrote:
> >>>>>>>> On 2021/3/4 4:58 下午, Yongji Xie wrote:
> >>>>>>>>> On Thu, Mar 4, 2021 at 2:59 PM Jason Wang <jasowang@redhat.com> wrote:
> >>>>>>>>>> On 2021/2/23 7:50 下午, Xie Yongji wrote:
> >>>>>>>>>>> This patch introduces a workqueue to support injecting
> >>>>>>>>>>> virtqueue's interrupt asynchronously. This is mainly
> >>>>>>>>>>> for performance considerations which makes sure the push()
> >>>>>>>>>>> and pop() for used vring can be asynchronous.
> >>>>>>>>>> Do you have pref numbers for this patch?
> >>>>>>>>>>
> >>>>>>>>> No, I can do some tests for it if needed.
> >>>>>>>>>
> >>>>>>>>> Another problem is the VIRTIO_RING_F_EVENT_IDX feature will be useless
> >>>>>>>>> if we call irq callback in ioctl context. Something like:
> >>>>>>>>>
> >>>>>>>>> virtqueue_push();
> >>>>>>>>> virtio_notify();
> >>>>>>>>>          ioctl()
> >>>>>>>>> -------------------------------------------------
> >>>>>>>>>              irq_cb()
> >>>>>>>>>                  virtqueue_get_buf()
> >>>>>>>>>
> >>>>>>>>> The used vring is always empty each time we call virtqueue_push() in
> >>>>>>>>> userspace. Not sure if it is what we expected.
> >>>>>>>> I'm not sure I get the issue.
> >>>>>>>>
> >>>>>>>> THe used ring should be filled by virtqueue_push() which is done by
> >>>>>>>> userspace before?
> >>>>>>>>
> >>>>>>> After userspace call virtqueue_push(), it always call virtio_notify()
> >>>>>>> immediately. In traditional VM (vhost-vdpa) cases, virtio_notify()
> >>>>>>> will inject an irq to VM and return, then vcpu thread will call
> >>>>>>> interrupt handler. But in container (virtio-vdpa) cases,
> >>>>>>> virtio_notify() will call interrupt handler directly. So it looks like
> >>>>>>> we have to optimize the virtio-vdpa cases. But one problem is we don't
> >>>>>>> know whether we are in the VM user case or container user case.
> >>>>>> Yes, but I still don't get why used ring is empty after the ioctl()?
> >>>>>> Used ring does not use bounce page so it should be visible to the kernel
> >>>>>> driver. What did I miss :) ?
> >>>>>>
> >>>>> Sorry, I'm not saying the kernel can't see the correct used vring. I
> >>>>> mean the kernel will consume the used vring in the ioctl context
> >>>>> directly in the virtio-vdpa case. In userspace's view, that means
> >>>>> virtqueue_push() is used vring's producer and virtio_notify() is used
> >>>>> vring's consumer. They will be called one by one in one thread rather
> >>>>> than different threads, which looks odd and has a bad effect on
> >>>>> performance.
> >>>> Yes, that's why we need a workqueue (WQ_UNBOUND you used). Or do you
> >>>> want to squash this patch into patch 8?
> >>>>
> >>>> So I think we can see obvious difference when virtio-vdpa is used.
> >>>>
> >>> But it looks like we don't need this workqueue in vhost-vdpa cases.
> >>> Any suggestions?
> >>
> >> I haven't had a deep thought. But I feel we can solve this by using the
> >> irq bypass manager (or something similar). Then we don't need it to be
> >> relayed via workqueue and vdpa. But I'm not sure how hard it will be.
> >>
> >   Or let vdpa bus drivers give us some information?
>
>
> This kind of 'type' is proposed in the early RFC of vDPA series. One
> issue is that at device level, we should not differ virtio from vhost,
> so if we introduce that, it might encourge people to design a device
> that is dedicated to vhost or virtio which might not be good.
>
> But we can re-visit this when necessary.
>

OK, I see. How about adding some information in ops.set_vq_cb()?

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: Re: [RFC v4 06/11] vduse: Implement an MMU-based IOMMU driver
  2021-03-08  3:52                         ` Jason Wang
  (?)
@ 2021-03-08  5:05                         ` Yongji Xie
  2021-03-08  7:04                             ` Jason Wang
  -1 siblings, 1 reply; 92+ messages in thread
From: Yongji Xie @ 2021-03-08  5:05 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Bob Liu, Christoph Hellwig, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel

On Mon, Mar 8, 2021 at 11:52 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2021/3/8 11:45 上午, Yongji Xie wrote:
> > On Mon, Mar 8, 2021 at 11:17 AM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> On 2021/3/5 3:59 下午, Yongji Xie wrote:
> >>> On Fri, Mar 5, 2021 at 3:27 PM Jason Wang <jasowang@redhat.com> wrote:
> >>>> On 2021/3/5 3:13 下午, Yongji Xie wrote:
> >>>>> On Fri, Mar 5, 2021 at 2:52 PM Jason Wang <jasowang@redhat.com> wrote:
> >>>>>> On 2021/3/5 2:15 下午, Yongji Xie wrote:
> >>>>>>
> >>>>>> Sorry if I've asked this before.
> >>>>>>
> >>>>>> But what's the reason for maintaing a dedicated IOTLB here? I think we
> >>>>>> could reuse vduse_dev->iommu since the device can not be used by both
> >>>>>> virtio and vhost in the same time or use vduse_iova_domain->iotlb for
> >>>>>> set_map().
> >>>>>>
> >>>>>> The main difference between domain->iotlb and dev->iotlb is the way to
> >>>>>> deal with bounce buffer. In the domain->iotlb case, bounce buffer
> >>>>>> needs to be mapped each DMA transfer because we need to get the bounce
> >>>>>> pages by an IOVA during DMA unmapping. In the dev->iotlb case, bounce
> >>>>>> buffer only needs to be mapped once during initialization, which will
> >>>>>> be used to tell userspace how to do mmap().
> >>>>>>
> >>>>>> Also, since vhost IOTLB support per mapping token (opauqe), can we use
> >>>>>> that instead of the bounce_pages *?
> >>>>>>
> >>>>>> Sorry, I didn't get you here. Which value do you mean to store in the
> >>>>>> opaque pointer?
> >>>>>>
> >>>>>> So I would like to have a way to use a single IOTLB for manage all kinds
> >>>>>> of mappings. Two possible ideas:
> >>>>>>
> >>>>>> 1) map bounce page one by one in vduse_dev_map_page(), in
> >>>>>> VDUSE_IOTLB_GET_FD, try to merge the result if we had the same fd. Then
> >>>>>> for bounce pages, userspace still only need to map it once and we can
> >>>>>> maintain the actual mapping by storing the page or pa in the opaque
> >>>>>> field of IOTLB entry.
> >>>>>>
> >>>>>> Looks like userspace still needs to unmap the old region and map a new
> >>>>>> region (size is changed) with the fd in each VDUSE_IOTLB_GET_FD ioctl.
> >>>>>>
> >>>>>>
> >>>>>> I don't get here. Can you give an example?
> >>>>>>
> >>>>> For example, userspace needs to process two I/O requests (one page per
> >>>>> request). To process the first request, userspace uses
> >>>>> VDUSE_IOTLB_GET_FD ioctl to query the iova region (0 ~ 4096) and mmap
> >>>>> it.
> >>>> I think in this case we should let VDUSE_IOTLB_GET_FD return the maximum
> >>>> range as far as they are backed by the same fd.
> >>>>
> >>> But now the bounce page is mapped one by one. The second page (4096 ~
> >>> 8192) might not be mapped when userspace is processing the first
> >>> request. So the maximum range is 0 ~ 4096 at that time.
> >>>
> >>> Thanks,
> >>> Yongji
> >>
> >> A question, if I read the code correctly, VDUSE_IOTLB_GET_FD will return
> >> the whole bounce map range which is setup in vduse_dev_map_page()? So my
> >> understanding is that usersapce may choose to map all its range via mmap().
> >>
> > Yes.
> >
> >> So if we 'map' bounce page one by one in vduse_dev_map_page(). (Here
> >> 'map' means using multiple itree entries instead of a single one). Then
> >> in the VDUSE_IOTLB_GET_FD we can keep traversing itree (dev->iommu)
> >> until the range is backed by a different file.
> >>
> >> With this, there's no userspace visible changes and there's no need for
> >> the domain->iotlb?
> >>
> > In this case, I wonder what range can be obtained if userspace calls
> > VDUSE_IOTLB_GET_FD when the first I/O (e.g. 4K) occurs. [0, 4K] or [0,
> > 64M]? In current implementation, userspace will map [0, 64M].
>
>
> It should still be [0, 64M). Do you see any issue?
>

Does it mean we still need to map the whole bounce buffer into itree
(dev->iommu) at initialization?

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC v4 10/11] vduse: Introduce a workqueue for irq injection
  2021-03-08  4:50                         ` Yongji Xie
@ 2021-03-08  7:01                             ` Jason Wang
  0 siblings, 0 replies; 92+ messages in thread
From: Jason Wang @ 2021-03-08  7:01 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Bob Liu, Christoph Hellwig, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel


On 2021/3/8 12:50 下午, Yongji Xie wrote:
> On Mon, Mar 8, 2021 at 11:04 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2021/3/5 4:12 下午, Yongji Xie wrote:
>>> On Fri, Mar 5, 2021 at 3:37 PM Jason Wang <jasowang@redhat.com> wrote:
>>>> On 2021/3/5 3:27 下午, Yongji Xie wrote:
>>>>> On Fri, Mar 5, 2021 at 3:01 PM Jason Wang <jasowang@redhat.com> wrote:
>>>>>> On 2021/3/5 2:36 下午, Yongji Xie wrote:
>>>>>>> On Fri, Mar 5, 2021 at 11:42 AM Jason Wang <jasowang@redhat.com> wrote:
>>>>>>>> On 2021/3/5 11:30 上午, Yongji Xie wrote:
>>>>>>>>> On Fri, Mar 5, 2021 at 11:05 AM Jason Wang <jasowang@redhat.com> wrote:
>>>>>>>>>> On 2021/3/4 4:58 下午, Yongji Xie wrote:
>>>>>>>>>>> On Thu, Mar 4, 2021 at 2:59 PM Jason Wang <jasowang@redhat.com> wrote:
>>>>>>>>>>>> On 2021/2/23 7:50 下午, Xie Yongji wrote:
>>>>>>>>>>>>> This patch introduces a workqueue to support injecting
>>>>>>>>>>>>> virtqueue's interrupt asynchronously. This is mainly
>>>>>>>>>>>>> for performance considerations which makes sure the push()
>>>>>>>>>>>>> and pop() for used vring can be asynchronous.
>>>>>>>>>>>> Do you have pref numbers for this patch?
>>>>>>>>>>>>
>>>>>>>>>>> No, I can do some tests for it if needed.
>>>>>>>>>>>
>>>>>>>>>>> Another problem is the VIRTIO_RING_F_EVENT_IDX feature will be useless
>>>>>>>>>>> if we call irq callback in ioctl context. Something like:
>>>>>>>>>>>
>>>>>>>>>>> virtqueue_push();
>>>>>>>>>>> virtio_notify();
>>>>>>>>>>>           ioctl()
>>>>>>>>>>> -------------------------------------------------
>>>>>>>>>>>               irq_cb()
>>>>>>>>>>>                   virtqueue_get_buf()
>>>>>>>>>>>
>>>>>>>>>>> The used vring is always empty each time we call virtqueue_push() in
>>>>>>>>>>> userspace. Not sure if it is what we expected.
>>>>>>>>>> I'm not sure I get the issue.
>>>>>>>>>>
>>>>>>>>>> THe used ring should be filled by virtqueue_push() which is done by
>>>>>>>>>> userspace before?
>>>>>>>>>>
>>>>>>>>> After userspace call virtqueue_push(), it always call virtio_notify()
>>>>>>>>> immediately. In traditional VM (vhost-vdpa) cases, virtio_notify()
>>>>>>>>> will inject an irq to VM and return, then vcpu thread will call
>>>>>>>>> interrupt handler. But in container (virtio-vdpa) cases,
>>>>>>>>> virtio_notify() will call interrupt handler directly. So it looks like
>>>>>>>>> we have to optimize the virtio-vdpa cases. But one problem is we don't
>>>>>>>>> know whether we are in the VM user case or container user case.
>>>>>>>> Yes, but I still don't get why used ring is empty after the ioctl()?
>>>>>>>> Used ring does not use bounce page so it should be visible to the kernel
>>>>>>>> driver. What did I miss :) ?
>>>>>>>>
>>>>>>> Sorry, I'm not saying the kernel can't see the correct used vring. I
>>>>>>> mean the kernel will consume the used vring in the ioctl context
>>>>>>> directly in the virtio-vdpa case. In userspace's view, that means
>>>>>>> virtqueue_push() is used vring's producer and virtio_notify() is used
>>>>>>> vring's consumer. They will be called one by one in one thread rather
>>>>>>> than different threads, which looks odd and has a bad effect on
>>>>>>> performance.
>>>>>> Yes, that's why we need a workqueue (WQ_UNBOUND you used). Or do you
>>>>>> want to squash this patch into patch 8?
>>>>>>
>>>>>> So I think we can see obvious difference when virtio-vdpa is used.
>>>>>>
>>>>> But it looks like we don't need this workqueue in vhost-vdpa cases.
>>>>> Any suggestions?
>>>> I haven't had a deep thought. But I feel we can solve this by using the
>>>> irq bypass manager (or something similar). Then we don't need it to be
>>>> relayed via workqueue and vdpa. But I'm not sure how hard it will be.
>>>>
>>>    Or let vdpa bus drivers give us some information?
>>
>> This kind of 'type' is proposed in the early RFC of vDPA series. One
>> issue is that at device level, we should not differ virtio from vhost,
>> so if we introduce that, it might encourge people to design a device
>> that is dedicated to vhost or virtio which might not be good.
>>
>> But we can re-visit this when necessary.
>>
> OK, I see. How about adding some information in ops.set_vq_cb()?


I'm not sure I get this, maybe you can explain a little bit more?

Thanks


>
> Thanks,
> Yongji
>


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC v4 10/11] vduse: Introduce a workqueue for irq injection
@ 2021-03-08  7:01                             ` Jason Wang
  0 siblings, 0 replies; 92+ messages in thread
From: Jason Wang @ 2021-03-08  7:01 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Jens Axboe, Jonathan Corbet, kvm, Michael S. Tsirkin, linux-aio,
	netdev, Randy Dunlap, Matthew Wilcox, virtualization,
	Christoph Hellwig, Bob Liu, bcrl, viro, Stefan Hajnoczi,
	linux-fsdevel


On 2021/3/8 12:50 下午, Yongji Xie wrote:
> On Mon, Mar 8, 2021 at 11:04 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2021/3/5 4:12 下午, Yongji Xie wrote:
>>> On Fri, Mar 5, 2021 at 3:37 PM Jason Wang <jasowang@redhat.com> wrote:
>>>> On 2021/3/5 3:27 下午, Yongji Xie wrote:
>>>>> On Fri, Mar 5, 2021 at 3:01 PM Jason Wang <jasowang@redhat.com> wrote:
>>>>>> On 2021/3/5 2:36 下午, Yongji Xie wrote:
>>>>>>> On Fri, Mar 5, 2021 at 11:42 AM Jason Wang <jasowang@redhat.com> wrote:
>>>>>>>> On 2021/3/5 11:30 上午, Yongji Xie wrote:
>>>>>>>>> On Fri, Mar 5, 2021 at 11:05 AM Jason Wang <jasowang@redhat.com> wrote:
>>>>>>>>>> On 2021/3/4 4:58 下午, Yongji Xie wrote:
>>>>>>>>>>> On Thu, Mar 4, 2021 at 2:59 PM Jason Wang <jasowang@redhat.com> wrote:
>>>>>>>>>>>> On 2021/2/23 7:50 下午, Xie Yongji wrote:
>>>>>>>>>>>>> This patch introduces a workqueue to support injecting
>>>>>>>>>>>>> virtqueue's interrupt asynchronously. This is mainly
>>>>>>>>>>>>> for performance considerations which makes sure the push()
>>>>>>>>>>>>> and pop() for used vring can be asynchronous.
>>>>>>>>>>>> Do you have pref numbers for this patch?
>>>>>>>>>>>>
>>>>>>>>>>> No, I can do some tests for it if needed.
>>>>>>>>>>>
>>>>>>>>>>> Another problem is the VIRTIO_RING_F_EVENT_IDX feature will be useless
>>>>>>>>>>> if we call irq callback in ioctl context. Something like:
>>>>>>>>>>>
>>>>>>>>>>> virtqueue_push();
>>>>>>>>>>> virtio_notify();
>>>>>>>>>>>           ioctl()
>>>>>>>>>>> -------------------------------------------------
>>>>>>>>>>>               irq_cb()
>>>>>>>>>>>                   virtqueue_get_buf()
>>>>>>>>>>>
>>>>>>>>>>> The used vring is always empty each time we call virtqueue_push() in
>>>>>>>>>>> userspace. Not sure if it is what we expected.
>>>>>>>>>> I'm not sure I get the issue.
>>>>>>>>>>
>>>>>>>>>> THe used ring should be filled by virtqueue_push() which is done by
>>>>>>>>>> userspace before?
>>>>>>>>>>
>>>>>>>>> After userspace call virtqueue_push(), it always call virtio_notify()
>>>>>>>>> immediately. In traditional VM (vhost-vdpa) cases, virtio_notify()
>>>>>>>>> will inject an irq to VM and return, then vcpu thread will call
>>>>>>>>> interrupt handler. But in container (virtio-vdpa) cases,
>>>>>>>>> virtio_notify() will call interrupt handler directly. So it looks like
>>>>>>>>> we have to optimize the virtio-vdpa cases. But one problem is we don't
>>>>>>>>> know whether we are in the VM user case or container user case.
>>>>>>>> Yes, but I still don't get why used ring is empty after the ioctl()?
>>>>>>>> Used ring does not use bounce page so it should be visible to the kernel
>>>>>>>> driver. What did I miss :) ?
>>>>>>>>
>>>>>>> Sorry, I'm not saying the kernel can't see the correct used vring. I
>>>>>>> mean the kernel will consume the used vring in the ioctl context
>>>>>>> directly in the virtio-vdpa case. In userspace's view, that means
>>>>>>> virtqueue_push() is used vring's producer and virtio_notify() is used
>>>>>>> vring's consumer. They will be called one by one in one thread rather
>>>>>>> than different threads, which looks odd and has a bad effect on
>>>>>>> performance.
>>>>>> Yes, that's why we need a workqueue (WQ_UNBOUND you used). Or do you
>>>>>> want to squash this patch into patch 8?
>>>>>>
>>>>>> So I think we can see obvious difference when virtio-vdpa is used.
>>>>>>
>>>>> But it looks like we don't need this workqueue in vhost-vdpa cases.
>>>>> Any suggestions?
>>>> I haven't had a deep thought. But I feel we can solve this by using the
>>>> irq bypass manager (or something similar). Then we don't need it to be
>>>> relayed via workqueue and vdpa. But I'm not sure how hard it will be.
>>>>
>>>    Or let vdpa bus drivers give us some information?
>>
>> This kind of 'type' is proposed in the early RFC of vDPA series. One
>> issue is that at device level, we should not differ virtio from vhost,
>> so if we introduce that, it might encourge people to design a device
>> that is dedicated to vhost or virtio which might not be good.
>>
>> But we can re-visit this when necessary.
>>
> OK, I see. How about adding some information in ops.set_vq_cb()?


I'm not sure I get this, maybe you can explain a little bit more?

Thanks


>
> Thanks,
> Yongji
>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC v4 06/11] vduse: Implement an MMU-based IOMMU driver
  2021-03-08  5:05                         ` Yongji Xie
@ 2021-03-08  7:04                             ` Jason Wang
  0 siblings, 0 replies; 92+ messages in thread
From: Jason Wang @ 2021-03-08  7:04 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Bob Liu, Christoph Hellwig, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel


On 2021/3/8 1:05 下午, Yongji Xie wrote:
> On Mon, Mar 8, 2021 at 11:52 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2021/3/8 11:45 上午, Yongji Xie wrote:
>>> On Mon, Mar 8, 2021 at 11:17 AM Jason Wang <jasowang@redhat.com> wrote:
>>>> On 2021/3/5 3:59 下午, Yongji Xie wrote:
>>>>> On Fri, Mar 5, 2021 at 3:27 PM Jason Wang <jasowang@redhat.com> wrote:
>>>>>> On 2021/3/5 3:13 下午, Yongji Xie wrote:
>>>>>>> On Fri, Mar 5, 2021 at 2:52 PM Jason Wang <jasowang@redhat.com> wrote:
>>>>>>>> On 2021/3/5 2:15 下午, Yongji Xie wrote:
>>>>>>>>
>>>>>>>> Sorry if I've asked this before.
>>>>>>>>
>>>>>>>> But what's the reason for maintaing a dedicated IOTLB here? I think we
>>>>>>>> could reuse vduse_dev->iommu since the device can not be used by both
>>>>>>>> virtio and vhost in the same time or use vduse_iova_domain->iotlb for
>>>>>>>> set_map().
>>>>>>>>
>>>>>>>> The main difference between domain->iotlb and dev->iotlb is the way to
>>>>>>>> deal with bounce buffer. In the domain->iotlb case, bounce buffer
>>>>>>>> needs to be mapped each DMA transfer because we need to get the bounce
>>>>>>>> pages by an IOVA during DMA unmapping. In the dev->iotlb case, bounce
>>>>>>>> buffer only needs to be mapped once during initialization, which will
>>>>>>>> be used to tell userspace how to do mmap().
>>>>>>>>
>>>>>>>> Also, since vhost IOTLB support per mapping token (opauqe), can we use
>>>>>>>> that instead of the bounce_pages *?
>>>>>>>>
>>>>>>>> Sorry, I didn't get you here. Which value do you mean to store in the
>>>>>>>> opaque pointer?
>>>>>>>>
>>>>>>>> So I would like to have a way to use a single IOTLB for manage all kinds
>>>>>>>> of mappings. Two possible ideas:
>>>>>>>>
>>>>>>>> 1) map bounce page one by one in vduse_dev_map_page(), in
>>>>>>>> VDUSE_IOTLB_GET_FD, try to merge the result if we had the same fd. Then
>>>>>>>> for bounce pages, userspace still only need to map it once and we can
>>>>>>>> maintain the actual mapping by storing the page or pa in the opaque
>>>>>>>> field of IOTLB entry.
>>>>>>>>
>>>>>>>> Looks like userspace still needs to unmap the old region and map a new
>>>>>>>> region (size is changed) with the fd in each VDUSE_IOTLB_GET_FD ioctl.
>>>>>>>>
>>>>>>>>
>>>>>>>> I don't get here. Can you give an example?
>>>>>>>>
>>>>>>> For example, userspace needs to process two I/O requests (one page per
>>>>>>> request). To process the first request, userspace uses
>>>>>>> VDUSE_IOTLB_GET_FD ioctl to query the iova region (0 ~ 4096) and mmap
>>>>>>> it.
>>>>>> I think in this case we should let VDUSE_IOTLB_GET_FD return the maximum
>>>>>> range as far as they are backed by the same fd.
>>>>>>
>>>>> But now the bounce page is mapped one by one. The second page (4096 ~
>>>>> 8192) might not be mapped when userspace is processing the first
>>>>> request. So the maximum range is 0 ~ 4096 at that time.
>>>>>
>>>>> Thanks,
>>>>> Yongji
>>>> A question, if I read the code correctly, VDUSE_IOTLB_GET_FD will return
>>>> the whole bounce map range which is setup in vduse_dev_map_page()? So my
>>>> understanding is that usersapce may choose to map all its range via mmap().
>>>>
>>> Yes.
>>>
>>>> So if we 'map' bounce page one by one in vduse_dev_map_page(). (Here
>>>> 'map' means using multiple itree entries instead of a single one). Then
>>>> in the VDUSE_IOTLB_GET_FD we can keep traversing itree (dev->iommu)
>>>> until the range is backed by a different file.
>>>>
>>>> With this, there's no userspace visible changes and there's no need for
>>>> the domain->iotlb?
>>>>
>>> In this case, I wonder what range can be obtained if userspace calls
>>> VDUSE_IOTLB_GET_FD when the first I/O (e.g. 4K) occurs. [0, 4K] or [0,
>>> 64M]? In current implementation, userspace will map [0, 64M].
>>
>> It should still be [0, 64M). Do you see any issue?
>>
> Does it mean we still need to map the whole bounce buffer into itree
> (dev->iommu) at initialization?


It's your choice I think, the point is to use a single IOTLB for 
maintaining mappings of all types of pages (bounce, coherent, or shared).

Thanks


>
> Thanks,
> Yongji
>


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC v4 06/11] vduse: Implement an MMU-based IOMMU driver
@ 2021-03-08  7:04                             ` Jason Wang
  0 siblings, 0 replies; 92+ messages in thread
From: Jason Wang @ 2021-03-08  7:04 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Jens Axboe, Jonathan Corbet, kvm, Michael S. Tsirkin, linux-aio,
	netdev, Randy Dunlap, Matthew Wilcox, virtualization,
	Christoph Hellwig, Bob Liu, bcrl, viro, Stefan Hajnoczi,
	linux-fsdevel


On 2021/3/8 1:05 下午, Yongji Xie wrote:
> On Mon, Mar 8, 2021 at 11:52 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2021/3/8 11:45 上午, Yongji Xie wrote:
>>> On Mon, Mar 8, 2021 at 11:17 AM Jason Wang <jasowang@redhat.com> wrote:
>>>> On 2021/3/5 3:59 下午, Yongji Xie wrote:
>>>>> On Fri, Mar 5, 2021 at 3:27 PM Jason Wang <jasowang@redhat.com> wrote:
>>>>>> On 2021/3/5 3:13 下午, Yongji Xie wrote:
>>>>>>> On Fri, Mar 5, 2021 at 2:52 PM Jason Wang <jasowang@redhat.com> wrote:
>>>>>>>> On 2021/3/5 2:15 下午, Yongji Xie wrote:
>>>>>>>>
>>>>>>>> Sorry if I've asked this before.
>>>>>>>>
>>>>>>>> But what's the reason for maintaing a dedicated IOTLB here? I think we
>>>>>>>> could reuse vduse_dev->iommu since the device can not be used by both
>>>>>>>> virtio and vhost in the same time or use vduse_iova_domain->iotlb for
>>>>>>>> set_map().
>>>>>>>>
>>>>>>>> The main difference between domain->iotlb and dev->iotlb is the way to
>>>>>>>> deal with bounce buffer. In the domain->iotlb case, bounce buffer
>>>>>>>> needs to be mapped each DMA transfer because we need to get the bounce
>>>>>>>> pages by an IOVA during DMA unmapping. In the dev->iotlb case, bounce
>>>>>>>> buffer only needs to be mapped once during initialization, which will
>>>>>>>> be used to tell userspace how to do mmap().
>>>>>>>>
>>>>>>>> Also, since vhost IOTLB support per mapping token (opauqe), can we use
>>>>>>>> that instead of the bounce_pages *?
>>>>>>>>
>>>>>>>> Sorry, I didn't get you here. Which value do you mean to store in the
>>>>>>>> opaque pointer?
>>>>>>>>
>>>>>>>> So I would like to have a way to use a single IOTLB for manage all kinds
>>>>>>>> of mappings. Two possible ideas:
>>>>>>>>
>>>>>>>> 1) map bounce page one by one in vduse_dev_map_page(), in
>>>>>>>> VDUSE_IOTLB_GET_FD, try to merge the result if we had the same fd. Then
>>>>>>>> for bounce pages, userspace still only need to map it once and we can
>>>>>>>> maintain the actual mapping by storing the page or pa in the opaque
>>>>>>>> field of IOTLB entry.
>>>>>>>>
>>>>>>>> Looks like userspace still needs to unmap the old region and map a new
>>>>>>>> region (size is changed) with the fd in each VDUSE_IOTLB_GET_FD ioctl.
>>>>>>>>
>>>>>>>>
>>>>>>>> I don't get here. Can you give an example?
>>>>>>>>
>>>>>>> For example, userspace needs to process two I/O requests (one page per
>>>>>>> request). To process the first request, userspace uses
>>>>>>> VDUSE_IOTLB_GET_FD ioctl to query the iova region (0 ~ 4096) and mmap
>>>>>>> it.
>>>>>> I think in this case we should let VDUSE_IOTLB_GET_FD return the maximum
>>>>>> range as far as they are backed by the same fd.
>>>>>>
>>>>> But now the bounce page is mapped one by one. The second page (4096 ~
>>>>> 8192) might not be mapped when userspace is processing the first
>>>>> request. So the maximum range is 0 ~ 4096 at that time.
>>>>>
>>>>> Thanks,
>>>>> Yongji
>>>> A question, if I read the code correctly, VDUSE_IOTLB_GET_FD will return
>>>> the whole bounce map range which is setup in vduse_dev_map_page()? So my
>>>> understanding is that usersapce may choose to map all its range via mmap().
>>>>
>>> Yes.
>>>
>>>> So if we 'map' bounce page one by one in vduse_dev_map_page(). (Here
>>>> 'map' means using multiple itree entries instead of a single one). Then
>>>> in the VDUSE_IOTLB_GET_FD we can keep traversing itree (dev->iommu)
>>>> until the range is backed by a different file.
>>>>
>>>> With this, there's no userspace visible changes and there's no need for
>>>> the domain->iotlb?
>>>>
>>> In this case, I wonder what range can be obtained if userspace calls
>>> VDUSE_IOTLB_GET_FD when the first I/O (e.g. 4K) occurs. [0, 4K] or [0,
>>> 64M]? In current implementation, userspace will map [0, 64M].
>>
>> It should still be [0, 64M). Do you see any issue?
>>
> Does it mean we still need to map the whole bounce buffer into itree
> (dev->iommu) at initialization?


It's your choice I think, the point is to use a single IOTLB for 
maintaining mappings of all types of pages (bounce, coherent, or shared).

Thanks


>
> Thanks,
> Yongji
>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: Re: [RFC v4 06/11] vduse: Implement an MMU-based IOMMU driver
  2021-03-08  7:04                             ` Jason Wang
  (?)
@ 2021-03-08  7:08                             ` Yongji Xie
  -1 siblings, 0 replies; 92+ messages in thread
From: Yongji Xie @ 2021-03-08  7:08 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Bob Liu, Christoph Hellwig, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel

On Mon, Mar 8, 2021 at 3:04 PM Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2021/3/8 1:05 下午, Yongji Xie wrote:
> > On Mon, Mar 8, 2021 at 11:52 AM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> On 2021/3/8 11:45 上午, Yongji Xie wrote:
> >>> On Mon, Mar 8, 2021 at 11:17 AM Jason Wang <jasowang@redhat.com> wrote:
> >>>> On 2021/3/5 3:59 下午, Yongji Xie wrote:
> >>>>> On Fri, Mar 5, 2021 at 3:27 PM Jason Wang <jasowang@redhat.com> wrote:
> >>>>>> On 2021/3/5 3:13 下午, Yongji Xie wrote:
> >>>>>>> On Fri, Mar 5, 2021 at 2:52 PM Jason Wang <jasowang@redhat.com> wrote:
> >>>>>>>> On 2021/3/5 2:15 下午, Yongji Xie wrote:
> >>>>>>>>
> >>>>>>>> Sorry if I've asked this before.
> >>>>>>>>
> >>>>>>>> But what's the reason for maintaing a dedicated IOTLB here? I think we
> >>>>>>>> could reuse vduse_dev->iommu since the device can not be used by both
> >>>>>>>> virtio and vhost in the same time or use vduse_iova_domain->iotlb for
> >>>>>>>> set_map().
> >>>>>>>>
> >>>>>>>> The main difference between domain->iotlb and dev->iotlb is the way to
> >>>>>>>> deal with bounce buffer. In the domain->iotlb case, bounce buffer
> >>>>>>>> needs to be mapped each DMA transfer because we need to get the bounce
> >>>>>>>> pages by an IOVA during DMA unmapping. In the dev->iotlb case, bounce
> >>>>>>>> buffer only needs to be mapped once during initialization, which will
> >>>>>>>> be used to tell userspace how to do mmap().
> >>>>>>>>
> >>>>>>>> Also, since vhost IOTLB support per mapping token (opauqe), can we use
> >>>>>>>> that instead of the bounce_pages *?
> >>>>>>>>
> >>>>>>>> Sorry, I didn't get you here. Which value do you mean to store in the
> >>>>>>>> opaque pointer?
> >>>>>>>>
> >>>>>>>> So I would like to have a way to use a single IOTLB for manage all kinds
> >>>>>>>> of mappings. Two possible ideas:
> >>>>>>>>
> >>>>>>>> 1) map bounce page one by one in vduse_dev_map_page(), in
> >>>>>>>> VDUSE_IOTLB_GET_FD, try to merge the result if we had the same fd. Then
> >>>>>>>> for bounce pages, userspace still only need to map it once and we can
> >>>>>>>> maintain the actual mapping by storing the page or pa in the opaque
> >>>>>>>> field of IOTLB entry.
> >>>>>>>>
> >>>>>>>> Looks like userspace still needs to unmap the old region and map a new
> >>>>>>>> region (size is changed) with the fd in each VDUSE_IOTLB_GET_FD ioctl.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> I don't get here. Can you give an example?
> >>>>>>>>
> >>>>>>> For example, userspace needs to process two I/O requests (one page per
> >>>>>>> request). To process the first request, userspace uses
> >>>>>>> VDUSE_IOTLB_GET_FD ioctl to query the iova region (0 ~ 4096) and mmap
> >>>>>>> it.
> >>>>>> I think in this case we should let VDUSE_IOTLB_GET_FD return the maximum
> >>>>>> range as far as they are backed by the same fd.
> >>>>>>
> >>>>> But now the bounce page is mapped one by one. The second page (4096 ~
> >>>>> 8192) might not be mapped when userspace is processing the first
> >>>>> request. So the maximum range is 0 ~ 4096 at that time.
> >>>>>
> >>>>> Thanks,
> >>>>> Yongji
> >>>> A question, if I read the code correctly, VDUSE_IOTLB_GET_FD will return
> >>>> the whole bounce map range which is setup in vduse_dev_map_page()? So my
> >>>> understanding is that usersapce may choose to map all its range via mmap().
> >>>>
> >>> Yes.
> >>>
> >>>> So if we 'map' bounce page one by one in vduse_dev_map_page(). (Here
> >>>> 'map' means using multiple itree entries instead of a single one). Then
> >>>> in the VDUSE_IOTLB_GET_FD we can keep traversing itree (dev->iommu)
> >>>> until the range is backed by a different file.
> >>>>
> >>>> With this, there's no userspace visible changes and there's no need for
> >>>> the domain->iotlb?
> >>>>
> >>> In this case, I wonder what range can be obtained if userspace calls
> >>> VDUSE_IOTLB_GET_FD when the first I/O (e.g. 4K) occurs. [0, 4K] or [0,
> >>> 64M]? In current implementation, userspace will map [0, 64M].
> >>
> >> It should still be [0, 64M). Do you see any issue?
> >>
> > Does it mean we still need to map the whole bounce buffer into itree
> > (dev->iommu) at initialization?
>
>
> It's your choice I think, the point is to use a single IOTLB for
> maintaining mappings of all types of pages (bounce, coherent, or shared).
>

OK, got it.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: Re: [RFC v4 10/11] vduse: Introduce a workqueue for irq injection
  2021-03-08  7:01                             ` Jason Wang
  (?)
@ 2021-03-08  7:16                             ` Yongji Xie
  2021-03-08  7:29                                 ` Jason Wang
  -1 siblings, 1 reply; 92+ messages in thread
From: Yongji Xie @ 2021-03-08  7:16 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Bob Liu, Christoph Hellwig, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel

On Mon, Mar 8, 2021 at 3:02 PM Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2021/3/8 12:50 下午, Yongji Xie wrote:
> > On Mon, Mar 8, 2021 at 11:04 AM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> On 2021/3/5 4:12 下午, Yongji Xie wrote:
> >>> On Fri, Mar 5, 2021 at 3:37 PM Jason Wang <jasowang@redhat.com> wrote:
> >>>> On 2021/3/5 3:27 下午, Yongji Xie wrote:
> >>>>> On Fri, Mar 5, 2021 at 3:01 PM Jason Wang <jasowang@redhat.com> wrote:
> >>>>>> On 2021/3/5 2:36 下午, Yongji Xie wrote:
> >>>>>>> On Fri, Mar 5, 2021 at 11:42 AM Jason Wang <jasowang@redhat.com> wrote:
> >>>>>>>> On 2021/3/5 11:30 上午, Yongji Xie wrote:
> >>>>>>>>> On Fri, Mar 5, 2021 at 11:05 AM Jason Wang <jasowang@redhat.com> wrote:
> >>>>>>>>>> On 2021/3/4 4:58 下午, Yongji Xie wrote:
> >>>>>>>>>>> On Thu, Mar 4, 2021 at 2:59 PM Jason Wang <jasowang@redhat.com> wrote:
> >>>>>>>>>>>> On 2021/2/23 7:50 下午, Xie Yongji wrote:
> >>>>>>>>>>>>> This patch introduces a workqueue to support injecting
> >>>>>>>>>>>>> virtqueue's interrupt asynchronously. This is mainly
> >>>>>>>>>>>>> for performance considerations which makes sure the push()
> >>>>>>>>>>>>> and pop() for used vring can be asynchronous.
> >>>>>>>>>>>> Do you have pref numbers for this patch?
> >>>>>>>>>>>>
> >>>>>>>>>>> No, I can do some tests for it if needed.
> >>>>>>>>>>>
> >>>>>>>>>>> Another problem is the VIRTIO_RING_F_EVENT_IDX feature will be useless
> >>>>>>>>>>> if we call irq callback in ioctl context. Something like:
> >>>>>>>>>>>
> >>>>>>>>>>> virtqueue_push();
> >>>>>>>>>>> virtio_notify();
> >>>>>>>>>>>           ioctl()
> >>>>>>>>>>> -------------------------------------------------
> >>>>>>>>>>>               irq_cb()
> >>>>>>>>>>>                   virtqueue_get_buf()
> >>>>>>>>>>>
> >>>>>>>>>>> The used vring is always empty each time we call virtqueue_push() in
> >>>>>>>>>>> userspace. Not sure if it is what we expected.
> >>>>>>>>>> I'm not sure I get the issue.
> >>>>>>>>>>
> >>>>>>>>>> THe used ring should be filled by virtqueue_push() which is done by
> >>>>>>>>>> userspace before?
> >>>>>>>>>>
> >>>>>>>>> After userspace call virtqueue_push(), it always call virtio_notify()
> >>>>>>>>> immediately. In traditional VM (vhost-vdpa) cases, virtio_notify()
> >>>>>>>>> will inject an irq to VM and return, then vcpu thread will call
> >>>>>>>>> interrupt handler. But in container (virtio-vdpa) cases,
> >>>>>>>>> virtio_notify() will call interrupt handler directly. So it looks like
> >>>>>>>>> we have to optimize the virtio-vdpa cases. But one problem is we don't
> >>>>>>>>> know whether we are in the VM user case or container user case.
> >>>>>>>> Yes, but I still don't get why used ring is empty after the ioctl()?
> >>>>>>>> Used ring does not use bounce page so it should be visible to the kernel
> >>>>>>>> driver. What did I miss :) ?
> >>>>>>>>
> >>>>>>> Sorry, I'm not saying the kernel can't see the correct used vring. I
> >>>>>>> mean the kernel will consume the used vring in the ioctl context
> >>>>>>> directly in the virtio-vdpa case. In userspace's view, that means
> >>>>>>> virtqueue_push() is used vring's producer and virtio_notify() is used
> >>>>>>> vring's consumer. They will be called one by one in one thread rather
> >>>>>>> than different threads, which looks odd and has a bad effect on
> >>>>>>> performance.
> >>>>>> Yes, that's why we need a workqueue (WQ_UNBOUND you used). Or do you
> >>>>>> want to squash this patch into patch 8?
> >>>>>>
> >>>>>> So I think we can see obvious difference when virtio-vdpa is used.
> >>>>>>
> >>>>> But it looks like we don't need this workqueue in vhost-vdpa cases.
> >>>>> Any suggestions?
> >>>> I haven't had a deep thought. But I feel we can solve this by using the
> >>>> irq bypass manager (or something similar). Then we don't need it to be
> >>>> relayed via workqueue and vdpa. But I'm not sure how hard it will be.
> >>>>
> >>>    Or let vdpa bus drivers give us some information?
> >>
> >> This kind of 'type' is proposed in the early RFC of vDPA series. One
> >> issue is that at device level, we should not differ virtio from vhost,
> >> so if we introduce that, it might encourge people to design a device
> >> that is dedicated to vhost or virtio which might not be good.
> >>
> >> But we can re-visit this when necessary.
> >>
> > OK, I see. How about adding some information in ops.set_vq_cb()?
>
>
> I'm not sure I get this, maybe you can explain a little bit more?
>

For example, add an extra parameter for ops.set_vq_cb() to indicate
whether this callback will trigger the interrupt handler directly.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC v4 10/11] vduse: Introduce a workqueue for irq injection
  2021-03-08  7:16                             ` Yongji Xie
@ 2021-03-08  7:29                                 ` Jason Wang
  0 siblings, 0 replies; 92+ messages in thread
From: Jason Wang @ 2021-03-08  7:29 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Bob Liu, Christoph Hellwig, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	virtualization, netdev, kvm, linux-aio, linux-fsdevel


On 2021/3/8 3:16 下午, Yongji Xie wrote:
> On Mon, Mar 8, 2021 at 3:02 PM Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2021/3/8 12:50 下午, Yongji Xie wrote:
>>> On Mon, Mar 8, 2021 at 11:04 AM Jason Wang <jasowang@redhat.com> wrote:
>>>> On 2021/3/5 4:12 下午, Yongji Xie wrote:
>>>>> On Fri, Mar 5, 2021 at 3:37 PM Jason Wang <jasowang@redhat.com> wrote:
>>>>>> On 2021/3/5 3:27 下午, Yongji Xie wrote:
>>>>>>> On Fri, Mar 5, 2021 at 3:01 PM Jason Wang <jasowang@redhat.com> wrote:
>>>>>>>> On 2021/3/5 2:36 下午, Yongji Xie wrote:
>>>>>>>>> On Fri, Mar 5, 2021 at 11:42 AM Jason Wang <jasowang@redhat.com> wrote:
>>>>>>>>>> On 2021/3/5 11:30 上午, Yongji Xie wrote:
>>>>>>>>>>> On Fri, Mar 5, 2021 at 11:05 AM Jason Wang <jasowang@redhat.com> wrote:
>>>>>>>>>>>> On 2021/3/4 4:58 下午, Yongji Xie wrote:
>>>>>>>>>>>>> On Thu, Mar 4, 2021 at 2:59 PM Jason Wang <jasowang@redhat.com> wrote:
>>>>>>>>>>>>>> On 2021/2/23 7:50 下午, Xie Yongji wrote:
>>>>>>>>>>>>>>> This patch introduces a workqueue to support injecting
>>>>>>>>>>>>>>> virtqueue's interrupt asynchronously. This is mainly
>>>>>>>>>>>>>>> for performance considerations which makes sure the push()
>>>>>>>>>>>>>>> and pop() for used vring can be asynchronous.
>>>>>>>>>>>>>> Do you have pref numbers for this patch?
>>>>>>>>>>>>>>
>>>>>>>>>>>>> No, I can do some tests for it if needed.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Another problem is the VIRTIO_RING_F_EVENT_IDX feature will be useless
>>>>>>>>>>>>> if we call irq callback in ioctl context. Something like:
>>>>>>>>>>>>>
>>>>>>>>>>>>> virtqueue_push();
>>>>>>>>>>>>> virtio_notify();
>>>>>>>>>>>>>            ioctl()
>>>>>>>>>>>>> -------------------------------------------------
>>>>>>>>>>>>>                irq_cb()
>>>>>>>>>>>>>                    virtqueue_get_buf()
>>>>>>>>>>>>>
>>>>>>>>>>>>> The used vring is always empty each time we call virtqueue_push() in
>>>>>>>>>>>>> userspace. Not sure if it is what we expected.
>>>>>>>>>>>> I'm not sure I get the issue.
>>>>>>>>>>>>
>>>>>>>>>>>> THe used ring should be filled by virtqueue_push() which is done by
>>>>>>>>>>>> userspace before?
>>>>>>>>>>>>
>>>>>>>>>>> After userspace call virtqueue_push(), it always call virtio_notify()
>>>>>>>>>>> immediately. In traditional VM (vhost-vdp