kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 00/10] Introduce VDUSE - vDPA Device in Userspace
@ 2021-03-31  8:05 Xie Yongji
  2021-03-31  8:05 ` [PATCH v6 01/10] file: Export receive_fd() to modules Xie Yongji
                   ` (10 more replies)
  0 siblings, 11 replies; 62+ messages in thread
From: Xie Yongji @ 2021-03-31  8:05 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, hch, christian.brauner,
	rdunlap, willy, viro, axboe, bcrl, corbet, mika.penttila,
	dan.carpenter
  Cc: virtualization, netdev, kvm, linux-fsdevel

This series introduces a framework, which can be used to implement
vDPA Devices in a userspace program. The work consist of two parts:
control path forwarding and data path offloading.

In the control path, the VDUSE driver will make use of message
mechnism to forward the config operation from vdpa bus driver
to userspace. Userspace can use read()/write() to receive/reply
those control messages.

In the data path, the core is mapping dma buffer into VDUSE
daemon's address space, which can be implemented in different ways
depending on the vdpa bus to which the vDPA device is attached.

In virtio-vdpa case, we implements a MMU-based on-chip IOMMU driver with
bounce-buffering mechanism to achieve that. And in vhost-vdpa case, the dma
buffer is reside in a userspace memory region which can be shared to the
VDUSE userspace processs via transferring the shmfd.

The details and our user case is shown below:

------------------------    -------------------------   ----------------------------------------------
|            Container |    |              QEMU(VM) |   |                               VDUSE daemon |
|       ---------      |    |  -------------------  |   | ------------------------- ---------------- |
|       |dev/vdx|      |    |  |/dev/vhost-vdpa-x|  |   | | vDPA device emulation | | block driver | |
------------+-----------     -----------+------------   -------------+----------------------+---------
            |                           |                            |                      |
            |                           |                            |                      |
------------+---------------------------+----------------------------+----------------------+---------
|    | block device |           |  vhost device |            | vduse driver |          | TCP/IP |    |
|    -------+--------           --------+--------            -------+--------          -----+----    |
|           |                           |                           |                       |        |
| ----------+----------       ----------+-----------         -------+-------                |        |
| | virtio-blk driver |       |  vhost-vdpa driver |         | vdpa device |                |        |
| ----------+----------       ----------+-----------         -------+-------                |        |
|           |      virtio bus           |                           |                       |        |
|   --------+----+-----------           |                           |                       |        |
|                |                      |                           |                       |        |
|      ----------+----------            |                           |                       |        |
|      | virtio-blk device |            |                           |                       |        |
|      ----------+----------            |                           |                       |        |
|                |                      |                           |                       |        |
|     -----------+-----------           |                           |                       |        |
|     |  virtio-vdpa driver |           |                           |                       |        |
|     -----------+-----------           |                           |                       |        |
|                |                      |                           |    vdpa bus           |        |
|     -----------+----------------------+---------------------------+------------           |        |
|                                                                                        ---+---     |
-----------------------------------------------------------------------------------------| NIC |------
                                                                                         ---+---
                                                                                            |
                                                                                   ---------+---------
                                                                                   | Remote Storages |
                                                                                   -------------------

We make use of it to implement a block device connecting to
our distributed storage, which can be used both in containers and
VMs. Thus, we can have an unified technology stack in this two cases.

To test it with null-blk:

  $ qemu-storage-daemon \
      --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server,nowait \
      --monitor chardev=charmonitor \
      --blockdev driver=host_device,cache.direct=on,aio=native,filename=/dev/nullb0,node-name=disk0 \
      --export type=vduse-blk,id=test,node-name=disk0,writable=on,name=vduse-null,num-queues=16,queue-size=128

The qemu-storage-daemon can be found at https://github.com/bytedance/qemu/tree/vduse

Future work:
  - Improve performance
  - Userspace library (find a way to reuse device emulation code in qemu/rust-vmm)

V5 to V6:
- Export receive_fd() instead of __receive_fd()
- Factor out the unmapping logic of pa and va separatedly
- Remove the logic of bounce page allocation in page fault handler
- Use PAGE_SIZE as IOVA allocation granule
- Add EPOLLOUT support
- Enable setting API version in userspace
- Fix some bugs

V4 to V5:
- Remove the patch for irq binding
- Use a single IOTLB for all types of mapping
- Factor out vhost_vdpa_pa_map()
- Add some sample codes in document
- Use receice_fd_user() to pass file descriptor
- Fix some bugs

V3 to V4:
- Rebase to vhost.git
- Split some patches
- Add some documents
- Use ioctl to inject interrupt rather than eventfd
- Enable config interrupt support
- Support binding irq to the specified cpu
- Add two module parameter to limit bounce/iova size
- Create char device rather than anon inode per vduse
- Reuse vhost IOTLB for iova domain
- Rework the message mechnism in control path

V2 to V3:
- Rework the MMU-based IOMMU driver
- Use the iova domain as iova allocator instead of genpool
- Support transferring vma->vm_file in vhost-vdpa
- Add SVA support in vhost-vdpa
- Remove the patches on bounce pages reclaim

V1 to V2:
- Add vhost-vdpa support
- Add some documents
- Based on the vdpa management tool
- Introduce a workqueue for irq injection
- Replace interval tree with array map to store the iova_map

Xie Yongji (10):
  file: Export receive_fd() to modules
  eventfd: Increase the recursion depth of eventfd_signal()
  vhost-vdpa: protect concurrent access to vhost device iotlb
  vhost-iotlb: Add an opaque pointer for vhost IOTLB
  vdpa: Add an opaque pointer for vdpa_config_ops.dma_map()
  vdpa: factor out vhost_vdpa_pa_map() and vhost_vdpa_pa_unmap()
  vdpa: Support transferring virtual addressing during DMA mapping
  vduse: Implement an MMU-based IOMMU driver
  vduse: Introduce VDUSE - vDPA Device in Userspace
  Documentation: Add documentation for VDUSE

 Documentation/userspace-api/index.rst              |    1 +
 Documentation/userspace-api/ioctl/ioctl-number.rst |    1 +
 Documentation/userspace-api/vduse.rst              |  212 +++
 drivers/vdpa/Kconfig                               |   10 +
 drivers/vdpa/Makefile                              |    1 +
 drivers/vdpa/ifcvf/ifcvf_main.c                    |    2 +-
 drivers/vdpa/mlx5/net/mlx5_vnet.c                  |    2 +-
 drivers/vdpa/vdpa.c                                |    9 +-
 drivers/vdpa/vdpa_sim/vdpa_sim.c                   |    8 +-
 drivers/vdpa/vdpa_user/Makefile                    |    5 +
 drivers/vdpa/vdpa_user/iova_domain.c               |  521 ++++++++
 drivers/vdpa/vdpa_user/iova_domain.h               |   70 +
 drivers/vdpa/vdpa_user/vduse_dev.c                 | 1362 ++++++++++++++++++++
 drivers/vdpa/virtio_pci/vp_vdpa.c                  |    2 +-
 drivers/vhost/iotlb.c                              |   20 +-
 drivers/vhost/vdpa.c                               |  154 ++-
 fs/eventfd.c                                       |    2 +-
 fs/file.c                                          |    6 +
 include/linux/eventfd.h                            |    5 +-
 include/linux/file.h                               |    7 +-
 include/linux/vdpa.h                               |   21 +-
 include/linux/vhost_iotlb.h                        |    3 +
 include/uapi/linux/vduse.h                         |  175 +++
 23 files changed, 2548 insertions(+), 51 deletions(-)
 create mode 100644 Documentation/userspace-api/vduse.rst
 create mode 100644 drivers/vdpa/vdpa_user/Makefile
 create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c
 create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h
 create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
 create mode 100644 include/uapi/linux/vduse.h

-- 
2.11.0


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v6 01/10] file: Export receive_fd() to modules
  2021-03-31  8:05 [PATCH v6 00/10] Introduce VDUSE - vDPA Device in Userspace Xie Yongji
@ 2021-03-31  8:05 ` Xie Yongji
  2021-03-31  9:15   ` Christian Brauner
  2021-03-31  8:05 ` [PATCH v6 02/10] eventfd: Increase the recursion depth of eventfd_signal() Xie Yongji
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 62+ messages in thread
From: Xie Yongji @ 2021-03-31  8:05 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, hch, christian.brauner,
	rdunlap, willy, viro, axboe, bcrl, corbet, mika.penttila,
	dan.carpenter
  Cc: virtualization, netdev, kvm, linux-fsdevel

Export receive_fd() so that some modules can use
it to pass file descriptor between processes without
missing any security stuffs.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 fs/file.c            | 6 ++++++
 include/linux/file.h | 7 +++----
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index dab120b71e44..d7d957217576 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -1108,6 +1108,12 @@ int __receive_fd(int fd, struct file *file, int __user *ufd, unsigned int o_flag
 	return new_fd;
 }
 
+int receive_fd(struct file *file, unsigned int o_flags)
+{
+	return __receive_fd(-1, file, NULL, o_flags);
+}
+EXPORT_SYMBOL(receive_fd);
+
 static int ksys_dup3(unsigned int oldfd, unsigned int newfd, int flags)
 {
 	int err = -EBADF;
diff --git a/include/linux/file.h b/include/linux/file.h
index 225982792fa2..4667f9567d3e 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -94,6 +94,9 @@ extern void fd_install(unsigned int fd, struct file *file);
 
 extern int __receive_fd(int fd, struct file *file, int __user *ufd,
 			unsigned int o_flags);
+
+extern int receive_fd(struct file *file, unsigned int o_flags);
+
 static inline int receive_fd_user(struct file *file, int __user *ufd,
 				  unsigned int o_flags)
 {
@@ -101,10 +104,6 @@ static inline int receive_fd_user(struct file *file, int __user *ufd,
 		return -EFAULT;
 	return __receive_fd(-1, file, ufd, o_flags);
 }
-static inline int receive_fd(struct file *file, unsigned int o_flags)
-{
-	return __receive_fd(-1, file, NULL, o_flags);
-}
 static inline int receive_fd_replace(int fd, struct file *file, unsigned int o_flags)
 {
 	return __receive_fd(fd, file, NULL, o_flags);
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v6 02/10] eventfd: Increase the recursion depth of eventfd_signal()
  2021-03-31  8:05 [PATCH v6 00/10] Introduce VDUSE - vDPA Device in Userspace Xie Yongji
  2021-03-31  8:05 ` [PATCH v6 01/10] file: Export receive_fd() to modules Xie Yongji
@ 2021-03-31  8:05 ` Xie Yongji
  2021-03-31  8:05 ` [PATCH v6 03/10] vhost-vdpa: protect concurrent access to vhost device iotlb Xie Yongji
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 62+ messages in thread
From: Xie Yongji @ 2021-03-31  8:05 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, hch, christian.brauner,
	rdunlap, willy, viro, axboe, bcrl, corbet, mika.penttila,
	dan.carpenter
  Cc: virtualization, netdev, kvm, linux-fsdevel

Increase the recursion depth of eventfd_signal() to 1. This
is the maximum recursion depth we have found so far, which
can be triggered with the following call chain:

    kvm_io_bus_write                        [kvm]
      --> ioeventfd_write                   [kvm]
        --> eventfd_signal                  [eventfd]
          --> vhost_poll_wakeup             [vhost]
            --> vduse_vdpa_kick_vq          [vduse]
              --> eventfd_signal            [eventfd]

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Acked-by: Jason Wang <jasowang@redhat.com>
---
 fs/eventfd.c            | 2 +-
 include/linux/eventfd.h | 5 ++++-
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/eventfd.c b/fs/eventfd.c
index e265b6dd4f34..cc7cd1dbedd3 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -71,7 +71,7 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n)
 	 * it returns true, the eventfd_signal() call should be deferred to a
 	 * safe context.
 	 */
-	if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count)))
+	if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count) > EFD_WAKE_DEPTH))
 		return 0;
 
 	spin_lock_irqsave(&ctx->wqh.lock, flags);
diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
index fa0a524baed0..886d99cd38ef 100644
--- a/include/linux/eventfd.h
+++ b/include/linux/eventfd.h
@@ -29,6 +29,9 @@
 #define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
 #define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE)
 
+/* Maximum recursion depth */
+#define EFD_WAKE_DEPTH 1
+
 struct eventfd_ctx;
 struct file;
 
@@ -47,7 +50,7 @@ DECLARE_PER_CPU(int, eventfd_wake_count);
 
 static inline bool eventfd_signal_count(void)
 {
-	return this_cpu_read(eventfd_wake_count);
+	return this_cpu_read(eventfd_wake_count) > EFD_WAKE_DEPTH;
 }
 
 #else /* CONFIG_EVENTFD */
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v6 03/10] vhost-vdpa: protect concurrent access to vhost device iotlb
  2021-03-31  8:05 [PATCH v6 00/10] Introduce VDUSE - vDPA Device in Userspace Xie Yongji
  2021-03-31  8:05 ` [PATCH v6 01/10] file: Export receive_fd() to modules Xie Yongji
  2021-03-31  8:05 ` [PATCH v6 02/10] eventfd: Increase the recursion depth of eventfd_signal() Xie Yongji
@ 2021-03-31  8:05 ` Xie Yongji
  2021-04-09 16:15   ` Michael S. Tsirkin
  2021-03-31  8:05 ` [PATCH v6 04/10] vhost-iotlb: Add an opaque pointer for vhost IOTLB Xie Yongji
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 62+ messages in thread
From: Xie Yongji @ 2021-03-31  8:05 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, hch, christian.brauner,
	rdunlap, willy, viro, axboe, bcrl, corbet, mika.penttila,
	dan.carpenter
  Cc: virtualization, netdev, kvm, linux-fsdevel

Use vhost_dev->mutex to protect vhost device iotlb from
concurrent access.

Fixes: 4c8cf318("vhost: introduce vDPA-based backend")
Cc: stable@vger.kernel.org
Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
---
 drivers/vhost/vdpa.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 3947fbc2d1d5..63b28d3aee7c 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -725,9 +725,11 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev,
 	const struct vdpa_config_ops *ops = vdpa->config;
 	int r = 0;
 
+	mutex_lock(&dev->mutex);
+
 	r = vhost_dev_check_owner(dev);
 	if (r)
-		return r;
+		goto unlock;
 
 	switch (msg->type) {
 	case VHOST_IOTLB_UPDATE:
@@ -748,6 +750,8 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev,
 		r = -EINVAL;
 		break;
 	}
+unlock:
+	mutex_unlock(&dev->mutex);
 
 	return r;
 }
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v6 04/10] vhost-iotlb: Add an opaque pointer for vhost IOTLB
  2021-03-31  8:05 [PATCH v6 00/10] Introduce VDUSE - vDPA Device in Userspace Xie Yongji
                   ` (2 preceding siblings ...)
  2021-03-31  8:05 ` [PATCH v6 03/10] vhost-vdpa: protect concurrent access to vhost device iotlb Xie Yongji
@ 2021-03-31  8:05 ` Xie Yongji
  2021-03-31  8:05 ` [PATCH v6 05/10] vdpa: Add an opaque pointer for vdpa_config_ops.dma_map() Xie Yongji
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 62+ messages in thread
From: Xie Yongji @ 2021-03-31  8:05 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, hch, christian.brauner,
	rdunlap, willy, viro, axboe, bcrl, corbet, mika.penttila,
	dan.carpenter
  Cc: virtualization, netdev, kvm, linux-fsdevel

Add an opaque pointer for vhost IOTLB. And introduce
vhost_iotlb_add_range_ctx() to accept it.

Suggested-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Acked-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/iotlb.c       | 20 ++++++++++++++++----
 include/linux/vhost_iotlb.h |  3 +++
 2 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/drivers/vhost/iotlb.c b/drivers/vhost/iotlb.c
index 0fd3f87e913c..5c99e1112cbb 100644
--- a/drivers/vhost/iotlb.c
+++ b/drivers/vhost/iotlb.c
@@ -36,19 +36,21 @@ void vhost_iotlb_map_free(struct vhost_iotlb *iotlb,
 EXPORT_SYMBOL_GPL(vhost_iotlb_map_free);
 
 /**
- * vhost_iotlb_add_range - add a new range to vhost IOTLB
+ * vhost_iotlb_add_range_ctx - add a new range to vhost IOTLB
  * @iotlb: the IOTLB
  * @start: start of the IOVA range
  * @last: last of IOVA range
  * @addr: the address that is mapped to @start
  * @perm: access permission of this range
+ * @opaque: the opaque pointer for the new mapping
  *
  * Returns an error last is smaller than start or memory allocation
  * fails
  */
-int vhost_iotlb_add_range(struct vhost_iotlb *iotlb,
-			  u64 start, u64 last,
-			  u64 addr, unsigned int perm)
+int vhost_iotlb_add_range_ctx(struct vhost_iotlb *iotlb,
+			      u64 start, u64 last,
+			      u64 addr, unsigned int perm,
+			      void *opaque)
 {
 	struct vhost_iotlb_map *map;
 
@@ -71,6 +73,7 @@ int vhost_iotlb_add_range(struct vhost_iotlb *iotlb,
 	map->last = last;
 	map->addr = addr;
 	map->perm = perm;
+	map->opaque = opaque;
 
 	iotlb->nmaps++;
 	vhost_iotlb_itree_insert(map, &iotlb->root);
@@ -80,6 +83,15 @@ int vhost_iotlb_add_range(struct vhost_iotlb *iotlb,
 
 	return 0;
 }
+EXPORT_SYMBOL_GPL(vhost_iotlb_add_range_ctx);
+
+int vhost_iotlb_add_range(struct vhost_iotlb *iotlb,
+			  u64 start, u64 last,
+			  u64 addr, unsigned int perm)
+{
+	return vhost_iotlb_add_range_ctx(iotlb, start, last,
+					 addr, perm, NULL);
+}
 EXPORT_SYMBOL_GPL(vhost_iotlb_add_range);
 
 /**
diff --git a/include/linux/vhost_iotlb.h b/include/linux/vhost_iotlb.h
index 6b09b786a762..2d0e2f52f938 100644
--- a/include/linux/vhost_iotlb.h
+++ b/include/linux/vhost_iotlb.h
@@ -17,6 +17,7 @@ struct vhost_iotlb_map {
 	u32 perm;
 	u32 flags_padding;
 	u64 __subtree_last;
+	void *opaque;
 };
 
 #define VHOST_IOTLB_FLAG_RETIRE 0x1
@@ -29,6 +30,8 @@ struct vhost_iotlb {
 	unsigned int flags;
 };
 
+int vhost_iotlb_add_range_ctx(struct vhost_iotlb *iotlb, u64 start, u64 last,
+			      u64 addr, unsigned int perm, void *opaque);
 int vhost_iotlb_add_range(struct vhost_iotlb *iotlb, u64 start, u64 last,
 			  u64 addr, unsigned int perm);
 void vhost_iotlb_del_range(struct vhost_iotlb *iotlb, u64 start, u64 last);
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v6 05/10] vdpa: Add an opaque pointer for vdpa_config_ops.dma_map()
  2021-03-31  8:05 [PATCH v6 00/10] Introduce VDUSE - vDPA Device in Userspace Xie Yongji
                   ` (3 preceding siblings ...)
  2021-03-31  8:05 ` [PATCH v6 04/10] vhost-iotlb: Add an opaque pointer for vhost IOTLB Xie Yongji
@ 2021-03-31  8:05 ` Xie Yongji
  2021-03-31  8:05 ` [PATCH v6 06/10] vdpa: factor out vhost_vdpa_pa_map() and vhost_vdpa_pa_unmap() Xie Yongji
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 62+ messages in thread
From: Xie Yongji @ 2021-03-31  8:05 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, hch, christian.brauner,
	rdunlap, willy, viro, axboe, bcrl, corbet, mika.penttila,
	dan.carpenter
  Cc: virtualization, netdev, kvm, linux-fsdevel

Add an opaque pointer for DMA mapping.

Suggested-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Acked-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vdpa/vdpa_sim/vdpa_sim.c | 6 +++---
 drivers/vhost/vdpa.c             | 2 +-
 include/linux/vdpa.h             | 2 +-
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
index 5b6b2f87d40c..ff331f088baf 100644
--- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
+++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
@@ -512,14 +512,14 @@ static int vdpasim_set_map(struct vdpa_device *vdpa,
 }
 
 static int vdpasim_dma_map(struct vdpa_device *vdpa, u64 iova, u64 size,
-			   u64 pa, u32 perm)
+			   u64 pa, u32 perm, void *opaque)
 {
 	struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
 	int ret;
 
 	spin_lock(&vdpasim->iommu_lock);
-	ret = vhost_iotlb_add_range(vdpasim->iommu, iova, iova + size - 1, pa,
-				    perm);
+	ret = vhost_iotlb_add_range_ctx(vdpasim->iommu, iova, iova + size - 1,
+					pa, perm, opaque);
 	spin_unlock(&vdpasim->iommu_lock);
 
 	return ret;
diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 63b28d3aee7c..22cab98610a1 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -550,7 +550,7 @@ static int vhost_vdpa_map(struct vhost_vdpa *v,
 		return r;
 
 	if (ops->dma_map) {
-		r = ops->dma_map(vdpa, iova, size, pa, perm);
+		r = ops->dma_map(vdpa, iova, size, pa, perm, NULL);
 	} else if (ops->set_map) {
 		if (!v->in_batch)
 			r = ops->set_map(vdpa, dev->iotlb);
diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index 15fa085fab05..b01f7c9096bf 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -241,7 +241,7 @@ struct vdpa_config_ops {
 	/* DMA ops */
 	int (*set_map)(struct vdpa_device *vdev, struct vhost_iotlb *iotlb);
 	int (*dma_map)(struct vdpa_device *vdev, u64 iova, u64 size,
-		       u64 pa, u32 perm);
+		       u64 pa, u32 perm, void *opaque);
 	int (*dma_unmap)(struct vdpa_device *vdev, u64 iova, u64 size);
 
 	/* Free device resources */
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v6 06/10] vdpa: factor out vhost_vdpa_pa_map() and vhost_vdpa_pa_unmap()
  2021-03-31  8:05 [PATCH v6 00/10] Introduce VDUSE - vDPA Device in Userspace Xie Yongji
                   ` (4 preceding siblings ...)
  2021-03-31  8:05 ` [PATCH v6 05/10] vdpa: Add an opaque pointer for vdpa_config_ops.dma_map() Xie Yongji
@ 2021-03-31  8:05 ` Xie Yongji
  2021-03-31  8:05 ` [PATCH v6 07/10] vdpa: Support transferring virtual addressing during DMA mapping Xie Yongji
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 62+ messages in thread
From: Xie Yongji @ 2021-03-31  8:05 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, hch, christian.brauner,
	rdunlap, willy, viro, axboe, bcrl, corbet, mika.penttila,
	dan.carpenter
  Cc: virtualization, netdev, kvm, linux-fsdevel

The upcoming patch is going to support VA mapping/unmapping.
So let's factor out the logic of PA mapping/unmapping firstly
to make the code more readable.

Suggested-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Acked-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/vdpa.c | 53 +++++++++++++++++++++++++++++++++-------------------
 1 file changed, 34 insertions(+), 19 deletions(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 22cab98610a1..f9aab9013745 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -483,7 +483,7 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
 	return r;
 }
 
-static void vhost_vdpa_iotlb_unmap(struct vhost_vdpa *v, u64 start, u64 last)
+static void vhost_vdpa_pa_unmap(struct vhost_vdpa *v, u64 start, u64 last)
 {
 	struct vhost_dev *dev = &v->vdev;
 	struct vhost_iotlb *iotlb = dev->iotlb;
@@ -505,6 +505,11 @@ static void vhost_vdpa_iotlb_unmap(struct vhost_vdpa *v, u64 start, u64 last)
 	}
 }
 
+static void vhost_vdpa_iotlb_unmap(struct vhost_vdpa *v, u64 start, u64 last)
+{
+	return vhost_vdpa_pa_unmap(v, start, last);
+}
+
 static void vhost_vdpa_iotlb_free(struct vhost_vdpa *v)
 {
 	struct vhost_dev *dev = &v->vdev;
@@ -585,37 +590,28 @@ static void vhost_vdpa_unmap(struct vhost_vdpa *v, u64 iova, u64 size)
 	}
 }
 
-static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
-					   struct vhost_iotlb_msg *msg)
+static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
+			     u64 iova, u64 size, u64 uaddr, u32 perm)
 {
 	struct vhost_dev *dev = &v->vdev;
-	struct vhost_iotlb *iotlb = dev->iotlb;
 	struct page **page_list;
 	unsigned long list_size = PAGE_SIZE / sizeof(struct page *);
 	unsigned int gup_flags = FOLL_LONGTERM;
 	unsigned long npages, cur_base, map_pfn, last_pfn = 0;
 	unsigned long lock_limit, sz2pin, nchunks, i;
-	u64 iova = msg->iova;
+	u64 start = iova;
 	long pinned;
 	int ret = 0;
 
-	if (msg->iova < v->range.first ||
-	    msg->iova + msg->size - 1 > v->range.last)
-		return -EINVAL;
-
-	if (vhost_iotlb_itree_first(iotlb, msg->iova,
-				    msg->iova + msg->size - 1))
-		return -EEXIST;
-
 	/* Limit the use of memory for bookkeeping */
 	page_list = (struct page **) __get_free_page(GFP_KERNEL);
 	if (!page_list)
 		return -ENOMEM;
 
-	if (msg->perm & VHOST_ACCESS_WO)
+	if (perm & VHOST_ACCESS_WO)
 		gup_flags |= FOLL_WRITE;
 
-	npages = PAGE_ALIGN(msg->size + (iova & ~PAGE_MASK)) >> PAGE_SHIFT;
+	npages = PAGE_ALIGN(size + (iova & ~PAGE_MASK)) >> PAGE_SHIFT;
 	if (!npages) {
 		ret = -EINVAL;
 		goto free;
@@ -629,7 +625,7 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
 		goto unlock;
 	}
 
-	cur_base = msg->uaddr & PAGE_MASK;
+	cur_base = uaddr & PAGE_MASK;
 	iova &= PAGE_MASK;
 	nchunks = 0;
 
@@ -660,7 +656,7 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
 				csize = (last_pfn - map_pfn + 1) << PAGE_SHIFT;
 				ret = vhost_vdpa_map(v, iova, csize,
 						     map_pfn << PAGE_SHIFT,
-						     msg->perm);
+						     perm);
 				if (ret) {
 					/*
 					 * Unpin the pages that are left unmapped
@@ -689,7 +685,7 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
 
 	/* Pin the rest chunk */
 	ret = vhost_vdpa_map(v, iova, (last_pfn - map_pfn + 1) << PAGE_SHIFT,
-			     map_pfn << PAGE_SHIFT, msg->perm);
+			     map_pfn << PAGE_SHIFT, perm);
 out:
 	if (ret) {
 		if (nchunks) {
@@ -708,13 +704,32 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
 			for (pfn = map_pfn; pfn <= last_pfn; pfn++)
 				unpin_user_page(pfn_to_page(pfn));
 		}
-		vhost_vdpa_unmap(v, msg->iova, msg->size);
+		vhost_vdpa_unmap(v, start, size);
 	}
 unlock:
 	mmap_read_unlock(dev->mm);
 free:
 	free_page((unsigned long)page_list);
 	return ret;
+
+}
+
+static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
+					   struct vhost_iotlb_msg *msg)
+{
+	struct vhost_dev *dev = &v->vdev;
+	struct vhost_iotlb *iotlb = dev->iotlb;
+
+	if (msg->iova < v->range.first ||
+	    msg->iova + msg->size - 1 > v->range.last)
+		return -EINVAL;
+
+	if (vhost_iotlb_itree_first(iotlb, msg->iova,
+				    msg->iova + msg->size - 1))
+		return -EEXIST;
+
+	return vhost_vdpa_pa_map(v, msg->iova, msg->size, msg->uaddr,
+				 msg->perm);
 }
 
 static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev,
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v6 07/10] vdpa: Support transferring virtual addressing during DMA mapping
  2021-03-31  8:05 [PATCH v6 00/10] Introduce VDUSE - vDPA Device in Userspace Xie Yongji
                   ` (5 preceding siblings ...)
  2021-03-31  8:05 ` [PATCH v6 06/10] vdpa: factor out vhost_vdpa_pa_map() and vhost_vdpa_pa_unmap() Xie Yongji
@ 2021-03-31  8:05 ` Xie Yongji
  2021-04-08  2:36   ` Jason Wang
  2021-03-31  8:05 ` [PATCH v6 08/10] vduse: Implement an MMU-based IOMMU driver Xie Yongji
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 62+ messages in thread
From: Xie Yongji @ 2021-03-31  8:05 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, hch, christian.brauner,
	rdunlap, willy, viro, axboe, bcrl, corbet, mika.penttila,
	dan.carpenter
  Cc: virtualization, netdev, kvm, linux-fsdevel

This patch introduces an attribute for vDPA device to indicate
whether virtual address can be used. If vDPA device driver set
it, vhost-vdpa bus driver will not pin user page and transfer
userspace virtual address instead of physical address during
DMA mapping. And corresponding vma->vm_file and offset will be
also passed as an opaque pointer.

Suggested-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 drivers/vdpa/ifcvf/ifcvf_main.c   |  2 +-
 drivers/vdpa/mlx5/net/mlx5_vnet.c |  2 +-
 drivers/vdpa/vdpa.c               |  9 +++-
 drivers/vdpa/vdpa_sim/vdpa_sim.c  |  2 +-
 drivers/vdpa/virtio_pci/vp_vdpa.c |  2 +-
 drivers/vhost/vdpa.c              | 99 ++++++++++++++++++++++++++++++++++-----
 include/linux/vdpa.h              | 19 ++++++--
 7 files changed, 116 insertions(+), 19 deletions(-)

diff --git a/drivers/vdpa/ifcvf/ifcvf_main.c b/drivers/vdpa/ifcvf/ifcvf_main.c
index d555a6a5d1ba..aee013f3eb5f 100644
--- a/drivers/vdpa/ifcvf/ifcvf_main.c
+++ b/drivers/vdpa/ifcvf/ifcvf_main.c
@@ -431,7 +431,7 @@ static int ifcvf_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	}
 
 	adapter = vdpa_alloc_device(struct ifcvf_adapter, vdpa,
-				    dev, &ifc_vdpa_ops, NULL);
+				    dev, &ifc_vdpa_ops, NULL, false);
 	if (adapter == NULL) {
 		IFCVF_ERR(pdev, "Failed to allocate vDPA structure");
 		return -ENOMEM;
diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c
index 71397fdafa6a..fb62ebcf464a 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -1982,7 +1982,7 @@ static int mlx5v_probe(struct auxiliary_device *adev,
 	max_vqs = min_t(u32, max_vqs, MLX5_MAX_SUPPORTED_VQS);
 
 	ndev = vdpa_alloc_device(struct mlx5_vdpa_net, mvdev.vdev, mdev->device, &mlx5_vdpa_ops,
-				 NULL);
+				 NULL, false);
 	if (IS_ERR(ndev))
 		return PTR_ERR(ndev);
 
diff --git a/drivers/vdpa/vdpa.c b/drivers/vdpa/vdpa.c
index 5cffce67cab0..97fbac276c72 100644
--- a/drivers/vdpa/vdpa.c
+++ b/drivers/vdpa/vdpa.c
@@ -71,6 +71,7 @@ static void vdpa_release_dev(struct device *d)
  * @config: the bus operations that is supported by this device
  * @size: size of the parent structure that contains private data
  * @name: name of the vdpa device; optional.
+ * @use_va: indicate whether virtual address must be used by this device
  *
  * Driver should use vdpa_alloc_device() wrapper macro instead of
  * using this directly.
@@ -80,7 +81,8 @@ static void vdpa_release_dev(struct device *d)
  */
 struct vdpa_device *__vdpa_alloc_device(struct device *parent,
 					const struct vdpa_config_ops *config,
-					size_t size, const char *name)
+					size_t size, const char *name,
+					bool use_va)
 {
 	struct vdpa_device *vdev;
 	int err = -EINVAL;
@@ -91,6 +93,10 @@ struct vdpa_device *__vdpa_alloc_device(struct device *parent,
 	if (!!config->dma_map != !!config->dma_unmap)
 		goto err;
 
+	/* It should only work for the device that use on-chip IOMMU */
+	if (use_va && !(config->dma_map || config->set_map))
+		goto err;
+
 	err = -ENOMEM;
 	vdev = kzalloc(size, GFP_KERNEL);
 	if (!vdev)
@@ -106,6 +112,7 @@ struct vdpa_device *__vdpa_alloc_device(struct device *parent,
 	vdev->index = err;
 	vdev->config = config;
 	vdev->features_valid = false;
+	vdev->use_va = use_va;
 
 	if (name)
 		err = dev_set_name(&vdev->dev, "%s", name);
diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
index ff331f088baf..d26334e9a412 100644
--- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
+++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
@@ -235,7 +235,7 @@ struct vdpasim *vdpasim_create(struct vdpasim_dev_attr *dev_attr)
 		ops = &vdpasim_config_ops;
 
 	vdpasim = vdpa_alloc_device(struct vdpasim, vdpa, NULL, ops,
-				    dev_attr->name);
+				    dev_attr->name, false);
 	if (!vdpasim)
 		goto err_alloc;
 
diff --git a/drivers/vdpa/virtio_pci/vp_vdpa.c b/drivers/vdpa/virtio_pci/vp_vdpa.c
index 1321a2fcd088..03b36aed48d6 100644
--- a/drivers/vdpa/virtio_pci/vp_vdpa.c
+++ b/drivers/vdpa/virtio_pci/vp_vdpa.c
@@ -377,7 +377,7 @@ static int vp_vdpa_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 		return ret;
 
 	vp_vdpa = vdpa_alloc_device(struct vp_vdpa, vdpa,
-				    dev, &vp_vdpa_ops, NULL);
+				    dev, &vp_vdpa_ops, NULL, false);
 	if (vp_vdpa == NULL) {
 		dev_err(dev, "vp_vdpa: Failed to allocate vDPA structure\n");
 		return -ENOMEM;
diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index f9aab9013745..613ea400e0e5 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -505,8 +505,28 @@ static void vhost_vdpa_pa_unmap(struct vhost_vdpa *v, u64 start, u64 last)
 	}
 }
 
+static void vhost_vdpa_va_unmap(struct vhost_vdpa *v, u64 start, u64 last)
+{
+	struct vhost_dev *dev = &v->vdev;
+	struct vhost_iotlb *iotlb = dev->iotlb;
+	struct vhost_iotlb_map *map;
+	struct vdpa_map_file *map_file;
+
+	while ((map = vhost_iotlb_itree_first(iotlb, start, last)) != NULL) {
+		map_file = (struct vdpa_map_file *)map->opaque;
+		fput(map_file->file);
+		kfree(map_file);
+		vhost_iotlb_map_free(iotlb, map);
+	}
+}
+
 static void vhost_vdpa_iotlb_unmap(struct vhost_vdpa *v, u64 start, u64 last)
 {
+	struct vdpa_device *vdpa = v->vdpa;
+
+	if (vdpa->use_va)
+		return vhost_vdpa_va_unmap(v, start, last);
+
 	return vhost_vdpa_pa_unmap(v, start, last);
 }
 
@@ -541,21 +561,21 @@ static int perm_to_iommu_flags(u32 perm)
 	return flags | IOMMU_CACHE;
 }
 
-static int vhost_vdpa_map(struct vhost_vdpa *v,
-			  u64 iova, u64 size, u64 pa, u32 perm)
+static int vhost_vdpa_map(struct vhost_vdpa *v, u64 iova,
+			  u64 size, u64 pa, u32 perm, void *opaque)
 {
 	struct vhost_dev *dev = &v->vdev;
 	struct vdpa_device *vdpa = v->vdpa;
 	const struct vdpa_config_ops *ops = vdpa->config;
 	int r = 0;
 
-	r = vhost_iotlb_add_range(dev->iotlb, iova, iova + size - 1,
-				  pa, perm);
+	r = vhost_iotlb_add_range_ctx(dev->iotlb, iova, iova + size - 1,
+				      pa, perm, opaque);
 	if (r)
 		return r;
 
 	if (ops->dma_map) {
-		r = ops->dma_map(vdpa, iova, size, pa, perm, NULL);
+		r = ops->dma_map(vdpa, iova, size, pa, perm, opaque);
 	} else if (ops->set_map) {
 		if (!v->in_batch)
 			r = ops->set_map(vdpa, dev->iotlb);
@@ -563,13 +583,15 @@ static int vhost_vdpa_map(struct vhost_vdpa *v,
 		r = iommu_map(v->domain, iova, pa, size,
 			      perm_to_iommu_flags(perm));
 	}
-
-	if (r)
+	if (r) {
 		vhost_iotlb_del_range(dev->iotlb, iova, iova + size - 1);
-	else
+		return r;
+	}
+
+	if (!vdpa->use_va)
 		atomic64_add(size >> PAGE_SHIFT, &dev->mm->pinned_vm);
 
-	return r;
+	return 0;
 }
 
 static void vhost_vdpa_unmap(struct vhost_vdpa *v, u64 iova, u64 size)
@@ -590,6 +612,56 @@ static void vhost_vdpa_unmap(struct vhost_vdpa *v, u64 iova, u64 size)
 	}
 }
 
+static int vhost_vdpa_va_map(struct vhost_vdpa *v,
+			     u64 iova, u64 size, u64 uaddr, u32 perm)
+{
+	struct vhost_dev *dev = &v->vdev;
+	u64 offset, map_size, map_iova = iova;
+	struct vdpa_map_file *map_file;
+	struct vm_area_struct *vma;
+	int ret;
+
+	mmap_read_lock(dev->mm);
+
+	while (size) {
+		vma = find_vma(dev->mm, uaddr);
+		if (!vma) {
+			ret = -EINVAL;
+			break;
+		}
+		map_size = min(size, vma->vm_end - uaddr);
+		if (!(vma->vm_file && (vma->vm_flags & VM_SHARED) &&
+			!(vma->vm_flags & (VM_IO | VM_PFNMAP))))
+			goto next;
+
+		map_file = kzalloc(sizeof(*map_file), GFP_KERNEL);
+		if (!map_file) {
+			ret = -ENOMEM;
+			break;
+		}
+		offset = (vma->vm_pgoff << PAGE_SHIFT) + uaddr - vma->vm_start;
+		map_file->offset = offset;
+		map_file->file = get_file(vma->vm_file);
+		ret = vhost_vdpa_map(v, map_iova, map_size, uaddr,
+				     perm, map_file);
+		if (ret) {
+			fput(map_file->file);
+			kfree(map_file);
+			break;
+		}
+next:
+		size -= map_size;
+		uaddr += map_size;
+		map_iova += map_size;
+	}
+	if (ret)
+		vhost_vdpa_unmap(v, iova, map_iova - iova);
+
+	mmap_read_unlock(dev->mm);
+
+	return ret;
+}
+
 static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
 			     u64 iova, u64 size, u64 uaddr, u32 perm)
 {
@@ -656,7 +728,7 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
 				csize = (last_pfn - map_pfn + 1) << PAGE_SHIFT;
 				ret = vhost_vdpa_map(v, iova, csize,
 						     map_pfn << PAGE_SHIFT,
-						     perm);
+						     perm, NULL);
 				if (ret) {
 					/*
 					 * Unpin the pages that are left unmapped
@@ -685,7 +757,7 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
 
 	/* Pin the rest chunk */
 	ret = vhost_vdpa_map(v, iova, (last_pfn - map_pfn + 1) << PAGE_SHIFT,
-			     map_pfn << PAGE_SHIFT, perm);
+			     map_pfn << PAGE_SHIFT, perm, NULL);
 out:
 	if (ret) {
 		if (nchunks) {
@@ -718,6 +790,7 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
 					   struct vhost_iotlb_msg *msg)
 {
 	struct vhost_dev *dev = &v->vdev;
+	struct vdpa_device *vdpa = v->vdpa;
 	struct vhost_iotlb *iotlb = dev->iotlb;
 
 	if (msg->iova < v->range.first ||
@@ -728,6 +801,10 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
 				    msg->iova + msg->size - 1))
 		return -EEXIST;
 
+	if (vdpa->use_va)
+		return vhost_vdpa_va_map(v, msg->iova, msg->size,
+					 msg->uaddr, msg->perm);
+
 	return vhost_vdpa_pa_map(v, msg->iova, msg->size, msg->uaddr,
 				 msg->perm);
 }
diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index b01f7c9096bf..e67404e4b23e 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -44,6 +44,7 @@ struct vdpa_mgmt_dev;
  * @config: the configuration ops for this device.
  * @index: device index
  * @features_valid: were features initialized? for legacy guests
+ * @use_va: indicate whether virtual address must be used by this device
  * @nvqs: maximum number of supported virtqueues
  * @mdev: management device pointer; caller must setup when registering device as part
  *	  of dev_add() mgmtdev ops callback before invoking _vdpa_register_device().
@@ -54,6 +55,7 @@ struct vdpa_device {
 	const struct vdpa_config_ops *config;
 	unsigned int index;
 	bool features_valid;
+	bool use_va;
 	int nvqs;
 	struct vdpa_mgmt_dev *mdev;
 };
@@ -69,6 +71,16 @@ struct vdpa_iova_range {
 };
 
 /**
+ * Corresponding file area for device memory mapping
+ * @file: vma->vm_file for the mapping
+ * @offset: mapping offset in the vm_file
+ */
+struct vdpa_map_file {
+	struct file *file;
+	u64 offset;
+};
+
+/**
  * vDPA_config_ops - operations for configuring a vDPA device.
  * Note: vDPA device drivers are required to implement all of the
  * operations unless it is mentioned to be optional in the following
@@ -250,14 +262,15 @@ struct vdpa_config_ops {
 
 struct vdpa_device *__vdpa_alloc_device(struct device *parent,
 					const struct vdpa_config_ops *config,
-					size_t size, const char *name);
+					size_t size, const char *name,
+					bool use_va);
 
-#define vdpa_alloc_device(dev_struct, member, parent, config, name)   \
+#define vdpa_alloc_device(dev_struct, member, parent, config, name, use_va)   \
 			  container_of(__vdpa_alloc_device( \
 				       parent, config, \
 				       sizeof(dev_struct) + \
 				       BUILD_BUG_ON_ZERO(offsetof( \
-				       dev_struct, member)), name), \
+				       dev_struct, member)), name, use_va), \
 				       dev_struct, member)
 
 int vdpa_register_device(struct vdpa_device *vdev, int nvqs);
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v6 08/10] vduse: Implement an MMU-based IOMMU driver
  2021-03-31  8:05 [PATCH v6 00/10] Introduce VDUSE - vDPA Device in Userspace Xie Yongji
                   ` (6 preceding siblings ...)
  2021-03-31  8:05 ` [PATCH v6 07/10] vdpa: Support transferring virtual addressing during DMA mapping Xie Yongji
@ 2021-03-31  8:05 ` Xie Yongji
  2021-04-08  3:25   ` Jason Wang
  2021-03-31  8:05 ` [PATCH v6 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace Xie Yongji
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 62+ messages in thread
From: Xie Yongji @ 2021-03-31  8:05 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, hch, christian.brauner,
	rdunlap, willy, viro, axboe, bcrl, corbet, mika.penttila,
	dan.carpenter
  Cc: virtualization, netdev, kvm, linux-fsdevel

This implements an MMU-based IOMMU driver to support mapping
kernel dma buffer into userspace. The basic idea behind it is
treating MMU (VA->PA) as IOMMU (IOVA->PA). The driver will set
up MMU mapping instead of IOMMU mapping for the DMA transfer so
that the userspace process is able to use its virtual address to
access the dma buffer in kernel.

And to avoid security issue, a bounce-buffering mechanism is
introduced to prevent userspace accessing the original buffer
directly.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 drivers/vdpa/vdpa_user/iova_domain.c | 521 +++++++++++++++++++++++++++++++++++
 drivers/vdpa/vdpa_user/iova_domain.h |  70 +++++
 2 files changed, 591 insertions(+)
 create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c
 create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h

diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c
new file mode 100644
index 000000000000..ed2a944d99b4
--- /dev/null
+++ b/drivers/vdpa/vdpa_user/iova_domain.c
@@ -0,0 +1,521 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * MMU-based IOMMU implementation
+ *
+ * Copyright (C) 2020-2021 Bytedance Inc. and/or its affiliates. All rights reserved.
+ *
+ * Author: Xie Yongji <xieyongji@bytedance.com>
+ *
+ */
+
+#include <linux/slab.h>
+#include <linux/file.h>
+#include <linux/anon_inodes.h>
+#include <linux/highmem.h>
+#include <linux/vmalloc.h>
+#include <linux/vdpa.h>
+
+#include "iova_domain.h"
+
+static int vduse_iotlb_add_range(struct vduse_iova_domain *domain,
+				 u64 start, u64 last,
+				 u64 addr, unsigned int perm,
+				 struct file *file, u64 offset)
+{
+	struct vdpa_map_file *map_file;
+	int ret;
+
+	map_file = kmalloc(sizeof(*map_file), GFP_ATOMIC);
+	if (!map_file)
+		return -ENOMEM;
+
+	map_file->file = get_file(file);
+	map_file->offset = offset;
+
+	ret = vhost_iotlb_add_range_ctx(domain->iotlb, start, last,
+					addr, perm, map_file);
+	if (ret) {
+		fput(map_file->file);
+		kfree(map_file);
+		return ret;
+	}
+	return 0;
+}
+
+static void vduse_iotlb_del_range(struct vduse_iova_domain *domain,
+				  u64 start, u64 last)
+{
+	struct vdpa_map_file *map_file;
+	struct vhost_iotlb_map *map;
+
+	while ((map = vhost_iotlb_itree_first(domain->iotlb, start, last))) {
+		map_file = (struct vdpa_map_file *)map->opaque;
+		fput(map_file->file);
+		kfree(map_file);
+		vhost_iotlb_map_free(domain->iotlb, map);
+	}
+}
+
+int vduse_domain_set_map(struct vduse_iova_domain *domain,
+			 struct vhost_iotlb *iotlb)
+{
+	struct vdpa_map_file *map_file;
+	struct vhost_iotlb_map *map;
+	u64 start = 0ULL, last = ULLONG_MAX;
+	int ret;
+
+	spin_lock(&domain->iotlb_lock);
+	vduse_iotlb_del_range(domain, start, last);
+
+	for (map = vhost_iotlb_itree_first(iotlb, start, last); map;
+	     map = vhost_iotlb_itree_next(map, start, last)) {
+		map_file = (struct vdpa_map_file *)map->opaque;
+		ret = vduse_iotlb_add_range(domain, map->start, map->last,
+					    map->addr, map->perm,
+					    map_file->file,
+					    map_file->offset);
+		if (ret)
+			goto err;
+	}
+	spin_unlock(&domain->iotlb_lock);
+
+	return 0;
+err:
+	vduse_iotlb_del_range(domain, start, last);
+	spin_unlock(&domain->iotlb_lock);
+	return ret;
+}
+
+static int vduse_domain_map_bounce_page(struct vduse_iova_domain *domain,
+					 u64 iova, u64 size, u64 paddr)
+{
+	struct vduse_bounce_map *map;
+	u64 last = iova + size - 1;
+
+	while (iova <= last) {
+		map = &domain->bounce_maps[iova >> PAGE_SHIFT];
+		if (!map->bounce_page) {
+			map->bounce_page = alloc_page(GFP_ATOMIC);
+			if (!map->bounce_page)
+				return -ENOMEM;
+		}
+		map->orig_phys = paddr;
+		paddr += PAGE_SIZE;
+		iova += PAGE_SIZE;
+	}
+	return 0;
+}
+
+static void vduse_domain_unmap_bounce_page(struct vduse_iova_domain *domain,
+					   u64 iova, u64 size)
+{
+	struct vduse_bounce_map *map;
+	u64 last = iova + size - 1;
+
+	while (iova <= last) {
+		map = &domain->bounce_maps[iova >> PAGE_SHIFT];
+		map->orig_phys = INVALID_PHYS_ADDR;
+		iova += PAGE_SIZE;
+	}
+}
+
+static void do_bounce(phys_addr_t orig, void *addr, size_t size,
+		      enum dma_data_direction dir)
+{
+	unsigned long pfn = PFN_DOWN(orig);
+	unsigned int offset = offset_in_page(orig);
+	char *buffer;
+	unsigned int sz = 0;
+
+	while (size) {
+		sz = min_t(size_t, PAGE_SIZE - offset, size);
+
+		buffer = kmap_atomic(pfn_to_page(pfn));
+		if (dir == DMA_TO_DEVICE)
+			memcpy(addr, buffer + offset, sz);
+		else
+			memcpy(buffer + offset, addr, sz);
+		kunmap_atomic(buffer);
+
+		size -= sz;
+		pfn++;
+		addr += sz;
+		offset = 0;
+	}
+}
+
+static void vduse_domain_bounce(struct vduse_iova_domain *domain,
+				dma_addr_t iova, size_t size,
+				enum dma_data_direction dir)
+{
+	struct vduse_bounce_map *map;
+	unsigned int offset;
+	void *addr;
+	size_t sz;
+
+	while (size) {
+		map = &domain->bounce_maps[iova >> PAGE_SHIFT];
+		offset = offset_in_page(iova);
+		sz = min_t(size_t, PAGE_SIZE - offset, size);
+
+		if (WARN_ON(!map->bounce_page ||
+			    map->orig_phys == INVALID_PHYS_ADDR))
+			return;
+
+		addr = page_address(map->bounce_page) + offset;
+		do_bounce(map->orig_phys + offset, addr, sz, dir);
+		size -= sz;
+		iova += sz;
+	}
+}
+
+static struct page *
+vduse_domain_get_mapping_page(struct vduse_iova_domain *domain, u64 iova)
+{
+	u64 start = iova & PAGE_MASK;
+	u64 last = start + PAGE_SIZE - 1;
+	struct vhost_iotlb_map *map;
+	struct page *page = NULL;
+
+	spin_lock(&domain->iotlb_lock);
+	map = vhost_iotlb_itree_first(domain->iotlb, start, last);
+	if (!map)
+		goto out;
+
+	page = pfn_to_page((map->addr + iova - map->start) >> PAGE_SHIFT);
+	get_page(page);
+out:
+	spin_unlock(&domain->iotlb_lock);
+
+	return page;
+}
+
+static struct page *
+vduse_domain_get_bounce_page(struct vduse_iova_domain *domain, u64 iova)
+{
+	struct vduse_bounce_map *map;
+	struct page *page = NULL;
+
+	spin_lock(&domain->iotlb_lock);
+	map = &domain->bounce_maps[iova >> PAGE_SHIFT];
+	if (!map->bounce_page)
+		goto out;
+
+	page = map->bounce_page;
+	get_page(page);
+out:
+	spin_unlock(&domain->iotlb_lock);
+
+	return page;
+}
+
+static void
+vduse_domain_free_bounce_pages(struct vduse_iova_domain *domain)
+{
+	struct vduse_bounce_map *map;
+	unsigned long pfn, bounce_pfns;
+
+	spin_lock(&domain->iotlb_lock);
+	bounce_pfns = domain->bounce_size >> PAGE_SHIFT;
+
+	for (pfn = 0; pfn < bounce_pfns; pfn++) {
+		map = &domain->bounce_maps[pfn];
+		if (WARN_ON(map->orig_phys != INVALID_PHYS_ADDR))
+			continue;
+
+		if (!map->bounce_page)
+			continue;
+
+		__free_page(map->bounce_page);
+		map->bounce_page = NULL;
+	}
+	spin_unlock(&domain->iotlb_lock);
+}
+
+void vduse_domain_reset_bounce_map(struct vduse_iova_domain *domain)
+{
+	if (!domain->bounce_map)
+		return;
+
+	spin_lock(&domain->iotlb_lock);
+	if (!domain->bounce_map)
+		goto unlock;
+
+	vduse_iotlb_del_range(domain, 0, domain->bounce_size - 1);
+	domain->bounce_map = 0;
+unlock:
+	spin_unlock(&domain->iotlb_lock);
+}
+
+static int vduse_domain_init_bounce_map(struct vduse_iova_domain *domain)
+{
+	int ret = 0;
+
+	if (domain->bounce_map)
+		return 0;
+
+	spin_lock(&domain->iotlb_lock);
+	if (domain->bounce_map)
+		goto unlock;
+
+	ret = vduse_iotlb_add_range(domain, 0, domain->bounce_size - 1,
+				    0, VHOST_MAP_RW, domain->file, 0);
+	if (ret)
+		goto unlock;
+
+	domain->bounce_map = 1;
+unlock:
+	spin_unlock(&domain->iotlb_lock);
+	return ret;
+}
+
+static dma_addr_t
+vduse_domain_alloc_iova(struct iova_domain *iovad,
+			unsigned long size, unsigned long limit)
+{
+	unsigned long shift = iova_shift(iovad);
+	unsigned long iova_len = iova_align(iovad, size) >> shift;
+	unsigned long iova_pfn;
+
+	if (iova_len < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1)))
+		iova_len = roundup_pow_of_two(iova_len);
+	iova_pfn = alloc_iova_fast(iovad, iova_len, limit >> shift, true);
+
+	return iova_pfn << shift;
+}
+
+static void vduse_domain_free_iova(struct iova_domain *iovad,
+				   dma_addr_t iova, size_t size)
+{
+	unsigned long shift = iova_shift(iovad);
+	unsigned long iova_len = iova_align(iovad, size) >> shift;
+
+	free_iova_fast(iovad, iova >> shift, iova_len);
+}
+
+dma_addr_t vduse_domain_map_page(struct vduse_iova_domain *domain,
+				 struct page *page, unsigned long offset,
+				 size_t size, enum dma_data_direction dir,
+				 unsigned long attrs)
+{
+	struct iova_domain *iovad = &domain->stream_iovad;
+	unsigned long limit = domain->bounce_size - 1;
+	phys_addr_t pa = page_to_phys(page) + offset;
+	dma_addr_t iova = vduse_domain_alloc_iova(iovad, size, limit);
+
+	if (!iova)
+		return DMA_MAPPING_ERROR;
+
+	if (vduse_domain_init_bounce_map(domain))
+		goto err;
+
+	if (vduse_domain_map_bounce_page(domain, (u64)iova, (u64)size, pa))
+		goto err;
+
+	if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL)
+		vduse_domain_bounce(domain, iova, size, DMA_TO_DEVICE);
+
+	return iova;
+err:
+	vduse_domain_free_iova(iovad, iova, size);
+	return DMA_MAPPING_ERROR;
+}
+
+void vduse_domain_unmap_page(struct vduse_iova_domain *domain,
+			     dma_addr_t dma_addr, size_t size,
+			     enum dma_data_direction dir, unsigned long attrs)
+{
+	struct iova_domain *iovad = &domain->stream_iovad;
+
+	if (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL)
+		vduse_domain_bounce(domain, dma_addr, size, DMA_FROM_DEVICE);
+
+	vduse_domain_unmap_bounce_page(domain, (u64)dma_addr, (u64)size);
+	vduse_domain_free_iova(iovad, dma_addr, size);
+}
+
+void *vduse_domain_alloc_coherent(struct vduse_iova_domain *domain,
+				  size_t size, dma_addr_t *dma_addr,
+				  gfp_t flag, unsigned long attrs)
+{
+	struct iova_domain *iovad = &domain->consistent_iovad;
+	unsigned long limit = domain->iova_limit;
+	dma_addr_t iova = vduse_domain_alloc_iova(iovad, size, limit);
+	void *orig = alloc_pages_exact(size, flag);
+
+	if (!iova || !orig)
+		goto err;
+
+	spin_lock(&domain->iotlb_lock);
+	if (vduse_iotlb_add_range(domain, (u64)iova, (u64)iova + size - 1,
+				  virt_to_phys(orig), VHOST_MAP_RW,
+				  domain->file, (u64)iova)) {
+		spin_unlock(&domain->iotlb_lock);
+		goto err;
+	}
+	spin_unlock(&domain->iotlb_lock);
+
+	*dma_addr = iova;
+
+	return orig;
+err:
+	*dma_addr = DMA_MAPPING_ERROR;
+	if (orig)
+		free_pages_exact(orig, size);
+	if (iova)
+		vduse_domain_free_iova(iovad, iova, size);
+
+	return NULL;
+}
+
+void vduse_domain_free_coherent(struct vduse_iova_domain *domain, size_t size,
+				void *vaddr, dma_addr_t dma_addr,
+				unsigned long attrs)
+{
+	struct iova_domain *iovad = &domain->consistent_iovad;
+	struct vhost_iotlb_map *map;
+	struct vdpa_map_file *map_file;
+	phys_addr_t pa;
+
+	spin_lock(&domain->iotlb_lock);
+	map = vhost_iotlb_itree_first(domain->iotlb, (u64)dma_addr,
+				      (u64)dma_addr + size - 1);
+	if (WARN_ON(!map)) {
+		spin_unlock(&domain->iotlb_lock);
+		return;
+	}
+	map_file = (struct vdpa_map_file *)map->opaque;
+	fput(map_file->file);
+	kfree(map_file);
+	pa = map->addr;
+	vhost_iotlb_map_free(domain->iotlb, map);
+	spin_unlock(&domain->iotlb_lock);
+
+	vduse_domain_free_iova(iovad, dma_addr, size);
+	free_pages_exact(phys_to_virt(pa), size);
+}
+
+static vm_fault_t vduse_domain_mmap_fault(struct vm_fault *vmf)
+{
+	struct vduse_iova_domain *domain = vmf->vma->vm_private_data;
+	unsigned long iova = vmf->pgoff << PAGE_SHIFT;
+	struct page *page;
+
+	if (!domain)
+		return VM_FAULT_SIGBUS;
+
+	if (iova < domain->bounce_size)
+		page = vduse_domain_get_bounce_page(domain, iova);
+	else
+		page = vduse_domain_get_mapping_page(domain, iova);
+
+	if (!page)
+		return VM_FAULT_SIGBUS;
+
+	vmf->page = page;
+
+	return 0;
+}
+
+static const struct vm_operations_struct vduse_domain_mmap_ops = {
+	.fault = vduse_domain_mmap_fault,
+};
+
+static int vduse_domain_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct vduse_iova_domain *domain = file->private_data;
+
+	vma->vm_flags |= VM_DONTDUMP | VM_DONTEXPAND;
+	vma->vm_private_data = domain;
+	vma->vm_ops = &vduse_domain_mmap_ops;
+
+	return 0;
+}
+
+static int vduse_domain_release(struct inode *inode, struct file *file)
+{
+	struct vduse_iova_domain *domain = file->private_data;
+
+	vduse_domain_reset_bounce_map(domain);
+	vduse_domain_free_bounce_pages(domain);
+	put_iova_domain(&domain->stream_iovad);
+	put_iova_domain(&domain->consistent_iovad);
+	vhost_iotlb_free(domain->iotlb);
+	vfree(domain->bounce_maps);
+	kfree(domain);
+
+	return 0;
+}
+
+static const struct file_operations vduse_domain_fops = {
+	.mmap = vduse_domain_mmap,
+	.release = vduse_domain_release,
+};
+
+void vduse_domain_destroy(struct vduse_iova_domain *domain)
+{
+	fput(domain->file);
+}
+
+struct vduse_iova_domain *
+vduse_domain_create(unsigned long iova_limit, size_t bounce_size)
+{
+	struct vduse_iova_domain *domain;
+	struct file *file;
+	struct vduse_bounce_map *map;
+	unsigned long pfn, bounce_pfns;
+
+	bounce_pfns = PAGE_ALIGN(bounce_size) >> PAGE_SHIFT;
+	if (iova_limit <= bounce_size)
+		return NULL;
+
+	domain = kzalloc(sizeof(*domain), GFP_KERNEL);
+	if (!domain)
+		return NULL;
+
+	domain->iotlb = vhost_iotlb_alloc(0, 0);
+	if (!domain->iotlb)
+		goto err_iotlb;
+
+	domain->iova_limit = iova_limit;
+	domain->bounce_size = PAGE_ALIGN(bounce_size);
+	domain->bounce_maps = vzalloc(bounce_pfns *
+				sizeof(struct vduse_bounce_map));
+	if (!domain->bounce_maps)
+		goto err_map;
+
+	for (pfn = 0; pfn < bounce_pfns; pfn++) {
+		map = &domain->bounce_maps[pfn];
+		map->orig_phys = INVALID_PHYS_ADDR;
+	}
+	file = anon_inode_getfile("[vduse-domain]", &vduse_domain_fops,
+				domain, O_RDWR);
+	if (IS_ERR(file))
+		goto err_file;
+
+	domain->file = file;
+	spin_lock_init(&domain->iotlb_lock);
+	init_iova_domain(&domain->stream_iovad,
+			PAGE_SIZE, IOVA_START_PFN);
+	init_iova_domain(&domain->consistent_iovad,
+			PAGE_SIZE, bounce_pfns);
+
+	return domain;
+err_file:
+	vfree(domain->bounce_maps);
+err_map:
+	vhost_iotlb_free(domain->iotlb);
+err_iotlb:
+	kfree(domain);
+	return NULL;
+}
+
+int vduse_domain_init(void)
+{
+	return iova_cache_get();
+}
+
+void vduse_domain_exit(void)
+{
+	iova_cache_put();
+}
diff --git a/drivers/vdpa/vdpa_user/iova_domain.h b/drivers/vdpa/vdpa_user/iova_domain.h
new file mode 100644
index 000000000000..31f822afa5f5
--- /dev/null
+++ b/drivers/vdpa/vdpa_user/iova_domain.h
@@ -0,0 +1,70 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * MMU-based IOMMU implementation
+ *
+ * Copyright (C) 2020-2021 Bytedance Inc. and/or its affiliates. All rights reserved.
+ *
+ * Author: Xie Yongji <xieyongji@bytedance.com>
+ *
+ */
+
+#ifndef _VDUSE_IOVA_DOMAIN_H
+#define _VDUSE_IOVA_DOMAIN_H
+
+#include <linux/iova.h>
+#include <linux/dma-mapping.h>
+#include <linux/vhost_iotlb.h>
+
+#define IOVA_START_PFN 1
+
+#define INVALID_PHYS_ADDR (~(phys_addr_t)0)
+
+struct vduse_bounce_map {
+	struct page *bounce_page;
+	u64 orig_phys;
+};
+
+struct vduse_iova_domain {
+	struct iova_domain stream_iovad;
+	struct iova_domain consistent_iovad;
+	struct vduse_bounce_map *bounce_maps;
+	size_t bounce_size;
+	unsigned long iova_limit;
+	int bounce_map;
+	struct vhost_iotlb *iotlb;
+	spinlock_t iotlb_lock;
+	struct file *file;
+};
+
+int vduse_domain_set_map(struct vduse_iova_domain *domain,
+			struct vhost_iotlb *iotlb);
+
+dma_addr_t vduse_domain_map_page(struct vduse_iova_domain *domain,
+				struct page *page, unsigned long offset,
+				size_t size, enum dma_data_direction dir,
+				unsigned long attrs);
+
+void vduse_domain_unmap_page(struct vduse_iova_domain *domain,
+			dma_addr_t dma_addr, size_t size,
+			enum dma_data_direction dir, unsigned long attrs);
+
+void *vduse_domain_alloc_coherent(struct vduse_iova_domain *domain,
+				size_t size, dma_addr_t *dma_addr,
+				gfp_t flag, unsigned long attrs);
+
+void vduse_domain_free_coherent(struct vduse_iova_domain *domain, size_t size,
+				void *vaddr, dma_addr_t dma_addr,
+				unsigned long attrs);
+
+void vduse_domain_reset_bounce_map(struct vduse_iova_domain *domain);
+
+void vduse_domain_destroy(struct vduse_iova_domain *domain);
+
+struct vduse_iova_domain *vduse_domain_create(unsigned long iova_limit,
+						size_t bounce_size);
+
+int vduse_domain_init(void);
+
+void vduse_domain_exit(void);
+
+#endif /* _VDUSE_IOVA_DOMAIN_H */
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v6 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
  2021-03-31  8:05 [PATCH v6 00/10] Introduce VDUSE - vDPA Device in Userspace Xie Yongji
                   ` (7 preceding siblings ...)
  2021-03-31  8:05 ` [PATCH v6 08/10] vduse: Implement an MMU-based IOMMU driver Xie Yongji
@ 2021-03-31  8:05 ` Xie Yongji
  2021-04-08  6:57   ` Jason Wang
  2021-04-16  3:24   ` Jason Wang
  2021-03-31  8:05 ` [PATCH v6 10/10] Documentation: Add documentation for VDUSE Xie Yongji
  2021-04-14  7:34 ` [PATCH v6 00/10] Introduce VDUSE - vDPA Device in Userspace Michael S. Tsirkin
  10 siblings, 2 replies; 62+ messages in thread
From: Xie Yongji @ 2021-03-31  8:05 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, hch, christian.brauner,
	rdunlap, willy, viro, axboe, bcrl, corbet, mika.penttila,
	dan.carpenter
  Cc: virtualization, netdev, kvm, linux-fsdevel

This VDUSE driver enables implementing vDPA devices in userspace.
Both control path and data path of vDPA devices will be able to
be handled in userspace.

In the control path, the VDUSE driver will make use of message
mechnism to forward the config operation from vdpa bus driver
to userspace. Userspace can use read()/write() to receive/reply
those control messages.

In the data path, VDUSE_IOTLB_GET_FD ioctl will be used to get
the file descriptors referring to vDPA device's iova regions. Then
userspace can use mmap() to access those iova regions. Besides,
userspace can use ioctl() to inject interrupt and use the eventfd
mechanism to receive virtqueue kicks.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 Documentation/userspace-api/ioctl/ioctl-number.rst |    1 +
 drivers/vdpa/Kconfig                               |   10 +
 drivers/vdpa/Makefile                              |    1 +
 drivers/vdpa/vdpa_user/Makefile                    |    5 +
 drivers/vdpa/vdpa_user/vduse_dev.c                 | 1362 ++++++++++++++++++++
 include/uapi/linux/vduse.h                         |  175 +++
 6 files changed, 1554 insertions(+)
 create mode 100644 drivers/vdpa/vdpa_user/Makefile
 create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
 create mode 100644 include/uapi/linux/vduse.h

diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
index a4c75a28c839..71722e6f8f23 100644
--- a/Documentation/userspace-api/ioctl/ioctl-number.rst
+++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
@@ -300,6 +300,7 @@ Code  Seq#    Include File                                           Comments
 'z'   10-4F  drivers/s390/crypto/zcrypt_api.h                        conflict!
 '|'   00-7F  linux/media.h
 0x80  00-1F  linux/fb.h
+0x81  00-1F  linux/vduse.h
 0x89  00-06  arch/x86/include/asm/sockios.h
 0x89  0B-DF  linux/sockios.h
 0x89  E0-EF  linux/sockios.h                                         SIOCPROTOPRIVATE range
diff --git a/drivers/vdpa/Kconfig b/drivers/vdpa/Kconfig
index a245809c99d0..77a1da522c21 100644
--- a/drivers/vdpa/Kconfig
+++ b/drivers/vdpa/Kconfig
@@ -25,6 +25,16 @@ config VDPA_SIM_NET
 	help
 	  vDPA networking device simulator which loops TX traffic back to RX.
 
+config VDPA_USER
+	tristate "VDUSE (vDPA Device in Userspace) support"
+	depends on EVENTFD && MMU && HAS_DMA
+	select DMA_OPS
+	select VHOST_IOTLB
+	select IOMMU_IOVA
+	help
+	  With VDUSE it is possible to emulate a vDPA Device
+	  in a userspace program.
+
 config IFCVF
 	tristate "Intel IFC VF vDPA driver"
 	depends on PCI_MSI
diff --git a/drivers/vdpa/Makefile b/drivers/vdpa/Makefile
index 67fe7f3d6943..f02ebed33f19 100644
--- a/drivers/vdpa/Makefile
+++ b/drivers/vdpa/Makefile
@@ -1,6 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0
 obj-$(CONFIG_VDPA) += vdpa.o
 obj-$(CONFIG_VDPA_SIM) += vdpa_sim/
+obj-$(CONFIG_VDPA_USER) += vdpa_user/
 obj-$(CONFIG_IFCVF)    += ifcvf/
 obj-$(CONFIG_MLX5_VDPA) += mlx5/
 obj-$(CONFIG_VP_VDPA)    += virtio_pci/
diff --git a/drivers/vdpa/vdpa_user/Makefile b/drivers/vdpa/vdpa_user/Makefile
new file mode 100644
index 000000000000..260e0b26af99
--- /dev/null
+++ b/drivers/vdpa/vdpa_user/Makefile
@@ -0,0 +1,5 @@
+# SPDX-License-Identifier: GPL-2.0
+
+vduse-y := vduse_dev.o iova_domain.o
+
+obj-$(CONFIG_VDPA_USER) += vduse.o
diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
new file mode 100644
index 000000000000..51ca73464d0d
--- /dev/null
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -0,0 +1,1362 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * VDUSE: vDPA Device in Userspace
+ *
+ * Copyright (C) 2020-2021 Bytedance Inc. and/or its affiliates. All rights reserved.
+ *
+ * Author: Xie Yongji <xieyongji@bytedance.com>
+ *
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/miscdevice.h>
+#include <linux/cdev.h>
+#include <linux/device.h>
+#include <linux/eventfd.h>
+#include <linux/slab.h>
+#include <linux/wait.h>
+#include <linux/dma-map-ops.h>
+#include <linux/poll.h>
+#include <linux/file.h>
+#include <linux/uio.h>
+#include <linux/vdpa.h>
+#include <uapi/linux/vduse.h>
+#include <uapi/linux/vdpa.h>
+#include <uapi/linux/virtio_config.h>
+#include <linux/mod_devicetable.h>
+
+#include "iova_domain.h"
+
+#define DRV_VERSION  "1.0"
+#define DRV_AUTHOR   "Yongji Xie <xieyongji@bytedance.com>"
+#define DRV_DESC     "vDPA Device in Userspace"
+#define DRV_LICENSE  "GPL v2"
+
+#define VDUSE_DEV_MAX (1U << MINORBITS)
+
+struct vduse_virtqueue {
+	u16 index;
+	bool ready;
+	spinlock_t kick_lock;
+	spinlock_t irq_lock;
+	struct eventfd_ctx *kickfd;
+	struct vdpa_callback cb;
+	struct work_struct inject;
+};
+
+struct vduse_dev;
+
+struct vduse_vdpa {
+	struct vdpa_device vdpa;
+	struct vduse_dev *dev;
+};
+
+struct vduse_dev {
+	struct vduse_vdpa *vdev;
+	struct device dev;
+	struct cdev cdev;
+	struct vduse_virtqueue *vqs;
+	struct vduse_iova_domain *domain;
+	struct mutex lock;
+	spinlock_t msg_lock;
+	atomic64_t msg_unique;
+	wait_queue_head_t waitq;
+	struct list_head send_list;
+	struct list_head recv_list;
+	struct list_head list;
+	struct vdpa_callback config_cb;
+	spinlock_t irq_lock;
+	unsigned long api_version;
+	bool connected;
+	int minor;
+	u16 vq_size_max;
+	u16 vq_num;
+	u32 vq_align;
+	u32 device_id;
+	u32 vendor_id;
+};
+
+struct vduse_dev_msg {
+	struct vduse_dev_request req;
+	struct vduse_dev_response resp;
+	struct list_head list;
+	wait_queue_head_t waitq;
+	bool completed;
+};
+
+struct vduse_control {
+	unsigned long api_version;
+};
+
+static unsigned long max_bounce_size = (64 * 1024 * 1024);
+module_param(max_bounce_size, ulong, 0444);
+MODULE_PARM_DESC(max_bounce_size, "Maximum bounce buffer size. (default: 64M)");
+
+static unsigned long max_iova_size = (128 * 1024 * 1024);
+module_param(max_iova_size, ulong, 0444);
+MODULE_PARM_DESC(max_iova_size, "Maximum iova space size (default: 128M)");
+
+static DEFINE_MUTEX(vduse_lock);
+static LIST_HEAD(vduse_devs);
+static DEFINE_IDA(vduse_ida);
+
+static dev_t vduse_major;
+static struct class *vduse_class;
+static struct workqueue_struct *vduse_irq_wq;
+
+static inline struct vduse_dev *vdpa_to_vduse(struct vdpa_device *vdpa)
+{
+	struct vduse_vdpa *vdev = container_of(vdpa, struct vduse_vdpa, vdpa);
+
+	return vdev->dev;
+}
+
+static inline struct vduse_dev *dev_to_vduse(struct device *dev)
+{
+	struct vdpa_device *vdpa = dev_to_vdpa(dev);
+
+	return vdpa_to_vduse(vdpa);
+}
+
+static struct vduse_dev_msg *vduse_find_msg(struct list_head *head,
+					    uint32_t request_id)
+{
+	struct vduse_dev_msg *tmp, *msg = NULL;
+
+	list_for_each_entry(tmp, head, list) {
+		if (tmp->req.request_id == request_id) {
+			msg = tmp;
+			list_del(&tmp->list);
+			break;
+		}
+	}
+
+	return msg;
+}
+
+static struct vduse_dev_msg *vduse_dequeue_msg(struct list_head *head)
+{
+	struct vduse_dev_msg *msg = NULL;
+
+	if (!list_empty(head)) {
+		msg = list_first_entry(head, struct vduse_dev_msg, list);
+		list_del(&msg->list);
+	}
+
+	return msg;
+}
+
+static void vduse_enqueue_msg(struct list_head *head,
+			      struct vduse_dev_msg *msg)
+{
+	list_add_tail(&msg->list, head);
+}
+
+static int vduse_dev_msg_sync(struct vduse_dev *dev,
+			      struct vduse_dev_msg *msg)
+{
+	init_waitqueue_head(&msg->waitq);
+	spin_lock(&dev->msg_lock);
+	vduse_enqueue_msg(&dev->send_list, msg);
+	wake_up(&dev->waitq);
+	spin_unlock(&dev->msg_lock);
+	wait_event_interruptible(msg->waitq, msg->completed);
+	spin_lock(&dev->msg_lock);
+	if (!msg->completed)
+		list_del(&msg->list);
+	spin_unlock(&dev->msg_lock);
+
+	return (msg->resp.result == VDUSE_REQUEST_OK) ? 0 : -1;
+}
+
+static u32 vduse_dev_get_request_id(struct vduse_dev *dev)
+{
+	return atomic64_fetch_inc(&dev->msg_unique);
+}
+
+static u64 vduse_dev_get_features(struct vduse_dev *dev)
+{
+	struct vduse_dev_msg msg = { 0 };
+
+	msg.req.type = VDUSE_GET_FEATURES;
+	msg.req.request_id = vduse_dev_get_request_id(dev);
+
+	return vduse_dev_msg_sync(dev, &msg) ? 0 : msg.resp.f.features;
+}
+
+static int vduse_dev_set_features(struct vduse_dev *dev, u64 features)
+{
+	struct vduse_dev_msg msg = { 0 };
+
+	msg.req.type = VDUSE_SET_FEATURES;
+	msg.req.request_id = vduse_dev_get_request_id(dev);
+	msg.req.f.features = features;
+
+	return vduse_dev_msg_sync(dev, &msg);
+}
+
+static u8 vduse_dev_get_status(struct vduse_dev *dev)
+{
+	struct vduse_dev_msg msg = { 0 };
+
+	msg.req.type = VDUSE_GET_STATUS;
+	msg.req.request_id = vduse_dev_get_request_id(dev);
+
+	return vduse_dev_msg_sync(dev, &msg) ? 0 : msg.resp.s.status;
+}
+
+static void vduse_dev_set_status(struct vduse_dev *dev, u8 status)
+{
+	struct vduse_dev_msg msg = { 0 };
+
+	msg.req.type = VDUSE_SET_STATUS;
+	msg.req.request_id = vduse_dev_get_request_id(dev);
+	msg.req.s.status = status;
+
+	vduse_dev_msg_sync(dev, &msg);
+}
+
+static void vduse_dev_get_config(struct vduse_dev *dev, unsigned int offset,
+				 void *buf, unsigned int len)
+{
+	struct vduse_dev_msg msg = { 0 };
+	unsigned int sz;
+
+	while (len) {
+		sz = min_t(unsigned int, len, sizeof(msg.req.config.data));
+		msg.req.type = VDUSE_GET_CONFIG;
+		msg.req.request_id = vduse_dev_get_request_id(dev);
+		msg.req.config.offset = offset;
+		msg.req.config.len = sz;
+		vduse_dev_msg_sync(dev, &msg);
+		memcpy(buf, msg.resp.config.data, sz);
+		buf += sz;
+		offset += sz;
+		len -= sz;
+	}
+}
+
+static void vduse_dev_set_config(struct vduse_dev *dev, unsigned int offset,
+				 const void *buf, unsigned int len)
+{
+	struct vduse_dev_msg msg = { 0 };
+	unsigned int sz;
+
+	while (len) {
+		sz = min_t(unsigned int, len, sizeof(msg.req.config.data));
+		msg.req.type = VDUSE_SET_CONFIG;
+		msg.req.request_id = vduse_dev_get_request_id(dev);
+		msg.req.config.offset = offset;
+		msg.req.config.len = sz;
+		memcpy(msg.req.config.data, buf, sz);
+		vduse_dev_msg_sync(dev, &msg);
+		buf += sz;
+		offset += sz;
+		len -= sz;
+	}
+}
+
+static void vduse_dev_set_vq_num(struct vduse_dev *dev,
+				 struct vduse_virtqueue *vq, u32 num)
+{
+	struct vduse_dev_msg msg = { 0 };
+
+	msg.req.type = VDUSE_SET_VQ_NUM;
+	msg.req.request_id = vduse_dev_get_request_id(dev);
+	msg.req.vq_num.index = vq->index;
+	msg.req.vq_num.num = num;
+
+	vduse_dev_msg_sync(dev, &msg);
+}
+
+static int vduse_dev_set_vq_addr(struct vduse_dev *dev,
+				 struct vduse_virtqueue *vq, u64 desc_addr,
+				 u64 driver_addr, u64 device_addr)
+{
+	struct vduse_dev_msg msg = { 0 };
+
+	msg.req.type = VDUSE_SET_VQ_ADDR;
+	msg.req.request_id = vduse_dev_get_request_id(dev);
+	msg.req.vq_addr.index = vq->index;
+	msg.req.vq_addr.desc_addr = desc_addr;
+	msg.req.vq_addr.driver_addr = driver_addr;
+	msg.req.vq_addr.device_addr = device_addr;
+
+	return vduse_dev_msg_sync(dev, &msg);
+}
+
+static void vduse_dev_set_vq_ready(struct vduse_dev *dev,
+				struct vduse_virtqueue *vq, bool ready)
+{
+	struct vduse_dev_msg msg = { 0 };
+
+	msg.req.type = VDUSE_SET_VQ_READY;
+	msg.req.request_id = vduse_dev_get_request_id(dev);
+	msg.req.vq_ready.index = vq->index;
+	msg.req.vq_ready.ready = ready;
+
+	vduse_dev_msg_sync(dev, &msg);
+}
+
+static bool vduse_dev_get_vq_ready(struct vduse_dev *dev,
+				   struct vduse_virtqueue *vq)
+{
+	struct vduse_dev_msg msg = { 0 };
+
+	msg.req.type = VDUSE_GET_VQ_READY;
+	msg.req.request_id = vduse_dev_get_request_id(dev);
+	msg.req.vq_ready.index = vq->index;
+
+	return vduse_dev_msg_sync(dev, &msg) ? false : msg.resp.vq_ready.ready;
+}
+
+static int vduse_dev_get_vq_state(struct vduse_dev *dev,
+				struct vduse_virtqueue *vq,
+				struct vdpa_vq_state *state)
+{
+	struct vduse_dev_msg msg = { 0 };
+	int ret;
+
+	msg.req.type = VDUSE_GET_VQ_STATE;
+	msg.req.request_id = vduse_dev_get_request_id(dev);
+	msg.req.vq_state.index = vq->index;
+
+	ret = vduse_dev_msg_sync(dev, &msg);
+	if (!ret)
+		state->avail_index = msg.resp.vq_state.avail_idx;
+
+	return ret;
+}
+
+static int vduse_dev_set_vq_state(struct vduse_dev *dev,
+				struct vduse_virtqueue *vq,
+				const struct vdpa_vq_state *state)
+{
+	struct vduse_dev_msg msg = { 0 };
+
+	msg.req.type = VDUSE_SET_VQ_STATE;
+	msg.req.request_id = vduse_dev_get_request_id(dev);
+	msg.req.vq_state.index = vq->index;
+	msg.req.vq_state.avail_idx = state->avail_index;
+
+	return vduse_dev_msg_sync(dev, &msg);
+}
+
+static int vduse_dev_update_iotlb(struct vduse_dev *dev,
+				u64 start, u64 last)
+{
+	struct vduse_dev_msg msg = { 0 };
+
+	if (last < start)
+		return -EINVAL;
+
+	msg.req.type = VDUSE_UPDATE_IOTLB;
+	msg.req.request_id = vduse_dev_get_request_id(dev);
+	msg.req.iova.start = start;
+	msg.req.iova.last = last;
+
+	return vduse_dev_msg_sync(dev, &msg);
+}
+
+static ssize_t vduse_dev_read_iter(struct kiocb *iocb, struct iov_iter *to)
+{
+	struct file *file = iocb->ki_filp;
+	struct vduse_dev *dev = file->private_data;
+	struct vduse_dev_msg *msg;
+	int size = sizeof(struct vduse_dev_request);
+	ssize_t ret = 0;
+
+	if (iov_iter_count(to) < size)
+		return 0;
+
+	spin_lock(&dev->msg_lock);
+	while (1) {
+		msg = vduse_dequeue_msg(&dev->send_list);
+		if (msg)
+			break;
+
+		ret = -EAGAIN;
+		if (file->f_flags & O_NONBLOCK)
+			goto unlock;
+
+		spin_unlock(&dev->msg_lock);
+		ret = wait_event_interruptible_exclusive(dev->waitq,
+					!list_empty(&dev->send_list));
+		if (ret)
+			return ret;
+
+		spin_lock(&dev->msg_lock);
+	}
+	spin_unlock(&dev->msg_lock);
+	ret = copy_to_iter(&msg->req, size, to);
+	spin_lock(&dev->msg_lock);
+	if (ret != size) {
+		ret = -EFAULT;
+		vduse_enqueue_msg(&dev->send_list, msg);
+		goto unlock;
+	}
+	vduse_enqueue_msg(&dev->recv_list, msg);
+	wake_up(&dev->waitq);
+unlock:
+	spin_unlock(&dev->msg_lock);
+
+	return ret;
+}
+
+static ssize_t vduse_dev_write_iter(struct kiocb *iocb, struct iov_iter *from)
+{
+	struct file *file = iocb->ki_filp;
+	struct vduse_dev *dev = file->private_data;
+	struct vduse_dev_response resp;
+	struct vduse_dev_msg *msg;
+	size_t ret;
+
+	ret = copy_from_iter(&resp, sizeof(resp), from);
+	if (ret != sizeof(resp))
+		return -EINVAL;
+
+	spin_lock(&dev->msg_lock);
+	msg = vduse_find_msg(&dev->recv_list, resp.request_id);
+	if (!msg) {
+		ret = -EINVAL;
+		goto unlock;
+	}
+
+	memcpy(&msg->resp, &resp, sizeof(resp));
+	msg->completed = 1;
+	wake_up(&msg->waitq);
+unlock:
+	spin_unlock(&dev->msg_lock);
+
+	return ret;
+}
+
+static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
+{
+	struct vduse_dev *dev = file->private_data;
+	__poll_t mask = 0;
+
+	poll_wait(file, &dev->waitq, wait);
+
+	if (!list_empty(&dev->send_list))
+		mask |= EPOLLIN | EPOLLRDNORM;
+	if (!list_empty(&dev->recv_list))
+		mask |= EPOLLOUT;
+
+	return mask;
+}
+
+static void vduse_dev_reset(struct vduse_dev *dev)
+{
+	int i;
+
+	/* The coherent mappings are handled in vduse_dev_free_coherent() */
+	vduse_domain_reset_bounce_map(dev->domain);
+	vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
+
+	spin_lock(&dev->irq_lock);
+	dev->config_cb.callback = NULL;
+	dev->config_cb.private = NULL;
+	spin_unlock(&dev->irq_lock);
+
+	for (i = 0; i < dev->vq_num; i++) {
+		struct vduse_virtqueue *vq = &dev->vqs[i];
+
+		spin_lock(&vq->irq_lock);
+		vq->ready = false;
+		vq->cb.callback = NULL;
+		vq->cb.private = NULL;
+		spin_unlock(&vq->irq_lock);
+	}
+}
+
+static int vduse_vdpa_set_vq_address(struct vdpa_device *vdpa, u16 idx,
+				u64 desc_area, u64 driver_area,
+				u64 device_area)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vduse_virtqueue *vq = &dev->vqs[idx];
+
+	return vduse_dev_set_vq_addr(dev, vq, desc_area,
+					driver_area, device_area);
+}
+
+static void vduse_vdpa_kick_vq(struct vdpa_device *vdpa, u16 idx)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vduse_virtqueue *vq = &dev->vqs[idx];
+
+	spin_lock(&vq->kick_lock);
+	if (vq->ready && vq->kickfd)
+		eventfd_signal(vq->kickfd, 1);
+	spin_unlock(&vq->kick_lock);
+}
+
+static void vduse_vdpa_set_vq_cb(struct vdpa_device *vdpa, u16 idx,
+			      struct vdpa_callback *cb)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vduse_virtqueue *vq = &dev->vqs[idx];
+
+	spin_lock(&vq->irq_lock);
+	vq->cb.callback = cb->callback;
+	vq->cb.private = cb->private;
+	spin_unlock(&vq->irq_lock);
+}
+
+static void vduse_vdpa_set_vq_num(struct vdpa_device *vdpa, u16 idx, u32 num)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vduse_virtqueue *vq = &dev->vqs[idx];
+
+	vduse_dev_set_vq_num(dev, vq, num);
+}
+
+static void vduse_vdpa_set_vq_ready(struct vdpa_device *vdpa,
+					u16 idx, bool ready)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vduse_virtqueue *vq = &dev->vqs[idx];
+
+	vduse_dev_set_vq_ready(dev, vq, ready);
+	vq->ready = ready;
+}
+
+static bool vduse_vdpa_get_vq_ready(struct vdpa_device *vdpa, u16 idx)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vduse_virtqueue *vq = &dev->vqs[idx];
+
+	vq->ready = vduse_dev_get_vq_ready(dev, vq);
+
+	return vq->ready;
+}
+
+static int vduse_vdpa_set_vq_state(struct vdpa_device *vdpa, u16 idx,
+				const struct vdpa_vq_state *state)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vduse_virtqueue *vq = &dev->vqs[idx];
+
+	return vduse_dev_set_vq_state(dev, vq, state);
+}
+
+static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
+				struct vdpa_vq_state *state)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vduse_virtqueue *vq = &dev->vqs[idx];
+
+	return vduse_dev_get_vq_state(dev, vq, state);
+}
+
+static u32 vduse_vdpa_get_vq_align(struct vdpa_device *vdpa)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	return dev->vq_align;
+}
+
+static u64 vduse_vdpa_get_features(struct vdpa_device *vdpa)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	return vduse_dev_get_features(dev);
+}
+
+static int vduse_vdpa_set_features(struct vdpa_device *vdpa, u64 features)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	if (!(features & (1ULL << VIRTIO_F_ACCESS_PLATFORM)))
+		return -EINVAL;
+
+	return vduse_dev_set_features(dev, features);
+}
+
+static void vduse_vdpa_set_config_cb(struct vdpa_device *vdpa,
+				  struct vdpa_callback *cb)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	spin_lock(&dev->irq_lock);
+	dev->config_cb.callback = cb->callback;
+	dev->config_cb.private = cb->private;
+	spin_unlock(&dev->irq_lock);
+}
+
+static u16 vduse_vdpa_get_vq_num_max(struct vdpa_device *vdpa)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	return dev->vq_size_max;
+}
+
+static u32 vduse_vdpa_get_device_id(struct vdpa_device *vdpa)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	return dev->device_id;
+}
+
+static u32 vduse_vdpa_get_vendor_id(struct vdpa_device *vdpa)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	return dev->vendor_id;
+}
+
+static u8 vduse_vdpa_get_status(struct vdpa_device *vdpa)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	return vduse_dev_get_status(dev);
+}
+
+static void vduse_vdpa_set_status(struct vdpa_device *vdpa, u8 status)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	vduse_dev_set_status(dev, status);
+
+	if (status == 0)
+		vduse_dev_reset(dev);
+}
+
+static void vduse_vdpa_get_config(struct vdpa_device *vdpa, unsigned int offset,
+			     void *buf, unsigned int len)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	vduse_dev_get_config(dev, offset, buf, len);
+}
+
+static void vduse_vdpa_set_config(struct vdpa_device *vdpa, unsigned int offset,
+			const void *buf, unsigned int len)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	vduse_dev_set_config(dev, offset, buf, len);
+}
+
+static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
+				struct vhost_iotlb *iotlb)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	int ret;
+
+	ret = vduse_domain_set_map(dev->domain, iotlb);
+	vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
+
+	return ret;
+}
+
+static void vduse_vdpa_free(struct vdpa_device *vdpa)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	WARN_ON(!list_empty(&dev->send_list));
+	WARN_ON(!list_empty(&dev->recv_list));
+	dev->vdev = NULL;
+}
+
+static const struct vdpa_config_ops vduse_vdpa_config_ops = {
+	.set_vq_address		= vduse_vdpa_set_vq_address,
+	.kick_vq		= vduse_vdpa_kick_vq,
+	.set_vq_cb		= vduse_vdpa_set_vq_cb,
+	.set_vq_num             = vduse_vdpa_set_vq_num,
+	.set_vq_ready		= vduse_vdpa_set_vq_ready,
+	.get_vq_ready		= vduse_vdpa_get_vq_ready,
+	.set_vq_state		= vduse_vdpa_set_vq_state,
+	.get_vq_state		= vduse_vdpa_get_vq_state,
+	.get_vq_align		= vduse_vdpa_get_vq_align,
+	.get_features		= vduse_vdpa_get_features,
+	.set_features		= vduse_vdpa_set_features,
+	.set_config_cb		= vduse_vdpa_set_config_cb,
+	.get_vq_num_max		= vduse_vdpa_get_vq_num_max,
+	.get_device_id		= vduse_vdpa_get_device_id,
+	.get_vendor_id		= vduse_vdpa_get_vendor_id,
+	.get_status		= vduse_vdpa_get_status,
+	.set_status		= vduse_vdpa_set_status,
+	.get_config		= vduse_vdpa_get_config,
+	.set_config		= vduse_vdpa_set_config,
+	.set_map		= vduse_vdpa_set_map,
+	.free			= vduse_vdpa_free,
+};
+
+static dma_addr_t vduse_dev_map_page(struct device *dev, struct page *page,
+				     unsigned long offset, size_t size,
+				     enum dma_data_direction dir,
+				     unsigned long attrs)
+{
+	struct vduse_dev *vdev = dev_to_vduse(dev);
+	struct vduse_iova_domain *domain = vdev->domain;
+
+	return vduse_domain_map_page(domain, page, offset, size, dir, attrs);
+}
+
+static void vduse_dev_unmap_page(struct device *dev, dma_addr_t dma_addr,
+				size_t size, enum dma_data_direction dir,
+				unsigned long attrs)
+{
+	struct vduse_dev *vdev = dev_to_vduse(dev);
+	struct vduse_iova_domain *domain = vdev->domain;
+
+	return vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
+}
+
+static void *vduse_dev_alloc_coherent(struct device *dev, size_t size,
+					dma_addr_t *dma_addr, gfp_t flag,
+					unsigned long attrs)
+{
+	struct vduse_dev *vdev = dev_to_vduse(dev);
+	struct vduse_iova_domain *domain = vdev->domain;
+	unsigned long iova;
+	void *addr;
+
+	*dma_addr = DMA_MAPPING_ERROR;
+	addr = vduse_domain_alloc_coherent(domain, size,
+				(dma_addr_t *)&iova, flag, attrs);
+	if (!addr)
+		return NULL;
+
+	*dma_addr = (dma_addr_t)iova;
+	vduse_dev_update_iotlb(vdev, iova, iova + size - 1);
+
+	return addr;
+}
+
+static void vduse_dev_free_coherent(struct device *dev, size_t size,
+					void *vaddr, dma_addr_t dma_addr,
+					unsigned long attrs)
+{
+	struct vduse_dev *vdev = dev_to_vduse(dev);
+	struct vduse_iova_domain *domain = vdev->domain;
+	unsigned long start = (unsigned long)dma_addr;
+	unsigned long last = start + size - 1;
+
+	vduse_domain_free_coherent(domain, size, vaddr, dma_addr, attrs);
+	vduse_dev_update_iotlb(vdev, start, last);
+}
+
+static const struct dma_map_ops vduse_dev_dma_ops = {
+	.map_page = vduse_dev_map_page,
+	.unmap_page = vduse_dev_unmap_page,
+	.alloc = vduse_dev_alloc_coherent,
+	.free = vduse_dev_free_coherent,
+};
+
+static unsigned int perm_to_file_flags(u8 perm)
+{
+	unsigned int flags = 0;
+
+	switch (perm) {
+	case VDUSE_ACCESS_WO:
+		flags |= O_WRONLY;
+		break;
+	case VDUSE_ACCESS_RO:
+		flags |= O_RDONLY;
+		break;
+	case VDUSE_ACCESS_RW:
+		flags |= O_RDWR;
+		break;
+	default:
+		WARN(1, "invalidate vhost IOTLB permission\n");
+		break;
+	}
+
+	return flags;
+}
+
+static int vduse_kickfd_setup(struct vduse_dev *dev,
+			struct vduse_vq_eventfd *eventfd)
+{
+	struct eventfd_ctx *ctx = NULL;
+	struct vduse_virtqueue *vq;
+
+	if (eventfd->index >= dev->vq_num)
+		return -EINVAL;
+
+	vq = &dev->vqs[eventfd->index];
+	if (eventfd->fd > 0) {
+		ctx = eventfd_ctx_fdget(eventfd->fd);
+		if (IS_ERR(ctx))
+			return PTR_ERR(ctx);
+	} else if (eventfd->fd != VDUSE_EVENTFD_DEASSIGN)
+		return 0;
+
+	spin_lock(&vq->kick_lock);
+	if (vq->kickfd)
+		eventfd_ctx_put(vq->kickfd);
+	vq->kickfd = ctx;
+	spin_unlock(&vq->kick_lock);
+
+	return 0;
+}
+
+static void vduse_vq_irq_inject(struct work_struct *work)
+{
+	struct vduse_virtqueue *vq = container_of(work,
+					struct vduse_virtqueue, inject);
+
+	spin_lock_irq(&vq->irq_lock);
+	if (vq->ready && vq->cb.callback)
+		vq->cb.callback(vq->cb.private);
+	spin_unlock_irq(&vq->irq_lock);
+}
+
+static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
+			    unsigned long arg)
+{
+	struct vduse_dev *dev = file->private_data;
+	void __user *argp = (void __user *)arg;
+	int ret;
+
+	switch (cmd) {
+	case VDUSE_IOTLB_GET_FD: {
+		struct vduse_iotlb_entry entry;
+		struct vhost_iotlb_map *map;
+		struct vdpa_map_file *map_file;
+		struct vduse_iova_domain *domain = dev->domain;
+		struct file *f = NULL;
+
+		ret = -EFAULT;
+		if (copy_from_user(&entry, argp, sizeof(entry)))
+			break;
+
+		ret = -EINVAL;
+		if (entry.start > entry.last)
+			break;
+
+		spin_lock(&domain->iotlb_lock);
+		map = vhost_iotlb_itree_first(domain->iotlb,
+					      entry.start, entry.last);
+		if (map) {
+			map_file = (struct vdpa_map_file *)map->opaque;
+			f = get_file(map_file->file);
+			entry.offset = map_file->offset;
+			entry.start = map->start;
+			entry.last = map->last;
+			entry.perm = map->perm;
+		}
+		spin_unlock(&domain->iotlb_lock);
+		ret = -EINVAL;
+		if (!f)
+			break;
+
+		ret = -EFAULT;
+		if (copy_to_user(argp, &entry, sizeof(entry))) {
+			fput(f);
+			break;
+		}
+		ret = receive_fd(f, perm_to_file_flags(entry.perm));
+		fput(f);
+		break;
+	}
+	case VDUSE_VQ_SETUP_KICKFD: {
+		struct vduse_vq_eventfd eventfd;
+
+		ret = -EFAULT;
+		if (copy_from_user(&eventfd, argp, sizeof(eventfd)))
+			break;
+
+		ret = vduse_kickfd_setup(dev, &eventfd);
+		break;
+	}
+	case VDUSE_INJECT_VQ_IRQ:
+		ret = -EINVAL;
+		if (arg >= dev->vq_num)
+			break;
+
+		ret = 0;
+		queue_work(vduse_irq_wq, &dev->vqs[arg].inject);
+		break;
+	case VDUSE_INJECT_CONFIG_IRQ:
+		ret = -EINVAL;
+		spin_lock_irq(&dev->irq_lock);
+		if (dev->config_cb.callback) {
+			dev->config_cb.callback(dev->config_cb.private);
+			ret = 0;
+		}
+		spin_unlock_irq(&dev->irq_lock);
+		break;
+	default:
+		ret = -ENOIOCTLCMD;
+		break;
+	}
+
+	return ret;
+}
+
+static int vduse_dev_release(struct inode *inode, struct file *file)
+{
+	struct vduse_dev *dev = file->private_data;
+	struct vduse_dev_msg *msg;
+	int i;
+
+	for (i = 0; i < dev->vq_num; i++) {
+		struct vduse_virtqueue *vq = &dev->vqs[i];
+
+		spin_lock(&vq->kick_lock);
+		if (vq->kickfd)
+			eventfd_ctx_put(vq->kickfd);
+		vq->kickfd = NULL;
+		spin_unlock(&vq->kick_lock);
+	}
+
+	spin_lock(&dev->msg_lock);
+	/*  Make sure the inflight messages can processed */
+	while ((msg = vduse_dequeue_msg(&dev->recv_list)))
+		vduse_enqueue_msg(&dev->send_list, msg);
+	spin_unlock(&dev->msg_lock);
+
+	dev->connected = false;
+
+	return 0;
+}
+
+static int vduse_dev_open(struct inode *inode, struct file *file)
+{
+	struct vduse_dev *dev = container_of(inode->i_cdev,
+					struct vduse_dev, cdev);
+	int ret = -EBUSY;
+
+	mutex_lock(&dev->lock);
+	if (dev->connected)
+		goto unlock;
+
+	ret = 0;
+	dev->connected = true;
+	file->private_data = dev;
+unlock:
+	mutex_unlock(&dev->lock);
+
+	return ret;
+}
+
+static const struct file_operations vduse_dev_fops = {
+	.owner		= THIS_MODULE,
+	.open		= vduse_dev_open,
+	.release	= vduse_dev_release,
+	.read_iter	= vduse_dev_read_iter,
+	.write_iter	= vduse_dev_write_iter,
+	.poll		= vduse_dev_poll,
+	.unlocked_ioctl	= vduse_dev_ioctl,
+	.compat_ioctl	= compat_ptr_ioctl,
+	.llseek		= noop_llseek,
+};
+
+static struct vduse_dev *vduse_dev_create(void)
+{
+	struct vduse_dev *dev = kzalloc(sizeof(*dev), GFP_KERNEL);
+
+	if (!dev)
+		return NULL;
+
+	mutex_init(&dev->lock);
+	spin_lock_init(&dev->msg_lock);
+	INIT_LIST_HEAD(&dev->send_list);
+	INIT_LIST_HEAD(&dev->recv_list);
+	atomic64_set(&dev->msg_unique, 0);
+	spin_lock_init(&dev->irq_lock);
+
+	init_waitqueue_head(&dev->waitq);
+
+	return dev;
+}
+
+static void vduse_dev_destroy(struct vduse_dev *dev)
+{
+	kfree(dev);
+}
+
+static struct vduse_dev *vduse_find_dev(const char *name)
+{
+	struct vduse_dev *tmp, *dev = NULL;
+
+	list_for_each_entry(tmp, &vduse_devs, list) {
+		if (!strcmp(dev_name(&tmp->dev), name)) {
+			dev = tmp;
+			break;
+		}
+	}
+	return dev;
+}
+
+static int vduse_destroy_dev(char *name)
+{
+	struct vduse_dev *dev = vduse_find_dev(name);
+
+	if (!dev)
+		return -EINVAL;
+
+	mutex_lock(&dev->lock);
+	if (dev->vdev || dev->connected) {
+		mutex_unlock(&dev->lock);
+		return -EBUSY;
+	}
+	dev->connected = true;
+	mutex_unlock(&dev->lock);
+
+	list_del(&dev->list);
+	cdev_device_del(&dev->cdev, &dev->dev);
+	put_device(&dev->dev);
+	module_put(THIS_MODULE);
+
+	return 0;
+}
+
+static void vduse_release_dev(struct device *device)
+{
+	struct vduse_dev *dev =
+		container_of(device, struct vduse_dev, dev);
+
+	ida_simple_remove(&vduse_ida, dev->minor);
+	kfree(dev->vqs);
+	vduse_domain_destroy(dev->domain);
+	vduse_dev_destroy(dev);
+}
+
+static int vduse_create_dev(struct vduse_dev_config *config,
+			    unsigned long api_version)
+{
+	int i, ret = -ENOMEM;
+	struct vduse_dev *dev;
+
+	if (config->bounce_size > max_bounce_size)
+		return -EINVAL;
+
+	if (config->bounce_size > max_iova_size)
+		return -EINVAL;
+
+	if (vduse_find_dev(config->name))
+		return -EEXIST;
+
+	dev = vduse_dev_create();
+	if (!dev)
+		return -ENOMEM;
+
+	dev->api_version = api_version;
+	dev->device_id = config->device_id;
+	dev->vendor_id = config->vendor_id;
+	dev->domain = vduse_domain_create(max_iova_size - 1,
+					config->bounce_size);
+	if (!dev->domain)
+		goto err_domain;
+
+	dev->vq_align = config->vq_align;
+	dev->vq_size_max = config->vq_size_max;
+	dev->vq_num = config->vq_num;
+	dev->vqs = kcalloc(dev->vq_num, sizeof(*dev->vqs), GFP_KERNEL);
+	if (!dev->vqs)
+		goto err_vqs;
+
+	for (i = 0; i < dev->vq_num; i++) {
+		dev->vqs[i].index = i;
+		INIT_WORK(&dev->vqs[i].inject, vduse_vq_irq_inject);
+		spin_lock_init(&dev->vqs[i].kick_lock);
+		spin_lock_init(&dev->vqs[i].irq_lock);
+	}
+
+	ret = ida_simple_get(&vduse_ida, 0, VDUSE_DEV_MAX, GFP_KERNEL);
+	if (ret < 0)
+		goto err_ida;
+
+	dev->minor = ret;
+	device_initialize(&dev->dev);
+	dev->dev.release = vduse_release_dev;
+	dev->dev.class = vduse_class;
+	dev->dev.devt = MKDEV(MAJOR(vduse_major), dev->minor);
+	ret = dev_set_name(&dev->dev, "%s", config->name);
+	if (ret) {
+		put_device(&dev->dev);
+		return ret;
+	}
+	cdev_init(&dev->cdev, &vduse_dev_fops);
+	dev->cdev.owner = THIS_MODULE;
+
+	ret = cdev_device_add(&dev->cdev, &dev->dev);
+	if (ret) {
+		put_device(&dev->dev);
+		return ret;
+	}
+	list_add(&dev->list, &vduse_devs);
+	__module_get(THIS_MODULE);
+
+	return 0;
+err_ida:
+	kfree(dev->vqs);
+err_vqs:
+	vduse_domain_destroy(dev->domain);
+err_domain:
+	vduse_dev_destroy(dev);
+	return ret;
+}
+
+static long vduse_ioctl(struct file *file, unsigned int cmd,
+			unsigned long arg)
+{
+	int ret;
+	void __user *argp = (void __user *)arg;
+	struct vduse_control *control = file->private_data;
+
+	mutex_lock(&vduse_lock);
+	switch (cmd) {
+	case VDUSE_GET_API_VERSION:
+		ret = control->api_version;
+		break;
+	case VDUSE_SET_API_VERSION:
+		ret = -EINVAL;
+		if (arg > VDUSE_API_VERSION)
+			break;
+
+		ret = 0;
+		control->api_version = arg;
+		break;
+	case VDUSE_CREATE_DEV: {
+		struct vduse_dev_config config;
+
+		ret = -EFAULT;
+		if (copy_from_user(&config, argp, sizeof(config)))
+			break;
+
+		ret = vduse_create_dev(&config, control->api_version);
+		break;
+	}
+	case VDUSE_DESTROY_DEV: {
+		char name[VDUSE_NAME_MAX];
+
+		ret = -EFAULT;
+		if (copy_from_user(name, argp, VDUSE_NAME_MAX))
+			break;
+
+		ret = vduse_destroy_dev(name);
+		break;
+	}
+	default:
+		ret = -EINVAL;
+		break;
+	}
+	mutex_unlock(&vduse_lock);
+
+	return ret;
+}
+
+static int vduse_release(struct inode *inode, struct file *file)
+{
+	struct vduse_control *control = file->private_data;
+
+	kfree(control);
+	return 0;
+}
+
+static int vduse_open(struct inode *inode, struct file *file)
+{
+	struct vduse_control *control;
+
+	control = kmalloc(sizeof(struct vduse_control), GFP_KERNEL);
+	if (!control)
+		return -ENOMEM;
+
+	control->api_version = VDUSE_API_VERSION;
+	file->private_data = control;
+
+	return 0;
+}
+
+static const struct file_operations vduse_fops = {
+	.owner		= THIS_MODULE,
+	.open		= vduse_open,
+	.release	= vduse_release,
+	.unlocked_ioctl	= vduse_ioctl,
+	.compat_ioctl	= compat_ptr_ioctl,
+	.llseek		= noop_llseek,
+};
+
+static char *vduse_devnode(struct device *dev, umode_t *mode)
+{
+	return kasprintf(GFP_KERNEL, "vduse/%s", dev_name(dev));
+}
+
+static struct miscdevice vduse_misc = {
+	.fops = &vduse_fops,
+	.minor = MISC_DYNAMIC_MINOR,
+	.name = "vduse",
+	.nodename = "vduse/control",
+};
+
+static void vduse_mgmtdev_release(struct device *dev)
+{
+}
+
+static struct device vduse_mgmtdev = {
+	.init_name = "vduse",
+	.release = vduse_mgmtdev_release,
+};
+
+static struct vdpa_mgmt_dev mgmt_dev;
+
+static int vduse_dev_init_vdpa(struct vduse_dev *dev, const char *name)
+{
+	struct vduse_vdpa *vdev;
+	int ret;
+
+	if (dev->vdev)
+		return -EEXIST;
+
+	vdev = vdpa_alloc_device(struct vduse_vdpa, vdpa, &dev->dev,
+				 &vduse_vdpa_config_ops, name, true);
+	if (!vdev)
+		return -ENOMEM;
+
+	dev->vdev = vdev;
+	vdev->dev = dev;
+	vdev->vdpa.dev.dma_mask = &vdev->vdpa.dev.coherent_dma_mask;
+	ret = dma_set_mask_and_coherent(&vdev->vdpa.dev, DMA_BIT_MASK(64));
+	if (ret) {
+		put_device(&vdev->vdpa.dev);
+		return ret;
+	}
+	set_dma_ops(&vdev->vdpa.dev, &vduse_dev_dma_ops);
+	vdev->vdpa.dma_dev = &vdev->vdpa.dev;
+	vdev->vdpa.mdev = &mgmt_dev;
+
+	return 0;
+}
+
+static int vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name)
+{
+	struct vduse_dev *dev;
+	int ret = -EINVAL;
+
+	mutex_lock(&vduse_lock);
+	dev = vduse_find_dev(name);
+	if (!dev) {
+		mutex_unlock(&vduse_lock);
+		return -EINVAL;
+	}
+	ret = vduse_dev_init_vdpa(dev, name);
+	mutex_unlock(&vduse_lock);
+	if (ret)
+		return ret;
+
+	ret = _vdpa_register_device(&dev->vdev->vdpa, dev->vq_num);
+	if (ret) {
+		put_device(&dev->vdev->vdpa.dev);
+		return ret;
+	}
+
+	return 0;
+}
+
+static void vdpa_dev_del(struct vdpa_mgmt_dev *mdev, struct vdpa_device *dev)
+{
+	_vdpa_unregister_device(dev);
+}
+
+static const struct vdpa_mgmtdev_ops vdpa_dev_mgmtdev_ops = {
+	.dev_add = vdpa_dev_add,
+	.dev_del = vdpa_dev_del,
+};
+
+static struct virtio_device_id id_table[] = {
+	{ VIRTIO_DEV_ANY_ID, VIRTIO_DEV_ANY_ID },
+	{ 0 },
+};
+
+static struct vdpa_mgmt_dev mgmt_dev = {
+	.device = &vduse_mgmtdev,
+	.id_table = id_table,
+	.ops = &vdpa_dev_mgmtdev_ops,
+};
+
+static int vduse_mgmtdev_init(void)
+{
+	int ret;
+
+	ret = device_register(&vduse_mgmtdev);
+	if (ret)
+		return ret;
+
+	ret = vdpa_mgmtdev_register(&mgmt_dev);
+	if (ret)
+		goto err;
+
+	return 0;
+err:
+	device_unregister(&vduse_mgmtdev);
+	return ret;
+}
+
+static void vduse_mgmtdev_exit(void)
+{
+	vdpa_mgmtdev_unregister(&mgmt_dev);
+	device_unregister(&vduse_mgmtdev);
+}
+
+static int vduse_init(void)
+{
+	int ret;
+
+	if (max_bounce_size >= max_iova_size)
+		return -EINVAL;
+
+	ret = misc_register(&vduse_misc);
+	if (ret)
+		return ret;
+
+	vduse_class = class_create(THIS_MODULE, "vduse");
+	if (IS_ERR(vduse_class)) {
+		ret = PTR_ERR(vduse_class);
+		goto err_class;
+	}
+	vduse_class->devnode = vduse_devnode;
+
+	ret = alloc_chrdev_region(&vduse_major, 0, VDUSE_DEV_MAX, "vduse");
+	if (ret)
+		goto err_chardev;
+
+	vduse_irq_wq = alloc_workqueue("vduse-irq",
+				WQ_HIGHPRI | WQ_SYSFS | WQ_UNBOUND, 0);
+	if (!vduse_irq_wq)
+		goto err_wq;
+
+	ret = vduse_domain_init();
+	if (ret)
+		goto err_domain;
+
+	ret = vduse_mgmtdev_init();
+	if (ret)
+		goto err_mgmtdev;
+
+	return 0;
+err_mgmtdev:
+	vduse_domain_exit();
+err_domain:
+	destroy_workqueue(vduse_irq_wq);
+err_wq:
+	unregister_chrdev_region(vduse_major, VDUSE_DEV_MAX);
+err_chardev:
+	class_destroy(vduse_class);
+err_class:
+	misc_deregister(&vduse_misc);
+	return ret;
+}
+module_init(vduse_init);
+
+static void vduse_exit(void)
+{
+	misc_deregister(&vduse_misc);
+	class_destroy(vduse_class);
+	unregister_chrdev_region(vduse_major, VDUSE_DEV_MAX);
+	destroy_workqueue(vduse_irq_wq);
+	vduse_domain_exit();
+	vduse_mgmtdev_exit();
+}
+module_exit(vduse_exit);
+
+MODULE_VERSION(DRV_VERSION);
+MODULE_LICENSE(DRV_LICENSE);
+MODULE_AUTHOR(DRV_AUTHOR);
+MODULE_DESCRIPTION(DRV_DESC);
diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
new file mode 100644
index 000000000000..66a6e5212226
--- /dev/null
+++ b/include/uapi/linux/vduse.h
@@ -0,0 +1,175 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _UAPI_VDUSE_H_
+#define _UAPI_VDUSE_H_
+
+#include <linux/types.h>
+
+#define VDUSE_API_VERSION	0
+
+#define VDUSE_CONFIG_DATA_LEN	256
+#define VDUSE_NAME_MAX	256
+
+/* the control messages definition for read/write */
+
+enum vduse_req_type {
+	/* Set the vring address of virtqueue. */
+	VDUSE_SET_VQ_NUM,
+	/* Set the vring address of virtqueue. */
+	VDUSE_SET_VQ_ADDR,
+	/* Set ready status of virtqueue */
+	VDUSE_SET_VQ_READY,
+	/* Get ready status of virtqueue */
+	VDUSE_GET_VQ_READY,
+	/* Set the state for virtqueue */
+	VDUSE_SET_VQ_STATE,
+	/* Get the state for virtqueue */
+	VDUSE_GET_VQ_STATE,
+	/* Set virtio features supported by the driver */
+	VDUSE_SET_FEATURES,
+	/* Get virtio features supported by the device */
+	VDUSE_GET_FEATURES,
+	/* Set the device status */
+	VDUSE_SET_STATUS,
+	/* Get the device status */
+	VDUSE_GET_STATUS,
+	/* Write to device specific configuration space */
+	VDUSE_SET_CONFIG,
+	/* Read from device specific configuration space */
+	VDUSE_GET_CONFIG,
+	/* Notify userspace to update the memory mapping in device IOTLB */
+	VDUSE_UPDATE_IOTLB,
+};
+
+struct vduse_vq_num {
+	__u32 index; /* virtqueue index */
+	__u32 num; /* the size of virtqueue */
+};
+
+struct vduse_vq_addr {
+	__u32 index; /* virtqueue index */
+	__u64 desc_addr; /* address of desc area */
+	__u64 driver_addr; /* address of driver area */
+	__u64 device_addr; /* address of device area */
+};
+
+struct vduse_vq_ready {
+	__u32 index; /* virtqueue index */
+	__u8 ready; /* ready status of virtqueue */
+};
+
+struct vduse_vq_state {
+	__u32 index; /* virtqueue index */
+	__u16 avail_idx; /* virtqueue state (last_avail_idx) */
+};
+
+struct vduse_dev_config_data {
+	__u32 offset; /* offset from the beginning of config space */
+	__u32 len; /* the length to read/write */
+	__u8 data[VDUSE_CONFIG_DATA_LEN]; /* data buffer used to read/write */
+};
+
+struct vduse_iova_range {
+	__u64 start; /* start of the IOVA range */
+	__u64 last; /* end of the IOVA range */
+};
+
+struct vduse_features {
+	__u64 features; /* virtio features */
+};
+
+struct vduse_status {
+	__u8 status; /* device status */
+};
+
+struct vduse_dev_request {
+	__u32 type; /* request type */
+	__u32 request_id; /* request id */
+	__u32 reserved[2]; /* for future use */
+	union {
+		struct vduse_vq_num vq_num; /* virtqueue num */
+		struct vduse_vq_addr vq_addr; /* virtqueue address */
+		struct vduse_vq_ready vq_ready; /* virtqueue ready status */
+		struct vduse_vq_state vq_state; /* virtqueue state */
+		struct vduse_dev_config_data config; /* virtio device config space */
+		struct vduse_iova_range iova; /* iova range for updating */
+		struct vduse_features f; /* virtio features */
+		struct vduse_status s; /* device status */
+		__u32 padding[16]; /* padding */
+	};
+};
+
+struct vduse_dev_response {
+	__u32 request_id; /* corresponding request id */
+#define VDUSE_REQUEST_OK	0x00
+#define VDUSE_REQUEST_FAILED	0x01
+	__u32 result; /* the result of request */
+	__u32 reserved[2]; /* for future use */
+	union {
+		struct vduse_vq_ready vq_ready; /* virtqueue ready status */
+		struct vduse_vq_state vq_state; /* virtqueue state */
+		struct vduse_dev_config_data config; /* virtio device config space */
+		struct vduse_features f; /* virtio features */
+		struct vduse_status s; /* device status */
+		__u32 padding[16]; /* padding */
+	};
+};
+
+/* ioctls */
+
+struct vduse_dev_config {
+	char name[VDUSE_NAME_MAX]; /* vduse device name */
+	__u32 vendor_id; /* virtio vendor id */
+	__u32 device_id; /* virtio device id */
+	__u64 bounce_size; /* bounce buffer size for iommu */
+	__u16 vq_num; /* the number of virtqueues */
+	__u16 vq_size_max; /* the max size of virtqueue */
+	__u32 vq_align; /* the allocation alignment of virtqueue's metadata */
+	__u32 reserved[8]; /* for future use */
+};
+
+struct vduse_iotlb_entry {
+	__u64 offset; /* the mmap offset on fd */
+	__u64 start; /* start of the IOVA range */
+	__u64 last; /* last of the IOVA range */
+#define VDUSE_ACCESS_RO 0x1
+#define VDUSE_ACCESS_WO 0x2
+#define VDUSE_ACCESS_RW 0x3
+	__u8 perm; /* access permission of this range */
+};
+
+struct vduse_vq_eventfd {
+	__u32 index; /* virtqueue index */
+#define VDUSE_EVENTFD_DEASSIGN -1
+	int fd; /* eventfd, -1 means de-assigning the eventfd */
+};
+
+#define VDUSE_BASE	0x81
+
+/* Get the version of VDUSE API. This is used for future extension */
+#define VDUSE_GET_API_VERSION	_IO(VDUSE_BASE, 0x00)
+
+/* Set the version of VDUSE API. */
+#define VDUSE_SET_API_VERSION	_IO(VDUSE_BASE, 0x01)
+
+/* Create a vduse device which is represented by a char device (/dev/vduse/<name>) */
+#define VDUSE_CREATE_DEV	_IOW(VDUSE_BASE, 0x02, struct vduse_dev_config)
+
+/* Destroy a vduse device. Make sure there are no references to the char device */
+#define VDUSE_DESTROY_DEV	_IOW(VDUSE_BASE, 0x03, char[VDUSE_NAME_MAX])
+
+/*
+ * Get a file descriptor for the first overlapped iova region,
+ * -EINVAL means the iova region doesn't exist.
+ */
+#define VDUSE_IOTLB_GET_FD	_IOWR(VDUSE_BASE, 0x04, struct vduse_iotlb_entry)
+
+/* Setup an eventfd to receive kick for virtqueue */
+#define VDUSE_VQ_SETUP_KICKFD	_IOW(VDUSE_BASE, 0x05, struct vduse_vq_eventfd)
+
+/* Inject an interrupt for specific virtqueue */
+#define VDUSE_INJECT_VQ_IRQ	_IO(VDUSE_BASE, 0x06)
+
+/* Inject a config interrupt */
+#define VDUSE_INJECT_CONFIG_IRQ	_IO(VDUSE_BASE, 0x07)
+
+#endif /* _UAPI_VDUSE_H_ */
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v6 10/10] Documentation: Add documentation for VDUSE
  2021-03-31  8:05 [PATCH v6 00/10] Introduce VDUSE - vDPA Device in Userspace Xie Yongji
                   ` (8 preceding siblings ...)
  2021-03-31  8:05 ` [PATCH v6 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace Xie Yongji
@ 2021-03-31  8:05 ` Xie Yongji
  2021-04-08  7:18   ` Jason Wang
  2021-04-14 14:14   ` Stefan Hajnoczi
  2021-04-14  7:34 ` [PATCH v6 00/10] Introduce VDUSE - vDPA Device in Userspace Michael S. Tsirkin
  10 siblings, 2 replies; 62+ messages in thread
From: Xie Yongji @ 2021-03-31  8:05 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, hch, christian.brauner,
	rdunlap, willy, viro, axboe, bcrl, corbet, mika.penttila,
	dan.carpenter
  Cc: virtualization, netdev, kvm, linux-fsdevel

VDUSE (vDPA Device in Userspace) is a framework to support
implementing software-emulated vDPA devices in userspace. This
document is intended to clarify the VDUSE design and usage.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 Documentation/userspace-api/index.rst |   1 +
 Documentation/userspace-api/vduse.rst | 212 ++++++++++++++++++++++++++++++++++
 2 files changed, 213 insertions(+)
 create mode 100644 Documentation/userspace-api/vduse.rst

diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst
index acd2cc2a538d..f63119130898 100644
--- a/Documentation/userspace-api/index.rst
+++ b/Documentation/userspace-api/index.rst
@@ -24,6 +24,7 @@ place where this information is gathered.
    ioctl/index
    iommu
    media/index
+   vduse
 
 .. only::  subproject and html
 
diff --git a/Documentation/userspace-api/vduse.rst b/Documentation/userspace-api/vduse.rst
new file mode 100644
index 000000000000..8c4e2b2df8bb
--- /dev/null
+++ b/Documentation/userspace-api/vduse.rst
@@ -0,0 +1,212 @@
+==================================
+VDUSE - "vDPA Device in Userspace"
+==================================
+
+vDPA (virtio data path acceleration) device is a device that uses a
+datapath which complies with the virtio specifications with vendor
+specific control path. vDPA devices can be both physically located on
+the hardware or emulated by software. VDUSE is a framework that makes it
+possible to implement software-emulated vDPA devices in userspace.
+
+How VDUSE works
+------------
+Each userspace vDPA device is created by the VDUSE_CREATE_DEV ioctl on
+the character device (/dev/vduse/control). Then a device file with the
+specified name (/dev/vduse/$NAME) will appear, which can be used to
+implement the userspace vDPA device's control path and data path.
+
+To implement control path, a message-based communication protocol and some
+types of control messages are introduced in the VDUSE framework:
+
+- VDUSE_SET_VQ_ADDR: Set the vring address of virtqueue.
+
+- VDUSE_SET_VQ_NUM: Set the size of virtqueue
+
+- VDUSE_SET_VQ_READY: Set ready status of virtqueue
+
+- VDUSE_GET_VQ_READY: Get ready status of virtqueue
+
+- VDUSE_SET_VQ_STATE: Set the state for virtqueue
+
+- VDUSE_GET_VQ_STATE: Get the state for virtqueue
+
+- VDUSE_SET_FEATURES: Set virtio features supported by the driver
+
+- VDUSE_GET_FEATURES: Get virtio features supported by the device
+
+- VDUSE_SET_STATUS: Set the device status
+
+- VDUSE_GET_STATUS: Get the device status
+
+- VDUSE_SET_CONFIG: Write to device specific configuration space
+
+- VDUSE_GET_CONFIG: Read from device specific configuration space
+
+- VDUSE_UPDATE_IOTLB: Notify userspace to update the memory mapping in device IOTLB
+
+Those control messages are mostly based on the vdpa_config_ops in
+include/linux/vdpa.h which defines a unified interface to control
+different types of vdpa device. Userspace needs to read()/write()
+on the VDUSE device file to receive/reply those control messages
+from/to VDUSE kernel module as follows:
+
+.. code-block:: c
+
+	static int vduse_message_handler(int dev_fd)
+	{
+		int len;
+		struct vduse_dev_request req;
+		struct vduse_dev_response resp;
+
+		len = read(dev_fd, &req, sizeof(req));
+		if (len != sizeof(req))
+			return -1;
+
+		resp.request_id = req.request_id;
+
+		switch (req.type) {
+
+		/* handle different types of message */
+
+		}
+
+		len = write(dev_fd, &resp, sizeof(resp));
+		if (len != sizeof(resp))
+			return -1;
+
+		return 0;
+	}
+
+In the data path, vDPA device's iova regions will be mapped into userspace
+with the help of VDUSE_IOTLB_GET_FD ioctl on the VDUSE device file:
+
+- VDUSE_IOTLB_GET_FD: get the file descriptor to the first overlapped iova region.
+  Userspace can access this iova region by passing fd and corresponding size, offset,
+  perm to mmap(). For example:
+
+.. code-block:: c
+
+	static int perm_to_prot(uint8_t perm)
+	{
+		int prot = 0;
+
+		switch (perm) {
+		case VDUSE_ACCESS_WO:
+			prot |= PROT_WRITE;
+			break;
+		case VDUSE_ACCESS_RO:
+			prot |= PROT_READ;
+			break;
+		case VDUSE_ACCESS_RW:
+			prot |= PROT_READ | PROT_WRITE;
+			break;
+		}
+
+		return prot;
+	}
+
+	static void *iova_to_va(int dev_fd, uint64_t iova, uint64_t *len)
+	{
+		int fd;
+		void *addr;
+		size_t size;
+		struct vduse_iotlb_entry entry;
+
+		entry.start = iova;
+		entry.last = iova + 1;
+		fd = ioctl(dev_fd, VDUSE_IOTLB_GET_FD, &entry);
+		if (fd < 0)
+			return NULL;
+
+		size = entry.last - entry.start + 1;
+		*len = entry.last - iova + 1;
+		addr = mmap(0, size, perm_to_prot(entry.perm), MAP_SHARED,
+			    fd, entry.offset);
+		close(fd);
+		if (addr == MAP_FAILED)
+			return NULL;
+
+		/* do something to cache this iova region */
+
+		return addr + iova - entry.start;
+	}
+
+Besides, the following ioctls on the VDUSE device file are provided to support
+interrupt injection and setting up eventfd for virtqueue kicks:
+
+- VDUSE_VQ_SETUP_KICKFD: set the kickfd for virtqueue, this eventfd is used
+  by VDUSE kernel module to notify userspace to consume the vring.
+
+- VDUSE_INJECT_VQ_IRQ: inject an interrupt for specific virtqueue
+
+- VDUSE_INJECT_CONFIG_IRQ: inject a config interrupt
+
+Register VDUSE device on vDPA bus
+---------------------------------
+In order to make the VDUSE device work, administrator needs to use the management
+API (netlink) to register it on vDPA bus. Some sample codes are show below:
+
+.. code-block:: c
+
+	static int netlink_add_vduse(const char *name, int device_id)
+	{
+		struct nl_sock *nlsock;
+		struct nl_msg *msg;
+		int famid;
+
+		nlsock = nl_socket_alloc();
+		if (!nlsock)
+			return -ENOMEM;
+
+		if (genl_connect(nlsock))
+			goto free_sock;
+
+		famid = genl_ctrl_resolve(nlsock, VDPA_GENL_NAME);
+		if (famid < 0)
+			goto close_sock;
+
+		msg = nlmsg_alloc();
+		if (!msg)
+			goto close_sock;
+
+		if (!genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, famid, 0, 0,
+		    VDPA_CMD_DEV_NEW, 0))
+			goto nla_put_failure;
+
+		NLA_PUT_STRING(msg, VDPA_ATTR_DEV_NAME, name);
+		NLA_PUT_STRING(msg, VDPA_ATTR_MGMTDEV_DEV_NAME, "vduse");
+		NLA_PUT_U32(msg, VDPA_ATTR_DEV_ID, device_id);
+
+		if (nl_send_sync(nlsock, msg))
+			goto close_sock;
+
+		nl_close(nlsock);
+		nl_socket_free(nlsock);
+
+		return 0;
+	nla_put_failure:
+		nlmsg_free(msg);
+	close_sock:
+		nl_close(nlsock);
+	free_sock:
+		nl_socket_free(nlsock);
+		return -1;
+	}
+
+MMU-based IOMMU Driver
+----------------------
+VDUSE framework implements an MMU-based on-chip IOMMU driver to support
+mapping the kernel DMA buffer into the userspace iova region dynamically.
+This is mainly designed for virtio-vdpa case (kernel virtio drivers).
+
+The basic idea behind this driver is treating MMU (VA->PA) as IOMMU (IOVA->PA).
+The driver will set up MMU mapping instead of IOMMU mapping for the DMA transfer
+so that the userspace process is able to use its virtual address to access
+the DMA buffer in kernel.
+
+And to avoid security issue, a bounce-buffering mechanism is introduced to
+prevent userspace accessing the original buffer directly which may contain other
+kernel data. During the mapping, unmapping, the driver will copy the data from
+the original buffer to the bounce buffer and back, depending on the direction of
+the transfer. And the bounce-buffer addresses will be mapped into the user address
+space instead of the original one.
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH v6 01/10] file: Export receive_fd() to modules
  2021-03-31  8:05 ` [PATCH v6 01/10] file: Export receive_fd() to modules Xie Yongji
@ 2021-03-31  9:15   ` Christian Brauner
  2021-03-31  9:26     ` Dan Carpenter
  2021-03-31 11:32     ` Yongji Xie
  0 siblings, 2 replies; 62+ messages in thread
From: Christian Brauner @ 2021-03-31  9:15 UTC (permalink / raw)
  To: Xie Yongji, hch
  Cc: mst, jasowang, stefanha, sgarzare, parav, christian.brauner,
	rdunlap, willy, viro, axboe, bcrl, corbet, mika.penttila,
	dan.carpenter, virtualization, netdev, kvm, linux-fsdevel

On Wed, Mar 31, 2021 at 04:05:10PM +0800, Xie Yongji wrote:
> Export receive_fd() so that some modules can use
> it to pass file descriptor between processes without
> missing any security stuffs.
> 
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> ---

Yeah, as I said in the other mail I'd be comfortable with exposing just
this variant of the helper.
Maybe this should be a separate patch bundled together with Christoph's
patch to split parts of receive_fd() into a separate helper.
This would also allow us to simplify a few other codepaths in drivers as
well btw. I just took a hasty stab at two of them:

diff --git a/drivers/android/binder.c b/drivers/android/binder.c
index c119736ca56a..3c716bf6d84b 100644
--- a/drivers/android/binder.c
+++ b/drivers/android/binder.c
@@ -3728,8 +3728,9 @@ static int binder_apply_fd_fixups(struct binder_proc *proc,
        int ret = 0;

        list_for_each_entry(fixup, &t->fd_fixups, fixup_entry) {
-               int fd = get_unused_fd_flags(O_CLOEXEC);
+               int fd = receive_fd(fixup->file, O_CLOEXEC);

+               fd = receive_fd(fixup->file, O_CLOEXEC);
                if (fd < 0) {
                        binder_debug(BINDER_DEBUG_TRANSACTION,
                                     "failed fd fixup txn %d fd %d\n",
@@ -3741,7 +3742,7 @@ static int binder_apply_fd_fixups(struct binder_proc *proc,
                             "fd fixup txn %d fd %d\n",
                             t->debug_id, fd);
                trace_binder_transaction_fd_recv(t, fd, fixup->offset);
-               fd_install(fd, fixup->file);
+               fput(fixup->file);
                fixup->file = NULL;
                if (binder_alloc_copy_to_buffer(&proc->alloc, t->buffer,
                                                fixup->offset, &fd,
diff --git a/drivers/tty/pty.c b/drivers/tty/pty.c
index 5e2374580e27..c3a6b6abb7f4 100644
--- a/drivers/tty/pty.c
+++ b/drivers/tty/pty.c
@@ -629,12 +629,6 @@ int ptm_open_peer(struct file *master, struct tty_struct *tty, int flags)
        if (tty->driver != ptm_driver)
                return -EIO;

-       fd = get_unused_fd_flags(flags);
-       if (fd < 0) {
-               retval = fd;
-               goto err;
-       }
-
        /* Compute the slave's path */
        path.mnt = devpts_mntget(master, tty->driver_data);
        if (IS_ERR(path.mnt)) {
@@ -650,7 +644,8 @@ int ptm_open_peer(struct file *master, struct tty_struct *tty, int flags)
                goto err_put;
        }

-       fd_install(fd, filp);
+       fd = receive_fd(filp, flags);
+       fput(filp);
        return fd;

 err_put:

>  fs/file.c            | 6 ++++++
>  include/linux/file.h | 7 +++----
>  2 files changed, 9 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/file.c b/fs/file.c
> index dab120b71e44..d7d957217576 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -1108,6 +1108,12 @@ int __receive_fd(int fd, struct file *file, int __user *ufd, unsigned int o_flag
>  	return new_fd;
>  }
>  
> +int receive_fd(struct file *file, unsigned int o_flags)
> +{
> +	return __receive_fd(-1, file, NULL, o_flags);
> +}
> +EXPORT_SYMBOL(receive_fd);
> +
>  static int ksys_dup3(unsigned int oldfd, unsigned int newfd, int flags)
>  {
>  	int err = -EBADF;
> diff --git a/include/linux/file.h b/include/linux/file.h
> index 225982792fa2..4667f9567d3e 100644
> --- a/include/linux/file.h
> +++ b/include/linux/file.h
> @@ -94,6 +94,9 @@ extern void fd_install(unsigned int fd, struct file *file);
>  
>  extern int __receive_fd(int fd, struct file *file, int __user *ufd,
>  			unsigned int o_flags);
> +
> +extern int receive_fd(struct file *file, unsigned int o_flags);
> +
>  static inline int receive_fd_user(struct file *file, int __user *ufd,
>  				  unsigned int o_flags)
>  {
> @@ -101,10 +104,6 @@ static inline int receive_fd_user(struct file *file, int __user *ufd,
>  		return -EFAULT;
>  	return __receive_fd(-1, file, ufd, o_flags);
>  }
> -static inline int receive_fd(struct file *file, unsigned int o_flags)
> -{
> -	return __receive_fd(-1, file, NULL, o_flags);
> -}
>  static inline int receive_fd_replace(int fd, struct file *file, unsigned int o_flags)
>  {
>  	return __receive_fd(fd, file, NULL, o_flags);
> -- 
> 2.11.0
> 

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH v6 01/10] file: Export receive_fd() to modules
  2021-03-31  9:15   ` Christian Brauner
@ 2021-03-31  9:26     ` Dan Carpenter
  2021-03-31  9:28       ` Christian Brauner
  2021-03-31 11:32     ` Yongji Xie
  1 sibling, 1 reply; 62+ messages in thread
From: Dan Carpenter @ 2021-03-31  9:26 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Xie Yongji, hch, mst, jasowang, stefanha, sgarzare, parav,
	christian.brauner, rdunlap, willy, viro, axboe, bcrl, corbet,
	mika.penttila, virtualization, netdev, kvm, linux-fsdevel

On Wed, Mar 31, 2021 at 11:15:45AM +0200, Christian Brauner wrote:
> On Wed, Mar 31, 2021 at 04:05:10PM +0800, Xie Yongji wrote:
> > Export receive_fd() so that some modules can use
> > it to pass file descriptor between processes without
> > missing any security stuffs.
> > 
> > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> > ---
> 
> Yeah, as I said in the other mail I'd be comfortable with exposing just
> this variant of the helper.
> Maybe this should be a separate patch bundled together with Christoph's
> patch to split parts of receive_fd() into a separate helper.
> This would also allow us to simplify a few other codepaths in drivers as
> well btw. I just took a hasty stab at two of them:
> 
> diff --git a/drivers/android/binder.c b/drivers/android/binder.c
> index c119736ca56a..3c716bf6d84b 100644
> --- a/drivers/android/binder.c
> +++ b/drivers/android/binder.c
> @@ -3728,8 +3728,9 @@ static int binder_apply_fd_fixups(struct binder_proc *proc,
>         int ret = 0;
> 
>         list_for_each_entry(fixup, &t->fd_fixups, fixup_entry) {
> -               int fd = get_unused_fd_flags(O_CLOEXEC);
> +               int fd = receive_fd(fixup->file, O_CLOEXEC);
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Assignment duplicated on the next line.

> 
> +               fd = receive_fd(fixup->file, O_CLOEXEC);
>                 if (fd < 0) {
>                         binder_debug(BINDER_DEBUG_TRANSACTION,
>                                      "failed fd fixup txn %d fd %d\n",

regards,
dan carpenter


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v6 01/10] file: Export receive_fd() to modules
  2021-03-31  9:26     ` Dan Carpenter
@ 2021-03-31  9:28       ` Christian Brauner
  0 siblings, 0 replies; 62+ messages in thread
From: Christian Brauner @ 2021-03-31  9:28 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: Xie Yongji, hch, mst, jasowang, stefanha, sgarzare, parav,
	christian.brauner, rdunlap, willy, viro, axboe, bcrl, corbet,
	mika.penttila, virtualization, netdev, kvm, linux-fsdevel

On Wed, Mar 31, 2021 at 12:26:24PM +0300, Dan Carpenter wrote:
> On Wed, Mar 31, 2021 at 11:15:45AM +0200, Christian Brauner wrote:
> > On Wed, Mar 31, 2021 at 04:05:10PM +0800, Xie Yongji wrote:
> > > Export receive_fd() so that some modules can use
> > > it to pass file descriptor between processes without
> > > missing any security stuffs.
> > > 
> > > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> > > ---
> > 
> > Yeah, as I said in the other mail I'd be comfortable with exposing just
> > this variant of the helper.
> > Maybe this should be a separate patch bundled together with Christoph's
> > patch to split parts of receive_fd() into a separate helper.
> > This would also allow us to simplify a few other codepaths in drivers as
> > well btw. I just took a hasty stab at two of them:
> > 
> > diff --git a/drivers/android/binder.c b/drivers/android/binder.c
> > index c119736ca56a..3c716bf6d84b 100644
> > --- a/drivers/android/binder.c
> > +++ b/drivers/android/binder.c
> > @@ -3728,8 +3728,9 @@ static int binder_apply_fd_fixups(struct binder_proc *proc,
> >         int ret = 0;
> > 
> >         list_for_each_entry(fixup, &t->fd_fixups, fixup_entry) {
> > -               int fd = get_unused_fd_flags(O_CLOEXEC);
> > +               int fd = receive_fd(fixup->file, O_CLOEXEC);
>                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> Assignment duplicated on the next line.

I didn't this for immediate inclusion that's why I said "hasty" but
thank you! :)

Christian

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Re: [PATCH v6 01/10] file: Export receive_fd() to modules
  2021-03-31  9:15   ` Christian Brauner
  2021-03-31  9:26     ` Dan Carpenter
@ 2021-03-31 11:32     ` Yongji Xie
  2021-03-31 12:23       ` Christian Brauner
  1 sibling, 1 reply; 62+ messages in thread
From: Yongji Xie @ 2021-03-31 11:32 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Christoph Hellwig, Michael S. Tsirkin, Jason Wang,
	Stefan Hajnoczi, Stefano Garzarella, Parav Pandit,
	Christian Brauner, Randy Dunlap, Matthew Wilcox, viro,
	Jens Axboe, bcrl, Jonathan Corbet, Mika Penttilä,
	Dan Carpenter, virtualization, netdev, kvm, linux-fsdevel

On Wed, Mar 31, 2021 at 5:15 PM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
>
> On Wed, Mar 31, 2021 at 04:05:10PM +0800, Xie Yongji wrote:
> > Export receive_fd() so that some modules can use
> > it to pass file descriptor between processes without
> > missing any security stuffs.
> >
> > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> > ---
>
> Yeah, as I said in the other mail I'd be comfortable with exposing just
> this variant of the helper.

Thanks, I got it now.

> Maybe this should be a separate patch bundled together with Christoph's
> patch to split parts of receive_fd() into a separate helper.

Do we need to add the seccomp notifier into the separate helper? In
our case, the file passed to the separate helper is from another
process.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v6 01/10] file: Export receive_fd() to modules
  2021-03-31 11:32     ` Yongji Xie
@ 2021-03-31 12:23       ` Christian Brauner
  2021-03-31 13:59         ` Yongji Xie
  0 siblings, 1 reply; 62+ messages in thread
From: Christian Brauner @ 2021-03-31 12:23 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Christoph Hellwig, Michael S. Tsirkin, Jason Wang,
	Stefan Hajnoczi, Stefano Garzarella, Parav Pandit,
	Christian Brauner, Randy Dunlap, Matthew Wilcox, viro,
	Jens Axboe, bcrl, Jonathan Corbet, Mika Penttilä,
	Dan Carpenter, virtualization, netdev, kvm, linux-fsdevel

On Wed, Mar 31, 2021 at 07:32:33PM +0800, Yongji Xie wrote:
> On Wed, Mar 31, 2021 at 5:15 PM Christian Brauner
> <christian.brauner@ubuntu.com> wrote:
> >
> > On Wed, Mar 31, 2021 at 04:05:10PM +0800, Xie Yongji wrote:
> > > Export receive_fd() so that some modules can use
> > > it to pass file descriptor between processes without
> > > missing any security stuffs.
> > >
> > > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> > > ---
> >
> > Yeah, as I said in the other mail I'd be comfortable with exposing just
> > this variant of the helper.
> 
> Thanks, I got it now.
> 
> > Maybe this should be a separate patch bundled together with Christoph's
> > patch to split parts of receive_fd() into a separate helper.
> 
> Do we need to add the seccomp notifier into the separate helper? In
> our case, the file passed to the separate helper is from another
> process.

Not sure what you mean. Christoph has proposed
https://lore.kernel.org/linux-fsdevel/20210325082209.1067987-2-hch@lst.de
I was just saying that if we think this patch is useful we might bundle
it together with the
EXPORT_SYMBOL(receive_fd)
part here, convert all drivers that currently open-code get_unused_fd()
+ fd_install() to use receive_fd(), and make this a separate patchset.

I don't think that needs to hinder reviewing your series though.

Christian

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Re: [PATCH v6 01/10] file: Export receive_fd() to modules
  2021-03-31 12:23       ` Christian Brauner
@ 2021-03-31 13:59         ` Yongji Xie
  2021-03-31 14:07           ` Christian Brauner
  0 siblings, 1 reply; 62+ messages in thread
From: Yongji Xie @ 2021-03-31 13:59 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Christoph Hellwig, Michael S. Tsirkin, Jason Wang,
	Stefan Hajnoczi, Stefano Garzarella, Parav Pandit,
	Christian Brauner, Randy Dunlap, Matthew Wilcox, viro,
	Jens Axboe, bcrl, Jonathan Corbet, Mika Penttilä,
	Dan Carpenter, virtualization, netdev, kvm, linux-fsdevel

On Wed, Mar 31, 2021 at 8:23 PM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
>
> On Wed, Mar 31, 2021 at 07:32:33PM +0800, Yongji Xie wrote:
> > On Wed, Mar 31, 2021 at 5:15 PM Christian Brauner
> > <christian.brauner@ubuntu.com> wrote:
> > >
> > > On Wed, Mar 31, 2021 at 04:05:10PM +0800, Xie Yongji wrote:
> > > > Export receive_fd() so that some modules can use
> > > > it to pass file descriptor between processes without
> > > > missing any security stuffs.
> > > >
> > > > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> > > > ---
> > >
> > > Yeah, as I said in the other mail I'd be comfortable with exposing just
> > > this variant of the helper.
> >
> > Thanks, I got it now.
> >
> > > Maybe this should be a separate patch bundled together with Christoph's
> > > patch to split parts of receive_fd() into a separate helper.
> >
> > Do we need to add the seccomp notifier into the separate helper? In
> > our case, the file passed to the separate helper is from another
> > process.
>
> Not sure what you mean. Christoph has proposed
> https://lore.kernel.org/linux-fsdevel/20210325082209.1067987-2-hch@lst.de
> I was just saying that if we think this patch is useful we might bundle
> it together with the
> EXPORT_SYMBOL(receive_fd)
> part here, convert all drivers that currently open-code get_unused_fd()
> + fd_install() to use receive_fd(), and make this a separate patchset.
>

Yes, I see. We can split the parts (get_unused_fd() + fd_install()) of
receive_fd() into a separate helper and convert all drivers to use
that. What I mean is that I also would like to use
security_file_receive() in my modules. So I'm not sure if it's ok to
add security_file_receive() into the separate helper. Or do I need to
export security_file_receive() separately?

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v6 01/10] file: Export receive_fd() to modules
  2021-03-31 13:59         ` Yongji Xie
@ 2021-03-31 14:07           ` Christian Brauner
  2021-03-31 14:37             ` Yongji Xie
  0 siblings, 1 reply; 62+ messages in thread
From: Christian Brauner @ 2021-03-31 14:07 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Christoph Hellwig, Michael S. Tsirkin, Jason Wang,
	Stefan Hajnoczi, Stefano Garzarella, Parav Pandit,
	Christian Brauner, Randy Dunlap, Matthew Wilcox, viro,
	Jens Axboe, bcrl, Jonathan Corbet, Mika Penttilä,
	Dan Carpenter, virtualization, netdev, kvm, linux-fsdevel

On Wed, Mar 31, 2021 at 09:59:07PM +0800, Yongji Xie wrote:
> On Wed, Mar 31, 2021 at 8:23 PM Christian Brauner
> <christian.brauner@ubuntu.com> wrote:
> >
> > On Wed, Mar 31, 2021 at 07:32:33PM +0800, Yongji Xie wrote:
> > > On Wed, Mar 31, 2021 at 5:15 PM Christian Brauner
> > > <christian.brauner@ubuntu.com> wrote:
> > > >
> > > > On Wed, Mar 31, 2021 at 04:05:10PM +0800, Xie Yongji wrote:
> > > > > Export receive_fd() so that some modules can use
> > > > > it to pass file descriptor between processes without
> > > > > missing any security stuffs.
> > > > >
> > > > > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> > > > > ---
> > > >
> > > > Yeah, as I said in the other mail I'd be comfortable with exposing just
> > > > this variant of the helper.
> > >
> > > Thanks, I got it now.
> > >
> > > > Maybe this should be a separate patch bundled together with Christoph's
> > > > patch to split parts of receive_fd() into a separate helper.
> > >
> > > Do we need to add the seccomp notifier into the separate helper? In
> > > our case, the file passed to the separate helper is from another
> > > process.
> >
> > Not sure what you mean. Christoph has proposed
> > https://lore.kernel.org/linux-fsdevel/20210325082209.1067987-2-hch@lst.de
> > I was just saying that if we think this patch is useful we might bundle
> > it together with the
> > EXPORT_SYMBOL(receive_fd)
> > part here, convert all drivers that currently open-code get_unused_fd()
> > + fd_install() to use receive_fd(), and make this a separate patchset.
> >
> 
> Yes, I see. We can split the parts (get_unused_fd() + fd_install()) of
> receive_fd() into a separate helper and convert all drivers to use
> that. What I mean is that I also would like to use
> security_file_receive() in my modules. So I'm not sure if it's ok to
> add security_file_receive() into the separate helper. Or do I need to
> export security_file_receive() separately?

I think I confused you which is my bad. What you do here is - in my
opinion - correct.
I'm just saying that exporting receive_fd() allows further cleanups and
your export here could go on top of Christoph's change in a separate
series.

Christian

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Re: [PATCH v6 01/10] file: Export receive_fd() to modules
  2021-03-31 14:07           ` Christian Brauner
@ 2021-03-31 14:37             ` Yongji Xie
  0 siblings, 0 replies; 62+ messages in thread
From: Yongji Xie @ 2021-03-31 14:37 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Christoph Hellwig, Michael S. Tsirkin, Jason Wang,
	Stefan Hajnoczi, Stefano Garzarella, Parav Pandit,
	Christian Brauner, Randy Dunlap, Matthew Wilcox, viro,
	Jens Axboe, bcrl, Jonathan Corbet, Mika Penttilä,
	Dan Carpenter, virtualization, netdev, kvm, linux-fsdevel

On Wed, Mar 31, 2021 at 10:08 PM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
>
> On Wed, Mar 31, 2021 at 09:59:07PM +0800, Yongji Xie wrote:
> > On Wed, Mar 31, 2021 at 8:23 PM Christian Brauner
> > <christian.brauner@ubuntu.com> wrote:
> > >
> > > On Wed, Mar 31, 2021 at 07:32:33PM +0800, Yongji Xie wrote:
> > > > On Wed, Mar 31, 2021 at 5:15 PM Christian Brauner
> > > > <christian.brauner@ubuntu.com> wrote:
> > > > >
> > > > > On Wed, Mar 31, 2021 at 04:05:10PM +0800, Xie Yongji wrote:
> > > > > > Export receive_fd() so that some modules can use
> > > > > > it to pass file descriptor between processes without
> > > > > > missing any security stuffs.
> > > > > >
> > > > > > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> > > > > > ---
> > > > >
> > > > > Yeah, as I said in the other mail I'd be comfortable with exposing just
> > > > > this variant of the helper.
> > > >
> > > > Thanks, I got it now.
> > > >
> > > > > Maybe this should be a separate patch bundled together with Christoph's
> > > > > patch to split parts of receive_fd() into a separate helper.
> > > >
> > > > Do we need to add the seccomp notifier into the separate helper? In
> > > > our case, the file passed to the separate helper is from another
> > > > process.
> > >
> > > Not sure what you mean. Christoph has proposed
> > > https://lore.kernel.org/linux-fsdevel/20210325082209.1067987-2-hch@lst.de
> > > I was just saying that if we think this patch is useful we might bundle
> > > it together with the
> > > EXPORT_SYMBOL(receive_fd)
> > > part here, convert all drivers that currently open-code get_unused_fd()
> > > + fd_install() to use receive_fd(), and make this a separate patchset.
> > >
> >
> > Yes, I see. We can split the parts (get_unused_fd() + fd_install()) of
> > receive_fd() into a separate helper and convert all drivers to use
> > that. What I mean is that I also would like to use
> > security_file_receive() in my modules. So I'm not sure if it's ok to
> > add security_file_receive() into the separate helper. Or do I need to
> > export security_file_receive() separately?
>
> I think I confused you which is my bad. What you do here is - in my
> opinion - correct.
> I'm just saying that exporting receive_fd() allows further cleanups and
> your export here could go on top of Christoph's change in a separate
> series.
>

Oh, I get you now! I'm glad to do that.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v6 07/10] vdpa: Support transferring virtual addressing during DMA mapping
  2021-03-31  8:05 ` [PATCH v6 07/10] vdpa: Support transferring virtual addressing during DMA mapping Xie Yongji
@ 2021-04-08  2:36   ` Jason Wang
  0 siblings, 0 replies; 62+ messages in thread
From: Jason Wang @ 2021-04-08  2:36 UTC (permalink / raw)
  To: Xie Yongji, mst, stefanha, sgarzare, parav, hch,
	christian.brauner, rdunlap, willy, viro, axboe, bcrl, corbet,
	mika.penttila, dan.carpenter
  Cc: virtualization, netdev, kvm, linux-fsdevel


在 2021/3/31 下午4:05, Xie Yongji 写道:
> This patch introduces an attribute for vDPA device to indicate
> whether virtual address can be used. If vDPA device driver set
> it, vhost-vdpa bus driver will not pin user page and transfer
> userspace virtual address instead of physical address during
> DMA mapping. And corresponding vma->vm_file and offset will be
> also passed as an opaque pointer.
>
> Suggested-by: Jason Wang <jasowang@redhat.com>
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>


Acked-by: Jason Wang <jasowang@redhat.com>


> ---
>   drivers/vdpa/ifcvf/ifcvf_main.c   |  2 +-
>   drivers/vdpa/mlx5/net/mlx5_vnet.c |  2 +-
>   drivers/vdpa/vdpa.c               |  9 +++-
>   drivers/vdpa/vdpa_sim/vdpa_sim.c  |  2 +-
>   drivers/vdpa/virtio_pci/vp_vdpa.c |  2 +-
>   drivers/vhost/vdpa.c              | 99 ++++++++++++++++++++++++++++++++++-----
>   include/linux/vdpa.h              | 19 ++++++--
>   7 files changed, 116 insertions(+), 19 deletions(-)
>
> diff --git a/drivers/vdpa/ifcvf/ifcvf_main.c b/drivers/vdpa/ifcvf/ifcvf_main.c
> index d555a6a5d1ba..aee013f3eb5f 100644
> --- a/drivers/vdpa/ifcvf/ifcvf_main.c
> +++ b/drivers/vdpa/ifcvf/ifcvf_main.c
> @@ -431,7 +431,7 @@ static int ifcvf_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>   	}
>   
>   	adapter = vdpa_alloc_device(struct ifcvf_adapter, vdpa,
> -				    dev, &ifc_vdpa_ops, NULL);
> +				    dev, &ifc_vdpa_ops, NULL, false);
>   	if (adapter == NULL) {
>   		IFCVF_ERR(pdev, "Failed to allocate vDPA structure");
>   		return -ENOMEM;
> diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> index 71397fdafa6a..fb62ebcf464a 100644
> --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
> +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> @@ -1982,7 +1982,7 @@ static int mlx5v_probe(struct auxiliary_device *adev,
>   	max_vqs = min_t(u32, max_vqs, MLX5_MAX_SUPPORTED_VQS);
>   
>   	ndev = vdpa_alloc_device(struct mlx5_vdpa_net, mvdev.vdev, mdev->device, &mlx5_vdpa_ops,
> -				 NULL);
> +				 NULL, false);
>   	if (IS_ERR(ndev))
>   		return PTR_ERR(ndev);
>   
> diff --git a/drivers/vdpa/vdpa.c b/drivers/vdpa/vdpa.c
> index 5cffce67cab0..97fbac276c72 100644
> --- a/drivers/vdpa/vdpa.c
> +++ b/drivers/vdpa/vdpa.c
> @@ -71,6 +71,7 @@ static void vdpa_release_dev(struct device *d)
>    * @config: the bus operations that is supported by this device
>    * @size: size of the parent structure that contains private data
>    * @name: name of the vdpa device; optional.
> + * @use_va: indicate whether virtual address must be used by this device
>    *
>    * Driver should use vdpa_alloc_device() wrapper macro instead of
>    * using this directly.
> @@ -80,7 +81,8 @@ static void vdpa_release_dev(struct device *d)
>    */
>   struct vdpa_device *__vdpa_alloc_device(struct device *parent,
>   					const struct vdpa_config_ops *config,
> -					size_t size, const char *name)
> +					size_t size, const char *name,
> +					bool use_va)
>   {
>   	struct vdpa_device *vdev;
>   	int err = -EINVAL;
> @@ -91,6 +93,10 @@ struct vdpa_device *__vdpa_alloc_device(struct device *parent,
>   	if (!!config->dma_map != !!config->dma_unmap)
>   		goto err;
>   
> +	/* It should only work for the device that use on-chip IOMMU */
> +	if (use_va && !(config->dma_map || config->set_map))
> +		goto err;
> +
>   	err = -ENOMEM;
>   	vdev = kzalloc(size, GFP_KERNEL);
>   	if (!vdev)
> @@ -106,6 +112,7 @@ struct vdpa_device *__vdpa_alloc_device(struct device *parent,
>   	vdev->index = err;
>   	vdev->config = config;
>   	vdev->features_valid = false;
> +	vdev->use_va = use_va;
>   
>   	if (name)
>   		err = dev_set_name(&vdev->dev, "%s", name);
> diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
> index ff331f088baf..d26334e9a412 100644
> --- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
> +++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
> @@ -235,7 +235,7 @@ struct vdpasim *vdpasim_create(struct vdpasim_dev_attr *dev_attr)
>   		ops = &vdpasim_config_ops;
>   
>   	vdpasim = vdpa_alloc_device(struct vdpasim, vdpa, NULL, ops,
> -				    dev_attr->name);
> +				    dev_attr->name, false);
>   	if (!vdpasim)
>   		goto err_alloc;
>   
> diff --git a/drivers/vdpa/virtio_pci/vp_vdpa.c b/drivers/vdpa/virtio_pci/vp_vdpa.c
> index 1321a2fcd088..03b36aed48d6 100644
> --- a/drivers/vdpa/virtio_pci/vp_vdpa.c
> +++ b/drivers/vdpa/virtio_pci/vp_vdpa.c
> @@ -377,7 +377,7 @@ static int vp_vdpa_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>   		return ret;
>   
>   	vp_vdpa = vdpa_alloc_device(struct vp_vdpa, vdpa,
> -				    dev, &vp_vdpa_ops, NULL);
> +				    dev, &vp_vdpa_ops, NULL, false);
>   	if (vp_vdpa == NULL) {
>   		dev_err(dev, "vp_vdpa: Failed to allocate vDPA structure\n");
>   		return -ENOMEM;
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index f9aab9013745..613ea400e0e5 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -505,8 +505,28 @@ static void vhost_vdpa_pa_unmap(struct vhost_vdpa *v, u64 start, u64 last)
>   	}
>   }
>   
> +static void vhost_vdpa_va_unmap(struct vhost_vdpa *v, u64 start, u64 last)
> +{
> +	struct vhost_dev *dev = &v->vdev;
> +	struct vhost_iotlb *iotlb = dev->iotlb;
> +	struct vhost_iotlb_map *map;
> +	struct vdpa_map_file *map_file;
> +
> +	while ((map = vhost_iotlb_itree_first(iotlb, start, last)) != NULL) {
> +		map_file = (struct vdpa_map_file *)map->opaque;
> +		fput(map_file->file);
> +		kfree(map_file);
> +		vhost_iotlb_map_free(iotlb, map);
> +	}
> +}
> +
>   static void vhost_vdpa_iotlb_unmap(struct vhost_vdpa *v, u64 start, u64 last)
>   {
> +	struct vdpa_device *vdpa = v->vdpa;
> +
> +	if (vdpa->use_va)
> +		return vhost_vdpa_va_unmap(v, start, last);
> +
>   	return vhost_vdpa_pa_unmap(v, start, last);
>   }
>   
> @@ -541,21 +561,21 @@ static int perm_to_iommu_flags(u32 perm)
>   	return flags | IOMMU_CACHE;
>   }
>   
> -static int vhost_vdpa_map(struct vhost_vdpa *v,
> -			  u64 iova, u64 size, u64 pa, u32 perm)
> +static int vhost_vdpa_map(struct vhost_vdpa *v, u64 iova,
> +			  u64 size, u64 pa, u32 perm, void *opaque)
>   {
>   	struct vhost_dev *dev = &v->vdev;
>   	struct vdpa_device *vdpa = v->vdpa;
>   	const struct vdpa_config_ops *ops = vdpa->config;
>   	int r = 0;
>   
> -	r = vhost_iotlb_add_range(dev->iotlb, iova, iova + size - 1,
> -				  pa, perm);
> +	r = vhost_iotlb_add_range_ctx(dev->iotlb, iova, iova + size - 1,
> +				      pa, perm, opaque);
>   	if (r)
>   		return r;
>   
>   	if (ops->dma_map) {
> -		r = ops->dma_map(vdpa, iova, size, pa, perm, NULL);
> +		r = ops->dma_map(vdpa, iova, size, pa, perm, opaque);
>   	} else if (ops->set_map) {
>   		if (!v->in_batch)
>   			r = ops->set_map(vdpa, dev->iotlb);
> @@ -563,13 +583,15 @@ static int vhost_vdpa_map(struct vhost_vdpa *v,
>   		r = iommu_map(v->domain, iova, pa, size,
>   			      perm_to_iommu_flags(perm));
>   	}
> -
> -	if (r)
> +	if (r) {
>   		vhost_iotlb_del_range(dev->iotlb, iova, iova + size - 1);
> -	else
> +		return r;
> +	}
> +
> +	if (!vdpa->use_va)
>   		atomic64_add(size >> PAGE_SHIFT, &dev->mm->pinned_vm);
>   
> -	return r;
> +	return 0;
>   }
>   
>   static void vhost_vdpa_unmap(struct vhost_vdpa *v, u64 iova, u64 size)
> @@ -590,6 +612,56 @@ static void vhost_vdpa_unmap(struct vhost_vdpa *v, u64 iova, u64 size)
>   	}
>   }
>   
> +static int vhost_vdpa_va_map(struct vhost_vdpa *v,
> +			     u64 iova, u64 size, u64 uaddr, u32 perm)
> +{
> +	struct vhost_dev *dev = &v->vdev;
> +	u64 offset, map_size, map_iova = iova;
> +	struct vdpa_map_file *map_file;
> +	struct vm_area_struct *vma;
> +	int ret;
> +
> +	mmap_read_lock(dev->mm);
> +
> +	while (size) {
> +		vma = find_vma(dev->mm, uaddr);
> +		if (!vma) {
> +			ret = -EINVAL;
> +			break;
> +		}
> +		map_size = min(size, vma->vm_end - uaddr);
> +		if (!(vma->vm_file && (vma->vm_flags & VM_SHARED) &&
> +			!(vma->vm_flags & (VM_IO | VM_PFNMAP))))
> +			goto next;
> +
> +		map_file = kzalloc(sizeof(*map_file), GFP_KERNEL);
> +		if (!map_file) {
> +			ret = -ENOMEM;
> +			break;
> +		}
> +		offset = (vma->vm_pgoff << PAGE_SHIFT) + uaddr - vma->vm_start;
> +		map_file->offset = offset;
> +		map_file->file = get_file(vma->vm_file);
> +		ret = vhost_vdpa_map(v, map_iova, map_size, uaddr,
> +				     perm, map_file);
> +		if (ret) {
> +			fput(map_file->file);
> +			kfree(map_file);
> +			break;
> +		}
> +next:
> +		size -= map_size;
> +		uaddr += map_size;
> +		map_iova += map_size;
> +	}
> +	if (ret)
> +		vhost_vdpa_unmap(v, iova, map_iova - iova);
> +
> +	mmap_read_unlock(dev->mm);
> +
> +	return ret;
> +}
> +
>   static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
>   			     u64 iova, u64 size, u64 uaddr, u32 perm)
>   {
> @@ -656,7 +728,7 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
>   				csize = (last_pfn - map_pfn + 1) << PAGE_SHIFT;
>   				ret = vhost_vdpa_map(v, iova, csize,
>   						     map_pfn << PAGE_SHIFT,
> -						     perm);
> +						     perm, NULL);
>   				if (ret) {
>   					/*
>   					 * Unpin the pages that are left unmapped
> @@ -685,7 +757,7 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
>   
>   	/* Pin the rest chunk */
>   	ret = vhost_vdpa_map(v, iova, (last_pfn - map_pfn + 1) << PAGE_SHIFT,
> -			     map_pfn << PAGE_SHIFT, perm);
> +			     map_pfn << PAGE_SHIFT, perm, NULL);
>   out:
>   	if (ret) {
>   		if (nchunks) {
> @@ -718,6 +790,7 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
>   					   struct vhost_iotlb_msg *msg)
>   {
>   	struct vhost_dev *dev = &v->vdev;
> +	struct vdpa_device *vdpa = v->vdpa;
>   	struct vhost_iotlb *iotlb = dev->iotlb;
>   
>   	if (msg->iova < v->range.first ||
> @@ -728,6 +801,10 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
>   				    msg->iova + msg->size - 1))
>   		return -EEXIST;
>   
> +	if (vdpa->use_va)
> +		return vhost_vdpa_va_map(v, msg->iova, msg->size,
> +					 msg->uaddr, msg->perm);
> +
>   	return vhost_vdpa_pa_map(v, msg->iova, msg->size, msg->uaddr,
>   				 msg->perm);
>   }
> diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
> index b01f7c9096bf..e67404e4b23e 100644
> --- a/include/linux/vdpa.h
> +++ b/include/linux/vdpa.h
> @@ -44,6 +44,7 @@ struct vdpa_mgmt_dev;
>    * @config: the configuration ops for this device.
>    * @index: device index
>    * @features_valid: were features initialized? for legacy guests
> + * @use_va: indicate whether virtual address must be used by this device
>    * @nvqs: maximum number of supported virtqueues
>    * @mdev: management device pointer; caller must setup when registering device as part
>    *	  of dev_add() mgmtdev ops callback before invoking _vdpa_register_device().
> @@ -54,6 +55,7 @@ struct vdpa_device {
>   	const struct vdpa_config_ops *config;
>   	unsigned int index;
>   	bool features_valid;
> +	bool use_va;
>   	int nvqs;
>   	struct vdpa_mgmt_dev *mdev;
>   };
> @@ -69,6 +71,16 @@ struct vdpa_iova_range {
>   };
>   
>   /**
> + * Corresponding file area for device memory mapping
> + * @file: vma->vm_file for the mapping
> + * @offset: mapping offset in the vm_file
> + */
> +struct vdpa_map_file {
> +	struct file *file;
> +	u64 offset;
> +};
> +
> +/**
>    * vDPA_config_ops - operations for configuring a vDPA device.
>    * Note: vDPA device drivers are required to implement all of the
>    * operations unless it is mentioned to be optional in the following
> @@ -250,14 +262,15 @@ struct vdpa_config_ops {
>   
>   struct vdpa_device *__vdpa_alloc_device(struct device *parent,
>   					const struct vdpa_config_ops *config,
> -					size_t size, const char *name);
> +					size_t size, const char *name,
> +					bool use_va);
>   
> -#define vdpa_alloc_device(dev_struct, member, parent, config, name)   \
> +#define vdpa_alloc_device(dev_struct, member, parent, config, name, use_va)   \
>   			  container_of(__vdpa_alloc_device( \
>   				       parent, config, \
>   				       sizeof(dev_struct) + \
>   				       BUILD_BUG_ON_ZERO(offsetof( \
> -				       dev_struct, member)), name), \
> +				       dev_struct, member)), name, use_va), \
>   				       dev_struct, member)
>   
>   int vdpa_register_device(struct vdpa_device *vdev, int nvqs);


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v6 08/10] vduse: Implement an MMU-based IOMMU driver
  2021-03-31  8:05 ` [PATCH v6 08/10] vduse: Implement an MMU-based IOMMU driver Xie Yongji
@ 2021-04-08  3:25   ` Jason Wang
  2021-04-08  5:27     ` Yongji Xie
  0 siblings, 1 reply; 62+ messages in thread
From: Jason Wang @ 2021-04-08  3:25 UTC (permalink / raw)
  To: Xie Yongji, mst, stefanha, sgarzare, parav, hch,
	christian.brauner, rdunlap, willy, viro, axboe, bcrl, corbet,
	mika.penttila, dan.carpenter
  Cc: virtualization, netdev, kvm, linux-fsdevel


在 2021/3/31 下午4:05, Xie Yongji 写道:
> This implements an MMU-based IOMMU driver to support mapping
> kernel dma buffer into userspace. The basic idea behind it is
> treating MMU (VA->PA) as IOMMU (IOVA->PA). The driver will set
> up MMU mapping instead of IOMMU mapping for the DMA transfer so
> that the userspace process is able to use its virtual address to
> access the dma buffer in kernel.
>
> And to avoid security issue, a bounce-buffering mechanism is
> introduced to prevent userspace accessing the original buffer
> directly.
>
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>


Acked-by: Jason Wang <jasowang@redhat.com>

With some nits:


> ---
>   drivers/vdpa/vdpa_user/iova_domain.c | 521 +++++++++++++++++++++++++++++++++++
>   drivers/vdpa/vdpa_user/iova_domain.h |  70 +++++
>   2 files changed, 591 insertions(+)
>   create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c
>   create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h


[...]


> +static void vduse_domain_bounce(struct vduse_iova_domain *domain,
> +				dma_addr_t iova, size_t size,
> +				enum dma_data_direction dir)
> +{
> +	struct vduse_bounce_map *map;
> +	unsigned int offset;
> +	void *addr;
> +	size_t sz;
> +
> +	while (size) {
> +		map = &domain->bounce_maps[iova >> PAGE_SHIFT];
> +		offset = offset_in_page(iova);
> +		sz = min_t(size_t, PAGE_SIZE - offset, size);
> +
> +		if (WARN_ON(!map->bounce_page ||
> +			    map->orig_phys == INVALID_PHYS_ADDR))
> +			return;
> +
> +		addr = page_address(map->bounce_page) + offset;
> +		do_bounce(map->orig_phys + offset, addr, sz, dir);
> +		size -= sz;
> +		iova += sz;
> +	}
> +}
> +
> +static struct page *
> +vduse_domain_get_mapping_page(struct vduse_iova_domain *domain, u64 iova)


It's better to rename this as "vduse_domain_get_coherent_page?".


> +{
> +	u64 start = iova & PAGE_MASK;
> +	u64 last = start + PAGE_SIZE - 1;
> +	struct vhost_iotlb_map *map;
> +	struct page *page = NULL;
> +
> +	spin_lock(&domain->iotlb_lock);
> +	map = vhost_iotlb_itree_first(domain->iotlb, start, last);
> +	if (!map)
> +		goto out;
> +
> +	page = pfn_to_page((map->addr + iova - map->start) >> PAGE_SHIFT);
> +	get_page(page);
> +out:
> +	spin_unlock(&domain->iotlb_lock);
> +
> +	return page;
> +}
> +


[...]


> +
> +static dma_addr_t
> +vduse_domain_alloc_iova(struct iova_domain *iovad,
> +			unsigned long size, unsigned long limit)
> +{
> +	unsigned long shift = iova_shift(iovad);
> +	unsigned long iova_len = iova_align(iovad, size) >> shift;
> +	unsigned long iova_pfn;
> +
> +	if (iova_len < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1)))
> +		iova_len = roundup_pow_of_two(iova_len);


Let's add a comment as what has been done in dma-iommu.c?

(In the future, it looks to me it's better to move them to 
alloc_iova_fast()).

Thanks



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Re: [PATCH v6 08/10] vduse: Implement an MMU-based IOMMU driver
  2021-04-08  3:25   ` Jason Wang
@ 2021-04-08  5:27     ` Yongji Xie
  0 siblings, 0 replies; 62+ messages in thread
From: Yongji Xie @ 2021-04-08  5:27 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Christoph Hellwig, Christian Brauner, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	Mika Penttilä,
	Dan Carpenter, virtualization, netdev, kvm, linux-fsdevel

On Thu, Apr 8, 2021 at 11:26 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/3/31 下午4:05, Xie Yongji 写道:
> > This implements an MMU-based IOMMU driver to support mapping
> > kernel dma buffer into userspace. The basic idea behind it is
> > treating MMU (VA->PA) as IOMMU (IOVA->PA). The driver will set
> > up MMU mapping instead of IOMMU mapping for the DMA transfer so
> > that the userspace process is able to use its virtual address to
> > access the dma buffer in kernel.
> >
> > And to avoid security issue, a bounce-buffering mechanism is
> > introduced to prevent userspace accessing the original buffer
> > directly.
> >
> > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
>
>
> Acked-by: Jason Wang <jasowang@redhat.com>
>
> With some nits:
>
>
> > ---
> >   drivers/vdpa/vdpa_user/iova_domain.c | 521 +++++++++++++++++++++++++++++++++++
> >   drivers/vdpa/vdpa_user/iova_domain.h |  70 +++++
> >   2 files changed, 591 insertions(+)
> >   create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c
> >   create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h
>
>
> [...]
>
>
> > +static void vduse_domain_bounce(struct vduse_iova_domain *domain,
> > +                             dma_addr_t iova, size_t size,
> > +                             enum dma_data_direction dir)
> > +{
> > +     struct vduse_bounce_map *map;
> > +     unsigned int offset;
> > +     void *addr;
> > +     size_t sz;
> > +
> > +     while (size) {
> > +             map = &domain->bounce_maps[iova >> PAGE_SHIFT];
> > +             offset = offset_in_page(iova);
> > +             sz = min_t(size_t, PAGE_SIZE - offset, size);
> > +
> > +             if (WARN_ON(!map->bounce_page ||
> > +                         map->orig_phys == INVALID_PHYS_ADDR))
> > +                     return;
> > +
> > +             addr = page_address(map->bounce_page) + offset;
> > +             do_bounce(map->orig_phys + offset, addr, sz, dir);
> > +             size -= sz;
> > +             iova += sz;
> > +     }
> > +}
> > +
> > +static struct page *
> > +vduse_domain_get_mapping_page(struct vduse_iova_domain *domain, u64 iova)
>
>
> It's better to rename this as "vduse_domain_get_coherent_page?".
>

OK.

>
> > +{
> > +     u64 start = iova & PAGE_MASK;
> > +     u64 last = start + PAGE_SIZE - 1;
> > +     struct vhost_iotlb_map *map;
> > +     struct page *page = NULL;
> > +
> > +     spin_lock(&domain->iotlb_lock);
> > +     map = vhost_iotlb_itree_first(domain->iotlb, start, last);
> > +     if (!map)
> > +             goto out;
> > +
> > +     page = pfn_to_page((map->addr + iova - map->start) >> PAGE_SHIFT);
> > +     get_page(page);
> > +out:
> > +     spin_unlock(&domain->iotlb_lock);
> > +
> > +     return page;
> > +}
> > +
>
>
> [...]
>
>
> > +
> > +static dma_addr_t
> > +vduse_domain_alloc_iova(struct iova_domain *iovad,
> > +                     unsigned long size, unsigned long limit)
> > +{
> > +     unsigned long shift = iova_shift(iovad);
> > +     unsigned long iova_len = iova_align(iovad, size) >> shift;
> > +     unsigned long iova_pfn;
> > +
> > +     if (iova_len < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1)))
> > +             iova_len = roundup_pow_of_two(iova_len);
>
>
> Let's add a comment as what has been done in dma-iommu.c?
>

Fine.

> (In the future, it looks to me it's better to move them to
> alloc_iova_fast()).
>

Agree.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v6 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
  2021-03-31  8:05 ` [PATCH v6 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace Xie Yongji
@ 2021-04-08  6:57   ` Jason Wang
  2021-04-08  9:36     ` Yongji Xie
  2021-04-16  3:24   ` Jason Wang
  1 sibling, 1 reply; 62+ messages in thread
From: Jason Wang @ 2021-04-08  6:57 UTC (permalink / raw)
  To: Xie Yongji, mst, stefanha, sgarzare, parav, hch,
	christian.brauner, rdunlap, willy, viro, axboe, bcrl, corbet,
	mika.penttila, dan.carpenter
  Cc: virtualization, netdev, kvm, linux-fsdevel


在 2021/3/31 下午4:05, Xie Yongji 写道:
> This VDUSE driver enables implementing vDPA devices in userspace.
> Both control path and data path of vDPA devices will be able to
> be handled in userspace.
>
> In the control path, the VDUSE driver will make use of message
> mechnism to forward the config operation from vdpa bus driver
> to userspace. Userspace can use read()/write() to receive/reply
> those control messages.
>
> In the data path, VDUSE_IOTLB_GET_FD ioctl will be used to get
> the file descriptors referring to vDPA device's iova regions. Then
> userspace can use mmap() to access those iova regions. Besides,
> userspace can use ioctl() to inject interrupt and use the eventfd
> mechanism to receive virtqueue kicks.
>
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> ---
>   Documentation/userspace-api/ioctl/ioctl-number.rst |    1 +
>   drivers/vdpa/Kconfig                               |   10 +
>   drivers/vdpa/Makefile                              |    1 +
>   drivers/vdpa/vdpa_user/Makefile                    |    5 +
>   drivers/vdpa/vdpa_user/vduse_dev.c                 | 1362 ++++++++++++++++++++
>   include/uapi/linux/vduse.h                         |  175 +++
>   6 files changed, 1554 insertions(+)
>   create mode 100644 drivers/vdpa/vdpa_user/Makefile
>   create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
>   create mode 100644 include/uapi/linux/vduse.h
>
> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
> index a4c75a28c839..71722e6f8f23 100644
> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> @@ -300,6 +300,7 @@ Code  Seq#    Include File                                           Comments
>   'z'   10-4F  drivers/s390/crypto/zcrypt_api.h                        conflict!
>   '|'   00-7F  linux/media.h
>   0x80  00-1F  linux/fb.h
> +0x81  00-1F  linux/vduse.h
>   0x89  00-06  arch/x86/include/asm/sockios.h
>   0x89  0B-DF  linux/sockios.h
>   0x89  E0-EF  linux/sockios.h                                         SIOCPROTOPRIVATE range
> diff --git a/drivers/vdpa/Kconfig b/drivers/vdpa/Kconfig
> index a245809c99d0..77a1da522c21 100644
> --- a/drivers/vdpa/Kconfig
> +++ b/drivers/vdpa/Kconfig
> @@ -25,6 +25,16 @@ config VDPA_SIM_NET
>   	help
>   	  vDPA networking device simulator which loops TX traffic back to RX.
>   
> +config VDPA_USER
> +	tristate "VDUSE (vDPA Device in Userspace) support"
> +	depends on EVENTFD && MMU && HAS_DMA
> +	select DMA_OPS
> +	select VHOST_IOTLB
> +	select IOMMU_IOVA
> +	help
> +	  With VDUSE it is possible to emulate a vDPA Device
> +	  in a userspace program.
> +
>   config IFCVF
>   	tristate "Intel IFC VF vDPA driver"
>   	depends on PCI_MSI
> diff --git a/drivers/vdpa/Makefile b/drivers/vdpa/Makefile
> index 67fe7f3d6943..f02ebed33f19 100644
> --- a/drivers/vdpa/Makefile
> +++ b/drivers/vdpa/Makefile
> @@ -1,6 +1,7 @@
>   # SPDX-License-Identifier: GPL-2.0
>   obj-$(CONFIG_VDPA) += vdpa.o
>   obj-$(CONFIG_VDPA_SIM) += vdpa_sim/
> +obj-$(CONFIG_VDPA_USER) += vdpa_user/
>   obj-$(CONFIG_IFCVF)    += ifcvf/
>   obj-$(CONFIG_MLX5_VDPA) += mlx5/
>   obj-$(CONFIG_VP_VDPA)    += virtio_pci/
> diff --git a/drivers/vdpa/vdpa_user/Makefile b/drivers/vdpa/vdpa_user/Makefile
> new file mode 100644
> index 000000000000..260e0b26af99
> --- /dev/null
> +++ b/drivers/vdpa/vdpa_user/Makefile
> @@ -0,0 +1,5 @@
> +# SPDX-License-Identifier: GPL-2.0
> +
> +vduse-y := vduse_dev.o iova_domain.o
> +
> +obj-$(CONFIG_VDPA_USER) += vduse.o
> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> new file mode 100644
> index 000000000000..51ca73464d0d
> --- /dev/null
> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> @@ -0,0 +1,1362 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * VDUSE: vDPA Device in Userspace
> + *
> + * Copyright (C) 2020-2021 Bytedance Inc. and/or its affiliates. All rights reserved.
> + *
> + * Author: Xie Yongji <xieyongji@bytedance.com>
> + *
> + */
> +
> +#include <linux/init.h>
> +#include <linux/module.h>
> +#include <linux/miscdevice.h>
> +#include <linux/cdev.h>
> +#include <linux/device.h>
> +#include <linux/eventfd.h>
> +#include <linux/slab.h>
> +#include <linux/wait.h>
> +#include <linux/dma-map-ops.h>
> +#include <linux/poll.h>
> +#include <linux/file.h>
> +#include <linux/uio.h>
> +#include <linux/vdpa.h>
> +#include <uapi/linux/vduse.h>
> +#include <uapi/linux/vdpa.h>
> +#include <uapi/linux/virtio_config.h>
> +#include <linux/mod_devicetable.h>
> +
> +#include "iova_domain.h"
> +
> +#define DRV_VERSION  "1.0"
> +#define DRV_AUTHOR   "Yongji Xie <xieyongji@bytedance.com>"
> +#define DRV_DESC     "vDPA Device in Userspace"
> +#define DRV_LICENSE  "GPL v2"
> +
> +#define VDUSE_DEV_MAX (1U << MINORBITS)
> +
> +struct vduse_virtqueue {
> +	u16 index;
> +	bool ready;
> +	spinlock_t kick_lock;
> +	spinlock_t irq_lock;
> +	struct eventfd_ctx *kickfd;
> +	struct vdpa_callback cb;
> +	struct work_struct inject;
> +};
> +
> +struct vduse_dev;
> +
> +struct vduse_vdpa {
> +	struct vdpa_device vdpa;
> +	struct vduse_dev *dev;
> +};
> +
> +struct vduse_dev {
> +	struct vduse_vdpa *vdev;
> +	struct device dev;
> +	struct cdev cdev;
> +	struct vduse_virtqueue *vqs;
> +	struct vduse_iova_domain *domain;
> +	struct mutex lock;
> +	spinlock_t msg_lock;
> +	atomic64_t msg_unique;
> +	wait_queue_head_t waitq;
> +	struct list_head send_list;
> +	struct list_head recv_list;
> +	struct list_head list;
> +	struct vdpa_callback config_cb;
> +	spinlock_t irq_lock;
> +	unsigned long api_version;
> +	bool connected;
> +	int minor;
> +	u16 vq_size_max;
> +	u16 vq_num;
> +	u32 vq_align;
> +	u32 device_id;
> +	u32 vendor_id;
> +};
> +
> +struct vduse_dev_msg {
> +	struct vduse_dev_request req;
> +	struct vduse_dev_response resp;
> +	struct list_head list;
> +	wait_queue_head_t waitq;
> +	bool completed;
> +};
> +
> +struct vduse_control {
> +	unsigned long api_version;
> +};
> +
> +static unsigned long max_bounce_size = (64 * 1024 * 1024);
> +module_param(max_bounce_size, ulong, 0444);
> +MODULE_PARM_DESC(max_bounce_size, "Maximum bounce buffer size. (default: 64M)");
> +
> +static unsigned long max_iova_size = (128 * 1024 * 1024);
> +module_param(max_iova_size, ulong, 0444);
> +MODULE_PARM_DESC(max_iova_size, "Maximum iova space size (default: 128M)");
> +
> +static DEFINE_MUTEX(vduse_lock);
> +static LIST_HEAD(vduse_devs);
> +static DEFINE_IDA(vduse_ida);
> +
> +static dev_t vduse_major;
> +static struct class *vduse_class;
> +static struct workqueue_struct *vduse_irq_wq;
> +
> +static inline struct vduse_dev *vdpa_to_vduse(struct vdpa_device *vdpa)
> +{
> +	struct vduse_vdpa *vdev = container_of(vdpa, struct vduse_vdpa, vdpa);
> +
> +	return vdev->dev;
> +}
> +
> +static inline struct vduse_dev *dev_to_vduse(struct device *dev)
> +{
> +	struct vdpa_device *vdpa = dev_to_vdpa(dev);
> +
> +	return vdpa_to_vduse(vdpa);
> +}
> +
> +static struct vduse_dev_msg *vduse_find_msg(struct list_head *head,
> +					    uint32_t request_id)
> +{
> +	struct vduse_dev_msg *tmp, *msg = NULL;
> +
> +	list_for_each_entry(tmp, head, list) {
> +		if (tmp->req.request_id == request_id) {
> +			msg = tmp;
> +			list_del(&tmp->list);
> +			break;
> +		}
> +	}
> +
> +	return msg;
> +}
> +
> +static struct vduse_dev_msg *vduse_dequeue_msg(struct list_head *head)
> +{
> +	struct vduse_dev_msg *msg = NULL;
> +
> +	if (!list_empty(head)) {
> +		msg = list_first_entry(head, struct vduse_dev_msg, list);
> +		list_del(&msg->list);
> +	}
> +
> +	return msg;
> +}
> +
> +static void vduse_enqueue_msg(struct list_head *head,
> +			      struct vduse_dev_msg *msg)
> +{
> +	list_add_tail(&msg->list, head);
> +}
> +
> +static int vduse_dev_msg_sync(struct vduse_dev *dev,
> +			      struct vduse_dev_msg *msg)
> +{
> +	init_waitqueue_head(&msg->waitq);
> +	spin_lock(&dev->msg_lock);
> +	vduse_enqueue_msg(&dev->send_list, msg);
> +	wake_up(&dev->waitq);
> +	spin_unlock(&dev->msg_lock);
> +	wait_event_interruptible(msg->waitq, msg->completed);
> +	spin_lock(&dev->msg_lock);
> +	if (!msg->completed)
> +		list_del(&msg->list);
> +	spin_unlock(&dev->msg_lock);
> +
> +	return (msg->resp.result == VDUSE_REQUEST_OK) ? 0 : -1;
> +}
> +
> +static u32 vduse_dev_get_request_id(struct vduse_dev *dev)
> +{
> +	return atomic64_fetch_inc(&dev->msg_unique);
> +}
> +
> +static u64 vduse_dev_get_features(struct vduse_dev *dev)
> +{
> +	struct vduse_dev_msg msg = { 0 };
> +
> +	msg.req.type = VDUSE_GET_FEATURES;
> +	msg.req.request_id = vduse_dev_get_request_id(dev);
> +
> +	return vduse_dev_msg_sync(dev, &msg) ? 0 : msg.resp.f.features;
> +}
> +
> +static int vduse_dev_set_features(struct vduse_dev *dev, u64 features)
> +{
> +	struct vduse_dev_msg msg = { 0 };
> +
> +	msg.req.type = VDUSE_SET_FEATURES;
> +	msg.req.request_id = vduse_dev_get_request_id(dev);
> +	msg.req.f.features = features;
> +
> +	return vduse_dev_msg_sync(dev, &msg);
> +}
> +
> +static u8 vduse_dev_get_status(struct vduse_dev *dev)
> +{
> +	struct vduse_dev_msg msg = { 0 };
> +
> +	msg.req.type = VDUSE_GET_STATUS;
> +	msg.req.request_id = vduse_dev_get_request_id(dev);
> +
> +	return vduse_dev_msg_sync(dev, &msg) ? 0 : msg.resp.s.status;
> +}
> +
> +static void vduse_dev_set_status(struct vduse_dev *dev, u8 status)
> +{
> +	struct vduse_dev_msg msg = { 0 };
> +
> +	msg.req.type = VDUSE_SET_STATUS;
> +	msg.req.request_id = vduse_dev_get_request_id(dev);
> +	msg.req.s.status = status;
> +
> +	vduse_dev_msg_sync(dev, &msg);
> +}
> +
> +static void vduse_dev_get_config(struct vduse_dev *dev, unsigned int offset,
> +				 void *buf, unsigned int len)
> +{
> +	struct vduse_dev_msg msg = { 0 };
> +	unsigned int sz;
> +
> +	while (len) {
> +		sz = min_t(unsigned int, len, sizeof(msg.req.config.data));
> +		msg.req.type = VDUSE_GET_CONFIG;
> +		msg.req.request_id = vduse_dev_get_request_id(dev);
> +		msg.req.config.offset = offset;
> +		msg.req.config.len = sz;
> +		vduse_dev_msg_sync(dev, &msg);
> +		memcpy(buf, msg.resp.config.data, sz);
> +		buf += sz;
> +		offset += sz;
> +		len -= sz;
> +	}
> +}
> +
> +static void vduse_dev_set_config(struct vduse_dev *dev, unsigned int offset,
> +				 const void *buf, unsigned int len)
> +{
> +	struct vduse_dev_msg msg = { 0 };
> +	unsigned int sz;
> +
> +	while (len) {
> +		sz = min_t(unsigned int, len, sizeof(msg.req.config.data));
> +		msg.req.type = VDUSE_SET_CONFIG;
> +		msg.req.request_id = vduse_dev_get_request_id(dev);
> +		msg.req.config.offset = offset;
> +		msg.req.config.len = sz;
> +		memcpy(msg.req.config.data, buf, sz);
> +		vduse_dev_msg_sync(dev, &msg);
> +		buf += sz;
> +		offset += sz;
> +		len -= sz;
> +	}
> +}
> +
> +static void vduse_dev_set_vq_num(struct vduse_dev *dev,
> +				 struct vduse_virtqueue *vq, u32 num)
> +{
> +	struct vduse_dev_msg msg = { 0 };
> +
> +	msg.req.type = VDUSE_SET_VQ_NUM;
> +	msg.req.request_id = vduse_dev_get_request_id(dev);
> +	msg.req.vq_num.index = vq->index;
> +	msg.req.vq_num.num = num;
> +
> +	vduse_dev_msg_sync(dev, &msg);
> +}
> +
> +static int vduse_dev_set_vq_addr(struct vduse_dev *dev,
> +				 struct vduse_virtqueue *vq, u64 desc_addr,
> +				 u64 driver_addr, u64 device_addr)
> +{
> +	struct vduse_dev_msg msg = { 0 };
> +
> +	msg.req.type = VDUSE_SET_VQ_ADDR;
> +	msg.req.request_id = vduse_dev_get_request_id(dev);
> +	msg.req.vq_addr.index = vq->index;
> +	msg.req.vq_addr.desc_addr = desc_addr;
> +	msg.req.vq_addr.driver_addr = driver_addr;
> +	msg.req.vq_addr.device_addr = device_addr;
> +
> +	return vduse_dev_msg_sync(dev, &msg);
> +}
> +
> +static void vduse_dev_set_vq_ready(struct vduse_dev *dev,
> +				struct vduse_virtqueue *vq, bool ready)
> +{
> +	struct vduse_dev_msg msg = { 0 };
> +
> +	msg.req.type = VDUSE_SET_VQ_READY;
> +	msg.req.request_id = vduse_dev_get_request_id(dev);
> +	msg.req.vq_ready.index = vq->index;
> +	msg.req.vq_ready.ready = ready;
> +
> +	vduse_dev_msg_sync(dev, &msg);
> +}
> +
> +static bool vduse_dev_get_vq_ready(struct vduse_dev *dev,
> +				   struct vduse_virtqueue *vq)
> +{
> +	struct vduse_dev_msg msg = { 0 };
> +
> +	msg.req.type = VDUSE_GET_VQ_READY;
> +	msg.req.request_id = vduse_dev_get_request_id(dev);
> +	msg.req.vq_ready.index = vq->index;
> +
> +	return vduse_dev_msg_sync(dev, &msg) ? false : msg.resp.vq_ready.ready;
> +}
> +
> +static int vduse_dev_get_vq_state(struct vduse_dev *dev,
> +				struct vduse_virtqueue *vq,
> +				struct vdpa_vq_state *state)
> +{
> +	struct vduse_dev_msg msg = { 0 };
> +	int ret;
> +
> +	msg.req.type = VDUSE_GET_VQ_STATE;
> +	msg.req.request_id = vduse_dev_get_request_id(dev);
> +	msg.req.vq_state.index = vq->index;
> +
> +	ret = vduse_dev_msg_sync(dev, &msg);
> +	if (!ret)
> +		state->avail_index = msg.resp.vq_state.avail_idx;
> +
> +	return ret;
> +}
> +
> +static int vduse_dev_set_vq_state(struct vduse_dev *dev,
> +				struct vduse_virtqueue *vq,
> +				const struct vdpa_vq_state *state)
> +{
> +	struct vduse_dev_msg msg = { 0 };
> +
> +	msg.req.type = VDUSE_SET_VQ_STATE;
> +	msg.req.request_id = vduse_dev_get_request_id(dev);
> +	msg.req.vq_state.index = vq->index;
> +	msg.req.vq_state.avail_idx = state->avail_index;
> +
> +	return vduse_dev_msg_sync(dev, &msg);
> +}
> +
> +static int vduse_dev_update_iotlb(struct vduse_dev *dev,
> +				u64 start, u64 last)
> +{
> +	struct vduse_dev_msg msg = { 0 };
> +
> +	if (last < start)
> +		return -EINVAL;
> +
> +	msg.req.type = VDUSE_UPDATE_IOTLB;
> +	msg.req.request_id = vduse_dev_get_request_id(dev);
> +	msg.req.iova.start = start;
> +	msg.req.iova.last = last;
> +
> +	return vduse_dev_msg_sync(dev, &msg);
> +}
> +
> +static ssize_t vduse_dev_read_iter(struct kiocb *iocb, struct iov_iter *to)
> +{
> +	struct file *file = iocb->ki_filp;
> +	struct vduse_dev *dev = file->private_data;
> +	struct vduse_dev_msg *msg;
> +	int size = sizeof(struct vduse_dev_request);
> +	ssize_t ret = 0;
> +
> +	if (iov_iter_count(to) < size)
> +		return 0;
> +
> +	spin_lock(&dev->msg_lock);
> +	while (1) {
> +		msg = vduse_dequeue_msg(&dev->send_list);
> +		if (msg)
> +			break;
> +
> +		ret = -EAGAIN;
> +		if (file->f_flags & O_NONBLOCK)
> +			goto unlock;
> +
> +		spin_unlock(&dev->msg_lock);
> +		ret = wait_event_interruptible_exclusive(dev->waitq,
> +					!list_empty(&dev->send_list));
> +		if (ret)
> +			return ret;
> +
> +		spin_lock(&dev->msg_lock);
> +	}
> +	spin_unlock(&dev->msg_lock);
> +	ret = copy_to_iter(&msg->req, size, to);
> +	spin_lock(&dev->msg_lock);
> +	if (ret != size) {
> +		ret = -EFAULT;
> +		vduse_enqueue_msg(&dev->send_list, msg);
> +		goto unlock;
> +	}
> +	vduse_enqueue_msg(&dev->recv_list, msg);
> +	wake_up(&dev->waitq);
> +unlock:
> +	spin_unlock(&dev->msg_lock);
> +
> +	return ret;
> +}
> +
> +static ssize_t vduse_dev_write_iter(struct kiocb *iocb, struct iov_iter *from)
> +{
> +	struct file *file = iocb->ki_filp;
> +	struct vduse_dev *dev = file->private_data;
> +	struct vduse_dev_response resp;
> +	struct vduse_dev_msg *msg;
> +	size_t ret;
> +
> +	ret = copy_from_iter(&resp, sizeof(resp), from);
> +	if (ret != sizeof(resp))
> +		return -EINVAL;
> +
> +	spin_lock(&dev->msg_lock);
> +	msg = vduse_find_msg(&dev->recv_list, resp.request_id);
> +	if (!msg) {
> +		ret = -EINVAL;
> +		goto unlock;
> +	}
> +
> +	memcpy(&msg->resp, &resp, sizeof(resp));
> +	msg->completed = 1;
> +	wake_up(&msg->waitq);
> +unlock:
> +	spin_unlock(&dev->msg_lock);
> +
> +	return ret;
> +}
> +
> +static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
> +{
> +	struct vduse_dev *dev = file->private_data;
> +	__poll_t mask = 0;
> +
> +	poll_wait(file, &dev->waitq, wait);
> +
> +	if (!list_empty(&dev->send_list))
> +		mask |= EPOLLIN | EPOLLRDNORM;
> +	if (!list_empty(&dev->recv_list))
> +		mask |= EPOLLOUT;
> +
> +	return mask;
> +}
> +
> +static void vduse_dev_reset(struct vduse_dev *dev)
> +{
> +	int i;
> +
> +	/* The coherent mappings are handled in vduse_dev_free_coherent() */
> +	vduse_domain_reset_bounce_map(dev->domain);
> +	vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
> +
> +	spin_lock(&dev->irq_lock);
> +	dev->config_cb.callback = NULL;
> +	dev->config_cb.private = NULL;
> +	spin_unlock(&dev->irq_lock);
> +
> +	for (i = 0; i < dev->vq_num; i++) {
> +		struct vduse_virtqueue *vq = &dev->vqs[i];
> +
> +		spin_lock(&vq->irq_lock);
> +		vq->ready = false;
> +		vq->cb.callback = NULL;
> +		vq->cb.private = NULL;
> +		spin_unlock(&vq->irq_lock);
> +	}
> +}
> +
> +static int vduse_vdpa_set_vq_address(struct vdpa_device *vdpa, u16 idx,
> +				u64 desc_area, u64 driver_area,
> +				u64 device_area)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	return vduse_dev_set_vq_addr(dev, vq, desc_area,
> +					driver_area, device_area);
> +}
> +
> +static void vduse_vdpa_kick_vq(struct vdpa_device *vdpa, u16 idx)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	spin_lock(&vq->kick_lock);
> +	if (vq->ready && vq->kickfd)
> +		eventfd_signal(vq->kickfd, 1);
> +	spin_unlock(&vq->kick_lock);
> +}
> +
> +static void vduse_vdpa_set_vq_cb(struct vdpa_device *vdpa, u16 idx,
> +			      struct vdpa_callback *cb)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	spin_lock(&vq->irq_lock);
> +	vq->cb.callback = cb->callback;
> +	vq->cb.private = cb->private;
> +	spin_unlock(&vq->irq_lock);
> +}
> +
> +static void vduse_vdpa_set_vq_num(struct vdpa_device *vdpa, u16 idx, u32 num)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	vduse_dev_set_vq_num(dev, vq, num);
> +}
> +
> +static void vduse_vdpa_set_vq_ready(struct vdpa_device *vdpa,
> +					u16 idx, bool ready)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	vduse_dev_set_vq_ready(dev, vq, ready);
> +	vq->ready = ready;
> +}
> +
> +static bool vduse_vdpa_get_vq_ready(struct vdpa_device *vdpa, u16 idx)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	vq->ready = vduse_dev_get_vq_ready(dev, vq);
> +
> +	return vq->ready;
> +}
> +
> +static int vduse_vdpa_set_vq_state(struct vdpa_device *vdpa, u16 idx,
> +				const struct vdpa_vq_state *state)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	return vduse_dev_set_vq_state(dev, vq, state);
> +}
> +
> +static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
> +				struct vdpa_vq_state *state)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	return vduse_dev_get_vq_state(dev, vq, state);
> +}
> +
> +static u32 vduse_vdpa_get_vq_align(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return dev->vq_align;
> +}
> +
> +static u64 vduse_vdpa_get_features(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return vduse_dev_get_features(dev);
> +}
> +
> +static int vduse_vdpa_set_features(struct vdpa_device *vdpa, u64 features)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	if (!(features & (1ULL << VIRTIO_F_ACCESS_PLATFORM)))
> +		return -EINVAL;
> +
> +	return vduse_dev_set_features(dev, features);
> +}
> +
> +static void vduse_vdpa_set_config_cb(struct vdpa_device *vdpa,
> +				  struct vdpa_callback *cb)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	spin_lock(&dev->irq_lock);
> +	dev->config_cb.callback = cb->callback;
> +	dev->config_cb.private = cb->private;
> +	spin_unlock(&dev->irq_lock);
> +}
> +
> +static u16 vduse_vdpa_get_vq_num_max(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return dev->vq_size_max;
> +}
> +
> +static u32 vduse_vdpa_get_device_id(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return dev->device_id;
> +}
> +
> +static u32 vduse_vdpa_get_vendor_id(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return dev->vendor_id;
> +}
> +
> +static u8 vduse_vdpa_get_status(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return vduse_dev_get_status(dev);
> +}
> +
> +static void vduse_vdpa_set_status(struct vdpa_device *vdpa, u8 status)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	vduse_dev_set_status(dev, status);
> +
> +	if (status == 0)
> +		vduse_dev_reset(dev);
> +}
> +
> +static void vduse_vdpa_get_config(struct vdpa_device *vdpa, unsigned int offset,
> +			     void *buf, unsigned int len)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	vduse_dev_get_config(dev, offset, buf, len);
> +}
> +
> +static void vduse_vdpa_set_config(struct vdpa_device *vdpa, unsigned int offset,
> +			const void *buf, unsigned int len)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	vduse_dev_set_config(dev, offset, buf, len);
> +}
> +
> +static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
> +				struct vhost_iotlb *iotlb)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	int ret;
> +
> +	ret = vduse_domain_set_map(dev->domain, iotlb);
> +	vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
> +
> +	return ret;
> +}
> +
> +static void vduse_vdpa_free(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	WARN_ON(!list_empty(&dev->send_list));
> +	WARN_ON(!list_empty(&dev->recv_list));
> +	dev->vdev = NULL;
> +}
> +
> +static const struct vdpa_config_ops vduse_vdpa_config_ops = {
> +	.set_vq_address		= vduse_vdpa_set_vq_address,
> +	.kick_vq		= vduse_vdpa_kick_vq,
> +	.set_vq_cb		= vduse_vdpa_set_vq_cb,
> +	.set_vq_num             = vduse_vdpa_set_vq_num,
> +	.set_vq_ready		= vduse_vdpa_set_vq_ready,
> +	.get_vq_ready		= vduse_vdpa_get_vq_ready,
> +	.set_vq_state		= vduse_vdpa_set_vq_state,
> +	.get_vq_state		= vduse_vdpa_get_vq_state,
> +	.get_vq_align		= vduse_vdpa_get_vq_align,
> +	.get_features		= vduse_vdpa_get_features,
> +	.set_features		= vduse_vdpa_set_features,
> +	.set_config_cb		= vduse_vdpa_set_config_cb,
> +	.get_vq_num_max		= vduse_vdpa_get_vq_num_max,
> +	.get_device_id		= vduse_vdpa_get_device_id,
> +	.get_vendor_id		= vduse_vdpa_get_vendor_id,
> +	.get_status		= vduse_vdpa_get_status,
> +	.set_status		= vduse_vdpa_set_status,
> +	.get_config		= vduse_vdpa_get_config,
> +	.set_config		= vduse_vdpa_set_config,
> +	.set_map		= vduse_vdpa_set_map,
> +	.free			= vduse_vdpa_free,
> +};
> +
> +static dma_addr_t vduse_dev_map_page(struct device *dev, struct page *page,
> +				     unsigned long offset, size_t size,
> +				     enum dma_data_direction dir,
> +				     unsigned long attrs)
> +{
> +	struct vduse_dev *vdev = dev_to_vduse(dev);
> +	struct vduse_iova_domain *domain = vdev->domain;
> +
> +	return vduse_domain_map_page(domain, page, offset, size, dir, attrs);
> +}
> +
> +static void vduse_dev_unmap_page(struct device *dev, dma_addr_t dma_addr,
> +				size_t size, enum dma_data_direction dir,
> +				unsigned long attrs)
> +{
> +	struct vduse_dev *vdev = dev_to_vduse(dev);
> +	struct vduse_iova_domain *domain = vdev->domain;
> +
> +	return vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
> +}
> +
> +static void *vduse_dev_alloc_coherent(struct device *dev, size_t size,
> +					dma_addr_t *dma_addr, gfp_t flag,
> +					unsigned long attrs)
> +{
> +	struct vduse_dev *vdev = dev_to_vduse(dev);
> +	struct vduse_iova_domain *domain = vdev->domain;
> +	unsigned long iova;
> +	void *addr;
> +
> +	*dma_addr = DMA_MAPPING_ERROR;
> +	addr = vduse_domain_alloc_coherent(domain, size,
> +				(dma_addr_t *)&iova, flag, attrs);
> +	if (!addr)
> +		return NULL;
> +
> +	*dma_addr = (dma_addr_t)iova;
> +	vduse_dev_update_iotlb(vdev, iova, iova + size - 1);
> +
> +	return addr;
> +}
> +
> +static void vduse_dev_free_coherent(struct device *dev, size_t size,
> +					void *vaddr, dma_addr_t dma_addr,
> +					unsigned long attrs)
> +{
> +	struct vduse_dev *vdev = dev_to_vduse(dev);
> +	struct vduse_iova_domain *domain = vdev->domain;
> +	unsigned long start = (unsigned long)dma_addr;
> +	unsigned long last = start + size - 1;
> +
> +	vduse_domain_free_coherent(domain, size, vaddr, dma_addr, attrs);
> +	vduse_dev_update_iotlb(vdev, start, last);
> +}
> +
> +static const struct dma_map_ops vduse_dev_dma_ops = {
> +	.map_page = vduse_dev_map_page,
> +	.unmap_page = vduse_dev_unmap_page,
> +	.alloc = vduse_dev_alloc_coherent,
> +	.free = vduse_dev_free_coherent,
> +};
> +
> +static unsigned int perm_to_file_flags(u8 perm)
> +{
> +	unsigned int flags = 0;
> +
> +	switch (perm) {
> +	case VDUSE_ACCESS_WO:
> +		flags |= O_WRONLY;
> +		break;
> +	case VDUSE_ACCESS_RO:
> +		flags |= O_RDONLY;
> +		break;
> +	case VDUSE_ACCESS_RW:
> +		flags |= O_RDWR;
> +		break;
> +	default:
> +		WARN(1, "invalidate vhost IOTLB permission\n");
> +		break;
> +	}
> +
> +	return flags;
> +}
> +
> +static int vduse_kickfd_setup(struct vduse_dev *dev,
> +			struct vduse_vq_eventfd *eventfd)
> +{
> +	struct eventfd_ctx *ctx = NULL;
> +	struct vduse_virtqueue *vq;
> +
> +	if (eventfd->index >= dev->vq_num)
> +		return -EINVAL;
> +
> +	vq = &dev->vqs[eventfd->index];
> +	if (eventfd->fd > 0) {
> +		ctx = eventfd_ctx_fdget(eventfd->fd);
> +		if (IS_ERR(ctx))
> +			return PTR_ERR(ctx);
> +	} else if (eventfd->fd != VDUSE_EVENTFD_DEASSIGN)
> +		return 0;


Do we allow the userspace to switch kickfd here? If yes, we need to deal 
with that.


> +
> +	spin_lock(&vq->kick_lock);
> +	if (vq->kickfd)
> +		eventfd_ctx_put(vq->kickfd);
> +	vq->kickfd = ctx;
> +	spin_unlock(&vq->kick_lock);
> +
> +	return 0;
> +}
> +
> +static void vduse_vq_irq_inject(struct work_struct *work)
> +{
> +	struct vduse_virtqueue *vq = container_of(work,
> +					struct vduse_virtqueue, inject);
> +
> +	spin_lock_irq(&vq->irq_lock);
> +	if (vq->ready && vq->cb.callback)
> +		vq->cb.callback(vq->cb.private);
> +	spin_unlock_irq(&vq->irq_lock);
> +}
> +
> +static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> +			    unsigned long arg)
> +{
> +	struct vduse_dev *dev = file->private_data;
> +	void __user *argp = (void __user *)arg;
> +	int ret;
> +
> +	switch (cmd) {
> +	case VDUSE_IOTLB_GET_FD: {
> +		struct vduse_iotlb_entry entry;
> +		struct vhost_iotlb_map *map;
> +		struct vdpa_map_file *map_file;
> +		struct vduse_iova_domain *domain = dev->domain;
> +		struct file *f = NULL;
> +
> +		ret = -EFAULT;
> +		if (copy_from_user(&entry, argp, sizeof(entry)))
> +			break;
> +
> +		ret = -EINVAL;
> +		if (entry.start > entry.last)
> +			break;
> +
> +		spin_lock(&domain->iotlb_lock);
> +		map = vhost_iotlb_itree_first(domain->iotlb,
> +					      entry.start, entry.last);
> +		if (map) {
> +			map_file = (struct vdpa_map_file *)map->opaque;
> +			f = get_file(map_file->file);
> +			entry.offset = map_file->offset;
> +			entry.start = map->start;
> +			entry.last = map->last;
> +			entry.perm = map->perm;
> +		}
> +		spin_unlock(&domain->iotlb_lock);
> +		ret = -EINVAL;
> +		if (!f)
> +			break;
> +
> +		ret = -EFAULT;
> +		if (copy_to_user(argp, &entry, sizeof(entry))) {
> +			fput(f);
> +			break;
> +		}
> +		ret = receive_fd(f, perm_to_file_flags(entry.perm));
> +		fput(f);


Any reason to drop the refcnt here?


> +		break;
> +	}
> +	case VDUSE_VQ_SETUP_KICKFD: {
> +		struct vduse_vq_eventfd eventfd;
> +
> +		ret = -EFAULT;
> +		if (copy_from_user(&eventfd, argp, sizeof(eventfd)))
> +			break;
> +
> +		ret = vduse_kickfd_setup(dev, &eventfd);
> +		break;
> +	}
> +	case VDUSE_INJECT_VQ_IRQ:
> +		ret = -EINVAL;
> +		if (arg >= dev->vq_num)
> +			break;
> +
> +		ret = 0;
> +		queue_work(vduse_irq_wq, &dev->vqs[arg].inject);
> +		break;
> +	case VDUSE_INJECT_CONFIG_IRQ:
> +		ret = -EINVAL;
> +		spin_lock_irq(&dev->irq_lock);
> +		if (dev->config_cb.callback) {
> +			dev->config_cb.callback(dev->config_cb.private);
> +			ret = 0;
> +		}
> +		spin_unlock_irq(&dev->irq_lock);


For consistency, is it better to use vduse_irq_wq here?


> +		break;
> +	default:
> +		ret = -ENOIOCTLCMD;
> +		break;
> +	}
> +
> +	return ret;
> +}
> +
> +static int vduse_dev_release(struct inode *inode, struct file *file)
> +{
> +	struct vduse_dev *dev = file->private_data;
> +	struct vduse_dev_msg *msg;
> +	int i;
> +
> +	for (i = 0; i < dev->vq_num; i++) {
> +		struct vduse_virtqueue *vq = &dev->vqs[i];
> +
> +		spin_lock(&vq->kick_lock);
> +		if (vq->kickfd)
> +			eventfd_ctx_put(vq->kickfd);
> +		vq->kickfd = NULL;
> +		spin_unlock(&vq->kick_lock);
> +	}
> +
> +	spin_lock(&dev->msg_lock);
> +	/*  Make sure the inflight messages can processed */


This might be better:

/*  Make sure the inflight messages can processed after reconncection */

> +	while ((msg = vduse_dequeue_msg(&dev->recv_list)))
> +		vduse_enqueue_msg(&dev->send_list, msg);
> +	spin_unlock(&dev->msg_lock);
> +
> +	dev->connected = false;
> +
> +	return 0;
> +}
> +
> +static int vduse_dev_open(struct inode *inode, struct file *file)
> +{
> +	struct vduse_dev *dev = container_of(inode->i_cdev,
> +					struct vduse_dev, cdev);
> +	int ret = -EBUSY;
> +
> +	mutex_lock(&dev->lock);
> +	if (dev->connected)
> +		goto unlock;
> +
> +	ret = 0;
> +	dev->connected = true;
> +	file->private_data = dev;
> +unlock:
> +	mutex_unlock(&dev->lock);
> +
> +	return ret;
> +}
> +
> +static const struct file_operations vduse_dev_fops = {
> +	.owner		= THIS_MODULE,
> +	.open		= vduse_dev_open,
> +	.release	= vduse_dev_release,
> +	.read_iter	= vduse_dev_read_iter,
> +	.write_iter	= vduse_dev_write_iter,
> +	.poll		= vduse_dev_poll,
> +	.unlocked_ioctl	= vduse_dev_ioctl,
> +	.compat_ioctl	= compat_ptr_ioctl,
> +	.llseek		= noop_llseek,
> +};
> +
> +static struct vduse_dev *vduse_dev_create(void)
> +{
> +	struct vduse_dev *dev = kzalloc(sizeof(*dev), GFP_KERNEL);
> +
> +	if (!dev)
> +		return NULL;
> +
> +	mutex_init(&dev->lock);
> +	spin_lock_init(&dev->msg_lock);
> +	INIT_LIST_HEAD(&dev->send_list);
> +	INIT_LIST_HEAD(&dev->recv_list);
> +	atomic64_set(&dev->msg_unique, 0);
> +	spin_lock_init(&dev->irq_lock);
> +
> +	init_waitqueue_head(&dev->waitq);
> +
> +	return dev;
> +}
> +
> +static void vduse_dev_destroy(struct vduse_dev *dev)
> +{
> +	kfree(dev);
> +}
> +
> +static struct vduse_dev *vduse_find_dev(const char *name)
> +{
> +	struct vduse_dev *tmp, *dev = NULL;
> +
> +	list_for_each_entry(tmp, &vduse_devs, list) {
> +		if (!strcmp(dev_name(&tmp->dev), name)) {
> +			dev = tmp;
> +			break;
> +		}
> +	}
> +	return dev;
> +}
> +
> +static int vduse_destroy_dev(char *name)
> +{
> +	struct vduse_dev *dev = vduse_find_dev(name);
> +
> +	if (!dev)
> +		return -EINVAL;
> +
> +	mutex_lock(&dev->lock);
> +	if (dev->vdev || dev->connected) {
> +		mutex_unlock(&dev->lock);
> +		return -EBUSY;
> +	}
> +	dev->connected = true;
> +	mutex_unlock(&dev->lock);
> +
> +	list_del(&dev->list);
> +	cdev_device_del(&dev->cdev, &dev->dev);
> +	put_device(&dev->dev);
> +	module_put(THIS_MODULE);
> +
> +	return 0;
> +}
> +
> +static void vduse_release_dev(struct device *device)
> +{
> +	struct vduse_dev *dev =
> +		container_of(device, struct vduse_dev, dev);
> +
> +	ida_simple_remove(&vduse_ida, dev->minor);
> +	kfree(dev->vqs);
> +	vduse_domain_destroy(dev->domain);
> +	vduse_dev_destroy(dev);
> +}
> +
> +static int vduse_create_dev(struct vduse_dev_config *config,
> +			    unsigned long api_version)
> +{
> +	int i, ret = -ENOMEM;
> +	struct vduse_dev *dev;
> +
> +	if (config->bounce_size > max_bounce_size)
> +		return -EINVAL;
> +
> +	if (config->bounce_size > max_iova_size)
> +		return -EINVAL;
> +
> +	if (vduse_find_dev(config->name))
> +		return -EEXIST;
> +
> +	dev = vduse_dev_create();
> +	if (!dev)
> +		return -ENOMEM;
> +
> +	dev->api_version = api_version;
> +	dev->device_id = config->device_id;
> +	dev->vendor_id = config->vendor_id;
> +	dev->domain = vduse_domain_create(max_iova_size - 1,
> +					config->bounce_size);
> +	if (!dev->domain)
> +		goto err_domain;
> +
> +	dev->vq_align = config->vq_align;
> +	dev->vq_size_max = config->vq_size_max;
> +	dev->vq_num = config->vq_num;
> +	dev->vqs = kcalloc(dev->vq_num, sizeof(*dev->vqs), GFP_KERNEL);
> +	if (!dev->vqs)
> +		goto err_vqs;
> +
> +	for (i = 0; i < dev->vq_num; i++) {
> +		dev->vqs[i].index = i;
> +		INIT_WORK(&dev->vqs[i].inject, vduse_vq_irq_inject);
> +		spin_lock_init(&dev->vqs[i].kick_lock);
> +		spin_lock_init(&dev->vqs[i].irq_lock);
> +	}
> +
> +	ret = ida_simple_get(&vduse_ida, 0, VDUSE_DEV_MAX, GFP_KERNEL);
> +	if (ret < 0)
> +		goto err_ida;
> +
> +	dev->minor = ret;
> +	device_initialize(&dev->dev);
> +	dev->dev.release = vduse_release_dev;
> +	dev->dev.class = vduse_class;
> +	dev->dev.devt = MKDEV(MAJOR(vduse_major), dev->minor);
> +	ret = dev_set_name(&dev->dev, "%s", config->name);
> +	if (ret) {
> +		put_device(&dev->dev);
> +		return ret;
> +	}
> +	cdev_init(&dev->cdev, &vduse_dev_fops);
> +	dev->cdev.owner = THIS_MODULE;
> +
> +	ret = cdev_device_add(&dev->cdev, &dev->dev);
> +	if (ret) {
> +		put_device(&dev->dev);
> +		return ret;


Let's introduce an error label for this.


> +	}
> +	list_add(&dev->list, &vduse_devs);
> +	__module_get(THIS_MODULE);
> +
> +	return 0;
> +err_ida:
> +	kfree(dev->vqs);
> +err_vqs:
> +	vduse_domain_destroy(dev->domain);
> +err_domain:
> +	vduse_dev_destroy(dev);
> +	return ret;
> +}
> +
> +static long vduse_ioctl(struct file *file, unsigned int cmd,
> +			unsigned long arg)
> +{
> +	int ret;
> +	void __user *argp = (void __user *)arg;
> +	struct vduse_control *control = file->private_data;
> +
> +	mutex_lock(&vduse_lock);
> +	switch (cmd) {
> +	case VDUSE_GET_API_VERSION:
> +		ret = control->api_version;
> +		break;
> +	case VDUSE_SET_API_VERSION:
> +		ret = -EINVAL;
> +		if (arg > VDUSE_API_VERSION)
> +			break;
> +
> +		ret = 0;
> +		control->api_version = arg;
> +		break;
> +	case VDUSE_CREATE_DEV: {
> +		struct vduse_dev_config config;
> +
> +		ret = -EFAULT;
> +		if (copy_from_user(&config, argp, sizeof(config)))
> +			break;
> +
> +		ret = vduse_create_dev(&config, control->api_version);
> +		break;
> +	}
> +	case VDUSE_DESTROY_DEV: {
> +		char name[VDUSE_NAME_MAX];
> +
> +		ret = -EFAULT;
> +		if (copy_from_user(name, argp, VDUSE_NAME_MAX))
> +			break;
> +
> +		ret = vduse_destroy_dev(name);
> +		break;
> +	}
> +	default:
> +		ret = -EINVAL;
> +		break;
> +	}
> +	mutex_unlock(&vduse_lock);
> +
> +	return ret;
> +}
> +
> +static int vduse_release(struct inode *inode, struct file *file)
> +{
> +	struct vduse_control *control = file->private_data;
> +
> +	kfree(control);
> +	return 0;
> +}
> +
> +static int vduse_open(struct inode *inode, struct file *file)
> +{
> +	struct vduse_control *control;
> +
> +	control = kmalloc(sizeof(struct vduse_control), GFP_KERNEL);
> +	if (!control)
> +		return -ENOMEM;
> +
> +	control->api_version = VDUSE_API_VERSION;
> +	file->private_data = control;
> +
> +	return 0;
> +}
> +
> +static const struct file_operations vduse_fops = {
> +	.owner		= THIS_MODULE,
> +	.open		= vduse_open,
> +	.release	= vduse_release,
> +	.unlocked_ioctl	= vduse_ioctl,
> +	.compat_ioctl	= compat_ptr_ioctl,
> +	.llseek		= noop_llseek,
> +};
> +
> +static char *vduse_devnode(struct device *dev, umode_t *mode)
> +{
> +	return kasprintf(GFP_KERNEL, "vduse/%s", dev_name(dev));
> +}
> +
> +static struct miscdevice vduse_misc = {
> +	.fops = &vduse_fops,
> +	.minor = MISC_DYNAMIC_MINOR,
> +	.name = "vduse",
> +	.nodename = "vduse/control",
> +};
> +
> +static void vduse_mgmtdev_release(struct device *dev)
> +{
> +}
> +
> +static struct device vduse_mgmtdev = {
> +	.init_name = "vduse",
> +	.release = vduse_mgmtdev_release,
> +};
> +
> +static struct vdpa_mgmt_dev mgmt_dev;
> +
> +static int vduse_dev_init_vdpa(struct vduse_dev *dev, const char *name)
> +{
> +	struct vduse_vdpa *vdev;
> +	int ret;
> +
> +	if (dev->vdev)
> +		return -EEXIST;
> +
> +	vdev = vdpa_alloc_device(struct vduse_vdpa, vdpa, &dev->dev,
> +				 &vduse_vdpa_config_ops, name, true);
> +	if (!vdev)
> +		return -ENOMEM;
> +
> +	dev->vdev = vdev;
> +	vdev->dev = dev;
> +	vdev->vdpa.dev.dma_mask = &vdev->vdpa.dev.coherent_dma_mask;
> +	ret = dma_set_mask_and_coherent(&vdev->vdpa.dev, DMA_BIT_MASK(64));
> +	if (ret) {
> +		put_device(&vdev->vdpa.dev);
> +		return ret;
> +	}
> +	set_dma_ops(&vdev->vdpa.dev, &vduse_dev_dma_ops);
> +	vdev->vdpa.dma_dev = &vdev->vdpa.dev;
> +	vdev->vdpa.mdev = &mgmt_dev;
> +
> +	return 0;
> +}
> +
> +static int vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name)
> +{
> +	struct vduse_dev *dev;
> +	int ret = -EINVAL;
> +
> +	mutex_lock(&vduse_lock);
> +	dev = vduse_find_dev(name);
> +	if (!dev) {
> +		mutex_unlock(&vduse_lock);
> +		return -EINVAL;
> +	}
> +	ret = vduse_dev_init_vdpa(dev, name);
> +	mutex_unlock(&vduse_lock);
> +	if (ret)
> +		return ret;
> +
> +	ret = _vdpa_register_device(&dev->vdev->vdpa, dev->vq_num);
> +	if (ret) {
> +		put_device(&dev->vdev->vdpa.dev);
> +		return ret;
> +	}
> +
> +	return 0;
> +}
> +
> +static void vdpa_dev_del(struct vdpa_mgmt_dev *mdev, struct vdpa_device *dev)
> +{
> +	_vdpa_unregister_device(dev);
> +}
> +
> +static const struct vdpa_mgmtdev_ops vdpa_dev_mgmtdev_ops = {
> +	.dev_add = vdpa_dev_add,
> +	.dev_del = vdpa_dev_del,
> +};
> +
> +static struct virtio_device_id id_table[] = {
> +	{ VIRTIO_DEV_ANY_ID, VIRTIO_DEV_ANY_ID },
> +	{ 0 },
> +};
> +
> +static struct vdpa_mgmt_dev mgmt_dev = {
> +	.device = &vduse_mgmtdev,
> +	.id_table = id_table,
> +	.ops = &vdpa_dev_mgmtdev_ops,
> +};
> +
> +static int vduse_mgmtdev_init(void)
> +{
> +	int ret;
> +
> +	ret = device_register(&vduse_mgmtdev);
> +	if (ret)
> +		return ret;
> +
> +	ret = vdpa_mgmtdev_register(&mgmt_dev);
> +	if (ret)
> +		goto err;
> +
> +	return 0;
> +err:
> +	device_unregister(&vduse_mgmtdev);
> +	return ret;
> +}
> +
> +static void vduse_mgmtdev_exit(void)
> +{
> +	vdpa_mgmtdev_unregister(&mgmt_dev);
> +	device_unregister(&vduse_mgmtdev);
> +}
> +
> +static int vduse_init(void)
> +{
> +	int ret;
> +
> +	if (max_bounce_size >= max_iova_size)
> +		return -EINVAL;
> +
> +	ret = misc_register(&vduse_misc);
> +	if (ret)
> +		return ret;
> +
> +	vduse_class = class_create(THIS_MODULE, "vduse");
> +	if (IS_ERR(vduse_class)) {
> +		ret = PTR_ERR(vduse_class);
> +		goto err_class;
> +	}
> +	vduse_class->devnode = vduse_devnode;
> +
> +	ret = alloc_chrdev_region(&vduse_major, 0, VDUSE_DEV_MAX, "vduse");
> +	if (ret)
> +		goto err_chardev;
> +
> +	vduse_irq_wq = alloc_workqueue("vduse-irq",
> +				WQ_HIGHPRI | WQ_SYSFS | WQ_UNBOUND, 0);
> +	if (!vduse_irq_wq)
> +		goto err_wq;
> +
> +	ret = vduse_domain_init();
> +	if (ret)
> +		goto err_domain;
> +
> +	ret = vduse_mgmtdev_init();
> +	if (ret)
> +		goto err_mgmtdev;
> +
> +	return 0;
> +err_mgmtdev:
> +	vduse_domain_exit();
> +err_domain:
> +	destroy_workqueue(vduse_irq_wq);
> +err_wq:
> +	unregister_chrdev_region(vduse_major, VDUSE_DEV_MAX);
> +err_chardev:
> +	class_destroy(vduse_class);
> +err_class:
> +	misc_deregister(&vduse_misc);
> +	return ret;
> +}
> +module_init(vduse_init);
> +
> +static void vduse_exit(void)
> +{
> +	misc_deregister(&vduse_misc);
> +	class_destroy(vduse_class);
> +	unregister_chrdev_region(vduse_major, VDUSE_DEV_MAX);
> +	destroy_workqueue(vduse_irq_wq);
> +	vduse_domain_exit();
> +	vduse_mgmtdev_exit();
> +}
> +module_exit(vduse_exit);
> +
> +MODULE_VERSION(DRV_VERSION);
> +MODULE_LICENSE(DRV_LICENSE);
> +MODULE_AUTHOR(DRV_AUTHOR);
> +MODULE_DESCRIPTION(DRV_DESC);
> diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
> new file mode 100644
> index 000000000000..66a6e5212226
> --- /dev/null
> +++ b/include/uapi/linux/vduse.h
> @@ -0,0 +1,175 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _UAPI_VDUSE_H_
> +#define _UAPI_VDUSE_H_
> +
> +#include <linux/types.h>
> +
> +#define VDUSE_API_VERSION	0
> +
> +#define VDUSE_CONFIG_DATA_LEN	256
> +#define VDUSE_NAME_MAX	256
> +
> +/* the control messages definition for read/write */
> +
> +enum vduse_req_type {
> +	/* Set the vring address of virtqueue. */
> +	VDUSE_SET_VQ_NUM,
> +	/* Set the vring address of virtqueue. */
> +	VDUSE_SET_VQ_ADDR,
> +	/* Set ready status of virtqueue */
> +	VDUSE_SET_VQ_READY,
> +	/* Get ready status of virtqueue */
> +	VDUSE_GET_VQ_READY,
> +	/* Set the state for virtqueue */
> +	VDUSE_SET_VQ_STATE,
> +	/* Get the state for virtqueue */
> +	VDUSE_GET_VQ_STATE,
> +	/* Set virtio features supported by the driver */
> +	VDUSE_SET_FEATURES,
> +	/* Get virtio features supported by the device */
> +	VDUSE_GET_FEATURES,
> +	/* Set the device status */
> +	VDUSE_SET_STATUS,
> +	/* Get the device status */
> +	VDUSE_GET_STATUS,
> +	/* Write to device specific configuration space */
> +	VDUSE_SET_CONFIG,
> +	/* Read from device specific configuration space */
> +	VDUSE_GET_CONFIG,
> +	/* Notify userspace to update the memory mapping in device IOTLB */
> +	VDUSE_UPDATE_IOTLB,
> +};
> +
> +struct vduse_vq_num {
> +	__u32 index; /* virtqueue index */


I think it's better to have a consistent style of the doc/comment. If 
yes, let's move those comment above the field.


> +	__u32 num; /* the size of virtqueue */
> +};
> +
> +struct vduse_vq_addr {
> +	__u32 index; /* virtqueue index */
> +	__u64 desc_addr; /* address of desc area */
> +	__u64 driver_addr; /* address of driver area */
> +	__u64 device_addr; /* address of device area */
> +};
> +
> +struct vduse_vq_ready {
> +	__u32 index; /* virtqueue index */
> +	__u8 ready; /* ready status of virtqueue */
> +};
> +
> +struct vduse_vq_state {
> +	__u32 index; /* virtqueue index */
> +	__u16 avail_idx; /* virtqueue state (last_avail_idx) */


Let's use __u64 here to be consistent with get_vq_state(). The idea is 
to support packed virtqueue.


> +};
> +
> +struct vduse_dev_config_data {
> +	__u32 offset; /* offset from the beginning of config space */
> +	__u32 len; /* the length to read/write */
> +	__u8 data[VDUSE_CONFIG_DATA_LEN]; /* data buffer used to read/write */


Note that since VDUSE_CONFIG_DATA_LEN is part of uAPI it means we can 
not change it in the future.

So this might suffcient for future features or all type of virtio devices.


> +};
> +
> +struct vduse_iova_range {
> +	__u64 start; /* start of the IOVA range */
> +	__u64 last; /* end of the IOVA range */
> +};
> +
> +struct vduse_features {
> +	__u64 features; /* virtio features */
> +};
> +
> +struct vduse_status {
> +	__u8 status; /* device status */
> +};
> +
> +struct vduse_dev_request {
> +	__u32 type; /* request type */
> +	__u32 request_id; /* request id */
> +	__u32 reserved[2]; /* for future use */
> +	union {
> +		struct vduse_vq_num vq_num; /* virtqueue num */
> +		struct vduse_vq_addr vq_addr; /* virtqueue address */
> +		struct vduse_vq_ready vq_ready; /* virtqueue ready status */
> +		struct vduse_vq_state vq_state; /* virtqueue state */
> +		struct vduse_dev_config_data config; /* virtio device config space */
> +		struct vduse_iova_range iova; /* iova range for updating */
> +		struct vduse_features f; /* virtio features */
> +		struct vduse_status s; /* device status */
> +		__u32 padding[16]; /* padding */
> +	};
> +};
> +
> +struct vduse_dev_response {
> +	__u32 request_id; /* corresponding request id */
> +#define VDUSE_REQUEST_OK	0x00
> +#define VDUSE_REQUEST_FAILED	0x01
> +	__u32 result; /* the result of request */
> +	__u32 reserved[2]; /* for future use */
> +	union {
> +		struct vduse_vq_ready vq_ready; /* virtqueue ready status */
> +		struct vduse_vq_state vq_state; /* virtqueue state */
> +		struct vduse_dev_config_data config; /* virtio device config space */
> +		struct vduse_features f; /* virtio features */
> +		struct vduse_status s; /* device status */
> +		__u32 padding[16]; /* padding */


So it looks to me this padding doesn't work since vduse_dev_config_data 
is larger than it.


> +	};
> +};
> +
> +/* ioctls */
> +
> +struct vduse_dev_config {
> +	char name[VDUSE_NAME_MAX]; /* vduse device name */
> +	__u32 vendor_id; /* virtio vendor id */
> +	__u32 device_id; /* virtio device id */
> +	__u64 bounce_size; /* bounce buffer size for iommu */
> +	__u16 vq_num; /* the number of virtqueues */
> +	__u16 vq_size_max; /* the max size of virtqueue */
> +	__u32 vq_align; /* the allocation alignment of virtqueue's metadata */
> +	__u32 reserved[8]; /* for future use */


Is there a hole before reserved?


> +};
> +
> +struct vduse_iotlb_entry {
> +	__u64 offset; /* the mmap offset on fd */
> +	__u64 start; /* start of the IOVA range */
> +	__u64 last; /* last of the IOVA range */
> +#define VDUSE_ACCESS_RO 0x1
> +#define VDUSE_ACCESS_WO 0x2
> +#define VDUSE_ACCESS_RW 0x3
> +	__u8 perm; /* access permission of this range */
> +};
> +
> +struct vduse_vq_eventfd {
> +	__u32 index; /* virtqueue index */
> +#define VDUSE_EVENTFD_DEASSIGN -1
> +	int fd; /* eventfd, -1 means de-assigning the eventfd */
> +};
> +
> +#define VDUSE_BASE	0x81
> +
> +/* Get the version of VDUSE API. This is used for future extension */
> +#define VDUSE_GET_API_VERSION	_IO(VDUSE_BASE, 0x00)
> +
> +/* Set the version of VDUSE API. */
> +#define VDUSE_SET_API_VERSION	_IO(VDUSE_BASE, 0x01)
> +
> +/* Create a vduse device which is represented by a char device (/dev/vduse/<name>) */
> +#define VDUSE_CREATE_DEV	_IOW(VDUSE_BASE, 0x02, struct vduse_dev_config)
> +
> +/* Destroy a vduse device. Make sure there are no references to the char device */
> +#define VDUSE_DESTROY_DEV	_IOW(VDUSE_BASE, 0x03, char[VDUSE_NAME_MAX])
> +
> +/*
> + * Get a file descriptor for the first overlapped iova region,
> + * -EINVAL means the iova region doesn't exist.
> + */
> +#define VDUSE_IOTLB_GET_FD	_IOWR(VDUSE_BASE, 0x04, struct vduse_iotlb_entry)
> +
> +/* Setup an eventfd to receive kick for virtqueue */
> +#define VDUSE_VQ_SETUP_KICKFD	_IOW(VDUSE_BASE, 0x05, struct vduse_vq_eventfd)
> +
> +/* Inject an interrupt for specific virtqueue */
> +#define VDUSE_INJECT_VQ_IRQ	_IO(VDUSE_BASE, 0x06)


Missing parameter?

Thanks


> +
> +/* Inject a config interrupt */
> +#define VDUSE_INJECT_CONFIG_IRQ	_IO(VDUSE_BASE, 0x07)
> +
> +#endif /* _UAPI_VDUSE_H_ */


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v6 10/10] Documentation: Add documentation for VDUSE
  2021-03-31  8:05 ` [PATCH v6 10/10] Documentation: Add documentation for VDUSE Xie Yongji
@ 2021-04-08  7:18   ` Jason Wang
  2021-04-08  8:09     ` Yongji Xie
  2021-04-14 14:14   ` Stefan Hajnoczi
  1 sibling, 1 reply; 62+ messages in thread
From: Jason Wang @ 2021-04-08  7:18 UTC (permalink / raw)
  To: Xie Yongji, mst, stefanha, sgarzare, parav, hch,
	christian.brauner, rdunlap, willy, viro, axboe, bcrl, corbet,
	mika.penttila, dan.carpenter
  Cc: virtualization, netdev, kvm, linux-fsdevel


在 2021/3/31 下午4:05, Xie Yongji 写道:
> VDUSE (vDPA Device in Userspace) is a framework to support
> implementing software-emulated vDPA devices in userspace. This
> document is intended to clarify the VDUSE design and usage.
>
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> ---
>   Documentation/userspace-api/index.rst |   1 +
>   Documentation/userspace-api/vduse.rst | 212 ++++++++++++++++++++++++++++++++++
>   2 files changed, 213 insertions(+)
>   create mode 100644 Documentation/userspace-api/vduse.rst
>
> diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst
> index acd2cc2a538d..f63119130898 100644
> --- a/Documentation/userspace-api/index.rst
> +++ b/Documentation/userspace-api/index.rst
> @@ -24,6 +24,7 @@ place where this information is gathered.
>      ioctl/index
>      iommu
>      media/index
> +   vduse
>   
>   .. only::  subproject and html
>   
> diff --git a/Documentation/userspace-api/vduse.rst b/Documentation/userspace-api/vduse.rst
> new file mode 100644
> index 000000000000..8c4e2b2df8bb
> --- /dev/null
> +++ b/Documentation/userspace-api/vduse.rst
> @@ -0,0 +1,212 @@
> +==================================
> +VDUSE - "vDPA Device in Userspace"
> +==================================
> +
> +vDPA (virtio data path acceleration) device is a device that uses a
> +datapath which complies with the virtio specifications with vendor
> +specific control path. vDPA devices can be both physically located on
> +the hardware or emulated by software. VDUSE is a framework that makes it
> +possible to implement software-emulated vDPA devices in userspace.
> +
> +How VDUSE works
> +------------
> +Each userspace vDPA device is created by the VDUSE_CREATE_DEV ioctl on
> +the character device (/dev/vduse/control). Then a device file with the
> +specified name (/dev/vduse/$NAME) will appear, which can be used to
> +implement the userspace vDPA device's control path and data path.
> +
> +To implement control path, a message-based communication protocol and some
> +types of control messages are introduced in the VDUSE framework:
> +
> +- VDUSE_SET_VQ_ADDR: Set the vring address of virtqueue.
> +
> +- VDUSE_SET_VQ_NUM: Set the size of virtqueue
> +
> +- VDUSE_SET_VQ_READY: Set ready status of virtqueue
> +
> +- VDUSE_GET_VQ_READY: Get ready status of virtqueue
> +
> +- VDUSE_SET_VQ_STATE: Set the state for virtqueue
> +
> +- VDUSE_GET_VQ_STATE: Get the state for virtqueue
> +
> +- VDUSE_SET_FEATURES: Set virtio features supported by the driver
> +
> +- VDUSE_GET_FEATURES: Get virtio features supported by the device
> +
> +- VDUSE_SET_STATUS: Set the device status
> +
> +- VDUSE_GET_STATUS: Get the device status
> +
> +- VDUSE_SET_CONFIG: Write to device specific configuration space
> +
> +- VDUSE_GET_CONFIG: Read from device specific configuration space
> +
> +- VDUSE_UPDATE_IOTLB: Notify userspace to update the memory mapping in device IOTLB
> +
> +Those control messages are mostly based on the vdpa_config_ops in
> +include/linux/vdpa.h which defines a unified interface to control
> +different types of vdpa device. Userspace needs to read()/write()
> +on the VDUSE device file to receive/reply those control messages
> +from/to VDUSE kernel module as follows:
> +
> +.. code-block:: c
> +
> +	static int vduse_message_handler(int dev_fd)
> +	{
> +		int len;
> +		struct vduse_dev_request req;
> +		struct vduse_dev_response resp;
> +
> +		len = read(dev_fd, &req, sizeof(req));
> +		if (len != sizeof(req))
> +			return -1;
> +
> +		resp.request_id = req.request_id;
> +
> +		switch (req.type) {
> +
> +		/* handle different types of message */
> +
> +		}
> +
> +		len = write(dev_fd, &resp, sizeof(resp));
> +		if (len != sizeof(resp))
> +			return -1;
> +
> +		return 0;
> +	}
> +
> +In the data path, vDPA device's iova regions will be mapped into userspace
> +with the help of VDUSE_IOTLB_GET_FD ioctl on the VDUSE device file:
> +
> +- VDUSE_IOTLB_GET_FD: get the file descriptor to the first overlapped iova region.
> +  Userspace can access this iova region by passing fd and corresponding size, offset,
> +  perm to mmap(). For example:
> +
> +.. code-block:: c
> +
> +	static int perm_to_prot(uint8_t perm)
> +	{
> +		int prot = 0;
> +
> +		switch (perm) {
> +		case VDUSE_ACCESS_WO:
> +			prot |= PROT_WRITE;
> +			break;
> +		case VDUSE_ACCESS_RO:
> +			prot |= PROT_READ;
> +			break;
> +		case VDUSE_ACCESS_RW:
> +			prot |= PROT_READ | PROT_WRITE;
> +			break;
> +		}
> +
> +		return prot;
> +	}
> +
> +	static void *iova_to_va(int dev_fd, uint64_t iova, uint64_t *len)
> +	{
> +		int fd;
> +		void *addr;
> +		size_t size;
> +		struct vduse_iotlb_entry entry;
> +
> +		entry.start = iova;
> +		entry.last = iova + 1;
> +		fd = ioctl(dev_fd, VDUSE_IOTLB_GET_FD, &entry);
> +		if (fd < 0)
> +			return NULL;
> +
> +		size = entry.last - entry.start + 1;
> +		*len = entry.last - iova + 1;
> +		addr = mmap(0, size, perm_to_prot(entry.perm), MAP_SHARED,
> +			    fd, entry.offset);
> +		close(fd);
> +		if (addr == MAP_FAILED)
> +			return NULL;
> +
> +		/* do something to cache this iova region */
> +
> +		return addr + iova - entry.start;
> +	}
> +
> +Besides, the following ioctls on the VDUSE device file are provided to support
> +interrupt injection and setting up eventfd for virtqueue kicks:
> +
> +- VDUSE_VQ_SETUP_KICKFD: set the kickfd for virtqueue, this eventfd is used
> +  by VDUSE kernel module to notify userspace to consume the vring.
> +
> +- VDUSE_INJECT_VQ_IRQ: inject an interrupt for specific virtqueue
> +
> +- VDUSE_INJECT_CONFIG_IRQ: inject a config interrupt
> +
> +Register VDUSE device on vDPA bus
> +---------------------------------
> +In order to make the VDUSE device work, administrator needs to use the management
> +API (netlink) to register it on vDPA bus. Some sample codes are show below:
> +
> +.. code-block:: c
> +
> +	static int netlink_add_vduse(const char *name, int device_id)
> +	{
> +		struct nl_sock *nlsock;
> +		struct nl_msg *msg;
> +		int famid;
> +
> +		nlsock = nl_socket_alloc();
> +		if (!nlsock)
> +			return -ENOMEM;
> +
> +		if (genl_connect(nlsock))
> +			goto free_sock;
> +
> +		famid = genl_ctrl_resolve(nlsock, VDPA_GENL_NAME);
> +		if (famid < 0)
> +			goto close_sock;
> +
> +		msg = nlmsg_alloc();
> +		if (!msg)
> +			goto close_sock;
> +
> +		if (!genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, famid, 0, 0,
> +		    VDPA_CMD_DEV_NEW, 0))
> +			goto nla_put_failure;
> +
> +		NLA_PUT_STRING(msg, VDPA_ATTR_DEV_NAME, name);
> +		NLA_PUT_STRING(msg, VDPA_ATTR_MGMTDEV_DEV_NAME, "vduse");
> +		NLA_PUT_U32(msg, VDPA_ATTR_DEV_ID, device_id);
> +
> +		if (nl_send_sync(nlsock, msg))
> +			goto close_sock;
> +
> +		nl_close(nlsock);
> +		nl_socket_free(nlsock);
> +
> +		return 0;
> +	nla_put_failure:
> +		nlmsg_free(msg);
> +	close_sock:
> +		nl_close(nlsock);
> +	free_sock:
> +		nl_socket_free(nlsock);
> +		return -1;
> +	}


Let's also explain this can be done via vdpa tool in iproute2 as well.

Otherwise

Acked-by: Jason Wang <jasowang@redhat.com>


> +
> +MMU-based IOMMU Driver
> +----------------------
> +VDUSE framework implements an MMU-based on-chip IOMMU driver to support
> +mapping the kernel DMA buffer into the userspace iova region dynamically.
> +This is mainly designed for virtio-vdpa case (kernel virtio drivers).
> +
> +The basic idea behind this driver is treating MMU (VA->PA) as IOMMU (IOVA->PA).
> +The driver will set up MMU mapping instead of IOMMU mapping for the DMA transfer
> +so that the userspace process is able to use its virtual address to access
> +the DMA buffer in kernel.
> +
> +And to avoid security issue, a bounce-buffering mechanism is introduced to
> +prevent userspace accessing the original buffer directly which may contain other
> +kernel data. During the mapping, unmapping, the driver will copy the data from
> +the original buffer to the bounce buffer and back, depending on the direction of
> +the transfer. And the bounce-buffer addresses will be mapped into the user address
> +space instead of the original one.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Re: [PATCH v6 10/10] Documentation: Add documentation for VDUSE
  2021-04-08  7:18   ` Jason Wang
@ 2021-04-08  8:09     ` Yongji Xie
  0 siblings, 0 replies; 62+ messages in thread
From: Yongji Xie @ 2021-04-08  8:09 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Christoph Hellwig, Christian Brauner, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	Mika Penttilä,
	Dan Carpenter, virtualization, netdev, kvm, linux-fsdevel

On Thu, Apr 8, 2021 at 3:18 PM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/3/31 下午4:05, Xie Yongji 写道:
> > VDUSE (vDPA Device in Userspace) is a framework to support
> > implementing software-emulated vDPA devices in userspace. This
> > document is intended to clarify the VDUSE design and usage.
> >
> > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> > ---
> >   Documentation/userspace-api/index.rst |   1 +
> >   Documentation/userspace-api/vduse.rst | 212 ++++++++++++++++++++++++++++++++++
> >   2 files changed, 213 insertions(+)
> >   create mode 100644 Documentation/userspace-api/vduse.rst
> >
> > diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst
> > index acd2cc2a538d..f63119130898 100644
> > --- a/Documentation/userspace-api/index.rst
> > +++ b/Documentation/userspace-api/index.rst
> > @@ -24,6 +24,7 @@ place where this information is gathered.
> >      ioctl/index
> >      iommu
> >      media/index
> > +   vduse
> >
> >   .. only::  subproject and html
> >
> > diff --git a/Documentation/userspace-api/vduse.rst b/Documentation/userspace-api/vduse.rst
> > new file mode 100644
> > index 000000000000..8c4e2b2df8bb
> > --- /dev/null
> > +++ b/Documentation/userspace-api/vduse.rst
> > @@ -0,0 +1,212 @@
> > +==================================
> > +VDUSE - "vDPA Device in Userspace"
> > +==================================
> > +
> > +vDPA (virtio data path acceleration) device is a device that uses a
> > +datapath which complies with the virtio specifications with vendor
> > +specific control path. vDPA devices can be both physically located on
> > +the hardware or emulated by software. VDUSE is a framework that makes it
> > +possible to implement software-emulated vDPA devices in userspace.
> > +
> > +How VDUSE works
> > +------------
> > +Each userspace vDPA device is created by the VDUSE_CREATE_DEV ioctl on
> > +the character device (/dev/vduse/control). Then a device file with the
> > +specified name (/dev/vduse/$NAME) will appear, which can be used to
> > +implement the userspace vDPA device's control path and data path.
> > +
> > +To implement control path, a message-based communication protocol and some
> > +types of control messages are introduced in the VDUSE framework:
> > +
> > +- VDUSE_SET_VQ_ADDR: Set the vring address of virtqueue.
> > +
> > +- VDUSE_SET_VQ_NUM: Set the size of virtqueue
> > +
> > +- VDUSE_SET_VQ_READY: Set ready status of virtqueue
> > +
> > +- VDUSE_GET_VQ_READY: Get ready status of virtqueue
> > +
> > +- VDUSE_SET_VQ_STATE: Set the state for virtqueue
> > +
> > +- VDUSE_GET_VQ_STATE: Get the state for virtqueue
> > +
> > +- VDUSE_SET_FEATURES: Set virtio features supported by the driver
> > +
> > +- VDUSE_GET_FEATURES: Get virtio features supported by the device
> > +
> > +- VDUSE_SET_STATUS: Set the device status
> > +
> > +- VDUSE_GET_STATUS: Get the device status
> > +
> > +- VDUSE_SET_CONFIG: Write to device specific configuration space
> > +
> > +- VDUSE_GET_CONFIG: Read from device specific configuration space
> > +
> > +- VDUSE_UPDATE_IOTLB: Notify userspace to update the memory mapping in device IOTLB
> > +
> > +Those control messages are mostly based on the vdpa_config_ops in
> > +include/linux/vdpa.h which defines a unified interface to control
> > +different types of vdpa device. Userspace needs to read()/write()
> > +on the VDUSE device file to receive/reply those control messages
> > +from/to VDUSE kernel module as follows:
> > +
> > +.. code-block:: c
> > +
> > +     static int vduse_message_handler(int dev_fd)
> > +     {
> > +             int len;
> > +             struct vduse_dev_request req;
> > +             struct vduse_dev_response resp;
> > +
> > +             len = read(dev_fd, &req, sizeof(req));
> > +             if (len != sizeof(req))
> > +                     return -1;
> > +
> > +             resp.request_id = req.request_id;
> > +
> > +             switch (req.type) {
> > +
> > +             /* handle different types of message */
> > +
> > +             }
> > +
> > +             len = write(dev_fd, &resp, sizeof(resp));
> > +             if (len != sizeof(resp))
> > +                     return -1;
> > +
> > +             return 0;
> > +     }
> > +
> > +In the data path, vDPA device's iova regions will be mapped into userspace
> > +with the help of VDUSE_IOTLB_GET_FD ioctl on the VDUSE device file:
> > +
> > +- VDUSE_IOTLB_GET_FD: get the file descriptor to the first overlapped iova region.
> > +  Userspace can access this iova region by passing fd and corresponding size, offset,
> > +  perm to mmap(). For example:
> > +
> > +.. code-block:: c
> > +
> > +     static int perm_to_prot(uint8_t perm)
> > +     {
> > +             int prot = 0;
> > +
> > +             switch (perm) {
> > +             case VDUSE_ACCESS_WO:
> > +                     prot |= PROT_WRITE;
> > +                     break;
> > +             case VDUSE_ACCESS_RO:
> > +                     prot |= PROT_READ;
> > +                     break;
> > +             case VDUSE_ACCESS_RW:
> > +                     prot |= PROT_READ | PROT_WRITE;
> > +                     break;
> > +             }
> > +
> > +             return prot;
> > +     }
> > +
> > +     static void *iova_to_va(int dev_fd, uint64_t iova, uint64_t *len)
> > +     {
> > +             int fd;
> > +             void *addr;
> > +             size_t size;
> > +             struct vduse_iotlb_entry entry;
> > +
> > +             entry.start = iova;
> > +             entry.last = iova + 1;
> > +             fd = ioctl(dev_fd, VDUSE_IOTLB_GET_FD, &entry);
> > +             if (fd < 0)
> > +                     return NULL;
> > +
> > +             size = entry.last - entry.start + 1;
> > +             *len = entry.last - iova + 1;
> > +             addr = mmap(0, size, perm_to_prot(entry.perm), MAP_SHARED,
> > +                         fd, entry.offset);
> > +             close(fd);
> > +             if (addr == MAP_FAILED)
> > +                     return NULL;
> > +
> > +             /* do something to cache this iova region */
> > +
> > +             return addr + iova - entry.start;
> > +     }
> > +
> > +Besides, the following ioctls on the VDUSE device file are provided to support
> > +interrupt injection and setting up eventfd for virtqueue kicks:
> > +
> > +- VDUSE_VQ_SETUP_KICKFD: set the kickfd for virtqueue, this eventfd is used
> > +  by VDUSE kernel module to notify userspace to consume the vring.
> > +
> > +- VDUSE_INJECT_VQ_IRQ: inject an interrupt for specific virtqueue
> > +
> > +- VDUSE_INJECT_CONFIG_IRQ: inject a config interrupt
> > +
> > +Register VDUSE device on vDPA bus
> > +---------------------------------
> > +In order to make the VDUSE device work, administrator needs to use the management
> > +API (netlink) to register it on vDPA bus. Some sample codes are show below:
> > +
> > +.. code-block:: c
> > +
> > +     static int netlink_add_vduse(const char *name, int device_id)
> > +     {
> > +             struct nl_sock *nlsock;
> > +             struct nl_msg *msg;
> > +             int famid;
> > +
> > +             nlsock = nl_socket_alloc();
> > +             if (!nlsock)
> > +                     return -ENOMEM;
> > +
> > +             if (genl_connect(nlsock))
> > +                     goto free_sock;
> > +
> > +             famid = genl_ctrl_resolve(nlsock, VDPA_GENL_NAME);
> > +             if (famid < 0)
> > +                     goto close_sock;
> > +
> > +             msg = nlmsg_alloc();
> > +             if (!msg)
> > +                     goto close_sock;
> > +
> > +             if (!genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, famid, 0, 0,
> > +                 VDPA_CMD_DEV_NEW, 0))
> > +                     goto nla_put_failure;
> > +
> > +             NLA_PUT_STRING(msg, VDPA_ATTR_DEV_NAME, name);
> > +             NLA_PUT_STRING(msg, VDPA_ATTR_MGMTDEV_DEV_NAME, "vduse");
> > +             NLA_PUT_U32(msg, VDPA_ATTR_DEV_ID, device_id);
> > +
> > +             if (nl_send_sync(nlsock, msg))
> > +                     goto close_sock;
> > +
> > +             nl_close(nlsock);
> > +             nl_socket_free(nlsock);
> > +
> > +             return 0;
> > +     nla_put_failure:
> > +             nlmsg_free(msg);
> > +     close_sock:
> > +             nl_close(nlsock);
> > +     free_sock:
> > +             nl_socket_free(nlsock);
> > +             return -1;
> > +     }
>
>
> Let's also explain this can be done via vdpa tool in iproute2 as well.
>

Sure.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Re: [PATCH v6 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
  2021-04-08  6:57   ` Jason Wang
@ 2021-04-08  9:36     ` Yongji Xie
  2021-04-09  5:36       ` Jason Wang
  0 siblings, 1 reply; 62+ messages in thread
From: Yongji Xie @ 2021-04-08  9:36 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Christoph Hellwig, Christian Brauner, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	Mika Penttilä,
	Dan Carpenter, virtualization, netdev, kvm, linux-fsdevel

On Thu, Apr 8, 2021 at 2:57 PM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/3/31 下午4:05, Xie Yongji 写道:
> > This VDUSE driver enables implementing vDPA devices in userspace.
> > Both control path and data path of vDPA devices will be able to
> > be handled in userspace.
> >
> > In the control path, the VDUSE driver will make use of message
> > mechnism to forward the config operation from vdpa bus driver
> > to userspace. Userspace can use read()/write() to receive/reply
> > those control messages.
> >
> > In the data path, VDUSE_IOTLB_GET_FD ioctl will be used to get
> > the file descriptors referring to vDPA device's iova regions. Then
> > userspace can use mmap() to access those iova regions. Besides,
> > userspace can use ioctl() to inject interrupt and use the eventfd
> > mechanism to receive virtqueue kicks.
> >
> > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> > ---
> >   Documentation/userspace-api/ioctl/ioctl-number.rst |    1 +
> >   drivers/vdpa/Kconfig                               |   10 +
> >   drivers/vdpa/Makefile                              |    1 +
> >   drivers/vdpa/vdpa_user/Makefile                    |    5 +
> >   drivers/vdpa/vdpa_user/vduse_dev.c                 | 1362 ++++++++++++++++++++
> >   include/uapi/linux/vduse.h                         |  175 +++
> >   6 files changed, 1554 insertions(+)
> >   create mode 100644 drivers/vdpa/vdpa_user/Makefile
> >   create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
> >   create mode 100644 include/uapi/linux/vduse.h
> >
> > diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
> > index a4c75a28c839..71722e6f8f23 100644
> > --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> > +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> > @@ -300,6 +300,7 @@ Code  Seq#    Include File                                           Comments
> >   'z'   10-4F  drivers/s390/crypto/zcrypt_api.h                        conflict!
> >   '|'   00-7F  linux/media.h
> >   0x80  00-1F  linux/fb.h
> > +0x81  00-1F  linux/vduse.h
> >   0x89  00-06  arch/x86/include/asm/sockios.h
> >   0x89  0B-DF  linux/sockios.h
> >   0x89  E0-EF  linux/sockios.h                                         SIOCPROTOPRIVATE range
> > diff --git a/drivers/vdpa/Kconfig b/drivers/vdpa/Kconfig
> > index a245809c99d0..77a1da522c21 100644
> > --- a/drivers/vdpa/Kconfig
> > +++ b/drivers/vdpa/Kconfig
> > @@ -25,6 +25,16 @@ config VDPA_SIM_NET
> >       help
> >         vDPA networking device simulator which loops TX traffic back to RX.
> >
> > +config VDPA_USER
> > +     tristate "VDUSE (vDPA Device in Userspace) support"
> > +     depends on EVENTFD && MMU && HAS_DMA
> > +     select DMA_OPS
> > +     select VHOST_IOTLB
> > +     select IOMMU_IOVA
> > +     help
> > +       With VDUSE it is possible to emulate a vDPA Device
> > +       in a userspace program.
> > +
> >   config IFCVF
> >       tristate "Intel IFC VF vDPA driver"
> >       depends on PCI_MSI
> > diff --git a/drivers/vdpa/Makefile b/drivers/vdpa/Makefile
> > index 67fe7f3d6943..f02ebed33f19 100644
> > --- a/drivers/vdpa/Makefile
> > +++ b/drivers/vdpa/Makefile
> > @@ -1,6 +1,7 @@
> >   # SPDX-License-Identifier: GPL-2.0
> >   obj-$(CONFIG_VDPA) += vdpa.o
> >   obj-$(CONFIG_VDPA_SIM) += vdpa_sim/
> > +obj-$(CONFIG_VDPA_USER) += vdpa_user/
> >   obj-$(CONFIG_IFCVF)    += ifcvf/
> >   obj-$(CONFIG_MLX5_VDPA) += mlx5/
> >   obj-$(CONFIG_VP_VDPA)    += virtio_pci/
> > diff --git a/drivers/vdpa/vdpa_user/Makefile b/drivers/vdpa/vdpa_user/Makefile
> > new file mode 100644
> > index 000000000000..260e0b26af99
> > --- /dev/null
> > +++ b/drivers/vdpa/vdpa_user/Makefile
> > @@ -0,0 +1,5 @@
> > +# SPDX-License-Identifier: GPL-2.0
> > +
> > +vduse-y := vduse_dev.o iova_domain.o
> > +
> > +obj-$(CONFIG_VDPA_USER) += vduse.o
> > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> > new file mode 100644
> > index 000000000000..51ca73464d0d
> > --- /dev/null
> > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > @@ -0,0 +1,1362 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * VDUSE: vDPA Device in Userspace
> > + *
> > + * Copyright (C) 2020-2021 Bytedance Inc. and/or its affiliates. All rights reserved.
> > + *
> > + * Author: Xie Yongji <xieyongji@bytedance.com>
> > + *
> > + */
> > +
> > +#include <linux/init.h>
> > +#include <linux/module.h>
> > +#include <linux/miscdevice.h>
> > +#include <linux/cdev.h>
> > +#include <linux/device.h>
> > +#include <linux/eventfd.h>
> > +#include <linux/slab.h>
> > +#include <linux/wait.h>
> > +#include <linux/dma-map-ops.h>
> > +#include <linux/poll.h>
> > +#include <linux/file.h>
> > +#include <linux/uio.h>
> > +#include <linux/vdpa.h>
> > +#include <uapi/linux/vduse.h>
> > +#include <uapi/linux/vdpa.h>
> > +#include <uapi/linux/virtio_config.h>
> > +#include <linux/mod_devicetable.h>
> > +
> > +#include "iova_domain.h"
> > +
> > +#define DRV_VERSION  "1.0"
> > +#define DRV_AUTHOR   "Yongji Xie <xieyongji@bytedance.com>"
> > +#define DRV_DESC     "vDPA Device in Userspace"
> > +#define DRV_LICENSE  "GPL v2"
> > +
> > +#define VDUSE_DEV_MAX (1U << MINORBITS)
> > +
> > +struct vduse_virtqueue {
> > +     u16 index;
> > +     bool ready;
> > +     spinlock_t kick_lock;
> > +     spinlock_t irq_lock;
> > +     struct eventfd_ctx *kickfd;
> > +     struct vdpa_callback cb;
> > +     struct work_struct inject;
> > +};
> > +
> > +struct vduse_dev;
> > +
> > +struct vduse_vdpa {
> > +     struct vdpa_device vdpa;
> > +     struct vduse_dev *dev;
> > +};
> > +
> > +struct vduse_dev {
> > +     struct vduse_vdpa *vdev;
> > +     struct device dev;
> > +     struct cdev cdev;
> > +     struct vduse_virtqueue *vqs;
> > +     struct vduse_iova_domain *domain;
> > +     struct mutex lock;
> > +     spinlock_t msg_lock;
> > +     atomic64_t msg_unique;
> > +     wait_queue_head_t waitq;
> > +     struct list_head send_list;
> > +     struct list_head recv_list;
> > +     struct list_head list;
> > +     struct vdpa_callback config_cb;
> > +     spinlock_t irq_lock;
> > +     unsigned long api_version;
> > +     bool connected;
> > +     int minor;
> > +     u16 vq_size_max;
> > +     u16 vq_num;
> > +     u32 vq_align;
> > +     u32 device_id;
> > +     u32 vendor_id;
> > +};
> > +
> > +struct vduse_dev_msg {
> > +     struct vduse_dev_request req;
> > +     struct vduse_dev_response resp;
> > +     struct list_head list;
> > +     wait_queue_head_t waitq;
> > +     bool completed;
> > +};
> > +
> > +struct vduse_control {
> > +     unsigned long api_version;
> > +};
> > +
> > +static unsigned long max_bounce_size = (64 * 1024 * 1024);
> > +module_param(max_bounce_size, ulong, 0444);
> > +MODULE_PARM_DESC(max_bounce_size, "Maximum bounce buffer size. (default: 64M)");
> > +
> > +static unsigned long max_iova_size = (128 * 1024 * 1024);
> > +module_param(max_iova_size, ulong, 0444);
> > +MODULE_PARM_DESC(max_iova_size, "Maximum iova space size (default: 128M)");
> > +
> > +static DEFINE_MUTEX(vduse_lock);
> > +static LIST_HEAD(vduse_devs);
> > +static DEFINE_IDA(vduse_ida);
> > +
> > +static dev_t vduse_major;
> > +static struct class *vduse_class;
> > +static struct workqueue_struct *vduse_irq_wq;
> > +
> > +static inline struct vduse_dev *vdpa_to_vduse(struct vdpa_device *vdpa)
> > +{
> > +     struct vduse_vdpa *vdev = container_of(vdpa, struct vduse_vdpa, vdpa);
> > +
> > +     return vdev->dev;
> > +}
> > +
> > +static inline struct vduse_dev *dev_to_vduse(struct device *dev)
> > +{
> > +     struct vdpa_device *vdpa = dev_to_vdpa(dev);
> > +
> > +     return vdpa_to_vduse(vdpa);
> > +}
> > +
> > +static struct vduse_dev_msg *vduse_find_msg(struct list_head *head,
> > +                                         uint32_t request_id)
> > +{
> > +     struct vduse_dev_msg *tmp, *msg = NULL;
> > +
> > +     list_for_each_entry(tmp, head, list) {
> > +             if (tmp->req.request_id == request_id) {
> > +                     msg = tmp;
> > +                     list_del(&tmp->list);
> > +                     break;
> > +             }
> > +     }
> > +
> > +     return msg;
> > +}
> > +
> > +static struct vduse_dev_msg *vduse_dequeue_msg(struct list_head *head)
> > +{
> > +     struct vduse_dev_msg *msg = NULL;
> > +
> > +     if (!list_empty(head)) {
> > +             msg = list_first_entry(head, struct vduse_dev_msg, list);
> > +             list_del(&msg->list);
> > +     }
> > +
> > +     return msg;
> > +}
> > +
> > +static void vduse_enqueue_msg(struct list_head *head,
> > +                           struct vduse_dev_msg *msg)
> > +{
> > +     list_add_tail(&msg->list, head);
> > +}
> > +
> > +static int vduse_dev_msg_sync(struct vduse_dev *dev,
> > +                           struct vduse_dev_msg *msg)
> > +{
> > +     init_waitqueue_head(&msg->waitq);
> > +     spin_lock(&dev->msg_lock);
> > +     vduse_enqueue_msg(&dev->send_list, msg);
> > +     wake_up(&dev->waitq);
> > +     spin_unlock(&dev->msg_lock);
> > +     wait_event_interruptible(msg->waitq, msg->completed);
> > +     spin_lock(&dev->msg_lock);
> > +     if (!msg->completed)
> > +             list_del(&msg->list);
> > +     spin_unlock(&dev->msg_lock);
> > +
> > +     return (msg->resp.result == VDUSE_REQUEST_OK) ? 0 : -1;
> > +}
> > +
> > +static u32 vduse_dev_get_request_id(struct vduse_dev *dev)
> > +{
> > +     return atomic64_fetch_inc(&dev->msg_unique);
> > +}
> > +
> > +static u64 vduse_dev_get_features(struct vduse_dev *dev)
> > +{
> > +     struct vduse_dev_msg msg = { 0 };
> > +
> > +     msg.req.type = VDUSE_GET_FEATURES;
> > +     msg.req.request_id = vduse_dev_get_request_id(dev);
> > +
> > +     return vduse_dev_msg_sync(dev, &msg) ? 0 : msg.resp.f.features;
> > +}
> > +
> > +static int vduse_dev_set_features(struct vduse_dev *dev, u64 features)
> > +{
> > +     struct vduse_dev_msg msg = { 0 };
> > +
> > +     msg.req.type = VDUSE_SET_FEATURES;
> > +     msg.req.request_id = vduse_dev_get_request_id(dev);
> > +     msg.req.f.features = features;
> > +
> > +     return vduse_dev_msg_sync(dev, &msg);
> > +}
> > +
> > +static u8 vduse_dev_get_status(struct vduse_dev *dev)
> > +{
> > +     struct vduse_dev_msg msg = { 0 };
> > +
> > +     msg.req.type = VDUSE_GET_STATUS;
> > +     msg.req.request_id = vduse_dev_get_request_id(dev);
> > +
> > +     return vduse_dev_msg_sync(dev, &msg) ? 0 : msg.resp.s.status;
> > +}
> > +
> > +static void vduse_dev_set_status(struct vduse_dev *dev, u8 status)
> > +{
> > +     struct vduse_dev_msg msg = { 0 };
> > +
> > +     msg.req.type = VDUSE_SET_STATUS;
> > +     msg.req.request_id = vduse_dev_get_request_id(dev);
> > +     msg.req.s.status = status;
> > +
> > +     vduse_dev_msg_sync(dev, &msg);
> > +}
> > +
> > +static void vduse_dev_get_config(struct vduse_dev *dev, unsigned int offset,
> > +                              void *buf, unsigned int len)
> > +{
> > +     struct vduse_dev_msg msg = { 0 };
> > +     unsigned int sz;
> > +
> > +     while (len) {
> > +             sz = min_t(unsigned int, len, sizeof(msg.req.config.data));
> > +             msg.req.type = VDUSE_GET_CONFIG;
> > +             msg.req.request_id = vduse_dev_get_request_id(dev);
> > +             msg.req.config.offset = offset;
> > +             msg.req.config.len = sz;
> > +             vduse_dev_msg_sync(dev, &msg);
> > +             memcpy(buf, msg.resp.config.data, sz);
> > +             buf += sz;
> > +             offset += sz;
> > +             len -= sz;
> > +     }
> > +}
> > +
> > +static void vduse_dev_set_config(struct vduse_dev *dev, unsigned int offset,
> > +                              const void *buf, unsigned int len)
> > +{
> > +     struct vduse_dev_msg msg = { 0 };
> > +     unsigned int sz;
> > +
> > +     while (len) {
> > +             sz = min_t(unsigned int, len, sizeof(msg.req.config.data));
> > +             msg.req.type = VDUSE_SET_CONFIG;
> > +             msg.req.request_id = vduse_dev_get_request_id(dev);
> > +             msg.req.config.offset = offset;
> > +             msg.req.config.len = sz;
> > +             memcpy(msg.req.config.data, buf, sz);
> > +             vduse_dev_msg_sync(dev, &msg);
> > +             buf += sz;
> > +             offset += sz;
> > +             len -= sz;
> > +     }
> > +}
> > +
> > +static void vduse_dev_set_vq_num(struct vduse_dev *dev,
> > +                              struct vduse_virtqueue *vq, u32 num)
> > +{
> > +     struct vduse_dev_msg msg = { 0 };
> > +
> > +     msg.req.type = VDUSE_SET_VQ_NUM;
> > +     msg.req.request_id = vduse_dev_get_request_id(dev);
> > +     msg.req.vq_num.index = vq->index;
> > +     msg.req.vq_num.num = num;
> > +
> > +     vduse_dev_msg_sync(dev, &msg);
> > +}
> > +
> > +static int vduse_dev_set_vq_addr(struct vduse_dev *dev,
> > +                              struct vduse_virtqueue *vq, u64 desc_addr,
> > +                              u64 driver_addr, u64 device_addr)
> > +{
> > +     struct vduse_dev_msg msg = { 0 };
> > +
> > +     msg.req.type = VDUSE_SET_VQ_ADDR;
> > +     msg.req.request_id = vduse_dev_get_request_id(dev);
> > +     msg.req.vq_addr.index = vq->index;
> > +     msg.req.vq_addr.desc_addr = desc_addr;
> > +     msg.req.vq_addr.driver_addr = driver_addr;
> > +     msg.req.vq_addr.device_addr = device_addr;
> > +
> > +     return vduse_dev_msg_sync(dev, &msg);
> > +}
> > +
> > +static void vduse_dev_set_vq_ready(struct vduse_dev *dev,
> > +                             struct vduse_virtqueue *vq, bool ready)
> > +{
> > +     struct vduse_dev_msg msg = { 0 };
> > +
> > +     msg.req.type = VDUSE_SET_VQ_READY;
> > +     msg.req.request_id = vduse_dev_get_request_id(dev);
> > +     msg.req.vq_ready.index = vq->index;
> > +     msg.req.vq_ready.ready = ready;
> > +
> > +     vduse_dev_msg_sync(dev, &msg);
> > +}
> > +
> > +static bool vduse_dev_get_vq_ready(struct vduse_dev *dev,
> > +                                struct vduse_virtqueue *vq)
> > +{
> > +     struct vduse_dev_msg msg = { 0 };
> > +
> > +     msg.req.type = VDUSE_GET_VQ_READY;
> > +     msg.req.request_id = vduse_dev_get_request_id(dev);
> > +     msg.req.vq_ready.index = vq->index;
> > +
> > +     return vduse_dev_msg_sync(dev, &msg) ? false : msg.resp.vq_ready.ready;
> > +}
> > +
> > +static int vduse_dev_get_vq_state(struct vduse_dev *dev,
> > +                             struct vduse_virtqueue *vq,
> > +                             struct vdpa_vq_state *state)
> > +{
> > +     struct vduse_dev_msg msg = { 0 };
> > +     int ret;
> > +
> > +     msg.req.type = VDUSE_GET_VQ_STATE;
> > +     msg.req.request_id = vduse_dev_get_request_id(dev);
> > +     msg.req.vq_state.index = vq->index;
> > +
> > +     ret = vduse_dev_msg_sync(dev, &msg);
> > +     if (!ret)
> > +             state->avail_index = msg.resp.vq_state.avail_idx;
> > +
> > +     return ret;
> > +}
> > +
> > +static int vduse_dev_set_vq_state(struct vduse_dev *dev,
> > +                             struct vduse_virtqueue *vq,
> > +                             const struct vdpa_vq_state *state)
> > +{
> > +     struct vduse_dev_msg msg = { 0 };
> > +
> > +     msg.req.type = VDUSE_SET_VQ_STATE;
> > +     msg.req.request_id = vduse_dev_get_request_id(dev);
> > +     msg.req.vq_state.index = vq->index;
> > +     msg.req.vq_state.avail_idx = state->avail_index;
> > +
> > +     return vduse_dev_msg_sync(dev, &msg);
> > +}
> > +
> > +static int vduse_dev_update_iotlb(struct vduse_dev *dev,
> > +                             u64 start, u64 last)
> > +{
> > +     struct vduse_dev_msg msg = { 0 };
> > +
> > +     if (last < start)
> > +             return -EINVAL;
> > +
> > +     msg.req.type = VDUSE_UPDATE_IOTLB;
> > +     msg.req.request_id = vduse_dev_get_request_id(dev);
> > +     msg.req.iova.start = start;
> > +     msg.req.iova.last = last;
> > +
> > +     return vduse_dev_msg_sync(dev, &msg);
> > +}
> > +
> > +static ssize_t vduse_dev_read_iter(struct kiocb *iocb, struct iov_iter *to)
> > +{
> > +     struct file *file = iocb->ki_filp;
> > +     struct vduse_dev *dev = file->private_data;
> > +     struct vduse_dev_msg *msg;
> > +     int size = sizeof(struct vduse_dev_request);
> > +     ssize_t ret = 0;
> > +
> > +     if (iov_iter_count(to) < size)
> > +             return 0;
> > +
> > +     spin_lock(&dev->msg_lock);
> > +     while (1) {
> > +             msg = vduse_dequeue_msg(&dev->send_list);
> > +             if (msg)
> > +                     break;
> > +
> > +             ret = -EAGAIN;
> > +             if (file->f_flags & O_NONBLOCK)
> > +                     goto unlock;
> > +
> > +             spin_unlock(&dev->msg_lock);
> > +             ret = wait_event_interruptible_exclusive(dev->waitq,
> > +                                     !list_empty(&dev->send_list));
> > +             if (ret)
> > +                     return ret;
> > +
> > +             spin_lock(&dev->msg_lock);
> > +     }
> > +     spin_unlock(&dev->msg_lock);
> > +     ret = copy_to_iter(&msg->req, size, to);
> > +     spin_lock(&dev->msg_lock);
> > +     if (ret != size) {
> > +             ret = -EFAULT;
> > +             vduse_enqueue_msg(&dev->send_list, msg);
> > +             goto unlock;
> > +     }
> > +     vduse_enqueue_msg(&dev->recv_list, msg);
> > +     wake_up(&dev->waitq);
> > +unlock:
> > +     spin_unlock(&dev->msg_lock);
> > +
> > +     return ret;
> > +}
> > +
> > +static ssize_t vduse_dev_write_iter(struct kiocb *iocb, struct iov_iter *from)
> > +{
> > +     struct file *file = iocb->ki_filp;
> > +     struct vduse_dev *dev = file->private_data;
> > +     struct vduse_dev_response resp;
> > +     struct vduse_dev_msg *msg;
> > +     size_t ret;
> > +
> > +     ret = copy_from_iter(&resp, sizeof(resp), from);
> > +     if (ret != sizeof(resp))
> > +             return -EINVAL;
> > +
> > +     spin_lock(&dev->msg_lock);
> > +     msg = vduse_find_msg(&dev->recv_list, resp.request_id);
> > +     if (!msg) {
> > +             ret = -EINVAL;
> > +             goto unlock;
> > +     }
> > +
> > +     memcpy(&msg->resp, &resp, sizeof(resp));
> > +     msg->completed = 1;
> > +     wake_up(&msg->waitq);
> > +unlock:
> > +     spin_unlock(&dev->msg_lock);
> > +
> > +     return ret;
> > +}
> > +
> > +static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
> > +{
> > +     struct vduse_dev *dev = file->private_data;
> > +     __poll_t mask = 0;
> > +
> > +     poll_wait(file, &dev->waitq, wait);
> > +
> > +     if (!list_empty(&dev->send_list))
> > +             mask |= EPOLLIN | EPOLLRDNORM;
> > +     if (!list_empty(&dev->recv_list))
> > +             mask |= EPOLLOUT;
> > +
> > +     return mask;
> > +}
> > +
> > +static void vduse_dev_reset(struct vduse_dev *dev)
> > +{
> > +     int i;
> > +
> > +     /* The coherent mappings are handled in vduse_dev_free_coherent() */
> > +     vduse_domain_reset_bounce_map(dev->domain);
> > +     vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
> > +
> > +     spin_lock(&dev->irq_lock);
> > +     dev->config_cb.callback = NULL;
> > +     dev->config_cb.private = NULL;
> > +     spin_unlock(&dev->irq_lock);
> > +
> > +     for (i = 0; i < dev->vq_num; i++) {
> > +             struct vduse_virtqueue *vq = &dev->vqs[i];
> > +
> > +             spin_lock(&vq->irq_lock);
> > +             vq->ready = false;
> > +             vq->cb.callback = NULL;
> > +             vq->cb.private = NULL;
> > +             spin_unlock(&vq->irq_lock);
> > +     }
> > +}
> > +
> > +static int vduse_vdpa_set_vq_address(struct vdpa_device *vdpa, u16 idx,
> > +                             u64 desc_area, u64 driver_area,
> > +                             u64 device_area)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> > +
> > +     return vduse_dev_set_vq_addr(dev, vq, desc_area,
> > +                                     driver_area, device_area);
> > +}
> > +
> > +static void vduse_vdpa_kick_vq(struct vdpa_device *vdpa, u16 idx)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> > +
> > +     spin_lock(&vq->kick_lock);
> > +     if (vq->ready && vq->kickfd)
> > +             eventfd_signal(vq->kickfd, 1);
> > +     spin_unlock(&vq->kick_lock);
> > +}
> > +
> > +static void vduse_vdpa_set_vq_cb(struct vdpa_device *vdpa, u16 idx,
> > +                           struct vdpa_callback *cb)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> > +
> > +     spin_lock(&vq->irq_lock);
> > +     vq->cb.callback = cb->callback;
> > +     vq->cb.private = cb->private;
> > +     spin_unlock(&vq->irq_lock);
> > +}
> > +
> > +static void vduse_vdpa_set_vq_num(struct vdpa_device *vdpa, u16 idx, u32 num)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> > +
> > +     vduse_dev_set_vq_num(dev, vq, num);
> > +}
> > +
> > +static void vduse_vdpa_set_vq_ready(struct vdpa_device *vdpa,
> > +                                     u16 idx, bool ready)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> > +
> > +     vduse_dev_set_vq_ready(dev, vq, ready);
> > +     vq->ready = ready;
> > +}
> > +
> > +static bool vduse_vdpa_get_vq_ready(struct vdpa_device *vdpa, u16 idx)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> > +
> > +     vq->ready = vduse_dev_get_vq_ready(dev, vq);
> > +
> > +     return vq->ready;
> > +}
> > +
> > +static int vduse_vdpa_set_vq_state(struct vdpa_device *vdpa, u16 idx,
> > +                             const struct vdpa_vq_state *state)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> > +
> > +     return vduse_dev_set_vq_state(dev, vq, state);
> > +}
> > +
> > +static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
> > +                             struct vdpa_vq_state *state)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> > +
> > +     return vduse_dev_get_vq_state(dev, vq, state);
> > +}
> > +
> > +static u32 vduse_vdpa_get_vq_align(struct vdpa_device *vdpa)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     return dev->vq_align;
> > +}
> > +
> > +static u64 vduse_vdpa_get_features(struct vdpa_device *vdpa)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     return vduse_dev_get_features(dev);
> > +}
> > +
> > +static int vduse_vdpa_set_features(struct vdpa_device *vdpa, u64 features)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     if (!(features & (1ULL << VIRTIO_F_ACCESS_PLATFORM)))
> > +             return -EINVAL;
> > +
> > +     return vduse_dev_set_features(dev, features);
> > +}
> > +
> > +static void vduse_vdpa_set_config_cb(struct vdpa_device *vdpa,
> > +                               struct vdpa_callback *cb)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     spin_lock(&dev->irq_lock);
> > +     dev->config_cb.callback = cb->callback;
> > +     dev->config_cb.private = cb->private;
> > +     spin_unlock(&dev->irq_lock);
> > +}
> > +
> > +static u16 vduse_vdpa_get_vq_num_max(struct vdpa_device *vdpa)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     return dev->vq_size_max;
> > +}
> > +
> > +static u32 vduse_vdpa_get_device_id(struct vdpa_device *vdpa)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     return dev->device_id;
> > +}
> > +
> > +static u32 vduse_vdpa_get_vendor_id(struct vdpa_device *vdpa)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     return dev->vendor_id;
> > +}
> > +
> > +static u8 vduse_vdpa_get_status(struct vdpa_device *vdpa)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     return vduse_dev_get_status(dev);
> > +}
> > +
> > +static void vduse_vdpa_set_status(struct vdpa_device *vdpa, u8 status)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     vduse_dev_set_status(dev, status);
> > +
> > +     if (status == 0)
> > +             vduse_dev_reset(dev);
> > +}
> > +
> > +static void vduse_vdpa_get_config(struct vdpa_device *vdpa, unsigned int offset,
> > +                          void *buf, unsigned int len)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     vduse_dev_get_config(dev, offset, buf, len);
> > +}
> > +
> > +static void vduse_vdpa_set_config(struct vdpa_device *vdpa, unsigned int offset,
> > +                     const void *buf, unsigned int len)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     vduse_dev_set_config(dev, offset, buf, len);
> > +}
> > +
> > +static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
> > +                             struct vhost_iotlb *iotlb)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +     int ret;
> > +
> > +     ret = vduse_domain_set_map(dev->domain, iotlb);
> > +     vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
> > +
> > +     return ret;
> > +}
> > +
> > +static void vduse_vdpa_free(struct vdpa_device *vdpa)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     WARN_ON(!list_empty(&dev->send_list));
> > +     WARN_ON(!list_empty(&dev->recv_list));
> > +     dev->vdev = NULL;
> > +}
> > +
> > +static const struct vdpa_config_ops vduse_vdpa_config_ops = {
> > +     .set_vq_address         = vduse_vdpa_set_vq_address,
> > +     .kick_vq                = vduse_vdpa_kick_vq,
> > +     .set_vq_cb              = vduse_vdpa_set_vq_cb,
> > +     .set_vq_num             = vduse_vdpa_set_vq_num,
> > +     .set_vq_ready           = vduse_vdpa_set_vq_ready,
> > +     .get_vq_ready           = vduse_vdpa_get_vq_ready,
> > +     .set_vq_state           = vduse_vdpa_set_vq_state,
> > +     .get_vq_state           = vduse_vdpa_get_vq_state,
> > +     .get_vq_align           = vduse_vdpa_get_vq_align,
> > +     .get_features           = vduse_vdpa_get_features,
> > +     .set_features           = vduse_vdpa_set_features,
> > +     .set_config_cb          = vduse_vdpa_set_config_cb,
> > +     .get_vq_num_max         = vduse_vdpa_get_vq_num_max,
> > +     .get_device_id          = vduse_vdpa_get_device_id,
> > +     .get_vendor_id          = vduse_vdpa_get_vendor_id,
> > +     .get_status             = vduse_vdpa_get_status,
> > +     .set_status             = vduse_vdpa_set_status,
> > +     .get_config             = vduse_vdpa_get_config,
> > +     .set_config             = vduse_vdpa_set_config,
> > +     .set_map                = vduse_vdpa_set_map,
> > +     .free                   = vduse_vdpa_free,
> > +};
> > +
> > +static dma_addr_t vduse_dev_map_page(struct device *dev, struct page *page,
> > +                                  unsigned long offset, size_t size,
> > +                                  enum dma_data_direction dir,
> > +                                  unsigned long attrs)
> > +{
> > +     struct vduse_dev *vdev = dev_to_vduse(dev);
> > +     struct vduse_iova_domain *domain = vdev->domain;
> > +
> > +     return vduse_domain_map_page(domain, page, offset, size, dir, attrs);
> > +}
> > +
> > +static void vduse_dev_unmap_page(struct device *dev, dma_addr_t dma_addr,
> > +                             size_t size, enum dma_data_direction dir,
> > +                             unsigned long attrs)
> > +{
> > +     struct vduse_dev *vdev = dev_to_vduse(dev);
> > +     struct vduse_iova_domain *domain = vdev->domain;
> > +
> > +     return vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
> > +}
> > +
> > +static void *vduse_dev_alloc_coherent(struct device *dev, size_t size,
> > +                                     dma_addr_t *dma_addr, gfp_t flag,
> > +                                     unsigned long attrs)
> > +{
> > +     struct vduse_dev *vdev = dev_to_vduse(dev);
> > +     struct vduse_iova_domain *domain = vdev->domain;
> > +     unsigned long iova;
> > +     void *addr;
> > +
> > +     *dma_addr = DMA_MAPPING_ERROR;
> > +     addr = vduse_domain_alloc_coherent(domain, size,
> > +                             (dma_addr_t *)&iova, flag, attrs);
> > +     if (!addr)
> > +             return NULL;
> > +
> > +     *dma_addr = (dma_addr_t)iova;
> > +     vduse_dev_update_iotlb(vdev, iova, iova + size - 1);
> > +
> > +     return addr;
> > +}
> > +
> > +static void vduse_dev_free_coherent(struct device *dev, size_t size,
> > +                                     void *vaddr, dma_addr_t dma_addr,
> > +                                     unsigned long attrs)
> > +{
> > +     struct vduse_dev *vdev = dev_to_vduse(dev);
> > +     struct vduse_iova_domain *domain = vdev->domain;
> > +     unsigned long start = (unsigned long)dma_addr;
> > +     unsigned long last = start + size - 1;
> > +
> > +     vduse_domain_free_coherent(domain, size, vaddr, dma_addr, attrs);
> > +     vduse_dev_update_iotlb(vdev, start, last);
> > +}
> > +
> > +static const struct dma_map_ops vduse_dev_dma_ops = {
> > +     .map_page = vduse_dev_map_page,
> > +     .unmap_page = vduse_dev_unmap_page,
> > +     .alloc = vduse_dev_alloc_coherent,
> > +     .free = vduse_dev_free_coherent,
> > +};
> > +
> > +static unsigned int perm_to_file_flags(u8 perm)
> > +{
> > +     unsigned int flags = 0;
> > +
> > +     switch (perm) {
> > +     case VDUSE_ACCESS_WO:
> > +             flags |= O_WRONLY;
> > +             break;
> > +     case VDUSE_ACCESS_RO:
> > +             flags |= O_RDONLY;
> > +             break;
> > +     case VDUSE_ACCESS_RW:
> > +             flags |= O_RDWR;
> > +             break;
> > +     default:
> > +             WARN(1, "invalidate vhost IOTLB permission\n");
> > +             break;
> > +     }
> > +
> > +     return flags;
> > +}
> > +
> > +static int vduse_kickfd_setup(struct vduse_dev *dev,
> > +                     struct vduse_vq_eventfd *eventfd)
> > +{
> > +     struct eventfd_ctx *ctx = NULL;
> > +     struct vduse_virtqueue *vq;
> > +
> > +     if (eventfd->index >= dev->vq_num)
> > +             return -EINVAL;
> > +
> > +     vq = &dev->vqs[eventfd->index];
> > +     if (eventfd->fd > 0) {
> > +             ctx = eventfd_ctx_fdget(eventfd->fd);
> > +             if (IS_ERR(ctx))
> > +                     return PTR_ERR(ctx);
> > +     } else if (eventfd->fd != VDUSE_EVENTFD_DEASSIGN)
> > +             return 0;
>
>
> Do we allow the userspace to switch kickfd here? If yes, we need to deal
> with that.
>

Do you mean the case that eventfd->fd > 0? I think we have dealt with it.

>
> > +
> > +     spin_lock(&vq->kick_lock);
> > +     if (vq->kickfd)
> > +             eventfd_ctx_put(vq->kickfd);
> > +     vq->kickfd = ctx;
> > +     spin_unlock(&vq->kick_lock);
> > +
> > +     return 0;
> > +}
> > +
> > +static void vduse_vq_irq_inject(struct work_struct *work)
> > +{
> > +     struct vduse_virtqueue *vq = container_of(work,
> > +                                     struct vduse_virtqueue, inject);
> > +
> > +     spin_lock_irq(&vq->irq_lock);
> > +     if (vq->ready && vq->cb.callback)
> > +             vq->cb.callback(vq->cb.private);
> > +     spin_unlock_irq(&vq->irq_lock);
> > +}
> > +
> > +static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > +                         unsigned long arg)
> > +{
> > +     struct vduse_dev *dev = file->private_data;
> > +     void __user *argp = (void __user *)arg;
> > +     int ret;
> > +
> > +     switch (cmd) {
> > +     case VDUSE_IOTLB_GET_FD: {
> > +             struct vduse_iotlb_entry entry;
> > +             struct vhost_iotlb_map *map;
> > +             struct vdpa_map_file *map_file;
> > +             struct vduse_iova_domain *domain = dev->domain;
> > +             struct file *f = NULL;
> > +
> > +             ret = -EFAULT;
> > +             if (copy_from_user(&entry, argp, sizeof(entry)))
> > +                     break;
> > +
> > +             ret = -EINVAL;
> > +             if (entry.start > entry.last)
> > +                     break;
> > +
> > +             spin_lock(&domain->iotlb_lock);
> > +             map = vhost_iotlb_itree_first(domain->iotlb,
> > +                                           entry.start, entry.last);
> > +             if (map) {
> > +                     map_file = (struct vdpa_map_file *)map->opaque;
> > +                     f = get_file(map_file->file);
> > +                     entry.offset = map_file->offset;
> > +                     entry.start = map->start;
> > +                     entry.last = map->last;
> > +                     entry.perm = map->perm;
> > +             }
> > +             spin_unlock(&domain->iotlb_lock);
> > +             ret = -EINVAL;
> > +             if (!f)
> > +                     break;
> > +
> > +             ret = -EFAULT;
> > +             if (copy_to_user(argp, &entry, sizeof(entry))) {
> > +                     fput(f);
> > +                     break;
> > +             }
> > +             ret = receive_fd(f, perm_to_file_flags(entry.perm));
> > +             fput(f);
>
>
> Any reason to drop the refcnt here?
>

We will do get_file() in receive_fd() too.

>
> > +             break;
> > +     }
> > +     case VDUSE_VQ_SETUP_KICKFD: {
> > +             struct vduse_vq_eventfd eventfd;
> > +
> > +             ret = -EFAULT;
> > +             if (copy_from_user(&eventfd, argp, sizeof(eventfd)))
> > +                     break;
> > +
> > +             ret = vduse_kickfd_setup(dev, &eventfd);
> > +             break;
> > +     }
> > +     case VDUSE_INJECT_VQ_IRQ:
> > +             ret = -EINVAL;
> > +             if (arg >= dev->vq_num)
> > +                     break;
> > +
> > +             ret = 0;
> > +             queue_work(vduse_irq_wq, &dev->vqs[arg].inject);
> > +             break;
> > +     case VDUSE_INJECT_CONFIG_IRQ:
> > +             ret = -EINVAL;
> > +             spin_lock_irq(&dev->irq_lock);
> > +             if (dev->config_cb.callback) {
> > +                     dev->config_cb.callback(dev->config_cb.private);
> > +                     ret = 0;
> > +             }
> > +             spin_unlock_irq(&dev->irq_lock);
>
>
> For consistency, is it better to use vduse_irq_wq here?
>

Looks good to me.

>
> > +             break;
> > +     default:
> > +             ret = -ENOIOCTLCMD;
> > +             break;
> > +     }
> > +
> > +     return ret;
> > +}
> > +
> > +static int vduse_dev_release(struct inode *inode, struct file *file)
> > +{
> > +     struct vduse_dev *dev = file->private_data;
> > +     struct vduse_dev_msg *msg;
> > +     int i;
> > +
> > +     for (i = 0; i < dev->vq_num; i++) {
> > +             struct vduse_virtqueue *vq = &dev->vqs[i];
> > +
> > +             spin_lock(&vq->kick_lock);
> > +             if (vq->kickfd)
> > +                     eventfd_ctx_put(vq->kickfd);
> > +             vq->kickfd = NULL;
> > +             spin_unlock(&vq->kick_lock);
> > +     }
> > +
> > +     spin_lock(&dev->msg_lock);
> > +     /*  Make sure the inflight messages can processed */
>
>
> This might be better:
>
> /*  Make sure the inflight messages can processed after reconncection */
>

OK.

> > +     while ((msg = vduse_dequeue_msg(&dev->recv_list)))
> > +             vduse_enqueue_msg(&dev->send_list, msg);
> > +     spin_unlock(&dev->msg_lock);
> > +
> > +     dev->connected = false;
> > +
> > +     return 0;
> > +}
> > +
> > +static int vduse_dev_open(struct inode *inode, struct file *file)
> > +{
> > +     struct vduse_dev *dev = container_of(inode->i_cdev,
> > +                                     struct vduse_dev, cdev);
> > +     int ret = -EBUSY;
> > +
> > +     mutex_lock(&dev->lock);
> > +     if (dev->connected)
> > +             goto unlock;
> > +
> > +     ret = 0;
> > +     dev->connected = true;
> > +     file->private_data = dev;
> > +unlock:
> > +     mutex_unlock(&dev->lock);
> > +
> > +     return ret;
> > +}
> > +
> > +static const struct file_operations vduse_dev_fops = {
> > +     .owner          = THIS_MODULE,
> > +     .open           = vduse_dev_open,
> > +     .release        = vduse_dev_release,
> > +     .read_iter      = vduse_dev_read_iter,
> > +     .write_iter     = vduse_dev_write_iter,
> > +     .poll           = vduse_dev_poll,
> > +     .unlocked_ioctl = vduse_dev_ioctl,
> > +     .compat_ioctl   = compat_ptr_ioctl,
> > +     .llseek         = noop_llseek,
> > +};
> > +
> > +static struct vduse_dev *vduse_dev_create(void)
> > +{
> > +     struct vduse_dev *dev = kzalloc(sizeof(*dev), GFP_KERNEL);
> > +
> > +     if (!dev)
> > +             return NULL;
> > +
> > +     mutex_init(&dev->lock);
> > +     spin_lock_init(&dev->msg_lock);
> > +     INIT_LIST_HEAD(&dev->send_list);
> > +     INIT_LIST_HEAD(&dev->recv_list);
> > +     atomic64_set(&dev->msg_unique, 0);
> > +     spin_lock_init(&dev->irq_lock);
> > +
> > +     init_waitqueue_head(&dev->waitq);
> > +
> > +     return dev;
> > +}
> > +
> > +static void vduse_dev_destroy(struct vduse_dev *dev)
> > +{
> > +     kfree(dev);
> > +}
> > +
> > +static struct vduse_dev *vduse_find_dev(const char *name)
> > +{
> > +     struct vduse_dev *tmp, *dev = NULL;
> > +
> > +     list_for_each_entry(tmp, &vduse_devs, list) {
> > +             if (!strcmp(dev_name(&tmp->dev), name)) {
> > +                     dev = tmp;
> > +                     break;
> > +             }
> > +     }
> > +     return dev;
> > +}
> > +
> > +static int vduse_destroy_dev(char *name)
> > +{
> > +     struct vduse_dev *dev = vduse_find_dev(name);
> > +
> > +     if (!dev)
> > +             return -EINVAL;
> > +
> > +     mutex_lock(&dev->lock);
> > +     if (dev->vdev || dev->connected) {
> > +             mutex_unlock(&dev->lock);
> > +             return -EBUSY;
> > +     }
> > +     dev->connected = true;
> > +     mutex_unlock(&dev->lock);
> > +
> > +     list_del(&dev->list);
> > +     cdev_device_del(&dev->cdev, &dev->dev);
> > +     put_device(&dev->dev);
> > +     module_put(THIS_MODULE);
> > +
> > +     return 0;
> > +}
> > +
> > +static void vduse_release_dev(struct device *device)
> > +{
> > +     struct vduse_dev *dev =
> > +             container_of(device, struct vduse_dev, dev);
> > +
> > +     ida_simple_remove(&vduse_ida, dev->minor);
> > +     kfree(dev->vqs);
> > +     vduse_domain_destroy(dev->domain);
> > +     vduse_dev_destroy(dev);
> > +}
> > +
> > +static int vduse_create_dev(struct vduse_dev_config *config,
> > +                         unsigned long api_version)
> > +{
> > +     int i, ret = -ENOMEM;
> > +     struct vduse_dev *dev;
> > +
> > +     if (config->bounce_size > max_bounce_size)
> > +             return -EINVAL;
> > +
> > +     if (config->bounce_size > max_iova_size)
> > +             return -EINVAL;
> > +
> > +     if (vduse_find_dev(config->name))
> > +             return -EEXIST;
> > +
> > +     dev = vduse_dev_create();
> > +     if (!dev)
> > +             return -ENOMEM;
> > +
> > +     dev->api_version = api_version;
> > +     dev->device_id = config->device_id;
> > +     dev->vendor_id = config->vendor_id;
> > +     dev->domain = vduse_domain_create(max_iova_size - 1,
> > +                                     config->bounce_size);
> > +     if (!dev->domain)
> > +             goto err_domain;
> > +
> > +     dev->vq_align = config->vq_align;
> > +     dev->vq_size_max = config->vq_size_max;
> > +     dev->vq_num = config->vq_num;
> > +     dev->vqs = kcalloc(dev->vq_num, sizeof(*dev->vqs), GFP_KERNEL);
> > +     if (!dev->vqs)
> > +             goto err_vqs;
> > +
> > +     for (i = 0; i < dev->vq_num; i++) {
> > +             dev->vqs[i].index = i;
> > +             INIT_WORK(&dev->vqs[i].inject, vduse_vq_irq_inject);
> > +             spin_lock_init(&dev->vqs[i].kick_lock);
> > +             spin_lock_init(&dev->vqs[i].irq_lock);
> > +     }
> > +
> > +     ret = ida_simple_get(&vduse_ida, 0, VDUSE_DEV_MAX, GFP_KERNEL);
> > +     if (ret < 0)
> > +             goto err_ida;
> > +
> > +     dev->minor = ret;
> > +     device_initialize(&dev->dev);
> > +     dev->dev.release = vduse_release_dev;
> > +     dev->dev.class = vduse_class;
> > +     dev->dev.devt = MKDEV(MAJOR(vduse_major), dev->minor);
> > +     ret = dev_set_name(&dev->dev, "%s", config->name);
> > +     if (ret) {
> > +             put_device(&dev->dev);
> > +             return ret;
> > +     }
> > +     cdev_init(&dev->cdev, &vduse_dev_fops);
> > +     dev->cdev.owner = THIS_MODULE;
> > +
> > +     ret = cdev_device_add(&dev->cdev, &dev->dev);
> > +     if (ret) {
> > +             put_device(&dev->dev);
> > +             return ret;
>
>
> Let's introduce an error label for this.
>

But put_device() would be duplicated with the below error handling.

>
> > +     }
> > +     list_add(&dev->list, &vduse_devs);
> > +     __module_get(THIS_MODULE);
> > +
> > +     return 0;
> > +err_ida:
> > +     kfree(dev->vqs);
> > +err_vqs:
> > +     vduse_domain_destroy(dev->domain);
> > +err_domain:
> > +     vduse_dev_destroy(dev);
> > +     return ret;
> > +}
> > +
> > +static long vduse_ioctl(struct file *file, unsigned int cmd,
> > +                     unsigned long arg)
> > +{
> > +     int ret;
> > +     void __user *argp = (void __user *)arg;
> > +     struct vduse_control *control = file->private_data;
> > +
> > +     mutex_lock(&vduse_lock);
> > +     switch (cmd) {
> > +     case VDUSE_GET_API_VERSION:
> > +             ret = control->api_version;
> > +             break;
> > +     case VDUSE_SET_API_VERSION:
> > +             ret = -EINVAL;
> > +             if (arg > VDUSE_API_VERSION)
> > +                     break;
> > +
> > +             ret = 0;
> > +             control->api_version = arg;
> > +             break;
> > +     case VDUSE_CREATE_DEV: {
> > +             struct vduse_dev_config config;
> > +
> > +             ret = -EFAULT;
> > +             if (copy_from_user(&config, argp, sizeof(config)))
> > +                     break;
> > +
> > +             ret = vduse_create_dev(&config, control->api_version);
> > +             break;
> > +     }
> > +     case VDUSE_DESTROY_DEV: {
> > +             char name[VDUSE_NAME_MAX];
> > +
> > +             ret = -EFAULT;
> > +             if (copy_from_user(name, argp, VDUSE_NAME_MAX))
> > +                     break;
> > +
> > +             ret = vduse_destroy_dev(name);
> > +             break;
> > +     }
> > +     default:
> > +             ret = -EINVAL;
> > +             break;
> > +     }
> > +     mutex_unlock(&vduse_lock);
> > +
> > +     return ret;
> > +}
> > +
> > +static int vduse_release(struct inode *inode, struct file *file)
> > +{
> > +     struct vduse_control *control = file->private_data;
> > +
> > +     kfree(control);
> > +     return 0;
> > +}
> > +
> > +static int vduse_open(struct inode *inode, struct file *file)
> > +{
> > +     struct vduse_control *control;
> > +
> > +     control = kmalloc(sizeof(struct vduse_control), GFP_KERNEL);
> > +     if (!control)
> > +             return -ENOMEM;
> > +
> > +     control->api_version = VDUSE_API_VERSION;
> > +     file->private_data = control;
> > +
> > +     return 0;
> > +}
> > +
> > +static const struct file_operations vduse_fops = {
> > +     .owner          = THIS_MODULE,
> > +     .open           = vduse_open,
> > +     .release        = vduse_release,
> > +     .unlocked_ioctl = vduse_ioctl,
> > +     .compat_ioctl   = compat_ptr_ioctl,
> > +     .llseek         = noop_llseek,
> > +};
> > +
> > +static char *vduse_devnode(struct device *dev, umode_t *mode)
> > +{
> > +     return kasprintf(GFP_KERNEL, "vduse/%s", dev_name(dev));
> > +}
> > +
> > +static struct miscdevice vduse_misc = {
> > +     .fops = &vduse_fops,
> > +     .minor = MISC_DYNAMIC_MINOR,
> > +     .name = "vduse",
> > +     .nodename = "vduse/control",
> > +};
> > +
> > +static void vduse_mgmtdev_release(struct device *dev)
> > +{
> > +}
> > +
> > +static struct device vduse_mgmtdev = {
> > +     .init_name = "vduse",
> > +     .release = vduse_mgmtdev_release,
> > +};
> > +
> > +static struct vdpa_mgmt_dev mgmt_dev;
> > +
> > +static int vduse_dev_init_vdpa(struct vduse_dev *dev, const char *name)
> > +{
> > +     struct vduse_vdpa *vdev;
> > +     int ret;
> > +
> > +     if (dev->vdev)
> > +             return -EEXIST;
> > +
> > +     vdev = vdpa_alloc_device(struct vduse_vdpa, vdpa, &dev->dev,
> > +                              &vduse_vdpa_config_ops, name, true);
> > +     if (!vdev)
> > +             return -ENOMEM;
> > +
> > +     dev->vdev = vdev;
> > +     vdev->dev = dev;
> > +     vdev->vdpa.dev.dma_mask = &vdev->vdpa.dev.coherent_dma_mask;
> > +     ret = dma_set_mask_and_coherent(&vdev->vdpa.dev, DMA_BIT_MASK(64));
> > +     if (ret) {
> > +             put_device(&vdev->vdpa.dev);
> > +             return ret;
> > +     }
> > +     set_dma_ops(&vdev->vdpa.dev, &vduse_dev_dma_ops);
> > +     vdev->vdpa.dma_dev = &vdev->vdpa.dev;
> > +     vdev->vdpa.mdev = &mgmt_dev;
> > +
> > +     return 0;
> > +}
> > +
> > +static int vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name)
> > +{
> > +     struct vduse_dev *dev;
> > +     int ret = -EINVAL;
> > +
> > +     mutex_lock(&vduse_lock);
> > +     dev = vduse_find_dev(name);
> > +     if (!dev) {
> > +             mutex_unlock(&vduse_lock);
> > +             return -EINVAL;
> > +     }
> > +     ret = vduse_dev_init_vdpa(dev, name);
> > +     mutex_unlock(&vduse_lock);
> > +     if (ret)
> > +             return ret;
> > +
> > +     ret = _vdpa_register_device(&dev->vdev->vdpa, dev->vq_num);
> > +     if (ret) {
> > +             put_device(&dev->vdev->vdpa.dev);
> > +             return ret;
> > +     }
> > +
> > +     return 0;
> > +}
> > +
> > +static void vdpa_dev_del(struct vdpa_mgmt_dev *mdev, struct vdpa_device *dev)
> > +{
> > +     _vdpa_unregister_device(dev);
> > +}
> > +
> > +static const struct vdpa_mgmtdev_ops vdpa_dev_mgmtdev_ops = {
> > +     .dev_add = vdpa_dev_add,
> > +     .dev_del = vdpa_dev_del,
> > +};
> > +
> > +static struct virtio_device_id id_table[] = {
> > +     { VIRTIO_DEV_ANY_ID, VIRTIO_DEV_ANY_ID },
> > +     { 0 },
> > +};
> > +
> > +static struct vdpa_mgmt_dev mgmt_dev = {
> > +     .device = &vduse_mgmtdev,
> > +     .id_table = id_table,
> > +     .ops = &vdpa_dev_mgmtdev_ops,
> > +};
> > +
> > +static int vduse_mgmtdev_init(void)
> > +{
> > +     int ret;
> > +
> > +     ret = device_register(&vduse_mgmtdev);
> > +     if (ret)
> > +             return ret;
> > +
> > +     ret = vdpa_mgmtdev_register(&mgmt_dev);
> > +     if (ret)
> > +             goto err;
> > +
> > +     return 0;
> > +err:
> > +     device_unregister(&vduse_mgmtdev);
> > +     return ret;
> > +}
> > +
> > +static void vduse_mgmtdev_exit(void)
> > +{
> > +     vdpa_mgmtdev_unregister(&mgmt_dev);
> > +     device_unregister(&vduse_mgmtdev);
> > +}
> > +
> > +static int vduse_init(void)
> > +{
> > +     int ret;
> > +
> > +     if (max_bounce_size >= max_iova_size)
> > +             return -EINVAL;
> > +
> > +     ret = misc_register(&vduse_misc);
> > +     if (ret)
> > +             return ret;
> > +
> > +     vduse_class = class_create(THIS_MODULE, "vduse");
> > +     if (IS_ERR(vduse_class)) {
> > +             ret = PTR_ERR(vduse_class);
> > +             goto err_class;
> > +     }
> > +     vduse_class->devnode = vduse_devnode;
> > +
> > +     ret = alloc_chrdev_region(&vduse_major, 0, VDUSE_DEV_MAX, "vduse");
> > +     if (ret)
> > +             goto err_chardev;
> > +
> > +     vduse_irq_wq = alloc_workqueue("vduse-irq",
> > +                             WQ_HIGHPRI | WQ_SYSFS | WQ_UNBOUND, 0);
> > +     if (!vduse_irq_wq)
> > +             goto err_wq;
> > +
> > +     ret = vduse_domain_init();
> > +     if (ret)
> > +             goto err_domain;
> > +
> > +     ret = vduse_mgmtdev_init();
> > +     if (ret)
> > +             goto err_mgmtdev;
> > +
> > +     return 0;
> > +err_mgmtdev:
> > +     vduse_domain_exit();
> > +err_domain:
> > +     destroy_workqueue(vduse_irq_wq);
> > +err_wq:
> > +     unregister_chrdev_region(vduse_major, VDUSE_DEV_MAX);
> > +err_chardev:
> > +     class_destroy(vduse_class);
> > +err_class:
> > +     misc_deregister(&vduse_misc);
> > +     return ret;
> > +}
> > +module_init(vduse_init);
> > +
> > +static void vduse_exit(void)
> > +{
> > +     misc_deregister(&vduse_misc);
> > +     class_destroy(vduse_class);
> > +     unregister_chrdev_region(vduse_major, VDUSE_DEV_MAX);
> > +     destroy_workqueue(vduse_irq_wq);
> > +     vduse_domain_exit();
> > +     vduse_mgmtdev_exit();
> > +}
> > +module_exit(vduse_exit);
> > +
> > +MODULE_VERSION(DRV_VERSION);
> > +MODULE_LICENSE(DRV_LICENSE);
> > +MODULE_AUTHOR(DRV_AUTHOR);
> > +MODULE_DESCRIPTION(DRV_DESC);
> > diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
> > new file mode 100644
> > index 000000000000..66a6e5212226
> > --- /dev/null
> > +++ b/include/uapi/linux/vduse.h
> > @@ -0,0 +1,175 @@
> > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> > +#ifndef _UAPI_VDUSE_H_
> > +#define _UAPI_VDUSE_H_
> > +
> > +#include <linux/types.h>
> > +
> > +#define VDUSE_API_VERSION    0
> > +
> > +#define VDUSE_CONFIG_DATA_LEN        256
> > +#define VDUSE_NAME_MAX       256
> > +
> > +/* the control messages definition for read/write */
> > +
> > +enum vduse_req_type {
> > +     /* Set the vring address of virtqueue. */
> > +     VDUSE_SET_VQ_NUM,
> > +     /* Set the vring address of virtqueue. */
> > +     VDUSE_SET_VQ_ADDR,
> > +     /* Set ready status of virtqueue */
> > +     VDUSE_SET_VQ_READY,
> > +     /* Get ready status of virtqueue */
> > +     VDUSE_GET_VQ_READY,
> > +     /* Set the state for virtqueue */
> > +     VDUSE_SET_VQ_STATE,
> > +     /* Get the state for virtqueue */
> > +     VDUSE_GET_VQ_STATE,
> > +     /* Set virtio features supported by the driver */
> > +     VDUSE_SET_FEATURES,
> > +     /* Get virtio features supported by the device */
> > +     VDUSE_GET_FEATURES,
> > +     /* Set the device status */
> > +     VDUSE_SET_STATUS,
> > +     /* Get the device status */
> > +     VDUSE_GET_STATUS,
> > +     /* Write to device specific configuration space */
> > +     VDUSE_SET_CONFIG,
> > +     /* Read from device specific configuration space */
> > +     VDUSE_GET_CONFIG,
> > +     /* Notify userspace to update the memory mapping in device IOTLB */
> > +     VDUSE_UPDATE_IOTLB,
> > +};
> > +
> > +struct vduse_vq_num {
> > +     __u32 index; /* virtqueue index */
>
>
> I think it's better to have a consistent style of the doc/comment. If
> yes, let's move those comment above the field.
>

Fine.

>
> > +     __u32 num; /* the size of virtqueue */
> > +};
> > +
> > +struct vduse_vq_addr {
> > +     __u32 index; /* virtqueue index */
> > +     __u64 desc_addr; /* address of desc area */
> > +     __u64 driver_addr; /* address of driver area */
> > +     __u64 device_addr; /* address of device area */
> > +};
> > +
> > +struct vduse_vq_ready {
> > +     __u32 index; /* virtqueue index */
> > +     __u8 ready; /* ready status of virtqueue */
> > +};
> > +
> > +struct vduse_vq_state {
> > +     __u32 index; /* virtqueue index */
> > +     __u16 avail_idx; /* virtqueue state (last_avail_idx) */
>
>
> Let's use __u64 here to be consistent with get_vq_state(). The idea is
> to support packed virtqueue.
>

OK. But looks like sizeof(struct vdpa_vq_state) is still equal to 2.
Do you mean we will extend it in the future?

>
> > +};
> > +
> > +struct vduse_dev_config_data {
> > +     __u32 offset; /* offset from the beginning of config space */
> > +     __u32 len; /* the length to read/write */
> > +     __u8 data[VDUSE_CONFIG_DATA_LEN]; /* data buffer used to read/write */
>
>
> Note that since VDUSE_CONFIG_DATA_LEN is part of uAPI it means we can
> not change it in the future.
>
> So this might suffcient for future features or all type of virtio devices.
>

Do you mean 256 is no enough here?

>
> > +};
> > +
> > +struct vduse_iova_range {
> > +     __u64 start; /* start of the IOVA range */
> > +     __u64 last; /* end of the IOVA range */
> > +};
> > +
> > +struct vduse_features {
> > +     __u64 features; /* virtio features */
> > +};
> > +
> > +struct vduse_status {
> > +     __u8 status; /* device status */
> > +};
> > +
> > +struct vduse_dev_request {
> > +     __u32 type; /* request type */
> > +     __u32 request_id; /* request id */
> > +     __u32 reserved[2]; /* for future use */
> > +     union {
> > +             struct vduse_vq_num vq_num; /* virtqueue num */
> > +             struct vduse_vq_addr vq_addr; /* virtqueue address */
> > +             struct vduse_vq_ready vq_ready; /* virtqueue ready status */
> > +             struct vduse_vq_state vq_state; /* virtqueue state */
> > +             struct vduse_dev_config_data config; /* virtio device config space */
> > +             struct vduse_iova_range iova; /* iova range for updating */
> > +             struct vduse_features f; /* virtio features */
> > +             struct vduse_status s; /* device status */
> > +             __u32 padding[16]; /* padding */
> > +     };
> > +};
> > +
> > +struct vduse_dev_response {
> > +     __u32 request_id; /* corresponding request id */
> > +#define VDUSE_REQUEST_OK     0x00
> > +#define VDUSE_REQUEST_FAILED 0x01
> > +     __u32 result; /* the result of request */
> > +     __u32 reserved[2]; /* for future use */
> > +     union {
> > +             struct vduse_vq_ready vq_ready; /* virtqueue ready status */
> > +             struct vduse_vq_state vq_state; /* virtqueue state */
> > +             struct vduse_dev_config_data config; /* virtio device config space */
> > +             struct vduse_features f; /* virtio features */
> > +             struct vduse_status s; /* device status */
> > +             __u32 padding[16]; /* padding */
>
>
> So it looks to me this padding doesn't work since vduse_dev_config_data
> is larger than it.
>

Oh, my bad. Will fix it.

>
> > +     };
> > +};
> > +
> > +/* ioctls */
> > +
> > +struct vduse_dev_config {
> > +     char name[VDUSE_NAME_MAX]; /* vduse device name */
> > +     __u32 vendor_id; /* virtio vendor id */
> > +     __u32 device_id; /* virtio device id */
> > +     __u64 bounce_size; /* bounce buffer size for iommu */
> > +     __u16 vq_num; /* the number of virtqueues */
> > +     __u16 vq_size_max; /* the max size of virtqueue */
> > +     __u32 vq_align; /* the allocation alignment of virtqueue's metadata */
> > +     __u32 reserved[8]; /* for future use */
>
>
> Is there a hole before reserved?
>

But I don't find the hole in below layout:

| 256 | 4 | 4 | 8 | 2 | 2 | 4 | 32 |

>
> > +};
> > +
> > +struct vduse_iotlb_entry {
> > +     __u64 offset; /* the mmap offset on fd */
> > +     __u64 start; /* start of the IOVA range */
> > +     __u64 last; /* last of the IOVA range */
> > +#define VDUSE_ACCESS_RO 0x1
> > +#define VDUSE_ACCESS_WO 0x2
> > +#define VDUSE_ACCESS_RW 0x3
> > +     __u8 perm; /* access permission of this range */
> > +};
> > +
> > +struct vduse_vq_eventfd {
> > +     __u32 index; /* virtqueue index */
> > +#define VDUSE_EVENTFD_DEASSIGN -1
> > +     int fd; /* eventfd, -1 means de-assigning the eventfd */
> > +};
> > +
> > +#define VDUSE_BASE   0x81
> > +
> > +/* Get the version of VDUSE API. This is used for future extension */
> > +#define VDUSE_GET_API_VERSION        _IO(VDUSE_BASE, 0x00)
> > +
> > +/* Set the version of VDUSE API. */
> > +#define VDUSE_SET_API_VERSION        _IO(VDUSE_BASE, 0x01)
> > +
> > +/* Create a vduse device which is represented by a char device (/dev/vduse/<name>) */
> > +#define VDUSE_CREATE_DEV     _IOW(VDUSE_BASE, 0x02, struct vduse_dev_config)
> > +
> > +/* Destroy a vduse device. Make sure there are no references to the char device */
> > +#define VDUSE_DESTROY_DEV    _IOW(VDUSE_BASE, 0x03, char[VDUSE_NAME_MAX])
> > +
> > +/*
> > + * Get a file descriptor for the first overlapped iova region,
> > + * -EINVAL means the iova region doesn't exist.
> > + */
> > +#define VDUSE_IOTLB_GET_FD   _IOWR(VDUSE_BASE, 0x04, struct vduse_iotlb_entry)
> > +
> > +/* Setup an eventfd to receive kick for virtqueue */
> > +#define VDUSE_VQ_SETUP_KICKFD        _IOW(VDUSE_BASE, 0x05, struct vduse_vq_eventfd)
> > +
> > +/* Inject an interrupt for specific virtqueue */
> > +#define VDUSE_INJECT_VQ_IRQ  _IO(VDUSE_BASE, 0x06)
>
>
> Missing parameter?
>

We use the argp to store the virtqueue index here. Is it OK?

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v6 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
  2021-04-08  9:36     ` Yongji Xie
@ 2021-04-09  5:36       ` Jason Wang
  2021-04-09  8:02         ` Yongji Xie
  0 siblings, 1 reply; 62+ messages in thread
From: Jason Wang @ 2021-04-09  5:36 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Christoph Hellwig, Christian Brauner, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	Mika Penttilä,
	Dan Carpenter, virtualization, netdev, kvm, linux-fsdevel


在 2021/4/8 下午5:36, Yongji Xie 写道:
> On Thu, Apr 8, 2021 at 2:57 PM Jason Wang <jasowang@redhat.com> wrote:
>>
>> 在 2021/3/31 下午4:05, Xie Yongji 写道:
>>> This VDUSE driver enables implementing vDPA devices in userspace.
>>> Both control path and data path of vDPA devices will be able to
>>> be handled in userspace.
>>>
>>> In the control path, the VDUSE driver will make use of message
>>> mechnism to forward the config operation from vdpa bus driver
>>> to userspace. Userspace can use read()/write() to receive/reply
>>> those control messages.
>>>
>>> In the data path, VDUSE_IOTLB_GET_FD ioctl will be used to get
>>> the file descriptors referring to vDPA device's iova regions. Then
>>> userspace can use mmap() to access those iova regions. Besides,
>>> userspace can use ioctl() to inject interrupt and use the eventfd
>>> mechanism to receive virtqueue kicks.
>>>
>>> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
>>> ---
>>>    Documentation/userspace-api/ioctl/ioctl-number.rst |    1 +
>>>    drivers/vdpa/Kconfig                               |   10 +
>>>    drivers/vdpa/Makefile                              |    1 +
>>>    drivers/vdpa/vdpa_user/Makefile                    |    5 +
>>>    drivers/vdpa/vdpa_user/vduse_dev.c                 | 1362 ++++++++++++++++++++
>>>    include/uapi/linux/vduse.h                         |  175 +++
>>>    6 files changed, 1554 insertions(+)
>>>    create mode 100644 drivers/vdpa/vdpa_user/Makefile
>>>    create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
>>>    create mode 100644 include/uapi/linux/vduse.h
>>>
>>> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
>>> index a4c75a28c839..71722e6f8f23 100644
>>> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
>>> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
>>> @@ -300,6 +300,7 @@ Code  Seq#    Include File                                           Comments
>>>    'z'   10-4F  drivers/s390/crypto/zcrypt_api.h                        conflict!
>>>    '|'   00-7F  linux/media.h
>>>    0x80  00-1F  linux/fb.h
>>> +0x81  00-1F  linux/vduse.h
>>>    0x89  00-06  arch/x86/include/asm/sockios.h
>>>    0x89  0B-DF  linux/sockios.h
>>>    0x89  E0-EF  linux/sockios.h                                         SIOCPROTOPRIVATE range
>>> diff --git a/drivers/vdpa/Kconfig b/drivers/vdpa/Kconfig
>>> index a245809c99d0..77a1da522c21 100644
>>> --- a/drivers/vdpa/Kconfig
>>> +++ b/drivers/vdpa/Kconfig
>>> @@ -25,6 +25,16 @@ config VDPA_SIM_NET
>>>        help
>>>          vDPA networking device simulator which loops TX traffic back to RX.
>>>
>>> +config VDPA_USER
>>> +     tristate "VDUSE (vDPA Device in Userspace) support"
>>> +     depends on EVENTFD && MMU && HAS_DMA
>>> +     select DMA_OPS
>>> +     select VHOST_IOTLB
>>> +     select IOMMU_IOVA
>>> +     help
>>> +       With VDUSE it is possible to emulate a vDPA Device
>>> +       in a userspace program.
>>> +
>>>    config IFCVF
>>>        tristate "Intel IFC VF vDPA driver"
>>>        depends on PCI_MSI
>>> diff --git a/drivers/vdpa/Makefile b/drivers/vdpa/Makefile
>>> index 67fe7f3d6943..f02ebed33f19 100644
>>> --- a/drivers/vdpa/Makefile
>>> +++ b/drivers/vdpa/Makefile
>>> @@ -1,6 +1,7 @@
>>>    # SPDX-License-Identifier: GPL-2.0
>>>    obj-$(CONFIG_VDPA) += vdpa.o
>>>    obj-$(CONFIG_VDPA_SIM) += vdpa_sim/
>>> +obj-$(CONFIG_VDPA_USER) += vdpa_user/
>>>    obj-$(CONFIG_IFCVF)    += ifcvf/
>>>    obj-$(CONFIG_MLX5_VDPA) += mlx5/
>>>    obj-$(CONFIG_VP_VDPA)    += virtio_pci/
>>> diff --git a/drivers/vdpa/vdpa_user/Makefile b/drivers/vdpa/vdpa_user/Makefile
>>> new file mode 100644
>>> index 000000000000..260e0b26af99
>>> --- /dev/null
>>> +++ b/drivers/vdpa/vdpa_user/Makefile
>>> @@ -0,0 +1,5 @@
>>> +# SPDX-License-Identifier: GPL-2.0
>>> +
>>> +vduse-y := vduse_dev.o iova_domain.o
>>> +
>>> +obj-$(CONFIG_VDPA_USER) += vduse.o
>>> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
>>> new file mode 100644
>>> index 000000000000..51ca73464d0d
>>> --- /dev/null
>>> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
>>> @@ -0,0 +1,1362 @@
>>> +// SPDX-License-Identifier: GPL-2.0-only
>>> +/*
>>> + * VDUSE: vDPA Device in Userspace
>>> + *
>>> + * Copyright (C) 2020-2021 Bytedance Inc. and/or its affiliates. All rights reserved.
>>> + *
>>> + * Author: Xie Yongji <xieyongji@bytedance.com>
>>> + *
>>> + */
>>> +
>>> +#include <linux/init.h>
>>> +#include <linux/module.h>
>>> +#include <linux/miscdevice.h>
>>> +#include <linux/cdev.h>
>>> +#include <linux/device.h>
>>> +#include <linux/eventfd.h>
>>> +#include <linux/slab.h>
>>> +#include <linux/wait.h>
>>> +#include <linux/dma-map-ops.h>
>>> +#include <linux/poll.h>
>>> +#include <linux/file.h>
>>> +#include <linux/uio.h>
>>> +#include <linux/vdpa.h>
>>> +#include <uapi/linux/vduse.h>
>>> +#include <uapi/linux/vdpa.h>
>>> +#include <uapi/linux/virtio_config.h>
>>> +#include <linux/mod_devicetable.h>
>>> +
>>> +#include "iova_domain.h"
>>> +
>>> +#define DRV_VERSION  "1.0"
>>> +#define DRV_AUTHOR   "Yongji Xie <xieyongji@bytedance.com>"
>>> +#define DRV_DESC     "vDPA Device in Userspace"
>>> +#define DRV_LICENSE  "GPL v2"
>>> +
>>> +#define VDUSE_DEV_MAX (1U << MINORBITS)
>>> +
>>> +struct vduse_virtqueue {
>>> +     u16 index;
>>> +     bool ready;
>>> +     spinlock_t kick_lock;
>>> +     spinlock_t irq_lock;
>>> +     struct eventfd_ctx *kickfd;
>>> +     struct vdpa_callback cb;
>>> +     struct work_struct inject;
>>> +};
>>> +
>>> +struct vduse_dev;
>>> +
>>> +struct vduse_vdpa {
>>> +     struct vdpa_device vdpa;
>>> +     struct vduse_dev *dev;
>>> +};
>>> +
>>> +struct vduse_dev {
>>> +     struct vduse_vdpa *vdev;
>>> +     struct device dev;
>>> +     struct cdev cdev;
>>> +     struct vduse_virtqueue *vqs;
>>> +     struct vduse_iova_domain *domain;
>>> +     struct mutex lock;
>>> +     spinlock_t msg_lock;
>>> +     atomic64_t msg_unique;
>>> +     wait_queue_head_t waitq;
>>> +     struct list_head send_list;
>>> +     struct list_head recv_list;
>>> +     struct list_head list;
>>> +     struct vdpa_callback config_cb;
>>> +     spinlock_t irq_lock;
>>> +     unsigned long api_version;
>>> +     bool connected;
>>> +     int minor;
>>> +     u16 vq_size_max;
>>> +     u16 vq_num;
>>> +     u32 vq_align;
>>> +     u32 device_id;
>>> +     u32 vendor_id;
>>> +};
>>> +
>>> +struct vduse_dev_msg {
>>> +     struct vduse_dev_request req;
>>> +     struct vduse_dev_response resp;
>>> +     struct list_head list;
>>> +     wait_queue_head_t waitq;
>>> +     bool completed;
>>> +};
>>> +
>>> +struct vduse_control {
>>> +     unsigned long api_version;
>>> +};
>>> +
>>> +static unsigned long max_bounce_size = (64 * 1024 * 1024);
>>> +module_param(max_bounce_size, ulong, 0444);
>>> +MODULE_PARM_DESC(max_bounce_size, "Maximum bounce buffer size. (default: 64M)");
>>> +
>>> +static unsigned long max_iova_size = (128 * 1024 * 1024);
>>> +module_param(max_iova_size, ulong, 0444);
>>> +MODULE_PARM_DESC(max_iova_size, "Maximum iova space size (default: 128M)");
>>> +
>>> +static DEFINE_MUTEX(vduse_lock);
>>> +static LIST_HEAD(vduse_devs);
>>> +static DEFINE_IDA(vduse_ida);
>>> +
>>> +static dev_t vduse_major;
>>> +static struct class *vduse_class;
>>> +static struct workqueue_struct *vduse_irq_wq;
>>> +
>>> +static inline struct vduse_dev *vdpa_to_vduse(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_vdpa *vdev = container_of(vdpa, struct vduse_vdpa, vdpa);
>>> +
>>> +     return vdev->dev;
>>> +}
>>> +
>>> +static inline struct vduse_dev *dev_to_vduse(struct device *dev)
>>> +{
>>> +     struct vdpa_device *vdpa = dev_to_vdpa(dev);
>>> +
>>> +     return vdpa_to_vduse(vdpa);
>>> +}
>>> +
>>> +static struct vduse_dev_msg *vduse_find_msg(struct list_head *head,
>>> +                                         uint32_t request_id)
>>> +{
>>> +     struct vduse_dev_msg *tmp, *msg = NULL;
>>> +
>>> +     list_for_each_entry(tmp, head, list) {
>>> +             if (tmp->req.request_id == request_id) {
>>> +                     msg = tmp;
>>> +                     list_del(&tmp->list);
>>> +                     break;
>>> +             }
>>> +     }
>>> +
>>> +     return msg;
>>> +}
>>> +
>>> +static struct vduse_dev_msg *vduse_dequeue_msg(struct list_head *head)
>>> +{
>>> +     struct vduse_dev_msg *msg = NULL;
>>> +
>>> +     if (!list_empty(head)) {
>>> +             msg = list_first_entry(head, struct vduse_dev_msg, list);
>>> +             list_del(&msg->list);
>>> +     }
>>> +
>>> +     return msg;
>>> +}
>>> +
>>> +static void vduse_enqueue_msg(struct list_head *head,
>>> +                           struct vduse_dev_msg *msg)
>>> +{
>>> +     list_add_tail(&msg->list, head);
>>> +}
>>> +
>>> +static int vduse_dev_msg_sync(struct vduse_dev *dev,
>>> +                           struct vduse_dev_msg *msg)
>>> +{
>>> +     init_waitqueue_head(&msg->waitq);
>>> +     spin_lock(&dev->msg_lock);
>>> +     vduse_enqueue_msg(&dev->send_list, msg);
>>> +     wake_up(&dev->waitq);
>>> +     spin_unlock(&dev->msg_lock);
>>> +     wait_event_interruptible(msg->waitq, msg->completed);
>>> +     spin_lock(&dev->msg_lock);
>>> +     if (!msg->completed)
>>> +             list_del(&msg->list);
>>> +     spin_unlock(&dev->msg_lock);
>>> +
>>> +     return (msg->resp.result == VDUSE_REQUEST_OK) ? 0 : -1;
>>> +}
>>> +
>>> +static u32 vduse_dev_get_request_id(struct vduse_dev *dev)
>>> +{
>>> +     return atomic64_fetch_inc(&dev->msg_unique);
>>> +}
>>> +
>>> +static u64 vduse_dev_get_features(struct vduse_dev *dev)
>>> +{
>>> +     struct vduse_dev_msg msg = { 0 };
>>> +
>>> +     msg.req.type = VDUSE_GET_FEATURES;
>>> +     msg.req.request_id = vduse_dev_get_request_id(dev);
>>> +
>>> +     return vduse_dev_msg_sync(dev, &msg) ? 0 : msg.resp.f.features;
>>> +}
>>> +
>>> +static int vduse_dev_set_features(struct vduse_dev *dev, u64 features)
>>> +{
>>> +     struct vduse_dev_msg msg = { 0 };
>>> +
>>> +     msg.req.type = VDUSE_SET_FEATURES;
>>> +     msg.req.request_id = vduse_dev_get_request_id(dev);
>>> +     msg.req.f.features = features;
>>> +
>>> +     return vduse_dev_msg_sync(dev, &msg);
>>> +}
>>> +
>>> +static u8 vduse_dev_get_status(struct vduse_dev *dev)
>>> +{
>>> +     struct vduse_dev_msg msg = { 0 };
>>> +
>>> +     msg.req.type = VDUSE_GET_STATUS;
>>> +     msg.req.request_id = vduse_dev_get_request_id(dev);
>>> +
>>> +     return vduse_dev_msg_sync(dev, &msg) ? 0 : msg.resp.s.status;
>>> +}
>>> +
>>> +static void vduse_dev_set_status(struct vduse_dev *dev, u8 status)
>>> +{
>>> +     struct vduse_dev_msg msg = { 0 };
>>> +
>>> +     msg.req.type = VDUSE_SET_STATUS;
>>> +     msg.req.request_id = vduse_dev_get_request_id(dev);
>>> +     msg.req.s.status = status;
>>> +
>>> +     vduse_dev_msg_sync(dev, &msg);
>>> +}
>>> +
>>> +static void vduse_dev_get_config(struct vduse_dev *dev, unsigned int offset,
>>> +                              void *buf, unsigned int len)
>>> +{
>>> +     struct vduse_dev_msg msg = { 0 };
>>> +     unsigned int sz;
>>> +
>>> +     while (len) {
>>> +             sz = min_t(unsigned int, len, sizeof(msg.req.config.data));
>>> +             msg.req.type = VDUSE_GET_CONFIG;
>>> +             msg.req.request_id = vduse_dev_get_request_id(dev);
>>> +             msg.req.config.offset = offset;
>>> +             msg.req.config.len = sz;
>>> +             vduse_dev_msg_sync(dev, &msg);
>>> +             memcpy(buf, msg.resp.config.data, sz);
>>> +             buf += sz;
>>> +             offset += sz;
>>> +             len -= sz;
>>> +     }
>>> +}
>>> +
>>> +static void vduse_dev_set_config(struct vduse_dev *dev, unsigned int offset,
>>> +                              const void *buf, unsigned int len)
>>> +{
>>> +     struct vduse_dev_msg msg = { 0 };
>>> +     unsigned int sz;
>>> +
>>> +     while (len) {
>>> +             sz = min_t(unsigned int, len, sizeof(msg.req.config.data));
>>> +             msg.req.type = VDUSE_SET_CONFIG;
>>> +             msg.req.request_id = vduse_dev_get_request_id(dev);
>>> +             msg.req.config.offset = offset;
>>> +             msg.req.config.len = sz;
>>> +             memcpy(msg.req.config.data, buf, sz);
>>> +             vduse_dev_msg_sync(dev, &msg);
>>> +             buf += sz;
>>> +             offset += sz;
>>> +             len -= sz;
>>> +     }
>>> +}
>>> +
>>> +static void vduse_dev_set_vq_num(struct vduse_dev *dev,
>>> +                              struct vduse_virtqueue *vq, u32 num)
>>> +{
>>> +     struct vduse_dev_msg msg = { 0 };
>>> +
>>> +     msg.req.type = VDUSE_SET_VQ_NUM;
>>> +     msg.req.request_id = vduse_dev_get_request_id(dev);
>>> +     msg.req.vq_num.index = vq->index;
>>> +     msg.req.vq_num.num = num;
>>> +
>>> +     vduse_dev_msg_sync(dev, &msg);
>>> +}
>>> +
>>> +static int vduse_dev_set_vq_addr(struct vduse_dev *dev,
>>> +                              struct vduse_virtqueue *vq, u64 desc_addr,
>>> +                              u64 driver_addr, u64 device_addr)
>>> +{
>>> +     struct vduse_dev_msg msg = { 0 };
>>> +
>>> +     msg.req.type = VDUSE_SET_VQ_ADDR;
>>> +     msg.req.request_id = vduse_dev_get_request_id(dev);
>>> +     msg.req.vq_addr.index = vq->index;
>>> +     msg.req.vq_addr.desc_addr = desc_addr;
>>> +     msg.req.vq_addr.driver_addr = driver_addr;
>>> +     msg.req.vq_addr.device_addr = device_addr;
>>> +
>>> +     return vduse_dev_msg_sync(dev, &msg);
>>> +}
>>> +
>>> +static void vduse_dev_set_vq_ready(struct vduse_dev *dev,
>>> +                             struct vduse_virtqueue *vq, bool ready)
>>> +{
>>> +     struct vduse_dev_msg msg = { 0 };
>>> +
>>> +     msg.req.type = VDUSE_SET_VQ_READY;
>>> +     msg.req.request_id = vduse_dev_get_request_id(dev);
>>> +     msg.req.vq_ready.index = vq->index;
>>> +     msg.req.vq_ready.ready = ready;
>>> +
>>> +     vduse_dev_msg_sync(dev, &msg);
>>> +}
>>> +
>>> +static bool vduse_dev_get_vq_ready(struct vduse_dev *dev,
>>> +                                struct vduse_virtqueue *vq)
>>> +{
>>> +     struct vduse_dev_msg msg = { 0 };
>>> +
>>> +     msg.req.type = VDUSE_GET_VQ_READY;
>>> +     msg.req.request_id = vduse_dev_get_request_id(dev);
>>> +     msg.req.vq_ready.index = vq->index;
>>> +
>>> +     return vduse_dev_msg_sync(dev, &msg) ? false : msg.resp.vq_ready.ready;
>>> +}
>>> +
>>> +static int vduse_dev_get_vq_state(struct vduse_dev *dev,
>>> +                             struct vduse_virtqueue *vq,
>>> +                             struct vdpa_vq_state *state)
>>> +{
>>> +     struct vduse_dev_msg msg = { 0 };
>>> +     int ret;
>>> +
>>> +     msg.req.type = VDUSE_GET_VQ_STATE;
>>> +     msg.req.request_id = vduse_dev_get_request_id(dev);
>>> +     msg.req.vq_state.index = vq->index;
>>> +
>>> +     ret = vduse_dev_msg_sync(dev, &msg);
>>> +     if (!ret)
>>> +             state->avail_index = msg.resp.vq_state.avail_idx;
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static int vduse_dev_set_vq_state(struct vduse_dev *dev,
>>> +                             struct vduse_virtqueue *vq,
>>> +                             const struct vdpa_vq_state *state)
>>> +{
>>> +     struct vduse_dev_msg msg = { 0 };
>>> +
>>> +     msg.req.type = VDUSE_SET_VQ_STATE;
>>> +     msg.req.request_id = vduse_dev_get_request_id(dev);
>>> +     msg.req.vq_state.index = vq->index;
>>> +     msg.req.vq_state.avail_idx = state->avail_index;
>>> +
>>> +     return vduse_dev_msg_sync(dev, &msg);
>>> +}
>>> +
>>> +static int vduse_dev_update_iotlb(struct vduse_dev *dev,
>>> +                             u64 start, u64 last)
>>> +{
>>> +     struct vduse_dev_msg msg = { 0 };
>>> +
>>> +     if (last < start)
>>> +             return -EINVAL;
>>> +
>>> +     msg.req.type = VDUSE_UPDATE_IOTLB;
>>> +     msg.req.request_id = vduse_dev_get_request_id(dev);
>>> +     msg.req.iova.start = start;
>>> +     msg.req.iova.last = last;
>>> +
>>> +     return vduse_dev_msg_sync(dev, &msg);
>>> +}
>>> +
>>> +static ssize_t vduse_dev_read_iter(struct kiocb *iocb, struct iov_iter *to)
>>> +{
>>> +     struct file *file = iocb->ki_filp;
>>> +     struct vduse_dev *dev = file->private_data;
>>> +     struct vduse_dev_msg *msg;
>>> +     int size = sizeof(struct vduse_dev_request);
>>> +     ssize_t ret = 0;
>>> +
>>> +     if (iov_iter_count(to) < size)
>>> +             return 0;
>>> +
>>> +     spin_lock(&dev->msg_lock);
>>> +     while (1) {
>>> +             msg = vduse_dequeue_msg(&dev->send_list);
>>> +             if (msg)
>>> +                     break;
>>> +
>>> +             ret = -EAGAIN;
>>> +             if (file->f_flags & O_NONBLOCK)
>>> +                     goto unlock;
>>> +
>>> +             spin_unlock(&dev->msg_lock);
>>> +             ret = wait_event_interruptible_exclusive(dev->waitq,
>>> +                                     !list_empty(&dev->send_list));
>>> +             if (ret)
>>> +                     return ret;
>>> +
>>> +             spin_lock(&dev->msg_lock);
>>> +     }
>>> +     spin_unlock(&dev->msg_lock);
>>> +     ret = copy_to_iter(&msg->req, size, to);
>>> +     spin_lock(&dev->msg_lock);
>>> +     if (ret != size) {
>>> +             ret = -EFAULT;
>>> +             vduse_enqueue_msg(&dev->send_list, msg);
>>> +             goto unlock;
>>> +     }
>>> +     vduse_enqueue_msg(&dev->recv_list, msg);
>>> +     wake_up(&dev->waitq);
>>> +unlock:
>>> +     spin_unlock(&dev->msg_lock);
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static ssize_t vduse_dev_write_iter(struct kiocb *iocb, struct iov_iter *from)
>>> +{
>>> +     struct file *file = iocb->ki_filp;
>>> +     struct vduse_dev *dev = file->private_data;
>>> +     struct vduse_dev_response resp;
>>> +     struct vduse_dev_msg *msg;
>>> +     size_t ret;
>>> +
>>> +     ret = copy_from_iter(&resp, sizeof(resp), from);
>>> +     if (ret != sizeof(resp))
>>> +             return -EINVAL;
>>> +
>>> +     spin_lock(&dev->msg_lock);
>>> +     msg = vduse_find_msg(&dev->recv_list, resp.request_id);
>>> +     if (!msg) {
>>> +             ret = -EINVAL;
>>> +             goto unlock;
>>> +     }
>>> +
>>> +     memcpy(&msg->resp, &resp, sizeof(resp));
>>> +     msg->completed = 1;
>>> +     wake_up(&msg->waitq);
>>> +unlock:
>>> +     spin_unlock(&dev->msg_lock);
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
>>> +{
>>> +     struct vduse_dev *dev = file->private_data;
>>> +     __poll_t mask = 0;
>>> +
>>> +     poll_wait(file, &dev->waitq, wait);
>>> +
>>> +     if (!list_empty(&dev->send_list))
>>> +             mask |= EPOLLIN | EPOLLRDNORM;
>>> +     if (!list_empty(&dev->recv_list))
>>> +             mask |= EPOLLOUT;
>>> +
>>> +     return mask;
>>> +}
>>> +
>>> +static void vduse_dev_reset(struct vduse_dev *dev)
>>> +{
>>> +     int i;
>>> +
>>> +     /* The coherent mappings are handled in vduse_dev_free_coherent() */
>>> +     vduse_domain_reset_bounce_map(dev->domain);
>>> +     vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
>>> +
>>> +     spin_lock(&dev->irq_lock);
>>> +     dev->config_cb.callback = NULL;
>>> +     dev->config_cb.private = NULL;
>>> +     spin_unlock(&dev->irq_lock);
>>> +
>>> +     for (i = 0; i < dev->vq_num; i++) {
>>> +             struct vduse_virtqueue *vq = &dev->vqs[i];
>>> +
>>> +             spin_lock(&vq->irq_lock);
>>> +             vq->ready = false;
>>> +             vq->cb.callback = NULL;
>>> +             vq->cb.private = NULL;
>>> +             spin_unlock(&vq->irq_lock);
>>> +     }
>>> +}
>>> +
>>> +static int vduse_vdpa_set_vq_address(struct vdpa_device *vdpa, u16 idx,
>>> +                             u64 desc_area, u64 driver_area,
>>> +                             u64 device_area)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     return vduse_dev_set_vq_addr(dev, vq, desc_area,
>>> +                                     driver_area, device_area);
>>> +}
>>> +
>>> +static void vduse_vdpa_kick_vq(struct vdpa_device *vdpa, u16 idx)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     spin_lock(&vq->kick_lock);
>>> +     if (vq->ready && vq->kickfd)
>>> +             eventfd_signal(vq->kickfd, 1);
>>> +     spin_unlock(&vq->kick_lock);
>>> +}
>>> +
>>> +static void vduse_vdpa_set_vq_cb(struct vdpa_device *vdpa, u16 idx,
>>> +                           struct vdpa_callback *cb)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     spin_lock(&vq->irq_lock);
>>> +     vq->cb.callback = cb->callback;
>>> +     vq->cb.private = cb->private;
>>> +     spin_unlock(&vq->irq_lock);
>>> +}
>>> +
>>> +static void vduse_vdpa_set_vq_num(struct vdpa_device *vdpa, u16 idx, u32 num)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     vduse_dev_set_vq_num(dev, vq, num);
>>> +}
>>> +
>>> +static void vduse_vdpa_set_vq_ready(struct vdpa_device *vdpa,
>>> +                                     u16 idx, bool ready)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     vduse_dev_set_vq_ready(dev, vq, ready);
>>> +     vq->ready = ready;
>>> +}
>>> +
>>> +static bool vduse_vdpa_get_vq_ready(struct vdpa_device *vdpa, u16 idx)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     vq->ready = vduse_dev_get_vq_ready(dev, vq);
>>> +
>>> +     return vq->ready;
>>> +}
>>> +
>>> +static int vduse_vdpa_set_vq_state(struct vdpa_device *vdpa, u16 idx,
>>> +                             const struct vdpa_vq_state *state)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     return vduse_dev_set_vq_state(dev, vq, state);
>>> +}
>>> +
>>> +static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
>>> +                             struct vdpa_vq_state *state)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     return vduse_dev_get_vq_state(dev, vq, state);
>>> +}
>>> +
>>> +static u32 vduse_vdpa_get_vq_align(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return dev->vq_align;
>>> +}
>>> +
>>> +static u64 vduse_vdpa_get_features(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return vduse_dev_get_features(dev);
>>> +}
>>> +
>>> +static int vduse_vdpa_set_features(struct vdpa_device *vdpa, u64 features)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     if (!(features & (1ULL << VIRTIO_F_ACCESS_PLATFORM)))
>>> +             return -EINVAL;
>>> +
>>> +     return vduse_dev_set_features(dev, features);
>>> +}
>>> +
>>> +static void vduse_vdpa_set_config_cb(struct vdpa_device *vdpa,
>>> +                               struct vdpa_callback *cb)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     spin_lock(&dev->irq_lock);
>>> +     dev->config_cb.callback = cb->callback;
>>> +     dev->config_cb.private = cb->private;
>>> +     spin_unlock(&dev->irq_lock);
>>> +}
>>> +
>>> +static u16 vduse_vdpa_get_vq_num_max(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return dev->vq_size_max;
>>> +}
>>> +
>>> +static u32 vduse_vdpa_get_device_id(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return dev->device_id;
>>> +}
>>> +
>>> +static u32 vduse_vdpa_get_vendor_id(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return dev->vendor_id;
>>> +}
>>> +
>>> +static u8 vduse_vdpa_get_status(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return vduse_dev_get_status(dev);
>>> +}
>>> +
>>> +static void vduse_vdpa_set_status(struct vdpa_device *vdpa, u8 status)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     vduse_dev_set_status(dev, status);
>>> +
>>> +     if (status == 0)
>>> +             vduse_dev_reset(dev);
>>> +}
>>> +
>>> +static void vduse_vdpa_get_config(struct vdpa_device *vdpa, unsigned int offset,
>>> +                          void *buf, unsigned int len)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     vduse_dev_get_config(dev, offset, buf, len);
>>> +}
>>> +
>>> +static void vduse_vdpa_set_config(struct vdpa_device *vdpa, unsigned int offset,
>>> +                     const void *buf, unsigned int len)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     vduse_dev_set_config(dev, offset, buf, len);
>>> +}
>>> +
>>> +static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
>>> +                             struct vhost_iotlb *iotlb)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     int ret;
>>> +
>>> +     ret = vduse_domain_set_map(dev->domain, iotlb);
>>> +     vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static void vduse_vdpa_free(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     WARN_ON(!list_empty(&dev->send_list));
>>> +     WARN_ON(!list_empty(&dev->recv_list));
>>> +     dev->vdev = NULL;
>>> +}
>>> +
>>> +static const struct vdpa_config_ops vduse_vdpa_config_ops = {
>>> +     .set_vq_address         = vduse_vdpa_set_vq_address,
>>> +     .kick_vq                = vduse_vdpa_kick_vq,
>>> +     .set_vq_cb              = vduse_vdpa_set_vq_cb,
>>> +     .set_vq_num             = vduse_vdpa_set_vq_num,
>>> +     .set_vq_ready           = vduse_vdpa_set_vq_ready,
>>> +     .get_vq_ready           = vduse_vdpa_get_vq_ready,
>>> +     .set_vq_state           = vduse_vdpa_set_vq_state,
>>> +     .get_vq_state           = vduse_vdpa_get_vq_state,
>>> +     .get_vq_align           = vduse_vdpa_get_vq_align,
>>> +     .get_features           = vduse_vdpa_get_features,
>>> +     .set_features           = vduse_vdpa_set_features,
>>> +     .set_config_cb          = vduse_vdpa_set_config_cb,
>>> +     .get_vq_num_max         = vduse_vdpa_get_vq_num_max,
>>> +     .get_device_id          = vduse_vdpa_get_device_id,
>>> +     .get_vendor_id          = vduse_vdpa_get_vendor_id,
>>> +     .get_status             = vduse_vdpa_get_status,
>>> +     .set_status             = vduse_vdpa_set_status,
>>> +     .get_config             = vduse_vdpa_get_config,
>>> +     .set_config             = vduse_vdpa_set_config,
>>> +     .set_map                = vduse_vdpa_set_map,
>>> +     .free                   = vduse_vdpa_free,
>>> +};
>>> +
>>> +static dma_addr_t vduse_dev_map_page(struct device *dev, struct page *page,
>>> +                                  unsigned long offset, size_t size,
>>> +                                  enum dma_data_direction dir,
>>> +                                  unsigned long attrs)
>>> +{
>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>> +
>>> +     return vduse_domain_map_page(domain, page, offset, size, dir, attrs);
>>> +}
>>> +
>>> +static void vduse_dev_unmap_page(struct device *dev, dma_addr_t dma_addr,
>>> +                             size_t size, enum dma_data_direction dir,
>>> +                             unsigned long attrs)
>>> +{
>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>> +
>>> +     return vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
>>> +}
>>> +
>>> +static void *vduse_dev_alloc_coherent(struct device *dev, size_t size,
>>> +                                     dma_addr_t *dma_addr, gfp_t flag,
>>> +                                     unsigned long attrs)
>>> +{
>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>> +     unsigned long iova;
>>> +     void *addr;
>>> +
>>> +     *dma_addr = DMA_MAPPING_ERROR;
>>> +     addr = vduse_domain_alloc_coherent(domain, size,
>>> +                             (dma_addr_t *)&iova, flag, attrs);
>>> +     if (!addr)
>>> +             return NULL;
>>> +
>>> +     *dma_addr = (dma_addr_t)iova;
>>> +     vduse_dev_update_iotlb(vdev, iova, iova + size - 1);
>>> +
>>> +     return addr;
>>> +}
>>> +
>>> +static void vduse_dev_free_coherent(struct device *dev, size_t size,
>>> +                                     void *vaddr, dma_addr_t dma_addr,
>>> +                                     unsigned long attrs)
>>> +{
>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>> +     unsigned long start = (unsigned long)dma_addr;
>>> +     unsigned long last = start + size - 1;
>>> +
>>> +     vduse_domain_free_coherent(domain, size, vaddr, dma_addr, attrs);
>>> +     vduse_dev_update_iotlb(vdev, start, last);
>>> +}
>>> +
>>> +static const struct dma_map_ops vduse_dev_dma_ops = {
>>> +     .map_page = vduse_dev_map_page,
>>> +     .unmap_page = vduse_dev_unmap_page,
>>> +     .alloc = vduse_dev_alloc_coherent,
>>> +     .free = vduse_dev_free_coherent,
>>> +};
>>> +
>>> +static unsigned int perm_to_file_flags(u8 perm)
>>> +{
>>> +     unsigned int flags = 0;
>>> +
>>> +     switch (perm) {
>>> +     case VDUSE_ACCESS_WO:
>>> +             flags |= O_WRONLY;
>>> +             break;
>>> +     case VDUSE_ACCESS_RO:
>>> +             flags |= O_RDONLY;
>>> +             break;
>>> +     case VDUSE_ACCESS_RW:
>>> +             flags |= O_RDWR;
>>> +             break;
>>> +     default:
>>> +             WARN(1, "invalidate vhost IOTLB permission\n");
>>> +             break;
>>> +     }
>>> +
>>> +     return flags;
>>> +}
>>> +
>>> +static int vduse_kickfd_setup(struct vduse_dev *dev,
>>> +                     struct vduse_vq_eventfd *eventfd)
>>> +{
>>> +     struct eventfd_ctx *ctx = NULL;
>>> +     struct vduse_virtqueue *vq;
>>> +
>>> +     if (eventfd->index >= dev->vq_num)
>>> +             return -EINVAL;
>>> +
>>> +     vq = &dev->vqs[eventfd->index];
>>> +     if (eventfd->fd > 0) {
>>> +             ctx = eventfd_ctx_fdget(eventfd->fd);
>>> +             if (IS_ERR(ctx))
>>> +                     return PTR_ERR(ctx);
>>> +     } else if (eventfd->fd != VDUSE_EVENTFD_DEASSIGN)
>>> +             return 0;
>>
>> Do we allow the userspace to switch kickfd here? If yes, we need to deal
>> with that.
>>
> Do you mean the case that eventfd->fd > 0? I think we have dealt with it.


Ok, but the above code still looks suspicious. E.g what happens if 
eventfd->fd is 0?


>
>>> +
>>> +     spin_lock(&vq->kick_lock);
>>> +     if (vq->kickfd)
>>> +             eventfd_ctx_put(vq->kickfd);
>>> +     vq->kickfd = ctx;
>>> +     spin_unlock(&vq->kick_lock);
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static void vduse_vq_irq_inject(struct work_struct *work)
>>> +{
>>> +     struct vduse_virtqueue *vq = container_of(work,
>>> +                                     struct vduse_virtqueue, inject);
>>> +
>>> +     spin_lock_irq(&vq->irq_lock);
>>> +     if (vq->ready && vq->cb.callback)
>>> +             vq->cb.callback(vq->cb.private);
>>> +     spin_unlock_irq(&vq->irq_lock);
>>> +}
>>> +
>>> +static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
>>> +                         unsigned long arg)
>>> +{
>>> +     struct vduse_dev *dev = file->private_data;
>>> +     void __user *argp = (void __user *)arg;
>>> +     int ret;
>>> +
>>> +     switch (cmd) {
>>> +     case VDUSE_IOTLB_GET_FD: {
>>> +             struct vduse_iotlb_entry entry;
>>> +             struct vhost_iotlb_map *map;
>>> +             struct vdpa_map_file *map_file;
>>> +             struct vduse_iova_domain *domain = dev->domain;
>>> +             struct file *f = NULL;
>>> +
>>> +             ret = -EFAULT;
>>> +             if (copy_from_user(&entry, argp, sizeof(entry)))
>>> +                     break;
>>> +
>>> +             ret = -EINVAL;
>>> +             if (entry.start > entry.last)
>>> +                     break;
>>> +
>>> +             spin_lock(&domain->iotlb_lock);
>>> +             map = vhost_iotlb_itree_first(domain->iotlb,
>>> +                                           entry.start, entry.last);
>>> +             if (map) {
>>> +                     map_file = (struct vdpa_map_file *)map->opaque;
>>> +                     f = get_file(map_file->file);
>>> +                     entry.offset = map_file->offset;
>>> +                     entry.start = map->start;
>>> +                     entry.last = map->last;
>>> +                     entry.perm = map->perm;
>>> +             }
>>> +             spin_unlock(&domain->iotlb_lock);
>>> +             ret = -EINVAL;
>>> +             if (!f)
>>> +                     break;
>>> +
>>> +             ret = -EFAULT;
>>> +             if (copy_to_user(argp, &entry, sizeof(entry))) {
>>> +                     fput(f);
>>> +                     break;
>>> +             }
>>> +             ret = receive_fd(f, perm_to_file_flags(entry.perm));
>>> +             fput(f);
>>
>> Any reason to drop the refcnt here?
>>
> We will do get_file() in receive_fd() too.


I see.


>
>>> +             break;
>>> +     }
>>> +     case VDUSE_VQ_SETUP_KICKFD: {
>>> +             struct vduse_vq_eventfd eventfd;
>>> +
>>> +             ret = -EFAULT;
>>> +             if (copy_from_user(&eventfd, argp, sizeof(eventfd)))
>>> +                     break;
>>> +
>>> +             ret = vduse_kickfd_setup(dev, &eventfd);
>>> +             break;
>>> +     }
>>> +     case VDUSE_INJECT_VQ_IRQ:
>>> +             ret = -EINVAL;
>>> +             if (arg >= dev->vq_num)
>>> +                     break;
>>> +
>>> +             ret = 0;
>>> +             queue_work(vduse_irq_wq, &dev->vqs[arg].inject);
>>> +             break;
>>> +     case VDUSE_INJECT_CONFIG_IRQ:
>>> +             ret = -EINVAL;
>>> +             spin_lock_irq(&dev->irq_lock);
>>> +             if (dev->config_cb.callback) {
>>> +                     dev->config_cb.callback(dev->config_cb.private);
>>> +                     ret = 0;
>>> +             }
>>> +             spin_unlock_irq(&dev->irq_lock);
>>
>> For consistency, is it better to use vduse_irq_wq here?
>>
> Looks good to me.
>
>>> +             break;
>>> +     default:
>>> +             ret = -ENOIOCTLCMD;
>>> +             break;
>>> +     }
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static int vduse_dev_release(struct inode *inode, struct file *file)
>>> +{
>>> +     struct vduse_dev *dev = file->private_data;
>>> +     struct vduse_dev_msg *msg;
>>> +     int i;
>>> +
>>> +     for (i = 0; i < dev->vq_num; i++) {
>>> +             struct vduse_virtqueue *vq = &dev->vqs[i];
>>> +
>>> +             spin_lock(&vq->kick_lock);
>>> +             if (vq->kickfd)
>>> +                     eventfd_ctx_put(vq->kickfd);
>>> +             vq->kickfd = NULL;
>>> +             spin_unlock(&vq->kick_lock);
>>> +     }
>>> +
>>> +     spin_lock(&dev->msg_lock);
>>> +     /*  Make sure the inflight messages can processed */
>>
>> This might be better:
>>
>> /*  Make sure the inflight messages can processed after reconncection */
>>
> OK.
>
>>> +     while ((msg = vduse_dequeue_msg(&dev->recv_list)))
>>> +             vduse_enqueue_msg(&dev->send_list, msg);
>>> +     spin_unlock(&dev->msg_lock);
>>> +
>>> +     dev->connected = false;
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static int vduse_dev_open(struct inode *inode, struct file *file)
>>> +{
>>> +     struct vduse_dev *dev = container_of(inode->i_cdev,
>>> +                                     struct vduse_dev, cdev);
>>> +     int ret = -EBUSY;
>>> +
>>> +     mutex_lock(&dev->lock);
>>> +     if (dev->connected)
>>> +             goto unlock;
>>> +
>>> +     ret = 0;
>>> +     dev->connected = true;
>>> +     file->private_data = dev;
>>> +unlock:
>>> +     mutex_unlock(&dev->lock);
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static const struct file_operations vduse_dev_fops = {
>>> +     .owner          = THIS_MODULE,
>>> +     .open           = vduse_dev_open,
>>> +     .release        = vduse_dev_release,
>>> +     .read_iter      = vduse_dev_read_iter,
>>> +     .write_iter     = vduse_dev_write_iter,
>>> +     .poll           = vduse_dev_poll,
>>> +     .unlocked_ioctl = vduse_dev_ioctl,
>>> +     .compat_ioctl   = compat_ptr_ioctl,
>>> +     .llseek         = noop_llseek,
>>> +};
>>> +
>>> +static struct vduse_dev *vduse_dev_create(void)
>>> +{
>>> +     struct vduse_dev *dev = kzalloc(sizeof(*dev), GFP_KERNEL);
>>> +
>>> +     if (!dev)
>>> +             return NULL;
>>> +
>>> +     mutex_init(&dev->lock);
>>> +     spin_lock_init(&dev->msg_lock);
>>> +     INIT_LIST_HEAD(&dev->send_list);
>>> +     INIT_LIST_HEAD(&dev->recv_list);
>>> +     atomic64_set(&dev->msg_unique, 0);
>>> +     spin_lock_init(&dev->irq_lock);
>>> +
>>> +     init_waitqueue_head(&dev->waitq);
>>> +
>>> +     return dev;
>>> +}
>>> +
>>> +static void vduse_dev_destroy(struct vduse_dev *dev)
>>> +{
>>> +     kfree(dev);
>>> +}
>>> +
>>> +static struct vduse_dev *vduse_find_dev(const char *name)
>>> +{
>>> +     struct vduse_dev *tmp, *dev = NULL;
>>> +
>>> +     list_for_each_entry(tmp, &vduse_devs, list) {
>>> +             if (!strcmp(dev_name(&tmp->dev), name)) {
>>> +                     dev = tmp;
>>> +                     break;
>>> +             }
>>> +     }
>>> +     return dev;
>>> +}
>>> +
>>> +static int vduse_destroy_dev(char *name)
>>> +{
>>> +     struct vduse_dev *dev = vduse_find_dev(name);
>>> +
>>> +     if (!dev)
>>> +             return -EINVAL;
>>> +
>>> +     mutex_lock(&dev->lock);
>>> +     if (dev->vdev || dev->connected) {
>>> +             mutex_unlock(&dev->lock);
>>> +             return -EBUSY;
>>> +     }
>>> +     dev->connected = true;
>>> +     mutex_unlock(&dev->lock);
>>> +
>>> +     list_del(&dev->list);
>>> +     cdev_device_del(&dev->cdev, &dev->dev);
>>> +     put_device(&dev->dev);
>>> +     module_put(THIS_MODULE);
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static void vduse_release_dev(struct device *device)
>>> +{
>>> +     struct vduse_dev *dev =
>>> +             container_of(device, struct vduse_dev, dev);
>>> +
>>> +     ida_simple_remove(&vduse_ida, dev->minor);
>>> +     kfree(dev->vqs);
>>> +     vduse_domain_destroy(dev->domain);
>>> +     vduse_dev_destroy(dev);
>>> +}
>>> +
>>> +static int vduse_create_dev(struct vduse_dev_config *config,
>>> +                         unsigned long api_version)
>>> +{
>>> +     int i, ret = -ENOMEM;
>>> +     struct vduse_dev *dev;
>>> +
>>> +     if (config->bounce_size > max_bounce_size)
>>> +             return -EINVAL;
>>> +
>>> +     if (config->bounce_size > max_iova_size)
>>> +             return -EINVAL;
>>> +
>>> +     if (vduse_find_dev(config->name))
>>> +             return -EEXIST;
>>> +
>>> +     dev = vduse_dev_create();
>>> +     if (!dev)
>>> +             return -ENOMEM;
>>> +
>>> +     dev->api_version = api_version;
>>> +     dev->device_id = config->device_id;
>>> +     dev->vendor_id = config->vendor_id;
>>> +     dev->domain = vduse_domain_create(max_iova_size - 1,
>>> +                                     config->bounce_size);
>>> +     if (!dev->domain)
>>> +             goto err_domain;
>>> +
>>> +     dev->vq_align = config->vq_align;
>>> +     dev->vq_size_max = config->vq_size_max;
>>> +     dev->vq_num = config->vq_num;
>>> +     dev->vqs = kcalloc(dev->vq_num, sizeof(*dev->vqs), GFP_KERNEL);
>>> +     if (!dev->vqs)
>>> +             goto err_vqs;
>>> +
>>> +     for (i = 0; i < dev->vq_num; i++) {
>>> +             dev->vqs[i].index = i;
>>> +             INIT_WORK(&dev->vqs[i].inject, vduse_vq_irq_inject);
>>> +             spin_lock_init(&dev->vqs[i].kick_lock);
>>> +             spin_lock_init(&dev->vqs[i].irq_lock);
>>> +     }
>>> +
>>> +     ret = ida_simple_get(&vduse_ida, 0, VDUSE_DEV_MAX, GFP_KERNEL);
>>> +     if (ret < 0)
>>> +             goto err_ida;
>>> +
>>> +     dev->minor = ret;
>>> +     device_initialize(&dev->dev);
>>> +     dev->dev.release = vduse_release_dev;
>>> +     dev->dev.class = vduse_class;
>>> +     dev->dev.devt = MKDEV(MAJOR(vduse_major), dev->minor);
>>> +     ret = dev_set_name(&dev->dev, "%s", config->name);
>>> +     if (ret) {
>>> +             put_device(&dev->dev);
>>> +             return ret;
>>> +     }
>>> +     cdev_init(&dev->cdev, &vduse_dev_fops);
>>> +     dev->cdev.owner = THIS_MODULE;
>>> +
>>> +     ret = cdev_device_add(&dev->cdev, &dev->dev);
>>> +     if (ret) {
>>> +             put_device(&dev->dev);
>>> +             return ret;
>>
>> Let's introduce an error label for this.
>>
> But put_device() would be duplicated with the below error handling.


I think it's as simple as put the error label after the return.


>
>>> +     }
>>> +     list_add(&dev->list, &vduse_devs);
>>> +     __module_get(THIS_MODULE);
>>> +
>>> +     return 0;
>>> +err_ida:
>>> +     kfree(dev->vqs);
>>> +err_vqs:
>>> +     vduse_domain_destroy(dev->domain);
>>> +err_domain:
>>> +     vduse_dev_destroy(dev);
>>> +     return ret;
>>> +}
>>> +
>>> +static long vduse_ioctl(struct file *file, unsigned int cmd,
>>> +                     unsigned long arg)
>>> +{
>>> +     int ret;
>>> +     void __user *argp = (void __user *)arg;
>>> +     struct vduse_control *control = file->private_data;
>>> +
>>> +     mutex_lock(&vduse_lock);
>>> +     switch (cmd) {
>>> +     case VDUSE_GET_API_VERSION:
>>> +             ret = control->api_version;
>>> +             break;
>>> +     case VDUSE_SET_API_VERSION:
>>> +             ret = -EINVAL;
>>> +             if (arg > VDUSE_API_VERSION)
>>> +                     break;
>>> +
>>> +             ret = 0;
>>> +             control->api_version = arg;
>>> +             break;
>>> +     case VDUSE_CREATE_DEV: {
>>> +             struct vduse_dev_config config;
>>> +
>>> +             ret = -EFAULT;
>>> +             if (copy_from_user(&config, argp, sizeof(config)))
>>> +                     break;
>>> +
>>> +             ret = vduse_create_dev(&config, control->api_version);
>>> +             break;
>>> +     }
>>> +     case VDUSE_DESTROY_DEV: {
>>> +             char name[VDUSE_NAME_MAX];
>>> +
>>> +             ret = -EFAULT;
>>> +             if (copy_from_user(name, argp, VDUSE_NAME_MAX))
>>> +                     break;
>>> +
>>> +             ret = vduse_destroy_dev(name);
>>> +             break;
>>> +     }
>>> +     default:
>>> +             ret = -EINVAL;
>>> +             break;
>>> +     }
>>> +     mutex_unlock(&vduse_lock);
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static int vduse_release(struct inode *inode, struct file *file)
>>> +{
>>> +     struct vduse_control *control = file->private_data;
>>> +
>>> +     kfree(control);
>>> +     return 0;
>>> +}
>>> +
>>> +static int vduse_open(struct inode *inode, struct file *file)
>>> +{
>>> +     struct vduse_control *control;
>>> +
>>> +     control = kmalloc(sizeof(struct vduse_control), GFP_KERNEL);
>>> +     if (!control)
>>> +             return -ENOMEM;
>>> +
>>> +     control->api_version = VDUSE_API_VERSION;
>>> +     file->private_data = control;
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static const struct file_operations vduse_fops = {
>>> +     .owner          = THIS_MODULE,
>>> +     .open           = vduse_open,
>>> +     .release        = vduse_release,
>>> +     .unlocked_ioctl = vduse_ioctl,
>>> +     .compat_ioctl   = compat_ptr_ioctl,
>>> +     .llseek         = noop_llseek,
>>> +};
>>> +
>>> +static char *vduse_devnode(struct device *dev, umode_t *mode)
>>> +{
>>> +     return kasprintf(GFP_KERNEL, "vduse/%s", dev_name(dev));
>>> +}
>>> +
>>> +static struct miscdevice vduse_misc = {
>>> +     .fops = &vduse_fops,
>>> +     .minor = MISC_DYNAMIC_MINOR,
>>> +     .name = "vduse",
>>> +     .nodename = "vduse/control",
>>> +};
>>> +
>>> +static void vduse_mgmtdev_release(struct device *dev)
>>> +{
>>> +}
>>> +
>>> +static struct device vduse_mgmtdev = {
>>> +     .init_name = "vduse",
>>> +     .release = vduse_mgmtdev_release,
>>> +};
>>> +
>>> +static struct vdpa_mgmt_dev mgmt_dev;
>>> +
>>> +static int vduse_dev_init_vdpa(struct vduse_dev *dev, const char *name)
>>> +{
>>> +     struct vduse_vdpa *vdev;
>>> +     int ret;
>>> +
>>> +     if (dev->vdev)
>>> +             return -EEXIST;
>>> +
>>> +     vdev = vdpa_alloc_device(struct vduse_vdpa, vdpa, &dev->dev,
>>> +                              &vduse_vdpa_config_ops, name, true);
>>> +     if (!vdev)
>>> +             return -ENOMEM;
>>> +
>>> +     dev->vdev = vdev;
>>> +     vdev->dev = dev;
>>> +     vdev->vdpa.dev.dma_mask = &vdev->vdpa.dev.coherent_dma_mask;
>>> +     ret = dma_set_mask_and_coherent(&vdev->vdpa.dev, DMA_BIT_MASK(64));
>>> +     if (ret) {
>>> +             put_device(&vdev->vdpa.dev);
>>> +             return ret;
>>> +     }
>>> +     set_dma_ops(&vdev->vdpa.dev, &vduse_dev_dma_ops);
>>> +     vdev->vdpa.dma_dev = &vdev->vdpa.dev;
>>> +     vdev->vdpa.mdev = &mgmt_dev;
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static int vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name)
>>> +{
>>> +     struct vduse_dev *dev;
>>> +     int ret = -EINVAL;
>>> +
>>> +     mutex_lock(&vduse_lock);
>>> +     dev = vduse_find_dev(name);
>>> +     if (!dev) {
>>> +             mutex_unlock(&vduse_lock);
>>> +             return -EINVAL;
>>> +     }
>>> +     ret = vduse_dev_init_vdpa(dev, name);
>>> +     mutex_unlock(&vduse_lock);
>>> +     if (ret)
>>> +             return ret;
>>> +
>>> +     ret = _vdpa_register_device(&dev->vdev->vdpa, dev->vq_num);
>>> +     if (ret) {
>>> +             put_device(&dev->vdev->vdpa.dev);
>>> +             return ret;
>>> +     }
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static void vdpa_dev_del(struct vdpa_mgmt_dev *mdev, struct vdpa_device *dev)
>>> +{
>>> +     _vdpa_unregister_device(dev);
>>> +}
>>> +
>>> +static const struct vdpa_mgmtdev_ops vdpa_dev_mgmtdev_ops = {
>>> +     .dev_add = vdpa_dev_add,
>>> +     .dev_del = vdpa_dev_del,
>>> +};
>>> +
>>> +static struct virtio_device_id id_table[] = {
>>> +     { VIRTIO_DEV_ANY_ID, VIRTIO_DEV_ANY_ID },
>>> +     { 0 },
>>> +};
>>> +
>>> +static struct vdpa_mgmt_dev mgmt_dev = {
>>> +     .device = &vduse_mgmtdev,
>>> +     .id_table = id_table,
>>> +     .ops = &vdpa_dev_mgmtdev_ops,
>>> +};
>>> +
>>> +static int vduse_mgmtdev_init(void)
>>> +{
>>> +     int ret;
>>> +
>>> +     ret = device_register(&vduse_mgmtdev);
>>> +     if (ret)
>>> +             return ret;
>>> +
>>> +     ret = vdpa_mgmtdev_register(&mgmt_dev);
>>> +     if (ret)
>>> +             goto err;
>>> +
>>> +     return 0;
>>> +err:
>>> +     device_unregister(&vduse_mgmtdev);
>>> +     return ret;
>>> +}
>>> +
>>> +static void vduse_mgmtdev_exit(void)
>>> +{
>>> +     vdpa_mgmtdev_unregister(&mgmt_dev);
>>> +     device_unregister(&vduse_mgmtdev);
>>> +}
>>> +
>>> +static int vduse_init(void)
>>> +{
>>> +     int ret;
>>> +
>>> +     if (max_bounce_size >= max_iova_size)
>>> +             return -EINVAL;
>>> +
>>> +     ret = misc_register(&vduse_misc);
>>> +     if (ret)
>>> +             return ret;
>>> +
>>> +     vduse_class = class_create(THIS_MODULE, "vduse");
>>> +     if (IS_ERR(vduse_class)) {
>>> +             ret = PTR_ERR(vduse_class);
>>> +             goto err_class;
>>> +     }
>>> +     vduse_class->devnode = vduse_devnode;
>>> +
>>> +     ret = alloc_chrdev_region(&vduse_major, 0, VDUSE_DEV_MAX, "vduse");
>>> +     if (ret)
>>> +             goto err_chardev;
>>> +
>>> +     vduse_irq_wq = alloc_workqueue("vduse-irq",
>>> +                             WQ_HIGHPRI | WQ_SYSFS | WQ_UNBOUND, 0);
>>> +     if (!vduse_irq_wq)
>>> +             goto err_wq;
>>> +
>>> +     ret = vduse_domain_init();
>>> +     if (ret)
>>> +             goto err_domain;
>>> +
>>> +     ret = vduse_mgmtdev_init();
>>> +     if (ret)
>>> +             goto err_mgmtdev;
>>> +
>>> +     return 0;
>>> +err_mgmtdev:
>>> +     vduse_domain_exit();
>>> +err_domain:
>>> +     destroy_workqueue(vduse_irq_wq);
>>> +err_wq:
>>> +     unregister_chrdev_region(vduse_major, VDUSE_DEV_MAX);
>>> +err_chardev:
>>> +     class_destroy(vduse_class);
>>> +err_class:
>>> +     misc_deregister(&vduse_misc);
>>> +     return ret;
>>> +}
>>> +module_init(vduse_init);
>>> +
>>> +static void vduse_exit(void)
>>> +{
>>> +     misc_deregister(&vduse_misc);
>>> +     class_destroy(vduse_class);
>>> +     unregister_chrdev_region(vduse_major, VDUSE_DEV_MAX);
>>> +     destroy_workqueue(vduse_irq_wq);
>>> +     vduse_domain_exit();
>>> +     vduse_mgmtdev_exit();
>>> +}
>>> +module_exit(vduse_exit);
>>> +
>>> +MODULE_VERSION(DRV_VERSION);
>>> +MODULE_LICENSE(DRV_LICENSE);
>>> +MODULE_AUTHOR(DRV_AUTHOR);
>>> +MODULE_DESCRIPTION(DRV_DESC);
>>> diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
>>> new file mode 100644
>>> index 000000000000..66a6e5212226
>>> --- /dev/null
>>> +++ b/include/uapi/linux/vduse.h
>>> @@ -0,0 +1,175 @@
>>> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
>>> +#ifndef _UAPI_VDUSE_H_
>>> +#define _UAPI_VDUSE_H_
>>> +
>>> +#include <linux/types.h>
>>> +
>>> +#define VDUSE_API_VERSION    0
>>> +
>>> +#define VDUSE_CONFIG_DATA_LEN        256
>>> +#define VDUSE_NAME_MAX       256
>>> +
>>> +/* the control messages definition for read/write */
>>> +
>>> +enum vduse_req_type {
>>> +     /* Set the vring address of virtqueue. */
>>> +     VDUSE_SET_VQ_NUM,
>>> +     /* Set the vring address of virtqueue. */
>>> +     VDUSE_SET_VQ_ADDR,
>>> +     /* Set ready status of virtqueue */
>>> +     VDUSE_SET_VQ_READY,
>>> +     /* Get ready status of virtqueue */
>>> +     VDUSE_GET_VQ_READY,
>>> +     /* Set the state for virtqueue */
>>> +     VDUSE_SET_VQ_STATE,
>>> +     /* Get the state for virtqueue */
>>> +     VDUSE_GET_VQ_STATE,
>>> +     /* Set virtio features supported by the driver */
>>> +     VDUSE_SET_FEATURES,
>>> +     /* Get virtio features supported by the device */
>>> +     VDUSE_GET_FEATURES,
>>> +     /* Set the device status */
>>> +     VDUSE_SET_STATUS,
>>> +     /* Get the device status */
>>> +     VDUSE_GET_STATUS,
>>> +     /* Write to device specific configuration space */
>>> +     VDUSE_SET_CONFIG,
>>> +     /* Read from device specific configuration space */
>>> +     VDUSE_GET_CONFIG,
>>> +     /* Notify userspace to update the memory mapping in device IOTLB */
>>> +     VDUSE_UPDATE_IOTLB,
>>> +};
>>> +
>>> +struct vduse_vq_num {
>>> +     __u32 index; /* virtqueue index */
>>
>> I think it's better to have a consistent style of the doc/comment. If
>> yes, let's move those comment above the field.
>>
> Fine.
>
>>> +     __u32 num; /* the size of virtqueue */
>>> +};
>>> +
>>> +struct vduse_vq_addr {
>>> +     __u32 index; /* virtqueue index */
>>> +     __u64 desc_addr; /* address of desc area */
>>> +     __u64 driver_addr; /* address of driver area */
>>> +     __u64 device_addr; /* address of device area */
>>> +};
>>> +
>>> +struct vduse_vq_ready {
>>> +     __u32 index; /* virtqueue index */
>>> +     __u8 ready; /* ready status of virtqueue */
>>> +};
>>> +
>>> +struct vduse_vq_state {
>>> +     __u32 index; /* virtqueue index */
>>> +     __u16 avail_idx; /* virtqueue state (last_avail_idx) */
>>
>> Let's use __u64 here to be consistent with get_vq_state().


__u32 actually:

struct vhost_vring_state {
         unsigned int index;
         unsigned int num;
};


>>   The idea is
>> to support packed virtqueue.
>>
> OK. But looks like sizeof(struct vdpa_vq_state) is still equal to 2.


It's a bug that needs to be fixed.


> Do you mean we will extend it in the future?


u32 should be fine, and it's not for future. For packed virtqueue we 
need something like:

le32 {
     last_avai_idx : 15;
     last_avail_wrap_counter : 1;
     used_idx : 15;
     used_wrap_counter : 1;
}

As has been done in 
https://lore.kernel.org/lkml/20190717105255.63488-3-jasowang@redhat.com/T/


>
>>> +};
>>> +
>>> +struct vduse_dev_config_data {
>>> +     __u32 offset; /* offset from the beginning of config space */
>>> +     __u32 len; /* the length to read/write */
>>> +     __u8 data[VDUSE_CONFIG_DATA_LEN]; /* data buffer used to read/write */
>>
>> Note that since VDUSE_CONFIG_DATA_LEN is part of uAPI it means we can
>> not change it in the future.
>>
>> So this might suffcient for future features or all type of virtio devices.
>>
> Do you mean 256 is no enough here?


Yes.


>
>>> +};
>>> +
>>> +struct vduse_iova_range {
>>> +     __u64 start; /* start of the IOVA range */
>>> +     __u64 last; /* end of the IOVA range */
>>> +};
>>> +
>>> +struct vduse_features {
>>> +     __u64 features; /* virtio features */
>>> +};
>>> +
>>> +struct vduse_status {
>>> +     __u8 status; /* device status */
>>> +};
>>> +
>>> +struct vduse_dev_request {
>>> +     __u32 type; /* request type */
>>> +     __u32 request_id; /* request id */
>>> +     __u32 reserved[2]; /* for future use */
>>> +     union {
>>> +             struct vduse_vq_num vq_num; /* virtqueue num */
>>> +             struct vduse_vq_addr vq_addr; /* virtqueue address */
>>> +             struct vduse_vq_ready vq_ready; /* virtqueue ready status */
>>> +             struct vduse_vq_state vq_state; /* virtqueue state */
>>> +             struct vduse_dev_config_data config; /* virtio device config space */
>>> +             struct vduse_iova_range iova; /* iova range for updating */
>>> +             struct vduse_features f; /* virtio features */
>>> +             struct vduse_status s; /* device status */
>>> +             __u32 padding[16]; /* padding */
>>> +     };
>>> +};
>>> +
>>> +struct vduse_dev_response {
>>> +     __u32 request_id; /* corresponding request id */
>>> +#define VDUSE_REQUEST_OK     0x00
>>> +#define VDUSE_REQUEST_FAILED 0x01
>>> +     __u32 result; /* the result of request */
>>> +     __u32 reserved[2]; /* for future use */
>>> +     union {
>>> +             struct vduse_vq_ready vq_ready; /* virtqueue ready status */
>>> +             struct vduse_vq_state vq_state; /* virtqueue state */
>>> +             struct vduse_dev_config_data config; /* virtio device config space */
>>> +             struct vduse_features f; /* virtio features */
>>> +             struct vduse_status s; /* device status */
>>> +             __u32 padding[16]; /* padding */
>>
>> So it looks to me this padding doesn't work since vduse_dev_config_data
>> is larger than it.
>>
> Oh, my bad. Will fix it.
>
>>> +     };
>>> +};
>>> +
>>> +/* ioctls */
>>> +
>>> +struct vduse_dev_config {
>>> +     char name[VDUSE_NAME_MAX]; /* vduse device name */
>>> +     __u32 vendor_id; /* virtio vendor id */
>>> +     __u32 device_id; /* virtio device id */
>>> +     __u64 bounce_size; /* bounce buffer size for iommu */
>>> +     __u16 vq_num; /* the number of virtqueues */
>>> +     __u16 vq_size_max; /* the max size of virtqueue */
>>> +     __u32 vq_align; /* the allocation alignment of virtqueue's metadata */
>>> +     __u32 reserved[8]; /* for future use */
>>
>> Is there a hole before reserved?
>>
> But I don't find the hole in below layout:
>
> | 256 | 4 | 4 | 8 | 2 | 2 | 4 | 32 |
>

Looks correct, better to check with pahole to double confirm.


>>> +};
>>> +
>>> +struct vduse_iotlb_entry {
>>> +     __u64 offset; /* the mmap offset on fd */
>>> +     __u64 start; /* start of the IOVA range */
>>> +     __u64 last; /* last of the IOVA range */
>>> +#define VDUSE_ACCESS_RO 0x1
>>> +#define VDUSE_ACCESS_WO 0x2
>>> +#define VDUSE_ACCESS_RW 0x3
>>> +     __u8 perm; /* access permission of this range */
>>> +};
>>> +
>>> +struct vduse_vq_eventfd {
>>> +     __u32 index; /* virtqueue index */
>>> +#define VDUSE_EVENTFD_DEASSIGN -1
>>> +     int fd; /* eventfd, -1 means de-assigning the eventfd */
>>> +};
>>> +
>>> +#define VDUSE_BASE   0x81
>>> +
>>> +/* Get the version of VDUSE API. This is used for future extension */
>>> +#define VDUSE_GET_API_VERSION        _IO(VDUSE_BASE, 0x00)
>>> +
>>> +/* Set the version of VDUSE API. */
>>> +#define VDUSE_SET_API_VERSION        _IO(VDUSE_BASE, 0x01)
>>> +
>>> +/* Create a vduse device which is represented by a char device (/dev/vduse/<name>) */
>>> +#define VDUSE_CREATE_DEV     _IOW(VDUSE_BASE, 0x02, struct vduse_dev_config)
>>> +
>>> +/* Destroy a vduse device. Make sure there are no references to the char device */
>>> +#define VDUSE_DESTROY_DEV    _IOW(VDUSE_BASE, 0x03, char[VDUSE_NAME_MAX])
>>> +
>>> +/*
>>> + * Get a file descriptor for the first overlapped iova region,
>>> + * -EINVAL means the iova region doesn't exist.
>>> + */
>>> +#define VDUSE_IOTLB_GET_FD   _IOWR(VDUSE_BASE, 0x04, struct vduse_iotlb_entry)
>>> +
>>> +/* Setup an eventfd to receive kick for virtqueue */
>>> +#define VDUSE_VQ_SETUP_KICKFD        _IOW(VDUSE_BASE, 0x05, struct vduse_vq_eventfd)
>>> +
>>> +/* Inject an interrupt for specific virtqueue */
>>> +#define VDUSE_INJECT_VQ_IRQ  _IO(VDUSE_BASE, 0x06)
>>
>> Missing parameter?
>>
> We use the argp to store the virtqueue index here. Is it OK?


So I meant it should be something like:

#define VDUSE_INJECT_VQ_IRQ  _IOW(VDUSE_BASE, 0x06, unsigned int)

?

Thanks


>
> Thanks,
> Yongji
>


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Re: [PATCH v6 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
  2021-04-09  5:36       ` Jason Wang
@ 2021-04-09  8:02         ` Yongji Xie
  2021-04-12  7:16           ` Jason Wang
  0 siblings, 1 reply; 62+ messages in thread
From: Yongji Xie @ 2021-04-09  8:02 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Christoph Hellwig, Christian Brauner, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	Mika Penttilä,
	Dan Carpenter, virtualization, netdev, kvm, linux-fsdevel

On Fri, Apr 9, 2021 at 1:36 PM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/4/8 下午5:36, Yongji Xie 写道:
> > On Thu, Apr 8, 2021 at 2:57 PM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> 在 2021/3/31 下午4:05, Xie Yongji 写道:
> >>> This VDUSE driver enables implementing vDPA devices in userspace.
> >>> Both control path and data path of vDPA devices will be able to
> >>> be handled in userspace.
> >>>
> >>> In the control path, the VDUSE driver will make use of message
> >>> mechnism to forward the config operation from vdpa bus driver
> >>> to userspace. Userspace can use read()/write() to receive/reply
> >>> those control messages.
> >>>
> >>> In the data path, VDUSE_IOTLB_GET_FD ioctl will be used to get
> >>> the file descriptors referring to vDPA device's iova regions. Then
> >>> userspace can use mmap() to access those iova regions. Besides,
> >>> userspace can use ioctl() to inject interrupt and use the eventfd
> >>> mechanism to receive virtqueue kicks.
> >>>
> >>> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> >>> ---
> >>>    Documentation/userspace-api/ioctl/ioctl-number.rst |    1 +
> >>>    drivers/vdpa/Kconfig                               |   10 +
> >>>    drivers/vdpa/Makefile                              |    1 +
> >>>    drivers/vdpa/vdpa_user/Makefile                    |    5 +
> >>>    drivers/vdpa/vdpa_user/vduse_dev.c                 | 1362 ++++++++++++++++++++
> >>>    include/uapi/linux/vduse.h                         |  175 +++
> >>>    6 files changed, 1554 insertions(+)
> >>>    create mode 100644 drivers/vdpa/vdpa_user/Makefile
> >>>    create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
> >>>    create mode 100644 include/uapi/linux/vduse.h
> >>>
> >>> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
> >>> index a4c75a28c839..71722e6f8f23 100644
> >>> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> >>> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> >>> @@ -300,6 +300,7 @@ Code  Seq#    Include File                                           Comments
> >>>    'z'   10-4F  drivers/s390/crypto/zcrypt_api.h                        conflict!
> >>>    '|'   00-7F  linux/media.h
> >>>    0x80  00-1F  linux/fb.h
> >>> +0x81  00-1F  linux/vduse.h
> >>>    0x89  00-06  arch/x86/include/asm/sockios.h
> >>>    0x89  0B-DF  linux/sockios.h
> >>>    0x89  E0-EF  linux/sockios.h                                         SIOCPROTOPRIVATE range
> >>> diff --git a/drivers/vdpa/Kconfig b/drivers/vdpa/Kconfig
> >>> index a245809c99d0..77a1da522c21 100644
> >>> --- a/drivers/vdpa/Kconfig
> >>> +++ b/drivers/vdpa/Kconfig
> >>> @@ -25,6 +25,16 @@ config VDPA_SIM_NET
> >>>        help
> >>>          vDPA networking device simulator which loops TX traffic back to RX.
> >>>
> >>> +config VDPA_USER
> >>> +     tristate "VDUSE (vDPA Device in Userspace) support"
> >>> +     depends on EVENTFD && MMU && HAS_DMA
> >>> +     select DMA_OPS
> >>> +     select VHOST_IOTLB
> >>> +     select IOMMU_IOVA
> >>> +     help
> >>> +       With VDUSE it is possible to emulate a vDPA Device
> >>> +       in a userspace program.
> >>> +
> >>>    config IFCVF
> >>>        tristate "Intel IFC VF vDPA driver"
> >>>        depends on PCI_MSI
> >>> diff --git a/drivers/vdpa/Makefile b/drivers/vdpa/Makefile
> >>> index 67fe7f3d6943..f02ebed33f19 100644
> >>> --- a/drivers/vdpa/Makefile
> >>> +++ b/drivers/vdpa/Makefile
> >>> @@ -1,6 +1,7 @@
> >>>    # SPDX-License-Identifier: GPL-2.0
> >>>    obj-$(CONFIG_VDPA) += vdpa.o
> >>>    obj-$(CONFIG_VDPA_SIM) += vdpa_sim/
> >>> +obj-$(CONFIG_VDPA_USER) += vdpa_user/
> >>>    obj-$(CONFIG_IFCVF)    += ifcvf/
> >>>    obj-$(CONFIG_MLX5_VDPA) += mlx5/
> >>>    obj-$(CONFIG_VP_VDPA)    += virtio_pci/
> >>> diff --git a/drivers/vdpa/vdpa_user/Makefile b/drivers/vdpa/vdpa_user/Makefile
> >>> new file mode 100644
> >>> index 000000000000..260e0b26af99
> >>> --- /dev/null
> >>> +++ b/drivers/vdpa/vdpa_user/Makefile
> >>> @@ -0,0 +1,5 @@
> >>> +# SPDX-License-Identifier: GPL-2.0
> >>> +
> >>> +vduse-y := vduse_dev.o iova_domain.o
> >>> +
> >>> +obj-$(CONFIG_VDPA_USER) += vduse.o
> >>> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> >>> new file mode 100644
> >>> index 000000000000..51ca73464d0d
> >>> --- /dev/null
> >>> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> >>> @@ -0,0 +1,1362 @@
> >>> +// SPDX-License-Identifier: GPL-2.0-only
> >>> +/*
> >>> + * VDUSE: vDPA Device in Userspace
> >>> + *
> >>> + * Copyright (C) 2020-2021 Bytedance Inc. and/or its affiliates. All rights reserved.
> >>> + *
> >>> + * Author: Xie Yongji <xieyongji@bytedance.com>
> >>> + *
> >>> + */
> >>> +
> >>> +#include <linux/init.h>
> >>> +#include <linux/module.h>
> >>> +#include <linux/miscdevice.h>
> >>> +#include <linux/cdev.h>
> >>> +#include <linux/device.h>
> >>> +#include <linux/eventfd.h>
> >>> +#include <linux/slab.h>
> >>> +#include <linux/wait.h>
> >>> +#include <linux/dma-map-ops.h>
> >>> +#include <linux/poll.h>
> >>> +#include <linux/file.h>
> >>> +#include <linux/uio.h>
> >>> +#include <linux/vdpa.h>
> >>> +#include <uapi/linux/vduse.h>
> >>> +#include <uapi/linux/vdpa.h>
> >>> +#include <uapi/linux/virtio_config.h>
> >>> +#include <linux/mod_devicetable.h>
> >>> +
> >>> +#include "iova_domain.h"
> >>> +
> >>> +#define DRV_VERSION  "1.0"
> >>> +#define DRV_AUTHOR   "Yongji Xie <xieyongji@bytedance.com>"
> >>> +#define DRV_DESC     "vDPA Device in Userspace"
> >>> +#define DRV_LICENSE  "GPL v2"
> >>> +
> >>> +#define VDUSE_DEV_MAX (1U << MINORBITS)
> >>> +
> >>> +struct vduse_virtqueue {
> >>> +     u16 index;
> >>> +     bool ready;
> >>> +     spinlock_t kick_lock;
> >>> +     spinlock_t irq_lock;
> >>> +     struct eventfd_ctx *kickfd;
> >>> +     struct vdpa_callback cb;
> >>> +     struct work_struct inject;
> >>> +};
> >>> +
> >>> +struct vduse_dev;
> >>> +
> >>> +struct vduse_vdpa {
> >>> +     struct vdpa_device vdpa;
> >>> +     struct vduse_dev *dev;
> >>> +};
> >>> +
> >>> +struct vduse_dev {
> >>> +     struct vduse_vdpa *vdev;
> >>> +     struct device dev;
> >>> +     struct cdev cdev;
> >>> +     struct vduse_virtqueue *vqs;
> >>> +     struct vduse_iova_domain *domain;
> >>> +     struct mutex lock;
> >>> +     spinlock_t msg_lock;
> >>> +     atomic64_t msg_unique;
> >>> +     wait_queue_head_t waitq;
> >>> +     struct list_head send_list;
> >>> +     struct list_head recv_list;
> >>> +     struct list_head list;
> >>> +     struct vdpa_callback config_cb;
> >>> +     spinlock_t irq_lock;
> >>> +     unsigned long api_version;
> >>> +     bool connected;
> >>> +     int minor;
> >>> +     u16 vq_size_max;
> >>> +     u16 vq_num;
> >>> +     u32 vq_align;
> >>> +     u32 device_id;
> >>> +     u32 vendor_id;
> >>> +};
> >>> +
> >>> +struct vduse_dev_msg {
> >>> +     struct vduse_dev_request req;
> >>> +     struct vduse_dev_response resp;
> >>> +     struct list_head list;
> >>> +     wait_queue_head_t waitq;
> >>> +     bool completed;
> >>> +};
> >>> +
> >>> +struct vduse_control {
> >>> +     unsigned long api_version;
> >>> +};
> >>> +
> >>> +static unsigned long max_bounce_size = (64 * 1024 * 1024);
> >>> +module_param(max_bounce_size, ulong, 0444);
> >>> +MODULE_PARM_DESC(max_bounce_size, "Maximum bounce buffer size. (default: 64M)");
> >>> +
> >>> +static unsigned long max_iova_size = (128 * 1024 * 1024);
> >>> +module_param(max_iova_size, ulong, 0444);
> >>> +MODULE_PARM_DESC(max_iova_size, "Maximum iova space size (default: 128M)");
> >>> +
> >>> +static DEFINE_MUTEX(vduse_lock);
> >>> +static LIST_HEAD(vduse_devs);
> >>> +static DEFINE_IDA(vduse_ida);
> >>> +
> >>> +static dev_t vduse_major;
> >>> +static struct class *vduse_class;
> >>> +static struct workqueue_struct *vduse_irq_wq;
> >>> +
> >>> +static inline struct vduse_dev *vdpa_to_vduse(struct vdpa_device *vdpa)
> >>> +{
> >>> +     struct vduse_vdpa *vdev = container_of(vdpa, struct vduse_vdpa, vdpa);
> >>> +
> >>> +     return vdev->dev;
> >>> +}
> >>> +
> >>> +static inline struct vduse_dev *dev_to_vduse(struct device *dev)
> >>> +{
> >>> +     struct vdpa_device *vdpa = dev_to_vdpa(dev);
> >>> +
> >>> +     return vdpa_to_vduse(vdpa);
> >>> +}
> >>> +
> >>> +static struct vduse_dev_msg *vduse_find_msg(struct list_head *head,
> >>> +                                         uint32_t request_id)
> >>> +{
> >>> +     struct vduse_dev_msg *tmp, *msg = NULL;
> >>> +
> >>> +     list_for_each_entry(tmp, head, list) {
> >>> +             if (tmp->req.request_id == request_id) {
> >>> +                     msg = tmp;
> >>> +                     list_del(&tmp->list);
> >>> +                     break;
> >>> +             }
> >>> +     }
> >>> +
> >>> +     return msg;
> >>> +}
> >>> +
> >>> +static struct vduse_dev_msg *vduse_dequeue_msg(struct list_head *head)
> >>> +{
> >>> +     struct vduse_dev_msg *msg = NULL;
> >>> +
> >>> +     if (!list_empty(head)) {
> >>> +             msg = list_first_entry(head, struct vduse_dev_msg, list);
> >>> +             list_del(&msg->list);
> >>> +     }
> >>> +
> >>> +     return msg;
> >>> +}
> >>> +
> >>> +static void vduse_enqueue_msg(struct list_head *head,
> >>> +                           struct vduse_dev_msg *msg)
> >>> +{
> >>> +     list_add_tail(&msg->list, head);
> >>> +}
> >>> +
> >>> +static int vduse_dev_msg_sync(struct vduse_dev *dev,
> >>> +                           struct vduse_dev_msg *msg)
> >>> +{
> >>> +     init_waitqueue_head(&msg->waitq);
> >>> +     spin_lock(&dev->msg_lock);
> >>> +     vduse_enqueue_msg(&dev->send_list, msg);
> >>> +     wake_up(&dev->waitq);
> >>> +     spin_unlock(&dev->msg_lock);
> >>> +     wait_event_interruptible(msg->waitq, msg->completed);
> >>> +     spin_lock(&dev->msg_lock);
> >>> +     if (!msg->completed)
> >>> +             list_del(&msg->list);
> >>> +     spin_unlock(&dev->msg_lock);
> >>> +
> >>> +     return (msg->resp.result == VDUSE_REQUEST_OK) ? 0 : -1;
> >>> +}
> >>> +
> >>> +static u32 vduse_dev_get_request_id(struct vduse_dev *dev)
> >>> +{
> >>> +     return atomic64_fetch_inc(&dev->msg_unique);
> >>> +}
> >>> +
> >>> +static u64 vduse_dev_get_features(struct vduse_dev *dev)
> >>> +{
> >>> +     struct vduse_dev_msg msg = { 0 };
> >>> +
> >>> +     msg.req.type = VDUSE_GET_FEATURES;
> >>> +     msg.req.request_id = vduse_dev_get_request_id(dev);
> >>> +
> >>> +     return vduse_dev_msg_sync(dev, &msg) ? 0 : msg.resp.f.features;
> >>> +}
> >>> +
> >>> +static int vduse_dev_set_features(struct vduse_dev *dev, u64 features)
> >>> +{
> >>> +     struct vduse_dev_msg msg = { 0 };
> >>> +
> >>> +     msg.req.type = VDUSE_SET_FEATURES;
> >>> +     msg.req.request_id = vduse_dev_get_request_id(dev);
> >>> +     msg.req.f.features = features;
> >>> +
> >>> +     return vduse_dev_msg_sync(dev, &msg);
> >>> +}
> >>> +
> >>> +static u8 vduse_dev_get_status(struct vduse_dev *dev)
> >>> +{
> >>> +     struct vduse_dev_msg msg = { 0 };
> >>> +
> >>> +     msg.req.type = VDUSE_GET_STATUS;
> >>> +     msg.req.request_id = vduse_dev_get_request_id(dev);
> >>> +
> >>> +     return vduse_dev_msg_sync(dev, &msg) ? 0 : msg.resp.s.status;
> >>> +}
> >>> +
> >>> +static void vduse_dev_set_status(struct vduse_dev *dev, u8 status)
> >>> +{
> >>> +     struct vduse_dev_msg msg = { 0 };
> >>> +
> >>> +     msg.req.type = VDUSE_SET_STATUS;
> >>> +     msg.req.request_id = vduse_dev_get_request_id(dev);
> >>> +     msg.req.s.status = status;
> >>> +
> >>> +     vduse_dev_msg_sync(dev, &msg);
> >>> +}
> >>> +
> >>> +static void vduse_dev_get_config(struct vduse_dev *dev, unsigned int offset,
> >>> +                              void *buf, unsigned int len)
> >>> +{
> >>> +     struct vduse_dev_msg msg = { 0 };
> >>> +     unsigned int sz;
> >>> +
> >>> +     while (len) {
> >>> +             sz = min_t(unsigned int, len, sizeof(msg.req.config.data));
> >>> +             msg.req.type = VDUSE_GET_CONFIG;
> >>> +             msg.req.request_id = vduse_dev_get_request_id(dev);
> >>> +             msg.req.config.offset = offset;
> >>> +             msg.req.config.len = sz;
> >>> +             vduse_dev_msg_sync(dev, &msg);
> >>> +             memcpy(buf, msg.resp.config.data, sz);
> >>> +             buf += sz;
> >>> +             offset += sz;
> >>> +             len -= sz;
> >>> +     }
> >>> +}
> >>> +
> >>> +static void vduse_dev_set_config(struct vduse_dev *dev, unsigned int offset,
> >>> +                              const void *buf, unsigned int len)
> >>> +{
> >>> +     struct vduse_dev_msg msg = { 0 };
> >>> +     unsigned int sz;
> >>> +
> >>> +     while (len) {
> >>> +             sz = min_t(unsigned int, len, sizeof(msg.req.config.data));
> >>> +             msg.req.type = VDUSE_SET_CONFIG;
> >>> +             msg.req.request_id = vduse_dev_get_request_id(dev);
> >>> +             msg.req.config.offset = offset;
> >>> +             msg.req.config.len = sz;
> >>> +             memcpy(msg.req.config.data, buf, sz);
> >>> +             vduse_dev_msg_sync(dev, &msg);
> >>> +             buf += sz;
> >>> +             offset += sz;
> >>> +             len -= sz;
> >>> +     }
> >>> +}
> >>> +
> >>> +static void vduse_dev_set_vq_num(struct vduse_dev *dev,
> >>> +                              struct vduse_virtqueue *vq, u32 num)
> >>> +{
> >>> +     struct vduse_dev_msg msg = { 0 };
> >>> +
> >>> +     msg.req.type = VDUSE_SET_VQ_NUM;
> >>> +     msg.req.request_id = vduse_dev_get_request_id(dev);
> >>> +     msg.req.vq_num.index = vq->index;
> >>> +     msg.req.vq_num.num = num;
> >>> +
> >>> +     vduse_dev_msg_sync(dev, &msg);
> >>> +}
> >>> +
> >>> +static int vduse_dev_set_vq_addr(struct vduse_dev *dev,
> >>> +                              struct vduse_virtqueue *vq, u64 desc_addr,
> >>> +                              u64 driver_addr, u64 device_addr)
> >>> +{
> >>> +     struct vduse_dev_msg msg = { 0 };
> >>> +
> >>> +     msg.req.type = VDUSE_SET_VQ_ADDR;
> >>> +     msg.req.request_id = vduse_dev_get_request_id(dev);
> >>> +     msg.req.vq_addr.index = vq->index;
> >>> +     msg.req.vq_addr.desc_addr = desc_addr;
> >>> +     msg.req.vq_addr.driver_addr = driver_addr;
> >>> +     msg.req.vq_addr.device_addr = device_addr;
> >>> +
> >>> +     return vduse_dev_msg_sync(dev, &msg);
> >>> +}
> >>> +
> >>> +static void vduse_dev_set_vq_ready(struct vduse_dev *dev,
> >>> +                             struct vduse_virtqueue *vq, bool ready)
> >>> +{
> >>> +     struct vduse_dev_msg msg = { 0 };
> >>> +
> >>> +     msg.req.type = VDUSE_SET_VQ_READY;
> >>> +     msg.req.request_id = vduse_dev_get_request_id(dev);
> >>> +     msg.req.vq_ready.index = vq->index;
> >>> +     msg.req.vq_ready.ready = ready;
> >>> +
> >>> +     vduse_dev_msg_sync(dev, &msg);
> >>> +}
> >>> +
> >>> +static bool vduse_dev_get_vq_ready(struct vduse_dev *dev,
> >>> +                                struct vduse_virtqueue *vq)
> >>> +{
> >>> +     struct vduse_dev_msg msg = { 0 };
> >>> +
> >>> +     msg.req.type = VDUSE_GET_VQ_READY;
> >>> +     msg.req.request_id = vduse_dev_get_request_id(dev);
> >>> +     msg.req.vq_ready.index = vq->index;
> >>> +
> >>> +     return vduse_dev_msg_sync(dev, &msg) ? false : msg.resp.vq_ready.ready;
> >>> +}
> >>> +
> >>> +static int vduse_dev_get_vq_state(struct vduse_dev *dev,
> >>> +                             struct vduse_virtqueue *vq,
> >>> +                             struct vdpa_vq_state *state)
> >>> +{
> >>> +     struct vduse_dev_msg msg = { 0 };
> >>> +     int ret;
> >>> +
> >>> +     msg.req.type = VDUSE_GET_VQ_STATE;
> >>> +     msg.req.request_id = vduse_dev_get_request_id(dev);
> >>> +     msg.req.vq_state.index = vq->index;
> >>> +
> >>> +     ret = vduse_dev_msg_sync(dev, &msg);
> >>> +     if (!ret)
> >>> +             state->avail_index = msg.resp.vq_state.avail_idx;
> >>> +
> >>> +     return ret;
> >>> +}
> >>> +
> >>> +static int vduse_dev_set_vq_state(struct vduse_dev *dev,
> >>> +                             struct vduse_virtqueue *vq,
> >>> +                             const struct vdpa_vq_state *state)
> >>> +{
> >>> +     struct vduse_dev_msg msg = { 0 };
> >>> +
> >>> +     msg.req.type = VDUSE_SET_VQ_STATE;
> >>> +     msg.req.request_id = vduse_dev_get_request_id(dev);
> >>> +     msg.req.vq_state.index = vq->index;
> >>> +     msg.req.vq_state.avail_idx = state->avail_index;
> >>> +
> >>> +     return vduse_dev_msg_sync(dev, &msg);
> >>> +}
> >>> +
> >>> +static int vduse_dev_update_iotlb(struct vduse_dev *dev,
> >>> +                             u64 start, u64 last)
> >>> +{
> >>> +     struct vduse_dev_msg msg = { 0 };
> >>> +
> >>> +     if (last < start)
> >>> +             return -EINVAL;
> >>> +
> >>> +     msg.req.type = VDUSE_UPDATE_IOTLB;
> >>> +     msg.req.request_id = vduse_dev_get_request_id(dev);
> >>> +     msg.req.iova.start = start;
> >>> +     msg.req.iova.last = last;
> >>> +
> >>> +     return vduse_dev_msg_sync(dev, &msg);
> >>> +}
> >>> +
> >>> +static ssize_t vduse_dev_read_iter(struct kiocb *iocb, struct iov_iter *to)
> >>> +{
> >>> +     struct file *file = iocb->ki_filp;
> >>> +     struct vduse_dev *dev = file->private_data;
> >>> +     struct vduse_dev_msg *msg;
> >>> +     int size = sizeof(struct vduse_dev_request);
> >>> +     ssize_t ret = 0;
> >>> +
> >>> +     if (iov_iter_count(to) < size)
> >>> +             return 0;
> >>> +
> >>> +     spin_lock(&dev->msg_lock);
> >>> +     while (1) {
> >>> +             msg = vduse_dequeue_msg(&dev->send_list);
> >>> +             if (msg)
> >>> +                     break;
> >>> +
> >>> +             ret = -EAGAIN;
> >>> +             if (file->f_flags & O_NONBLOCK)
> >>> +                     goto unlock;
> >>> +
> >>> +             spin_unlock(&dev->msg_lock);
> >>> +             ret = wait_event_interruptible_exclusive(dev->waitq,
> >>> +                                     !list_empty(&dev->send_list));
> >>> +             if (ret)
> >>> +                     return ret;
> >>> +
> >>> +             spin_lock(&dev->msg_lock);
> >>> +     }
> >>> +     spin_unlock(&dev->msg_lock);
> >>> +     ret = copy_to_iter(&msg->req, size, to);
> >>> +     spin_lock(&dev->msg_lock);
> >>> +     if (ret != size) {
> >>> +             ret = -EFAULT;
> >>> +             vduse_enqueue_msg(&dev->send_list, msg);
> >>> +             goto unlock;
> >>> +     }
> >>> +     vduse_enqueue_msg(&dev->recv_list, msg);
> >>> +     wake_up(&dev->waitq);
> >>> +unlock:
> >>> +     spin_unlock(&dev->msg_lock);
> >>> +
> >>> +     return ret;
> >>> +}
> >>> +
> >>> +static ssize_t vduse_dev_write_iter(struct kiocb *iocb, struct iov_iter *from)
> >>> +{
> >>> +     struct file *file = iocb->ki_filp;
> >>> +     struct vduse_dev *dev = file->private_data;
> >>> +     struct vduse_dev_response resp;
> >>> +     struct vduse_dev_msg *msg;
> >>> +     size_t ret;
> >>> +
> >>> +     ret = copy_from_iter(&resp, sizeof(resp), from);
> >>> +     if (ret != sizeof(resp))
> >>> +             return -EINVAL;
> >>> +
> >>> +     spin_lock(&dev->msg_lock);
> >>> +     msg = vduse_find_msg(&dev->recv_list, resp.request_id);
> >>> +     if (!msg) {
> >>> +             ret = -EINVAL;
> >>> +             goto unlock;
> >>> +     }
> >>> +
> >>> +     memcpy(&msg->resp, &resp, sizeof(resp));
> >>> +     msg->completed = 1;
> >>> +     wake_up(&msg->waitq);
> >>> +unlock:
> >>> +     spin_unlock(&dev->msg_lock);
> >>> +
> >>> +     return ret;
> >>> +}
> >>> +
> >>> +static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
> >>> +{
> >>> +     struct vduse_dev *dev = file->private_data;
> >>> +     __poll_t mask = 0;
> >>> +
> >>> +     poll_wait(file, &dev->waitq, wait);
> >>> +
> >>> +     if (!list_empty(&dev->send_list))
> >>> +             mask |= EPOLLIN | EPOLLRDNORM;
> >>> +     if (!list_empty(&dev->recv_list))
> >>> +             mask |= EPOLLOUT;
> >>> +
> >>> +     return mask;
> >>> +}
> >>> +
> >>> +static void vduse_dev_reset(struct vduse_dev *dev)
> >>> +{
> >>> +     int i;
> >>> +
> >>> +     /* The coherent mappings are handled in vduse_dev_free_coherent() */
> >>> +     vduse_domain_reset_bounce_map(dev->domain);
> >>> +     vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
> >>> +
> >>> +     spin_lock(&dev->irq_lock);
> >>> +     dev->config_cb.callback = NULL;
> >>> +     dev->config_cb.private = NULL;
> >>> +     spin_unlock(&dev->irq_lock);
> >>> +
> >>> +     for (i = 0; i < dev->vq_num; i++) {
> >>> +             struct vduse_virtqueue *vq = &dev->vqs[i];
> >>> +
> >>> +             spin_lock(&vq->irq_lock);
> >>> +             vq->ready = false;
> >>> +             vq->cb.callback = NULL;
> >>> +             vq->cb.private = NULL;
> >>> +             spin_unlock(&vq->irq_lock);
> >>> +     }
> >>> +}
> >>> +
> >>> +static int vduse_vdpa_set_vq_address(struct vdpa_device *vdpa, u16 idx,
> >>> +                             u64 desc_area, u64 driver_area,
> >>> +                             u64 device_area)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> >>> +
> >>> +     return vduse_dev_set_vq_addr(dev, vq, desc_area,
> >>> +                                     driver_area, device_area);
> >>> +}
> >>> +
> >>> +static void vduse_vdpa_kick_vq(struct vdpa_device *vdpa, u16 idx)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> >>> +
> >>> +     spin_lock(&vq->kick_lock);
> >>> +     if (vq->ready && vq->kickfd)
> >>> +             eventfd_signal(vq->kickfd, 1);
> >>> +     spin_unlock(&vq->kick_lock);
> >>> +}
> >>> +
> >>> +static void vduse_vdpa_set_vq_cb(struct vdpa_device *vdpa, u16 idx,
> >>> +                           struct vdpa_callback *cb)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> >>> +
> >>> +     spin_lock(&vq->irq_lock);
> >>> +     vq->cb.callback = cb->callback;
> >>> +     vq->cb.private = cb->private;
> >>> +     spin_unlock(&vq->irq_lock);
> >>> +}
> >>> +
> >>> +static void vduse_vdpa_set_vq_num(struct vdpa_device *vdpa, u16 idx, u32 num)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> >>> +
> >>> +     vduse_dev_set_vq_num(dev, vq, num);
> >>> +}
> >>> +
> >>> +static void vduse_vdpa_set_vq_ready(struct vdpa_device *vdpa,
> >>> +                                     u16 idx, bool ready)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> >>> +
> >>> +     vduse_dev_set_vq_ready(dev, vq, ready);
> >>> +     vq->ready = ready;
> >>> +}
> >>> +
> >>> +static bool vduse_vdpa_get_vq_ready(struct vdpa_device *vdpa, u16 idx)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> >>> +
> >>> +     vq->ready = vduse_dev_get_vq_ready(dev, vq);
> >>> +
> >>> +     return vq->ready;
> >>> +}
> >>> +
> >>> +static int vduse_vdpa_set_vq_state(struct vdpa_device *vdpa, u16 idx,
> >>> +                             const struct vdpa_vq_state *state)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> >>> +
> >>> +     return vduse_dev_set_vq_state(dev, vq, state);
> >>> +}
> >>> +
> >>> +static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
> >>> +                             struct vdpa_vq_state *state)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> >>> +
> >>> +     return vduse_dev_get_vq_state(dev, vq, state);
> >>> +}
> >>> +
> >>> +static u32 vduse_vdpa_get_vq_align(struct vdpa_device *vdpa)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +
> >>> +     return dev->vq_align;
> >>> +}
> >>> +
> >>> +static u64 vduse_vdpa_get_features(struct vdpa_device *vdpa)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +
> >>> +     return vduse_dev_get_features(dev);
> >>> +}
> >>> +
> >>> +static int vduse_vdpa_set_features(struct vdpa_device *vdpa, u64 features)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +
> >>> +     if (!(features & (1ULL << VIRTIO_F_ACCESS_PLATFORM)))
> >>> +             return -EINVAL;
> >>> +
> >>> +     return vduse_dev_set_features(dev, features);
> >>> +}
> >>> +
> >>> +static void vduse_vdpa_set_config_cb(struct vdpa_device *vdpa,
> >>> +                               struct vdpa_callback *cb)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +
> >>> +     spin_lock(&dev->irq_lock);
> >>> +     dev->config_cb.callback = cb->callback;
> >>> +     dev->config_cb.private = cb->private;
> >>> +     spin_unlock(&dev->irq_lock);
> >>> +}
> >>> +
> >>> +static u16 vduse_vdpa_get_vq_num_max(struct vdpa_device *vdpa)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +
> >>> +     return dev->vq_size_max;
> >>> +}
> >>> +
> >>> +static u32 vduse_vdpa_get_device_id(struct vdpa_device *vdpa)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +
> >>> +     return dev->device_id;
> >>> +}
> >>> +
> >>> +static u32 vduse_vdpa_get_vendor_id(struct vdpa_device *vdpa)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +
> >>> +     return dev->vendor_id;
> >>> +}
> >>> +
> >>> +static u8 vduse_vdpa_get_status(struct vdpa_device *vdpa)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +
> >>> +     return vduse_dev_get_status(dev);
> >>> +}
> >>> +
> >>> +static void vduse_vdpa_set_status(struct vdpa_device *vdpa, u8 status)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +
> >>> +     vduse_dev_set_status(dev, status);
> >>> +
> >>> +     if (status == 0)
> >>> +             vduse_dev_reset(dev);
> >>> +}
> >>> +
> >>> +static void vduse_vdpa_get_config(struct vdpa_device *vdpa, unsigned int offset,
> >>> +                          void *buf, unsigned int len)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +
> >>> +     vduse_dev_get_config(dev, offset, buf, len);
> >>> +}
> >>> +
> >>> +static void vduse_vdpa_set_config(struct vdpa_device *vdpa, unsigned int offset,
> >>> +                     const void *buf, unsigned int len)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +
> >>> +     vduse_dev_set_config(dev, offset, buf, len);
> >>> +}
> >>> +
> >>> +static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
> >>> +                             struct vhost_iotlb *iotlb)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +     int ret;
> >>> +
> >>> +     ret = vduse_domain_set_map(dev->domain, iotlb);
> >>> +     vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
> >>> +
> >>> +     return ret;
> >>> +}
> >>> +
> >>> +static void vduse_vdpa_free(struct vdpa_device *vdpa)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +
> >>> +     WARN_ON(!list_empty(&dev->send_list));
> >>> +     WARN_ON(!list_empty(&dev->recv_list));
> >>> +     dev->vdev = NULL;
> >>> +}
> >>> +
> >>> +static const struct vdpa_config_ops vduse_vdpa_config_ops = {
> >>> +     .set_vq_address         = vduse_vdpa_set_vq_address,
> >>> +     .kick_vq                = vduse_vdpa_kick_vq,
> >>> +     .set_vq_cb              = vduse_vdpa_set_vq_cb,
> >>> +     .set_vq_num             = vduse_vdpa_set_vq_num,
> >>> +     .set_vq_ready           = vduse_vdpa_set_vq_ready,
> >>> +     .get_vq_ready           = vduse_vdpa_get_vq_ready,
> >>> +     .set_vq_state           = vduse_vdpa_set_vq_state,
> >>> +     .get_vq_state           = vduse_vdpa_get_vq_state,
> >>> +     .get_vq_align           = vduse_vdpa_get_vq_align,
> >>> +     .get_features           = vduse_vdpa_get_features,
> >>> +     .set_features           = vduse_vdpa_set_features,
> >>> +     .set_config_cb          = vduse_vdpa_set_config_cb,
> >>> +     .get_vq_num_max         = vduse_vdpa_get_vq_num_max,
> >>> +     .get_device_id          = vduse_vdpa_get_device_id,
> >>> +     .get_vendor_id          = vduse_vdpa_get_vendor_id,
> >>> +     .get_status             = vduse_vdpa_get_status,
> >>> +     .set_status             = vduse_vdpa_set_status,
> >>> +     .get_config             = vduse_vdpa_get_config,
> >>> +     .set_config             = vduse_vdpa_set_config,
> >>> +     .set_map                = vduse_vdpa_set_map,
> >>> +     .free                   = vduse_vdpa_free,
> >>> +};
> >>> +
> >>> +static dma_addr_t vduse_dev_map_page(struct device *dev, struct page *page,
> >>> +                                  unsigned long offset, size_t size,
> >>> +                                  enum dma_data_direction dir,
> >>> +                                  unsigned long attrs)
> >>> +{
> >>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
> >>> +     struct vduse_iova_domain *domain = vdev->domain;
> >>> +
> >>> +     return vduse_domain_map_page(domain, page, offset, size, dir, attrs);
> >>> +}
> >>> +
> >>> +static void vduse_dev_unmap_page(struct device *dev, dma_addr_t dma_addr,
> >>> +                             size_t size, enum dma_data_direction dir,
> >>> +                             unsigned long attrs)
> >>> +{
> >>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
> >>> +     struct vduse_iova_domain *domain = vdev->domain;
> >>> +
> >>> +     return vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
> >>> +}
> >>> +
> >>> +static void *vduse_dev_alloc_coherent(struct device *dev, size_t size,
> >>> +                                     dma_addr_t *dma_addr, gfp_t flag,
> >>> +                                     unsigned long attrs)
> >>> +{
> >>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
> >>> +     struct vduse_iova_domain *domain = vdev->domain;
> >>> +     unsigned long iova;
> >>> +     void *addr;
> >>> +
> >>> +     *dma_addr = DMA_MAPPING_ERROR;
> >>> +     addr = vduse_domain_alloc_coherent(domain, size,
> >>> +                             (dma_addr_t *)&iova, flag, attrs);
> >>> +     if (!addr)
> >>> +             return NULL;
> >>> +
> >>> +     *dma_addr = (dma_addr_t)iova;
> >>> +     vduse_dev_update_iotlb(vdev, iova, iova + size - 1);
> >>> +
> >>> +     return addr;
> >>> +}
> >>> +
> >>> +static void vduse_dev_free_coherent(struct device *dev, size_t size,
> >>> +                                     void *vaddr, dma_addr_t dma_addr,
> >>> +                                     unsigned long attrs)
> >>> +{
> >>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
> >>> +     struct vduse_iova_domain *domain = vdev->domain;
> >>> +     unsigned long start = (unsigned long)dma_addr;
> >>> +     unsigned long last = start + size - 1;
> >>> +
> >>> +     vduse_domain_free_coherent(domain, size, vaddr, dma_addr, attrs);
> >>> +     vduse_dev_update_iotlb(vdev, start, last);
> >>> +}
> >>> +
> >>> +static const struct dma_map_ops vduse_dev_dma_ops = {
> >>> +     .map_page = vduse_dev_map_page,
> >>> +     .unmap_page = vduse_dev_unmap_page,
> >>> +     .alloc = vduse_dev_alloc_coherent,
> >>> +     .free = vduse_dev_free_coherent,
> >>> +};
> >>> +
> >>> +static unsigned int perm_to_file_flags(u8 perm)
> >>> +{
> >>> +     unsigned int flags = 0;
> >>> +
> >>> +     switch (perm) {
> >>> +     case VDUSE_ACCESS_WO:
> >>> +             flags |= O_WRONLY;
> >>> +             break;
> >>> +     case VDUSE_ACCESS_RO:
> >>> +             flags |= O_RDONLY;
> >>> +             break;
> >>> +     case VDUSE_ACCESS_RW:
> >>> +             flags |= O_RDWR;
> >>> +             break;
> >>> +     default:
> >>> +             WARN(1, "invalidate vhost IOTLB permission\n");
> >>> +             break;
> >>> +     }
> >>> +
> >>> +     return flags;
> >>> +}
> >>> +
> >>> +static int vduse_kickfd_setup(struct vduse_dev *dev,
> >>> +                     struct vduse_vq_eventfd *eventfd)
> >>> +{
> >>> +     struct eventfd_ctx *ctx = NULL;
> >>> +     struct vduse_virtqueue *vq;
> >>> +
> >>> +     if (eventfd->index >= dev->vq_num)
> >>> +             return -EINVAL;
> >>> +
> >>> +     vq = &dev->vqs[eventfd->index];
> >>> +     if (eventfd->fd > 0) {
> >>> +             ctx = eventfd_ctx_fdget(eventfd->fd);
> >>> +             if (IS_ERR(ctx))
> >>> +                     return PTR_ERR(ctx);
> >>> +     } else if (eventfd->fd != VDUSE_EVENTFD_DEASSIGN)
> >>> +             return 0;
> >>
> >> Do we allow the userspace to switch kickfd here? If yes, we need to deal
> >> with that.
> >>
> > Do you mean the case that eventfd->fd > 0? I think we have dealt with it.
>
>
> Ok, but the above code still looks suspicious. E.g what happens if
> eventfd->fd is 0?
>

Oh, yes. I think we need eventfd->fd >= 0 here.

>
> >
> >>> +
> >>> +     spin_lock(&vq->kick_lock);
> >>> +     if (vq->kickfd)
> >>> +             eventfd_ctx_put(vq->kickfd);
> >>> +     vq->kickfd = ctx;
> >>> +     spin_unlock(&vq->kick_lock);
> >>> +
> >>> +     return 0;
> >>> +}
> >>> +
> >>> +static void vduse_vq_irq_inject(struct work_struct *work)
> >>> +{
> >>> +     struct vduse_virtqueue *vq = container_of(work,
> >>> +                                     struct vduse_virtqueue, inject);
> >>> +
> >>> +     spin_lock_irq(&vq->irq_lock);
> >>> +     if (vq->ready && vq->cb.callback)
> >>> +             vq->cb.callback(vq->cb.private);
> >>> +     spin_unlock_irq(&vq->irq_lock);
> >>> +}
> >>> +
> >>> +static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> >>> +                         unsigned long arg)
> >>> +{
> >>> +     struct vduse_dev *dev = file->private_data;
> >>> +     void __user *argp = (void __user *)arg;
> >>> +     int ret;
> >>> +
> >>> +     switch (cmd) {
> >>> +     case VDUSE_IOTLB_GET_FD: {
> >>> +             struct vduse_iotlb_entry entry;
> >>> +             struct vhost_iotlb_map *map;
> >>> +             struct vdpa_map_file *map_file;
> >>> +             struct vduse_iova_domain *domain = dev->domain;
> >>> +             struct file *f = NULL;
> >>> +
> >>> +             ret = -EFAULT;
> >>> +             if (copy_from_user(&entry, argp, sizeof(entry)))
> >>> +                     break;
> >>> +
> >>> +             ret = -EINVAL;
> >>> +             if (entry.start > entry.last)
> >>> +                     break;
> >>> +
> >>> +             spin_lock(&domain->iotlb_lock);
> >>> +             map = vhost_iotlb_itree_first(domain->iotlb,
> >>> +                                           entry.start, entry.last);
> >>> +             if (map) {
> >>> +                     map_file = (struct vdpa_map_file *)map->opaque;
> >>> +                     f = get_file(map_file->file);
> >>> +                     entry.offset = map_file->offset;
> >>> +                     entry.start = map->start;
> >>> +                     entry.last = map->last;
> >>> +                     entry.perm = map->perm;
> >>> +             }
> >>> +             spin_unlock(&domain->iotlb_lock);
> >>> +             ret = -EINVAL;
> >>> +             if (!f)
> >>> +                     break;
> >>> +
> >>> +             ret = -EFAULT;
> >>> +             if (copy_to_user(argp, &entry, sizeof(entry))) {
> >>> +                     fput(f);
> >>> +                     break;
> >>> +             }
> >>> +             ret = receive_fd(f, perm_to_file_flags(entry.perm));
> >>> +             fput(f);
> >>
> >> Any reason to drop the refcnt here?
> >>
> > We will do get_file() in receive_fd() too.
>
>
> I see.
>
>
> >
> >>> +             break;
> >>> +     }
> >>> +     case VDUSE_VQ_SETUP_KICKFD: {
> >>> +             struct vduse_vq_eventfd eventfd;
> >>> +
> >>> +             ret = -EFAULT;
> >>> +             if (copy_from_user(&eventfd, argp, sizeof(eventfd)))
> >>> +                     break;
> >>> +
> >>> +             ret = vduse_kickfd_setup(dev, &eventfd);
> >>> +             break;
> >>> +     }
> >>> +     case VDUSE_INJECT_VQ_IRQ:
> >>> +             ret = -EINVAL;
> >>> +             if (arg >= dev->vq_num)
> >>> +                     break;
> >>> +
> >>> +             ret = 0;
> >>> +             queue_work(vduse_irq_wq, &dev->vqs[arg].inject);
> >>> +             break;
> >>> +     case VDUSE_INJECT_CONFIG_IRQ:
> >>> +             ret = -EINVAL;
> >>> +             spin_lock_irq(&dev->irq_lock);
> >>> +             if (dev->config_cb.callback) {
> >>> +                     dev->config_cb.callback(dev->config_cb.private);
> >>> +                     ret = 0;
> >>> +             }
> >>> +             spin_unlock_irq(&dev->irq_lock);
> >>
> >> For consistency, is it better to use vduse_irq_wq here?
> >>
> > Looks good to me.
> >
> >>> +             break;
> >>> +     default:
> >>> +             ret = -ENOIOCTLCMD;
> >>> +             break;
> >>> +     }
> >>> +
> >>> +     return ret;
> >>> +}
> >>> +
> >>> +static int vduse_dev_release(struct inode *inode, struct file *file)
> >>> +{
> >>> +     struct vduse_dev *dev = file->private_data;
> >>> +     struct vduse_dev_msg *msg;
> >>> +     int i;
> >>> +
> >>> +     for (i = 0; i < dev->vq_num; i++) {
> >>> +             struct vduse_virtqueue *vq = &dev->vqs[i];
> >>> +
> >>> +             spin_lock(&vq->kick_lock);
> >>> +             if (vq->kickfd)
> >>> +                     eventfd_ctx_put(vq->kickfd);
> >>> +             vq->kickfd = NULL;
> >>> +             spin_unlock(&vq->kick_lock);
> >>> +     }
> >>> +
> >>> +     spin_lock(&dev->msg_lock);
> >>> +     /*  Make sure the inflight messages can processed */
> >>
> >> This might be better:
> >>
> >> /*  Make sure the inflight messages can processed after reconncection */
> >>
> > OK.
> >
> >>> +     while ((msg = vduse_dequeue_msg(&dev->recv_list)))
> >>> +             vduse_enqueue_msg(&dev->send_list, msg);
> >>> +     spin_unlock(&dev->msg_lock);
> >>> +
> >>> +     dev->connected = false;
> >>> +
> >>> +     return 0;
> >>> +}
> >>> +
> >>> +static int vduse_dev_open(struct inode *inode, struct file *file)
> >>> +{
> >>> +     struct vduse_dev *dev = container_of(inode->i_cdev,
> >>> +                                     struct vduse_dev, cdev);
> >>> +     int ret = -EBUSY;
> >>> +
> >>> +     mutex_lock(&dev->lock);
> >>> +     if (dev->connected)
> >>> +             goto unlock;
> >>> +
> >>> +     ret = 0;
> >>> +     dev->connected = true;
> >>> +     file->private_data = dev;
> >>> +unlock:
> >>> +     mutex_unlock(&dev->lock);
> >>> +
> >>> +     return ret;
> >>> +}
> >>> +
> >>> +static const struct file_operations vduse_dev_fops = {
> >>> +     .owner          = THIS_MODULE,
> >>> +     .open           = vduse_dev_open,
> >>> +     .release        = vduse_dev_release,
> >>> +     .read_iter      = vduse_dev_read_iter,
> >>> +     .write_iter     = vduse_dev_write_iter,
> >>> +     .poll           = vduse_dev_poll,
> >>> +     .unlocked_ioctl = vduse_dev_ioctl,
> >>> +     .compat_ioctl   = compat_ptr_ioctl,
> >>> +     .llseek         = noop_llseek,
> >>> +};
> >>> +
> >>> +static struct vduse_dev *vduse_dev_create(void)
> >>> +{
> >>> +     struct vduse_dev *dev = kzalloc(sizeof(*dev), GFP_KERNEL);
> >>> +
> >>> +     if (!dev)
> >>> +             return NULL;
> >>> +
> >>> +     mutex_init(&dev->lock);
> >>> +     spin_lock_init(&dev->msg_lock);
> >>> +     INIT_LIST_HEAD(&dev->send_list);
> >>> +     INIT_LIST_HEAD(&dev->recv_list);
> >>> +     atomic64_set(&dev->msg_unique, 0);
> >>> +     spin_lock_init(&dev->irq_lock);
> >>> +
> >>> +     init_waitqueue_head(&dev->waitq);
> >>> +
> >>> +     return dev;
> >>> +}
> >>> +
> >>> +static void vduse_dev_destroy(struct vduse_dev *dev)
> >>> +{
> >>> +     kfree(dev);
> >>> +}
> >>> +
> >>> +static struct vduse_dev *vduse_find_dev(const char *name)
> >>> +{
> >>> +     struct vduse_dev *tmp, *dev = NULL;
> >>> +
> >>> +     list_for_each_entry(tmp, &vduse_devs, list) {
> >>> +             if (!strcmp(dev_name(&tmp->dev), name)) {
> >>> +                     dev = tmp;
> >>> +                     break;
> >>> +             }
> >>> +     }
> >>> +     return dev;
> >>> +}
> >>> +
> >>> +static int vduse_destroy_dev(char *name)
> >>> +{
> >>> +     struct vduse_dev *dev = vduse_find_dev(name);
> >>> +
> >>> +     if (!dev)
> >>> +             return -EINVAL;
> >>> +
> >>> +     mutex_lock(&dev->lock);
> >>> +     if (dev->vdev || dev->connected) {
> >>> +             mutex_unlock(&dev->lock);
> >>> +             return -EBUSY;
> >>> +     }
> >>> +     dev->connected = true;
> >>> +     mutex_unlock(&dev->lock);
> >>> +
> >>> +     list_del(&dev->list);
> >>> +     cdev_device_del(&dev->cdev, &dev->dev);
> >>> +     put_device(&dev->dev);
> >>> +     module_put(THIS_MODULE);
> >>> +
> >>> +     return 0;
> >>> +}
> >>> +
> >>> +static void vduse_release_dev(struct device *device)
> >>> +{
> >>> +     struct vduse_dev *dev =
> >>> +             container_of(device, struct vduse_dev, dev);
> >>> +
> >>> +     ida_simple_remove(&vduse_ida, dev->minor);
> >>> +     kfree(dev->vqs);
> >>> +     vduse_domain_destroy(dev->domain);
> >>> +     vduse_dev_destroy(dev);
> >>> +}
> >>> +
> >>> +static int vduse_create_dev(struct vduse_dev_config *config,
> >>> +                         unsigned long api_version)
> >>> +{
> >>> +     int i, ret = -ENOMEM;
> >>> +     struct vduse_dev *dev;
> >>> +
> >>> +     if (config->bounce_size > max_bounce_size)
> >>> +             return -EINVAL;
> >>> +
> >>> +     if (config->bounce_size > max_iova_size)
> >>> +             return -EINVAL;
> >>> +
> >>> +     if (vduse_find_dev(config->name))
> >>> +             return -EEXIST;
> >>> +
> >>> +     dev = vduse_dev_create();
> >>> +     if (!dev)
> >>> +             return -ENOMEM;
> >>> +
> >>> +     dev->api_version = api_version;
> >>> +     dev->device_id = config->device_id;
> >>> +     dev->vendor_id = config->vendor_id;
> >>> +     dev->domain = vduse_domain_create(max_iova_size - 1,
> >>> +                                     config->bounce_size);
> >>> +     if (!dev->domain)
> >>> +             goto err_domain;
> >>> +
> >>> +     dev->vq_align = config->vq_align;
> >>> +     dev->vq_size_max = config->vq_size_max;
> >>> +     dev->vq_num = config->vq_num;
> >>> +     dev->vqs = kcalloc(dev->vq_num, sizeof(*dev->vqs), GFP_KERNEL);
> >>> +     if (!dev->vqs)
> >>> +             goto err_vqs;
> >>> +
> >>> +     for (i = 0; i < dev->vq_num; i++) {
> >>> +             dev->vqs[i].index = i;
> >>> +             INIT_WORK(&dev->vqs[i].inject, vduse_vq_irq_inject);
> >>> +             spin_lock_init(&dev->vqs[i].kick_lock);
> >>> +             spin_lock_init(&dev->vqs[i].irq_lock);
> >>> +     }
> >>> +
> >>> +     ret = ida_simple_get(&vduse_ida, 0, VDUSE_DEV_MAX, GFP_KERNEL);
> >>> +     if (ret < 0)
> >>> +             goto err_ida;
> >>> +
> >>> +     dev->minor = ret;
> >>> +     device_initialize(&dev->dev);
> >>> +     dev->dev.release = vduse_release_dev;
> >>> +     dev->dev.class = vduse_class;
> >>> +     dev->dev.devt = MKDEV(MAJOR(vduse_major), dev->minor);
> >>> +     ret = dev_set_name(&dev->dev, "%s", config->name);
> >>> +     if (ret) {
> >>> +             put_device(&dev->dev);
> >>> +             return ret;
> >>> +     }
> >>> +     cdev_init(&dev->cdev, &vduse_dev_fops);
> >>> +     dev->cdev.owner = THIS_MODULE;
> >>> +
> >>> +     ret = cdev_device_add(&dev->cdev, &dev->dev);
> >>> +     if (ret) {
> >>> +             put_device(&dev->dev);
> >>> +             return ret;
> >>
> >> Let's introduce an error label for this.
> >>
> > But put_device() would be duplicated with the below error handling.
>
>
> I think it's as simple as put the error label after the return.
>

OK.

>
> >
> >>> +     }
> >>> +     list_add(&dev->list, &vduse_devs);
> >>> +     __module_get(THIS_MODULE);
> >>> +
> >>> +     return 0;
> >>> +err_ida:
> >>> +     kfree(dev->vqs);
> >>> +err_vqs:
> >>> +     vduse_domain_destroy(dev->domain);
> >>> +err_domain:
> >>> +     vduse_dev_destroy(dev);
> >>> +     return ret;
> >>> +}
> >>> +
> >>> +static long vduse_ioctl(struct file *file, unsigned int cmd,
> >>> +                     unsigned long arg)
> >>> +{
> >>> +     int ret;
> >>> +     void __user *argp = (void __user *)arg;
> >>> +     struct vduse_control *control = file->private_data;
> >>> +
> >>> +     mutex_lock(&vduse_lock);
> >>> +     switch (cmd) {
> >>> +     case VDUSE_GET_API_VERSION:
> >>> +             ret = control->api_version;
> >>> +             break;
> >>> +     case VDUSE_SET_API_VERSION:
> >>> +             ret = -EINVAL;
> >>> +             if (arg > VDUSE_API_VERSION)
> >>> +                     break;
> >>> +
> >>> +             ret = 0;
> >>> +             control->api_version = arg;
> >>> +             break;
> >>> +     case VDUSE_CREATE_DEV: {
> >>> +             struct vduse_dev_config config;
> >>> +
> >>> +             ret = -EFAULT;
> >>> +             if (copy_from_user(&config, argp, sizeof(config)))
> >>> +                     break;
> >>> +
> >>> +             ret = vduse_create_dev(&config, control->api_version);
> >>> +             break;
> >>> +     }
> >>> +     case VDUSE_DESTROY_DEV: {
> >>> +             char name[VDUSE_NAME_MAX];
> >>> +
> >>> +             ret = -EFAULT;
> >>> +             if (copy_from_user(name, argp, VDUSE_NAME_MAX))
> >>> +                     break;
> >>> +
> >>> +             ret = vduse_destroy_dev(name);
> >>> +             break;
> >>> +     }
> >>> +     default:
> >>> +             ret = -EINVAL;
> >>> +             break;
> >>> +     }
> >>> +     mutex_unlock(&vduse_lock);
> >>> +
> >>> +     return ret;
> >>> +}
> >>> +
> >>> +static int vduse_release(struct inode *inode, struct file *file)
> >>> +{
> >>> +     struct vduse_control *control = file->private_data;
> >>> +
> >>> +     kfree(control);
> >>> +     return 0;
> >>> +}
> >>> +
> >>> +static int vduse_open(struct inode *inode, struct file *file)
> >>> +{
> >>> +     struct vduse_control *control;
> >>> +
> >>> +     control = kmalloc(sizeof(struct vduse_control), GFP_KERNEL);
> >>> +     if (!control)
> >>> +             return -ENOMEM;
> >>> +
> >>> +     control->api_version = VDUSE_API_VERSION;
> >>> +     file->private_data = control;
> >>> +
> >>> +     return 0;
> >>> +}
> >>> +
> >>> +static const struct file_operations vduse_fops = {
> >>> +     .owner          = THIS_MODULE,
> >>> +     .open           = vduse_open,
> >>> +     .release        = vduse_release,
> >>> +     .unlocked_ioctl = vduse_ioctl,
> >>> +     .compat_ioctl   = compat_ptr_ioctl,
> >>> +     .llseek         = noop_llseek,
> >>> +};
> >>> +
> >>> +static char *vduse_devnode(struct device *dev, umode_t *mode)
> >>> +{
> >>> +     return kasprintf(GFP_KERNEL, "vduse/%s", dev_name(dev));
> >>> +}
> >>> +
> >>> +static struct miscdevice vduse_misc = {
> >>> +     .fops = &vduse_fops,
> >>> +     .minor = MISC_DYNAMIC_MINOR,
> >>> +     .name = "vduse",
> >>> +     .nodename = "vduse/control",
> >>> +};
> >>> +
> >>> +static void vduse_mgmtdev_release(struct device *dev)
> >>> +{
> >>> +}
> >>> +
> >>> +static struct device vduse_mgmtdev = {
> >>> +     .init_name = "vduse",
> >>> +     .release = vduse_mgmtdev_release,
> >>> +};
> >>> +
> >>> +static struct vdpa_mgmt_dev mgmt_dev;
> >>> +
> >>> +static int vduse_dev_init_vdpa(struct vduse_dev *dev, const char *name)
> >>> +{
> >>> +     struct vduse_vdpa *vdev;
> >>> +     int ret;
> >>> +
> >>> +     if (dev->vdev)
> >>> +             return -EEXIST;
> >>> +
> >>> +     vdev = vdpa_alloc_device(struct vduse_vdpa, vdpa, &dev->dev,
> >>> +                              &vduse_vdpa_config_ops, name, true);
> >>> +     if (!vdev)
> >>> +             return -ENOMEM;
> >>> +
> >>> +     dev->vdev = vdev;
> >>> +     vdev->dev = dev;
> >>> +     vdev->vdpa.dev.dma_mask = &vdev->vdpa.dev.coherent_dma_mask;
> >>> +     ret = dma_set_mask_and_coherent(&vdev->vdpa.dev, DMA_BIT_MASK(64));
> >>> +     if (ret) {
> >>> +             put_device(&vdev->vdpa.dev);
> >>> +             return ret;
> >>> +     }
> >>> +     set_dma_ops(&vdev->vdpa.dev, &vduse_dev_dma_ops);
> >>> +     vdev->vdpa.dma_dev = &vdev->vdpa.dev;
> >>> +     vdev->vdpa.mdev = &mgmt_dev;
> >>> +
> >>> +     return 0;
> >>> +}
> >>> +
> >>> +static int vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name)
> >>> +{
> >>> +     struct vduse_dev *dev;
> >>> +     int ret = -EINVAL;
> >>> +
> >>> +     mutex_lock(&vduse_lock);
> >>> +     dev = vduse_find_dev(name);
> >>> +     if (!dev) {
> >>> +             mutex_unlock(&vduse_lock);
> >>> +             return -EINVAL;
> >>> +     }
> >>> +     ret = vduse_dev_init_vdpa(dev, name);
> >>> +     mutex_unlock(&vduse_lock);
> >>> +     if (ret)
> >>> +             return ret;
> >>> +
> >>> +     ret = _vdpa_register_device(&dev->vdev->vdpa, dev->vq_num);
> >>> +     if (ret) {
> >>> +             put_device(&dev->vdev->vdpa.dev);
> >>> +             return ret;
> >>> +     }
> >>> +
> >>> +     return 0;
> >>> +}
> >>> +
> >>> +static void vdpa_dev_del(struct vdpa_mgmt_dev *mdev, struct vdpa_device *dev)
> >>> +{
> >>> +     _vdpa_unregister_device(dev);
> >>> +}
> >>> +
> >>> +static const struct vdpa_mgmtdev_ops vdpa_dev_mgmtdev_ops = {
> >>> +     .dev_add = vdpa_dev_add,
> >>> +     .dev_del = vdpa_dev_del,
> >>> +};
> >>> +
> >>> +static struct virtio_device_id id_table[] = {
> >>> +     { VIRTIO_DEV_ANY_ID, VIRTIO_DEV_ANY_ID },
> >>> +     { 0 },
> >>> +};
> >>> +
> >>> +static struct vdpa_mgmt_dev mgmt_dev = {
> >>> +     .device = &vduse_mgmtdev,
> >>> +     .id_table = id_table,
> >>> +     .ops = &vdpa_dev_mgmtdev_ops,
> >>> +};
> >>> +
> >>> +static int vduse_mgmtdev_init(void)
> >>> +{
> >>> +     int ret;
> >>> +
> >>> +     ret = device_register(&vduse_mgmtdev);
> >>> +     if (ret)
> >>> +             return ret;
> >>> +
> >>> +     ret = vdpa_mgmtdev_register(&mgmt_dev);
> >>> +     if (ret)
> >>> +             goto err;
> >>> +
> >>> +     return 0;
> >>> +err:
> >>> +     device_unregister(&vduse_mgmtdev);
> >>> +     return ret;
> >>> +}
> >>> +
> >>> +static void vduse_mgmtdev_exit(void)
> >>> +{
> >>> +     vdpa_mgmtdev_unregister(&mgmt_dev);
> >>> +     device_unregister(&vduse_mgmtdev);
> >>> +}
> >>> +
> >>> +static int vduse_init(void)
> >>> +{
> >>> +     int ret;
> >>> +
> >>> +     if (max_bounce_size >= max_iova_size)
> >>> +             return -EINVAL;
> >>> +
> >>> +     ret = misc_register(&vduse_misc);
> >>> +     if (ret)
> >>> +             return ret;
> >>> +
> >>> +     vduse_class = class_create(THIS_MODULE, "vduse");
> >>> +     if (IS_ERR(vduse_class)) {
> >>> +             ret = PTR_ERR(vduse_class);
> >>> +             goto err_class;
> >>> +     }
> >>> +     vduse_class->devnode = vduse_devnode;
> >>> +
> >>> +     ret = alloc_chrdev_region(&vduse_major, 0, VDUSE_DEV_MAX, "vduse");
> >>> +     if (ret)
> >>> +             goto err_chardev;
> >>> +
> >>> +     vduse_irq_wq = alloc_workqueue("vduse-irq",
> >>> +                             WQ_HIGHPRI | WQ_SYSFS | WQ_UNBOUND, 0);
> >>> +     if (!vduse_irq_wq)
> >>> +             goto err_wq;
> >>> +
> >>> +     ret = vduse_domain_init();
> >>> +     if (ret)
> >>> +             goto err_domain;
> >>> +
> >>> +     ret = vduse_mgmtdev_init();
> >>> +     if (ret)
> >>> +             goto err_mgmtdev;
> >>> +
> >>> +     return 0;
> >>> +err_mgmtdev:
> >>> +     vduse_domain_exit();
> >>> +err_domain:
> >>> +     destroy_workqueue(vduse_irq_wq);
> >>> +err_wq:
> >>> +     unregister_chrdev_region(vduse_major, VDUSE_DEV_MAX);
> >>> +err_chardev:
> >>> +     class_destroy(vduse_class);
> >>> +err_class:
> >>> +     misc_deregister(&vduse_misc);
> >>> +     return ret;
> >>> +}
> >>> +module_init(vduse_init);
> >>> +
> >>> +static void vduse_exit(void)
> >>> +{
> >>> +     misc_deregister(&vduse_misc);
> >>> +     class_destroy(vduse_class);
> >>> +     unregister_chrdev_region(vduse_major, VDUSE_DEV_MAX);
> >>> +     destroy_workqueue(vduse_irq_wq);
> >>> +     vduse_domain_exit();
> >>> +     vduse_mgmtdev_exit();
> >>> +}
> >>> +module_exit(vduse_exit);
> >>> +
> >>> +MODULE_VERSION(DRV_VERSION);
> >>> +MODULE_LICENSE(DRV_LICENSE);
> >>> +MODULE_AUTHOR(DRV_AUTHOR);
> >>> +MODULE_DESCRIPTION(DRV_DESC);
> >>> diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
> >>> new file mode 100644
> >>> index 000000000000..66a6e5212226
> >>> --- /dev/null
> >>> +++ b/include/uapi/linux/vduse.h
> >>> @@ -0,0 +1,175 @@
> >>> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> >>> +#ifndef _UAPI_VDUSE_H_
> >>> +#define _UAPI_VDUSE_H_
> >>> +
> >>> +#include <linux/types.h>
> >>> +
> >>> +#define VDUSE_API_VERSION    0
> >>> +
> >>> +#define VDUSE_CONFIG_DATA_LEN        256
> >>> +#define VDUSE_NAME_MAX       256
> >>> +
> >>> +/* the control messages definition for read/write */
> >>> +
> >>> +enum vduse_req_type {
> >>> +     /* Set the vring address of virtqueue. */
> >>> +     VDUSE_SET_VQ_NUM,
> >>> +     /* Set the vring address of virtqueue. */
> >>> +     VDUSE_SET_VQ_ADDR,
> >>> +     /* Set ready status of virtqueue */
> >>> +     VDUSE_SET_VQ_READY,
> >>> +     /* Get ready status of virtqueue */
> >>> +     VDUSE_GET_VQ_READY,
> >>> +     /* Set the state for virtqueue */
> >>> +     VDUSE_SET_VQ_STATE,
> >>> +     /* Get the state for virtqueue */
> >>> +     VDUSE_GET_VQ_STATE,
> >>> +     /* Set virtio features supported by the driver */
> >>> +     VDUSE_SET_FEATURES,
> >>> +     /* Get virtio features supported by the device */
> >>> +     VDUSE_GET_FEATURES,
> >>> +     /* Set the device status */
> >>> +     VDUSE_SET_STATUS,
> >>> +     /* Get the device status */
> >>> +     VDUSE_GET_STATUS,
> >>> +     /* Write to device specific configuration space */
> >>> +     VDUSE_SET_CONFIG,
> >>> +     /* Read from device specific configuration space */
> >>> +     VDUSE_GET_CONFIG,
> >>> +     /* Notify userspace to update the memory mapping in device IOTLB */
> >>> +     VDUSE_UPDATE_IOTLB,
> >>> +};
> >>> +
> >>> +struct vduse_vq_num {
> >>> +     __u32 index; /* virtqueue index */
> >>
> >> I think it's better to have a consistent style of the doc/comment. If
> >> yes, let's move those comment above the field.
> >>
> > Fine.
> >
> >>> +     __u32 num; /* the size of virtqueue */
> >>> +};
> >>> +
> >>> +struct vduse_vq_addr {
> >>> +     __u32 index; /* virtqueue index */
> >>> +     __u64 desc_addr; /* address of desc area */
> >>> +     __u64 driver_addr; /* address of driver area */
> >>> +     __u64 device_addr; /* address of device area */
> >>> +};
> >>> +
> >>> +struct vduse_vq_ready {
> >>> +     __u32 index; /* virtqueue index */
> >>> +     __u8 ready; /* ready status of virtqueue */
> >>> +};
> >>> +
> >>> +struct vduse_vq_state {
> >>> +     __u32 index; /* virtqueue index */
> >>> +     __u16 avail_idx; /* virtqueue state (last_avail_idx) */
> >>
> >> Let's use __u64 here to be consistent with get_vq_state().
>
>
> __u32 actually:
>
> struct vhost_vring_state {
>          unsigned int index;
>          unsigned int num;
> };
>

OK.

>
> >>   The idea is
> >> to support packed virtqueue.
> >>
> > OK. But looks like sizeof(struct vdpa_vq_state) is still equal to 2.
>
>
> It's a bug that needs to be fixed.
>
>
> > Do you mean we will extend it in the future?
>
>
> u32 should be fine, and it's not for future. For packed virtqueue we
> need something like:
>
> le32 {
>      last_avai_idx : 15;
>      last_avail_wrap_counter : 1;
>      used_idx : 15;
>      used_wrap_counter : 1;
> }
>
> As has been done in
> https://lore.kernel.org/lkml/20190717105255.63488-3-jasowang@redhat.com/T/
>

I see.

>
> >
> >>> +};
> >>> +
> >>> +struct vduse_dev_config_data {
> >>> +     __u32 offset; /* offset from the beginning of config space */
> >>> +     __u32 len; /* the length to read/write */
> >>> +     __u8 data[VDUSE_CONFIG_DATA_LEN]; /* data buffer used to read/write */
> >>
> >> Note that since VDUSE_CONFIG_DATA_LEN is part of uAPI it means we can
> >> not change it in the future.
> >>
> >> So this might suffcient for future features or all type of virtio devices.
> >>
> > Do you mean 256 is no enough here?
>
>
> Yes.
>

But this request will be submitted multiple times if config lengh is
larger than 256. So do you think whether we need to extent the size to
512 or larger?

>
> >
> >>> +};
> >>> +
> >>> +struct vduse_iova_range {
> >>> +     __u64 start; /* start of the IOVA range */
> >>> +     __u64 last; /* end of the IOVA range */
> >>> +};
> >>> +
> >>> +struct vduse_features {
> >>> +     __u64 features; /* virtio features */
> >>> +};
> >>> +
> >>> +struct vduse_status {
> >>> +     __u8 status; /* device status */
> >>> +};
> >>> +
> >>> +struct vduse_dev_request {
> >>> +     __u32 type; /* request type */
> >>> +     __u32 request_id; /* request id */
> >>> +     __u32 reserved[2]; /* for future use */
> >>> +     union {
> >>> +             struct vduse_vq_num vq_num; /* virtqueue num */
> >>> +             struct vduse_vq_addr vq_addr; /* virtqueue address */
> >>> +             struct vduse_vq_ready vq_ready; /* virtqueue ready status */
> >>> +             struct vduse_vq_state vq_state; /* virtqueue state */
> >>> +             struct vduse_dev_config_data config; /* virtio device config space */
> >>> +             struct vduse_iova_range iova; /* iova range for updating */
> >>> +             struct vduse_features f; /* virtio features */
> >>> +             struct vduse_status s; /* device status */
> >>> +             __u32 padding[16]; /* padding */
> >>> +     };
> >>> +};
> >>> +
> >>> +struct vduse_dev_response {
> >>> +     __u32 request_id; /* corresponding request id */
> >>> +#define VDUSE_REQUEST_OK     0x00
> >>> +#define VDUSE_REQUEST_FAILED 0x01
> >>> +     __u32 result; /* the result of request */
> >>> +     __u32 reserved[2]; /* for future use */
> >>> +     union {
> >>> +             struct vduse_vq_ready vq_ready; /* virtqueue ready status */
> >>> +             struct vduse_vq_state vq_state; /* virtqueue state */
> >>> +             struct vduse_dev_config_data config; /* virtio device config space */
> >>> +             struct vduse_features f; /* virtio features */
> >>> +             struct vduse_status s; /* device status */
> >>> +             __u32 padding[16]; /* padding */
> >>
> >> So it looks to me this padding doesn't work since vduse_dev_config_data
> >> is larger than it.
> >>
> > Oh, my bad. Will fix it.
> >
> >>> +     };
> >>> +};
> >>> +
> >>> +/* ioctls */
> >>> +
> >>> +struct vduse_dev_config {
> >>> +     char name[VDUSE_NAME_MAX]; /* vduse device name */
> >>> +     __u32 vendor_id; /* virtio vendor id */
> >>> +     __u32 device_id; /* virtio device id */
> >>> +     __u64 bounce_size; /* bounce buffer size for iommu */
> >>> +     __u16 vq_num; /* the number of virtqueues */
> >>> +     __u16 vq_size_max; /* the max size of virtqueue */
> >>> +     __u32 vq_align; /* the allocation alignment of virtqueue's metadata */
> >>> +     __u32 reserved[8]; /* for future use */
> >>
> >> Is there a hole before reserved?
> >>
> > But I don't find the hole in below layout:
> >
> > | 256 | 4 | 4 | 8 | 2 | 2 | 4 | 32 |
> >
>
> Looks correct, better to check with pahole to double confirm.
>

OK. Sure.

>
> >>> +};
> >>> +
> >>> +struct vduse_iotlb_entry {
> >>> +     __u64 offset; /* the mmap offset on fd */
> >>> +     __u64 start; /* start of the IOVA range */
> >>> +     __u64 last; /* last of the IOVA range */
> >>> +#define VDUSE_ACCESS_RO 0x1
> >>> +#define VDUSE_ACCESS_WO 0x2
> >>> +#define VDUSE_ACCESS_RW 0x3
> >>> +     __u8 perm; /* access permission of this range */
> >>> +};
> >>> +
> >>> +struct vduse_vq_eventfd {
> >>> +     __u32 index; /* virtqueue index */
> >>> +#define VDUSE_EVENTFD_DEASSIGN -1
> >>> +     int fd; /* eventfd, -1 means de-assigning the eventfd */
> >>> +};
> >>> +
> >>> +#define VDUSE_BASE   0x81
> >>> +
> >>> +/* Get the version of VDUSE API. This is used for future extension */
> >>> +#define VDUSE_GET_API_VERSION        _IO(VDUSE_BASE, 0x00)
> >>> +
> >>> +/* Set the version of VDUSE API. */
> >>> +#define VDUSE_SET_API_VERSION        _IO(VDUSE_BASE, 0x01)
> >>> +
> >>> +/* Create a vduse device which is represented by a char device (/dev/vduse/<name>) */
> >>> +#define VDUSE_CREATE_DEV     _IOW(VDUSE_BASE, 0x02, struct vduse_dev_config)
> >>> +
> >>> +/* Destroy a vduse device. Make sure there are no references to the char device */
> >>> +#define VDUSE_DESTROY_DEV    _IOW(VDUSE_BASE, 0x03, char[VDUSE_NAME_MAX])
> >>> +
> >>> +/*
> >>> + * Get a file descriptor for the first overlapped iova region,
> >>> + * -EINVAL means the iova region doesn't exist.
> >>> + */
> >>> +#define VDUSE_IOTLB_GET_FD   _IOWR(VDUSE_BASE, 0x04, struct vduse_iotlb_entry)
> >>> +
> >>> +/* Setup an eventfd to receive kick for virtqueue */
> >>> +#define VDUSE_VQ_SETUP_KICKFD        _IOW(VDUSE_BASE, 0x05, struct vduse_vq_eventfd)
> >>> +
> >>> +/* Inject an interrupt for specific virtqueue */
> >>> +#define VDUSE_INJECT_VQ_IRQ  _IO(VDUSE_BASE, 0x06)
> >>
> >> Missing parameter?
> >>
> > We use the argp to store the virtqueue index here. Is it OK?
>
>
> So I meant it should be something like:
>
> #define VDUSE_INJECT_VQ_IRQ  _IOW(VDUSE_BASE, 0x06, unsigned int)
>

It's OK to me.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v6 03/10] vhost-vdpa: protect concurrent access to vhost device iotlb
  2021-03-31  8:05 ` [PATCH v6 03/10] vhost-vdpa: protect concurrent access to vhost device iotlb Xie Yongji
@ 2021-04-09 16:15   ` Michael S. Tsirkin
  2021-04-11  5:36     ` Yongji Xie
  0 siblings, 1 reply; 62+ messages in thread
From: Michael S. Tsirkin @ 2021-04-09 16:15 UTC (permalink / raw)
  To: Xie Yongji
  Cc: jasowang, stefanha, sgarzare, parav, hch, christian.brauner,
	rdunlap, willy, viro, axboe, bcrl, corbet, mika.penttila,
	dan.carpenter, virtualization, netdev, kvm, linux-fsdevel

On Wed, Mar 31, 2021 at 04:05:12PM +0800, Xie Yongji wrote:
> Use vhost_dev->mutex to protect vhost device iotlb from
> concurrent access.
> 
> Fixes: 4c8cf318("vhost: introduce vDPA-based backend")
> Cc: stable@vger.kernel.org
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> Acked-by: Jason Wang <jasowang@redhat.com>
> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>

I could not figure out whether there's a bug there now.
If yes when is the concurrent access triggered?

> ---
>  drivers/vhost/vdpa.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index 3947fbc2d1d5..63b28d3aee7c 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -725,9 +725,11 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev,
>  	const struct vdpa_config_ops *ops = vdpa->config;
>  	int r = 0;
>  
> +	mutex_lock(&dev->mutex);
> +
>  	r = vhost_dev_check_owner(dev);
>  	if (r)
> -		return r;
> +		goto unlock;
>  
>  	switch (msg->type) {
>  	case VHOST_IOTLB_UPDATE:
> @@ -748,6 +750,8 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev,
>  		r = -EINVAL;
>  		break;
>  	}
> +unlock:
> +	mutex_unlock(&dev->mutex);
>  
>  	return r;
>  }
> -- 
> 2.11.0


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Re: [PATCH v6 03/10] vhost-vdpa: protect concurrent access to vhost device iotlb
  2021-04-09 16:15   ` Michael S. Tsirkin
@ 2021-04-11  5:36     ` Yongji Xie
  2021-04-11 20:48       ` Michael S. Tsirkin
  0 siblings, 1 reply; 62+ messages in thread
From: Yongji Xie @ 2021-04-11  5:36 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, Stefan Hajnoczi, Stefano Garzarella, Parav Pandit,
	Christoph Hellwig, Christian Brauner, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	Mika Penttilä,
	Dan Carpenter, virtualization, netdev, kvm, linux-fsdevel

On Sat, Apr 10, 2021 at 12:16 AM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Wed, Mar 31, 2021 at 04:05:12PM +0800, Xie Yongji wrote:
> > Use vhost_dev->mutex to protect vhost device iotlb from
> > concurrent access.
> >
> > Fixes: 4c8cf318("vhost: introduce vDPA-based backend")
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> > Acked-by: Jason Wang <jasowang@redhat.com>
> > Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
>
> I could not figure out whether there's a bug there now.
> If yes when is the concurrent access triggered?
>

When userspace sends the VHOST_IOTLB_MSG_V2 message concurrently?

vhost_vdpa_chr_write_iter -> vhost_chr_write_iter ->
vhost_vdpa_process_iotlb_msg()

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Re: [PATCH v6 03/10] vhost-vdpa: protect concurrent access to vhost device iotlb
  2021-04-11  5:36     ` Yongji Xie
@ 2021-04-11 20:48       ` Michael S. Tsirkin
  2021-04-12  2:29         ` Yongji Xie
  0 siblings, 1 reply; 62+ messages in thread
From: Michael S. Tsirkin @ 2021-04-11 20:48 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Jason Wang, Stefan Hajnoczi, Stefano Garzarella, Parav Pandit,
	Christoph Hellwig, Christian Brauner, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	Mika Penttilä,
	Dan Carpenter, virtualization, netdev, kvm, linux-fsdevel

On Sun, Apr 11, 2021 at 01:36:18PM +0800, Yongji Xie wrote:
> On Sat, Apr 10, 2021 at 12:16 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Wed, Mar 31, 2021 at 04:05:12PM +0800, Xie Yongji wrote:
> > > Use vhost_dev->mutex to protect vhost device iotlb from
> > > concurrent access.
> > >
> > > Fixes: 4c8cf318("vhost: introduce vDPA-based backend")
> > > Cc: stable@vger.kernel.org
> > > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> > > Acked-by: Jason Wang <jasowang@redhat.com>
> > > Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
> >
> > I could not figure out whether there's a bug there now.
> > If yes when is the concurrent access triggered?
> >
> 
> When userspace sends the VHOST_IOTLB_MSG_V2 message concurrently?
> 
> vhost_vdpa_chr_write_iter -> vhost_chr_write_iter ->
> vhost_vdpa_process_iotlb_msg()
> 
> Thanks,
> Yongji

And then what happens currently?

-- 
MST


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Re: Re: [PATCH v6 03/10] vhost-vdpa: protect concurrent access to vhost device iotlb
  2021-04-11 20:48       ` Michael S. Tsirkin
@ 2021-04-12  2:29         ` Yongji Xie
  2021-04-12  9:00           ` Michael S. Tsirkin
  0 siblings, 1 reply; 62+ messages in thread
From: Yongji Xie @ 2021-04-12  2:29 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, Stefan Hajnoczi, Stefano Garzarella, Parav Pandit,
	Christoph Hellwig, Christian Brauner, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	Mika Penttilä,
	Dan Carpenter, virtualization, netdev, kvm, linux-fsdevel

On Mon, Apr 12, 2021 at 4:49 AM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Sun, Apr 11, 2021 at 01:36:18PM +0800, Yongji Xie wrote:
> > On Sat, Apr 10, 2021 at 12:16 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> > >
> > > On Wed, Mar 31, 2021 at 04:05:12PM +0800, Xie Yongji wrote:
> > > > Use vhost_dev->mutex to protect vhost device iotlb from
> > > > concurrent access.
> > > >
> > > > Fixes: 4c8cf318("vhost: introduce vDPA-based backend")
> > > > Cc: stable@vger.kernel.org
> > > > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> > > > Acked-by: Jason Wang <jasowang@redhat.com>
> > > > Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
> > >
> > > I could not figure out whether there's a bug there now.
> > > If yes when is the concurrent access triggered?
> > >
> >
> > When userspace sends the VHOST_IOTLB_MSG_V2 message concurrently?
> >
> > vhost_vdpa_chr_write_iter -> vhost_chr_write_iter ->
> > vhost_vdpa_process_iotlb_msg()
> >
> > Thanks,
> > Yongji
>
> And then what happens currently?
>

Then we might access vhost_vdpa_map() concurrently and cause
corruption of the list and interval tree in struct vhost_iotlb.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v6 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
  2021-04-09  8:02         ` Yongji Xie
@ 2021-04-12  7:16           ` Jason Wang
  2021-04-12  8:02             ` Yongji Xie
  0 siblings, 1 reply; 62+ messages in thread
From: Jason Wang @ 2021-04-12  7:16 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Christoph Hellwig, Christian Brauner, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	Mika Penttilä,
	Dan Carpenter, virtualization, netdev, kvm, linux-fsdevel


在 2021/4/9 下午4:02, Yongji Xie 写道:
>>>>> +};
>>>>> +
>>>>> +struct vduse_dev_config_data {
>>>>> +     __u32 offset; /* offset from the beginning of config space */
>>>>> +     __u32 len; /* the length to read/write */
>>>>> +     __u8 data[VDUSE_CONFIG_DATA_LEN]; /* data buffer used to read/write */
>>>> Note that since VDUSE_CONFIG_DATA_LEN is part of uAPI it means we can
>>>> not change it in the future.
>>>>
>>>> So this might suffcient for future features or all type of virtio devices.
>>>>
>>> Do you mean 256 is no enough here?
>> Yes.
>>
> But this request will be submitted multiple times if config lengh is
> larger than 256. So do you think whether we need to extent the size to
> 512 or larger?


So I think you'd better either:

1) document the limitation (256) in somewhere, (better both uapi and doc)

or

2) make it variable

Thanks


>


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Re: [PATCH v6 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
  2021-04-12  7:16           ` Jason Wang
@ 2021-04-12  8:02             ` Yongji Xie
  2021-04-12  9:37               ` Jason Wang
  0 siblings, 1 reply; 62+ messages in thread
From: Yongji Xie @ 2021-04-12  8:02 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Christoph Hellwig, Christian Brauner, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	Mika Penttilä,
	Dan Carpenter, virtualization, netdev, kvm, linux-fsdevel

On Mon, Apr 12, 2021 at 3:16 PM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/4/9 下午4:02, Yongji Xie 写道:
> >>>>> +};
> >>>>> +
> >>>>> +struct vduse_dev_config_data {
> >>>>> +     __u32 offset; /* offset from the beginning of config space */
> >>>>> +     __u32 len; /* the length to read/write */
> >>>>> +     __u8 data[VDUSE_CONFIG_DATA_LEN]; /* data buffer used to read/write */
> >>>> Note that since VDUSE_CONFIG_DATA_LEN is part of uAPI it means we can
> >>>> not change it in the future.
> >>>>
> >>>> So this might suffcient for future features or all type of virtio devices.
> >>>>
> >>> Do you mean 256 is no enough here?
> >> Yes.
> >>
> > But this request will be submitted multiple times if config lengh is
> > larger than 256. So do you think whether we need to extent the size to
> > 512 or larger?
>
>
> So I think you'd better either:
>
> 1) document the limitation (256) in somewhere, (better both uapi and doc)
>

But the VDUSE_CONFIG_DATA_LEN doesn't mean the limitation of
configuration space. It only means the maximum size of one data
transfer for configuration space. Do you mean document this?

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Re: Re: [PATCH v6 03/10] vhost-vdpa: protect concurrent access to vhost device iotlb
  2021-04-12  2:29         ` Yongji Xie
@ 2021-04-12  9:00           ` Michael S. Tsirkin
  0 siblings, 0 replies; 62+ messages in thread
From: Michael S. Tsirkin @ 2021-04-12  9:00 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Jason Wang, Stefan Hajnoczi, Stefano Garzarella, Parav Pandit,
	Christoph Hellwig, Christian Brauner, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	Mika Penttilä,
	Dan Carpenter, virtualization, netdev, kvm, linux-fsdevel

On Mon, Apr 12, 2021 at 10:29:17AM +0800, Yongji Xie wrote:
> On Mon, Apr 12, 2021 at 4:49 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Sun, Apr 11, 2021 at 01:36:18PM +0800, Yongji Xie wrote:
> > > On Sat, Apr 10, 2021 at 12:16 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> > > >
> > > > On Wed, Mar 31, 2021 at 04:05:12PM +0800, Xie Yongji wrote:
> > > > > Use vhost_dev->mutex to protect vhost device iotlb from
> > > > > concurrent access.
> > > > >
> > > > > Fixes: 4c8cf318("vhost: introduce vDPA-based backend")
> > > > > Cc: stable@vger.kernel.org
> > > > > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> > > > > Acked-by: Jason Wang <jasowang@redhat.com>
> > > > > Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
> > > >
> > > > I could not figure out whether there's a bug there now.
> > > > If yes when is the concurrent access triggered?
> > > >
> > >
> > > When userspace sends the VHOST_IOTLB_MSG_V2 message concurrently?
> > >
> > > vhost_vdpa_chr_write_iter -> vhost_chr_write_iter ->
> > > vhost_vdpa_process_iotlb_msg()
> > >
> > > Thanks,
> > > Yongji
> >
> > And then what happens currently?
> >
> 
> Then we might access vhost_vdpa_map() concurrently and cause
> corruption of the list and interval tree in struct vhost_iotlb.
> 
> Thanks,
> Yongji

OK. Sounds like it's actually needed in this release if possible.  Pls
add this info in the commit log and post it as a separate patch. 

-- 
MST


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v6 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
  2021-04-12  8:02             ` Yongji Xie
@ 2021-04-12  9:37               ` Jason Wang
  2021-04-12  9:59                 ` Yongji Xie
  0 siblings, 1 reply; 62+ messages in thread
From: Jason Wang @ 2021-04-12  9:37 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Christoph Hellwig, Christian Brauner, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	Mika Penttilä,
	Dan Carpenter, virtualization, netdev, kvm, linux-fsdevel


在 2021/4/12 下午4:02, Yongji Xie 写道:
> On Mon, Apr 12, 2021 at 3:16 PM Jason Wang <jasowang@redhat.com> wrote:
>>
>> 在 2021/4/9 下午4:02, Yongji Xie 写道:
>>>>>>> +};
>>>>>>> +
>>>>>>> +struct vduse_dev_config_data {
>>>>>>> +     __u32 offset; /* offset from the beginning of config space */
>>>>>>> +     __u32 len; /* the length to read/write */
>>>>>>> +     __u8 data[VDUSE_CONFIG_DATA_LEN]; /* data buffer used to read/write */
>>>>>> Note that since VDUSE_CONFIG_DATA_LEN is part of uAPI it means we can
>>>>>> not change it in the future.
>>>>>>
>>>>>> So this might suffcient for future features or all type of virtio devices.
>>>>>>
>>>>> Do you mean 256 is no enough here?
>>>> Yes.
>>>>
>>> But this request will be submitted multiple times if config lengh is
>>> larger than 256. So do you think whether we need to extent the size to
>>> 512 or larger?
>>
>> So I think you'd better either:
>>
>> 1) document the limitation (256) in somewhere, (better both uapi and doc)
>>
> But the VDUSE_CONFIG_DATA_LEN doesn't mean the limitation of
> configuration space. It only means the maximum size of one data
> transfer for configuration space. Do you mean document this?


Yes, and another thing is that since you're using 
data[VDUSE_CONFIG_DATA_LEN] in the uapi, it implies the length is always 
256 which seems not good and not what the code is wrote.

Thanks


>
> Thanks,
> Yongji
>


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Re: [PATCH v6 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
  2021-04-12  9:37               ` Jason Wang
@ 2021-04-12  9:59                 ` Yongji Xie
  2021-04-13  3:35                   ` Jason Wang
  0 siblings, 1 reply; 62+ messages in thread
From: Yongji Xie @ 2021-04-12  9:59 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Christoph Hellwig, Christian Brauner, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	Mika Penttilä,
	Dan Carpenter, virtualization, netdev, kvm, linux-fsdevel

On Mon, Apr 12, 2021 at 5:37 PM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/4/12 下午4:02, Yongji Xie 写道:
> > On Mon, Apr 12, 2021 at 3:16 PM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> 在 2021/4/9 下午4:02, Yongji Xie 写道:
> >>>>>>> +};
> >>>>>>> +
> >>>>>>> +struct vduse_dev_config_data {
> >>>>>>> +     __u32 offset; /* offset from the beginning of config space */
> >>>>>>> +     __u32 len; /* the length to read/write */
> >>>>>>> +     __u8 data[VDUSE_CONFIG_DATA_LEN]; /* data buffer used to read/write */
> >>>>>> Note that since VDUSE_CONFIG_DATA_LEN is part of uAPI it means we can
> >>>>>> not change it in the future.
> >>>>>>
> >>>>>> So this might suffcient for future features or all type of virtio devices.
> >>>>>>
> >>>>> Do you mean 256 is no enough here?
> >>>> Yes.
> >>>>
> >>> But this request will be submitted multiple times if config lengh is
> >>> larger than 256. So do you think whether we need to extent the size to
> >>> 512 or larger?
> >>
> >> So I think you'd better either:
> >>
> >> 1) document the limitation (256) in somewhere, (better both uapi and doc)
> >>
> > But the VDUSE_CONFIG_DATA_LEN doesn't mean the limitation of
> > configuration space. It only means the maximum size of one data
> > transfer for configuration space. Do you mean document this?
>
>
> Yes, and another thing is that since you're using
> data[VDUSE_CONFIG_DATA_LEN] in the uapi, it implies the length is always
> 256 which seems not good and not what the code is wrote.
>

How about renaming VDUSE_CONFIG_DATA_LEN to VDUSE_MAX_TRANSFER_LEN?

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v6 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
  2021-04-12  9:59                 ` Yongji Xie
@ 2021-04-13  3:35                   ` Jason Wang
  2021-04-13  4:28                     ` Yongji Xie
  0 siblings, 1 reply; 62+ messages in thread
From: Jason Wang @ 2021-04-13  3:35 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Christoph Hellwig, Christian Brauner, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	Mika Penttilä,
	Dan Carpenter, virtualization, netdev, kvm, linux-fsdevel


在 2021/4/12 下午5:59, Yongji Xie 写道:
> On Mon, Apr 12, 2021 at 5:37 PM Jason Wang <jasowang@redhat.com> wrote:
>>
>> 在 2021/4/12 下午4:02, Yongji Xie 写道:
>>> On Mon, Apr 12, 2021 at 3:16 PM Jason Wang <jasowang@redhat.com> wrote:
>>>> 在 2021/4/9 下午4:02, Yongji Xie 写道:
>>>>>>>>> +};
>>>>>>>>> +
>>>>>>>>> +struct vduse_dev_config_data {
>>>>>>>>> +     __u32 offset; /* offset from the beginning of config space */
>>>>>>>>> +     __u32 len; /* the length to read/write */
>>>>>>>>> +     __u8 data[VDUSE_CONFIG_DATA_LEN]; /* data buffer used to read/write */
>>>>>>>> Note that since VDUSE_CONFIG_DATA_LEN is part of uAPI it means we can
>>>>>>>> not change it in the future.
>>>>>>>>
>>>>>>>> So this might suffcient for future features or all type of virtio devices.
>>>>>>>>
>>>>>>> Do you mean 256 is no enough here?
>>>>>> Yes.
>>>>>>
>>>>> But this request will be submitted multiple times if config lengh is
>>>>> larger than 256. So do you think whether we need to extent the size to
>>>>> 512 or larger?
>>>> So I think you'd better either:
>>>>
>>>> 1) document the limitation (256) in somewhere, (better both uapi and doc)
>>>>
>>> But the VDUSE_CONFIG_DATA_LEN doesn't mean the limitation of
>>> configuration space. It only means the maximum size of one data
>>> transfer for configuration space. Do you mean document this?
>>
>> Yes, and another thing is that since you're using
>> data[VDUSE_CONFIG_DATA_LEN] in the uapi, it implies the length is always
>> 256 which seems not good and not what the code is wrote.
>>
> How about renaming VDUSE_CONFIG_DATA_LEN to VDUSE_MAX_TRANSFER_LEN?
>
> Thanks,
> Yongji


So a question is the reason to have a limitation of this in the uAPI? 
Note that in vhost-vdpa we don't have such:

struct vhost_vdpa_config {
         __u32 off;
         __u32 len;
         __u8 buf[0];
};

Thanks


>


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Re: [PATCH v6 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
  2021-04-13  3:35                   ` Jason Wang
@ 2021-04-13  4:28                     ` Yongji Xie
  2021-04-14  8:18                       ` Jason Wang
  0 siblings, 1 reply; 62+ messages in thread
From: Yongji Xie @ 2021-04-13  4:28 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Christoph Hellwig, Christian Brauner, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	Mika Penttilä,
	Dan Carpenter, virtualization, netdev, kvm, linux-fsdevel

On Tue, Apr 13, 2021 at 11:35 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/4/12 下午5:59, Yongji Xie 写道:
> > On Mon, Apr 12, 2021 at 5:37 PM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> 在 2021/4/12 下午4:02, Yongji Xie 写道:
> >>> On Mon, Apr 12, 2021 at 3:16 PM Jason Wang <jasowang@redhat.com> wrote:
> >>>> 在 2021/4/9 下午4:02, Yongji Xie 写道:
> >>>>>>>>> +};
> >>>>>>>>> +
> >>>>>>>>> +struct vduse_dev_config_data {
> >>>>>>>>> +     __u32 offset; /* offset from the beginning of config space */
> >>>>>>>>> +     __u32 len; /* the length to read/write */
> >>>>>>>>> +     __u8 data[VDUSE_CONFIG_DATA_LEN]; /* data buffer used to read/write */
> >>>>>>>> Note that since VDUSE_CONFIG_DATA_LEN is part of uAPI it means we can
> >>>>>>>> not change it in the future.
> >>>>>>>>
> >>>>>>>> So this might suffcient for future features or all type of virtio devices.
> >>>>>>>>
> >>>>>>> Do you mean 256 is no enough here?
> >>>>>> Yes.
> >>>>>>
> >>>>> But this request will be submitted multiple times if config lengh is
> >>>>> larger than 256. So do you think whether we need to extent the size to
> >>>>> 512 or larger?
> >>>> So I think you'd better either:
> >>>>
> >>>> 1) document the limitation (256) in somewhere, (better both uapi and doc)
> >>>>
> >>> But the VDUSE_CONFIG_DATA_LEN doesn't mean the limitation of
> >>> configuration space. It only means the maximum size of one data
> >>> transfer for configuration space. Do you mean document this?
> >>
> >> Yes, and another thing is that since you're using
> >> data[VDUSE_CONFIG_DATA_LEN] in the uapi, it implies the length is always
> >> 256 which seems not good and not what the code is wrote.
> >>
> > How about renaming VDUSE_CONFIG_DATA_LEN to VDUSE_MAX_TRANSFER_LEN?
> >
> > Thanks,
> > Yongji
>
>
> So a question is the reason to have a limitation of this in the uAPI?
> Note that in vhost-vdpa we don't have such:
>
> struct vhost_vdpa_config {
>          __u32 off;
>          __u32 len;
>          __u8 buf[0];
> };
>

If so, we need to call read()/write() multiple times each time
receiving/sending one request or response in userspace and kernel. For
example,

1. read and check request/response type
2. read and check config length if type is VDUSE_SET_CONFIG or VDUSE_GET_CONFIG
3. read the payload

Not sure if it's worth it.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v6 00/10] Introduce VDUSE - vDPA Device in Userspace
  2021-03-31  8:05 [PATCH v6 00/10] Introduce VDUSE - vDPA Device in Userspace Xie Yongji
                   ` (9 preceding siblings ...)
  2021-03-31  8:05 ` [PATCH v6 10/10] Documentation: Add documentation for VDUSE Xie Yongji
@ 2021-04-14  7:34 ` Michael S. Tsirkin
  2021-04-14  7:49   ` Jason Wang
  2021-04-14  7:54   ` Yongji Xie
  10 siblings, 2 replies; 62+ messages in thread
From: Michael S. Tsirkin @ 2021-04-14  7:34 UTC (permalink / raw)
  To: Xie Yongji
  Cc: jasowang, stefanha, sgarzare, parav, hch, christian.brauner,
	rdunlap, willy, viro, axboe, bcrl, corbet, mika.penttila,
	dan.carpenter, virtualization, netdev, kvm, linux-fsdevel

On Wed, Mar 31, 2021 at 04:05:09PM +0800, Xie Yongji wrote:
> This series introduces a framework, which can be used to implement
> vDPA Devices in a userspace program. The work consist of two parts:
> control path forwarding and data path offloading.
> 
> In the control path, the VDUSE driver will make use of message
> mechnism to forward the config operation from vdpa bus driver
> to userspace. Userspace can use read()/write() to receive/reply
> those control messages.
> 
> In the data path, the core is mapping dma buffer into VDUSE
> daemon's address space, which can be implemented in different ways
> depending on the vdpa bus to which the vDPA device is attached.
> 
> In virtio-vdpa case, we implements a MMU-based on-chip IOMMU driver with
> bounce-buffering mechanism to achieve that. And in vhost-vdpa case, the dma
> buffer is reside in a userspace memory region which can be shared to the
> VDUSE userspace processs via transferring the shmfd.
> 
> The details and our user case is shown below:
> 
> ------------------------    -------------------------   ----------------------------------------------
> |            Container |    |              QEMU(VM) |   |                               VDUSE daemon |
> |       ---------      |    |  -------------------  |   | ------------------------- ---------------- |
> |       |dev/vdx|      |    |  |/dev/vhost-vdpa-x|  |   | | vDPA device emulation | | block driver | |
> ------------+-----------     -----------+------------   -------------+----------------------+---------
>             |                           |                            |                      |
>             |                           |                            |                      |
> ------------+---------------------------+----------------------------+----------------------+---------
> |    | block device |           |  vhost device |            | vduse driver |          | TCP/IP |    |
> |    -------+--------           --------+--------            -------+--------          -----+----    |
> |           |                           |                           |                       |        |
> | ----------+----------       ----------+-----------         -------+-------                |        |
> | | virtio-blk driver |       |  vhost-vdpa driver |         | vdpa device |                |        |
> | ----------+----------       ----------+-----------         -------+-------                |        |
> |           |      virtio bus           |                           |                       |        |
> |   --------+----+-----------           |                           |                       |        |
> |                |                      |                           |                       |        |
> |      ----------+----------            |                           |                       |        |
> |      | virtio-blk device |            |                           |                       |        |
> |      ----------+----------            |                           |                       |        |
> |                |                      |                           |                       |        |
> |     -----------+-----------           |                           |                       |        |
> |     |  virtio-vdpa driver |           |                           |                       |        |
> |     -----------+-----------           |                           |                       |        |
> |                |                      |                           |    vdpa bus           |        |
> |     -----------+----------------------+---------------------------+------------           |        |
> |                                                                                        ---+---     |
> -----------------------------------------------------------------------------------------| NIC |------
>                                                                                          ---+---
>                                                                                             |
>                                                                                    ---------+---------
>                                                                                    | Remote Storages |
>                                                                                    -------------------

This all looks quite similar to vhost-user-block except that one
does not need any kernel support at all.

So I am still scratching my head about its advantages over
vhost-user-block.


> We make use of it to implement a block device connecting to
> our distributed storage, which can be used both in containers and
> VMs. Thus, we can have an unified technology stack in this two cases.

Maybe the container part is the answer. How does that stack look?

> To test it with null-blk:
> 
>   $ qemu-storage-daemon \
>       --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server,nowait \
>       --monitor chardev=charmonitor \
>       --blockdev driver=host_device,cache.direct=on,aio=native,filename=/dev/nullb0,node-name=disk0 \
>       --export type=vduse-blk,id=test,node-name=disk0,writable=on,name=vduse-null,num-queues=16,queue-size=128
> 
> The qemu-storage-daemon can be found at https://github.com/bytedance/qemu/tree/vduse
> 
> Future work:
>   - Improve performance
>   - Userspace library (find a way to reuse device emulation code in qemu/rust-vmm)
> 
> V5 to V6:
> - Export receive_fd() instead of __receive_fd()
> - Factor out the unmapping logic of pa and va separatedly
> - Remove the logic of bounce page allocation in page fault handler
> - Use PAGE_SIZE as IOVA allocation granule
> - Add EPOLLOUT support
> - Enable setting API version in userspace
> - Fix some bugs
> 
> V4 to V5:
> - Remove the patch for irq binding
> - Use a single IOTLB for all types of mapping
> - Factor out vhost_vdpa_pa_map()
> - Add some sample codes in document
> - Use receice_fd_user() to pass file descriptor
> - Fix some bugs
> 
> V3 to V4:
> - Rebase to vhost.git
> - Split some patches
> - Add some documents
> - Use ioctl to inject interrupt rather than eventfd
> - Enable config interrupt support
> - Support binding irq to the specified cpu
> - Add two module parameter to limit bounce/iova size
> - Create char device rather than anon inode per vduse
> - Reuse vhost IOTLB for iova domain
> - Rework the message mechnism in control path
> 
> V2 to V3:
> - Rework the MMU-based IOMMU driver
> - Use the iova domain as iova allocator instead of genpool
> - Support transferring vma->vm_file in vhost-vdpa
> - Add SVA support in vhost-vdpa
> - Remove the patches on bounce pages reclaim
> 
> V1 to V2:
> - Add vhost-vdpa support
> - Add some documents
> - Based on the vdpa management tool
> - Introduce a workqueue for irq injection
> - Replace interval tree with array map to store the iova_map
> 
> Xie Yongji (10):
>   file: Export receive_fd() to modules
>   eventfd: Increase the recursion depth of eventfd_signal()
>   vhost-vdpa: protect concurrent access to vhost device iotlb
>   vhost-iotlb: Add an opaque pointer for vhost IOTLB
>   vdpa: Add an opaque pointer for vdpa_config_ops.dma_map()
>   vdpa: factor out vhost_vdpa_pa_map() and vhost_vdpa_pa_unmap()
>   vdpa: Support transferring virtual addressing during DMA mapping
>   vduse: Implement an MMU-based IOMMU driver
>   vduse: Introduce VDUSE - vDPA Device in Userspace
>   Documentation: Add documentation for VDUSE
> 
>  Documentation/userspace-api/index.rst              |    1 +
>  Documentation/userspace-api/ioctl/ioctl-number.rst |    1 +
>  Documentation/userspace-api/vduse.rst              |  212 +++
>  drivers/vdpa/Kconfig                               |   10 +
>  drivers/vdpa/Makefile                              |    1 +
>  drivers/vdpa/ifcvf/ifcvf_main.c                    |    2 +-
>  drivers/vdpa/mlx5/net/mlx5_vnet.c                  |    2 +-
>  drivers/vdpa/vdpa.c                                |    9 +-
>  drivers/vdpa/vdpa_sim/vdpa_sim.c                   |    8 +-
>  drivers/vdpa/vdpa_user/Makefile                    |    5 +
>  drivers/vdpa/vdpa_user/iova_domain.c               |  521 ++++++++
>  drivers/vdpa/vdpa_user/iova_domain.h               |   70 +
>  drivers/vdpa/vdpa_user/vduse_dev.c                 | 1362 ++++++++++++++++++++
>  drivers/vdpa/virtio_pci/vp_vdpa.c                  |    2 +-
>  drivers/vhost/iotlb.c                              |   20 +-
>  drivers/vhost/vdpa.c                               |  154 ++-
>  fs/eventfd.c                                       |    2 +-
>  fs/file.c                                          |    6 +
>  include/linux/eventfd.h                            |    5 +-
>  include/linux/file.h                               |    7 +-
>  include/linux/vdpa.h                               |   21 +-
>  include/linux/vhost_iotlb.h                        |    3 +
>  include/uapi/linux/vduse.h                         |  175 +++
>  23 files changed, 2548 insertions(+), 51 deletions(-)
>  create mode 100644 Documentation/userspace-api/vduse.rst
>  create mode 100644 drivers/vdpa/vdpa_user/Makefile
>  create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c
>  create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h
>  create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
>  create mode 100644 include/uapi/linux/vduse.h
> 
> -- 
> 2.11.0


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v6 00/10] Introduce VDUSE - vDPA Device in Userspace
  2021-04-14  7:34 ` [PATCH v6 00/10] Introduce VDUSE - vDPA Device in Userspace Michael S. Tsirkin
@ 2021-04-14  7:49   ` Jason Wang
  2021-04-14  7:54   ` Yongji Xie
  1 sibling, 0 replies; 62+ messages in thread
From: Jason Wang @ 2021-04-14  7:49 UTC (permalink / raw)
  To: Michael S. Tsirkin, Xie Yongji
  Cc: stefanha, sgarzare, parav, hch, christian.brauner, rdunlap,
	willy, viro, axboe, bcrl, corbet, mika.penttila, dan.carpenter,
	virtualization, netdev, kvm, linux-fsdevel


在 2021/4/14 下午3:34, Michael S. Tsirkin 写道:
> On Wed, Mar 31, 2021 at 04:05:09PM +0800, Xie Yongji wrote:
>> This series introduces a framework, which can be used to implement
>> vDPA Devices in a userspace program. The work consist of two parts:
>> control path forwarding and data path offloading.
>>
>> In the control path, the VDUSE driver will make use of message
>> mechnism to forward the config operation from vdpa bus driver
>> to userspace. Userspace can use read()/write() to receive/reply
>> those control messages.
>>
>> In the data path, the core is mapping dma buffer into VDUSE
>> daemon's address space, which can be implemented in different ways
>> depending on the vdpa bus to which the vDPA device is attached.
>>
>> In virtio-vdpa case, we implements a MMU-based on-chip IOMMU driver with
>> bounce-buffering mechanism to achieve that. And in vhost-vdpa case, the dma
>> buffer is reside in a userspace memory region which can be shared to the
>> VDUSE userspace processs via transferring the shmfd.
>>
>> The details and our user case is shown below:
>>
>> ------------------------    -------------------------   ----------------------------------------------
>> |            Container |    |              QEMU(VM) |   |                               VDUSE daemon |
>> |       ---------      |    |  -------------------  |   | ------------------------- ---------------- |
>> |       |dev/vdx|      |    |  |/dev/vhost-vdpa-x|  |   | | vDPA device emulation | | block driver | |
>> ------------+-----------     -----------+------------   -------------+----------------------+---------
>>              |                           |                            |                      |
>>              |                           |                            |                      |
>> ------------+---------------------------+----------------------------+----------------------+---------
>> |    | block device |           |  vhost device |            | vduse driver |          | TCP/IP |    |
>> |    -------+--------           --------+--------            -------+--------          -----+----    |
>> |           |                           |                           |                       |        |
>> | ----------+----------       ----------+-----------         -------+-------                |        |
>> | | virtio-blk driver |       |  vhost-vdpa driver |         | vdpa device |                |        |
>> | ----------+----------       ----------+-----------         -------+-------                |        |
>> |           |      virtio bus           |                           |                       |        |
>> |   --------+----+-----------           |                           |                       |        |
>> |                |                      |                           |                       |        |
>> |      ----------+----------            |                           |                       |        |
>> |      | virtio-blk device |            |                           |                       |        |
>> |      ----------+----------            |                           |                       |        |
>> |                |                      |                           |                       |        |
>> |     -----------+-----------           |                           |                       |        |
>> |     |  virtio-vdpa driver |           |                           |                       |        |
>> |     -----------+-----------           |                           |                       |        |
>> |                |                      |                           |    vdpa bus           |        |
>> |     -----------+----------------------+---------------------------+------------           |        |
>> |                                                                                        ---+---     |
>> -----------------------------------------------------------------------------------------| NIC |------
>>                                                                                           ---+---
>>                                                                                              |
>>                                                                                     ---------+---------
>>                                                                                     | Remote Storages |
>>                                                                                     -------------------
> This all looks quite similar to vhost-user-block except that one
> does not need any kernel support at all.
>
> So I am still scratching my head about its advantages over
> vhost-user-block.
>
>
>> We make use of it to implement a block device connecting to
>> our distributed storage, which can be used both in containers and
>> VMs. Thus, we can have an unified technology stack in this two cases.
> Maybe the container part is the answer. How does that stack look?


Yong Ji may add more and I think this has been demonstrated in the above 
figure: the userspace vDPA device can provide a kenrel virito-blk device 
via virtio_vdpa driver.

Thanks


>
>> To test it with null-blk:
>>
>>    $ qemu-storage-daemon \
>>        --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server,nowait \
>>        --monitor chardev=charmonitor \
>>        --blockdev driver=host_device,cache.direct=on,aio=native,filename=/dev/nullb0,node-name=disk0 \
>>        --export type=vduse-blk,id=test,node-name=disk0,writable=on,name=vduse-null,num-queues=16,queue-size=128
>>
>> The qemu-storage-daemon can be found at https://github.com/bytedance/qemu/tree/vduse
>>
>> Future work:
>>    - Improve performance
>>    - Userspace library (find a way to reuse device emulation code in qemu/rust-vmm)
>>
>> V5 to V6:
>> - Export receive_fd() instead of __receive_fd()
>> - Factor out the unmapping logic of pa and va separatedly
>> - Remove the logic of bounce page allocation in page fault handler
>> - Use PAGE_SIZE as IOVA allocation granule
>> - Add EPOLLOUT support
>> - Enable setting API version in userspace
>> - Fix some bugs
>>
>> V4 to V5:
>> - Remove the patch for irq binding
>> - Use a single IOTLB for all types of mapping
>> - Factor out vhost_vdpa_pa_map()
>> - Add some sample codes in document
>> - Use receice_fd_user() to pass file descriptor
>> - Fix some bugs
>>
>> V3 to V4:
>> - Rebase to vhost.git
>> - Split some patches
>> - Add some documents
>> - Use ioctl to inject interrupt rather than eventfd
>> - Enable config interrupt support
>> - Support binding irq to the specified cpu
>> - Add two module parameter to limit bounce/iova size
>> - Create char device rather than anon inode per vduse
>> - Reuse vhost IOTLB for iova domain
>> - Rework the message mechnism in control path
>>
>> V2 to V3:
>> - Rework the MMU-based IOMMU driver
>> - Use the iova domain as iova allocator instead of genpool
>> - Support transferring vma->vm_file in vhost-vdpa
>> - Add SVA support in vhost-vdpa
>> - Remove the patches on bounce pages reclaim
>>
>> V1 to V2:
>> - Add vhost-vdpa support
>> - Add some documents
>> - Based on the vdpa management tool
>> - Introduce a workqueue for irq injection
>> - Replace interval tree with array map to store the iova_map
>>
>> Xie Yongji (10):
>>    file: Export receive_fd() to modules
>>    eventfd: Increase the recursion depth of eventfd_signal()
>>    vhost-vdpa: protect concurrent access to vhost device iotlb
>>    vhost-iotlb: Add an opaque pointer for vhost IOTLB
>>    vdpa: Add an opaque pointer for vdpa_config_ops.dma_map()
>>    vdpa: factor out vhost_vdpa_pa_map() and vhost_vdpa_pa_unmap()
>>    vdpa: Support transferring virtual addressing during DMA mapping
>>    vduse: Implement an MMU-based IOMMU driver
>>    vduse: Introduce VDUSE - vDPA Device in Userspace
>>    Documentation: Add documentation for VDUSE
>>
>>   Documentation/userspace-api/index.rst              |    1 +
>>   Documentation/userspace-api/ioctl/ioctl-number.rst |    1 +
>>   Documentation/userspace-api/vduse.rst              |  212 +++
>>   drivers/vdpa/Kconfig                               |   10 +
>>   drivers/vdpa/Makefile                              |    1 +
>>   drivers/vdpa/ifcvf/ifcvf_main.c                    |    2 +-
>>   drivers/vdpa/mlx5/net/mlx5_vnet.c                  |    2 +-
>>   drivers/vdpa/vdpa.c                                |    9 +-
>>   drivers/vdpa/vdpa_sim/vdpa_sim.c                   |    8 +-
>>   drivers/vdpa/vdpa_user/Makefile                    |    5 +
>>   drivers/vdpa/vdpa_user/iova_domain.c               |  521 ++++++++
>>   drivers/vdpa/vdpa_user/iova_domain.h               |   70 +
>>   drivers/vdpa/vdpa_user/vduse_dev.c                 | 1362 ++++++++++++++++++++
>>   drivers/vdpa/virtio_pci/vp_vdpa.c                  |    2 +-
>>   drivers/vhost/iotlb.c                              |   20 +-
>>   drivers/vhost/vdpa.c                               |  154 ++-
>>   fs/eventfd.c                                       |    2 +-
>>   fs/file.c                                          |    6 +
>>   include/linux/eventfd.h                            |    5 +-
>>   include/linux/file.h                               |    7 +-
>>   include/linux/vdpa.h                               |   21 +-
>>   include/linux/vhost_iotlb.h                        |    3 +
>>   include/uapi/linux/vduse.h                         |  175 +++
>>   23 files changed, 2548 insertions(+), 51 deletions(-)
>>   create mode 100644 Documentation/userspace-api/vduse.rst
>>   create mode 100644 drivers/vdpa/vdpa_user/Makefile
>>   create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c
>>   create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h
>>   create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
>>   create mode 100644 include/uapi/linux/vduse.h
>>
>> -- 
>> 2.11.0


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Re: [PATCH v6 00/10] Introduce VDUSE - vDPA Device in Userspace
  2021-04-14  7:34 ` [PATCH v6 00/10] Introduce VDUSE - vDPA Device in Userspace Michael S. Tsirkin
  2021-04-14  7:49   ` Jason Wang
@ 2021-04-14  7:54   ` Yongji Xie
  1 sibling, 0 replies; 62+ messages in thread
From: Yongji Xie @ 2021-04-14  7:54 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, Stefan Hajnoczi, Stefano Garzarella, Parav Pandit,
	Christoph Hellwig, Christian Brauner, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	Mika Penttilä,
	Dan Carpenter, virtualization, netdev, kvm, linux-fsdevel

On Wed, Apr 14, 2021 at 3:35 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Wed, Mar 31, 2021 at 04:05:09PM +0800, Xie Yongji wrote:
> > This series introduces a framework, which can be used to implement
> > vDPA Devices in a userspace program. The work consist of two parts:
> > control path forwarding and data path offloading.
> >
> > In the control path, the VDUSE driver will make use of message
> > mechnism to forward the config operation from vdpa bus driver
> > to userspace. Userspace can use read()/write() to receive/reply
> > those control messages.
> >
> > In the data path, the core is mapping dma buffer into VDUSE
> > daemon's address space, which can be implemented in different ways
> > depending on the vdpa bus to which the vDPA device is attached.
> >
> > In virtio-vdpa case, we implements a MMU-based on-chip IOMMU driver with
> > bounce-buffering mechanism to achieve that. And in vhost-vdpa case, the dma
> > buffer is reside in a userspace memory region which can be shared to the
> > VDUSE userspace processs via transferring the shmfd.
> >
> > The details and our user case is shown below:
> >
> > ------------------------    -------------------------   ----------------------------------------------
> > |            Container |    |              QEMU(VM) |   |                               VDUSE daemon |
> > |       ---------      |    |  -------------------  |   | ------------------------- ---------------- |
> > |       |dev/vdx|      |    |  |/dev/vhost-vdpa-x|  |   | | vDPA device emulation | | block driver | |
> > ------------+-----------     -----------+------------   -------------+----------------------+---------
> >             |                           |                            |                      |
> >             |                           |                            |                      |
> > ------------+---------------------------+----------------------------+----------------------+---------
> > |    | block device |           |  vhost device |            | vduse driver |          | TCP/IP |    |
> > |    -------+--------           --------+--------            -------+--------          -----+----    |
> > |           |                           |                           |                       |        |
> > | ----------+----------       ----------+-----------         -------+-------                |        |
> > | | virtio-blk driver |       |  vhost-vdpa driver |         | vdpa device |                |        |
> > | ----------+----------       ----------+-----------         -------+-------                |        |
> > |           |      virtio bus           |                           |                       |        |
> > |   --------+----+-----------           |                           |                       |        |
> > |                |                      |                           |                       |        |
> > |      ----------+----------            |                           |                       |        |
> > |      | virtio-blk device |            |                           |                       |        |
> > |      ----------+----------            |                           |                       |        |
> > |                |                      |                           |                       |        |
> > |     -----------+-----------           |                           |                       |        |
> > |     |  virtio-vdpa driver |           |                           |                       |        |
> > |     -----------+-----------           |                           |                       |        |
> > |                |                      |                           |    vdpa bus           |        |
> > |     -----------+----------------------+---------------------------+------------           |        |
> > |                                                                                        ---+---     |
> > -----------------------------------------------------------------------------------------| NIC |------
> >                                                                                          ---+---
> >                                                                                             |
> >                                                                                    ---------+---------
> >                                                                                    | Remote Storages |
> >                                                                                    -------------------
>
> This all looks quite similar to vhost-user-block except that one
> does not need any kernel support at all.
>
> So I am still scratching my head about its advantages over
> vhost-user-block.
>

It plays the same role as vhost-user-block in VM user cases.

>
> > We make use of it to implement a block device connecting to
> > our distributed storage, which can be used both in containers and
> > VMs. Thus, we can have an unified technology stack in this two cases.
>
> Maybe the container part is the answer. How does that stack look?
>

Yes, it enables containers to reuse virtio software stack. We can have
one daemon that provides service to both containers and virtual
machines.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v6 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
  2021-04-13  4:28                     ` Yongji Xie
@ 2021-04-14  8:18                       ` Jason Wang
  0 siblings, 0 replies; 62+ messages in thread
From: Jason Wang @ 2021-04-14  8:18 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Christoph Hellwig, Christian Brauner, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	Mika Penttilä,
	Dan Carpenter, virtualization, netdev, kvm, linux-fsdevel


在 2021/4/13 下午12:28, Yongji Xie 写道:
> On Tue, Apr 13, 2021 at 11:35 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> 在 2021/4/12 下午5:59, Yongji Xie 写道:
>>> On Mon, Apr 12, 2021 at 5:37 PM Jason Wang <jasowang@redhat.com> wrote:
>>>> 在 2021/4/12 下午4:02, Yongji Xie 写道:
>>>>> On Mon, Apr 12, 2021 at 3:16 PM Jason Wang <jasowang@redhat.com> wrote:
>>>>>> 在 2021/4/9 下午4:02, Yongji Xie 写道:
>>>>>>>>>>> +};
>>>>>>>>>>> +
>>>>>>>>>>> +struct vduse_dev_config_data {
>>>>>>>>>>> +     __u32 offset; /* offset from the beginning of config space */
>>>>>>>>>>> +     __u32 len; /* the length to read/write */
>>>>>>>>>>> +     __u8 data[VDUSE_CONFIG_DATA_LEN]; /* data buffer used to read/write */
>>>>>>>>>> Note that since VDUSE_CONFIG_DATA_LEN is part of uAPI it means we can
>>>>>>>>>> not change it in the future.
>>>>>>>>>>
>>>>>>>>>> So this might suffcient for future features or all type of virtio devices.
>>>>>>>>>>
>>>>>>>>> Do you mean 256 is no enough here?
>>>>>>>> Yes.
>>>>>>>>
>>>>>>> But this request will be submitted multiple times if config lengh is
>>>>>>> larger than 256. So do you think whether we need to extent the size to
>>>>>>> 512 or larger?
>>>>>> So I think you'd better either:
>>>>>>
>>>>>> 1) document the limitation (256) in somewhere, (better both uapi and doc)
>>>>>>
>>>>> But the VDUSE_CONFIG_DATA_LEN doesn't mean the limitation of
>>>>> configuration space. It only means the maximum size of one data
>>>>> transfer for configuration space. Do you mean document this?
>>>> Yes, and another thing is that since you're using
>>>> data[VDUSE_CONFIG_DATA_LEN] in the uapi, it implies the length is always
>>>> 256 which seems not good and not what the code is wrote.
>>>>
>>> How about renaming VDUSE_CONFIG_DATA_LEN to VDUSE_MAX_TRANSFER_LEN?
>>>
>>> Thanks,
>>> Yongji
>>
>> So a question is the reason to have a limitation of this in the uAPI?
>> Note that in vhost-vdpa we don't have such:
>>
>> struct vhost_vdpa_config {
>>           __u32 off;
>>           __u32 len;
>>           __u8 buf[0];
>> };
>>
> If so, we need to call read()/write() multiple times each time
> receiving/sending one request or response in userspace and kernel. For
> example,
>
> 1. read and check request/response type
> 2. read and check config length if type is VDUSE_SET_CONFIG or VDUSE_GET_CONFIG
> 3. read the payload
>
> Not sure if it's worth it.
>
> Thanks,
> Yongji


Right, I see.

So I'm fine with current approach.

Thanks



>


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v6 10/10] Documentation: Add documentation for VDUSE
  2021-03-31  8:05 ` [PATCH v6 10/10] Documentation: Add documentation for VDUSE Xie Yongji
  2021-04-08  7:18   ` Jason Wang
@ 2021-04-14 14:14   ` Stefan Hajnoczi
  2021-04-15  5:38     ` Yongji Xie
  1 sibling, 1 reply; 62+ messages in thread
From: Stefan Hajnoczi @ 2021-04-14 14:14 UTC (permalink / raw)
  To: Xie Yongji
  Cc: mst, jasowang, sgarzare, parav, hch, christian.brauner, rdunlap,
	willy, viro, axboe, bcrl, corbet, mika.penttila, dan.carpenter,
	virtualization, netdev, kvm, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 3614 bytes --]

On Wed, Mar 31, 2021 at 04:05:19PM +0800, Xie Yongji wrote:
> VDUSE (vDPA Device in Userspace) is a framework to support
> implementing software-emulated vDPA devices in userspace. This
> document is intended to clarify the VDUSE design and usage.
> 
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> ---
>  Documentation/userspace-api/index.rst |   1 +
>  Documentation/userspace-api/vduse.rst | 212 ++++++++++++++++++++++++++++++++++
>  2 files changed, 213 insertions(+)
>  create mode 100644 Documentation/userspace-api/vduse.rst

Just looking over the documentation briefly (I haven't studied the code
yet)...

> +How VDUSE works
> +------------
> +Each userspace vDPA device is created by the VDUSE_CREATE_DEV ioctl on
> +the character device (/dev/vduse/control). Then a device file with the
> +specified name (/dev/vduse/$NAME) will appear, which can be used to
> +implement the userspace vDPA device's control path and data path.

These steps are taken after sending the VDPA_CMD_DEV_NEW netlink
message? (Please consider reordering the documentation to make it clear
what the sequence of steps are.)

> +	static int netlink_add_vduse(const char *name, int device_id)
> +	{
> +		struct nl_sock *nlsock;
> +		struct nl_msg *msg;
> +		int famid;
> +
> +		nlsock = nl_socket_alloc();
> +		if (!nlsock)
> +			return -ENOMEM;
> +
> +		if (genl_connect(nlsock))
> +			goto free_sock;
> +
> +		famid = genl_ctrl_resolve(nlsock, VDPA_GENL_NAME);
> +		if (famid < 0)
> +			goto close_sock;
> +
> +		msg = nlmsg_alloc();
> +		if (!msg)
> +			goto close_sock;
> +
> +		if (!genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, famid, 0, 0,
> +		    VDPA_CMD_DEV_NEW, 0))
> +			goto nla_put_failure;
> +
> +		NLA_PUT_STRING(msg, VDPA_ATTR_DEV_NAME, name);
> +		NLA_PUT_STRING(msg, VDPA_ATTR_MGMTDEV_DEV_NAME, "vduse");
> +		NLA_PUT_U32(msg, VDPA_ATTR_DEV_ID, device_id);

What are the permission/capability requirements for VDUSE?

How does VDUSE interact with namespaces?

What is the meaning of VDPA_ATTR_DEV_ID? I don't see it in Linux
v5.12-rc6 drivers/vdpa/vdpa.c:vdpa_nl_cmd_dev_add_set_doit().

> +MMU-based IOMMU Driver
> +----------------------
> +VDUSE framework implements an MMU-based on-chip IOMMU driver to support
> +mapping the kernel DMA buffer into the userspace iova region dynamically.
> +This is mainly designed for virtio-vdpa case (kernel virtio drivers).
> +
> +The basic idea behind this driver is treating MMU (VA->PA) as IOMMU (IOVA->PA).
> +The driver will set up MMU mapping instead of IOMMU mapping for the DMA transfer
> +so that the userspace process is able to use its virtual address to access
> +the DMA buffer in kernel.
> +
> +And to avoid security issue, a bounce-buffering mechanism is introduced to
> +prevent userspace accessing the original buffer directly which may contain other
> +kernel data. During the mapping, unmapping, the driver will copy the data from
> +the original buffer to the bounce buffer and back, depending on the direction of
> +the transfer. And the bounce-buffer addresses will be mapped into the user address
> +space instead of the original one.

Is mmap(2) the right interface if memory is not actually shared, why not
just use pread(2)/pwrite(2) to make the copy explicit? That way the copy
semantics are clear. For example, don't expect to be able to busy wait
on the memory because changes will not be visible to the other side.

(I guess I'm missing something here and that mmap(2) is the right
approach, but maybe this documentation section can be clarified.)

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Re: [PATCH v6 10/10] Documentation: Add documentation for VDUSE
  2021-04-14 14:14   ` Stefan Hajnoczi
@ 2021-04-15  5:38     ` Yongji Xie
  2021-04-15  7:19       ` Stefan Hajnoczi
  0 siblings, 1 reply; 62+ messages in thread
From: Yongji Xie @ 2021-04-15  5:38 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Michael S. Tsirkin, Jason Wang, Stefano Garzarella, Parav Pandit,
	Christoph Hellwig, Christian Brauner, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	Mika Penttilä,
	Dan Carpenter, virtualization, netdev, kvm, linux-fsdevel

On Wed, Apr 14, 2021 at 10:15 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Wed, Mar 31, 2021 at 04:05:19PM +0800, Xie Yongji wrote:
> > VDUSE (vDPA Device in Userspace) is a framework to support
> > implementing software-emulated vDPA devices in userspace. This
> > document is intended to clarify the VDUSE design and usage.
> >
> > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> > ---
> >  Documentation/userspace-api/index.rst |   1 +
> >  Documentation/userspace-api/vduse.rst | 212 ++++++++++++++++++++++++++++++++++
> >  2 files changed, 213 insertions(+)
> >  create mode 100644 Documentation/userspace-api/vduse.rst
>
> Just looking over the documentation briefly (I haven't studied the code
> yet)...
>

Thank you!

> > +How VDUSE works
> > +------------
> > +Each userspace vDPA device is created by the VDUSE_CREATE_DEV ioctl on
> > +the character device (/dev/vduse/control). Then a device file with the
> > +specified name (/dev/vduse/$NAME) will appear, which can be used to
> > +implement the userspace vDPA device's control path and data path.
>
> These steps are taken after sending the VDPA_CMD_DEV_NEW netlink
> message? (Please consider reordering the documentation to make it clear
> what the sequence of steps are.)
>

No, VDUSE devices should be created before sending the
VDPA_CMD_DEV_NEW netlink messages which might produce I/Os to VDUSE.

> > +     static int netlink_add_vduse(const char *name, int device_id)
> > +     {
> > +             struct nl_sock *nlsock;
> > +             struct nl_msg *msg;
> > +             int famid;
> > +
> > +             nlsock = nl_socket_alloc();
> > +             if (!nlsock)
> > +                     return -ENOMEM;
> > +
> > +             if (genl_connect(nlsock))
> > +                     goto free_sock;
> > +
> > +             famid = genl_ctrl_resolve(nlsock, VDPA_GENL_NAME);
> > +             if (famid < 0)
> > +                     goto close_sock;
> > +
> > +             msg = nlmsg_alloc();
> > +             if (!msg)
> > +                     goto close_sock;
> > +
> > +             if (!genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, famid, 0, 0,
> > +                 VDPA_CMD_DEV_NEW, 0))
> > +                     goto nla_put_failure;
> > +
> > +             NLA_PUT_STRING(msg, VDPA_ATTR_DEV_NAME, name);
> > +             NLA_PUT_STRING(msg, VDPA_ATTR_MGMTDEV_DEV_NAME, "vduse");
> > +             NLA_PUT_U32(msg, VDPA_ATTR_DEV_ID, device_id);
>
> What are the permission/capability requirements for VDUSE?
>

Now I think we need privileged permission (root user). Because
userspace daemon is able to access avail vring, used vring, descriptor
table in kernel driver directly.

> How does VDUSE interact with namespaces?
>

Not sure I get your point here. Do you mean how the emulated vDPA
device interact with namespaces? This should work like hardware vDPA
devices do. VDUSE daemon can reside outside the namespace of a
container which uses the vDPA device.

> What is the meaning of VDPA_ATTR_DEV_ID? I don't see it in Linux
> v5.12-rc6 drivers/vdpa/vdpa.c:vdpa_nl_cmd_dev_add_set_doit().
>

It means the device id (e.g. VIRTIO_ID_BLOCK) of the vDPA device and
can be found in include/uapi/linux/vdpa.h.

> > +MMU-based IOMMU Driver
> > +----------------------
> > +VDUSE framework implements an MMU-based on-chip IOMMU driver to support
> > +mapping the kernel DMA buffer into the userspace iova region dynamically.
> > +This is mainly designed for virtio-vdpa case (kernel virtio drivers).
> > +
> > +The basic idea behind this driver is treating MMU (VA->PA) as IOMMU (IOVA->PA).
> > +The driver will set up MMU mapping instead of IOMMU mapping for the DMA transfer
> > +so that the userspace process is able to use its virtual address to access
> > +the DMA buffer in kernel.
> > +
> > +And to avoid security issue, a bounce-buffering mechanism is introduced to
> > +prevent userspace accessing the original buffer directly which may contain other
> > +kernel data. During the mapping, unmapping, the driver will copy the data from
> > +the original buffer to the bounce buffer and back, depending on the direction of
> > +the transfer. And the bounce-buffer addresses will be mapped into the user address
> > +space instead of the original one.
>
> Is mmap(2) the right interface if memory is not actually shared, why not
> just use pread(2)/pwrite(2) to make the copy explicit? That way the copy
> semantics are clear. For example, don't expect to be able to busy wait
> on the memory because changes will not be visible to the other side.
>
> (I guess I'm missing something here and that mmap(2) is the right
> approach, but maybe this documentation section can be clarified.)

It's for performance considerations on the one hand. We might need to
call pread(2)/pwrite(2) multiple times for each request. On the other
hand, we can handle the virtqueue in a unified way for both vhost-vdpa
case and virtio-vdpa case. Otherwise, userspace daemon needs to know
which iova ranges need to be accessed with pread(2)/pwrite(2). And in
the future, we might be able to avoid bouncing in some cases.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Re: [PATCH v6 10/10] Documentation: Add documentation for VDUSE
  2021-04-15  5:38     ` Yongji Xie
@ 2021-04-15  7:19       ` Stefan Hajnoczi
  2021-04-15  8:33         ` Yongji Xie
  2021-04-15  8:36         ` Jason Wang
  0 siblings, 2 replies; 62+ messages in thread
From: Stefan Hajnoczi @ 2021-04-15  7:19 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Jason Wang, Stefano Garzarella, Parav Pandit,
	Christoph Hellwig, Christian Brauner, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	Mika Penttilä,
	Dan Carpenter, virtualization, netdev, kvm, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 7947 bytes --]

On Thu, Apr 15, 2021 at 01:38:37PM +0800, Yongji Xie wrote:
> On Wed, Apr 14, 2021 at 10:15 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > On Wed, Mar 31, 2021 at 04:05:19PM +0800, Xie Yongji wrote:
> > > VDUSE (vDPA Device in Userspace) is a framework to support
> > > implementing software-emulated vDPA devices in userspace. This
> > > document is intended to clarify the VDUSE design and usage.
> > >
> > > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> > > ---
> > >  Documentation/userspace-api/index.rst |   1 +
> > >  Documentation/userspace-api/vduse.rst | 212 ++++++++++++++++++++++++++++++++++
> > >  2 files changed, 213 insertions(+)
> > >  create mode 100644 Documentation/userspace-api/vduse.rst
> >
> > Just looking over the documentation briefly (I haven't studied the code
> > yet)...
> >
> 
> Thank you!
> 
> > > +How VDUSE works
> > > +------------
> > > +Each userspace vDPA device is created by the VDUSE_CREATE_DEV ioctl on
> > > +the character device (/dev/vduse/control). Then a device file with the
> > > +specified name (/dev/vduse/$NAME) will appear, which can be used to
> > > +implement the userspace vDPA device's control path and data path.
> >
> > These steps are taken after sending the VDPA_CMD_DEV_NEW netlink
> > message? (Please consider reordering the documentation to make it clear
> > what the sequence of steps are.)
> >
> 
> No, VDUSE devices should be created before sending the
> VDPA_CMD_DEV_NEW netlink messages which might produce I/Os to VDUSE.

I see. Please include an overview of the steps before going into detail.
Something like:

  VDUSE devices are started as follows:

  1. Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on
     /dev/vduse/control.

  2. Begin processing VDUSE messages from /dev/vduse/$NAME. The first
     messages will arrive while attaching the VDUSE instance to vDPA.

  3. Send the VDPA_CMD_DEV_NEW netlink message to attach the VDUSE
     instance to vDPA.

  VDUSE devices are stopped as follows:

  ...

> > > +     static int netlink_add_vduse(const char *name, int device_id)
> > > +     {
> > > +             struct nl_sock *nlsock;
> > > +             struct nl_msg *msg;
> > > +             int famid;
> > > +
> > > +             nlsock = nl_socket_alloc();
> > > +             if (!nlsock)
> > > +                     return -ENOMEM;
> > > +
> > > +             if (genl_connect(nlsock))
> > > +                     goto free_sock;
> > > +
> > > +             famid = genl_ctrl_resolve(nlsock, VDPA_GENL_NAME);
> > > +             if (famid < 0)
> > > +                     goto close_sock;
> > > +
> > > +             msg = nlmsg_alloc();
> > > +             if (!msg)
> > > +                     goto close_sock;
> > > +
> > > +             if (!genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, famid, 0, 0,
> > > +                 VDPA_CMD_DEV_NEW, 0))
> > > +                     goto nla_put_failure;
> > > +
> > > +             NLA_PUT_STRING(msg, VDPA_ATTR_DEV_NAME, name);
> > > +             NLA_PUT_STRING(msg, VDPA_ATTR_MGMTDEV_DEV_NAME, "vduse");
> > > +             NLA_PUT_U32(msg, VDPA_ATTR_DEV_ID, device_id);
> >
> > What are the permission/capability requirements for VDUSE?
> >
> 
> Now I think we need privileged permission (root user). Because
> userspace daemon is able to access avail vring, used vring, descriptor
> table in kernel driver directly.

Please state this explicitly at the start of the document. Existing
interfaces like FUSE are designed to avoid trusting userspace. Therefore
people might think the same is the case here. It's critical that people
are aware of this before deploying VDUSE with virtio-vdpa.

We should probably pause here and think about whether it's possible to
avoid trusting userspace. Even if it takes some effort and costs some
performance it would probably be worthwhile.

Is the security situation different with vhost-vdpa? In that case it
seems more likely that the host kernel doesn't need to trust the
userspace VDUSE device.

Regarding privileges in general: userspace VDUSE processes shouldn't
need to run as root. The VDUSE device lifecycle will require privileges
to attach vhost-vdpa and virtio-vdpa devices, but the actual userspace
process that emulates the device should be able to run unprivileged.
Emulated devices are an attack surface and even if you are comfortable
with running them as root in your specific use case, it will be an issue
as soon as other people want to use VDUSE and could give VDUSE a
reputation for poor security.

> > How does VDUSE interact with namespaces?
> >
> 
> Not sure I get your point here. Do you mean how the emulated vDPA
> device interact with namespaces? This should work like hardware vDPA
> devices do. VDUSE daemon can reside outside the namespace of a
> container which uses the vDPA device.

Can VDUSE devices run inside containers? Are /dev/vduse/$NAME and vDPA
device names global?

> > What is the meaning of VDPA_ATTR_DEV_ID? I don't see it in Linux
> > v5.12-rc6 drivers/vdpa/vdpa.c:vdpa_nl_cmd_dev_add_set_doit().
> >
> 
> It means the device id (e.g. VIRTIO_ID_BLOCK) of the vDPA device and
> can be found in include/uapi/linux/vdpa.h.

VDPA_ATTR_DEV_ID is only used by VDPA_CMD_DEV_GET in Linux v5.12-rc6,
not by VDPA_CMD_DEV_NEW.

The example in this document uses VDPA_ATTR_DEV_ID with
VDPA_CMD_DEV_NEW. Is the example outdated?

> 
> > > +MMU-based IOMMU Driver
> > > +----------------------
> > > +VDUSE framework implements an MMU-based on-chip IOMMU driver to support
> > > +mapping the kernel DMA buffer into the userspace iova region dynamically.
> > > +This is mainly designed for virtio-vdpa case (kernel virtio drivers).
> > > +
> > > +The basic idea behind this driver is treating MMU (VA->PA) as IOMMU (IOVA->PA).
> > > +The driver will set up MMU mapping instead of IOMMU mapping for the DMA transfer
> > > +so that the userspace process is able to use its virtual address to access
> > > +the DMA buffer in kernel.
> > > +
> > > +And to avoid security issue, a bounce-buffering mechanism is introduced to
> > > +prevent userspace accessing the original buffer directly which may contain other
> > > +kernel data. During the mapping, unmapping, the driver will copy the data from
> > > +the original buffer to the bounce buffer and back, depending on the direction of
> > > +the transfer. And the bounce-buffer addresses will be mapped into the user address
> > > +space instead of the original one.
> >
> > Is mmap(2) the right interface if memory is not actually shared, why not
> > just use pread(2)/pwrite(2) to make the copy explicit? That way the copy
> > semantics are clear. For example, don't expect to be able to busy wait
> > on the memory because changes will not be visible to the other side.
> >
> > (I guess I'm missing something here and that mmap(2) is the right
> > approach, but maybe this documentation section can be clarified.)
> 
> It's for performance considerations on the one hand. We might need to
> call pread(2)/pwrite(2) multiple times for each request.

Userspace can keep page-sized pread() buffers around to avoid additional
syscalls during a request.

mmap() access does reduce the number of syscalls, but it also introduces
page faults (effectively doing the page-sized pread() I mentioned
above).

It's not obvious to me that there is a fundamental difference between
the two approaches in terms of performance.

> On the other
> hand, we can handle the virtqueue in a unified way for both vhost-vdpa
> case and virtio-vdpa case. Otherwise, userspace daemon needs to know
> which iova ranges need to be accessed with pread(2)/pwrite(2). And in
> the future, we might be able to avoid bouncing in some cases.

Ah, I see. So bounce buffers are not used for vhost-vdpa?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Re: Re: [PATCH v6 10/10] Documentation: Add documentation for VDUSE
  2021-04-15  7:19       ` Stefan Hajnoczi
@ 2021-04-15  8:33         ` Yongji Xie
  2021-04-15 14:17           ` Stefan Hajnoczi
  2021-04-15  8:36         ` Jason Wang
  1 sibling, 1 reply; 62+ messages in thread
From: Yongji Xie @ 2021-04-15  8:33 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Michael S. Tsirkin, Jason Wang, Stefano Garzarella, Parav Pandit,
	Christoph Hellwig, Christian Brauner, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	Mika Penttilä,
	Dan Carpenter, virtualization, netdev, kvm, linux-fsdevel

On Thu, Apr 15, 2021 at 3:19 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Thu, Apr 15, 2021 at 01:38:37PM +0800, Yongji Xie wrote:
> > On Wed, Apr 14, 2021 at 10:15 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > >
> > > On Wed, Mar 31, 2021 at 04:05:19PM +0800, Xie Yongji wrote:
> > > > VDUSE (vDPA Device in Userspace) is a framework to support
> > > > implementing software-emulated vDPA devices in userspace. This
> > > > document is intended to clarify the VDUSE design and usage.
> > > >
> > > > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> > > > ---
> > > >  Documentation/userspace-api/index.rst |   1 +
> > > >  Documentation/userspace-api/vduse.rst | 212 ++++++++++++++++++++++++++++++++++
> > > >  2 files changed, 213 insertions(+)
> > > >  create mode 100644 Documentation/userspace-api/vduse.rst
> > >
> > > Just looking over the documentation briefly (I haven't studied the code
> > > yet)...
> > >
> >
> > Thank you!
> >
> > > > +How VDUSE works
> > > > +------------
> > > > +Each userspace vDPA device is created by the VDUSE_CREATE_DEV ioctl on
> > > > +the character device (/dev/vduse/control). Then a device file with the
> > > > +specified name (/dev/vduse/$NAME) will appear, which can be used to
> > > > +implement the userspace vDPA device's control path and data path.
> > >
> > > These steps are taken after sending the VDPA_CMD_DEV_NEW netlink
> > > message? (Please consider reordering the documentation to make it clear
> > > what the sequence of steps are.)
> > >
> >
> > No, VDUSE devices should be created before sending the
> > VDPA_CMD_DEV_NEW netlink messages which might produce I/Os to VDUSE.
>
> I see. Please include an overview of the steps before going into detail.
> Something like:
>
>   VDUSE devices are started as follows:
>
>   1. Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on
>      /dev/vduse/control.
>
>   2. Begin processing VDUSE messages from /dev/vduse/$NAME. The first
>      messages will arrive while attaching the VDUSE instance to vDPA.
>
>   3. Send the VDPA_CMD_DEV_NEW netlink message to attach the VDUSE
>      instance to vDPA.
>
>   VDUSE devices are stopped as follows:
>
>   ...
>

Sure.

> > > > +     static int netlink_add_vduse(const char *name, int device_id)
> > > > +     {
> > > > +             struct nl_sock *nlsock;
> > > > +             struct nl_msg *msg;
> > > > +             int famid;
> > > > +
> > > > +             nlsock = nl_socket_alloc();
> > > > +             if (!nlsock)
> > > > +                     return -ENOMEM;
> > > > +
> > > > +             if (genl_connect(nlsock))
> > > > +                     goto free_sock;
> > > > +
> > > > +             famid = genl_ctrl_resolve(nlsock, VDPA_GENL_NAME);
> > > > +             if (famid < 0)
> > > > +                     goto close_sock;
> > > > +
> > > > +             msg = nlmsg_alloc();
> > > > +             if (!msg)
> > > > +                     goto close_sock;
> > > > +
> > > > +             if (!genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, famid, 0, 0,
> > > > +                 VDPA_CMD_DEV_NEW, 0))
> > > > +                     goto nla_put_failure;
> > > > +
> > > > +             NLA_PUT_STRING(msg, VDPA_ATTR_DEV_NAME, name);
> > > > +             NLA_PUT_STRING(msg, VDPA_ATTR_MGMTDEV_DEV_NAME, "vduse");
> > > > +             NLA_PUT_U32(msg, VDPA_ATTR_DEV_ID, device_id);
> > >
> > > What are the permission/capability requirements for VDUSE?
> > >
> >
> > Now I think we need privileged permission (root user). Because
> > userspace daemon is able to access avail vring, used vring, descriptor
> > table in kernel driver directly.
>
> Please state this explicitly at the start of the document. Existing
> interfaces like FUSE are designed to avoid trusting userspace. Therefore
> people might think the same is the case here. It's critical that people
> are aware of this before deploying VDUSE with virtio-vdpa.
>
> We should probably pause here and think about whether it's possible to
> avoid trusting userspace. Even if it takes some effort and costs some
> performance it would probably be worthwhile.
>
> Is the security situation different with vhost-vdpa? In that case it
> seems more likely that the host kernel doesn't need to trust the
> userspace VDUSE device.
>

Yes.

> Regarding privileges in general: userspace VDUSE processes shouldn't
> need to run as root. The VDUSE device lifecycle will require privileges
> to attach vhost-vdpa and virtio-vdpa devices, but the actual userspace
> process that emulates the device should be able to run unprivileged.
> Emulated devices are an attack surface and even if you are comfortable
> with running them as root in your specific use case, it will be an issue
> as soon as other people want to use VDUSE and could give VDUSE a
> reputation for poor security.
>

Agreed. Rethink about the virtio-vdpa case. The security risks mainly
come from the untrusted user being able to rewrite the content of
avail vring, used vring, descriptor table. But it seems that the worst
result of doing this is getting a broken virtqueue. Not sure if it's
acceptable to kernel.

> > > How does VDUSE interact with namespaces?
> > >
> >
> > Not sure I get your point here. Do you mean how the emulated vDPA
> > device interact with namespaces? This should work like hardware vDPA
> > devices do. VDUSE daemon can reside outside the namespace of a
> > container which uses the vDPA device.
>
> Can VDUSE devices run inside containers? Are /dev/vduse/$NAME and vDPA
> device names global?
>

I think we can run it inside containers. But there might be some
limitations. As you mentioned, the device name is global. So we need
to make sure the VDUSE daemons in different containers don't use the
same name to create vDPA devices.

> > > What is the meaning of VDPA_ATTR_DEV_ID? I don't see it in Linux
> > > v5.12-rc6 drivers/vdpa/vdpa.c:vdpa_nl_cmd_dev_add_set_doit().
> > >
> >
> > It means the device id (e.g. VIRTIO_ID_BLOCK) of the vDPA device and
> > can be found in include/uapi/linux/vdpa.h.
>
> VDPA_ATTR_DEV_ID is only used by VDPA_CMD_DEV_GET in Linux v5.12-rc6,
> not by VDPA_CMD_DEV_NEW.
>
> The example in this document uses VDPA_ATTR_DEV_ID with
> VDPA_CMD_DEV_NEW. Is the example outdated?
>

Oh, you are right. Will update it.

> >
> > > > +MMU-based IOMMU Driver
> > > > +----------------------
> > > > +VDUSE framework implements an MMU-based on-chip IOMMU driver to support
> > > > +mapping the kernel DMA buffer into the userspace iova region dynamically.
> > > > +This is mainly designed for virtio-vdpa case (kernel virtio drivers).
> > > > +
> > > > +The basic idea behind this driver is treating MMU (VA->PA) as IOMMU (IOVA->PA).
> > > > +The driver will set up MMU mapping instead of IOMMU mapping for the DMA transfer
> > > > +so that the userspace process is able to use its virtual address to access
> > > > +the DMA buffer in kernel.
> > > > +
> > > > +And to avoid security issue, a bounce-buffering mechanism is introduced to
> > > > +prevent userspace accessing the original buffer directly which may contain other
> > > > +kernel data. During the mapping, unmapping, the driver will copy the data from
> > > > +the original buffer to the bounce buffer and back, depending on the direction of
> > > > +the transfer. And the bounce-buffer addresses will be mapped into the user address
> > > > +space instead of the original one.
> > >
> > > Is mmap(2) the right interface if memory is not actually shared, why not
> > > just use pread(2)/pwrite(2) to make the copy explicit? That way the copy
> > > semantics are clear. For example, don't expect to be able to busy wait
> > > on the memory because changes will not be visible to the other side.
> > >
> > > (I guess I'm missing something here and that mmap(2) is the right
> > > approach, but maybe this documentation section can be clarified.)
> >
> > It's for performance considerations on the one hand. We might need to
> > call pread(2)/pwrite(2) multiple times for each request.
>
> Userspace can keep page-sized pread() buffers around to avoid additional
> syscalls during a request.
>

In the indirect descriptors case , it looks like we can't use one
pread() to get all buffers?

> mmap() access does reduce the number of syscalls, but it also introduces
> page faults (effectively doing the page-sized pread() I mentioned
> above).
>

Yes, but only on the first access.

> It's not obvious to me that there is a fundamental difference between
> the two approaches in terms of performance.
>
> > On the other
> > hand, we can handle the virtqueue in a unified way for both vhost-vdpa
> > case and virtio-vdpa case. Otherwise, userspace daemon needs to know
> > which iova ranges need to be accessed with pread(2)/pwrite(2). And in
> > the future, we might be able to avoid bouncing in some cases.
>
> Ah, I see. So bounce buffers are not used for vhost-vdpa?
>

Yes.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v6 10/10] Documentation: Add documentation for VDUSE
  2021-04-15  7:19       ` Stefan Hajnoczi
  2021-04-15  8:33         ` Yongji Xie
@ 2021-04-15  8:36         ` Jason Wang
  2021-04-15  9:04           ` Jason Wang
  2021-04-15 14:38           ` Stefan Hajnoczi
  1 sibling, 2 replies; 62+ messages in thread
From: Jason Wang @ 2021-04-15  8:36 UTC (permalink / raw)
  To: Stefan Hajnoczi, Yongji Xie
  Cc: Michael S. Tsirkin, Stefano Garzarella, Parav Pandit,
	Christoph Hellwig, Christian Brauner, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	Mika Penttilä,
	Dan Carpenter, virtualization, netdev, kvm, linux-fsdevel


在 2021/4/15 下午3:19, Stefan Hajnoczi 写道:
> On Thu, Apr 15, 2021 at 01:38:37PM +0800, Yongji Xie wrote:
>> On Wed, Apr 14, 2021 at 10:15 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>> On Wed, Mar 31, 2021 at 04:05:19PM +0800, Xie Yongji wrote:
>>>> VDUSE (vDPA Device in Userspace) is a framework to support
>>>> implementing software-emulated vDPA devices in userspace. This
>>>> document is intended to clarify the VDUSE design and usage.
>>>>
>>>> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
>>>> ---
>>>>   Documentation/userspace-api/index.rst |   1 +
>>>>   Documentation/userspace-api/vduse.rst | 212 ++++++++++++++++++++++++++++++++++
>>>>   2 files changed, 213 insertions(+)
>>>>   create mode 100644 Documentation/userspace-api/vduse.rst
>>> Just looking over the documentation briefly (I haven't studied the code
>>> yet)...
>>>
>> Thank you!
>>
>>>> +How VDUSE works
>>>> +------------
>>>> +Each userspace vDPA device is created by the VDUSE_CREATE_DEV ioctl on
>>>> +the character device (/dev/vduse/control). Then a device file with the
>>>> +specified name (/dev/vduse/$NAME) will appear, which can be used to
>>>> +implement the userspace vDPA device's control path and data path.
>>> These steps are taken after sending the VDPA_CMD_DEV_NEW netlink
>>> message? (Please consider reordering the documentation to make it clear
>>> what the sequence of steps are.)
>>>
>> No, VDUSE devices should be created before sending the
>> VDPA_CMD_DEV_NEW netlink messages which might produce I/Os to VDUSE.
> I see. Please include an overview of the steps before going into detail.
> Something like:
>
>    VDUSE devices are started as follows:
>
>    1. Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on
>       /dev/vduse/control.
>
>    2. Begin processing VDUSE messages from /dev/vduse/$NAME. The first
>       messages will arrive while attaching the VDUSE instance to vDPA.
>
>    3. Send the VDPA_CMD_DEV_NEW netlink message to attach the VDUSE
>       instance to vDPA.
>
>    VDUSE devices are stopped as follows:
>
>    ...
>
>>>> +     static int netlink_add_vduse(const char *name, int device_id)
>>>> +     {
>>>> +             struct nl_sock *nlsock;
>>>> +             struct nl_msg *msg;
>>>> +             int famid;
>>>> +
>>>> +             nlsock = nl_socket_alloc();
>>>> +             if (!nlsock)
>>>> +                     return -ENOMEM;
>>>> +
>>>> +             if (genl_connect(nlsock))
>>>> +                     goto free_sock;
>>>> +
>>>> +             famid = genl_ctrl_resolve(nlsock, VDPA_GENL_NAME);
>>>> +             if (famid < 0)
>>>> +                     goto close_sock;
>>>> +
>>>> +             msg = nlmsg_alloc();
>>>> +             if (!msg)
>>>> +                     goto close_sock;
>>>> +
>>>> +             if (!genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, famid, 0, 0,
>>>> +                 VDPA_CMD_DEV_NEW, 0))
>>>> +                     goto nla_put_failure;
>>>> +
>>>> +             NLA_PUT_STRING(msg, VDPA_ATTR_DEV_NAME, name);
>>>> +             NLA_PUT_STRING(msg, VDPA_ATTR_MGMTDEV_DEV_NAME, "vduse");
>>>> +             NLA_PUT_U32(msg, VDPA_ATTR_DEV_ID, device_id);
>>> What are the permission/capability requirements for VDUSE?
>>>
>> Now I think we need privileged permission (root user). Because
>> userspace daemon is able to access avail vring, used vring, descriptor
>> table in kernel driver directly.
> Please state this explicitly at the start of the document. Existing
> interfaces like FUSE are designed to avoid trusting userspace.


There're some subtle difference here. VDUSE present a device to kernel 
which means IOMMU is probably the only thing to prevent a malicous device.


> Therefore
> people might think the same is the case here. It's critical that people
> are aware of this before deploying VDUSE with virtio-vdpa.
>
> We should probably pause here and think about whether it's possible to
> avoid trusting userspace. Even if it takes some effort and costs some
> performance it would probably be worthwhile.


Since the bounce buffer is used the only attack surface is the coherent 
area, if we want to enforce stronger isolation we need to use shadow 
virtqueue (which is proposed in earlier version by me) in this case. But 
I'm not sure it's worth to do that.


>
> Is the security situation different with vhost-vdpa? In that case it
> seems more likely that the host kernel doesn't need to trust the
> userspace VDUSE device.
>
> Regarding privileges in general: userspace VDUSE processes shouldn't
> need to run as root. The VDUSE device lifecycle will require privileges
> to attach vhost-vdpa and virtio-vdpa devices, but the actual userspace
> process that emulates the device should be able to run unprivileged.
> Emulated devices are an attack surface and even if you are comfortable
> with running them as root in your specific use case, it will be an issue
> as soon as other people want to use VDUSE and could give VDUSE a
> reputation for poor security.


In this case, I think it works as other char device:

- privilleged process to create and destroy the VDUSE
- fd is passed via SCM_RIGHTS to unprivilleged process that implements 
the device


>
>>> How does VDUSE interact with namespaces?
>>>
>> Not sure I get your point here. Do you mean how the emulated vDPA
>> device interact with namespaces? This should work like hardware vDPA
>> devices do. VDUSE daemon can reside outside the namespace of a
>> container which uses the vDPA device.
> Can VDUSE devices run inside containers? Are /dev/vduse/$NAME and vDPA
> device names global?


I think it's a global one, we can add namespace on top.


>
>>> What is the meaning of VDPA_ATTR_DEV_ID? I don't see it in Linux
>>> v5.12-rc6 drivers/vdpa/vdpa.c:vdpa_nl_cmd_dev_add_set_doit().
>>>
>> It means the device id (e.g. VIRTIO_ID_BLOCK) of the vDPA device and
>> can be found in include/uapi/linux/vdpa.h.
> VDPA_ATTR_DEV_ID is only used by VDPA_CMD_DEV_GET in Linux v5.12-rc6,
> not by VDPA_CMD_DEV_NEW.
>
> The example in this document uses VDPA_ATTR_DEV_ID with
> VDPA_CMD_DEV_NEW. Is the example outdated?
>
>>>> +MMU-based IOMMU Driver
>>>> +----------------------
>>>> +VDUSE framework implements an MMU-based on-chip IOMMU driver to support
>>>> +mapping the kernel DMA buffer into the userspace iova region dynamically.
>>>> +This is mainly designed for virtio-vdpa case (kernel virtio drivers).
>>>> +
>>>> +The basic idea behind this driver is treating MMU (VA->PA) as IOMMU (IOVA->PA).
>>>> +The driver will set up MMU mapping instead of IOMMU mapping for the DMA transfer
>>>> +so that the userspace process is able to use its virtual address to access
>>>> +the DMA buffer in kernel.
>>>> +
>>>> +And to avoid security issue, a bounce-buffering mechanism is introduced to
>>>> +prevent userspace accessing the original buffer directly which may contain other
>>>> +kernel data. During the mapping, unmapping, the driver will copy the data from
>>>> +the original buffer to the bounce buffer and back, depending on the direction of
>>>> +the transfer. And the bounce-buffer addresses will be mapped into the user address
>>>> +space instead of the original one.
>>> Is mmap(2) the right interface if memory is not actually shared, why not
>>> just use pread(2)/pwrite(2) to make the copy explicit? That way the copy
>>> semantics are clear. For example, don't expect to be able to busy wait
>>> on the memory because changes will not be visible to the other side.
>>>
>>> (I guess I'm missing something here and that mmap(2) is the right
>>> approach, but maybe this documentation section can be clarified.)
>> It's for performance considerations on the one hand. We might need to
>> call pread(2)/pwrite(2) multiple times for each request.
> Userspace can keep page-sized pread() buffers around to avoid additional
> syscalls during a request.


I'm not sure I get here. But the length of the request is not 
necessarily PAGE_SIZE.


>
> mmap() access does reduce the number of syscalls, but it also introduces
> page faults (effectively doing the page-sized pread() I mentioned
> above).


You can access the data directly if there's already a page fault. So 
mmap() should be much faster in this case.


>
> It's not obvious to me that there is a fundamental difference between
> the two approaches in terms of performance.
>
>> On the other
>> hand, we can handle the virtqueue in a unified way for both vhost-vdpa
>> case and virtio-vdpa case. Otherwise, userspace daemon needs to know
>> which iova ranges need to be accessed with pread(2)/pwrite(2). And in
>> the future, we might be able to avoid bouncing in some cases.
> Ah, I see. So bounce buffers are not used for vhost-vdpa?


Yes, VDUSE can pass different fds to usersapce for mmap().

Thanks


>
> Stefan


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v6 10/10] Documentation: Add documentation for VDUSE
  2021-04-15  8:36         ` Jason Wang
@ 2021-04-15  9:04           ` Jason Wang
  2021-04-15 11:17             ` Yongji Xie
  2021-04-15 14:38           ` Stefan Hajnoczi
  1 sibling, 1 reply; 62+ messages in thread
From: Jason Wang @ 2021-04-15  9:04 UTC (permalink / raw)
  To: Stefan Hajnoczi, Yongji Xie
  Cc: Michael S. Tsirkin, Stefano Garzarella, Parav Pandit,
	Christoph Hellwig, Christian Brauner, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	Mika Penttilä,
	Dan Carpenter, virtualization, netdev, kvm, linux-fsdevel


在 2021/4/15 下午4:36, Jason Wang 写道:
>>>
>> Please state this explicitly at the start of the document. Existing
>> interfaces like FUSE are designed to avoid trusting userspace.
>
>
> There're some subtle difference here. VDUSE present a device to kernel 
> which means IOMMU is probably the only thing to prevent a malicous 
> device.
>
>
>> Therefore
>> people might think the same is the case here. It's critical that people
>> are aware of this before deploying VDUSE with virtio-vdpa.
>>
>> We should probably pause here and think about whether it's possible to
>> avoid trusting userspace. Even if it takes some effort and costs some
>> performance it would probably be worthwhile.
>
>
> Since the bounce buffer is used the only attack surface is the 
> coherent area, if we want to enforce stronger isolation we need to use 
> shadow virtqueue (which is proposed in earlier version by me) in this 
> case. But I'm not sure it's worth to do that.



So this reminds me the discussion in the end of last year. We need to 
make sure we don't suffer from the same issues for VDUSE at least

https://yhbt.net/lore/all/c3629a27-3590-1d9f-211b-c0b7be152b32@redhat.com/T/#mc6b6e2343cbeffca68ca7a97e0f473aaa871c95b

Or we can solve it at virtio level, e.g remember the dma address instead 
of depending on the addr in the descriptor ring

Thanks


>
>
>>
>> Is the security situation different with vhost-vdpa? In that case it
>> seems more likely that the host kernel doesn't need to trust the
>> userspace VDUSE device.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Re: [PATCH v6 10/10] Documentation: Add documentation for VDUSE
  2021-04-15  9:04           ` Jason Wang
@ 2021-04-15 11:17             ` Yongji Xie
  2021-04-16  2:20               ` Jason Wang
  0 siblings, 1 reply; 62+ messages in thread
From: Yongji Xie @ 2021-04-15 11:17 UTC (permalink / raw)
  To: Jason Wang
  Cc: Stefan Hajnoczi, Michael S. Tsirkin, Stefano Garzarella,
	Parav Pandit, Christoph Hellwig, Christian Brauner, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	Mika Penttilä,
	Dan Carpenter, virtualization, netdev, kvm, linux-fsdevel

On Thu, Apr 15, 2021 at 5:05 PM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/4/15 下午4:36, Jason Wang 写道:
> >>>
> >> Please state this explicitly at the start of the document. Existing
> >> interfaces like FUSE are designed to avoid trusting userspace.
> >
> >
> > There're some subtle difference here. VDUSE present a device to kernel
> > which means IOMMU is probably the only thing to prevent a malicous
> > device.
> >
> >
> >> Therefore
> >> people might think the same is the case here. It's critical that people
> >> are aware of this before deploying VDUSE with virtio-vdpa.
> >>
> >> We should probably pause here and think about whether it's possible to
> >> avoid trusting userspace. Even if it takes some effort and costs some
> >> performance it would probably be worthwhile.
> >
> >
> > Since the bounce buffer is used the only attack surface is the
> > coherent area, if we want to enforce stronger isolation we need to use
> > shadow virtqueue (which is proposed in earlier version by me) in this
> > case. But I'm not sure it's worth to do that.
>
>
>
> So this reminds me the discussion in the end of last year. We need to
> make sure we don't suffer from the same issues for VDUSE at least
>
> https://yhbt.net/lore/all/c3629a27-3590-1d9f-211b-c0b7be152b32@redhat.com/T/#mc6b6e2343cbeffca68ca7a97e0f473aaa871c95b
>
> Or we can solve it at virtio level, e.g remember the dma address instead
> of depending on the addr in the descriptor ring
>

I might miss something. But VDUSE has recorded the dma address during
dma mapping, so we would not do bouncing if the addr/length is invalid
during dma unmapping. Is it enough?

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Re: Re: [PATCH v6 10/10] Documentation: Add documentation for VDUSE
  2021-04-15  8:33         ` Yongji Xie
@ 2021-04-15 14:17           ` Stefan Hajnoczi
  0 siblings, 0 replies; 62+ messages in thread
From: Stefan Hajnoczi @ 2021-04-15 14:17 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Jason Wang, Stefano Garzarella, Parav Pandit,
	Christoph Hellwig, Christian Brauner, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	Mika Penttilä,
	Dan Carpenter, virtualization, netdev, kvm, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 1023 bytes --]

On Thu, Apr 15, 2021 at 04:33:27PM +0800, Yongji Xie wrote:
> On Thu, Apr 15, 2021 at 3:19 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > On Thu, Apr 15, 2021 at 01:38:37PM +0800, Yongji Xie wrote:
> > > On Wed, Apr 14, 2021 at 10:15 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > On Wed, Mar 31, 2021 at 04:05:19PM +0800, Xie Yongji wrote:
> > It's not obvious to me that there is a fundamental difference between
> > the two approaches in terms of performance.
> >
> > > On the other
> > > hand, we can handle the virtqueue in a unified way for both vhost-vdpa
> > > case and virtio-vdpa case. Otherwise, userspace daemon needs to know
> > > which iova ranges need to be accessed with pread(2)/pwrite(2). And in
> > > the future, we might be able to avoid bouncing in some cases.
> >
> > Ah, I see. So bounce buffers are not used for vhost-vdpa?
> >
> 
> Yes.

Okay, in that case I understand why mmap is used and it's nice to keep
virtio-vpda and vhost-vdpa unified. Thanks!

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v6 10/10] Documentation: Add documentation for VDUSE
  2021-04-15  8:36         ` Jason Wang
  2021-04-15  9:04           ` Jason Wang
@ 2021-04-15 14:38           ` Stefan Hajnoczi
  2021-04-16  2:23             ` Jason Wang
  2021-04-16  3:13             ` Yongji Xie
  1 sibling, 2 replies; 62+ messages in thread
From: Stefan Hajnoczi @ 2021-04-15 14:38 UTC (permalink / raw)
  To: Jason Wang
  Cc: Yongji Xie, Michael S. Tsirkin, Stefano Garzarella, Parav Pandit,
	Christoph Hellwig, Christian Brauner, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	Mika Penttilä,
	Dan Carpenter, virtualization, netdev, kvm, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 5389 bytes --]

On Thu, Apr 15, 2021 at 04:36:35PM +0800, Jason Wang wrote:
> 
> 在 2021/4/15 下午3:19, Stefan Hajnoczi 写道:
> > On Thu, Apr 15, 2021 at 01:38:37PM +0800, Yongji Xie wrote:
> > > On Wed, Apr 14, 2021 at 10:15 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > On Wed, Mar 31, 2021 at 04:05:19PM +0800, Xie Yongji wrote:
> > > > > VDUSE (vDPA Device in Userspace) is a framework to support
> > > > > implementing software-emulated vDPA devices in userspace. This
> > > > > document is intended to clarify the VDUSE design and usage.
> > > > > 
> > > > > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> > > > > ---
> > > > >   Documentation/userspace-api/index.rst |   1 +
> > > > >   Documentation/userspace-api/vduse.rst | 212 ++++++++++++++++++++++++++++++++++
> > > > >   2 files changed, 213 insertions(+)
> > > > >   create mode 100644 Documentation/userspace-api/vduse.rst
> > > > Just looking over the documentation briefly (I haven't studied the code
> > > > yet)...
> > > > 
> > > Thank you!
> > > 
> > > > > +How VDUSE works
> > > > > +------------
> > > > > +Each userspace vDPA device is created by the VDUSE_CREATE_DEV ioctl on
> > > > > +the character device (/dev/vduse/control). Then a device file with the
> > > > > +specified name (/dev/vduse/$NAME) will appear, which can be used to
> > > > > +implement the userspace vDPA device's control path and data path.
> > > > These steps are taken after sending the VDPA_CMD_DEV_NEW netlink
> > > > message? (Please consider reordering the documentation to make it clear
> > > > what the sequence of steps are.)
> > > > 
> > > No, VDUSE devices should be created before sending the
> > > VDPA_CMD_DEV_NEW netlink messages which might produce I/Os to VDUSE.
> > I see. Please include an overview of the steps before going into detail.
> > Something like:
> > 
> >    VDUSE devices are started as follows:
> > 
> >    1. Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on
> >       /dev/vduse/control.
> > 
> >    2. Begin processing VDUSE messages from /dev/vduse/$NAME. The first
> >       messages will arrive while attaching the VDUSE instance to vDPA.
> > 
> >    3. Send the VDPA_CMD_DEV_NEW netlink message to attach the VDUSE
> >       instance to vDPA.
> > 
> >    VDUSE devices are stopped as follows:
> > 
> >    ...
> > 
> > > > > +     static int netlink_add_vduse(const char *name, int device_id)
> > > > > +     {
> > > > > +             struct nl_sock *nlsock;
> > > > > +             struct nl_msg *msg;
> > > > > +             int famid;
> > > > > +
> > > > > +             nlsock = nl_socket_alloc();
> > > > > +             if (!nlsock)
> > > > > +                     return -ENOMEM;
> > > > > +
> > > > > +             if (genl_connect(nlsock))
> > > > > +                     goto free_sock;
> > > > > +
> > > > > +             famid = genl_ctrl_resolve(nlsock, VDPA_GENL_NAME);
> > > > > +             if (famid < 0)
> > > > > +                     goto close_sock;
> > > > > +
> > > > > +             msg = nlmsg_alloc();
> > > > > +             if (!msg)
> > > > > +                     goto close_sock;
> > > > > +
> > > > > +             if (!genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, famid, 0, 0,
> > > > > +                 VDPA_CMD_DEV_NEW, 0))
> > > > > +                     goto nla_put_failure;
> > > > > +
> > > > > +             NLA_PUT_STRING(msg, VDPA_ATTR_DEV_NAME, name);
> > > > > +             NLA_PUT_STRING(msg, VDPA_ATTR_MGMTDEV_DEV_NAME, "vduse");
> > > > > +             NLA_PUT_U32(msg, VDPA_ATTR_DEV_ID, device_id);
> > > > What are the permission/capability requirements for VDUSE?
> > > > 
> > > Now I think we need privileged permission (root user). Because
> > > userspace daemon is able to access avail vring, used vring, descriptor
> > > table in kernel driver directly.
> > Please state this explicitly at the start of the document. Existing
> > interfaces like FUSE are designed to avoid trusting userspace.
> 
> 
> There're some subtle difference here. VDUSE present a device to kernel which
> means IOMMU is probably the only thing to prevent a malicous device.
> 
> 
> > Therefore
> > people might think the same is the case here. It's critical that people
> > are aware of this before deploying VDUSE with virtio-vdpa.
> > 
> > We should probably pause here and think about whether it's possible to
> > avoid trusting userspace. Even if it takes some effort and costs some
> > performance it would probably be worthwhile.
> 
> 
> Since the bounce buffer is used the only attack surface is the coherent
> area, if we want to enforce stronger isolation we need to use shadow
> virtqueue (which is proposed in earlier version by me) in this case. But I'm
> not sure it's worth to do that.

The security situation needs to be clear before merging this feature.

I think the IOMMU and vring can be made secure. What is more concerning
is the kernel code that runs on top: VIRTIO device drivers, network
stack, file systems, etc. They trust devices to an extent.

Since virtio-vdpa is a big reason for doing VDUSE in the first place I
don't think it makes sense to disable virtio-vdpa with VDUSE. A solution
is needed.

I'm going to be offline for a week and don't want to be a bottleneck.
I'll catch up when I'm back.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v6 10/10] Documentation: Add documentation for VDUSE
  2021-04-15 11:17             ` Yongji Xie
@ 2021-04-16  2:20               ` Jason Wang
  2021-04-16  2:58                 ` Yongji Xie
  0 siblings, 1 reply; 62+ messages in thread
From: Jason Wang @ 2021-04-16  2:20 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Stefan Hajnoczi, Michael S. Tsirkin, Stefano Garzarella,
	Parav Pandit, Christoph Hellwig, Christian Brauner, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	Mika Penttilä,
	Dan Carpenter, virtualization, netdev, kvm, linux-fsdevel


在 2021/4/15 下午7:17, Yongji Xie 写道:
> On Thu, Apr 15, 2021 at 5:05 PM Jason Wang <jasowang@redhat.com> wrote:
>>
>> 在 2021/4/15 下午4:36, Jason Wang 写道:
>>>> Please state this explicitly at the start of the document. Existing
>>>> interfaces like FUSE are designed to avoid trusting userspace.
>>>
>>> There're some subtle difference here. VDUSE present a device to kernel
>>> which means IOMMU is probably the only thing to prevent a malicous
>>> device.
>>>
>>>
>>>> Therefore
>>>> people might think the same is the case here. It's critical that people
>>>> are aware of this before deploying VDUSE with virtio-vdpa.
>>>>
>>>> We should probably pause here and think about whether it's possible to
>>>> avoid trusting userspace. Even if it takes some effort and costs some
>>>> performance it would probably be worthwhile.
>>>
>>> Since the bounce buffer is used the only attack surface is the
>>> coherent area, if we want to enforce stronger isolation we need to use
>>> shadow virtqueue (which is proposed in earlier version by me) in this
>>> case. But I'm not sure it's worth to do that.
>>
>>
>> So this reminds me the discussion in the end of last year. We need to
>> make sure we don't suffer from the same issues for VDUSE at least
>>
>> https://yhbt.net/lore/all/c3629a27-3590-1d9f-211b-c0b7be152b32@redhat.com/T/#mc6b6e2343cbeffca68ca7a97e0f473aaa871c95b
>>
>> Or we can solve it at virtio level, e.g remember the dma address instead
>> of depending on the addr in the descriptor ring
>>
> I might miss something. But VDUSE has recorded the dma address during
> dma mapping, so we would not do bouncing if the addr/length is invalid
> during dma unmapping. Is it enough?


E.g malicous device write a buggy dma address in the descriptor ring, so 
we had:

vring_unmap_one_split(desc->addr, desc->len)
     dma_unmap_single()
         vduse_dev_unmap_page()
             vduse_domain_bounce()

And in vduse_domain_bounce() we had:

         while (size) {
                 map = &domain->bounce_maps[iova >> PAGE_SHIFT];
                 offset = offset_in_page(iova);
                 sz = min_t(size_t, PAGE_SIZE - offset, size);

This means we trust the iova which is dangerous and exacly the issue 
mentioned in the above link.

 From VDUSE level need to make sure iova is legal.

 From virtio level, we should not truse desc->addr.

Thanks


>
> Thanks,
> Yongji
>


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v6 10/10] Documentation: Add documentation for VDUSE
  2021-04-15 14:38           ` Stefan Hajnoczi
@ 2021-04-16  2:23             ` Jason Wang
  2021-04-16  3:19               ` Yongji Xie
  2021-04-16  3:13             ` Yongji Xie
  1 sibling, 1 reply; 62+ messages in thread
From: Jason Wang @ 2021-04-16  2:23 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Yongji Xie, Michael S. Tsirkin, Stefano Garzarella, Parav Pandit,
	Christoph Hellwig, Christian Brauner, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	Mika Penttilä,
	Dan Carpenter, virtualization, netdev, kvm, linux-fsdevel


在 2021/4/15 下午10:38, Stefan Hajnoczi 写道:
> On Thu, Apr 15, 2021 at 04:36:35PM +0800, Jason Wang wrote:
>> 在 2021/4/15 下午3:19, Stefan Hajnoczi 写道:
>>> On Thu, Apr 15, 2021 at 01:38:37PM +0800, Yongji Xie wrote:
>>>> On Wed, Apr 14, 2021 at 10:15 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>>>> On Wed, Mar 31, 2021 at 04:05:19PM +0800, Xie Yongji wrote:
>>>>>> VDUSE (vDPA Device in Userspace) is a framework to support
>>>>>> implementing software-emulated vDPA devices in userspace. This
>>>>>> document is intended to clarify the VDUSE design and usage.
>>>>>>
>>>>>> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
>>>>>> ---
>>>>>>    Documentation/userspace-api/index.rst |   1 +
>>>>>>    Documentation/userspace-api/vduse.rst | 212 ++++++++++++++++++++++++++++++++++
>>>>>>    2 files changed, 213 insertions(+)
>>>>>>    create mode 100644 Documentation/userspace-api/vduse.rst
>>>>> Just looking over the documentation briefly (I haven't studied the code
>>>>> yet)...
>>>>>
>>>> Thank you!
>>>>
>>>>>> +How VDUSE works
>>>>>> +------------
>>>>>> +Each userspace vDPA device is created by the VDUSE_CREATE_DEV ioctl on
>>>>>> +the character device (/dev/vduse/control). Then a device file with the
>>>>>> +specified name (/dev/vduse/$NAME) will appear, which can be used to
>>>>>> +implement the userspace vDPA device's control path and data path.
>>>>> These steps are taken after sending the VDPA_CMD_DEV_NEW netlink
>>>>> message? (Please consider reordering the documentation to make it clear
>>>>> what the sequence of steps are.)
>>>>>
>>>> No, VDUSE devices should be created before sending the
>>>> VDPA_CMD_DEV_NEW netlink messages which might produce I/Os to VDUSE.
>>> I see. Please include an overview of the steps before going into detail.
>>> Something like:
>>>
>>>     VDUSE devices are started as follows:
>>>
>>>     1. Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on
>>>        /dev/vduse/control.
>>>
>>>     2. Begin processing VDUSE messages from /dev/vduse/$NAME. The first
>>>        messages will arrive while attaching the VDUSE instance to vDPA.
>>>
>>>     3. Send the VDPA_CMD_DEV_NEW netlink message to attach the VDUSE
>>>        instance to vDPA.
>>>
>>>     VDUSE devices are stopped as follows:
>>>
>>>     ...
>>>
>>>>>> +     static int netlink_add_vduse(const char *name, int device_id)
>>>>>> +     {
>>>>>> +             struct nl_sock *nlsock;
>>>>>> +             struct nl_msg *msg;
>>>>>> +             int famid;
>>>>>> +
>>>>>> +             nlsock = nl_socket_alloc();
>>>>>> +             if (!nlsock)
>>>>>> +                     return -ENOMEM;
>>>>>> +
>>>>>> +             if (genl_connect(nlsock))
>>>>>> +                     goto free_sock;
>>>>>> +
>>>>>> +             famid = genl_ctrl_resolve(nlsock, VDPA_GENL_NAME);
>>>>>> +             if (famid < 0)
>>>>>> +                     goto close_sock;
>>>>>> +
>>>>>> +             msg = nlmsg_alloc();
>>>>>> +             if (!msg)
>>>>>> +                     goto close_sock;
>>>>>> +
>>>>>> +             if (!genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, famid, 0, 0,
>>>>>> +                 VDPA_CMD_DEV_NEW, 0))
>>>>>> +                     goto nla_put_failure;
>>>>>> +
>>>>>> +             NLA_PUT_STRING(msg, VDPA_ATTR_DEV_NAME, name);
>>>>>> +             NLA_PUT_STRING(msg, VDPA_ATTR_MGMTDEV_DEV_NAME, "vduse");
>>>>>> +             NLA_PUT_U32(msg, VDPA_ATTR_DEV_ID, device_id);
>>>>> What are the permission/capability requirements for VDUSE?
>>>>>
>>>> Now I think we need privileged permission (root user). Because
>>>> userspace daemon is able to access avail vring, used vring, descriptor
>>>> table in kernel driver directly.
>>> Please state this explicitly at the start of the document. Existing
>>> interfaces like FUSE are designed to avoid trusting userspace.
>>
>> There're some subtle difference here. VDUSE present a device to kernel which
>> means IOMMU is probably the only thing to prevent a malicous device.
>>
>>
>>> Therefore
>>> people might think the same is the case here. It's critical that people
>>> are aware of this before deploying VDUSE with virtio-vdpa.
>>>
>>> We should probably pause here and think about whether it's possible to
>>> avoid trusting userspace. Even if it takes some effort and costs some
>>> performance it would probably be worthwhile.
>>
>> Since the bounce buffer is used the only attack surface is the coherent
>> area, if we want to enforce stronger isolation we need to use shadow
>> virtqueue (which is proposed in earlier version by me) in this case. But I'm
>> not sure it's worth to do that.
> The security situation needs to be clear before merging this feature.


+1


>
> I think the IOMMU and vring can be made secure. What is more concerning
> is the kernel code that runs on top: VIRTIO device drivers, network
> stack, file systems, etc. They trust devices to an extent.
>
> Since virtio-vdpa is a big reason for doing VDUSE in the first place I
> don't think it makes sense to disable virtio-vdpa with VDUSE. A solution
> is needed.


Yes, so the case of VDUSE is something similar to the case of e.g SEV.

Both cases won't trust device and use some kind of software IOTLB.

That means we need to protect at both IOTLB and virtio drivers.

Let me post patches for virtio first.


>
> I'm going to be offline for a week and don't want to be a bottleneck.
> I'll catch up when I'm back.


Thanks a lot for comments and I think we had sufficent time to make 
VDUSE safe before merging.


>
> Stefan


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Re: [PATCH v6 10/10] Documentation: Add documentation for VDUSE
  2021-04-16  2:20               ` Jason Wang
@ 2021-04-16  2:58                 ` Yongji Xie
  2021-04-16  3:02                   ` Jason Wang
  0 siblings, 1 reply; 62+ messages in thread
From: Yongji Xie @ 2021-04-16  2:58 UTC (permalink / raw)
  To: Jason Wang
  Cc: Stefan Hajnoczi, Michael S. Tsirkin, Stefano Garzarella,
	Parav Pandit, Christoph Hellwig, Christian Brauner, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	Mika Penttilä,
	Dan Carpenter, virtualization, netdev, kvm, linux-fsdevel

On Fri, Apr 16, 2021 at 10:20 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/4/15 下午7:17, Yongji Xie 写道:
> > On Thu, Apr 15, 2021 at 5:05 PM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> 在 2021/4/15 下午4:36, Jason Wang 写道:
> >>>> Please state this explicitly at the start of the document. Existing
> >>>> interfaces like FUSE are designed to avoid trusting userspace.
> >>>
> >>> There're some subtle difference here. VDUSE present a device to kernel
> >>> which means IOMMU is probably the only thing to prevent a malicous
> >>> device.
> >>>
> >>>
> >>>> Therefore
> >>>> people might think the same is the case here. It's critical that people
> >>>> are aware of this before deploying VDUSE with virtio-vdpa.
> >>>>
> >>>> We should probably pause here and think about whether it's possible to
> >>>> avoid trusting userspace. Even if it takes some effort and costs some
> >>>> performance it would probably be worthwhile.
> >>>
> >>> Since the bounce buffer is used the only attack surface is the
> >>> coherent area, if we want to enforce stronger isolation we need to use
> >>> shadow virtqueue (which is proposed in earlier version by me) in this
> >>> case. But I'm not sure it's worth to do that.
> >>
> >>
> >> So this reminds me the discussion in the end of last year. We need to
> >> make sure we don't suffer from the same issues for VDUSE at least
> >>
> >> https://yhbt.net/lore/all/c3629a27-3590-1d9f-211b-c0b7be152b32@redhat.com/T/#mc6b6e2343cbeffca68ca7a97e0f473aaa871c95b
> >>
> >> Or we can solve it at virtio level, e.g remember the dma address instead
> >> of depending on the addr in the descriptor ring
> >>
> > I might miss something. But VDUSE has recorded the dma address during
> > dma mapping, so we would not do bouncing if the addr/length is invalid
> > during dma unmapping. Is it enough?
>
>
> E.g malicous device write a buggy dma address in the descriptor ring, so
> we had:
>
> vring_unmap_one_split(desc->addr, desc->len)
>      dma_unmap_single()
>          vduse_dev_unmap_page()
>              vduse_domain_bounce()
>
> And in vduse_domain_bounce() we had:
>
>          while (size) {
>                  map = &domain->bounce_maps[iova >> PAGE_SHIFT];
>                  offset = offset_in_page(iova);
>                  sz = min_t(size_t, PAGE_SIZE - offset, size);
>
> This means we trust the iova which is dangerous and exacly the issue
> mentioned in the above link.
>
>  From VDUSE level need to make sure iova is legal.
>

I think we already do that in vduse_domain_bounce():

    while (size) {
        map = &domain->bounce_maps[iova >> PAGE_SHIFT];

        if (WARN_ON(!map->bounce_page ||
            map->orig_phys == INVALID_PHYS_ADDR))
            return;


>  From virtio level, we should not truse desc->addr.
>

We would not touch desc->addr after vring_unmap_one_split(). So I'm
not sure what we need to do at the virtio level.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v6 10/10] Documentation: Add documentation for VDUSE
  2021-04-16  2:58                 ` Yongji Xie
@ 2021-04-16  3:02                   ` Jason Wang
  2021-04-16  3:18                     ` Yongji Xie
  0 siblings, 1 reply; 62+ messages in thread
From: Jason Wang @ 2021-04-16  3:02 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Stefan Hajnoczi, Michael S. Tsirkin, Stefano Garzarella,
	Parav Pandit, Christoph Hellwig, Christian Brauner, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	Mika Penttilä,
	Dan Carpenter, virtualization, netdev, kvm, linux-fsdevel


在 2021/4/16 上午10:58, Yongji Xie 写道:
> On Fri, Apr 16, 2021 at 10:20 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> 在 2021/4/15 下午7:17, Yongji Xie 写道:
>>> On Thu, Apr 15, 2021 at 5:05 PM Jason Wang <jasowang@redhat.com> wrote:
>>>> 在 2021/4/15 下午4:36, Jason Wang 写道:
>>>>>> Please state this explicitly at the start of the document. Existing
>>>>>> interfaces like FUSE are designed to avoid trusting userspace.
>>>>> There're some subtle difference here. VDUSE present a device to kernel
>>>>> which means IOMMU is probably the only thing to prevent a malicous
>>>>> device.
>>>>>
>>>>>
>>>>>> Therefore
>>>>>> people might think the same is the case here. It's critical that people
>>>>>> are aware of this before deploying VDUSE with virtio-vdpa.
>>>>>>
>>>>>> We should probably pause here and think about whether it's possible to
>>>>>> avoid trusting userspace. Even if it takes some effort and costs some
>>>>>> performance it would probably be worthwhile.
>>>>> Since the bounce buffer is used the only attack surface is the
>>>>> coherent area, if we want to enforce stronger isolation we need to use
>>>>> shadow virtqueue (which is proposed in earlier version by me) in this
>>>>> case. But I'm not sure it's worth to do that.
>>>>
>>>> So this reminds me the discussion in the end of last year. We need to
>>>> make sure we don't suffer from the same issues for VDUSE at least
>>>>
>>>> https://yhbt.net/lore/all/c3629a27-3590-1d9f-211b-c0b7be152b32@redhat.com/T/#mc6b6e2343cbeffca68ca7a97e0f473aaa871c95b
>>>>
>>>> Or we can solve it at virtio level, e.g remember the dma address instead
>>>> of depending on the addr in the descriptor ring
>>>>
>>> I might miss something. But VDUSE has recorded the dma address during
>>> dma mapping, so we would not do bouncing if the addr/length is invalid
>>> during dma unmapping. Is it enough?
>>
>> E.g malicous device write a buggy dma address in the descriptor ring, so
>> we had:
>>
>> vring_unmap_one_split(desc->addr, desc->len)
>>       dma_unmap_single()
>>           vduse_dev_unmap_page()
>>               vduse_domain_bounce()
>>
>> And in vduse_domain_bounce() we had:
>>
>>           while (size) {
>>                   map = &domain->bounce_maps[iova >> PAGE_SHIFT];
>>                   offset = offset_in_page(iova);
>>                   sz = min_t(size_t, PAGE_SIZE - offset, size);
>>
>> This means we trust the iova which is dangerous and exacly the issue
>> mentioned in the above link.
>>
>>   From VDUSE level need to make sure iova is legal.
>>
> I think we already do that in vduse_domain_bounce():
>
>      while (size) {
>          map = &domain->bounce_maps[iova >> PAGE_SHIFT];
>
>          if (WARN_ON(!map->bounce_page ||
>              map->orig_phys == INVALID_PHYS_ADDR))
>              return;


So you don't check whether iova is legal before using it, so it's at 
least a possible out of bound access of the bounce_maps[] isn't it? (e.g 
what happens if iova is ULLONG_MAX).


>
>
>>   From virtio level, we should not truse desc->addr.
>>
> We would not touch desc->addr after vring_unmap_one_split(). So I'm
> not sure what we need to do at the virtio level.


I think the point is to record the dma addres/len somewhere instead of 
reading them from descriptor ring.

Thanks


>
> Thanks,
> Yongji
>


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Re: [PATCH v6 10/10] Documentation: Add documentation for VDUSE
  2021-04-15 14:38           ` Stefan Hajnoczi
  2021-04-16  2:23             ` Jason Wang
@ 2021-04-16  3:13             ` Yongji Xie
  1 sibling, 0 replies; 62+ messages in thread
From: Yongji Xie @ 2021-04-16  3:13 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Jason Wang, Michael S. Tsirkin, Stefano Garzarella, Parav Pandit,
	Christoph Hellwig, Christian Brauner, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	Mika Penttilä,
	Dan Carpenter, virtualization, netdev, kvm, linux-fsdevel

On Thu, Apr 15, 2021 at 10:38 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Thu, Apr 15, 2021 at 04:36:35PM +0800, Jason Wang wrote:
> >
> > 在 2021/4/15 下午3:19, Stefan Hajnoczi 写道:
> > > On Thu, Apr 15, 2021 at 01:38:37PM +0800, Yongji Xie wrote:
> > > > On Wed, Apr 14, 2021 at 10:15 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > On Wed, Mar 31, 2021 at 04:05:19PM +0800, Xie Yongji wrote:
> > > > > > VDUSE (vDPA Device in Userspace) is a framework to support
> > > > > > implementing software-emulated vDPA devices in userspace. This
> > > > > > document is intended to clarify the VDUSE design and usage.
> > > > > >
> > > > > > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> > > > > > ---
> > > > > >   Documentation/userspace-api/index.rst |   1 +
> > > > > >   Documentation/userspace-api/vduse.rst | 212 ++++++++++++++++++++++++++++++++++
> > > > > >   2 files changed, 213 insertions(+)
> > > > > >   create mode 100644 Documentation/userspace-api/vduse.rst
> > > > > Just looking over the documentation briefly (I haven't studied the code
> > > > > yet)...
> > > > >
> > > > Thank you!
> > > >
> > > > > > +How VDUSE works
> > > > > > +------------
> > > > > > +Each userspace vDPA device is created by the VDUSE_CREATE_DEV ioctl on
> > > > > > +the character device (/dev/vduse/control). Then a device file with the
> > > > > > +specified name (/dev/vduse/$NAME) will appear, which can be used to
> > > > > > +implement the userspace vDPA device's control path and data path.
> > > > > These steps are taken after sending the VDPA_CMD_DEV_NEW netlink
> > > > > message? (Please consider reordering the documentation to make it clear
> > > > > what the sequence of steps are.)
> > > > >
> > > > No, VDUSE devices should be created before sending the
> > > > VDPA_CMD_DEV_NEW netlink messages which might produce I/Os to VDUSE.
> > > I see. Please include an overview of the steps before going into detail.
> > > Something like:
> > >
> > >    VDUSE devices are started as follows:
> > >
> > >    1. Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on
> > >       /dev/vduse/control.
> > >
> > >    2. Begin processing VDUSE messages from /dev/vduse/$NAME. The first
> > >       messages will arrive while attaching the VDUSE instance to vDPA.
> > >
> > >    3. Send the VDPA_CMD_DEV_NEW netlink message to attach the VDUSE
> > >       instance to vDPA.
> > >
> > >    VDUSE devices are stopped as follows:
> > >
> > >    ...
> > >
> > > > > > +     static int netlink_add_vduse(const char *name, int device_id)
> > > > > > +     {
> > > > > > +             struct nl_sock *nlsock;
> > > > > > +             struct nl_msg *msg;
> > > > > > +             int famid;
> > > > > > +
> > > > > > +             nlsock = nl_socket_alloc();
> > > > > > +             if (!nlsock)
> > > > > > +                     return -ENOMEM;
> > > > > > +
> > > > > > +             if (genl_connect(nlsock))
> > > > > > +                     goto free_sock;
> > > > > > +
> > > > > > +             famid = genl_ctrl_resolve(nlsock, VDPA_GENL_NAME);
> > > > > > +             if (famid < 0)
> > > > > > +                     goto close_sock;
> > > > > > +
> > > > > > +             msg = nlmsg_alloc();
> > > > > > +             if (!msg)
> > > > > > +                     goto close_sock;
> > > > > > +
> > > > > > +             if (!genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, famid, 0, 0,
> > > > > > +                 VDPA_CMD_DEV_NEW, 0))
> > > > > > +                     goto nla_put_failure;
> > > > > > +
> > > > > > +             NLA_PUT_STRING(msg, VDPA_ATTR_DEV_NAME, name);
> > > > > > +             NLA_PUT_STRING(msg, VDPA_ATTR_MGMTDEV_DEV_NAME, "vduse");
> > > > > > +             NLA_PUT_U32(msg, VDPA_ATTR_DEV_ID, device_id);
> > > > > What are the permission/capability requirements for VDUSE?
> > > > >
> > > > Now I think we need privileged permission (root user). Because
> > > > userspace daemon is able to access avail vring, used vring, descriptor
> > > > table in kernel driver directly.
> > > Please state this explicitly at the start of the document. Existing
> > > interfaces like FUSE are designed to avoid trusting userspace.
> >
> >
> > There're some subtle difference here. VDUSE present a device to kernel which
> > means IOMMU is probably the only thing to prevent a malicous device.
> >
> >
> > > Therefore
> > > people might think the same is the case here. It's critical that people
> > > are aware of this before deploying VDUSE with virtio-vdpa.
> > >
> > > We should probably pause here and think about whether it's possible to
> > > avoid trusting userspace. Even if it takes some effort and costs some
> > > performance it would probably be worthwhile.
> >
> >
> > Since the bounce buffer is used the only attack surface is the coherent
> > area, if we want to enforce stronger isolation we need to use shadow
> > virtqueue (which is proposed in earlier version by me) in this case. But I'm
> > not sure it's worth to do that.
>
> The security situation needs to be clear before merging this feature.
>
> I think the IOMMU and vring can be made secure. What is more concerning
> is the kernel code that runs on top: VIRTIO device drivers, network
> stack, file systems, etc. They trust devices to an extent.
>

I will dig into it to see if there is any security issue.

> Since virtio-vdpa is a big reason for doing VDUSE in the first place I
> don't think it makes sense to disable virtio-vdpa with VDUSE. A solution
> is needed.
>
> I'm going to be offline for a week and don't want to be a bottleneck.
> I'll catch up when I'm back.
>

Thanks for your comments!

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Re: [PATCH v6 10/10] Documentation: Add documentation for VDUSE
  2021-04-16  3:02                   ` Jason Wang
@ 2021-04-16  3:18                     ` Yongji Xie
  0 siblings, 0 replies; 62+ messages in thread
From: Yongji Xie @ 2021-04-16  3:18 UTC (permalink / raw)
  To: Jason Wang
  Cc: Stefan Hajnoczi, Michael S. Tsirkin, Stefano Garzarella,
	Parav Pandit, Christoph Hellwig, Christian Brauner, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	Mika Penttilä,
	Dan Carpenter, virtualization, netdev, kvm, linux-fsdevel

On Fri, Apr 16, 2021 at 11:03 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/4/16 上午10:58, Yongji Xie 写道:
> > On Fri, Apr 16, 2021 at 10:20 AM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> 在 2021/4/15 下午7:17, Yongji Xie 写道:
> >>> On Thu, Apr 15, 2021 at 5:05 PM Jason Wang <jasowang@redhat.com> wrote:
> >>>> 在 2021/4/15 下午4:36, Jason Wang 写道:
> >>>>>> Please state this explicitly at the start of the document. Existing
> >>>>>> interfaces like FUSE are designed to avoid trusting userspace.
> >>>>> There're some subtle difference here. VDUSE present a device to kernel
> >>>>> which means IOMMU is probably the only thing to prevent a malicous
> >>>>> device.
> >>>>>
> >>>>>
> >>>>>> Therefore
> >>>>>> people might think the same is the case here. It's critical that people
> >>>>>> are aware of this before deploying VDUSE with virtio-vdpa.
> >>>>>>
> >>>>>> We should probably pause here and think about whether it's possible to
> >>>>>> avoid trusting userspace. Even if it takes some effort and costs some
> >>>>>> performance it would probably be worthwhile.
> >>>>> Since the bounce buffer is used the only attack surface is the
> >>>>> coherent area, if we want to enforce stronger isolation we need to use
> >>>>> shadow virtqueue (which is proposed in earlier version by me) in this
> >>>>> case. But I'm not sure it's worth to do that.
> >>>>
> >>>> So this reminds me the discussion in the end of last year. We need to
> >>>> make sure we don't suffer from the same issues for VDUSE at least
> >>>>
> >>>> https://yhbt.net/lore/all/c3629a27-3590-1d9f-211b-c0b7be152b32@redhat.com/T/#mc6b6e2343cbeffca68ca7a97e0f473aaa871c95b
> >>>>
> >>>> Or we can solve it at virtio level, e.g remember the dma address instead
> >>>> of depending on the addr in the descriptor ring
> >>>>
> >>> I might miss something. But VDUSE has recorded the dma address during
> >>> dma mapping, so we would not do bouncing if the addr/length is invalid
> >>> during dma unmapping. Is it enough?
> >>
> >> E.g malicous device write a buggy dma address in the descriptor ring, so
> >> we had:
> >>
> >> vring_unmap_one_split(desc->addr, desc->len)
> >>       dma_unmap_single()
> >>           vduse_dev_unmap_page()
> >>               vduse_domain_bounce()
> >>
> >> And in vduse_domain_bounce() we had:
> >>
> >>           while (size) {
> >>                   map = &domain->bounce_maps[iova >> PAGE_SHIFT];
> >>                   offset = offset_in_page(iova);
> >>                   sz = min_t(size_t, PAGE_SIZE - offset, size);
> >>
> >> This means we trust the iova which is dangerous and exacly the issue
> >> mentioned in the above link.
> >>
> >>   From VDUSE level need to make sure iova is legal.
> >>
> > I think we already do that in vduse_domain_bounce():
> >
> >      while (size) {
> >          map = &domain->bounce_maps[iova >> PAGE_SHIFT];
> >
> >          if (WARN_ON(!map->bounce_page ||
> >              map->orig_phys == INVALID_PHYS_ADDR))
> >              return;
>
>
> So you don't check whether iova is legal before using it, so it's at
> least a possible out of bound access of the bounce_maps[] isn't it? (e.g
> what happens if iova is ULLONG_MAX).
>

Oh, yes. Will do it!

>
> >
> >
> >>   From virtio level, we should not truse desc->addr.
> >>
> > We would not touch desc->addr after vring_unmap_one_split(). So I'm
> > not sure what we need to do at the virtio level.
>
>
> I think the point is to record the dma addres/len somewhere instead of
> reading them from descriptor ring.
>

OK, I see.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Re: [PATCH v6 10/10] Documentation: Add documentation for VDUSE
  2021-04-16  2:23             ` Jason Wang
@ 2021-04-16  3:19               ` Yongji Xie
  2021-04-16  5:39                 ` Jason Wang
  0 siblings, 1 reply; 62+ messages in thread
From: Yongji Xie @ 2021-04-16  3:19 UTC (permalink / raw)
  To: Jason Wang
  Cc: Stefan Hajnoczi, Michael S. Tsirkin, Stefano Garzarella,
	Parav Pandit, Christoph Hellwig, Christian Brauner, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	Mika Penttilä,
	Dan Carpenter, virtualization, netdev, kvm, linux-fsdevel

On Fri, Apr 16, 2021 at 10:24 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/4/15 下午10:38, Stefan Hajnoczi 写道:
> > On Thu, Apr 15, 2021 at 04:36:35PM +0800, Jason Wang wrote:
> >> 在 2021/4/15 下午3:19, Stefan Hajnoczi 写道:
> >>> On Thu, Apr 15, 2021 at 01:38:37PM +0800, Yongji Xie wrote:
> >>>> On Wed, Apr 14, 2021 at 10:15 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >>>>> On Wed, Mar 31, 2021 at 04:05:19PM +0800, Xie Yongji wrote:
> >>>>>> VDUSE (vDPA Device in Userspace) is a framework to support
> >>>>>> implementing software-emulated vDPA devices in userspace. This
> >>>>>> document is intended to clarify the VDUSE design and usage.
> >>>>>>
> >>>>>> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> >>>>>> ---
> >>>>>>    Documentation/userspace-api/index.rst |   1 +
> >>>>>>    Documentation/userspace-api/vduse.rst | 212 ++++++++++++++++++++++++++++++++++
> >>>>>>    2 files changed, 213 insertions(+)
> >>>>>>    create mode 100644 Documentation/userspace-api/vduse.rst
> >>>>> Just looking over the documentation briefly (I haven't studied the code
> >>>>> yet)...
> >>>>>
> >>>> Thank you!
> >>>>
> >>>>>> +How VDUSE works
> >>>>>> +------------
> >>>>>> +Each userspace vDPA device is created by the VDUSE_CREATE_DEV ioctl on
> >>>>>> +the character device (/dev/vduse/control). Then a device file with the
> >>>>>> +specified name (/dev/vduse/$NAME) will appear, which can be used to
> >>>>>> +implement the userspace vDPA device's control path and data path.
> >>>>> These steps are taken after sending the VDPA_CMD_DEV_NEW netlink
> >>>>> message? (Please consider reordering the documentation to make it clear
> >>>>> what the sequence of steps are.)
> >>>>>
> >>>> No, VDUSE devices should be created before sending the
> >>>> VDPA_CMD_DEV_NEW netlink messages which might produce I/Os to VDUSE.
> >>> I see. Please include an overview of the steps before going into detail.
> >>> Something like:
> >>>
> >>>     VDUSE devices are started as follows:
> >>>
> >>>     1. Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on
> >>>        /dev/vduse/control.
> >>>
> >>>     2. Begin processing VDUSE messages from /dev/vduse/$NAME. The first
> >>>        messages will arrive while attaching the VDUSE instance to vDPA.
> >>>
> >>>     3. Send the VDPA_CMD_DEV_NEW netlink message to attach the VDUSE
> >>>        instance to vDPA.
> >>>
> >>>     VDUSE devices are stopped as follows:
> >>>
> >>>     ...
> >>>
> >>>>>> +     static int netlink_add_vduse(const char *name, int device_id)
> >>>>>> +     {
> >>>>>> +             struct nl_sock *nlsock;
> >>>>>> +             struct nl_msg *msg;
> >>>>>> +             int famid;
> >>>>>> +
> >>>>>> +             nlsock = nl_socket_alloc();
> >>>>>> +             if (!nlsock)
> >>>>>> +                     return -ENOMEM;
> >>>>>> +
> >>>>>> +             if (genl_connect(nlsock))
> >>>>>> +                     goto free_sock;
> >>>>>> +
> >>>>>> +             famid = genl_ctrl_resolve(nlsock, VDPA_GENL_NAME);
> >>>>>> +             if (famid < 0)
> >>>>>> +                     goto close_sock;
> >>>>>> +
> >>>>>> +             msg = nlmsg_alloc();
> >>>>>> +             if (!msg)
> >>>>>> +                     goto close_sock;
> >>>>>> +
> >>>>>> +             if (!genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, famid, 0, 0,
> >>>>>> +                 VDPA_CMD_DEV_NEW, 0))
> >>>>>> +                     goto nla_put_failure;
> >>>>>> +
> >>>>>> +             NLA_PUT_STRING(msg, VDPA_ATTR_DEV_NAME, name);
> >>>>>> +             NLA_PUT_STRING(msg, VDPA_ATTR_MGMTDEV_DEV_NAME, "vduse");
> >>>>>> +             NLA_PUT_U32(msg, VDPA_ATTR_DEV_ID, device_id);
> >>>>> What are the permission/capability requirements for VDUSE?
> >>>>>
> >>>> Now I think we need privileged permission (root user). Because
> >>>> userspace daemon is able to access avail vring, used vring, descriptor
> >>>> table in kernel driver directly.
> >>> Please state this explicitly at the start of the document. Existing
> >>> interfaces like FUSE are designed to avoid trusting userspace.
> >>
> >> There're some subtle difference here. VDUSE present a device to kernel which
> >> means IOMMU is probably the only thing to prevent a malicous device.
> >>
> >>
> >>> Therefore
> >>> people might think the same is the case here. It's critical that people
> >>> are aware of this before deploying VDUSE with virtio-vdpa.
> >>>
> >>> We should probably pause here and think about whether it's possible to
> >>> avoid trusting userspace. Even if it takes some effort and costs some
> >>> performance it would probably be worthwhile.
> >>
> >> Since the bounce buffer is used the only attack surface is the coherent
> >> area, if we want to enforce stronger isolation we need to use shadow
> >> virtqueue (which is proposed in earlier version by me) in this case. But I'm
> >> not sure it's worth to do that.
> > The security situation needs to be clear before merging this feature.
>
>
> +1
>
>
> >
> > I think the IOMMU and vring can be made secure. What is more concerning
> > is the kernel code that runs on top: VIRTIO device drivers, network
> > stack, file systems, etc. They trust devices to an extent.
> >
> > Since virtio-vdpa is a big reason for doing VDUSE in the first place I
> > don't think it makes sense to disable virtio-vdpa with VDUSE. A solution
> > is needed.
>
>
> Yes, so the case of VDUSE is something similar to the case of e.g SEV.
>
> Both cases won't trust device and use some kind of software IOTLB.
>
> That means we need to protect at both IOTLB and virtio drivers.
>
> Let me post patches for virtio first.
>

Looking forward your patches.

Thanks.
Yongji

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v6 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
  2021-03-31  8:05 ` [PATCH v6 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace Xie Yongji
  2021-04-08  6:57   ` Jason Wang
@ 2021-04-16  3:24   ` Jason Wang
  2021-04-16  8:43     ` Yongji Xie
  1 sibling, 1 reply; 62+ messages in thread
From: Jason Wang @ 2021-04-16  3:24 UTC (permalink / raw)
  To: Xie Yongji, mst, stefanha, sgarzare, parav, hch,
	christian.brauner, rdunlap, willy, viro, axboe, bcrl, corbet,
	mika.penttila, dan.carpenter
  Cc: virtualization, netdev, kvm, linux-fsdevel


在 2021/3/31 下午4:05, Xie Yongji 写道:
> +	}
> +	case VDUSE_INJECT_VQ_IRQ:
> +		ret = -EINVAL;
> +		if (arg >= dev->vq_num)
> +			break;
> +
> +		ret = 0;
> +		queue_work(vduse_irq_wq, &dev->vqs[arg].inject);
> +		break;


One additional note:

Please use array_index_nospec() for all vqs[idx] access where idx is 
under the control of userspace to avoid potential spectre exploitation.

Thanks


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v6 10/10] Documentation: Add documentation for VDUSE
  2021-04-16  3:19               ` Yongji Xie
@ 2021-04-16  5:39                 ` Jason Wang
  0 siblings, 0 replies; 62+ messages in thread
From: Jason Wang @ 2021-04-16  5:39 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Stefan Hajnoczi, Michael S. Tsirkin, Stefano Garzarella,
	Parav Pandit, Christoph Hellwig, Christian Brauner, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	Mika Penttilä,
	Dan Carpenter, virtualization, netdev, kvm, linux-fsdevel


在 2021/4/16 上午11:19, Yongji Xie 写道:
> On Fri, Apr 16, 2021 at 10:24 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> 在 2021/4/15 下午10:38, Stefan Hajnoczi 写道:
>>> On Thu, Apr 15, 2021 at 04:36:35PM +0800, Jason Wang wrote:
>>>> 在 2021/4/15 下午3:19, Stefan Hajnoczi 写道:
>>>>> On Thu, Apr 15, 2021 at 01:38:37PM +0800, Yongji Xie wrote:
>>>>>> On Wed, Apr 14, 2021 at 10:15 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>>>>>> On Wed, Mar 31, 2021 at 04:05:19PM +0800, Xie Yongji wrote:
>>>>>>>> VDUSE (vDPA Device in Userspace) is a framework to support
>>>>>>>> implementing software-emulated vDPA devices in userspace. This
>>>>>>>> document is intended to clarify the VDUSE design and usage.
>>>>>>>>
>>>>>>>> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
>>>>>>>> ---
>>>>>>>>     Documentation/userspace-api/index.rst |   1 +
>>>>>>>>     Documentation/userspace-api/vduse.rst | 212 ++++++++++++++++++++++++++++++++++
>>>>>>>>     2 files changed, 213 insertions(+)
>>>>>>>>     create mode 100644 Documentation/userspace-api/vduse.rst
>>>>>>> Just looking over the documentation briefly (I haven't studied the code
>>>>>>> yet)...
>>>>>>>
>>>>>> Thank you!
>>>>>>
>>>>>>>> +How VDUSE works
>>>>>>>> +------------
>>>>>>>> +Each userspace vDPA device is created by the VDUSE_CREATE_DEV ioctl on
>>>>>>>> +the character device (/dev/vduse/control). Then a device file with the
>>>>>>>> +specified name (/dev/vduse/$NAME) will appear, which can be used to
>>>>>>>> +implement the userspace vDPA device's control path and data path.
>>>>>>> These steps are taken after sending the VDPA_CMD_DEV_NEW netlink
>>>>>>> message? (Please consider reordering the documentation to make it clear
>>>>>>> what the sequence of steps are.)
>>>>>>>
>>>>>> No, VDUSE devices should be created before sending the
>>>>>> VDPA_CMD_DEV_NEW netlink messages which might produce I/Os to VDUSE.
>>>>> I see. Please include an overview of the steps before going into detail.
>>>>> Something like:
>>>>>
>>>>>      VDUSE devices are started as follows:
>>>>>
>>>>>      1. Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on
>>>>>         /dev/vduse/control.
>>>>>
>>>>>      2. Begin processing VDUSE messages from /dev/vduse/$NAME. The first
>>>>>         messages will arrive while attaching the VDUSE instance to vDPA.
>>>>>
>>>>>      3. Send the VDPA_CMD_DEV_NEW netlink message to attach the VDUSE
>>>>>         instance to vDPA.
>>>>>
>>>>>      VDUSE devices are stopped as follows:
>>>>>
>>>>>      ...
>>>>>
>>>>>>>> +     static int netlink_add_vduse(const char *name, int device_id)
>>>>>>>> +     {
>>>>>>>> +             struct nl_sock *nlsock;
>>>>>>>> +             struct nl_msg *msg;
>>>>>>>> +             int famid;
>>>>>>>> +
>>>>>>>> +             nlsock = nl_socket_alloc();
>>>>>>>> +             if (!nlsock)
>>>>>>>> +                     return -ENOMEM;
>>>>>>>> +
>>>>>>>> +             if (genl_connect(nlsock))
>>>>>>>> +                     goto free_sock;
>>>>>>>> +
>>>>>>>> +             famid = genl_ctrl_resolve(nlsock, VDPA_GENL_NAME);
>>>>>>>> +             if (famid < 0)
>>>>>>>> +                     goto close_sock;
>>>>>>>> +
>>>>>>>> +             msg = nlmsg_alloc();
>>>>>>>> +             if (!msg)
>>>>>>>> +                     goto close_sock;
>>>>>>>> +
>>>>>>>> +             if (!genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, famid, 0, 0,
>>>>>>>> +                 VDPA_CMD_DEV_NEW, 0))
>>>>>>>> +                     goto nla_put_failure;
>>>>>>>> +
>>>>>>>> +             NLA_PUT_STRING(msg, VDPA_ATTR_DEV_NAME, name);
>>>>>>>> +             NLA_PUT_STRING(msg, VDPA_ATTR_MGMTDEV_DEV_NAME, "vduse");
>>>>>>>> +             NLA_PUT_U32(msg, VDPA_ATTR_DEV_ID, device_id);
>>>>>>> What are the permission/capability requirements for VDUSE?
>>>>>>>
>>>>>> Now I think we need privileged permission (root user). Because
>>>>>> userspace daemon is able to access avail vring, used vring, descriptor
>>>>>> table in kernel driver directly.
>>>>> Please state this explicitly at the start of the document. Existing
>>>>> interfaces like FUSE are designed to avoid trusting userspace.
>>>> There're some subtle difference here. VDUSE present a device to kernel which
>>>> means IOMMU is probably the only thing to prevent a malicous device.
>>>>
>>>>
>>>>> Therefore
>>>>> people might think the same is the case here. It's critical that people
>>>>> are aware of this before deploying VDUSE with virtio-vdpa.
>>>>>
>>>>> We should probably pause here and think about whether it's possible to
>>>>> avoid trusting userspace. Even if it takes some effort and costs some
>>>>> performance it would probably be worthwhile.
>>>> Since the bounce buffer is used the only attack surface is the coherent
>>>> area, if we want to enforce stronger isolation we need to use shadow
>>>> virtqueue (which is proposed in earlier version by me) in this case. But I'm
>>>> not sure it's worth to do that.
>>> The security situation needs to be clear before merging this feature.
>>
>> +1
>>
>>
>>> I think the IOMMU and vring can be made secure. What is more concerning
>>> is the kernel code that runs on top: VIRTIO device drivers, network
>>> stack, file systems, etc. They trust devices to an extent.
>>>
>>> Since virtio-vdpa is a big reason for doing VDUSE in the first place I
>>> don't think it makes sense to disable virtio-vdpa with VDUSE. A solution
>>> is needed.
>>
>> Yes, so the case of VDUSE is something similar to the case of e.g SEV.
>>
>> Both cases won't trust device and use some kind of software IOTLB.
>>
>> That means we need to protect at both IOTLB and virtio drivers.
>>
>> Let me post patches for virtio first.
>>
> Looking forward your patches.
>
> Thanks.
> Yongji
>

Fortuantely, packed ring has already did this since the descriptor talbe 
is expected to be re-wrote by the device. I just need to conver the 
split ring.

Thanks




^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Re: [PATCH v6 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
  2021-04-16  3:24   ` Jason Wang
@ 2021-04-16  8:43     ` Yongji Xie
  0 siblings, 0 replies; 62+ messages in thread
From: Yongji Xie @ 2021-04-16  8:43 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Christoph Hellwig, Christian Brauner, Randy Dunlap,
	Matthew Wilcox, viro, Jens Axboe, bcrl, Jonathan Corbet,
	Mika Penttilä,
	Dan Carpenter, virtualization, netdev, kvm, linux-fsdevel

On Fri, Apr 16, 2021 at 11:24 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/3/31 下午4:05, Xie Yongji 写道:
> > +     }
> > +     case VDUSE_INJECT_VQ_IRQ:
> > +             ret = -EINVAL;
> > +             if (arg >= dev->vq_num)
> > +                     break;
> > +
> > +             ret = 0;
> > +             queue_work(vduse_irq_wq, &dev->vqs[arg].inject);
> > +             break;
>
>
> One additional note:
>
> Please use array_index_nospec() for all vqs[idx] access where idx is
> under the control of userspace to avoid potential spectre exploitation.
>

OK, I see.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2021-04-16  8:43 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-31  8:05 [PATCH v6 00/10] Introduce VDUSE - vDPA Device in Userspace Xie Yongji
2021-03-31  8:05 ` [PATCH v6 01/10] file: Export receive_fd() to modules Xie Yongji
2021-03-31  9:15   ` Christian Brauner
2021-03-31  9:26     ` Dan Carpenter
2021-03-31  9:28       ` Christian Brauner
2021-03-31 11:32     ` Yongji Xie
2021-03-31 12:23       ` Christian Brauner
2021-03-31 13:59         ` Yongji Xie
2021-03-31 14:07           ` Christian Brauner
2021-03-31 14:37             ` Yongji Xie
2021-03-31  8:05 ` [PATCH v6 02/10] eventfd: Increase the recursion depth of eventfd_signal() Xie Yongji
2021-03-31  8:05 ` [PATCH v6 03/10] vhost-vdpa: protect concurrent access to vhost device iotlb Xie Yongji
2021-04-09 16:15   ` Michael S. Tsirkin
2021-04-11  5:36     ` Yongji Xie
2021-04-11 20:48       ` Michael S. Tsirkin
2021-04-12  2:29         ` Yongji Xie
2021-04-12  9:00           ` Michael S. Tsirkin
2021-03-31  8:05 ` [PATCH v6 04/10] vhost-iotlb: Add an opaque pointer for vhost IOTLB Xie Yongji
2021-03-31  8:05 ` [PATCH v6 05/10] vdpa: Add an opaque pointer for vdpa_config_ops.dma_map() Xie Yongji
2021-03-31  8:05 ` [PATCH v6 06/10] vdpa: factor out vhost_vdpa_pa_map() and vhost_vdpa_pa_unmap() Xie Yongji
2021-03-31  8:05 ` [PATCH v6 07/10] vdpa: Support transferring virtual addressing during DMA mapping Xie Yongji
2021-04-08  2:36   ` Jason Wang
2021-03-31  8:05 ` [PATCH v6 08/10] vduse: Implement an MMU-based IOMMU driver Xie Yongji
2021-04-08  3:25   ` Jason Wang
2021-04-08  5:27     ` Yongji Xie
2021-03-31  8:05 ` [PATCH v6 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace Xie Yongji
2021-04-08  6:57   ` Jason Wang
2021-04-08  9:36     ` Yongji Xie
2021-04-09  5:36       ` Jason Wang
2021-04-09  8:02         ` Yongji Xie
2021-04-12  7:16           ` Jason Wang
2021-04-12  8:02             ` Yongji Xie
2021-04-12  9:37               ` Jason Wang
2021-04-12  9:59                 ` Yongji Xie
2021-04-13  3:35                   ` Jason Wang
2021-04-13  4:28                     ` Yongji Xie
2021-04-14  8:18                       ` Jason Wang
2021-04-16  3:24   ` Jason Wang
2021-04-16  8:43     ` Yongji Xie
2021-03-31  8:05 ` [PATCH v6 10/10] Documentation: Add documentation for VDUSE Xie Yongji
2021-04-08  7:18   ` Jason Wang
2021-04-08  8:09     ` Yongji Xie
2021-04-14 14:14   ` Stefan Hajnoczi
2021-04-15  5:38     ` Yongji Xie
2021-04-15  7:19       ` Stefan Hajnoczi
2021-04-15  8:33         ` Yongji Xie
2021-04-15 14:17           ` Stefan Hajnoczi
2021-04-15  8:36         ` Jason Wang
2021-04-15  9:04           ` Jason Wang
2021-04-15 11:17             ` Yongji Xie
2021-04-16  2:20               ` Jason Wang
2021-04-16  2:58                 ` Yongji Xie
2021-04-16  3:02                   ` Jason Wang
2021-04-16  3:18                     ` Yongji Xie
2021-04-15 14:38           ` Stefan Hajnoczi
2021-04-16  2:23             ` Jason Wang
2021-04-16  3:19               ` Yongji Xie
2021-04-16  5:39                 ` Jason Wang
2021-04-16  3:13             ` Yongji Xie
2021-04-14  7:34 ` [PATCH v6 00/10] Introduce VDUSE - vDPA Device in Userspace Michael S. Tsirkin
2021-04-14  7:49   ` Jason Wang
2021-04-14  7:54   ` Yongji Xie

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).