All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v8 00/10] Introduce VDUSE - vDPA Device in Userspace
@ 2021-06-15 14:13 ` Xie Yongji
  0 siblings, 0 replies; 193+ messages in thread
From: Xie Yongji @ 2021-06-15 14:13 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, hch, christian.brauner,
	rdunlap, willy, viro, axboe, bcrl, corbet, mika.penttila,
	dan.carpenter, joro, gregkh
  Cc: songmuchun, virtualization, netdev, kvm, linux-fsdevel, iommu,
	linux-kernel

This series introduces a framework that makes it possible to implement
software-emulated vDPA devices in userspace. And to make it simple, the
emulated vDPA device's control path is handled in the kernel and only the
data path is implemented in the userspace.

Since the emuldated vDPA device's control path is handled in the kernel,
a message mechnism is introduced to make userspace be aware of the data
path related changes. Userspace can use read()/write() to receive/reply
the control messages.

In the data path, the core is mapping dma buffer into VDUSE daemon's
address space, which can be implemented in different ways depending on
the vdpa bus to which the vDPA device is attached.

In virtio-vdpa case, we implements a MMU-based on-chip IOMMU driver with
bounce-buffering mechanism to achieve that. And in vhost-vdpa case, the dma
buffer is reside in a userspace memory region which can be shared to the
VDUSE userspace processs via transferring the shmfd.

The details and our user case is shown below:

------------------------    -------------------------   ----------------------------------------------
|            Container |    |              QEMU(VM) |   |                               VDUSE daemon |
|       ---------      |    |  -------------------  |   | ------------------------- ---------------- |
|       |dev/vdx|      |    |  |/dev/vhost-vdpa-x|  |   | | vDPA device emulation | | block driver | |
------------+-----------     -----------+------------   -------------+----------------------+---------
            |                           |                            |                      |
            |                           |                            |                      |
------------+---------------------------+----------------------------+----------------------+---------
|    | block device |           |  vhost device |            | vduse driver |          | TCP/IP |    |
|    -------+--------           --------+--------            -------+--------          -----+----    |
|           |                           |                           |                       |        |
| ----------+----------       ----------+-----------         -------+-------                |        |
| | virtio-blk driver |       |  vhost-vdpa driver |         | vdpa device |                |        |
| ----------+----------       ----------+-----------         -------+-------                |        |
|           |      virtio bus           |                           |                       |        |
|   --------+----+-----------           |                           |                       |        |
|                |                      |                           |                       |        |
|      ----------+----------            |                           |                       |        |
|      | virtio-blk device |            |                           |                       |        |
|      ----------+----------            |                           |                       |        |
|                |                      |                           |                       |        |
|     -----------+-----------           |                           |                       |        |
|     |  virtio-vdpa driver |           |                           |                       |        |
|     -----------+-----------           |                           |                       |        |
|                |                      |                           |    vdpa bus           |        |
|     -----------+----------------------+---------------------------+------------           |        |
|                                                                                        ---+---     |
-----------------------------------------------------------------------------------------| NIC |------
                                                                                         ---+---
                                                                                            |
                                                                                   ---------+---------
                                                                                   | Remote Storages |
                                                                                   -------------------

We make use of it to implement a block device connecting to
our distributed storage, which can be used both in containers and
VMs. Thus, we can have an unified technology stack in this two cases.

To test it with null-blk:

  $ qemu-storage-daemon \
      --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server,nowait \
      --monitor chardev=charmonitor \
      --blockdev driver=host_device,cache.direct=on,aio=native,filename=/dev/nullb0,node-name=disk0 \
      --export type=vduse-blk,id=test,node-name=disk0,writable=on,name=vduse-null,num-queues=16,queue-size=128

The qemu-storage-daemon can be found at https://github.com/bytedance/qemu/tree/vduse

To make the userspace VDUSE processes such as qemu-storage-daemon able to
be run by an unprivileged user. We did some works on virtio driver to avoid
trusting device, including:

  - validating the used length: 

    * https://lore.kernel.org/lkml/20210531135852.113-1-xieyongji@bytedance.com/
    * https://lore.kernel.org/lkml/20210525125622.1203-1-xieyongji@bytedance.com/

  - validating the device config:
    
    * https://lore.kernel.org/lkml/20210615104810.151-1-xieyongji@bytedance.com/

  - validating the device response:

    * https://lore.kernel.org/lkml/20210615105218.214-1-xieyongji@bytedance.com/

Since I'm not sure if I missing something during auditing, especially on some
virtio device drivers that I'm not familiar with, we limit the supported device
type to virtio block device currently. The support for other device types can be
added after the security issue of corresponding device driver is clarified or
fixed in the future.

Future work:
  - Improve performance
  - Userspace library (find a way to reuse device emulation code in qemu/rust-vmm)
  - Support more device types

V7 to V8:
- Rebased to newest kernel tree
- Rework VDUSE driver to handle the device's control path in kernel
- Limit the supported device type to virtio block device
- Export free_iova_fast()
- Remove the virtio-blk and virtio-scsi patches (will send them alone)
- Remove all module parameters
- Use the same MAJOR for both control device and VDUSE devices
- Avoid eventfd cleanup in vduse_dev_release()

V6 to V7:
- Export alloc_iova_fast()
- Add get_config_size() callback
- Add some patches to avoid trusting virtio devices
- Add limited device emulation
- Add some documents
- Use workqueue to inject config irq
- Add parameter on vq irq injecting
- Rename vduse_domain_get_mapping_page() to vduse_domain_get_coherent_page()
- Add WARN_ON() to catch message failure
- Add some padding/reserved fields to uAPI structure
- Fix some bugs
- Rebase to vhost.git

V5 to V6:
- Export receive_fd() instead of __receive_fd()
- Factor out the unmapping logic of pa and va separatedly
- Remove the logic of bounce page allocation in page fault handler
- Use PAGE_SIZE as IOVA allocation granule
- Add EPOLLOUT support
- Enable setting API version in userspace
- Fix some bugs

V4 to V5:
- Remove the patch for irq binding
- Use a single IOTLB for all types of mapping
- Factor out vhost_vdpa_pa_map()
- Add some sample codes in document
- Use receice_fd_user() to pass file descriptor
- Fix some bugs

V3 to V4:
- Rebase to vhost.git
- Split some patches
- Add some documents
- Use ioctl to inject interrupt rather than eventfd
- Enable config interrupt support
- Support binding irq to the specified cpu
- Add two module parameter to limit bounce/iova size
- Create char device rather than anon inode per vduse
- Reuse vhost IOTLB for iova domain
- Rework the message mechnism in control path

V2 to V3:
- Rework the MMU-based IOMMU driver
- Use the iova domain as iova allocator instead of genpool
- Support transferring vma->vm_file in vhost-vdpa
- Add SVA support in vhost-vdpa
- Remove the patches on bounce pages reclaim

V1 to V2:
- Add vhost-vdpa support
- Add some documents
- Based on the vdpa management tool
- Introduce a workqueue for irq injection
- Replace interval tree with array map to store the iova_map

Xie Yongji (10):
  iova: Export alloc_iova_fast() and free_iova_fast();
  file: Export receive_fd() to modules
  eventfd: Increase the recursion depth of eventfd_signal()
  vhost-iotlb: Add an opaque pointer for vhost IOTLB
  vdpa: Add an opaque pointer for vdpa_config_ops.dma_map()
  vdpa: factor out vhost_vdpa_pa_map() and vhost_vdpa_pa_unmap()
  vdpa: Support transferring virtual addressing during DMA mapping
  vduse: Implement an MMU-based IOMMU driver
  vduse: Introduce VDUSE - vDPA Device in Userspace
  Documentation: Add documentation for VDUSE

 Documentation/userspace-api/index.rst              |    1 +
 Documentation/userspace-api/ioctl/ioctl-number.rst |    1 +
 Documentation/userspace-api/vduse.rst              |  222 +++
 drivers/iommu/iova.c                               |    2 +
 drivers/vdpa/Kconfig                               |   10 +
 drivers/vdpa/Makefile                              |    1 +
 drivers/vdpa/ifcvf/ifcvf_main.c                    |    2 +-
 drivers/vdpa/mlx5/net/mlx5_vnet.c                  |    2 +-
 drivers/vdpa/vdpa.c                                |    9 +-
 drivers/vdpa/vdpa_sim/vdpa_sim.c                   |    8 +-
 drivers/vdpa/vdpa_user/Makefile                    |    5 +
 drivers/vdpa/vdpa_user/iova_domain.c               |  545 ++++++++
 drivers/vdpa/vdpa_user/iova_domain.h               |   73 +
 drivers/vdpa/vdpa_user/vduse_dev.c                 | 1453 ++++++++++++++++++++
 drivers/vdpa/virtio_pci/vp_vdpa.c                  |    2 +-
 drivers/vhost/iotlb.c                              |   20 +-
 drivers/vhost/vdpa.c                               |  148 +-
 fs/eventfd.c                                       |    2 +-
 fs/file.c                                          |    6 +
 include/linux/eventfd.h                            |    5 +-
 include/linux/file.h                               |    7 +-
 include/linux/vdpa.h                               |   21 +-
 include/linux/vhost_iotlb.h                        |    3 +
 include/uapi/linux/vduse.h                         |  143 ++
 24 files changed, 2641 insertions(+), 50 deletions(-)
 create mode 100644 Documentation/userspace-api/vduse.rst
 create mode 100644 drivers/vdpa/vdpa_user/Makefile
 create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c
 create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h
 create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
 create mode 100644 include/uapi/linux/vduse.h

-- 
2.11.0


^ permalink raw reply	[flat|nested] 193+ messages in thread

* [PATCH v8 00/10] Introduce VDUSE - vDPA Device in Userspace
@ 2021-06-15 14:13 ` Xie Yongji
  0 siblings, 0 replies; 193+ messages in thread
From: Xie Yongji @ 2021-06-15 14:13 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, hch, christian.brauner,
	rdunlap, willy, viro, axboe, bcrl, corbet, mika.penttila,
	dan.carpenter, joro, gregkh
  Cc: kvm, netdev, linux-kernel, virtualization, iommu, songmuchun,
	linux-fsdevel

This series introduces a framework that makes it possible to implement
software-emulated vDPA devices in userspace. And to make it simple, the
emulated vDPA device's control path is handled in the kernel and only the
data path is implemented in the userspace.

Since the emuldated vDPA device's control path is handled in the kernel,
a message mechnism is introduced to make userspace be aware of the data
path related changes. Userspace can use read()/write() to receive/reply
the control messages.

In the data path, the core is mapping dma buffer into VDUSE daemon's
address space, which can be implemented in different ways depending on
the vdpa bus to which the vDPA device is attached.

In virtio-vdpa case, we implements a MMU-based on-chip IOMMU driver with
bounce-buffering mechanism to achieve that. And in vhost-vdpa case, the dma
buffer is reside in a userspace memory region which can be shared to the
VDUSE userspace processs via transferring the shmfd.

The details and our user case is shown below:

------------------------    -------------------------   ----------------------------------------------
|            Container |    |              QEMU(VM) |   |                               VDUSE daemon |
|       ---------      |    |  -------------------  |   | ------------------------- ---------------- |
|       |dev/vdx|      |    |  |/dev/vhost-vdpa-x|  |   | | vDPA device emulation | | block driver | |
------------+-----------     -----------+------------   -------------+----------------------+---------
            |                           |                            |                      |
            |                           |                            |                      |
------------+---------------------------+----------------------------+----------------------+---------
|    | block device |           |  vhost device |            | vduse driver |          | TCP/IP |    |
|    -------+--------           --------+--------            -------+--------          -----+----    |
|           |                           |                           |                       |        |
| ----------+----------       ----------+-----------         -------+-------                |        |
| | virtio-blk driver |       |  vhost-vdpa driver |         | vdpa device |                |        |
| ----------+----------       ----------+-----------         -------+-------                |        |
|           |      virtio bus           |                           |                       |        |
|   --------+----+-----------           |                           |                       |        |
|                |                      |                           |                       |        |
|      ----------+----------            |                           |                       |        |
|      | virtio-blk device |            |                           |                       |        |
|      ----------+----------            |                           |                       |        |
|                |                      |                           |                       |        |
|     -----------+-----------           |                           |                       |        |
|     |  virtio-vdpa driver |           |                           |                       |        |
|     -----------+-----------           |                           |                       |        |
|                |                      |                           |    vdpa bus           |        |
|     -----------+----------------------+---------------------------+------------           |        |
|                                                                                        ---+---     |
-----------------------------------------------------------------------------------------| NIC |------
                                                                                         ---+---
                                                                                            |
                                                                                   ---------+---------
                                                                                   | Remote Storages |
                                                                                   -------------------

We make use of it to implement a block device connecting to
our distributed storage, which can be used both in containers and
VMs. Thus, we can have an unified technology stack in this two cases.

To test it with null-blk:

  $ qemu-storage-daemon \
      --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server,nowait \
      --monitor chardev=charmonitor \
      --blockdev driver=host_device,cache.direct=on,aio=native,filename=/dev/nullb0,node-name=disk0 \
      --export type=vduse-blk,id=test,node-name=disk0,writable=on,name=vduse-null,num-queues=16,queue-size=128

The qemu-storage-daemon can be found at https://github.com/bytedance/qemu/tree/vduse

To make the userspace VDUSE processes such as qemu-storage-daemon able to
be run by an unprivileged user. We did some works on virtio driver to avoid
trusting device, including:

  - validating the used length: 

    * https://lore.kernel.org/lkml/20210531135852.113-1-xieyongji@bytedance.com/
    * https://lore.kernel.org/lkml/20210525125622.1203-1-xieyongji@bytedance.com/

  - validating the device config:
    
    * https://lore.kernel.org/lkml/20210615104810.151-1-xieyongji@bytedance.com/

  - validating the device response:

    * https://lore.kernel.org/lkml/20210615105218.214-1-xieyongji@bytedance.com/

Since I'm not sure if I missing something during auditing, especially on some
virtio device drivers that I'm not familiar with, we limit the supported device
type to virtio block device currently. The support for other device types can be
added after the security issue of corresponding device driver is clarified or
fixed in the future.

Future work:
  - Improve performance
  - Userspace library (find a way to reuse device emulation code in qemu/rust-vmm)
  - Support more device types

V7 to V8:
- Rebased to newest kernel tree
- Rework VDUSE driver to handle the device's control path in kernel
- Limit the supported device type to virtio block device
- Export free_iova_fast()
- Remove the virtio-blk and virtio-scsi patches (will send them alone)
- Remove all module parameters
- Use the same MAJOR for both control device and VDUSE devices
- Avoid eventfd cleanup in vduse_dev_release()

V6 to V7:
- Export alloc_iova_fast()
- Add get_config_size() callback
- Add some patches to avoid trusting virtio devices
- Add limited device emulation
- Add some documents
- Use workqueue to inject config irq
- Add parameter on vq irq injecting
- Rename vduse_domain_get_mapping_page() to vduse_domain_get_coherent_page()
- Add WARN_ON() to catch message failure
- Add some padding/reserved fields to uAPI structure
- Fix some bugs
- Rebase to vhost.git

V5 to V6:
- Export receive_fd() instead of __receive_fd()
- Factor out the unmapping logic of pa and va separatedly
- Remove the logic of bounce page allocation in page fault handler
- Use PAGE_SIZE as IOVA allocation granule
- Add EPOLLOUT support
- Enable setting API version in userspace
- Fix some bugs

V4 to V5:
- Remove the patch for irq binding
- Use a single IOTLB for all types of mapping
- Factor out vhost_vdpa_pa_map()
- Add some sample codes in document
- Use receice_fd_user() to pass file descriptor
- Fix some bugs

V3 to V4:
- Rebase to vhost.git
- Split some patches
- Add some documents
- Use ioctl to inject interrupt rather than eventfd
- Enable config interrupt support
- Support binding irq to the specified cpu
- Add two module parameter to limit bounce/iova size
- Create char device rather than anon inode per vduse
- Reuse vhost IOTLB for iova domain
- Rework the message mechnism in control path

V2 to V3:
- Rework the MMU-based IOMMU driver
- Use the iova domain as iova allocator instead of genpool
- Support transferring vma->vm_file in vhost-vdpa
- Add SVA support in vhost-vdpa
- Remove the patches on bounce pages reclaim

V1 to V2:
- Add vhost-vdpa support
- Add some documents
- Based on the vdpa management tool
- Introduce a workqueue for irq injection
- Replace interval tree with array map to store the iova_map

Xie Yongji (10):
  iova: Export alloc_iova_fast() and free_iova_fast();
  file: Export receive_fd() to modules
  eventfd: Increase the recursion depth of eventfd_signal()
  vhost-iotlb: Add an opaque pointer for vhost IOTLB
  vdpa: Add an opaque pointer for vdpa_config_ops.dma_map()
  vdpa: factor out vhost_vdpa_pa_map() and vhost_vdpa_pa_unmap()
  vdpa: Support transferring virtual addressing during DMA mapping
  vduse: Implement an MMU-based IOMMU driver
  vduse: Introduce VDUSE - vDPA Device in Userspace
  Documentation: Add documentation for VDUSE

 Documentation/userspace-api/index.rst              |    1 +
 Documentation/userspace-api/ioctl/ioctl-number.rst |    1 +
 Documentation/userspace-api/vduse.rst              |  222 +++
 drivers/iommu/iova.c                               |    2 +
 drivers/vdpa/Kconfig                               |   10 +
 drivers/vdpa/Makefile                              |    1 +
 drivers/vdpa/ifcvf/ifcvf_main.c                    |    2 +-
 drivers/vdpa/mlx5/net/mlx5_vnet.c                  |    2 +-
 drivers/vdpa/vdpa.c                                |    9 +-
 drivers/vdpa/vdpa_sim/vdpa_sim.c                   |    8 +-
 drivers/vdpa/vdpa_user/Makefile                    |    5 +
 drivers/vdpa/vdpa_user/iova_domain.c               |  545 ++++++++
 drivers/vdpa/vdpa_user/iova_domain.h               |   73 +
 drivers/vdpa/vdpa_user/vduse_dev.c                 | 1453 ++++++++++++++++++++
 drivers/vdpa/virtio_pci/vp_vdpa.c                  |    2 +-
 drivers/vhost/iotlb.c                              |   20 +-
 drivers/vhost/vdpa.c                               |  148 +-
 fs/eventfd.c                                       |    2 +-
 fs/file.c                                          |    6 +
 include/linux/eventfd.h                            |    5 +-
 include/linux/file.h                               |    7 +-
 include/linux/vdpa.h                               |   21 +-
 include/linux/vhost_iotlb.h                        |    3 +
 include/uapi/linux/vduse.h                         |  143 ++
 24 files changed, 2641 insertions(+), 50 deletions(-)
 create mode 100644 Documentation/userspace-api/vduse.rst
 create mode 100644 drivers/vdpa/vdpa_user/Makefile
 create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c
 create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h
 create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
 create mode 100644 include/uapi/linux/vduse.h

-- 
2.11.0

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 193+ messages in thread

* [PATCH v8 01/10] iova: Export alloc_iova_fast() and free_iova_fast();
  2021-06-15 14:13 ` Xie Yongji
@ 2021-06-15 14:13   ` Xie Yongji
  -1 siblings, 0 replies; 193+ messages in thread
From: Xie Yongji @ 2021-06-15 14:13 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, hch, christian.brauner,
	rdunlap, willy, viro, axboe, bcrl, corbet, mika.penttila,
	dan.carpenter, joro, gregkh
  Cc: songmuchun, virtualization, netdev, kvm, linux-fsdevel, iommu,
	linux-kernel

Export alloc_iova_fast() and free_iova_fast() so that
some modules can use it to improve iova allocation efficiency.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 drivers/iommu/iova.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index b7ecd5b08039..59916b4b7fe9 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -518,6 +518,7 @@ alloc_iova_fast(struct iova_domain *iovad, unsigned long size,
 
 	return new_iova->pfn_lo;
 }
+EXPORT_SYMBOL_GPL(alloc_iova_fast);
 
 /**
  * free_iova_fast - free iova pfn range into rcache
@@ -535,6 +536,7 @@ free_iova_fast(struct iova_domain *iovad, unsigned long pfn, unsigned long size)
 
 	free_iova(iovad, pfn);
 }
+EXPORT_SYMBOL_GPL(free_iova_fast);
 
 #define fq_ring_for_each(i, fq) \
 	for ((i) = (fq)->head; (i) != (fq)->tail; (i) = ((i) + 1) % IOVA_FQ_SIZE)
-- 
2.11.0


^ permalink raw reply	[flat|nested] 193+ messages in thread

* [PATCH v8 01/10] iova: Export alloc_iova_fast() and free_iova_fast();
@ 2021-06-15 14:13   ` Xie Yongji
  0 siblings, 0 replies; 193+ messages in thread
From: Xie Yongji @ 2021-06-15 14:13 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, hch, christian.brauner,
	rdunlap, willy, viro, axboe, bcrl, corbet, mika.penttila,
	dan.carpenter, joro, gregkh
  Cc: kvm, netdev, linux-kernel, virtualization, iommu, songmuchun,
	linux-fsdevel

Export alloc_iova_fast() and free_iova_fast() so that
some modules can use it to improve iova allocation efficiency.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 drivers/iommu/iova.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index b7ecd5b08039..59916b4b7fe9 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -518,6 +518,7 @@ alloc_iova_fast(struct iova_domain *iovad, unsigned long size,
 
 	return new_iova->pfn_lo;
 }
+EXPORT_SYMBOL_GPL(alloc_iova_fast);
 
 /**
  * free_iova_fast - free iova pfn range into rcache
@@ -535,6 +536,7 @@ free_iova_fast(struct iova_domain *iovad, unsigned long pfn, unsigned long size)
 
 	free_iova(iovad, pfn);
 }
+EXPORT_SYMBOL_GPL(free_iova_fast);
 
 #define fq_ring_for_each(i, fq) \
 	for ((i) = (fq)->head; (i) != (fq)->tail; (i) = ((i) + 1) % IOVA_FQ_SIZE)
-- 
2.11.0

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 193+ messages in thread

* [PATCH v8 02/10] file: Export receive_fd() to modules
  2021-06-15 14:13 ` Xie Yongji
@ 2021-06-15 14:13   ` Xie Yongji
  -1 siblings, 0 replies; 193+ messages in thread
From: Xie Yongji @ 2021-06-15 14:13 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, hch, christian.brauner,
	rdunlap, willy, viro, axboe, bcrl, corbet, mika.penttila,
	dan.carpenter, joro, gregkh
  Cc: songmuchun, virtualization, netdev, kvm, linux-fsdevel, iommu,
	linux-kernel

Export receive_fd() so that some modules can use
it to pass file descriptor between processes without
missing any security stuffs.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 fs/file.c            | 6 ++++++
 include/linux/file.h | 7 +++----
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 86dc9956af32..210e540672aa 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -1134,6 +1134,12 @@ int receive_fd_replace(int new_fd, struct file *file, unsigned int o_flags)
 	return new_fd;
 }
 
+int receive_fd(struct file *file, unsigned int o_flags)
+{
+	return __receive_fd(file, NULL, o_flags);
+}
+EXPORT_SYMBOL_GPL(receive_fd);
+
 static int ksys_dup3(unsigned int oldfd, unsigned int newfd, int flags)
 {
 	int err = -EBADF;
diff --git a/include/linux/file.h b/include/linux/file.h
index 2de2e4613d7b..51e830b4fe3a 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -94,6 +94,9 @@ extern void fd_install(unsigned int fd, struct file *file);
 
 extern int __receive_fd(struct file *file, int __user *ufd,
 			unsigned int o_flags);
+
+extern int receive_fd(struct file *file, unsigned int o_flags);
+
 static inline int receive_fd_user(struct file *file, int __user *ufd,
 				  unsigned int o_flags)
 {
@@ -101,10 +104,6 @@ static inline int receive_fd_user(struct file *file, int __user *ufd,
 		return -EFAULT;
 	return __receive_fd(file, ufd, o_flags);
 }
-static inline int receive_fd(struct file *file, unsigned int o_flags)
-{
-	return __receive_fd(file, NULL, o_flags);
-}
 int receive_fd_replace(int new_fd, struct file *file, unsigned int o_flags);
 
 extern void flush_delayed_fput(void);
-- 
2.11.0


^ permalink raw reply	[flat|nested] 193+ messages in thread

* [PATCH v8 02/10] file: Export receive_fd() to modules
@ 2021-06-15 14:13   ` Xie Yongji
  0 siblings, 0 replies; 193+ messages in thread
From: Xie Yongji @ 2021-06-15 14:13 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, hch, christian.brauner,
	rdunlap, willy, viro, axboe, bcrl, corbet, mika.penttila,
	dan.carpenter, joro, gregkh
  Cc: kvm, netdev, linux-kernel, virtualization, iommu, songmuchun,
	linux-fsdevel

Export receive_fd() so that some modules can use
it to pass file descriptor between processes without
missing any security stuffs.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 fs/file.c            | 6 ++++++
 include/linux/file.h | 7 +++----
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 86dc9956af32..210e540672aa 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -1134,6 +1134,12 @@ int receive_fd_replace(int new_fd, struct file *file, unsigned int o_flags)
 	return new_fd;
 }
 
+int receive_fd(struct file *file, unsigned int o_flags)
+{
+	return __receive_fd(file, NULL, o_flags);
+}
+EXPORT_SYMBOL_GPL(receive_fd);
+
 static int ksys_dup3(unsigned int oldfd, unsigned int newfd, int flags)
 {
 	int err = -EBADF;
diff --git a/include/linux/file.h b/include/linux/file.h
index 2de2e4613d7b..51e830b4fe3a 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -94,6 +94,9 @@ extern void fd_install(unsigned int fd, struct file *file);
 
 extern int __receive_fd(struct file *file, int __user *ufd,
 			unsigned int o_flags);
+
+extern int receive_fd(struct file *file, unsigned int o_flags);
+
 static inline int receive_fd_user(struct file *file, int __user *ufd,
 				  unsigned int o_flags)
 {
@@ -101,10 +104,6 @@ static inline int receive_fd_user(struct file *file, int __user *ufd,
 		return -EFAULT;
 	return __receive_fd(file, ufd, o_flags);
 }
-static inline int receive_fd(struct file *file, unsigned int o_flags)
-{
-	return __receive_fd(file, NULL, o_flags);
-}
 int receive_fd_replace(int new_fd, struct file *file, unsigned int o_flags);
 
 extern void flush_delayed_fput(void);
-- 
2.11.0

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 193+ messages in thread

* [PATCH v8 03/10] eventfd: Increase the recursion depth of eventfd_signal()
  2021-06-15 14:13 ` Xie Yongji
@ 2021-06-15 14:13   ` Xie Yongji
  -1 siblings, 0 replies; 193+ messages in thread
From: Xie Yongji @ 2021-06-15 14:13 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, hch, christian.brauner,
	rdunlap, willy, viro, axboe, bcrl, corbet, mika.penttila,
	dan.carpenter, joro, gregkh
  Cc: songmuchun, virtualization, netdev, kvm, linux-fsdevel, iommu,
	linux-kernel

Increase the recursion depth of eventfd_signal() to 1. This
is the maximum recursion depth we have found so far, which
can be triggered with the following call chain:

    kvm_io_bus_write                        [kvm]
      --> ioeventfd_write                   [kvm]
        --> eventfd_signal                  [eventfd]
          --> vhost_poll_wakeup             [vhost]
            --> vduse_vdpa_kick_vq          [vduse]
              --> eventfd_signal            [eventfd]

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Acked-by: Jason Wang <jasowang@redhat.com>
---
 fs/eventfd.c            | 2 +-
 include/linux/eventfd.h | 5 ++++-
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/eventfd.c b/fs/eventfd.c
index e265b6dd4f34..cc7cd1dbedd3 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -71,7 +71,7 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n)
 	 * it returns true, the eventfd_signal() call should be deferred to a
 	 * safe context.
 	 */
-	if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count)))
+	if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count) > EFD_WAKE_DEPTH))
 		return 0;
 
 	spin_lock_irqsave(&ctx->wqh.lock, flags);
diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
index fa0a524baed0..886d99cd38ef 100644
--- a/include/linux/eventfd.h
+++ b/include/linux/eventfd.h
@@ -29,6 +29,9 @@
 #define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
 #define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE)
 
+/* Maximum recursion depth */
+#define EFD_WAKE_DEPTH 1
+
 struct eventfd_ctx;
 struct file;
 
@@ -47,7 +50,7 @@ DECLARE_PER_CPU(int, eventfd_wake_count);
 
 static inline bool eventfd_signal_count(void)
 {
-	return this_cpu_read(eventfd_wake_count);
+	return this_cpu_read(eventfd_wake_count) > EFD_WAKE_DEPTH;
 }
 
 #else /* CONFIG_EVENTFD */
-- 
2.11.0


^ permalink raw reply	[flat|nested] 193+ messages in thread

* [PATCH v8 03/10] eventfd: Increase the recursion depth of eventfd_signal()
@ 2021-06-15 14:13   ` Xie Yongji
  0 siblings, 0 replies; 193+ messages in thread
From: Xie Yongji @ 2021-06-15 14:13 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, hch, christian.brauner,
	rdunlap, willy, viro, axboe, bcrl, corbet, mika.penttila,
	dan.carpenter, joro, gregkh
  Cc: kvm, netdev, linux-kernel, virtualization, iommu, songmuchun,
	linux-fsdevel

Increase the recursion depth of eventfd_signal() to 1. This
is the maximum recursion depth we have found so far, which
can be triggered with the following call chain:

    kvm_io_bus_write                        [kvm]
      --> ioeventfd_write                   [kvm]
        --> eventfd_signal                  [eventfd]
          --> vhost_poll_wakeup             [vhost]
            --> vduse_vdpa_kick_vq          [vduse]
              --> eventfd_signal            [eventfd]

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Acked-by: Jason Wang <jasowang@redhat.com>
---
 fs/eventfd.c            | 2 +-
 include/linux/eventfd.h | 5 ++++-
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/eventfd.c b/fs/eventfd.c
index e265b6dd4f34..cc7cd1dbedd3 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -71,7 +71,7 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n)
 	 * it returns true, the eventfd_signal() call should be deferred to a
 	 * safe context.
 	 */
-	if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count)))
+	if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count) > EFD_WAKE_DEPTH))
 		return 0;
 
 	spin_lock_irqsave(&ctx->wqh.lock, flags);
diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
index fa0a524baed0..886d99cd38ef 100644
--- a/include/linux/eventfd.h
+++ b/include/linux/eventfd.h
@@ -29,6 +29,9 @@
 #define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
 #define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE)
 
+/* Maximum recursion depth */
+#define EFD_WAKE_DEPTH 1
+
 struct eventfd_ctx;
 struct file;
 
@@ -47,7 +50,7 @@ DECLARE_PER_CPU(int, eventfd_wake_count);
 
 static inline bool eventfd_signal_count(void)
 {
-	return this_cpu_read(eventfd_wake_count);
+	return this_cpu_read(eventfd_wake_count) > EFD_WAKE_DEPTH;
 }
 
 #else /* CONFIG_EVENTFD */
-- 
2.11.0

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 193+ messages in thread

* [PATCH v8 04/10] vhost-iotlb: Add an opaque pointer for vhost IOTLB
  2021-06-15 14:13 ` Xie Yongji
@ 2021-06-15 14:13   ` Xie Yongji
  -1 siblings, 0 replies; 193+ messages in thread
From: Xie Yongji @ 2021-06-15 14:13 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, hch, christian.brauner,
	rdunlap, willy, viro, axboe, bcrl, corbet, mika.penttila,
	dan.carpenter, joro, gregkh
  Cc: songmuchun, virtualization, netdev, kvm, linux-fsdevel, iommu,
	linux-kernel

Add an opaque pointer for vhost IOTLB. And introduce
vhost_iotlb_add_range_ctx() to accept it.

Suggested-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Acked-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/iotlb.c       | 20 ++++++++++++++++----
 include/linux/vhost_iotlb.h |  3 +++
 2 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/drivers/vhost/iotlb.c b/drivers/vhost/iotlb.c
index 0fd3f87e913c..5c99e1112cbb 100644
--- a/drivers/vhost/iotlb.c
+++ b/drivers/vhost/iotlb.c
@@ -36,19 +36,21 @@ void vhost_iotlb_map_free(struct vhost_iotlb *iotlb,
 EXPORT_SYMBOL_GPL(vhost_iotlb_map_free);
 
 /**
- * vhost_iotlb_add_range - add a new range to vhost IOTLB
+ * vhost_iotlb_add_range_ctx - add a new range to vhost IOTLB
  * @iotlb: the IOTLB
  * @start: start of the IOVA range
  * @last: last of IOVA range
  * @addr: the address that is mapped to @start
  * @perm: access permission of this range
+ * @opaque: the opaque pointer for the new mapping
  *
  * Returns an error last is smaller than start or memory allocation
  * fails
  */
-int vhost_iotlb_add_range(struct vhost_iotlb *iotlb,
-			  u64 start, u64 last,
-			  u64 addr, unsigned int perm)
+int vhost_iotlb_add_range_ctx(struct vhost_iotlb *iotlb,
+			      u64 start, u64 last,
+			      u64 addr, unsigned int perm,
+			      void *opaque)
 {
 	struct vhost_iotlb_map *map;
 
@@ -71,6 +73,7 @@ int vhost_iotlb_add_range(struct vhost_iotlb *iotlb,
 	map->last = last;
 	map->addr = addr;
 	map->perm = perm;
+	map->opaque = opaque;
 
 	iotlb->nmaps++;
 	vhost_iotlb_itree_insert(map, &iotlb->root);
@@ -80,6 +83,15 @@ int vhost_iotlb_add_range(struct vhost_iotlb *iotlb,
 
 	return 0;
 }
+EXPORT_SYMBOL_GPL(vhost_iotlb_add_range_ctx);
+
+int vhost_iotlb_add_range(struct vhost_iotlb *iotlb,
+			  u64 start, u64 last,
+			  u64 addr, unsigned int perm)
+{
+	return vhost_iotlb_add_range_ctx(iotlb, start, last,
+					 addr, perm, NULL);
+}
 EXPORT_SYMBOL_GPL(vhost_iotlb_add_range);
 
 /**
diff --git a/include/linux/vhost_iotlb.h b/include/linux/vhost_iotlb.h
index 6b09b786a762..2d0e2f52f938 100644
--- a/include/linux/vhost_iotlb.h
+++ b/include/linux/vhost_iotlb.h
@@ -17,6 +17,7 @@ struct vhost_iotlb_map {
 	u32 perm;
 	u32 flags_padding;
 	u64 __subtree_last;
+	void *opaque;
 };
 
 #define VHOST_IOTLB_FLAG_RETIRE 0x1
@@ -29,6 +30,8 @@ struct vhost_iotlb {
 	unsigned int flags;
 };
 
+int vhost_iotlb_add_range_ctx(struct vhost_iotlb *iotlb, u64 start, u64 last,
+			      u64 addr, unsigned int perm, void *opaque);
 int vhost_iotlb_add_range(struct vhost_iotlb *iotlb, u64 start, u64 last,
 			  u64 addr, unsigned int perm);
 void vhost_iotlb_del_range(struct vhost_iotlb *iotlb, u64 start, u64 last);
-- 
2.11.0


^ permalink raw reply	[flat|nested] 193+ messages in thread

* [PATCH v8 04/10] vhost-iotlb: Add an opaque pointer for vhost IOTLB
@ 2021-06-15 14:13   ` Xie Yongji
  0 siblings, 0 replies; 193+ messages in thread
From: Xie Yongji @ 2021-06-15 14:13 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, hch, christian.brauner,
	rdunlap, willy, viro, axboe, bcrl, corbet, mika.penttila,
	dan.carpenter, joro, gregkh
  Cc: kvm, netdev, linux-kernel, virtualization, iommu, songmuchun,
	linux-fsdevel

Add an opaque pointer for vhost IOTLB. And introduce
vhost_iotlb_add_range_ctx() to accept it.

Suggested-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Acked-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/iotlb.c       | 20 ++++++++++++++++----
 include/linux/vhost_iotlb.h |  3 +++
 2 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/drivers/vhost/iotlb.c b/drivers/vhost/iotlb.c
index 0fd3f87e913c..5c99e1112cbb 100644
--- a/drivers/vhost/iotlb.c
+++ b/drivers/vhost/iotlb.c
@@ -36,19 +36,21 @@ void vhost_iotlb_map_free(struct vhost_iotlb *iotlb,
 EXPORT_SYMBOL_GPL(vhost_iotlb_map_free);
 
 /**
- * vhost_iotlb_add_range - add a new range to vhost IOTLB
+ * vhost_iotlb_add_range_ctx - add a new range to vhost IOTLB
  * @iotlb: the IOTLB
  * @start: start of the IOVA range
  * @last: last of IOVA range
  * @addr: the address that is mapped to @start
  * @perm: access permission of this range
+ * @opaque: the opaque pointer for the new mapping
  *
  * Returns an error last is smaller than start or memory allocation
  * fails
  */
-int vhost_iotlb_add_range(struct vhost_iotlb *iotlb,
-			  u64 start, u64 last,
-			  u64 addr, unsigned int perm)
+int vhost_iotlb_add_range_ctx(struct vhost_iotlb *iotlb,
+			      u64 start, u64 last,
+			      u64 addr, unsigned int perm,
+			      void *opaque)
 {
 	struct vhost_iotlb_map *map;
 
@@ -71,6 +73,7 @@ int vhost_iotlb_add_range(struct vhost_iotlb *iotlb,
 	map->last = last;
 	map->addr = addr;
 	map->perm = perm;
+	map->opaque = opaque;
 
 	iotlb->nmaps++;
 	vhost_iotlb_itree_insert(map, &iotlb->root);
@@ -80,6 +83,15 @@ int vhost_iotlb_add_range(struct vhost_iotlb *iotlb,
 
 	return 0;
 }
+EXPORT_SYMBOL_GPL(vhost_iotlb_add_range_ctx);
+
+int vhost_iotlb_add_range(struct vhost_iotlb *iotlb,
+			  u64 start, u64 last,
+			  u64 addr, unsigned int perm)
+{
+	return vhost_iotlb_add_range_ctx(iotlb, start, last,
+					 addr, perm, NULL);
+}
 EXPORT_SYMBOL_GPL(vhost_iotlb_add_range);
 
 /**
diff --git a/include/linux/vhost_iotlb.h b/include/linux/vhost_iotlb.h
index 6b09b786a762..2d0e2f52f938 100644
--- a/include/linux/vhost_iotlb.h
+++ b/include/linux/vhost_iotlb.h
@@ -17,6 +17,7 @@ struct vhost_iotlb_map {
 	u32 perm;
 	u32 flags_padding;
 	u64 __subtree_last;
+	void *opaque;
 };
 
 #define VHOST_IOTLB_FLAG_RETIRE 0x1
@@ -29,6 +30,8 @@ struct vhost_iotlb {
 	unsigned int flags;
 };
 
+int vhost_iotlb_add_range_ctx(struct vhost_iotlb *iotlb, u64 start, u64 last,
+			      u64 addr, unsigned int perm, void *opaque);
 int vhost_iotlb_add_range(struct vhost_iotlb *iotlb, u64 start, u64 last,
 			  u64 addr, unsigned int perm);
 void vhost_iotlb_del_range(struct vhost_iotlb *iotlb, u64 start, u64 last);
-- 
2.11.0

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 193+ messages in thread

* [PATCH v8 05/10] vdpa: Add an opaque pointer for vdpa_config_ops.dma_map()
  2021-06-15 14:13 ` Xie Yongji
@ 2021-06-15 14:13   ` Xie Yongji
  -1 siblings, 0 replies; 193+ messages in thread
From: Xie Yongji @ 2021-06-15 14:13 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, hch, christian.brauner,
	rdunlap, willy, viro, axboe, bcrl, corbet, mika.penttila,
	dan.carpenter, joro, gregkh
  Cc: songmuchun, virtualization, netdev, kvm, linux-fsdevel, iommu,
	linux-kernel

Add an opaque pointer for DMA mapping.

Suggested-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Acked-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vdpa/vdpa_sim/vdpa_sim.c | 6 +++---
 drivers/vhost/vdpa.c             | 2 +-
 include/linux/vdpa.h             | 2 +-
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
index 98f793bc9376..efd0cb3d964d 100644
--- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
+++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
@@ -542,14 +542,14 @@ static int vdpasim_set_map(struct vdpa_device *vdpa,
 }
 
 static int vdpasim_dma_map(struct vdpa_device *vdpa, u64 iova, u64 size,
-			   u64 pa, u32 perm)
+			   u64 pa, u32 perm, void *opaque)
 {
 	struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
 	int ret;
 
 	spin_lock(&vdpasim->iommu_lock);
-	ret = vhost_iotlb_add_range(vdpasim->iommu, iova, iova + size - 1, pa,
-				    perm);
+	ret = vhost_iotlb_add_range_ctx(vdpasim->iommu, iova, iova + size - 1,
+					pa, perm, opaque);
 	spin_unlock(&vdpasim->iommu_lock);
 
 	return ret;
diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index fb41db3da611..1d5c5c6b6d5d 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -565,7 +565,7 @@ static int vhost_vdpa_map(struct vhost_vdpa *v,
 		return r;
 
 	if (ops->dma_map) {
-		r = ops->dma_map(vdpa, iova, size, pa, perm);
+		r = ops->dma_map(vdpa, iova, size, pa, perm, NULL);
 	} else if (ops->set_map) {
 		if (!v->in_batch)
 			r = ops->set_map(vdpa, dev->iotlb);
diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index f311d227aa1b..281f768cb597 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -245,7 +245,7 @@ struct vdpa_config_ops {
 	/* DMA ops */
 	int (*set_map)(struct vdpa_device *vdev, struct vhost_iotlb *iotlb);
 	int (*dma_map)(struct vdpa_device *vdev, u64 iova, u64 size,
-		       u64 pa, u32 perm);
+		       u64 pa, u32 perm, void *opaque);
 	int (*dma_unmap)(struct vdpa_device *vdev, u64 iova, u64 size);
 
 	/* Free device resources */
-- 
2.11.0


^ permalink raw reply	[flat|nested] 193+ messages in thread

* [PATCH v8 05/10] vdpa: Add an opaque pointer for vdpa_config_ops.dma_map()
@ 2021-06-15 14:13   ` Xie Yongji
  0 siblings, 0 replies; 193+ messages in thread
From: Xie Yongji @ 2021-06-15 14:13 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, hch, christian.brauner,
	rdunlap, willy, viro, axboe, bcrl, corbet, mika.penttila,
	dan.carpenter, joro, gregkh
  Cc: kvm, netdev, linux-kernel, virtualization, iommu, songmuchun,
	linux-fsdevel

Add an opaque pointer for DMA mapping.

Suggested-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Acked-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vdpa/vdpa_sim/vdpa_sim.c | 6 +++---
 drivers/vhost/vdpa.c             | 2 +-
 include/linux/vdpa.h             | 2 +-
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
index 98f793bc9376..efd0cb3d964d 100644
--- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
+++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
@@ -542,14 +542,14 @@ static int vdpasim_set_map(struct vdpa_device *vdpa,
 }
 
 static int vdpasim_dma_map(struct vdpa_device *vdpa, u64 iova, u64 size,
-			   u64 pa, u32 perm)
+			   u64 pa, u32 perm, void *opaque)
 {
 	struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
 	int ret;
 
 	spin_lock(&vdpasim->iommu_lock);
-	ret = vhost_iotlb_add_range(vdpasim->iommu, iova, iova + size - 1, pa,
-				    perm);
+	ret = vhost_iotlb_add_range_ctx(vdpasim->iommu, iova, iova + size - 1,
+					pa, perm, opaque);
 	spin_unlock(&vdpasim->iommu_lock);
 
 	return ret;
diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index fb41db3da611..1d5c5c6b6d5d 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -565,7 +565,7 @@ static int vhost_vdpa_map(struct vhost_vdpa *v,
 		return r;
 
 	if (ops->dma_map) {
-		r = ops->dma_map(vdpa, iova, size, pa, perm);
+		r = ops->dma_map(vdpa, iova, size, pa, perm, NULL);
 	} else if (ops->set_map) {
 		if (!v->in_batch)
 			r = ops->set_map(vdpa, dev->iotlb);
diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index f311d227aa1b..281f768cb597 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -245,7 +245,7 @@ struct vdpa_config_ops {
 	/* DMA ops */
 	int (*set_map)(struct vdpa_device *vdev, struct vhost_iotlb *iotlb);
 	int (*dma_map)(struct vdpa_device *vdev, u64 iova, u64 size,
-		       u64 pa, u32 perm);
+		       u64 pa, u32 perm, void *opaque);
 	int (*dma_unmap)(struct vdpa_device *vdev, u64 iova, u64 size);
 
 	/* Free device resources */
-- 
2.11.0

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 193+ messages in thread

* [PATCH v8 06/10] vdpa: factor out vhost_vdpa_pa_map() and vhost_vdpa_pa_unmap()
  2021-06-15 14:13 ` Xie Yongji
@ 2021-06-15 14:13   ` Xie Yongji
  -1 siblings, 0 replies; 193+ messages in thread
From: Xie Yongji @ 2021-06-15 14:13 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, hch, christian.brauner,
	rdunlap, willy, viro, axboe, bcrl, corbet, mika.penttila,
	dan.carpenter, joro, gregkh
  Cc: songmuchun, virtualization, netdev, kvm, linux-fsdevel, iommu,
	linux-kernel

The upcoming patch is going to support VA mapping/unmapping.
So let's factor out the logic of PA mapping/unmapping firstly
to make the code more readable.

Suggested-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Acked-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/vdpa.c | 53 +++++++++++++++++++++++++++++++++-------------------
 1 file changed, 34 insertions(+), 19 deletions(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 1d5c5c6b6d5d..c5ec45b920f8 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -498,7 +498,7 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
 	return r;
 }
 
-static void vhost_vdpa_iotlb_unmap(struct vhost_vdpa *v, u64 start, u64 last)
+static void vhost_vdpa_pa_unmap(struct vhost_vdpa *v, u64 start, u64 last)
 {
 	struct vhost_dev *dev = &v->vdev;
 	struct vhost_iotlb *iotlb = dev->iotlb;
@@ -520,6 +520,11 @@ static void vhost_vdpa_iotlb_unmap(struct vhost_vdpa *v, u64 start, u64 last)
 	}
 }
 
+static void vhost_vdpa_iotlb_unmap(struct vhost_vdpa *v, u64 start, u64 last)
+{
+	return vhost_vdpa_pa_unmap(v, start, last);
+}
+
 static void vhost_vdpa_iotlb_free(struct vhost_vdpa *v)
 {
 	struct vhost_dev *dev = &v->vdev;
@@ -600,37 +605,28 @@ static void vhost_vdpa_unmap(struct vhost_vdpa *v, u64 iova, u64 size)
 	}
 }
 
-static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
-					   struct vhost_iotlb_msg *msg)
+static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
+			     u64 iova, u64 size, u64 uaddr, u32 perm)
 {
 	struct vhost_dev *dev = &v->vdev;
-	struct vhost_iotlb *iotlb = dev->iotlb;
 	struct page **page_list;
 	unsigned long list_size = PAGE_SIZE / sizeof(struct page *);
 	unsigned int gup_flags = FOLL_LONGTERM;
 	unsigned long npages, cur_base, map_pfn, last_pfn = 0;
 	unsigned long lock_limit, sz2pin, nchunks, i;
-	u64 iova = msg->iova;
+	u64 start = iova;
 	long pinned;
 	int ret = 0;
 
-	if (msg->iova < v->range.first ||
-	    msg->iova + msg->size - 1 > v->range.last)
-		return -EINVAL;
-
-	if (vhost_iotlb_itree_first(iotlb, msg->iova,
-				    msg->iova + msg->size - 1))
-		return -EEXIST;
-
 	/* Limit the use of memory for bookkeeping */
 	page_list = (struct page **) __get_free_page(GFP_KERNEL);
 	if (!page_list)
 		return -ENOMEM;
 
-	if (msg->perm & VHOST_ACCESS_WO)
+	if (perm & VHOST_ACCESS_WO)
 		gup_flags |= FOLL_WRITE;
 
-	npages = PAGE_ALIGN(msg->size + (iova & ~PAGE_MASK)) >> PAGE_SHIFT;
+	npages = PAGE_ALIGN(size + (iova & ~PAGE_MASK)) >> PAGE_SHIFT;
 	if (!npages) {
 		ret = -EINVAL;
 		goto free;
@@ -644,7 +640,7 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
 		goto unlock;
 	}
 
-	cur_base = msg->uaddr & PAGE_MASK;
+	cur_base = uaddr & PAGE_MASK;
 	iova &= PAGE_MASK;
 	nchunks = 0;
 
@@ -675,7 +671,7 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
 				csize = (last_pfn - map_pfn + 1) << PAGE_SHIFT;
 				ret = vhost_vdpa_map(v, iova, csize,
 						     map_pfn << PAGE_SHIFT,
-						     msg->perm);
+						     perm);
 				if (ret) {
 					/*
 					 * Unpin the pages that are left unmapped
@@ -704,7 +700,7 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
 
 	/* Pin the rest chunk */
 	ret = vhost_vdpa_map(v, iova, (last_pfn - map_pfn + 1) << PAGE_SHIFT,
-			     map_pfn << PAGE_SHIFT, msg->perm);
+			     map_pfn << PAGE_SHIFT, perm);
 out:
 	if (ret) {
 		if (nchunks) {
@@ -723,13 +719,32 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
 			for (pfn = map_pfn; pfn <= last_pfn; pfn++)
 				unpin_user_page(pfn_to_page(pfn));
 		}
-		vhost_vdpa_unmap(v, msg->iova, msg->size);
+		vhost_vdpa_unmap(v, start, size);
 	}
 unlock:
 	mmap_read_unlock(dev->mm);
 free:
 	free_page((unsigned long)page_list);
 	return ret;
+
+}
+
+static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
+					   struct vhost_iotlb_msg *msg)
+{
+	struct vhost_dev *dev = &v->vdev;
+	struct vhost_iotlb *iotlb = dev->iotlb;
+
+	if (msg->iova < v->range.first ||
+	    msg->iova + msg->size - 1 > v->range.last)
+		return -EINVAL;
+
+	if (vhost_iotlb_itree_first(iotlb, msg->iova,
+				    msg->iova + msg->size - 1))
+		return -EEXIST;
+
+	return vhost_vdpa_pa_map(v, msg->iova, msg->size, msg->uaddr,
+				 msg->perm);
 }
 
 static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev,
-- 
2.11.0


^ permalink raw reply	[flat|nested] 193+ messages in thread

* [PATCH v8 06/10] vdpa: factor out vhost_vdpa_pa_map() and vhost_vdpa_pa_unmap()
@ 2021-06-15 14:13   ` Xie Yongji
  0 siblings, 0 replies; 193+ messages in thread
From: Xie Yongji @ 2021-06-15 14:13 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, hch, christian.brauner,
	rdunlap, willy, viro, axboe, bcrl, corbet, mika.penttila,
	dan.carpenter, joro, gregkh
  Cc: kvm, netdev, linux-kernel, virtualization, iommu, songmuchun,
	linux-fsdevel

The upcoming patch is going to support VA mapping/unmapping.
So let's factor out the logic of PA mapping/unmapping firstly
to make the code more readable.

Suggested-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Acked-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/vdpa.c | 53 +++++++++++++++++++++++++++++++++-------------------
 1 file changed, 34 insertions(+), 19 deletions(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 1d5c5c6b6d5d..c5ec45b920f8 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -498,7 +498,7 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
 	return r;
 }
 
-static void vhost_vdpa_iotlb_unmap(struct vhost_vdpa *v, u64 start, u64 last)
+static void vhost_vdpa_pa_unmap(struct vhost_vdpa *v, u64 start, u64 last)
 {
 	struct vhost_dev *dev = &v->vdev;
 	struct vhost_iotlb *iotlb = dev->iotlb;
@@ -520,6 +520,11 @@ static void vhost_vdpa_iotlb_unmap(struct vhost_vdpa *v, u64 start, u64 last)
 	}
 }
 
+static void vhost_vdpa_iotlb_unmap(struct vhost_vdpa *v, u64 start, u64 last)
+{
+	return vhost_vdpa_pa_unmap(v, start, last);
+}
+
 static void vhost_vdpa_iotlb_free(struct vhost_vdpa *v)
 {
 	struct vhost_dev *dev = &v->vdev;
@@ -600,37 +605,28 @@ static void vhost_vdpa_unmap(struct vhost_vdpa *v, u64 iova, u64 size)
 	}
 }
 
-static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
-					   struct vhost_iotlb_msg *msg)
+static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
+			     u64 iova, u64 size, u64 uaddr, u32 perm)
 {
 	struct vhost_dev *dev = &v->vdev;
-	struct vhost_iotlb *iotlb = dev->iotlb;
 	struct page **page_list;
 	unsigned long list_size = PAGE_SIZE / sizeof(struct page *);
 	unsigned int gup_flags = FOLL_LONGTERM;
 	unsigned long npages, cur_base, map_pfn, last_pfn = 0;
 	unsigned long lock_limit, sz2pin, nchunks, i;
-	u64 iova = msg->iova;
+	u64 start = iova;
 	long pinned;
 	int ret = 0;
 
-	if (msg->iova < v->range.first ||
-	    msg->iova + msg->size - 1 > v->range.last)
-		return -EINVAL;
-
-	if (vhost_iotlb_itree_first(iotlb, msg->iova,
-				    msg->iova + msg->size - 1))
-		return -EEXIST;
-
 	/* Limit the use of memory for bookkeeping */
 	page_list = (struct page **) __get_free_page(GFP_KERNEL);
 	if (!page_list)
 		return -ENOMEM;
 
-	if (msg->perm & VHOST_ACCESS_WO)
+	if (perm & VHOST_ACCESS_WO)
 		gup_flags |= FOLL_WRITE;
 
-	npages = PAGE_ALIGN(msg->size + (iova & ~PAGE_MASK)) >> PAGE_SHIFT;
+	npages = PAGE_ALIGN(size + (iova & ~PAGE_MASK)) >> PAGE_SHIFT;
 	if (!npages) {
 		ret = -EINVAL;
 		goto free;
@@ -644,7 +640,7 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
 		goto unlock;
 	}
 
-	cur_base = msg->uaddr & PAGE_MASK;
+	cur_base = uaddr & PAGE_MASK;
 	iova &= PAGE_MASK;
 	nchunks = 0;
 
@@ -675,7 +671,7 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
 				csize = (last_pfn - map_pfn + 1) << PAGE_SHIFT;
 				ret = vhost_vdpa_map(v, iova, csize,
 						     map_pfn << PAGE_SHIFT,
-						     msg->perm);
+						     perm);
 				if (ret) {
 					/*
 					 * Unpin the pages that are left unmapped
@@ -704,7 +700,7 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
 
 	/* Pin the rest chunk */
 	ret = vhost_vdpa_map(v, iova, (last_pfn - map_pfn + 1) << PAGE_SHIFT,
-			     map_pfn << PAGE_SHIFT, msg->perm);
+			     map_pfn << PAGE_SHIFT, perm);
 out:
 	if (ret) {
 		if (nchunks) {
@@ -723,13 +719,32 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
 			for (pfn = map_pfn; pfn <= last_pfn; pfn++)
 				unpin_user_page(pfn_to_page(pfn));
 		}
-		vhost_vdpa_unmap(v, msg->iova, msg->size);
+		vhost_vdpa_unmap(v, start, size);
 	}
 unlock:
 	mmap_read_unlock(dev->mm);
 free:
 	free_page((unsigned long)page_list);
 	return ret;
+
+}
+
+static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
+					   struct vhost_iotlb_msg *msg)
+{
+	struct vhost_dev *dev = &v->vdev;
+	struct vhost_iotlb *iotlb = dev->iotlb;
+
+	if (msg->iova < v->range.first ||
+	    msg->iova + msg->size - 1 > v->range.last)
+		return -EINVAL;
+
+	if (vhost_iotlb_itree_first(iotlb, msg->iova,
+				    msg->iova + msg->size - 1))
+		return -EEXIST;
+
+	return vhost_vdpa_pa_map(v, msg->iova, msg->size, msg->uaddr,
+				 msg->perm);
 }
 
 static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev,
-- 
2.11.0

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 193+ messages in thread

* [PATCH v8 07/10] vdpa: Support transferring virtual addressing during DMA mapping
  2021-06-15 14:13 ` Xie Yongji
@ 2021-06-15 14:13   ` Xie Yongji
  -1 siblings, 0 replies; 193+ messages in thread
From: Xie Yongji @ 2021-06-15 14:13 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, hch, christian.brauner,
	rdunlap, willy, viro, axboe, bcrl, corbet, mika.penttila,
	dan.carpenter, joro, gregkh
  Cc: songmuchun, virtualization, netdev, kvm, linux-fsdevel, iommu,
	linux-kernel

This patch introduces an attribute for vDPA device to indicate
whether virtual address can be used. If vDPA device driver set
it, vhost-vdpa bus driver will not pin user page and transfer
userspace virtual address instead of physical address during
DMA mapping. And corresponding vma->vm_file and offset will be
also passed as an opaque pointer.

Suggested-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Acked-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vdpa/ifcvf/ifcvf_main.c   |  2 +-
 drivers/vdpa/mlx5/net/mlx5_vnet.c |  2 +-
 drivers/vdpa/vdpa.c               |  9 +++-
 drivers/vdpa/vdpa_sim/vdpa_sim.c  |  2 +-
 drivers/vdpa/virtio_pci/vp_vdpa.c |  2 +-
 drivers/vhost/vdpa.c              | 99 ++++++++++++++++++++++++++++++++++-----
 include/linux/vdpa.h              | 19 ++++++--
 7 files changed, 116 insertions(+), 19 deletions(-)

diff --git a/drivers/vdpa/ifcvf/ifcvf_main.c b/drivers/vdpa/ifcvf/ifcvf_main.c
index ab0ab5cf0f6e..daf9746e51e6 100644
--- a/drivers/vdpa/ifcvf/ifcvf_main.c
+++ b/drivers/vdpa/ifcvf/ifcvf_main.c
@@ -476,7 +476,7 @@ static int ifcvf_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	}
 
 	adapter = vdpa_alloc_device(struct ifcvf_adapter, vdpa,
-				    dev, &ifc_vdpa_ops, NULL);
+				    dev, &ifc_vdpa_ops, NULL, false);
 	if (adapter == NULL) {
 		IFCVF_ERR(pdev, "Failed to allocate vDPA structure");
 		return -ENOMEM;
diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c
index dda5dc6f7737..2b7ca111f039 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -2012,7 +2012,7 @@ static int mlx5_vdpa_dev_add(struct vdpa_mgmt_dev *v_mdev, const char *name)
 	max_vqs = min_t(u32, max_vqs, MLX5_MAX_SUPPORTED_VQS);
 
 	ndev = vdpa_alloc_device(struct mlx5_vdpa_net, mvdev.vdev, mdev->device, &mlx5_vdpa_ops,
-				 name);
+				 name, false);
 	if (IS_ERR(ndev))
 		return PTR_ERR(ndev);
 
diff --git a/drivers/vdpa/vdpa.c b/drivers/vdpa/vdpa.c
index bb3f1d1f0422..8f01d6a7ecc5 100644
--- a/drivers/vdpa/vdpa.c
+++ b/drivers/vdpa/vdpa.c
@@ -71,6 +71,7 @@ static void vdpa_release_dev(struct device *d)
  * @config: the bus operations that is supported by this device
  * @size: size of the parent structure that contains private data
  * @name: name of the vdpa device; optional.
+ * @use_va: indicate whether virtual address must be used by this device
  *
  * Driver should use vdpa_alloc_device() wrapper macro instead of
  * using this directly.
@@ -80,7 +81,8 @@ static void vdpa_release_dev(struct device *d)
  */
 struct vdpa_device *__vdpa_alloc_device(struct device *parent,
 					const struct vdpa_config_ops *config,
-					size_t size, const char *name)
+					size_t size, const char *name,
+					bool use_va)
 {
 	struct vdpa_device *vdev;
 	int err = -EINVAL;
@@ -91,6 +93,10 @@ struct vdpa_device *__vdpa_alloc_device(struct device *parent,
 	if (!!config->dma_map != !!config->dma_unmap)
 		goto err;
 
+	/* It should only work for the device that use on-chip IOMMU */
+	if (use_va && !(config->dma_map || config->set_map))
+		goto err;
+
 	err = -ENOMEM;
 	vdev = kzalloc(size, GFP_KERNEL);
 	if (!vdev)
@@ -106,6 +112,7 @@ struct vdpa_device *__vdpa_alloc_device(struct device *parent,
 	vdev->index = err;
 	vdev->config = config;
 	vdev->features_valid = false;
+	vdev->use_va = use_va;
 
 	if (name)
 		err = dev_set_name(&vdev->dev, "%s", name);
diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
index efd0cb3d964d..a43479cf57ea 100644
--- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
+++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
@@ -250,7 +250,7 @@ struct vdpasim *vdpasim_create(struct vdpasim_dev_attr *dev_attr)
 		ops = &vdpasim_config_ops;
 
 	vdpasim = vdpa_alloc_device(struct vdpasim, vdpa, NULL, ops,
-				    dev_attr->name);
+				    dev_attr->name, false);
 	if (!vdpasim)
 		goto err_alloc;
 
diff --git a/drivers/vdpa/virtio_pci/vp_vdpa.c b/drivers/vdpa/virtio_pci/vp_vdpa.c
index c76ebb531212..f907f42e83bb 100644
--- a/drivers/vdpa/virtio_pci/vp_vdpa.c
+++ b/drivers/vdpa/virtio_pci/vp_vdpa.c
@@ -399,7 +399,7 @@ static int vp_vdpa_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 		return ret;
 
 	vp_vdpa = vdpa_alloc_device(struct vp_vdpa, vdpa,
-				    dev, &vp_vdpa_ops, NULL);
+				    dev, &vp_vdpa_ops, NULL, false);
 	if (vp_vdpa == NULL) {
 		dev_err(dev, "vp_vdpa: Failed to allocate vDPA structure\n");
 		return -ENOMEM;
diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index c5ec45b920f8..45d1c0955961 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -520,8 +520,28 @@ static void vhost_vdpa_pa_unmap(struct vhost_vdpa *v, u64 start, u64 last)
 	}
 }
 
+static void vhost_vdpa_va_unmap(struct vhost_vdpa *v, u64 start, u64 last)
+{
+	struct vhost_dev *dev = &v->vdev;
+	struct vhost_iotlb *iotlb = dev->iotlb;
+	struct vhost_iotlb_map *map;
+	struct vdpa_map_file *map_file;
+
+	while ((map = vhost_iotlb_itree_first(iotlb, start, last)) != NULL) {
+		map_file = (struct vdpa_map_file *)map->opaque;
+		fput(map_file->file);
+		kfree(map_file);
+		vhost_iotlb_map_free(iotlb, map);
+	}
+}
+
 static void vhost_vdpa_iotlb_unmap(struct vhost_vdpa *v, u64 start, u64 last)
 {
+	struct vdpa_device *vdpa = v->vdpa;
+
+	if (vdpa->use_va)
+		return vhost_vdpa_va_unmap(v, start, last);
+
 	return vhost_vdpa_pa_unmap(v, start, last);
 }
 
@@ -556,21 +576,21 @@ static int perm_to_iommu_flags(u32 perm)
 	return flags | IOMMU_CACHE;
 }
 
-static int vhost_vdpa_map(struct vhost_vdpa *v,
-			  u64 iova, u64 size, u64 pa, u32 perm)
+static int vhost_vdpa_map(struct vhost_vdpa *v, u64 iova,
+			  u64 size, u64 pa, u32 perm, void *opaque)
 {
 	struct vhost_dev *dev = &v->vdev;
 	struct vdpa_device *vdpa = v->vdpa;
 	const struct vdpa_config_ops *ops = vdpa->config;
 	int r = 0;
 
-	r = vhost_iotlb_add_range(dev->iotlb, iova, iova + size - 1,
-				  pa, perm);
+	r = vhost_iotlb_add_range_ctx(dev->iotlb, iova, iova + size - 1,
+				      pa, perm, opaque);
 	if (r)
 		return r;
 
 	if (ops->dma_map) {
-		r = ops->dma_map(vdpa, iova, size, pa, perm, NULL);
+		r = ops->dma_map(vdpa, iova, size, pa, perm, opaque);
 	} else if (ops->set_map) {
 		if (!v->in_batch)
 			r = ops->set_map(vdpa, dev->iotlb);
@@ -578,13 +598,15 @@ static int vhost_vdpa_map(struct vhost_vdpa *v,
 		r = iommu_map(v->domain, iova, pa, size,
 			      perm_to_iommu_flags(perm));
 	}
-
-	if (r)
+	if (r) {
 		vhost_iotlb_del_range(dev->iotlb, iova, iova + size - 1);
-	else
+		return r;
+	}
+
+	if (!vdpa->use_va)
 		atomic64_add(size >> PAGE_SHIFT, &dev->mm->pinned_vm);
 
-	return r;
+	return 0;
 }
 
 static void vhost_vdpa_unmap(struct vhost_vdpa *v, u64 iova, u64 size)
@@ -605,6 +627,56 @@ static void vhost_vdpa_unmap(struct vhost_vdpa *v, u64 iova, u64 size)
 	}
 }
 
+static int vhost_vdpa_va_map(struct vhost_vdpa *v,
+			     u64 iova, u64 size, u64 uaddr, u32 perm)
+{
+	struct vhost_dev *dev = &v->vdev;
+	u64 offset, map_size, map_iova = iova;
+	struct vdpa_map_file *map_file;
+	struct vm_area_struct *vma;
+	int ret;
+
+	mmap_read_lock(dev->mm);
+
+	while (size) {
+		vma = find_vma(dev->mm, uaddr);
+		if (!vma) {
+			ret = -EINVAL;
+			break;
+		}
+		map_size = min(size, vma->vm_end - uaddr);
+		if (!(vma->vm_file && (vma->vm_flags & VM_SHARED) &&
+			!(vma->vm_flags & (VM_IO | VM_PFNMAP))))
+			goto next;
+
+		map_file = kzalloc(sizeof(*map_file), GFP_KERNEL);
+		if (!map_file) {
+			ret = -ENOMEM;
+			break;
+		}
+		offset = (vma->vm_pgoff << PAGE_SHIFT) + uaddr - vma->vm_start;
+		map_file->offset = offset;
+		map_file->file = get_file(vma->vm_file);
+		ret = vhost_vdpa_map(v, map_iova, map_size, uaddr,
+				     perm, map_file);
+		if (ret) {
+			fput(map_file->file);
+			kfree(map_file);
+			break;
+		}
+next:
+		size -= map_size;
+		uaddr += map_size;
+		map_iova += map_size;
+	}
+	if (ret)
+		vhost_vdpa_unmap(v, iova, map_iova - iova);
+
+	mmap_read_unlock(dev->mm);
+
+	return ret;
+}
+
 static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
 			     u64 iova, u64 size, u64 uaddr, u32 perm)
 {
@@ -671,7 +743,7 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
 				csize = (last_pfn - map_pfn + 1) << PAGE_SHIFT;
 				ret = vhost_vdpa_map(v, iova, csize,
 						     map_pfn << PAGE_SHIFT,
-						     perm);
+						     perm, NULL);
 				if (ret) {
 					/*
 					 * Unpin the pages that are left unmapped
@@ -700,7 +772,7 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
 
 	/* Pin the rest chunk */
 	ret = vhost_vdpa_map(v, iova, (last_pfn - map_pfn + 1) << PAGE_SHIFT,
-			     map_pfn << PAGE_SHIFT, perm);
+			     map_pfn << PAGE_SHIFT, perm, NULL);
 out:
 	if (ret) {
 		if (nchunks) {
@@ -733,6 +805,7 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
 					   struct vhost_iotlb_msg *msg)
 {
 	struct vhost_dev *dev = &v->vdev;
+	struct vdpa_device *vdpa = v->vdpa;
 	struct vhost_iotlb *iotlb = dev->iotlb;
 
 	if (msg->iova < v->range.first ||
@@ -743,6 +816,10 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
 				    msg->iova + msg->size - 1))
 		return -EEXIST;
 
+	if (vdpa->use_va)
+		return vhost_vdpa_va_map(v, msg->iova, msg->size,
+					 msg->uaddr, msg->perm);
+
 	return vhost_vdpa_pa_map(v, msg->iova, msg->size, msg->uaddr,
 				 msg->perm);
 }
diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index 281f768cb597..85124d197c55 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -44,6 +44,7 @@ struct vdpa_mgmt_dev;
  * @config: the configuration ops for this device.
  * @index: device index
  * @features_valid: were features initialized? for legacy guests
+ * @use_va: indicate whether virtual address must be used by this device
  * @nvqs: maximum number of supported virtqueues
  * @mdev: management device pointer; caller must setup when registering device as part
  *	  of dev_add() mgmtdev ops callback before invoking _vdpa_register_device().
@@ -54,6 +55,7 @@ struct vdpa_device {
 	const struct vdpa_config_ops *config;
 	unsigned int index;
 	bool features_valid;
+	bool use_va;
 	int nvqs;
 	struct vdpa_mgmt_dev *mdev;
 };
@@ -69,6 +71,16 @@ struct vdpa_iova_range {
 };
 
 /**
+ * Corresponding file area for device memory mapping
+ * @file: vma->vm_file for the mapping
+ * @offset: mapping offset in the vm_file
+ */
+struct vdpa_map_file {
+	struct file *file;
+	u64 offset;
+};
+
+/**
  * struct vdpa_config_ops - operations for configuring a vDPA device.
  * Note: vDPA device drivers are required to implement all of the
  * operations unless it is mentioned to be optional in the following
@@ -254,14 +266,15 @@ struct vdpa_config_ops {
 
 struct vdpa_device *__vdpa_alloc_device(struct device *parent,
 					const struct vdpa_config_ops *config,
-					size_t size, const char *name);
+					size_t size, const char *name,
+					bool use_va);
 
-#define vdpa_alloc_device(dev_struct, member, parent, config, name)   \
+#define vdpa_alloc_device(dev_struct, member, parent, config, name, use_va)   \
 			  container_of(__vdpa_alloc_device( \
 				       parent, config, \
 				       sizeof(dev_struct) + \
 				       BUILD_BUG_ON_ZERO(offsetof( \
-				       dev_struct, member)), name), \
+				       dev_struct, member)), name, use_va), \
 				       dev_struct, member)
 
 int vdpa_register_device(struct vdpa_device *vdev, int nvqs);
-- 
2.11.0


^ permalink raw reply	[flat|nested] 193+ messages in thread

* [PATCH v8 07/10] vdpa: Support transferring virtual addressing during DMA mapping
@ 2021-06-15 14:13   ` Xie Yongji
  0 siblings, 0 replies; 193+ messages in thread
From: Xie Yongji @ 2021-06-15 14:13 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, hch, christian.brauner,
	rdunlap, willy, viro, axboe, bcrl, corbet, mika.penttila,
	dan.carpenter, joro, gregkh
  Cc: kvm, netdev, linux-kernel, virtualization, iommu, songmuchun,
	linux-fsdevel

This patch introduces an attribute for vDPA device to indicate
whether virtual address can be used. If vDPA device driver set
it, vhost-vdpa bus driver will not pin user page and transfer
userspace virtual address instead of physical address during
DMA mapping. And corresponding vma->vm_file and offset will be
also passed as an opaque pointer.

Suggested-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Acked-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vdpa/ifcvf/ifcvf_main.c   |  2 +-
 drivers/vdpa/mlx5/net/mlx5_vnet.c |  2 +-
 drivers/vdpa/vdpa.c               |  9 +++-
 drivers/vdpa/vdpa_sim/vdpa_sim.c  |  2 +-
 drivers/vdpa/virtio_pci/vp_vdpa.c |  2 +-
 drivers/vhost/vdpa.c              | 99 ++++++++++++++++++++++++++++++++++-----
 include/linux/vdpa.h              | 19 ++++++--
 7 files changed, 116 insertions(+), 19 deletions(-)

diff --git a/drivers/vdpa/ifcvf/ifcvf_main.c b/drivers/vdpa/ifcvf/ifcvf_main.c
index ab0ab5cf0f6e..daf9746e51e6 100644
--- a/drivers/vdpa/ifcvf/ifcvf_main.c
+++ b/drivers/vdpa/ifcvf/ifcvf_main.c
@@ -476,7 +476,7 @@ static int ifcvf_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	}
 
 	adapter = vdpa_alloc_device(struct ifcvf_adapter, vdpa,
-				    dev, &ifc_vdpa_ops, NULL);
+				    dev, &ifc_vdpa_ops, NULL, false);
 	if (adapter == NULL) {
 		IFCVF_ERR(pdev, "Failed to allocate vDPA structure");
 		return -ENOMEM;
diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c
index dda5dc6f7737..2b7ca111f039 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -2012,7 +2012,7 @@ static int mlx5_vdpa_dev_add(struct vdpa_mgmt_dev *v_mdev, const char *name)
 	max_vqs = min_t(u32, max_vqs, MLX5_MAX_SUPPORTED_VQS);
 
 	ndev = vdpa_alloc_device(struct mlx5_vdpa_net, mvdev.vdev, mdev->device, &mlx5_vdpa_ops,
-				 name);
+				 name, false);
 	if (IS_ERR(ndev))
 		return PTR_ERR(ndev);
 
diff --git a/drivers/vdpa/vdpa.c b/drivers/vdpa/vdpa.c
index bb3f1d1f0422..8f01d6a7ecc5 100644
--- a/drivers/vdpa/vdpa.c
+++ b/drivers/vdpa/vdpa.c
@@ -71,6 +71,7 @@ static void vdpa_release_dev(struct device *d)
  * @config: the bus operations that is supported by this device
  * @size: size of the parent structure that contains private data
  * @name: name of the vdpa device; optional.
+ * @use_va: indicate whether virtual address must be used by this device
  *
  * Driver should use vdpa_alloc_device() wrapper macro instead of
  * using this directly.
@@ -80,7 +81,8 @@ static void vdpa_release_dev(struct device *d)
  */
 struct vdpa_device *__vdpa_alloc_device(struct device *parent,
 					const struct vdpa_config_ops *config,
-					size_t size, const char *name)
+					size_t size, const char *name,
+					bool use_va)
 {
 	struct vdpa_device *vdev;
 	int err = -EINVAL;
@@ -91,6 +93,10 @@ struct vdpa_device *__vdpa_alloc_device(struct device *parent,
 	if (!!config->dma_map != !!config->dma_unmap)
 		goto err;
 
+	/* It should only work for the device that use on-chip IOMMU */
+	if (use_va && !(config->dma_map || config->set_map))
+		goto err;
+
 	err = -ENOMEM;
 	vdev = kzalloc(size, GFP_KERNEL);
 	if (!vdev)
@@ -106,6 +112,7 @@ struct vdpa_device *__vdpa_alloc_device(struct device *parent,
 	vdev->index = err;
 	vdev->config = config;
 	vdev->features_valid = false;
+	vdev->use_va = use_va;
 
 	if (name)
 		err = dev_set_name(&vdev->dev, "%s", name);
diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
index efd0cb3d964d..a43479cf57ea 100644
--- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
+++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
@@ -250,7 +250,7 @@ struct vdpasim *vdpasim_create(struct vdpasim_dev_attr *dev_attr)
 		ops = &vdpasim_config_ops;
 
 	vdpasim = vdpa_alloc_device(struct vdpasim, vdpa, NULL, ops,
-				    dev_attr->name);
+				    dev_attr->name, false);
 	if (!vdpasim)
 		goto err_alloc;
 
diff --git a/drivers/vdpa/virtio_pci/vp_vdpa.c b/drivers/vdpa/virtio_pci/vp_vdpa.c
index c76ebb531212..f907f42e83bb 100644
--- a/drivers/vdpa/virtio_pci/vp_vdpa.c
+++ b/drivers/vdpa/virtio_pci/vp_vdpa.c
@@ -399,7 +399,7 @@ static int vp_vdpa_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 		return ret;
 
 	vp_vdpa = vdpa_alloc_device(struct vp_vdpa, vdpa,
-				    dev, &vp_vdpa_ops, NULL);
+				    dev, &vp_vdpa_ops, NULL, false);
 	if (vp_vdpa == NULL) {
 		dev_err(dev, "vp_vdpa: Failed to allocate vDPA structure\n");
 		return -ENOMEM;
diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index c5ec45b920f8..45d1c0955961 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -520,8 +520,28 @@ static void vhost_vdpa_pa_unmap(struct vhost_vdpa *v, u64 start, u64 last)
 	}
 }
 
+static void vhost_vdpa_va_unmap(struct vhost_vdpa *v, u64 start, u64 last)
+{
+	struct vhost_dev *dev = &v->vdev;
+	struct vhost_iotlb *iotlb = dev->iotlb;
+	struct vhost_iotlb_map *map;
+	struct vdpa_map_file *map_file;
+
+	while ((map = vhost_iotlb_itree_first(iotlb, start, last)) != NULL) {
+		map_file = (struct vdpa_map_file *)map->opaque;
+		fput(map_file->file);
+		kfree(map_file);
+		vhost_iotlb_map_free(iotlb, map);
+	}
+}
+
 static void vhost_vdpa_iotlb_unmap(struct vhost_vdpa *v, u64 start, u64 last)
 {
+	struct vdpa_device *vdpa = v->vdpa;
+
+	if (vdpa->use_va)
+		return vhost_vdpa_va_unmap(v, start, last);
+
 	return vhost_vdpa_pa_unmap(v, start, last);
 }
 
@@ -556,21 +576,21 @@ static int perm_to_iommu_flags(u32 perm)
 	return flags | IOMMU_CACHE;
 }
 
-static int vhost_vdpa_map(struct vhost_vdpa *v,
-			  u64 iova, u64 size, u64 pa, u32 perm)
+static int vhost_vdpa_map(struct vhost_vdpa *v, u64 iova,
+			  u64 size, u64 pa, u32 perm, void *opaque)
 {
 	struct vhost_dev *dev = &v->vdev;
 	struct vdpa_device *vdpa = v->vdpa;
 	const struct vdpa_config_ops *ops = vdpa->config;
 	int r = 0;
 
-	r = vhost_iotlb_add_range(dev->iotlb, iova, iova + size - 1,
-				  pa, perm);
+	r = vhost_iotlb_add_range_ctx(dev->iotlb, iova, iova + size - 1,
+				      pa, perm, opaque);
 	if (r)
 		return r;
 
 	if (ops->dma_map) {
-		r = ops->dma_map(vdpa, iova, size, pa, perm, NULL);
+		r = ops->dma_map(vdpa, iova, size, pa, perm, opaque);
 	} else if (ops->set_map) {
 		if (!v->in_batch)
 			r = ops->set_map(vdpa, dev->iotlb);
@@ -578,13 +598,15 @@ static int vhost_vdpa_map(struct vhost_vdpa *v,
 		r = iommu_map(v->domain, iova, pa, size,
 			      perm_to_iommu_flags(perm));
 	}
-
-	if (r)
+	if (r) {
 		vhost_iotlb_del_range(dev->iotlb, iova, iova + size - 1);
-	else
+		return r;
+	}
+
+	if (!vdpa->use_va)
 		atomic64_add(size >> PAGE_SHIFT, &dev->mm->pinned_vm);
 
-	return r;
+	return 0;
 }
 
 static void vhost_vdpa_unmap(struct vhost_vdpa *v, u64 iova, u64 size)
@@ -605,6 +627,56 @@ static void vhost_vdpa_unmap(struct vhost_vdpa *v, u64 iova, u64 size)
 	}
 }
 
+static int vhost_vdpa_va_map(struct vhost_vdpa *v,
+			     u64 iova, u64 size, u64 uaddr, u32 perm)
+{
+	struct vhost_dev *dev = &v->vdev;
+	u64 offset, map_size, map_iova = iova;
+	struct vdpa_map_file *map_file;
+	struct vm_area_struct *vma;
+	int ret;
+
+	mmap_read_lock(dev->mm);
+
+	while (size) {
+		vma = find_vma(dev->mm, uaddr);
+		if (!vma) {
+			ret = -EINVAL;
+			break;
+		}
+		map_size = min(size, vma->vm_end - uaddr);
+		if (!(vma->vm_file && (vma->vm_flags & VM_SHARED) &&
+			!(vma->vm_flags & (VM_IO | VM_PFNMAP))))
+			goto next;
+
+		map_file = kzalloc(sizeof(*map_file), GFP_KERNEL);
+		if (!map_file) {
+			ret = -ENOMEM;
+			break;
+		}
+		offset = (vma->vm_pgoff << PAGE_SHIFT) + uaddr - vma->vm_start;
+		map_file->offset = offset;
+		map_file->file = get_file(vma->vm_file);
+		ret = vhost_vdpa_map(v, map_iova, map_size, uaddr,
+				     perm, map_file);
+		if (ret) {
+			fput(map_file->file);
+			kfree(map_file);
+			break;
+		}
+next:
+		size -= map_size;
+		uaddr += map_size;
+		map_iova += map_size;
+	}
+	if (ret)
+		vhost_vdpa_unmap(v, iova, map_iova - iova);
+
+	mmap_read_unlock(dev->mm);
+
+	return ret;
+}
+
 static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
 			     u64 iova, u64 size, u64 uaddr, u32 perm)
 {
@@ -671,7 +743,7 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
 				csize = (last_pfn - map_pfn + 1) << PAGE_SHIFT;
 				ret = vhost_vdpa_map(v, iova, csize,
 						     map_pfn << PAGE_SHIFT,
-						     perm);
+						     perm, NULL);
 				if (ret) {
 					/*
 					 * Unpin the pages that are left unmapped
@@ -700,7 +772,7 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
 
 	/* Pin the rest chunk */
 	ret = vhost_vdpa_map(v, iova, (last_pfn - map_pfn + 1) << PAGE_SHIFT,
-			     map_pfn << PAGE_SHIFT, perm);
+			     map_pfn << PAGE_SHIFT, perm, NULL);
 out:
 	if (ret) {
 		if (nchunks) {
@@ -733,6 +805,7 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
 					   struct vhost_iotlb_msg *msg)
 {
 	struct vhost_dev *dev = &v->vdev;
+	struct vdpa_device *vdpa = v->vdpa;
 	struct vhost_iotlb *iotlb = dev->iotlb;
 
 	if (msg->iova < v->range.first ||
@@ -743,6 +816,10 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
 				    msg->iova + msg->size - 1))
 		return -EEXIST;
 
+	if (vdpa->use_va)
+		return vhost_vdpa_va_map(v, msg->iova, msg->size,
+					 msg->uaddr, msg->perm);
+
 	return vhost_vdpa_pa_map(v, msg->iova, msg->size, msg->uaddr,
 				 msg->perm);
 }
diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index 281f768cb597..85124d197c55 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -44,6 +44,7 @@ struct vdpa_mgmt_dev;
  * @config: the configuration ops for this device.
  * @index: device index
  * @features_valid: were features initialized? for legacy guests
+ * @use_va: indicate whether virtual address must be used by this device
  * @nvqs: maximum number of supported virtqueues
  * @mdev: management device pointer; caller must setup when registering device as part
  *	  of dev_add() mgmtdev ops callback before invoking _vdpa_register_device().
@@ -54,6 +55,7 @@ struct vdpa_device {
 	const struct vdpa_config_ops *config;
 	unsigned int index;
 	bool features_valid;
+	bool use_va;
 	int nvqs;
 	struct vdpa_mgmt_dev *mdev;
 };
@@ -69,6 +71,16 @@ struct vdpa_iova_range {
 };
 
 /**
+ * Corresponding file area for device memory mapping
+ * @file: vma->vm_file for the mapping
+ * @offset: mapping offset in the vm_file
+ */
+struct vdpa_map_file {
+	struct file *file;
+	u64 offset;
+};
+
+/**
  * struct vdpa_config_ops - operations for configuring a vDPA device.
  * Note: vDPA device drivers are required to implement all of the
  * operations unless it is mentioned to be optional in the following
@@ -254,14 +266,15 @@ struct vdpa_config_ops {
 
 struct vdpa_device *__vdpa_alloc_device(struct device *parent,
 					const struct vdpa_config_ops *config,
-					size_t size, const char *name);
+					size_t size, const char *name,
+					bool use_va);
 
-#define vdpa_alloc_device(dev_struct, member, parent, config, name)   \
+#define vdpa_alloc_device(dev_struct, member, parent, config, name, use_va)   \
 			  container_of(__vdpa_alloc_device( \
 				       parent, config, \
 				       sizeof(dev_struct) + \
 				       BUILD_BUG_ON_ZERO(offsetof( \
-				       dev_struct, member)), name), \
+				       dev_struct, member)), name, use_va), \
 				       dev_struct, member)
 
 int vdpa_register_device(struct vdpa_device *vdev, int nvqs);
-- 
2.11.0

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 193+ messages in thread

* [PATCH v8 08/10] vduse: Implement an MMU-based IOMMU driver
  2021-06-15 14:13 ` Xie Yongji
@ 2021-06-15 14:13   ` Xie Yongji
  -1 siblings, 0 replies; 193+ messages in thread
From: Xie Yongji @ 2021-06-15 14:13 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, hch, christian.brauner,
	rdunlap, willy, viro, axboe, bcrl, corbet, mika.penttila,
	dan.carpenter, joro, gregkh
  Cc: songmuchun, virtualization, netdev, kvm, linux-fsdevel, iommu,
	linux-kernel

This implements an MMU-based IOMMU driver to support mapping
kernel dma buffer into userspace. The basic idea behind it is
treating MMU (VA->PA) as IOMMU (IOVA->PA). The driver will set
up MMU mapping instead of IOMMU mapping for the DMA transfer so
that the userspace process is able to use its virtual address to
access the dma buffer in kernel.

And to avoid security issue, a bounce-buffering mechanism is
introduced to prevent userspace accessing the original buffer
directly.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Acked-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vdpa/vdpa_user/iova_domain.c | 545 +++++++++++++++++++++++++++++++++++
 drivers/vdpa/vdpa_user/iova_domain.h |  73 +++++
 2 files changed, 618 insertions(+)
 create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c
 create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h

diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c
new file mode 100644
index 000000000000..ad45026f5423
--- /dev/null
+++ b/drivers/vdpa/vdpa_user/iova_domain.c
@@ -0,0 +1,545 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * MMU-based IOMMU implementation
+ *
+ * Copyright (C) 2020-2021 Bytedance Inc. and/or its affiliates. All rights reserved.
+ *
+ * Author: Xie Yongji <xieyongji@bytedance.com>
+ *
+ */
+
+#include <linux/slab.h>
+#include <linux/file.h>
+#include <linux/anon_inodes.h>
+#include <linux/highmem.h>
+#include <linux/vmalloc.h>
+#include <linux/vdpa.h>
+
+#include "iova_domain.h"
+
+static int vduse_iotlb_add_range(struct vduse_iova_domain *domain,
+				 u64 start, u64 last,
+				 u64 addr, unsigned int perm,
+				 struct file *file, u64 offset)
+{
+	struct vdpa_map_file *map_file;
+	int ret;
+
+	map_file = kmalloc(sizeof(*map_file), GFP_ATOMIC);
+	if (!map_file)
+		return -ENOMEM;
+
+	map_file->file = get_file(file);
+	map_file->offset = offset;
+
+	ret = vhost_iotlb_add_range_ctx(domain->iotlb, start, last,
+					addr, perm, map_file);
+	if (ret) {
+		fput(map_file->file);
+		kfree(map_file);
+		return ret;
+	}
+	return 0;
+}
+
+static void vduse_iotlb_del_range(struct vduse_iova_domain *domain,
+				  u64 start, u64 last)
+{
+	struct vdpa_map_file *map_file;
+	struct vhost_iotlb_map *map;
+
+	while ((map = vhost_iotlb_itree_first(domain->iotlb, start, last))) {
+		map_file = (struct vdpa_map_file *)map->opaque;
+		fput(map_file->file);
+		kfree(map_file);
+		vhost_iotlb_map_free(domain->iotlb, map);
+	}
+}
+
+int vduse_domain_set_map(struct vduse_iova_domain *domain,
+			 struct vhost_iotlb *iotlb)
+{
+	struct vdpa_map_file *map_file;
+	struct vhost_iotlb_map *map;
+	u64 start = 0ULL, last = ULLONG_MAX;
+	int ret;
+
+	spin_lock(&domain->iotlb_lock);
+	vduse_iotlb_del_range(domain, start, last);
+
+	for (map = vhost_iotlb_itree_first(iotlb, start, last); map;
+	     map = vhost_iotlb_itree_next(map, start, last)) {
+		map_file = (struct vdpa_map_file *)map->opaque;
+		ret = vduse_iotlb_add_range(domain, map->start, map->last,
+					    map->addr, map->perm,
+					    map_file->file,
+					    map_file->offset);
+		if (ret)
+			goto err;
+	}
+	spin_unlock(&domain->iotlb_lock);
+
+	return 0;
+err:
+	vduse_iotlb_del_range(domain, start, last);
+	spin_unlock(&domain->iotlb_lock);
+	return ret;
+}
+
+void vduse_domain_clear_map(struct vduse_iova_domain *domain,
+			    struct vhost_iotlb *iotlb)
+{
+	struct vhost_iotlb_map *map;
+	u64 start = 0ULL, last = ULLONG_MAX;
+
+	spin_lock(&domain->iotlb_lock);
+	for (map = vhost_iotlb_itree_first(iotlb, start, last); map;
+	     map = vhost_iotlb_itree_next(map, start, last)) {
+		vduse_iotlb_del_range(domain, map->start, map->last);
+	}
+	spin_unlock(&domain->iotlb_lock);
+}
+
+static int vduse_domain_map_bounce_page(struct vduse_iova_domain *domain,
+					 u64 iova, u64 size, u64 paddr)
+{
+	struct vduse_bounce_map *map;
+	u64 last = iova + size - 1;
+
+	while (iova <= last) {
+		map = &domain->bounce_maps[iova >> PAGE_SHIFT];
+		if (!map->bounce_page) {
+			map->bounce_page = alloc_page(GFP_ATOMIC);
+			if (!map->bounce_page)
+				return -ENOMEM;
+		}
+		map->orig_phys = paddr;
+		paddr += PAGE_SIZE;
+		iova += PAGE_SIZE;
+	}
+	return 0;
+}
+
+static void vduse_domain_unmap_bounce_page(struct vduse_iova_domain *domain,
+					   u64 iova, u64 size)
+{
+	struct vduse_bounce_map *map;
+	u64 last = iova + size - 1;
+
+	while (iova <= last) {
+		map = &domain->bounce_maps[iova >> PAGE_SHIFT];
+		map->orig_phys = INVALID_PHYS_ADDR;
+		iova += PAGE_SIZE;
+	}
+}
+
+static void do_bounce(phys_addr_t orig, void *addr, size_t size,
+		      enum dma_data_direction dir)
+{
+	unsigned long pfn = PFN_DOWN(orig);
+	unsigned int offset = offset_in_page(orig);
+	char *buffer;
+	unsigned int sz = 0;
+
+	while (size) {
+		sz = min_t(size_t, PAGE_SIZE - offset, size);
+
+		buffer = kmap_atomic(pfn_to_page(pfn));
+		if (dir == DMA_TO_DEVICE)
+			memcpy(addr, buffer + offset, sz);
+		else
+			memcpy(buffer + offset, addr, sz);
+		kunmap_atomic(buffer);
+
+		size -= sz;
+		pfn++;
+		addr += sz;
+		offset = 0;
+	}
+}
+
+static void vduse_domain_bounce(struct vduse_iova_domain *domain,
+				dma_addr_t iova, size_t size,
+				enum dma_data_direction dir)
+{
+	struct vduse_bounce_map *map;
+	unsigned int offset;
+	void *addr;
+	size_t sz;
+
+	if (iova >= domain->bounce_size)
+		return;
+
+	while (size) {
+		map = &domain->bounce_maps[iova >> PAGE_SHIFT];
+		offset = offset_in_page(iova);
+		sz = min_t(size_t, PAGE_SIZE - offset, size);
+
+		if (WARN_ON(!map->bounce_page ||
+			    map->orig_phys == INVALID_PHYS_ADDR))
+			return;
+
+		addr = page_address(map->bounce_page) + offset;
+		do_bounce(map->orig_phys + offset, addr, sz, dir);
+		size -= sz;
+		iova += sz;
+	}
+}
+
+static struct page *
+vduse_domain_get_coherent_page(struct vduse_iova_domain *domain, u64 iova)
+{
+	u64 start = iova & PAGE_MASK;
+	u64 last = start + PAGE_SIZE - 1;
+	struct vhost_iotlb_map *map;
+	struct page *page = NULL;
+
+	spin_lock(&domain->iotlb_lock);
+	map = vhost_iotlb_itree_first(domain->iotlb, start, last);
+	if (!map)
+		goto out;
+
+	page = pfn_to_page((map->addr + iova - map->start) >> PAGE_SHIFT);
+	get_page(page);
+out:
+	spin_unlock(&domain->iotlb_lock);
+
+	return page;
+}
+
+static struct page *
+vduse_domain_get_bounce_page(struct vduse_iova_domain *domain, u64 iova)
+{
+	struct vduse_bounce_map *map;
+	struct page *page = NULL;
+
+	spin_lock(&domain->iotlb_lock);
+	map = &domain->bounce_maps[iova >> PAGE_SHIFT];
+	if (!map->bounce_page)
+		goto out;
+
+	page = map->bounce_page;
+	get_page(page);
+out:
+	spin_unlock(&domain->iotlb_lock);
+
+	return page;
+}
+
+static void
+vduse_domain_free_bounce_pages(struct vduse_iova_domain *domain)
+{
+	struct vduse_bounce_map *map;
+	unsigned long pfn, bounce_pfns;
+
+	bounce_pfns = domain->bounce_size >> PAGE_SHIFT;
+
+	for (pfn = 0; pfn < bounce_pfns; pfn++) {
+		map = &domain->bounce_maps[pfn];
+		if (WARN_ON(map->orig_phys != INVALID_PHYS_ADDR))
+			continue;
+
+		if (!map->bounce_page)
+			continue;
+
+		__free_page(map->bounce_page);
+		map->bounce_page = NULL;
+	}
+}
+
+void vduse_domain_reset_bounce_map(struct vduse_iova_domain *domain)
+{
+	if (!domain->bounce_map)
+		return;
+
+	spin_lock(&domain->iotlb_lock);
+	if (!domain->bounce_map)
+		goto unlock;
+
+	vduse_iotlb_del_range(domain, 0, domain->bounce_size - 1);
+	domain->bounce_map = 0;
+unlock:
+	spin_unlock(&domain->iotlb_lock);
+}
+
+static int vduse_domain_init_bounce_map(struct vduse_iova_domain *domain)
+{
+	int ret = 0;
+
+	if (domain->bounce_map)
+		return 0;
+
+	spin_lock(&domain->iotlb_lock);
+	if (domain->bounce_map)
+		goto unlock;
+
+	ret = vduse_iotlb_add_range(domain, 0, domain->bounce_size - 1,
+				    0, VHOST_MAP_RW, domain->file, 0);
+	if (ret)
+		goto unlock;
+
+	domain->bounce_map = 1;
+unlock:
+	spin_unlock(&domain->iotlb_lock);
+	return ret;
+}
+
+static dma_addr_t
+vduse_domain_alloc_iova(struct iova_domain *iovad,
+			unsigned long size, unsigned long limit)
+{
+	unsigned long shift = iova_shift(iovad);
+	unsigned long iova_len = iova_align(iovad, size) >> shift;
+	unsigned long iova_pfn;
+
+	/*
+	 * Freeing non-power-of-two-sized allocations back into the IOVA caches
+	 * will come back to bite us badly, so we have to waste a bit of space
+	 * rounding up anything cacheable to make sure that can't happen. The
+	 * order of the unadjusted size will still match upon freeing.
+	 */
+	if (iova_len < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1)))
+		iova_len = roundup_pow_of_two(iova_len);
+	iova_pfn = alloc_iova_fast(iovad, iova_len, limit >> shift, true);
+
+	return iova_pfn << shift;
+}
+
+static void vduse_domain_free_iova(struct iova_domain *iovad,
+				   dma_addr_t iova, size_t size)
+{
+	unsigned long shift = iova_shift(iovad);
+	unsigned long iova_len = iova_align(iovad, size) >> shift;
+
+	free_iova_fast(iovad, iova >> shift, iova_len);
+}
+
+dma_addr_t vduse_domain_map_page(struct vduse_iova_domain *domain,
+				 struct page *page, unsigned long offset,
+				 size_t size, enum dma_data_direction dir,
+				 unsigned long attrs)
+{
+	struct iova_domain *iovad = &domain->stream_iovad;
+	unsigned long limit = domain->bounce_size - 1;
+	phys_addr_t pa = page_to_phys(page) + offset;
+	dma_addr_t iova = vduse_domain_alloc_iova(iovad, size, limit);
+
+	if (!iova)
+		return DMA_MAPPING_ERROR;
+
+	if (vduse_domain_init_bounce_map(domain))
+		goto err;
+
+	if (vduse_domain_map_bounce_page(domain, (u64)iova, (u64)size, pa))
+		goto err;
+
+	if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL)
+		vduse_domain_bounce(domain, iova, size, DMA_TO_DEVICE);
+
+	return iova;
+err:
+	vduse_domain_free_iova(iovad, iova, size);
+	return DMA_MAPPING_ERROR;
+}
+
+void vduse_domain_unmap_page(struct vduse_iova_domain *domain,
+			     dma_addr_t dma_addr, size_t size,
+			     enum dma_data_direction dir, unsigned long attrs)
+{
+	struct iova_domain *iovad = &domain->stream_iovad;
+
+	if (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL)
+		vduse_domain_bounce(domain, dma_addr, size, DMA_FROM_DEVICE);
+
+	vduse_domain_unmap_bounce_page(domain, (u64)dma_addr, (u64)size);
+	vduse_domain_free_iova(iovad, dma_addr, size);
+}
+
+void *vduse_domain_alloc_coherent(struct vduse_iova_domain *domain,
+				  size_t size, dma_addr_t *dma_addr,
+				  gfp_t flag, unsigned long attrs)
+{
+	struct iova_domain *iovad = &domain->consistent_iovad;
+	unsigned long limit = domain->iova_limit;
+	dma_addr_t iova = vduse_domain_alloc_iova(iovad, size, limit);
+	void *orig = alloc_pages_exact(size, flag);
+
+	if (!iova || !orig)
+		goto err;
+
+	spin_lock(&domain->iotlb_lock);
+	if (vduse_iotlb_add_range(domain, (u64)iova, (u64)iova + size - 1,
+				  virt_to_phys(orig), VHOST_MAP_RW,
+				  domain->file, (u64)iova)) {
+		spin_unlock(&domain->iotlb_lock);
+		goto err;
+	}
+	spin_unlock(&domain->iotlb_lock);
+
+	*dma_addr = iova;
+
+	return orig;
+err:
+	*dma_addr = DMA_MAPPING_ERROR;
+	if (orig)
+		free_pages_exact(orig, size);
+	if (iova)
+		vduse_domain_free_iova(iovad, iova, size);
+
+	return NULL;
+}
+
+void vduse_domain_free_coherent(struct vduse_iova_domain *domain, size_t size,
+				void *vaddr, dma_addr_t dma_addr,
+				unsigned long attrs)
+{
+	struct iova_domain *iovad = &domain->consistent_iovad;
+	struct vhost_iotlb_map *map;
+	struct vdpa_map_file *map_file;
+	phys_addr_t pa;
+
+	spin_lock(&domain->iotlb_lock);
+	map = vhost_iotlb_itree_first(domain->iotlb, (u64)dma_addr,
+				      (u64)dma_addr + size - 1);
+	if (WARN_ON(!map)) {
+		spin_unlock(&domain->iotlb_lock);
+		return;
+	}
+	map_file = (struct vdpa_map_file *)map->opaque;
+	fput(map_file->file);
+	kfree(map_file);
+	pa = map->addr;
+	vhost_iotlb_map_free(domain->iotlb, map);
+	spin_unlock(&domain->iotlb_lock);
+
+	vduse_domain_free_iova(iovad, dma_addr, size);
+	free_pages_exact(phys_to_virt(pa), size);
+}
+
+static vm_fault_t vduse_domain_mmap_fault(struct vm_fault *vmf)
+{
+	struct vduse_iova_domain *domain = vmf->vma->vm_private_data;
+	unsigned long iova = vmf->pgoff << PAGE_SHIFT;
+	struct page *page;
+
+	if (!domain)
+		return VM_FAULT_SIGBUS;
+
+	if (iova < domain->bounce_size)
+		page = vduse_domain_get_bounce_page(domain, iova);
+	else
+		page = vduse_domain_get_coherent_page(domain, iova);
+
+	if (!page)
+		return VM_FAULT_SIGBUS;
+
+	vmf->page = page;
+
+	return 0;
+}
+
+static const struct vm_operations_struct vduse_domain_mmap_ops = {
+	.fault = vduse_domain_mmap_fault,
+};
+
+static int vduse_domain_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct vduse_iova_domain *domain = file->private_data;
+
+	vma->vm_flags |= VM_DONTDUMP | VM_DONTEXPAND;
+	vma->vm_private_data = domain;
+	vma->vm_ops = &vduse_domain_mmap_ops;
+
+	return 0;
+}
+
+static int vduse_domain_release(struct inode *inode, struct file *file)
+{
+	struct vduse_iova_domain *domain = file->private_data;
+
+	spin_lock(&domain->iotlb_lock);
+	vduse_iotlb_del_range(domain, 0, ULLONG_MAX);
+	vduse_domain_free_bounce_pages(domain);
+	spin_unlock(&domain->iotlb_lock);
+	put_iova_domain(&domain->stream_iovad);
+	put_iova_domain(&domain->consistent_iovad);
+	vhost_iotlb_free(domain->iotlb);
+	vfree(domain->bounce_maps);
+	kfree(domain);
+
+	return 0;
+}
+
+static const struct file_operations vduse_domain_fops = {
+	.owner = THIS_MODULE,
+	.mmap = vduse_domain_mmap,
+	.release = vduse_domain_release,
+};
+
+void vduse_domain_destroy(struct vduse_iova_domain *domain)
+{
+	fput(domain->file);
+}
+
+struct vduse_iova_domain *
+vduse_domain_create(unsigned long iova_limit, size_t bounce_size)
+{
+	struct vduse_iova_domain *domain;
+	struct file *file;
+	struct vduse_bounce_map *map;
+	unsigned long pfn, bounce_pfns;
+
+	bounce_pfns = PAGE_ALIGN(bounce_size) >> PAGE_SHIFT;
+	if (iova_limit <= bounce_size)
+		return NULL;
+
+	domain = kzalloc(sizeof(*domain), GFP_KERNEL);
+	if (!domain)
+		return NULL;
+
+	domain->iotlb = vhost_iotlb_alloc(0, 0);
+	if (!domain->iotlb)
+		goto err_iotlb;
+
+	domain->iova_limit = iova_limit;
+	domain->bounce_size = PAGE_ALIGN(bounce_size);
+	domain->bounce_maps = vzalloc(bounce_pfns *
+				sizeof(struct vduse_bounce_map));
+	if (!domain->bounce_maps)
+		goto err_map;
+
+	for (pfn = 0; pfn < bounce_pfns; pfn++) {
+		map = &domain->bounce_maps[pfn];
+		map->orig_phys = INVALID_PHYS_ADDR;
+	}
+	file = anon_inode_getfile("[vduse-domain]", &vduse_domain_fops,
+				domain, O_RDWR);
+	if (IS_ERR(file))
+		goto err_file;
+
+	domain->file = file;
+	spin_lock_init(&domain->iotlb_lock);
+	init_iova_domain(&domain->stream_iovad,
+			PAGE_SIZE, IOVA_START_PFN);
+	init_iova_domain(&domain->consistent_iovad,
+			PAGE_SIZE, bounce_pfns);
+
+	return domain;
+err_file:
+	vfree(domain->bounce_maps);
+err_map:
+	vhost_iotlb_free(domain->iotlb);
+err_iotlb:
+	kfree(domain);
+	return NULL;
+}
+
+int vduse_domain_init(void)
+{
+	return iova_cache_get();
+}
+
+void vduse_domain_exit(void)
+{
+	iova_cache_put();
+}
diff --git a/drivers/vdpa/vdpa_user/iova_domain.h b/drivers/vdpa/vdpa_user/iova_domain.h
new file mode 100644
index 000000000000..5fc41d01c412
--- /dev/null
+++ b/drivers/vdpa/vdpa_user/iova_domain.h
@@ -0,0 +1,73 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * MMU-based IOMMU implementation
+ *
+ * Copyright (C) 2020-2021 Bytedance Inc. and/or its affiliates. All rights reserved.
+ *
+ * Author: Xie Yongji <xieyongji@bytedance.com>
+ *
+ */
+
+#ifndef _VDUSE_IOVA_DOMAIN_H
+#define _VDUSE_IOVA_DOMAIN_H
+
+#include <linux/iova.h>
+#include <linux/dma-mapping.h>
+#include <linux/vhost_iotlb.h>
+
+#define IOVA_START_PFN 1
+
+#define INVALID_PHYS_ADDR (~(phys_addr_t)0)
+
+struct vduse_bounce_map {
+	struct page *bounce_page;
+	u64 orig_phys;
+};
+
+struct vduse_iova_domain {
+	struct iova_domain stream_iovad;
+	struct iova_domain consistent_iovad;
+	struct vduse_bounce_map *bounce_maps;
+	size_t bounce_size;
+	unsigned long iova_limit;
+	int bounce_map;
+	struct vhost_iotlb *iotlb;
+	spinlock_t iotlb_lock;
+	struct file *file;
+};
+
+int vduse_domain_set_map(struct vduse_iova_domain *domain,
+			 struct vhost_iotlb *iotlb);
+
+void vduse_domain_clear_map(struct vduse_iova_domain *domain,
+			    struct vhost_iotlb *iotlb);
+
+dma_addr_t vduse_domain_map_page(struct vduse_iova_domain *domain,
+				 struct page *page, unsigned long offset,
+				 size_t size, enum dma_data_direction dir,
+				 unsigned long attrs);
+
+void vduse_domain_unmap_page(struct vduse_iova_domain *domain,
+			     dma_addr_t dma_addr, size_t size,
+			     enum dma_data_direction dir, unsigned long attrs);
+
+void *vduse_domain_alloc_coherent(struct vduse_iova_domain *domain,
+				  size_t size, dma_addr_t *dma_addr,
+				  gfp_t flag, unsigned long attrs);
+
+void vduse_domain_free_coherent(struct vduse_iova_domain *domain, size_t size,
+				void *vaddr, dma_addr_t dma_addr,
+				unsigned long attrs);
+
+void vduse_domain_reset_bounce_map(struct vduse_iova_domain *domain);
+
+void vduse_domain_destroy(struct vduse_iova_domain *domain);
+
+struct vduse_iova_domain *vduse_domain_create(unsigned long iova_limit,
+					      size_t bounce_size);
+
+int vduse_domain_init(void);
+
+void vduse_domain_exit(void);
+
+#endif /* _VDUSE_IOVA_DOMAIN_H */
-- 
2.11.0


^ permalink raw reply	[flat|nested] 193+ messages in thread

* [PATCH v8 08/10] vduse: Implement an MMU-based IOMMU driver
@ 2021-06-15 14:13   ` Xie Yongji
  0 siblings, 0 replies; 193+ messages in thread
From: Xie Yongji @ 2021-06-15 14:13 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, hch, christian.brauner,
	rdunlap, willy, viro, axboe, bcrl, corbet, mika.penttila,
	dan.carpenter, joro, gregkh
  Cc: kvm, netdev, linux-kernel, virtualization, iommu, songmuchun,
	linux-fsdevel

This implements an MMU-based IOMMU driver to support mapping
kernel dma buffer into userspace. The basic idea behind it is
treating MMU (VA->PA) as IOMMU (IOVA->PA). The driver will set
up MMU mapping instead of IOMMU mapping for the DMA transfer so
that the userspace process is able to use its virtual address to
access the dma buffer in kernel.

And to avoid security issue, a bounce-buffering mechanism is
introduced to prevent userspace accessing the original buffer
directly.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Acked-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vdpa/vdpa_user/iova_domain.c | 545 +++++++++++++++++++++++++++++++++++
 drivers/vdpa/vdpa_user/iova_domain.h |  73 +++++
 2 files changed, 618 insertions(+)
 create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c
 create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h

diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c
new file mode 100644
index 000000000000..ad45026f5423
--- /dev/null
+++ b/drivers/vdpa/vdpa_user/iova_domain.c
@@ -0,0 +1,545 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * MMU-based IOMMU implementation
+ *
+ * Copyright (C) 2020-2021 Bytedance Inc. and/or its affiliates. All rights reserved.
+ *
+ * Author: Xie Yongji <xieyongji@bytedance.com>
+ *
+ */
+
+#include <linux/slab.h>
+#include <linux/file.h>
+#include <linux/anon_inodes.h>
+#include <linux/highmem.h>
+#include <linux/vmalloc.h>
+#include <linux/vdpa.h>
+
+#include "iova_domain.h"
+
+static int vduse_iotlb_add_range(struct vduse_iova_domain *domain,
+				 u64 start, u64 last,
+				 u64 addr, unsigned int perm,
+				 struct file *file, u64 offset)
+{
+	struct vdpa_map_file *map_file;
+	int ret;
+
+	map_file = kmalloc(sizeof(*map_file), GFP_ATOMIC);
+	if (!map_file)
+		return -ENOMEM;
+
+	map_file->file = get_file(file);
+	map_file->offset = offset;
+
+	ret = vhost_iotlb_add_range_ctx(domain->iotlb, start, last,
+					addr, perm, map_file);
+	if (ret) {
+		fput(map_file->file);
+		kfree(map_file);
+		return ret;
+	}
+	return 0;
+}
+
+static void vduse_iotlb_del_range(struct vduse_iova_domain *domain,
+				  u64 start, u64 last)
+{
+	struct vdpa_map_file *map_file;
+	struct vhost_iotlb_map *map;
+
+	while ((map = vhost_iotlb_itree_first(domain->iotlb, start, last))) {
+		map_file = (struct vdpa_map_file *)map->opaque;
+		fput(map_file->file);
+		kfree(map_file);
+		vhost_iotlb_map_free(domain->iotlb, map);
+	}
+}
+
+int vduse_domain_set_map(struct vduse_iova_domain *domain,
+			 struct vhost_iotlb *iotlb)
+{
+	struct vdpa_map_file *map_file;
+	struct vhost_iotlb_map *map;
+	u64 start = 0ULL, last = ULLONG_MAX;
+	int ret;
+
+	spin_lock(&domain->iotlb_lock);
+	vduse_iotlb_del_range(domain, start, last);
+
+	for (map = vhost_iotlb_itree_first(iotlb, start, last); map;
+	     map = vhost_iotlb_itree_next(map, start, last)) {
+		map_file = (struct vdpa_map_file *)map->opaque;
+		ret = vduse_iotlb_add_range(domain, map->start, map->last,
+					    map->addr, map->perm,
+					    map_file->file,
+					    map_file->offset);
+		if (ret)
+			goto err;
+	}
+	spin_unlock(&domain->iotlb_lock);
+
+	return 0;
+err:
+	vduse_iotlb_del_range(domain, start, last);
+	spin_unlock(&domain->iotlb_lock);
+	return ret;
+}
+
+void vduse_domain_clear_map(struct vduse_iova_domain *domain,
+			    struct vhost_iotlb *iotlb)
+{
+	struct vhost_iotlb_map *map;
+	u64 start = 0ULL, last = ULLONG_MAX;
+
+	spin_lock(&domain->iotlb_lock);
+	for (map = vhost_iotlb_itree_first(iotlb, start, last); map;
+	     map = vhost_iotlb_itree_next(map, start, last)) {
+		vduse_iotlb_del_range(domain, map->start, map->last);
+	}
+	spin_unlock(&domain->iotlb_lock);
+}
+
+static int vduse_domain_map_bounce_page(struct vduse_iova_domain *domain,
+					 u64 iova, u64 size, u64 paddr)
+{
+	struct vduse_bounce_map *map;
+	u64 last = iova + size - 1;
+
+	while (iova <= last) {
+		map = &domain->bounce_maps[iova >> PAGE_SHIFT];
+		if (!map->bounce_page) {
+			map->bounce_page = alloc_page(GFP_ATOMIC);
+			if (!map->bounce_page)
+				return -ENOMEM;
+		}
+		map->orig_phys = paddr;
+		paddr += PAGE_SIZE;
+		iova += PAGE_SIZE;
+	}
+	return 0;
+}
+
+static void vduse_domain_unmap_bounce_page(struct vduse_iova_domain *domain,
+					   u64 iova, u64 size)
+{
+	struct vduse_bounce_map *map;
+	u64 last = iova + size - 1;
+
+	while (iova <= last) {
+		map = &domain->bounce_maps[iova >> PAGE_SHIFT];
+		map->orig_phys = INVALID_PHYS_ADDR;
+		iova += PAGE_SIZE;
+	}
+}
+
+static void do_bounce(phys_addr_t orig, void *addr, size_t size,
+		      enum dma_data_direction dir)
+{
+	unsigned long pfn = PFN_DOWN(orig);
+	unsigned int offset = offset_in_page(orig);
+	char *buffer;
+	unsigned int sz = 0;
+
+	while (size) {
+		sz = min_t(size_t, PAGE_SIZE - offset, size);
+
+		buffer = kmap_atomic(pfn_to_page(pfn));
+		if (dir == DMA_TO_DEVICE)
+			memcpy(addr, buffer + offset, sz);
+		else
+			memcpy(buffer + offset, addr, sz);
+		kunmap_atomic(buffer);
+
+		size -= sz;
+		pfn++;
+		addr += sz;
+		offset = 0;
+	}
+}
+
+static void vduse_domain_bounce(struct vduse_iova_domain *domain,
+				dma_addr_t iova, size_t size,
+				enum dma_data_direction dir)
+{
+	struct vduse_bounce_map *map;
+	unsigned int offset;
+	void *addr;
+	size_t sz;
+
+	if (iova >= domain->bounce_size)
+		return;
+
+	while (size) {
+		map = &domain->bounce_maps[iova >> PAGE_SHIFT];
+		offset = offset_in_page(iova);
+		sz = min_t(size_t, PAGE_SIZE - offset, size);
+
+		if (WARN_ON(!map->bounce_page ||
+			    map->orig_phys == INVALID_PHYS_ADDR))
+			return;
+
+		addr = page_address(map->bounce_page) + offset;
+		do_bounce(map->orig_phys + offset, addr, sz, dir);
+		size -= sz;
+		iova += sz;
+	}
+}
+
+static struct page *
+vduse_domain_get_coherent_page(struct vduse_iova_domain *domain, u64 iova)
+{
+	u64 start = iova & PAGE_MASK;
+	u64 last = start + PAGE_SIZE - 1;
+	struct vhost_iotlb_map *map;
+	struct page *page = NULL;
+
+	spin_lock(&domain->iotlb_lock);
+	map = vhost_iotlb_itree_first(domain->iotlb, start, last);
+	if (!map)
+		goto out;
+
+	page = pfn_to_page((map->addr + iova - map->start) >> PAGE_SHIFT);
+	get_page(page);
+out:
+	spin_unlock(&domain->iotlb_lock);
+
+	return page;
+}
+
+static struct page *
+vduse_domain_get_bounce_page(struct vduse_iova_domain *domain, u64 iova)
+{
+	struct vduse_bounce_map *map;
+	struct page *page = NULL;
+
+	spin_lock(&domain->iotlb_lock);
+	map = &domain->bounce_maps[iova >> PAGE_SHIFT];
+	if (!map->bounce_page)
+		goto out;
+
+	page = map->bounce_page;
+	get_page(page);
+out:
+	spin_unlock(&domain->iotlb_lock);
+
+	return page;
+}
+
+static void
+vduse_domain_free_bounce_pages(struct vduse_iova_domain *domain)
+{
+	struct vduse_bounce_map *map;
+	unsigned long pfn, bounce_pfns;
+
+	bounce_pfns = domain->bounce_size >> PAGE_SHIFT;
+
+	for (pfn = 0; pfn < bounce_pfns; pfn++) {
+		map = &domain->bounce_maps[pfn];
+		if (WARN_ON(map->orig_phys != INVALID_PHYS_ADDR))
+			continue;
+
+		if (!map->bounce_page)
+			continue;
+
+		__free_page(map->bounce_page);
+		map->bounce_page = NULL;
+	}
+}
+
+void vduse_domain_reset_bounce_map(struct vduse_iova_domain *domain)
+{
+	if (!domain->bounce_map)
+		return;
+
+	spin_lock(&domain->iotlb_lock);
+	if (!domain->bounce_map)
+		goto unlock;
+
+	vduse_iotlb_del_range(domain, 0, domain->bounce_size - 1);
+	domain->bounce_map = 0;
+unlock:
+	spin_unlock(&domain->iotlb_lock);
+}
+
+static int vduse_domain_init_bounce_map(struct vduse_iova_domain *domain)
+{
+	int ret = 0;
+
+	if (domain->bounce_map)
+		return 0;
+
+	spin_lock(&domain->iotlb_lock);
+	if (domain->bounce_map)
+		goto unlock;
+
+	ret = vduse_iotlb_add_range(domain, 0, domain->bounce_size - 1,
+				    0, VHOST_MAP_RW, domain->file, 0);
+	if (ret)
+		goto unlock;
+
+	domain->bounce_map = 1;
+unlock:
+	spin_unlock(&domain->iotlb_lock);
+	return ret;
+}
+
+static dma_addr_t
+vduse_domain_alloc_iova(struct iova_domain *iovad,
+			unsigned long size, unsigned long limit)
+{
+	unsigned long shift = iova_shift(iovad);
+	unsigned long iova_len = iova_align(iovad, size) >> shift;
+	unsigned long iova_pfn;
+
+	/*
+	 * Freeing non-power-of-two-sized allocations back into the IOVA caches
+	 * will come back to bite us badly, so we have to waste a bit of space
+	 * rounding up anything cacheable to make sure that can't happen. The
+	 * order of the unadjusted size will still match upon freeing.
+	 */
+	if (iova_len < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1)))
+		iova_len = roundup_pow_of_two(iova_len);
+	iova_pfn = alloc_iova_fast(iovad, iova_len, limit >> shift, true);
+
+	return iova_pfn << shift;
+}
+
+static void vduse_domain_free_iova(struct iova_domain *iovad,
+				   dma_addr_t iova, size_t size)
+{
+	unsigned long shift = iova_shift(iovad);
+	unsigned long iova_len = iova_align(iovad, size) >> shift;
+
+	free_iova_fast(iovad, iova >> shift, iova_len);
+}
+
+dma_addr_t vduse_domain_map_page(struct vduse_iova_domain *domain,
+				 struct page *page, unsigned long offset,
+				 size_t size, enum dma_data_direction dir,
+				 unsigned long attrs)
+{
+	struct iova_domain *iovad = &domain->stream_iovad;
+	unsigned long limit = domain->bounce_size - 1;
+	phys_addr_t pa = page_to_phys(page) + offset;
+	dma_addr_t iova = vduse_domain_alloc_iova(iovad, size, limit);
+
+	if (!iova)
+		return DMA_MAPPING_ERROR;
+
+	if (vduse_domain_init_bounce_map(domain))
+		goto err;
+
+	if (vduse_domain_map_bounce_page(domain, (u64)iova, (u64)size, pa))
+		goto err;
+
+	if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL)
+		vduse_domain_bounce(domain, iova, size, DMA_TO_DEVICE);
+
+	return iova;
+err:
+	vduse_domain_free_iova(iovad, iova, size);
+	return DMA_MAPPING_ERROR;
+}
+
+void vduse_domain_unmap_page(struct vduse_iova_domain *domain,
+			     dma_addr_t dma_addr, size_t size,
+			     enum dma_data_direction dir, unsigned long attrs)
+{
+	struct iova_domain *iovad = &domain->stream_iovad;
+
+	if (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL)
+		vduse_domain_bounce(domain, dma_addr, size, DMA_FROM_DEVICE);
+
+	vduse_domain_unmap_bounce_page(domain, (u64)dma_addr, (u64)size);
+	vduse_domain_free_iova(iovad, dma_addr, size);
+}
+
+void *vduse_domain_alloc_coherent(struct vduse_iova_domain *domain,
+				  size_t size, dma_addr_t *dma_addr,
+				  gfp_t flag, unsigned long attrs)
+{
+	struct iova_domain *iovad = &domain->consistent_iovad;
+	unsigned long limit = domain->iova_limit;
+	dma_addr_t iova = vduse_domain_alloc_iova(iovad, size, limit);
+	void *orig = alloc_pages_exact(size, flag);
+
+	if (!iova || !orig)
+		goto err;
+
+	spin_lock(&domain->iotlb_lock);
+	if (vduse_iotlb_add_range(domain, (u64)iova, (u64)iova + size - 1,
+				  virt_to_phys(orig), VHOST_MAP_RW,
+				  domain->file, (u64)iova)) {
+		spin_unlock(&domain->iotlb_lock);
+		goto err;
+	}
+	spin_unlock(&domain->iotlb_lock);
+
+	*dma_addr = iova;
+
+	return orig;
+err:
+	*dma_addr = DMA_MAPPING_ERROR;
+	if (orig)
+		free_pages_exact(orig, size);
+	if (iova)
+		vduse_domain_free_iova(iovad, iova, size);
+
+	return NULL;
+}
+
+void vduse_domain_free_coherent(struct vduse_iova_domain *domain, size_t size,
+				void *vaddr, dma_addr_t dma_addr,
+				unsigned long attrs)
+{
+	struct iova_domain *iovad = &domain->consistent_iovad;
+	struct vhost_iotlb_map *map;
+	struct vdpa_map_file *map_file;
+	phys_addr_t pa;
+
+	spin_lock(&domain->iotlb_lock);
+	map = vhost_iotlb_itree_first(domain->iotlb, (u64)dma_addr,
+				      (u64)dma_addr + size - 1);
+	if (WARN_ON(!map)) {
+		spin_unlock(&domain->iotlb_lock);
+		return;
+	}
+	map_file = (struct vdpa_map_file *)map->opaque;
+	fput(map_file->file);
+	kfree(map_file);
+	pa = map->addr;
+	vhost_iotlb_map_free(domain->iotlb, map);
+	spin_unlock(&domain->iotlb_lock);
+
+	vduse_domain_free_iova(iovad, dma_addr, size);
+	free_pages_exact(phys_to_virt(pa), size);
+}
+
+static vm_fault_t vduse_domain_mmap_fault(struct vm_fault *vmf)
+{
+	struct vduse_iova_domain *domain = vmf->vma->vm_private_data;
+	unsigned long iova = vmf->pgoff << PAGE_SHIFT;
+	struct page *page;
+
+	if (!domain)
+		return VM_FAULT_SIGBUS;
+
+	if (iova < domain->bounce_size)
+		page = vduse_domain_get_bounce_page(domain, iova);
+	else
+		page = vduse_domain_get_coherent_page(domain, iova);
+
+	if (!page)
+		return VM_FAULT_SIGBUS;
+
+	vmf->page = page;
+
+	return 0;
+}
+
+static const struct vm_operations_struct vduse_domain_mmap_ops = {
+	.fault = vduse_domain_mmap_fault,
+};
+
+static int vduse_domain_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct vduse_iova_domain *domain = file->private_data;
+
+	vma->vm_flags |= VM_DONTDUMP | VM_DONTEXPAND;
+	vma->vm_private_data = domain;
+	vma->vm_ops = &vduse_domain_mmap_ops;
+
+	return 0;
+}
+
+static int vduse_domain_release(struct inode *inode, struct file *file)
+{
+	struct vduse_iova_domain *domain = file->private_data;
+
+	spin_lock(&domain->iotlb_lock);
+	vduse_iotlb_del_range(domain, 0, ULLONG_MAX);
+	vduse_domain_free_bounce_pages(domain);
+	spin_unlock(&domain->iotlb_lock);
+	put_iova_domain(&domain->stream_iovad);
+	put_iova_domain(&domain->consistent_iovad);
+	vhost_iotlb_free(domain->iotlb);
+	vfree(domain->bounce_maps);
+	kfree(domain);
+
+	return 0;
+}
+
+static const struct file_operations vduse_domain_fops = {
+	.owner = THIS_MODULE,
+	.mmap = vduse_domain_mmap,
+	.release = vduse_domain_release,
+};
+
+void vduse_domain_destroy(struct vduse_iova_domain *domain)
+{
+	fput(domain->file);
+}
+
+struct vduse_iova_domain *
+vduse_domain_create(unsigned long iova_limit, size_t bounce_size)
+{
+	struct vduse_iova_domain *domain;
+	struct file *file;
+	struct vduse_bounce_map *map;
+	unsigned long pfn, bounce_pfns;
+
+	bounce_pfns = PAGE_ALIGN(bounce_size) >> PAGE_SHIFT;
+	if (iova_limit <= bounce_size)
+		return NULL;
+
+	domain = kzalloc(sizeof(*domain), GFP_KERNEL);
+	if (!domain)
+		return NULL;
+
+	domain->iotlb = vhost_iotlb_alloc(0, 0);
+	if (!domain->iotlb)
+		goto err_iotlb;
+
+	domain->iova_limit = iova_limit;
+	domain->bounce_size = PAGE_ALIGN(bounce_size);
+	domain->bounce_maps = vzalloc(bounce_pfns *
+				sizeof(struct vduse_bounce_map));
+	if (!domain->bounce_maps)
+		goto err_map;
+
+	for (pfn = 0; pfn < bounce_pfns; pfn++) {
+		map = &domain->bounce_maps[pfn];
+		map->orig_phys = INVALID_PHYS_ADDR;
+	}
+	file = anon_inode_getfile("[vduse-domain]", &vduse_domain_fops,
+				domain, O_RDWR);
+	if (IS_ERR(file))
+		goto err_file;
+
+	domain->file = file;
+	spin_lock_init(&domain->iotlb_lock);
+	init_iova_domain(&domain->stream_iovad,
+			PAGE_SIZE, IOVA_START_PFN);
+	init_iova_domain(&domain->consistent_iovad,
+			PAGE_SIZE, bounce_pfns);
+
+	return domain;
+err_file:
+	vfree(domain->bounce_maps);
+err_map:
+	vhost_iotlb_free(domain->iotlb);
+err_iotlb:
+	kfree(domain);
+	return NULL;
+}
+
+int vduse_domain_init(void)
+{
+	return iova_cache_get();
+}
+
+void vduse_domain_exit(void)
+{
+	iova_cache_put();
+}
diff --git a/drivers/vdpa/vdpa_user/iova_domain.h b/drivers/vdpa/vdpa_user/iova_domain.h
new file mode 100644
index 000000000000..5fc41d01c412
--- /dev/null
+++ b/drivers/vdpa/vdpa_user/iova_domain.h
@@ -0,0 +1,73 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * MMU-based IOMMU implementation
+ *
+ * Copyright (C) 2020-2021 Bytedance Inc. and/or its affiliates. All rights reserved.
+ *
+ * Author: Xie Yongji <xieyongji@bytedance.com>
+ *
+ */
+
+#ifndef _VDUSE_IOVA_DOMAIN_H
+#define _VDUSE_IOVA_DOMAIN_H
+
+#include <linux/iova.h>
+#include <linux/dma-mapping.h>
+#include <linux/vhost_iotlb.h>
+
+#define IOVA_START_PFN 1
+
+#define INVALID_PHYS_ADDR (~(phys_addr_t)0)
+
+struct vduse_bounce_map {
+	struct page *bounce_page;
+	u64 orig_phys;
+};
+
+struct vduse_iova_domain {
+	struct iova_domain stream_iovad;
+	struct iova_domain consistent_iovad;
+	struct vduse_bounce_map *bounce_maps;
+	size_t bounce_size;
+	unsigned long iova_limit;
+	int bounce_map;
+	struct vhost_iotlb *iotlb;
+	spinlock_t iotlb_lock;
+	struct file *file;
+};
+
+int vduse_domain_set_map(struct vduse_iova_domain *domain,
+			 struct vhost_iotlb *iotlb);
+
+void vduse_domain_clear_map(struct vduse_iova_domain *domain,
+			    struct vhost_iotlb *iotlb);
+
+dma_addr_t vduse_domain_map_page(struct vduse_iova_domain *domain,
+				 struct page *page, unsigned long offset,
+				 size_t size, enum dma_data_direction dir,
+				 unsigned long attrs);
+
+void vduse_domain_unmap_page(struct vduse_iova_domain *domain,
+			     dma_addr_t dma_addr, size_t size,
+			     enum dma_data_direction dir, unsigned long attrs);
+
+void *vduse_domain_alloc_coherent(struct vduse_iova_domain *domain,
+				  size_t size, dma_addr_t *dma_addr,
+				  gfp_t flag, unsigned long attrs);
+
+void vduse_domain_free_coherent(struct vduse_iova_domain *domain, size_t size,
+				void *vaddr, dma_addr_t dma_addr,
+				unsigned long attrs);
+
+void vduse_domain_reset_bounce_map(struct vduse_iova_domain *domain);
+
+void vduse_domain_destroy(struct vduse_iova_domain *domain);
+
+struct vduse_iova_domain *vduse_domain_create(unsigned long iova_limit,
+					      size_t bounce_size);
+
+int vduse_domain_init(void);
+
+void vduse_domain_exit(void);
+
+#endif /* _VDUSE_IOVA_DOMAIN_H */
-- 
2.11.0

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 193+ messages in thread

* [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
  2021-06-15 14:13 ` Xie Yongji
@ 2021-06-15 14:13   ` Xie Yongji
  -1 siblings, 0 replies; 193+ messages in thread
From: Xie Yongji @ 2021-06-15 14:13 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, hch, christian.brauner,
	rdunlap, willy, viro, axboe, bcrl, corbet, mika.penttila,
	dan.carpenter, joro, gregkh
  Cc: songmuchun, virtualization, netdev, kvm, linux-fsdevel, iommu,
	linux-kernel

This VDUSE driver enables implementing vDPA devices in userspace.
The vDPA device's control path is handled in kernel and the data
path is handled in userspace.

A message mechnism is used by VDUSE driver to forward some control
messages such as starting/stopping datapath to userspace. Userspace
can use read()/write() to receive/reply those control messages.

And some ioctls are introduced to help userspace to implement the
data path. VDUSE_IOTLB_GET_FD ioctl can be used to get the file
descriptors referring to vDPA device's iova regions. Then userspace
can use mmap() to access those iova regions. VDUSE_DEV_GET_FEATURES
and VDUSE_VQ_GET_INFO ioctls are used to get the negotiated features
and metadata of virtqueues. VDUSE_INJECT_VQ_IRQ and VDUSE_VQ_SETUP_KICKFD
ioctls can be used to inject interrupt and setup the kickfd for
virtqueues. VDUSE_DEV_UPDATE_CONFIG ioctl is used to update the
configuration space and inject a config interrupt.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 Documentation/userspace-api/ioctl/ioctl-number.rst |    1 +
 drivers/vdpa/Kconfig                               |   10 +
 drivers/vdpa/Makefile                              |    1 +
 drivers/vdpa/vdpa_user/Makefile                    |    5 +
 drivers/vdpa/vdpa_user/vduse_dev.c                 | 1453 ++++++++++++++++++++
 include/uapi/linux/vduse.h                         |  143 ++
 6 files changed, 1613 insertions(+)
 create mode 100644 drivers/vdpa/vdpa_user/Makefile
 create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
 create mode 100644 include/uapi/linux/vduse.h

diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
index 9bfc2b510c64..acd95e9dcfe7 100644
--- a/Documentation/userspace-api/ioctl/ioctl-number.rst
+++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
@@ -300,6 +300,7 @@ Code  Seq#    Include File                                           Comments
 'z'   10-4F  drivers/s390/crypto/zcrypt_api.h                        conflict!
 '|'   00-7F  linux/media.h
 0x80  00-1F  linux/fb.h
+0x81  00-1F  linux/vduse.h
 0x89  00-06  arch/x86/include/asm/sockios.h
 0x89  0B-DF  linux/sockios.h
 0x89  E0-EF  linux/sockios.h                                         SIOCPROTOPRIVATE range
diff --git a/drivers/vdpa/Kconfig b/drivers/vdpa/Kconfig
index a503c1b2bfd9..6e23bce6433a 100644
--- a/drivers/vdpa/Kconfig
+++ b/drivers/vdpa/Kconfig
@@ -33,6 +33,16 @@ config VDPA_SIM_BLOCK
 	  vDPA block device simulator which terminates IO request in a
 	  memory buffer.
 
+config VDPA_USER
+	tristate "VDUSE (vDPA Device in Userspace) support"
+	depends on EVENTFD && MMU && HAS_DMA
+	select DMA_OPS
+	select VHOST_IOTLB
+	select IOMMU_IOVA
+	help
+	  With VDUSE it is possible to emulate a vDPA Device
+	  in a userspace program.
+
 config IFCVF
 	tristate "Intel IFC VF vDPA driver"
 	depends on PCI_MSI
diff --git a/drivers/vdpa/Makefile b/drivers/vdpa/Makefile
index 67fe7f3d6943..f02ebed33f19 100644
--- a/drivers/vdpa/Makefile
+++ b/drivers/vdpa/Makefile
@@ -1,6 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0
 obj-$(CONFIG_VDPA) += vdpa.o
 obj-$(CONFIG_VDPA_SIM) += vdpa_sim/
+obj-$(CONFIG_VDPA_USER) += vdpa_user/
 obj-$(CONFIG_IFCVF)    += ifcvf/
 obj-$(CONFIG_MLX5_VDPA) += mlx5/
 obj-$(CONFIG_VP_VDPA)    += virtio_pci/
diff --git a/drivers/vdpa/vdpa_user/Makefile b/drivers/vdpa/vdpa_user/Makefile
new file mode 100644
index 000000000000..260e0b26af99
--- /dev/null
+++ b/drivers/vdpa/vdpa_user/Makefile
@@ -0,0 +1,5 @@
+# SPDX-License-Identifier: GPL-2.0
+
+vduse-y := vduse_dev.o iova_domain.o
+
+obj-$(CONFIG_VDPA_USER) += vduse.o
diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
new file mode 100644
index 000000000000..5271cbd15e28
--- /dev/null
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -0,0 +1,1453 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * VDUSE: vDPA Device in Userspace
+ *
+ * Copyright (C) 2020-2021 Bytedance Inc. and/or its affiliates. All rights reserved.
+ *
+ * Author: Xie Yongji <xieyongji@bytedance.com>
+ *
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/cdev.h>
+#include <linux/device.h>
+#include <linux/eventfd.h>
+#include <linux/slab.h>
+#include <linux/wait.h>
+#include <linux/dma-map-ops.h>
+#include <linux/poll.h>
+#include <linux/file.h>
+#include <linux/uio.h>
+#include <linux/vdpa.h>
+#include <linux/nospec.h>
+#include <uapi/linux/vduse.h>
+#include <uapi/linux/vdpa.h>
+#include <uapi/linux/virtio_config.h>
+#include <uapi/linux/virtio_ids.h>
+#include <uapi/linux/virtio_blk.h>
+#include <linux/mod_devicetable.h>
+
+#include "iova_domain.h"
+
+#define DRV_AUTHOR   "Yongji Xie <xieyongji@bytedance.com>"
+#define DRV_DESC     "vDPA Device in Userspace"
+#define DRV_LICENSE  "GPL v2"
+
+#define VDUSE_DEV_MAX (1U << MINORBITS)
+#define VDUSE_MAX_BOUNCE_SIZE (64 * 1024 * 1024)
+#define VDUSE_IOVA_SIZE (128 * 1024 * 1024)
+#define VDUSE_REQUEST_TIMEOUT 30
+
+struct vduse_virtqueue {
+	u16 index;
+	u32 num;
+	u32 avail_idx;
+	u64 desc_addr;
+	u64 driver_addr;
+	u64 device_addr;
+	bool ready;
+	bool kicked;
+	spinlock_t kick_lock;
+	spinlock_t irq_lock;
+	struct eventfd_ctx *kickfd;
+	struct vdpa_callback cb;
+	struct work_struct inject;
+};
+
+struct vduse_dev;
+
+struct vduse_vdpa {
+	struct vdpa_device vdpa;
+	struct vduse_dev *dev;
+};
+
+struct vduse_dev {
+	struct vduse_vdpa *vdev;
+	struct device *dev;
+	struct vduse_virtqueue *vqs;
+	struct vduse_iova_domain *domain;
+	char *name;
+	struct mutex lock;
+	spinlock_t msg_lock;
+	u64 msg_unique;
+	wait_queue_head_t waitq;
+	struct list_head send_list;
+	struct list_head recv_list;
+	struct vdpa_callback config_cb;
+	struct work_struct inject;
+	spinlock_t irq_lock;
+	int minor;
+	bool connected;
+	bool started;
+	u64 api_version;
+	u64 user_features;
+	u64 features;
+	u32 device_id;
+	u32 vendor_id;
+	u32 generation;
+	u32 config_size;
+	void *config;
+	u8 status;
+	u16 vq_size_max;
+	u32 vq_num;
+	u32 vq_align;
+};
+
+struct vduse_dev_msg {
+	struct vduse_dev_request req;
+	struct vduse_dev_response resp;
+	struct list_head list;
+	wait_queue_head_t waitq;
+	bool completed;
+};
+
+struct vduse_control {
+	u64 api_version;
+};
+
+static DEFINE_MUTEX(vduse_lock);
+static DEFINE_IDR(vduse_idr);
+
+static dev_t vduse_major;
+static struct class *vduse_class;
+static struct cdev vduse_ctrl_cdev;
+static struct cdev vduse_cdev;
+static struct workqueue_struct *vduse_irq_wq;
+
+static u32 allowed_device_id[] = {
+	VIRTIO_ID_BLOCK,
+};
+
+static inline struct vduse_dev *vdpa_to_vduse(struct vdpa_device *vdpa)
+{
+	struct vduse_vdpa *vdev = container_of(vdpa, struct vduse_vdpa, vdpa);
+
+	return vdev->dev;
+}
+
+static inline struct vduse_dev *dev_to_vduse(struct device *dev)
+{
+	struct vdpa_device *vdpa = dev_to_vdpa(dev);
+
+	return vdpa_to_vduse(vdpa);
+}
+
+static struct vduse_dev_msg *vduse_find_msg(struct list_head *head,
+					    uint32_t request_id)
+{
+	struct vduse_dev_msg *msg;
+
+	list_for_each_entry(msg, head, list) {
+		if (msg->req.request_id == request_id) {
+			list_del(&msg->list);
+			return msg;
+		}
+	}
+
+	return NULL;
+}
+
+static struct vduse_dev_msg *vduse_dequeue_msg(struct list_head *head)
+{
+	struct vduse_dev_msg *msg = NULL;
+
+	if (!list_empty(head)) {
+		msg = list_first_entry(head, struct vduse_dev_msg, list);
+		list_del(&msg->list);
+	}
+
+	return msg;
+}
+
+static void vduse_enqueue_msg(struct list_head *head,
+			      struct vduse_dev_msg *msg)
+{
+	list_add_tail(&msg->list, head);
+}
+
+static int vduse_dev_msg_send(struct vduse_dev *dev,
+			      struct vduse_dev_msg *msg, bool no_reply)
+{
+	init_waitqueue_head(&msg->waitq);
+	spin_lock(&dev->msg_lock);
+	msg->req.request_id = dev->msg_unique++;
+	vduse_enqueue_msg(&dev->send_list, msg);
+	wake_up(&dev->waitq);
+	spin_unlock(&dev->msg_lock);
+	if (no_reply)
+		return 0;
+
+	wait_event_killable_timeout(msg->waitq, msg->completed,
+				    VDUSE_REQUEST_TIMEOUT * HZ);
+	spin_lock(&dev->msg_lock);
+	if (!msg->completed) {
+		list_del(&msg->list);
+		msg->resp.result = VDUSE_REQ_RESULT_FAILED;
+	}
+	spin_unlock(&dev->msg_lock);
+
+	return (msg->resp.result == VDUSE_REQ_RESULT_OK) ? 0 : -EIO;
+}
+
+static void vduse_dev_msg_cleanup(struct vduse_dev *dev)
+{
+	struct vduse_dev_msg *msg;
+
+	spin_lock(&dev->msg_lock);
+	while ((msg = vduse_dequeue_msg(&dev->send_list))) {
+		if (msg->req.flags & VDUSE_REQ_FLAGS_NO_REPLY)
+			kfree(msg);
+		else
+			vduse_enqueue_msg(&dev->recv_list, msg);
+	}
+	while ((msg = vduse_dequeue_msg(&dev->recv_list))) {
+		msg->resp.result = VDUSE_REQ_RESULT_FAILED;
+		msg->completed = 1;
+		wake_up(&msg->waitq);
+	}
+	spin_unlock(&dev->msg_lock);
+}
+
+static void vduse_dev_start_dataplane(struct vduse_dev *dev)
+{
+	struct vduse_dev_msg *msg = kzalloc(sizeof(*msg),
+					    GFP_KERNEL | __GFP_NOFAIL);
+
+	msg->req.type = VDUSE_START_DATAPLANE;
+	msg->req.flags |= VDUSE_REQ_FLAGS_NO_REPLY;
+	vduse_dev_msg_send(dev, msg, true);
+}
+
+static void vduse_dev_stop_dataplane(struct vduse_dev *dev)
+{
+	struct vduse_dev_msg *msg = kzalloc(sizeof(*msg),
+					    GFP_KERNEL | __GFP_NOFAIL);
+
+	msg->req.type = VDUSE_STOP_DATAPLANE;
+	msg->req.flags |= VDUSE_REQ_FLAGS_NO_REPLY;
+	vduse_dev_msg_send(dev, msg, true);
+}
+
+static int vduse_dev_get_vq_state(struct vduse_dev *dev,
+				  struct vduse_virtqueue *vq,
+				  struct vdpa_vq_state *state)
+{
+	struct vduse_dev_msg msg = { 0 };
+	int ret;
+
+	msg.req.type = VDUSE_GET_VQ_STATE;
+	msg.req.vq_state.index = vq->index;
+
+	ret = vduse_dev_msg_send(dev, &msg, false);
+	if (ret)
+		return ret;
+
+	state->avail_index = msg.resp.vq_state.avail_idx;
+	return 0;
+}
+
+static int vduse_dev_update_iotlb(struct vduse_dev *dev,
+				u64 start, u64 last)
+{
+	struct vduse_dev_msg msg = { 0 };
+
+	if (last < start)
+		return -EINVAL;
+
+	msg.req.type = VDUSE_UPDATE_IOTLB;
+	msg.req.iova.start = start;
+	msg.req.iova.last = last;
+
+	return vduse_dev_msg_send(dev, &msg, false);
+}
+
+static ssize_t vduse_dev_read_iter(struct kiocb *iocb, struct iov_iter *to)
+{
+	struct file *file = iocb->ki_filp;
+	struct vduse_dev *dev = file->private_data;
+	struct vduse_dev_msg *msg;
+	int size = sizeof(struct vduse_dev_request);
+	ssize_t ret;
+
+	if (iov_iter_count(to) < size)
+		return -EINVAL;
+
+	spin_lock(&dev->msg_lock);
+	while (1) {
+		msg = vduse_dequeue_msg(&dev->send_list);
+		if (msg)
+			break;
+
+		ret = -EAGAIN;
+		if (file->f_flags & O_NONBLOCK)
+			goto unlock;
+
+		spin_unlock(&dev->msg_lock);
+		ret = wait_event_interruptible_exclusive(dev->waitq,
+					!list_empty(&dev->send_list));
+		if (ret)
+			return ret;
+
+		spin_lock(&dev->msg_lock);
+	}
+	spin_unlock(&dev->msg_lock);
+	ret = copy_to_iter(&msg->req, size, to);
+	spin_lock(&dev->msg_lock);
+	if (ret != size) {
+		ret = -EFAULT;
+		vduse_enqueue_msg(&dev->send_list, msg);
+		goto unlock;
+	}
+	if (msg->req.flags & VDUSE_REQ_FLAGS_NO_REPLY)
+		kfree(msg);
+	else
+		vduse_enqueue_msg(&dev->recv_list, msg);
+unlock:
+	spin_unlock(&dev->msg_lock);
+
+	return ret;
+}
+
+static ssize_t vduse_dev_write_iter(struct kiocb *iocb, struct iov_iter *from)
+{
+	struct file *file = iocb->ki_filp;
+	struct vduse_dev *dev = file->private_data;
+	struct vduse_dev_response resp;
+	struct vduse_dev_msg *msg;
+	size_t ret;
+
+	ret = copy_from_iter(&resp, sizeof(resp), from);
+	if (ret != sizeof(resp))
+		return -EINVAL;
+
+	spin_lock(&dev->msg_lock);
+	msg = vduse_find_msg(&dev->recv_list, resp.request_id);
+	if (!msg) {
+		ret = -ENOENT;
+		goto unlock;
+	}
+
+	memcpy(&msg->resp, &resp, sizeof(resp));
+	msg->completed = 1;
+	wake_up(&msg->waitq);
+unlock:
+	spin_unlock(&dev->msg_lock);
+
+	return ret;
+}
+
+static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
+{
+	struct vduse_dev *dev = file->private_data;
+	__poll_t mask = 0;
+
+	poll_wait(file, &dev->waitq, wait);
+
+	if (!list_empty(&dev->send_list))
+		mask |= EPOLLIN | EPOLLRDNORM;
+	if (!list_empty(&dev->recv_list))
+		mask |= EPOLLOUT | EPOLLWRNORM;
+
+	return mask;
+}
+
+static void vduse_dev_reset(struct vduse_dev *dev)
+{
+	int i;
+	struct vduse_iova_domain *domain = dev->domain;
+
+	/* The coherent mappings are handled in vduse_dev_free_coherent() */
+	if (domain->bounce_map)
+		vduse_domain_reset_bounce_map(domain);
+
+	dev->features = 0;
+	dev->generation++;
+	spin_lock(&dev->irq_lock);
+	dev->config_cb.callback = NULL;
+	dev->config_cb.private = NULL;
+	spin_unlock(&dev->irq_lock);
+
+	for (i = 0; i < dev->vq_num; i++) {
+		struct vduse_virtqueue *vq = &dev->vqs[i];
+
+		vq->ready = false;
+		vq->desc_addr = 0;
+		vq->driver_addr = 0;
+		vq->device_addr = 0;
+		vq->avail_idx = 0;
+		vq->num = 0;
+
+		spin_lock(&vq->kick_lock);
+		vq->kicked = false;
+		if (vq->kickfd)
+			eventfd_ctx_put(vq->kickfd);
+		vq->kickfd = NULL;
+		spin_unlock(&vq->kick_lock);
+
+		spin_lock(&vq->irq_lock);
+		vq->cb.callback = NULL;
+		vq->cb.private = NULL;
+		spin_unlock(&vq->irq_lock);
+	}
+}
+
+static int vduse_vdpa_set_vq_address(struct vdpa_device *vdpa, u16 idx,
+				u64 desc_area, u64 driver_area,
+				u64 device_area)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vduse_virtqueue *vq = &dev->vqs[idx];
+
+	vq->desc_addr = desc_area;
+	vq->driver_addr = driver_area;
+	vq->device_addr = device_area;
+
+	return 0;
+}
+
+static void vduse_vdpa_kick_vq(struct vdpa_device *vdpa, u16 idx)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vduse_virtqueue *vq = &dev->vqs[idx];
+
+	spin_lock(&vq->kick_lock);
+	if (!vq->ready)
+		goto unlock;
+
+	if (vq->kickfd)
+		eventfd_signal(vq->kickfd, 1);
+	else
+		vq->kicked = true;
+unlock:
+	spin_unlock(&vq->kick_lock);
+}
+
+static void vduse_vdpa_set_vq_cb(struct vdpa_device *vdpa, u16 idx,
+			      struct vdpa_callback *cb)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vduse_virtqueue *vq = &dev->vqs[idx];
+
+	spin_lock(&vq->irq_lock);
+	vq->cb.callback = cb->callback;
+	vq->cb.private = cb->private;
+	spin_unlock(&vq->irq_lock);
+}
+
+static void vduse_vdpa_set_vq_num(struct vdpa_device *vdpa, u16 idx, u32 num)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vduse_virtqueue *vq = &dev->vqs[idx];
+
+	vq->num = num;
+}
+
+static void vduse_vdpa_set_vq_ready(struct vdpa_device *vdpa,
+					u16 idx, bool ready)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vduse_virtqueue *vq = &dev->vqs[idx];
+
+	vq->ready = ready;
+}
+
+static bool vduse_vdpa_get_vq_ready(struct vdpa_device *vdpa, u16 idx)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vduse_virtqueue *vq = &dev->vqs[idx];
+
+	return vq->ready;
+}
+
+static int vduse_vdpa_set_vq_state(struct vdpa_device *vdpa, u16 idx,
+				const struct vdpa_vq_state *state)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vduse_virtqueue *vq = &dev->vqs[idx];
+
+	vq->avail_idx = state->avail_index;
+	return 0;
+}
+
+static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
+				struct vdpa_vq_state *state)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vduse_virtqueue *vq = &dev->vqs[idx];
+
+	return vduse_dev_get_vq_state(dev, vq, state);
+}
+
+static u32 vduse_vdpa_get_vq_align(struct vdpa_device *vdpa)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	return dev->vq_align;
+}
+
+static u64 vduse_vdpa_get_features(struct vdpa_device *vdpa)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	return dev->user_features;
+}
+
+static int vduse_vdpa_set_features(struct vdpa_device *vdpa, u64 features)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	dev->features = features;
+	return 0;
+}
+
+static void vduse_vdpa_set_config_cb(struct vdpa_device *vdpa,
+				  struct vdpa_callback *cb)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	spin_lock(&dev->irq_lock);
+	dev->config_cb.callback = cb->callback;
+	dev->config_cb.private = cb->private;
+	spin_unlock(&dev->irq_lock);
+}
+
+static u16 vduse_vdpa_get_vq_num_max(struct vdpa_device *vdpa)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	return dev->vq_size_max;
+}
+
+static u32 vduse_vdpa_get_device_id(struct vdpa_device *vdpa)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	return dev->device_id;
+}
+
+static u32 vduse_vdpa_get_vendor_id(struct vdpa_device *vdpa)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	return dev->vendor_id;
+}
+
+static u8 vduse_vdpa_get_status(struct vdpa_device *vdpa)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	return dev->status;
+}
+
+static void vduse_vdpa_set_status(struct vdpa_device *vdpa, u8 status)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	bool started = !!(status & VIRTIO_CONFIG_S_DRIVER_OK);
+
+	dev->status = status;
+
+	if (dev->started == started)
+		return;
+
+	dev->started = started;
+	if (dev->started) {
+		vduse_dev_start_dataplane(dev);
+	} else {
+		vduse_dev_reset(dev);
+		vduse_dev_stop_dataplane(dev);
+	}
+}
+
+static size_t vduse_vdpa_get_config_size(struct vdpa_device *vdpa)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	return dev->config_size;
+}
+
+static void vduse_vdpa_get_config(struct vdpa_device *vdpa, unsigned int offset,
+				  void *buf, unsigned int len)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	memcpy(buf, dev->config + offset, len);
+}
+
+static void vduse_vdpa_set_config(struct vdpa_device *vdpa, unsigned int offset,
+			const void *buf, unsigned int len)
+{
+	/* Now we only support read-only configuration space */
+}
+
+static u32 vduse_vdpa_get_generation(struct vdpa_device *vdpa)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	return dev->generation;
+}
+
+static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
+				struct vhost_iotlb *iotlb)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	int ret;
+
+	ret = vduse_domain_set_map(dev->domain, iotlb);
+	if (ret)
+		return ret;
+
+	ret = vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
+	if (ret) {
+		vduse_domain_clear_map(dev->domain, iotlb);
+		return ret;
+	}
+
+	return 0;
+}
+
+static void vduse_vdpa_free(struct vdpa_device *vdpa)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	dev->vdev = NULL;
+}
+
+static const struct vdpa_config_ops vduse_vdpa_config_ops = {
+	.set_vq_address		= vduse_vdpa_set_vq_address,
+	.kick_vq		= vduse_vdpa_kick_vq,
+	.set_vq_cb		= vduse_vdpa_set_vq_cb,
+	.set_vq_num             = vduse_vdpa_set_vq_num,
+	.set_vq_ready		= vduse_vdpa_set_vq_ready,
+	.get_vq_ready		= vduse_vdpa_get_vq_ready,
+	.set_vq_state		= vduse_vdpa_set_vq_state,
+	.get_vq_state		= vduse_vdpa_get_vq_state,
+	.get_vq_align		= vduse_vdpa_get_vq_align,
+	.get_features		= vduse_vdpa_get_features,
+	.set_features		= vduse_vdpa_set_features,
+	.set_config_cb		= vduse_vdpa_set_config_cb,
+	.get_vq_num_max		= vduse_vdpa_get_vq_num_max,
+	.get_device_id		= vduse_vdpa_get_device_id,
+	.get_vendor_id		= vduse_vdpa_get_vendor_id,
+	.get_status		= vduse_vdpa_get_status,
+	.set_status		= vduse_vdpa_set_status,
+	.get_config_size	= vduse_vdpa_get_config_size,
+	.get_config		= vduse_vdpa_get_config,
+	.set_config		= vduse_vdpa_set_config,
+	.get_generation		= vduse_vdpa_get_generation,
+	.set_map		= vduse_vdpa_set_map,
+	.free			= vduse_vdpa_free,
+};
+
+static dma_addr_t vduse_dev_map_page(struct device *dev, struct page *page,
+				     unsigned long offset, size_t size,
+				     enum dma_data_direction dir,
+				     unsigned long attrs)
+{
+	struct vduse_dev *vdev = dev_to_vduse(dev);
+	struct vduse_iova_domain *domain = vdev->domain;
+
+	return vduse_domain_map_page(domain, page, offset, size, dir, attrs);
+}
+
+static void vduse_dev_unmap_page(struct device *dev, dma_addr_t dma_addr,
+				size_t size, enum dma_data_direction dir,
+				unsigned long attrs)
+{
+	struct vduse_dev *vdev = dev_to_vduse(dev);
+	struct vduse_iova_domain *domain = vdev->domain;
+
+	return vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
+}
+
+static void *vduse_dev_alloc_coherent(struct device *dev, size_t size,
+					dma_addr_t *dma_addr, gfp_t flag,
+					unsigned long attrs)
+{
+	struct vduse_dev *vdev = dev_to_vduse(dev);
+	struct vduse_iova_domain *domain = vdev->domain;
+	unsigned long iova;
+	void *addr;
+
+	*dma_addr = DMA_MAPPING_ERROR;
+	addr = vduse_domain_alloc_coherent(domain, size,
+				(dma_addr_t *)&iova, flag, attrs);
+	if (!addr)
+		return NULL;
+
+	*dma_addr = (dma_addr_t)iova;
+
+	return addr;
+}
+
+static void vduse_dev_free_coherent(struct device *dev, size_t size,
+					void *vaddr, dma_addr_t dma_addr,
+					unsigned long attrs)
+{
+	struct vduse_dev *vdev = dev_to_vduse(dev);
+	struct vduse_iova_domain *domain = vdev->domain;
+
+	vduse_domain_free_coherent(domain, size, vaddr, dma_addr, attrs);
+}
+
+static size_t vduse_dev_max_mapping_size(struct device *dev)
+{
+	struct vduse_dev *vdev = dev_to_vduse(dev);
+	struct vduse_iova_domain *domain = vdev->domain;
+
+	return domain->bounce_size;
+}
+
+static const struct dma_map_ops vduse_dev_dma_ops = {
+	.map_page = vduse_dev_map_page,
+	.unmap_page = vduse_dev_unmap_page,
+	.alloc = vduse_dev_alloc_coherent,
+	.free = vduse_dev_free_coherent,
+	.max_mapping_size = vduse_dev_max_mapping_size,
+};
+
+static unsigned int perm_to_file_flags(u8 perm)
+{
+	unsigned int flags = 0;
+
+	switch (perm) {
+	case VDUSE_ACCESS_WO:
+		flags |= O_WRONLY;
+		break;
+	case VDUSE_ACCESS_RO:
+		flags |= O_RDONLY;
+		break;
+	case VDUSE_ACCESS_RW:
+		flags |= O_RDWR;
+		break;
+	default:
+		WARN(1, "invalidate vhost IOTLB permission\n");
+		break;
+	}
+
+	return flags;
+}
+
+static int vduse_kickfd_setup(struct vduse_dev *dev,
+			struct vduse_vq_eventfd *eventfd)
+{
+	struct eventfd_ctx *ctx = NULL;
+	struct vduse_virtqueue *vq;
+	u32 index;
+
+	if (eventfd->index >= dev->vq_num)
+		return -EINVAL;
+
+	index = array_index_nospec(eventfd->index, dev->vq_num);
+	vq = &dev->vqs[index];
+	if (eventfd->fd >= 0) {
+		ctx = eventfd_ctx_fdget(eventfd->fd);
+		if (IS_ERR(ctx))
+			return PTR_ERR(ctx);
+	} else if (eventfd->fd != VDUSE_EVENTFD_DEASSIGN)
+		return 0;
+
+	spin_lock(&vq->kick_lock);
+	if (vq->kickfd)
+		eventfd_ctx_put(vq->kickfd);
+	vq->kickfd = ctx;
+	if (vq->ready && vq->kicked && vq->kickfd) {
+		eventfd_signal(vq->kickfd, 1);
+		vq->kicked = false;
+	}
+	spin_unlock(&vq->kick_lock);
+
+	return 0;
+}
+
+static void vduse_dev_irq_inject(struct work_struct *work)
+{
+	struct vduse_dev *dev = container_of(work, struct vduse_dev, inject);
+
+	spin_lock_irq(&dev->irq_lock);
+	if (dev->config_cb.callback)
+		dev->config_cb.callback(dev->config_cb.private);
+	spin_unlock_irq(&dev->irq_lock);
+}
+
+static void vduse_vq_irq_inject(struct work_struct *work)
+{
+	struct vduse_virtqueue *vq = container_of(work,
+					struct vduse_virtqueue, inject);
+
+	spin_lock_irq(&vq->irq_lock);
+	if (vq->ready && vq->cb.callback)
+		vq->cb.callback(vq->cb.private);
+	spin_unlock_irq(&vq->irq_lock);
+}
+
+static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
+			    unsigned long arg)
+{
+	struct vduse_dev *dev = file->private_data;
+	void __user *argp = (void __user *)arg;
+	int ret;
+
+	switch (cmd) {
+	case VDUSE_IOTLB_GET_FD: {
+		struct vduse_iotlb_entry entry;
+		struct vhost_iotlb_map *map;
+		struct vdpa_map_file *map_file;
+		struct vduse_iova_domain *domain = dev->domain;
+		struct file *f = NULL;
+
+		ret = -EFAULT;
+		if (copy_from_user(&entry, argp, sizeof(entry)))
+			break;
+
+		ret = -EINVAL;
+		if (entry.start > entry.last)
+			break;
+
+		spin_lock(&domain->iotlb_lock);
+		map = vhost_iotlb_itree_first(domain->iotlb,
+					      entry.start, entry.last);
+		if (map) {
+			map_file = (struct vdpa_map_file *)map->opaque;
+			f = get_file(map_file->file);
+			entry.offset = map_file->offset;
+			entry.start = map->start;
+			entry.last = map->last;
+			entry.perm = map->perm;
+		}
+		spin_unlock(&domain->iotlb_lock);
+		ret = -EINVAL;
+		if (!f)
+			break;
+
+		ret = -EFAULT;
+		if (copy_to_user(argp, &entry, sizeof(entry))) {
+			fput(f);
+			break;
+		}
+		ret = receive_fd(f, perm_to_file_flags(entry.perm));
+		fput(f);
+		break;
+	}
+	case VDUSE_DEV_GET_FEATURES:
+		ret = put_user(dev->features, (u64 __user *)argp);
+		break;
+	case VDUSE_DEV_UPDATE_CONFIG: {
+		struct vduse_config_update config;
+		unsigned long size = offsetof(struct vduse_config_update,
+					      buffer);
+
+		ret = -EFAULT;
+		if (copy_from_user(&config, argp, size))
+			break;
+
+		ret = -EINVAL;
+		if (config.length == 0 ||
+		    config.length > dev->config_size - config.offset)
+			break;
+
+		ret = -EFAULT;
+		if (copy_from_user(dev->config + config.offset, argp + size,
+				   config.length))
+			break;
+
+		ret = 0;
+		queue_work(vduse_irq_wq, &dev->inject);
+		break;
+	}
+	case VDUSE_VQ_GET_INFO: {
+		struct vduse_vq_info vq_info;
+		u32 vq_index;
+
+		ret = -EFAULT;
+		if (copy_from_user(&vq_info, argp, sizeof(vq_info)))
+			break;
+
+		ret = -EINVAL;
+		if (vq_info.index >= dev->vq_num)
+			break;
+
+		vq_index = array_index_nospec(vq_info.index, dev->vq_num);
+		vq_info.desc_addr = dev->vqs[vq_index].desc_addr;
+		vq_info.driver_addr = dev->vqs[vq_index].driver_addr;
+		vq_info.device_addr = dev->vqs[vq_index].device_addr;
+		vq_info.num = dev->vqs[vq_index].num;
+		vq_info.avail_idx = dev->vqs[vq_index].avail_idx;
+		vq_info.ready = dev->vqs[vq_index].ready;
+
+		ret = -EFAULT;
+		if (copy_to_user(argp, &vq_info, sizeof(vq_info)))
+			break;
+
+		ret = 0;
+		break;
+	}
+	case VDUSE_VQ_SETUP_KICKFD: {
+		struct vduse_vq_eventfd eventfd;
+
+		ret = -EFAULT;
+		if (copy_from_user(&eventfd, argp, sizeof(eventfd)))
+			break;
+
+		ret = vduse_kickfd_setup(dev, &eventfd);
+		break;
+	}
+	case VDUSE_VQ_INJECT_IRQ: {
+		u32 vq_index;
+
+		ret = -EFAULT;
+		if (get_user(vq_index, (u32 __user *)argp))
+			break;
+
+		ret = -EINVAL;
+		if (vq_index >= dev->vq_num)
+			break;
+
+		ret = 0;
+		vq_index = array_index_nospec(vq_index, dev->vq_num);
+		queue_work(vduse_irq_wq, &dev->vqs[vq_index].inject);
+		break;
+	}
+	default:
+		ret = -ENOIOCTLCMD;
+		break;
+	}
+
+	return ret;
+}
+
+static int vduse_dev_release(struct inode *inode, struct file *file)
+{
+	struct vduse_dev *dev = file->private_data;
+
+	spin_lock(&dev->msg_lock);
+	/* Make sure the inflight messages can processed after reconncection */
+	list_splice_init(&dev->recv_list, &dev->send_list);
+	spin_unlock(&dev->msg_lock);
+	dev->connected = false;
+
+	return 0;
+}
+
+static struct vduse_dev *vduse_dev_get_from_minor(int minor)
+{
+	struct vduse_dev *dev;
+
+	mutex_lock(&vduse_lock);
+	dev = idr_find(&vduse_idr, minor);
+	mutex_unlock(&vduse_lock);
+
+	return dev;
+}
+
+static int vduse_dev_open(struct inode *inode, struct file *file)
+{
+	int ret;
+	struct vduse_dev *dev = vduse_dev_get_from_minor(iminor(inode));
+
+	if (!dev)
+		return -ENODEV;
+
+	ret = -EBUSY;
+	mutex_lock(&dev->lock);
+	if (dev->connected)
+		goto unlock;
+
+	ret = 0;
+	dev->connected = true;
+	file->private_data = dev;
+unlock:
+	mutex_unlock(&dev->lock);
+
+	return ret;
+}
+
+static const struct file_operations vduse_dev_fops = {
+	.owner		= THIS_MODULE,
+	.open		= vduse_dev_open,
+	.release	= vduse_dev_release,
+	.read_iter	= vduse_dev_read_iter,
+	.write_iter	= vduse_dev_write_iter,
+	.poll		= vduse_dev_poll,
+	.unlocked_ioctl	= vduse_dev_ioctl,
+	.compat_ioctl	= compat_ptr_ioctl,
+	.llseek		= noop_llseek,
+};
+
+static struct vduse_dev *vduse_dev_create(void)
+{
+	struct vduse_dev *dev = kzalloc(sizeof(*dev), GFP_KERNEL);
+
+	if (!dev)
+		return NULL;
+
+	mutex_init(&dev->lock);
+	spin_lock_init(&dev->msg_lock);
+	INIT_LIST_HEAD(&dev->send_list);
+	INIT_LIST_HEAD(&dev->recv_list);
+	spin_lock_init(&dev->irq_lock);
+
+	INIT_WORK(&dev->inject, vduse_dev_irq_inject);
+	init_waitqueue_head(&dev->waitq);
+
+	return dev;
+}
+
+static void vduse_dev_destroy(struct vduse_dev *dev)
+{
+	kfree(dev);
+}
+
+static struct vduse_dev *vduse_find_dev(const char *name)
+{
+	struct vduse_dev *dev;
+	int id;
+
+	idr_for_each_entry(&vduse_idr, dev, id)
+		if (!strcmp(dev->name, name))
+			return dev;
+
+	return NULL;
+}
+
+static int vduse_destroy_dev(char *name)
+{
+	struct vduse_dev *dev = vduse_find_dev(name);
+
+	if (!dev)
+		return -EINVAL;
+
+	mutex_lock(&dev->lock);
+	if (dev->vdev || dev->connected) {
+		mutex_unlock(&dev->lock);
+		return -EBUSY;
+	}
+	dev->connected = true;
+	mutex_unlock(&dev->lock);
+
+	vduse_dev_msg_cleanup(dev);
+	device_destroy(vduse_class, MKDEV(MAJOR(vduse_major), dev->minor));
+	idr_remove(&vduse_idr, dev->minor);
+	kvfree(dev->config);
+	kfree(dev->vqs);
+	vduse_domain_destroy(dev->domain);
+	kfree(dev->name);
+	vduse_dev_destroy(dev);
+	module_put(THIS_MODULE);
+
+	return 0;
+}
+
+static bool device_is_allowed(u32 device_id)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(allowed_device_id); i++)
+		if (allowed_device_id[i] == device_id)
+			return true;
+
+	return false;
+}
+
+static bool features_is_valid(u64 features)
+{
+	if (!(features & (1ULL << VIRTIO_F_ACCESS_PLATFORM)))
+		return false;
+
+	/* Now we only support read-only configuration space */
+	if (features & (1ULL << VIRTIO_BLK_F_CONFIG_WCE))
+		return false;
+
+	return true;
+}
+
+static bool vduse_validate_config(struct vduse_dev_config *config)
+{
+	if (config->bounce_size > VDUSE_MAX_BOUNCE_SIZE)
+		return false;
+
+	if (config->vq_align > PAGE_SIZE)
+		return false;
+
+	if (config->config_size > PAGE_SIZE)
+		return false;
+
+	if (!device_is_allowed(config->device_id))
+		return false;
+
+	if (!features_is_valid(config->features))
+		return false;
+
+	return true;
+}
+
+static int vduse_create_dev(struct vduse_dev_config *config,
+			    void *config_buf, u64 api_version)
+{
+	int i, ret;
+	struct vduse_dev *dev;
+
+	ret = -EEXIST;
+	if (vduse_find_dev(config->name))
+		goto err;
+
+	ret = -ENOMEM;
+	dev = vduse_dev_create();
+	if (!dev)
+		goto err;
+
+	dev->api_version = api_version;
+	dev->user_features = config->features;
+	dev->device_id = config->device_id;
+	dev->vendor_id = config->vendor_id;
+	dev->name = kstrdup(config->name, GFP_KERNEL);
+	if (!dev->name)
+		goto err_str;
+
+	dev->domain = vduse_domain_create(VDUSE_IOVA_SIZE - 1,
+					  config->bounce_size);
+	if (!dev->domain)
+		goto err_domain;
+
+	dev->config = config_buf;
+	dev->config_size = config->config_size;
+	dev->vq_align = config->vq_align;
+	dev->vq_size_max = config->vq_size_max;
+	dev->vq_num = config->vq_num;
+	dev->vqs = kcalloc(dev->vq_num, sizeof(*dev->vqs), GFP_KERNEL);
+	if (!dev->vqs)
+		goto err_vqs;
+
+	for (i = 0; i < dev->vq_num; i++) {
+		dev->vqs[i].index = i;
+		INIT_WORK(&dev->vqs[i].inject, vduse_vq_irq_inject);
+		spin_lock_init(&dev->vqs[i].kick_lock);
+		spin_lock_init(&dev->vqs[i].irq_lock);
+	}
+
+	ret = idr_alloc(&vduse_idr, dev, 1, VDUSE_DEV_MAX, GFP_KERNEL);
+	if (ret < 0)
+		goto err_idr;
+
+	dev->minor = ret;
+	dev->dev = device_create(vduse_class, NULL,
+				 MKDEV(MAJOR(vduse_major), dev->minor),
+				 NULL, "%s", config->name);
+	if (IS_ERR(dev->dev)) {
+		ret = PTR_ERR(dev->dev);
+		goto err_dev;
+	}
+	__module_get(THIS_MODULE);
+
+	return 0;
+err_dev:
+	idr_remove(&vduse_idr, dev->minor);
+err_idr:
+	kfree(dev->vqs);
+err_vqs:
+	vduse_domain_destroy(dev->domain);
+err_domain:
+	kfree(dev->name);
+err_str:
+	vduse_dev_destroy(dev);
+err:
+	kvfree(config_buf);
+	return ret;
+}
+
+static long vduse_ioctl(struct file *file, unsigned int cmd,
+			unsigned long arg)
+{
+	int ret;
+	void __user *argp = (void __user *)arg;
+	struct vduse_control *control = file->private_data;
+
+	mutex_lock(&vduse_lock);
+	switch (cmd) {
+	case VDUSE_GET_API_VERSION:
+		ret = put_user(control->api_version, (u64 __user *)argp);
+		break;
+	case VDUSE_SET_API_VERSION: {
+		u64 api_version;
+
+		ret = -EFAULT;
+		if (get_user(api_version, (u64 __user *)argp))
+			break;
+
+		ret = -EINVAL;
+		if (api_version > VDUSE_API_VERSION)
+			break;
+
+		ret = 0;
+		control->api_version = api_version;
+		break;
+	}
+	case VDUSE_CREATE_DEV: {
+		struct vduse_dev_config config;
+		unsigned long size = offsetof(struct vduse_dev_config, config);
+		void *buf;
+
+		ret = -EFAULT;
+		if (copy_from_user(&config, argp, size))
+			break;
+
+		ret = -EINVAL;
+		if (vduse_validate_config(&config) == false)
+			break;
+
+		buf = vmemdup_user(argp + size, config.config_size);
+		if (IS_ERR(buf)) {
+			ret = PTR_ERR(buf);
+			break;
+		}
+		ret = vduse_create_dev(&config, buf, control->api_version);
+		break;
+	}
+	case VDUSE_DESTROY_DEV: {
+		char name[VDUSE_NAME_MAX];
+
+		ret = -EFAULT;
+		if (copy_from_user(name, argp, VDUSE_NAME_MAX))
+			break;
+
+		ret = vduse_destroy_dev(name);
+		break;
+	}
+	default:
+		ret = -EINVAL;
+		break;
+	}
+	mutex_unlock(&vduse_lock);
+
+	return ret;
+}
+
+static int vduse_release(struct inode *inode, struct file *file)
+{
+	struct vduse_control *control = file->private_data;
+
+	kfree(control);
+	return 0;
+}
+
+static int vduse_open(struct inode *inode, struct file *file)
+{
+	struct vduse_control *control;
+
+	control = kmalloc(sizeof(struct vduse_control), GFP_KERNEL);
+	if (!control)
+		return -ENOMEM;
+
+	control->api_version = VDUSE_API_VERSION;
+	file->private_data = control;
+
+	return 0;
+}
+
+static const struct file_operations vduse_ctrl_fops = {
+	.owner		= THIS_MODULE,
+	.open		= vduse_open,
+	.release	= vduse_release,
+	.unlocked_ioctl	= vduse_ioctl,
+	.compat_ioctl	= compat_ptr_ioctl,
+	.llseek		= noop_llseek,
+};
+
+static char *vduse_devnode(struct device *dev, umode_t *mode)
+{
+	return kasprintf(GFP_KERNEL, "vduse/%s", dev_name(dev));
+}
+
+static void vduse_mgmtdev_release(struct device *dev)
+{
+}
+
+static struct device vduse_mgmtdev = {
+	.init_name = "vduse",
+	.release = vduse_mgmtdev_release,
+};
+
+static struct vdpa_mgmt_dev mgmt_dev;
+
+static int vduse_dev_init_vdpa(struct vduse_dev *dev, const char *name)
+{
+	struct vduse_vdpa *vdev;
+	int ret;
+
+	if (dev->vdev)
+		return -EEXIST;
+
+	vdev = vdpa_alloc_device(struct vduse_vdpa, vdpa, dev->dev,
+				 &vduse_vdpa_config_ops, name, true);
+	if (!vdev)
+		return -ENOMEM;
+
+	dev->vdev = vdev;
+	vdev->dev = dev;
+	vdev->vdpa.dev.dma_mask = &vdev->vdpa.dev.coherent_dma_mask;
+	ret = dma_set_mask_and_coherent(&vdev->vdpa.dev, DMA_BIT_MASK(64));
+	if (ret) {
+		put_device(&vdev->vdpa.dev);
+		return ret;
+	}
+	set_dma_ops(&vdev->vdpa.dev, &vduse_dev_dma_ops);
+	vdev->vdpa.dma_dev = &vdev->vdpa.dev;
+	vdev->vdpa.mdev = &mgmt_dev;
+
+	return 0;
+}
+
+static int vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name)
+{
+	struct vduse_dev *dev;
+	int ret;
+
+	mutex_lock(&vduse_lock);
+	dev = vduse_find_dev(name);
+	if (!dev) {
+		mutex_unlock(&vduse_lock);
+		return -EINVAL;
+	}
+	ret = vduse_dev_init_vdpa(dev, name);
+	mutex_unlock(&vduse_lock);
+	if (ret)
+		return ret;
+
+	ret = _vdpa_register_device(&dev->vdev->vdpa, dev->vq_num);
+	if (ret) {
+		put_device(&dev->vdev->vdpa.dev);
+		return ret;
+	}
+
+	return 0;
+}
+
+static void vdpa_dev_del(struct vdpa_mgmt_dev *mdev, struct vdpa_device *dev)
+{
+	_vdpa_unregister_device(dev);
+}
+
+static const struct vdpa_mgmtdev_ops vdpa_dev_mgmtdev_ops = {
+	.dev_add = vdpa_dev_add,
+	.dev_del = vdpa_dev_del,
+};
+
+static struct virtio_device_id id_table[] = {
+	{ VIRTIO_ID_BLOCK, VIRTIO_DEV_ANY_ID },
+	{ 0 },
+};
+
+static struct vdpa_mgmt_dev mgmt_dev = {
+	.device = &vduse_mgmtdev,
+	.id_table = id_table,
+	.ops = &vdpa_dev_mgmtdev_ops,
+};
+
+static int vduse_mgmtdev_init(void)
+{
+	int ret;
+
+	ret = device_register(&vduse_mgmtdev);
+	if (ret)
+		return ret;
+
+	ret = vdpa_mgmtdev_register(&mgmt_dev);
+	if (ret)
+		goto err;
+
+	return 0;
+err:
+	device_unregister(&vduse_mgmtdev);
+	return ret;
+}
+
+static void vduse_mgmtdev_exit(void)
+{
+	vdpa_mgmtdev_unregister(&mgmt_dev);
+	device_unregister(&vduse_mgmtdev);
+}
+
+static int vduse_init(void)
+{
+	int ret;
+	struct device *dev;
+
+	vduse_class = class_create(THIS_MODULE, "vduse");
+	if (IS_ERR(vduse_class))
+		return PTR_ERR(vduse_class);
+
+	vduse_class->devnode = vduse_devnode;
+
+	ret = alloc_chrdev_region(&vduse_major, 0, VDUSE_DEV_MAX, "vduse");
+	if (ret)
+		goto err_chardev_region;
+
+	/* /dev/vduse/control */
+	cdev_init(&vduse_ctrl_cdev, &vduse_ctrl_fops);
+	vduse_ctrl_cdev.owner = THIS_MODULE;
+	ret = cdev_add(&vduse_ctrl_cdev, vduse_major, 1);
+	if (ret)
+		goto err_ctrl_cdev;
+
+	dev = device_create(vduse_class, NULL, vduse_major, NULL, "control");
+	if (IS_ERR(dev)) {
+		ret = PTR_ERR(dev);
+		goto err_device;
+	}
+
+	/* /dev/vduse/$DEVICE */
+	cdev_init(&vduse_cdev, &vduse_dev_fops);
+	vduse_cdev.owner = THIS_MODULE;
+	ret = cdev_add(&vduse_cdev, MKDEV(MAJOR(vduse_major), 1),
+		       VDUSE_DEV_MAX - 1);
+	if (ret)
+		goto err_cdev;
+
+	vduse_irq_wq = alloc_workqueue("vduse-irq",
+				WQ_HIGHPRI | WQ_SYSFS | WQ_UNBOUND, 0);
+	if (!vduse_irq_wq)
+		goto err_wq;
+
+	ret = vduse_domain_init();
+	if (ret)
+		goto err_domain;
+
+	ret = vduse_mgmtdev_init();
+	if (ret)
+		goto err_mgmtdev;
+
+	return 0;
+err_mgmtdev:
+	vduse_domain_exit();
+err_domain:
+	destroy_workqueue(vduse_irq_wq);
+err_wq:
+	cdev_del(&vduse_cdev);
+err_cdev:
+	device_destroy(vduse_class, vduse_major);
+err_device:
+	cdev_del(&vduse_ctrl_cdev);
+err_ctrl_cdev:
+	unregister_chrdev_region(vduse_major, VDUSE_DEV_MAX);
+err_chardev_region:
+	class_destroy(vduse_class);
+	return ret;
+}
+module_init(vduse_init);
+
+static void vduse_exit(void)
+{
+	vduse_mgmtdev_exit();
+	vduse_domain_exit();
+	destroy_workqueue(vduse_irq_wq);
+	cdev_del(&vduse_cdev);
+	device_destroy(vduse_class, vduse_major);
+	cdev_del(&vduse_ctrl_cdev);
+	unregister_chrdev_region(vduse_major, VDUSE_DEV_MAX);
+	class_destroy(vduse_class);
+}
+module_exit(vduse_exit);
+
+MODULE_LICENSE(DRV_LICENSE);
+MODULE_AUTHOR(DRV_AUTHOR);
+MODULE_DESCRIPTION(DRV_DESC);
diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
new file mode 100644
index 000000000000..f21b2e51b5c8
--- /dev/null
+++ b/include/uapi/linux/vduse.h
@@ -0,0 +1,143 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _UAPI_VDUSE_H_
+#define _UAPI_VDUSE_H_
+
+#include <linux/types.h>
+
+#define VDUSE_API_VERSION	0
+
+#define VDUSE_NAME_MAX	256
+
+/* the control messages definition for read/write */
+
+enum vduse_req_type {
+	/* Get the state for virtqueue from userspace */
+	VDUSE_GET_VQ_STATE,
+	/* Notify userspace to start the dataplane, no reply */
+	VDUSE_START_DATAPLANE,
+	/* Notify userspace to stop the dataplane, no reply */
+	VDUSE_STOP_DATAPLANE,
+	/* Notify userspace to update the memory mapping in device IOTLB */
+	VDUSE_UPDATE_IOTLB,
+};
+
+struct vduse_vq_state {
+	__u32 index; /* virtqueue index */
+	__u32 avail_idx; /* virtqueue state (last_avail_idx) */
+};
+
+struct vduse_iova_range {
+	__u64 start; /* start of the IOVA range */
+	__u64 last; /* end of the IOVA range */
+};
+
+struct vduse_dev_request {
+	__u32 type; /* request type */
+	__u32 request_id; /* request id */
+#define VDUSE_REQ_FLAGS_NO_REPLY	(1 << 0) /* No need to reply */
+	__u32 flags; /* request flags */
+	__u32 reserved; /* for future use */
+	union {
+		struct vduse_vq_state vq_state; /* virtqueue state */
+		struct vduse_iova_range iova; /* iova range for updating */
+		__u32 padding[16]; /* padding */
+	};
+};
+
+struct vduse_dev_response {
+	__u32 request_id; /* corresponding request id */
+#define VDUSE_REQ_RESULT_OK	0x00
+#define VDUSE_REQ_RESULT_FAILED	0x01
+	__u32 result; /* the result of request */
+	__u32 reserved[2]; /* for future use */
+	union {
+		struct vduse_vq_state vq_state; /* virtqueue state */
+		__u32 padding[16]; /* padding */
+	};
+};
+
+/* ioctls */
+
+struct vduse_dev_config {
+	char name[VDUSE_NAME_MAX]; /* vduse device name */
+	__u32 vendor_id; /* virtio vendor id */
+	__u32 device_id; /* virtio device id */
+	__u64 features; /* device features */
+	__u64 bounce_size; /* bounce buffer size for iommu */
+	__u16 vq_size_max; /* the max size of virtqueue */
+	__u16 padding; /* padding */
+	__u32 vq_num; /* the number of virtqueues */
+	__u32 vq_align; /* the allocation alignment of virtqueue's metadata */
+	__u32 config_size; /* the size of the configuration space */
+	__u32 reserved[15]; /* for future use */
+	__u8 config[0]; /* the buffer of the configuration space */
+};
+
+struct vduse_iotlb_entry {
+	__u64 offset; /* the mmap offset on fd */
+	__u64 start; /* start of the IOVA range */
+	__u64 last; /* last of the IOVA range */
+#define VDUSE_ACCESS_RO 0x1
+#define VDUSE_ACCESS_WO 0x2
+#define VDUSE_ACCESS_RW 0x3
+	__u8 perm; /* access permission of this range */
+};
+
+struct vduse_config_update {
+	__u32 offset; /* offset from the beginning of configuration space */
+	__u32 length; /* the length to write to configuration space */
+	__u8 buffer[0]; /* buffer used to write from */
+};
+
+struct vduse_vq_info {
+	__u32 index; /* virtqueue index */
+	__u32 avail_idx; /* virtqueue state (last_avail_idx) */
+	__u64 desc_addr; /* address of desc area */
+	__u64 driver_addr; /* address of driver area */
+	__u64 device_addr; /* address of device area */
+	__u32 num; /* the size of virtqueue */
+	__u8 ready; /* ready status of virtqueue */
+};
+
+struct vduse_vq_eventfd {
+	__u32 index; /* virtqueue index */
+#define VDUSE_EVENTFD_DEASSIGN -1
+	int fd; /* eventfd, -1 means de-assigning the eventfd */
+};
+
+#define VDUSE_BASE	0x81
+
+/* Get the version of VDUSE API. This is used for future extension */
+#define VDUSE_GET_API_VERSION	_IOR(VDUSE_BASE, 0x00, __u64)
+
+/* Set the version of VDUSE API. */
+#define VDUSE_SET_API_VERSION	_IOW(VDUSE_BASE, 0x01, __u64)
+
+/* Create a vduse device which is represented by a char device (/dev/vduse/<name>) */
+#define VDUSE_CREATE_DEV	_IOW(VDUSE_BASE, 0x02, struct vduse_dev_config)
+
+/* Destroy a vduse device. Make sure there are no references to the char device */
+#define VDUSE_DESTROY_DEV	_IOW(VDUSE_BASE, 0x03, char[VDUSE_NAME_MAX])
+
+/*
+ * Get a file descriptor for the first overlapped iova region,
+ * -EINVAL means the iova region doesn't exist.
+ */
+#define VDUSE_IOTLB_GET_FD	_IOWR(VDUSE_BASE, 0x04, struct vduse_iotlb_entry)
+
+/* Get the negotiated features */
+#define VDUSE_DEV_GET_FEATURES	_IOR(VDUSE_BASE, 0x05, __u64)
+
+/* Update the configuration space */
+#define VDUSE_DEV_UPDATE_CONFIG	_IOW(VDUSE_BASE, 0x06, struct vduse_config_update)
+
+/* Get the specified virtqueue's information */
+#define VDUSE_VQ_GET_INFO	_IOWR(VDUSE_BASE, 0x07, struct vduse_vq_info)
+
+/* Setup an eventfd to receive kick for virtqueue */
+#define VDUSE_VQ_SETUP_KICKFD	_IOW(VDUSE_BASE, 0x08, struct vduse_vq_eventfd)
+
+/* Inject an interrupt for specific virtqueue */
+#define VDUSE_VQ_INJECT_IRQ	_IOW(VDUSE_BASE, 0x09, __u32)
+
+#endif /* _UAPI_VDUSE_H_ */
-- 
2.11.0


^ permalink raw reply	[flat|nested] 193+ messages in thread

* [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
@ 2021-06-15 14:13   ` Xie Yongji
  0 siblings, 0 replies; 193+ messages in thread
From: Xie Yongji @ 2021-06-15 14:13 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, hch, christian.brauner,
	rdunlap, willy, viro, axboe, bcrl, corbet, mika.penttila,
	dan.carpenter, joro, gregkh
  Cc: kvm, netdev, linux-kernel, virtualization, iommu, songmuchun,
	linux-fsdevel

This VDUSE driver enables implementing vDPA devices in userspace.
The vDPA device's control path is handled in kernel and the data
path is handled in userspace.

A message mechnism is used by VDUSE driver to forward some control
messages such as starting/stopping datapath to userspace. Userspace
can use read()/write() to receive/reply those control messages.

And some ioctls are introduced to help userspace to implement the
data path. VDUSE_IOTLB_GET_FD ioctl can be used to get the file
descriptors referring to vDPA device's iova regions. Then userspace
can use mmap() to access those iova regions. VDUSE_DEV_GET_FEATURES
and VDUSE_VQ_GET_INFO ioctls are used to get the negotiated features
and metadata of virtqueues. VDUSE_INJECT_VQ_IRQ and VDUSE_VQ_SETUP_KICKFD
ioctls can be used to inject interrupt and setup the kickfd for
virtqueues. VDUSE_DEV_UPDATE_CONFIG ioctl is used to update the
configuration space and inject a config interrupt.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 Documentation/userspace-api/ioctl/ioctl-number.rst |    1 +
 drivers/vdpa/Kconfig                               |   10 +
 drivers/vdpa/Makefile                              |    1 +
 drivers/vdpa/vdpa_user/Makefile                    |    5 +
 drivers/vdpa/vdpa_user/vduse_dev.c                 | 1453 ++++++++++++++++++++
 include/uapi/linux/vduse.h                         |  143 ++
 6 files changed, 1613 insertions(+)
 create mode 100644 drivers/vdpa/vdpa_user/Makefile
 create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
 create mode 100644 include/uapi/linux/vduse.h

diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
index 9bfc2b510c64..acd95e9dcfe7 100644
--- a/Documentation/userspace-api/ioctl/ioctl-number.rst
+++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
@@ -300,6 +300,7 @@ Code  Seq#    Include File                                           Comments
 'z'   10-4F  drivers/s390/crypto/zcrypt_api.h                        conflict!
 '|'   00-7F  linux/media.h
 0x80  00-1F  linux/fb.h
+0x81  00-1F  linux/vduse.h
 0x89  00-06  arch/x86/include/asm/sockios.h
 0x89  0B-DF  linux/sockios.h
 0x89  E0-EF  linux/sockios.h                                         SIOCPROTOPRIVATE range
diff --git a/drivers/vdpa/Kconfig b/drivers/vdpa/Kconfig
index a503c1b2bfd9..6e23bce6433a 100644
--- a/drivers/vdpa/Kconfig
+++ b/drivers/vdpa/Kconfig
@@ -33,6 +33,16 @@ config VDPA_SIM_BLOCK
 	  vDPA block device simulator which terminates IO request in a
 	  memory buffer.
 
+config VDPA_USER
+	tristate "VDUSE (vDPA Device in Userspace) support"
+	depends on EVENTFD && MMU && HAS_DMA
+	select DMA_OPS
+	select VHOST_IOTLB
+	select IOMMU_IOVA
+	help
+	  With VDUSE it is possible to emulate a vDPA Device
+	  in a userspace program.
+
 config IFCVF
 	tristate "Intel IFC VF vDPA driver"
 	depends on PCI_MSI
diff --git a/drivers/vdpa/Makefile b/drivers/vdpa/Makefile
index 67fe7f3d6943..f02ebed33f19 100644
--- a/drivers/vdpa/Makefile
+++ b/drivers/vdpa/Makefile
@@ -1,6 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0
 obj-$(CONFIG_VDPA) += vdpa.o
 obj-$(CONFIG_VDPA_SIM) += vdpa_sim/
+obj-$(CONFIG_VDPA_USER) += vdpa_user/
 obj-$(CONFIG_IFCVF)    += ifcvf/
 obj-$(CONFIG_MLX5_VDPA) += mlx5/
 obj-$(CONFIG_VP_VDPA)    += virtio_pci/
diff --git a/drivers/vdpa/vdpa_user/Makefile b/drivers/vdpa/vdpa_user/Makefile
new file mode 100644
index 000000000000..260e0b26af99
--- /dev/null
+++ b/drivers/vdpa/vdpa_user/Makefile
@@ -0,0 +1,5 @@
+# SPDX-License-Identifier: GPL-2.0
+
+vduse-y := vduse_dev.o iova_domain.o
+
+obj-$(CONFIG_VDPA_USER) += vduse.o
diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
new file mode 100644
index 000000000000..5271cbd15e28
--- /dev/null
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -0,0 +1,1453 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * VDUSE: vDPA Device in Userspace
+ *
+ * Copyright (C) 2020-2021 Bytedance Inc. and/or its affiliates. All rights reserved.
+ *
+ * Author: Xie Yongji <xieyongji@bytedance.com>
+ *
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/cdev.h>
+#include <linux/device.h>
+#include <linux/eventfd.h>
+#include <linux/slab.h>
+#include <linux/wait.h>
+#include <linux/dma-map-ops.h>
+#include <linux/poll.h>
+#include <linux/file.h>
+#include <linux/uio.h>
+#include <linux/vdpa.h>
+#include <linux/nospec.h>
+#include <uapi/linux/vduse.h>
+#include <uapi/linux/vdpa.h>
+#include <uapi/linux/virtio_config.h>
+#include <uapi/linux/virtio_ids.h>
+#include <uapi/linux/virtio_blk.h>
+#include <linux/mod_devicetable.h>
+
+#include "iova_domain.h"
+
+#define DRV_AUTHOR   "Yongji Xie <xieyongji@bytedance.com>"
+#define DRV_DESC     "vDPA Device in Userspace"
+#define DRV_LICENSE  "GPL v2"
+
+#define VDUSE_DEV_MAX (1U << MINORBITS)
+#define VDUSE_MAX_BOUNCE_SIZE (64 * 1024 * 1024)
+#define VDUSE_IOVA_SIZE (128 * 1024 * 1024)
+#define VDUSE_REQUEST_TIMEOUT 30
+
+struct vduse_virtqueue {
+	u16 index;
+	u32 num;
+	u32 avail_idx;
+	u64 desc_addr;
+	u64 driver_addr;
+	u64 device_addr;
+	bool ready;
+	bool kicked;
+	spinlock_t kick_lock;
+	spinlock_t irq_lock;
+	struct eventfd_ctx *kickfd;
+	struct vdpa_callback cb;
+	struct work_struct inject;
+};
+
+struct vduse_dev;
+
+struct vduse_vdpa {
+	struct vdpa_device vdpa;
+	struct vduse_dev *dev;
+};
+
+struct vduse_dev {
+	struct vduse_vdpa *vdev;
+	struct device *dev;
+	struct vduse_virtqueue *vqs;
+	struct vduse_iova_domain *domain;
+	char *name;
+	struct mutex lock;
+	spinlock_t msg_lock;
+	u64 msg_unique;
+	wait_queue_head_t waitq;
+	struct list_head send_list;
+	struct list_head recv_list;
+	struct vdpa_callback config_cb;
+	struct work_struct inject;
+	spinlock_t irq_lock;
+	int minor;
+	bool connected;
+	bool started;
+	u64 api_version;
+	u64 user_features;
+	u64 features;
+	u32 device_id;
+	u32 vendor_id;
+	u32 generation;
+	u32 config_size;
+	void *config;
+	u8 status;
+	u16 vq_size_max;
+	u32 vq_num;
+	u32 vq_align;
+};
+
+struct vduse_dev_msg {
+	struct vduse_dev_request req;
+	struct vduse_dev_response resp;
+	struct list_head list;
+	wait_queue_head_t waitq;
+	bool completed;
+};
+
+struct vduse_control {
+	u64 api_version;
+};
+
+static DEFINE_MUTEX(vduse_lock);
+static DEFINE_IDR(vduse_idr);
+
+static dev_t vduse_major;
+static struct class *vduse_class;
+static struct cdev vduse_ctrl_cdev;
+static struct cdev vduse_cdev;
+static struct workqueue_struct *vduse_irq_wq;
+
+static u32 allowed_device_id[] = {
+	VIRTIO_ID_BLOCK,
+};
+
+static inline struct vduse_dev *vdpa_to_vduse(struct vdpa_device *vdpa)
+{
+	struct vduse_vdpa *vdev = container_of(vdpa, struct vduse_vdpa, vdpa);
+
+	return vdev->dev;
+}
+
+static inline struct vduse_dev *dev_to_vduse(struct device *dev)
+{
+	struct vdpa_device *vdpa = dev_to_vdpa(dev);
+
+	return vdpa_to_vduse(vdpa);
+}
+
+static struct vduse_dev_msg *vduse_find_msg(struct list_head *head,
+					    uint32_t request_id)
+{
+	struct vduse_dev_msg *msg;
+
+	list_for_each_entry(msg, head, list) {
+		if (msg->req.request_id == request_id) {
+			list_del(&msg->list);
+			return msg;
+		}
+	}
+
+	return NULL;
+}
+
+static struct vduse_dev_msg *vduse_dequeue_msg(struct list_head *head)
+{
+	struct vduse_dev_msg *msg = NULL;
+
+	if (!list_empty(head)) {
+		msg = list_first_entry(head, struct vduse_dev_msg, list);
+		list_del(&msg->list);
+	}
+
+	return msg;
+}
+
+static void vduse_enqueue_msg(struct list_head *head,
+			      struct vduse_dev_msg *msg)
+{
+	list_add_tail(&msg->list, head);
+}
+
+static int vduse_dev_msg_send(struct vduse_dev *dev,
+			      struct vduse_dev_msg *msg, bool no_reply)
+{
+	init_waitqueue_head(&msg->waitq);
+	spin_lock(&dev->msg_lock);
+	msg->req.request_id = dev->msg_unique++;
+	vduse_enqueue_msg(&dev->send_list, msg);
+	wake_up(&dev->waitq);
+	spin_unlock(&dev->msg_lock);
+	if (no_reply)
+		return 0;
+
+	wait_event_killable_timeout(msg->waitq, msg->completed,
+				    VDUSE_REQUEST_TIMEOUT * HZ);
+	spin_lock(&dev->msg_lock);
+	if (!msg->completed) {
+		list_del(&msg->list);
+		msg->resp.result = VDUSE_REQ_RESULT_FAILED;
+	}
+	spin_unlock(&dev->msg_lock);
+
+	return (msg->resp.result == VDUSE_REQ_RESULT_OK) ? 0 : -EIO;
+}
+
+static void vduse_dev_msg_cleanup(struct vduse_dev *dev)
+{
+	struct vduse_dev_msg *msg;
+
+	spin_lock(&dev->msg_lock);
+	while ((msg = vduse_dequeue_msg(&dev->send_list))) {
+		if (msg->req.flags & VDUSE_REQ_FLAGS_NO_REPLY)
+			kfree(msg);
+		else
+			vduse_enqueue_msg(&dev->recv_list, msg);
+	}
+	while ((msg = vduse_dequeue_msg(&dev->recv_list))) {
+		msg->resp.result = VDUSE_REQ_RESULT_FAILED;
+		msg->completed = 1;
+		wake_up(&msg->waitq);
+	}
+	spin_unlock(&dev->msg_lock);
+}
+
+static void vduse_dev_start_dataplane(struct vduse_dev *dev)
+{
+	struct vduse_dev_msg *msg = kzalloc(sizeof(*msg),
+					    GFP_KERNEL | __GFP_NOFAIL);
+
+	msg->req.type = VDUSE_START_DATAPLANE;
+	msg->req.flags |= VDUSE_REQ_FLAGS_NO_REPLY;
+	vduse_dev_msg_send(dev, msg, true);
+}
+
+static void vduse_dev_stop_dataplane(struct vduse_dev *dev)
+{
+	struct vduse_dev_msg *msg = kzalloc(sizeof(*msg),
+					    GFP_KERNEL | __GFP_NOFAIL);
+
+	msg->req.type = VDUSE_STOP_DATAPLANE;
+	msg->req.flags |= VDUSE_REQ_FLAGS_NO_REPLY;
+	vduse_dev_msg_send(dev, msg, true);
+}
+
+static int vduse_dev_get_vq_state(struct vduse_dev *dev,
+				  struct vduse_virtqueue *vq,
+				  struct vdpa_vq_state *state)
+{
+	struct vduse_dev_msg msg = { 0 };
+	int ret;
+
+	msg.req.type = VDUSE_GET_VQ_STATE;
+	msg.req.vq_state.index = vq->index;
+
+	ret = vduse_dev_msg_send(dev, &msg, false);
+	if (ret)
+		return ret;
+
+	state->avail_index = msg.resp.vq_state.avail_idx;
+	return 0;
+}
+
+static int vduse_dev_update_iotlb(struct vduse_dev *dev,
+				u64 start, u64 last)
+{
+	struct vduse_dev_msg msg = { 0 };
+
+	if (last < start)
+		return -EINVAL;
+
+	msg.req.type = VDUSE_UPDATE_IOTLB;
+	msg.req.iova.start = start;
+	msg.req.iova.last = last;
+
+	return vduse_dev_msg_send(dev, &msg, false);
+}
+
+static ssize_t vduse_dev_read_iter(struct kiocb *iocb, struct iov_iter *to)
+{
+	struct file *file = iocb->ki_filp;
+	struct vduse_dev *dev = file->private_data;
+	struct vduse_dev_msg *msg;
+	int size = sizeof(struct vduse_dev_request);
+	ssize_t ret;
+
+	if (iov_iter_count(to) < size)
+		return -EINVAL;
+
+	spin_lock(&dev->msg_lock);
+	while (1) {
+		msg = vduse_dequeue_msg(&dev->send_list);
+		if (msg)
+			break;
+
+		ret = -EAGAIN;
+		if (file->f_flags & O_NONBLOCK)
+			goto unlock;
+
+		spin_unlock(&dev->msg_lock);
+		ret = wait_event_interruptible_exclusive(dev->waitq,
+					!list_empty(&dev->send_list));
+		if (ret)
+			return ret;
+
+		spin_lock(&dev->msg_lock);
+	}
+	spin_unlock(&dev->msg_lock);
+	ret = copy_to_iter(&msg->req, size, to);
+	spin_lock(&dev->msg_lock);
+	if (ret != size) {
+		ret = -EFAULT;
+		vduse_enqueue_msg(&dev->send_list, msg);
+		goto unlock;
+	}
+	if (msg->req.flags & VDUSE_REQ_FLAGS_NO_REPLY)
+		kfree(msg);
+	else
+		vduse_enqueue_msg(&dev->recv_list, msg);
+unlock:
+	spin_unlock(&dev->msg_lock);
+
+	return ret;
+}
+
+static ssize_t vduse_dev_write_iter(struct kiocb *iocb, struct iov_iter *from)
+{
+	struct file *file = iocb->ki_filp;
+	struct vduse_dev *dev = file->private_data;
+	struct vduse_dev_response resp;
+	struct vduse_dev_msg *msg;
+	size_t ret;
+
+	ret = copy_from_iter(&resp, sizeof(resp), from);
+	if (ret != sizeof(resp))
+		return -EINVAL;
+
+	spin_lock(&dev->msg_lock);
+	msg = vduse_find_msg(&dev->recv_list, resp.request_id);
+	if (!msg) {
+		ret = -ENOENT;
+		goto unlock;
+	}
+
+	memcpy(&msg->resp, &resp, sizeof(resp));
+	msg->completed = 1;
+	wake_up(&msg->waitq);
+unlock:
+	spin_unlock(&dev->msg_lock);
+
+	return ret;
+}
+
+static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
+{
+	struct vduse_dev *dev = file->private_data;
+	__poll_t mask = 0;
+
+	poll_wait(file, &dev->waitq, wait);
+
+	if (!list_empty(&dev->send_list))
+		mask |= EPOLLIN | EPOLLRDNORM;
+	if (!list_empty(&dev->recv_list))
+		mask |= EPOLLOUT | EPOLLWRNORM;
+
+	return mask;
+}
+
+static void vduse_dev_reset(struct vduse_dev *dev)
+{
+	int i;
+	struct vduse_iova_domain *domain = dev->domain;
+
+	/* The coherent mappings are handled in vduse_dev_free_coherent() */
+	if (domain->bounce_map)
+		vduse_domain_reset_bounce_map(domain);
+
+	dev->features = 0;
+	dev->generation++;
+	spin_lock(&dev->irq_lock);
+	dev->config_cb.callback = NULL;
+	dev->config_cb.private = NULL;
+	spin_unlock(&dev->irq_lock);
+
+	for (i = 0; i < dev->vq_num; i++) {
+		struct vduse_virtqueue *vq = &dev->vqs[i];
+
+		vq->ready = false;
+		vq->desc_addr = 0;
+		vq->driver_addr = 0;
+		vq->device_addr = 0;
+		vq->avail_idx = 0;
+		vq->num = 0;
+
+		spin_lock(&vq->kick_lock);
+		vq->kicked = false;
+		if (vq->kickfd)
+			eventfd_ctx_put(vq->kickfd);
+		vq->kickfd = NULL;
+		spin_unlock(&vq->kick_lock);
+
+		spin_lock(&vq->irq_lock);
+		vq->cb.callback = NULL;
+		vq->cb.private = NULL;
+		spin_unlock(&vq->irq_lock);
+	}
+}
+
+static int vduse_vdpa_set_vq_address(struct vdpa_device *vdpa, u16 idx,
+				u64 desc_area, u64 driver_area,
+				u64 device_area)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vduse_virtqueue *vq = &dev->vqs[idx];
+
+	vq->desc_addr = desc_area;
+	vq->driver_addr = driver_area;
+	vq->device_addr = device_area;
+
+	return 0;
+}
+
+static void vduse_vdpa_kick_vq(struct vdpa_device *vdpa, u16 idx)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vduse_virtqueue *vq = &dev->vqs[idx];
+
+	spin_lock(&vq->kick_lock);
+	if (!vq->ready)
+		goto unlock;
+
+	if (vq->kickfd)
+		eventfd_signal(vq->kickfd, 1);
+	else
+		vq->kicked = true;
+unlock:
+	spin_unlock(&vq->kick_lock);
+}
+
+static void vduse_vdpa_set_vq_cb(struct vdpa_device *vdpa, u16 idx,
+			      struct vdpa_callback *cb)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vduse_virtqueue *vq = &dev->vqs[idx];
+
+	spin_lock(&vq->irq_lock);
+	vq->cb.callback = cb->callback;
+	vq->cb.private = cb->private;
+	spin_unlock(&vq->irq_lock);
+}
+
+static void vduse_vdpa_set_vq_num(struct vdpa_device *vdpa, u16 idx, u32 num)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vduse_virtqueue *vq = &dev->vqs[idx];
+
+	vq->num = num;
+}
+
+static void vduse_vdpa_set_vq_ready(struct vdpa_device *vdpa,
+					u16 idx, bool ready)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vduse_virtqueue *vq = &dev->vqs[idx];
+
+	vq->ready = ready;
+}
+
+static bool vduse_vdpa_get_vq_ready(struct vdpa_device *vdpa, u16 idx)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vduse_virtqueue *vq = &dev->vqs[idx];
+
+	return vq->ready;
+}
+
+static int vduse_vdpa_set_vq_state(struct vdpa_device *vdpa, u16 idx,
+				const struct vdpa_vq_state *state)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vduse_virtqueue *vq = &dev->vqs[idx];
+
+	vq->avail_idx = state->avail_index;
+	return 0;
+}
+
+static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
+				struct vdpa_vq_state *state)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vduse_virtqueue *vq = &dev->vqs[idx];
+
+	return vduse_dev_get_vq_state(dev, vq, state);
+}
+
+static u32 vduse_vdpa_get_vq_align(struct vdpa_device *vdpa)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	return dev->vq_align;
+}
+
+static u64 vduse_vdpa_get_features(struct vdpa_device *vdpa)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	return dev->user_features;
+}
+
+static int vduse_vdpa_set_features(struct vdpa_device *vdpa, u64 features)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	dev->features = features;
+	return 0;
+}
+
+static void vduse_vdpa_set_config_cb(struct vdpa_device *vdpa,
+				  struct vdpa_callback *cb)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	spin_lock(&dev->irq_lock);
+	dev->config_cb.callback = cb->callback;
+	dev->config_cb.private = cb->private;
+	spin_unlock(&dev->irq_lock);
+}
+
+static u16 vduse_vdpa_get_vq_num_max(struct vdpa_device *vdpa)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	return dev->vq_size_max;
+}
+
+static u32 vduse_vdpa_get_device_id(struct vdpa_device *vdpa)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	return dev->device_id;
+}
+
+static u32 vduse_vdpa_get_vendor_id(struct vdpa_device *vdpa)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	return dev->vendor_id;
+}
+
+static u8 vduse_vdpa_get_status(struct vdpa_device *vdpa)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	return dev->status;
+}
+
+static void vduse_vdpa_set_status(struct vdpa_device *vdpa, u8 status)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	bool started = !!(status & VIRTIO_CONFIG_S_DRIVER_OK);
+
+	dev->status = status;
+
+	if (dev->started == started)
+		return;
+
+	dev->started = started;
+	if (dev->started) {
+		vduse_dev_start_dataplane(dev);
+	} else {
+		vduse_dev_reset(dev);
+		vduse_dev_stop_dataplane(dev);
+	}
+}
+
+static size_t vduse_vdpa_get_config_size(struct vdpa_device *vdpa)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	return dev->config_size;
+}
+
+static void vduse_vdpa_get_config(struct vdpa_device *vdpa, unsigned int offset,
+				  void *buf, unsigned int len)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	memcpy(buf, dev->config + offset, len);
+}
+
+static void vduse_vdpa_set_config(struct vdpa_device *vdpa, unsigned int offset,
+			const void *buf, unsigned int len)
+{
+	/* Now we only support read-only configuration space */
+}
+
+static u32 vduse_vdpa_get_generation(struct vdpa_device *vdpa)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	return dev->generation;
+}
+
+static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
+				struct vhost_iotlb *iotlb)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	int ret;
+
+	ret = vduse_domain_set_map(dev->domain, iotlb);
+	if (ret)
+		return ret;
+
+	ret = vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
+	if (ret) {
+		vduse_domain_clear_map(dev->domain, iotlb);
+		return ret;
+	}
+
+	return 0;
+}
+
+static void vduse_vdpa_free(struct vdpa_device *vdpa)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+	dev->vdev = NULL;
+}
+
+static const struct vdpa_config_ops vduse_vdpa_config_ops = {
+	.set_vq_address		= vduse_vdpa_set_vq_address,
+	.kick_vq		= vduse_vdpa_kick_vq,
+	.set_vq_cb		= vduse_vdpa_set_vq_cb,
+	.set_vq_num             = vduse_vdpa_set_vq_num,
+	.set_vq_ready		= vduse_vdpa_set_vq_ready,
+	.get_vq_ready		= vduse_vdpa_get_vq_ready,
+	.set_vq_state		= vduse_vdpa_set_vq_state,
+	.get_vq_state		= vduse_vdpa_get_vq_state,
+	.get_vq_align		= vduse_vdpa_get_vq_align,
+	.get_features		= vduse_vdpa_get_features,
+	.set_features		= vduse_vdpa_set_features,
+	.set_config_cb		= vduse_vdpa_set_config_cb,
+	.get_vq_num_max		= vduse_vdpa_get_vq_num_max,
+	.get_device_id		= vduse_vdpa_get_device_id,
+	.get_vendor_id		= vduse_vdpa_get_vendor_id,
+	.get_status		= vduse_vdpa_get_status,
+	.set_status		= vduse_vdpa_set_status,
+	.get_config_size	= vduse_vdpa_get_config_size,
+	.get_config		= vduse_vdpa_get_config,
+	.set_config		= vduse_vdpa_set_config,
+	.get_generation		= vduse_vdpa_get_generation,
+	.set_map		= vduse_vdpa_set_map,
+	.free			= vduse_vdpa_free,
+};
+
+static dma_addr_t vduse_dev_map_page(struct device *dev, struct page *page,
+				     unsigned long offset, size_t size,
+				     enum dma_data_direction dir,
+				     unsigned long attrs)
+{
+	struct vduse_dev *vdev = dev_to_vduse(dev);
+	struct vduse_iova_domain *domain = vdev->domain;
+
+	return vduse_domain_map_page(domain, page, offset, size, dir, attrs);
+}
+
+static void vduse_dev_unmap_page(struct device *dev, dma_addr_t dma_addr,
+				size_t size, enum dma_data_direction dir,
+				unsigned long attrs)
+{
+	struct vduse_dev *vdev = dev_to_vduse(dev);
+	struct vduse_iova_domain *domain = vdev->domain;
+
+	return vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
+}
+
+static void *vduse_dev_alloc_coherent(struct device *dev, size_t size,
+					dma_addr_t *dma_addr, gfp_t flag,
+					unsigned long attrs)
+{
+	struct vduse_dev *vdev = dev_to_vduse(dev);
+	struct vduse_iova_domain *domain = vdev->domain;
+	unsigned long iova;
+	void *addr;
+
+	*dma_addr = DMA_MAPPING_ERROR;
+	addr = vduse_domain_alloc_coherent(domain, size,
+				(dma_addr_t *)&iova, flag, attrs);
+	if (!addr)
+		return NULL;
+
+	*dma_addr = (dma_addr_t)iova;
+
+	return addr;
+}
+
+static void vduse_dev_free_coherent(struct device *dev, size_t size,
+					void *vaddr, dma_addr_t dma_addr,
+					unsigned long attrs)
+{
+	struct vduse_dev *vdev = dev_to_vduse(dev);
+	struct vduse_iova_domain *domain = vdev->domain;
+
+	vduse_domain_free_coherent(domain, size, vaddr, dma_addr, attrs);
+}
+
+static size_t vduse_dev_max_mapping_size(struct device *dev)
+{
+	struct vduse_dev *vdev = dev_to_vduse(dev);
+	struct vduse_iova_domain *domain = vdev->domain;
+
+	return domain->bounce_size;
+}
+
+static const struct dma_map_ops vduse_dev_dma_ops = {
+	.map_page = vduse_dev_map_page,
+	.unmap_page = vduse_dev_unmap_page,
+	.alloc = vduse_dev_alloc_coherent,
+	.free = vduse_dev_free_coherent,
+	.max_mapping_size = vduse_dev_max_mapping_size,
+};
+
+static unsigned int perm_to_file_flags(u8 perm)
+{
+	unsigned int flags = 0;
+
+	switch (perm) {
+	case VDUSE_ACCESS_WO:
+		flags |= O_WRONLY;
+		break;
+	case VDUSE_ACCESS_RO:
+		flags |= O_RDONLY;
+		break;
+	case VDUSE_ACCESS_RW:
+		flags |= O_RDWR;
+		break;
+	default:
+		WARN(1, "invalidate vhost IOTLB permission\n");
+		break;
+	}
+
+	return flags;
+}
+
+static int vduse_kickfd_setup(struct vduse_dev *dev,
+			struct vduse_vq_eventfd *eventfd)
+{
+	struct eventfd_ctx *ctx = NULL;
+	struct vduse_virtqueue *vq;
+	u32 index;
+
+	if (eventfd->index >= dev->vq_num)
+		return -EINVAL;
+
+	index = array_index_nospec(eventfd->index, dev->vq_num);
+	vq = &dev->vqs[index];
+	if (eventfd->fd >= 0) {
+		ctx = eventfd_ctx_fdget(eventfd->fd);
+		if (IS_ERR(ctx))
+			return PTR_ERR(ctx);
+	} else if (eventfd->fd != VDUSE_EVENTFD_DEASSIGN)
+		return 0;
+
+	spin_lock(&vq->kick_lock);
+	if (vq->kickfd)
+		eventfd_ctx_put(vq->kickfd);
+	vq->kickfd = ctx;
+	if (vq->ready && vq->kicked && vq->kickfd) {
+		eventfd_signal(vq->kickfd, 1);
+		vq->kicked = false;
+	}
+	spin_unlock(&vq->kick_lock);
+
+	return 0;
+}
+
+static void vduse_dev_irq_inject(struct work_struct *work)
+{
+	struct vduse_dev *dev = container_of(work, struct vduse_dev, inject);
+
+	spin_lock_irq(&dev->irq_lock);
+	if (dev->config_cb.callback)
+		dev->config_cb.callback(dev->config_cb.private);
+	spin_unlock_irq(&dev->irq_lock);
+}
+
+static void vduse_vq_irq_inject(struct work_struct *work)
+{
+	struct vduse_virtqueue *vq = container_of(work,
+					struct vduse_virtqueue, inject);
+
+	spin_lock_irq(&vq->irq_lock);
+	if (vq->ready && vq->cb.callback)
+		vq->cb.callback(vq->cb.private);
+	spin_unlock_irq(&vq->irq_lock);
+}
+
+static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
+			    unsigned long arg)
+{
+	struct vduse_dev *dev = file->private_data;
+	void __user *argp = (void __user *)arg;
+	int ret;
+
+	switch (cmd) {
+	case VDUSE_IOTLB_GET_FD: {
+		struct vduse_iotlb_entry entry;
+		struct vhost_iotlb_map *map;
+		struct vdpa_map_file *map_file;
+		struct vduse_iova_domain *domain = dev->domain;
+		struct file *f = NULL;
+
+		ret = -EFAULT;
+		if (copy_from_user(&entry, argp, sizeof(entry)))
+			break;
+
+		ret = -EINVAL;
+		if (entry.start > entry.last)
+			break;
+
+		spin_lock(&domain->iotlb_lock);
+		map = vhost_iotlb_itree_first(domain->iotlb,
+					      entry.start, entry.last);
+		if (map) {
+			map_file = (struct vdpa_map_file *)map->opaque;
+			f = get_file(map_file->file);
+			entry.offset = map_file->offset;
+			entry.start = map->start;
+			entry.last = map->last;
+			entry.perm = map->perm;
+		}
+		spin_unlock(&domain->iotlb_lock);
+		ret = -EINVAL;
+		if (!f)
+			break;
+
+		ret = -EFAULT;
+		if (copy_to_user(argp, &entry, sizeof(entry))) {
+			fput(f);
+			break;
+		}
+		ret = receive_fd(f, perm_to_file_flags(entry.perm));
+		fput(f);
+		break;
+	}
+	case VDUSE_DEV_GET_FEATURES:
+		ret = put_user(dev->features, (u64 __user *)argp);
+		break;
+	case VDUSE_DEV_UPDATE_CONFIG: {
+		struct vduse_config_update config;
+		unsigned long size = offsetof(struct vduse_config_update,
+					      buffer);
+
+		ret = -EFAULT;
+		if (copy_from_user(&config, argp, size))
+			break;
+
+		ret = -EINVAL;
+		if (config.length == 0 ||
+		    config.length > dev->config_size - config.offset)
+			break;
+
+		ret = -EFAULT;
+		if (copy_from_user(dev->config + config.offset, argp + size,
+				   config.length))
+			break;
+
+		ret = 0;
+		queue_work(vduse_irq_wq, &dev->inject);
+		break;
+	}
+	case VDUSE_VQ_GET_INFO: {
+		struct vduse_vq_info vq_info;
+		u32 vq_index;
+
+		ret = -EFAULT;
+		if (copy_from_user(&vq_info, argp, sizeof(vq_info)))
+			break;
+
+		ret = -EINVAL;
+		if (vq_info.index >= dev->vq_num)
+			break;
+
+		vq_index = array_index_nospec(vq_info.index, dev->vq_num);
+		vq_info.desc_addr = dev->vqs[vq_index].desc_addr;
+		vq_info.driver_addr = dev->vqs[vq_index].driver_addr;
+		vq_info.device_addr = dev->vqs[vq_index].device_addr;
+		vq_info.num = dev->vqs[vq_index].num;
+		vq_info.avail_idx = dev->vqs[vq_index].avail_idx;
+		vq_info.ready = dev->vqs[vq_index].ready;
+
+		ret = -EFAULT;
+		if (copy_to_user(argp, &vq_info, sizeof(vq_info)))
+			break;
+
+		ret = 0;
+		break;
+	}
+	case VDUSE_VQ_SETUP_KICKFD: {
+		struct vduse_vq_eventfd eventfd;
+
+		ret = -EFAULT;
+		if (copy_from_user(&eventfd, argp, sizeof(eventfd)))
+			break;
+
+		ret = vduse_kickfd_setup(dev, &eventfd);
+		break;
+	}
+	case VDUSE_VQ_INJECT_IRQ: {
+		u32 vq_index;
+
+		ret = -EFAULT;
+		if (get_user(vq_index, (u32 __user *)argp))
+			break;
+
+		ret = -EINVAL;
+		if (vq_index >= dev->vq_num)
+			break;
+
+		ret = 0;
+		vq_index = array_index_nospec(vq_index, dev->vq_num);
+		queue_work(vduse_irq_wq, &dev->vqs[vq_index].inject);
+		break;
+	}
+	default:
+		ret = -ENOIOCTLCMD;
+		break;
+	}
+
+	return ret;
+}
+
+static int vduse_dev_release(struct inode *inode, struct file *file)
+{
+	struct vduse_dev *dev = file->private_data;
+
+	spin_lock(&dev->msg_lock);
+	/* Make sure the inflight messages can processed after reconncection */
+	list_splice_init(&dev->recv_list, &dev->send_list);
+	spin_unlock(&dev->msg_lock);
+	dev->connected = false;
+
+	return 0;
+}
+
+static struct vduse_dev *vduse_dev_get_from_minor(int minor)
+{
+	struct vduse_dev *dev;
+
+	mutex_lock(&vduse_lock);
+	dev = idr_find(&vduse_idr, minor);
+	mutex_unlock(&vduse_lock);
+
+	return dev;
+}
+
+static int vduse_dev_open(struct inode *inode, struct file *file)
+{
+	int ret;
+	struct vduse_dev *dev = vduse_dev_get_from_minor(iminor(inode));
+
+	if (!dev)
+		return -ENODEV;
+
+	ret = -EBUSY;
+	mutex_lock(&dev->lock);
+	if (dev->connected)
+		goto unlock;
+
+	ret = 0;
+	dev->connected = true;
+	file->private_data = dev;
+unlock:
+	mutex_unlock(&dev->lock);
+
+	return ret;
+}
+
+static const struct file_operations vduse_dev_fops = {
+	.owner		= THIS_MODULE,
+	.open		= vduse_dev_open,
+	.release	= vduse_dev_release,
+	.read_iter	= vduse_dev_read_iter,
+	.write_iter	= vduse_dev_write_iter,
+	.poll		= vduse_dev_poll,
+	.unlocked_ioctl	= vduse_dev_ioctl,
+	.compat_ioctl	= compat_ptr_ioctl,
+	.llseek		= noop_llseek,
+};
+
+static struct vduse_dev *vduse_dev_create(void)
+{
+	struct vduse_dev *dev = kzalloc(sizeof(*dev), GFP_KERNEL);
+
+	if (!dev)
+		return NULL;
+
+	mutex_init(&dev->lock);
+	spin_lock_init(&dev->msg_lock);
+	INIT_LIST_HEAD(&dev->send_list);
+	INIT_LIST_HEAD(&dev->recv_list);
+	spin_lock_init(&dev->irq_lock);
+
+	INIT_WORK(&dev->inject, vduse_dev_irq_inject);
+	init_waitqueue_head(&dev->waitq);
+
+	return dev;
+}
+
+static void vduse_dev_destroy(struct vduse_dev *dev)
+{
+	kfree(dev);
+}
+
+static struct vduse_dev *vduse_find_dev(const char *name)
+{
+	struct vduse_dev *dev;
+	int id;
+
+	idr_for_each_entry(&vduse_idr, dev, id)
+		if (!strcmp(dev->name, name))
+			return dev;
+
+	return NULL;
+}
+
+static int vduse_destroy_dev(char *name)
+{
+	struct vduse_dev *dev = vduse_find_dev(name);
+
+	if (!dev)
+		return -EINVAL;
+
+	mutex_lock(&dev->lock);
+	if (dev->vdev || dev->connected) {
+		mutex_unlock(&dev->lock);
+		return -EBUSY;
+	}
+	dev->connected = true;
+	mutex_unlock(&dev->lock);
+
+	vduse_dev_msg_cleanup(dev);
+	device_destroy(vduse_class, MKDEV(MAJOR(vduse_major), dev->minor));
+	idr_remove(&vduse_idr, dev->minor);
+	kvfree(dev->config);
+	kfree(dev->vqs);
+	vduse_domain_destroy(dev->domain);
+	kfree(dev->name);
+	vduse_dev_destroy(dev);
+	module_put(THIS_MODULE);
+
+	return 0;
+}
+
+static bool device_is_allowed(u32 device_id)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(allowed_device_id); i++)
+		if (allowed_device_id[i] == device_id)
+			return true;
+
+	return false;
+}
+
+static bool features_is_valid(u64 features)
+{
+	if (!(features & (1ULL << VIRTIO_F_ACCESS_PLATFORM)))
+		return false;
+
+	/* Now we only support read-only configuration space */
+	if (features & (1ULL << VIRTIO_BLK_F_CONFIG_WCE))
+		return false;
+
+	return true;
+}
+
+static bool vduse_validate_config(struct vduse_dev_config *config)
+{
+	if (config->bounce_size > VDUSE_MAX_BOUNCE_SIZE)
+		return false;
+
+	if (config->vq_align > PAGE_SIZE)
+		return false;
+
+	if (config->config_size > PAGE_SIZE)
+		return false;
+
+	if (!device_is_allowed(config->device_id))
+		return false;
+
+	if (!features_is_valid(config->features))
+		return false;
+
+	return true;
+}
+
+static int vduse_create_dev(struct vduse_dev_config *config,
+			    void *config_buf, u64 api_version)
+{
+	int i, ret;
+	struct vduse_dev *dev;
+
+	ret = -EEXIST;
+	if (vduse_find_dev(config->name))
+		goto err;
+
+	ret = -ENOMEM;
+	dev = vduse_dev_create();
+	if (!dev)
+		goto err;
+
+	dev->api_version = api_version;
+	dev->user_features = config->features;
+	dev->device_id = config->device_id;
+	dev->vendor_id = config->vendor_id;
+	dev->name = kstrdup(config->name, GFP_KERNEL);
+	if (!dev->name)
+		goto err_str;
+
+	dev->domain = vduse_domain_create(VDUSE_IOVA_SIZE - 1,
+					  config->bounce_size);
+	if (!dev->domain)
+		goto err_domain;
+
+	dev->config = config_buf;
+	dev->config_size = config->config_size;
+	dev->vq_align = config->vq_align;
+	dev->vq_size_max = config->vq_size_max;
+	dev->vq_num = config->vq_num;
+	dev->vqs = kcalloc(dev->vq_num, sizeof(*dev->vqs), GFP_KERNEL);
+	if (!dev->vqs)
+		goto err_vqs;
+
+	for (i = 0; i < dev->vq_num; i++) {
+		dev->vqs[i].index = i;
+		INIT_WORK(&dev->vqs[i].inject, vduse_vq_irq_inject);
+		spin_lock_init(&dev->vqs[i].kick_lock);
+		spin_lock_init(&dev->vqs[i].irq_lock);
+	}
+
+	ret = idr_alloc(&vduse_idr, dev, 1, VDUSE_DEV_MAX, GFP_KERNEL);
+	if (ret < 0)
+		goto err_idr;
+
+	dev->minor = ret;
+	dev->dev = device_create(vduse_class, NULL,
+				 MKDEV(MAJOR(vduse_major), dev->minor),
+				 NULL, "%s", config->name);
+	if (IS_ERR(dev->dev)) {
+		ret = PTR_ERR(dev->dev);
+		goto err_dev;
+	}
+	__module_get(THIS_MODULE);
+
+	return 0;
+err_dev:
+	idr_remove(&vduse_idr, dev->minor);
+err_idr:
+	kfree(dev->vqs);
+err_vqs:
+	vduse_domain_destroy(dev->domain);
+err_domain:
+	kfree(dev->name);
+err_str:
+	vduse_dev_destroy(dev);
+err:
+	kvfree(config_buf);
+	return ret;
+}
+
+static long vduse_ioctl(struct file *file, unsigned int cmd,
+			unsigned long arg)
+{
+	int ret;
+	void __user *argp = (void __user *)arg;
+	struct vduse_control *control = file->private_data;
+
+	mutex_lock(&vduse_lock);
+	switch (cmd) {
+	case VDUSE_GET_API_VERSION:
+		ret = put_user(control->api_version, (u64 __user *)argp);
+		break;
+	case VDUSE_SET_API_VERSION: {
+		u64 api_version;
+
+		ret = -EFAULT;
+		if (get_user(api_version, (u64 __user *)argp))
+			break;
+
+		ret = -EINVAL;
+		if (api_version > VDUSE_API_VERSION)
+			break;
+
+		ret = 0;
+		control->api_version = api_version;
+		break;
+	}
+	case VDUSE_CREATE_DEV: {
+		struct vduse_dev_config config;
+		unsigned long size = offsetof(struct vduse_dev_config, config);
+		void *buf;
+
+		ret = -EFAULT;
+		if (copy_from_user(&config, argp, size))
+			break;
+
+		ret = -EINVAL;
+		if (vduse_validate_config(&config) == false)
+			break;
+
+		buf = vmemdup_user(argp + size, config.config_size);
+		if (IS_ERR(buf)) {
+			ret = PTR_ERR(buf);
+			break;
+		}
+		ret = vduse_create_dev(&config, buf, control->api_version);
+		break;
+	}
+	case VDUSE_DESTROY_DEV: {
+		char name[VDUSE_NAME_MAX];
+
+		ret = -EFAULT;
+		if (copy_from_user(name, argp, VDUSE_NAME_MAX))
+			break;
+
+		ret = vduse_destroy_dev(name);
+		break;
+	}
+	default:
+		ret = -EINVAL;
+		break;
+	}
+	mutex_unlock(&vduse_lock);
+
+	return ret;
+}
+
+static int vduse_release(struct inode *inode, struct file *file)
+{
+	struct vduse_control *control = file->private_data;
+
+	kfree(control);
+	return 0;
+}
+
+static int vduse_open(struct inode *inode, struct file *file)
+{
+	struct vduse_control *control;
+
+	control = kmalloc(sizeof(struct vduse_control), GFP_KERNEL);
+	if (!control)
+		return -ENOMEM;
+
+	control->api_version = VDUSE_API_VERSION;
+	file->private_data = control;
+
+	return 0;
+}
+
+static const struct file_operations vduse_ctrl_fops = {
+	.owner		= THIS_MODULE,
+	.open		= vduse_open,
+	.release	= vduse_release,
+	.unlocked_ioctl	= vduse_ioctl,
+	.compat_ioctl	= compat_ptr_ioctl,
+	.llseek		= noop_llseek,
+};
+
+static char *vduse_devnode(struct device *dev, umode_t *mode)
+{
+	return kasprintf(GFP_KERNEL, "vduse/%s", dev_name(dev));
+}
+
+static void vduse_mgmtdev_release(struct device *dev)
+{
+}
+
+static struct device vduse_mgmtdev = {
+	.init_name = "vduse",
+	.release = vduse_mgmtdev_release,
+};
+
+static struct vdpa_mgmt_dev mgmt_dev;
+
+static int vduse_dev_init_vdpa(struct vduse_dev *dev, const char *name)
+{
+	struct vduse_vdpa *vdev;
+	int ret;
+
+	if (dev->vdev)
+		return -EEXIST;
+
+	vdev = vdpa_alloc_device(struct vduse_vdpa, vdpa, dev->dev,
+				 &vduse_vdpa_config_ops, name, true);
+	if (!vdev)
+		return -ENOMEM;
+
+	dev->vdev = vdev;
+	vdev->dev = dev;
+	vdev->vdpa.dev.dma_mask = &vdev->vdpa.dev.coherent_dma_mask;
+	ret = dma_set_mask_and_coherent(&vdev->vdpa.dev, DMA_BIT_MASK(64));
+	if (ret) {
+		put_device(&vdev->vdpa.dev);
+		return ret;
+	}
+	set_dma_ops(&vdev->vdpa.dev, &vduse_dev_dma_ops);
+	vdev->vdpa.dma_dev = &vdev->vdpa.dev;
+	vdev->vdpa.mdev = &mgmt_dev;
+
+	return 0;
+}
+
+static int vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name)
+{
+	struct vduse_dev *dev;
+	int ret;
+
+	mutex_lock(&vduse_lock);
+	dev = vduse_find_dev(name);
+	if (!dev) {
+		mutex_unlock(&vduse_lock);
+		return -EINVAL;
+	}
+	ret = vduse_dev_init_vdpa(dev, name);
+	mutex_unlock(&vduse_lock);
+	if (ret)
+		return ret;
+
+	ret = _vdpa_register_device(&dev->vdev->vdpa, dev->vq_num);
+	if (ret) {
+		put_device(&dev->vdev->vdpa.dev);
+		return ret;
+	}
+
+	return 0;
+}
+
+static void vdpa_dev_del(struct vdpa_mgmt_dev *mdev, struct vdpa_device *dev)
+{
+	_vdpa_unregister_device(dev);
+}
+
+static const struct vdpa_mgmtdev_ops vdpa_dev_mgmtdev_ops = {
+	.dev_add = vdpa_dev_add,
+	.dev_del = vdpa_dev_del,
+};
+
+static struct virtio_device_id id_table[] = {
+	{ VIRTIO_ID_BLOCK, VIRTIO_DEV_ANY_ID },
+	{ 0 },
+};
+
+static struct vdpa_mgmt_dev mgmt_dev = {
+	.device = &vduse_mgmtdev,
+	.id_table = id_table,
+	.ops = &vdpa_dev_mgmtdev_ops,
+};
+
+static int vduse_mgmtdev_init(void)
+{
+	int ret;
+
+	ret = device_register(&vduse_mgmtdev);
+	if (ret)
+		return ret;
+
+	ret = vdpa_mgmtdev_register(&mgmt_dev);
+	if (ret)
+		goto err;
+
+	return 0;
+err:
+	device_unregister(&vduse_mgmtdev);
+	return ret;
+}
+
+static void vduse_mgmtdev_exit(void)
+{
+	vdpa_mgmtdev_unregister(&mgmt_dev);
+	device_unregister(&vduse_mgmtdev);
+}
+
+static int vduse_init(void)
+{
+	int ret;
+	struct device *dev;
+
+	vduse_class = class_create(THIS_MODULE, "vduse");
+	if (IS_ERR(vduse_class))
+		return PTR_ERR(vduse_class);
+
+	vduse_class->devnode = vduse_devnode;
+
+	ret = alloc_chrdev_region(&vduse_major, 0, VDUSE_DEV_MAX, "vduse");
+	if (ret)
+		goto err_chardev_region;
+
+	/* /dev/vduse/control */
+	cdev_init(&vduse_ctrl_cdev, &vduse_ctrl_fops);
+	vduse_ctrl_cdev.owner = THIS_MODULE;
+	ret = cdev_add(&vduse_ctrl_cdev, vduse_major, 1);
+	if (ret)
+		goto err_ctrl_cdev;
+
+	dev = device_create(vduse_class, NULL, vduse_major, NULL, "control");
+	if (IS_ERR(dev)) {
+		ret = PTR_ERR(dev);
+		goto err_device;
+	}
+
+	/* /dev/vduse/$DEVICE */
+	cdev_init(&vduse_cdev, &vduse_dev_fops);
+	vduse_cdev.owner = THIS_MODULE;
+	ret = cdev_add(&vduse_cdev, MKDEV(MAJOR(vduse_major), 1),
+		       VDUSE_DEV_MAX - 1);
+	if (ret)
+		goto err_cdev;
+
+	vduse_irq_wq = alloc_workqueue("vduse-irq",
+				WQ_HIGHPRI | WQ_SYSFS | WQ_UNBOUND, 0);
+	if (!vduse_irq_wq)
+		goto err_wq;
+
+	ret = vduse_domain_init();
+	if (ret)
+		goto err_domain;
+
+	ret = vduse_mgmtdev_init();
+	if (ret)
+		goto err_mgmtdev;
+
+	return 0;
+err_mgmtdev:
+	vduse_domain_exit();
+err_domain:
+	destroy_workqueue(vduse_irq_wq);
+err_wq:
+	cdev_del(&vduse_cdev);
+err_cdev:
+	device_destroy(vduse_class, vduse_major);
+err_device:
+	cdev_del(&vduse_ctrl_cdev);
+err_ctrl_cdev:
+	unregister_chrdev_region(vduse_major, VDUSE_DEV_MAX);
+err_chardev_region:
+	class_destroy(vduse_class);
+	return ret;
+}
+module_init(vduse_init);
+
+static void vduse_exit(void)
+{
+	vduse_mgmtdev_exit();
+	vduse_domain_exit();
+	destroy_workqueue(vduse_irq_wq);
+	cdev_del(&vduse_cdev);
+	device_destroy(vduse_class, vduse_major);
+	cdev_del(&vduse_ctrl_cdev);
+	unregister_chrdev_region(vduse_major, VDUSE_DEV_MAX);
+	class_destroy(vduse_class);
+}
+module_exit(vduse_exit);
+
+MODULE_LICENSE(DRV_LICENSE);
+MODULE_AUTHOR(DRV_AUTHOR);
+MODULE_DESCRIPTION(DRV_DESC);
diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
new file mode 100644
index 000000000000..f21b2e51b5c8
--- /dev/null
+++ b/include/uapi/linux/vduse.h
@@ -0,0 +1,143 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _UAPI_VDUSE_H_
+#define _UAPI_VDUSE_H_
+
+#include <linux/types.h>
+
+#define VDUSE_API_VERSION	0
+
+#define VDUSE_NAME_MAX	256
+
+/* the control messages definition for read/write */
+
+enum vduse_req_type {
+	/* Get the state for virtqueue from userspace */
+	VDUSE_GET_VQ_STATE,
+	/* Notify userspace to start the dataplane, no reply */
+	VDUSE_START_DATAPLANE,
+	/* Notify userspace to stop the dataplane, no reply */
+	VDUSE_STOP_DATAPLANE,
+	/* Notify userspace to update the memory mapping in device IOTLB */
+	VDUSE_UPDATE_IOTLB,
+};
+
+struct vduse_vq_state {
+	__u32 index; /* virtqueue index */
+	__u32 avail_idx; /* virtqueue state (last_avail_idx) */
+};
+
+struct vduse_iova_range {
+	__u64 start; /* start of the IOVA range */
+	__u64 last; /* end of the IOVA range */
+};
+
+struct vduse_dev_request {
+	__u32 type; /* request type */
+	__u32 request_id; /* request id */
+#define VDUSE_REQ_FLAGS_NO_REPLY	(1 << 0) /* No need to reply */
+	__u32 flags; /* request flags */
+	__u32 reserved; /* for future use */
+	union {
+		struct vduse_vq_state vq_state; /* virtqueue state */
+		struct vduse_iova_range iova; /* iova range for updating */
+		__u32 padding[16]; /* padding */
+	};
+};
+
+struct vduse_dev_response {
+	__u32 request_id; /* corresponding request id */
+#define VDUSE_REQ_RESULT_OK	0x00
+#define VDUSE_REQ_RESULT_FAILED	0x01
+	__u32 result; /* the result of request */
+	__u32 reserved[2]; /* for future use */
+	union {
+		struct vduse_vq_state vq_state; /* virtqueue state */
+		__u32 padding[16]; /* padding */
+	};
+};
+
+/* ioctls */
+
+struct vduse_dev_config {
+	char name[VDUSE_NAME_MAX]; /* vduse device name */
+	__u32 vendor_id; /* virtio vendor id */
+	__u32 device_id; /* virtio device id */
+	__u64 features; /* device features */
+	__u64 bounce_size; /* bounce buffer size for iommu */
+	__u16 vq_size_max; /* the max size of virtqueue */
+	__u16 padding; /* padding */
+	__u32 vq_num; /* the number of virtqueues */
+	__u32 vq_align; /* the allocation alignment of virtqueue's metadata */
+	__u32 config_size; /* the size of the configuration space */
+	__u32 reserved[15]; /* for future use */
+	__u8 config[0]; /* the buffer of the configuration space */
+};
+
+struct vduse_iotlb_entry {
+	__u64 offset; /* the mmap offset on fd */
+	__u64 start; /* start of the IOVA range */
+	__u64 last; /* last of the IOVA range */
+#define VDUSE_ACCESS_RO 0x1
+#define VDUSE_ACCESS_WO 0x2
+#define VDUSE_ACCESS_RW 0x3
+	__u8 perm; /* access permission of this range */
+};
+
+struct vduse_config_update {
+	__u32 offset; /* offset from the beginning of configuration space */
+	__u32 length; /* the length to write to configuration space */
+	__u8 buffer[0]; /* buffer used to write from */
+};
+
+struct vduse_vq_info {
+	__u32 index; /* virtqueue index */
+	__u32 avail_idx; /* virtqueue state (last_avail_idx) */
+	__u64 desc_addr; /* address of desc area */
+	__u64 driver_addr; /* address of driver area */
+	__u64 device_addr; /* address of device area */
+	__u32 num; /* the size of virtqueue */
+	__u8 ready; /* ready status of virtqueue */
+};
+
+struct vduse_vq_eventfd {
+	__u32 index; /* virtqueue index */
+#define VDUSE_EVENTFD_DEASSIGN -1
+	int fd; /* eventfd, -1 means de-assigning the eventfd */
+};
+
+#define VDUSE_BASE	0x81
+
+/* Get the version of VDUSE API. This is used for future extension */
+#define VDUSE_GET_API_VERSION	_IOR(VDUSE_BASE, 0x00, __u64)
+
+/* Set the version of VDUSE API. */
+#define VDUSE_SET_API_VERSION	_IOW(VDUSE_BASE, 0x01, __u64)
+
+/* Create a vduse device which is represented by a char device (/dev/vduse/<name>) */
+#define VDUSE_CREATE_DEV	_IOW(VDUSE_BASE, 0x02, struct vduse_dev_config)
+
+/* Destroy a vduse device. Make sure there are no references to the char device */
+#define VDUSE_DESTROY_DEV	_IOW(VDUSE_BASE, 0x03, char[VDUSE_NAME_MAX])
+
+/*
+ * Get a file descriptor for the first overlapped iova region,
+ * -EINVAL means the iova region doesn't exist.
+ */
+#define VDUSE_IOTLB_GET_FD	_IOWR(VDUSE_BASE, 0x04, struct vduse_iotlb_entry)
+
+/* Get the negotiated features */
+#define VDUSE_DEV_GET_FEATURES	_IOR(VDUSE_BASE, 0x05, __u64)
+
+/* Update the configuration space */
+#define VDUSE_DEV_UPDATE_CONFIG	_IOW(VDUSE_BASE, 0x06, struct vduse_config_update)
+
+/* Get the specified virtqueue's information */
+#define VDUSE_VQ_GET_INFO	_IOWR(VDUSE_BASE, 0x07, struct vduse_vq_info)
+
+/* Setup an eventfd to receive kick for virtqueue */
+#define VDUSE_VQ_SETUP_KICKFD	_IOW(VDUSE_BASE, 0x08, struct vduse_vq_eventfd)
+
+/* Inject an interrupt for specific virtqueue */
+#define VDUSE_VQ_INJECT_IRQ	_IOW(VDUSE_BASE, 0x09, __u32)
+
+#endif /* _UAPI_VDUSE_H_ */
-- 
2.11.0


_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


^ permalink raw reply	[flat|nested] 193+ messages in thread

* [PATCH v8 10/10] Documentation: Add documentation for VDUSE
  2021-06-15 14:13 ` Xie Yongji
@ 2021-06-15 14:13   ` Xie Yongji
  -1 siblings, 0 replies; 193+ messages in thread
From: Xie Yongji @ 2021-06-15 14:13 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, hch, christian.brauner,
	rdunlap, willy, viro, axboe, bcrl, corbet, mika.penttila,
	dan.carpenter, joro, gregkh
  Cc: songmuchun, virtualization, netdev, kvm, linux-fsdevel, iommu,
	linux-kernel

VDUSE (vDPA Device in Userspace) is a framework to support
implementing software-emulated vDPA devices in userspace. This
document is intended to clarify the VDUSE design and usage.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 Documentation/userspace-api/index.rst |   1 +
 Documentation/userspace-api/vduse.rst | 222 ++++++++++++++++++++++++++++++++++
 2 files changed, 223 insertions(+)
 create mode 100644 Documentation/userspace-api/vduse.rst

diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst
index 0b5eefed027e..c432be070f67 100644
--- a/Documentation/userspace-api/index.rst
+++ b/Documentation/userspace-api/index.rst
@@ -27,6 +27,7 @@ place where this information is gathered.
    iommu
    media/index
    sysfs-platform_profile
+   vduse
 
 .. only::  subproject and html
 
diff --git a/Documentation/userspace-api/vduse.rst b/Documentation/userspace-api/vduse.rst
new file mode 100644
index 000000000000..2f9cd1a4e530
--- /dev/null
+++ b/Documentation/userspace-api/vduse.rst
@@ -0,0 +1,222 @@
+==================================
+VDUSE - "vDPA Device in Userspace"
+==================================
+
+vDPA (virtio data path acceleration) device is a device that uses a
+datapath which complies with the virtio specifications with vendor
+specific control path. vDPA devices can be both physically located on
+the hardware or emulated by software. VDUSE is a framework that makes it
+possible to implement software-emulated vDPA devices in userspace. And
+to make it simple, the emulated vDPA device's control path is handled in
+the kernel and only the data path is implemented in the userspace.
+
+Note that only virtio block device is supported by VDUSE framework now,
+which can reduce security risks when the userspace process that implements
+the data path is run by an unprivileged user. The Support for other device
+types can be added after the security issue is clarified or fixed in the future.
+
+Start/Stop VDUSE devices
+------------------------
+
+VDUSE devices are started as follows:
+
+1. Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on
+   /dev/vduse/control.
+
+2. Begin processing VDUSE messages from /dev/vduse/$NAME. The first
+   messages will arrive while attaching the VDUSE instance to vDPA bus.
+
+3. Send the VDPA_CMD_DEV_NEW netlink message to attach the VDUSE
+   instance to vDPA bus.
+
+VDUSE devices are stopped as follows:
+
+1. Send the VDPA_CMD_DEV_DEL netlink message to detach the VDUSE
+   instance from vDPA bus.
+
+2. Close the file descriptor referring to /dev/vduse/$NAME
+
+3. Destroy the VDUSE instance with ioctl(VDUSE_DESTROY_DEV) on
+   /dev/vduse/control
+
+The netlink messages metioned above can be sent via vdpa tool in iproute2
+or use the below sample codes:
+
+.. code-block:: c
+
+	static int netlink_add_vduse(const char *name, enum vdpa_command cmd)
+	{
+		struct nl_sock *nlsock;
+		struct nl_msg *msg;
+		int famid;
+
+		nlsock = nl_socket_alloc();
+		if (!nlsock)
+			return -ENOMEM;
+
+		if (genl_connect(nlsock))
+			goto free_sock;
+
+		famid = genl_ctrl_resolve(nlsock, VDPA_GENL_NAME);
+		if (famid < 0)
+			goto close_sock;
+
+		msg = nlmsg_alloc();
+		if (!msg)
+			goto close_sock;
+
+		if (!genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, famid, 0, 0, cmd, 0))
+			goto nla_put_failure;
+
+		NLA_PUT_STRING(msg, VDPA_ATTR_DEV_NAME, name);
+		if (cmd == VDPA_CMD_DEV_NEW)
+			NLA_PUT_STRING(msg, VDPA_ATTR_MGMTDEV_DEV_NAME, "vduse");
+
+		if (nl_send_sync(nlsock, msg))
+			goto close_sock;
+
+		nl_close(nlsock);
+		nl_socket_free(nlsock);
+
+		return 0;
+	nla_put_failure:
+		nlmsg_free(msg);
+	close_sock:
+		nl_close(nlsock);
+	free_sock:
+		nl_socket_free(nlsock);
+		return -1;
+	}
+
+How VDUSE works
+---------------
+
+Since the emuldated vDPA device's control path is handled in the kernel,
+a message-based communication protocol and few types of control messages
+are introduced by VDUSE framework to make userspace be aware of the data
+path related changes:
+
+- VDUSE_GET_VQ_STATE: Get the state for virtqueue from userspace
+
+- VDUSE_START_DATAPLANE: Notify userspace to start the dataplane
+
+- VDUSE_STOP_DATAPLANE: Notify userspace to stop the dataplane
+
+- VDUSE_UPDATE_IOTLB: Notify userspace to update the memory mapping in device IOTLB
+
+Userspace needs to read()/write() on /dev/vduse/$NAME to receive/reply
+those control messages from/to VDUSE kernel module as follows:
+
+.. code-block:: c
+
+	static int vduse_message_handler(int dev_fd)
+	{
+		int len;
+		struct vduse_dev_request req;
+		struct vduse_dev_response resp;
+
+		len = read(dev_fd, &req, sizeof(req));
+		if (len != sizeof(req))
+			return -1;
+
+		resp.request_id = req.request_id;
+
+		switch (req.type) {
+
+		/* handle different types of message */
+
+		}
+
+		if (req.flags & VDUSE_REQ_FLAGS_NO_REPLY)
+			return 0;
+
+		len = write(dev_fd, &resp, sizeof(resp));
+		if (len != sizeof(resp))
+			return -1;
+
+		return 0;
+	}
+
+After VDUSE_START_DATAPLANE messages is received, userspace should start the
+dataplane processing with the help of some ioctls on /dev/vduse/$NAME:
+
+- VDUSE_IOTLB_GET_FD: get the file descriptor to the first overlapped iova region.
+  Userspace can access this iova region by passing fd and corresponding size, offset,
+  perm to mmap(). For example:
+
+.. code-block:: c
+
+	static int perm_to_prot(uint8_t perm)
+	{
+		int prot = 0;
+
+		switch (perm) {
+		case VDUSE_ACCESS_WO:
+			prot |= PROT_WRITE;
+			break;
+		case VDUSE_ACCESS_RO:
+			prot |= PROT_READ;
+			break;
+		case VDUSE_ACCESS_RW:
+			prot |= PROT_READ | PROT_WRITE;
+			break;
+		}
+
+		return prot;
+	}
+
+	static void *iova_to_va(int dev_fd, uint64_t iova, uint64_t *len)
+	{
+		int fd;
+		void *addr;
+		size_t size;
+		struct vduse_iotlb_entry entry;
+
+		entry.start = iova;
+		entry.last = iova + 1;
+		fd = ioctl(dev_fd, VDUSE_IOTLB_GET_FD, &entry);
+		if (fd < 0)
+			return NULL;
+
+		size = entry.last - entry.start + 1;
+		*len = entry.last - iova + 1;
+		addr = mmap(0, size, perm_to_prot(entry.perm), MAP_SHARED,
+			    fd, entry.offset);
+		close(fd);
+		if (addr == MAP_FAILED)
+			return NULL;
+
+		/* do something to cache this iova region */
+
+		return addr + iova - entry.start;
+	}
+
+- VDUSE_DEV_GET_FEATURES: Get the negotiated features
+
+- VDUSE_DEV_UPDATE_CONFIG: Update the configuration space and inject a config interrupt
+
+- VDUSE_VQ_GET_INFO: Get the specified virtqueue's metadata
+
+- VDUSE_VQ_SETUP_KICKFD: set the kickfd for virtqueue, this eventfd is used
+  by VDUSE kernel module to notify userspace to consume the vring.
+
+- VDUSE_INJECT_VQ_IRQ: inject an interrupt for specific virtqueue
+
+MMU-based IOMMU Driver
+----------------------
+
+VDUSE framework implements an MMU-based on-chip IOMMU driver to support
+mapping the kernel DMA buffer into the userspace iova region dynamically.
+This is mainly designed for virtio-vdpa case (kernel virtio drivers).
+
+The basic idea behind this driver is treating MMU (VA->PA) as IOMMU (IOVA->PA).
+The driver will set up MMU mapping instead of IOMMU mapping for the DMA transfer
+so that the userspace process is able to use its virtual address to access
+the DMA buffer in kernel.
+
+And to avoid security issue, a bounce-buffering mechanism is introduced to
+prevent userspace accessing the original buffer directly which may contain other
+kernel data. During the mapping, unmapping, the driver will copy the data from
+the original buffer to the bounce buffer and back, depending on the direction of
+the transfer. And the bounce-buffer addresses will be mapped into the user address
+space instead of the original one.
-- 
2.11.0


^ permalink raw reply	[flat|nested] 193+ messages in thread

* [PATCH v8 10/10] Documentation: Add documentation for VDUSE
@ 2021-06-15 14:13   ` Xie Yongji
  0 siblings, 0 replies; 193+ messages in thread
From: Xie Yongji @ 2021-06-15 14:13 UTC (permalink / raw)
  To: mst, jasowang, stefanha, sgarzare, parav, hch, christian.brauner,
	rdunlap, willy, viro, axboe, bcrl, corbet, mika.penttila,
	dan.carpenter, joro, gregkh
  Cc: kvm, netdev, linux-kernel, virtualization, iommu, songmuchun,
	linux-fsdevel

VDUSE (vDPA Device in Userspace) is a framework to support
implementing software-emulated vDPA devices in userspace. This
document is intended to clarify the VDUSE design and usage.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 Documentation/userspace-api/index.rst |   1 +
 Documentation/userspace-api/vduse.rst | 222 ++++++++++++++++++++++++++++++++++
 2 files changed, 223 insertions(+)
 create mode 100644 Documentation/userspace-api/vduse.rst

diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst
index 0b5eefed027e..c432be070f67 100644
--- a/Documentation/userspace-api/index.rst
+++ b/Documentation/userspace-api/index.rst
@@ -27,6 +27,7 @@ place where this information is gathered.
    iommu
    media/index
    sysfs-platform_profile
+   vduse
 
 .. only::  subproject and html
 
diff --git a/Documentation/userspace-api/vduse.rst b/Documentation/userspace-api/vduse.rst
new file mode 100644
index 000000000000..2f9cd1a4e530
--- /dev/null
+++ b/Documentation/userspace-api/vduse.rst
@@ -0,0 +1,222 @@
+==================================
+VDUSE - "vDPA Device in Userspace"
+==================================
+
+vDPA (virtio data path acceleration) device is a device that uses a
+datapath which complies with the virtio specifications with vendor
+specific control path. vDPA devices can be both physically located on
+the hardware or emulated by software. VDUSE is a framework that makes it
+possible to implement software-emulated vDPA devices in userspace. And
+to make it simple, the emulated vDPA device's control path is handled in
+the kernel and only the data path is implemented in the userspace.
+
+Note that only virtio block device is supported by VDUSE framework now,
+which can reduce security risks when the userspace process that implements
+the data path is run by an unprivileged user. The Support for other device
+types can be added after the security issue is clarified or fixed in the future.
+
+Start/Stop VDUSE devices
+------------------------
+
+VDUSE devices are started as follows:
+
+1. Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on
+   /dev/vduse/control.
+
+2. Begin processing VDUSE messages from /dev/vduse/$NAME. The first
+   messages will arrive while attaching the VDUSE instance to vDPA bus.
+
+3. Send the VDPA_CMD_DEV_NEW netlink message to attach the VDUSE
+   instance to vDPA bus.
+
+VDUSE devices are stopped as follows:
+
+1. Send the VDPA_CMD_DEV_DEL netlink message to detach the VDUSE
+   instance from vDPA bus.
+
+2. Close the file descriptor referring to /dev/vduse/$NAME
+
+3. Destroy the VDUSE instance with ioctl(VDUSE_DESTROY_DEV) on
+   /dev/vduse/control
+
+The netlink messages metioned above can be sent via vdpa tool in iproute2
+or use the below sample codes:
+
+.. code-block:: c
+
+	static int netlink_add_vduse(const char *name, enum vdpa_command cmd)
+	{
+		struct nl_sock *nlsock;
+		struct nl_msg *msg;
+		int famid;
+
+		nlsock = nl_socket_alloc();
+		if (!nlsock)
+			return -ENOMEM;
+
+		if (genl_connect(nlsock))
+			goto free_sock;
+
+		famid = genl_ctrl_resolve(nlsock, VDPA_GENL_NAME);
+		if (famid < 0)
+			goto close_sock;
+
+		msg = nlmsg_alloc();
+		if (!msg)
+			goto close_sock;
+
+		if (!genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, famid, 0, 0, cmd, 0))
+			goto nla_put_failure;
+
+		NLA_PUT_STRING(msg, VDPA_ATTR_DEV_NAME, name);
+		if (cmd == VDPA_CMD_DEV_NEW)
+			NLA_PUT_STRING(msg, VDPA_ATTR_MGMTDEV_DEV_NAME, "vduse");
+
+		if (nl_send_sync(nlsock, msg))
+			goto close_sock;
+
+		nl_close(nlsock);
+		nl_socket_free(nlsock);
+
+		return 0;
+	nla_put_failure:
+		nlmsg_free(msg);
+	close_sock:
+		nl_close(nlsock);
+	free_sock:
+		nl_socket_free(nlsock);
+		return -1;
+	}
+
+How VDUSE works
+---------------
+
+Since the emuldated vDPA device's control path is handled in the kernel,
+a message-based communication protocol and few types of control messages
+are introduced by VDUSE framework to make userspace be aware of the data
+path related changes:
+
+- VDUSE_GET_VQ_STATE: Get the state for virtqueue from userspace
+
+- VDUSE_START_DATAPLANE: Notify userspace to start the dataplane
+
+- VDUSE_STOP_DATAPLANE: Notify userspace to stop the dataplane
+
+- VDUSE_UPDATE_IOTLB: Notify userspace to update the memory mapping in device IOTLB
+
+Userspace needs to read()/write() on /dev/vduse/$NAME to receive/reply
+those control messages from/to VDUSE kernel module as follows:
+
+.. code-block:: c
+
+	static int vduse_message_handler(int dev_fd)
+	{
+		int len;
+		struct vduse_dev_request req;
+		struct vduse_dev_response resp;
+
+		len = read(dev_fd, &req, sizeof(req));
+		if (len != sizeof(req))
+			return -1;
+
+		resp.request_id = req.request_id;
+
+		switch (req.type) {
+
+		/* handle different types of message */
+
+		}
+
+		if (req.flags & VDUSE_REQ_FLAGS_NO_REPLY)
+			return 0;
+
+		len = write(dev_fd, &resp, sizeof(resp));
+		if (len != sizeof(resp))
+			return -1;
+
+		return 0;
+	}
+
+After VDUSE_START_DATAPLANE messages is received, userspace should start the
+dataplane processing with the help of some ioctls on /dev/vduse/$NAME:
+
+- VDUSE_IOTLB_GET_FD: get the file descriptor to the first overlapped iova region.
+  Userspace can access this iova region by passing fd and corresponding size, offset,
+  perm to mmap(). For example:
+
+.. code-block:: c
+
+	static int perm_to_prot(uint8_t perm)
+	{
+		int prot = 0;
+
+		switch (perm) {
+		case VDUSE_ACCESS_WO:
+			prot |= PROT_WRITE;
+			break;
+		case VDUSE_ACCESS_RO:
+			prot |= PROT_READ;
+			break;
+		case VDUSE_ACCESS_RW:
+			prot |= PROT_READ | PROT_WRITE;
+			break;
+		}
+
+		return prot;
+	}
+
+	static void *iova_to_va(int dev_fd, uint64_t iova, uint64_t *len)
+	{
+		int fd;
+		void *addr;
+		size_t size;
+		struct vduse_iotlb_entry entry;
+
+		entry.start = iova;
+		entry.last = iova + 1;
+		fd = ioctl(dev_fd, VDUSE_IOTLB_GET_FD, &entry);
+		if (fd < 0)
+			return NULL;
+
+		size = entry.last - entry.start + 1;
+		*len = entry.last - iova + 1;
+		addr = mmap(0, size, perm_to_prot(entry.perm), MAP_SHARED,
+			    fd, entry.offset);
+		close(fd);
+		if (addr == MAP_FAILED)
+			return NULL;
+
+		/* do something to cache this iova region */
+
+		return addr + iova - entry.start;
+	}
+
+- VDUSE_DEV_GET_FEATURES: Get the negotiated features
+
+- VDUSE_DEV_UPDATE_CONFIG: Update the configuration space and inject a config interrupt
+
+- VDUSE_VQ_GET_INFO: Get the specified virtqueue's metadata
+
+- VDUSE_VQ_SETUP_KICKFD: set the kickfd for virtqueue, this eventfd is used
+  by VDUSE kernel module to notify userspace to consume the vring.
+
+- VDUSE_INJECT_VQ_IRQ: inject an interrupt for specific virtqueue
+
+MMU-based IOMMU Driver
+----------------------
+
+VDUSE framework implements an MMU-based on-chip IOMMU driver to support
+mapping the kernel DMA buffer into the userspace iova region dynamically.
+This is mainly designed for virtio-vdpa case (kernel virtio drivers).
+
+The basic idea behind this driver is treating MMU (VA->PA) as IOMMU (IOVA->PA).
+The driver will set up MMU mapping instead of IOMMU mapping for the DMA transfer
+so that the userspace process is able to use its virtual address to access
+the DMA buffer in kernel.
+
+And to avoid security issue, a bounce-buffering mechanism is introduced to
+prevent userspace accessing the original buffer directly which may contain other
+kernel data. During the mapping, unmapping, the driver will copy the data from
+the original buffer to the bounce buffer and back, depending on the direction of
+the transfer. And the bounce-buffer addresses will be mapped into the user address
+space instead of the original one.
-- 
2.11.0

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 193+ messages in thread

* Re: [PATCH v8 03/10] eventfd: Increase the recursion depth of eventfd_signal()
  2021-06-15 14:13   ` Xie Yongji
  (?)
@ 2021-06-17  8:33     ` He Zhe
  -1 siblings, 0 replies; 193+ messages in thread
From: He Zhe @ 2021-06-17  8:33 UTC (permalink / raw)
  To: Xie Yongji, mst, jasowang, stefanha, sgarzare, parav, hch,
	christian.brauner, rdunlap, willy, viro, axboe, bcrl, corbet,
	mika.penttila, dan.carpenter, joro, gregkh
  Cc: songmuchun, virtualization, netdev, kvm, linux-fsdevel, iommu,
	linux-kernel, He Zhe



On 6/15/21 10:13 PM, Xie Yongji wrote:
> Increase the recursion depth of eventfd_signal() to 1. This
> is the maximum recursion depth we have found so far, which
> can be triggered with the following call chain:
>
>     kvm_io_bus_write                        [kvm]
>       --> ioeventfd_write                   [kvm]
>         --> eventfd_signal                  [eventfd]
>           --> vhost_poll_wakeup             [vhost]
>             --> vduse_vdpa_kick_vq          [vduse]
>               --> eventfd_signal            [eventfd]
>
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> Acked-by: Jason Wang <jasowang@redhat.com>

The fix had been posted one year ago.

https://lore.kernel.org/lkml/20200410114720.24838-1-zhe.he@windriver.com/


> ---
>  fs/eventfd.c            | 2 +-
>  include/linux/eventfd.h | 5 ++++-
>  2 files changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/fs/eventfd.c b/fs/eventfd.c
> index e265b6dd4f34..cc7cd1dbedd3 100644
> --- a/fs/eventfd.c
> +++ b/fs/eventfd.c
> @@ -71,7 +71,7 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n)
>  	 * it returns true, the eventfd_signal() call should be deferred to a
>  	 * safe context.
>  	 */
> -	if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count)))
> +	if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count) > EFD_WAKE_DEPTH))
>  		return 0;
>  
>  	spin_lock_irqsave(&ctx->wqh.lock, flags);
> diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
> index fa0a524baed0..886d99cd38ef 100644
> --- a/include/linux/eventfd.h
> +++ b/include/linux/eventfd.h
> @@ -29,6 +29,9 @@
>  #define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
>  #define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE)
>  
> +/* Maximum recursion depth */
> +#define EFD_WAKE_DEPTH 1
> +
>  struct eventfd_ctx;
>  struct file;
>  
> @@ -47,7 +50,7 @@ DECLARE_PER_CPU(int, eventfd_wake_count);
>  
>  static inline bool eventfd_signal_count(void)
>  {
> -	return this_cpu_read(eventfd_wake_count);
> +	return this_cpu_read(eventfd_wake_count) > EFD_WAKE_DEPTH;

count is just count. How deep is acceptable should be put
where eventfd_signal_count is called.


Zhe

>  }
>  
>  #else /* CONFIG_EVENTFD */


^ permalink raw reply	[flat|nested] 193+ messages in thread

* Re: [PATCH v8 03/10] eventfd: Increase the recursion depth of eventfd_signal()
@ 2021-06-17  8:33     ` He Zhe
  0 siblings, 0 replies; 193+ messages in thread
From: He Zhe @ 2021-06-17  8:33 UTC (permalink / raw)
  To: Xie Yongji, mst, jasowang, stefanha, sgarzare, parav, hch,
	christian.brauner, rdunlap, willy, viro, axboe, bcrl, corbet,
	mika.penttila, dan.carpenter, joro, gregkh
  Cc: He Zhe, kvm, netdev, linux-kernel, virtualization, iommu,
	songmuchun, linux-fsdevel



On 6/15/21 10:13 PM, Xie Yongji wrote:
> Increase the recursion depth of eventfd_signal() to 1. This
> is the maximum recursion depth we have found so far, which
> can be triggered with the following call chain:
>
>     kvm_io_bus_write                        [kvm]
>       --> ioeventfd_write                   [kvm]
>         --> eventfd_signal                  [eventfd]
>           --> vhost_poll_wakeup             [vhost]
>             --> vduse_vdpa_kick_vq          [vduse]
>               --> eventfd_signal            [eventfd]
>
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> Acked-by: Jason Wang <jasowang@redhat.com>

The fix had been posted one year ago.

https://lore.kernel.org/lkml/20200410114720.24838-1-zhe.he@windriver.com/


> ---
>  fs/eventfd.c            | 2 +-
>  include/linux/eventfd.h | 5 ++++-
>  2 files changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/fs/eventfd.c b/fs/eventfd.c
> index e265b6dd4f34..cc7cd1dbedd3 100644
> --- a/fs/eventfd.c
> +++ b/fs/eventfd.c
> @@ -71,7 +71,7 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n)
>  	 * it returns true, the eventfd_signal() call should be deferred to a
>  	 * safe context.
>  	 */
> -	if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count)))
> +	if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count) > EFD_WAKE_DEPTH))
>  		return 0;
>  
>  	spin_lock_irqsave(&ctx->wqh.lock, flags);
> diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
> index fa0a524baed0..886d99cd38ef 100644
> --- a/include/linux/eventfd.h
> +++ b/include/linux/eventfd.h
> @@ -29,6 +29,9 @@
>  #define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
>  #define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE)
>  
> +/* Maximum recursion depth */
> +#define EFD_WAKE_DEPTH 1
> +
>  struct eventfd_ctx;
>  struct file;
>  
> @@ -47,7 +50,7 @@ DECLARE_PER_CPU(int, eventfd_wake_count);
>  
>  static inline bool eventfd_signal_count(void)
>  {
> -	return this_cpu_read(eventfd_wake_count);
> +	return this_cpu_read(eventfd_wake_count) > EFD_WAKE_DEPTH;

count is just count. How deep is acceptable should be put
where eventfd_signal_count is called.


Zhe

>  }
>  
>  #else /* CONFIG_EVENTFD */

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 193+ messages in thread

* Re: [PATCH v8 03/10] eventfd: Increase the recursion depth of eventfd_signal()
@ 2021-06-17  8:33     ` He Zhe
  0 siblings, 0 replies; 193+ messages in thread
From: He Zhe @ 2021-06-17  8:33 UTC (permalink / raw)
  To: Xie Yongji, mst, jasowang, stefanha, sgarzare, parav, hch,
	christian.brauner, rdunlap, willy, viro, axboe, bcrl, corbet,
	mika.penttila, dan.carpenter, joro, gregkh
  Cc: He Zhe, kvm, netdev, linux-kernel, virtualization, iommu,
	songmuchun, linux-fsdevel



On 6/15/21 10:13 PM, Xie Yongji wrote:
> Increase the recursion depth of eventfd_signal() to 1. This
> is the maximum recursion depth we have found so far, which
> can be triggered with the following call chain:
>
>     kvm_io_bus_write                        [kvm]
>       --> ioeventfd_write                   [kvm]
>         --> eventfd_signal                  [eventfd]
>           --> vhost_poll_wakeup             [vhost]
>             --> vduse_vdpa_kick_vq          [vduse]
>               --> eventfd_signal            [eventfd]
>
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> Acked-by: Jason Wang <jasowang@redhat.com>

The fix had been posted one year ago.

https://lore.kernel.org/lkml/20200410114720.24838-1-zhe.he@windriver.com/


> ---
>  fs/eventfd.c            | 2 +-
>  include/linux/eventfd.h | 5 ++++-
>  2 files changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/fs/eventfd.c b/fs/eventfd.c
> index e265b6dd4f34..cc7cd1dbedd3 100644
> --- a/fs/eventfd.c
> +++ b/fs/eventfd.c
> @@ -71,7 +71,7 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n)
>  	 * it returns true, the eventfd_signal() call should be deferred to a
>  	 * safe context.
>  	 */
> -	if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count)))
> +	if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count) > EFD_WAKE_DEPTH))
>  		return 0;
>  
>  	spin_lock_irqsave(&ctx->wqh.lock, flags);
> diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
> index fa0a524baed0..886d99cd38ef 100644
> --- a/include/linux/eventfd.h
> +++ b/include/linux/eventfd.h
> @@ -29,6 +29,9 @@
>  #define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
>  #define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE)
>  
> +/* Maximum recursion depth */
> +#define EFD_WAKE_DEPTH 1
> +
>  struct eventfd_ctx;
>  struct file;
>  
> @@ -47,7 +50,7 @@ DECLARE_PER_CPU(int, eventfd_wake_count);
>  
>  static inline bool eventfd_signal_count(void)
>  {
> -	return this_cpu_read(eventfd_wake_count);
> +	return this_cpu_read(eventfd_wake_count) > EFD_WAKE_DEPTH;

count is just count. How deep is acceptable should be put
where eventfd_signal_count is called.


Zhe

>  }
>  
>  #else /* CONFIG_EVENTFD */

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 193+ messages in thread

* Re: Re: [PATCH v8 03/10] eventfd: Increase the recursion depth of eventfd_signal()
  2021-06-17  8:33     ` He Zhe
@ 2021-06-18  3:29       ` Yongji Xie
  -1 siblings, 0 replies; 193+ messages in thread
From: Yongji Xie @ 2021-06-18  3:29 UTC (permalink / raw)
  To: He Zhe
  Cc: Michael S. Tsirkin, Jason Wang, Stefan Hajnoczi,
	Stefano Garzarella, Parav Pandit, Christoph Hellwig,
	Christian Brauner, Randy Dunlap, Matthew Wilcox, Al Viro,
	Jens Axboe, bcrl, Jonathan Corbet, Mika Penttilä,
	Dan Carpenter, joro, Greg KH, songmuchun, virtualization, netdev,
	kvm, linux-fsdevel, iommu, linux-kernel, qiang.zhang

On Thu, Jun 17, 2021 at 4:34 PM He Zhe <zhe.he@windriver.com> wrote:
>
>
>
> On 6/15/21 10:13 PM, Xie Yongji wrote:
> > Increase the recursion depth of eventfd_signal() to 1. This
> > is the maximum recursion depth we have found so far, which
> > can be triggered with the following call chain:
> >
> >     kvm_io_bus_write                        [kvm]
> >       --> ioeventfd_write                   [kvm]
> >         --> eventfd_signal                  [eventfd]
> >           --> vhost_poll_wakeup             [vhost]
> >             --> vduse_vdpa_kick_vq          [vduse]
> >               --> eventfd_signal            [eventfd]
> >
> > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> > Acked-by: Jason Wang <jasowang@redhat.com>
>
> The fix had been posted one year ago.
>
> https://lore.kernel.org/lkml/20200410114720.24838-1-zhe.he@windriver.com/
>

OK, so it seems to be a fix for the RT system if my understanding is
correct? Any reason why it's not merged? I'm happy to rebase my series
on your patch if you'd like to repost it.

BTW, I also notice another thread for this issue:

https://lore.kernel.org/linux-fsdevel/DM6PR11MB420291B550A10853403C7592FF349@DM6PR11MB4202.namprd11.prod.outlook.com/T/

>
> > ---
> >  fs/eventfd.c            | 2 +-
> >  include/linux/eventfd.h | 5 ++++-
> >  2 files changed, 5 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/eventfd.c b/fs/eventfd.c
> > index e265b6dd4f34..cc7cd1dbedd3 100644
> > --- a/fs/eventfd.c
> > +++ b/fs/eventfd.c
> > @@ -71,7 +71,7 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n)
> >        * it returns true, the eventfd_signal() call should be deferred to a
> >        * safe context.
> >        */
> > -     if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count)))
> > +     if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count) > EFD_WAKE_DEPTH))
> >               return 0;
> >
> >       spin_lock_irqsave(&ctx->wqh.lock, flags);
> > diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
> > index fa0a524baed0..886d99cd38ef 100644
> > --- a/include/linux/eventfd.h
> > +++ b/include/linux/eventfd.h
> > @@ -29,6 +29,9 @@
> >  #define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
> >  #define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE)
> >
> > +/* Maximum recursion depth */
> > +#define EFD_WAKE_DEPTH 1
> > +
> >  struct eventfd_ctx;
> >  struct file;
> >
> > @@ -47,7 +50,7 @@ DECLARE_PER_CPU(int, eventfd_wake_count);
> >
> >  static inline bool eventfd_signal_count(void)
> >  {
> > -     return this_cpu_read(eventfd_wake_count);
> > +     return this_cpu_read(eventfd_wake_count) > EFD_WAKE_DEPTH;
>
> count is just count. How deep is acceptable should be put
> where eventfd_signal_count is called.
>

The return value of this function is boolean rather than integer.
Please see the comments in eventfd_signal():

"then it should check eventfd_signal_count() before calling this
function. If it returns true, the eventfd_signal() call should be
deferred to a safe context."

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 193+ messages in thread

* Re: Re: [PATCH v8 03/10] eventfd: Increase the recursion depth of eventfd_signal()
@ 2021-06-18  3:29       ` Yongji Xie
  0 siblings, 0 replies; 193+ messages in thread
From: Yongji Xie @ 2021-06-18  3:29 UTC (permalink / raw)
  To: He Zhe
  Cc: kvm, Michael S. Tsirkin, Jason Wang, virtualization,
	Christian Brauner, qiang.zhang, Jonathan Corbet, Matthew Wilcox,
	Christoph Hellwig, Dan Carpenter, Stefano Garzarella, Al Viro,
	Stefan Hajnoczi, songmuchun, Jens Axboe, Greg KH, Randy Dunlap,
	linux-kernel, iommu, bcrl, netdev, linux-fsdevel,
	Mika Penttilä

On Thu, Jun 17, 2021 at 4:34 PM He Zhe <zhe.he@windriver.com> wrote:
>
>
>
> On 6/15/21 10:13 PM, Xie Yongji wrote:
> > Increase the recursion depth of eventfd_signal() to 1. This
> > is the maximum recursion depth we have found so far, which
> > can be triggered with the following call chain:
> >
> >     kvm_io_bus_write                        [kvm]
> >       --> ioeventfd_write                   [kvm]
> >         --> eventfd_signal                  [eventfd]
> >           --> vhost_poll_wakeup             [vhost]
> >             --> vduse_vdpa_kick_vq          [vduse]
> >               --> eventfd_signal            [eventfd]
> >
> > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> > Acked-by: Jason Wang <jasowang@redhat.com>
>
> The fix had been posted one year ago.
>
> https://lore.kernel.org/lkml/20200410114720.24838-1-zhe.he@windriver.com/
>

OK, so it seems to be a fix for the RT system if my understanding is
correct? Any reason why it's not merged? I'm happy to rebase my series
on your patch if you'd like to repost it.

BTW, I also notice another thread for this issue:

https://lore.kernel.org/linux-fsdevel/DM6PR11MB420291B550A10853403C7592FF349@DM6PR11MB4202.namprd11.prod.outlook.com/T/

>
> > ---
> >  fs/eventfd.c            | 2 +-
> >  include/linux/eventfd.h | 5 ++++-
> >  2 files changed, 5 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/eventfd.c b/fs/eventfd.c
> > index e265b6dd4f34..cc7cd1dbedd3 100644
> > --- a/fs/eventfd.c
> > +++ b/fs/eventfd.c
> > @@ -71,7 +71,7 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n)
> >        * it returns true, the eventfd_signal() call should be deferred to a
> >        * safe context.
> >        */
> > -     if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count)))
> > +     if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count) > EFD_WAKE_DEPTH))
> >               return 0;
> >
> >       spin_lock_irqsave(&ctx->wqh.lock, flags);
> > diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
> > index fa0a524baed0..886d99cd38ef 100644
> > --- a/include/linux/eventfd.h
> > +++ b/include/linux/eventfd.h
> > @@ -29,6 +29,9 @@
> >  #define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
> >  #define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE)
> >
> > +/* Maximum recursion depth */
> > +#define EFD_WAKE_DEPTH 1
> > +
> >  struct eventfd_ctx;
> >  struct file;
> >
> > @@ -47,7 +50,7 @@ DECLARE_PER_CPU(int, eventfd_wake_count);
> >
> >  static inline bool eventfd_signal_count(void)
> >  {
> > -     return this_cpu_read(eventfd_wake_count);
> > +     return this_cpu_read(eventfd_wake_count) > EFD_WAKE_DEPTH;
>
> count is just count. How deep is acceptable should be put
> where eventfd_signal_count is called.
>

The return value of this function is boolean rather than integer.
Please see the comments in eventfd_signal():

"then it should check eventfd_signal_count() before calling this
function. If it returns true, the eventfd_signal() call should be
deferred to a safe context."

Thanks,
Yongji
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 193+ messages in thread

* Re: [PATCH v8 03/10] eventfd: Increase the recursion depth of eventfd_signal()
  2021-06-18  3:29       ` Yongji Xie
  (?)
@ 2021-06-18  8:41         ` He Zhe
  -1 siblings, 0 replies; 193+ messages in thread
From: He Zhe @ 2021-06-18  8:41 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Jason Wang, Stefan Hajnoczi,
	Stefano Garzarella, Parav Pandit, Christoph Hellwig,
	Christian Brauner, Randy Dunlap, Matthew Wilcox, Al Viro,
	Jens Axboe, bcrl, Jonathan Corbet, Mika Penttilä,
	Dan Carpenter, joro, Greg KH, songmuchun, virtualization, netdev,
	kvm, linux-fsdevel, iommu, linux-kernel, qiang.zhang



On 6/18/21 11:29 AM, Yongji Xie wrote:
> On Thu, Jun 17, 2021 at 4:34 PM He Zhe <zhe.he@windriver.com> wrote:
>>
>>
>> On 6/15/21 10:13 PM, Xie Yongji wrote:
>>> Increase the recursion depth of eventfd_signal() to 1. This
>>> is the maximum recursion depth we have found so far, which
>>> can be triggered with the following call chain:
>>>
>>>     kvm_io_bus_write                        [kvm]
>>>       --> ioeventfd_write                   [kvm]
>>>         --> eventfd_signal                  [eventfd]
>>>           --> vhost_poll_wakeup             [vhost]
>>>             --> vduse_vdpa_kick_vq          [vduse]
>>>               --> eventfd_signal            [eventfd]
>>>
>>> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
>>> Acked-by: Jason Wang <jasowang@redhat.com>
>> The fix had been posted one year ago.
>>
>> https://lore.kernel.org/lkml/20200410114720.24838-1-zhe.he@windriver.com/
>>
> OK, so it seems to be a fix for the RT system if my understanding is
> correct? Any reason why it's not merged? I'm happy to rebase my series
> on your patch if you'd like to repost it.

It works for both mainline and RT kernel. The folks just reproduced in their RT
environments.

This patch somehow hasn't got maintainer's reply, so not merged yet.

And OK, I'll resend the patch.

>
> BTW, I also notice another thread for this issue:
>
> https://lore.kernel.org/linux-fsdevel/DM6PR11MB420291B550A10853403C7592FF349@DM6PR11MB4202.namprd11.prod.outlook.com/T/

This is the same way as my v1

https://lore.kernel.org/lkml/3b4aa4cb-0e76-89c2-c48a-cf24e1a36bc2@kernel.dk/

which was not what the maintainer wanted.


>
>>> ---
>>>  fs/eventfd.c            | 2 +-
>>>  include/linux/eventfd.h | 5 ++++-
>>>  2 files changed, 5 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/fs/eventfd.c b/fs/eventfd.c
>>> index e265b6dd4f34..cc7cd1dbedd3 100644
>>> --- a/fs/eventfd.c
>>> +++ b/fs/eventfd.c
>>> @@ -71,7 +71,7 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n)
>>>        * it returns true, the eventfd_signal() call should be deferred to a
>>>        * safe context.
>>>        */
>>> -     if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count)))
>>> +     if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count) > EFD_WAKE_DEPTH))
>>>               return 0;
>>>
>>>       spin_lock_irqsave(&ctx->wqh.lock, flags);
>>> diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
>>> index fa0a524baed0..886d99cd38ef 100644
>>> --- a/include/linux/eventfd.h
>>> +++ b/include/linux/eventfd.h
>>> @@ -29,6 +29,9 @@
>>>  #define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
>>>  #define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE)
>>>
>>> +/* Maximum recursion depth */
>>> +#define EFD_WAKE_DEPTH 1
>>> +
>>>  struct eventfd_ctx;
>>>  struct file;
>>>
>>> @@ -47,7 +50,7 @@ DECLARE_PER_CPU(int, eventfd_wake_count);
>>>
>>>  static inline bool eventfd_signal_count(void)
>>>  {
>>> -     return this_cpu_read(eventfd_wake_count);
>>> +     return this_cpu_read(eventfd_wake_count) > EFD_WAKE_DEPTH;
>> count is just count. How deep is acceptable should be put
>> where eventfd_signal_count is called.
>>
> The return value of this function is boolean rather than integer.
> Please see the comments in eventfd_signal():
>
> "then it should check eventfd_signal_count() before calling this
> function. If it returns true, the eventfd_signal() call should be
> deferred to a safe context."

OK. Now that the maintainer comments as such we can use it accordingly,
though I still got the feeling that the function name and the type of the return
value don't match.


Thanks,
Zhe

>
> Thanks,
> Yongji


^ permalink raw reply	[flat|nested] 193+ messages in thread

* Re: [PATCH v8 03/10] eventfd: Increase the recursion depth of eventfd_signal()
@ 2021-06-18  8:41         ` He Zhe
  0 siblings, 0 replies; 193+ messages in thread
From: He Zhe @ 2021-06-18  8:41 UTC (permalink / raw)
  To: Yongji Xie
  Cc: kvm, Michael S. Tsirkin, Jason Wang, virtualization,
	Christian Brauner, qiang.zhang, Jonathan Corbet, Matthew Wilcox,
	Christoph Hellwig, Dan Carpenter, Stefano Garzarella, Al Viro,
	Stefan Hajnoczi, songmuchun, Jens Axboe, Greg KH, Randy Dunlap,
	linux-kernel, iommu, bcrl, netdev, linux-fsdevel,
	Mika Penttilä



On 6/18/21 11:29 AM, Yongji Xie wrote:
> On Thu, Jun 17, 2021 at 4:34 PM He Zhe <zhe.he@windriver.com> wrote:
>>
>>
>> On 6/15/21 10:13 PM, Xie Yongji wrote:
>>> Increase the recursion depth of eventfd_signal() to 1. This
>>> is the maximum recursion depth we have found so far, which
>>> can be triggered with the following call chain:
>>>
>>>     kvm_io_bus_write                        [kvm]
>>>       --> ioeventfd_write                   [kvm]
>>>         --> eventfd_signal                  [eventfd]
>>>           --> vhost_poll_wakeup             [vhost]
>>>             --> vduse_vdpa_kick_vq          [vduse]
>>>               --> eventfd_signal            [eventfd]
>>>
>>> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
>>> Acked-by: Jason Wang <jasowang@redhat.com>
>> The fix had been posted one year ago.
>>
>> https://lore.kernel.org/lkml/20200410114720.24838-1-zhe.he@windriver.com/
>>
> OK, so it seems to be a fix for the RT system if my understanding is
> correct? Any reason why it's not merged? I'm happy to rebase my series
> on your patch if you'd like to repost it.

It works for both mainline and RT kernel. The folks just reproduced in their RT
environments.

This patch somehow hasn't got maintainer's reply, so not merged yet.

And OK, I'll resend the patch.

>
> BTW, I also notice another thread for this issue:
>
> https://lore.kernel.org/linux-fsdevel/DM6PR11MB420291B550A10853403C7592FF349@DM6PR11MB4202.namprd11.prod.outlook.com/T/

This is the same way as my v1

https://lore.kernel.org/lkml/3b4aa4cb-0e76-89c2-c48a-cf24e1a36bc2@kernel.dk/

which was not what the maintainer wanted.


>
>>> ---
>>>  fs/eventfd.c            | 2 +-
>>>  include/linux/eventfd.h | 5 ++++-
>>>  2 files changed, 5 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/fs/eventfd.c b/fs/eventfd.c
>>> index e265b6dd4f34..cc7cd1dbedd3 100644
>>> --- a/fs/eventfd.c
>>> +++ b/fs/eventfd.c
>>> @@ -71,7 +71,7 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n)
>>>        * it returns true, the eventfd_signal() call should be deferred to a
>>>        * safe context.
>>>        */
>>> -     if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count)))
>>> +     if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count) > EFD_WAKE_DEPTH))
>>>               return 0;
>>>
>>>       spin_lock_irqsave(&ctx->wqh.lock, flags);
>>> diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
>>> index fa0a524baed0..886d99cd38ef 100644
>>> --- a/include/linux/eventfd.h
>>> +++ b/include/linux/eventfd.h
>>> @@ -29,6 +29,9 @@
>>>  #define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
>>>  #define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE)
>>>
>>> +/* Maximum recursion depth */
>>> +#define EFD_WAKE_DEPTH 1
>>> +
>>>  struct eventfd_ctx;
>>>  struct file;
>>>
>>> @@ -47,7 +50,7 @@ DECLARE_PER_CPU(int, eventfd_wake_count);
>>>
>>>  static inline bool eventfd_signal_count(void)
>>>  {
>>> -     return this_cpu_read(eventfd_wake_count);
>>> +     return this_cpu_read(eventfd_wake_count) > EFD_WAKE_DEPTH;
>> count is just count. How deep is acceptable should be put
>> where eventfd_signal_count is called.
>>
> The return value of this function is boolean rather than integer.
> Please see the comments in eventfd_signal():
>
> "then it should check eventfd_signal_count() before calling this
> function. If it returns true, the eventfd_signal() call should be
> deferred to a safe context."

OK. Now that the maintainer comments as such we can use it accordingly,
though I still got the feeling that the function name and the type of the return
value don't match.


Thanks,
Zhe

>
> Thanks,
> Yongji

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 193+ messages in thread

* Re: [PATCH v8 03/10] eventfd: Increase the recursion depth of eventfd_signal()
@ 2021-06-18  8:41         ` He Zhe
  0 siblings, 0 replies; 193+ messages in thread
From: He Zhe @ 2021-06-18  8:41 UTC (permalink / raw)
  To: Yongji Xie
  Cc: kvm, Michael S. Tsirkin, virtualization, Christian Brauner,
	qiang.zhang, Jonathan Corbet, joro, Matthew Wilcox,
	Christoph Hellwig, Dan Carpenter, Al Viro, Stefan Hajnoczi,
	songmuchun, Jens Axboe, Greg KH, Randy Dunlap, linux-kernel,
	iommu, bcrl, netdev, linux-fsdevel, Mika Penttilä



On 6/18/21 11:29 AM, Yongji Xie wrote:
> On Thu, Jun 17, 2021 at 4:34 PM He Zhe <zhe.he@windriver.com> wrote:
>>
>>
>> On 6/15/21 10:13 PM, Xie Yongji wrote:
>>> Increase the recursion depth of eventfd_signal() to 1. This
>>> is the maximum recursion depth we have found so far, which
>>> can be triggered with the following call chain:
>>>
>>>     kvm_io_bus_write                        [kvm]
>>>       --> ioeventfd_write                   [kvm]
>>>         --> eventfd_signal                  [eventfd]
>>>           --> vhost_poll_wakeup             [vhost]
>>>             --> vduse_vdpa_kick_vq          [vduse]
>>>               --> eventfd_signal            [eventfd]
>>>
>>> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
>>> Acked-by: Jason Wang <jasowang@redhat.com>
>> The fix had been posted one year ago.
>>
>> https://lore.kernel.org/lkml/20200410114720.24838-1-zhe.he@windriver.com/
>>
> OK, so it seems to be a fix for the RT system if my understanding is
> correct? Any reason why it's not merged? I'm happy to rebase my series
> on your patch if you'd like to repost it.

It works for both mainline and RT kernel. The folks just reproduced in their RT
environments.

This patch somehow hasn't got maintainer's reply, so not merged yet.

And OK, I'll resend the patch.

>
> BTW, I also notice another thread for this issue:
>
> https://lore.kernel.org/linux-fsdevel/DM6PR11MB420291B550A10853403C7592FF349@DM6PR11MB4202.namprd11.prod.outlook.com/T/

This is the same way as my v1

https://lore.kernel.org/lkml/3b4aa4cb-0e76-89c2-c48a-cf24e1a36bc2@kernel.dk/

which was not what the maintainer wanted.


>
>>> ---
>>>  fs/eventfd.c            | 2 +-
>>>  include/linux/eventfd.h | 5 ++++-
>>>  2 files changed, 5 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/fs/eventfd.c b/fs/eventfd.c
>>> index e265b6dd4f34..cc7cd1dbedd3 100644
>>> --- a/fs/eventfd.c
>>> +++ b/fs/eventfd.c
>>> @@ -71,7 +71,7 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n)
>>>        * it returns true, the eventfd_signal() call should be deferred to a
>>>        * safe context.
>>>        */
>>> -     if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count)))
>>> +     if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count) > EFD_WAKE_DEPTH))
>>>               return 0;
>>>
>>>       spin_lock_irqsave(&ctx->wqh.lock, flags);
>>> diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
>>> index fa0a524baed0..886d99cd38ef 100644
>>> --- a/include/linux/eventfd.h
>>> +++ b/include/linux/eventfd.h
>>> @@ -29,6 +29,9 @@
>>>  #define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
>>>  #define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE)
>>>
>>> +/* Maximum recursion depth */
>>> +#define EFD_WAKE_DEPTH 1
>>> +
>>>  struct eventfd_ctx;
>>>  struct file;
>>>
>>> @@ -47,7 +50,7 @@ DECLARE_PER_CPU(int, eventfd_wake_count);
>>>
>>>  static inline bool eventfd_signal_count(void)
>>>  {
>>> -     return this_cpu_read(eventfd_wake_count);
>>> +     return this_cpu_read(eventfd_wake_count) > EFD_WAKE_DEPTH;
>> count is just count. How deep is acceptable should be put
>> where eventfd_signal_count is called.
>>
> The return value of this function is boolean rather than integer.
> Please see the comments in eventfd_signal():
>
> "then it should check eventfd_signal_count() before calling this
> function. If it returns true, the eventfd_signal() call should be
> deferred to a safe context."

OK. Now that the maintainer comments as such we can use it accordingly,
though I still got the feeling that the function name and the type of the return
value don't match.


Thanks,
Zhe

>
> Thanks,
> Yongji

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 193+ messages in thread

* [PATCH] eventfd: Enlarge recursion limit to allow vhost to work
  2021-06-18  3:29       ` Yongji Xie
  (?)
@ 2021-06-18  8:44         ` He Zhe
  -1 siblings, 0 replies; 193+ messages in thread
From: He Zhe @ 2021-06-18  8:44 UTC (permalink / raw)
  To: xieyongji, mst, jasowang, stefanha, sgarzare, parav, hch,
	christian.brauner, rdunlap, willy, viro, axboe, bcrl, corbet,
	mika.penttila, dan.carpenter, gregkh, songmuchun, virtualization,
	kvm, linux-fsdevel, iommu, linux-kernel, qiang.zhang, zhe.he

commit b5e683d5cab8 ("eventfd: track eventfd_signal() recursion depth")
introduces a percpu counter that tracks the percpu recursion depth and
warn if it greater than zero, to avoid potential deadlock and stack
overflow.

However sometimes different eventfds may be used in parallel. Specifically,
when heavy network load goes through kvm and vhost, working as below, it
would trigger the following call trace.

-  100.00%
   - 66.51%
        ret_from_fork
        kthread
      - vhost_worker
         - 33.47% handle_tx_kick
              handle_tx
              handle_tx_copy
              vhost_tx_batch.isra.0
              vhost_add_used_and_signal_n
              eventfd_signal
         - 33.05% handle_rx_net
              handle_rx
              vhost_add_used_and_signal_n
              eventfd_signal
   - 33.49%
        ioctl
        entry_SYSCALL_64_after_hwframe
        do_syscall_64
        __x64_sys_ioctl
        ksys_ioctl
        do_vfs_ioctl
        kvm_vcpu_ioctl
        kvm_arch_vcpu_ioctl_run
        vmx_handle_exit
        handle_ept_misconfig
        kvm_io_bus_write
        __kvm_io_bus_write
        eventfd_signal

001: WARNING: CPU: 1 PID: 1503 at fs/eventfd.c:73 eventfd_signal+0x85/0xa0
---- snip ----
001: Call Trace:
001:  vhost_signal+0x15e/0x1b0 [vhost]
001:  vhost_add_used_and_signal_n+0x2b/0x40 [vhost]
001:  handle_rx+0xb9/0x900 [vhost_net]
001:  handle_rx_net+0x15/0x20 [vhost_net]
001:  vhost_worker+0xbe/0x120 [vhost]
001:  kthread+0x106/0x140
001:  ? log_used.part.0+0x20/0x20 [vhost]
001:  ? kthread_park+0x90/0x90
001:  ret_from_fork+0x35/0x40
001: ---[ end trace 0000000000000003 ]---

This patch enlarges the limit to 1 which is the maximum recursion depth we
have found so far.

The credit of modification for eventfd_signal_count goes to
Xie Yongji <xieyongji@bytedance.com>

Signed-off-by: He Zhe <zhe.he@windriver.com>
---
 fs/eventfd.c            | 3 ++-
 include/linux/eventfd.h | 5 ++++-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/fs/eventfd.c b/fs/eventfd.c
index e265b6dd4f34..add6af91cacf 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -71,7 +71,8 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n)
 	 * it returns true, the eventfd_signal() call should be deferred to a
 	 * safe context.
 	 */
-	if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count)))
+	if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count) >
+	    EFD_WAKE_COUNT_MAX))
 		return 0;
 
 	spin_lock_irqsave(&ctx->wqh.lock, flags);
diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
index fa0a524baed0..74be152ebe87 100644
--- a/include/linux/eventfd.h
+++ b/include/linux/eventfd.h
@@ -29,6 +29,9 @@
 #define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
 #define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE)
 
+/* This is the maximum recursion depth we find so far */
+#define EFD_WAKE_COUNT_MAX 1
+
 struct eventfd_ctx;
 struct file;
 
@@ -47,7 +50,7 @@ DECLARE_PER_CPU(int, eventfd_wake_count);
 
 static inline bool eventfd_signal_count(void)
 {
-	return this_cpu_read(eventfd_wake_count);
+	return this_cpu_read(eventfd_wake_count) > EFD_WAKE_COUNT_MAX;
 }
 
 #else /* CONFIG_EVENTFD */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 193+ messages in thread

* [PATCH] eventfd: Enlarge recursion limit to allow vhost to work
@ 2021-06-18  8:44         ` He Zhe
  0 siblings, 0 replies; 193+ messages in thread
From: He Zhe @ 2021-06-18  8:44 UTC (permalink / raw)
  To: xieyongji, mst, jasowang, stefanha, sgarzare, parav, hch,
	christian.brauner, rdunlap, willy, viro, axboe, bcrl, corbet,
	mika.penttila, dan.carpenter, gregkh, songmuchun, virtualization,
	kvm, linux-fsdevel, iommu, linux-kernel, qiang.zhang, zhe.he

commit b5e683d5cab8 ("eventfd: track eventfd_signal() recursion depth")
introduces a percpu counter that tracks the percpu recursion depth and
warn if it greater than zero, to avoid potential deadlock and stack
overflow.

However sometimes different eventfds may be used in parallel. Specifically,
when heavy network load goes through kvm and vhost, working as below, it
would trigger the following call trace.

-  100.00%
   - 66.51%
        ret_from_fork
        kthread
      - vhost_worker
         - 33.47% handle_tx_kick
              handle_tx
              handle_tx_copy
              vhost_tx_batch.isra.0
              vhost_add_used_and_signal_n
              eventfd_signal
         - 33.05% handle_rx_net
              handle_rx
              vhost_add_used_and_signal_n
              eventfd_signal
   - 33.49%
        ioctl
        entry_SYSCALL_64_after_hwframe
        do_syscall_64
        __x64_sys_ioctl
        ksys_ioctl
        do_vfs_ioctl
        kvm_vcpu_ioctl
        kvm_arch_vcpu_ioctl_run
        vmx_handle_exit
        handle_ept_misconfig
        kvm_io_bus_write
        __kvm_io_bus_write
        eventfd_signal

001: WARNING: CPU: 1 PID: 1503 at fs/eventfd.c:73 eventfd_signal+0x85/0xa0
---- snip ----
001: Call Trace:
001:  vhost_signal+0x15e/0x1b0 [vhost]
001:  vhost_add_used_and_signal_n+0x2b/0x40 [vhost]
001:  handle_rx+0xb9/0x900 [vhost_net]
001:  handle_rx_net+0x15/0x20 [vhost_net]
001:  vhost_worker+0xbe/0x120 [vhost]
001:  kthread+0x106/0x140
001:  ? log_used.part.0+0x20/0x20 [vhost]
001:  ? kthread_park+0x90/0x90
001:  ret_from_fork+0x35/0x40
001: ---[ end trace 0000000000000003 ]---

This patch enlarges the limit to 1 which is the maximum recursion depth we
have found so far.

The credit of modification for eventfd_signal_count goes to
Xie Yongji <xieyongji@bytedance.com>

Signed-off-by: He Zhe <zhe.he@windriver.com>
---
 fs/eventfd.c            | 3 ++-
 include/linux/eventfd.h | 5 ++++-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/fs/eventfd.c b/fs/eventfd.c
index e265b6dd4f34..add6af91cacf 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -71,7 +71,8 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n)
 	 * it returns true, the eventfd_signal() call should be deferred to a
 	 * safe context.
 	 */
-	if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count)))
+	if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count) >
+	    EFD_WAKE_COUNT_MAX))
 		return 0;
 
 	spin_lock_irqsave(&ctx->wqh.lock, flags);
diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
index fa0a524baed0..74be152ebe87 100644
--- a/include/linux/eventfd.h
+++ b/include/linux/eventfd.h
@@ -29,6 +29,9 @@
 #define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
 #define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE)
 
+/* This is the maximum recursion depth we find so far */
+#define EFD_WAKE_COUNT_MAX 1
+
 struct eventfd_ctx;
 struct file;
 
@@ -47,7 +50,7 @@ DECLARE_PER_CPU(int, eventfd_wake_count);
 
 static inline bool eventfd_signal_count(void)
 {
-	return this_cpu_read(eventfd_wake_count);
+	return this_cpu_read(eventfd_wake_count) > EFD_WAKE_COUNT_MAX;
 }
 
 #else /* CONFIG_EVENTFD */
-- 
2.17.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 193+ messages in thread

* [PATCH] eventfd: Enlarge recursion limit to allow vhost to work
@ 2021-06-18  8:44         ` He Zhe
  0 siblings, 0 replies; 193+ messages in thread
From: He Zhe @ 2021-06-18  8:44 UTC (permalink / raw)
  To: xieyongji, mst, jasowang, stefanha, sgarzare, parav, hch,
	christian.brauner, rdunlap, willy, viro, axboe, bcrl, corbet,
	mika.penttila, dan.carpenter, gregkh, songmuchun, virtualization,
	kvm, linux-fsdevel, iommu, linux-kernel, qiang.zhang, zhe.he

commit b5e683d5cab8 ("eventfd: track eventfd_signal() recursion depth")
introduces a percpu counter that tracks the percpu recursion depth and
warn if it greater than zero, to avoid potential deadlock and stack
overflow.

However sometimes different eventfds may be used in parallel. Specifically,
when heavy network load goes through kvm and vhost, working as below, it
would trigger the following call trace.

-  100.00%
   - 66.51%
        ret_from_fork
        kthread
      - vhost_worker
         - 33.47% handle_tx_kick
              handle_tx
              handle_tx_copy
              vhost_tx_batch.isra.0
              vhost_add_used_and_signal_n
              eventfd_signal
         - 33.05% handle_rx_net
              handle_rx
              vhost_add_used_and_signal_n
              eventfd_signal
   - 33.49%
        ioctl
        entry_SYSCALL_64_after_hwframe
        do_syscall_64
        __x64_sys_ioctl
        ksys_ioctl
        do_vfs_ioctl
        kvm_vcpu_ioctl
        kvm_arch_vcpu_ioctl_run
        vmx_handle_exit
        handle_ept_misconfig
        kvm_io_bus_write
        __kvm_io_bus_write
        eventfd_signal

001: WARNING: CPU: 1 PID: 1503 at fs/eventfd.c:73 eventfd_signal+0x85/0xa0
---- snip ----
001: Call Trace:
001:  vhost_signal+0x15e/0x1b0 [vhost]
001:  vhost_add_used_and_signal_n+0x2b/0x40 [vhost]
001:  handle_rx+0xb9/0x900 [vhost_net]
001:  handle_rx_net+0x15/0x20 [vhost_net]
001:  vhost_worker+0xbe/0x120 [vhost]
001:  kthread+0x106/0x140
001:  ? log_used.part.0+0x20/0x20 [vhost]
001:  ? kthread_park+0x90/0x90
001:  ret_from_fork+0x35/0x40
001: ---[ end trace 0000000000000003 ]---

This patch enlarges the limit to 1 which is the maximum recursion depth we
have found so far.

The credit of modification for eventfd_signal_count goes to
Xie Yongji <xieyongji@bytedance.com>

Signed-off-by: He Zhe <zhe.he@windriver.com>
---
 fs/eventfd.c            | 3 ++-
 include/linux/eventfd.h | 5 ++++-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/fs/eventfd.c b/fs/eventfd.c
index e265b6dd4f34..add6af91cacf 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -71,7 +71,8 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n)
 	 * it returns true, the eventfd_signal() call should be deferred to a
 	 * safe context.
 	 */
-	if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count)))
+	if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count) >
+	    EFD_WAKE_COUNT_MAX))
 		return 0;
 
 	spin_lock_irqsave(&ctx->wqh.lock, flags);
diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
index fa0a524baed0..74be152ebe87 100644
--- a/include/linux/eventfd.h
+++ b/include/linux/eventfd.h
@@ -29,6 +29,9 @@
 #define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
 #define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE)
 
+/* This is the maximum recursion depth we find so far */
+#define EFD_WAKE_COUNT_MAX 1
+
 struct eventfd_ctx;
 struct file;
 
@@ -47,7 +50,7 @@ DECLARE_PER_CPU(int, eventfd_wake_count);
 
 static inline bool eventfd_signal_count(void)
 {
-	return this_cpu_read(eventfd_wake_count);
+	return this_cpu_read(eventfd_wake_count) > EFD_WAKE_COUNT_MAX;
 }
 
 #else /* CONFIG_EVENTFD */
-- 
2.17.1

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 193+ messages in thread

* Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
  2021-06-15 14:13   ` Xie Yongji
  (?)
@ 2021-06-21  9:13     ` Jason Wang
  -1 siblings, 0 replies; 193+ messages in thread
From: Jason Wang @ 2021-06-21  9:13 UTC (permalink / raw)
  To: Xie Yongji, mst, stefanha, sgarzare, parav, hch,
	christian.brauner, rdunlap, willy, viro, axboe, bcrl, corbet,
	mika.penttila, dan.carpenter, joro, gregkh
  Cc: songmuchun, virtualization, netdev, kvm, linux-fsdevel, iommu,
	linux-kernel


在 2021/6/15 下午10:13, Xie Yongji 写道:
> This VDUSE driver enables implementing vDPA devices in userspace.
> The vDPA device's control path is handled in kernel and the data
> path is handled in userspace.
>
> A message mechnism is used by VDUSE driver to forward some control
> messages such as starting/stopping datapath to userspace. Userspace
> can use read()/write() to receive/reply those control messages.
>
> And some ioctls are introduced to help userspace to implement the
> data path. VDUSE_IOTLB_GET_FD ioctl can be used to get the file
> descriptors referring to vDPA device's iova regions. Then userspace
> can use mmap() to access those iova regions. VDUSE_DEV_GET_FEATURES
> and VDUSE_VQ_GET_INFO ioctls are used to get the negotiated features
> and metadata of virtqueues. VDUSE_INJECT_VQ_IRQ and VDUSE_VQ_SETUP_KICKFD
> ioctls can be used to inject interrupt and setup the kickfd for
> virtqueues. VDUSE_DEV_UPDATE_CONFIG ioctl is used to update the
> configuration space and inject a config interrupt.
>
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> ---
>   Documentation/userspace-api/ioctl/ioctl-number.rst |    1 +
>   drivers/vdpa/Kconfig                               |   10 +
>   drivers/vdpa/Makefile                              |    1 +
>   drivers/vdpa/vdpa_user/Makefile                    |    5 +
>   drivers/vdpa/vdpa_user/vduse_dev.c                 | 1453 ++++++++++++++++++++
>   include/uapi/linux/vduse.h                         |  143 ++
>   6 files changed, 1613 insertions(+)
>   create mode 100644 drivers/vdpa/vdpa_user/Makefile
>   create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
>   create mode 100644 include/uapi/linux/vduse.h
>
> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
> index 9bfc2b510c64..acd95e9dcfe7 100644
> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> @@ -300,6 +300,7 @@ Code  Seq#    Include File                                           Comments
>   'z'   10-4F  drivers/s390/crypto/zcrypt_api.h                        conflict!
>   '|'   00-7F  linux/media.h
>   0x80  00-1F  linux/fb.h
> +0x81  00-1F  linux/vduse.h
>   0x89  00-06  arch/x86/include/asm/sockios.h
>   0x89  0B-DF  linux/sockios.h
>   0x89  E0-EF  linux/sockios.h                                         SIOCPROTOPRIVATE range
> diff --git a/drivers/vdpa/Kconfig b/drivers/vdpa/Kconfig
> index a503c1b2bfd9..6e23bce6433a 100644
> --- a/drivers/vdpa/Kconfig
> +++ b/drivers/vdpa/Kconfig
> @@ -33,6 +33,16 @@ config VDPA_SIM_BLOCK
>   	  vDPA block device simulator which terminates IO request in a
>   	  memory buffer.
>   
> +config VDPA_USER
> +	tristate "VDUSE (vDPA Device in Userspace) support"
> +	depends on EVENTFD && MMU && HAS_DMA
> +	select DMA_OPS
> +	select VHOST_IOTLB
> +	select IOMMU_IOVA
> +	help
> +	  With VDUSE it is possible to emulate a vDPA Device
> +	  in a userspace program.
> +
>   config IFCVF
>   	tristate "Intel IFC VF vDPA driver"
>   	depends on PCI_MSI
> diff --git a/drivers/vdpa/Makefile b/drivers/vdpa/Makefile
> index 67fe7f3d6943..f02ebed33f19 100644
> --- a/drivers/vdpa/Makefile
> +++ b/drivers/vdpa/Makefile
> @@ -1,6 +1,7 @@
>   # SPDX-License-Identifier: GPL-2.0
>   obj-$(CONFIG_VDPA) += vdpa.o
>   obj-$(CONFIG_VDPA_SIM) += vdpa_sim/
> +obj-$(CONFIG_VDPA_USER) += vdpa_user/
>   obj-$(CONFIG_IFCVF)    += ifcvf/
>   obj-$(CONFIG_MLX5_VDPA) += mlx5/
>   obj-$(CONFIG_VP_VDPA)    += virtio_pci/
> diff --git a/drivers/vdpa/vdpa_user/Makefile b/drivers/vdpa/vdpa_user/Makefile
> new file mode 100644
> index 000000000000..260e0b26af99
> --- /dev/null
> +++ b/drivers/vdpa/vdpa_user/Makefile
> @@ -0,0 +1,5 @@
> +# SPDX-License-Identifier: GPL-2.0
> +
> +vduse-y := vduse_dev.o iova_domain.o
> +
> +obj-$(CONFIG_VDPA_USER) += vduse.o
> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> new file mode 100644
> index 000000000000..5271cbd15e28
> --- /dev/null
> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> @@ -0,0 +1,1453 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * VDUSE: vDPA Device in Userspace
> + *
> + * Copyright (C) 2020-2021 Bytedance Inc. and/or its affiliates. All rights reserved.
> + *
> + * Author: Xie Yongji <xieyongji@bytedance.com>
> + *
> + */
> +
> +#include <linux/init.h>
> +#include <linux/module.h>
> +#include <linux/cdev.h>
> +#include <linux/device.h>
> +#include <linux/eventfd.h>
> +#include <linux/slab.h>
> +#include <linux/wait.h>
> +#include <linux/dma-map-ops.h>
> +#include <linux/poll.h>
> +#include <linux/file.h>
> +#include <linux/uio.h>
> +#include <linux/vdpa.h>
> +#include <linux/nospec.h>
> +#include <uapi/linux/vduse.h>
> +#include <uapi/linux/vdpa.h>
> +#include <uapi/linux/virtio_config.h>
> +#include <uapi/linux/virtio_ids.h>
> +#include <uapi/linux/virtio_blk.h>
> +#include <linux/mod_devicetable.h>
> +
> +#include "iova_domain.h"
> +
> +#define DRV_AUTHOR   "Yongji Xie <xieyongji@bytedance.com>"
> +#define DRV_DESC     "vDPA Device in Userspace"
> +#define DRV_LICENSE  "GPL v2"
> +
> +#define VDUSE_DEV_MAX (1U << MINORBITS)
> +#define VDUSE_MAX_BOUNCE_SIZE (64 * 1024 * 1024)
> +#define VDUSE_IOVA_SIZE (128 * 1024 * 1024)
> +#define VDUSE_REQUEST_TIMEOUT 30
> +
> +struct vduse_virtqueue {
> +	u16 index;
> +	u32 num;
> +	u32 avail_idx;
> +	u64 desc_addr;
> +	u64 driver_addr;
> +	u64 device_addr;
> +	bool ready;
> +	bool kicked;
> +	spinlock_t kick_lock;
> +	spinlock_t irq_lock;
> +	struct eventfd_ctx *kickfd;
> +	struct vdpa_callback cb;
> +	struct work_struct inject;
> +};
> +
> +struct vduse_dev;
> +
> +struct vduse_vdpa {
> +	struct vdpa_device vdpa;
> +	struct vduse_dev *dev;
> +};
> +
> +struct vduse_dev {
> +	struct vduse_vdpa *vdev;
> +	struct device *dev;
> +	struct vduse_virtqueue *vqs;
> +	struct vduse_iova_domain *domain;
> +	char *name;
> +	struct mutex lock;
> +	spinlock_t msg_lock;
> +	u64 msg_unique;
> +	wait_queue_head_t waitq;
> +	struct list_head send_list;
> +	struct list_head recv_list;
> +	struct vdpa_callback config_cb;
> +	struct work_struct inject;
> +	spinlock_t irq_lock;
> +	int minor;
> +	bool connected;
> +	bool started;
> +	u64 api_version;
> +	u64 user_features;


Let's use device_features.


> +	u64 features;


And driver features.


> +	u32 device_id;
> +	u32 vendor_id;
> +	u32 generation;
> +	u32 config_size;
> +	void *config;
> +	u8 status;
> +	u16 vq_size_max;
> +	u32 vq_num;
> +	u32 vq_align;
> +};
> +
> +struct vduse_dev_msg {
> +	struct vduse_dev_request req;
> +	struct vduse_dev_response resp;
> +	struct list_head list;
> +	wait_queue_head_t waitq;
> +	bool completed;
> +};
> +
> +struct vduse_control {
> +	u64 api_version;
> +};
> +
> +static DEFINE_MUTEX(vduse_lock);
> +static DEFINE_IDR(vduse_idr);
> +
> +static dev_t vduse_major;
> +static struct class *vduse_class;
> +static struct cdev vduse_ctrl_cdev;
> +static struct cdev vduse_cdev;
> +static struct workqueue_struct *vduse_irq_wq;
> +
> +static u32 allowed_device_id[] = {
> +	VIRTIO_ID_BLOCK,
> +};
> +
> +static inline struct vduse_dev *vdpa_to_vduse(struct vdpa_device *vdpa)
> +{
> +	struct vduse_vdpa *vdev = container_of(vdpa, struct vduse_vdpa, vdpa);
> +
> +	return vdev->dev;
> +}
> +
> +static inline struct vduse_dev *dev_to_vduse(struct device *dev)
> +{
> +	struct vdpa_device *vdpa = dev_to_vdpa(dev);
> +
> +	return vdpa_to_vduse(vdpa);
> +}
> +
> +static struct vduse_dev_msg *vduse_find_msg(struct list_head *head,
> +					    uint32_t request_id)
> +{
> +	struct vduse_dev_msg *msg;
> +
> +	list_for_each_entry(msg, head, list) {
> +		if (msg->req.request_id == request_id) {
> +			list_del(&msg->list);
> +			return msg;
> +		}
> +	}
> +
> +	return NULL;
> +}
> +
> +static struct vduse_dev_msg *vduse_dequeue_msg(struct list_head *head)
> +{
> +	struct vduse_dev_msg *msg = NULL;
> +
> +	if (!list_empty(head)) {
> +		msg = list_first_entry(head, struct vduse_dev_msg, list);
> +		list_del(&msg->list);
> +	}
> +
> +	return msg;
> +}
> +
> +static void vduse_enqueue_msg(struct list_head *head,
> +			      struct vduse_dev_msg *msg)
> +{
> +	list_add_tail(&msg->list, head);
> +}
> +
> +static int vduse_dev_msg_send(struct vduse_dev *dev,
> +			      struct vduse_dev_msg *msg, bool no_reply)
> +{


It looks to me the only user for no_reply=true is the dataplane start. I 
wonder no_reply is really needed consider we have switched to use 
wait_event_killable_timeout().

In another way, no_reply is false for vq state synchronization and IOTLB 
updating. I wonder if we can simply use no_reply = true for them.


> +	init_waitqueue_head(&msg->waitq);
> +	spin_lock(&dev->msg_lock);
> +	msg->req.request_id = dev->msg_unique++;
> +	vduse_enqueue_msg(&dev->send_list, msg);
> +	wake_up(&dev->waitq);
> +	spin_unlock(&dev->msg_lock);
> +	if (no_reply)
> +		return 0;
> +
> +	wait_event_killable_timeout(msg->waitq, msg->completed,
> +				    VDUSE_REQUEST_TIMEOUT * HZ);
> +	spin_lock(&dev->msg_lock);
> +	if (!msg->completed) {
> +		list_del(&msg->list);
> +		msg->resp.result = VDUSE_REQ_RESULT_FAILED;
> +	}
> +	spin_unlock(&dev->msg_lock);
> +
> +	return (msg->resp.result == VDUSE_REQ_RESULT_OK) ? 0 : -EIO;


Do we need to serialize the check by protecting it with the spinlock above?


> +}
> +
> +static void vduse_dev_msg_cleanup(struct vduse_dev *dev)
> +{
> +	struct vduse_dev_msg *msg;
> +
> +	spin_lock(&dev->msg_lock);
> +	while ((msg = vduse_dequeue_msg(&dev->send_list))) {
> +		if (msg->req.flags & VDUSE_REQ_FLAGS_NO_REPLY)
> +			kfree(msg);
> +		else
> +			vduse_enqueue_msg(&dev->recv_list, msg);
> +	}
> +	while ((msg = vduse_dequeue_msg(&dev->recv_list))) {
> +		msg->resp.result = VDUSE_REQ_RESULT_FAILED;
> +		msg->completed = 1;
> +		wake_up(&msg->waitq);
> +	}
> +	spin_unlock(&dev->msg_lock);
> +}
> +
> +static void vduse_dev_start_dataplane(struct vduse_dev *dev)
> +{
> +	struct vduse_dev_msg *msg = kzalloc(sizeof(*msg),
> +					    GFP_KERNEL | __GFP_NOFAIL);
> +
> +	msg->req.type = VDUSE_START_DATAPLANE;
> +	msg->req.flags |= VDUSE_REQ_FLAGS_NO_REPLY;
> +	vduse_dev_msg_send(dev, msg, true);
> +}
> +
> +static void vduse_dev_stop_dataplane(struct vduse_dev *dev)
> +{
> +	struct vduse_dev_msg *msg = kzalloc(sizeof(*msg),
> +					    GFP_KERNEL | __GFP_NOFAIL);
> +
> +	msg->req.type = VDUSE_STOP_DATAPLANE;
> +	msg->req.flags |= VDUSE_REQ_FLAGS_NO_REPLY;


Can we simply use this flag instead of introducing a new parameter 
(no_reply) in vduse_dev_msg_send()?


> +	vduse_dev_msg_send(dev, msg, true);
> +}
> +
> +static int vduse_dev_get_vq_state(struct vduse_dev *dev,
> +				  struct vduse_virtqueue *vq,
> +				  struct vdpa_vq_state *state)
> +{
> +	struct vduse_dev_msg msg = { 0 };
> +	int ret;


Note that I post a series that implement the packed virtqueue support:

https://lists.linuxfoundation.org/pipermail/virtualization/2021-June/054501.html

So this patch needs to be updated as well.


> +
> +	msg.req.type = VDUSE_GET_VQ_STATE;
> +	msg.req.vq_state.index = vq->index;
> +
> +	ret = vduse_dev_msg_send(dev, &msg, false);
> +	if (ret)
> +		return ret;
> +
> +	state->avail_index = msg.resp.vq_state.avail_idx;
> +	return 0;
> +}
> +
> +static int vduse_dev_update_iotlb(struct vduse_dev *dev,
> +				u64 start, u64 last)
> +{
> +	struct vduse_dev_msg msg = { 0 };
> +
> +	if (last < start)
> +		return -EINVAL;
> +
> +	msg.req.type = VDUSE_UPDATE_IOTLB;
> +	msg.req.iova.start = start;
> +	msg.req.iova.last = last;
> +
> +	return vduse_dev_msg_send(dev, &msg, false);
> +}
> +
> +static ssize_t vduse_dev_read_iter(struct kiocb *iocb, struct iov_iter *to)
> +{
> +	struct file *file = iocb->ki_filp;
> +	struct vduse_dev *dev = file->private_data;
> +	struct vduse_dev_msg *msg;
> +	int size = sizeof(struct vduse_dev_request);
> +	ssize_t ret;
> +
> +	if (iov_iter_count(to) < size)
> +		return -EINVAL;
> +
> +	spin_lock(&dev->msg_lock);
> +	while (1) {
> +		msg = vduse_dequeue_msg(&dev->send_list);
> +		if (msg)
> +			break;
> +
> +		ret = -EAGAIN;
> +		if (file->f_flags & O_NONBLOCK)
> +			goto unlock;
> +
> +		spin_unlock(&dev->msg_lock);
> +		ret = wait_event_interruptible_exclusive(dev->waitq,
> +					!list_empty(&dev->send_list));
> +		if (ret)
> +			return ret;
> +
> +		spin_lock(&dev->msg_lock);
> +	}
> +	spin_unlock(&dev->msg_lock);
> +	ret = copy_to_iter(&msg->req, size, to);
> +	spin_lock(&dev->msg_lock);
> +	if (ret != size) {
> +		ret = -EFAULT;
> +		vduse_enqueue_msg(&dev->send_list, msg);
> +		goto unlock;
> +	}
> +	if (msg->req.flags & VDUSE_REQ_FLAGS_NO_REPLY)
> +		kfree(msg);
> +	else
> +		vduse_enqueue_msg(&dev->recv_list, msg);
> +unlock:
> +	spin_unlock(&dev->msg_lock);
> +
> +	return ret;
> +}
> +
> +static ssize_t vduse_dev_write_iter(struct kiocb *iocb, struct iov_iter *from)
> +{
> +	struct file *file = iocb->ki_filp;
> +	struct vduse_dev *dev = file->private_data;
> +	struct vduse_dev_response resp;
> +	struct vduse_dev_msg *msg;
> +	size_t ret;
> +
> +	ret = copy_from_iter(&resp, sizeof(resp), from);
> +	if (ret != sizeof(resp))
> +		return -EINVAL;
> +
> +	spin_lock(&dev->msg_lock);
> +	msg = vduse_find_msg(&dev->recv_list, resp.request_id);
> +	if (!msg) {
> +		ret = -ENOENT;
> +		goto unlock;
> +	}
> +
> +	memcpy(&msg->resp, &resp, sizeof(resp));
> +	msg->completed = 1;
> +	wake_up(&msg->waitq);
> +unlock:
> +	spin_unlock(&dev->msg_lock);
> +
> +	return ret;
> +}
> +
> +static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
> +{
> +	struct vduse_dev *dev = file->private_data;
> +	__poll_t mask = 0;
> +
> +	poll_wait(file, &dev->waitq, wait);
> +
> +	if (!list_empty(&dev->send_list))
> +		mask |= EPOLLIN | EPOLLRDNORM;
> +	if (!list_empty(&dev->recv_list))
> +		mask |= EPOLLOUT | EPOLLWRNORM;
> +
> +	return mask;
> +}
> +
> +static void vduse_dev_reset(struct vduse_dev *dev)
> +{
> +	int i;
> +	struct vduse_iova_domain *domain = dev->domain;
> +
> +	/* The coherent mappings are handled in vduse_dev_free_coherent() */
> +	if (domain->bounce_map)
> +		vduse_domain_reset_bounce_map(domain);
> +
> +	dev->features = 0;
> +	dev->generation++;
> +	spin_lock(&dev->irq_lock);
> +	dev->config_cb.callback = NULL;
> +	dev->config_cb.private = NULL;
> +	spin_unlock(&dev->irq_lock);
> +
> +	for (i = 0; i < dev->vq_num; i++) {
> +		struct vduse_virtqueue *vq = &dev->vqs[i];
> +
> +		vq->ready = false;
> +		vq->desc_addr = 0;
> +		vq->driver_addr = 0;
> +		vq->device_addr = 0;
> +		vq->avail_idx = 0;
> +		vq->num = 0;
> +
> +		spin_lock(&vq->kick_lock);
> +		vq->kicked = false;
> +		if (vq->kickfd)
> +			eventfd_ctx_put(vq->kickfd);
> +		vq->kickfd = NULL;
> +		spin_unlock(&vq->kick_lock);
> +
> +		spin_lock(&vq->irq_lock);
> +		vq->cb.callback = NULL;
> +		vq->cb.private = NULL;
> +		spin_unlock(&vq->irq_lock);
> +	}
> +}
> +
> +static int vduse_vdpa_set_vq_address(struct vdpa_device *vdpa, u16 idx,
> +				u64 desc_area, u64 driver_area,
> +				u64 device_area)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	vq->desc_addr = desc_area;
> +	vq->driver_addr = driver_area;
> +	vq->device_addr = device_area;
> +
> +	return 0;
> +}
> +
> +static void vduse_vdpa_kick_vq(struct vdpa_device *vdpa, u16 idx)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	spin_lock(&vq->kick_lock);
> +	if (!vq->ready)
> +		goto unlock;
> +
> +	if (vq->kickfd)
> +		eventfd_signal(vq->kickfd, 1);
> +	else
> +		vq->kicked = true;
> +unlock:
> +	spin_unlock(&vq->kick_lock);
> +}
> +
> +static void vduse_vdpa_set_vq_cb(struct vdpa_device *vdpa, u16 idx,
> +			      struct vdpa_callback *cb)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	spin_lock(&vq->irq_lock);
> +	vq->cb.callback = cb->callback;
> +	vq->cb.private = cb->private;
> +	spin_unlock(&vq->irq_lock);
> +}
> +
> +static void vduse_vdpa_set_vq_num(struct vdpa_device *vdpa, u16 idx, u32 num)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	vq->num = num;
> +}
> +
> +static void vduse_vdpa_set_vq_ready(struct vdpa_device *vdpa,
> +					u16 idx, bool ready)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	vq->ready = ready;
> +}
> +
> +static bool vduse_vdpa_get_vq_ready(struct vdpa_device *vdpa, u16 idx)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	return vq->ready;
> +}
> +
> +static int vduse_vdpa_set_vq_state(struct vdpa_device *vdpa, u16 idx,
> +				const struct vdpa_vq_state *state)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	vq->avail_idx = state->avail_index;
> +	return 0;
> +}
> +
> +static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
> +				struct vdpa_vq_state *state)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	return vduse_dev_get_vq_state(dev, vq, state);
> +}
> +
> +static u32 vduse_vdpa_get_vq_align(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return dev->vq_align;
> +}
> +
> +static u64 vduse_vdpa_get_features(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return dev->user_features;
> +}
> +
> +static int vduse_vdpa_set_features(struct vdpa_device *vdpa, u64 features)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	dev->features = features;
> +	return 0;
> +}
> +
> +static void vduse_vdpa_set_config_cb(struct vdpa_device *vdpa,
> +				  struct vdpa_callback *cb)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	spin_lock(&dev->irq_lock);
> +	dev->config_cb.callback = cb->callback;
> +	dev->config_cb.private = cb->private;
> +	spin_unlock(&dev->irq_lock);
> +}
> +
> +static u16 vduse_vdpa_get_vq_num_max(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return dev->vq_size_max;
> +}
> +
> +static u32 vduse_vdpa_get_device_id(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return dev->device_id;
> +}
> +
> +static u32 vduse_vdpa_get_vendor_id(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return dev->vendor_id;
> +}
> +
> +static u8 vduse_vdpa_get_status(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return dev->status;
> +}
> +
> +static void vduse_vdpa_set_status(struct vdpa_device *vdpa, u8 status)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	bool started = !!(status & VIRTIO_CONFIG_S_DRIVER_OK);
> +
> +	dev->status = status;
> +
> +	if (dev->started == started)
> +		return;


If we check dev->status == status, (or only check the DRIVER_OK bit) 
then there's no need to introduce an extra dev->started.


> +
> +	dev->started = started;
> +	if (dev->started) {
> +		vduse_dev_start_dataplane(dev);
> +	} else {
> +		vduse_dev_reset(dev);
> +		vduse_dev_stop_dataplane(dev);


I wonder if no_reply work for the case of vhost-vdpa. For virtio-vDPA, 
we have bouncing buffers so it's harmless if usersapce dataplane keeps 
performing read/write. For vhost-vDPA we don't have such stuffs.


> +	}
> +}
> +
> +static size_t vduse_vdpa_get_config_size(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return dev->config_size;
> +}
> +
> +static void vduse_vdpa_get_config(struct vdpa_device *vdpa, unsigned int offset,
> +				  void *buf, unsigned int len)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	memcpy(buf, dev->config + offset, len);
> +}
> +
> +static void vduse_vdpa_set_config(struct vdpa_device *vdpa, unsigned int offset,
> +			const void *buf, unsigned int len)
> +{
> +	/* Now we only support read-only configuration space */
> +}
> +
> +static u32 vduse_vdpa_get_generation(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return dev->generation;
> +}
> +
> +static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
> +				struct vhost_iotlb *iotlb)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	int ret;
> +
> +	ret = vduse_domain_set_map(dev->domain, iotlb);
> +	if (ret)
> +		return ret;
> +
> +	ret = vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
> +	if (ret) {
> +		vduse_domain_clear_map(dev->domain, iotlb);
> +		return ret;
> +	}
> +
> +	return 0;
> +}
> +
> +static void vduse_vdpa_free(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	dev->vdev = NULL;
> +}
> +
> +static const struct vdpa_config_ops vduse_vdpa_config_ops = {
> +	.set_vq_address		= vduse_vdpa_set_vq_address,
> +	.kick_vq		= vduse_vdpa_kick_vq,
> +	.set_vq_cb		= vduse_vdpa_set_vq_cb,
> +	.set_vq_num             = vduse_vdpa_set_vq_num,
> +	.set_vq_ready		= vduse_vdpa_set_vq_ready,
> +	.get_vq_ready		= vduse_vdpa_get_vq_ready,
> +	.set_vq_state		= vduse_vdpa_set_vq_state,
> +	.get_vq_state		= vduse_vdpa_get_vq_state,
> +	.get_vq_align		= vduse_vdpa_get_vq_align,
> +	.get_features		= vduse_vdpa_get_features,
> +	.set_features		= vduse_vdpa_set_features,
> +	.set_config_cb		= vduse_vdpa_set_config_cb,
> +	.get_vq_num_max		= vduse_vdpa_get_vq_num_max,
> +	.get_device_id		= vduse_vdpa_get_device_id,
> +	.get_vendor_id		= vduse_vdpa_get_vendor_id,
> +	.get_status		= vduse_vdpa_get_status,
> +	.set_status		= vduse_vdpa_set_status,
> +	.get_config_size	= vduse_vdpa_get_config_size,
> +	.get_config		= vduse_vdpa_get_config,
> +	.set_config		= vduse_vdpa_set_config,
> +	.get_generation		= vduse_vdpa_get_generation,
> +	.set_map		= vduse_vdpa_set_map,
> +	.free			= vduse_vdpa_free,
> +};
> +
> +static dma_addr_t vduse_dev_map_page(struct device *dev, struct page *page,
> +				     unsigned long offset, size_t size,
> +				     enum dma_data_direction dir,
> +				     unsigned long attrs)
> +{
> +	struct vduse_dev *vdev = dev_to_vduse(dev);
> +	struct vduse_iova_domain *domain = vdev->domain;
> +
> +	return vduse_domain_map_page(domain, page, offset, size, dir, attrs);
> +}
> +
> +static void vduse_dev_unmap_page(struct device *dev, dma_addr_t dma_addr,
> +				size_t size, enum dma_data_direction dir,
> +				unsigned long attrs)
> +{
> +	struct vduse_dev *vdev = dev_to_vduse(dev);
> +	struct vduse_iova_domain *domain = vdev->domain;
> +
> +	return vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
> +}
> +
> +static void *vduse_dev_alloc_coherent(struct device *dev, size_t size,
> +					dma_addr_t *dma_addr, gfp_t flag,
> +					unsigned long attrs)
> +{
> +	struct vduse_dev *vdev = dev_to_vduse(dev);
> +	struct vduse_iova_domain *domain = vdev->domain;
> +	unsigned long iova;
> +	void *addr;
> +
> +	*dma_addr = DMA_MAPPING_ERROR;
> +	addr = vduse_domain_alloc_coherent(domain, size,
> +				(dma_addr_t *)&iova, flag, attrs);
> +	if (!addr)
> +		return NULL;
> +
> +	*dma_addr = (dma_addr_t)iova;
> +
> +	return addr;
> +}
> +
> +static void vduse_dev_free_coherent(struct device *dev, size_t size,
> +					void *vaddr, dma_addr_t dma_addr,
> +					unsigned long attrs)
> +{
> +	struct vduse_dev *vdev = dev_to_vduse(dev);
> +	struct vduse_iova_domain *domain = vdev->domain;
> +
> +	vduse_domain_free_coherent(domain, size, vaddr, dma_addr, attrs);
> +}
> +
> +static size_t vduse_dev_max_mapping_size(struct device *dev)
> +{
> +	struct vduse_dev *vdev = dev_to_vduse(dev);
> +	struct vduse_iova_domain *domain = vdev->domain;
> +
> +	return domain->bounce_size;
> +}
> +
> +static const struct dma_map_ops vduse_dev_dma_ops = {
> +	.map_page = vduse_dev_map_page,
> +	.unmap_page = vduse_dev_unmap_page,
> +	.alloc = vduse_dev_alloc_coherent,
> +	.free = vduse_dev_free_coherent,
> +	.max_mapping_size = vduse_dev_max_mapping_size,
> +};
> +
> +static unsigned int perm_to_file_flags(u8 perm)
> +{
> +	unsigned int flags = 0;
> +
> +	switch (perm) {
> +	case VDUSE_ACCESS_WO:
> +		flags |= O_WRONLY;
> +		break;
> +	case VDUSE_ACCESS_RO:
> +		flags |= O_RDONLY;
> +		break;
> +	case VDUSE_ACCESS_RW:
> +		flags |= O_RDWR;
> +		break;
> +	default:
> +		WARN(1, "invalidate vhost IOTLB permission\n");
> +		break;
> +	}
> +
> +	return flags;
> +}
> +
> +static int vduse_kickfd_setup(struct vduse_dev *dev,
> +			struct vduse_vq_eventfd *eventfd)
> +{
> +	struct eventfd_ctx *ctx = NULL;
> +	struct vduse_virtqueue *vq;
> +	u32 index;
> +
> +	if (eventfd->index >= dev->vq_num)
> +		return -EINVAL;
> +
> +	index = array_index_nospec(eventfd->index, dev->vq_num);
> +	vq = &dev->vqs[index];
> +	if (eventfd->fd >= 0) {
> +		ctx = eventfd_ctx_fdget(eventfd->fd);
> +		if (IS_ERR(ctx))
> +			return PTR_ERR(ctx);
> +	} else if (eventfd->fd != VDUSE_EVENTFD_DEASSIGN)
> +		return 0;
> +
> +	spin_lock(&vq->kick_lock);
> +	if (vq->kickfd)
> +		eventfd_ctx_put(vq->kickfd);
> +	vq->kickfd = ctx;
> +	if (vq->ready && vq->kicked && vq->kickfd) {
> +		eventfd_signal(vq->kickfd, 1);
> +		vq->kicked = false;
> +	}
> +	spin_unlock(&vq->kick_lock);
> +
> +	return 0;
> +}
> +
> +static void vduse_dev_irq_inject(struct work_struct *work)
> +{
> +	struct vduse_dev *dev = container_of(work, struct vduse_dev, inject);
> +
> +	spin_lock_irq(&dev->irq_lock);
> +	if (dev->config_cb.callback)
> +		dev->config_cb.callback(dev->config_cb.private);
> +	spin_unlock_irq(&dev->irq_lock);
> +}
> +
> +static void vduse_vq_irq_inject(struct work_struct *work)
> +{
> +	struct vduse_virtqueue *vq = container_of(work,
> +					struct vduse_virtqueue, inject);
> +
> +	spin_lock_irq(&vq->irq_lock);
> +	if (vq->ready && vq->cb.callback)
> +		vq->cb.callback(vq->cb.private);
> +	spin_unlock_irq(&vq->irq_lock);
> +}
> +
> +static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> +			    unsigned long arg)
> +{
> +	struct vduse_dev *dev = file->private_data;
> +	void __user *argp = (void __user *)arg;
> +	int ret;
> +
> +	switch (cmd) {
> +	case VDUSE_IOTLB_GET_FD: {
> +		struct vduse_iotlb_entry entry;
> +		struct vhost_iotlb_map *map;
> +		struct vdpa_map_file *map_file;
> +		struct vduse_iova_domain *domain = dev->domain;
> +		struct file *f = NULL;
> +
> +		ret = -EFAULT;
> +		if (copy_from_user(&entry, argp, sizeof(entry)))
> +			break;
> +
> +		ret = -EINVAL;
> +		if (entry.start > entry.last)
> +			break;
> +
> +		spin_lock(&domain->iotlb_lock);
> +		map = vhost_iotlb_itree_first(domain->iotlb,
> +					      entry.start, entry.last);
> +		if (map) {
> +			map_file = (struct vdpa_map_file *)map->opaque;
> +			f = get_file(map_file->file);
> +			entry.offset = map_file->offset;
> +			entry.start = map->start;
> +			entry.last = map->last;
> +			entry.perm = map->perm;
> +		}
> +		spin_unlock(&domain->iotlb_lock);
> +		ret = -EINVAL;
> +		if (!f)
> +			break;
> +
> +		ret = -EFAULT;
> +		if (copy_to_user(argp, &entry, sizeof(entry))) {
> +			fput(f);
> +			break;
> +		}
> +		ret = receive_fd(f, perm_to_file_flags(entry.perm));
> +		fput(f);
> +		break;
> +	}
> +	case VDUSE_DEV_GET_FEATURES:
> +		ret = put_user(dev->features, (u64 __user *)argp);
> +		break;
> +	case VDUSE_DEV_UPDATE_CONFIG: {
> +		struct vduse_config_update config;
> +		unsigned long size = offsetof(struct vduse_config_update,
> +					      buffer);
> +
> +		ret = -EFAULT;
> +		if (copy_from_user(&config, argp, size))
> +			break;
> +
> +		ret = -EINVAL;
> +		if (config.length == 0 ||
> +		    config.length > dev->config_size - config.offset)
> +			break;
> +
> +		ret = -EFAULT;
> +		if (copy_from_user(dev->config + config.offset, argp + size,
> +				   config.length))
> +			break;
> +
> +		ret = 0;
> +		queue_work(vduse_irq_wq, &dev->inject);


I wonder if it's better to separate config interrupt out of config 
update or we need document this.


> +		break;
> +	}
> +	case VDUSE_VQ_GET_INFO: {


Do we need to limit this only when DRIVER_OK is set?


> +		struct vduse_vq_info vq_info;
> +		u32 vq_index;
> +
> +		ret = -EFAULT;
> +		if (copy_from_user(&vq_info, argp, sizeof(vq_info)))
> +			break;
> +
> +		ret = -EINVAL;
> +		if (vq_info.index >= dev->vq_num)
> +			break;
> +
> +		vq_index = array_index_nospec(vq_info.index, dev->vq_num);
> +		vq_info.desc_addr = dev->vqs[vq_index].desc_addr;
> +		vq_info.driver_addr = dev->vqs[vq_index].driver_addr;
> +		vq_info.device_addr = dev->vqs[vq_index].device_addr;
> +		vq_info.num = dev->vqs[vq_index].num;
> +		vq_info.avail_idx = dev->vqs[vq_index].avail_idx;
> +		vq_info.ready = dev->vqs[vq_index].ready;
> +
> +		ret = -EFAULT;
> +		if (copy_to_user(argp, &vq_info, sizeof(vq_info)))
> +			break;
> +
> +		ret = 0;
> +		break;
> +	}
> +	case VDUSE_VQ_SETUP_KICKFD: {
> +		struct vduse_vq_eventfd eventfd;
> +
> +		ret = -EFAULT;
> +		if (copy_from_user(&eventfd, argp, sizeof(eventfd)))
> +			break;
> +
> +		ret = vduse_kickfd_setup(dev, &eventfd);
> +		break;
> +	}
> +	case VDUSE_VQ_INJECT_IRQ: {
> +		u32 vq_index;
> +
> +		ret = -EFAULT;
> +		if (get_user(vq_index, (u32 __user *)argp))
> +			break;
> +
> +		ret = -EINVAL;
> +		if (vq_index >= dev->vq_num)
> +			break;
> +
> +		ret = 0;
> +		vq_index = array_index_nospec(vq_index, dev->vq_num);
> +		queue_work(vduse_irq_wq, &dev->vqs[vq_index].inject);
> +		break;
> +	}
> +	default:
> +		ret = -ENOIOCTLCMD;
> +		break;
> +	}
> +
> +	return ret;
> +}
> +
> +static int vduse_dev_release(struct inode *inode, struct file *file)
> +{
> +	struct vduse_dev *dev = file->private_data;
> +
> +	spin_lock(&dev->msg_lock);
> +	/* Make sure the inflight messages can processed after reconncection */
> +	list_splice_init(&dev->recv_list, &dev->send_list);
> +	spin_unlock(&dev->msg_lock);
> +	dev->connected = false;
> +
> +	return 0;
> +}
> +
> +static struct vduse_dev *vduse_dev_get_from_minor(int minor)
> +{
> +	struct vduse_dev *dev;
> +
> +	mutex_lock(&vduse_lock);
> +	dev = idr_find(&vduse_idr, minor);
> +	mutex_unlock(&vduse_lock);
> +
> +	return dev;
> +}
> +
> +static int vduse_dev_open(struct inode *inode, struct file *file)
> +{
> +	int ret;
> +	struct vduse_dev *dev = vduse_dev_get_from_minor(iminor(inode));
> +
> +	if (!dev)
> +		return -ENODEV;
> +
> +	ret = -EBUSY;
> +	mutex_lock(&dev->lock);
> +	if (dev->connected)
> +		goto unlock;
> +
> +	ret = 0;
> +	dev->connected = true;
> +	file->private_data = dev;
> +unlock:
> +	mutex_unlock(&dev->lock);
> +
> +	return ret;
> +}
> +
> +static const struct file_operations vduse_dev_fops = {
> +	.owner		= THIS_MODULE,
> +	.open		= vduse_dev_open,
> +	.release	= vduse_dev_release,
> +	.read_iter	= vduse_dev_read_iter,
> +	.write_iter	= vduse_dev_write_iter,
> +	.poll		= vduse_dev_poll,
> +	.unlocked_ioctl	= vduse_dev_ioctl,
> +	.compat_ioctl	= compat_ptr_ioctl,
> +	.llseek		= noop_llseek,
> +};
> +
> +static struct vduse_dev *vduse_dev_create(void)
> +{
> +	struct vduse_dev *dev = kzalloc(sizeof(*dev), GFP_KERNEL);
> +
> +	if (!dev)
> +		return NULL;
> +
> +	mutex_init(&dev->lock);
> +	spin_lock_init(&dev->msg_lock);
> +	INIT_LIST_HEAD(&dev->send_list);
> +	INIT_LIST_HEAD(&dev->recv_list);
> +	spin_lock_init(&dev->irq_lock);
> +
> +	INIT_WORK(&dev->inject, vduse_dev_irq_inject);
> +	init_waitqueue_head(&dev->waitq);
> +
> +	return dev;
> +}
> +
> +static void vduse_dev_destroy(struct vduse_dev *dev)
> +{
> +	kfree(dev);
> +}
> +
> +static struct vduse_dev *vduse_find_dev(const char *name)
> +{
> +	struct vduse_dev *dev;
> +	int id;
> +
> +	idr_for_each_entry(&vduse_idr, dev, id)
> +		if (!strcmp(dev->name, name))
> +			return dev;
> +
> +	return NULL;
> +}
> +
> +static int vduse_destroy_dev(char *name)
> +{
> +	struct vduse_dev *dev = vduse_find_dev(name);
> +
> +	if (!dev)
> +		return -EINVAL;
> +
> +	mutex_lock(&dev->lock);
> +	if (dev->vdev || dev->connected) {
> +		mutex_unlock(&dev->lock);
> +		return -EBUSY;
> +	}
> +	dev->connected = true;
> +	mutex_unlock(&dev->lock);
> +
> +	vduse_dev_msg_cleanup(dev);
> +	device_destroy(vduse_class, MKDEV(MAJOR(vduse_major), dev->minor));
> +	idr_remove(&vduse_idr, dev->minor);
> +	kvfree(dev->config);
> +	kfree(dev->vqs);
> +	vduse_domain_destroy(dev->domain);
> +	kfree(dev->name);
> +	vduse_dev_destroy(dev);
> +	module_put(THIS_MODULE);
> +
> +	return 0;
> +}
> +
> +static bool device_is_allowed(u32 device_id)
> +{
> +	int i;
> +
> +	for (i = 0; i < ARRAY_SIZE(allowed_device_id); i++)
> +		if (allowed_device_id[i] == device_id)
> +			return true;
> +
> +	return false;
> +}
> +
> +static bool features_is_valid(u64 features)
> +{
> +	if (!(features & (1ULL << VIRTIO_F_ACCESS_PLATFORM)))
> +		return false;
> +
> +	/* Now we only support read-only configuration space */
> +	if (features & (1ULL << VIRTIO_BLK_F_CONFIG_WCE))
> +		return false;
> +
> +	return true;
> +}
> +
> +static bool vduse_validate_config(struct vduse_dev_config *config)
> +{
> +	if (config->bounce_size > VDUSE_MAX_BOUNCE_SIZE)
> +		return false;
> +
> +	if (config->vq_align > PAGE_SIZE)
> +		return false;
> +
> +	if (config->config_size > PAGE_SIZE)
> +		return false;
> +
> +	if (!device_is_allowed(config->device_id))
> +		return false;
> +
> +	if (!features_is_valid(config->features))
> +		return false;


Do we need to validate whether or not config_size is too small otherwise 
we may have OOB access in get_config()?


> +
> +	return true;
> +}
> +
> +static int vduse_create_dev(struct vduse_dev_config *config,
> +			    void *config_buf, u64 api_version)
> +{
> +	int i, ret;
> +	struct vduse_dev *dev;
> +
> +	ret = -EEXIST;
> +	if (vduse_find_dev(config->name))
> +		goto err;
> +
> +	ret = -ENOMEM;
> +	dev = vduse_dev_create();
> +	if (!dev)
> +		goto err;
> +
> +	dev->api_version = api_version;
> +	dev->user_features = config->features;
> +	dev->device_id = config->device_id;
> +	dev->vendor_id = config->vendor_id;
> +	dev->name = kstrdup(config->name, GFP_KERNEL);
> +	if (!dev->name)
> +		goto err_str;
> +
> +	dev->domain = vduse_domain_create(VDUSE_IOVA_SIZE - 1,
> +					  config->bounce_size);
> +	if (!dev->domain)
> +		goto err_domain;
> +
> +	dev->config = config_buf;
> +	dev->config_size = config->config_size;
> +	dev->vq_align = config->vq_align;
> +	dev->vq_size_max = config->vq_size_max;
> +	dev->vq_num = config->vq_num;
> +	dev->vqs = kcalloc(dev->vq_num, sizeof(*dev->vqs), GFP_KERNEL);
> +	if (!dev->vqs)
> +		goto err_vqs;
> +
> +	for (i = 0; i < dev->vq_num; i++) {
> +		dev->vqs[i].index = i;
> +		INIT_WORK(&dev->vqs[i].inject, vduse_vq_irq_inject);
> +		spin_lock_init(&dev->vqs[i].kick_lock);
> +		spin_lock_init(&dev->vqs[i].irq_lock);
> +	}
> +
> +	ret = idr_alloc(&vduse_idr, dev, 1, VDUSE_DEV_MAX, GFP_KERNEL);
> +	if (ret < 0)
> +		goto err_idr;
> +
> +	dev->minor = ret;
> +	dev->dev = device_create(vduse_class, NULL,
> +				 MKDEV(MAJOR(vduse_major), dev->minor),
> +				 NULL, "%s", config->name);
> +	if (IS_ERR(dev->dev)) {
> +		ret = PTR_ERR(dev->dev);
> +		goto err_dev;
> +	}
> +	__module_get(THIS_MODULE);
> +
> +	return 0;
> +err_dev:
> +	idr_remove(&vduse_idr, dev->minor);
> +err_idr:
> +	kfree(dev->vqs);
> +err_vqs:
> +	vduse_domain_destroy(dev->domain);
> +err_domain:
> +	kfree(dev->name);
> +err_str:
> +	vduse_dev_destroy(dev);
> +err:
> +	kvfree(config_buf);
> +	return ret;
> +}
> +
> +static long vduse_ioctl(struct file *file, unsigned int cmd,
> +			unsigned long arg)
> +{
> +	int ret;
> +	void __user *argp = (void __user *)arg;
> +	struct vduse_control *control = file->private_data;
> +
> +	mutex_lock(&vduse_lock);
> +	switch (cmd) {
> +	case VDUSE_GET_API_VERSION:
> +		ret = put_user(control->api_version, (u64 __user *)argp);
> +		break;
> +	case VDUSE_SET_API_VERSION: {
> +		u64 api_version;
> +
> +		ret = -EFAULT;
> +		if (get_user(api_version, (u64 __user *)argp))
> +			break;
> +
> +		ret = -EINVAL;
> +		if (api_version > VDUSE_API_VERSION)
> +			break;
> +
> +		ret = 0;
> +		control->api_version = api_version;
> +		break;
> +	}
> +	case VDUSE_CREATE_DEV: {
> +		struct vduse_dev_config config;
> +		unsigned long size = offsetof(struct vduse_dev_config, config);
> +		void *buf;
> +
> +		ret = -EFAULT;
> +		if (copy_from_user(&config, argp, size))
> +			break;
> +
> +		ret = -EINVAL;
> +		if (vduse_validate_config(&config) == false)
> +			break;
> +
> +		buf = vmemdup_user(argp + size, config.config_size);
> +		if (IS_ERR(buf)) {
> +			ret = PTR_ERR(buf);
> +			break;
> +		}
> +		ret = vduse_create_dev(&config, buf, control->api_version);
> +		break;
> +	}
> +	case VDUSE_DESTROY_DEV: {
> +		char name[VDUSE_NAME_MAX];
> +
> +		ret = -EFAULT;
> +		if (copy_from_user(name, argp, VDUSE_NAME_MAX))
> +			break;
> +
> +		ret = vduse_destroy_dev(name);
> +		break;
> +	}
> +	default:
> +		ret = -EINVAL;
> +		break;
> +	}
> +	mutex_unlock(&vduse_lock);
> +
> +	return ret;
> +}
> +
> +static int vduse_release(struct inode *inode, struct file *file)
> +{
> +	struct vduse_control *control = file->private_data;
> +
> +	kfree(control);
> +	return 0;
> +}
> +
> +static int vduse_open(struct inode *inode, struct file *file)
> +{
> +	struct vduse_control *control;
> +
> +	control = kmalloc(sizeof(struct vduse_control), GFP_KERNEL);
> +	if (!control)
> +		return -ENOMEM;
> +
> +	control->api_version = VDUSE_API_VERSION;
> +	file->private_data = control;
> +
> +	return 0;
> +}
> +
> +static const struct file_operations vduse_ctrl_fops = {
> +	.owner		= THIS_MODULE,
> +	.open		= vduse_open,
> +	.release	= vduse_release,
> +	.unlocked_ioctl	= vduse_ioctl,
> +	.compat_ioctl	= compat_ptr_ioctl,
> +	.llseek		= noop_llseek,
> +};
> +
> +static char *vduse_devnode(struct device *dev, umode_t *mode)
> +{
> +	return kasprintf(GFP_KERNEL, "vduse/%s", dev_name(dev));
> +}
> +
> +static void vduse_mgmtdev_release(struct device *dev)
> +{
> +}
> +
> +static struct device vduse_mgmtdev = {
> +	.init_name = "vduse",
> +	.release = vduse_mgmtdev_release,
> +};
> +
> +static struct vdpa_mgmt_dev mgmt_dev;
> +
> +static int vduse_dev_init_vdpa(struct vduse_dev *dev, const char *name)
> +{
> +	struct vduse_vdpa *vdev;
> +	int ret;
> +
> +	if (dev->vdev)
> +		return -EEXIST;
> +
> +	vdev = vdpa_alloc_device(struct vduse_vdpa, vdpa, dev->dev,
> +				 &vduse_vdpa_config_ops, name, true);
> +	if (!vdev)
> +		return -ENOMEM;
> +
> +	dev->vdev = vdev;
> +	vdev->dev = dev;
> +	vdev->vdpa.dev.dma_mask = &vdev->vdpa.dev.coherent_dma_mask;
> +	ret = dma_set_mask_and_coherent(&vdev->vdpa.dev, DMA_BIT_MASK(64));
> +	if (ret) {
> +		put_device(&vdev->vdpa.dev);
> +		return ret;
> +	}
> +	set_dma_ops(&vdev->vdpa.dev, &vduse_dev_dma_ops);
> +	vdev->vdpa.dma_dev = &vdev->vdpa.dev;
> +	vdev->vdpa.mdev = &mgmt_dev;
> +
> +	return 0;
> +}
> +
> +static int vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name)
> +{
> +	struct vduse_dev *dev;
> +	int ret;
> +
> +	mutex_lock(&vduse_lock);
> +	dev = vduse_find_dev(name);
> +	if (!dev) {
> +		mutex_unlock(&vduse_lock);
> +		return -EINVAL;
> +	}
> +	ret = vduse_dev_init_vdpa(dev, name);
> +	mutex_unlock(&vduse_lock);
> +	if (ret)
> +		return ret;
> +
> +	ret = _vdpa_register_device(&dev->vdev->vdpa, dev->vq_num);
> +	if (ret) {
> +		put_device(&dev->vdev->vdpa.dev);
> +		return ret;
> +	}
> +
> +	return 0;
> +}
> +
> +static void vdpa_dev_del(struct vdpa_mgmt_dev *mdev, struct vdpa_device *dev)
> +{
> +	_vdpa_unregister_device(dev);
> +}
> +
> +static const struct vdpa_mgmtdev_ops vdpa_dev_mgmtdev_ops = {
> +	.dev_add = vdpa_dev_add,
> +	.dev_del = vdpa_dev_del,
> +};
> +
> +static struct virtio_device_id id_table[] = {
> +	{ VIRTIO_ID_BLOCK, VIRTIO_DEV_ANY_ID },
> +	{ 0 },
> +};
> +
> +static struct vdpa_mgmt_dev mgmt_dev = {
> +	.device = &vduse_mgmtdev,
> +	.id_table = id_table,
> +	.ops = &vdpa_dev_mgmtdev_ops,
> +};
> +
> +static int vduse_mgmtdev_init(void)
> +{
> +	int ret;
> +
> +	ret = device_register(&vduse_mgmtdev);
> +	if (ret)
> +		return ret;
> +
> +	ret = vdpa_mgmtdev_register(&mgmt_dev);
> +	if (ret)
> +		goto err;
> +
> +	return 0;
> +err:
> +	device_unregister(&vduse_mgmtdev);
> +	return ret;
> +}
> +
> +static void vduse_mgmtdev_exit(void)
> +{
> +	vdpa_mgmtdev_unregister(&mgmt_dev);
> +	device_unregister(&vduse_mgmtdev);
> +}
> +
> +static int vduse_init(void)
> +{
> +	int ret;
> +	struct device *dev;
> +
> +	vduse_class = class_create(THIS_MODULE, "vduse");
> +	if (IS_ERR(vduse_class))
> +		return PTR_ERR(vduse_class);
> +
> +	vduse_class->devnode = vduse_devnode;
> +
> +	ret = alloc_chrdev_region(&vduse_major, 0, VDUSE_DEV_MAX, "vduse");
> +	if (ret)
> +		goto err_chardev_region;
> +
> +	/* /dev/vduse/control */
> +	cdev_init(&vduse_ctrl_cdev, &vduse_ctrl_fops);
> +	vduse_ctrl_cdev.owner = THIS_MODULE;
> +	ret = cdev_add(&vduse_ctrl_cdev, vduse_major, 1);
> +	if (ret)
> +		goto err_ctrl_cdev;
> +
> +	dev = device_create(vduse_class, NULL, vduse_major, NULL, "control");
> +	if (IS_ERR(dev)) {
> +		ret = PTR_ERR(dev);
> +		goto err_device;
> +	}
> +
> +	/* /dev/vduse/$DEVICE */
> +	cdev_init(&vduse_cdev, &vduse_dev_fops);
> +	vduse_cdev.owner = THIS_MODULE;
> +	ret = cdev_add(&vduse_cdev, MKDEV(MAJOR(vduse_major), 1),
> +		       VDUSE_DEV_MAX - 1);
> +	if (ret)
> +		goto err_cdev;
> +
> +	vduse_irq_wq = alloc_workqueue("vduse-irq",
> +				WQ_HIGHPRI | WQ_SYSFS | WQ_UNBOUND, 0);
> +	if (!vduse_irq_wq)
> +		goto err_wq;
> +
> +	ret = vduse_domain_init();
> +	if (ret)
> +		goto err_domain;
> +
> +	ret = vduse_mgmtdev_init();
> +	if (ret)
> +		goto err_mgmtdev;
> +
> +	return 0;
> +err_mgmtdev:
> +	vduse_domain_exit();
> +err_domain:
> +	destroy_workqueue(vduse_irq_wq);
> +err_wq:
> +	cdev_del(&vduse_cdev);
> +err_cdev:
> +	device_destroy(vduse_class, vduse_major);
> +err_device:
> +	cdev_del(&vduse_ctrl_cdev);
> +err_ctrl_cdev:
> +	unregister_chrdev_region(vduse_major, VDUSE_DEV_MAX);
> +err_chardev_region:
> +	class_destroy(vduse_class);
> +	return ret;
> +}
> +module_init(vduse_init);
> +
> +static void vduse_exit(void)
> +{
> +	vduse_mgmtdev_exit();
> +	vduse_domain_exit();
> +	destroy_workqueue(vduse_irq_wq);
> +	cdev_del(&vduse_cdev);
> +	device_destroy(vduse_class, vduse_major);
> +	cdev_del(&vduse_ctrl_cdev);
> +	unregister_chrdev_region(vduse_major, VDUSE_DEV_MAX);
> +	class_destroy(vduse_class);
> +}
> +module_exit(vduse_exit);
> +
> +MODULE_LICENSE(DRV_LICENSE);
> +MODULE_AUTHOR(DRV_AUTHOR);
> +MODULE_DESCRIPTION(DRV_DESC);
> diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
> new file mode 100644
> index 000000000000..f21b2e51b5c8
> --- /dev/null
> +++ b/include/uapi/linux/vduse.h
> @@ -0,0 +1,143 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _UAPI_VDUSE_H_
> +#define _UAPI_VDUSE_H_
> +
> +#include <linux/types.h>
> +
> +#define VDUSE_API_VERSION	0
> +
> +#define VDUSE_NAME_MAX	256
> +
> +/* the control messages definition for read/write */
> +
> +enum vduse_req_type {
> +	/* Get the state for virtqueue from userspace */
> +	VDUSE_GET_VQ_STATE,
> +	/* Notify userspace to start the dataplane, no reply */
> +	VDUSE_START_DATAPLANE,
> +	/* Notify userspace to stop the dataplane, no reply */
> +	VDUSE_STOP_DATAPLANE,
> +	/* Notify userspace to update the memory mapping in device IOTLB */
> +	VDUSE_UPDATE_IOTLB,
> +};
> +
> +struct vduse_vq_state {
> +	__u32 index; /* virtqueue index */
> +	__u32 avail_idx; /* virtqueue state (last_avail_idx) */
> +};


This needs some tweaks to support packed virtqueue.


> +
> +struct vduse_iova_range {
> +	__u64 start; /* start of the IOVA range */
> +	__u64 last; /* end of the IOVA range */
> +};
> +
> +struct vduse_dev_request {
> +	__u32 type; /* request type */
> +	__u32 request_id; /* request id */
> +#define VDUSE_REQ_FLAGS_NO_REPLY	(1 << 0) /* No need to reply */
> +	__u32 flags; /* request flags */
> +	__u32 reserved; /* for future use */
> +	union {
> +		struct vduse_vq_state vq_state; /* virtqueue state */
> +		struct vduse_iova_range iova; /* iova range for updating */
> +		__u32 padding[16]; /* padding */
> +	};
> +};
> +
> +struct vduse_dev_response {
> +	__u32 request_id; /* corresponding request id */
> +#define VDUSE_REQ_RESULT_OK	0x00
> +#define VDUSE_REQ_RESULT_FAILED	0x01
> +	__u32 result; /* the result of request */
> +	__u32 reserved[2]; /* for future use */
> +	union {
> +		struct vduse_vq_state vq_state; /* virtqueue state */
> +		__u32 padding[16]; /* padding */
> +	};
> +};
> +
> +/* ioctls */
> +
> +struct vduse_dev_config {
> +	char name[VDUSE_NAME_MAX]; /* vduse device name */
> +	__u32 vendor_id; /* virtio vendor id */
> +	__u32 device_id; /* virtio device id */
> +	__u64 features; /* device features */
> +	__u64 bounce_size; /* bounce buffer size for iommu */
> +	__u16 vq_size_max; /* the max size of virtqueue */
> +	__u16 padding; /* padding */
> +	__u32 vq_num; /* the number of virtqueues */
> +	__u32 vq_align; /* the allocation alignment of virtqueue's metadata */
> +	__u32 config_size; /* the size of the configuration space */
> +	__u32 reserved[15]; /* for future use */
> +	__u8 config[0]; /* the buffer of the configuration space */
> +};
> +
> +struct vduse_iotlb_entry {
> +	__u64 offset; /* the mmap offset on fd */
> +	__u64 start; /* start of the IOVA range */
> +	__u64 last; /* last of the IOVA range */
> +#define VDUSE_ACCESS_RO 0x1
> +#define VDUSE_ACCESS_WO 0x2
> +#define VDUSE_ACCESS_RW 0x3
> +	__u8 perm; /* access permission of this range */
> +};
> +
> +struct vduse_config_update {
> +	__u32 offset; /* offset from the beginning of configuration space */
> +	__u32 length; /* the length to write to configuration space */
> +	__u8 buffer[0]; /* buffer used to write from */
> +};
> +
> +struct vduse_vq_info {
> +	__u32 index; /* virtqueue index */
> +	__u32 avail_idx; /* virtqueue state (last_avail_idx) */
> +	__u64 desc_addr; /* address of desc area */
> +	__u64 driver_addr; /* address of driver area */
> +	__u64 device_addr; /* address of device area */
> +	__u32 num; /* the size of virtqueue */
> +	__u8 ready; /* ready status of virtqueue */
> +};
> +
> +struct vduse_vq_eventfd {
> +	__u32 index; /* virtqueue index */
> +#define VDUSE_EVENTFD_DEASSIGN -1
> +	int fd; /* eventfd, -1 means de-assigning the eventfd */
> +};
> +
> +#define VDUSE_BASE	0x81
> +
> +/* Get the version of VDUSE API. This is used for future extension */
> +#define VDUSE_GET_API_VERSION	_IOR(VDUSE_BASE, 0x00, __u64)
> +
> +/* Set the version of VDUSE API. */
> +#define VDUSE_SET_API_VERSION	_IOW(VDUSE_BASE, 0x01, __u64)
> +
> +/* Create a vduse device which is represented by a char device (/dev/vduse/<name>) */
> +#define VDUSE_CREATE_DEV	_IOW(VDUSE_BASE, 0x02, struct vduse_dev_config)
> +
> +/* Destroy a vduse device. Make sure there are no references to the char device */
> +#define VDUSE_DESTROY_DEV	_IOW(VDUSE_BASE, 0x03, char[VDUSE_NAME_MAX])
> +
> +/*
> + * Get a file descriptor for the first overlapped iova region,
> + * -EINVAL means the iova region doesn't exist.
> + */
> +#define VDUSE_IOTLB_GET_FD	_IOWR(VDUSE_BASE, 0x04, struct vduse_iotlb_entry)
> +
> +/* Get the negotiated features */
> +#define VDUSE_DEV_GET_FEATURES	_IOR(VDUSE_BASE, 0x05, __u64)
> +
> +/* Update the configuration space */
> +#define VDUSE_DEV_UPDATE_CONFIG	_IOW(VDUSE_BASE, 0x06, struct vduse_config_update)
> +
> +/* Get the specified virtqueue's information */
> +#define VDUSE_VQ_GET_INFO	_IOWR(VDUSE_BASE, 0x07, struct vduse_vq_info)
> +
> +/* Setup an eventfd to receive kick for virtqueue */
> +#define VDUSE_VQ_SETUP_KICKFD	_IOW(VDUSE_BASE, 0x08, struct vduse_vq_eventfd)
> +
> +/* Inject an interrupt for specific virtqueue */
> +#define VDUSE_VQ_INJECT_IRQ	_IOW(VDUSE_BASE, 0x09, __u32)
> +
> +#endif /* _UAPI_VDUSE_H_ */


^ permalink raw reply	[flat|nested] 193+ messages in thread

* Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
@ 2021-06-21  9:13     ` Jason Wang
  0 siblings, 0 replies; 193+ messages in thread
From: Jason Wang @ 2021-06-21  9:13 UTC (permalink / raw)
  To: Xie Yongji, mst, stefanha, sgarzare, parav, hch,
	christian.brauner, rdunlap, willy, viro, axboe, bcrl, corbet,
	mika.penttila, dan.carpenter, joro, gregkh
  Cc: kvm, netdev, linux-kernel, virtualization, iommu, songmuchun,
	linux-fsdevel


在 2021/6/15 下午10:13, Xie Yongji 写道:
> This VDUSE driver enables implementing vDPA devices in userspace.
> The vDPA device's control path is handled in kernel and the data
> path is handled in userspace.
>
> A message mechnism is used by VDUSE driver to forward some control
> messages such as starting/stopping datapath to userspace. Userspace
> can use read()/write() to receive/reply those control messages.
>
> And some ioctls are introduced to help userspace to implement the
> data path. VDUSE_IOTLB_GET_FD ioctl can be used to get the file
> descriptors referring to vDPA device's iova regions. Then userspace
> can use mmap() to access those iova regions. VDUSE_DEV_GET_FEATURES
> and VDUSE_VQ_GET_INFO ioctls are used to get the negotiated features
> and metadata of virtqueues. VDUSE_INJECT_VQ_IRQ and VDUSE_VQ_SETUP_KICKFD
> ioctls can be used to inject interrupt and setup the kickfd for
> virtqueues. VDUSE_DEV_UPDATE_CONFIG ioctl is used to update the
> configuration space and inject a config interrupt.
>
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> ---
>   Documentation/userspace-api/ioctl/ioctl-number.rst |    1 +
>   drivers/vdpa/Kconfig                               |   10 +
>   drivers/vdpa/Makefile                              |    1 +
>   drivers/vdpa/vdpa_user/Makefile                    |    5 +
>   drivers/vdpa/vdpa_user/vduse_dev.c                 | 1453 ++++++++++++++++++++
>   include/uapi/linux/vduse.h                         |  143 ++
>   6 files changed, 1613 insertions(+)
>   create mode 100644 drivers/vdpa/vdpa_user/Makefile
>   create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
>   create mode 100644 include/uapi/linux/vduse.h
>
> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
> index 9bfc2b510c64..acd95e9dcfe7 100644
> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> @@ -300,6 +300,7 @@ Code  Seq#    Include File                                           Comments
>   'z'   10-4F  drivers/s390/crypto/zcrypt_api.h                        conflict!
>   '|'   00-7F  linux/media.h
>   0x80  00-1F  linux/fb.h
> +0x81  00-1F  linux/vduse.h
>   0x89  00-06  arch/x86/include/asm/sockios.h
>   0x89  0B-DF  linux/sockios.h
>   0x89  E0-EF  linux/sockios.h                                         SIOCPROTOPRIVATE range
> diff --git a/drivers/vdpa/Kconfig b/drivers/vdpa/Kconfig
> index a503c1b2bfd9..6e23bce6433a 100644
> --- a/drivers/vdpa/Kconfig
> +++ b/drivers/vdpa/Kconfig
> @@ -33,6 +33,16 @@ config VDPA_SIM_BLOCK
>   	  vDPA block device simulator which terminates IO request in a
>   	  memory buffer.
>   
> +config VDPA_USER
> +	tristate "VDUSE (vDPA Device in Userspace) support"
> +	depends on EVENTFD && MMU && HAS_DMA
> +	select DMA_OPS
> +	select VHOST_IOTLB
> +	select IOMMU_IOVA
> +	help
> +	  With VDUSE it is possible to emulate a vDPA Device
> +	  in a userspace program.
> +
>   config IFCVF
>   	tristate "Intel IFC VF vDPA driver"
>   	depends on PCI_MSI
> diff --git a/drivers/vdpa/Makefile b/drivers/vdpa/Makefile
> index 67fe7f3d6943..f02ebed33f19 100644
> --- a/drivers/vdpa/Makefile
> +++ b/drivers/vdpa/Makefile
> @@ -1,6 +1,7 @@
>   # SPDX-License-Identifier: GPL-2.0
>   obj-$(CONFIG_VDPA) += vdpa.o
>   obj-$(CONFIG_VDPA_SIM) += vdpa_sim/
> +obj-$(CONFIG_VDPA_USER) += vdpa_user/
>   obj-$(CONFIG_IFCVF)    += ifcvf/
>   obj-$(CONFIG_MLX5_VDPA) += mlx5/
>   obj-$(CONFIG_VP_VDPA)    += virtio_pci/
> diff --git a/drivers/vdpa/vdpa_user/Makefile b/drivers/vdpa/vdpa_user/Makefile
> new file mode 100644
> index 000000000000..260e0b26af99
> --- /dev/null
> +++ b/drivers/vdpa/vdpa_user/Makefile
> @@ -0,0 +1,5 @@
> +# SPDX-License-Identifier: GPL-2.0
> +
> +vduse-y := vduse_dev.o iova_domain.o
> +
> +obj-$(CONFIG_VDPA_USER) += vduse.o
> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> new file mode 100644
> index 000000000000..5271cbd15e28
> --- /dev/null
> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> @@ -0,0 +1,1453 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * VDUSE: vDPA Device in Userspace
> + *
> + * Copyright (C) 2020-2021 Bytedance Inc. and/or its affiliates. All rights reserved.
> + *
> + * Author: Xie Yongji <xieyongji@bytedance.com>
> + *
> + */
> +
> +#include <linux/init.h>
> +#include <linux/module.h>
> +#include <linux/cdev.h>
> +#include <linux/device.h>
> +#include <linux/eventfd.h>
> +#include <linux/slab.h>
> +#include <linux/wait.h>
> +#include <linux/dma-map-ops.h>
> +#include <linux/poll.h>
> +#include <linux/file.h>
> +#include <linux/uio.h>
> +#include <linux/vdpa.h>
> +#include <linux/nospec.h>
> +#include <uapi/linux/vduse.h>
> +#include <uapi/linux/vdpa.h>
> +#include <uapi/linux/virtio_config.h>
> +#include <uapi/linux/virtio_ids.h>
> +#include <uapi/linux/virtio_blk.h>
> +#include <linux/mod_devicetable.h>
> +
> +#include "iova_domain.h"
> +
> +#define DRV_AUTHOR   "Yongji Xie <xieyongji@bytedance.com>"
> +#define DRV_DESC     "vDPA Device in Userspace"
> +#define DRV_LICENSE  "GPL v2"
> +
> +#define VDUSE_DEV_MAX (1U << MINORBITS)
> +#define VDUSE_MAX_BOUNCE_SIZE (64 * 1024 * 1024)
> +#define VDUSE_IOVA_SIZE (128 * 1024 * 1024)
> +#define VDUSE_REQUEST_TIMEOUT 30
> +
> +struct vduse_virtqueue {
> +	u16 index;
> +	u32 num;
> +	u32 avail_idx;
> +	u64 desc_addr;
> +	u64 driver_addr;
> +	u64 device_addr;
> +	bool ready;
> +	bool kicked;
> +	spinlock_t kick_lock;
> +	spinlock_t irq_lock;
> +	struct eventfd_ctx *kickfd;
> +	struct vdpa_callback cb;
> +	struct work_struct inject;
> +};
> +
> +struct vduse_dev;
> +
> +struct vduse_vdpa {
> +	struct vdpa_device vdpa;
> +	struct vduse_dev *dev;
> +};
> +
> +struct vduse_dev {
> +	struct vduse_vdpa *vdev;
> +	struct device *dev;
> +	struct vduse_virtqueue *vqs;
> +	struct vduse_iova_domain *domain;
> +	char *name;
> +	struct mutex lock;
> +	spinlock_t msg_lock;
> +	u64 msg_unique;
> +	wait_queue_head_t waitq;
> +	struct list_head send_list;
> +	struct list_head recv_list;
> +	struct vdpa_callback config_cb;
> +	struct work_struct inject;
> +	spinlock_t irq_lock;
> +	int minor;
> +	bool connected;
> +	bool started;
> +	u64 api_version;
> +	u64 user_features;


Let's use device_features.


> +	u64 features;


And driver features.


> +	u32 device_id;
> +	u32 vendor_id;
> +	u32 generation;
> +	u32 config_size;
> +	void *config;
> +	u8 status;
> +	u16 vq_size_max;
> +	u32 vq_num;
> +	u32 vq_align;
> +};
> +
> +struct vduse_dev_msg {
> +	struct vduse_dev_request req;
> +	struct vduse_dev_response resp;
> +	struct list_head list;
> +	wait_queue_head_t waitq;
> +	bool completed;
> +};
> +
> +struct vduse_control {
> +	u64 api_version;
> +};
> +
> +static DEFINE_MUTEX(vduse_lock);
> +static DEFINE_IDR(vduse_idr);
> +
> +static dev_t vduse_major;
> +static struct class *vduse_class;
> +static struct cdev vduse_ctrl_cdev;
> +static struct cdev vduse_cdev;
> +static struct workqueue_struct *vduse_irq_wq;
> +
> +static u32 allowed_device_id[] = {
> +	VIRTIO_ID_BLOCK,
> +};
> +
> +static inline struct vduse_dev *vdpa_to_vduse(struct vdpa_device *vdpa)
> +{
> +	struct vduse_vdpa *vdev = container_of(vdpa, struct vduse_vdpa, vdpa);
> +
> +	return vdev->dev;
> +}
> +
> +static inline struct vduse_dev *dev_to_vduse(struct device *dev)
> +{
> +	struct vdpa_device *vdpa = dev_to_vdpa(dev);
> +
> +	return vdpa_to_vduse(vdpa);
> +}
> +
> +static struct vduse_dev_msg *vduse_find_msg(struct list_head *head,
> +					    uint32_t request_id)
> +{
> +	struct vduse_dev_msg *msg;
> +
> +	list_for_each_entry(msg, head, list) {
> +		if (msg->req.request_id == request_id) {
> +			list_del(&msg->list);
> +			return msg;
> +		}
> +	}
> +
> +	return NULL;
> +}
> +
> +static struct vduse_dev_msg *vduse_dequeue_msg(struct list_head *head)
> +{
> +	struct vduse_dev_msg *msg = NULL;
> +
> +	if (!list_empty(head)) {
> +		msg = list_first_entry(head, struct vduse_dev_msg, list);
> +		list_del(&msg->list);
> +	}
> +
> +	return msg;
> +}
> +
> +static void vduse_enqueue_msg(struct list_head *head,
> +			      struct vduse_dev_msg *msg)
> +{
> +	list_add_tail(&msg->list, head);
> +}
> +
> +static int vduse_dev_msg_send(struct vduse_dev *dev,
> +			      struct vduse_dev_msg *msg, bool no_reply)
> +{


It looks to me the only user for no_reply=true is the dataplane start. I 
wonder no_reply is really needed consider we have switched to use 
wait_event_killable_timeout().

In another way, no_reply is false for vq state synchronization and IOTLB 
updating. I wonder if we can simply use no_reply = true for them.


> +	init_waitqueue_head(&msg->waitq);
> +	spin_lock(&dev->msg_lock);
> +	msg->req.request_id = dev->msg_unique++;
> +	vduse_enqueue_msg(&dev->send_list, msg);
> +	wake_up(&dev->waitq);
> +	spin_unlock(&dev->msg_lock);
> +	if (no_reply)
> +		return 0;
> +
> +	wait_event_killable_timeout(msg->waitq, msg->completed,
> +				    VDUSE_REQUEST_TIMEOUT * HZ);
> +	spin_lock(&dev->msg_lock);
> +	if (!msg->completed) {
> +		list_del(&msg->list);
> +		msg->resp.result = VDUSE_REQ_RESULT_FAILED;
> +	}
> +	spin_unlock(&dev->msg_lock);
> +
> +	return (msg->resp.result == VDUSE_REQ_RESULT_OK) ? 0 : -EIO;


Do we need to serialize the check by protecting it with the spinlock above?


> +}
> +
> +static void vduse_dev_msg_cleanup(struct vduse_dev *dev)
> +{
> +	struct vduse_dev_msg *msg;
> +
> +	spin_lock(&dev->msg_lock);
> +	while ((msg = vduse_dequeue_msg(&dev->send_list))) {
> +		if (msg->req.flags & VDUSE_REQ_FLAGS_NO_REPLY)
> +			kfree(msg);
> +		else
> +			vduse_enqueue_msg(&dev->recv_list, msg);
> +	}
> +	while ((msg = vduse_dequeue_msg(&dev->recv_list))) {
> +		msg->resp.result = VDUSE_REQ_RESULT_FAILED;
> +		msg->completed = 1;
> +		wake_up(&msg->waitq);
> +	}
> +	spin_unlock(&dev->msg_lock);
> +}
> +
> +static void vduse_dev_start_dataplane(struct vduse_dev *dev)
> +{
> +	struct vduse_dev_msg *msg = kzalloc(sizeof(*msg),
> +					    GFP_KERNEL | __GFP_NOFAIL);
> +
> +	msg->req.type = VDUSE_START_DATAPLANE;
> +	msg->req.flags |= VDUSE_REQ_FLAGS_NO_REPLY;
> +	vduse_dev_msg_send(dev, msg, true);
> +}
> +
> +static void vduse_dev_stop_dataplane(struct vduse_dev *dev)
> +{
> +	struct vduse_dev_msg *msg = kzalloc(sizeof(*msg),
> +					    GFP_KERNEL | __GFP_NOFAIL);
> +
> +	msg->req.type = VDUSE_STOP_DATAPLANE;
> +	msg->req.flags |= VDUSE_REQ_FLAGS_NO_REPLY;


Can we simply use this flag instead of introducing a new parameter 
(no_reply) in vduse_dev_msg_send()?


> +	vduse_dev_msg_send(dev, msg, true);
> +}
> +
> +static int vduse_dev_get_vq_state(struct vduse_dev *dev,
> +				  struct vduse_virtqueue *vq,
> +				  struct vdpa_vq_state *state)
> +{
> +	struct vduse_dev_msg msg = { 0 };
> +	int ret;


Note that I post a series that implement the packed virtqueue support:

https://lists.linuxfoundation.org/pipermail/virtualization/2021-June/054501.html

So this patch needs to be updated as well.


> +
> +	msg.req.type = VDUSE_GET_VQ_STATE;
> +	msg.req.vq_state.index = vq->index;
> +
> +	ret = vduse_dev_msg_send(dev, &msg, false);
> +	if (ret)
> +		return ret;
> +
> +	state->avail_index = msg.resp.vq_state.avail_idx;
> +	return 0;
> +}
> +
> +static int vduse_dev_update_iotlb(struct vduse_dev *dev,
> +				u64 start, u64 last)
> +{
> +	struct vduse_dev_msg msg = { 0 };
> +
> +	if (last < start)
> +		return -EINVAL;
> +
> +	msg.req.type = VDUSE_UPDATE_IOTLB;
> +	msg.req.iova.start = start;
> +	msg.req.iova.last = last;
> +
> +	return vduse_dev_msg_send(dev, &msg, false);
> +}
> +
> +static ssize_t vduse_dev_read_iter(struct kiocb *iocb, struct iov_iter *to)
> +{
> +	struct file *file = iocb->ki_filp;
> +	struct vduse_dev *dev = file->private_data;
> +	struct vduse_dev_msg *msg;
> +	int size = sizeof(struct vduse_dev_request);
> +	ssize_t ret;
> +
> +	if (iov_iter_count(to) < size)
> +		return -EINVAL;
> +
> +	spin_lock(&dev->msg_lock);
> +	while (1) {
> +		msg = vduse_dequeue_msg(&dev->send_list);
> +		if (msg)
> +			break;
> +
> +		ret = -EAGAIN;
> +		if (file->f_flags & O_NONBLOCK)
> +			goto unlock;
> +
> +		spin_unlock(&dev->msg_lock);
> +		ret = wait_event_interruptible_exclusive(dev->waitq,
> +					!list_empty(&dev->send_list));
> +		if (ret)
> +			return ret;
> +
> +		spin_lock(&dev->msg_lock);
> +	}
> +	spin_unlock(&dev->msg_lock);
> +	ret = copy_to_iter(&msg->req, size, to);
> +	spin_lock(&dev->msg_lock);
> +	if (ret != size) {
> +		ret = -EFAULT;
> +		vduse_enqueue_msg(&dev->send_list, msg);
> +		goto unlock;
> +	}
> +	if (msg->req.flags & VDUSE_REQ_FLAGS_NO_REPLY)
> +		kfree(msg);
> +	else
> +		vduse_enqueue_msg(&dev->recv_list, msg);
> +unlock:
> +	spin_unlock(&dev->msg_lock);
> +
> +	return ret;
> +}
> +
> +static ssize_t vduse_dev_write_iter(struct kiocb *iocb, struct iov_iter *from)
> +{
> +	struct file *file = iocb->ki_filp;
> +	struct vduse_dev *dev = file->private_data;
> +	struct vduse_dev_response resp;
> +	struct vduse_dev_msg *msg;
> +	size_t ret;
> +
> +	ret = copy_from_iter(&resp, sizeof(resp), from);
> +	if (ret != sizeof(resp))
> +		return -EINVAL;
> +
> +	spin_lock(&dev->msg_lock);
> +	msg = vduse_find_msg(&dev->recv_list, resp.request_id);
> +	if (!msg) {
> +		ret = -ENOENT;
> +		goto unlock;
> +	}
> +
> +	memcpy(&msg->resp, &resp, sizeof(resp));
> +	msg->completed = 1;
> +	wake_up(&msg->waitq);
> +unlock:
> +	spin_unlock(&dev->msg_lock);
> +
> +	return ret;
> +}
> +
> +static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
> +{
> +	struct vduse_dev *dev = file->private_data;
> +	__poll_t mask = 0;
> +
> +	poll_wait(file, &dev->waitq, wait);
> +
> +	if (!list_empty(&dev->send_list))
> +		mask |= EPOLLIN | EPOLLRDNORM;
> +	if (!list_empty(&dev->recv_list))
> +		mask |= EPOLLOUT | EPOLLWRNORM;
> +
> +	return mask;
> +}
> +
> +static void vduse_dev_reset(struct vduse_dev *dev)
> +{
> +	int i;
> +	struct vduse_iova_domain *domain = dev->domain;
> +
> +	/* The coherent mappings are handled in vduse_dev_free_coherent() */
> +	if (domain->bounce_map)
> +		vduse_domain_reset_bounce_map(domain);
> +
> +	dev->features = 0;
> +	dev->generation++;
> +	spin_lock(&dev->irq_lock);
> +	dev->config_cb.callback = NULL;
> +	dev->config_cb.private = NULL;
> +	spin_unlock(&dev->irq_lock);
> +
> +	for (i = 0; i < dev->vq_num; i++) {
> +		struct vduse_virtqueue *vq = &dev->vqs[i];
> +
> +		vq->ready = false;
> +		vq->desc_addr = 0;
> +		vq->driver_addr = 0;
> +		vq->device_addr = 0;
> +		vq->avail_idx = 0;
> +		vq->num = 0;
> +
> +		spin_lock(&vq->kick_lock);
> +		vq->kicked = false;
> +		if (vq->kickfd)
> +			eventfd_ctx_put(vq->kickfd);
> +		vq->kickfd = NULL;
> +		spin_unlock(&vq->kick_lock);
> +
> +		spin_lock(&vq->irq_lock);
> +		vq->cb.callback = NULL;
> +		vq->cb.private = NULL;
> +		spin_unlock(&vq->irq_lock);
> +	}
> +}
> +
> +static int vduse_vdpa_set_vq_address(struct vdpa_device *vdpa, u16 idx,
> +				u64 desc_area, u64 driver_area,
> +				u64 device_area)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	vq->desc_addr = desc_area;
> +	vq->driver_addr = driver_area;
> +	vq->device_addr = device_area;
> +
> +	return 0;
> +}
> +
> +static void vduse_vdpa_kick_vq(struct vdpa_device *vdpa, u16 idx)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	spin_lock(&vq->kick_lock);
> +	if (!vq->ready)
> +		goto unlock;
> +
> +	if (vq->kickfd)
> +		eventfd_signal(vq->kickfd, 1);
> +	else
> +		vq->kicked = true;
> +unlock:
> +	spin_unlock(&vq->kick_lock);
> +}
> +
> +static void vduse_vdpa_set_vq_cb(struct vdpa_device *vdpa, u16 idx,
> +			      struct vdpa_callback *cb)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	spin_lock(&vq->irq_lock);
> +	vq->cb.callback = cb->callback;
> +	vq->cb.private = cb->private;
> +	spin_unlock(&vq->irq_lock);
> +}
> +
> +static void vduse_vdpa_set_vq_num(struct vdpa_device *vdpa, u16 idx, u32 num)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	vq->num = num;
> +}
> +
> +static void vduse_vdpa_set_vq_ready(struct vdpa_device *vdpa,
> +					u16 idx, bool ready)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	vq->ready = ready;
> +}
> +
> +static bool vduse_vdpa_get_vq_ready(struct vdpa_device *vdpa, u16 idx)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	return vq->ready;
> +}
> +
> +static int vduse_vdpa_set_vq_state(struct vdpa_device *vdpa, u16 idx,
> +				const struct vdpa_vq_state *state)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	vq->avail_idx = state->avail_index;
> +	return 0;
> +}
> +
> +static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
> +				struct vdpa_vq_state *state)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	return vduse_dev_get_vq_state(dev, vq, state);
> +}
> +
> +static u32 vduse_vdpa_get_vq_align(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return dev->vq_align;
> +}
> +
> +static u64 vduse_vdpa_get_features(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return dev->user_features;
> +}
> +
> +static int vduse_vdpa_set_features(struct vdpa_device *vdpa, u64 features)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	dev->features = features;
> +	return 0;
> +}
> +
> +static void vduse_vdpa_set_config_cb(struct vdpa_device *vdpa,
> +				  struct vdpa_callback *cb)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	spin_lock(&dev->irq_lock);
> +	dev->config_cb.callback = cb->callback;
> +	dev->config_cb.private = cb->private;
> +	spin_unlock(&dev->irq_lock);
> +}
> +
> +static u16 vduse_vdpa_get_vq_num_max(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return dev->vq_size_max;
> +}
> +
> +static u32 vduse_vdpa_get_device_id(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return dev->device_id;
> +}
> +
> +static u32 vduse_vdpa_get_vendor_id(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return dev->vendor_id;
> +}
> +
> +static u8 vduse_vdpa_get_status(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return dev->status;
> +}
> +
> +static void vduse_vdpa_set_status(struct vdpa_device *vdpa, u8 status)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	bool started = !!(status & VIRTIO_CONFIG_S_DRIVER_OK);
> +
> +	dev->status = status;
> +
> +	if (dev->started == started)
> +		return;


If we check dev->status == status, (or only check the DRIVER_OK bit) 
then there's no need to introduce an extra dev->started.


> +
> +	dev->started = started;
> +	if (dev->started) {
> +		vduse_dev_start_dataplane(dev);
> +	} else {
> +		vduse_dev_reset(dev);
> +		vduse_dev_stop_dataplane(dev);


I wonder if no_reply work for the case of vhost-vdpa. For virtio-vDPA, 
we have bouncing buffers so it's harmless if usersapce dataplane keeps 
performing read/write. For vhost-vDPA we don't have such stuffs.


> +	}
> +}
> +
> +static size_t vduse_vdpa_get_config_size(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return dev->config_size;
> +}
> +
> +static void vduse_vdpa_get_config(struct vdpa_device *vdpa, unsigned int offset,
> +				  void *buf, unsigned int len)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	memcpy(buf, dev->config + offset, len);
> +}
> +
> +static void vduse_vdpa_set_config(struct vdpa_device *vdpa, unsigned int offset,
> +			const void *buf, unsigned int len)
> +{
> +	/* Now we only support read-only configuration space */
> +}
> +
> +static u32 vduse_vdpa_get_generation(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return dev->generation;
> +}
> +
> +static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
> +				struct vhost_iotlb *iotlb)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	int ret;
> +
> +	ret = vduse_domain_set_map(dev->domain, iotlb);
> +	if (ret)
> +		return ret;
> +
> +	ret = vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
> +	if (ret) {
> +		vduse_domain_clear_map(dev->domain, iotlb);
> +		return ret;
> +	}
> +
> +	return 0;
> +}
> +
> +static void vduse_vdpa_free(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	dev->vdev = NULL;
> +}
> +
> +static const struct vdpa_config_ops vduse_vdpa_config_ops = {
> +	.set_vq_address		= vduse_vdpa_set_vq_address,
> +	.kick_vq		= vduse_vdpa_kick_vq,
> +	.set_vq_cb		= vduse_vdpa_set_vq_cb,
> +	.set_vq_num             = vduse_vdpa_set_vq_num,
> +	.set_vq_ready		= vduse_vdpa_set_vq_ready,
> +	.get_vq_ready		= vduse_vdpa_get_vq_ready,
> +	.set_vq_state		= vduse_vdpa_set_vq_state,
> +	.get_vq_state		= vduse_vdpa_get_vq_state,
> +	.get_vq_align		= vduse_vdpa_get_vq_align,
> +	.get_features		= vduse_vdpa_get_features,
> +	.set_features		= vduse_vdpa_set_features,
> +	.set_config_cb		= vduse_vdpa_set_config_cb,
> +	.get_vq_num_max		= vduse_vdpa_get_vq_num_max,
> +	.get_device_id		= vduse_vdpa_get_device_id,
> +	.get_vendor_id		= vduse_vdpa_get_vendor_id,
> +	.get_status		= vduse_vdpa_get_status,
> +	.set_status		= vduse_vdpa_set_status,
> +	.get_config_size	= vduse_vdpa_get_config_size,
> +	.get_config		= vduse_vdpa_get_config,
> +	.set_config		= vduse_vdpa_set_config,
> +	.get_generation		= vduse_vdpa_get_generation,
> +	.set_map		= vduse_vdpa_set_map,
> +	.free			= vduse_vdpa_free,
> +};
> +
> +static dma_addr_t vduse_dev_map_page(struct device *dev, struct page *page,
> +				     unsigned long offset, size_t size,
> +				     enum dma_data_direction dir,
> +				     unsigned long attrs)
> +{
> +	struct vduse_dev *vdev = dev_to_vduse(dev);
> +	struct vduse_iova_domain *domain = vdev->domain;
> +
> +	return vduse_domain_map_page(domain, page, offset, size, dir, attrs);
> +}
> +
> +static void vduse_dev_unmap_page(struct device *dev, dma_addr_t dma_addr,
> +				size_t size, enum dma_data_direction dir,
> +				unsigned long attrs)
> +{
> +	struct vduse_dev *vdev = dev_to_vduse(dev);
> +	struct vduse_iova_domain *domain = vdev->domain;
> +
> +	return vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
> +}
> +
> +static void *vduse_dev_alloc_coherent(struct device *dev, size_t size,
> +					dma_addr_t *dma_addr, gfp_t flag,
> +					unsigned long attrs)
> +{
> +	struct vduse_dev *vdev = dev_to_vduse(dev);
> +	struct vduse_iova_domain *domain = vdev->domain;
> +	unsigned long iova;
> +	void *addr;
> +
> +	*dma_addr = DMA_MAPPING_ERROR;
> +	addr = vduse_domain_alloc_coherent(domain, size,
> +				(dma_addr_t *)&iova, flag, attrs);
> +	if (!addr)
> +		return NULL;
> +
> +	*dma_addr = (dma_addr_t)iova;
> +
> +	return addr;
> +}
> +
> +static void vduse_dev_free_coherent(struct device *dev, size_t size,
> +					void *vaddr, dma_addr_t dma_addr,
> +					unsigned long attrs)
> +{
> +	struct vduse_dev *vdev = dev_to_vduse(dev);
> +	struct vduse_iova_domain *domain = vdev->domain;
> +
> +	vduse_domain_free_coherent(domain, size, vaddr, dma_addr, attrs);
> +}
> +
> +static size_t vduse_dev_max_mapping_size(struct device *dev)
> +{
> +	struct vduse_dev *vdev = dev_to_vduse(dev);
> +	struct vduse_iova_domain *domain = vdev->domain;
> +
> +	return domain->bounce_size;
> +}
> +
> +static const struct dma_map_ops vduse_dev_dma_ops = {
> +	.map_page = vduse_dev_map_page,
> +	.unmap_page = vduse_dev_unmap_page,
> +	.alloc = vduse_dev_alloc_coherent,
> +	.free = vduse_dev_free_coherent,
> +	.max_mapping_size = vduse_dev_max_mapping_size,
> +};
> +
> +static unsigned int perm_to_file_flags(u8 perm)
> +{
> +	unsigned int flags = 0;
> +
> +	switch (perm) {
> +	case VDUSE_ACCESS_WO:
> +		flags |= O_WRONLY;
> +		break;
> +	case VDUSE_ACCESS_RO:
> +		flags |= O_RDONLY;
> +		break;
> +	case VDUSE_ACCESS_RW:
> +		flags |= O_RDWR;
> +		break;
> +	default:
> +		WARN(1, "invalidate vhost IOTLB permission\n");
> +		break;
> +	}
> +
> +	return flags;
> +}
> +
> +static int vduse_kickfd_setup(struct vduse_dev *dev,
> +			struct vduse_vq_eventfd *eventfd)
> +{
> +	struct eventfd_ctx *ctx = NULL;
> +	struct vduse_virtqueue *vq;
> +	u32 index;
> +
> +	if (eventfd->index >= dev->vq_num)
> +		return -EINVAL;
> +
> +	index = array_index_nospec(eventfd->index, dev->vq_num);
> +	vq = &dev->vqs[index];
> +	if (eventfd->fd >= 0) {
> +		ctx = eventfd_ctx_fdget(eventfd->fd);
> +		if (IS_ERR(ctx))
> +			return PTR_ERR(ctx);
> +	} else if (eventfd->fd != VDUSE_EVENTFD_DEASSIGN)
> +		return 0;
> +
> +	spin_lock(&vq->kick_lock);
> +	if (vq->kickfd)
> +		eventfd_ctx_put(vq->kickfd);
> +	vq->kickfd = ctx;
> +	if (vq->ready && vq->kicked && vq->kickfd) {
> +		eventfd_signal(vq->kickfd, 1);
> +		vq->kicked = false;
> +	}
> +	spin_unlock(&vq->kick_lock);
> +
> +	return 0;
> +}
> +
> +static void vduse_dev_irq_inject(struct work_struct *work)
> +{
> +	struct vduse_dev *dev = container_of(work, struct vduse_dev, inject);
> +
> +	spin_lock_irq(&dev->irq_lock);
> +	if (dev->config_cb.callback)
> +		dev->config_cb.callback(dev->config_cb.private);
> +	spin_unlock_irq(&dev->irq_lock);
> +}
> +
> +static void vduse_vq_irq_inject(struct work_struct *work)
> +{
> +	struct vduse_virtqueue *vq = container_of(work,
> +					struct vduse_virtqueue, inject);
> +
> +	spin_lock_irq(&vq->irq_lock);
> +	if (vq->ready && vq->cb.callback)
> +		vq->cb.callback(vq->cb.private);
> +	spin_unlock_irq(&vq->irq_lock);
> +}
> +
> +static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> +			    unsigned long arg)
> +{
> +	struct vduse_dev *dev = file->private_data;
> +	void __user *argp = (void __user *)arg;
> +	int ret;
> +
> +	switch (cmd) {
> +	case VDUSE_IOTLB_GET_FD: {
> +		struct vduse_iotlb_entry entry;
> +		struct vhost_iotlb_map *map;
> +		struct vdpa_map_file *map_file;
> +		struct vduse_iova_domain *domain = dev->domain;
> +		struct file *f = NULL;
> +
> +		ret = -EFAULT;
> +		if (copy_from_user(&entry, argp, sizeof(entry)))
> +			break;
> +
> +		ret = -EINVAL;
> +		if (entry.start > entry.last)
> +			break;
> +
> +		spin_lock(&domain->iotlb_lock);
> +		map = vhost_iotlb_itree_first(domain->iotlb,
> +					      entry.start, entry.last);
> +		if (map) {
> +			map_file = (struct vdpa_map_file *)map->opaque;
> +			f = get_file(map_file->file);
> +			entry.offset = map_file->offset;
> +			entry.start = map->start;
> +			entry.last = map->last;
> +			entry.perm = map->perm;
> +		}
> +		spin_unlock(&domain->iotlb_lock);
> +		ret = -EINVAL;
> +		if (!f)
> +			break;
> +
> +		ret = -EFAULT;
> +		if (copy_to_user(argp, &entry, sizeof(entry))) {
> +			fput(f);
> +			break;
> +		}
> +		ret = receive_fd(f, perm_to_file_flags(entry.perm));
> +		fput(f);
> +		break;
> +	}
> +	case VDUSE_DEV_GET_FEATURES:
> +		ret = put_user(dev->features, (u64 __user *)argp);
> +		break;
> +	case VDUSE_DEV_UPDATE_CONFIG: {
> +		struct vduse_config_update config;
> +		unsigned long size = offsetof(struct vduse_config_update,
> +					      buffer);
> +
> +		ret = -EFAULT;
> +		if (copy_from_user(&config, argp, size))
> +			break;
> +
> +		ret = -EINVAL;
> +		if (config.length == 0 ||
> +		    config.length > dev->config_size - config.offset)
> +			break;
> +
> +		ret = -EFAULT;
> +		if (copy_from_user(dev->config + config.offset, argp + size,
> +				   config.length))
> +			break;
> +
> +		ret = 0;
> +		queue_work(vduse_irq_wq, &dev->inject);


I wonder if it's better to separate config interrupt out of config 
update or we need document this.


> +		break;
> +	}
> +	case VDUSE_VQ_GET_INFO: {


Do we need to limit this only when DRIVER_OK is set?


> +		struct vduse_vq_info vq_info;
> +		u32 vq_index;
> +
> +		ret = -EFAULT;
> +		if (copy_from_user(&vq_info, argp, sizeof(vq_info)))
> +			break;
> +
> +		ret = -EINVAL;
> +		if (vq_info.index >= dev->vq_num)
> +			break;
> +
> +		vq_index = array_index_nospec(vq_info.index, dev->vq_num);
> +		vq_info.desc_addr = dev->vqs[vq_index].desc_addr;
> +		vq_info.driver_addr = dev->vqs[vq_index].driver_addr;
> +		vq_info.device_addr = dev->vqs[vq_index].device_addr;
> +		vq_info.num = dev->vqs[vq_index].num;
> +		vq_info.avail_idx = dev->vqs[vq_index].avail_idx;
> +		vq_info.ready = dev->vqs[vq_index].ready;
> +
> +		ret = -EFAULT;
> +		if (copy_to_user(argp, &vq_info, sizeof(vq_info)))
> +			break;
> +
> +		ret = 0;
> +		break;
> +	}
> +	case VDUSE_VQ_SETUP_KICKFD: {
> +		struct vduse_vq_eventfd eventfd;
> +
> +		ret = -EFAULT;
> +		if (copy_from_user(&eventfd, argp, sizeof(eventfd)))
> +			break;
> +
> +		ret = vduse_kickfd_setup(dev, &eventfd);
> +		break;
> +	}
> +	case VDUSE_VQ_INJECT_IRQ: {
> +		u32 vq_index;
> +
> +		ret = -EFAULT;
> +		if (get_user(vq_index, (u32 __user *)argp))
> +			break;
> +
> +		ret = -EINVAL;
> +		if (vq_index >= dev->vq_num)
> +			break;
> +
> +		ret = 0;
> +		vq_index = array_index_nospec(vq_index, dev->vq_num);
> +		queue_work(vduse_irq_wq, &dev->vqs[vq_index].inject);
> +		break;
> +	}
> +	default:
> +		ret = -ENOIOCTLCMD;
> +		break;
> +	}
> +
> +	return ret;
> +}
> +
> +static int vduse_dev_release(struct inode *inode, struct file *file)
> +{
> +	struct vduse_dev *dev = file->private_data;
> +
> +	spin_lock(&dev->msg_lock);
> +	/* Make sure the inflight messages can processed after reconncection */
> +	list_splice_init(&dev->recv_list, &dev->send_list);
> +	spin_unlock(&dev->msg_lock);
> +	dev->connected = false;
> +
> +	return 0;
> +}
> +
> +static struct vduse_dev *vduse_dev_get_from_minor(int minor)
> +{
> +	struct vduse_dev *dev;
> +
> +	mutex_lock(&vduse_lock);
> +	dev = idr_find(&vduse_idr, minor);
> +	mutex_unlock(&vduse_lock);
> +
> +	return dev;
> +}
> +
> +static int vduse_dev_open(struct inode *inode, struct file *file)
> +{
> +	int ret;
> +	struct vduse_dev *dev = vduse_dev_get_from_minor(iminor(inode));
> +
> +	if (!dev)
> +		return -ENODEV;
> +
> +	ret = -EBUSY;
> +	mutex_lock(&dev->lock);
> +	if (dev->connected)
> +		goto unlock;
> +
> +	ret = 0;
> +	dev->connected = true;
> +	file->private_data = dev;
> +unlock:
> +	mutex_unlock(&dev->lock);
> +
> +	return ret;
> +}
> +
> +static const struct file_operations vduse_dev_fops = {
> +	.owner		= THIS_MODULE,
> +	.open		= vduse_dev_open,
> +	.release	= vduse_dev_release,
> +	.read_iter	= vduse_dev_read_iter,
> +	.write_iter	= vduse_dev_write_iter,
> +	.poll		= vduse_dev_poll,
> +	.unlocked_ioctl	= vduse_dev_ioctl,
> +	.compat_ioctl	= compat_ptr_ioctl,
> +	.llseek		= noop_llseek,
> +};
> +
> +static struct vduse_dev *vduse_dev_create(void)
> +{
> +	struct vduse_dev *dev = kzalloc(sizeof(*dev), GFP_KERNEL);
> +
> +	if (!dev)
> +		return NULL;
> +
> +	mutex_init(&dev->lock);
> +	spin_lock_init(&dev->msg_lock);
> +	INIT_LIST_HEAD(&dev->send_list);
> +	INIT_LIST_HEAD(&dev->recv_list);
> +	spin_lock_init(&dev->irq_lock);
> +
> +	INIT_WORK(&dev->inject, vduse_dev_irq_inject);
> +	init_waitqueue_head(&dev->waitq);
> +
> +	return dev;
> +}
> +
> +static void vduse_dev_destroy(struct vduse_dev *dev)
> +{
> +	kfree(dev);
> +}
> +
> +static struct vduse_dev *vduse_find_dev(const char *name)
> +{
> +	struct vduse_dev *dev;
> +	int id;
> +
> +	idr_for_each_entry(&vduse_idr, dev, id)
> +		if (!strcmp(dev->name, name))
> +			return dev;
> +
> +	return NULL;
> +}
> +
> +static int vduse_destroy_dev(char *name)
> +{
> +	struct vduse_dev *dev = vduse_find_dev(name);
> +
> +	if (!dev)
> +		return -EINVAL;
> +
> +	mutex_lock(&dev->lock);
> +	if (dev->vdev || dev->connected) {
> +		mutex_unlock(&dev->lock);
> +		return -EBUSY;
> +	}
> +	dev->connected = true;
> +	mutex_unlock(&dev->lock);
> +
> +	vduse_dev_msg_cleanup(dev);
> +	device_destroy(vduse_class, MKDEV(MAJOR(vduse_major), dev->minor));
> +	idr_remove(&vduse_idr, dev->minor);
> +	kvfree(dev->config);
> +	kfree(dev->vqs);
> +	vduse_domain_destroy(dev->domain);
> +	kfree(dev->name);
> +	vduse_dev_destroy(dev);
> +	module_put(THIS_MODULE);
> +
> +	return 0;
> +}
> +
> +static bool device_is_allowed(u32 device_id)
> +{
> +	int i;
> +
> +	for (i = 0; i < ARRAY_SIZE(allowed_device_id); i++)
> +		if (allowed_device_id[i] == device_id)
> +			return true;
> +
> +	return false;
> +}
> +
> +static bool features_is_valid(u64 features)
> +{
> +	if (!(features & (1ULL << VIRTIO_F_ACCESS_PLATFORM)))
> +		return false;
> +
> +	/* Now we only support read-only configuration space */
> +	if (features & (1ULL << VIRTIO_BLK_F_CONFIG_WCE))
> +		return false;
> +
> +	return true;
> +}
> +
> +static bool vduse_validate_config(struct vduse_dev_config *config)
> +{
> +	if (config->bounce_size > VDUSE_MAX_BOUNCE_SIZE)
> +		return false;
> +
> +	if (config->vq_align > PAGE_SIZE)
> +		return false;
> +
> +	if (config->config_size > PAGE_SIZE)
> +		return false;
> +
> +	if (!device_is_allowed(config->device_id))
> +		return false;
> +
> +	if (!features_is_valid(config->features))
> +		return false;


Do we need to validate whether or not config_size is too small otherwise 
we may have OOB access in get_config()?


> +
> +	return true;
> +}
> +
> +static int vduse_create_dev(struct vduse_dev_config *config,
> +			    void *config_buf, u64 api_version)
> +{
> +	int i, ret;
> +	struct vduse_dev *dev;
> +
> +	ret = -EEXIST;
> +	if (vduse_find_dev(config->name))
> +		goto err;
> +
> +	ret = -ENOMEM;
> +	dev = vduse_dev_create();
> +	if (!dev)
> +		goto err;
> +
> +	dev->api_version = api_version;
> +	dev->user_features = config->features;
> +	dev->device_id = config->device_id;
> +	dev->vendor_id = config->vendor_id;
> +	dev->name = kstrdup(config->name, GFP_KERNEL);
> +	if (!dev->name)
> +		goto err_str;
> +
> +	dev->domain = vduse_domain_create(VDUSE_IOVA_SIZE - 1,
> +					  config->bounce_size);
> +	if (!dev->domain)
> +		goto err_domain;
> +
> +	dev->config = config_buf;
> +	dev->config_size = config->config_size;
> +	dev->vq_align = config->vq_align;
> +	dev->vq_size_max = config->vq_size_max;
> +	dev->vq_num = config->vq_num;
> +	dev->vqs = kcalloc(dev->vq_num, sizeof(*dev->vqs), GFP_KERNEL);
> +	if (!dev->vqs)
> +		goto err_vqs;
> +
> +	for (i = 0; i < dev->vq_num; i++) {
> +		dev->vqs[i].index = i;
> +		INIT_WORK(&dev->vqs[i].inject, vduse_vq_irq_inject);
> +		spin_lock_init(&dev->vqs[i].kick_lock);
> +		spin_lock_init(&dev->vqs[i].irq_lock);
> +	}
> +
> +	ret = idr_alloc(&vduse_idr, dev, 1, VDUSE_DEV_MAX, GFP_KERNEL);
> +	if (ret < 0)
> +		goto err_idr;
> +
> +	dev->minor = ret;
> +	dev->dev = device_create(vduse_class, NULL,
> +				 MKDEV(MAJOR(vduse_major), dev->minor),
> +				 NULL, "%s", config->name);
> +	if (IS_ERR(dev->dev)) {
> +		ret = PTR_ERR(dev->dev);
> +		goto err_dev;
> +	}
> +	__module_get(THIS_MODULE);
> +
> +	return 0;
> +err_dev:
> +	idr_remove(&vduse_idr, dev->minor);
> +err_idr:
> +	kfree(dev->vqs);
> +err_vqs:
> +	vduse_domain_destroy(dev->domain);
> +err_domain:
> +	kfree(dev->name);
> +err_str:
> +	vduse_dev_destroy(dev);
> +err:
> +	kvfree(config_buf);
> +	return ret;
> +}
> +
> +static long vduse_ioctl(struct file *file, unsigned int cmd,
> +			unsigned long arg)
> +{
> +	int ret;
> +	void __user *argp = (void __user *)arg;
> +	struct vduse_control *control = file->private_data;
> +
> +	mutex_lock(&vduse_lock);
> +	switch (cmd) {
> +	case VDUSE_GET_API_VERSION:
> +		ret = put_user(control->api_version, (u64 __user *)argp);
> +		break;
> +	case VDUSE_SET_API_VERSION: {
> +		u64 api_version;
> +
> +		ret = -EFAULT;
> +		if (get_user(api_version, (u64 __user *)argp))
> +			break;
> +
> +		ret = -EINVAL;
> +		if (api_version > VDUSE_API_VERSION)
> +			break;
> +
> +		ret = 0;
> +		control->api_version = api_version;
> +		break;
> +	}
> +	case VDUSE_CREATE_DEV: {
> +		struct vduse_dev_config config;
> +		unsigned long size = offsetof(struct vduse_dev_config, config);
> +		void *buf;
> +
> +		ret = -EFAULT;
> +		if (copy_from_user(&config, argp, size))
> +			break;
> +
> +		ret = -EINVAL;
> +		if (vduse_validate_config(&config) == false)
> +			break;
> +
> +		buf = vmemdup_user(argp + size, config.config_size);
> +		if (IS_ERR(buf)) {
> +			ret = PTR_ERR(buf);
> +			break;
> +		}
> +		ret = vduse_create_dev(&config, buf, control->api_version);
> +		break;
> +	}
> +	case VDUSE_DESTROY_DEV: {
> +		char name[VDUSE_NAME_MAX];
> +
> +		ret = -EFAULT;
> +		if (copy_from_user(name, argp, VDUSE_NAME_MAX))
> +			break;
> +
> +		ret = vduse_destroy_dev(name);
> +		break;
> +	}
> +	default:
> +		ret = -EINVAL;
> +		break;
> +	}
> +	mutex_unlock(&vduse_lock);
> +
> +	return ret;
> +}
> +
> +static int vduse_release(struct inode *inode, struct file *file)
> +{
> +	struct vduse_control *control = file->private_data;
> +
> +	kfree(control);
> +	return 0;
> +}
> +
> +static int vduse_open(struct inode *inode, struct file *file)
> +{
> +	struct vduse_control *control;
> +
> +	control = kmalloc(sizeof(struct vduse_control), GFP_KERNEL);
> +	if (!control)
> +		return -ENOMEM;
> +
> +	control->api_version = VDUSE_API_VERSION;
> +	file->private_data = control;
> +
> +	return 0;
> +}
> +
> +static const struct file_operations vduse_ctrl_fops = {
> +	.owner		= THIS_MODULE,
> +	.open		= vduse_open,
> +	.release	= vduse_release,
> +	.unlocked_ioctl	= vduse_ioctl,
> +	.compat_ioctl	= compat_ptr_ioctl,
> +	.llseek		= noop_llseek,
> +};
> +
> +static char *vduse_devnode(struct device *dev, umode_t *mode)
> +{
> +	return kasprintf(GFP_KERNEL, "vduse/%s", dev_name(dev));
> +}
> +
> +static void vduse_mgmtdev_release(struct device *dev)
> +{
> +}
> +
> +static struct device vduse_mgmtdev = {
> +	.init_name = "vduse",
> +	.release = vduse_mgmtdev_release,
> +};
> +
> +static struct vdpa_mgmt_dev mgmt_dev;
> +
> +static int vduse_dev_init_vdpa(struct vduse_dev *dev, const char *name)
> +{
> +	struct vduse_vdpa *vdev;
> +	int ret;
> +
> +	if (dev->vdev)
> +		return -EEXIST;
> +
> +	vdev = vdpa_alloc_device(struct vduse_vdpa, vdpa, dev->dev,
> +				 &vduse_vdpa_config_ops, name, true);
> +	if (!vdev)
> +		return -ENOMEM;
> +
> +	dev->vdev = vdev;
> +	vdev->dev = dev;
> +	vdev->vdpa.dev.dma_mask = &vdev->vdpa.dev.coherent_dma_mask;
> +	ret = dma_set_mask_and_coherent(&vdev->vdpa.dev, DMA_BIT_MASK(64));
> +	if (ret) {
> +		put_device(&vdev->vdpa.dev);
> +		return ret;
> +	}
> +	set_dma_ops(&vdev->vdpa.dev, &vduse_dev_dma_ops);
> +	vdev->vdpa.dma_dev = &vdev->vdpa.dev;
> +	vdev->vdpa.mdev = &mgmt_dev;
> +
> +	return 0;
> +}
> +
> +static int vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name)
> +{
> +	struct vduse_dev *dev;
> +	int ret;
> +
> +	mutex_lock(&vduse_lock);
> +	dev = vduse_find_dev(name);
> +	if (!dev) {
> +		mutex_unlock(&vduse_lock);
> +		return -EINVAL;
> +	}
> +	ret = vduse_dev_init_vdpa(dev, name);
> +	mutex_unlock(&vduse_lock);
> +	if (ret)
> +		return ret;
> +
> +	ret = _vdpa_register_device(&dev->vdev->vdpa, dev->vq_num);
> +	if (ret) {
> +		put_device(&dev->vdev->vdpa.dev);
> +		return ret;
> +	}
> +
> +	return 0;
> +}
> +
> +static void vdpa_dev_del(struct vdpa_mgmt_dev *mdev, struct vdpa_device *dev)
> +{
> +	_vdpa_unregister_device(dev);
> +}
> +
> +static const struct vdpa_mgmtdev_ops vdpa_dev_mgmtdev_ops = {
> +	.dev_add = vdpa_dev_add,
> +	.dev_del = vdpa_dev_del,
> +};
> +
> +static struct virtio_device_id id_table[] = {
> +	{ VIRTIO_ID_BLOCK, VIRTIO_DEV_ANY_ID },
> +	{ 0 },
> +};
> +
> +static struct vdpa_mgmt_dev mgmt_dev = {
> +	.device = &vduse_mgmtdev,
> +	.id_table = id_table,
> +	.ops = &vdpa_dev_mgmtdev_ops,
> +};
> +
> +static int vduse_mgmtdev_init(void)
> +{
> +	int ret;
> +
> +	ret = device_register(&vduse_mgmtdev);
> +	if (ret)
> +		return ret;
> +
> +	ret = vdpa_mgmtdev_register(&mgmt_dev);
> +	if (ret)
> +		goto err;
> +
> +	return 0;
> +err:
> +	device_unregister(&vduse_mgmtdev);
> +	return ret;
> +}
> +
> +static void vduse_mgmtdev_exit(void)
> +{
> +	vdpa_mgmtdev_unregister(&mgmt_dev);
> +	device_unregister(&vduse_mgmtdev);
> +}
> +
> +static int vduse_init(void)
> +{
> +	int ret;
> +	struct device *dev;
> +
> +	vduse_class = class_create(THIS_MODULE, "vduse");
> +	if (IS_ERR(vduse_class))
> +		return PTR_ERR(vduse_class);
> +
> +	vduse_class->devnode = vduse_devnode;
> +
> +	ret = alloc_chrdev_region(&vduse_major, 0, VDUSE_DEV_MAX, "vduse");
> +	if (ret)
> +		goto err_chardev_region;
> +
> +	/* /dev/vduse/control */
> +	cdev_init(&vduse_ctrl_cdev, &vduse_ctrl_fops);
> +	vduse_ctrl_cdev.owner = THIS_MODULE;
> +	ret = cdev_add(&vduse_ctrl_cdev, vduse_major, 1);
> +	if (ret)
> +		goto err_ctrl_cdev;
> +
> +	dev = device_create(vduse_class, NULL, vduse_major, NULL, "control");
> +	if (IS_ERR(dev)) {
> +		ret = PTR_ERR(dev);
> +		goto err_device;
> +	}
> +
> +	/* /dev/vduse/$DEVICE */
> +	cdev_init(&vduse_cdev, &vduse_dev_fops);
> +	vduse_cdev.owner = THIS_MODULE;
> +	ret = cdev_add(&vduse_cdev, MKDEV(MAJOR(vduse_major), 1),
> +		       VDUSE_DEV_MAX - 1);
> +	if (ret)
> +		goto err_cdev;
> +
> +	vduse_irq_wq = alloc_workqueue("vduse-irq",
> +				WQ_HIGHPRI | WQ_SYSFS | WQ_UNBOUND, 0);
> +	if (!vduse_irq_wq)
> +		goto err_wq;
> +
> +	ret = vduse_domain_init();
> +	if (ret)
> +		goto err_domain;
> +
> +	ret = vduse_mgmtdev_init();
> +	if (ret)
> +		goto err_mgmtdev;
> +
> +	return 0;
> +err_mgmtdev:
> +	vduse_domain_exit();
> +err_domain:
> +	destroy_workqueue(vduse_irq_wq);
> +err_wq:
> +	cdev_del(&vduse_cdev);
> +err_cdev:
> +	device_destroy(vduse_class, vduse_major);
> +err_device:
> +	cdev_del(&vduse_ctrl_cdev);
> +err_ctrl_cdev:
> +	unregister_chrdev_region(vduse_major, VDUSE_DEV_MAX);
> +err_chardev_region:
> +	class_destroy(vduse_class);
> +	return ret;
> +}
> +module_init(vduse_init);
> +
> +static void vduse_exit(void)
> +{
> +	vduse_mgmtdev_exit();
> +	vduse_domain_exit();
> +	destroy_workqueue(vduse_irq_wq);
> +	cdev_del(&vduse_cdev);
> +	device_destroy(vduse_class, vduse_major);
> +	cdev_del(&vduse_ctrl_cdev);
> +	unregister_chrdev_region(vduse_major, VDUSE_DEV_MAX);
> +	class_destroy(vduse_class);
> +}
> +module_exit(vduse_exit);
> +
> +MODULE_LICENSE(DRV_LICENSE);
> +MODULE_AUTHOR(DRV_AUTHOR);
> +MODULE_DESCRIPTION(DRV_DESC);
> diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
> new file mode 100644
> index 000000000000..f21b2e51b5c8
> --- /dev/null
> +++ b/include/uapi/linux/vduse.h
> @@ -0,0 +1,143 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _UAPI_VDUSE_H_
> +#define _UAPI_VDUSE_H_
> +
> +#include <linux/types.h>
> +
> +#define VDUSE_API_VERSION	0
> +
> +#define VDUSE_NAME_MAX	256
> +
> +/* the control messages definition for read/write */
> +
> +enum vduse_req_type {
> +	/* Get the state for virtqueue from userspace */
> +	VDUSE_GET_VQ_STATE,
> +	/* Notify userspace to start the dataplane, no reply */
> +	VDUSE_START_DATAPLANE,
> +	/* Notify userspace to stop the dataplane, no reply */
> +	VDUSE_STOP_DATAPLANE,
> +	/* Notify userspace to update the memory mapping in device IOTLB */
> +	VDUSE_UPDATE_IOTLB,
> +};
> +
> +struct vduse_vq_state {
> +	__u32 index; /* virtqueue index */
> +	__u32 avail_idx; /* virtqueue state (last_avail_idx) */
> +};


This needs some tweaks to support packed virtqueue.


> +
> +struct vduse_iova_range {
> +	__u64 start; /* start of the IOVA range */
> +	__u64 last; /* end of the IOVA range */
> +};
> +
> +struct vduse_dev_request {
> +	__u32 type; /* request type */
> +	__u32 request_id; /* request id */
> +#define VDUSE_REQ_FLAGS_NO_REPLY	(1 << 0) /* No need to reply */
> +	__u32 flags; /* request flags */
> +	__u32 reserved; /* for future use */
> +	union {
> +		struct vduse_vq_state vq_state; /* virtqueue state */
> +		struct vduse_iova_range iova; /* iova range for updating */
> +		__u32 padding[16]; /* padding */
> +	};
> +};
> +
> +struct vduse_dev_response {
> +	__u32 request_id; /* corresponding request id */
> +#define VDUSE_REQ_RESULT_OK	0x00
> +#define VDUSE_REQ_RESULT_FAILED	0x01
> +	__u32 result; /* the result of request */
> +	__u32 reserved[2]; /* for future use */
> +	union {
> +		struct vduse_vq_state vq_state; /* virtqueue state */
> +		__u32 padding[16]; /* padding */
> +	};
> +};
> +
> +/* ioctls */
> +
> +struct vduse_dev_config {
> +	char name[VDUSE_NAME_MAX]; /* vduse device name */
> +	__u32 vendor_id; /* virtio vendor id */
> +	__u32 device_id; /* virtio device id */
> +	__u64 features; /* device features */
> +	__u64 bounce_size; /* bounce buffer size for iommu */
> +	__u16 vq_size_max; /* the max size of virtqueue */
> +	__u16 padding; /* padding */
> +	__u32 vq_num; /* the number of virtqueues */
> +	__u32 vq_align; /* the allocation alignment of virtqueue's metadata */
> +	__u32 config_size; /* the size of the configuration space */
> +	__u32 reserved[15]; /* for future use */
> +	__u8 config[0]; /* the buffer of the configuration space */
> +};
> +
> +struct vduse_iotlb_entry {
> +	__u64 offset; /* the mmap offset on fd */
> +	__u64 start; /* start of the IOVA range */
> +	__u64 last; /* last of the IOVA range */
> +#define VDUSE_ACCESS_RO 0x1
> +#define VDUSE_ACCESS_WO 0x2
> +#define VDUSE_ACCESS_RW 0x3
> +	__u8 perm; /* access permission of this range */
> +};
> +
> +struct vduse_config_update {
> +	__u32 offset; /* offset from the beginning of configuration space */
> +	__u32 length; /* the length to write to configuration space */
> +	__u8 buffer[0]; /* buffer used to write from */
> +};
> +
> +struct vduse_vq_info {
> +	__u32 index; /* virtqueue index */
> +	__u32 avail_idx; /* virtqueue state (last_avail_idx) */
> +	__u64 desc_addr; /* address of desc area */
> +	__u64 driver_addr; /* address of driver area */
> +	__u64 device_addr; /* address of device area */
> +	__u32 num; /* the size of virtqueue */
> +	__u8 ready; /* ready status of virtqueue */
> +};
> +
> +struct vduse_vq_eventfd {
> +	__u32 index; /* virtqueue index */
> +#define VDUSE_EVENTFD_DEASSIGN -1
> +	int fd; /* eventfd, -1 means de-assigning the eventfd */
> +};
> +
> +#define VDUSE_BASE	0x81
> +
> +/* Get the version of VDUSE API. This is used for future extension */
> +#define VDUSE_GET_API_VERSION	_IOR(VDUSE_BASE, 0x00, __u64)
> +
> +/* Set the version of VDUSE API. */
> +#define VDUSE_SET_API_VERSION	_IOW(VDUSE_BASE, 0x01, __u64)
> +
> +/* Create a vduse device which is represented by a char device (/dev/vduse/<name>) */
> +#define VDUSE_CREATE_DEV	_IOW(VDUSE_BASE, 0x02, struct vduse_dev_config)
> +
> +/* Destroy a vduse device. Make sure there are no references to the char device */
> +#define VDUSE_DESTROY_DEV	_IOW(VDUSE_BASE, 0x03, char[VDUSE_NAME_MAX])
> +
> +/*
> + * Get a file descriptor for the first overlapped iova region,
> + * -EINVAL means the iova region doesn't exist.
> + */
> +#define VDUSE_IOTLB_GET_FD	_IOWR(VDUSE_BASE, 0x04, struct vduse_iotlb_entry)
> +
> +/* Get the negotiated features */
> +#define VDUSE_DEV_GET_FEATURES	_IOR(VDUSE_BASE, 0x05, __u64)
> +
> +/* Update the configuration space */
> +#define VDUSE_DEV_UPDATE_CONFIG	_IOW(VDUSE_BASE, 0x06, struct vduse_config_update)
> +
> +/* Get the specified virtqueue's information */
> +#define VDUSE_VQ_GET_INFO	_IOWR(VDUSE_BASE, 0x07, struct vduse_vq_info)
> +
> +/* Setup an eventfd to receive kick for virtqueue */
> +#define VDUSE_VQ_SETUP_KICKFD	_IOW(VDUSE_BASE, 0x08, struct vduse_vq_eventfd)
> +
> +/* Inject an interrupt for specific virtqueue */
> +#define VDUSE_VQ_INJECT_IRQ	_IOW(VDUSE_BASE, 0x09, __u32)
> +
> +#endif /* _UAPI_VDUSE_H_ */


_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 193+ messages in thread

* Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
@ 2021-06-21  9:13     ` Jason Wang
  0 siblings, 0 replies; 193+ messages in thread
From: Jason Wang @ 2021-06-21  9:13 UTC (permalink / raw)
  To: Xie Yongji, mst, stefanha, sgarzare, parav, hch,
	christian.brauner, rdunlap, willy, viro, axboe, bcrl, corbet,
	mika.penttila, dan.carpenter, joro, gregkh
  Cc: kvm, netdev, linux-kernel, virtualization, iommu, songmuchun,
	linux-fsdevel


在 2021/6/15 下午10:13, Xie Yongji 写道:
> This VDUSE driver enables implementing vDPA devices in userspace.
> The vDPA device's control path is handled in kernel and the data
> path is handled in userspace.
>
> A message mechnism is used by VDUSE driver to forward some control
> messages such as starting/stopping datapath to userspace. Userspace
> can use read()/write() to receive/reply those control messages.
>
> And some ioctls are introduced to help userspace to implement the
> data path. VDUSE_IOTLB_GET_FD ioctl can be used to get the file
> descriptors referring to vDPA device's iova regions. Then userspace
> can use mmap() to access those iova regions. VDUSE_DEV_GET_FEATURES
> and VDUSE_VQ_GET_INFO ioctls are used to get the negotiated features
> and metadata of virtqueues. VDUSE_INJECT_VQ_IRQ and VDUSE_VQ_SETUP_KICKFD
> ioctls can be used to inject interrupt and setup the kickfd for
> virtqueues. VDUSE_DEV_UPDATE_CONFIG ioctl is used to update the
> configuration space and inject a config interrupt.
>
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> ---
>   Documentation/userspace-api/ioctl/ioctl-number.rst |    1 +
>   drivers/vdpa/Kconfig                               |   10 +
>   drivers/vdpa/Makefile                              |    1 +
>   drivers/vdpa/vdpa_user/Makefile                    |    5 +
>   drivers/vdpa/vdpa_user/vduse_dev.c                 | 1453 ++++++++++++++++++++
>   include/uapi/linux/vduse.h                         |  143 ++
>   6 files changed, 1613 insertions(+)
>   create mode 100644 drivers/vdpa/vdpa_user/Makefile
>   create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
>   create mode 100644 include/uapi/linux/vduse.h
>
> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
> index 9bfc2b510c64..acd95e9dcfe7 100644
> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> @@ -300,6 +300,7 @@ Code  Seq#    Include File                                           Comments
>   'z'   10-4F  drivers/s390/crypto/zcrypt_api.h                        conflict!
>   '|'   00-7F  linux/media.h
>   0x80  00-1F  linux/fb.h
> +0x81  00-1F  linux/vduse.h
>   0x89  00-06  arch/x86/include/asm/sockios.h
>   0x89  0B-DF  linux/sockios.h
>   0x89  E0-EF  linux/sockios.h                                         SIOCPROTOPRIVATE range
> diff --git a/drivers/vdpa/Kconfig b/drivers/vdpa/Kconfig
> index a503c1b2bfd9..6e23bce6433a 100644
> --- a/drivers/vdpa/Kconfig
> +++ b/drivers/vdpa/Kconfig
> @@ -33,6 +33,16 @@ config VDPA_SIM_BLOCK
>   	  vDPA block device simulator which terminates IO request in a
>   	  memory buffer.
>   
> +config VDPA_USER
> +	tristate "VDUSE (vDPA Device in Userspace) support"
> +	depends on EVENTFD && MMU && HAS_DMA
> +	select DMA_OPS
> +	select VHOST_IOTLB
> +	select IOMMU_IOVA
> +	help
> +	  With VDUSE it is possible to emulate a vDPA Device
> +	  in a userspace program.
> +
>   config IFCVF
>   	tristate "Intel IFC VF vDPA driver"
>   	depends on PCI_MSI
> diff --git a/drivers/vdpa/Makefile b/drivers/vdpa/Makefile
> index 67fe7f3d6943..f02ebed33f19 100644
> --- a/drivers/vdpa/Makefile
> +++ b/drivers/vdpa/Makefile
> @@ -1,6 +1,7 @@
>   # SPDX-License-Identifier: GPL-2.0
>   obj-$(CONFIG_VDPA) += vdpa.o
>   obj-$(CONFIG_VDPA_SIM) += vdpa_sim/
> +obj-$(CONFIG_VDPA_USER) += vdpa_user/
>   obj-$(CONFIG_IFCVF)    += ifcvf/
>   obj-$(CONFIG_MLX5_VDPA) += mlx5/
>   obj-$(CONFIG_VP_VDPA)    += virtio_pci/
> diff --git a/drivers/vdpa/vdpa_user/Makefile b/drivers/vdpa/vdpa_user/Makefile
> new file mode 100644
> index 000000000000..260e0b26af99
> --- /dev/null
> +++ b/drivers/vdpa/vdpa_user/Makefile
> @@ -0,0 +1,5 @@
> +# SPDX-License-Identifier: GPL-2.0
> +
> +vduse-y := vduse_dev.o iova_domain.o
> +
> +obj-$(CONFIG_VDPA_USER) += vduse.o
> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> new file mode 100644
> index 000000000000..5271cbd15e28
> --- /dev/null
> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> @@ -0,0 +1,1453 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * VDUSE: vDPA Device in Userspace
> + *
> + * Copyright (C) 2020-2021 Bytedance Inc. and/or its affiliates. All rights reserved.
> + *
> + * Author: Xie Yongji <xieyongji@bytedance.com>
> + *
> + */
> +
> +#include <linux/init.h>
> +#include <linux/module.h>
> +#include <linux/cdev.h>
> +#include <linux/device.h>
> +#include <linux/eventfd.h>
> +#include <linux/slab.h>
> +#include <linux/wait.h>
> +#include <linux/dma-map-ops.h>
> +#include <linux/poll.h>
> +#include <linux/file.h>
> +#include <linux/uio.h>
> +#include <linux/vdpa.h>
> +#include <linux/nospec.h>
> +#include <uapi/linux/vduse.h>
> +#include <uapi/linux/vdpa.h>
> +#include <uapi/linux/virtio_config.h>
> +#include <uapi/linux/virtio_ids.h>
> +#include <uapi/linux/virtio_blk.h>
> +#include <linux/mod_devicetable.h>
> +
> +#include "iova_domain.h"
> +
> +#define DRV_AUTHOR   "Yongji Xie <xieyongji@bytedance.com>"
> +#define DRV_DESC     "vDPA Device in Userspace"
> +#define DRV_LICENSE  "GPL v2"
> +
> +#define VDUSE_DEV_MAX (1U << MINORBITS)
> +#define VDUSE_MAX_BOUNCE_SIZE (64 * 1024 * 1024)
> +#define VDUSE_IOVA_SIZE (128 * 1024 * 1024)
> +#define VDUSE_REQUEST_TIMEOUT 30
> +
> +struct vduse_virtqueue {
> +	u16 index;
> +	u32 num;
> +	u32 avail_idx;
> +	u64 desc_addr;
> +	u64 driver_addr;
> +	u64 device_addr;
> +	bool ready;
> +	bool kicked;
> +	spinlock_t kick_lock;
> +	spinlock_t irq_lock;
> +	struct eventfd_ctx *kickfd;
> +	struct vdpa_callback cb;
> +	struct work_struct inject;
> +};
> +
> +struct vduse_dev;
> +
> +struct vduse_vdpa {
> +	struct vdpa_device vdpa;
> +	struct vduse_dev *dev;
> +};
> +
> +struct vduse_dev {
> +	struct vduse_vdpa *vdev;
> +	struct device *dev;
> +	struct vduse_virtqueue *vqs;
> +	struct vduse_iova_domain *domain;
> +	char *name;
> +	struct mutex lock;
> +	spinlock_t msg_lock;
> +	u64 msg_unique;
> +	wait_queue_head_t waitq;
> +	struct list_head send_list;
> +	struct list_head recv_list;
> +	struct vdpa_callback config_cb;
> +	struct work_struct inject;
> +	spinlock_t irq_lock;
> +	int minor;
> +	bool connected;
> +	bool started;
> +	u64 api_version;
> +	u64 user_features;


Let's use device_features.


> +	u64 features;


And driver features.


> +	u32 device_id;
> +	u32 vendor_id;
> +	u32 generation;
> +	u32 config_size;
> +	void *config;
> +	u8 status;
> +	u16 vq_size_max;
> +	u32 vq_num;
> +	u32 vq_align;
> +};
> +
> +struct vduse_dev_msg {
> +	struct vduse_dev_request req;
> +	struct vduse_dev_response resp;
> +	struct list_head list;
> +	wait_queue_head_t waitq;
> +	bool completed;
> +};
> +
> +struct vduse_control {
> +	u64 api_version;
> +};
> +
> +static DEFINE_MUTEX(vduse_lock);
> +static DEFINE_IDR(vduse_idr);
> +
> +static dev_t vduse_major;
> +static struct class *vduse_class;
> +static struct cdev vduse_ctrl_cdev;
> +static struct cdev vduse_cdev;
> +static struct workqueue_struct *vduse_irq_wq;
> +
> +static u32 allowed_device_id[] = {
> +	VIRTIO_ID_BLOCK,
> +};
> +
> +static inline struct vduse_dev *vdpa_to_vduse(struct vdpa_device *vdpa)
> +{
> +	struct vduse_vdpa *vdev = container_of(vdpa, struct vduse_vdpa, vdpa);
> +
> +	return vdev->dev;
> +}
> +
> +static inline struct vduse_dev *dev_to_vduse(struct device *dev)
> +{
> +	struct vdpa_device *vdpa = dev_to_vdpa(dev);
> +
> +	return vdpa_to_vduse(vdpa);
> +}
> +
> +static struct vduse_dev_msg *vduse_find_msg(struct list_head *head,
> +					    uint32_t request_id)
> +{
> +	struct vduse_dev_msg *msg;
> +
> +	list_for_each_entry(msg, head, list) {
> +		if (msg->req.request_id == request_id) {
> +			list_del(&msg->list);
> +			return msg;
> +		}
> +	}
> +
> +	return NULL;
> +}
> +
> +static struct vduse_dev_msg *vduse_dequeue_msg(struct list_head *head)
> +{
> +	struct vduse_dev_msg *msg = NULL;
> +
> +	if (!list_empty(head)) {
> +		msg = list_first_entry(head, struct vduse_dev_msg, list);
> +		list_del(&msg->list);
> +	}
> +
> +	return msg;
> +}
> +
> +static void vduse_enqueue_msg(struct list_head *head,
> +			      struct vduse_dev_msg *msg)
> +{
> +	list_add_tail(&msg->list, head);
> +}
> +
> +static int vduse_dev_msg_send(struct vduse_dev *dev,
> +			      struct vduse_dev_msg *msg, bool no_reply)
> +{


It looks to me the only user for no_reply=true is the dataplane start. I 
wonder no_reply is really needed consider we have switched to use 
wait_event_killable_timeout().

In another way, no_reply is false for vq state synchronization and IOTLB 
updating. I wonder if we can simply use no_reply = true for them.


> +	init_waitqueue_head(&msg->waitq);
> +	spin_lock(&dev->msg_lock);
> +	msg->req.request_id = dev->msg_unique++;
> +	vduse_enqueue_msg(&dev->send_list, msg);
> +	wake_up(&dev->waitq);
> +	spin_unlock(&dev->msg_lock);
> +	if (no_reply)
> +		return 0;
> +
> +	wait_event_killable_timeout(msg->waitq, msg->completed,
> +				    VDUSE_REQUEST_TIMEOUT * HZ);
> +	spin_lock(&dev->msg_lock);
> +	if (!msg->completed) {
> +		list_del(&msg->list);
> +		msg->resp.result = VDUSE_REQ_RESULT_FAILED;
> +	}
> +	spin_unlock(&dev->msg_lock);
> +
> +	return (msg->resp.result == VDUSE_REQ_RESULT_OK) ? 0 : -EIO;


Do we need to serialize the check by protecting it with the spinlock above?


> +}
> +
> +static void vduse_dev_msg_cleanup(struct vduse_dev *dev)
> +{
> +	struct vduse_dev_msg *msg;
> +
> +	spin_lock(&dev->msg_lock);
> +	while ((msg = vduse_dequeue_msg(&dev->send_list))) {
> +		if (msg->req.flags & VDUSE_REQ_FLAGS_NO_REPLY)
> +			kfree(msg);
> +		else
> +			vduse_enqueue_msg(&dev->recv_list, msg);
> +	}
> +	while ((msg = vduse_dequeue_msg(&dev->recv_list))) {
> +		msg->resp.result = VDUSE_REQ_RESULT_FAILED;
> +		msg->completed = 1;
> +		wake_up(&msg->waitq);
> +	}
> +	spin_unlock(&dev->msg_lock);
> +}
> +
> +static void vduse_dev_start_dataplane(struct vduse_dev *dev)
> +{
> +	struct vduse_dev_msg *msg = kzalloc(sizeof(*msg),
> +					    GFP_KERNEL | __GFP_NOFAIL);
> +
> +	msg->req.type = VDUSE_START_DATAPLANE;
> +	msg->req.flags |= VDUSE_REQ_FLAGS_NO_REPLY;
> +	vduse_dev_msg_send(dev, msg, true);
> +}
> +
> +static void vduse_dev_stop_dataplane(struct vduse_dev *dev)
> +{
> +	struct vduse_dev_msg *msg = kzalloc(sizeof(*msg),
> +					    GFP_KERNEL | __GFP_NOFAIL);
> +
> +	msg->req.type = VDUSE_STOP_DATAPLANE;
> +	msg->req.flags |= VDUSE_REQ_FLAGS_NO_REPLY;


Can we simply use this flag instead of introducing a new parameter 
(no_reply) in vduse_dev_msg_send()?


> +	vduse_dev_msg_send(dev, msg, true);
> +}
> +
> +static int vduse_dev_get_vq_state(struct vduse_dev *dev,
> +				  struct vduse_virtqueue *vq,
> +				  struct vdpa_vq_state *state)
> +{
> +	struct vduse_dev_msg msg = { 0 };
> +	int ret;


Note that I post a series that implement the packed virtqueue support:

https://lists.linuxfoundation.org/pipermail/virtualization/2021-June/054501.html

So this patch needs to be updated as well.


> +
> +	msg.req.type = VDUSE_GET_VQ_STATE;
> +	msg.req.vq_state.index = vq->index;
> +
> +	ret = vduse_dev_msg_send(dev, &msg, false);
> +	if (ret)
> +		return ret;
> +
> +	state->avail_index = msg.resp.vq_state.avail_idx;
> +	return 0;
> +}
> +
> +static int vduse_dev_update_iotlb(struct vduse_dev *dev,
> +				u64 start, u64 last)
> +{
> +	struct vduse_dev_msg msg = { 0 };
> +
> +	if (last < start)
> +		return -EINVAL;
> +
> +	msg.req.type = VDUSE_UPDATE_IOTLB;
> +	msg.req.iova.start = start;
> +	msg.req.iova.last = last;
> +
> +	return vduse_dev_msg_send(dev, &msg, false);
> +}
> +
> +static ssize_t vduse_dev_read_iter(struct kiocb *iocb, struct iov_iter *to)
> +{
> +	struct file *file = iocb->ki_filp;
> +	struct vduse_dev *dev = file->private_data;
> +	struct vduse_dev_msg *msg;
> +	int size = sizeof(struct vduse_dev_request);
> +	ssize_t ret;
> +
> +	if (iov_iter_count(to) < size)
> +		return -EINVAL;
> +
> +	spin_lock(&dev->msg_lock);
> +	while (1) {
> +		msg = vduse_dequeue_msg(&dev->send_list);
> +		if (msg)
> +			break;
> +
> +		ret = -EAGAIN;
> +		if (file->f_flags & O_NONBLOCK)
> +			goto unlock;
> +
> +		spin_unlock(&dev->msg_lock);
> +		ret = wait_event_interruptible_exclusive(dev->waitq,
> +					!list_empty(&dev->send_list));
> +		if (ret)
> +			return ret;
> +
> +		spin_lock(&dev->msg_lock);
> +	}
> +	spin_unlock(&dev->msg_lock);
> +	ret = copy_to_iter(&msg->req, size, to);
> +	spin_lock(&dev->msg_lock);
> +	if (ret != size) {
> +		ret = -EFAULT;
> +		vduse_enqueue_msg(&dev->send_list, msg);
> +		goto unlock;
> +	}
> +	if (msg->req.flags & VDUSE_REQ_FLAGS_NO_REPLY)
> +		kfree(msg);
> +	else
> +		vduse_enqueue_msg(&dev->recv_list, msg);
> +unlock:
> +	spin_unlock(&dev->msg_lock);
> +
> +	return ret;
> +}
> +
> +static ssize_t vduse_dev_write_iter(struct kiocb *iocb, struct iov_iter *from)
> +{
> +	struct file *file = iocb->ki_filp;
> +	struct vduse_dev *dev = file->private_data;
> +	struct vduse_dev_response resp;
> +	struct vduse_dev_msg *msg;
> +	size_t ret;
> +
> +	ret = copy_from_iter(&resp, sizeof(resp), from);
> +	if (ret != sizeof(resp))
> +		return -EINVAL;
> +
> +	spin_lock(&dev->msg_lock);
> +	msg = vduse_find_msg(&dev->recv_list, resp.request_id);
> +	if (!msg) {
> +		ret = -ENOENT;
> +		goto unlock;
> +	}
> +
> +	memcpy(&msg->resp, &resp, sizeof(resp));
> +	msg->completed = 1;
> +	wake_up(&msg->waitq);
> +unlock:
> +	spin_unlock(&dev->msg_lock);
> +
> +	return ret;
> +}
> +
> +static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
> +{
> +	struct vduse_dev *dev = file->private_data;
> +	__poll_t mask = 0;
> +
> +	poll_wait(file, &dev->waitq, wait);
> +
> +	if (!list_empty(&dev->send_list))
> +		mask |= EPOLLIN | EPOLLRDNORM;
> +	if (!list_empty(&dev->recv_list))
> +		mask |= EPOLLOUT | EPOLLWRNORM;
> +
> +	return mask;
> +}
> +
> +static void vduse_dev_reset(struct vduse_dev *dev)
> +{
> +	int i;
> +	struct vduse_iova_domain *domain = dev->domain;
> +
> +	/* The coherent mappings are handled in vduse_dev_free_coherent() */
> +	if (domain->bounce_map)
> +		vduse_domain_reset_bounce_map(domain);
> +
> +	dev->features = 0;
> +	dev->generation++;
> +	spin_lock(&dev->irq_lock);
> +	dev->config_cb.callback = NULL;
> +	dev->config_cb.private = NULL;
> +	spin_unlock(&dev->irq_lock);
> +
> +	for (i = 0; i < dev->vq_num; i++) {
> +		struct vduse_virtqueue *vq = &dev->vqs[i];
> +
> +		vq->ready = false;
> +		vq->desc_addr = 0;
> +		vq->driver_addr = 0;
> +		vq->device_addr = 0;
> +		vq->avail_idx = 0;
> +		vq->num = 0;
> +
> +		spin_lock(&vq->kick_lock);
> +		vq->kicked = false;
> +		if (vq->kickfd)
> +			eventfd_ctx_put(vq->kickfd);
> +		vq->kickfd = NULL;
> +		spin_unlock(&vq->kick_lock);
> +
> +		spin_lock(&vq->irq_lock);
> +		vq->cb.callback = NULL;
> +		vq->cb.private = NULL;
> +		spin_unlock(&vq->irq_lock);
> +	}
> +}
> +
> +static int vduse_vdpa_set_vq_address(struct vdpa_device *vdpa, u16 idx,
> +				u64 desc_area, u64 driver_area,
> +				u64 device_area)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	vq->desc_addr = desc_area;
> +	vq->driver_addr = driver_area;
> +	vq->device_addr = device_area;
> +
> +	return 0;
> +}
> +
> +static void vduse_vdpa_kick_vq(struct vdpa_device *vdpa, u16 idx)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	spin_lock(&vq->kick_lock);
> +	if (!vq->ready)
> +		goto unlock;
> +
> +	if (vq->kickfd)
> +		eventfd_signal(vq->kickfd, 1);
> +	else
> +		vq->kicked = true;
> +unlock:
> +	spin_unlock(&vq->kick_lock);
> +}
> +
> +static void vduse_vdpa_set_vq_cb(struct vdpa_device *vdpa, u16 idx,
> +			      struct vdpa_callback *cb)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	spin_lock(&vq->irq_lock);
> +	vq->cb.callback = cb->callback;
> +	vq->cb.private = cb->private;
> +	spin_unlock(&vq->irq_lock);
> +}
> +
> +static void vduse_vdpa_set_vq_num(struct vdpa_device *vdpa, u16 idx, u32 num)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	vq->num = num;
> +}
> +
> +static void vduse_vdpa_set_vq_ready(struct vdpa_device *vdpa,
> +					u16 idx, bool ready)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	vq->ready = ready;
> +}
> +
> +static bool vduse_vdpa_get_vq_ready(struct vdpa_device *vdpa, u16 idx)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	return vq->ready;
> +}
> +
> +static int vduse_vdpa_set_vq_state(struct vdpa_device *vdpa, u16 idx,
> +				const struct vdpa_vq_state *state)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	vq->avail_idx = state->avail_index;
> +	return 0;
> +}
> +
> +static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
> +				struct vdpa_vq_state *state)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	return vduse_dev_get_vq_state(dev, vq, state);
> +}
> +
> +static u32 vduse_vdpa_get_vq_align(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return dev->vq_align;
> +}
> +
> +static u64 vduse_vdpa_get_features(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return dev->user_features;
> +}
> +
> +static int vduse_vdpa_set_features(struct vdpa_device *vdpa, u64 features)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	dev->features = features;
> +	return 0;
> +}
> +
> +static void vduse_vdpa_set_config_cb(struct vdpa_device *vdpa,
> +				  struct vdpa_callback *cb)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	spin_lock(&dev->irq_lock);
> +	dev->config_cb.callback = cb->callback;
> +	dev->config_cb.private = cb->private;
> +	spin_unlock(&dev->irq_lock);
> +}
> +
> +static u16 vduse_vdpa_get_vq_num_max(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return dev->vq_size_max;
> +}
> +
> +static u32 vduse_vdpa_get_device_id(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return dev->device_id;
> +}
> +
> +static u32 vduse_vdpa_get_vendor_id(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return dev->vendor_id;
> +}
> +
> +static u8 vduse_vdpa_get_status(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return dev->status;
> +}
> +
> +static void vduse_vdpa_set_status(struct vdpa_device *vdpa, u8 status)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	bool started = !!(status & VIRTIO_CONFIG_S_DRIVER_OK);
> +
> +	dev->status = status;
> +
> +	if (dev->started == started)
> +		return;


If we check dev->status == status, (or only check the DRIVER_OK bit) 
then there's no need to introduce an extra dev->started.


> +
> +	dev->started = started;
> +	if (dev->started) {
> +		vduse_dev_start_dataplane(dev);
> +	} else {
> +		vduse_dev_reset(dev);
> +		vduse_dev_stop_dataplane(dev);


I wonder if no_reply work for the case of vhost-vdpa. For virtio-vDPA, 
we have bouncing buffers so it's harmless if usersapce dataplane keeps 
performing read/write. For vhost-vDPA we don't have such stuffs.


> +	}
> +}
> +
> +static size_t vduse_vdpa_get_config_size(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return dev->config_size;
> +}
> +
> +static void vduse_vdpa_get_config(struct vdpa_device *vdpa, unsigned int offset,
> +				  void *buf, unsigned int len)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	memcpy(buf, dev->config + offset, len);
> +}
> +
> +static void vduse_vdpa_set_config(struct vdpa_device *vdpa, unsigned int offset,
> +			const void *buf, unsigned int len)
> +{
> +	/* Now we only support read-only configuration space */
> +}
> +
> +static u32 vduse_vdpa_get_generation(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return dev->generation;
> +}
> +
> +static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
> +				struct vhost_iotlb *iotlb)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	int ret;
> +
> +	ret = vduse_domain_set_map(dev->domain, iotlb);
> +	if (ret)
> +		return ret;
> +
> +	ret = vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
> +	if (ret) {
> +		vduse_domain_clear_map(dev->domain, iotlb);
> +		return ret;
> +	}
> +
> +	return 0;
> +}
> +
> +static void vduse_vdpa_free(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	dev->vdev = NULL;
> +}
> +
> +static const struct vdpa_config_ops vduse_vdpa_config_ops = {
> +	.set_vq_address		= vduse_vdpa_set_vq_address,
> +	.kick_vq		= vduse_vdpa_kick_vq,
> +	.set_vq_cb		= vduse_vdpa_set_vq_cb,
> +	.set_vq_num             = vduse_vdpa_set_vq_num,
> +	.set_vq_ready		= vduse_vdpa_set_vq_ready,
> +	.get_vq_ready		= vduse_vdpa_get_vq_ready,
> +	.set_vq_state		= vduse_vdpa_set_vq_state,
> +	.get_vq_state		= vduse_vdpa_get_vq_state,
> +	.get_vq_align		= vduse_vdpa_get_vq_align,
> +	.get_features		= vduse_vdpa_get_features,
> +	.set_features		= vduse_vdpa_set_features,
> +	.set_config_cb		= vduse_vdpa_set_config_cb,
> +	.get_vq_num_max		= vduse_vdpa_get_vq_num_max,
> +	.get_device_id		= vduse_vdpa_get_device_id,
> +	.get_vendor_id		= vduse_vdpa_get_vendor_id,
> +	.get_status		= vduse_vdpa_get_status,
> +	.set_status		= vduse_vdpa_set_status,
> +	.get_config_size	= vduse_vdpa_get_config_size,
> +	.get_config		= vduse_vdpa_get_config,
> +	.set_config		= vduse_vdpa_set_config,
> +	.get_generation		= vduse_vdpa_get_generation,
> +	.set_map		= vduse_vdpa_set_map,
> +	.free			= vduse_vdpa_free,
> +};
> +
> +static dma_addr_t vduse_dev_map_page(struct device *dev, struct page *page,
> +				     unsigned long offset, size_t size,
> +				     enum dma_data_direction dir,
> +				     unsigned long attrs)
> +{
> +	struct vduse_dev *vdev = dev_to_vduse(dev);
> +	struct vduse_iova_domain *domain = vdev->domain;
> +
> +	return vduse_domain_map_page(domain, page, offset, size, dir, attrs);
> +}
> +
> +static void vduse_dev_unmap_page(struct device *dev, dma_addr_t dma_addr,
> +				size_t size, enum dma_data_direction dir,
> +				unsigned long attrs)
> +{
> +	struct vduse_dev *vdev = dev_to_vduse(dev);
> +	struct vduse_iova_domain *domain = vdev->domain;
> +
> +	return vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
> +}
> +
> +static void *vduse_dev_alloc_coherent(struct device *dev, size_t size,
> +					dma_addr_t *dma_addr, gfp_t flag,
> +					unsigned long attrs)
> +{
> +	struct vduse_dev *vdev = dev_to_vduse(dev);
> +	struct vduse_iova_domain *domain = vdev->domain;
> +	unsigned long iova;
> +	void *addr;
> +
> +	*dma_addr = DMA_MAPPING_ERROR;
> +	addr = vduse_domain_alloc_coherent(domain, size,
> +				(dma_addr_t *)&iova, flag, attrs);
> +	if (!addr)
> +		return NULL;
> +
> +	*dma_addr = (dma_addr_t)iova;
> +
> +	return addr;
> +}
> +
> +static void vduse_dev_free_coherent(struct device *dev, size_t size,
> +					void *vaddr, dma_addr_t dma_addr,
> +					unsigned long attrs)
> +{
> +	struct vduse_dev *vdev = dev_to_vduse(dev);
> +	struct vduse_iova_domain *domain = vdev->domain;
> +
> +	vduse_domain_free_coherent(domain, size, vaddr, dma_addr, attrs);
> +}
> +
> +static size_t vduse_dev_max_mapping_size(struct device *dev)
> +{
> +	struct vduse_dev *vdev = dev_to_vduse(dev);
> +	struct vduse_iova_domain *domain = vdev->domain;
> +
> +	return domain->bounce_size;
> +}
> +
> +static const struct dma_map_ops vduse_dev_dma_ops = {
> +	.map_page = vduse_dev_map_page,
> +	.unmap_page = vduse_dev_unmap_page,
> +	.alloc = vduse_dev_alloc_coherent,
> +	.free = vduse_dev_free_coherent,
> +	.max_mapping_size = vduse_dev_max_mapping_size,
> +};
> +
> +static unsigned int perm_to_file_flags(u8 perm)
> +{
> +	unsigned int flags = 0;
> +
> +	switch (perm) {
> +	case VDUSE_ACCESS_WO:
> +		flags |= O_WRONLY;
> +		break;
> +	case VDUSE_ACCESS_RO:
> +		flags |= O_RDONLY;
> +		break;
> +	case VDUSE_ACCESS_RW:
> +		flags |= O_RDWR;
> +		break;
> +	default:
> +		WARN(1, "invalidate vhost IOTLB permission\n");
> +		break;
> +	}
> +
> +	return flags;
> +}
> +
> +static int vduse_kickfd_setup(struct vduse_dev *dev,
> +			struct vduse_vq_eventfd *eventfd)
> +{
> +	struct eventfd_ctx *ctx = NULL;
> +	struct vduse_virtqueue *vq;
> +	u32 index;
> +
> +	if (eventfd->index >= dev->vq_num)
> +		return -EINVAL;
> +
> +	index = array_index_nospec(eventfd->index, dev->vq_num);
> +	vq = &dev->vqs[index];
> +	if (eventfd->fd >= 0) {
> +		ctx = eventfd_ctx_fdget(eventfd->fd);
> +		if (IS_ERR(ctx))
> +			return PTR_ERR(ctx);
> +	} else if (eventfd->fd != VDUSE_EVENTFD_DEASSIGN)
> +		return 0;
> +
> +	spin_lock(&vq->kick_lock);
> +	if (vq->kickfd)
> +		eventfd_ctx_put(vq->kickfd);
> +	vq->kickfd = ctx;
> +	if (vq->ready && vq->kicked && vq->kickfd) {
> +		eventfd_signal(vq->kickfd, 1);
> +		vq->kicked = false;
> +	}
> +	spin_unlock(&vq->kick_lock);
> +
> +	return 0;
> +}
> +
> +static void vduse_dev_irq_inject(struct work_struct *work)
> +{
> +	struct vduse_dev *dev = container_of(work, struct vduse_dev, inject);
> +
> +	spin_lock_irq(&dev->irq_lock);
> +	if (dev->config_cb.callback)
> +		dev->config_cb.callback(dev->config_cb.private);
> +	spin_unlock_irq(&dev->irq_lock);
> +}
> +
> +static void vduse_vq_irq_inject(struct work_struct *work)
> +{
> +	struct vduse_virtqueue *vq = container_of(work,
> +					struct vduse_virtqueue, inject);
> +
> +	spin_lock_irq(&vq->irq_lock);
> +	if (vq->ready && vq->cb.callback)
> +		vq->cb.callback(vq->cb.private);
> +	spin_unlock_irq(&vq->irq_lock);
> +}
> +
> +static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> +			    unsigned long arg)
> +{
> +	struct vduse_dev *dev = file->private_data;
> +	void __user *argp = (void __user *)arg;
> +	int ret;
> +
> +	switch (cmd) {
> +	case VDUSE_IOTLB_GET_FD: {
> +		struct vduse_iotlb_entry entry;
> +		struct vhost_iotlb_map *map;
> +		struct vdpa_map_file *map_file;
> +		struct vduse_iova_domain *domain = dev->domain;
> +		struct file *f = NULL;
> +
> +		ret = -EFAULT;
> +		if (copy_from_user(&entry, argp, sizeof(entry)))
> +			break;
> +
> +		ret = -EINVAL;
> +		if (entry.start > entry.last)
> +			break;
> +
> +		spin_lock(&domain->iotlb_lock);
> +		map = vhost_iotlb_itree_first(domain->iotlb,
> +					      entry.start, entry.last);
> +		if (map) {
> +			map_file = (struct vdpa_map_file *)map->opaque;
> +			f = get_file(map_file->file);
> +			entry.offset = map_file->offset;
> +			entry.start = map->start;
> +			entry.last = map->last;
> +			entry.perm = map->perm;
> +		}
> +		spin_unlock(&domain->iotlb_lock);
> +		ret = -EINVAL;
> +		if (!f)
> +			break;
> +
> +		ret = -EFAULT;
> +		if (copy_to_user(argp, &entry, sizeof(entry))) {
> +			fput(f);
> +			break;
> +		}
> +		ret = receive_fd(f, perm_to_file_flags(entry.perm));
> +		fput(f);
> +		break;
> +	}
> +	case VDUSE_DEV_GET_FEATURES:
> +		ret = put_user(dev->features, (u64 __user *)argp);
> +		break;
> +	case VDUSE_DEV_UPDATE_CONFIG: {
> +		struct vduse_config_update config;
> +		unsigned long size = offsetof(struct vduse_config_update,
> +					      buffer);
> +
> +		ret = -EFAULT;
> +		if (copy_from_user(&config, argp, size))
> +			break;
> +
> +		ret = -EINVAL;
> +		if (config.length == 0 ||
> +		    config.length > dev->config_size - config.offset)
> +			break;
> +
> +		ret = -EFAULT;
> +		if (copy_from_user(dev->config + config.offset, argp + size,
> +				   config.length))
> +			break;
> +
> +		ret = 0;
> +		queue_work(vduse_irq_wq, &dev->inject);


I wonder if it's better to separate config interrupt out of config 
update or we need document this.


> +		break;
> +	}
> +	case VDUSE_VQ_GET_INFO: {


Do we need to limit this only when DRIVER_OK is set?


> +		struct vduse_vq_info vq_info;
> +		u32 vq_index;
> +
> +		ret = -EFAULT;
> +		if (copy_from_user(&vq_info, argp, sizeof(vq_info)))
> +			break;
> +
> +		ret = -EINVAL;
> +		if (vq_info.index >= dev->vq_num)
> +			break;
> +
> +		vq_index = array_index_nospec(vq_info.index, dev->vq_num);
> +		vq_info.desc_addr = dev->vqs[vq_index].desc_addr;
> +		vq_info.driver_addr = dev->vqs[vq_index].driver_addr;
> +		vq_info.device_addr = dev->vqs[vq_index].device_addr;
> +		vq_info.num = dev->vqs[vq_index].num;
> +		vq_info.avail_idx = dev->vqs[vq_index].avail_idx;
> +		vq_info.ready = dev->vqs[vq_index].ready;
> +
> +		ret = -EFAULT;
> +		if (copy_to_user(argp, &vq_info, sizeof(vq_info)))
> +			break;
> +
> +		ret = 0;
> +		break;
> +	}
> +	case VDUSE_VQ_SETUP_KICKFD: {
> +		struct vduse_vq_eventfd eventfd;
> +
> +		ret = -EFAULT;
> +		if (copy_from_user(&eventfd, argp, sizeof(eventfd)))
> +			break;
> +
> +		ret = vduse_kickfd_setup(dev, &eventfd);
> +		break;
> +	}
> +	case VDUSE_VQ_INJECT_IRQ: {
> +		u32 vq_index;
> +
> +		ret = -EFAULT;
> +		if (get_user(vq_index, (u32 __user *)argp))
> +			break;
> +
> +		ret = -EINVAL;
> +		if (vq_index >= dev->vq_num)
> +			break;
> +
> +		ret = 0;
> +		vq_index = array_index_nospec(vq_index, dev->vq_num);
> +		queue_work(vduse_irq_wq, &dev->vqs[vq_index].inject);
> +		break;
> +	}
> +	default:
> +		ret = -ENOIOCTLCMD;
> +		break;
> +	}
> +
> +	return ret;
> +}
> +
> +static int vduse_dev_release(struct inode *inode, struct file *file)
> +{
> +	struct vduse_dev *dev = file->private_data;
> +
> +	spin_lock(&dev->msg_lock);
> +	/* Make sure the inflight messages can processed after reconncection */
> +	list_splice_init(&dev->recv_list, &dev->send_list);
> +	spin_unlock(&dev->msg_lock);
> +	dev->connected = false;
> +
> +	return 0;
> +}
> +
> +static struct vduse_dev *vduse_dev_get_from_minor(int minor)
> +{
> +	struct vduse_dev *dev;
> +
> +	mutex_lock(&vduse_lock);
> +	dev = idr_find(&vduse_idr, minor);
> +	mutex_unlock(&vduse_lock);
> +
> +	return dev;
> +}
> +
> +static int vduse_dev_open(struct inode *inode, struct file *file)
> +{
> +	int ret;
> +	struct vduse_dev *dev = vduse_dev_get_from_minor(iminor(inode));
> +
> +	if (!dev)
> +		return -ENODEV;
> +
> +	ret = -EBUSY;
> +	mutex_lock(&dev->lock);
> +	if (dev->connected)
> +		goto unlock;
> +
> +	ret = 0;
> +	dev->connected = true;
> +	file->private_data = dev;
> +unlock:
> +	mutex_unlock(&dev->lock);
> +
> +	return ret;
> +}
> +
> +static const struct file_operations vduse_dev_fops = {
> +	.owner		= THIS_MODULE,
> +	.open		= vduse_dev_open,
> +	.release	= vduse_dev_release,
> +	.read_iter	= vduse_dev_read_iter,
> +	.write_iter	= vduse_dev_write_iter,
> +	.poll		= vduse_dev_poll,
> +	.unlocked_ioctl	= vduse_dev_ioctl,
> +	.compat_ioctl	= compat_ptr_ioctl,
> +	.llseek		= noop_llseek,
> +};
> +
> +static struct vduse_dev *vduse_dev_create(void)
> +{
> +	struct vduse_dev *dev = kzalloc(sizeof(*dev), GFP_KERNEL);
> +
> +	if (!dev)
> +		return NULL;
> +
> +	mutex_init(&dev->lock);
> +	spin_lock_init(&dev->msg_lock);
> +	INIT_LIST_HEAD(&dev->send_list);
> +	INIT_LIST_HEAD(&dev->recv_list);
> +	spin_lock_init(&dev->irq_lock);
> +
> +	INIT_WORK(&dev->inject, vduse_dev_irq_inject);
> +	init_waitqueue_head(&dev->waitq);
> +
> +	return dev;
> +}
> +
> +static void vduse_dev_destroy(struct vduse_dev *dev)
> +{
> +	kfree(dev);
> +}
> +
> +static struct vduse_dev *vduse_find_dev(const char *name)
> +{
> +	struct vduse_dev *dev;
> +	int id;
> +
> +	idr_for_each_entry(&vduse_idr, dev, id)
> +		if (!strcmp(dev->name, name))
> +			return dev;
> +
> +	return NULL;
> +}
> +
> +static int vduse_destroy_dev(char *name)
> +{
> +	struct vduse_dev *dev = vduse_find_dev(name);
> +
> +	if (!dev)
> +		return -EINVAL;
> +
> +	mutex_lock(&dev->lock);
> +	if (dev->vdev || dev->connected) {
> +		mutex_unlock(&dev->lock);
> +		return -EBUSY;
> +	}
> +	dev->connected = true;
> +	mutex_unlock(&dev->lock);
> +
> +	vduse_dev_msg_cleanup(dev);
> +	device_destroy(vduse_class, MKDEV(MAJOR(vduse_major), dev->minor));
> +	idr_remove(&vduse_idr, dev->minor);
> +	kvfree(dev->config);
> +	kfree(dev->vqs);
> +	vduse_domain_destroy(dev->domain);
> +	kfree(dev->name);
> +	vduse_dev_destroy(dev);
> +	module_put(THIS_MODULE);
> +
> +	return 0;
> +}
> +
> +static bool device_is_allowed(u32 device_id)
> +{
> +	int i;
> +
> +	for (i = 0; i < ARRAY_SIZE(allowed_device_id); i++)
> +		if (allowed_device_id[i] == device_id)
> +			return true;
> +
> +	return false;
> +}
> +
> +static bool features_is_valid(u64 features)
> +{
> +	if (!(features & (1ULL << VIRTIO_F_ACCESS_PLATFORM)))
> +		return false;
> +
> +	/* Now we only support read-only configuration space */
> +	if (features & (1ULL << VIRTIO_BLK_F_CONFIG_WCE))
> +		return false;
> +
> +	return true;
> +}
> +
> +static bool vduse_validate_config(struct vduse_dev_config *config)
> +{
> +	if (config->bounce_size > VDUSE_MAX_BOUNCE_SIZE)
> +		return false;
> +
> +	if (config->vq_align > PAGE_SIZE)
> +		return false;
> +
> +	if (config->config_size > PAGE_SIZE)
> +		return false;
> +
> +	if (!device_is_allowed(config->device_id))
> +		return false;
> +
> +	if (!features_is_valid(config->features))
> +		return false;


Do we need to validate whether or not config_size is too small otherwise 
we may have OOB access in get_config()?


> +
> +	return true;
> +}
> +
> +static int vduse_create_dev(struct vduse_dev_config *config,
> +			    void *config_buf, u64 api_version)
> +{
> +	int i, ret;
> +	struct vduse_dev *dev;
> +
> +	ret = -EEXIST;
> +	if (vduse_find_dev(config->name))
> +		goto err;
> +
> +	ret = -ENOMEM;
> +	dev = vduse_dev_create();
> +	if (!dev)
> +		goto err;
> +
> +	dev->api_version = api_version;
> +	dev->user_features = config->features;
> +	dev->device_id = config->device_id;
> +	dev->vendor_id = config->vendor_id;
> +	dev->name = kstrdup(config->name, GFP_KERNEL);
> +	if (!dev->name)
> +		goto err_str;
> +
> +	dev->domain = vduse_domain_create(VDUSE_IOVA_SIZE - 1,
> +					  config->bounce_size);
> +	if (!dev->domain)
> +		goto err_domain;
> +
> +	dev->config = config_buf;
> +	dev->config_size = config->config_size;
> +	dev->vq_align = config->vq_align;
> +	dev->vq_size_max = config->vq_size_max;
> +	dev->vq_num = config->vq_num;
> +	dev->vqs = kcalloc(dev->vq_num, sizeof(*dev->vqs), GFP_KERNEL);
> +	if (!dev->vqs)
> +		goto err_vqs;
> +
> +	for (i = 0; i < dev->vq_num; i++) {
> +		dev->vqs[i].index = i;
> +		INIT_WORK(&dev->vqs[i].inject, vduse_vq_irq_inject);
> +		spin_lock_init(&dev->vqs[i].kick_lock);
> +		spin_lock_init(&dev->vqs[i].irq_lock);
> +	}
> +
> +	ret = idr_alloc(&vduse_idr, dev, 1, VDUSE_DEV_MAX, GFP_KERNEL);
> +	if (ret < 0)
> +		goto err_idr;
> +
> +	dev->minor = ret;
> +	dev->dev = device_create(vduse_class, NULL,
> +				 MKDEV(MAJOR(vduse_major), dev->minor),
> +				 NULL, "%s", config->name);
> +	if (IS_ERR(dev->dev)) {
> +		ret = PTR_ERR(dev->dev);
> +		goto err_dev;
> +	}
> +	__module_get(THIS_MODULE);
> +
> +	return 0;
> +err_dev:
> +	idr_remove(&vduse_idr, dev->minor);
> +err_idr:
> +	kfree(dev->vqs);
> +err_vqs:
> +	vduse_domain_destroy(dev->domain);
> +err_domain:
> +	kfree(dev->name);
> +err_str:
> +	vduse_dev_destroy(dev);
> +err:
> +	kvfree(config_buf);
> +	return ret;
> +}
> +
> +static long vduse_ioctl(struct file *file, unsigned int cmd,
> +			unsigned long arg)
> +{
> +	int ret;
> +	void __user *argp = (void __user *)arg;
> +	struct vduse_control *control = file->private_data;
> +
> +	mutex_lock(&vduse_lock);
> +	switch (cmd) {
> +	case VDUSE_GET_API_VERSION:
> +		ret = put_user(control->api_version, (u64 __user *)argp);
> +		break;
> +	case VDUSE_SET_API_VERSION: {
> +		u64 api_version;
> +
> +		ret = -EFAULT;
> +		if (get_user(api_version, (u64 __user *)argp))
> +			break;
> +
> +		ret = -EINVAL;
> +		if (api_version > VDUSE_API_VERSION)
> +			break;
> +
> +		ret = 0;
> +		control->api_version = api_version;
> +		break;
> +	}
> +	case VDUSE_CREATE_DEV: {
> +		struct vduse_dev_config config;
> +		unsigned long size = offsetof(struct vduse_dev_config, config);
> +		void *buf;
> +
> +		ret = -EFAULT;
> +		if (copy_from_user(&config, argp, size))
> +			break;
> +
> +		ret = -EINVAL;
> +		if (vduse_validate_config(&config) == false)
> +			break;
> +
> +		buf = vmemdup_user(argp + size, config.config_size);
> +		if (IS_ERR(buf)) {
> +			ret = PTR_ERR(buf);
> +			break;
> +		}
> +		ret = vduse_create_dev(&config, buf, control->api_version);
> +		break;
> +	}
> +	case VDUSE_DESTROY_DEV: {
> +		char name[VDUSE_NAME_MAX];
> +
> +		ret = -EFAULT;
> +		if (copy_from_user(name, argp, VDUSE_NAME_MAX))
> +			break;
> +
> +		ret = vduse_destroy_dev(name);
> +		break;
> +	}
> +	default:
> +		ret = -EINVAL;
> +		break;
> +	}
> +	mutex_unlock(&vduse_lock);
> +
> +	return ret;
> +}
> +
> +static int vduse_release(struct inode *inode, struct file *file)
> +{
> +	struct vduse_control *control = file->private_data;
> +
> +	kfree(control);
> +	return 0;
> +}
> +
> +static int vduse_open(struct inode *inode, struct file *file)
> +{
> +	struct vduse_control *control;
> +
> +	control = kmalloc(sizeof(struct vduse_control), GFP_KERNEL);
> +	if (!control)
> +		return -ENOMEM;
> +
> +	control->api_version = VDUSE_API_VERSION;
> +	file->private_data = control;
> +
> +	return 0;
> +}
> +
> +static const struct file_operations vduse_ctrl_fops = {
> +	.owner		= THIS_MODULE,
> +	.open		= vduse_open,
> +	.release	= vduse_release,
> +	.unlocked_ioctl	= vduse_ioctl,
> +	.compat_ioctl	= compat_ptr_ioctl,
> +	.llseek		= noop_llseek,
> +};
> +
> +static char *vduse_devnode(struct device *dev, umode_t *mode)
> +{
> +	return kasprintf(GFP_KERNEL, "vduse/%s", dev_name(dev));
> +}
> +
> +static void vduse_mgmtdev_release(struct device *dev)
> +{
> +}
> +
> +static struct device vduse_mgmtdev = {
> +	.init_name = "vduse",
> +	.release = vduse_mgmtdev_release,
> +};
> +
> +static struct vdpa_mgmt_dev mgmt_dev;
> +
> +static int vduse_dev_init_vdpa(struct vduse_dev *dev, const char *name)
> +{
> +	struct vduse_vdpa *vdev;
> +	int ret;
> +
> +	if (dev->vdev)
> +		return -EEXIST;
> +
> +	vdev = vdpa_alloc_device(struct vduse_vdpa, vdpa, dev->dev,
> +				 &vduse_vdpa_config_ops, name, true);
> +	if (!vdev)
> +		return -ENOMEM;
> +
> +	dev->vdev = vdev;
> +	vdev->dev = dev;
> +	vdev->vdpa.dev.dma_mask = &vdev->vdpa.dev.coherent_dma_mask;
> +	ret = dma_set_mask_and_coherent(&vdev->vdpa.dev, DMA_BIT_MASK(64));
> +	if (ret) {
> +		put_device(&vdev->vdpa.dev);
> +		return ret;
> +	}
> +	set_dma_ops(&vdev->vdpa.dev, &vduse_dev_dma_ops);
> +	vdev->vdpa.dma_dev = &vdev->vdpa.dev;
> +	vdev->vdpa.mdev = &mgmt_dev;
> +
> +	return 0;
> +}
> +
> +static int vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name)
> +{
> +	struct vduse_dev *dev;
> +	int ret;
> +
> +	mutex_lock(&vduse_lock);
> +	dev = vduse_find_dev(name);
> +	if (!dev) {
> +		mutex_unlock(&vduse_lock);
> +		return -EINVAL;
> +	}
> +	ret = vduse_dev_init_vdpa(dev, name);
> +	mutex_unlock(&vduse_lock);
> +	if (ret)
> +		return ret;
> +
> +	ret = _vdpa_register_device(&dev->vdev->vdpa, dev->vq_num);
> +	if (ret) {
> +		put_device(&dev->vdev->vdpa.dev);
> +		return ret;
> +	}
> +
> +	return 0;
> +}
> +
> +static void vdpa_dev_del(struct vdpa_mgmt_dev *mdev, struct vdpa_device *dev)
> +{
> +	_vdpa_unregister_device(dev);
> +}
> +
> +static const struct vdpa_mgmtdev_ops vdpa_dev_mgmtdev_ops = {
> +	.dev_add = vdpa_dev_add,
> +	.dev_del = vdpa_dev_del,
> +};
> +
> +static struct virtio_device_id id_table[] = {
> +	{ VIRTIO_ID_BLOCK, VIRTIO_DEV_ANY_ID },
> +	{ 0 },
> +};
> +
> +static struct vdpa_mgmt_dev mgmt_dev = {
> +	.device = &vduse_mgmtdev,
> +	.id_table = id_table,
> +	.ops = &vdpa_dev_mgmtdev_ops,
> +};
> +
> +static int vduse_mgmtdev_init(void)
> +{
> +	int ret;
> +
> +	ret = device_register(&vduse_mgmtdev);
> +	if (ret)
> +		return ret;
> +
> +	ret = vdpa_mgmtdev_register(&mgmt_dev);
> +	if (ret)
> +		goto err;
> +
> +	return 0;
> +err:
> +	device_unregister(&vduse_mgmtdev);
> +	return ret;
> +}
> +
> +static void vduse_mgmtdev_exit(void)
> +{
> +	vdpa_mgmtdev_unregister(&mgmt_dev);
> +	device_unregister(&vduse_mgmtdev);
> +}
> +
> +static int vduse_init(void)
> +{
> +	int ret;
> +	struct device *dev;
> +
> +	vduse_class = class_create(THIS_MODULE, "vduse");
> +	if (IS_ERR(vduse_class))
> +		return PTR_ERR(vduse_class);
> +
> +	vduse_class->devnode = vduse_devnode;
> +
> +	ret = alloc_chrdev_region(&vduse_major, 0, VDUSE_DEV_MAX, "vduse");
> +	if (ret)
> +		goto err_chardev_region;
> +
> +	/* /dev/vduse/control */
> +	cdev_init(&vduse_ctrl_cdev, &vduse_ctrl_fops);
> +	vduse_ctrl_cdev.owner = THIS_MODULE;
> +	ret = cdev_add(&vduse_ctrl_cdev, vduse_major, 1);
> +	if (ret)
> +		goto err_ctrl_cdev;
> +
> +	dev = device_create(vduse_class, NULL, vduse_major, NULL, "control");
> +	if (IS_ERR(dev)) {
> +		ret = PTR_ERR(dev);
> +		goto err_device;
> +	}
> +
> +	/* /dev/vduse/$DEVICE */
> +	cdev_init(&vduse_cdev, &vduse_dev_fops);
> +	vduse_cdev.owner = THIS_MODULE;
> +	ret = cdev_add(&vduse_cdev, MKDEV(MAJOR(vduse_major), 1),
> +		       VDUSE_DEV_MAX - 1);
> +	if (ret)
> +		goto err_cdev;
> +
> +	vduse_irq_wq = alloc_workqueue("vduse-irq",
> +				WQ_HIGHPRI | WQ_SYSFS | WQ_UNBOUND, 0);
> +	if (!vduse_irq_wq)
> +		goto err_wq;
> +
> +	ret = vduse_domain_init();
> +	if (ret)
> +		goto err_domain;
> +
> +	ret = vduse_mgmtdev_init();
> +	if (ret)
> +		goto err_mgmtdev;
> +
> +	return 0;
> +err_mgmtdev:
> +	vduse_domain_exit();
> +err_domain:
> +	destroy_workqueue(vduse_irq_wq);
> +err_wq:
> +	cdev_del(&vduse_cdev);
> +err_cdev:
> +	device_destroy(vduse_class, vduse_major);
> +err_device:
> +	cdev_del(&vduse_ctrl_cdev);
> +err_ctrl_cdev:
> +	unregister_chrdev_region(vduse_major, VDUSE_DEV_MAX);
> +err_chardev_region:
> +	class_destroy(vduse_class);
> +	return ret;
> +}
> +module_init(vduse_init);
> +
> +static void vduse_exit(void)
> +{
> +	vduse_mgmtdev_exit();
> +	vduse_domain_exit();
> +	destroy_workqueue(vduse_irq_wq);
> +	cdev_del(&vduse_cdev);
> +	device_destroy(vduse_class, vduse_major);
> +	cdev_del(&vduse_ctrl_cdev);
> +	unregister_chrdev_region(vduse_major, VDUSE_DEV_MAX);
> +	class_destroy(vduse_class);
> +}
> +module_exit(vduse_exit);
> +
> +MODULE_LICENSE(DRV_LICENSE);
> +MODULE_AUTHOR(DRV_AUTHOR);
> +MODULE_DESCRIPTION(DRV_DESC);
> diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
> new file mode 100644
> index 000000000000..f21b2e51b5c8
> --- /dev/null
> +++ b/include/uapi/linux/vduse.h
> @@ -0,0 +1,143 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _UAPI_VDUSE_H_
> +#define _UAPI_VDUSE_H_
> +
> +#include <linux/types.h>
> +
> +#define VDUSE_API_VERSION	0
> +
> +#define VDUSE_NAME_MAX	256
> +
> +/* the control messages definition for read/write */
> +
> +enum vduse_req_type {
> +	/* Get the state for virtqueue from userspace */
> +	VDUSE_GET_VQ_STATE,
> +	/* Notify userspace to start the dataplane, no reply */
> +	VDUSE_START_DATAPLANE,
> +	/* Notify userspace to stop the dataplane, no reply */
> +	VDUSE_STOP_DATAPLANE,
> +	/* Notify userspace to update the memory mapping in device IOTLB */
> +	VDUSE_UPDATE_IOTLB,
> +};
> +
> +struct vduse_vq_state {
> +	__u32 index; /* virtqueue index */
> +	__u32 avail_idx; /* virtqueue state (last_avail_idx) */
> +};


This needs some tweaks to support packed virtqueue.


> +
> +struct vduse_iova_range {
> +	__u64 start; /* start of the IOVA range */
> +	__u64 last; /* end of the IOVA range */
> +};
> +
> +struct vduse_dev_request {
> +	__u32 type; /* request type */
> +	__u32 request_id; /* request id */
> +#define VDUSE_REQ_FLAGS_NO_REPLY	(1 << 0) /* No need to reply */
> +	__u32 flags; /* request flags */
> +	__u32 reserved; /* for future use */
> +	union {
> +		struct vduse_vq_state vq_state; /* virtqueue state */
> +		struct vduse_iova_range iova; /* iova range for updating */
> +		__u32 padding[16]; /* padding */
> +	};
> +};
> +
> +struct vduse_dev_response {
> +	__u32 request_id; /* corresponding request id */
> +#define VDUSE_REQ_RESULT_OK	0x00
> +#define VDUSE_REQ_RESULT_FAILED	0x01
> +	__u32 result; /* the result of request */
> +	__u32 reserved[2]; /* for future use */
> +	union {
> +		struct vduse_vq_state vq_state; /* virtqueue state */
> +		__u32 padding[16]; /* padding */
> +	};
> +};
> +
> +/* ioctls */
> +
> +struct vduse_dev_config {
> +	char name[VDUSE_NAME_MAX]; /* vduse device name */
> +	__u32 vendor_id; /* virtio vendor id */
> +	__u32 device_id; /* virtio device id */
> +	__u64 features; /* device features */
> +	__u64 bounce_size; /* bounce buffer size for iommu */
> +	__u16 vq_size_max; /* the max size of virtqueue */
> +	__u16 padding; /* padding */
> +	__u32 vq_num; /* the number of virtqueues */
> +	__u32 vq_align; /* the allocation alignment of virtqueue's metadata */
> +	__u32 config_size; /* the size of the configuration space */
> +	__u32 reserved[15]; /* for future use */
> +	__u8 config[0]; /* the buffer of the configuration space */
> +};
> +
> +struct vduse_iotlb_entry {
> +	__u64 offset; /* the mmap offset on fd */
> +	__u64 start; /* start of the IOVA range */
> +	__u64 last; /* last of the IOVA range */
> +#define VDUSE_ACCESS_RO 0x1
> +#define VDUSE_ACCESS_WO 0x2
> +#define VDUSE_ACCESS_RW 0x3
> +	__u8 perm; /* access permission of this range */
> +};
> +
> +struct vduse_config_update {
> +	__u32 offset; /* offset from the beginning of configuration space */
> +	__u32 length; /* the length to write to configuration space */
> +	__u8 buffer[0]; /* buffer used to write from */
> +};
> +
> +struct vduse_vq_info {
> +	__u32 index; /* virtqueue index */
> +	__u32 avail_idx; /* virtqueue state (last_avail_idx) */
> +	__u64 desc_addr; /* address of desc area */
> +	__u64 driver_addr; /* address of driver area */
> +	__u64 device_addr; /* address of device area */
> +	__u32 num; /* the size of virtqueue */
> +	__u8 ready; /* ready status of virtqueue */
> +};
> +
> +struct vduse_vq_eventfd {
> +	__u32 index; /* virtqueue index */
> +#define VDUSE_EVENTFD_DEASSIGN -1
> +	int fd; /* eventfd, -1 means de-assigning the eventfd */
> +};
> +
> +#define VDUSE_BASE	0x81
> +
> +/* Get the version of VDUSE API. This is used for future extension */
> +#define VDUSE_GET_API_VERSION	_IOR(VDUSE_BASE, 0x00, __u64)
> +
> +/* Set the version of VDUSE API. */
> +#define VDUSE_SET_API_VERSION	_IOW(VDUSE_BASE, 0x01, __u64)
> +
> +/* Create a vduse device which is represented by a char device (/dev/vduse/<name>) */
> +#define VDUSE_CREATE_DEV	_IOW(VDUSE_BASE, 0x02, struct vduse_dev_config)
> +
> +/* Destroy a vduse device. Make sure there are no references to the char device */
> +#define VDUSE_DESTROY_DEV	_IOW(VDUSE_BASE, 0x03, char[VDUSE_NAME_MAX])
> +
> +/*
> + * Get a file descriptor for the first overlapped iova region,
> + * -EINVAL means the iova region doesn't exist.
> + */
> +#define VDUSE_IOTLB_GET_FD	_IOWR(VDUSE_BASE, 0x04, struct vduse_iotlb_entry)
> +
> +/* Get the negotiated features */
> +#define VDUSE_DEV_GET_FEATURES	_IOR(VDUSE_BASE, 0x05, __u64)
> +
> +/* Update the configuration space */
> +#define VDUSE_DEV_UPDATE_CONFIG	_IOW(VDUSE_BASE, 0x06, struct vduse_config_update)
> +
> +/* Get the specified virtqueue's information */
> +#define VDUSE_VQ_GET_INFO	_IOWR(VDUSE_BASE, 0x07, struct vduse_vq_info)
> +
> +/* Setup an eventfd to receive kick for virtqueue */
> +#define VDUSE_VQ_SETUP_KICKFD	_IOW(VDUSE_BASE, 0x08, struct vduse_vq_eventfd)
> +
> +/* Inject an interrupt for specific virtqueue */
> +#define VDUSE_VQ_INJECT_IRQ	_IOW(VDUSE_BASE, 0x09, __u32)
> +
> +#endif /* _UAPI_VDUSE_H_ */


_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 193+ messages in thread

* Re: Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
  2021-06-21  9:13     ` Jason Wang
@ 2021-06-21 10:41       ` Yongji Xie
  -1 siblings, 0 replies; 193+ messages in thread
From: Yongji Xie @ 2021-06-21 10:41 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Christoph Hellwig, Christian Brauner, Randy Dunlap,
	Matthew Wilcox, Al Viro, Jens Axboe, bcrl, Jonathan Corbet,
	Mika Penttilä,
	Dan Carpenter, joro, Greg KH, songmuchun, virtualization, netdev,
	kvm, linux-fsdevel, iommu, linux-kernel

On Mon, Jun 21, 2021 at 5:14 PM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/6/15 下午10:13, Xie Yongji 写道:
> > This VDUSE driver enables implementing vDPA devices in userspace.
> > The vDPA device's control path is handled in kernel and the data
> > path is handled in userspace.
> >
> > A message mechnism is used by VDUSE driver to forward some control
> > messages such as starting/stopping datapath to userspace. Userspace
> > can use read()/write() to receive/reply those control messages.
> >
> > And some ioctls are introduced to help userspace to implement the
> > data path. VDUSE_IOTLB_GET_FD ioctl can be used to get the file
> > descriptors referring to vDPA device's iova regions. Then userspace
> > can use mmap() to access those iova regions. VDUSE_DEV_GET_FEATURES
> > and VDUSE_VQ_GET_INFO ioctls are used to get the negotiated features
> > and metadata of virtqueues. VDUSE_INJECT_VQ_IRQ and VDUSE_VQ_SETUP_KICKFD
> > ioctls can be used to inject interrupt and setup the kickfd for
> > virtqueues. VDUSE_DEV_UPDATE_CONFIG ioctl is used to update the
> > configuration space and inject a config interrupt.
> >
> > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> > ---
> >   Documentation/userspace-api/ioctl/ioctl-number.rst |    1 +
> >   drivers/vdpa/Kconfig                               |   10 +
> >   drivers/vdpa/Makefile                              |    1 +
> >   drivers/vdpa/vdpa_user/Makefile                    |    5 +
> >   drivers/vdpa/vdpa_user/vduse_dev.c                 | 1453 ++++++++++++++++++++
> >   include/uapi/linux/vduse.h                         |  143 ++
> >   6 files changed, 1613 insertions(+)
> >   create mode 100644 drivers/vdpa/vdpa_user/Makefile
> >   create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
> >   create mode 100644 include/uapi/linux/vduse.h
> >
> > diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
> > index 9bfc2b510c64..acd95e9dcfe7 100644
> > --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> > +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> > @@ -300,6 +300,7 @@ Code  Seq#    Include File                                           Comments
> >   'z'   10-4F  drivers/s390/crypto/zcrypt_api.h                        conflict!
> >   '|'   00-7F  linux/media.h
> >   0x80  00-1F  linux/fb.h
> > +0x81  00-1F  linux/vduse.h
> >   0x89  00-06  arch/x86/include/asm/sockios.h
> >   0x89  0B-DF  linux/sockios.h
> >   0x89  E0-EF  linux/sockios.h                                         SIOCPROTOPRIVATE range
> > diff --git a/drivers/vdpa/Kconfig b/drivers/vdpa/Kconfig
> > index a503c1b2bfd9..6e23bce6433a 100644
> > --- a/drivers/vdpa/Kconfig
> > +++ b/drivers/vdpa/Kconfig
> > @@ -33,6 +33,16 @@ config VDPA_SIM_BLOCK
> >         vDPA block device simulator which terminates IO request in a
> >         memory buffer.
> >
> > +config VDPA_USER
> > +     tristate "VDUSE (vDPA Device in Userspace) support"
> > +     depends on EVENTFD && MMU && HAS_DMA
> > +     select DMA_OPS
> > +     select VHOST_IOTLB
> > +     select IOMMU_IOVA
> > +     help
> > +       With VDUSE it is possible to emulate a vDPA Device
> > +       in a userspace program.
> > +
> >   config IFCVF
> >       tristate "Intel IFC VF vDPA driver"
> >       depends on PCI_MSI
> > diff --git a/drivers/vdpa/Makefile b/drivers/vdpa/Makefile
> > index 67fe7f3d6943..f02ebed33f19 100644
> > --- a/drivers/vdpa/Makefile
> > +++ b/drivers/vdpa/Makefile
> > @@ -1,6 +1,7 @@
> >   # SPDX-License-Identifier: GPL-2.0
> >   obj-$(CONFIG_VDPA) += vdpa.o
> >   obj-$(CONFIG_VDPA_SIM) += vdpa_sim/
> > +obj-$(CONFIG_VDPA_USER) += vdpa_user/
> >   obj-$(CONFIG_IFCVF)    += ifcvf/
> >   obj-$(CONFIG_MLX5_VDPA) += mlx5/
> >   obj-$(CONFIG_VP_VDPA)    += virtio_pci/
> > diff --git a/drivers/vdpa/vdpa_user/Makefile b/drivers/vdpa/vdpa_user/Makefile
> > new file mode 100644
> > index 000000000000..260e0b26af99
> > --- /dev/null
> > +++ b/drivers/vdpa/vdpa_user/Makefile
> > @@ -0,0 +1,5 @@
> > +# SPDX-License-Identifier: GPL-2.0
> > +
> > +vduse-y := vduse_dev.o iova_domain.o
> > +
> > +obj-$(CONFIG_VDPA_USER) += vduse.o
> > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> > new file mode 100644
> > index 000000000000..5271cbd15e28
> > --- /dev/null
> > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > @@ -0,0 +1,1453 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * VDUSE: vDPA Device in Userspace
> > + *
> > + * Copyright (C) 2020-2021 Bytedance Inc. and/or its affiliates. All rights reserved.
> > + *
> > + * Author: Xie Yongji <xieyongji@bytedance.com>
> > + *
> > + */
> > +
> > +#include <linux/init.h>
> > +#include <linux/module.h>
> > +#include <linux/cdev.h>
> > +#include <linux/device.h>
> > +#include <linux/eventfd.h>
> > +#include <linux/slab.h>
> > +#include <linux/wait.h>
> > +#include <linux/dma-map-ops.h>
> > +#include <linux/poll.h>
> > +#include <linux/file.h>
> > +#include <linux/uio.h>
> > +#include <linux/vdpa.h>
> > +#include <linux/nospec.h>
> > +#include <uapi/linux/vduse.h>
> > +#include <uapi/linux/vdpa.h>
> > +#include <uapi/linux/virtio_config.h>
> > +#include <uapi/linux/virtio_ids.h>
> > +#include <uapi/linux/virtio_blk.h>
> > +#include <linux/mod_devicetable.h>
> > +
> > +#include "iova_domain.h"
> > +
> > +#define DRV_AUTHOR   "Yongji Xie <xieyongji@bytedance.com>"
> > +#define DRV_DESC     "vDPA Device in Userspace"
> > +#define DRV_LICENSE  "GPL v2"
> > +
> > +#define VDUSE_DEV_MAX (1U << MINORBITS)
> > +#define VDUSE_MAX_BOUNCE_SIZE (64 * 1024 * 1024)
> > +#define VDUSE_IOVA_SIZE (128 * 1024 * 1024)
> > +#define VDUSE_REQUEST_TIMEOUT 30
> > +
> > +struct vduse_virtqueue {
> > +     u16 index;
> > +     u32 num;
> > +     u32 avail_idx;
> > +     u64 desc_addr;
> > +     u64 driver_addr;
> > +     u64 device_addr;
> > +     bool ready;
> > +     bool kicked;
> > +     spinlock_t kick_lock;
> > +     spinlock_t irq_lock;
> > +     struct eventfd_ctx *kickfd;
> > +     struct vdpa_callback cb;
> > +     struct work_struct inject;
> > +};
> > +
> > +struct vduse_dev;
> > +
> > +struct vduse_vdpa {
> > +     struct vdpa_device vdpa;
> > +     struct vduse_dev *dev;
> > +};
> > +
> > +struct vduse_dev {
> > +     struct vduse_vdpa *vdev;
> > +     struct device *dev;
> > +     struct vduse_virtqueue *vqs;
> > +     struct vduse_iova_domain *domain;
> > +     char *name;
> > +     struct mutex lock;
> > +     spinlock_t msg_lock;
> > +     u64 msg_unique;
> > +     wait_queue_head_t waitq;
> > +     struct list_head send_list;
> > +     struct list_head recv_list;
> > +     struct vdpa_callback config_cb;
> > +     struct work_struct inject;
> > +     spinlock_t irq_lock;
> > +     int minor;
> > +     bool connected;
> > +     bool started;
> > +     u64 api_version;
> > +     u64 user_features;
>
>
> Let's use device_features.
>

OK.

>
> > +     u64 features;
>
>
> And driver features.
>

OK.

>
> > +     u32 device_id;
> > +     u32 vendor_id;
> > +     u32 generation;
> > +     u32 config_size;
> > +     void *config;
> > +     u8 status;
> > +     u16 vq_size_max;
> > +     u32 vq_num;
> > +     u32 vq_align;
> > +};
> > +
> > +struct vduse_dev_msg {
> > +     struct vduse_dev_request req;
> > +     struct vduse_dev_response resp;
> > +     struct list_head list;
> > +     wait_queue_head_t waitq;
> > +     bool completed;
> > +};
> > +
> > +struct vduse_control {
> > +     u64 api_version;
> > +};
> > +
> > +static DEFINE_MUTEX(vduse_lock);
> > +static DEFINE_IDR(vduse_idr);
> > +
> > +static dev_t vduse_major;
> > +static struct class *vduse_class;
> > +static struct cdev vduse_ctrl_cdev;
> > +static struct cdev vduse_cdev;
> > +static struct workqueue_struct *vduse_irq_wq;
> > +
> > +static u32 allowed_device_id[] = {
> > +     VIRTIO_ID_BLOCK,
> > +};
> > +
> > +static inline struct vduse_dev *vdpa_to_vduse(struct vdpa_device *vdpa)
> > +{
> > +     struct vduse_vdpa *vdev = container_of(vdpa, struct vduse_vdpa, vdpa);
> > +
> > +     return vdev->dev;
> > +}
> > +
> > +static inline struct vduse_dev *dev_to_vduse(struct device *dev)
> > +{
> > +     struct vdpa_device *vdpa = dev_to_vdpa(dev);
> > +
> > +     return vdpa_to_vduse(vdpa);
> > +}
> > +
> > +static struct vduse_dev_msg *vduse_find_msg(struct list_head *head,
> > +                                         uint32_t request_id)
> > +{
> > +     struct vduse_dev_msg *msg;
> > +
> > +     list_for_each_entry(msg, head, list) {
> > +             if (msg->req.request_id == request_id) {
> > +                     list_del(&msg->list);
> > +                     return msg;
> > +             }
> > +     }
> > +
> > +     return NULL;
> > +}
> > +
> > +static struct vduse_dev_msg *vduse_dequeue_msg(struct list_head *head)
> > +{
> > +     struct vduse_dev_msg *msg = NULL;
> > +
> > +     if (!list_empty(head)) {
> > +             msg = list_first_entry(head, struct vduse_dev_msg, list);
> > +             list_del(&msg->list);
> > +     }
> > +
> > +     return msg;
> > +}
> > +
> > +static void vduse_enqueue_msg(struct list_head *head,
> > +                           struct vduse_dev_msg *msg)
> > +{
> > +     list_add_tail(&msg->list, head);
> > +}
> > +
> > +static int vduse_dev_msg_send(struct vduse_dev *dev,
> > +                           struct vduse_dev_msg *msg, bool no_reply)
> > +{
>
>
> It looks to me the only user for no_reply=true is the dataplane start. I
> wonder no_reply is really needed consider we have switched to use
> wait_event_killable_timeout().
>

Do we need to handle the error in this case if we remove the no_reply
flag. Print a warning message?

> In another way, no_reply is false for vq state synchronization and IOTLB
> updating. I wonder if we can simply use no_reply = true for them.
>

Looks like we can't, e.g. we need to get a reply from userspace for vq state.

>
> > +     init_waitqueue_head(&msg->waitq);
> > +     spin_lock(&dev->msg_lock);
> > +     msg->req.request_id = dev->msg_unique++;
> > +     vduse_enqueue_msg(&dev->send_list, msg);
> > +     wake_up(&dev->waitq);
> > +     spin_unlock(&dev->msg_lock);
> > +     if (no_reply)
> > +             return 0;
> > +
> > +     wait_event_killable_timeout(msg->waitq, msg->completed,
> > +                                 VDUSE_REQUEST_TIMEOUT * HZ);
> > +     spin_lock(&dev->msg_lock);
> > +     if (!msg->completed) {
> > +             list_del(&msg->list);
> > +             msg->resp.result = VDUSE_REQ_RESULT_FAILED;
> > +     }
> > +     spin_unlock(&dev->msg_lock);
> > +
> > +     return (msg->resp.result == VDUSE_REQ_RESULT_OK) ? 0 : -EIO;
>
>
> Do we need to serialize the check by protecting it with the spinlock above?
>

Good point.

>
> > +}
> > +
> > +static void vduse_dev_msg_cleanup(struct vduse_dev *dev)
> > +{
> > +     struct vduse_dev_msg *msg;
> > +
> > +     spin_lock(&dev->msg_lock);
> > +     while ((msg = vduse_dequeue_msg(&dev->send_list))) {
> > +             if (msg->req.flags & VDUSE_REQ_FLAGS_NO_REPLY)
> > +                     kfree(msg);
> > +             else
> > +                     vduse_enqueue_msg(&dev->recv_list, msg);
> > +     }
> > +     while ((msg = vduse_dequeue_msg(&dev->recv_list))) {
> > +             msg->resp.result = VDUSE_REQ_RESULT_FAILED;
> > +             msg->completed = 1;
> > +             wake_up(&msg->waitq);
> > +     }
> > +     spin_unlock(&dev->msg_lock);
> > +}
> > +
> > +static void vduse_dev_start_dataplane(struct vduse_dev *dev)
> > +{
> > +     struct vduse_dev_msg *msg = kzalloc(sizeof(*msg),
> > +                                         GFP_KERNEL | __GFP_NOFAIL);
> > +
> > +     msg->req.type = VDUSE_START_DATAPLANE;
> > +     msg->req.flags |= VDUSE_REQ_FLAGS_NO_REPLY;
> > +     vduse_dev_msg_send(dev, msg, true);
> > +}
> > +
> > +static void vduse_dev_stop_dataplane(struct vduse_dev *dev)
> > +{
> > +     struct vduse_dev_msg *msg = kzalloc(sizeof(*msg),
> > +                                         GFP_KERNEL | __GFP_NOFAIL);
> > +
> > +     msg->req.type = VDUSE_STOP_DATAPLANE;
> > +     msg->req.flags |= VDUSE_REQ_FLAGS_NO_REPLY;
>
>
> Can we simply use this flag instead of introducing a new parameter
> (no_reply) in vduse_dev_msg_send()?
>

Looks good to me.

>
> > +     vduse_dev_msg_send(dev, msg, true);
> > +}
> > +
> > +static int vduse_dev_get_vq_state(struct vduse_dev *dev,
> > +                               struct vduse_virtqueue *vq,
> > +                               struct vdpa_vq_state *state)
> > +{
> > +     struct vduse_dev_msg msg = { 0 };
> > +     int ret;
>
>
> Note that I post a series that implement the packed virtqueue support:
>
> https://lists.linuxfoundation.org/pipermail/virtualization/2021-June/054501.html
>
> So this patch needs to be updated as well.
>

Will do it.

>
> > +
> > +     msg.req.type = VDUSE_GET_VQ_STATE;
> > +     msg.req.vq_state.index = vq->index;
> > +
> > +     ret = vduse_dev_msg_send(dev, &msg, false);
> > +     if (ret)
> > +             return ret;
> > +
> > +     state->avail_index = msg.resp.vq_state.avail_idx;
> > +     return 0;
> > +}
> > +
> > +static int vduse_dev_update_iotlb(struct vduse_dev *dev,
> > +                             u64 start, u64 last)
> > +{
> > +     struct vduse_dev_msg msg = { 0 };
> > +
> > +     if (last < start)
> > +             return -EINVAL;
> > +
> > +     msg.req.type = VDUSE_UPDATE_IOTLB;
> > +     msg.req.iova.start = start;
> > +     msg.req.iova.last = last;
> > +
> > +     return vduse_dev_msg_send(dev, &msg, false);
> > +}
> > +
> > +static ssize_t vduse_dev_read_iter(struct kiocb *iocb, struct iov_iter *to)
> > +{
> > +     struct file *file = iocb->ki_filp;
> > +     struct vduse_dev *dev = file->private_data;
> > +     struct vduse_dev_msg *msg;
> > +     int size = sizeof(struct vduse_dev_request);
> > +     ssize_t ret;
> > +
> > +     if (iov_iter_count(to) < size)
> > +             return -EINVAL;
> > +
> > +     spin_lock(&dev->msg_lock);
> > +     while (1) {
> > +             msg = vduse_dequeue_msg(&dev->send_list);
> > +             if (msg)
> > +                     break;
> > +
> > +             ret = -EAGAIN;
> > +             if (file->f_flags & O_NONBLOCK)
> > +                     goto unlock;
> > +
> > +             spin_unlock(&dev->msg_lock);
> > +             ret = wait_event_interruptible_exclusive(dev->waitq,
> > +                                     !list_empty(&dev->send_list));
> > +             if (ret)
> > +                     return ret;
> > +
> > +             spin_lock(&dev->msg_lock);
> > +     }
> > +     spin_unlock(&dev->msg_lock);
> > +     ret = copy_to_iter(&msg->req, size, to);
> > +     spin_lock(&dev->msg_lock);
> > +     if (ret != size) {
> > +             ret = -EFAULT;
> > +             vduse_enqueue_msg(&dev->send_list, msg);
> > +             goto unlock;
> > +     }
> > +     if (msg->req.flags & VDUSE_REQ_FLAGS_NO_REPLY)
> > +             kfree(msg);
> > +     else
> > +             vduse_enqueue_msg(&dev->recv_list, msg);
> > +unlock:
> > +     spin_unlock(&dev->msg_lock);
> > +
> > +     return ret;
> > +}
> > +
> > +static ssize_t vduse_dev_write_iter(struct kiocb *iocb, struct iov_iter *from)
> > +{
> > +     struct file *file = iocb->ki_filp;
> > +     struct vduse_dev *dev = file->private_data;
> > +     struct vduse_dev_response resp;
> > +     struct vduse_dev_msg *msg;
> > +     size_t ret;
> > +
> > +     ret = copy_from_iter(&resp, sizeof(resp), from);
> > +     if (ret != sizeof(resp))
> > +             return -EINVAL;
> > +
> > +     spin_lock(&dev->msg_lock);
> > +     msg = vduse_find_msg(&dev->recv_list, resp.request_id);
> > +     if (!msg) {
> > +             ret = -ENOENT;
> > +             goto unlock;
> > +     }
> > +
> > +     memcpy(&msg->resp, &resp, sizeof(resp));
> > +     msg->completed = 1;
> > +     wake_up(&msg->waitq);
> > +unlock:
> > +     spin_unlock(&dev->msg_lock);
> > +
> > +     return ret;
> > +}
> > +
> > +static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
> > +{
> > +     struct vduse_dev *dev = file->private_data;
> > +     __poll_t mask = 0;
> > +
> > +     poll_wait(file, &dev->waitq, wait);
> > +
> > +     if (!list_empty(&dev->send_list))
> > +             mask |= EPOLLIN | EPOLLRDNORM;
> > +     if (!list_empty(&dev->recv_list))
> > +             mask |= EPOLLOUT | EPOLLWRNORM;
> > +
> > +     return mask;
> > +}
> > +
> > +static void vduse_dev_reset(struct vduse_dev *dev)
> > +{
> > +     int i;
> > +     struct vduse_iova_domain *domain = dev->domain;
> > +
> > +     /* The coherent mappings are handled in vduse_dev_free_coherent() */
> > +     if (domain->bounce_map)
> > +             vduse_domain_reset_bounce_map(domain);
> > +
> > +     dev->features = 0;
> > +     dev->generation++;
> > +     spin_lock(&dev->irq_lock);
> > +     dev->config_cb.callback = NULL;
> > +     dev->config_cb.private = NULL;
> > +     spin_unlock(&dev->irq_lock);
> > +
> > +     for (i = 0; i < dev->vq_num; i++) {
> > +             struct vduse_virtqueue *vq = &dev->vqs[i];
> > +
> > +             vq->ready = false;
> > +             vq->desc_addr = 0;
> > +             vq->driver_addr = 0;
> > +             vq->device_addr = 0;
> > +             vq->avail_idx = 0;
> > +             vq->num = 0;
> > +
> > +             spin_lock(&vq->kick_lock);
> > +             vq->kicked = false;
> > +             if (vq->kickfd)
> > +                     eventfd_ctx_put(vq->kickfd);
> > +             vq->kickfd = NULL;
> > +             spin_unlock(&vq->kick_lock);
> > +
> > +             spin_lock(&vq->irq_lock);
> > +             vq->cb.callback = NULL;
> > +             vq->cb.private = NULL;
> > +             spin_unlock(&vq->irq_lock);
> > +     }
> > +}
> > +
> > +static int vduse_vdpa_set_vq_address(struct vdpa_device *vdpa, u16 idx,
> > +                             u64 desc_area, u64 driver_area,
> > +                             u64 device_area)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> > +
> > +     vq->desc_addr = desc_area;
> > +     vq->driver_addr = driver_area;
> > +     vq->device_addr = device_area;
> > +
> > +     return 0;
> > +}
> > +
> > +static void vduse_vdpa_kick_vq(struct vdpa_device *vdpa, u16 idx)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> > +
> > +     spin_lock(&vq->kick_lock);
> > +     if (!vq->ready)
> > +             goto unlock;
> > +
> > +     if (vq->kickfd)
> > +             eventfd_signal(vq->kickfd, 1);
> > +     else
> > +             vq->kicked = true;
> > +unlock:
> > +     spin_unlock(&vq->kick_lock);
> > +}
> > +
> > +static void vduse_vdpa_set_vq_cb(struct vdpa_device *vdpa, u16 idx,
> > +                           struct vdpa_callback *cb)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> > +
> > +     spin_lock(&vq->irq_lock);
> > +     vq->cb.callback = cb->callback;
> > +     vq->cb.private = cb->private;
> > +     spin_unlock(&vq->irq_lock);
> > +}
> > +
> > +static void vduse_vdpa_set_vq_num(struct vdpa_device *vdpa, u16 idx, u32 num)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> > +
> > +     vq->num = num;
> > +}
> > +
> > +static void vduse_vdpa_set_vq_ready(struct vdpa_device *vdpa,
> > +                                     u16 idx, bool ready)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> > +
> > +     vq->ready = ready;
> > +}
> > +
> > +static bool vduse_vdpa_get_vq_ready(struct vdpa_device *vdpa, u16 idx)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> > +
> > +     return vq->ready;
> > +}
> > +
> > +static int vduse_vdpa_set_vq_state(struct vdpa_device *vdpa, u16 idx,
> > +                             const struct vdpa_vq_state *state)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> > +
> > +     vq->avail_idx = state->avail_index;
> > +     return 0;
> > +}
> > +
> > +static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
> > +                             struct vdpa_vq_state *state)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> > +
> > +     return vduse_dev_get_vq_state(dev, vq, state);
> > +}
> > +
> > +static u32 vduse_vdpa_get_vq_align(struct vdpa_device *vdpa)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     return dev->vq_align;
> > +}
> > +
> > +static u64 vduse_vdpa_get_features(struct vdpa_device *vdpa)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     return dev->user_features;
> > +}
> > +
> > +static int vduse_vdpa_set_features(struct vdpa_device *vdpa, u64 features)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     dev->features = features;
> > +     return 0;
> > +}
> > +
> > +static void vduse_vdpa_set_config_cb(struct vdpa_device *vdpa,
> > +                               struct vdpa_callback *cb)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     spin_lock(&dev->irq_lock);
> > +     dev->config_cb.callback = cb->callback;
> > +     dev->config_cb.private = cb->private;
> > +     spin_unlock(&dev->irq_lock);
> > +}
> > +
> > +static u16 vduse_vdpa_get_vq_num_max(struct vdpa_device *vdpa)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     return dev->vq_size_max;
> > +}
> > +
> > +static u32 vduse_vdpa_get_device_id(struct vdpa_device *vdpa)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     return dev->device_id;
> > +}
> > +
> > +static u32 vduse_vdpa_get_vendor_id(struct vdpa_device *vdpa)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     return dev->vendor_id;
> > +}
> > +
> > +static u8 vduse_vdpa_get_status(struct vdpa_device *vdpa)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     return dev->status;
> > +}
> > +
> > +static void vduse_vdpa_set_status(struct vdpa_device *vdpa, u8 status)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +     bool started = !!(status & VIRTIO_CONFIG_S_DRIVER_OK);
> > +
> > +     dev->status = status;
> > +
> > +     if (dev->started == started)
> > +             return;
>
>
> If we check dev->status == status, (or only check the DRIVER_OK bit)
> then there's no need to introduce an extra dev->started.
>

Will do it.

>
> > +
> > +     dev->started = started;
> > +     if (dev->started) {
> > +             vduse_dev_start_dataplane(dev);
> > +     } else {
> > +             vduse_dev_reset(dev);
> > +             vduse_dev_stop_dataplane(dev);
>
>
> I wonder if no_reply work for the case of vhost-vdpa. For virtio-vDPA,
> we have bouncing buffers so it's harmless if usersapce dataplane keeps
> performing read/write. For vhost-vDPA we don't have such stuffs.
>

OK. So it still needs to be synchronized here. If so, how to handle
the error? Looks like printing a warning message should be enough.

>
> > +     }
> > +}
> > +
> > +static size_t vduse_vdpa_get_config_size(struct vdpa_device *vdpa)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     return dev->config_size;
> > +}
> > +
> > +static void vduse_vdpa_get_config(struct vdpa_device *vdpa, unsigned int offset,
> > +                               void *buf, unsigned int len)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     memcpy(buf, dev->config + offset, len);
> > +}
> > +
> > +static void vduse_vdpa_set_config(struct vdpa_device *vdpa, unsigned int offset,
> > +                     const void *buf, unsigned int len)
> > +{
> > +     /* Now we only support read-only configuration space */
> > +}
> > +
> > +static u32 vduse_vdpa_get_generation(struct vdpa_device *vdpa)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     return dev->generation;
> > +}
> > +
> > +static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
> > +                             struct vhost_iotlb *iotlb)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +     int ret;
> > +
> > +     ret = vduse_domain_set_map(dev->domain, iotlb);
> > +     if (ret)
> > +             return ret;
> > +
> > +     ret = vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
> > +     if (ret) {
> > +             vduse_domain_clear_map(dev->domain, iotlb);
> > +             return ret;
> > +     }
> > +
> > +     return 0;
> > +}
> > +
> > +static void vduse_vdpa_free(struct vdpa_device *vdpa)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     dev->vdev = NULL;
> > +}
> > +
> > +static const struct vdpa_config_ops vduse_vdpa_config_ops = {
> > +     .set_vq_address         = vduse_vdpa_set_vq_address,
> > +     .kick_vq                = vduse_vdpa_kick_vq,
> > +     .set_vq_cb              = vduse_vdpa_set_vq_cb,
> > +     .set_vq_num             = vduse_vdpa_set_vq_num,
> > +     .set_vq_ready           = vduse_vdpa_set_vq_ready,
> > +     .get_vq_ready           = vduse_vdpa_get_vq_ready,
> > +     .set_vq_state           = vduse_vdpa_set_vq_state,
> > +     .get_vq_state           = vduse_vdpa_get_vq_state,
> > +     .get_vq_align           = vduse_vdpa_get_vq_align,
> > +     .get_features           = vduse_vdpa_get_features,
> > +     .set_features           = vduse_vdpa_set_features,
> > +     .set_config_cb          = vduse_vdpa_set_config_cb,
> > +     .get_vq_num_max         = vduse_vdpa_get_vq_num_max,
> > +     .get_device_id          = vduse_vdpa_get_device_id,
> > +     .get_vendor_id          = vduse_vdpa_get_vendor_id,
> > +     .get_status             = vduse_vdpa_get_status,
> > +     .set_status             = vduse_vdpa_set_status,
> > +     .get_config_size        = vduse_vdpa_get_config_size,
> > +     .get_config             = vduse_vdpa_get_config,
> > +     .set_config             = vduse_vdpa_set_config,
> > +     .get_generation         = vduse_vdpa_get_generation,
> > +     .set_map                = vduse_vdpa_set_map,
> > +     .free                   = vduse_vdpa_free,
> > +};
> > +
> > +static dma_addr_t vduse_dev_map_page(struct device *dev, struct page *page,
> > +                                  unsigned long offset, size_t size,
> > +                                  enum dma_data_direction dir,
> > +                                  unsigned long attrs)
> > +{
> > +     struct vduse_dev *vdev = dev_to_vduse(dev);
> > +     struct vduse_iova_domain *domain = vdev->domain;
> > +
> > +     return vduse_domain_map_page(domain, page, offset, size, dir, attrs);
> > +}
> > +
> > +static void vduse_dev_unmap_page(struct device *dev, dma_addr_t dma_addr,
> > +                             size_t size, enum dma_data_direction dir,
> > +                             unsigned long attrs)
> > +{
> > +     struct vduse_dev *vdev = dev_to_vduse(dev);
> > +     struct vduse_iova_domain *domain = vdev->domain;
> > +
> > +     return vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
> > +}
> > +
> > +static void *vduse_dev_alloc_coherent(struct device *dev, size_t size,
> > +                                     dma_addr_t *dma_addr, gfp_t flag,
> > +                                     unsigned long attrs)
> > +{
> > +     struct vduse_dev *vdev = dev_to_vduse(dev);
> > +     struct vduse_iova_domain *domain = vdev->domain;
> > +     unsigned long iova;
> > +     void *addr;
> > +
> > +     *dma_addr = DMA_MAPPING_ERROR;
> > +     addr = vduse_domain_alloc_coherent(domain, size,
> > +                             (dma_addr_t *)&iova, flag, attrs);
> > +     if (!addr)
> > +             return NULL;
> > +
> > +     *dma_addr = (dma_addr_t)iova;
> > +
> > +     return addr;
> > +}
> > +
> > +static void vduse_dev_free_coherent(struct device *dev, size_t size,
> > +                                     void *vaddr, dma_addr_t dma_addr,
> > +                                     unsigned long attrs)
> > +{
> > +     struct vduse_dev *vdev = dev_to_vduse(dev);
> > +     struct vduse_iova_domain *domain = vdev->domain;
> > +
> > +     vduse_domain_free_coherent(domain, size, vaddr, dma_addr, attrs);
> > +}
> > +
> > +static size_t vduse_dev_max_mapping_size(struct device *dev)
> > +{
> > +     struct vduse_dev *vdev = dev_to_vduse(dev);
> > +     struct vduse_iova_domain *domain = vdev->domain;
> > +
> > +     return domain->bounce_size;
> > +}
> > +
> > +static const struct dma_map_ops vduse_dev_dma_ops = {
> > +     .map_page = vduse_dev_map_page,
> > +     .unmap_page = vduse_dev_unmap_page,
> > +     .alloc = vduse_dev_alloc_coherent,
> > +     .free = vduse_dev_free_coherent,
> > +     .max_mapping_size = vduse_dev_max_mapping_size,
> > +};
> > +
> > +static unsigned int perm_to_file_flags(u8 perm)
> > +{
> > +     unsigned int flags = 0;
> > +
> > +     switch (perm) {
> > +     case VDUSE_ACCESS_WO:
> > +             flags |= O_WRONLY;
> > +             break;
> > +     case VDUSE_ACCESS_RO:
> > +             flags |= O_RDONLY;
> > +             break;
> > +     case VDUSE_ACCESS_RW:
> > +             flags |= O_RDWR;
> > +             break;
> > +     default:
> > +             WARN(1, "invalidate vhost IOTLB permission\n");
> > +             break;
> > +     }
> > +
> > +     return flags;
> > +}
> > +
> > +static int vduse_kickfd_setup(struct vduse_dev *dev,
> > +                     struct vduse_vq_eventfd *eventfd)
> > +{
> > +     struct eventfd_ctx *ctx = NULL;
> > +     struct vduse_virtqueue *vq;
> > +     u32 index;
> > +
> > +     if (eventfd->index >= dev->vq_num)
> > +             return -EINVAL;
> > +
> > +     index = array_index_nospec(eventfd->index, dev->vq_num);
> > +     vq = &dev->vqs[index];
> > +     if (eventfd->fd >= 0) {
> > +             ctx = eventfd_ctx_fdget(eventfd->fd);
> > +             if (IS_ERR(ctx))
> > +                     return PTR_ERR(ctx);
> > +     } else if (eventfd->fd != VDUSE_EVENTFD_DEASSIGN)
> > +             return 0;
> > +
> > +     spin_lock(&vq->kick_lock);
> > +     if (vq->kickfd)
> > +             eventfd_ctx_put(vq->kickfd);
> > +     vq->kickfd = ctx;
> > +     if (vq->ready && vq->kicked && vq->kickfd) {
> > +             eventfd_signal(vq->kickfd, 1);
> > +             vq->kicked = false;
> > +     }
> > +     spin_unlock(&vq->kick_lock);
> > +
> > +     return 0;
> > +}
> > +
> > +static void vduse_dev_irq_inject(struct work_struct *work)
> > +{
> > +     struct vduse_dev *dev = container_of(work, struct vduse_dev, inject);
> > +
> > +     spin_lock_irq(&dev->irq_lock);
> > +     if (dev->config_cb.callback)
> > +             dev->config_cb.callback(dev->config_cb.private);
> > +     spin_unlock_irq(&dev->irq_lock);
> > +}
> > +
> > +static void vduse_vq_irq_inject(struct work_struct *work)
> > +{
> > +     struct vduse_virtqueue *vq = container_of(work,
> > +                                     struct vduse_virtqueue, inject);
> > +
> > +     spin_lock_irq(&vq->irq_lock);
> > +     if (vq->ready && vq->cb.callback)
> > +             vq->cb.callback(vq->cb.private);
> > +     spin_unlock_irq(&vq->irq_lock);
> > +}
> > +
> > +static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > +                         unsigned long arg)
> > +{
> > +     struct vduse_dev *dev = file->private_data;
> > +     void __user *argp = (void __user *)arg;
> > +     int ret;
> > +
> > +     switch (cmd) {
> > +     case VDUSE_IOTLB_GET_FD: {
> > +             struct vduse_iotlb_entry entry;
> > +             struct vhost_iotlb_map *map;
> > +             struct vdpa_map_file *map_file;
> > +             struct vduse_iova_domain *domain = dev->domain;
> > +             struct file *f = NULL;
> > +
> > +             ret = -EFAULT;
> > +             if (copy_from_user(&entry, argp, sizeof(entry)))
> > +                     break;
> > +
> > +             ret = -EINVAL;
> > +             if (entry.start > entry.last)
> > +                     break;
> > +
> > +             spin_lock(&domain->iotlb_lock);
> > +             map = vhost_iotlb_itree_first(domain->iotlb,
> > +                                           entry.start, entry.last);
> > +             if (map) {
> > +                     map_file = (struct vdpa_map_file *)map->opaque;
> > +                     f = get_file(map_file->file);
> > +                     entry.offset = map_file->offset;
> > +                     entry.start = map->start;
> > +                     entry.last = map->last;
> > +                     entry.perm = map->perm;
> > +             }
> > +             spin_unlock(&domain->iotlb_lock);
> > +             ret = -EINVAL;
> > +             if (!f)
> > +                     break;
> > +
> > +             ret = -EFAULT;
> > +             if (copy_to_user(argp, &entry, sizeof(entry))) {
> > +                     fput(f);
> > +                     break;
> > +             }
> > +             ret = receive_fd(f, perm_to_file_flags(entry.perm));
> > +             fput(f);
> > +             break;
> > +     }
> > +     case VDUSE_DEV_GET_FEATURES:
> > +             ret = put_user(dev->features, (u64 __user *)argp);
> > +             break;
> > +     case VDUSE_DEV_UPDATE_CONFIG: {
> > +             struct vduse_config_update config;
> > +             unsigned long size = offsetof(struct vduse_config_update,
> > +                                           buffer);
> > +
> > +             ret = -EFAULT;
> > +             if (copy_from_user(&config, argp, size))
> > +                     break;
> > +
> > +             ret = -EINVAL;
> > +             if (config.length == 0 ||
> > +                 config.length > dev->config_size - config.offset)
> > +                     break;
> > +
> > +             ret = -EFAULT;
> > +             if (copy_from_user(dev->config + config.offset, argp + size,
> > +                                config.length))
> > +                     break;
> > +
> > +             ret = 0;
> > +             queue_work(vduse_irq_wq, &dev->inject);
>
>
> I wonder if it's better to separate config interrupt out of config
> update or we need document this.
>

I have documented it in the docs. Looks like a config update should be
always followed by a config interrupt. I didn't find a case that uses
them separately.

>
> > +             break;
> > +     }
> > +     case VDUSE_VQ_GET_INFO: {
>
>
> Do we need to limit this only when DRIVER_OK is set?
>

Any reason to add this limitation?

>
> > +             struct vduse_vq_info vq_info;
> > +             u32 vq_index;
> > +
> > +             ret = -EFAULT;
> > +             if (copy_from_user(&vq_info, argp, sizeof(vq_info)))
> > +                     break;
> > +
> > +             ret = -EINVAL;
> > +             if (vq_info.index >= dev->vq_num)
> > +                     break;
> > +
> > +             vq_index = array_index_nospec(vq_info.index, dev->vq_num);
> > +             vq_info.desc_addr = dev->vqs[vq_index].desc_addr;
> > +             vq_info.driver_addr = dev->vqs[vq_index].driver_addr;
> > +             vq_info.device_addr = dev->vqs[vq_index].device_addr;
> > +             vq_info.num = dev->vqs[vq_index].num;
> > +             vq_info.avail_idx = dev->vqs[vq_index].avail_idx;
> > +             vq_info.ready = dev->vqs[vq_index].ready;
> > +
> > +             ret = -EFAULT;
> > +             if (copy_to_user(argp, &vq_info, sizeof(vq_info)))
> > +                     break;
> > +
> > +             ret = 0;
> > +             break;
> > +     }
> > +     case VDUSE_VQ_SETUP_KICKFD: {
> > +             struct vduse_vq_eventfd eventfd;
> > +
> > +             ret = -EFAULT;
> > +             if (copy_from_user(&eventfd, argp, sizeof(eventfd)))
> > +                     break;
> > +
> > +             ret = vduse_kickfd_setup(dev, &eventfd);
> > +             break;
> > +     }
> > +     case VDUSE_VQ_INJECT_IRQ: {
> > +             u32 vq_index;
> > +
> > +             ret = -EFAULT;
> > +             if (get_user(vq_index, (u32 __user *)argp))
> > +                     break;
> > +
> > +             ret = -EINVAL;
> > +             if (vq_index >= dev->vq_num)
> > +                     break;
> > +
> > +             ret = 0;
> > +             vq_index = array_index_nospec(vq_index, dev->vq_num);
> > +             queue_work(vduse_irq_wq, &dev->vqs[vq_index].inject);
> > +             break;
> > +     }
> > +     default:
> > +             ret = -ENOIOCTLCMD;
> > +             break;
> > +     }
> > +
> > +     return ret;
> > +}
> > +
> > +static int vduse_dev_release(struct inode *inode, struct file *file)
> > +{
> > +     struct vduse_dev *dev = file->private_data;
> > +
> > +     spin_lock(&dev->msg_lock);
> > +     /* Make sure the inflight messages can processed after reconncection */
> > +     list_splice_init(&dev->recv_list, &dev->send_list);
> > +     spin_unlock(&dev->msg_lock);
> > +     dev->connected = false;
> > +
> > +     return 0;
> > +}
> > +
> > +static struct vduse_dev *vduse_dev_get_from_minor(int minor)
> > +{
> > +     struct vduse_dev *dev;
> > +
> > +     mutex_lock(&vduse_lock);
> > +     dev = idr_find(&vduse_idr, minor);
> > +     mutex_unlock(&vduse_lock);
> > +
> > +     return dev;
> > +}
> > +
> > +static int vduse_dev_open(struct inode *inode, struct file *file)
> > +{
> > +     int ret;
> > +     struct vduse_dev *dev = vduse_dev_get_from_minor(iminor(inode));
> > +
> > +     if (!dev)
> > +             return -ENODEV;
> > +
> > +     ret = -EBUSY;
> > +     mutex_lock(&dev->lock);
> > +     if (dev->connected)
> > +             goto unlock;
> > +
> > +     ret = 0;
> > +     dev->connected = true;
> > +     file->private_data = dev;
> > +unlock:
> > +     mutex_unlock(&dev->lock);
> > +
> > +     return ret;
> > +}
> > +
> > +static const struct file_operations vduse_dev_fops = {
> > +     .owner          = THIS_MODULE,
> > +     .open           = vduse_dev_open,
> > +     .release        = vduse_dev_release,
> > +     .read_iter      = vduse_dev_read_iter,
> > +     .write_iter     = vduse_dev_write_iter,
> > +     .poll           = vduse_dev_poll,
> > +     .unlocked_ioctl = vduse_dev_ioctl,
> > +     .compat_ioctl   = compat_ptr_ioctl,
> > +     .llseek         = noop_llseek,
> > +};
> > +
> > +static struct vduse_dev *vduse_dev_create(void)
> > +{
> > +     struct vduse_dev *dev = kzalloc(sizeof(*dev), GFP_KERNEL);
> > +
> > +     if (!dev)
> > +             return NULL;
> > +
> > +     mutex_init(&dev->lock);
> > +     spin_lock_init(&dev->msg_lock);
> > +     INIT_LIST_HEAD(&dev->send_list);
> > +     INIT_LIST_HEAD(&dev->recv_list);
> > +     spin_lock_init(&dev->irq_lock);
> > +
> > +     INIT_WORK(&dev->inject, vduse_dev_irq_inject);
> > +     init_waitqueue_head(&dev->waitq);
> > +
> > +     return dev;
> > +}
> > +
> > +static void vduse_dev_destroy(struct vduse_dev *dev)
> > +{
> > +     kfree(dev);
> > +}
> > +
> > +static struct vduse_dev *vduse_find_dev(const char *name)
> > +{
> > +     struct vduse_dev *dev;
> > +     int id;
> > +
> > +     idr_for_each_entry(&vduse_idr, dev, id)
> > +             if (!strcmp(dev->name, name))
> > +                     return dev;
> > +
> > +     return NULL;
> > +}
> > +
> > +static int vduse_destroy_dev(char *name)
> > +{
> > +     struct vduse_dev *dev = vduse_find_dev(name);
> > +
> > +     if (!dev)
> > +             return -EINVAL;
> > +
> > +     mutex_lock(&dev->lock);
> > +     if (dev->vdev || dev->connected) {
> > +             mutex_unlock(&dev->lock);
> > +             return -EBUSY;
> > +     }
> > +     dev->connected = true;
> > +     mutex_unlock(&dev->lock);
> > +
> > +     vduse_dev_msg_cleanup(dev);
> > +     device_destroy(vduse_class, MKDEV(MAJOR(vduse_major), dev->minor));
> > +     idr_remove(&vduse_idr, dev->minor);
> > +     kvfree(dev->config);
> > +     kfree(dev->vqs);
> > +     vduse_domain_destroy(dev->domain);
> > +     kfree(dev->name);
> > +     vduse_dev_destroy(dev);
> > +     module_put(THIS_MODULE);
> > +
> > +     return 0;
> > +}
> > +
> > +static bool device_is_allowed(u32 device_id)
> > +{
> > +     int i;
> > +
> > +     for (i = 0; i < ARRAY_SIZE(allowed_device_id); i++)
> > +             if (allowed_device_id[i] == device_id)
> > +                     return true;
> > +
> > +     return false;
> > +}
> > +
> > +static bool features_is_valid(u64 features)
> > +{
> > +     if (!(features & (1ULL << VIRTIO_F_ACCESS_PLATFORM)))
> > +             return false;
> > +
> > +     /* Now we only support read-only configuration space */
> > +     if (features & (1ULL << VIRTIO_BLK_F_CONFIG_WCE))
> > +             return false;
> > +
> > +     return true;
> > +}
> > +
> > +static bool vduse_validate_config(struct vduse_dev_config *config)
> > +{
> > +     if (config->bounce_size > VDUSE_MAX_BOUNCE_SIZE)
> > +             return false;
> > +
> > +     if (config->vq_align > PAGE_SIZE)
> > +             return false;
> > +
> > +     if (config->config_size > PAGE_SIZE)
> > +             return false;
> > +
> > +     if (!device_is_allowed(config->device_id))
> > +             return false;
> > +
> > +     if (!features_is_valid(config->features))
> > +             return false;
>
>
> Do we need to validate whether or not config_size is too small otherwise
> we may have OOB access in get_config()?
>

How about adding validation in get_config()? It seems to be hard to
define the lower bound.

>
> > +
> > +     return true;
> > +}
> > +
> > +static int vduse_create_dev(struct vduse_dev_config *config,
> > +                         void *config_buf, u64 api_version)
> > +{
> > +     int i, ret;
> > +     struct vduse_dev *dev;
> > +
> > +     ret = -EEXIST;
> > +     if (vduse_find_dev(config->name))
> > +             goto err;
> > +
> > +     ret = -ENOMEM;
> > +     dev = vduse_dev_create();
> > +     if (!dev)
> > +             goto err;
> > +
> > +     dev->api_version = api_version;
> > +     dev->user_features = config->features;
> > +     dev->device_id = config->device_id;
> > +     dev->vendor_id = config->vendor_id;
> > +     dev->name = kstrdup(config->name, GFP_KERNEL);
> > +     if (!dev->name)
> > +             goto err_str;
> > +
> > +     dev->domain = vduse_domain_create(VDUSE_IOVA_SIZE - 1,
> > +                                       config->bounce_size);
> > +     if (!dev->domain)
> > +             goto err_domain;
> > +
> > +     dev->config = config_buf;
> > +     dev->config_size = config->config_size;
> > +     dev->vq_align = config->vq_align;
> > +     dev->vq_size_max = config->vq_size_max;
> > +     dev->vq_num = config->vq_num;
> > +     dev->vqs = kcalloc(dev->vq_num, sizeof(*dev->vqs), GFP_KERNEL);
> > +     if (!dev->vqs)
> > +             goto err_vqs;
> > +
> > +     for (i = 0; i < dev->vq_num; i++) {
> > +             dev->vqs[i].index = i;
> > +             INIT_WORK(&dev->vqs[i].inject, vduse_vq_irq_inject);
> > +             spin_lock_init(&dev->vqs[i].kick_lock);
> > +             spin_lock_init(&dev->vqs[i].irq_lock);
> > +     }
> > +
> > +     ret = idr_alloc(&vduse_idr, dev, 1, VDUSE_DEV_MAX, GFP_KERNEL);
> > +     if (ret < 0)
> > +             goto err_idr;
> > +
> > +     dev->minor = ret;
> > +     dev->dev = device_create(vduse_class, NULL,
> > +                              MKDEV(MAJOR(vduse_major), dev->minor),
> > +                              NULL, "%s", config->name);
> > +     if (IS_ERR(dev->dev)) {
> > +             ret = PTR_ERR(dev->dev);
> > +             goto err_dev;
> > +     }
> > +     __module_get(THIS_MODULE);
> > +
> > +     return 0;
> > +err_dev:
> > +     idr_remove(&vduse_idr, dev->minor);
> > +err_idr:
> > +     kfree(dev->vqs);
> > +err_vqs:
> > +     vduse_domain_destroy(dev->domain);
> > +err_domain:
> > +     kfree(dev->name);
> > +err_str:
> > +     vduse_dev_destroy(dev);
> > +err:
> > +     kvfree(config_buf);
> > +     return ret;
> > +}
> > +
> > +static long vduse_ioctl(struct file *file, unsigned int cmd,
> > +                     unsigned long arg)
> > +{
> > +     int ret;
> > +     void __user *argp = (void __user *)arg;
> > +     struct vduse_control *control = file->private_data;
> > +
> > +     mutex_lock(&vduse_lock);
> > +     switch (cmd) {
> > +     case VDUSE_GET_API_VERSION:
> > +             ret = put_user(control->api_version, (u64 __user *)argp);
> > +             break;
> > +     case VDUSE_SET_API_VERSION: {
> > +             u64 api_version;
> > +
> > +             ret = -EFAULT;
> > +             if (get_user(api_version, (u64 __user *)argp))
> > +                     break;
> > +
> > +             ret = -EINVAL;
> > +             if (api_version > VDUSE_API_VERSION)
> > +                     break;
> > +
> > +             ret = 0;
> > +             control->api_version = api_version;
> > +             break;
> > +     }
> > +     case VDUSE_CREATE_DEV: {
> > +             struct vduse_dev_config config;
> > +             unsigned long size = offsetof(struct vduse_dev_config, config);
> > +             void *buf;
> > +
> > +             ret = -EFAULT;
> > +             if (copy_from_user(&config, argp, size))
> > +                     break;
> > +
> > +             ret = -EINVAL;
> > +             if (vduse_validate_config(&config) == false)
> > +                     break;
> > +
> > +             buf = vmemdup_user(argp + size, config.config_size);
> > +             if (IS_ERR(buf)) {
> > +                     ret = PTR_ERR(buf);
> > +                     break;
> > +             }
> > +             ret = vduse_create_dev(&config, buf, control->api_version);
> > +             break;
> > +     }
> > +     case VDUSE_DESTROY_DEV: {
> > +             char name[VDUSE_NAME_MAX];
> > +
> > +             ret = -EFAULT;
> > +             if (copy_from_user(name, argp, VDUSE_NAME_MAX))
> > +                     break;
> > +
> > +             ret = vduse_destroy_dev(name);
> > +             break;
> > +     }
> > +     default:
> > +             ret = -EINVAL;
> > +             break;
> > +     }
> > +     mutex_unlock(&vduse_lock);
> > +
> > +     return ret;
> > +}
> > +
> > +static int vduse_release(struct inode *inode, struct file *file)
> > +{
> > +     struct vduse_control *control = file->private_data;
> > +
> > +     kfree(control);
> > +     return 0;
> > +}
> > +
> > +static int vduse_open(struct inode *inode, struct file *file)
> > +{
> > +     struct vduse_control *control;
> > +
> > +     control = kmalloc(sizeof(struct vduse_control), GFP_KERNEL);
> > +     if (!control)
> > +             return -ENOMEM;
> > +
> > +     control->api_version = VDUSE_API_VERSION;
> > +     file->private_data = control;
> > +
> > +     return 0;
> > +}
> > +
> > +static const struct file_operations vduse_ctrl_fops = {
> > +     .owner          = THIS_MODULE,
> > +     .open           = vduse_open,
> > +     .release        = vduse_release,
> > +     .unlocked_ioctl = vduse_ioctl,
> > +     .compat_ioctl   = compat_ptr_ioctl,
> > +     .llseek         = noop_llseek,
> > +};
> > +
> > +static char *vduse_devnode(struct device *dev, umode_t *mode)
> > +{
> > +     return kasprintf(GFP_KERNEL, "vduse/%s", dev_name(dev));
> > +}
> > +
> > +static void vduse_mgmtdev_release(struct device *dev)
> > +{
> > +}
> > +
> > +static struct device vduse_mgmtdev = {
> > +     .init_name = "vduse",
> > +     .release = vduse_mgmtdev_release,
> > +};
> > +
> > +static struct vdpa_mgmt_dev mgmt_dev;
> > +
> > +static int vduse_dev_init_vdpa(struct vduse_dev *dev, const char *name)
> > +{
> > +     struct vduse_vdpa *vdev;
> > +     int ret;
> > +
> > +     if (dev->vdev)
> > +             return -EEXIST;
> > +
> > +     vdev = vdpa_alloc_device(struct vduse_vdpa, vdpa, dev->dev,
> > +                              &vduse_vdpa_config_ops, name, true);
> > +     if (!vdev)
> > +             return -ENOMEM;
> > +
> > +     dev->vdev = vdev;
> > +     vdev->dev = dev;
> > +     vdev->vdpa.dev.dma_mask = &vdev->vdpa.dev.coherent_dma_mask;
> > +     ret = dma_set_mask_and_coherent(&vdev->vdpa.dev, DMA_BIT_MASK(64));
> > +     if (ret) {
> > +             put_device(&vdev->vdpa.dev);
> > +             return ret;
> > +     }
> > +     set_dma_ops(&vdev->vdpa.dev, &vduse_dev_dma_ops);
> > +     vdev->vdpa.dma_dev = &vdev->vdpa.dev;
> > +     vdev->vdpa.mdev = &mgmt_dev;
> > +
> > +     return 0;
> > +}
> > +
> > +static int vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name)
> > +{
> > +     struct vduse_dev *dev;
> > +     int ret;
> > +
> > +     mutex_lock(&vduse_lock);
> > +     dev = vduse_find_dev(name);
> > +     if (!dev) {
> > +             mutex_unlock(&vduse_lock);
> > +             return -EINVAL;
> > +     }
> > +     ret = vduse_dev_init_vdpa(dev, name);
> > +     mutex_unlock(&vduse_lock);
> > +     if (ret)
> > +             return ret;
> > +
> > +     ret = _vdpa_register_device(&dev->vdev->vdpa, dev->vq_num);
> > +     if (ret) {
> > +             put_device(&dev->vdev->vdpa.dev);
> > +             return ret;
> > +     }
> > +
> > +     return 0;
> > +}
> > +
> > +static void vdpa_dev_del(struct vdpa_mgmt_dev *mdev, struct vdpa_device *dev)
> > +{
> > +     _vdpa_unregister_device(dev);
> > +}
> > +
> > +static const struct vdpa_mgmtdev_ops vdpa_dev_mgmtdev_ops = {
> > +     .dev_add = vdpa_dev_add,
> > +     .dev_del = vdpa_dev_del,
> > +};
> > +
> > +static struct virtio_device_id id_table[] = {
> > +     { VIRTIO_ID_BLOCK, VIRTIO_DEV_ANY_ID },
> > +     { 0 },
> > +};
> > +
> > +static struct vdpa_mgmt_dev mgmt_dev = {
> > +     .device = &vduse_mgmtdev,
> > +     .id_table = id_table,
> > +     .ops = &vdpa_dev_mgmtdev_ops,
> > +};
> > +
> > +static int vduse_mgmtdev_init(void)
> > +{
> > +     int ret;
> > +
> > +     ret = device_register(&vduse_mgmtdev);
> > +     if (ret)
> > +             return ret;
> > +
> > +     ret = vdpa_mgmtdev_register(&mgmt_dev);
> > +     if (ret)
> > +             goto err;
> > +
> > +     return 0;
> > +err:
> > +     device_unregister(&vduse_mgmtdev);
> > +     return ret;
> > +}
> > +
> > +static void vduse_mgmtdev_exit(void)
> > +{
> > +     vdpa_mgmtdev_unregister(&mgmt_dev);
> > +     device_unregister(&vduse_mgmtdev);
> > +}
> > +
> > +static int vduse_init(void)
> > +{
> > +     int ret;
> > +     struct device *dev;
> > +
> > +     vduse_class = class_create(THIS_MODULE, "vduse");
> > +     if (IS_ERR(vduse_class))
> > +             return PTR_ERR(vduse_class);
> > +
> > +     vduse_class->devnode = vduse_devnode;
> > +
> > +     ret = alloc_chrdev_region(&vduse_major, 0, VDUSE_DEV_MAX, "vduse");
> > +     if (ret)
> > +             goto err_chardev_region;
> > +
> > +     /* /dev/vduse/control */
> > +     cdev_init(&vduse_ctrl_cdev, &vduse_ctrl_fops);
> > +     vduse_ctrl_cdev.owner = THIS_MODULE;
> > +     ret = cdev_add(&vduse_ctrl_cdev, vduse_major, 1);
> > +     if (ret)
> > +             goto err_ctrl_cdev;
> > +
> > +     dev = device_create(vduse_class, NULL, vduse_major, NULL, "control");
> > +     if (IS_ERR(dev)) {
> > +             ret = PTR_ERR(dev);
> > +             goto err_device;
> > +     }
> > +
> > +     /* /dev/vduse/$DEVICE */
> > +     cdev_init(&vduse_cdev, &vduse_dev_fops);
> > +     vduse_cdev.owner = THIS_MODULE;
> > +     ret = cdev_add(&vduse_cdev, MKDEV(MAJOR(vduse_major), 1),
> > +                    VDUSE_DEV_MAX - 1);
> > +     if (ret)
> > +             goto err_cdev;
> > +
> > +     vduse_irq_wq = alloc_workqueue("vduse-irq",
> > +                             WQ_HIGHPRI | WQ_SYSFS | WQ_UNBOUND, 0);
> > +     if (!vduse_irq_wq)
> > +             goto err_wq;
> > +
> > +     ret = vduse_domain_init();
> > +     if (ret)
> > +             goto err_domain;
> > +
> > +     ret = vduse_mgmtdev_init();
> > +     if (ret)
> > +             goto err_mgmtdev;
> > +
> > +     return 0;
> > +err_mgmtdev:
> > +     vduse_domain_exit();
> > +err_domain:
> > +     destroy_workqueue(vduse_irq_wq);
> > +err_wq:
> > +     cdev_del(&vduse_cdev);
> > +err_cdev:
> > +     device_destroy(vduse_class, vduse_major);
> > +err_device:
> > +     cdev_del(&vduse_ctrl_cdev);
> > +err_ctrl_cdev:
> > +     unregister_chrdev_region(vduse_major, VDUSE_DEV_MAX);
> > +err_chardev_region:
> > +     class_destroy(vduse_class);
> > +     return ret;
> > +}
> > +module_init(vduse_init);
> > +
> > +static void vduse_exit(void)
> > +{
> > +     vduse_mgmtdev_exit();
> > +     vduse_domain_exit();
> > +     destroy_workqueue(vduse_irq_wq);
> > +     cdev_del(&vduse_cdev);
> > +     device_destroy(vduse_class, vduse_major);
> > +     cdev_del(&vduse_ctrl_cdev);
> > +     unregister_chrdev_region(vduse_major, VDUSE_DEV_MAX);
> > +     class_destroy(vduse_class);
> > +}
> > +module_exit(vduse_exit);
> > +
> > +MODULE_LICENSE(DRV_LICENSE);
> > +MODULE_AUTHOR(DRV_AUTHOR);
> > +MODULE_DESCRIPTION(DRV_DESC);
> > diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
> > new file mode 100644
> > index 000000000000..f21b2e51b5c8
> > --- /dev/null
> > +++ b/include/uapi/linux/vduse.h
> > @@ -0,0 +1,143 @@
> > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> > +#ifndef _UAPI_VDUSE_H_
> > +#define _UAPI_VDUSE_H_
> > +
> > +#include <linux/types.h>
> > +
> > +#define VDUSE_API_VERSION    0
> > +
> > +#define VDUSE_NAME_MAX       256
> > +
> > +/* the control messages definition for read/write */
> > +
> > +enum vduse_req_type {
> > +     /* Get the state for virtqueue from userspace */
> > +     VDUSE_GET_VQ_STATE,
> > +     /* Notify userspace to start the dataplane, no reply */
> > +     VDUSE_START_DATAPLANE,
> > +     /* Notify userspace to stop the dataplane, no reply */
> > +     VDUSE_STOP_DATAPLANE,
> > +     /* Notify userspace to update the memory mapping in device IOTLB */
> > +     VDUSE_UPDATE_IOTLB,
> > +};
> > +
> > +struct vduse_vq_state {
> > +     __u32 index; /* virtqueue index */
> > +     __u32 avail_idx; /* virtqueue state (last_avail_idx) */
> > +};
>
>
> This needs some tweaks to support packed virtqueue.
>

OK.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 193+ messages in thread

* Re: Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
@ 2021-06-21 10:41       ` Yongji Xie
  0 siblings, 0 replies; 193+ messages in thread
From: Yongji Xie @ 2021-06-21 10:41 UTC (permalink / raw)
  To: Jason Wang
  Cc: kvm, Michael S. Tsirkin, virtualization, Christian Brauner,
	Jonathan Corbet, Matthew Wilcox, Christoph Hellwig,
	Dan Carpenter, Stefano Garzarella, Al Viro, Stefan Hajnoczi,
	songmuchun, Jens Axboe, Greg KH, Randy Dunlap, linux-kernel,
	iommu, bcrl, netdev, linux-fsdevel, Mika Penttilä

On Mon, Jun 21, 2021 at 5:14 PM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/6/15 下午10:13, Xie Yongji 写道:
> > This VDUSE driver enables implementing vDPA devices in userspace.
> > The vDPA device's control path is handled in kernel and the data
> > path is handled in userspace.
> >
> > A message mechnism is used by VDUSE driver to forward some control
> > messages such as starting/stopping datapath to userspace. Userspace
> > can use read()/write() to receive/reply those control messages.
> >
> > And some ioctls are introduced to help userspace to implement the
> > data path. VDUSE_IOTLB_GET_FD ioctl can be used to get the file
> > descriptors referring to vDPA device's iova regions. Then userspace
> > can use mmap() to access those iova regions. VDUSE_DEV_GET_FEATURES
> > and VDUSE_VQ_GET_INFO ioctls are used to get the negotiated features
> > and metadata of virtqueues. VDUSE_INJECT_VQ_IRQ and VDUSE_VQ_SETUP_KICKFD
> > ioctls can be used to inject interrupt and setup the kickfd for
> > virtqueues. VDUSE_DEV_UPDATE_CONFIG ioctl is used to update the
> > configuration space and inject a config interrupt.
> >
> > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> > ---
> >   Documentation/userspace-api/ioctl/ioctl-number.rst |    1 +
> >   drivers/vdpa/Kconfig                               |   10 +
> >   drivers/vdpa/Makefile                              |    1 +
> >   drivers/vdpa/vdpa_user/Makefile                    |    5 +
> >   drivers/vdpa/vdpa_user/vduse_dev.c                 | 1453 ++++++++++++++++++++
> >   include/uapi/linux/vduse.h                         |  143 ++
> >   6 files changed, 1613 insertions(+)
> >   create mode 100644 drivers/vdpa/vdpa_user/Makefile
> >   create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
> >   create mode 100644 include/uapi/linux/vduse.h
> >
> > diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
> > index 9bfc2b510c64..acd95e9dcfe7 100644
> > --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> > +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> > @@ -300,6 +300,7 @@ Code  Seq#    Include File                                           Comments
> >   'z'   10-4F  drivers/s390/crypto/zcrypt_api.h                        conflict!
> >   '|'   00-7F  linux/media.h
> >   0x80  00-1F  linux/fb.h
> > +0x81  00-1F  linux/vduse.h
> >   0x89  00-06  arch/x86/include/asm/sockios.h
> >   0x89  0B-DF  linux/sockios.h
> >   0x89  E0-EF  linux/sockios.h                                         SIOCPROTOPRIVATE range
> > diff --git a/drivers/vdpa/Kconfig b/drivers/vdpa/Kconfig
> > index a503c1b2bfd9..6e23bce6433a 100644
> > --- a/drivers/vdpa/Kconfig
> > +++ b/drivers/vdpa/Kconfig
> > @@ -33,6 +33,16 @@ config VDPA_SIM_BLOCK
> >         vDPA block device simulator which terminates IO request in a
> >         memory buffer.
> >
> > +config VDPA_USER
> > +     tristate "VDUSE (vDPA Device in Userspace) support"
> > +     depends on EVENTFD && MMU && HAS_DMA
> > +     select DMA_OPS
> > +     select VHOST_IOTLB
> > +     select IOMMU_IOVA
> > +     help
> > +       With VDUSE it is possible to emulate a vDPA Device
> > +       in a userspace program.
> > +
> >   config IFCVF
> >       tristate "Intel IFC VF vDPA driver"
> >       depends on PCI_MSI
> > diff --git a/drivers/vdpa/Makefile b/drivers/vdpa/Makefile
> > index 67fe7f3d6943..f02ebed33f19 100644
> > --- a/drivers/vdpa/Makefile
> > +++ b/drivers/vdpa/Makefile
> > @@ -1,6 +1,7 @@
> >   # SPDX-License-Identifier: GPL-2.0
> >   obj-$(CONFIG_VDPA) += vdpa.o
> >   obj-$(CONFIG_VDPA_SIM) += vdpa_sim/
> > +obj-$(CONFIG_VDPA_USER) += vdpa_user/
> >   obj-$(CONFIG_IFCVF)    += ifcvf/
> >   obj-$(CONFIG_MLX5_VDPA) += mlx5/
> >   obj-$(CONFIG_VP_VDPA)    += virtio_pci/
> > diff --git a/drivers/vdpa/vdpa_user/Makefile b/drivers/vdpa/vdpa_user/Makefile
> > new file mode 100644
> > index 000000000000..260e0b26af99
> > --- /dev/null
> > +++ b/drivers/vdpa/vdpa_user/Makefile
> > @@ -0,0 +1,5 @@
> > +# SPDX-License-Identifier: GPL-2.0
> > +
> > +vduse-y := vduse_dev.o iova_domain.o
> > +
> > +obj-$(CONFIG_VDPA_USER) += vduse.o
> > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> > new file mode 100644
> > index 000000000000..5271cbd15e28
> > --- /dev/null
> > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > @@ -0,0 +1,1453 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * VDUSE: vDPA Device in Userspace
> > + *
> > + * Copyright (C) 2020-2021 Bytedance Inc. and/or its affiliates. All rights reserved.
> > + *
> > + * Author: Xie Yongji <xieyongji@bytedance.com>
> > + *
> > + */
> > +
> > +#include <linux/init.h>
> > +#include <linux/module.h>
> > +#include <linux/cdev.h>
> > +#include <linux/device.h>
> > +#include <linux/eventfd.h>
> > +#include <linux/slab.h>
> > +#include <linux/wait.h>
> > +#include <linux/dma-map-ops.h>
> > +#include <linux/poll.h>
> > +#include <linux/file.h>
> > +#include <linux/uio.h>
> > +#include <linux/vdpa.h>
> > +#include <linux/nospec.h>
> > +#include <uapi/linux/vduse.h>
> > +#include <uapi/linux/vdpa.h>
> > +#include <uapi/linux/virtio_config.h>
> > +#include <uapi/linux/virtio_ids.h>
> > +#include <uapi/linux/virtio_blk.h>
> > +#include <linux/mod_devicetable.h>
> > +
> > +#include "iova_domain.h"
> > +
> > +#define DRV_AUTHOR   "Yongji Xie <xieyongji@bytedance.com>"
> > +#define DRV_DESC     "vDPA Device in Userspace"
> > +#define DRV_LICENSE  "GPL v2"
> > +
> > +#define VDUSE_DEV_MAX (1U << MINORBITS)
> > +#define VDUSE_MAX_BOUNCE_SIZE (64 * 1024 * 1024)
> > +#define VDUSE_IOVA_SIZE (128 * 1024 * 1024)
> > +#define VDUSE_REQUEST_TIMEOUT 30
> > +
> > +struct vduse_virtqueue {
> > +     u16 index;
> > +     u32 num;
> > +     u32 avail_idx;
> > +     u64 desc_addr;
> > +     u64 driver_addr;
> > +     u64 device_addr;
> > +     bool ready;
> > +     bool kicked;
> > +     spinlock_t kick_lock;
> > +     spinlock_t irq_lock;
> > +     struct eventfd_ctx *kickfd;
> > +     struct vdpa_callback cb;
> > +     struct work_struct inject;
> > +};
> > +
> > +struct vduse_dev;
> > +
> > +struct vduse_vdpa {
> > +     struct vdpa_device vdpa;
> > +     struct vduse_dev *dev;
> > +};
> > +
> > +struct vduse_dev {
> > +     struct vduse_vdpa *vdev;
> > +     struct device *dev;
> > +     struct vduse_virtqueue *vqs;
> > +     struct vduse_iova_domain *domain;
> > +     char *name;
> > +     struct mutex lock;
> > +     spinlock_t msg_lock;
> > +     u64 msg_unique;
> > +     wait_queue_head_t waitq;
> > +     struct list_head send_list;
> > +     struct list_head recv_list;
> > +     struct vdpa_callback config_cb;
> > +     struct work_struct inject;
> > +     spinlock_t irq_lock;
> > +     int minor;
> > +     bool connected;
> > +     bool started;
> > +     u64 api_version;
> > +     u64 user_features;
>
>
> Let's use device_features.
>

OK.

>
> > +     u64 features;
>
>
> And driver features.
>

OK.

>
> > +     u32 device_id;
> > +     u32 vendor_id;
> > +     u32 generation;
> > +     u32 config_size;
> > +     void *config;
> > +     u8 status;
> > +     u16 vq_size_max;
> > +     u32 vq_num;
> > +     u32 vq_align;
> > +};
> > +
> > +struct vduse_dev_msg {
> > +     struct vduse_dev_request req;
> > +     struct vduse_dev_response resp;
> > +     struct list_head list;
> > +     wait_queue_head_t waitq;
> > +     bool completed;
> > +};
> > +
> > +struct vduse_control {
> > +     u64 api_version;
> > +};
> > +
> > +static DEFINE_MUTEX(vduse_lock);
> > +static DEFINE_IDR(vduse_idr);
> > +
> > +static dev_t vduse_major;
> > +static struct class *vduse_class;
> > +static struct cdev vduse_ctrl_cdev;
> > +static struct cdev vduse_cdev;
> > +static struct workqueue_struct *vduse_irq_wq;
> > +
> > +static u32 allowed_device_id[] = {
> > +     VIRTIO_ID_BLOCK,
> > +};
> > +
> > +static inline struct vduse_dev *vdpa_to_vduse(struct vdpa_device *vdpa)
> > +{
> > +     struct vduse_vdpa *vdev = container_of(vdpa, struct vduse_vdpa, vdpa);
> > +
> > +     return vdev->dev;
> > +}
> > +
> > +static inline struct vduse_dev *dev_to_vduse(struct device *dev)
> > +{
> > +     struct vdpa_device *vdpa = dev_to_vdpa(dev);
> > +
> > +     return vdpa_to_vduse(vdpa);
> > +}
> > +
> > +static struct vduse_dev_msg *vduse_find_msg(struct list_head *head,
> > +                                         uint32_t request_id)
> > +{
> > +     struct vduse_dev_msg *msg;
> > +
> > +     list_for_each_entry(msg, head, list) {
> > +             if (msg->req.request_id == request_id) {
> > +                     list_del(&msg->list);
> > +                     return msg;
> > +             }
> > +     }
> > +
> > +     return NULL;
> > +}
> > +
> > +static struct vduse_dev_msg *vduse_dequeue_msg(struct list_head *head)
> > +{
> > +     struct vduse_dev_msg *msg = NULL;
> > +
> > +     if (!list_empty(head)) {
> > +             msg = list_first_entry(head, struct vduse_dev_msg, list);
> > +             list_del(&msg->list);
> > +     }
> > +
> > +     return msg;
> > +}
> > +
> > +static void vduse_enqueue_msg(struct list_head *head,
> > +                           struct vduse_dev_msg *msg)
> > +{
> > +     list_add_tail(&msg->list, head);
> > +}
> > +
> > +static int vduse_dev_msg_send(struct vduse_dev *dev,
> > +                           struct vduse_dev_msg *msg, bool no_reply)
> > +{
>
>
> It looks to me the only user for no_reply=true is the dataplane start. I
> wonder no_reply is really needed consider we have switched to use
> wait_event_killable_timeout().
>

Do we need to handle the error in this case if we remove the no_reply
flag. Print a warning message?

> In another way, no_reply is false for vq state synchronization and IOTLB
> updating. I wonder if we can simply use no_reply = true for them.
>

Looks like we can't, e.g. we need to get a reply from userspace for vq state.

>
> > +     init_waitqueue_head(&msg->waitq);
> > +     spin_lock(&dev->msg_lock);
> > +     msg->req.request_id = dev->msg_unique++;
> > +     vduse_enqueue_msg(&dev->send_list, msg);
> > +     wake_up(&dev->waitq);
> > +     spin_unlock(&dev->msg_lock);
> > +     if (no_reply)
> > +             return 0;
> > +
> > +     wait_event_killable_timeout(msg->waitq, msg->completed,
> > +                                 VDUSE_REQUEST_TIMEOUT * HZ);
> > +     spin_lock(&dev->msg_lock);
> > +     if (!msg->completed) {
> > +             list_del(&msg->list);
> > +             msg->resp.result = VDUSE_REQ_RESULT_FAILED;
> > +     }
> > +     spin_unlock(&dev->msg_lock);
> > +
> > +     return (msg->resp.result == VDUSE_REQ_RESULT_OK) ? 0 : -EIO;
>
>
> Do we need to serialize the check by protecting it with the spinlock above?
>

Good point.

>
> > +}
> > +
> > +static void vduse_dev_msg_cleanup(struct vduse_dev *dev)
> > +{
> > +     struct vduse_dev_msg *msg;
> > +
> > +     spin_lock(&dev->msg_lock);
> > +     while ((msg = vduse_dequeue_msg(&dev->send_list))) {
> > +             if (msg->req.flags & VDUSE_REQ_FLAGS_NO_REPLY)
> > +                     kfree(msg);
> > +             else
> > +                     vduse_enqueue_msg(&dev->recv_list, msg);
> > +     }
> > +     while ((msg = vduse_dequeue_msg(&dev->recv_list))) {
> > +             msg->resp.result = VDUSE_REQ_RESULT_FAILED;
> > +             msg->completed = 1;
> > +             wake_up(&msg->waitq);
> > +     }
> > +     spin_unlock(&dev->msg_lock);
> > +}
> > +
> > +static void vduse_dev_start_dataplane(struct vduse_dev *dev)
> > +{
> > +     struct vduse_dev_msg *msg = kzalloc(sizeof(*msg),
> > +                                         GFP_KERNEL | __GFP_NOFAIL);
> > +
> > +     msg->req.type = VDUSE_START_DATAPLANE;
> > +     msg->req.flags |= VDUSE_REQ_FLAGS_NO_REPLY;
> > +     vduse_dev_msg_send(dev, msg, true);
> > +}
> > +
> > +static void vduse_dev_stop_dataplane(struct vduse_dev *dev)
> > +{
> > +     struct vduse_dev_msg *msg = kzalloc(sizeof(*msg),
> > +                                         GFP_KERNEL | __GFP_NOFAIL);
> > +
> > +     msg->req.type = VDUSE_STOP_DATAPLANE;
> > +     msg->req.flags |= VDUSE_REQ_FLAGS_NO_REPLY;
>
>
> Can we simply use this flag instead of introducing a new parameter
> (no_reply) in vduse_dev_msg_send()?
>

Looks good to me.

>
> > +     vduse_dev_msg_send(dev, msg, true);
> > +}
> > +
> > +static int vduse_dev_get_vq_state(struct vduse_dev *dev,
> > +                               struct vduse_virtqueue *vq,
> > +                               struct vdpa_vq_state *state)
> > +{
> > +     struct vduse_dev_msg msg = { 0 };
> > +     int ret;
>
>
> Note that I post a series that implement the packed virtqueue support:
>
> https://lists.linuxfoundation.org/pipermail/virtualization/2021-June/054501.html
>
> So this patch needs to be updated as well.
>

Will do it.

>
> > +
> > +     msg.req.type = VDUSE_GET_VQ_STATE;
> > +     msg.req.vq_state.index = vq->index;
> > +
> > +     ret = vduse_dev_msg_send(dev, &msg, false);
> > +     if (ret)
> > +             return ret;
> > +
> > +     state->avail_index = msg.resp.vq_state.avail_idx;
> > +     return 0;
> > +}
> > +
> > +static int vduse_dev_update_iotlb(struct vduse_dev *dev,
> > +                             u64 start, u64 last)
> > +{
> > +     struct vduse_dev_msg msg = { 0 };
> > +
> > +     if (last < start)
> > +             return -EINVAL;
> > +
> > +     msg.req.type = VDUSE_UPDATE_IOTLB;
> > +     msg.req.iova.start = start;
> > +     msg.req.iova.last = last;
> > +
> > +     return vduse_dev_msg_send(dev, &msg, false);
> > +}
> > +
> > +static ssize_t vduse_dev_read_iter(struct kiocb *iocb, struct iov_iter *to)
> > +{
> > +     struct file *file = iocb->ki_filp;
> > +     struct vduse_dev *dev = file->private_data;
> > +     struct vduse_dev_msg *msg;
> > +     int size = sizeof(struct vduse_dev_request);
> > +     ssize_t ret;
> > +
> > +     if (iov_iter_count(to) < size)
> > +             return -EINVAL;
> > +
> > +     spin_lock(&dev->msg_lock);
> > +     while (1) {
> > +             msg = vduse_dequeue_msg(&dev->send_list);
> > +             if (msg)
> > +                     break;
> > +
> > +             ret = -EAGAIN;
> > +             if (file->f_flags & O_NONBLOCK)
> > +                     goto unlock;
> > +
> > +             spin_unlock(&dev->msg_lock);
> > +             ret = wait_event_interruptible_exclusive(dev->waitq,
> > +                                     !list_empty(&dev->send_list));
> > +             if (ret)
> > +                     return ret;
> > +
> > +             spin_lock(&dev->msg_lock);
> > +     }
> > +     spin_unlock(&dev->msg_lock);
> > +     ret = copy_to_iter(&msg->req, size, to);
> > +     spin_lock(&dev->msg_lock);
> > +     if (ret != size) {
> > +             ret = -EFAULT;
> > +             vduse_enqueue_msg(&dev->send_list, msg);
> > +             goto unlock;
> > +     }
> > +     if (msg->req.flags & VDUSE_REQ_FLAGS_NO_REPLY)
> > +             kfree(msg);
> > +     else
> > +             vduse_enqueue_msg(&dev->recv_list, msg);
> > +unlock:
> > +     spin_unlock(&dev->msg_lock);
> > +
> > +     return ret;
> > +}
> > +
> > +static ssize_t vduse_dev_write_iter(struct kiocb *iocb, struct iov_iter *from)
> > +{
> > +     struct file *file = iocb->ki_filp;
> > +     struct vduse_dev *dev = file->private_data;
> > +     struct vduse_dev_response resp;
> > +     struct vduse_dev_msg *msg;
> > +     size_t ret;
> > +
> > +     ret = copy_from_iter(&resp, sizeof(resp), from);
> > +     if (ret != sizeof(resp))
> > +             return -EINVAL;
> > +
> > +     spin_lock(&dev->msg_lock);
> > +     msg = vduse_find_msg(&dev->recv_list, resp.request_id);
> > +     if (!msg) {
> > +             ret = -ENOENT;
> > +             goto unlock;
> > +     }
> > +
> > +     memcpy(&msg->resp, &resp, sizeof(resp));
> > +     msg->completed = 1;
> > +     wake_up(&msg->waitq);
> > +unlock:
> > +     spin_unlock(&dev->msg_lock);
> > +
> > +     return ret;
> > +}
> > +
> > +static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
> > +{
> > +     struct vduse_dev *dev = file->private_data;
> > +     __poll_t mask = 0;
> > +
> > +     poll_wait(file, &dev->waitq, wait);
> > +
> > +     if (!list_empty(&dev->send_list))
> > +             mask |= EPOLLIN | EPOLLRDNORM;
> > +     if (!list_empty(&dev->recv_list))
> > +             mask |= EPOLLOUT | EPOLLWRNORM;
> > +
> > +     return mask;
> > +}
> > +
> > +static void vduse_dev_reset(struct vduse_dev *dev)
> > +{
> > +     int i;
> > +     struct vduse_iova_domain *domain = dev->domain;
> > +
> > +     /* The coherent mappings are handled in vduse_dev_free_coherent() */
> > +     if (domain->bounce_map)
> > +             vduse_domain_reset_bounce_map(domain);
> > +
> > +     dev->features = 0;
> > +     dev->generation++;
> > +     spin_lock(&dev->irq_lock);
> > +     dev->config_cb.callback = NULL;
> > +     dev->config_cb.private = NULL;
> > +     spin_unlock(&dev->irq_lock);
> > +
> > +     for (i = 0; i < dev->vq_num; i++) {
> > +             struct vduse_virtqueue *vq = &dev->vqs[i];
> > +
> > +             vq->ready = false;
> > +             vq->desc_addr = 0;
> > +             vq->driver_addr = 0;
> > +             vq->device_addr = 0;
> > +             vq->avail_idx = 0;
> > +             vq->num = 0;
> > +
> > +             spin_lock(&vq->kick_lock);
> > +             vq->kicked = false;
> > +             if (vq->kickfd)
> > +                     eventfd_ctx_put(vq->kickfd);
> > +             vq->kickfd = NULL;
> > +             spin_unlock(&vq->kick_lock);
> > +
> > +             spin_lock(&vq->irq_lock);
> > +             vq->cb.callback = NULL;
> > +             vq->cb.private = NULL;
> > +             spin_unlock(&vq->irq_lock);
> > +     }
> > +}
> > +
> > +static int vduse_vdpa_set_vq_address(struct vdpa_device *vdpa, u16 idx,
> > +                             u64 desc_area, u64 driver_area,
> > +                             u64 device_area)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> > +
> > +     vq->desc_addr = desc_area;
> > +     vq->driver_addr = driver_area;
> > +     vq->device_addr = device_area;
> > +
> > +     return 0;
> > +}
> > +
> > +static void vduse_vdpa_kick_vq(struct vdpa_device *vdpa, u16 idx)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> > +
> > +     spin_lock(&vq->kick_lock);
> > +     if (!vq->ready)
> > +             goto unlock;
> > +
> > +     if (vq->kickfd)
> > +             eventfd_signal(vq->kickfd, 1);
> > +     else
> > +             vq->kicked = true;
> > +unlock:
> > +     spin_unlock(&vq->kick_lock);
> > +}
> > +
> > +static void vduse_vdpa_set_vq_cb(struct vdpa_device *vdpa, u16 idx,
> > +                           struct vdpa_callback *cb)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> > +
> > +     spin_lock(&vq->irq_lock);
> > +     vq->cb.callback = cb->callback;
> > +     vq->cb.private = cb->private;
> > +     spin_unlock(&vq->irq_lock);
> > +}
> > +
> > +static void vduse_vdpa_set_vq_num(struct vdpa_device *vdpa, u16 idx, u32 num)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> > +
> > +     vq->num = num;
> > +}
> > +
> > +static void vduse_vdpa_set_vq_ready(struct vdpa_device *vdpa,
> > +                                     u16 idx, bool ready)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> > +
> > +     vq->ready = ready;
> > +}
> > +
> > +static bool vduse_vdpa_get_vq_ready(struct vdpa_device *vdpa, u16 idx)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> > +
> > +     return vq->ready;
> > +}
> > +
> > +static int vduse_vdpa_set_vq_state(struct vdpa_device *vdpa, u16 idx,
> > +                             const struct vdpa_vq_state *state)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> > +
> > +     vq->avail_idx = state->avail_index;
> > +     return 0;
> > +}
> > +
> > +static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
> > +                             struct vdpa_vq_state *state)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> > +
> > +     return vduse_dev_get_vq_state(dev, vq, state);
> > +}
> > +
> > +static u32 vduse_vdpa_get_vq_align(struct vdpa_device *vdpa)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     return dev->vq_align;
> > +}
> > +
> > +static u64 vduse_vdpa_get_features(struct vdpa_device *vdpa)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     return dev->user_features;
> > +}
> > +
> > +static int vduse_vdpa_set_features(struct vdpa_device *vdpa, u64 features)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     dev->features = features;
> > +     return 0;
> > +}
> > +
> > +static void vduse_vdpa_set_config_cb(struct vdpa_device *vdpa,
> > +                               struct vdpa_callback *cb)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     spin_lock(&dev->irq_lock);
> > +     dev->config_cb.callback = cb->callback;
> > +     dev->config_cb.private = cb->private;
> > +     spin_unlock(&dev->irq_lock);
> > +}
> > +
> > +static u16 vduse_vdpa_get_vq_num_max(struct vdpa_device *vdpa)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     return dev->vq_size_max;
> > +}
> > +
> > +static u32 vduse_vdpa_get_device_id(struct vdpa_device *vdpa)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     return dev->device_id;
> > +}
> > +
> > +static u32 vduse_vdpa_get_vendor_id(struct vdpa_device *vdpa)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     return dev->vendor_id;
> > +}
> > +
> > +static u8 vduse_vdpa_get_status(struct vdpa_device *vdpa)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     return dev->status;
> > +}
> > +
> > +static void vduse_vdpa_set_status(struct vdpa_device *vdpa, u8 status)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +     bool started = !!(status & VIRTIO_CONFIG_S_DRIVER_OK);
> > +
> > +     dev->status = status;
> > +
> > +     if (dev->started == started)
> > +             return;
>
>
> If we check dev->status == status, (or only check the DRIVER_OK bit)
> then there's no need to introduce an extra dev->started.
>

Will do it.

>
> > +
> > +     dev->started = started;
> > +     if (dev->started) {
> > +             vduse_dev_start_dataplane(dev);
> > +     } else {
> > +             vduse_dev_reset(dev);
> > +             vduse_dev_stop_dataplane(dev);
>
>
> I wonder if no_reply work for the case of vhost-vdpa. For virtio-vDPA,
> we have bouncing buffers so it's harmless if usersapce dataplane keeps
> performing read/write. For vhost-vDPA we don't have such stuffs.
>

OK. So it still needs to be synchronized here. If so, how to handle
the error? Looks like printing a warning message should be enough.

>
> > +     }
> > +}
> > +
> > +static size_t vduse_vdpa_get_config_size(struct vdpa_device *vdpa)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     return dev->config_size;
> > +}
> > +
> > +static void vduse_vdpa_get_config(struct vdpa_device *vdpa, unsigned int offset,
> > +                               void *buf, unsigned int len)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     memcpy(buf, dev->config + offset, len);
> > +}
> > +
> > +static void vduse_vdpa_set_config(struct vdpa_device *vdpa, unsigned int offset,
> > +                     const void *buf, unsigned int len)
> > +{
> > +     /* Now we only support read-only configuration space */
> > +}
> > +
> > +static u32 vduse_vdpa_get_generation(struct vdpa_device *vdpa)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     return dev->generation;
> > +}
> > +
> > +static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
> > +                             struct vhost_iotlb *iotlb)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +     int ret;
> > +
> > +     ret = vduse_domain_set_map(dev->domain, iotlb);
> > +     if (ret)
> > +             return ret;
> > +
> > +     ret = vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
> > +     if (ret) {
> > +             vduse_domain_clear_map(dev->domain, iotlb);
> > +             return ret;
> > +     }
> > +
> > +     return 0;
> > +}
> > +
> > +static void vduse_vdpa_free(struct vdpa_device *vdpa)
> > +{
> > +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +
> > +     dev->vdev = NULL;
> > +}
> > +
> > +static const struct vdpa_config_ops vduse_vdpa_config_ops = {
> > +     .set_vq_address         = vduse_vdpa_set_vq_address,
> > +     .kick_vq                = vduse_vdpa_kick_vq,
> > +     .set_vq_cb              = vduse_vdpa_set_vq_cb,
> > +     .set_vq_num             = vduse_vdpa_set_vq_num,
> > +     .set_vq_ready           = vduse_vdpa_set_vq_ready,
> > +     .get_vq_ready           = vduse_vdpa_get_vq_ready,
> > +     .set_vq_state           = vduse_vdpa_set_vq_state,
> > +     .get_vq_state           = vduse_vdpa_get_vq_state,
> > +     .get_vq_align           = vduse_vdpa_get_vq_align,
> > +     .get_features           = vduse_vdpa_get_features,
> > +     .set_features           = vduse_vdpa_set_features,
> > +     .set_config_cb          = vduse_vdpa_set_config_cb,
> > +     .get_vq_num_max         = vduse_vdpa_get_vq_num_max,
> > +     .get_device_id          = vduse_vdpa_get_device_id,
> > +     .get_vendor_id          = vduse_vdpa_get_vendor_id,
> > +     .get_status             = vduse_vdpa_get_status,
> > +     .set_status             = vduse_vdpa_set_status,
> > +     .get_config_size        = vduse_vdpa_get_config_size,
> > +     .get_config             = vduse_vdpa_get_config,
> > +     .set_config             = vduse_vdpa_set_config,
> > +     .get_generation         = vduse_vdpa_get_generation,
> > +     .set_map                = vduse_vdpa_set_map,
> > +     .free                   = vduse_vdpa_free,
> > +};
> > +
> > +static dma_addr_t vduse_dev_map_page(struct device *dev, struct page *page,
> > +                                  unsigned long offset, size_t size,
> > +                                  enum dma_data_direction dir,
> > +                                  unsigned long attrs)
> > +{
> > +     struct vduse_dev *vdev = dev_to_vduse(dev);
> > +     struct vduse_iova_domain *domain = vdev->domain;
> > +
> > +     return vduse_domain_map_page(domain, page, offset, size, dir, attrs);
> > +}
> > +
> > +static void vduse_dev_unmap_page(struct device *dev, dma_addr_t dma_addr,
> > +                             size_t size, enum dma_data_direction dir,
> > +                             unsigned long attrs)
> > +{
> > +     struct vduse_dev *vdev = dev_to_vduse(dev);
> > +     struct vduse_iova_domain *domain = vdev->domain;
> > +
> > +     return vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
> > +}
> > +
> > +static void *vduse_dev_alloc_coherent(struct device *dev, size_t size,
> > +                                     dma_addr_t *dma_addr, gfp_t flag,
> > +                                     unsigned long attrs)
> > +{
> > +     struct vduse_dev *vdev = dev_to_vduse(dev);
> > +     struct vduse_iova_domain *domain = vdev->domain;
> > +     unsigned long iova;
> > +     void *addr;
> > +
> > +     *dma_addr = DMA_MAPPING_ERROR;
> > +     addr = vduse_domain_alloc_coherent(domain, size,
> > +                             (dma_addr_t *)&iova, flag, attrs);
> > +     if (!addr)
> > +             return NULL;
> > +
> > +     *dma_addr = (dma_addr_t)iova;
> > +
> > +     return addr;
> > +}
> > +
> > +static void vduse_dev_free_coherent(struct device *dev, size_t size,
> > +                                     void *vaddr, dma_addr_t dma_addr,
> > +                                     unsigned long attrs)
> > +{
> > +     struct vduse_dev *vdev = dev_to_vduse(dev);
> > +     struct vduse_iova_domain *domain = vdev->domain;
> > +
> > +     vduse_domain_free_coherent(domain, size, vaddr, dma_addr, attrs);
> > +}
> > +
> > +static size_t vduse_dev_max_mapping_size(struct device *dev)
> > +{
> > +     struct vduse_dev *vdev = dev_to_vduse(dev);
> > +     struct vduse_iova_domain *domain = vdev->domain;
> > +
> > +     return domain->bounce_size;
> > +}
> > +
> > +static const struct dma_map_ops vduse_dev_dma_ops = {
> > +     .map_page = vduse_dev_map_page,
> > +     .unmap_page = vduse_dev_unmap_page,
> > +     .alloc = vduse_dev_alloc_coherent,
> > +     .free = vduse_dev_free_coherent,
> > +     .max_mapping_size = vduse_dev_max_mapping_size,
> > +};
> > +
> > +static unsigned int perm_to_file_flags(u8 perm)
> > +{
> > +     unsigned int flags = 0;
> > +
> > +     switch (perm) {
> > +     case VDUSE_ACCESS_WO:
> > +             flags |= O_WRONLY;
> > +             break;
> > +     case VDUSE_ACCESS_RO:
> > +             flags |= O_RDONLY;
> > +             break;
> > +     case VDUSE_ACCESS_RW:
> > +             flags |= O_RDWR;
> > +             break;
> > +     default:
> > +             WARN(1, "invalidate vhost IOTLB permission\n");
> > +             break;
> > +     }
> > +
> > +     return flags;
> > +}
> > +
> > +static int vduse_kickfd_setup(struct vduse_dev *dev,
> > +                     struct vduse_vq_eventfd *eventfd)
> > +{
> > +     struct eventfd_ctx *ctx = NULL;
> > +     struct vduse_virtqueue *vq;
> > +     u32 index;
> > +
> > +     if (eventfd->index >= dev->vq_num)
> > +             return -EINVAL;
> > +
> > +     index = array_index_nospec(eventfd->index, dev->vq_num);
> > +     vq = &dev->vqs[index];
> > +     if (eventfd->fd >= 0) {
> > +             ctx = eventfd_ctx_fdget(eventfd->fd);
> > +             if (IS_ERR(ctx))
> > +                     return PTR_ERR(ctx);
> > +     } else if (eventfd->fd != VDUSE_EVENTFD_DEASSIGN)
> > +             return 0;
> > +
> > +     spin_lock(&vq->kick_lock);
> > +     if (vq->kickfd)
> > +             eventfd_ctx_put(vq->kickfd);
> > +     vq->kickfd = ctx;
> > +     if (vq->ready && vq->kicked && vq->kickfd) {
> > +             eventfd_signal(vq->kickfd, 1);
> > +             vq->kicked = false;
> > +     }
> > +     spin_unlock(&vq->kick_lock);
> > +
> > +     return 0;
> > +}
> > +
> > +static void vduse_dev_irq_inject(struct work_struct *work)
> > +{
> > +     struct vduse_dev *dev = container_of(work, struct vduse_dev, inject);
> > +
> > +     spin_lock_irq(&dev->irq_lock);
> > +     if (dev->config_cb.callback)
> > +             dev->config_cb.callback(dev->config_cb.private);
> > +     spin_unlock_irq(&dev->irq_lock);
> > +}
> > +
> > +static void vduse_vq_irq_inject(struct work_struct *work)
> > +{
> > +     struct vduse_virtqueue *vq = container_of(work,
> > +                                     struct vduse_virtqueue, inject);
> > +
> > +     spin_lock_irq(&vq->irq_lock);
> > +     if (vq->ready && vq->cb.callback)
> > +             vq->cb.callback(vq->cb.private);
> > +     spin_unlock_irq(&vq->irq_lock);
> > +}
> > +
> > +static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > +                         unsigned long arg)
> > +{
> > +     struct vduse_dev *dev = file->private_data;
> > +     void __user *argp = (void __user *)arg;
> > +     int ret;
> > +
> > +     switch (cmd) {
> > +     case VDUSE_IOTLB_GET_FD: {
> > +             struct vduse_iotlb_entry entry;
> > +             struct vhost_iotlb_map *map;
> > +             struct vdpa_map_file *map_file;
> > +             struct vduse_iova_domain *domain = dev->domain;
> > +             struct file *f = NULL;
> > +
> > +             ret = -EFAULT;
> > +             if (copy_from_user(&entry, argp, sizeof(entry)))
> > +                     break;
> > +
> > +             ret = -EINVAL;
> > +             if (entry.start > entry.last)
> > +                     break;
> > +
> > +             spin_lock(&domain->iotlb_lock);
> > +             map = vhost_iotlb_itree_first(domain->iotlb,
> > +                                           entry.start, entry.last);
> > +             if (map) {
> > +                     map_file = (struct vdpa_map_file *)map->opaque;
> > +                     f = get_file(map_file->file);
> > +                     entry.offset = map_file->offset;
> > +                     entry.start = map->start;
> > +                     entry.last = map->last;
> > +                     entry.perm = map->perm;
> > +             }
> > +             spin_unlock(&domain->iotlb_lock);
> > +             ret = -EINVAL;
> > +             if (!f)
> > +                     break;
> > +
> > +             ret = -EFAULT;
> > +             if (copy_to_user(argp, &entry, sizeof(entry))) {
> > +                     fput(f);
> > +                     break;
> > +             }
> > +             ret = receive_fd(f, perm_to_file_flags(entry.perm));
> > +             fput(f);
> > +             break;
> > +     }
> > +     case VDUSE_DEV_GET_FEATURES:
> > +             ret = put_user(dev->features, (u64 __user *)argp);
> > +             break;
> > +     case VDUSE_DEV_UPDATE_CONFIG: {
> > +             struct vduse_config_update config;
> > +             unsigned long size = offsetof(struct vduse_config_update,
> > +                                           buffer);
> > +
> > +             ret = -EFAULT;
> > +             if (copy_from_user(&config, argp, size))
> > +                     break;
> > +
> > +             ret = -EINVAL;
> > +             if (config.length == 0 ||
> > +                 config.length > dev->config_size - config.offset)
> > +                     break;
> > +
> > +             ret = -EFAULT;
> > +             if (copy_from_user(dev->config + config.offset, argp + size,
> > +                                config.length))
> > +                     break;
> > +
> > +             ret = 0;
> > +             queue_work(vduse_irq_wq, &dev->inject);
>
>
> I wonder if it's better to separate config interrupt out of config
> update or we need document this.
>

I have documented it in the docs. Looks like a config update should be
always followed by a config interrupt. I didn't find a case that uses
them separately.

>
> > +             break;
> > +     }
> > +     case VDUSE_VQ_GET_INFO: {
>
>
> Do we need to limit this only when DRIVER_OK is set?
>

Any reason to add this limitation?

>
> > +             struct vduse_vq_info vq_info;
> > +             u32 vq_index;
> > +
> > +             ret = -EFAULT;
> > +             if (copy_from_user(&vq_info, argp, sizeof(vq_info)))
> > +                     break;
> > +
> > +             ret = -EINVAL;
> > +             if (vq_info.index >= dev->vq_num)
> > +                     break;
> > +
> > +             vq_index = array_index_nospec(vq_info.index, dev->vq_num);
> > +             vq_info.desc_addr = dev->vqs[vq_index].desc_addr;
> > +             vq_info.driver_addr = dev->vqs[vq_index].driver_addr;
> > +             vq_info.device_addr = dev->vqs[vq_index].device_addr;
> > +             vq_info.num = dev->vqs[vq_index].num;
> > +             vq_info.avail_idx = dev->vqs[vq_index].avail_idx;
> > +             vq_info.ready = dev->vqs[vq_index].ready;
> > +
> > +             ret = -EFAULT;
> > +             if (copy_to_user(argp, &vq_info, sizeof(vq_info)))
> > +                     break;
> > +
> > +             ret = 0;
> > +             break;
> > +     }
> > +     case VDUSE_VQ_SETUP_KICKFD: {
> > +             struct vduse_vq_eventfd eventfd;
> > +
> > +             ret = -EFAULT;
> > +             if (copy_from_user(&eventfd, argp, sizeof(eventfd)))
> > +                     break;
> > +
> > +             ret = vduse_kickfd_setup(dev, &eventfd);
> > +             break;
> > +     }
> > +     case VDUSE_VQ_INJECT_IRQ: {
> > +             u32 vq_index;
> > +
> > +             ret = -EFAULT;
> > +             if (get_user(vq_index, (u32 __user *)argp))
> > +                     break;
> > +
> > +             ret = -EINVAL;
> > +             if (vq_index >= dev->vq_num)
> > +                     break;
> > +
> > +             ret = 0;
> > +             vq_index = array_index_nospec(vq_index, dev->vq_num);
> > +             queue_work(vduse_irq_wq, &dev->vqs[vq_index].inject);
> > +             break;
> > +     }
> > +     default:
> > +             ret = -ENOIOCTLCMD;
> > +             break;
> > +     }
> > +
> > +     return ret;
> > +}
> > +
> > +static int vduse_dev_release(struct inode *inode, struct file *file)
> > +{
> > +     struct vduse_dev *dev = file->private_data;
> > +
> > +     spin_lock(&dev->msg_lock);
> > +     /* Make sure the inflight messages can processed after reconncection */
> > +     list_splice_init(&dev->recv_list, &dev->send_list);
> > +     spin_unlock(&dev->msg_lock);
> > +     dev->connected = false;
> > +
> > +     return 0;
> > +}
> > +
> > +static struct vduse_dev *vduse_dev_get_from_minor(int minor)
> > +{
> > +     struct vduse_dev *dev;
> > +
> > +     mutex_lock(&vduse_lock);
> > +     dev = idr_find(&vduse_idr, minor);
> > +     mutex_unlock(&vduse_lock);
> > +
> > +     return dev;
> > +}
> > +
> > +static int vduse_dev_open(struct inode *inode, struct file *file)
> > +{
> > +     int ret;
> > +     struct vduse_dev *dev = vduse_dev_get_from_minor(iminor(inode));
> > +
> > +     if (!dev)
> > +             return -ENODEV;
> > +
> > +     ret = -EBUSY;
> > +     mutex_lock(&dev->lock);
> > +     if (dev->connected)
> > +             goto unlock;
> > +
> > +     ret = 0;
> > +     dev->connected = true;
> > +     file->private_data = dev;
> > +unlock:
> > +     mutex_unlock(&dev->lock);
> > +
> > +     return ret;
> > +}
> > +
> > +static const struct file_operations vduse_dev_fops = {
> > +     .owner          = THIS_MODULE,
> > +     .open           = vduse_dev_open,
> > +     .release        = vduse_dev_release,
> > +     .read_iter      = vduse_dev_read_iter,
> > +     .write_iter     = vduse_dev_write_iter,
> > +     .poll           = vduse_dev_poll,
> > +     .unlocked_ioctl = vduse_dev_ioctl,
> > +     .compat_ioctl   = compat_ptr_ioctl,
> > +     .llseek         = noop_llseek,
> > +};
> > +
> > +static struct vduse_dev *vduse_dev_create(void)
> > +{
> > +     struct vduse_dev *dev = kzalloc(sizeof(*dev), GFP_KERNEL);
> > +
> > +     if (!dev)
> > +             return NULL;
> > +
> > +     mutex_init(&dev->lock);
> > +     spin_lock_init(&dev->msg_lock);
> > +     INIT_LIST_HEAD(&dev->send_list);
> > +     INIT_LIST_HEAD(&dev->recv_list);
> > +     spin_lock_init(&dev->irq_lock);
> > +
> > +     INIT_WORK(&dev->inject, vduse_dev_irq_inject);
> > +     init_waitqueue_head(&dev->waitq);
> > +
> > +     return dev;
> > +}
> > +
> > +static void vduse_dev_destroy(struct vduse_dev *dev)
> > +{
> > +     kfree(dev);
> > +}
> > +
> > +static struct vduse_dev *vduse_find_dev(const char *name)
> > +{
> > +     struct vduse_dev *dev;
> > +     int id;
> > +
> > +     idr_for_each_entry(&vduse_idr, dev, id)
> > +             if (!strcmp(dev->name, name))
> > +                     return dev;
> > +
> > +     return NULL;
> > +}
> > +
> > +static int vduse_destroy_dev(char *name)
> > +{
> > +     struct vduse_dev *dev = vduse_find_dev(name);
> > +
> > +     if (!dev)
> > +             return -EINVAL;
> > +
> > +     mutex_lock(&dev->lock);
> > +     if (dev->vdev || dev->connected) {
> > +             mutex_unlock(&dev->lock);
> > +             return -EBUSY;
> > +     }
> > +     dev->connected = true;
> > +     mutex_unlock(&dev->lock);
> > +
> > +     vduse_dev_msg_cleanup(dev);
> > +     device_destroy(vduse_class, MKDEV(MAJOR(vduse_major), dev->minor));
> > +     idr_remove(&vduse_idr, dev->minor);
> > +     kvfree(dev->config);
> > +     kfree(dev->vqs);
> > +     vduse_domain_destroy(dev->domain);
> > +     kfree(dev->name);
> > +     vduse_dev_destroy(dev);
> > +     module_put(THIS_MODULE);
> > +
> > +     return 0;
> > +}
> > +
> > +static bool device_is_allowed(u32 device_id)
> > +{
> > +     int i;
> > +
> > +     for (i = 0; i < ARRAY_SIZE(allowed_device_id); i++)
> > +             if (allowed_device_id[i] == device_id)
> > +                     return true;
> > +
> > +     return false;
> > +}
> > +
> > +static bool features_is_valid(u64 features)
> > +{
> > +     if (!(features & (1ULL << VIRTIO_F_ACCESS_PLATFORM)))
> > +             return false;
> > +
> > +     /* Now we only support read-only configuration space */
> > +     if (features & (1ULL << VIRTIO_BLK_F_CONFIG_WCE))
> > +             return false;
> > +
> > +     return true;
> > +}
> > +
> > +static bool vduse_validate_config(struct vduse_dev_config *config)
> > +{
> > +     if (config->bounce_size > VDUSE_MAX_BOUNCE_SIZE)
> > +             return false;
> > +
> > +     if (config->vq_align > PAGE_SIZE)
> > +             return false;
> > +
> > +     if (config->config_size > PAGE_SIZE)
> > +             return false;
> > +
> > +     if (!device_is_allowed(config->device_id))
> > +             return false;
> > +
> > +     if (!features_is_valid(config->features))
> > +             return false;
>
>
> Do we need to validate whether or not config_size is too small otherwise
> we may have OOB access in get_config()?
>

How about adding validation in get_config()? It seems to be hard to
define the lower bound.

>
> > +
> > +     return true;
> > +}
> > +
> > +static int vduse_create_dev(struct vduse_dev_config *config,
> > +                         void *config_buf, u64 api_version)
> > +{
> > +     int i, ret;
> > +     struct vduse_dev *dev;
> > +
> > +     ret = -EEXIST;
> > +     if (vduse_find_dev(config->name))
> > +             goto err;
> > +
> > +     ret = -ENOMEM;
> > +     dev = vduse_dev_create();
> > +     if (!dev)
> > +             goto err;
> > +
> > +     dev->api_version = api_version;
> > +     dev->user_features = config->features;
> > +     dev->device_id = config->device_id;
> > +     dev->vendor_id = config->vendor_id;
> > +     dev->name = kstrdup(config->name, GFP_KERNEL);
> > +     if (!dev->name)
> > +             goto err_str;
> > +
> > +     dev->domain = vduse_domain_create(VDUSE_IOVA_SIZE - 1,
> > +                                       config->bounce_size);
> > +     if (!dev->domain)
> > +             goto err_domain;
> > +
> > +     dev->config = config_buf;
> > +     dev->config_size = config->config_size;
> > +     dev->vq_align = config->vq_align;
> > +     dev->vq_size_max = config->vq_size_max;
> > +     dev->vq_num = config->vq_num;
> > +     dev->vqs = kcalloc(dev->vq_num, sizeof(*dev->vqs), GFP_KERNEL);
> > +     if (!dev->vqs)
> > +             goto err_vqs;
> > +
> > +     for (i = 0; i < dev->vq_num; i++) {
> > +             dev->vqs[i].index = i;
> > +             INIT_WORK(&dev->vqs[i].inject, vduse_vq_irq_inject);
> > +             spin_lock_init(&dev->vqs[i].kick_lock);
> > +             spin_lock_init(&dev->vqs[i].irq_lock);
> > +     }
> > +
> > +     ret = idr_alloc(&vduse_idr, dev, 1, VDUSE_DEV_MAX, GFP_KERNEL);
> > +     if (ret < 0)
> > +             goto err_idr;
> > +
> > +     dev->minor = ret;
> > +     dev->dev = device_create(vduse_class, NULL,
> > +                              MKDEV(MAJOR(vduse_major), dev->minor),
> > +                              NULL, "%s", config->name);
> > +     if (IS_ERR(dev->dev)) {
> > +             ret = PTR_ERR(dev->dev);
> > +             goto err_dev;
> > +     }
> > +     __module_get(THIS_MODULE);
> > +
> > +     return 0;
> > +err_dev:
> > +     idr_remove(&vduse_idr, dev->minor);
> > +err_idr:
> > +     kfree(dev->vqs);
> > +err_vqs:
> > +     vduse_domain_destroy(dev->domain);
> > +err_domain:
> > +     kfree(dev->name);
> > +err_str:
> > +     vduse_dev_destroy(dev);
> > +err:
> > +     kvfree(config_buf);
> > +     return ret;
> > +}
> > +
> > +static long vduse_ioctl(struct file *file, unsigned int cmd,
> > +                     unsigned long arg)
> > +{
> > +     int ret;
> > +     void __user *argp = (void __user *)arg;
> > +     struct vduse_control *control = file->private_data;
> > +
> > +     mutex_lock(&vduse_lock);
> > +     switch (cmd) {
> > +     case VDUSE_GET_API_VERSION:
> > +             ret = put_user(control->api_version, (u64 __user *)argp);
> > +             break;
> > +     case VDUSE_SET_API_VERSION: {
> > +             u64 api_version;
> > +
> > +             ret = -EFAULT;
> > +             if (get_user(api_version, (u64 __user *)argp))
> > +                     break;
> > +
> > +             ret = -EINVAL;
> > +             if (api_version > VDUSE_API_VERSION)
> > +                     break;
> > +
> > +             ret = 0;
> > +             control->api_version = api_version;
> > +             break;
> > +     }
> > +     case VDUSE_CREATE_DEV: {
> > +             struct vduse_dev_config config;
> > +             unsigned long size = offsetof(struct vduse_dev_config, config);
> > +             void *buf;
> > +
> > +             ret = -EFAULT;
> > +             if (copy_from_user(&config, argp, size))
> > +                     break;
> > +
> > +             ret = -EINVAL;
> > +             if (vduse_validate_config(&config) == false)
> > +                     break;
> > +
> > +             buf = vmemdup_user(argp + size, config.config_size);
> > +             if (IS_ERR(buf)) {
> > +                     ret = PTR_ERR(buf);
> > +                     break;
> > +             }
> > +             ret = vduse_create_dev(&config, buf, control->api_version);
> > +             break;
> > +     }
> > +     case VDUSE_DESTROY_DEV: {
> > +             char name[VDUSE_NAME_MAX];
> > +
> > +             ret = -EFAULT;
> > +             if (copy_from_user(name, argp, VDUSE_NAME_MAX))
> > +                     break;
> > +
> > +             ret = vduse_destroy_dev(name);
> > +             break;
> > +     }
> > +     default:
> > +             ret = -EINVAL;
> > +             break;
> > +     }
> > +     mutex_unlock(&vduse_lock);
> > +
> > +     return ret;
> > +}
> > +
> > +static int vduse_release(struct inode *inode, struct file *file)
> > +{
> > +     struct vduse_control *control = file->private_data;
> > +
> > +     kfree(control);
> > +     return 0;
> > +}
> > +
> > +static int vduse_open(struct inode *inode, struct file *file)
> > +{
> > +     struct vduse_control *control;
> > +
> > +     control = kmalloc(sizeof(struct vduse_control), GFP_KERNEL);
> > +     if (!control)
> > +             return -ENOMEM;
> > +
> > +     control->api_version = VDUSE_API_VERSION;
> > +     file->private_data = control;
> > +
> > +     return 0;
> > +}
> > +
> > +static const struct file_operations vduse_ctrl_fops = {
> > +     .owner          = THIS_MODULE,
> > +     .open           = vduse_open,
> > +     .release        = vduse_release,
> > +     .unlocked_ioctl = vduse_ioctl,
> > +     .compat_ioctl   = compat_ptr_ioctl,
> > +     .llseek         = noop_llseek,
> > +};
> > +
> > +static char *vduse_devnode(struct device *dev, umode_t *mode)
> > +{
> > +     return kasprintf(GFP_KERNEL, "vduse/%s", dev_name(dev));
> > +}
> > +
> > +static void vduse_mgmtdev_release(struct device *dev)
> > +{
> > +}
> > +
> > +static struct device vduse_mgmtdev = {
> > +     .init_name = "vduse",
> > +     .release = vduse_mgmtdev_release,
> > +};
> > +
> > +static struct vdpa_mgmt_dev mgmt_dev;
> > +
> > +static int vduse_dev_init_vdpa(struct vduse_dev *dev, const char *name)
> > +{
> > +     struct vduse_vdpa *vdev;
> > +     int ret;
> > +
> > +     if (dev->vdev)
> > +             return -EEXIST;
> > +
> > +     vdev = vdpa_alloc_device(struct vduse_vdpa, vdpa, dev->dev,
> > +                              &vduse_vdpa_config_ops, name, true);
> > +     if (!vdev)
> > +             return -ENOMEM;
> > +
> > +     dev->vdev = vdev;
> > +     vdev->dev = dev;
> > +     vdev->vdpa.dev.dma_mask = &vdev->vdpa.dev.coherent_dma_mask;
> > +     ret = dma_set_mask_and_coherent(&vdev->vdpa.dev, DMA_BIT_MASK(64));
> > +     if (ret) {
> > +             put_device(&vdev->vdpa.dev);
> > +             return ret;
> > +     }
> > +     set_dma_ops(&vdev->vdpa.dev, &vduse_dev_dma_ops);
> > +     vdev->vdpa.dma_dev = &vdev->vdpa.dev;
> > +     vdev->vdpa.mdev = &mgmt_dev;
> > +
> > +     return 0;
> > +}
> > +
> > +static int vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name)
> > +{
> > +     struct vduse_dev *dev;
> > +     int ret;
> > +
> > +     mutex_lock(&vduse_lock);
> > +     dev = vduse_find_dev(name);
> > +     if (!dev) {
> > +             mutex_unlock(&vduse_lock);
> > +             return -EINVAL;
> > +     }
> > +     ret = vduse_dev_init_vdpa(dev, name);
> > +     mutex_unlock(&vduse_lock);
> > +     if (ret)
> > +             return ret;
> > +
> > +     ret = _vdpa_register_device(&dev->vdev->vdpa, dev->vq_num);
> > +     if (ret) {
> > +             put_device(&dev->vdev->vdpa.dev);
> > +             return ret;
> > +     }
> > +
> > +     return 0;
> > +}
> > +
> > +static void vdpa_dev_del(struct vdpa_mgmt_dev *mdev, struct vdpa_device *dev)
> > +{
> > +     _vdpa_unregister_device(dev);
> > +}
> > +
> > +static const struct vdpa_mgmtdev_ops vdpa_dev_mgmtdev_ops = {
> > +     .dev_add = vdpa_dev_add,
> > +     .dev_del = vdpa_dev_del,
> > +};
> > +
> > +static struct virtio_device_id id_table[] = {
> > +     { VIRTIO_ID_BLOCK, VIRTIO_DEV_ANY_ID },
> > +     { 0 },
> > +};
> > +
> > +static struct vdpa_mgmt_dev mgmt_dev = {
> > +     .device = &vduse_mgmtdev,
> > +     .id_table = id_table,
> > +     .ops = &vdpa_dev_mgmtdev_ops,
> > +};
> > +
> > +static int vduse_mgmtdev_init(void)
> > +{
> > +     int ret;
> > +
> > +     ret = device_register(&vduse_mgmtdev);
> > +     if (ret)
> > +             return ret;
> > +
> > +     ret = vdpa_mgmtdev_register(&mgmt_dev);
> > +     if (ret)
> > +             goto err;
> > +
> > +     return 0;
> > +err:
> > +     device_unregister(&vduse_mgmtdev);
> > +     return ret;
> > +}
> > +
> > +static void vduse_mgmtdev_exit(void)
> > +{
> > +     vdpa_mgmtdev_unregister(&mgmt_dev);
> > +     device_unregister(&vduse_mgmtdev);
> > +}
> > +
> > +static int vduse_init(void)
> > +{
> > +     int ret;
> > +     struct device *dev;
> > +
> > +     vduse_class = class_create(THIS_MODULE, "vduse");
> > +     if (IS_ERR(vduse_class))
> > +             return PTR_ERR(vduse_class);
> > +
> > +     vduse_class->devnode = vduse_devnode;
> > +
> > +     ret = alloc_chrdev_region(&vduse_major, 0, VDUSE_DEV_MAX, "vduse");
> > +     if (ret)
> > +             goto err_chardev_region;
> > +
> > +     /* /dev/vduse/control */
> > +     cdev_init(&vduse_ctrl_cdev, &vduse_ctrl_fops);
> > +     vduse_ctrl_cdev.owner = THIS_MODULE;
> > +     ret = cdev_add(&vduse_ctrl_cdev, vduse_major, 1);
> > +     if (ret)
> > +             goto err_ctrl_cdev;
> > +
> > +     dev = device_create(vduse_class, NULL, vduse_major, NULL, "control");
> > +     if (IS_ERR(dev)) {
> > +             ret = PTR_ERR(dev);
> > +             goto err_device;
> > +     }
> > +
> > +     /* /dev/vduse/$DEVICE */
> > +     cdev_init(&vduse_cdev, &vduse_dev_fops);
> > +     vduse_cdev.owner = THIS_MODULE;
> > +     ret = cdev_add(&vduse_cdev, MKDEV(MAJOR(vduse_major), 1),
> > +                    VDUSE_DEV_MAX - 1);
> > +     if (ret)
> > +             goto err_cdev;
> > +
> > +     vduse_irq_wq = alloc_workqueue("vduse-irq",
> > +                             WQ_HIGHPRI | WQ_SYSFS | WQ_UNBOUND, 0);
> > +     if (!vduse_irq_wq)
> > +             goto err_wq;
> > +
> > +     ret = vduse_domain_init();
> > +     if (ret)
> > +             goto err_domain;
> > +
> > +     ret = vduse_mgmtdev_init();
> > +     if (ret)
> > +             goto err_mgmtdev;
> > +
> > +     return 0;
> > +err_mgmtdev:
> > +     vduse_domain_exit();
> > +err_domain:
> > +     destroy_workqueue(vduse_irq_wq);
> > +err_wq:
> > +     cdev_del(&vduse_cdev);
> > +err_cdev:
> > +     device_destroy(vduse_class, vduse_major);
> > +err_device:
> > +     cdev_del(&vduse_ctrl_cdev);
> > +err_ctrl_cdev:
> > +     unregister_chrdev_region(vduse_major, VDUSE_DEV_MAX);
> > +err_chardev_region:
> > +     class_destroy(vduse_class);
> > +     return ret;
> > +}
> > +module_init(vduse_init);
> > +
> > +static void vduse_exit(void)
> > +{
> > +     vduse_mgmtdev_exit();
> > +     vduse_domain_exit();
> > +     destroy_workqueue(vduse_irq_wq);
> > +     cdev_del(&vduse_cdev);
> > +     device_destroy(vduse_class, vduse_major);
> > +     cdev_del(&vduse_ctrl_cdev);
> > +     unregister_chrdev_region(vduse_major, VDUSE_DEV_MAX);
> > +     class_destroy(vduse_class);
> > +}
> > +module_exit(vduse_exit);
> > +
> > +MODULE_LICENSE(DRV_LICENSE);
> > +MODULE_AUTHOR(DRV_AUTHOR);
> > +MODULE_DESCRIPTION(DRV_DESC);
> > diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
> > new file mode 100644
> > index 000000000000..f21b2e51b5c8
> > --- /dev/null
> > +++ b/include/uapi/linux/vduse.h
> > @@ -0,0 +1,143 @@
> > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> > +#ifndef _UAPI_VDUSE_H_
> > +#define _UAPI_VDUSE_H_
> > +
> > +#include <linux/types.h>
> > +
> > +#define VDUSE_API_VERSION    0
> > +
> > +#define VDUSE_NAME_MAX       256
> > +
> > +/* the control messages definition for read/write */
> > +
> > +enum vduse_req_type {
> > +     /* Get the state for virtqueue from userspace */
> > +     VDUSE_GET_VQ_STATE,
> > +     /* Notify userspace to start the dataplane, no reply */
> > +     VDUSE_START_DATAPLANE,
> > +     /* Notify userspace to stop the dataplane, no reply */
> > +     VDUSE_STOP_DATAPLANE,
> > +     /* Notify userspace to update the memory mapping in device IOTLB */
> > +     VDUSE_UPDATE_IOTLB,
> > +};
> > +
> > +struct vduse_vq_state {
> > +     __u32 index; /* virtqueue index */
> > +     __u32 avail_idx; /* virtqueue state (last_avail_idx) */
> > +};
>
>
> This needs some tweaks to support packed virtqueue.
>

OK.

Thanks,
Yongji
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 193+ messages in thread

* Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
  2021-06-21 10:41       ` Yongji Xie
  (?)
@ 2021-06-22  5:06         ` Jason Wang
  -1 siblings, 0 replies; 193+ messages in thread
From: Jason Wang @ 2021-06-22  5:06 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Christoph Hellwig, Christian Brauner, Randy Dunlap,
	Matthew Wilcox, Al Viro, Jens Axboe, bcrl, Jonathan Corbet,
	Mika Penttilä,
	Dan Carpenter, joro, Greg KH, songmuchun, virtualization, netdev,
	kvm, linux-fsdevel, iommu, linux-kernel


在 2021/6/21 下午6:41, Yongji Xie 写道:
> On Mon, Jun 21, 2021 at 5:14 PM Jason Wang <jasowang@redhat.com> wrote:
>>
>> 在 2021/6/15 下午10:13, Xie Yongji 写道:
>>> This VDUSE driver enables implementing vDPA devices in userspace.
>>> The vDPA device's control path is handled in kernel and the data
>>> path is handled in userspace.
>>>
>>> A message mechnism is used by VDUSE driver to forward some control
>>> messages such as starting/stopping datapath to userspace. Userspace
>>> can use read()/write() to receive/reply those control messages.
>>>
>>> And some ioctls are introduced to help userspace to implement the
>>> data path. VDUSE_IOTLB_GET_FD ioctl can be used to get the file
>>> descriptors referring to vDPA device's iova regions. Then userspace
>>> can use mmap() to access those iova regions. VDUSE_DEV_GET_FEATURES
>>> and VDUSE_VQ_GET_INFO ioctls are used to get the negotiated features
>>> and metadata of virtqueues. VDUSE_INJECT_VQ_IRQ and VDUSE_VQ_SETUP_KICKFD
>>> ioctls can be used to inject interrupt and setup the kickfd for
>>> virtqueues. VDUSE_DEV_UPDATE_CONFIG ioctl is used to update the
>>> configuration space and inject a config interrupt.
>>>
>>> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
>>> ---
>>>    Documentation/userspace-api/ioctl/ioctl-number.rst |    1 +
>>>    drivers/vdpa/Kconfig                               |   10 +
>>>    drivers/vdpa/Makefile                              |    1 +
>>>    drivers/vdpa/vdpa_user/Makefile                    |    5 +
>>>    drivers/vdpa/vdpa_user/vduse_dev.c                 | 1453 ++++++++++++++++++++
>>>    include/uapi/linux/vduse.h                         |  143 ++
>>>    6 files changed, 1613 insertions(+)
>>>    create mode 100644 drivers/vdpa/vdpa_user/Makefile
>>>    create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
>>>    create mode 100644 include/uapi/linux/vduse.h
>>>
>>> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
>>> index 9bfc2b510c64..acd95e9dcfe7 100644
>>> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
>>> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
>>> @@ -300,6 +300,7 @@ Code  Seq#    Include File                                           Comments
>>>    'z'   10-4F  drivers/s390/crypto/zcrypt_api.h                        conflict!
>>>    '|'   00-7F  linux/media.h
>>>    0x80  00-1F  linux/fb.h
>>> +0x81  00-1F  linux/vduse.h
>>>    0x89  00-06  arch/x86/include/asm/sockios.h
>>>    0x89  0B-DF  linux/sockios.h
>>>    0x89  E0-EF  linux/sockios.h                                         SIOCPROTOPRIVATE range
>>> diff --git a/drivers/vdpa/Kconfig b/drivers/vdpa/Kconfig
>>> index a503c1b2bfd9..6e23bce6433a 100644
>>> --- a/drivers/vdpa/Kconfig
>>> +++ b/drivers/vdpa/Kconfig
>>> @@ -33,6 +33,16 @@ config VDPA_SIM_BLOCK
>>>          vDPA block device simulator which terminates IO request in a
>>>          memory buffer.
>>>
>>> +config VDPA_USER
>>> +     tristate "VDUSE (vDPA Device in Userspace) support"
>>> +     depends on EVENTFD && MMU && HAS_DMA
>>> +     select DMA_OPS
>>> +     select VHOST_IOTLB
>>> +     select IOMMU_IOVA
>>> +     help
>>> +       With VDUSE it is possible to emulate a vDPA Device
>>> +       in a userspace program.
>>> +
>>>    config IFCVF
>>>        tristate "Intel IFC VF vDPA driver"
>>>        depends on PCI_MSI
>>> diff --git a/drivers/vdpa/Makefile b/drivers/vdpa/Makefile
>>> index 67fe7f3d6943..f02ebed33f19 100644
>>> --- a/drivers/vdpa/Makefile
>>> +++ b/drivers/vdpa/Makefile
>>> @@ -1,6 +1,7 @@
>>>    # SPDX-License-Identifier: GPL-2.0
>>>    obj-$(CONFIG_VDPA) += vdpa.o
>>>    obj-$(CONFIG_VDPA_SIM) += vdpa_sim/
>>> +obj-$(CONFIG_VDPA_USER) += vdpa_user/
>>>    obj-$(CONFIG_IFCVF)    += ifcvf/
>>>    obj-$(CONFIG_MLX5_VDPA) += mlx5/
>>>    obj-$(CONFIG_VP_VDPA)    += virtio_pci/
>>> diff --git a/drivers/vdpa/vdpa_user/Makefile b/drivers/vdpa/vdpa_user/Makefile
>>> new file mode 100644
>>> index 000000000000..260e0b26af99
>>> --- /dev/null
>>> +++ b/drivers/vdpa/vdpa_user/Makefile
>>> @@ -0,0 +1,5 @@
>>> +# SPDX-License-Identifier: GPL-2.0
>>> +
>>> +vduse-y := vduse_dev.o iova_domain.o
>>> +
>>> +obj-$(CONFIG_VDPA_USER) += vduse.o
>>> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
>>> new file mode 100644
>>> index 000000000000..5271cbd15e28
>>> --- /dev/null
>>> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
>>> @@ -0,0 +1,1453 @@
>>> +// SPDX-License-Identifier: GPL-2.0-only
>>> +/*
>>> + * VDUSE: vDPA Device in Userspace
>>> + *
>>> + * Copyright (C) 2020-2021 Bytedance Inc. and/or its affiliates. All rights reserved.
>>> + *
>>> + * Author: Xie Yongji <xieyongji@bytedance.com>
>>> + *
>>> + */
>>> +
>>> +#include <linux/init.h>
>>> +#include <linux/module.h>
>>> +#include <linux/cdev.h>
>>> +#include <linux/device.h>
>>> +#include <linux/eventfd.h>
>>> +#include <linux/slab.h>
>>> +#include <linux/wait.h>
>>> +#include <linux/dma-map-ops.h>
>>> +#include <linux/poll.h>
>>> +#include <linux/file.h>
>>> +#include <linux/uio.h>
>>> +#include <linux/vdpa.h>
>>> +#include <linux/nospec.h>
>>> +#include <uapi/linux/vduse.h>
>>> +#include <uapi/linux/vdpa.h>
>>> +#include <uapi/linux/virtio_config.h>
>>> +#include <uapi/linux/virtio_ids.h>
>>> +#include <uapi/linux/virtio_blk.h>
>>> +#include <linux/mod_devicetable.h>
>>> +
>>> +#include "iova_domain.h"
>>> +
>>> +#define DRV_AUTHOR   "Yongji Xie <xieyongji@bytedance.com>"
>>> +#define DRV_DESC     "vDPA Device in Userspace"
>>> +#define DRV_LICENSE  "GPL v2"
>>> +
>>> +#define VDUSE_DEV_MAX (1U << MINORBITS)
>>> +#define VDUSE_MAX_BOUNCE_SIZE (64 * 1024 * 1024)
>>> +#define VDUSE_IOVA_SIZE (128 * 1024 * 1024)
>>> +#define VDUSE_REQUEST_TIMEOUT 30
>>> +
>>> +struct vduse_virtqueue {
>>> +     u16 index;
>>> +     u32 num;
>>> +     u32 avail_idx;
>>> +     u64 desc_addr;
>>> +     u64 driver_addr;
>>> +     u64 device_addr;
>>> +     bool ready;
>>> +     bool kicked;
>>> +     spinlock_t kick_lock;
>>> +     spinlock_t irq_lock;
>>> +     struct eventfd_ctx *kickfd;
>>> +     struct vdpa_callback cb;
>>> +     struct work_struct inject;
>>> +};
>>> +
>>> +struct vduse_dev;
>>> +
>>> +struct vduse_vdpa {
>>> +     struct vdpa_device vdpa;
>>> +     struct vduse_dev *dev;
>>> +};
>>> +
>>> +struct vduse_dev {
>>> +     struct vduse_vdpa *vdev;
>>> +     struct device *dev;
>>> +     struct vduse_virtqueue *vqs;
>>> +     struct vduse_iova_domain *domain;
>>> +     char *name;
>>> +     struct mutex lock;
>>> +     spinlock_t msg_lock;
>>> +     u64 msg_unique;
>>> +     wait_queue_head_t waitq;
>>> +     struct list_head send_list;
>>> +     struct list_head recv_list;
>>> +     struct vdpa_callback config_cb;
>>> +     struct work_struct inject;
>>> +     spinlock_t irq_lock;
>>> +     int minor;
>>> +     bool connected;
>>> +     bool started;
>>> +     u64 api_version;
>>> +     u64 user_features;
>>
>> Let's use device_features.
>>
> OK.
>
>>> +     u64 features;
>>
>> And driver features.
>>
> OK.
>
>>> +     u32 device_id;
>>> +     u32 vendor_id;
>>> +     u32 generation;
>>> +     u32 config_size;
>>> +     void *config;
>>> +     u8 status;
>>> +     u16 vq_size_max;
>>> +     u32 vq_num;
>>> +     u32 vq_align;
>>> +};
>>> +
>>> +struct vduse_dev_msg {
>>> +     struct vduse_dev_request req;
>>> +     struct vduse_dev_response resp;
>>> +     struct list_head list;
>>> +     wait_queue_head_t waitq;
>>> +     bool completed;
>>> +};
>>> +
>>> +struct vduse_control {
>>> +     u64 api_version;
>>> +};
>>> +
>>> +static DEFINE_MUTEX(vduse_lock);
>>> +static DEFINE_IDR(vduse_idr);
>>> +
>>> +static dev_t vduse_major;
>>> +static struct class *vduse_class;
>>> +static struct cdev vduse_ctrl_cdev;
>>> +static struct cdev vduse_cdev;
>>> +static struct workqueue_struct *vduse_irq_wq;
>>> +
>>> +static u32 allowed_device_id[] = {
>>> +     VIRTIO_ID_BLOCK,
>>> +};
>>> +
>>> +static inline struct vduse_dev *vdpa_to_vduse(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_vdpa *vdev = container_of(vdpa, struct vduse_vdpa, vdpa);
>>> +
>>> +     return vdev->dev;
>>> +}
>>> +
>>> +static inline struct vduse_dev *dev_to_vduse(struct device *dev)
>>> +{
>>> +     struct vdpa_device *vdpa = dev_to_vdpa(dev);
>>> +
>>> +     return vdpa_to_vduse(vdpa);
>>> +}
>>> +
>>> +static struct vduse_dev_msg *vduse_find_msg(struct list_head *head,
>>> +                                         uint32_t request_id)
>>> +{
>>> +     struct vduse_dev_msg *msg;
>>> +
>>> +     list_for_each_entry(msg, head, list) {
>>> +             if (msg->req.request_id == request_id) {
>>> +                     list_del(&msg->list);
>>> +                     return msg;
>>> +             }
>>> +     }
>>> +
>>> +     return NULL;
>>> +}
>>> +
>>> +static struct vduse_dev_msg *vduse_dequeue_msg(struct list_head *head)
>>> +{
>>> +     struct vduse_dev_msg *msg = NULL;
>>> +
>>> +     if (!list_empty(head)) {
>>> +             msg = list_first_entry(head, struct vduse_dev_msg, list);
>>> +             list_del(&msg->list);
>>> +     }
>>> +
>>> +     return msg;
>>> +}
>>> +
>>> +static void vduse_enqueue_msg(struct list_head *head,
>>> +                           struct vduse_dev_msg *msg)
>>> +{
>>> +     list_add_tail(&msg->list, head);
>>> +}
>>> +
>>> +static int vduse_dev_msg_send(struct vduse_dev *dev,
>>> +                           struct vduse_dev_msg *msg, bool no_reply)
>>> +{
>>
>> It looks to me the only user for no_reply=true is the dataplane start. I
>> wonder no_reply is really needed consider we have switched to use
>> wait_event_killable_timeout().
>>
> Do we need to handle the error in this case if we remove the no_reply
> flag. Print a warning message?


See below.


>
>> In another way, no_reply is false for vq state synchronization and IOTLB
>> updating. I wonder if we can simply use no_reply = true for them.
>>
> Looks like we can't, e.g. we need to get a reply from userspace for vq state.


Right.


>
>>> +     init_waitqueue_head(&msg->waitq);
>>> +     spin_lock(&dev->msg_lock);
>>> +     msg->req.request_id = dev->msg_unique++;
>>> +     vduse_enqueue_msg(&dev->send_list, msg);
>>> +     wake_up(&dev->waitq);
>>> +     spin_unlock(&dev->msg_lock);
>>> +     if (no_reply)
>>> +             return 0;
>>> +
>>> +     wait_event_killable_timeout(msg->waitq, msg->completed,
>>> +                                 VDUSE_REQUEST_TIMEOUT * HZ);
>>> +     spin_lock(&dev->msg_lock);
>>> +     if (!msg->completed) {
>>> +             list_del(&msg->list);
>>> +             msg->resp.result = VDUSE_REQ_RESULT_FAILED;
>>> +     }
>>> +     spin_unlock(&dev->msg_lock);
>>> +
>>> +     return (msg->resp.result == VDUSE_REQ_RESULT_OK) ? 0 : -EIO;
>>
>> Do we need to serialize the check by protecting it with the spinlock above?
>>
> Good point.
>
>>> +}
>>> +
>>> +static void vduse_dev_msg_cleanup(struct vduse_dev *dev)
>>> +{
>>> +     struct vduse_dev_msg *msg;
>>> +
>>> +     spin_lock(&dev->msg_lock);
>>> +     while ((msg = vduse_dequeue_msg(&dev->send_list))) {
>>> +             if (msg->req.flags & VDUSE_REQ_FLAGS_NO_REPLY)
>>> +                     kfree(msg);
>>> +             else
>>> +                     vduse_enqueue_msg(&dev->recv_list, msg);
>>> +     }
>>> +     while ((msg = vduse_dequeue_msg(&dev->recv_list))) {
>>> +             msg->resp.result = VDUSE_REQ_RESULT_FAILED;
>>> +             msg->completed = 1;
>>> +             wake_up(&msg->waitq);
>>> +     }
>>> +     spin_unlock(&dev->msg_lock);
>>> +}
>>> +
>>> +static void vduse_dev_start_dataplane(struct vduse_dev *dev)
>>> +{
>>> +     struct vduse_dev_msg *msg = kzalloc(sizeof(*msg),
>>> +                                         GFP_KERNEL | __GFP_NOFAIL);
>>> +
>>> +     msg->req.type = VDUSE_START_DATAPLANE;
>>> +     msg->req.flags |= VDUSE_REQ_FLAGS_NO_REPLY;
>>> +     vduse_dev_msg_send(dev, msg, true);
>>> +}
>>> +
>>> +static void vduse_dev_stop_dataplane(struct vduse_dev *dev)
>>> +{
>>> +     struct vduse_dev_msg *msg = kzalloc(sizeof(*msg),
>>> +                                         GFP_KERNEL | __GFP_NOFAIL);
>>> +
>>> +     msg->req.type = VDUSE_STOP_DATAPLANE;
>>> +     msg->req.flags |= VDUSE_REQ_FLAGS_NO_REPLY;
>>
>> Can we simply use this flag instead of introducing a new parameter
>> (no_reply) in vduse_dev_msg_send()?
>>
> Looks good to me.
>
>>> +     vduse_dev_msg_send(dev, msg, true);
>>> +}
>>> +
>>> +static int vduse_dev_get_vq_state(struct vduse_dev *dev,
>>> +                               struct vduse_virtqueue *vq,
>>> +                               struct vdpa_vq_state *state)
>>> +{
>>> +     struct vduse_dev_msg msg = { 0 };
>>> +     int ret;
>>
>> Note that I post a series that implement the packed virtqueue support:
>>
>> https://lists.linuxfoundation.org/pipermail/virtualization/2021-June/054501.html
>>
>> So this patch needs to be updated as well.
>>
> Will do it.
>
>>> +
>>> +     msg.req.type = VDUSE_GET_VQ_STATE;
>>> +     msg.req.vq_state.index = vq->index;
>>> +
>>> +     ret = vduse_dev_msg_send(dev, &msg, false);
>>> +     if (ret)
>>> +             return ret;
>>> +
>>> +     state->avail_index = msg.resp.vq_state.avail_idx;
>>> +     return 0;
>>> +}
>>> +
>>> +static int vduse_dev_update_iotlb(struct vduse_dev *dev,
>>> +                             u64 start, u64 last)
>>> +{
>>> +     struct vduse_dev_msg msg = { 0 };
>>> +
>>> +     if (last < start)
>>> +             return -EINVAL;
>>> +
>>> +     msg.req.type = VDUSE_UPDATE_IOTLB;
>>> +     msg.req.iova.start = start;
>>> +     msg.req.iova.last = last;
>>> +
>>> +     return vduse_dev_msg_send(dev, &msg, false);
>>> +}
>>> +
>>> +static ssize_t vduse_dev_read_iter(struct kiocb *iocb, struct iov_iter *to)
>>> +{
>>> +     struct file *file = iocb->ki_filp;
>>> +     struct vduse_dev *dev = file->private_data;
>>> +     struct vduse_dev_msg *msg;
>>> +     int size = sizeof(struct vduse_dev_request);
>>> +     ssize_t ret;
>>> +
>>> +     if (iov_iter_count(to) < size)
>>> +             return -EINVAL;
>>> +
>>> +     spin_lock(&dev->msg_lock);
>>> +     while (1) {
>>> +             msg = vduse_dequeue_msg(&dev->send_list);
>>> +             if (msg)
>>> +                     break;
>>> +
>>> +             ret = -EAGAIN;
>>> +             if (file->f_flags & O_NONBLOCK)
>>> +                     goto unlock;
>>> +
>>> +             spin_unlock(&dev->msg_lock);
>>> +             ret = wait_event_interruptible_exclusive(dev->waitq,
>>> +                                     !list_empty(&dev->send_list));
>>> +             if (ret)
>>> +                     return ret;
>>> +
>>> +             spin_lock(&dev->msg_lock);
>>> +     }
>>> +     spin_unlock(&dev->msg_lock);
>>> +     ret = copy_to_iter(&msg->req, size, to);
>>> +     spin_lock(&dev->msg_lock);
>>> +     if (ret != size) {
>>> +             ret = -EFAULT;
>>> +             vduse_enqueue_msg(&dev->send_list, msg);
>>> +             goto unlock;
>>> +     }
>>> +     if (msg->req.flags & VDUSE_REQ_FLAGS_NO_REPLY)
>>> +             kfree(msg);
>>> +     else
>>> +             vduse_enqueue_msg(&dev->recv_list, msg);
>>> +unlock:
>>> +     spin_unlock(&dev->msg_lock);
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static ssize_t vduse_dev_write_iter(struct kiocb *iocb, struct iov_iter *from)
>>> +{
>>> +     struct file *file = iocb->ki_filp;
>>> +     struct vduse_dev *dev = file->private_data;
>>> +     struct vduse_dev_response resp;
>>> +     struct vduse_dev_msg *msg;
>>> +     size_t ret;
>>> +
>>> +     ret = copy_from_iter(&resp, sizeof(resp), from);
>>> +     if (ret != sizeof(resp))
>>> +             return -EINVAL;
>>> +
>>> +     spin_lock(&dev->msg_lock);
>>> +     msg = vduse_find_msg(&dev->recv_list, resp.request_id);
>>> +     if (!msg) {
>>> +             ret = -ENOENT;
>>> +             goto unlock;
>>> +     }
>>> +
>>> +     memcpy(&msg->resp, &resp, sizeof(resp));
>>> +     msg->completed = 1;
>>> +     wake_up(&msg->waitq);
>>> +unlock:
>>> +     spin_unlock(&dev->msg_lock);
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
>>> +{
>>> +     struct vduse_dev *dev = file->private_data;
>>> +     __poll_t mask = 0;
>>> +
>>> +     poll_wait(file, &dev->waitq, wait);
>>> +
>>> +     if (!list_empty(&dev->send_list))
>>> +             mask |= EPOLLIN | EPOLLRDNORM;
>>> +     if (!list_empty(&dev->recv_list))
>>> +             mask |= EPOLLOUT | EPOLLWRNORM;
>>> +
>>> +     return mask;
>>> +}
>>> +
>>> +static void vduse_dev_reset(struct vduse_dev *dev)
>>> +{
>>> +     int i;
>>> +     struct vduse_iova_domain *domain = dev->domain;
>>> +
>>> +     /* The coherent mappings are handled in vduse_dev_free_coherent() */
>>> +     if (domain->bounce_map)
>>> +             vduse_domain_reset_bounce_map(domain);
>>> +
>>> +     dev->features = 0;
>>> +     dev->generation++;
>>> +     spin_lock(&dev->irq_lock);
>>> +     dev->config_cb.callback = NULL;
>>> +     dev->config_cb.private = NULL;
>>> +     spin_unlock(&dev->irq_lock);
>>> +
>>> +     for (i = 0; i < dev->vq_num; i++) {
>>> +             struct vduse_virtqueue *vq = &dev->vqs[i];
>>> +
>>> +             vq->ready = false;
>>> +             vq->desc_addr = 0;
>>> +             vq->driver_addr = 0;
>>> +             vq->device_addr = 0;
>>> +             vq->avail_idx = 0;
>>> +             vq->num = 0;
>>> +
>>> +             spin_lock(&vq->kick_lock);
>>> +             vq->kicked = false;
>>> +             if (vq->kickfd)
>>> +                     eventfd_ctx_put(vq->kickfd);
>>> +             vq->kickfd = NULL;
>>> +             spin_unlock(&vq->kick_lock);
>>> +
>>> +             spin_lock(&vq->irq_lock);
>>> +             vq->cb.callback = NULL;
>>> +             vq->cb.private = NULL;
>>> +             spin_unlock(&vq->irq_lock);
>>> +     }
>>> +}
>>> +
>>> +static int vduse_vdpa_set_vq_address(struct vdpa_device *vdpa, u16 idx,
>>> +                             u64 desc_area, u64 driver_area,
>>> +                             u64 device_area)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     vq->desc_addr = desc_area;
>>> +     vq->driver_addr = driver_area;
>>> +     vq->device_addr = device_area;
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static void vduse_vdpa_kick_vq(struct vdpa_device *vdpa, u16 idx)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     spin_lock(&vq->kick_lock);
>>> +     if (!vq->ready)
>>> +             goto unlock;
>>> +
>>> +     if (vq->kickfd)
>>> +             eventfd_signal(vq->kickfd, 1);
>>> +     else
>>> +             vq->kicked = true;
>>> +unlock:
>>> +     spin_unlock(&vq->kick_lock);
>>> +}
>>> +
>>> +static void vduse_vdpa_set_vq_cb(struct vdpa_device *vdpa, u16 idx,
>>> +                           struct vdpa_callback *cb)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     spin_lock(&vq->irq_lock);
>>> +     vq->cb.callback = cb->callback;
>>> +     vq->cb.private = cb->private;
>>> +     spin_unlock(&vq->irq_lock);
>>> +}
>>> +
>>> +static void vduse_vdpa_set_vq_num(struct vdpa_device *vdpa, u16 idx, u32 num)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     vq->num = num;
>>> +}
>>> +
>>> +static void vduse_vdpa_set_vq_ready(struct vdpa_device *vdpa,
>>> +                                     u16 idx, bool ready)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     vq->ready = ready;
>>> +}
>>> +
>>> +static bool vduse_vdpa_get_vq_ready(struct vdpa_device *vdpa, u16 idx)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     return vq->ready;
>>> +}
>>> +
>>> +static int vduse_vdpa_set_vq_state(struct vdpa_device *vdpa, u16 idx,
>>> +                             const struct vdpa_vq_state *state)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     vq->avail_idx = state->avail_index;
>>> +     return 0;
>>> +}
>>> +
>>> +static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
>>> +                             struct vdpa_vq_state *state)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     return vduse_dev_get_vq_state(dev, vq, state);
>>> +}
>>> +
>>> +static u32 vduse_vdpa_get_vq_align(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return dev->vq_align;
>>> +}
>>> +
>>> +static u64 vduse_vdpa_get_features(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return dev->user_features;
>>> +}
>>> +
>>> +static int vduse_vdpa_set_features(struct vdpa_device *vdpa, u64 features)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     dev->features = features;
>>> +     return 0;
>>> +}
>>> +
>>> +static void vduse_vdpa_set_config_cb(struct vdpa_device *vdpa,
>>> +                               struct vdpa_callback *cb)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     spin_lock(&dev->irq_lock);
>>> +     dev->config_cb.callback = cb->callback;
>>> +     dev->config_cb.private = cb->private;
>>> +     spin_unlock(&dev->irq_lock);
>>> +}
>>> +
>>> +static u16 vduse_vdpa_get_vq_num_max(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return dev->vq_size_max;
>>> +}
>>> +
>>> +static u32 vduse_vdpa_get_device_id(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return dev->device_id;
>>> +}
>>> +
>>> +static u32 vduse_vdpa_get_vendor_id(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return dev->vendor_id;
>>> +}
>>> +
>>> +static u8 vduse_vdpa_get_status(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return dev->status;
>>> +}
>>> +
>>> +static void vduse_vdpa_set_status(struct vdpa_device *vdpa, u8 status)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     bool started = !!(status & VIRTIO_CONFIG_S_DRIVER_OK);
>>> +
>>> +     dev->status = status;
>>> +
>>> +     if (dev->started == started)
>>> +             return;
>>
>> If we check dev->status == status, (or only check the DRIVER_OK bit)
>> then there's no need to introduce an extra dev->started.
>>
> Will do it.
>
>>> +
>>> +     dev->started = started;
>>> +     if (dev->started) {
>>> +             vduse_dev_start_dataplane(dev);
>>> +     } else {
>>> +             vduse_dev_reset(dev);
>>> +             vduse_dev_stop_dataplane(dev);
>>
>> I wonder if no_reply work for the case of vhost-vdpa. For virtio-vDPA,
>> we have bouncing buffers so it's harmless if usersapce dataplane keeps
>> performing read/write. For vhost-vDPA we don't have such stuffs.
>>
> OK. So it still needs to be synchronized here. If so, how to handle
> the error? Looks like printing a warning message should be enough.


We need fix a way to propagate the error to the userspace.

E.g if we want to stop the deivce, we will delay the status reset until 
we get respose from the userspace?


>
>>> +     }
>>> +}
>>> +
>>> +static size_t vduse_vdpa_get_config_size(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return dev->config_size;
>>> +}
>>> +
>>> +static void vduse_vdpa_get_config(struct vdpa_device *vdpa, unsigned int offset,
>>> +                               void *buf, unsigned int len)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     memcpy(buf, dev->config + offset, len);
>>> +}
>>> +
>>> +static void vduse_vdpa_set_config(struct vdpa_device *vdpa, unsigned int offset,
>>> +                     const void *buf, unsigned int len)
>>> +{
>>> +     /* Now we only support read-only configuration space */
>>> +}
>>> +
>>> +static u32 vduse_vdpa_get_generation(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return dev->generation;
>>> +}
>>> +
>>> +static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
>>> +                             struct vhost_iotlb *iotlb)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     int ret;
>>> +
>>> +     ret = vduse_domain_set_map(dev->domain, iotlb);
>>> +     if (ret)
>>> +             return ret;
>>> +
>>> +     ret = vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
>>> +     if (ret) {
>>> +             vduse_domain_clear_map(dev->domain, iotlb);
>>> +             return ret;
>>> +     }
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static void vduse_vdpa_free(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     dev->vdev = NULL;
>>> +}
>>> +
>>> +static const struct vdpa_config_ops vduse_vdpa_config_ops = {
>>> +     .set_vq_address         = vduse_vdpa_set_vq_address,
>>> +     .kick_vq                = vduse_vdpa_kick_vq,
>>> +     .set_vq_cb              = vduse_vdpa_set_vq_cb,
>>> +     .set_vq_num             = vduse_vdpa_set_vq_num,
>>> +     .set_vq_ready           = vduse_vdpa_set_vq_ready,
>>> +     .get_vq_ready           = vduse_vdpa_get_vq_ready,
>>> +     .set_vq_state           = vduse_vdpa_set_vq_state,
>>> +     .get_vq_state           = vduse_vdpa_get_vq_state,
>>> +     .get_vq_align           = vduse_vdpa_get_vq_align,
>>> +     .get_features           = vduse_vdpa_get_features,
>>> +     .set_features           = vduse_vdpa_set_features,
>>> +     .set_config_cb          = vduse_vdpa_set_config_cb,
>>> +     .get_vq_num_max         = vduse_vdpa_get_vq_num_max,
>>> +     .get_device_id          = vduse_vdpa_get_device_id,
>>> +     .get_vendor_id          = vduse_vdpa_get_vendor_id,
>>> +     .get_status             = vduse_vdpa_get_status,
>>> +     .set_status             = vduse_vdpa_set_status,
>>> +     .get_config_size        = vduse_vdpa_get_config_size,
>>> +     .get_config             = vduse_vdpa_get_config,
>>> +     .set_config             = vduse_vdpa_set_config,
>>> +     .get_generation         = vduse_vdpa_get_generation,
>>> +     .set_map                = vduse_vdpa_set_map,
>>> +     .free                   = vduse_vdpa_free,
>>> +};
>>> +
>>> +static dma_addr_t vduse_dev_map_page(struct device *dev, struct page *page,
>>> +                                  unsigned long offset, size_t size,
>>> +                                  enum dma_data_direction dir,
>>> +                                  unsigned long attrs)
>>> +{
>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>> +
>>> +     return vduse_domain_map_page(domain, page, offset, size, dir, attrs);
>>> +}
>>> +
>>> +static void vduse_dev_unmap_page(struct device *dev, dma_addr_t dma_addr,
>>> +                             size_t size, enum dma_data_direction dir,
>>> +                             unsigned long attrs)
>>> +{
>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>> +
>>> +     return vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
>>> +}
>>> +
>>> +static void *vduse_dev_alloc_coherent(struct device *dev, size_t size,
>>> +                                     dma_addr_t *dma_addr, gfp_t flag,
>>> +                                     unsigned long attrs)
>>> +{
>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>> +     unsigned long iova;
>>> +     void *addr;
>>> +
>>> +     *dma_addr = DMA_MAPPING_ERROR;
>>> +     addr = vduse_domain_alloc_coherent(domain, size,
>>> +                             (dma_addr_t *)&iova, flag, attrs);
>>> +     if (!addr)
>>> +             return NULL;
>>> +
>>> +     *dma_addr = (dma_addr_t)iova;
>>> +
>>> +     return addr;
>>> +}
>>> +
>>> +static void vduse_dev_free_coherent(struct device *dev, size_t size,
>>> +                                     void *vaddr, dma_addr_t dma_addr,
>>> +                                     unsigned long attrs)
>>> +{
>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>> +
>>> +     vduse_domain_free_coherent(domain, size, vaddr, dma_addr, attrs);
>>> +}
>>> +
>>> +static size_t vduse_dev_max_mapping_size(struct device *dev)
>>> +{
>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>> +
>>> +     return domain->bounce_size;
>>> +}
>>> +
>>> +static const struct dma_map_ops vduse_dev_dma_ops = {
>>> +     .map_page = vduse_dev_map_page,
>>> +     .unmap_page = vduse_dev_unmap_page,
>>> +     .alloc = vduse_dev_alloc_coherent,
>>> +     .free = vduse_dev_free_coherent,
>>> +     .max_mapping_size = vduse_dev_max_mapping_size,
>>> +};
>>> +
>>> +static unsigned int perm_to_file_flags(u8 perm)
>>> +{
>>> +     unsigned int flags = 0;
>>> +
>>> +     switch (perm) {
>>> +     case VDUSE_ACCESS_WO:
>>> +             flags |= O_WRONLY;
>>> +             break;
>>> +     case VDUSE_ACCESS_RO:
>>> +             flags |= O_RDONLY;
>>> +             break;
>>> +     case VDUSE_ACCESS_RW:
>>> +             flags |= O_RDWR;
>>> +             break;
>>> +     default:
>>> +             WARN(1, "invalidate vhost IOTLB permission\n");
>>> +             break;
>>> +     }
>>> +
>>> +     return flags;
>>> +}
>>> +
>>> +static int vduse_kickfd_setup(struct vduse_dev *dev,
>>> +                     struct vduse_vq_eventfd *eventfd)
>>> +{
>>> +     struct eventfd_ctx *ctx = NULL;
>>> +     struct vduse_virtqueue *vq;
>>> +     u32 index;
>>> +
>>> +     if (eventfd->index >= dev->vq_num)
>>> +             return -EINVAL;
>>> +
>>> +     index = array_index_nospec(eventfd->index, dev->vq_num);
>>> +     vq = &dev->vqs[index];
>>> +     if (eventfd->fd >= 0) {
>>> +             ctx = eventfd_ctx_fdget(eventfd->fd);
>>> +             if (IS_ERR(ctx))
>>> +                     return PTR_ERR(ctx);
>>> +     } else if (eventfd->fd != VDUSE_EVENTFD_DEASSIGN)
>>> +             return 0;
>>> +
>>> +     spin_lock(&vq->kick_lock);
>>> +     if (vq->kickfd)
>>> +             eventfd_ctx_put(vq->kickfd);
>>> +     vq->kickfd = ctx;
>>> +     if (vq->ready && vq->kicked && vq->kickfd) {
>>> +             eventfd_signal(vq->kickfd, 1);
>>> +             vq->kicked = false;
>>> +     }
>>> +     spin_unlock(&vq->kick_lock);
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static void vduse_dev_irq_inject(struct work_struct *work)
>>> +{
>>> +     struct vduse_dev *dev = container_of(work, struct vduse_dev, inject);
>>> +
>>> +     spin_lock_irq(&dev->irq_lock);
>>> +     if (dev->config_cb.callback)
>>> +             dev->config_cb.callback(dev->config_cb.private);
>>> +     spin_unlock_irq(&dev->irq_lock);
>>> +}
>>> +
>>> +static void vduse_vq_irq_inject(struct work_struct *work)
>>> +{
>>> +     struct vduse_virtqueue *vq = container_of(work,
>>> +                                     struct vduse_virtqueue, inject);
>>> +
>>> +     spin_lock_irq(&vq->irq_lock);
>>> +     if (vq->ready && vq->cb.callback)
>>> +             vq->cb.callback(vq->cb.private);
>>> +     spin_unlock_irq(&vq->irq_lock);
>>> +}
>>> +
>>> +static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
>>> +                         unsigned long arg)
>>> +{
>>> +     struct vduse_dev *dev = file->private_data;
>>> +     void __user *argp = (void __user *)arg;
>>> +     int ret;
>>> +
>>> +     switch (cmd) {
>>> +     case VDUSE_IOTLB_GET_FD: {
>>> +             struct vduse_iotlb_entry entry;
>>> +             struct vhost_iotlb_map *map;
>>> +             struct vdpa_map_file *map_file;
>>> +             struct vduse_iova_domain *domain = dev->domain;
>>> +             struct file *f = NULL;
>>> +
>>> +             ret = -EFAULT;
>>> +             if (copy_from_user(&entry, argp, sizeof(entry)))
>>> +                     break;
>>> +
>>> +             ret = -EINVAL;
>>> +             if (entry.start > entry.last)
>>> +                     break;
>>> +
>>> +             spin_lock(&domain->iotlb_lock);
>>> +             map = vhost_iotlb_itree_first(domain->iotlb,
>>> +                                           entry.start, entry.last);
>>> +             if (map) {
>>> +                     map_file = (struct vdpa_map_file *)map->opaque;
>>> +                     f = get_file(map_file->file);
>>> +                     entry.offset = map_file->offset;
>>> +                     entry.start = map->start;
>>> +                     entry.last = map->last;
>>> +                     entry.perm = map->perm;
>>> +             }
>>> +             spin_unlock(&domain->iotlb_lock);
>>> +             ret = -EINVAL;
>>> +             if (!f)
>>> +                     break;
>>> +
>>> +             ret = -EFAULT;
>>> +             if (copy_to_user(argp, &entry, sizeof(entry))) {
>>> +                     fput(f);
>>> +                     break;
>>> +             }
>>> +             ret = receive_fd(f, perm_to_file_flags(entry.perm));
>>> +             fput(f);
>>> +             break;
>>> +     }
>>> +     case VDUSE_DEV_GET_FEATURES:
>>> +             ret = put_user(dev->features, (u64 __user *)argp);
>>> +             break;
>>> +     case VDUSE_DEV_UPDATE_CONFIG: {
>>> +             struct vduse_config_update config;
>>> +             unsigned long size = offsetof(struct vduse_config_update,
>>> +                                           buffer);
>>> +
>>> +             ret = -EFAULT;
>>> +             if (copy_from_user(&config, argp, size))
>>> +                     break;
>>> +
>>> +             ret = -EINVAL;
>>> +             if (config.length == 0 ||
>>> +                 config.length > dev->config_size - config.offset)
>>> +                     break;
>>> +
>>> +             ret = -EFAULT;
>>> +             if (copy_from_user(dev->config + config.offset, argp + size,
>>> +                                config.length))
>>> +                     break;
>>> +
>>> +             ret = 0;
>>> +             queue_work(vduse_irq_wq, &dev->inject);
>>
>> I wonder if it's better to separate config interrupt out of config
>> update or we need document this.
>>
> I have documented it in the docs. Looks like a config update should be
> always followed by a config interrupt. I didn't find a case that uses
> them separately.


The uAPI doesn't prevent us from the following scenario:

update_config(mac[0], ..);
update_config(max[1], ..);

So it looks to me it's better to separate the config interrupt from the 
config updating.


>
>>> +             break;
>>> +     }
>>> +     case VDUSE_VQ_GET_INFO: {
>>
>> Do we need to limit this only when DRIVER_OK is set?
>>
> Any reason to add this limitation?


Otherwise the vq is not fully initialized, e.g the desc_addr might not 
be correct.


>
>>> +             struct vduse_vq_info vq_info;
>>> +             u32 vq_index;
>>> +
>>> +             ret = -EFAULT;
>>> +             if (copy_from_user(&vq_info, argp, sizeof(vq_info)))
>>> +                     break;
>>> +
>>> +             ret = -EINVAL;
>>> +             if (vq_info.index >= dev->vq_num)
>>> +                     break;
>>> +
>>> +             vq_index = array_index_nospec(vq_info.index, dev->vq_num);
>>> +             vq_info.desc_addr = dev->vqs[vq_index].desc_addr;
>>> +             vq_info.driver_addr = dev->vqs[vq_index].driver_addr;
>>> +             vq_info.device_addr = dev->vqs[vq_index].device_addr;
>>> +             vq_info.num = dev->vqs[vq_index].num;
>>> +             vq_info.avail_idx = dev->vqs[vq_index].avail_idx;
>>> +             vq_info.ready = dev->vqs[vq_index].ready;
>>> +
>>> +             ret = -EFAULT;
>>> +             if (copy_to_user(argp, &vq_info, sizeof(vq_info)))
>>> +                     break;
>>> +
>>> +             ret = 0;
>>> +             break;
>>> +     }
>>> +     case VDUSE_VQ_SETUP_KICKFD: {
>>> +             struct vduse_vq_eventfd eventfd;
>>> +
>>> +             ret = -EFAULT;
>>> +             if (copy_from_user(&eventfd, argp, sizeof(eventfd)))
>>> +                     break;
>>> +
>>> +             ret = vduse_kickfd_setup(dev, &eventfd);
>>> +             break;
>>> +     }
>>> +     case VDUSE_VQ_INJECT_IRQ: {
>>> +             u32 vq_index;
>>> +
>>> +             ret = -EFAULT;
>>> +             if (get_user(vq_index, (u32 __user *)argp))
>>> +                     break;
>>> +
>>> +             ret = -EINVAL;
>>> +             if (vq_index >= dev->vq_num)
>>> +                     break;
>>> +
>>> +             ret = 0;
>>> +             vq_index = array_index_nospec(vq_index, dev->vq_num);
>>> +             queue_work(vduse_irq_wq, &dev->vqs[vq_index].inject);
>>> +             break;
>>> +     }
>>> +     default:
>>> +             ret = -ENOIOCTLCMD;
>>> +             break;
>>> +     }
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static int vduse_dev_release(struct inode *inode, struct file *file)
>>> +{
>>> +     struct vduse_dev *dev = file->private_data;
>>> +
>>> +     spin_lock(&dev->msg_lock);
>>> +     /* Make sure the inflight messages can processed after reconncection */
>>> +     list_splice_init(&dev->recv_list, &dev->send_list);
>>> +     spin_unlock(&dev->msg_lock);
>>> +     dev->connected = false;
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static struct vduse_dev *vduse_dev_get_from_minor(int minor)
>>> +{
>>> +     struct vduse_dev *dev;
>>> +
>>> +     mutex_lock(&vduse_lock);
>>> +     dev = idr_find(&vduse_idr, minor);
>>> +     mutex_unlock(&vduse_lock);
>>> +
>>> +     return dev;
>>> +}
>>> +
>>> +static int vduse_dev_open(struct inode *inode, struct file *file)
>>> +{
>>> +     int ret;
>>> +     struct vduse_dev *dev = vduse_dev_get_from_minor(iminor(inode));
>>> +
>>> +     if (!dev)
>>> +             return -ENODEV;
>>> +
>>> +     ret = -EBUSY;
>>> +     mutex_lock(&dev->lock);
>>> +     if (dev->connected)
>>> +             goto unlock;
>>> +
>>> +     ret = 0;
>>> +     dev->connected = true;
>>> +     file->private_data = dev;
>>> +unlock:
>>> +     mutex_unlock(&dev->lock);
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static const struct file_operations vduse_dev_fops = {
>>> +     .owner          = THIS_MODULE,
>>> +     .open           = vduse_dev_open,
>>> +     .release        = vduse_dev_release,
>>> +     .read_iter      = vduse_dev_read_iter,
>>> +     .write_iter     = vduse_dev_write_iter,
>>> +     .poll           = vduse_dev_poll,
>>> +     .unlocked_ioctl = vduse_dev_ioctl,
>>> +     .compat_ioctl   = compat_ptr_ioctl,
>>> +     .llseek         = noop_llseek,
>>> +};
>>> +
>>> +static struct vduse_dev *vduse_dev_create(void)
>>> +{
>>> +     struct vduse_dev *dev = kzalloc(sizeof(*dev), GFP_KERNEL);
>>> +
>>> +     if (!dev)
>>> +             return NULL;
>>> +
>>> +     mutex_init(&dev->lock);
>>> +     spin_lock_init(&dev->msg_lock);
>>> +     INIT_LIST_HEAD(&dev->send_list);
>>> +     INIT_LIST_HEAD(&dev->recv_list);
>>> +     spin_lock_init(&dev->irq_lock);
>>> +
>>> +     INIT_WORK(&dev->inject, vduse_dev_irq_inject);
>>> +     init_waitqueue_head(&dev->waitq);
>>> +
>>> +     return dev;
>>> +}
>>> +
>>> +static void vduse_dev_destroy(struct vduse_dev *dev)
>>> +{
>>> +     kfree(dev);
>>> +}
>>> +
>>> +static struct vduse_dev *vduse_find_dev(const char *name)
>>> +{
>>> +     struct vduse_dev *dev;
>>> +     int id;
>>> +
>>> +     idr_for_each_entry(&vduse_idr, dev, id)
>>> +             if (!strcmp(dev->name, name))
>>> +                     return dev;
>>> +
>>> +     return NULL;
>>> +}
>>> +
>>> +static int vduse_destroy_dev(char *name)
>>> +{
>>> +     struct vduse_dev *dev = vduse_find_dev(name);
>>> +
>>> +     if (!dev)
>>> +             return -EINVAL;
>>> +
>>> +     mutex_lock(&dev->lock);
>>> +     if (dev->vdev || dev->connected) {
>>> +             mutex_unlock(&dev->lock);
>>> +             return -EBUSY;
>>> +     }
>>> +     dev->connected = true;
>>> +     mutex_unlock(&dev->lock);
>>> +
>>> +     vduse_dev_msg_cleanup(dev);
>>> +     device_destroy(vduse_class, MKDEV(MAJOR(vduse_major), dev->minor));
>>> +     idr_remove(&vduse_idr, dev->minor);
>>> +     kvfree(dev->config);
>>> +     kfree(dev->vqs);
>>> +     vduse_domain_destroy(dev->domain);
>>> +     kfree(dev->name);
>>> +     vduse_dev_destroy(dev);
>>> +     module_put(THIS_MODULE);
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static bool device_is_allowed(u32 device_id)
>>> +{
>>> +     int i;
>>> +
>>> +     for (i = 0; i < ARRAY_SIZE(allowed_device_id); i++)
>>> +             if (allowed_device_id[i] == device_id)
>>> +                     return true;
>>> +
>>> +     return false;
>>> +}
>>> +
>>> +static bool features_is_valid(u64 features)
>>> +{
>>> +     if (!(features & (1ULL << VIRTIO_F_ACCESS_PLATFORM)))
>>> +             return false;
>>> +
>>> +     /* Now we only support read-only configuration space */
>>> +     if (features & (1ULL << VIRTIO_BLK_F_CONFIG_WCE))
>>> +             return false;
>>> +
>>> +     return true;
>>> +}
>>> +
>>> +static bool vduse_validate_config(struct vduse_dev_config *config)
>>> +{
>>> +     if (config->bounce_size > VDUSE_MAX_BOUNCE_SIZE)
>>> +             return false;
>>> +
>>> +     if (config->vq_align > PAGE_SIZE)
>>> +             return false;
>>> +
>>> +     if (config->config_size > PAGE_SIZE)
>>> +             return false;
>>> +
>>> +     if (!device_is_allowed(config->device_id))
>>> +             return false;
>>> +
>>> +     if (!features_is_valid(config->features))
>>> +             return false;
>>
>> Do we need to validate whether or not config_size is too small otherwise
>> we may have OOB access in get_config()?
>>
> How about adding validation in get_config()? It seems to be hard to
> define the lower bound.


It should work.

Thanks


>
>>> +
>>> +     return true;
>>> +}
>>> +
>>> +static int vduse_create_dev(struct vduse_dev_config *config,
>>> +                         void *config_buf, u64 api_version)
>>> +{
>>> +     int i, ret;
>>> +     struct vduse_dev *dev;
>>> +
>>> +     ret = -EEXIST;
>>> +     if (vduse_find_dev(config->name))
>>> +             goto err;
>>> +
>>> +     ret = -ENOMEM;
>>> +     dev = vduse_dev_create();
>>> +     if (!dev)
>>> +             goto err;
>>> +
>>> +     dev->api_version = api_version;
>>> +     dev->user_features = config->features;
>>> +     dev->device_id = config->device_id;
>>> +     dev->vendor_id = config->vendor_id;
>>> +     dev->name = kstrdup(config->name, GFP_KERNEL);
>>> +     if (!dev->name)
>>> +             goto err_str;
>>> +
>>> +     dev->domain = vduse_domain_create(VDUSE_IOVA_SIZE - 1,
>>> +                                       config->bounce_size);
>>> +     if (!dev->domain)
>>> +             goto err_domain;
>>> +
>>> +     dev->config = config_buf;
>>> +     dev->config_size = config->config_size;
>>> +     dev->vq_align = config->vq_align;
>>> +     dev->vq_size_max = config->vq_size_max;
>>> +     dev->vq_num = config->vq_num;
>>> +     dev->vqs = kcalloc(dev->vq_num, sizeof(*dev->vqs), GFP_KERNEL);
>>> +     if (!dev->vqs)
>>> +             goto err_vqs;
>>> +
>>> +     for (i = 0; i < dev->vq_num; i++) {
>>> +             dev->vqs[i].index = i;
>>> +             INIT_WORK(&dev->vqs[i].inject, vduse_vq_irq_inject);
>>> +             spin_lock_init(&dev->vqs[i].kick_lock);
>>> +             spin_lock_init(&dev->vqs[i].irq_lock);
>>> +     }
>>> +
>>> +     ret = idr_alloc(&vduse_idr, dev, 1, VDUSE_DEV_MAX, GFP_KERNEL);
>>> +     if (ret < 0)
>>> +             goto err_idr;
>>> +
>>> +     dev->minor = ret;
>>> +     dev->dev = device_create(vduse_class, NULL,
>>> +                              MKDEV(MAJOR(vduse_major), dev->minor),
>>> +                              NULL, "%s", config->name);
>>> +     if (IS_ERR(dev->dev)) {
>>> +             ret = PTR_ERR(dev->dev);
>>> +             goto err_dev;
>>> +     }
>>> +     __module_get(THIS_MODULE);
>>> +
>>> +     return 0;
>>> +err_dev:
>>> +     idr_remove(&vduse_idr, dev->minor);
>>> +err_idr:
>>> +     kfree(dev->vqs);
>>> +err_vqs:
>>> +     vduse_domain_destroy(dev->domain);
>>> +err_domain:
>>> +     kfree(dev->name);
>>> +err_str:
>>> +     vduse_dev_destroy(dev);
>>> +err:
>>> +     kvfree(config_buf);
>>> +     return ret;
>>> +}
>>> +
>>> +static long vduse_ioctl(struct file *file, unsigned int cmd,
>>> +                     unsigned long arg)
>>> +{
>>> +     int ret;
>>> +     void __user *argp = (void __user *)arg;
>>> +     struct vduse_control *control = file->private_data;
>>> +
>>> +     mutex_lock(&vduse_lock);
>>> +     switch (cmd) {
>>> +     case VDUSE_GET_API_VERSION:
>>> +             ret = put_user(control->api_version, (u64 __user *)argp);
>>> +             break;
>>> +     case VDUSE_SET_API_VERSION: {
>>> +             u64 api_version;
>>> +
>>> +             ret = -EFAULT;
>>> +             if (get_user(api_version, (u64 __user *)argp))
>>> +                     break;
>>> +
>>> +             ret = -EINVAL;
>>> +             if (api_version > VDUSE_API_VERSION)
>>> +                     break;
>>> +
>>> +             ret = 0;
>>> +             control->api_version = api_version;
>>> +             break;
>>> +     }
>>> +     case VDUSE_CREATE_DEV: {
>>> +             struct vduse_dev_config config;
>>> +             unsigned long size = offsetof(struct vduse_dev_config, config);
>>> +             void *buf;
>>> +
>>> +             ret = -EFAULT;
>>> +             if (copy_from_user(&config, argp, size))
>>> +                     break;
>>> +
>>> +             ret = -EINVAL;
>>> +             if (vduse_validate_config(&config) == false)
>>> +                     break;
>>> +
>>> +             buf = vmemdup_user(argp + size, config.config_size);
>>> +             if (IS_ERR(buf)) {
>>> +                     ret = PTR_ERR(buf);
>>> +                     break;
>>> +             }
>>> +             ret = vduse_create_dev(&config, buf, control->api_version);
>>> +             break;
>>> +     }
>>> +     case VDUSE_DESTROY_DEV: {
>>> +             char name[VDUSE_NAME_MAX];
>>> +
>>> +             ret = -EFAULT;
>>> +             if (copy_from_user(name, argp, VDUSE_NAME_MAX))
>>> +                     break;
>>> +
>>> +             ret = vduse_destroy_dev(name);
>>> +             break;
>>> +     }
>>> +     default:
>>> +             ret = -EINVAL;
>>> +             break;
>>> +     }
>>> +     mutex_unlock(&vduse_lock);
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static int vduse_release(struct inode *inode, struct file *file)
>>> +{
>>> +     struct vduse_control *control = file->private_data;
>>> +
>>> +     kfree(control);
>>> +     return 0;
>>> +}
>>> +
>>> +static int vduse_open(struct inode *inode, struct file *file)
>>> +{
>>> +     struct vduse_control *control;
>>> +
>>> +     control = kmalloc(sizeof(struct vduse_control), GFP_KERNEL);
>>> +     if (!control)
>>> +             return -ENOMEM;
>>> +
>>> +     control->api_version = VDUSE_API_VERSION;
>>> +     file->private_data = control;
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static const struct file_operations vduse_ctrl_fops = {
>>> +     .owner          = THIS_MODULE,
>>> +     .open           = vduse_open,
>>> +     .release        = vduse_release,
>>> +     .unlocked_ioctl = vduse_ioctl,
>>> +     .compat_ioctl   = compat_ptr_ioctl,
>>> +     .llseek         = noop_llseek,
>>> +};
>>> +
>>> +static char *vduse_devnode(struct device *dev, umode_t *mode)
>>> +{
>>> +     return kasprintf(GFP_KERNEL, "vduse/%s", dev_name(dev));
>>> +}
>>> +
>>> +static void vduse_mgmtdev_release(struct device *dev)
>>> +{
>>> +}
>>> +
>>> +static struct device vduse_mgmtdev = {
>>> +     .init_name = "vduse",
>>> +     .release = vduse_mgmtdev_release,
>>> +};
>>> +
>>> +static struct vdpa_mgmt_dev mgmt_dev;
>>> +
>>> +static int vduse_dev_init_vdpa(struct vduse_dev *dev, const char *name)
>>> +{
>>> +     struct vduse_vdpa *vdev;
>>> +     int ret;
>>> +
>>> +     if (dev->vdev)
>>> +             return -EEXIST;
>>> +
>>> +     vdev = vdpa_alloc_device(struct vduse_vdpa, vdpa, dev->dev,
>>> +                              &vduse_vdpa_config_ops, name, true);
>>> +     if (!vdev)
>>> +             return -ENOMEM;
>>> +
>>> +     dev->vdev = vdev;
>>> +     vdev->dev = dev;
>>> +     vdev->vdpa.dev.dma_mask = &vdev->vdpa.dev.coherent_dma_mask;
>>> +     ret = dma_set_mask_and_coherent(&vdev->vdpa.dev, DMA_BIT_MASK(64));
>>> +     if (ret) {
>>> +             put_device(&vdev->vdpa.dev);
>>> +             return ret;
>>> +     }
>>> +     set_dma_ops(&vdev->vdpa.dev, &vduse_dev_dma_ops);
>>> +     vdev->vdpa.dma_dev = &vdev->vdpa.dev;
>>> +     vdev->vdpa.mdev = &mgmt_dev;
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static int vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name)
>>> +{
>>> +     struct vduse_dev *dev;
>>> +     int ret;
>>> +
>>> +     mutex_lock(&vduse_lock);
>>> +     dev = vduse_find_dev(name);
>>> +     if (!dev) {
>>> +             mutex_unlock(&vduse_lock);
>>> +             return -EINVAL;
>>> +     }
>>> +     ret = vduse_dev_init_vdpa(dev, name);
>>> +     mutex_unlock(&vduse_lock);
>>> +     if (ret)
>>> +             return ret;
>>> +
>>> +     ret = _vdpa_register_device(&dev->vdev->vdpa, dev->vq_num);
>>> +     if (ret) {
>>> +             put_device(&dev->vdev->vdpa.dev);
>>> +             return ret;
>>> +     }
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static void vdpa_dev_del(struct vdpa_mgmt_dev *mdev, struct vdpa_device *dev)
>>> +{
>>> +     _vdpa_unregister_device(dev);
>>> +}
>>> +
>>> +static const struct vdpa_mgmtdev_ops vdpa_dev_mgmtdev_ops = {
>>> +     .dev_add = vdpa_dev_add,
>>> +     .dev_del = vdpa_dev_del,
>>> +};
>>> +
>>> +static struct virtio_device_id id_table[] = {
>>> +     { VIRTIO_ID_BLOCK, VIRTIO_DEV_ANY_ID },
>>> +     { 0 },
>>> +};
>>> +
>>> +static struct vdpa_mgmt_dev mgmt_dev = {
>>> +     .device = &vduse_mgmtdev,
>>> +     .id_table = id_table,
>>> +     .ops = &vdpa_dev_mgmtdev_ops,
>>> +};
>>> +
>>> +static int vduse_mgmtdev_init(void)
>>> +{
>>> +     int ret;
>>> +
>>> +     ret = device_register(&vduse_mgmtdev);
>>> +     if (ret)
>>> +             return ret;
>>> +
>>> +     ret = vdpa_mgmtdev_register(&mgmt_dev);
>>> +     if (ret)
>>> +             goto err;
>>> +
>>> +     return 0;
>>> +err:
>>> +     device_unregister(&vduse_mgmtdev);
>>> +     return ret;
>>> +}
>>> +
>>> +static void vduse_mgmtdev_exit(void)
>>> +{
>>> +     vdpa_mgmtdev_unregister(&mgmt_dev);
>>> +     device_unregister(&vduse_mgmtdev);
>>> +}
>>> +
>>> +static int vduse_init(void)
>>> +{
>>> +     int ret;
>>> +     struct device *dev;
>>> +
>>> +     vduse_class = class_create(THIS_MODULE, "vduse");
>>> +     if (IS_ERR(vduse_class))
>>> +             return PTR_ERR(vduse_class);
>>> +
>>> +     vduse_class->devnode = vduse_devnode;
>>> +
>>> +     ret = alloc_chrdev_region(&vduse_major, 0, VDUSE_DEV_MAX, "vduse");
>>> +     if (ret)
>>> +             goto err_chardev_region;
>>> +
>>> +     /* /dev/vduse/control */
>>> +     cdev_init(&vduse_ctrl_cdev, &vduse_ctrl_fops);
>>> +     vduse_ctrl_cdev.owner = THIS_MODULE;
>>> +     ret = cdev_add(&vduse_ctrl_cdev, vduse_major, 1);
>>> +     if (ret)
>>> +             goto err_ctrl_cdev;
>>> +
>>> +     dev = device_create(vduse_class, NULL, vduse_major, NULL, "control");
>>> +     if (IS_ERR(dev)) {
>>> +             ret = PTR_ERR(dev);
>>> +             goto err_device;
>>> +     }
>>> +
>>> +     /* /dev/vduse/$DEVICE */
>>> +     cdev_init(&vduse_cdev, &vduse_dev_fops);
>>> +     vduse_cdev.owner = THIS_MODULE;
>>> +     ret = cdev_add(&vduse_cdev, MKDEV(MAJOR(vduse_major), 1),
>>> +                    VDUSE_DEV_MAX - 1);
>>> +     if (ret)
>>> +             goto err_cdev;
>>> +
>>> +     vduse_irq_wq = alloc_workqueue("vduse-irq",
>>> +                             WQ_HIGHPRI | WQ_SYSFS | WQ_UNBOUND, 0);
>>> +     if (!vduse_irq_wq)
>>> +             goto err_wq;
>>> +
>>> +     ret = vduse_domain_init();
>>> +     if (ret)
>>> +             goto err_domain;
>>> +
>>> +     ret = vduse_mgmtdev_init();
>>> +     if (ret)
>>> +             goto err_mgmtdev;
>>> +
>>> +     return 0;
>>> +err_mgmtdev:
>>> +     vduse_domain_exit();
>>> +err_domain:
>>> +     destroy_workqueue(vduse_irq_wq);
>>> +err_wq:
>>> +     cdev_del(&vduse_cdev);
>>> +err_cdev:
>>> +     device_destroy(vduse_class, vduse_major);
>>> +err_device:
>>> +     cdev_del(&vduse_ctrl_cdev);
>>> +err_ctrl_cdev:
>>> +     unregister_chrdev_region(vduse_major, VDUSE_DEV_MAX);
>>> +err_chardev_region:
>>> +     class_destroy(vduse_class);
>>> +     return ret;
>>> +}
>>> +module_init(vduse_init);
>>> +
>>> +static void vduse_exit(void)
>>> +{
>>> +     vduse_mgmtdev_exit();
>>> +     vduse_domain_exit();
>>> +     destroy_workqueue(vduse_irq_wq);
>>> +     cdev_del(&vduse_cdev);
>>> +     device_destroy(vduse_class, vduse_major);
>>> +     cdev_del(&vduse_ctrl_cdev);
>>> +     unregister_chrdev_region(vduse_major, VDUSE_DEV_MAX);
>>> +     class_destroy(vduse_class);
>>> +}
>>> +module_exit(vduse_exit);
>>> +
>>> +MODULE_LICENSE(DRV_LICENSE);
>>> +MODULE_AUTHOR(DRV_AUTHOR);
>>> +MODULE_DESCRIPTION(DRV_DESC);
>>> diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
>>> new file mode 100644
>>> index 000000000000..f21b2e51b5c8
>>> --- /dev/null
>>> +++ b/include/uapi/linux/vduse.h
>>> @@ -0,0 +1,143 @@
>>> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
>>> +#ifndef _UAPI_VDUSE_H_
>>> +#define _UAPI_VDUSE_H_
>>> +
>>> +#include <linux/types.h>
>>> +
>>> +#define VDUSE_API_VERSION    0
>>> +
>>> +#define VDUSE_NAME_MAX       256
>>> +
>>> +/* the control messages definition for read/write */
>>> +
>>> +enum vduse_req_type {
>>> +     /* Get the state for virtqueue from userspace */
>>> +     VDUSE_GET_VQ_STATE,
>>> +     /* Notify userspace to start the dataplane, no reply */
>>> +     VDUSE_START_DATAPLANE,
>>> +     /* Notify userspace to stop the dataplane, no reply */
>>> +     VDUSE_STOP_DATAPLANE,
>>> +     /* Notify userspace to update the memory mapping in device IOTLB */
>>> +     VDUSE_UPDATE_IOTLB,
>>> +};
>>> +
>>> +struct vduse_vq_state {
>>> +     __u32 index; /* virtqueue index */
>>> +     __u32 avail_idx; /* virtqueue state (last_avail_idx) */
>>> +};
>>
>> This needs some tweaks to support packed virtqueue.
>>
> OK.
>
> Thanks,
> Yongji
>


^ permalink raw reply	[flat|nested] 193+ messages in thread

* Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
@ 2021-06-22  5:06         ` Jason Wang
  0 siblings, 0 replies; 193+ messages in thread
From: Jason Wang @ 2021-06-22  5:06 UTC (permalink / raw)
  To: Yongji Xie
  Cc: kvm, Michael S. Tsirkin, virtualization, Christian Brauner,
	Jonathan Corbet, Matthew Wilcox, Christoph Hellwig,
	Dan Carpenter, Stefano Garzarella, Al Viro, Stefan Hajnoczi,
	songmuchun, Jens Axboe, Greg KH, Randy Dunlap, linux-kernel,
	iommu, bcrl, netdev, linux-fsdevel, Mika Penttilä


在 2021/6/21 下午6:41, Yongji Xie 写道:
> On Mon, Jun 21, 2021 at 5:14 PM Jason Wang <jasowang@redhat.com> wrote:
>>
>> 在 2021/6/15 下午10:13, Xie Yongji 写道:
>>> This VDUSE driver enables implementing vDPA devices in userspace.
>>> The vDPA device's control path is handled in kernel and the data
>>> path is handled in userspace.
>>>
>>> A message mechnism is used by VDUSE driver to forward some control
>>> messages such as starting/stopping datapath to userspace. Userspace
>>> can use read()/write() to receive/reply those control messages.
>>>
>>> And some ioctls are introduced to help userspace to implement the
>>> data path. VDUSE_IOTLB_GET_FD ioctl can be used to get the file
>>> descriptors referring to vDPA device's iova regions. Then userspace
>>> can use mmap() to access those iova regions. VDUSE_DEV_GET_FEATURES
>>> and VDUSE_VQ_GET_INFO ioctls are used to get the negotiated features
>>> and metadata of virtqueues. VDUSE_INJECT_VQ_IRQ and VDUSE_VQ_SETUP_KICKFD
>>> ioctls can be used to inject interrupt and setup the kickfd for
>>> virtqueues. VDUSE_DEV_UPDATE_CONFIG ioctl is used to update the
>>> configuration space and inject a config interrupt.
>>>
>>> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
>>> ---
>>>    Documentation/userspace-api/ioctl/ioctl-number.rst |    1 +
>>>    drivers/vdpa/Kconfig                               |   10 +
>>>    drivers/vdpa/Makefile                              |    1 +
>>>    drivers/vdpa/vdpa_user/Makefile                    |    5 +
>>>    drivers/vdpa/vdpa_user/vduse_dev.c                 | 1453 ++++++++++++++++++++
>>>    include/uapi/linux/vduse.h                         |  143 ++
>>>    6 files changed, 1613 insertions(+)
>>>    create mode 100644 drivers/vdpa/vdpa_user/Makefile
>>>    create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
>>>    create mode 100644 include/uapi/linux/vduse.h
>>>
>>> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
>>> index 9bfc2b510c64..acd95e9dcfe7 100644
>>> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
>>> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
>>> @@ -300,6 +300,7 @@ Code  Seq#    Include File                                           Comments
>>>    'z'   10-4F  drivers/s390/crypto/zcrypt_api.h                        conflict!
>>>    '|'   00-7F  linux/media.h
>>>    0x80  00-1F  linux/fb.h
>>> +0x81  00-1F  linux/vduse.h
>>>    0x89  00-06  arch/x86/include/asm/sockios.h
>>>    0x89  0B-DF  linux/sockios.h
>>>    0x89  E0-EF  linux/sockios.h                                         SIOCPROTOPRIVATE range
>>> diff --git a/drivers/vdpa/Kconfig b/drivers/vdpa/Kconfig
>>> index a503c1b2bfd9..6e23bce6433a 100644
>>> --- a/drivers/vdpa/Kconfig
>>> +++ b/drivers/vdpa/Kconfig
>>> @@ -33,6 +33,16 @@ config VDPA_SIM_BLOCK
>>>          vDPA block device simulator which terminates IO request in a
>>>          memory buffer.
>>>
>>> +config VDPA_USER
>>> +     tristate "VDUSE (vDPA Device in Userspace) support"
>>> +     depends on EVENTFD && MMU && HAS_DMA
>>> +     select DMA_OPS
>>> +     select VHOST_IOTLB
>>> +     select IOMMU_IOVA
>>> +     help
>>> +       With VDUSE it is possible to emulate a vDPA Device
>>> +       in a userspace program.
>>> +
>>>    config IFCVF
>>>        tristate "Intel IFC VF vDPA driver"
>>>        depends on PCI_MSI
>>> diff --git a/drivers/vdpa/Makefile b/drivers/vdpa/Makefile
>>> index 67fe7f3d6943..f02ebed33f19 100644
>>> --- a/drivers/vdpa/Makefile
>>> +++ b/drivers/vdpa/Makefile
>>> @@ -1,6 +1,7 @@
>>>    # SPDX-License-Identifier: GPL-2.0
>>>    obj-$(CONFIG_VDPA) += vdpa.o
>>>    obj-$(CONFIG_VDPA_SIM) += vdpa_sim/
>>> +obj-$(CONFIG_VDPA_USER) += vdpa_user/
>>>    obj-$(CONFIG_IFCVF)    += ifcvf/
>>>    obj-$(CONFIG_MLX5_VDPA) += mlx5/
>>>    obj-$(CONFIG_VP_VDPA)    += virtio_pci/
>>> diff --git a/drivers/vdpa/vdpa_user/Makefile b/drivers/vdpa/vdpa_user/Makefile
>>> new file mode 100644
>>> index 000000000000..260e0b26af99
>>> --- /dev/null
>>> +++ b/drivers/vdpa/vdpa_user/Makefile
>>> @@ -0,0 +1,5 @@
>>> +# SPDX-License-Identifier: GPL-2.0
>>> +
>>> +vduse-y := vduse_dev.o iova_domain.o
>>> +
>>> +obj-$(CONFIG_VDPA_USER) += vduse.o
>>> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
>>> new file mode 100644
>>> index 000000000000..5271cbd15e28
>>> --- /dev/null
>>> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
>>> @@ -0,0 +1,1453 @@
>>> +// SPDX-License-Identifier: GPL-2.0-only
>>> +/*
>>> + * VDUSE: vDPA Device in Userspace
>>> + *
>>> + * Copyright (C) 2020-2021 Bytedance Inc. and/or its affiliates. All rights reserved.
>>> + *
>>> + * Author: Xie Yongji <xieyongji@bytedance.com>
>>> + *
>>> + */
>>> +
>>> +#include <linux/init.h>
>>> +#include <linux/module.h>
>>> +#include <linux/cdev.h>
>>> +#include <linux/device.h>
>>> +#include <linux/eventfd.h>
>>> +#include <linux/slab.h>
>>> +#include <linux/wait.h>
>>> +#include <linux/dma-map-ops.h>
>>> +#include <linux/poll.h>
>>> +#include <linux/file.h>
>>> +#include <linux/uio.h>
>>> +#include <linux/vdpa.h>
>>> +#include <linux/nospec.h>
>>> +#include <uapi/linux/vduse.h>
>>> +#include <uapi/linux/vdpa.h>
>>> +#include <uapi/linux/virtio_config.h>
>>> +#include <uapi/linux/virtio_ids.h>
>>> +#include <uapi/linux/virtio_blk.h>
>>> +#include <linux/mod_devicetable.h>
>>> +
>>> +#include "iova_domain.h"
>>> +
>>> +#define DRV_AUTHOR   "Yongji Xie <xieyongji@bytedance.com>"
>>> +#define DRV_DESC     "vDPA Device in Userspace"
>>> +#define DRV_LICENSE  "GPL v2"
>>> +
>>> +#define VDUSE_DEV_MAX (1U << MINORBITS)
>>> +#define VDUSE_MAX_BOUNCE_SIZE (64 * 1024 * 1024)
>>> +#define VDUSE_IOVA_SIZE (128 * 1024 * 1024)
>>> +#define VDUSE_REQUEST_TIMEOUT 30
>>> +
>>> +struct vduse_virtqueue {
>>> +     u16 index;
>>> +     u32 num;
>>> +     u32 avail_idx;
>>> +     u64 desc_addr;
>>> +     u64 driver_addr;
>>> +     u64 device_addr;
>>> +     bool ready;
>>> +     bool kicked;
>>> +     spinlock_t kick_lock;
>>> +     spinlock_t irq_lock;
>>> +     struct eventfd_ctx *kickfd;
>>> +     struct vdpa_callback cb;
>>> +     struct work_struct inject;
>>> +};
>>> +
>>> +struct vduse_dev;
>>> +
>>> +struct vduse_vdpa {
>>> +     struct vdpa_device vdpa;
>>> +     struct vduse_dev *dev;
>>> +};
>>> +
>>> +struct vduse_dev {
>>> +     struct vduse_vdpa *vdev;
>>> +     struct device *dev;
>>> +     struct vduse_virtqueue *vqs;
>>> +     struct vduse_iova_domain *domain;
>>> +     char *name;
>>> +     struct mutex lock;
>>> +     spinlock_t msg_lock;
>>> +     u64 msg_unique;
>>> +     wait_queue_head_t waitq;
>>> +     struct list_head send_list;
>>> +     struct list_head recv_list;
>>> +     struct vdpa_callback config_cb;
>>> +     struct work_struct inject;
>>> +     spinlock_t irq_lock;
>>> +     int minor;
>>> +     bool connected;
>>> +     bool started;
>>> +     u64 api_version;
>>> +     u64 user_features;
>>
>> Let's use device_features.
>>
> OK.
>
>>> +     u64 features;
>>
>> And driver features.
>>
> OK.
>
>>> +     u32 device_id;
>>> +     u32 vendor_id;
>>> +     u32 generation;
>>> +     u32 config_size;
>>> +     void *config;
>>> +     u8 status;
>>> +     u16 vq_size_max;
>>> +     u32 vq_num;
>>> +     u32 vq_align;
>>> +};
>>> +
>>> +struct vduse_dev_msg {
>>> +     struct vduse_dev_request req;
>>> +     struct vduse_dev_response resp;
>>> +     struct list_head list;
>>> +     wait_queue_head_t waitq;
>>> +     bool completed;
>>> +};
>>> +
>>> +struct vduse_control {
>>> +     u64 api_version;
>>> +};
>>> +
>>> +static DEFINE_MUTEX(vduse_lock);
>>> +static DEFINE_IDR(vduse_idr);
>>> +
>>> +static dev_t vduse_major;
>>> +static struct class *vduse_class;
>>> +static struct cdev vduse_ctrl_cdev;
>>> +static struct cdev vduse_cdev;
>>> +static struct workqueue_struct *vduse_irq_wq;
>>> +
>>> +static u32 allowed_device_id[] = {
>>> +     VIRTIO_ID_BLOCK,
>>> +};
>>> +
>>> +static inline struct vduse_dev *vdpa_to_vduse(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_vdpa *vdev = container_of(vdpa, struct vduse_vdpa, vdpa);
>>> +
>>> +     return vdev->dev;
>>> +}
>>> +
>>> +static inline struct vduse_dev *dev_to_vduse(struct device *dev)
>>> +{
>>> +     struct vdpa_device *vdpa = dev_to_vdpa(dev);
>>> +
>>> +     return vdpa_to_vduse(vdpa);
>>> +}
>>> +
>>> +static struct vduse_dev_msg *vduse_find_msg(struct list_head *head,
>>> +                                         uint32_t request_id)
>>> +{
>>> +     struct vduse_dev_msg *msg;
>>> +
>>> +     list_for_each_entry(msg, head, list) {
>>> +             if (msg->req.request_id == request_id) {
>>> +                     list_del(&msg->list);
>>> +                     return msg;
>>> +             }
>>> +     }
>>> +
>>> +     return NULL;
>>> +}
>>> +
>>> +static struct vduse_dev_msg *vduse_dequeue_msg(struct list_head *head)
>>> +{
>>> +     struct vduse_dev_msg *msg = NULL;
>>> +
>>> +     if (!list_empty(head)) {
>>> +             msg = list_first_entry(head, struct vduse_dev_msg, list);
>>> +             list_del(&msg->list);
>>> +     }
>>> +
>>> +     return msg;
>>> +}
>>> +
>>> +static void vduse_enqueue_msg(struct list_head *head,
>>> +                           struct vduse_dev_msg *msg)
>>> +{
>>> +     list_add_tail(&msg->list, head);
>>> +}
>>> +
>>> +static int vduse_dev_msg_send(struct vduse_dev *dev,
>>> +                           struct vduse_dev_msg *msg, bool no_reply)
>>> +{
>>
>> It looks to me the only user for no_reply=true is the dataplane start. I
>> wonder no_reply is really needed consider we have switched to use
>> wait_event_killable_timeout().
>>
> Do we need to handle the error in this case if we remove the no_reply
> flag. Print a warning message?


See below.


>
>> In another way, no_reply is false for vq state synchronization and IOTLB
>> updating. I wonder if we can simply use no_reply = true for them.
>>
> Looks like we can't, e.g. we need to get a reply from userspace for vq state.


Right.


>
>>> +     init_waitqueue_head(&msg->waitq);
>>> +     spin_lock(&dev->msg_lock);
>>> +     msg->req.request_id = dev->msg_unique++;
>>> +     vduse_enqueue_msg(&dev->send_list, msg);
>>> +     wake_up(&dev->waitq);
>>> +     spin_unlock(&dev->msg_lock);
>>> +     if (no_reply)
>>> +             return 0;
>>> +
>>> +     wait_event_killable_timeout(msg->waitq, msg->completed,
>>> +                                 VDUSE_REQUEST_TIMEOUT * HZ);
>>> +     spin_lock(&dev->msg_lock);
>>> +     if (!msg->completed) {
>>> +             list_del(&msg->list);
>>> +             msg->resp.result = VDUSE_REQ_RESULT_FAILED;
>>> +     }
>>> +     spin_unlock(&dev->msg_lock);
>>> +
>>> +     return (msg->resp.result == VDUSE_REQ_RESULT_OK) ? 0 : -EIO;
>>
>> Do we need to serialize the check by protecting it with the spinlock above?
>>
> Good point.
>
>>> +}
>>> +
>>> +static void vduse_dev_msg_cleanup(struct vduse_dev *dev)
>>> +{
>>> +     struct vduse_dev_msg *msg;
>>> +
>>> +     spin_lock(&dev->msg_lock);
>>> +     while ((msg = vduse_dequeue_msg(&dev->send_list))) {
>>> +             if (msg->req.flags & VDUSE_REQ_FLAGS_NO_REPLY)
>>> +                     kfree(msg);
>>> +             else
>>> +                     vduse_enqueue_msg(&dev->recv_list, msg);
>>> +     }
>>> +     while ((msg = vduse_dequeue_msg(&dev->recv_list))) {
>>> +             msg->resp.result = VDUSE_REQ_RESULT_FAILED;
>>> +             msg->completed = 1;
>>> +             wake_up(&msg->waitq);
>>> +     }
>>> +     spin_unlock(&dev->msg_lock);
>>> +}
>>> +
>>> +static void vduse_dev_start_dataplane(struct vduse_dev *dev)
>>> +{
>>> +     struct vduse_dev_msg *msg = kzalloc(sizeof(*msg),
>>> +                                         GFP_KERNEL | __GFP_NOFAIL);
>>> +
>>> +     msg->req.type = VDUSE_START_DATAPLANE;
>>> +     msg->req.flags |= VDUSE_REQ_FLAGS_NO_REPLY;
>>> +     vduse_dev_msg_send(dev, msg, true);
>>> +}
>>> +
>>> +static void vduse_dev_stop_dataplane(struct vduse_dev *dev)
>>> +{
>>> +     struct vduse_dev_msg *msg = kzalloc(sizeof(*msg),
>>> +                                         GFP_KERNEL | __GFP_NOFAIL);
>>> +
>>> +     msg->req.type = VDUSE_STOP_DATAPLANE;
>>> +     msg->req.flags |= VDUSE_REQ_FLAGS_NO_REPLY;
>>
>> Can we simply use this flag instead of introducing a new parameter
>> (no_reply) in vduse_dev_msg_send()?
>>
> Looks good to me.
>
>>> +     vduse_dev_msg_send(dev, msg, true);
>>> +}
>>> +
>>> +static int vduse_dev_get_vq_state(struct vduse_dev *dev,
>>> +                               struct vduse_virtqueue *vq,
>>> +                               struct vdpa_vq_state *state)
>>> +{
>>> +     struct vduse_dev_msg msg = { 0 };
>>> +     int ret;
>>
>> Note that I post a series that implement the packed virtqueue support:
>>
>> https://lists.linuxfoundation.org/pipermail/virtualization/2021-June/054501.html
>>
>> So this patch needs to be updated as well.
>>
> Will do it.
>
>>> +
>>> +     msg.req.type = VDUSE_GET_VQ_STATE;
>>> +     msg.req.vq_state.index = vq->index;
>>> +
>>> +     ret = vduse_dev_msg_send(dev, &msg, false);
>>> +     if (ret)
>>> +             return ret;
>>> +
>>> +     state->avail_index = msg.resp.vq_state.avail_idx;
>>> +     return 0;
>>> +}
>>> +
>>> +static int vduse_dev_update_iotlb(struct vduse_dev *dev,
>>> +                             u64 start, u64 last)
>>> +{
>>> +     struct vduse_dev_msg msg = { 0 };
>>> +
>>> +     if (last < start)
>>> +             return -EINVAL;
>>> +
>>> +     msg.req.type = VDUSE_UPDATE_IOTLB;
>>> +     msg.req.iova.start = start;
>>> +     msg.req.iova.last = last;
>>> +
>>> +     return vduse_dev_msg_send(dev, &msg, false);
>>> +}
>>> +
>>> +static ssize_t vduse_dev_read_iter(struct kiocb *iocb, struct iov_iter *to)
>>> +{
>>> +     struct file *file = iocb->ki_filp;
>>> +     struct vduse_dev *dev = file->private_data;
>>> +     struct vduse_dev_msg *msg;
>>> +     int size = sizeof(struct vduse_dev_request);
>>> +     ssize_t ret;
>>> +
>>> +     if (iov_iter_count(to) < size)
>>> +             return -EINVAL;
>>> +
>>> +     spin_lock(&dev->msg_lock);
>>> +     while (1) {
>>> +             msg = vduse_dequeue_msg(&dev->send_list);
>>> +             if (msg)
>>> +                     break;
>>> +
>>> +             ret = -EAGAIN;
>>> +             if (file->f_flags & O_NONBLOCK)
>>> +                     goto unlock;
>>> +
>>> +             spin_unlock(&dev->msg_lock);
>>> +             ret = wait_event_interruptible_exclusive(dev->waitq,
>>> +                                     !list_empty(&dev->send_list));
>>> +             if (ret)
>>> +                     return ret;
>>> +
>>> +             spin_lock(&dev->msg_lock);
>>> +     }
>>> +     spin_unlock(&dev->msg_lock);
>>> +     ret = copy_to_iter(&msg->req, size, to);
>>> +     spin_lock(&dev->msg_lock);
>>> +     if (ret != size) {
>>> +             ret = -EFAULT;
>>> +             vduse_enqueue_msg(&dev->send_list, msg);
>>> +             goto unlock;
>>> +     }
>>> +     if (msg->req.flags & VDUSE_REQ_FLAGS_NO_REPLY)
>>> +             kfree(msg);
>>> +     else
>>> +             vduse_enqueue_msg(&dev->recv_list, msg);
>>> +unlock:
>>> +     spin_unlock(&dev->msg_lock);
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static ssize_t vduse_dev_write_iter(struct kiocb *iocb, struct iov_iter *from)
>>> +{
>>> +     struct file *file = iocb->ki_filp;
>>> +     struct vduse_dev *dev = file->private_data;
>>> +     struct vduse_dev_response resp;
>>> +     struct vduse_dev_msg *msg;
>>> +     size_t ret;
>>> +
>>> +     ret = copy_from_iter(&resp, sizeof(resp), from);
>>> +     if (ret != sizeof(resp))
>>> +             return -EINVAL;
>>> +
>>> +     spin_lock(&dev->msg_lock);
>>> +     msg = vduse_find_msg(&dev->recv_list, resp.request_id);
>>> +     if (!msg) {
>>> +             ret = -ENOENT;
>>> +             goto unlock;
>>> +     }
>>> +
>>> +     memcpy(&msg->resp, &resp, sizeof(resp));
>>> +     msg->completed = 1;
>>> +     wake_up(&msg->waitq);
>>> +unlock:
>>> +     spin_unlock(&dev->msg_lock);
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
>>> +{
>>> +     struct vduse_dev *dev = file->private_data;
>>> +     __poll_t mask = 0;
>>> +
>>> +     poll_wait(file, &dev->waitq, wait);
>>> +
>>> +     if (!list_empty(&dev->send_list))
>>> +             mask |= EPOLLIN | EPOLLRDNORM;
>>> +     if (!list_empty(&dev->recv_list))
>>> +             mask |= EPOLLOUT | EPOLLWRNORM;
>>> +
>>> +     return mask;
>>> +}
>>> +
>>> +static void vduse_dev_reset(struct vduse_dev *dev)
>>> +{
>>> +     int i;
>>> +     struct vduse_iova_domain *domain = dev->domain;
>>> +
>>> +     /* The coherent mappings are handled in vduse_dev_free_coherent() */
>>> +     if (domain->bounce_map)
>>> +             vduse_domain_reset_bounce_map(domain);
>>> +
>>> +     dev->features = 0;
>>> +     dev->generation++;
>>> +     spin_lock(&dev->irq_lock);
>>> +     dev->config_cb.callback = NULL;
>>> +     dev->config_cb.private = NULL;
>>> +     spin_unlock(&dev->irq_lock);
>>> +
>>> +     for (i = 0; i < dev->vq_num; i++) {
>>> +             struct vduse_virtqueue *vq = &dev->vqs[i];
>>> +
>>> +             vq->ready = false;
>>> +             vq->desc_addr = 0;
>>> +             vq->driver_addr = 0;
>>> +             vq->device_addr = 0;
>>> +             vq->avail_idx = 0;
>>> +             vq->num = 0;
>>> +
>>> +             spin_lock(&vq->kick_lock);
>>> +             vq->kicked = false;
>>> +             if (vq->kickfd)
>>> +                     eventfd_ctx_put(vq->kickfd);
>>> +             vq->kickfd = NULL;
>>> +             spin_unlock(&vq->kick_lock);
>>> +
>>> +             spin_lock(&vq->irq_lock);
>>> +             vq->cb.callback = NULL;
>>> +             vq->cb.private = NULL;
>>> +             spin_unlock(&vq->irq_lock);
>>> +     }
>>> +}
>>> +
>>> +static int vduse_vdpa_set_vq_address(struct vdpa_device *vdpa, u16 idx,
>>> +                             u64 desc_area, u64 driver_area,
>>> +                             u64 device_area)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     vq->desc_addr = desc_area;
>>> +     vq->driver_addr = driver_area;
>>> +     vq->device_addr = device_area;
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static void vduse_vdpa_kick_vq(struct vdpa_device *vdpa, u16 idx)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     spin_lock(&vq->kick_lock);
>>> +     if (!vq->ready)
>>> +             goto unlock;
>>> +
>>> +     if (vq->kickfd)
>>> +             eventfd_signal(vq->kickfd, 1);
>>> +     else
>>> +             vq->kicked = true;
>>> +unlock:
>>> +     spin_unlock(&vq->kick_lock);
>>> +}
>>> +
>>> +static void vduse_vdpa_set_vq_cb(struct vdpa_device *vdpa, u16 idx,
>>> +                           struct vdpa_callback *cb)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     spin_lock(&vq->irq_lock);
>>> +     vq->cb.callback = cb->callback;
>>> +     vq->cb.private = cb->private;
>>> +     spin_unlock(&vq->irq_lock);
>>> +}
>>> +
>>> +static void vduse_vdpa_set_vq_num(struct vdpa_device *vdpa, u16 idx, u32 num)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     vq->num = num;
>>> +}
>>> +
>>> +static void vduse_vdpa_set_vq_ready(struct vdpa_device *vdpa,
>>> +                                     u16 idx, bool ready)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     vq->ready = ready;
>>> +}
>>> +
>>> +static bool vduse_vdpa_get_vq_ready(struct vdpa_device *vdpa, u16 idx)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     return vq->ready;
>>> +}
>>> +
>>> +static int vduse_vdpa_set_vq_state(struct vdpa_device *vdpa, u16 idx,
>>> +                             const struct vdpa_vq_state *state)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     vq->avail_idx = state->avail_index;
>>> +     return 0;
>>> +}
>>> +
>>> +static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
>>> +                             struct vdpa_vq_state *state)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     return vduse_dev_get_vq_state(dev, vq, state);
>>> +}
>>> +
>>> +static u32 vduse_vdpa_get_vq_align(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return dev->vq_align;
>>> +}
>>> +
>>> +static u64 vduse_vdpa_get_features(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return dev->user_features;
>>> +}
>>> +
>>> +static int vduse_vdpa_set_features(struct vdpa_device *vdpa, u64 features)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     dev->features = features;
>>> +     return 0;
>>> +}
>>> +
>>> +static void vduse_vdpa_set_config_cb(struct vdpa_device *vdpa,
>>> +                               struct vdpa_callback *cb)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     spin_lock(&dev->irq_lock);
>>> +     dev->config_cb.callback = cb->callback;
>>> +     dev->config_cb.private = cb->private;
>>> +     spin_unlock(&dev->irq_lock);
>>> +}
>>> +
>>> +static u16 vduse_vdpa_get_vq_num_max(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return dev->vq_size_max;
>>> +}
>>> +
>>> +static u32 vduse_vdpa_get_device_id(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return dev->device_id;
>>> +}
>>> +
>>> +static u32 vduse_vdpa_get_vendor_id(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return dev->vendor_id;
>>> +}
>>> +
>>> +static u8 vduse_vdpa_get_status(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return dev->status;
>>> +}
>>> +
>>> +static void vduse_vdpa_set_status(struct vdpa_device *vdpa, u8 status)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     bool started = !!(status & VIRTIO_CONFIG_S_DRIVER_OK);
>>> +
>>> +     dev->status = status;
>>> +
>>> +     if (dev->started == started)
>>> +             return;
>>
>> If we check dev->status == status, (or only check the DRIVER_OK bit)
>> then there's no need to introduce an extra dev->started.
>>
> Will do it.
>
>>> +
>>> +     dev->started = started;
>>> +     if (dev->started) {
>>> +             vduse_dev_start_dataplane(dev);
>>> +     } else {
>>> +             vduse_dev_reset(dev);
>>> +             vduse_dev_stop_dataplane(dev);
>>
>> I wonder if no_reply work for the case of vhost-vdpa. For virtio-vDPA,
>> we have bouncing buffers so it's harmless if usersapce dataplane keeps
>> performing read/write. For vhost-vDPA we don't have such stuffs.
>>
> OK. So it still needs to be synchronized here. If so, how to handle
> the error? Looks like printing a warning message should be enough.


We need fix a way to propagate the error to the userspace.

E.g if we want to stop the deivce, we will delay the status reset until 
we get respose from the userspace?


>
>>> +     }
>>> +}
>>> +
>>> +static size_t vduse_vdpa_get_config_size(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return dev->config_size;
>>> +}
>>> +
>>> +static void vduse_vdpa_get_config(struct vdpa_device *vdpa, unsigned int offset,
>>> +                               void *buf, unsigned int len)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     memcpy(buf, dev->config + offset, len);
>>> +}
>>> +
>>> +static void vduse_vdpa_set_config(struct vdpa_device *vdpa, unsigned int offset,
>>> +                     const void *buf, unsigned int len)
>>> +{
>>> +     /* Now we only support read-only configuration space */
>>> +}
>>> +
>>> +static u32 vduse_vdpa_get_generation(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return dev->generation;
>>> +}
>>> +
>>> +static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
>>> +                             struct vhost_iotlb *iotlb)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     int ret;
>>> +
>>> +     ret = vduse_domain_set_map(dev->domain, iotlb);
>>> +     if (ret)
>>> +             return ret;
>>> +
>>> +     ret = vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
>>> +     if (ret) {
>>> +             vduse_domain_clear_map(dev->domain, iotlb);
>>> +             return ret;
>>> +     }
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static void vduse_vdpa_free(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     dev->vdev = NULL;
>>> +}
>>> +
>>> +static const struct vdpa_config_ops vduse_vdpa_config_ops = {
>>> +     .set_vq_address         = vduse_vdpa_set_vq_address,
>>> +     .kick_vq                = vduse_vdpa_kick_vq,
>>> +     .set_vq_cb              = vduse_vdpa_set_vq_cb,
>>> +     .set_vq_num             = vduse_vdpa_set_vq_num,
>>> +     .set_vq_ready           = vduse_vdpa_set_vq_ready,
>>> +     .get_vq_ready           = vduse_vdpa_get_vq_ready,
>>> +     .set_vq_state           = vduse_vdpa_set_vq_state,
>>> +     .get_vq_state           = vduse_vdpa_get_vq_state,
>>> +     .get_vq_align           = vduse_vdpa_get_vq_align,
>>> +     .get_features           = vduse_vdpa_get_features,
>>> +     .set_features           = vduse_vdpa_set_features,
>>> +     .set_config_cb          = vduse_vdpa_set_config_cb,
>>> +     .get_vq_num_max         = vduse_vdpa_get_vq_num_max,
>>> +     .get_device_id          = vduse_vdpa_get_device_id,
>>> +     .get_vendor_id          = vduse_vdpa_get_vendor_id,
>>> +     .get_status             = vduse_vdpa_get_status,
>>> +     .set_status             = vduse_vdpa_set_status,
>>> +     .get_config_size        = vduse_vdpa_get_config_size,
>>> +     .get_config             = vduse_vdpa_get_config,
>>> +     .set_config             = vduse_vdpa_set_config,
>>> +     .get_generation         = vduse_vdpa_get_generation,
>>> +     .set_map                = vduse_vdpa_set_map,
>>> +     .free                   = vduse_vdpa_free,
>>> +};
>>> +
>>> +static dma_addr_t vduse_dev_map_page(struct device *dev, struct page *page,
>>> +                                  unsigned long offset, size_t size,
>>> +                                  enum dma_data_direction dir,
>>> +                                  unsigned long attrs)
>>> +{
>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>> +
>>> +     return vduse_domain_map_page(domain, page, offset, size, dir, attrs);
>>> +}
>>> +
>>> +static void vduse_dev_unmap_page(struct device *dev, dma_addr_t dma_addr,
>>> +                             size_t size, enum dma_data_direction dir,
>>> +                             unsigned long attrs)
>>> +{
>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>> +
>>> +     return vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
>>> +}
>>> +
>>> +static void *vduse_dev_alloc_coherent(struct device *dev, size_t size,
>>> +                                     dma_addr_t *dma_addr, gfp_t flag,
>>> +                                     unsigned long attrs)
>>> +{
>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>> +     unsigned long iova;
>>> +     void *addr;
>>> +
>>> +     *dma_addr = DMA_MAPPING_ERROR;
>>> +     addr = vduse_domain_alloc_coherent(domain, size,
>>> +                             (dma_addr_t *)&iova, flag, attrs);
>>> +     if (!addr)
>>> +             return NULL;
>>> +
>>> +     *dma_addr = (dma_addr_t)iova;
>>> +
>>> +     return addr;
>>> +}
>>> +
>>> +static void vduse_dev_free_coherent(struct device *dev, size_t size,
>>> +                                     void *vaddr, dma_addr_t dma_addr,
>>> +                                     unsigned long attrs)
>>> +{
>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>> +
>>> +     vduse_domain_free_coherent(domain, size, vaddr, dma_addr, attrs);
>>> +}
>>> +
>>> +static size_t vduse_dev_max_mapping_size(struct device *dev)
>>> +{
>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>> +
>>> +     return domain->bounce_size;
>>> +}
>>> +
>>> +static const struct dma_map_ops vduse_dev_dma_ops = {
>>> +     .map_page = vduse_dev_map_page,
>>> +     .unmap_page = vduse_dev_unmap_page,
>>> +     .alloc = vduse_dev_alloc_coherent,
>>> +     .free = vduse_dev_free_coherent,
>>> +     .max_mapping_size = vduse_dev_max_mapping_size,
>>> +};
>>> +
>>> +static unsigned int perm_to_file_flags(u8 perm)
>>> +{
>>> +     unsigned int flags = 0;
>>> +
>>> +     switch (perm) {
>>> +     case VDUSE_ACCESS_WO:
>>> +             flags |= O_WRONLY;
>>> +             break;
>>> +     case VDUSE_ACCESS_RO:
>>> +             flags |= O_RDONLY;
>>> +             break;
>>> +     case VDUSE_ACCESS_RW:
>>> +             flags |= O_RDWR;
>>> +             break;
>>> +     default:
>>> +             WARN(1, "invalidate vhost IOTLB permission\n");
>>> +             break;
>>> +     }
>>> +
>>> +     return flags;
>>> +}
>>> +
>>> +static int vduse_kickfd_setup(struct vduse_dev *dev,
>>> +                     struct vduse_vq_eventfd *eventfd)
>>> +{
>>> +     struct eventfd_ctx *ctx = NULL;
>>> +     struct vduse_virtqueue *vq;
>>> +     u32 index;
>>> +
>>> +     if (eventfd->index >= dev->vq_num)
>>> +             return -EINVAL;
>>> +
>>> +     index = array_index_nospec(eventfd->index, dev->vq_num);
>>> +     vq = &dev->vqs[index];
>>> +     if (eventfd->fd >= 0) {
>>> +             ctx = eventfd_ctx_fdget(eventfd->fd);
>>> +             if (IS_ERR(ctx))
>>> +                     return PTR_ERR(ctx);
>>> +     } else if (eventfd->fd != VDUSE_EVENTFD_DEASSIGN)
>>> +             return 0;
>>> +
>>> +     spin_lock(&vq->kick_lock);
>>> +     if (vq->kickfd)
>>> +             eventfd_ctx_put(vq->kickfd);
>>> +     vq->kickfd = ctx;
>>> +     if (vq->ready && vq->kicked && vq->kickfd) {
>>> +             eventfd_signal(vq->kickfd, 1);
>>> +             vq->kicked = false;
>>> +     }
>>> +     spin_unlock(&vq->kick_lock);
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static void vduse_dev_irq_inject(struct work_struct *work)
>>> +{
>>> +     struct vduse_dev *dev = container_of(work, struct vduse_dev, inject);
>>> +
>>> +     spin_lock_irq(&dev->irq_lock);
>>> +     if (dev->config_cb.callback)
>>> +             dev->config_cb.callback(dev->config_cb.private);
>>> +     spin_unlock_irq(&dev->irq_lock);
>>> +}
>>> +
>>> +static void vduse_vq_irq_inject(struct work_struct *work)
>>> +{
>>> +     struct vduse_virtqueue *vq = container_of(work,
>>> +                                     struct vduse_virtqueue, inject);
>>> +
>>> +     spin_lock_irq(&vq->irq_lock);
>>> +     if (vq->ready && vq->cb.callback)
>>> +             vq->cb.callback(vq->cb.private);
>>> +     spin_unlock_irq(&vq->irq_lock);
>>> +}
>>> +
>>> +static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
>>> +                         unsigned long arg)
>>> +{
>>> +     struct vduse_dev *dev = file->private_data;
>>> +     void __user *argp = (void __user *)arg;
>>> +     int ret;
>>> +
>>> +     switch (cmd) {
>>> +     case VDUSE_IOTLB_GET_FD: {
>>> +             struct vduse_iotlb_entry entry;
>>> +             struct vhost_iotlb_map *map;
>>> +             struct vdpa_map_file *map_file;
>>> +             struct vduse_iova_domain *domain = dev->domain;
>>> +             struct file *f = NULL;
>>> +
>>> +             ret = -EFAULT;
>>> +             if (copy_from_user(&entry, argp, sizeof(entry)))
>>> +                     break;
>>> +
>>> +             ret = -EINVAL;
>>> +             if (entry.start > entry.last)
>>> +                     break;
>>> +
>>> +             spin_lock(&domain->iotlb_lock);
>>> +             map = vhost_iotlb_itree_first(domain->iotlb,
>>> +                                           entry.start, entry.last);
>>> +             if (map) {
>>> +                     map_file = (struct vdpa_map_file *)map->opaque;
>>> +                     f = get_file(map_file->file);
>>> +                     entry.offset = map_file->offset;
>>> +                     entry.start = map->start;
>>> +                     entry.last = map->last;
>>> +                     entry.perm = map->perm;
>>> +             }
>>> +             spin_unlock(&domain->iotlb_lock);
>>> +             ret = -EINVAL;
>>> +             if (!f)
>>> +                     break;
>>> +
>>> +             ret = -EFAULT;
>>> +             if (copy_to_user(argp, &entry, sizeof(entry))) {
>>> +                     fput(f);
>>> +                     break;
>>> +             }
>>> +             ret = receive_fd(f, perm_to_file_flags(entry.perm));
>>> +             fput(f);
>>> +             break;
>>> +     }
>>> +     case VDUSE_DEV_GET_FEATURES:
>>> +             ret = put_user(dev->features, (u64 __user *)argp);
>>> +             break;
>>> +     case VDUSE_DEV_UPDATE_CONFIG: {
>>> +             struct vduse_config_update config;
>>> +             unsigned long size = offsetof(struct vduse_config_update,
>>> +                                           buffer);
>>> +
>>> +             ret = -EFAULT;
>>> +             if (copy_from_user(&config, argp, size))
>>> +                     break;
>>> +
>>> +             ret = -EINVAL;
>>> +             if (config.length == 0 ||
>>> +                 config.length > dev->config_size - config.offset)
>>> +                     break;
>>> +
>>> +             ret = -EFAULT;
>>> +             if (copy_from_user(dev->config + config.offset, argp + size,
>>> +                                config.length))
>>> +                     break;
>>> +
>>> +             ret = 0;
>>> +             queue_work(vduse_irq_wq, &dev->inject);
>>
>> I wonder if it's better to separate config interrupt out of config
>> update or we need document this.
>>
> I have documented it in the docs. Looks like a config update should be
> always followed by a config interrupt. I didn't find a case that uses
> them separately.


The uAPI doesn't prevent us from the following scenario:

update_config(mac[0], ..);
update_config(max[1], ..);

So it looks to me it's better to separate the config interrupt from the 
config updating.


>
>>> +             break;
>>> +     }
>>> +     case VDUSE_VQ_GET_INFO: {
>>
>> Do we need to limit this only when DRIVER_OK is set?
>>
> Any reason to add this limitation?


Otherwise the vq is not fully initialized, e.g the desc_addr might not 
be correct.


>
>>> +             struct vduse_vq_info vq_info;
>>> +             u32 vq_index;
>>> +
>>> +             ret = -EFAULT;
>>> +             if (copy_from_user(&vq_info, argp, sizeof(vq_info)))
>>> +                     break;
>>> +
>>> +             ret = -EINVAL;
>>> +             if (vq_info.index >= dev->vq_num)
>>> +                     break;
>>> +
>>> +             vq_index = array_index_nospec(vq_info.index, dev->vq_num);
>>> +             vq_info.desc_addr = dev->vqs[vq_index].desc_addr;
>>> +             vq_info.driver_addr = dev->vqs[vq_index].driver_addr;
>>> +             vq_info.device_addr = dev->vqs[vq_index].device_addr;
>>> +             vq_info.num = dev->vqs[vq_index].num;
>>> +             vq_info.avail_idx = dev->vqs[vq_index].avail_idx;
>>> +             vq_info.ready = dev->vqs[vq_index].ready;
>>> +
>>> +             ret = -EFAULT;
>>> +             if (copy_to_user(argp, &vq_info, sizeof(vq_info)))
>>> +                     break;
>>> +
>>> +             ret = 0;
>>> +             break;
>>> +     }
>>> +     case VDUSE_VQ_SETUP_KICKFD: {
>>> +             struct vduse_vq_eventfd eventfd;
>>> +
>>> +             ret = -EFAULT;
>>> +             if (copy_from_user(&eventfd, argp, sizeof(eventfd)))
>>> +                     break;
>>> +
>>> +             ret = vduse_kickfd_setup(dev, &eventfd);
>>> +             break;
>>> +     }
>>> +     case VDUSE_VQ_INJECT_IRQ: {
>>> +             u32 vq_index;
>>> +
>>> +             ret = -EFAULT;
>>> +             if (get_user(vq_index, (u32 __user *)argp))
>>> +                     break;
>>> +
>>> +             ret = -EINVAL;
>>> +             if (vq_index >= dev->vq_num)
>>> +                     break;
>>> +
>>> +             ret = 0;
>>> +             vq_index = array_index_nospec(vq_index, dev->vq_num);
>>> +             queue_work(vduse_irq_wq, &dev->vqs[vq_index].inject);
>>> +             break;
>>> +     }
>>> +     default:
>>> +             ret = -ENOIOCTLCMD;
>>> +             break;
>>> +     }
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static int vduse_dev_release(struct inode *inode, struct file *file)
>>> +{
>>> +     struct vduse_dev *dev = file->private_data;
>>> +
>>> +     spin_lock(&dev->msg_lock);
>>> +     /* Make sure the inflight messages can processed after reconncection */
>>> +     list_splice_init(&dev->recv_list, &dev->send_list);
>>> +     spin_unlock(&dev->msg_lock);
>>> +     dev->connected = false;
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static struct vduse_dev *vduse_dev_get_from_minor(int minor)
>>> +{
>>> +     struct vduse_dev *dev;
>>> +
>>> +     mutex_lock(&vduse_lock);
>>> +     dev = idr_find(&vduse_idr, minor);
>>> +     mutex_unlock(&vduse_lock);
>>> +
>>> +     return dev;
>>> +}
>>> +
>>> +static int vduse_dev_open(struct inode *inode, struct file *file)
>>> +{
>>> +     int ret;
>>> +     struct vduse_dev *dev = vduse_dev_get_from_minor(iminor(inode));
>>> +
>>> +     if (!dev)
>>> +             return -ENODEV;
>>> +
>>> +     ret = -EBUSY;
>>> +     mutex_lock(&dev->lock);
>>> +     if (dev->connected)
>>> +             goto unlock;
>>> +
>>> +     ret = 0;
>>> +     dev->connected = true;
>>> +     file->private_data = dev;
>>> +unlock:
>>> +     mutex_unlock(&dev->lock);
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static const struct file_operations vduse_dev_fops = {
>>> +     .owner          = THIS_MODULE,
>>> +     .open           = vduse_dev_open,
>>> +     .release        = vduse_dev_release,
>>> +     .read_iter      = vduse_dev_read_iter,
>>> +     .write_iter     = vduse_dev_write_iter,
>>> +     .poll           = vduse_dev_poll,
>>> +     .unlocked_ioctl = vduse_dev_ioctl,
>>> +     .compat_ioctl   = compat_ptr_ioctl,
>>> +     .llseek         = noop_llseek,
>>> +};
>>> +
>>> +static struct vduse_dev *vduse_dev_create(void)
>>> +{
>>> +     struct vduse_dev *dev = kzalloc(sizeof(*dev), GFP_KERNEL);
>>> +
>>> +     if (!dev)
>>> +             return NULL;
>>> +
>>> +     mutex_init(&dev->lock);
>>> +     spin_lock_init(&dev->msg_lock);
>>> +     INIT_LIST_HEAD(&dev->send_list);
>>> +     INIT_LIST_HEAD(&dev->recv_list);
>>> +     spin_lock_init(&dev->irq_lock);
>>> +
>>> +     INIT_WORK(&dev->inject, vduse_dev_irq_inject);
>>> +     init_waitqueue_head(&dev->waitq);
>>> +
>>> +     return dev;
>>> +}
>>> +
>>> +static void vduse_dev_destroy(struct vduse_dev *dev)
>>> +{
>>> +     kfree(dev);
>>> +}
>>> +
>>> +static struct vduse_dev *vduse_find_dev(const char *name)
>>> +{
>>> +     struct vduse_dev *dev;
>>> +     int id;
>>> +
>>> +     idr_for_each_entry(&vduse_idr, dev, id)
>>> +             if (!strcmp(dev->name, name))
>>> +                     return dev;
>>> +
>>> +     return NULL;
>>> +}
>>> +
>>> +static int vduse_destroy_dev(char *name)
>>> +{
>>> +     struct vduse_dev *dev = vduse_find_dev(name);
>>> +
>>> +     if (!dev)
>>> +             return -EINVAL;
>>> +
>>> +     mutex_lock(&dev->lock);
>>> +     if (dev->vdev || dev->connected) {
>>> +             mutex_unlock(&dev->lock);
>>> +             return -EBUSY;
>>> +     }
>>> +     dev->connected = true;
>>> +     mutex_unlock(&dev->lock);
>>> +
>>> +     vduse_dev_msg_cleanup(dev);
>>> +     device_destroy(vduse_class, MKDEV(MAJOR(vduse_major), dev->minor));
>>> +     idr_remove(&vduse_idr, dev->minor);
>>> +     kvfree(dev->config);
>>> +     kfree(dev->vqs);
>>> +     vduse_domain_destroy(dev->domain);
>>> +     kfree(dev->name);
>>> +     vduse_dev_destroy(dev);
>>> +     module_put(THIS_MODULE);
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static bool device_is_allowed(u32 device_id)
>>> +{
>>> +     int i;
>>> +
>>> +     for (i = 0; i < ARRAY_SIZE(allowed_device_id); i++)
>>> +             if (allowed_device_id[i] == device_id)
>>> +                     return true;
>>> +
>>> +     return false;
>>> +}
>>> +
>>> +static bool features_is_valid(u64 features)
>>> +{
>>> +     if (!(features & (1ULL << VIRTIO_F_ACCESS_PLATFORM)))
>>> +             return false;
>>> +
>>> +     /* Now we only support read-only configuration space */
>>> +     if (features & (1ULL << VIRTIO_BLK_F_CONFIG_WCE))
>>> +             return false;
>>> +
>>> +     return true;
>>> +}
>>> +
>>> +static bool vduse_validate_config(struct vduse_dev_config *config)
>>> +{
>>> +     if (config->bounce_size > VDUSE_MAX_BOUNCE_SIZE)
>>> +             return false;
>>> +
>>> +     if (config->vq_align > PAGE_SIZE)
>>> +             return false;
>>> +
>>> +     if (config->config_size > PAGE_SIZE)
>>> +             return false;
>>> +
>>> +     if (!device_is_allowed(config->device_id))
>>> +             return false;
>>> +
>>> +     if (!features_is_valid(config->features))
>>> +             return false;
>>
>> Do we need to validate whether or not config_size is too small otherwise
>> we may have OOB access in get_config()?
>>
> How about adding validation in get_config()? It seems to be hard to
> define the lower bound.


It should work.

Thanks


>
>>> +
>>> +     return true;
>>> +}
>>> +
>>> +static int vduse_create_dev(struct vduse_dev_config *config,
>>> +                         void *config_buf, u64 api_version)
>>> +{
>>> +     int i, ret;
>>> +     struct vduse_dev *dev;
>>> +
>>> +     ret = -EEXIST;
>>> +     if (vduse_find_dev(config->name))
>>> +             goto err;
>>> +
>>> +     ret = -ENOMEM;
>>> +     dev = vduse_dev_create();
>>> +     if (!dev)
>>> +             goto err;
>>> +
>>> +     dev->api_version = api_version;
>>> +     dev->user_features = config->features;
>>> +     dev->device_id = config->device_id;
>>> +     dev->vendor_id = config->vendor_id;
>>> +     dev->name = kstrdup(config->name, GFP_KERNEL);
>>> +     if (!dev->name)
>>> +             goto err_str;
>>> +
>>> +     dev->domain = vduse_domain_create(VDUSE_IOVA_SIZE - 1,
>>> +                                       config->bounce_size);
>>> +     if (!dev->domain)
>>> +             goto err_domain;
>>> +
>>> +     dev->config = config_buf;
>>> +     dev->config_size = config->config_size;
>>> +     dev->vq_align = config->vq_align;
>>> +     dev->vq_size_max = config->vq_size_max;
>>> +     dev->vq_num = config->vq_num;
>>> +     dev->vqs = kcalloc(dev->vq_num, sizeof(*dev->vqs), GFP_KERNEL);
>>> +     if (!dev->vqs)
>>> +             goto err_vqs;
>>> +
>>> +     for (i = 0; i < dev->vq_num; i++) {
>>> +             dev->vqs[i].index = i;
>>> +             INIT_WORK(&dev->vqs[i].inject, vduse_vq_irq_inject);
>>> +             spin_lock_init(&dev->vqs[i].kick_lock);
>>> +             spin_lock_init(&dev->vqs[i].irq_lock);
>>> +     }
>>> +
>>> +     ret = idr_alloc(&vduse_idr, dev, 1, VDUSE_DEV_MAX, GFP_KERNEL);
>>> +     if (ret < 0)
>>> +             goto err_idr;
>>> +
>>> +     dev->minor = ret;
>>> +     dev->dev = device_create(vduse_class, NULL,
>>> +                              MKDEV(MAJOR(vduse_major), dev->minor),
>>> +                              NULL, "%s", config->name);
>>> +     if (IS_ERR(dev->dev)) {
>>> +             ret = PTR_ERR(dev->dev);
>>> +             goto err_dev;
>>> +     }
>>> +     __module_get(THIS_MODULE);
>>> +
>>> +     return 0;
>>> +err_dev:
>>> +     idr_remove(&vduse_idr, dev->minor);
>>> +err_idr:
>>> +     kfree(dev->vqs);
>>> +err_vqs:
>>> +     vduse_domain_destroy(dev->domain);
>>> +err_domain:
>>> +     kfree(dev->name);
>>> +err_str:
>>> +     vduse_dev_destroy(dev);
>>> +err:
>>> +     kvfree(config_buf);
>>> +     return ret;
>>> +}
>>> +
>>> +static long vduse_ioctl(struct file *file, unsigned int cmd,
>>> +                     unsigned long arg)
>>> +{
>>> +     int ret;
>>> +     void __user *argp = (void __user *)arg;
>>> +     struct vduse_control *control = file->private_data;
>>> +
>>> +     mutex_lock(&vduse_lock);
>>> +     switch (cmd) {
>>> +     case VDUSE_GET_API_VERSION:
>>> +             ret = put_user(control->api_version, (u64 __user *)argp);
>>> +             break;
>>> +     case VDUSE_SET_API_VERSION: {
>>> +             u64 api_version;
>>> +
>>> +             ret = -EFAULT;
>>> +             if (get_user(api_version, (u64 __user *)argp))
>>> +                     break;
>>> +
>>> +             ret = -EINVAL;
>>> +             if (api_version > VDUSE_API_VERSION)
>>> +                     break;
>>> +
>>> +             ret = 0;
>>> +             control->api_version = api_version;
>>> +             break;
>>> +     }
>>> +     case VDUSE_CREATE_DEV: {
>>> +             struct vduse_dev_config config;
>>> +             unsigned long size = offsetof(struct vduse_dev_config, config);
>>> +             void *buf;
>>> +
>>> +             ret = -EFAULT;
>>> +             if (copy_from_user(&config, argp, size))
>>> +                     break;
>>> +
>>> +             ret = -EINVAL;
>>> +             if (vduse_validate_config(&config) == false)
>>> +                     break;
>>> +
>>> +             buf = vmemdup_user(argp + size, config.config_size);
>>> +             if (IS_ERR(buf)) {
>>> +                     ret = PTR_ERR(buf);
>>> +                     break;
>>> +             }
>>> +             ret = vduse_create_dev(&config, buf, control->api_version);
>>> +             break;
>>> +     }
>>> +     case VDUSE_DESTROY_DEV: {
>>> +             char name[VDUSE_NAME_MAX];
>>> +
>>> +             ret = -EFAULT;
>>> +             if (copy_from_user(name, argp, VDUSE_NAME_MAX))
>>> +                     break;
>>> +
>>> +             ret = vduse_destroy_dev(name);
>>> +             break;
>>> +     }
>>> +     default:
>>> +             ret = -EINVAL;
>>> +             break;
>>> +     }
>>> +     mutex_unlock(&vduse_lock);
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static int vduse_release(struct inode *inode, struct file *file)
>>> +{
>>> +     struct vduse_control *control = file->private_data;
>>> +
>>> +     kfree(control);
>>> +     return 0;
>>> +}
>>> +
>>> +static int vduse_open(struct inode *inode, struct file *file)
>>> +{
>>> +     struct vduse_control *control;
>>> +
>>> +     control = kmalloc(sizeof(struct vduse_control), GFP_KERNEL);
>>> +     if (!control)
>>> +             return -ENOMEM;
>>> +
>>> +     control->api_version = VDUSE_API_VERSION;
>>> +     file->private_data = control;
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static const struct file_operations vduse_ctrl_fops = {
>>> +     .owner          = THIS_MODULE,
>>> +     .open           = vduse_open,
>>> +     .release        = vduse_release,
>>> +     .unlocked_ioctl = vduse_ioctl,
>>> +     .compat_ioctl   = compat_ptr_ioctl,
>>> +     .llseek         = noop_llseek,
>>> +};
>>> +
>>> +static char *vduse_devnode(struct device *dev, umode_t *mode)
>>> +{
>>> +     return kasprintf(GFP_KERNEL, "vduse/%s", dev_name(dev));
>>> +}
>>> +
>>> +static void vduse_mgmtdev_release(struct device *dev)
>>> +{
>>> +}
>>> +
>>> +static struct device vduse_mgmtdev = {
>>> +     .init_name = "vduse",
>>> +     .release = vduse_mgmtdev_release,
>>> +};
>>> +
>>> +static struct vdpa_mgmt_dev mgmt_dev;
>>> +
>>> +static int vduse_dev_init_vdpa(struct vduse_dev *dev, const char *name)
>>> +{
>>> +     struct vduse_vdpa *vdev;
>>> +     int ret;
>>> +
>>> +     if (dev->vdev)
>>> +             return -EEXIST;
>>> +
>>> +     vdev = vdpa_alloc_device(struct vduse_vdpa, vdpa, dev->dev,
>>> +                              &vduse_vdpa_config_ops, name, true);
>>> +     if (!vdev)
>>> +             return -ENOMEM;
>>> +
>>> +     dev->vdev = vdev;
>>> +     vdev->dev = dev;
>>> +     vdev->vdpa.dev.dma_mask = &vdev->vdpa.dev.coherent_dma_mask;
>>> +     ret = dma_set_mask_and_coherent(&vdev->vdpa.dev, DMA_BIT_MASK(64));
>>> +     if (ret) {
>>> +             put_device(&vdev->vdpa.dev);
>>> +             return ret;
>>> +     }
>>> +     set_dma_ops(&vdev->vdpa.dev, &vduse_dev_dma_ops);
>>> +     vdev->vdpa.dma_dev = &vdev->vdpa.dev;
>>> +     vdev->vdpa.mdev = &mgmt_dev;
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static int vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name)
>>> +{
>>> +     struct vduse_dev *dev;
>>> +     int ret;
>>> +
>>> +     mutex_lock(&vduse_lock);
>>> +     dev = vduse_find_dev(name);
>>> +     if (!dev) {
>>> +             mutex_unlock(&vduse_lock);
>>> +             return -EINVAL;
>>> +     }
>>> +     ret = vduse_dev_init_vdpa(dev, name);
>>> +     mutex_unlock(&vduse_lock);
>>> +     if (ret)
>>> +             return ret;
>>> +
>>> +     ret = _vdpa_register_device(&dev->vdev->vdpa, dev->vq_num);
>>> +     if (ret) {
>>> +             put_device(&dev->vdev->vdpa.dev);
>>> +             return ret;
>>> +     }
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static void vdpa_dev_del(struct vdpa_mgmt_dev *mdev, struct vdpa_device *dev)
>>> +{
>>> +     _vdpa_unregister_device(dev);
>>> +}
>>> +
>>> +static const struct vdpa_mgmtdev_ops vdpa_dev_mgmtdev_ops = {
>>> +     .dev_add = vdpa_dev_add,
>>> +     .dev_del = vdpa_dev_del,
>>> +};
>>> +
>>> +static struct virtio_device_id id_table[] = {
>>> +     { VIRTIO_ID_BLOCK, VIRTIO_DEV_ANY_ID },
>>> +     { 0 },
>>> +};
>>> +
>>> +static struct vdpa_mgmt_dev mgmt_dev = {
>>> +     .device = &vduse_mgmtdev,
>>> +     .id_table = id_table,
>>> +     .ops = &vdpa_dev_mgmtdev_ops,
>>> +};
>>> +
>>> +static int vduse_mgmtdev_init(void)
>>> +{
>>> +     int ret;
>>> +
>>> +     ret = device_register(&vduse_mgmtdev);
>>> +     if (ret)
>>> +             return ret;
>>> +
>>> +     ret = vdpa_mgmtdev_register(&mgmt_dev);
>>> +     if (ret)
>>> +             goto err;
>>> +
>>> +     return 0;
>>> +err:
>>> +     device_unregister(&vduse_mgmtdev);
>>> +     return ret;
>>> +}
>>> +
>>> +static void vduse_mgmtdev_exit(void)
>>> +{
>>> +     vdpa_mgmtdev_unregister(&mgmt_dev);
>>> +     device_unregister(&vduse_mgmtdev);
>>> +}
>>> +
>>> +static int vduse_init(void)
>>> +{
>>> +     int ret;
>>> +     struct device *dev;
>>> +
>>> +     vduse_class = class_create(THIS_MODULE, "vduse");
>>> +     if (IS_ERR(vduse_class))
>>> +             return PTR_ERR(vduse_class);
>>> +
>>> +     vduse_class->devnode = vduse_devnode;
>>> +
>>> +     ret = alloc_chrdev_region(&vduse_major, 0, VDUSE_DEV_MAX, "vduse");
>>> +     if (ret)
>>> +             goto err_chardev_region;
>>> +
>>> +     /* /dev/vduse/control */
>>> +     cdev_init(&vduse_ctrl_cdev, &vduse_ctrl_fops);
>>> +     vduse_ctrl_cdev.owner = THIS_MODULE;
>>> +     ret = cdev_add(&vduse_ctrl_cdev, vduse_major, 1);
>>> +     if (ret)
>>> +             goto err_ctrl_cdev;
>>> +
>>> +     dev = device_create(vduse_class, NULL, vduse_major, NULL, "control");
>>> +     if (IS_ERR(dev)) {
>>> +             ret = PTR_ERR(dev);
>>> +             goto err_device;
>>> +     }
>>> +
>>> +     /* /dev/vduse/$DEVICE */
>>> +     cdev_init(&vduse_cdev, &vduse_dev_fops);
>>> +     vduse_cdev.owner = THIS_MODULE;
>>> +     ret = cdev_add(&vduse_cdev, MKDEV(MAJOR(vduse_major), 1),
>>> +                    VDUSE_DEV_MAX - 1);
>>> +     if (ret)
>>> +             goto err_cdev;
>>> +
>>> +     vduse_irq_wq = alloc_workqueue("vduse-irq",
>>> +                             WQ_HIGHPRI | WQ_SYSFS | WQ_UNBOUND, 0);
>>> +     if (!vduse_irq_wq)
>>> +             goto err_wq;
>>> +
>>> +     ret = vduse_domain_init();
>>> +     if (ret)
>>> +             goto err_domain;
>>> +
>>> +     ret = vduse_mgmtdev_init();
>>> +     if (ret)
>>> +             goto err_mgmtdev;
>>> +
>>> +     return 0;
>>> +err_mgmtdev:
>>> +     vduse_domain_exit();
>>> +err_domain:
>>> +     destroy_workqueue(vduse_irq_wq);
>>> +err_wq:
>>> +     cdev_del(&vduse_cdev);
>>> +err_cdev:
>>> +     device_destroy(vduse_class, vduse_major);
>>> +err_device:
>>> +     cdev_del(&vduse_ctrl_cdev);
>>> +err_ctrl_cdev:
>>> +     unregister_chrdev_region(vduse_major, VDUSE_DEV_MAX);
>>> +err_chardev_region:
>>> +     class_destroy(vduse_class);
>>> +     return ret;
>>> +}
>>> +module_init(vduse_init);
>>> +
>>> +static void vduse_exit(void)
>>> +{
>>> +     vduse_mgmtdev_exit();
>>> +     vduse_domain_exit();
>>> +     destroy_workqueue(vduse_irq_wq);
>>> +     cdev_del(&vduse_cdev);
>>> +     device_destroy(vduse_class, vduse_major);
>>> +     cdev_del(&vduse_ctrl_cdev);
>>> +     unregister_chrdev_region(vduse_major, VDUSE_DEV_MAX);
>>> +     class_destroy(vduse_class);
>>> +}
>>> +module_exit(vduse_exit);
>>> +
>>> +MODULE_LICENSE(DRV_LICENSE);
>>> +MODULE_AUTHOR(DRV_AUTHOR);
>>> +MODULE_DESCRIPTION(DRV_DESC);
>>> diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
>>> new file mode 100644
>>> index 000000000000..f21b2e51b5c8
>>> --- /dev/null
>>> +++ b/include/uapi/linux/vduse.h
>>> @@ -0,0 +1,143 @@
>>> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
>>> +#ifndef _UAPI_VDUSE_H_
>>> +#define _UAPI_VDUSE_H_
>>> +
>>> +#include <linux/types.h>
>>> +
>>> +#define VDUSE_API_VERSION    0
>>> +
>>> +#define VDUSE_NAME_MAX       256
>>> +
>>> +/* the control messages definition for read/write */
>>> +
>>> +enum vduse_req_type {
>>> +     /* Get the state for virtqueue from userspace */
>>> +     VDUSE_GET_VQ_STATE,
>>> +     /* Notify userspace to start the dataplane, no reply */
>>> +     VDUSE_START_DATAPLANE,
>>> +     /* Notify userspace to stop the dataplane, no reply */
>>> +     VDUSE_STOP_DATAPLANE,
>>> +     /* Notify userspace to update the memory mapping in device IOTLB */
>>> +     VDUSE_UPDATE_IOTLB,
>>> +};
>>> +
>>> +struct vduse_vq_state {
>>> +     __u32 index; /* virtqueue index */
>>> +     __u32 avail_idx; /* virtqueue state (last_avail_idx) */
>>> +};
>>
>> This needs some tweaks to support packed virtqueue.
>>
> OK.
>
> Thanks,
> Yongji
>

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 193+ messages in thread

* Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
@ 2021-06-22  5:06         ` Jason Wang
  0 siblings, 0 replies; 193+ messages in thread
From: Jason Wang @ 2021-06-22  5:06 UTC (permalink / raw)
  To: Yongji Xie
  Cc: kvm, Michael S. Tsirkin, virtualization, Christian Brauner,
	Jonathan Corbet, joro, Matthew Wilcox, Christoph Hellwig,
	Dan Carpenter, Al Viro, Stefan Hajnoczi, songmuchun, Jens Axboe,
	Greg KH, Randy Dunlap, linux-kernel, iommu, bcrl, netdev,
	linux-fsdevel, Mika Penttilä


在 2021/6/21 下午6:41, Yongji Xie 写道:
> On Mon, Jun 21, 2021 at 5:14 PM Jason Wang <jasowang@redhat.com> wrote:
>>
>> 在 2021/6/15 下午10:13, Xie Yongji 写道:
>>> This VDUSE driver enables implementing vDPA devices in userspace.
>>> The vDPA device's control path is handled in kernel and the data
>>> path is handled in userspace.
>>>
>>> A message mechnism is used by VDUSE driver to forward some control
>>> messages such as starting/stopping datapath to userspace. Userspace
>>> can use read()/write() to receive/reply those control messages.
>>>
>>> And some ioctls are introduced to help userspace to implement the
>>> data path. VDUSE_IOTLB_GET_FD ioctl can be used to get the file
>>> descriptors referring to vDPA device's iova regions. Then userspace
>>> can use mmap() to access those iova regions. VDUSE_DEV_GET_FEATURES
>>> and VDUSE_VQ_GET_INFO ioctls are used to get the negotiated features
>>> and metadata of virtqueues. VDUSE_INJECT_VQ_IRQ and VDUSE_VQ_SETUP_KICKFD
>>> ioctls can be used to inject interrupt and setup the kickfd for
>>> virtqueues. VDUSE_DEV_UPDATE_CONFIG ioctl is used to update the
>>> configuration space and inject a config interrupt.
>>>
>>> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
>>> ---
>>>    Documentation/userspace-api/ioctl/ioctl-number.rst |    1 +
>>>    drivers/vdpa/Kconfig                               |   10 +
>>>    drivers/vdpa/Makefile                              |    1 +
>>>    drivers/vdpa/vdpa_user/Makefile                    |    5 +
>>>    drivers/vdpa/vdpa_user/vduse_dev.c                 | 1453 ++++++++++++++++++++
>>>    include/uapi/linux/vduse.h                         |  143 ++
>>>    6 files changed, 1613 insertions(+)
>>>    create mode 100644 drivers/vdpa/vdpa_user/Makefile
>>>    create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
>>>    create mode 100644 include/uapi/linux/vduse.h
>>>
>>> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
>>> index 9bfc2b510c64..acd95e9dcfe7 100644
>>> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
>>> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
>>> @@ -300,6 +300,7 @@ Code  Seq#    Include File                                           Comments
>>>    'z'   10-4F  drivers/s390/crypto/zcrypt_api.h                        conflict!
>>>    '|'   00-7F  linux/media.h
>>>    0x80  00-1F  linux/fb.h
>>> +0x81  00-1F  linux/vduse.h
>>>    0x89  00-06  arch/x86/include/asm/sockios.h
>>>    0x89  0B-DF  linux/sockios.h
>>>    0x89  E0-EF  linux/sockios.h                                         SIOCPROTOPRIVATE range
>>> diff --git a/drivers/vdpa/Kconfig b/drivers/vdpa/Kconfig
>>> index a503c1b2bfd9..6e23bce6433a 100644
>>> --- a/drivers/vdpa/Kconfig
>>> +++ b/drivers/vdpa/Kconfig
>>> @@ -33,6 +33,16 @@ config VDPA_SIM_BLOCK
>>>          vDPA block device simulator which terminates IO request in a
>>>          memory buffer.
>>>
>>> +config VDPA_USER
>>> +     tristate "VDUSE (vDPA Device in Userspace) support"
>>> +     depends on EVENTFD && MMU && HAS_DMA
>>> +     select DMA_OPS
>>> +     select VHOST_IOTLB
>>> +     select IOMMU_IOVA
>>> +     help
>>> +       With VDUSE it is possible to emulate a vDPA Device
>>> +       in a userspace program.
>>> +
>>>    config IFCVF
>>>        tristate "Intel IFC VF vDPA driver"
>>>        depends on PCI_MSI
>>> diff --git a/drivers/vdpa/Makefile b/drivers/vdpa/Makefile
>>> index 67fe7f3d6943..f02ebed33f19 100644
>>> --- a/drivers/vdpa/Makefile
>>> +++ b/drivers/vdpa/Makefile
>>> @@ -1,6 +1,7 @@
>>>    # SPDX-License-Identifier: GPL-2.0
>>>    obj-$(CONFIG_VDPA) += vdpa.o
>>>    obj-$(CONFIG_VDPA_SIM) += vdpa_sim/
>>> +obj-$(CONFIG_VDPA_USER) += vdpa_user/
>>>    obj-$(CONFIG_IFCVF)    += ifcvf/
>>>    obj-$(CONFIG_MLX5_VDPA) += mlx5/
>>>    obj-$(CONFIG_VP_VDPA)    += virtio_pci/
>>> diff --git a/drivers/vdpa/vdpa_user/Makefile b/drivers/vdpa/vdpa_user/Makefile
>>> new file mode 100644
>>> index 000000000000..260e0b26af99
>>> --- /dev/null
>>> +++ b/drivers/vdpa/vdpa_user/Makefile
>>> @@ -0,0 +1,5 @@
>>> +# SPDX-License-Identifier: GPL-2.0
>>> +
>>> +vduse-y := vduse_dev.o iova_domain.o
>>> +
>>> +obj-$(CONFIG_VDPA_USER) += vduse.o
>>> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
>>> new file mode 100644
>>> index 000000000000..5271cbd15e28
>>> --- /dev/null
>>> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
>>> @@ -0,0 +1,1453 @@
>>> +// SPDX-License-Identifier: GPL-2.0-only
>>> +/*
>>> + * VDUSE: vDPA Device in Userspace
>>> + *
>>> + * Copyright (C) 2020-2021 Bytedance Inc. and/or its affiliates. All rights reserved.
>>> + *
>>> + * Author: Xie Yongji <xieyongji@bytedance.com>
>>> + *
>>> + */
>>> +
>>> +#include <linux/init.h>
>>> +#include <linux/module.h>
>>> +#include <linux/cdev.h>
>>> +#include <linux/device.h>
>>> +#include <linux/eventfd.h>
>>> +#include <linux/slab.h>
>>> +#include <linux/wait.h>
>>> +#include <linux/dma-map-ops.h>
>>> +#include <linux/poll.h>
>>> +#include <linux/file.h>
>>> +#include <linux/uio.h>
>>> +#include <linux/vdpa.h>
>>> +#include <linux/nospec.h>
>>> +#include <uapi/linux/vduse.h>
>>> +#include <uapi/linux/vdpa.h>
>>> +#include <uapi/linux/virtio_config.h>
>>> +#include <uapi/linux/virtio_ids.h>
>>> +#include <uapi/linux/virtio_blk.h>
>>> +#include <linux/mod_devicetable.h>
>>> +
>>> +#include "iova_domain.h"
>>> +
>>> +#define DRV_AUTHOR   "Yongji Xie <xieyongji@bytedance.com>"
>>> +#define DRV_DESC     "vDPA Device in Userspace"
>>> +#define DRV_LICENSE  "GPL v2"
>>> +
>>> +#define VDUSE_DEV_MAX (1U << MINORBITS)
>>> +#define VDUSE_MAX_BOUNCE_SIZE (64 * 1024 * 1024)
>>> +#define VDUSE_IOVA_SIZE (128 * 1024 * 1024)
>>> +#define VDUSE_REQUEST_TIMEOUT 30
>>> +
>>> +struct vduse_virtqueue {
>>> +     u16 index;
>>> +     u32 num;
>>> +     u32 avail_idx;
>>> +     u64 desc_addr;
>>> +     u64 driver_addr;
>>> +     u64 device_addr;
>>> +     bool ready;
>>> +     bool kicked;
>>> +     spinlock_t kick_lock;
>>> +     spinlock_t irq_lock;
>>> +     struct eventfd_ctx *kickfd;
>>> +     struct vdpa_callback cb;
>>> +     struct work_struct inject;
>>> +};
>>> +
>>> +struct vduse_dev;
>>> +
>>> +struct vduse_vdpa {
>>> +     struct vdpa_device vdpa;
>>> +     struct vduse_dev *dev;
>>> +};
>>> +
>>> +struct vduse_dev {
>>> +     struct vduse_vdpa *vdev;
>>> +     struct device *dev;
>>> +     struct vduse_virtqueue *vqs;
>>> +     struct vduse_iova_domain *domain;
>>> +     char *name;
>>> +     struct mutex lock;
>>> +     spinlock_t msg_lock;
>>> +     u64 msg_unique;
>>> +     wait_queue_head_t waitq;
>>> +     struct list_head send_list;
>>> +     struct list_head recv_list;
>>> +     struct vdpa_callback config_cb;
>>> +     struct work_struct inject;
>>> +     spinlock_t irq_lock;
>>> +     int minor;
>>> +     bool connected;
>>> +     bool started;
>>> +     u64 api_version;
>>> +     u64 user_features;
>>
>> Let's use device_features.
>>
> OK.
>
>>> +     u64 features;
>>
>> And driver features.
>>
> OK.
>
>>> +     u32 device_id;
>>> +     u32 vendor_id;
>>> +     u32 generation;
>>> +     u32 config_size;
>>> +     void *config;
>>> +     u8 status;
>>> +     u16 vq_size_max;
>>> +     u32 vq_num;
>>> +     u32 vq_align;
>>> +};
>>> +
>>> +struct vduse_dev_msg {
>>> +     struct vduse_dev_request req;
>>> +     struct vduse_dev_response resp;
>>> +     struct list_head list;
>>> +     wait_queue_head_t waitq;
>>> +     bool completed;
>>> +};
>>> +
>>> +struct vduse_control {
>>> +     u64 api_version;
>>> +};
>>> +
>>> +static DEFINE_MUTEX(vduse_lock);
>>> +static DEFINE_IDR(vduse_idr);
>>> +
>>> +static dev_t vduse_major;
>>> +static struct class *vduse_class;
>>> +static struct cdev vduse_ctrl_cdev;
>>> +static struct cdev vduse_cdev;
>>> +static struct workqueue_struct *vduse_irq_wq;
>>> +
>>> +static u32 allowed_device_id[] = {
>>> +     VIRTIO_ID_BLOCK,
>>> +};
>>> +
>>> +static inline struct vduse_dev *vdpa_to_vduse(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_vdpa *vdev = container_of(vdpa, struct vduse_vdpa, vdpa);
>>> +
>>> +     return vdev->dev;
>>> +}
>>> +
>>> +static inline struct vduse_dev *dev_to_vduse(struct device *dev)
>>> +{
>>> +     struct vdpa_device *vdpa = dev_to_vdpa(dev);
>>> +
>>> +     return vdpa_to_vduse(vdpa);
>>> +}
>>> +
>>> +static struct vduse_dev_msg *vduse_find_msg(struct list_head *head,
>>> +                                         uint32_t request_id)
>>> +{
>>> +     struct vduse_dev_msg *msg;
>>> +
>>> +     list_for_each_entry(msg, head, list) {
>>> +             if (msg->req.request_id == request_id) {
>>> +                     list_del(&msg->list);
>>> +                     return msg;
>>> +             }
>>> +     }
>>> +
>>> +     return NULL;
>>> +}
>>> +
>>> +static struct vduse_dev_msg *vduse_dequeue_msg(struct list_head *head)
>>> +{
>>> +     struct vduse_dev_msg *msg = NULL;
>>> +
>>> +     if (!list_empty(head)) {
>>> +             msg = list_first_entry(head, struct vduse_dev_msg, list);
>>> +             list_del(&msg->list);
>>> +     }
>>> +
>>> +     return msg;
>>> +}
>>> +
>>> +static void vduse_enqueue_msg(struct list_head *head,
>>> +                           struct vduse_dev_msg *msg)
>>> +{
>>> +     list_add_tail(&msg->list, head);
>>> +}
>>> +
>>> +static int vduse_dev_msg_send(struct vduse_dev *dev,
>>> +                           struct vduse_dev_msg *msg, bool no_reply)
>>> +{
>>
>> It looks to me the only user for no_reply=true is the dataplane start. I
>> wonder no_reply is really needed consider we have switched to use
>> wait_event_killable_timeout().
>>
> Do we need to handle the error in this case if we remove the no_reply
> flag. Print a warning message?


See below.


>
>> In another way, no_reply is false for vq state synchronization and IOTLB
>> updating. I wonder if we can simply use no_reply = true for them.
>>
> Looks like we can't, e.g. we need to get a reply from userspace for vq state.


Right.


>
>>> +     init_waitqueue_head(&msg->waitq);
>>> +     spin_lock(&dev->msg_lock);
>>> +     msg->req.request_id = dev->msg_unique++;
>>> +     vduse_enqueue_msg(&dev->send_list, msg);
>>> +     wake_up(&dev->waitq);
>>> +     spin_unlock(&dev->msg_lock);
>>> +     if (no_reply)
>>> +             return 0;
>>> +
>>> +     wait_event_killable_timeout(msg->waitq, msg->completed,
>>> +                                 VDUSE_REQUEST_TIMEOUT * HZ);
>>> +     spin_lock(&dev->msg_lock);
>>> +     if (!msg->completed) {
>>> +             list_del(&msg->list);
>>> +             msg->resp.result = VDUSE_REQ_RESULT_FAILED;
>>> +     }
>>> +     spin_unlock(&dev->msg_lock);
>>> +
>>> +     return (msg->resp.result == VDUSE_REQ_RESULT_OK) ? 0 : -EIO;
>>
>> Do we need to serialize the check by protecting it with the spinlock above?
>>
> Good point.
>
>>> +}
>>> +
>>> +static void vduse_dev_msg_cleanup(struct vduse_dev *dev)
>>> +{
>>> +     struct vduse_dev_msg *msg;
>>> +
>>> +     spin_lock(&dev->msg_lock);
>>> +     while ((msg = vduse_dequeue_msg(&dev->send_list))) {
>>> +             if (msg->req.flags & VDUSE_REQ_FLAGS_NO_REPLY)
>>> +                     kfree(msg);
>>> +             else
>>> +                     vduse_enqueue_msg(&dev->recv_list, msg);
>>> +     }
>>> +     while ((msg = vduse_dequeue_msg(&dev->recv_list))) {
>>> +             msg->resp.result = VDUSE_REQ_RESULT_FAILED;
>>> +             msg->completed = 1;
>>> +             wake_up(&msg->waitq);
>>> +     }
>>> +     spin_unlock(&dev->msg_lock);
>>> +}
>>> +
>>> +static void vduse_dev_start_dataplane(struct vduse_dev *dev)
>>> +{
>>> +     struct vduse_dev_msg *msg = kzalloc(sizeof(*msg),
>>> +                                         GFP_KERNEL | __GFP_NOFAIL);
>>> +
>>> +     msg->req.type = VDUSE_START_DATAPLANE;
>>> +     msg->req.flags |= VDUSE_REQ_FLAGS_NO_REPLY;
>>> +     vduse_dev_msg_send(dev, msg, true);
>>> +}
>>> +
>>> +static void vduse_dev_stop_dataplane(struct vduse_dev *dev)
>>> +{
>>> +     struct vduse_dev_msg *msg = kzalloc(sizeof(*msg),
>>> +                                         GFP_KERNEL | __GFP_NOFAIL);
>>> +
>>> +     msg->req.type = VDUSE_STOP_DATAPLANE;
>>> +     msg->req.flags |= VDUSE_REQ_FLAGS_NO_REPLY;
>>
>> Can we simply use this flag instead of introducing a new parameter
>> (no_reply) in vduse_dev_msg_send()?
>>
> Looks good to me.
>
>>> +     vduse_dev_msg_send(dev, msg, true);
>>> +}
>>> +
>>> +static int vduse_dev_get_vq_state(struct vduse_dev *dev,
>>> +                               struct vduse_virtqueue *vq,
>>> +                               struct vdpa_vq_state *state)
>>> +{
>>> +     struct vduse_dev_msg msg = { 0 };
>>> +     int ret;
>>
>> Note that I post a series that implement the packed virtqueue support:
>>
>> https://lists.linuxfoundation.org/pipermail/virtualization/2021-June/054501.html
>>
>> So this patch needs to be updated as well.
>>
> Will do it.
>
>>> +
>>> +     msg.req.type = VDUSE_GET_VQ_STATE;
>>> +     msg.req.vq_state.index = vq->index;
>>> +
>>> +     ret = vduse_dev_msg_send(dev, &msg, false);
>>> +     if (ret)
>>> +             return ret;
>>> +
>>> +     state->avail_index = msg.resp.vq_state.avail_idx;
>>> +     return 0;
>>> +}
>>> +
>>> +static int vduse_dev_update_iotlb(struct vduse_dev *dev,
>>> +                             u64 start, u64 last)
>>> +{
>>> +     struct vduse_dev_msg msg = { 0 };
>>> +
>>> +     if (last < start)
>>> +             return -EINVAL;
>>> +
>>> +     msg.req.type = VDUSE_UPDATE_IOTLB;
>>> +     msg.req.iova.start = start;
>>> +     msg.req.iova.last = last;
>>> +
>>> +     return vduse_dev_msg_send(dev, &msg, false);
>>> +}
>>> +
>>> +static ssize_t vduse_dev_read_iter(struct kiocb *iocb, struct iov_iter *to)
>>> +{
>>> +     struct file *file = iocb->ki_filp;
>>> +     struct vduse_dev *dev = file->private_data;
>>> +     struct vduse_dev_msg *msg;
>>> +     int size = sizeof(struct vduse_dev_request);
>>> +     ssize_t ret;
>>> +
>>> +     if (iov_iter_count(to) < size)
>>> +             return -EINVAL;
>>> +
>>> +     spin_lock(&dev->msg_lock);
>>> +     while (1) {
>>> +             msg = vduse_dequeue_msg(&dev->send_list);
>>> +             if (msg)
>>> +                     break;
>>> +
>>> +             ret = -EAGAIN;
>>> +             if (file->f_flags & O_NONBLOCK)
>>> +                     goto unlock;
>>> +
>>> +             spin_unlock(&dev->msg_lock);
>>> +             ret = wait_event_interruptible_exclusive(dev->waitq,
>>> +                                     !list_empty(&dev->send_list));
>>> +             if (ret)
>>> +                     return ret;
>>> +
>>> +             spin_lock(&dev->msg_lock);
>>> +     }
>>> +     spin_unlock(&dev->msg_lock);
>>> +     ret = copy_to_iter(&msg->req, size, to);
>>> +     spin_lock(&dev->msg_lock);
>>> +     if (ret != size) {
>>> +             ret = -EFAULT;
>>> +             vduse_enqueue_msg(&dev->send_list, msg);
>>> +             goto unlock;
>>> +     }
>>> +     if (msg->req.flags & VDUSE_REQ_FLAGS_NO_REPLY)
>>> +             kfree(msg);
>>> +     else
>>> +             vduse_enqueue_msg(&dev->recv_list, msg);
>>> +unlock:
>>> +     spin_unlock(&dev->msg_lock);
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static ssize_t vduse_dev_write_iter(struct kiocb *iocb, struct iov_iter *from)
>>> +{
>>> +     struct file *file = iocb->ki_filp;
>>> +     struct vduse_dev *dev = file->private_data;
>>> +     struct vduse_dev_response resp;
>>> +     struct vduse_dev_msg *msg;
>>> +     size_t ret;
>>> +
>>> +     ret = copy_from_iter(&resp, sizeof(resp), from);
>>> +     if (ret != sizeof(resp))
>>> +             return -EINVAL;
>>> +
>>> +     spin_lock(&dev->msg_lock);
>>> +     msg = vduse_find_msg(&dev->recv_list, resp.request_id);
>>> +     if (!msg) {
>>> +             ret = -ENOENT;
>>> +             goto unlock;
>>> +     }
>>> +
>>> +     memcpy(&msg->resp, &resp, sizeof(resp));
>>> +     msg->completed = 1;
>>> +     wake_up(&msg->waitq);
>>> +unlock:
>>> +     spin_unlock(&dev->msg_lock);
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
>>> +{
>>> +     struct vduse_dev *dev = file->private_data;
>>> +     __poll_t mask = 0;
>>> +
>>> +     poll_wait(file, &dev->waitq, wait);
>>> +
>>> +     if (!list_empty(&dev->send_list))
>>> +             mask |= EPOLLIN | EPOLLRDNORM;
>>> +     if (!list_empty(&dev->recv_list))
>>> +             mask |= EPOLLOUT | EPOLLWRNORM;
>>> +
>>> +     return mask;
>>> +}
>>> +
>>> +static void vduse_dev_reset(struct vduse_dev *dev)
>>> +{
>>> +     int i;
>>> +     struct vduse_iova_domain *domain = dev->domain;
>>> +
>>> +     /* The coherent mappings are handled in vduse_dev_free_coherent() */
>>> +     if (domain->bounce_map)
>>> +             vduse_domain_reset_bounce_map(domain);
>>> +
>>> +     dev->features = 0;
>>> +     dev->generation++;
>>> +     spin_lock(&dev->irq_lock);
>>> +     dev->config_cb.callback = NULL;
>>> +     dev->config_cb.private = NULL;
>>> +     spin_unlock(&dev->irq_lock);
>>> +
>>> +     for (i = 0; i < dev->vq_num; i++) {
>>> +             struct vduse_virtqueue *vq = &dev->vqs[i];
>>> +
>>> +             vq->ready = false;
>>> +             vq->desc_addr = 0;
>>> +             vq->driver_addr = 0;
>>> +             vq->device_addr = 0;
>>> +             vq->avail_idx = 0;
>>> +             vq->num = 0;
>>> +
>>> +             spin_lock(&vq->kick_lock);
>>> +             vq->kicked = false;
>>> +             if (vq->kickfd)
>>> +                     eventfd_ctx_put(vq->kickfd);
>>> +             vq->kickfd = NULL;
>>> +             spin_unlock(&vq->kick_lock);
>>> +
>>> +             spin_lock(&vq->irq_lock);
>>> +             vq->cb.callback = NULL;
>>> +             vq->cb.private = NULL;
>>> +             spin_unlock(&vq->irq_lock);
>>> +     }
>>> +}
>>> +
>>> +static int vduse_vdpa_set_vq_address(struct vdpa_device *vdpa, u16 idx,
>>> +                             u64 desc_area, u64 driver_area,
>>> +                             u64 device_area)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     vq->desc_addr = desc_area;
>>> +     vq->driver_addr = driver_area;
>>> +     vq->device_addr = device_area;
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static void vduse_vdpa_kick_vq(struct vdpa_device *vdpa, u16 idx)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     spin_lock(&vq->kick_lock);
>>> +     if (!vq->ready)
>>> +             goto unlock;
>>> +
>>> +     if (vq->kickfd)
>>> +             eventfd_signal(vq->kickfd, 1);
>>> +     else
>>> +             vq->kicked = true;
>>> +unlock:
>>> +     spin_unlock(&vq->kick_lock);
>>> +}
>>> +
>>> +static void vduse_vdpa_set_vq_cb(struct vdpa_device *vdpa, u16 idx,
>>> +                           struct vdpa_callback *cb)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     spin_lock(&vq->irq_lock);
>>> +     vq->cb.callback = cb->callback;
>>> +     vq->cb.private = cb->private;
>>> +     spin_unlock(&vq->irq_lock);
>>> +}
>>> +
>>> +static void vduse_vdpa_set_vq_num(struct vdpa_device *vdpa, u16 idx, u32 num)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     vq->num = num;
>>> +}
>>> +
>>> +static void vduse_vdpa_set_vq_ready(struct vdpa_device *vdpa,
>>> +                                     u16 idx, bool ready)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     vq->ready = ready;
>>> +}
>>> +
>>> +static bool vduse_vdpa_get_vq_ready(struct vdpa_device *vdpa, u16 idx)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     return vq->ready;
>>> +}
>>> +
>>> +static int vduse_vdpa_set_vq_state(struct vdpa_device *vdpa, u16 idx,
>>> +                             const struct vdpa_vq_state *state)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     vq->avail_idx = state->avail_index;
>>> +     return 0;
>>> +}
>>> +
>>> +static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
>>> +                             struct vdpa_vq_state *state)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     return vduse_dev_get_vq_state(dev, vq, state);
>>> +}
>>> +
>>> +static u32 vduse_vdpa_get_vq_align(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return dev->vq_align;
>>> +}
>>> +
>>> +static u64 vduse_vdpa_get_features(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return dev->user_features;
>>> +}
>>> +
>>> +static int vduse_vdpa_set_features(struct vdpa_device *vdpa, u64 features)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     dev->features = features;
>>> +     return 0;
>>> +}
>>> +
>>> +static void vduse_vdpa_set_config_cb(struct vdpa_device *vdpa,
>>> +                               struct vdpa_callback *cb)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     spin_lock(&dev->irq_lock);
>>> +     dev->config_cb.callback = cb->callback;
>>> +     dev->config_cb.private = cb->private;
>>> +     spin_unlock(&dev->irq_lock);
>>> +}
>>> +
>>> +static u16 vduse_vdpa_get_vq_num_max(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return dev->vq_size_max;
>>> +}
>>> +
>>> +static u32 vduse_vdpa_get_device_id(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return dev->device_id;
>>> +}
>>> +
>>> +static u32 vduse_vdpa_get_vendor_id(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return dev->vendor_id;
>>> +}
>>> +
>>> +static u8 vduse_vdpa_get_status(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return dev->status;
>>> +}
>>> +
>>> +static void vduse_vdpa_set_status(struct vdpa_device *vdpa, u8 status)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     bool started = !!(status & VIRTIO_CONFIG_S_DRIVER_OK);
>>> +
>>> +     dev->status = status;
>>> +
>>> +     if (dev->started == started)
>>> +             return;
>>
>> If we check dev->status == status, (or only check the DRIVER_OK bit)
>> then there's no need to introduce an extra dev->started.
>>
> Will do it.
>
>>> +
>>> +     dev->started = started;
>>> +     if (dev->started) {
>>> +             vduse_dev_start_dataplane(dev);
>>> +     } else {
>>> +             vduse_dev_reset(dev);
>>> +             vduse_dev_stop_dataplane(dev);
>>
>> I wonder if no_reply work for the case of vhost-vdpa. For virtio-vDPA,
>> we have bouncing buffers so it's harmless if usersapce dataplane keeps
>> performing read/write. For vhost-vDPA we don't have such stuffs.
>>
> OK. So it still needs to be synchronized here. If so, how to handle
> the error? Looks like printing a warning message should be enough.


We need fix a way to propagate the error to the userspace.

E.g if we want to stop the deivce, we will delay the status reset until 
we get respose from the userspace?


>
>>> +     }
>>> +}
>>> +
>>> +static size_t vduse_vdpa_get_config_size(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return dev->config_size;
>>> +}
>>> +
>>> +static void vduse_vdpa_get_config(struct vdpa_device *vdpa, unsigned int offset,
>>> +                               void *buf, unsigned int len)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     memcpy(buf, dev->config + offset, len);
>>> +}
>>> +
>>> +static void vduse_vdpa_set_config(struct vdpa_device *vdpa, unsigned int offset,
>>> +                     const void *buf, unsigned int len)
>>> +{
>>> +     /* Now we only support read-only configuration space */
>>> +}
>>> +
>>> +static u32 vduse_vdpa_get_generation(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return dev->generation;
>>> +}
>>> +
>>> +static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
>>> +                             struct vhost_iotlb *iotlb)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     int ret;
>>> +
>>> +     ret = vduse_domain_set_map(dev->domain, iotlb);
>>> +     if (ret)
>>> +             return ret;
>>> +
>>> +     ret = vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
>>> +     if (ret) {
>>> +             vduse_domain_clear_map(dev->domain, iotlb);
>>> +             return ret;
>>> +     }
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static void vduse_vdpa_free(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     dev->vdev = NULL;
>>> +}
>>> +
>>> +static const struct vdpa_config_ops vduse_vdpa_config_ops = {
>>> +     .set_vq_address         = vduse_vdpa_set_vq_address,
>>> +     .kick_vq                = vduse_vdpa_kick_vq,
>>> +     .set_vq_cb              = vduse_vdpa_set_vq_cb,
>>> +     .set_vq_num             = vduse_vdpa_set_vq_num,
>>> +     .set_vq_ready           = vduse_vdpa_set_vq_ready,
>>> +     .get_vq_ready           = vduse_vdpa_get_vq_ready,
>>> +     .set_vq_state           = vduse_vdpa_set_vq_state,
>>> +     .get_vq_state           = vduse_vdpa_get_vq_state,
>>> +     .get_vq_align           = vduse_vdpa_get_vq_align,
>>> +     .get_features           = vduse_vdpa_get_features,
>>> +     .set_features           = vduse_vdpa_set_features,
>>> +     .set_config_cb          = vduse_vdpa_set_config_cb,
>>> +     .get_vq_num_max         = vduse_vdpa_get_vq_num_max,
>>> +     .get_device_id          = vduse_vdpa_get_device_id,
>>> +     .get_vendor_id          = vduse_vdpa_get_vendor_id,
>>> +     .get_status             = vduse_vdpa_get_status,
>>> +     .set_status             = vduse_vdpa_set_status,
>>> +     .get_config_size        = vduse_vdpa_get_config_size,
>>> +     .get_config             = vduse_vdpa_get_config,
>>> +     .set_config             = vduse_vdpa_set_config,
>>> +     .get_generation         = vduse_vdpa_get_generation,
>>> +     .set_map                = vduse_vdpa_set_map,
>>> +     .free                   = vduse_vdpa_free,
>>> +};
>>> +
>>> +static dma_addr_t vduse_dev_map_page(struct device *dev, struct page *page,
>>> +                                  unsigned long offset, size_t size,
>>> +                                  enum dma_data_direction dir,
>>> +                                  unsigned long attrs)
>>> +{
>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>> +
>>> +     return vduse_domain_map_page(domain, page, offset, size, dir, attrs);
>>> +}
>>> +
>>> +static void vduse_dev_unmap_page(struct device *dev, dma_addr_t dma_addr,
>>> +                             size_t size, enum dma_data_direction dir,
>>> +                             unsigned long attrs)
>>> +{
>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>> +
>>> +     return vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
>>> +}
>>> +
>>> +static void *vduse_dev_alloc_coherent(struct device *dev, size_t size,
>>> +                                     dma_addr_t *dma_addr, gfp_t flag,
>>> +                                     unsigned long attrs)
>>> +{
>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>> +     unsigned long iova;
>>> +     void *addr;
>>> +
>>> +     *dma_addr = DMA_MAPPING_ERROR;
>>> +     addr = vduse_domain_alloc_coherent(domain, size,
>>> +                             (dma_addr_t *)&iova, flag, attrs);
>>> +     if (!addr)
>>> +             return NULL;
>>> +
>>> +     *dma_addr = (dma_addr_t)iova;
>>> +
>>> +     return addr;
>>> +}
>>> +
>>> +static void vduse_dev_free_coherent(struct device *dev, size_t size,
>>> +                                     void *vaddr, dma_addr_t dma_addr,
>>> +                                     unsigned long attrs)
>>> +{
>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>> +
>>> +     vduse_domain_free_coherent(domain, size, vaddr, dma_addr, attrs);
>>> +}
>>> +
>>> +static size_t vduse_dev_max_mapping_size(struct device *dev)
>>> +{
>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>> +
>>> +     return domain->bounce_size;
>>> +}
>>> +
>>> +static const struct dma_map_ops vduse_dev_dma_ops = {
>>> +     .map_page = vduse_dev_map_page,
>>> +     .unmap_page = vduse_dev_unmap_page,
>>> +     .alloc = vduse_dev_alloc_coherent,
>>> +     .free = vduse_dev_free_coherent,
>>> +     .max_mapping_size = vduse_dev_max_mapping_size,
>>> +};
>>> +
>>> +static unsigned int perm_to_file_flags(u8 perm)
>>> +{
>>> +     unsigned int flags = 0;
>>> +
>>> +     switch (perm) {
>>> +     case VDUSE_ACCESS_WO:
>>> +             flags |= O_WRONLY;
>>> +             break;
>>> +     case VDUSE_ACCESS_RO:
>>> +             flags |= O_RDONLY;
>>> +             break;
>>> +     case VDUSE_ACCESS_RW:
>>> +             flags |= O_RDWR;
>>> +             break;
>>> +     default:
>>> +             WARN(1, "invalidate vhost IOTLB permission\n");
>>> +             break;
>>> +     }
>>> +
>>> +     return flags;
>>> +}
>>> +
>>> +static int vduse_kickfd_setup(struct vduse_dev *dev,
>>> +                     struct vduse_vq_eventfd *eventfd)
>>> +{
>>> +     struct eventfd_ctx *ctx = NULL;
>>> +     struct vduse_virtqueue *vq;
>>> +     u32 index;
>>> +
>>> +     if (eventfd->index >= dev->vq_num)
>>> +             return -EINVAL;
>>> +
>>> +     index = array_index_nospec(eventfd->index, dev->vq_num);
>>> +     vq = &dev->vqs[index];
>>> +     if (eventfd->fd >= 0) {
>>> +             ctx = eventfd_ctx_fdget(eventfd->fd);
>>> +             if (IS_ERR(ctx))
>>> +                     return PTR_ERR(ctx);
>>> +     } else if (eventfd->fd != VDUSE_EVENTFD_DEASSIGN)
>>> +             return 0;
>>> +
>>> +     spin_lock(&vq->kick_lock);
>>> +     if (vq->kickfd)
>>> +             eventfd_ctx_put(vq->kickfd);
>>> +     vq->kickfd = ctx;
>>> +     if (vq->ready && vq->kicked && vq->kickfd) {
>>> +             eventfd_signal(vq->kickfd, 1);
>>> +             vq->kicked = false;
>>> +     }
>>> +     spin_unlock(&vq->kick_lock);
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static void vduse_dev_irq_inject(struct work_struct *work)
>>> +{
>>> +     struct vduse_dev *dev = container_of(work, struct vduse_dev, inject);
>>> +
>>> +     spin_lock_irq(&dev->irq_lock);
>>> +     if (dev->config_cb.callback)
>>> +             dev->config_cb.callback(dev->config_cb.private);
>>> +     spin_unlock_irq(&dev->irq_lock);
>>> +}
>>> +
>>> +static void vduse_vq_irq_inject(struct work_struct *work)
>>> +{
>>> +     struct vduse_virtqueue *vq = container_of(work,
>>> +                                     struct vduse_virtqueue, inject);
>>> +
>>> +     spin_lock_irq(&vq->irq_lock);
>>> +     if (vq->ready && vq->cb.callback)
>>> +             vq->cb.callback(vq->cb.private);
>>> +     spin_unlock_irq(&vq->irq_lock);
>>> +}
>>> +
>>> +static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
>>> +                         unsigned long arg)
>>> +{
>>> +     struct vduse_dev *dev = file->private_data;
>>> +     void __user *argp = (void __user *)arg;
>>> +     int ret;
>>> +
>>> +     switch (cmd) {
>>> +     case VDUSE_IOTLB_GET_FD: {
>>> +             struct vduse_iotlb_entry entry;
>>> +             struct vhost_iotlb_map *map;
>>> +             struct vdpa_map_file *map_file;
>>> +             struct vduse_iova_domain *domain = dev->domain;
>>> +             struct file *f = NULL;
>>> +
>>> +             ret = -EFAULT;
>>> +             if (copy_from_user(&entry, argp, sizeof(entry)))
>>> +                     break;
>>> +
>>> +             ret = -EINVAL;
>>> +             if (entry.start > entry.last)
>>> +                     break;
>>> +
>>> +             spin_lock(&domain->iotlb_lock);
>>> +             map = vhost_iotlb_itree_first(domain->iotlb,
>>> +                                           entry.start, entry.last);
>>> +             if (map) {
>>> +                     map_file = (struct vdpa_map_file *)map->opaque;
>>> +                     f = get_file(map_file->file);
>>> +                     entry.offset = map_file->offset;
>>> +                     entry.start = map->start;
>>> +                     entry.last = map->last;
>>> +                     entry.perm = map->perm;
>>> +             }
>>> +             spin_unlock(&domain->iotlb_lock);
>>> +             ret = -EINVAL;
>>> +             if (!f)
>>> +                     break;
>>> +
>>> +             ret = -EFAULT;
>>> +             if (copy_to_user(argp, &entry, sizeof(entry))) {
>>> +                     fput(f);
>>> +                     break;
>>> +             }
>>> +             ret = receive_fd(f, perm_to_file_flags(entry.perm));
>>> +             fput(f);
>>> +             break;
>>> +     }
>>> +     case VDUSE_DEV_GET_FEATURES:
>>> +             ret = put_user(dev->features, (u64 __user *)argp);
>>> +             break;
>>> +     case VDUSE_DEV_UPDATE_CONFIG: {
>>> +             struct vduse_config_update config;
>>> +             unsigned long size = offsetof(struct vduse_config_update,
>>> +                                           buffer);
>>> +
>>> +             ret = -EFAULT;
>>> +             if (copy_from_user(&config, argp, size))
>>> +                     break;
>>> +
>>> +             ret = -EINVAL;
>>> +             if (config.length == 0 ||
>>> +                 config.length > dev->config_size - config.offset)
>>> +                     break;
>>> +
>>> +             ret = -EFAULT;
>>> +             if (copy_from_user(dev->config + config.offset, argp + size,
>>> +                                config.length))
>>> +                     break;
>>> +
>>> +             ret = 0;
>>> +             queue_work(vduse_irq_wq, &dev->inject);
>>
>> I wonder if it's better to separate config interrupt out of config
>> update or we need document this.
>>
> I have documented it in the docs. Looks like a config update should be
> always followed by a config interrupt. I didn't find a case that uses
> them separately.


The uAPI doesn't prevent us from the following scenario:

update_config(mac[0], ..);
update_config(max[1], ..);

So it looks to me it's better to separate the config interrupt from the 
config updating.


>
>>> +             break;
>>> +     }
>>> +     case VDUSE_VQ_GET_INFO: {
>>
>> Do we need to limit this only when DRIVER_OK is set?
>>
> Any reason to add this limitation?


Otherwise the vq is not fully initialized, e.g the desc_addr might not 
be correct.


>
>>> +             struct vduse_vq_info vq_info;
>>> +             u32 vq_index;
>>> +
>>> +             ret = -EFAULT;
>>> +             if (copy_from_user(&vq_info, argp, sizeof(vq_info)))
>>> +                     break;
>>> +
>>> +             ret = -EINVAL;
>>> +             if (vq_info.index >= dev->vq_num)
>>> +                     break;
>>> +
>>> +             vq_index = array_index_nospec(vq_info.index, dev->vq_num);
>>> +             vq_info.desc_addr = dev->vqs[vq_index].desc_addr;
>>> +             vq_info.driver_addr = dev->vqs[vq_index].driver_addr;
>>> +             vq_info.device_addr = dev->vqs[vq_index].device_addr;
>>> +             vq_info.num = dev->vqs[vq_index].num;
>>> +             vq_info.avail_idx = dev->vqs[vq_index].avail_idx;
>>> +             vq_info.ready = dev->vqs[vq_index].ready;
>>> +
>>> +             ret = -EFAULT;
>>> +             if (copy_to_user(argp, &vq_info, sizeof(vq_info)))
>>> +                     break;
>>> +
>>> +             ret = 0;
>>> +             break;
>>> +     }
>>> +     case VDUSE_VQ_SETUP_KICKFD: {
>>> +             struct vduse_vq_eventfd eventfd;
>>> +
>>> +             ret = -EFAULT;
>>> +             if (copy_from_user(&eventfd, argp, sizeof(eventfd)))
>>> +                     break;
>>> +
>>> +             ret = vduse_kickfd_setup(dev, &eventfd);
>>> +             break;
>>> +     }
>>> +     case VDUSE_VQ_INJECT_IRQ: {
>>> +             u32 vq_index;
>>> +
>>> +             ret = -EFAULT;
>>> +             if (get_user(vq_index, (u32 __user *)argp))
>>> +                     break;
>>> +
>>> +             ret = -EINVAL;
>>> +             if (vq_index >= dev->vq_num)
>>> +                     break;
>>> +
>>> +             ret = 0;
>>> +             vq_index = array_index_nospec(vq_index, dev->vq_num);
>>> +             queue_work(vduse_irq_wq, &dev->vqs[vq_index].inject);
>>> +             break;
>>> +     }
>>> +     default:
>>> +             ret = -ENOIOCTLCMD;
>>> +             break;
>>> +     }
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static int vduse_dev_release(struct inode *inode, struct file *file)
>>> +{
>>> +     struct vduse_dev *dev = file->private_data;
>>> +
>>> +     spin_lock(&dev->msg_lock);
>>> +     /* Make sure the inflight messages can processed after reconncection */
>>> +     list_splice_init(&dev->recv_list, &dev->send_list);
>>> +     spin_unlock(&dev->msg_lock);
>>> +     dev->connected = false;
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static struct vduse_dev *vduse_dev_get_from_minor(int minor)
>>> +{
>>> +     struct vduse_dev *dev;
>>> +
>>> +     mutex_lock(&vduse_lock);
>>> +     dev = idr_find(&vduse_idr, minor);
>>> +     mutex_unlock(&vduse_lock);
>>> +
>>> +     return dev;
>>> +}
>>> +
>>> +static int vduse_dev_open(struct inode *inode, struct file *file)
>>> +{
>>> +     int ret;
>>> +     struct vduse_dev *dev = vduse_dev_get_from_minor(iminor(inode));
>>> +
>>> +     if (!dev)
>>> +             return -ENODEV;
>>> +
>>> +     ret = -EBUSY;
>>> +     mutex_lock(&dev->lock);
>>> +     if (dev->connected)
>>> +             goto unlock;
>>> +
>>> +     ret = 0;
>>> +     dev->connected = true;
>>> +     file->private_data = dev;
>>> +unlock:
>>> +     mutex_unlock(&dev->lock);
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static const struct file_operations vduse_dev_fops = {
>>> +     .owner          = THIS_MODULE,
>>> +     .open           = vduse_dev_open,
>>> +     .release        = vduse_dev_release,
>>> +     .read_iter      = vduse_dev_read_iter,
>>> +     .write_iter     = vduse_dev_write_iter,
>>> +     .poll           = vduse_dev_poll,
>>> +     .unlocked_ioctl = vduse_dev_ioctl,
>>> +     .compat_ioctl   = compat_ptr_ioctl,
>>> +     .llseek         = noop_llseek,
>>> +};
>>> +
>>> +static struct vduse_dev *vduse_dev_create(void)
>>> +{
>>> +     struct vduse_dev *dev = kzalloc(sizeof(*dev), GFP_KERNEL);
>>> +
>>> +     if (!dev)
>>> +             return NULL;
>>> +
>>> +     mutex_init(&dev->lock);
>>> +     spin_lock_init(&dev->msg_lock);
>>> +     INIT_LIST_HEAD(&dev->send_list);
>>> +     INIT_LIST_HEAD(&dev->recv_list);
>>> +     spin_lock_init(&dev->irq_lock);
>>> +
>>> +     INIT_WORK(&dev->inject, vduse_dev_irq_inject);
>>> +     init_waitqueue_head(&dev->waitq);
>>> +
>>> +     return dev;
>>> +}
>>> +
>>> +static void vduse_dev_destroy(struct vduse_dev *dev)
>>> +{
>>> +     kfree(dev);
>>> +}
>>> +
>>> +static struct vduse_dev *vduse_find_dev(const char *name)
>>> +{
>>> +     struct vduse_dev *dev;
>>> +     int id;
>>> +
>>> +     idr_for_each_entry(&vduse_idr, dev, id)
>>> +             if (!strcmp(dev->name, name))
>>> +                     return dev;
>>> +
>>> +     return NULL;
>>> +}
>>> +
>>> +static int vduse_destroy_dev(char *name)
>>> +{
>>> +     struct vduse_dev *dev = vduse_find_dev(name);
>>> +
>>> +     if (!dev)
>>> +             return -EINVAL;
>>> +
>>> +     mutex_lock(&dev->lock);
>>> +     if (dev->vdev || dev->connected) {
>>> +             mutex_unlock(&dev->lock);
>>> +             return -EBUSY;
>>> +     }
>>> +     dev->connected = true;
>>> +     mutex_unlock(&dev->lock);
>>> +
>>> +     vduse_dev_msg_cleanup(dev);
>>> +     device_destroy(vduse_class, MKDEV(MAJOR(vduse_major), dev->minor));
>>> +     idr_remove(&vduse_idr, dev->minor);
>>> +     kvfree(dev->config);
>>> +     kfree(dev->vqs);
>>> +     vduse_domain_destroy(dev->domain);
>>> +     kfree(dev->name);
>>> +     vduse_dev_destroy(dev);
>>> +     module_put(THIS_MODULE);
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static bool device_is_allowed(u32 device_id)
>>> +{
>>> +     int i;
>>> +
>>> +     for (i = 0; i < ARRAY_SIZE(allowed_device_id); i++)
>>> +             if (allowed_device_id[i] == device_id)
>>> +                     return true;
>>> +
>>> +     return false;
>>> +}
>>> +
>>> +static bool features_is_valid(u64 features)
>>> +{
>>> +     if (!(features & (1ULL << VIRTIO_F_ACCESS_PLATFORM)))
>>> +             return false;
>>> +
>>> +     /* Now we only support read-only configuration space */
>>> +     if (features & (1ULL << VIRTIO_BLK_F_CONFIG_WCE))
>>> +             return false;
>>> +
>>> +     return true;
>>> +}
>>> +
>>> +static bool vduse_validate_config(struct vduse_dev_config *config)
>>> +{
>>> +     if (config->bounce_size > VDUSE_MAX_BOUNCE_SIZE)
>>> +             return false;
>>> +
>>> +     if (config->vq_align > PAGE_SIZE)
>>> +             return false;
>>> +
>>> +     if (config->config_size > PAGE_SIZE)
>>> +             return false;
>>> +
>>> +     if (!device_is_allowed(config->device_id))
>>> +             return false;
>>> +
>>> +     if (!features_is_valid(config->features))
>>> +             return false;
>>
>> Do we need to validate whether or not config_size is too small otherwise
>> we may have OOB access in get_config()?
>>
> How about adding validation in get_config()? It seems to be hard to
> define the lower bound.


It should work.

Thanks


>
>>> +
>>> +     return true;
>>> +}
>>> +
>>> +static int vduse_create_dev(struct vduse_dev_config *config,
>>> +                         void *config_buf, u64 api_version)
>>> +{
>>> +     int i, ret;
>>> +     struct vduse_dev *dev;
>>> +
>>> +     ret = -EEXIST;
>>> +     if (vduse_find_dev(config->name))
>>> +             goto err;
>>> +
>>> +     ret = -ENOMEM;
>>> +     dev = vduse_dev_create();
>>> +     if (!dev)
>>> +             goto err;
>>> +
>>> +     dev->api_version = api_version;
>>> +     dev->user_features = config->features;
>>> +     dev->device_id = config->device_id;
>>> +     dev->vendor_id = config->vendor_id;
>>> +     dev->name = kstrdup(config->name, GFP_KERNEL);
>>> +     if (!dev->name)
>>> +             goto err_str;
>>> +
>>> +     dev->domain = vduse_domain_create(VDUSE_IOVA_SIZE - 1,
>>> +                                       config->bounce_size);
>>> +     if (!dev->domain)
>>> +             goto err_domain;
>>> +
>>> +     dev->config = config_buf;
>>> +     dev->config_size = config->config_size;
>>> +     dev->vq_align = config->vq_align;
>>> +     dev->vq_size_max = config->vq_size_max;
>>> +     dev->vq_num = config->vq_num;
>>> +     dev->vqs = kcalloc(dev->vq_num, sizeof(*dev->vqs), GFP_KERNEL);
>>> +     if (!dev->vqs)
>>> +             goto err_vqs;
>>> +
>>> +     for (i = 0; i < dev->vq_num; i++) {
>>> +             dev->vqs[i].index = i;
>>> +             INIT_WORK(&dev->vqs[i].inject, vduse_vq_irq_inject);
>>> +             spin_lock_init(&dev->vqs[i].kick_lock);
>>> +             spin_lock_init(&dev->vqs[i].irq_lock);
>>> +     }
>>> +
>>> +     ret = idr_alloc(&vduse_idr, dev, 1, VDUSE_DEV_MAX, GFP_KERNEL);
>>> +     if (ret < 0)
>>> +             goto err_idr;
>>> +
>>> +     dev->minor = ret;
>>> +     dev->dev = device_create(vduse_class, NULL,
>>> +                              MKDEV(MAJOR(vduse_major), dev->minor),
>>> +                              NULL, "%s", config->name);
>>> +     if (IS_ERR(dev->dev)) {
>>> +             ret = PTR_ERR(dev->dev);
>>> +             goto err_dev;
>>> +     }
>>> +     __module_get(THIS_MODULE);
>>> +
>>> +     return 0;
>>> +err_dev:
>>> +     idr_remove(&vduse_idr, dev->minor);
>>> +err_idr:
>>> +     kfree(dev->vqs);
>>> +err_vqs:
>>> +     vduse_domain_destroy(dev->domain);
>>> +err_domain:
>>> +     kfree(dev->name);
>>> +err_str:
>>> +     vduse_dev_destroy(dev);
>>> +err:
>>> +     kvfree(config_buf);
>>> +     return ret;
>>> +}
>>> +
>>> +static long vduse_ioctl(struct file *file, unsigned int cmd,
>>> +                     unsigned long arg)
>>> +{
>>> +     int ret;
>>> +     void __user *argp = (void __user *)arg;
>>> +     struct vduse_control *control = file->private_data;
>>> +
>>> +     mutex_lock(&vduse_lock);
>>> +     switch (cmd) {
>>> +     case VDUSE_GET_API_VERSION:
>>> +             ret = put_user(control->api_version, (u64 __user *)argp);
>>> +             break;
>>> +     case VDUSE_SET_API_VERSION: {
>>> +             u64 api_version;
>>> +
>>> +             ret = -EFAULT;
>>> +             if (get_user(api_version, (u64 __user *)argp))
>>> +                     break;
>>> +
>>> +             ret = -EINVAL;
>>> +             if (api_version > VDUSE_API_VERSION)
>>> +                     break;
>>> +
>>> +             ret = 0;
>>> +             control->api_version = api_version;
>>> +             break;
>>> +     }
>>> +     case VDUSE_CREATE_DEV: {
>>> +             struct vduse_dev_config config;
>>> +             unsigned long size = offsetof(struct vduse_dev_config, config);
>>> +             void *buf;
>>> +
>>> +             ret = -EFAULT;
>>> +             if (copy_from_user(&config, argp, size))
>>> +                     break;
>>> +
>>> +             ret = -EINVAL;
>>> +             if (vduse_validate_config(&config) == false)
>>> +                     break;
>>> +
>>> +             buf = vmemdup_user(argp + size, config.config_size);
>>> +             if (IS_ERR(buf)) {
>>> +                     ret = PTR_ERR(buf);
>>> +                     break;
>>> +             }
>>> +             ret = vduse_create_dev(&config, buf, control->api_version);
>>> +             break;
>>> +     }
>>> +     case VDUSE_DESTROY_DEV: {
>>> +             char name[VDUSE_NAME_MAX];
>>> +
>>> +             ret = -EFAULT;
>>> +             if (copy_from_user(name, argp, VDUSE_NAME_MAX))
>>> +                     break;
>>> +
>>> +             ret = vduse_destroy_dev(name);
>>> +             break;
>>> +     }
>>> +     default:
>>> +             ret = -EINVAL;
>>> +             break;
>>> +     }
>>> +     mutex_unlock(&vduse_lock);
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static int vduse_release(struct inode *inode, struct file *file)
>>> +{
>>> +     struct vduse_control *control = file->private_data;
>>> +
>>> +     kfree(control);
>>> +     return 0;
>>> +}
>>> +
>>> +static int vduse_open(struct inode *inode, struct file *file)
>>> +{
>>> +     struct vduse_control *control;
>>> +
>>> +     control = kmalloc(sizeof(struct vduse_control), GFP_KERNEL);
>>> +     if (!control)
>>> +             return -ENOMEM;
>>> +
>>> +     control->api_version = VDUSE_API_VERSION;
>>> +     file->private_data = control;
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static const struct file_operations vduse_ctrl_fops = {
>>> +     .owner          = THIS_MODULE,
>>> +     .open           = vduse_open,
>>> +     .release        = vduse_release,
>>> +     .unlocked_ioctl = vduse_ioctl,
>>> +     .compat_ioctl   = compat_ptr_ioctl,
>>> +     .llseek         = noop_llseek,
>>> +};
>>> +
>>> +static char *vduse_devnode(struct device *dev, umode_t *mode)
>>> +{
>>> +     return kasprintf(GFP_KERNEL, "vduse/%s", dev_name(dev));
>>> +}
>>> +
>>> +static void vduse_mgmtdev_release(struct device *dev)
>>> +{
>>> +}
>>> +
>>> +static struct device vduse_mgmtdev = {
>>> +     .init_name = "vduse",
>>> +     .release = vduse_mgmtdev_release,
>>> +};
>>> +
>>> +static struct vdpa_mgmt_dev mgmt_dev;
>>> +
>>> +static int vduse_dev_init_vdpa(struct vduse_dev *dev, const char *name)
>>> +{
>>> +     struct vduse_vdpa *vdev;
>>> +     int ret;
>>> +
>>> +     if (dev->vdev)
>>> +             return -EEXIST;
>>> +
>>> +     vdev = vdpa_alloc_device(struct vduse_vdpa, vdpa, dev->dev,
>>> +                              &vduse_vdpa_config_ops, name, true);
>>> +     if (!vdev)
>>> +             return -ENOMEM;
>>> +
>>> +     dev->vdev = vdev;
>>> +     vdev->dev = dev;
>>> +     vdev->vdpa.dev.dma_mask = &vdev->vdpa.dev.coherent_dma_mask;
>>> +     ret = dma_set_mask_and_coherent(&vdev->vdpa.dev, DMA_BIT_MASK(64));
>>> +     if (ret) {
>>> +             put_device(&vdev->vdpa.dev);
>>> +             return ret;
>>> +     }
>>> +     set_dma_ops(&vdev->vdpa.dev, &vduse_dev_dma_ops);
>>> +     vdev->vdpa.dma_dev = &vdev->vdpa.dev;
>>> +     vdev->vdpa.mdev = &mgmt_dev;
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static int vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name)
>>> +{
>>> +     struct vduse_dev *dev;
>>> +     int ret;
>>> +
>>> +     mutex_lock(&vduse_lock);
>>> +     dev = vduse_find_dev(name);
>>> +     if (!dev) {
>>> +             mutex_unlock(&vduse_lock);
>>> +             return -EINVAL;
>>> +     }
>>> +     ret = vduse_dev_init_vdpa(dev, name);
>>> +     mutex_unlock(&vduse_lock);
>>> +     if (ret)
>>> +             return ret;
>>> +
>>> +     ret = _vdpa_register_device(&dev->vdev->vdpa, dev->vq_num);
>>> +     if (ret) {
>>> +             put_device(&dev->vdev->vdpa.dev);
>>> +             return ret;
>>> +     }
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static void vdpa_dev_del(struct vdpa_mgmt_dev *mdev, struct vdpa_device *dev)
>>> +{
>>> +     _vdpa_unregister_device(dev);
>>> +}
>>> +
>>> +static const struct vdpa_mgmtdev_ops vdpa_dev_mgmtdev_ops = {
>>> +     .dev_add = vdpa_dev_add,
>>> +     .dev_del = vdpa_dev_del,
>>> +};
>>> +
>>> +static struct virtio_device_id id_table[] = {
>>> +     { VIRTIO_ID_BLOCK, VIRTIO_DEV_ANY_ID },
>>> +     { 0 },
>>> +};
>>> +
>>> +static struct vdpa_mgmt_dev mgmt_dev = {
>>> +     .device = &vduse_mgmtdev,
>>> +     .id_table = id_table,
>>> +     .ops = &vdpa_dev_mgmtdev_ops,
>>> +};
>>> +
>>> +static int vduse_mgmtdev_init(void)
>>> +{
>>> +     int ret;
>>> +
>>> +     ret = device_register(&vduse_mgmtdev);
>>> +     if (ret)
>>> +             return ret;
>>> +
>>> +     ret = vdpa_mgmtdev_register(&mgmt_dev);
>>> +     if (ret)
>>> +             goto err;
>>> +
>>> +     return 0;
>>> +err:
>>> +     device_unregister(&vduse_mgmtdev);
>>> +     return ret;
>>> +}
>>> +
>>> +static void vduse_mgmtdev_exit(void)
>>> +{
>>> +     vdpa_mgmtdev_unregister(&mgmt_dev);
>>> +     device_unregister(&vduse_mgmtdev);
>>> +}
>>> +
>>> +static int vduse_init(void)
>>> +{
>>> +     int ret;
>>> +     struct device *dev;
>>> +
>>> +     vduse_class = class_create(THIS_MODULE, "vduse");
>>> +     if (IS_ERR(vduse_class))
>>> +             return PTR_ERR(vduse_class);
>>> +
>>> +     vduse_class->devnode = vduse_devnode;
>>> +
>>> +     ret = alloc_chrdev_region(&vduse_major, 0, VDUSE_DEV_MAX, "vduse");
>>> +     if (ret)
>>> +             goto err_chardev_region;
>>> +
>>> +     /* /dev/vduse/control */
>>> +     cdev_init(&vduse_ctrl_cdev, &vduse_ctrl_fops);
>>> +     vduse_ctrl_cdev.owner = THIS_MODULE;
>>> +     ret = cdev_add(&vduse_ctrl_cdev, vduse_major, 1);
>>> +     if (ret)
>>> +             goto err_ctrl_cdev;
>>> +
>>> +     dev = device_create(vduse_class, NULL, vduse_major, NULL, "control");
>>> +     if (IS_ERR(dev)) {
>>> +             ret = PTR_ERR(dev);
>>> +             goto err_device;
>>> +     }
>>> +
>>> +     /* /dev/vduse/$DEVICE */
>>> +     cdev_init(&vduse_cdev, &vduse_dev_fops);
>>> +     vduse_cdev.owner = THIS_MODULE;
>>> +     ret = cdev_add(&vduse_cdev, MKDEV(MAJOR(vduse_major), 1),
>>> +                    VDUSE_DEV_MAX - 1);
>>> +     if (ret)
>>> +             goto err_cdev;
>>> +
>>> +     vduse_irq_wq = alloc_workqueue("vduse-irq",
>>> +                             WQ_HIGHPRI | WQ_SYSFS | WQ_UNBOUND, 0);
>>> +     if (!vduse_irq_wq)
>>> +             goto err_wq;
>>> +
>>> +     ret = vduse_domain_init();
>>> +     if (ret)
>>> +             goto err_domain;
>>> +
>>> +     ret = vduse_mgmtdev_init();
>>> +     if (ret)
>>> +             goto err_mgmtdev;
>>> +
>>> +     return 0;
>>> +err_mgmtdev:
>>> +     vduse_domain_exit();
>>> +err_domain:
>>> +     destroy_workqueue(vduse_irq_wq);
>>> +err_wq:
>>> +     cdev_del(&vduse_cdev);
>>> +err_cdev:
>>> +     device_destroy(vduse_class, vduse_major);
>>> +err_device:
>>> +     cdev_del(&vduse_ctrl_cdev);
>>> +err_ctrl_cdev:
>>> +     unregister_chrdev_region(vduse_major, VDUSE_DEV_MAX);
>>> +err_chardev_region:
>>> +     class_destroy(vduse_class);
>>> +     return ret;
>>> +}
>>> +module_init(vduse_init);
>>> +
>>> +static void vduse_exit(void)
>>> +{
>>> +     vduse_mgmtdev_exit();
>>> +     vduse_domain_exit();
>>> +     destroy_workqueue(vduse_irq_wq);
>>> +     cdev_del(&vduse_cdev);
>>> +     device_destroy(vduse_class, vduse_major);
>>> +     cdev_del(&vduse_ctrl_cdev);
>>> +     unregister_chrdev_region(vduse_major, VDUSE_DEV_MAX);
>>> +     class_destroy(vduse_class);
>>> +}
>>> +module_exit(vduse_exit);
>>> +
>>> +MODULE_LICENSE(DRV_LICENSE);
>>> +MODULE_AUTHOR(DRV_AUTHOR);
>>> +MODULE_DESCRIPTION(DRV_DESC);
>>> diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
>>> new file mode 100644
>>> index 000000000000..f21b2e51b5c8
>>> --- /dev/null
>>> +++ b/include/uapi/linux/vduse.h
>>> @@ -0,0 +1,143 @@
>>> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
>>> +#ifndef _UAPI_VDUSE_H_
>>> +#define _UAPI_VDUSE_H_
>>> +
>>> +#include <linux/types.h>
>>> +
>>> +#define VDUSE_API_VERSION    0
>>> +
>>> +#define VDUSE_NAME_MAX       256
>>> +
>>> +/* the control messages definition for read/write */
>>> +
>>> +enum vduse_req_type {
>>> +     /* Get the state for virtqueue from userspace */
>>> +     VDUSE_GET_VQ_STATE,
>>> +     /* Notify userspace to start the dataplane, no reply */
>>> +     VDUSE_START_DATAPLANE,
>>> +     /* Notify userspace to stop the dataplane, no reply */
>>> +     VDUSE_STOP_DATAPLANE,
>>> +     /* Notify userspace to update the memory mapping in device IOTLB */
>>> +     VDUSE_UPDATE_IOTLB,
>>> +};
>>> +
>>> +struct vduse_vq_state {
>>> +     __u32 index; /* virtqueue index */
>>> +     __u32 avail_idx; /* virtqueue state (last_avail_idx) */
>>> +};
>>
>> This needs some tweaks to support packed virtqueue.
>>
> OK.
>
> Thanks,
> Yongji
>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 193+ messages in thread

* Re: Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
  2021-06-22  5:06         ` Jason Wang
@ 2021-06-22  7:22           ` Yongji Xie
  -1 siblings, 0 replies; 193+ messages in thread
From: Yongji Xie @ 2021-06-22  7:22 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Christoph Hellwig, Christian Brauner, Randy Dunlap,
	Matthew Wilcox, Al Viro, Jens Axboe, bcrl, Jonathan Corbet,
	Mika Penttilä,
	Dan Carpenter, joro, Greg KH, songmuchun, virtualization, netdev,
	kvm, linux-fsdevel, iommu, linux-kernel

.On Tue, Jun 22, 2021 at 1:07 PM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/6/21 下午6:41, Yongji Xie 写道:
> > On Mon, Jun 21, 2021 at 5:14 PM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> 在 2021/6/15 下午10:13, Xie Yongji 写道:
> >>> This VDUSE driver enables implementing vDPA devices in userspace.
> >>> The vDPA device's control path is handled in kernel and the data
> >>> path is handled in userspace.
> >>>
> >>> A message mechnism is used by VDUSE driver to forward some control
> >>> messages such as starting/stopping datapath to userspace. Userspace
> >>> can use read()/write() to receive/reply those control messages.
> >>>
> >>> And some ioctls are introduced to help userspace to implement the
> >>> data path. VDUSE_IOTLB_GET_FD ioctl can be used to get the file
> >>> descriptors referring to vDPA device's iova regions. Then userspace
> >>> can use mmap() to access those iova regions. VDUSE_DEV_GET_FEATURES
> >>> and VDUSE_VQ_GET_INFO ioctls are used to get the negotiated features
> >>> and metadata of virtqueues. VDUSE_INJECT_VQ_IRQ and VDUSE_VQ_SETUP_KICKFD
> >>> ioctls can be used to inject interrupt and setup the kickfd for
> >>> virtqueues. VDUSE_DEV_UPDATE_CONFIG ioctl is used to update the
> >>> configuration space and inject a config interrupt.
> >>>
> >>> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> >>> ---
> >>>    Documentation/userspace-api/ioctl/ioctl-number.rst |    1 +
> >>>    drivers/vdpa/Kconfig                               |   10 +
> >>>    drivers/vdpa/Makefile                              |    1 +
> >>>    drivers/vdpa/vdpa_user/Makefile                    |    5 +
> >>>    drivers/vdpa/vdpa_user/vduse_dev.c                 | 1453 ++++++++++++++++++++
> >>>    include/uapi/linux/vduse.h                         |  143 ++
> >>>    6 files changed, 1613 insertions(+)
> >>>    create mode 100644 drivers/vdpa/vdpa_user/Makefile
> >>>    create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
> >>>    create mode 100644 include/uapi/linux/vduse.h
> >>>
> >>> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
> >>> index 9bfc2b510c64..acd95e9dcfe7 100644
> >>> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> >>> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> >>> @@ -300,6 +300,7 @@ Code  Seq#    Include File                                           Comments
> >>>    'z'   10-4F  drivers/s390/crypto/zcrypt_api.h                        conflict!
> >>>    '|'   00-7F  linux/media.h
> >>>    0x80  00-1F  linux/fb.h
> >>> +0x81  00-1F  linux/vduse.h
> >>>    0x89  00-06  arch/x86/include/asm/sockios.h
> >>>    0x89  0B-DF  linux/sockios.h
> >>>    0x89  E0-EF  linux/sockios.h                                         SIOCPROTOPRIVATE range
> >>> diff --git a/drivers/vdpa/Kconfig b/drivers/vdpa/Kconfig
> >>> index a503c1b2bfd9..6e23bce6433a 100644
> >>> --- a/drivers/vdpa/Kconfig
> >>> +++ b/drivers/vdpa/Kconfig
> >>> @@ -33,6 +33,16 @@ config VDPA_SIM_BLOCK
> >>>          vDPA block device simulator which terminates IO request in a
> >>>          memory buffer.
> >>>
> >>> +config VDPA_USER
> >>> +     tristate "VDUSE (vDPA Device in Userspace) support"
> >>> +     depends on EVENTFD && MMU && HAS_DMA
> >>> +     select DMA_OPS
> >>> +     select VHOST_IOTLB
> >>> +     select IOMMU_IOVA
> >>> +     help
> >>> +       With VDUSE it is possible to emulate a vDPA Device
> >>> +       in a userspace program.
> >>> +
> >>>    config IFCVF
> >>>        tristate "Intel IFC VF vDPA driver"
> >>>        depends on PCI_MSI
> >>> diff --git a/drivers/vdpa/Makefile b/drivers/vdpa/Makefile
> >>> index 67fe7f3d6943..f02ebed33f19 100644
> >>> --- a/drivers/vdpa/Makefile
> >>> +++ b/drivers/vdpa/Makefile
> >>> @@ -1,6 +1,7 @@
> >>>    # SPDX-License-Identifier: GPL-2.0
> >>>    obj-$(CONFIG_VDPA) += vdpa.o
> >>>    obj-$(CONFIG_VDPA_SIM) += vdpa_sim/
> >>> +obj-$(CONFIG_VDPA_USER) += vdpa_user/
> >>>    obj-$(CONFIG_IFCVF)    += ifcvf/
> >>>    obj-$(CONFIG_MLX5_VDPA) += mlx5/
> >>>    obj-$(CONFIG_VP_VDPA)    += virtio_pci/
> >>> diff --git a/drivers/vdpa/vdpa_user/Makefile b/drivers/vdpa/vdpa_user/Makefile
> >>> new file mode 100644
> >>> index 000000000000..260e0b26af99
> >>> --- /dev/null
> >>> +++ b/drivers/vdpa/vdpa_user/Makefile
> >>> @@ -0,0 +1,5 @@
> >>> +# SPDX-License-Identifier: GPL-2.0
> >>> +
> >>> +vduse-y := vduse_dev.o iova_domain.o
> >>> +
> >>> +obj-$(CONFIG_VDPA_USER) += vduse.o
> >>> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> >>> new file mode 100644
> >>> index 000000000000..5271cbd15e28
> >>> --- /dev/null
> >>> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> >>> @@ -0,0 +1,1453 @@
> >>> +// SPDX-License-Identifier: GPL-2.0-only
> >>> +/*
> >>> + * VDUSE: vDPA Device in Userspace
> >>> + *
> >>> + * Copyright (C) 2020-2021 Bytedance Inc. and/or its affiliates. All rights reserved.
> >>> + *
> >>> + * Author: Xie Yongji <xieyongji@bytedance.com>
> >>> + *
> >>> + */
> >>> +
> >>> +#include <linux/init.h>
> >>> +#include <linux/module.h>
> >>> +#include <linux/cdev.h>
> >>> +#include <linux/device.h>
> >>> +#include <linux/eventfd.h>
> >>> +#include <linux/slab.h>
> >>> +#include <linux/wait.h>
> >>> +#include <linux/dma-map-ops.h>
> >>> +#include <linux/poll.h>
> >>> +#include <linux/file.h>
> >>> +#include <linux/uio.h>
> >>> +#include <linux/vdpa.h>
> >>> +#include <linux/nospec.h>
> >>> +#include <uapi/linux/vduse.h>
> >>> +#include <uapi/linux/vdpa.h>
> >>> +#include <uapi/linux/virtio_config.h>
> >>> +#include <uapi/linux/virtio_ids.h>
> >>> +#include <uapi/linux/virtio_blk.h>
> >>> +#include <linux/mod_devicetable.h>
> >>> +
> >>> +#include "iova_domain.h"
> >>> +
> >>> +#define DRV_AUTHOR   "Yongji Xie <xieyongji@bytedance.com>"
> >>> +#define DRV_DESC     "vDPA Device in Userspace"
> >>> +#define DRV_LICENSE  "GPL v2"
> >>> +
> >>> +#define VDUSE_DEV_MAX (1U << MINORBITS)
> >>> +#define VDUSE_MAX_BOUNCE_SIZE (64 * 1024 * 1024)
> >>> +#define VDUSE_IOVA_SIZE (128 * 1024 * 1024)
> >>> +#define VDUSE_REQUEST_TIMEOUT 30
> >>> +
> >>> +struct vduse_virtqueue {
> >>> +     u16 index;
> >>> +     u32 num;
> >>> +     u32 avail_idx;
> >>> +     u64 desc_addr;
> >>> +     u64 driver_addr;
> >>> +     u64 device_addr;
> >>> +     bool ready;
> >>> +     bool kicked;
> >>> +     spinlock_t kick_lock;
> >>> +     spinlock_t irq_lock;
> >>> +     struct eventfd_ctx *kickfd;
> >>> +     struct vdpa_callback cb;
> >>> +     struct work_struct inject;
> >>> +};
> >>> +
> >>> +struct vduse_dev;
> >>> +
> >>> +struct vduse_vdpa {
> >>> +     struct vdpa_device vdpa;
> >>> +     struct vduse_dev *dev;
> >>> +};
> >>> +
> >>> +struct vduse_dev {
> >>> +     struct vduse_vdpa *vdev;
> >>> +     struct device *dev;
> >>> +     struct vduse_virtqueue *vqs;
> >>> +     struct vduse_iova_domain *domain;
> >>> +     char *name;
> >>> +     struct mutex lock;
> >>> +     spinlock_t msg_lock;
> >>> +     u64 msg_unique;
> >>> +     wait_queue_head_t waitq;
> >>> +     struct list_head send_list;
> >>> +     struct list_head recv_list;
> >>> +     struct vdpa_callback config_cb;
> >>> +     struct work_struct inject;
> >>> +     spinlock_t irq_lock;
> >>> +     int minor;
> >>> +     bool connected;
> >>> +     bool started;
> >>> +     u64 api_version;
> >>> +     u64 user_features;
> >>
> >> Let's use device_features.
> >>
> > OK.
> >
> >>> +     u64 features;
> >>
> >> And driver features.
> >>
> > OK.
> >
> >>> +     u32 device_id;
> >>> +     u32 vendor_id;
> >>> +     u32 generation;
> >>> +     u32 config_size;
> >>> +     void *config;
> >>> +     u8 status;
> >>> +     u16 vq_size_max;
> >>> +     u32 vq_num;
> >>> +     u32 vq_align;
> >>> +};
> >>> +
> >>> +struct vduse_dev_msg {
> >>> +     struct vduse_dev_request req;
> >>> +     struct vduse_dev_response resp;
> >>> +     struct list_head list;
> >>> +     wait_queue_head_t waitq;
> >>> +     bool completed;
> >>> +};
> >>> +
> >>> +struct vduse_control {
> >>> +     u64 api_version;
> >>> +};
> >>> +
> >>> +static DEFINE_MUTEX(vduse_lock);
> >>> +static DEFINE_IDR(vduse_idr);
> >>> +
> >>> +static dev_t vduse_major;
> >>> +static struct class *vduse_class;
> >>> +static struct cdev vduse_ctrl_cdev;
> >>> +static struct cdev vduse_cdev;
> >>> +static struct workqueue_struct *vduse_irq_wq;
> >>> +
> >>> +static u32 allowed_device_id[] = {
> >>> +     VIRTIO_ID_BLOCK,
> >>> +};
> >>> +
> >>> +static inline struct vduse_dev *vdpa_to_vduse(struct vdpa_device *vdpa)
> >>> +{
> >>> +     struct vduse_vdpa *vdev = container_of(vdpa, struct vduse_vdpa, vdpa);
> >>> +
> >>> +     return vdev->dev;
> >>> +}
> >>> +
> >>> +static inline struct vduse_dev *dev_to_vduse(struct device *dev)
> >>> +{
> >>> +     struct vdpa_device *vdpa = dev_to_vdpa(dev);
> >>> +
> >>> +     return vdpa_to_vduse(vdpa);
> >>> +}
> >>> +
> >>> +static struct vduse_dev_msg *vduse_find_msg(struct list_head *head,
> >>> +                                         uint32_t request_id)
> >>> +{
> >>> +     struct vduse_dev_msg *msg;
> >>> +
> >>> +     list_for_each_entry(msg, head, list) {
> >>> +             if (msg->req.request_id == request_id) {
> >>> +                     list_del(&msg->list);
> >>> +                     return msg;
> >>> +             }
> >>> +     }
> >>> +
> >>> +     return NULL;
> >>> +}
> >>> +
> >>> +static struct vduse_dev_msg *vduse_dequeue_msg(struct list_head *head)
> >>> +{
> >>> +     struct vduse_dev_msg *msg = NULL;
> >>> +
> >>> +     if (!list_empty(head)) {
> >>> +             msg = list_first_entry(head, struct vduse_dev_msg, list);
> >>> +             list_del(&msg->list);
> >>> +     }
> >>> +
> >>> +     return msg;
> >>> +}
> >>> +
> >>> +static void vduse_enqueue_msg(struct list_head *head,
> >>> +                           struct vduse_dev_msg *msg)
> >>> +{
> >>> +     list_add_tail(&msg->list, head);
> >>> +}
> >>> +
> >>> +static int vduse_dev_msg_send(struct vduse_dev *dev,
> >>> +                           struct vduse_dev_msg *msg, bool no_reply)
> >>> +{
> >>
> >> It looks to me the only user for no_reply=true is the dataplane start. I
> >> wonder no_reply is really needed consider we have switched to use
> >> wait_event_killable_timeout().
> >>
> > Do we need to handle the error in this case if we remove the no_reply
> > flag. Print a warning message?
>
>
> See below.
>
>
> >
> >> In another way, no_reply is false for vq state synchronization and IOTLB
> >> updating. I wonder if we can simply use no_reply = true for them.
> >>
> > Looks like we can't, e.g. we need to get a reply from userspace for vq state.
>
>
> Right.
>
>
> >
> >>> +     init_waitqueue_head(&msg->waitq);
> >>> +     spin_lock(&dev->msg_lock);
> >>> +     msg->req.request_id = dev->msg_unique++;
> >>> +     vduse_enqueue_msg(&dev->send_list, msg);
> >>> +     wake_up(&dev->waitq);
> >>> +     spin_unlock(&dev->msg_lock);
> >>> +     if (no_reply)
> >>> +             return 0;
> >>> +
> >>> +     wait_event_killable_timeout(msg->waitq, msg->completed,
> >>> +                                 VDUSE_REQUEST_TIMEOUT * HZ);
> >>> +     spin_lock(&dev->msg_lock);
> >>> +     if (!msg->completed) {
> >>> +             list_del(&msg->list);
> >>> +             msg->resp.result = VDUSE_REQ_RESULT_FAILED;
> >>> +     }
> >>> +     spin_unlock(&dev->msg_lock);
> >>> +
> >>> +     return (msg->resp.result == VDUSE_REQ_RESULT_OK) ? 0 : -EIO;
> >>
> >> Do we need to serialize the check by protecting it with the spinlock above?
> >>
> > Good point.
> >
> >>> +}
> >>> +
> >>> +static void vduse_dev_msg_cleanup(struct vduse_dev *dev)
> >>> +{
> >>> +     struct vduse_dev_msg *msg;
> >>> +
> >>> +     spin_lock(&dev->msg_lock);
> >>> +     while ((msg = vduse_dequeue_msg(&dev->send_list))) {
> >>> +             if (msg->req.flags & VDUSE_REQ_FLAGS_NO_REPLY)
> >>> +                     kfree(msg);
> >>> +             else
> >>> +                     vduse_enqueue_msg(&dev->recv_list, msg);
> >>> +     }
> >>> +     while ((msg = vduse_dequeue_msg(&dev->recv_list))) {
> >>> +             msg->resp.result = VDUSE_REQ_RESULT_FAILED;
> >>> +             msg->completed = 1;
> >>> +             wake_up(&msg->waitq);
> >>> +     }
> >>> +     spin_unlock(&dev->msg_lock);
> >>> +}
> >>> +
> >>> +static void vduse_dev_start_dataplane(struct vduse_dev *dev)
> >>> +{
> >>> +     struct vduse_dev_msg *msg = kzalloc(sizeof(*msg),
> >>> +                                         GFP_KERNEL | __GFP_NOFAIL);
> >>> +
> >>> +     msg->req.type = VDUSE_START_DATAPLANE;
> >>> +     msg->req.flags |= VDUSE_REQ_FLAGS_NO_REPLY;
> >>> +     vduse_dev_msg_send(dev, msg, true);
> >>> +}
> >>> +
> >>> +static void vduse_dev_stop_dataplane(struct vduse_dev *dev)
> >>> +{
> >>> +     struct vduse_dev_msg *msg = kzalloc(sizeof(*msg),
> >>> +                                         GFP_KERNEL | __GFP_NOFAIL);
> >>> +
> >>> +     msg->req.type = VDUSE_STOP_DATAPLANE;
> >>> +     msg->req.flags |= VDUSE_REQ_FLAGS_NO_REPLY;
> >>
> >> Can we simply use this flag instead of introducing a new parameter
> >> (no_reply) in vduse_dev_msg_send()?
> >>
> > Looks good to me.
> >
> >>> +     vduse_dev_msg_send(dev, msg, true);
> >>> +}
> >>> +
> >>> +static int vduse_dev_get_vq_state(struct vduse_dev *dev,
> >>> +                               struct vduse_virtqueue *vq,
> >>> +                               struct vdpa_vq_state *state)
> >>> +{
> >>> +     struct vduse_dev_msg msg = { 0 };
> >>> +     int ret;
> >>
> >> Note that I post a series that implement the packed virtqueue support:
> >>
> >> https://lists.linuxfoundation.org/pipermail/virtualization/2021-June/054501.html
> >>
> >> So this patch needs to be updated as well.
> >>
> > Will do it.
> >
> >>> +
> >>> +     msg.req.type = VDUSE_GET_VQ_STATE;
> >>> +     msg.req.vq_state.index = vq->index;
> >>> +
> >>> +     ret = vduse_dev_msg_send(dev, &msg, false);
> >>> +     if (ret)
> >>> +             return ret;
> >>> +
> >>> +     state->avail_index = msg.resp.vq_state.avail_idx;
> >>> +     return 0;
> >>> +}
> >>> +
> >>> +static int vduse_dev_update_iotlb(struct vduse_dev *dev,
> >>> +                             u64 start, u64 last)
> >>> +{
> >>> +     struct vduse_dev_msg msg = { 0 };
> >>> +
> >>> +     if (last < start)
> >>> +             return -EINVAL;
> >>> +
> >>> +     msg.req.type = VDUSE_UPDATE_IOTLB;
> >>> +     msg.req.iova.start = start;
> >>> +     msg.req.iova.last = last;
> >>> +
> >>> +     return vduse_dev_msg_send(dev, &msg, false);
> >>> +}
> >>> +
> >>> +static ssize_t vduse_dev_read_iter(struct kiocb *iocb, struct iov_iter *to)
> >>> +{
> >>> +     struct file *file = iocb->ki_filp;
> >>> +     struct vduse_dev *dev = file->private_data;
> >>> +     struct vduse_dev_msg *msg;
> >>> +     int size = sizeof(struct vduse_dev_request);
> >>> +     ssize_t ret;
> >>> +
> >>> +     if (iov_iter_count(to) < size)
> >>> +             return -EINVAL;
> >>> +
> >>> +     spin_lock(&dev->msg_lock);
> >>> +     while (1) {
> >>> +             msg = vduse_dequeue_msg(&dev->send_list);
> >>> +             if (msg)
> >>> +                     break;
> >>> +
> >>> +             ret = -EAGAIN;
> >>> +             if (file->f_flags & O_NONBLOCK)
> >>> +                     goto unlock;
> >>> +
> >>> +             spin_unlock(&dev->msg_lock);
> >>> +             ret = wait_event_interruptible_exclusive(dev->waitq,
> >>> +                                     !list_empty(&dev->send_list));
> >>> +             if (ret)
> >>> +                     return ret;
> >>> +
> >>> +             spin_lock(&dev->msg_lock);
> >>> +     }
> >>> +     spin_unlock(&dev->msg_lock);
> >>> +     ret = copy_to_iter(&msg->req, size, to);
> >>> +     spin_lock(&dev->msg_lock);
> >>> +     if (ret != size) {
> >>> +             ret = -EFAULT;
> >>> +             vduse_enqueue_msg(&dev->send_list, msg);
> >>> +             goto unlock;
> >>> +     }
> >>> +     if (msg->req.flags & VDUSE_REQ_FLAGS_NO_REPLY)
> >>> +             kfree(msg);
> >>> +     else
> >>> +             vduse_enqueue_msg(&dev->recv_list, msg);
> >>> +unlock:
> >>> +     spin_unlock(&dev->msg_lock);
> >>> +
> >>> +     return ret;
> >>> +}
> >>> +
> >>> +static ssize_t vduse_dev_write_iter(struct kiocb *iocb, struct iov_iter *from)
> >>> +{
> >>> +     struct file *file = iocb->ki_filp;
> >>> +     struct vduse_dev *dev = file->private_data;
> >>> +     struct vduse_dev_response resp;
> >>> +     struct vduse_dev_msg *msg;
> >>> +     size_t ret;
> >>> +
> >>> +     ret = copy_from_iter(&resp, sizeof(resp), from);
> >>> +     if (ret != sizeof(resp))
> >>> +             return -EINVAL;
> >>> +
> >>> +     spin_lock(&dev->msg_lock);
> >>> +     msg = vduse_find_msg(&dev->recv_list, resp.request_id);
> >>> +     if (!msg) {
> >>> +             ret = -ENOENT;
> >>> +             goto unlock;
> >>> +     }
> >>> +
> >>> +     memcpy(&msg->resp, &resp, sizeof(resp));
> >>> +     msg->completed = 1;
> >>> +     wake_up(&msg->waitq);
> >>> +unlock:
> >>> +     spin_unlock(&dev->msg_lock);
> >>> +
> >>> +     return ret;
> >>> +}
> >>> +
> >>> +static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
> >>> +{
> >>> +     struct vduse_dev *dev = file->private_data;
> >>> +     __poll_t mask = 0;
> >>> +
> >>> +     poll_wait(file, &dev->waitq, wait);
> >>> +
> >>> +     if (!list_empty(&dev->send_list))
> >>> +             mask |= EPOLLIN | EPOLLRDNORM;
> >>> +     if (!list_empty(&dev->recv_list))
> >>> +             mask |= EPOLLOUT | EPOLLWRNORM;
> >>> +
> >>> +     return mask;
> >>> +}
> >>> +
> >>> +static void vduse_dev_reset(struct vduse_dev *dev)
> >>> +{
> >>> +     int i;
> >>> +     struct vduse_iova_domain *domain = dev->domain;
> >>> +
> >>> +     /* The coherent mappings are handled in vduse_dev_free_coherent() */
> >>> +     if (domain->bounce_map)
> >>> +             vduse_domain_reset_bounce_map(domain);
> >>> +
> >>> +     dev->features = 0;
> >>> +     dev->generation++;
> >>> +     spin_lock(&dev->irq_lock);
> >>> +     dev->config_cb.callback = NULL;
> >>> +     dev->config_cb.private = NULL;
> >>> +     spin_unlock(&dev->irq_lock);
> >>> +
> >>> +     for (i = 0; i < dev->vq_num; i++) {
> >>> +             struct vduse_virtqueue *vq = &dev->vqs[i];
> >>> +
> >>> +             vq->ready = false;
> >>> +             vq->desc_addr = 0;
> >>> +             vq->driver_addr = 0;
> >>> +             vq->device_addr = 0;
> >>> +             vq->avail_idx = 0;
> >>> +             vq->num = 0;
> >>> +
> >>> +             spin_lock(&vq->kick_lock);
> >>> +             vq->kicked = false;
> >>> +             if (vq->kickfd)
> >>> +                     eventfd_ctx_put(vq->kickfd);
> >>> +             vq->kickfd = NULL;
> >>> +             spin_unlock(&vq->kick_lock);
> >>> +
> >>> +             spin_lock(&vq->irq_lock);
> >>> +             vq->cb.callback = NULL;
> >>> +             vq->cb.private = NULL;
> >>> +             spin_unlock(&vq->irq_lock);
> >>> +     }
> >>> +}
> >>> +
> >>> +static int vduse_vdpa_set_vq_address(struct vdpa_device *vdpa, u16 idx,
> >>> +                             u64 desc_area, u64 driver_area,
> >>> +                             u64 device_area)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> >>> +
> >>> +     vq->desc_addr = desc_area;
> >>> +     vq->driver_addr = driver_area;
> >>> +     vq->device_addr = device_area;
> >>> +
> >>> +     return 0;
> >>> +}
> >>> +
> >>> +static void vduse_vdpa_kick_vq(struct vdpa_device *vdpa, u16 idx)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> >>> +
> >>> +     spin_lock(&vq->kick_lock);
> >>> +     if (!vq->ready)
> >>> +             goto unlock;
> >>> +
> >>> +     if (vq->kickfd)
> >>> +             eventfd_signal(vq->kickfd, 1);
> >>> +     else
> >>> +             vq->kicked = true;
> >>> +unlock:
> >>> +     spin_unlock(&vq->kick_lock);
> >>> +}
> >>> +
> >>> +static void vduse_vdpa_set_vq_cb(struct vdpa_device *vdpa, u16 idx,
> >>> +                           struct vdpa_callback *cb)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> >>> +
> >>> +     spin_lock(&vq->irq_lock);
> >>> +     vq->cb.callback = cb->callback;
> >>> +     vq->cb.private = cb->private;
> >>> +     spin_unlock(&vq->irq_lock);
> >>> +}
> >>> +
> >>> +static void vduse_vdpa_set_vq_num(struct vdpa_device *vdpa, u16 idx, u32 num)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> >>> +
> >>> +     vq->num = num;
> >>> +}
> >>> +
> >>> +static void vduse_vdpa_set_vq_ready(struct vdpa_device *vdpa,
> >>> +                                     u16 idx, bool ready)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> >>> +
> >>> +     vq->ready = ready;
> >>> +}
> >>> +
> >>> +static bool vduse_vdpa_get_vq_ready(struct vdpa_device *vdpa, u16 idx)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> >>> +
> >>> +     return vq->ready;
> >>> +}
> >>> +
> >>> +static int vduse_vdpa_set_vq_state(struct vdpa_device *vdpa, u16 idx,
> >>> +                             const struct vdpa_vq_state *state)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> >>> +
> >>> +     vq->avail_idx = state->avail_index;
> >>> +     return 0;
> >>> +}
> >>> +
> >>> +static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
> >>> +                             struct vdpa_vq_state *state)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> >>> +
> >>> +     return vduse_dev_get_vq_state(dev, vq, state);
> >>> +}
> >>> +
> >>> +static u32 vduse_vdpa_get_vq_align(struct vdpa_device *vdpa)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +
> >>> +     return dev->vq_align;
> >>> +}
> >>> +
> >>> +static u64 vduse_vdpa_get_features(struct vdpa_device *vdpa)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +
> >>> +     return dev->user_features;
> >>> +}
> >>> +
> >>> +static int vduse_vdpa_set_features(struct vdpa_device *vdpa, u64 features)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +
> >>> +     dev->features = features;
> >>> +     return 0;
> >>> +}
> >>> +
> >>> +static void vduse_vdpa_set_config_cb(struct vdpa_device *vdpa,
> >>> +                               struct vdpa_callback *cb)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +
> >>> +     spin_lock(&dev->irq_lock);
> >>> +     dev->config_cb.callback = cb->callback;
> >>> +     dev->config_cb.private = cb->private;
> >>> +     spin_unlock(&dev->irq_lock);
> >>> +}
> >>> +
> >>> +static u16 vduse_vdpa_get_vq_num_max(struct vdpa_device *vdpa)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +
> >>> +     return dev->vq_size_max;
> >>> +}
> >>> +
> >>> +static u32 vduse_vdpa_get_device_id(struct vdpa_device *vdpa)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +
> >>> +     return dev->device_id;
> >>> +}
> >>> +
> >>> +static u32 vduse_vdpa_get_vendor_id(struct vdpa_device *vdpa)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +
> >>> +     return dev->vendor_id;
> >>> +}
> >>> +
> >>> +static u8 vduse_vdpa_get_status(struct vdpa_device *vdpa)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +
> >>> +     return dev->status;
> >>> +}
> >>> +
> >>> +static void vduse_vdpa_set_status(struct vdpa_device *vdpa, u8 status)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +     bool started = !!(status & VIRTIO_CONFIG_S_DRIVER_OK);
> >>> +
> >>> +     dev->status = status;
> >>> +
> >>> +     if (dev->started == started)
> >>> +             return;
> >>
> >> If we check dev->status == status, (or only check the DRIVER_OK bit)
> >> then there's no need to introduce an extra dev->started.
> >>
> > Will do it.
> >
> >>> +
> >>> +     dev->started = started;
> >>> +     if (dev->started) {
> >>> +             vduse_dev_start_dataplane(dev);
> >>> +     } else {
> >>> +             vduse_dev_reset(dev);
> >>> +             vduse_dev_stop_dataplane(dev);
> >>
> >> I wonder if no_reply work for the case of vhost-vdpa. For virtio-vDPA,
> >> we have bouncing buffers so it's harmless if usersapce dataplane keeps
> >> performing read/write. For vhost-vDPA we don't have such stuffs.
> >>
> > OK. So it still needs to be synchronized here. If so, how to handle
> > the error? Looks like printing a warning message should be enough.
>
>
> We need fix a way to propagate the error to the userspace.
>
> E.g if we want to stop the deivce, we will delay the status reset until
> we get respose from the userspace?
>

I didn't get how to delay the status reset. And should it be a DoS
that we want to fix if the userspace doesn't give a response forever?

>
> >
> >>> +     }
> >>> +}
> >>> +
> >>> +static size_t vduse_vdpa_get_config_size(struct vdpa_device *vdpa)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +
> >>> +     return dev->config_size;
> >>> +}
> >>> +
> >>> +static void vduse_vdpa_get_config(struct vdpa_device *vdpa, unsigned int offset,
> >>> +                               void *buf, unsigned int len)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +
> >>> +     memcpy(buf, dev->config + offset, len);
> >>> +}
> >>> +
> >>> +static void vduse_vdpa_set_config(struct vdpa_device *vdpa, unsigned int offset,
> >>> +                     const void *buf, unsigned int len)
> >>> +{
> >>> +     /* Now we only support read-only configuration space */
> >>> +}
> >>> +
> >>> +static u32 vduse_vdpa_get_generation(struct vdpa_device *vdpa)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +
> >>> +     return dev->generation;
> >>> +}
> >>> +
> >>> +static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
> >>> +                             struct vhost_iotlb *iotlb)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +     int ret;
> >>> +
> >>> +     ret = vduse_domain_set_map(dev->domain, iotlb);
> >>> +     if (ret)
> >>> +             return ret;
> >>> +
> >>> +     ret = vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
> >>> +     if (ret) {
> >>> +             vduse_domain_clear_map(dev->domain, iotlb);
> >>> +             return ret;
> >>> +     }
> >>> +
> >>> +     return 0;
> >>> +}
> >>> +
> >>> +static void vduse_vdpa_free(struct vdpa_device *vdpa)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +
> >>> +     dev->vdev = NULL;
> >>> +}
> >>> +
> >>> +static const struct vdpa_config_ops vduse_vdpa_config_ops = {
> >>> +     .set_vq_address         = vduse_vdpa_set_vq_address,
> >>> +     .kick_vq                = vduse_vdpa_kick_vq,
> >>> +     .set_vq_cb              = vduse_vdpa_set_vq_cb,
> >>> +     .set_vq_num             = vduse_vdpa_set_vq_num,
> >>> +     .set_vq_ready           = vduse_vdpa_set_vq_ready,
> >>> +     .get_vq_ready           = vduse_vdpa_get_vq_ready,
> >>> +     .set_vq_state           = vduse_vdpa_set_vq_state,
> >>> +     .get_vq_state           = vduse_vdpa_get_vq_state,
> >>> +     .get_vq_align           = vduse_vdpa_get_vq_align,
> >>> +     .get_features           = vduse_vdpa_get_features,
> >>> +     .set_features           = vduse_vdpa_set_features,
> >>> +     .set_config_cb          = vduse_vdpa_set_config_cb,
> >>> +     .get_vq_num_max         = vduse_vdpa_get_vq_num_max,
> >>> +     .get_device_id          = vduse_vdpa_get_device_id,
> >>> +     .get_vendor_id          = vduse_vdpa_get_vendor_id,
> >>> +     .get_status             = vduse_vdpa_get_status,
> >>> +     .set_status             = vduse_vdpa_set_status,
> >>> +     .get_config_size        = vduse_vdpa_get_config_size,
> >>> +     .get_config             = vduse_vdpa_get_config,
> >>> +     .set_config             = vduse_vdpa_set_config,
> >>> +     .get_generation         = vduse_vdpa_get_generation,
> >>> +     .set_map                = vduse_vdpa_set_map,
> >>> +     .free                   = vduse_vdpa_free,
> >>> +};
> >>> +
> >>> +static dma_addr_t vduse_dev_map_page(struct device *dev, struct page *page,
> >>> +                                  unsigned long offset, size_t size,
> >>> +                                  enum dma_data_direction dir,
> >>> +                                  unsigned long attrs)
> >>> +{
> >>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
> >>> +     struct vduse_iova_domain *domain = vdev->domain;
> >>> +
> >>> +     return vduse_domain_map_page(domain, page, offset, size, dir, attrs);
> >>> +}
> >>> +
> >>> +static void vduse_dev_unmap_page(struct device *dev, dma_addr_t dma_addr,
> >>> +                             size_t size, enum dma_data_direction dir,
> >>> +                             unsigned long attrs)
> >>> +{
> >>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
> >>> +     struct vduse_iova_domain *domain = vdev->domain;
> >>> +
> >>> +     return vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
> >>> +}
> >>> +
> >>> +static void *vduse_dev_alloc_coherent(struct device *dev, size_t size,
> >>> +                                     dma_addr_t *dma_addr, gfp_t flag,
> >>> +                                     unsigned long attrs)
> >>> +{
> >>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
> >>> +     struct vduse_iova_domain *domain = vdev->domain;
> >>> +     unsigned long iova;
> >>> +     void *addr;
> >>> +
> >>> +     *dma_addr = DMA_MAPPING_ERROR;
> >>> +     addr = vduse_domain_alloc_coherent(domain, size,
> >>> +                             (dma_addr_t *)&iova, flag, attrs);
> >>> +     if (!addr)
> >>> +             return NULL;
> >>> +
> >>> +     *dma_addr = (dma_addr_t)iova;
> >>> +
> >>> +     return addr;
> >>> +}
> >>> +
> >>> +static void vduse_dev_free_coherent(struct device *dev, size_t size,
> >>> +                                     void *vaddr, dma_addr_t dma_addr,
> >>> +                                     unsigned long attrs)
> >>> +{
> >>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
> >>> +     struct vduse_iova_domain *domain = vdev->domain;
> >>> +
> >>> +     vduse_domain_free_coherent(domain, size, vaddr, dma_addr, attrs);
> >>> +}
> >>> +
> >>> +static size_t vduse_dev_max_mapping_size(struct device *dev)
> >>> +{
> >>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
> >>> +     struct vduse_iova_domain *domain = vdev->domain;
> >>> +
> >>> +     return domain->bounce_size;
> >>> +}
> >>> +
> >>> +static const struct dma_map_ops vduse_dev_dma_ops = {
> >>> +     .map_page = vduse_dev_map_page,
> >>> +     .unmap_page = vduse_dev_unmap_page,
> >>> +     .alloc = vduse_dev_alloc_coherent,
> >>> +     .free = vduse_dev_free_coherent,
> >>> +     .max_mapping_size = vduse_dev_max_mapping_size,
> >>> +};
> >>> +
> >>> +static unsigned int perm_to_file_flags(u8 perm)
> >>> +{
> >>> +     unsigned int flags = 0;
> >>> +
> >>> +     switch (perm) {
> >>> +     case VDUSE_ACCESS_WO:
> >>> +             flags |= O_WRONLY;
> >>> +             break;
> >>> +     case VDUSE_ACCESS_RO:
> >>> +             flags |= O_RDONLY;
> >>> +             break;
> >>> +     case VDUSE_ACCESS_RW:
> >>> +             flags |= O_RDWR;
> >>> +             break;
> >>> +     default:
> >>> +             WARN(1, "invalidate vhost IOTLB permission\n");
> >>> +             break;
> >>> +     }
> >>> +
> >>> +     return flags;
> >>> +}
> >>> +
> >>> +static int vduse_kickfd_setup(struct vduse_dev *dev,
> >>> +                     struct vduse_vq_eventfd *eventfd)
> >>> +{
> >>> +     struct eventfd_ctx *ctx = NULL;
> >>> +     struct vduse_virtqueue *vq;
> >>> +     u32 index;
> >>> +
> >>> +     if (eventfd->index >= dev->vq_num)
> >>> +             return -EINVAL;
> >>> +
> >>> +     index = array_index_nospec(eventfd->index, dev->vq_num);
> >>> +     vq = &dev->vqs[index];
> >>> +     if (eventfd->fd >= 0) {
> >>> +             ctx = eventfd_ctx_fdget(eventfd->fd);
> >>> +             if (IS_ERR(ctx))
> >>> +                     return PTR_ERR(ctx);
> >>> +     } else if (eventfd->fd != VDUSE_EVENTFD_DEASSIGN)
> >>> +             return 0;
> >>> +
> >>> +     spin_lock(&vq->kick_lock);
> >>> +     if (vq->kickfd)
> >>> +             eventfd_ctx_put(vq->kickfd);
> >>> +     vq->kickfd = ctx;
> >>> +     if (vq->ready && vq->kicked && vq->kickfd) {
> >>> +             eventfd_signal(vq->kickfd, 1);
> >>> +             vq->kicked = false;
> >>> +     }
> >>> +     spin_unlock(&vq->kick_lock);
> >>> +
> >>> +     return 0;
> >>> +}
> >>> +
> >>> +static void vduse_dev_irq_inject(struct work_struct *work)
> >>> +{
> >>> +     struct vduse_dev *dev = container_of(work, struct vduse_dev, inject);
> >>> +
> >>> +     spin_lock_irq(&dev->irq_lock);
> >>> +     if (dev->config_cb.callback)
> >>> +             dev->config_cb.callback(dev->config_cb.private);
> >>> +     spin_unlock_irq(&dev->irq_lock);
> >>> +}
> >>> +
> >>> +static void vduse_vq_irq_inject(struct work_struct *work)
> >>> +{
> >>> +     struct vduse_virtqueue *vq = container_of(work,
> >>> +                                     struct vduse_virtqueue, inject);
> >>> +
> >>> +     spin_lock_irq(&vq->irq_lock);
> >>> +     if (vq->ready && vq->cb.callback)
> >>> +             vq->cb.callback(vq->cb.private);
> >>> +     spin_unlock_irq(&vq->irq_lock);
> >>> +}
> >>> +
> >>> +static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> >>> +                         unsigned long arg)
> >>> +{
> >>> +     struct vduse_dev *dev = file->private_data;
> >>> +     void __user *argp = (void __user *)arg;
> >>> +     int ret;
> >>> +
> >>> +     switch (cmd) {
> >>> +     case VDUSE_IOTLB_GET_FD: {
> >>> +             struct vduse_iotlb_entry entry;
> >>> +             struct vhost_iotlb_map *map;
> >>> +             struct vdpa_map_file *map_file;
> >>> +             struct vduse_iova_domain *domain = dev->domain;
> >>> +             struct file *f = NULL;
> >>> +
> >>> +             ret = -EFAULT;
> >>> +             if (copy_from_user(&entry, argp, sizeof(entry)))
> >>> +                     break;
> >>> +
> >>> +             ret = -EINVAL;
> >>> +             if (entry.start > entry.last)
> >>> +                     break;
> >>> +
> >>> +             spin_lock(&domain->iotlb_lock);
> >>> +             map = vhost_iotlb_itree_first(domain->iotlb,
> >>> +                                           entry.start, entry.last);
> >>> +             if (map) {
> >>> +                     map_file = (struct vdpa_map_file *)map->opaque;
> >>> +                     f = get_file(map_file->file);
> >>> +                     entry.offset = map_file->offset;
> >>> +                     entry.start = map->start;
> >>> +                     entry.last = map->last;
> >>> +                     entry.perm = map->perm;
> >>> +             }
> >>> +             spin_unlock(&domain->iotlb_lock);
> >>> +             ret = -EINVAL;
> >>> +             if (!f)
> >>> +                     break;
> >>> +
> >>> +             ret = -EFAULT;
> >>> +             if (copy_to_user(argp, &entry, sizeof(entry))) {
> >>> +                     fput(f);
> >>> +                     break;
> >>> +             }
> >>> +             ret = receive_fd(f, perm_to_file_flags(entry.perm));
> >>> +             fput(f);
> >>> +             break;
> >>> +     }
> >>> +     case VDUSE_DEV_GET_FEATURES:
> >>> +             ret = put_user(dev->features, (u64 __user *)argp);
> >>> +             break;
> >>> +     case VDUSE_DEV_UPDATE_CONFIG: {
> >>> +             struct vduse_config_update config;
> >>> +             unsigned long size = offsetof(struct vduse_config_update,
> >>> +                                           buffer);
> >>> +
> >>> +             ret = -EFAULT;
> >>> +             if (copy_from_user(&config, argp, size))
> >>> +                     break;
> >>> +
> >>> +             ret = -EINVAL;
> >>> +             if (config.length == 0 ||
> >>> +                 config.length > dev->config_size - config.offset)
> >>> +                     break;
> >>> +
> >>> +             ret = -EFAULT;
> >>> +             if (copy_from_user(dev->config + config.offset, argp + size,
> >>> +                                config.length))
> >>> +                     break;
> >>> +
> >>> +             ret = 0;
> >>> +             queue_work(vduse_irq_wq, &dev->inject);
> >>
> >> I wonder if it's better to separate config interrupt out of config
> >> update or we need document this.
> >>
> > I have documented it in the docs. Looks like a config update should be
> > always followed by a config interrupt. I didn't find a case that uses
> > them separately.
>
>
> The uAPI doesn't prevent us from the following scenario:
>
> update_config(mac[0], ..);
> update_config(max[1], ..);
>
> So it looks to me it's better to separate the config interrupt from the
> config updating.
>

Fine.

>
> >
> >>> +             break;
> >>> +     }
> >>> +     case VDUSE_VQ_GET_INFO: {
> >>
> >> Do we need to limit this only when DRIVER_OK is set?
> >>
> > Any reason to add this limitation?
>
>
> Otherwise the vq is not fully initialized, e.g the desc_addr might not
> be correct.
>

The vq_info->ready can be used to tell userspace whether the vq is
initialized or not.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 193+ messages in thread

* Re: Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
@ 2021-06-22  7:22           ` Yongji Xie
  0 siblings, 0 replies; 193+ messages in thread
From: Yongji Xie @ 2021-06-22  7:22 UTC (permalink / raw)
  To: Jason Wang
  Cc: kvm, Michael S. Tsirkin, virtualization, Christian Brauner,
	Jonathan Corbet, Matthew Wilcox, Christoph Hellwig,
	Dan Carpenter, Stefano Garzarella, Al Viro, Stefan Hajnoczi,
	songmuchun, Jens Axboe, Greg KH, Randy Dunlap, linux-kernel,
	iommu, bcrl, netdev, linux-fsdevel, Mika Penttilä

.On Tue, Jun 22, 2021 at 1:07 PM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/6/21 下午6:41, Yongji Xie 写道:
> > On Mon, Jun 21, 2021 at 5:14 PM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> 在 2021/6/15 下午10:13, Xie Yongji 写道:
> >>> This VDUSE driver enables implementing vDPA devices in userspace.
> >>> The vDPA device's control path is handled in kernel and the data
> >>> path is handled in userspace.
> >>>
> >>> A message mechnism is used by VDUSE driver to forward some control
> >>> messages such as starting/stopping datapath to userspace. Userspace
> >>> can use read()/write() to receive/reply those control messages.
> >>>
> >>> And some ioctls are introduced to help userspace to implement the
> >>> data path. VDUSE_IOTLB_GET_FD ioctl can be used to get the file
> >>> descriptors referring to vDPA device's iova regions. Then userspace
> >>> can use mmap() to access those iova regions. VDUSE_DEV_GET_FEATURES
> >>> and VDUSE_VQ_GET_INFO ioctls are used to get the negotiated features
> >>> and metadata of virtqueues. VDUSE_INJECT_VQ_IRQ and VDUSE_VQ_SETUP_KICKFD
> >>> ioctls can be used to inject interrupt and setup the kickfd for
> >>> virtqueues. VDUSE_DEV_UPDATE_CONFIG ioctl is used to update the
> >>> configuration space and inject a config interrupt.
> >>>
> >>> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> >>> ---
> >>>    Documentation/userspace-api/ioctl/ioctl-number.rst |    1 +
> >>>    drivers/vdpa/Kconfig                               |   10 +
> >>>    drivers/vdpa/Makefile                              |    1 +
> >>>    drivers/vdpa/vdpa_user/Makefile                    |    5 +
> >>>    drivers/vdpa/vdpa_user/vduse_dev.c                 | 1453 ++++++++++++++++++++
> >>>    include/uapi/linux/vduse.h                         |  143 ++
> >>>    6 files changed, 1613 insertions(+)
> >>>    create mode 100644 drivers/vdpa/vdpa_user/Makefile
> >>>    create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
> >>>    create mode 100644 include/uapi/linux/vduse.h
> >>>
> >>> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
> >>> index 9bfc2b510c64..acd95e9dcfe7 100644
> >>> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> >>> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> >>> @@ -300,6 +300,7 @@ Code  Seq#    Include File                                           Comments
> >>>    'z'   10-4F  drivers/s390/crypto/zcrypt_api.h                        conflict!
> >>>    '|'   00-7F  linux/media.h
> >>>    0x80  00-1F  linux/fb.h
> >>> +0x81  00-1F  linux/vduse.h
> >>>    0x89  00-06  arch/x86/include/asm/sockios.h
> >>>    0x89  0B-DF  linux/sockios.h
> >>>    0x89  E0-EF  linux/sockios.h                                         SIOCPROTOPRIVATE range
> >>> diff --git a/drivers/vdpa/Kconfig b/drivers/vdpa/Kconfig
> >>> index a503c1b2bfd9..6e23bce6433a 100644
> >>> --- a/drivers/vdpa/Kconfig
> >>> +++ b/drivers/vdpa/Kconfig
> >>> @@ -33,6 +33,16 @@ config VDPA_SIM_BLOCK
> >>>          vDPA block device simulator which terminates IO request in a
> >>>          memory buffer.
> >>>
> >>> +config VDPA_USER
> >>> +     tristate "VDUSE (vDPA Device in Userspace) support"
> >>> +     depends on EVENTFD && MMU && HAS_DMA
> >>> +     select DMA_OPS
> >>> +     select VHOST_IOTLB
> >>> +     select IOMMU_IOVA
> >>> +     help
> >>> +       With VDUSE it is possible to emulate a vDPA Device
> >>> +       in a userspace program.
> >>> +
> >>>    config IFCVF
> >>>        tristate "Intel IFC VF vDPA driver"
> >>>        depends on PCI_MSI
> >>> diff --git a/drivers/vdpa/Makefile b/drivers/vdpa/Makefile
> >>> index 67fe7f3d6943..f02ebed33f19 100644
> >>> --- a/drivers/vdpa/Makefile
> >>> +++ b/drivers/vdpa/Makefile
> >>> @@ -1,6 +1,7 @@
> >>>    # SPDX-License-Identifier: GPL-2.0
> >>>    obj-$(CONFIG_VDPA) += vdpa.o
> >>>    obj-$(CONFIG_VDPA_SIM) += vdpa_sim/
> >>> +obj-$(CONFIG_VDPA_USER) += vdpa_user/
> >>>    obj-$(CONFIG_IFCVF)    += ifcvf/
> >>>    obj-$(CONFIG_MLX5_VDPA) += mlx5/
> >>>    obj-$(CONFIG_VP_VDPA)    += virtio_pci/
> >>> diff --git a/drivers/vdpa/vdpa_user/Makefile b/drivers/vdpa/vdpa_user/Makefile
> >>> new file mode 100644
> >>> index 000000000000..260e0b26af99
> >>> --- /dev/null
> >>> +++ b/drivers/vdpa/vdpa_user/Makefile
> >>> @@ -0,0 +1,5 @@
> >>> +# SPDX-License-Identifier: GPL-2.0
> >>> +
> >>> +vduse-y := vduse_dev.o iova_domain.o
> >>> +
> >>> +obj-$(CONFIG_VDPA_USER) += vduse.o
> >>> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> >>> new file mode 100644
> >>> index 000000000000..5271cbd15e28
> >>> --- /dev/null
> >>> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> >>> @@ -0,0 +1,1453 @@
> >>> +// SPDX-License-Identifier: GPL-2.0-only
> >>> +/*
> >>> + * VDUSE: vDPA Device in Userspace
> >>> + *
> >>> + * Copyright (C) 2020-2021 Bytedance Inc. and/or its affiliates. All rights reserved.
> >>> + *
> >>> + * Author: Xie Yongji <xieyongji@bytedance.com>
> >>> + *
> >>> + */
> >>> +
> >>> +#include <linux/init.h>
> >>> +#include <linux/module.h>
> >>> +#include <linux/cdev.h>
> >>> +#include <linux/device.h>
> >>> +#include <linux/eventfd.h>
> >>> +#include <linux/slab.h>
> >>> +#include <linux/wait.h>
> >>> +#include <linux/dma-map-ops.h>
> >>> +#include <linux/poll.h>
> >>> +#include <linux/file.h>
> >>> +#include <linux/uio.h>
> >>> +#include <linux/vdpa.h>
> >>> +#include <linux/nospec.h>
> >>> +#include <uapi/linux/vduse.h>
> >>> +#include <uapi/linux/vdpa.h>
> >>> +#include <uapi/linux/virtio_config.h>
> >>> +#include <uapi/linux/virtio_ids.h>
> >>> +#include <uapi/linux/virtio_blk.h>
> >>> +#include <linux/mod_devicetable.h>
> >>> +
> >>> +#include "iova_domain.h"
> >>> +
> >>> +#define DRV_AUTHOR   "Yongji Xie <xieyongji@bytedance.com>"
> >>> +#define DRV_DESC     "vDPA Device in Userspace"
> >>> +#define DRV_LICENSE  "GPL v2"
> >>> +
> >>> +#define VDUSE_DEV_MAX (1U << MINORBITS)
> >>> +#define VDUSE_MAX_BOUNCE_SIZE (64 * 1024 * 1024)
> >>> +#define VDUSE_IOVA_SIZE (128 * 1024 * 1024)
> >>> +#define VDUSE_REQUEST_TIMEOUT 30
> >>> +
> >>> +struct vduse_virtqueue {
> >>> +     u16 index;
> >>> +     u32 num;
> >>> +     u32 avail_idx;
> >>> +     u64 desc_addr;
> >>> +     u64 driver_addr;
> >>> +     u64 device_addr;
> >>> +     bool ready;
> >>> +     bool kicked;
> >>> +     spinlock_t kick_lock;
> >>> +     spinlock_t irq_lock;
> >>> +     struct eventfd_ctx *kickfd;
> >>> +     struct vdpa_callback cb;
> >>> +     struct work_struct inject;
> >>> +};
> >>> +
> >>> +struct vduse_dev;
> >>> +
> >>> +struct vduse_vdpa {
> >>> +     struct vdpa_device vdpa;
> >>> +     struct vduse_dev *dev;
> >>> +};
> >>> +
> >>> +struct vduse_dev {
> >>> +     struct vduse_vdpa *vdev;
> >>> +     struct device *dev;
> >>> +     struct vduse_virtqueue *vqs;
> >>> +     struct vduse_iova_domain *domain;
> >>> +     char *name;
> >>> +     struct mutex lock;
> >>> +     spinlock_t msg_lock;
> >>> +     u64 msg_unique;
> >>> +     wait_queue_head_t waitq;
> >>> +     struct list_head send_list;
> >>> +     struct list_head recv_list;
> >>> +     struct vdpa_callback config_cb;
> >>> +     struct work_struct inject;
> >>> +     spinlock_t irq_lock;
> >>> +     int minor;
> >>> +     bool connected;
> >>> +     bool started;
> >>> +     u64 api_version;
> >>> +     u64 user_features;
> >>
> >> Let's use device_features.
> >>
> > OK.
> >
> >>> +     u64 features;
> >>
> >> And driver features.
> >>
> > OK.
> >
> >>> +     u32 device_id;
> >>> +     u32 vendor_id;
> >>> +     u32 generation;
> >>> +     u32 config_size;
> >>> +     void *config;
> >>> +     u8 status;
> >>> +     u16 vq_size_max;
> >>> +     u32 vq_num;
> >>> +     u32 vq_align;
> >>> +};
> >>> +
> >>> +struct vduse_dev_msg {
> >>> +     struct vduse_dev_request req;
> >>> +     struct vduse_dev_response resp;
> >>> +     struct list_head list;
> >>> +     wait_queue_head_t waitq;
> >>> +     bool completed;
> >>> +};
> >>> +
> >>> +struct vduse_control {
> >>> +     u64 api_version;
> >>> +};
> >>> +
> >>> +static DEFINE_MUTEX(vduse_lock);
> >>> +static DEFINE_IDR(vduse_idr);
> >>> +
> >>> +static dev_t vduse_major;
> >>> +static struct class *vduse_class;
> >>> +static struct cdev vduse_ctrl_cdev;
> >>> +static struct cdev vduse_cdev;
> >>> +static struct workqueue_struct *vduse_irq_wq;
> >>> +
> >>> +static u32 allowed_device_id[] = {
> >>> +     VIRTIO_ID_BLOCK,
> >>> +};
> >>> +
> >>> +static inline struct vduse_dev *vdpa_to_vduse(struct vdpa_device *vdpa)
> >>> +{
> >>> +     struct vduse_vdpa *vdev = container_of(vdpa, struct vduse_vdpa, vdpa);
> >>> +
> >>> +     return vdev->dev;
> >>> +}
> >>> +
> >>> +static inline struct vduse_dev *dev_to_vduse(struct device *dev)
> >>> +{
> >>> +     struct vdpa_device *vdpa = dev_to_vdpa(dev);
> >>> +
> >>> +     return vdpa_to_vduse(vdpa);
> >>> +}
> >>> +
> >>> +static struct vduse_dev_msg *vduse_find_msg(struct list_head *head,
> >>> +                                         uint32_t request_id)
> >>> +{
> >>> +     struct vduse_dev_msg *msg;
> >>> +
> >>> +     list_for_each_entry(msg, head, list) {
> >>> +             if (msg->req.request_id == request_id) {
> >>> +                     list_del(&msg->list);
> >>> +                     return msg;
> >>> +             }
> >>> +     }
> >>> +
> >>> +     return NULL;
> >>> +}
> >>> +
> >>> +static struct vduse_dev_msg *vduse_dequeue_msg(struct list_head *head)
> >>> +{
> >>> +     struct vduse_dev_msg *msg = NULL;
> >>> +
> >>> +     if (!list_empty(head)) {
> >>> +             msg = list_first_entry(head, struct vduse_dev_msg, list);
> >>> +             list_del(&msg->list);
> >>> +     }
> >>> +
> >>> +     return msg;
> >>> +}
> >>> +
> >>> +static void vduse_enqueue_msg(struct list_head *head,
> >>> +                           struct vduse_dev_msg *msg)
> >>> +{
> >>> +     list_add_tail(&msg->list, head);
> >>> +}
> >>> +
> >>> +static int vduse_dev_msg_send(struct vduse_dev *dev,
> >>> +                           struct vduse_dev_msg *msg, bool no_reply)
> >>> +{
> >>
> >> It looks to me the only user for no_reply=true is the dataplane start. I
> >> wonder no_reply is really needed consider we have switched to use
> >> wait_event_killable_timeout().
> >>
> > Do we need to handle the error in this case if we remove the no_reply
> > flag. Print a warning message?
>
>
> See below.
>
>
> >
> >> In another way, no_reply is false for vq state synchronization and IOTLB
> >> updating. I wonder if we can simply use no_reply = true for them.
> >>
> > Looks like we can't, e.g. we need to get a reply from userspace for vq state.
>
>
> Right.
>
>
> >
> >>> +     init_waitqueue_head(&msg->waitq);
> >>> +     spin_lock(&dev->msg_lock);
> >>> +     msg->req.request_id = dev->msg_unique++;
> >>> +     vduse_enqueue_msg(&dev->send_list, msg);
> >>> +     wake_up(&dev->waitq);
> >>> +     spin_unlock(&dev->msg_lock);
> >>> +     if (no_reply)
> >>> +             return 0;
> >>> +
> >>> +     wait_event_killable_timeout(msg->waitq, msg->completed,
> >>> +                                 VDUSE_REQUEST_TIMEOUT * HZ);
> >>> +     spin_lock(&dev->msg_lock);
> >>> +     if (!msg->completed) {
> >>> +             list_del(&msg->list);
> >>> +             msg->resp.result = VDUSE_REQ_RESULT_FAILED;
> >>> +     }
> >>> +     spin_unlock(&dev->msg_lock);
> >>> +
> >>> +     return (msg->resp.result == VDUSE_REQ_RESULT_OK) ? 0 : -EIO;
> >>
> >> Do we need to serialize the check by protecting it with the spinlock above?
> >>
> > Good point.
> >
> >>> +}
> >>> +
> >>> +static void vduse_dev_msg_cleanup(struct vduse_dev *dev)
> >>> +{
> >>> +     struct vduse_dev_msg *msg;
> >>> +
> >>> +     spin_lock(&dev->msg_lock);
> >>> +     while ((msg = vduse_dequeue_msg(&dev->send_list))) {
> >>> +             if (msg->req.flags & VDUSE_REQ_FLAGS_NO_REPLY)
> >>> +                     kfree(msg);
> >>> +             else
> >>> +                     vduse_enqueue_msg(&dev->recv_list, msg);
> >>> +     }
> >>> +     while ((msg = vduse_dequeue_msg(&dev->recv_list))) {
> >>> +             msg->resp.result = VDUSE_REQ_RESULT_FAILED;
> >>> +             msg->completed = 1;
> >>> +             wake_up(&msg->waitq);
> >>> +     }
> >>> +     spin_unlock(&dev->msg_lock);
> >>> +}
> >>> +
> >>> +static void vduse_dev_start_dataplane(struct vduse_dev *dev)
> >>> +{
> >>> +     struct vduse_dev_msg *msg = kzalloc(sizeof(*msg),
> >>> +                                         GFP_KERNEL | __GFP_NOFAIL);
> >>> +
> >>> +     msg->req.type = VDUSE_START_DATAPLANE;
> >>> +     msg->req.flags |= VDUSE_REQ_FLAGS_NO_REPLY;
> >>> +     vduse_dev_msg_send(dev, msg, true);
> >>> +}
> >>> +
> >>> +static void vduse_dev_stop_dataplane(struct vduse_dev *dev)
> >>> +{
> >>> +     struct vduse_dev_msg *msg = kzalloc(sizeof(*msg),
> >>> +                                         GFP_KERNEL | __GFP_NOFAIL);
> >>> +
> >>> +     msg->req.type = VDUSE_STOP_DATAPLANE;
> >>> +     msg->req.flags |= VDUSE_REQ_FLAGS_NO_REPLY;
> >>
> >> Can we simply use this flag instead of introducing a new parameter
> >> (no_reply) in vduse_dev_msg_send()?
> >>
> > Looks good to me.
> >
> >>> +     vduse_dev_msg_send(dev, msg, true);
> >>> +}
> >>> +
> >>> +static int vduse_dev_get_vq_state(struct vduse_dev *dev,
> >>> +                               struct vduse_virtqueue *vq,
> >>> +                               struct vdpa_vq_state *state)
> >>> +{
> >>> +     struct vduse_dev_msg msg = { 0 };
> >>> +     int ret;
> >>
> >> Note that I post a series that implement the packed virtqueue support:
> >>
> >> https://lists.linuxfoundation.org/pipermail/virtualization/2021-June/054501.html
> >>
> >> So this patch needs to be updated as well.
> >>
> > Will do it.
> >
> >>> +
> >>> +     msg.req.type = VDUSE_GET_VQ_STATE;
> >>> +     msg.req.vq_state.index = vq->index;
> >>> +
> >>> +     ret = vduse_dev_msg_send(dev, &msg, false);
> >>> +     if (ret)
> >>> +             return ret;
> >>> +
> >>> +     state->avail_index = msg.resp.vq_state.avail_idx;
> >>> +     return 0;
> >>> +}
> >>> +
> >>> +static int vduse_dev_update_iotlb(struct vduse_dev *dev,
> >>> +                             u64 start, u64 last)
> >>> +{
> >>> +     struct vduse_dev_msg msg = { 0 };
> >>> +
> >>> +     if (last < start)
> >>> +             return -EINVAL;
> >>> +
> >>> +     msg.req.type = VDUSE_UPDATE_IOTLB;
> >>> +     msg.req.iova.start = start;
> >>> +     msg.req.iova.last = last;
> >>> +
> >>> +     return vduse_dev_msg_send(dev, &msg, false);
> >>> +}
> >>> +
> >>> +static ssize_t vduse_dev_read_iter(struct kiocb *iocb, struct iov_iter *to)
> >>> +{
> >>> +     struct file *file = iocb->ki_filp;
> >>> +     struct vduse_dev *dev = file->private_data;
> >>> +     struct vduse_dev_msg *msg;
> >>> +     int size = sizeof(struct vduse_dev_request);
> >>> +     ssize_t ret;
> >>> +
> >>> +     if (iov_iter_count(to) < size)
> >>> +             return -EINVAL;
> >>> +
> >>> +     spin_lock(&dev->msg_lock);
> >>> +     while (1) {
> >>> +             msg = vduse_dequeue_msg(&dev->send_list);
> >>> +             if (msg)
> >>> +                     break;
> >>> +
> >>> +             ret = -EAGAIN;
> >>> +             if (file->f_flags & O_NONBLOCK)
> >>> +                     goto unlock;
> >>> +
> >>> +             spin_unlock(&dev->msg_lock);
> >>> +             ret = wait_event_interruptible_exclusive(dev->waitq,
> >>> +                                     !list_empty(&dev->send_list));
> >>> +             if (ret)
> >>> +                     return ret;
> >>> +
> >>> +             spin_lock(&dev->msg_lock);
> >>> +     }
> >>> +     spin_unlock(&dev->msg_lock);
> >>> +     ret = copy_to_iter(&msg->req, size, to);
> >>> +     spin_lock(&dev->msg_lock);
> >>> +     if (ret != size) {
> >>> +             ret = -EFAULT;
> >>> +             vduse_enqueue_msg(&dev->send_list, msg);
> >>> +             goto unlock;
> >>> +     }
> >>> +     if (msg->req.flags & VDUSE_REQ_FLAGS_NO_REPLY)
> >>> +             kfree(msg);
> >>> +     else
> >>> +             vduse_enqueue_msg(&dev->recv_list, msg);
> >>> +unlock:
> >>> +     spin_unlock(&dev->msg_lock);
> >>> +
> >>> +     return ret;
> >>> +}
> >>> +
> >>> +static ssize_t vduse_dev_write_iter(struct kiocb *iocb, struct iov_iter *from)
> >>> +{
> >>> +     struct file *file = iocb->ki_filp;
> >>> +     struct vduse_dev *dev = file->private_data;
> >>> +     struct vduse_dev_response resp;
> >>> +     struct vduse_dev_msg *msg;
> >>> +     size_t ret;
> >>> +
> >>> +     ret = copy_from_iter(&resp, sizeof(resp), from);
> >>> +     if (ret != sizeof(resp))
> >>> +             return -EINVAL;
> >>> +
> >>> +     spin_lock(&dev->msg_lock);
> >>> +     msg = vduse_find_msg(&dev->recv_list, resp.request_id);
> >>> +     if (!msg) {
> >>> +             ret = -ENOENT;
> >>> +             goto unlock;
> >>> +     }
> >>> +
> >>> +     memcpy(&msg->resp, &resp, sizeof(resp));
> >>> +     msg->completed = 1;
> >>> +     wake_up(&msg->waitq);
> >>> +unlock:
> >>> +     spin_unlock(&dev->msg_lock);
> >>> +
> >>> +     return ret;
> >>> +}
> >>> +
> >>> +static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
> >>> +{
> >>> +     struct vduse_dev *dev = file->private_data;
> >>> +     __poll_t mask = 0;
> >>> +
> >>> +     poll_wait(file, &dev->waitq, wait);
> >>> +
> >>> +     if (!list_empty(&dev->send_list))
> >>> +             mask |= EPOLLIN | EPOLLRDNORM;
> >>> +     if (!list_empty(&dev->recv_list))
> >>> +             mask |= EPOLLOUT | EPOLLWRNORM;
> >>> +
> >>> +     return mask;
> >>> +}
> >>> +
> >>> +static void vduse_dev_reset(struct vduse_dev *dev)
> >>> +{
> >>> +     int i;
> >>> +     struct vduse_iova_domain *domain = dev->domain;
> >>> +
> >>> +     /* The coherent mappings are handled in vduse_dev_free_coherent() */
> >>> +     if (domain->bounce_map)
> >>> +             vduse_domain_reset_bounce_map(domain);
> >>> +
> >>> +     dev->features = 0;
> >>> +     dev->generation++;
> >>> +     spin_lock(&dev->irq_lock);
> >>> +     dev->config_cb.callback = NULL;
> >>> +     dev->config_cb.private = NULL;
> >>> +     spin_unlock(&dev->irq_lock);
> >>> +
> >>> +     for (i = 0; i < dev->vq_num; i++) {
> >>> +             struct vduse_virtqueue *vq = &dev->vqs[i];
> >>> +
> >>> +             vq->ready = false;
> >>> +             vq->desc_addr = 0;
> >>> +             vq->driver_addr = 0;
> >>> +             vq->device_addr = 0;
> >>> +             vq->avail_idx = 0;
> >>> +             vq->num = 0;
> >>> +
> >>> +             spin_lock(&vq->kick_lock);
> >>> +             vq->kicked = false;
> >>> +             if (vq->kickfd)
> >>> +                     eventfd_ctx_put(vq->kickfd);
> >>> +             vq->kickfd = NULL;
> >>> +             spin_unlock(&vq->kick_lock);
> >>> +
> >>> +             spin_lock(&vq->irq_lock);
> >>> +             vq->cb.callback = NULL;
> >>> +             vq->cb.private = NULL;
> >>> +             spin_unlock(&vq->irq_lock);
> >>> +     }
> >>> +}
> >>> +
> >>> +static int vduse_vdpa_set_vq_address(struct vdpa_device *vdpa, u16 idx,
> >>> +                             u64 desc_area, u64 driver_area,
> >>> +                             u64 device_area)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> >>> +
> >>> +     vq->desc_addr = desc_area;
> >>> +     vq->driver_addr = driver_area;
> >>> +     vq->device_addr = device_area;
> >>> +
> >>> +     return 0;
> >>> +}
> >>> +
> >>> +static void vduse_vdpa_kick_vq(struct vdpa_device *vdpa, u16 idx)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> >>> +
> >>> +     spin_lock(&vq->kick_lock);
> >>> +     if (!vq->ready)
> >>> +             goto unlock;
> >>> +
> >>> +     if (vq->kickfd)
> >>> +             eventfd_signal(vq->kickfd, 1);
> >>> +     else
> >>> +             vq->kicked = true;
> >>> +unlock:
> >>> +     spin_unlock(&vq->kick_lock);
> >>> +}
> >>> +
> >>> +static void vduse_vdpa_set_vq_cb(struct vdpa_device *vdpa, u16 idx,
> >>> +                           struct vdpa_callback *cb)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> >>> +
> >>> +     spin_lock(&vq->irq_lock);
> >>> +     vq->cb.callback = cb->callback;
> >>> +     vq->cb.private = cb->private;
> >>> +     spin_unlock(&vq->irq_lock);
> >>> +}
> >>> +
> >>> +static void vduse_vdpa_set_vq_num(struct vdpa_device *vdpa, u16 idx, u32 num)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> >>> +
> >>> +     vq->num = num;
> >>> +}
> >>> +
> >>> +static void vduse_vdpa_set_vq_ready(struct vdpa_device *vdpa,
> >>> +                                     u16 idx, bool ready)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> >>> +
> >>> +     vq->ready = ready;
> >>> +}
> >>> +
> >>> +static bool vduse_vdpa_get_vq_ready(struct vdpa_device *vdpa, u16 idx)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> >>> +
> >>> +     return vq->ready;
> >>> +}
> >>> +
> >>> +static int vduse_vdpa_set_vq_state(struct vdpa_device *vdpa, u16 idx,
> >>> +                             const struct vdpa_vq_state *state)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> >>> +
> >>> +     vq->avail_idx = state->avail_index;
> >>> +     return 0;
> >>> +}
> >>> +
> >>> +static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
> >>> +                             struct vdpa_vq_state *state)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
> >>> +
> >>> +     return vduse_dev_get_vq_state(dev, vq, state);
> >>> +}
> >>> +
> >>> +static u32 vduse_vdpa_get_vq_align(struct vdpa_device *vdpa)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +
> >>> +     return dev->vq_align;
> >>> +}
> >>> +
> >>> +static u64 vduse_vdpa_get_features(struct vdpa_device *vdpa)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +
> >>> +     return dev->user_features;
> >>> +}
> >>> +
> >>> +static int vduse_vdpa_set_features(struct vdpa_device *vdpa, u64 features)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +
> >>> +     dev->features = features;
> >>> +     return 0;
> >>> +}
> >>> +
> >>> +static void vduse_vdpa_set_config_cb(struct vdpa_device *vdpa,
> >>> +                               struct vdpa_callback *cb)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +
> >>> +     spin_lock(&dev->irq_lock);
> >>> +     dev->config_cb.callback = cb->callback;
> >>> +     dev->config_cb.private = cb->private;
> >>> +     spin_unlock(&dev->irq_lock);
> >>> +}
> >>> +
> >>> +static u16 vduse_vdpa_get_vq_num_max(struct vdpa_device *vdpa)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +
> >>> +     return dev->vq_size_max;
> >>> +}
> >>> +
> >>> +static u32 vduse_vdpa_get_device_id(struct vdpa_device *vdpa)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +
> >>> +     return dev->device_id;
> >>> +}
> >>> +
> >>> +static u32 vduse_vdpa_get_vendor_id(struct vdpa_device *vdpa)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +
> >>> +     return dev->vendor_id;
> >>> +}
> >>> +
> >>> +static u8 vduse_vdpa_get_status(struct vdpa_device *vdpa)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +
> >>> +     return dev->status;
> >>> +}
> >>> +
> >>> +static void vduse_vdpa_set_status(struct vdpa_device *vdpa, u8 status)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +     bool started = !!(status & VIRTIO_CONFIG_S_DRIVER_OK);
> >>> +
> >>> +     dev->status = status;
> >>> +
> >>> +     if (dev->started == started)
> >>> +             return;
> >>
> >> If we check dev->status == status, (or only check the DRIVER_OK bit)
> >> then there's no need to introduce an extra dev->started.
> >>
> > Will do it.
> >
> >>> +
> >>> +     dev->started = started;
> >>> +     if (dev->started) {
> >>> +             vduse_dev_start_dataplane(dev);
> >>> +     } else {
> >>> +             vduse_dev_reset(dev);
> >>> +             vduse_dev_stop_dataplane(dev);
> >>
> >> I wonder if no_reply work for the case of vhost-vdpa. For virtio-vDPA,
> >> we have bouncing buffers so it's harmless if usersapce dataplane keeps
> >> performing read/write. For vhost-vDPA we don't have such stuffs.
> >>
> > OK. So it still needs to be synchronized here. If so, how to handle
> > the error? Looks like printing a warning message should be enough.
>
>
> We need fix a way to propagate the error to the userspace.
>
> E.g if we want to stop the deivce, we will delay the status reset until
> we get respose from the userspace?
>

I didn't get how to delay the status reset. And should it be a DoS
that we want to fix if the userspace doesn't give a response forever?

>
> >
> >>> +     }
> >>> +}
> >>> +
> >>> +static size_t vduse_vdpa_get_config_size(struct vdpa_device *vdpa)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +
> >>> +     return dev->config_size;
> >>> +}
> >>> +
> >>> +static void vduse_vdpa_get_config(struct vdpa_device *vdpa, unsigned int offset,
> >>> +                               void *buf, unsigned int len)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +
> >>> +     memcpy(buf, dev->config + offset, len);
> >>> +}
> >>> +
> >>> +static void vduse_vdpa_set_config(struct vdpa_device *vdpa, unsigned int offset,
> >>> +                     const void *buf, unsigned int len)
> >>> +{
> >>> +     /* Now we only support read-only configuration space */
> >>> +}
> >>> +
> >>> +static u32 vduse_vdpa_get_generation(struct vdpa_device *vdpa)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +
> >>> +     return dev->generation;
> >>> +}
> >>> +
> >>> +static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
> >>> +                             struct vhost_iotlb *iotlb)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +     int ret;
> >>> +
> >>> +     ret = vduse_domain_set_map(dev->domain, iotlb);
> >>> +     if (ret)
> >>> +             return ret;
> >>> +
> >>> +     ret = vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
> >>> +     if (ret) {
> >>> +             vduse_domain_clear_map(dev->domain, iotlb);
> >>> +             return ret;
> >>> +     }
> >>> +
> >>> +     return 0;
> >>> +}
> >>> +
> >>> +static void vduse_vdpa_free(struct vdpa_device *vdpa)
> >>> +{
> >>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >>> +
> >>> +     dev->vdev = NULL;
> >>> +}
> >>> +
> >>> +static const struct vdpa_config_ops vduse_vdpa_config_ops = {
> >>> +     .set_vq_address         = vduse_vdpa_set_vq_address,
> >>> +     .kick_vq                = vduse_vdpa_kick_vq,
> >>> +     .set_vq_cb              = vduse_vdpa_set_vq_cb,
> >>> +     .set_vq_num             = vduse_vdpa_set_vq_num,
> >>> +     .set_vq_ready           = vduse_vdpa_set_vq_ready,
> >>> +     .get_vq_ready           = vduse_vdpa_get_vq_ready,
> >>> +     .set_vq_state           = vduse_vdpa_set_vq_state,
> >>> +     .get_vq_state           = vduse_vdpa_get_vq_state,
> >>> +     .get_vq_align           = vduse_vdpa_get_vq_align,
> >>> +     .get_features           = vduse_vdpa_get_features,
> >>> +     .set_features           = vduse_vdpa_set_features,
> >>> +     .set_config_cb          = vduse_vdpa_set_config_cb,
> >>> +     .get_vq_num_max         = vduse_vdpa_get_vq_num_max,
> >>> +     .get_device_id          = vduse_vdpa_get_device_id,
> >>> +     .get_vendor_id          = vduse_vdpa_get_vendor_id,
> >>> +     .get_status             = vduse_vdpa_get_status,
> >>> +     .set_status             = vduse_vdpa_set_status,
> >>> +     .get_config_size        = vduse_vdpa_get_config_size,
> >>> +     .get_config             = vduse_vdpa_get_config,
> >>> +     .set_config             = vduse_vdpa_set_config,
> >>> +     .get_generation         = vduse_vdpa_get_generation,
> >>> +     .set_map                = vduse_vdpa_set_map,
> >>> +     .free                   = vduse_vdpa_free,
> >>> +};
> >>> +
> >>> +static dma_addr_t vduse_dev_map_page(struct device *dev, struct page *page,
> >>> +                                  unsigned long offset, size_t size,
> >>> +                                  enum dma_data_direction dir,
> >>> +                                  unsigned long attrs)
> >>> +{
> >>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
> >>> +     struct vduse_iova_domain *domain = vdev->domain;
> >>> +
> >>> +     return vduse_domain_map_page(domain, page, offset, size, dir, attrs);
> >>> +}
> >>> +
> >>> +static void vduse_dev_unmap_page(struct device *dev, dma_addr_t dma_addr,
> >>> +                             size_t size, enum dma_data_direction dir,
> >>> +                             unsigned long attrs)
> >>> +{
> >>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
> >>> +     struct vduse_iova_domain *domain = vdev->domain;
> >>> +
> >>> +     return vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
> >>> +}
> >>> +
> >>> +static void *vduse_dev_alloc_coherent(struct device *dev, size_t size,
> >>> +                                     dma_addr_t *dma_addr, gfp_t flag,
> >>> +                                     unsigned long attrs)
> >>> +{
> >>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
> >>> +     struct vduse_iova_domain *domain = vdev->domain;
> >>> +     unsigned long iova;
> >>> +     void *addr;
> >>> +
> >>> +     *dma_addr = DMA_MAPPING_ERROR;
> >>> +     addr = vduse_domain_alloc_coherent(domain, size,
> >>> +                             (dma_addr_t *)&iova, flag, attrs);
> >>> +     if (!addr)
> >>> +             return NULL;
> >>> +
> >>> +     *dma_addr = (dma_addr_t)iova;
> >>> +
> >>> +     return addr;
> >>> +}
> >>> +
> >>> +static void vduse_dev_free_coherent(struct device *dev, size_t size,
> >>> +                                     void *vaddr, dma_addr_t dma_addr,
> >>> +                                     unsigned long attrs)
> >>> +{
> >>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
> >>> +     struct vduse_iova_domain *domain = vdev->domain;
> >>> +
> >>> +     vduse_domain_free_coherent(domain, size, vaddr, dma_addr, attrs);
> >>> +}
> >>> +
> >>> +static size_t vduse_dev_max_mapping_size(struct device *dev)
> >>> +{
> >>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
> >>> +     struct vduse_iova_domain *domain = vdev->domain;
> >>> +
> >>> +     return domain->bounce_size;
> >>> +}
> >>> +
> >>> +static const struct dma_map_ops vduse_dev_dma_ops = {
> >>> +     .map_page = vduse_dev_map_page,
> >>> +     .unmap_page = vduse_dev_unmap_page,
> >>> +     .alloc = vduse_dev_alloc_coherent,
> >>> +     .free = vduse_dev_free_coherent,
> >>> +     .max_mapping_size = vduse_dev_max_mapping_size,
> >>> +};
> >>> +
> >>> +static unsigned int perm_to_file_flags(u8 perm)
> >>> +{
> >>> +     unsigned int flags = 0;
> >>> +
> >>> +     switch (perm) {
> >>> +     case VDUSE_ACCESS_WO:
> >>> +             flags |= O_WRONLY;
> >>> +             break;
> >>> +     case VDUSE_ACCESS_RO:
> >>> +             flags |= O_RDONLY;
> >>> +             break;
> >>> +     case VDUSE_ACCESS_RW:
> >>> +             flags |= O_RDWR;
> >>> +             break;
> >>> +     default:
> >>> +             WARN(1, "invalidate vhost IOTLB permission\n");
> >>> +             break;
> >>> +     }
> >>> +
> >>> +     return flags;
> >>> +}
> >>> +
> >>> +static int vduse_kickfd_setup(struct vduse_dev *dev,
> >>> +                     struct vduse_vq_eventfd *eventfd)
> >>> +{
> >>> +     struct eventfd_ctx *ctx = NULL;
> >>> +     struct vduse_virtqueue *vq;
> >>> +     u32 index;
> >>> +
> >>> +     if (eventfd->index >= dev->vq_num)
> >>> +             return -EINVAL;
> >>> +
> >>> +     index = array_index_nospec(eventfd->index, dev->vq_num);
> >>> +     vq = &dev->vqs[index];
> >>> +     if (eventfd->fd >= 0) {
> >>> +             ctx = eventfd_ctx_fdget(eventfd->fd);
> >>> +             if (IS_ERR(ctx))
> >>> +                     return PTR_ERR(ctx);
> >>> +     } else if (eventfd->fd != VDUSE_EVENTFD_DEASSIGN)
> >>> +             return 0;
> >>> +
> >>> +     spin_lock(&vq->kick_lock);
> >>> +     if (vq->kickfd)
> >>> +             eventfd_ctx_put(vq->kickfd);
> >>> +     vq->kickfd = ctx;
> >>> +     if (vq->ready && vq->kicked && vq->kickfd) {
> >>> +             eventfd_signal(vq->kickfd, 1);
> >>> +             vq->kicked = false;
> >>> +     }
> >>> +     spin_unlock(&vq->kick_lock);
> >>> +
> >>> +     return 0;
> >>> +}
> >>> +
> >>> +static void vduse_dev_irq_inject(struct work_struct *work)
> >>> +{
> >>> +     struct vduse_dev *dev = container_of(work, struct vduse_dev, inject);
> >>> +
> >>> +     spin_lock_irq(&dev->irq_lock);
> >>> +     if (dev->config_cb.callback)
> >>> +             dev->config_cb.callback(dev->config_cb.private);
> >>> +     spin_unlock_irq(&dev->irq_lock);
> >>> +}
> >>> +
> >>> +static void vduse_vq_irq_inject(struct work_struct *work)
> >>> +{
> >>> +     struct vduse_virtqueue *vq = container_of(work,
> >>> +                                     struct vduse_virtqueue, inject);
> >>> +
> >>> +     spin_lock_irq(&vq->irq_lock);
> >>> +     if (vq->ready && vq->cb.callback)
> >>> +             vq->cb.callback(vq->cb.private);
> >>> +     spin_unlock_irq(&vq->irq_lock);
> >>> +}
> >>> +
> >>> +static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> >>> +                         unsigned long arg)
> >>> +{
> >>> +     struct vduse_dev *dev = file->private_data;
> >>> +     void __user *argp = (void __user *)arg;
> >>> +     int ret;
> >>> +
> >>> +     switch (cmd) {
> >>> +     case VDUSE_IOTLB_GET_FD: {
> >>> +             struct vduse_iotlb_entry entry;
> >>> +             struct vhost_iotlb_map *map;
> >>> +             struct vdpa_map_file *map_file;
> >>> +             struct vduse_iova_domain *domain = dev->domain;
> >>> +             struct file *f = NULL;
> >>> +
> >>> +             ret = -EFAULT;
> >>> +             if (copy_from_user(&entry, argp, sizeof(entry)))
> >>> +                     break;
> >>> +
> >>> +             ret = -EINVAL;
> >>> +             if (entry.start > entry.last)
> >>> +                     break;
> >>> +
> >>> +             spin_lock(&domain->iotlb_lock);
> >>> +             map = vhost_iotlb_itree_first(domain->iotlb,
> >>> +                                           entry.start, entry.last);
> >>> +             if (map) {
> >>> +                     map_file = (struct vdpa_map_file *)map->opaque;
> >>> +                     f = get_file(map_file->file);
> >>> +                     entry.offset = map_file->offset;
> >>> +                     entry.start = map->start;
> >>> +                     entry.last = map->last;
> >>> +                     entry.perm = map->perm;
> >>> +             }
> >>> +             spin_unlock(&domain->iotlb_lock);
> >>> +             ret = -EINVAL;
> >>> +             if (!f)
> >>> +                     break;
> >>> +
> >>> +             ret = -EFAULT;
> >>> +             if (copy_to_user(argp, &entry, sizeof(entry))) {
> >>> +                     fput(f);
> >>> +                     break;
> >>> +             }
> >>> +             ret = receive_fd(f, perm_to_file_flags(entry.perm));
> >>> +             fput(f);
> >>> +             break;
> >>> +     }
> >>> +     case VDUSE_DEV_GET_FEATURES:
> >>> +             ret = put_user(dev->features, (u64 __user *)argp);
> >>> +             break;
> >>> +     case VDUSE_DEV_UPDATE_CONFIG: {
> >>> +             struct vduse_config_update config;
> >>> +             unsigned long size = offsetof(struct vduse_config_update,
> >>> +                                           buffer);
> >>> +
> >>> +             ret = -EFAULT;
> >>> +             if (copy_from_user(&config, argp, size))
> >>> +                     break;
> >>> +
> >>> +             ret = -EINVAL;
> >>> +             if (config.length == 0 ||
> >>> +                 config.length > dev->config_size - config.offset)
> >>> +                     break;
> >>> +
> >>> +             ret = -EFAULT;
> >>> +             if (copy_from_user(dev->config + config.offset, argp + size,
> >>> +                                config.length))
> >>> +                     break;
> >>> +
> >>> +             ret = 0;
> >>> +             queue_work(vduse_irq_wq, &dev->inject);
> >>
> >> I wonder if it's better to separate config interrupt out of config
> >> update or we need document this.
> >>
> > I have documented it in the docs. Looks like a config update should be
> > always followed by a config interrupt. I didn't find a case that uses
> > them separately.
>
>
> The uAPI doesn't prevent us from the following scenario:
>
> update_config(mac[0], ..);
> update_config(max[1], ..);
>
> So it looks to me it's better to separate the config interrupt from the
> config updating.
>

Fine.

>
> >
> >>> +             break;
> >>> +     }
> >>> +     case VDUSE_VQ_GET_INFO: {
> >>
> >> Do we need to limit this only when DRIVER_OK is set?
> >>
> > Any reason to add this limitation?
>
>
> Otherwise the vq is not fully initialized, e.g the desc_addr might not
> be correct.
>

The vq_info->ready can be used to tell userspace whether the vq is
initialized or not.

Thanks,
Yongji

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 193+ messages in thread

* Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
  2021-06-22  7:22           ` Yongji Xie
  (?)
@ 2021-06-22  7:49             ` Jason Wang
  -1 siblings, 0 replies; 193+ messages in thread
From: Jason Wang @ 2021-06-22  7:49 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Christoph Hellwig, Christian Brauner, Randy Dunlap,
	Matthew Wilcox, Al Viro, Jens Axboe, bcrl, Jonathan Corbet,
	Mika Penttilä,
	Dan Carpenter, joro, Greg KH, songmuchun, virtualization, netdev,
	kvm, linux-fsdevel, iommu, linux-kernel


在 2021/6/22 下午3:22, Yongji Xie 写道:
>> We need fix a way to propagate the error to the userspace.
>>
>> E.g if we want to stop the deivce, we will delay the status reset until
>> we get respose from the userspace?
>>
> I didn't get how to delay the status reset. And should it be a DoS
> that we want to fix if the userspace doesn't give a response forever?


You're right. So let's make set_status() can fail first, then propagate 
its failure via VHOST_VDPA_SET_STATUS.


>
>>>>> +     }
>>>>> +}
>>>>> +
>>>>> +static size_t vduse_vdpa_get_config_size(struct vdpa_device *vdpa)
>>>>> +{
>>>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>>>> +
>>>>> +     return dev->config_size;
>>>>> +}
>>>>> +
>>>>> +static void vduse_vdpa_get_config(struct vdpa_device *vdpa, unsigned int offset,
>>>>> +                               void *buf, unsigned int len)
>>>>> +{
>>>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>>>> +
>>>>> +     memcpy(buf, dev->config + offset, len);
>>>>> +}
>>>>> +
>>>>> +static void vduse_vdpa_set_config(struct vdpa_device *vdpa, unsigned int offset,
>>>>> +                     const void *buf, unsigned int len)
>>>>> +{
>>>>> +     /* Now we only support read-only configuration space */
>>>>> +}
>>>>> +
>>>>> +static u32 vduse_vdpa_get_generation(struct vdpa_device *vdpa)
>>>>> +{
>>>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>>>> +
>>>>> +     return dev->generation;
>>>>> +}
>>>>> +
>>>>> +static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
>>>>> +                             struct vhost_iotlb *iotlb)
>>>>> +{
>>>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>>>> +     int ret;
>>>>> +
>>>>> +     ret = vduse_domain_set_map(dev->domain, iotlb);
>>>>> +     if (ret)
>>>>> +             return ret;
>>>>> +
>>>>> +     ret = vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
>>>>> +     if (ret) {
>>>>> +             vduse_domain_clear_map(dev->domain, iotlb);
>>>>> +             return ret;
>>>>> +     }
>>>>> +
>>>>> +     return 0;
>>>>> +}
>>>>> +
>>>>> +static void vduse_vdpa_free(struct vdpa_device *vdpa)
>>>>> +{
>>>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>>>> +
>>>>> +     dev->vdev = NULL;
>>>>> +}
>>>>> +
>>>>> +static const struct vdpa_config_ops vduse_vdpa_config_ops = {
>>>>> +     .set_vq_address         = vduse_vdpa_set_vq_address,
>>>>> +     .kick_vq                = vduse_vdpa_kick_vq,
>>>>> +     .set_vq_cb              = vduse_vdpa_set_vq_cb,
>>>>> +     .set_vq_num             = vduse_vdpa_set_vq_num,
>>>>> +     .set_vq_ready           = vduse_vdpa_set_vq_ready,
>>>>> +     .get_vq_ready           = vduse_vdpa_get_vq_ready,
>>>>> +     .set_vq_state           = vduse_vdpa_set_vq_state,
>>>>> +     .get_vq_state           = vduse_vdpa_get_vq_state,
>>>>> +     .get_vq_align           = vduse_vdpa_get_vq_align,
>>>>> +     .get_features           = vduse_vdpa_get_features,
>>>>> +     .set_features           = vduse_vdpa_set_features,
>>>>> +     .set_config_cb          = vduse_vdpa_set_config_cb,
>>>>> +     .get_vq_num_max         = vduse_vdpa_get_vq_num_max,
>>>>> +     .get_device_id          = vduse_vdpa_get_device_id,
>>>>> +     .get_vendor_id          = vduse_vdpa_get_vendor_id,
>>>>> +     .get_status             = vduse_vdpa_get_status,
>>>>> +     .set_status             = vduse_vdpa_set_status,
>>>>> +     .get_config_size        = vduse_vdpa_get_config_size,
>>>>> +     .get_config             = vduse_vdpa_get_config,
>>>>> +     .set_config             = vduse_vdpa_set_config,
>>>>> +     .get_generation         = vduse_vdpa_get_generation,
>>>>> +     .set_map                = vduse_vdpa_set_map,
>>>>> +     .free                   = vduse_vdpa_free,
>>>>> +};
>>>>> +
>>>>> +static dma_addr_t vduse_dev_map_page(struct device *dev, struct page *page,
>>>>> +                                  unsigned long offset, size_t size,
>>>>> +                                  enum dma_data_direction dir,
>>>>> +                                  unsigned long attrs)
>>>>> +{
>>>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>>>> +
>>>>> +     return vduse_domain_map_page(domain, page, offset, size, dir, attrs);
>>>>> +}
>>>>> +
>>>>> +static void vduse_dev_unmap_page(struct device *dev, dma_addr_t dma_addr,
>>>>> +                             size_t size, enum dma_data_direction dir,
>>>>> +                             unsigned long attrs)
>>>>> +{
>>>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>>>> +
>>>>> +     return vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
>>>>> +}
>>>>> +
>>>>> +static void *vduse_dev_alloc_coherent(struct device *dev, size_t size,
>>>>> +                                     dma_addr_t *dma_addr, gfp_t flag,
>>>>> +                                     unsigned long attrs)
>>>>> +{
>>>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>>>> +     unsigned long iova;
>>>>> +     void *addr;
>>>>> +
>>>>> +     *dma_addr = DMA_MAPPING_ERROR;
>>>>> +     addr = vduse_domain_alloc_coherent(domain, size,
>>>>> +                             (dma_addr_t *)&iova, flag, attrs);
>>>>> +     if (!addr)
>>>>> +             return NULL;
>>>>> +
>>>>> +     *dma_addr = (dma_addr_t)iova;
>>>>> +
>>>>> +     return addr;
>>>>> +}
>>>>> +
>>>>> +static void vduse_dev_free_coherent(struct device *dev, size_t size,
>>>>> +                                     void *vaddr, dma_addr_t dma_addr,
>>>>> +                                     unsigned long attrs)
>>>>> +{
>>>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>>>> +
>>>>> +     vduse_domain_free_coherent(domain, size, vaddr, dma_addr, attrs);
>>>>> +}
>>>>> +
>>>>> +static size_t vduse_dev_max_mapping_size(struct device *dev)
>>>>> +{
>>>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>>>> +
>>>>> +     return domain->bounce_size;
>>>>> +}
>>>>> +
>>>>> +static const struct dma_map_ops vduse_dev_dma_ops = {
>>>>> +     .map_page = vduse_dev_map_page,
>>>>> +     .unmap_page = vduse_dev_unmap_page,
>>>>> +     .alloc = vduse_dev_alloc_coherent,
>>>>> +     .free = vduse_dev_free_coherent,
>>>>> +     .max_mapping_size = vduse_dev_max_mapping_size,
>>>>> +};
>>>>> +
>>>>> +static unsigned int perm_to_file_flags(u8 perm)
>>>>> +{
>>>>> +     unsigned int flags = 0;
>>>>> +
>>>>> +     switch (perm) {
>>>>> +     case VDUSE_ACCESS_WO:
>>>>> +             flags |= O_WRONLY;
>>>>> +             break;
>>>>> +     case VDUSE_ACCESS_RO:
>>>>> +             flags |= O_RDONLY;
>>>>> +             break;
>>>>> +     case VDUSE_ACCESS_RW:
>>>>> +             flags |= O_RDWR;
>>>>> +             break;
>>>>> +     default:
>>>>> +             WARN(1, "invalidate vhost IOTLB permission\n");
>>>>> +             break;
>>>>> +     }
>>>>> +
>>>>> +     return flags;
>>>>> +}
>>>>> +
>>>>> +static int vduse_kickfd_setup(struct vduse_dev *dev,
>>>>> +                     struct vduse_vq_eventfd *eventfd)
>>>>> +{
>>>>> +     struct eventfd_ctx *ctx = NULL;
>>>>> +     struct vduse_virtqueue *vq;
>>>>> +     u32 index;
>>>>> +
>>>>> +     if (eventfd->index >= dev->vq_num)
>>>>> +             return -EINVAL;
>>>>> +
>>>>> +     index = array_index_nospec(eventfd->index, dev->vq_num);
>>>>> +     vq = &dev->vqs[index];
>>>>> +     if (eventfd->fd >= 0) {
>>>>> +             ctx = eventfd_ctx_fdget(eventfd->fd);
>>>>> +             if (IS_ERR(ctx))
>>>>> +                     return PTR_ERR(ctx);
>>>>> +     } else if (eventfd->fd != VDUSE_EVENTFD_DEASSIGN)
>>>>> +             return 0;
>>>>> +
>>>>> +     spin_lock(&vq->kick_lock);
>>>>> +     if (vq->kickfd)
>>>>> +             eventfd_ctx_put(vq->kickfd);
>>>>> +     vq->kickfd = ctx;
>>>>> +     if (vq->ready && vq->kicked && vq->kickfd) {
>>>>> +             eventfd_signal(vq->kickfd, 1);
>>>>> +             vq->kicked = false;
>>>>> +     }
>>>>> +     spin_unlock(&vq->kick_lock);
>>>>> +
>>>>> +     return 0;
>>>>> +}
>>>>> +
>>>>> +static void vduse_dev_irq_inject(struct work_struct *work)
>>>>> +{
>>>>> +     struct vduse_dev *dev = container_of(work, struct vduse_dev, inject);
>>>>> +
>>>>> +     spin_lock_irq(&dev->irq_lock);
>>>>> +     if (dev->config_cb.callback)
>>>>> +             dev->config_cb.callback(dev->config_cb.private);
>>>>> +     spin_unlock_irq(&dev->irq_lock);
>>>>> +}
>>>>> +
>>>>> +static void vduse_vq_irq_inject(struct work_struct *work)
>>>>> +{
>>>>> +     struct vduse_virtqueue *vq = container_of(work,
>>>>> +                                     struct vduse_virtqueue, inject);
>>>>> +
>>>>> +     spin_lock_irq(&vq->irq_lock);
>>>>> +     if (vq->ready && vq->cb.callback)
>>>>> +             vq->cb.callback(vq->cb.private);
>>>>> +     spin_unlock_irq(&vq->irq_lock);
>>>>> +}
>>>>> +
>>>>> +static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
>>>>> +                         unsigned long arg)
>>>>> +{
>>>>> +     struct vduse_dev *dev = file->private_data;
>>>>> +     void __user *argp = (void __user *)arg;
>>>>> +     int ret;
>>>>> +
>>>>> +     switch (cmd) {
>>>>> +     case VDUSE_IOTLB_GET_FD: {
>>>>> +             struct vduse_iotlb_entry entry;
>>>>> +             struct vhost_iotlb_map *map;
>>>>> +             struct vdpa_map_file *map_file;
>>>>> +             struct vduse_iova_domain *domain = dev->domain;
>>>>> +             struct file *f = NULL;
>>>>> +
>>>>> +             ret = -EFAULT;
>>>>> +             if (copy_from_user(&entry, argp, sizeof(entry)))
>>>>> +                     break;
>>>>> +
>>>>> +             ret = -EINVAL;
>>>>> +             if (entry.start > entry.last)
>>>>> +                     break;
>>>>> +
>>>>> +             spin_lock(&domain->iotlb_lock);
>>>>> +             map = vhost_iotlb_itree_first(domain->iotlb,
>>>>> +                                           entry.start, entry.last);
>>>>> +             if (map) {
>>>>> +                     map_file = (struct vdpa_map_file *)map->opaque;
>>>>> +                     f = get_file(map_file->file);
>>>>> +                     entry.offset = map_file->offset;
>>>>> +                     entry.start = map->start;
>>>>> +                     entry.last = map->last;
>>>>> +                     entry.perm = map->perm;
>>>>> +             }
>>>>> +             spin_unlock(&domain->iotlb_lock);
>>>>> +             ret = -EINVAL;
>>>>> +             if (!f)
>>>>> +                     break;
>>>>> +
>>>>> +             ret = -EFAULT;
>>>>> +             if (copy_to_user(argp, &entry, sizeof(entry))) {
>>>>> +                     fput(f);
>>>>> +                     break;
>>>>> +             }
>>>>> +             ret = receive_fd(f, perm_to_file_flags(entry.perm));
>>>>> +             fput(f);
>>>>> +             break;
>>>>> +     }
>>>>> +     case VDUSE_DEV_GET_FEATURES:
>>>>> +             ret = put_user(dev->features, (u64 __user *)argp);
>>>>> +             break;
>>>>> +     case VDUSE_DEV_UPDATE_CONFIG: {
>>>>> +             struct vduse_config_update config;
>>>>> +             unsigned long size = offsetof(struct vduse_config_update,
>>>>> +                                           buffer);
>>>>> +
>>>>> +             ret = -EFAULT;
>>>>> +             if (copy_from_user(&config, argp, size))
>>>>> +                     break;
>>>>> +
>>>>> +             ret = -EINVAL;
>>>>> +             if (config.length == 0 ||
>>>>> +                 config.length > dev->config_size - config.offset)
>>>>> +                     break;
>>>>> +
>>>>> +             ret = -EFAULT;
>>>>> +             if (copy_from_user(dev->config + config.offset, argp + size,
>>>>> +                                config.length))
>>>>> +                     break;
>>>>> +
>>>>> +             ret = 0;
>>>>> +             queue_work(vduse_irq_wq, &dev->inject);
>>>> I wonder if it's better to separate config interrupt out of config
>>>> update or we need document this.
>>>>
>>> I have documented it in the docs. Looks like a config update should be
>>> always followed by a config interrupt. I didn't find a case that uses
>>> them separately.
>> The uAPI doesn't prevent us from the following scenario:
>>
>> update_config(mac[0], ..);
>> update_config(max[1], ..);
>>
>> So it looks to me it's better to separate the config interrupt from the
>> config updating.
>>
> Fine.
>
>>>>> +             break;
>>>>> +     }
>>>>> +     case VDUSE_VQ_GET_INFO: {
>>>> Do we need to limit this only when DRIVER_OK is set?
>>>>
>>> Any reason to add this limitation?
>> Otherwise the vq is not fully initialized, e.g the desc_addr might not
>> be correct.
>>
> The vq_info->ready can be used to tell userspace whether the vq is
> initialized or not.


Yes, this will work as well.

Thanks


>
> Thanks,
> Yongji
>


^ permalink raw reply	[flat|nested] 193+ messages in thread

* Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
@ 2021-06-22  7:49             ` Jason Wang
  0 siblings, 0 replies; 193+ messages in thread
From: Jason Wang @ 2021-06-22  7:49 UTC (permalink / raw)
  To: Yongji Xie
  Cc: kvm, Michael S. Tsirkin, virtualization, Christian Brauner,
	Jonathan Corbet, Matthew Wilcox, Christoph Hellwig,
	Dan Carpenter, Stefano Garzarella, Al Viro, Stefan Hajnoczi,
	songmuchun, Jens Axboe, Greg KH, Randy Dunlap, linux-kernel,
	iommu, bcrl, netdev, linux-fsdevel, Mika Penttilä


在 2021/6/22 下午3:22, Yongji Xie 写道:
>> We need fix a way to propagate the error to the userspace.
>>
>> E.g if we want to stop the deivce, we will delay the status reset until
>> we get respose from the userspace?
>>
> I didn't get how to delay the status reset. And should it be a DoS
> that we want to fix if the userspace doesn't give a response forever?


You're right. So let's make set_status() can fail first, then propagate 
its failure via VHOST_VDPA_SET_STATUS.


>
>>>>> +     }
>>>>> +}
>>>>> +
>>>>> +static size_t vduse_vdpa_get_config_size(struct vdpa_device *vdpa)
>>>>> +{
>>>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>>>> +
>>>>> +     return dev->config_size;
>>>>> +}
>>>>> +
>>>>> +static void vduse_vdpa_get_config(struct vdpa_device *vdpa, unsigned int offset,
>>>>> +                               void *buf, unsigned int len)
>>>>> +{
>>>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>>>> +
>>>>> +     memcpy(buf, dev->config + offset, len);
>>>>> +}
>>>>> +
>>>>> +static void vduse_vdpa_set_config(struct vdpa_device *vdpa, unsigned int offset,
>>>>> +                     const void *buf, unsigned int len)
>>>>> +{
>>>>> +     /* Now we only support read-only configuration space */
>>>>> +}
>>>>> +
>>>>> +static u32 vduse_vdpa_get_generation(struct vdpa_device *vdpa)
>>>>> +{
>>>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>>>> +
>>>>> +     return dev->generation;
>>>>> +}
>>>>> +
>>>>> +static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
>>>>> +                             struct vhost_iotlb *iotlb)
>>>>> +{
>>>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>>>> +     int ret;
>>>>> +
>>>>> +     ret = vduse_domain_set_map(dev->domain, iotlb);
>>>>> +     if (ret)
>>>>> +             return ret;
>>>>> +
>>>>> +     ret = vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
>>>>> +     if (ret) {
>>>>> +             vduse_domain_clear_map(dev->domain, iotlb);
>>>>> +             return ret;
>>>>> +     }
>>>>> +
>>>>> +     return 0;
>>>>> +}
>>>>> +
>>>>> +static void vduse_vdpa_free(struct vdpa_device *vdpa)
>>>>> +{
>>>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>>>> +
>>>>> +     dev->vdev = NULL;
>>>>> +}
>>>>> +
>>>>> +static const struct vdpa_config_ops vduse_vdpa_config_ops = {
>>>>> +     .set_vq_address         = vduse_vdpa_set_vq_address,
>>>>> +     .kick_vq                = vduse_vdpa_kick_vq,
>>>>> +     .set_vq_cb              = vduse_vdpa_set_vq_cb,
>>>>> +     .set_vq_num             = vduse_vdpa_set_vq_num,
>>>>> +     .set_vq_ready           = vduse_vdpa_set_vq_ready,
>>>>> +     .get_vq_ready           = vduse_vdpa_get_vq_ready,
>>>>> +     .set_vq_state           = vduse_vdpa_set_vq_state,
>>>>> +     .get_vq_state           = vduse_vdpa_get_vq_state,
>>>>> +     .get_vq_align           = vduse_vdpa_get_vq_align,
>>>>> +     .get_features           = vduse_vdpa_get_features,
>>>>> +     .set_features           = vduse_vdpa_set_features,
>>>>> +     .set_config_cb          = vduse_vdpa_set_config_cb,
>>>>> +     .get_vq_num_max         = vduse_vdpa_get_vq_num_max,
>>>>> +     .get_device_id          = vduse_vdpa_get_device_id,
>>>>> +     .get_vendor_id          = vduse_vdpa_get_vendor_id,
>>>>> +     .get_status             = vduse_vdpa_get_status,
>>>>> +     .set_status             = vduse_vdpa_set_status,
>>>>> +     .get_config_size        = vduse_vdpa_get_config_size,
>>>>> +     .get_config             = vduse_vdpa_get_config,
>>>>> +     .set_config             = vduse_vdpa_set_config,
>>>>> +     .get_generation         = vduse_vdpa_get_generation,
>>>>> +     .set_map                = vduse_vdpa_set_map,
>>>>> +     .free                   = vduse_vdpa_free,
>>>>> +};
>>>>> +
>>>>> +static dma_addr_t vduse_dev_map_page(struct device *dev, struct page *page,
>>>>> +                                  unsigned long offset, size_t size,
>>>>> +                                  enum dma_data_direction dir,
>>>>> +                                  unsigned long attrs)
>>>>> +{
>>>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>>>> +
>>>>> +     return vduse_domain_map_page(domain, page, offset, size, dir, attrs);
>>>>> +}
>>>>> +
>>>>> +static void vduse_dev_unmap_page(struct device *dev, dma_addr_t dma_addr,
>>>>> +                             size_t size, enum dma_data_direction dir,
>>>>> +                             unsigned long attrs)
>>>>> +{
>>>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>>>> +
>>>>> +     return vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
>>>>> +}
>>>>> +
>>>>> +static void *vduse_dev_alloc_coherent(struct device *dev, size_t size,
>>>>> +                                     dma_addr_t *dma_addr, gfp_t flag,
>>>>> +                                     unsigned long attrs)
>>>>> +{
>>>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>>>> +     unsigned long iova;
>>>>> +     void *addr;
>>>>> +
>>>>> +     *dma_addr = DMA_MAPPING_ERROR;
>>>>> +     addr = vduse_domain_alloc_coherent(domain, size,
>>>>> +                             (dma_addr_t *)&iova, flag, attrs);
>>>>> +     if (!addr)
>>>>> +             return NULL;
>>>>> +
>>>>> +     *dma_addr = (dma_addr_t)iova;
>>>>> +
>>>>> +     return addr;
>>>>> +}
>>>>> +
>>>>> +static void vduse_dev_free_coherent(struct device *dev, size_t size,
>>>>> +                                     void *vaddr, dma_addr_t dma_addr,
>>>>> +                                     unsigned long attrs)
>>>>> +{
>>>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>>>> +
>>>>> +     vduse_domain_free_coherent(domain, size, vaddr, dma_addr, attrs);
>>>>> +}
>>>>> +
>>>>> +static size_t vduse_dev_max_mapping_size(struct device *dev)
>>>>> +{
>>>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>>>> +
>>>>> +     return domain->bounce_size;
>>>>> +}
>>>>> +
>>>>> +static const struct dma_map_ops vduse_dev_dma_ops = {
>>>>> +     .map_page = vduse_dev_map_page,
>>>>> +     .unmap_page = vduse_dev_unmap_page,
>>>>> +     .alloc = vduse_dev_alloc_coherent,
>>>>> +     .free = vduse_dev_free_coherent,
>>>>> +     .max_mapping_size = vduse_dev_max_mapping_size,
>>>>> +};
>>>>> +
>>>>> +static unsigned int perm_to_file_flags(u8 perm)
>>>>> +{
>>>>> +     unsigned int flags = 0;
>>>>> +
>>>>> +     switch (perm) {
>>>>> +     case VDUSE_ACCESS_WO:
>>>>> +             flags |= O_WRONLY;
>>>>> +             break;
>>>>> +     case VDUSE_ACCESS_RO:
>>>>> +             flags |= O_RDONLY;
>>>>> +             break;
>>>>> +     case VDUSE_ACCESS_RW:
>>>>> +             flags |= O_RDWR;
>>>>> +             break;
>>>>> +     default:
>>>>> +             WARN(1, "invalidate vhost IOTLB permission\n");
>>>>> +             break;
>>>>> +     }
>>>>> +
>>>>> +     return flags;
>>>>> +}
>>>>> +
>>>>> +static int vduse_kickfd_setup(struct vduse_dev *dev,
>>>>> +                     struct vduse_vq_eventfd *eventfd)
>>>>> +{
>>>>> +     struct eventfd_ctx *ctx = NULL;
>>>>> +     struct vduse_virtqueue *vq;
>>>>> +     u32 index;
>>>>> +
>>>>> +     if (eventfd->index >= dev->vq_num)
>>>>> +             return -EINVAL;
>>>>> +
>>>>> +     index = array_index_nospec(eventfd->index, dev->vq_num);
>>>>> +     vq = &dev->vqs[index];
>>>>> +     if (eventfd->fd >= 0) {
>>>>> +             ctx = eventfd_ctx_fdget(eventfd->fd);
>>>>> +             if (IS_ERR(ctx))
>>>>> +                     return PTR_ERR(ctx);
>>>>> +     } else if (eventfd->fd != VDUSE_EVENTFD_DEASSIGN)
>>>>> +             return 0;
>>>>> +
>>>>> +     spin_lock(&vq->kick_lock);
>>>>> +     if (vq->kickfd)
>>>>> +             eventfd_ctx_put(vq->kickfd);
>>>>> +     vq->kickfd = ctx;
>>>>> +     if (vq->ready && vq->kicked && vq->kickfd) {
>>>>> +             eventfd_signal(vq->kickfd, 1);
>>>>> +             vq->kicked = false;
>>>>> +     }
>>>>> +     spin_unlock(&vq->kick_lock);
>>>>> +
>>>>> +     return 0;
>>>>> +}
>>>>> +
>>>>> +static void vduse_dev_irq_inject(struct work_struct *work)
>>>>> +{
>>>>> +     struct vduse_dev *dev = container_of(work, struct vduse_dev, inject);
>>>>> +
>>>>> +     spin_lock_irq(&dev->irq_lock);
>>>>> +     if (dev->config_cb.callback)
>>>>> +             dev->config_cb.callback(dev->config_cb.private);
>>>>> +     spin_unlock_irq(&dev->irq_lock);
>>>>> +}
>>>>> +
>>>>> +static void vduse_vq_irq_inject(struct work_struct *work)
>>>>> +{
>>>>> +     struct vduse_virtqueue *vq = container_of(work,
>>>>> +                                     struct vduse_virtqueue, inject);
>>>>> +
>>>>> +     spin_lock_irq(&vq->irq_lock);
>>>>> +     if (vq->ready && vq->cb.callback)
>>>>> +             vq->cb.callback(vq->cb.private);
>>>>> +     spin_unlock_irq(&vq->irq_lock);
>>>>> +}
>>>>> +
>>>>> +static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
>>>>> +                         unsigned long arg)
>>>>> +{
>>>>> +     struct vduse_dev *dev = file->private_data;
>>>>> +     void __user *argp = (void __user *)arg;
>>>>> +     int ret;
>>>>> +
>>>>> +     switch (cmd) {
>>>>> +     case VDUSE_IOTLB_GET_FD: {
>>>>> +             struct vduse_iotlb_entry entry;
>>>>> +             struct vhost_iotlb_map *map;
>>>>> +             struct vdpa_map_file *map_file;
>>>>> +             struct vduse_iova_domain *domain = dev->domain;
>>>>> +             struct file *f = NULL;
>>>>> +
>>>>> +             ret = -EFAULT;
>>>>> +             if (copy_from_user(&entry, argp, sizeof(entry)))
>>>>> +                     break;
>>>>> +
>>>>> +             ret = -EINVAL;
>>>>> +             if (entry.start > entry.last)
>>>>> +                     break;
>>>>> +
>>>>> +             spin_lock(&domain->iotlb_lock);
>>>>> +             map = vhost_iotlb_itree_first(domain->iotlb,
>>>>> +                                           entry.start, entry.last);
>>>>> +             if (map) {
>>>>> +                     map_file = (struct vdpa_map_file *)map->opaque;
>>>>> +                     f = get_file(map_file->file);
>>>>> +                     entry.offset = map_file->offset;
>>>>> +                     entry.start = map->start;
>>>>> +                     entry.last = map->last;
>>>>> +                     entry.perm = map->perm;
>>>>> +             }
>>>>> +             spin_unlock(&domain->iotlb_lock);
>>>>> +             ret = -EINVAL;
>>>>> +             if (!f)
>>>>> +                     break;
>>>>> +
>>>>> +             ret = -EFAULT;
>>>>> +             if (copy_to_user(argp, &entry, sizeof(entry))) {
>>>>> +                     fput(f);
>>>>> +                     break;
>>>>> +             }
>>>>> +             ret = receive_fd(f, perm_to_file_flags(entry.perm));
>>>>> +             fput(f);
>>>>> +             break;
>>>>> +     }
>>>>> +     case VDUSE_DEV_GET_FEATURES:
>>>>> +             ret = put_user(dev->features, (u64 __user *)argp);
>>>>> +             break;
>>>>> +     case VDUSE_DEV_UPDATE_CONFIG: {
>>>>> +             struct vduse_config_update config;
>>>>> +             unsigned long size = offsetof(struct vduse_config_update,
>>>>> +                                           buffer);
>>>>> +
>>>>> +             ret = -EFAULT;
>>>>> +             if (copy_from_user(&config, argp, size))
>>>>> +                     break;
>>>>> +
>>>>> +             ret = -EINVAL;
>>>>> +             if (config.length == 0 ||
>>>>> +                 config.length > dev->config_size - config.offset)
>>>>> +                     break;
>>>>> +
>>>>> +             ret = -EFAULT;
>>>>> +             if (copy_from_user(dev->config + config.offset, argp + size,
>>>>> +                                config.length))
>>>>> +                     break;
>>>>> +
>>>>> +             ret = 0;
>>>>> +             queue_work(vduse_irq_wq, &dev->inject);
>>>> I wonder if it's better to separate config interrupt out of config
>>>> update or we need document this.
>>>>
>>> I have documented it in the docs. Looks like a config update should be
>>> always followed by a config interrupt. I didn't find a case that uses
>>> them separately.
>> The uAPI doesn't prevent us from the following scenario:
>>
>> update_config(mac[0], ..);
>> update_config(max[1], ..);
>>
>> So it looks to me it's better to separate the config interrupt from the
>> config updating.
>>
> Fine.
>
>>>>> +             break;
>>>>> +     }
>>>>> +     case VDUSE_VQ_GET_INFO: {
>>>> Do we need to limit this only when DRIVER_OK is set?
>>>>
>>> Any reason to add this limitation?
>> Otherwise the vq is not fully initialized, e.g the desc_addr might not
>> be correct.
>>
> The vq_info->ready can be used to tell userspace whether the vq is
> initialized or not.


Yes, this will work as well.

Thanks


>
> Thanks,
> Yongji
>

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 193+ messages in thread

* Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
@ 2021-06-22  7:49             ` Jason Wang
  0 siblings, 0 replies; 193+ messages in thread
From: Jason Wang @ 2021-06-22  7:49 UTC (permalink / raw)
  To: Yongji Xie
  Cc: kvm, Michael S. Tsirkin, virtualization, Christian Brauner,
	Jonathan Corbet, joro, Matthew Wilcox, Christoph Hellwig,
	Dan Carpenter, Al Viro, Stefan Hajnoczi, songmuchun, Jens Axboe,
	Greg KH, Randy Dunlap, linux-kernel, iommu, bcrl, netdev,
	linux-fsdevel, Mika Penttilä


在 2021/6/22 下午3:22, Yongji Xie 写道:
>> We need fix a way to propagate the error to the userspace.
>>
>> E.g if we want to stop the deivce, we will delay the status reset until
>> we get respose from the userspace?
>>
> I didn't get how to delay the status reset. And should it be a DoS
> that we want to fix if the userspace doesn't give a response forever?


You're right. So let's make set_status() can fail first, then propagate 
its failure via VHOST_VDPA_SET_STATUS.


>
>>>>> +     }
>>>>> +}
>>>>> +
>>>>> +static size_t vduse_vdpa_get_config_size(struct vdpa_device *vdpa)
>>>>> +{
>>>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>>>> +
>>>>> +     return dev->config_size;
>>>>> +}
>>>>> +
>>>>> +static void vduse_vdpa_get_config(struct vdpa_device *vdpa, unsigned int offset,
>>>>> +                               void *buf, unsigned int len)
>>>>> +{
>>>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>>>> +
>>>>> +     memcpy(buf, dev->config + offset, len);
>>>>> +}
>>>>> +
>>>>> +static void vduse_vdpa_set_config(struct vdpa_device *vdpa, unsigned int offset,
>>>>> +                     const void *buf, unsigned int len)
>>>>> +{
>>>>> +     /* Now we only support read-only configuration space */
>>>>> +}
>>>>> +
>>>>> +static u32 vduse_vdpa_get_generation(struct vdpa_device *vdpa)
>>>>> +{
>>>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>>>> +
>>>>> +     return dev->generation;
>>>>> +}
>>>>> +
>>>>> +static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
>>>>> +                             struct vhost_iotlb *iotlb)
>>>>> +{
>>>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>>>> +     int ret;
>>>>> +
>>>>> +     ret = vduse_domain_set_map(dev->domain, iotlb);
>>>>> +     if (ret)
>>>>> +             return ret;
>>>>> +
>>>>> +     ret = vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
>>>>> +     if (ret) {
>>>>> +             vduse_domain_clear_map(dev->domain, iotlb);
>>>>> +             return ret;
>>>>> +     }
>>>>> +
>>>>> +     return 0;
>>>>> +}
>>>>> +
>>>>> +static void vduse_vdpa_free(struct vdpa_device *vdpa)
>>>>> +{
>>>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>>>> +
>>>>> +     dev->vdev = NULL;
>>>>> +}
>>>>> +
>>>>> +static const struct vdpa_config_ops vduse_vdpa_config_ops = {
>>>>> +     .set_vq_address         = vduse_vdpa_set_vq_address,
>>>>> +     .kick_vq                = vduse_vdpa_kick_vq,
>>>>> +     .set_vq_cb              = vduse_vdpa_set_vq_cb,
>>>>> +     .set_vq_num             = vduse_vdpa_set_vq_num,
>>>>> +     .set_vq_ready           = vduse_vdpa_set_vq_ready,
>>>>> +     .get_vq_ready           = vduse_vdpa_get_vq_ready,
>>>>> +     .set_vq_state           = vduse_vdpa_set_vq_state,
>>>>> +     .get_vq_state           = vduse_vdpa_get_vq_state,
>>>>> +     .get_vq_align           = vduse_vdpa_get_vq_align,
>>>>> +     .get_features           = vduse_vdpa_get_features,
>>>>> +     .set_features           = vduse_vdpa_set_features,
>>>>> +     .set_config_cb          = vduse_vdpa_set_config_cb,
>>>>> +     .get_vq_num_max         = vduse_vdpa_get_vq_num_max,
>>>>> +     .get_device_id          = vduse_vdpa_get_device_id,
>>>>> +     .get_vendor_id          = vduse_vdpa_get_vendor_id,
>>>>> +     .get_status             = vduse_vdpa_get_status,
>>>>> +     .set_status             = vduse_vdpa_set_status,
>>>>> +     .get_config_size        = vduse_vdpa_get_config_size,
>>>>> +     .get_config             = vduse_vdpa_get_config,
>>>>> +     .set_config             = vduse_vdpa_set_config,
>>>>> +     .get_generation         = vduse_vdpa_get_generation,
>>>>> +     .set_map                = vduse_vdpa_set_map,
>>>>> +     .free                   = vduse_vdpa_free,
>>>>> +};
>>>>> +
>>>>> +static dma_addr_t vduse_dev_map_page(struct device *dev, struct page *page,
>>>>> +                                  unsigned long offset, size_t size,
>>>>> +                                  enum dma_data_direction dir,
>>>>> +                                  unsigned long attrs)
>>>>> +{
>>>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>>>> +
>>>>> +     return vduse_domain_map_page(domain, page, offset, size, dir, attrs);
>>>>> +}
>>>>> +
>>>>> +static void vduse_dev_unmap_page(struct device *dev, dma_addr_t dma_addr,
>>>>> +                             size_t size, enum dma_data_direction dir,
>>>>> +                             unsigned long attrs)
>>>>> +{
>>>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>>>> +
>>>>> +     return vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
>>>>> +}
>>>>> +
>>>>> +static void *vduse_dev_alloc_coherent(struct device *dev, size_t size,
>>>>> +                                     dma_addr_t *dma_addr, gfp_t flag,
>>>>> +                                     unsigned long attrs)
>>>>> +{
>>>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>>>> +     unsigned long iova;
>>>>> +     void *addr;
>>>>> +
>>>>> +     *dma_addr = DMA_MAPPING_ERROR;
>>>>> +     addr = vduse_domain_alloc_coherent(domain, size,
>>>>> +                             (dma_addr_t *)&iova, flag, attrs);
>>>>> +     if (!addr)
>>>>> +             return NULL;
>>>>> +
>>>>> +     *dma_addr = (dma_addr_t)iova;
>>>>> +
>>>>> +     return addr;
>>>>> +}
>>>>> +
>>>>> +static void vduse_dev_free_coherent(struct device *dev, size_t size,
>>>>> +                                     void *vaddr, dma_addr_t dma_addr,
>>>>> +                                     unsigned long attrs)
>>>>> +{
>>>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>>>> +
>>>>> +     vduse_domain_free_coherent(domain, size, vaddr, dma_addr, attrs);
>>>>> +}
>>>>> +
>>>>> +static size_t vduse_dev_max_mapping_size(struct device *dev)
>>>>> +{
>>>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>>>> +
>>>>> +     return domain->bounce_size;
>>>>> +}
>>>>> +
>>>>> +static const struct dma_map_ops vduse_dev_dma_ops = {
>>>>> +     .map_page = vduse_dev_map_page,
>>>>> +     .unmap_page = vduse_dev_unmap_page,
>>>>> +     .alloc = vduse_dev_alloc_coherent,
>>>>> +     .free = vduse_dev_free_coherent,
>>>>> +     .max_mapping_size = vduse_dev_max_mapping_size,
>>>>> +};
>>>>> +
>>>>> +static unsigned int perm_to_file_flags(u8 perm)
>>>>> +{
>>>>> +     unsigned int flags = 0;
>>>>> +
>>>>> +     switch (perm) {
>>>>> +     case VDUSE_ACCESS_WO:
>>>>> +             flags |= O_WRONLY;
>>>>> +             break;
>>>>> +     case VDUSE_ACCESS_RO:
>>>>> +             flags |= O_RDONLY;
>>>>> +             break;
>>>>> +     case VDUSE_ACCESS_RW:
>>>>> +             flags |= O_RDWR;
>>>>> +             break;
>>>>> +     default:
>>>>> +             WARN(1, "invalidate vhost IOTLB permission\n");
>>>>> +             break;
>>>>> +     }
>>>>> +
>>>>> +     return flags;
>>>>> +}
>>>>> +
>>>>> +static int vduse_kickfd_setup(struct vduse_dev *dev,
>>>>> +                     struct vduse_vq_eventfd *eventfd)
>>>>> +{
>>>>> +     struct eventfd_ctx *ctx = NULL;
>>>>> +     struct vduse_virtqueue *vq;
>>>>> +     u32 index;
>>>>> +
>>>>> +     if (eventfd->index >= dev->vq_num)
>>>>> +             return -EINVAL;
>>>>> +
>>>>> +     index = array_index_nospec(eventfd->index, dev->vq_num);
>>>>> +     vq = &dev->vqs[index];
>>>>> +     if (eventfd->fd >= 0) {
>>>>> +             ctx = eventfd_ctx_fdget(eventfd->fd);
>>>>> +             if (IS_ERR(ctx))
>>>>> +                     return PTR_ERR(ctx);
>>>>> +     } else if (eventfd->fd != VDUSE_EVENTFD_DEASSIGN)
>>>>> +             return 0;
>>>>> +
>>>>> +     spin_lock(&vq->kick_lock);
>>>>> +     if (vq->kickfd)
>>>>> +             eventfd_ctx_put(vq->kickfd);
>>>>> +     vq->kickfd = ctx;
>>>>> +     if (vq->ready && vq->kicked && vq->kickfd) {
>>>>> +             eventfd_signal(vq->kickfd, 1);
>>>>> +             vq->kicked = false;
>>>>> +     }
>>>>> +     spin_unlock(&vq->kick_lock);
>>>>> +
>>>>> +     return 0;
>>>>> +}
>>>>> +
>>>>> +static void vduse_dev_irq_inject(struct work_struct *work)
>>>>> +{
>>>>> +     struct vduse_dev *dev = container_of(work, struct vduse_dev, inject);
>>>>> +
>>>>> +     spin_lock_irq(&dev->irq_lock);
>>>>> +     if (dev->config_cb.callback)
>>>>> +             dev->config_cb.callback(dev->config_cb.private);
>>>>> +     spin_unlock_irq(&dev->irq_lock);
>>>>> +}
>>>>> +
>>>>> +static void vduse_vq_irq_inject(struct work_struct *work)
>>>>> +{
>>>>> +     struct vduse_virtqueue *vq = container_of(work,
>>>>> +                                     struct vduse_virtqueue, inject);
>>>>> +
>>>>> +     spin_lock_irq(&vq->irq_lock);
>>>>> +     if (vq->ready && vq->cb.callback)
>>>>> +             vq->cb.callback(vq->cb.private);
>>>>> +     spin_unlock_irq(&vq->irq_lock);
>>>>> +}
>>>>> +
>>>>> +static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
>>>>> +                         unsigned long arg)
>>>>> +{
>>>>> +     struct vduse_dev *dev = file->private_data;
>>>>> +     void __user *argp = (void __user *)arg;
>>>>> +     int ret;
>>>>> +
>>>>> +     switch (cmd) {
>>>>> +     case VDUSE_IOTLB_GET_FD: {
>>>>> +             struct vduse_iotlb_entry entry;
>>>>> +             struct vhost_iotlb_map *map;
>>>>> +             struct vdpa_map_file *map_file;
>>>>> +             struct vduse_iova_domain *domain = dev->domain;
>>>>> +             struct file *f = NULL;
>>>>> +
>>>>> +             ret = -EFAULT;
>>>>> +             if (copy_from_user(&entry, argp, sizeof(entry)))
>>>>> +                     break;
>>>>> +
>>>>> +             ret = -EINVAL;
>>>>> +             if (entry.start > entry.last)
>>>>> +                     break;
>>>>> +
>>>>> +             spin_lock(&domain->iotlb_lock);
>>>>> +             map = vhost_iotlb_itree_first(domain->iotlb,
>>>>> +                                           entry.start, entry.last);
>>>>> +             if (map) {
>>>>> +                     map_file = (struct vdpa_map_file *)map->opaque;
>>>>> +                     f = get_file(map_file->file);
>>>>> +                     entry.offset = map_file->offset;
>>>>> +                     entry.start = map->start;
>>>>> +                     entry.last = map->last;
>>>>> +                     entry.perm = map->perm;
>>>>> +             }
>>>>> +             spin_unlock(&domain->iotlb_lock);
>>>>> +             ret = -EINVAL;
>>>>> +             if (!f)
>>>>> +                     break;
>>>>> +
>>>>> +             ret = -EFAULT;
>>>>> +             if (copy_to_user(argp, &entry, sizeof(entry))) {
>>>>> +                     fput(f);
>>>>> +                     break;
>>>>> +             }
>>>>> +             ret = receive_fd(f, perm_to_file_flags(entry.perm));
>>>>> +             fput(f);
>>>>> +             break;
>>>>> +     }
>>>>> +     case VDUSE_DEV_GET_FEATURES:
>>>>> +             ret = put_user(dev->features, (u64 __user *)argp);
>>>>> +             break;
>>>>> +     case VDUSE_DEV_UPDATE_CONFIG: {
>>>>> +             struct vduse_config_update config;
>>>>> +             unsigned long size = offsetof(struct vduse_config_update,
>>>>> +                                           buffer);
>>>>> +
>>>>> +             ret = -EFAULT;
>>>>> +             if (copy_from_user(&config, argp, size))
>>>>> +                     break;
>>>>> +
>>>>> +             ret = -EINVAL;
>>>>> +             if (config.length == 0 ||
>>>>> +                 config.length > dev->config_size - config.offset)
>>>>> +                     break;
>>>>> +
>>>>> +             ret = -EFAULT;
>>>>> +             if (copy_from_user(dev->config + config.offset, argp + size,
>>>>> +                                config.length))
>>>>> +                     break;
>>>>> +
>>>>> +             ret = 0;
>>>>> +             queue_work(vduse_irq_wq, &dev->inject);
>>>> I wonder if it's better to separate config interrupt out of config
>>>> update or we need document this.
>>>>
>>> I have documented it in the docs. Looks like a config update should be
>>> always followed by a config interrupt. I didn't find a case that uses
>>> them separately.
>> The uAPI doesn't prevent us from the following scenario:
>>
>> update_config(mac[0], ..);
>> update_config(max[1], ..);
>>
>> So it looks to me it's better to separate the config interrupt from the
>> config updating.
>>
> Fine.
>
>>>>> +             break;
>>>>> +     }
>>>>> +     case VDUSE_VQ_GET_INFO: {
>>>> Do we need to limit this only when DRIVER_OK is set?
>>>>
>>> Any reason to add this limitation?
>> Otherwise the vq is not fully initialized, e.g the desc_addr might not
>> be correct.
>>
> The vq_info->ready can be used to tell userspace whether the vq is
> initialized or not.


Yes, this will work as well.

Thanks


>
> Thanks,
> Yongji
>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 193+ messages in thread

* Re: Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
  2021-06-22  7:49             ` Jason Wang
@ 2021-06-22  8:14               ` Yongji Xie
  -1 siblings, 0 replies; 193+ messages in thread
From: Yongji Xie @ 2021-06-22  8:14 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Christoph Hellwig, Christian Brauner, Randy Dunlap,
	Matthew Wilcox, Al Viro, Jens Axboe, bcrl, Jonathan Corbet,
	Mika Penttilä,
	Dan Carpenter, joro, Greg KH, songmuchun, virtualization, netdev,
	kvm, linux-fsdevel, iommu, linux-kernel

On Tue, Jun 22, 2021 at 3:50 PM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/6/22 下午3:22, Yongji Xie 写道:
> >> We need fix a way to propagate the error to the userspace.
> >>
> >> E.g if we want to stop the deivce, we will delay the status reset until
> >> we get respose from the userspace?
> >>
> > I didn't get how to delay the status reset. And should it be a DoS
> > that we want to fix if the userspace doesn't give a response forever?
>
>
> You're right. So let's make set_status() can fail first, then propagate
> its failure via VHOST_VDPA_SET_STATUS.
>

OK. So we only need to propagate the failure in the vhost-vdpa case, right?

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 193+ messages in thread

* Re: Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
@ 2021-06-22  8:14               ` Yongji Xie
  0 siblings, 0 replies; 193+ messages in thread
From: Yongji Xie @ 2021-06-22  8:14 UTC (permalink / raw)
  To: Jason Wang
  Cc: kvm, Michael S. Tsirkin, virtualization, Christian Brauner,
	Jonathan Corbet, Matthew Wilcox, Christoph Hellwig,
	Dan Carpenter, Stefano Garzarella, Al Viro, Stefan Hajnoczi,
	songmuchun, Jens Axboe, Greg KH, Randy Dunlap, linux-kernel,
	iommu, bcrl, netdev, linux-fsdevel, Mika Penttilä

On Tue, Jun 22, 2021 at 3:50 PM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/6/22 下午3:22, Yongji Xie 写道:
> >> We need fix a way to propagate the error to the userspace.
> >>
> >> E.g if we want to stop the deivce, we will delay the status reset until
> >> we get respose from the userspace?
> >>
> > I didn't get how to delay the status reset. And should it be a DoS
> > that we want to fix if the userspace doesn't give a response forever?
>
>
> You're right. So let's make set_status() can fail first, then propagate
> its failure via VHOST_VDPA_SET_STATUS.
>

OK. So we only need to propagate the failure in the vhost-vdpa case, right?

Thanks,
Yongji
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 193+ messages in thread

* Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
  2021-06-22  8:14               ` Yongji Xie
  (?)
@ 2021-06-23  3:30                 ` Jason Wang
  -1 siblings, 0 replies; 193+ messages in thread
From: Jason Wang @ 2021-06-23  3:30 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Christoph Hellwig, Christian Brauner, Randy Dunlap,
	Matthew Wilcox, Al Viro, Jens Axboe, bcrl, Jonathan Corbet,
	Mika Penttilä,
	Dan Carpenter, joro, Greg KH, songmuchun, virtualization, netdev,
	kvm, linux-fsdevel, iommu, linux-kernel


在 2021/6/22 下午4:14, Yongji Xie 写道:
> On Tue, Jun 22, 2021 at 3:50 PM Jason Wang <jasowang@redhat.com> wrote:
>>
>> 在 2021/6/22 下午3:22, Yongji Xie 写道:
>>>> We need fix a way to propagate the error to the userspace.
>>>>
>>>> E.g if we want to stop the deivce, we will delay the status reset until
>>>> we get respose from the userspace?
>>>>
>>> I didn't get how to delay the status reset. And should it be a DoS
>>> that we want to fix if the userspace doesn't give a response forever?
>>
>> You're right. So let's make set_status() can fail first, then propagate
>> its failure via VHOST_VDPA_SET_STATUS.
>>
> OK. So we only need to propagate the failure in the vhost-vdpa case, right?


I think not, we need to deal with the reset for virtio as well:

E.g in register_virtio_devices(), we have:

         /* We always start by resetting the device, in case a previous
          * driver messed it up.  This also tests that code path a 
little. */
       dev->config->reset(dev);

We probably need to make reset can fail and then fail the 
register_virtio_device() as well.

Thanks


>
> Thanks,
> Yongji
>


^ permalink raw reply	[flat|nested] 193+ messages in thread

* Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
@ 2021-06-23  3:30                 ` Jason Wang
  0 siblings, 0 replies; 193+ messages in thread
From: Jason Wang @ 2021-06-23  3:30 UTC (permalink / raw)
  To: Yongji Xie
  Cc: kvm, Michael S. Tsirkin, virtualization, Christian Brauner,
	Jonathan Corbet, Matthew Wilcox, Christoph Hellwig,
	Dan Carpenter, Stefano Garzarella, Al Viro, Stefan Hajnoczi,
	songmuchun, Jens Axboe, Greg KH, Randy Dunlap, linux-kernel,
	iommu, bcrl, netdev, linux-fsdevel, Mika Penttilä


在 2021/6/22 下午4:14, Yongji Xie 写道:
> On Tue, Jun 22, 2021 at 3:50 PM Jason Wang <jasowang@redhat.com> wrote:
>>
>> 在 2021/6/22 下午3:22, Yongji Xie 写道:
>>>> We need fix a way to propagate the error to the userspace.
>>>>
>>>> E.g if we want to stop the deivce, we will delay the status reset until
>>>> we get respose from the userspace?
>>>>
>>> I didn't get how to delay the status reset. And should it be a DoS
>>> that we want to fix if the userspace doesn't give a response forever?
>>
>> You're right. So let's make set_status() can fail first, then propagate
>> its failure via VHOST_VDPA_SET_STATUS.
>>
> OK. So we only need to propagate the failure in the vhost-vdpa case, right?


I think not, we need to deal with the reset for virtio as well:

E.g in register_virtio_devices(), we have:

         /* We always start by resetting the device, in case a previous
          * driver messed it up.  This also tests that code path a 
little. */
       dev->config->reset(dev);

We probably need to make reset can fail and then fail the 
register_virtio_device() as well.

Thanks


>
> Thanks,
> Yongji
>

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 193+ messages in thread

* Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
@ 2021-06-23  3:30                 ` Jason Wang
  0 siblings, 0 replies; 193+ messages in thread
From: Jason Wang @ 2021-06-23  3:30 UTC (permalink / raw)
  To: Yongji Xie
  Cc: kvm, Michael S. Tsirkin, virtualization, Christian Brauner,
	Jonathan Corbet, joro, Matthew Wilcox, Christoph Hellwig,
	Dan Carpenter, Al Viro, Stefan Hajnoczi, songmuchun, Jens Axboe,
	Greg KH, Randy Dunlap, linux-kernel, iommu, bcrl, netdev,
	linux-fsdevel, Mika Penttilä


在 2021/6/22 下午4:14, Yongji Xie 写道:
> On Tue, Jun 22, 2021 at 3:50 PM Jason Wang <jasowang@redhat.com> wrote:
>>
>> 在 2021/6/22 下午3:22, Yongji Xie 写道:
>>>> We need fix a way to propagate the error to the userspace.
>>>>
>>>> E.g if we want to stop the deivce, we will delay the status reset until
>>>> we get respose from the userspace?
>>>>
>>> I didn't get how to delay the status reset. And should it be a DoS
>>> that we want to fix if the userspace doesn't give a response forever?
>>
>> You're right. So let's make set_status() can fail first, then propagate
>> its failure via VHOST_VDPA_SET_STATUS.
>>
> OK. So we only need to propagate the failure in the vhost-vdpa case, right?


I think not, we need to deal with the reset for virtio as well:

E.g in register_virtio_devices(), we have:

         /* We always start by resetting the device, in case a previous
          * driver messed it up.  This also tests that code path a 
little. */
       dev->config->reset(dev);

We probably need to make reset can fail and then fail the 
register_virtio_device() as well.

Thanks


>
> Thanks,
> Yongji
>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 193+ messages in thread

* Re: Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
  2021-06-23  3:30                 ` Jason Wang
@ 2021-06-23  5:50                   ` Yongji Xie
  -1 siblings, 0 replies; 193+ messages in thread
From: Yongji Xie @ 2021-06-23  5:50 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Stefano Garzarella,
	Parav Pandit, Christoph Hellwig, Christian Brauner, Randy Dunlap,
	Matthew Wilcox, Al Viro, Jens Axboe, bcrl, Jonathan Corbet,
	Mika Penttilä,
	Dan Carpenter, joro, Greg KH, songmuchun, virtualization, netdev,
	kvm, linux-fsdevel, iommu, linux-kernel

On Wed, Jun 23, 2021 at 11:31 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/6/22 下午4:14, Yongji Xie 写道:
> > On Tue, Jun 22, 2021 at 3:50 PM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> 在 2021/6/22 下午3:22, Yongji Xie 写道:
> >>>> We need fix a way to propagate the error to the userspace.
> >>>>
> >>>> E.g if we want to stop the deivce, we will delay the status reset until
> >>>> we get respose from the userspace?
> >>>>
> >>> I didn't get how to delay the status reset. And should it be a DoS
> >>> that we want to fix if the userspace doesn't give a response forever?
> >>
> >> You're right. So let's make set_status() can fail first, then propagate
> >> its failure via VHOST_VDPA_SET_STATUS.
> >>
> > OK. So we only need to propagate the failure in the vhost-vdpa case, right?
>
>
> I think not, we need to deal with the reset for virtio as well:
>
> E.g in register_virtio_devices(), we have:
>
>          /* We always start by resetting the device, in case a previous
>           * driver messed it up.  This also tests that code path a
> little. */
>        dev->config->reset(dev);
>
> We probably need to make reset can fail and then fail the
> register_virtio_device() as well.
>

OK, looks like virtio_add_status() and virtio_device_ready()[1] should
be also modified if we need to propagate the failure in the
virtio-vdpa case. Or do we only need to care about the reset case?

[1] https://lore.kernel.org/lkml/20210517093428.670-1-xieyongji@bytedance.com/

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 193+ messages in thread

* Re: Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
@ 2021-06-23  5:50                   ` Yongji Xie
  0 siblings, 0 replies; 193+ messages in thread
From: Yongji Xie @ 2021-06-23  5:50 UTC (permalink / raw)
  To: Jason Wang
  Cc: kvm, Michael S. Tsirkin, virtualization, Christian Brauner,
	Jonathan Corbet, Matthew Wilcox, Christoph Hellwig,
	Dan Carpenter, Stefano Garzarella, Al Viro, Stefan Hajnoczi,
	songmuchun, Jens Axboe, Greg KH, Randy Dunlap, linux-kernel,
	iommu, bcrl, netdev, linux-fsdevel, Mika Penttilä

On Wed, Jun 23, 2021 at 11:31 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/6/22 下午4:14, Yongji Xie 写道:
> > On Tue, Jun 22, 2021 at 3:50 PM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> 在 2021/6/22 下午3:22, Yongji Xie 写道:
> >>>> We need fix a way to propagate the error to the userspace.
> >>>>
> >>>> E.g if we want to stop the deivce, we will delay the status reset until
> >>>> we get respose from the userspace?
> >>>>
> >>> I didn't get how to delay the status reset. And should it be a DoS
> >>> that we want to fix if the userspace doesn't give a response forever?
> >>
> >> You're right. So let's make set_status() can fail first, then propagate
> >> its failure via VHOST_VDPA_SET_STATUS.
> >>
> > OK. So we only need to propagate the failure in the vhost-vdpa case, right?
>
>
> I think not, we need to deal with the reset for virtio as well:
>
> E.g in register_virtio_devices(), we have:
>
>          /* We always start by resetting the device, in case a previous
>           * driver messed it up.  This also tests that code path a
> little. */
>        dev->config->reset(dev);
>
> We probably need to make reset can fail and then fail the
> register_virtio_device() as well.
>

OK, looks