virtualization.lists.linux-foundation.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH v5 01/11] file: Export __receive_fd() to modules
       [not found] ` <20210315053721.189-2-xieyongji@bytedance.com>
@ 2021-03-15  9:08   ` Christoph Hellwig
       [not found]     ` <CACycT3vrHOExXj6v8ULvUzdLcRkdzS5=TNK6=g4+RWEdN-nOJw@mail.gmail.com>
  0 siblings, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2021-03-15  9:08 UTC (permalink / raw)
  To: Xie Yongji
  Cc: axboe, corbet, kvm, mst, netdev, rdunlap, willy, virtualization,
	hch, bob.liu, bcrl, viro, stefanha, linux-fsdevel, dan.carpenter,
	mika.penttila

On Mon, Mar 15, 2021 at 01:37:11PM +0800, Xie Yongji wrote:
> Export __receive_fd() so that some modules can use
> it to pass file descriptor between processes.

I really don't think any non-core code should do that, especilly not
modular mere driver code.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 03/11] vhost-vdpa: protect concurrent access to vhost device iotlb
       [not found] ` <20210315053721.189-4-xieyongji@bytedance.com>
@ 2021-03-23  3:02   ` Jason Wang
  2021-03-25 11:08   ` Stefano Garzarella
  1 sibling, 0 replies; 14+ messages in thread
From: Jason Wang @ 2021-03-23  3:02 UTC (permalink / raw)
  To: Xie Yongji, mst, stefanha, sgarzare, parav, bob.liu, hch,
	rdunlap, willy, viro, axboe, bcrl, corbet, mika.penttila,
	dan.carpenter
  Cc: linux-fsdevel, netdev, kvm, virtualization


在 2021/3/15 下午1:37, Xie Yongji 写道:
> Use vhost_dev->mutex to protect vhost device iotlb from
> concurrent access.
>
> Fixes: 4c8cf318("vhost: introduce vDPA-based backend")
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>


Acked-by: Jason Wang <jasowang@redhat.com>

Please cc stable for next version.

Thanks


> ---
>   drivers/vhost/vdpa.c | 6 +++++-
>   1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index cb14c66eb2ec..3f7175c2ac24 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -719,9 +719,11 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev,
>   	const struct vdpa_config_ops *ops = vdpa->config;
>   	int r = 0;
>   
> +	mutex_lock(&dev->mutex);
> +
>   	r = vhost_dev_check_owner(dev);
>   	if (r)
> -		return r;
> +		goto unlock;
>   
>   	switch (msg->type) {
>   	case VHOST_IOTLB_UPDATE:
> @@ -742,6 +744,8 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev,
>   		r = -EINVAL;
>   		break;
>   	}
> +unlock:
> +	mutex_unlock(&dev->mutex);
>   
>   	return r;
>   }

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 06/11] vdpa: factor out vhost_vdpa_pa_map()
       [not found] ` <20210315053721.189-7-xieyongji@bytedance.com>
@ 2021-03-23  3:09   ` Jason Wang
  0 siblings, 0 replies; 14+ messages in thread
From: Jason Wang @ 2021-03-23  3:09 UTC (permalink / raw)
  To: Xie Yongji, mst, stefanha, sgarzare, parav, bob.liu, hch,
	rdunlap, willy, viro, axboe, bcrl, corbet, mika.penttila,
	dan.carpenter
  Cc: linux-fsdevel, netdev, kvm, virtualization


在 2021/3/15 下午1:37, Xie Yongji 写道:
> The upcoming patch is going to support VA mapping. So let's
> factor out the logic of PA mapping firstly to make the code
> more readable.
>
> Suggested-by: Jason Wang <jasowang@redhat.com>
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>


Acked-by: Jason Wang <jasowang@redhat.com>

While at it, I think it's better to factor out the unmap() part? Since 
the unpin and page dirty is not needed for va device.

Thanks


> ---
>   drivers/vhost/vdpa.c | 46 ++++++++++++++++++++++++++++------------------
>   1 file changed, 28 insertions(+), 18 deletions(-)
>
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index b24ec69a374b..7c83fbf3edac 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -579,37 +579,28 @@ static void vhost_vdpa_unmap(struct vhost_vdpa *v, u64 iova, u64 size)
>   	}
>   }
>   
> -static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
> -					   struct vhost_iotlb_msg *msg)
> +static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
> +			     u64 iova, u64 size, u64 uaddr, u32 perm)
>   {
>   	struct vhost_dev *dev = &v->vdev;
> -	struct vhost_iotlb *iotlb = dev->iotlb;
>   	struct page **page_list;
>   	unsigned long list_size = PAGE_SIZE / sizeof(struct page *);
>   	unsigned int gup_flags = FOLL_LONGTERM;
>   	unsigned long npages, cur_base, map_pfn, last_pfn = 0;
>   	unsigned long lock_limit, sz2pin, nchunks, i;
> -	u64 iova = msg->iova;
> +	u64 start = iova;
>   	long pinned;
>   	int ret = 0;
>   
> -	if (msg->iova < v->range.first ||
> -	    msg->iova + msg->size - 1 > v->range.last)
> -		return -EINVAL;
> -
> -	if (vhost_iotlb_itree_first(iotlb, msg->iova,
> -				    msg->iova + msg->size - 1))
> -		return -EEXIST;
> -
>   	/* Limit the use of memory for bookkeeping */
>   	page_list = (struct page **) __get_free_page(GFP_KERNEL);
>   	if (!page_list)
>   		return -ENOMEM;
>   
> -	if (msg->perm & VHOST_ACCESS_WO)
> +	if (perm & VHOST_ACCESS_WO)
>   		gup_flags |= FOLL_WRITE;
>   
> -	npages = PAGE_ALIGN(msg->size + (iova & ~PAGE_MASK)) >> PAGE_SHIFT;
> +	npages = PAGE_ALIGN(size + (iova & ~PAGE_MASK)) >> PAGE_SHIFT;
>   	if (!npages) {
>   		ret = -EINVAL;
>   		goto free;
> @@ -623,7 +614,7 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
>   		goto unlock;
>   	}
>   
> -	cur_base = msg->uaddr & PAGE_MASK;
> +	cur_base = uaddr & PAGE_MASK;
>   	iova &= PAGE_MASK;
>   	nchunks = 0;
>   
> @@ -654,7 +645,7 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
>   				csize = (last_pfn - map_pfn + 1) << PAGE_SHIFT;
>   				ret = vhost_vdpa_map(v, iova, csize,
>   						     map_pfn << PAGE_SHIFT,
> -						     msg->perm);
> +						     perm);
>   				if (ret) {
>   					/*
>   					 * Unpin the pages that are left unmapped
> @@ -683,7 +674,7 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
>   
>   	/* Pin the rest chunk */
>   	ret = vhost_vdpa_map(v, iova, (last_pfn - map_pfn + 1) << PAGE_SHIFT,
> -			     map_pfn << PAGE_SHIFT, msg->perm);
> +			     map_pfn << PAGE_SHIFT, perm);
>   out:
>   	if (ret) {
>   		if (nchunks) {
> @@ -702,13 +693,32 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
>   			for (pfn = map_pfn; pfn <= last_pfn; pfn++)
>   				unpin_user_page(pfn_to_page(pfn));
>   		}
> -		vhost_vdpa_unmap(v, msg->iova, msg->size);
> +		vhost_vdpa_unmap(v, start, size);
>   	}
>   unlock:
>   	mmap_read_unlock(dev->mm);
>   free:
>   	free_page((unsigned long)page_list);
>   	return ret;
> +
> +}
> +
> +static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
> +					   struct vhost_iotlb_msg *msg)
> +{
> +	struct vhost_dev *dev = &v->vdev;
> +	struct vhost_iotlb *iotlb = dev->iotlb;
> +
> +	if (msg->iova < v->range.first ||
> +	    msg->iova + msg->size - 1 > v->range.last)
> +		return -EINVAL;
> +
> +	if (vhost_iotlb_itree_first(iotlb, msg->iova,
> +				    msg->iova + msg->size - 1))
> +		return -EEXIST;
> +
> +	return vhost_vdpa_pa_map(v, msg->iova, msg->size, msg->uaddr,
> +				 msg->perm);
>   }
>   
>   static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev,

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 07/11] vdpa: Support transferring virtual addressing during DMA mapping
       [not found] ` <20210315053721.189-8-xieyongji@bytedance.com>
@ 2021-03-23  3:13   ` Jason Wang
  0 siblings, 0 replies; 14+ messages in thread
From: Jason Wang @ 2021-03-23  3:13 UTC (permalink / raw)
  To: Xie Yongji, mst, stefanha, sgarzare, parav, bob.liu, hch,
	rdunlap, willy, viro, axboe, bcrl, corbet, mika.penttila,
	dan.carpenter
  Cc: linux-fsdevel, netdev, kvm, virtualization


在 2021/3/15 下午1:37, Xie Yongji 写道:
> This patch introduces an attribute for vDPA device to indicate
> whether virtual address can be used. If vDPA device driver set
> it, vhost-vdpa bus driver will not pin user page and transfer
> userspace virtual address instead of physical address during
> DMA mapping. And corresponding vma->vm_file and offset will be
> also passed as an opaque pointer.
>
> Suggested-by: Jason Wang <jasowang@redhat.com>
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> ---
>   drivers/vdpa/ifcvf/ifcvf_main.c   |   2 +-
>   drivers/vdpa/mlx5/net/mlx5_vnet.c |   2 +-
>   drivers/vdpa/vdpa.c               |   9 +++-
>   drivers/vdpa/vdpa_sim/vdpa_sim.c  |   2 +-
>   drivers/vdpa/virtio_pci/vp_vdpa.c |   2 +-
>   drivers/vhost/vdpa.c              | 104 +++++++++++++++++++++++++++++++-------
>   include/linux/vdpa.h              |  19 +++++--
>   7 files changed, 113 insertions(+), 27 deletions(-)
>
> diff --git a/drivers/vdpa/ifcvf/ifcvf_main.c b/drivers/vdpa/ifcvf/ifcvf_main.c
> index d555a6a5d1ba..aee013f3eb5f 100644
> --- a/drivers/vdpa/ifcvf/ifcvf_main.c
> +++ b/drivers/vdpa/ifcvf/ifcvf_main.c
> @@ -431,7 +431,7 @@ static int ifcvf_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>   	}
>   
>   	adapter = vdpa_alloc_device(struct ifcvf_adapter, vdpa,
> -				    dev, &ifc_vdpa_ops, NULL);
> +				    dev, &ifc_vdpa_ops, NULL, false);
>   	if (adapter == NULL) {
>   		IFCVF_ERR(pdev, "Failed to allocate vDPA structure");
>   		return -ENOMEM;
> diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> index 71397fdafa6a..fb62ebcf464a 100644
> --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
> +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> @@ -1982,7 +1982,7 @@ static int mlx5v_probe(struct auxiliary_device *adev,
>   	max_vqs = min_t(u32, max_vqs, MLX5_MAX_SUPPORTED_VQS);
>   
>   	ndev = vdpa_alloc_device(struct mlx5_vdpa_net, mvdev.vdev, mdev->device, &mlx5_vdpa_ops,
> -				 NULL);
> +				 NULL, false);
>   	if (IS_ERR(ndev))
>   		return PTR_ERR(ndev);
>   
> diff --git a/drivers/vdpa/vdpa.c b/drivers/vdpa/vdpa.c
> index 5cffce67cab0..97fbac276c72 100644
> --- a/drivers/vdpa/vdpa.c
> +++ b/drivers/vdpa/vdpa.c
> @@ -71,6 +71,7 @@ static void vdpa_release_dev(struct device *d)
>    * @config: the bus operations that is supported by this device
>    * @size: size of the parent structure that contains private data
>    * @name: name of the vdpa device; optional.
> + * @use_va: indicate whether virtual address must be used by this device
>    *
>    * Driver should use vdpa_alloc_device() wrapper macro instead of
>    * using this directly.
> @@ -80,7 +81,8 @@ static void vdpa_release_dev(struct device *d)
>    */
>   struct vdpa_device *__vdpa_alloc_device(struct device *parent,
>   					const struct vdpa_config_ops *config,
> -					size_t size, const char *name)
> +					size_t size, const char *name,
> +					bool use_va)
>   {
>   	struct vdpa_device *vdev;
>   	int err = -EINVAL;
> @@ -91,6 +93,10 @@ struct vdpa_device *__vdpa_alloc_device(struct device *parent,
>   	if (!!config->dma_map != !!config->dma_unmap)
>   		goto err;
>   
> +	/* It should only work for the device that use on-chip IOMMU */
> +	if (use_va && !(config->dma_map || config->set_map))
> +		goto err;
> +
>   	err = -ENOMEM;
>   	vdev = kzalloc(size, GFP_KERNEL);
>   	if (!vdev)
> @@ -106,6 +112,7 @@ struct vdpa_device *__vdpa_alloc_device(struct device *parent,
>   	vdev->index = err;
>   	vdev->config = config;
>   	vdev->features_valid = false;
> +	vdev->use_va = use_va;
>   
>   	if (name)
>   		err = dev_set_name(&vdev->dev, "%s", name);
> diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
> index ff331f088baf..d26334e9a412 100644
> --- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
> +++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
> @@ -235,7 +235,7 @@ struct vdpasim *vdpasim_create(struct vdpasim_dev_attr *dev_attr)
>   		ops = &vdpasim_config_ops;
>   
>   	vdpasim = vdpa_alloc_device(struct vdpasim, vdpa, NULL, ops,
> -				    dev_attr->name);
> +				    dev_attr->name, false);
>   	if (!vdpasim)
>   		goto err_alloc;
>   
> diff --git a/drivers/vdpa/virtio_pci/vp_vdpa.c b/drivers/vdpa/virtio_pci/vp_vdpa.c
> index 1321a2fcd088..03b36aed48d6 100644
> --- a/drivers/vdpa/virtio_pci/vp_vdpa.c
> +++ b/drivers/vdpa/virtio_pci/vp_vdpa.c
> @@ -377,7 +377,7 @@ static int vp_vdpa_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>   		return ret;
>   
>   	vp_vdpa = vdpa_alloc_device(struct vp_vdpa, vdpa,
> -				    dev, &vp_vdpa_ops, NULL);
> +				    dev, &vp_vdpa_ops, NULL, false);
>   	if (vp_vdpa == NULL) {
>   		dev_err(dev, "vp_vdpa: Failed to allocate vDPA structure\n");
>   		return -ENOMEM;
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index 7c83fbf3edac..b65c21ae98d1 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -480,21 +480,30 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
>   static void vhost_vdpa_iotlb_unmap(struct vhost_vdpa *v, u64 start, u64 last)
>   {
>   	struct vhost_dev *dev = &v->vdev;
> +	struct vdpa_device *vdpa = v->vdpa;
>   	struct vhost_iotlb *iotlb = dev->iotlb;
>   	struct vhost_iotlb_map *map;
> +	struct vdpa_map_file *map_file;
>   	struct page *page;
>   	unsigned long pfn, pinned;
>   
>   	while ((map = vhost_iotlb_itree_first(iotlb, start, last)) != NULL) {
> -		pinned = map->size >> PAGE_SHIFT;
> -		for (pfn = map->addr >> PAGE_SHIFT;
> -		     pinned > 0; pfn++, pinned--) {
> -			page = pfn_to_page(pfn);
> -			if (map->perm & VHOST_ACCESS_WO)
> -				set_page_dirty_lock(page);
> -			unpin_user_page(page);
> +		if (!vdpa->use_va) {
> +			pinned = map->size >> PAGE_SHIFT;
> +			for (pfn = map->addr >> PAGE_SHIFT;
> +			     pinned > 0; pfn++, pinned--) {
> +				page = pfn_to_page(pfn);
> +				if (map->perm & VHOST_ACCESS_WO)
> +					set_page_dirty_lock(page);
> +				unpin_user_page(page);
> +			}
> +			atomic64_sub(map->size >> PAGE_SHIFT,
> +					&dev->mm->pinned_vm);
> +		} else {
> +			map_file = (struct vdpa_map_file *)map->opaque;
> +			fput(map_file->file);
> +			kfree(map_file);


Let's factor out the logic of pa and va separatedly here.

Other looks good to me.

Thanks


>   		}
> -		atomic64_sub(map->size >> PAGE_SHIFT, &dev->mm->pinned_vm);
>   		vhost_iotlb_map_free(iotlb, map);
>   	}
>   }
> @@ -530,21 +539,21 @@ static int perm_to_iommu_flags(u32 perm)
>   	return flags | IOMMU_CACHE;
>   }
>   
> -static int vhost_vdpa_map(struct vhost_vdpa *v,
> -			  u64 iova, u64 size, u64 pa, u32 perm)
> +static int vhost_vdpa_map(struct vhost_vdpa *v, u64 iova,
> +			  u64 size, u64 pa, u32 perm, void *opaque)
>   {
>   	struct vhost_dev *dev = &v->vdev;
>   	struct vdpa_device *vdpa = v->vdpa;
>   	const struct vdpa_config_ops *ops = vdpa->config;
>   	int r = 0;
>   
> -	r = vhost_iotlb_add_range(dev->iotlb, iova, iova + size - 1,
> -				  pa, perm);
> +	r = vhost_iotlb_add_range_ctx(dev->iotlb, iova, iova + size - 1,
> +				      pa, perm, opaque);
>   	if (r)
>   		return r;
>   
>   	if (ops->dma_map) {
> -		r = ops->dma_map(vdpa, iova, size, pa, perm, NULL);
> +		r = ops->dma_map(vdpa, iova, size, pa, perm, opaque);
>   	} else if (ops->set_map) {
>   		if (!v->in_batch)
>   			r = ops->set_map(vdpa, dev->iotlb);
> @@ -552,13 +561,15 @@ static int vhost_vdpa_map(struct vhost_vdpa *v,
>   		r = iommu_map(v->domain, iova, pa, size,
>   			      perm_to_iommu_flags(perm));
>   	}
> -
> -	if (r)
> +	if (r) {
>   		vhost_iotlb_del_range(dev->iotlb, iova, iova + size - 1);
> -	else
> +		return r;
> +	}
> +
> +	if (!vdpa->use_va)
>   		atomic64_add(size >> PAGE_SHIFT, &dev->mm->pinned_vm);
>   
> -	return r;
> +	return 0;
>   }
>   
>   static void vhost_vdpa_unmap(struct vhost_vdpa *v, u64 iova, u64 size)
> @@ -579,6 +590,56 @@ static void vhost_vdpa_unmap(struct vhost_vdpa *v, u64 iova, u64 size)
>   	}
>   }
>   
> +static int vhost_vdpa_va_map(struct vhost_vdpa *v,
> +			     u64 iova, u64 size, u64 uaddr, u32 perm)
> +{
> +	struct vhost_dev *dev = &v->vdev;
> +	u64 offset, map_size, map_iova = iova;
> +	struct vdpa_map_file *map_file;
> +	struct vm_area_struct *vma;
> +	int ret;
> +
> +	mmap_read_lock(dev->mm);
> +
> +	while (size) {
> +		vma = find_vma(dev->mm, uaddr);
> +		if (!vma) {
> +			ret = -EINVAL;
> +			break;
> +		}
> +		map_size = min(size, vma->vm_end - uaddr);
> +		if (!(vma->vm_file && (vma->vm_flags & VM_SHARED) &&
> +			!(vma->vm_flags & (VM_IO | VM_PFNMAP))))
> +			goto next;
> +
> +		map_file = kzalloc(sizeof(*map_file), GFP_KERNEL);
> +		if (!map_file) {
> +			ret = -ENOMEM;
> +			break;
> +		}
> +		offset = (vma->vm_pgoff << PAGE_SHIFT) + uaddr - vma->vm_start;
> +		map_file->offset = offset;
> +		map_file->file = get_file(vma->vm_file);
> +		ret = vhost_vdpa_map(v, map_iova, map_size, uaddr,
> +				     perm, map_file);
> +		if (ret) {
> +			fput(map_file->file);
> +			kfree(map_file);
> +			break;
> +		}
> +next:
> +		size -= map_size;
> +		uaddr += map_size;
> +		map_iova += map_size;
> +	}
> +	if (ret)
> +		vhost_vdpa_unmap(v, iova, map_iova - iova);
> +
> +	mmap_read_unlock(dev->mm);
> +
> +	return ret;
> +}
> +
>   static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
>   			     u64 iova, u64 size, u64 uaddr, u32 perm)
>   {
> @@ -645,7 +706,7 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
>   				csize = (last_pfn - map_pfn + 1) << PAGE_SHIFT;
>   				ret = vhost_vdpa_map(v, iova, csize,
>   						     map_pfn << PAGE_SHIFT,
> -						     perm);
> +						     perm, NULL);
>   				if (ret) {
>   					/*
>   					 * Unpin the pages that are left unmapped
> @@ -674,7 +735,7 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
>   
>   	/* Pin the rest chunk */
>   	ret = vhost_vdpa_map(v, iova, (last_pfn - map_pfn + 1) << PAGE_SHIFT,
> -			     map_pfn << PAGE_SHIFT, perm);
> +			     map_pfn << PAGE_SHIFT, perm, NULL);
>   out:
>   	if (ret) {
>   		if (nchunks) {
> @@ -707,6 +768,7 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
>   					   struct vhost_iotlb_msg *msg)
>   {
>   	struct vhost_dev *dev = &v->vdev;
> +	struct vdpa_device *vdpa = v->vdpa;
>   	struct vhost_iotlb *iotlb = dev->iotlb;
>   
>   	if (msg->iova < v->range.first ||
> @@ -717,6 +779,10 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
>   				    msg->iova + msg->size - 1))
>   		return -EEXIST;
>   
> +	if (vdpa->use_va)
> +		return vhost_vdpa_va_map(v, msg->iova, msg->size,
> +					 msg->uaddr, msg->perm);
> +
>   	return vhost_vdpa_pa_map(v, msg->iova, msg->size, msg->uaddr,
>   				 msg->perm);
>   }
> diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
> index b01f7c9096bf..e67404e4b23e 100644
> --- a/include/linux/vdpa.h
> +++ b/include/linux/vdpa.h
> @@ -44,6 +44,7 @@ struct vdpa_mgmt_dev;
>    * @config: the configuration ops for this device.
>    * @index: device index
>    * @features_valid: were features initialized? for legacy guests
> + * @use_va: indicate whether virtual address must be used by this device
>    * @nvqs: maximum number of supported virtqueues
>    * @mdev: management device pointer; caller must setup when registering device as part
>    *	  of dev_add() mgmtdev ops callback before invoking _vdpa_register_device().
> @@ -54,6 +55,7 @@ struct vdpa_device {
>   	const struct vdpa_config_ops *config;
>   	unsigned int index;
>   	bool features_valid;
> +	bool use_va;
>   	int nvqs;
>   	struct vdpa_mgmt_dev *mdev;
>   };
> @@ -69,6 +71,16 @@ struct vdpa_iova_range {
>   };
>   
>   /**
> + * Corresponding file area for device memory mapping
> + * @file: vma->vm_file for the mapping
> + * @offset: mapping offset in the vm_file
> + */
> +struct vdpa_map_file {
> +	struct file *file;
> +	u64 offset;
> +};
> +
> +/**
>    * vDPA_config_ops - operations for configuring a vDPA device.
>    * Note: vDPA device drivers are required to implement all of the
>    * operations unless it is mentioned to be optional in the following
> @@ -250,14 +262,15 @@ struct vdpa_config_ops {
>   
>   struct vdpa_device *__vdpa_alloc_device(struct device *parent,
>   					const struct vdpa_config_ops *config,
> -					size_t size, const char *name);
> +					size_t size, const char *name,
> +					bool use_va);
>   
> -#define vdpa_alloc_device(dev_struct, member, parent, config, name)   \
> +#define vdpa_alloc_device(dev_struct, member, parent, config, name, use_va)   \
>   			  container_of(__vdpa_alloc_device( \
>   				       parent, config, \
>   				       sizeof(dev_struct) + \
>   				       BUILD_BUG_ON_ZERO(offsetof( \
> -				       dev_struct, member)), name), \
> +				       dev_struct, member)), name, use_va), \
>   				       dev_struct, member)
>   
>   int vdpa_register_device(struct vdpa_device *vdev, int nvqs);

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 08/11] vduse: Implement an MMU-based IOMMU driver
       [not found] ` <20210315053721.189-9-xieyongji@bytedance.com>
@ 2021-03-24  3:54   ` Jason Wang
       [not found]     ` <CACycT3v_-G6ju-poofXEzYt8QPKWNFHwsS7t=KTLgs-=g+iPQQ@mail.gmail.com>
  0 siblings, 1 reply; 14+ messages in thread
From: Jason Wang @ 2021-03-24  3:54 UTC (permalink / raw)
  To: Xie Yongji, mst, stefanha, sgarzare, parav, bob.liu, hch,
	rdunlap, willy, viro, axboe, bcrl, corbet, mika.penttila,
	dan.carpenter
  Cc: linux-fsdevel, netdev, kvm, virtualization


在 2021/3/15 下午1:37, Xie Yongji 写道:
> This implements an MMU-based IOMMU driver to support mapping
> kernel dma buffer into userspace. The basic idea behind it is
> treating MMU (VA->PA) as IOMMU (IOVA->PA). The driver will set
> up MMU mapping instead of IOMMU mapping for the DMA transfer so
> that the userspace process is able to use its virtual address to
> access the dma buffer in kernel.
>
> And to avoid security issue, a bounce-buffering mechanism is
> introduced to prevent userspace accessing the original buffer
> directly.
>
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> ---
>   drivers/vdpa/vdpa_user/iova_domain.c | 535 +++++++++++++++++++++++++++++++++++
>   drivers/vdpa/vdpa_user/iova_domain.h |  75 +++++
>   2 files changed, 610 insertions(+)
>   create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c
>   create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h
>
> diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c
> new file mode 100644
> index 000000000000..83de216b0e51
> --- /dev/null
> +++ b/drivers/vdpa/vdpa_user/iova_domain.c
> @@ -0,0 +1,535 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * MMU-based IOMMU implementation
> + *
> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.


2021 as well.


> + *
> + * Author: Xie Yongji <xieyongji@bytedance.com>
> + *
> + */
> +
> +#include <linux/slab.h>
> +#include <linux/file.h>
> +#include <linux/anon_inodes.h>
> +#include <linux/highmem.h>
> +#include <linux/vmalloc.h>
> +#include <linux/vdpa.h>
> +
> +#include "iova_domain.h"
> +
> +static int vduse_iotlb_add_range(struct vduse_iova_domain *domain,
> +				 u64 start, u64 last,
> +				 u64 addr, unsigned int perm,
> +				 struct file *file, u64 offset)
> +{
> +	struct vdpa_map_file *map_file;
> +	int ret;
> +
> +	map_file = kmalloc(sizeof(*map_file), GFP_ATOMIC);
> +	if (!map_file)
> +		return -ENOMEM;
> +
> +	map_file->file = get_file(file);
> +	map_file->offset = offset;
> +
> +	ret = vhost_iotlb_add_range_ctx(domain->iotlb, start, last,
> +					addr, perm, map_file);
> +	if (ret) {
> +		fput(map_file->file);
> +		kfree(map_file);
> +		return ret;
> +	}
> +	return 0;
> +}
> +
> +static void vduse_iotlb_del_range(struct vduse_iova_domain *domain,
> +				  u64 start, u64 last)
> +{
> +	struct vdpa_map_file *map_file;
> +	struct vhost_iotlb_map *map;
> +
> +	while ((map = vhost_iotlb_itree_first(domain->iotlb, start, last))) {
> +		map_file = (struct vdpa_map_file *)map->opaque;
> +		fput(map_file->file);
> +		kfree(map_file);
> +		vhost_iotlb_map_free(domain->iotlb, map);
> +	}
> +}
> +
> +int vduse_domain_set_map(struct vduse_iova_domain *domain,
> +			 struct vhost_iotlb *iotlb)
> +{
> +	struct vdpa_map_file *map_file;
> +	struct vhost_iotlb_map *map;
> +	u64 start = 0ULL, last = ULLONG_MAX;
> +	int ret;
> +
> +	spin_lock(&domain->iotlb_lock);
> +	vduse_iotlb_del_range(domain, start, last);
> +
> +	for (map = vhost_iotlb_itree_first(iotlb, start, last); map;
> +	     map = vhost_iotlb_itree_next(map, start, last)) {
> +		map_file = (struct vdpa_map_file *)map->opaque;
> +		ret = vduse_iotlb_add_range(domain, map->start, map->last,
> +					    map->addr, map->perm,
> +					    map_file->file,
> +					    map_file->offset);
> +		if (ret)
> +			goto err;
> +	}
> +	spin_unlock(&domain->iotlb_lock);
> +
> +	return 0;
> +err:
> +	vduse_iotlb_del_range(domain, start, last);
> +	spin_unlock(&domain->iotlb_lock);
> +	return ret;
> +}
> +
> +static void vduse_domain_map_bounce_page(struct vduse_iova_domain *domain,
> +					 u64 iova, u64 size, u64 paddr)
> +{
> +	struct vduse_bounce_map *map;
> +	unsigned int index;
> +	u64 last = iova + size - 1;
> +
> +	while (iova < last) {
> +		map = &domain->bounce_maps[iova >> PAGE_SHIFT];
> +		index = offset_in_page(iova) >> IOVA_ALLOC_ORDER;
> +		map->orig_phys[index] = paddr;
> +		paddr += IOVA_ALLOC_SIZE;
> +		iova += IOVA_ALLOC_SIZE;
> +	}
> +}
> +
> +static void vduse_domain_unmap_bounce_page(struct vduse_iova_domain *domain,
> +					   u64 iova, u64 size)
> +{
> +	struct vduse_bounce_map *map;
> +	unsigned int index;
> +	u64 last = iova + size - 1;
> +
> +	while (iova < last) {
> +		map = &domain->bounce_maps[iova >> PAGE_SHIFT];
> +		index = offset_in_page(iova) >> IOVA_ALLOC_ORDER;
> +		map->orig_phys[index] = INVALID_PHYS_ADDR;
> +		iova += IOVA_ALLOC_SIZE;
> +	}
> +}
> +
> +static void do_bounce(phys_addr_t orig, void *addr, size_t size,
> +		      enum dma_data_direction dir)
> +{
> +	unsigned long pfn = PFN_DOWN(orig);
> +
> +	if (PageHighMem(pfn_to_page(pfn))) {
> +		unsigned int offset = offset_in_page(orig);
> +		char *buffer;
> +		unsigned int sz = 0;
> +
> +		while (size) {
> +			sz = min_t(size_t, PAGE_SIZE - offset, size);
> +
> +			buffer = kmap_atomic(pfn_to_page(pfn));


So kmap_atomic() can autoamtically go with fast path if the page does 
not belong to highmem.

I think we can removce the condition and just use kmap_atomic() for all 
the cases here.


> +			if (dir == DMA_TO_DEVICE)
> +				memcpy(addr, buffer + offset, sz);
> +			else
> +				memcpy(buffer + offset, addr, sz);
> +			kunmap_atomic(buffer);
> +
> +			size -= sz;
> +			pfn++;
> +			addr += sz;
> +			offset = 0;
> +		}
> +	} else if (dir == DMA_TO_DEVICE) {
> +		memcpy(addr, phys_to_virt(orig), size);
> +	} else {
> +		memcpy(phys_to_virt(orig), addr, size);
> +	}
> +}
> +
> +static void vduse_domain_bounce(struct vduse_iova_domain *domain,
> +				dma_addr_t iova, size_t size,
> +				enum dma_data_direction dir)
> +{
> +	struct vduse_bounce_map *map;
> +	unsigned int index, offset;
> +	void *addr;
> +	size_t sz;
> +
> +	while (size) {
> +		map = &domain->bounce_maps[iova >> PAGE_SHIFT];
> +		offset = offset_in_page(iova);
> +		sz = min_t(size_t, IOVA_ALLOC_SIZE, size);
> +
> +		if (map->bounce_page &&
> +		    map->orig_phys[index] != INVALID_PHYS_ADDR) {
> +			addr = page_address(map->bounce_page) + offset;
> +			index = offset >> IOVA_ALLOC_ORDER;
> +			do_bounce(map->orig_phys[index], addr, sz, dir);
> +		}
> +		size -= sz;
> +		iova += sz;
> +	}
> +}
> +
> +static struct page *
> +vduse_domain_get_mapping_page(struct vduse_iova_domain *domain, u64 iova)
> +{
> +	u64 start = iova & PAGE_MASK;
> +	u64 last = start + PAGE_SIZE - 1;
> +	struct vhost_iotlb_map *map;
> +	struct page *page = NULL;
> +
> +	spin_lock(&domain->iotlb_lock);
> +	map = vhost_iotlb_itree_first(domain->iotlb, start, last);
> +	if (!map)
> +		goto out;
> +
> +	page = pfn_to_page((map->addr + iova - map->start) >> PAGE_SHIFT);
> +	get_page(page);
> +out:
> +	spin_unlock(&domain->iotlb_lock);
> +
> +	return page;
> +}
> +
> +static struct page *
> +vduse_domain_alloc_bounce_page(struct vduse_iova_domain *domain, u64 iova)
> +{
> +	u64 start = iova & PAGE_MASK;
> +	struct page *page = alloc_page(GFP_KERNEL);
> +	struct vduse_bounce_map *map;
> +
> +	if (!page)
> +		return NULL;
> +
> +	spin_lock(&domain->iotlb_lock);
> +	map = &domain->bounce_maps[iova >> PAGE_SHIFT];
> +	if (map->bounce_page) {
> +		__free_page(page);
> +		goto out;
> +	}
> +	map->bounce_page = page;
> +
> +	/* paired with vduse_domain_map_page() */
> +	smp_mb();


So this is suspicious. It's better to explain like, we need make sure A 
must be done after B.

And it looks to me the iotlb_lock is sufficnet to do the synchronization 
here. E.g any reason that you don't take it in 
vduse_domain_map_bounce_page().

And what's more, is there anyway to aovid holding the spinlock during 
bouncing?


> +
> +	vduse_domain_bounce(domain, start, PAGE_SIZE, DMA_TO_DEVICE);
> +out:
> +	get_page(map->bounce_page);
> +	spin_unlock(&domain->iotlb_lock);
> +
> +	return map->bounce_page;
> +}
> +
> +static void
> +vduse_domain_free_bounce_pages(struct vduse_iova_domain *domain)
> +{
> +	struct vduse_bounce_map *map;
> +	unsigned long i, pfn, bounce_pfns;
> +
> +	bounce_pfns = domain->bounce_size >> PAGE_SHIFT;
> +
> +	for (pfn = 0; pfn < bounce_pfns; pfn++) {
> +		map = &domain->bounce_maps[pfn];
> +		for (i = 0; i < IOVA_MAPS_PER_PAGE; i++) {
> +			if (WARN_ON(map->orig_phys[i] != INVALID_PHYS_ADDR))
> +				continue;
> +		}
> +		if (!map->bounce_page)
> +			continue;
> +
> +		__free_page(map->bounce_page);
> +		map->bounce_page = NULL;
> +	}
> +}
> +
> +void vduse_domain_reset_bounce_map(struct vduse_iova_domain *domain)
> +{
> +	if (!domain->bounce_map)
> +		return;
> +
> +	spin_lock(&domain->iotlb_lock);
> +	if (!domain->bounce_map)
> +		goto unlock;
> +
> +	vduse_iotlb_del_range(domain, 0, domain->bounce_size - 1);
> +	domain->bounce_map = 0;
> +	vduse_domain_free_bounce_pages(domain);
> +unlock:
> +	spin_unlock(&domain->iotlb_lock);
> +}
> +
> +static int vduse_domain_init_bounce_map(struct vduse_iova_domain *domain)
> +{
> +	int ret;
> +
> +	if (domain->bounce_map)
> +		return 0;
> +
> +	spin_lock(&domain->iotlb_lock);
> +	if (domain->bounce_map)
> +		goto unlock;
> +
> +	ret = vduse_iotlb_add_range(domain, 0, domain->bounce_size - 1,
> +				    0, VHOST_MAP_RW, domain->file, 0);
> +	if (!ret)
> +		domain->bounce_map = 1;
> +unlock:
> +	spin_unlock(&domain->iotlb_lock);
> +	return ret;
> +}
> +
> +static dma_addr_t
> +vduse_domain_alloc_iova(struct iova_domain *iovad,
> +			unsigned long size, unsigned long limit)
> +{
> +	unsigned long shift = iova_shift(iovad);
> +	unsigned long iova_len = iova_align(iovad, size) >> shift;
> +	unsigned long iova_pfn;
> +
> +	if (iova_len < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1)))
> +		iova_len = roundup_pow_of_two(iova_len);
> +	iova_pfn = alloc_iova_fast(iovad, iova_len, limit >> shift, true);
> +
> +	return iova_pfn << shift;
> +}
> +
> +static void vduse_domain_free_iova(struct iova_domain *iovad,
> +				   dma_addr_t iova, size_t size)
> +{
> +	unsigned long shift = iova_shift(iovad);
> +	unsigned long iova_len = iova_align(iovad, size) >> shift;
> +
> +	free_iova_fast(iovad, iova >> shift, iova_len);
> +}
> +
> +dma_addr_t vduse_domain_map_page(struct vduse_iova_domain *domain,
> +				 struct page *page, unsigned long offset,
> +				 size_t size, enum dma_data_direction dir,
> +				 unsigned long attrs)
> +{
> +	struct iova_domain *iovad = &domain->stream_iovad;
> +	unsigned long limit = domain->bounce_size - 1;
> +	phys_addr_t pa = page_to_phys(page) + offset;
> +	dma_addr_t iova = vduse_domain_alloc_iova(iovad, size, limit);
> +
> +	if (!iova)
> +		return DMA_MAPPING_ERROR;
> +
> +	if (vduse_domain_init_bounce_map(domain)) {
> +		vduse_domain_free_iova(iovad, iova, size);
> +		return DMA_MAPPING_ERROR;
> +	}
> +
> +	vduse_domain_map_bounce_page(domain, (u64)iova, (u64)size, pa);
> +
> +	/* paired with vduse_domain_alloc_bounce_page() */
> +	smp_mb();
> +
> +	if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL)
> +		vduse_domain_bounce(domain, iova, size, DMA_TO_DEVICE);
> +
> +	return iova;
> +}
> +
> +void vduse_domain_unmap_page(struct vduse_iova_domain *domain,
> +			     dma_addr_t dma_addr, size_t size,
> +			     enum dma_data_direction dir, unsigned long attrs)
> +{
> +	struct iova_domain *iovad = &domain->stream_iovad;
> +
> +	if (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL)
> +		vduse_domain_bounce(domain, dma_addr, size, DMA_FROM_DEVICE);
> +
> +	vduse_domain_unmap_bounce_page(domain, (u64)dma_addr, (u64)size);
> +	vduse_domain_free_iova(iovad, dma_addr, size);
> +}
> +
> +void *vduse_domain_alloc_coherent(struct vduse_iova_domain *domain,
> +				  size_t size, dma_addr_t *dma_addr,
> +				  gfp_t flag, unsigned long attrs)
> +{
> +	struct iova_domain *iovad = &domain->consistent_iovad;
> +	unsigned long limit = domain->iova_limit;
> +	dma_addr_t iova = vduse_domain_alloc_iova(iovad, size, limit);
> +	void *orig = alloc_pages_exact(size, flag);
> +
> +	if (!iova || !orig)
> +		goto err;
> +
> +	spin_lock(&domain->iotlb_lock);
> +	if (vduse_iotlb_add_range(domain, (u64)iova, (u64)iova + size - 1,
> +				  virt_to_phys(orig), VHOST_MAP_RW,
> +				  domain->file, (u64)iova)) {
> +		spin_unlock(&domain->iotlb_lock);
> +		goto err;
> +	}
> +	spin_unlock(&domain->iotlb_lock);
> +
> +	*dma_addr = iova;
> +
> +	return orig;
> +err:
> +	*dma_addr = DMA_MAPPING_ERROR;
> +	if (orig)
> +		free_pages_exact(orig, size);
> +	if (iova)
> +		vduse_domain_free_iova(iovad, iova, size);
> +
> +	return NULL;
> +}
> +
> +void vduse_domain_free_coherent(struct vduse_iova_domain *domain, size_t size,
> +				void *vaddr, dma_addr_t dma_addr,
> +				unsigned long attrs)
> +{
> +	struct iova_domain *iovad = &domain->consistent_iovad;
> +	struct vhost_iotlb_map *map;
> +	struct vdpa_map_file *map_file;
> +	phys_addr_t pa;
> +
> +	spin_lock(&domain->iotlb_lock);
> +	map = vhost_iotlb_itree_first(domain->iotlb, (u64)dma_addr,
> +				      (u64)dma_addr + size - 1);
> +	if (WARN_ON(!map)) {
> +		spin_unlock(&domain->iotlb_lock);
> +		return;
> +	}
> +	map_file = (struct vdpa_map_file *)map->opaque;
> +	fput(map_file->file);
> +	kfree(map_file);
> +	pa = map->addr;
> +	vhost_iotlb_map_free(domain->iotlb, map);
> +	spin_unlock(&domain->iotlb_lock);
> +
> +	vduse_domain_free_iova(iovad, dma_addr, size);
> +	free_pages_exact(phys_to_virt(pa), size);


I wonder whether we should free the coherent page after munmap(). 
Otherwise usersapce can poke kernel pages in this way, e.g the page 
could be allocated and used by other subsystems?


> +}
> +
> +static vm_fault_t vduse_domain_mmap_fault(struct vm_fault *vmf)
> +{
> +	struct vduse_iova_domain *domain = vmf->vma->vm_private_data;
> +	unsigned long iova = vmf->pgoff << PAGE_SHIFT;
> +	struct page *page;
> +
> +	if (!domain)
> +		return VM_FAULT_SIGBUS;
> +
> +	if (iova < domain->bounce_size)
> +		page = vduse_domain_alloc_bounce_page(domain, iova);
> +	else
> +		page = vduse_domain_get_mapping_page(domain, iova);
> +
> +	if (!page)
> +		return VM_FAULT_SIGBUS;
> +
> +	vmf->page = page;
> +
> +	return 0;
> +}
> +
> +static const struct vm_operations_struct vduse_domain_mmap_ops = {
> +	.fault = vduse_domain_mmap_fault,
> +};
> +
> +static int vduse_domain_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> +	struct vduse_iova_domain *domain = file->private_data;
> +
> +	vma->vm_flags |= VM_DONTDUMP | VM_DONTEXPAND;
> +	vma->vm_private_data = domain;
> +	vma->vm_ops = &vduse_domain_mmap_ops;
> +
> +	return 0;
> +}
> +
> +static int vduse_domain_release(struct inode *inode, struct file *file)
> +{
> +	struct vduse_iova_domain *domain = file->private_data;
> +
> +	vduse_domain_reset_bounce_map(domain);
> +	put_iova_domain(&domain->stream_iovad);
> +	put_iova_domain(&domain->consistent_iovad);
> +	vhost_iotlb_free(domain->iotlb);
> +	vfree(domain->bounce_maps);
> +	kfree(domain);
> +
> +	return 0;
> +}
> +
> +static const struct file_operations vduse_domain_fops = {
> +	.mmap = vduse_domain_mmap,
> +	.release = vduse_domain_release,
> +};
> +
> +void vduse_domain_destroy(struct vduse_iova_domain *domain)
> +{
> +	fput(domain->file);
> +}
> +
> +struct vduse_iova_domain *
> +vduse_domain_create(unsigned long iova_limit, size_t bounce_size)
> +{
> +	struct vduse_iova_domain *domain;
> +	struct file *file;
> +	struct vduse_bounce_map *map;
> +	unsigned long i, pfn, bounce_pfns;
> +
> +	bounce_pfns = PAGE_ALIGN(bounce_size) >> PAGE_SHIFT;
> +	if (iova_limit <= bounce_size)
> +		return NULL;
> +
> +	domain = kzalloc(sizeof(*domain), GFP_KERNEL);
> +	if (!domain)
> +		return NULL;
> +
> +	domain->iotlb = vhost_iotlb_alloc(0, 0);
> +	if (!domain->iotlb)
> +		goto err_iotlb;
> +
> +	domain->iova_limit = iova_limit;
> +	domain->bounce_size = PAGE_ALIGN(bounce_size);
> +	domain->bounce_maps = vzalloc(bounce_pfns *
> +				sizeof(struct vduse_bounce_map));
> +	if (!domain->bounce_maps)
> +		goto err_map;
> +
> +	for (pfn = 0; pfn < bounce_pfns; pfn++) {
> +		map = &domain->bounce_maps[pfn];
> +		for (i = 0; i < IOVA_MAPS_PER_PAGE; i++)
> +			map->orig_phys[i] = INVALID_PHYS_ADDR;
> +	}
> +	file = anon_inode_getfile("[vduse-domain]", &vduse_domain_fops,
> +				domain, O_RDWR);
> +	if (IS_ERR(file))
> +		goto err_file;
> +
> +	domain->file = file;
> +	spin_lock_init(&domain->iotlb_lock);
> +	init_iova_domain(&domain->stream_iovad,
> +			IOVA_ALLOC_SIZE, IOVA_START_PFN);
> +	init_iova_domain(&domain->consistent_iovad,
> +			PAGE_SIZE, bounce_pfns);


Any reason for treating coherent and stream DMA differently (the 
different granule)?


> +
> +	return domain;
> +err_file:
> +	vfree(domain->bounce_maps);
> +err_map:
> +	vhost_iotlb_free(domain->iotlb);
> +err_iotlb:
> +	kfree(domain);
> +	return NULL;
> +}
> +
> +int vduse_domain_init(void)
> +{
> +	return iova_cache_get();
> +}
> +
> +void vduse_domain_exit(void)
> +{
> +	iova_cache_put();
> +}
> diff --git a/drivers/vdpa/vdpa_user/iova_domain.h b/drivers/vdpa/vdpa_user/iova_domain.h
> new file mode 100644
> index 000000000000..faeeedfaa786
> --- /dev/null
> +++ b/drivers/vdpa/vdpa_user/iova_domain.h
> @@ -0,0 +1,75 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * MMU-based IOMMU implementation
> + *
> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
> + *
> + * Author: Xie Yongji <xieyongji@bytedance.com>
> + *
> + */
> +
> +#ifndef _VDUSE_IOVA_DOMAIN_H
> +#define _VDUSE_IOVA_DOMAIN_H
> +
> +#include <linux/iova.h>
> +#include <linux/dma-mapping.h>
> +#include <linux/vhost_iotlb.h>
> +
> +#define IOVA_START_PFN 1
> +
> +#define IOVA_ALLOC_ORDER 12
> +#define IOVA_ALLOC_SIZE (1 << IOVA_ALLOC_ORDER)
> +
> +#define IOVA_MAPS_PER_PAGE (1 << (PAGE_SHIFT - IOVA_ALLOC_ORDER))
> +
> +#define INVALID_PHYS_ADDR (~(phys_addr_t)0)
> +
> +struct vduse_bounce_map {
> +	struct page *bounce_page;
> +	u64 orig_phys[IOVA_MAPS_PER_PAGE];


Sorry if I had asked this before. But I'm not sure it's worth to have 
this extra complexitiy. If I read the code correctly, the 
IOVA_MAPS_PER_PAGE is 1 for the archs that have 4K page. Have you tested 
the code on the archs that have more than 4K page?

Thanks


> +};
> +
> +struct vduse_iova_domain {
> +	struct iova_domain stream_iovad;
> +	struct iova_domain consistent_iovad;
> +	struct vduse_bounce_map *bounce_maps;
> +	size_t bounce_size;
> +	unsigned long iova_limit;
> +	int bounce_map;
> +	struct vhost_iotlb *iotlb;
> +	spinlock_t iotlb_lock;
> +	struct file *file;
> +};
> +
> +int vduse_domain_set_map(struct vduse_iova_domain *domain,
> +			struct vhost_iotlb *iotlb);
> +
> +dma_addr_t vduse_domain_map_page(struct vduse_iova_domain *domain,
> +				struct page *page, unsigned long offset,
> +				size_t size, enum dma_data_direction dir,
> +				unsigned long attrs);
> +
> +void vduse_domain_unmap_page(struct vduse_iova_domain *domain,
> +			dma_addr_t dma_addr, size_t size,
> +			enum dma_data_direction dir, unsigned long attrs);
> +
> +void *vduse_domain_alloc_coherent(struct vduse_iova_domain *domain,
> +				size_t size, dma_addr_t *dma_addr,
> +				gfp_t flag, unsigned long attrs);
> +
> +void vduse_domain_free_coherent(struct vduse_iova_domain *domain, size_t size,
> +				void *vaddr, dma_addr_t dma_addr,
> +				unsigned long attrs);
> +
> +void vduse_domain_reset_bounce_map(struct vduse_iova_domain *domain);
> +
> +void vduse_domain_destroy(struct vduse_iova_domain *domain);
> +
> +struct vduse_iova_domain *vduse_domain_create(unsigned long iova_limit,
> +						size_t bounce_size);
> +
> +int vduse_domain_init(void);
> +
> +void vduse_domain_exit(void);
> +
> +#endif /* _VDUSE_IOVA_DOMAIN_H */

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 09/11] vduse: Introduce VDUSE - vDPA Device in Userspace
       [not found] ` <20210315053721.189-10-xieyongji@bytedance.com>
@ 2021-03-24  4:43   ` Jason Wang
       [not found]     ` <CACycT3u5hEO8=JVkbHmKXYMManiZ1VedyS6O+7M8Rzbk1ftC7A@mail.gmail.com>
  0 siblings, 1 reply; 14+ messages in thread
From: Jason Wang @ 2021-03-24  4:43 UTC (permalink / raw)
  To: Xie Yongji, mst, stefanha, sgarzare, parav, bob.liu, hch,
	rdunlap, willy, viro, axboe, bcrl, corbet, mika.penttila,
	dan.carpenter
  Cc: linux-fsdevel, netdev, kvm, virtualization


在 2021/3/15 下午1:37, Xie Yongji 写道:
> This VDUSE driver enables implementing vDPA devices in userspace.
> Both control path and data path of vDPA devices will be able to
> be handled in userspace.
>
> In the control path, the VDUSE driver will make use of message
> mechnism to forward the config operation from vdpa bus driver
> to userspace. Userspace can use read()/write() to receive/reply
> those control messages.
>
> In the data path, userspace can use mmap() to access vDPA device's
> iova regions obtained through VDUSE_IOTLB_GET_ENTRY ioctl. Besides,
> userspace can use ioctl() to inject interrupt and use the eventfd
> mechanism to receive virtqueue kicks.
>
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> ---
>   Documentation/userspace-api/ioctl/ioctl-number.rst |    1 +
>   drivers/vdpa/Kconfig                               |   10 +
>   drivers/vdpa/Makefile                              |    1 +
>   drivers/vdpa/vdpa_user/Makefile                    |    5 +
>   drivers/vdpa/vdpa_user/vduse_dev.c                 | 1281 ++++++++++++++++++++
>   include/uapi/linux/vduse.h                         |  153 +++
>   6 files changed, 1451 insertions(+)
>   create mode 100644 drivers/vdpa/vdpa_user/Makefile
>   create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
>   create mode 100644 include/uapi/linux/vduse.h
>
> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
> index a4c75a28c839..71722e6f8f23 100644
> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> @@ -300,6 +300,7 @@ Code  Seq#    Include File                                           Comments
>   'z'   10-4F  drivers/s390/crypto/zcrypt_api.h                        conflict!
>   '|'   00-7F  linux/media.h
>   0x80  00-1F  linux/fb.h
> +0x81  00-1F  linux/vduse.h
>   0x89  00-06  arch/x86/include/asm/sockios.h
>   0x89  0B-DF  linux/sockios.h
>   0x89  E0-EF  linux/sockios.h                                         SIOCPROTOPRIVATE range
> diff --git a/drivers/vdpa/Kconfig b/drivers/vdpa/Kconfig
> index a245809c99d0..77a1da522c21 100644
> --- a/drivers/vdpa/Kconfig
> +++ b/drivers/vdpa/Kconfig
> @@ -25,6 +25,16 @@ config VDPA_SIM_NET
>   	help
>   	  vDPA networking device simulator which loops TX traffic back to RX.
>   
> +config VDPA_USER
> +	tristate "VDUSE (vDPA Device in Userspace) support"
> +	depends on EVENTFD && MMU && HAS_DMA
> +	select DMA_OPS
> +	select VHOST_IOTLB
> +	select IOMMU_IOVA
> +	help
> +	  With VDUSE it is possible to emulate a vDPA Device
> +	  in a userspace program.
> +
>   config IFCVF
>   	tristate "Intel IFC VF vDPA driver"
>   	depends on PCI_MSI
> diff --git a/drivers/vdpa/Makefile b/drivers/vdpa/Makefile
> index 67fe7f3d6943..f02ebed33f19 100644
> --- a/drivers/vdpa/Makefile
> +++ b/drivers/vdpa/Makefile
> @@ -1,6 +1,7 @@
>   # SPDX-License-Identifier: GPL-2.0
>   obj-$(CONFIG_VDPA) += vdpa.o
>   obj-$(CONFIG_VDPA_SIM) += vdpa_sim/
> +obj-$(CONFIG_VDPA_USER) += vdpa_user/
>   obj-$(CONFIG_IFCVF)    += ifcvf/
>   obj-$(CONFIG_MLX5_VDPA) += mlx5/
>   obj-$(CONFIG_VP_VDPA)    += virtio_pci/
> diff --git a/drivers/vdpa/vdpa_user/Makefile b/drivers/vdpa/vdpa_user/Makefile
> new file mode 100644
> index 000000000000..260e0b26af99
> --- /dev/null
> +++ b/drivers/vdpa/vdpa_user/Makefile
> @@ -0,0 +1,5 @@
> +# SPDX-License-Identifier: GPL-2.0
> +
> +vduse-y := vduse_dev.o iova_domain.o
> +
> +obj-$(CONFIG_VDPA_USER) += vduse.o
> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> new file mode 100644
> index 000000000000..07d0ae92d470
> --- /dev/null
> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> @@ -0,0 +1,1281 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * VDUSE: vDPA Device in Userspace
> + *
> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
> + *
> + * Author: Xie Yongji <xieyongji@bytedance.com>
> + *
> + */
> +
> +#include <linux/init.h>
> +#include <linux/module.h>
> +#include <linux/miscdevice.h>
> +#include <linux/cdev.h>
> +#include <linux/device.h>
> +#include <linux/eventfd.h>
> +#include <linux/slab.h>
> +#include <linux/wait.h>
> +#include <linux/dma-map-ops.h>
> +#include <linux/poll.h>
> +#include <linux/file.h>
> +#include <linux/uio.h>
> +#include <linux/vdpa.h>
> +#include <uapi/linux/vduse.h>
> +#include <uapi/linux/vdpa.h>
> +#include <uapi/linux/virtio_config.h>
> +#include <linux/mod_devicetable.h>
> +
> +#include "iova_domain.h"
> +
> +#define DRV_VERSION  "1.0"
> +#define DRV_AUTHOR   "Yongji Xie <xieyongji@bytedance.com>"
> +#define DRV_DESC     "vDPA Device in Userspace"
> +#define DRV_LICENSE  "GPL v2"
> +
> +#define VDUSE_DEV_MAX (1U << MINORBITS)
> +
> +struct vduse_virtqueue {
> +	u16 index;
> +	bool ready;
> +	spinlock_t kick_lock;
> +	spinlock_t irq_lock;
> +	struct eventfd_ctx *kickfd;
> +	struct vdpa_callback cb;
> +	struct work_struct inject;
> +};
> +
> +struct vduse_dev;
> +
> +struct vduse_vdpa {
> +	struct vdpa_device vdpa;
> +	struct vduse_dev *dev;
> +};
> +
> +struct vduse_dev {
> +	struct vduse_vdpa *vdev;
> +	struct device dev;
> +	struct cdev cdev;
> +	struct vduse_virtqueue *vqs;
> +	struct vduse_iova_domain *domain;
> +	spinlock_t msg_lock;
> +	atomic64_t msg_unique;
> +	wait_queue_head_t waitq;
> +	struct list_head send_list;
> +	struct list_head recv_list;
> +	struct list_head list;
> +	bool connected;
> +	int minor;
> +	u16 vq_size_max;
> +	u16 vq_num;
> +	u32 vq_align;
> +	u32 device_id;
> +	u32 vendor_id;
> +};
> +
> +struct vduse_dev_msg {
> +	struct vduse_dev_request req;
> +	struct vduse_dev_response resp;
> +	struct list_head list;
> +	wait_queue_head_t waitq;
> +	bool completed;
> +};
> +
> +static unsigned long max_bounce_size = (64 * 1024 * 1024);
> +module_param(max_bounce_size, ulong, 0444);
> +MODULE_PARM_DESC(max_bounce_size, "Maximum bounce buffer size. (default: 64M)");
> +
> +static unsigned long max_iova_size = (128 * 1024 * 1024);
> +module_param(max_iova_size, ulong, 0444);
> +MODULE_PARM_DESC(max_iova_size, "Maximum iova space size (default: 128M)");
> +
> +static DEFINE_MUTEX(vduse_lock);
> +static LIST_HEAD(vduse_devs);
> +static DEFINE_IDA(vduse_ida);
> +
> +static dev_t vduse_major;
> +static struct class *vduse_class;
> +static struct workqueue_struct *vduse_irq_wq;
> +
> +static inline struct vduse_dev *vdpa_to_vduse(struct vdpa_device *vdpa)
> +{
> +	struct vduse_vdpa *vdev = container_of(vdpa, struct vduse_vdpa, vdpa);
> +
> +	return vdev->dev;
> +}
> +
> +static inline struct vduse_dev *dev_to_vduse(struct device *dev)
> +{
> +	struct vdpa_device *vdpa = dev_to_vdpa(dev);
> +
> +	return vdpa_to_vduse(vdpa);
> +}
> +
> +static struct vduse_dev_msg *vduse_find_msg(struct list_head *head,
> +					    uint32_t request_id)
> +{
> +	struct vduse_dev_msg *tmp, *msg = NULL;
> +
> +	list_for_each_entry(tmp, head, list) {
> +		if (tmp->req.request_id == request_id) {
> +			msg = tmp;
> +			list_del(&tmp->list);
> +			break;
> +		}
> +	}
> +
> +	return msg;
> +}
> +
> +static struct vduse_dev_msg *vduse_dequeue_msg(struct list_head *head)
> +{
> +	struct vduse_dev_msg *msg = NULL;
> +
> +	if (!list_empty(head)) {
> +		msg = list_first_entry(head, struct vduse_dev_msg, list);
> +		list_del(&msg->list);
> +	}
> +
> +	return msg;
> +}
> +
> +static void vduse_enqueue_msg(struct list_head *head,
> +			      struct vduse_dev_msg *msg)
> +{
> +	list_add_tail(&msg->list, head);
> +}
> +
> +static int vduse_dev_msg_sync(struct vduse_dev *dev,
> +			      struct vduse_dev_msg *msg)
> +{
> +	init_waitqueue_head(&msg->waitq);
> +	spin_lock(&dev->msg_lock);
> +	vduse_enqueue_msg(&dev->send_list, msg);
> +	wake_up(&dev->waitq);
> +	spin_unlock(&dev->msg_lock);
> +	wait_event_interruptible(msg->waitq, msg->completed);
> +	spin_lock(&dev->msg_lock);
> +	if (!msg->completed)
> +		list_del(&msg->list);
> +	spin_unlock(&dev->msg_lock);
> +
> +	return (msg->resp.result == VDUSE_REQUEST_OK) ? 0 : -1;
> +}
> +
> +static u64 vduse_dev_get_features(struct vduse_dev *dev)
> +{
> +	struct vduse_dev_msg msg = { 0 };
> +
> +	msg.req.type = VDUSE_GET_FEATURES;
> +	msg.req.request_id = atomic64_fetch_inc(&dev->msg_unique);


Let's introduce a helper for the atomic64_fetch_inc() here.


> +
> +	return vduse_dev_msg_sync(dev, &msg) ? 0 : msg.resp.f.features;
> +}
> +
> +static int vduse_dev_set_features(struct vduse_dev *dev, u64 features)
> +{
> +	struct vduse_dev_msg msg = { 0 };
> +
> +	msg.req.type = VDUSE_SET_FEATURES;
> +	msg.req.request_id = atomic64_fetch_inc(&dev->msg_unique);
> +	msg.req.f.features = features;
> +
> +	return vduse_dev_msg_sync(dev, &msg);
> +}
> +
> +static u8 vduse_dev_get_status(struct vduse_dev *dev)
> +{
> +	struct vduse_dev_msg msg = { 0 };
> +
> +	msg.req.type = VDUSE_GET_STATUS;
> +	msg.req.request_id = atomic64_fetch_inc(&dev->msg_unique);
> +
> +	return vduse_dev_msg_sync(dev, &msg) ? 0 : msg.resp.s.status;
> +}
> +
> +static void vduse_dev_set_status(struct vduse_dev *dev, u8 status)
> +{
> +	struct vduse_dev_msg msg = { 0 };
> +
> +	msg.req.type = VDUSE_SET_STATUS;
> +	msg.req.request_id = atomic64_fetch_inc(&dev->msg_unique);
> +	msg.req.s.status = status;
> +
> +	vduse_dev_msg_sync(dev, &msg);
> +}
> +
> +static void vduse_dev_get_config(struct vduse_dev *dev, unsigned int offset,
> +				 void *buf, unsigned int len)
> +{
> +	struct vduse_dev_msg msg = { 0 };
> +	unsigned int sz;
> +
> +	while (len) {
> +		sz = min_t(unsigned int, len, sizeof(msg.req.config.data));
> +		msg.req.type = VDUSE_GET_CONFIG;
> +		msg.req.request_id = atomic64_fetch_inc(&dev->msg_unique);
> +		msg.req.config.offset = offset;
> +		msg.req.config.len = sz;
> +		vduse_dev_msg_sync(dev, &msg);
> +		memcpy(buf, msg.resp.config.data, sz);
> +		buf += sz;
> +		offset += sz;
> +		len -= sz;
> +	}
> +}
> +
> +static void vduse_dev_set_config(struct vduse_dev *dev, unsigned int offset,
> +				 const void *buf, unsigned int len)
> +{
> +	struct vduse_dev_msg msg = { 0 };
> +	unsigned int sz;
> +
> +	while (len) {
> +		sz = min_t(unsigned int, len, sizeof(msg.req.config.data));
> +		msg.req.type = VDUSE_SET_CONFIG;
> +		msg.req.request_id = atomic64_fetch_inc(&dev->msg_unique);
> +		msg.req.config.offset = offset;
> +		msg.req.config.len = sz;
> +		memcpy(msg.req.config.data, buf, sz);
> +		vduse_dev_msg_sync(dev, &msg);
> +		buf += sz;
> +		offset += sz;
> +		len -= sz;
> +	}
> +}
> +
> +static void vduse_dev_set_vq_num(struct vduse_dev *dev,
> +				 struct vduse_virtqueue *vq, u32 num)
> +{
> +	struct vduse_dev_msg msg = { 0 };
> +
> +	msg.req.type = VDUSE_SET_VQ_NUM;
> +	msg.req.request_id = atomic64_fetch_inc(&dev->msg_unique);
> +	msg.req.vq_num.index = vq->index;
> +	msg.req.vq_num.num = num;
> +
> +	vduse_dev_msg_sync(dev, &msg);
> +}
> +
> +static int vduse_dev_set_vq_addr(struct vduse_dev *dev,
> +				 struct vduse_virtqueue *vq, u64 desc_addr,
> +				 u64 driver_addr, u64 device_addr)
> +{
> +	struct vduse_dev_msg msg = { 0 };
> +
> +	msg.req.type = VDUSE_SET_VQ_ADDR;
> +	msg.req.request_id = atomic64_fetch_inc(&dev->msg_unique);
> +	msg.req.vq_addr.index = vq->index;
> +	msg.req.vq_addr.desc_addr = desc_addr;
> +	msg.req.vq_addr.driver_addr = driver_addr;
> +	msg.req.vq_addr.device_addr = device_addr;
> +
> +	return vduse_dev_msg_sync(dev, &msg);
> +}
> +
> +static void vduse_dev_set_vq_ready(struct vduse_dev *dev,
> +				struct vduse_virtqueue *vq, bool ready)
> +{
> +	struct vduse_dev_msg msg = { 0 };
> +
> +	msg.req.type = VDUSE_SET_VQ_READY;
> +	msg.req.request_id = atomic64_fetch_inc(&dev->msg_unique);
> +	msg.req.vq_ready.index = vq->index;
> +	msg.req.vq_ready.ready = ready;
> +
> +	vduse_dev_msg_sync(dev, &msg);
> +}
> +
> +static bool vduse_dev_get_vq_ready(struct vduse_dev *dev,
> +				   struct vduse_virtqueue *vq)
> +{
> +	struct vduse_dev_msg msg = { 0 };
> +
> +	msg.req.type = VDUSE_GET_VQ_READY;
> +	msg.req.request_id = atomic64_fetch_inc(&dev->msg_unique);
> +	msg.req.vq_ready.index = vq->index;
> +
> +	return vduse_dev_msg_sync(dev, &msg) ? false : msg.resp.vq_ready.ready;
> +}
> +
> +static int vduse_dev_get_vq_state(struct vduse_dev *dev,
> +				struct vduse_virtqueue *vq,
> +				struct vdpa_vq_state *state)
> +{
> +	struct vduse_dev_msg msg = { 0 };
> +	int ret;
> +
> +	msg.req.type = VDUSE_GET_VQ_STATE;
> +	msg.req.request_id = atomic64_fetch_inc(&dev->msg_unique);
> +	msg.req.vq_state.index = vq->index;
> +
> +	ret = vduse_dev_msg_sync(dev, &msg);
> +	if (!ret)
> +		state->avail_index = msg.resp.vq_state.avail_idx;
> +
> +	return ret;
> +}
> +
> +static int vduse_dev_set_vq_state(struct vduse_dev *dev,
> +				struct vduse_virtqueue *vq,
> +				const struct vdpa_vq_state *state)
> +{
> +	struct vduse_dev_msg msg = { 0 };
> +
> +	msg.req.type = VDUSE_SET_VQ_STATE;
> +	msg.req.request_id = atomic64_fetch_inc(&dev->msg_unique);
> +	msg.req.vq_state.index = vq->index;
> +	msg.req.vq_state.avail_idx = state->avail_index;
> +
> +	return vduse_dev_msg_sync(dev, &msg);
> +}
> +
> +static int vduse_dev_update_iotlb(struct vduse_dev *dev,
> +				u64 start, u64 last)
> +{
> +	struct vduse_dev_msg *msg;
> +
> +	if (last < start)
> +		return -EINVAL;
> +
> +	msg = kzalloc(sizeof(*msg), GFP_ATOMIC);


The return value is not checked.


> +	msg->req.type = VDUSE_UPDATE_IOTLB;


What would usespace do after receiving VDUSE_UPDATE_IOTLB? If it still 
needs to issue VDUSE_GET_ENTRY with probably -EINVAL, it's kind of 
overkill. So it looks to me that the VDUSE_UPDATE_IOTLB is acutally kind 
of flush or unmap here. If this is true, should we introduce a new type 
or just rename it as VDUSE_IOTLB_UNMAP?


> +	msg->req.request_id = atomic64_fetch_inc(&dev->msg_unique);
> +	msg->req.iova.start = start;
> +	msg->req.iova.last = last;
> +
> +	return vduse_dev_msg_sync(dev, msg);
> +}
> +
> +static ssize_t vduse_dev_read_iter(struct kiocb *iocb, struct iov_iter *to)
> +{
> +	struct file *file = iocb->ki_filp;
> +	struct vduse_dev *dev = file->private_data;
> +	struct vduse_dev_msg *msg;
> +	int size = sizeof(struct vduse_dev_request);
> +	ssize_t ret = 0;
> +
> +	if (iov_iter_count(to) < size)
> +		return 0;
> +
> +	spin_lock(&dev->msg_lock);
> +	while (1) {
> +		msg = vduse_dequeue_msg(&dev->send_list);
> +		if (msg)
> +			break;
> +
> +		ret = -EAGAIN;
> +		if (file->f_flags & O_NONBLOCK)
> +			goto unlock;
> +
> +		spin_unlock(&dev->msg_lock);
> +		ret = wait_event_interruptible_exclusive(dev->waitq,
> +					!list_empty(&dev->send_list));
> +		if (ret)
> +			return ret;
> +
> +		spin_lock(&dev->msg_lock);
> +	}
> +	spin_unlock(&dev->msg_lock);
> +	ret = copy_to_iter(&msg->req, size, to);
> +	spin_lock(&dev->msg_lock);
> +	if (ret != size) {
> +		ret = -EFAULT;
> +		vduse_enqueue_msg(&dev->send_list, msg);
> +		goto unlock;
> +	}
> +	vduse_enqueue_msg(&dev->recv_list, msg);
> +unlock:
> +	spin_unlock(&dev->msg_lock);
> +
> +	return ret;
> +}
> +
> +static ssize_t vduse_dev_write_iter(struct kiocb *iocb, struct iov_iter *from)
> +{
> +	struct file *file = iocb->ki_filp;
> +	struct vduse_dev *dev = file->private_data;
> +	struct vduse_dev_response resp;
> +	struct vduse_dev_msg *msg;
> +	size_t ret;
> +
> +	ret = copy_from_iter(&resp, sizeof(resp), from);
> +	if (ret != sizeof(resp))
> +		return -EINVAL;
> +
> +	spin_lock(&dev->msg_lock);
> +	msg = vduse_find_msg(&dev->recv_list, resp.request_id);
> +	if (!msg) {
> +		ret = -EINVAL;
> +		goto unlock;
> +	}
> +
> +	memcpy(&msg->resp, &resp, sizeof(resp));
> +	msg->completed = 1;
> +	wake_up(&msg->waitq);
> +unlock:
> +	spin_unlock(&dev->msg_lock);
> +
> +	return ret;
> +}
> +
> +static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
> +{
> +	struct vduse_dev *dev = file->private_data;
> +	__poll_t mask = 0;
> +
> +	poll_wait(file, &dev->waitq, wait);
> +
> +	if (!list_empty(&dev->send_list))
> +		mask |= EPOLLIN | EPOLLRDNORM;


EPOLLOUT is missed here?


> +
> +	return mask;
> +}
> +
> +static void vduse_dev_reset(struct vduse_dev *dev)
> +{
> +	int i;
> +
> +	vduse_domain_reset_bounce_map(dev->domain);
> +	vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);


Simialrly, IOTLB update should be done before the resetting?

And it would be helpful to add comment to explain how coherent mappings 
is handled.


> +
> +	for (i = 0; i < dev->vq_num; i++) {
> +		struct vduse_virtqueue *vq = &dev->vqs[i];
> +
> +		spin_lock(&vq->irq_lock);
> +		vq->ready = false;
> +		vq->cb.callback = NULL;
> +		vq->cb.private = NULL;
> +		spin_unlock(&vq->irq_lock);
> +	}
> +}
> +
> +static int vduse_vdpa_set_vq_address(struct vdpa_device *vdpa, u16 idx,
> +				u64 desc_area, u64 driver_area,
> +				u64 device_area)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	return vduse_dev_set_vq_addr(dev, vq, desc_area,
> +					driver_area, device_area);
> +}
> +
> +static void vduse_vdpa_kick_vq(struct vdpa_device *vdpa, u16 idx)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	spin_lock(&vq->kick_lock);
> +	if (vq->ready && vq->kickfd)
> +		eventfd_signal(vq->kickfd, 1);
> +	spin_unlock(&vq->kick_lock);
> +}
> +
> +static void vduse_vdpa_set_vq_cb(struct vdpa_device *vdpa, u16 idx,
> +			      struct vdpa_callback *cb)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	spin_lock(&vq->irq_lock);
> +	vq->cb.callback = cb->callback;
> +	vq->cb.private = cb->private;
> +	spin_unlock(&vq->irq_lock);
> +}
> +
> +static void vduse_vdpa_set_vq_num(struct vdpa_device *vdpa, u16 idx, u32 num)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	vduse_dev_set_vq_num(dev, vq, num);
> +}
> +
> +static void vduse_vdpa_set_vq_ready(struct vdpa_device *vdpa,
> +					u16 idx, bool ready)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	vduse_dev_set_vq_ready(dev, vq, ready);
> +	vq->ready = ready;
> +}
> +
> +static bool vduse_vdpa_get_vq_ready(struct vdpa_device *vdpa, u16 idx)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	vq->ready = vduse_dev_get_vq_ready(dev, vq);
> +
> +	return vq->ready;
> +}
> +
> +static int vduse_vdpa_set_vq_state(struct vdpa_device *vdpa, u16 idx,
> +				const struct vdpa_vq_state *state)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	return vduse_dev_set_vq_state(dev, vq, state);
> +}
> +
> +static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
> +				struct vdpa_vq_state *state)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	return vduse_dev_get_vq_state(dev, vq, state);
> +}
> +
> +static u32 vduse_vdpa_get_vq_align(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return dev->vq_align;
> +}
> +
> +static u64 vduse_vdpa_get_features(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return vduse_dev_get_features(dev);
> +}
> +
> +static int vduse_vdpa_set_features(struct vdpa_device *vdpa, u64 features)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	if (!(features & (1ULL << VIRTIO_F_ACCESS_PLATFORM)))
> +		return -EINVAL;
> +
> +	return vduse_dev_set_features(dev, features);
> +}
> +
> +static void vduse_vdpa_set_config_cb(struct vdpa_device *vdpa,
> +				  struct vdpa_callback *cb)
> +{
> +	/* We don't support config interrupt */
> +}
> +
> +static u16 vduse_vdpa_get_vq_num_max(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return dev->vq_size_max;
> +}
> +
> +static u32 vduse_vdpa_get_device_id(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return dev->device_id;
> +}
> +
> +static u32 vduse_vdpa_get_vendor_id(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return dev->vendor_id;
> +}
> +
> +static u8 vduse_vdpa_get_status(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return vduse_dev_get_status(dev);
> +}
> +
> +static void vduse_vdpa_set_status(struct vdpa_device *vdpa, u8 status)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	if (status == 0)
> +		vduse_dev_reset(dev);
> +
> +	vduse_dev_set_status(dev, status);
> +}
> +
> +static void vduse_vdpa_get_config(struct vdpa_device *vdpa, unsigned int offset,
> +			     void *buf, unsigned int len)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	vduse_dev_get_config(dev, offset, buf, len);
> +}
> +
> +static void vduse_vdpa_set_config(struct vdpa_device *vdpa, unsigned int offset,
> +			const void *buf, unsigned int len)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	vduse_dev_set_config(dev, offset, buf, len);
> +}
> +
> +static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
> +				struct vhost_iotlb *iotlb)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	int ret;
> +


So I wonder we need to do the vhost_dev_update_iotlb() before 
vduse_domain_set_map().

That is, we need to make sure the userspace's IOTLB is cleared after 
setting up the new map?


> +	ret = vduse_domain_set_map(dev->domain, iotlb);
> +	vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
> +
> +	return ret;
> +}
> +
> +static void vduse_vdpa_free(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	WARN_ON(!list_empty(&dev->send_list));
> +	WARN_ON(!list_empty(&dev->recv_list));
> +	dev->vdev = NULL;
> +}
> +
> +static const struct vdpa_config_ops vduse_vdpa_config_ops = {
> +	.set_vq_address		= vduse_vdpa_set_vq_address,
> +	.kick_vq		= vduse_vdpa_kick_vq,
> +	.set_vq_cb		= vduse_vdpa_set_vq_cb,
> +	.set_vq_num             = vduse_vdpa_set_vq_num,
> +	.set_vq_ready		= vduse_vdpa_set_vq_ready,
> +	.get_vq_ready		= vduse_vdpa_get_vq_ready,
> +	.set_vq_state		= vduse_vdpa_set_vq_state,
> +	.get_vq_state		= vduse_vdpa_get_vq_state,
> +	.get_vq_align		= vduse_vdpa_get_vq_align,
> +	.get_features		= vduse_vdpa_get_features,
> +	.set_features		= vduse_vdpa_set_features,
> +	.set_config_cb		= vduse_vdpa_set_config_cb,
> +	.get_vq_num_max		= vduse_vdpa_get_vq_num_max,
> +	.get_device_id		= vduse_vdpa_get_device_id,
> +	.get_vendor_id		= vduse_vdpa_get_vendor_id,
> +	.get_status		= vduse_vdpa_get_status,
> +	.set_status		= vduse_vdpa_set_status,
> +	.get_config		= vduse_vdpa_get_config,
> +	.set_config		= vduse_vdpa_set_config,
> +	.set_map		= vduse_vdpa_set_map,
> +	.free			= vduse_vdpa_free,
> +};
> +
> +static dma_addr_t vduse_dev_map_page(struct device *dev, struct page *page,
> +				     unsigned long offset, size_t size,
> +				     enum dma_data_direction dir,
> +				     unsigned long attrs)
> +{
> +	struct vduse_dev *vdev = dev_to_vduse(dev);
> +	struct vduse_iova_domain *domain = vdev->domain;
> +
> +	return vduse_domain_map_page(domain, page, offset, size, dir, attrs);
> +}
> +
> +static void vduse_dev_unmap_page(struct device *dev, dma_addr_t dma_addr,
> +				size_t size, enum dma_data_direction dir,
> +				unsigned long attrs)
> +{
> +	struct vduse_dev *vdev = dev_to_vduse(dev);
> +	struct vduse_iova_domain *domain = vdev->domain;
> +
> +	return vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
> +}
> +
> +static void *vduse_dev_alloc_coherent(struct device *dev, size_t size,
> +					dma_addr_t *dma_addr, gfp_t flag,
> +					unsigned long attrs)
> +{
> +	struct vduse_dev *vdev = dev_to_vduse(dev);
> +	struct vduse_iova_domain *domain = vdev->domain;
> +	unsigned long iova;
> +	void *addr;
> +
> +	*dma_addr = DMA_MAPPING_ERROR;
> +	addr = vduse_domain_alloc_coherent(domain, size,
> +				(dma_addr_t *)&iova, flag, attrs);
> +	if (!addr)
> +		return NULL;
> +
> +	*dma_addr = (dma_addr_t)iova;
> +	vduse_dev_update_iotlb(vdev, iova, iova + size - 1);
> +
> +	return addr;
> +}
> +
> +static void vduse_dev_free_coherent(struct device *dev, size_t size,
> +					void *vaddr, dma_addr_t dma_addr,
> +					unsigned long attrs)
> +{
> +	struct vduse_dev *vdev = dev_to_vduse(dev);
> +	struct vduse_iova_domain *domain = vdev->domain;
> +	unsigned long start = (unsigned long)dma_addr;
> +	unsigned long last = start + size - 1;
> +
> +	vduse_domain_free_coherent(domain, size, vaddr, dma_addr, attrs);
> +	vduse_dev_update_iotlb(vdev, start, last);
> +}
> +
> +static const struct dma_map_ops vduse_dev_dma_ops = {
> +	.map_page = vduse_dev_map_page,
> +	.unmap_page = vduse_dev_unmap_page,
> +	.alloc = vduse_dev_alloc_coherent,
> +	.free = vduse_dev_free_coherent,
> +};
> +
> +static unsigned int perm_to_file_flags(u8 perm)
> +{
> +	unsigned int flags = 0;
> +
> +	switch (perm) {
> +	case VDUSE_ACCESS_WO:
> +		flags |= O_WRONLY;
> +		break;
> +	case VDUSE_ACCESS_RO:
> +		flags |= O_RDONLY;
> +		break;
> +	case VDUSE_ACCESS_RW:
> +		flags |= O_RDWR;
> +		break;
> +	default:
> +		WARN(1, "invalidate vhost IOTLB permission\n");
> +		break;
> +	}
> +
> +	return flags;
> +}
> +
> +static int vduse_kickfd_setup(struct vduse_dev *dev,
> +			struct vduse_vq_eventfd *eventfd)
> +{
> +	struct eventfd_ctx *ctx = NULL;
> +	struct vduse_virtqueue *vq;
> +
> +	if (eventfd->index >= dev->vq_num)
> +		return -EINVAL;
> +
> +	vq = &dev->vqs[eventfd->index];
> +	if (eventfd->fd > 0) {
> +		ctx = eventfd_ctx_fdget(eventfd->fd);
> +		if (IS_ERR(ctx))
> +			return PTR_ERR(ctx);
> +	} else if (eventfd->fd != VDUSE_EVENTFD_DEASSIGN)
> +		return 0;
> +
> +	spin_lock(&vq->kick_lock);
> +	if (vq->kickfd)
> +		eventfd_ctx_put(vq->kickfd);
> +	vq->kickfd = ctx;
> +	spin_unlock(&vq->kick_lock);
> +
> +	return 0;
> +}
> +
> +static void vduse_vq_irq_inject(struct work_struct *work)
> +{
> +	struct vduse_virtqueue *vq = container_of(work,
> +					struct vduse_virtqueue, inject);
> +
> +	spin_lock_irq(&vq->irq_lock);
> +	if (vq->ready && vq->cb.callback)
> +		vq->cb.callback(vq->cb.private);
> +	spin_unlock_irq(&vq->irq_lock);
> +}
> +
> +static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> +			    unsigned long arg)
> +{
> +	struct vduse_dev *dev = file->private_data;
> +	void __user *argp = (void __user *)arg;
> +	int ret;
> +
> +	switch (cmd) {
> +	case VDUSE_IOTLB_GET_ENTRY: {
> +		struct vduse_iotlb_entry entry;
> +		struct vhost_iotlb_map *map;
> +		struct vdpa_map_file *map_file;
> +		struct vduse_iova_domain *domain = dev->domain;
> +		struct file *f = NULL;
> +
> +		ret = -EFAULT;
> +		if (copy_from_user(&entry, argp, sizeof(entry)))
> +			break;
> +
> +		spin_lock(&domain->iotlb_lock);
> +		map = vhost_iotlb_itree_first(domain->iotlb,
> +					      entry.start, entry.start + 1);
> +		if (map) {
> +			map_file = (struct vdpa_map_file *)map->opaque;
> +			f = get_file(map_file->file);
> +			entry.offset = map_file->offset;
> +			entry.start = map->start;
> +			entry.last = map->last;
> +			entry.perm = map->perm;
> +		}
> +		spin_unlock(&domain->iotlb_lock);
> +		ret = -EINVAL;


So we need document this in the uAPI doc. I think when userspace see 
-EINVAL it means the map doesn't exist.

Or should we make it more explicitly by e.g introduing new flags.


> +		if (!f)
> +			break;
> +
> +		ret = -EFAULT;
> +		if (copy_to_user(argp, &entry, sizeof(entry))) {
> +			fput(f);
> +			break;
> +		}
> +		ret = receive_fd_user(f, argp, perm_to_file_flags(entry.perm));
> +		fput(f);
> +		break;
> +	}
> +	case VDUSE_VQ_SETUP_KICKFD: {
> +		struct vduse_vq_eventfd eventfd;
> +
> +		ret = -EFAULT;
> +		if (copy_from_user(&eventfd, argp, sizeof(eventfd)))
> +			break;
> +
> +		ret = vduse_kickfd_setup(dev, &eventfd);
> +		break;
> +	}
> +	case VDUSE_INJECT_VQ_IRQ:
> +		ret = -EINVAL;
> +		if (arg >= dev->vq_num)
> +			break;
> +
> +		ret = 0;
> +		queue_work(vduse_irq_wq, &dev->vqs[arg].inject);
> +		break;
> +	default:
> +		ret = -ENOIOCTLCMD;
> +		break;
> +	}
> +
> +	return ret;
> +}
> +
> +static int vduse_dev_release(struct inode *inode, struct file *file)
> +{
> +	struct vduse_dev *dev = file->private_data;
> +	struct vduse_dev_msg *msg;
> +	int i;
> +
> +	for (i = 0; i < dev->vq_num; i++) {
> +		struct vduse_virtqueue *vq = &dev->vqs[i];
> +
> +		spin_lock(&vq->kick_lock);
> +		if (vq->kickfd)
> +			eventfd_ctx_put(vq->kickfd);
> +		vq->kickfd = NULL;
> +		spin_unlock(&vq->kick_lock);
> +	}
> +
> +	spin_lock(&dev->msg_lock);
> +	while ((msg = vduse_dequeue_msg(&dev->recv_list)))
> +		vduse_enqueue_msg(&dev->send_list, msg);


What's the goal of this?

In addition to free the messages, we need wake up the processes that is 
in the waitq in this case.


> +	spin_unlock(&dev->msg_lock);
> +
> +	dev->connected = false;


Do we need to hold vduse mutex here?


> +
> +	return 0;
> +}
> +
> +static int vduse_dev_open(struct inode *inode, struct file *file)
> +{
> +	struct vduse_dev *dev = container_of(inode->i_cdev,
> +					struct vduse_dev, cdev);
> +	int ret = -EBUSY;
> +
> +	mutex_lock(&vduse_lock);
> +	if (dev->connected)
> +		goto unlock;
> +
> +	ret = 0;
> +	dev->connected = true;
> +	file->private_data = dev;
> +unlock:
> +	mutex_unlock(&vduse_lock);
> +
> +	return ret;
> +}
> +
> +static const struct file_operations vduse_dev_fops = {
> +	.owner		= THIS_MODULE,
> +	.open		= vduse_dev_open,
> +	.release	= vduse_dev_release,
> +	.read_iter	= vduse_dev_read_iter,
> +	.write_iter	= vduse_dev_write_iter,
> +	.poll		= vduse_dev_poll,
> +	.unlocked_ioctl	= vduse_dev_ioctl,
> +	.compat_ioctl	= compat_ptr_ioctl,
> +	.llseek		= noop_llseek,
> +};
> +
> +static struct vduse_dev *vduse_dev_create(void)
> +{
> +	struct vduse_dev *dev = kzalloc(sizeof(*dev), GFP_KERNEL);
> +
> +	if (!dev)
> +		return NULL;
> +
> +	spin_lock_init(&dev->msg_lock);
> +	INIT_LIST_HEAD(&dev->send_list);
> +	INIT_LIST_HEAD(&dev->recv_list);
> +	atomic64_set(&dev->msg_unique, 0);
> +
> +	init_waitqueue_head(&dev->waitq);
> +
> +	return dev;
> +}
> +
> +static void vduse_dev_destroy(struct vduse_dev *dev)
> +{
> +	kfree(dev);
> +}
> +
> +static struct vduse_dev *vduse_find_dev(const char *name)
> +{
> +	struct vduse_dev *tmp, *dev = NULL;
> +
> +	list_for_each_entry(tmp, &vduse_devs, list) {
> +		if (!strcmp(dev_name(&tmp->dev), name)) {
> +			dev = tmp;
> +			break;
> +		}
> +	}
> +	return dev;
> +}
> +
> +static int vduse_destroy_dev(char *name)
> +{
> +	struct vduse_dev *dev = vduse_find_dev(name);
> +
> +	if (!dev)
> +		return -EINVAL;
> +
> +	if (dev->vdev || dev->connected)
> +		return -EBUSY;
> +
> +	dev->connected = true;


Need mutex here?


> +	list_del(&dev->list);
> +	cdev_device_del(&dev->cdev, &dev->dev);
> +	put_device(&dev->dev);
> +
> +	return 0;
> +}
> +
> +static void vduse_release_dev(struct device *device)
> +{
> +	struct vduse_dev *dev =
> +		container_of(device, struct vduse_dev, dev);
> +
> +	ida_simple_remove(&vduse_ida, dev->minor);
> +	kfree(dev->vqs);
> +	vduse_domain_destroy(dev->domain);
> +	vduse_dev_destroy(dev);
> +	module_put(THIS_MODULE);
> +}
> +
> +static int vduse_create_dev(struct vduse_dev_config *config)
> +{
> +	int i, ret = -ENOMEM;
> +	struct vduse_dev *dev;
> +
> +	if (config->bounce_size > max_bounce_size)
> +		return -EINVAL;
> +
> +	if (config->bounce_size > max_iova_size)
> +		return -EINVAL;
> +
> +	if (vduse_find_dev(config->name))
> +		return -EEXIST;
> +
> +	dev = vduse_dev_create();
> +	if (!dev)
> +		return -ENOMEM;
> +
> +	dev->device_id = config->device_id;
> +	dev->vendor_id = config->vendor_id;
> +	dev->domain = vduse_domain_create(max_iova_size - 1,
> +					config->bounce_size);
> +	if (!dev->domain)
> +		goto err_domain;
> +
> +	dev->vq_align = config->vq_align;
> +	dev->vq_size_max = config->vq_size_max;
> +	dev->vq_num = config->vq_num;
> +	dev->vqs = kcalloc(dev->vq_num, sizeof(*dev->vqs), GFP_KERNEL);
> +	if (!dev->vqs)
> +		goto err_vqs;
> +
> +	for (i = 0; i < dev->vq_num; i++) {
> +		dev->vqs[i].index = i;
> +		INIT_WORK(&dev->vqs[i].inject, vduse_vq_irq_inject);
> +		spin_lock_init(&dev->vqs[i].kick_lock);
> +		spin_lock_init(&dev->vqs[i].irq_lock);
> +	}
> +
> +	ret = ida_simple_get(&vduse_ida, 0, VDUSE_DEV_MAX, GFP_KERNEL);
> +	if (ret < 0)
> +		goto err_ida;
> +
> +	dev->minor = ret;
> +	device_initialize(&dev->dev);
> +	dev->dev.release = vduse_release_dev;
> +	dev->dev.class = vduse_class;
> +	dev->dev.devt = MKDEV(MAJOR(vduse_major), dev->minor);
> +	ret = dev_set_name(&dev->dev, "%s", config->name);
> +	if (ret)
> +		goto err_name;
> +
> +	cdev_init(&dev->cdev, &vduse_dev_fops);
> +	dev->cdev.owner = THIS_MODULE;
> +
> +	ret = cdev_device_add(&dev->cdev, &dev->dev);
> +	if (ret) {
> +		put_device(&dev->dev);
> +		return ret;
> +	}
> +	list_add(&dev->list, &vduse_devs);
> +	__module_get(THIS_MODULE);
> +
> +	return 0;
> +err_name:
> +	ida_simple_remove(&vduse_ida, dev->minor);
> +err_ida:
> +	kfree(dev->vqs);
> +err_vqs:
> +	vduse_domain_destroy(dev->domain);
> +err_domain:


So the rewind after device_initialize() looks wrong, we should use 
put_device() which will use dev.relase().

See the comment of device_initialize():

  * NOTE: Use put_device() to give up your reference instead of freeing
  * @dev directly once you have called this function.
  */

> +	vduse_dev_destroy(dev);
> +	return ret;
> +}
> +
> +static long vduse_ioctl(struct file *file, unsigned int cmd,
> +			unsigned long arg)
> +{
> +	int ret;
> +	void __user *argp = (void __user *)arg;
> +
> +	mutex_lock(&vduse_lock);
> +	switch (cmd) {
> +	case VDUSE_GET_API_VERSION:
> +		ret = VDUSE_API_VERSION;


To preseve the uAPI compatibility, besides GET_API_VERSION, we need 
SET_API_VERSION to support older userspace.

And we need probably all the ioctls when API version is not set from 
userspace.


> +		break;
> +	case VDUSE_CREATE_DEV: {
> +		struct vduse_dev_config config;
> +
> +		ret = -EFAULT;
> +		if (copy_from_user(&config, argp, sizeof(config)))
> +			break;
> +
> +		ret = vduse_create_dev(&config);
> +		break;
> +	}
> +	case VDUSE_DESTROY_DEV: {
> +		char name[VDUSE_NAME_MAX];
> +
> +		ret = -EFAULT;
> +		if (copy_from_user(name, argp, VDUSE_NAME_MAX))
> +			break;
> +
> +		ret = vduse_destroy_dev(name);
> +		break;
> +	}
> +	default:
> +		ret = -EINVAL;
> +		break;
> +	}
> +	mutex_unlock(&vduse_lock);
> +
> +	return ret;
> +}
> +
> +static const struct file_operations vduse_fops = {
> +	.owner		= THIS_MODULE,
> +	.unlocked_ioctl	= vduse_ioctl,
> +	.compat_ioctl	= compat_ptr_ioctl,
> +	.llseek		= noop_llseek,
> +};
> +
> +static char *vduse_devnode(struct device *dev, umode_t *mode)
> +{
> +	return kasprintf(GFP_KERNEL, "vduse/%s", dev_name(dev));
> +}
> +
> +static struct miscdevice vduse_misc = {
> +	.fops = &vduse_fops,
> +	.minor = MISC_DYNAMIC_MINOR,
> +	.name = "vduse",
> +	.nodename = "vduse/control",
> +};
> +
> +static void vduse_mgmtdev_release(struct device *dev)
> +{
> +}
> +
> +static struct device vduse_mgmtdev = {
> +	.init_name = "vduse",
> +	.release = vduse_mgmtdev_release,
> +};
> +
> +static struct vdpa_mgmt_dev mgmt_dev;
> +
> +static int vduse_dev_add_vdpa(struct vduse_dev *dev, const char *name)
> +{
> +	struct vduse_vdpa *vdev = dev->vdev;
> +	int ret;
> +
> +	if (vdev)
> +		return -EEXIST;
> +
> +	vdev = vdpa_alloc_device(struct vduse_vdpa, vdpa, &dev->dev,
> +				 &vduse_vdpa_config_ops, name, true);
> +	if (!vdev)
> +		return -ENOMEM;
> +
> +	vdev->dev = dev;
> +	vdev->vdpa.dev.dma_mask = &vdev->vdpa.dev.coherent_dma_mask;
> +	ret = dma_set_mask_and_coherent(&vdev->vdpa.dev, DMA_BIT_MASK(64));
> +	if (ret)
> +		goto err;
> +
> +	set_dma_ops(&vdev->vdpa.dev, &vduse_dev_dma_ops);
> +	vdev->vdpa.dma_dev = &vdev->vdpa.dev;
> +	vdev->vdpa.mdev = &mgmt_dev;
> +
> +	ret = _vdpa_register_device(&vdev->vdpa, dev->vq_num);
> +	if (ret)
> +		goto err;
> +
> +	dev->vdev = vdev;
> +
> +	return 0;
> +err:
> +	put_device(&vdev->vdpa.dev);
> +	return ret;
> +}
> +
> +static int vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name)
> +{
> +	struct vduse_dev *dev;
> +	int ret = -EINVAL;
> +
> +	mutex_lock(&vduse_lock);
> +	dev = vduse_find_dev(name);
> +	if (!dev)
> +		goto unlock;
> +
> +	ret = vduse_dev_add_vdpa(dev, name);
> +unlock:
> +	mutex_unlock(&vduse_lock);
> +
> +	return ret;
> +}
> +
> +static void vdpa_dev_del(struct vdpa_mgmt_dev *mdev, struct vdpa_device *dev)
> +{
> +	_vdpa_unregister_device(dev);
> +}
> +
> +static const struct vdpa_mgmtdev_ops vdpa_dev_mgmtdev_ops = {
> +	.dev_add = vdpa_dev_add,
> +	.dev_del = vdpa_dev_del,
> +};
> +
> +static struct virtio_device_id id_table[] = {
> +	{ VIRTIO_DEV_ANY_ID, VIRTIO_DEV_ANY_ID },
> +	{ 0 },
> +};
> +
> +static struct vdpa_mgmt_dev mgmt_dev = {
> +	.device = &vduse_mgmtdev,
> +	.id_table = id_table,
> +	.ops = &vdpa_dev_mgmtdev_ops,
> +};
> +
> +static int vduse_mgmtdev_init(void)
> +{
> +	int ret;
> +
> +	ret = device_register(&vduse_mgmtdev);
> +	if (ret)
> +		return ret;
> +
> +	ret = vdpa_mgmtdev_register(&mgmt_dev);
> +	if (ret)
> +		goto err;
> +
> +	return 0;
> +err:
> +	device_unregister(&vduse_mgmtdev);
> +	return ret;
> +}
> +
> +static void vduse_mgmtdev_exit(void)
> +{
> +	vdpa_mgmtdev_unregister(&mgmt_dev);
> +	device_unregister(&vduse_mgmtdev);
> +}
> +
> +static int vduse_init(void)
> +{
> +	int ret;
> +
> +	if (max_bounce_size >= max_iova_size)
> +		return -EINVAL;
> +
> +	ret = misc_register(&vduse_misc);
> +	if (ret)
> +		return ret;
> +
> +	vduse_class = class_create(THIS_MODULE, "vduse");
> +	if (IS_ERR(vduse_class)) {
> +		ret = PTR_ERR(vduse_class);
> +		goto err_class;
> +	}
> +	vduse_class->devnode = vduse_devnode;
> +
> +	ret = alloc_chrdev_region(&vduse_major, 0, VDUSE_DEV_MAX, "vduse");
> +	if (ret)
> +		goto err_chardev;
> +
> +	vduse_irq_wq = alloc_workqueue("vduse-irq",
> +				WQ_HIGHPRI | WQ_SYSFS | WQ_UNBOUND, 0);
> +	if (!vduse_irq_wq)
> +		goto err_wq;
> +
> +	ret = vduse_domain_init();
> +	if (ret)
> +		goto err_domain;
> +
> +	ret = vduse_mgmtdev_init();
> +	if (ret)
> +		goto err_mgmtdev;
> +
> +	return 0;
> +err_mgmtdev:
> +	vduse_domain_exit();
> +err_domain:
> +	destroy_workqueue(vduse_irq_wq);
> +err_wq:
> +	unregister_chrdev_region(vduse_major, VDUSE_DEV_MAX);
> +err_chardev:
> +	class_destroy(vduse_class);
> +err_class:
> +	misc_deregister(&vduse_misc);
> +	return ret;
> +}
> +module_init(vduse_init);
> +
> +static void vduse_exit(void)
> +{
> +	misc_deregister(&vduse_misc);
> +	class_destroy(vduse_class);
> +	unregister_chrdev_region(vduse_major, VDUSE_DEV_MAX);
> +	destroy_workqueue(vduse_irq_wq);
> +	vduse_domain_exit();
> +	vduse_mgmtdev_exit();
> +}
> +module_exit(vduse_exit);
> +
> +MODULE_VERSION(DRV_VERSION);
> +MODULE_LICENSE(DRV_LICENSE);
> +MODULE_AUTHOR(DRV_AUTHOR);
> +MODULE_DESCRIPTION(DRV_DESC);
> diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
> new file mode 100644
> index 000000000000..37f7d7059aa8
> --- /dev/null
> +++ b/include/uapi/linux/vduse.h
> @@ -0,0 +1,153 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _UAPI_VDUSE_H_
> +#define _UAPI_VDUSE_H_
> +
> +#include <linux/types.h>
> +
> +#define VDUSE_API_VERSION	0
> +
> +#define VDUSE_CONFIG_DATA_LEN	256
> +#define VDUSE_NAME_MAX	256
> +
> +/* the control messages definition for read/write */
> +
> +enum vduse_req_type {
> +	VDUSE_SET_VQ_NUM,
> +	VDUSE_SET_VQ_ADDR,
> +	VDUSE_SET_VQ_READY,
> +	VDUSE_GET_VQ_READY,
> +	VDUSE_SET_VQ_STATE,
> +	VDUSE_GET_VQ_STATE,
> +	VDUSE_SET_FEATURES,
> +	VDUSE_GET_FEATURES,
> +	VDUSE_SET_STATUS,
> +	VDUSE_GET_STATUS,
> +	VDUSE_SET_CONFIG,
> +	VDUSE_GET_CONFIG,
> +	VDUSE_UPDATE_IOTLB,
> +};


Need comment to explain each type.


> +
> +struct vduse_vq_num {
> +	__u32 index;
> +	__u32 num;
> +};
> +
> +struct vduse_vq_addr {
> +	__u32 index;
> +	__u64 desc_addr;
> +	__u64 driver_addr;
> +	__u64 device_addr;
> +};
> +
> +struct vduse_vq_ready {
> +	__u32 index;
> +	__u8 ready;
> +};
> +
> +struct vduse_vq_state {
> +	__u32 index;
> +	__u16 avail_idx;
> +};
> +
> +struct vduse_dev_config_data {
> +	__u32 offset;
> +	__u32 len;
> +	__u8 data[VDUSE_CONFIG_DATA_LEN];
> +};
> +
> +struct vduse_iova_range {
> +	__u64 start;
> +	__u64 last;
> +};
> +
> +struct vduse_features {
> +	__u64 features;
> +};
> +
> +struct vduse_status {
> +	__u8 status;
> +};


Need comment for all the above uapi.


> +
> +struct vduse_dev_request {
> +	__u32 type; /* request type */
> +	__u32 request_id; /* request id */
> +	__u32 reserved[2]; /* for feature use */
> +	union {
> +		struct vduse_vq_num vq_num; /* virtqueue num */
> +		struct vduse_vq_addr vq_addr; /* virtqueue address */
> +		struct vduse_vq_ready vq_ready; /* virtqueue ready status */
> +		struct vduse_vq_state vq_state; /* virtqueue state */
> +		struct vduse_dev_config_data config; /* virtio device config space */
> +		struct vduse_iova_range iova; /* iova range for updating */
> +		struct vduse_features f; /* virtio features */
> +		struct vduse_status s; /* device status */
> +		__u32 padding[16]; /* padding */
> +	};
> +};
> +
> +struct vduse_dev_response {
> +	__u32 request_id; /* corresponding request id */
> +#define VDUSE_REQUEST_OK	0x00
> +#define VDUSE_REQUEST_FAILED	0x01
> +	__u32 result; /* the result of request */
> +	__u32 reserved[2]; /* for feature use */
> +	union {
> +		struct vduse_vq_ready vq_ready; /* virtqueue ready status */
> +		struct vduse_vq_state vq_state; /* virtqueue state */
> +		struct vduse_dev_config_data config; /* virtio device config space */
> +		struct vduse_features f; /* virtio features */
> +		struct vduse_status s; /* device status */
> +		__u32 padding[16]; /* padding */
> +	};
> +};
> +
> +/* ioctls */
> +
> +struct vduse_dev_config {
> +	char name[VDUSE_NAME_MAX]; /* vduse device name */
> +	__u32 vendor_id; /* virtio vendor id */
> +	__u32 device_id; /* virtio device id */
> +	__u64 bounce_size; /* bounce buffer size for iommu */
> +	__u16 vq_num; /* the number of virtqueues */
> +	__u16 vq_size_max; /* the max size of virtqueue */
> +	__u32 vq_align; /* the allocation alignment of virtqueue's metadata */
> +};
> +
> +struct vduse_iotlb_entry {
> +	int fd;
> +#define VDUSE_ACCESS_RO 0x1
> +#define VDUSE_ACCESS_WO 0x2
> +#define VDUSE_ACCESS_RW 0x3
> +	__u8 perm; /* access permission of this range */


Let's re-order the perm or add explict padding here to avoid hole.

Thanks


> +	__u64 offset; /* the mmap offset on fd */
> +	__u64 start; /* start of the IOVA range */
> +	__u64 last; /* last of the IOVA range */
> +};
> +
> +struct vduse_vq_eventfd {
> +	__u32 index; /* virtqueue index */
> +#define VDUSE_EVENTFD_DEASSIGN -1
> +	int fd; /* eventfd, -1 means de-assigning the eventfd */
> +};
> +
> +#define VDUSE_BASE	0x81
> +
> +/* Get the version of VDUSE API. This is used for future extension */
> +#define VDUSE_GET_API_VERSION	_IO(VDUSE_BASE, 0x00)
> +
> +/* Create a vduse device which is represented by a char device (/dev/vduse/<name>) */
> +#define VDUSE_CREATE_DEV	_IOW(VDUSE_BASE, 0x01, struct vduse_dev_config)
> +
> +/* Destroy a vduse device. Make sure there are no references to the char device */
> +#define VDUSE_DESTROY_DEV	_IOW(VDUSE_BASE, 0x02, char[VDUSE_NAME_MAX])
> +
> +/* Get a mmap'able iova region */
> +#define VDUSE_IOTLB_GET_ENTRY	_IOWR(VDUSE_BASE, 0x03, struct vduse_iotlb_entry)
> +
> +/* Setup an eventfd to receive kick for virtqueue */
> +#define VDUSE_VQ_SETUP_KICKFD	_IOW(VDUSE_BASE, 0x04, struct vduse_vq_eventfd)
> +
> +/* Inject an interrupt for specific virtqueue */
> +#define VDUSE_INJECT_VQ_IRQ	_IO(VDUSE_BASE, 0x05)
> +
> +#endif /* _UAPI_VDUSE_H_ */


_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 10/11] vduse: Add config interrupt support
       [not found] ` <20210315053721.189-11-xieyongji@bytedance.com>
@ 2021-03-24  4:45   ` Jason Wang
  0 siblings, 0 replies; 14+ messages in thread
From: Jason Wang @ 2021-03-24  4:45 UTC (permalink / raw)
  To: Xie Yongji, mst, stefanha, sgarzare, parav, bob.liu, hch,
	rdunlap, willy, viro, axboe, bcrl, corbet, mika.penttila,
	dan.carpenter
  Cc: linux-fsdevel, netdev, kvm, virtualization


在 2021/3/15 下午1:37, Xie Yongji 写道:
> This patch introduces a new ioctl VDUSE_INJECT_CONFIG_IRQ
> to support injecting config interrupt.
>
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>


I suggest to squash this into path 9.

Other looks good.

Thanks


> ---
>   drivers/vdpa/vdpa_user/vduse_dev.c | 24 +++++++++++++++++++++++-
>   include/uapi/linux/vduse.h         |  3 +++
>   2 files changed, 26 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> index 07d0ae92d470..cc12b58bdc09 100644
> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> @@ -64,6 +64,8 @@ struct vduse_dev {
>   	struct list_head send_list;
>   	struct list_head recv_list;
>   	struct list_head list;
> +	struct vdpa_callback config_cb;
> +	spinlock_t irq_lock;
>   	bool connected;
>   	int minor;
>   	u16 vq_size_max;
> @@ -439,6 +441,11 @@ static void vduse_dev_reset(struct vduse_dev *dev)
>   	vduse_domain_reset_bounce_map(dev->domain);
>   	vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
>   
> +	spin_lock(&dev->irq_lock);
> +	dev->config_cb.callback = NULL;
> +	dev->config_cb.private = NULL;
> +	spin_unlock(&dev->irq_lock);
> +
>   	for (i = 0; i < dev->vq_num; i++) {
>   		struct vduse_virtqueue *vq = &dev->vqs[i];
>   
> @@ -557,7 +564,12 @@ static int vduse_vdpa_set_features(struct vdpa_device *vdpa, u64 features)
>   static void vduse_vdpa_set_config_cb(struct vdpa_device *vdpa,
>   				  struct vdpa_callback *cb)
>   {
> -	/* We don't support config interrupt */
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	spin_lock(&dev->irq_lock);
> +	dev->config_cb.callback = cb->callback;
> +	dev->config_cb.private = cb->private;
> +	spin_unlock(&dev->irq_lock);
>   }
>   
>   static u16 vduse_vdpa_get_vq_num_max(struct vdpa_device *vdpa)
> @@ -842,6 +854,15 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
>   		ret = 0;
>   		queue_work(vduse_irq_wq, &dev->vqs[arg].inject);
>   		break;
> +	case VDUSE_INJECT_CONFIG_IRQ:
> +		ret = -EINVAL;
> +		spin_lock_irq(&dev->irq_lock);
> +		if (dev->config_cb.callback) {
> +			dev->config_cb.callback(dev->config_cb.private);
> +			ret = 0;
> +		}
> +		spin_unlock_irq(&dev->irq_lock);
> +		break;
>   	default:
>   		ret = -ENOIOCTLCMD;
>   		break;
> @@ -918,6 +939,7 @@ static struct vduse_dev *vduse_dev_create(void)
>   	INIT_LIST_HEAD(&dev->send_list);
>   	INIT_LIST_HEAD(&dev->recv_list);
>   	atomic64_set(&dev->msg_unique, 0);
> +	spin_lock_init(&dev->irq_lock);
>   
>   	init_waitqueue_head(&dev->waitq);
>   
> diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
> index 37f7d7059aa8..337e766f5622 100644
> --- a/include/uapi/linux/vduse.h
> +++ b/include/uapi/linux/vduse.h
> @@ -150,4 +150,7 @@ struct vduse_vq_eventfd {
>   /* Inject an interrupt for specific virtqueue */
>   #define VDUSE_INJECT_VQ_IRQ	_IO(VDUSE_BASE, 0x05)
>   
> +/* Inject a config interrupt */
> +#define VDUSE_INJECT_CONFIG_IRQ	_IO(VDUSE_BASE, 0x06)
> +
>   #endif /* _UAPI_VDUSE_H_ */

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 08/11] vduse: Implement an MMU-based IOMMU driver
       [not found]     ` <CACycT3v_-G6ju-poofXEzYt8QPKWNFHwsS7t=KTLgs-=g+iPQQ@mail.gmail.com>
@ 2021-03-25  4:52       ` Jason Wang
       [not found]         ` <CACycT3uS870yy04rw7KBk==sioi+VNunxVz6BQH-Lmxk6m-VSg@mail.gmail.com>
  0 siblings, 1 reply; 14+ messages in thread
From: Jason Wang @ 2021-03-25  4:52 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Jens Axboe, Jonathan Corbet, kvm, Michael S. Tsirkin, netdev,
	Randy Dunlap, Matthew Wilcox, virtualization, Christoph Hellwig,
	Bob Liu, bcrl, viro, Stefan Hajnoczi, linux-fsdevel,
	Dan Carpenter, Mika Penttilä


在 2021/3/24 下午3:39, Yongji Xie 写道:
> On Wed, Mar 24, 2021 at 11:54 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> 在 2021/3/15 下午1:37, Xie Yongji 写道:
>>> This implements an MMU-based IOMMU driver to support mapping
>>> kernel dma buffer into userspace. The basic idea behind it is
>>> treating MMU (VA->PA) as IOMMU (IOVA->PA). The driver will set
>>> up MMU mapping instead of IOMMU mapping for the DMA transfer so
>>> that the userspace process is able to use its virtual address to
>>> access the dma buffer in kernel.
>>>
>>> And to avoid security issue, a bounce-buffering mechanism is
>>> introduced to prevent userspace accessing the original buffer
>>> directly.
>>>
>>> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
>>> ---
>>>    drivers/vdpa/vdpa_user/iova_domain.c | 535 +++++++++++++++++++++++++++++++++++
>>>    drivers/vdpa/vdpa_user/iova_domain.h |  75 +++++
>>>    2 files changed, 610 insertions(+)
>>>    create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c
>>>    create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h
>>>
>>> diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c
>>> new file mode 100644
>>> index 000000000000..83de216b0e51
>>> --- /dev/null
>>> +++ b/drivers/vdpa/vdpa_user/iova_domain.c
>>> @@ -0,0 +1,535 @@
>>> +// SPDX-License-Identifier: GPL-2.0-only
>>> +/*
>>> + * MMU-based IOMMU implementation
>>> + *
>>> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
>>
>> 2021 as well.
>>
> Sure.
>
>>> + *
>>> + * Author: Xie Yongji <xieyongji@bytedance.com>
>>> + *
>>> + */
>>> +
>>> +#include <linux/slab.h>
>>> +#include <linux/file.h>
>>> +#include <linux/anon_inodes.h>
>>> +#include <linux/highmem.h>
>>> +#include <linux/vmalloc.h>
>>> +#include <linux/vdpa.h>
>>> +
>>> +#include "iova_domain.h"
>>> +
>>> +static int vduse_iotlb_add_range(struct vduse_iova_domain *domain,
>>> +                              u64 start, u64 last,
>>> +                              u64 addr, unsigned int perm,
>>> +                              struct file *file, u64 offset)
>>> +{
>>> +     struct vdpa_map_file *map_file;
>>> +     int ret;
>>> +
>>> +     map_file = kmalloc(sizeof(*map_file), GFP_ATOMIC);
>>> +     if (!map_file)
>>> +             return -ENOMEM;
>>> +
>>> +     map_file->file = get_file(file);
>>> +     map_file->offset = offset;
>>> +
>>> +     ret = vhost_iotlb_add_range_ctx(domain->iotlb, start, last,
>>> +                                     addr, perm, map_file);
>>> +     if (ret) {
>>> +             fput(map_file->file);
>>> +             kfree(map_file);
>>> +             return ret;
>>> +     }
>>> +     return 0;
>>> +}
>>> +
>>> +static void vduse_iotlb_del_range(struct vduse_iova_domain *domain,
>>> +                               u64 start, u64 last)
>>> +{
>>> +     struct vdpa_map_file *map_file;
>>> +     struct vhost_iotlb_map *map;
>>> +
>>> +     while ((map = vhost_iotlb_itree_first(domain->iotlb, start, last))) {
>>> +             map_file = (struct vdpa_map_file *)map->opaque;
>>> +             fput(map_file->file);
>>> +             kfree(map_file);
>>> +             vhost_iotlb_map_free(domain->iotlb, map);
>>> +     }
>>> +}
>>> +
>>> +int vduse_domain_set_map(struct vduse_iova_domain *domain,
>>> +                      struct vhost_iotlb *iotlb)
>>> +{
>>> +     struct vdpa_map_file *map_file;
>>> +     struct vhost_iotlb_map *map;
>>> +     u64 start = 0ULL, last = ULLONG_MAX;
>>> +     int ret;
>>> +
>>> +     spin_lock(&domain->iotlb_lock);
>>> +     vduse_iotlb_del_range(domain, start, last);
>>> +
>>> +     for (map = vhost_iotlb_itree_first(iotlb, start, last); map;
>>> +          map = vhost_iotlb_itree_next(map, start, last)) {
>>> +             map_file = (struct vdpa_map_file *)map->opaque;
>>> +             ret = vduse_iotlb_add_range(domain, map->start, map->last,
>>> +                                         map->addr, map->perm,
>>> +                                         map_file->file,
>>> +                                         map_file->offset);
>>> +             if (ret)
>>> +                     goto err;
>>> +     }
>>> +     spin_unlock(&domain->iotlb_lock);
>>> +
>>> +     return 0;
>>> +err:
>>> +     vduse_iotlb_del_range(domain, start, last);
>>> +     spin_unlock(&domain->iotlb_lock);
>>> +     return ret;
>>> +}
>>> +
>>> +static void vduse_domain_map_bounce_page(struct vduse_iova_domain *domain,
>>> +                                      u64 iova, u64 size, u64 paddr)
>>> +{
>>> +     struct vduse_bounce_map *map;
>>> +     unsigned int index;
>>> +     u64 last = iova + size - 1;
>>> +
>>> +     while (iova < last) {
>>> +             map = &domain->bounce_maps[iova >> PAGE_SHIFT];
>>> +             index = offset_in_page(iova) >> IOVA_ALLOC_ORDER;
>>> +             map->orig_phys[index] = paddr;
>>> +             paddr += IOVA_ALLOC_SIZE;
>>> +             iova += IOVA_ALLOC_SIZE;
>>> +     }
>>> +}
>>> +
>>> +static void vduse_domain_unmap_bounce_page(struct vduse_iova_domain *domain,
>>> +                                        u64 iova, u64 size)
>>> +{
>>> +     struct vduse_bounce_map *map;
>>> +     unsigned int index;
>>> +     u64 last = iova + size - 1;
>>> +
>>> +     while (iova < last) {
>>> +             map = &domain->bounce_maps[iova >> PAGE_SHIFT];
>>> +             index = offset_in_page(iova) >> IOVA_ALLOC_ORDER;
>>> +             map->orig_phys[index] = INVALID_PHYS_ADDR;
>>> +             iova += IOVA_ALLOC_SIZE;
>>> +     }
>>> +}
>>> +
>>> +static void do_bounce(phys_addr_t orig, void *addr, size_t size,
>>> +                   enum dma_data_direction dir)
>>> +{
>>> +     unsigned long pfn = PFN_DOWN(orig);
>>> +
>>> +     if (PageHighMem(pfn_to_page(pfn))) {
>>> +             unsigned int offset = offset_in_page(orig);
>>> +             char *buffer;
>>> +             unsigned int sz = 0;
>>> +
>>> +             while (size) {
>>> +                     sz = min_t(size_t, PAGE_SIZE - offset, size);
>>> +
>>> +                     buffer = kmap_atomic(pfn_to_page(pfn));
>>
>> So kmap_atomic() can autoamtically go with fast path if the page does
>> not belong to highmem.
>>
>> I think we can removce the condition and just use kmap_atomic() for all
>> the cases here.
>>
> Looks good to me.
>
>>> +                     if (dir == DMA_TO_DEVICE)
>>> +                             memcpy(addr, buffer + offset, sz);
>>> +                     else
>>> +                             memcpy(buffer + offset, addr, sz);
>>> +                     kunmap_atomic(buffer);
>>> +
>>> +                     size -= sz;
>>> +                     pfn++;
>>> +                     addr += sz;
>>> +                     offset = 0;
>>> +             }
>>> +     } else if (dir == DMA_TO_DEVICE) {
>>> +             memcpy(addr, phys_to_virt(orig), size);
>>> +     } else {
>>> +             memcpy(phys_to_virt(orig), addr, size);
>>> +     }
>>> +}
>>> +
>>> +static void vduse_domain_bounce(struct vduse_iova_domain *domain,
>>> +                             dma_addr_t iova, size_t size,
>>> +                             enum dma_data_direction dir)
>>> +{
>>> +     struct vduse_bounce_map *map;
>>> +     unsigned int index, offset;
>>> +     void *addr;
>>> +     size_t sz;
>>> +
>>> +     while (size) {
>>> +             map = &domain->bounce_maps[iova >> PAGE_SHIFT];
>>> +             offset = offset_in_page(iova);
>>> +             sz = min_t(size_t, IOVA_ALLOC_SIZE, size);
>>> +
>>> +             if (map->bounce_page &&
>>> +                 map->orig_phys[index] != INVALID_PHYS_ADDR) {
>>> +                     addr = page_address(map->bounce_page) + offset;
>>> +                     index = offset >> IOVA_ALLOC_ORDER;
>>> +                     do_bounce(map->orig_phys[index], addr, sz, dir);
>>> +             }
>>> +             size -= sz;
>>> +             iova += sz;
>>> +     }
>>> +}
>>> +
>>> +static struct page *
>>> +vduse_domain_get_mapping_page(struct vduse_iova_domain *domain, u64 iova)
>>> +{
>>> +     u64 start = iova & PAGE_MASK;
>>> +     u64 last = start + PAGE_SIZE - 1;
>>> +     struct vhost_iotlb_map *map;
>>> +     struct page *page = NULL;
>>> +
>>> +     spin_lock(&domain->iotlb_lock);
>>> +     map = vhost_iotlb_itree_first(domain->iotlb, start, last);
>>> +     if (!map)
>>> +             goto out;
>>> +
>>> +     page = pfn_to_page((map->addr + iova - map->start) >> PAGE_SHIFT);
>>> +     get_page(page);
>>> +out:
>>> +     spin_unlock(&domain->iotlb_lock);
>>> +
>>> +     return page;
>>> +}
>>> +
>>> +static struct page *
>>> +vduse_domain_alloc_bounce_page(struct vduse_iova_domain *domain, u64 iova)
>>> +{
>>> +     u64 start = iova & PAGE_MASK;
>>> +     struct page *page = alloc_page(GFP_KERNEL);
>>> +     struct vduse_bounce_map *map;
>>> +
>>> +     if (!page)
>>> +             return NULL;
>>> +
>>> +     spin_lock(&domain->iotlb_lock);
>>> +     map = &domain->bounce_maps[iova >> PAGE_SHIFT];
>>> +     if (map->bounce_page) {
>>> +             __free_page(page);
>>> +             goto out;
>>> +     }
>>> +     map->bounce_page = page;
>>> +
>>> +     /* paired with vduse_domain_map_page() */
>>> +     smp_mb();
>>
>> So this is suspicious. It's better to explain like, we need make sure A
>> must be done after B.
> OK. I see. It's used to protect this pattern:
>
>     vduse_domain_alloc_bounce_page:          vduse_domain_map_page:
>     write map->bounce_page                           write map->orig_phys
>     mb()                                                            mb()
>     read map->orig_phys                                 read map->bounce_page
>
> Make sure there will always be a path to do bouncing.


Ok.


>
>> And it looks to me the iotlb_lock is sufficnet to do the synchronization
>> here. E.g any reason that you don't take it in
>> vduse_domain_map_bounce_page().
>>
> Yes, we can. But the performance in multi-queue cases will go down if
> we use iotlb_lock on this critical path.
>
>> And what's more, is there anyway to aovid holding the spinlock during
>> bouncing?
>>
> Looks like we can't. In the case that multiple page faults happen on
> the same page, we should make sure the bouncing is done before any
> page fault handler returns.


So it looks to me all those extra complexitiy comes from the fact that 
the bounce_page and orig_phys are set by different places so we need to 
do the bouncing in two places.

I wonder how much we can gain from the "lazy" boucning in page fault. 
The buffer mapped via dma_ops from virtio driver is expected to be 
accessed by the userspace soon.  It looks to me we can do all those 
stuffs during dma_map() then things would be greatly simplified.


>
>>> +
>>> +     vduse_domain_bounce(domain, start, PAGE_SIZE, DMA_TO_DEVICE);
>>> +out:
>>> +     get_page(map->bounce_page);
>>> +     spin_unlock(&domain->iotlb_lock);
>>> +
>>> +     return map->bounce_page;
>>> +}
>>> +
>>> +static void
>>> +vduse_domain_free_bounce_pages(struct vduse_iova_domain *domain)
>>> +{
>>> +     struct vduse_bounce_map *map;
>>> +     unsigned long i, pfn, bounce_pfns;
>>> +
>>> +     bounce_pfns = domain->bounce_size >> PAGE_SHIFT;
>>> +
>>> +     for (pfn = 0; pfn < bounce_pfns; pfn++) {
>>> +             map = &domain->bounce_maps[pfn];
>>> +             for (i = 0; i < IOVA_MAPS_PER_PAGE; i++) {
>>> +                     if (WARN_ON(map->orig_phys[i] != INVALID_PHYS_ADDR))
>>> +                             continue;
>>> +             }
>>> +             if (!map->bounce_page)
>>> +                     continue;
>>> +
>>> +             __free_page(map->bounce_page);
>>> +             map->bounce_page = NULL;
>>> +     }
>>> +}
>>> +
>>> +void vduse_domain_reset_bounce_map(struct vduse_iova_domain *domain)
>>> +{
>>> +     if (!domain->bounce_map)
>>> +             return;
>>> +
>>> +     spin_lock(&domain->iotlb_lock);
>>> +     if (!domain->bounce_map)
>>> +             goto unlock;
>>> +
>>> +     vduse_iotlb_del_range(domain, 0, domain->bounce_size - 1);
>>> +     domain->bounce_map = 0;
>>> +     vduse_domain_free_bounce_pages(domain);
>>> +unlock:
>>> +     spin_unlock(&domain->iotlb_lock);
>>> +}
>>> +
>>> +static int vduse_domain_init_bounce_map(struct vduse_iova_domain *domain)
>>> +{
>>> +     int ret;
>>> +
>>> +     if (domain->bounce_map)
>>> +             return 0;
>>> +
>>> +     spin_lock(&domain->iotlb_lock);
>>> +     if (domain->bounce_map)
>>> +             goto unlock;
>>> +
>>> +     ret = vduse_iotlb_add_range(domain, 0, domain->bounce_size - 1,
>>> +                                 0, VHOST_MAP_RW, domain->file, 0);
>>> +     if (!ret)
>>> +             domain->bounce_map = 1;
>>> +unlock:
>>> +     spin_unlock(&domain->iotlb_lock);
>>> +     return ret;
>>> +}
>>> +
>>> +static dma_addr_t
>>> +vduse_domain_alloc_iova(struct iova_domain *iovad,
>>> +                     unsigned long size, unsigned long limit)
>>> +{
>>> +     unsigned long shift = iova_shift(iovad);
>>> +     unsigned long iova_len = iova_align(iovad, size) >> shift;
>>> +     unsigned long iova_pfn;
>>> +
>>> +     if (iova_len < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1)))
>>> +             iova_len = roundup_pow_of_two(iova_len);
>>> +     iova_pfn = alloc_iova_fast(iovad, iova_len, limit >> shift, true);
>>> +
>>> +     return iova_pfn << shift;
>>> +}
>>> +
>>> +static void vduse_domain_free_iova(struct iova_domain *iovad,
>>> +                                dma_addr_t iova, size_t size)
>>> +{
>>> +     unsigned long shift = iova_shift(iovad);
>>> +     unsigned long iova_len = iova_align(iovad, size) >> shift;
>>> +
>>> +     free_iova_fast(iovad, iova >> shift, iova_len);
>>> +}
>>> +
>>> +dma_addr_t vduse_domain_map_page(struct vduse_iova_domain *domain,
>>> +                              struct page *page, unsigned long offset,
>>> +                              size_t size, enum dma_data_direction dir,
>>> +                              unsigned long attrs)
>>> +{
>>> +     struct iova_domain *iovad = &domain->stream_iovad;
>>> +     unsigned long limit = domain->bounce_size - 1;
>>> +     phys_addr_t pa = page_to_phys(page) + offset;
>>> +     dma_addr_t iova = vduse_domain_alloc_iova(iovad, size, limit);
>>> +
>>> +     if (!iova)
>>> +             return DMA_MAPPING_ERROR;
>>> +
>>> +     if (vduse_domain_init_bounce_map(domain)) {
>>> +             vduse_domain_free_iova(iovad, iova, size);
>>> +             return DMA_MAPPING_ERROR;
>>> +     }
>>> +
>>> +     vduse_domain_map_bounce_page(domain, (u64)iova, (u64)size, pa);
>>> +
>>> +     /* paired with vduse_domain_alloc_bounce_page() */
>>> +     smp_mb();
>>> +
>>> +     if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL)
>>> +             vduse_domain_bounce(domain, iova, size, DMA_TO_DEVICE);
>>> +
>>> +     return iova;
>>> +}
>>> +
>>> +void vduse_domain_unmap_page(struct vduse_iova_domain *domain,
>>> +                          dma_addr_t dma_addr, size_t size,
>>> +                          enum dma_data_direction dir, unsigned long attrs)
>>> +{
>>> +     struct iova_domain *iovad = &domain->stream_iovad;
>>> +
>>> +     if (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL)
>>> +             vduse_domain_bounce(domain, dma_addr, size, DMA_FROM_DEVICE);
>>> +
>>> +     vduse_domain_unmap_bounce_page(domain, (u64)dma_addr, (u64)size);
>>> +     vduse_domain_free_iova(iovad, dma_addr, size);
>>> +}
>>> +
>>> +void *vduse_domain_alloc_coherent(struct vduse_iova_domain *domain,
>>> +                               size_t size, dma_addr_t *dma_addr,
>>> +                               gfp_t flag, unsigned long attrs)
>>> +{
>>> +     struct iova_domain *iovad = &domain->consistent_iovad;
>>> +     unsigned long limit = domain->iova_limit;
>>> +     dma_addr_t iova = vduse_domain_alloc_iova(iovad, size, limit);
>>> +     void *orig = alloc_pages_exact(size, flag);
>>> +
>>> +     if (!iova || !orig)
>>> +             goto err;
>>> +
>>> +     spin_lock(&domain->iotlb_lock);
>>> +     if (vduse_iotlb_add_range(domain, (u64)iova, (u64)iova + size - 1,
>>> +                               virt_to_phys(orig), VHOST_MAP_RW,
>>> +                               domain->file, (u64)iova)) {
>>> +             spin_unlock(&domain->iotlb_lock);
>>> +             goto err;
>>> +     }
>>> +     spin_unlock(&domain->iotlb_lock);
>>> +
>>> +     *dma_addr = iova;
>>> +
>>> +     return orig;
>>> +err:
>>> +     *dma_addr = DMA_MAPPING_ERROR;
>>> +     if (orig)
>>> +             free_pages_exact(orig, size);
>>> +     if (iova)
>>> +             vduse_domain_free_iova(iovad, iova, size);
>>> +
>>> +     return NULL;
>>> +}
>>> +
>>> +void vduse_domain_free_coherent(struct vduse_iova_domain *domain, size_t size,
>>> +                             void *vaddr, dma_addr_t dma_addr,
>>> +                             unsigned long attrs)
>>> +{
>>> +     struct iova_domain *iovad = &domain->consistent_iovad;
>>> +     struct vhost_iotlb_map *map;
>>> +     struct vdpa_map_file *map_file;
>>> +     phys_addr_t pa;
>>> +
>>> +     spin_lock(&domain->iotlb_lock);
>>> +     map = vhost_iotlb_itree_first(domain->iotlb, (u64)dma_addr,
>>> +                                   (u64)dma_addr + size - 1);
>>> +     if (WARN_ON(!map)) {
>>> +             spin_unlock(&domain->iotlb_lock);
>>> +             return;
>>> +     }
>>> +     map_file = (struct vdpa_map_file *)map->opaque;
>>> +     fput(map_file->file);
>>> +     kfree(map_file);
>>> +     pa = map->addr;
>>> +     vhost_iotlb_map_free(domain->iotlb, map);
>>> +     spin_unlock(&domain->iotlb_lock);
>>> +
>>> +     vduse_domain_free_iova(iovad, dma_addr, size);
>>> +     free_pages_exact(phys_to_virt(pa), size);
>>
>> I wonder whether we should free the coherent page after munmap().
> But we don't know whether this coherent page is still needed by
> userspace. The userspace can call munmap() in any cases.
>
>> Otherwise usersapce can poke kernel pages in this way, e.g the page
>> could be allocated and used by other subsystems?
>>
> Sorry, I didn't get your point here. What's the relationship between
> this problem and munmap()?


Ok, so it should be fine, I miss the code that takes an extra refcnt 
when trying to map coherent page.

Thanks


>
>>> +}
>>> +
>>> +static vm_fault_t vduse_domain_mmap_fault(struct vm_fault *vmf)
>>> +{
>>> +     struct vduse_iova_domain *domain = vmf->vma->vm_private_data;
>>> +     unsigned long iova = vmf->pgoff << PAGE_SHIFT;
>>> +     struct page *page;
>>> +
>>> +     if (!domain)
>>> +             return VM_FAULT_SIGBUS;
>>> +
>>> +     if (iova < domain->bounce_size)
>>> +             page = vduse_domain_alloc_bounce_page(domain, iova);
>>> +     else
>>> +             page = vduse_domain_get_mapping_page(domain, iova);
>>> +
>>> +     if (!page)
>>> +             return VM_FAULT_SIGBUS;
>>> +
>>> +     vmf->page = page;
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static const struct vm_operations_struct vduse_domain_mmap_ops = {
>>> +     .fault = vduse_domain_mmap_fault,
>>> +};
>>> +
>>> +static int vduse_domain_mmap(struct file *file, struct vm_area_struct *vma)
>>> +{
>>> +     struct vduse_iova_domain *domain = file->private_data;
>>> +
>>> +     vma->vm_flags |= VM_DONTDUMP | VM_DONTEXPAND;
>>> +     vma->vm_private_data = domain;
>>> +     vma->vm_ops = &vduse_domain_mmap_ops;
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static int vduse_domain_release(struct inode *inode, struct file *file)
>>> +{
>>> +     struct vduse_iova_domain *domain = file->private_data;
>>> +
>>> +     vduse_domain_reset_bounce_map(domain);
>>> +     put_iova_domain(&domain->stream_iovad);
>>> +     put_iova_domain(&domain->consistent_iovad);
>>> +     vhost_iotlb_free(domain->iotlb);
>>> +     vfree(domain->bounce_maps);
>>> +     kfree(domain);
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static const struct file_operations vduse_domain_fops = {
>>> +     .mmap = vduse_domain_mmap,
>>> +     .release = vduse_domain_release,
>>> +};
>>> +
>>> +void vduse_domain_destroy(struct vduse_iova_domain *domain)
>>> +{
>>> +     fput(domain->file);
>>> +}
>>> +
>>> +struct vduse_iova_domain *
>>> +vduse_domain_create(unsigned long iova_limit, size_t bounce_size)
>>> +{
>>> +     struct vduse_iova_domain *domain;
>>> +     struct file *file;
>>> +     struct vduse_bounce_map *map;
>>> +     unsigned long i, pfn, bounce_pfns;
>>> +
>>> +     bounce_pfns = PAGE_ALIGN(bounce_size) >> PAGE_SHIFT;
>>> +     if (iova_limit <= bounce_size)
>>> +             return NULL;
>>> +
>>> +     domain = kzalloc(sizeof(*domain), GFP_KERNEL);
>>> +     if (!domain)
>>> +             return NULL;
>>> +
>>> +     domain->iotlb = vhost_iotlb_alloc(0, 0);
>>> +     if (!domain->iotlb)
>>> +             goto err_iotlb;
>>> +
>>> +     domain->iova_limit = iova_limit;
>>> +     domain->bounce_size = PAGE_ALIGN(bounce_size);
>>> +     domain->bounce_maps = vzalloc(bounce_pfns *
>>> +                             sizeof(struct vduse_bounce_map));
>>> +     if (!domain->bounce_maps)
>>> +             goto err_map;
>>> +
>>> +     for (pfn = 0; pfn < bounce_pfns; pfn++) {
>>> +             map = &domain->bounce_maps[pfn];
>>> +             for (i = 0; i < IOVA_MAPS_PER_PAGE; i++)
>>> +                     map->orig_phys[i] = INVALID_PHYS_ADDR;
>>> +     }
>>> +     file = anon_inode_getfile("[vduse-domain]", &vduse_domain_fops,
>>> +                             domain, O_RDWR);
>>> +     if (IS_ERR(file))
>>> +             goto err_file;
>>> +
>>> +     domain->file = file;
>>> +     spin_lock_init(&domain->iotlb_lock);
>>> +     init_iova_domain(&domain->stream_iovad,
>>> +                     IOVA_ALLOC_SIZE, IOVA_START_PFN);
>>> +     init_iova_domain(&domain->consistent_iovad,
>>> +                     PAGE_SIZE, bounce_pfns);
>>
>> Any reason for treating coherent and stream DMA differently (the
>> different granule)?
>>
> To save space for small I/Os (less than PAGE_SIZE). We can have one
> bounce page for multiple small I/Os.
>
>>> +
>>> +     return domain;
>>> +err_file:
>>> +     vfree(domain->bounce_maps);
>>> +err_map:
>>> +     vhost_iotlb_free(domain->iotlb);
>>> +err_iotlb:
>>> +     kfree(domain);
>>> +     return NULL;
>>> +}
>>> +
>>> +int vduse_domain_init(void)
>>> +{
>>> +     return iova_cache_get();
>>> +}
>>> +
>>> +void vduse_domain_exit(void)
>>> +{
>>> +     iova_cache_put();
>>> +}
>>> diff --git a/drivers/vdpa/vdpa_user/iova_domain.h b/drivers/vdpa/vdpa_user/iova_domain.h
>>> new file mode 100644
>>> index 000000000000..faeeedfaa786
>>> --- /dev/null
>>> +++ b/drivers/vdpa/vdpa_user/iova_domain.h
>>> @@ -0,0 +1,75 @@
>>> +/* SPDX-License-Identifier: GPL-2.0-only */
>>> +/*
>>> + * MMU-based IOMMU implementation
>>> + *
>>> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
>>> + *
>>> + * Author: Xie Yongji <xieyongji@bytedance.com>
>>> + *
>>> + */
>>> +
>>> +#ifndef _VDUSE_IOVA_DOMAIN_H
>>> +#define _VDUSE_IOVA_DOMAIN_H
>>> +
>>> +#include <linux/iova.h>
>>> +#include <linux/dma-mapping.h>
>>> +#include <linux/vhost_iotlb.h>
>>> +
>>> +#define IOVA_START_PFN 1
>>> +
>>> +#define IOVA_ALLOC_ORDER 12
>>> +#define IOVA_ALLOC_SIZE (1 << IOVA_ALLOC_ORDER)
>>> +
>>> +#define IOVA_MAPS_PER_PAGE (1 << (PAGE_SHIFT - IOVA_ALLOC_ORDER))
>>> +
>>> +#define INVALID_PHYS_ADDR (~(phys_addr_t)0)
>>> +
>>> +struct vduse_bounce_map {
>>> +     struct page *bounce_page;
>>> +     u64 orig_phys[IOVA_MAPS_PER_PAGE];
>>
>> Sorry if I had asked this before. But I'm not sure it's worth to have
>> this extra complexitiy. If I read the code correctly, the
>> IOVA_MAPS_PER_PAGE is 1 for the archs that have 4K page. Have you tested
>> the code on the archs that have more than 4K page?
>>
> No, I haven't test it. Now I think it's OK to remove this optimization
> in this patchset.
>
> Thanks,
> Yongji
>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 09/11] vduse: Introduce VDUSE - vDPA Device in Userspace
       [not found]     ` <CACycT3u5hEO8=JVkbHmKXYMManiZ1VedyS6O+7M8Rzbk1ftC7A@mail.gmail.com>
@ 2021-03-25  6:30       ` Jason Wang
  0 siblings, 0 replies; 14+ messages in thread
From: Jason Wang @ 2021-03-25  6:30 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Jens Axboe, Jonathan Corbet, kvm, Michael S. Tsirkin, netdev,
	Randy Dunlap, Matthew Wilcox, virtualization, Christoph Hellwig,
	Bob Liu, bcrl, viro, Stefan Hajnoczi, linux-fsdevel,
	Dan Carpenter, Mika Penttilä


在 2021/3/24 下午4:55, Yongji Xie 写道:
> On Wed, Mar 24, 2021 at 12:43 PM Jason Wang <jasowang@redhat.com> wrote:
>>
>> 在 2021/3/15 下午1:37, Xie Yongji 写道:
>>> This VDUSE driver enables implementing vDPA devices in userspace.
>>> Both control path and data path of vDPA devices will be able to
>>> be handled in userspace.
>>>
>>> In the control path, the VDUSE driver will make use of message
>>> mechnism to forward the config operation from vdpa bus driver
>>> to userspace. Userspace can use read()/write() to receive/reply
>>> those control messages.
>>>
>>> In the data path, userspace can use mmap() to access vDPA device's
>>> iova regions obtained through VDUSE_IOTLB_GET_ENTRY ioctl. Besides,
>>> userspace can use ioctl() to inject interrupt and use the eventfd
>>> mechanism to receive virtqueue kicks.
>>>
>>> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
>>> ---
>>>    Documentation/userspace-api/ioctl/ioctl-number.rst |    1 +
>>>    drivers/vdpa/Kconfig                               |   10 +
>>>    drivers/vdpa/Makefile                              |    1 +
>>>    drivers/vdpa/vdpa_user/Makefile                    |    5 +
>>>    drivers/vdpa/vdpa_user/vduse_dev.c                 | 1281 ++++++++++++++++++++
>>>    include/uapi/linux/vduse.h                         |  153 +++
>>>    6 files changed, 1451 insertions(+)
>>>    create mode 100644 drivers/vdpa/vdpa_user/Makefile
>>>    create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
>>>    create mode 100644 include/uapi/linux/vduse.h
>>>
>>> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
>>> index a4c75a28c839..71722e6f8f23 100644
>>> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
>>> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
>>> @@ -300,6 +300,7 @@ Code  Seq#    Include File                                           Comments
>>>    'z'   10-4F  drivers/s390/crypto/zcrypt_api.h                        conflict!
>>>    '|'   00-7F  linux/media.h
>>>    0x80  00-1F  linux/fb.h
>>> +0x81  00-1F  linux/vduse.h
>>>    0x89  00-06  arch/x86/include/asm/sockios.h
>>>    0x89  0B-DF  linux/sockios.h
>>>    0x89  E0-EF  linux/sockios.h                                         SIOCPROTOPRIVATE range
>>> diff --git a/drivers/vdpa/Kconfig b/drivers/vdpa/Kconfig
>>> index a245809c99d0..77a1da522c21 100644
>>> --- a/drivers/vdpa/Kconfig
>>> +++ b/drivers/vdpa/Kconfig
>>> @@ -25,6 +25,16 @@ config VDPA_SIM_NET
>>>        help
>>>          vDPA networking device simulator which loops TX traffic back to RX.
>>>
>>> +config VDPA_USER
>>> +     tristate "VDUSE (vDPA Device in Userspace) support"
>>> +     depends on EVENTFD && MMU && HAS_DMA
>>> +     select DMA_OPS
>>> +     select VHOST_IOTLB
>>> +     select IOMMU_IOVA
>>> +     help
>>> +       With VDUSE it is possible to emulate a vDPA Device
>>> +       in a userspace program.
>>> +
>>>    config IFCVF
>>>        tristate "Intel IFC VF vDPA driver"
>>>        depends on PCI_MSI
>>> diff --git a/drivers/vdpa/Makefile b/drivers/vdpa/Makefile
>>> index 67fe7f3d6943..f02ebed33f19 100644
>>> --- a/drivers/vdpa/Makefile
>>> +++ b/drivers/vdpa/Makefile
>>> @@ -1,6 +1,7 @@
>>>    # SPDX-License-Identifier: GPL-2.0
>>>    obj-$(CONFIG_VDPA) += vdpa.o
>>>    obj-$(CONFIG_VDPA_SIM) += vdpa_sim/
>>> +obj-$(CONFIG_VDPA_USER) += vdpa_user/
>>>    obj-$(CONFIG_IFCVF)    += ifcvf/
>>>    obj-$(CONFIG_MLX5_VDPA) += mlx5/
>>>    obj-$(CONFIG_VP_VDPA)    += virtio_pci/
>>> diff --git a/drivers/vdpa/vdpa_user/Makefile b/drivers/vdpa/vdpa_user/Makefile
>>> new file mode 100644
>>> index 000000000000..260e0b26af99
>>> --- /dev/null
>>> +++ b/drivers/vdpa/vdpa_user/Makefile
>>> @@ -0,0 +1,5 @@
>>> +# SPDX-License-Identifier: GPL-2.0
>>> +
>>> +vduse-y := vduse_dev.o iova_domain.o
>>> +
>>> +obj-$(CONFIG_VDPA_USER) += vduse.o
>>> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
>>> new file mode 100644
>>> index 000000000000..07d0ae92d470
>>> --- /dev/null
>>> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
>>> @@ -0,0 +1,1281 @@
>>> +// SPDX-License-Identifier: GPL-2.0-only
>>> +/*
>>> + * VDUSE: vDPA Device in Userspace
>>> + *
>>> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
>>> + *
>>> + * Author: Xie Yongji <xieyongji@bytedance.com>
>>> + *
>>> + */
>>> +
>>> +#include <linux/init.h>
>>> +#include <linux/module.h>
>>> +#include <linux/miscdevice.h>
>>> +#include <linux/cdev.h>
>>> +#include <linux/device.h>
>>> +#include <linux/eventfd.h>
>>> +#include <linux/slab.h>
>>> +#include <linux/wait.h>
>>> +#include <linux/dma-map-ops.h>
>>> +#include <linux/poll.h>
>>> +#include <linux/file.h>
>>> +#include <linux/uio.h>
>>> +#include <linux/vdpa.h>
>>> +#include <uapi/linux/vduse.h>
>>> +#include <uapi/linux/vdpa.h>
>>> +#include <uapi/linux/virtio_config.h>
>>> +#include <linux/mod_devicetable.h>
>>> +
>>> +#include "iova_domain.h"
>>> +
>>> +#define DRV_VERSION  "1.0"
>>> +#define DRV_AUTHOR   "Yongji Xie <xieyongji@bytedance.com>"
>>> +#define DRV_DESC     "vDPA Device in Userspace"
>>> +#define DRV_LICENSE  "GPL v2"
>>> +
>>> +#define VDUSE_DEV_MAX (1U << MINORBITS)
>>> +
>>> +struct vduse_virtqueue {
>>> +     u16 index;
>>> +     bool ready;
>>> +     spinlock_t kick_lock;
>>> +     spinlock_t irq_lock;
>>> +     struct eventfd_ctx *kickfd;
>>> +     struct vdpa_callback cb;
>>> +     struct work_struct inject;
>>> +};
>>> +
>>> +struct vduse_dev;
>>> +
>>> +struct vduse_vdpa {
>>> +     struct vdpa_device vdpa;
>>> +     struct vduse_dev *dev;
>>> +};
>>> +
>>> +struct vduse_dev {
>>> +     struct vduse_vdpa *vdev;
>>> +     struct device dev;
>>> +     struct cdev cdev;
>>> +     struct vduse_virtqueue *vqs;
>>> +     struct vduse_iova_domain *domain;
>>> +     spinlock_t msg_lock;
>>> +     atomic64_t msg_unique;
>>> +     wait_queue_head_t waitq;
>>> +     struct list_head send_list;
>>> +     struct list_head recv_list;
>>> +     struct list_head list;
>>> +     bool connected;
>>> +     int minor;
>>> +     u16 vq_size_max;
>>> +     u16 vq_num;
>>> +     u32 vq_align;
>>> +     u32 device_id;
>>> +     u32 vendor_id;
>>> +};
>>> +
>>> +struct vduse_dev_msg {
>>> +     struct vduse_dev_request req;
>>> +     struct vduse_dev_response resp;
>>> +     struct list_head list;
>>> +     wait_queue_head_t waitq;
>>> +     bool completed;
>>> +};
>>> +
>>> +static unsigned long max_bounce_size = (64 * 1024 * 1024);
>>> +module_param(max_bounce_size, ulong, 0444);
>>> +MODULE_PARM_DESC(max_bounce_size, "Maximum bounce buffer size. (default: 64M)");
>>> +
>>> +static unsigned long max_iova_size = (128 * 1024 * 1024);
>>> +module_param(max_iova_size, ulong, 0444);
>>> +MODULE_PARM_DESC(max_iova_size, "Maximum iova space size (default: 128M)");
>>> +
>>> +static DEFINE_MUTEX(vduse_lock);
>>> +static LIST_HEAD(vduse_devs);
>>> +static DEFINE_IDA(vduse_ida);
>>> +
>>> +static dev_t vduse_major;
>>> +static struct class *vduse_class;
>>> +static struct workqueue_struct *vduse_irq_wq;
>>> +
>>> +static inline struct vduse_dev *vdpa_to_vduse(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_vdpa *vdev = container_of(vdpa, struct vduse_vdpa, vdpa);
>>> +
>>> +     return vdev->dev;
>>> +}
>>> +
>>> +static inline struct vduse_dev *dev_to_vduse(struct device *dev)
>>> +{
>>> +     struct vdpa_device *vdpa = dev_to_vdpa(dev);
>>> +
>>> +     return vdpa_to_vduse(vdpa);
>>> +}
>>> +
>>> +static struct vduse_dev_msg *vduse_find_msg(struct list_head *head,
>>> +                                         uint32_t request_id)
>>> +{
>>> +     struct vduse_dev_msg *tmp, *msg = NULL;
>>> +
>>> +     list_for_each_entry(tmp, head, list) {
>>> +             if (tmp->req.request_id == request_id) {
>>> +                     msg = tmp;
>>> +                     list_del(&tmp->list);
>>> +                     break;
>>> +             }
>>> +     }
>>> +
>>> +     return msg;
>>> +}
>>> +
>>> +static struct vduse_dev_msg *vduse_dequeue_msg(struct list_head *head)
>>> +{
>>> +     struct vduse_dev_msg *msg = NULL;
>>> +
>>> +     if (!list_empty(head)) {
>>> +             msg = list_first_entry(head, struct vduse_dev_msg, list);
>>> +             list_del(&msg->list);
>>> +     }
>>> +
>>> +     return msg;
>>> +}
>>> +
>>> +static void vduse_enqueue_msg(struct list_head *head,
>>> +                           struct vduse_dev_msg *msg)
>>> +{
>>> +     list_add_tail(&msg->list, head);
>>> +}
>>> +
>>> +static int vduse_dev_msg_sync(struct vduse_dev *dev,
>>> +                           struct vduse_dev_msg *msg)
>>> +{
>>> +     init_waitqueue_head(&msg->waitq);
>>> +     spin_lock(&dev->msg_lock);
>>> +     vduse_enqueue_msg(&dev->send_list, msg);
>>> +     wake_up(&dev->waitq);
>>> +     spin_unlock(&dev->msg_lock);
>>> +     wait_event_interruptible(msg->waitq, msg->completed);
>>> +     spin_lock(&dev->msg_lock);
>>> +     if (!msg->completed)
>>> +             list_del(&msg->list);
>>> +     spin_unlock(&dev->msg_lock);
>>> +
>>> +     return (msg->resp.result == VDUSE_REQUEST_OK) ? 0 : -1;
>>> +}
>>> +
>>> +static u64 vduse_dev_get_features(struct vduse_dev *dev)
>>> +{
>>> +     struct vduse_dev_msg msg = { 0 };
>>> +
>>> +     msg.req.type = VDUSE_GET_FEATURES;
>>> +     msg.req.request_id = atomic64_fetch_inc(&dev->msg_unique);
>>
>> Let's introduce a helper for the atomic64_fetch_inc() here.
>>
> Fine.
>
>>> +
>>> +     return vduse_dev_msg_sync(dev, &msg) ? 0 : msg.resp.f.features;
>>> +}
>>> +
>>> +static int vduse_dev_set_features(struct vduse_dev *dev, u64 features)
>>> +{
>>> +     struct vduse_dev_msg msg = { 0 };
>>> +
>>> +     msg.req.type = VDUSE_SET_FEATURES;
>>> +     msg.req.request_id = atomic64_fetch_inc(&dev->msg_unique);
>>> +     msg.req.f.features = features;
>>> +
>>> +     return vduse_dev_msg_sync(dev, &msg);
>>> +}
>>> +
>>> +static u8 vduse_dev_get_status(struct vduse_dev *dev)
>>> +{
>>> +     struct vduse_dev_msg msg = { 0 };
>>> +
>>> +     msg.req.type = VDUSE_GET_STATUS;
>>> +     msg.req.request_id = atomic64_fetch_inc(&dev->msg_unique);
>>> +
>>> +     return vduse_dev_msg_sync(dev, &msg) ? 0 : msg.resp.s.status;
>>> +}
>>> +
>>> +static void vduse_dev_set_status(struct vduse_dev *dev, u8 status)
>>> +{
>>> +     struct vduse_dev_msg msg = { 0 };
>>> +
>>> +     msg.req.type = VDUSE_SET_STATUS;
>>> +     msg.req.request_id = atomic64_fetch_inc(&dev->msg_unique);
>>> +     msg.req.s.status = status;
>>> +
>>> +     vduse_dev_msg_sync(dev, &msg);
>>> +}
>>> +
>>> +static void vduse_dev_get_config(struct vduse_dev *dev, unsigned int offset,
>>> +                              void *buf, unsigned int len)
>>> +{
>>> +     struct vduse_dev_msg msg = { 0 };
>>> +     unsigned int sz;
>>> +
>>> +     while (len) {
>>> +             sz = min_t(unsigned int, len, sizeof(msg.req.config.data));
>>> +             msg.req.type = VDUSE_GET_CONFIG;
>>> +             msg.req.request_id = atomic64_fetch_inc(&dev->msg_unique);
>>> +             msg.req.config.offset = offset;
>>> +             msg.req.config.len = sz;
>>> +             vduse_dev_msg_sync(dev, &msg);
>>> +             memcpy(buf, msg.resp.config.data, sz);
>>> +             buf += sz;
>>> +             offset += sz;
>>> +             len -= sz;
>>> +     }
>>> +}
>>> +
>>> +static void vduse_dev_set_config(struct vduse_dev *dev, unsigned int offset,
>>> +                              const void *buf, unsigned int len)
>>> +{
>>> +     struct vduse_dev_msg msg = { 0 };
>>> +     unsigned int sz;
>>> +
>>> +     while (len) {
>>> +             sz = min_t(unsigned int, len, sizeof(msg.req.config.data));
>>> +             msg.req.type = VDUSE_SET_CONFIG;
>>> +             msg.req.request_id = atomic64_fetch_inc(&dev->msg_unique);
>>> +             msg.req.config.offset = offset;
>>> +             msg.req.config.len = sz;
>>> +             memcpy(msg.req.config.data, buf, sz);
>>> +             vduse_dev_msg_sync(dev, &msg);
>>> +             buf += sz;
>>> +             offset += sz;
>>> +             len -= sz;
>>> +     }
>>> +}
>>> +
>>> +static void vduse_dev_set_vq_num(struct vduse_dev *dev,
>>> +                              struct vduse_virtqueue *vq, u32 num)
>>> +{
>>> +     struct vduse_dev_msg msg = { 0 };
>>> +
>>> +     msg.req.type = VDUSE_SET_VQ_NUM;
>>> +     msg.req.request_id = atomic64_fetch_inc(&dev->msg_unique);
>>> +     msg.req.vq_num.index = vq->index;
>>> +     msg.req.vq_num.num = num;
>>> +
>>> +     vduse_dev_msg_sync(dev, &msg);
>>> +}
>>> +
>>> +static int vduse_dev_set_vq_addr(struct vduse_dev *dev,
>>> +                              struct vduse_virtqueue *vq, u64 desc_addr,
>>> +                              u64 driver_addr, u64 device_addr)
>>> +{
>>> +     struct vduse_dev_msg msg = { 0 };
>>> +
>>> +     msg.req.type = VDUSE_SET_VQ_ADDR;
>>> +     msg.req.request_id = atomic64_fetch_inc(&dev->msg_unique);
>>> +     msg.req.vq_addr.index = vq->index;
>>> +     msg.req.vq_addr.desc_addr = desc_addr;
>>> +     msg.req.vq_addr.driver_addr = driver_addr;
>>> +     msg.req.vq_addr.device_addr = device_addr;
>>> +
>>> +     return vduse_dev_msg_sync(dev, &msg);
>>> +}
>>> +
>>> +static void vduse_dev_set_vq_ready(struct vduse_dev *dev,
>>> +                             struct vduse_virtqueue *vq, bool ready)
>>> +{
>>> +     struct vduse_dev_msg msg = { 0 };
>>> +
>>> +     msg.req.type = VDUSE_SET_VQ_READY;
>>> +     msg.req.request_id = atomic64_fetch_inc(&dev->msg_unique);
>>> +     msg.req.vq_ready.index = vq->index;
>>> +     msg.req.vq_ready.ready = ready;
>>> +
>>> +     vduse_dev_msg_sync(dev, &msg);
>>> +}
>>> +
>>> +static bool vduse_dev_get_vq_ready(struct vduse_dev *dev,
>>> +                                struct vduse_virtqueue *vq)
>>> +{
>>> +     struct vduse_dev_msg msg = { 0 };
>>> +
>>> +     msg.req.type = VDUSE_GET_VQ_READY;
>>> +     msg.req.request_id = atomic64_fetch_inc(&dev->msg_unique);
>>> +     msg.req.vq_ready.index = vq->index;
>>> +
>>> +     return vduse_dev_msg_sync(dev, &msg) ? false : msg.resp.vq_ready.ready;
>>> +}
>>> +
>>> +static int vduse_dev_get_vq_state(struct vduse_dev *dev,
>>> +                             struct vduse_virtqueue *vq,
>>> +                             struct vdpa_vq_state *state)
>>> +{
>>> +     struct vduse_dev_msg msg = { 0 };
>>> +     int ret;
>>> +
>>> +     msg.req.type = VDUSE_GET_VQ_STATE;
>>> +     msg.req.request_id = atomic64_fetch_inc(&dev->msg_unique);
>>> +     msg.req.vq_state.index = vq->index;
>>> +
>>> +     ret = vduse_dev_msg_sync(dev, &msg);
>>> +     if (!ret)
>>> +             state->avail_index = msg.resp.vq_state.avail_idx;
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static int vduse_dev_set_vq_state(struct vduse_dev *dev,
>>> +                             struct vduse_virtqueue *vq,
>>> +                             const struct vdpa_vq_state *state)
>>> +{
>>> +     struct vduse_dev_msg msg = { 0 };
>>> +
>>> +     msg.req.type = VDUSE_SET_VQ_STATE;
>>> +     msg.req.request_id = atomic64_fetch_inc(&dev->msg_unique);
>>> +     msg.req.vq_state.index = vq->index;
>>> +     msg.req.vq_state.avail_idx = state->avail_index;
>>> +
>>> +     return vduse_dev_msg_sync(dev, &msg);
>>> +}
>>> +
>>> +static int vduse_dev_update_iotlb(struct vduse_dev *dev,
>>> +                             u64 start, u64 last)
>>> +{
>>> +     struct vduse_dev_msg *msg;
>>> +
>>> +     if (last < start)
>>> +             return -EINVAL;
>>> +
>>> +     msg = kzalloc(sizeof(*msg), GFP_ATOMIC);
>>
>> The return value is not checked.
>>
> Will fix it.
>
>>> +     msg->req.type = VDUSE_UPDATE_IOTLB;
>>
>> What would usespace do after receiving VDUSE_UPDATE_IOTLB? If it still
>> needs to issue VDUSE_GET_ENTRY with probably -EINVAL, it's kind of
>> overkill. So it looks to me that the VDUSE_UPDATE_IOTLB is acutally kind
>> of flush or unmap here. If this is true, should we introduce a new type
>> or just rename it as VDUSE_IOTLB_UNMAP?
>>
> VDUSE_UPDATE_IOTLB is used to notify userspace of refreshing (include
> mapping and unmapping) the iotlb mapping. The reason why we can't use
> flush/unmap is explained below.
>
>>> +     msg->req.request_id = atomic64_fetch_inc(&dev->msg_unique);
>>> +     msg->req.iova.start = start;
>>> +     msg->req.iova.last = last;
>>> +
>>> +     return vduse_dev_msg_sync(dev, msg);
>>> +}
>>> +
>>> +static ssize_t vduse_dev_read_iter(struct kiocb *iocb, struct iov_iter *to)
>>> +{
>>> +     struct file *file = iocb->ki_filp;
>>> +     struct vduse_dev *dev = file->private_data;
>>> +     struct vduse_dev_msg *msg;
>>> +     int size = sizeof(struct vduse_dev_request);
>>> +     ssize_t ret = 0;
>>> +
>>> +     if (iov_iter_count(to) < size)
>>> +             return 0;
>>> +
>>> +     spin_lock(&dev->msg_lock);
>>> +     while (1) {
>>> +             msg = vduse_dequeue_msg(&dev->send_list);
>>> +             if (msg)
>>> +                     break;
>>> +
>>> +             ret = -EAGAIN;
>>> +             if (file->f_flags & O_NONBLOCK)
>>> +                     goto unlock;
>>> +
>>> +             spin_unlock(&dev->msg_lock);
>>> +             ret = wait_event_interruptible_exclusive(dev->waitq,
>>> +                                     !list_empty(&dev->send_list));
>>> +             if (ret)
>>> +                     return ret;
>>> +
>>> +             spin_lock(&dev->msg_lock);
>>> +     }
>>> +     spin_unlock(&dev->msg_lock);
>>> +     ret = copy_to_iter(&msg->req, size, to);
>>> +     spin_lock(&dev->msg_lock);
>>> +     if (ret != size) {
>>> +             ret = -EFAULT;
>>> +             vduse_enqueue_msg(&dev->send_list, msg);
>>> +             goto unlock;
>>> +     }
>>> +     vduse_enqueue_msg(&dev->recv_list, msg);
>>> +unlock:
>>> +     spin_unlock(&dev->msg_lock);
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static ssize_t vduse_dev_write_iter(struct kiocb *iocb, struct iov_iter *from)
>>> +{
>>> +     struct file *file = iocb->ki_filp;
>>> +     struct vduse_dev *dev = file->private_data;
>>> +     struct vduse_dev_response resp;
>>> +     struct vduse_dev_msg *msg;
>>> +     size_t ret;
>>> +
>>> +     ret = copy_from_iter(&resp, sizeof(resp), from);
>>> +     if (ret != sizeof(resp))
>>> +             return -EINVAL;
>>> +
>>> +     spin_lock(&dev->msg_lock);
>>> +     msg = vduse_find_msg(&dev->recv_list, resp.request_id);
>>> +     if (!msg) {
>>> +             ret = -EINVAL;
>>> +             goto unlock;
>>> +     }
>>> +
>>> +     memcpy(&msg->resp, &resp, sizeof(resp));
>>> +     msg->completed = 1;
>>> +     wake_up(&msg->waitq);
>>> +unlock:
>>> +     spin_unlock(&dev->msg_lock);
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
>>> +{
>>> +     struct vduse_dev *dev = file->private_data;
>>> +     __poll_t mask = 0;
>>> +
>>> +     poll_wait(file, &dev->waitq, wait);
>>> +
>>> +     if (!list_empty(&dev->send_list))
>>> +             mask |= EPOLLIN | EPOLLRDNORM;
>>
>> EPOLLOUT is missed here?
>>
> Why do we need EPOLLOUT here?


It means the fd is ready to be wrote?


>
>>> +
>>> +     return mask;
>>> +}
>>> +
>>> +static void vduse_dev_reset(struct vduse_dev *dev)
>>> +{
>>> +     int i;
>>> +
>>> +     vduse_domain_reset_bounce_map(dev->domain);
>>> +     vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
>>
>> Simialrly, IOTLB update should be done before the resetting?
>>
> The problem is userspace can still get valid bounce mapping through
> VDUSE_IOTLB_GET_ENTRY between receiving IOTLB_UNMAP and bounce mapping
> reset. Then userspace has no way to know when to invalidate these
> mappings.


Right, I think it might be helpful to add a comment here to explain the 
order.


>
>> And it would be helpful to add comment to explain how coherent mappings
>> is handled.
>>
> OK. It would be handled in vduse_dev_free_coherent().
>
>>> +
>>> +     for (i = 0; i < dev->vq_num; i++) {
>>> +             struct vduse_virtqueue *vq = &dev->vqs[i];
>>> +
>>> +             spin_lock(&vq->irq_lock);
>>> +             vq->ready = false;
>>> +             vq->cb.callback = NULL;
>>> +             vq->cb.private = NULL;
>>> +             spin_unlock(&vq->irq_lock);
>>> +     }
>>> +}
>>> +
>>> +static int vduse_vdpa_set_vq_address(struct vdpa_device *vdpa, u16 idx,
>>> +                             u64 desc_area, u64 driver_area,
>>> +                             u64 device_area)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     return vduse_dev_set_vq_addr(dev, vq, desc_area,
>>> +                                     driver_area, device_area);
>>> +}
>>> +
>>> +static void vduse_vdpa_kick_vq(struct vdpa_device *vdpa, u16 idx)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     spin_lock(&vq->kick_lock);
>>> +     if (vq->ready && vq->kickfd)
>>> +             eventfd_signal(vq->kickfd, 1);
>>> +     spin_unlock(&vq->kick_lock);
>>> +}
>>> +
>>> +static void vduse_vdpa_set_vq_cb(struct vdpa_device *vdpa, u16 idx,
>>> +                           struct vdpa_callback *cb)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     spin_lock(&vq->irq_lock);
>>> +     vq->cb.callback = cb->callback;
>>> +     vq->cb.private = cb->private;
>>> +     spin_unlock(&vq->irq_lock);
>>> +}
>>> +
>>> +static void vduse_vdpa_set_vq_num(struct vdpa_device *vdpa, u16 idx, u32 num)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     vduse_dev_set_vq_num(dev, vq, num);
>>> +}
>>> +
>>> +static void vduse_vdpa_set_vq_ready(struct vdpa_device *vdpa,
>>> +                                     u16 idx, bool ready)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     vduse_dev_set_vq_ready(dev, vq, ready);
>>> +     vq->ready = ready;
>>> +}
>>> +
>>> +static bool vduse_vdpa_get_vq_ready(struct vdpa_device *vdpa, u16 idx)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     vq->ready = vduse_dev_get_vq_ready(dev, vq);
>>> +
>>> +     return vq->ready;
>>> +}
>>> +
>>> +static int vduse_vdpa_set_vq_state(struct vdpa_device *vdpa, u16 idx,
>>> +                             const struct vdpa_vq_state *state)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     return vduse_dev_set_vq_state(dev, vq, state);
>>> +}
>>> +
>>> +static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
>>> +                             struct vdpa_vq_state *state)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     return vduse_dev_get_vq_state(dev, vq, state);
>>> +}
>>> +
>>> +static u32 vduse_vdpa_get_vq_align(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return dev->vq_align;
>>> +}
>>> +
>>> +static u64 vduse_vdpa_get_features(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return vduse_dev_get_features(dev);
>>> +}
>>> +
>>> +static int vduse_vdpa_set_features(struct vdpa_device *vdpa, u64 features)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     if (!(features & (1ULL << VIRTIO_F_ACCESS_PLATFORM)))
>>> +             return -EINVAL;
>>> +
>>> +     return vduse_dev_set_features(dev, features);
>>> +}
>>> +
>>> +static void vduse_vdpa_set_config_cb(struct vdpa_device *vdpa,
>>> +                               struct vdpa_callback *cb)
>>> +{
>>> +     /* We don't support config interrupt */
>>> +}
>>> +
>>> +static u16 vduse_vdpa_get_vq_num_max(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return dev->vq_size_max;
>>> +}
>>> +
>>> +static u32 vduse_vdpa_get_device_id(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return dev->device_id;
>>> +}
>>> +
>>> +static u32 vduse_vdpa_get_vendor_id(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return dev->vendor_id;
>>> +}
>>> +
>>> +static u8 vduse_vdpa_get_status(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return vduse_dev_get_status(dev);
>>> +}
>>> +
>>> +static void vduse_vdpa_set_status(struct vdpa_device *vdpa, u8 status)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     if (status == 0)
>>> +             vduse_dev_reset(dev);
>>> +
>>> +     vduse_dev_set_status(dev, status);
>>> +}
>>> +
>>> +static void vduse_vdpa_get_config(struct vdpa_device *vdpa, unsigned int offset,
>>> +                          void *buf, unsigned int len)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     vduse_dev_get_config(dev, offset, buf, len);
>>> +}
>>> +
>>> +static void vduse_vdpa_set_config(struct vdpa_device *vdpa, unsigned int offset,
>>> +                     const void *buf, unsigned int len)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     vduse_dev_set_config(dev, offset, buf, len);
>>> +}
>>> +
>>> +static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
>>> +                             struct vhost_iotlb *iotlb)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     int ret;
>>> +
>>
>> So I wonder we need to do the vhost_dev_update_iotlb() before
>> vduse_domain_set_map().
>>
>> That is, we need to make sure the userspace's IOTLB is cleared after
>> setting up the new map?
>>
> The same problem I described above. So we use UPDATE_IOTLB messages to
> notify userspace of refreshing the IOTLB after we change the iotlb
> itree.


Yes.


>
>>> +     ret = vduse_domain_set_map(dev->domain, iotlb);
>>> +     vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static void vduse_vdpa_free(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     WARN_ON(!list_empty(&dev->send_list));
>>> +     WARN_ON(!list_empty(&dev->recv_list));
>>> +     dev->vdev = NULL;
>>> +}
>>> +
>>> +static const struct vdpa_config_ops vduse_vdpa_config_ops = {
>>> +     .set_vq_address         = vduse_vdpa_set_vq_address,
>>> +     .kick_vq                = vduse_vdpa_kick_vq,
>>> +     .set_vq_cb              = vduse_vdpa_set_vq_cb,
>>> +     .set_vq_num             = vduse_vdpa_set_vq_num,
>>> +     .set_vq_ready           = vduse_vdpa_set_vq_ready,
>>> +     .get_vq_ready           = vduse_vdpa_get_vq_ready,
>>> +     .set_vq_state           = vduse_vdpa_set_vq_state,
>>> +     .get_vq_state           = vduse_vdpa_get_vq_state,
>>> +     .get_vq_align           = vduse_vdpa_get_vq_align,
>>> +     .get_features           = vduse_vdpa_get_features,
>>> +     .set_features           = vduse_vdpa_set_features,
>>> +     .set_config_cb          = vduse_vdpa_set_config_cb,
>>> +     .get_vq_num_max         = vduse_vdpa_get_vq_num_max,
>>> +     .get_device_id          = vduse_vdpa_get_device_id,
>>> +     .get_vendor_id          = vduse_vdpa_get_vendor_id,
>>> +     .get_status             = vduse_vdpa_get_status,
>>> +     .set_status             = vduse_vdpa_set_status,
>>> +     .get_config             = vduse_vdpa_get_config,
>>> +     .set_config             = vduse_vdpa_set_config,
>>> +     .set_map                = vduse_vdpa_set_map,
>>> +     .free                   = vduse_vdpa_free,
>>> +};
>>> +
>>> +static dma_addr_t vduse_dev_map_page(struct device *dev, struct page *page,
>>> +                                  unsigned long offset, size_t size,
>>> +                                  enum dma_data_direction dir,
>>> +                                  unsigned long attrs)
>>> +{
>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>> +
>>> +     return vduse_domain_map_page(domain, page, offset, size, dir, attrs);
>>> +}
>>> +
>>> +static void vduse_dev_unmap_page(struct device *dev, dma_addr_t dma_addr,
>>> +                             size_t size, enum dma_data_direction dir,
>>> +                             unsigned long attrs)
>>> +{
>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>> +
>>> +     return vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
>>> +}
>>> +
>>> +static void *vduse_dev_alloc_coherent(struct device *dev, size_t size,
>>> +                                     dma_addr_t *dma_addr, gfp_t flag,
>>> +                                     unsigned long attrs)
>>> +{
>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>> +     unsigned long iova;
>>> +     void *addr;
>>> +
>>> +     *dma_addr = DMA_MAPPING_ERROR;
>>> +     addr = vduse_domain_alloc_coherent(domain, size,
>>> +                             (dma_addr_t *)&iova, flag, attrs);
>>> +     if (!addr)
>>> +             return NULL;
>>> +
>>> +     *dma_addr = (dma_addr_t)iova;
>>> +     vduse_dev_update_iotlb(vdev, iova, iova + size - 1);
>>> +
>>> +     return addr;
>>> +}
>>> +
>>> +static void vduse_dev_free_coherent(struct device *dev, size_t size,
>>> +                                     void *vaddr, dma_addr_t dma_addr,
>>> +                                     unsigned long attrs)
>>> +{
>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>> +     unsigned long start = (unsigned long)dma_addr;
>>> +     unsigned long last = start + size - 1;
>>> +
>>> +     vduse_domain_free_coherent(domain, size, vaddr, dma_addr, attrs);
>>> +     vduse_dev_update_iotlb(vdev, start, last);
>>> +}
>>> +
>>> +static const struct dma_map_ops vduse_dev_dma_ops = {
>>> +     .map_page = vduse_dev_map_page,
>>> +     .unmap_page = vduse_dev_unmap_page,
>>> +     .alloc = vduse_dev_alloc_coherent,
>>> +     .free = vduse_dev_free_coherent,
>>> +};
>>> +
>>> +static unsigned int perm_to_file_flags(u8 perm)
>>> +{
>>> +     unsigned int flags = 0;
>>> +
>>> +     switch (perm) {
>>> +     case VDUSE_ACCESS_WO:
>>> +             flags |= O_WRONLY;
>>> +             break;
>>> +     case VDUSE_ACCESS_RO:
>>> +             flags |= O_RDONLY;
>>> +             break;
>>> +     case VDUSE_ACCESS_RW:
>>> +             flags |= O_RDWR;
>>> +             break;
>>> +     default:
>>> +             WARN(1, "invalidate vhost IOTLB permission\n");
>>> +             break;
>>> +     }
>>> +
>>> +     return flags;
>>> +}
>>> +
>>> +static int vduse_kickfd_setup(struct vduse_dev *dev,
>>> +                     struct vduse_vq_eventfd *eventfd)
>>> +{
>>> +     struct eventfd_ctx *ctx = NULL;
>>> +     struct vduse_virtqueue *vq;
>>> +
>>> +     if (eventfd->index >= dev->vq_num)
>>> +             return -EINVAL;
>>> +
>>> +     vq = &dev->vqs[eventfd->index];
>>> +     if (eventfd->fd > 0) {
>>> +             ctx = eventfd_ctx_fdget(eventfd->fd);
>>> +             if (IS_ERR(ctx))
>>> +                     return PTR_ERR(ctx);
>>> +     } else if (eventfd->fd != VDUSE_EVENTFD_DEASSIGN)
>>> +             return 0;
>>> +
>>> +     spin_lock(&vq->kick_lock);
>>> +     if (vq->kickfd)
>>> +             eventfd_ctx_put(vq->kickfd);
>>> +     vq->kickfd = ctx;
>>> +     spin_unlock(&vq->kick_lock);
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static void vduse_vq_irq_inject(struct work_struct *work)
>>> +{
>>> +     struct vduse_virtqueue *vq = container_of(work,
>>> +                                     struct vduse_virtqueue, inject);
>>> +
>>> +     spin_lock_irq(&vq->irq_lock);
>>> +     if (vq->ready && vq->cb.callback)
>>> +             vq->cb.callback(vq->cb.private);
>>> +     spin_unlock_irq(&vq->irq_lock);
>>> +}
>>> +
>>> +static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
>>> +                         unsigned long arg)
>>> +{
>>> +     struct vduse_dev *dev = file->private_data;
>>> +     void __user *argp = (void __user *)arg;
>>> +     int ret;
>>> +
>>> +     switch (cmd) {
>>> +     case VDUSE_IOTLB_GET_ENTRY: {
>>> +             struct vduse_iotlb_entry entry;
>>> +             struct vhost_iotlb_map *map;
>>> +             struct vdpa_map_file *map_file;
>>> +             struct vduse_iova_domain *domain = dev->domain;
>>> +             struct file *f = NULL;
>>> +
>>> +             ret = -EFAULT;
>>> +             if (copy_from_user(&entry, argp, sizeof(entry)))
>>> +                     break;
>>> +
>>> +             spin_lock(&domain->iotlb_lock);
>>> +             map = vhost_iotlb_itree_first(domain->iotlb,
>>> +                                           entry.start, entry.start + 1);
>>> +             if (map) {
>>> +                     map_file = (struct vdpa_map_file *)map->opaque;
>>> +                     f = get_file(map_file->file);
>>> +                     entry.offset = map_file->offset;
>>> +                     entry.start = map->start;
>>> +                     entry.last = map->last;
>>> +                     entry.perm = map->perm;
>>> +             }
>>> +             spin_unlock(&domain->iotlb_lock);
>>> +             ret = -EINVAL;
>>
>> So we need document this in the uAPI doc. I think when userspace see
>> -EINVAL it means the map doesn't exist.
>>
> Fine with me.
>
>> Or should we make it more explicitly by e.g introduing new flags.
>>
>>
>>> +             if (!f)
>>> +                     break;
>>> +
>>> +             ret = -EFAULT;
>>> +             if (copy_to_user(argp, &entry, sizeof(entry))) {
>>> +                     fput(f);
>>> +                     break;
>>> +             }
>>> +             ret = receive_fd_user(f, argp, perm_to_file_flags(entry.perm));
>>> +             fput(f);
>>> +             break;
>>> +     }
>>> +     case VDUSE_VQ_SETUP_KICKFD: {
>>> +             struct vduse_vq_eventfd eventfd;
>>> +
>>> +             ret = -EFAULT;
>>> +             if (copy_from_user(&eventfd, argp, sizeof(eventfd)))
>>> +                     break;
>>> +
>>> +             ret = vduse_kickfd_setup(dev, &eventfd);
>>> +             break;
>>> +     }
>>> +     case VDUSE_INJECT_VQ_IRQ:
>>> +             ret = -EINVAL;
>>> +             if (arg >= dev->vq_num)
>>> +                     break;
>>> +
>>> +             ret = 0;
>>> +             queue_work(vduse_irq_wq, &dev->vqs[arg].inject);
>>> +             break;
>>> +     default:
>>> +             ret = -ENOIOCTLCMD;
>>> +             break;
>>> +     }
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static int vduse_dev_release(struct inode *inode, struct file *file)
>>> +{
>>> +     struct vduse_dev *dev = file->private_data;
>>> +     struct vduse_dev_msg *msg;
>>> +     int i;
>>> +
>>> +     for (i = 0; i < dev->vq_num; i++) {
>>> +             struct vduse_virtqueue *vq = &dev->vqs[i];
>>> +
>>> +             spin_lock(&vq->kick_lock);
>>> +             if (vq->kickfd)
>>> +                     eventfd_ctx_put(vq->kickfd);
>>> +             vq->kickfd = NULL;
>>> +             spin_unlock(&vq->kick_lock);
>>> +     }
>>> +
>>> +     spin_lock(&dev->msg_lock);
>>> +     while ((msg = vduse_dequeue_msg(&dev->recv_list)))
>>> +             vduse_enqueue_msg(&dev->send_list, msg);
>>
>> What's the goal of this?
>>
> Support reconnecting. Make sure userspace daemon can get the inflight
> messages after reboot.


I see, plase add a comment for this.


>
>> In addition to free the messages, we need wake up the processes that is
>> in the waitq in this case.
>>
>>
>>> +     spin_unlock(&dev->msg_lock);
>>> +
>>> +     dev->connected = false;
>>
>> Do we need to hold vduse mutex here?
>>
> Looks like I didn't find any situation that requires the mutex.


Ok, I guess the reason is because there will be no external reference 
for the device?


>
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static int vduse_dev_open(struct inode *inode, struct file *file)
>>> +{
>>> +     struct vduse_dev *dev = container_of(inode->i_cdev,
>>> +                                     struct vduse_dev, cdev);
>>> +     int ret = -EBUSY;
>>> +
>>> +     mutex_lock(&vduse_lock);
>>> +     if (dev->connected)
>>> +             goto unlock;
>>> +
>>> +     ret = 0;
>>> +     dev->connected = true;
>>> +     file->private_data = dev;
>>> +unlock:
>>> +     mutex_unlock(&vduse_lock);
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static const struct file_operations vduse_dev_fops = {
>>> +     .owner          = THIS_MODULE,
>>> +     .open           = vduse_dev_open,
>>> +     .release        = vduse_dev_release,
>>> +     .read_iter      = vduse_dev_read_iter,
>>> +     .write_iter     = vduse_dev_write_iter,
>>> +     .poll           = vduse_dev_poll,
>>> +     .unlocked_ioctl = vduse_dev_ioctl,
>>> +     .compat_ioctl   = compat_ptr_ioctl,
>>> +     .llseek         = noop_llseek,
>>> +};
>>> +
>>> +static struct vduse_dev *vduse_dev_create(void)
>>> +{
>>> +     struct vduse_dev *dev = kzalloc(sizeof(*dev), GFP_KERNEL);
>>> +
>>> +     if (!dev)
>>> +             return NULL;
>>> +
>>> +     spin_lock_init(&dev->msg_lock);
>>> +     INIT_LIST_HEAD(&dev->send_list);
>>> +     INIT_LIST_HEAD(&dev->recv_list);
>>> +     atomic64_set(&dev->msg_unique, 0);
>>> +
>>> +     init_waitqueue_head(&dev->waitq);
>>> +
>>> +     return dev;
>>> +}
>>> +
>>> +static void vduse_dev_destroy(struct vduse_dev *dev)
>>> +{
>>> +     kfree(dev);
>>> +}
>>> +
>>> +static struct vduse_dev *vduse_find_dev(const char *name)
>>> +{
>>> +     struct vduse_dev *tmp, *dev = NULL;
>>> +
>>> +     list_for_each_entry(tmp, &vduse_devs, list) {
>>> +             if (!strcmp(dev_name(&tmp->dev), name)) {
>>> +                     dev = tmp;
>>> +                     break;
>>> +             }
>>> +     }
>>> +     return dev;
>>> +}
>>> +
>>> +static int vduse_destroy_dev(char *name)
>>> +{
>>> +     struct vduse_dev *dev = vduse_find_dev(name);
>>> +
>>> +     if (!dev)
>>> +             return -EINVAL;
>>> +
>>> +     if (dev->vdev || dev->connected)
>>> +             return -EBUSY;
>>> +
>>> +     dev->connected = true;
>>
>> Need mutex here?
>>
> vduse_destroy_dev() is protected by the vduse_mutex.


I see.


>
>>> +     list_del(&dev->list);
>>> +     cdev_device_del(&dev->cdev, &dev->dev);
>>> +     put_device(&dev->dev);
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static void vduse_release_dev(struct device *device)
>>> +{
>>> +     struct vduse_dev *dev =
>>> +             container_of(device, struct vduse_dev, dev);
>>> +
>>> +     ida_simple_remove(&vduse_ida, dev->minor);
>>> +     kfree(dev->vqs);
>>> +     vduse_domain_destroy(dev->domain);
>>> +     vduse_dev_destroy(dev);
>>> +     module_put(THIS_MODULE);
>>> +}
>>> +
>>> +static int vduse_create_dev(struct vduse_dev_config *config)
>>> +{
>>> +     int i, ret = -ENOMEM;
>>> +     struct vduse_dev *dev;
>>> +
>>> +     if (config->bounce_size > max_bounce_size)
>>> +             return -EINVAL;
>>> +
>>> +     if (config->bounce_size > max_iova_size)
>>> +             return -EINVAL;
>>> +
>>> +     if (vduse_find_dev(config->name))
>>> +             return -EEXIST;
>>> +
>>> +     dev = vduse_dev_create();
>>> +     if (!dev)
>>> +             return -ENOMEM;
>>> +
>>> +     dev->device_id = config->device_id;
>>> +     dev->vendor_id = config->vendor_id;
>>> +     dev->domain = vduse_domain_create(max_iova_size - 1,
>>> +                                     config->bounce_size);
>>> +     if (!dev->domain)
>>> +             goto err_domain;
>>> +
>>> +     dev->vq_align = config->vq_align;
>>> +     dev->vq_size_max = config->vq_size_max;
>>> +     dev->vq_num = config->vq_num;
>>> +     dev->vqs = kcalloc(dev->vq_num, sizeof(*dev->vqs), GFP_KERNEL);
>>> +     if (!dev->vqs)
>>> +             goto err_vqs;
>>> +
>>> +     for (i = 0; i < dev->vq_num; i++) {
>>> +             dev->vqs[i].index = i;
>>> +             INIT_WORK(&dev->vqs[i].inject, vduse_vq_irq_inject);
>>> +             spin_lock_init(&dev->vqs[i].kick_lock);
>>> +             spin_lock_init(&dev->vqs[i].irq_lock);
>>> +     }
>>> +
>>> +     ret = ida_simple_get(&vduse_ida, 0, VDUSE_DEV_MAX, GFP_KERNEL);
>>> +     if (ret < 0)
>>> +             goto err_ida;
>>> +
>>> +     dev->minor = ret;
>>> +     device_initialize(&dev->dev);
>>> +     dev->dev.release = vduse_release_dev;
>>> +     dev->dev.class = vduse_class;
>>> +     dev->dev.devt = MKDEV(MAJOR(vduse_major), dev->minor);
>>> +     ret = dev_set_name(&dev->dev, "%s", config->name);
>>> +     if (ret)
>>> +             goto err_name;
>>> +
>>> +     cdev_init(&dev->cdev, &vduse_dev_fops);
>>> +     dev->cdev.owner = THIS_MODULE;
>>> +
>>> +     ret = cdev_device_add(&dev->cdev, &dev->dev);
>>> +     if (ret) {
>>> +             put_device(&dev->dev);
>>> +             return ret;
>>> +     }
>>> +     list_add(&dev->list, &vduse_devs);
>>> +     __module_get(THIS_MODULE);
>>> +
>>> +     return 0;
>>> +err_name:
>>> +     ida_simple_remove(&vduse_ida, dev->minor);
>>> +err_ida:
>>> +     kfree(dev->vqs);
>>> +err_vqs:
>>> +     vduse_domain_destroy(dev->domain);
>>> +err_domain:
>>
>> So the rewind after device_initialize() looks wrong, we should use
>> put_device() which will use dev.relase().
>>
> Oh, yes. We should also call put_device() in err_name case.
>
>> See the comment of device_initialize():
>>
>>    * NOTE: Use put_device() to give up your reference instead of freeing
>>    * @dev directly once you have called this function.
>>    */
>>
>>> +     vduse_dev_destroy(dev);
>>> +     return ret;
>>> +}
>>> +
>>> +static long vduse_ioctl(struct file *file, unsigned int cmd,
>>> +                     unsigned long arg)
>>> +{
>>> +     int ret;
>>> +     void __user *argp = (void __user *)arg;
>>> +
>>> +     mutex_lock(&vduse_lock);
>>> +     switch (cmd) {
>>> +     case VDUSE_GET_API_VERSION:
>>> +             ret = VDUSE_API_VERSION;
>>
>> To preseve the uAPI compatibility, besides GET_API_VERSION, we need
>> SET_API_VERSION to support older userspace.
>>
> Shouldn't the userspace keep compatibility to support older kernel? If
> so, we only need GET_API_VERSION here.


Actually the reverse. The new kernel need to make sure the old userspace 
can work. That is to say the kenrel should support version 0 forever 
even if it supports e.g version 1.

In this case, we should let userspace to choose which version it wants 
to use.

Thanks


>
>> And we need probably all the ioctls when API version is not set from
>> userspace.
>>
>>
>>> +             break;
>>> +     case VDUSE_CREATE_DEV: {
>>> +             struct vduse_dev_config config;
>>> +
>>> +             ret = -EFAULT;
>>> +             if (copy_from_user(&config, argp, sizeof(config)))
>>> +                     break;
>>> +
>>> +             ret = vduse_create_dev(&config);
>>> +             break;
>>> +     }
>>> +     case VDUSE_DESTROY_DEV: {
>>> +             char name[VDUSE_NAME_MAX];
>>> +
>>> +             ret = -EFAULT;
>>> +             if (copy_from_user(name, argp, VDUSE_NAME_MAX))
>>> +                     break;
>>> +
>>> +             ret = vduse_destroy_dev(name);
>>> +             break;
>>> +     }
>>> +     default:
>>> +             ret = -EINVAL;
>>> +             break;
>>> +     }
>>> +     mutex_unlock(&vduse_lock);
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static const struct file_operations vduse_fops = {
>>> +     .owner          = THIS_MODULE,
>>> +     .unlocked_ioctl = vduse_ioctl,
>>> +     .compat_ioctl   = compat_ptr_ioctl,
>>> +     .llseek         = noop_llseek,
>>> +};
>>> +
>>> +static char *vduse_devnode(struct device *dev, umode_t *mode)
>>> +{
>>> +     return kasprintf(GFP_KERNEL, "vduse/%s", dev_name(dev));
>>> +}
>>> +
>>> +static struct miscdevice vduse_misc = {
>>> +     .fops = &vduse_fops,
>>> +     .minor = MISC_DYNAMIC_MINOR,
>>> +     .name = "vduse",
>>> +     .nodename = "vduse/control",
>>> +};
>>> +
>>> +static void vduse_mgmtdev_release(struct device *dev)
>>> +{
>>> +}
>>> +
>>> +static struct device vduse_mgmtdev = {
>>> +     .init_name = "vduse",
>>> +     .release = vduse_mgmtdev_release,
>>> +};
>>> +
>>> +static struct vdpa_mgmt_dev mgmt_dev;
>>> +
>>> +static int vduse_dev_add_vdpa(struct vduse_dev *dev, const char *name)
>>> +{
>>> +     struct vduse_vdpa *vdev = dev->vdev;
>>> +     int ret;
>>> +
>>> +     if (vdev)
>>> +             return -EEXIST;
>>> +
>>> +     vdev = vdpa_alloc_device(struct vduse_vdpa, vdpa, &dev->dev,
>>> +                              &vduse_vdpa_config_ops, name, true);
>>> +     if (!vdev)
>>> +             return -ENOMEM;
>>> +
>>> +     vdev->dev = dev;
>>> +     vdev->vdpa.dev.dma_mask = &vdev->vdpa.dev.coherent_dma_mask;
>>> +     ret = dma_set_mask_and_coherent(&vdev->vdpa.dev, DMA_BIT_MASK(64));
>>> +     if (ret)
>>> +             goto err;
>>> +
>>> +     set_dma_ops(&vdev->vdpa.dev, &vduse_dev_dma_ops);
>>> +     vdev->vdpa.dma_dev = &vdev->vdpa.dev;
>>> +     vdev->vdpa.mdev = &mgmt_dev;
>>> +
>>> +     ret = _vdpa_register_device(&vdev->vdpa, dev->vq_num);
>>> +     if (ret)
>>> +             goto err;
>>> +
>>> +     dev->vdev = vdev;
>>> +
>>> +     return 0;
>>> +err:
>>> +     put_device(&vdev->vdpa.dev);
>>> +     return ret;
>>> +}
>>> +
>>> +static int vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name)
>>> +{
>>> +     struct vduse_dev *dev;
>>> +     int ret = -EINVAL;
>>> +
>>> +     mutex_lock(&vduse_lock);
>>> +     dev = vduse_find_dev(name);
>>> +     if (!dev)
>>> +             goto unlock;
>>> +
>>> +     ret = vduse_dev_add_vdpa(dev, name);
>>> +unlock:
>>> +     mutex_unlock(&vduse_lock);
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static void vdpa_dev_del(struct vdpa_mgmt_dev *mdev, struct vdpa_device *dev)
>>> +{
>>> +     _vdpa_unregister_device(dev);
>>> +}
>>> +
>>> +static const struct vdpa_mgmtdev_ops vdpa_dev_mgmtdev_ops = {
>>> +     .dev_add = vdpa_dev_add,
>>> +     .dev_del = vdpa_dev_del,
>>> +};
>>> +
>>> +static struct virtio_device_id id_table[] = {
>>> +     { VIRTIO_DEV_ANY_ID, VIRTIO_DEV_ANY_ID },
>>> +     { 0 },
>>> +};
>>> +
>>> +static struct vdpa_mgmt_dev mgmt_dev = {
>>> +     .device = &vduse_mgmtdev,
>>> +     .id_table = id_table,
>>> +     .ops = &vdpa_dev_mgmtdev_ops,
>>> +};
>>> +
>>> +static int vduse_mgmtdev_init(void)
>>> +{
>>> +     int ret;
>>> +
>>> +     ret = device_register(&vduse_mgmtdev);
>>> +     if (ret)
>>> +             return ret;
>>> +
>>> +     ret = vdpa_mgmtdev_register(&mgmt_dev);
>>> +     if (ret)
>>> +             goto err;
>>> +
>>> +     return 0;
>>> +err:
>>> +     device_unregister(&vduse_mgmtdev);
>>> +     return ret;
>>> +}
>>> +
>>> +static void vduse_mgmtdev_exit(void)
>>> +{
>>> +     vdpa_mgmtdev_unregister(&mgmt_dev);
>>> +     device_unregister(&vduse_mgmtdev);
>>> +}
>>> +
>>> +static int vduse_init(void)
>>> +{
>>> +     int ret;
>>> +
>>> +     if (max_bounce_size >= max_iova_size)
>>> +             return -EINVAL;
>>> +
>>> +     ret = misc_register(&vduse_misc);
>>> +     if (ret)
>>> +             return ret;
>>> +
>>> +     vduse_class = class_create(THIS_MODULE, "vduse");
>>> +     if (IS_ERR(vduse_class)) {
>>> +             ret = PTR_ERR(vduse_class);
>>> +             goto err_class;
>>> +     }
>>> +     vduse_class->devnode = vduse_devnode;
>>> +
>>> +     ret = alloc_chrdev_region(&vduse_major, 0, VDUSE_DEV_MAX, "vduse");
>>> +     if (ret)
>>> +             goto err_chardev;
>>> +
>>> +     vduse_irq_wq = alloc_workqueue("vduse-irq",
>>> +                             WQ_HIGHPRI | WQ_SYSFS | WQ_UNBOUND, 0);
>>> +     if (!vduse_irq_wq)
>>> +             goto err_wq;
>>> +
>>> +     ret = vduse_domain_init();
>>> +     if (ret)
>>> +             goto err_domain;
>>> +
>>> +     ret = vduse_mgmtdev_init();
>>> +     if (ret)
>>> +             goto err_mgmtdev;
>>> +
>>> +     return 0;
>>> +err_mgmtdev:
>>> +     vduse_domain_exit();
>>> +err_domain:
>>> +     destroy_workqueue(vduse_irq_wq);
>>> +err_wq:
>>> +     unregister_chrdev_region(vduse_major, VDUSE_DEV_MAX);
>>> +err_chardev:
>>> +     class_destroy(vduse_class);
>>> +err_class:
>>> +     misc_deregister(&vduse_misc);
>>> +     return ret;
>>> +}
>>> +module_init(vduse_init);
>>> +
>>> +static void vduse_exit(void)
>>> +{
>>> +     misc_deregister(&vduse_misc);
>>> +     class_destroy(vduse_class);
>>> +     unregister_chrdev_region(vduse_major, VDUSE_DEV_MAX);
>>> +     destroy_workqueue(vduse_irq_wq);
>>> +     vduse_domain_exit();
>>> +     vduse_mgmtdev_exit();
>>> +}
>>> +module_exit(vduse_exit);
>>> +
>>> +MODULE_VERSION(DRV_VERSION);
>>> +MODULE_LICENSE(DRV_LICENSE);
>>> +MODULE_AUTHOR(DRV_AUTHOR);
>>> +MODULE_DESCRIPTION(DRV_DESC);
>>> diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
>>> new file mode 100644
>>> index 000000000000..37f7d7059aa8
>>> --- /dev/null
>>> +++ b/include/uapi/linux/vduse.h
>>> @@ -0,0 +1,153 @@
>>> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
>>> +#ifndef _UAPI_VDUSE_H_
>>> +#define _UAPI_VDUSE_H_
>>> +
>>> +#include <linux/types.h>
>>> +
>>> +#define VDUSE_API_VERSION    0
>>> +
>>> +#define VDUSE_CONFIG_DATA_LEN        256
>>> +#define VDUSE_NAME_MAX       256
>>> +
>>> +/* the control messages definition for read/write */
>>> +
>>> +enum vduse_req_type {
>>> +     VDUSE_SET_VQ_NUM,
>>> +     VDUSE_SET_VQ_ADDR,
>>> +     VDUSE_SET_VQ_READY,
>>> +     VDUSE_GET_VQ_READY,
>>> +     VDUSE_SET_VQ_STATE,
>>> +     VDUSE_GET_VQ_STATE,
>>> +     VDUSE_SET_FEATURES,
>>> +     VDUSE_GET_FEATURES,
>>> +     VDUSE_SET_STATUS,
>>> +     VDUSE_GET_STATUS,
>>> +     VDUSE_SET_CONFIG,
>>> +     VDUSE_GET_CONFIG,
>>> +     VDUSE_UPDATE_IOTLB,
>>> +};
>>
>> Need comment to explain each type.
>>
> Fine.
>
>>> +
>>> +struct vduse_vq_num {
>>> +     __u32 index;
>>> +     __u32 num;
>>> +};
>>> +
>>> +struct vduse_vq_addr {
>>> +     __u32 index;
>>> +     __u64 desc_addr;
>>> +     __u64 driver_addr;
>>> +     __u64 device_addr;
>>> +};
>>> +
>>> +struct vduse_vq_ready {
>>> +     __u32 index;
>>> +     __u8 ready;
>>> +};
>>> +
>>> +struct vduse_vq_state {
>>> +     __u32 index;
>>> +     __u16 avail_idx;
>>> +};
>>> +
>>> +struct vduse_dev_config_data {
>>> +     __u32 offset;
>>> +     __u32 len;
>>> +     __u8 data[VDUSE_CONFIG_DATA_LEN];
>>> +};
>>> +
>>> +struct vduse_iova_range {
>>> +     __u64 start;
>>> +     __u64 last;
>>> +};
>>> +
>>> +struct vduse_features {
>>> +     __u64 features;
>>> +};
>>> +
>>> +struct vduse_status {
>>> +     __u8 status;
>>> +};
>>
>> Need comment for all the above uapi.
>>
> Fine.
>
>>> +
>>> +struct vduse_dev_request {
>>> +     __u32 type; /* request type */
>>> +     __u32 request_id; /* request id */
>>> +     __u32 reserved[2]; /* for feature use */
>>> +     union {
>>> +             struct vduse_vq_num vq_num; /* virtqueue num */
>>> +             struct vduse_vq_addr vq_addr; /* virtqueue address */
>>> +             struct vduse_vq_ready vq_ready; /* virtqueue ready status */
>>> +             struct vduse_vq_state vq_state; /* virtqueue state */
>>> +             struct vduse_dev_config_data config; /* virtio device config space */
>>> +             struct vduse_iova_range iova; /* iova range for updating */
>>> +             struct vduse_features f; /* virtio features */
>>> +             struct vduse_status s; /* device status */
>>> +             __u32 padding[16]; /* padding */
>>> +     };
>>> +};
>>> +
>>> +struct vduse_dev_response {
>>> +     __u32 request_id; /* corresponding request id */
>>> +#define VDUSE_REQUEST_OK     0x00
>>> +#define VDUSE_REQUEST_FAILED 0x01
>>> +     __u32 result; /* the result of request */
>>> +     __u32 reserved[2]; /* for feature use */
>>> +     union {
>>> +             struct vduse_vq_ready vq_ready; /* virtqueue ready status */
>>> +             struct vduse_vq_state vq_state; /* virtqueue state */
>>> +             struct vduse_dev_config_data config; /* virtio device config space */
>>> +             struct vduse_features f; /* virtio features */
>>> +             struct vduse_status s; /* device status */
>>> +             __u32 padding[16]; /* padding */
>>> +     };
>>> +};
>>> +
>>> +/* ioctls */
>>> +
>>> +struct vduse_dev_config {
>>> +     char name[VDUSE_NAME_MAX]; /* vduse device name */
>>> +     __u32 vendor_id; /* virtio vendor id */
>>> +     __u32 device_id; /* virtio device id */
>>> +     __u64 bounce_size; /* bounce buffer size for iommu */
>>> +     __u16 vq_num; /* the number of virtqueues */
>>> +     __u16 vq_size_max; /* the max size of virtqueue */
>>> +     __u32 vq_align; /* the allocation alignment of virtqueue's metadata */
>>> +};
>>> +
>>> +struct vduse_iotlb_entry {
>>> +     int fd;
>>> +#define VDUSE_ACCESS_RO 0x1
>>> +#define VDUSE_ACCESS_WO 0x2
>>> +#define VDUSE_ACCESS_RW 0x3
>>> +     __u8 perm; /* access permission of this range */
>>
>> Let's re-order the perm or add explict padding here to avoid hole.
>>
> OK.
>
> Thanks,
> Yongji
>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 01/11] file: Export __receive_fd() to modules
       [not found]     ` <CACycT3vrHOExXj6v8ULvUzdLcRkdzS5=TNK6=g4+RWEdN-nOJw@mail.gmail.com>
@ 2021-03-25  8:23       ` Christoph Hellwig
  0 siblings, 0 replies; 14+ messages in thread
From: Christoph Hellwig @ 2021-03-25  8:23 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Jens Axboe, Jonathan Corbet, kvm, Michael S. Tsirkin, netdev,
	Randy Dunlap, Matthew Wilcox, virtualization, Christoph Hellwig,
	Bob Liu, bcrl, viro, Stefan Hajnoczi, linux-fsdevel,
	Dan Carpenter, Mika Penttil??

On Mon, Mar 15, 2021 at 05:46:43PM +0800, Yongji Xie wrote:
> On Mon, Mar 15, 2021 at 5:08 PM Christoph Hellwig <hch@infradead.org> wrote:
> >
> > On Mon, Mar 15, 2021 at 01:37:11PM +0800, Xie Yongji wrote:
> > > Export __receive_fd() so that some modules can use
> > > it to pass file descriptor between processes.
> >
> > I really don't think any non-core code should do that, especilly not
> > modular mere driver code.
> 
> Do you see any issue? Now I think we're able to do that with the help
> of get_unused_fd_flags() and fd_install() in modules. But we may miss
> some security stuff in this way. So I try to export __receive_fd() and
> use it instead.

The real problem is now what helper to use, but rather that random
drivers should not just mess with the FD table like that.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 03/11] vhost-vdpa: protect concurrent access to vhost device iotlb
       [not found] ` <20210315053721.189-4-xieyongji@bytedance.com>
  2021-03-23  3:02   ` [PATCH v5 03/11] vhost-vdpa: protect concurrent access to vhost device iotlb Jason Wang
@ 2021-03-25 11:08   ` Stefano Garzarella
  1 sibling, 0 replies; 14+ messages in thread
From: Stefano Garzarella @ 2021-03-25 11:08 UTC (permalink / raw)
  To: Xie Yongji
  Cc: axboe, corbet, kvm, mst, netdev, rdunlap, willy, virtualization,
	hch, bob.liu, bcrl, viro, stefanha, linux-fsdevel, dan.carpenter,
	mika.penttila

On Mon, Mar 15, 2021 at 01:37:13PM +0800, Xie Yongji wrote:
>Use vhost_dev->mutex to protect vhost device iotlb from
>concurrent access.
>
>Fixes: 4c8cf318("vhost: introduce vDPA-based backend")
>Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
>---
> drivers/vhost/vdpa.c | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)


Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 08/11] vduse: Implement an MMU-based IOMMU driver
       [not found]         ` <CACycT3uS870yy04rw7KBk==sioi+VNunxVz6BQH-Lmxk6m-VSg@mail.gmail.com>
@ 2021-03-26  4:26           ` Jason Wang
       [not found]             ` <CACycT3v6Lj61fafztOuzBNFLs2TbKeqrNLXkzv5RK6-h-iTnvA@mail.gmail.com>
  0 siblings, 1 reply; 14+ messages in thread
From: Jason Wang @ 2021-03-26  4:26 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Jens Axboe, Jonathan Corbet, kvm, Michael S. Tsirkin, netdev,
	Randy Dunlap, Matthew Wilcox, virtualization, Christoph Hellwig,
	Bob Liu, bcrl, viro, Stefan Hajnoczi, linux-fsdevel,
	Dan Carpenter, Mika Penttilä


在 2021/3/25 下午3:38, Yongji Xie 写道:
> On Thu, Mar 25, 2021 at 12:53 PM Jason Wang <jasowang@redhat.com> wrote:
>>
>> 在 2021/3/24 下午3:39, Yongji Xie 写道:
>>> On Wed, Mar 24, 2021 at 11:54 AM Jason Wang <jasowang@redhat.com> wrote:
>>>> 在 2021/3/15 下午1:37, Xie Yongji 写道:
>>>>> This implements an MMU-based IOMMU driver to support mapping
>>>>> kernel dma buffer into userspace. The basic idea behind it is
>>>>> treating MMU (VA->PA) as IOMMU (IOVA->PA). The driver will set
>>>>> up MMU mapping instead of IOMMU mapping for the DMA transfer so
>>>>> that the userspace process is able to use its virtual address to
>>>>> access the dma buffer in kernel.
>>>>>
>>>>> And to avoid security issue, a bounce-buffering mechanism is
>>>>> introduced to prevent userspace accessing the original buffer
>>>>> directly.
>>>>>
>>>>> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
>>>>> ---
>>>>>     drivers/vdpa/vdpa_user/iova_domain.c | 535 +++++++++++++++++++++++++++++++++++
>>>>>     drivers/vdpa/vdpa_user/iova_domain.h |  75 +++++
>>>>>     2 files changed, 610 insertions(+)
>>>>>     create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c
>>>>>     create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h
>>>>>
>>>>> diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c
>>>>> new file mode 100644
>>>>> index 000000000000..83de216b0e51
>>>>> --- /dev/null
>>>>> +++ b/drivers/vdpa/vdpa_user/iova_domain.c
>>>>> @@ -0,0 +1,535 @@
>>>>> +// SPDX-License-Identifier: GPL-2.0-only
>>>>> +/*
>>>>> + * MMU-based IOMMU implementation
>>>>> + *
>>>>> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
>>>> 2021 as well.
>>>>
>>> Sure.
>>>
>>>>> + *
>>>>> + * Author: Xie Yongji <xieyongji@bytedance.com>
>>>>> + *
>>>>> + */
>>>>> +
>>>>> +#include <linux/slab.h>
>>>>> +#include <linux/file.h>
>>>>> +#include <linux/anon_inodes.h>
>>>>> +#include <linux/highmem.h>
>>>>> +#include <linux/vmalloc.h>
>>>>> +#include <linux/vdpa.h>
>>>>> +
>>>>> +#include "iova_domain.h"
>>>>> +
>>>>> +static int vduse_iotlb_add_range(struct vduse_iova_domain *domain,
>>>>> +                              u64 start, u64 last,
>>>>> +                              u64 addr, unsigned int perm,
>>>>> +                              struct file *file, u64 offset)
>>>>> +{
>>>>> +     struct vdpa_map_file *map_file;
>>>>> +     int ret;
>>>>> +
>>>>> +     map_file = kmalloc(sizeof(*map_file), GFP_ATOMIC);
>>>>> +     if (!map_file)
>>>>> +             return -ENOMEM;
>>>>> +
>>>>> +     map_file->file = get_file(file);
>>>>> +     map_file->offset = offset;
>>>>> +
>>>>> +     ret = vhost_iotlb_add_range_ctx(domain->iotlb, start, last,
>>>>> +                                     addr, perm, map_file);
>>>>> +     if (ret) {
>>>>> +             fput(map_file->file);
>>>>> +             kfree(map_file);
>>>>> +             return ret;
>>>>> +     }
>>>>> +     return 0;
>>>>> +}
>>>>> +
>>>>> +static void vduse_iotlb_del_range(struct vduse_iova_domain *domain,
>>>>> +                               u64 start, u64 last)
>>>>> +{
>>>>> +     struct vdpa_map_file *map_file;
>>>>> +     struct vhost_iotlb_map *map;
>>>>> +
>>>>> +     while ((map = vhost_iotlb_itree_first(domain->iotlb, start, last))) {
>>>>> +             map_file = (struct vdpa_map_file *)map->opaque;
>>>>> +             fput(map_file->file);
>>>>> +             kfree(map_file);
>>>>> +             vhost_iotlb_map_free(domain->iotlb, map);
>>>>> +     }
>>>>> +}
>>>>> +
>>>>> +int vduse_domain_set_map(struct vduse_iova_domain *domain,
>>>>> +                      struct vhost_iotlb *iotlb)
>>>>> +{
>>>>> +     struct vdpa_map_file *map_file;
>>>>> +     struct vhost_iotlb_map *map;
>>>>> +     u64 start = 0ULL, last = ULLONG_MAX;
>>>>> +     int ret;
>>>>> +
>>>>> +     spin_lock(&domain->iotlb_lock);
>>>>> +     vduse_iotlb_del_range(domain, start, last);
>>>>> +
>>>>> +     for (map = vhost_iotlb_itree_first(iotlb, start, last); map;
>>>>> +          map = vhost_iotlb_itree_next(map, start, last)) {
>>>>> +             map_file = (struct vdpa_map_file *)map->opaque;
>>>>> +             ret = vduse_iotlb_add_range(domain, map->start, map->last,
>>>>> +                                         map->addr, map->perm,
>>>>> +                                         map_file->file,
>>>>> +                                         map_file->offset);
>>>>> +             if (ret)
>>>>> +                     goto err;
>>>>> +     }
>>>>> +     spin_unlock(&domain->iotlb_lock);
>>>>> +
>>>>> +     return 0;
>>>>> +err:
>>>>> +     vduse_iotlb_del_range(domain, start, last);
>>>>> +     spin_unlock(&domain->iotlb_lock);
>>>>> +     return ret;
>>>>> +}
>>>>> +
>>>>> +static void vduse_domain_map_bounce_page(struct vduse_iova_domain *domain,
>>>>> +                                      u64 iova, u64 size, u64 paddr)
>>>>> +{
>>>>> +     struct vduse_bounce_map *map;
>>>>> +     unsigned int index;
>>>>> +     u64 last = iova + size - 1;
>>>>> +
>>>>> +     while (iova < last) {
>>>>> +             map = &domain->bounce_maps[iova >> PAGE_SHIFT];
>>>>> +             index = offset_in_page(iova) >> IOVA_ALLOC_ORDER;
>>>>> +             map->orig_phys[index] = paddr;
>>>>> +             paddr += IOVA_ALLOC_SIZE;
>>>>> +             iova += IOVA_ALLOC_SIZE;
>>>>> +     }
>>>>> +}
>>>>> +
>>>>> +static void vduse_domain_unmap_bounce_page(struct vduse_iova_domain *domain,
>>>>> +                                        u64 iova, u64 size)
>>>>> +{
>>>>> +     struct vduse_bounce_map *map;
>>>>> +     unsigned int index;
>>>>> +     u64 last = iova + size - 1;
>>>>> +
>>>>> +     while (iova < last) {
>>>>> +             map = &domain->bounce_maps[iova >> PAGE_SHIFT];
>>>>> +             index = offset_in_page(iova) >> IOVA_ALLOC_ORDER;
>>>>> +             map->orig_phys[index] = INVALID_PHYS_ADDR;
>>>>> +             iova += IOVA_ALLOC_SIZE;
>>>>> +     }
>>>>> +}
>>>>> +
>>>>> +static void do_bounce(phys_addr_t orig, void *addr, size_t size,
>>>>> +                   enum dma_data_direction dir)
>>>>> +{
>>>>> +     unsigned long pfn = PFN_DOWN(orig);
>>>>> +
>>>>> +     if (PageHighMem(pfn_to_page(pfn))) {
>>>>> +             unsigned int offset = offset_in_page(orig);
>>>>> +             char *buffer;
>>>>> +             unsigned int sz = 0;
>>>>> +
>>>>> +             while (size) {
>>>>> +                     sz = min_t(size_t, PAGE_SIZE - offset, size);
>>>>> +
>>>>> +                     buffer = kmap_atomic(pfn_to_page(pfn));
>>>> So kmap_atomic() can autoamtically go with fast path if the page does
>>>> not belong to highmem.
>>>>
>>>> I think we can removce the condition and just use kmap_atomic() for all
>>>> the cases here.
>>>>
>>> Looks good to me.
>>>
>>>>> +                     if (dir == DMA_TO_DEVICE)
>>>>> +                             memcpy(addr, buffer + offset, sz);
>>>>> +                     else
>>>>> +                             memcpy(buffer + offset, addr, sz);
>>>>> +                     kunmap_atomic(buffer);
>>>>> +
>>>>> +                     size -= sz;
>>>>> +                     pfn++;
>>>>> +                     addr += sz;
>>>>> +                     offset = 0;
>>>>> +             }
>>>>> +     } else if (dir == DMA_TO_DEVICE) {
>>>>> +             memcpy(addr, phys_to_virt(orig), size);
>>>>> +     } else {
>>>>> +             memcpy(phys_to_virt(orig), addr, size);
>>>>> +     }
>>>>> +}
>>>>> +
>>>>> +static void vduse_domain_bounce(struct vduse_iova_domain *domain,
>>>>> +                             dma_addr_t iova, size_t size,
>>>>> +                             enum dma_data_direction dir)
>>>>> +{
>>>>> +     struct vduse_bounce_map *map;
>>>>> +     unsigned int index, offset;
>>>>> +     void *addr;
>>>>> +     size_t sz;
>>>>> +
>>>>> +     while (size) {
>>>>> +             map = &domain->bounce_maps[iova >> PAGE_SHIFT];
>>>>> +             offset = offset_in_page(iova);
>>>>> +             sz = min_t(size_t, IOVA_ALLOC_SIZE, size);
>>>>> +
>>>>> +             if (map->bounce_page &&
>>>>> +                 map->orig_phys[index] != INVALID_PHYS_ADDR) {
>>>>> +                     addr = page_address(map->bounce_page) + offset;
>>>>> +                     index = offset >> IOVA_ALLOC_ORDER;
>>>>> +                     do_bounce(map->orig_phys[index], addr, sz, dir);
>>>>> +             }
>>>>> +             size -= sz;
>>>>> +             iova += sz;
>>>>> +     }
>>>>> +}
>>>>> +
>>>>> +static struct page *
>>>>> +vduse_domain_get_mapping_page(struct vduse_iova_domain *domain, u64 iova)
>>>>> +{
>>>>> +     u64 start = iova & PAGE_MASK;
>>>>> +     u64 last = start + PAGE_SIZE - 1;
>>>>> +     struct vhost_iotlb_map *map;
>>>>> +     struct page *page = NULL;
>>>>> +
>>>>> +     spin_lock(&domain->iotlb_lock);
>>>>> +     map = vhost_iotlb_itree_first(domain->iotlb, start, last);
>>>>> +     if (!map)
>>>>> +             goto out;
>>>>> +
>>>>> +     page = pfn_to_page((map->addr + iova - map->start) >> PAGE_SHIFT);
>>>>> +     get_page(page);
>>>>> +out:
>>>>> +     spin_unlock(&domain->iotlb_lock);
>>>>> +
>>>>> +     return page;
>>>>> +}
>>>>> +
>>>>> +static struct page *
>>>>> +vduse_domain_alloc_bounce_page(struct vduse_iova_domain *domain, u64 iova)
>>>>> +{
>>>>> +     u64 start = iova & PAGE_MASK;
>>>>> +     struct page *page = alloc_page(GFP_KERNEL);
>>>>> +     struct vduse_bounce_map *map;
>>>>> +
>>>>> +     if (!page)
>>>>> +             return NULL;
>>>>> +
>>>>> +     spin_lock(&domain->iotlb_lock);
>>>>> +     map = &domain->bounce_maps[iova >> PAGE_SHIFT];
>>>>> +     if (map->bounce_page) {
>>>>> +             __free_page(page);
>>>>> +             goto out;
>>>>> +     }
>>>>> +     map->bounce_page = page;
>>>>> +
>>>>> +     /* paired with vduse_domain_map_page() */
>>>>> +     smp_mb();
>>>> So this is suspicious. It's better to explain like, we need make sure A
>>>> must be done after B.
>>> OK. I see. It's used to protect this pattern:
>>>
>>>      vduse_domain_alloc_bounce_page:          vduse_domain_map_page:
>>>      write map->bounce_page                           write map->orig_phys
>>>      mb()                                                            mb()
>>>      read map->orig_phys                                 read map->bounce_page
>>>
>>> Make sure there will always be a path to do bouncing.
>>
>> Ok.
>>
>>
>>>> And it looks to me the iotlb_lock is sufficnet to do the synchronization
>>>> here. E.g any reason that you don't take it in
>>>> vduse_domain_map_bounce_page().
>>>>
>>> Yes, we can. But the performance in multi-queue cases will go down if
>>> we use iotlb_lock on this critical path.
>>>
>>>> And what's more, is there anyway to aovid holding the spinlock during
>>>> bouncing?
>>>>
>>> Looks like we can't. In the case that multiple page faults happen on
>>> the same page, we should make sure the bouncing is done before any
>>> page fault handler returns.
>>
>> So it looks to me all those extra complexitiy comes from the fact that
>> the bounce_page and orig_phys are set by different places so we need to
>> do the bouncing in two places.
>>
>> I wonder how much we can gain from the "lazy" boucning in page fault.
>> The buffer mapped via dma_ops from virtio driver is expected to be
>> accessed by the userspace soon.  It looks to me we can do all those
>> stuffs during dma_map() then things would be greatly simplified.
>>
> If so, we need to allocate lots of pages from the pool reserved for
> atomic memory allocation requests.


This should be fine, a lot of drivers tries to allocate pages in atomic 
context. The point is to simplify the codes to make it easy to 
determince the correctness so we can add optimization on top simply by 
benchmarking the difference.

E.g we have serveral places that accesses orig_phys:

1) map_page(), write
2) unmap_page(), write
3) page fault handler, read

It's not clear to me how they were synchronized. Or if it was 
synchronzied implicitly (via iova allocator?), we'd better document it. 
Or simply use spinlock (which is the preferrable way I'd like to go). We 
probably don't need to worry too much about the cost of spinlock since 
iova allocater use it heavily.

Thanks


>
> Thanks,
> Yongji
>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 08/11] vduse: Implement an MMU-based IOMMU driver
       [not found]             ` <CACycT3v6Lj61fafztOuzBNFLs2TbKeqrNLXkzv5RK6-h-iTnvA@mail.gmail.com>
@ 2021-03-26  6:16               ` Jason Wang
       [not found]                 ` <CACycT3t+2MC9rQ7iWdWQ4=O3ojCXHvHZ-M7y7AjXoXYZUiAOzQ@mail.gmail.com>
  0 siblings, 1 reply; 14+ messages in thread
From: Jason Wang @ 2021-03-26  6:16 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Jens Axboe, Jonathan Corbet, kvm, Michael S. Tsirkin, netdev,
	Randy Dunlap, Matthew Wilcox, virtualization, Christoph Hellwig,
	Bob Liu, bcrl, viro, Stefan Hajnoczi, linux-fsdevel,
	Dan Carpenter, Mika Penttilä


[-- Attachment #1.1: Type: text/plain, Size: 3148 bytes --]


在 2021/3/26 下午1:14, Yongji Xie 写道:
>>>>>>> +     }
>>>>>>> +     map->bounce_page = page;
>>>>>>> +
>>>>>>> +     /* paired with vduse_domain_map_page() */
>>>>>>> +     smp_mb();
>>>>>> So this is suspicious. It's better to explain like, we need make sure A
>>>>>> must be done after B.
>>>>> OK. I see. It's used to protect this pattern:
>>>>>
>>>>>       vduse_domain_alloc_bounce_page:          vduse_domain_map_page:
>>>>>       write map->bounce_page                           write map->orig_phys
>>>>>       mb()                                                            mb()
>>>>>       read map->orig_phys                                 read map->bounce_page
>>>>>
>>>>> Make sure there will always be a path to do bouncing.
>>>> Ok.
>>>>
>>>>
>>>>>> And it looks to me the iotlb_lock is sufficnet to do the synchronization
>>>>>> here. E.g any reason that you don't take it in
>>>>>> vduse_domain_map_bounce_page().
>>>>>>
>>>>> Yes, we can. But the performance in multi-queue cases will go down if
>>>>> we use iotlb_lock on this critical path.
>>>>>
>>>>>> And what's more, is there anyway to aovid holding the spinlock during
>>>>>> bouncing?
>>>>>>
>>>>> Looks like we can't. In the case that multiple page faults happen on
>>>>> the same page, we should make sure the bouncing is done before any
>>>>> page fault handler returns.
>>>> So it looks to me all those extra complexitiy comes from the fact that
>>>> the bounce_page and orig_phys are set by different places so we need to
>>>> do the bouncing in two places.
>>>>
>>>> I wonder how much we can gain from the "lazy" boucning in page fault.
>>>> The buffer mapped via dma_ops from virtio driver is expected to be
>>>> accessed by the userspace soon.  It looks to me we can do all those
>>>> stuffs during dma_map() then things would be greatly simplified.
>>>>
>>> If so, we need to allocate lots of pages from the pool reserved for
>>> atomic memory allocation requests.
>> This should be fine, a lot of drivers tries to allocate pages in atomic
>> context. The point is to simplify the codes to make it easy to
>> determince the correctness so we can add optimization on top simply by
>> benchmarking the difference.
>>
> OK. I will use this way in the next version.
>
>> E.g we have serveral places that accesses orig_phys:
>>
>> 1) map_page(), write
>> 2) unmap_page(), write
>> 3) page fault handler, read
>>
>> It's not clear to me how they were synchronized. Or if it was
>> synchronzied implicitly (via iova allocator?), we'd better document it.
> Yes.
>
>> Or simply use spinlock (which is the preferrable way I'd like to go). We
>> probably don't need to worry too much about the cost of spinlock since
>> iova allocater use it heavily.
>>
> Actually iova allocator implements a per-CPU cache to optimize it.
>
> Thanks,
> Yongji


Right, but have a quick glance, I guess what you meant is that usually 
there's no lock contention unless cpu hot-plug. This can work but the 
problem is that such synchornization depends on the internal 
implementation of IOVA allocator which is kind of fragile. I still think 
we should do that on our own.

Thanks



[-- Attachment #1.2: Type: text/html, Size: 5228 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 08/11] vduse: Implement an MMU-based IOMMU driver
       [not found]                 ` <CACycT3t+2MC9rQ7iWdWQ4=O3ojCXHvHZ-M7y7AjXoXYZUiAOzQ@mail.gmail.com>
@ 2021-03-26  7:36                   ` Jason Wang
  0 siblings, 0 replies; 14+ messages in thread
From: Jason Wang @ 2021-03-26  7:36 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Jens Axboe, Jonathan Corbet, kvm, Michael S. Tsirkin, netdev,
	Randy Dunlap, Matthew Wilcox, virtualization, Christoph Hellwig,
	bcrl, viro, Stefan Hajnoczi, linux-fsdevel, Dan Carpenter,
	Mika Penttilä


在 2021/3/26 下午2:56, Yongji Xie 写道:
> On Fri, Mar 26, 2021 at 2:16 PM Jason Wang <jasowang@redhat.com> wrote:
>>
>> 在 2021/3/26 下午1:14, Yongji Xie 写道:
>>
>> +     }
>> +     map->bounce_page = page;
>> +
>> +     /* paired with vduse_domain_map_page() */
>> +     smp_mb();
>>
>> So this is suspicious. It's better to explain like, we need make sure A
>> must be done after B.
>>
>> OK. I see. It's used to protect this pattern:
>>
>>       vduse_domain_alloc_bounce_page:          vduse_domain_map_page:
>>       write map->bounce_page                           write map->orig_phys
>>       mb()                                                            mb()
>>       read map->orig_phys                                 read map->bounce_page
>>
>> Make sure there will always be a path to do bouncing.
>>
>> Ok.
>>
>>
>> And it looks to me the iotlb_lock is sufficnet to do the synchronization
>> here. E.g any reason that you don't take it in
>> vduse_domain_map_bounce_page().
>>
>> Yes, we can. But the performance in multi-queue cases will go down if
>> we use iotlb_lock on this critical path.
>>
>> And what's more, is there anyway to aovid holding the spinlock during
>> bouncing?
>>
>> Looks like we can't. In the case that multiple page faults happen on
>> the same page, we should make sure the bouncing is done before any
>> page fault handler returns.
>>
>> So it looks to me all those extra complexitiy comes from the fact that
>> the bounce_page and orig_phys are set by different places so we need to
>> do the bouncing in two places.
>>
>> I wonder how much we can gain from the "lazy" boucning in page fault.
>> The buffer mapped via dma_ops from virtio driver is expected to be
>> accessed by the userspace soon.  It looks to me we can do all those
>> stuffs during dma_map() then things would be greatly simplified.
>>
>> If so, we need to allocate lots of pages from the pool reserved for
>> atomic memory allocation requests.
>>
>> This should be fine, a lot of drivers tries to allocate pages in atomic
>> context. The point is to simplify the codes to make it easy to
>> determince the correctness so we can add optimization on top simply by
>> benchmarking the difference.
>>
>> OK. I will use this way in the next version.
>>
>> E.g we have serveral places that accesses orig_phys:
>>
>> 1) map_page(), write
>> 2) unmap_page(), write
>> 3) page fault handler, read
>>
>> It's not clear to me how they were synchronized. Or if it was
>> synchronzied implicitly (via iova allocator?), we'd better document it.
>>
>> Yes.
>>
>> Or simply use spinlock (which is the preferrable way I'd like to go). We
>> probably don't need to worry too much about the cost of spinlock since
>> iova allocater use it heavily.
>>
>> Actually iova allocator implements a per-CPU cache to optimize it.
>>
>> Thanks,
>> Yongji
>>
>>
>> Right, but have a quick glance, I guess what you meant is that usually there's no lock contention unless cpu hot-plug. This can work but the problem is that such synchornization depends on the internal implementation of IOVA allocator which is kind of fragile. I still think we should do that on our own.
>>
> I might miss something. Looks like we don't need any synchronization
> if the page fault handler is removed as you suggested. We should not
> access the same orig_phys concurrently (in map_page() and
> unmap_page()) unless we free the iova before accessing.
>
> Thanks,
> Yongji


You're right. I overestimate the complexitiy that is required by the 
synchronization.

Thanks


>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2021-03-26  7:36 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20210315053721.189-1-xieyongji@bytedance.com>
     [not found] ` <20210315053721.189-2-xieyongji@bytedance.com>
2021-03-15  9:08   ` [PATCH v5 01/11] file: Export __receive_fd() to modules Christoph Hellwig
     [not found]     ` <CACycT3vrHOExXj6v8ULvUzdLcRkdzS5=TNK6=g4+RWEdN-nOJw@mail.gmail.com>
2021-03-25  8:23       ` Christoph Hellwig
     [not found] ` <20210315053721.189-7-xieyongji@bytedance.com>
2021-03-23  3:09   ` [PATCH v5 06/11] vdpa: factor out vhost_vdpa_pa_map() Jason Wang
     [not found] ` <20210315053721.189-8-xieyongji@bytedance.com>
2021-03-23  3:13   ` [PATCH v5 07/11] vdpa: Support transferring virtual addressing during DMA mapping Jason Wang
     [not found] ` <20210315053721.189-9-xieyongji@bytedance.com>
2021-03-24  3:54   ` [PATCH v5 08/11] vduse: Implement an MMU-based IOMMU driver Jason Wang
     [not found]     ` <CACycT3v_-G6ju-poofXEzYt8QPKWNFHwsS7t=KTLgs-=g+iPQQ@mail.gmail.com>
2021-03-25  4:52       ` Jason Wang
     [not found]         ` <CACycT3uS870yy04rw7KBk==sioi+VNunxVz6BQH-Lmxk6m-VSg@mail.gmail.com>
2021-03-26  4:26           ` Jason Wang
     [not found]             ` <CACycT3v6Lj61fafztOuzBNFLs2TbKeqrNLXkzv5RK6-h-iTnvA@mail.gmail.com>
2021-03-26  6:16               ` Jason Wang
     [not found]                 ` <CACycT3t+2MC9rQ7iWdWQ4=O3ojCXHvHZ-M7y7AjXoXYZUiAOzQ@mail.gmail.com>
2021-03-26  7:36                   ` Jason Wang
     [not found] ` <20210315053721.189-10-xieyongji@bytedance.com>
2021-03-24  4:43   ` [PATCH v5 09/11] vduse: Introduce VDUSE - vDPA Device in Userspace Jason Wang
     [not found]     ` <CACycT3u5hEO8=JVkbHmKXYMManiZ1VedyS6O+7M8Rzbk1ftC7A@mail.gmail.com>
2021-03-25  6:30       ` Jason Wang
     [not found] ` <20210315053721.189-11-xieyongji@bytedance.com>
2021-03-24  4:45   ` [PATCH v5 10/11] vduse: Add config interrupt support Jason Wang
     [not found] ` <20210315053721.189-4-xieyongji@bytedance.com>
2021-03-23  3:02   ` [PATCH v5 03/11] vhost-vdpa: protect concurrent access to vhost device iotlb Jason Wang
2021-03-25 11:08   ` Stefano Garzarella

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).