IOMMU Archive on lore.kernel.org
 help / color / Atom feed
* (a design open) RE: [PATCH v1 6/8] vfio/type1: Bind guest page tables to host
@ 2020-05-15  7:39 Tian, Kevin
  2020-05-15 15:19 ` Raj, Ashok
  2020-05-15 17:59 ` Alex Williamson
  0 siblings, 2 replies; 5+ messages in thread
From: Tian, Kevin @ 2020-05-15  7:39 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson, eric.auger
  Cc: jean-philippe, Raj, Ashok, kvm, Tian, Jun J, iommu, linux-kernel,
	Sun, Yi Y, Wu, Hao

Hi, Alex,

When working on an updated version Yi and I found an design open
which needs your guidance.

In concept nested translation can be incarnated as one GPA->HPA page
table and multiple GVA->GPA page tables per VM. It means one container
is sufficient to include all SVA-capable devices assigned to the same guest,
as there is just one iova space (GPA->HPA) to be managed by vfio iommu
map/unmap api. GVA->GPA page tables are bound through the new api
introduced in this patch. It is different from legacy shadow translation 
which merges GIOVA->GPA->HPA into GIOVA->HPA thus each device requires 
its own iova space and must be in a separate container.

However supporting multiple SVA-capable devices in one container 
imposes one challenge. Now the bind_guest_pgtbl is implemented as
iommu type1 api. From the definition of iommu type 1 any operation
should be applied to all devices within the same container, just like
dma map/unmap. However this philosophy is incorrect regarding to
page table binding. We must follow the guest binding requirements 
between its page tables and assigned devices, otherwise every bound
address space is suddenly accessible by all assigned devices thus creating
security holes.

Do you think whether it's possible to extend iommu api to accept
device specific cmd? If not, moving it to vfio-pci or vfio-mdev sounds
also problematic, as PASID and page tables are IOMMU things which
are not touched by vfio device drivers today.

Alternatively can we impose the limitation that nesting APIs can be
only enabled for singleton containers which contains only one device?
This basically falls back to the state of legacy shadow translation 
vIOMMU. and our current Qemu vIOMMU implementation actually 
does this way regardless of whether nesting is used. Do you think 
whether such tradeoff is acceptable as a starting point?

Thanks
Kevin

> -----Original Message-----
> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Sunday, March 22, 2020 8:32 PM
> To: alex.williamson@redhat.com; eric.auger@redhat.com
> Cc: Tian, Kevin <kevin.tian@intel.com>; jacob.jun.pan@linux.intel.com;
> joro@8bytes.org; Raj, Ashok <ashok.raj@intel.com>; Liu, Yi L
> <yi.l.liu@intel.com>; Tian, Jun J <jun.j.tian@intel.com>; Sun, Yi Y
> <yi.y.sun@intel.com>; jean-philippe@linaro.org; peterx@redhat.com;
> iommu@lists.linux-foundation.org; kvm@vger.kernel.org; linux-
> kernel@vger.kernel.org; Wu, Hao <hao.wu@intel.com>
> Subject: [PATCH v1 6/8] vfio/type1: Bind guest page tables to host
> 
> From: Liu Yi L <yi.l.liu@intel.com>
> 
> VFIO_TYPE1_NESTING_IOMMU is an IOMMU type which is backed by
> hardware
> IOMMUs that have nesting DMA translation (a.k.a dual stage address
> translation). For such hardware IOMMUs, there are two stages/levels of
> address translation, and software may let userspace/VM to own the first-
> level/stage-1 translation structures. Example of such usage is vSVA (
> virtual Shared Virtual Addressing). VM owns the first-level/stage-1
> translation structures and bind the structures to host, then hardware
> IOMMU would utilize nesting translation when doing DMA translation fo
> the devices behind such hardware IOMMU.
> 
> This patch adds vfio support for binding guest translation (a.k.a stage 1)
> structure to host iommu. And for VFIO_TYPE1_NESTING_IOMMU, not only
> bind
> guest page table is needed, it also requires to expose interface to guest
> for iommu cache invalidation when guest modified the first-level/stage-1
> translation structures since hardware needs to be notified to flush stale
> iotlbs. This would be introduced in next patch.
> 
> In this patch, guest page table bind and unbind are done by using flags
> VFIO_IOMMU_BIND_GUEST_PGTBL and
> VFIO_IOMMU_UNBIND_GUEST_PGTBL under IOCTL
> VFIO_IOMMU_BIND, the bind/unbind data are conveyed by
> struct iommu_gpasid_bind_data. Before binding guest page table to host,
> VM should have got a PASID allocated by host via
> VFIO_IOMMU_PASID_REQUEST.
> 
> Bind guest translation structures (here is guest page table) to host
> are the first step to setup vSVA (Virtual Shared Virtual Addressing).
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Cc: Eric Auger <eric.auger@redhat.com>
> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 158
> ++++++++++++++++++++++++++++++++++++++++
>  include/uapi/linux/vfio.h       |  46 ++++++++++++
>  2 files changed, 204 insertions(+)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c
> b/drivers/vfio/vfio_iommu_type1.c
> index 82a9e0b..a877747 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -130,6 +130,33 @@ struct vfio_regions {
>  #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)	\
>  					(!list_empty(&iommu->domain_list))
> 
> +struct domain_capsule {
> +	struct iommu_domain *domain;
> +	void *data;
> +};
> +
> +/* iommu->lock must be held */
> +static int vfio_iommu_for_each_dev(struct vfio_iommu *iommu,
> +		      int (*fn)(struct device *dev, void *data),
> +		      void *data)
> +{
> +	struct domain_capsule dc = {.data = data};
> +	struct vfio_domain *d;
> +	struct vfio_group *g;
> +	int ret = 0;
> +
> +	list_for_each_entry(d, &iommu->domain_list, next) {
> +		dc.domain = d->domain;
> +		list_for_each_entry(g, &d->group_list, next) {
> +			ret = iommu_group_for_each_dev(g->iommu_group,
> +						       &dc, fn);
> +			if (ret)
> +				break;
> +		}
> +	}
> +	return ret;
> +}
> +
>  static int put_pfn(unsigned long pfn, int prot);
> 
>  /*
> @@ -2314,6 +2341,88 @@ static int
> vfio_iommu_info_add_nesting_cap(struct vfio_iommu *iommu,
>  	return 0;
>  }
> 
> +static int vfio_bind_gpasid_fn(struct device *dev, void *data)
> +{
> +	struct domain_capsule *dc = (struct domain_capsule *)data;
> +	struct iommu_gpasid_bind_data *gbind_data =
> +		(struct iommu_gpasid_bind_data *) dc->data;
> +
> +	return iommu_sva_bind_gpasid(dc->domain, dev, gbind_data);
> +}
> +
> +static int vfio_unbind_gpasid_fn(struct device *dev, void *data)
> +{
> +	struct domain_capsule *dc = (struct domain_capsule *)data;
> +	struct iommu_gpasid_bind_data *gbind_data =
> +		(struct iommu_gpasid_bind_data *) dc->data;
> +
> +	return iommu_sva_unbind_gpasid(dc->domain, dev,
> +					gbind_data->hpasid);
> +}
> +
> +/**
> + * Unbind specific gpasid, caller of this function requires hold
> + * vfio_iommu->lock
> + */
> +static long vfio_iommu_type1_do_guest_unbind(struct vfio_iommu
> *iommu,
> +				struct iommu_gpasid_bind_data *gbind_data)
> +{
> +	return vfio_iommu_for_each_dev(iommu,
> +				vfio_unbind_gpasid_fn, gbind_data);
> +}
> +
> +static long vfio_iommu_type1_bind_gpasid(struct vfio_iommu *iommu,
> +				struct iommu_gpasid_bind_data *gbind_data)
> +{
> +	int ret = 0;
> +
> +	mutex_lock(&iommu->lock);
> +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> +		ret = -EINVAL;
> +		goto out_unlock;
> +	}
> +
> +	ret = vfio_iommu_for_each_dev(iommu,
> +			vfio_bind_gpasid_fn, gbind_data);
> +	/*
> +	 * If bind failed, it may not be a total failure. Some devices
> +	 * within the iommu group may have bind successfully. Although
> +	 * we don't enable pasid capability for non-singletion iommu
> +	 * groups, a unbind operation would be helpful to ensure no
> +	 * partial binding for an iommu group.
> +	 */
> +	if (ret)
> +		/*
> +		 * Undo all binds that already succeeded, no need to
> +		 * check the return value here since some device within
> +		 * the group has no successful bind when coming to this
> +		 * place switch.
> +		 */
> +		vfio_iommu_type1_do_guest_unbind(iommu, gbind_data);
> +
> +out_unlock:
> +	mutex_unlock(&iommu->lock);
> +	return ret;
> +}
> +
> +static long vfio_iommu_type1_unbind_gpasid(struct vfio_iommu *iommu,
> +				struct iommu_gpasid_bind_data *gbind_data)
> +{
> +	int ret = 0;
> +
> +	mutex_lock(&iommu->lock);
> +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> +		ret = -EINVAL;
> +		goto out_unlock;
> +	}
> +
> +	ret = vfio_iommu_type1_do_guest_unbind(iommu, gbind_data);
> +
> +out_unlock:
> +	mutex_unlock(&iommu->lock);
> +	return ret;
> +}
> +
>  static long vfio_iommu_type1_ioctl(void *iommu_data,
>  				   unsigned int cmd, unsigned long arg)
>  {
> @@ -2471,6 +2580,55 @@ static long vfio_iommu_type1_ioctl(void
> *iommu_data,
>  		default:
>  			return -EINVAL;
>  		}
> +
> +	} else if (cmd == VFIO_IOMMU_BIND) {
> +		struct vfio_iommu_type1_bind bind;
> +		u32 version;
> +		int data_size;
> +		void *gbind_data;
> +		int ret;
> +
> +		minsz = offsetofend(struct vfio_iommu_type1_bind, flags);
> +
> +		if (copy_from_user(&bind, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (bind.argsz < minsz)
> +			return -EINVAL;
> +
> +		/* Get the version of struct iommu_gpasid_bind_data */
> +		if (copy_from_user(&version,
> +			(void __user *) (arg + minsz),
> +					sizeof(version)))
> +			return -EFAULT;
> +
> +		data_size = iommu_uapi_get_data_size(
> +				IOMMU_UAPI_BIND_GPASID, version);
> +		gbind_data = kzalloc(data_size, GFP_KERNEL);
> +		if (!gbind_data)
> +			return -ENOMEM;
> +
> +		if (copy_from_user(gbind_data,
> +			 (void __user *) (arg + minsz), data_size)) {
> +			kfree(gbind_data);
> +			return -EFAULT;
> +		}
> +
> +		switch (bind.flags & VFIO_IOMMU_BIND_MASK) {
> +		case VFIO_IOMMU_BIND_GUEST_PGTBL:
> +			ret = vfio_iommu_type1_bind_gpasid(iommu,
> +							   gbind_data);
> +			break;
> +		case VFIO_IOMMU_UNBIND_GUEST_PGTBL:
> +			ret = vfio_iommu_type1_unbind_gpasid(iommu,
> +							     gbind_data);
> +			break;
> +		default:
> +			ret = -EINVAL;
> +			break;
> +		}
> +		kfree(gbind_data);
> +		return ret;
>  	}
> 
>  	return -ENOTTY;
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index ebeaf3e..2235bc6 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -14,6 +14,7 @@
> 
>  #include <linux/types.h>
>  #include <linux/ioctl.h>
> +#include <linux/iommu.h>
> 
>  #define VFIO_API_VERSION	0
> 
> @@ -853,6 +854,51 @@ struct vfio_iommu_type1_pasid_request {
>   */
>  #define VFIO_IOMMU_PASID_REQUEST	_IO(VFIO_TYPE, VFIO_BASE +
> 22)
> 
> +/**
> + * Supported flags:
> + *	- VFIO_IOMMU_BIND_GUEST_PGTBL: bind guest page tables to host
> for
> + *			nesting type IOMMUs. In @data field It takes struct
> + *			iommu_gpasid_bind_data.
> + *	- VFIO_IOMMU_UNBIND_GUEST_PGTBL: undo a bind guest page
> table operation
> + *			invoked by VFIO_IOMMU_BIND_GUEST_PGTBL.
> + *
> + */
> +struct vfio_iommu_type1_bind {
> +	__u32		argsz;
> +	__u32		flags;
> +#define VFIO_IOMMU_BIND_GUEST_PGTBL	(1 << 0)
> +#define VFIO_IOMMU_UNBIND_GUEST_PGTBL	(1 << 1)
> +	__u8		data[];
> +};
> +
> +#define VFIO_IOMMU_BIND_MASK	(VFIO_IOMMU_BIND_GUEST_PGTBL
> | \
> +
> 	VFIO_IOMMU_UNBIND_GUEST_PGTBL)
> +
> +/**
> + * VFIO_IOMMU_BIND - _IOW(VFIO_TYPE, VFIO_BASE + 23,
> + *				struct vfio_iommu_type1_bind)
> + *
> + * Manage address spaces of devices in this container. Initially a TYPE1
> + * container can only have one address space, managed with
> + * VFIO_IOMMU_MAP/UNMAP_DMA.
> + *
> + * An IOMMU of type VFIO_TYPE1_NESTING_IOMMU can be managed by
> both MAP/UNMAP
> + * and BIND ioctls at the same time. MAP/UNMAP acts on the stage-2 (host)
> page
> + * tables, and BIND manages the stage-1 (guest) page tables. Other types of
> + * IOMMU may allow MAP/UNMAP and BIND to coexist, where
> MAP/UNMAP controls
> + * the traffics only require single stage translation while BIND controls the
> + * traffics require nesting translation. But this depends on the underlying
> + * IOMMU architecture and isn't guaranteed. Example of this is the guest
> SVA
> + * traffics, such traffics need nesting translation to gain gVA->gPA and then
> + * gPA->hPA translation.
> + *
> + * Availability of this feature depends on the device, its bus, the underlying
> + * IOMMU and the CPU architecture.
> + *
> + * returns: 0 on success, -errno on failure.
> + */
> +#define VFIO_IOMMU_BIND		_IO(VFIO_TYPE, VFIO_BASE + 23)
> +
>  /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
> 
>  /*
> --
> 2.7.4

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: (a design open) RE: [PATCH v1 6/8] vfio/type1: Bind guest page tables to host
  2020-05-15  7:39 (a design open) RE: [PATCH v1 6/8] vfio/type1: Bind guest page tables to host Tian, Kevin
@ 2020-05-15 15:19 ` Raj, Ashok
  2020-05-15 23:36   ` Tian, Kevin
  2020-05-15 17:59 ` Alex Williamson
  1 sibling, 1 reply; 5+ messages in thread
From: Raj, Ashok @ 2020-05-15 15:19 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: jean-philippe, Ashok Raj, kvm, iommu, Tian, Jun J, Sun, Yi Y,
	linux-kernel, alex.williamson, Wu, Hao

On Fri, May 15, 2020 at 12:39:14AM -0700, Tian, Kevin wrote:
> Hi, Alex,
> 
> When working on an updated version Yi and I found an design open
> which needs your guidance.
> 
> In concept nested translation can be incarnated as one GPA->HPA page
> table and multiple GVA->GPA page tables per VM. It means one container
> is sufficient to include all SVA-capable devices assigned to the same guest,
> as there is just one iova space (GPA->HPA) to be managed by vfio iommu
> map/unmap api. GVA->GPA page tables are bound through the new api
> introduced in this patch. It is different from legacy shadow translation
> which merges GIOVA->GPA->HPA into GIOVA->HPA thus each device requires
> its own iova space and must be in a separate container.
> 
> However supporting multiple SVA-capable devices in one container
> imposes one challenge. Now the bind_guest_pgtbl is implemented as
> iommu type1 api. From the definition of iommu type 1 any operation
> should be applied to all devices within the same container, just like
> dma map/unmap. However this philosophy is incorrect regarding to
> page table binding. We must follow the guest binding requirements
> between its page tables and assigned devices, otherwise every bound
> address space is suddenly accessible by all assigned devices thus creating
> security holes.

The above 2 paragraphs are bit confusing :-) but what really matters
is the below: 
> 
> Do you think whether it's possible to extend iommu api to accept
> device specific cmd? If not, moving it to vfio-pci or vfio-mdev sounds
> also problematic, as PASID and page tables are IOMMU things which
> are not touched by vfio device drivers today.

All you are referring to is when admin groups multiple devices in a
single container, you are saying you can't give isolation to them
for SVA purposes. This is logically equivalent to a single group with
multiple devices.

	- Such as devices behind p2p bridge
	- MFD's or devices behind switches or hieararchies without ACS
	  support for instance.

By limitation you mean, disable PASID on those devices in a single 
container?

what about ATS? 

> 
> Alternatively can we impose the limitation that nesting APIs can be
> only enabled for singleton containers which contains only one device?
> This basically falls back to the state of legacy shadow translation
> vIOMMU. and our current Qemu vIOMMU implementation actually
> does this way regardless of whether nesting is used. Do you think
> whether such tradeoff is acceptable as a starting point?
> 
> Thanks
> Kevin
> 
> > -----Original Message-----
> > From: Liu, Yi L <yi.l.liu@intel.com>
> > Sent: Sunday, March 22, 2020 8:32 PM
> > To: alex.williamson@redhat.com; eric.auger@redhat.com
> > Cc: Tian, Kevin <kevin.tian@intel.com>; jacob.jun.pan@linux.intel.com;
> > joro@8bytes.org; Raj, Ashok <ashok.raj@intel.com>; Liu, Yi L
> > <yi.l.liu@intel.com>; Tian, Jun J <jun.j.tian@intel.com>; Sun, Yi Y
> > <yi.y.sun@intel.com>; jean-philippe@linaro.org; peterx@redhat.com;
> > iommu@lists.linux-foundation.org; kvm@vger.kernel.org; linux-
> > kernel@vger.kernel.org; Wu, Hao <hao.wu@intel.com>
> > Subject: [PATCH v1 6/8] vfio/type1: Bind guest page tables to host
> >
> > From: Liu Yi L <yi.l.liu@intel.com>
> >
> > VFIO_TYPE1_NESTING_IOMMU is an IOMMU type which is backed by
> > hardware
> > IOMMUs that have nesting DMA translation (a.k.a dual stage address
> > translation). For such hardware IOMMUs, there are two stages/levels of
> > address translation, and software may let userspace/VM to own the first-
> > level/stage-1 translation structures. Example of such usage is vSVA (
> > virtual Shared Virtual Addressing). VM owns the first-level/stage-1
> > translation structures and bind the structures to host, then hardware
> > IOMMU would utilize nesting translation when doing DMA translation fo
> > the devices behind such hardware IOMMU.
> >
> > This patch adds vfio support for binding guest translation (a.k.a stage 1)
> > structure to host iommu. And for VFIO_TYPE1_NESTING_IOMMU, not only
> > bind
> > guest page table is needed, it also requires to expose interface to guest
> > for iommu cache invalidation when guest modified the first-level/stage-1
> > translation structures since hardware needs to be notified to flush stale
> > iotlbs. This would be introduced in next patch.
> >
> > In this patch, guest page table bind and unbind are done by using flags
> > VFIO_IOMMU_BIND_GUEST_PGTBL and
> > VFIO_IOMMU_UNBIND_GUEST_PGTBL under IOCTL
> > VFIO_IOMMU_BIND, the bind/unbind data are conveyed by
> > struct iommu_gpasid_bind_data. Before binding guest page table to host,
> > VM should have got a PASID allocated by host via
> > VFIO_IOMMU_PASID_REQUEST.
> >
> > Bind guest translation structures (here is guest page table) to host
> > are the first step to setup vSVA (Virtual Shared Virtual Addressing).
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Alex Williamson <alex.williamson@redhat.com>
> > Cc: Eric Auger <eric.auger@redhat.com>
> > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > ---
> >  drivers/vfio/vfio_iommu_type1.c | 158
> > ++++++++++++++++++++++++++++++++++++++++
> >  include/uapi/linux/vfio.h       |  46 ++++++++++++
> >  2 files changed, 204 insertions(+)
> >
> > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > b/drivers/vfio/vfio_iommu_type1.c
> > index 82a9e0b..a877747 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -130,6 +130,33 @@ struct vfio_regions {
> >  #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)\
> >  (!list_empty(&iommu->domain_list))
> >
> > +struct domain_capsule {
> > +struct iommu_domain *domain;
> > +void *data;
> > +};
> > +
> > +/* iommu->lock must be held */
> > +static int vfio_iommu_for_each_dev(struct vfio_iommu *iommu,
> > +      int (*fn)(struct device *dev, void *data),
> > +      void *data)
> > +{
> > +struct domain_capsule dc = {.data = data};
> > +struct vfio_domain *d;
> > +struct vfio_group *g;
> > +int ret = 0;
> > +
> > +list_for_each_entry(d, &iommu->domain_list, next) {
> > +dc.domain = d->domain;
> > +list_for_each_entry(g, &d->group_list, next) {
> > +ret = iommu_group_for_each_dev(g->iommu_group,
> > +       &dc, fn);
> > +if (ret)
> > +break;
> > +}
> > +}
> > +return ret;
> > +}
> > +
> >  static int put_pfn(unsigned long pfn, int prot);
> >
> >  /*
> > @@ -2314,6 +2341,88 @@ static int
> > vfio_iommu_info_add_nesting_cap(struct vfio_iommu *iommu,
> >  return 0;
> >  }
> >
> > +static int vfio_bind_gpasid_fn(struct device *dev, void *data)
> > +{
> > +struct domain_capsule *dc = (struct domain_capsule *)data;
> > +struct iommu_gpasid_bind_data *gbind_data =
> > +(struct iommu_gpasid_bind_data *) dc->data;
> > +
> > +return iommu_sva_bind_gpasid(dc->domain, dev, gbind_data);
> > +}
> > +
> > +static int vfio_unbind_gpasid_fn(struct device *dev, void *data)
> > +{
> > +struct domain_capsule *dc = (struct domain_capsule *)data;
> > +struct iommu_gpasid_bind_data *gbind_data =
> > +(struct iommu_gpasid_bind_data *) dc->data;
> > +
> > +return iommu_sva_unbind_gpasid(dc->domain, dev,
> > +gbind_data->hpasid);
> > +}
> > +
> > +/**
> > + * Unbind specific gpasid, caller of this function requires hold
> > + * vfio_iommu->lock
> > + */
> > +static long vfio_iommu_type1_do_guest_unbind(struct vfio_iommu
> > *iommu,
> > +struct iommu_gpasid_bind_data *gbind_data)
> > +{
> > +return vfio_iommu_for_each_dev(iommu,
> > +vfio_unbind_gpasid_fn, gbind_data);
> > +}
> > +
> > +static long vfio_iommu_type1_bind_gpasid(struct vfio_iommu *iommu,
> > +struct iommu_gpasid_bind_data *gbind_data)
> > +{
> > +int ret = 0;
> > +
> > +mutex_lock(&iommu->lock);
> > +if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > +ret = -EINVAL;
> > +goto out_unlock;
> > +}
> > +
> > +ret = vfio_iommu_for_each_dev(iommu,
> > +vfio_bind_gpasid_fn, gbind_data);
> > +/*
> > + * If bind failed, it may not be a total failure. Some devices
> > + * within the iommu group may have bind successfully. Although
> > + * we don't enable pasid capability for non-singletion iommu
> > + * groups, a unbind operation would be helpful to ensure no
> > + * partial binding for an iommu group.
> > + */
> > +if (ret)
> > +/*
> > + * Undo all binds that already succeeded, no need to
> > + * check the return value here since some device within
> > + * the group has no successful bind when coming to this
> > + * place switch.
> > + */
> > +vfio_iommu_type1_do_guest_unbind(iommu, gbind_data);
> > +
> > +out_unlock:
> > +mutex_unlock(&iommu->lock);
> > +return ret;
> > +}
> > +
> > +static long vfio_iommu_type1_unbind_gpasid(struct vfio_iommu *iommu,
> > +struct iommu_gpasid_bind_data *gbind_data)
> > +{
> > +int ret = 0;
> > +
> > +mutex_lock(&iommu->lock);
> > +if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > +ret = -EINVAL;
> > +goto out_unlock;
> > +}
> > +
> > +ret = vfio_iommu_type1_do_guest_unbind(iommu, gbind_data);
> > +
> > +out_unlock:
> > +mutex_unlock(&iommu->lock);
> > +return ret;
> > +}
> > +
> >  static long vfio_iommu_type1_ioctl(void *iommu_data,
> >     unsigned int cmd, unsigned long arg)
> >  {
> > @@ -2471,6 +2580,55 @@ static long vfio_iommu_type1_ioctl(void
> > *iommu_data,
> >  default:
> >  return -EINVAL;
> >  }
> > +
> > +} else if (cmd == VFIO_IOMMU_BIND) {
> > +struct vfio_iommu_type1_bind bind;
> > +u32 version;
> > +int data_size;
> > +void *gbind_data;
> > +int ret;
> > +
> > +minsz = offsetofend(struct vfio_iommu_type1_bind, flags);
> > +
> > +if (copy_from_user(&bind, (void __user *)arg, minsz))
> > +return -EFAULT;
> > +
> > +if (bind.argsz < minsz)
> > +return -EINVAL;
> > +
> > +/* Get the version of struct iommu_gpasid_bind_data */
> > +if (copy_from_user(&version,
> > +(void __user *) (arg + minsz),
> > +sizeof(version)))
> > +return -EFAULT;
> > +
> > +data_size = iommu_uapi_get_data_size(
> > +IOMMU_UAPI_BIND_GPASID, version);
> > +gbind_data = kzalloc(data_size, GFP_KERNEL);
> > +if (!gbind_data)
> > +return -ENOMEM;
> > +
> > +if (copy_from_user(gbind_data,
> > + (void __user *) (arg + minsz), data_size)) {
> > +kfree(gbind_data);
> > +return -EFAULT;
> > +}
> > +
> > +switch (bind.flags & VFIO_IOMMU_BIND_MASK) {
> > +case VFIO_IOMMU_BIND_GUEST_PGTBL:
> > +ret = vfio_iommu_type1_bind_gpasid(iommu,
> > +   gbind_data);
> > +break;
> > +case VFIO_IOMMU_UNBIND_GUEST_PGTBL:
> > +ret = vfio_iommu_type1_unbind_gpasid(iommu,
> > +     gbind_data);
> > +break;
> > +default:
> > +ret = -EINVAL;
> > +break;
> > +}
> > +kfree(gbind_data);
> > +return ret;
> >  }
> >
> >  return -ENOTTY;
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index ebeaf3e..2235bc6 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -14,6 +14,7 @@
> >
> >  #include <linux/types.h>
> >  #include <linux/ioctl.h>
> > +#include <linux/iommu.h>
> >
> >  #define VFIO_API_VERSION0
> >
> > @@ -853,6 +854,51 @@ struct vfio_iommu_type1_pasid_request {
> >   */
> >  #define VFIO_IOMMU_PASID_REQUEST_IO(VFIO_TYPE, VFIO_BASE +
> > 22)
> >
> > +/**
> > + * Supported flags:
> > + *- VFIO_IOMMU_BIND_GUEST_PGTBL: bind guest page tables to host
> > for
> > + *nesting type IOMMUs. In @data field It takes struct
> > + *iommu_gpasid_bind_data.
> > + *- VFIO_IOMMU_UNBIND_GUEST_PGTBL: undo a bind guest page
> > table operation
> > + *invoked by VFIO_IOMMU_BIND_GUEST_PGTBL.
> > + *
> > + */
> > +struct vfio_iommu_type1_bind {
> > +__u32argsz;
> > +__u32flags;
> > +#define VFIO_IOMMU_BIND_GUEST_PGTBL(1 << 0)
> > +#define VFIO_IOMMU_UNBIND_GUEST_PGTBL(1 << 1)
> > +__u8data[];
> > +};
> > +
> > +#define VFIO_IOMMU_BIND_MASK(VFIO_IOMMU_BIND_GUEST_PGTBL
> > | \
> > +
> > VFIO_IOMMU_UNBIND_GUEST_PGTBL)
> > +
> > +/**
> > + * VFIO_IOMMU_BIND - _IOW(VFIO_TYPE, VFIO_BASE + 23,
> > + *struct vfio_iommu_type1_bind)
> > + *
> > + * Manage address spaces of devices in this container. Initially a TYPE1
> > + * container can only have one address space, managed with
> > + * VFIO_IOMMU_MAP/UNMAP_DMA.
> > + *
> > + * An IOMMU of type VFIO_TYPE1_NESTING_IOMMU can be managed by
> > both MAP/UNMAP
> > + * and BIND ioctls at the same time. MAP/UNMAP acts on the stage-2 (host)
> > page
> > + * tables, and BIND manages the stage-1 (guest) page tables. Other types of
> > + * IOMMU may allow MAP/UNMAP and BIND to coexist, where
> > MAP/UNMAP controls
> > + * the traffics only require single stage translation while BIND controls the
> > + * traffics require nesting translation. But this depends on the underlying
> > + * IOMMU architecture and isn't guaranteed. Example of this is the guest
> > SVA
> > + * traffics, such traffics need nesting translation to gain gVA->gPA and then
> > + * gPA->hPA translation.
> > + *
> > + * Availability of this feature depends on the device, its bus, the underlying
> > + * IOMMU and the CPU architecture.
> > + *
> > + * returns: 0 on success, -errno on failure.
> > + */
> > +#define VFIO_IOMMU_BIND_IO(VFIO_TYPE, VFIO_BASE + 23)
> > +
> >  /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
> >
> >  /*
> > --
> > 2.7.4
> 
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: (a design open) RE: [PATCH v1 6/8] vfio/type1: Bind guest page tables to host
  2020-05-15  7:39 (a design open) RE: [PATCH v1 6/8] vfio/type1: Bind guest page tables to host Tian, Kevin
  2020-05-15 15:19 ` Raj, Ashok
@ 2020-05-15 17:59 ` Alex Williamson
  2020-05-15 23:26   ` Tian, Kevin
  1 sibling, 1 reply; 5+ messages in thread
From: Alex Williamson @ 2020-05-15 17:59 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: jean-philippe, Raj, Ashok, kvm, Tian, Jun J, iommu, linux-kernel,
	Sun,  Yi Y, Wu, Hao

On Fri, 15 May 2020 07:39:14 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> Hi, Alex,
> 
> When working on an updated version Yi and I found an design open
> which needs your guidance.
> 
> In concept nested translation can be incarnated as one GPA->HPA page
> table and multiple GVA->GPA page tables per VM. It means one container
> is sufficient to include all SVA-capable devices assigned to the same guest,
> as there is just one iova space (GPA->HPA) to be managed by vfio iommu
> map/unmap api. GVA->GPA page tables are bound through the new api
> introduced in this patch. It is different from legacy shadow translation 
> which merges GIOVA->GPA->HPA into GIOVA->HPA thus each device requires 
> its own iova space and must be in a separate container.

I think you've redefined the container as specifically IOVA context,
but I think a container is actually more of a shared IOMMU context
between groups and devices within those groups.  Userspace is under no
obligation to share a container between groups and the kernel is under
no obligation to let groups share a container.

> However supporting multiple SVA-capable devices in one container 
> imposes one challenge. Now the bind_guest_pgtbl is implemented as
> iommu type1 api. From the definition of iommu type 1 any operation
> should be applied to all devices within the same container, just like
> dma map/unmap. However this philosophy is incorrect regarding to
> page table binding. We must follow the guest binding requirements 
> between its page tables and assigned devices, otherwise every bound
> address space is suddenly accessible by all assigned devices thus creating
> security holes.

Is that a security hole, or is that simply the nature of a container?
Containers are meant to allow a user to share an IOMMU context between
groups of devices.  Sharing that context necessarily implies that the
user is granting the devices access to those address spaces.

> Do you think whether it's possible to extend iommu api to accept
> device specific cmd? If not, moving it to vfio-pci or vfio-mdev sounds
> also problematic, as PASID and page tables are IOMMU things which
> are not touched by vfio device drivers today.

Is this really that different from a group holding multiple devices,
each of which has a unique requester ID as seen by the IOMMU?  The
IOMMU can support separate contexts per device, but the topology
restricts us that those contexts are not fully isolated.  So we've
imposed the group restriction on ourselves to reflect that.  If a user
wants to share an IOMMU context between devices, maybe that precludes
the user from making use of nested translation.
 
> Alternatively can we impose the limitation that nesting APIs can be
> only enabled for singleton containers which contains only one device?
> This basically falls back to the state of legacy shadow translation 
> vIOMMU. and our current Qemu vIOMMU implementation actually 
> does this way regardless of whether nesting is used. Do you think 
> whether such tradeoff is acceptable as a starting point?

I think it's an acceptable starting point.  It seems what you're
describing is separating a monolithic notion of IOMMU context into
multiple layers, so we can share them at different points, ex. share a
GPA->HPA context among multiple different GVA->GPA contexts.  That
seems to imply something like the sub/super-container idea that's been
tossed around before, but never been fully defined or explored.  Thanks,

Alex

> > -----Original Message-----
> > From: Liu, Yi L <yi.l.liu@intel.com>
> > Sent: Sunday, March 22, 2020 8:32 PM
> > To: alex.williamson@redhat.com; eric.auger@redhat.com
> > Cc: Tian, Kevin <kevin.tian@intel.com>; jacob.jun.pan@linux.intel.com;
> > joro@8bytes.org; Raj, Ashok <ashok.raj@intel.com>; Liu, Yi L
> > <yi.l.liu@intel.com>; Tian, Jun J <jun.j.tian@intel.com>; Sun, Yi Y
> > <yi.y.sun@intel.com>; jean-philippe@linaro.org; peterx@redhat.com;
> > iommu@lists.linux-foundation.org; kvm@vger.kernel.org; linux-
> > kernel@vger.kernel.org; Wu, Hao <hao.wu@intel.com>
> > Subject: [PATCH v1 6/8] vfio/type1: Bind guest page tables to host
> > 
> > From: Liu Yi L <yi.l.liu@intel.com>
> > 
> > VFIO_TYPE1_NESTING_IOMMU is an IOMMU type which is backed by
> > hardware
> > IOMMUs that have nesting DMA translation (a.k.a dual stage address
> > translation). For such hardware IOMMUs, there are two stages/levels of
> > address translation, and software may let userspace/VM to own the first-
> > level/stage-1 translation structures. Example of such usage is vSVA (
> > virtual Shared Virtual Addressing). VM owns the first-level/stage-1
> > translation structures and bind the structures to host, then hardware
> > IOMMU would utilize nesting translation when doing DMA translation fo
> > the devices behind such hardware IOMMU.
> > 
> > This patch adds vfio support for binding guest translation (a.k.a stage 1)
> > structure to host iommu. And for VFIO_TYPE1_NESTING_IOMMU, not only
> > bind
> > guest page table is needed, it also requires to expose interface to guest
> > for iommu cache invalidation when guest modified the first-level/stage-1
> > translation structures since hardware needs to be notified to flush stale
> > iotlbs. This would be introduced in next patch.
> > 
> > In this patch, guest page table bind and unbind are done by using flags
> > VFIO_IOMMU_BIND_GUEST_PGTBL and
> > VFIO_IOMMU_UNBIND_GUEST_PGTBL under IOCTL
> > VFIO_IOMMU_BIND, the bind/unbind data are conveyed by
> > struct iommu_gpasid_bind_data. Before binding guest page table to host,
> > VM should have got a PASID allocated by host via
> > VFIO_IOMMU_PASID_REQUEST.
> > 
> > Bind guest translation structures (here is guest page table) to host
> > are the first step to setup vSVA (Virtual Shared Virtual Addressing).
> > 
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Alex Williamson <alex.williamson@redhat.com>
> > Cc: Eric Auger <eric.auger@redhat.com>
> > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > ---
> >  drivers/vfio/vfio_iommu_type1.c | 158
> > ++++++++++++++++++++++++++++++++++++++++
> >  include/uapi/linux/vfio.h       |  46 ++++++++++++
> >  2 files changed, 204 insertions(+)
> > 
> > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > b/drivers/vfio/vfio_iommu_type1.c
> > index 82a9e0b..a877747 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -130,6 +130,33 @@ struct vfio_regions {
> >  #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)	\
> >  					(!list_empty(&iommu->domain_list))
> > 
> > +struct domain_capsule {
> > +	struct iommu_domain *domain;
> > +	void *data;
> > +};
> > +
> > +/* iommu->lock must be held */
> > +static int vfio_iommu_for_each_dev(struct vfio_iommu *iommu,
> > +		      int (*fn)(struct device *dev, void *data),
> > +		      void *data)
> > +{
> > +	struct domain_capsule dc = {.data = data};
> > +	struct vfio_domain *d;
> > +	struct vfio_group *g;
> > +	int ret = 0;
> > +
> > +	list_for_each_entry(d, &iommu->domain_list, next) {
> > +		dc.domain = d->domain;
> > +		list_for_each_entry(g, &d->group_list, next) {
> > +			ret = iommu_group_for_each_dev(g->iommu_group,
> > +						       &dc, fn);
> > +			if (ret)
> > +				break;
> > +		}
> > +	}
> > +	return ret;
> > +}
> > +
> >  static int put_pfn(unsigned long pfn, int prot);
> > 
> >  /*
> > @@ -2314,6 +2341,88 @@ static int
> > vfio_iommu_info_add_nesting_cap(struct vfio_iommu *iommu,
> >  	return 0;
> >  }
> > 
> > +static int vfio_bind_gpasid_fn(struct device *dev, void *data)
> > +{
> > +	struct domain_capsule *dc = (struct domain_capsule *)data;
> > +	struct iommu_gpasid_bind_data *gbind_data =
> > +		(struct iommu_gpasid_bind_data *) dc->data;
> > +
> > +	return iommu_sva_bind_gpasid(dc->domain, dev, gbind_data);
> > +}
> > +
> > +static int vfio_unbind_gpasid_fn(struct device *dev, void *data)
> > +{
> > +	struct domain_capsule *dc = (struct domain_capsule *)data;
> > +	struct iommu_gpasid_bind_data *gbind_data =
> > +		(struct iommu_gpasid_bind_data *) dc->data;
> > +
> > +	return iommu_sva_unbind_gpasid(dc->domain, dev,
> > +					gbind_data->hpasid);
> > +}
> > +
> > +/**
> > + * Unbind specific gpasid, caller of this function requires hold
> > + * vfio_iommu->lock
> > + */
> > +static long vfio_iommu_type1_do_guest_unbind(struct vfio_iommu
> > *iommu,
> > +				struct iommu_gpasid_bind_data *gbind_data)
> > +{
> > +	return vfio_iommu_for_each_dev(iommu,
> > +				vfio_unbind_gpasid_fn, gbind_data);
> > +}
> > +
> > +static long vfio_iommu_type1_bind_gpasid(struct vfio_iommu *iommu,
> > +				struct iommu_gpasid_bind_data *gbind_data)
> > +{
> > +	int ret = 0;
> > +
> > +	mutex_lock(&iommu->lock);
> > +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > +		ret = -EINVAL;
> > +		goto out_unlock;
> > +	}
> > +
> > +	ret = vfio_iommu_for_each_dev(iommu,
> > +			vfio_bind_gpasid_fn, gbind_data);
> > +	/*
> > +	 * If bind failed, it may not be a total failure. Some devices
> > +	 * within the iommu group may have bind successfully. Although
> > +	 * we don't enable pasid capability for non-singletion iommu
> > +	 * groups, a unbind operation would be helpful to ensure no
> > +	 * partial binding for an iommu group.
> > +	 */
> > +	if (ret)
> > +		/*
> > +		 * Undo all binds that already succeeded, no need to
> > +		 * check the return value here since some device within
> > +		 * the group has no successful bind when coming to this
> > +		 * place switch.
> > +		 */
> > +		vfio_iommu_type1_do_guest_unbind(iommu, gbind_data);
> > +
> > +out_unlock:
> > +	mutex_unlock(&iommu->lock);
> > +	return ret;
> > +}
> > +
> > +static long vfio_iommu_type1_unbind_gpasid(struct vfio_iommu *iommu,
> > +				struct iommu_gpasid_bind_data *gbind_data)
> > +{
> > +	int ret = 0;
> > +
> > +	mutex_lock(&iommu->lock);
> > +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > +		ret = -EINVAL;
> > +		goto out_unlock;
> > +	}
> > +
> > +	ret = vfio_iommu_type1_do_guest_unbind(iommu, gbind_data);
> > +
> > +out_unlock:
> > +	mutex_unlock(&iommu->lock);
> > +	return ret;
> > +}
> > +
> >  static long vfio_iommu_type1_ioctl(void *iommu_data,
> >  				   unsigned int cmd, unsigned long arg)
> >  {
> > @@ -2471,6 +2580,55 @@ static long vfio_iommu_type1_ioctl(void
> > *iommu_data,
> >  		default:
> >  			return -EINVAL;
> >  		}
> > +
> > +	} else if (cmd == VFIO_IOMMU_BIND) {
> > +		struct vfio_iommu_type1_bind bind;
> > +		u32 version;
> > +		int data_size;
> > +		void *gbind_data;
> > +		int ret;
> > +
> > +		minsz = offsetofend(struct vfio_iommu_type1_bind, flags);
> > +
> > +		if (copy_from_user(&bind, (void __user *)arg, minsz))
> > +			return -EFAULT;
> > +
> > +		if (bind.argsz < minsz)
> > +			return -EINVAL;
> > +
> > +		/* Get the version of struct iommu_gpasid_bind_data */
> > +		if (copy_from_user(&version,
> > +			(void __user *) (arg + minsz),
> > +					sizeof(version)))
> > +			return -EFAULT;
> > +
> > +		data_size = iommu_uapi_get_data_size(
> > +				IOMMU_UAPI_BIND_GPASID, version);
> > +		gbind_data = kzalloc(data_size, GFP_KERNEL);
> > +		if (!gbind_data)
> > +			return -ENOMEM;
> > +
> > +		if (copy_from_user(gbind_data,
> > +			 (void __user *) (arg + minsz), data_size)) {
> > +			kfree(gbind_data);
> > +			return -EFAULT;
> > +		}
> > +
> > +		switch (bind.flags & VFIO_IOMMU_BIND_MASK) {
> > +		case VFIO_IOMMU_BIND_GUEST_PGTBL:
> > +			ret = vfio_iommu_type1_bind_gpasid(iommu,
> > +							   gbind_data);
> > +			break;
> > +		case VFIO_IOMMU_UNBIND_GUEST_PGTBL:
> > +			ret = vfio_iommu_type1_unbind_gpasid(iommu,
> > +							     gbind_data);
> > +			break;
> > +		default:
> > +			ret = -EINVAL;
> > +			break;
> > +		}
> > +		kfree(gbind_data);
> > +		return ret;
> >  	}
> > 
> >  	return -ENOTTY;
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index ebeaf3e..2235bc6 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -14,6 +14,7 @@
> > 
> >  #include <linux/types.h>
> >  #include <linux/ioctl.h>
> > +#include <linux/iommu.h>
> > 
> >  #define VFIO_API_VERSION	0
> > 
> > @@ -853,6 +854,51 @@ struct vfio_iommu_type1_pasid_request {
> >   */
> >  #define VFIO_IOMMU_PASID_REQUEST	_IO(VFIO_TYPE, VFIO_BASE +
> > 22)
> > 
> > +/**
> > + * Supported flags:
> > + *	- VFIO_IOMMU_BIND_GUEST_PGTBL: bind guest page tables to host
> > for
> > + *			nesting type IOMMUs. In @data field It takes struct
> > + *			iommu_gpasid_bind_data.
> > + *	- VFIO_IOMMU_UNBIND_GUEST_PGTBL: undo a bind guest page
> > table operation
> > + *			invoked by VFIO_IOMMU_BIND_GUEST_PGTBL.
> > + *
> > + */
> > +struct vfio_iommu_type1_bind {
> > +	__u32		argsz;
> > +	__u32		flags;
> > +#define VFIO_IOMMU_BIND_GUEST_PGTBL	(1 << 0)
> > +#define VFIO_IOMMU_UNBIND_GUEST_PGTBL	(1 << 1)
> > +	__u8		data[];
> > +};
> > +
> > +#define VFIO_IOMMU_BIND_MASK	(VFIO_IOMMU_BIND_GUEST_PGTBL
> > | \
> > +
> > 	VFIO_IOMMU_UNBIND_GUEST_PGTBL)
> > +
> > +/**
> > + * VFIO_IOMMU_BIND - _IOW(VFIO_TYPE, VFIO_BASE + 23,
> > + *				struct vfio_iommu_type1_bind)
> > + *
> > + * Manage address spaces of devices in this container. Initially a TYPE1
> > + * container can only have one address space, managed with
> > + * VFIO_IOMMU_MAP/UNMAP_DMA.
> > + *
> > + * An IOMMU of type VFIO_TYPE1_NESTING_IOMMU can be managed by
> > both MAP/UNMAP
> > + * and BIND ioctls at the same time. MAP/UNMAP acts on the stage-2 (host)
> > page
> > + * tables, and BIND manages the stage-1 (guest) page tables. Other types of
> > + * IOMMU may allow MAP/UNMAP and BIND to coexist, where
> > MAP/UNMAP controls
> > + * the traffics only require single stage translation while BIND controls the
> > + * traffics require nesting translation. But this depends on the underlying
> > + * IOMMU architecture and isn't guaranteed. Example of this is the guest
> > SVA
> > + * traffics, such traffics need nesting translation to gain gVA->gPA and then
> > + * gPA->hPA translation.
> > + *
> > + * Availability of this feature depends on the device, its bus, the underlying
> > + * IOMMU and the CPU architecture.
> > + *
> > + * returns: 0 on success, -errno on failure.
> > + */
> > +#define VFIO_IOMMU_BIND		_IO(VFIO_TYPE, VFIO_BASE + 23)
> > +
> >  /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
> > 
> >  /*
> > --
> > 2.7.4  
> 

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: (a design open) RE: [PATCH v1 6/8] vfio/type1: Bind guest page tables to host
  2020-05-15 17:59 ` Alex Williamson
@ 2020-05-15 23:26   ` Tian, Kevin
  0 siblings, 0 replies; 5+ messages in thread
From: Tian, Kevin @ 2020-05-15 23:26 UTC (permalink / raw)
  To: Alex Williamson
  Cc: jean-philippe, Raj, Ashok, kvm, Tian, Jun J, iommu, linux-kernel,
	Sun, Yi Y, Wu, Hao

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Saturday, May 16, 2020 1:59 AM
> 
> On Fri, 15 May 2020 07:39:14 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > Hi, Alex,
> >
> > When working on an updated version Yi and I found an design open
> > which needs your guidance.
> >
> > In concept nested translation can be incarnated as one GPA->HPA page
> > table and multiple GVA->GPA page tables per VM. It means one container
> > is sufficient to include all SVA-capable devices assigned to the same guest,
> > as there is just one iova space (GPA->HPA) to be managed by vfio iommu
> > map/unmap api. GVA->GPA page tables are bound through the new api
> > introduced in this patch. It is different from legacy shadow translation
> > which merges GIOVA->GPA->HPA into GIOVA->HPA thus each device
> requires
> > its own iova space and must be in a separate container.
> 
> I think you've redefined the container as specifically IOVA context,
> but I think a container is actually more of a shared IOMMU context
> between groups and devices within those groups.  Userspace is under no
> obligation to share a container between groups and the kernel is under
> no obligation to let groups share a container.

oh, sorry, I didn't redefine anything here. Just describe the usage
examples of containers when vIOMMU is used. Before vSVA there
is just one address space per IOMMU context, which is what current
vfio iommu api is designed for. Now when extending it to support
vSVA there are two-layer multiple address spaces per IOMMU
 context, and the sharing possibility of those address spaces are 
different. Then this open is about whether we want to allow partial
sharing of IOMMU context within a container, i.e. sharing the 2nd
level but allows binding to different 1st levels.

> 
> > However supporting multiple SVA-capable devices in one container
> > imposes one challenge. Now the bind_guest_pgtbl is implemented as
> > iommu type1 api. From the definition of iommu type 1 any operation
> > should be applied to all devices within the same container, just like
> > dma map/unmap. However this philosophy is incorrect regarding to
> > page table binding. We must follow the guest binding requirements
> > between its page tables and assigned devices, otherwise every bound
> > address space is suddenly accessible by all assigned devices thus creating
> > security holes.
> 
> Is that a security hole, or is that simply the nature of a container?
> Containers are meant to allow a user to share an IOMMU context between
> groups of devices.  Sharing that context necessarily implies that the
> user is granting the devices access to those address spaces.

It's the nature of container. I just tried to state that the current api
is insufficient if we want to allow partial-sharing within container. 😊

> 
> > Do you think whether it's possible to extend iommu api to accept
> > device specific cmd? If not, moving it to vfio-pci or vfio-mdev sounds
> > also problematic, as PASID and page tables are IOMMU things which
> > are not touched by vfio device drivers today.
> 
> Is this really that different from a group holding multiple devices,
> each of which has a unique requester ID as seen by the IOMMU?  The
> IOMMU can support separate contexts per device, but the topology
> restricts us that those contexts are not fully isolated.  So we've
> imposed the group restriction on ourselves to reflect that.  If a user
> wants to share an IOMMU context between devices, maybe that precludes
> the user from making use of nested translation.

Agree.

> 
> > Alternatively can we impose the limitation that nesting APIs can be
> > only enabled for singleton containers which contains only one device?
> > This basically falls back to the state of legacy shadow translation
> > vIOMMU. and our current Qemu vIOMMU implementation actually
> > does this way regardless of whether nesting is used. Do you think
> > whether such tradeoff is acceptable as a starting point?
> 
> I think it's an acceptable starting point.  It seems what you're
> describing is separating a monolithic notion of IOMMU context into
> multiple layers, so we can share them at different points, ex. share a
> GPA->HPA context among multiple different GVA->GPA contexts.  That
> seems to imply something like the sub/super-container idea that's been
> tossed around before, but never been fully defined or explored.  Thanks,

yes, that'd be a future extension and we'll start with singleton
limitation for now. 

Thanks
Kevin

> 
> > > -----Original Message-----
> > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > Sent: Sunday, March 22, 2020 8:32 PM
> > > To: alex.williamson@redhat.com; eric.auger@redhat.com
> > > Cc: Tian, Kevin <kevin.tian@intel.com>; jacob.jun.pan@linux.intel.com;
> > > joro@8bytes.org; Raj, Ashok <ashok.raj@intel.com>; Liu, Yi L
> > > <yi.l.liu@intel.com>; Tian, Jun J <jun.j.tian@intel.com>; Sun, Yi Y
> > > <yi.y.sun@intel.com>; jean-philippe@linaro.org; peterx@redhat.com;
> > > iommu@lists.linux-foundation.org; kvm@vger.kernel.org; linux-
> > > kernel@vger.kernel.org; Wu, Hao <hao.wu@intel.com>
> > > Subject: [PATCH v1 6/8] vfio/type1: Bind guest page tables to host
> > >
> > > From: Liu Yi L <yi.l.liu@intel.com>
> > >
> > > VFIO_TYPE1_NESTING_IOMMU is an IOMMU type which is backed by
> > > hardware
> > > IOMMUs that have nesting DMA translation (a.k.a dual stage address
> > > translation). For such hardware IOMMUs, there are two stages/levels of
> > > address translation, and software may let userspace/VM to own the first-
> > > level/stage-1 translation structures. Example of such usage is vSVA (
> > > virtual Shared Virtual Addressing). VM owns the first-level/stage-1
> > > translation structures and bind the structures to host, then hardware
> > > IOMMU would utilize nesting translation when doing DMA translation fo
> > > the devices behind such hardware IOMMU.
> > >
> > > This patch adds vfio support for binding guest translation (a.k.a stage 1)
> > > structure to host iommu. And for VFIO_TYPE1_NESTING_IOMMU, not
> only
> > > bind
> > > guest page table is needed, it also requires to expose interface to guest
> > > for iommu cache invalidation when guest modified the first-level/stage-1
> > > translation structures since hardware needs to be notified to flush stale
> > > iotlbs. This would be introduced in next patch.
> > >
> > > In this patch, guest page table bind and unbind are done by using flags
> > > VFIO_IOMMU_BIND_GUEST_PGTBL and
> > > VFIO_IOMMU_UNBIND_GUEST_PGTBL under IOCTL
> > > VFIO_IOMMU_BIND, the bind/unbind data are conveyed by
> > > struct iommu_gpasid_bind_data. Before binding guest page table to host,
> > > VM should have got a PASID allocated by host via
> > > VFIO_IOMMU_PASID_REQUEST.
> > >
> > > Bind guest translation structures (here is guest page table) to host
> > > are the first step to setup vSVA (Virtual Shared Virtual Addressing).
> > >
> > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > Cc: Alex Williamson <alex.williamson@redhat.com>
> > > Cc: Eric Auger <eric.auger@redhat.com>
> > > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > > Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.com>
> > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > ---
> > >  drivers/vfio/vfio_iommu_type1.c | 158
> > > ++++++++++++++++++++++++++++++++++++++++
> > >  include/uapi/linux/vfio.h       |  46 ++++++++++++
> > >  2 files changed, 204 insertions(+)
> > >
> > > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > > b/drivers/vfio/vfio_iommu_type1.c
> > > index 82a9e0b..a877747 100644
> > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > @@ -130,6 +130,33 @@ struct vfio_regions {
> > >  #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)	\
> > >  					(!list_empty(&iommu->domain_list))
> > >
> > > +struct domain_capsule {
> > > +	struct iommu_domain *domain;
> > > +	void *data;
> > > +};
> > > +
> > > +/* iommu->lock must be held */
> > > +static int vfio_iommu_for_each_dev(struct vfio_iommu *iommu,
> > > +		      int (*fn)(struct device *dev, void *data),
> > > +		      void *data)
> > > +{
> > > +	struct domain_capsule dc = {.data = data};
> > > +	struct vfio_domain *d;
> > > +	struct vfio_group *g;
> > > +	int ret = 0;
> > > +
> > > +	list_for_each_entry(d, &iommu->domain_list, next) {
> > > +		dc.domain = d->domain;
> > > +		list_for_each_entry(g, &d->group_list, next) {
> > > +			ret = iommu_group_for_each_dev(g->iommu_group,
> > > +						       &dc, fn);
> > > +			if (ret)
> > > +				break;
> > > +		}
> > > +	}
> > > +	return ret;
> > > +}
> > > +
> > >  static int put_pfn(unsigned long pfn, int prot);
> > >
> > >  /*
> > > @@ -2314,6 +2341,88 @@ static int
> > > vfio_iommu_info_add_nesting_cap(struct vfio_iommu *iommu,
> > >  	return 0;
> > >  }
> > >
> > > +static int vfio_bind_gpasid_fn(struct device *dev, void *data)
> > > +{
> > > +	struct domain_capsule *dc = (struct domain_capsule *)data;
> > > +	struct iommu_gpasid_bind_data *gbind_data =
> > > +		(struct iommu_gpasid_bind_data *) dc->data;
> > > +
> > > +	return iommu_sva_bind_gpasid(dc->domain, dev, gbind_data);
> > > +}
> > > +
> > > +static int vfio_unbind_gpasid_fn(struct device *dev, void *data)
> > > +{
> > > +	struct domain_capsule *dc = (struct domain_capsule *)data;
> > > +	struct iommu_gpasid_bind_data *gbind_data =
> > > +		(struct iommu_gpasid_bind_data *) dc->data;
> > > +
> > > +	return iommu_sva_unbind_gpasid(dc->domain, dev,
> > > +					gbind_data->hpasid);
> > > +}
> > > +
> > > +/**
> > > + * Unbind specific gpasid, caller of this function requires hold
> > > + * vfio_iommu->lock
> > > + */
> > > +static long vfio_iommu_type1_do_guest_unbind(struct vfio_iommu
> > > *iommu,
> > > +				struct iommu_gpasid_bind_data *gbind_data)
> > > +{
> > > +	return vfio_iommu_for_each_dev(iommu,
> > > +				vfio_unbind_gpasid_fn, gbind_data);
> > > +}
> > > +
> > > +static long vfio_iommu_type1_bind_gpasid(struct vfio_iommu *iommu,
> > > +				struct iommu_gpasid_bind_data *gbind_data)
> > > +{
> > > +	int ret = 0;
> > > +
> > > +	mutex_lock(&iommu->lock);
> > > +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > > +		ret = -EINVAL;
> > > +		goto out_unlock;
> > > +	}
> > > +
> > > +	ret = vfio_iommu_for_each_dev(iommu,
> > > +			vfio_bind_gpasid_fn, gbind_data);
> > > +	/*
> > > +	 * If bind failed, it may not be a total failure. Some devices
> > > +	 * within the iommu group may have bind successfully. Although
> > > +	 * we don't enable pasid capability for non-singletion iommu
> > > +	 * groups, a unbind operation would be helpful to ensure no
> > > +	 * partial binding for an iommu group.
> > > +	 */
> > > +	if (ret)
> > > +		/*
> > > +		 * Undo all binds that already succeeded, no need to
> > > +		 * check the return value here since some device within
> > > +		 * the group has no successful bind when coming to this
> > > +		 * place switch.
> > > +		 */
> > > +		vfio_iommu_type1_do_guest_unbind(iommu, gbind_data);
> > > +
> > > +out_unlock:
> > > +	mutex_unlock(&iommu->lock);
> > > +	return ret;
> > > +}
> > > +
> > > +static long vfio_iommu_type1_unbind_gpasid(struct vfio_iommu
> *iommu,
> > > +				struct iommu_gpasid_bind_data *gbind_data)
> > > +{
> > > +	int ret = 0;
> > > +
> > > +	mutex_lock(&iommu->lock);
> > > +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > > +		ret = -EINVAL;
> > > +		goto out_unlock;
> > > +	}
> > > +
> > > +	ret = vfio_iommu_type1_do_guest_unbind(iommu, gbind_data);
> > > +
> > > +out_unlock:
> > > +	mutex_unlock(&iommu->lock);
> > > +	return ret;
> > > +}
> > > +
> > >  static long vfio_iommu_type1_ioctl(void *iommu_data,
> > >  				   unsigned int cmd, unsigned long arg)
> > >  {
> > > @@ -2471,6 +2580,55 @@ static long vfio_iommu_type1_ioctl(void
> > > *iommu_data,
> > >  		default:
> > >  			return -EINVAL;
> > >  		}
> > > +
> > > +	} else if (cmd == VFIO_IOMMU_BIND) {
> > > +		struct vfio_iommu_type1_bind bind;
> > > +		u32 version;
> > > +		int data_size;
> > > +		void *gbind_data;
> > > +		int ret;
> > > +
> > > +		minsz = offsetofend(struct vfio_iommu_type1_bind, flags);
> > > +
> > > +		if (copy_from_user(&bind, (void __user *)arg, minsz))
> > > +			return -EFAULT;
> > > +
> > > +		if (bind.argsz < minsz)
> > > +			return -EINVAL;
> > > +
> > > +		/* Get the version of struct iommu_gpasid_bind_data */
> > > +		if (copy_from_user(&version,
> > > +			(void __user *) (arg + minsz),
> > > +					sizeof(version)))
> > > +			return -EFAULT;
> > > +
> > > +		data_size = iommu_uapi_get_data_size(
> > > +				IOMMU_UAPI_BIND_GPASID, version);
> > > +		gbind_data = kzalloc(data_size, GFP_KERNEL);
> > > +		if (!gbind_data)
> > > +			return -ENOMEM;
> > > +
> > > +		if (copy_from_user(gbind_data,
> > > +			 (void __user *) (arg + minsz), data_size)) {
> > > +			kfree(gbind_data);
> > > +			return -EFAULT;
> > > +		}
> > > +
> > > +		switch (bind.flags & VFIO_IOMMU_BIND_MASK) {
> > > +		case VFIO_IOMMU_BIND_GUEST_PGTBL:
> > > +			ret = vfio_iommu_type1_bind_gpasid(iommu,
> > > +							   gbind_data);
> > > +			break;
> > > +		case VFIO_IOMMU_UNBIND_GUEST_PGTBL:
> > > +			ret = vfio_iommu_type1_unbind_gpasid(iommu,
> > > +							     gbind_data);
> > > +			break;
> > > +		default:
> > > +			ret = -EINVAL;
> > > +			break;
> > > +		}
> > > +		kfree(gbind_data);
> > > +		return ret;
> > >  	}
> > >
> > >  	return -ENOTTY;
> > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > > index ebeaf3e..2235bc6 100644
> > > --- a/include/uapi/linux/vfio.h
> > > +++ b/include/uapi/linux/vfio.h
> > > @@ -14,6 +14,7 @@
> > >
> > >  #include <linux/types.h>
> > >  #include <linux/ioctl.h>
> > > +#include <linux/iommu.h>
> > >
> > >  #define VFIO_API_VERSION	0
> > >
> > > @@ -853,6 +854,51 @@ struct vfio_iommu_type1_pasid_request {
> > >   */
> > >  #define VFIO_IOMMU_PASID_REQUEST	_IO(VFIO_TYPE, VFIO_BASE +
> > > 22)
> > >
> > > +/**
> > > + * Supported flags:
> > > + *	- VFIO_IOMMU_BIND_GUEST_PGTBL: bind guest page tables to host
> > > for
> > > + *			nesting type IOMMUs. In @data field It takes struct
> > > + *			iommu_gpasid_bind_data.
> > > + *	- VFIO_IOMMU_UNBIND_GUEST_PGTBL: undo a bind guest page
> > > table operation
> > > + *			invoked by VFIO_IOMMU_BIND_GUEST_PGTBL.
> > > + *
> > > + */
> > > +struct vfio_iommu_type1_bind {
> > > +	__u32		argsz;
> > > +	__u32		flags;
> > > +#define VFIO_IOMMU_BIND_GUEST_PGTBL	(1 << 0)
> > > +#define VFIO_IOMMU_UNBIND_GUEST_PGTBL	(1 << 1)
> > > +	__u8		data[];
> > > +};
> > > +
> > > +#define VFIO_IOMMU_BIND_MASK
> 	(VFIO_IOMMU_BIND_GUEST_PGTBL
> > > | \
> > > +
> > > 	VFIO_IOMMU_UNBIND_GUEST_PGTBL)
> > > +
> > > +/**
> > > + * VFIO_IOMMU_BIND - _IOW(VFIO_TYPE, VFIO_BASE + 23,
> > > + *				struct vfio_iommu_type1_bind)
> > > + *
> > > + * Manage address spaces of devices in this container. Initially a TYPE1
> > > + * container can only have one address space, managed with
> > > + * VFIO_IOMMU_MAP/UNMAP_DMA.
> > > + *
> > > + * An IOMMU of type VFIO_TYPE1_NESTING_IOMMU can be managed
> by
> > > both MAP/UNMAP
> > > + * and BIND ioctls at the same time. MAP/UNMAP acts on the stage-2
> (host)
> > > page
> > > + * tables, and BIND manages the stage-1 (guest) page tables. Other types
> of
> > > + * IOMMU may allow MAP/UNMAP and BIND to coexist, where
> > > MAP/UNMAP controls
> > > + * the traffics only require single stage translation while BIND controls
> the
> > > + * traffics require nesting translation. But this depends on the underlying
> > > + * IOMMU architecture and isn't guaranteed. Example of this is the guest
> > > SVA
> > > + * traffics, such traffics need nesting translation to gain gVA->gPA and
> then
> > > + * gPA->hPA translation.
> > > + *
> > > + * Availability of this feature depends on the device, its bus, the
> underlying
> > > + * IOMMU and the CPU architecture.
> > > + *
> > > + * returns: 0 on success, -errno on failure.
> > > + */
> > > +#define VFIO_IOMMU_BIND		_IO(VFIO_TYPE, VFIO_BASE +
> 23)
> > > +
> > >  /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU --------
> */
> > >
> > >  /*
> > > --
> > > 2.7.4
> >

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: (a design open) RE: [PATCH v1 6/8] vfio/type1: Bind guest page tables to host
  2020-05-15 15:19 ` Raj, Ashok
@ 2020-05-15 23:36   ` Tian, Kevin
  0 siblings, 0 replies; 5+ messages in thread
From: Tian, Kevin @ 2020-05-15 23:36 UTC (permalink / raw)
  To: Raj, Ashok
  Cc: jean-philippe, kvm, iommu, Tian,  Jun J, Sun, Yi Y, linux-kernel,
	alex.williamson, Wu, Hao

> From: Raj, Ashok <ashok.raj@intel.com>
> Sent: Friday, May 15, 2020 11:20 PM
> 
> On Fri, May 15, 2020 at 12:39:14AM -0700, Tian, Kevin wrote:
> > Hi, Alex,
> >
> > When working on an updated version Yi and I found an design open
> > which needs your guidance.
> >
> > In concept nested translation can be incarnated as one GPA->HPA page
> > table and multiple GVA->GPA page tables per VM. It means one container
> > is sufficient to include all SVA-capable devices assigned to the same guest,
> > as there is just one iova space (GPA->HPA) to be managed by vfio iommu
> > map/unmap api. GVA->GPA page tables are bound through the new api
> > introduced in this patch. It is different from legacy shadow translation
> > which merges GIOVA->GPA->HPA into GIOVA->HPA thus each device
> requires
> > its own iova space and must be in a separate container.
> >
> > However supporting multiple SVA-capable devices in one container
> > imposes one challenge. Now the bind_guest_pgtbl is implemented as
> > iommu type1 api. From the definition of iommu type 1 any operation
> > should be applied to all devices within the same container, just like
> > dma map/unmap. However this philosophy is incorrect regarding to
> > page table binding. We must follow the guest binding requirements
> > between its page tables and assigned devices, otherwise every bound
> > address space is suddenly accessible by all assigned devices thus creating
> > security holes.
> 
> The above 2 paragraphs are bit confusing :-) but what really matters
> is the below:
> >
> > Do you think whether it's possible to extend iommu api to accept
> > device specific cmd? If not, moving it to vfio-pci or vfio-mdev sounds
> > also problematic, as PASID and page tables are IOMMU things which
> > are not touched by vfio device drivers today.
> 
> All you are referring to is when admin groups multiple devices in a
> single container, you are saying you can't give isolation to them
> for SVA purposes. This is logically equivalent to a single group with
> multiple devices.
> 
> 	- Such as devices behind p2p bridge
> 	- MFD's or devices behind switches or hieararchies without ACS
> 	  support for instance.
> 
> By limitation you mean, disable PASID on those devices in a single
> container?

the limitation means disabling nesting capability in such container
and yes it implies not exposing PASID capability to userspace too.

> 
> what about ATS?

ATS is possibly fine. VFIO exposes it to userspace even today.

Thanks
Kevin

> 
> >
> > Alternatively can we impose the limitation that nesting APIs can be
> > only enabled for singleton containers which contains only one device?
> > This basically falls back to the state of legacy shadow translation
> > vIOMMU. and our current Qemu vIOMMU implementation actually
> > does this way regardless of whether nesting is used. Do you think
> > whether such tradeoff is acceptable as a starting point?
> >
> > Thanks
> > Kevin
> >
> > > -----Original Message-----
> > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > Sent: Sunday, March 22, 2020 8:32 PM
> > > To: alex.williamson@redhat.com; eric.auger@redhat.com
> > > Cc: Tian, Kevin <kevin.tian@intel.com>; jacob.jun.pan@linux.intel.com;
> > > joro@8bytes.org; Raj, Ashok <ashok.raj@intel.com>; Liu, Yi L
> > > <yi.l.liu@intel.com>; Tian, Jun J <jun.j.tian@intel.com>; Sun, Yi Y
> > > <yi.y.sun@intel.com>; jean-philippe@linaro.org; peterx@redhat.com;
> > > iommu@lists.linux-foundation.org; kvm@vger.kernel.org; linux-
> > > kernel@vger.kernel.org; Wu, Hao <hao.wu@intel.com>
> > > Subject: [PATCH v1 6/8] vfio/type1: Bind guest page tables to host
> > >
> > > From: Liu Yi L <yi.l.liu@intel.com>
> > >
> > > VFIO_TYPE1_NESTING_IOMMU is an IOMMU type which is backed by
> > > hardware
> > > IOMMUs that have nesting DMA translation (a.k.a dual stage address
> > > translation). For such hardware IOMMUs, there are two stages/levels of
> > > address translation, and software may let userspace/VM to own the first-
> > > level/stage-1 translation structures. Example of such usage is vSVA (
> > > virtual Shared Virtual Addressing). VM owns the first-level/stage-1
> > > translation structures and bind the structures to host, then hardware
> > > IOMMU would utilize nesting translation when doing DMA translation fo
> > > the devices behind such hardware IOMMU.
> > >
> > > This patch adds vfio support for binding guest translation (a.k.a stage 1)
> > > structure to host iommu. And for VFIO_TYPE1_NESTING_IOMMU, not
> only
> > > bind
> > > guest page table is needed, it also requires to expose interface to guest
> > > for iommu cache invalidation when guest modified the first-level/stage-1
> > > translation structures since hardware needs to be notified to flush stale
> > > iotlbs. This would be introduced in next patch.
> > >
> > > In this patch, guest page table bind and unbind are done by using flags
> > > VFIO_IOMMU_BIND_GUEST_PGTBL and
> > > VFIO_IOMMU_UNBIND_GUEST_PGTBL under IOCTL
> > > VFIO_IOMMU_BIND, the bind/unbind data are conveyed by
> > > struct iommu_gpasid_bind_data. Before binding guest page table to host,
> > > VM should have got a PASID allocated by host via
> > > VFIO_IOMMU_PASID_REQUEST.
> > >
> > > Bind guest translation structures (here is guest page table) to host
> > > are the first step to setup vSVA (Virtual Shared Virtual Addressing).
> > >
> > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > Cc: Alex Williamson <alex.williamson@redhat.com>
> > > Cc: Eric Auger <eric.auger@redhat.com>
> > > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > > Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.com>
> > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > ---
> > >  drivers/vfio/vfio_iommu_type1.c | 158
> > > ++++++++++++++++++++++++++++++++++++++++
> > >  include/uapi/linux/vfio.h       |  46 ++++++++++++
> > >  2 files changed, 204 insertions(+)
> > >
> > > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > > b/drivers/vfio/vfio_iommu_type1.c
> > > index 82a9e0b..a877747 100644
> > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > @@ -130,6 +130,33 @@ struct vfio_regions {
> > >  #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)\
> > >  (!list_empty(&iommu->domain_list))
> > >
> > > +struct domain_capsule {
> > > +struct iommu_domain *domain;
> > > +void *data;
> > > +};
> > > +
> > > +/* iommu->lock must be held */
> > > +static int vfio_iommu_for_each_dev(struct vfio_iommu *iommu,
> > > +      int (*fn)(struct device *dev, void *data),
> > > +      void *data)
> > > +{
> > > +struct domain_capsule dc = {.data = data};
> > > +struct vfio_domain *d;
> > > +struct vfio_group *g;
> > > +int ret = 0;
> > > +
> > > +list_for_each_entry(d, &iommu->domain_list, next) {
> > > +dc.domain = d->domain;
> > > +list_for_each_entry(g, &d->group_list, next) {
> > > +ret = iommu_group_for_each_dev(g->iommu_group,
> > > +       &dc, fn);
> > > +if (ret)
> > > +break;
> > > +}
> > > +}
> > > +return ret;
> > > +}
> > > +
> > >  static int put_pfn(unsigned long pfn, int prot);
> > >
> > >  /*
> > > @@ -2314,6 +2341,88 @@ static int
> > > vfio_iommu_info_add_nesting_cap(struct vfio_iommu *iommu,
> > >  return 0;
> > >  }
> > >
> > > +static int vfio_bind_gpasid_fn(struct device *dev, void *data)
> > > +{
> > > +struct domain_capsule *dc = (struct domain_capsule *)data;
> > > +struct iommu_gpasid_bind_data *gbind_data =
> > > +(struct iommu_gpasid_bind_data *) dc->data;
> > > +
> > > +return iommu_sva_bind_gpasid(dc->domain, dev, gbind_data);
> > > +}
> > > +
> > > +static int vfio_unbind_gpasid_fn(struct device *dev, void *data)
> > > +{
> > > +struct domain_capsule *dc = (struct domain_capsule *)data;
> > > +struct iommu_gpasid_bind_data *gbind_data =
> > > +(struct iommu_gpasid_bind_data *) dc->data;
> > > +
> > > +return iommu_sva_unbind_gpasid(dc->domain, dev,
> > > +gbind_data->hpasid);
> > > +}
> > > +
> > > +/**
> > > + * Unbind specific gpasid, caller of this function requires hold
> > > + * vfio_iommu->lock
> > > + */
> > > +static long vfio_iommu_type1_do_guest_unbind(struct vfio_iommu
> > > *iommu,
> > > +struct iommu_gpasid_bind_data *gbind_data)
> > > +{
> > > +return vfio_iommu_for_each_dev(iommu,
> > > +vfio_unbind_gpasid_fn, gbind_data);
> > > +}
> > > +
> > > +static long vfio_iommu_type1_bind_gpasid(struct vfio_iommu *iommu,
> > > +struct iommu_gpasid_bind_data *gbind_data)
> > > +{
> > > +int ret = 0;
> > > +
> > > +mutex_lock(&iommu->lock);
> > > +if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > > +ret = -EINVAL;
> > > +goto out_unlock;
> > > +}
> > > +
> > > +ret = vfio_iommu_for_each_dev(iommu,
> > > +vfio_bind_gpasid_fn, gbind_data);
> > > +/*
> > > + * If bind failed, it may not be a total failure. Some devices
> > > + * within the iommu group may have bind successfully. Although
> > > + * we don't enable pasid capability for non-singletion iommu
> > > + * groups, a unbind operation would be helpful to ensure no
> > > + * partial binding for an iommu group.
> > > + */
> > > +if (ret)
> > > +/*
> > > + * Undo all binds that already succeeded, no need to
> > > + * check the return value here since some device within
> > > + * the group has no successful bind when coming to this
> > > + * place switch.
> > > + */
> > > +vfio_iommu_type1_do_guest_unbind(iommu, gbind_data);
> > > +
> > > +out_unlock:
> > > +mutex_unlock(&iommu->lock);
> > > +return ret;
> > > +}
> > > +
> > > +static long vfio_iommu_type1_unbind_gpasid(struct vfio_iommu
> *iommu,
> > > +struct iommu_gpasid_bind_data *gbind_data)
> > > +{
> > > +int ret = 0;
> > > +
> > > +mutex_lock(&iommu->lock);
> > > +if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > > +ret = -EINVAL;
> > > +goto out_unlock;
> > > +}
> > > +
> > > +ret = vfio_iommu_type1_do_guest_unbind(iommu, gbind_data);
> > > +
> > > +out_unlock:
> > > +mutex_unlock(&iommu->lock);
> > > +return ret;
> > > +}
> > > +
> > >  static long vfio_iommu_type1_ioctl(void *iommu_data,
> > >     unsigned int cmd, unsigned long arg)
> > >  {
> > > @@ -2471,6 +2580,55 @@ static long vfio_iommu_type1_ioctl(void
> > > *iommu_data,
> > >  default:
> > >  return -EINVAL;
> > >  }
> > > +
> > > +} else if (cmd == VFIO_IOMMU_BIND) {
> > > +struct vfio_iommu_type1_bind bind;
> > > +u32 version;
> > > +int data_size;
> > > +void *gbind_data;
> > > +int ret;
> > > +
> > > +minsz = offsetofend(struct vfio_iommu_type1_bind, flags);
> > > +
> > > +if (copy_from_user(&bind, (void __user *)arg, minsz))
> > > +return -EFAULT;
> > > +
> > > +if (bind.argsz < minsz)
> > > +return -EINVAL;
> > > +
> > > +/* Get the version of struct iommu_gpasid_bind_data */
> > > +if (copy_from_user(&version,
> > > +(void __user *) (arg + minsz),
> > > +sizeof(version)))
> > > +return -EFAULT;
> > > +
> > > +data_size = iommu_uapi_get_data_size(
> > > +IOMMU_UAPI_BIND_GPASID, version);
> > > +gbind_data = kzalloc(data_size, GFP_KERNEL);
> > > +if (!gbind_data)
> > > +return -ENOMEM;
> > > +
> > > +if (copy_from_user(gbind_data,
> > > + (void __user *) (arg + minsz), data_size)) {
> > > +kfree(gbind_data);
> > > +return -EFAULT;
> > > +}
> > > +
> > > +switch (bind.flags & VFIO_IOMMU_BIND_MASK) {
> > > +case VFIO_IOMMU_BIND_GUEST_PGTBL:
> > > +ret = vfio_iommu_type1_bind_gpasid(iommu,
> > > +   gbind_data);
> > > +break;
> > > +case VFIO_IOMMU_UNBIND_GUEST_PGTBL:
> > > +ret = vfio_iommu_type1_unbind_gpasid(iommu,
> > > +     gbind_data);
> > > +break;
> > > +default:
> > > +ret = -EINVAL;
> > > +break;
> > > +}
> > > +kfree(gbind_data);
> > > +return ret;
> > >  }
> > >
> > >  return -ENOTTY;
> > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > > index ebeaf3e..2235bc6 100644
> > > --- a/include/uapi/linux/vfio.h
> > > +++ b/include/uapi/linux/vfio.h
> > > @@ -14,6 +14,7 @@
> > >
> > >  #include <linux/types.h>
> > >  #include <linux/ioctl.h>
> > > +#include <linux/iommu.h>
> > >
> > >  #define VFIO_API_VERSION0
> > >
> > > @@ -853,6 +854,51 @@ struct vfio_iommu_type1_pasid_request {
> > >   */
> > >  #define VFIO_IOMMU_PASID_REQUEST_IO(VFIO_TYPE, VFIO_BASE +
> > > 22)
> > >
> > > +/**
> > > + * Supported flags:
> > > + *- VFIO_IOMMU_BIND_GUEST_PGTBL: bind guest page tables to host
> > > for
> > > + *nesting type IOMMUs. In @data field It takes struct
> > > + *iommu_gpasid_bind_data.
> > > + *- VFIO_IOMMU_UNBIND_GUEST_PGTBL: undo a bind guest page
> > > table operation
> > > + *invoked by VFIO_IOMMU_BIND_GUEST_PGTBL.
> > > + *
> > > + */
> > > +struct vfio_iommu_type1_bind {
> > > +__u32argsz;
> > > +__u32flags;
> > > +#define VFIO_IOMMU_BIND_GUEST_PGTBL(1 << 0)
> > > +#define VFIO_IOMMU_UNBIND_GUEST_PGTBL(1 << 1)
> > > +__u8data[];
> > > +};
> > > +
> > > +#define VFIO_IOMMU_BIND_MASK(VFIO_IOMMU_BIND_GUEST_PGTBL
> > > | \
> > > +
> > > VFIO_IOMMU_UNBIND_GUEST_PGTBL)
> > > +
> > > +/**
> > > + * VFIO_IOMMU_BIND - _IOW(VFIO_TYPE, VFIO_BASE + 23,
> > > + *struct vfio_iommu_type1_bind)
> > > + *
> > > + * Manage address spaces of devices in this container. Initially a TYPE1
> > > + * container can only have one address space, managed with
> > > + * VFIO_IOMMU_MAP/UNMAP_DMA.
> > > + *
> > > + * An IOMMU of type VFIO_TYPE1_NESTING_IOMMU can be managed
> by
> > > both MAP/UNMAP
> > > + * and BIND ioctls at the same time. MAP/UNMAP acts on the stage-2
> (host)
> > > page
> > > + * tables, and BIND manages the stage-1 (guest) page tables. Other types
> of
> > > + * IOMMU may allow MAP/UNMAP and BIND to coexist, where
> > > MAP/UNMAP controls
> > > + * the traffics only require single stage translation while BIND controls
> the
> > > + * traffics require nesting translation. But this depends on the underlying
> > > + * IOMMU architecture and isn't guaranteed. Example of this is the guest
> > > SVA
> > > + * traffics, such traffics need nesting translation to gain gVA->gPA and
> then
> > > + * gPA->hPA translation.
> > > + *
> > > + * Availability of this feature depends on the device, its bus, the
> underlying
> > > + * IOMMU and the CPU architecture.
> > > + *
> > > + * returns: 0 on success, -errno on failure.
> > > + */
> > > +#define VFIO_IOMMU_BIND_IO(VFIO_TYPE, VFIO_BASE + 23)
> > > +
> > >  /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU --------
> */
> > >
> > >  /*
> > > --
> > > 2.7.4
> >
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, back to index

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-15  7:39 (a design open) RE: [PATCH v1 6/8] vfio/type1: Bind guest page tables to host Tian, Kevin
2020-05-15 15:19 ` Raj, Ashok
2020-05-15 23:36   ` Tian, Kevin
2020-05-15 17:59 ` Alex Williamson
2020-05-15 23:26   ` Tian, Kevin

IOMMU Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-iommu/0 linux-iommu/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-iommu linux-iommu/ https://lore.kernel.org/linux-iommu \
		iommu@lists.linux-foundation.org
	public-inbox-index linux-iommu

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.linux-foundation.lists.iommu


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git