linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3] vfio-pci: Allow to mmap sub-page MMIO BARs if the mmio page is exclusive
@ 2016-05-12 10:20 Yongji Xie
  2016-05-19 22:33 ` Alex Williamson
  0 siblings, 1 reply; 5+ messages in thread
From: Yongji Xie @ 2016-05-12 10:20 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-pci
  Cc: alex.williamson, bhelgaas, aik, benh, paulus, mpe, warrier,
	zhong, nikunj, gwshan, kevin.tian, Yongji Xie

Current vfio-pci implementation disallows to mmap
sub-page(size < PAGE_SIZE) MMIO BARs because these BARs' mmio
page may be shared with other BARs. This will cause some
performance issues when we passthrough a PCI device with
this kind of BARs. Guest will be not able to handle the mmio
accesses to the BARs which leads to mmio emulations in host.

However, not all sub-page BARs will share page with other BARs.
We should allow to mmap the sub-page MMIO BARs which we can
make sure will not share page with other BARs.

This patch adds support for this case. And we try to add a
dummy resource to reserve the remainder of the page which
hot-add device's BAR might be assigned into. But it's not
necessary to handle the case when the BAR is not page aligned.
Because we can't expect the BAR will be assigned into the same
location in a page in guest when we passthrough the BAR. And
it's hard to access this BAR in userspace because we have
no way to get the BAR's location in a page.

Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
---
 drivers/vfio/pci/vfio_pci.c         |   70 +++++++++++++++++++++++++++++++----
 drivers/vfio/pci/vfio_pci_private.h |    8 ++++
 2 files changed, 71 insertions(+), 7 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 188b1ff..253c22f 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -110,6 +110,50 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
 	return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
 }
 
+static bool vfio_pci_bar_mmap_supported(struct vfio_pci_device *vdev, int index)
+{
+	struct resource *res = vdev->pdev->resource + index;
+	struct vfio_pci_dummy_resource *dummy_res = NULL;
+
+	if (!IS_ENABLED(CONFIG_VFIO_PCI_MMAP))
+		return false;
+
+	if (!(res->flags & IORESOURCE_MEM))
+		return false;
+
+	/*
+	 * The PCI core shouldn't set up a resource with a type but
+	 * zero size. But there may be bugs that cause us to do that.
+	 */
+	if (!resource_size(res))
+		return false;
+
+	if (resource_size(res) >= PAGE_SIZE)
+		return true;
+
+	if (!(res->start & ~PAGE_MASK)) {
+		/*
+		 * Add a dummy resource to reserve the remainder
+		 * of the exclusive page in case that hot-add
+		 * device's bar is assigned into it.
+		 */
+		dummy_res = kzalloc(sizeof(*dummy_res), GFP_KERNEL);
+		if (dummy_res == NULL)
+			return false;
+		dummy_res->resource.start = res->end + 1;
+		dummy_res->resource.end = res->start + PAGE_SIZE - 1;
+		dummy_res->resource.flags = res->flags;
+		if (request_resource(res->parent, &dummy_res->resource)) {
+			kfree(dummy_res);
+			return false;
+		}
+		dummy_res->index = index;
+		list_add(&dummy_res->res_next, &vdev->dummy_resources_list);
+		return true;
+	}
+	return false;
+}
+
 static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev);
 static void vfio_pci_disable(struct vfio_pci_device *vdev);
 
@@ -145,10 +189,12 @@ static bool vfio_pci_nointx(struct pci_dev *pdev)
 static int vfio_pci_enable(struct vfio_pci_device *vdev)
 {
 	struct pci_dev *pdev = vdev->pdev;
-	int ret;
+	int ret, bar;
 	u16 cmd;
 	u8 msix_pos;
 
+	INIT_LIST_HEAD(&vdev->dummy_resources_list);
+
 	pci_set_power_state(pdev, PCI_D0);
 
 	/* Don't allow our initial saved state to include busmaster */
@@ -218,12 +264,17 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
 		}
 	}
 
+	for (bar = PCI_STD_RESOURCES; bar <= PCI_STD_RESOURCE_END; bar++) {
+		vdev->bar_mmap_supported[bar] =
+				vfio_pci_bar_mmap_supported(vdev, bar);
+	}
 	return 0;
 }
 
 static void vfio_pci_disable(struct vfio_pci_device *vdev)
 {
 	struct pci_dev *pdev = vdev->pdev;
+	struct vfio_pci_dummy_resource *dummy_res, *tmp;
 	int i, bar;
 
 	/* Stop the device from further DMA */
@@ -252,6 +303,13 @@ static void vfio_pci_disable(struct vfio_pci_device *vdev)
 		vdev->barmap[bar] = NULL;
 	}
 
+	list_for_each_entry_safe(dummy_res, tmp,
+				 &vdev->dummy_resources_list, res_next) {
+		list_del(&dummy_res->res_next);
+		release_resource(&dummy_res->resource);
+		kfree(dummy_res);
+	}
+
 	vdev->needs_reset = true;
 
 	/*
@@ -623,9 +681,7 @@ static long vfio_pci_ioctl(void *device_data,
 
 			info.flags = VFIO_REGION_INFO_FLAG_READ |
 				     VFIO_REGION_INFO_FLAG_WRITE;
-			if (IS_ENABLED(CONFIG_VFIO_PCI_MMAP) &&
-			    pci_resource_flags(pdev, info.index) &
-			    IORESOURCE_MEM && info.size >= PAGE_SIZE) {
+			if (vdev->bar_mmap_supported[info.index]) {
 				info.flags |= VFIO_REGION_INFO_FLAG_MMAP;
 				if (info.index == vdev->msix_bar) {
 					ret = msix_sparse_mmap_cap(vdev, &caps);
@@ -1049,16 +1105,16 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
 		return -EINVAL;
 	if (index >= VFIO_PCI_ROM_REGION_INDEX)
 		return -EINVAL;
-	if (!(pci_resource_flags(pdev, index) & IORESOURCE_MEM))
+	if (!vdev->bar_mmap_supported[index])
 		return -EINVAL;
 
-	phys_len = pci_resource_len(pdev, index);
+	phys_len = PAGE_ALIGN(pci_resource_len(pdev, index));
 	req_len = vma->vm_end - vma->vm_start;
 	pgoff = vma->vm_pgoff &
 		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
 	req_start = pgoff << PAGE_SHIFT;
 
-	if (phys_len < PAGE_SIZE || req_start + req_len > phys_len)
+	if (req_start + req_len > phys_len)
 		return -EINVAL;
 
 	if (index == vdev->msix_bar) {
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 016c14a..2128de8 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -57,9 +57,16 @@ struct vfio_pci_region {
 	u32				flags;
 };
 
+struct vfio_pci_dummy_resource {
+	struct resource		resource;
+	int			index;
+	struct list_head	res_next;
+};
+
 struct vfio_pci_device {
 	struct pci_dev		*pdev;
 	void __iomem		*barmap[PCI_STD_RESOURCE_END + 1];
+	bool			bar_mmap_supported[PCI_STD_RESOURCE_END + 1];
 	u8			*pci_config_map;
 	u8			*vconfig;
 	struct perm_bits	*msi_perm;
@@ -88,6 +95,7 @@ struct vfio_pci_device {
 	int			refcnt;
 	struct eventfd_ctx	*err_trigger;
 	struct eventfd_ctx	*req_trigger;
+	struct list_head	dummy_resources_list;
 };
 
 #define is_intx(vdev) (vdev->irq_type == VFIO_PCI_INTX_IRQ_INDEX)
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH v3] vfio-pci: Allow to mmap sub-page MMIO BARs if the mmio page is exclusive
  2016-05-12 10:20 [PATCH v3] vfio-pci: Allow to mmap sub-page MMIO BARs if the mmio page is exclusive Yongji Xie
@ 2016-05-19 22:33 ` Alex Williamson
  2016-05-23  3:45   ` Yongji Xie
       [not found]   ` <201605230345.u4N3dJip043323@mx0a-001b2d01.pphosted.com>
  0 siblings, 2 replies; 5+ messages in thread
From: Alex Williamson @ 2016-05-19 22:33 UTC (permalink / raw)
  To: Yongji Xie
  Cc: kvm, linux-kernel, linux-pci, bhelgaas, aik, benh, paulus, mpe,
	warrier, zhong, nikunj, gwshan, kevin.tian

On Thu, 12 May 2016 18:20:51 +0800
Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote:

> Current vfio-pci implementation disallows to mmap
> sub-page(size < PAGE_SIZE) MMIO BARs because these BARs' mmio
> page may be shared with other BARs. This will cause some
> performance issues when we passthrough a PCI device with
> this kind of BARs. Guest will be not able to handle the mmio
> accesses to the BARs which leads to mmio emulations in host.
> 
> However, not all sub-page BARs will share page with other BARs.
> We should allow to mmap the sub-page MMIO BARs which we can
> make sure will not share page with other BARs.
> 
> This patch adds support for this case. And we try to add a
> dummy resource to reserve the remainder of the page which
> hot-add device's BAR might be assigned into. But it's not
> necessary to handle the case when the BAR is not page aligned.
> Because we can't expect the BAR will be assigned into the same
> location in a page in guest when we passthrough the BAR. And
> it's hard to access this BAR in userspace because we have
> no way to get the BAR's location in a page.
> 
> Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
> ---
>  drivers/vfio/pci/vfio_pci.c         |   70 +++++++++++++++++++++++++++++++----
>  drivers/vfio/pci/vfio_pci_private.h |    8 ++++
>  2 files changed, 71 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 188b1ff..253c22f 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -110,6 +110,50 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
>  	return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
>  }
>  
> +static bool vfio_pci_bar_mmap_supported(struct vfio_pci_device *vdev, int index)
> +{
> +	struct resource *res = vdev->pdev->resource + index;
> +	struct vfio_pci_dummy_resource *dummy_res = NULL;
> +
> +	if (!IS_ENABLED(CONFIG_VFIO_PCI_MMAP))
> +		return false;
> +
> +	if (!(res->flags & IORESOURCE_MEM))
> +		return false;
> +
> +	/*
> +	 * The PCI core shouldn't set up a resource with a type but
> +	 * zero size. But there may be bugs that cause us to do that.
> +	 */
> +	if (!resource_size(res))
> +		return false;
> +
> +	if (resource_size(res) >= PAGE_SIZE)
> +		return true;
> +
> +	if (!(res->start & ~PAGE_MASK)) {
> +		/*
> +		 * Add a dummy resource to reserve the remainder
> +		 * of the exclusive page in case that hot-add
> +		 * device's bar is assigned into it.
> +		 */
> +		dummy_res = kzalloc(sizeof(*dummy_res), GFP_KERNEL);
> +		if (dummy_res == NULL)
> +			return false;
> +		dummy_res->resource.start = res->end + 1;
> +		dummy_res->resource.end = res->start + PAGE_SIZE - 1;
> +		dummy_res->resource.flags = res->flags;
> +		if (request_resource(res->parent, &dummy_res->resource)) {
> +			kfree(dummy_res);
> +			return false;
> +		}
> +		dummy_res->index = index;
> +		list_add(&dummy_res->res_next, &vdev->dummy_resources_list);
> +		return true;
> +	}
> +	return false;
> +}

The name of this function is vfio_pci_bar_mmap_supported(), which
suggests we should be able to call it at any point to test if mmap is
supported, but that's not what it does.  It's actually a one time setup
function, that also happens to return what it found or managed to
reserve.  If we were to call this a second time, we might get a
different result.  So I think this either needs to change to something
like:

static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)

where it loops through all the BARs and results in a valid
bar_mmap_supported array, or the function should be made smart enough
to identify if the necessary resource has already been allocated such
that it can be call multiple times per BAR, at which point we could
remove the bar_mmap_supported array.

A comment describing why we can only support sub-page mmaps for
resources aligned at the start of a page would also be helpful for
future maintenance here.

> +
>  static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev);
>  static void vfio_pci_disable(struct vfio_pci_device *vdev);
>  
> @@ -145,10 +189,12 @@ static bool vfio_pci_nointx(struct pci_dev *pdev)
>  static int vfio_pci_enable(struct vfio_pci_device *vdev)
>  {
>  	struct pci_dev *pdev = vdev->pdev;
> -	int ret;
> +	int ret, bar;
>  	u16 cmd;
>  	u8 msix_pos;
>  
> +	INIT_LIST_HEAD(&vdev->dummy_resources_list);
> +
>  	pci_set_power_state(pdev, PCI_D0);
>  
>  	/* Don't allow our initial saved state to include busmaster */
> @@ -218,12 +264,17 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
>  		}
>  	}
>  
> +	for (bar = PCI_STD_RESOURCES; bar <= PCI_STD_RESOURCE_END; bar++) {
> +		vdev->bar_mmap_supported[bar] =
> +				vfio_pci_bar_mmap_supported(vdev, bar);
> +	}
>  	return 0;
>  }
>  
>  static void vfio_pci_disable(struct vfio_pci_device *vdev)
>  {
>  	struct pci_dev *pdev = vdev->pdev;
> +	struct vfio_pci_dummy_resource *dummy_res, *tmp;
>  	int i, bar;
>  
>  	/* Stop the device from further DMA */
> @@ -252,6 +303,13 @@ static void vfio_pci_disable(struct vfio_pci_device *vdev)
>  		vdev->barmap[bar] = NULL;
>  	}
>  
> +	list_for_each_entry_safe(dummy_res, tmp,
> +				 &vdev->dummy_resources_list, res_next) {
> +		list_del(&dummy_res->res_next);
> +		release_resource(&dummy_res->resource);
> +		kfree(dummy_res);
> +	}
> +
>  	vdev->needs_reset = true;
>  
>  	/*
> @@ -623,9 +681,7 @@ static long vfio_pci_ioctl(void *device_data,
>  
>  			info.flags = VFIO_REGION_INFO_FLAG_READ |
>  				     VFIO_REGION_INFO_FLAG_WRITE;
> -			if (IS_ENABLED(CONFIG_VFIO_PCI_MMAP) &&
> -			    pci_resource_flags(pdev, info.index) &
> -			    IORESOURCE_MEM && info.size >= PAGE_SIZE) {
> +			if (vdev->bar_mmap_supported[info.index]) {
>  				info.flags |= VFIO_REGION_INFO_FLAG_MMAP;
>  				if (info.index == vdev->msix_bar) {
>  					ret = msix_sparse_mmap_cap(vdev, &caps);
> @@ -1049,16 +1105,16 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
>  		return -EINVAL;
>  	if (index >= VFIO_PCI_ROM_REGION_INDEX)
>  		return -EINVAL;
> -	if (!(pci_resource_flags(pdev, index) & IORESOURCE_MEM))
> +	if (!vdev->bar_mmap_supported[index])
>  		return -EINVAL;
>  
> -	phys_len = pci_resource_len(pdev, index);
> +	phys_len = PAGE_ALIGN(pci_resource_len(pdev, index));
>  	req_len = vma->vm_end - vma->vm_start;
>  	pgoff = vma->vm_pgoff &
>  		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
>  	req_start = pgoff << PAGE_SHIFT;
>  
> -	if (phys_len < PAGE_SIZE || req_start + req_len > phys_len)
> +	if (req_start + req_len > phys_len)
>  		return -EINVAL;

I'm only able to find your QEMU patch from last year, have you posted a
version making use of this latest proposal?  Is the expectation still
that QEMU modify the BAR size as seen from the guest to the host page
size for sub-page regions exposing the MMAP flag?  Does that result in
any known incompatibilities with drivers in the guest?  Thanks,

Alex

>  
>  	if (index == vdev->msix_bar) {
> diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
> index 016c14a..2128de8 100644
> --- a/drivers/vfio/pci/vfio_pci_private.h
> +++ b/drivers/vfio/pci/vfio_pci_private.h
> @@ -57,9 +57,16 @@ struct vfio_pci_region {
>  	u32				flags;
>  };
>  
> +struct vfio_pci_dummy_resource {
> +	struct resource		resource;
> +	int			index;
> +	struct list_head	res_next;
> +};
> +
>  struct vfio_pci_device {
>  	struct pci_dev		*pdev;
>  	void __iomem		*barmap[PCI_STD_RESOURCE_END + 1];
> +	bool			bar_mmap_supported[PCI_STD_RESOURCE_END + 1];
>  	u8			*pci_config_map;
>  	u8			*vconfig;
>  	struct perm_bits	*msi_perm;
> @@ -88,6 +95,7 @@ struct vfio_pci_device {
>  	int			refcnt;
>  	struct eventfd_ctx	*err_trigger;
>  	struct eventfd_ctx	*req_trigger;
> +	struct list_head	dummy_resources_list;
>  };
>  
>  #define is_intx(vdev) (vdev->irq_type == VFIO_PCI_INTX_IRQ_INDEX)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v3] vfio-pci: Allow to mmap sub-page MMIO BARs if the mmio page is exclusive
  2016-05-19 22:33 ` Alex Williamson
@ 2016-05-23  3:45   ` Yongji Xie
       [not found]   ` <201605230345.u4N3dJip043323@mx0a-001b2d01.pphosted.com>
  1 sibling, 0 replies; 5+ messages in thread
From: Yongji Xie @ 2016-05-23  3:45 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, linux-kernel, linux-pci, bhelgaas, aik, benh, paulus, mpe,
	warrier, zhong, nikunj, gwshan, kevin.tian

On 2016/5/20 6:33, Alex Williamson wrote:

> On Thu, 12 May 2016 18:20:51 +0800
> Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote:
>
>> Current vfio-pci implementation disallows to mmap
>> sub-page(size < PAGE_SIZE) MMIO BARs because these BARs' mmio
>> page may be shared with other BARs. This will cause some
>> performance issues when we passthrough a PCI device with
>> this kind of BARs. Guest will be not able to handle the mmio
>> accesses to the BARs which leads to mmio emulations in host.
>>
>> However, not all sub-page BARs will share page with other BARs.
>> We should allow to mmap the sub-page MMIO BARs which we can
>> make sure will not share page with other BARs.
>>
>> This patch adds support for this case. And we try to add a
>> dummy resource to reserve the remainder of the page which
>> hot-add device's BAR might be assigned into. But it's not
>> necessary to handle the case when the BAR is not page aligned.
>> Because we can't expect the BAR will be assigned into the same
>> location in a page in guest when we passthrough the BAR. And
>> it's hard to access this BAR in userspace because we have
>> no way to get the BAR's location in a page.
>>
>> Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
>> ---
>>   drivers/vfio/pci/vfio_pci.c         |   70 +++++++++++++++++++++++++++++++----
>>   drivers/vfio/pci/vfio_pci_private.h |    8 ++++
>>   2 files changed, 71 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
>> index 188b1ff..253c22f 100644
>> --- a/drivers/vfio/pci/vfio_pci.c
>> +++ b/drivers/vfio/pci/vfio_pci.c
>> @@ -110,6 +110,50 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
>>   	return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
>>   }
>>   
>> +static bool vfio_pci_bar_mmap_supported(struct vfio_pci_device *vdev, int index)
>> +{
>> +	struct resource *res = vdev->pdev->resource + index;
>> +	struct vfio_pci_dummy_resource *dummy_res = NULL;
>> +
>> +	if (!IS_ENABLED(CONFIG_VFIO_PCI_MMAP))
>> +		return false;
>> +
>> +	if (!(res->flags & IORESOURCE_MEM))
>> +		return false;
>> +
>> +	/*
>> +	 * The PCI core shouldn't set up a resource with a type but
>> +	 * zero size. But there may be bugs that cause us to do that.
>> +	 */
>> +	if (!resource_size(res))
>> +		return false;
>> +
>> +	if (resource_size(res) >= PAGE_SIZE)
>> +		return true;
>> +
>> +	if (!(res->start & ~PAGE_MASK)) {
>> +		/*
>> +		 * Add a dummy resource to reserve the remainder
>> +		 * of the exclusive page in case that hot-add
>> +		 * device's bar is assigned into it.
>> +		 */
>> +		dummy_res = kzalloc(sizeof(*dummy_res), GFP_KERNEL);
>> +		if (dummy_res == NULL)
>> +			return false;
>> +		dummy_res->resource.start = res->end + 1;
>> +		dummy_res->resource.end = res->start + PAGE_SIZE - 1;
>> +		dummy_res->resource.flags = res->flags;
>> +		if (request_resource(res->parent, &dummy_res->resource)) {
>> +			kfree(dummy_res);
>> +			return false;
>> +		}
>> +		dummy_res->index = index;
>> +		list_add(&dummy_res->res_next, &vdev->dummy_resources_list);
>> +		return true;
>> +	}
>> +	return false;
>> +}
> The name of this function is vfio_pci_bar_mmap_supported(), which
> suggests we should be able to call it at any point to test if mmap is
> supported, but that's not what it does.  It's actually a one time setup
> function, that also happens to return what it found or managed to
> reserve.  If we were to call this a second time, we might get a
> different result.  So I think this either needs to change to something
> like:
>
> static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
>
> where it loops through all the BARs and results in a valid
> bar_mmap_supported array, or the function should be made smart enough
> to identify if the necessary resource has already been allocated such
> that it can be call multiple times per BAR, at which point we could
> remove the bar_mmap_supported array.

Thanks for your comment. I would change the name of this function
and reserve bar_mmap_supported array to cache the result.

> A comment describing why we can only support sub-page mmaps for
> resources aligned at the start of a page would also be helpful for
> future maintenance here.

OK. I will add this.

>> +
>>   static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev);
>>   static void vfio_pci_disable(struct vfio_pci_device *vdev);
>>   
>> @@ -145,10 +189,12 @@ static bool vfio_pci_nointx(struct pci_dev *pdev)
>>   static int vfio_pci_enable(struct vfio_pci_device *vdev)
>>   {
>>   	struct pci_dev *pdev = vdev->pdev;
>> -	int ret;
>> +	int ret, bar;
>>   	u16 cmd;
>>   	u8 msix_pos;
>>   
>> +	INIT_LIST_HEAD(&vdev->dummy_resources_list);
>> +
>>   	pci_set_power_state(pdev, PCI_D0);
>>   
>>   	/* Don't allow our initial saved state to include busmaster */
>> @@ -218,12 +264,17 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
>>   		}
>>   	}
>>   
>> +	for (bar = PCI_STD_RESOURCES; bar <= PCI_STD_RESOURCE_END; bar++) {
>> +		vdev->bar_mmap_supported[bar] =
>> +				vfio_pci_bar_mmap_supported(vdev, bar);
>> +	}
>>   	return 0;
>>   }
>>   
>>   static void vfio_pci_disable(struct vfio_pci_device *vdev)
>>   {
>>   	struct pci_dev *pdev = vdev->pdev;
>> +	struct vfio_pci_dummy_resource *dummy_res, *tmp;
>>   	int i, bar;
>>   
>>   	/* Stop the device from further DMA */
>> @@ -252,6 +303,13 @@ static void vfio_pci_disable(struct vfio_pci_device *vdev)
>>   		vdev->barmap[bar] = NULL;
>>   	}
>>   
>> +	list_for_each_entry_safe(dummy_res, tmp,
>> +				 &vdev->dummy_resources_list, res_next) {
>> +		list_del(&dummy_res->res_next);
>> +		release_resource(&dummy_res->resource);
>> +		kfree(dummy_res);
>> +	}
>> +
>>   	vdev->needs_reset = true;
>>   
>>   	/*
>> @@ -623,9 +681,7 @@ static long vfio_pci_ioctl(void *device_data,
>>   
>>   			info.flags = VFIO_REGION_INFO_FLAG_READ |
>>   				     VFIO_REGION_INFO_FLAG_WRITE;
>> -			if (IS_ENABLED(CONFIG_VFIO_PCI_MMAP) &&
>> -			    pci_resource_flags(pdev, info.index) &
>> -			    IORESOURCE_MEM && info.size >= PAGE_SIZE) {
>> +			if (vdev->bar_mmap_supported[info.index]) {
>>   				info.flags |= VFIO_REGION_INFO_FLAG_MMAP;
>>   				if (info.index == vdev->msix_bar) {
>>   					ret = msix_sparse_mmap_cap(vdev, &caps);
>> @@ -1049,16 +1105,16 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
>>   		return -EINVAL;
>>   	if (index >= VFIO_PCI_ROM_REGION_INDEX)
>>   		return -EINVAL;
>> -	if (!(pci_resource_flags(pdev, index) & IORESOURCE_MEM))
>> +	if (!vdev->bar_mmap_supported[index])
>>   		return -EINVAL;
>>   
>> -	phys_len = pci_resource_len(pdev, index);
>> +	phys_len = PAGE_ALIGN(pci_resource_len(pdev, index));
>>   	req_len = vma->vm_end - vma->vm_start;
>>   	pgoff = vma->vm_pgoff &
>>   		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
>>   	req_start = pgoff << PAGE_SHIFT;
>>   
>> -	if (phys_len < PAGE_SIZE || req_start + req_len > phys_len)
>> +	if (req_start + req_len > phys_len)
>>   		return -EINVAL;
> I'm only able to find your QEMU patch from last year, have you posted a
> version making use of this latest proposal?  Is the expectation still
> that QEMU modify the BAR size as seen from the guest to the host page
> size for sub-page regions exposing the MMAP flag?  Does that result in
> any known incompatibilities with drivers in the guest?  Thanks,
>
> Alex

I will post the latest version soon. In QEMU, we would change the size of
MemoryRegion instead of VFIORegion. So guest will see the real size. The
only limit is that BAR must be in a exclusive page in guest too. Otherwise,
we would not allow to passthrough this BAR. On Power, we have changed
the SLOF to enforce all BARs to be page aligned in guest.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v3] vfio-pci: Allow to mmap sub-page MMIO BARs if the mmio page is exclusive
       [not found]   ` <201605230345.u4N3dJip043323@mx0a-001b2d01.pphosted.com>
@ 2016-05-23 15:20     ` Alex Williamson
  2016-05-24 15:57       ` Yongji Xie
  0 siblings, 1 reply; 5+ messages in thread
From: Alex Williamson @ 2016-05-23 15:20 UTC (permalink / raw)
  To: Yongji Xie
  Cc: kvm, linux-kernel, linux-pci, bhelgaas, aik, benh, paulus, mpe,
	warrier, zhong, nikunj, gwshan, kevin.tian

On Mon, 23 May 2016 11:45:34 +0800
Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote:

> On 2016/5/20 6:33, Alex Williamson wrote:
> 
> > On Thu, 12 May 2016 18:20:51 +0800
> > Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote:
> >  
> >> Current vfio-pci implementation disallows to mmap
> >> sub-page(size < PAGE_SIZE) MMIO BARs because these BARs' mmio
> >> page may be shared with other BARs. This will cause some
> >> performance issues when we passthrough a PCI device with
> >> this kind of BARs. Guest will be not able to handle the mmio
> >> accesses to the BARs which leads to mmio emulations in host.
> >>
> >> However, not all sub-page BARs will share page with other BARs.
> >> We should allow to mmap the sub-page MMIO BARs which we can
> >> make sure will not share page with other BARs.
> >>
> >> This patch adds support for this case. And we try to add a
> >> dummy resource to reserve the remainder of the page which
> >> hot-add device's BAR might be assigned into. But it's not
> >> necessary to handle the case when the BAR is not page aligned.
> >> Because we can't expect the BAR will be assigned into the same
> >> location in a page in guest when we passthrough the BAR. And
> >> it's hard to access this BAR in userspace because we have
> >> no way to get the BAR's location in a page.
> >>
> >> Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
> >> ---
> >>   drivers/vfio/pci/vfio_pci.c         |   70 +++++++++++++++++++++++++++++++----
> >>   drivers/vfio/pci/vfio_pci_private.h |    8 ++++
> >>   2 files changed, 71 insertions(+), 7 deletions(-)
> >>
> >> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> >> index 188b1ff..253c22f 100644
> >> --- a/drivers/vfio/pci/vfio_pci.c
> >> +++ b/drivers/vfio/pci/vfio_pci.c
> >> @@ -110,6 +110,50 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
> >>   	return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
> >>   }
> >>   
> >> +static bool vfio_pci_bar_mmap_supported(struct vfio_pci_device *vdev, int index)
> >> +{
> >> +	struct resource *res = vdev->pdev->resource + index;
> >> +	struct vfio_pci_dummy_resource *dummy_res = NULL;
> >> +
> >> +	if (!IS_ENABLED(CONFIG_VFIO_PCI_MMAP))
> >> +		return false;
> >> +
> >> +	if (!(res->flags & IORESOURCE_MEM))
> >> +		return false;
> >> +
> >> +	/*
> >> +	 * The PCI core shouldn't set up a resource with a type but
> >> +	 * zero size. But there may be bugs that cause us to do that.
> >> +	 */
> >> +	if (!resource_size(res))
> >> +		return false;
> >> +
> >> +	if (resource_size(res) >= PAGE_SIZE)
> >> +		return true;
> >> +
> >> +	if (!(res->start & ~PAGE_MASK)) {
> >> +		/*
> >> +		 * Add a dummy resource to reserve the remainder
> >> +		 * of the exclusive page in case that hot-add
> >> +		 * device's bar is assigned into it.
> >> +		 */
> >> +		dummy_res = kzalloc(sizeof(*dummy_res), GFP_KERNEL);
> >> +		if (dummy_res == NULL)
> >> +			return false;
> >> +		dummy_res->resource.start = res->end + 1;
> >> +		dummy_res->resource.end = res->start + PAGE_SIZE - 1;
> >> +		dummy_res->resource.flags = res->flags;
> >> +		if (request_resource(res->parent, &dummy_res->resource)) {
> >> +			kfree(dummy_res);
> >> +			return false;
> >> +		}
> >> +		dummy_res->index = index;
> >> +		list_add(&dummy_res->res_next, &vdev->dummy_resources_list);
> >> +		return true;
> >> +	}
> >> +	return false;
> >> +}  
> > The name of this function is vfio_pci_bar_mmap_supported(), which
> > suggests we should be able to call it at any point to test if mmap is
> > supported, but that's not what it does.  It's actually a one time setup
> > function, that also happens to return what it found or managed to
> > reserve.  If we were to call this a second time, we might get a
> > different result.  So I think this either needs to change to something
> > like:
> >
> > static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
> >
> > where it loops through all the BARs and results in a valid
> > bar_mmap_supported array, or the function should be made smart enough
> > to identify if the necessary resource has already been allocated such
> > that it can be call multiple times per BAR, at which point we could
> > remove the bar_mmap_supported array.  
> 
> Thanks for your comment. I would change the name of this function
> and reserve bar_mmap_supported array to cache the result.
> 
> > A comment describing why we can only support sub-page mmaps for
> > resources aligned at the start of a page would also be helpful for
> > future maintenance here.  
> 
> OK. I will add this.
> 
> >> +
> >>   static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev);
> >>   static void vfio_pci_disable(struct vfio_pci_device *vdev);
> >>   
> >> @@ -145,10 +189,12 @@ static bool vfio_pci_nointx(struct pci_dev *pdev)
> >>   static int vfio_pci_enable(struct vfio_pci_device *vdev)
> >>   {
> >>   	struct pci_dev *pdev = vdev->pdev;
> >> -	int ret;
> >> +	int ret, bar;
> >>   	u16 cmd;
> >>   	u8 msix_pos;
> >>   
> >> +	INIT_LIST_HEAD(&vdev->dummy_resources_list);
> >> +
> >>   	pci_set_power_state(pdev, PCI_D0);
> >>   
> >>   	/* Don't allow our initial saved state to include busmaster */
> >> @@ -218,12 +264,17 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
> >>   		}
> >>   	}
> >>   
> >> +	for (bar = PCI_STD_RESOURCES; bar <= PCI_STD_RESOURCE_END; bar++) {
> >> +		vdev->bar_mmap_supported[bar] =
> >> +				vfio_pci_bar_mmap_supported(vdev, bar);
> >> +	}
> >>   	return 0;
> >>   }
> >>   
> >>   static void vfio_pci_disable(struct vfio_pci_device *vdev)
> >>   {
> >>   	struct pci_dev *pdev = vdev->pdev;
> >> +	struct vfio_pci_dummy_resource *dummy_res, *tmp;
> >>   	int i, bar;
> >>   
> >>   	/* Stop the device from further DMA */
> >> @@ -252,6 +303,13 @@ static void vfio_pci_disable(struct vfio_pci_device *vdev)
> >>   		vdev->barmap[bar] = NULL;
> >>   	}
> >>   
> >> +	list_for_each_entry_safe(dummy_res, tmp,
> >> +				 &vdev->dummy_resources_list, res_next) {
> >> +		list_del(&dummy_res->res_next);
> >> +		release_resource(&dummy_res->resource);
> >> +		kfree(dummy_res);
> >> +	}
> >> +
> >>   	vdev->needs_reset = true;
> >>   
> >>   	/*
> >> @@ -623,9 +681,7 @@ static long vfio_pci_ioctl(void *device_data,
> >>   
> >>   			info.flags = VFIO_REGION_INFO_FLAG_READ |
> >>   				     VFIO_REGION_INFO_FLAG_WRITE;
> >> -			if (IS_ENABLED(CONFIG_VFIO_PCI_MMAP) &&
> >> -			    pci_resource_flags(pdev, info.index) &
> >> -			    IORESOURCE_MEM && info.size >= PAGE_SIZE) {
> >> +			if (vdev->bar_mmap_supported[info.index]) {
> >>   				info.flags |= VFIO_REGION_INFO_FLAG_MMAP;
> >>   				if (info.index == vdev->msix_bar) {
> >>   					ret = msix_sparse_mmap_cap(vdev, &caps);
> >> @@ -1049,16 +1105,16 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
> >>   		return -EINVAL;
> >>   	if (index >= VFIO_PCI_ROM_REGION_INDEX)
> >>   		return -EINVAL;
> >> -	if (!(pci_resource_flags(pdev, index) & IORESOURCE_MEM))
> >> +	if (!vdev->bar_mmap_supported[index])
> >>   		return -EINVAL;
> >>   
> >> -	phys_len = pci_resource_len(pdev, index);
> >> +	phys_len = PAGE_ALIGN(pci_resource_len(pdev, index));
> >>   	req_len = vma->vm_end - vma->vm_start;
> >>   	pgoff = vma->vm_pgoff &
> >>   		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
> >>   	req_start = pgoff << PAGE_SHIFT;
> >>   
> >> -	if (phys_len < PAGE_SIZE || req_start + req_len > phys_len)
> >> +	if (req_start + req_len > phys_len)
> >>   		return -EINVAL;  
> > I'm only able to find your QEMU patch from last year, have you posted a
> > version making use of this latest proposal?  Is the expectation still
> > that QEMU modify the BAR size as seen from the guest to the host page
> > size for sub-page regions exposing the MMAP flag?  Does that result in
> > any known incompatibilities with drivers in the guest?  Thanks,
> >
> > Alex  
> 
> I will post the latest version soon. In QEMU, we would change the size of
> MemoryRegion instead of VFIORegion. So guest will see the real size. The
> only limit is that BAR must be in a exclusive page in guest too. Otherwise,
> we would not allow to passthrough this BAR. On Power, we have changed
> the SLOF to enforce all BARs to be page aligned in guest.

So the VM firmware will allocate page aligned BARs, but the guest OS is
free to realloc BARs as it sees fit.  VM firmware does not have the
final say in how BARs get mapped within the guest.  Besides, it seems
like a lot of hand waving and assumptions that QEMU gets a
MemoryListener region added with a sub-page size that magically gets
expanded because we think we know how the guest is mapping BARs.  I'm
not very confident in this solutions.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v3] vfio-pci: Allow to mmap sub-page MMIO BARs if the mmio page is exclusive
  2016-05-23 15:20     ` Alex Williamson
@ 2016-05-24 15:57       ` Yongji Xie
  0 siblings, 0 replies; 5+ messages in thread
From: Yongji Xie @ 2016-05-24 15:57 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, linux-kernel, linux-pci, bhelgaas, aik, benh, paulus, mpe,
	warrier, zhong, nikunj, gwshan, kevin.tian

On 2016/5/23 23:20, Alex Williamson wrote:

> On Mon, 23 May 2016 11:45:34 +0800
> Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote:
>
>> On 2016/5/20 6:33, Alex Williamson wrote:
>>
>>> On Thu, 12 May 2016 18:20:51 +0800
>>> Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote:
>>>   
>>>> Current vfio-pci implementation disallows to mmap
>>>> sub-page(size < PAGE_SIZE) MMIO BARs because these BARs' mmio
>>>> page may be shared with other BARs. This will cause some
>>>> performance issues when we passthrough a PCI device with
>>>> this kind of BARs. Guest will be not able to handle the mmio
>>>> accesses to the BARs which leads to mmio emulations in host.
>>>>
>>>> However, not all sub-page BARs will share page with other BARs.
>>>> We should allow to mmap the sub-page MMIO BARs which we can
>>>> make sure will not share page with other BARs.
>>>>
>>>> This patch adds support for this case. And we try to add a
>>>> dummy resource to reserve the remainder of the page which
>>>> hot-add device's BAR might be assigned into. But it's not
>>>> necessary to handle the case when the BAR is not page aligned.
>>>> Because we can't expect the BAR will be assigned into the same
>>>> location in a page in guest when we passthrough the BAR. And
>>>> it's hard to access this BAR in userspace because we have
>>>> no way to get the BAR's location in a page.
>>>>
>>>> Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
>>>> ---
>>>>    drivers/vfio/pci/vfio_pci.c         |   70 +++++++++++++++++++++++++++++++----
>>>>    drivers/vfio/pci/vfio_pci_private.h |    8 ++++
>>>>    2 files changed, 71 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
>>>> index 188b1ff..253c22f 100644
>>>> --- a/drivers/vfio/pci/vfio_pci.c
>>>> +++ b/drivers/vfio/pci/vfio_pci.c
>>>> @@ -110,6 +110,50 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
>>>>    	return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
>>>>    }
>>>>    
>>>> +static bool vfio_pci_bar_mmap_supported(struct vfio_pci_device *vdev, int index)
>>>> +{
>>>> +	struct resource *res = vdev->pdev->resource + index;
>>>> +	struct vfio_pci_dummy_resource *dummy_res = NULL;
>>>> +
>>>> +	if (!IS_ENABLED(CONFIG_VFIO_PCI_MMAP))
>>>> +		return false;
>>>> +
>>>> +	if (!(res->flags & IORESOURCE_MEM))
>>>> +		return false;
>>>> +
>>>> +	/*
>>>> +	 * The PCI core shouldn't set up a resource with a type but
>>>> +	 * zero size. But there may be bugs that cause us to do that.
>>>> +	 */
>>>> +	if (!resource_size(res))
>>>> +		return false;
>>>> +
>>>> +	if (resource_size(res) >= PAGE_SIZE)
>>>> +		return true;
>>>> +
>>>> +	if (!(res->start & ~PAGE_MASK)) {
>>>> +		/*
>>>> +		 * Add a dummy resource to reserve the remainder
>>>> +		 * of the exclusive page in case that hot-add
>>>> +		 * device's bar is assigned into it.
>>>> +		 */
>>>> +		dummy_res = kzalloc(sizeof(*dummy_res), GFP_KERNEL);
>>>> +		if (dummy_res == NULL)
>>>> +			return false;
>>>> +		dummy_res->resource.start = res->end + 1;
>>>> +		dummy_res->resource.end = res->start + PAGE_SIZE - 1;
>>>> +		dummy_res->resource.flags = res->flags;
>>>> +		if (request_resource(res->parent, &dummy_res->resource)) {
>>>> +			kfree(dummy_res);
>>>> +			return false;
>>>> +		}
>>>> +		dummy_res->index = index;
>>>> +		list_add(&dummy_res->res_next, &vdev->dummy_resources_list);
>>>> +		return true;
>>>> +	}
>>>> +	return false;
>>>> +}
>>> The name of this function is vfio_pci_bar_mmap_supported(), which
>>> suggests we should be able to call it at any point to test if mmap is
>>> supported, but that's not what it does.  It's actually a one time setup
>>> function, that also happens to return what it found or managed to
>>> reserve.  If we were to call this a second time, we might get a
>>> different result.  So I think this either needs to change to something
>>> like:
>>>
>>> static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
>>>
>>> where it loops through all the BARs and results in a valid
>>> bar_mmap_supported array, or the function should be made smart enough
>>> to identify if the necessary resource has already been allocated such
>>> that it can be call multiple times per BAR, at which point we could
>>> remove the bar_mmap_supported array.
>> Thanks for your comment. I would change the name of this function
>> and reserve bar_mmap_supported array to cache the result.
>>
>>> A comment describing why we can only support sub-page mmaps for
>>> resources aligned at the start of a page would also be helpful for
>>> future maintenance here.
>> OK. I will add this.
>>
>>>> +
>>>>    static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev);
>>>>    static void vfio_pci_disable(struct vfio_pci_device *vdev);
>>>>    
>>>> @@ -145,10 +189,12 @@ static bool vfio_pci_nointx(struct pci_dev *pdev)
>>>>    static int vfio_pci_enable(struct vfio_pci_device *vdev)
>>>>    {
>>>>    	struct pci_dev *pdev = vdev->pdev;
>>>> -	int ret;
>>>> +	int ret, bar;
>>>>    	u16 cmd;
>>>>    	u8 msix_pos;
>>>>    
>>>> +	INIT_LIST_HEAD(&vdev->dummy_resources_list);
>>>> +
>>>>    	pci_set_power_state(pdev, PCI_D0);
>>>>    
>>>>    	/* Don't allow our initial saved state to include busmaster */
>>>> @@ -218,12 +264,17 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
>>>>    		}
>>>>    	}
>>>>    
>>>> +	for (bar = PCI_STD_RESOURCES; bar <= PCI_STD_RESOURCE_END; bar++) {
>>>> +		vdev->bar_mmap_supported[bar] =
>>>> +				vfio_pci_bar_mmap_supported(vdev, bar);
>>>> +	}
>>>>    	return 0;
>>>>    }
>>>>    
>>>>    static void vfio_pci_disable(struct vfio_pci_device *vdev)
>>>>    {
>>>>    	struct pci_dev *pdev = vdev->pdev;
>>>> +	struct vfio_pci_dummy_resource *dummy_res, *tmp;
>>>>    	int i, bar;
>>>>    
>>>>    	/* Stop the device from further DMA */
>>>> @@ -252,6 +303,13 @@ static void vfio_pci_disable(struct vfio_pci_device *vdev)
>>>>    		vdev->barmap[bar] = NULL;
>>>>    	}
>>>>    
>>>> +	list_for_each_entry_safe(dummy_res, tmp,
>>>> +				 &vdev->dummy_resources_list, res_next) {
>>>> +		list_del(&dummy_res->res_next);
>>>> +		release_resource(&dummy_res->resource);
>>>> +		kfree(dummy_res);
>>>> +	}
>>>> +
>>>>    	vdev->needs_reset = true;
>>>>    
>>>>    	/*
>>>> @@ -623,9 +681,7 @@ static long vfio_pci_ioctl(void *device_data,
>>>>    
>>>>    			info.flags = VFIO_REGION_INFO_FLAG_READ |
>>>>    				     VFIO_REGION_INFO_FLAG_WRITE;
>>>> -			if (IS_ENABLED(CONFIG_VFIO_PCI_MMAP) &&
>>>> -			    pci_resource_flags(pdev, info.index) &
>>>> -			    IORESOURCE_MEM && info.size >= PAGE_SIZE) {
>>>> +			if (vdev->bar_mmap_supported[info.index]) {
>>>>    				info.flags |= VFIO_REGION_INFO_FLAG_MMAP;
>>>>    				if (info.index == vdev->msix_bar) {
>>>>    					ret = msix_sparse_mmap_cap(vdev, &caps);
>>>> @@ -1049,16 +1105,16 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
>>>>    		return -EINVAL;
>>>>    	if (index >= VFIO_PCI_ROM_REGION_INDEX)
>>>>    		return -EINVAL;
>>>> -	if (!(pci_resource_flags(pdev, index) & IORESOURCE_MEM))
>>>> +	if (!vdev->bar_mmap_supported[index])
>>>>    		return -EINVAL;
>>>>    
>>>> -	phys_len = pci_resource_len(pdev, index);
>>>> +	phys_len = PAGE_ALIGN(pci_resource_len(pdev, index));
>>>>    	req_len = vma->vm_end - vma->vm_start;
>>>>    	pgoff = vma->vm_pgoff &
>>>>    		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
>>>>    	req_start = pgoff << PAGE_SHIFT;
>>>>    
>>>> -	if (phys_len < PAGE_SIZE || req_start + req_len > phys_len)
>>>> +	if (req_start + req_len > phys_len)
>>>>    		return -EINVAL;
>>> I'm only able to find your QEMU patch from last year, have you posted a
>>> version making use of this latest proposal?  Is the expectation still
>>> that QEMU modify the BAR size as seen from the guest to the host page
>>> size for sub-page regions exposing the MMAP flag?  Does that result in
>>> any known incompatibilities with drivers in the guest?  Thanks,
>>>
>>> Alex
>> I will post the latest version soon. In QEMU, we would change the size of
>> MemoryRegion instead of VFIORegion. So guest will see the real size. The
>> only limit is that BAR must be in a exclusive page in guest too. Otherwise,
>> we would not allow to passthrough this BAR. On Power, we have changed
>> the SLOF to enforce all BARs to be page aligned in guest.
> So the VM firmware will allocate page aligned BARs, but the guest OS is
> free to realloc BARs as it sees fit.  VM firmware does not have the
> final say in how BARs get mapped within the guest.

Yes, the allocation of BARs in guest are uncertain. So we should
detect whether the BAR is in an exclusive page in guest. This can
be done by setting lower priority of sub-page BAR's MemoryRegion.
If there is any overlap, the right BAR would have FlatRange.

> Besides, it seems like a lot of hand waving and assumptions that QEMU gets a
> MemoryListener region added with a sub-page size that magically gets
> expanded because we think we know how the guest is mapping BARs.  I'm
> not very confident in this solutions.  Thanks,
>

Sorry, I didn't get your point...

In my opinion, QEMU patch need to do two things if we want to
passthrough sub-page BARs to guest:

1. Passing a valid size to KVM ioctl KVM_SET_USER_MEMORY_REGION
because the size of BAR is still less than PAGE_SIZE. This could be done
by expanding MemoryRegion of sub-page BARs.

2. Handling the case that another BAR share the same page with
other BAR the same page with in guest because guest may not
allocate page aligned BARs. In this case, we should not passthrough
the sub-page BAR. We could set lower priority of sub-page BAR's
MemoryRegion to achieve that.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-05-24 15:58 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-12 10:20 [PATCH v3] vfio-pci: Allow to mmap sub-page MMIO BARs if the mmio page is exclusive Yongji Xie
2016-05-19 22:33 ` Alex Williamson
2016-05-23  3:45   ` Yongji Xie
     [not found]   ` <201605230345.u4N3dJip043323@mx0a-001b2d01.pphosted.com>
2016-05-23 15:20     ` Alex Williamson
2016-05-24 15:57       ` Yongji Xie

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).