linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/3] vfio-pci: Allow to mmap sub-page MMIO BARs and MSI-X table on PPC64 platform
@ 2015-12-11  8:53 Yongji Xie
  2015-12-11  8:53 ` [RFC PATCH 1/3] powerpc/pci: Enforce all MMIO BARs to be page aligned Yongji Xie
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Yongji Xie @ 2015-12-11  8:53 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-api, linuxppc-dev
  Cc: aik, alex.williamson, benh, paulus, mpe, warrier, zhong, nikunj,
	Yongji Xie

Current vfio-pci implementation disallows to mmap
sub-page(size < PAGE_SIZE) MMIO BARs and MSI-X table. This is because
sub-page BARs' mmio page may be shared with other BARs and MSI-X table
should not be accessed directly from the guest for security reasons.

But these would cause some performance issues for mmio accesses in guest
when vfio passthrough sub-page BARs or BARs containing MSI-X table on
PPC64 platform. This is because PAGE_SIZE is 64KB by default on PPC64
platform and the big page may easily hit the sub-page MMIO
BARs' unmmapping and cause the unmmaping of the mmio page which
MSI-X table locate in, which lead to mmio emulation in host.

For sub-page MMIO BARs' unmmapping, this patch set enforces all MMIO
BARs to be page aligned on PPC64 platform so that sub-page BAR's mmio
page would not be shared with other BARs. Then we can mmap sub-page
MMIO BARs in vfio-pci driver if all MMIO BARs are page aligned.

For MSI-X table's unmmapping, we think MSI-X table is safe to access
directly from the guest with EEH mechanism enabled which can ensure that
a given pci device can only shoot the MSIs assigned for its PE. So
we add support for mmapping MSI-X table in vfio-pci driver if EEH is
supported.

With this patch set applied, we can get almost 100% improvement on
performance for mmio accesses when we passthrough sub-page BARs in
our test.

The last two patches in the patch set can be used by qemu to:
	- Add support for a VFIO-PCI ioctl to indicate that platform
	  support all PCI BARs are page aligned.
	- Add support for a VFIO-PCI ioctl to indicate that platform
	  support mmapping MSI-X table.

Yongji Xie (3):
  powerpc/pci: Enforce all MMIO BARs to be page aligned
  vfio-pci: Allow to mmap sub-page MMIO BARs if all MMIO BARs are page aligned
  vfio-pci: Allow to mmap MSI-X table if EEH is supported

 arch/powerpc/kernel/pci-common.c    |   10 +++++++++-
 drivers/vfio/pci/vfio_pci.c         |   15 +++++++++++++--
 drivers/vfio/pci/vfio_pci_private.h |   10 ++++++++++
 include/uapi/linux/vfio.h           |    4 ++++
 4 files changed, 36 insertions(+), 3 deletions(-)

-- 
1.7.9.5


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC PATCH 1/3] powerpc/pci: Enforce all MMIO BARs to be page aligned
  2015-12-11  8:53 [RFC PATCH 0/3] vfio-pci: Allow to mmap sub-page MMIO BARs and MSI-X table on PPC64 platform Yongji Xie
@ 2015-12-11  8:53 ` Yongji Xie
  2015-12-11  8:53 ` [RFC PATCH 2/3] vfio-pci: Allow to mmap sub-page MMIO BARs if all MMIO BARs are " Yongji Xie
  2015-12-11  8:53 ` [RFC PATCH 3/3] vfio-pci: Allow to mmap MSI-X table if EEH is supported Yongji Xie
  2 siblings, 0 replies; 15+ messages in thread
From: Yongji Xie @ 2015-12-11  8:53 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-api, linuxppc-dev
  Cc: aik, alex.williamson, benh, paulus, mpe, warrier, zhong, nikunj,
	Yongji Xie

PAGE_SIZE is 64KB by default on PPC64 platform. When vfio
passthrough a pci device of which MMIO BARs are smaller than
64KB(PAGE_SIZE), guest would not handle the mmio accesses
to the BARs which leads to mmio emulations in host.

This is because vfio would not allow to passthrough one
BAR's mmio page which may be shared with other BARs.

To solve this performance issue, this patch enforces the
alignment of all MMIO BARs allocations to be at least PAGE_SIZE
on PPC64 platform because we have enough address space, so
that one BAR's mmio page would not be shared with other BARs.

Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/pci-common.c |   10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index 0f7a60f..6989e0f 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -1074,6 +1074,11 @@ static int skip_isa_ioresource_align(struct pci_dev *dev)
  * bits, so it's ok to allocate at, say, 0x2800-0x28ff,
  * but we want to try to avoid allocating at 0x2900-0x2bff
  * which might have be mirrored at 0x0100-0x03ff..
+ *
+ * And for PPC64, we enforce the alignment of all MMIO BARs
+ * allocations to be at least PAGE_SIZE(64KB). This would be
+ * helpful to improve performance when we passthrough
+ * a PCI device of which BARs are smaller than PAGE_SIZE
  */
 resource_size_t pcibios_align_resource(void *data, const struct resource *res,
 				resource_size_t size, resource_size_t align)
@@ -1087,7 +1092,10 @@ resource_size_t pcibios_align_resource(void *data, const struct resource *res,
 		if (start & 0x300)
 			start = (start + 0x3ff) & ~0x3ff;
 	}
-
+#ifdef CONFIG_PPC64
+	if (res->flags & IORESOURCE_MEM)
+		start = PAGE_ALIGN(start);
+#endif
 	return start;
 }
 EXPORT_SYMBOL(pcibios_align_resource);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 2/3] vfio-pci: Allow to mmap sub-page MMIO BARs if all MMIO BARs are page aligned
  2015-12-11  8:53 [RFC PATCH 0/3] vfio-pci: Allow to mmap sub-page MMIO BARs and MSI-X table on PPC64 platform Yongji Xie
  2015-12-11  8:53 ` [RFC PATCH 1/3] powerpc/pci: Enforce all MMIO BARs to be page aligned Yongji Xie
@ 2015-12-11  8:53 ` Yongji Xie
  2015-12-16 20:04   ` Alex Williamson
  2015-12-11  8:53 ` [RFC PATCH 3/3] vfio-pci: Allow to mmap MSI-X table if EEH is supported Yongji Xie
  2 siblings, 1 reply; 15+ messages in thread
From: Yongji Xie @ 2015-12-11  8:53 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-api, linuxppc-dev
  Cc: aik, alex.williamson, benh, paulus, mpe, warrier, zhong, nikunj,
	Yongji Xie

Current vfio-pci implementation disallows to mmap
sub-page(size < PAGE_SIZE) MMIO BARs because these BARs' mmio page
may be shared with other BARs.

But we should allow to mmap these sub-page MMIO BARs if all MMIO BARs
are page aligned which leads the BARs' mmio page would not be shared
with other BARs.

This patch adds support for this case and we also add a
VFIO_DEVICE_FLAGS_PCI_PAGE_ALIGNED flag to notify userspace that
platform supports all MMIO BARs to be page aligned.

Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
---
 drivers/vfio/pci/vfio_pci.c         |   10 +++++++++-
 drivers/vfio/pci/vfio_pci_private.h |    5 +++++
 include/uapi/linux/vfio.h           |    2 ++
 3 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 32b88bd..dbcad99 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -443,6 +443,9 @@ static long vfio_pci_ioctl(void *device_data,
 		if (vdev->reset_works)
 			info.flags |= VFIO_DEVICE_FLAGS_RESET;
 
+		if (vfio_pci_bar_page_aligned())
+			info.flags |= VFIO_DEVICE_FLAGS_PCI_PAGE_ALIGNED;
+
 		info.num_regions = VFIO_PCI_NUM_REGIONS;
 		info.num_irqs = VFIO_PCI_NUM_IRQS;
 
@@ -479,7 +482,8 @@ static long vfio_pci_ioctl(void *device_data,
 				     VFIO_REGION_INFO_FLAG_WRITE;
 			if (IS_ENABLED(CONFIG_VFIO_PCI_MMAP) &&
 			    pci_resource_flags(pdev, info.index) &
-			    IORESOURCE_MEM && info.size >= PAGE_SIZE)
+			    IORESOURCE_MEM && (info.size >= PAGE_SIZE ||
+			    vfio_pci_bar_page_aligned()))
 				info.flags |= VFIO_REGION_INFO_FLAG_MMAP;
 			break;
 		case VFIO_PCI_ROM_REGION_INDEX:
@@ -855,6 +859,10 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
 		return -EINVAL;
 
 	phys_len = pci_resource_len(pdev, index);
+
+	if (vfio_pci_bar_page_aligned())
+		phys_len = PAGE_ALIGN(phys_len);
+
 	req_len = vma->vm_end - vma->vm_start;
 	pgoff = vma->vm_pgoff &
 		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 0e7394f..319352a 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -69,6 +69,11 @@ struct vfio_pci_device {
 #define is_irq_none(vdev) (!(is_intx(vdev) || is_msi(vdev) || is_msix(vdev)))
 #define irq_is(vdev, type) (vdev->irq_type == type)
 
+static inline bool vfio_pci_bar_page_aligned(void)
+{
+	return IS_ENABLED(CONFIG_PPC64);
+}
+
 extern void vfio_pci_intx_mask(struct vfio_pci_device *vdev);
 extern void vfio_pci_intx_unmask(struct vfio_pci_device *vdev);
 
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 751b69f..1fc8066 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -171,6 +171,8 @@ struct vfio_device_info {
 #define VFIO_DEVICE_FLAGS_PCI	(1 << 1)	/* vfio-pci device */
 #define VFIO_DEVICE_FLAGS_PLATFORM (1 << 2)	/* vfio-platform device */
 #define VFIO_DEVICE_FLAGS_AMBA  (1 << 3)	/* vfio-amba device */
+/* Platform support all PCI MMIO BARs to be page aligned */
+#define VFIO_DEVICE_FLAGS_PCI_PAGE_ALIGNED	(1 << 4)
 	__u32	num_regions;	/* Max region index + 1 */
 	__u32	num_irqs;	/* Max IRQ index + 1 */
 };
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 3/3] vfio-pci: Allow to mmap MSI-X table if EEH is supported
  2015-12-11  8:53 [RFC PATCH 0/3] vfio-pci: Allow to mmap sub-page MMIO BARs and MSI-X table on PPC64 platform Yongji Xie
  2015-12-11  8:53 ` [RFC PATCH 1/3] powerpc/pci: Enforce all MMIO BARs to be page aligned Yongji Xie
  2015-12-11  8:53 ` [RFC PATCH 2/3] vfio-pci: Allow to mmap sub-page MMIO BARs if all MMIO BARs are " Yongji Xie
@ 2015-12-11  8:53 ` Yongji Xie
  2015-12-16 20:14   ` Alex Williamson
  2 siblings, 1 reply; 15+ messages in thread
From: Yongji Xie @ 2015-12-11  8:53 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-api, linuxppc-dev
  Cc: aik, alex.williamson, benh, paulus, mpe, warrier, zhong, nikunj,
	Yongji Xie

Current vfio-pci implementation disallows to mmap MSI-X table in
case that user get to touch this directly.

However, EEH mechanism could ensure that a given pci device
can only shoot the MSIs assigned for its PE and guest kernel also
would not write to MSI-X table in pci_enable_msix() because
para-virtualization on PPC64 platform. So MSI-X table is safe to
access directly from the guest with EEH mechanism enabled.

This patch adds support for this case and allow to mmap MSI-X
table if EEH is supported on PPC64 platform.

And we also add a VFIO_DEVICE_FLAGS_PCI_MSIX_MMAP flag to notify
userspace that it's safe to mmap MSI-X table.

Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
---
 drivers/vfio/pci/vfio_pci.c         |    5 ++++-
 drivers/vfio/pci/vfio_pci_private.h |    5 +++++
 include/uapi/linux/vfio.h           |    2 ++
 3 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index dbcad99..85d9980 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -446,6 +446,9 @@ static long vfio_pci_ioctl(void *device_data,
 		if (vfio_pci_bar_page_aligned())
 			info.flags |= VFIO_DEVICE_FLAGS_PCI_PAGE_ALIGNED;
 
+		if (vfio_msix_table_mmap_enabled())
+			info.flags |= VFIO_DEVICE_FLAGS_PCI_MSIX_MMAP;
+
 		info.num_regions = VFIO_PCI_NUM_REGIONS;
 		info.num_irqs = VFIO_PCI_NUM_IRQS;
 
@@ -871,7 +874,7 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
 	if (phys_len < PAGE_SIZE || req_start + req_len > phys_len)
 		return -EINVAL;
 
-	if (index == vdev->msix_bar) {
+	if (index == vdev->msix_bar && !vfio_msix_table_mmap_enabled()) {
 		/*
 		 * Disallow mmaps overlapping the MSI-X table; users don't
 		 * get to touch this directly.  We could find somewhere
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 319352a..835619e 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -74,6 +74,11 @@ static inline bool vfio_pci_bar_page_aligned(void)
 	return IS_ENABLED(CONFIG_PPC64);
 }
 
+static inline bool vfio_msix_table_mmap_enabled(void)
+{
+	return IS_ENABLED(CONFIG_EEH);
+}
+
 extern void vfio_pci_intx_mask(struct vfio_pci_device *vdev);
 extern void vfio_pci_intx_unmask(struct vfio_pci_device *vdev);
 
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 1fc8066..289e662 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -173,6 +173,8 @@ struct vfio_device_info {
 #define VFIO_DEVICE_FLAGS_AMBA  (1 << 3)	/* vfio-amba device */
 /* Platform support all PCI MMIO BARs to be page aligned */
 #define VFIO_DEVICE_FLAGS_PCI_PAGE_ALIGNED	(1 << 4)
+/* Platform support mmapping PCI MSI-X vector table */
+#define VFIO_DEVICE_FLAGS_PCI_MSIX_MMAP	(1 << 5)
 	__u32	num_regions;	/* Max region index + 1 */
 	__u32	num_irqs;	/* Max IRQ index + 1 */
 };
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 2/3] vfio-pci: Allow to mmap sub-page MMIO BARs if all MMIO BARs are page aligned
  2015-12-11  8:53 ` [RFC PATCH 2/3] vfio-pci: Allow to mmap sub-page MMIO BARs if all MMIO BARs are " Yongji Xie
@ 2015-12-16 20:04   ` Alex Williamson
  2015-12-17 10:26     ` yongji xie
  0 siblings, 1 reply; 15+ messages in thread
From: Alex Williamson @ 2015-12-16 20:04 UTC (permalink / raw)
  To: Yongji Xie, kvm, linux-kernel, linux-api, linuxppc-dev
  Cc: aik, benh, paulus, mpe, warrier, zhong, nikunj

On Fri, 2015-12-11 at 16:53 +0800, Yongji Xie wrote:
> Current vfio-pci implementation disallows to mmap
> sub-page(size < PAGE_SIZE) MMIO BARs because these BARs' mmio page
> may be shared with other BARs.
> 
> But we should allow to mmap these sub-page MMIO BARs if all MMIO BARs
> are page aligned which leads the BARs' mmio page would not be shared
> with other BARs.
> 
> This patch adds support for this case and we also add a
> VFIO_DEVICE_FLAGS_PCI_PAGE_ALIGNED flag to notify userspace that
> platform supports all MMIO BARs to be page aligned.
> 
> Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
> ---
>  drivers/vfio/pci/vfio_pci.c         |   10 +++++++++-
>  drivers/vfio/pci/vfio_pci_private.h |    5 +++++
>  include/uapi/linux/vfio.h           |    2 ++
>  3 files changed, 16 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci.c
> b/drivers/vfio/pci/vfio_pci.c
> index 32b88bd..dbcad99 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -443,6 +443,9 @@ static long vfio_pci_ioctl(void *device_data,
>  		if (vdev->reset_works)
>  			info.flags |= VFIO_DEVICE_FLAGS_RESET;
>  
> +		if (vfio_pci_bar_page_aligned())
> +			info.flags |=
> VFIO_DEVICE_FLAGS_PCI_PAGE_ALIGNED;
> +
>  		info.num_regions = VFIO_PCI_NUM_REGIONS;
>  		info.num_irqs = VFIO_PCI_NUM_IRQS;
>  
> @@ -479,7 +482,8 @@ static long vfio_pci_ioctl(void *device_data,
>  				     VFIO_REGION_INFO_FLAG_WRITE;
>  			if (IS_ENABLED(CONFIG_VFIO_PCI_MMAP) &&
>  			    pci_resource_flags(pdev, info.index) &
> -			    IORESOURCE_MEM && info.size >=
> PAGE_SIZE)
> +			    IORESOURCE_MEM && (info.size >=
> PAGE_SIZE ||
> +			    vfio_pci_bar_page_aligned()))
>  				info.flags |=
> VFIO_REGION_INFO_FLAG_MMAP;
>  			break;
>  		case VFIO_PCI_ROM_REGION_INDEX:
> @@ -855,6 +859,10 @@ static int vfio_pci_mmap(void *device_data,
> struct vm_area_struct *vma)
>  		return -EINVAL;
>  
>  	phys_len = pci_resource_len(pdev, index);
> +
> +	if (vfio_pci_bar_page_aligned())
> +		phys_len = PAGE_ALIGN(phys_len);
> +
>  	req_len = vma->vm_end - vma->vm_start;
>  	pgoff = vma->vm_pgoff &
>  		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
> diff --git a/drivers/vfio/pci/vfio_pci_private.h
> b/drivers/vfio/pci/vfio_pci_private.h
> index 0e7394f..319352a 100644
> --- a/drivers/vfio/pci/vfio_pci_private.h
> +++ b/drivers/vfio/pci/vfio_pci_private.h
> @@ -69,6 +69,11 @@ struct vfio_pci_device {
>  #define is_irq_none(vdev) (!(is_intx(vdev) || is_msi(vdev) ||
> is_msix(vdev)))
>  #define irq_is(vdev, type) (vdev->irq_type == type)
>  
> +static inline bool vfio_pci_bar_page_aligned(void)
> +{
> +	return IS_ENABLED(CONFIG_PPC64);
> +}

I really dislike this.  This is a problem for any architecture that
runs on larger pages, and even an annoyance on 4k hosts.  Why are we
only solving it for PPC64?  Can't we do something similar in the core
PCI code and detect it?

> +
>  extern void vfio_pci_intx_mask(struct vfio_pci_device *vdev);
>  extern void vfio_pci_intx_unmask(struct vfio_pci_device *vdev);
>  
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 751b69f..1fc8066 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -171,6 +171,8 @@ struct vfio_device_info {
>  #define VFIO_DEVICE_FLAGS_PCI	(1 << 1)	/* vfio-pci
> device */
>  #define VFIO_DEVICE_FLAGS_PLATFORM (1 << 2)	/* vfio-platform
> device */
>  #define VFIO_DEVICE_FLAGS_AMBA  (1 << 3)	/* vfio-amba device
> */
> +/* Platform support all PCI MMIO BARs to be page aligned */
> +#define VFIO_DEVICE_FLAGS_PCI_PAGE_ALIGNED	(1 << 4)
>  	__u32	num_regions;	/* Max region index + 1 */
>  	__u32	num_irqs;	/* Max IRQ index + 1 */
>  };

Why is this on the device info, shouldn't it be per region?  Do we even
need a flag or can we just set the existing mmap flag with the
clarification that sub-host page size regions can mmap an entire host-
page aligned, sized area in the documentation?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 3/3] vfio-pci: Allow to mmap MSI-X table if EEH is supported
  2015-12-11  8:53 ` [RFC PATCH 3/3] vfio-pci: Allow to mmap MSI-X table if EEH is supported Yongji Xie
@ 2015-12-16 20:14   ` Alex Williamson
  2015-12-17 10:08     ` David Laight
  2015-12-17 10:37     ` yongji xie
  0 siblings, 2 replies; 15+ messages in thread
From: Alex Williamson @ 2015-12-16 20:14 UTC (permalink / raw)
  To: Yongji Xie, kvm, linux-kernel, linux-api, linuxppc-dev
  Cc: aik, benh, paulus, mpe, warrier, zhong, nikunj

On Fri, 2015-12-11 at 16:53 +0800, Yongji Xie wrote:
> Current vfio-pci implementation disallows to mmap MSI-X table in
> case that user get to touch this directly.
> 
> However, EEH mechanism could ensure that a given pci device
> can only shoot the MSIs assigned for its PE and guest kernel also
> would not write to MSI-X table in pci_enable_msix() because
> para-virtualization on PPC64 platform. So MSI-X table is safe to
> access directly from the guest with EEH mechanism enabled.

The MSI-X table is paravirtualized on vfio in general and interrupt
remapping theoretically protects against errant interrupts, so why is
this PPC64 specific?  We have the same safeguards on x86 if we want to
decide they're sufficient.  Offhand, the only way I can think that a
device can touch the MSI-X table is via backdoors or p2p DMA with
another device.

> This patch adds support for this case and allow to mmap MSI-X
> table if EEH is supported on PPC64 platform.
> 
> And we also add a VFIO_DEVICE_FLAGS_PCI_MSIX_MMAP flag to notify
> userspace that it's safe to mmap MSI-X table.
> 
> Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
> ---
>  drivers/vfio/pci/vfio_pci.c         |    5 ++++-
>  drivers/vfio/pci/vfio_pci_private.h |    5 +++++
>  include/uapi/linux/vfio.h           |    2 ++
>  3 files changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index dbcad99..85d9980 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -446,6 +446,9 @@ static long vfio_pci_ioctl(void *device_data,
>  		if (vfio_pci_bar_page_aligned())
>  			info.flags |= VFIO_DEVICE_FLAGS_PCI_PAGE_ALIGNED;
>  
> +		if (vfio_msix_table_mmap_enabled())
> +			info.flags |= VFIO_DEVICE_FLAGS_PCI_MSIX_MMAP;
> +
>  		info.num_regions = VFIO_PCI_NUM_REGIONS;
>  		info.num_irqs = VFIO_PCI_NUM_IRQS;
>  
> @@ -871,7 +874,7 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
>  	if (phys_len < PAGE_SIZE || req_start + req_len > phys_len)
>  		return -EINVAL;
>  
> -	if (index == vdev->msix_bar) {
> +	if (index == vdev->msix_bar && !vfio_msix_table_mmap_enabled()) {
>  		/*
>  		 * Disallow mmaps overlapping the MSI-X table; users don't
>  		 * get to touch this directly.  We could find somewhere
> diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
> index 319352a..835619e 100644
> --- a/drivers/vfio/pci/vfio_pci_private.h
> +++ b/drivers/vfio/pci/vfio_pci_private.h
> @@ -74,6 +74,11 @@ static inline bool vfio_pci_bar_page_aligned(void)
>  	return IS_ENABLED(CONFIG_PPC64);
>  }
>  
> +static inline bool vfio_msix_table_mmap_enabled(void)
> +{
> +	return IS_ENABLED(CONFIG_EEH);
> +}

I really dislike these.

> +
>  extern void vfio_pci_intx_mask(struct vfio_pci_device *vdev);
>  extern void vfio_pci_intx_unmask(struct vfio_pci_device *vdev);
>  
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 1fc8066..289e662 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -173,6 +173,8 @@ struct vfio_device_info {
>  #define VFIO_DEVICE_FLAGS_AMBA  (1 << 3)	/* vfio-amba device */
>  /* Platform support all PCI MMIO BARs to be page aligned */
>  #define VFIO_DEVICE_FLAGS_PCI_PAGE_ALIGNED	(1 << 4)
> +/* Platform support mmapping PCI MSI-X vector table */
> +#define VFIO_DEVICE_FLAGS_PCI_MSIX_MMAP	(1 << 5)

Again, not sure why this is on the device versus the region, but I'd
prefer to investigate whether we can handle this with the sparse mmap
capability (or lack of) in the capability chains I proposed[1]. Thanks,

Alex

[1] https://lkml.org/lkml/2015/11/23/748

>  	__u32	num_regions;	/* Max region index + 1 */
>  	__u32	num_irqs;	/* Max IRQ index + 1 */
>  };


^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [RFC PATCH 3/3] vfio-pci: Allow to mmap MSI-X table if EEH is supported
  2015-12-16 20:14   ` Alex Williamson
@ 2015-12-17 10:08     ` David Laight
  2015-12-17 21:06       ` Alex Williamson
  2015-12-17 10:37     ` yongji xie
  1 sibling, 1 reply; 15+ messages in thread
From: David Laight @ 2015-12-17 10:08 UTC (permalink / raw)
  To: 'Alex Williamson',
	Yongji Xie, kvm, linux-kernel, linux-api, linuxppc-dev
  Cc: nikunj, zhong, aik, paulus, warrier

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1638 bytes --]

> The MSI-X table is paravirtualized on vfio in general and interrupt
> remapping theoretically protects against errant interrupts, so why is
> this PPC64 specific? We have the same safeguards on x86 if we want to
> decide they're sufficient. Offhand, the only way I can think that a
> device can touch the MSI-X table is via backdoors or p2p DMA with
> another device.

Is this all related to the statements in the PCI(e) spec that the
MSI-X table and Pending bit array should in their own BARs?
(ISTR it even suggests a BAR each.)

Since the MSI-X table exists in device memory/registers there is
nothing to stop the device modifying the table contents (or even
ignoring the contents and writing address+data pairs that are known
to reference the CPUs MSI-X interrupt generation logic).

We've an fpga based PCIe slave that has some additional PCIe slaves
(associated with the interrupt generation logic) that are currently
next to the PBA (which is 8k from the MSI-X table).
If we can't map the PBA we can't actually raise any interrupts.
The same would be true if page size is 64k and mapping the MSI-X
table banned.

Do we need to change our PCIe slave address map so we don't need
to access anything in the same page (which might be 64k were we to
target large ppc - which we don't at the moment) as both the
MSI-X table and the PBA?

I'd also note that being able to read the MSI-X table is a useful
diagnostic that the relevant interrupts are enabled properly.

	David

ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 2/3] vfio-pci: Allow to mmap sub-page MMIO BARs if all MMIO BARs are page aligned
  2015-12-16 20:04   ` Alex Williamson
@ 2015-12-17 10:26     ` yongji xie
  2015-12-17 21:46       ` Alex Williamson
  0 siblings, 1 reply; 15+ messages in thread
From: yongji xie @ 2015-12-17 10:26 UTC (permalink / raw)
  To: Alex Williamson, kvm, linux-kernel, linux-api, linuxppc-dev
  Cc: aik, benh, paulus, mpe, warrier, zhong, nikunj



On 2015/12/17 4:04, Alex Williamson wrote:
> On Fri, 2015-12-11 at 16:53 +0800, Yongji Xie wrote:
>> Current vfio-pci implementation disallows to mmap
>> sub-page(size < PAGE_SIZE) MMIO BARs because these BARs' mmio page
>> may be shared with other BARs.
>>
>> But we should allow to mmap these sub-page MMIO BARs if all MMIO BARs
>> are page aligned which leads the BARs' mmio page would not be shared
>> with other BARs.
>>
>> This patch adds support for this case and we also add a
>> VFIO_DEVICE_FLAGS_PCI_PAGE_ALIGNED flag to notify userspace that
>> platform supports all MMIO BARs to be page aligned.
>>
>> Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
>> ---
>>   drivers/vfio/pci/vfio_pci.c         |   10 +++++++++-
>>   drivers/vfio/pci/vfio_pci_private.h |    5 +++++
>>   include/uapi/linux/vfio.h           |    2 ++
>>   3 files changed, 16 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/vfio/pci/vfio_pci.c
>> b/drivers/vfio/pci/vfio_pci.c
>> index 32b88bd..dbcad99 100644
>> --- a/drivers/vfio/pci/vfio_pci.c
>> +++ b/drivers/vfio/pci/vfio_pci.c
>> @@ -443,6 +443,9 @@ static long vfio_pci_ioctl(void *device_data,
>>   		if (vdev->reset_works)
>>   			info.flags |= VFIO_DEVICE_FLAGS_RESET;
>>   
>> +		if (vfio_pci_bar_page_aligned())
>> +			info.flags |=
>> VFIO_DEVICE_FLAGS_PCI_PAGE_ALIGNED;
>> +
>>   		info.num_regions = VFIO_PCI_NUM_REGIONS;
>>   		info.num_irqs = VFIO_PCI_NUM_IRQS;
>>   
>> @@ -479,7 +482,8 @@ static long vfio_pci_ioctl(void *device_data,
>>   				     VFIO_REGION_INFO_FLAG_WRITE;
>>   			if (IS_ENABLED(CONFIG_VFIO_PCI_MMAP) &&
>>   			    pci_resource_flags(pdev, info.index) &
>> -			    IORESOURCE_MEM && info.size >=
>> PAGE_SIZE)
>> +			    IORESOURCE_MEM && (info.size >=
>> PAGE_SIZE ||
>> +			    vfio_pci_bar_page_aligned()))
>>   				info.flags |=
>> VFIO_REGION_INFO_FLAG_MMAP;
>>   			break;
>>   		case VFIO_PCI_ROM_REGION_INDEX:
>> @@ -855,6 +859,10 @@ static int vfio_pci_mmap(void *device_data,
>> struct vm_area_struct *vma)
>>   		return -EINVAL;
>>   
>>   	phys_len = pci_resource_len(pdev, index);
>> +
>> +	if (vfio_pci_bar_page_aligned())
>> +		phys_len = PAGE_ALIGN(phys_len);
>> +
>>   	req_len = vma->vm_end - vma->vm_start;
>>   	pgoff = vma->vm_pgoff &
>>   		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
>> diff --git a/drivers/vfio/pci/vfio_pci_private.h
>> b/drivers/vfio/pci/vfio_pci_private.h
>> index 0e7394f..319352a 100644
>> --- a/drivers/vfio/pci/vfio_pci_private.h
>> +++ b/drivers/vfio/pci/vfio_pci_private.h
>> @@ -69,6 +69,11 @@ struct vfio_pci_device {
>>   #define is_irq_none(vdev) (!(is_intx(vdev) || is_msi(vdev) ||
>> is_msix(vdev)))
>>   #define irq_is(vdev, type) (vdev->irq_type == type)
>>   
>> +static inline bool vfio_pci_bar_page_aligned(void)
>> +{
>> +	return IS_ENABLED(CONFIG_PPC64);
>> +}
> I really dislike this.  This is a problem for any architecture that
> runs on larger pages, and even an annoyance on 4k hosts.  Why are we
> only solving it for PPC64?
Yes, I know it's a problem for other architectures. But I'm not sure if 
other archs prefer
to enforce the alignment of all BARs to be at least PAGE_SIZE which 
would result in
some waste of address space.

So I just propose a prototype and add PPC64 support here. And other 
archs could decide
to use it or not by themselves.
> Can't we do something similar in the core PCI code and detect it?
So you mean we can do it like this:

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index d390fc1..f46c04d 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -320,6 +320,11 @@ static inline resource_size_t 
pci_resource_alignment(struct pci_dev *dev,
         return resource_alignment(res);
  }

+static inline bool pci_bar_page_aligned(void)
+{
+       return IS_ENABLED(CONFIG_PPC64);
+}
+
  void pci_enable_acs(struct pci_dev *dev);

  struct pci_dev_reset_methods {

or add a config option to indicate that PCI MMIO BARs should be page 
aligned? Thanks.

Regards
Yongji Xie
>> +
>>   extern void vfio_pci_intx_mask(struct vfio_pci_device *vdev);
>>   extern void vfio_pci_intx_unmask(struct vfio_pci_device *vdev);
>>   
>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>> index 751b69f..1fc8066 100644
>> --- a/include/uapi/linux/vfio.h
>> +++ b/include/uapi/linux/vfio.h
>> @@ -171,6 +171,8 @@ struct vfio_device_info {
>>   #define VFIO_DEVICE_FLAGS_PCI	(1 << 1)	/* vfio-pci
>> device */
>>   #define VFIO_DEVICE_FLAGS_PLATFORM (1 << 2)	/* vfio-platform
>> device */
>>   #define VFIO_DEVICE_FLAGS_AMBA  (1 << 3)	/* vfio-amba device
>> */
>> +/* Platform support all PCI MMIO BARs to be page aligned */
>> +#define VFIO_DEVICE_FLAGS_PCI_PAGE_ALIGNED	(1 << 4)
>>   	__u32	num_regions;	/* Max region index + 1 */
>>   	__u32	num_irqs;	/* Max IRQ index + 1 */
>>   };
> Why is this on the device info, shouldn't it be per region?  Do we even
> need a flag or can we just set the existing mmap flag with the
> clarification that sub-host page size regions can mmap an entire host-
> page aligned, sized area in the documentation?  Thanks,
>
> Alex
>
Yes, you are right! We can just set the existing mmap flag instead of 
adding a new flag.
I will fix it in the next version. Thanks.

Regards
Yongji Xie


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 3/3] vfio-pci: Allow to mmap MSI-X table if EEH is supported
  2015-12-16 20:14   ` Alex Williamson
  2015-12-17 10:08     ` David Laight
@ 2015-12-17 10:37     ` yongji xie
  2015-12-17 21:41       ` Alex Williamson
  1 sibling, 1 reply; 15+ messages in thread
From: yongji xie @ 2015-12-17 10:37 UTC (permalink / raw)
  To: Alex Williamson, kvm, linux-kernel, linux-api, linuxppc-dev
  Cc: aik, benh, paulus, mpe, warrier, zhong, nikunj



On 2015/12/17 4:14, Alex Williamson wrote:
> On Fri, 2015-12-11 at 16:53 +0800, Yongji Xie wrote:
>> Current vfio-pci implementation disallows to mmap MSI-X table in
>> case that user get to touch this directly.
>>
>> However, EEH mechanism could ensure that a given pci device
>> can only shoot the MSIs assigned for its PE and guest kernel also
>> would not write to MSI-X table in pci_enable_msix() because
>> para-virtualization on PPC64 platform. So MSI-X table is safe to
>> access directly from the guest with EEH mechanism enabled.
> The MSI-X table is paravirtualized on vfio in general and interrupt
> remapping theoretically protects against errant interrupts, so why is
> this PPC64 specific?  We have the same safeguards on x86 if we want to
> decide they're sufficient.  Offhand, the only way I can think that a
> device can touch the MSI-X table is via backdoors or p2p DMA with
> another device.
Maybe I didn't make my point clear. The reasons why we can mmap MSI-X
table on PPC64 are:

1. EEH mechanism could ensure that a given pci device can only shoot
the MSIs assigned for its PE. So it would not do harm to other memory
space when the guest write a garbage MSI-X address/data to the vector table
if we passthough MSI-X tables to guest.

2. The guest kernel would not write to MSI-X table on PPC64 platform
when device drivers call pci_enable_msix() to initialize MSI-X interrupts.

So I think it is safe to mmap/passthrough MSI-X table on PPC64 platform.
And I'm not sure whether other architectures can ensure these two 
points. Thanks.

Regards
Yongji Xie
>> This patch adds support for this case and allow to mmap MSI-X
>> table if EEH is supported on PPC64 platform.
>>
>> And we also add a VFIO_DEVICE_FLAGS_PCI_MSIX_MMAP flag to notify
>> userspace that it's safe to mmap MSI-X table.
>>
>> Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
>> ---
>>   drivers/vfio/pci/vfio_pci.c         |    5 ++++-
>>   drivers/vfio/pci/vfio_pci_private.h |    5 +++++
>>   include/uapi/linux/vfio.h           |    2 ++
>>   3 files changed, 11 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
>> index dbcad99..85d9980 100644
>> --- a/drivers/vfio/pci/vfio_pci.c
>> +++ b/drivers/vfio/pci/vfio_pci.c
>> @@ -446,6 +446,9 @@ static long vfio_pci_ioctl(void *device_data,
>>   		if (vfio_pci_bar_page_aligned())
>>   			info.flags |= VFIO_DEVICE_FLAGS_PCI_PAGE_ALIGNED;
>>   
>> +		if (vfio_msix_table_mmap_enabled())
>> +			info.flags |= VFIO_DEVICE_FLAGS_PCI_MSIX_MMAP;
>> +
>>   		info.num_regions = VFIO_PCI_NUM_REGIONS;
>>   		info.num_irqs = VFIO_PCI_NUM_IRQS;
>>   
>> @@ -871,7 +874,7 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
>>   	if (phys_len < PAGE_SIZE || req_start + req_len > phys_len)
>>   		return -EINVAL;
>>   
>> -	if (index == vdev->msix_bar) {
>> +	if (index == vdev->msix_bar && !vfio_msix_table_mmap_enabled()) {
>>   		/*
>>   		 * Disallow mmaps overlapping the MSI-X table; users don't
>>   		 * get to touch this directly.  We could find somewhere
>> diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
>> index 319352a..835619e 100644
>> --- a/drivers/vfio/pci/vfio_pci_private.h
>> +++ b/drivers/vfio/pci/vfio_pci_private.h
>> @@ -74,6 +74,11 @@ static inline bool vfio_pci_bar_page_aligned(void)
>>   	return IS_ENABLED(CONFIG_PPC64);
>>   }
>>   
>> +static inline bool vfio_msix_table_mmap_enabled(void)
>> +{
>> +	return IS_ENABLED(CONFIG_EEH);
>> +}
> I really dislike these.
>
>> +
>>   extern void vfio_pci_intx_mask(struct vfio_pci_device *vdev);
>>   extern void vfio_pci_intx_unmask(struct vfio_pci_device *vdev);
>>   
>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>> index 1fc8066..289e662 100644
>> --- a/include/uapi/linux/vfio.h
>> +++ b/include/uapi/linux/vfio.h
>> @@ -173,6 +173,8 @@ struct vfio_device_info {
>>   #define VFIO_DEVICE_FLAGS_AMBA  (1 << 3)	/* vfio-amba device */
>>   /* Platform support all PCI MMIO BARs to be page aligned */
>>   #define VFIO_DEVICE_FLAGS_PCI_PAGE_ALIGNED	(1 << 4)
>> +/* Platform support mmapping PCI MSI-X vector table */
>> +#define VFIO_DEVICE_FLAGS_PCI_MSIX_MMAP	(1 << 5)
> Again, not sure why this is on the device versus the region, but I'd
> prefer to investigate whether we can handle this with the sparse mmap
> capability (or lack of) in the capability chains I proposed[1]. Thanks,
>
> Alex
>
> [1] https://lkml.org/lkml/2015/11/23/748
>
Good idea! I wiil investigate it. Thanks.

Regards
Yongji Xie
>>   	__u32	num_regions;	/* Max region index + 1 */
>>   	__u32	num_irqs;	/* Max IRQ index + 1 */
>>   };


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 3/3] vfio-pci: Allow to mmap MSI-X table if EEH is supported
  2015-12-17 10:08     ` David Laight
@ 2015-12-17 21:06       ` Alex Williamson
  2015-12-18 10:15         ` David Laight
  0 siblings, 1 reply; 15+ messages in thread
From: Alex Williamson @ 2015-12-17 21:06 UTC (permalink / raw)
  To: David Laight, Yongji Xie, kvm, linux-kernel, linux-api, linuxppc-dev
  Cc: nikunj, zhong, aik, paulus, warrier

On Thu, 2015-12-17 at 10:08 +0000, David Laight wrote:
> > The MSI-X table is paravirtualized on vfio in general and interrupt
> > remapping theoretically protects against errant interrupts, so why
> > is
> > this PPC64 specific? We have the same safeguards on x86 if we want
> > to
> > decide they're sufficient. Offhand, the only way I can think that a
> > device can touch the MSI-X table is via backdoors or p2p DMA with
> > another device.
> 
> Is this all related to the statements in the PCI(e) spec that the
> MSI-X table and Pending bit array should in their own BARs?
> (ISTR it even suggests a BAR each.)
> 
> Since the MSI-X table exists in device memory/registers there is
> nothing to stop the device modifying the table contents (or even
> ignoring the contents and writing address+data pairs that are known
> to reference the CPUs MSI-X interrupt generation logic).
> 
> We've an fpga based PCIe slave that has some additional PCIe slaves
> (associated with the interrupt generation logic) that are currently
> next to the PBA (which is 8k from the MSI-X table).
> If we can't map the PBA we can't actually raise any interrupts.
> The same would be true if page size is 64k and mapping the MSI-X
> table banned.
> 
> Do we need to change our PCIe slave address map so we don't need
> to access anything in the same page (which might be 64k were we to
> target large ppc - which we don't at the moment) as both the
> MSI-X table and the PBA?
> 
> I'd also note that being able to read the MSI-X table is a useful
> diagnostic that the relevant interrupts are enabled properly.

Yes, the spec requirement is that MSI-X structures must reside in a 4k
aligned area that doesn't overlap with other configuration registers
for the device.  It's only an advisement to put them into their own
BAR, and 4k clearly wasn't as forward looking as we'd hope.  Vfio
doesn't particularly care about the PBA, but if it resides in the same
host PAGE_SIZE area as the MSI-X vector table, you currently won't be
able to get to it.  Most devices are not at all dependent on the PBA
for any sort of functionality.

It's really more correct to say that both the vector table and PBA are
emulated by QEMU than paravirtualized.  Only PPC64 has the guest OS
taking a paravirtual path to program the vector table, everyone else
attempts to read/write to the device MMIO space, which gets trapped and
emulated in QEMU.  This is why the QEMU side patch has further ugly
hacks to mess with the ordering of MemoryRegions since even if we can
access and mmap the MSI-X vector table, we'll still trap into QEMU for
emulation.

How exactly does the ability to map the PBA affect your ability to
raise an interrupt?  I can only think that maybe you're writing PBA
bits to clear them, but the spec indicates that software should never
write to the PBA, only read, and that writes are undefined.  So that
would be very non-standard, QEMU drops writes, they don't even make it
to the hardware.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 3/3] vfio-pci: Allow to mmap MSI-X table if EEH is supported
  2015-12-17 10:37     ` yongji xie
@ 2015-12-17 21:41       ` Alex Williamson
  2015-12-17 22:48         ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 15+ messages in thread
From: Alex Williamson @ 2015-12-17 21:41 UTC (permalink / raw)
  To: yongji xie, kvm, linux-kernel, linux-api, linuxppc-dev
  Cc: aik, benh, paulus, mpe, warrier, zhong, nikunj

On Thu, 2015-12-17 at 18:37 +0800, yongji xie wrote:
> 
> On 2015/12/17 4:14, Alex Williamson wrote:
> > On Fri, 2015-12-11 at 16:53 +0800, Yongji Xie wrote:
> > > Current vfio-pci implementation disallows to mmap MSI-X table in
> > > case that user get to touch this directly.
> > > 
> > > However, EEH mechanism could ensure that a given pci device
> > > can only shoot the MSIs assigned for its PE and guest kernel also
> > > would not write to MSI-X table in pci_enable_msix() because
> > > para-virtualization on PPC64 platform. So MSI-X table is safe to
> > > access directly from the guest with EEH mechanism enabled.
> > The MSI-X table is paravirtualized on vfio in general and interrupt
> > remapping theoretically protects against errant interrupts, so why
> > is
> > this PPC64 specific?  We have the same safeguards on x86 if we want
> > to
> > decide they're sufficient.  Offhand, the only way I can think that
> > a
> > device can touch the MSI-X table is via backdoors or p2p DMA with
> > another device.
> Maybe I didn't make my point clear. The reasons why we can mmap MSI-X
> table on PPC64 are:
> 
> 1. EEH mechanism could ensure that a given pci device can only shoot
> the MSIs assigned for its PE. So it would not do harm to other memory
> space when the guest write a garbage MSI-X address/data to the vector
> table
> if we passthough MSI-X tables to guest.

Interrupt remapping does the same on x86.

> 2. The guest kernel would not write to MSI-X table on PPC64 platform
> when device drivers call pci_enable_msix() to initialize MSI-X
> interrupts.

This is irrelevant to the vfio API.  vfio is a userspace driver
interface, QEMU is just one possible consumer of the interface.  Even
in the case of PPC64 & QEMU, the guest is still capable of writing to
the vector table, it just probably won't.

> So I think it is safe to mmap/passthrough MSI-X table on PPC64
> platform.
> And I'm not sure whether other architectures can ensure these two 
> points. 

There is another consideration, which is the API exposed to the user.
 vfio currently enforces interrupt setup through ioctls by making the
PCI mechanisms for interrupt programming inaccessible through the
device regions.  Ignoring that you are only focused on PPC64 with QEMU,
does it make sense for the vfio API to allow a user to manipulate
interrupt programming in a way that not only will not work, but in a
way that we expect to fail and require error isolation to recover from?
 I can't say I'm fully convinced that a footnote in the documentation
is sufficient for that.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 2/3] vfio-pci: Allow to mmap sub-page MMIO BARs if all MMIO BARs are page aligned
  2015-12-17 10:26     ` yongji xie
@ 2015-12-17 21:46       ` Alex Williamson
  2015-12-18  8:23         ` yongji xie
  0 siblings, 1 reply; 15+ messages in thread
From: Alex Williamson @ 2015-12-17 21:46 UTC (permalink / raw)
  To: yongji xie, kvm, linux-kernel, linux-api, linuxppc-dev
  Cc: aik, benh, paulus, mpe, warrier, zhong, nikunj

On Thu, 2015-12-17 at 18:26 +0800, yongji xie wrote:
> 
> On 2015/12/17 4:04, Alex Williamson wrote:
> > On Fri, 2015-12-11 at 16:53 +0800, Yongji Xie wrote:
> > > Current vfio-pci implementation disallows to mmap
> > > sub-page(size < PAGE_SIZE) MMIO BARs because these BARs' mmio
> > > page
> > > may be shared with other BARs.
> > > 
> > > But we should allow to mmap these sub-page MMIO BARs if all MMIO
> > > BARs
> > > are page aligned which leads the BARs' mmio page would not be
> > > shared
> > > with other BARs.
> > > 
> > > This patch adds support for this case and we also add a
> > > VFIO_DEVICE_FLAGS_PCI_PAGE_ALIGNED flag to notify userspace that
> > > platform supports all MMIO BARs to be page aligned.
> > > 
> > > Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
> > > ---
> > >   drivers/vfio/pci/vfio_pci.c         |   10 +++++++++-
> > >   drivers/vfio/pci/vfio_pci_private.h |    5 +++++
> > >   include/uapi/linux/vfio.h           |    2 ++
> > >   3 files changed, 16 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/vfio/pci/vfio_pci.c
> > > b/drivers/vfio/pci/vfio_pci.c
> > > index 32b88bd..dbcad99 100644
> > > --- a/drivers/vfio/pci/vfio_pci.c
> > > +++ b/drivers/vfio/pci/vfio_pci.c
> > > @@ -443,6 +443,9 @@ static long vfio_pci_ioctl(void *device_data,
> > >   		if (vdev->reset_works)
> > >   			info.flags |= VFIO_DEVICE_FLAGS_RESET;
> > >   
> > > +		if (vfio_pci_bar_page_aligned())
> > > +			info.flags |=
> > > VFIO_DEVICE_FLAGS_PCI_PAGE_ALIGNED;
> > > +
> > >   		info.num_regions = VFIO_PCI_NUM_REGIONS;
> > >   		info.num_irqs = VFIO_PCI_NUM_IRQS;
> > >   
> > > @@ -479,7 +482,8 @@ static long vfio_pci_ioctl(void *device_data,
> > >   				     VFIO_REGION_INFO_FLAG_WRIT
> > > E;
> > >   			if (IS_ENABLED(CONFIG_VFIO_PCI_MMAP) &&
> > >   			    pci_resource_flags(pdev,
> > > info.index) &
> > > -			    IORESOURCE_MEM && info.size >=
> > > PAGE_SIZE)
> > > +			    IORESOURCE_MEM && (info.size >=
> > > PAGE_SIZE ||
> > > +			    vfio_pci_bar_page_aligned()))
> > >   				info.flags |=
> > > VFIO_REGION_INFO_FLAG_MMAP;
> > >   			break;
> > >   		case VFIO_PCI_ROM_REGION_INDEX:
> > > @@ -855,6 +859,10 @@ static int vfio_pci_mmap(void *device_data,
> > > struct vm_area_struct *vma)
> > >   		return -EINVAL;
> > >   
> > >   	phys_len = pci_resource_len(pdev, index);
> > > +
> > > +	if (vfio_pci_bar_page_aligned())
> > > +		phys_len = PAGE_ALIGN(phys_len);
> > > +
> > >   	req_len = vma->vm_end - vma->vm_start;
> > >   	pgoff = vma->vm_pgoff &
> > >   		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) -
> > > 1);
> > > diff --git a/drivers/vfio/pci/vfio_pci_private.h
> > > b/drivers/vfio/pci/vfio_pci_private.h
> > > index 0e7394f..319352a 100644
> > > --- a/drivers/vfio/pci/vfio_pci_private.h
> > > +++ b/drivers/vfio/pci/vfio_pci_private.h
> > > @@ -69,6 +69,11 @@ struct vfio_pci_device {
> > >   #define is_irq_none(vdev) (!(is_intx(vdev) || is_msi(vdev) ||
> > > is_msix(vdev)))
> > >   #define irq_is(vdev, type) (vdev->irq_type == type)
> > >   
> > > +static inline bool vfio_pci_bar_page_aligned(void)
> > > +{
> > > +	return IS_ENABLED(CONFIG_PPC64);
> > > +}
> > I really dislike this.  This is a problem for any architecture that
> > runs on larger pages, and even an annoyance on 4k hosts.  Why are
> > we
> > only solving it for PPC64?
> Yes, I know it's a problem for other architectures. But I'm not sure
> if 
> other archs prefer
> to enforce the alignment of all BARs to be at least PAGE_SIZE which 
> would result in
> some waste of address space.
> 
> So I just propose a prototype and add PPC64 support here. And other 
> archs could decide
> to use it or not by themselves.
> > Can't we do something similar in the core PCI code and detect it?
> So you mean we can do it like this:
> 
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index d390fc1..f46c04d 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -320,6 +320,11 @@ static inline resource_size_t 
> pci_resource_alignment(struct pci_dev *dev,
>          return resource_alignment(res);
>   }
> 
> +static inline bool pci_bar_page_aligned(void)
> +{
> +       return IS_ENABLED(CONFIG_PPC64);
> +}
> +
>   void pci_enable_acs(struct pci_dev *dev);
> 
>   struct pci_dev_reset_methods {
> 
> or add a config option to indicate that PCI MMIO BARs should be page 
> aligned? 

Yes, I'm thinking of a boot commandline option, maybe one that PPC64
can default to enabled if it chooses to.  The problem is not unique to
PPC64 and the solution should not be unique either.  I don't want to
need to revisit this for ARM, which we know is going to be similarly
afflicted.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 3/3] vfio-pci: Allow to mmap MSI-X table if EEH is supported
  2015-12-17 21:41       ` Alex Williamson
@ 2015-12-17 22:48         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 15+ messages in thread
From: Benjamin Herrenschmidt @ 2015-12-17 22:48 UTC (permalink / raw)
  To: Alex Williamson, yongji xie, kvm, linux-kernel, linux-api, linuxppc-dev
  Cc: aik, paulus, mpe, warrier, zhong, nikunj

On Thu, 2015-12-17 at 14:41 -0700, Alex Williamson wrote:
> > So I think it is safe to mmap/passthrough MSI-X table on PPC64
> > platform.
> > And I'm not sure whether other architectures can ensure these two 
> > points. 
> 
> There is another consideration, which is the API exposed to the user.
>  vfio currently enforces interrupt setup through ioctls by making the
> PCI mechanisms for interrupt programming inaccessible through the
> device regions.  Ignoring that you are only focused on PPC64 with QEMU,
> does it make sense for the vfio API to allow a user to manipulate
> interrupt programming in a way that not only will not work, but in a
> way that we expect to fail and require error isolation to recover from?
>  I can't say I'm fully convinced that a footnote in the documentation
> is sufficient for that.  Thanks,

Well, one could argue that the "isolation" provided by qemu here is
fairly weak anyway ;-)

I mean. .. how do you know the device doesn't have a backdoor path into
that table via some other MMIO registers anyway ?

In any case, the HW isolation on platforms like pseries means that the
worst the guest can do si shoot itself in the foot. Big deal. On the
other hand, not bothering with intercepting the table has benefits,
such as reducing the memory region clutter, but also removing all the
massive performacne problems we see because adapters have critical
registers in the same 64K page as the MSI-X table.

So I don't think there is any question here that we *need* that
functionality in power. The filtering of the table by Qemu doesn't
provide any practical benefit, it just gets in the way.

Cheers,
Ben.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 2/3] vfio-pci: Allow to mmap sub-page MMIO BARs if all MMIO BARs are page aligned
  2015-12-17 21:46       ` Alex Williamson
@ 2015-12-18  8:23         ` yongji xie
  0 siblings, 0 replies; 15+ messages in thread
From: yongji xie @ 2015-12-18  8:23 UTC (permalink / raw)
  To: Alex Williamson, kvm, linux-kernel, linux-api, linuxppc-dev
  Cc: aik, benh, paulus, mpe, warrier, zhong, nikunj



On 2015/12/18 5:46, Alex Williamson wrote:
> On Thu, 2015-12-17 at 18:26 +0800, yongji xie wrote:
>> On 2015/12/17 4:04, Alex Williamson wrote:
>>> On Fri, 2015-12-11 at 16:53 +0800, Yongji Xie wrote:
>>>> Current vfio-pci implementation disallows to mmap
>>>> sub-page(size < PAGE_SIZE) MMIO BARs because these BARs' mmio
>>>> page
>>>> may be shared with other BARs.
>>>>
>>>> But we should allow to mmap these sub-page MMIO BARs if all MMIO
>>>> BARs
>>>> are page aligned which leads the BARs' mmio page would not be
>>>> shared
>>>> with other BARs.
>>>>
>>>> This patch adds support for this case and we also add a
>>>> VFIO_DEVICE_FLAGS_PCI_PAGE_ALIGNED flag to notify userspace that
>>>> platform supports all MMIO BARs to be page aligned.
>>>>
>>>> Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
>>>> ---
>>>>    drivers/vfio/pci/vfio_pci.c         |   10 +++++++++-
>>>>    drivers/vfio/pci/vfio_pci_private.h |    5 +++++
>>>>    include/uapi/linux/vfio.h           |    2 ++
>>>>    3 files changed, 16 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/vfio/pci/vfio_pci.c
>>>> b/drivers/vfio/pci/vfio_pci.c
>>>> index 32b88bd..dbcad99 100644
>>>> --- a/drivers/vfio/pci/vfio_pci.c
>>>> +++ b/drivers/vfio/pci/vfio_pci.c
>>>> @@ -443,6 +443,9 @@ static long vfio_pci_ioctl(void *device_data,
>>>>    		if (vdev->reset_works)
>>>>    			info.flags |= VFIO_DEVICE_FLAGS_RESET;
>>>>    
>>>> +		if (vfio_pci_bar_page_aligned())
>>>> +			info.flags |=
>>>> VFIO_DEVICE_FLAGS_PCI_PAGE_ALIGNED;
>>>> +
>>>>    		info.num_regions = VFIO_PCI_NUM_REGIONS;
>>>>    		info.num_irqs = VFIO_PCI_NUM_IRQS;
>>>>    
>>>> @@ -479,7 +482,8 @@ static long vfio_pci_ioctl(void *device_data,
>>>>    				     VFIO_REGION_INFO_FLAG_WRIT
>>>> E;
>>>>    			if (IS_ENABLED(CONFIG_VFIO_PCI_MMAP) &&
>>>>    			    pci_resource_flags(pdev,
>>>> info.index) &
>>>> -			    IORESOURCE_MEM && info.size >=
>>>> PAGE_SIZE)
>>>> +			    IORESOURCE_MEM && (info.size >=
>>>> PAGE_SIZE ||
>>>> +			    vfio_pci_bar_page_aligned()))
>>>>    				info.flags |=
>>>> VFIO_REGION_INFO_FLAG_MMAP;
>>>>    			break;
>>>>    		case VFIO_PCI_ROM_REGION_INDEX:
>>>> @@ -855,6 +859,10 @@ static int vfio_pci_mmap(void *device_data,
>>>> struct vm_area_struct *vma)
>>>>    		return -EINVAL;
>>>>    
>>>>    	phys_len = pci_resource_len(pdev, index);
>>>> +
>>>> +	if (vfio_pci_bar_page_aligned())
>>>> +		phys_len = PAGE_ALIGN(phys_len);
>>>> +
>>>>    	req_len = vma->vm_end - vma->vm_start;
>>>>    	pgoff = vma->vm_pgoff &
>>>>    		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) -
>>>> 1);
>>>> diff --git a/drivers/vfio/pci/vfio_pci_private.h
>>>> b/drivers/vfio/pci/vfio_pci_private.h
>>>> index 0e7394f..319352a 100644
>>>> --- a/drivers/vfio/pci/vfio_pci_private.h
>>>> +++ b/drivers/vfio/pci/vfio_pci_private.h
>>>> @@ -69,6 +69,11 @@ struct vfio_pci_device {
>>>>    #define is_irq_none(vdev) (!(is_intx(vdev) || is_msi(vdev) ||
>>>> is_msix(vdev)))
>>>>    #define irq_is(vdev, type) (vdev->irq_type == type)
>>>>    
>>>> +static inline bool vfio_pci_bar_page_aligned(void)
>>>> +{
>>>> +	return IS_ENABLED(CONFIG_PPC64);
>>>> +}
>>> I really dislike this.  This is a problem for any architecture that
>>> runs on larger pages, and even an annoyance on 4k hosts.  Why are
>>> we
>>> only solving it for PPC64?
>> Yes, I know it's a problem for other architectures. But I'm not sure
>> if
>> other archs prefer
>> to enforce the alignment of all BARs to be at least PAGE_SIZE which
>> would result in
>> some waste of address space.
>>
>> So I just propose a prototype and add PPC64 support here. And other
>> archs could decide
>> to use it or not by themselves.
>>> Can't we do something similar in the core PCI code and detect it?
>> So you mean we can do it like this:
>>
>> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
>> index d390fc1..f46c04d 100644
>> --- a/drivers/pci/pci.h
>> +++ b/drivers/pci/pci.h
>> @@ -320,6 +320,11 @@ static inline resource_size_t
>> pci_resource_alignment(struct pci_dev *dev,
>>           return resource_alignment(res);
>>    }
>>
>> +static inline bool pci_bar_page_aligned(void)
>> +{
>> +       return IS_ENABLED(CONFIG_PPC64);
>> +}
>> +
>>    void pci_enable_acs(struct pci_dev *dev);
>>
>>    struct pci_dev_reset_methods {
>>
>> or add a config option to indicate that PCI MMIO BARs should be page
>> aligned?
> Yes, I'm thinking of a boot commandline option, maybe one that PPC64
> can default to enabled if it chooses to.  The problem is not unique to
> PPC64 and the solution should not be unique either.  I don't want to
> need to revisit this for ARM, which we know is going to be similarly
> afflicted.  Thanks,
>
> Alex
>
OK. I will try to fix it by adding a boot commandline option.
It seems to be better than adding a config option. Thanks

Regards
Yongji Xie


^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [RFC PATCH 3/3] vfio-pci: Allow to mmap MSI-X table if EEH is supported
  2015-12-17 21:06       ` Alex Williamson
@ 2015-12-18 10:15         ` David Laight
  0 siblings, 0 replies; 15+ messages in thread
From: David Laight @ 2015-12-18 10:15 UTC (permalink / raw)
  To: 'Alex Williamson',
	Yongji Xie, kvm, linux-kernel, linux-api, linuxppc-dev
  Cc: nikunj, zhong, aik, paulus, warrier

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 3564 bytes --]

From: Alex Williamson
> Sent: 17 December 2015 21:07
...
> > Is this all related to the statements in the PCI(e) spec that the
> > MSI-X table and Pending bit array should in their own BARs?
> > (ISTR it even suggests a BAR each.)
> >
> > Since the MSI-X table exists in device memory/registers there is
> > nothing to stop the device modifying the table contents (or even
> > ignoring the contents and writing address+data pairs that are known
> > to reference the CPUs MSI-X interrupt generation logic).
> >
> > We've an fpga based PCIe slave that has some additional PCIe slaves
> > (associated with the interrupt generation logic) that are currently
> > next to the PBA (which is 8k from the MSI-X table).
> > If we can't map the PBA we can't actually raise any interrupts.
> > The same would be true if page size is 64k and mapping the MSI-X
> > table banned.
> >
> > Do we need to change our PCIe slave address map so we don't need
> > to access anything in the same page (which might be 64k were we to
> > target large ppc - which we don't at the moment) as both the
> > MSI-X table and the PBA?
> >
> > I'd also note that being able to read the MSI-X table is a useful
> > diagnostic that the relevant interrupts are enabled properly.
> 
> Yes, the spec requirement is that MSI-X structures must reside in a 4k
> aligned area that doesn't overlap with other configuration registers
> for the device.  It's only an advisement to put them into their own
> BAR, and 4k clearly wasn't as forward looking as we'd hope.  Vfio
> doesn't particularly care about the PBA, but if it resides in the same
> host PAGE_SIZE area as the MSI-X vector table, you currently won't be
> able to get to it.  Most devices are not at all dependent on the PBA
> for any sort of functionality.

Having some hint in the spec as to why these mapping rules might be
needed would have been useful.

> It's really more correct to say that both the vector table and PBA are
> emulated by QEMU than paravirtualized.  Only PPC64 has the guest OS
> taking a paravirtual path to program the vector table, everyone else
> attempts to read/write to the device MMIO space, which gets trapped and
> emulated in QEMU.  This is why the QEMU side patch has further ugly
> hacks to mess with the ordering of MemoryRegions since even if we can
> access and mmap the MSI-X vector table, we'll still trap into QEMU for
> emulation.

Thanks for that explanation.

> How exactly does the ability to map the PBA affect your ability to
> raise an interrupt?

There are other registers for the logic block that converts internal
interrupt requests into the PCIe writes in the locations following the PBA.
These include interrupt enable bits, and the ability to remove pending
interrupt requests (and other stuff for testing the interrupt block).
Yes I know I probably shouldn't have done that, but it all worked.

At least it is better than the previous version of the hardware that
required the driver read back the MSI-X table entries in order to
set up an on-board PTE to convert a 32bit on-board address to the
64bit PCIe address needed for the MSI-X.

I'll 'fix' our board by making both the MSI-X table and PBA area
accessible through one of the other BARs. (Annoyingly this will
be confusing because the BAR offsets will have to differ.)
Note that this makes a complete mockery of disallowing the mapping
in the first place.

	David


ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2015-12-18 10:17 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-11  8:53 [RFC PATCH 0/3] vfio-pci: Allow to mmap sub-page MMIO BARs and MSI-X table on PPC64 platform Yongji Xie
2015-12-11  8:53 ` [RFC PATCH 1/3] powerpc/pci: Enforce all MMIO BARs to be page aligned Yongji Xie
2015-12-11  8:53 ` [RFC PATCH 2/3] vfio-pci: Allow to mmap sub-page MMIO BARs if all MMIO BARs are " Yongji Xie
2015-12-16 20:04   ` Alex Williamson
2015-12-17 10:26     ` yongji xie
2015-12-17 21:46       ` Alex Williamson
2015-12-18  8:23         ` yongji xie
2015-12-11  8:53 ` [RFC PATCH 3/3] vfio-pci: Allow to mmap MSI-X table if EEH is supported Yongji Xie
2015-12-16 20:14   ` Alex Williamson
2015-12-17 10:08     ` David Laight
2015-12-17 21:06       ` Alex Williamson
2015-12-18 10:15         ` David Laight
2015-12-17 10:37     ` yongji xie
2015-12-17 21:41       ` Alex Williamson
2015-12-17 22:48         ` Benjamin Herrenschmidt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).