All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v4 0/7] vfio-pci: Allow to mmap sub-page MMIO BARs and MSI-X table
@ 2016-03-07  7:48 Yongji Xie
  2016-03-07  7:48 ` [RFC PATCH v4 1/7] PCI: Add a new option for resource_alignment to reassign alignment Yongji Xie
                   ` (7 more replies)
  0 siblings, 8 replies; 28+ messages in thread
From: Yongji Xie @ 2016-03-07  7:48 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-pci, linuxppc-dev, linux-doc
  Cc: bhelgaas, corbet, aik, alex.williamson, benh, paulus, mpe,
	warrier, zhong, nikunj, Yongji Xie

Current vfio-pci implementation disallows to mmap
sub-page(size < PAGE_SIZE) MMIO BARs and MSI-X table. This is because
sub-page BARs' mmio page may be shared with other BARs and MSI-X table
should not be accessed directly from the guest for security reasons.

But these will easily cause some performance issues for mmio accesses
in guest when vfio passthrough sub-page BARs or BARs containing MSI-X
table on PPC64 platform. This is because PAGE_SIZE is 64KB by default
on PPC64 platform and the big page may easily hit the sub-page MMIO
BARs' unmmapping and cause the unmmaping of the mmio page which
MSI-X table locate in, which lead to mmio emulation in host.

For sub-page MMIO BARs' unmmapping, this patchset modifies
resource_alignment kernel parameter to enforce the alignment of all
MMIO BARs to be at least PAGE_SZIE so that sub-page BAR's mmio page
will not be shared with other BARs. Then we can mmap sub-page MMIO BARs
in vfio-pci driver with the modified resource_alignment.

For MSI-X table's unmmapping, we think MSI-X table is safe to access
directly from userspace if PCI host bridge support filtering of MSIs
which can ensure that a given pci device can only shoot the MSIs
assigned for it. So we allow to mmap MSI-X table if
IOMMU_CAP_INTR_REMAP was set. And we add IOMMU_CAP_INTR_REMAP for
IODA host bridge on PPC64 platform.

With this patchset applied, we can get almost 100% improvement on
performance for mmio accesses when we passthrough sub-page BARs to guest
in our test.

The two vfio related patches(patch 5 and patch 6) are based on the
proposed patchset[1].

Changelog v4:
- Rebase on v4.5-rc6 with patchset[1] applied.
- Remove resource_page_aligned kernel parameter
- Fix some problems with resource_alignment kernel parameter
- Modify resource_alignment kernel parameter to support multiple
  devices.
- Remove host bridge attribute: msi_filtered
- Use IOMMU_CAP_INTR_REMAP to check if MSI-X table can be mmapped
- Add IOMMU_CAP_INTR_REMAP for IODA host bridge on PPC64 platform

Changelog v3:
- Rebase on new linux kernel mainline with the patchset[1] applied.
- Add a function to check whether PCI BARs'mmio page is shared with
  other BARs.
- Add a host bridge attribute to indicate PCI host bridge support
  filtering of MSIs.
- Use the new host bridge attribute to check if MSI-X table can
  be mmapped instead of CONFIG_EEH.
- Remove Kconfig option VFIO_PCI_MMAP_MSIX

Changelog v2:
- Rebase on v4.4-rc6 with the patchset[1] applied.
- Use kernel parameter to enforce all MMIO BARs to be page aligned
  on PCI core code instead of doing it on PPC64 arch code.
- Remove flags: VFIO_DEVICE_FLAGS_PCI_PAGE_ALIGNED
		VFIO_DEVICE_FLAGS_PCI_MSIX_MMAP
- Add a Kconfig option to support for mmapping MSI-X table.

[1] http://www.spinics.net/lists/kvm/msg127812.html

Yongji Xie (7):
  PCI: Add a new option for resource_alignment to reassign alignment
  PCI: Use IORESOURCE_WINDOW to identify bridge resources
  PCI: Ignore resource_alignment if PCI_PROBE_ONLY was set
  PCI: Modify resource_alignment to support multiple devices
  vfio-pci: Allow to mmap sub-page MMIO BARs if the mmio page is exclusive
  vfio-pci: Allow to mmap MSI-X table if IOMMU_CAP_INTR_REMAP was set
  powerpc/powernv/pci-ioda: Add IOMMU_CAP_INTR_REMAP for IODA host bridge

 Documentation/kernel-parameters.txt       |    9 ++-
 arch/powerpc/platforms/powernv/pci-ioda.c |   17 ++++
 drivers/pci/pci.c                         |  126 ++++++++++++++++++++++++-----
 drivers/pci/probe.c                       |    3 +-
 drivers/pci/setup-bus.c                   |   21 ++---
 drivers/vfio/pci/vfio_pci.c               |   15 +++-
 drivers/vfio/pci/vfio_pci_rdwr.c          |    4 +-
 include/linux/pci.h                       |    4 +
 8 files changed, 162 insertions(+), 37 deletions(-)

-- 
1.7.9.5

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH v4 1/7] PCI: Add a new option for resource_alignment to reassign alignment
  2016-03-07  7:48 [RFC PATCH v4 0/7] vfio-pci: Allow to mmap sub-page MMIO BARs and MSI-X table Yongji Xie
@ 2016-03-07  7:48 ` Yongji Xie
  2016-03-10  2:19   ` Alexey Kardashevskiy
  2016-03-07  7:48 ` [RFC PATCH v4 2/7] PCI: Use IORESOURCE_WINDOW to identify bridge resources Yongji Xie
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 28+ messages in thread
From: Yongji Xie @ 2016-03-07  7:48 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-pci, linuxppc-dev, linux-doc
  Cc: bhelgaas, corbet, aik, alex.williamson, benh, paulus, mpe,
	warrier, zhong, nikunj, Yongji Xie

When using resource_alignment kernel parameter, the current
implement reassigns the alignment by changing resources' size
which can potentially break some drivers.

So this patch adds a new option "noresize" for the parameter
to solve this problem.

Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
---
 Documentation/kernel-parameters.txt |    5 ++++-
 drivers/pci/pci.c                   |   36 +++++++++++++++++++++++++----------
 2 files changed, 30 insertions(+), 11 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 9a53c92..d8b29ab 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2912,13 +2912,16 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 				window. The default value is 64 megabytes.
 		resource_alignment=
 				Format:
-				[<order of align>@][<domain>:]<bus>:<slot>.<func>[; ...]
+				[<order of align>@][<domain>:]<bus>:<slot>.<func>
+				[:noresize][; ...]
 				Specifies alignment and device to reassign
 				aligned memory resources.
 				If <order of align> is not specified,
 				PAGE_SIZE is used as alignment.
 				PCI-PCI bridge can be specified, if resource
 				windows need to be expanded.
+				noresize: Don't change the resources' sizes when
+				reassigning alignment.
 		ecrc=		Enable/disable PCIe ECRC (transaction layer
 				end-to-end CRC checking).
 				bios: Use BIOS/firmware settings. This is the
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 602eb42..760cce5 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4598,7 +4598,8 @@ static DEFINE_SPINLOCK(resource_alignment_lock);
  * RETURNS: Resource alignment if it is specified.
  *          Zero if it is not specified.
  */
-static resource_size_t pci_specified_resource_alignment(struct pci_dev *dev)
+static resource_size_t pci_specified_resource_alignment(struct pci_dev *dev,
+		bool *resize)
 {
 	int seg, bus, slot, func, align_order, count;
 	resource_size_t align = 0;
@@ -4626,6 +4627,11 @@ static resource_size_t pci_specified_resource_alignment(struct pci_dev *dev)
 			}
 		}
 		p += count;
+		if (!strncmp(p, ":noresize", 9)) {
+			*resize = false;
+			p += 9;
+		} else
+			*resize = true;
 		if (seg == pci_domain_nr(dev->bus) &&
 			bus == dev->bus->number &&
 			slot == PCI_SLOT(dev->devfn) &&
@@ -4658,11 +4664,12 @@ void pci_reassigndev_resource_alignment(struct pci_dev *dev)
 {
 	int i;
 	struct resource *r;
+	bool resize;
 	resource_size_t align, size;
 	u16 command;
 
 	/* check if specified PCI is target device to reassign */
-	align = pci_specified_resource_alignment(dev);
+	align = pci_specified_resource_alignment(dev, &resize);
 	if (!align)
 		return;
 
@@ -4684,15 +4691,24 @@ void pci_reassigndev_resource_alignment(struct pci_dev *dev)
 		if (!(r->flags & IORESOURCE_MEM))
 			continue;
 		size = resource_size(r);
-		if (size < align) {
-			size = align;
-			dev_info(&dev->dev,
-				"Rounding up size of resource #%d to %#llx.\n",
-				i, (unsigned long long)size);
+		if (resize) {
+			if (size < align) {
+				size = align;
+				dev_info(&dev->dev,
+					"Rounding up size of resource #%d to %#llx.\n",
+					i, (unsigned long long)size);
+			}
+			r->flags |= IORESOURCE_UNSET;
+			r->end = size - 1;
+			r->start = 0;
+		} else {
+			if (size > align)
+				align = size;
+			r->flags &= ~IORESOURCE_SIZEALIGN;
+			r->flags |= IORESOURCE_STARTALIGN | IORESOURCE_UNSET;
+			r->start = align;
+			r->end = r->start + size - 1;
 		}
-		r->flags |= IORESOURCE_UNSET;
-		r->end = size - 1;
-		r->start = 0;
 	}
 	/* Need to disable bridge's resource window,
 	 * to enable the kernel to reassign new resource
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH v4 2/7] PCI: Use IORESOURCE_WINDOW to identify bridge resources
  2016-03-07  7:48 [RFC PATCH v4 0/7] vfio-pci: Allow to mmap sub-page MMIO BARs and MSI-X table Yongji Xie
  2016-03-07  7:48 ` [RFC PATCH v4 1/7] PCI: Add a new option for resource_alignment to reassign alignment Yongji Xie
@ 2016-03-07  7:48 ` Yongji Xie
  2016-03-07  7:48 ` [RFC PATCH v4 3/7] PCI: Ignore resource_alignment if PCI_PROBE_ONLY was set Yongji Xie
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 28+ messages in thread
From: Yongji Xie @ 2016-03-07  7:48 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-pci, linuxppc-dev, linux-doc
  Cc: bhelgaas, corbet, aik, alex.williamson, benh, paulus, mpe,
	warrier, zhong, nikunj, Yongji Xie

Now we use the IORESOURCE_STARTALIGN to identify bridge
resources in __assign_resources_sorted(). But there would
be some problems because some PCI devices' resources may
also use IORESOURCE_STARTALIGN, e.g. using "noresize"
option of resource_alignment kernel parameter.

So this patch replaces IORESOURCE_STARTALIGN with
IORESOURCE_WINDOW.

Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
---
 drivers/pci/setup-bus.c |   21 ++++++++++++---------
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
index 7796d0a..4ff10ca 100644
--- a/drivers/pci/setup-bus.c
+++ b/drivers/pci/setup-bus.c
@@ -411,11 +411,11 @@ static void __assign_resources_sorted(struct list_head *head,
 
 		/*
 		 * There are two kinds of additional resources in the list:
-		 * 1. bridge resource  -- IORESOURCE_STARTALIGN
-		 * 2. SR-IOV resource   -- IORESOURCE_SIZEALIGN
+		 * 1. bridge resource -- IORESOURCE_WINDOW
+		 * 2. SR-IOV resource
 		 * Here just fix the additional alignment for bridge
 		 */
-		if (!(dev_res->res->flags & IORESOURCE_STARTALIGN))
+		if (!(dev_res->res->flags & IORESOURCE_WINDOW))
 			continue;
 
 		add_align = get_res_add_align(realloc_head, dev_res->res);
@@ -956,7 +956,7 @@ static void pbus_size_io(struct pci_bus *bus, resource_size_t min_size,
 
 	b_res->start = min_align;
 	b_res->end = b_res->start + size0 - 1;
-	b_res->flags |= IORESOURCE_STARTALIGN;
+	b_res->flags |= IORESOURCE_STARTALIGN | IORESOURCE_WINDOW;
 	if (size1 > size0 && realloc_head) {
 		add_to_list(realloc_head, bus->self, b_res, size1-size0,
 			    min_align);
@@ -1104,7 +1104,7 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
 	}
 	b_res->start = min_align;
 	b_res->end = size0 + min_align - 1;
-	b_res->flags |= IORESOURCE_STARTALIGN;
+	b_res->flags |= IORESOURCE_STARTALIGN | IORESOURCE_WINDOW;
 	if (size1 > size0 && realloc_head) {
 		add_to_list(realloc_head, bus->self, b_res, size1-size0, add_align);
 		dev_printk(KERN_DEBUG, &bus->self->dev, "bridge window %pR to %pR add_size %llx add_align %llx\n",
@@ -1140,7 +1140,8 @@ static void pci_bus_size_cardbus(struct pci_bus *bus,
 	 */
 	b_res[0].start = pci_cardbus_io_size;
 	b_res[0].end = b_res[0].start + pci_cardbus_io_size - 1;
-	b_res[0].flags |= IORESOURCE_IO | IORESOURCE_STARTALIGN;
+	b_res[0].flags |= IORESOURCE_IO | IORESOURCE_STARTALIGN |
+			  IORESOURCE_WINDOW;
 	if (realloc_head) {
 		b_res[0].end -= pci_cardbus_io_size;
 		add_to_list(realloc_head, bridge, b_res, pci_cardbus_io_size,
@@ -1152,7 +1153,8 @@ handle_b_res_1:
 		goto handle_b_res_2;
 	b_res[1].start = pci_cardbus_io_size;
 	b_res[1].end = b_res[1].start + pci_cardbus_io_size - 1;
-	b_res[1].flags |= IORESOURCE_IO | IORESOURCE_STARTALIGN;
+	b_res[1].flags |= IORESOURCE_IO | IORESOURCE_STARTALIGN |
+			  IORESOURCE_WINDOW;
 	if (realloc_head) {
 		b_res[1].end -= pci_cardbus_io_size;
 		add_to_list(realloc_head, bridge, b_res+1, pci_cardbus_io_size,
@@ -1190,7 +1192,7 @@ handle_b_res_2:
 		b_res[2].start = pci_cardbus_mem_size;
 		b_res[2].end = b_res[2].start + pci_cardbus_mem_size - 1;
 		b_res[2].flags |= IORESOURCE_MEM | IORESOURCE_PREFETCH |
-				  IORESOURCE_STARTALIGN;
+				  IORESOURCE_STARTALIGN | IORESOURCE_WINDOW;
 		if (realloc_head) {
 			b_res[2].end -= pci_cardbus_mem_size;
 			add_to_list(realloc_head, bridge, b_res+2,
@@ -1206,7 +1208,8 @@ handle_b_res_3:
 		goto handle_done;
 	b_res[3].start = pci_cardbus_mem_size;
 	b_res[3].end = b_res[3].start + b_res_3_size - 1;
-	b_res[3].flags |= IORESOURCE_MEM | IORESOURCE_STARTALIGN;
+	b_res[3].flags |= IORESOURCE_MEM | IORESOURCE_STARTALIGN |
+			  IORESOURCE_WINDOW;
 	if (realloc_head) {
 		b_res[3].end -= b_res_3_size;
 		add_to_list(realloc_head, bridge, b_res+3, b_res_3_size,
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH v4 3/7] PCI: Ignore resource_alignment if PCI_PROBE_ONLY was set
  2016-03-07  7:48 [RFC PATCH v4 0/7] vfio-pci: Allow to mmap sub-page MMIO BARs and MSI-X table Yongji Xie
  2016-03-07  7:48 ` [RFC PATCH v4 1/7] PCI: Add a new option for resource_alignment to reassign alignment Yongji Xie
  2016-03-07  7:48 ` [RFC PATCH v4 2/7] PCI: Use IORESOURCE_WINDOW to identify bridge resources Yongji Xie
@ 2016-03-07  7:48 ` Yongji Xie
  2016-03-16 16:31   ` Alex Williamson
  2016-03-07  7:48 ` [RFC PATCH v4 4/7] PCI: Modify resource_alignment to support multiple devices Yongji Xie
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 28+ messages in thread
From: Yongji Xie @ 2016-03-07  7:48 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-pci, linuxppc-dev, linux-doc
  Cc: bhelgaas, corbet, aik, alex.williamson, benh, paulus, mpe,
	warrier, zhong, nikunj, Yongji Xie

The resource_alignment will releases memory resources
allocated by firmware so that kernel can reassign new
resources later on. But this will cause the problem
that no resources can be allocated by kernel if
PCI_PROBE_ONLY was set, e.g. on pSeries platform
because PCI_PROBE_ONLY force kernel to use firmware
setup and not to reassign any resources.

To solve this problem, this patch ignores
resource_alignment if PCI_PROBE_ONLY was set.

Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
---
 Documentation/kernel-parameters.txt |    2 ++
 drivers/pci/probe.c                 |    3 ++-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index d8b29ab..8028631 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2922,6 +2922,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 				windows need to be expanded.
 				noresize: Don't change the resources' sizes when
 				reassigning alignment.
+				Note that this option will not work if
+				PCI_PROBE_ONLY is set.
 		ecrc=		Enable/disable PCIe ECRC (transaction layer
 				end-to-end CRC checking).
 				bios: Use BIOS/firmware settings. This is the
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 6d7ab9b..bc31cad 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -1719,7 +1719,8 @@ void pci_device_add(struct pci_dev *dev, struct pci_bus *bus)
 	pci_fixup_device(pci_fixup_header, dev);
 
 	/* moved out from quirk header fixup code */
-	pci_reassigndev_resource_alignment(dev);
+	if (!pci_has_flag(PCI_PROBE_ONLY))
+		pci_reassigndev_resource_alignment(dev);
 
 	/* Clear the state_saved flag. */
 	dev->state_saved = false;
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH v4 4/7] PCI: Modify resource_alignment to support multiple devices
  2016-03-07  7:48 [RFC PATCH v4 0/7] vfio-pci: Allow to mmap sub-page MMIO BARs and MSI-X table Yongji Xie
                   ` (2 preceding siblings ...)
  2016-03-07  7:48 ` [RFC PATCH v4 3/7] PCI: Ignore resource_alignment if PCI_PROBE_ONLY was set Yongji Xie
@ 2016-03-07  7:48 ` Yongji Xie
  2016-03-16 16:30   ` Alex Williamson
  2016-03-07  7:48 ` [RFC PATCH v4 5/7] vfio-pci: Allow to mmap sub-page MMIO BARs if the mmio page is exclusive Yongji Xie
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 28+ messages in thread
From: Yongji Xie @ 2016-03-07  7:48 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-pci, linuxppc-dev, linux-doc
  Cc: bhelgaas, corbet, aik, alex.williamson, benh, paulus, mpe,
	warrier, zhong, nikunj, Yongji Xie

When vfio passthrough a PCI device of which MMIO BARs
are smaller than PAGE_SIZE, guest will not handle the
mmio accesses to the BARs which leads to mmio emulations
in host.

This is because vfio will not allow to passthrough one
BAR's mmio page which may be shared with other BARs.

To solve this performance issue, this patch modifies
resource_alignment to support syntax where multiple
devices get the same alignment. So we can use something
like "pci=resource_alignment=*:*:*.*:noresize" to
enforce the alignment of all MMIO BARs to be at least
PAGE_SIZE so that one BAR's mmio page would not be
shared with other BARs.

Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
---
 Documentation/kernel-parameters.txt |    2 +
 drivers/pci/pci.c                   |   90 ++++++++++++++++++++++++++++++-----
 include/linux/pci.h                 |    4 ++
 3 files changed, 85 insertions(+), 11 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 8028631..74b38ab 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2918,6 +2918,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 				aligned memory resources.
 				If <order of align> is not specified,
 				PAGE_SIZE is used as alignment.
+				<domain>, <bus>, <slot> and <func> can be set to
+				"*" which means match all values.
 				PCI-PCI bridge can be specified, if resource
 				windows need to be expanded.
 				noresize: Don't change the resources' sizes when
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 760cce5..44ab59f 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -102,6 +102,8 @@ unsigned int pcibios_max_latency = 255;
 /* If set, the PCIe ARI capability will not be used. */
 static bool pcie_ari_disabled;
 
+bool pci_resources_page_aligned;
+
 /**
  * pci_bus_max_busnr - returns maximum PCI bus number of given bus' children
  * @bus: pointer to PCI bus structure to search
@@ -4604,6 +4606,7 @@ static resource_size_t pci_specified_resource_alignment(struct pci_dev *dev,
 	int seg, bus, slot, func, align_order, count;
 	resource_size_t align = 0;
 	char *p;
+	bool invalid = false;
 
 	spin_lock(&resource_alignment_lock);
 	p = resource_alignment_param;
@@ -4615,16 +4618,49 @@ static resource_size_t pci_specified_resource_alignment(struct pci_dev *dev,
 		} else {
 			align_order = -1;
 		}
-		if (sscanf(p, "%x:%x:%x.%x%n",
-			&seg, &bus, &slot, &func, &count) != 4) {
+		if (p[0] == '*' && p[1] == ':') {
+			seg = -1;
+			count = 1;
+		} else if (sscanf(p, "%x%n", &seg, &count) != 1 ||
+				p[count] != ':') {
+			invalid = true;
+			break;
+		}
+		p += count + 1;
+		if (*p == '*') {
+			bus = -1;
+			count = 1;
+		} else if (sscanf(p, "%x%n", &bus, &count) != 1) {
+			invalid = true;
+			break;
+		}
+		p += count;
+		if (*p == '.') {
+			slot = bus;
+			bus = seg;
 			seg = 0;
-			if (sscanf(p, "%x:%x.%x%n",
-					&bus, &slot, &func, &count) != 3) {
-				/* Invalid format */
-				printk(KERN_ERR "PCI: Can't parse resource_alignment parameter: %s\n",
-					p);
+			p++;
+		} else if (*p == ':') {
+			p++;
+			if (p[0] == '*' && p[1] == '.') {
+				slot = -1;
+				count = 1;
+			} else if (sscanf(p, "%x%n", &slot, &count) != 1 ||
+					p[count] != '.') {
+				invalid = true;
 				break;
 			}
+			p += count + 1;
+		} else {
+			invalid = true;
+			break;
+		}
+		if (*p == '*') {
+			func = -1;
+			count = 1;
+		} else if (sscanf(p, "%x%n", &func, &count) != 1) {
+			invalid = true;
+			break;
 		}
 		p += count;
 		if (!strncmp(p, ":noresize", 9)) {
@@ -4632,23 +4668,34 @@ static resource_size_t pci_specified_resource_alignment(struct pci_dev *dev,
 			p += 9;
 		} else
 			*resize = true;
-		if (seg == pci_domain_nr(dev->bus) &&
-			bus == dev->bus->number &&
-			slot == PCI_SLOT(dev->devfn) &&
-			func == PCI_FUNC(dev->devfn)) {
+		if ((seg == pci_domain_nr(dev->bus) || seg == -1) &&
+			(bus == dev->bus->number || bus == -1) &&
+			(slot == PCI_SLOT(dev->devfn) || slot == -1) &&
+			(func == PCI_FUNC(dev->devfn) || func == -1)) {
 			if (align_order == -1)
 				align = PAGE_SIZE;
 			else
 				align = 1 << align_order;
+			if (!pci_resources_page_aligned &&
+				(align >= PAGE_SIZE &&
+				seg == -1 && bus == -1 &&
+				slot == -1 && func == -1))
+				pci_resources_page_aligned = true;
 			/* Found */
 			break;
 		}
 		if (*p != ';' && *p != ',') {
 			/* End of param or invalid format */
+			invalid = true;
 			break;
 		}
 		p++;
 	}
+	if (invalid == true) {
+		/* Invalid format */
+		printk(KERN_ERR "PCI: Can't parse resource_alignment parameter:%s\n",
+				p);
+	}
 	spin_unlock(&resource_alignment_lock);
 	return align;
 }
@@ -4769,6 +4816,27 @@ static int __init pci_resource_alignment_sysfs_init(void)
 }
 late_initcall(pci_resource_alignment_sysfs_init);
 
+/*
+ * This function checks whether PCI BARs' mmio page will be shared
+ * with other BARs.
+ */
+bool pci_resources_share_page(struct pci_dev *dev, int resno)
+{
+	struct resource *res = dev->resource + resno;
+
+	if (resource_size(res) >= PAGE_SIZE)
+		return false;
+	if (pci_resources_page_aligned && !(res->start & ~PAGE_MASK) &&
+		res->flags & IORESOURCE_MEM) {
+		if (res->sibling)
+			return (res->sibling->start & ~PAGE_MASK);
+		else
+			return false;
+	}
+	return true;
+}
+EXPORT_SYMBOL_GPL(pci_resources_share_page);
+
 static void pci_no_domains(void)
 {
 #ifdef CONFIG_PCI_DOMAINS
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 2771625..064a1b6 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1511,6 +1511,10 @@ static inline int pci_get_new_domain_nr(void) { return -ENOSYS; }
 	 (pci_resource_end((dev), (bar)) -		\
 	  pci_resource_start((dev), (bar)) + 1))
 
+extern bool pci_resources_page_aligned;
+
+bool pci_resources_share_page(struct pci_dev *dev, int resno);
+
 /* Similar to the helpers above, these manipulate per-pci_dev
  * driver-specific data.  They are really just a wrapper around
  * the generic device structure functions of these calls.
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH v4 5/7] vfio-pci: Allow to mmap sub-page MMIO BARs if the mmio page is exclusive
  2016-03-07  7:48 [RFC PATCH v4 0/7] vfio-pci: Allow to mmap sub-page MMIO BARs and MSI-X table Yongji Xie
                   ` (3 preceding siblings ...)
  2016-03-07  7:48 ` [RFC PATCH v4 4/7] PCI: Modify resource_alignment to support multiple devices Yongji Xie
@ 2016-03-07  7:48 ` Yongji Xie
  2016-03-16 16:30   ` Alex Williamson
  2016-03-07  7:48 ` [RFC PATCH v4 6/7] vfio-pci: Allow to mmap MSI-X table if IOMMU_CAP_INTR_REMAP was set Yongji Xie
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 28+ messages in thread
From: Yongji Xie @ 2016-03-07  7:48 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-pci, linuxppc-dev, linux-doc
  Cc: bhelgaas, corbet, aik, alex.williamson, benh, paulus, mpe,
	warrier, zhong, nikunj, Yongji Xie

Current vfio-pci implementation disallows to mmap
sub-page(size < PAGE_SIZE) MMIO BARs because these BARs' mmio
page may be shared with other BARs.

But we should allow to mmap these sub-page MMIO BARs if PCI
resource allocator can make sure these BARs' mmio page will
not be shared with other BARs.

Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
---
 drivers/vfio/pci/vfio_pci.c |    7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 1ce1d36..49d7a69 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -589,7 +589,8 @@ static long vfio_pci_ioctl(void *device_data,
 				     VFIO_REGION_INFO_FLAG_WRITE;
 			if (IS_ENABLED(CONFIG_VFIO_PCI_MMAP) &&
 			    pci_resource_flags(pdev, info.index) &
-			    IORESOURCE_MEM && info.size >= PAGE_SIZE) {
+			    IORESOURCE_MEM && !pci_resources_share_page(pdev,
+			    info.index)) {
 				info.flags |= VFIO_REGION_INFO_FLAG_MMAP;
 				if (info.index == vdev->msix_bar) {
 					ret = msix_sparse_mmap_cap(vdev, &caps);
@@ -1016,6 +1017,10 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
 		return -EINVAL;
 
 	phys_len = pci_resource_len(pdev, index);
+
+	if (!pci_resources_share_page(pdev, index))
+		phys_len = PAGE_ALIGN(phys_len);
+
 	req_len = vma->vm_end - vma->vm_start;
 	pgoff = vma->vm_pgoff &
 		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH v4 6/7] vfio-pci: Allow to mmap MSI-X table if IOMMU_CAP_INTR_REMAP was set
  2016-03-07  7:48 [RFC PATCH v4 0/7] vfio-pci: Allow to mmap sub-page MMIO BARs and MSI-X table Yongji Xie
                   ` (4 preceding siblings ...)
  2016-03-07  7:48 ` [RFC PATCH v4 5/7] vfio-pci: Allow to mmap sub-page MMIO BARs if the mmio page is exclusive Yongji Xie
@ 2016-03-07  7:48 ` Yongji Xie
  2016-03-16 16:31   ` Alex Williamson
  2016-03-07  7:48 ` [RFC PATCH v4 7/7] powerpc/powernv/pci-ioda: Add IOMMU_CAP_INTR_REMAP for IODA host bridge Yongji Xie
  2016-03-16 10:51 ` [RFC PATCH v4 0/7] vfio-pci: Allow to mmap sub-page MMIO BARs and MSI-X table Yongji Xie
  7 siblings, 1 reply; 28+ messages in thread
From: Yongji Xie @ 2016-03-07  7:48 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-pci, linuxppc-dev, linux-doc
  Cc: bhelgaas, corbet, aik, alex.williamson, benh, paulus, mpe,
	warrier, zhong, nikunj, Yongji Xie

Current vfio-pci implementation disallows to mmap MSI-X
table in case that user get to touch this directly.

But we should allow to mmap these MSI-X tables if IOMMU
supports interrupt remapping which can ensure that a
given pci device can only shoot the MSIs assigned for it.

Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
---
 drivers/vfio/pci/vfio_pci.c      |    8 +++++---
 drivers/vfio/pci/vfio_pci_rdwr.c |    4 +++-
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 49d7a69..d6f4788 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -592,13 +592,14 @@ static long vfio_pci_ioctl(void *device_data,
 			    IORESOURCE_MEM && !pci_resources_share_page(pdev,
 			    info.index)) {
 				info.flags |= VFIO_REGION_INFO_FLAG_MMAP;
-				if (info.index == vdev->msix_bar) {
+				if (!iommu_capable(pdev->dev.bus,
+					IOMMU_CAP_INTR_REMAP) &&
+					info.index == vdev->msix_bar) {
 					ret = msix_sparse_mmap_cap(vdev, &caps);
 					if (ret)
 						return ret;
 				}
 			}
-
 			break;
 		case VFIO_PCI_ROM_REGION_INDEX:
 		{
@@ -1029,7 +1030,8 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
 	if (phys_len < PAGE_SIZE || req_start + req_len > phys_len)
 		return -EINVAL;
 
-	if (index == vdev->msix_bar) {
+	if (!iommu_capable(pdev->dev.bus, IOMMU_CAP_INTR_REMAP) &&
+		index == vdev->msix_bar) {
 		/*
 		 * Disallow mmaps overlapping the MSI-X table; users don't
 		 * get to touch this directly.  We could find somewhere
diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
index 5ffd1d9..1c46c29 100644
--- a/drivers/vfio/pci/vfio_pci_rdwr.c
+++ b/drivers/vfio/pci/vfio_pci_rdwr.c
@@ -18,6 +18,7 @@
 #include <linux/uaccess.h>
 #include <linux/io.h>
 #include <linux/vgaarb.h>
+#include <linux/iommu.h>
 
 #include "vfio_pci_private.h"
 
@@ -164,7 +165,8 @@ ssize_t vfio_pci_bar_rw(struct vfio_pci_device *vdev, char __user *buf,
 	} else
 		io = vdev->barmap[bar];
 
-	if (bar == vdev->msix_bar) {
+	if (!iommu_capable(pdev->dev.bus, IOMMU_CAP_INTR_REMAP) &&
+		bar == vdev->msix_bar) {
 		x_start = vdev->msix_offset;
 		x_end = vdev->msix_offset + vdev->msix_size;
 	}
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH v4 7/7] powerpc/powernv/pci-ioda: Add IOMMU_CAP_INTR_REMAP for IODA host bridge
  2016-03-07  7:48 [RFC PATCH v4 0/7] vfio-pci: Allow to mmap sub-page MMIO BARs and MSI-X table Yongji Xie
                   ` (5 preceding siblings ...)
  2016-03-07  7:48 ` [RFC PATCH v4 6/7] vfio-pci: Allow to mmap MSI-X table if IOMMU_CAP_INTR_REMAP was set Yongji Xie
@ 2016-03-07  7:48 ` Yongji Xie
  2016-03-16 16:32   ` Alex Williamson
  2016-03-16 10:51 ` [RFC PATCH v4 0/7] vfio-pci: Allow to mmap sub-page MMIO BARs and MSI-X table Yongji Xie
  7 siblings, 1 reply; 28+ messages in thread
From: Yongji Xie @ 2016-03-07  7:48 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-pci, linuxppc-dev, linux-doc
  Cc: bhelgaas, corbet, aik, alex.williamson, benh, paulus, mpe,
	warrier, zhong, nikunj, Yongji Xie

This patch adds IOMMU_CAP_INTR_REMAP for IODA host bridge so that
we can mmap MSI-X table in vfio driver.

Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/powernv/pci-ioda.c |   17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index f90dc04..f01b9ab 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1955,6 +1955,20 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = {
 	.free = pnv_ioda2_table_free,
 };
 
+static bool pnv_ioda_iommu_capable(enum iommu_cap cap)
+{
+	switch (cap) {
+	case IOMMU_CAP_INTR_REMAP:
+		return true;
+	default:
+		return false;
+	}
+}
+
+static struct iommu_ops pnv_ioda_iommu_ops = {
+	.capable = pnv_ioda_iommu_capable,
+};
+
 static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 				      struct pnv_ioda_pe *pe, unsigned int base,
 				      unsigned int segs)
@@ -3078,6 +3092,9 @@ static void pnv_pci_ioda_fixup(void)
 
 	/* Link NPU IODA tables to their PCI devices. */
 	pnv_npu_ioda_fixup();
+
+	/* Add IOMMU_CAP_INTR_REMAP */
+	bus_set_iommu(&pci_bus_type, &pnv_ioda_iommu_ops);
 }
 
 /*
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v4 1/7] PCI: Add a new option for resource_alignment to reassign alignment
  2016-03-07  7:48 ` [RFC PATCH v4 1/7] PCI: Add a new option for resource_alignment to reassign alignment Yongji Xie
@ 2016-03-10  2:19   ` Alexey Kardashevskiy
  2016-03-10  4:47     ` Yongji Xie
  0 siblings, 1 reply; 28+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-10  2:19 UTC (permalink / raw)
  To: Yongji Xie, kvm, linux-kernel, linux-pci, linuxppc-dev, linux-doc
  Cc: bhelgaas, corbet, alex.williamson, benh, paulus, mpe, warrier,
	zhong, nikunj

On 03/07/2016 06:48 PM, Yongji Xie wrote:
> When using resource_alignment kernel parameter, the current
> implement reassigns the alignment by changing resources' size
> which can potentially break some drivers.

How can this possibly break any driver?... It rounds up, not down, what do 
I miss here?

>
> So this patch adds a new option "noresize" for the parameter
> to solve this problem.
>
> Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
> ---
>   Documentation/kernel-parameters.txt |    5 ++++-
>   drivers/pci/pci.c                   |   36 +++++++++++++++++++++++++----------
>   2 files changed, 30 insertions(+), 11 deletions(-)
>
> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
> index 9a53c92..d8b29ab 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -2912,13 +2912,16 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
>   				window. The default value is 64 megabytes.
>   		resource_alignment=
>   				Format:
> -				[<order of align>@][<domain>:]<bus>:<slot>.<func>[; ...]
> +				[<order of align>@][<domain>:]<bus>:<slot>.<func>
> +				[:noresize][; ...]
>   				Specifies alignment and device to reassign
>   				aligned memory resources.
>   				If <order of align> is not specified,
>   				PAGE_SIZE is used as alignment.
>   				PCI-PCI bridge can be specified, if resource
>   				windows need to be expanded.
> +				noresize: Don't change the resources' sizes when
> +				reassigning alignment.
>   		ecrc=		Enable/disable PCIe ECRC (transaction layer
>   				end-to-end CRC checking).
>   				bios: Use BIOS/firmware settings. This is the
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 602eb42..760cce5 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -4598,7 +4598,8 @@ static DEFINE_SPINLOCK(resource_alignment_lock);
>    * RETURNS: Resource alignment if it is specified.
>    *          Zero if it is not specified.
>    */
> -static resource_size_t pci_specified_resource_alignment(struct pci_dev *dev)
> +static resource_size_t pci_specified_resource_alignment(struct pci_dev *dev,
> +		bool *resize)
>   {
>   	int seg, bus, slot, func, align_order, count;
>   	resource_size_t align = 0;
> @@ -4626,6 +4627,11 @@ static resource_size_t pci_specified_resource_alignment(struct pci_dev *dev)
>   			}
>   		}
>   		p += count;
> +		if (!strncmp(p, ":noresize", 9)) {
> +			*resize = false;
> +			p += 9;
> +		} else
> +			*resize = true;
>   		if (seg == pci_domain_nr(dev->bus) &&
>   			bus == dev->bus->number &&
>   			slot == PCI_SLOT(dev->devfn) &&
> @@ -4658,11 +4664,12 @@ void pci_reassigndev_resource_alignment(struct pci_dev *dev)
>   {
>   	int i;
>   	struct resource *r;
> +	bool resize;
>   	resource_size_t align, size;
>   	u16 command;
>
>   	/* check if specified PCI is target device to reassign */
> -	align = pci_specified_resource_alignment(dev);
> +	align = pci_specified_resource_alignment(dev, &resize);

A compiler should have warned here about passing a pointer to unitialized 
@resize.


>   	if (!align)
>   		return;
>
> @@ -4684,15 +4691,24 @@ void pci_reassigndev_resource_alignment(struct pci_dev *dev)
>   		if (!(r->flags & IORESOURCE_MEM))
>   			continue;
>   		size = resource_size(r);
> -		if (size < align) {
> -			size = align;
> -			dev_info(&dev->dev,
> -				"Rounding up size of resource #%d to %#llx.\n",
> -				i, (unsigned long long)size);
> +		if (resize) {
> +			if (size < align) {
> +				size = align;
> +				dev_info(&dev->dev,
> +					"Rounding up size of resource #%d to %#llx.\n",
> +					i, (unsigned long long)size);
> +			}
> +			r->flags |= IORESOURCE_UNSET;
> +			r->end = size - 1;
> +			r->start = 0;
> +		} else {
> +			if (size > align)
> +				align = size;
> +			r->flags &= ~IORESOURCE_SIZEALIGN;
> +			r->flags |= IORESOURCE_STARTALIGN | IORESOURCE_UNSET;
> +			r->start = align;
> +			r->end = r->start + size - 1;
>   		}
> -		r->flags |= IORESOURCE_UNSET;
> -		r->end = size - 1;
> -		r->start = 0;
>   	}
>   	/* Need to disable bridge's resource window,
>   	 * to enable the kernel to reassign new resource
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v4 1/7] PCI: Add a new option for resource_alignment to reassign alignment
  2016-03-10  2:19   ` Alexey Kardashevskiy
@ 2016-03-10  4:47     ` Yongji Xie
  0 siblings, 0 replies; 28+ messages in thread
From: Yongji Xie @ 2016-03-10  4:47 UTC (permalink / raw)
  To: Alexey Kardashevskiy, kvm, linux-kernel, linux-pci, linuxppc-dev,
	linux-doc
  Cc: bhelgaas, corbet, alex.williamson, benh, paulus, mpe, warrier,
	zhong, nikunj

On 2016/3/10 10:19, Alexey Kardashevskiy wrote:
> On 03/07/2016 06:48 PM, Yongji Xie wrote:
>> When using resource_alignment kernel parameter, the current
>> implement reassigns the alignment by changing resources' size
>> which can potentially break some drivers.
>
> How can this possibly break any driver?... It rounds up, not down, 
> what do I miss here?
>

If the driver uses the size to locate some register, e.g. the length of
register is related to the size,  or get some other information of the
device. The wrong size may break this driver.

So I think it's better not to touch the size here.  The resource_size()
should be the real size.

>>
>> So this patch adds a new option "noresize" for the parameter
>> to solve this problem.
>>
>> Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
>> ---
>>   Documentation/kernel-parameters.txt |    5 ++++-
>>   drivers/pci/pci.c                   |   36 
>> +++++++++++++++++++++++++----------
>>   2 files changed, 30 insertions(+), 11 deletions(-)
>>
>> diff --git a/Documentation/kernel-parameters.txt 
>> b/Documentation/kernel-parameters.txt
>> index 9a53c92..d8b29ab 100644
>> --- a/Documentation/kernel-parameters.txt
>> +++ b/Documentation/kernel-parameters.txt
>> @@ -2912,13 +2912,16 @@ bytes respectively. Such letter suffixes can 
>> also be entirely omitted.
>>                   window. The default value is 64 megabytes.
>>           resource_alignment=
>>                   Format:
>> -                [<order of align>@][<domain>:]<bus>:<slot>.<func>[; 
>> ...]
>> +                [<order of align>@][<domain>:]<bus>:<slot>.<func>
>> +                [:noresize][; ...]
>>                   Specifies alignment and device to reassign
>>                   aligned memory resources.
>>                   If <order of align> is not specified,
>>                   PAGE_SIZE is used as alignment.
>>                   PCI-PCI bridge can be specified, if resource
>>                   windows need to be expanded.
>> +                noresize: Don't change the resources' sizes when
>> +                reassigning alignment.
>>           ecrc=        Enable/disable PCIe ECRC (transaction layer
>>                   end-to-end CRC checking).
>>                   bios: Use BIOS/firmware settings. This is the
>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>> index 602eb42..760cce5 100644
>> --- a/drivers/pci/pci.c
>> +++ b/drivers/pci/pci.c
>> @@ -4598,7 +4598,8 @@ static DEFINE_SPINLOCK(resource_alignment_lock);
>>    * RETURNS: Resource alignment if it is specified.
>>    *          Zero if it is not specified.
>>    */
>> -static resource_size_t pci_specified_resource_alignment(struct 
>> pci_dev *dev)
>> +static resource_size_t pci_specified_resource_alignment(struct 
>> pci_dev *dev,
>> +        bool *resize)
>>   {
>>       int seg, bus, slot, func, align_order, count;
>>       resource_size_t align = 0;
>> @@ -4626,6 +4627,11 @@ static resource_size_t 
>> pci_specified_resource_alignment(struct pci_dev *dev)
>>               }
>>           }
>>           p += count;
>> +        if (!strncmp(p, ":noresize", 9)) {
>> +            *resize = false;
>> +            p += 9;
>> +        } else
>> +            *resize = true;
>>           if (seg == pci_domain_nr(dev->bus) &&
>>               bus == dev->bus->number &&
>>               slot == PCI_SLOT(dev->devfn) &&
>> @@ -4658,11 +4664,12 @@ void 
>> pci_reassigndev_resource_alignment(struct pci_dev *dev)
>>   {
>>       int i;
>>       struct resource *r;
>> +    bool resize;
>>       resource_size_t align, size;
>>       u16 command;
>>
>>       /* check if specified PCI is target device to reassign */
>> -    align = pci_specified_resource_alignment(dev);
>> +    align = pci_specified_resource_alignment(dev, &resize);
>
> A compiler should have warned here about passing a pointer to 
> unitialized @resize.
>

OK. I'll fix it.

Thanks,
Yongji Xie

>>       if (!align)
>>           return;
>>
>> @@ -4684,15 +4691,24 @@ void 
>> pci_reassigndev_resource_alignment(struct pci_dev *dev)
>>           if (!(r->flags & IORESOURCE_MEM))
>>               continue;
>>           size = resource_size(r);
>> -        if (size < align) {
>> -            size = align;
>> -            dev_info(&dev->dev,
>> -                "Rounding up size of resource #%d to %#llx.\n",
>> -                i, (unsigned long long)size);
>> +        if (resize) {
>> +            if (size < align) {
>> +                size = align;
>> +                dev_info(&dev->dev,
>> +                    "Rounding up size of resource #%d to %#llx.\n",
>> +                    i, (unsigned long long)size);
>> +            }
>> +            r->flags |= IORESOURCE_UNSET;
>> +            r->end = size - 1;
>> +            r->start = 0;
>> +        } else {
>> +            if (size > align)
>> +                align = size;
>> +            r->flags &= ~IORESOURCE_SIZEALIGN;
>> +            r->flags |= IORESOURCE_STARTALIGN | IORESOURCE_UNSET;
>> +            r->start = align;
>> +            r->end = r->start + size - 1;
>>           }
>> -        r->flags |= IORESOURCE_UNSET;
>> -        r->end = size - 1;
>> -        r->start = 0;
>>       }
>>       /* Need to disable bridge's resource window,
>>        * to enable the kernel to reassign new resource
>>
>
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v4 0/7] vfio-pci: Allow to mmap sub-page MMIO BARs and MSI-X table
  2016-03-07  7:48 [RFC PATCH v4 0/7] vfio-pci: Allow to mmap sub-page MMIO BARs and MSI-X table Yongji Xie
                   ` (6 preceding siblings ...)
  2016-03-07  7:48 ` [RFC PATCH v4 7/7] powerpc/powernv/pci-ioda: Add IOMMU_CAP_INTR_REMAP for IODA host bridge Yongji Xie
@ 2016-03-16 10:51 ` Yongji Xie
  2016-03-16 14:10   ` Bjorn Helgaas
  7 siblings, 1 reply; 28+ messages in thread
From: Yongji Xie @ 2016-03-16 10:51 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-pci, linuxppc-dev, linux-doc
  Cc: bhelgaas, corbet, aik, alex.williamson, benh, paulus, mpe,
	warrier, zhong, nikunj


Ping.

On 2016/3/7 15:48, Yongji Xie wrote:
> Current vfio-pci implementation disallows to mmap
> sub-page(size < PAGE_SIZE) MMIO BARs and MSI-X table. This is because
> sub-page BARs' mmio page may be shared with other BARs and MSI-X table
> should not be accessed directly from the guest for security reasons.
>
> But these will easily cause some performance issues for mmio accesses
> in guest when vfio passthrough sub-page BARs or BARs containing MSI-X
> table on PPC64 platform. This is because PAGE_SIZE is 64KB by default
> on PPC64 platform and the big page may easily hit the sub-page MMIO
> BARs' unmmapping and cause the unmmaping of the mmio page which
> MSI-X table locate in, which lead to mmio emulation in host.
>
> For sub-page MMIO BARs' unmmapping, this patchset modifies
> resource_alignment kernel parameter to enforce the alignment of all
> MMIO BARs to be at least PAGE_SZIE so that sub-page BAR's mmio page
> will not be shared with other BARs. Then we can mmap sub-page MMIO BARs
> in vfio-pci driver with the modified resource_alignment.
>
> For MSI-X table's unmmapping, we think MSI-X table is safe to access
> directly from userspace if PCI host bridge support filtering of MSIs
> which can ensure that a given pci device can only shoot the MSIs
> assigned for it. So we allow to mmap MSI-X table if
> IOMMU_CAP_INTR_REMAP was set. And we add IOMMU_CAP_INTR_REMAP for
> IODA host bridge on PPC64 platform.
>
> With this patchset applied, we can get almost 100% improvement on
> performance for mmio accesses when we passthrough sub-page BARs to guest
> in our test.
>
> The two vfio related patches(patch 5 and patch 6) are based on the
> proposed patchset[1].
>
> Changelog v4:
> - Rebase on v4.5-rc6 with patchset[1] applied.
> - Remove resource_page_aligned kernel parameter
> - Fix some problems with resource_alignment kernel parameter
> - Modify resource_alignment kernel parameter to support multiple
>    devices.
> - Remove host bridge attribute: msi_filtered
> - Use IOMMU_CAP_INTR_REMAP to check if MSI-X table can be mmapped
> - Add IOMMU_CAP_INTR_REMAP for IODA host bridge on PPC64 platform
>
> Changelog v3:
> - Rebase on new linux kernel mainline with the patchset[1] applied.
> - Add a function to check whether PCI BARs'mmio page is shared with
>    other BARs.
> - Add a host bridge attribute to indicate PCI host bridge support
>    filtering of MSIs.
> - Use the new host bridge attribute to check if MSI-X table can
>    be mmapped instead of CONFIG_EEH.
> - Remove Kconfig option VFIO_PCI_MMAP_MSIX
>
> Changelog v2:
> - Rebase on v4.4-rc6 with the patchset[1] applied.
> - Use kernel parameter to enforce all MMIO BARs to be page aligned
>    on PCI core code instead of doing it on PPC64 arch code.
> - Remove flags: VFIO_DEVICE_FLAGS_PCI_PAGE_ALIGNED
> 		VFIO_DEVICE_FLAGS_PCI_MSIX_MMAP
> - Add a Kconfig option to support for mmapping MSI-X table.
>
> [1] http://www.spinics.net/lists/kvm/msg127812.html
>
> Yongji Xie (7):
>    PCI: Add a new option for resource_alignment to reassign alignment
>    PCI: Use IORESOURCE_WINDOW to identify bridge resources
>    PCI: Ignore resource_alignment if PCI_PROBE_ONLY was set
>    PCI: Modify resource_alignment to support multiple devices
>    vfio-pci: Allow to mmap sub-page MMIO BARs if the mmio page is exclusive
>    vfio-pci: Allow to mmap MSI-X table if IOMMU_CAP_INTR_REMAP was set
>    powerpc/powernv/pci-ioda: Add IOMMU_CAP_INTR_REMAP for IODA host bridge
>
>   Documentation/kernel-parameters.txt       |    9 ++-
>   arch/powerpc/platforms/powernv/pci-ioda.c |   17 ++++
>   drivers/pci/pci.c                         |  126 ++++++++++++++++++++++++-----
>   drivers/pci/probe.c                       |    3 +-
>   drivers/pci/setup-bus.c                   |   21 ++---
>   drivers/vfio/pci/vfio_pci.c               |   15 +++-
>   drivers/vfio/pci/vfio_pci_rdwr.c          |    4 +-
>   include/linux/pci.h                       |    4 +
>   8 files changed, 162 insertions(+), 37 deletions(-)
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v4 0/7] vfio-pci: Allow to mmap sub-page MMIO BARs and MSI-X table
  2016-03-16 10:51 ` [RFC PATCH v4 0/7] vfio-pci: Allow to mmap sub-page MMIO BARs and MSI-X table Yongji Xie
@ 2016-03-16 14:10   ` Bjorn Helgaas
  2016-03-17 10:46     ` Yongji Xie
  0 siblings, 1 reply; 28+ messages in thread
From: Bjorn Helgaas @ 2016-03-16 14:10 UTC (permalink / raw)
  To: Yongji Xie
  Cc: kvm, linux-kernel, linux-pci, linuxppc-dev, linux-doc, bhelgaas,
	corbet, aik, alex.williamson, benh, paulus, mpe, warrier, zhong,
	nikunj

On Wed, Mar 16, 2016 at 06:51:56PM +0800, Yongji Xie wrote:
> 
> Ping.

This is mainly VFIO stuff, and Alex had some security concerns, so I'm
not going to spend much time looking at this until he's satisfied.

When I do, I'll be looking hard at the resource_alignment kernel
parameter.  I'm opposed to kernel parameters in general because
they're very difficult for users to use correctly, and they lead to
kernel code paths that are rarely tested and hard to maintain.  So
I'll be looking for an excuse to reject changes in that area.

The changelog for 2/7 says it "replaces IORESOURCE_STARTALIGN with
IORESOURCE_WINDOW."  But even a glance at the patch itself shows that
IORESOURCE_WINDOW is *added* to some places, and it doesn't *replace*
IORESOURCE_STARTALIGN.

The changelog for 4/7 says:

  This is because vfio will not allow to passthrough one BAR's mmio
  page which may be shared with other BARs.  To solve this performance
  issue ...

with no mention at all of the actual *reason* vfio doesn't allow that
passthrough.  If I understand correctly, that reason has to do with
security, so your justification must be much stronger than "solving a
performance issue."

> On 2016/3/7 15:48, Yongji Xie wrote:
> >Current vfio-pci implementation disallows to mmap
> >sub-page(size < PAGE_SIZE) MMIO BARs and MSI-X table. This is because
> >sub-page BARs' mmio page may be shared with other BARs and MSI-X table
> >should not be accessed directly from the guest for security reasons.
> >
> >But these will easily cause some performance issues for mmio accesses
> >in guest when vfio passthrough sub-page BARs or BARs containing MSI-X
> >table on PPC64 platform. This is because PAGE_SIZE is 64KB by default
> >on PPC64 platform and the big page may easily hit the sub-page MMIO
> >BARs' unmmapping and cause the unmmaping of the mmio page which
> >MSI-X table locate in, which lead to mmio emulation in host.
> >
> >For sub-page MMIO BARs' unmmapping, this patchset modifies
> >resource_alignment kernel parameter to enforce the alignment of all
> >MMIO BARs to be at least PAGE_SZIE so that sub-page BAR's mmio page
> >will not be shared with other BARs. Then we can mmap sub-page MMIO BARs
> >in vfio-pci driver with the modified resource_alignment.
> >
> >For MSI-X table's unmmapping, we think MSI-X table is safe to access
> >directly from userspace if PCI host bridge support filtering of MSIs
> >which can ensure that a given pci device can only shoot the MSIs
> >assigned for it. So we allow to mmap MSI-X table if
> >IOMMU_CAP_INTR_REMAP was set. And we add IOMMU_CAP_INTR_REMAP for
> >IODA host bridge on PPC64 platform.
> >
> >With this patchset applied, we can get almost 100% improvement on
> >performance for mmio accesses when we passthrough sub-page BARs to guest
> >in our test.
> >
> >The two vfio related patches(patch 5 and patch 6) are based on the
> >proposed patchset[1].
> >
> >Changelog v4:
> >- Rebase on v4.5-rc6 with patchset[1] applied.
> >- Remove resource_page_aligned kernel parameter
> >- Fix some problems with resource_alignment kernel parameter
> >- Modify resource_alignment kernel parameter to support multiple
> >   devices.
> >- Remove host bridge attribute: msi_filtered
> >- Use IOMMU_CAP_INTR_REMAP to check if MSI-X table can be mmapped
> >- Add IOMMU_CAP_INTR_REMAP for IODA host bridge on PPC64 platform
> >
> >Changelog v3:
> >- Rebase on new linux kernel mainline with the patchset[1] applied.
> >- Add a function to check whether PCI BARs'mmio page is shared with
> >   other BARs.
> >- Add a host bridge attribute to indicate PCI host bridge support
> >   filtering of MSIs.
> >- Use the new host bridge attribute to check if MSI-X table can
> >   be mmapped instead of CONFIG_EEH.
> >- Remove Kconfig option VFIO_PCI_MMAP_MSIX
> >
> >Changelog v2:
> >- Rebase on v4.4-rc6 with the patchset[1] applied.
> >- Use kernel parameter to enforce all MMIO BARs to be page aligned
> >   on PCI core code instead of doing it on PPC64 arch code.
> >- Remove flags: VFIO_DEVICE_FLAGS_PCI_PAGE_ALIGNED
> >		VFIO_DEVICE_FLAGS_PCI_MSIX_MMAP
> >- Add a Kconfig option to support for mmapping MSI-X table.
> >
> >[1] http://www.spinics.net/lists/kvm/msg127812.html
> >
> >Yongji Xie (7):
> >   PCI: Add a new option for resource_alignment to reassign alignment
> >   PCI: Use IORESOURCE_WINDOW to identify bridge resources
> >   PCI: Ignore resource_alignment if PCI_PROBE_ONLY was set
> >   PCI: Modify resource_alignment to support multiple devices
> >   vfio-pci: Allow to mmap sub-page MMIO BARs if the mmio page is exclusive
> >   vfio-pci: Allow to mmap MSI-X table if IOMMU_CAP_INTR_REMAP was set
> >   powerpc/powernv/pci-ioda: Add IOMMU_CAP_INTR_REMAP for IODA host bridge
> >
> >  Documentation/kernel-parameters.txt       |    9 ++-
> >  arch/powerpc/platforms/powernv/pci-ioda.c |   17 ++++
> >  drivers/pci/pci.c                         |  126 ++++++++++++++++++++++++-----
> >  drivers/pci/probe.c                       |    3 +-
> >  drivers/pci/setup-bus.c                   |   21 ++---
> >  drivers/vfio/pci/vfio_pci.c               |   15 +++-
> >  drivers/vfio/pci/vfio_pci_rdwr.c          |    4 +-
> >  include/linux/pci.h                       |    4 +
> >  8 files changed, 162 insertions(+), 37 deletions(-)
> >
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v4 4/7] PCI: Modify resource_alignment to support multiple devices
  2016-03-07  7:48 ` [RFC PATCH v4 4/7] PCI: Modify resource_alignment to support multiple devices Yongji Xie
@ 2016-03-16 16:30   ` Alex Williamson
  2016-03-17 11:28     ` Yongji Xie
  0 siblings, 1 reply; 28+ messages in thread
From: Alex Williamson @ 2016-03-16 16:30 UTC (permalink / raw)
  To: Yongji Xie
  Cc: kvm, linux-kernel, linux-pci, linuxppc-dev, linux-doc, bhelgaas,
	corbet, aik, benh, paulus, mpe, warrier, zhong, nikunj

On Mon,  7 Mar 2016 15:48:35 +0800
Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote:

> When vfio passthrough a PCI device of which MMIO BARs
> are smaller than PAGE_SIZE, guest will not handle the
> mmio accesses to the BARs which leads to mmio emulations
> in host.
> 
> This is because vfio will not allow to passthrough one
> BAR's mmio page which may be shared with other BARs.
> 
> To solve this performance issue, this patch modifies
> resource_alignment to support syntax where multiple
> devices get the same alignment. So we can use something
> like "pci=resource_alignment=*:*:*.*:noresize" to
> enforce the alignment of all MMIO BARs to be at least
> PAGE_SIZE so that one BAR's mmio page would not be
> shared with other BARs.
> 
> Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
> ---
>  Documentation/kernel-parameters.txt |    2 +
>  drivers/pci/pci.c                   |   90 ++++++++++++++++++++++++++++++-----
>  include/linux/pci.h                 |    4 ++
>  3 files changed, 85 insertions(+), 11 deletions(-)
> 
> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
> index 8028631..74b38ab 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -2918,6 +2918,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
>  				aligned memory resources.
>  				If <order of align> is not specified,
>  				PAGE_SIZE is used as alignment.
> +				<domain>, <bus>, <slot> and <func> can be set to
> +				"*" which means match all values.

I don't see anywhere that you're automatically enabling this for your
platform, so presumably you're expecting users to determine on their
own that they have a performance problem and hoping that they'll figure
out that they need to use this option to resolve it.  The first irate
question you'll get back is why doesn't this happen automatically?

>  				PCI-PCI bridge can be specified, if resource
>  				windows need to be expanded.
>  				noresize: Don't change the resources' sizes when
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 760cce5..44ab59f 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -102,6 +102,8 @@ unsigned int pcibios_max_latency = 255;
>  /* If set, the PCIe ARI capability will not be used. */
>  static bool pcie_ari_disabled;
>  
> +bool pci_resources_page_aligned;
> +
>  /**
>   * pci_bus_max_busnr - returns maximum PCI bus number of given bus' children
>   * @bus: pointer to PCI bus structure to search
> @@ -4604,6 +4606,7 @@ static resource_size_t pci_specified_resource_alignment(struct pci_dev *dev,
>  	int seg, bus, slot, func, align_order, count;
>  	resource_size_t align = 0;
>  	char *p;
> +	bool invalid = false;
>  
>  	spin_lock(&resource_alignment_lock);
>  	p = resource_alignment_param;
> @@ -4615,16 +4618,49 @@ static resource_size_t pci_specified_resource_alignment(struct pci_dev *dev,
>  		} else {
>  			align_order = -1;
>  		}
> -		if (sscanf(p, "%x:%x:%x.%x%n",
> -			&seg, &bus, &slot, &func, &count) != 4) {
> +		if (p[0] == '*' && p[1] == ':') {
> +			seg = -1;
> +			count = 1;
> +		} else if (sscanf(p, "%x%n", &seg, &count) != 1 ||
> +				p[count] != ':') {
> +			invalid = true;
> +			break;
> +		}
> +		p += count + 1;
> +		if (*p == '*') {
> +			bus = -1;
> +			count = 1;
> +		} else if (sscanf(p, "%x%n", &bus, &count) != 1) {
> +			invalid = true;
> +			break;
> +		}
> +		p += count;
> +		if (*p == '.') {
> +			slot = bus;
> +			bus = seg;
>  			seg = 0;
> -			if (sscanf(p, "%x:%x.%x%n",
> -					&bus, &slot, &func, &count) != 3) {
> -				/* Invalid format */
> -				printk(KERN_ERR "PCI: Can't parse resource_alignment parameter: %s\n",
> -					p);
> +			p++;
> +		} else if (*p == ':') {
> +			p++;
> +			if (p[0] == '*' && p[1] == '.') {
> +				slot = -1;
> +				count = 1;
> +			} else if (sscanf(p, "%x%n", &slot, &count) != 1 ||
> +					p[count] != '.') {
> +				invalid = true;
>  				break;
>  			}
> +			p += count + 1;
> +		} else {
> +			invalid = true;
> +			break;
> +		}
> +		if (*p == '*') {
> +			func = -1;
> +			count = 1;
> +		} else if (sscanf(p, "%x%n", &func, &count) != 1) {
> +			invalid = true;
> +			break;
>  		}
>  		p += count;
>  		if (!strncmp(p, ":noresize", 9)) {
> @@ -4632,23 +4668,34 @@ static resource_size_t pci_specified_resource_alignment(struct pci_dev *dev,
>  			p += 9;
>  		} else
>  			*resize = true;
> -		if (seg == pci_domain_nr(dev->bus) &&
> -			bus == dev->bus->number &&
> -			slot == PCI_SLOT(dev->devfn) &&
> -			func == PCI_FUNC(dev->devfn)) {
> +		if ((seg == pci_domain_nr(dev->bus) || seg == -1) &&
> +			(bus == dev->bus->number || bus == -1) &&
> +			(slot == PCI_SLOT(dev->devfn) || slot == -1) &&
> +			(func == PCI_FUNC(dev->devfn) || func == -1)) {
>  			if (align_order == -1)
>  				align = PAGE_SIZE;
>  			else
>  				align = 1 << align_order;
> +			if (!pci_resources_page_aligned &&
> +				(align >= PAGE_SIZE &&
> +				seg == -1 && bus == -1 &&
> +				slot == -1 && func == -1))
> +				pci_resources_page_aligned = true;
>  			/* Found */
>  			break;
>  		}
>  		if (*p != ';' && *p != ',') {
>  			/* End of param or invalid format */
> +			invalid = true;
>  			break;
>  		}
>  		p++;
>  	}
> +	if (invalid == true) {
> +		/* Invalid format */
> +		printk(KERN_ERR "PCI: Can't parse resource_alignment parameter:%s\n",
> +				p);
> +	}
>  	spin_unlock(&resource_alignment_lock);
>  	return align;
>  }
> @@ -4769,6 +4816,27 @@ static int __init pci_resource_alignment_sysfs_init(void)
>  }
>  late_initcall(pci_resource_alignment_sysfs_init);
>  
> +/*
> + * This function checks whether PCI BARs' mmio page will be shared
> + * with other BARs.
> + */
> +bool pci_resources_share_page(struct pci_dev *dev, int resno)
> +{
> +	struct resource *res = dev->resource + resno;
> +
> +	if (resource_size(res) >= PAGE_SIZE)
> +		return false;
> +	if (pci_resources_page_aligned && !(res->start & ~PAGE_MASK) &&
> +		res->flags & IORESOURCE_MEM) {
> +		if (res->sibling)
> +			return (res->sibling->start & ~PAGE_MASK);
> +		else
> +			return false;
> +	}
> +	return true;
> +}
> +EXPORT_SYMBOL_GPL(pci_resources_share_page);

This should be a separate patch, there's no mention of this part of the
change at all in the commitlog.  Also, pci_resource_page_aligned is
only set if we use the magic wildcards to set alignment for all
devices, it's not set if we use a specific seg:bus:dev.fn.  That's
not consistent.  Can't we make use of the STARTALIGN flag for this or
did that get removed when resources were assigned?

The test itself is not entirely reassuring, I'd like some positive
indication that the device has been aligned, not simply that it should
have been and the start alignment appears that it might have happened.
Apparently you don't trust pci_resources_page_aligned either since you
still test the start alignment.

> +
>  static void pci_no_domains(void)
>  {
>  #ifdef CONFIG_PCI_DOMAINS
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 2771625..064a1b6 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -1511,6 +1511,10 @@ static inline int pci_get_new_domain_nr(void) { return -ENOSYS; }
>  	 (pci_resource_end((dev), (bar)) -		\
>  	  pci_resource_start((dev), (bar)) + 1))
>  
> +extern bool pci_resources_page_aligned;

I'm not seeing why this shouldn't have been static since you're
providing a function that tests it, there shouldn't really be any
external consumers.

> +
> +bool pci_resources_share_page(struct pci_dev *dev, int resno);
> +
>  /* Similar to the helpers above, these manipulate per-pci_dev
>   * driver-specific data.  They are really just a wrapper around
>   * the generic device structure functions of these calls.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v4 5/7] vfio-pci: Allow to mmap sub-page MMIO BARs if the mmio page is exclusive
  2016-03-07  7:48 ` [RFC PATCH v4 5/7] vfio-pci: Allow to mmap sub-page MMIO BARs if the mmio page is exclusive Yongji Xie
@ 2016-03-16 16:30   ` Alex Williamson
  2016-03-17 11:29     ` Yongji Xie
  0 siblings, 1 reply; 28+ messages in thread
From: Alex Williamson @ 2016-03-16 16:30 UTC (permalink / raw)
  To: Yongji Xie
  Cc: kvm, linux-kernel, linux-pci, linuxppc-dev, linux-doc, bhelgaas,
	corbet, aik, benh, paulus, mpe, warrier, zhong, nikunj

On Mon,  7 Mar 2016 15:48:36 +0800
Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote:

> Current vfio-pci implementation disallows to mmap
> sub-page(size < PAGE_SIZE) MMIO BARs because these BARs' mmio
> page may be shared with other BARs.
> 
> But we should allow to mmap these sub-page MMIO BARs if PCI
> resource allocator can make sure these BARs' mmio page will
> not be shared with other BARs.
> 
> Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
> ---
>  drivers/vfio/pci/vfio_pci.c |    7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 1ce1d36..49d7a69 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -589,7 +589,8 @@ static long vfio_pci_ioctl(void *device_data,
>  				     VFIO_REGION_INFO_FLAG_WRITE;
>  			if (IS_ENABLED(CONFIG_VFIO_PCI_MMAP) &&
>  			    pci_resource_flags(pdev, info.index) &
> -			    IORESOURCE_MEM && info.size >= PAGE_SIZE) {
> +			    IORESOURCE_MEM && !pci_resources_share_page(pdev,
> +			    info.index)) {

this would be a preferable line wrap:

                            IORESOURCE_MEM &&
                            !pci_resources_share_page(pdev, info.index)) {

>  				info.flags |= VFIO_REGION_INFO_FLAG_MMAP;
>  				if (info.index == vdev->msix_bar) {
>  					ret = msix_sparse_mmap_cap(vdev, &caps);
> @@ -1016,6 +1017,10 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
>  		return -EINVAL;
>  
>  	phys_len = pci_resource_len(pdev, index);
> +
> +	if (!pci_resources_share_page(pdev, index))
> +		phys_len = PAGE_ALIGN(phys_len);
> +
>  	req_len = vma->vm_end - vma->vm_start;
>  	pgoff = vma->vm_pgoff &
>  		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v4 6/7] vfio-pci: Allow to mmap MSI-X table if IOMMU_CAP_INTR_REMAP was set
  2016-03-07  7:48 ` [RFC PATCH v4 6/7] vfio-pci: Allow to mmap MSI-X table if IOMMU_CAP_INTR_REMAP was set Yongji Xie
@ 2016-03-16 16:31   ` Alex Williamson
  2016-03-17 11:32     ` Yongji Xie
  0 siblings, 1 reply; 28+ messages in thread
From: Alex Williamson @ 2016-03-16 16:31 UTC (permalink / raw)
  To: Yongji Xie
  Cc: kvm, linux-kernel, linux-pci, linuxppc-dev, linux-doc, bhelgaas,
	corbet, aik, benh, paulus, mpe, warrier, zhong, nikunj,
	Eric Auger, Will Deacon

[cc+ Eric, Will]

On Mon,  7 Mar 2016 15:48:37 +0800
Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote:

> Current vfio-pci implementation disallows to mmap MSI-X
> table in case that user get to touch this directly.
> 
> But we should allow to mmap these MSI-X tables if IOMMU
> supports interrupt remapping which can ensure that a
> given pci device can only shoot the MSIs assigned for it.
> 
> Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
> ---
>  drivers/vfio/pci/vfio_pci.c      |    8 +++++---
>  drivers/vfio/pci/vfio_pci_rdwr.c |    4 +++-
>  2 files changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 49d7a69..d6f4788 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -592,13 +592,14 @@ static long vfio_pci_ioctl(void *device_data,
>  			    IORESOURCE_MEM && !pci_resources_share_page(pdev,
>  			    info.index)) {
>  				info.flags |= VFIO_REGION_INFO_FLAG_MMAP;
> -				if (info.index == vdev->msix_bar) {
> +				if (!iommu_capable(pdev->dev.bus,
> +					IOMMU_CAP_INTR_REMAP) &&
> +					info.index == vdev->msix_bar) {

We only need to test the IOMMU capability if it's the msix BAR, so why
test these in the reverse order?  It should be:

                                    info.index == vdev->msix_bar &&
                                    !iommu_capable(pdev->dev.bus,
                                                   IOMMU_CAP_INTR_REMAP)

Same below.

I think we also have the problem that ARM SMMU is setting this
capability when it's really not doing anything at all to provide
interrupt isolation.  Adding Eric and Will to the Cc for comment.

I slightly dislike using an IOMMU API interface here to determine if
it's safe to allow user access to the MSIx vector table, but it seems
like the best option we have at this point, if it's actually true for
all the IOMMU drivers participating in the IOMMU API.

>  					ret = msix_sparse_mmap_cap(vdev, &caps);
>  					if (ret)
>  						return ret;
>  				}
>  			}
> -
>  			break;
>  		case VFIO_PCI_ROM_REGION_INDEX:
>  		{
> @@ -1029,7 +1030,8 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
>  	if (phys_len < PAGE_SIZE || req_start + req_len > phys_len)
>  		return -EINVAL;
>  
> -	if (index == vdev->msix_bar) {
> +	if (!iommu_capable(pdev->dev.bus, IOMMU_CAP_INTR_REMAP) &&
> +		index == vdev->msix_bar) {
>  		/*
>  		 * Disallow mmaps overlapping the MSI-X table; users don't
>  		 * get to touch this directly.  We could find somewhere
> diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
> index 5ffd1d9..1c46c29 100644
> --- a/drivers/vfio/pci/vfio_pci_rdwr.c
> +++ b/drivers/vfio/pci/vfio_pci_rdwr.c
> @@ -18,6 +18,7 @@
>  #include <linux/uaccess.h>
>  #include <linux/io.h>
>  #include <linux/vgaarb.h>
> +#include <linux/iommu.h>
>  
>  #include "vfio_pci_private.h"
>  
> @@ -164,7 +165,8 @@ ssize_t vfio_pci_bar_rw(struct vfio_pci_device *vdev, char __user *buf,
>  	} else
>  		io = vdev->barmap[bar];
>  
> -	if (bar == vdev->msix_bar) {
> +	if (!iommu_capable(pdev->dev.bus, IOMMU_CAP_INTR_REMAP) &&
> +		bar == vdev->msix_bar) {

Do we really want to test this on *every* read/write to any BAR (order
of tests matter)?  Even in the case of the MSIx BAR, should we cache
this when the device is first opened?

>  		x_start = vdev->msix_offset;
>  		x_end = vdev->msix_offset + vdev->msix_size;
>  	}

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v4 3/7] PCI: Ignore resource_alignment if PCI_PROBE_ONLY was set
  2016-03-07  7:48 ` [RFC PATCH v4 3/7] PCI: Ignore resource_alignment if PCI_PROBE_ONLY was set Yongji Xie
@ 2016-03-16 16:31   ` Alex Williamson
  2016-03-17 11:35     ` Yongji Xie
  0 siblings, 1 reply; 28+ messages in thread
From: Alex Williamson @ 2016-03-16 16:31 UTC (permalink / raw)
  To: Yongji Xie
  Cc: kvm, linux-kernel, linux-pci, linuxppc-dev, linux-doc, bhelgaas,
	corbet, aik, benh, paulus, mpe, warrier, zhong, nikunj

On Mon,  7 Mar 2016 15:48:34 +0800
Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote:

> The resource_alignment will releases memory resources
> allocated by firmware so that kernel can reassign new
> resources later on. But this will cause the problem
> that no resources can be allocated by kernel if
> PCI_PROBE_ONLY was set, e.g. on pSeries platform
> because PCI_PROBE_ONLY force kernel to use firmware
> setup and not to reassign any resources.
> 
> To solve this problem, this patch ignores
> resource_alignment if PCI_PROBE_ONLY was set.
> 
> Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
> ---
>  Documentation/kernel-parameters.txt |    2 ++
>  drivers/pci/probe.c                 |    3 ++-
>  2 files changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
> index d8b29ab..8028631 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -2922,6 +2922,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
>  				windows need to be expanded.
>  				noresize: Don't change the resources' sizes when
>  				reassigning alignment.
> +				Note that this option will not work if
> +				PCI_PROBE_ONLY is set.

How would a user have any idea if this is set?

>  		ecrc=		Enable/disable PCIe ECRC (transaction layer
>  				end-to-end CRC checking).
>  				bios: Use BIOS/firmware settings. This is the
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index 6d7ab9b..bc31cad 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -1719,7 +1719,8 @@ void pci_device_add(struct pci_dev *dev, struct pci_bus *bus)
>  	pci_fixup_device(pci_fixup_header, dev);
>  
>  	/* moved out from quirk header fixup code */
> -	pci_reassigndev_resource_alignment(dev);
> +	if (!pci_has_flag(PCI_PROBE_ONLY))
> +		pci_reassigndev_resource_alignment(dev);
>  
>  	/* Clear the state_saved flag. */
>  	dev->state_saved = false;

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v4 7/7] powerpc/powernv/pci-ioda: Add IOMMU_CAP_INTR_REMAP for IODA host bridge
  2016-03-07  7:48 ` [RFC PATCH v4 7/7] powerpc/powernv/pci-ioda: Add IOMMU_CAP_INTR_REMAP for IODA host bridge Yongji Xie
@ 2016-03-16 16:32   ` Alex Williamson
  2016-03-17 11:38     ` Yongji Xie
  0 siblings, 1 reply; 28+ messages in thread
From: Alex Williamson @ 2016-03-16 16:32 UTC (permalink / raw)
  To: Yongji Xie
  Cc: kvm, linux-kernel, linux-pci, linuxppc-dev, linux-doc, bhelgaas,
	corbet, aik, benh, paulus, mpe, warrier, zhong, nikunj

On Mon,  7 Mar 2016 15:48:38 +0800
Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote:

> This patch adds IOMMU_CAP_INTR_REMAP for IODA host bridge so that
> we can mmap MSI-X table in vfio driver.
> 
> Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
> ---
>  arch/powerpc/platforms/powernv/pci-ioda.c |   17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
> 
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index f90dc04..f01b9ab 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1955,6 +1955,20 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = {
>  	.free = pnv_ioda2_table_free,
>  };
>  
> +static bool pnv_ioda_iommu_capable(enum iommu_cap cap)
> +{
> +	switch (cap) {
> +	case IOMMU_CAP_INTR_REMAP:
> +		return true;
> +	default:
> +		return false;
> +	}
> +}
> +
> +static struct iommu_ops pnv_ioda_iommu_ops = {
> +	.capable = pnv_ioda_iommu_capable,
> +};
> +
>  static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
>  				      struct pnv_ioda_pe *pe, unsigned int base,
>  				      unsigned int segs)
> @@ -3078,6 +3092,9 @@ static void pnv_pci_ioda_fixup(void)
>  
>  	/* Link NPU IODA tables to their PCI devices. */
>  	pnv_npu_ioda_fixup();
> +
> +	/* Add IOMMU_CAP_INTR_REMAP */
> +	bus_set_iommu(&pci_bus_type, &pnv_ioda_iommu_ops);
>  }
>  
>  /*


Doesn't this set you up for a world of hurt?  bus_set_iommu() calls
iommu_bus_init() which sets up notifiers, which maybe you don't care
about, but it also means that iommu_domain_alloc(&pci_bus_type) will
segfault because you're not providing a domain_alloc callback here.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v4 0/7] vfio-pci: Allow to mmap sub-page MMIO BARs and MSI-X table
  2016-03-16 14:10   ` Bjorn Helgaas
@ 2016-03-17 10:46     ` Yongji Xie
  0 siblings, 0 replies; 28+ messages in thread
From: Yongji Xie @ 2016-03-17 10:46 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: kvm, linux-kernel, linux-pci, linuxppc-dev, linux-doc, bhelgaas,
	corbet, aik, alex.williamson, benh, paulus, mpe, warrier, zhong,
	nikunj

On 2016/3/16 22:10, Bjorn Helgaas wrote:
> On Wed, Mar 16, 2016 at 06:51:56PM +0800, Yongji Xie wrote:
>> Ping.
> This is mainly VFIO stuff, and Alex had some security concerns, so I'm
> not going to spend much time looking at this until he's satisfied.
>
> When I do, I'll be looking hard at the resource_alignment kernel
> parameter.  I'm opposed to kernel parameters in general because
> they're very difficult for users to use correctly, and they lead to
> kernel code paths that are rarely tested and hard to maintain.  So
> I'll be looking for an excuse to reject changes in that area.
>
> The changelog for 2/7 says it "replaces IORESOURCE_STARTALIGN with
> IORESOURCE_WINDOW."  But even a glance at the patch itself shows that
> IORESOURCE_WINDOW is *added* to some places, and it doesn't *replace*
> IORESOURCE_STARTALIGN.

There is a problem with my statement. I mean we can use
IORESOURCE_WINDOW to identify bridge resources instead of
IORESOURCE_STARTALIGN here.

> The changelog for 4/7 says:
>
>    This is because vfio will not allow to passthrough one BAR's mmio
>    page which may be shared with other BARs.  To solve this performance
>    issue ...
>
> with no mention at all of the actual *reason* vfio doesn't allow that
> passthrough.  If I understand correctly, that reason has to do with
> security, so your justification must be much stronger than "solving a
> performance issue."

OK. I will try to make my justification become stronger.

Thanks,
Yongji Xie

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v4 4/7] PCI: Modify resource_alignment to support multiple devices
  2016-03-16 16:30   ` Alex Williamson
@ 2016-03-17 11:28     ` Yongji Xie
  2016-03-17 12:40       ` Alex Williamson
  0 siblings, 1 reply; 28+ messages in thread
From: Yongji Xie @ 2016-03-17 11:28 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, linux-kernel, linux-pci, linuxppc-dev, linux-doc, bhelgaas,
	corbet, aik, benh, paulus, mpe, warrier, zhong, nikunj

On 2016/3/17 0:30, Alex Williamson wrote:
> On Mon,  7 Mar 2016 15:48:35 +0800
> Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote:
>
>> When vfio passthrough a PCI device of which MMIO BARs
>> are smaller than PAGE_SIZE, guest will not handle the
>> mmio accesses to the BARs which leads to mmio emulations
>> in host.
>>
>> This is because vfio will not allow to passthrough one
>> BAR's mmio page which may be shared with other BARs.
>>
>> To solve this performance issue, this patch modifies
>> resource_alignment to support syntax where multiple
>> devices get the same alignment. So we can use something
>> like "pci=resource_alignment=*:*:*.*:noresize" to
>> enforce the alignment of all MMIO BARs to be at least
>> PAGE_SIZE so that one BAR's mmio page would not be
>> shared with other BARs.
>>
>> Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
>> ---
>>   Documentation/kernel-parameters.txt |    2 +
>>   drivers/pci/pci.c                   |   90 ++++++++++++++++++++++++++++++-----
>>   include/linux/pci.h                 |    4 ++
>>   3 files changed, 85 insertions(+), 11 deletions(-)
>>
>> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
>> index 8028631..74b38ab 100644
>> --- a/Documentation/kernel-parameters.txt
>> +++ b/Documentation/kernel-parameters.txt
>> @@ -2918,6 +2918,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
>>   				aligned memory resources.
>>   				If <order of align> is not specified,
>>   				PAGE_SIZE is used as alignment.
>> +				<domain>, <bus>, <slot> and <func> can be set to
>> +				"*" which means match all values.
> I don't see anywhere that you're automatically enabling this for your
> platform, so presumably you're expecting users to determine on their
> own that they have a performance problem and hoping that they'll figure
> out that they need to use this option to resolve it.  The first irate
> question you'll get back is why doesn't this happen automatically?

Actually, I just want to make the code simple. Maybe it is
not a good idea to let users enable this option manually.
I will try to fix it like that:

diff --git a/arch/powerpc/include/asm/pci.h b/arch/powerpc/include/asm/pci.h
index 6f8065a..6659752 100644
--- a/arch/powerpc/include/asm/pci.h
+++ b/arch/powerpc/include/asm/pci.h
@@ -30,6 +30,8 @@
  #define PCIBIOS_MIN_IO         0x1000
  #define PCIBIOS_MIN_MEM                0x10000000

+#define PCIBIOS_MIN_ALIGNMENT  PAGE_SIZE
+
  struct pci_dev;

  /* Values for the `which' argument to sys_pciconfig_iobase syscall.  */
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index dadd28a..9f644e4 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4593,6 +4593,8 @@ EXPORT_SYMBOL_GPL(pci_ignore_hotplug);
  static char resource_alignment_param[RESOURCE_ALIGNMENT_PARAM_SIZE] = {0};
  static DEFINE_SPINLOCK(resource_alignment_lock);

+#define DISABLE_ARCH_ALIGNMENT -1
+#define DEFAULT_ALIGNMENT      -2
  /**
   * pci_specified_resource_alignment - get resource alignment specified 
by user.
   * @dev: the PCI device to get
@@ -4609,6 +4611,9 @@ static resource_size_t 
pci_specified_resource_alignment(struct pci_dev *dev,
         char *p;
         bool invalid = false;

+#ifdef PCIBIOS_MIN_ALIGNMENT
+       align = PCIBIOS_MIN_ALIGNMENT;
+#endif
         spin_lock(&resource_alignment_lock);
         p = resource_alignment_param;
         while (*p) {
@@ -4617,7 +4622,7 @@ static resource_size_t 
pci_specified_resource_alignment(struct pci_dev *dev,
                                                         p[count] == '@') {
                         p += count + 1;
                 } else {
-                       align_order = -1;
+                       align_order = DEFAULT_ALIGNMENT;
                 }
                 if (p[0] == '*' && p[1] == ':') {
                         seg = -1;
@@ -4673,8 +4678,10 @@ static resource_size_t 
pci_specified_resource_alignment(struct pci_dev *dev,
                         (bus == dev->bus->number || bus == -1) &&
                         (slot == PCI_SLOT(dev->devfn) || slot == -1) &&
                         (func == PCI_FUNC(dev->devfn) || func == -1)) {
-                       if (align_order == -1)
+                       if (align_order == DEFAULT_ALIGNMENT)
                                 align = PAGE_SIZE;
+                       else if (align_order == DISABLE_ARCH_ALIGNMENT)
+                               align = 0;
                         else
                                 align = 1 << align_order;
                         if (!pci_resources_page_aligned &&
>>   				PCI-PCI bridge can be specified, if resource
>>   				windows need to be expanded.
>>   				noresize: Don't change the resources' sizes when
>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>> index 760cce5..44ab59f 100644
>> --- a/drivers/pci/pci.c
>> +++ b/drivers/pci/pci.c
>> @@ -102,6 +102,8 @@ unsigned int pcibios_max_latency = 255;
>>   /* If set, the PCIe ARI capability will not be used. */
>>   static bool pcie_ari_disabled;
>>   
>> +bool pci_resources_page_aligned;
>> +
>>   /**
>>    * pci_bus_max_busnr - returns maximum PCI bus number of given bus' children
>>    * @bus: pointer to PCI bus structure to search
>> @@ -4604,6 +4606,7 @@ static resource_size_t pci_specified_resource_alignment(struct pci_dev *dev,
>>   	int seg, bus, slot, func, align_order, count;
>>   	resource_size_t align = 0;
>>   	char *p;
>> +	bool invalid = false;
>>   
>>   	spin_lock(&resource_alignment_lock);
>>   	p = resource_alignment_param;
>> @@ -4615,16 +4618,49 @@ static resource_size_t pci_specified_resource_alignment(struct pci_dev *dev,
>>   		} else {
>>   			align_order = -1;
>>   		}
>> -		if (sscanf(p, "%x:%x:%x.%x%n",
>> -			&seg, &bus, &slot, &func, &count) != 4) {
>> +		if (p[0] == '*' && p[1] == ':') {
>> +			seg = -1;
>> +			count = 1;
>> +		} else if (sscanf(p, "%x%n", &seg, &count) != 1 ||
>> +				p[count] != ':') {
>> +			invalid = true;
>> +			break;
>> +		}
>> +		p += count + 1;
>> +		if (*p == '*') {
>> +			bus = -1;
>> +			count = 1;
>> +		} else if (sscanf(p, "%x%n", &bus, &count) != 1) {
>> +			invalid = true;
>> +			break;
>> +		}
>> +		p += count;
>> +		if (*p == '.') {
>> +			slot = bus;
>> +			bus = seg;
>>   			seg = 0;
>> -			if (sscanf(p, "%x:%x.%x%n",
>> -					&bus, &slot, &func, &count) != 3) {
>> -				/* Invalid format */
>> -				printk(KERN_ERR "PCI: Can't parse resource_alignment parameter: %s\n",
>> -					p);
>> +			p++;
>> +		} else if (*p == ':') {
>> +			p++;
>> +			if (p[0] == '*' && p[1] == '.') {
>> +				slot = -1;
>> +				count = 1;
>> +			} else if (sscanf(p, "%x%n", &slot, &count) != 1 ||
>> +					p[count] != '.') {
>> +				invalid = true;
>>   				break;
>>   			}
>> +			p += count + 1;
>> +		} else {
>> +			invalid = true;
>> +			break;
>> +		}
>> +		if (*p == '*') {
>> +			func = -1;
>> +			count = 1;
>> +		} else if (sscanf(p, "%x%n", &func, &count) != 1) {
>> +			invalid = true;
>> +			break;
>>   		}
>>   		p += count;
>>   		if (!strncmp(p, ":noresize", 9)) {
>> @@ -4632,23 +4668,34 @@ static resource_size_t pci_specified_resource_alignment(struct pci_dev *dev,
>>   			p += 9;
>>   		} else
>>   			*resize = true;
>> -		if (seg == pci_domain_nr(dev->bus) &&
>> -			bus == dev->bus->number &&
>> -			slot == PCI_SLOT(dev->devfn) &&
>> -			func == PCI_FUNC(dev->devfn)) {
>> +		if ((seg == pci_domain_nr(dev->bus) || seg == -1) &&
>> +			(bus == dev->bus->number || bus == -1) &&
>> +			(slot == PCI_SLOT(dev->devfn) || slot == -1) &&
>> +			(func == PCI_FUNC(dev->devfn) || func == -1)) {
>>   			if (align_order == -1)
>>   				align = PAGE_SIZE;
>>   			else
>>   				align = 1 << align_order;
>> +			if (!pci_resources_page_aligned &&
>> +				(align >= PAGE_SIZE &&
>> +				seg == -1 && bus == -1 &&
>> +				slot == -1 && func == -1))
>> +				pci_resources_page_aligned = true;
>>   			/* Found */
>>   			break;
>>   		}
>>   		if (*p != ';' && *p != ',') {
>>   			/* End of param or invalid format */
>> +			invalid = true;
>>   			break;
>>   		}
>>   		p++;
>>   	}
>> +	if (invalid == true) {
>> +		/* Invalid format */
>> +		printk(KERN_ERR "PCI: Can't parse resource_alignment parameter:%s\n",
>> +				p);
>> +	}
>>   	spin_unlock(&resource_alignment_lock);
>>   	return align;
>>   }
>> @@ -4769,6 +4816,27 @@ static int __init pci_resource_alignment_sysfs_init(void)
>>   }
>>   late_initcall(pci_resource_alignment_sysfs_init);
>>   
>> +/*
>> + * This function checks whether PCI BARs' mmio page will be shared
>> + * with other BARs.
>> + */
>> +bool pci_resources_share_page(struct pci_dev *dev, int resno)
>> +{
>> +	struct resource *res = dev->resource + resno;
>> +
>> +	if (resource_size(res) >= PAGE_SIZE)
>> +		return false;
>> +	if (pci_resources_page_aligned && !(res->start & ~PAGE_MASK) &&
>> +		res->flags & IORESOURCE_MEM) {
>> +		if (res->sibling)
>> +			return (res->sibling->start & ~PAGE_MASK);
>> +		else
>> +			return false;
>> +	}
>> +	return true;
>> +}
>> +EXPORT_SYMBOL_GPL(pci_resources_share_page);
> This should be a separate patch, there's no mention of this part of the
> change at all in the commitlog.  Also, pci_resource_page_aligned is
> only set if we use the magic wildcards to set alignment for all
> devices, it's not set if we use a specific seg:bus:dev.fn.  That's
> not consistent.  Can't we make use of the STARTALIGN flag for this or
> did that get removed when resources were assigned?

I only set pci_resource_page_aligned when using magic wildcards
because I must make sure pci_resources_share_page() can return
correctly when a hotplug event happens.

For example, there is only one half-page resource in the system.
And we use a specific seg:bus:dev.fn to force it to be page aligned.
So we set pci_resource_page_aligned because all devices are page
aligned now. It will return false when we call
pci_resources_share_page() for this resource. But this return value
will be not correct if we hot add a new device which also have
a half-page resource which will share one page with the previous
resource.

> The test itself is not entirely reassuring, I'd like some positive
> indication that the device has been aligned, not simply that it should
> have been and the start alignment appears that it might have happened.
> Apparently you don't trust pci_resources_page_aligned either since you
> still test the start alignment.

Yes, I don't trust pci_resources_page_aligned. Because
resource_alignment will not work in some case, e.g.
PCI_PROBE_ONLY or IORESOURCE_PCI_FIXED is set.

But I think *start* of the resource can give the positive
indication. The resource should not share page with
others if the start of the resource and its sibling are both
page aligned.

>> +
>>   static void pci_no_domains(void)
>>   {
>>   #ifdef CONFIG_PCI_DOMAINS
>> diff --git a/include/linux/pci.h b/include/linux/pci.h
>> index 2771625..064a1b6 100644
>> --- a/include/linux/pci.h
>> +++ b/include/linux/pci.h
>> @@ -1511,6 +1511,10 @@ static inline int pci_get_new_domain_nr(void) { return -ENOSYS; }
>>   	 (pci_resource_end((dev), (bar)) -		\
>>   	  pci_resource_start((dev), (bar)) + 1))
>>   
>> +extern bool pci_resources_page_aligned;
> I'm not seeing why this shouldn't have been static since you're
> providing a function that tests it, there shouldn't really be any
> external consumers.

I made a mistake here. I will fix it in next version.

Thanks,
Yongji Xie.

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v4 5/7] vfio-pci: Allow to mmap sub-page MMIO BARs if the mmio page is exclusive
  2016-03-16 16:30   ` Alex Williamson
@ 2016-03-17 11:29     ` Yongji Xie
  0 siblings, 0 replies; 28+ messages in thread
From: Yongji Xie @ 2016-03-17 11:29 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, linux-kernel, linux-pci, linuxppc-dev, linux-doc, bhelgaas,
	corbet, aik, benh, paulus, mpe, warrier, zhong, nikunj

On 2016/3/17 0:30, Alex Williamson wrote:
> On Mon,  7 Mar 2016 15:48:36 +0800
> Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote:
>
>> Current vfio-pci implementation disallows to mmap
>> sub-page(size < PAGE_SIZE) MMIO BARs because these BARs' mmio
>> page may be shared with other BARs.
>>
>> But we should allow to mmap these sub-page MMIO BARs if PCI
>> resource allocator can make sure these BARs' mmio page will
>> not be shared with other BARs.
>>
>> Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
>> ---
>>   drivers/vfio/pci/vfio_pci.c |    7 ++++++-
>>   1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
>> index 1ce1d36..49d7a69 100644
>> --- a/drivers/vfio/pci/vfio_pci.c
>> +++ b/drivers/vfio/pci/vfio_pci.c
>> @@ -589,7 +589,8 @@ static long vfio_pci_ioctl(void *device_data,
>>   				     VFIO_REGION_INFO_FLAG_WRITE;
>>   			if (IS_ENABLED(CONFIG_VFIO_PCI_MMAP) &&
>>   			    pci_resource_flags(pdev, info.index) &
>> -			    IORESOURCE_MEM && info.size >= PAGE_SIZE) {
>> +			    IORESOURCE_MEM && !pci_resources_share_page(pdev,
>> +			    info.index)) {
> this would be a preferable line wrap:
>
>                              IORESOURCE_MEM &&
>                              !pci_resources_share_page(pdev, info.index)) {

OK. I'll fix it.

Thanks,
Yongji Xie

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v4 6/7] vfio-pci: Allow to mmap MSI-X table if IOMMU_CAP_INTR_REMAP was set
  2016-03-16 16:31   ` Alex Williamson
@ 2016-03-17 11:32     ` Yongji Xie
  0 siblings, 0 replies; 28+ messages in thread
From: Yongji Xie @ 2016-03-17 11:32 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, linux-kernel, linux-pci, linuxppc-dev, linux-doc, bhelgaas,
	corbet, aik, benh, paulus, mpe, warrier, zhong, nikunj,
	Eric Auger, Will Deacon

On 2016/3/17 0:31, Alex Williamson wrote:
> [cc+ Eric, Will]
>
> On Mon,  7 Mar 2016 15:48:37 +0800
> Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote:
>
>> Current vfio-pci implementation disallows to mmap MSI-X
>> table in case that user get to touch this directly.
>>
>> But we should allow to mmap these MSI-X tables if IOMMU
>> supports interrupt remapping which can ensure that a
>> given pci device can only shoot the MSIs assigned for it.
>>
>> Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
>> ---
>>   drivers/vfio/pci/vfio_pci.c      |    8 +++++---
>>   drivers/vfio/pci/vfio_pci_rdwr.c |    4 +++-
>>   2 files changed, 8 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
>> index 49d7a69..d6f4788 100644
>> --- a/drivers/vfio/pci/vfio_pci.c
>> +++ b/drivers/vfio/pci/vfio_pci.c
>> @@ -592,13 +592,14 @@ static long vfio_pci_ioctl(void *device_data,
>>   			    IORESOURCE_MEM && !pci_resources_share_page(pdev,
>>   			    info.index)) {
>>   				info.flags |= VFIO_REGION_INFO_FLAG_MMAP;
>> -				if (info.index == vdev->msix_bar) {
>> +				if (!iommu_capable(pdev->dev.bus,
>> +					IOMMU_CAP_INTR_REMAP) &&
>> +					info.index == vdev->msix_bar) {
> We only need to test the IOMMU capability if it's the msix BAR, so why
> test these in the reverse order?  It should be:
>
>                                      info.index == vdev->msix_bar &&
>                                      !iommu_capable(pdev->dev.bus,
>                                                     IOMMU_CAP_INTR_REMAP)
>
> Same below.

OK. I'll fix it.

> I think we also have the problem that ARM SMMU is setting this
> capability when it's really not doing anything at all to provide
> interrupt isolation.  Adding Eric and Will to the Cc for comment.
>
> I slightly dislike using an IOMMU API interface here to determine if
> it's safe to allow user access to the MSIx vector table, but it seems
> like the best option we have at this point, if it's actually true for
> all the IOMMU drivers participating in the IOMMU API.
>
>>   					ret = msix_sparse_mmap_cap(vdev, &caps);
>>   					if (ret)
>>   						return ret;
>>   				}
>>   			}
>> -
>>   			break;
>>   		case VFIO_PCI_ROM_REGION_INDEX:
>>   		{
>> @@ -1029,7 +1030,8 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
>>   	if (phys_len < PAGE_SIZE || req_start + req_len > phys_len)
>>   		return -EINVAL;
>>   
>> -	if (index == vdev->msix_bar) {
>> +	if (!iommu_capable(pdev->dev.bus, IOMMU_CAP_INTR_REMAP) &&
>> +		index == vdev->msix_bar) {
>>   		/*
>>   		 * Disallow mmaps overlapping the MSI-X table; users don't
>>   		 * get to touch this directly.  We could find somewhere
>> diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
>> index 5ffd1d9..1c46c29 100644
>> --- a/drivers/vfio/pci/vfio_pci_rdwr.c
>> +++ b/drivers/vfio/pci/vfio_pci_rdwr.c
>> @@ -18,6 +18,7 @@
>>   #include <linux/uaccess.h>
>>   #include <linux/io.h>
>>   #include <linux/vgaarb.h>
>> +#include <linux/iommu.h>
>>   
>>   #include "vfio_pci_private.h"
>>   
>> @@ -164,7 +165,8 @@ ssize_t vfio_pci_bar_rw(struct vfio_pci_device *vdev, char __user *buf,
>>   	} else
>>   		io = vdev->barmap[bar];
>>   
>> -	if (bar == vdev->msix_bar) {
>> +	if (!iommu_capable(pdev->dev.bus, IOMMU_CAP_INTR_REMAP) &&
>> +		bar == vdev->msix_bar) {
> Do we really want to test this on *every* read/write to any BAR (order
> of tests matter)?  Even in the case of the MSIx BAR, should we cache
> this when the device is first opened?

I will cache this in vfio_pci_open().

Thanks,
Yongji Xie

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v4 3/7] PCI: Ignore resource_alignment if PCI_PROBE_ONLY was set
  2016-03-16 16:31   ` Alex Williamson
@ 2016-03-17 11:35     ` Yongji Xie
  0 siblings, 0 replies; 28+ messages in thread
From: Yongji Xie @ 2016-03-17 11:35 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, linux-kernel, linux-pci, linuxppc-dev, linux-doc, bhelgaas,
	corbet, aik, benh, paulus, mpe, warrier, zhong, nikunj

On 2016/3/17 0:31, Alex Williamson wrote:
> On Mon,  7 Mar 2016 15:48:34 +0800
> Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote:
>
>> The resource_alignment will releases memory resources
>> allocated by firmware so that kernel can reassign new
>> resources later on. But this will cause the problem
>> that no resources can be allocated by kernel if
>> PCI_PROBE_ONLY was set, e.g. on pSeries platform
>> because PCI_PROBE_ONLY force kernel to use firmware
>> setup and not to reassign any resources.
>>
>> To solve this problem, this patch ignores
>> resource_alignment if PCI_PROBE_ONLY was set.
>>
>> Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
>> ---
>>   Documentation/kernel-parameters.txt |    2 ++
>>   drivers/pci/probe.c                 |    3 ++-
>>   2 files changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
>> index d8b29ab..8028631 100644
>> --- a/Documentation/kernel-parameters.txt
>> +++ b/Documentation/kernel-parameters.txt
>> @@ -2922,6 +2922,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
>>   				windows need to be expanded.
>>   				noresize: Don't change the resources' sizes when
>>   				reassigning alignment.
>> +				Note that this option will not work if
>> +				PCI_PROBE_ONLY is set.
> How would a user have any idea if this is set?

I found the PCI_PROBE_ONLY is set on pSeries, maple and
arm with "firmware" kernel parameter enabled.

So can we say: Note that this option will not work on pSeries,
maple and arm with "firmware" kernel parameter enabled?
Or do you have any suggestion?

Thanks,
Yongji Xie

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v4 7/7] powerpc/powernv/pci-ioda: Add IOMMU_CAP_INTR_REMAP for IODA host bridge
  2016-03-16 16:32   ` Alex Williamson
@ 2016-03-17 11:38     ` Yongji Xie
  2016-03-17 12:48       ` Alex Williamson
  0 siblings, 1 reply; 28+ messages in thread
From: Yongji Xie @ 2016-03-17 11:38 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, linux-kernel, linux-pci, linuxppc-dev, linux-doc, bhelgaas,
	corbet, aik, benh, paulus, mpe, warrier, zhong, nikunj

On 2016/3/17 0:32, Alex Williamson wrote:
> On Mon,  7 Mar 2016 15:48:38 +0800
> Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote:
>
>> This patch adds IOMMU_CAP_INTR_REMAP for IODA host bridge so that
>> we can mmap MSI-X table in vfio driver.
>>
>> Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
>> ---
>>   arch/powerpc/platforms/powernv/pci-ioda.c |   17 +++++++++++++++++
>>   1 file changed, 17 insertions(+)
>>
>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>> index f90dc04..f01b9ab 100644
>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>> @@ -1955,6 +1955,20 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = {
>>   	.free = pnv_ioda2_table_free,
>>   };
>>   
>> +static bool pnv_ioda_iommu_capable(enum iommu_cap cap)
>> +{
>> +	switch (cap) {
>> +	case IOMMU_CAP_INTR_REMAP:
>> +		return true;
>> +	default:
>> +		return false;
>> +	}
>> +}
>> +
>> +static struct iommu_ops pnv_ioda_iommu_ops = {
>> +	.capable = pnv_ioda_iommu_capable,
>> +};
>> +
>>   static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
>>   				      struct pnv_ioda_pe *pe, unsigned int base,
>>   				      unsigned int segs)
>> @@ -3078,6 +3092,9 @@ static void pnv_pci_ioda_fixup(void)
>>   
>>   	/* Link NPU IODA tables to their PCI devices. */
>>   	pnv_npu_ioda_fixup();
>> +
>> +	/* Add IOMMU_CAP_INTR_REMAP */
>> +	bus_set_iommu(&pci_bus_type, &pnv_ioda_iommu_ops);
>>   }
>>   
>>   /*
>
> Doesn't this set you up for a world of hurt?  bus_set_iommu() calls
> iommu_bus_init() which sets up notifiers, which maybe you don't care
> about, but it also means that iommu_domain_alloc(&pci_bus_type) will
> segfault because you're not providing a domain_alloc callback here.

It seems to be hard to add IOMMU_CAP_INTR_REMAP on
PPC64 platform.

And can we add a new ioctl in vfio_iommu_driver to check
if interrupt remapping is supported so that we can use our
own way to determine that on PPC64 platform?

Thanks,
Yongji Xie

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v4 4/7] PCI: Modify resource_alignment to support multiple devices
  2016-03-17 11:28     ` Yongji Xie
@ 2016-03-17 12:40       ` Alex Williamson
  2016-03-18 15:04         ` Yongji Xie
  0 siblings, 1 reply; 28+ messages in thread
From: Alex Williamson @ 2016-03-17 12:40 UTC (permalink / raw)
  To: Yongji Xie
  Cc: kvm, linux-kernel, linux-pci, linuxppc-dev, linux-doc, bhelgaas,
	corbet, aik, benh, paulus, mpe, warrier, zhong, nikunj

On Thu, 17 Mar 2016 19:28:34 +0800
Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote:

> On 2016/3/17 0:30, Alex Williamson wrote:
> > On Mon,  7 Mar 2016 15:48:35 +0800
> > Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote:
> >  
> >> When vfio passthrough a PCI device of which MMIO BARs
> >> are smaller than PAGE_SIZE, guest will not handle the
> >> mmio accesses to the BARs which leads to mmio emulations
> >> in host.
> >>
> >> This is because vfio will not allow to passthrough one
> >> BAR's mmio page which may be shared with other BARs.
> >>
> >> To solve this performance issue, this patch modifies
> >> resource_alignment to support syntax where multiple
> >> devices get the same alignment. So we can use something
> >> like "pci=resource_alignment=*:*:*.*:noresize" to
> >> enforce the alignment of all MMIO BARs to be at least
> >> PAGE_SIZE so that one BAR's mmio page would not be
> >> shared with other BARs.
> >>
> >> Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
> >> ---
> >>   Documentation/kernel-parameters.txt |    2 +
> >>   drivers/pci/pci.c                   |   90 ++++++++++++++++++++++++++++++-----
> >>   include/linux/pci.h                 |    4 ++
> >>   3 files changed, 85 insertions(+), 11 deletions(-)
> >>
> >> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
> >> index 8028631..74b38ab 100644
> >> --- a/Documentation/kernel-parameters.txt
> >> +++ b/Documentation/kernel-parameters.txt
> >> @@ -2918,6 +2918,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
> >>   				aligned memory resources.
> >>   				If <order of align> is not specified,
> >>   				PAGE_SIZE is used as alignment.
> >> +				<domain>, <bus>, <slot> and <func> can be set to
> >> +				"*" which means match all values.  
> > I don't see anywhere that you're automatically enabling this for your
> > platform, so presumably you're expecting users to determine on their
> > own that they have a performance problem and hoping that they'll figure
> > out that they need to use this option to resolve it.  The first irate
> > question you'll get back is why doesn't this happen automatically?  
> 
> Actually, I just want to make the code simple. Maybe it is
> not a good idea to let users enable this option manually.
> I will try to fix it like that:

That's not entirely what I meant, I think having a way for a user to
enable it is a good thing, but maybe there need to be cases where it's
enabled automatically.  With 4k pages, this is often not an issue since
the PCI spec recommends 4k or 8k alignment for resources, but that
doesn't preclude an unusual device where a user might want to enable it
anyway.  At 64k page size, problems become more common, so we need to
think about either enabling it automatically or somehow making it more
apparent to the user that the option is available for this purpose.

> 
> diff --git a/arch/powerpc/include/asm/pci.h b/arch/powerpc/include/asm/pci.h
> index 6f8065a..6659752 100644
> --- a/arch/powerpc/include/asm/pci.h
> +++ b/arch/powerpc/include/asm/pci.h
> @@ -30,6 +30,8 @@
>   #define PCIBIOS_MIN_IO         0x1000
>   #define PCIBIOS_MIN_MEM                0x10000000
> 
> +#define PCIBIOS_MIN_ALIGNMENT  PAGE_SIZE
> +
>   struct pci_dev;
> 
>   /* Values for the `which' argument to sys_pciconfig_iobase syscall.  */
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index dadd28a..9f644e4 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -4593,6 +4593,8 @@ EXPORT_SYMBOL_GPL(pci_ignore_hotplug);
>   static char resource_alignment_param[RESOURCE_ALIGNMENT_PARAM_SIZE] = {0};
>   static DEFINE_SPINLOCK(resource_alignment_lock);
> 
> +#define DISABLE_ARCH_ALIGNMENT -1
> +#define DEFAULT_ALIGNMENT      -2
>   /**
>    * pci_specified_resource_alignment - get resource alignment specified 
> by user.
>    * @dev: the PCI device to get
> @@ -4609,6 +4611,9 @@ static resource_size_t 
> pci_specified_resource_alignment(struct pci_dev *dev,
>          char *p;
>          bool invalid = false;
> 
> +#ifdef PCIBIOS_MIN_ALIGNMENT
> +       align = PCIBIOS_MIN_ALIGNMENT;
> +#endif
>          spin_lock(&resource_alignment_lock);
>          p = resource_alignment_param;
>          while (*p) {
> @@ -4617,7 +4622,7 @@ static resource_size_t 
> pci_specified_resource_alignment(struct pci_dev *dev,
>                                                          p[count] == '@') {
>                          p += count + 1;
>                  } else {
> -                       align_order = -1;
> +                       align_order = DEFAULT_ALIGNMENT;
>                  }
>                  if (p[0] == '*' && p[1] == ':') {
>                          seg = -1;
> @@ -4673,8 +4678,10 @@ static resource_size_t 
> pci_specified_resource_alignment(struct pci_dev *dev,
>                          (bus == dev->bus->number || bus == -1) &&
>                          (slot == PCI_SLOT(dev->devfn) || slot == -1) &&
>                          (func == PCI_FUNC(dev->devfn) || func == -1)) {
> -                       if (align_order == -1)
> +                       if (align_order == DEFAULT_ALIGNMENT)
>                                  align = PAGE_SIZE;
> +                       else if (align_order == DISABLE_ARCH_ALIGNMENT)
> +                               align = 0;
>                          else
>                                  align = 1 << align_order;
>                          if (!pci_resources_page_aligned &&
> >>   				PCI-PCI bridge can be specified, if resource
> >>   				windows need to be expanded.
> >>   				noresize: Don't change the resources' sizes when
> >> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> >> index 760cce5..44ab59f 100644
> >> --- a/drivers/pci/pci.c
> >> +++ b/drivers/pci/pci.c
> >> @@ -102,6 +102,8 @@ unsigned int pcibios_max_latency = 255;
> >>   /* If set, the PCIe ARI capability will not be used. */
> >>   static bool pcie_ari_disabled;
> >>   
> >> +bool pci_resources_page_aligned;
> >> +
> >>   /**
> >>    * pci_bus_max_busnr - returns maximum PCI bus number of given bus' children
> >>    * @bus: pointer to PCI bus structure to search
> >> @@ -4604,6 +4606,7 @@ static resource_size_t pci_specified_resource_alignment(struct pci_dev *dev,
> >>   	int seg, bus, slot, func, align_order, count;
> >>   	resource_size_t align = 0;
> >>   	char *p;
> >> +	bool invalid = false;
> >>   
> >>   	spin_lock(&resource_alignment_lock);
> >>   	p = resource_alignment_param;
> >> @@ -4615,16 +4618,49 @@ static resource_size_t pci_specified_resource_alignment(struct pci_dev *dev,
> >>   		} else {
> >>   			align_order = -1;
> >>   		}
> >> -		if (sscanf(p, "%x:%x:%x.%x%n",
> >> -			&seg, &bus, &slot, &func, &count) != 4) {
> >> +		if (p[0] == '*' && p[1] == ':') {
> >> +			seg = -1;
> >> +			count = 1;
> >> +		} else if (sscanf(p, "%x%n", &seg, &count) != 1 ||
> >> +				p[count] != ':') {
> >> +			invalid = true;
> >> +			break;
> >> +		}
> >> +		p += count + 1;
> >> +		if (*p == '*') {
> >> +			bus = -1;
> >> +			count = 1;
> >> +		} else if (sscanf(p, "%x%n", &bus, &count) != 1) {
> >> +			invalid = true;
> >> +			break;
> >> +		}
> >> +		p += count;
> >> +		if (*p == '.') {
> >> +			slot = bus;
> >> +			bus = seg;
> >>   			seg = 0;
> >> -			if (sscanf(p, "%x:%x.%x%n",
> >> -					&bus, &slot, &func, &count) != 3) {
> >> -				/* Invalid format */
> >> -				printk(KERN_ERR "PCI: Can't parse resource_alignment parameter: %s\n",
> >> -					p);
> >> +			p++;
> >> +		} else if (*p == ':') {
> >> +			p++;
> >> +			if (p[0] == '*' && p[1] == '.') {
> >> +				slot = -1;
> >> +				count = 1;
> >> +			} else if (sscanf(p, "%x%n", &slot, &count) != 1 ||
> >> +					p[count] != '.') {
> >> +				invalid = true;
> >>   				break;
> >>   			}
> >> +			p += count + 1;
> >> +		} else {
> >> +			invalid = true;
> >> +			break;
> >> +		}
> >> +		if (*p == '*') {
> >> +			func = -1;
> >> +			count = 1;
> >> +		} else if (sscanf(p, "%x%n", &func, &count) != 1) {
> >> +			invalid = true;
> >> +			break;
> >>   		}
> >>   		p += count;
> >>   		if (!strncmp(p, ":noresize", 9)) {
> >> @@ -4632,23 +4668,34 @@ static resource_size_t pci_specified_resource_alignment(struct pci_dev *dev,
> >>   			p += 9;
> >>   		} else
> >>   			*resize = true;
> >> -		if (seg == pci_domain_nr(dev->bus) &&
> >> -			bus == dev->bus->number &&
> >> -			slot == PCI_SLOT(dev->devfn) &&
> >> -			func == PCI_FUNC(dev->devfn)) {
> >> +		if ((seg == pci_domain_nr(dev->bus) || seg == -1) &&
> >> +			(bus == dev->bus->number || bus == -1) &&
> >> +			(slot == PCI_SLOT(dev->devfn) || slot == -1) &&
> >> +			(func == PCI_FUNC(dev->devfn) || func == -1)) {
> >>   			if (align_order == -1)
> >>   				align = PAGE_SIZE;
> >>   			else
> >>   				align = 1 << align_order;
> >> +			if (!pci_resources_page_aligned &&
> >> +				(align >= PAGE_SIZE &&
> >> +				seg == -1 && bus == -1 &&
> >> +				slot == -1 && func == -1))
> >> +				pci_resources_page_aligned = true;
> >>   			/* Found */
> >>   			break;
> >>   		}
> >>   		if (*p != ';' && *p != ',') {
> >>   			/* End of param or invalid format */
> >> +			invalid = true;
> >>   			break;
> >>   		}
> >>   		p++;
> >>   	}
> >> +	if (invalid == true) {
> >> +		/* Invalid format */
> >> +		printk(KERN_ERR "PCI: Can't parse resource_alignment parameter:%s\n",
> >> +				p);
> >> +	}
> >>   	spin_unlock(&resource_alignment_lock);
> >>   	return align;
> >>   }
> >> @@ -4769,6 +4816,27 @@ static int __init pci_resource_alignment_sysfs_init(void)
> >>   }
> >>   late_initcall(pci_resource_alignment_sysfs_init);
> >>   
> >> +/*
> >> + * This function checks whether PCI BARs' mmio page will be shared
> >> + * with other BARs.
> >> + */
> >> +bool pci_resources_share_page(struct pci_dev *dev, int resno)
> >> +{
> >> +	struct resource *res = dev->resource + resno;
> >> +
> >> +	if (resource_size(res) >= PAGE_SIZE)
> >> +		return false;
> >> +	if (pci_resources_page_aligned && !(res->start & ~PAGE_MASK) &&
> >> +		res->flags & IORESOURCE_MEM) {
> >> +		if (res->sibling)
> >> +			return (res->sibling->start & ~PAGE_MASK);
> >> +		else
> >> +			return false;
> >> +	}
> >> +	return true;
> >> +}
> >> +EXPORT_SYMBOL_GPL(pci_resources_share_page);  
> > This should be a separate patch, there's no mention of this part of the
> > change at all in the commitlog.  Also, pci_resource_page_aligned is
> > only set if we use the magic wildcards to set alignment for all
> > devices, it's not set if we use a specific seg:bus:dev.fn.  That's
> > not consistent.  Can't we make use of the STARTALIGN flag for this or
> > did that get removed when resources were assigned?  
> 
> I only set pci_resource_page_aligned when using magic wildcards
> because I must make sure pci_resources_share_page() can return
> correctly when a hotplug event happens.
> 
> For example, there is only one half-page resource in the system.
> And we use a specific seg:bus:dev.fn to force it to be page aligned.
> So we set pci_resource_page_aligned because all devices are page
> aligned now. It will return false when we call
> pci_resources_share_page() for this resource. But this return value
> will be not correct if we hot add a new device which also have
> a half-page resource which will share one page with the previous
> resource.

That's a good point, it should be documented in a comment.

> > The test itself is not entirely reassuring, I'd like some positive
> > indication that the device has been aligned, not simply that it should
> > have been and the start alignment appears that it might have happened.
> > Apparently you don't trust pci_resources_page_aligned either since you
> > still test the start alignment.  
> 
> Yes, I don't trust pci_resources_page_aligned. Because
> resource_alignment will not work in some case, e.g.
> PCI_PROBE_ONLY or IORESOURCE_PCI_FIXED is set.
> 
> But I think *start* of the resource can give the positive
> indication. The resource should not share page with
> others if the start of the resource and its sibling are both
> page aligned.

If you can't trust pci_resource_page_aligned then the alignment of
start seems like it only indicates that it's likely aligned, not that
it is aligned.

> >> +
> >>   static void pci_no_domains(void)
> >>   {
> >>   #ifdef CONFIG_PCI_DOMAINS
> >> diff --git a/include/linux/pci.h b/include/linux/pci.h
> >> index 2771625..064a1b6 100644
> >> --- a/include/linux/pci.h
> >> +++ b/include/linux/pci.h
> >> @@ -1511,6 +1511,10 @@ static inline int pci_get_new_domain_nr(void) { return -ENOSYS; }
> >>   	 (pci_resource_end((dev), (bar)) -		\
> >>   	  pci_resource_start((dev), (bar)) + 1))
> >>   
> >> +extern bool pci_resources_page_aligned;  
> > I'm not seeing why this shouldn't have been static since you're
> > providing a function that tests it, there shouldn't really be any
> > external consumers.  
> 
> I made a mistake here. I will fix it in next version.
> 
> Thanks,
> Yongji Xie.
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v4 7/7] powerpc/powernv/pci-ioda: Add IOMMU_CAP_INTR_REMAP for IODA host bridge
  2016-03-17 11:38     ` Yongji Xie
@ 2016-03-17 12:48       ` Alex Williamson
  2016-03-18 11:51           ` Yongji Xie
  0 siblings, 1 reply; 28+ messages in thread
From: Alex Williamson @ 2016-03-17 12:48 UTC (permalink / raw)
  To: Yongji Xie
  Cc: kvm, linux-kernel, linux-pci, linuxppc-dev, linux-doc, bhelgaas,
	corbet, aik, benh, paulus, mpe, warrier, zhong, nikunj

On Thu, 17 Mar 2016 19:38:29 +0800
Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote:

> On 2016/3/17 0:32, Alex Williamson wrote:
> > On Mon,  7 Mar 2016 15:48:38 +0800
> > Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote:
> >  
> >> This patch adds IOMMU_CAP_INTR_REMAP for IODA host bridge so that
> >> we can mmap MSI-X table in vfio driver.
> >>
> >> Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
> >> ---
> >>   arch/powerpc/platforms/powernv/pci-ioda.c |   17 +++++++++++++++++
> >>   1 file changed, 17 insertions(+)
> >>
> >> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> >> index f90dc04..f01b9ab 100644
> >> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> >> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> >> @@ -1955,6 +1955,20 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = {
> >>   	.free = pnv_ioda2_table_free,
> >>   };
> >>   
> >> +static bool pnv_ioda_iommu_capable(enum iommu_cap cap)
> >> +{
> >> +	switch (cap) {
> >> +	case IOMMU_CAP_INTR_REMAP:
> >> +		return true;
> >> +	default:
> >> +		return false;
> >> +	}
> >> +}
> >> +
> >> +static struct iommu_ops pnv_ioda_iommu_ops = {
> >> +	.capable = pnv_ioda_iommu_capable,
> >> +};
> >> +
> >>   static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
> >>   				      struct pnv_ioda_pe *pe, unsigned int base,
> >>   				      unsigned int segs)
> >> @@ -3078,6 +3092,9 @@ static void pnv_pci_ioda_fixup(void)
> >>   
> >>   	/* Link NPU IODA tables to their PCI devices. */
> >>   	pnv_npu_ioda_fixup();
> >> +
> >> +	/* Add IOMMU_CAP_INTR_REMAP */
> >> +	bus_set_iommu(&pci_bus_type, &pnv_ioda_iommu_ops);
> >>   }
> >>   
> >>   /*  
> >
> > Doesn't this set you up for a world of hurt?  bus_set_iommu() calls
> > iommu_bus_init() which sets up notifiers, which maybe you don't care
> > about, but it also means that iommu_domain_alloc(&pci_bus_type) will
> > segfault because you're not providing a domain_alloc callback here.  
> 
> It seems to be hard to add IOMMU_CAP_INTR_REMAP on
> PPC64 platform.
> 
> And can we add a new ioctl in vfio_iommu_driver to check
> if interrupt remapping is supported so that we can use our
> own way to determine that on PPC64 platform?

I'd prefer not.  At the vfio user API level, the question is whether
the user can mmap over the msix table, testing a property/ioctl on the
iommu driver seems like an odd way to discover that.  We should be
determining whether that's safe in the kernel and exporting that info
on the vfio device itself, where it seems like we have various ways we
could do this within the existing ioctls.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v4 7/7] powerpc/powernv/pci-ioda: Add IOMMU_CAP_INTR_REMAP for IODA host bridge
  2016-03-17 12:48       ` Alex Williamson
@ 2016-03-18 11:51           ` Yongji Xie
  0 siblings, 0 replies; 28+ messages in thread
From: Yongji Xie @ 2016-03-18 11:51 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, linux-kernel, linux-pci, linuxppc-dev, linux-doc, bhelgaas,
	corbet, aik, benh, paulus, mpe, warrier, zhong, nikunj

On 2016/3/17 20:48, Alex Williamson wrote:
> On Thu, 17 Mar 2016 19:38:29 +0800
> Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote:
>
>> On 2016/3/17 0:32, Alex Williamson wrote:
>>> On Mon,  7 Mar 2016 15:48:38 +0800
>>> Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote:
>>>   
>>>> This patch adds IOMMU_CAP_INTR_REMAP for IODA host bridge so that
>>>> we can mmap MSI-X table in vfio driver.
>>>>
>>>> Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
>>>> ---
>>>>    arch/powerpc/platforms/powernv/pci-ioda.c |   17 +++++++++++++++++
>>>>    1 file changed, 17 insertions(+)
>>>>
>>>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>>>> index f90dc04..f01b9ab 100644
>>>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>>>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>>>> @@ -1955,6 +1955,20 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = {
>>>>    	.free = pnv_ioda2_table_free,
>>>>    };
>>>>    
>>>> +static bool pnv_ioda_iommu_capable(enum iommu_cap cap)
>>>> +{
>>>> +	switch (cap) {
>>>> +	case IOMMU_CAP_INTR_REMAP:
>>>> +		return true;
>>>> +	default:
>>>> +		return false;
>>>> +	}
>>>> +}
>>>> +
>>>> +static struct iommu_ops pnv_ioda_iommu_ops = {
>>>> +	.capable = pnv_ioda_iommu_capable,
>>>> +};
>>>> +
>>>>    static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
>>>>    				      struct pnv_ioda_pe *pe, unsigned int base,
>>>>    				      unsigned int segs)
>>>> @@ -3078,6 +3092,9 @@ static void pnv_pci_ioda_fixup(void)
>>>>    
>>>>    	/* Link NPU IODA tables to their PCI devices. */
>>>>    	pnv_npu_ioda_fixup();
>>>> +
>>>> +	/* Add IOMMU_CAP_INTR_REMAP */
>>>> +	bus_set_iommu(&pci_bus_type, &pnv_ioda_iommu_ops);
>>>>    }
>>>>    
>>>>    /*
>>> Doesn't this set you up for a world of hurt?  bus_set_iommu() calls
>>> iommu_bus_init() which sets up notifiers, which maybe you don't care
>>> about, but it also means that iommu_domain_alloc(&pci_bus_type) will
>>> segfault because you're not providing a domain_alloc callback here.
>> It seems to be hard to add IOMMU_CAP_INTR_REMAP on
>> PPC64 platform.
>>
>> And can we add a new ioctl in vfio_iommu_driver to check
>> if interrupt remapping is supported so that we can use our
>> own way to determine that on PPC64 platform?
> I'd prefer not.  At the vfio user API level, the question is whether
> the user can mmap over the msix table, testing a property/ioctl on the
> iommu driver seems like an odd way to discover that.  We should be
> determining whether that's safe in the kernel and exporting that info
> on the vfio device itself, where it seems like we have various ways we
> could do this within the existing ioctls.  Thanks,
>
> Alex
>

Yes, you are right. It's not a good idea to add a new ioctl in
vfio_iommu_driver. Now I'd like to talk about the way to
determining whether it's safe to mmap over the msix table.

We currently use IOMMU_CAP_INTR_REMAP to determine that.
But there are some problems on PPC64 which never set
iommu_ops and ARM SMMU which set this capability but not
provide interrupt isolation. Can we add a variable/property
which can be set in vfio_iommu_driver->ops->attach_group()
and used in vfio_pci_driver to determine whether we can allow
mmapping msix table?  If so, we can still use
IOMMU_CAP_INTR_REMAP, or some arch-independent ways
when IOMMU_CAP_INTR_REMAP doesn't work.

Thanks,
Yongji Xie

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v4 7/7] powerpc/powernv/pci-ioda: Add IOMMU_CAP_INTR_REMAP for IODA host bridge
@ 2016-03-18 11:51           ` Yongji Xie
  0 siblings, 0 replies; 28+ messages in thread
From: Yongji Xie @ 2016-03-18 11:51 UTC (permalink / raw)
  To: Alex Williamson
  Cc: nikunj, zhong, kvm, linux-doc, aik, linux-pci, corbet,
	linux-kernel, warrier, paulus, bhelgaas, linuxppc-dev

On 2016/3/17 20:48, Alex Williamson wrote:
> On Thu, 17 Mar 2016 19:38:29 +0800
> Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote:
>
>> On 2016/3/17 0:32, Alex Williamson wrote:
>>> On Mon,  7 Mar 2016 15:48:38 +0800
>>> Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote:
>>>   
>>>> This patch adds IOMMU_CAP_INTR_REMAP for IODA host bridge so that
>>>> we can mmap MSI-X table in vfio driver.
>>>>
>>>> Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
>>>> ---
>>>>    arch/powerpc/platforms/powernv/pci-ioda.c |   17 +++++++++++++++++
>>>>    1 file changed, 17 insertions(+)
>>>>
>>>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>>>> index f90dc04..f01b9ab 100644
>>>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>>>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>>>> @@ -1955,6 +1955,20 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = {
>>>>    	.free = pnv_ioda2_table_free,
>>>>    };
>>>>    
>>>> +static bool pnv_ioda_iommu_capable(enum iommu_cap cap)
>>>> +{
>>>> +	switch (cap) {
>>>> +	case IOMMU_CAP_INTR_REMAP:
>>>> +		return true;
>>>> +	default:
>>>> +		return false;
>>>> +	}
>>>> +}
>>>> +
>>>> +static struct iommu_ops pnv_ioda_iommu_ops = {
>>>> +	.capable = pnv_ioda_iommu_capable,
>>>> +};
>>>> +
>>>>    static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
>>>>    				      struct pnv_ioda_pe *pe, unsigned int base,
>>>>    				      unsigned int segs)
>>>> @@ -3078,6 +3092,9 @@ static void pnv_pci_ioda_fixup(void)
>>>>    
>>>>    	/* Link NPU IODA tables to their PCI devices. */
>>>>    	pnv_npu_ioda_fixup();
>>>> +
>>>> +	/* Add IOMMU_CAP_INTR_REMAP */
>>>> +	bus_set_iommu(&pci_bus_type, &pnv_ioda_iommu_ops);
>>>>    }
>>>>    
>>>>    /*
>>> Doesn't this set you up for a world of hurt?  bus_set_iommu() calls
>>> iommu_bus_init() which sets up notifiers, which maybe you don't care
>>> about, but it also means that iommu_domain_alloc(&pci_bus_type) will
>>> segfault because you're not providing a domain_alloc callback here.
>> It seems to be hard to add IOMMU_CAP_INTR_REMAP on
>> PPC64 platform.
>>
>> And can we add a new ioctl in vfio_iommu_driver to check
>> if interrupt remapping is supported so that we can use our
>> own way to determine that on PPC64 platform?
> I'd prefer not.  At the vfio user API level, the question is whether
> the user can mmap over the msix table, testing a property/ioctl on the
> iommu driver seems like an odd way to discover that.  We should be
> determining whether that's safe in the kernel and exporting that info
> on the vfio device itself, where it seems like we have various ways we
> could do this within the existing ioctls.  Thanks,
>
> Alex
>

Yes, you are right. It's not a good idea to add a new ioctl in
vfio_iommu_driver. Now I'd like to talk about the way to
determining whether it's safe to mmap over the msix table.

We currently use IOMMU_CAP_INTR_REMAP to determine that.
But there are some problems on PPC64 which never set
iommu_ops and ARM SMMU which set this capability but not
provide interrupt isolation. Can we add a variable/property
which can be set in vfio_iommu_driver->ops->attach_group()
and used in vfio_pci_driver to determine whether we can allow
mmapping msix table?  If so, we can still use
IOMMU_CAP_INTR_REMAP, or some arch-independent ways
when IOMMU_CAP_INTR_REMAP doesn't work.

Thanks,
Yongji Xie

_______________________________________________
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v4 4/7] PCI: Modify resource_alignment to support multiple devices
  2016-03-17 12:40       ` Alex Williamson
@ 2016-03-18 15:04         ` Yongji Xie
  0 siblings, 0 replies; 28+ messages in thread
From: Yongji Xie @ 2016-03-18 15:04 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, linux-kernel, linux-pci, linuxppc-dev, linux-doc, bhelgaas,
	corbet, aik, benh, paulus, mpe, warrier, zhong, nikunj

On 2016/3/17 20:40, Alex Williamson wrote:
> On Thu, 17 Mar 2016 19:28:34 +0800
> Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote:
>
>> On 2016/3/17 0:30, Alex Williamson wrote:
>>> On Mon,  7 Mar 2016 15:48:35 +0800
>>> Yongji Xie <xyjxie@linux.vnet.ibm.com> wrote:
>>>   
>>>> When vfio passthrough a PCI device of which MMIO BARs
>>>> are smaller than PAGE_SIZE, guest will not handle the
>>>> mmio accesses to the BARs which leads to mmio emulations
>>>> in host.
>>>>
>>>> This is because vfio will not allow to passthrough one
>>>> BAR's mmio page which may be shared with other BARs.
>>>>
>>>> To solve this performance issue, this patch modifies
>>>> resource_alignment to support syntax where multiple
>>>> devices get the same alignment. So we can use something
>>>> like "pci=resource_alignment=*:*:*.*:noresize" to
>>>> enforce the alignment of all MMIO BARs to be at least
>>>> PAGE_SIZE so that one BAR's mmio page would not be
>>>> shared with other BARs.
>>>>
>>>> Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
>>>> ---
>>>>    Documentation/kernel-parameters.txt |    2 +
>>>>    drivers/pci/pci.c                   |   90 ++++++++++++++++++++++++++++++-----
>>>>    include/linux/pci.h                 |    4 ++
>>>>    3 files changed, 85 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
>>>> index 8028631..74b38ab 100644
>>>> --- a/Documentation/kernel-parameters.txt
>>>> +++ b/Documentation/kernel-parameters.txt
>>>> @@ -2918,6 +2918,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
>>>>    				aligned memory resources.
>>>>    				If <order of align> is not specified,
>>>>    				PAGE_SIZE is used as alignment.
>>>> +				<domain>, <bus>, <slot> and <func> can be set to
>>>> +				"*" which means match all values.
>>> I don't see anywhere that you're automatically enabling this for your
>>> platform, so presumably you're expecting users to determine on their
>>> own that they have a performance problem and hoping that they'll figure
>>> out that they need to use this option to resolve it.  The first irate
>>> question you'll get back is why doesn't this happen automatically?
>> Actually, I just want to make the code simple. Maybe it is
>> not a good idea to let users enable this option manually.
>> I will try to fix it like that:
> That's not entirely what I meant, I think having a way for a user to
> enable it is a good thing, but maybe there need to be cases where it's
> enabled automatically.  With 4k pages, this is often not an issue since
> the PCI spec recommends 4k or 8k alignment for resources, but that
> doesn't preclude an unusual device where a user might want to enable it
> anyway.  At 64k page size, problems become more common, so we need to
> think about either enabling it automatically or somehow making it more
> apparent to the user that the option is available for this purpose.

OK. I see. I will add a new arch-independent macro to indicate
the min alignments of all PCI BARs. And we can also specify the
alignments through the resource_alignment.

>> diff --git a/arch/powerpc/include/asm/pci.h b/arch/powerpc/include/asm/pci.h
>> index 6f8065a..6659752 100644
>> --- a/arch/powerpc/include/asm/pci.h
>> +++ b/arch/powerpc/include/asm/pci.h
>> @@ -30,6 +30,8 @@
>>    #define PCIBIOS_MIN_IO         0x1000
>>    #define PCIBIOS_MIN_MEM                0x10000000
>>
>> +#define PCIBIOS_MIN_ALIGNMENT  PAGE_SIZE
>> +
>>    struct pci_dev;
>>
>>    /* Values for the `which' argument to sys_pciconfig_iobase syscall.  */
>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>> index dadd28a..9f644e4 100644
>> --- a/drivers/pci/pci.c
>> +++ b/drivers/pci/pci.c
>> @@ -4593,6 +4593,8 @@ EXPORT_SYMBOL_GPL(pci_ignore_hotplug);
>>    static char resource_alignment_param[RESOURCE_ALIGNMENT_PARAM_SIZE] = {0};
>>    static DEFINE_SPINLOCK(resource_alignment_lock);
>>
>> +#define DISABLE_ARCH_ALIGNMENT -1
>> +#define DEFAULT_ALIGNMENT      -2
>>    /**
>>     * pci_specified_resource_alignment - get resource alignment specified
>> by user.
>>     * @dev: the PCI device to get
>> @@ -4609,6 +4611,9 @@ static resource_size_t
>> pci_specified_resource_alignment(struct pci_dev *dev,
>>           char *p;
>>           bool invalid = false;
>>
>> +#ifdef PCIBIOS_MIN_ALIGNMENT
>> +       align = PCIBIOS_MIN_ALIGNMENT;
>> +#endif
>>           spin_lock(&resource_alignment_lock);
>>           p = resource_alignment_param;
>>           while (*p) {
>> @@ -4617,7 +4622,7 @@ static resource_size_t
>> pci_specified_resource_alignment(struct pci_dev *dev,
>>                                                           p[count] == '@') {
>>                           p += count + 1;
>>                   } else {
>> -                       align_order = -1;
>> +                       align_order = DEFAULT_ALIGNMENT;
>>                   }
>>                   if (p[0] == '*' && p[1] == ':') {
>>                           seg = -1;
>> @@ -4673,8 +4678,10 @@ static resource_size_t
>> pci_specified_resource_alignment(struct pci_dev *dev,
>>                           (bus == dev->bus->number || bus == -1) &&
>>                           (slot == PCI_SLOT(dev->devfn) || slot == -1) &&
>>                           (func == PCI_FUNC(dev->devfn) || func == -1)) {
>> -                       if (align_order == -1)
>> +                       if (align_order == DEFAULT_ALIGNMENT)
>>                                   align = PAGE_SIZE;
>> +                       else if (align_order == DISABLE_ARCH_ALIGNMENT)
>> +                               align = 0;
>>                           else
>>                                   align = 1 << align_order;
>>                           if (!pci_resources_page_aligned &&
>>>>    				PCI-PCI bridge can be specified, if resource
>>>>    				windows need to be expanded.
>>>>    				noresize: Don't change the resources' sizes when
>>>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>>>> index 760cce5..44ab59f 100644
>>>> --- a/drivers/pci/pci.c
>>>> +++ b/drivers/pci/pci.c
>>>> @@ -102,6 +102,8 @@ unsigned int pcibios_max_latency = 255;
>>>>    /* If set, the PCIe ARI capability will not be used. */
>>>>    static bool pcie_ari_disabled;
>>>>    
>>>> +bool pci_resources_page_aligned;
>>>> +
>>>>    /**
>>>>     * pci_bus_max_busnr - returns maximum PCI bus number of given bus' children
>>>>     * @bus: pointer to PCI bus structure to search
>>>> @@ -4604,6 +4606,7 @@ static resource_size_t pci_specified_resource_alignment(struct pci_dev *dev,
>>>>    	int seg, bus, slot, func, align_order, count;
>>>>    	resource_size_t align = 0;
>>>>    	char *p;
>>>> +	bool invalid = false;
>>>>    
>>>>    	spin_lock(&resource_alignment_lock);
>>>>    	p = resource_alignment_param;
>>>> @@ -4615,16 +4618,49 @@ static resource_size_t pci_specified_resource_alignment(struct pci_dev *dev,
>>>>    		} else {
>>>>    			align_order = -1;
>>>>    		}
>>>> -		if (sscanf(p, "%x:%x:%x.%x%n",
>>>> -			&seg, &bus, &slot, &func, &count) != 4) {
>>>> +		if (p[0] == '*' && p[1] == ':') {
>>>> +			seg = -1;
>>>> +			count = 1;
>>>> +		} else if (sscanf(p, "%x%n", &seg, &count) != 1 ||
>>>> +				p[count] != ':') {
>>>> +			invalid = true;
>>>> +			break;
>>>> +		}
>>>> +		p += count + 1;
>>>> +		if (*p == '*') {
>>>> +			bus = -1;
>>>> +			count = 1;
>>>> +		} else if (sscanf(p, "%x%n", &bus, &count) != 1) {
>>>> +			invalid = true;
>>>> +			break;
>>>> +		}
>>>> +		p += count;
>>>> +		if (*p == '.') {
>>>> +			slot = bus;
>>>> +			bus = seg;
>>>>    			seg = 0;
>>>> -			if (sscanf(p, "%x:%x.%x%n",
>>>> -					&bus, &slot, &func, &count) != 3) {
>>>> -				/* Invalid format */
>>>> -				printk(KERN_ERR "PCI: Can't parse resource_alignment parameter: %s\n",
>>>> -					p);
>>>> +			p++;
>>>> +		} else if (*p == ':') {
>>>> +			p++;
>>>> +			if (p[0] == '*' && p[1] == '.') {
>>>> +				slot = -1;
>>>> +				count = 1;
>>>> +			} else if (sscanf(p, "%x%n", &slot, &count) != 1 ||
>>>> +					p[count] != '.') {
>>>> +				invalid = true;
>>>>    				break;
>>>>    			}
>>>> +			p += count + 1;
>>>> +		} else {
>>>> +			invalid = true;
>>>> +			break;
>>>> +		}
>>>> +		if (*p == '*') {
>>>> +			func = -1;
>>>> +			count = 1;
>>>> +		} else if (sscanf(p, "%x%n", &func, &count) != 1) {
>>>> +			invalid = true;
>>>> +			break;
>>>>    		}
>>>>    		p += count;
>>>>    		if (!strncmp(p, ":noresize", 9)) {
>>>> @@ -4632,23 +4668,34 @@ static resource_size_t pci_specified_resource_alignment(struct pci_dev *dev,
>>>>    			p += 9;
>>>>    		} else
>>>>    			*resize = true;
>>>> -		if (seg == pci_domain_nr(dev->bus) &&
>>>> -			bus == dev->bus->number &&
>>>> -			slot == PCI_SLOT(dev->devfn) &&
>>>> -			func == PCI_FUNC(dev->devfn)) {
>>>> +		if ((seg == pci_domain_nr(dev->bus) || seg == -1) &&
>>>> +			(bus == dev->bus->number || bus == -1) &&
>>>> +			(slot == PCI_SLOT(dev->devfn) || slot == -1) &&
>>>> +			(func == PCI_FUNC(dev->devfn) || func == -1)) {
>>>>    			if (align_order == -1)
>>>>    				align = PAGE_SIZE;
>>>>    			else
>>>>    				align = 1 << align_order;
>>>> +			if (!pci_resources_page_aligned &&
>>>> +				(align >= PAGE_SIZE &&
>>>> +				seg == -1 && bus == -1 &&
>>>> +				slot == -1 && func == -1))
>>>> +				pci_resources_page_aligned = true;
>>>>    			/* Found */
>>>>    			break;
>>>>    		}
>>>>    		if (*p != ';' && *p != ',') {
>>>>    			/* End of param or invalid format */
>>>> +			invalid = true;
>>>>    			break;
>>>>    		}
>>>>    		p++;
>>>>    	}
>>>> +	if (invalid == true) {
>>>> +		/* Invalid format */
>>>> +		printk(KERN_ERR "PCI: Can't parse resource_alignment parameter:%s\n",
>>>> +				p);
>>>> +	}
>>>>    	spin_unlock(&resource_alignment_lock);
>>>>    	return align;
>>>>    }
>>>> @@ -4769,6 +4816,27 @@ static int __init pci_resource_alignment_sysfs_init(void)
>>>>    }
>>>>    late_initcall(pci_resource_alignment_sysfs_init);
>>>>    
>>>> +/*
>>>> + * This function checks whether PCI BARs' mmio page will be shared
>>>> + * with other BARs.
>>>> + */
>>>> +bool pci_resources_share_page(struct pci_dev *dev, int resno)
>>>> +{
>>>> +	struct resource *res = dev->resource + resno;
>>>> +
>>>> +	if (resource_size(res) >= PAGE_SIZE)
>>>> +		return false;
>>>> +	if (pci_resources_page_aligned && !(res->start & ~PAGE_MASK) &&
>>>> +		res->flags & IORESOURCE_MEM) {
>>>> +		if (res->sibling)
>>>> +			return (res->sibling->start & ~PAGE_MASK);
>>>> +		else
>>>> +			return false;
>>>> +	}
>>>> +	return true;
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(pci_resources_share_page);
>>> This should be a separate patch, there's no mention of this part of the
>>> change at all in the commitlog.  Also, pci_resource_page_aligned is
>>> only set if we use the magic wildcards to set alignment for all
>>> devices, it's not set if we use a specific seg:bus:dev.fn.  That's
>>> not consistent.  Can't we make use of the STARTALIGN flag for this or
>>> did that get removed when resources were assigned?
>> I only set pci_resource_page_aligned when using magic wildcards
>> because I must make sure pci_resources_share_page() can return
>> correctly when a hotplug event happens.
>>
>> For example, there is only one half-page resource in the system.
>> And we use a specific seg:bus:dev.fn to force it to be page aligned.
>> So we set pci_resource_page_aligned because all devices are page
>> aligned now. It will return false when we call
>> pci_resources_share_page() for this resource. But this return value
>> will be not correct if we hot add a new device which also have
>> a half-page resource which will share one page with the previous
>> resource.
> That's a good point, it should be documented in a comment.

OK. I will do it.

>>> The test itself is not entirely reassuring, I'd like some positive
>>> indication that the device has been aligned, not simply that it should
>>> have been and the start alignment appears that it might have happened.
>>> Apparently you don't trust pci_resources_page_aligned either since you
>>> still test the start alignment.
>> Yes, I don't trust pci_resources_page_aligned. Because
>> resource_alignment will not work in some case, e.g.
>> PCI_PROBE_ONLY or IORESOURCE_PCI_FIXED is set.
>>
>> But I think *start* of the resource can give the positive
>> indication. The resource should not share page with
>> others if the start of the resource and its sibling are both
>> page aligned.
> If you can't trust pci_resource_page_aligned then the alignment of
> start seems like it only indicates that it's likely aligned, not that
> it is aligned.

Sorry. I don't get your point why start of resource can't
indicates it's aligned.  The res->start is equal to the start of
allocated PCI address.  Shouldn't it mean mmio page of the
resource is exclusive if res->start is page aligned and
res->sibling->start - res->start > PAGE_SIZE?

Thanks,
Yongji Xie

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2016-03-18 15:04 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-07  7:48 [RFC PATCH v4 0/7] vfio-pci: Allow to mmap sub-page MMIO BARs and MSI-X table Yongji Xie
2016-03-07  7:48 ` [RFC PATCH v4 1/7] PCI: Add a new option for resource_alignment to reassign alignment Yongji Xie
2016-03-10  2:19   ` Alexey Kardashevskiy
2016-03-10  4:47     ` Yongji Xie
2016-03-07  7:48 ` [RFC PATCH v4 2/7] PCI: Use IORESOURCE_WINDOW to identify bridge resources Yongji Xie
2016-03-07  7:48 ` [RFC PATCH v4 3/7] PCI: Ignore resource_alignment if PCI_PROBE_ONLY was set Yongji Xie
2016-03-16 16:31   ` Alex Williamson
2016-03-17 11:35     ` Yongji Xie
2016-03-07  7:48 ` [RFC PATCH v4 4/7] PCI: Modify resource_alignment to support multiple devices Yongji Xie
2016-03-16 16:30   ` Alex Williamson
2016-03-17 11:28     ` Yongji Xie
2016-03-17 12:40       ` Alex Williamson
2016-03-18 15:04         ` Yongji Xie
2016-03-07  7:48 ` [RFC PATCH v4 5/7] vfio-pci: Allow to mmap sub-page MMIO BARs if the mmio page is exclusive Yongji Xie
2016-03-16 16:30   ` Alex Williamson
2016-03-17 11:29     ` Yongji Xie
2016-03-07  7:48 ` [RFC PATCH v4 6/7] vfio-pci: Allow to mmap MSI-X table if IOMMU_CAP_INTR_REMAP was set Yongji Xie
2016-03-16 16:31   ` Alex Williamson
2016-03-17 11:32     ` Yongji Xie
2016-03-07  7:48 ` [RFC PATCH v4 7/7] powerpc/powernv/pci-ioda: Add IOMMU_CAP_INTR_REMAP for IODA host bridge Yongji Xie
2016-03-16 16:32   ` Alex Williamson
2016-03-17 11:38     ` Yongji Xie
2016-03-17 12:48       ` Alex Williamson
2016-03-18 11:51         ` Yongji Xie
2016-03-18 11:51           ` Yongji Xie
2016-03-16 10:51 ` [RFC PATCH v4 0/7] vfio-pci: Allow to mmap sub-page MMIO BARs and MSI-X table Yongji Xie
2016-03-16 14:10   ` Bjorn Helgaas
2016-03-17 10:46     ` Yongji Xie

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.