All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v12 00/21] Enable SRIOV on Power8
@ 2015-02-24  8:32 Bjorn Helgaas
  2015-02-24  8:33 ` [PATCH v12 01/21] PCI: Print more info in sriov_enable() error message Bjorn Helgaas
                   ` (20 more replies)
  0 siblings, 21 replies; 69+ messages in thread
From: Bjorn Helgaas @ 2015-02-24  8:32 UTC (permalink / raw)
  To: Wei Yang, benh, gwshan; +Cc: linux-pci, linuxppc-dev

Wei Yang's most recent POWER8 SR-IOV patchset was v11, posted on Jan 15,
2015.

I'm having a hard time keeping everything straight between the tweaks I've
made on my branch and incremental updates.  I think it's easier to repost
the whole series so one can easily collect everything that goes together.
So here'a a v12 with the changes I've made.

Wei, please follow up with a v13 to fix anything I broke here.  Here's how
I would do that using stgit:

  git checkout -b pci/virtualization-v13 pci/virtualization-v12
  stg init
  stg uncommit -n 21
  <hack on the patches>
  stg mail -v v13 ... pci-print-more-info-in..powerpc-pci-add-pci-resource

I put v10, v11, and v12 on branches based on v4.0-rc1:

  pci/virtualization-v10	(posted 12/22/2014)
  pci/virtualization-v11	(posted 01/15/2015)
  pci/virtualization-v12	(this posting)

This makes it relatively easy to diff the versions, e.g.,

  git diff pci/virtualization-v11 pci/virtualization-v12

These branches are at

  https://git.kernel.org/cgit/linux/kernel/git/helgaas/pci.git/

v12:
   * remove "align" parameter from pcibios_iov_resource_alignment()
     default version returns pci_iov_resource_size() instead of the
     "align" parameter
   * in powerpc pcibios_iov_resource_alignment(), return
     pci_iov_resource_size() if there's no ppc_md function pointer
   * in pci_sriov_resource_alignment(), don't re-read base, since we
     saved the required alignment when reading it the first time
   * remove "vf_num" parameter from add_dev_pci_info() and
     remove_dev_pci_info(); use pci_sriov_get_totalvfs() instead
   * use dev_warn() instead of pr_warn() when possible
   * check to be sure IOV BAR is still in range after shifting, change
     pnv_pci_vf_resource_shift() from void to int
   * improve sriov_enable() error message
   * improve SR-IOV BAR sizing message
   * index IOV resources in conventional style
   * include preamble patches (refresh offset/stride when updating numVFs,
     calculate max buses required
   * restructure pci_iov_max_bus_range() to return value instead of updating
     internally, rename to virtfn_max_buses()
   * fix typos & formatting
   * expand documentation

Bjorn

---

Bjorn Helgaas (2):
      PCI: Print more info in sriov_enable() error message
      PCI: Index IOV resources in the conventional style

Gavin Shan (1):
      powerpc/pci: Refactor pci_dn

Wei Yang (18):
      PCI: Print PF SR-IOV resource that contains all VF(n) BAR space
      PCI: Keep individual VF BAR size in struct pci_sriov
      PCI: Refresh First VF Offset and VF Stride when updating NumVFs
      PCI: Calculate maximum number of buses required for VFs
      PCI: Export pci_iov_virtfn_bus() and pci_iov_virtfn_devfn()
      PCI: Add pcibios_sriov_enable() and pcibios_sriov_disable()
      PCI: Add pcibios_iov_resource_alignment() interface
      PCI: Consider additional PF's IOV BAR alignment in sizing and assigning
      powerpc/pci: Don't unset PCI resources for VFs
      powerpc/powernv: Use pci_dn, not device_node, in PCI config accessor
      powerpc/powernv: Allocate struct pnv_ioda_pe iommu_table dynamically
      powerpc/powernv: Reserve additional space for IOV BAR according to the number of total_pe
      powerpc/powernv: Implement pcibios_iov_resource_alignment() on powernv
      powerpc/powernv: Shift VF resource with an offset
      powerpc/powernv: Reserve additional space for IOV BAR, with m64_per_iov supported
      powerpc/powernv: Group VF PE when IOV BAR is big on PHB3
      powerpc/pci: Remove unused struct pci_dn.pcidev field
      powerpc/pci: Add PCI resource alignment documentation


 .../powerpc/pci_iov_resource_on_powernv.txt        |  305 ++++++++
 arch/powerpc/include/asm/device.h                  |    3 
 arch/powerpc/include/asm/iommu.h                   |    3 
 arch/powerpc/include/asm/machdep.h                 |    5 
 arch/powerpc/include/asm/pci-bridge.h              |   24 +
 arch/powerpc/kernel/pci-common.c                   |   19 
 arch/powerpc/kernel/pci_dn.c                       |  256 ++++++-
 arch/powerpc/platforms/powernv/eeh-powernv.c       |   14 
 arch/powerpc/platforms/powernv/pci-ioda.c          |  777 +++++++++++++++++++-
 arch/powerpc/platforms/powernv/pci.c               |   87 +-
 arch/powerpc/platforms/powernv/pci.h               |   13 
 drivers/pci/iov.c                                  |  155 +++-
 drivers/pci/pci.h                                  |    2 
 drivers/pci/setup-bus.c                            |   83 ++
 include/linux/pci.h                                |   15 
 15 files changed, 1622 insertions(+), 139 deletions(-)
 create mode 100644 Documentation/powerpc/pci_iov_resource_on_powernv.txt

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH v12 01/21] PCI: Print more info in sriov_enable() error message
  2015-02-24  8:32 [PATCH v12 00/21] Enable SRIOV on Power8 Bjorn Helgaas
@ 2015-02-24  8:33 ` Bjorn Helgaas
  2015-02-24  8:33 ` [PATCH v12 02/21] PCI: Print PF SR-IOV resource that contains all VF(n) BAR space Bjorn Helgaas
                   ` (19 subsequent siblings)
  20 siblings, 0 replies; 69+ messages in thread
From: Bjorn Helgaas @ 2015-02-24  8:33 UTC (permalink / raw)
  To: Wei Yang, benh, gwshan; +Cc: linux-pci, linuxppc-dev

If we don't have space for all the bus numbers required to enable VFs,
print the largest bus number required and the range available.

No functional change; improved error message only.

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
---
 drivers/pci/iov.c |    7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 4b3a4eaad996..c4c33ead03bc 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -180,6 +180,7 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
 	struct pci_dev *pdev;
 	struct pci_sriov *iov = dev->sriov;
 	int bars = 0;
+	u8 bus;
 
 	if (!nr_virtfn)
 		return 0;
@@ -216,8 +217,10 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
 	iov->offset = offset;
 	iov->stride = stride;
 
-	if (virtfn_bus(dev, nr_virtfn - 1) > dev->bus->busn_res.end) {
-		dev_err(&dev->dev, "SR-IOV: bus number out of range\n");
+	bus = virtfn_bus(dev, nr_virtfn - 1);
+	if (bus > dev->bus->busn_res.end) {
+		dev_err(&dev->dev, "can't enable %d VFs (bus %02x out of range of %pR)\n",
+			nr_virtfn, bus, &dev->bus->busn_res);
 		return -ENOMEM;
 	}
 


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v12 02/21] PCI: Print PF SR-IOV resource that contains all VF(n) BAR space
  2015-02-24  8:32 [PATCH v12 00/21] Enable SRIOV on Power8 Bjorn Helgaas
  2015-02-24  8:33 ` [PATCH v12 01/21] PCI: Print more info in sriov_enable() error message Bjorn Helgaas
@ 2015-02-24  8:33 ` Bjorn Helgaas
  2015-02-24  8:33 ` [PATCH v12 03/21] PCI: Keep individual VF BAR size in struct pci_sriov Bjorn Helgaas
                   ` (18 subsequent siblings)
  20 siblings, 0 replies; 69+ messages in thread
From: Bjorn Helgaas @ 2015-02-24  8:33 UTC (permalink / raw)
  To: Wei Yang, benh, gwshan; +Cc: linux-pci, linuxppc-dev

From: Wei Yang <weiyang@linux.vnet.ibm.com>

When we size VF BAR0, VF BAR1, etc., from the SR-IOV Capability of a PF, we
learn the alignment requirement and amount of space consumed by a single
VF.  But when VFs are enabled, *each* of the NumVFs consumes that amount of
space, so the total size of the PF resource is "VF BAR size * NumVFs".

Add a printk of the total space consumed by the VFs corresponding to what
we already do for normal non-IOV BARs.

No functional change; new message only.

[bhelgaas: split out into its own patch]
Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
---
 drivers/pci/iov.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index c4c33ead03bc..05f9d97e4175 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -372,6 +372,8 @@ found:
 			goto failed;
 		}
 		res->end = res->start + resource_size(res) * total - 1;
+		dev_info(&dev->dev, "VF(n) BAR%d space: %pR (contains BAR%d for %d VFs)\n",
+			 i, res, i, total);
 		nres++;
 	}
 


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v12 03/21] PCI: Keep individual VF BAR size in struct pci_sriov
  2015-02-24  8:32 [PATCH v12 00/21] Enable SRIOV on Power8 Bjorn Helgaas
  2015-02-24  8:33 ` [PATCH v12 01/21] PCI: Print more info in sriov_enable() error message Bjorn Helgaas
  2015-02-24  8:33 ` [PATCH v12 02/21] PCI: Print PF SR-IOV resource that contains all VF(n) BAR space Bjorn Helgaas
@ 2015-02-24  8:33 ` Bjorn Helgaas
  2015-02-24  8:33 ` [PATCH v12 04/21] PCI: Index IOV resources in the conventional style Bjorn Helgaas
                   ` (17 subsequent siblings)
  20 siblings, 0 replies; 69+ messages in thread
From: Bjorn Helgaas @ 2015-02-24  8:33 UTC (permalink / raw)
  To: Wei Yang, benh, gwshan; +Cc: linux-pci, linuxppc-dev

From: Wei Yang <weiyang@linux.vnet.ibm.com>

Currently we don't store the individual VF BAR size.  We calculate it when
needed by dividing the PF's IOV resource size (which contains space for
*all* the VFs) by total_VFs or by reading the BAR in the SR-IOV capability
again.

Keep the individual VF BAR size in struct pci_sriov.barsz[], add
pci_iov_resource_size() to retrieve it, and use that instead of doing the
division or reading the SR-IOV capability BAR.

[bhelgaas: rename to "barsz[]", simplify barsz[] index computation, remove
SR-IOV capability BAR sizing]
Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
---
 drivers/pci/iov.c   |   39 ++++++++++++++++++++-------------------
 drivers/pci/pci.h   |    1 +
 include/linux/pci.h |    3 +++
 3 files changed, 24 insertions(+), 19 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 05f9d97e4175..5bca0e1a2799 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -57,6 +57,14 @@ static void virtfn_remove_bus(struct pci_bus *physbus, struct pci_bus *virtbus)
 		pci_remove_bus(virtbus);
 }
 
+resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno)
+{
+	if (!dev->is_physfn)
+		return 0;
+
+	return dev->sriov->barsz[resno - PCI_IOV_RESOURCES];
+}
+
 static int virtfn_add(struct pci_dev *dev, int id, int reset)
 {
 	int i;
@@ -92,8 +100,7 @@ static int virtfn_add(struct pci_dev *dev, int id, int reset)
 			continue;
 		virtfn->resource[i].name = pci_name(virtfn);
 		virtfn->resource[i].flags = res->flags;
-		size = resource_size(res);
-		do_div(size, iov->total_VFs);
+		size = pci_iov_resource_size(dev, i + PCI_IOV_RESOURCES);
 		virtfn->resource[i].start = res->start + size * id;
 		virtfn->resource[i].end = virtfn->resource[i].start + size - 1;
 		rc = request_resource(res, &virtfn->resource[i]);
@@ -311,7 +318,7 @@ static void sriov_disable(struct pci_dev *dev)
 
 static int sriov_init(struct pci_dev *dev, int pos)
 {
-	int i;
+	int i, bar64;
 	int rc;
 	int nres;
 	u32 pgsz;
@@ -360,29 +367,29 @@ found:
 	pgsz &= ~(pgsz - 1);
 	pci_write_config_dword(dev, pos + PCI_SRIOV_SYS_PGSIZE, pgsz);
 
+	iov = kzalloc(sizeof(*iov), GFP_KERNEL);
+	if (!iov)
+		return -ENOMEM;
+
 	nres = 0;
 	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
 		res = dev->resource + PCI_IOV_RESOURCES + i;
-		i += __pci_read_base(dev, pci_bar_unknown, res,
-				     pos + PCI_SRIOV_BAR + i * 4);
+		bar64 = __pci_read_base(dev, pci_bar_unknown, res,
+					pos + PCI_SRIOV_BAR + i * 4);
 		if (!res->flags)
 			continue;
 		if (resource_size(res) & (PAGE_SIZE - 1)) {
 			rc = -EIO;
 			goto failed;
 		}
+		iov->barsz[i] = resource_size(res);
 		res->end = res->start + resource_size(res) * total - 1;
 		dev_info(&dev->dev, "VF(n) BAR%d space: %pR (contains BAR%d for %d VFs)\n",
 			 i, res, i, total);
+		i += bar64;
 		nres++;
 	}
 
-	iov = kzalloc(sizeof(*iov), GFP_KERNEL);
-	if (!iov) {
-		rc = -ENOMEM;
-		goto failed;
-	}
-
 	iov->pos = pos;
 	iov->nres = nres;
 	iov->ctrl = ctrl;
@@ -414,6 +421,7 @@ failed:
 		res->flags = 0;
 	}
 
+	kfree(iov);
 	return rc;
 }
 
@@ -510,14 +518,7 @@ int pci_iov_resource_bar(struct pci_dev *dev, int resno)
  */
 resource_size_t pci_sriov_resource_alignment(struct pci_dev *dev, int resno)
 {
-	struct resource tmp;
-	int reg = pci_iov_resource_bar(dev, resno);
-
-	if (!reg)
-		return 0;
-
-	 __pci_read_base(dev, pci_bar_unknown, &tmp, reg);
-	return resource_alignment(&tmp);
+	return pci_iov_resource_size(dev, resno);
 }
 
 /**
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 4091f82239cd..57329645dd01 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -247,6 +247,7 @@ struct pci_sriov {
 	struct pci_dev *dev;	/* lowest numbered PF */
 	struct pci_dev *self;	/* this PF */
 	struct mutex lock;	/* lock for VF bus */
+	resource_size_t barsz[PCI_SRIOV_NUM_BARS];	/* VF BAR size */
 };
 
 #ifdef CONFIG_PCI_ATS
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 211e9da8a7d7..15596582e575 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1675,6 +1675,7 @@ int pci_num_vf(struct pci_dev *dev);
 int pci_vfs_assigned(struct pci_dev *dev);
 int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs);
 int pci_sriov_get_totalvfs(struct pci_dev *dev);
+resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno);
 #else
 static inline int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn)
 { return -ENODEV; }
@@ -1686,6 +1687,8 @@ static inline int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs)
 { return 0; }
 static inline int pci_sriov_get_totalvfs(struct pci_dev *dev)
 { return 0; }
+static inline resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno)
+{ return 0; }
 #endif
 
 #if defined(CONFIG_HOTPLUG_PCI) || defined(CONFIG_HOTPLUG_PCI_MODULE)


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v12 04/21] PCI: Index IOV resources in the conventional style
  2015-02-24  8:32 [PATCH v12 00/21] Enable SRIOV on Power8 Bjorn Helgaas
                   ` (2 preceding siblings ...)
  2015-02-24  8:33 ` [PATCH v12 03/21] PCI: Keep individual VF BAR size in struct pci_sriov Bjorn Helgaas
@ 2015-02-24  8:33 ` Bjorn Helgaas
  2015-02-24  8:33 ` [PATCH v12 05/21] PCI: Refresh First VF Offset and VF Stride when updating NumVFs Bjorn Helgaas
                   ` (16 subsequent siblings)
  20 siblings, 0 replies; 69+ messages in thread
From: Bjorn Helgaas @ 2015-02-24  8:33 UTC (permalink / raw)
  To: Wei Yang, benh, gwshan; +Cc: linux-pci, linuxppc-dev

Most of PCI uses "res = &dev->resource[i]", not "res = dev->resource + i".
Use that style in iov.c also.

No functional change.

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
---
 drivers/pci/iov.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 5bca0e1a2799..27b98c361823 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -95,7 +95,7 @@ static int virtfn_add(struct pci_dev *dev, int id, int reset)
 	virtfn->multifunction = 0;
 
 	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
-		res = dev->resource + PCI_IOV_RESOURCES + i;
+		res = &dev->resource[i + PCI_IOV_RESOURCES];
 		if (!res->parent)
 			continue;
 		virtfn->resource[i].name = pci_name(virtfn);
@@ -212,7 +212,7 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
 	nres = 0;
 	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
 		bars |= (1 << (i + PCI_IOV_RESOURCES));
-		res = dev->resource + PCI_IOV_RESOURCES + i;
+		res = &dev->resource[i + PCI_IOV_RESOURCES];
 		if (res->parent)
 			nres++;
 	}
@@ -373,7 +373,7 @@ found:
 
 	nres = 0;
 	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
-		res = dev->resource + PCI_IOV_RESOURCES + i;
+		res = &dev->resource[i + PCI_IOV_RESOURCES];
 		bar64 = __pci_read_base(dev, pci_bar_unknown, res,
 					pos + PCI_SRIOV_BAR + i * 4);
 		if (!res->flags)
@@ -417,7 +417,7 @@ found:
 
 failed:
 	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
-		res = dev->resource + PCI_IOV_RESOURCES + i;
+		res = &dev->resource[i + PCI_IOV_RESOURCES];
 		res->flags = 0;
 	}
 


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v12 05/21] PCI: Refresh First VF Offset and VF Stride when updating NumVFs
  2015-02-24  8:32 [PATCH v12 00/21] Enable SRIOV on Power8 Bjorn Helgaas
                   ` (3 preceding siblings ...)
  2015-02-24  8:33 ` [PATCH v12 04/21] PCI: Index IOV resources in the conventional style Bjorn Helgaas
@ 2015-02-24  8:33 ` Bjorn Helgaas
  2015-02-24  8:33 ` [PATCH v12 06/21] PCI: Calculate maximum number of buses required for VFs Bjorn Helgaas
                   ` (15 subsequent siblings)
  20 siblings, 0 replies; 69+ messages in thread
From: Bjorn Helgaas @ 2015-02-24  8:33 UTC (permalink / raw)
  To: Wei Yang, benh, gwshan; +Cc: linux-pci, linuxppc-dev

From: Wei Yang <weiyang@linux.vnet.ibm.com>

The First VF Offset and VF Stride fields depend on the NumVFs setting, so
refresh the cached fields in struct pci_sriov when updating NumVFs.  See
the SR-IOV spec r1.1, sec 3.3.9 and 3.3.10.

[bhelgaas: changelog, remove kernel-doc comment marker]
Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
---
 drivers/pci/iov.c |   23 +++++++++++++++++++----
 1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 27b98c361823..a8752c2c2b53 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -31,6 +31,21 @@ static inline u8 virtfn_devfn(struct pci_dev *dev, int id)
 		dev->sriov->stride * id) & 0xff;
 }
 
+/*
+ * Per SR-IOV spec sec 3.3.10 and 3.3.11, First VF Offset and VF Stride may
+ * change when NumVFs changes.
+ *
+ * Update iov->offset and iov->stride when NumVFs is written.
+ */
+static inline void pci_iov_set_numvfs(struct pci_dev *dev, int nr_virtfn)
+{
+	struct pci_sriov *iov = dev->sriov;
+
+	pci_write_config_word(dev, iov->pos + PCI_SRIOV_NUM_VF, nr_virtfn);
+	pci_read_config_word(dev, iov->pos + PCI_SRIOV_VF_OFFSET, &iov->offset);
+	pci_read_config_word(dev, iov->pos + PCI_SRIOV_VF_STRIDE, &iov->stride);
+}
+
 static struct pci_bus *virtfn_add_bus(struct pci_bus *bus, int busnr)
 {
 	struct pci_bus *child;
@@ -253,7 +268,7 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
 			return rc;
 	}
 
-	pci_write_config_word(dev, iov->pos + PCI_SRIOV_NUM_VF, nr_virtfn);
+	pci_iov_set_numvfs(dev, nr_virtfn);
 	iov->ctrl |= PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE;
 	pci_cfg_access_lock(dev);
 	pci_write_config_word(dev, iov->pos + PCI_SRIOV_CTRL, iov->ctrl);
@@ -282,7 +297,7 @@ failed:
 	iov->ctrl &= ~(PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE);
 	pci_cfg_access_lock(dev);
 	pci_write_config_word(dev, iov->pos + PCI_SRIOV_CTRL, iov->ctrl);
-	pci_write_config_word(dev, iov->pos + PCI_SRIOV_NUM_VF, 0);
+	pci_iov_set_numvfs(dev, 0);
 	ssleep(1);
 	pci_cfg_access_unlock(dev);
 
@@ -313,7 +328,7 @@ static void sriov_disable(struct pci_dev *dev)
 		sysfs_remove_link(&dev->dev.kobj, "dep_link");
 
 	iov->num_VFs = 0;
-	pci_write_config_word(dev, iov->pos + PCI_SRIOV_NUM_VF, 0);
+	pci_iov_set_numvfs(dev, 0);
 }
 
 static int sriov_init(struct pci_dev *dev, int pos)
@@ -452,7 +467,7 @@ static void sriov_restore_state(struct pci_dev *dev)
 		pci_update_resource(dev, i);
 
 	pci_write_config_dword(dev, iov->pos + PCI_SRIOV_SYS_PGSIZE, iov->pgsz);
-	pci_write_config_word(dev, iov->pos + PCI_SRIOV_NUM_VF, iov->num_VFs);
+	pci_iov_set_numvfs(dev, iov->num_VFs);
 	pci_write_config_word(dev, iov->pos + PCI_SRIOV_CTRL, iov->ctrl);
 	if (iov->ctrl & PCI_SRIOV_CTRL_VFE)
 		msleep(100);


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v12 06/21] PCI: Calculate maximum number of buses required for VFs
  2015-02-24  8:32 [PATCH v12 00/21] Enable SRIOV on Power8 Bjorn Helgaas
                   ` (4 preceding siblings ...)
  2015-02-24  8:33 ` [PATCH v12 05/21] PCI: Refresh First VF Offset and VF Stride when updating NumVFs Bjorn Helgaas
@ 2015-02-24  8:33 ` Bjorn Helgaas
  2015-02-24  8:33 ` [PATCH v12 07/21] PCI: Export pci_iov_virtfn_bus() and pci_iov_virtfn_devfn() Bjorn Helgaas
                   ` (14 subsequent siblings)
  20 siblings, 0 replies; 69+ messages in thread
From: Bjorn Helgaas @ 2015-02-24  8:33 UTC (permalink / raw)
  To: Wei Yang, benh, gwshan; +Cc: linux-pci, linuxppc-dev

From: Wei Yang <weiyang@linux.vnet.ibm.com>

An SR-IOV device can change its First VF Offset and VF Stride based on the
values of ARI Capable Hierarchy and NumVFs.  The number of buses required
for all VFs is determined by NumVFs, First VF Offset, and VF Stride (see
SR-IOV spec r1.1, sec 2.1.2).

Previously pci_iov_bus_range() computed how many buses would be required by
TotalVFs, but this was based on a single NumVFs value and may not have been
the maximum for all NumVFs configurations.

Iterate over all valid NumVFs and calculate the maximum number of bus
numbers that could ever be required for VFs of this device.

[bhelgaas: changelog, compute busnr of NumVFs, not TotalVFs, remove
kerenl-doc comment marker]
Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
---
 drivers/pci/iov.c |   31 +++++++++++++++++++++++++++----
 drivers/pci/pci.h |    1 +
 2 files changed, 28 insertions(+), 4 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index a8752c2c2b53..2ae921f84bd3 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -46,6 +46,30 @@ static inline void pci_iov_set_numvfs(struct pci_dev *dev, int nr_virtfn)
 	pci_read_config_word(dev, iov->pos + PCI_SRIOV_VF_STRIDE, &iov->stride);
 }
 
+/*
+ * The PF consumes one bus number.  NumVFs, First VF Offset, and VF Stride
+ * determine how many additional bus numbers will be consumed by VFs.
+ *
+ * Iterate over all valid NumVFs and calculate the maximum number of bus
+ * numbers that could ever be required.
+ */
+static inline u8 virtfn_max_buses(struct pci_dev *dev)
+{
+	struct pci_sriov *iov = dev->sriov;
+	int nr_virtfn;
+	u8 max = 0;
+	u8 busnr;
+
+	for (nr_virtfn = 1; nr_virtfn <= iov->total_VFs; nr_virtfn++) {
+		pci_iov_set_numvfs(dev, nr_virtfn);
+		busnr = virtfn_bus(dev, nr_virtfn - 1);
+		if (busnr > max)
+			max = busnr;
+	}
+
+	return max;
+}
+
 static struct pci_bus *virtfn_add_bus(struct pci_bus *bus, int busnr)
 {
 	struct pci_bus *child;
@@ -427,6 +451,7 @@ found:
 
 	dev->sriov = iov;
 	dev->is_physfn = 1;
+	iov->max_VF_buses = virtfn_max_buses(dev);
 
 	return 0;
 
@@ -556,15 +581,13 @@ void pci_restore_iov_state(struct pci_dev *dev)
 int pci_iov_bus_range(struct pci_bus *bus)
 {
 	int max = 0;
-	u8 busnr;
 	struct pci_dev *dev;
 
 	list_for_each_entry(dev, &bus->devices, bus_list) {
 		if (!dev->is_physfn)
 			continue;
-		busnr = virtfn_bus(dev, dev->sriov->total_VFs - 1);
-		if (busnr > max)
-			max = busnr;
+		if (dev->sriov->max_VF_buses > max)
+			max = dev->sriov->max_VF_buses;
 	}
 
 	return max ? max - bus->number : 0;
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 57329645dd01..bae593c04541 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -243,6 +243,7 @@ struct pci_sriov {
 	u16 stride;		/* following VF stride */
 	u32 pgsz;		/* page size for BAR alignment */
 	u8 link;		/* Function Dependency Link */
+	u8 max_VF_buses;	/* max buses consumed by VFs */
 	u16 driver_max_VFs;	/* max num VFs driver supports */
 	struct pci_dev *dev;	/* lowest numbered PF */
 	struct pci_dev *self;	/* this PF */


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v12 07/21] PCI: Export pci_iov_virtfn_bus() and pci_iov_virtfn_devfn()
  2015-02-24  8:32 [PATCH v12 00/21] Enable SRIOV on Power8 Bjorn Helgaas
                   ` (5 preceding siblings ...)
  2015-02-24  8:33 ` [PATCH v12 06/21] PCI: Calculate maximum number of buses required for VFs Bjorn Helgaas
@ 2015-02-24  8:33 ` Bjorn Helgaas
  2015-02-24  8:33 ` [PATCH v12 08/21] PCI: Add pcibios_sriov_enable() and pcibios_sriov_disable() Bjorn Helgaas
                   ` (13 subsequent siblings)
  20 siblings, 0 replies; 69+ messages in thread
From: Bjorn Helgaas @ 2015-02-24  8:33 UTC (permalink / raw)
  To: Wei Yang, benh, gwshan; +Cc: linux-pci, linuxppc-dev

From: Wei Yang <weiyang@linux.vnet.ibm.com>

On PowerNV, some resource reservation is needed for SR-IOV VFs that don't
exist at the bootup stage.  To do the match between resources and VFs, the
code need to get the VF's BDF in advance.

Rename virtfn_bus() and virtfn_devfn() to pci_iov_virtfn_bus() and
pci_iov_virtfn_devfn() and export them.

[bhelgaas: changelog, make "busnr" int]
Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
---
 drivers/pci/iov.c   |   28 ++++++++++++++++------------
 include/linux/pci.h |   11 +++++++++++
 2 files changed, 27 insertions(+), 12 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 2ae921f84bd3..5643a1011e23 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -19,16 +19,20 @@
 
 #define VIRTFN_ID_LEN	16
 
-static inline u8 virtfn_bus(struct pci_dev *dev, int id)
+int pci_iov_virtfn_bus(struct pci_dev *dev, int vf_id)
 {
+	if (!dev->is_physfn)
+		return -EINVAL;
 	return dev->bus->number + ((dev->devfn + dev->sriov->offset +
-				    dev->sriov->stride * id) >> 8);
+				    dev->sriov->stride * vf_id) >> 8);
 }
 
-static inline u8 virtfn_devfn(struct pci_dev *dev, int id)
+int pci_iov_virtfn_devfn(struct pci_dev *dev, int vf_id)
 {
+	if (!dev->is_physfn)
+		return -EINVAL;
 	return (dev->devfn + dev->sriov->offset +
-		dev->sriov->stride * id) & 0xff;
+		dev->sriov->stride * vf_id) & 0xff;
 }
 
 /*
@@ -58,11 +62,11 @@ static inline u8 virtfn_max_buses(struct pci_dev *dev)
 	struct pci_sriov *iov = dev->sriov;
 	int nr_virtfn;
 	u8 max = 0;
-	u8 busnr;
+	int busnr;
 
 	for (nr_virtfn = 1; nr_virtfn <= iov->total_VFs; nr_virtfn++) {
 		pci_iov_set_numvfs(dev, nr_virtfn);
-		busnr = virtfn_bus(dev, nr_virtfn - 1);
+		busnr = pci_iov_virtfn_bus(dev, nr_virtfn - 1);
 		if (busnr > max)
 			max = busnr;
 	}
@@ -116,7 +120,7 @@ static int virtfn_add(struct pci_dev *dev, int id, int reset)
 	struct pci_bus *bus;
 
 	mutex_lock(&iov->dev->sriov->lock);
-	bus = virtfn_add_bus(dev->bus, virtfn_bus(dev, id));
+	bus = virtfn_add_bus(dev->bus, pci_iov_virtfn_bus(dev, id));
 	if (!bus)
 		goto failed;
 
@@ -124,7 +128,7 @@ static int virtfn_add(struct pci_dev *dev, int id, int reset)
 	if (!virtfn)
 		goto failed0;
 
-	virtfn->devfn = virtfn_devfn(dev, id);
+	virtfn->devfn = pci_iov_virtfn_devfn(dev, id);
 	virtfn->vendor = dev->vendor;
 	pci_read_config_word(dev, iov->pos + PCI_SRIOV_VF_DID, &virtfn->device);
 	pci_setup_device(virtfn);
@@ -186,8 +190,8 @@ static void virtfn_remove(struct pci_dev *dev, int id, int reset)
 	struct pci_sriov *iov = dev->sriov;
 
 	virtfn = pci_get_domain_bus_and_slot(pci_domain_nr(dev->bus),
-					     virtfn_bus(dev, id),
-					     virtfn_devfn(dev, id));
+					     pci_iov_virtfn_bus(dev, id),
+					     pci_iov_virtfn_devfn(dev, id));
 	if (!virtfn)
 		return;
 
@@ -226,7 +230,7 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
 	struct pci_dev *pdev;
 	struct pci_sriov *iov = dev->sriov;
 	int bars = 0;
-	u8 bus;
+	int bus;
 
 	if (!nr_virtfn)
 		return 0;
@@ -263,7 +267,7 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
 	iov->offset = offset;
 	iov->stride = stride;
 
-	bus = virtfn_bus(dev, nr_virtfn - 1);
+	bus = pci_iov_virtfn_bus(dev, nr_virtfn - 1);
 	if (bus > dev->bus->busn_res.end) {
 		dev_err(&dev->dev, "can't enable %d VFs (bus %02x out of range of %pR)\n",
 			nr_virtfn, bus, &dev->bus->busn_res);
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 15596582e575..99ea94835fb6 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1669,6 +1669,9 @@ int pci_ext_cfg_avail(void);
 void __iomem *pci_ioremap_bar(struct pci_dev *pdev, int bar);
 
 #ifdef CONFIG_PCI_IOV
+int pci_iov_virtfn_bus(struct pci_dev *dev, int id);
+int pci_iov_virtfn_devfn(struct pci_dev *dev, int id);
+
 int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
 void pci_disable_sriov(struct pci_dev *dev);
 int pci_num_vf(struct pci_dev *dev);
@@ -1677,6 +1680,14 @@ int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs);
 int pci_sriov_get_totalvfs(struct pci_dev *dev);
 resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno);
 #else
+static inline int pci_iov_virtfn_bus(struct pci_dev *dev, int id)
+{
+	return -ENOSYS;
+}
+static inline int pci_iov_virtfn_devfn(struct pci_dev *dev, int id)
+{
+	return -ENOSYS;
+}
 static inline int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn)
 { return -ENODEV; }
 static inline void pci_disable_sriov(struct pci_dev *dev) { }


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v12 08/21] PCI: Add pcibios_sriov_enable() and pcibios_sriov_disable()
  2015-02-24  8:32 [PATCH v12 00/21] Enable SRIOV on Power8 Bjorn Helgaas
                   ` (6 preceding siblings ...)
  2015-02-24  8:33 ` [PATCH v12 07/21] PCI: Export pci_iov_virtfn_bus() and pci_iov_virtfn_devfn() Bjorn Helgaas
@ 2015-02-24  8:33 ` Bjorn Helgaas
  2015-02-24  8:39   ` Bjorn Helgaas
  2015-02-24  8:33 ` [PATCH v12 09/21] PCI: Add pcibios_iov_resource_alignment() interface Bjorn Helgaas
                   ` (12 subsequent siblings)
  20 siblings, 1 reply; 69+ messages in thread
From: Bjorn Helgaas @ 2015-02-24  8:33 UTC (permalink / raw)
  To: Wei Yang, benh, gwshan; +Cc: linux-pci, linuxppc-dev

From: Wei Yang <weiyang@linux.vnet.ibm.com>

VFs are dynamically created when a driver enables them.  On some platforms,
like PowerNV, special resources are necessary to enable VFs.

Add platform hooks for enabling and disabling VFs.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
---
 drivers/pci/iov.c |   19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 5643a1011e23..cc6fedf4a1b9 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -220,6 +220,11 @@ static void virtfn_remove(struct pci_dev *dev, int id, int reset)
 	pci_dev_put(dev);
 }
 
+int __weak pcibios_sriov_enable(struct pci_dev *pdev, u16 vf_num)
+{
+       return 0;
+}
+
 static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
 {
 	int rc;
@@ -231,6 +236,7 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
 	struct pci_sriov *iov = dev->sriov;
 	int bars = 0;
 	int bus;
+	int retval;
 
 	if (!nr_virtfn)
 		return 0;
@@ -307,6 +313,12 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
 	if (nr_virtfn < initial)
 		initial = nr_virtfn;
 
+	if ((retval = pcibios_sriov_enable(dev, initial))) {
+		dev_err(&dev->dev, "failure %d from pcibios_sriov_enable()\n",
+			retval);
+		return retval;
+	}
+
 	for (i = 0; i < initial; i++) {
 		rc = virtfn_add(dev, i, 0);
 		if (rc)
@@ -335,6 +347,11 @@ failed:
 	return rc;
 }
 
+int __weak pcibios_sriov_disable(struct pci_dev *pdev)
+{
+       return 0;
+}
+
 static void sriov_disable(struct pci_dev *dev)
 {
 	int i;
@@ -346,6 +363,8 @@ static void sriov_disable(struct pci_dev *dev)
 	for (i = 0; i < iov->num_VFs; i++)
 		virtfn_remove(dev, i, 0);
 
+	pcibios_sriov_disable(dev);
+
 	iov->ctrl &= ~(PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE);
 	pci_cfg_access_lock(dev);
 	pci_write_config_word(dev, iov->pos + PCI_SRIOV_CTRL, iov->ctrl);


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v12 09/21] PCI: Add pcibios_iov_resource_alignment() interface
  2015-02-24  8:32 [PATCH v12 00/21] Enable SRIOV on Power8 Bjorn Helgaas
                   ` (7 preceding siblings ...)
  2015-02-24  8:33 ` [PATCH v12 08/21] PCI: Add pcibios_sriov_enable() and pcibios_sriov_disable() Bjorn Helgaas
@ 2015-02-24  8:33 ` Bjorn Helgaas
  2015-02-24  8:34 ` [PATCH v12 10/21] PCI: Consider additional PF's IOV BAR alignment in sizing and assigning Bjorn Helgaas
                   ` (11 subsequent siblings)
  20 siblings, 0 replies; 69+ messages in thread
From: Bjorn Helgaas @ 2015-02-24  8:33 UTC (permalink / raw)
  To: Wei Yang, benh, gwshan; +Cc: linux-pci, linuxppc-dev

From: Wei Yang <weiyang@linux.vnet.ibm.com>

Per the SR-IOV spec r1.1, sec 3.3.14, the required alignment of a PF's IOV
BAR is the size of an individual VF BAR, and the size consumed is the
individual VF BAR size times NumVFs.

The PowerNV platform has additional alignment requirements to help support
its Partitionable Endpoint device isolation feature (see
Documentation/powerpc/pci_iov_resource_on_powernv.txt).

Add a pcibios_iov_resource_alignment() interface to allow platforms to
request additional alignment.

[bhelgaas: changelog, adapt to reworked pci_sriov_resource_alignment(),
drop "align" parameter]
Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
---
 drivers/pci/iov.c   |    8 +++++++-
 include/linux/pci.h |    1 +
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index cc6fedf4a1b9..bde0f02cae32 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -569,6 +569,12 @@ int pci_iov_resource_bar(struct pci_dev *dev, int resno)
 		4 * (resno - PCI_IOV_RESOURCES);
 }
 
+resource_size_t __weak pcibios_iov_resource_alignment(struct pci_dev *dev,
+						      int resno)
+{
+	return pci_iov_resource_size(dev, resno);
+}
+
 /**
  * pci_sriov_resource_alignment - get resource alignment for VF BAR
  * @dev: the PCI device
@@ -581,7 +587,7 @@ int pci_iov_resource_bar(struct pci_dev *dev, int resno)
  */
 resource_size_t pci_sriov_resource_alignment(struct pci_dev *dev, int resno)
 {
-	return pci_iov_resource_size(dev, resno);
+	return pcibios_iov_resource_alignment(dev, resno);
 }
 
 /**
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 99ea94835fb6..4e1f17db1a81 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1174,6 +1174,7 @@ unsigned char pci_bus_max_busnr(struct pci_bus *bus);
 void pci_setup_bridge(struct pci_bus *bus);
 resource_size_t pcibios_window_alignment(struct pci_bus *bus,
 					 unsigned long type);
+resource_size_t pcibios_iov_resource_alignment(struct pci_dev *dev, int resno);
 
 #define PCI_VGA_STATE_CHANGE_BRIDGE (1 << 0)
 #define PCI_VGA_STATE_CHANGE_DECODES (1 << 1)


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v12 10/21] PCI: Consider additional PF's IOV BAR alignment in sizing and assigning
  2015-02-24  8:32 [PATCH v12 00/21] Enable SRIOV on Power8 Bjorn Helgaas
                   ` (8 preceding siblings ...)
  2015-02-24  8:33 ` [PATCH v12 09/21] PCI: Add pcibios_iov_resource_alignment() interface Bjorn Helgaas
@ 2015-02-24  8:34 ` Bjorn Helgaas
  2015-02-24  8:41   ` Bjorn Helgaas
  2015-02-24  8:34 ` [PATCH v12 11/21] powerpc/pci: Don't unset PCI resources for VFs Bjorn Helgaas
                   ` (10 subsequent siblings)
  20 siblings, 1 reply; 69+ messages in thread
From: Bjorn Helgaas @ 2015-02-24  8:34 UTC (permalink / raw)
  To: Wei Yang, benh, gwshan; +Cc: linux-pci, linuxppc-dev

From: Wei Yang <weiyang@linux.vnet.ibm.com>

When sizing and assigning resources, we divide the resources into two
lists: the requested list and the additional list.  We don't consider the
alignment of additional VF(n) BAR space.

This is reasonable because the alignment required for the VF(n) BAR space
is the size of an individual VF BAR, not the size of the space for *all*
VFs.  But some platforms, e.g., PowerNV, require additional alignment.

Consider the additional IOV BAR alignment when sizing and assigning
resources.  When there is not enough system MMIO space, the PF's IOV BAR
alignment will not contribute to the bridge.  When there is enough system
MMIO space, the additional alignment will contribute to the bridge.

Also, take advantage of pci_dev_resource::min_align to store this
additional alignment.

[bhelgaas: changelog, printk cast]
Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
---
 drivers/pci/setup-bus.c |   83 ++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 70 insertions(+), 13 deletions(-)

diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
index e3e17f3c0f0f..affbceae560f 100644
--- a/drivers/pci/setup-bus.c
+++ b/drivers/pci/setup-bus.c
@@ -99,8 +99,8 @@ static void remove_from_list(struct list_head *head,
 	}
 }
 
-static resource_size_t get_res_add_size(struct list_head *head,
-					struct resource *res)
+static struct pci_dev_resource *res_to_dev_res(struct list_head *head,
+					       struct resource *res)
 {
 	struct pci_dev_resource *dev_res;
 
@@ -109,17 +109,37 @@ static resource_size_t get_res_add_size(struct list_head *head,
 			int idx = res - &dev_res->dev->resource[0];
 
 			dev_printk(KERN_DEBUG, &dev_res->dev->dev,
-				 "res[%d]=%pR get_res_add_size add_size %llx\n",
+				 "res[%d]=%pR res_to_dev_res add_size %llx min_align %llx\n",
 				 idx, dev_res->res,
-				 (unsigned long long)dev_res->add_size);
+				 (unsigned long long)dev_res->add_size,
+				 (unsigned long long)dev_res->min_align);
 
-			return dev_res->add_size;
+			return dev_res;
 		}
 	}
 
-	return 0;
+	return NULL;
+}
+
+static resource_size_t get_res_add_size(struct list_head *head,
+					struct resource *res)
+{
+	struct pci_dev_resource *dev_res;
+
+	dev_res = res_to_dev_res(head, res);
+	return dev_res ? dev_res->add_size : 0;
+}
+
+static resource_size_t get_res_add_align(struct list_head *head,
+					 struct resource *res)
+{
+	struct pci_dev_resource *dev_res;
+
+	dev_res = res_to_dev_res(head, res);
+	return dev_res ? dev_res->min_align : 0;
 }
 
+
 /* Sort resources by alignment */
 static void pdev_sort_resources(struct pci_dev *dev, struct list_head *head)
 {
@@ -368,8 +388,9 @@ static void __assign_resources_sorted(struct list_head *head,
 	LIST_HEAD(save_head);
 	LIST_HEAD(local_fail_head);
 	struct pci_dev_resource *save_res;
-	struct pci_dev_resource *dev_res, *tmp_res;
+	struct pci_dev_resource *dev_res, *tmp_res, *dev_res2;
 	unsigned long fail_type;
+	resource_size_t add_align, align;
 
 	/* Check if optional add_size is there */
 	if (!realloc_head || list_empty(realloc_head))
@@ -384,10 +405,38 @@ static void __assign_resources_sorted(struct list_head *head,
 	}
 
 	/* Update res in head list with add_size in realloc_head list */
-	list_for_each_entry(dev_res, head, list)
+	list_for_each_entry_safe(dev_res, tmp_res, head, list) {
 		dev_res->res->end += get_res_add_size(realloc_head,
 							dev_res->res);
 
+		/*
+		 * There are two kinds of additional resources in the list:
+		 * 1. bridge resource  -- IORESOURCE_STARTALIGN
+		 * 2. SR-IOV resource   -- IORESOURCE_SIZEALIGN
+		 * Here just fix the additional alignment for bridge
+		 */
+		if (!(dev_res->res->flags & IORESOURCE_STARTALIGN))
+			continue;
+
+		add_align = get_res_add_align(realloc_head, dev_res->res);
+
+		/* Reorder the list by their alignment */
+		if (add_align > dev_res->res->start) {
+			dev_res->res->start = add_align;
+			dev_res->res->end = add_align +
+				            resource_size(dev_res->res);
+
+			list_for_each_entry(dev_res2, head, list) {
+				align = pci_resource_alignment(dev_res2->dev,
+							       dev_res2->res);
+				if (add_align > align)
+					list_move_tail(&dev_res->list,
+						       &dev_res2->list);
+			}
+               }
+
+	}
+
 	/* Try updated head list with add_size added */
 	assign_requested_resources_sorted(head, &local_fail_head);
 
@@ -962,6 +1011,8 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
 	struct resource *b_res = find_free_bus_resource(bus,
 					mask | IORESOURCE_PREFETCH, type);
 	resource_size_t children_add_size = 0;
+	resource_size_t children_add_align = 0;
+	resource_size_t add_align = 0;
 
 	if (!b_res)
 		return -ENOSPC;
@@ -986,6 +1037,7 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
 			/* put SRIOV requested res to the optional list */
 			if (realloc_head && i >= PCI_IOV_RESOURCES &&
 					i <= PCI_IOV_RESOURCE_END) {
+				add_align = max(pci_resource_alignment(dev, r), add_align);
 				r->end = r->start - 1;
 				add_to_list(realloc_head, dev, r, r_size, 0/* don't care */);
 				children_add_size += r_size;
@@ -1016,19 +1068,23 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
 			if (order > max_order)
 				max_order = order;
 
-			if (realloc_head)
+			if (realloc_head) {
 				children_add_size += get_res_add_size(realloc_head, r);
+				children_add_align = get_res_add_align(realloc_head, r);
+				add_align = max(add_align, children_add_align);
+			}
 		}
 	}
 
 	min_align = calculate_mem_align(aligns, max_order);
 	min_align = max(min_align, window_alignment(bus, b_res->flags));
 	size0 = calculate_memsize(size, min_size, 0, resource_size(b_res), min_align);
+	add_align = max(min_align, add_align);
 	if (children_add_size > add_size)
 		add_size = children_add_size;
 	size1 = (!realloc_head || (realloc_head && !add_size)) ? size0 :
 		calculate_memsize(size, min_size, add_size,
-				resource_size(b_res), min_align);
+				resource_size(b_res), add_align);
 	if (!size0 && !size1) {
 		if (b_res->start || b_res->end)
 			dev_info(&bus->self->dev, "disabling bridge window %pR to %pR (unused)\n",
@@ -1040,10 +1096,11 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
 	b_res->end = size0 + min_align - 1;
 	b_res->flags |= IORESOURCE_STARTALIGN;
 	if (size1 > size0 && realloc_head) {
-		add_to_list(realloc_head, bus->self, b_res, size1-size0, min_align);
-		dev_printk(KERN_DEBUG, &bus->self->dev, "bridge window %pR to %pR add_size %llx\n",
+		add_to_list(realloc_head, bus->self, b_res, size1-size0, add_align);
+		dev_printk(KERN_DEBUG, &bus->self->dev, "bridge window %pR to %pR add_size %llx add_align %llx\n",
 			   b_res, &bus->busn_res,
-			   (unsigned long long)size1-size0);
+			   (unsigned long long) (size1 - size0),
+			   (unsigned long long) add_align);
 	}
 	return 0;
 }


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v12 11/21] powerpc/pci: Don't unset PCI resources for VFs
  2015-02-24  8:32 [PATCH v12 00/21] Enable SRIOV on Power8 Bjorn Helgaas
                   ` (9 preceding siblings ...)
  2015-02-24  8:34 ` [PATCH v12 10/21] PCI: Consider additional PF's IOV BAR alignment in sizing and assigning Bjorn Helgaas
@ 2015-02-24  8:34 ` Bjorn Helgaas
  2015-02-24  8:44   ` Bjorn Helgaas
  2015-02-24  8:34 ` [PATCH v12 12/21] powerpc/pci: Refactor pci_dn Bjorn Helgaas
                   ` (9 subsequent siblings)
  20 siblings, 1 reply; 69+ messages in thread
From: Bjorn Helgaas @ 2015-02-24  8:34 UTC (permalink / raw)
  To: Wei Yang, benh, gwshan; +Cc: linux-pci, linuxppc-dev

From: Wei Yang <weiyang@linux.vnet.ibm.com>

If we're going to reassign resources with flag PCI_REASSIGN_ALL_RSRC, all
resources will be cleaned out during device header fixup time and then get
reassigned by PCI core.  However, the VF resources won't be reassigned and
thus, we shouldn't clean them out.

If the pci_dev is a VF, skip the resource unset process.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
---
 arch/powerpc/kernel/pci-common.c |    4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index 2a525c938158..82031011522f 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -788,6 +788,10 @@ static void pcibios_fixup_resources(struct pci_dev *dev)
 		       pci_name(dev));
 		return;
 	}
+
+	if (dev->is_virtfn)
+		return;
+
 	for (i = 0; i < DEVICE_COUNT_RESOURCE; i++) {
 		struct resource *res = dev->resource + i;
 		struct pci_bus_region reg;


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v12 12/21] powerpc/pci: Refactor pci_dn
  2015-02-24  8:32 [PATCH v12 00/21] Enable SRIOV on Power8 Bjorn Helgaas
                   ` (10 preceding siblings ...)
  2015-02-24  8:34 ` [PATCH v12 11/21] powerpc/pci: Don't unset PCI resources for VFs Bjorn Helgaas
@ 2015-02-24  8:34 ` Bjorn Helgaas
  2015-02-24  8:34 ` [PATCH v12 13/21] powerpc/powernv: Use pci_dn, not device_node, in PCI config accessor Bjorn Helgaas
                   ` (8 subsequent siblings)
  20 siblings, 0 replies; 69+ messages in thread
From: Bjorn Helgaas @ 2015-02-24  8:34 UTC (permalink / raw)
  To: Wei Yang, benh, gwshan; +Cc: linux-pci, linuxppc-dev

From: Gavin Shan <gwshan@linux.vnet.ibm.com>

pci_dn is the extension of PCI device node and is created from device node.
Unfortunately, VFs are enabled dynamically by PF's driver and they don't
have corresponding device nodes, and pci_dn.  Refactor pci_dn to support
VFs:

   * pci_dn is organized as a hierarchy tree.  VF's pci_dn is put
     to the child list of pci_dn of PF's bridge.  pci_dn of other device
     put to the child list of pci_dn of its upstream bridge.

   * VF's pci_dn is expected to be created dynamically when PF
     enabling VFs.  VF's pci_dn will be destroyed when PF disabling VFs.
     pci_dn of other device is still created from device node as before.

   * For one particular PCI device (VF or not), its pci_dn can be
     found from pdev->dev.archdata.firmware_data, PCI_DN(devnode), or
     parent's list.  The fast path (fetching pci_dn through PCI device
     instance) is populated during early fixup time.

[bhelgaas: add ifdef around add_one_dev_pci_info(), use dev_printk()]
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
---
 arch/powerpc/include/asm/device.h         |    3 
 arch/powerpc/include/asm/pci-bridge.h     |   14 +-
 arch/powerpc/kernel/pci_dn.c              |  245 +++++++++++++++++++++++++++++
 arch/powerpc/platforms/powernv/pci-ioda.c |   16 ++
 4 files changed, 272 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/device.h b/arch/powerpc/include/asm/device.h
index 38faeded7d59..29992cd020bb 100644
--- a/arch/powerpc/include/asm/device.h
+++ b/arch/powerpc/include/asm/device.h
@@ -34,6 +34,9 @@ struct dev_archdata {
 #ifdef CONFIG_SWIOTLB
 	dma_addr_t		max_direct_dma_addr;
 #endif
+#ifdef CONFIG_PPC64
+	void			*firmware_data;
+#endif
 #ifdef CONFIG_EEH
 	struct eeh_dev		*edev;
 #endif
diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
index 546d036fe925..513f8f27060d 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -89,6 +89,7 @@ struct pci_controller {
 
 #ifdef CONFIG_PPC64
 	unsigned long buid;
+	void *firmware_data;
 #endif	/* CONFIG_PPC64 */
 
 	void *private_data;
@@ -154,9 +155,13 @@ static inline int isa_vaddr_is_ioport(void __iomem *address)
 struct iommu_table;
 
 struct pci_dn {
+	int     flags;
+#define PCI_DN_FLAG_IOV_VF     0x01
+
 	int	busno;			/* pci bus number */
 	int	devfn;			/* pci device and function number */
 
+	struct  pci_dn *parent;
 	struct  pci_controller *phb;	/* for pci devices */
 	struct	iommu_table *iommu_table;	/* for phb's or bridges */
 	struct	device_node *node;	/* back-pointer to the device_node */
@@ -171,14 +176,19 @@ struct pci_dn {
 #ifdef CONFIG_PPC_POWERNV
 	int	pe_number;
 #endif
+	struct list_head child_list;
+	struct list_head list;
 };
 
 /* Get the pointer to a device_node's pci_dn */
 #define PCI_DN(dn)	((struct pci_dn *) (dn)->data)
 
+extern struct pci_dn *pci_get_pdn_by_devfn(struct pci_bus *bus,
+					   int devfn);
 extern struct pci_dn *pci_get_pdn(struct pci_dev *pdev);
-
-extern void * update_dn_pci_info(struct device_node *dn, void *data);
+extern struct pci_dn *add_dev_pci_info(struct pci_dev *pdev);
+extern void remove_dev_pci_info(struct pci_dev *pdev);
+extern void *update_dn_pci_info(struct device_node *dn, void *data);
 
 static inline int pci_device_from_OF_node(struct device_node *np,
 					  u8 *bus, u8 *devfn)
diff --git a/arch/powerpc/kernel/pci_dn.c b/arch/powerpc/kernel/pci_dn.c
index 83df3075d3df..f3a1a81d112f 100644
--- a/arch/powerpc/kernel/pci_dn.c
+++ b/arch/powerpc/kernel/pci_dn.c
@@ -32,12 +32,223 @@
 #include <asm/ppc-pci.h>
 #include <asm/firmware.h>
 
+/*
+ * The function is used to find the firmware data of one
+ * specific PCI device, which is attached to the indicated
+ * PCI bus. For VFs, their firmware data is linked to that
+ * one of PF's bridge. For other devices, their firmware
+ * data is linked to that of their bridge.
+ */
+static struct pci_dn *pci_bus_to_pdn(struct pci_bus *bus)
+{
+	struct pci_bus *pbus;
+	struct device_node *dn;
+	struct pci_dn *pdn;
+
+	/*
+	 * We probably have virtual bus which doesn't
+	 * have associated bridge.
+	 */
+	pbus = bus;
+	while (pbus) {
+		if (pci_is_root_bus(pbus) || pbus->self)
+			break;
+
+		pbus = pbus->parent;
+	}
+
+	/*
+	 * Except virtual bus, all PCI buses should
+	 * have device nodes.
+	 */
+	dn = pci_bus_to_OF_node(pbus);
+	pdn = dn ? PCI_DN(dn) : NULL;
+
+	return pdn;
+}
+
+struct pci_dn *pci_get_pdn_by_devfn(struct pci_bus *bus,
+				    int devfn)
+{
+	struct device_node *dn = NULL;
+	struct pci_dn *parent, *pdn;
+	struct pci_dev *pdev = NULL;
+
+	/* Fast path: fetch from PCI device */
+	list_for_each_entry(pdev, &bus->devices, bus_list) {
+		if (pdev->devfn == devfn) {
+			if (pdev->dev.archdata.firmware_data)
+				return pdev->dev.archdata.firmware_data;
+
+			dn = pci_device_to_OF_node(pdev);
+			break;
+		}
+	}
+
+	/* Fast path: fetch from device node */
+	pdn = dn ? PCI_DN(dn) : NULL;
+	if (pdn)
+		return pdn;
+
+	/* Slow path: fetch from firmware data hierarchy */
+	parent = pci_bus_to_pdn(bus);
+	if (!parent)
+		return NULL;
+
+	list_for_each_entry(pdn, &parent->child_list, list) {
+		if (pdn->busno == bus->number &&
+                    pdn->devfn == devfn)
+                        return pdn;
+        }
+
+	return NULL;
+}
+
 struct pci_dn *pci_get_pdn(struct pci_dev *pdev)
 {
-	struct device_node *dn = pci_device_to_OF_node(pdev);
-	if (!dn)
+	struct device_node *dn;
+	struct pci_dn *parent, *pdn;
+
+	/* Search device directly */
+	if (pdev->dev.archdata.firmware_data)
+		return pdev->dev.archdata.firmware_data;
+
+	/* Check device node */
+	dn = pci_device_to_OF_node(pdev);
+	pdn = dn ? PCI_DN(dn) : NULL;
+	if (pdn)
+		return pdn;
+
+	/*
+	 * VFs don't have device nodes. We hook their
+	 * firmware data to PF's bridge.
+	 */
+	parent = pci_bus_to_pdn(pdev->bus);
+	if (!parent)
 		return NULL;
-	return PCI_DN(dn);
+
+	list_for_each_entry(pdn, &parent->child_list, list) {
+		if (pdn->busno == pdev->bus->number &&
+		    pdn->devfn == pdev->devfn)
+			return pdn;
+	}
+
+	return NULL;
+}
+
+#ifdef CONFIG_PCI_IOV
+static struct pci_dn *add_one_dev_pci_info(struct pci_dn *parent,
+					   struct pci_dev *pdev,
+					   int busno, int devfn)
+{
+	struct pci_dn *pdn;
+
+	/* Except PHB, we always have parent firmware data */
+	if (!parent)
+		return NULL;
+
+	pdn = kzalloc(sizeof(*pdn), GFP_KERNEL);
+	if (!pdn) {
+		dev_warn(&pdev->dev, "%s: Out of memory!\n", __func__);
+		return NULL;
+	}
+
+	pdn->phb = parent->phb;
+	pdn->parent = parent;
+	pdn->busno = busno;
+	pdn->devfn = devfn;
+#ifdef CONFIG_PPC_POWERNV
+	pdn->pe_number = IODA_INVALID_PE;
+#endif
+	INIT_LIST_HEAD(&pdn->child_list);
+	INIT_LIST_HEAD(&pdn->list);
+	list_add_tail(&pdn->list, &parent->child_list);
+
+	/*
+	 * If we already have PCI device instance, lets
+	 * bind them.
+	 */
+	if (pdev)
+		pdev->dev.archdata.firmware_data = pdn;
+
+	return pdn;
+}
+#endif
+
+struct pci_dn *add_dev_pci_info(struct pci_dev *pdev)
+{
+#ifdef CONFIG_PCI_IOV
+	struct pci_dn *parent, *pdn;
+	int i;
+
+	/* Only support IOV for now */
+	if (!pdev->is_physfn)
+		return pci_get_pdn(pdev);
+
+	/* Check if VFs have been populated */
+	pdn = pci_get_pdn(pdev);
+	if (!pdn || (pdn->flags & PCI_DN_FLAG_IOV_VF))
+		return NULL;
+
+	pdn->flags |= PCI_DN_FLAG_IOV_VF;
+	parent = pci_bus_to_pdn(pdev->bus);
+	if (!parent)
+		return NULL;
+
+	for (i = 0; i < pci_sriov_get_totalvfs(pdev); i++) {
+		pdn = add_one_dev_pci_info(parent, NULL,
+					   pci_iov_virtfn_bus(pdev, i),
+					   pci_iov_virtfn_devfn(pdev, i));
+		if (!pdn) {
+			dev_warn(&pdev->dev, "%s: Cannot create firmware data for VF#%d\n",
+				 __func__, i);
+			return NULL;
+		}
+	}
+#endif
+
+	return pci_get_pdn(pdev);
+}
+
+void remove_dev_pci_info(struct pci_dev *pdev)
+{
+#ifdef CONFIG_PCI_IOV
+	struct pci_dn *parent;
+	struct pci_dn *pdn, *tmp;
+	int i;
+
+	/* Only support IOV PF for now */
+	if (!pdev->is_physfn)
+		return;
+
+	/* Check if VFs have been populated */
+	pdn = pci_get_pdn(pdev);
+	if (!pdn || !(pdn->flags & PCI_DN_FLAG_IOV_VF))
+		return;
+
+	pdn->flags &= ~PCI_DN_FLAG_IOV_VF;
+	parent = pci_bus_to_pdn(pdev->bus);
+	if (!parent)
+		return;
+
+	/*
+	 * We might introduce flag to pci_dn in future
+	 * so that we can release VF's firmware data in
+	 * a batch mode.
+	 */
+	for (i = 0; i < pci_sriov_get_totalvfs(pdev); i++) {
+		list_for_each_entry_safe(pdn, tmp,
+			&parent->child_list, list) {
+			if (pdn->busno != pci_iov_virtfn_bus(pdev, i) ||
+			    pdn->devfn != pci_iov_virtfn_devfn(pdev, i))
+				continue;
+
+			if (!list_empty(&pdn->list))
+				list_del(&pdn->list);
+			kfree(pdn);
+		}
+	}
+#endif
 }
 
 /*
@@ -49,6 +260,7 @@ void *update_dn_pci_info(struct device_node *dn, void *data)
 	struct pci_controller *phb = data;
 	const __be32 *type = of_get_property(dn, "ibm,pci-config-space-type", NULL);
 	const __be32 *regs;
+	struct device_node *parent;
 	struct pci_dn *pdn;
 
 	pdn = zalloc_maybe_bootmem(sizeof(*pdn), GFP_KERNEL);
@@ -70,6 +282,15 @@ void *update_dn_pci_info(struct device_node *dn, void *data)
 	}
 
 	pdn->pci_ext_config_space = (type && of_read_number(type, 1) == 1);
+
+	/* Attach to parent node */
+	INIT_LIST_HEAD(&pdn->child_list);
+	INIT_LIST_HEAD(&pdn->list);
+	parent = of_get_parent(dn);
+	pdn->parent = parent ? PCI_DN(parent) : NULL;
+	if (pdn->parent)
+		list_add_tail(&pdn->list, &pdn->parent->child_list);
+
 	return NULL;
 }
 
@@ -147,8 +368,11 @@ void pci_devs_phb_init_dynamic(struct pci_controller *phb)
 	/* PHB nodes themselves must not match */
 	update_dn_pci_info(dn, phb);
 	pdn = dn->data;
-	if (pdn)
+	if (pdn) {
 		pdn->devfn = pdn->busno = -1;
+		pdn->phb = phb;
+		phb->firmware_data = pdn;
+	}
 
 	/* Update dn->phb ptrs for new phb and children devices */
 	traverse_pci_devices(dn, update_dn_pci_info, phb);
@@ -171,3 +395,16 @@ void __init pci_devs_phb_init(void)
 	list_for_each_entry_safe(phb, tmp, &hose_list, list_node)
 		pci_devs_phb_init_dynamic(phb);
 }
+
+static void pci_dev_pdn_setup(struct pci_dev *pdev)
+{
+	struct pci_dn *pdn;
+
+	if (pdev->dev.archdata.firmware_data)
+		return;
+
+	/* Setup the fast path */
+	pdn = pci_get_pdn(pdev);
+	pdev->dev.archdata.firmware_data = pdn;
+}
+DECLARE_PCI_FIXUP_EARLY(PCI_ANY_ID, PCI_ANY_ID, pci_dev_pdn_setup);
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 6c9ff2b95119..58c4fc4ab63c 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -974,6 +974,22 @@ static void pnv_pci_ioda_setup_PEs(void)
 	}
 }
 
+#ifdef CONFIG_PCI_IOV
+int pcibios_sriov_disable(struct pci_dev *pdev)
+{
+	/* Release firmware data */
+	remove_dev_pci_info(pdev);
+	return 0;
+}
+
+int pcibios_sriov_enable(struct pci_dev *pdev, u16 vf_num)
+{
+	/* Allocate firmware data */
+	add_dev_pci_info(pdev);
+	return 0;
+}
+#endif /* CONFIG_PCI_IOV */
+
 static void pnv_pci_ioda_dma_dev_setup(struct pnv_phb *phb, struct pci_dev *pdev)
 {
 	struct pci_dn *pdn = pci_get_pdn(pdev);


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v12 13/21] powerpc/powernv: Use pci_dn, not device_node, in PCI config accessor
  2015-02-24  8:32 [PATCH v12 00/21] Enable SRIOV on Power8 Bjorn Helgaas
                   ` (11 preceding siblings ...)
  2015-02-24  8:34 ` [PATCH v12 12/21] powerpc/pci: Refactor pci_dn Bjorn Helgaas
@ 2015-02-24  8:34 ` Bjorn Helgaas
  2015-02-24  8:34 ` [PATCH v12 14/21] powerpc/powernv: Allocate struct pnv_ioda_pe iommu_table dynamically Bjorn Helgaas
                   ` (7 subsequent siblings)
  20 siblings, 0 replies; 69+ messages in thread
From: Bjorn Helgaas @ 2015-02-24  8:34 UTC (permalink / raw)
  To: Wei Yang, benh, gwshan; +Cc: linux-pci, linuxppc-dev

From: Wei Yang <weiyang@linux.vnet.ibm.com>

The PCI config accessors previously relied on device_node.  Unfortunately,
VFs don't have a corresponding device_node, so change the accessors to use
pci_dn instead.

[bhelgaas: changelog]
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
---
 arch/powerpc/platforms/powernv/eeh-powernv.c |   14 +++++
 arch/powerpc/platforms/powernv/pci.c         |   69 ++++++++++----------------
 arch/powerpc/platforms/powernv/pci.h         |    4 +-
 3 files changed, 40 insertions(+), 47 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c
index e261869adc86..7a5021b95a14 100644
--- a/arch/powerpc/platforms/powernv/eeh-powernv.c
+++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
@@ -430,21 +430,31 @@ static inline bool powernv_eeh_cfg_blocked(struct device_node *dn)
 static int powernv_eeh_read_config(struct device_node *dn,
 				   int where, int size, u32 *val)
 {
+	struct pci_dn *pdn = PCI_DN(dn);
+
+	if (!pdn)
+		return PCIBIOS_DEVICE_NOT_FOUND;
+
 	if (powernv_eeh_cfg_blocked(dn)) {
 		*val = 0xFFFFFFFF;
 		return PCIBIOS_SET_FAILED;
 	}
 
-	return pnv_pci_cfg_read(dn, where, size, val);
+	return pnv_pci_cfg_read(pdn, where, size, val);
 }
 
 static int powernv_eeh_write_config(struct device_node *dn,
 				    int where, int size, u32 val)
 {
+	struct pci_dn *pdn = PCI_DN(dn);
+
+	if (!pdn)
+		return PCIBIOS_DEVICE_NOT_FOUND;
+
 	if (powernv_eeh_cfg_blocked(dn))
 		return PCIBIOS_SET_FAILED;
 
-	return pnv_pci_cfg_write(dn, where, size, val);
+	return pnv_pci_cfg_write(pdn, where, size, val);
 }
 
 /**
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index e69142f4af08..6c20d6e70383 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -366,9 +366,9 @@ static void pnv_pci_handle_eeh_config(struct pnv_phb *phb, u32 pe_no)
 	spin_unlock_irqrestore(&phb->lock, flags);
 }
 
-static void pnv_pci_config_check_eeh(struct pnv_phb *phb,
-				     struct device_node *dn)
+static void pnv_pci_config_check_eeh(struct pci_dn *pdn)
 {
+	struct pnv_phb *phb = pdn->phb->private_data;
 	u8	fstate;
 	__be16	pcierr;
 	int	pe_no;
@@ -379,7 +379,7 @@ static void pnv_pci_config_check_eeh(struct pnv_phb *phb,
 	 * setup that yet. So all ER errors should be mapped to
 	 * reserved PE.
 	 */
-	pe_no = PCI_DN(dn)->pe_number;
+	pe_no = pdn->pe_number;
 	if (pe_no == IODA_INVALID_PE) {
 		if (phb->type == PNV_PHB_P5IOC2)
 			pe_no = 0;
@@ -407,8 +407,7 @@ static void pnv_pci_config_check_eeh(struct pnv_phb *phb,
 	}
 
 	cfg_dbg(" -> EEH check, bdfn=%04x PE#%d fstate=%x\n",
-		(PCI_DN(dn)->busno << 8) | (PCI_DN(dn)->devfn),
-		pe_no, fstate);
+		(pdn->busno << 8) | (pdn->devfn), pe_no, fstate);
 
 	/* Clear the frozen state if applicable */
 	if (fstate == OPAL_EEH_STOPPED_MMIO_FREEZE ||
@@ -425,10 +424,9 @@ static void pnv_pci_config_check_eeh(struct pnv_phb *phb,
 	}
 }
 
-int pnv_pci_cfg_read(struct device_node *dn,
+int pnv_pci_cfg_read(struct pci_dn *pdn,
 		     int where, int size, u32 *val)
 {
-	struct pci_dn *pdn = PCI_DN(dn);
 	struct pnv_phb *phb = pdn->phb->private_data;
 	u32 bdfn = (pdn->busno << 8) | pdn->devfn;
 	s64 rc;
@@ -462,10 +460,9 @@ int pnv_pci_cfg_read(struct device_node *dn,
 	return PCIBIOS_SUCCESSFUL;
 }
 
-int pnv_pci_cfg_write(struct device_node *dn,
+int pnv_pci_cfg_write(struct pci_dn *pdn,
 		      int where, int size, u32 val)
 {
-	struct pci_dn *pdn = PCI_DN(dn);
 	struct pnv_phb *phb = pdn->phb->private_data;
 	u32 bdfn = (pdn->busno << 8) | pdn->devfn;
 
@@ -489,18 +486,17 @@ int pnv_pci_cfg_write(struct device_node *dn,
 }
 
 #if CONFIG_EEH
-static bool pnv_pci_cfg_check(struct pci_controller *hose,
-			      struct device_node *dn)
+static bool pnv_pci_cfg_check(struct pci_dn *pdn)
 {
 	struct eeh_dev *edev = NULL;
-	struct pnv_phb *phb = hose->private_data;
+	struct pnv_phb *phb = pdn->phb->private_data;
 
 	/* EEH not enabled ? */
 	if (!(phb->flags & PNV_PHB_FLAG_EEH))
 		return true;
 
 	/* PE reset or device removed ? */
-	edev = of_node_to_eeh_dev(dn);
+	edev = pdn->edev;
 	if (edev) {
 		if (edev->pe &&
 		    (edev->pe->state & EEH_PE_CFG_BLOCKED))
@@ -513,8 +509,7 @@ static bool pnv_pci_cfg_check(struct pci_controller *hose,
 	return true;
 }
 #else
-static inline pnv_pci_cfg_check(struct pci_controller *hose,
-				struct device_node *dn)
+static inline pnv_pci_cfg_check(struct pci_dn *pdn)
 {
 	return true;
 }
@@ -524,32 +519,26 @@ static int pnv_pci_read_config(struct pci_bus *bus,
 			       unsigned int devfn,
 			       int where, int size, u32 *val)
 {
-	struct device_node *dn, *busdn = pci_bus_to_OF_node(bus);
 	struct pci_dn *pdn;
 	struct pnv_phb *phb;
-	bool found = false;
 	int ret;
 
 	*val = 0xFFFFFFFF;
-	for (dn = busdn->child; dn; dn = dn->sibling) {
-		pdn = PCI_DN(dn);
-		if (pdn && pdn->devfn == devfn) {
-			phb = pdn->phb->private_data;
-			found = true;
-			break;
-		}
-	}
+	pdn = pci_get_pdn_by_devfn(bus, devfn);
+	if (!pdn)
+		return PCIBIOS_DEVICE_NOT_FOUND;
 
-	if (!found || !pnv_pci_cfg_check(pdn->phb, dn))
+	if (!pnv_pci_cfg_check(pdn))
 		return PCIBIOS_DEVICE_NOT_FOUND;
 
-	ret = pnv_pci_cfg_read(dn, where, size, val);
-	if (phb->flags & PNV_PHB_FLAG_EEH) {
+	ret = pnv_pci_cfg_read(pdn, where, size, val);
+	phb = pdn->phb->private_data;
+	if (phb->flags & PNV_PHB_FLAG_EEH && pdn->edev) {
 		if (*val == EEH_IO_ERROR_VALUE(size) &&
-		    eeh_dev_check_failure(of_node_to_eeh_dev(dn)))
+		    eeh_dev_check_failure(pdn->edev))
                         return PCIBIOS_DEVICE_NOT_FOUND;
 	} else {
-		pnv_pci_config_check_eeh(phb, dn);
+		pnv_pci_config_check_eeh(pdn);
 	}
 
 	return ret;
@@ -559,27 +548,21 @@ static int pnv_pci_write_config(struct pci_bus *bus,
 				unsigned int devfn,
 				int where, int size, u32 val)
 {
-	struct device_node *dn, *busdn = pci_bus_to_OF_node(bus);
 	struct pci_dn *pdn;
 	struct pnv_phb *phb;
-	bool found = false;
 	int ret;
 
-	for (dn = busdn->child; dn; dn = dn->sibling) {
-		pdn = PCI_DN(dn);
-		if (pdn && pdn->devfn == devfn) {
-			phb = pdn->phb->private_data;
-			found = true;
-			break;
-		}
-	}
+	pdn = pci_get_pdn_by_devfn(bus, devfn);
+	if (!pdn)
+		return PCIBIOS_DEVICE_NOT_FOUND;
 
-	if (!found || !pnv_pci_cfg_check(pdn->phb, dn))
+	if (!pnv_pci_cfg_check(pdn))
 		return PCIBIOS_DEVICE_NOT_FOUND;
 
-	ret = pnv_pci_cfg_write(dn, where, size, val);
+	ret = pnv_pci_cfg_write(pdn, where, size, val);
+	phb = pdn->phb->private_data;
 	if (!(phb->flags & PNV_PHB_FLAG_EEH))
-		pnv_pci_config_check_eeh(phb, dn);
+		pnv_pci_config_check_eeh(pdn);
 
 	return ret;
 }
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 6c02ff8dd69f..e5b75b298d95 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -219,9 +219,9 @@ extern struct pnv_eeh_ops ioda_eeh_ops;
 
 void pnv_pci_dump_phb_diag_data(struct pci_controller *hose,
 				unsigned char *log_buff);
-int pnv_pci_cfg_read(struct device_node *dn,
+int pnv_pci_cfg_read(struct pci_dn *pdn,
 		     int where, int size, u32 *val);
-int pnv_pci_cfg_write(struct device_node *dn,
+int pnv_pci_cfg_write(struct pci_dn *pdn,
 		      int where, int size, u32 val);
 extern void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
 				      void *tce_mem, u64 tce_size,


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v12 14/21] powerpc/powernv: Allocate struct pnv_ioda_pe iommu_table dynamically
  2015-02-24  8:32 [PATCH v12 00/21] Enable SRIOV on Power8 Bjorn Helgaas
                   ` (12 preceding siblings ...)
  2015-02-24  8:34 ` [PATCH v12 13/21] powerpc/powernv: Use pci_dn, not device_node, in PCI config accessor Bjorn Helgaas
@ 2015-02-24  8:34 ` Bjorn Helgaas
  2015-02-24  8:46   ` Bjorn Helgaas
  2015-02-24  8:34 ` [PATCH v12 15/21] powerpc/powernv: Reserve additional space for IOV BAR according to the number of total_pe Bjorn Helgaas
                   ` (6 subsequent siblings)
  20 siblings, 1 reply; 69+ messages in thread
From: Bjorn Helgaas @ 2015-02-24  8:34 UTC (permalink / raw)
  To: Wei Yang, benh, gwshan; +Cc: linux-pci, linuxppc-dev

From: Wei Yang <weiyang@linux.vnet.ibm.com>

Current iommu_table of a PE is a static field.  This will have a problem
when iommu_free_table() is called.

Allocate iommu_table dynamically.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
---
 arch/powerpc/include/asm/iommu.h          |    3 +++
 arch/powerpc/platforms/powernv/pci-ioda.c |   26 ++++++++++++++------------
 arch/powerpc/platforms/powernv/pci.h      |    2 +-
 3 files changed, 18 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 9cfa3706a1b8..5574eeb97634 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -78,6 +78,9 @@ struct iommu_table {
 	struct iommu_group *it_group;
 #endif
 	void (*set_bypass)(struct iommu_table *tbl, bool enable);
+#ifdef CONFIG_PPC_POWERNV
+	void           *data;
+#endif
 };
 
 /* Pure 2^n version of get_order */
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 58c4fc4ab63c..cd1a56160ded 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -916,6 +916,10 @@ static void pnv_ioda_setup_bus_PE(struct pci_bus *bus, int all)
 		return;
 	}
 
+	pe->tce32_table = kzalloc_node(sizeof(struct iommu_table),
+			GFP_KERNEL, hose->node);
+	pe->tce32_table->data = pe;
+
 	/* Associate it with all child devices */
 	pnv_ioda_setup_same_PE(bus, pe);
 
@@ -1005,7 +1009,7 @@ static void pnv_pci_ioda_dma_dev_setup(struct pnv_phb *phb, struct pci_dev *pdev
 
 	pe = &phb->ioda.pe_array[pdn->pe_number];
 	WARN_ON(get_dma_ops(&pdev->dev) != &dma_iommu_ops);
-	set_iommu_table_base_and_group(&pdev->dev, &pe->tce32_table);
+	set_iommu_table_base_and_group(&pdev->dev, pe->tce32_table);
 }
 
 static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb,
@@ -1032,7 +1036,7 @@ static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb,
 	} else {
 		dev_info(&pdev->dev, "Using 32-bit DMA via iommu\n");
 		set_dma_ops(&pdev->dev, &dma_iommu_ops);
-		set_iommu_table_base(&pdev->dev, &pe->tce32_table);
+		set_iommu_table_base(&pdev->dev, pe->tce32_table);
 	}
 	*pdev->dev.dma_mask = dma_mask;
 	return 0;
@@ -1069,9 +1073,9 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe,
 	list_for_each_entry(dev, &bus->devices, bus_list) {
 		if (add_to_iommu_group)
 			set_iommu_table_base_and_group(&dev->dev,
-						       &pe->tce32_table);
+						       pe->tce32_table);
 		else
-			set_iommu_table_base(&dev->dev, &pe->tce32_table);
+			set_iommu_table_base(&dev->dev, pe->tce32_table);
 
 		if (dev->subordinate)
 			pnv_ioda_setup_bus_dma(pe, dev->subordinate,
@@ -1161,8 +1165,7 @@ static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
 void pnv_pci_ioda_tce_invalidate(struct iommu_table *tbl,
 				 __be64 *startp, __be64 *endp, bool rm)
 {
-	struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
-					      tce32_table);
+	struct pnv_ioda_pe *pe = tbl->data;
 	struct pnv_phb *phb = pe->phb;
 
 	if (phb->type == PNV_PHB_IODA1)
@@ -1228,7 +1231,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 	}
 
 	/* Setup linux iommu table */
-	tbl = &pe->tce32_table;
+	tbl = pe->tce32_table;
 	pnv_pci_setup_iommu_table(tbl, addr, TCE32_TABLE_SIZE * segs,
 				  base << 28, IOMMU_PAGE_SHIFT_4K);
 
@@ -1266,8 +1269,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 
 static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
 {
-	struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
-					      tce32_table);
+	struct pnv_ioda_pe *pe = tbl->data;
 	uint16_t window_id = (pe->pe_number << 1 ) + 1;
 	int64_t rc;
 
@@ -1312,10 +1314,10 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
 	pe->tce_bypass_base = 1ull << 59;
 
 	/* Install set_bypass callback for VFIO */
-	pe->tce32_table.set_bypass = pnv_pci_ioda2_set_bypass;
+	pe->tce32_table->set_bypass = pnv_pci_ioda2_set_bypass;
 
 	/* Enable bypass by default */
-	pnv_pci_ioda2_set_bypass(&pe->tce32_table, true);
+	pnv_pci_ioda2_set_bypass(pe->tce32_table, true);
 }
 
 static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
@@ -1363,7 +1365,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 	}
 
 	/* Setup linux iommu table */
-	tbl = &pe->tce32_table;
+	tbl = pe->tce32_table;
 	pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0,
 			IOMMU_PAGE_SHIFT_4K);
 
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index e5b75b298d95..731777734bca 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -53,7 +53,7 @@ struct pnv_ioda_pe {
 	/* "Base" iommu table, ie, 4K TCEs, 32-bit DMA */
 	int			tce32_seg;
 	int			tce32_segcount;
-	struct iommu_table	tce32_table;
+	struct iommu_table	*tce32_table;
 	phys_addr_t		tce_inval_reg_phys;
 
 	/* 64-bit TCE bypass region */


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v12 15/21] powerpc/powernv: Reserve additional space for IOV BAR according to the number of total_pe
  2015-02-24  8:32 [PATCH v12 00/21] Enable SRIOV on Power8 Bjorn Helgaas
                   ` (13 preceding siblings ...)
  2015-02-24  8:34 ` [PATCH v12 14/21] powerpc/powernv: Allocate struct pnv_ioda_pe iommu_table dynamically Bjorn Helgaas
@ 2015-02-24  8:34 ` Bjorn Helgaas
  2015-02-24  8:52   ` Bjorn Helgaas
  2015-02-24  8:34 ` [PATCH v12 16/21] powerpc/powernv: Implement pcibios_iov_resource_alignment() on powernv Bjorn Helgaas
                   ` (5 subsequent siblings)
  20 siblings, 1 reply; 69+ messages in thread
From: Bjorn Helgaas @ 2015-02-24  8:34 UTC (permalink / raw)
  To: Wei Yang, benh, gwshan; +Cc: linux-pci, linuxppc-dev

From: Wei Yang <weiyang@linux.vnet.ibm.com>

On PHB3, PF IOV BAR will be covered by M64 window to have better PE
isolation.  The total_pe number is usually different from total_VFs, which
can lead to a conflict between MMIO space and the PE number.

For example, if total_VFs is 128 and total_pe is 256, the second half of
M64 window will be part of other PCI device, which may already belong
to other PEs.

Prevent the conflict by reserving additional space for the PF IOV BAR,
which is total_pe number of VF's BAR size.

[bhelgaas: make dev_printk() output more consistent, index resource[]
conventionally]
Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
---
 arch/powerpc/include/asm/machdep.h        |    4 ++
 arch/powerpc/include/asm/pci-bridge.h     |    3 ++
 arch/powerpc/kernel/pci-common.c          |    5 +++
 arch/powerpc/platforms/powernv/pci-ioda.c |   58 +++++++++++++++++++++++++++++
 4 files changed, 70 insertions(+)

diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h
index c8175a3fe560..965547c58497 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -250,6 +250,10 @@ struct machdep_calls {
 	/* Reset the secondary bus of bridge */
 	void  (*pcibios_reset_secondary_bus)(struct pci_dev *dev);
 
+#ifdef CONFIG_PCI_IOV
+	void (*pcibios_fixup_sriov)(struct pci_bus *bus);
+#endif /* CONFIG_PCI_IOV */
+
 	/* Called to shutdown machine specific hardware not already controlled
 	 * by other drivers.
 	 */
diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
index 513f8f27060d..de11de7d4547 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -175,6 +175,9 @@ struct pci_dn {
 #define IODA_INVALID_PE		(-1)
 #ifdef CONFIG_PPC_POWERNV
 	int	pe_number;
+#ifdef CONFIG_PCI_IOV
+	u16     max_vfs;		/* number of VFs IOV BAR expended */
+#endif /* CONFIG_PCI_IOV */
 #endif
 	struct list_head child_list;
 	struct list_head list;
diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index 82031011522f..022e9feeb1f2 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -1646,6 +1646,11 @@ void pcibios_scan_phb(struct pci_controller *hose)
 	if (ppc_md.pcibios_fixup_phb)
 		ppc_md.pcibios_fixup_phb(hose);
 
+#ifdef CONFIG_PCI_IOV
+	if (ppc_md.pcibios_fixup_sriov)
+		ppc_md.pcibios_fixup_sriov(bus);
+#endif /* CONFIG_PCI_IOV */
+
 	/* Configure PCI Express settings */
 	if (bus && !pci_has_flag(PCI_PROBE_ONLY)) {
 		struct pci_bus *child;
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index cd1a56160ded..36c533da5ccb 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1749,6 +1749,61 @@ static void pnv_pci_init_ioda_msis(struct pnv_phb *phb)
 static void pnv_pci_init_ioda_msis(struct pnv_phb *phb) { }
 #endif /* CONFIG_PCI_MSI */
 
+#ifdef CONFIG_PCI_IOV
+static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
+{
+	struct pci_controller *hose;
+	struct pnv_phb *phb;
+	struct resource *res;
+	int i;
+	resource_size_t size;
+	struct pci_dn *pdn;
+
+	if (!pdev->is_physfn || pdev->is_added)
+		return;
+
+	hose = pci_bus_to_host(pdev->bus);
+	phb = hose->private_data;
+
+	pdn = pci_get_pdn(pdev);
+	pdn->max_vfs = 0;
+
+	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
+		res = &pdev->resource[i + PCI_IOV_RESOURCES];
+		if (!res->flags || res->parent)
+			continue;
+		if (!pnv_pci_is_mem_pref_64(res->flags)) {
+			dev_warn(&pdev->dev, "Skipping expanding VF BAR%d: %pR\n",
+				 i, res);
+			continue;
+		}
+
+		dev_dbg(&pdev->dev, " Fixing VF BAR%d: %pR to\n", i, res);
+		size = pci_iov_resource_size(pdev, i + PCI_IOV_RESOURCES);
+		res->end = res->start + size * phb->ioda.total_pe - 1;
+		dev_dbg(&pdev->dev, "                       %pR\n", res);
+		dev_info(&pdev->dev, "VF BAR%d: %pR (expanded to %d VFs for PE alignment)",
+				i, res, phb->ioda.total_pe);
+	}
+	pdn->max_vfs = phb->ioda.total_pe;
+}
+
+static void pnv_pci_ioda_fixup_sriov(struct pci_bus *bus)
+{
+	struct pci_dev *pdev;
+	struct pci_bus *b;
+
+	list_for_each_entry(pdev, &bus->devices, bus_list) {
+		b = pdev->subordinate;
+
+		if (b)
+			pnv_pci_ioda_fixup_sriov(b);
+
+		pnv_pci_ioda_fixup_iov_resources(pdev);
+	}
+}
+#endif /* CONFIG_PCI_IOV */
+
 /*
  * This function is supposed to be called on basis of PE from top
  * to bottom style. So the the I/O or MMIO segment assigned to
@@ -2125,6 +2180,9 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
 	ppc_md.pcibios_enable_device_hook = pnv_pci_enable_device_hook;
 	ppc_md.pcibios_window_alignment = pnv_pci_window_alignment;
 	ppc_md.pcibios_reset_secondary_bus = pnv_pci_reset_secondary_bus;
+#ifdef CONFIG_PCI_IOV
+	ppc_md.pcibios_fixup_sriov = pnv_pci_ioda_fixup_sriov;
+#endif /* CONFIG_PCI_IOV */
 	pci_add_flags(PCI_REASSIGN_ALL_RSRC);
 
 	/* Reset IODA tables to a clean state */


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v12 16/21] powerpc/powernv: Implement pcibios_iov_resource_alignment() on powernv
  2015-02-24  8:32 [PATCH v12 00/21] Enable SRIOV on Power8 Bjorn Helgaas
                   ` (14 preceding siblings ...)
  2015-02-24  8:34 ` [PATCH v12 15/21] powerpc/powernv: Reserve additional space for IOV BAR according to the number of total_pe Bjorn Helgaas
@ 2015-02-24  8:34 ` Bjorn Helgaas
  2015-02-24  8:34 ` [PATCH v12 17/21] powerpc/powernv: Shift VF resource with an offset Bjorn Helgaas
                   ` (4 subsequent siblings)
  20 siblings, 0 replies; 69+ messages in thread
From: Bjorn Helgaas @ 2015-02-24  8:34 UTC (permalink / raw)
  To: Wei Yang, benh, gwshan; +Cc: linux-pci, linuxppc-dev

From: Wei Yang <weiyang@linux.vnet.ibm.com>

Implement pcibios_iov_resource_alignment() on powernv platform.

On PowerNV platform, there are 3 cases for the IOV BAR:
1. initial state, the IOV BAR size is multiple times of VF BAR size
2. after expanded, the IOV BAR size is expanded to meet the M64 segment size
3. sizing stage, the IOV BAR is truncated to 0

pnv_pci_iov_resource_alignment() handle these three cases respectively.

[bhelgaas: adjust to drop "align" parameter, return pci_iov_resource_size()
if no ppc_md machdep_call version]
Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
---
 arch/powerpc/include/asm/machdep.h        |    1 +
 arch/powerpc/kernel/pci-common.c          |   10 ++++++++++
 arch/powerpc/platforms/powernv/pci-ioda.c |   20 ++++++++++++++++++++
 3 files changed, 31 insertions(+)

diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h
index 965547c58497..045448f9e8b2 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -252,6 +252,7 @@ struct machdep_calls {
 
 #ifdef CONFIG_PCI_IOV
 	void (*pcibios_fixup_sriov)(struct pci_bus *bus);
+	resource_size_t (*pcibios_iov_resource_alignment)(struct pci_dev *, int resno);
 #endif /* CONFIG_PCI_IOV */
 
 	/* Called to shutdown machine specific hardware not already controlled
diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index 022e9feeb1f2..2f1ad9ef4402 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -130,6 +130,16 @@ void pcibios_reset_secondary_bus(struct pci_dev *dev)
 	pci_reset_secondary_bus(dev);
 }
 
+#ifdef CONFIG_PCI_IOV
+resource_size_t pcibios_iov_resource_alignment(struct pci_dev *pdev, int resno)
+{
+	if (ppc_md.pcibios_iov_resource_alignment)
+		return ppc_md.pcibios_iov_resource_alignment(pdev, resno);
+
+	return pci_iov_resource_size(dev, resno);
+}
+#endif /* CONFIG_PCI_IOV */
+
 static resource_size_t pcibios_io_size(const struct pci_controller *hose)
 {
 #ifdef CONFIG_PPC64
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 36c533da5ccb..6a86690bb8de 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1980,6 +1980,25 @@ static resource_size_t pnv_pci_window_alignment(struct pci_bus *bus,
 	return phb->ioda.io_segsize;
 }
 
+#ifdef CONFIG_PCI_IOV
+static resource_size_t pnv_pci_iov_resource_alignment(struct pci_dev *pdev,
+						      int resno)
+{
+	struct pci_dn *pdn = pci_get_pdn(pdev);
+	resource_size_t align, iov_align;
+
+	iov_align = resource_size(&pdev->resource[resno]);
+	if (iov_align)
+		return iov_align;
+
+	align = pci_iov_resource_size(pdev, resno);
+	if (pdn->max_vfs)
+		return pdn->max_vfs * align;
+
+	return align;
+}
+#endif /* CONFIG_PCI_IOV */
+
 /* Prevent enabling devices for which we couldn't properly
  * assign a PE
  */
@@ -2182,6 +2201,7 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
 	ppc_md.pcibios_reset_secondary_bus = pnv_pci_reset_secondary_bus;
 #ifdef CONFIG_PCI_IOV
 	ppc_md.pcibios_fixup_sriov = pnv_pci_ioda_fixup_sriov;
+	ppc_md.pcibios_iov_resource_alignment = pnv_pci_iov_resource_alignment;
 #endif /* CONFIG_PCI_IOV */
 	pci_add_flags(PCI_REASSIGN_ALL_RSRC);
 


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v12 17/21] powerpc/powernv: Shift VF resource with an offset
  2015-02-24  8:32 [PATCH v12 00/21] Enable SRIOV on Power8 Bjorn Helgaas
                   ` (15 preceding siblings ...)
  2015-02-24  8:34 ` [PATCH v12 16/21] powerpc/powernv: Implement pcibios_iov_resource_alignment() on powernv Bjorn Helgaas
@ 2015-02-24  8:34 ` Bjorn Helgaas
  2015-02-24  9:00   ` Bjorn Helgaas
  2015-02-24  9:03   ` Bjorn Helgaas
  2015-02-24  8:35 ` [PATCH v12 18/21] powerpc/powernv: Reserve additional space for IOV BAR, with m64_per_iov supported Bjorn Helgaas
                   ` (3 subsequent siblings)
  20 siblings, 2 replies; 69+ messages in thread
From: Bjorn Helgaas @ 2015-02-24  8:34 UTC (permalink / raw)
  To: Wei Yang, benh, gwshan; +Cc: linux-pci, linuxppc-dev

From: Wei Yang <weiyang@linux.vnet.ibm.com>

On PowerNV platform, resource position in M64 implies the PE# the resource
belongs to.  In some cases, adjustment of a resource is necessary to locate
it to a correct position in M64.

Add pnv_pci_vf_resource_shift() to shift the 'real' PF IOV BAR address
according to an offset.

[bhelgaas: rework loops, rework overlap check, index resource[]
conventionally, remove pci_regs.h include, squashed with next patch]
Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
---
 arch/powerpc/include/asm/pci-bridge.h     |    4 
 arch/powerpc/kernel/pci_dn.c              |   11 +
 arch/powerpc/platforms/powernv/pci-ioda.c |  520 ++++++++++++++++++++++++++++-
 arch/powerpc/platforms/powernv/pci.c      |   18 +
 arch/powerpc/platforms/powernv/pci.h      |    7 
 5 files changed, 543 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
index de11de7d4547..011340df8583 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -177,6 +177,10 @@ struct pci_dn {
 	int	pe_number;
 #ifdef CONFIG_PCI_IOV
 	u16     max_vfs;		/* number of VFs IOV BAR expended */
+	u16     vf_pes;			/* VF PE# under this PF */
+	int     offset;			/* PE# for the first VF PE */
+#define IODA_INVALID_M64        (-1)
+	int     m64_wins[PCI_SRIOV_NUM_BARS];
 #endif /* CONFIG_PCI_IOV */
 #endif
 	struct list_head child_list;
diff --git a/arch/powerpc/kernel/pci_dn.c b/arch/powerpc/kernel/pci_dn.c
index f3a1a81d112f..5faf7ca45434 100644
--- a/arch/powerpc/kernel/pci_dn.c
+++ b/arch/powerpc/kernel/pci_dn.c
@@ -217,6 +217,17 @@ void remove_dev_pci_info(struct pci_dev *pdev)
 	struct pci_dn *pdn, *tmp;
 	int i;
 
+	/*
+	 * VF and VF PE are created/released dynamically, so we need to
+	 * bind/unbind them.  Otherwise the VF and VF PE would be mismatched
+	 * when re-enabling SR-IOV.
+	 */
+	if (pdev->is_virtfn) {
+		pdn = pci_get_pdn(pdev);
+		pdn->pe_number = IODA_INVALID_PE;
+		return;
+	}
+
 	/* Only support IOV PF for now */
 	if (!pdev->is_physfn)
 		return;
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 6a86690bb8de..a3c2fbe35fc8 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -44,6 +44,9 @@
 #include "powernv.h"
 #include "pci.h"
 
+/* 256M DMA window, 4K TCE pages, 8 bytes TCE */
+#define TCE32_TABLE_SIZE	((0x10000000 / 0x1000) * 8)
+
 static void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
 			    const char *fmt, ...)
 {
@@ -56,11 +59,18 @@ static void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
 	vaf.fmt = fmt;
 	vaf.va = &args;
 
-	if (pe->pdev)
+	if (pe->flags & PNV_IODA_PE_DEV)
 		strlcpy(pfix, dev_name(&pe->pdev->dev), sizeof(pfix));
-	else
+	else if (pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL))
 		sprintf(pfix, "%04x:%02x     ",
 			pci_domain_nr(pe->pbus), pe->pbus->number);
+#ifdef CONFIG_PCI_IOV
+	else if (pe->flags & PNV_IODA_PE_VF)
+		sprintf(pfix, "%04x:%02x:%2x.%d",
+			pci_domain_nr(pe->parent_dev->bus),
+			(pe->rid & 0xff00) >> 8,
+			PCI_SLOT(pe->rid), PCI_FUNC(pe->rid));
+#endif /* CONFIG_PCI_IOV*/
 
 	printk("%spci %s: [PE# %.3d] %pV",
 	       level, pfix, pe->pe_number, &vaf);
@@ -591,7 +601,7 @@ static int pnv_ioda_set_peltv(struct pnv_phb *phb,
 			      bool is_add)
 {
 	struct pnv_ioda_pe *slave;
-	struct pci_dev *pdev;
+	struct pci_dev *pdev = NULL;
 	int ret;
 
 	/*
@@ -630,8 +640,12 @@ static int pnv_ioda_set_peltv(struct pnv_phb *phb,
 
 	if (pe->flags & (PNV_IODA_PE_BUS_ALL | PNV_IODA_PE_BUS))
 		pdev = pe->pbus->self;
-	else
+	else if (pe->flags & PNV_IODA_PE_DEV)
 		pdev = pe->pdev->bus->self;
+#ifdef CONFIG_PCI_IOV
+	else if (pe->flags & PNV_IODA_PE_VF)
+		pdev = pe->parent_dev->bus->self;
+#endif /* CONFIG_PCI_IOV */
 	while (pdev) {
 		struct pci_dn *pdn = pci_get_pdn(pdev);
 		struct pnv_ioda_pe *parent;
@@ -649,6 +663,87 @@ static int pnv_ioda_set_peltv(struct pnv_phb *phb,
 	return 0;
 }
 
+#ifdef CONFIG_PCI_IOV
+static int pnv_ioda_deconfigure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
+{
+	struct pci_dev *parent;
+	uint8_t bcomp, dcomp, fcomp;
+	int64_t rc;
+	long rid_end, rid;
+
+	/* Currently, we just deconfigure VF PE. Bus PE will always there.*/
+	if (pe->pbus) {
+		int count;
+
+		dcomp = OPAL_IGNORE_RID_DEVICE_NUMBER;
+		fcomp = OPAL_IGNORE_RID_FUNCTION_NUMBER;
+		parent = pe->pbus->self;
+		if (pe->flags & PNV_IODA_PE_BUS_ALL)
+			count = pe->pbus->busn_res.end - pe->pbus->busn_res.start + 1;
+		else
+			count = 1;
+
+		switch(count) {
+		case  1: bcomp = OpalPciBusAll;         break;
+		case  2: bcomp = OpalPciBus7Bits;       break;
+		case  4: bcomp = OpalPciBus6Bits;       break;
+		case  8: bcomp = OpalPciBus5Bits;       break;
+		case 16: bcomp = OpalPciBus4Bits;       break;
+		case 32: bcomp = OpalPciBus3Bits;       break;
+		default:
+			dev_err(&pe->pbus->dev, "Number of subordinate buses %d unsupported\n",
+			        count);
+			/* Do an exact match only */
+			bcomp = OpalPciBusAll;
+		}
+		rid_end = pe->rid + (count << 8);
+	} else {
+		if (pe->flags & PNV_IODA_PE_VF)
+			parent = pe->parent_dev;
+		else
+			parent = pe->pdev->bus->self;
+		bcomp = OpalPciBusAll;
+		dcomp = OPAL_COMPARE_RID_DEVICE_NUMBER;
+		fcomp = OPAL_COMPARE_RID_FUNCTION_NUMBER;
+		rid_end = pe->rid + 1;
+	}
+
+	/* Clear the reverse map */
+	for (rid = pe->rid; rid < rid_end; rid++)
+		phb->ioda.pe_rmap[rid] = 0;
+
+	/* Release from all parents PELT-V */
+	while (parent) {
+		struct pci_dn *pdn = pci_get_pdn(parent);
+		if (pdn && pdn->pe_number != IODA_INVALID_PE) {
+			rc = opal_pci_set_peltv(phb->opal_id, pdn->pe_number,
+						pe->pe_number, OPAL_REMOVE_PE_FROM_DOMAIN);
+			/* XXX What to do in case of error ? */
+		}
+		parent = parent->bus->self;
+	}
+
+	opal_pci_eeh_freeze_set(phb->opal_id, pe->pe_number,
+				  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
+
+	/* Disassociate PE in PELT */
+	rc = opal_pci_set_peltv(phb->opal_id, pe->pe_number,
+				pe->pe_number, OPAL_REMOVE_PE_FROM_DOMAIN);
+	if (rc)
+		pe_warn(pe, "OPAL error %ld remove self from PELTV\n", rc);
+	rc = opal_pci_set_pe(phb->opal_id, pe->pe_number, pe->rid,
+			     bcomp, dcomp, fcomp, OPAL_UNMAP_PE);
+	if (rc)
+		pe_err(pe, "OPAL error %ld trying to setup PELT table\n", rc);
+
+	pe->pbus = NULL;
+	pe->pdev = NULL;
+	pe->parent_dev = NULL;
+
+	return 0;
+}
+#endif /* CONFIG_PCI_IOV */
+
 static int pnv_ioda_configure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
 {
 	struct pci_dev *parent;
@@ -675,15 +770,19 @@ static int pnv_ioda_configure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
 		case 16: bcomp = OpalPciBus4Bits;	break;
 		case 32: bcomp = OpalPciBus3Bits;	break;
 		default:
-			pr_err("%s: Number of subordinate busses %d"
-			       " unsupported\n",
-			       pci_name(pe->pbus->self), count);
+			dev_err(&pe->pbus->dev, "Number of subordinate buses %d unsupported\n",
+			        count);
 			/* Do an exact match only */
 			bcomp = OpalPciBusAll;
 		}
 		rid_end = pe->rid + (count << 8);
 	} else {
-		parent = pe->pdev->bus->self;
+#ifdef CONFIG_PCI_IOV
+		if (pe->flags & PNV_IODA_PE_VF)
+			parent = pe->parent_dev;
+		else
+#endif /* CONFIG_PCI_IOV */
+			parent = pe->pdev->bus->self;
 		bcomp = OpalPciBusAll;
 		dcomp = OPAL_COMPARE_RID_DEVICE_NUMBER;
 		fcomp = OPAL_COMPARE_RID_FUNCTION_NUMBER;
@@ -774,6 +873,74 @@ static unsigned int pnv_ioda_dma_weight(struct pci_dev *dev)
 	return 10;
 }
 
+#ifdef CONFIG_PCI_IOV
+static int pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
+{
+	struct pci_dn *pdn = pci_get_pdn(dev);
+	int i;
+	struct resource *res, res2;
+	resource_size_t size;
+	u16 vf_num;
+
+	if (!dev->is_physfn)
+		return -EINVAL;
+
+	/*
+	 * "offset" is in VFs.  The M64 windows are sized so that when they
+	 * are segmented, each segment is the same size as the IOV BAR.
+	 * Each segment is in a separate PE, and the high order bits of the
+	 * address are the PE number.  Therefore, each VF's BAR is in a
+	 * separate PE, and changing the IOV BAR start address changes the
+	 * range of PEs the VFs are in.
+	 */
+	vf_num = pdn->vf_pes;
+	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
+		res = &dev->resource[i + PCI_IOV_RESOURCES];
+		if (!res->flags || !res->parent)
+			continue;
+
+		if (!pnv_pci_is_mem_pref_64(res->flags))
+			continue;
+
+		/*
+		 * The actual IOV BAR range is determined by the start address
+		 * and the actual size for vf_num VFs BAR.  This check is to
+		 * make sure that after shifting, the range will not overlap
+		 * with another device.
+		 */
+		size = pci_iov_resource_size(dev, i + PCI_IOV_RESOURCES);
+		res2.flags = res->flags;
+		res2.start = res->start + (size * offset);
+		res2.end = res2.start + (size * vf_num) - 1;
+
+		if (res2.end > res->end) {
+			dev_err(&dev->dev, "VF BAR%d: %pR would extend past %pR (trying to enable %d VFs shifted by %d)\n",
+				i, &res2, res, vf_num, offset);
+			return -EBUSY;
+		}
+	}
+
+	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
+		res = &dev->resource[i + PCI_IOV_RESOURCES];
+		if (!res->flags || !res->parent)
+			continue;
+
+		if (!pnv_pci_is_mem_pref_64(res->flags))
+			continue;
+
+		size = pci_iov_resource_size(dev, i + PCI_IOV_RESOURCES);
+		res2 = *res;
+		res->start += size * offset;
+
+		dev_info(&dev->dev, "VF BAR%d: %pR shifted to %pR (enabling %d VFs shifted by %d)\n",
+			 i, &res2, res, vf_num, offset);
+		pci_update_resource(dev, i + PCI_IOV_RESOURCES);
+	}
+	pdn->max_vfs -= offset;
+	return 0;
+}
+#endif /* CONFIG_PCI_IOV */
+
 #if 0
 static struct pnv_ioda_pe *pnv_ioda_setup_dev_PE(struct pci_dev *dev)
 {
@@ -979,8 +1146,312 @@ static void pnv_pci_ioda_setup_PEs(void)
 }
 
 #ifdef CONFIG_PCI_IOV
+static int pnv_pci_vf_release_m64(struct pci_dev *pdev)
+{
+	struct pci_bus        *bus;
+	struct pci_controller *hose;
+	struct pnv_phb        *phb;
+	struct pci_dn         *pdn;
+	int                    i;
+
+	bus = pdev->bus;
+	hose = pci_bus_to_host(bus);
+	phb = hose->private_data;
+	pdn = pci_get_pdn(pdev);
+
+	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
+		if (pdn->m64_wins[i] == IODA_INVALID_M64)
+			continue;
+		opal_pci_phb_mmio_enable(phb->opal_id,
+				OPAL_M64_WINDOW_TYPE, pdn->m64_wins[i], 0);
+		clear_bit(pdn->m64_wins[i], &phb->ioda.m64_bar_alloc);
+		pdn->m64_wins[i] = IODA_INVALID_M64;
+	}
+
+	return 0;
+}
+
+static int pnv_pci_vf_assign_m64(struct pci_dev *pdev)
+{
+	struct pci_bus        *bus;
+	struct pci_controller *hose;
+	struct pnv_phb        *phb;
+	struct pci_dn         *pdn;
+	unsigned int           win;
+	struct resource       *res;
+	int                    i;
+	int64_t                rc;
+
+	bus = pdev->bus;
+	hose = pci_bus_to_host(bus);
+	phb = hose->private_data;
+	pdn = pci_get_pdn(pdev);
+
+	/* Initialize the m64_wins to IODA_INVALID_M64 */
+	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++)
+		pdn->m64_wins[i] = IODA_INVALID_M64;
+
+	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
+		res = &pdev->resource[i + PCI_IOV_RESOURCES];
+		if (!res->flags || !res->parent)
+			continue;
+
+		if (!pnv_pci_is_mem_pref_64(res->flags))
+			continue;
+
+		do {
+			win = find_next_zero_bit(&phb->ioda.m64_bar_alloc,
+					phb->ioda.m64_bar_idx + 1, 0);
+
+			if (win >= phb->ioda.m64_bar_idx + 1)
+				goto m64_failed;
+		} while (test_and_set_bit(win, &phb->ioda.m64_bar_alloc));
+
+		pdn->m64_wins[i] = win;
+
+		/* Map the M64 here */
+		rc = opal_pci_set_phb_mem_window(phb->opal_id,
+						 OPAL_M64_WINDOW_TYPE,
+						 pdn->m64_wins[i],
+						 res->start,
+						 0, /* unused */
+						 resource_size(res));
+		if (rc != OPAL_SUCCESS) {
+			dev_err(&pdev->dev, "Failed to map M64 window #%d: %lld\n",
+				win, rc);
+			goto m64_failed;
+		}
+
+		rc = opal_pci_phb_mmio_enable(phb->opal_id,
+				OPAL_M64_WINDOW_TYPE, pdn->m64_wins[i], 1);
+		if (rc != OPAL_SUCCESS) {
+			dev_err(&pdev->dev, "Failed to enable M64 window #%d: %llx\n",
+				win, rc);
+			goto m64_failed;
+		}
+	}
+	return 0;
+
+m64_failed:
+	pnv_pci_vf_release_m64(pdev);
+	return -EBUSY;
+}
+
+static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe *pe)
+{
+	struct pci_bus        *bus;
+	struct pci_controller *hose;
+	struct pnv_phb        *phb;
+	struct iommu_table    *tbl;
+	unsigned long         addr;
+	int64_t               rc;
+
+	bus = dev->bus;
+	hose = pci_bus_to_host(bus);
+	phb = hose->private_data;
+	tbl = pe->tce32_table;
+	addr = tbl->it_base;
+
+	opal_pci_map_pe_dma_window(phb->opal_id, pe->pe_number,
+				   pe->pe_number << 1, 1, __pa(addr),
+				   0, 0x1000);
+
+	rc = opal_pci_map_pe_dma_window_real(pe->phb->opal_id,
+				        pe->pe_number,
+				        (pe->pe_number << 1) + 1,
+				        pe->tce_bypass_base,
+				        0);
+	if (rc)
+		pe_warn(pe, "OPAL error %ld release DMA window\n", rc);
+
+	iommu_free_table(tbl, of_node_full_name(dev->dev.of_node));
+	free_pages(addr, get_order(TCE32_TABLE_SIZE));
+	pe->tce32_table = NULL;
+}
+
+static void pnv_ioda_release_vf_PE(struct pci_dev *pdev)
+{
+	struct pci_bus        *bus;
+	struct pci_controller *hose;
+	struct pnv_phb        *phb;
+	struct pnv_ioda_pe    *pe, *pe_n;
+	struct pci_dn         *pdn;
+
+	bus = pdev->bus;
+	hose = pci_bus_to_host(bus);
+	phb = hose->private_data;
+
+	if (!pdev->is_physfn)
+		return;
+
+	pdn = pci_get_pdn(pdev);
+	list_for_each_entry_safe(pe, pe_n, &phb->ioda.pe_list, list) {
+		if (pe->parent_dev != pdev)
+			continue;
+
+		pnv_pci_ioda2_release_dma_pe(pdev, pe);
+
+		/* Remove from list */
+		mutex_lock(&phb->ioda.pe_list_mutex);
+		list_del(&pe->list);
+		mutex_unlock(&phb->ioda.pe_list_mutex);
+
+		pnv_ioda_deconfigure_pe(phb, pe);
+
+		pnv_ioda_free_pe(phb, pe->pe_number);
+	}
+}
+
+void pnv_pci_sriov_disable(struct pci_dev *pdev)
+{
+	struct pci_bus        *bus;
+	struct pci_controller *hose;
+	struct pnv_phb        *phb;
+	struct pci_dn         *pdn;
+	struct pci_sriov      *iov;
+	u16 vf_num;
+
+	bus = pdev->bus;
+	hose = pci_bus_to_host(bus);
+	phb = hose->private_data;
+	pdn = pci_get_pdn(pdev);
+	iov = pdev->sriov;
+	vf_num = pdn->vf_pes;
+
+	/* Release VF PEs */
+	pnv_ioda_release_vf_PE(pdev);
+
+	if (phb->type == PNV_PHB_IODA2) {
+		pnv_pci_vf_resource_shift(pdev, -pdn->offset);
+
+		/* Release M64 windows */
+		pnv_pci_vf_release_m64(pdev);
+
+		/* Release PE numbers */
+		bitmap_clear(phb->ioda.pe_alloc, pdn->offset, vf_num);
+		pdn->offset = 0;
+	}
+}
+
+static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
+				       struct pnv_ioda_pe *pe);
+static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 vf_num)
+{
+	struct pci_bus        *bus;
+	struct pci_controller *hose;
+	struct pnv_phb        *phb;
+	struct pnv_ioda_pe    *pe;
+	int                    pe_num;
+	u16                    vf_index;
+	struct pci_dn         *pdn;
+
+	bus = pdev->bus;
+	hose = pci_bus_to_host(bus);
+	phb = hose->private_data;
+	pdn = pci_get_pdn(pdev);
+
+	if (!pdev->is_physfn)
+		return;
+
+	/* Reserve PE for each VF */
+	for (vf_index = 0; vf_index < vf_num; vf_index++) {
+		pe_num = pdn->offset + vf_index;
+
+		pe = &phb->ioda.pe_array[pe_num];
+		pe->pe_number = pe_num;
+		pe->phb = phb;
+		pe->flags = PNV_IODA_PE_VF;
+		pe->pbus = NULL;
+		pe->parent_dev = pdev;
+		pe->tce32_seg = -1;
+		pe->mve_number = -1;
+		pe->rid = (pci_iov_virtfn_bus(pdev, vf_index) << 8) |
+			   pci_iov_virtfn_devfn(pdev, vf_index);
+
+		pe_info(pe, "VF %04d:%02d:%02d.%d associated with PE#%d\n",
+			hose->global_number, pdev->bus->number,
+			PCI_SLOT(pci_iov_virtfn_devfn(pdev, vf_index)),
+			PCI_FUNC(pci_iov_virtfn_devfn(pdev, vf_index)), pe_num);
+
+		if (pnv_ioda_configure_pe(phb, pe)) {
+			/* XXX What do we do here ? */
+			if (pe_num)
+				pnv_ioda_free_pe(phb, pe_num);
+			pe->pdev = NULL;
+			continue;
+		}
+
+		pe->tce32_table = kzalloc_node(sizeof(struct iommu_table),
+				GFP_KERNEL, hose->node);
+		pe->tce32_table->data = pe;
+
+		/* Put PE to the list */
+		mutex_lock(&phb->ioda.pe_list_mutex);
+		list_add_tail(&pe->list, &phb->ioda.pe_list);
+		mutex_unlock(&phb->ioda.pe_list_mutex);
+
+		pnv_pci_ioda2_setup_dma_pe(phb, pe);
+	}
+}
+
+int pnv_pci_sriov_enable(struct pci_dev *pdev, u16 vf_num)
+{
+	struct pci_bus        *bus;
+	struct pci_controller *hose;
+	struct pnv_phb        *phb;
+	struct pci_dn         *pdn;
+	int                    ret;
+
+	bus = pdev->bus;
+	hose = pci_bus_to_host(bus);
+	phb = hose->private_data;
+	pdn = pci_get_pdn(pdev);
+
+	if (phb->type == PNV_PHB_IODA2) {
+		/* Calculate available PE for required VFs */
+		mutex_lock(&phb->ioda.pe_alloc_mutex);
+		pdn->offset = bitmap_find_next_zero_area(
+			phb->ioda.pe_alloc, phb->ioda.total_pe,
+			0, vf_num, 0);
+		if (pdn->offset >= phb->ioda.total_pe) {
+			mutex_unlock(&phb->ioda.pe_alloc_mutex);
+			dev_info(&pdev->dev, "Failed to enable VF%d\n", vf_num);
+			pdn->offset = 0;
+			return -EBUSY;
+		}
+		bitmap_set(phb->ioda.pe_alloc, pdn->offset, vf_num);
+		pdn->vf_pes = vf_num;
+		mutex_unlock(&phb->ioda.pe_alloc_mutex);
+
+		/* Assign M64 window accordingly */
+		ret = pnv_pci_vf_assign_m64(pdev);
+		if (ret) {
+			dev_info(&pdev->dev, "Not enough M64 window resources\n");
+			goto m64_failed;
+		}
+
+		/* Do some magic shift */
+		ret = pnv_pci_vf_resource_shift(pdev, pdn->offset);
+		if (ret)
+			goto m64_failed;
+	}
+
+	/* Setup VF PEs */
+	pnv_ioda_setup_vf_PE(pdev, vf_num);
+
+	return 0;
+
+m64_failed:
+	bitmap_clear(phb->ioda.pe_alloc, pdn->offset, vf_num);
+	pdn->offset = 0;
+
+	return ret;
+}
+
 int pcibios_sriov_disable(struct pci_dev *pdev)
 {
+	pnv_pci_sriov_disable(pdev);
+
 	/* Release firmware data */
 	remove_dev_pci_info(pdev);
 	return 0;
@@ -990,6 +1461,8 @@ int pcibios_sriov_enable(struct pci_dev *pdev, u16 vf_num)
 {
 	/* Allocate firmware data */
 	add_dev_pci_info(pdev);
+
+	pnv_pci_sriov_enable(pdev, vf_num);
 	return 0;
 }
 #endif /* CONFIG_PCI_IOV */
@@ -1186,9 +1659,6 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 	int64_t rc;
 	void *addr;
 
-	/* 256M DMA window, 4K TCE pages, 8 bytes TCE */
-#define TCE32_TABLE_SIZE	((0x10000000 / 0x1000) * 8)
-
 	/* XXX FIXME: Handle 64-bit only DMA devices */
 	/* XXX FIXME: Provide 64-bit DMA facilities & non-4K TCE tables etc.. */
 	/* XXX FIXME: Allocate multi-level tables on PHB3 */
@@ -1251,12 +1721,19 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 				 TCE_PCI_SWINV_PAIR);
 	}
 	iommu_init_table(tbl, phb->hose->node);
-	iommu_register_group(tbl, phb->hose->global_number, pe->pe_number);
 
-	if (pe->pdev)
+	if (pe->flags & PNV_IODA_PE_DEV) {
+		iommu_register_group(tbl, phb->hose->global_number,
+				     pe->pe_number);
 		set_iommu_table_base_and_group(&pe->pdev->dev, tbl);
-	else
+	} else if (pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL)) {
+		iommu_register_group(tbl, phb->hose->global_number,
+				     pe->pe_number);
 		pnv_ioda_setup_bus_dma(pe, pe->pbus, true);
+	} else if (pe->flags & PNV_IODA_PE_VF) {
+		iommu_register_group(tbl, phb->hose->global_number,
+				     pe->pe_number);
+	}
 
 	return;
  fail:
@@ -1383,12 +1860,19 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 		tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
 	}
 	iommu_init_table(tbl, phb->hose->node);
-	iommu_register_group(tbl, phb->hose->global_number, pe->pe_number);
 
-	if (pe->pdev)
+	if (pe->flags & PNV_IODA_PE_DEV) {
+		iommu_register_group(tbl, phb->hose->global_number,
+				     pe->pe_number);
 		set_iommu_table_base_and_group(&pe->pdev->dev, tbl);
-	else
+	} else if (pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL)) {
+		iommu_register_group(tbl, phb->hose->global_number,
+				     pe->pe_number);
 		pnv_ioda_setup_bus_dma(pe, pe->pbus, true);
+	} else if (pe->flags & PNV_IODA_PE_VF) {
+		iommu_register_group(tbl, phb->hose->global_number,
+				     pe->pe_number);
+	}
 
 	/* Also create a bypass window */
 	if (!pnv_iommu_bypass_disabled)
@@ -2083,6 +2567,7 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
 	phb->hub_id = hub_id;
 	phb->opal_id = phb_id;
 	phb->type = ioda_type;
+	mutex_init(&phb->ioda.pe_alloc_mutex);
 
 	/* Detect specific models for error handling */
 	if (of_device_is_compatible(np, "ibm,p7ioc-pciex"))
@@ -2142,6 +2627,7 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
 
 	INIT_LIST_HEAD(&phb->ioda.pe_dma_list);
 	INIT_LIST_HEAD(&phb->ioda.pe_list);
+	mutex_init(&phb->ioda.pe_list_mutex);
 
 	/* Calculate how many 32-bit TCE segments we have */
 	phb->ioda.tce32_count = phb->ioda.m32_pci_base >> 28;
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index 6c20d6e70383..a88f915fc603 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -714,6 +714,24 @@ static void pnv_pci_dma_dev_setup(struct pci_dev *pdev)
 {
 	struct pci_controller *hose = pci_bus_to_host(pdev->bus);
 	struct pnv_phb *phb = hose->private_data;
+#ifdef CONFIG_PCI_IOV
+	struct pnv_ioda_pe *pe;
+	struct pci_dn *pdn;
+
+	/* Fix the VF pdn PE number */
+	if (pdev->is_virtfn) {
+		pdn = pci_get_pdn(pdev);
+		WARN_ON(pdn->pe_number != IODA_INVALID_PE);
+		list_for_each_entry(pe, &phb->ioda.pe_list, list) {
+			if (pe->rid == ((pdev->bus->number << 8) |
+			    (pdev->devfn & 0xff))) {
+				pdn->pe_number = pe->pe_number;
+				pe->pdev = pdev;
+				break;
+			}
+		}
+	}
+#endif /* CONFIG_PCI_IOV */
 
 	/* If we have no phb structure, try to setup a fallback based on
 	 * the device-tree (RTAS PCI for example)
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 731777734bca..39d42f2b7a15 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -23,6 +23,7 @@ enum pnv_phb_model {
 #define PNV_IODA_PE_BUS_ALL	(1 << 2)	/* PE has subordinate buses	*/
 #define PNV_IODA_PE_MASTER	(1 << 3)	/* Master PE in compound case	*/
 #define PNV_IODA_PE_SLAVE	(1 << 4)	/* Slave PE in compound case	*/
+#define PNV_IODA_PE_VF		(1 << 5)	/* PE for one VF 		*/
 
 /* Data associated with a PE, including IOMMU tracking etc.. */
 struct pnv_phb;
@@ -34,6 +35,9 @@ struct pnv_ioda_pe {
 	 * entire bus (& children). In the former case, pdev
 	 * is populated, in the later case, pbus is.
 	 */
+#ifdef CONFIG_PCI_IOV
+	struct pci_dev          *parent_dev;
+#endif
 	struct pci_dev		*pdev;
 	struct pci_bus		*pbus;
 
@@ -165,6 +169,8 @@ struct pnv_phb {
 
 			/* PE allocation bitmap */
 			unsigned long		*pe_alloc;
+			/* PE allocation mutex */
+			struct mutex		pe_alloc_mutex;
 
 			/* M32 & IO segment maps */
 			unsigned int		*m32_segmap;
@@ -179,6 +185,7 @@ struct pnv_phb {
 			 * on the sequence of creation
 			 */
 			struct list_head	pe_list;
+			struct mutex            pe_list_mutex;
 
 			/* Reverse map of PEs, will have to extend if
 			 * we are to support more than 256 PEs, indexed


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v12 18/21] powerpc/powernv: Reserve additional space for IOV BAR, with m64_per_iov supported
  2015-02-24  8:32 [PATCH v12 00/21] Enable SRIOV on Power8 Bjorn Helgaas
                   ` (16 preceding siblings ...)
  2015-02-24  8:34 ` [PATCH v12 17/21] powerpc/powernv: Shift VF resource with an offset Bjorn Helgaas
@ 2015-02-24  8:35 ` Bjorn Helgaas
  2015-02-24  9:06   ` Bjorn Helgaas
  2015-02-24  8:35 ` [PATCH v12 19/21] powerpc/powernv: Group VF PE when IOV BAR is big on PHB3 Bjorn Helgaas
                   ` (2 subsequent siblings)
  20 siblings, 1 reply; 69+ messages in thread
From: Bjorn Helgaas @ 2015-02-24  8:35 UTC (permalink / raw)
  To: Wei Yang, benh, gwshan; +Cc: linux-pci, linuxppc-dev

From: Wei Yang <weiyang@linux.vnet.ibm.com>

M64 aperture size is limited on PHB3.  When the IOV BAR is too big, this
will exceed the limitation and failed to be assigned.

Introduce a different mechanism based on the IOV BAR size:

  - if IOV BAR size is smaller than 64MB, expand to total_pe
  - if IOV BAR size is bigger than 64MB, roundup power2

[bhelgaas: make dev_printk() output more consistent, use PCI_SRIOV_NUM_BARS]
Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
---
 arch/powerpc/include/asm/pci-bridge.h     |    2 ++
 arch/powerpc/platforms/powernv/pci-ioda.c |   33 ++++++++++++++++++++++++++---
 2 files changed, 32 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
index 011340df8583..d824bb184ab8 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -179,6 +179,8 @@ struct pci_dn {
 	u16     max_vfs;		/* number of VFs IOV BAR expended */
 	u16     vf_pes;			/* VF PE# under this PF */
 	int     offset;			/* PE# for the first VF PE */
+#define M64_PER_IOV 4
+	int     m64_per_iov;
 #define IODA_INVALID_M64        (-1)
 	int     m64_wins[PCI_SRIOV_NUM_BARS];
 #endif /* CONFIG_PCI_IOV */
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index a3c2fbe35fc8..30b7c3909746 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2242,6 +2242,7 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
 	int i;
 	resource_size_t size;
 	struct pci_dn *pdn;
+	int mul, total_vfs;
 
 	if (!pdev->is_physfn || pdev->is_added)
 		return;
@@ -2252,6 +2253,32 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
 	pdn = pci_get_pdn(pdev);
 	pdn->max_vfs = 0;
 
+	total_vfs = pci_sriov_get_totalvfs(pdev);
+	pdn->m64_per_iov = 1;
+	mul = phb->ioda.total_pe;
+
+	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
+		res = &pdev->resource[i + PCI_IOV_RESOURCES];
+		if (!res->flags || res->parent)
+			continue;
+		if (!pnv_pci_is_mem_pref_64(res->flags)) {
+			dev_warn(&pdev->dev, " non M64 VF BAR%d: %pR\n",
+				 i, res);
+			continue;
+		}
+
+		size = pci_iov_resource_size(pdev, i + PCI_IOV_RESOURCES);
+
+		/* bigger than 64M */
+		if (size > (1 << 26)) {
+			dev_info(&pdev->dev, "PowerNV: VF BAR%d: %pR IOV size is bigger than 64M, roundup power2\n",
+				 i, res);
+			pdn->m64_per_iov = M64_PER_IOV;
+			mul = __roundup_pow_of_two(total_vfs);
+			break;
+		}
+	}
+
 	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
 		res = &pdev->resource[i + PCI_IOV_RESOURCES];
 		if (!res->flags || res->parent)
@@ -2264,12 +2291,12 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
 
 		dev_dbg(&pdev->dev, " Fixing VF BAR%d: %pR to\n", i, res);
 		size = pci_iov_resource_size(pdev, i + PCI_IOV_RESOURCES);
-		res->end = res->start + size * phb->ioda.total_pe - 1;
+		res->end = res->start + size * mul - 1;
 		dev_dbg(&pdev->dev, "                       %pR\n", res);
 		dev_info(&pdev->dev, "VF BAR%d: %pR (expanded to %d VFs for PE alignment)",
-				i, res, phb->ioda.total_pe);
+			 i, res, mul);
 	}
-	pdn->max_vfs = phb->ioda.total_pe;
+	pdn->max_vfs = mul;
 }
 
 static void pnv_pci_ioda_fixup_sriov(struct pci_bus *bus)


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v12 19/21] powerpc/powernv: Group VF PE when IOV BAR is big on PHB3
  2015-02-24  8:32 [PATCH v12 00/21] Enable SRIOV on Power8 Bjorn Helgaas
                   ` (17 preceding siblings ...)
  2015-02-24  8:35 ` [PATCH v12 18/21] powerpc/powernv: Reserve additional space for IOV BAR, with m64_per_iov supported Bjorn Helgaas
@ 2015-02-24  8:35 ` Bjorn Helgaas
  2015-02-24  8:35 ` [PATCH v12 20/21] powerpc/pci: Remove unused struct pci_dn.pcidev field Bjorn Helgaas
  2015-02-24  8:35 ` [PATCH v12 21/21] powerpc/pci: Add PCI resource alignment documentation Bjorn Helgaas
  20 siblings, 0 replies; 69+ messages in thread
From: Bjorn Helgaas @ 2015-02-24  8:35 UTC (permalink / raw)
  To: Wei Yang, benh, gwshan; +Cc: linux-pci, linuxppc-dev

From: Wei Yang <weiyang@linux.vnet.ibm.com>

When IOV BAR is big, each is covered by 4 M64 windows.  This leads to
several VF PE sits in one PE in terms of M64.

Group VF PEs according to the M64 allocation.

[bhelgaas: use dev_printk() when possible]
Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
---
 arch/powerpc/include/asm/pci-bridge.h     |    2 
 arch/powerpc/platforms/powernv/pci-ioda.c |  197 +++++++++++++++++++++++------
 2 files changed, 154 insertions(+), 45 deletions(-)

diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
index d824bb184ab8..958ea8675691 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -182,7 +182,7 @@ struct pci_dn {
 #define M64_PER_IOV 4
 	int     m64_per_iov;
 #define IODA_INVALID_M64        (-1)
-	int     m64_wins[PCI_SRIOV_NUM_BARS];
+	int     m64_wins[PCI_SRIOV_NUM_BARS][M64_PER_IOV];
 #endif /* CONFIG_PCI_IOV */
 #endif
 	struct list_head child_list;
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 30b7c3909746..b265d5da601b 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1152,26 +1152,27 @@ static int pnv_pci_vf_release_m64(struct pci_dev *pdev)
 	struct pci_controller *hose;
 	struct pnv_phb        *phb;
 	struct pci_dn         *pdn;
-	int                    i;
+	int                    i, j;
 
 	bus = pdev->bus;
 	hose = pci_bus_to_host(bus);
 	phb = hose->private_data;
 	pdn = pci_get_pdn(pdev);
 
-	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
-		if (pdn->m64_wins[i] == IODA_INVALID_M64)
-			continue;
-		opal_pci_phb_mmio_enable(phb->opal_id,
-				OPAL_M64_WINDOW_TYPE, pdn->m64_wins[i], 0);
-		clear_bit(pdn->m64_wins[i], &phb->ioda.m64_bar_alloc);
-		pdn->m64_wins[i] = IODA_INVALID_M64;
-	}
+	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++)
+		for (j = 0; j < M64_PER_IOV; j++) {
+			if (pdn->m64_wins[i][j] == IODA_INVALID_M64)
+				continue;
+			opal_pci_phb_mmio_enable(phb->opal_id,
+				OPAL_M64_WINDOW_TYPE, pdn->m64_wins[i][j], 0);
+			clear_bit(pdn->m64_wins[i][j], &phb->ioda.m64_bar_alloc);
+			pdn->m64_wins[i][j] = IODA_INVALID_M64;
+		}
 
 	return 0;
 }
 
-static int pnv_pci_vf_assign_m64(struct pci_dev *pdev)
+static int pnv_pci_vf_assign_m64(struct pci_dev *pdev, u16 vf_num)
 {
 	struct pci_bus        *bus;
 	struct pci_controller *hose;
@@ -1179,17 +1180,33 @@ static int pnv_pci_vf_assign_m64(struct pci_dev *pdev)
 	struct pci_dn         *pdn;
 	unsigned int           win;
 	struct resource       *res;
-	int                    i;
+	int                    i, j;
 	int64_t                rc;
+	int                    total_vfs;
+	resource_size_t        size, start;
+	int                    pe_num;
+	int                    vf_groups;
+	int                    vf_per_group;
 
 	bus = pdev->bus;
 	hose = pci_bus_to_host(bus);
 	phb = hose->private_data;
 	pdn = pci_get_pdn(pdev);
+	total_vfs = pci_sriov_get_totalvfs(pdev);
 
 	/* Initialize the m64_wins to IODA_INVALID_M64 */
 	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++)
-		pdn->m64_wins[i] = IODA_INVALID_M64;
+		for (j = 0; j < M64_PER_IOV; j++)
+			pdn->m64_wins[i][j] = IODA_INVALID_M64;
+
+	if (pdn->m64_per_iov == M64_PER_IOV) {
+		vf_groups = (vf_num <= M64_PER_IOV) ? vf_num: M64_PER_IOV;
+		vf_per_group = (vf_num <= M64_PER_IOV)? 1:
+			__roundup_pow_of_two(vf_num) / pdn->m64_per_iov;
+	} else {
+		vf_groups = 1;
+		vf_per_group = 1;
+	}
 
 	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
 		res = &pdev->resource[i + PCI_IOV_RESOURCES];
@@ -1199,35 +1216,61 @@ static int pnv_pci_vf_assign_m64(struct pci_dev *pdev)
 		if (!pnv_pci_is_mem_pref_64(res->flags))
 			continue;
 
-		do {
-			win = find_next_zero_bit(&phb->ioda.m64_bar_alloc,
-					phb->ioda.m64_bar_idx + 1, 0);
-
-			if (win >= phb->ioda.m64_bar_idx + 1)
-				goto m64_failed;
-		} while (test_and_set_bit(win, &phb->ioda.m64_bar_alloc));
+		for (j = 0; j < vf_groups; j++) {
+			do {
+				win = find_next_zero_bit(&phb->ioda.m64_bar_alloc,
+						phb->ioda.m64_bar_idx + 1, 0);
+
+				if (win >= phb->ioda.m64_bar_idx + 1)
+					goto m64_failed;
+			} while (test_and_set_bit(win, &phb->ioda.m64_bar_alloc));
+
+			pdn->m64_wins[i][j] = win;
+
+			if (pdn->m64_per_iov == M64_PER_IOV) {
+				size = pci_iov_resource_size(pdev,
+							PCI_IOV_RESOURCES + i);
+				size = size * vf_per_group;
+				start = res->start + size * j;
+			} else {
+				size = resource_size(res);
+				start = res->start;
+			}
 
-		pdn->m64_wins[i] = win;
+			/* Map the M64 here */
+			if (pdn->m64_per_iov == M64_PER_IOV) {
+				pe_num = pdn->offset + j;
+				rc = opal_pci_map_pe_mmio_window(phb->opal_id,
+						pe_num, OPAL_M64_WINDOW_TYPE,
+						pdn->m64_wins[i][j], 0);
+			}
 
-		/* Map the M64 here */
-		rc = opal_pci_set_phb_mem_window(phb->opal_id,
+			rc = opal_pci_set_phb_mem_window(phb->opal_id,
 						 OPAL_M64_WINDOW_TYPE,
-						 pdn->m64_wins[i],
-						 res->start,
+						 pdn->m64_wins[i][j],
+						 start,
 						 0, /* unused */
-						 resource_size(res));
-		if (rc != OPAL_SUCCESS) {
-			dev_err(&pdev->dev, "Failed to map M64 window #%d: %lld\n",
-				win, rc);
-			goto m64_failed;
-		}
+						 size);
 
-		rc = opal_pci_phb_mmio_enable(phb->opal_id,
-				OPAL_M64_WINDOW_TYPE, pdn->m64_wins[i], 1);
-		if (rc != OPAL_SUCCESS) {
-			dev_err(&pdev->dev, "Failed to enable M64 window #%d: %llx\n",
-				win, rc);
-			goto m64_failed;
+
+			if (rc != OPAL_SUCCESS) {
+				dev_err(&pdev->dev, "Failed to map M64 window #%d: %lld\n",
+					win, rc);
+				goto m64_failed;
+			}
+
+			if (pdn->m64_per_iov == M64_PER_IOV)
+				rc = opal_pci_phb_mmio_enable(phb->opal_id,
+				     OPAL_M64_WINDOW_TYPE, pdn->m64_wins[i][j], 2);
+			else
+				rc = opal_pci_phb_mmio_enable(phb->opal_id,
+				     OPAL_M64_WINDOW_TYPE, pdn->m64_wins[i][j], 1);
+
+			if (rc != OPAL_SUCCESS) {
+				dev_err(&pdev->dev, "Failed to enable M64 window #%d: %llx\n",
+					win, rc);
+				goto m64_failed;
+			}
 		}
 	}
 	return 0;
@@ -1269,22 +1312,53 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe
 	pe->tce32_table = NULL;
 }
 
-static void pnv_ioda_release_vf_PE(struct pci_dev *pdev)
+static void pnv_ioda_release_vf_PE(struct pci_dev *pdev, u16 vf_num)
 {
 	struct pci_bus        *bus;
 	struct pci_controller *hose;
 	struct pnv_phb        *phb;
 	struct pnv_ioda_pe    *pe, *pe_n;
 	struct pci_dn         *pdn;
+	u16                    vf_index;
+	int64_t                rc;
 
 	bus = pdev->bus;
 	hose = pci_bus_to_host(bus);
 	phb = hose->private_data;
+	pdn = pci_get_pdn(pdev);
 
 	if (!pdev->is_physfn)
 		return;
 
-	pdn = pci_get_pdn(pdev);
+	if (pdn->m64_per_iov == M64_PER_IOV && vf_num > M64_PER_IOV) {
+		int   vf_group;
+		int   vf_per_group;
+		int   vf_index1;
+
+		vf_per_group = __roundup_pow_of_two(vf_num) / pdn->m64_per_iov;
+
+		for (vf_group = 0; vf_group < M64_PER_IOV; vf_group++)
+			for (vf_index = vf_group * vf_per_group;
+				vf_index < (vf_group + 1) * vf_per_group &&
+				vf_index < vf_num;
+				vf_index++)
+				for (vf_index1 = vf_group * vf_per_group;
+					vf_index1 < (vf_group + 1) * vf_per_group &&
+					vf_index1 < vf_num;
+					vf_index1++){
+
+					rc = opal_pci_set_peltv(phb->opal_id,
+						pdn->offset + vf_index,
+						pdn->offset + vf_index1,
+						OPAL_REMOVE_PE_FROM_DOMAIN);
+
+					if (rc)
+					    dev_warn(&pdev->dev, "%s: Failed to unlink same group PE#%d(%lld)\n",
+						__func__,
+						pdn->offset + vf_index1, rc);
+				}
+	}
+
 	list_for_each_entry_safe(pe, pe_n, &phb->ioda.pe_list, list) {
 		if (pe->parent_dev != pdev)
 			continue;
@@ -1319,10 +1393,11 @@ void pnv_pci_sriov_disable(struct pci_dev *pdev)
 	vf_num = pdn->vf_pes;
 
 	/* Release VF PEs */
-	pnv_ioda_release_vf_PE(pdev);
+	pnv_ioda_release_vf_PE(pdev, vf_num);
 
 	if (phb->type == PNV_PHB_IODA2) {
-		pnv_pci_vf_resource_shift(pdev, -pdn->offset);
+		if (pdn->m64_per_iov == 1)
+			pnv_pci_vf_resource_shift(pdev, -pdn->offset);
 
 		/* Release M64 windows */
 		pnv_pci_vf_release_m64(pdev);
@@ -1344,6 +1419,7 @@ static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 vf_num)
 	int                    pe_num;
 	u16                    vf_index;
 	struct pci_dn         *pdn;
+	int64_t                rc;
 
 	bus = pdev->bus;
 	hose = pci_bus_to_host(bus);
@@ -1392,6 +1468,37 @@ static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 vf_num)
 
 		pnv_pci_ioda2_setup_dma_pe(phb, pe);
 	}
+
+	if (pdn->m64_per_iov == M64_PER_IOV && vf_num > M64_PER_IOV) {
+		int   vf_group;
+		int   vf_per_group;
+		int   vf_index1;
+
+		vf_per_group = __roundup_pow_of_two(vf_num) / pdn->m64_per_iov;
+
+		for (vf_group = 0; vf_group < M64_PER_IOV; vf_group++) {
+			for (vf_index = vf_group * vf_per_group;
+			     vf_index < (vf_group + 1) * vf_per_group &&
+			     vf_index < vf_num;
+			     vf_index++) {
+				for (vf_index1 = vf_group * vf_per_group;
+				     vf_index1 < (vf_group + 1) * vf_per_group &&
+				     vf_index1 < vf_num;
+				     vf_index1++) {
+
+					rc = opal_pci_set_peltv(phb->opal_id,
+						pdn->offset + vf_index,
+						pdn->offset + vf_index1,
+						OPAL_ADD_PE_TO_DOMAIN);
+
+					if (rc)
+					    dev_warn(&pdev->dev, "%s: Failed to link same group PE#%d(%lld)\n",
+						__func__,
+						pdn->offset + vf_index1, rc);
+				}
+			}
+		}
+	}
 }
 
 int pnv_pci_sriov_enable(struct pci_dev *pdev, u16 vf_num)
@@ -1424,16 +1531,18 @@ int pnv_pci_sriov_enable(struct pci_dev *pdev, u16 vf_num)
 		mutex_unlock(&phb->ioda.pe_alloc_mutex);
 
 		/* Assign M64 window accordingly */
-		ret = pnv_pci_vf_assign_m64(pdev);
+		ret = pnv_pci_vf_assign_m64(pdev, vf_num);
 		if (ret) {
 			dev_info(&pdev->dev, "Not enough M64 window resources\n");
 			goto m64_failed;
 		}
 
 		/* Do some magic shift */
-		ret = pnv_pci_vf_resource_shift(pdev, pdn->offset);
-		if (ret)
-			goto m64_failed;
+		if (pdn->m64_per_iov == 1) {
+			ret = pnv_pci_vf_resource_shift(pdev, pdn->offset);
+			if (ret)
+				goto m64_failed;
+		}
 	}
 
 	/* Setup VF PEs */


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v12 20/21] powerpc/pci: Remove unused struct pci_dn.pcidev field
  2015-02-24  8:32 [PATCH v12 00/21] Enable SRIOV on Power8 Bjorn Helgaas
                   ` (18 preceding siblings ...)
  2015-02-24  8:35 ` [PATCH v12 19/21] powerpc/powernv: Group VF PE when IOV BAR is big on PHB3 Bjorn Helgaas
@ 2015-02-24  8:35 ` Bjorn Helgaas
  2015-02-24  8:35 ` [PATCH v12 21/21] powerpc/pci: Add PCI resource alignment documentation Bjorn Helgaas
  20 siblings, 0 replies; 69+ messages in thread
From: Bjorn Helgaas @ 2015-02-24  8:35 UTC (permalink / raw)
  To: Wei Yang, benh, gwshan; +Cc: linux-pci, linuxppc-dev

From: Wei Yang <weiyang@linux.vnet.ibm.com>

In struct pci_dn, the pcidev field is assigned but not used, so remove it.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/pci-bridge.h     |    1 -
 arch/powerpc/platforms/powernv/pci-ioda.c |    1 -
 2 files changed, 2 deletions(-)

diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
index 958ea8675691..109efbaf384d 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -168,7 +168,6 @@ struct pci_dn {
 
 	int	pci_ext_config_space;	/* for pci devices */
 
-	struct	pci_dev *pcidev;	/* back-pointer to the pci device */
 #ifdef CONFIG_EEH
 	struct eeh_dev *edev;		/* eeh device */
 #endif
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index b265d5da601b..58d4ca01bfd9 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1024,7 +1024,6 @@ static void pnv_ioda_setup_same_PE(struct pci_bus *bus, struct pnv_ioda_pe *pe)
 				pci_name(dev));
 			continue;
 		}
-		pdn->pcidev = dev;
 		pdn->pe_number = pe->pe_number;
 		pe->dma_weight += pnv_ioda_dma_weight(dev);
 		if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate)


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v12 21/21] powerpc/pci: Add PCI resource alignment documentation
  2015-02-24  8:32 [PATCH v12 00/21] Enable SRIOV on Power8 Bjorn Helgaas
                   ` (19 preceding siblings ...)
  2015-02-24  8:35 ` [PATCH v12 20/21] powerpc/pci: Remove unused struct pci_dn.pcidev field Bjorn Helgaas
@ 2015-02-24  8:35 ` Bjorn Helgaas
  20 siblings, 0 replies; 69+ messages in thread
From: Bjorn Helgaas @ 2015-02-24  8:35 UTC (permalink / raw)
  To: Wei Yang, benh, gwshan; +Cc: linux-pci, linuxppc-dev

From: Wei Yang <weiyang@linux.vnet.ibm.com>

In order to enable SRIOV on PowerNV platform, the PF's IOV BAR needs to be
adjusted:

    1. size expanded
    2. aligned to M64BT size

This patch documents this change on the reason and how.

[bhelgaas: reformat, clarify, expand]
Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
---
 .../powerpc/pci_iov_resource_on_powernv.txt        |  305 ++++++++++++++++++++
 1 file changed, 305 insertions(+)
 create mode 100644 Documentation/powerpc/pci_iov_resource_on_powernv.txt

diff --git a/Documentation/powerpc/pci_iov_resource_on_powernv.txt b/Documentation/powerpc/pci_iov_resource_on_powernv.txt
new file mode 100644
index 000000000000..4e9bb2812238
--- /dev/null
+++ b/Documentation/powerpc/pci_iov_resource_on_powernv.txt
@@ -0,0 +1,305 @@
+Wei Yang <weiyang@linux.vnet.ibm.com>
+Benjamin Herrenschmidt <benh@au1.ibm.com>
+26 Aug 2014
+
+This document describes the requirement from hardware for PCI MMIO resource
+sizing and assignment on PowerNV platform and how generic PCI code handles
+this requirement.  The first two sections describe the concepts of
+Partitionable Endpoints and the implementation on P8 (IODA2).
+
+1. Introduction to Partitionable Endpoints
+
+A Partitionable Endpoint (PE) is a way to group the various resources
+associated with a device or a set of device to provide isolation between
+partitions (i.e., filtering of DMA, MSIs etc.) and to provide a mechanism
+to freeze a device that is causing errors in order to limit the possibility
+of propagation of bad data.
+
+There is thus, in HW, a table of PE states that contains a pair of "frozen"
+state bits (one for MMIO and one for DMA, they get set together but can be
+cleared independently) for each PE.
+
+When a PE is frozen, all stores in any direction are dropped and all loads
+return all 1's value.  MSIs are also blocked.  There's a bit more state
+that captures things like the details of the error that caused the freeze
+etc., but that's not critical.
+
+The interesting part is how the various PCIe transactions (MMIO, DMA, ...)
+are matched to their corresponding PEs.
+
+The following section provides a rough description of what we have on P8
+(IODA2).  Keep in mind that this is all per PHB (PCI host bridge).  Each
+PHB is a completely separate HW entity that replicates the entire logic,
+so has its own set of PEs, etc.
+
+2. Implementation of Partitionable Endpoints on P8 (IODA2)
+
+P8 supports up to 256 Partitionable Endpoints per PHB.
+
+  * Inbound
+
+    For DMA, MSIs and inbound PCIe error messages, we have a table (in
+    memory but accessed in HW by the chip) that provides a direct
+    correspondence between a PCIe RID (bus/dev/fn) with a PE number.
+    We call this the RTT.
+
+    - For DMA we then provide an entire address space for each PE that can
+      contains two "windows", depending on the value of PCI address bit 59.
+      Each window can be configured to be remapped via a "TCE table" (IOMMU
+      translation table), which has various configurable characteristics
+      not described here.
+
+    - For MSIs, we have two windows in the address space (one at the top of
+      the 32-bit space and one much higher) which, via a combination of the
+      address and MSI value, will result in one of the 2048 interrupts per
+      bridge being triggered.  There's a PE# in the interrupt controller
+      descriptor table as well which is compared with the PE# obtained from
+      the RTT to "authorize" the device to emit that specific interrupt.
+
+    - Error messages just use the RTT.
+
+  * Outbound.  That's where the tricky part is.
+
+    Like other PCI host bridges, the Power8 IODA2 PHB supports "windows"
+    from the CPU address space to the PCI address space.  There is one M32
+    window and sixteen M64 windows.  They have different characteristics.
+    First what they have in common: they forward a configurable portion of
+    the CPU address space to the PCIe bus and must be naturally aligned
+    power of two in size.  The rest is different:
+
+    - The M32 window:
+
+      * Is limited to 4GB in size.
+
+      * Drops the top bits of the address (above the size) and replaces
+	them with a configurable value.  This is typically used to generate
+	32-bit PCIe accesses.  We configure that window at boot from FW and
+	don't touch it from Linux; it's usually set to forward a 2GB
+	portion of address space from the CPU to PCIe
+	0x8000_0000..0xffff_ffff.  (Note: The top 64KB are actually
+	reserved for MSIs but this is not a problem at this point; we just
+	need to ensure Linux doesn't assign anything there, the M32 logic
+	ignores that however and will forward in that space if we try).
+
+      * It is divided into 256 segments of equal size.  A table in the chip
+	maps each segment to a PE#.  That allows portions of the MMIO space
+	to be assigned to PEs on a segment granularity.  For a 2GB window,
+	the segment granularity is 2GB/256 = 8MB.
+
+    Now, this is the "main" window we use in Linux today (excluding
+    SR-IOV).  We basically use the trick of forcing the bridge MMIO windows
+    onto a segment alignment/granularity so that the space behind a bridge
+    can be assigned to a PE.
+
+    Ideally we would like to be able to have individual functions in PEs
+    but that would mean using a completely different address allocation
+    scheme where individual function BARs can be "grouped" to fit in one or
+    more segments.
+
+    - The M64 windows:
+
+      * Must be at least 256MB in size.
+
+      * Do not translate addresses (the address on PCIe is the same as the
+	address on the PowerBus).  There is a way to also set the top 14
+	bits which are not conveyed by PowerBus but we don't use this.
+
+      * Can be configured to be segmented.  When not segmented, we can
+	specify the PE# for the entire window.  When segmented, a window
+	has 256 segments; however, there is no table for mapping a segment
+	to a PE#.  The segment number *is* the PE#.
+
+      * Support overlaps.  If an address is covered by multiple windows,
+	there's a defined ordering for which window applies.
+
+    We have code (fairly new compared to the M32 stuff) that exploits that
+    for large BARs in 64-bit space:
+
+    We configure an M64 window to cover the entire region of address space
+    that has been assigned by FW for the PHB (about 64GB, ignore the space
+    for the M32, it comes out of a different "reserve").  We configure it
+    as segmented.
+
+    Then we do the same thing as with M32, using the bridge alignment
+    trick, to match to those giant segments.
+
+    Since we cannot remap, we have two additional constraints:
+
+    - We do the PE# allocation *after* the 64-bit space has been assigned
+      because the addresses we use directly determine the PE#.  We then
+      update the M32 PE# for the devices that use both 32-bit and 64-bit
+      spaces or assign the remaining PE# to 32-bit only devices.
+
+    - We cannot "group" segments in HW, so if a device ends up using more
+      than one segment, we end up with more than one PE#.  There is a HW
+      mechanism to make the freeze state cascade to "companion" PEs but
+      that only works for PCIe error messages (typically used so that if
+      you freeze a switch, it freezes all its children).  So we do it in
+      SW.  We lose a bit of effectiveness of EEH in that case, but that's
+      the best we found.  So when any of the PEs freezes, we freeze the
+      other ones for that "domain".  We thus introduce the concept of
+      "master PE" which is the one used for DMA, MSIs, etc., and "secondary
+      PEs" that are used for the remaining M64 segments.
+
+    We would like to investigate using additional M64 windows in "single
+    PE" mode to overlay over specific BARs to work around some of that, for
+    example for devices with very large BARs, e.g., GPUs.  It would make
+    sense, but we haven't done it yet.
+
+3. PowerNV Platform Considerations for SR-IOV
+
+  * SR-IOV Background
+
+    The PCIe SR-IOV feature allows a single Physical Function (PF) to
+    support several Virtual Functions (VFs).  Registers in the PF's SR-IOV
+    Capability control the number of VFs and whether they are enabled.
+
+    When VFs are enabled, they appear in Configuration Space like normal
+    PCI devices, but the BARs in VF config space headers are unusual.  For
+    a non-VF device, software uses BARs in the config space header to
+    discover the BAR sizes and assign addresses for them.  For VF devices,
+    software uses VF BAR registers in the *PF* SR-IOV Capability to
+    discover sizes and assign addresses.  The BARs in the VF's config space
+    header are read-only zeros.
+
+    When a VF BAR in the PF SR-IOV Capability is programmed, it sets the
+    base address for all the corresponding VF(n) BARs.  For example, if the
+    PF SR-IOV Capability is programmed to enable eight VFs, and it has a
+    1MB VF BAR0, the address in that VF BAR sets the base of an 8MB region.
+    This region is divided into eight contiguous 1MB regions, each of which
+    is a BAR0 for one of the VFs.  Note that even though the VF BAR
+    describes an 8MB region, the alignment requirement is for a single VF,
+    i.e., 1MB in this example.
+
+  There are several strategies for isolating VFs in PEs:
+
+  - M32 window: There's one M32 window, and it is split into 256
+    equally-sized segments.  The finest granularity possible is a 256MB
+    window with 1MB segments.  VF BARs that are 1MB or larger could be
+    mapped to separate PEs in this window.  Each segment can be
+    individually mapped to a PE via the lookup table, so this is quite
+    flexible, but it works best when all the VF BARs are the same size.  If
+    they are different sizes, the entire window has to be small enough that
+    the segment size matches the smallest VF BAR, which means larger VF
+    BARs span several segments.
+
+  - Non-segmented M64 window: A non-segmented M64 window is mapped entirely
+    to a single PE, so it could only isolate one VF.
+
+  - Single segmented M64 windows: A segmented M64 window could be used just
+    like the M32 window, but the segments can't be individually mapped to
+    PEs (the segment number is the PE#), so there isn't as much
+    flexibility.  A VF with multiple BARs would have to be in a "domain" of
+    multiple PEs, which is not as well isolated as a single PE.
+
+  - Multiple segmented M64 windows: As usual, each window is split into 256
+    equally-sized segments, and the segment number is the PE#.  But if we
+    use several M64 windows, they can be set to different base addresses
+    and different segment sizes.  If we have VFs that each have a 1MB BAR
+    and a 32MB BAR, we could use one M64 window to assign 1MB segments and
+    another M64 window to assign 32MB segments.
+
+  Finally, the plan to use M64 windows for SR-IOV, which will be described
+  more in the next two sections.  For a given VF BAR, we need to
+  effectively reserve the entire 256 segments (256 * VF BAR size) and
+  position the VF BAR to start at the beginning of a free range of
+  segments/PEs inside that M64 window.
+
+  The goal is of course to be able to give a separate PE for each VF.
+
+  The PowerNV IODA2 platform has 16 M64 windows, which are used to map MMIO
+  range to PE#.  Each M64 window defines one MMIO range and this range is
+  divided into 256 segments, with each segment corresponding to one PE.
+
+  We decide to leverage this M64 window to map VFs to individual PEs, since
+  SR-IOV VF BARs are all the same size.
+
+  But doing so introduces another problem: total_VFs is usually smaller
+  than the number of M64 window segments, so if we map one VF BAR directly
+  to one M64 window, some part of the M64 window will map to another
+  device's MMIO range.
+
+  IODA supports 256 PEs, so segmented windows contain 256 segments, so if
+  total_VFs is less than 256, we have the situation in Figure 1.0, where
+  segments [total_VFs, 255] of the M64 window may map to some MMIO range on
+  other devices:
+
+     0      1                     total_VFs - 1
+     +------+------+-     -+------+------+
+     |      |      |  ...  |      |      |
+     +------+------+-     -+------+------+
+
+                           VF(n) BAR space
+
+     0      1                     total_VFs - 1                255
+     +------+------+-     -+------+------+-      -+------+------+
+     |      |      |  ...  |      |      |   ...  |      |      |
+     +------+------+-     -+------+------+-      -+------+------+
+
+                           M64 window
+
+		Figure 1.0 Direct map VF(n) BAR space
+
+  Our current solution is to allocate 256 segments even if the VF(n) BAR
+  space doesn't need that much, as shown in Figure 1.1:
+
+     0      1                     total_VFs - 1                255
+     +------+------+-     -+------+------+-      -+------+------+
+     |      |      |  ...  |      |      |   ...  |      |      |
+     +------+------+-     -+------+------+-      -+------+------+
+
+                           VF(n) BAR space + extra
+
+     0      1                     total_VFs - 1                255
+     +------+------+-     -+------+------+-      -+------+------+
+     |      |      |  ...  |      |      |   ...  |      |      |
+     +------+------+-     -+------+------+-      -+------+------+
+
+			   M64 window
+
+		Figure 1.1 Map VF(n) BAR space + extra
+
+  Allocating the extra space ensures that the entire M64 window will be
+  assigned to this one SR-IOV device and none of the space will be
+  available for other devices.  Note that this only expands the space
+  reserved in software; there are still only total_VFs VFs, and they only
+  respond to segments [0, total_VFs - 1].  There's nothing in hardware that
+  responds to segments [total_VFs, 255].
+
+4. Implications for the Generic PCI Code
+
+The PCIe SR-IOV spec requires that the base of the VF(n) BAR space be
+aligned to the size of an individual VF BAR.
+
+On PowerNV, we want to align the base of the VF(n) BAR space to the size of
+an M64 window, not just the size of an individual VF BAR.  This means the
+VF(n) BAR space can fit exactly in the M64 window.
+
+>>>> Bjorn's speculation follows:
+
+In IODA2, the MMIO address determines the PE#.  If the address is in an M32
+window, we can set the PE# by updating the table that translates segments
+to PE#s.  Similarly, if the address is in an unsegmented M64 window, we can
+set the PE# for the window.  But if it's in a segmented M64 window, the
+segment number is the PE#.
+
+Therefore, the only way to control the PE# for a VF is to change the base
+of the VF(n) BAR space in the VF BAR.  If the PCI core allocates the exact
+amount of space required for the VF(n) BAR space, the VF BAR value is fixed
+and cannot be changed.
+
+On the other hand, if the PCI core allocates additional space, the VF BAR
+value can be changed as long as the entire VF(n) BAR space remains inside
+the space allocated by the core.
+
+Ideally the segment size will be the same as an individual VF BAR size.
+Then each VF will be in its own PE.  The VF BARs (and therefore the PE#s)
+are contiguous.  If VF0 is in PE(x), then VF(n) is in PE(x+n).  If we
+allocate 256 segments, there are (256 - numVFs) choices for the PE# of VF0.
+
+If the segment size is smaller than the VF BAR size, it will take several
+segments to cover a VF BAR, and a VF will be in several PEs.  This is
+possible, but the isolation isn't as good, and it reduces the number of PE#
+choices because instead of consuming only numVFs segments, the VF(n) BAR
+space will consume (numVFs * n) segments.  That means there aren't as many
+available segments for adjusting base of the VF(n) BAR space.


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH v12 08/21] PCI: Add pcibios_sriov_enable() and pcibios_sriov_disable()
  2015-02-24  8:33 ` [PATCH v12 08/21] PCI: Add pcibios_sriov_enable() and pcibios_sriov_disable() Bjorn Helgaas
@ 2015-02-24  8:39   ` Bjorn Helgaas
  2015-03-02  6:53       ` Wei Yang
  0 siblings, 1 reply; 69+ messages in thread
From: Bjorn Helgaas @ 2015-02-24  8:39 UTC (permalink / raw)
  To: Wei Yang, benh, gwshan; +Cc: linux-pci, linuxppc-dev

On Tue, Feb 24, 2015 at 02:33:52AM -0600, Bjorn Helgaas wrote:
> From: Wei Yang <weiyang@linux.vnet.ibm.com>
> 
> VFs are dynamically created when a driver enables them.  On some platforms,
> like PowerNV, special resources are necessary to enable VFs.
> 
> Add platform hooks for enabling and disabling VFs.
> 
> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> ---
>  drivers/pci/iov.c |   19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)
> 
> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> index 5643a1011e23..cc6fedf4a1b9 100644
> --- a/drivers/pci/iov.c
> +++ b/drivers/pci/iov.c
> @@ -220,6 +220,11 @@ static void virtfn_remove(struct pci_dev *dev, int id, int reset)
>  	pci_dev_put(dev);
>  }
>  
> +int __weak pcibios_sriov_enable(struct pci_dev *pdev, u16 vf_num)

I think this "vf_num" parameter should be renamed to something like
"num_vfs" instead.  It's subtle, but "vf_num" suggests that we're talking
about one of several VFs, e.g., VF1 or VF 2.  But here we really mean the
total number of VFs that we're enabling.

There's similar code in the powerpc implementation that should be
renamed the same way.

> +{
> +       return 0;
> +}
> +
>  static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
>  {
>  	int rc;
> @@ -231,6 +236,7 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
>  	struct pci_sriov *iov = dev->sriov;
>  	int bars = 0;
>  	int bus;
> +	int retval;
>  
>  	if (!nr_virtfn)
>  		return 0;
> @@ -307,6 +313,12 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
>  	if (nr_virtfn < initial)
>  		initial = nr_virtfn;
>  
> +	if ((retval = pcibios_sriov_enable(dev, initial))) {
> +		dev_err(&dev->dev, "failure %d from pcibios_sriov_enable()\n",
> +			retval);
> +		return retval;
> +	}
> +
>  	for (i = 0; i < initial; i++) {
>  		rc = virtfn_add(dev, i, 0);
>  		if (rc)
> @@ -335,6 +347,11 @@ failed:
>  	return rc;
>  }
>  
> +int __weak pcibios_sriov_disable(struct pci_dev *pdev)
> +{
> +       return 0;
> +}
> +
>  static void sriov_disable(struct pci_dev *dev)
>  {
>  	int i;
> @@ -346,6 +363,8 @@ static void sriov_disable(struct pci_dev *dev)
>  	for (i = 0; i < iov->num_VFs; i++)
>  		virtfn_remove(dev, i, 0);
>  
> +	pcibios_sriov_disable(dev);
> +
>  	iov->ctrl &= ~(PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE);
>  	pci_cfg_access_lock(dev);
>  	pci_write_config_word(dev, iov->pos + PCI_SRIOV_CTRL, iov->ctrl);
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v12 10/21] PCI: Consider additional PF's IOV BAR alignment in sizing and assigning
  2015-02-24  8:34 ` [PATCH v12 10/21] PCI: Consider additional PF's IOV BAR alignment in sizing and assigning Bjorn Helgaas
@ 2015-02-24  8:41   ` Bjorn Helgaas
  2015-03-02  7:32       ` Wei Yang
  0 siblings, 1 reply; 69+ messages in thread
From: Bjorn Helgaas @ 2015-02-24  8:41 UTC (permalink / raw)
  To: Wei Yang, benh, gwshan; +Cc: linux-pci, linuxppc-dev

On Tue, Feb 24, 2015 at 02:34:06AM -0600, Bjorn Helgaas wrote:
> From: Wei Yang <weiyang@linux.vnet.ibm.com>
> 
> When sizing and assigning resources, we divide the resources into two
> lists: the requested list and the additional list.  We don't consider the
> alignment of additional VF(n) BAR space.
> 
> This is reasonable because the alignment required for the VF(n) BAR space
> is the size of an individual VF BAR, not the size of the space for *all*
> VFs.  But some platforms, e.g., PowerNV, require additional alignment.
> 
> Consider the additional IOV BAR alignment when sizing and assigning
> resources.  When there is not enough system MMIO space, the PF's IOV BAR
> alignment will not contribute to the bridge.  When there is enough system
> MMIO space, the additional alignment will contribute to the bridge.

I don't understand the ""when there is not enough system MMIO space" part.
How do we tell if there's enough MMIO space?

> Also, take advantage of pci_dev_resource::min_align to store this
> additional alignment.

This comment doesn't seem to make sense; this patch doesn't save anything
in min_align.

Another question below...

> [bhelgaas: changelog, printk cast]
> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> ---
>  drivers/pci/setup-bus.c |   83 ++++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 70 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
> index e3e17f3c0f0f..affbceae560f 100644
> --- a/drivers/pci/setup-bus.c
> +++ b/drivers/pci/setup-bus.c
> @@ -99,8 +99,8 @@ static void remove_from_list(struct list_head *head,
>  	}
>  }
>  
> -static resource_size_t get_res_add_size(struct list_head *head,
> -					struct resource *res)
> +static struct pci_dev_resource *res_to_dev_res(struct list_head *head,
> +					       struct resource *res)
>  {
>  	struct pci_dev_resource *dev_res;
>  
> @@ -109,17 +109,37 @@ static resource_size_t get_res_add_size(struct list_head *head,
>  			int idx = res - &dev_res->dev->resource[0];
>  
>  			dev_printk(KERN_DEBUG, &dev_res->dev->dev,
> -				 "res[%d]=%pR get_res_add_size add_size %llx\n",
> +				 "res[%d]=%pR res_to_dev_res add_size %llx min_align %llx\n",
>  				 idx, dev_res->res,
> -				 (unsigned long long)dev_res->add_size);
> +				 (unsigned long long)dev_res->add_size,
> +				 (unsigned long long)dev_res->min_align);
>  
> -			return dev_res->add_size;
> +			return dev_res;
>  		}
>  	}
>  
> -	return 0;
> +	return NULL;
> +}
> +
> +static resource_size_t get_res_add_size(struct list_head *head,
> +					struct resource *res)
> +{
> +	struct pci_dev_resource *dev_res;
> +
> +	dev_res = res_to_dev_res(head, res);
> +	return dev_res ? dev_res->add_size : 0;
> +}
> +
> +static resource_size_t get_res_add_align(struct list_head *head,
> +					 struct resource *res)
> +{
> +	struct pci_dev_resource *dev_res;
> +
> +	dev_res = res_to_dev_res(head, res);
> +	return dev_res ? dev_res->min_align : 0;
>  }
>  
> +
>  /* Sort resources by alignment */
>  static void pdev_sort_resources(struct pci_dev *dev, struct list_head *head)
>  {
> @@ -368,8 +388,9 @@ static void __assign_resources_sorted(struct list_head *head,
>  	LIST_HEAD(save_head);
>  	LIST_HEAD(local_fail_head);
>  	struct pci_dev_resource *save_res;
> -	struct pci_dev_resource *dev_res, *tmp_res;
> +	struct pci_dev_resource *dev_res, *tmp_res, *dev_res2;
>  	unsigned long fail_type;
> +	resource_size_t add_align, align;
>  
>  	/* Check if optional add_size is there */
>  	if (!realloc_head || list_empty(realloc_head))
> @@ -384,10 +405,38 @@ static void __assign_resources_sorted(struct list_head *head,
>  	}
>  
>  	/* Update res in head list with add_size in realloc_head list */
> -	list_for_each_entry(dev_res, head, list)
> +	list_for_each_entry_safe(dev_res, tmp_res, head, list) {
>  		dev_res->res->end += get_res_add_size(realloc_head,
>  							dev_res->res);
>  
> +		/*
> +		 * There are two kinds of additional resources in the list:
> +		 * 1. bridge resource  -- IORESOURCE_STARTALIGN
> +		 * 2. SR-IOV resource   -- IORESOURCE_SIZEALIGN
> +		 * Here just fix the additional alignment for bridge
> +		 */
> +		if (!(dev_res->res->flags & IORESOURCE_STARTALIGN))
> +			continue;
> +
> +		add_align = get_res_add_align(realloc_head, dev_res->res);
> +
> +		/* Reorder the list by their alignment */

Why do we need to reorder the list by alignment?

> +		if (add_align > dev_res->res->start) {
> +			dev_res->res->start = add_align;
> +			dev_res->res->end = add_align +
> +				            resource_size(dev_res->res);
> +
> +			list_for_each_entry(dev_res2, head, list) {
> +				align = pci_resource_alignment(dev_res2->dev,
> +							       dev_res2->res);
> +				if (add_align > align)
> +					list_move_tail(&dev_res->list,
> +						       &dev_res2->list);
> +			}
> +               }
> +
> +	}
> +
>  	/* Try updated head list with add_size added */
>  	assign_requested_resources_sorted(head, &local_fail_head);
>  
> @@ -962,6 +1011,8 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
>  	struct resource *b_res = find_free_bus_resource(bus,
>  					mask | IORESOURCE_PREFETCH, type);
>  	resource_size_t children_add_size = 0;
> +	resource_size_t children_add_align = 0;
> +	resource_size_t add_align = 0;
>  
>  	if (!b_res)
>  		return -ENOSPC;
> @@ -986,6 +1037,7 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
>  			/* put SRIOV requested res to the optional list */
>  			if (realloc_head && i >= PCI_IOV_RESOURCES &&
>  					i <= PCI_IOV_RESOURCE_END) {
> +				add_align = max(pci_resource_alignment(dev, r), add_align);
>  				r->end = r->start - 1;
>  				add_to_list(realloc_head, dev, r, r_size, 0/* don't care */);
>  				children_add_size += r_size;
> @@ -1016,19 +1068,23 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
>  			if (order > max_order)
>  				max_order = order;
>  
> -			if (realloc_head)
> +			if (realloc_head) {
>  				children_add_size += get_res_add_size(realloc_head, r);
> +				children_add_align = get_res_add_align(realloc_head, r);
> +				add_align = max(add_align, children_add_align);
> +			}
>  		}
>  	}
>  
>  	min_align = calculate_mem_align(aligns, max_order);
>  	min_align = max(min_align, window_alignment(bus, b_res->flags));
>  	size0 = calculate_memsize(size, min_size, 0, resource_size(b_res), min_align);
> +	add_align = max(min_align, add_align);
>  	if (children_add_size > add_size)
>  		add_size = children_add_size;
>  	size1 = (!realloc_head || (realloc_head && !add_size)) ? size0 :
>  		calculate_memsize(size, min_size, add_size,
> -				resource_size(b_res), min_align);
> +				resource_size(b_res), add_align);
>  	if (!size0 && !size1) {
>  		if (b_res->start || b_res->end)
>  			dev_info(&bus->self->dev, "disabling bridge window %pR to %pR (unused)\n",
> @@ -1040,10 +1096,11 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
>  	b_res->end = size0 + min_align - 1;
>  	b_res->flags |= IORESOURCE_STARTALIGN;
>  	if (size1 > size0 && realloc_head) {
> -		add_to_list(realloc_head, bus->self, b_res, size1-size0, min_align);
> -		dev_printk(KERN_DEBUG, &bus->self->dev, "bridge window %pR to %pR add_size %llx\n",
> +		add_to_list(realloc_head, bus->self, b_res, size1-size0, add_align);
> +		dev_printk(KERN_DEBUG, &bus->self->dev, "bridge window %pR to %pR add_size %llx add_align %llx\n",
>  			   b_res, &bus->busn_res,
> -			   (unsigned long long)size1-size0);
> +			   (unsigned long long) (size1 - size0),
> +			   (unsigned long long) add_align);
>  	}
>  	return 0;
>  }
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v12 11/21] powerpc/pci: Don't unset PCI resources for VFs
  2015-02-24  8:34 ` [PATCH v12 11/21] powerpc/pci: Don't unset PCI resources for VFs Bjorn Helgaas
@ 2015-02-24  8:44   ` Bjorn Helgaas
  2015-03-02  7:34       ` Wei Yang
  0 siblings, 1 reply; 69+ messages in thread
From: Bjorn Helgaas @ 2015-02-24  8:44 UTC (permalink / raw)
  To: Wei Yang, benh, gwshan; +Cc: linux-pci, linuxppc-dev

On Tue, Feb 24, 2015 at 02:34:13AM -0600, Bjorn Helgaas wrote:
> From: Wei Yang <weiyang@linux.vnet.ibm.com>
> 
> If we're going to reassign resources with flag PCI_REASSIGN_ALL_RSRC, all
> resources will be cleaned out during device header fixup time and then get
> reassigned by PCI core.  However, the VF resources won't be reassigned and
> thus, we shouldn't clean them out.
> 
> If the pci_dev is a VF, skip the resource unset process.

I think this patch is correct, but we should include a little more detail
in the changelog to answer questions like mine and Ben's
(http://lkml.kernel.org/r/1423528584.4924.70.camel@au1.ibm.com).

> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> ---
>  arch/powerpc/kernel/pci-common.c |    4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
> index 2a525c938158..82031011522f 100644
> --- a/arch/powerpc/kernel/pci-common.c
> +++ b/arch/powerpc/kernel/pci-common.c
> @@ -788,6 +788,10 @@ static void pcibios_fixup_resources(struct pci_dev *dev)
>  		       pci_name(dev));
>  		return;
>  	}
> +
> +	if (dev->is_virtfn)
> +		return;
> +
>  	for (i = 0; i < DEVICE_COUNT_RESOURCE; i++) {
>  		struct resource *res = dev->resource + i;
>  		struct pci_bus_region reg;
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v12 14/21] powerpc/powernv: Allocate struct pnv_ioda_pe iommu_table dynamically
  2015-02-24  8:34 ` [PATCH v12 14/21] powerpc/powernv: Allocate struct pnv_ioda_pe iommu_table dynamically Bjorn Helgaas
@ 2015-02-24  8:46   ` Bjorn Helgaas
  2015-03-02  7:50       ` Wei Yang
  0 siblings, 1 reply; 69+ messages in thread
From: Bjorn Helgaas @ 2015-02-24  8:46 UTC (permalink / raw)
  To: Wei Yang, benh, gwshan; +Cc: linux-pci, linuxppc-dev

On Tue, Feb 24, 2015 at 02:34:35AM -0600, Bjorn Helgaas wrote:
> From: Wei Yang <weiyang@linux.vnet.ibm.com>
> 
> Current iommu_table of a PE is a static field.  This will have a problem
> when iommu_free_table() is called.
> 
> Allocate iommu_table dynamically.

I'd like a little more explanation about why we're calling
iommu_free_table() now when we didn't call it before.  Maybe this happens
when we disable SR-IOV and the VFs go away?

Is there a hotplug remove path where we should also be calling
iommu_free_table()?

> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> ---
>  arch/powerpc/include/asm/iommu.h          |    3 +++
>  arch/powerpc/platforms/powernv/pci-ioda.c |   26 ++++++++++++++------------
>  arch/powerpc/platforms/powernv/pci.h      |    2 +-
>  3 files changed, 18 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
> index 9cfa3706a1b8..5574eeb97634 100644
> --- a/arch/powerpc/include/asm/iommu.h
> +++ b/arch/powerpc/include/asm/iommu.h
> @@ -78,6 +78,9 @@ struct iommu_table {
>  	struct iommu_group *it_group;
>  #endif
>  	void (*set_bypass)(struct iommu_table *tbl, bool enable);
> +#ifdef CONFIG_PPC_POWERNV
> +	void           *data;
> +#endif
>  };
>  
>  /* Pure 2^n version of get_order */
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 58c4fc4ab63c..cd1a56160ded 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -916,6 +916,10 @@ static void pnv_ioda_setup_bus_PE(struct pci_bus *bus, int all)
>  		return;
>  	}
>  
> +	pe->tce32_table = kzalloc_node(sizeof(struct iommu_table),
> +			GFP_KERNEL, hose->node);
> +	pe->tce32_table->data = pe;
> +
>  	/* Associate it with all child devices */
>  	pnv_ioda_setup_same_PE(bus, pe);
>  
> @@ -1005,7 +1009,7 @@ static void pnv_pci_ioda_dma_dev_setup(struct pnv_phb *phb, struct pci_dev *pdev
>  
>  	pe = &phb->ioda.pe_array[pdn->pe_number];
>  	WARN_ON(get_dma_ops(&pdev->dev) != &dma_iommu_ops);
> -	set_iommu_table_base_and_group(&pdev->dev, &pe->tce32_table);
> +	set_iommu_table_base_and_group(&pdev->dev, pe->tce32_table);
>  }
>  
>  static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb,
> @@ -1032,7 +1036,7 @@ static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb,
>  	} else {
>  		dev_info(&pdev->dev, "Using 32-bit DMA via iommu\n");
>  		set_dma_ops(&pdev->dev, &dma_iommu_ops);
> -		set_iommu_table_base(&pdev->dev, &pe->tce32_table);
> +		set_iommu_table_base(&pdev->dev, pe->tce32_table);
>  	}
>  	*pdev->dev.dma_mask = dma_mask;
>  	return 0;
> @@ -1069,9 +1073,9 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe,
>  	list_for_each_entry(dev, &bus->devices, bus_list) {
>  		if (add_to_iommu_group)
>  			set_iommu_table_base_and_group(&dev->dev,
> -						       &pe->tce32_table);
> +						       pe->tce32_table);
>  		else
> -			set_iommu_table_base(&dev->dev, &pe->tce32_table);
> +			set_iommu_table_base(&dev->dev, pe->tce32_table);
>  
>  		if (dev->subordinate)
>  			pnv_ioda_setup_bus_dma(pe, dev->subordinate,
> @@ -1161,8 +1165,7 @@ static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
>  void pnv_pci_ioda_tce_invalidate(struct iommu_table *tbl,
>  				 __be64 *startp, __be64 *endp, bool rm)
>  {
> -	struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
> -					      tce32_table);
> +	struct pnv_ioda_pe *pe = tbl->data;
>  	struct pnv_phb *phb = pe->phb;
>  
>  	if (phb->type == PNV_PHB_IODA1)
> @@ -1228,7 +1231,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
>  	}
>  
>  	/* Setup linux iommu table */
> -	tbl = &pe->tce32_table;
> +	tbl = pe->tce32_table;
>  	pnv_pci_setup_iommu_table(tbl, addr, TCE32_TABLE_SIZE * segs,
>  				  base << 28, IOMMU_PAGE_SHIFT_4K);
>  
> @@ -1266,8 +1269,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
>  
>  static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
>  {
> -	struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
> -					      tce32_table);
> +	struct pnv_ioda_pe *pe = tbl->data;
>  	uint16_t window_id = (pe->pe_number << 1 ) + 1;
>  	int64_t rc;
>  
> @@ -1312,10 +1314,10 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
>  	pe->tce_bypass_base = 1ull << 59;
>  
>  	/* Install set_bypass callback for VFIO */
> -	pe->tce32_table.set_bypass = pnv_pci_ioda2_set_bypass;
> +	pe->tce32_table->set_bypass = pnv_pci_ioda2_set_bypass;
>  
>  	/* Enable bypass by default */
> -	pnv_pci_ioda2_set_bypass(&pe->tce32_table, true);
> +	pnv_pci_ioda2_set_bypass(pe->tce32_table, true);
>  }
>  
>  static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
> @@ -1363,7 +1365,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
>  	}
>  
>  	/* Setup linux iommu table */
> -	tbl = &pe->tce32_table;
> +	tbl = pe->tce32_table;
>  	pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0,
>  			IOMMU_PAGE_SHIFT_4K);
>  
> diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
> index e5b75b298d95..731777734bca 100644
> --- a/arch/powerpc/platforms/powernv/pci.h
> +++ b/arch/powerpc/platforms/powernv/pci.h
> @@ -53,7 +53,7 @@ struct pnv_ioda_pe {
>  	/* "Base" iommu table, ie, 4K TCEs, 32-bit DMA */
>  	int			tce32_seg;
>  	int			tce32_segcount;
> -	struct iommu_table	tce32_table;
> +	struct iommu_table	*tce32_table;
>  	phys_addr_t		tce_inval_reg_phys;
>  
>  	/* 64-bit TCE bypass region */
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v12 15/21] powerpc/powernv: Reserve additional space for IOV BAR according to the number of total_pe
  2015-02-24  8:34 ` [PATCH v12 15/21] powerpc/powernv: Reserve additional space for IOV BAR according to the number of total_pe Bjorn Helgaas
@ 2015-02-24  8:52   ` Bjorn Helgaas
  2015-03-02  7:41       ` Wei Yang
  0 siblings, 1 reply; 69+ messages in thread
From: Bjorn Helgaas @ 2015-02-24  8:52 UTC (permalink / raw)
  To: Wei Yang, benh, gwshan; +Cc: linux-pci, linuxppc-dev

On Tue, Feb 24, 2015 at 02:34:42AM -0600, Bjorn Helgaas wrote:
> From: Wei Yang <weiyang@linux.vnet.ibm.com>
> 
> On PHB3, PF IOV BAR will be covered by M64 window to have better PE
> isolation.  The total_pe number is usually different from total_VFs, which
> can lead to a conflict between MMIO space and the PE number.
> 
> For example, if total_VFs is 128 and total_pe is 256, the second half of
> M64 window will be part of other PCI device, which may already belong
> to other PEs.

I'm still trying to wrap my mind around the explanation here.

I *think* what's going on is that the M64 window must be a power-of-two
size.  If the VF(n) BAR space doesn't completely fill it, we might allocate
the leftover space to another device.  Then the M64 window for *this*
device may cause the other device to be associated with a PE it didn't
expect.

But I don't understand this well enough to describe it clearly.

More serious code question below...

> Prevent the conflict by reserving additional space for the PF IOV BAR,
> which is total_pe number of VF's BAR size.
> 
> [bhelgaas: make dev_printk() output more consistent, index resource[]
> conventionally]
> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> ---
>  arch/powerpc/include/asm/machdep.h        |    4 ++
>  arch/powerpc/include/asm/pci-bridge.h     |    3 ++
>  arch/powerpc/kernel/pci-common.c          |    5 +++
>  arch/powerpc/platforms/powernv/pci-ioda.c |   58 +++++++++++++++++++++++++++++
>  4 files changed, 70 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h
> index c8175a3fe560..965547c58497 100644
> --- a/arch/powerpc/include/asm/machdep.h
> +++ b/arch/powerpc/include/asm/machdep.h
> @@ -250,6 +250,10 @@ struct machdep_calls {
>  	/* Reset the secondary bus of bridge */
>  	void  (*pcibios_reset_secondary_bus)(struct pci_dev *dev);
>  
> +#ifdef CONFIG_PCI_IOV
> +	void (*pcibios_fixup_sriov)(struct pci_bus *bus);
> +#endif /* CONFIG_PCI_IOV */
> +
>  	/* Called to shutdown machine specific hardware not already controlled
>  	 * by other drivers.
>  	 */
> diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
> index 513f8f27060d..de11de7d4547 100644
> --- a/arch/powerpc/include/asm/pci-bridge.h
> +++ b/arch/powerpc/include/asm/pci-bridge.h
> @@ -175,6 +175,9 @@ struct pci_dn {
>  #define IODA_INVALID_PE		(-1)
>  #ifdef CONFIG_PPC_POWERNV
>  	int	pe_number;
> +#ifdef CONFIG_PCI_IOV
> +	u16     max_vfs;		/* number of VFs IOV BAR expended */
> +#endif /* CONFIG_PCI_IOV */
>  #endif
>  	struct list_head child_list;
>  	struct list_head list;
> diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
> index 82031011522f..022e9feeb1f2 100644
> --- a/arch/powerpc/kernel/pci-common.c
> +++ b/arch/powerpc/kernel/pci-common.c
> @@ -1646,6 +1646,11 @@ void pcibios_scan_phb(struct pci_controller *hose)
>  	if (ppc_md.pcibios_fixup_phb)
>  		ppc_md.pcibios_fixup_phb(hose);
>  
> +#ifdef CONFIG_PCI_IOV
> +	if (ppc_md.pcibios_fixup_sriov)
> +		ppc_md.pcibios_fixup_sriov(bus);
> +#endif /* CONFIG_PCI_IOV */
> +
>  	/* Configure PCI Express settings */
>  	if (bus && !pci_has_flag(PCI_PROBE_ONLY)) {
>  		struct pci_bus *child;
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index cd1a56160ded..36c533da5ccb 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1749,6 +1749,61 @@ static void pnv_pci_init_ioda_msis(struct pnv_phb *phb)
>  static void pnv_pci_init_ioda_msis(struct pnv_phb *phb) { }
>  #endif /* CONFIG_PCI_MSI */
>  
> +#ifdef CONFIG_PCI_IOV
> +static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
> +{
> +	struct pci_controller *hose;
> +	struct pnv_phb *phb;
> +	struct resource *res;
> +	int i;
> +	resource_size_t size;
> +	struct pci_dn *pdn;
> +
> +	if (!pdev->is_physfn || pdev->is_added)
> +		return;
> +
> +	hose = pci_bus_to_host(pdev->bus);
> +	phb = hose->private_data;
> +
> +	pdn = pci_get_pdn(pdev);
> +	pdn->max_vfs = 0;
> +
> +	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
> +		res = &pdev->resource[i + PCI_IOV_RESOURCES];
> +		if (!res->flags || res->parent)
> +			continue;
> +		if (!pnv_pci_is_mem_pref_64(res->flags)) {
> +			dev_warn(&pdev->dev, "Skipping expanding VF BAR%d: %pR\n",
> +				 i, res);
> +			continue;
> +		}
> +
> +		dev_dbg(&pdev->dev, " Fixing VF BAR%d: %pR to\n", i, res);
> +		size = pci_iov_resource_size(pdev, i + PCI_IOV_RESOURCES);
> +		res->end = res->start + size * phb->ioda.total_pe - 1;
> +		dev_dbg(&pdev->dev, "                       %pR\n", res);
> +		dev_info(&pdev->dev, "VF BAR%d: %pR (expanded to %d VFs for PE alignment)",
> +				i, res, phb->ioda.total_pe);
> +	}
> +	pdn->max_vfs = phb->ioda.total_pe;
> +}
> +
> +static void pnv_pci_ioda_fixup_sriov(struct pci_bus *bus)
> +{
> +	struct pci_dev *pdev;
> +	struct pci_bus *b;
> +
> +	list_for_each_entry(pdev, &bus->devices, bus_list) {
> +		b = pdev->subordinate;
> +
> +		if (b)
> +			pnv_pci_ioda_fixup_sriov(b);
> +
> +		pnv_pci_ioda_fixup_iov_resources(pdev);

I'm not sure this happens at the right time.  We have this call chain:

  pcibios_scan_phb
    pci_create_root_bus
    pci_scan_child_bus
    pnv_pci_ioda_fixup_sriov
      pnv_pci_ioda_fixup_iov_resources
	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++)
	  increase res->size to accomodate 256 PEs (or roundup(totalVFs)

so we only do the fixup_iov_resources() when we scan the PHB, and we
wouldn't do it at all for hot-added devices.

> +	}
> +}
> +#endif /* CONFIG_PCI_IOV */
> +
>  /*
>   * This function is supposed to be called on basis of PE from top
>   * to bottom style. So the the I/O or MMIO segment assigned to
> @@ -2125,6 +2180,9 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
>  	ppc_md.pcibios_enable_device_hook = pnv_pci_enable_device_hook;
>  	ppc_md.pcibios_window_alignment = pnv_pci_window_alignment;
>  	ppc_md.pcibios_reset_secondary_bus = pnv_pci_reset_secondary_bus;
> +#ifdef CONFIG_PCI_IOV
> +	ppc_md.pcibios_fixup_sriov = pnv_pci_ioda_fixup_sriov;
> +#endif /* CONFIG_PCI_IOV */
>  	pci_add_flags(PCI_REASSIGN_ALL_RSRC);
>  
>  	/* Reset IODA tables to a clean state */
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v12 17/21] powerpc/powernv: Shift VF resource with an offset
  2015-02-24  8:34 ` [PATCH v12 17/21] powerpc/powernv: Shift VF resource with an offset Bjorn Helgaas
@ 2015-02-24  9:00   ` Bjorn Helgaas
  2015-02-24 17:10     ` Bjorn Helgaas
  2015-03-04  3:01       ` Wei Yang
  2015-02-24  9:03   ` Bjorn Helgaas
  1 sibling, 2 replies; 69+ messages in thread
From: Bjorn Helgaas @ 2015-02-24  9:00 UTC (permalink / raw)
  To: Wei Yang, benh, gwshan; +Cc: linux-pci, linuxppc-dev

On Tue, Feb 24, 2015 at 02:34:57AM -0600, Bjorn Helgaas wrote:
> From: Wei Yang <weiyang@linux.vnet.ibm.com>
> 
> On PowerNV platform, resource position in M64 implies the PE# the resource
> belongs to.  In some cases, adjustment of a resource is necessary to locate
> it to a correct position in M64.
> 
> Add pnv_pci_vf_resource_shift() to shift the 'real' PF IOV BAR address
> according to an offset.
> 
> [bhelgaas: rework loops, rework overlap check, index resource[]
> conventionally, remove pci_regs.h include, squashed with next patch]
> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>

...

> +#ifdef CONFIG_PCI_IOV
> +static int pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
> +{
> +	struct pci_dn *pdn = pci_get_pdn(dev);
> +	int i;
> +	struct resource *res, res2;
> +	resource_size_t size;
> +	u16 vf_num;
> +
> +	if (!dev->is_physfn)
> +		return -EINVAL;
> +
> +	/*
> +	 * "offset" is in VFs.  The M64 windows are sized so that when they
> +	 * are segmented, each segment is the same size as the IOV BAR.
> +	 * Each segment is in a separate PE, and the high order bits of the
> +	 * address are the PE number.  Therefore, each VF's BAR is in a
> +	 * separate PE, and changing the IOV BAR start address changes the
> +	 * range of PEs the VFs are in.
> +	 */
> +	vf_num = pdn->vf_pes;
> +	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
> +		res = &dev->resource[i + PCI_IOV_RESOURCES];
> +		if (!res->flags || !res->parent)
> +			continue;
> +
> +		if (!pnv_pci_is_mem_pref_64(res->flags))
> +			continue;
> +
> +		/*
> +		 * The actual IOV BAR range is determined by the start address
> +		 * and the actual size for vf_num VFs BAR.  This check is to
> +		 * make sure that after shifting, the range will not overlap
> +		 * with another device.
> +		 */
> +		size = pci_iov_resource_size(dev, i + PCI_IOV_RESOURCES);
> +		res2.flags = res->flags;
> +		res2.start = res->start + (size * offset);
> +		res2.end = res2.start + (size * vf_num) - 1;
> +
> +		if (res2.end > res->end) {
> +			dev_err(&dev->dev, "VF BAR%d: %pR would extend past %pR (trying to enable %d VFs shifted by %d)\n",
> +				i, &res2, res, vf_num, offset);
> +			return -EBUSY;
> +		}
> +	}
> +
> +	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
> +		res = &dev->resource[i + PCI_IOV_RESOURCES];
> +		if (!res->flags || !res->parent)
> +			continue;
> +
> +		if (!pnv_pci_is_mem_pref_64(res->flags))
> +			continue;
> +
> +		size = pci_iov_resource_size(dev, i + PCI_IOV_RESOURCES);
> +		res2 = *res;
> +		res->start += size * offset;

I'm still not happy about this fiddling with res->start.

Increasing res->start means that in principle, the "size * offset" bytes
that we just removed from res are now available for allocation to somebody
else.  I don't think we *will* give that space to anything else because of
the alignment restrictions you're enforcing, but "res" now doesn't
correctly describe the real resource map.

Would you be able to just update the BAR here while leaving the struct
resource alone?  In that case, it would look a little funny that lspci
would show a BAR value in the middle of the region in /proc/iomem, but
the /proc/iomem region would be more correct.

> +
> +		dev_info(&dev->dev, "VF BAR%d: %pR shifted to %pR (enabling %d VFs shifted by %d)\n",
> +			 i, &res2, res, vf_num, offset);
> +		pci_update_resource(dev, i + PCI_IOV_RESOURCES);
> +	}
> +	pdn->max_vfs -= offset;
> +	return 0;
> +}
> +#endif /* CONFIG_PCI_IOV */

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v12 17/21] powerpc/powernv: Shift VF resource with an offset
  2015-02-24  8:34 ` [PATCH v12 17/21] powerpc/powernv: Shift VF resource with an offset Bjorn Helgaas
  2015-02-24  9:00   ` Bjorn Helgaas
@ 2015-02-24  9:03   ` Bjorn Helgaas
  1 sibling, 0 replies; 69+ messages in thread
From: Bjorn Helgaas @ 2015-02-24  9:03 UTC (permalink / raw)
  To: Wei Yang, benh, gwshan; +Cc: linux-pci, linuxppc-dev

On Tue, Feb 24, 2015 at 02:34:57AM -0600, Bjorn Helgaas wrote:
> From: Wei Yang <weiyang@linux.vnet.ibm.com>
> 
> On PowerNV platform, resource position in M64 implies the PE# the resource
> belongs to.  In some cases, adjustment of a resource is necessary to locate
> it to a correct position in M64.
> 
> Add pnv_pci_vf_resource_shift() to shift the 'real' PF IOV BAR address
> according to an offset.

I think I squashed the "powerpc/powernv: Allocate VF PE" into this one, but
I didn't merge the changelog into this one.  Those two patches don't seem
super related to each other, but I think there really was some dependency.

> [bhelgaas: rework loops, rework overlap check, index resource[]
> conventionally, remove pci_regs.h include, squashed with next patch]
> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> ---
>  arch/powerpc/include/asm/pci-bridge.h     |    4 
>  arch/powerpc/kernel/pci_dn.c              |   11 +
>  arch/powerpc/platforms/powernv/pci-ioda.c |  520 ++++++++++++++++++++++++++++-
>  arch/powerpc/platforms/powernv/pci.c      |   18 +
>  arch/powerpc/platforms/powernv/pci.h      |    7 
>  5 files changed, 543 insertions(+), 17 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
> index de11de7d4547..011340df8583 100644
> --- a/arch/powerpc/include/asm/pci-bridge.h
> +++ b/arch/powerpc/include/asm/pci-bridge.h
> @@ -177,6 +177,10 @@ struct pci_dn {
>  	int	pe_number;
>  #ifdef CONFIG_PCI_IOV
>  	u16     max_vfs;		/* number of VFs IOV BAR expended */
> +	u16     vf_pes;			/* VF PE# under this PF */
> +	int     offset;			/* PE# for the first VF PE */
> +#define IODA_INVALID_M64        (-1)
> +	int     m64_wins[PCI_SRIOV_NUM_BARS];
>  #endif /* CONFIG_PCI_IOV */
>  #endif
>  	struct list_head child_list;
> diff --git a/arch/powerpc/kernel/pci_dn.c b/arch/powerpc/kernel/pci_dn.c
> index f3a1a81d112f..5faf7ca45434 100644
> --- a/arch/powerpc/kernel/pci_dn.c
> +++ b/arch/powerpc/kernel/pci_dn.c
> @@ -217,6 +217,17 @@ void remove_dev_pci_info(struct pci_dev *pdev)
>  	struct pci_dn *pdn, *tmp;
>  	int i;
>  
> +	/*
> +	 * VF and VF PE are created/released dynamically, so we need to
> +	 * bind/unbind them.  Otherwise the VF and VF PE would be mismatched
> +	 * when re-enabling SR-IOV.
> +	 */
> +	if (pdev->is_virtfn) {
> +		pdn = pci_get_pdn(pdev);
> +		pdn->pe_number = IODA_INVALID_PE;
> +		return;
> +	}
> +
>  	/* Only support IOV PF for now */
>  	if (!pdev->is_physfn)
>  		return;
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 6a86690bb8de..a3c2fbe35fc8 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -44,6 +44,9 @@
>  #include "powernv.h"
>  #include "pci.h"
>  
> +/* 256M DMA window, 4K TCE pages, 8 bytes TCE */
> +#define TCE32_TABLE_SIZE	((0x10000000 / 0x1000) * 8)
> +
>  static void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
>  			    const char *fmt, ...)
>  {
> @@ -56,11 +59,18 @@ static void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
>  	vaf.fmt = fmt;
>  	vaf.va = &args;
>  
> -	if (pe->pdev)
> +	if (pe->flags & PNV_IODA_PE_DEV)
>  		strlcpy(pfix, dev_name(&pe->pdev->dev), sizeof(pfix));
> -	else
> +	else if (pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL))
>  		sprintf(pfix, "%04x:%02x     ",
>  			pci_domain_nr(pe->pbus), pe->pbus->number);
> +#ifdef CONFIG_PCI_IOV
> +	else if (pe->flags & PNV_IODA_PE_VF)
> +		sprintf(pfix, "%04x:%02x:%2x.%d",
> +			pci_domain_nr(pe->parent_dev->bus),
> +			(pe->rid & 0xff00) >> 8,
> +			PCI_SLOT(pe->rid), PCI_FUNC(pe->rid));
> +#endif /* CONFIG_PCI_IOV*/
>  
>  	printk("%spci %s: [PE# %.3d] %pV",
>  	       level, pfix, pe->pe_number, &vaf);
> @@ -591,7 +601,7 @@ static int pnv_ioda_set_peltv(struct pnv_phb *phb,
>  			      bool is_add)
>  {
>  	struct pnv_ioda_pe *slave;
> -	struct pci_dev *pdev;
> +	struct pci_dev *pdev = NULL;
>  	int ret;
>  
>  	/*
> @@ -630,8 +640,12 @@ static int pnv_ioda_set_peltv(struct pnv_phb *phb,
>  
>  	if (pe->flags & (PNV_IODA_PE_BUS_ALL | PNV_IODA_PE_BUS))
>  		pdev = pe->pbus->self;
> -	else
> +	else if (pe->flags & PNV_IODA_PE_DEV)
>  		pdev = pe->pdev->bus->self;
> +#ifdef CONFIG_PCI_IOV
> +	else if (pe->flags & PNV_IODA_PE_VF)
> +		pdev = pe->parent_dev->bus->self;
> +#endif /* CONFIG_PCI_IOV */
>  	while (pdev) {
>  		struct pci_dn *pdn = pci_get_pdn(pdev);
>  		struct pnv_ioda_pe *parent;
> @@ -649,6 +663,87 @@ static int pnv_ioda_set_peltv(struct pnv_phb *phb,
>  	return 0;
>  }
>  
> +#ifdef CONFIG_PCI_IOV
> +static int pnv_ioda_deconfigure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
> +{
> +	struct pci_dev *parent;
> +	uint8_t bcomp, dcomp, fcomp;
> +	int64_t rc;
> +	long rid_end, rid;
> +
> +	/* Currently, we just deconfigure VF PE. Bus PE will always there.*/
> +	if (pe->pbus) {
> +		int count;
> +
> +		dcomp = OPAL_IGNORE_RID_DEVICE_NUMBER;
> +		fcomp = OPAL_IGNORE_RID_FUNCTION_NUMBER;
> +		parent = pe->pbus->self;
> +		if (pe->flags & PNV_IODA_PE_BUS_ALL)
> +			count = pe->pbus->busn_res.end - pe->pbus->busn_res.start + 1;
> +		else
> +			count = 1;
> +
> +		switch(count) {
> +		case  1: bcomp = OpalPciBusAll;         break;
> +		case  2: bcomp = OpalPciBus7Bits;       break;
> +		case  4: bcomp = OpalPciBus6Bits;       break;
> +		case  8: bcomp = OpalPciBus5Bits;       break;
> +		case 16: bcomp = OpalPciBus4Bits;       break;
> +		case 32: bcomp = OpalPciBus3Bits;       break;
> +		default:
> +			dev_err(&pe->pbus->dev, "Number of subordinate buses %d unsupported\n",
> +			        count);
> +			/* Do an exact match only */
> +			bcomp = OpalPciBusAll;
> +		}
> +		rid_end = pe->rid + (count << 8);
> +	} else {
> +		if (pe->flags & PNV_IODA_PE_VF)
> +			parent = pe->parent_dev;
> +		else
> +			parent = pe->pdev->bus->self;
> +		bcomp = OpalPciBusAll;
> +		dcomp = OPAL_COMPARE_RID_DEVICE_NUMBER;
> +		fcomp = OPAL_COMPARE_RID_FUNCTION_NUMBER;
> +		rid_end = pe->rid + 1;
> +	}
> +
> +	/* Clear the reverse map */
> +	for (rid = pe->rid; rid < rid_end; rid++)
> +		phb->ioda.pe_rmap[rid] = 0;
> +
> +	/* Release from all parents PELT-V */
> +	while (parent) {
> +		struct pci_dn *pdn = pci_get_pdn(parent);
> +		if (pdn && pdn->pe_number != IODA_INVALID_PE) {
> +			rc = opal_pci_set_peltv(phb->opal_id, pdn->pe_number,
> +						pe->pe_number, OPAL_REMOVE_PE_FROM_DOMAIN);
> +			/* XXX What to do in case of error ? */
> +		}
> +		parent = parent->bus->self;
> +	}
> +
> +	opal_pci_eeh_freeze_set(phb->opal_id, pe->pe_number,
> +				  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
> +
> +	/* Disassociate PE in PELT */
> +	rc = opal_pci_set_peltv(phb->opal_id, pe->pe_number,
> +				pe->pe_number, OPAL_REMOVE_PE_FROM_DOMAIN);
> +	if (rc)
> +		pe_warn(pe, "OPAL error %ld remove self from PELTV\n", rc);
> +	rc = opal_pci_set_pe(phb->opal_id, pe->pe_number, pe->rid,
> +			     bcomp, dcomp, fcomp, OPAL_UNMAP_PE);
> +	if (rc)
> +		pe_err(pe, "OPAL error %ld trying to setup PELT table\n", rc);
> +
> +	pe->pbus = NULL;
> +	pe->pdev = NULL;
> +	pe->parent_dev = NULL;
> +
> +	return 0;
> +}
> +#endif /* CONFIG_PCI_IOV */
> +
>  static int pnv_ioda_configure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
>  {
>  	struct pci_dev *parent;
> @@ -675,15 +770,19 @@ static int pnv_ioda_configure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
>  		case 16: bcomp = OpalPciBus4Bits;	break;
>  		case 32: bcomp = OpalPciBus3Bits;	break;
>  		default:
> -			pr_err("%s: Number of subordinate busses %d"
> -			       " unsupported\n",
> -			       pci_name(pe->pbus->self), count);
> +			dev_err(&pe->pbus->dev, "Number of subordinate buses %d unsupported\n",
> +			        count);
>  			/* Do an exact match only */
>  			bcomp = OpalPciBusAll;
>  		}
>  		rid_end = pe->rid + (count << 8);
>  	} else {
> -		parent = pe->pdev->bus->self;
> +#ifdef CONFIG_PCI_IOV
> +		if (pe->flags & PNV_IODA_PE_VF)
> +			parent = pe->parent_dev;
> +		else
> +#endif /* CONFIG_PCI_IOV */
> +			parent = pe->pdev->bus->self;
>  		bcomp = OpalPciBusAll;
>  		dcomp = OPAL_COMPARE_RID_DEVICE_NUMBER;
>  		fcomp = OPAL_COMPARE_RID_FUNCTION_NUMBER;
> @@ -774,6 +873,74 @@ static unsigned int pnv_ioda_dma_weight(struct pci_dev *dev)
>  	return 10;
>  }
>  
> +#ifdef CONFIG_PCI_IOV
> +static int pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
> +{
> +	struct pci_dn *pdn = pci_get_pdn(dev);
> +	int i;
> +	struct resource *res, res2;
> +	resource_size_t size;
> +	u16 vf_num;
> +
> +	if (!dev->is_physfn)
> +		return -EINVAL;
> +
> +	/*
> +	 * "offset" is in VFs.  The M64 windows are sized so that when they
> +	 * are segmented, each segment is the same size as the IOV BAR.
> +	 * Each segment is in a separate PE, and the high order bits of the
> +	 * address are the PE number.  Therefore, each VF's BAR is in a
> +	 * separate PE, and changing the IOV BAR start address changes the
> +	 * range of PEs the VFs are in.
> +	 */
> +	vf_num = pdn->vf_pes;
> +	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
> +		res = &dev->resource[i + PCI_IOV_RESOURCES];
> +		if (!res->flags || !res->parent)
> +			continue;
> +
> +		if (!pnv_pci_is_mem_pref_64(res->flags))
> +			continue;
> +
> +		/*
> +		 * The actual IOV BAR range is determined by the start address
> +		 * and the actual size for vf_num VFs BAR.  This check is to
> +		 * make sure that after shifting, the range will not overlap
> +		 * with another device.
> +		 */
> +		size = pci_iov_resource_size(dev, i + PCI_IOV_RESOURCES);
> +		res2.flags = res->flags;
> +		res2.start = res->start + (size * offset);
> +		res2.end = res2.start + (size * vf_num) - 1;
> +
> +		if (res2.end > res->end) {
> +			dev_err(&dev->dev, "VF BAR%d: %pR would extend past %pR (trying to enable %d VFs shifted by %d)\n",
> +				i, &res2, res, vf_num, offset);
> +			return -EBUSY;
> +		}
> +	}
> +
> +	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
> +		res = &dev->resource[i + PCI_IOV_RESOURCES];
> +		if (!res->flags || !res->parent)
> +			continue;
> +
> +		if (!pnv_pci_is_mem_pref_64(res->flags))
> +			continue;
> +
> +		size = pci_iov_resource_size(dev, i + PCI_IOV_RESOURCES);
> +		res2 = *res;
> +		res->start += size * offset;
> +
> +		dev_info(&dev->dev, "VF BAR%d: %pR shifted to %pR (enabling %d VFs shifted by %d)\n",
> +			 i, &res2, res, vf_num, offset);
> +		pci_update_resource(dev, i + PCI_IOV_RESOURCES);
> +	}
> +	pdn->max_vfs -= offset;
> +	return 0;
> +}
> +#endif /* CONFIG_PCI_IOV */
> +
>  #if 0
>  static struct pnv_ioda_pe *pnv_ioda_setup_dev_PE(struct pci_dev *dev)
>  {
> @@ -979,8 +1146,312 @@ static void pnv_pci_ioda_setup_PEs(void)
>  }
>  
>  #ifdef CONFIG_PCI_IOV
> +static int pnv_pci_vf_release_m64(struct pci_dev *pdev)
> +{
> +	struct pci_bus        *bus;
> +	struct pci_controller *hose;
> +	struct pnv_phb        *phb;
> +	struct pci_dn         *pdn;
> +	int                    i;
> +
> +	bus = pdev->bus;
> +	hose = pci_bus_to_host(bus);
> +	phb = hose->private_data;
> +	pdn = pci_get_pdn(pdev);
> +
> +	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
> +		if (pdn->m64_wins[i] == IODA_INVALID_M64)
> +			continue;
> +		opal_pci_phb_mmio_enable(phb->opal_id,
> +				OPAL_M64_WINDOW_TYPE, pdn->m64_wins[i], 0);
> +		clear_bit(pdn->m64_wins[i], &phb->ioda.m64_bar_alloc);
> +		pdn->m64_wins[i] = IODA_INVALID_M64;
> +	}
> +
> +	return 0;
> +}
> +
> +static int pnv_pci_vf_assign_m64(struct pci_dev *pdev)
> +{
> +	struct pci_bus        *bus;
> +	struct pci_controller *hose;
> +	struct pnv_phb        *phb;
> +	struct pci_dn         *pdn;
> +	unsigned int           win;
> +	struct resource       *res;
> +	int                    i;
> +	int64_t                rc;
> +
> +	bus = pdev->bus;
> +	hose = pci_bus_to_host(bus);
> +	phb = hose->private_data;
> +	pdn = pci_get_pdn(pdev);
> +
> +	/* Initialize the m64_wins to IODA_INVALID_M64 */
> +	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++)
> +		pdn->m64_wins[i] = IODA_INVALID_M64;
> +
> +	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
> +		res = &pdev->resource[i + PCI_IOV_RESOURCES];
> +		if (!res->flags || !res->parent)
> +			continue;
> +
> +		if (!pnv_pci_is_mem_pref_64(res->flags))
> +			continue;
> +
> +		do {
> +			win = find_next_zero_bit(&phb->ioda.m64_bar_alloc,
> +					phb->ioda.m64_bar_idx + 1, 0);
> +
> +			if (win >= phb->ioda.m64_bar_idx + 1)
> +				goto m64_failed;
> +		} while (test_and_set_bit(win, &phb->ioda.m64_bar_alloc));
> +
> +		pdn->m64_wins[i] = win;
> +
> +		/* Map the M64 here */
> +		rc = opal_pci_set_phb_mem_window(phb->opal_id,
> +						 OPAL_M64_WINDOW_TYPE,
> +						 pdn->m64_wins[i],
> +						 res->start,
> +						 0, /* unused */
> +						 resource_size(res));
> +		if (rc != OPAL_SUCCESS) {
> +			dev_err(&pdev->dev, "Failed to map M64 window #%d: %lld\n",
> +				win, rc);
> +			goto m64_failed;
> +		}
> +
> +		rc = opal_pci_phb_mmio_enable(phb->opal_id,
> +				OPAL_M64_WINDOW_TYPE, pdn->m64_wins[i], 1);
> +		if (rc != OPAL_SUCCESS) {
> +			dev_err(&pdev->dev, "Failed to enable M64 window #%d: %llx\n",
> +				win, rc);
> +			goto m64_failed;
> +		}
> +	}
> +	return 0;
> +
> +m64_failed:
> +	pnv_pci_vf_release_m64(pdev);
> +	return -EBUSY;
> +}
> +
> +static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe *pe)
> +{
> +	struct pci_bus        *bus;
> +	struct pci_controller *hose;
> +	struct pnv_phb        *phb;
> +	struct iommu_table    *tbl;
> +	unsigned long         addr;
> +	int64_t               rc;
> +
> +	bus = dev->bus;
> +	hose = pci_bus_to_host(bus);
> +	phb = hose->private_data;
> +	tbl = pe->tce32_table;
> +	addr = tbl->it_base;
> +
> +	opal_pci_map_pe_dma_window(phb->opal_id, pe->pe_number,
> +				   pe->pe_number << 1, 1, __pa(addr),
> +				   0, 0x1000);
> +
> +	rc = opal_pci_map_pe_dma_window_real(pe->phb->opal_id,
> +				        pe->pe_number,
> +				        (pe->pe_number << 1) + 1,
> +				        pe->tce_bypass_base,
> +				        0);
> +	if (rc)
> +		pe_warn(pe, "OPAL error %ld release DMA window\n", rc);
> +
> +	iommu_free_table(tbl, of_node_full_name(dev->dev.of_node));
> +	free_pages(addr, get_order(TCE32_TABLE_SIZE));
> +	pe->tce32_table = NULL;
> +}
> +
> +static void pnv_ioda_release_vf_PE(struct pci_dev *pdev)
> +{
> +	struct pci_bus        *bus;
> +	struct pci_controller *hose;
> +	struct pnv_phb        *phb;
> +	struct pnv_ioda_pe    *pe, *pe_n;
> +	struct pci_dn         *pdn;
> +
> +	bus = pdev->bus;
> +	hose = pci_bus_to_host(bus);
> +	phb = hose->private_data;
> +
> +	if (!pdev->is_physfn)
> +		return;
> +
> +	pdn = pci_get_pdn(pdev);
> +	list_for_each_entry_safe(pe, pe_n, &phb->ioda.pe_list, list) {
> +		if (pe->parent_dev != pdev)
> +			continue;
> +
> +		pnv_pci_ioda2_release_dma_pe(pdev, pe);
> +
> +		/* Remove from list */
> +		mutex_lock(&phb->ioda.pe_list_mutex);
> +		list_del(&pe->list);
> +		mutex_unlock(&phb->ioda.pe_list_mutex);
> +
> +		pnv_ioda_deconfigure_pe(phb, pe);
> +
> +		pnv_ioda_free_pe(phb, pe->pe_number);
> +	}
> +}
> +
> +void pnv_pci_sriov_disable(struct pci_dev *pdev)
> +{
> +	struct pci_bus        *bus;
> +	struct pci_controller *hose;
> +	struct pnv_phb        *phb;
> +	struct pci_dn         *pdn;
> +	struct pci_sriov      *iov;
> +	u16 vf_num;
> +
> +	bus = pdev->bus;
> +	hose = pci_bus_to_host(bus);
> +	phb = hose->private_data;
> +	pdn = pci_get_pdn(pdev);
> +	iov = pdev->sriov;
> +	vf_num = pdn->vf_pes;
> +
> +	/* Release VF PEs */
> +	pnv_ioda_release_vf_PE(pdev);
> +
> +	if (phb->type == PNV_PHB_IODA2) {
> +		pnv_pci_vf_resource_shift(pdev, -pdn->offset);
> +
> +		/* Release M64 windows */
> +		pnv_pci_vf_release_m64(pdev);
> +
> +		/* Release PE numbers */
> +		bitmap_clear(phb->ioda.pe_alloc, pdn->offset, vf_num);
> +		pdn->offset = 0;
> +	}
> +}
> +
> +static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
> +				       struct pnv_ioda_pe *pe);
> +static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 vf_num)
> +{
> +	struct pci_bus        *bus;
> +	struct pci_controller *hose;
> +	struct pnv_phb        *phb;
> +	struct pnv_ioda_pe    *pe;
> +	int                    pe_num;
> +	u16                    vf_index;
> +	struct pci_dn         *pdn;
> +
> +	bus = pdev->bus;
> +	hose = pci_bus_to_host(bus);
> +	phb = hose->private_data;
> +	pdn = pci_get_pdn(pdev);
> +
> +	if (!pdev->is_physfn)
> +		return;
> +
> +	/* Reserve PE for each VF */
> +	for (vf_index = 0; vf_index < vf_num; vf_index++) {
> +		pe_num = pdn->offset + vf_index;
> +
> +		pe = &phb->ioda.pe_array[pe_num];
> +		pe->pe_number = pe_num;
> +		pe->phb = phb;
> +		pe->flags = PNV_IODA_PE_VF;
> +		pe->pbus = NULL;
> +		pe->parent_dev = pdev;
> +		pe->tce32_seg = -1;
> +		pe->mve_number = -1;
> +		pe->rid = (pci_iov_virtfn_bus(pdev, vf_index) << 8) |
> +			   pci_iov_virtfn_devfn(pdev, vf_index);
> +
> +		pe_info(pe, "VF %04d:%02d:%02d.%d associated with PE#%d\n",
> +			hose->global_number, pdev->bus->number,
> +			PCI_SLOT(pci_iov_virtfn_devfn(pdev, vf_index)),
> +			PCI_FUNC(pci_iov_virtfn_devfn(pdev, vf_index)), pe_num);
> +
> +		if (pnv_ioda_configure_pe(phb, pe)) {
> +			/* XXX What do we do here ? */
> +			if (pe_num)
> +				pnv_ioda_free_pe(phb, pe_num);
> +			pe->pdev = NULL;
> +			continue;
> +		}
> +
> +		pe->tce32_table = kzalloc_node(sizeof(struct iommu_table),
> +				GFP_KERNEL, hose->node);
> +		pe->tce32_table->data = pe;
> +
> +		/* Put PE to the list */
> +		mutex_lock(&phb->ioda.pe_list_mutex);
> +		list_add_tail(&pe->list, &phb->ioda.pe_list);
> +		mutex_unlock(&phb->ioda.pe_list_mutex);
> +
> +		pnv_pci_ioda2_setup_dma_pe(phb, pe);
> +	}
> +}
> +
> +int pnv_pci_sriov_enable(struct pci_dev *pdev, u16 vf_num)
> +{
> +	struct pci_bus        *bus;
> +	struct pci_controller *hose;
> +	struct pnv_phb        *phb;
> +	struct pci_dn         *pdn;
> +	int                    ret;
> +
> +	bus = pdev->bus;
> +	hose = pci_bus_to_host(bus);
> +	phb = hose->private_data;
> +	pdn = pci_get_pdn(pdev);
> +
> +	if (phb->type == PNV_PHB_IODA2) {
> +		/* Calculate available PE for required VFs */
> +		mutex_lock(&phb->ioda.pe_alloc_mutex);
> +		pdn->offset = bitmap_find_next_zero_area(
> +			phb->ioda.pe_alloc, phb->ioda.total_pe,
> +			0, vf_num, 0);
> +		if (pdn->offset >= phb->ioda.total_pe) {
> +			mutex_unlock(&phb->ioda.pe_alloc_mutex);
> +			dev_info(&pdev->dev, "Failed to enable VF%d\n", vf_num);
> +			pdn->offset = 0;
> +			return -EBUSY;
> +		}
> +		bitmap_set(phb->ioda.pe_alloc, pdn->offset, vf_num);
> +		pdn->vf_pes = vf_num;
> +		mutex_unlock(&phb->ioda.pe_alloc_mutex);
> +
> +		/* Assign M64 window accordingly */
> +		ret = pnv_pci_vf_assign_m64(pdev);
> +		if (ret) {
> +			dev_info(&pdev->dev, "Not enough M64 window resources\n");
> +			goto m64_failed;
> +		}
> +
> +		/* Do some magic shift */
> +		ret = pnv_pci_vf_resource_shift(pdev, pdn->offset);
> +		if (ret)
> +			goto m64_failed;
> +	}
> +
> +	/* Setup VF PEs */
> +	pnv_ioda_setup_vf_PE(pdev, vf_num);
> +
> +	return 0;
> +
> +m64_failed:
> +	bitmap_clear(phb->ioda.pe_alloc, pdn->offset, vf_num);
> +	pdn->offset = 0;
> +
> +	return ret;
> +}
> +
>  int pcibios_sriov_disable(struct pci_dev *pdev)
>  {
> +	pnv_pci_sriov_disable(pdev);
> +
>  	/* Release firmware data */
>  	remove_dev_pci_info(pdev);
>  	return 0;
> @@ -990,6 +1461,8 @@ int pcibios_sriov_enable(struct pci_dev *pdev, u16 vf_num)
>  {
>  	/* Allocate firmware data */
>  	add_dev_pci_info(pdev);
> +
> +	pnv_pci_sriov_enable(pdev, vf_num);
>  	return 0;
>  }
>  #endif /* CONFIG_PCI_IOV */
> @@ -1186,9 +1659,6 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
>  	int64_t rc;
>  	void *addr;
>  
> -	/* 256M DMA window, 4K TCE pages, 8 bytes TCE */
> -#define TCE32_TABLE_SIZE	((0x10000000 / 0x1000) * 8)
> -
>  	/* XXX FIXME: Handle 64-bit only DMA devices */
>  	/* XXX FIXME: Provide 64-bit DMA facilities & non-4K TCE tables etc.. */
>  	/* XXX FIXME: Allocate multi-level tables on PHB3 */
> @@ -1251,12 +1721,19 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
>  				 TCE_PCI_SWINV_PAIR);
>  	}
>  	iommu_init_table(tbl, phb->hose->node);
> -	iommu_register_group(tbl, phb->hose->global_number, pe->pe_number);
>  
> -	if (pe->pdev)
> +	if (pe->flags & PNV_IODA_PE_DEV) {
> +		iommu_register_group(tbl, phb->hose->global_number,
> +				     pe->pe_number);
>  		set_iommu_table_base_and_group(&pe->pdev->dev, tbl);
> -	else
> +	} else if (pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL)) {
> +		iommu_register_group(tbl, phb->hose->global_number,
> +				     pe->pe_number);
>  		pnv_ioda_setup_bus_dma(pe, pe->pbus, true);
> +	} else if (pe->flags & PNV_IODA_PE_VF) {
> +		iommu_register_group(tbl, phb->hose->global_number,
> +				     pe->pe_number);
> +	}
>  
>  	return;
>   fail:
> @@ -1383,12 +1860,19 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
>  		tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
>  	}
>  	iommu_init_table(tbl, phb->hose->node);
> -	iommu_register_group(tbl, phb->hose->global_number, pe->pe_number);
>  
> -	if (pe->pdev)
> +	if (pe->flags & PNV_IODA_PE_DEV) {
> +		iommu_register_group(tbl, phb->hose->global_number,
> +				     pe->pe_number);
>  		set_iommu_table_base_and_group(&pe->pdev->dev, tbl);
> -	else
> +	} else if (pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL)) {
> +		iommu_register_group(tbl, phb->hose->global_number,
> +				     pe->pe_number);
>  		pnv_ioda_setup_bus_dma(pe, pe->pbus, true);
> +	} else if (pe->flags & PNV_IODA_PE_VF) {
> +		iommu_register_group(tbl, phb->hose->global_number,
> +				     pe->pe_number);
> +	}
>  
>  	/* Also create a bypass window */
>  	if (!pnv_iommu_bypass_disabled)
> @@ -2083,6 +2567,7 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
>  	phb->hub_id = hub_id;
>  	phb->opal_id = phb_id;
>  	phb->type = ioda_type;
> +	mutex_init(&phb->ioda.pe_alloc_mutex);
>  
>  	/* Detect specific models for error handling */
>  	if (of_device_is_compatible(np, "ibm,p7ioc-pciex"))
> @@ -2142,6 +2627,7 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
>  
>  	INIT_LIST_HEAD(&phb->ioda.pe_dma_list);
>  	INIT_LIST_HEAD(&phb->ioda.pe_list);
> +	mutex_init(&phb->ioda.pe_list_mutex);
>  
>  	/* Calculate how many 32-bit TCE segments we have */
>  	phb->ioda.tce32_count = phb->ioda.m32_pci_base >> 28;
> diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
> index 6c20d6e70383..a88f915fc603 100644
> --- a/arch/powerpc/platforms/powernv/pci.c
> +++ b/arch/powerpc/platforms/powernv/pci.c
> @@ -714,6 +714,24 @@ static void pnv_pci_dma_dev_setup(struct pci_dev *pdev)
>  {
>  	struct pci_controller *hose = pci_bus_to_host(pdev->bus);
>  	struct pnv_phb *phb = hose->private_data;
> +#ifdef CONFIG_PCI_IOV
> +	struct pnv_ioda_pe *pe;
> +	struct pci_dn *pdn;
> +
> +	/* Fix the VF pdn PE number */
> +	if (pdev->is_virtfn) {
> +		pdn = pci_get_pdn(pdev);
> +		WARN_ON(pdn->pe_number != IODA_INVALID_PE);
> +		list_for_each_entry(pe, &phb->ioda.pe_list, list) {
> +			if (pe->rid == ((pdev->bus->number << 8) |
> +			    (pdev->devfn & 0xff))) {
> +				pdn->pe_number = pe->pe_number;
> +				pe->pdev = pdev;
> +				break;
> +			}
> +		}
> +	}
> +#endif /* CONFIG_PCI_IOV */
>  
>  	/* If we have no phb structure, try to setup a fallback based on
>  	 * the device-tree (RTAS PCI for example)
> diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
> index 731777734bca..39d42f2b7a15 100644
> --- a/arch/powerpc/platforms/powernv/pci.h
> +++ b/arch/powerpc/platforms/powernv/pci.h
> @@ -23,6 +23,7 @@ enum pnv_phb_model {
>  #define PNV_IODA_PE_BUS_ALL	(1 << 2)	/* PE has subordinate buses	*/
>  #define PNV_IODA_PE_MASTER	(1 << 3)	/* Master PE in compound case	*/
>  #define PNV_IODA_PE_SLAVE	(1 << 4)	/* Slave PE in compound case	*/
> +#define PNV_IODA_PE_VF		(1 << 5)	/* PE for one VF 		*/
>  
>  /* Data associated with a PE, including IOMMU tracking etc.. */
>  struct pnv_phb;
> @@ -34,6 +35,9 @@ struct pnv_ioda_pe {
>  	 * entire bus (& children). In the former case, pdev
>  	 * is populated, in the later case, pbus is.
>  	 */
> +#ifdef CONFIG_PCI_IOV
> +	struct pci_dev          *parent_dev;
> +#endif
>  	struct pci_dev		*pdev;
>  	struct pci_bus		*pbus;
>  
> @@ -165,6 +169,8 @@ struct pnv_phb {
>  
>  			/* PE allocation bitmap */
>  			unsigned long		*pe_alloc;
> +			/* PE allocation mutex */
> +			struct mutex		pe_alloc_mutex;
>  
>  			/* M32 & IO segment maps */
>  			unsigned int		*m32_segmap;
> @@ -179,6 +185,7 @@ struct pnv_phb {
>  			 * on the sequence of creation
>  			 */
>  			struct list_head	pe_list;
> +			struct mutex            pe_list_mutex;
>  
>  			/* Reverse map of PEs, will have to extend if
>  			 * we are to support more than 256 PEs, indexed
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v12 18/21] powerpc/powernv: Reserve additional space for IOV BAR, with m64_per_iov supported
  2015-02-24  8:35 ` [PATCH v12 18/21] powerpc/powernv: Reserve additional space for IOV BAR, with m64_per_iov supported Bjorn Helgaas
@ 2015-02-24  9:06   ` Bjorn Helgaas
  2015-03-02  7:55       ` Wei Yang
  0 siblings, 1 reply; 69+ messages in thread
From: Bjorn Helgaas @ 2015-02-24  9:06 UTC (permalink / raw)
  To: Wei Yang, benh, gwshan; +Cc: linux-pci, linuxppc-dev

On Tue, Feb 24, 2015 at 02:35:04AM -0600, Bjorn Helgaas wrote:
> From: Wei Yang <weiyang@linux.vnet.ibm.com>
> 
> M64 aperture size is limited on PHB3.  When the IOV BAR is too big, this
> will exceed the limitation and failed to be assigned.
> 
> Introduce a different mechanism based on the IOV BAR size:
> 
>   - if IOV BAR size is smaller than 64MB, expand to total_pe
>   - if IOV BAR size is bigger than 64MB, roundup power2
> 
> [bhelgaas: make dev_printk() output more consistent, use PCI_SRIOV_NUM_BARS]
> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> ---
>  arch/powerpc/include/asm/pci-bridge.h     |    2 ++
>  arch/powerpc/platforms/powernv/pci-ioda.c |   33 ++++++++++++++++++++++++++---
>  2 files changed, 32 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
> index 011340df8583..d824bb184ab8 100644
> --- a/arch/powerpc/include/asm/pci-bridge.h
> +++ b/arch/powerpc/include/asm/pci-bridge.h
> @@ -179,6 +179,8 @@ struct pci_dn {
>  	u16     max_vfs;		/* number of VFs IOV BAR expended */
>  	u16     vf_pes;			/* VF PE# under this PF */
>  	int     offset;			/* PE# for the first VF PE */
> +#define M64_PER_IOV 4
> +	int     m64_per_iov;
>  #define IODA_INVALID_M64        (-1)
>  	int     m64_wins[PCI_SRIOV_NUM_BARS];
>  #endif /* CONFIG_PCI_IOV */
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index a3c2fbe35fc8..30b7c3909746 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -2242,6 +2242,7 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
>  	int i;
>  	resource_size_t size;
>  	struct pci_dn *pdn;
> +	int mul, total_vfs;
>  
>  	if (!pdev->is_physfn || pdev->is_added)
>  		return;
> @@ -2252,6 +2253,32 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
>  	pdn = pci_get_pdn(pdev);
>  	pdn->max_vfs = 0;
>  
> +	total_vfs = pci_sriov_get_totalvfs(pdev);
> +	pdn->m64_per_iov = 1;
> +	mul = phb->ioda.total_pe;
> +
> +	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
> +		res = &pdev->resource[i + PCI_IOV_RESOURCES];
> +		if (!res->flags || res->parent)
> +			continue;
> +		if (!pnv_pci_is_mem_pref_64(res->flags)) {
> +			dev_warn(&pdev->dev, " non M64 VF BAR%d: %pR\n",
> +				 i, res);

Why is this a dev_warn()?  Can the user do anything about it?  Do you want
bug reports if users see this message?  There are several other instances
of this in the other patches, too.

> +			continue;
> +		}
> +
> +		size = pci_iov_resource_size(pdev, i + PCI_IOV_RESOURCES);
> +
> +		/* bigger than 64M */
> +		if (size > (1 << 26)) {
> +			dev_info(&pdev->dev, "PowerNV: VF BAR%d: %pR IOV size is bigger than 64M, roundup power2\n",
> +				 i, res);
> +			pdn->m64_per_iov = M64_PER_IOV;
> +			mul = __roundup_pow_of_two(total_vfs);

Why is this __roundup_pow_of_two() instead of roundup_pow_of_two()?
I *think* __roundup_pow_of_two() is basically a helper function for
implementing roundup_pow_of_two() and not intended to be used by itself.

I think there are other patches that use __roundup_pow_of_two(), too.

> +			break;
> +		}
> +	}
> +
>  	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
>  		res = &pdev->resource[i + PCI_IOV_RESOURCES];
>  		if (!res->flags || res->parent)
> @@ -2264,12 +2291,12 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
>  
>  		dev_dbg(&pdev->dev, " Fixing VF BAR%d: %pR to\n", i, res);
>  		size = pci_iov_resource_size(pdev, i + PCI_IOV_RESOURCES);
> -		res->end = res->start + size * phb->ioda.total_pe - 1;
> +		res->end = res->start + size * mul - 1;
>  		dev_dbg(&pdev->dev, "                       %pR\n", res);
>  		dev_info(&pdev->dev, "VF BAR%d: %pR (expanded to %d VFs for PE alignment)",
> -				i, res, phb->ioda.total_pe);
> +			 i, res, mul);
>  	}
> -	pdn->max_vfs = phb->ioda.total_pe;
> +	pdn->max_vfs = mul;
>  }
>  
>  static void pnv_pci_ioda_fixup_sriov(struct pci_bus *bus)
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v12 17/21] powerpc/powernv: Shift VF resource with an offset
  2015-02-24  9:00   ` Bjorn Helgaas
@ 2015-02-24 17:10     ` Bjorn Helgaas
  2015-03-02  7:58         ` Wei Yang
  2015-03-04  3:01       ` Wei Yang
  1 sibling, 1 reply; 69+ messages in thread
From: Bjorn Helgaas @ 2015-02-24 17:10 UTC (permalink / raw)
  To: Wei Yang, Benjamin Herrenschmidt, Gavin Shan; +Cc: linux-pci, linuxppc-dev

On Tue, Feb 24, 2015 at 3:00 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> On Tue, Feb 24, 2015 at 02:34:57AM -0600, Bjorn Helgaas wrote:
>> From: Wei Yang <weiyang@linux.vnet.ibm.com>
>>
>> On PowerNV platform, resource position in M64 implies the PE# the resource
>> belongs to.  In some cases, adjustment of a resource is necessary to locate
>> it to a correct position in M64.
>>
>> Add pnv_pci_vf_resource_shift() to shift the 'real' PF IOV BAR address
>> according to an offset.
>>
>> [bhelgaas: rework loops, rework overlap check, index resource[]
>> conventionally, remove pci_regs.h include, squashed with next patch]
>> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
>> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
>
> ...
>
>> +#ifdef CONFIG_PCI_IOV
>> +static int pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
>> +{
>> +     struct pci_dn *pdn = pci_get_pdn(dev);
>> +     int i;
>> +     struct resource *res, res2;
>> +     resource_size_t size;
>> +     u16 vf_num;
>> +
>> +     if (!dev->is_physfn)
>> +             return -EINVAL;
>> +
>> +     /*
>> +      * "offset" is in VFs.  The M64 windows are sized so that when they
>> +      * are segmented, each segment is the same size as the IOV BAR.
>> +      * Each segment is in a separate PE, and the high order bits of the
>> +      * address are the PE number.  Therefore, each VF's BAR is in a
>> +      * separate PE, and changing the IOV BAR start address changes the
>> +      * range of PEs the VFs are in.
>> +      */
>> +     vf_num = pdn->vf_pes;
>> +     for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
>> +             res = &dev->resource[i + PCI_IOV_RESOURCES];
>> +             if (!res->flags || !res->parent)
>> +                     continue;
>> +
>> +             if (!pnv_pci_is_mem_pref_64(res->flags))
>> +                     continue;
>> +
>> +             /*
>> +              * The actual IOV BAR range is determined by the start address
>> +              * and the actual size for vf_num VFs BAR.  This check is to
>> +              * make sure that after shifting, the range will not overlap
>> +              * with another device.
>> +              */
>> +             size = pci_iov_resource_size(dev, i + PCI_IOV_RESOURCES);
>> +             res2.flags = res->flags;
>> +             res2.start = res->start + (size * offset);
>> +             res2.end = res2.start + (size * vf_num) - 1;
>> +
>> +             if (res2.end > res->end) {
>> +                     dev_err(&dev->dev, "VF BAR%d: %pR would extend past %pR (trying to enable %d VFs shifted by %d)\n",
>> +                             i, &res2, res, vf_num, offset);
>> +                     return -EBUSY;
>> +             }
>> +     }
>> +
>> +     for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
>> +             res = &dev->resource[i + PCI_IOV_RESOURCES];
>> +             if (!res->flags || !res->parent)
>> +                     continue;
>> +
>> +             if (!pnv_pci_is_mem_pref_64(res->flags))
>> +                     continue;
>> +
>> +             size = pci_iov_resource_size(dev, i + PCI_IOV_RESOURCES);
>> +             res2 = *res;
>> +             res->start += size * offset;
>
> I'm still not happy about this fiddling with res->start.
>
> Increasing res->start means that in principle, the "size * offset" bytes
> that we just removed from res are now available for allocation to somebody
> else.  I don't think we *will* give that space to anything else because of
> the alignment restrictions you're enforcing, but "res" now doesn't
> correctly describe the real resource map.
>
> Would you be able to just update the BAR here while leaving the struct
> resource alone?  In that case, it would look a little funny that lspci
> would show a BAR value in the middle of the region in /proc/iomem, but
> the /proc/iomem region would be more correct.

I guess this would also require a tweak where we compute the addresses
of each of the VF resources.  Today it's probably just "base + VF_num
* size", where "base" is res->start.  We'd have to account for the
offset there if we don't adjust it here.

>> +
>> +             dev_info(&dev->dev, "VF BAR%d: %pR shifted to %pR (enabling %d VFs shifted by %d)\n",
>> +                      i, &res2, res, vf_num, offset);
>> +             pci_update_resource(dev, i + PCI_IOV_RESOURCES);
>> +     }
>> +     pdn->max_vfs -= offset;
>> +     return 0;
>> +}
>> +#endif /* CONFIG_PCI_IOV */

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v12 08/21] PCI: Add pcibios_sriov_enable() and pcibios_sriov_disable()
  2015-02-24  8:39   ` Bjorn Helgaas
@ 2015-03-02  6:53       ` Wei Yang
  0 siblings, 0 replies; 69+ messages in thread
From: Wei Yang @ 2015-03-02  6:53 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: Wei Yang, benh, gwshan, linux-pci, linuxppc-dev

On Tue, Feb 24, 2015 at 02:39:21AM -0600, Bjorn Helgaas wrote:
>On Tue, Feb 24, 2015 at 02:33:52AM -0600, Bjorn Helgaas wrote:
>> From: Wei Yang <weiyang@linux.vnet.ibm.com>
>> 
>> VFs are dynamically created when a driver enables them.  On some platforms,
>> like PowerNV, special resources are necessary to enable VFs.
>> 
>> Add platform hooks for enabling and disabling VFs.
>> 
>> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
>> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
>> ---
>>  drivers/pci/iov.c |   19 +++++++++++++++++++
>>  1 file changed, 19 insertions(+)
>> 
>> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
>> index 5643a1011e23..cc6fedf4a1b9 100644
>> --- a/drivers/pci/iov.c
>> +++ b/drivers/pci/iov.c
>> @@ -220,6 +220,11 @@ static void virtfn_remove(struct pci_dev *dev, int id, int reset)
>>  	pci_dev_put(dev);
>>  }
>>  
>> +int __weak pcibios_sriov_enable(struct pci_dev *pdev, u16 vf_num)
>
>I think this "vf_num" parameter should be renamed to something like
>"num_vfs" instead.  It's subtle, but "vf_num" suggests that we're talking
>about one of several VFs, e.g., VF1 or VF 2.  But here we really mean the
>total number of VFs that we're enabling.
>
>There's similar code in the powerpc implementation that should be
>renamed the same way.
>

Agree, would take this suggestion in consideration.

>> +{
>> +       return 0;
>> +}
>> +
>>  static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
>>  {
>>  	int rc;
>> @@ -231,6 +236,7 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
>>  	struct pci_sriov *iov = dev->sriov;
>>  	int bars = 0;
>>  	int bus;
>> +	int retval;
>>  
>>  	if (!nr_virtfn)
>>  		return 0;
>> @@ -307,6 +313,12 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
>>  	if (nr_virtfn < initial)
>>  		initial = nr_virtfn;
>>  
>> +	if ((retval = pcibios_sriov_enable(dev, initial))) {
>> +		dev_err(&dev->dev, "failure %d from pcibios_sriov_enable()\n",
>> +			retval);
>> +		return retval;
>> +	}
>> +
>>  	for (i = 0; i < initial; i++) {
>>  		rc = virtfn_add(dev, i, 0);
>>  		if (rc)
>> @@ -335,6 +347,11 @@ failed:
>>  	return rc;
>>  }
>>  
>> +int __weak pcibios_sriov_disable(struct pci_dev *pdev)
>> +{
>> +       return 0;
>> +}
>> +
>>  static void sriov_disable(struct pci_dev *dev)
>>  {
>>  	int i;
>> @@ -346,6 +363,8 @@ static void sriov_disable(struct pci_dev *dev)
>>  	for (i = 0; i < iov->num_VFs; i++)
>>  		virtfn_remove(dev, i, 0);
>>  
>> +	pcibios_sriov_disable(dev);
>> +
>>  	iov->ctrl &= ~(PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE);
>>  	pci_cfg_access_lock(dev);
>>  	pci_write_config_word(dev, iov->pos + PCI_SRIOV_CTRL, iov->ctrl);
>> 
>--
>To unsubscribe from this list: send the line "unsubscribe linux-pci" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Richard Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v12 08/21] PCI: Add pcibios_sriov_enable() and pcibios_sriov_disable()
@ 2015-03-02  6:53       ` Wei Yang
  0 siblings, 0 replies; 69+ messages in thread
From: Wei Yang @ 2015-03-02  6:53 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci, Wei Yang, benh, linuxppc-dev, gwshan

On Tue, Feb 24, 2015 at 02:39:21AM -0600, Bjorn Helgaas wrote:
>On Tue, Feb 24, 2015 at 02:33:52AM -0600, Bjorn Helgaas wrote:
>> From: Wei Yang <weiyang@linux.vnet.ibm.com>
>> 
>> VFs are dynamically created when a driver enables them.  On some platforms,
>> like PowerNV, special resources are necessary to enable VFs.
>> 
>> Add platform hooks for enabling and disabling VFs.
>> 
>> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
>> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
>> ---
>>  drivers/pci/iov.c |   19 +++++++++++++++++++
>>  1 file changed, 19 insertions(+)
>> 
>> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
>> index 5643a1011e23..cc6fedf4a1b9 100644
>> --- a/drivers/pci/iov.c
>> +++ b/drivers/pci/iov.c
>> @@ -220,6 +220,11 @@ static void virtfn_remove(struct pci_dev *dev, int id, int reset)
>>  	pci_dev_put(dev);
>>  }
>>  
>> +int __weak pcibios_sriov_enable(struct pci_dev *pdev, u16 vf_num)
>
>I think this "vf_num" parameter should be renamed to something like
>"num_vfs" instead.  It's subtle, but "vf_num" suggests that we're talking
>about one of several VFs, e.g., VF1 or VF 2.  But here we really mean the
>total number of VFs that we're enabling.
>
>There's similar code in the powerpc implementation that should be
>renamed the same way.
>

Agree, would take this suggestion in consideration.

>> +{
>> +       return 0;
>> +}
>> +
>>  static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
>>  {
>>  	int rc;
>> @@ -231,6 +236,7 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
>>  	struct pci_sriov *iov = dev->sriov;
>>  	int bars = 0;
>>  	int bus;
>> +	int retval;
>>  
>>  	if (!nr_virtfn)
>>  		return 0;
>> @@ -307,6 +313,12 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
>>  	if (nr_virtfn < initial)
>>  		initial = nr_virtfn;
>>  
>> +	if ((retval = pcibios_sriov_enable(dev, initial))) {
>> +		dev_err(&dev->dev, "failure %d from pcibios_sriov_enable()\n",
>> +			retval);
>> +		return retval;
>> +	}
>> +
>>  	for (i = 0; i < initial; i++) {
>>  		rc = virtfn_add(dev, i, 0);
>>  		if (rc)
>> @@ -335,6 +347,11 @@ failed:
>>  	return rc;
>>  }
>>  
>> +int __weak pcibios_sriov_disable(struct pci_dev *pdev)
>> +{
>> +       return 0;
>> +}
>> +
>>  static void sriov_disable(struct pci_dev *dev)
>>  {
>>  	int i;
>> @@ -346,6 +363,8 @@ static void sriov_disable(struct pci_dev *dev)
>>  	for (i = 0; i < iov->num_VFs; i++)
>>  		virtfn_remove(dev, i, 0);
>>  
>> +	pcibios_sriov_disable(dev);
>> +
>>  	iov->ctrl &= ~(PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE);
>>  	pci_cfg_access_lock(dev);
>>  	pci_write_config_word(dev, iov->pos + PCI_SRIOV_CTRL, iov->ctrl);
>> 
>--
>To unsubscribe from this list: send the line "unsubscribe linux-pci" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Richard Yang
Help you, Help me

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v12 10/21] PCI: Consider additional PF's IOV BAR alignment in sizing and assigning
  2015-02-24  8:41   ` Bjorn Helgaas
@ 2015-03-02  7:32       ` Wei Yang
  0 siblings, 0 replies; 69+ messages in thread
From: Wei Yang @ 2015-03-02  7:32 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: Wei Yang, benh, gwshan, linux-pci, linuxppc-dev

On Tue, Feb 24, 2015 at 02:41:52AM -0600, Bjorn Helgaas wrote:
>On Tue, Feb 24, 2015 at 02:34:06AM -0600, Bjorn Helgaas wrote:
>> From: Wei Yang <weiyang@linux.vnet.ibm.com>
>> 
>> When sizing and assigning resources, we divide the resources into two
>> lists: the requested list and the additional list.  We don't consider the
>> alignment of additional VF(n) BAR space.
>> 
>> This is reasonable because the alignment required for the VF(n) BAR space
>> is the size of an individual VF BAR, not the size of the space for *all*
>> VFs.  But some platforms, e.g., PowerNV, require additional alignment.
>> 
>> Consider the additional IOV BAR alignment when sizing and assigning
>> resources.  When there is not enough system MMIO space, the PF's IOV BAR
>> alignment will not contribute to the bridge.  When there is enough system
>> MMIO space, the additional alignment will contribute to the bridge.
>
>I don't understand the ""when there is not enough system MMIO space" part.
>How do we tell if there's enough MMIO space?
>

In __assign_resources_sorted(), it has two resources list, one for requested
(head) and one for additional (realloc_head). This function will first try to
combine them and assign. If failed, this means we don't have enough MMIO
space.

>> Also, take advantage of pci_dev_resource::min_align to store this
>> additional alignment.
>
>This comment doesn't seem to make sense; this patch doesn't save anything
>in min_align.
>

At the end of this patch:

   add_to_list(realloc_head, bus->self, b_res, size1-size0, add_align);

The add_align is stored in pci_dev_resource::min_align in add_to_list(). And
retrieved by get_res_add_align() in below code. This field is not used
previously, so I took advantage of this field to store the alignment of the
additional resources.

>Another question below...
>
>> [bhelgaas: changelog, printk cast]
>> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
>> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
>> ---
>>  drivers/pci/setup-bus.c |   83 ++++++++++++++++++++++++++++++++++++++++-------
>>  1 file changed, 70 insertions(+), 13 deletions(-)
>> 
>> diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
>> index e3e17f3c0f0f..affbceae560f 100644
>> --- a/drivers/pci/setup-bus.c
>> +++ b/drivers/pci/setup-bus.c
>> @@ -99,8 +99,8 @@ static void remove_from_list(struct list_head *head,
>>  	}
>>  }
>>  
>> -static resource_size_t get_res_add_size(struct list_head *head,
>> -					struct resource *res)
>> +static struct pci_dev_resource *res_to_dev_res(struct list_head *head,
>> +					       struct resource *res)
>>  {
>>  	struct pci_dev_resource *dev_res;
>>  
>> @@ -109,17 +109,37 @@ static resource_size_t get_res_add_size(struct list_head *head,
>>  			int idx = res - &dev_res->dev->resource[0];
>>  
>>  			dev_printk(KERN_DEBUG, &dev_res->dev->dev,
>> -				 "res[%d]=%pR get_res_add_size add_size %llx\n",
>> +				 "res[%d]=%pR res_to_dev_res add_size %llx min_align %llx\n",
>>  				 idx, dev_res->res,
>> -				 (unsigned long long)dev_res->add_size);
>> +				 (unsigned long long)dev_res->add_size,
>> +				 (unsigned long long)dev_res->min_align);
>>  
>> -			return dev_res->add_size;
>> +			return dev_res;
>>  		}
>>  	}
>>  
>> -	return 0;
>> +	return NULL;
>> +}
>> +
>> +static resource_size_t get_res_add_size(struct list_head *head,
>> +					struct resource *res)
>> +{
>> +	struct pci_dev_resource *dev_res;
>> +
>> +	dev_res = res_to_dev_res(head, res);
>> +	return dev_res ? dev_res->add_size : 0;
>> +}
>> +
>> +static resource_size_t get_res_add_align(struct list_head *head,
>> +					 struct resource *res)
>> +{
>> +	struct pci_dev_resource *dev_res;
>> +
>> +	dev_res = res_to_dev_res(head, res);
>> +	return dev_res ? dev_res->min_align : 0;
>>  }
>>  
>> +
>>  /* Sort resources by alignment */
>>  static void pdev_sort_resources(struct pci_dev *dev, struct list_head *head)
>>  {
>> @@ -368,8 +388,9 @@ static void __assign_resources_sorted(struct list_head *head,
>>  	LIST_HEAD(save_head);
>>  	LIST_HEAD(local_fail_head);
>>  	struct pci_dev_resource *save_res;
>> -	struct pci_dev_resource *dev_res, *tmp_res;
>> +	struct pci_dev_resource *dev_res, *tmp_res, *dev_res2;
>>  	unsigned long fail_type;
>> +	resource_size_t add_align, align;
>>  
>>  	/* Check if optional add_size is there */
>>  	if (!realloc_head || list_empty(realloc_head))
>> @@ -384,10 +405,38 @@ static void __assign_resources_sorted(struct list_head *head,
>>  	}
>>  
>>  	/* Update res in head list with add_size in realloc_head list */
>> -	list_for_each_entry(dev_res, head, list)
>> +	list_for_each_entry_safe(dev_res, tmp_res, head, list) {
>>  		dev_res->res->end += get_res_add_size(realloc_head,
>>  							dev_res->res);
>>  
>> +		/*
>> +		 * There are two kinds of additional resources in the list:
>> +		 * 1. bridge resource  -- IORESOURCE_STARTALIGN
>> +		 * 2. SR-IOV resource   -- IORESOURCE_SIZEALIGN
>> +		 * Here just fix the additional alignment for bridge
>> +		 */
>> +		if (!(dev_res->res->flags & IORESOURCE_STARTALIGN))
>> +			continue;
>> +
>> +		add_align = get_res_add_align(realloc_head, dev_res->res);
>> +
>> +		/* Reorder the list by their alignment */
>
>Why do we need to reorder the list by alignment?

Resource list "head" is sorted by the alignment, while the alignment would be
changed after we considering the additional resource.

Take powernv platform as an example. The IOV BAR is expanded and need to be
aligned with its total size instead of the individual VF BAR size. If we don't
reorder it, the IOV BAR would be assigned after some other resources, which
may cause the real assignment fail even the total size is enough.

>
>> +		if (add_align > dev_res->res->start) {
>> +			dev_res->res->start = add_align;
>> +			dev_res->res->end = add_align +
>> +				            resource_size(dev_res->res);
>> +
>> +			list_for_each_entry(dev_res2, head, list) {
>> +				align = pci_resource_alignment(dev_res2->dev,
>> +							       dev_res2->res);
>> +				if (add_align > align)
>> +					list_move_tail(&dev_res->list,
>> +						       &dev_res2->list);
>> +			}
>> +               }
>> +
>> +	}
>> +
>>  	/* Try updated head list with add_size added */
>>  	assign_requested_resources_sorted(head, &local_fail_head);
>>  
>> @@ -962,6 +1011,8 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
>>  	struct resource *b_res = find_free_bus_resource(bus,
>>  					mask | IORESOURCE_PREFETCH, type);
>>  	resource_size_t children_add_size = 0;
>> +	resource_size_t children_add_align = 0;
>> +	resource_size_t add_align = 0;
>>  
>>  	if (!b_res)
>>  		return -ENOSPC;
>> @@ -986,6 +1037,7 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
>>  			/* put SRIOV requested res to the optional list */
>>  			if (realloc_head && i >= PCI_IOV_RESOURCES &&
>>  					i <= PCI_IOV_RESOURCE_END) {
>> +				add_align = max(pci_resource_alignment(dev, r), add_align);
>>  				r->end = r->start - 1;
>>  				add_to_list(realloc_head, dev, r, r_size, 0/* don't care */);
>>  				children_add_size += r_size;
>> @@ -1016,19 +1068,23 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
>>  			if (order > max_order)
>>  				max_order = order;
>>  
>> -			if (realloc_head)
>> +			if (realloc_head) {
>>  				children_add_size += get_res_add_size(realloc_head, r);
>> +				children_add_align = get_res_add_align(realloc_head, r);
>> +				add_align = max(add_align, children_add_align);
>> +			}
>>  		}
>>  	}
>>  
>>  	min_align = calculate_mem_align(aligns, max_order);
>>  	min_align = max(min_align, window_alignment(bus, b_res->flags));
>>  	size0 = calculate_memsize(size, min_size, 0, resource_size(b_res), min_align);
>> +	add_align = max(min_align, add_align);
>>  	if (children_add_size > add_size)
>>  		add_size = children_add_size;
>>  	size1 = (!realloc_head || (realloc_head && !add_size)) ? size0 :
>>  		calculate_memsize(size, min_size, add_size,
>> -				resource_size(b_res), min_align);
>> +				resource_size(b_res), add_align);
>>  	if (!size0 && !size1) {
>>  		if (b_res->start || b_res->end)
>>  			dev_info(&bus->self->dev, "disabling bridge window %pR to %pR (unused)\n",
>> @@ -1040,10 +1096,11 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
>>  	b_res->end = size0 + min_align - 1;
>>  	b_res->flags |= IORESOURCE_STARTALIGN;
>>  	if (size1 > size0 && realloc_head) {
>> -		add_to_list(realloc_head, bus->self, b_res, size1-size0, min_align);
>> -		dev_printk(KERN_DEBUG, &bus->self->dev, "bridge window %pR to %pR add_size %llx\n",
>> +		add_to_list(realloc_head, bus->self, b_res, size1-size0, add_align);
>> +		dev_printk(KERN_DEBUG, &bus->self->dev, "bridge window %pR to %pR add_size %llx add_align %llx\n",
>>  			   b_res, &bus->busn_res,
>> -			   (unsigned long long)size1-size0);
>> +			   (unsigned long long) (size1 - size0),
>> +			   (unsigned long long) add_align);
>>  	}
>>  	return 0;
>>  }
>> 
>--
>To unsubscribe from this list: send the line "unsubscribe linux-pci" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Richard Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v12 10/21] PCI: Consider additional PF's IOV BAR alignment in sizing and assigning
@ 2015-03-02  7:32       ` Wei Yang
  0 siblings, 0 replies; 69+ messages in thread
From: Wei Yang @ 2015-03-02  7:32 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci, Wei Yang, benh, linuxppc-dev, gwshan

On Tue, Feb 24, 2015 at 02:41:52AM -0600, Bjorn Helgaas wrote:
>On Tue, Feb 24, 2015 at 02:34:06AM -0600, Bjorn Helgaas wrote:
>> From: Wei Yang <weiyang@linux.vnet.ibm.com>
>> 
>> When sizing and assigning resources, we divide the resources into two
>> lists: the requested list and the additional list.  We don't consider the
>> alignment of additional VF(n) BAR space.
>> 
>> This is reasonable because the alignment required for the VF(n) BAR space
>> is the size of an individual VF BAR, not the size of the space for *all*
>> VFs.  But some platforms, e.g., PowerNV, require additional alignment.
>> 
>> Consider the additional IOV BAR alignment when sizing and assigning
>> resources.  When there is not enough system MMIO space, the PF's IOV BAR
>> alignment will not contribute to the bridge.  When there is enough system
>> MMIO space, the additional alignment will contribute to the bridge.
>
>I don't understand the ""when there is not enough system MMIO space" part.
>How do we tell if there's enough MMIO space?
>

In __assign_resources_sorted(), it has two resources list, one for requested
(head) and one for additional (realloc_head). This function will first try to
combine them and assign. If failed, this means we don't have enough MMIO
space.

>> Also, take advantage of pci_dev_resource::min_align to store this
>> additional alignment.
>
>This comment doesn't seem to make sense; this patch doesn't save anything
>in min_align.
>

At the end of this patch:

   add_to_list(realloc_head, bus->self, b_res, size1-size0, add_align);

The add_align is stored in pci_dev_resource::min_align in add_to_list(). And
retrieved by get_res_add_align() in below code. This field is not used
previously, so I took advantage of this field to store the alignment of the
additional resources.

>Another question below...
>
>> [bhelgaas: changelog, printk cast]
>> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
>> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
>> ---
>>  drivers/pci/setup-bus.c |   83 ++++++++++++++++++++++++++++++++++++++++-------
>>  1 file changed, 70 insertions(+), 13 deletions(-)
>> 
>> diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
>> index e3e17f3c0f0f..affbceae560f 100644
>> --- a/drivers/pci/setup-bus.c
>> +++ b/drivers/pci/setup-bus.c
>> @@ -99,8 +99,8 @@ static void remove_from_list(struct list_head *head,
>>  	}
>>  }
>>  
>> -static resource_size_t get_res_add_size(struct list_head *head,
>> -					struct resource *res)
>> +static struct pci_dev_resource *res_to_dev_res(struct list_head *head,
>> +					       struct resource *res)
>>  {
>>  	struct pci_dev_resource *dev_res;
>>  
>> @@ -109,17 +109,37 @@ static resource_size_t get_res_add_size(struct list_head *head,
>>  			int idx = res - &dev_res->dev->resource[0];
>>  
>>  			dev_printk(KERN_DEBUG, &dev_res->dev->dev,
>> -				 "res[%d]=%pR get_res_add_size add_size %llx\n",
>> +				 "res[%d]=%pR res_to_dev_res add_size %llx min_align %llx\n",
>>  				 idx, dev_res->res,
>> -				 (unsigned long long)dev_res->add_size);
>> +				 (unsigned long long)dev_res->add_size,
>> +				 (unsigned long long)dev_res->min_align);
>>  
>> -			return dev_res->add_size;
>> +			return dev_res;
>>  		}
>>  	}
>>  
>> -	return 0;
>> +	return NULL;
>> +}
>> +
>> +static resource_size_t get_res_add_size(struct list_head *head,
>> +					struct resource *res)
>> +{
>> +	struct pci_dev_resource *dev_res;
>> +
>> +	dev_res = res_to_dev_res(head, res);
>> +	return dev_res ? dev_res->add_size : 0;
>> +}
>> +
>> +static resource_size_t get_res_add_align(struct list_head *head,
>> +					 struct resource *res)
>> +{
>> +	struct pci_dev_resource *dev_res;
>> +
>> +	dev_res = res_to_dev_res(head, res);
>> +	return dev_res ? dev_res->min_align : 0;
>>  }
>>  
>> +
>>  /* Sort resources by alignment */
>>  static void pdev_sort_resources(struct pci_dev *dev, struct list_head *head)
>>  {
>> @@ -368,8 +388,9 @@ static void __assign_resources_sorted(struct list_head *head,
>>  	LIST_HEAD(save_head);
>>  	LIST_HEAD(local_fail_head);
>>  	struct pci_dev_resource *save_res;
>> -	struct pci_dev_resource *dev_res, *tmp_res;
>> +	struct pci_dev_resource *dev_res, *tmp_res, *dev_res2;
>>  	unsigned long fail_type;
>> +	resource_size_t add_align, align;
>>  
>>  	/* Check if optional add_size is there */
>>  	if (!realloc_head || list_empty(realloc_head))
>> @@ -384,10 +405,38 @@ static void __assign_resources_sorted(struct list_head *head,
>>  	}
>>  
>>  	/* Update res in head list with add_size in realloc_head list */
>> -	list_for_each_entry(dev_res, head, list)
>> +	list_for_each_entry_safe(dev_res, tmp_res, head, list) {
>>  		dev_res->res->end += get_res_add_size(realloc_head,
>>  							dev_res->res);
>>  
>> +		/*
>> +		 * There are two kinds of additional resources in the list:
>> +		 * 1. bridge resource  -- IORESOURCE_STARTALIGN
>> +		 * 2. SR-IOV resource   -- IORESOURCE_SIZEALIGN
>> +		 * Here just fix the additional alignment for bridge
>> +		 */
>> +		if (!(dev_res->res->flags & IORESOURCE_STARTALIGN))
>> +			continue;
>> +
>> +		add_align = get_res_add_align(realloc_head, dev_res->res);
>> +
>> +		/* Reorder the list by their alignment */
>
>Why do we need to reorder the list by alignment?

Resource list "head" is sorted by the alignment, while the alignment would be
changed after we considering the additional resource.

Take powernv platform as an example. The IOV BAR is expanded and need to be
aligned with its total size instead of the individual VF BAR size. If we don't
reorder it, the IOV BAR would be assigned after some other resources, which
may cause the real assignment fail even the total size is enough.

>
>> +		if (add_align > dev_res->res->start) {
>> +			dev_res->res->start = add_align;
>> +			dev_res->res->end = add_align +
>> +				            resource_size(dev_res->res);
>> +
>> +			list_for_each_entry(dev_res2, head, list) {
>> +				align = pci_resource_alignment(dev_res2->dev,
>> +							       dev_res2->res);
>> +				if (add_align > align)
>> +					list_move_tail(&dev_res->list,
>> +						       &dev_res2->list);
>> +			}
>> +               }
>> +
>> +	}
>> +
>>  	/* Try updated head list with add_size added */
>>  	assign_requested_resources_sorted(head, &local_fail_head);
>>  
>> @@ -962,6 +1011,8 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
>>  	struct resource *b_res = find_free_bus_resource(bus,
>>  					mask | IORESOURCE_PREFETCH, type);
>>  	resource_size_t children_add_size = 0;
>> +	resource_size_t children_add_align = 0;
>> +	resource_size_t add_align = 0;
>>  
>>  	if (!b_res)
>>  		return -ENOSPC;
>> @@ -986,6 +1037,7 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
>>  			/* put SRIOV requested res to the optional list */
>>  			if (realloc_head && i >= PCI_IOV_RESOURCES &&
>>  					i <= PCI_IOV_RESOURCE_END) {
>> +				add_align = max(pci_resource_alignment(dev, r), add_align);
>>  				r->end = r->start - 1;
>>  				add_to_list(realloc_head, dev, r, r_size, 0/* don't care */);
>>  				children_add_size += r_size;
>> @@ -1016,19 +1068,23 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
>>  			if (order > max_order)
>>  				max_order = order;
>>  
>> -			if (realloc_head)
>> +			if (realloc_head) {
>>  				children_add_size += get_res_add_size(realloc_head, r);
>> +				children_add_align = get_res_add_align(realloc_head, r);
>> +				add_align = max(add_align, children_add_align);
>> +			}
>>  		}
>>  	}
>>  
>>  	min_align = calculate_mem_align(aligns, max_order);
>>  	min_align = max(min_align, window_alignment(bus, b_res->flags));
>>  	size0 = calculate_memsize(size, min_size, 0, resource_size(b_res), min_align);
>> +	add_align = max(min_align, add_align);
>>  	if (children_add_size > add_size)
>>  		add_size = children_add_size;
>>  	size1 = (!realloc_head || (realloc_head && !add_size)) ? size0 :
>>  		calculate_memsize(size, min_size, add_size,
>> -				resource_size(b_res), min_align);
>> +				resource_size(b_res), add_align);
>>  	if (!size0 && !size1) {
>>  		if (b_res->start || b_res->end)
>>  			dev_info(&bus->self->dev, "disabling bridge window %pR to %pR (unused)\n",
>> @@ -1040,10 +1096,11 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
>>  	b_res->end = size0 + min_align - 1;
>>  	b_res->flags |= IORESOURCE_STARTALIGN;
>>  	if (size1 > size0 && realloc_head) {
>> -		add_to_list(realloc_head, bus->self, b_res, size1-size0, min_align);
>> -		dev_printk(KERN_DEBUG, &bus->self->dev, "bridge window %pR to %pR add_size %llx\n",
>> +		add_to_list(realloc_head, bus->self, b_res, size1-size0, add_align);
>> +		dev_printk(KERN_DEBUG, &bus->self->dev, "bridge window %pR to %pR add_size %llx add_align %llx\n",
>>  			   b_res, &bus->busn_res,
>> -			   (unsigned long long)size1-size0);
>> +			   (unsigned long long) (size1 - size0),
>> +			   (unsigned long long) add_align);
>>  	}
>>  	return 0;
>>  }
>> 
>--
>To unsubscribe from this list: send the line "unsubscribe linux-pci" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Richard Yang
Help you, Help me

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v12 11/21] powerpc/pci: Don't unset PCI resources for VFs
  2015-02-24  8:44   ` Bjorn Helgaas
@ 2015-03-02  7:34       ` Wei Yang
  0 siblings, 0 replies; 69+ messages in thread
From: Wei Yang @ 2015-03-02  7:34 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: Wei Yang, benh, gwshan, linux-pci, linuxppc-dev

On Tue, Feb 24, 2015 at 02:44:50AM -0600, Bjorn Helgaas wrote:
>On Tue, Feb 24, 2015 at 02:34:13AM -0600, Bjorn Helgaas wrote:
>> From: Wei Yang <weiyang@linux.vnet.ibm.com>
>> 
>> If we're going to reassign resources with flag PCI_REASSIGN_ALL_RSRC, all
>> resources will be cleaned out during device header fixup time and then get
>> reassigned by PCI core.  However, the VF resources won't be reassigned and
>> thus, we shouldn't clean them out.
>> 
>> If the pci_dev is a VF, skip the resource unset process.
>
>I think this patch is correct, but we should include a little more detail
>in the changelog to answer questions like mine and Ben's
>(http://lkml.kernel.org/r/1423528584.4924.70.camel@au1.ibm.com).
>

Ok, I will add more change log to explain this.

>> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
>> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
>> ---
>>  arch/powerpc/kernel/pci-common.c |    4 ++++
>>  1 file changed, 4 insertions(+)
>> 
>> diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
>> index 2a525c938158..82031011522f 100644
>> --- a/arch/powerpc/kernel/pci-common.c
>> +++ b/arch/powerpc/kernel/pci-common.c
>> @@ -788,6 +788,10 @@ static void pcibios_fixup_resources(struct pci_dev *dev)
>>  		       pci_name(dev));
>>  		return;
>>  	}
>> +
>> +	if (dev->is_virtfn)
>> +		return;
>> +
>>  	for (i = 0; i < DEVICE_COUNT_RESOURCE; i++) {
>>  		struct resource *res = dev->resource + i;
>>  		struct pci_bus_region reg;
>> 

-- 
Richard Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v12 11/21] powerpc/pci: Don't unset PCI resources for VFs
@ 2015-03-02  7:34       ` Wei Yang
  0 siblings, 0 replies; 69+ messages in thread
From: Wei Yang @ 2015-03-02  7:34 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci, Wei Yang, benh, linuxppc-dev, gwshan

On Tue, Feb 24, 2015 at 02:44:50AM -0600, Bjorn Helgaas wrote:
>On Tue, Feb 24, 2015 at 02:34:13AM -0600, Bjorn Helgaas wrote:
>> From: Wei Yang <weiyang@linux.vnet.ibm.com>
>> 
>> If we're going to reassign resources with flag PCI_REASSIGN_ALL_RSRC, all
>> resources will be cleaned out during device header fixup time and then get
>> reassigned by PCI core.  However, the VF resources won't be reassigned and
>> thus, we shouldn't clean them out.
>> 
>> If the pci_dev is a VF, skip the resource unset process.
>
>I think this patch is correct, but we should include a little more detail
>in the changelog to answer questions like mine and Ben's
>(http://lkml.kernel.org/r/1423528584.4924.70.camel@au1.ibm.com).
>

Ok, I will add more change log to explain this.

>> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
>> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
>> ---
>>  arch/powerpc/kernel/pci-common.c |    4 ++++
>>  1 file changed, 4 insertions(+)
>> 
>> diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
>> index 2a525c938158..82031011522f 100644
>> --- a/arch/powerpc/kernel/pci-common.c
>> +++ b/arch/powerpc/kernel/pci-common.c
>> @@ -788,6 +788,10 @@ static void pcibios_fixup_resources(struct pci_dev *dev)
>>  		       pci_name(dev));
>>  		return;
>>  	}
>> +
>> +	if (dev->is_virtfn)
>> +		return;
>> +
>>  	for (i = 0; i < DEVICE_COUNT_RESOURCE; i++) {
>>  		struct resource *res = dev->resource + i;
>>  		struct pci_bus_region reg;
>> 

-- 
Richard Yang
Help you, Help me

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v12 15/21] powerpc/powernv: Reserve additional space for IOV BAR according to the number of total_pe
  2015-02-24  8:52   ` Bjorn Helgaas
@ 2015-03-02  7:41       ` Wei Yang
  0 siblings, 0 replies; 69+ messages in thread
From: Wei Yang @ 2015-03-02  7:41 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: Wei Yang, benh, gwshan, linux-pci, linuxppc-dev

On Tue, Feb 24, 2015 at 02:52:34AM -0600, Bjorn Helgaas wrote:
>On Tue, Feb 24, 2015 at 02:34:42AM -0600, Bjorn Helgaas wrote:
>> From: Wei Yang <weiyang@linux.vnet.ibm.com>
>> 
>> On PHB3, PF IOV BAR will be covered by M64 window to have better PE
>> isolation.  The total_pe number is usually different from total_VFs, which
>> can lead to a conflict between MMIO space and the PE number.
>> 
>> For example, if total_VFs is 128 and total_pe is 256, the second half of
>> M64 window will be part of other PCI device, which may already belong
>> to other PEs.
>
>I'm still trying to wrap my mind around the explanation here.
>
>I *think* what's going on is that the M64 window must be a power-of-two
>size.  If the VF(n) BAR space doesn't completely fill it, we might allocate
>the leftover space to another device.  Then the M64 window for *this*
>device may cause the other device to be associated with a PE it didn't
>expect.

Yes, this is the exact reason.

>
>But I don't understand this well enough to describe it clearly.
>
>More serious code question below...
>
>> Prevent the conflict by reserving additional space for the PF IOV BAR,
>> which is total_pe number of VF's BAR size.
>> 
>> [bhelgaas: make dev_printk() output more consistent, index resource[]
>> conventionally]
>> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
>> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
>> ---
>>  arch/powerpc/include/asm/machdep.h        |    4 ++
>>  arch/powerpc/include/asm/pci-bridge.h     |    3 ++
>>  arch/powerpc/kernel/pci-common.c          |    5 +++
>>  arch/powerpc/platforms/powernv/pci-ioda.c |   58 +++++++++++++++++++++++++++++
>>  4 files changed, 70 insertions(+)
>> 
>> diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h
>> index c8175a3fe560..965547c58497 100644
>> --- a/arch/powerpc/include/asm/machdep.h
>> +++ b/arch/powerpc/include/asm/machdep.h
>> @@ -250,6 +250,10 @@ struct machdep_calls {
>>  	/* Reset the secondary bus of bridge */
>>  	void  (*pcibios_reset_secondary_bus)(struct pci_dev *dev);
>>  
>> +#ifdef CONFIG_PCI_IOV
>> +	void (*pcibios_fixup_sriov)(struct pci_bus *bus);
>> +#endif /* CONFIG_PCI_IOV */
>> +
>>  	/* Called to shutdown machine specific hardware not already controlled
>>  	 * by other drivers.
>>  	 */
>> diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
>> index 513f8f27060d..de11de7d4547 100644
>> --- a/arch/powerpc/include/asm/pci-bridge.h
>> +++ b/arch/powerpc/include/asm/pci-bridge.h
>> @@ -175,6 +175,9 @@ struct pci_dn {
>>  #define IODA_INVALID_PE		(-1)
>>  #ifdef CONFIG_PPC_POWERNV
>>  	int	pe_number;
>> +#ifdef CONFIG_PCI_IOV
>> +	u16     max_vfs;		/* number of VFs IOV BAR expended */
>> +#endif /* CONFIG_PCI_IOV */
>>  #endif
>>  	struct list_head child_list;
>>  	struct list_head list;
>> diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
>> index 82031011522f..022e9feeb1f2 100644
>> --- a/arch/powerpc/kernel/pci-common.c
>> +++ b/arch/powerpc/kernel/pci-common.c
>> @@ -1646,6 +1646,11 @@ void pcibios_scan_phb(struct pci_controller *hose)
>>  	if (ppc_md.pcibios_fixup_phb)
>>  		ppc_md.pcibios_fixup_phb(hose);
>>  
>> +#ifdef CONFIG_PCI_IOV
>> +	if (ppc_md.pcibios_fixup_sriov)
>> +		ppc_md.pcibios_fixup_sriov(bus);
>> +#endif /* CONFIG_PCI_IOV */
>> +
>>  	/* Configure PCI Express settings */
>>  	if (bus && !pci_has_flag(PCI_PROBE_ONLY)) {
>>  		struct pci_bus *child;
>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>> index cd1a56160ded..36c533da5ccb 100644
>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>> @@ -1749,6 +1749,61 @@ static void pnv_pci_init_ioda_msis(struct pnv_phb *phb)
>>  static void pnv_pci_init_ioda_msis(struct pnv_phb *phb) { }
>>  #endif /* CONFIG_PCI_MSI */
>>  
>> +#ifdef CONFIG_PCI_IOV
>> +static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
>> +{
>> +	struct pci_controller *hose;
>> +	struct pnv_phb *phb;
>> +	struct resource *res;
>> +	int i;
>> +	resource_size_t size;
>> +	struct pci_dn *pdn;
>> +
>> +	if (!pdev->is_physfn || pdev->is_added)
>> +		return;
>> +
>> +	hose = pci_bus_to_host(pdev->bus);
>> +	phb = hose->private_data;
>> +
>> +	pdn = pci_get_pdn(pdev);
>> +	pdn->max_vfs = 0;
>> +
>> +	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
>> +		res = &pdev->resource[i + PCI_IOV_RESOURCES];
>> +		if (!res->flags || res->parent)
>> +			continue;
>> +		if (!pnv_pci_is_mem_pref_64(res->flags)) {
>> +			dev_warn(&pdev->dev, "Skipping expanding VF BAR%d: %pR\n",
>> +				 i, res);
>> +			continue;
>> +		}
>> +
>> +		dev_dbg(&pdev->dev, " Fixing VF BAR%d: %pR to\n", i, res);
>> +		size = pci_iov_resource_size(pdev, i + PCI_IOV_RESOURCES);
>> +		res->end = res->start + size * phb->ioda.total_pe - 1;
>> +		dev_dbg(&pdev->dev, "                       %pR\n", res);
>> +		dev_info(&pdev->dev, "VF BAR%d: %pR (expanded to %d VFs for PE alignment)",
>> +				i, res, phb->ioda.total_pe);
>> +	}
>> +	pdn->max_vfs = phb->ioda.total_pe;
>> +}
>> +
>> +static void pnv_pci_ioda_fixup_sriov(struct pci_bus *bus)
>> +{
>> +	struct pci_dev *pdev;
>> +	struct pci_bus *b;
>> +
>> +	list_for_each_entry(pdev, &bus->devices, bus_list) {
>> +		b = pdev->subordinate;
>> +
>> +		if (b)
>> +			pnv_pci_ioda_fixup_sriov(b);
>> +
>> +		pnv_pci_ioda_fixup_iov_resources(pdev);
>
>I'm not sure this happens at the right time.  We have this call chain:
>
>  pcibios_scan_phb
>    pci_create_root_bus
>    pci_scan_child_bus
>    pnv_pci_ioda_fixup_sriov
>      pnv_pci_ioda_fixup_iov_resources
>	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++)
>	  increase res->size to accomodate 256 PEs (or roundup(totalVFs)
>
>so we only do the fixup_iov_resources() when we scan the PHB, and we
>wouldn't do it at all for hot-added devices.

Yep, you are right :-)

I had a separate patch to do this in pcibios_add_pci_devices(). Looks we could
merge them.

>
>> +	}
>> +}
>> +#endif /* CONFIG_PCI_IOV */
>> +
>>  /*
>>   * This function is supposed to be called on basis of PE from top
>>   * to bottom style. So the the I/O or MMIO segment assigned to
>> @@ -2125,6 +2180,9 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
>>  	ppc_md.pcibios_enable_device_hook = pnv_pci_enable_device_hook;
>>  	ppc_md.pcibios_window_alignment = pnv_pci_window_alignment;
>>  	ppc_md.pcibios_reset_secondary_bus = pnv_pci_reset_secondary_bus;
>> +#ifdef CONFIG_PCI_IOV
>> +	ppc_md.pcibios_fixup_sriov = pnv_pci_ioda_fixup_sriov;
>> +#endif /* CONFIG_PCI_IOV */
>>  	pci_add_flags(PCI_REASSIGN_ALL_RSRC);
>>  
>>  	/* Reset IODA tables to a clean state */
>> 

-- 
Richard Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v12 15/21] powerpc/powernv: Reserve additional space for IOV BAR according to the number of total_pe
@ 2015-03-02  7:41       ` Wei Yang
  0 siblings, 0 replies; 69+ messages in thread
From: Wei Yang @ 2015-03-02  7:41 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci, Wei Yang, benh, linuxppc-dev, gwshan

On Tue, Feb 24, 2015 at 02:52:34AM -0600, Bjorn Helgaas wrote:
>On Tue, Feb 24, 2015 at 02:34:42AM -0600, Bjorn Helgaas wrote:
>> From: Wei Yang <weiyang@linux.vnet.ibm.com>
>> 
>> On PHB3, PF IOV BAR will be covered by M64 window to have better PE
>> isolation.  The total_pe number is usually different from total_VFs, which
>> can lead to a conflict between MMIO space and the PE number.
>> 
>> For example, if total_VFs is 128 and total_pe is 256, the second half of
>> M64 window will be part of other PCI device, which may already belong
>> to other PEs.
>
>I'm still trying to wrap my mind around the explanation here.
>
>I *think* what's going on is that the M64 window must be a power-of-two
>size.  If the VF(n) BAR space doesn't completely fill it, we might allocate
>the leftover space to another device.  Then the M64 window for *this*
>device may cause the other device to be associated with a PE it didn't
>expect.

Yes, this is the exact reason.

>
>But I don't understand this well enough to describe it clearly.
>
>More serious code question below...
>
>> Prevent the conflict by reserving additional space for the PF IOV BAR,
>> which is total_pe number of VF's BAR size.
>> 
>> [bhelgaas: make dev_printk() output more consistent, index resource[]
>> conventionally]
>> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
>> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
>> ---
>>  arch/powerpc/include/asm/machdep.h        |    4 ++
>>  arch/powerpc/include/asm/pci-bridge.h     |    3 ++
>>  arch/powerpc/kernel/pci-common.c          |    5 +++
>>  arch/powerpc/platforms/powernv/pci-ioda.c |   58 +++++++++++++++++++++++++++++
>>  4 files changed, 70 insertions(+)
>> 
>> diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h
>> index c8175a3fe560..965547c58497 100644
>> --- a/arch/powerpc/include/asm/machdep.h
>> +++ b/arch/powerpc/include/asm/machdep.h
>> @@ -250,6 +250,10 @@ struct machdep_calls {
>>  	/* Reset the secondary bus of bridge */
>>  	void  (*pcibios_reset_secondary_bus)(struct pci_dev *dev);
>>  
>> +#ifdef CONFIG_PCI_IOV
>> +	void (*pcibios_fixup_sriov)(struct pci_bus *bus);
>> +#endif /* CONFIG_PCI_IOV */
>> +
>>  	/* Called to shutdown machine specific hardware not already controlled
>>  	 * by other drivers.
>>  	 */
>> diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
>> index 513f8f27060d..de11de7d4547 100644
>> --- a/arch/powerpc/include/asm/pci-bridge.h
>> +++ b/arch/powerpc/include/asm/pci-bridge.h
>> @@ -175,6 +175,9 @@ struct pci_dn {
>>  #define IODA_INVALID_PE		(-1)
>>  #ifdef CONFIG_PPC_POWERNV
>>  	int	pe_number;
>> +#ifdef CONFIG_PCI_IOV
>> +	u16     max_vfs;		/* number of VFs IOV BAR expended */
>> +#endif /* CONFIG_PCI_IOV */
>>  #endif
>>  	struct list_head child_list;
>>  	struct list_head list;
>> diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
>> index 82031011522f..022e9feeb1f2 100644
>> --- a/arch/powerpc/kernel/pci-common.c
>> +++ b/arch/powerpc/kernel/pci-common.c
>> @@ -1646,6 +1646,11 @@ void pcibios_scan_phb(struct pci_controller *hose)
>>  	if (ppc_md.pcibios_fixup_phb)
>>  		ppc_md.pcibios_fixup_phb(hose);
>>  
>> +#ifdef CONFIG_PCI_IOV
>> +	if (ppc_md.pcibios_fixup_sriov)
>> +		ppc_md.pcibios_fixup_sriov(bus);
>> +#endif /* CONFIG_PCI_IOV */
>> +
>>  	/* Configure PCI Express settings */
>>  	if (bus && !pci_has_flag(PCI_PROBE_ONLY)) {
>>  		struct pci_bus *child;
>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>> index cd1a56160ded..36c533da5ccb 100644
>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>> @@ -1749,6 +1749,61 @@ static void pnv_pci_init_ioda_msis(struct pnv_phb *phb)
>>  static void pnv_pci_init_ioda_msis(struct pnv_phb *phb) { }
>>  #endif /* CONFIG_PCI_MSI */
>>  
>> +#ifdef CONFIG_PCI_IOV
>> +static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
>> +{
>> +	struct pci_controller *hose;
>> +	struct pnv_phb *phb;
>> +	struct resource *res;
>> +	int i;
>> +	resource_size_t size;
>> +	struct pci_dn *pdn;
>> +
>> +	if (!pdev->is_physfn || pdev->is_added)
>> +		return;
>> +
>> +	hose = pci_bus_to_host(pdev->bus);
>> +	phb = hose->private_data;
>> +
>> +	pdn = pci_get_pdn(pdev);
>> +	pdn->max_vfs = 0;
>> +
>> +	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
>> +		res = &pdev->resource[i + PCI_IOV_RESOURCES];
>> +		if (!res->flags || res->parent)
>> +			continue;
>> +		if (!pnv_pci_is_mem_pref_64(res->flags)) {
>> +			dev_warn(&pdev->dev, "Skipping expanding VF BAR%d: %pR\n",
>> +				 i, res);
>> +			continue;
>> +		}
>> +
>> +		dev_dbg(&pdev->dev, " Fixing VF BAR%d: %pR to\n", i, res);
>> +		size = pci_iov_resource_size(pdev, i + PCI_IOV_RESOURCES);
>> +		res->end = res->start + size * phb->ioda.total_pe - 1;
>> +		dev_dbg(&pdev->dev, "                       %pR\n", res);
>> +		dev_info(&pdev->dev, "VF BAR%d: %pR (expanded to %d VFs for PE alignment)",
>> +				i, res, phb->ioda.total_pe);
>> +	}
>> +	pdn->max_vfs = phb->ioda.total_pe;
>> +}
>> +
>> +static void pnv_pci_ioda_fixup_sriov(struct pci_bus *bus)
>> +{
>> +	struct pci_dev *pdev;
>> +	struct pci_bus *b;
>> +
>> +	list_for_each_entry(pdev, &bus->devices, bus_list) {
>> +		b = pdev->subordinate;
>> +
>> +		if (b)
>> +			pnv_pci_ioda_fixup_sriov(b);
>> +
>> +		pnv_pci_ioda_fixup_iov_resources(pdev);
>
>I'm not sure this happens at the right time.  We have this call chain:
>
>  pcibios_scan_phb
>    pci_create_root_bus
>    pci_scan_child_bus
>    pnv_pci_ioda_fixup_sriov
>      pnv_pci_ioda_fixup_iov_resources
>	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++)
>	  increase res->size to accomodate 256 PEs (or roundup(totalVFs)
>
>so we only do the fixup_iov_resources() when we scan the PHB, and we
>wouldn't do it at all for hot-added devices.

Yep, you are right :-)

I had a separate patch to do this in pcibios_add_pci_devices(). Looks we could
merge them.

>
>> +	}
>> +}
>> +#endif /* CONFIG_PCI_IOV */
>> +
>>  /*
>>   * This function is supposed to be called on basis of PE from top
>>   * to bottom style. So the the I/O or MMIO segment assigned to
>> @@ -2125,6 +2180,9 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
>>  	ppc_md.pcibios_enable_device_hook = pnv_pci_enable_device_hook;
>>  	ppc_md.pcibios_window_alignment = pnv_pci_window_alignment;
>>  	ppc_md.pcibios_reset_secondary_bus = pnv_pci_reset_secondary_bus;
>> +#ifdef CONFIG_PCI_IOV
>> +	ppc_md.pcibios_fixup_sriov = pnv_pci_ioda_fixup_sriov;
>> +#endif /* CONFIG_PCI_IOV */
>>  	pci_add_flags(PCI_REASSIGN_ALL_RSRC);
>>  
>>  	/* Reset IODA tables to a clean state */
>> 

-- 
Richard Yang
Help you, Help me

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v12 14/21] powerpc/powernv: Allocate struct pnv_ioda_pe iommu_table dynamically
  2015-02-24  8:46   ` Bjorn Helgaas
@ 2015-03-02  7:50       ` Wei Yang
  0 siblings, 0 replies; 69+ messages in thread
From: Wei Yang @ 2015-03-02  7:50 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: Wei Yang, benh, gwshan, linux-pci, linuxppc-dev

On Tue, Feb 24, 2015 at 02:46:53AM -0600, Bjorn Helgaas wrote:
>On Tue, Feb 24, 2015 at 02:34:35AM -0600, Bjorn Helgaas wrote:
>> From: Wei Yang <weiyang@linux.vnet.ibm.com>
>> 
>> Current iommu_table of a PE is a static field.  This will have a problem
>> when iommu_free_table() is called.
>> 
>> Allocate iommu_table dynamically.
>
>I'd like a little more explanation about why we're calling
>iommu_free_table() now when we didn't call it before.  Maybe this happens
>when we disable SR-IOV and the VFs go away?

Yes, it is called in disable path.

pcibios_sriov_disable
    pnv_pci_sriov_disable
        pnv_ioda_release_vf_PE
	    pnv_pci_ioda2_release_dma_pe
	        iommu_free_table            <--- here it is invoked


>
>Is there a hotplug remove path where we should also be calling
>iommu_free_table()?

When VF is not introduced, no one calls this on powernv platform.

Each PCI bus is a PE and it has its own iommu table, even a device is
hotpluged, the iommu table will not be released.

>
>> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
>> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
>> ---
>>  arch/powerpc/include/asm/iommu.h          |    3 +++
>>  arch/powerpc/platforms/powernv/pci-ioda.c |   26 ++++++++++++++------------
>>  arch/powerpc/platforms/powernv/pci.h      |    2 +-
>>  3 files changed, 18 insertions(+), 13 deletions(-)
>> 
>> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
>> index 9cfa3706a1b8..5574eeb97634 100644
>> --- a/arch/powerpc/include/asm/iommu.h
>> +++ b/arch/powerpc/include/asm/iommu.h
>> @@ -78,6 +78,9 @@ struct iommu_table {
>>  	struct iommu_group *it_group;
>>  #endif
>>  	void (*set_bypass)(struct iommu_table *tbl, bool enable);
>> +#ifdef CONFIG_PPC_POWERNV
>> +	void           *data;
>> +#endif
>>  };
>>  
>>  /* Pure 2^n version of get_order */
>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>> index 58c4fc4ab63c..cd1a56160ded 100644
>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>> @@ -916,6 +916,10 @@ static void pnv_ioda_setup_bus_PE(struct pci_bus *bus, int all)
>>  		return;
>>  	}
>>  
>> +	pe->tce32_table = kzalloc_node(sizeof(struct iommu_table),
>> +			GFP_KERNEL, hose->node);
>> +	pe->tce32_table->data = pe;
>> +
>>  	/* Associate it with all child devices */
>>  	pnv_ioda_setup_same_PE(bus, pe);
>>  
>> @@ -1005,7 +1009,7 @@ static void pnv_pci_ioda_dma_dev_setup(struct pnv_phb *phb, struct pci_dev *pdev
>>  
>>  	pe = &phb->ioda.pe_array[pdn->pe_number];
>>  	WARN_ON(get_dma_ops(&pdev->dev) != &dma_iommu_ops);
>> -	set_iommu_table_base_and_group(&pdev->dev, &pe->tce32_table);
>> +	set_iommu_table_base_and_group(&pdev->dev, pe->tce32_table);
>>  }
>>  
>>  static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb,
>> @@ -1032,7 +1036,7 @@ static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb,
>>  	} else {
>>  		dev_info(&pdev->dev, "Using 32-bit DMA via iommu\n");
>>  		set_dma_ops(&pdev->dev, &dma_iommu_ops);
>> -		set_iommu_table_base(&pdev->dev, &pe->tce32_table);
>> +		set_iommu_table_base(&pdev->dev, pe->tce32_table);
>>  	}
>>  	*pdev->dev.dma_mask = dma_mask;
>>  	return 0;
>> @@ -1069,9 +1073,9 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe,
>>  	list_for_each_entry(dev, &bus->devices, bus_list) {
>>  		if (add_to_iommu_group)
>>  			set_iommu_table_base_and_group(&dev->dev,
>> -						       &pe->tce32_table);
>> +						       pe->tce32_table);
>>  		else
>> -			set_iommu_table_base(&dev->dev, &pe->tce32_table);
>> +			set_iommu_table_base(&dev->dev, pe->tce32_table);
>>  
>>  		if (dev->subordinate)
>>  			pnv_ioda_setup_bus_dma(pe, dev->subordinate,
>> @@ -1161,8 +1165,7 @@ static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
>>  void pnv_pci_ioda_tce_invalidate(struct iommu_table *tbl,
>>  				 __be64 *startp, __be64 *endp, bool rm)
>>  {
>> -	struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
>> -					      tce32_table);
>> +	struct pnv_ioda_pe *pe = tbl->data;
>>  	struct pnv_phb *phb = pe->phb;
>>  
>>  	if (phb->type == PNV_PHB_IODA1)
>> @@ -1228,7 +1231,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
>>  	}
>>  
>>  	/* Setup linux iommu table */
>> -	tbl = &pe->tce32_table;
>> +	tbl = pe->tce32_table;
>>  	pnv_pci_setup_iommu_table(tbl, addr, TCE32_TABLE_SIZE * segs,
>>  				  base << 28, IOMMU_PAGE_SHIFT_4K);
>>  
>> @@ -1266,8 +1269,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
>>  
>>  static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
>>  {
>> -	struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
>> -					      tce32_table);
>> +	struct pnv_ioda_pe *pe = tbl->data;
>>  	uint16_t window_id = (pe->pe_number << 1 ) + 1;
>>  	int64_t rc;
>>  
>> @@ -1312,10 +1314,10 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
>>  	pe->tce_bypass_base = 1ull << 59;
>>  
>>  	/* Install set_bypass callback for VFIO */
>> -	pe->tce32_table.set_bypass = pnv_pci_ioda2_set_bypass;
>> +	pe->tce32_table->set_bypass = pnv_pci_ioda2_set_bypass;
>>  
>>  	/* Enable bypass by default */
>> -	pnv_pci_ioda2_set_bypass(&pe->tce32_table, true);
>> +	pnv_pci_ioda2_set_bypass(pe->tce32_table, true);
>>  }
>>  
>>  static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
>> @@ -1363,7 +1365,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
>>  	}
>>  
>>  	/* Setup linux iommu table */
>> -	tbl = &pe->tce32_table;
>> +	tbl = pe->tce32_table;
>>  	pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0,
>>  			IOMMU_PAGE_SHIFT_4K);
>>  
>> diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
>> index e5b75b298d95..731777734bca 100644
>> --- a/arch/powerpc/platforms/powernv/pci.h
>> +++ b/arch/powerpc/platforms/powernv/pci.h
>> @@ -53,7 +53,7 @@ struct pnv_ioda_pe {
>>  	/* "Base" iommu table, ie, 4K TCEs, 32-bit DMA */
>>  	int			tce32_seg;
>>  	int			tce32_segcount;
>> -	struct iommu_table	tce32_table;
>> +	struct iommu_table	*tce32_table;
>>  	phys_addr_t		tce_inval_reg_phys;
>>  
>>  	/* 64-bit TCE bypass region */
>> 
>--
>To unsubscribe from this list: send the line "unsubscribe linux-pci" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Richard Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v12 14/21] powerpc/powernv: Allocate struct pnv_ioda_pe iommu_table dynamically
@ 2015-03-02  7:50       ` Wei Yang
  0 siblings, 0 replies; 69+ messages in thread
From: Wei Yang @ 2015-03-02  7:50 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci, Wei Yang, benh, linuxppc-dev, gwshan

On Tue, Feb 24, 2015 at 02:46:53AM -0600, Bjorn Helgaas wrote:
>On Tue, Feb 24, 2015 at 02:34:35AM -0600, Bjorn Helgaas wrote:
>> From: Wei Yang <weiyang@linux.vnet.ibm.com>
>> 
>> Current iommu_table of a PE is a static field.  This will have a problem
>> when iommu_free_table() is called.
>> 
>> Allocate iommu_table dynamically.
>
>I'd like a little more explanation about why we're calling
>iommu_free_table() now when we didn't call it before.  Maybe this happens
>when we disable SR-IOV and the VFs go away?

Yes, it is called in disable path.

pcibios_sriov_disable
    pnv_pci_sriov_disable
        pnv_ioda_release_vf_PE
	    pnv_pci_ioda2_release_dma_pe
	        iommu_free_table            <--- here it is invoked


>
>Is there a hotplug remove path where we should also be calling
>iommu_free_table()?

When VF is not introduced, no one calls this on powernv platform.

Each PCI bus is a PE and it has its own iommu table, even a device is
hotpluged, the iommu table will not be released.

>
>> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
>> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
>> ---
>>  arch/powerpc/include/asm/iommu.h          |    3 +++
>>  arch/powerpc/platforms/powernv/pci-ioda.c |   26 ++++++++++++++------------
>>  arch/powerpc/platforms/powernv/pci.h      |    2 +-
>>  3 files changed, 18 insertions(+), 13 deletions(-)
>> 
>> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
>> index 9cfa3706a1b8..5574eeb97634 100644
>> --- a/arch/powerpc/include/asm/iommu.h
>> +++ b/arch/powerpc/include/asm/iommu.h
>> @@ -78,6 +78,9 @@ struct iommu_table {
>>  	struct iommu_group *it_group;
>>  #endif
>>  	void (*set_bypass)(struct iommu_table *tbl, bool enable);
>> +#ifdef CONFIG_PPC_POWERNV
>> +	void           *data;
>> +#endif
>>  };
>>  
>>  /* Pure 2^n version of get_order */
>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>> index 58c4fc4ab63c..cd1a56160ded 100644
>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>> @@ -916,6 +916,10 @@ static void pnv_ioda_setup_bus_PE(struct pci_bus *bus, int all)
>>  		return;
>>  	}
>>  
>> +	pe->tce32_table = kzalloc_node(sizeof(struct iommu_table),
>> +			GFP_KERNEL, hose->node);
>> +	pe->tce32_table->data = pe;
>> +
>>  	/* Associate it with all child devices */
>>  	pnv_ioda_setup_same_PE(bus, pe);
>>  
>> @@ -1005,7 +1009,7 @@ static void pnv_pci_ioda_dma_dev_setup(struct pnv_phb *phb, struct pci_dev *pdev
>>  
>>  	pe = &phb->ioda.pe_array[pdn->pe_number];
>>  	WARN_ON(get_dma_ops(&pdev->dev) != &dma_iommu_ops);
>> -	set_iommu_table_base_and_group(&pdev->dev, &pe->tce32_table);
>> +	set_iommu_table_base_and_group(&pdev->dev, pe->tce32_table);
>>  }
>>  
>>  static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb,
>> @@ -1032,7 +1036,7 @@ static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb,
>>  	} else {
>>  		dev_info(&pdev->dev, "Using 32-bit DMA via iommu\n");
>>  		set_dma_ops(&pdev->dev, &dma_iommu_ops);
>> -		set_iommu_table_base(&pdev->dev, &pe->tce32_table);
>> +		set_iommu_table_base(&pdev->dev, pe->tce32_table);
>>  	}
>>  	*pdev->dev.dma_mask = dma_mask;
>>  	return 0;
>> @@ -1069,9 +1073,9 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe,
>>  	list_for_each_entry(dev, &bus->devices, bus_list) {
>>  		if (add_to_iommu_group)
>>  			set_iommu_table_base_and_group(&dev->dev,
>> -						       &pe->tce32_table);
>> +						       pe->tce32_table);
>>  		else
>> -			set_iommu_table_base(&dev->dev, &pe->tce32_table);
>> +			set_iommu_table_base(&dev->dev, pe->tce32_table);
>>  
>>  		if (dev->subordinate)
>>  			pnv_ioda_setup_bus_dma(pe, dev->subordinate,
>> @@ -1161,8 +1165,7 @@ static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
>>  void pnv_pci_ioda_tce_invalidate(struct iommu_table *tbl,
>>  				 __be64 *startp, __be64 *endp, bool rm)
>>  {
>> -	struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
>> -					      tce32_table);
>> +	struct pnv_ioda_pe *pe = tbl->data;
>>  	struct pnv_phb *phb = pe->phb;
>>  
>>  	if (phb->type == PNV_PHB_IODA1)
>> @@ -1228,7 +1231,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
>>  	}
>>  
>>  	/* Setup linux iommu table */
>> -	tbl = &pe->tce32_table;
>> +	tbl = pe->tce32_table;
>>  	pnv_pci_setup_iommu_table(tbl, addr, TCE32_TABLE_SIZE * segs,
>>  				  base << 28, IOMMU_PAGE_SHIFT_4K);
>>  
>> @@ -1266,8 +1269,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
>>  
>>  static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
>>  {
>> -	struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
>> -					      tce32_table);
>> +	struct pnv_ioda_pe *pe = tbl->data;
>>  	uint16_t window_id = (pe->pe_number << 1 ) + 1;
>>  	int64_t rc;
>>  
>> @@ -1312,10 +1314,10 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
>>  	pe->tce_bypass_base = 1ull << 59;
>>  
>>  	/* Install set_bypass callback for VFIO */
>> -	pe->tce32_table.set_bypass = pnv_pci_ioda2_set_bypass;
>> +	pe->tce32_table->set_bypass = pnv_pci_ioda2_set_bypass;
>>  
>>  	/* Enable bypass by default */
>> -	pnv_pci_ioda2_set_bypass(&pe->tce32_table, true);
>> +	pnv_pci_ioda2_set_bypass(pe->tce32_table, true);
>>  }
>>  
>>  static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
>> @@ -1363,7 +1365,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
>>  	}
>>  
>>  	/* Setup linux iommu table */
>> -	tbl = &pe->tce32_table;
>> +	tbl = pe->tce32_table;
>>  	pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0,
>>  			IOMMU_PAGE_SHIFT_4K);
>>  
>> diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
>> index e5b75b298d95..731777734bca 100644
>> --- a/arch/powerpc/platforms/powernv/pci.h
>> +++ b/arch/powerpc/platforms/powernv/pci.h
>> @@ -53,7 +53,7 @@ struct pnv_ioda_pe {
>>  	/* "Base" iommu table, ie, 4K TCEs, 32-bit DMA */
>>  	int			tce32_seg;
>>  	int			tce32_segcount;
>> -	struct iommu_table	tce32_table;
>> +	struct iommu_table	*tce32_table;
>>  	phys_addr_t		tce_inval_reg_phys;
>>  
>>  	/* 64-bit TCE bypass region */
>> 
>--
>To unsubscribe from this list: send the line "unsubscribe linux-pci" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Richard Yang
Help you, Help me

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v12 18/21] powerpc/powernv: Reserve additional space for IOV BAR, with m64_per_iov supported
  2015-02-24  9:06   ` Bjorn Helgaas
@ 2015-03-02  7:55       ` Wei Yang
  0 siblings, 0 replies; 69+ messages in thread
From: Wei Yang @ 2015-03-02  7:55 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: Wei Yang, benh, gwshan, linux-pci, linuxppc-dev

On Tue, Feb 24, 2015 at 03:06:57AM -0600, Bjorn Helgaas wrote:
>On Tue, Feb 24, 2015 at 02:35:04AM -0600, Bjorn Helgaas wrote:
>> From: Wei Yang <weiyang@linux.vnet.ibm.com>
>> 
>> M64 aperture size is limited on PHB3.  When the IOV BAR is too big, this
>> will exceed the limitation and failed to be assigned.
>> 
>> Introduce a different mechanism based on the IOV BAR size:
>> 
>>   - if IOV BAR size is smaller than 64MB, expand to total_pe
>>   - if IOV BAR size is bigger than 64MB, roundup power2
>> 
>> [bhelgaas: make dev_printk() output more consistent, use PCI_SRIOV_NUM_BARS]
>> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
>> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
>> ---
>>  arch/powerpc/include/asm/pci-bridge.h     |    2 ++
>>  arch/powerpc/platforms/powernv/pci-ioda.c |   33 ++++++++++++++++++++++++++---
>>  2 files changed, 32 insertions(+), 3 deletions(-)
>> 
>> diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
>> index 011340df8583..d824bb184ab8 100644
>> --- a/arch/powerpc/include/asm/pci-bridge.h
>> +++ b/arch/powerpc/include/asm/pci-bridge.h
>> @@ -179,6 +179,8 @@ struct pci_dn {
>>  	u16     max_vfs;		/* number of VFs IOV BAR expended */
>>  	u16     vf_pes;			/* VF PE# under this PF */
>>  	int     offset;			/* PE# for the first VF PE */
>> +#define M64_PER_IOV 4
>> +	int     m64_per_iov;
>>  #define IODA_INVALID_M64        (-1)
>>  	int     m64_wins[PCI_SRIOV_NUM_BARS];
>>  #endif /* CONFIG_PCI_IOV */
>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>> index a3c2fbe35fc8..30b7c3909746 100644
>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>> @@ -2242,6 +2242,7 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
>>  	int i;
>>  	resource_size_t size;
>>  	struct pci_dn *pdn;
>> +	int mul, total_vfs;
>>  
>>  	if (!pdev->is_physfn || pdev->is_added)
>>  		return;
>> @@ -2252,6 +2253,32 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
>>  	pdn = pci_get_pdn(pdev);
>>  	pdn->max_vfs = 0;
>>  
>> +	total_vfs = pci_sriov_get_totalvfs(pdev);
>> +	pdn->m64_per_iov = 1;
>> +	mul = phb->ioda.total_pe;
>> +
>> +	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
>> +		res = &pdev->resource[i + PCI_IOV_RESOURCES];
>> +		if (!res->flags || res->parent)
>> +			continue;
>> +		if (!pnv_pci_is_mem_pref_64(res->flags)) {
>> +			dev_warn(&pdev->dev, " non M64 VF BAR%d: %pR\n",
>> +				 i, res);
>
>Why is this a dev_warn()?  Can the user do anything about it?  Do you want
>bug reports if users see this message?  There are several other instances
>of this in the other patches, too.
>

Thanks for your question.

Actually, on current implementation, powernv platform can't support a SRIOV
device with non M64 IOV BAR, since we map the IOV BAR with M64 BAR on PHB.
Actually, I am not sure which one is better.

>> +			continue;
>> +		}
>> +
>> +		size = pci_iov_resource_size(pdev, i + PCI_IOV_RESOURCES);
>> +
>> +		/* bigger than 64M */
>> +		if (size > (1 << 26)) {
>> +			dev_info(&pdev->dev, "PowerNV: VF BAR%d: %pR IOV size is bigger than 64M, roundup power2\n",
>> +				 i, res);
>> +			pdn->m64_per_iov = M64_PER_IOV;
>> +			mul = __roundup_pow_of_two(total_vfs);
>
>Why is this __roundup_pow_of_two() instead of roundup_pow_of_two()?
>I *think* __roundup_pow_of_two() is basically a helper function for
>implementing roundup_pow_of_two() and not intended to be used by itself.
>
>I think there are other patches that use __roundup_pow_of_two(), too.

Got it, will change it.

>
>> +			break;
>> +		}
>> +	}
>> +
>>  	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
>>  		res = &pdev->resource[i + PCI_IOV_RESOURCES];
>>  		if (!res->flags || res->parent)
>> @@ -2264,12 +2291,12 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
>>  
>>  		dev_dbg(&pdev->dev, " Fixing VF BAR%d: %pR to\n", i, res);
>>  		size = pci_iov_resource_size(pdev, i + PCI_IOV_RESOURCES);
>> -		res->end = res->start + size * phb->ioda.total_pe - 1;
>> +		res->end = res->start + size * mul - 1;
>>  		dev_dbg(&pdev->dev, "                       %pR\n", res);
>>  		dev_info(&pdev->dev, "VF BAR%d: %pR (expanded to %d VFs for PE alignment)",
>> -				i, res, phb->ioda.total_pe);
>> +			 i, res, mul);
>>  	}
>> -	pdn->max_vfs = phb->ioda.total_pe;
>> +	pdn->max_vfs = mul;
>>  }
>>  
>>  static void pnv_pci_ioda_fixup_sriov(struct pci_bus *bus)
>> 

-- 
Richard Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v12 18/21] powerpc/powernv: Reserve additional space for IOV BAR, with m64_per_iov supported
@ 2015-03-02  7:55       ` Wei Yang
  0 siblings, 0 replies; 69+ messages in thread
From: Wei Yang @ 2015-03-02  7:55 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci, Wei Yang, benh, linuxppc-dev, gwshan

On Tue, Feb 24, 2015 at 03:06:57AM -0600, Bjorn Helgaas wrote:
>On Tue, Feb 24, 2015 at 02:35:04AM -0600, Bjorn Helgaas wrote:
>> From: Wei Yang <weiyang@linux.vnet.ibm.com>
>> 
>> M64 aperture size is limited on PHB3.  When the IOV BAR is too big, this
>> will exceed the limitation and failed to be assigned.
>> 
>> Introduce a different mechanism based on the IOV BAR size:
>> 
>>   - if IOV BAR size is smaller than 64MB, expand to total_pe
>>   - if IOV BAR size is bigger than 64MB, roundup power2
>> 
>> [bhelgaas: make dev_printk() output more consistent, use PCI_SRIOV_NUM_BARS]
>> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
>> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
>> ---
>>  arch/powerpc/include/asm/pci-bridge.h     |    2 ++
>>  arch/powerpc/platforms/powernv/pci-ioda.c |   33 ++++++++++++++++++++++++++---
>>  2 files changed, 32 insertions(+), 3 deletions(-)
>> 
>> diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
>> index 011340df8583..d824bb184ab8 100644
>> --- a/arch/powerpc/include/asm/pci-bridge.h
>> +++ b/arch/powerpc/include/asm/pci-bridge.h
>> @@ -179,6 +179,8 @@ struct pci_dn {
>>  	u16     max_vfs;		/* number of VFs IOV BAR expended */
>>  	u16     vf_pes;			/* VF PE# under this PF */
>>  	int     offset;			/* PE# for the first VF PE */
>> +#define M64_PER_IOV 4
>> +	int     m64_per_iov;
>>  #define IODA_INVALID_M64        (-1)
>>  	int     m64_wins[PCI_SRIOV_NUM_BARS];
>>  #endif /* CONFIG_PCI_IOV */
>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>> index a3c2fbe35fc8..30b7c3909746 100644
>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>> @@ -2242,6 +2242,7 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
>>  	int i;
>>  	resource_size_t size;
>>  	struct pci_dn *pdn;
>> +	int mul, total_vfs;
>>  
>>  	if (!pdev->is_physfn || pdev->is_added)
>>  		return;
>> @@ -2252,6 +2253,32 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
>>  	pdn = pci_get_pdn(pdev);
>>  	pdn->max_vfs = 0;
>>  
>> +	total_vfs = pci_sriov_get_totalvfs(pdev);
>> +	pdn->m64_per_iov = 1;
>> +	mul = phb->ioda.total_pe;
>> +
>> +	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
>> +		res = &pdev->resource[i + PCI_IOV_RESOURCES];
>> +		if (!res->flags || res->parent)
>> +			continue;
>> +		if (!pnv_pci_is_mem_pref_64(res->flags)) {
>> +			dev_warn(&pdev->dev, " non M64 VF BAR%d: %pR\n",
>> +				 i, res);
>
>Why is this a dev_warn()?  Can the user do anything about it?  Do you want
>bug reports if users see this message?  There are several other instances
>of this in the other patches, too.
>

Thanks for your question.

Actually, on current implementation, powernv platform can't support a SRIOV
device with non M64 IOV BAR, since we map the IOV BAR with M64 BAR on PHB.
Actually, I am not sure which one is better.

>> +			continue;
>> +		}
>> +
>> +		size = pci_iov_resource_size(pdev, i + PCI_IOV_RESOURCES);
>> +
>> +		/* bigger than 64M */
>> +		if (size > (1 << 26)) {
>> +			dev_info(&pdev->dev, "PowerNV: VF BAR%d: %pR IOV size is bigger than 64M, roundup power2\n",
>> +				 i, res);
>> +			pdn->m64_per_iov = M64_PER_IOV;
>> +			mul = __roundup_pow_of_two(total_vfs);
>
>Why is this __roundup_pow_of_two() instead of roundup_pow_of_two()?
>I *think* __roundup_pow_of_two() is basically a helper function for
>implementing roundup_pow_of_two() and not intended to be used by itself.
>
>I think there are other patches that use __roundup_pow_of_two(), too.

Got it, will change it.

>
>> +			break;
>> +		}
>> +	}
>> +
>>  	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
>>  		res = &pdev->resource[i + PCI_IOV_RESOURCES];
>>  		if (!res->flags || res->parent)
>> @@ -2264,12 +2291,12 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
>>  
>>  		dev_dbg(&pdev->dev, " Fixing VF BAR%d: %pR to\n", i, res);
>>  		size = pci_iov_resource_size(pdev, i + PCI_IOV_RESOURCES);
>> -		res->end = res->start + size * phb->ioda.total_pe - 1;
>> +		res->end = res->start + size * mul - 1;
>>  		dev_dbg(&pdev->dev, "                       %pR\n", res);
>>  		dev_info(&pdev->dev, "VF BAR%d: %pR (expanded to %d VFs for PE alignment)",
>> -				i, res, phb->ioda.total_pe);
>> +			 i, res, mul);
>>  	}
>> -	pdn->max_vfs = phb->ioda.total_pe;
>> +	pdn->max_vfs = mul;
>>  }
>>  
>>  static void pnv_pci_ioda_fixup_sriov(struct pci_bus *bus)
>> 

-- 
Richard Yang
Help you, Help me

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v12 14/21] powerpc/powernv: Allocate struct pnv_ioda_pe iommu_table dynamically
  2015-03-02  7:50       ` Wei Yang
@ 2015-03-02  7:56         ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 69+ messages in thread
From: Benjamin Herrenschmidt @ 2015-03-02  7:56 UTC (permalink / raw)
  To: Wei Yang; +Cc: Bjorn Helgaas, gwshan, linux-pci, linuxppc-dev

On Mon, 2015-03-02 at 15:50 +0800, Wei Yang wrote:
> >
> >Is there a hotplug remove path where we should also be calling
> >iommu_free_table()?
> 
> When VF is not introduced, no one calls this on powernv platform.
> 
> Each PCI bus is a PE and it has its own iommu table, even a device is
> hotpluged, the iommu table will not be released.

Actually, I believe Alexey patches to add support for dynamic DMA
windows for KVM guests using VFIO will also alloc/free iommu tables. In
fact his patches somewhat change quite a few things in that area, and
I'm currently reviewing them.

Wei, can you post a new series when you've finished sync'ing with
Bjorn ? At that point, I'll try to work with Alexey to evaluate the
impact of his changes on your patches.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v12 14/21] powerpc/powernv: Allocate struct pnv_ioda_pe iommu_table dynamically
@ 2015-03-02  7:56         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 69+ messages in thread
From: Benjamin Herrenschmidt @ 2015-03-02  7:56 UTC (permalink / raw)
  To: Wei Yang; +Cc: Bjorn Helgaas, linux-pci, linuxppc-dev, gwshan

On Mon, 2015-03-02 at 15:50 +0800, Wei Yang wrote:
> >
> >Is there a hotplug remove path where we should also be calling
> >iommu_free_table()?
> 
> When VF is not introduced, no one calls this on powernv platform.
> 
> Each PCI bus is a PE and it has its own iommu table, even a device is
> hotpluged, the iommu table will not be released.

Actually, I believe Alexey patches to add support for dynamic DMA
windows for KVM guests using VFIO will also alloc/free iommu tables. In
fact his patches somewhat change quite a few things in that area, and
I'm currently reviewing them.

Wei, can you post a new series when you've finished sync'ing with
Bjorn ? At that point, I'll try to work with Alexey to evaluate the
impact of his changes on your patches.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v12 17/21] powerpc/powernv: Shift VF resource with an offset
  2015-02-24 17:10     ` Bjorn Helgaas
@ 2015-03-02  7:58         ` Wei Yang
  0 siblings, 0 replies; 69+ messages in thread
From: Wei Yang @ 2015-03-02  7:58 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Wei Yang, Benjamin Herrenschmidt, Gavin Shan, linux-pci, linuxppc-dev

On Tue, Feb 24, 2015 at 11:10:33AM -0600, Bjorn Helgaas wrote:
>On Tue, Feb 24, 2015 at 3:00 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
>> On Tue, Feb 24, 2015 at 02:34:57AM -0600, Bjorn Helgaas wrote:
>>> From: Wei Yang <weiyang@linux.vnet.ibm.com>
>>>
>>> On PowerNV platform, resource position in M64 implies the PE# the resource
>>> belongs to.  In some cases, adjustment of a resource is necessary to locate
>>> it to a correct position in M64.
>>>
>>> Add pnv_pci_vf_resource_shift() to shift the 'real' PF IOV BAR address
>>> according to an offset.
>>>
>>> [bhelgaas: rework loops, rework overlap check, index resource[]
>>> conventionally, remove pci_regs.h include, squashed with next patch]
>>> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
>>> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
>>
>> ...
>>
>>> +#ifdef CONFIG_PCI_IOV
>>> +static int pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
>>> +{
>>> +     struct pci_dn *pdn = pci_get_pdn(dev);
>>> +     int i;
>>> +     struct resource *res, res2;
>>> +     resource_size_t size;
>>> +     u16 vf_num;
>>> +
>>> +     if (!dev->is_physfn)
>>> +             return -EINVAL;
>>> +
>>> +     /*
>>> +      * "offset" is in VFs.  The M64 windows are sized so that when they
>>> +      * are segmented, each segment is the same size as the IOV BAR.
>>> +      * Each segment is in a separate PE, and the high order bits of the
>>> +      * address are the PE number.  Therefore, each VF's BAR is in a
>>> +      * separate PE, and changing the IOV BAR start address changes the
>>> +      * range of PEs the VFs are in.
>>> +      */
>>> +     vf_num = pdn->vf_pes;
>>> +     for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
>>> +             res = &dev->resource[i + PCI_IOV_RESOURCES];
>>> +             if (!res->flags || !res->parent)
>>> +                     continue;
>>> +
>>> +             if (!pnv_pci_is_mem_pref_64(res->flags))
>>> +                     continue;
>>> +
>>> +             /*
>>> +              * The actual IOV BAR range is determined by the start address
>>> +              * and the actual size for vf_num VFs BAR.  This check is to
>>> +              * make sure that after shifting, the range will not overlap
>>> +              * with another device.
>>> +              */
>>> +             size = pci_iov_resource_size(dev, i + PCI_IOV_RESOURCES);
>>> +             res2.flags = res->flags;
>>> +             res2.start = res->start + (size * offset);
>>> +             res2.end = res2.start + (size * vf_num) - 1;
>>> +
>>> +             if (res2.end > res->end) {
>>> +                     dev_err(&dev->dev, "VF BAR%d: %pR would extend past %pR (trying to enable %d VFs shifted by %d)\n",
>>> +                             i, &res2, res, vf_num, offset);
>>> +                     return -EBUSY;
>>> +             }
>>> +     }
>>> +
>>> +     for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
>>> +             res = &dev->resource[i + PCI_IOV_RESOURCES];
>>> +             if (!res->flags || !res->parent)
>>> +                     continue;
>>> +
>>> +             if (!pnv_pci_is_mem_pref_64(res->flags))
>>> +                     continue;
>>> +
>>> +             size = pci_iov_resource_size(dev, i + PCI_IOV_RESOURCES);
>>> +             res2 = *res;
>>> +             res->start += size * offset;
>>
>> I'm still not happy about this fiddling with res->start.
>>
>> Increasing res->start means that in principle, the "size * offset" bytes
>> that we just removed from res are now available for allocation to somebody
>> else.  I don't think we *will* give that space to anything else because of
>> the alignment restrictions you're enforcing, but "res" now doesn't
>> correctly describe the real resource map.
>>
>> Would you be able to just update the BAR here while leaving the struct
>> resource alone?  In that case, it would look a little funny that lspci
>> would show a BAR value in the middle of the region in /proc/iomem, but
>> the /proc/iomem region would be more correct.
>
>I guess this would also require a tweak where we compute the addresses
>of each of the VF resources.  Today it's probably just "base + VF_num
>* size", where "base" is res->start.  We'd have to account for the
>offset there if we don't adjust it here.
>

Oh, this is really an interesting idea.

I will do some tests to see the result.

>>> +
>>> +             dev_info(&dev->dev, "VF BAR%d: %pR shifted to %pR (enabling %d VFs shifted by %d)\n",
>>> +                      i, &res2, res, vf_num, offset);
>>> +             pci_update_resource(dev, i + PCI_IOV_RESOURCES);
>>> +     }
>>> +     pdn->max_vfs -= offset;
>>> +     return 0;
>>> +}
>>> +#endif /* CONFIG_PCI_IOV */

-- 
Richard Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v12 17/21] powerpc/powernv: Shift VF resource with an offset
@ 2015-03-02  7:58         ` Wei Yang
  0 siblings, 0 replies; 69+ messages in thread
From: Wei Yang @ 2015-03-02  7:58 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-pci, Wei Yang, Benjamin Herrenschmidt, linuxppc-dev, Gavin Shan

On Tue, Feb 24, 2015 at 11:10:33AM -0600, Bjorn Helgaas wrote:
>On Tue, Feb 24, 2015 at 3:00 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
>> On Tue, Feb 24, 2015 at 02:34:57AM -0600, Bjorn Helgaas wrote:
>>> From: Wei Yang <weiyang@linux.vnet.ibm.com>
>>>
>>> On PowerNV platform, resource position in M64 implies the PE# the resource
>>> belongs to.  In some cases, adjustment of a resource is necessary to locate
>>> it to a correct position in M64.
>>>
>>> Add pnv_pci_vf_resource_shift() to shift the 'real' PF IOV BAR address
>>> according to an offset.
>>>
>>> [bhelgaas: rework loops, rework overlap check, index resource[]
>>> conventionally, remove pci_regs.h include, squashed with next patch]
>>> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
>>> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
>>
>> ...
>>
>>> +#ifdef CONFIG_PCI_IOV
>>> +static int pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
>>> +{
>>> +     struct pci_dn *pdn = pci_get_pdn(dev);
>>> +     int i;
>>> +     struct resource *res, res2;
>>> +     resource_size_t size;
>>> +     u16 vf_num;
>>> +
>>> +     if (!dev->is_physfn)
>>> +             return -EINVAL;
>>> +
>>> +     /*
>>> +      * "offset" is in VFs.  The M64 windows are sized so that when they
>>> +      * are segmented, each segment is the same size as the IOV BAR.
>>> +      * Each segment is in a separate PE, and the high order bits of the
>>> +      * address are the PE number.  Therefore, each VF's BAR is in a
>>> +      * separate PE, and changing the IOV BAR start address changes the
>>> +      * range of PEs the VFs are in.
>>> +      */
>>> +     vf_num = pdn->vf_pes;
>>> +     for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
>>> +             res = &dev->resource[i + PCI_IOV_RESOURCES];
>>> +             if (!res->flags || !res->parent)
>>> +                     continue;
>>> +
>>> +             if (!pnv_pci_is_mem_pref_64(res->flags))
>>> +                     continue;
>>> +
>>> +             /*
>>> +              * The actual IOV BAR range is determined by the start address
>>> +              * and the actual size for vf_num VFs BAR.  This check is to
>>> +              * make sure that after shifting, the range will not overlap
>>> +              * with another device.
>>> +              */
>>> +             size = pci_iov_resource_size(dev, i + PCI_IOV_RESOURCES);
>>> +             res2.flags = res->flags;
>>> +             res2.start = res->start + (size * offset);
>>> +             res2.end = res2.start + (size * vf_num) - 1;
>>> +
>>> +             if (res2.end > res->end) {
>>> +                     dev_err(&dev->dev, "VF BAR%d: %pR would extend past %pR (trying to enable %d VFs shifted by %d)\n",
>>> +                             i, &res2, res, vf_num, offset);
>>> +                     return -EBUSY;
>>> +             }
>>> +     }
>>> +
>>> +     for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
>>> +             res = &dev->resource[i + PCI_IOV_RESOURCES];
>>> +             if (!res->flags || !res->parent)
>>> +                     continue;
>>> +
>>> +             if (!pnv_pci_is_mem_pref_64(res->flags))
>>> +                     continue;
>>> +
>>> +             size = pci_iov_resource_size(dev, i + PCI_IOV_RESOURCES);
>>> +             res2 = *res;
>>> +             res->start += size * offset;
>>
>> I'm still not happy about this fiddling with res->start.
>>
>> Increasing res->start means that in principle, the "size * offset" bytes
>> that we just removed from res are now available for allocation to somebody
>> else.  I don't think we *will* give that space to anything else because of
>> the alignment restrictions you're enforcing, but "res" now doesn't
>> correctly describe the real resource map.
>>
>> Would you be able to just update the BAR here while leaving the struct
>> resource alone?  In that case, it would look a little funny that lspci
>> would show a BAR value in the middle of the region in /proc/iomem, but
>> the /proc/iomem region would be more correct.
>
>I guess this would also require a tweak where we compute the addresses
>of each of the VF resources.  Today it's probably just "base + VF_num
>* size", where "base" is res->start.  We'd have to account for the
>offset there if we don't adjust it here.
>

Oh, this is really an interesting idea.

I will do some tests to see the result.

>>> +
>>> +             dev_info(&dev->dev, "VF BAR%d: %pR shifted to %pR (enabling %d VFs shifted by %d)\n",
>>> +                      i, &res2, res, vf_num, offset);
>>> +             pci_update_resource(dev, i + PCI_IOV_RESOURCES);
>>> +     }
>>> +     pdn->max_vfs -= offset;
>>> +     return 0;
>>> +}
>>> +#endif /* CONFIG_PCI_IOV */

-- 
Richard Yang
Help you, Help me

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v12 14/21] powerpc/powernv: Allocate struct pnv_ioda_pe iommu_table dynamically
  2015-03-02  7:56         ` Benjamin Herrenschmidt
@ 2015-03-02  8:02           ` Wei Yang
  -1 siblings, 0 replies; 69+ messages in thread
From: Wei Yang @ 2015-03-02  8:02 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Wei Yang, Bjorn Helgaas, gwshan, linux-pci, linuxppc-dev

On Mon, Mar 02, 2015 at 06:56:19PM +1100, Benjamin Herrenschmidt wrote:
>On Mon, 2015-03-02 at 15:50 +0800, Wei Yang wrote:
>> >
>> >Is there a hotplug remove path where we should also be calling
>> >iommu_free_table()?
>> 
>> When VF is not introduced, no one calls this on powernv platform.
>> 
>> Each PCI bus is a PE and it has its own iommu table, even a device is
>> hotpluged, the iommu table will not be released.
>
>Actually, I believe Alexey patches to add support for dynamic DMA
>windows for KVM guests using VFIO will also alloc/free iommu tables. In
>fact his patches somewhat change quite a few things in that area, and
>I'm currently reviewing them.

Yes, I see these changes before.

>
>Wei, can you post a new series when you've finished sync'ing with
>Bjorn ? At that point, I'll try to work with Alexey to evaluate the
>impact of his changes on your patches.

Sure, I will do it ASAP.

>
>Cheers,
>Ben.
>

-- 
Richard Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v12 14/21] powerpc/powernv: Allocate struct pnv_ioda_pe iommu_table dynamically
@ 2015-03-02  8:02           ` Wei Yang
  0 siblings, 0 replies; 69+ messages in thread
From: Wei Yang @ 2015-03-02  8:02 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Bjorn Helgaas, linux-pci, Wei Yang, linuxppc-dev, gwshan

On Mon, Mar 02, 2015 at 06:56:19PM +1100, Benjamin Herrenschmidt wrote:
>On Mon, 2015-03-02 at 15:50 +0800, Wei Yang wrote:
>> >
>> >Is there a hotplug remove path where we should also be calling
>> >iommu_free_table()?
>> 
>> When VF is not introduced, no one calls this on powernv platform.
>> 
>> Each PCI bus is a PE and it has its own iommu table, even a device is
>> hotpluged, the iommu table will not be released.
>
>Actually, I believe Alexey patches to add support for dynamic DMA
>windows for KVM guests using VFIO will also alloc/free iommu tables. In
>fact his patches somewhat change quite a few things in that area, and
>I'm currently reviewing them.

Yes, I see these changes before.

>
>Wei, can you post a new series when you've finished sync'ing with
>Bjorn ? At that point, I'll try to work with Alexey to evaluate the
>impact of his changes on your patches.

Sure, I will do it ASAP.

>
>Cheers,
>Ben.
>

-- 
Richard Yang
Help you, Help me

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v12 17/21] powerpc/powernv: Shift VF resource with an offset
  2015-02-24  9:00   ` Bjorn Helgaas
@ 2015-03-04  3:01       ` Wei Yang
  2015-03-04  3:01       ` Wei Yang
  1 sibling, 0 replies; 69+ messages in thread
From: Wei Yang @ 2015-03-04  3:01 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: Wei Yang, benh, gwshan, linux-pci, linuxppc-dev

On Tue, Feb 24, 2015 at 03:00:37AM -0600, Bjorn Helgaas wrote:
>On Tue, Feb 24, 2015 at 02:34:57AM -0600, Bjorn Helgaas wrote:
>> From: Wei Yang <weiyang@linux.vnet.ibm.com>
>> 
>> On PowerNV platform, resource position in M64 implies the PE# the resource
>> belongs to.  In some cases, adjustment of a resource is necessary to locate
>> it to a correct position in M64.
>> 
>> Add pnv_pci_vf_resource_shift() to shift the 'real' PF IOV BAR address
>> according to an offset.
>> 
>> [bhelgaas: rework loops, rework overlap check, index resource[]
>> conventionally, remove pci_regs.h include, squashed with next patch]
>> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
>> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
>
>...
>
>> +#ifdef CONFIG_PCI_IOV
>> +static int pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
>> +{
>> +	struct pci_dn *pdn = pci_get_pdn(dev);
>> +	int i;
>> +	struct resource *res, res2;
>> +	resource_size_t size;
>> +	u16 vf_num;
>> +
>> +	if (!dev->is_physfn)
>> +		return -EINVAL;
>> +
>> +	/*
>> +	 * "offset" is in VFs.  The M64 windows are sized so that when they
>> +	 * are segmented, each segment is the same size as the IOV BAR.
>> +	 * Each segment is in a separate PE, and the high order bits of the
>> +	 * address are the PE number.  Therefore, each VF's BAR is in a
>> +	 * separate PE, and changing the IOV BAR start address changes the
>> +	 * range of PEs the VFs are in.
>> +	 */
>> +	vf_num = pdn->vf_pes;
>> +	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
>> +		res = &dev->resource[i + PCI_IOV_RESOURCES];
>> +		if (!res->flags || !res->parent)
>> +			continue;
>> +
>> +		if (!pnv_pci_is_mem_pref_64(res->flags))
>> +			continue;
>> +
>> +		/*
>> +		 * The actual IOV BAR range is determined by the start address
>> +		 * and the actual size for vf_num VFs BAR.  This check is to
>> +		 * make sure that after shifting, the range will not overlap
>> +		 * with another device.
>> +		 */
>> +		size = pci_iov_resource_size(dev, i + PCI_IOV_RESOURCES);
>> +		res2.flags = res->flags;
>> +		res2.start = res->start + (size * offset);
>> +		res2.end = res2.start + (size * vf_num) - 1;
>> +
>> +		if (res2.end > res->end) {
>> +			dev_err(&dev->dev, "VF BAR%d: %pR would extend past %pR (trying to enable %d VFs shifted by %d)\n",
>> +				i, &res2, res, vf_num, offset);
>> +			return -EBUSY;
>> +		}
>> +	}
>> +
>> +	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
>> +		res = &dev->resource[i + PCI_IOV_RESOURCES];
>> +		if (!res->flags || !res->parent)
>> +			continue;
>> +
>> +		if (!pnv_pci_is_mem_pref_64(res->flags))
>> +			continue;
>> +
>> +		size = pci_iov_resource_size(dev, i + PCI_IOV_RESOURCES);
>> +		res2 = *res;
>> +		res->start += size * offset;
>
>I'm still not happy about this fiddling with res->start.
>
>Increasing res->start means that in principle, the "size * offset" bytes
>that we just removed from res are now available for allocation to somebody
>else.  I don't think we *will* give that space to anything else because of
>the alignment restrictions you're enforcing, but "res" now doesn't
>correctly describe the real resource map.
>
>Would you be able to just update the BAR here while leaving the struct
>resource alone?  In that case, it would look a little funny that lspci
>would show a BAR value in the middle of the region in /proc/iomem, but
>the /proc/iomem region would be more correct.

Bjorn,

I did some tests, while the result is not good.

What I did is still write the shifted resource address to the device by
pci_update_resource(), but I revert the res->start to the original one. If
this step is not correct, please let me know.

This can't work since after we revert the res->start, those VFs will be given
resources from res->start instead of (res->start + offset * size). This is not
what we expect.

I have rebased/clean/change the code according to your comments based on this
patch set. Will send it out v13 soon.

>
>> +
>> +		dev_info(&dev->dev, "VF BAR%d: %pR shifted to %pR (enabling %d VFs shifted by %d)\n",
>> +			 i, &res2, res, vf_num, offset);
>> +		pci_update_resource(dev, i + PCI_IOV_RESOURCES);
>> +	}
>> +	pdn->max_vfs -= offset;
>> +	return 0;
>> +}
>> +#endif /* CONFIG_PCI_IOV */

-- 
Richard Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v12 17/21] powerpc/powernv: Shift VF resource with an offset
@ 2015-03-04  3:01       ` Wei Yang
  0 siblings, 0 replies; 69+ messages in thread
From: Wei Yang @ 2015-03-04  3:01 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci, Wei Yang, benh, linuxppc-dev, gwshan

On Tue, Feb 24, 2015 at 03:00:37AM -0600, Bjorn Helgaas wrote:
>On Tue, Feb 24, 2015 at 02:34:57AM -0600, Bjorn Helgaas wrote:
>> From: Wei Yang <weiyang@linux.vnet.ibm.com>
>> 
>> On PowerNV platform, resource position in M64 implies the PE# the resource
>> belongs to.  In some cases, adjustment of a resource is necessary to locate
>> it to a correct position in M64.
>> 
>> Add pnv_pci_vf_resource_shift() to shift the 'real' PF IOV BAR address
>> according to an offset.
>> 
>> [bhelgaas: rework loops, rework overlap check, index resource[]
>> conventionally, remove pci_regs.h include, squashed with next patch]
>> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
>> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
>
>...
>
>> +#ifdef CONFIG_PCI_IOV
>> +static int pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
>> +{
>> +	struct pci_dn *pdn = pci_get_pdn(dev);
>> +	int i;
>> +	struct resource *res, res2;
>> +	resource_size_t size;
>> +	u16 vf_num;
>> +
>> +	if (!dev->is_physfn)
>> +		return -EINVAL;
>> +
>> +	/*
>> +	 * "offset" is in VFs.  The M64 windows are sized so that when they
>> +	 * are segmented, each segment is the same size as the IOV BAR.
>> +	 * Each segment is in a separate PE, and the high order bits of the
>> +	 * address are the PE number.  Therefore, each VF's BAR is in a
>> +	 * separate PE, and changing the IOV BAR start address changes the
>> +	 * range of PEs the VFs are in.
>> +	 */
>> +	vf_num = pdn->vf_pes;
>> +	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
>> +		res = &dev->resource[i + PCI_IOV_RESOURCES];
>> +		if (!res->flags || !res->parent)
>> +			continue;
>> +
>> +		if (!pnv_pci_is_mem_pref_64(res->flags))
>> +			continue;
>> +
>> +		/*
>> +		 * The actual IOV BAR range is determined by the start address
>> +		 * and the actual size for vf_num VFs BAR.  This check is to
>> +		 * make sure that after shifting, the range will not overlap
>> +		 * with another device.
>> +		 */
>> +		size = pci_iov_resource_size(dev, i + PCI_IOV_RESOURCES);
>> +		res2.flags = res->flags;
>> +		res2.start = res->start + (size * offset);
>> +		res2.end = res2.start + (size * vf_num) - 1;
>> +
>> +		if (res2.end > res->end) {
>> +			dev_err(&dev->dev, "VF BAR%d: %pR would extend past %pR (trying to enable %d VFs shifted by %d)\n",
>> +				i, &res2, res, vf_num, offset);
>> +			return -EBUSY;
>> +		}
>> +	}
>> +
>> +	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
>> +		res = &dev->resource[i + PCI_IOV_RESOURCES];
>> +		if (!res->flags || !res->parent)
>> +			continue;
>> +
>> +		if (!pnv_pci_is_mem_pref_64(res->flags))
>> +			continue;
>> +
>> +		size = pci_iov_resource_size(dev, i + PCI_IOV_RESOURCES);
>> +		res2 = *res;
>> +		res->start += size * offset;
>
>I'm still not happy about this fiddling with res->start.
>
>Increasing res->start means that in principle, the "size * offset" bytes
>that we just removed from res are now available for allocation to somebody
>else.  I don't think we *will* give that space to anything else because of
>the alignment restrictions you're enforcing, but "res" now doesn't
>correctly describe the real resource map.
>
>Would you be able to just update the BAR here while leaving the struct
>resource alone?  In that case, it would look a little funny that lspci
>would show a BAR value in the middle of the region in /proc/iomem, but
>the /proc/iomem region would be more correct.

Bjorn,

I did some tests, while the result is not good.

What I did is still write the shifted resource address to the device by
pci_update_resource(), but I revert the res->start to the original one. If
this step is not correct, please let me know.

This can't work since after we revert the res->start, those VFs will be given
resources from res->start instead of (res->start + offset * size). This is not
what we expect.

I have rebased/clean/change the code according to your comments based on this
patch set. Will send it out v13 soon.

>
>> +
>> +		dev_info(&dev->dev, "VF BAR%d: %pR shifted to %pR (enabling %d VFs shifted by %d)\n",
>> +			 i, &res2, res, vf_num, offset);
>> +		pci_update_resource(dev, i + PCI_IOV_RESOURCES);
>> +	}
>> +	pdn->max_vfs -= offset;
>> +	return 0;
>> +}
>> +#endif /* CONFIG_PCI_IOV */

-- 
Richard Yang
Help you, Help me

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v12 10/21] PCI: Consider additional PF's IOV BAR alignment in sizing and assigning
  2015-03-02  7:32       ` Wei Yang
@ 2015-03-11  2:36         ` Bjorn Helgaas
  -1 siblings, 0 replies; 69+ messages in thread
From: Bjorn Helgaas @ 2015-03-11  2:36 UTC (permalink / raw)
  To: Wei Yang; +Cc: benh, gwshan, linux-pci, linuxppc-dev

On Mon, Mar 02, 2015 at 03:32:47PM +0800, Wei Yang wrote:
> On Tue, Feb 24, 2015 at 02:41:52AM -0600, Bjorn Helgaas wrote:
> >On Tue, Feb 24, 2015 at 02:34:06AM -0600, Bjorn Helgaas wrote:
> >> From: Wei Yang <weiyang@linux.vnet.ibm.com>
> >> 
> >> When sizing and assigning resources, we divide the resources into two
> >> lists: the requested list and the additional list.  We don't consider the
> >> alignment of additional VF(n) BAR space.
> >> 
> >> This is reasonable because the alignment required for the VF(n) BAR space
> >> is the size of an individual VF BAR, not the size of the space for *all*
> >> VFs.  But some platforms, e.g., PowerNV, require additional alignment.
> >> 
> >> Consider the additional IOV BAR alignment when sizing and assigning
> >> resources.  When there is not enough system MMIO space, the PF's IOV BAR
> >> alignment will not contribute to the bridge.  When there is enough system
> >> MMIO space, the additional alignment will contribute to the bridge.
> >
> >I don't understand the ""when there is not enough system MMIO space" part.
> >How do we tell if there's enough MMIO space?
> >
> 
> In __assign_resources_sorted(), it has two resources list, one for requested
> (head) and one for additional (realloc_head). This function will first try to
> combine them and assign. If failed, this means we don't have enough MMIO
> space.

How about this text:

  This is because the alignment required for the VF(n) BAR space is the size
  of an individual VF BAR, not the size of the space for *all* VFs.  But we
  want additional alignment to support partitioning on PowerNV.

  Consider the additional IOV BAR alignment when sizing and assigning
  resources.  When there is not enough system MMIO space to accomodate both
  the requested list and the additional list, the PF's IOV BAR alignment will
  not contribute to the bridge.  When there is enough system MMIO space for
  both lists, the additional alignment will contribute to the bridge.

We're doing something specifically for PowerNV.  I would really like to be
able to read this patch and say "Oh, here's the hook where we get the
PowerNV behavior, and it's obvious that other platforms are unaffected."
But I don't see a pcibios or similar hook, so I don't know where that
PowerNV behavior is.

Is it something to do with get_res_add_align()?  That uses min_align, but I
don't know how that's connected ...  ah, I see, "add_align" is computed
from pci_resource_alignment(), which has this path:

  pci_resource_alignment
    pci_sriov_resource_alignment
      pcibios_iov_resource_alignment

and powerpc has a special pcibios_iov_resource_alignment() for PowerNV.

> >> Also, take advantage of pci_dev_resource::min_align to store this
> >> additional alignment.
> >
> >This comment doesn't seem to make sense; this patch doesn't save anything
> >in min_align.
> 
> At the end of this patch:
> 
>    add_to_list(realloc_head, bus->self, b_res, size1-size0, add_align);
> 
> The add_align is stored in pci_dev_resource::min_align in add_to_list(). And
> retrieved by get_res_add_align() in below code. This field is not used
> previously, so I took advantage of this field to store the alignment of the
> additional resources.

Hmm.  pci_dev_resource::min_align *is* already used in
reassign_resources_sorted().  Maybe there's no overlap; I gave up the
analysis before I could convince myself.

The changelog needs to mention the add_to_list() connection.

> >> +		/*
> >> +		 * There are two kinds of additional resources in the list:
> >> +		 * 1. bridge resource  -- IORESOURCE_STARTALIGN
> >> +		 * 2. SR-IOV resource   -- IORESOURCE_SIZEALIGN
> >> +		 * Here just fix the additional alignment for bridge
> >> +		 */
> >> +		if (!(dev_res->res->flags & IORESOURCE_STARTALIGN))
> >> +			continue;
> >> +
> >> +		add_align = get_res_add_align(realloc_head, dev_res->res);
> >> +
> >> +		/* Reorder the list by their alignment */
> >
> >Why do we need to reorder the list by alignment?
> 
> Resource list "head" is sorted by the alignment, while the alignment would be
> changed after we considering the additional resource.
> 
> Take powernv platform as an example. The IOV BAR is expanded and need to be
> aligned with its total size instead of the individual VF BAR size. If we don't
> reorder it, the IOV BAR would be assigned after some other resources, which
> may cause the real assignment fail even the total size is enough.

This is worthy of a comment in the code.

Bjorn

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v12 10/21] PCI: Consider additional PF's IOV BAR alignment in sizing and assigning
@ 2015-03-11  2:36         ` Bjorn Helgaas
  0 siblings, 0 replies; 69+ messages in thread
From: Bjorn Helgaas @ 2015-03-11  2:36 UTC (permalink / raw)
  To: Wei Yang; +Cc: linux-pci, benh, linuxppc-dev, gwshan

On Mon, Mar 02, 2015 at 03:32:47PM +0800, Wei Yang wrote:
> On Tue, Feb 24, 2015 at 02:41:52AM -0600, Bjorn Helgaas wrote:
> >On Tue, Feb 24, 2015 at 02:34:06AM -0600, Bjorn Helgaas wrote:
> >> From: Wei Yang <weiyang@linux.vnet.ibm.com>
> >> 
> >> When sizing and assigning resources, we divide the resources into two
> >> lists: the requested list and the additional list.  We don't consider the
> >> alignment of additional VF(n) BAR space.
> >> 
> >> This is reasonable because the alignment required for the VF(n) BAR space
> >> is the size of an individual VF BAR, not the size of the space for *all*
> >> VFs.  But some platforms, e.g., PowerNV, require additional alignment.
> >> 
> >> Consider the additional IOV BAR alignment when sizing and assigning
> >> resources.  When there is not enough system MMIO space, the PF's IOV BAR
> >> alignment will not contribute to the bridge.  When there is enough system
> >> MMIO space, the additional alignment will contribute to the bridge.
> >
> >I don't understand the ""when there is not enough system MMIO space" part.
> >How do we tell if there's enough MMIO space?
> >
> 
> In __assign_resources_sorted(), it has two resources list, one for requested
> (head) and one for additional (realloc_head). This function will first try to
> combine them and assign. If failed, this means we don't have enough MMIO
> space.

How about this text:

  This is because the alignment required for the VF(n) BAR space is the size
  of an individual VF BAR, not the size of the space for *all* VFs.  But we
  want additional alignment to support partitioning on PowerNV.

  Consider the additional IOV BAR alignment when sizing and assigning
  resources.  When there is not enough system MMIO space to accomodate both
  the requested list and the additional list, the PF's IOV BAR alignment will
  not contribute to the bridge.  When there is enough system MMIO space for
  both lists, the additional alignment will contribute to the bridge.

We're doing something specifically for PowerNV.  I would really like to be
able to read this patch and say "Oh, here's the hook where we get the
PowerNV behavior, and it's obvious that other platforms are unaffected."
But I don't see a pcibios or similar hook, so I don't know where that
PowerNV behavior is.

Is it something to do with get_res_add_align()?  That uses min_align, but I
don't know how that's connected ...  ah, I see, "add_align" is computed
from pci_resource_alignment(), which has this path:

  pci_resource_alignment
    pci_sriov_resource_alignment
      pcibios_iov_resource_alignment

and powerpc has a special pcibios_iov_resource_alignment() for PowerNV.

> >> Also, take advantage of pci_dev_resource::min_align to store this
> >> additional alignment.
> >
> >This comment doesn't seem to make sense; this patch doesn't save anything
> >in min_align.
> 
> At the end of this patch:
> 
>    add_to_list(realloc_head, bus->self, b_res, size1-size0, add_align);
> 
> The add_align is stored in pci_dev_resource::min_align in add_to_list(). And
> retrieved by get_res_add_align() in below code. This field is not used
> previously, so I took advantage of this field to store the alignment of the
> additional resources.

Hmm.  pci_dev_resource::min_align *is* already used in
reassign_resources_sorted().  Maybe there's no overlap; I gave up the
analysis before I could convince myself.

The changelog needs to mention the add_to_list() connection.

> >> +		/*
> >> +		 * There are two kinds of additional resources in the list:
> >> +		 * 1. bridge resource  -- IORESOURCE_STARTALIGN
> >> +		 * 2. SR-IOV resource   -- IORESOURCE_SIZEALIGN
> >> +		 * Here just fix the additional alignment for bridge
> >> +		 */
> >> +		if (!(dev_res->res->flags & IORESOURCE_STARTALIGN))
> >> +			continue;
> >> +
> >> +		add_align = get_res_add_align(realloc_head, dev_res->res);
> >> +
> >> +		/* Reorder the list by their alignment */
> >
> >Why do we need to reorder the list by alignment?
> 
> Resource list "head" is sorted by the alignment, while the alignment would be
> changed after we considering the additional resource.
> 
> Take powernv platform as an example. The IOV BAR is expanded and need to be
> aligned with its total size instead of the individual VF BAR size. If we don't
> reorder it, the IOV BAR would be assigned after some other resources, which
> may cause the real assignment fail even the total size is enough.

This is worthy of a comment in the code.

Bjorn

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v12 14/21] powerpc/powernv: Allocate struct pnv_ioda_pe iommu_table dynamically
  2015-03-02  7:50       ` Wei Yang
@ 2015-03-11  2:47         ` Bjorn Helgaas
  -1 siblings, 0 replies; 69+ messages in thread
From: Bjorn Helgaas @ 2015-03-11  2:47 UTC (permalink / raw)
  To: Wei Yang; +Cc: benh, gwshan, linux-pci, linuxppc-dev

On Mon, Mar 02, 2015 at 03:50:37PM +0800, Wei Yang wrote:
> On Tue, Feb 24, 2015 at 02:46:53AM -0600, Bjorn Helgaas wrote:
> >On Tue, Feb 24, 2015 at 02:34:35AM -0600, Bjorn Helgaas wrote:
> >> From: Wei Yang <weiyang@linux.vnet.ibm.com>
> >> 
> >> Current iommu_table of a PE is a static field.  This will have a problem
> >> when iommu_free_table() is called.
> >> 
> >> Allocate iommu_table dynamically.
> >
> >I'd like a little more explanation about why we're calling
> >iommu_free_table() now when we didn't call it before.  Maybe this happens
> >when we disable SR-IOV and the VFs go away?
> 
> Yes, it is called in disable path.
> 
> pcibios_sriov_disable
>     pnv_pci_sriov_disable
>         pnv_ioda_release_vf_PE
> 	    pnv_pci_ioda2_release_dma_pe
> 	        iommu_free_table            <--- here it is invoked
> 
> 
> >
> >Is there a hotplug remove path where we should also be calling
> >iommu_free_table()?
> 
> When VF is not introduced, no one calls this on powernv platform.
> 
> Each PCI bus is a PE and it has its own iommu table, even a device is
> hotpluged, the iommu table will not be released.

None of this explanation made it into the v13 patch.  And I don't quite
understand it anyway.

Something like "Previously the iommu_table had the same lifetime as a
struct pnv_ioda_pe and was embedded in it.  The pnv_ioda_pe was allocated
when XXX and freed when YYY.  This no longer works: we can't allocate the
iommu_table at the same time as the pnv_ioda_pe because XXX, so we allocate
it when XXX and free it when YYY."

Bjorn

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v12 14/21] powerpc/powernv: Allocate struct pnv_ioda_pe iommu_table dynamically
@ 2015-03-11  2:47         ` Bjorn Helgaas
  0 siblings, 0 replies; 69+ messages in thread
From: Bjorn Helgaas @ 2015-03-11  2:47 UTC (permalink / raw)
  To: Wei Yang; +Cc: linux-pci, benh, linuxppc-dev, gwshan

On Mon, Mar 02, 2015 at 03:50:37PM +0800, Wei Yang wrote:
> On Tue, Feb 24, 2015 at 02:46:53AM -0600, Bjorn Helgaas wrote:
> >On Tue, Feb 24, 2015 at 02:34:35AM -0600, Bjorn Helgaas wrote:
> >> From: Wei Yang <weiyang@linux.vnet.ibm.com>
> >> 
> >> Current iommu_table of a PE is a static field.  This will have a problem
> >> when iommu_free_table() is called.
> >> 
> >> Allocate iommu_table dynamically.
> >
> >I'd like a little more explanation about why we're calling
> >iommu_free_table() now when we didn't call it before.  Maybe this happens
> >when we disable SR-IOV and the VFs go away?
> 
> Yes, it is called in disable path.
> 
> pcibios_sriov_disable
>     pnv_pci_sriov_disable
>         pnv_ioda_release_vf_PE
> 	    pnv_pci_ioda2_release_dma_pe
> 	        iommu_free_table            <--- here it is invoked
> 
> 
> >
> >Is there a hotplug remove path where we should also be calling
> >iommu_free_table()?
> 
> When VF is not introduced, no one calls this on powernv platform.
> 
> Each PCI bus is a PE and it has its own iommu table, even a device is
> hotpluged, the iommu table will not be released.

None of this explanation made it into the v13 patch.  And I don't quite
understand it anyway.

Something like "Previously the iommu_table had the same lifetime as a
struct pnv_ioda_pe and was embedded in it.  The pnv_ioda_pe was allocated
when XXX and freed when YYY.  This no longer works: we can't allocate the
iommu_table at the same time as the pnv_ioda_pe because XXX, so we allocate
it when XXX and free it when YYY."

Bjorn

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v12 15/21] powerpc/powernv: Reserve additional space for IOV BAR according to the number of total_pe
  2015-03-02  7:41       ` Wei Yang
@ 2015-03-11  2:51         ` Bjorn Helgaas
  -1 siblings, 0 replies; 69+ messages in thread
From: Bjorn Helgaas @ 2015-03-11  2:51 UTC (permalink / raw)
  To: Wei Yang; +Cc: benh, gwshan, linux-pci, linuxppc-dev

On Mon, Mar 02, 2015 at 03:41:32PM +0800, Wei Yang wrote:
> On Tue, Feb 24, 2015 at 02:52:34AM -0600, Bjorn Helgaas wrote:
> >On Tue, Feb 24, 2015 at 02:34:42AM -0600, Bjorn Helgaas wrote:
> >> From: Wei Yang <weiyang@linux.vnet.ibm.com>
> >> 
> >> On PHB3, PF IOV BAR will be covered by M64 window to have better PE
> >> isolation.  The total_pe number is usually different from total_VFs, which
> >> can lead to a conflict between MMIO space and the PE number.
> >> 
> >> For example, if total_VFs is 128 and total_pe is 256, the second half of
> >> M64 window will be part of other PCI device, which may already belong
> >> to other PEs.
> >
> >I'm still trying to wrap my mind around the explanation here.
> >
> >I *think* what's going on is that the M64 window must be a power-of-two
> >size.  If the VF(n) BAR space doesn't completely fill it, we might allocate
> >the leftover space to another device.  Then the M64 window for *this*
> >device may cause the other device to be associated with a PE it didn't
> >expect.
> 
> Yes, this is the exact reason.

Can you include some of this text in your changelog, then?  I can wordsmith
it and try to make it fit together better.

> >> +#ifdef CONFIG_PCI_IOV
> >> +static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
> >> +{
> >> +	struct pci_controller *hose;
> >> +	struct pnv_phb *phb;
> >> +	struct resource *res;
> >> +	int i;
> >> +	resource_size_t size;
> >> +	struct pci_dn *pdn;
> >> +
> >> +	if (!pdev->is_physfn || pdev->is_added)
> >> +		return;
> >> +
> >> +	hose = pci_bus_to_host(pdev->bus);
> >> +	phb = hose->private_data;
> >> +
> >> +	pdn = pci_get_pdn(pdev);
> >> +	pdn->max_vfs = 0;
> >> +
> >> +	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
> >> +		res = &pdev->resource[i + PCI_IOV_RESOURCES];
> >> +		if (!res->flags || res->parent)
> >> +			continue;
> >> +		if (!pnv_pci_is_mem_pref_64(res->flags)) {
> >> +			dev_warn(&pdev->dev, "Skipping expanding VF BAR%d: %pR\n",
> >> +				 i, res);
> >> +			continue;
> >> +		}
> >> +
> >> +		dev_dbg(&pdev->dev, " Fixing VF BAR%d: %pR to\n", i, res);
> >> +		size = pci_iov_resource_size(pdev, i + PCI_IOV_RESOURCES);
> >> +		res->end = res->start + size * phb->ioda.total_pe - 1;
> >> +		dev_dbg(&pdev->dev, "                       %pR\n", res);
> >> +		dev_info(&pdev->dev, "VF BAR%d: %pR (expanded to %d VFs for PE alignment)",
> >> +				i, res, phb->ioda.total_pe);
> >> +	}
> >> +	pdn->max_vfs = phb->ioda.total_pe;
> >> +}
> >> +
> >> +static void pnv_pci_ioda_fixup_sriov(struct pci_bus *bus)
> >> +{
> >> +	struct pci_dev *pdev;
> >> +	struct pci_bus *b;
> >> +
> >> +	list_for_each_entry(pdev, &bus->devices, bus_list) {
> >> +		b = pdev->subordinate;
> >> +
> >> +		if (b)
> >> +			pnv_pci_ioda_fixup_sriov(b);
> >> +
> >> +		pnv_pci_ioda_fixup_iov_resources(pdev);
> >
> >I'm not sure this happens at the right time.  We have this call chain:
> >
> >  pcibios_scan_phb
> >    pci_create_root_bus
> >    pci_scan_child_bus
> >    pnv_pci_ioda_fixup_sriov
> >      pnv_pci_ioda_fixup_iov_resources
> >	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++)
> >	  increase res->size to accomodate 256 PEs (or roundup(totalVFs)
> >
> >so we only do the fixup_iov_resources() when we scan the PHB, and we
> >wouldn't do it at all for hot-added devices.
> 
> Yep, you are right :-)
> 
> I had a separate patch to do this in pcibios_add_pci_devices(). Looks we could
> merge them.

Did you fix this in v13?  I don't see the change if you did.

> >> +	}
> >> +}
> >> +#endif /* CONFIG_PCI_IOV */
> >> +
> >>  /*
> >>   * This function is supposed to be called on basis of PE from top
> >>   * to bottom style. So the the I/O or MMIO segment assigned to
> >> @@ -2125,6 +2180,9 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
> >>  	ppc_md.pcibios_enable_device_hook = pnv_pci_enable_device_hook;
> >>  	ppc_md.pcibios_window_alignment = pnv_pci_window_alignment;
> >>  	ppc_md.pcibios_reset_secondary_bus = pnv_pci_reset_secondary_bus;
> >> +#ifdef CONFIG_PCI_IOV
> >> +	ppc_md.pcibios_fixup_sriov = pnv_pci_ioda_fixup_sriov;
> >> +#endif /* CONFIG_PCI_IOV */
> >>  	pci_add_flags(PCI_REASSIGN_ALL_RSRC);
> >>  
> >>  	/* Reset IODA tables to a clean state */
> >> 
> 
> -- 
> Richard Yang
> Help you, Help me
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v12 15/21] powerpc/powernv: Reserve additional space for IOV BAR according to the number of total_pe
@ 2015-03-11  2:51         ` Bjorn Helgaas
  0 siblings, 0 replies; 69+ messages in thread
From: Bjorn Helgaas @ 2015-03-11  2:51 UTC (permalink / raw)
  To: Wei Yang; +Cc: linux-pci, benh, linuxppc-dev, gwshan

On Mon, Mar 02, 2015 at 03:41:32PM +0800, Wei Yang wrote:
> On Tue, Feb 24, 2015 at 02:52:34AM -0600, Bjorn Helgaas wrote:
> >On Tue, Feb 24, 2015 at 02:34:42AM -0600, Bjorn Helgaas wrote:
> >> From: Wei Yang <weiyang@linux.vnet.ibm.com>
> >> 
> >> On PHB3, PF IOV BAR will be covered by M64 window to have better PE
> >> isolation.  The total_pe number is usually different from total_VFs, which
> >> can lead to a conflict between MMIO space and the PE number.
> >> 
> >> For example, if total_VFs is 128 and total_pe is 256, the second half of
> >> M64 window will be part of other PCI device, which may already belong
> >> to other PEs.
> >
> >I'm still trying to wrap my mind around the explanation here.
> >
> >I *think* what's going on is that the M64 window must be a power-of-two
> >size.  If the VF(n) BAR space doesn't completely fill it, we might allocate
> >the leftover space to another device.  Then the M64 window for *this*
> >device may cause the other device to be associated with a PE it didn't
> >expect.
> 
> Yes, this is the exact reason.

Can you include some of this text in your changelog, then?  I can wordsmith
it and try to make it fit together better.

> >> +#ifdef CONFIG_PCI_IOV
> >> +static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
> >> +{
> >> +	struct pci_controller *hose;
> >> +	struct pnv_phb *phb;
> >> +	struct resource *res;
> >> +	int i;
> >> +	resource_size_t size;
> >> +	struct pci_dn *pdn;
> >> +
> >> +	if (!pdev->is_physfn || pdev->is_added)
> >> +		return;
> >> +
> >> +	hose = pci_bus_to_host(pdev->bus);
> >> +	phb = hose->private_data;
> >> +
> >> +	pdn = pci_get_pdn(pdev);
> >> +	pdn->max_vfs = 0;
> >> +
> >> +	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
> >> +		res = &pdev->resource[i + PCI_IOV_RESOURCES];
> >> +		if (!res->flags || res->parent)
> >> +			continue;
> >> +		if (!pnv_pci_is_mem_pref_64(res->flags)) {
> >> +			dev_warn(&pdev->dev, "Skipping expanding VF BAR%d: %pR\n",
> >> +				 i, res);
> >> +			continue;
> >> +		}
> >> +
> >> +		dev_dbg(&pdev->dev, " Fixing VF BAR%d: %pR to\n", i, res);
> >> +		size = pci_iov_resource_size(pdev, i + PCI_IOV_RESOURCES);
> >> +		res->end = res->start + size * phb->ioda.total_pe - 1;
> >> +		dev_dbg(&pdev->dev, "                       %pR\n", res);
> >> +		dev_info(&pdev->dev, "VF BAR%d: %pR (expanded to %d VFs for PE alignment)",
> >> +				i, res, phb->ioda.total_pe);
> >> +	}
> >> +	pdn->max_vfs = phb->ioda.total_pe;
> >> +}
> >> +
> >> +static void pnv_pci_ioda_fixup_sriov(struct pci_bus *bus)
> >> +{
> >> +	struct pci_dev *pdev;
> >> +	struct pci_bus *b;
> >> +
> >> +	list_for_each_entry(pdev, &bus->devices, bus_list) {
> >> +		b = pdev->subordinate;
> >> +
> >> +		if (b)
> >> +			pnv_pci_ioda_fixup_sriov(b);
> >> +
> >> +		pnv_pci_ioda_fixup_iov_resources(pdev);
> >
> >I'm not sure this happens at the right time.  We have this call chain:
> >
> >  pcibios_scan_phb
> >    pci_create_root_bus
> >    pci_scan_child_bus
> >    pnv_pci_ioda_fixup_sriov
> >      pnv_pci_ioda_fixup_iov_resources
> >	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++)
> >	  increase res->size to accomodate 256 PEs (or roundup(totalVFs)
> >
> >so we only do the fixup_iov_resources() when we scan the PHB, and we
> >wouldn't do it at all for hot-added devices.
> 
> Yep, you are right :-)
> 
> I had a separate patch to do this in pcibios_add_pci_devices(). Looks we could
> merge them.

Did you fix this in v13?  I don't see the change if you did.

> >> +	}
> >> +}
> >> +#endif /* CONFIG_PCI_IOV */
> >> +
> >>  /*
> >>   * This function is supposed to be called on basis of PE from top
> >>   * to bottom style. So the the I/O or MMIO segment assigned to
> >> @@ -2125,6 +2180,9 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
> >>  	ppc_md.pcibios_enable_device_hook = pnv_pci_enable_device_hook;
> >>  	ppc_md.pcibios_window_alignment = pnv_pci_window_alignment;
> >>  	ppc_md.pcibios_reset_secondary_bus = pnv_pci_reset_secondary_bus;
> >> +#ifdef CONFIG_PCI_IOV
> >> +	ppc_md.pcibios_fixup_sriov = pnv_pci_ioda_fixup_sriov;
> >> +#endif /* CONFIG_PCI_IOV */
> >>  	pci_add_flags(PCI_REASSIGN_ALL_RSRC);
> >>  
> >>  	/* Reset IODA tables to a clean state */
> >> 
> 
> -- 
> Richard Yang
> Help you, Help me
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v12 17/21] powerpc/powernv: Shift VF resource with an offset
  2015-03-04  3:01       ` Wei Yang
@ 2015-03-11  2:55         ` Bjorn Helgaas
  -1 siblings, 0 replies; 69+ messages in thread
From: Bjorn Helgaas @ 2015-03-11  2:55 UTC (permalink / raw)
  To: Wei Yang; +Cc: benh, gwshan, linux-pci, linuxppc-dev

On Wed, Mar 04, 2015 at 11:01:24AM +0800, Wei Yang wrote:
> On Tue, Feb 24, 2015 at 03:00:37AM -0600, Bjorn Helgaas wrote:
> >On Tue, Feb 24, 2015 at 02:34:57AM -0600, Bjorn Helgaas wrote:
> >> From: Wei Yang <weiyang@linux.vnet.ibm.com>
> >> 
> >> On PowerNV platform, resource position in M64 implies the PE# the resource
> >> belongs to.  In some cases, adjustment of a resource is necessary to locate
> >> it to a correct position in M64.
> >> 
> >> Add pnv_pci_vf_resource_shift() to shift the 'real' PF IOV BAR address
> >> according to an offset.
> >> 
> >> [bhelgaas: rework loops, rework overlap check, index resource[]
> >> conventionally, remove pci_regs.h include, squashed with next patch]
> >> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
> >> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> >
> >...
> >
> >> +#ifdef CONFIG_PCI_IOV
> >> +static int pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
> >> +{
> >> +	struct pci_dn *pdn = pci_get_pdn(dev);
> >> +	int i;
> >> +	struct resource *res, res2;
> >> +	resource_size_t size;
> >> +	u16 vf_num;
> >> +
> >> +	if (!dev->is_physfn)
> >> +		return -EINVAL;
> >> +
> >> +	/*
> >> +	 * "offset" is in VFs.  The M64 windows are sized so that when they
> >> +	 * are segmented, each segment is the same size as the IOV BAR.
> >> +	 * Each segment is in a separate PE, and the high order bits of the
> >> +	 * address are the PE number.  Therefore, each VF's BAR is in a
> >> +	 * separate PE, and changing the IOV BAR start address changes the
> >> +	 * range of PEs the VFs are in.
> >> +	 */
> >> +	vf_num = pdn->vf_pes;
> >> +	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
> >> +		res = &dev->resource[i + PCI_IOV_RESOURCES];
> >> +		if (!res->flags || !res->parent)
> >> +			continue;
> >> +
> >> +		if (!pnv_pci_is_mem_pref_64(res->flags))
> >> +			continue;
> >> +
> >> +		/*
> >> +		 * The actual IOV BAR range is determined by the start address
> >> +		 * and the actual size for vf_num VFs BAR.  This check is to
> >> +		 * make sure that after shifting, the range will not overlap
> >> +		 * with another device.
> >> +		 */
> >> +		size = pci_iov_resource_size(dev, i + PCI_IOV_RESOURCES);
> >> +		res2.flags = res->flags;
> >> +		res2.start = res->start + (size * offset);
> >> +		res2.end = res2.start + (size * vf_num) - 1;
> >> +
> >> +		if (res2.end > res->end) {
> >> +			dev_err(&dev->dev, "VF BAR%d: %pR would extend past %pR (trying to enable %d VFs shifted by %d)\n",
> >> +				i, &res2, res, vf_num, offset);
> >> +			return -EBUSY;
> >> +		}
> >> +	}
> >> +
> >> +	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
> >> +		res = &dev->resource[i + PCI_IOV_RESOURCES];
> >> +		if (!res->flags || !res->parent)
> >> +			continue;
> >> +
> >> +		if (!pnv_pci_is_mem_pref_64(res->flags))
> >> +			continue;
> >> +
> >> +		size = pci_iov_resource_size(dev, i + PCI_IOV_RESOURCES);
> >> +		res2 = *res;
> >> +		res->start += size * offset;
> >
> >I'm still not happy about this fiddling with res->start.
> >
> >Increasing res->start means that in principle, the "size * offset" bytes
> >that we just removed from res are now available for allocation to somebody
> >else.  I don't think we *will* give that space to anything else because of
> >the alignment restrictions you're enforcing, but "res" now doesn't
> >correctly describe the real resource map.
> >
> >Would you be able to just update the BAR here while leaving the struct
> >resource alone?  In that case, it would look a little funny that lspci
> >would show a BAR value in the middle of the region in /proc/iomem, but
> >the /proc/iomem region would be more correct.
> 
> Bjorn,
> 
> I did some tests, while the result is not good.
> 
> What I did is still write the shifted resource address to the device by
> pci_update_resource(), but I revert the res->start to the original one. If
> this step is not correct, please let me know.
> 
> This can't work since after we revert the res->start, those VFs will be given
> resources from res->start instead of (res->start + offset * size). This is not
> what we expect.

Hmm, yes, I suppose we'd have to have a hook in pci_bus_alloc_from_region()
or something.  That's getting a little messy.  I still don't like messing
with the resource after it's in the resource tree, but I don't have a
better idea right now.  So let's just go with what you have.

> >> +
> >> +		dev_info(&dev->dev, "VF BAR%d: %pR shifted to %pR (enabling %d VFs shifted by %d)\n",
> >> +			 i, &res2, res, vf_num, offset);
> >> +		pci_update_resource(dev, i + PCI_IOV_RESOURCES);
> >> +	}
> >> +	pdn->max_vfs -= offset;
> >> +	return 0;
> >> +}
> >> +#endif /* CONFIG_PCI_IOV */
> 
> -- 
> Richard Yang
> Help you, Help me
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v12 17/21] powerpc/powernv: Shift VF resource with an offset
@ 2015-03-11  2:55         ` Bjorn Helgaas
  0 siblings, 0 replies; 69+ messages in thread
From: Bjorn Helgaas @ 2015-03-11  2:55 UTC (permalink / raw)
  To: Wei Yang; +Cc: linux-pci, benh, linuxppc-dev, gwshan

On Wed, Mar 04, 2015 at 11:01:24AM +0800, Wei Yang wrote:
> On Tue, Feb 24, 2015 at 03:00:37AM -0600, Bjorn Helgaas wrote:
> >On Tue, Feb 24, 2015 at 02:34:57AM -0600, Bjorn Helgaas wrote:
> >> From: Wei Yang <weiyang@linux.vnet.ibm.com>
> >> 
> >> On PowerNV platform, resource position in M64 implies the PE# the resource
> >> belongs to.  In some cases, adjustment of a resource is necessary to locate
> >> it to a correct position in M64.
> >> 
> >> Add pnv_pci_vf_resource_shift() to shift the 'real' PF IOV BAR address
> >> according to an offset.
> >> 
> >> [bhelgaas: rework loops, rework overlap check, index resource[]
> >> conventionally, remove pci_regs.h include, squashed with next patch]
> >> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
> >> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> >
> >...
> >
> >> +#ifdef CONFIG_PCI_IOV
> >> +static int pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
> >> +{
> >> +	struct pci_dn *pdn = pci_get_pdn(dev);
> >> +	int i;
> >> +	struct resource *res, res2;
> >> +	resource_size_t size;
> >> +	u16 vf_num;
> >> +
> >> +	if (!dev->is_physfn)
> >> +		return -EINVAL;
> >> +
> >> +	/*
> >> +	 * "offset" is in VFs.  The M64 windows are sized so that when they
> >> +	 * are segmented, each segment is the same size as the IOV BAR.
> >> +	 * Each segment is in a separate PE, and the high order bits of the
> >> +	 * address are the PE number.  Therefore, each VF's BAR is in a
> >> +	 * separate PE, and changing the IOV BAR start address changes the
> >> +	 * range of PEs the VFs are in.
> >> +	 */
> >> +	vf_num = pdn->vf_pes;
> >> +	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
> >> +		res = &dev->resource[i + PCI_IOV_RESOURCES];
> >> +		if (!res->flags || !res->parent)
> >> +			continue;
> >> +
> >> +		if (!pnv_pci_is_mem_pref_64(res->flags))
> >> +			continue;
> >> +
> >> +		/*
> >> +		 * The actual IOV BAR range is determined by the start address
> >> +		 * and the actual size for vf_num VFs BAR.  This check is to
> >> +		 * make sure that after shifting, the range will not overlap
> >> +		 * with another device.
> >> +		 */
> >> +		size = pci_iov_resource_size(dev, i + PCI_IOV_RESOURCES);
> >> +		res2.flags = res->flags;
> >> +		res2.start = res->start + (size * offset);
> >> +		res2.end = res2.start + (size * vf_num) - 1;
> >> +
> >> +		if (res2.end > res->end) {
> >> +			dev_err(&dev->dev, "VF BAR%d: %pR would extend past %pR (trying to enable %d VFs shifted by %d)\n",
> >> +				i, &res2, res, vf_num, offset);
> >> +			return -EBUSY;
> >> +		}
> >> +	}
> >> +
> >> +	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
> >> +		res = &dev->resource[i + PCI_IOV_RESOURCES];
> >> +		if (!res->flags || !res->parent)
> >> +			continue;
> >> +
> >> +		if (!pnv_pci_is_mem_pref_64(res->flags))
> >> +			continue;
> >> +
> >> +		size = pci_iov_resource_size(dev, i + PCI_IOV_RESOURCES);
> >> +		res2 = *res;
> >> +		res->start += size * offset;
> >
> >I'm still not happy about this fiddling with res->start.
> >
> >Increasing res->start means that in principle, the "size * offset" bytes
> >that we just removed from res are now available for allocation to somebody
> >else.  I don't think we *will* give that space to anything else because of
> >the alignment restrictions you're enforcing, but "res" now doesn't
> >correctly describe the real resource map.
> >
> >Would you be able to just update the BAR here while leaving the struct
> >resource alone?  In that case, it would look a little funny that lspci
> >would show a BAR value in the middle of the region in /proc/iomem, but
> >the /proc/iomem region would be more correct.
> 
> Bjorn,
> 
> I did some tests, while the result is not good.
> 
> What I did is still write the shifted resource address to the device by
> pci_update_resource(), but I revert the res->start to the original one. If
> this step is not correct, please let me know.
> 
> This can't work since after we revert the res->start, those VFs will be given
> resources from res->start instead of (res->start + offset * size). This is not
> what we expect.

Hmm, yes, I suppose we'd have to have a hook in pci_bus_alloc_from_region()
or something.  That's getting a little messy.  I still don't like messing
with the resource after it's in the resource tree, but I don't have a
better idea right now.  So let's just go with what you have.

> >> +
> >> +		dev_info(&dev->dev, "VF BAR%d: %pR shifted to %pR (enabling %d VFs shifted by %d)\n",
> >> +			 i, &res2, res, vf_num, offset);
> >> +		pci_update_resource(dev, i + PCI_IOV_RESOURCES);
> >> +	}
> >> +	pdn->max_vfs -= offset;
> >> +	return 0;
> >> +}
> >> +#endif /* CONFIG_PCI_IOV */
> 
> -- 
> Richard Yang
> Help you, Help me
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v12 14/21] powerpc/powernv: Allocate struct pnv_ioda_pe iommu_table dynamically
  2015-03-11  2:47         ` Bjorn Helgaas
@ 2015-03-11  6:13           ` Wei Yang
  -1 siblings, 0 replies; 69+ messages in thread
From: Wei Yang @ 2015-03-11  6:13 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: Wei Yang, linux-pci, benh, linuxppc-dev, gwshan

On Tue, Mar 10, 2015 at 09:47:37PM -0500, Bjorn Helgaas wrote:
>On Mon, Mar 02, 2015 at 03:50:37PM +0800, Wei Yang wrote:
>> On Tue, Feb 24, 2015 at 02:46:53AM -0600, Bjorn Helgaas wrote:
>> >On Tue, Feb 24, 2015 at 02:34:35AM -0600, Bjorn Helgaas wrote:
>> >> From: Wei Yang <weiyang@linux.vnet.ibm.com>
>> >> 
>> >> Current iommu_table of a PE is a static field.  This will have a problem
>> >> when iommu_free_table() is called.
>> >> 
>> >> Allocate iommu_table dynamically.
>> >
>> >I'd like a little more explanation about why we're calling
>> >iommu_free_table() now when we didn't call it before.  Maybe this happens
>> >when we disable SR-IOV and the VFs go away?
>> 
>> Yes, it is called in disable path.
>> 
>> pcibios_sriov_disable
>>     pnv_pci_sriov_disable
>>         pnv_ioda_release_vf_PE
>> 	    pnv_pci_ioda2_release_dma_pe
>> 	        iommu_free_table            <--- here it is invoked
>> 
>> 
>> >
>> >Is there a hotplug remove path where we should also be calling
>> >iommu_free_table()?
>> 
>> When VF is not introduced, no one calls this on powernv platform.
>> 
>> Each PCI bus is a PE and it has its own iommu table, even a device is
>> hotpluged, the iommu table will not be released.
>
>None of this explanation made it into the v13 patch.  And I don't quite
>understand it anyway.
>
>Something like "Previously the iommu_table had the same lifetime as a
>struct pnv_ioda_pe and was embedded in it.  The pnv_ioda_pe was allocated
>when XXX and freed when YYY.  This no longer works: we can't allocate the
>iommu_table at the same time as the pnv_ioda_pe because XXX, so we allocate
>it when XXX and free it when YYY."

Got it, I have put the explanation in change log in next version.

>
>Bjorn
>_______________________________________________
>Linuxppc-dev mailing list
>Linuxppc-dev@lists.ozlabs.org
>https://lists.ozlabs.org/listinfo/linuxppc-dev

-- 
Richard Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v12 14/21] powerpc/powernv: Allocate struct pnv_ioda_pe iommu_table dynamically
@ 2015-03-11  6:13           ` Wei Yang
  0 siblings, 0 replies; 69+ messages in thread
From: Wei Yang @ 2015-03-11  6:13 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci, Wei Yang, benh, linuxppc-dev, gwshan

On Tue, Mar 10, 2015 at 09:47:37PM -0500, Bjorn Helgaas wrote:
>On Mon, Mar 02, 2015 at 03:50:37PM +0800, Wei Yang wrote:
>> On Tue, Feb 24, 2015 at 02:46:53AM -0600, Bjorn Helgaas wrote:
>> >On Tue, Feb 24, 2015 at 02:34:35AM -0600, Bjorn Helgaas wrote:
>> >> From: Wei Yang <weiyang@linux.vnet.ibm.com>
>> >> 
>> >> Current iommu_table of a PE is a static field.  This will have a problem
>> >> when iommu_free_table() is called.
>> >> 
>> >> Allocate iommu_table dynamically.
>> >
>> >I'd like a little more explanation about why we're calling
>> >iommu_free_table() now when we didn't call it before.  Maybe this happens
>> >when we disable SR-IOV and the VFs go away?
>> 
>> Yes, it is called in disable path.
>> 
>> pcibios_sriov_disable
>>     pnv_pci_sriov_disable
>>         pnv_ioda_release_vf_PE
>> 	    pnv_pci_ioda2_release_dma_pe
>> 	        iommu_free_table            <--- here it is invoked
>> 
>> 
>> >
>> >Is there a hotplug remove path where we should also be calling
>> >iommu_free_table()?
>> 
>> When VF is not introduced, no one calls this on powernv platform.
>> 
>> Each PCI bus is a PE and it has its own iommu table, even a device is
>> hotpluged, the iommu table will not be released.
>
>None of this explanation made it into the v13 patch.  And I don't quite
>understand it anyway.
>
>Something like "Previously the iommu_table had the same lifetime as a
>struct pnv_ioda_pe and was embedded in it.  The pnv_ioda_pe was allocated
>when XXX and freed when YYY.  This no longer works: we can't allocate the
>iommu_table at the same time as the pnv_ioda_pe because XXX, so we allocate
>it when XXX and free it when YYY."

Got it, I have put the explanation in change log in next version.

>
>Bjorn
>_______________________________________________
>Linuxppc-dev mailing list
>Linuxppc-dev@lists.ozlabs.org
>https://lists.ozlabs.org/listinfo/linuxppc-dev

-- 
Richard Yang
Help you, Help me

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v12 15/21] powerpc/powernv: Reserve additional space for IOV BAR according to the number of total_pe
  2015-03-11  2:51         ` Bjorn Helgaas
@ 2015-03-11  6:22           ` Wei Yang
  -1 siblings, 0 replies; 69+ messages in thread
From: Wei Yang @ 2015-03-11  6:22 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: Wei Yang, benh, gwshan, linux-pci, linuxppc-dev

On Tue, Mar 10, 2015 at 09:51:25PM -0500, Bjorn Helgaas wrote:
>On Mon, Mar 02, 2015 at 03:41:32PM +0800, Wei Yang wrote:
>> On Tue, Feb 24, 2015 at 02:52:34AM -0600, Bjorn Helgaas wrote:
>> >On Tue, Feb 24, 2015 at 02:34:42AM -0600, Bjorn Helgaas wrote:
>> >> From: Wei Yang <weiyang@linux.vnet.ibm.com>
>> >> 
>> >> On PHB3, PF IOV BAR will be covered by M64 window to have better PE
>> >> isolation.  The total_pe number is usually different from total_VFs, which
>> >> can lead to a conflict between MMIO space and the PE number.
>> >> 
>> >> For example, if total_VFs is 128 and total_pe is 256, the second half of
>> >> M64 window will be part of other PCI device, which may already belong
>> >> to other PEs.
>> >
>> >I'm still trying to wrap my mind around the explanation here.
>> >
>> >I *think* what's going on is that the M64 window must be a power-of-two
>> >size.  If the VF(n) BAR space doesn't completely fill it, we might allocate
>> >the leftover space to another device.  Then the M64 window for *this*
>> >device may cause the other device to be associated with a PE it didn't
>> >expect.
>> 
>> Yes, this is the exact reason.
>
>Can you include some of this text in your changelog, then?  I can wordsmith
>it and try to make it fit together better.
>

Sure, I will do this.

>> >> +#ifdef CONFIG_PCI_IOV
>> >> +static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
>> >> +{
>> >> +	struct pci_controller *hose;
>> >> +	struct pnv_phb *phb;
>> >> +	struct resource *res;
>> >> +	int i;
>> >> +	resource_size_t size;
>> >> +	struct pci_dn *pdn;
>> >> +
>> >> +	if (!pdev->is_physfn || pdev->is_added)
>> >> +		return;
>> >> +
>> >> +	hose = pci_bus_to_host(pdev->bus);
>> >> +	phb = hose->private_data;
>> >> +
>> >> +	pdn = pci_get_pdn(pdev);
>> >> +	pdn->max_vfs = 0;
>> >> +
>> >> +	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
>> >> +		res = &pdev->resource[i + PCI_IOV_RESOURCES];
>> >> +		if (!res->flags || res->parent)
>> >> +			continue;
>> >> +		if (!pnv_pci_is_mem_pref_64(res->flags)) {
>> >> +			dev_warn(&pdev->dev, "Skipping expanding VF BAR%d: %pR\n",
>> >> +				 i, res);
>> >> +			continue;
>> >> +		}
>> >> +
>> >> +		dev_dbg(&pdev->dev, " Fixing VF BAR%d: %pR to\n", i, res);
>> >> +		size = pci_iov_resource_size(pdev, i + PCI_IOV_RESOURCES);
>> >> +		res->end = res->start + size * phb->ioda.total_pe - 1;
>> >> +		dev_dbg(&pdev->dev, "                       %pR\n", res);
>> >> +		dev_info(&pdev->dev, "VF BAR%d: %pR (expanded to %d VFs for PE alignment)",
>> >> +				i, res, phb->ioda.total_pe);
>> >> +	}
>> >> +	pdn->max_vfs = phb->ioda.total_pe;
>> >> +}
>> >> +
>> >> +static void pnv_pci_ioda_fixup_sriov(struct pci_bus *bus)
>> >> +{
>> >> +	struct pci_dev *pdev;
>> >> +	struct pci_bus *b;
>> >> +
>> >> +	list_for_each_entry(pdev, &bus->devices, bus_list) {
>> >> +		b = pdev->subordinate;
>> >> +
>> >> +		if (b)
>> >> +			pnv_pci_ioda_fixup_sriov(b);
>> >> +
>> >> +		pnv_pci_ioda_fixup_iov_resources(pdev);
>> >
>> >I'm not sure this happens at the right time.  We have this call chain:
>> >
>> >  pcibios_scan_phb
>> >    pci_create_root_bus
>> >    pci_scan_child_bus
>> >    pnv_pci_ioda_fixup_sriov
>> >      pnv_pci_ioda_fixup_iov_resources
>> >	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++)
>> >	  increase res->size to accomodate 256 PEs (or roundup(totalVFs)
>> >
>> >so we only do the fixup_iov_resources() when we scan the PHB, and we
>> >wouldn't do it at all for hot-added devices.
>> 
>> Yep, you are right :-)
>> 
>> I had a separate patch to do this in pcibios_add_pci_devices(). Looks we could
>> merge them.
>
>Did you fix this in v13?  I don't see the change if you did.
>

I add this in [PATCH V13 15/21].

In the file arch/powerpc/kernel/pci-hotplug.c, when hotplug a device, the
fixup will be called on that bus too.

>> >> +	}
>> >> +}
>> >> +#endif /* CONFIG_PCI_IOV */
>> >> +
>> >>  /*
>> >>   * This function is supposed to be called on basis of PE from top
>> >>   * to bottom style. So the the I/O or MMIO segment assigned to
>> >> @@ -2125,6 +2180,9 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
>> >>  	ppc_md.pcibios_enable_device_hook = pnv_pci_enable_device_hook;
>> >>  	ppc_md.pcibios_window_alignment = pnv_pci_window_alignment;
>> >>  	ppc_md.pcibios_reset_secondary_bus = pnv_pci_reset_secondary_bus;
>> >> +#ifdef CONFIG_PCI_IOV
>> >> +	ppc_md.pcibios_fixup_sriov = pnv_pci_ioda_fixup_sriov;
>> >> +#endif /* CONFIG_PCI_IOV */
>> >>  	pci_add_flags(PCI_REASSIGN_ALL_RSRC);
>> >>  
>> >>  	/* Reset IODA tables to a clean state */
>> >> 
>> 
>> -- 
>> Richard Yang
>> Help you, Help me
>> 

-- 
Richard Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v12 15/21] powerpc/powernv: Reserve additional space for IOV BAR according to the number of total_pe
@ 2015-03-11  6:22           ` Wei Yang
  0 siblings, 0 replies; 69+ messages in thread
From: Wei Yang @ 2015-03-11  6:22 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci, Wei Yang, benh, linuxppc-dev, gwshan

On Tue, Mar 10, 2015 at 09:51:25PM -0500, Bjorn Helgaas wrote:
>On Mon, Mar 02, 2015 at 03:41:32PM +0800, Wei Yang wrote:
>> On Tue, Feb 24, 2015 at 02:52:34AM -0600, Bjorn Helgaas wrote:
>> >On Tue, Feb 24, 2015 at 02:34:42AM -0600, Bjorn Helgaas wrote:
>> >> From: Wei Yang <weiyang@linux.vnet.ibm.com>
>> >> 
>> >> On PHB3, PF IOV BAR will be covered by M64 window to have better PE
>> >> isolation.  The total_pe number is usually different from total_VFs, which
>> >> can lead to a conflict between MMIO space and the PE number.
>> >> 
>> >> For example, if total_VFs is 128 and total_pe is 256, the second half of
>> >> M64 window will be part of other PCI device, which may already belong
>> >> to other PEs.
>> >
>> >I'm still trying to wrap my mind around the explanation here.
>> >
>> >I *think* what's going on is that the M64 window must be a power-of-two
>> >size.  If the VF(n) BAR space doesn't completely fill it, we might allocate
>> >the leftover space to another device.  Then the M64 window for *this*
>> >device may cause the other device to be associated with a PE it didn't
>> >expect.
>> 
>> Yes, this is the exact reason.
>
>Can you include some of this text in your changelog, then?  I can wordsmith
>it and try to make it fit together better.
>

Sure, I will do this.

>> >> +#ifdef CONFIG_PCI_IOV
>> >> +static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
>> >> +{
>> >> +	struct pci_controller *hose;
>> >> +	struct pnv_phb *phb;
>> >> +	struct resource *res;
>> >> +	int i;
>> >> +	resource_size_t size;
>> >> +	struct pci_dn *pdn;
>> >> +
>> >> +	if (!pdev->is_physfn || pdev->is_added)
>> >> +		return;
>> >> +
>> >> +	hose = pci_bus_to_host(pdev->bus);
>> >> +	phb = hose->private_data;
>> >> +
>> >> +	pdn = pci_get_pdn(pdev);
>> >> +	pdn->max_vfs = 0;
>> >> +
>> >> +	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
>> >> +		res = &pdev->resource[i + PCI_IOV_RESOURCES];
>> >> +		if (!res->flags || res->parent)
>> >> +			continue;
>> >> +		if (!pnv_pci_is_mem_pref_64(res->flags)) {
>> >> +			dev_warn(&pdev->dev, "Skipping expanding VF BAR%d: %pR\n",
>> >> +				 i, res);
>> >> +			continue;
>> >> +		}
>> >> +
>> >> +		dev_dbg(&pdev->dev, " Fixing VF BAR%d: %pR to\n", i, res);
>> >> +		size = pci_iov_resource_size(pdev, i + PCI_IOV_RESOURCES);
>> >> +		res->end = res->start + size * phb->ioda.total_pe - 1;
>> >> +		dev_dbg(&pdev->dev, "                       %pR\n", res);
>> >> +		dev_info(&pdev->dev, "VF BAR%d: %pR (expanded to %d VFs for PE alignment)",
>> >> +				i, res, phb->ioda.total_pe);
>> >> +	}
>> >> +	pdn->max_vfs = phb->ioda.total_pe;
>> >> +}
>> >> +
>> >> +static void pnv_pci_ioda_fixup_sriov(struct pci_bus *bus)
>> >> +{
>> >> +	struct pci_dev *pdev;
>> >> +	struct pci_bus *b;
>> >> +
>> >> +	list_for_each_entry(pdev, &bus->devices, bus_list) {
>> >> +		b = pdev->subordinate;
>> >> +
>> >> +		if (b)
>> >> +			pnv_pci_ioda_fixup_sriov(b);
>> >> +
>> >> +		pnv_pci_ioda_fixup_iov_resources(pdev);
>> >
>> >I'm not sure this happens at the right time.  We have this call chain:
>> >
>> >  pcibios_scan_phb
>> >    pci_create_root_bus
>> >    pci_scan_child_bus
>> >    pnv_pci_ioda_fixup_sriov
>> >      pnv_pci_ioda_fixup_iov_resources
>> >	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++)
>> >	  increase res->size to accomodate 256 PEs (or roundup(totalVFs)
>> >
>> >so we only do the fixup_iov_resources() when we scan the PHB, and we
>> >wouldn't do it at all for hot-added devices.
>> 
>> Yep, you are right :-)
>> 
>> I had a separate patch to do this in pcibios_add_pci_devices(). Looks we could
>> merge them.
>
>Did you fix this in v13?  I don't see the change if you did.
>

I add this in [PATCH V13 15/21].

In the file arch/powerpc/kernel/pci-hotplug.c, when hotplug a device, the
fixup will be called on that bus too.

>> >> +	}
>> >> +}
>> >> +#endif /* CONFIG_PCI_IOV */
>> >> +
>> >>  /*
>> >>   * This function is supposed to be called on basis of PE from top
>> >>   * to bottom style. So the the I/O or MMIO segment assigned to
>> >> @@ -2125,6 +2180,9 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
>> >>  	ppc_md.pcibios_enable_device_hook = pnv_pci_enable_device_hook;
>> >>  	ppc_md.pcibios_window_alignment = pnv_pci_window_alignment;
>> >>  	ppc_md.pcibios_reset_secondary_bus = pnv_pci_reset_secondary_bus;
>> >> +#ifdef CONFIG_PCI_IOV
>> >> +	ppc_md.pcibios_fixup_sriov = pnv_pci_ioda_fixup_sriov;
>> >> +#endif /* CONFIG_PCI_IOV */
>> >>  	pci_add_flags(PCI_REASSIGN_ALL_RSRC);
>> >>  
>> >>  	/* Reset IODA tables to a clean state */
>> >> 
>> 
>> -- 
>> Richard Yang
>> Help you, Help me
>> 

-- 
Richard Yang
Help you, Help me

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v12 17/21] powerpc/powernv: Shift VF resource with an offset
  2015-03-11  2:55         ` Bjorn Helgaas
@ 2015-03-11  6:42           ` Wei Yang
  -1 siblings, 0 replies; 69+ messages in thread
From: Wei Yang @ 2015-03-11  6:42 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: Wei Yang, benh, gwshan, linux-pci, linuxppc-dev

On Tue, Mar 10, 2015 at 09:55:19PM -0500, Bjorn Helgaas wrote:
>On Wed, Mar 04, 2015 at 11:01:24AM +0800, Wei Yang wrote:
>> On Tue, Feb 24, 2015 at 03:00:37AM -0600, Bjorn Helgaas wrote:
>> >On Tue, Feb 24, 2015 at 02:34:57AM -0600, Bjorn Helgaas wrote:
>> >> From: Wei Yang <weiyang@linux.vnet.ibm.com>
>> >> 
>> >> On PowerNV platform, resource position in M64 implies the PE# the resource
>> >> belongs to.  In some cases, adjustment of a resource is necessary to locate
>> >> it to a correct position in M64.
>> >> 
>> >> Add pnv_pci_vf_resource_shift() to shift the 'real' PF IOV BAR address
>> >> according to an offset.
>> >> 
>> >> [bhelgaas: rework loops, rework overlap check, index resource[]
>> >> conventionally, remove pci_regs.h include, squashed with next patch]
>> >> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
>> >> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
>> >
>> >...
>> >
>> >> +#ifdef CONFIG_PCI_IOV
>> >> +static int pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
>> >> +{
>> >> +	struct pci_dn *pdn = pci_get_pdn(dev);
>> >> +	int i;
>> >> +	struct resource *res, res2;
>> >> +	resource_size_t size;
>> >> +	u16 vf_num;
>> >> +
>> >> +	if (!dev->is_physfn)
>> >> +		return -EINVAL;
>> >> +
>> >> +	/*
>> >> +	 * "offset" is in VFs.  The M64 windows are sized so that when they
>> >> +	 * are segmented, each segment is the same size as the IOV BAR.
>> >> +	 * Each segment is in a separate PE, and the high order bits of the
>> >> +	 * address are the PE number.  Therefore, each VF's BAR is in a
>> >> +	 * separate PE, and changing the IOV BAR start address changes the
>> >> +	 * range of PEs the VFs are in.
>> >> +	 */
>> >> +	vf_num = pdn->vf_pes;
>> >> +	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
>> >> +		res = &dev->resource[i + PCI_IOV_RESOURCES];
>> >> +		if (!res->flags || !res->parent)
>> >> +			continue;
>> >> +
>> >> +		if (!pnv_pci_is_mem_pref_64(res->flags))
>> >> +			continue;
>> >> +
>> >> +		/*
>> >> +		 * The actual IOV BAR range is determined by the start address
>> >> +		 * and the actual size for vf_num VFs BAR.  This check is to
>> >> +		 * make sure that after shifting, the range will not overlap
>> >> +		 * with another device.
>> >> +		 */
>> >> +		size = pci_iov_resource_size(dev, i + PCI_IOV_RESOURCES);
>> >> +		res2.flags = res->flags;
>> >> +		res2.start = res->start + (size * offset);
>> >> +		res2.end = res2.start + (size * vf_num) - 1;
>> >> +
>> >> +		if (res2.end > res->end) {
>> >> +			dev_err(&dev->dev, "VF BAR%d: %pR would extend past %pR (trying to enable %d VFs shifted by %d)\n",
>> >> +				i, &res2, res, vf_num, offset);
>> >> +			return -EBUSY;
>> >> +		}
>> >> +	}
>> >> +
>> >> +	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
>> >> +		res = &dev->resource[i + PCI_IOV_RESOURCES];
>> >> +		if (!res->flags || !res->parent)
>> >> +			continue;
>> >> +
>> >> +		if (!pnv_pci_is_mem_pref_64(res->flags))
>> >> +			continue;
>> >> +
>> >> +		size = pci_iov_resource_size(dev, i + PCI_IOV_RESOURCES);
>> >> +		res2 = *res;
>> >> +		res->start += size * offset;
>> >
>> >I'm still not happy about this fiddling with res->start.
>> >
>> >Increasing res->start means that in principle, the "size * offset" bytes
>> >that we just removed from res are now available for allocation to somebody
>> >else.  I don't think we *will* give that space to anything else because of
>> >the alignment restrictions you're enforcing, but "res" now doesn't
>> >correctly describe the real resource map.
>> >
>> >Would you be able to just update the BAR here while leaving the struct
>> >resource alone?  In that case, it would look a little funny that lspci
>> >would show a BAR value in the middle of the region in /proc/iomem, but
>> >the /proc/iomem region would be more correct.
>> 
>> Bjorn,
>> 
>> I did some tests, while the result is not good.
>> 
>> What I did is still write the shifted resource address to the device by
>> pci_update_resource(), but I revert the res->start to the original one. If
>> this step is not correct, please let me know.
>> 
>> This can't work since after we revert the res->start, those VFs will be given
>> resources from res->start instead of (res->start + offset * size). This is not
>> what we expect.
>
>Hmm, yes, I suppose we'd have to have a hook in pci_bus_alloc_from_region()
>or something.  That's getting a little messy.  I still don't like messing
>with the resource after it's in the resource tree, but I don't have a
>better idea right now.  So let's just go with what you have.
>

Thanks  :-)

I would state this in the change log and add a comment in the code to note
this down. Hope this would be a little helpful.

>> >> +
>> >> +		dev_info(&dev->dev, "VF BAR%d: %pR shifted to %pR (enabling %d VFs shifted by %d)\n",
>> >> +			 i, &res2, res, vf_num, offset);
>> >> +		pci_update_resource(dev, i + PCI_IOV_RESOURCES);
>> >> +	}
>> >> +	pdn->max_vfs -= offset;
>> >> +	return 0;
>> >> +}
>> >> +#endif /* CONFIG_PCI_IOV */
>> 
>> -- 
>> Richard Yang
>> Help you, Help me
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Richard Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v12 17/21] powerpc/powernv: Shift VF resource with an offset
@ 2015-03-11  6:42           ` Wei Yang
  0 siblings, 0 replies; 69+ messages in thread
From: Wei Yang @ 2015-03-11  6:42 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci, Wei Yang, benh, linuxppc-dev, gwshan

On Tue, Mar 10, 2015 at 09:55:19PM -0500, Bjorn Helgaas wrote:
>On Wed, Mar 04, 2015 at 11:01:24AM +0800, Wei Yang wrote:
>> On Tue, Feb 24, 2015 at 03:00:37AM -0600, Bjorn Helgaas wrote:
>> >On Tue, Feb 24, 2015 at 02:34:57AM -0600, Bjorn Helgaas wrote:
>> >> From: Wei Yang <weiyang@linux.vnet.ibm.com>
>> >> 
>> >> On PowerNV platform, resource position in M64 implies the PE# the resource
>> >> belongs to.  In some cases, adjustment of a resource is necessary to locate
>> >> it to a correct position in M64.
>> >> 
>> >> Add pnv_pci_vf_resource_shift() to shift the 'real' PF IOV BAR address
>> >> according to an offset.
>> >> 
>> >> [bhelgaas: rework loops, rework overlap check, index resource[]
>> >> conventionally, remove pci_regs.h include, squashed with next patch]
>> >> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
>> >> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
>> >
>> >...
>> >
>> >> +#ifdef CONFIG_PCI_IOV
>> >> +static int pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
>> >> +{
>> >> +	struct pci_dn *pdn = pci_get_pdn(dev);
>> >> +	int i;
>> >> +	struct resource *res, res2;
>> >> +	resource_size_t size;
>> >> +	u16 vf_num;
>> >> +
>> >> +	if (!dev->is_physfn)
>> >> +		return -EINVAL;
>> >> +
>> >> +	/*
>> >> +	 * "offset" is in VFs.  The M64 windows are sized so that when they
>> >> +	 * are segmented, each segment is the same size as the IOV BAR.
>> >> +	 * Each segment is in a separate PE, and the high order bits of the
>> >> +	 * address are the PE number.  Therefore, each VF's BAR is in a
>> >> +	 * separate PE, and changing the IOV BAR start address changes the
>> >> +	 * range of PEs the VFs are in.
>> >> +	 */
>> >> +	vf_num = pdn->vf_pes;
>> >> +	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
>> >> +		res = &dev->resource[i + PCI_IOV_RESOURCES];
>> >> +		if (!res->flags || !res->parent)
>> >> +			continue;
>> >> +
>> >> +		if (!pnv_pci_is_mem_pref_64(res->flags))
>> >> +			continue;
>> >> +
>> >> +		/*
>> >> +		 * The actual IOV BAR range is determined by the start address
>> >> +		 * and the actual size for vf_num VFs BAR.  This check is to
>> >> +		 * make sure that after shifting, the range will not overlap
>> >> +		 * with another device.
>> >> +		 */
>> >> +		size = pci_iov_resource_size(dev, i + PCI_IOV_RESOURCES);
>> >> +		res2.flags = res->flags;
>> >> +		res2.start = res->start + (size * offset);
>> >> +		res2.end = res2.start + (size * vf_num) - 1;
>> >> +
>> >> +		if (res2.end > res->end) {
>> >> +			dev_err(&dev->dev, "VF BAR%d: %pR would extend past %pR (trying to enable %d VFs shifted by %d)\n",
>> >> +				i, &res2, res, vf_num, offset);
>> >> +			return -EBUSY;
>> >> +		}
>> >> +	}
>> >> +
>> >> +	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
>> >> +		res = &dev->resource[i + PCI_IOV_RESOURCES];
>> >> +		if (!res->flags || !res->parent)
>> >> +			continue;
>> >> +
>> >> +		if (!pnv_pci_is_mem_pref_64(res->flags))
>> >> +			continue;
>> >> +
>> >> +		size = pci_iov_resource_size(dev, i + PCI_IOV_RESOURCES);
>> >> +		res2 = *res;
>> >> +		res->start += size * offset;
>> >
>> >I'm still not happy about this fiddling with res->start.
>> >
>> >Increasing res->start means that in principle, the "size * offset" bytes
>> >that we just removed from res are now available for allocation to somebody
>> >else.  I don't think we *will* give that space to anything else because of
>> >the alignment restrictions you're enforcing, but "res" now doesn't
>> >correctly describe the real resource map.
>> >
>> >Would you be able to just update the BAR here while leaving the struct
>> >resource alone?  In that case, it would look a little funny that lspci
>> >would show a BAR value in the middle of the region in /proc/iomem, but
>> >the /proc/iomem region would be more correct.
>> 
>> Bjorn,
>> 
>> I did some tests, while the result is not good.
>> 
>> What I did is still write the shifted resource address to the device by
>> pci_update_resource(), but I revert the res->start to the original one. If
>> this step is not correct, please let me know.
>> 
>> This can't work since after we revert the res->start, those VFs will be given
>> resources from res->start instead of (res->start + offset * size). This is not
>> what we expect.
>
>Hmm, yes, I suppose we'd have to have a hook in pci_bus_alloc_from_region()
>or something.  That's getting a little messy.  I still don't like messing
>with the resource after it's in the resource tree, but I don't have a
>better idea right now.  So let's just go with what you have.
>

Thanks  :-)

I would state this in the change log and add a comment in the code to note
this down. Hope this would be a little helpful.

>> >> +
>> >> +		dev_info(&dev->dev, "VF BAR%d: %pR shifted to %pR (enabling %d VFs shifted by %d)\n",
>> >> +			 i, &res2, res, vf_num, offset);
>> >> +		pci_update_resource(dev, i + PCI_IOV_RESOURCES);
>> >> +	}
>> >> +	pdn->max_vfs -= offset;
>> >> +	return 0;
>> >> +}
>> >> +#endif /* CONFIG_PCI_IOV */
>> 
>> -- 
>> Richard Yang
>> Help you, Help me
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Richard Yang
Help you, Help me

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v12 10/21] PCI: Consider additional PF's IOV BAR alignment in sizing and assigning
  2015-03-11  2:36         ` Bjorn Helgaas
@ 2015-03-11  9:17           ` Wei Yang
  -1 siblings, 0 replies; 69+ messages in thread
From: Wei Yang @ 2015-03-11  9:17 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: Wei Yang, benh, gwshan, linux-pci, linuxppc-dev

On Tue, Mar 10, 2015 at 09:36:58PM -0500, Bjorn Helgaas wrote:
>On Mon, Mar 02, 2015 at 03:32:47PM +0800, Wei Yang wrote:
>> On Tue, Feb 24, 2015 at 02:41:52AM -0600, Bjorn Helgaas wrote:
>> >On Tue, Feb 24, 2015 at 02:34:06AM -0600, Bjorn Helgaas wrote:
>> >> From: Wei Yang <weiyang@linux.vnet.ibm.com>
>> >> 
>> >> When sizing and assigning resources, we divide the resources into two
>> >> lists: the requested list and the additional list.  We don't consider the
>> >> alignment of additional VF(n) BAR space.
>> >> 
>> >> This is reasonable because the alignment required for the VF(n) BAR space
>> >> is the size of an individual VF BAR, not the size of the space for *all*
>> >> VFs.  But some platforms, e.g., PowerNV, require additional alignment.
>> >> 
>> >> Consider the additional IOV BAR alignment when sizing and assigning
>> >> resources.  When there is not enough system MMIO space, the PF's IOV BAR
>> >> alignment will not contribute to the bridge.  When there is enough system
>> >> MMIO space, the additional alignment will contribute to the bridge.
>> >
>> >I don't understand the ""when there is not enough system MMIO space" part.
>> >How do we tell if there's enough MMIO space?
>> >
>> 
>> In __assign_resources_sorted(), it has two resources list, one for requested
>> (head) and one for additional (realloc_head). This function will first try to
>> combine them and assign. If failed, this means we don't have enough MMIO
>> space.
>
>How about this text:
>
>  This is because the alignment required for the VF(n) BAR space is the size
>  of an individual VF BAR, not the size of the space for *all* VFs.  But we
>  want additional alignment to support partitioning on PowerNV.
>
>  Consider the additional IOV BAR alignment when sizing and assigning
>  resources.  When there is not enough system MMIO space to accomodate both
>  the requested list and the additional list, the PF's IOV BAR alignment will
>  not contribute to the bridge.  When there is enough system MMIO space for
>  both lists, the additional alignment will contribute to the bridge.
>
>We're doing something specifically for PowerNV.  I would really like to be
>able to read this patch and say "Oh, here's the hook where we get the
>PowerNV behavior, and it's obvious that other platforms are unaffected."
>But I don't see a pcibios or similar hook, so I don't know where that
>PowerNV behavior is.
>
>Is it something to do with get_res_add_align()?  That uses min_align, but I
>don't know how that's connected ...  ah, I see, "add_align" is computed
>from pci_resource_alignment(), which has this path:
>
>  pci_resource_alignment
>    pci_sriov_resource_alignment
>      pcibios_iov_resource_alignment
>
>and powerpc has a special pcibios_iov_resource_alignment() for PowerNV.
>

Thanks for the text. I have added these in the change log and some description
about how it give arch a chance to be involved.

>> >> Also, take advantage of pci_dev_resource::min_align to store this
>> >> additional alignment.
>> >
>> >This comment doesn't seem to make sense; this patch doesn't save anything
>> >in min_align.
>> 
>> At the end of this patch:
>> 
>>    add_to_list(realloc_head, bus->self, b_res, size1-size0, add_align);
>> 
>> The add_align is stored in pci_dev_resource::min_align in add_to_list(). And
>> retrieved by get_res_add_align() in below code. This field is not used
>> previously, so I took advantage of this field to store the alignment of the
>> additional resources.
>
>Hmm.  pci_dev_resource::min_align *is* already used in
>reassign_resources_sorted().  Maybe there's no overlap; I gave up the
>analysis before I could convince myself.
>

Bjorn,

I know you may have some concern on this, let me try to explain how I
understand the code. If my understanding is not correct, please let me know.

In __assign_resources_sorted(), we pass two resources list, one is required
and the other is the additional. First, we try our best to assigned both of
them by merge them together. If this fails, we will assign the required list
first and then take care of the additional list.

There is one interesting thing in the first step. We merge these two list to
the required list and in this patch I fix the alignment in required list.
(Which is the "head" list in code.) And before doing so, we save the original
information in "save_head". When we fail to assign the merged list, we will
restore the required list, this mean we clean the alignment done in this patch
and make sure we assign the required resource just with basic alignment.

The usage of the min_align in reassign_resources_sorted() happens in the
second part to assign the additional list individually. In the realloc_head
list, those resources still have the add_align which is calculated in
pbus_size_mem(). And we try to allocate it with this alignment, which is
exactly what we want.

BTW, by reading the code again, it looks I missed to change one place in
reassign_resources_sorted(). In condition when (!resource_size(res)) is true,
we rely on the res->start to be the alignment. Since the alignment is no
longer the start address, we need to fix this part too.

I may miss some background of the code, if my understanding is not correct,
glad to hear from you.

>The changelog needs to mention the add_to_list() connection.
>

Added in the change log.

>> >> +		/*
>> >> +		 * There are two kinds of additional resources in the list:
>> >> +		 * 1. bridge resource  -- IORESOURCE_STARTALIGN
>> >> +		 * 2. SR-IOV resource   -- IORESOURCE_SIZEALIGN
>> >> +		 * Here just fix the additional alignment for bridge
>> >> +		 */
>> >> +		if (!(dev_res->res->flags & IORESOURCE_STARTALIGN))
>> >> +			continue;
>> >> +
>> >> +		add_align = get_res_add_align(realloc_head, dev_res->res);
>> >> +
>> >> +		/* Reorder the list by their alignment */
>> >
>> >Why do we need to reorder the list by alignment?
>> 
>> Resource list "head" is sorted by the alignment, while the alignment would be
>> changed after we considering the additional resource.
>> 
>> Take powernv platform as an example. The IOV BAR is expanded and need to be
>> aligned with its total size instead of the individual VF BAR size. If we don't
>> reorder it, the IOV BAR would be assigned after some other resources, which
>> may cause the real assignment fail even the total size is enough.
>
>This is worthy of a comment in the code.
>
>Bjorn
>--
>To unsubscribe from this list: send the line "unsubscribe linux-pci" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Richard Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v12 10/21] PCI: Consider additional PF's IOV BAR alignment in sizing and assigning
@ 2015-03-11  9:17           ` Wei Yang
  0 siblings, 0 replies; 69+ messages in thread
From: Wei Yang @ 2015-03-11  9:17 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci, Wei Yang, benh, linuxppc-dev, gwshan

On Tue, Mar 10, 2015 at 09:36:58PM -0500, Bjorn Helgaas wrote:
>On Mon, Mar 02, 2015 at 03:32:47PM +0800, Wei Yang wrote:
>> On Tue, Feb 24, 2015 at 02:41:52AM -0600, Bjorn Helgaas wrote:
>> >On Tue, Feb 24, 2015 at 02:34:06AM -0600, Bjorn Helgaas wrote:
>> >> From: Wei Yang <weiyang@linux.vnet.ibm.com>
>> >> 
>> >> When sizing and assigning resources, we divide the resources into two
>> >> lists: the requested list and the additional list.  We don't consider the
>> >> alignment of additional VF(n) BAR space.
>> >> 
>> >> This is reasonable because the alignment required for the VF(n) BAR space
>> >> is the size of an individual VF BAR, not the size of the space for *all*
>> >> VFs.  But some platforms, e.g., PowerNV, require additional alignment.
>> >> 
>> >> Consider the additional IOV BAR alignment when sizing and assigning
>> >> resources.  When there is not enough system MMIO space, the PF's IOV BAR
>> >> alignment will not contribute to the bridge.  When there is enough system
>> >> MMIO space, the additional alignment will contribute to the bridge.
>> >
>> >I don't understand the ""when there is not enough system MMIO space" part.
>> >How do we tell if there's enough MMIO space?
>> >
>> 
>> In __assign_resources_sorted(), it has two resources list, one for requested
>> (head) and one for additional (realloc_head). This function will first try to
>> combine them and assign. If failed, this means we don't have enough MMIO
>> space.
>
>How about this text:
>
>  This is because the alignment required for the VF(n) BAR space is the size
>  of an individual VF BAR, not the size of the space for *all* VFs.  But we
>  want additional alignment to support partitioning on PowerNV.
>
>  Consider the additional IOV BAR alignment when sizing and assigning
>  resources.  When there is not enough system MMIO space to accomodate both
>  the requested list and the additional list, the PF's IOV BAR alignment will
>  not contribute to the bridge.  When there is enough system MMIO space for
>  both lists, the additional alignment will contribute to the bridge.
>
>We're doing something specifically for PowerNV.  I would really like to be
>able to read this patch and say "Oh, here's the hook where we get the
>PowerNV behavior, and it's obvious that other platforms are unaffected."
>But I don't see a pcibios or similar hook, so I don't know where that
>PowerNV behavior is.
>
>Is it something to do with get_res_add_align()?  That uses min_align, but I
>don't know how that's connected ...  ah, I see, "add_align" is computed
>from pci_resource_alignment(), which has this path:
>
>  pci_resource_alignment
>    pci_sriov_resource_alignment
>      pcibios_iov_resource_alignment
>
>and powerpc has a special pcibios_iov_resource_alignment() for PowerNV.
>

Thanks for the text. I have added these in the change log and some description
about how it give arch a chance to be involved.

>> >> Also, take advantage of pci_dev_resource::min_align to store this
>> >> additional alignment.
>> >
>> >This comment doesn't seem to make sense; this patch doesn't save anything
>> >in min_align.
>> 
>> At the end of this patch:
>> 
>>    add_to_list(realloc_head, bus->self, b_res, size1-size0, add_align);
>> 
>> The add_align is stored in pci_dev_resource::min_align in add_to_list(). And
>> retrieved by get_res_add_align() in below code. This field is not used
>> previously, so I took advantage of this field to store the alignment of the
>> additional resources.
>
>Hmm.  pci_dev_resource::min_align *is* already used in
>reassign_resources_sorted().  Maybe there's no overlap; I gave up the
>analysis before I could convince myself.
>

Bjorn,

I know you may have some concern on this, let me try to explain how I
understand the code. If my understanding is not correct, please let me know.

In __assign_resources_sorted(), we pass two resources list, one is required
and the other is the additional. First, we try our best to assigned both of
them by merge them together. If this fails, we will assign the required list
first and then take care of the additional list.

There is one interesting thing in the first step. We merge these two list to
the required list and in this patch I fix the alignment in required list.
(Which is the "head" list in code.) And before doing so, we save the original
information in "save_head". When we fail to assign the merged list, we will
restore the required list, this mean we clean the alignment done in this patch
and make sure we assign the required resource just with basic alignment.

The usage of the min_align in reassign_resources_sorted() happens in the
second part to assign the additional list individually. In the realloc_head
list, those resources still have the add_align which is calculated in
pbus_size_mem(). And we try to allocate it with this alignment, which is
exactly what we want.

BTW, by reading the code again, it looks I missed to change one place in
reassign_resources_sorted(). In condition when (!resource_size(res)) is true,
we rely on the res->start to be the alignment. Since the alignment is no
longer the start address, we need to fix this part too.

I may miss some background of the code, if my understanding is not correct,
glad to hear from you.

>The changelog needs to mention the add_to_list() connection.
>

Added in the change log.

>> >> +		/*
>> >> +		 * There are two kinds of additional resources in the list:
>> >> +		 * 1. bridge resource  -- IORESOURCE_STARTALIGN
>> >> +		 * 2. SR-IOV resource   -- IORESOURCE_SIZEALIGN
>> >> +		 * Here just fix the additional alignment for bridge
>> >> +		 */
>> >> +		if (!(dev_res->res->flags & IORESOURCE_STARTALIGN))
>> >> +			continue;
>> >> +
>> >> +		add_align = get_res_add_align(realloc_head, dev_res->res);
>> >> +
>> >> +		/* Reorder the list by their alignment */
>> >
>> >Why do we need to reorder the list by alignment?
>> 
>> Resource list "head" is sorted by the alignment, while the alignment would be
>> changed after we considering the additional resource.
>> 
>> Take powernv platform as an example. The IOV BAR is expanded and need to be
>> aligned with its total size instead of the individual VF BAR size. If we don't
>> reorder it, the IOV BAR would be assigned after some other resources, which
>> may cause the real assignment fail even the total size is enough.
>
>This is worthy of a comment in the code.
>
>Bjorn
>--
>To unsubscribe from this list: send the line "unsubscribe linux-pci" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Richard Yang
Help you, Help me

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v12 15/21] powerpc/powernv: Reserve additional space for IOV BAR according to the number of total_pe
  2015-03-11  6:22           ` Wei Yang
@ 2015-03-11 13:40             ` Bjorn Helgaas
  -1 siblings, 0 replies; 69+ messages in thread
From: Bjorn Helgaas @ 2015-03-11 13:40 UTC (permalink / raw)
  To: Wei Yang; +Cc: Benjamin Herrenschmidt, Gavin Shan, linux-pci, linuxppc-dev

On Wed, Mar 11, 2015 at 1:22 AM, Wei Yang <weiyang@linux.vnet.ibm.com> wrote:
> On Tue, Mar 10, 2015 at 09:51:25PM -0500, Bjorn Helgaas wrote:
>>On Mon, Mar 02, 2015 at 03:41:32PM +0800, Wei Yang wrote:
>>> On Tue, Feb 24, 2015 at 02:52:34AM -0600, Bjorn Helgaas wrote:
>>> >On Tue, Feb 24, 2015 at 02:34:42AM -0600, Bjorn Helgaas wrote:
>>> >> From: Wei Yang <weiyang@linux.vnet.ibm.com>
>>> >>
>>> >> On PHB3, PF IOV BAR will be covered by M64 window to have better PE
>>> >> isolation.  The total_pe number is usually different from total_VFs, which
>>> >> can lead to a conflict between MMIO space and the PE number.
>>> >>
>>> >> For example, if total_VFs is 128 and total_pe is 256, the second half of
>>> >> M64 window will be part of other PCI device, which may already belong
>>> >> to other PEs.
>>> >
>>> >I'm still trying to wrap my mind around the explanation here.
>>> >
>>> >I *think* what's going on is that the M64 window must be a power-of-two
>>> >size.  If the VF(n) BAR space doesn't completely fill it, we might allocate
>>> >the leftover space to another device.  Then the M64 window for *this*
>>> >device may cause the other device to be associated with a PE it didn't
>>> >expect.
>>>
>>> Yes, this is the exact reason.
>>
>>Can you include some of this text in your changelog, then?  I can wordsmith
>>it and try to make it fit together better.
>>
>
> Sure, I will do this.
>
>>> >> +#ifdef CONFIG_PCI_IOV
>>> >> +static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
>>> >> +{
>>> >> + struct pci_controller *hose;
>>> >> + struct pnv_phb *phb;
>>> >> + struct resource *res;
>>> >> + int i;
>>> >> + resource_size_t size;
>>> >> + struct pci_dn *pdn;
>>> >> +
>>> >> + if (!pdev->is_physfn || pdev->is_added)
>>> >> +         return;
>>> >> +
>>> >> + hose = pci_bus_to_host(pdev->bus);
>>> >> + phb = hose->private_data;
>>> >> +
>>> >> + pdn = pci_get_pdn(pdev);
>>> >> + pdn->max_vfs = 0;
>>> >> +
>>> >> + for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
>>> >> +         res = &pdev->resource[i + PCI_IOV_RESOURCES];
>>> >> +         if (!res->flags || res->parent)
>>> >> +                 continue;
>>> >> +         if (!pnv_pci_is_mem_pref_64(res->flags)) {
>>> >> +                 dev_warn(&pdev->dev, "Skipping expanding VF BAR%d: %pR\n",
>>> >> +                          i, res);
>>> >> +                 continue;
>>> >> +         }
>>> >> +
>>> >> +         dev_dbg(&pdev->dev, " Fixing VF BAR%d: %pR to\n", i, res);
>>> >> +         size = pci_iov_resource_size(pdev, i + PCI_IOV_RESOURCES);
>>> >> +         res->end = res->start + size * phb->ioda.total_pe - 1;
>>> >> +         dev_dbg(&pdev->dev, "                       %pR\n", res);
>>> >> +         dev_info(&pdev->dev, "VF BAR%d: %pR (expanded to %d VFs for PE alignment)",
>>> >> +                         i, res, phb->ioda.total_pe);
>>> >> + }
>>> >> + pdn->max_vfs = phb->ioda.total_pe;
>>> >> +}
>>> >> +
>>> >> +static void pnv_pci_ioda_fixup_sriov(struct pci_bus *bus)
>>> >> +{
>>> >> + struct pci_dev *pdev;
>>> >> + struct pci_bus *b;
>>> >> +
>>> >> + list_for_each_entry(pdev, &bus->devices, bus_list) {
>>> >> +         b = pdev->subordinate;
>>> >> +
>>> >> +         if (b)
>>> >> +                 pnv_pci_ioda_fixup_sriov(b);
>>> >> +
>>> >> +         pnv_pci_ioda_fixup_iov_resources(pdev);
>>> >
>>> >I'm not sure this happens at the right time.  We have this call chain:
>>> >
>>> >  pcibios_scan_phb
>>> >    pci_create_root_bus
>>> >    pci_scan_child_bus
>>> >    pnv_pci_ioda_fixup_sriov
>>> >      pnv_pci_ioda_fixup_iov_resources
>>> >    for (i = 0; i < PCI_SRIOV_NUM_BARS; i++)
>>> >      increase res->size to accomodate 256 PEs (or roundup(totalVFs)
>>> >
>>> >so we only do the fixup_iov_resources() when we scan the PHB, and we
>>> >wouldn't do it at all for hot-added devices.
>>>
>>> Yep, you are right :-)
>>>
>>> I had a separate patch to do this in pcibios_add_pci_devices(). Looks we could
>>> merge them.
>>
>>Did you fix this in v13?  I don't see the change if you did.
>>
>
> I add this in [PATCH V13 15/21].
>
> In the file arch/powerpc/kernel/pci-hotplug.c, when hotplug a device, the
> fixup will be called on that bus too.

Ah, OK, thanks for the pointer.

>>> >> + }
>>> >> +}
>>> >> +#endif /* CONFIG_PCI_IOV */
>>> >> +
>>> >>  /*
>>> >>   * This function is supposed to be called on basis of PE from top
>>> >>   * to bottom style. So the the I/O or MMIO segment assigned to
>>> >> @@ -2125,6 +2180,9 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
>>> >>   ppc_md.pcibios_enable_device_hook = pnv_pci_enable_device_hook;
>>> >>   ppc_md.pcibios_window_alignment = pnv_pci_window_alignment;
>>> >>   ppc_md.pcibios_reset_secondary_bus = pnv_pci_reset_secondary_bus;
>>> >> +#ifdef CONFIG_PCI_IOV
>>> >> + ppc_md.pcibios_fixup_sriov = pnv_pci_ioda_fixup_sriov;
>>> >> +#endif /* CONFIG_PCI_IOV */
>>> >>   pci_add_flags(PCI_REASSIGN_ALL_RSRC);
>>> >>
>>> >>   /* Reset IODA tables to a clean state */
>>> >>
>>>
>>> --
>>> Richard Yang
>>> Help you, Help me
>>>
>
> --
> Richard Yang
> Help you, Help me
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v12 15/21] powerpc/powernv: Reserve additional space for IOV BAR according to the number of total_pe
@ 2015-03-11 13:40             ` Bjorn Helgaas
  0 siblings, 0 replies; 69+ messages in thread
From: Bjorn Helgaas @ 2015-03-11 13:40 UTC (permalink / raw)
  To: Wei Yang; +Cc: linux-pci, Benjamin Herrenschmidt, linuxppc-dev, Gavin Shan

On Wed, Mar 11, 2015 at 1:22 AM, Wei Yang <weiyang@linux.vnet.ibm.com> wrote:
> On Tue, Mar 10, 2015 at 09:51:25PM -0500, Bjorn Helgaas wrote:
>>On Mon, Mar 02, 2015 at 03:41:32PM +0800, Wei Yang wrote:
>>> On Tue, Feb 24, 2015 at 02:52:34AM -0600, Bjorn Helgaas wrote:
>>> >On Tue, Feb 24, 2015 at 02:34:42AM -0600, Bjorn Helgaas wrote:
>>> >> From: Wei Yang <weiyang@linux.vnet.ibm.com>
>>> >>
>>> >> On PHB3, PF IOV BAR will be covered by M64 window to have better PE
>>> >> isolation.  The total_pe number is usually different from total_VFs, which
>>> >> can lead to a conflict between MMIO space and the PE number.
>>> >>
>>> >> For example, if total_VFs is 128 and total_pe is 256, the second half of
>>> >> M64 window will be part of other PCI device, which may already belong
>>> >> to other PEs.
>>> >
>>> >I'm still trying to wrap my mind around the explanation here.
>>> >
>>> >I *think* what's going on is that the M64 window must be a power-of-two
>>> >size.  If the VF(n) BAR space doesn't completely fill it, we might allocate
>>> >the leftover space to another device.  Then the M64 window for *this*
>>> >device may cause the other device to be associated with a PE it didn't
>>> >expect.
>>>
>>> Yes, this is the exact reason.
>>
>>Can you include some of this text in your changelog, then?  I can wordsmith
>>it and try to make it fit together better.
>>
>
> Sure, I will do this.
>
>>> >> +#ifdef CONFIG_PCI_IOV
>>> >> +static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
>>> >> +{
>>> >> + struct pci_controller *hose;
>>> >> + struct pnv_phb *phb;
>>> >> + struct resource *res;
>>> >> + int i;
>>> >> + resource_size_t size;
>>> >> + struct pci_dn *pdn;
>>> >> +
>>> >> + if (!pdev->is_physfn || pdev->is_added)
>>> >> +         return;
>>> >> +
>>> >> + hose = pci_bus_to_host(pdev->bus);
>>> >> + phb = hose->private_data;
>>> >> +
>>> >> + pdn = pci_get_pdn(pdev);
>>> >> + pdn->max_vfs = 0;
>>> >> +
>>> >> + for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
>>> >> +         res = &pdev->resource[i + PCI_IOV_RESOURCES];
>>> >> +         if (!res->flags || res->parent)
>>> >> +                 continue;
>>> >> +         if (!pnv_pci_is_mem_pref_64(res->flags)) {
>>> >> +                 dev_warn(&pdev->dev, "Skipping expanding VF BAR%d: %pR\n",
>>> >> +                          i, res);
>>> >> +                 continue;
>>> >> +         }
>>> >> +
>>> >> +         dev_dbg(&pdev->dev, " Fixing VF BAR%d: %pR to\n", i, res);
>>> >> +         size = pci_iov_resource_size(pdev, i + PCI_IOV_RESOURCES);
>>> >> +         res->end = res->start + size * phb->ioda.total_pe - 1;
>>> >> +         dev_dbg(&pdev->dev, "                       %pR\n", res);
>>> >> +         dev_info(&pdev->dev, "VF BAR%d: %pR (expanded to %d VFs for PE alignment)",
>>> >> +                         i, res, phb->ioda.total_pe);
>>> >> + }
>>> >> + pdn->max_vfs = phb->ioda.total_pe;
>>> >> +}
>>> >> +
>>> >> +static void pnv_pci_ioda_fixup_sriov(struct pci_bus *bus)
>>> >> +{
>>> >> + struct pci_dev *pdev;
>>> >> + struct pci_bus *b;
>>> >> +
>>> >> + list_for_each_entry(pdev, &bus->devices, bus_list) {
>>> >> +         b = pdev->subordinate;
>>> >> +
>>> >> +         if (b)
>>> >> +                 pnv_pci_ioda_fixup_sriov(b);
>>> >> +
>>> >> +         pnv_pci_ioda_fixup_iov_resources(pdev);
>>> >
>>> >I'm not sure this happens at the right time.  We have this call chain:
>>> >
>>> >  pcibios_scan_phb
>>> >    pci_create_root_bus
>>> >    pci_scan_child_bus
>>> >    pnv_pci_ioda_fixup_sriov
>>> >      pnv_pci_ioda_fixup_iov_resources
>>> >    for (i = 0; i < PCI_SRIOV_NUM_BARS; i++)
>>> >      increase res->size to accomodate 256 PEs (or roundup(totalVFs)
>>> >
>>> >so we only do the fixup_iov_resources() when we scan the PHB, and we
>>> >wouldn't do it at all for hot-added devices.
>>>
>>> Yep, you are right :-)
>>>
>>> I had a separate patch to do this in pcibios_add_pci_devices(). Looks we could
>>> merge them.
>>
>>Did you fix this in v13?  I don't see the change if you did.
>>
>
> I add this in [PATCH V13 15/21].
>
> In the file arch/powerpc/kernel/pci-hotplug.c, when hotplug a device, the
> fixup will be called on that bus too.

Ah, OK, thanks for the pointer.

>>> >> + }
>>> >> +}
>>> >> +#endif /* CONFIG_PCI_IOV */
>>> >> +
>>> >>  /*
>>> >>   * This function is supposed to be called on basis of PE from top
>>> >>   * to bottom style. So the the I/O or MMIO segment assigned to
>>> >> @@ -2125,6 +2180,9 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
>>> >>   ppc_md.pcibios_enable_device_hook = pnv_pci_enable_device_hook;
>>> >>   ppc_md.pcibios_window_alignment = pnv_pci_window_alignment;
>>> >>   ppc_md.pcibios_reset_secondary_bus = pnv_pci_reset_secondary_bus;
>>> >> +#ifdef CONFIG_PCI_IOV
>>> >> + ppc_md.pcibios_fixup_sriov = pnv_pci_ioda_fixup_sriov;
>>> >> +#endif /* CONFIG_PCI_IOV */
>>> >>   pci_add_flags(PCI_REASSIGN_ALL_RSRC);
>>> >>
>>> >>   /* Reset IODA tables to a clean state */
>>> >>
>>>
>>> --
>>> Richard Yang
>>> Help you, Help me
>>>
>
> --
> Richard Yang
> Help you, Help me
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 69+ messages in thread

end of thread, other threads:[~2015-03-11 13:41 UTC | newest]

Thread overview: 69+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-24  8:32 [PATCH v12 00/21] Enable SRIOV on Power8 Bjorn Helgaas
2015-02-24  8:33 ` [PATCH v12 01/21] PCI: Print more info in sriov_enable() error message Bjorn Helgaas
2015-02-24  8:33 ` [PATCH v12 02/21] PCI: Print PF SR-IOV resource that contains all VF(n) BAR space Bjorn Helgaas
2015-02-24  8:33 ` [PATCH v12 03/21] PCI: Keep individual VF BAR size in struct pci_sriov Bjorn Helgaas
2015-02-24  8:33 ` [PATCH v12 04/21] PCI: Index IOV resources in the conventional style Bjorn Helgaas
2015-02-24  8:33 ` [PATCH v12 05/21] PCI: Refresh First VF Offset and VF Stride when updating NumVFs Bjorn Helgaas
2015-02-24  8:33 ` [PATCH v12 06/21] PCI: Calculate maximum number of buses required for VFs Bjorn Helgaas
2015-02-24  8:33 ` [PATCH v12 07/21] PCI: Export pci_iov_virtfn_bus() and pci_iov_virtfn_devfn() Bjorn Helgaas
2015-02-24  8:33 ` [PATCH v12 08/21] PCI: Add pcibios_sriov_enable() and pcibios_sriov_disable() Bjorn Helgaas
2015-02-24  8:39   ` Bjorn Helgaas
2015-03-02  6:53     ` Wei Yang
2015-03-02  6:53       ` Wei Yang
2015-02-24  8:33 ` [PATCH v12 09/21] PCI: Add pcibios_iov_resource_alignment() interface Bjorn Helgaas
2015-02-24  8:34 ` [PATCH v12 10/21] PCI: Consider additional PF's IOV BAR alignment in sizing and assigning Bjorn Helgaas
2015-02-24  8:41   ` Bjorn Helgaas
2015-03-02  7:32     ` Wei Yang
2015-03-02  7:32       ` Wei Yang
2015-03-11  2:36       ` Bjorn Helgaas
2015-03-11  2:36         ` Bjorn Helgaas
2015-03-11  9:17         ` Wei Yang
2015-03-11  9:17           ` Wei Yang
2015-02-24  8:34 ` [PATCH v12 11/21] powerpc/pci: Don't unset PCI resources for VFs Bjorn Helgaas
2015-02-24  8:44   ` Bjorn Helgaas
2015-03-02  7:34     ` Wei Yang
2015-03-02  7:34       ` Wei Yang
2015-02-24  8:34 ` [PATCH v12 12/21] powerpc/pci: Refactor pci_dn Bjorn Helgaas
2015-02-24  8:34 ` [PATCH v12 13/21] powerpc/powernv: Use pci_dn, not device_node, in PCI config accessor Bjorn Helgaas
2015-02-24  8:34 ` [PATCH v12 14/21] powerpc/powernv: Allocate struct pnv_ioda_pe iommu_table dynamically Bjorn Helgaas
2015-02-24  8:46   ` Bjorn Helgaas
2015-03-02  7:50     ` Wei Yang
2015-03-02  7:50       ` Wei Yang
2015-03-02  7:56       ` Benjamin Herrenschmidt
2015-03-02  7:56         ` Benjamin Herrenschmidt
2015-03-02  8:02         ` Wei Yang
2015-03-02  8:02           ` Wei Yang
2015-03-11  2:47       ` Bjorn Helgaas
2015-03-11  2:47         ` Bjorn Helgaas
2015-03-11  6:13         ` Wei Yang
2015-03-11  6:13           ` Wei Yang
2015-02-24  8:34 ` [PATCH v12 15/21] powerpc/powernv: Reserve additional space for IOV BAR according to the number of total_pe Bjorn Helgaas
2015-02-24  8:52   ` Bjorn Helgaas
2015-03-02  7:41     ` Wei Yang
2015-03-02  7:41       ` Wei Yang
2015-03-11  2:51       ` Bjorn Helgaas
2015-03-11  2:51         ` Bjorn Helgaas
2015-03-11  6:22         ` Wei Yang
2015-03-11  6:22           ` Wei Yang
2015-03-11 13:40           ` Bjorn Helgaas
2015-03-11 13:40             ` Bjorn Helgaas
2015-02-24  8:34 ` [PATCH v12 16/21] powerpc/powernv: Implement pcibios_iov_resource_alignment() on powernv Bjorn Helgaas
2015-02-24  8:34 ` [PATCH v12 17/21] powerpc/powernv: Shift VF resource with an offset Bjorn Helgaas
2015-02-24  9:00   ` Bjorn Helgaas
2015-02-24 17:10     ` Bjorn Helgaas
2015-03-02  7:58       ` Wei Yang
2015-03-02  7:58         ` Wei Yang
2015-03-04  3:01     ` Wei Yang
2015-03-04  3:01       ` Wei Yang
2015-03-11  2:55       ` Bjorn Helgaas
2015-03-11  2:55         ` Bjorn Helgaas
2015-03-11  6:42         ` Wei Yang
2015-03-11  6:42           ` Wei Yang
2015-02-24  9:03   ` Bjorn Helgaas
2015-02-24  8:35 ` [PATCH v12 18/21] powerpc/powernv: Reserve additional space for IOV BAR, with m64_per_iov supported Bjorn Helgaas
2015-02-24  9:06   ` Bjorn Helgaas
2015-03-02  7:55     ` Wei Yang
2015-03-02  7:55       ` Wei Yang
2015-02-24  8:35 ` [PATCH v12 19/21] powerpc/powernv: Group VF PE when IOV BAR is big on PHB3 Bjorn Helgaas
2015-02-24  8:35 ` [PATCH v12 20/21] powerpc/pci: Remove unused struct pci_dn.pcidev field Bjorn Helgaas
2015-02-24  8:35 ` [PATCH v12 21/21] powerpc/pci: Add PCI resource alignment documentation Bjorn Helgaas

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.